CN114885422A - Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network - Google Patents

Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network Download PDF

Info

Publication number
CN114885422A
CN114885422A CN202210299457.1A CN202210299457A CN114885422A CN 114885422 A CN114885422 A CN 114885422A CN 202210299457 A CN202210299457 A CN 202210299457A CN 114885422 A CN114885422 A CN 114885422A
Authority
CN
China
Prior art keywords
network
task
base station
user
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210299457.1A
Other languages
Chinese (zh)
Inventor
鲜永菊
刘闯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210299457.1A priority Critical patent/CN114885422A/en
Publication of CN114885422A publication Critical patent/CN114885422A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0473Wireless resource allocation based on the type of the allocated resource the resource being transmission power
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/51Allocation or scheduling criteria for wireless resources based on terminal or device properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/52Allocation or scheduling criteria for wireless resources based on load
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention belongs to the technical field of mobile communication, and particularly relates to a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network, which comprises the following steps: constructing a multi-user ultra-dense network; the user equipment generates a calculation task and sends a task request to the macro base station; a macro base station constructs a task model and acquires the state information of the current network; executing different task transmission modes according to the state information of the current network; inputting the state information of the current network into the trained neural network to obtain an unloading decision and a resource allocation scheme; the macro base station sends the decision scheme to each user and sends the resource allocation scheme to the micro base station; the user unloads the task according to the decision-making scheme, and the micro base station distributes resources according to the resource distribution scheme; the method adopts a double-delay deterministic strategy gradient algorithm with an advanced target to solve the optimization problem, and improves the training efficiency of the model.

Description

Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network
Technical Field
The invention belongs to the technical field of mobile communication, and particularly relates to a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network.
Background
With the rapid development of wireless communication technology and the widespread of smart devices, mobile applications such as face recognition, online mobile games, Virtual Reality (VR) and Augmented Reality (AR) have been explosively increased in recent years. Most of them are computationally intensive or delay sensitive applications, whereas mobile devices (e.g. smartphones, wearable devices) are typically limited in both computational power and battery power. The relationship between applications and resource constrained devices presents a significant challenge to improving the user computing experience.
Mobile Edge Computing (MEC) sinks a Computing server from a cloud center to the Edge of a network, the distance between user equipment and the server is greatly shortened, and a user can utilize a Computing offloading technology to offload tasks to the Edge server for Computing to meet the demand of intensive Computing. In addition, an Ultra-Dense Network (UDN) under a 5G architecture is a heterogeneous Network scheme for multi-base-station cooperative service, and the overall performance of the Network is improved by deploying a large number of micro base stations and macro base stations in a hot spot area. Non-orthogonal Multiple Access (NOMA) is a radio Access technology that enables Multiple users to share the same channel for information transmission with more capabilities. The throughput rate and the network capacity of the network are improved. Therefore, UDNs integrated with MECs and NOMA are considered a reliable technology in 5G applications. However, since the densely deployed micro base stations and MEC servers cause multiple users to be in the coverage area of the micro base stations, and different base stations have different computing capabilities, different channels should be reasonably allocated, and how to make offloading decisions and resource allocation for the users is a challenge.
The existing resource allocation method comprises a learning-assisted mean field gaming method, and the method is used for clustering base station groups in a super-dense network NOMA-MEC system, wherein each cluster uses NOMA to transmit information. In dynamic request scheduling optimization in mobile edge computing of the Internet of things, a combined request unloading and resource scheduling problem is modeled into a mixed integer nonlinear programming, and user mobility is considered to minimize response delay of requests.
In the above methods, some methods also consider user mobility in a dynamic MEC system, however, in these working scenarios, users only make simple position changes during offloading, and the impact on offloading is only changes in channel state, and changes in user number caused by user movement are not considered. In some areas with large changes of the pedestrian volume, the fixed base station is deployed in the areas with the rapidly increased pedestrian volume, and the transmission and calculation requirements of each user cannot be met. At this time, the conventional transmission method cannot meet the requirement. Therefore, how to design an offloading scheme to meet the computing requirements of users in the areas with large variation of the human traffic in the ultra-dense network has important research value.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network, which comprises the following steps:
constructing a multi-user ultra-dense network, and initializing the network;
the user equipment generates a calculation task and sends a task request to the macro base station;
a macro base station receives a task request in a current system and constructs a task model according to the task request;
the macro base station acquires state information of a current network, wherein the state information comprises a task model, computing resources of a base station server, a base station channel state and channel information between the base station and a user;
the controller of the macro base station judges whether the network reaches the maximum value of the OMA transmission mode or not according to the state information of the current network, if the network reaches the maximum value, other users adopt the NOMA transmission mode to carry out task transmission, and if the network does not reach the maximum value, the OMA transmission mode is adopted to transmit tasks;
the base station controller inputs the state information of the current network into the trained neural network to obtain an unloading decision and a resource allocation scheme;
the macro base station sends the decision scheme to each user and sends the resource allocation scheme to the micro base station; and the user unloads the tasks according to the decision scheme, and the micro base station allocates resources according to the resource allocation scheme.
Preferably, the multi-user ultra-dense network comprises a macro base station and N micro base stations, and each micro base station is provided with an MEC server for executing calculation tasks; each micro base station adopts orthogonal frequency division multiple access to user equipment.
Preferably, the constructing of the task model comprises: existing user task Q u (t)=[d u (t),c u (t),τ u (t)]And add new user task
Figure BDA0003564692170000031
Wherein Q u (t)、
Figure BDA0003564692170000032
Tasks for existing users and newly added users, respectively, d u (t) and
Figure BDA0003564692170000033
respectively representing the data size of the existing user task and the data size of the newly added user task, c u (t) and
Figure BDA0003564692170000034
respectively representing the unit CPU cycle number required by the execution of the existing user task and the unit CPU cycle number required by the execution of the newly added user task, tau u (t) and
Figure BDA0003564692170000035
respectively representing the time delay threshold of the existing user task and the time delay threshold of the newly added user task.
Preferably, the OMA transmission scheme includes: and carrying out data transmission by adopting an orthogonal frequency division multiplexing channel.
Preferably, the NOMA transmission mode includes: when the network capacity is not enough to accommodate the number of users, the newly added users adopt the NOMA mode to transmit data, namely a plurality of users use the same channel, and the network capacity is increased by using the mode of increasing transmission power.
Preferably, the training of the neural network includes: the neural network is trained by adopting a double-delay deterministic strategy gradient algorithm, and the method comprises the following steps:
step 1: initializing parameters of a network, wherein the neural network comprises a strategy network and a value network;
and 2, step: each time slot macro base station is used as an intelligent agent to acquire current environment state information, wherein the information comprises the number of occupied channels in a network, user calculation task information, calculation resources of each micro base station server and channel states between the micro base stations and users;
and step 3: inputting the current environment state information into a policy network to obtain task actions; the task actions include user offload decision, power control, and computational resource allocation actions;
and 4, step 4: calculating the instant reward of the agent according to the task action and the current network state information;
and 5: the intelligent agent stores the current network state, the task action, the instant reward and the network state at the next moment into a priority experience replay array as a quadruple;
step 6: the policy network and the value network are trained using data in the preferred empirical replay array.
Further, the calculation expression of the instant rewarding function is as follows:
Figure BDA0003564692170000036
Figure BDA0003564692170000041
wherein r is t Representing the instant reward function, s, at time t t Information indicating the state of the system environment at time t, a t Indicating the task action at time t, U indicating the user set, E u (t) denotes the total energy consumption by the u-th user at time t u (t) represents a penalty function; upsilon is 1 And upsilon 2 Is two positive real numbers and satisfies upsilon 1 ≤E u (t)≤υ 2 ;T u (t) denotes task execution delay, τ u (t) represents a task execution latency threshold.
Preferably, the process of training the policy network includes:
step 611: initializing parameters of a policy network;
step 612: randomly extracting a quadruple from the preferred experience replay array, inputting data in the quadruple into the policy network, and obtaining the fraction q of actions in the quadruple data t =Q(s t ,a t ;θ Q );
Step 613: acquiring a target network after the value network updating parameters are acquired;
step 614: updating the parameters of the policy network by adopting a gradient descent algorithm according to the parameters of the value network;
step 615: and recalculating the action score according to the updated parameters, and finishing the training of the strategy network when the action score is maximum.
Preferably, the process of training the value network includes:
step 621: initializing parameters of a value network;
step 622: extracting a set of experience groups(s) from the experience replay array t ,a t ,r t ,s t+1 ) And inputting the extracted experience group into a value network to obtain task evaluation q at the time t t =Q(s t ,a t ;θ Q ) And task evaluation q at time t +1 t+1 =Q(s t+1 ,a t+1 ;θ Q );
Step 623: acquiring a target network after updating the strategy network parameters, calculating task action at the next moment in the network according to the acquired target network, and adding noise into the task action;
step 624: calculating two TD targets according to the task action at the next moment after the noise is added, and selecting the minimum value of the two TD targets; the expression calculated is:
Figure BDA0003564692170000042
Figure BDA0003564692170000043
Figure BDA0003564692170000051
wherein the content of the first and second substances,
Figure BDA0003564692170000052
representing TD targets, r, of a value network 1 t Representing an instant reward, gamma representing a discount factor, Q representing a value network, s t+1 Indicates the state of the next time, a t+1 Represents the next moment of action, θ Q1′ Representing a parameter of the value network 1,
Figure BDA0003564692170000053
representing TD targets, θ, of value network 2 Q2′ Representing value network 2 parameters;
step 625: calculating a loss function of the model according to the TD target;
step 626: and updating parameters of the model by adopting a gradient descent algorithm, and finishing the training of the model when the loss function is minimum.
Further, the formula for updating the parameters of the model by using the gradient descent algorithm is as follows:
Figure BDA0003564692170000054
wherein the content of the first and second substances,
Figure BDA0003564692170000055
is the TD error, alpha is the learning rate,
Figure BDA0003564692170000056
indicating the number of experience groups extracted from the experience regression groups,
Figure BDA0003564692170000057
representing the graduating of the value network parameters,
Figure BDA00035646921700000510
representing a value network, r B Representing a single empirical group r B
Figure BDA0003564692170000058
The state of the experience group is represented,
Figure BDA0003564692170000059
representing the empirical set of actions, θ Q Representing a value network parameter.
The invention has the beneficial effects that:
the invention provides a mixed NOMA-MEC system, which can be used for transmitting tasks by a user in a NOMA mode when the number of the users is large and the original network cannot accommodate the users. An optimization problem is presented that minimizes energy consumption while maximizing user capacity. The traditional optimization algorithm is relatively laboursome in solving dynamic and multidimensional problems, and the invention adopts a double-delay deterministic strategy gradient algorithm with an advanced target to solve the optimization problem. Through simulation experiments, the system based on the hybrid access mode has certain advantages compared with a single access mode system.
Drawings
FIG. 1 is a diagram of a multi-user ultra-dense network scenario;
FIG. 2 is a frame diagram of a deep reinforcement learning algorithm;
FIG. 3 is a flowchart of a deep reinforcement learning algorithm;
FIG. 4 is a graph of the impact of number of users on average energy consumption;
FIG. 5 is a graph of the impact of number of users on task completion rate.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A specific implementation mode of a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network comprises the following steps:
constructing a multi-user ultra-dense network and initializing the network;
the method comprises the steps that user equipment generates a calculation task and sends a task request to a macro base station;
a macro base station receives a task request in a current system and constructs a task model according to the task request;
the macro base station acquires state information of a current network, wherein the state information comprises a task model, computing resources of a base station server, a base station channel state and channel information between the base station and a user;
the controller of the macro base station judges whether the network reaches the maximum value of the OMA transmission mode or not according to the state information of the current network, if the network reaches the maximum value, other users adopt the NOMA transmission mode to carry out task transmission, and if the network does not reach the maximum value, the OMA transmission mode is adopted to transmit tasks;
the base station controller inputs the state information of the current network into the trained neural network to obtain an unloading decision and a resource allocation scheme;
the macro base station sends the decision scheme to each user and sends the resource allocation scheme to the micro base station; and the user unloads the tasks according to the decision scheme, and the micro base station allocates resources according to the resource allocation scheme.
A specific implementation of a hybrid access mode-based dynamic edge computing offloading method in an ultra-dense network, as shown in fig. 1, the building of a system model of a multi-user ultra-dense network includes: considering a heterogeneous network scenario in an ultra-dense network, a plurality of users and a plurality of base stations. The current user set is
Figure BDA0003564692170000061
When the number of users changes, the users which cannot be accommodated by the system are represented as
Figure BDA0003564692170000062
Each base station is equipped with a MEC server and a SDN controller is deployed on the macro base station as a centralized control. Base station setBy
Figure BDA0003564692170000071
Meaning that each MEC server has different computing power. Normally using Orthogonal Frequency Division Multiple Access (OFDMA) transmission, a base station channel is divided into K Orthogonal sub-channels, and the channel set is
Figure BDA0003564692170000072
Channels are multiplexed between the base stations, and the capacity of the network is KN under normal conditions. A discrete time model is used to divide time into a set of time slots of length l,
Figure BDA0003564692170000073
the task that the user u arrives at the t time slot is represented by Q u (t)=[d u (t),c u (t),τ u (t)]It is shown that,
Figure BDA0003564692170000074
representing user u + The task arriving at the time slot t, where d denotes the input data size (bit), c denotes the unit computing power (cycle/bit) required to execute the task, and τ is the maximum tolerated delay. Modeling the network scene into a three-dimensional scene due to the continuous change of the user position, wherein the user position is represented by p u (t)=(x u ,y u 0) denotes that the base station position and height are fixed, and the position of each base station is represented by p n (t)=(x n ,y n Phi) is shown.
Using λ u,n (t) denotes the offload decision of user u, λ u,n When (t) is k, the user selects to unload the task to the nth base station for calculation and uses the kth channel for task transmission,
Figure BDA0003564692170000075
the meaning is the same. When a user has tasks to calculate, the tasks are transmitted to a selected base station through a corresponding wireless uplink, then the calculation tasks are executed through an MEC server of the base station, and after the tasks are completed, calculation results are obtained through a wireless downlinkAnd sending back to the user equipment. Since the result backtransmission of the last phase is less heavily weighted than the first two phases, including the communication phase and the computation phase, are of significant interest.
The transmission rate of users in the multi-user ultra-dense network is as follows:
Figure BDA0003564692170000076
wherein B is n Representing a fixed bandwidth, p, of a base station n u,n (t) is the transmit power, σ, of the user equipment u 2 Representing Additive White Gaussian Noise (AWGN), h u,n (t) is the channel gain between the user and the base station.
In a very dense network, interference between devices cannot be ignored since multiple base stations multiplex the spectrum. I is u,n (t) is the interference suffered by the current equipment on the current channel, and the expression is as follows:
Figure BDA0003564692170000077
wherein p is u′,n′ (t) is the transmit power of other users using the current channel, h u′,n (t) represents the channel gain of the current channel user u' and base station n.
The transmission delay of the unloading calculation can be obtained according to the transmission rate of the user as follows:
Figure BDA0003564692170000081
wherein d is u (t) represents a task data size.
The time consumed by the server-side calculation is as follows:
Figure BDA0003564692170000082
wherein, c u (t) represents the execution of the taskNumber of CPU cycles required by the business unit, f u,n (t) represents the calculation resource allocated to user u by base station n, and satisfies
Figure BDA0003564692170000083
F n Is the maximum computational power of base station n.
The total latency of user u can be expressed as:
Figure BDA0003564692170000084
the energy consumed by the user in the process of uploading the task is as follows:
Figure BDA0003564692170000085
since only the user experience is considered, the user u can always consume energy by ignoring the computing energy consumption of the server
Figure BDA0003564692170000086
At some point, the number of users increases to a number that the system cannot accommodate, and newly joined users need to select the same channel to offload their tasks as the users that have started transmission before. After a base station needing to be unloaded is determined by a specific measurement standard, a multiplexed sub-channel also needs to be selected, and due to the NOMA transmission principle and the SIC decoding rule, a receiving end firstly decodes a user with a large channel gain, regards a user signal with a small channel gain as interference, and needs to select a channel with a large channel gain relative to the current user for transmission in order not to influence the experience of the previous user.
Can obtain u + The transmission rate for the user using NOMA is:
Figure BDA0003564692170000091
wherein, B n Representing the total bandwidth of the base station, K representing the total transmission channelThe number, σ, represents white gaussian noise,
Figure BDA0003564692170000092
indicating the transmit power when the newly added user uses NOMA transmission,
Figure BDA0003564692170000093
which represents the gain of the channel and is,
Figure BDA0003564692170000094
indicating co-channel interference, p u,n (t)h u,n (t) is the interference of the original user to the new added user in the same channel.
After the nl time, the previous user has finished transmitting, and the newly added user can continue transmitting by using the OMA mode to save energy consumption. Having a transmission rate of
Figure BDA0003564692170000095
Wherein the content of the first and second substances,
Figure BDA0003564692170000096
indicating the transmit power when the newly added user is using OMA transmission.
Its transmission time can be expressed as:
Figure BDA0003564692170000097
wherein the content of the first and second substances,
Figure BDA0003564692170000098
indicating the size of the task data for the newly added user.
The time consumed by the server-side calculation is as follows:
Figure BDA0003564692170000099
wherein the content of the first and second substances,
Figure BDA00035646921700000910
indicating the number of unit CPU cycles required for the execution of the new user task,
Figure BDA00035646921700000911
allocating base station n to user u + Is satisfied by computing resources of
Figure BDA00035646921700000912
F n Is the maximum computational power of base station n.
User u + The total delay can be expressed as:
Figure BDA00035646921700000913
the energy consumed by the user in the process of uploading the task is
Figure BDA00035646921700000914
Only the experience of the user is considered, the computing energy consumption of the server is ignored, and the energy consumed in the process of uploading the task is the user u + Total energy consumption of
Figure BDA0003564692170000101
The invention relates to a specific implementation mode of a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network, which aims to solve the optimization problem of minimizing user energy consumption under the condition of meeting a time delay requirement and maximizing the number of access users, namely:
Figure BDA0003564692170000102
Figure BDA0003564692170000103
Figure BDA0003564692170000104
Figure BDA0003564692170000105
Figure BDA0003564692170000106
Figure BDA0003564692170000107
Figure BDA0003564692170000108
Figure BDA0003564692170000109
wherein, C1 and C2 are unloading variable constraints, C1 limits unloading decision values to be in K channels only, C2 indicates that one channel of the base station can be allocated to only one user, wherein G {#} 1 indicates that # is true in the function. C3 is the constraint of computing resources, and the computing resources allocated by the base station to all users cannot exceed the maximum value of the resources owned by the base station itself. C4, C5 are transmit power constraints. C6 and C7 are to ensure that the task can be completed on time. T represents total time, U represents total number of users, U + represents total number of newly added users, lambda represents unloading decision variable, p represents transmitting power, f represents calculation resource distribution variable, E u,n (t) represents the energy consumption,
Figure BDA00035646921700001010
representing the energy consumption of the newly added user, N representing the total number of base stations, lambda u,n (t) represents an unloading decision variable,
Figure BDA00035646921700001011
represents the unloading decision variable of the newly added user, K represents the total number of channels, u represents the users u, u + Indicating newly added user u + And n represents a base station n,
Figure BDA00035646921700001012
representing an offload decision representation function, f u,n (t) represents a computational resource allocation variable,
Figure BDA00035646921700001013
indicating a newly added user computing resource allocation variable, F n Representing the maximum computational resource, p, of the base station n u,n (t) represents the transmission power and,
Figure BDA00035646921700001014
which is indicative of the maximum transmit power,
Figure BDA0003564692170000111
indicating the transmit power when the newly added user is using OMA transmission,
Figure BDA0003564692170000112
indicating that the newly added user uses NOMA transmit power,
Figure BDA0003564692170000113
indicating the maximum transmit power, T, of the newly added user u (t) denotes task execution delay, τ u (t) represents a task execution latency threshold,
Figure BDA0003564692170000114
indicating the delay in the execution of the new user task,
Figure BDA0003564692170000115
indicating the execution delay threshold of the new user task.
The problem P involves the optimization of three variables, λ is a decision matrix of dimension U × N, the elemental values are single discrete integers no greater than K, and P and f are continuous real vectors relating to all users. It can be seen that the problem P is a non-convex mixed integer nonlinear programming problem, but the work of the present invention is applied to a dynamic MEC system, and the difficulty of solving the problem P by using a conventional optimization algorithm under a time-varying condition is large. Reinforcement Learning (RL) is a relatively advanced decision-making method that obtains the maximum return by continuously performing trial-and-error Learning in a target environment, feeding back results and modifying strategies. Despite its many advantages, it also lacks scalability and is inherently limited to rather low dimensional problems, mainly because reinforcement learning algorithms have the same memory, computational and sample complexity as other algorithms. Therefore, in order to solve the high-dimensional decision problem which is difficult to process by reinforcement learning, the Deep Reinforcement Learning (DRL) combines the perception capability of the deep learning and the decision capability of the reinforcement learning, and solves the problems of a high-dimensional state space and a behavior space through enhancement function approximation and a deep neural network.
The present invention is intended to solve the above mixed integer nonlinear programming problem using deep reinforcement learning.
The reinforcement learning framework of the present invention comprises: the RL framework is mainly composed of an agent, an environment and three elements, wherein the three elements comprise: state space
Figure BDA0003564692170000116
It is the set of all possible presence states. Movement space
Figure BDA0003564692170000117
Refers to the set of all possible actions of the agent. Reward function
Figure BDA0003564692170000118
Meaning that after the agent performs an action, the environment returns a reward to the agent:
Figure BDA0003564692170000119
the agent learns continuously in discrete time slot steps through constant interaction with the environment and based on the rewards earned. At each time slot t, the agent observes the environment
Figure BDA00035646921700001110
And take action
Figure BDA00035646921700001111
The action of the agent is output determined by the policy function μ. After performing the action, the environment returns a scalar reward r to the agent t And is converted into the next state s t+1
The present invention defines the following elements according to the system model:
(1) state space: at the beginning of each time slot, the macro base station observes the system state of the wireless network, including the task details requested by all devices, the available computing resources of each base station MEC server, the channel gain between the user and the base station, and the channel occupation condition. System state
Figure BDA0003564692170000121
Can be defined as:
Figure BDA0003564692170000122
v (t) is a task request feature matrix for all users, including data features of all user devices that achieve the task. H (t) is the channel gain matrix between each base station and each user. F (t) is the calculation resource vector of all base station servers, which represents the calculation resource available for the current server, and j (t) is the channel occupation condition of each base station.
(2) An action space: based on the currently observed system state s t The agent selects a different action, action a, based on the decision variables of the problem P t Can be defined as:
Figure BDA0003564692170000123
λ (t) is a strategy action of base station selection and channel selection, p (t) is a transmission power allocation action of user equipment, and f (t) is a calculation resource action allocated to a user by the base station.
(3) Rewarding: the objective of the joint computation offload and resource allocation problem P proposed by the present invention is to minimize the energy consumption of the user, and the negative value of the result can be directly used as the instant reward. The instant reward function is defined as:
Figure BDA0003564692170000124
wherein: u (t) is a penalty function, which is defined as follows:
Figure BDA0003564692170000125
wherein upsilon is 1 And upsilon 2 Is two positive real numbers and satisfies upsilon 1 ≤O u (t)≤υ 2
The optimization problem of the invention contains continuous variables, and the Deep Deterministic Policy Gradient (DDPG) is the most common continuous control method, which is an Actor-critical-based Deep reinforcement learning algorithm and consists of a Policy network (Actor) mu (s; theta) μ ) To approximate the policy function, a value network (reviewer) Q (s, a; theta Q ) To approximate the cost function of the motion, θ μ And theta Q Respectively, corresponding neural network parameters. The policy network is mainly responsible for outputting actions according to the current state, the output of the policy network is determined actions, and the output of the value network is the current state and the value of the output actions of the policy network, namely, the actions are scored according to the current state to guide the actors to make better actions. However, DDPG is prone to generate an overestimation problem in network training, and studies have been made to alleviate the overestimation problem by using a target network, but the effect is not ideal. The Double delay depth determination strategy Gradient (TD 3) adopts a truncated Double Q Learning (Clipped Double Q-Learning) manner to alleviate the overestimation problem.
TD3 uses two value networks and one policy networkCollaterals of each formula are Q(s) t ,a t ;θ Q1 ),Q(s t ,a t ;θ Q2 ),μ(s tμ ) The three neural networks also correspond to a target network, Q(s) respectively t ,a t ;θ Q1′ ), Q(s t ,a t ;θ Q2′ ),μ(s tμ′ ). Wherein theta is Q1′Q2′μ′ The network model is shown in fig. 2 for the neural network parameters of the corresponding target network.
Experience Replay (Experience Replay) is an important skill in reinforcement learning, and the performance of reinforcement learning can be greatly improved. The records (experiences) of the agent's interactions with the environment are stored in an array, which is referred to as the experience Replay array (Replay Buffer), and used to train the agent. Controlling interaction of agents with the environment using a policy network, collecting quadruplets(s) t ,a t ,r t ,s t+1 ) An experience replay array is placed and a certain amount of experience is then extracted from the array to train the policy network and the value network.
The process of training the policy network comprises the following steps: given a current state s t The policy network outputs an action a t =μ(s t ;θ μ ) The value network will give the action a score based on the current state: q. q.s t =Q(s t ,a t ;θ Q ). Policy network parameter θ μ Will influence the action a t Thereby influencing q t . The aim of training the strategy network is to improve the parameter theta μ Let q be t Becomes as large as possible. Updated using the following formula:
Figure BDA0003564692170000131
a gradient ascent is used to make the score higher, where beta is the learning rate,
Figure BDA0003564692170000132
is from experience to experienceAnd returning the number of the extracted experience groups in the group.
The process of training the value network includes: the value network acts as a reviewer and in order for it to score the actors more and more accurately, it is necessary to calibrate its score based on the actual observed reward. The training value network mainly uses a time Difference algorithm (TD) algorithm, and the yield value network fits the TD target. Extracting a set of experience groups(s) from the experience replay array t ,a t ,r t ,s t+1 ) First, let the value network make an evaluation to obtain q t =Q(s t ,a t ;θ Q ) And q is t+1 =Q(s t+1 ,a t+1 ;θ Q ) The next action here is given by the target policy network and adds noise to it:
a t+1 =μ(s t+1 ;θ μ′ )+ξ
where ξ is a random variable representing a noise subject to a truncated Normal Distribution (called clipping Normal Distribution)
Figure BDA0003564692170000141
Meaning that the mean is 0 and the standard deviation is sigma, but the variables fall only in the interval [ -c, c]Is normally distributed. With a truncated normal distribution, excessive noise can be prevented. The addition of this noise in the process of computing the TD target may make the action output smoother to reduce errors.
Calculating two TD targets
Figure BDA0003564692170000142
And
Figure BDA0003564692170000143
gamma is a discount factor, and the smaller value of the two is taken as a TD target:
Figure BDA0003564692170000144
wherein the content of the first and second substances,
Figure BDA0003564692170000145
representing TD targets, r, of a value network 1 t Representing an instant reward, gamma representing a discount factor, Q representing a value network, s t+1 Indicates the state of the next time, a t+1 Represents the next moment of action, θ Q1′ Representing a parameter of the value network 1,
Figure BDA0003564692170000146
representing TD targets, θ, of value network 2 Q2′ Representing a value network 2 parameter.
The loss function according to the TD objective calculation model, expressed as:
Figure BDA0003564692170000147
updating parameters by using a gradient descent method, wherein the expression is as follows:
Figure BDA0003564692170000148
wherein
Figure BDA0003564692170000149
For TD error, α is the learning rate. The use of the gradient descent method allows the loss function to be smaller, i.e. allows the evaluation of the value network to be closer to the TD target.
The Actor-Critic uses the value network Q to guide the updating of the policy network mu. If the value network itself is not reliable, then scoring the action with the value network is inaccurate and does not help improve the policy network. When the value network is still poor, the value network is updated urgently, and not only can the mu not be improved, but also the training of Q is unstable due to the change of the mu. Experiments have shown that the policy network as well as the three target networks should be updated slower than the value network. Each round of training of a conventional Actor-Critic makes one update to the policy network, the value network, and the target network. A better approach is to update the value network once per round, but update the policy network and the three target networks every k rounds. The specific algorithm flow is shown in fig. 3.
Fig. 4 and 5 are comparison diagrams of the algorithm of the present invention and the conventional algorithm. The main simulation parameters are set as follows. The number of micro base stations in the network is 10. Each user starts a calculation task, and c is the [300,50 ] in the task parameters]0kbi, the number d of CPU needed by all task unit bit is 1000cycles/bit, the maximum tolerant time delay tau is equal to [20,50 ∈]ms. Maximum transmission power p of user equipment max 2W. Different base stations have different computing capabilities, F ∈ [15,25 ]]GHz. The bandwidth of each base station is set to B10 MHz, and the channel is divided into K4 subchannels. The additive white Gaussian noise was set to-174 dBm/Hz. The channel gain between the user and the base station follows a free space path loss model, denoted as
Figure BDA0003564692170000151
Wherein the antenna gain A d Carrier frequency f 4.11 c =900MHz,
Figure BDA0003564692170000152
Is the distance between the user equipment and the base station.
The algorithm contains 6 neural networks, including two value networks, a policy network and two corresponding target networks. The value network and the strategy network are all composed of four fully-connected layers, wherein each fully-connected layer comprises an input layer and an output layer of two hidden layers, each hidden layer comprises 256 neurons, the number of the neurons of the input layer of the strategy network is the size of a state space, the size of the output layer is an action space, the size of the input layer of the value network is the size of an action space, and the number of the neurons of the output layer is 1. The hidden layers of the neural network all use a Linear correction Unit (ReLU) as an activation function, the output layers use sigmod as the activation function, and an Adam optimizer is used for updating the parameters of the neural network. Number of experience group samples
Figure BDA0003564692170000153
The neural network learning rate is 0.01 and the preferred empirical replay array size is 512.
Fig. 4 and fig. 5 are graphs comparing effects of three transmission modes. As can be seen from the figure, as the number of users increases, the average energy consumption for the task execution of the three types of users increases, because as the number of users increases, the computational resources allocated to each user decreases, and in order to complete the task within the delay threshold, a larger transmission power is required for the task transmission. However, in contrast, as the number of users increases, the power consumption of the hybrid NOMA transmission method is slightly larger than that of the OMA method, but the advantage in the aspect of task completion rate is obvious. This is because as the number increases, the user of the OMA method cannot transmit tasks, but only waits, and at this time, the delay threshold is easily exceeded, and the hybrid NOMA can transmit data in time by using a method that consumes energy. Whereas mixed NOMA performs less well in terms of task completion rate than pure NOMA, there is a saving in power consumption.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network is characterized by comprising the following steps:
constructing a multi-user ultra-dense network and initializing the network;
the user equipment generates a calculation task and sends a task request to the macro base station;
a macro base station receives a task request in a current system and constructs a task model according to the task request;
the macro base station acquires state information of a current network, wherein the state information comprises a task model, computing resources of a base station server, a base station channel state and channel information between the base station and a user;
the controller of the macro base station judges whether the network reaches the maximum value of the OMA transmission mode or not according to the state information of the current network, if the network reaches the maximum value, other users adopt the NOMA transmission mode to carry out task transmission, and if the network does not reach the maximum value, the OMA transmission mode is adopted to transmit tasks;
the base station controller inputs the state information of the current network into the trained neural network to obtain an unloading decision and a resource allocation scheme;
the macro base station sends the decision scheme to each user and sends the resource allocation scheme to the micro base station; and the user unloads the tasks according to the decision scheme, and the micro base station allocates resources according to the resource allocation scheme.
2. The method for unloading dynamic edge computing based on a hybrid access mode in the ultra-dense network according to claim 1, wherein the multi-user ultra-dense network comprises a macro base station and N micro base stations, and each micro base station is configured with an MEC server to execute computing tasks; each micro base station adopts orthogonal frequency division multiple access to user equipment.
3. The method for dynamic edge computing offloading based on hybrid access in ultra-dense network as claimed in claim 1, wherein constructing the task model comprises: existing user task Q u (t)=[d u (t),c u (t),τ u (t)]And add new user task
Figure FDA0003564692160000011
Wherein Q u (t)、
Figure FDA0003564692160000012
Tasks for existing users and newly added users, respectively, d u (t) and
Figure FDA0003564692160000013
respectively representing the data size of the existing user task and the data size of the newly added user task, c u (t) and
Figure FDA0003564692160000014
are respectively provided withIndicates the number of unit CPU cycles required for the execution of the existing user task and the number of unit CPU cycles required for the execution of the newly added user task, tau u (t) and
Figure FDA0003564692160000015
respectively representing the time delay threshold of the existing user task and the time delay threshold of the newly added user task.
4. The method of claim 1, wherein the OMA transport mode comprises: and carrying out data transmission by adopting an orthogonal frequency division multiplexing channel.
5. The method of claim 1, wherein the NOMA transmission scheme includes: when the network capacity is not enough to accommodate the number of users, the newly added users adopt the NOMA mode to transmit data, namely a plurality of users use the same channel, and the network capacity is increased by using the mode of increasing transmission power.
6. The method according to claim 1, wherein the training of the neural network comprises: the neural network is trained by adopting a double-delay deterministic strategy gradient algorithm, and the method comprises the following steps:
step 1: initializing parameters of a network, wherein the neural network comprises a strategy network and a value network;
and 2, step: each time slot macro base station is used as an intelligent agent to acquire current environment state information, wherein the information comprises the number of occupied channels in a network, user calculation task information, calculation resources of each micro base station server and channel states between the micro base stations and users;
and 3, step 3: inputting the current environment state information into a policy network to obtain task actions; the task actions include user offload decision, power control, and computational resource allocation actions;
and 4, step 4: calculating the instant reward of the agent according to the task action and the current network state information;
and 5: the intelligent agent stores the current network state, the task action, the instant reward and the network state at the next moment into a priority experience replay array as a quadruple;
step 6: the policy network and the value network are trained using data in the preferred empirical replay array.
7. The method of claim 6, wherein the calculation expression of the immediate reward function is as follows:
Figure FDA0003564692160000021
Figure FDA0003564692160000022
wherein r is t Representing the instant reward function, s, at time t t Information indicating the state of the system environment at time t, a t Indicating the task action at time t, U indicating the user set, E u (t) denotes the total energy consumption by the u-th user at time t u (t) represents a penalty function; upsilon is 1 And upsilon 2 Is two positive real numbers and satisfies upsilon 1 ≤E u (t)≤υ 2 ;T u (t) represents the task execution delay, τ u (t) represents a task execution latency threshold.
8. The method of claim 6, wherein the training of the policy network comprises:
step 611: initializing parameters of a policy network;
step 612: random in preferred empirical replay arraysExtracting a quadruple, inputting data in the quadruple into a policy network to obtain the fraction q of actions in quadruple data t =Q(s t ,a t ;θ Q );
Step 613: acquiring a target network after the value network updating parameters are acquired;
step 614: updating the parameters of the policy network by adopting a gradient descent algorithm according to the parameters of the value network;
step 615: and recalculating the action score according to the updated parameters, and finishing the training of the strategy network when the action score is maximum.
9. The method of claim 6, wherein the training of the value network comprises:
step 621: initializing parameters of a value network;
step 622: extracting a set of experience groups(s) from the experience replay array t ,a t ,r t ,s t+1 ) And inputting the extracted experience group into a value network to obtain task evaluation q at the time t t =Q(s t ,a t ;θ Q ) And task evaluation q at time t +1 t+1 =Q(s t+1 ,a t+1 ;θ Q );
Step 623: acquiring a target network after updating the strategy network parameters, calculating task action at the next moment in the network according to the acquired target network, and adding noise into the task action;
step 624: calculating two TD targets according to the task action at the next moment after the noise is added, and selecting the minimum value of the two TD targets; the expression calculated is:
Figure FDA0003564692160000031
Figure FDA0003564692160000032
Figure FDA0003564692160000041
wherein the content of the first and second substances,
Figure FDA0003564692160000042
representing TD targets, r, of a value network 1 t Representing an instant reward, gamma representing a discount factor, Q representing a value network, s t+1 Indicates the state of the next time, a t+1 Represents the next moment of action, θ Q1′ Representing a parameter of the value network 1,
Figure FDA0003564692160000043
representing TD targets, θ, of value network 2 Q2′ Representing value network 2 parameters;
step 625: calculating a loss function of the model according to the TD target;
step 626: and updating parameters of the model by adopting a gradient descent algorithm, and finishing the training of the model when the loss function is minimum.
10. The method according to claim 9, wherein the formula for updating the parameters of the model by using the gradient descent algorithm is as follows:
Figure FDA0003564692160000044
wherein the content of the first and second substances,
Figure FDA0003564692160000045
is the TD error, alpha is the learning rate,
Figure FDA0003564692160000046
representing the number of experience groups drawn from the experience reentry group,
Figure FDA0003564692160000047
representing the graduating of the value network parameters,
Figure FDA0003564692160000048
representing a value network, r B Representing a single empirical group r B
Figure FDA0003564692160000049
The state of the experience group is represented,
Figure FDA00035646921600000410
representing the empirical set of actions, θ Q Representing a value network parameter.
CN202210299457.1A 2022-03-25 2022-03-25 Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network Pending CN114885422A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210299457.1A CN114885422A (en) 2022-03-25 2022-03-25 Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210299457.1A CN114885422A (en) 2022-03-25 2022-03-25 Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network

Publications (1)

Publication Number Publication Date
CN114885422A true CN114885422A (en) 2022-08-09

Family

ID=82668159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210299457.1A Pending CN114885422A (en) 2022-03-25 2022-03-25 Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network

Country Status (1)

Country Link
CN (1) CN114885422A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499441A (en) * 2022-09-15 2022-12-20 中原工学院 Deep reinforcement learning-based edge computing task unloading method in ultra-dense network
CN116939668A (en) * 2023-09-15 2023-10-24 清华大学 Method and device for distributing communication resources of vehicle-mounted WiFi-cellular heterogeneous network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499441A (en) * 2022-09-15 2022-12-20 中原工学院 Deep reinforcement learning-based edge computing task unloading method in ultra-dense network
CN116939668A (en) * 2023-09-15 2023-10-24 清华大学 Method and device for distributing communication resources of vehicle-mounted WiFi-cellular heterogeneous network
CN116939668B (en) * 2023-09-15 2023-12-12 清华大学 Method and device for distributing communication resources of vehicle-mounted WiFi-cellular heterogeneous network

Similar Documents

Publication Publication Date Title
Xiong et al. Resource allocation based on deep reinforcement learning in IoT edge computing
CN111586720B (en) Task unloading and resource allocation combined optimization method in multi-cell scene
CN113950066B (en) Single server part calculation unloading method, system and equipment under mobile edge environment
CN111093203B (en) Service function chain low-cost intelligent deployment method based on environment perception
CN110798849A (en) Computing resource allocation and task unloading method for ultra-dense network edge computing
CN109151864B (en) Migration decision and resource optimal allocation method for mobile edge computing ultra-dense network
CN111800828B (en) Mobile edge computing resource allocation method for ultra-dense network
CN114885422A (en) Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network
CN111711666B (en) Internet of vehicles cloud computing resource optimization method based on reinforcement learning
CN113364859B (en) MEC-oriented joint computing resource allocation and unloading decision optimization method in Internet of vehicles
CN114641076A (en) Edge computing unloading method based on dynamic user satisfaction in ultra-dense network
CN112860429A (en) Cost-efficiency optimization system and method for task unloading in mobile edge computing system
Le et al. Enhanced resource allocation in D2D communications with NOMA and unlicensed spectrum
CN114885420A (en) User grouping and resource allocation method and device in NOMA-MEC system
CN113590279A (en) Task scheduling and resource allocation method for multi-core edge computing server
CN114697333A (en) Edge calculation method for energy queue equalization
CN114827191B (en) Dynamic task unloading method for fusing NOMA in vehicle-road cooperative system
CN115696581A (en) Wireless network resource allocation method based on constrained reinforcement learning
Hu et al. Dynamic task offloading in MEC-enabled IoT networks: A hybrid DDPG-D3QN approach
Geng et al. Deep reinforcement learning-based computation offloading in vehicular networks
CN116541106B (en) Computing task unloading method, computing device and storage medium
Jiang et al. A collaborative optimization strategy for computing offloading and resource allocation based on multi-agent deep reinforcement learning
CN117098189A (en) Computing unloading and resource allocation method based on GAT hybrid action multi-agent reinforcement learning
CN107682934A (en) A kind of adaptive resource improves allocative decision in OFDM multi-user systems
CN113452625B (en) Deep reinforcement learning-based unloading scheduling and resource allocation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination