CN115640131A - Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient - Google Patents

Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient Download PDF

Info

Publication number
CN115640131A
CN115640131A CN202211341446.1A CN202211341446A CN115640131A CN 115640131 A CN115640131 A CN 115640131A CN 202211341446 A CN202211341446 A CN 202211341446A CN 115640131 A CN115640131 A CN 115640131A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
network
calculation
agent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211341446.1A
Other languages
Chinese (zh)
Inventor
陈志江
雷磊
宋晓勤
蒋泽星
唐胜
王执屹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202211341446.1A priority Critical patent/CN115640131A/en
Publication of CN115640131A publication Critical patent/CN115640131A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Traffic Control Systems (AREA)

Abstract

The invention provides a calculation task unloading algorithm based on deep reinforcement learning aiming at the requirements of calculation intensive and delay sensitive mobile services. And (4) considering constraint conditions such as the flight ranges, the flight speeds, the system fairness benefits and the like of the multiple unmanned aerial vehicles, and minimizing the weighted sum of the network average calculation delay and the energy consumption of the unmanned aerial vehicles. The non-convex and NP difficult problems are converted into a partial observation Markov decision process, and the multi-agent depth certainty strategy gradient algorithm is used for unloading decision of the mobile user and optimizing the flight track of the unmanned aerial vehicle. Simulation results show that the performance of the algorithm is superior to that of a baseline algorithm in the aspects of fairness of the mobile service terminal, average system time delay, total energy consumption of multiple unmanned aerial vehicles and the like.

Description

Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient
Technical Field
The invention belongs to the field of Mobile Edge Computing (MEC), relates to a Multi-unmanned aerial vehicle assisted Mobile Edge Computing method, and more particularly relates to a Multi-Agent Deep determination Policy Gradient (MADDPG) based computational migration method.
Background
With the development of 5G technology, computing-intensive applications such as network gaming, VR/AR, telemedicine, etc. running on user devices will become more prosperous and popular. These mobile applications typically require a large amount of computing resources, consume a large amount of energy, and may interrupt the connection with the server while the user is mobile due to the limited coverage of the server. The server which originally requests to unload cannot send the calculation result at the next position of the user in time, which causes the waste of the calculation resources of the server and increases the time delay and energy consumption for the user to upload and unload the calculation task again. For user off-loadable tasks, many studies will adopt a mode of unloading all tasks into the MEC server for execution, but when the number of users is large or the number of unloaded tasks is large, limited server computing resources can cause task queuing and the time delay of the unloading computation is increased. Due to high mobility and flexibility, unmanned Aerial Vehicles (UAVs) can assist mobile edge computing in military and civilian areas, especially in remote or natural disaster areas, without relying on infrastructure. When natural disasters cause network infrastructure to be unavailable or mobile devices suddenly increase to exceed network service capacity, the unmanned aerial vehicle can be used as a temporary communication relay station or an edge computing platform to enhance wireless coverage in areas with communication interruption or traffic hot spots, and computing support is provided. However, the computational resources and power of the drone are limited, and there are many key issues to be solved to improve the performance of the MEC system, including security [8] Task offloading, energy consumption, resource allocation, user delay performance in various channel scenarios, etc.
In an unmanned aerial vehicle MEC network, various types of variables (such as unmanned aerial vehicle tracks, task unloading strategies and computing resource allocation) can be optimized to achieve a desired scheduling target, and a traditional optimization method needs a large amount of iteration and priori knowledge to obtain an approximately optimal solution, so that the method is not suitable for real-time MEC application in a dynamic environment. With the wide application of machine learning in research, many researchers are also exploring learning-based MEC scheduling algorithms, and deep reinforcement learning has become a research focus in view of the recent progress of machine learning. With the increase of network scale, multi-agent deep reinforcement learning provides a distributed view for resource management of a multi-unmanned-aerial-vehicle MEC network.
The invention provides an unmanned aerial vehicle assisted mobile edge computing system which utilizes computing resources provided by an unmanned aerial vehicle to provide unloading services for nearby user equipment. The unmanned aerial vehicle trajectory and unloading optimization problem is solved through a multi-agent deep reinforcement learning method, so that an extensible and effective scheduling strategy is obtained, a terminal unloads a part of calculation tasks to a UAV, other tasks are executed locally at the terminal, and system processing delay and unmanned aerial vehicle energy consumption are minimized through joint optimization of user scheduling, task unloading ratio, unmanned aerial vehicle flight angle and flight speed.
Disclosure of Invention
The purpose of the invention is as follows: in consideration of the non-convexity, high-dimensional state space and continuous action space of the problem, the deep reinforcement learning algorithm based on the MADDPG is provided, and the algorithm can obtain an optimal calculation unloading strategy in a dynamic environment, so that the combined optimization of the lowest system delay and the energy consumption of the unmanned aerial vehicle is realized.
The technical scheme is as follows: in consideration of the scene of multi-user task calculation unloading at the same time, the purposes of jointly optimizing system time delay and unmanned aerial vehicle energy consumption are achieved through reasonable and efficient unmanned aerial vehicle path planning and unloading decisions. And regarding each unmanned aerial vehicle as an intelligent agent, and selecting the associated user based on locally observed state information and task information obtained in each time slot by adopting a distributed execution and centralized training mode. And optimizing the deep reinforcement learning model by using the MADDPG algorithm through establishing the deep reinforcement learning model. And obtaining the optimal flight track and unloading strategy according to the optimized MADDPG model. The invention is realized by the following technical scheme: an unmanned aerial vehicle assisted computing migration method based on MADDPG comprises the following steps:
(1) The traditional MEC server is deployed in a base station or other fixed facilities, and the mobile MEC server is adopted at this time to combine the unmanned aerial vehicle technology with edge calculation;
(2) The user equipment unloads the calculation task to the unmanned aerial vehicle end through wireless communication so as to reduce the calculation delay;
(3) An unmanned aerial vehicle auxiliary user unloading system model, a moving model, a communication model and a calculation model are constructed, and an optimization objective function is given;
(4) The unmanned aerial vehicle acquires a user position set, a task set, service times and channel parameter information in an observation range;
(5) The method comprises the steps of modeling by adopting a Partially Observable Markov Decision Process (POMDP), jointly optimizing flight tracks of multiple unmanned aerial vehicles and calculating unloading strategies based on the positions and task information of users under the condition of considering the flight range and the safety distance of the unmanned aerial vehicles, and constructing a deep reinforcement learning model by taking the aim of minimizing system time delay and unmanned aerial vehicle energy consumption and simultaneously ensuring service fairness of the users as the target;
(6) Considering a continuous state space and a continuous action space, and performing model training of computational migration by using a multi-agent deep reinforcement learning algorithm based on MADDPG;
(7) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment;
further, the step (3) comprises the following specific steps:
(3a) Establishing a Mobile edge computing system model for assisting user unloading by an unmanned aerial vehicle, wherein the system comprises M Mobile user equipment (MD) and an unmanned aerial vehicle with a U-frame carrying an MEC server, and the M Mobile user equipment (MD) and the U-frame carry the MEC server respectively by sets
Figure BSA0000287870260000021
And
Figure BSA0000287870260000022
and (4) showing. Unmanned aerial vehicle with fixed height H u And (3) flying, wherein the total time length of the unmanned aerial vehicle for executing one flight task is T, the total time length can be divided into N time slots with equal length, and the set of the time slots is recorded as
Figure BSA0000287870260000023
Each MD has one in each time slot tauA computing intensive task, denoted as S m (τ)={D m (τ),F m (τ) }, in which D m (τ) represents the amount of data bits, F m (τ) represents the required CPU cycles per bit;
(3b) Each unmanned aerial vehicle only provides calculation unloading service for one terminal device in each time slot tau, a user only needs to calculate a small part of tasks locally, the rest tasks are unloaded to the unmanned aerial vehicle for auxiliary calculation, so that the calculation delay and energy consumption are reduced, and the proportion of unloading calculation amount is recorded as delta m,u (τ)∈[0,1]. The offload decision variable between the drone and the user equipment may be expressed as:
D={α m,u (τ) | U ∈ U, M ∈ M, τ ∈ T } expression 1
Wherein alpha is m,u (τ) ∈ {0,1}, when α is m,u (τ) =1 indicates that the device MD m The calculation task at the time slot tau is carried out by an Unmanned Aerial Vehicle (UAV) u Auxiliary calculation, Δ m,u (τ) > 0; when alpha is m,u (τ) =0, which means that the calculation task is only executed locally, Δ m,u (τ) =0. Decision variables need to be satisfied:
Figure BSA0000287870260000031
(3c) And establishing a movement model, wherein the mobile device moves to a new position randomly in each time slot, and the movement of each device is related to the current speed and angle of the device. Suppose MD m The coordinate at time slot τ is denoted as c m (τ)=[x m (τ),y m (τ)]Then the coordinates of its next slot τ +1 can be expressed as:
Figure BSA0000287870260000032
wherein d is max Representing the maximum distance of the standby movement, the movement direction and the distance probability are subject to uniform distribution, rho 1,m ,ρ 2,m U (0, 1), the drone serves the terminal considering only its starting position in the slot.
(3d) Each unmanned plane is at height H u Can also use the discrete position c of the unmanned plane in each time slot u (τ) assuming a UAV u Selecting fly-to-service MD at time slot tau m Then its flight direction is recorded as β u (τ)∈[0,2π]The flying speed is v u (τ)∈[0,V max ]Time of flight t fly The energy consumed by the flight of the unmanned aerial vehicle is as follows:
Figure BSA0000287870260000033
wherein μ =0.5M u t fly ,M u Is the total mass of the unmanned aerial vehicle.
(3e) And the computation offload adopts a partial offload strategy, then MD m The local computation delay at slot τ can be expressed as:
Figure BSA0000287870260000034
wherein f is m Denotes MD m Local computing power (number of CPU cycles per second).
(3f) The actual unmanned aerial vehicle-to-ground communication is simulated by adopting a line-of-sight link model, and the channel gain h between the unmanned aerial vehicle and a user m,u (τ) follows a free space path loss model, which can be expressed as:
Figure BSA0000287870260000035
wherein g is 0 Is the channel power gain per meter.
(3g) Instantaneous transmission rate r between unmanned aerial vehicle and ground equipment m,u (τ) is defined as:
Figure BSA0000287870260000041
wherein B represents a channelThe bandwidth of the communication channel is controlled,
Figure BSA0000287870260000042
transmit power, σ, for the mobile device uplink 2 Representing white gaussian noise at the unmanned end.
Associated user MD m The transmission data delay is as follows:
Figure BSA0000287870260000043
after the computation task is transmitted, the unmanned aerial vehicle executes an unloading computation task, wherein the time delay and the energy consumption of the unloading computation are respectively as follows:
Figure BSA0000287870260000044
Figure BSA0000287870260000045
wherein f is u Representing the computational power of the drone,
Figure BSA0000287870260000046
denotes the CPU power, κ, of the drone when performing the calculations u =10 -27 Is a chip constant
(3h) Since the resulting output data volume of various compute-intensive tasks is much smaller than the input, the delay spent on downlink transmission can be ignored, user MD m Completing task S in time slot tau m Time delay T of (tau) m (τ) can be expressed as:
Figure BSA0000287870260000047
unmanned Aerial Vehicle (UAV) u The total energy consumption of the auxiliary computing offload at slot τ is:
Figure BSA0000287870260000048
(3i) User MD m The average delay of (d) may be expressed as:
Figure BSA0000287870260000049
the system average calculated delay can be calculated as:
Figure BSA00002878702600000410
(3j) In order to ensure user fairness, a fairness index xi is defined τ To measure the fairness of service:
Figure BSA0000287870260000051
(3k) In summary, the following objective functions and constraints can be established:
Figure BSA0000287870260000052
wherein P = { beta = u (τ),v u (τ)},Z={α m,u (τ),Δ m,u (τ)},φ t And phi e For weighting parameters, C1 limits that each time slot of the unmanned aerial vehicle only serves one user, C2 and C6 limit the flight range of the unmanned aerial vehicle, C3 and C4 limit the flight speed and angle of the unmanned aerial vehicle, C6 represents that the calculation task can be partially unloaded, C7 ensures the fair benefit of the system, d safe And xi min The minimum safety distance and the minimum fairness index between the preset unmanned planes are adopted.
Further, the step (5) comprises the following specific steps:
(5a) The problem of multi-UAV assisted computational offloading is considered as a partial observation Markov decision process, and is composed of tuples { S, A, O, pr, R }. Usually has a plurality ofAgents interact with the environment, each agent being based on a current state s τ Get self observation o τ Belongs to O and make action a τ E.g. A, the environment generates an instant reward r for the action τ E.g. R to evaluate the current action and with probability Pr (S) τ+1 |S τ ,A τ ) The next state is entered and the new state depends only on the current state and the actions of the respective agent. Actions of agents are based on a policy π (a) τ |o τ ) Enforcement, with the goal of learning to an optimal strategy to maximize long-term jackpot, can be expressed as:
Figure BSA0000287870260000053
where gamma is the reward discount.
(5b) Specifically defining observation space, each unmanned aerial vehicle only has a limited observation range, and the radius of the observation range is set as r obs Therefore, only partial state information can be observed, and global state information and actions of other drones are unknown. UAV (unmanned aerial vehicle) u The information observable in the time slot τ has its own position information c u (tau) and current location information, task information and number of services for K mobile users in observation range
Figure BSA0000287870260000054
The action space a is the transmit power and the selected channel, and is represented as:
o u (τ)={c u (τ),k u (τ) } expression 18
(5c) Specifically defining an action space, and based on observed information, determining which user is served by the drone at the current time slot tau and the load shedding ratio delta m,u (tau) and determining the flight angle beta thereof u (τ) and flight velocity v u (τ), which can be written as:
a u (τ)={m(τ),Δ m,u (τ),β u (τ),v u (τ) } expression 19
(5d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observation results:
s(τ)={o u (τ) | U ∈ U } expression 20
(5e) Specifically, the reward is defined, and the feedback obtained after the agent executes the action is called the reward and is used for judging whether the action is good or bad and guiding the agent to update the strategy. Generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negative, so the reward after the drone executes actions is defined as:
r u (τ)=D m (τ)·(-T mean (τ)-ψE u (τ)-P u (τ)) expression 21
Wherein D m (τ)∈[0,1]For the attenuation coefficient, the benefit obtained after the unmanned aerial vehicle processes the mobile terminal unloading task is defined, and the specific calculation is as follows:
Figure BSA0000287870260000061
wherein eta and beta are correlation constants, the function image is of a sigmoid type, the input is the accumulated service times of the current user, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is. Psi is used to numerically align the drone energy consumption and the user average delay. P u (τ) is an additional penalty, which is added if the drone flies out of the field after performing the action or has a distance to the remaining drones that is less than a safe distance.
(5f) And establishing a deep reinforcement learning model on the basis of the MADDPG according to the established S, A, O and R, and adopting an operator-critic framework, wherein each agent has an operator network, a critic network and a respective target network. The Actor network is responsible for formulating a policy pi (o) for an agent u (τ)|θ u ),θ u Representing its network parameters; the estimate of the critic's network output to the optimal state-action cost function is denoted as Q (s (τ), a 1 (τ),...,a U (τ)|w u ),w u Representing its network parameters. The inputs to the critic network include observations and actions of all agents within a time slot, but the inputs to the operator network only require their own observations at the time of distributed execution.
The algorithm learns the Q function and the optimal strategy at the same time, when the criticc network is updated, H groups of records need to be extracted from the experience pool of each intelligent agent, and each group at the same time is spliced to obtain H new records, which are recorded as: { s t,i ,a 1,i ,...,a U,i ,r 1,i ,...,r U,i ,s t+1,i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:
Figure BSA0000287870260000062
wherein y is u,i Obtained from formula (24):
Figure BSA0000287870260000063
wherein,
Figure BSA0000287870260000064
and
Figure BSA0000287870260000065
respectively represent Unmanned Aerial Vehicle (UAV) u The critic target network and the operator target network, wherein the target network has network parameters updated later, so that the training becomes more stable.
Critic networks need to minimize losses to approximate true Q * The operator network updates the network parameters by gradient ascending of the determination strategy gradient of the Q value so as to maximize the action value:
Figure BSA0000287870260000071
finally at a fixed interval at an update rate
Figure BSA0000287870260000072
Updating the target network:
Figure BSA0000287870260000073
further, the step (6) comprises the following specific steps:
(6a) Starting environment simulation, and initializing each agent operator network, a critic network and respective target network parameters thereof;
(6b) Initializing the number of training rounds;
(6c) Updating the position set, the task set and the service times of the user, and the position set and the channel parameters of the unmanned aerial vehicle;
(6d) For each agent, an actor network is distributed and operated according to the observation o u (τ), output action a u (τ) and obtaining an instant prize r u (τ) while moving to the next state s τ+1 To thereby obtain training data o u (τ),a u (τ),r u (τ),o u (τ+1)};
(6e) Storing the training data into respective experience playback pools;
(6f) Each agent randomly samples H training data from the experience playback pool to form a training data set;
(6g) Each agent calculates a loss value L (w) through the critic network and the target network u ) Updating w u And adopting a determined strategy gradient to perform gradient rise, and updating the parameter theta of the operator network through back propagation of the neural network u
(6h) When the training times reach the target network updating interval, updating the target network parameters;
(6i) Judging whether convergence is met, if yes, finishing optimization to obtain an optimized deep reinforcement learning model, and otherwise, entering the step (6 c);
further, the step (7) comprises the following specific steps:
(7a) The method comprises the steps of training a deep reinforcement learning model by using an MADDPG algorithm, and inputting state information at a certain moment;
(7b) Outputting the optimal action strategy
Figure BSA0000287870260000074
And obtaining the optimal migration strategy and flight path.
Has the advantages that: according to the large-scale multi-unmanned aerial vehicle assisted MEC network computing unloading method based on the MADDPG algorithm, under the condition that constraint conditions are met, the energy consumption of the unmanned aerial vehicle and the average computing time delay of the system are reduced to the greatest extent by jointly optimizing the unloading decision and the flight path of the unmanned aerial vehicle, and the method can be stably represented in the optimization of a series of continuous state spaces and continuous action spaces. Under the condition of meeting the similar scene, the performance of the MADDPG-based deep reinforcement learning algorithm in the aspects of reducing energy consumption and average task delay is superior.
Drawings
Fig. 1 is a schematic structural diagram of an unmanned aerial vehicle-assisted computation offloading model provided in an embodiment of the present invention;
fig. 2 is a schematic diagram of a POMDP decision process of a multi-drone computation migration algorithm according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of algorithm training based on MADDPG according to an embodiment of the present invention;
fig. 4 is a simulation result diagram of a relationship between energy consumption and computation performance of the unmanned aerial vehicle under the maddppg algorithm provided by the embodiment of the present invention.
Detailed Description
The core idea of the invention is that: a distributed reinforcement learning method is adopted, each unmanned aerial vehicle is regarded as an intelligent agent, and a deep reinforcement learning model is optimized by means of an MADDPG algorithm through establishment of the deep reinforcement learning model. And obtaining the optimal migration strategy and flight path according to the optimized model.
The present invention is described in further detail below.
(1) The traditional MEC server is deployed in a base station or other fixed facilities, and the mobile MEC server is adopted at this time to combine the unmanned aerial vehicle technology with edge calculation;
(2) The user equipment unloads the calculation task to the unmanned aerial vehicle end through wireless communication so as to reduce the calculation delay;
(3) An unmanned aerial vehicle auxiliary user unloading system model, a mobile model, a communication model and a calculation model are constructed, and an optimization objective function is given;
the method comprises the following specific steps:
(3a) Establishing a Mobile edge computing system model for assisting user unloading by an unmanned aerial vehicle, wherein the system comprises M Mobile user equipment (MD) and an unmanned aerial vehicle with a U-frame carrying an MEC server, and the M Mobile user equipment (MD) and the U-frame carry the MEC server respectively by sets
Figure BSA0000287870260000081
And
Figure BSA0000287870260000082
and (4) showing. Unmanned aerial vehicle with fixed height H u And (3) flying, wherein the total time length of the unmanned aerial vehicle for executing one flight task is T, the total time length can be divided into N time slots with equal length, and the set of the time slots is recorded as
Figure BSA0000287870260000083
Each MD has a compute intensive task at each time slot τ, denoted as S m (τ)={D m (τ),F m (τ) }, in which D m (τ) represents the amount of data bits, F m (τ) represents the required CPU cycles per bit;
(3b) Each unmanned aerial vehicle only provides calculation unloading service for one terminal device in each time slot tau, a user only needs to calculate a small part of tasks locally, the rest tasks are unloaded to the unmanned aerial vehicle for auxiliary calculation, so that the calculation delay and energy consumption are reduced, and the proportion of unloading calculation amount is recorded as delta m,u (τ)∈[0,1]. The offloading decision variables between the drone and the user equipment may be expressed as:
D={α m,u (tau) | U ∈ U, M ∈ M, tau ∈ T } expression 1
Wherein alpha is m,u (τ) ∈ {0,1}, when α is m,u (τ) =1 indicates the device MD m The calculation task at time slot τ is carried out by an unmanned aerial vehicle UAV u Auxiliary calculation, Δ m,u (τ) > 0; when alpha is m,u (τ) =0 then this means that the computation task is only performed locally, Δ m,u (τ) =0. Decision variables need to be satisfied:
Figure BSA0000287870260000084
(3c) And establishing a movement model, wherein the mobile device moves to a new position randomly in each time slot, and the movement of each device is related to the current speed and angle of the device. Suppose MD m The coordinate at time slot τ is denoted as c m (τ)=[x m (τ),y m (τ)]Then the coordinates of its next slot τ +1 can be expressed as:
Figure BSA0000287870260000091
wherein d is max The maximum distance representing the movement of the device, the movement direction and the distance probability are uniformly distributed, and rho 1,m ,ρ 2,m U (0, 1), the drone serves the terminal considering only its starting position in the slot.
(3d) Each unmanned plane is at height H u Can also use the discrete position c of the unmanned plane in each time slot u (τ) assuming a UAV u Selecting a fly-to-service MD at a time slot τ m Then its flight direction is recorded as β u (τ)∈[0,2π]The flying speed is v u (τ)∈[0,V max ]Time of flight t fly The energy consumed by the flight of the unmanned aerial vehicle is as follows:
Figure BSA0000287870260000092
wherein μ =0.5M u t fly ,M u Is the total mass of the unmanned aerial vehicle.
(3e) Computing offload adoption of optional partsOffloading strategy, then MD m The local computation delay at slot τ can be expressed as:
Figure BSA0000287870260000093
wherein f is m Denotes MD m Local computing power (number of CPU cycles per second).
(3f) The actual unmanned aerial vehicle-to-ground communication is simulated by adopting a line-of-sight link model, and the channel gain h between the unmanned aerial vehicle and a user m,u (τ) follows a free space path loss model, which can be expressed as:
Figure BSA0000287870260000094
wherein g is 0 Is the channel power gain per meter.
(3g) Instantaneous transmission rate r between unmanned aerial vehicle and ground equipment m,u (τ) is defined as:
Figure BSA0000287870260000095
where B represents the bandwidth of the channel and,
Figure BSA0000287870260000096
transmitting power, σ, for the uplink of a mobile device 2 Representing white gaussian noise at the unmanned end.
Associated user MD m The transmission data delay is as follows:
Figure BSA0000287870260000097
after the computation task is transmitted, the unmanned aerial vehicle executes an unloading computation task, wherein the time delay and the energy consumption of the unloading computation are respectively as follows:
Figure BSA0000287870260000098
Figure BSA0000287870260000101
wherein f is u Representing the computational power of the drone,
Figure BSA0000287870260000102
denotes the CPU power, κ, at which the drone performs the calculations u =10 -27 Is a chip constant
(3h) Since the resulting output data volume of various compute-intensive tasks is much smaller than the input, the delay spent on downlink transmission can be neglected, user MD m Completing task S in time slot tau m Time delay T of (tau) m (τ) can be expressed as:
Figure BSA0000287870260000103
unmanned Aerial Vehicle (UAV) u The total energy consumption of the auxiliary computing offload at slot τ is:
Figure BSA0000287870260000104
(3i) User MD m The average delay of (d) may be expressed as:
Figure BSA0000287870260000105
the system average calculated delay may be calculated as:
Figure BSA0000287870260000106
(3j) In order to ensure user fairness, a fairness index ξ is defined τ To measure fairness of service:
Figure BSA0000287870260000107
(3k) In summary, the following objective functions and constraints can be established:
Figure BSA0000287870260000108
wherein P = { beta = u (τ),v u (τ)},Z={α m,u (τ),Δ m,u (τ)},φ t And phi e For weighting parameters, C1 limits that each time slot of the unmanned aerial vehicle only serves one user, C2 and C6 limit the flight range of the unmanned aerial vehicle, C3 and C4 limit the flight speed and angle of the unmanned aerial vehicle, C6 represents that the calculation task can be partially unloaded, C7 ensures the fair benefit of the system, d safe And xi min The minimum safety distance and the minimum fairness index between the preset unmanned planes are adopted.
(4) The unmanned aerial vehicle acquires a user position set, a task set, service times and channel parameter information in an observation range;
(5) The method adopts a Partially Observable Markov Decision Process (POMDP) for modeling, and jointly optimizes flight tracks of multiple unmanned aerial vehicles and calculates unloading strategies based on the position and task information of a user under the condition of considering the flight range and the safety distance of the unmanned aerial vehicles, so as to construct a deep reinforcement learning model by taking the aim of minimizing system delay and unmanned aerial vehicle energy consumption and simultaneously ensuring service fairness of the user as a target, and comprises the following specific steps:
(5a) The problem of multi-UAV assisted computational offloading is considered as a partial observation Markov decision process, and is composed of tuples { S, A, O, pr, R }. There are typically multiple agents interacting with the environment, each agent based on the current state s τ Get self observation o τ Belongs to O and make action a τ E.g. A, the environment generates an instant reward r for the action τ E.g. R to evaluate the current action and with probability Pr (S) τ+1 |S τ ,A τ ) The next state is entered and the new state depends only on the current state and the actions of the respective agent. Actions of agents are based on a policy π (a) τ |o τ ) Enforcement, with the goal of learning an optimal strategy to maximize long-term jackpot, can be expressed as:
Figure BSA0000287870260000111
where gamma is the reward discount.
(5b) Specifically defining observation space, each unmanned aerial vehicle only has a limited observation range, and the radius of the observation range is set as r obs Therefore, only partial state information can be observed, and global state information and the actions of other drones are unknown. Single unmanned aerial vehicle UAV u The information observable in the time slot τ has its own position information c u (tau) and current location information, task information and number of services for K mobile users in observation range
Figure BSA0000287870260000112
The action space a is the transmit power and the selected channel, and is represented as:
o u (τ)={c u (τ),k u (τ) } expression 18
(5c) Specifically defining an action space, based on the observed information, the drone needs to determine which user is served at the current time slot τ and the offloading ratio Δ m,u (τ) and determining the flight angle β thereof u (τ) and flight velocity v u (τ), can be written as:
a u (τ)={m(τ),Δ m,u (τ),β u (τ),v u (τ) } expression 19
(5d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observation results:
s(τ)={o u expression 20 of (tau) | U ∈ U }
(5e) Specifically, the reward is defined, and the feedback obtained after the agent executes the action is called the reward and is used for judging the quality of the action and guiding the agent to update the strategy. Generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negatively correlated, so the reward after the drone executes actions is defined as:
r u (τ)=D m (τ)·(-T mean (τ)-ψE u (τ)-P u (τ)) expression 21
Wherein D m (τ)∈[0,1]For the attenuation coefficient, the benefit obtained after the unmanned aerial vehicle processes the mobile terminal unloading task is defined, and the specific calculation is as follows:
Figure BSA0000287870260000121
wherein eta and beta are correlation constants, the function image is of a sigmoid type, the input is the accumulated service times of the current user, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is. Psi is used to numerically align the drone energy consumption and the user average delay. P u (τ) is an additional penalty term, which is added if the drone flies out of the field or is less than a safe distance away from the rest of the drones after performing the action.
(5f) According to the established S, A, O and R, a deep reinforcement learning model is established on the basis of MADDPG, an operator-critic framework is adopted, and each agent has an own operator network, a critic network and a respective target network. The Actor network is responsible for formulating a policy pi (o) for an agent u (τ)|θ u ),θ u Representing its network parameters; the estimate of the critic's network output to the optimal state-action cost function is denoted as Q (s (τ), a 1 (τ),..,a U (τ)|w u ),w u Representing its network parameters. The inputs to the critic network contain observations and actions of all agents within a time slot, but the inputs to the actor network only require their own observations when the distribution is performed.
The algorithm carries out the Q function and the optimal strategy simultaneouslyLearning, when updating the critic network, H groups of records need to be extracted from the experience pool of each agent, and H new records are obtained by splicing each group at the same time and are recorded as: { s t,i ,a 1,i ,...,a U,i ,r 1,i ,...,r U,i ,s t+1,i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:
Figure BSA0000287870260000122
wherein y is u,i Obtained from formula (24):
Figure BSA0000287870260000123
wherein,
Figure BSA0000287870260000124
and
Figure BSA0000287870260000125
respectively representing Unmanned Aerial Vehicles (UAVs) u And the critic target network and the actor target network, wherein the target network has network parameters updated later, so that the training becomes more stable.
Critic networks need to minimize losses to approximate true Q * The operator network updates the network parameters by using the gradient of the determination strategy of the Q value to perform gradient rise so as to maximize the action value:
Figure BSA0000287870260000126
finally at a fixed interval at an update rate
Figure BSA0000287870260000127
Updating the target network:
Figure BSA0000287870260000131
(6) Considering a continuous state space and a continuous action space, and performing model training of computation migration by using a MADDPG-based multi-agent deep reinforcement learning algorithm, the method comprises the following specific steps:
(6a) Starting environment simulation, and initializing each agent operator network, critic network and respective target network parameters thereof;
(6b) Initializing the number of training rounds;
(6c) Updating the position set, the task set and the service times of the user, and the position set and the channel parameters of the unmanned aerial vehicle;
(6d) For each agent, an actor network is distributed and operated according to the observation o u (τ), output action a u (τ) and obtaining an instant prize r u (τ) while moving to the next state s τ+1 To thereby obtain training data o u (τ),a u (τ),r u (τ),o u (τ+1)};
(6e) Storing the training data into respective experience playback pools;
(6f) Each agent randomly samples H training data from the experience playback pool to form a training data set;
(6g) Each agent calculates a loss value L (w) through the critic network and the target network u ) Updating w u And adopting a determined strategy gradient to perform gradient rise, and updating the parameter theta of the operator network through back propagation of the neural network u
(6h) When the training times reach the target network updating interval, updating the target network parameters;
(6i) Judging whether convergence is met, if yes, finishing optimization to obtain an optimized deep reinforcement learning model, and otherwise, entering the step (6 c);
(7) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment;
(7a) The method comprises the steps of training a deep reinforcement learning model by using an MADDPG algorithm, and inputting state information at a certain moment;
(7b) Outputting the optimal action strategy
Figure BSA0000287870260000132
And obtaining the optimal migration strategy and flight path.
In fig. 1, a model of a mobile edge computing system for drone-assisted user offloading is depicted, where the user offloads computing tasks to drone-assisted computing to reduce latency and energy consumption of the computing.
In fig. 2, a deep reinforcement learning model of the drone assisted MEC network is described, and it can be seen that multiple drones as agents select the current optimal policy according to the policy based on the current state, and obtain rewards from the environment.
In fig. 3, a training model of an operator-critic framework is described, and through centralized training and distributed execution, the critic network can refer to the behaviors of other agents in the training process, so that the performance of the operator network is better evaluated, and the stability of a strategy is improved.
In fig. 4, simulation results of the calculation performance and the energy consumption of the unmanned aerial vehicle with different algorithms are described, optimal power consumption control under different calculation performances can be obtained based on the maddppg algorithm, and when the CPU frequency is 12.5GHz, the energy consumption is reduced by 29.16% compared with a baseline, and is reduced by 8.67% compared with a random policy gradient algorithm. .
Those matters not described in detail in the present application are well within the knowledge of those skilled in the art.

Claims (1)

1. An unmanned aerial vehicle assisted calculation migration method based on multi-agent depth determination strategy gradient is characterized by comprising the following steps:
(1) The traditional MEC servers are deployed in a base station or other fixed facilities, a movable MEC server is adopted at this time, the unmanned aerial vehicle technology is combined with edge calculation, and user equipment unloads calculation tasks to an unmanned aerial vehicle end through wireless communication so as to reduce calculation delay;
(2) Constructing an unmanned aerial vehicle auxiliary user unloading system model, a mobile model, a communication model and a calculation model, and giving an optimization objective function;
(3) The method is characterized in that a Partially Observable Markov Decision Process (POMDP) is adopted for modeling, under the condition of considering the flight range and the safety distance of the unmanned aerial vehicle, the flight tracks and the calculation unloading strategies of the multiple unmanned aerial vehicles are jointly optimized on the basis of the position and the task information of a user, a deep reinforcement learning model is constructed by taking the aim of minimizing the system delay and the energy consumption of the unmanned aerial vehicle and simultaneously ensuring the service fairness of the user as the target, and the method comprises the following specific steps:
(3a) The problem of the multi-unmanned aerial vehicle auxiliary computation unloading is regarded as a partial observation Markov decision process and is composed of tuples { S, A, O, pr and R }; there are typically multiple agents interacting with the environment, each agent based on the current state s τ Get self observation o τ E.g. O and make action a τ E.g. A, the environment generates an instant reward r for the action τ E.g. R to evaluate the current action and with probability Pr (S) τ+1 |S τ ,A τ ) Entering a next state, the new state depending only on the current state and the actions of the respective agent; actions of Agents are based on policy π (a) τ |o τ ) Enforcement, with the goal of learning to an optimal strategy to maximize long-term jackpot, can be expressed as:
Figure FSA0000287870250000011
wherein γ is a reward discount;
(3b) Specifically defining observation space, each unmanned aerial vehicle only has limited observation range, and the radius of the observation range is set as r obs Therefore, only partial state information can be observed, and the global state information and the actions of other unmanned planes are unknown; single unmanned aerial vehicle UAV u The information observable in the time slot τ has its own position information c u (tau) and current location information, task information and number of services for K mobile users in observation range
Figure FSA0000287870250000012
The action space a is the transmit power and the selected channel, and is represented as:
o u (τ)={c u (τ),k u (τ)}
(3c) Specifically defining an action space, based on the observed information, the drone needs to determine which user is served at the current time slot τ and the offloading ratio Δ m,u (tau) and determining the flight angle beta thereof u (τ) and flight velocity v u (τ), which can be written as:
a u (τ)={m(τ),Δ m,u (τ),β u (τ),v u (τ)}
(3d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observations:
s(τ)={o u (τ)|u∈U}
(3e) Specifically defining rewards, wherein feedback obtained after the intelligent agent executes actions is called the rewards and is used for judging the quality of the actions and guiding the intelligent agent to update the strategy; generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negatively correlated, so the reward after the drone executes actions is defined as:
r u (τ)=D m (τ)·(-T mean (τ)-ψE u (τ)-P u (τ))
wherein D m (τ)∈[0,1]For the attenuation coefficient, the benefit obtained after the unmanned aerial vehicle processes the mobile terminal unloading task is defined, and the specific calculation is as follows:
Figure FSA0000287870250000021
wherein eta and beta are correlation constants, the function image is of a sigmoid type, the number of times of accumulated service of the current user is input, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is; psi is used to average the drone energy consumption and usersCarrying out numerical value alignment; p u (tau) is an additional punishment item, and if the unmanned aerial vehicle flies out of the field after executing actions or the distance between the unmanned aerial vehicle and the rest unmanned aerial vehicles is less than the safe distance, the punishment item needs to be added;
(3f) Establishing a deep reinforcement learning model on the basis of MADDPG according to the established S, A, O and R, adopting an operator-critic framework, wherein each agent has an own operator network, critic network and respective target network; the Actor network is responsible for formulating a policy pi (o) for an agent u (τ)|θ u ),θ u Representing its network parameters; the critic network outputs an estimate of the optimum state-action cost function denoted as Q (s (τ), a) 1 (τ),...,a U (τ)|w u ),w u Representing its network parameters; the input of the critic network comprises the observed values and actions of all agents in a time slot, but the input of the actor network only needs the observed values of the actor network when the distribution is executed;
the algorithm learns the Q function and the optimal strategy at the same time, when the criticc network is updated, H groups of records need to be extracted from the experience pool of each intelligent agent, and each group at the same time is spliced to obtain H new records, which are recorded as: { s t,i ,a 1,i ,...,a U,i ,r 1,i ,...,r U,i ,s t+1,i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:
Figure FSA0000287870250000022
wherein y is u,i Obtained from formula (24):
Figure FSA0000287870250000023
wherein,
Figure FSA0000287870250000024
and
Figure FSA0000287870250000025
respectively representing Unmanned Aerial Vehicles (UAVs) u The critic target network and the actor target network, the target network has network parameters updated later, so that the training becomes more stable;
critic networks need to minimize losses to approximate true Q * The operator network updates the network parameters by using the gradient of the determination strategy of the Q value to perform gradient rise so as to maximize the action value:
Figure FSA0000287870250000026
finally at a fixed interval at an update rate
Figure FSA0000287870250000029
Updating the target network:
Figure FSA0000287870250000027
Figure FSA0000287870250000028
(4) Considering a continuous state space and a continuous action space, and performing model training of computational migration by using a multi-agent deep reinforcement learning algorithm based on MADDPG;
(5) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment.
CN202211341446.1A 2022-10-28 2022-10-28 Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient Pending CN115640131A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211341446.1A CN115640131A (en) 2022-10-28 2022-10-28 Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211341446.1A CN115640131A (en) 2022-10-28 2022-10-28 Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient

Publications (1)

Publication Number Publication Date
CN115640131A true CN115640131A (en) 2023-01-24

Family

ID=84947041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211341446.1A Pending CN115640131A (en) 2022-10-28 2022-10-28 Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient

Country Status (1)

Country Link
CN (1) CN115640131A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116502547A (en) * 2023-06-29 2023-07-28 深圳大学 Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning
CN116546559A (en) * 2023-07-05 2023-08-04 南京航空航天大学 Distributed multi-target space-ground combined track planning and unloading scheduling method and system
CN117354759A (en) * 2023-12-06 2024-01-05 吉林大学 Task unloading and charging scheduling combined optimization method for multi-unmanned aerial vehicle auxiliary MEC
CN117376985A (en) * 2023-12-08 2024-01-09 吉林大学 Energy efficiency optimization method for multi-unmanned aerial vehicle auxiliary MEC task unloading under rice channel
CN117371761A (en) * 2023-12-04 2024-01-09 集美大学 Intelligent ocean Internet of things task scheduling method, device, equipment and medium
CN117553803A (en) * 2024-01-09 2024-02-13 大连海事大学 Multi-unmanned aerial vehicle intelligent path planning method based on deep reinforcement learning
CN117573383A (en) * 2024-01-17 2024-02-20 南京信息工程大学 Unmanned aerial vehicle resource management method based on distributed multi-agent autonomous decision

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116502547B (en) * 2023-06-29 2024-06-04 深圳大学 Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning
CN116502547A (en) * 2023-06-29 2023-07-28 深圳大学 Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning
CN116546559A (en) * 2023-07-05 2023-08-04 南京航空航天大学 Distributed multi-target space-ground combined track planning and unloading scheduling method and system
CN116546559B (en) * 2023-07-05 2023-10-03 南京航空航天大学 Distributed multi-target space-ground combined track planning and unloading scheduling method and system
US11961409B1 (en) 2023-07-05 2024-04-16 Nanjing University Of Aeronautics And Astronautics Air-ground joint trajectory planning and offloading scheduling method and system for distributed multiple objectives
CN117371761A (en) * 2023-12-04 2024-01-09 集美大学 Intelligent ocean Internet of things task scheduling method, device, equipment and medium
CN117354759B (en) * 2023-12-06 2024-03-19 吉林大学 Task unloading and charging scheduling combined optimization method for multi-unmanned aerial vehicle auxiliary MEC
CN117354759A (en) * 2023-12-06 2024-01-05 吉林大学 Task unloading and charging scheduling combined optimization method for multi-unmanned aerial vehicle auxiliary MEC
CN117376985B (en) * 2023-12-08 2024-03-19 吉林大学 Energy efficiency optimization method for multi-unmanned aerial vehicle auxiliary MEC task unloading under rice channel
CN117376985A (en) * 2023-12-08 2024-01-09 吉林大学 Energy efficiency optimization method for multi-unmanned aerial vehicle auxiliary MEC task unloading under rice channel
CN117553803B (en) * 2024-01-09 2024-03-19 大连海事大学 Multi-unmanned aerial vehicle intelligent path planning method based on deep reinforcement learning
CN117553803A (en) * 2024-01-09 2024-02-13 大连海事大学 Multi-unmanned aerial vehicle intelligent path planning method based on deep reinforcement learning
CN117573383A (en) * 2024-01-17 2024-02-20 南京信息工程大学 Unmanned aerial vehicle resource management method based on distributed multi-agent autonomous decision
CN117573383B (en) * 2024-01-17 2024-03-29 南京信息工程大学 Unmanned aerial vehicle resource management method based on distributed multi-agent autonomous decision

Similar Documents

Publication Publication Date Title
CN115640131A (en) Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient
CN113162679B (en) DDPG algorithm-based IRS (intelligent resilient software) assisted unmanned aerial vehicle communication joint optimization method
CN114422056B (en) Space-to-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface
CN111787509B (en) Unmanned aerial vehicle task unloading method and system based on reinforcement learning in edge calculation
CN113032904B (en) Model construction method, task allocation method, device, equipment and medium
CN113395654A (en) Method for task unloading and resource allocation of multiple unmanned aerial vehicles of edge computing system
CN114690799A (en) Air-space-ground integrated unmanned aerial vehicle Internet of things data acquisition method based on information age
CN113254188B (en) Scheduling optimization method and device, electronic equipment and storage medium
CN115037751B (en) Unmanned aerial vehicle-assisted heterogeneous Internet of vehicles task migration and resource allocation method
CN117499867A (en) Method for realizing high-energy-efficiency calculation and unloading through strategy gradient algorithm in multi-unmanned plane auxiliary movement edge calculation
Zeng et al. Joint resource allocation and trajectory optimization in UAV-enabled wirelessly powered MEC for large area
CN116257335A (en) Unmanned plane auxiliary MEC system joint task scheduling and motion trail optimization method
CN117580105B (en) Unmanned aerial vehicle task unloading optimization method for power grid inspection
CN114698125A (en) Method, device and system for optimizing computation offload of mobile edge computing network
CN114079882B (en) Method and device for cooperative calculation and path control of multiple unmanned aerial vehicles
Sobouti et al. Managing sets of flying base stations using energy efficient 3D trajectory planning in cellular networks
CN116208968B (en) Track planning method and device based on federal learning
Termehchi et al. Distributed Safe Multi-Agent Reinforcement Learning: Joint Design of THz-enabled UAV Trajectory and Channel Allocation
Zhang et al. Cybertwin-driven multi-intelligent reflecting surfaces aided vehicular edge computing leveraged by deep reinforcement learning
CN115967430A (en) Cost-optimal air-ground network task unloading method based on deep reinforcement learning
Yu et al. Efficient UAV/Satellite-assisted IoT Task Offloading: A Multi-agent Reinforcement Learning Solution
CN114727323A (en) Unmanned aerial vehicle base station control method and device and model training method and device
Kumar et al. Proximal Policy Optimization based computations offloading for delay optimization in UAV-assisted mobile edge computing
CN114169234B (en) Scheduling optimization method and system for unmanned aerial vehicle auxiliary mobile edge calculation
Cheng et al. Joint Optimization of Multi-UAV Deployment and User Association Via Deep Reinforcement Learning for Long-Term Communication Coverage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination