CN115640131A - Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient - Google Patents
Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient Download PDFInfo
- Publication number
- CN115640131A CN115640131A CN202211341446.1A CN202211341446A CN115640131A CN 115640131 A CN115640131 A CN 115640131A CN 202211341446 A CN202211341446 A CN 202211341446A CN 115640131 A CN115640131 A CN 115640131A
- Authority
- CN
- China
- Prior art keywords
- unmanned aerial
- aerial vehicle
- network
- calculation
- agent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000013508 migration Methods 0.000 title claims description 12
- 230000005012 migration Effects 0.000 title claims description 12
- 238000004364 calculation method Methods 0.000 claims abstract description 46
- 238000005265 energy consumption Methods 0.000 claims abstract description 28
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 230000002787 reinforcement Effects 0.000 claims abstract description 25
- 230000008569 process Effects 0.000 claims abstract description 12
- 230000008901 benefit Effects 0.000 claims abstract description 10
- 230000009471 action Effects 0.000 claims description 51
- 238000012549 training Methods 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 23
- 238000005457 optimization Methods 0.000 claims description 16
- 238000004891 communication Methods 0.000 claims description 11
- 238000005516 engineering process Methods 0.000 claims description 4
- 230000007774 longterm Effects 0.000 claims description 3
- 230000002596 correlated effect Effects 0.000 claims description 2
- 238000009826 distribution Methods 0.000 claims description 2
- 238000004088 simulation Methods 0.000 abstract description 5
- 230000005540 biological transmission Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013468 resource allocation Methods 0.000 description 2
- 101100206075 Escherichia coli (strain K12) tauA gene Proteins 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Traffic Control Systems (AREA)
Abstract
The invention provides a calculation task unloading algorithm based on deep reinforcement learning aiming at the requirements of calculation intensive and delay sensitive mobile services. And (4) considering constraint conditions such as the flight ranges, the flight speeds, the system fairness benefits and the like of the multiple unmanned aerial vehicles, and minimizing the weighted sum of the network average calculation delay and the energy consumption of the unmanned aerial vehicles. The non-convex and NP difficult problems are converted into a partial observation Markov decision process, and the multi-agent depth certainty strategy gradient algorithm is used for unloading decision of the mobile user and optimizing the flight track of the unmanned aerial vehicle. Simulation results show that the performance of the algorithm is superior to that of a baseline algorithm in the aspects of fairness of the mobile service terminal, average system time delay, total energy consumption of multiple unmanned aerial vehicles and the like.
Description
Technical Field
The invention belongs to the field of Mobile Edge Computing (MEC), relates to a Multi-unmanned aerial vehicle assisted Mobile Edge Computing method, and more particularly relates to a Multi-Agent Deep determination Policy Gradient (MADDPG) based computational migration method.
Background
With the development of 5G technology, computing-intensive applications such as network gaming, VR/AR, telemedicine, etc. running on user devices will become more prosperous and popular. These mobile applications typically require a large amount of computing resources, consume a large amount of energy, and may interrupt the connection with the server while the user is mobile due to the limited coverage of the server. The server which originally requests to unload cannot send the calculation result at the next position of the user in time, which causes the waste of the calculation resources of the server and increases the time delay and energy consumption for the user to upload and unload the calculation task again. For user off-loadable tasks, many studies will adopt a mode of unloading all tasks into the MEC server for execution, but when the number of users is large or the number of unloaded tasks is large, limited server computing resources can cause task queuing and the time delay of the unloading computation is increased. Due to high mobility and flexibility, unmanned Aerial Vehicles (UAVs) can assist mobile edge computing in military and civilian areas, especially in remote or natural disaster areas, without relying on infrastructure. When natural disasters cause network infrastructure to be unavailable or mobile devices suddenly increase to exceed network service capacity, the unmanned aerial vehicle can be used as a temporary communication relay station or an edge computing platform to enhance wireless coverage in areas with communication interruption or traffic hot spots, and computing support is provided. However, the computational resources and power of the drone are limited, and there are many key issues to be solved to improve the performance of the MEC system, including security [8] Task offloading, energy consumption, resource allocation, user delay performance in various channel scenarios, etc.
In an unmanned aerial vehicle MEC network, various types of variables (such as unmanned aerial vehicle tracks, task unloading strategies and computing resource allocation) can be optimized to achieve a desired scheduling target, and a traditional optimization method needs a large amount of iteration and priori knowledge to obtain an approximately optimal solution, so that the method is not suitable for real-time MEC application in a dynamic environment. With the wide application of machine learning in research, many researchers are also exploring learning-based MEC scheduling algorithms, and deep reinforcement learning has become a research focus in view of the recent progress of machine learning. With the increase of network scale, multi-agent deep reinforcement learning provides a distributed view for resource management of a multi-unmanned-aerial-vehicle MEC network.
The invention provides an unmanned aerial vehicle assisted mobile edge computing system which utilizes computing resources provided by an unmanned aerial vehicle to provide unloading services for nearby user equipment. The unmanned aerial vehicle trajectory and unloading optimization problem is solved through a multi-agent deep reinforcement learning method, so that an extensible and effective scheduling strategy is obtained, a terminal unloads a part of calculation tasks to a UAV, other tasks are executed locally at the terminal, and system processing delay and unmanned aerial vehicle energy consumption are minimized through joint optimization of user scheduling, task unloading ratio, unmanned aerial vehicle flight angle and flight speed.
Disclosure of Invention
The purpose of the invention is as follows: in consideration of the non-convexity, high-dimensional state space and continuous action space of the problem, the deep reinforcement learning algorithm based on the MADDPG is provided, and the algorithm can obtain an optimal calculation unloading strategy in a dynamic environment, so that the combined optimization of the lowest system delay and the energy consumption of the unmanned aerial vehicle is realized.
The technical scheme is as follows: in consideration of the scene of multi-user task calculation unloading at the same time, the purposes of jointly optimizing system time delay and unmanned aerial vehicle energy consumption are achieved through reasonable and efficient unmanned aerial vehicle path planning and unloading decisions. And regarding each unmanned aerial vehicle as an intelligent agent, and selecting the associated user based on locally observed state information and task information obtained in each time slot by adopting a distributed execution and centralized training mode. And optimizing the deep reinforcement learning model by using the MADDPG algorithm through establishing the deep reinforcement learning model. And obtaining the optimal flight track and unloading strategy according to the optimized MADDPG model. The invention is realized by the following technical scheme: an unmanned aerial vehicle assisted computing migration method based on MADDPG comprises the following steps:
(1) The traditional MEC server is deployed in a base station or other fixed facilities, and the mobile MEC server is adopted at this time to combine the unmanned aerial vehicle technology with edge calculation;
(2) The user equipment unloads the calculation task to the unmanned aerial vehicle end through wireless communication so as to reduce the calculation delay;
(3) An unmanned aerial vehicle auxiliary user unloading system model, a moving model, a communication model and a calculation model are constructed, and an optimization objective function is given;
(4) The unmanned aerial vehicle acquires a user position set, a task set, service times and channel parameter information in an observation range;
(5) The method comprises the steps of modeling by adopting a Partially Observable Markov Decision Process (POMDP), jointly optimizing flight tracks of multiple unmanned aerial vehicles and calculating unloading strategies based on the positions and task information of users under the condition of considering the flight range and the safety distance of the unmanned aerial vehicles, and constructing a deep reinforcement learning model by taking the aim of minimizing system time delay and unmanned aerial vehicle energy consumption and simultaneously ensuring service fairness of the users as the target;
(6) Considering a continuous state space and a continuous action space, and performing model training of computational migration by using a multi-agent deep reinforcement learning algorithm based on MADDPG;
(7) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment;
further, the step (3) comprises the following specific steps:
(3a) Establishing a Mobile edge computing system model for assisting user unloading by an unmanned aerial vehicle, wherein the system comprises M Mobile user equipment (MD) and an unmanned aerial vehicle with a U-frame carrying an MEC server, and the M Mobile user equipment (MD) and the U-frame carry the MEC server respectively by setsAndand (4) showing. Unmanned aerial vehicle with fixed height H u And (3) flying, wherein the total time length of the unmanned aerial vehicle for executing one flight task is T, the total time length can be divided into N time slots with equal length, and the set of the time slots is recorded asEach MD has one in each time slot tauA computing intensive task, denoted as S m (τ)={D m (τ),F m (τ) }, in which D m (τ) represents the amount of data bits, F m (τ) represents the required CPU cycles per bit;
(3b) Each unmanned aerial vehicle only provides calculation unloading service for one terminal device in each time slot tau, a user only needs to calculate a small part of tasks locally, the rest tasks are unloaded to the unmanned aerial vehicle for auxiliary calculation, so that the calculation delay and energy consumption are reduced, and the proportion of unloading calculation amount is recorded as delta m,u (τ)∈[0,1]. The offload decision variable between the drone and the user equipment may be expressed as:
D={α m,u (τ) | U ∈ U, M ∈ M, τ ∈ T } expression 1
Wherein alpha is m,u (τ) ∈ {0,1}, when α is m,u (τ) =1 indicates that the device MD m The calculation task at the time slot tau is carried out by an Unmanned Aerial Vehicle (UAV) u Auxiliary calculation, Δ m,u (τ) > 0; when alpha is m,u (τ) =0, which means that the calculation task is only executed locally, Δ m,u (τ) =0. Decision variables need to be satisfied:
(3c) And establishing a movement model, wherein the mobile device moves to a new position randomly in each time slot, and the movement of each device is related to the current speed and angle of the device. Suppose MD m The coordinate at time slot τ is denoted as c m (τ)=[x m (τ),y m (τ)]Then the coordinates of its next slot τ +1 can be expressed as:
wherein d is max Representing the maximum distance of the standby movement, the movement direction and the distance probability are subject to uniform distribution, rho 1,m ,ρ 2,m U (0, 1), the drone serves the terminal considering only its starting position in the slot.
(3d) Each unmanned plane is at height H u Can also use the discrete position c of the unmanned plane in each time slot u (τ) assuming a UAV u Selecting fly-to-service MD at time slot tau m Then its flight direction is recorded as β u (τ)∈[0,2π]The flying speed is v u (τ)∈[0,V max ]Time of flight t fly The energy consumed by the flight of the unmanned aerial vehicle is as follows:
wherein μ =0.5M u t fly ,M u Is the total mass of the unmanned aerial vehicle.
(3e) And the computation offload adopts a partial offload strategy, then MD m The local computation delay at slot τ can be expressed as:
wherein f is m Denotes MD m Local computing power (number of CPU cycles per second).
(3f) The actual unmanned aerial vehicle-to-ground communication is simulated by adopting a line-of-sight link model, and the channel gain h between the unmanned aerial vehicle and a user m,u (τ) follows a free space path loss model, which can be expressed as:
wherein g is 0 Is the channel power gain per meter.
(3g) Instantaneous transmission rate r between unmanned aerial vehicle and ground equipment m,u (τ) is defined as:
wherein B represents a channelThe bandwidth of the communication channel is controlled,transmit power, σ, for the mobile device uplink 2 Representing white gaussian noise at the unmanned end.
Associated user MD m The transmission data delay is as follows:
after the computation task is transmitted, the unmanned aerial vehicle executes an unloading computation task, wherein the time delay and the energy consumption of the unloading computation are respectively as follows:
wherein f is u Representing the computational power of the drone,denotes the CPU power, κ, of the drone when performing the calculations u =10 -27 Is a chip constant
(3h) Since the resulting output data volume of various compute-intensive tasks is much smaller than the input, the delay spent on downlink transmission can be ignored, user MD m Completing task S in time slot tau m Time delay T of (tau) m (τ) can be expressed as:
unmanned Aerial Vehicle (UAV) u The total energy consumption of the auxiliary computing offload at slot τ is:
(3i) User MD m The average delay of (d) may be expressed as:
the system average calculated delay can be calculated as:
(3j) In order to ensure user fairness, a fairness index xi is defined τ To measure the fairness of service:
(3k) In summary, the following objective functions and constraints can be established:
wherein P = { beta = u (τ),v u (τ)},Z={α m,u (τ),Δ m,u (τ)},φ t And phi e For weighting parameters, C1 limits that each time slot of the unmanned aerial vehicle only serves one user, C2 and C6 limit the flight range of the unmanned aerial vehicle, C3 and C4 limit the flight speed and angle of the unmanned aerial vehicle, C6 represents that the calculation task can be partially unloaded, C7 ensures the fair benefit of the system, d safe And xi min The minimum safety distance and the minimum fairness index between the preset unmanned planes are adopted.
Further, the step (5) comprises the following specific steps:
(5a) The problem of multi-UAV assisted computational offloading is considered as a partial observation Markov decision process, and is composed of tuples { S, A, O, pr, R }. Usually has a plurality ofAgents interact with the environment, each agent being based on a current state s τ Get self observation o τ Belongs to O and make action a τ E.g. A, the environment generates an instant reward r for the action τ E.g. R to evaluate the current action and with probability Pr (S) τ+1 |S τ ,A τ ) The next state is entered and the new state depends only on the current state and the actions of the respective agent. Actions of agents are based on a policy π (a) τ |o τ ) Enforcement, with the goal of learning to an optimal strategy to maximize long-term jackpot, can be expressed as:
where gamma is the reward discount.
(5b) Specifically defining observation space, each unmanned aerial vehicle only has a limited observation range, and the radius of the observation range is set as r obs Therefore, only partial state information can be observed, and global state information and actions of other drones are unknown. UAV (unmanned aerial vehicle) u The information observable in the time slot τ has its own position information c u (tau) and current location information, task information and number of services for K mobile users in observation rangeThe action space a is the transmit power and the selected channel, and is represented as:
o u (τ)={c u (τ),k u (τ) } expression 18
(5c) Specifically defining an action space, and based on observed information, determining which user is served by the drone at the current time slot tau and the load shedding ratio delta m,u (tau) and determining the flight angle beta thereof u (τ) and flight velocity v u (τ), which can be written as:
a u (τ)={m(τ),Δ m,u (τ),β u (τ),v u (τ) } expression 19
(5d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observation results:
s(τ)={o u (τ) | U ∈ U } expression 20
(5e) Specifically, the reward is defined, and the feedback obtained after the agent executes the action is called the reward and is used for judging whether the action is good or bad and guiding the agent to update the strategy. Generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negative, so the reward after the drone executes actions is defined as:
r u (τ)=D m (τ)·(-T mean (τ)-ψE u (τ)-P u (τ)) expression 21
Wherein D m (τ)∈[0,1]For the attenuation coefficient, the benefit obtained after the unmanned aerial vehicle processes the mobile terminal unloading task is defined, and the specific calculation is as follows:
wherein eta and beta are correlation constants, the function image is of a sigmoid type, the input is the accumulated service times of the current user, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is. Psi is used to numerically align the drone energy consumption and the user average delay. P u (τ) is an additional penalty, which is added if the drone flies out of the field after performing the action or has a distance to the remaining drones that is less than a safe distance.
(5f) And establishing a deep reinforcement learning model on the basis of the MADDPG according to the established S, A, O and R, and adopting an operator-critic framework, wherein each agent has an operator network, a critic network and a respective target network. The Actor network is responsible for formulating a policy pi (o) for an agent u (τ)|θ u ),θ u Representing its network parameters; the estimate of the critic's network output to the optimal state-action cost function is denoted as Q (s (τ), a 1 (τ),...,a U (τ)|w u ),w u Representing its network parameters. The inputs to the critic network include observations and actions of all agents within a time slot, but the inputs to the operator network only require their own observations at the time of distributed execution.
The algorithm learns the Q function and the optimal strategy at the same time, when the criticc network is updated, H groups of records need to be extracted from the experience pool of each intelligent agent, and each group at the same time is spliced to obtain H new records, which are recorded as: { s t,i ,a 1,i ,...,a U,i ,r 1,i ,...,r U,i ,s t+1,i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:
wherein y is u,i Obtained from formula (24):
wherein,andrespectively represent Unmanned Aerial Vehicle (UAV) u The critic target network and the operator target network, wherein the target network has network parameters updated later, so that the training becomes more stable.
Critic networks need to minimize losses to approximate true Q * The operator network updates the network parameters by gradient ascending of the determination strategy gradient of the Q value so as to maximize the action value:
further, the step (6) comprises the following specific steps:
(6a) Starting environment simulation, and initializing each agent operator network, a critic network and respective target network parameters thereof;
(6b) Initializing the number of training rounds;
(6c) Updating the position set, the task set and the service times of the user, and the position set and the channel parameters of the unmanned aerial vehicle;
(6d) For each agent, an actor network is distributed and operated according to the observation o u (τ), output action a u (τ) and obtaining an instant prize r u (τ) while moving to the next state s τ+1 To thereby obtain training data o u (τ),a u (τ),r u (τ),o u (τ+1)};
(6e) Storing the training data into respective experience playback pools;
(6f) Each agent randomly samples H training data from the experience playback pool to form a training data set;
(6g) Each agent calculates a loss value L (w) through the critic network and the target network u ) Updating w u And adopting a determined strategy gradient to perform gradient rise, and updating the parameter theta of the operator network through back propagation of the neural network u ;
(6h) When the training times reach the target network updating interval, updating the target network parameters;
(6i) Judging whether convergence is met, if yes, finishing optimization to obtain an optimized deep reinforcement learning model, and otherwise, entering the step (6 c);
further, the step (7) comprises the following specific steps:
(7a) The method comprises the steps of training a deep reinforcement learning model by using an MADDPG algorithm, and inputting state information at a certain moment;
(7b) Outputting the optimal action strategyAnd obtaining the optimal migration strategy and flight path.
Has the advantages that: according to the large-scale multi-unmanned aerial vehicle assisted MEC network computing unloading method based on the MADDPG algorithm, under the condition that constraint conditions are met, the energy consumption of the unmanned aerial vehicle and the average computing time delay of the system are reduced to the greatest extent by jointly optimizing the unloading decision and the flight path of the unmanned aerial vehicle, and the method can be stably represented in the optimization of a series of continuous state spaces and continuous action spaces. Under the condition of meeting the similar scene, the performance of the MADDPG-based deep reinforcement learning algorithm in the aspects of reducing energy consumption and average task delay is superior.
Drawings
Fig. 1 is a schematic structural diagram of an unmanned aerial vehicle-assisted computation offloading model provided in an embodiment of the present invention;
fig. 2 is a schematic diagram of a POMDP decision process of a multi-drone computation migration algorithm according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of algorithm training based on MADDPG according to an embodiment of the present invention;
fig. 4 is a simulation result diagram of a relationship between energy consumption and computation performance of the unmanned aerial vehicle under the maddppg algorithm provided by the embodiment of the present invention.
Detailed Description
The core idea of the invention is that: a distributed reinforcement learning method is adopted, each unmanned aerial vehicle is regarded as an intelligent agent, and a deep reinforcement learning model is optimized by means of an MADDPG algorithm through establishment of the deep reinforcement learning model. And obtaining the optimal migration strategy and flight path according to the optimized model.
The present invention is described in further detail below.
(1) The traditional MEC server is deployed in a base station or other fixed facilities, and the mobile MEC server is adopted at this time to combine the unmanned aerial vehicle technology with edge calculation;
(2) The user equipment unloads the calculation task to the unmanned aerial vehicle end through wireless communication so as to reduce the calculation delay;
(3) An unmanned aerial vehicle auxiliary user unloading system model, a mobile model, a communication model and a calculation model are constructed, and an optimization objective function is given;
the method comprises the following specific steps:
(3a) Establishing a Mobile edge computing system model for assisting user unloading by an unmanned aerial vehicle, wherein the system comprises M Mobile user equipment (MD) and an unmanned aerial vehicle with a U-frame carrying an MEC server, and the M Mobile user equipment (MD) and the U-frame carry the MEC server respectively by setsAndand (4) showing. Unmanned aerial vehicle with fixed height H u And (3) flying, wherein the total time length of the unmanned aerial vehicle for executing one flight task is T, the total time length can be divided into N time slots with equal length, and the set of the time slots is recorded asEach MD has a compute intensive task at each time slot τ, denoted as S m (τ)={D m (τ),F m (τ) }, in which D m (τ) represents the amount of data bits, F m (τ) represents the required CPU cycles per bit;
(3b) Each unmanned aerial vehicle only provides calculation unloading service for one terminal device in each time slot tau, a user only needs to calculate a small part of tasks locally, the rest tasks are unloaded to the unmanned aerial vehicle for auxiliary calculation, so that the calculation delay and energy consumption are reduced, and the proportion of unloading calculation amount is recorded as delta m,u (τ)∈[0,1]. The offloading decision variables between the drone and the user equipment may be expressed as:
D={α m,u (tau) | U ∈ U, M ∈ M, tau ∈ T } expression 1
Wherein alpha is m,u (τ) ∈ {0,1}, when α is m,u (τ) =1 indicates the device MD m The calculation task at time slot τ is carried out by an unmanned aerial vehicle UAV u Auxiliary calculation, Δ m,u (τ) > 0; when alpha is m,u (τ) =0 then this means that the computation task is only performed locally, Δ m,u (τ) =0. Decision variables need to be satisfied:
(3c) And establishing a movement model, wherein the mobile device moves to a new position randomly in each time slot, and the movement of each device is related to the current speed and angle of the device. Suppose MD m The coordinate at time slot τ is denoted as c m (τ)=[x m (τ),y m (τ)]Then the coordinates of its next slot τ +1 can be expressed as:
wherein d is max The maximum distance representing the movement of the device, the movement direction and the distance probability are uniformly distributed, and rho 1,m ,ρ 2,m U (0, 1), the drone serves the terminal considering only its starting position in the slot.
(3d) Each unmanned plane is at height H u Can also use the discrete position c of the unmanned plane in each time slot u (τ) assuming a UAV u Selecting a fly-to-service MD at a time slot τ m Then its flight direction is recorded as β u (τ)∈[0,2π]The flying speed is v u (τ)∈[0,V max ]Time of flight t fly The energy consumed by the flight of the unmanned aerial vehicle is as follows:
wherein μ =0.5M u t fly ,M u Is the total mass of the unmanned aerial vehicle.
(3e) Computing offload adoption of optional partsOffloading strategy, then MD m The local computation delay at slot τ can be expressed as:
wherein f is m Denotes MD m Local computing power (number of CPU cycles per second).
(3f) The actual unmanned aerial vehicle-to-ground communication is simulated by adopting a line-of-sight link model, and the channel gain h between the unmanned aerial vehicle and a user m,u (τ) follows a free space path loss model, which can be expressed as:
wherein g is 0 Is the channel power gain per meter.
(3g) Instantaneous transmission rate r between unmanned aerial vehicle and ground equipment m,u (τ) is defined as:
where B represents the bandwidth of the channel and,transmitting power, σ, for the uplink of a mobile device 2 Representing white gaussian noise at the unmanned end.
Associated user MD m The transmission data delay is as follows:
after the computation task is transmitted, the unmanned aerial vehicle executes an unloading computation task, wherein the time delay and the energy consumption of the unloading computation are respectively as follows:
wherein f is u Representing the computational power of the drone,denotes the CPU power, κ, at which the drone performs the calculations u =10 -27 Is a chip constant
(3h) Since the resulting output data volume of various compute-intensive tasks is much smaller than the input, the delay spent on downlink transmission can be neglected, user MD m Completing task S in time slot tau m Time delay T of (tau) m (τ) can be expressed as:
unmanned Aerial Vehicle (UAV) u The total energy consumption of the auxiliary computing offload at slot τ is:
(3i) User MD m The average delay of (d) may be expressed as:
the system average calculated delay may be calculated as:
(3j) In order to ensure user fairness, a fairness index ξ is defined τ To measure fairness of service:
(3k) In summary, the following objective functions and constraints can be established:
wherein P = { beta = u (τ),v u (τ)},Z={α m,u (τ),Δ m,u (τ)},φ t And phi e For weighting parameters, C1 limits that each time slot of the unmanned aerial vehicle only serves one user, C2 and C6 limit the flight range of the unmanned aerial vehicle, C3 and C4 limit the flight speed and angle of the unmanned aerial vehicle, C6 represents that the calculation task can be partially unloaded, C7 ensures the fair benefit of the system, d safe And xi min The minimum safety distance and the minimum fairness index between the preset unmanned planes are adopted.
(4) The unmanned aerial vehicle acquires a user position set, a task set, service times and channel parameter information in an observation range;
(5) The method adopts a Partially Observable Markov Decision Process (POMDP) for modeling, and jointly optimizes flight tracks of multiple unmanned aerial vehicles and calculates unloading strategies based on the position and task information of a user under the condition of considering the flight range and the safety distance of the unmanned aerial vehicles, so as to construct a deep reinforcement learning model by taking the aim of minimizing system delay and unmanned aerial vehicle energy consumption and simultaneously ensuring service fairness of the user as a target, and comprises the following specific steps:
(5a) The problem of multi-UAV assisted computational offloading is considered as a partial observation Markov decision process, and is composed of tuples { S, A, O, pr, R }. There are typically multiple agents interacting with the environment, each agent based on the current state s τ Get self observation o τ Belongs to O and make action a τ E.g. A, the environment generates an instant reward r for the action τ E.g. R to evaluate the current action and with probability Pr (S) τ+1 |S τ ,A τ ) The next state is entered and the new state depends only on the current state and the actions of the respective agent. Actions of agents are based on a policy π (a) τ |o τ ) Enforcement, with the goal of learning an optimal strategy to maximize long-term jackpot, can be expressed as:
where gamma is the reward discount.
(5b) Specifically defining observation space, each unmanned aerial vehicle only has a limited observation range, and the radius of the observation range is set as r obs Therefore, only partial state information can be observed, and global state information and the actions of other drones are unknown. Single unmanned aerial vehicle UAV u The information observable in the time slot τ has its own position information c u (tau) and current location information, task information and number of services for K mobile users in observation rangeThe action space a is the transmit power and the selected channel, and is represented as:
o u (τ)={c u (τ),k u (τ) } expression 18
(5c) Specifically defining an action space, based on the observed information, the drone needs to determine which user is served at the current time slot τ and the offloading ratio Δ m,u (τ) and determining the flight angle β thereof u (τ) and flight velocity v u (τ), can be written as:
a u (τ)={m(τ),Δ m,u (τ),β u (τ),v u (τ) } expression 19
(5d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observation results:
s(τ)={o u expression 20 of (tau) | U ∈ U }
(5e) Specifically, the reward is defined, and the feedback obtained after the agent executes the action is called the reward and is used for judging the quality of the action and guiding the agent to update the strategy. Generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negatively correlated, so the reward after the drone executes actions is defined as:
r u (τ)=D m (τ)·(-T mean (τ)-ψE u (τ)-P u (τ)) expression 21
Wherein D m (τ)∈[0,1]For the attenuation coefficient, the benefit obtained after the unmanned aerial vehicle processes the mobile terminal unloading task is defined, and the specific calculation is as follows:
wherein eta and beta are correlation constants, the function image is of a sigmoid type, the input is the accumulated service times of the current user, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is. Psi is used to numerically align the drone energy consumption and the user average delay. P u (τ) is an additional penalty term, which is added if the drone flies out of the field or is less than a safe distance away from the rest of the drones after performing the action.
(5f) According to the established S, A, O and R, a deep reinforcement learning model is established on the basis of MADDPG, an operator-critic framework is adopted, and each agent has an own operator network, a critic network and a respective target network. The Actor network is responsible for formulating a policy pi (o) for an agent u (τ)|θ u ),θ u Representing its network parameters; the estimate of the critic's network output to the optimal state-action cost function is denoted as Q (s (τ), a 1 (τ),..,a U (τ)|w u ),w u Representing its network parameters. The inputs to the critic network contain observations and actions of all agents within a time slot, but the inputs to the actor network only require their own observations when the distribution is performed.
The algorithm carries out the Q function and the optimal strategy simultaneouslyLearning, when updating the critic network, H groups of records need to be extracted from the experience pool of each agent, and H new records are obtained by splicing each group at the same time and are recorded as: { s t,i ,a 1,i ,...,a U,i ,r 1,i ,...,r U,i ,s t+1,i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:
wherein y is u,i Obtained from formula (24):
wherein,andrespectively representing Unmanned Aerial Vehicles (UAVs) u And the critic target network and the actor target network, wherein the target network has network parameters updated later, so that the training becomes more stable.
Critic networks need to minimize losses to approximate true Q * The operator network updates the network parameters by using the gradient of the determination strategy of the Q value to perform gradient rise so as to maximize the action value:
(6) Considering a continuous state space and a continuous action space, and performing model training of computation migration by using a MADDPG-based multi-agent deep reinforcement learning algorithm, the method comprises the following specific steps:
(6a) Starting environment simulation, and initializing each agent operator network, critic network and respective target network parameters thereof;
(6b) Initializing the number of training rounds;
(6c) Updating the position set, the task set and the service times of the user, and the position set and the channel parameters of the unmanned aerial vehicle;
(6d) For each agent, an actor network is distributed and operated according to the observation o u (τ), output action a u (τ) and obtaining an instant prize r u (τ) while moving to the next state s τ+1 To thereby obtain training data o u (τ),a u (τ),r u (τ),o u (τ+1)};
(6e) Storing the training data into respective experience playback pools;
(6f) Each agent randomly samples H training data from the experience playback pool to form a training data set;
(6g) Each agent calculates a loss value L (w) through the critic network and the target network u ) Updating w u And adopting a determined strategy gradient to perform gradient rise, and updating the parameter theta of the operator network through back propagation of the neural network u ;
(6h) When the training times reach the target network updating interval, updating the target network parameters;
(6i) Judging whether convergence is met, if yes, finishing optimization to obtain an optimized deep reinforcement learning model, and otherwise, entering the step (6 c);
(7) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment;
(7a) The method comprises the steps of training a deep reinforcement learning model by using an MADDPG algorithm, and inputting state information at a certain moment;
(7b) Outputting the optimal action strategyAnd obtaining the optimal migration strategy and flight path.
In fig. 1, a model of a mobile edge computing system for drone-assisted user offloading is depicted, where the user offloads computing tasks to drone-assisted computing to reduce latency and energy consumption of the computing.
In fig. 2, a deep reinforcement learning model of the drone assisted MEC network is described, and it can be seen that multiple drones as agents select the current optimal policy according to the policy based on the current state, and obtain rewards from the environment.
In fig. 3, a training model of an operator-critic framework is described, and through centralized training and distributed execution, the critic network can refer to the behaviors of other agents in the training process, so that the performance of the operator network is better evaluated, and the stability of a strategy is improved.
In fig. 4, simulation results of the calculation performance and the energy consumption of the unmanned aerial vehicle with different algorithms are described, optimal power consumption control under different calculation performances can be obtained based on the maddppg algorithm, and when the CPU frequency is 12.5GHz, the energy consumption is reduced by 29.16% compared with a baseline, and is reduced by 8.67% compared with a random policy gradient algorithm. .
Those matters not described in detail in the present application are well within the knowledge of those skilled in the art.
Claims (1)
1. An unmanned aerial vehicle assisted calculation migration method based on multi-agent depth determination strategy gradient is characterized by comprising the following steps:
(1) The traditional MEC servers are deployed in a base station or other fixed facilities, a movable MEC server is adopted at this time, the unmanned aerial vehicle technology is combined with edge calculation, and user equipment unloads calculation tasks to an unmanned aerial vehicle end through wireless communication so as to reduce calculation delay;
(2) Constructing an unmanned aerial vehicle auxiliary user unloading system model, a mobile model, a communication model and a calculation model, and giving an optimization objective function;
(3) The method is characterized in that a Partially Observable Markov Decision Process (POMDP) is adopted for modeling, under the condition of considering the flight range and the safety distance of the unmanned aerial vehicle, the flight tracks and the calculation unloading strategies of the multiple unmanned aerial vehicles are jointly optimized on the basis of the position and the task information of a user, a deep reinforcement learning model is constructed by taking the aim of minimizing the system delay and the energy consumption of the unmanned aerial vehicle and simultaneously ensuring the service fairness of the user as the target, and the method comprises the following specific steps:
(3a) The problem of the multi-unmanned aerial vehicle auxiliary computation unloading is regarded as a partial observation Markov decision process and is composed of tuples { S, A, O, pr and R }; there are typically multiple agents interacting with the environment, each agent based on the current state s τ Get self observation o τ E.g. O and make action a τ E.g. A, the environment generates an instant reward r for the action τ E.g. R to evaluate the current action and with probability Pr (S) τ+1 |S τ ,A τ ) Entering a next state, the new state depending only on the current state and the actions of the respective agent; actions of Agents are based on policy π (a) τ |o τ ) Enforcement, with the goal of learning to an optimal strategy to maximize long-term jackpot, can be expressed as:
wherein γ is a reward discount;
(3b) Specifically defining observation space, each unmanned aerial vehicle only has limited observation range, and the radius of the observation range is set as r obs Therefore, only partial state information can be observed, and the global state information and the actions of other unmanned planes are unknown; single unmanned aerial vehicle UAV u The information observable in the time slot τ has its own position information c u (tau) and current location information, task information and number of services for K mobile users in observation rangeThe action space a is the transmit power and the selected channel, and is represented as:
o u (τ)={c u (τ),k u (τ)}
(3c) Specifically defining an action space, based on the observed information, the drone needs to determine which user is served at the current time slot τ and the offloading ratio Δ m,u (tau) and determining the flight angle beta thereof u (τ) and flight velocity v u (τ), which can be written as:
a u (τ)={m(τ),Δ m,u (τ),β u (τ),v u (τ)}
(3d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observations:
s(τ)={o u (τ)|u∈U}
(3e) Specifically defining rewards, wherein feedback obtained after the intelligent agent executes actions is called the rewards and is used for judging the quality of the actions and guiding the intelligent agent to update the strategy; generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negatively correlated, so the reward after the drone executes actions is defined as:
r u (τ)=D m (τ)·(-T mean (τ)-ψE u (τ)-P u (τ))
wherein D m (τ)∈[0,1]For the attenuation coefficient, the benefit obtained after the unmanned aerial vehicle processes the mobile terminal unloading task is defined, and the specific calculation is as follows:
wherein eta and beta are correlation constants, the function image is of a sigmoid type, the number of times of accumulated service of the current user is input, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is; psi is used to average the drone energy consumption and usersCarrying out numerical value alignment; p u (tau) is an additional punishment item, and if the unmanned aerial vehicle flies out of the field after executing actions or the distance between the unmanned aerial vehicle and the rest unmanned aerial vehicles is less than the safe distance, the punishment item needs to be added;
(3f) Establishing a deep reinforcement learning model on the basis of MADDPG according to the established S, A, O and R, adopting an operator-critic framework, wherein each agent has an own operator network, critic network and respective target network; the Actor network is responsible for formulating a policy pi (o) for an agent u (τ)|θ u ),θ u Representing its network parameters; the critic network outputs an estimate of the optimum state-action cost function denoted as Q (s (τ), a) 1 (τ),...,a U (τ)|w u ),w u Representing its network parameters; the input of the critic network comprises the observed values and actions of all agents in a time slot, but the input of the actor network only needs the observed values of the actor network when the distribution is executed;
the algorithm learns the Q function and the optimal strategy at the same time, when the criticc network is updated, H groups of records need to be extracted from the experience pool of each intelligent agent, and each group at the same time is spliced to obtain H new records, which are recorded as: { s t,i ,a 1,i ,...,a U,i ,r 1,i ,...,r U,i ,s t+1,i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:
wherein y is u,i Obtained from formula (24):
wherein,andrespectively representing Unmanned Aerial Vehicles (UAVs) u The critic target network and the actor target network, the target network has network parameters updated later, so that the training becomes more stable;
critic networks need to minimize losses to approximate true Q * The operator network updates the network parameters by using the gradient of the determination strategy of the Q value to perform gradient rise so as to maximize the action value:
(4) Considering a continuous state space and a continuous action space, and performing model training of computational migration by using a multi-agent deep reinforcement learning algorithm based on MADDPG;
(5) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211341446.1A CN115640131A (en) | 2022-10-28 | 2022-10-28 | Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211341446.1A CN115640131A (en) | 2022-10-28 | 2022-10-28 | Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115640131A true CN115640131A (en) | 2023-01-24 |
Family
ID=84947041
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211341446.1A Pending CN115640131A (en) | 2022-10-28 | 2022-10-28 | Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115640131A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116502547A (en) * | 2023-06-29 | 2023-07-28 | 深圳大学 | Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning |
CN116546559A (en) * | 2023-07-05 | 2023-08-04 | 南京航空航天大学 | Distributed multi-target space-ground combined track planning and unloading scheduling method and system |
CN117354759A (en) * | 2023-12-06 | 2024-01-05 | 吉林大学 | Task unloading and charging scheduling combined optimization method for multi-unmanned aerial vehicle auxiliary MEC |
CN117376985A (en) * | 2023-12-08 | 2024-01-09 | 吉林大学 | Energy efficiency optimization method for multi-unmanned aerial vehicle auxiliary MEC task unloading under rice channel |
CN117371761A (en) * | 2023-12-04 | 2024-01-09 | 集美大学 | Intelligent ocean Internet of things task scheduling method, device, equipment and medium |
CN117553803A (en) * | 2024-01-09 | 2024-02-13 | 大连海事大学 | Multi-unmanned aerial vehicle intelligent path planning method based on deep reinforcement learning |
CN117573383A (en) * | 2024-01-17 | 2024-02-20 | 南京信息工程大学 | Unmanned aerial vehicle resource management method based on distributed multi-agent autonomous decision |
-
2022
- 2022-10-28 CN CN202211341446.1A patent/CN115640131A/en active Pending
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116502547B (en) * | 2023-06-29 | 2024-06-04 | 深圳大学 | Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning |
CN116502547A (en) * | 2023-06-29 | 2023-07-28 | 深圳大学 | Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning |
CN116546559A (en) * | 2023-07-05 | 2023-08-04 | 南京航空航天大学 | Distributed multi-target space-ground combined track planning and unloading scheduling method and system |
CN116546559B (en) * | 2023-07-05 | 2023-10-03 | 南京航空航天大学 | Distributed multi-target space-ground combined track planning and unloading scheduling method and system |
US11961409B1 (en) | 2023-07-05 | 2024-04-16 | Nanjing University Of Aeronautics And Astronautics | Air-ground joint trajectory planning and offloading scheduling method and system for distributed multiple objectives |
CN117371761A (en) * | 2023-12-04 | 2024-01-09 | 集美大学 | Intelligent ocean Internet of things task scheduling method, device, equipment and medium |
CN117354759B (en) * | 2023-12-06 | 2024-03-19 | 吉林大学 | Task unloading and charging scheduling combined optimization method for multi-unmanned aerial vehicle auxiliary MEC |
CN117354759A (en) * | 2023-12-06 | 2024-01-05 | 吉林大学 | Task unloading and charging scheduling combined optimization method for multi-unmanned aerial vehicle auxiliary MEC |
CN117376985B (en) * | 2023-12-08 | 2024-03-19 | 吉林大学 | Energy efficiency optimization method for multi-unmanned aerial vehicle auxiliary MEC task unloading under rice channel |
CN117376985A (en) * | 2023-12-08 | 2024-01-09 | 吉林大学 | Energy efficiency optimization method for multi-unmanned aerial vehicle auxiliary MEC task unloading under rice channel |
CN117553803B (en) * | 2024-01-09 | 2024-03-19 | 大连海事大学 | Multi-unmanned aerial vehicle intelligent path planning method based on deep reinforcement learning |
CN117553803A (en) * | 2024-01-09 | 2024-02-13 | 大连海事大学 | Multi-unmanned aerial vehicle intelligent path planning method based on deep reinforcement learning |
CN117573383A (en) * | 2024-01-17 | 2024-02-20 | 南京信息工程大学 | Unmanned aerial vehicle resource management method based on distributed multi-agent autonomous decision |
CN117573383B (en) * | 2024-01-17 | 2024-03-29 | 南京信息工程大学 | Unmanned aerial vehicle resource management method based on distributed multi-agent autonomous decision |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115640131A (en) | Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient | |
CN113162679B (en) | DDPG algorithm-based IRS (intelligent resilient software) assisted unmanned aerial vehicle communication joint optimization method | |
CN114422056B (en) | Space-to-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface | |
CN111787509B (en) | Unmanned aerial vehicle task unloading method and system based on reinforcement learning in edge calculation | |
CN113032904B (en) | Model construction method, task allocation method, device, equipment and medium | |
CN113395654A (en) | Method for task unloading and resource allocation of multiple unmanned aerial vehicles of edge computing system | |
CN114690799A (en) | Air-space-ground integrated unmanned aerial vehicle Internet of things data acquisition method based on information age | |
CN113254188B (en) | Scheduling optimization method and device, electronic equipment and storage medium | |
CN115037751B (en) | Unmanned aerial vehicle-assisted heterogeneous Internet of vehicles task migration and resource allocation method | |
CN117499867A (en) | Method for realizing high-energy-efficiency calculation and unloading through strategy gradient algorithm in multi-unmanned plane auxiliary movement edge calculation | |
Zeng et al. | Joint resource allocation and trajectory optimization in UAV-enabled wirelessly powered MEC for large area | |
CN116257335A (en) | Unmanned plane auxiliary MEC system joint task scheduling and motion trail optimization method | |
CN117580105B (en) | Unmanned aerial vehicle task unloading optimization method for power grid inspection | |
CN114698125A (en) | Method, device and system for optimizing computation offload of mobile edge computing network | |
CN114079882B (en) | Method and device for cooperative calculation and path control of multiple unmanned aerial vehicles | |
Sobouti et al. | Managing sets of flying base stations using energy efficient 3D trajectory planning in cellular networks | |
CN116208968B (en) | Track planning method and device based on federal learning | |
Termehchi et al. | Distributed Safe Multi-Agent Reinforcement Learning: Joint Design of THz-enabled UAV Trajectory and Channel Allocation | |
Zhang et al. | Cybertwin-driven multi-intelligent reflecting surfaces aided vehicular edge computing leveraged by deep reinforcement learning | |
CN115967430A (en) | Cost-optimal air-ground network task unloading method based on deep reinforcement learning | |
Yu et al. | Efficient UAV/Satellite-assisted IoT Task Offloading: A Multi-agent Reinforcement Learning Solution | |
CN114727323A (en) | Unmanned aerial vehicle base station control method and device and model training method and device | |
Kumar et al. | Proximal Policy Optimization based computations offloading for delay optimization in UAV-assisted mobile edge computing | |
CN114169234B (en) | Scheduling optimization method and system for unmanned aerial vehicle auxiliary mobile edge calculation | |
Cheng et al. | Joint Optimization of Multi-UAV Deployment and User Association Via Deep Reinforcement Learning for Long-Term Communication Coverage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |