CN115640131A

CN115640131A - Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient

Info

Publication number: CN115640131A
Application number: CN202211341446.1A
Authority: CN
Inventors: 陈志江; 雷磊; 宋晓勤; 蒋泽星; 唐胜; 王执屹
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-01-24

Abstract

The invention provides a calculation task unloading algorithm based on deep reinforcement learning aiming at the requirements of calculation intensive and delay sensitive mobile services. And (4) considering constraint conditions such as the flight ranges, the flight speeds, the system fairness benefits and the like of the multiple unmanned aerial vehicles, and minimizing the weighted sum of the network average calculation delay and the energy consumption of the unmanned aerial vehicles. The non-convex and NP difficult problems are converted into a partial observation Markov decision process, and the multi-agent depth certainty strategy gradient algorithm is used for unloading decision of the mobile user and optimizing the flight track of the unmanned aerial vehicle. Simulation results show that the performance of the algorithm is superior to that of a baseline algorithm in the aspects of fairness of the mobile service terminal, average system time delay, total energy consumption of multiple unmanned aerial vehicles and the like.

Description

A UAV-aided Computational Migration Method Based on Deep Deterministic Policy Gradients

技术领域technical field

本发明属于移动边缘计算(Mobile Edge Computing，MEC)领域，涉及一种多无人机辅助移动边缘计算方法，更具体地说，涉及一种基于多智能体深度确定策略梯度(Multi-Agent Deep Deterministic Policy Gradient，MADDPG)的计算迁移方法。The invention belongs to the field of mobile edge computing (Mobile Edge Computing, MEC), relates to a multi-UAV assisted mobile edge computing method, more specifically, relates to a strategy gradient based on multi-agent depth determination (Multi-Agent Deep Deterministic Policy Gradient, MADDPG) computational migration method.

背景技术Background technique

随着5G技术的发展，运行在用户设备上的计算密集型应用，如网络游戏、VR/AR、远程医疗等将变得更加繁荣和流行。这些移动应用程序通常需要大量的计算资源，消耗大量的能量，而且由于服务器的覆盖范围有限，用户在移动的时候可能会中断与服务器的连接。原先请求卸载的服务器无法在用户的下一位置及时发送计算结果，会引起服务器计算资源的浪费，增加用户再次上传卸载计算任务的时延和能耗。对于用户的可卸载任务，许多研究都会采取全部卸载到MEC服务器中执行的方式，但是当用户数量较多或卸载任务量较大时，有限的服务器计算资源会导致任务排队，卸载计算时延增长。由于高机动性和灵活性，无人机(Unmanned Aerial Vehicle，UAV)可不依靠基础设施在军事和民用领域中辅助移动边缘计算，特别是在偏远地区或自然灾害地区。当受到自然灾害导致网络基础设施不可用或移动设备的突然增多超出了网络服务能力，无人机就可以作为临时的通信中继站或边缘计算平台在通信中断或流量热点地区增强无线覆盖，提供计算支持。但是无人机的计算资源与电量受限，为提高MEC系统的性能，有许多关键问题还需解决，包括安全性^[8]、任务卸载、能量消耗、资源分配和各种信道情况下的用户延迟性能等。With the development of 5G technology, computing-intensive applications running on user equipment, such as online games, VR/AR, and telemedicine, will become more prosperous and popular. These mobile applications usually require a lot of computing resources, consume a lot of energy, and due to the limited coverage of the server, the user may lose the connection with the server while moving. The server that originally requested to offload cannot send the calculation results in time at the user's next location, which will cause a waste of server computing resources and increase the delay and energy consumption of the user's re-uploading and offloading calculation tasks. For the user's offloadable tasks, many studies will take the method of offloading them all to the MEC server for execution. However, when the number of users is large or the amount of offloaded tasks is large, the limited server computing resources will lead to task queuing, and the offloading calculation delay will increase. . Due to high mobility and flexibility, UAV (Unmanned Aerial Vehicle, UAV) can assist mobile edge computing in military and civilian fields without relying on infrastructure, especially in remote areas or natural disaster areas. When the network infrastructure is unavailable due to natural disasters or the sudden increase of mobile devices exceeds the network service capacity, drones can be used as temporary communication relay stations or edge computing platforms to enhance wireless coverage and provide computing support in areas where communication is interrupted or traffic hotspots . However, the computing resources and power of drones are limited. In order to improve the performance of the MEC system, there are many key issues that need to be solved, including security ^[8] , task offloading, energy consumption, resource allocation, and users under various channel conditions. Latency performance etc.

在无人机MEC网络中，可以优化多种类型的变量(如无人机的轨迹、任务卸载策略、计算资源分配)以实现期望的调度目标，传统的优化方法由于需要大量的迭代和先验知识来获得一个近似最优解，因此不适用于动态环境中的实时MEC应用。随着机器学习在研究中的广泛应用，许多研究人员也在探索基于学习的MEC调度算法，鉴于机器学习的最新进展，深度强化学习现已成为研究热点。随着网络规模的增长，多智能体深度强化学习为多无人机MEC网络的资源管理提供了分布式的视角。In the UAV MEC network, various types of variables (such as UAV trajectory, task offloading strategy, and computing resource allocation) can be optimized to achieve the desired scheduling goal. Traditional optimization methods require a large number of iterations and priori knowledge to obtain an approximate optimal solution, so it is not suitable for real-time MEC applications in dynamic environments. With the widespread application of machine learning in research, many researchers are also exploring learning-based MEC scheduling algorithms, and deep reinforcement learning has now become a research hotspot in view of the latest advances in machine learning. As the network scale grows, multi-agent deep reinforcement learning provides a distributed perspective for resource management of multi-UAV MEC networks.

本发明提出了一种无人机辅助移动边缘计算系统，利用无人机提供的计算资源为附近的用户设备提供卸载服务。通过多智能体深度强化学习的方法来求解无人机轨迹和卸载优化问题，从而获得可扩展和有效的调度策略，终端将计算任务的一部分卸载给UAV，而其余的任务在该终端本地执行，通过联合优化用户调度、任务卸载比、无人机飞行角和飞行速度来最小化系统处理延迟和无人机能耗。The present invention proposes a UAV-assisted mobile edge computing system, which uses the computing resources provided by the UAV to provide offloading services for nearby user equipment. Solve the UAV trajectory and unloading optimization problem through the method of multi-agent deep reinforcement learning, so as to obtain a scalable and effective scheduling strategy. The terminal offloads part of the computing tasks to the UAV, while the rest of the tasks are performed locally on the terminal. System processing delay and UAV energy consumption are minimized by jointly optimizing user scheduling, task offload ratio, UAV flight angle and flight speed.

发明内容Contents of the invention

发明目的：考虑到该问题的非凸性、高维状态空间和连续动作空间，我们提出了一种基于MADDPG的深度强化学习算法，该算法可以在动态环境下得到最优计算卸载策略，从而实现最低系统延时和无人机能耗的联合优化。Purpose of the invention: Considering the non-convexity, high-dimensional state space and continuous action space of the problem, we propose a deep reinforcement learning algorithm based on MADDPG, which can obtain the optimal calculation unloading strategy in a dynamic environment, so as to realize Joint optimization for lowest system latency and UAV energy consumption.

技术方案：在考虑同一时刻多用户计算任务卸载的场景，以合理高效的无人机路径规划和卸载决策达到联合优化系统时延和无人机能耗的目的。将每架无人机看作智能体，采用分布式执行和集中式训练的方式，基于本地观察到的状态信息和每个时隙得到的任务信息来选择关联用户。通过建立深度强化学习模型，利用MADDPG算法优化深度强化学习模型。根据优化后的MADDPG模型，得到最优的飞行轨迹和卸载策略。完成上述发明通过以下技术方案实现：一种基于MADDPG的无人机辅助计算迁移方法，包括步骤如下：Technical solution: Considering the scenario of multi-user computing task offloading at the same time, the purpose of joint optimization of system delay and UAV energy consumption is achieved through reasonable and efficient UAV path planning and offloading decisions. Considering each UAV as an agent, using distributed execution and centralized training, it selects associated users based on locally observed state information and task information obtained in each time slot. By establishing a deep reinforcement learning model, the MADDPG algorithm is used to optimize the deep reinforcement learning model. According to the optimized MADDPG model, the optimal flight trajectory and unloading strategy are obtained. The above-mentioned invention is achieved through the following technical solutions: a MADDPG-based unmanned aerial vehicle-aided calculation migration method, including the following steps:

(1)，传统MEC服务器都是部署在基站或其他固定设施中，本次采用可移动式MEC服务器，将无人机技术与边缘计算相结合；(1) Traditional MEC servers are deployed in base stations or other fixed facilities. This time, mobile MEC servers are used to combine UAV technology with edge computing;

(2)，用户设备通过无线通信将计算任务卸载到无人机端从而降低计算延时；(2), the user equipment offloads computing tasks to the UAV side through wireless communication to reduce computing delays;

(3)，构建无人机辅助用户卸载系统模型、移动模型、通信模型与计算模型，给出优化目标函数；(3) Construct the UAV-assisted user unloading system model, mobile model, communication model and calculation model, and give the optimization objective function;

(4)，无人机获取观测范围内用户位置集合、任务集合和服务次数以及信道参数信息；(4), the UAV obtains the user location set, task set and service times and channel parameter information within the observation range;

(5)，采用部分可观测马尔可夫决策过程(Partially Observable MarkovDecision Process，POMDP)建模，在考虑无人机飞行范围和安全距离的情况下，基于用户的位置和任务信息，联合优化多无人机的飞行轨迹和计算卸载策略，以最小化系统时延和无人机能耗同时保证用户的服务公平为目标，构建深度强化学习模型；(5), using Partially Observable Markov Decision Process (POMDP) modeling, considering the flight range and safety distance of the UAV, based on the user's location and task information, jointly optimize the The flight trajectory and calculation offloading strategy of the man-machine, with the goal of minimizing system delay and UAV energy consumption while ensuring user service fairness, builds a deep reinforcement learning model;

(6)，考虑连续状态空间和连续动作空间，利用基于MADDPG的多智能体深度强化学习算法进行计算迁移的模型训练；(6), considering continuous state space and continuous action space, using MADDPG-based multi-agent deep reinforcement learning algorithm for model training of computational migration;

(7)，在执行阶段，无人机基于当前环境的状态s(τ)，利用训练好的模型得到最优的用户卸载方案和飞行轨迹；(7), in the execution phase, based on the state s(τ) of the current environment, the UAV uses the trained model to obtain the optimal user offloading scheme and flight trajectory;

进一步的，所述步骤(3)包括如下具体步骤：Further, the step (3) includes the following specific steps:

(3a)，建立无人机辅助用户卸载的移动边缘计算系统模型，系统有M个移动用户设备(Mobile Device，MD)和U架搭载MEC服务器的无人机，分别用集合

和

表示。无人机以固定高度H_u飞行，设无人机执行一次飞行任务的总时长为T，总时长可被分为N个等长的时隙，时隙的集合记为

每个MD在每个时隙τ有一个计算密集型任务，任务记为S_m(τ)＝{D_m(τ)，F_m(τ)}，其中D_m(τ)表示数据比特量，F_m(τ)表示每比特所需CPU周期；(3a), establish a mobile edge computing system model for UAV-assisted user offloading. The system has M mobile user devices (Mobile Device, MD) and U UAVs equipped with MEC servers.

and

express. The UAV flies at a fixed height Hu, assuming that the total time for the _UAV to perform a flight mission is T, the total time can be divided into N time slots of equal length, and the set of time slots is recorded as

Each MD has a computationally intensive task in each time slot τ, and the task is recorded as S _m (τ) = {D _m (τ), F _m (τ)}, where D _m (τ) represents the amount of data bits, F _m (τ) represents the required CPU cycles per bit;

(3b)，每架无人机在每个时隙τ只为一个终端设备提供计算卸载服务，用户只需在本地计算任务的一小部分，其余卸载到无人机辅助计算，以减少计算的延时和能耗，卸载计算量的比例记为Δ_m，u(τ)∈[0，1]。无人机和用户设备之间的卸裁决策变量可表示为：(3b), each UAV only provides computing offloading service for one terminal device in each time slot τ, the user only needs to calculate a small part of the task locally, and the rest is offloaded to the UAV’s auxiliary computing to reduce the computational cost Delay and energy consumption, the proportion of unloaded computation is recorded as Δ _{m, u} (τ) ∈ [0, 1]. The offloading decision variables between the UAV and the user equipment can be expressed as:

D＝{α_m，u(τ)|u∈U，m∈M，τ∈T} 表达式1D={α _{m, u} (τ)|u∈U, m∈M, τ∈T} Expression 1

其中α_m，u(τ)∈{0，1}，当α_m，u(τ)＝1时表示设备MD_m在时隙τ的计算任务由无人机UAV_u辅助计算，Δ_m，u(τ)＞0；当α_m，u(τ)＝0时则表示只在本地执行计算任务，Δ_m，u(τ)＝0。决策变量需要满足：Where α _{m, u} (τ) ∈ {0, 1}, when α _{m, u} (τ) = 1, it means that the calculation task of device MD _m in time slot τ is assisted by UAV _u , Δ _{m, u} (τ)>0; when α _m,u (τ)=0, it means that the computing task is only performed locally, and Δ _m,u (τ)=0. Decision variables need to satisfy:

(3c)，建立移动模型，移动设备会在每个时隙内随机移动到新的位置，每个设备的移动与其当前的速度和角度有关。假设MD_m在时隙τ的坐标记为c_m(τ)＝[x_m(τ)，y_m(τ)]，则其下一时隙τ+1的坐标可表示为：(3c), establish a mobile model, the mobile device will randomly move to a new position in each time slot, and the movement of each device is related to its current speed and angle. Assuming that the coordinates of MD _m in the time slot τ are marked as c _m (τ)=[x _m (τ), y _m (τ)], then the coordinates of the next time slot τ+1 can be expressed as:

其中d_max代表备移动的最大距离，移动方向和距离概率均服从均匀分布，ρ_1，m，ρ_2，m～U(0，1)，无人机服务终端时仅考虑其在该时隙的起始位置。Among them, d _max represents the maximum moving distance of the standby, and the moving direction and distance probability all obey the uniform distribution, ρ _{1, m} , ρ _{2, m} ~ U(0, 1), and the UAV service terminal only considers its time slot the starting position of .

(3d)，每架无人机在高度H_u的水平面轨迹也可以用无人机在每个时隙的离散位置c_u(τ)来表示，假设UAV_u在时隙τ选择飞去服务MD_m，则其飞行方向记为β_u(τ)∈[0，2π]，飞行速度为v_u(τ)∈[0，V_max]，飞行时间为t_fly，无人机飞行消耗的能量为：(3d), the horizontal trajectory of each UAV at height Hu can also be represented by the UAV’s discrete position c _u (τ _{) at each time slot, assuming that UAV u} _chooses to fly to serve MD at time slot τ _m , the flight direction is recorded as β _u (τ)∈[0, 2π], the flight speed is v _u (τ)∈[0, V _max ], the flight time is t _fly , and the energy consumed by UAV flight is :

其中μ＝0.5M_ut_fly，M_u为无人机总质量。Where μ=0.5M _u t _fly , M _u is the total mass of the UAV.

(3e)，计算卸载采用可部分卸载策略，则MD_m在时隙τ的本地计算延时可表示为：(3e), the calculation offload adopts a partial offload strategy, then the local calculation delay of MD _m in time slot τ can be expressed as:

其中f_m表示MD_m的本地计算能力(每秒CPU周期数)。Where f _m represents the local computing power of MD _m (the number of CPU cycles per second).

(3f)，采用视距链路模型模拟实际的无人机对地通信，无人机和用户之间的信道增益h_m，u(τ)遵循自由空间路径损失模型，可表示为：(3f), using the line-of-sight link model to simulate the actual UAV-to-ground communication, the channel gain _hm,u (τ) between the UAV and the user follows the free-space path loss model, which can be expressed as:

其中g₀是每米信道功率增益。where _g0 is the channel power gain per meter.

(3g)，无人机和地面设备之间的瞬时传输速率r_m，u(τ)定义为：(3g), the instantaneous transmission rate r _m,u (τ) between the UAV and the ground equipment is defined as:

其中B代表了信道带宽，

为移动设备上传链路的发射功率，σ²代表无人机端的高斯白噪声。where B represents the channel bandwidth,

is the transmitting power of the upload link of the mobile device, and σ ² represents the Gaussian white noise at the UAV side.

关联用户MD_m的传输数据延时为：The transmission data delay of associated user MD _m is:

计算任务传输完成后，无人机执行卸载计算任务，卸载计算的延时和能耗分别为：After the calculation task transmission is completed, the UAV performs the offloading calculation task, and the delay and energy consumption of the offloading calculation are respectively:

其中f_u表示无人机的计算能力，

表示无人机执行计算时的CPU功率，κ_u＝10^-27为芯片常数where f _u represents the computing power of the UAV,

Indicates the CPU power when the UAV performs calculations, κ _u = 10 ^-27 is the chip constant

(3h)，由于各种计算密集型任务的结果输出数据量都远小于输入，因此可以忽略下行链路传输所花费的延时，用户MD_m在时隙τ完成任务S_m(τ)的时延T_m(τ)可以表示为：(3h), since the amount of output data of various computationally intensive tasks is much smaller than the input, so the delay spent in downlink transmission can be ignored, when user MD _m completes task S _m (τ) in time slot τ The extension T _m (τ) can be expressed as:

无人机UAV_u辅助计算卸载在时隙τ的总能耗为：The total energy consumption of unmanned aerial vehicle UAV _u auxiliary computing offloading in time slot τ is:

(3i)，用户MD_m的平均延时可以表示为：(3i), the average delay of user MD _m can be expressed as:

系统平均计算延时可计算为：The average computing delay of the system can be calculated as:

(3j)，为了保证用户公平，定义公平指数ξ_τ来衡量服务的公平性：(3j), in order to ensure user fairness, define the fairness index ξ _τ to measure the fairness of the service:

(3k)，综上，可以建立如下目标函数和约束条件：(3k), in summary, the following objective functions and constraints can be established:

其中P＝{β_u(τ)，v_u(τ)}，Z＝{α_m，u(τ)，Δ_m，u(τ)}，φ_t和φ_e为权重参数，C1限制无人机每个时隙只服务一个用户，C2和C6限制无人机的飞行范围，C3和C4限制无人机的飞行速度和角度，C6表示计算任务可以被部分卸载，C7保证系统的公平效益，d_safe和ξ_min为预先设定的无人机之间最小安全距离和最低公平指数。where P＝{β _u (τ), v _u (τ)}, Z＝{α _{m, u} (τ), Δ _{m, u} (τ)}, φ _t and φ _e are weight parameters, and C1 restricts no one Each time slot of the machine only serves one user, C2 and C6 limit the flight range of the drone, C3 and C4 limit the flight speed and angle of the drone, C6 indicates that the calculation task can be partially offloaded, and C7 guarantees the fair benefit of the system. d _safe and ξ _min are the pre-set minimum safe distance and minimum fairness index between UAVs.

进一步的，所述步骤(5)包括如下具体步骤：Further, the step (5) includes the following specific steps:

(5a)，将多无人机辅助计算卸载问题看作是一个部分观测马尔可夫决策过程，由元组{S，A，O，Pr，R}构成。通常有多个智能体与环境交互，每个智能体基于当前状态s_τ得到自身观察o_τ∈O并做出动作a_τ∈A，环境对动作产生即时奖励r_τ∈R以评估当前动作的好坏，并以概率Pr(S_τ+1|S_τ，A_τ)进入下一状态，新状态只取决于当前的状态和各个智能体的动作。智能体的动作基于策略π(a_τ|o_τ)执行，其目标为学习到最优策略以最大化长期累积奖励，可表示为：(5a), consider the multi-UAV assisted computation offloading problem as a partial observation Markov decision process, consisting of tuples {S, A, O, Pr, R}. There are usually multiple agents interacting with the environment, each agent obtains its own observation o _τ ∈ O based on the current state s _τ and makes an action a _τ ∈ A, the environment generates an immediate reward r _τ ∈ R for the action to evaluate the current action Good or bad, and enter the next state with probability Pr(S _τ+1 |S _τ , A _τ ), the new state only depends on the current state and the actions of each agent. The actions of the agent are executed based on the policy π(a _τ |o _τ ), and its goal is to learn the optimal policy to maximize the long-term cumulative reward, which can be expressed as:

其中γ为奖励折扣。where γ is the reward discount.

(5b)，具体定义观测空间，每架无人机都只有有限的观测范围，观测范围的半径设为r_obs，因此只能观测到部分状态信息，而全局的状态信息和其他无人机的动作都是未知的。单架无人机UAV_u在时隙τ能观测到的信息有自身的位置信息c_u(τ)和观测范围内K个移动用户当前的位置信息、任务信息以及服务次数

动作空间A为发射功率和选择的信道，表示为：(5b), specifically define the observation space, each UAV has only a limited observation range, and the radius of the observation range is set to r _obs , so only part of the state information can be observed, while the global state information and other UAV's Actions are unknown. The information that a single unmanned aerial vehicle UAV _u can observe in the time slot τ includes its own position information c _u (τ) and the current position information, task information and service times of K mobile users within the observation range

The action space A is the transmit power and the selected channel, expressed as:

o_u(τ)＝{c_u(τ)，k_u(τ)} 表达式18o _u (τ) = {c _u (τ), k _u (τ)} Expression 18

(5c)，具体定义动作空间，基于观测到的信息，无人机需要确定在当前时隙τ服务哪位用户以及缷载比例Δ_m，u(τ)，再决定自身的飞行角度β_u(τ)和飞行速度v_u(τ)，可记为：(5c), specifically define the action space. Based on the observed information, the UAV needs to determine which user to serve in the current time slot τ and the unloading ratio Δ _{m, u} (τ), and then determine its own flight angle β _u ( τ) and flight speed v _u (τ), can be written as:

a_u(τ)＝{m(τ)，Δ_m，u(τ)，β_u(τ)，v_u(τ)} 表达式19a _u (τ) = {m(τ), Δ _{m, u} (τ), β _u (τ), v _u (τ)} Expression 19

(5d)，定义状态空间，系统的状态可看作所有无人机观测结果的集合：(5d), define the state space, the state of the system can be regarded as the collection of all UAV observations:

s(τ)＝{o_u(τ)|u∈U} 表达式20s(τ)={o _u (τ)|u∈U} Expression 20

(5e)，具体定义奖励，智能体执行动作后得到的反馈称之为奖励，用于判定动作的好坏，指导智能体更新其策略。一般来说，奖励函数都与优化目标相对应，本次优化的目标是最小化无人机的能耗和系统平均计算延时，与最大奖励回报正好呈负相关，因此将无人机执行动作后的奖励定义为：(5e), specifically define the reward, the feedback obtained by the agent after performing the action is called the reward, which is used to judge the quality of the action and guide the agent to update its strategy. Generally speaking, the reward function corresponds to the optimization goal. The goal of this optimization is to minimize the energy consumption of the UAV and the average calculation delay of the system, which is exactly negatively correlated with the maximum reward return. Therefore, the UAV performs the action The latter reward is defined as:

r_u(τ)＝D_m(τ)·(-T^mean(τ)-ψE_u(τ)-P_u(τ)) 表达式21r _u (τ)＝D _m (τ)·(-T ^mean (τ)-ψE _u (τ)-P _u (τ)) Expression 21

其中D_m(τ)∈[0，1]为衰减系数，定义为无人机处理移动终端卸载任务后得到的效益，具体计算如下：Among them, D _m (τ) ∈ [0, 1] is the attenuation coefficient, which is defined as the benefit obtained after the UAV handles the unloading task of the mobile terminal. The specific calculation is as follows:

其中η和β为相关常数，其函数图像为类sigmoid型，输入为当前用户的累积服务次数，次数越多，其值越大，奖励越小，效益越低。ψ用来对无人机能耗和用户平均时延进行数值对齐。P_u(τ)为额外的惩罚项，如果无人机执行动作后飞出场地或和其余无人机的距离小于安全距离，就需要增加惩罚。Among them, η and β are related constants, and its function image is sigmoid-like, and the input is the cumulative service times of the current user. The more times, the larger the value, the smaller the reward, and the lower the benefit. ψ is used to numerically align UAV energy consumption and user average delay. P _u (τ) is an additional penalty item. If the UAV flies out of the field after performing an action or the distance with other UAVs is less than the safe distance, the penalty needs to be increased.

(5f)，依据建立好的S，A，O和R，在MADDPG的基础上建立深度强化学习模型，采用actor-critic框架，每个智能体都有自己的actor网络和critic网络，以及各自的目标网络。Actor网络负责为智能体制定策略π(o_u(τ)|θ_u)，θ_u代表其网络参数；critic网络输出对最优状态-动作价值函数的估计记为Q(s(τ)，a₁(τ)，...，a_U(τ)|w_u)，w_u代表其网络参数。critic网络的输入包含一个时隙内所有智能体的观测值和动作，但在分布执行时，actor网络的输入仅需要自身的观测值。(5f), according to the established S, A, O and R, establish a deep reinforcement learning model on the basis of MADDPG, using the actor-critic framework, each agent has its own actor network and critic network, and their own target network. The Actor network is responsible for formulating strategies for the agent π(o _u (τ)|θ _u ), θ _u represents its network parameters; the critic network output estimates the optimal state-action value function as Q(s(τ), a ₁ (τ),..., a _U (τ)|w _u ), w _u represents its network parameters. The input of the critic network contains the observations and actions of all agents in a time slot, but during distributed execution, the input of the actor network only needs its own observations.

算法同时对Q函数以及最优策略进行学习，在更新critic网络时，需要从每个智能体的经验池中抽取H组记录，将同样时刻的每组拼接得到H条新记录，记为：{s_t，i，a_1，i，...，a_U，i，r_1，i，...，r_U，i，s_t+1，i|i＝1，2，..，H}，使用时序差分集中训练每一个智能体的critic网络，训练Q值函数的损失函数定义为：The algorithm learns the Q function and the optimal strategy at the same time. When updating the critic network, it needs to extract H groups of records from the experience pool of each agent, and splice each group at the same time to obtain H new records, which are recorded as: { s _{t, i} , a _{1, i} ,..., a _{U, i} , r _{1, i} ,..., r _{U, i} , s _{t+1, i} |i=1, 2, .., H }, use the temporal difference to train the critic network of each agent intensively, and the loss function for training the Q-value function is defined as:

其中y_u，i由式(24)得到：where y _{u, i} can be obtained from formula (24):

其中，

和

分别代表无人机UAV_u的critic目标网络和actor目标网络，目标网络都具有滞后更新的网络参数，使训练变得更稳定。in,

and

represent the critic target network and the actor target network of the UAV _u , respectively, and the target network has network parameters that are updated with a lag, making the training more stable.

Critic网络需要尽量降低损失以逼近真实的Q^*值，actor网络则用Q值的确定策略梯度作梯度上升更新网络参数以最大化动作价值：The critic network needs to reduce the loss as much as possible to approach the real Q ^* value, and the actor network uses the determined strategy gradient of the Q value for gradient ascent to update the network parameters to maximize the action value:

最后在固定的间隔以更新率

更新目标网络：Finally at regular intervals with an update rate of

Update target network:

进一步的，所述步骤(6)包括如下具体步骤：Further, the step (6) includes the following specific steps:

(6a)，启动环境模拟，初始化每个智能体actor网络和critic网络及其各自目标网络参数；(6a), start the environment simulation, initialize each agent actor network and critic network and their respective target network parameters;

(6b)，初始化训练回合数；(6b), initialize the number of training rounds;

(6c)，更新用户的位置集合、任务集合和服务次数，无人机的位置集合以及信道参数；(6c), update the user's location set, task set and service times, the location set and channel parameters of the drone;

(6d)，对每个智能体分布运行actor网络，根据观察o_u(τ)，输出动作a_u(τ)，并获取即时的奖励r_u(τ)，同时转到下一状态s_τ+1，从而获得训练数据{o_u(τ)，a_u(τ)，r_u(τ)，o_u(τ+1)}；(6d), run the actor network for each agent distribution, output the action a _u (τ) according to the observation o _u (τ), and get the immediate reward r _u (τ), and go to the next state s _{τ+ 1} , so as to obtain the training data {o _u (τ), a _u (τ), r _u (τ), o _u (τ+1)};

(6e)，将训练数据存入各自的经验回放池中；(6e), store the training data in their respective experience playback pools;

(6f)，每个智能体从经验回放池中随机采样H个训练数据，构成训练数据集；(6f), each agent randomly samples H training data from the experience playback pool to form a training data set;

(6g)，每个智能体通过critic网络和目标网络计算出损失值L(w_u)更新w_u，并采用确定策略梯度作梯度上升，通过神经网络的反向传播来更新actor网络的参数θ_u；(6g), each agent calculates the loss value L(w _u ) through the critic network and the target network to update w _u , and uses the determined strategy gradient for gradient ascent, and updates the parameter θ of the actor network through the backpropagation of the neural network _u ;

(6h)，训练次数达到目标网络更新间隔，更新目标网络参数；(6h), the number of training times reaches the target network update interval, and the target network parameters are updated;

(6i)，判断是否满足收敛，若是，优化结束得到优化后的深度强化学习模型，否则进入步骤(6c)；(6i), judging whether convergence is satisfied, if so, the optimized deep reinforcement learning model is obtained after optimization, otherwise enter step (6c);

进一步的，所述步骤(7)包括如下具体步骤：Further, the step (7) includes the following specific steps:

(7a)，利用MADDPG算法训练好的深度强化学习模型，输入某时刻的状态信息；(7a), using the deep reinforcement learning model trained by the MADDPG algorithm, input the state information at a certain moment;

(7b)，输出最优动作策略

得到最优的迁移策略和飞行路径。(7b), output the optimal action strategy

Get the optimal migration strategy and flight path.

有益效果：本发明提出的一种基于MADDPG算法的大规模多无人机辅助MEC网络计算卸载方法，在满足约束条件的情况下，通过联合优化无人机卸载决策和飞行轨迹最大限度的减少无人机的能耗和系统平均计算延时，可以在一系列连续状态空间和连续动作空间的优化中表现稳定。在满足场景相似的情况下，本发明提出的一种基于MADDPG的深度强化学习算法在降低能耗和平均任务延时方面的性能是优越的。Beneficial effects: A large-scale multi-UAV assisted MEC network computing unloading method based on the MADDPG algorithm proposed by the present invention, in the case of satisfying the constraint conditions, can minimize the unloading by jointly optimizing the UAV unloading decision and flight trajectory. The energy consumption of man-machine and the average calculation delay of the system can be stable in a series of optimizations of continuous state space and continuous action space. Under the condition that the scenarios are similar, a deep reinforcement learning algorithm based on MADDPG proposed by the present invention has superior performance in reducing energy consumption and average task delay.

附图说明Description of drawings

图1为本发明实施例提供的无人机辅助计算卸载模型结构示意图；Fig. 1 is a schematic structural diagram of an unmanned aerial vehicle-assisted computing offloading model provided by an embodiment of the present invention;

图2为本发明实施例提供的多无人机计算迁移算法的POMDP决策过程示意图；Fig. 2 is a schematic diagram of the POMDP decision-making process of the multi-unmanned aerial vehicle calculation migration algorithm provided by the embodiment of the present invention;

图3为本发明实施例提供的基于MADDPG的算法训练示意图；Fig. 3 is the schematic diagram of algorithm training based on MADDPG that the embodiment of the present invention provides;

图4为本发明实施例提供的MADDPG算法下无人机能耗与计算性能的关系的仿真结果图。FIG. 4 is a simulation result diagram of the relationship between the energy consumption and computing performance of the drone under the MADDPG algorithm provided by the embodiment of the present invention.

具体实施方式Detailed ways

本发明的核心思想在于：采用分布式的强化学习方法，将每架无人机视为智能体，通过建立深度强化学习模型，利用MADDPG算法优化深度强化学习模型。根据优化后的模型，得到最优的迁移策略和飞行路径。The core idea of the present invention is: using a distributed reinforcement learning method, each UAV is regarded as an agent, and by establishing a deep reinforcement learning model, the MADDPG algorithm is used to optimize the depth reinforcement learning model. According to the optimized model, the optimal migration strategy and flight path are obtained.

下面对本发明做进一步详细描述。The present invention is described in further detail below.

包括如下具体步骤：Including the following specific steps:

和

and

(3b)，每架无人机在每个时隙τ只为一个终端设备提供计算卸载服务，用户只需在本地计算任务的一小部分，其余卸载到无人机辅助计算，以减少计算的延时和能耗，卸载计算量的比例记为Δ_m，u(τ)∈[0，1]。无人机和用户设备之间的卸载决策变量可表示为：(3b), each UAV only provides computing offloading service for one terminal device in each time slot τ, the user only needs to calculate a small part of the task locally, and the rest is offloaded to the UAV’s auxiliary computing to reduce the computational cost Delay and energy consumption, the proportion of unloaded computation is recorded as Δ _{m, u} (τ) ∈ [0, 1]. The offloading decision variable between UAV and user equipment can be expressed as:

其中B代表了信道带宽，

其中f_u表示无人机的计算能力，

(5)，采用部分可观测马尔可夫决策过程(Partially Observable MarkovDecision Process，POMDP)建模，在考虑无人机飞行范围和安全距离的情况下，基于用户的位置和任务信息，联合优化多无人机的飞行轨迹和计算卸载策略，以最小化系统时延和无人机能耗同时保证用户的服务公平为目标，构建深度强化学习模型，包括如下具体步骤：(5), using Partially Observable Markov Decision Process (POMDP) modeling, considering the flight range and safety distance of the UAV, based on the user's location and task information, jointly optimize the The human-machine flight trajectory and calculation offloading strategy aim to minimize system delay and UAV energy consumption while ensuring user service fairness as the goal, and build a deep reinforcement learning model, including the following specific steps:

其中γ为奖励折扣。where γ is the reward discount.

(5c)，具体定义动作空间，基于观测到的信息，无人机需要确定在当前时隙τ服务哪位用户以及卸载比例Δ_m，u(τ)，再决定自身的飞行角度β_u(τ)和飞行速度v_u(τ)，可记为：(5c), specifically define the action space. Based on the observed information, the UAV needs to determine which user to serve in the current time slot τ and the unloading ratio Δ _{m, u} (τ), and then determine its own flight angle β _u (τ ) and flight speed v _u (τ), can be written as:

s(τ)＝{o_u(τ)|u∈U} 表达式20s(τ)={o _u (τ)|u∈U} Expression 20

(5f)，依据建立好的S，A，O和R，在MADDPG的基础上建立深度强化学习模型，采用actor-critic框架，每个智能体都有自己的actor网络和critic网络，以及各自的目标网络。Actor网络负责为智能体制定策略π(o_u(τ)|θ_u)，θ_u代表其网络参数；critic网络输出对最优状态-动作价值函数的估计记为Q(s(τ)，a₁(τ)，..，a_U(τ)|w_u)，w_u代表其网络参数。critic网络的输入包含一个时隙内所有智能体的观测值和动作，但在分布执行时，actor网络的输入仅需要自身的观测值。(5f), according to the established S, A, O and R, establish a deep reinforcement learning model on the basis of MADDPG, using the actor-critic framework, each agent has its own actor network and critic network, and their own target network. The Actor network is responsible for formulating strategies for the agent π(o _u (τ)|θ _u ), θ _u represents its network parameters; the critic network output estimates the optimal state-action value function as Q(s(τ), a ₁ (τ), .., a _U (τ)|w _u ), w _u represents its network parameters. The input of the critic network contains the observations and actions of all agents in a time slot, but during distributed execution, the input of the actor network only needs its own observations.

其中，

和

and

最后在固定的间隔以更新率

更新目标网络：Finally at regular intervals with an update rate of

Update target network:

(6)，考虑连续状态空间和连续动作空间，利用基于MADDPG的多智能体深度强化学习算法进行计算迁移的模型训练，包括如下具体步骤：(6), considering continuous state space and continuous action space, using MADDPG-based multi-agent deep reinforcement learning algorithm for model training of computational migration, including the following specific steps:

(7b)，输出最优动作策略

得到最优的迁移策略和飞行路径。(7b), output the optimal action policy

Get the optimal migration strategy and flight path.

在图1中，描述了无人机辅助用户卸载的移动边缘计算系统模型，用户将计算任务卸载到无人机辅助计算，以减少计算的延时和能耗。In Figure 1, a mobile edge computing system model is described in which UAV-assisted user offloading is described. The user offloads computing tasks to UAV-assisted computing to reduce computing delays and energy consumption.

在图2中，描述了无人机辅助MEC网络的深度强化学习模型，可以看出多架无人机作为智能体基于当前状态根据策略选择当前最优策略，并从环境中获取奖励。In Figure 2, the deep reinforcement learning model of UAV-assisted MEC network is described. It can be seen that multiple UAVs act as agents to select the current optimal strategy based on the current state and obtain rewards from the environment.

在图3中，描述了actor-critic框架的训练模型，通过集中式训练和分散式执行，critic网络可以在训练过程中参考其他智能体的行为，从而更好评估actor网络的性能，提高了策略的稳定性。In Figure 3, the training model of the actor-critic framework is described. Through centralized training and decentralized execution, the critic network can refer to the behavior of other agents during the training process, so as to better evaluate the performance of the actor network and improve the strategy. stability.

在图4中，描述了不同算法无人机的计算性能与能耗的仿真结果，基于MADDPG算法能够得到不同计算性能下的最佳功耗控制，当CPU频率为12.5GHz时，能耗相比基线降低29.16％，相比随机策略梯度算法降低8.67％。。In Figure 4, the simulation results of computing performance and energy consumption of drones with different algorithms are described. Based on the MADDPG algorithm, the optimal power consumption control under different computing performances can be obtained. When the CPU frequency is 12.5GHz, the energy consumption is compared to The baseline is reduced by 29.16%, which is 8.67% lower than the stochastic policy gradient algorithm. .

本发明申请书中未作详细描述的内容属于本领域专业技术人员公知的现有技术。The contents not described in detail in the application of the present invention belong to the prior art known to those skilled in the art.

Claims

1. An unmanned aerial vehicle assisted calculation migration method based on multi-agent depth determination strategy gradient is characterized by comprising the following steps:

(1) The traditional MEC servers are deployed in a base station or other fixed facilities, a movable MEC server is adopted at this time, the unmanned aerial vehicle technology is combined with edge calculation, and user equipment unloads calculation tasks to an unmanned aerial vehicle end through wireless communication so as to reduce calculation delay;

(2) Constructing an unmanned aerial vehicle auxiliary user unloading system model, a mobile model, a communication model and a calculation model, and giving an optimization objective function;

(3) The method is characterized in that a Partially Observable Markov Decision Process (POMDP) is adopted for modeling, under the condition of considering the flight range and the safety distance of the unmanned aerial vehicle, the flight tracks and the calculation unloading strategies of the multiple unmanned aerial vehicles are jointly optimized on the basis of the position and the task information of a user, a deep reinforcement learning model is constructed by taking the aim of minimizing the system delay and the energy consumption of the unmanned aerial vehicle and simultaneously ensuring the service fairness of the user as the target, and the method comprises the following specific steps:

(3a) The problem of the multi-unmanned aerial vehicle auxiliary computation unloading is regarded as a partial observation Markov decision process and is composed of tuples { S, A, O, pr and R }; there are typically multiple agents interacting with the environment, each agent based on the current state s _τ Get self observation o _τ E.g. O and make action a _τ E.g. A, the environment generates an instant reward r for the action _τ E.g. R to evaluate the current action and with probability Pr (S) _τ+1 |S _τ ，A _τ ) Entering a next state, the new state depending only on the current state and the actions of the respective agent; actions of Agents are based on policy π (a) _τ |o _τ ) Enforcement, with the goal of learning to an optimal strategy to maximize long-term jackpot, can be expressed as:

wherein γ is a reward discount;

(3b) Specifically defining observation space, each unmanned aerial vehicle only has limited observation range, and the radius of the observation range is set as r _obs Therefore, only partial state information can be observed, and the global state information and the actions of other unmanned planes are unknown; single unmanned aerial vehicle UAV _u The information observable in the time slot τ has its own position information c _u (tau) and current location information, task information and number of services for K mobile users in observation range

The action space a is the transmit power and the selected channel, and is represented as:

o _u (τ)＝{c _u (τ)，k _u (τ)}

(3c) Specifically defining an action space, based on the observed information, the drone needs to determine which user is served at the current time slot τ and the offloading ratio Δ _m，u (tau) and determining the flight angle beta thereof _u (τ) and flight velocity v _u (τ), which can be written as:

a _u (τ)＝{m(τ)，Δ _m，u (τ)，β _u (τ)，v _u (τ)}

(3d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observations:

s(τ)＝{o _u (τ)|u∈U}

(3e) Specifically defining rewards, wherein feedback obtained after the intelligent agent executes actions is called the rewards and is used for judging the quality of the actions and guiding the intelligent agent to update the strategy; generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negatively correlated, so the reward after the drone executes actions is defined as:

r _u (τ)＝D _m (τ)·(-T ^mean (τ)-ψE _u (τ)-P _u (τ))

wherein D _m (τ)∈[0，1]For the attenuation coefficient, the benefit obtained after the unmanned aerial vehicle processes the mobile terminal unloading task is defined, and the specific calculation is as follows:

wherein eta and beta are correlation constants, the function image is of a sigmoid type, the number of times of accumulated service of the current user is input, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is; psi is used to average the drone energy consumption and usersCarrying out numerical value alignment; p _u (tau) is an additional punishment item, and if the unmanned aerial vehicle flies out of the field after executing actions or the distance between the unmanned aerial vehicle and the rest unmanned aerial vehicles is less than the safe distance, the punishment item needs to be added;

(3f) Establishing a deep reinforcement learning model on the basis of MADDPG according to the established S, A, O and R, adopting an operator-critic framework, wherein each agent has an own operator network, critic network and respective target network; the Actor network is responsible for formulating a policy pi (o) for an agent _u (τ)|θ _u )，θ _u Representing its network parameters; the critic network outputs an estimate of the optimum state-action cost function denoted as Q (s (τ), a) ₁ (τ)，...，a _U (τ)|w _u )，w _u Representing its network parameters; the input of the critic network comprises the observed values and actions of all agents in a time slot, but the input of the actor network only needs the observed values of the actor network when the distribution is executed;

the algorithm learns the Q function and the optimal strategy at the same time, when the criticc network is updated, H groups of records need to be extracted from the experience pool of each intelligent agent, and each group at the same time is spliced to obtain H new records, which are recorded as: { s _t，i ，a _1，i ，...，a _U，i ，r _1，i ，...，r _U，i ，s _t+1，i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:

wherein y is _u，i Obtained from formula (24):

wherein,

and

respectively representing Unmanned Aerial Vehicles (UAVs) _u The critic target network and the actor target network, the target network has network parameters updated later, so that the training becomes more stable;

critic networks need to minimize losses to approximate true Q ^* The operator network updates the network parameters by using the gradient of the determination strategy of the Q value to perform gradient rise so as to maximize the action value:

finally at a fixed interval at an update rate

Updating the target network:

(4) Considering a continuous state space and a continuous action space, and performing model training of computational migration by using a multi-agent deep reinforcement learning algorithm based on MADDPG;

(5) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment.