CN111786713A

CN111786713A - A UAV network hovering position optimization method based on multi-agent deep reinforcement learning

Info

Publication number: CN111786713A
Application number: CN202010497656.4A
Authority: CN
Inventors: 刘中豪; 覃振权; 卢炳先; 王雷; 朱明�
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2020-10-16
Anticipated expiration: 2040-06-04
Also published as: CN111786713B

Abstract

An unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning comprises the steps of firstly, modeling a channel model, a coverage model and an energy loss model in an unmanned aerial vehicle ground communication scene; modeling a throughput maximization problem of an unmanned aerial vehicle to ground communication network into a partially observable Markov decision process; obtaining local observation information and instantaneous rewards through continuous interaction of an unmanned aerial vehicle and the environment, and carrying out centralized training based on the information to obtain a distributed strategy network; and deploying the strategy network to each unmanned aerial vehicle, wherein each unmanned aerial vehicle can obtain a moving direction and a moving distance decision based on local observation information of the unmanned aerial vehicle, adjust the hovering position and perform distributed cooperation. The invention also introduces proportional fair scheduling and energy consumption loss information of the unmanned aerial vehicle into the instantaneous reward function, thereby improving the throughput, ensuring the fairness of the unmanned aerial vehicle to the ground user service, reducing the energy consumption loss and enabling the unmanned aerial vehicle cluster to adapt to the dynamic environment.

Description

A UAV network hovering position optimization based on multi-agent deep reinforcement learning method

技术领域technical field

本发明涉及无线通信技术领域，特别涉及一种基于多智能体深度强化学习的多无人机网络悬停位置优化方法。The invention relates to the technical field of wireless communication, in particular to a method for optimizing the hovering position of a multi-unmanned aerial vehicle network based on multi-agent deep reinforcement learning.

背景技术Background technique

近年来，由于无人机的高机动性、易部署性和低成本，基于无人机的通信技术引起了广泛的关注，成为了无线通信领域的一个新的研究热点。无人机辅助通信技术主要有以下几个应用场景：无人机作为移动基站为基础设施稀少或灾后地区提供通信覆盖、无人机作为中继节点为相距较远的无法直接建立连接的两个通信节点提供无线连接、基于无人机的数据分发和采集。本发明主要针对第一个场景，在该场景中，无人机的悬停位置决定了整个无人机网络的覆盖性能和吞吐量大小。无人机网络所服务的地面设备可能具有移动性，因此无人机需要不断地调整自身的悬停位置以实现最优的性能。In recent years, due to the high mobility, easy deployment and low cost of UAVs, UAV-based communication technology has attracted extensive attention and has become a new research hotspot in the field of wireless communication. UAV-assisted communication technology mainly has the following application scenarios: UAV as a mobile base station to provide communication coverage for areas with sparse infrastructure or post-disaster areas, UAV as a relay node for two distant places that cannot directly establish a connection Communication nodes provide wireless connectivity, UAV-based data distribution and collection. The present invention is mainly aimed at the first scenario, in which the hovering position of the UAV determines the coverage performance and throughput of the entire UAV network. The ground equipment served by the drone network may be mobile, so the drone needs to constantly adjust its hovering position to achieve optimal performance.

2018年，Qingqing Wu等人在论文《JointTrajectoryandCommunicationDesignforMulti-UAVEnabledWirelessNetworks》中提出一种多无人机对地通信系统的UAV路径规划方案，将时间划分为多个周期，每个周期UAVs的移动轨迹是相同的，在每个时隙，无人机基站服务特定的地面用户。该方案将优化问题建模为混合整数规划问题，并使用块坐标梯度下降和近似凸优化技术进行求解，求得周期内每个时间片的最优悬停位置，最大化和地面用户间的下行链路吞吐量。但是，该论文提出的方案只适用于静态环境，是假设地面设备不具备移动性的条件下进行的，并不适用于地面用户不断移动的场景。Chi Harold Liu等人在论文《Energy-Efficient UAV Control for Effective andFair CommunicationCoverage:A DeepReinforcement Learning Approach》提出了一种基于深度强化学习的UAV路径规划算法，通过深度强化学习方法训练出了一个决策模型，该模型根据当前状态输出UAVs下一步的决策(移动方向、移动距离)。该论文提出的方法能够实现大范围区域的公平无线覆盖，并尽可能减少UAVs的能耗。但是，该方法仅仅考虑了UAVs网络的覆盖性能，且是针对区域的粗粒度覆盖公平，而不是针对用户的细粒度覆盖公平。此外，该方法是一种集中式的方案，需要一个控制器在每个时隙收集所有无人机的信息，才能做出决策。In 2018, Qingqing Wu et al. proposed a UAV path planning scheme for a multi-UAV-to-ground communication system in the paper "Joint Trajectory and Communication Design for Multi-UAV Enabled Wireless Networks", dividing the time into multiple cycles, and the movement trajectories of UAVs in each cycle are the same. , in each time slot, the UAV base station serves a specific ground user. This scheme models the optimization problem as a mixed integer programming problem, and solves it using block coordinate gradient descent and approximate convex optimization techniques to obtain the optimal hovering position for each time slice in the cycle, and maximize the downlink between ground users. link throughput. However, the scheme proposed in this paper is only suitable for static environments, and is carried out under the assumption that ground equipment does not have mobility, and is not suitable for scenarios where ground users are constantly moving. Chi Harold Liu et al. proposed a UAV path planning algorithm based on deep reinforcement learning in the paper "Energy-Efficient UAV Control for Effective and Fair Communication Coverage: A Deep Reinforcement Learning Approach", and trained a decision model through deep reinforcement learning. The model outputs the UAVs' next decision (moving direction, moving distance) according to the current state. The method proposed in this paper can achieve fair wireless coverage over a large area and minimize the energy consumption of UAVs. However, this method only considers the coverage performance of the UAVs network, and is fair for the coarse-grained coverage of the region rather than the fine-grained coverage of the user. Furthermore, the method is a centralized scheme that requires a controller to collect information from all UAVs at each time slot in order to make decisions.

综上所述，基于无人机基站的对地通信网络中的UAVs路径规划技术主要有如下缺陷：(1)没有考虑环境的动态性，即地面用户的移动性。(2)采用的是集中式的算法，依赖全局信息和集中式控制，某些大范围的场景中，进行集中式控制是较为困难的，因此需要一种分布式的控制策略，每个无人机基站仅靠自己获得的信息做出决策。(3)忽略了考虑用户层次的服务公平性。这些缺陷使得现有的无人机网络中的UAVs轨迹优化方法无法适用于实际通信环境。To sum up, the UAVs path planning technology in the ground-to-ground communication network based on the UAV base station mainly has the following defects: (1) The dynamics of the environment, that is, the mobility of ground users, is not considered. (2) A centralized algorithm is used, which relies on global information and centralized control. In some large-scale scenarios, it is difficult to perform centralized control. Therefore, a distributed control strategy is required. The base station only relies on the information it obtains to make decisions. (3) The service fairness considering the user level is ignored. These shortcomings make the existing UAVs trajectory optimization methods in UAV networks unsuitable for practical communication environments.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提出一种基于多智能体强化学习的多无人机悬停位置优化方法，以解决上述技术问题。The purpose of the present invention is to propose a method for optimizing the hovering position of multiple UAVs based on multi-agent reinforcement learning, so as to solve the above technical problems.

本发明的技术方案：Technical scheme of the present invention:

一种基于多智能体深度强化学习的无人机网络悬停位置优化方法，步骤如下：A UAV network hovering position optimization method based on multi-agent deep reinforcement learning, the steps are as follows:

(1)建立多无人机対地通信网络模型，主要包括以下4个步骤：(1) Establish a multi-UAV ground-to-ground communication network model, which mainly includes the following 4 steps:

(1.1)建立场景模型：建立一个边长为l的正方形目标区域，该区域中有N个地面用户和M个无人机基站(UAV-BSs)，这些无人机基站为地面用户提供通信服务。时间被划分为T个相同的时隙，从上一时隙到当前时隙，地面用户可能静止也可能发生移动，因此无人机基站需要在每个时隙寻找新的最优悬停位置，并在到达目标位置后选择地面用户进行数据传输服务。(1.1) Establish a scene model: establish a square target area with a side length of l, in which there are N ground users and M unmanned aerial vehicle base stations (UAV-BSs), which provide communication services for ground users . Time is divided into T identical time slots, from the previous time slot to the current time slot, the ground user may be stationary or may move, so the UAV base station needs to find a new optimal hovering position in each time slot, and After reaching the target location, the ground user is selected for data transmission service.

(1.2)建立空对地通信模型：本发明使用空对地信道模型对无人机基站和地面用户之间的信道进行建模，无人机基站由于高飞行高度，相比于地面基站更容易与地面用户建立视距链路(LoS)，在LoS情况下，无人机基站m和地面用户n之间的路径损耗模型为：(1.2) Establishing an air-to-ground communication model: The present invention uses an air-to-ground channel model to model the channel between the UAV base station and the ground user. The UAV base station is easier to fly than the ground base station due to its high flying height. A line-of-sight link (LoS) is established with the ground user. In the LoS case, the path loss model between the UAV base station m and the ground user n is:

其中η表示额外路径损耗系数，c表示光速，f_c表示子载波频率，α表示路径损失指数，

表示无人机基站m和地面用户n之间的距离，其中r_n,m表示二者的水平距离，h为无人机基站固定飞行高度。根据路径损失，信道增益可以表示为

根据信道增益，无人机基站m和地面用户n之间在时隙t的数据传输速率为：where η is the extra path loss coefficient, c is the speed of light, f _c is the subcarrier frequency, α is the path loss index,

Represents the distance between the UAV base station m and the ground user n, where r _n,m represents the horizontal distance between the two, and h is the fixed flying height of the UAV base station. According to the path loss, the channel gain can be expressed as

According to the channel gain, the data transmission rate between the UAV base station m and the ground user n in the time slot t is:

其中σ表示加性高斯白噪声，p_t表示无人机基站的发射功率，g_n,m(t)表示t时刻无人机基站m和地面用户n之间的信道增益。where σ is the additive white Gaussian noise, p _t is the transmit power of the UAV base station, and g _n,m (t) is the channel gain between the UAV base station m and the ground user n at time t.

(1.3)建立覆盖模型：由于硬件限制，每个无人机基站的覆盖范围是有限的。本发明定义了最大可容忍路径损失L_max，如果某一时刻无人机基站和用户之间路径损失小于L_max，我们认为建立的连接是可靠的，否则，我们认为建立连接失败。因此，可以根据最大可容忍路径损耗定义出每个无人机基站的有效覆盖范围，该范围以无人机基站在地面的投影点为圆心，以R_cov为半径，根据路径损失公式，R_cov可以表示为：(1.3) Establish coverage model: Due to hardware limitations, the coverage of each UAV base station is limited. The present invention defines the maximum tolerable path loss L _max . If the path loss between the UAV base station and the user is less than L _max at a certain moment, we consider the established connection to be reliable, otherwise, we consider that the established connection fails. Therefore, the effective coverage of each UAV base station can be defined according to the maximum tolerable path loss. The range takes the projection point of the UAV base station on the ground as the center and R _cov as the radius. According to the path loss formula, R _cov It can be expressed as:

(1.4)建立能量损耗模型：本发明主要关注无人机移动造成的能量损耗，考虑无人机的飞行速度V以及飞行功率p_f，无人机基站m在时隙t的飞行能耗取决于飞行的距离：(1.4) Establishing an energy loss model: the present invention mainly focuses on the energy loss caused by the movement of the UAV, considering the flight speed V and the flight power p _f of the UAV, the flight energy consumption of the UAV base station m in the time slot t depends on Distance flown:

其中

分别表示无人机在水平面上x轴和y轴的位置坐标。in

Represent the position coordinates of the UAV on the x-axis and y-axis on the horizontal plane, respectively.

(2)将问题建模为局部可观测马尔科夫决策过程：(2) Model the problem as a locally observable Markov decision process:

每个无人机基站相当于一个智能体；在每一个环境状态为S(t)的时隙中，智能体m在仅能获得自身覆盖范围内的局部观察o_m，并根据决策函数u_m(o_m)，从动作集A中选择动作a_m，以最大化折扣总期望奖励

其中γ∈(0,1)为折扣系数，r_m(t)表示智能体m在t时刻的奖励；Each UAV base station is equivalent to an agent; in each time slot where the environmental state is S(t), the agent m can only obtain the local observation o _m within its own coverage, and according to the decision function u _m (o _m ), choose action a _m from action set A to maximize the discounted total expected reward

where γ∈(0,1) is the discount coefficient, and r _m (t) represents the reward of agent m at time t;

系统状态集合S＝{S(t)|S(t)＝(S^u(t),S^g(t))}，分别包含无人机基站的当前状态

和地面用户当前状态

无人机基站状态

包括无人机当前的位置信息；地面用户状态

包括当前地面用户的位置信息。The system state set S={S(t)|S(t)=(S ^u (t), S ^g (t))}, respectively including the current state of the UAV base station

and the current state of the ground user

UAV base station status

Including the current position information of the drone; ground user status

Including the location information of the current ground user.

无人机动作集合A＝{a(t)|a(t)＝(θ(t),d(t))}，在时隙t，无人机m需要在得到当前局部观察信息后做出决策a_m(t)，移动到下一个悬停位置，因此动作集合包括飞行旋转角度θ(t)和移动距离d(t)。UAV action set A={a(t)|a(t)=(θ(t),d(t))}, in time slot t, UAV m needs to make the current local observation information after obtaining the current local observation information. Decision a _m (t), move to the next hover position, so the action set includes the flight rotation angle θ(t) and the moving distance d(t).

系统及时奖励r(t)：本文的目标是在考虑用户服务公平性和能耗的同时，最大化无人机网络的吞吐量。因此，在每个时刻t通过调整无人机悬停位置所产生的额外吞吐量是一个正项奖励，表示为：The system rewards r(t) in time: The goal of this paper is to maximize the throughput of the UAV network while considering user service fairness and energy consumption. Therefore, the additional throughput generated by adjusting the hovering position of the drone at each time t is a positive reward, expressed as:

ΔC(t)＝C(S^u(t+1),S^g(t))-C(S^u(t),S^g(t))ΔC(t)=C(S ^u (t+1),S ^g (t))-C(S ^u (t),S ^g (t))

其中C(S^u(t),S^g(t))表示无人机基站状态为S^u(t)，地面用户状态为S^g(t)时网络产生的吞吐量。C(S^u(t+1),S^g(t))则表示无人机基站状态为S^u(t+1)，地面用户状态为S^g(t)时网络产生的吞吐量。考虑到用户服务的公平性，如果某个区域聚集有大量用户，而某个区域只有一个用户，无人机基站为了追求最大化吞吐量会一直悬停在高密度区域，而忽略低密度区域，因此本发明为每个用户的吞吐量奖励施加一个权重w_n(t)实现比例公平调度。R_req表示的是地面用户需求的最小通信速率要求，R_n(t)表示的是地面用户n从开始阶段到时刻t的平均通信速率。当无人机基站服务该用户时，R_n(t)增长，该用户的权重会逐渐变小；若该用户没有被服务到，则R_n(t)减小，该用户权重不断增大。因此，用户稀疏地区的奖励权重会不断增大，吸引无人机基站进行服务。where C(S ^u (t), S ^g (t)) represents the throughput generated by the network when the UAV base station state is S ^u (t) and the ground user state is S ^g (t). C(S ^u (t+1), S ^g (t)) represents the throughput generated by the network when the UAV base station state is S ^u (t+1) and the ground user state is S ^g (t). Considering the fairness of user services, if there are a large number of users in a certain area, but only one user in a certain area, the drone base station will always hover in the high-density area and ignore the low-density area in order to maximize throughput. Therefore, the present invention applies a weight w _n (t) to each user's throughput reward to achieve proportional fair scheduling. R _req represents the minimum communication rate required by the ground user, and R _n (t) represents the average communication rate of the ground user n from the initial stage to time t. When the UAV base station serves the user, R _n (t) increases, and the weight of the user will gradually decrease; if the user is not served, R _n (t) decreases, and the weight of the user increases continuously. Therefore, the reward weight in areas with sparse users will continue to increase, attracting UAV base stations to serve.

其中，a_n,m(t)是一个指示变量，在t时刻，如果无人机基站m服务地面用户用户n，那么a_n,m(t)＝1，因此，综合考虑公平性吞吐量奖励和能耗损失惩罚，本发明给出系统实时奖励r(t):Among them, a _n,m (t) is an indicator variable. At time t, if the UAV base station m serves the ground user user n, then an _n,m (t)=1. Therefore, the fairness throughput reward is comprehensively considered. and energy loss penalty, the present invention gives the system real-time reward r(t):

其中α表示能耗惩罚所占的权重，α越大，则该系统在决策时更注重能耗损失，反之则越忽略能耗损失。Among them, α represents the weight of the energy consumption penalty. The larger the α is, the more energy consumption loss the system pays attention to in decision-making, otherwise, the more energy consumption loss is ignored.

局部观察集合O(t)＝{o₁(t),…,o_M(t)}，当多无人机基站在一个大范围区域协同工作时，每个无人机无法观察到全局信息，只能观察到自身覆盖范围内的地面用户信息。o_m(t)表示t时刻无人机基站m所观察到的处于自己覆盖范围内的地面用户的位置信息。The local observation set O(t)={o ₁ (t),...,o _M (t)}, when multiple UAV base stations work together in a large area, each UAV cannot observe the global information, Only terrestrial user information within its own coverage area can be observed. o _m (t) represents the location information of the ground users within the coverage area observed by the UAV base station m at time t.

(3)基于多智能体深度强化学习算法进行训练：(3) Training based on multi-agent deep reinforcement learning algorithm:

本发明将多智能体深度强化学习算法MADDPG引入到无人机对地通信网络悬停位置优化中，采用集中式训练和分布式执行的架构，在训练时使用全局信息，更好地指导每个无人机的决策函数的梯度更新，在执行时每个无人机仅使用自己观察到的局部信息做出下一步决策，更贴合实际场景的需要；每个智能体采用了Actor-Critic架构的DDPG网络进行训练，策略网络用来拟合策略函数u(o)，输入局部观察o，输出动作策略a；评价网络用来拟合状态-动作函数Q(s,a)，表示在系统状态为s时，采取动作a所获得的期望奖励；令u＝{u₁,…,u_M}表示M个智能体的确定性策略函数，

表示每个策略网络的参数，Q＝{Q₁,…,Q_M}表示M个智能体的评价网络，

表示评价网络的参数，步骤(3)包括：The invention introduces the multi-agent deep reinforcement learning algorithm MADDPG into the hovering position optimization of the UAV-to-ground communication network, adopts the architecture of centralized training and distributed execution, and uses global information during training to better guide each The gradient update of the UAV's decision function, each UAV only uses the local information observed by itself to make the next decision during execution, which is more in line with the needs of the actual scene; each agent adopts the Actor-Critic architecture The DDPG network is used for training, and the strategy network is used to fit the strategy function u(o), input the local observation o, and output the action strategy a; the evaluation network is used to fit the state-action function Q(s, a), which represents the state of the system. When it is s, the expected reward obtained by taking action a; let u={u ₁ ,...,u _M } denote the deterministic policy function of M agents,

represents the parameters of each policy network, Q={Q ₁ ,...,Q _M } represents the evaluation network of M agents,

represents the parameters of the evaluation network, and step (3) includes:

(3.1)初始化经验回放空间，设置经验回放空间大小，初始化每个DDPG网络的参数，训练回合数等(3.1) Initialize the experience playback space, set the size of the experience playback space, initialize the parameters of each DDPG network, the number of training rounds, etc.

(3.2)从训练回合epoch＝1开始，从时刻t＝1开始。(3.2) Starting from the training round epoch=1, starting from time t=1.

(3.3)获取当前无人机的局部观察信息o和整个系统当前状态s；每个无人机m使用t时隙得到的局部观察信息，基于∈贪婪策略和DDPG网络输出决策信息a_m调整悬停位置，并根据和地面用户间的路径损耗，基于贪婪方案选择路径损耗最低的W个地面用户进行通信服务，得到瞬时回报奖励r，达到下一系统状态s′并获得局部观察信息o′；将(s,o,a,r,s′,o′)作为样本存入经验回放空间，a＝{a₁,…,a_M}表示所有无人机的联合动作，o＝{o₁,…,o_m}表示所有无人机的局部观察信息，t＝t+1。(3.3) Obtain the local observation information o of the current UAV and the current state s of the entire system; each UAV m uses the local observation information obtained in the t time slot, and adjusts the suspension based on the ∈ greedy strategy and the DDPG network output decision information a _m stop position, and according to the path loss with the ground users, based on the greedy scheme, select W ground users with the lowest path loss for communication services, obtain the instantaneous reward r, reach the next system state s' and obtain the local observation information o'; Store (s,o,a,r,s′,o′) as a sample in the experience playback space, a={a ₁ ,...,a _M } represents the joint actions of all UAVs, o={o ₁ , ..., o _m } represents the local observation information of all UAVs, t=t+1.

(3.4)若回放空间存储的样本数量大于B，到达步骤3.5；否则，继续收集样本，返回步骤3.3。(3.4) If the number of samples stored in the playback space is greater than B, go to step 3.5; otherwise, continue to collect samples and return to step 3.3.

(3.5)对每个智能体m，从经验回放空间中随机采样固定数量K的样本，计算目标值，其中第k个样本(s^k,o^k,a^k,r^k,s^′k,o^k)的目标值y^k可以表示为：

其中Q′_m表示第m个智能体的评价网络的目标网络，u′_m表示第m个智能体的策略网络的目标网络，r^k表示第k个样本中的及时奖励，a′_m表示无人机m在系统状态s^′k下根据局部观察

所作出的决策。基于全局信息，使用梯度下降法最小化损失函数

更新该智能体的评价网络的参数：(3.5) For each agent m, randomly sample a fixed number of K samples from the experience replay space, and calculate the target value, where the kth sample (s ^k , ^ok , ^ak ,r ^k ,s ^′k ,o The target value y ^k of ^k ) can be expressed as:

where Q' _m represents the target network of the evaluation network of the _mth agent, u'm represents the target network of the policy network of the mth agent, r ^k represents the timely reward in the kth sample, and a' _m represents no The human-machine m is in the system state ^s'k according to the local observation

decisions made. Minimize the loss function using gradient descent based on global information

Update the parameters of the agent's evaluation network:

根据评价网络和样本信息，基于样本的策略梯度，更新该智能体策略网络的参数：According to the evaluation network and sample information, based on the policy gradient of the sample, update the parameters of the agent's policy network:

(3.6)间隔一定回合后，即，更新目标网络参数θ^Q′和θ^u′：θ^Q′＝τθ^Q+(1-τ)θ^Q′,θ^u′＝τθ^u+(1-τ)θ^u′。当达到总时长T或无人机能量耗尽后，退出当前训练回合，否则，返回步骤3.3。若训练回合数已到，则退出训练过程，否则进入新的训练回合。(3.6) After a certain round interval, that is, update the target network parameters θ ^Q′ and θ ^u′ : θ ^Q′ =τθ ^Q +(1-τ)θ ^Q′ ,θ ^u′ =τθ ^u +(1-τ) θ ^u′ . When the total duration T is reached or the drone's energy is exhausted, exit the current training round, otherwise, go back to step 3.3. If the number of training rounds is reached, exit the training process, otherwise enter a new training round.

(4)将训练好的策略网络u分配给每个无人机，将无人机部署到目标区域，每个无人机在每个时隙根据自身的局部观察调整悬停位置，并对地面用户进行通信服务。(4) Allocate the trained policy network u to each UAV, deploy the UAV to the target area, and each UAV adjusts the hovering position according to its own local observation in each time slot, and adjusts the hovering position to the ground. Users perform communication services.

本发明的有益效果：本发明提出一种基于多智能体深度强化学习的无人机网络悬停位置优化方法，将无人机对地通信网络场景下的吞吐量最大化问题建模为局部可观察马尔可夫决策过程，引入多智能体深度强化学习方法MADDPG进行集中式训练和分布式执行，解决动态环境下无人机悬停位置优化问题。该方法使得无人机集群能够更好的适应动态环境，且多个无人机不依赖集中式控制器，能够以分布式的方式进行协作，本发明在即时奖励函数构建中引入了比例公平权重和能耗损失信息，在提高吞吐量的同时一定程度上保证了用户服务的公平性和无人机集群的低能耗。Beneficial effects of the present invention: The present invention proposes a method for optimizing the hovering position of a UAV network based on multi-agent deep reinforcement learning, which models the throughput maximization problem in the UAV-to-ground communication network scenario as a locally available method. Observe the Markov decision-making process, and introduce the multi-agent deep reinforcement learning method MADDPG for centralized training and distributed execution to solve the optimization problem of UAV hovering position in dynamic environment. The method enables the UAV swarm to better adapt to the dynamic environment, and multiple UAVs can cooperate in a distributed manner without relying on a centralized controller. The present invention introduces proportional fair weight in the construction of the instant reward function. and energy consumption loss information, which ensures the fairness of user services and the low energy consumption of UAV swarms to a certain extent while improving throughput.

附图说明Description of drawings

图1是本发明所述的无人机对地通信网络场景示意图。FIG. 1 is a schematic diagram of a UAV-to-ground communication network scenario according to the present invention.

图2是本发明一种基于多智能体深度强化学习的无人机网络悬停位置优化方法的流程图。FIG. 2 is a flowchart of a method for optimizing the hovering position of a UAV network based on multi-agent deep reinforcement learning of the present invention.

图3是本发明基于多智能体深度强化学习的训练无人机分布式策略网络的流程图。FIG. 3 is a flow chart of the present invention for training a distributed strategy network for UAVs based on multi-agent deep reinforcement learning.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

一种基于多智能体深度强化学习的无人机网络悬停位置优化方法，应用于缺少地面基础设施或灾后地区的紧急通信恢复。如图1所示，该区域缺少基础通信设施，由无人机作为移动基站进行通信覆盖，地面环境是动态变化的，地面设备可能会发生移动，无人机基站需要不断调整自身的悬停位置，以实现更好的通信服务(最大化系统的吞吐量)。同时还要考虑服务的公平性和能耗损失，不能因为追求吞吐量最大化而忽略某些地面用户，并尽可能减少无人机基站移动所造成的能耗损失。本发明的流程如图2所示，首先，对具体的应用场景中的通信模型、覆盖模型和能耗模型等进行建模并构建优化目标；其次，根据优化目标和多无人机系统特性将优化问题建模为局部可观测马尔科夫决策过程；然后，使用仿真平台模拟多无人机对地通信场景，通过无人机集群和环境的交互采集样本，使用多智能体深度强化学习算法MADDPG进行集中式训练，得到每个无人机的分布式策略。最后，将训练好的策略网络部署到无人机中，将无人机集群部署到目标区域，无人机互相协作完成高吞吐量、低能耗、公平的通信覆盖。A multi-agent deep reinforcement learning-based hovering position optimization method for UAV networks, applied to emergency communication recovery in areas lacking ground infrastructure or post-disaster. As shown in Figure 1, the area lacks basic communication facilities, and the UAV is used as a mobile base station for communication coverage. The ground environment is dynamic, and the ground equipment may move. The UAV base station needs to constantly adjust its hovering position , for better communication service (maximizing the throughput of the system). At the same time, the fairness of the service and the loss of energy consumption should also be considered. Some ground users cannot be ignored because of the pursuit of maximum throughput, and the energy loss caused by the movement of the UAV base station should be minimized. The process of the present invention is shown in Figure 2. First, the communication model, coverage model and energy consumption model in specific application scenarios are modeled and an optimization target is constructed; secondly, according to the optimization target and the characteristics of the multi-UAV system, The optimization problem is modeled as a locally observable Markov decision process; then, a simulation platform is used to simulate a multi-UAV-to-ground communication scenario, and samples are collected through the interaction between the UAV swarm and the environment, and the multi-agent deep reinforcement learning algorithm MADDPG is used. Conduct centralized training to get a distributed policy for each drone. Finally, the trained policy network is deployed into the UAV, and the UAV cluster is deployed to the target area, and the UAVs cooperate with each other to achieve high throughput, low energy consumption, and fair communication coverage.

具体步骤如下：Specific steps are as follows:

表示无人机基站m和地面用户n之间的距离，r_n,m为水平距离，h为无人机基站固定飞行高度。根据路径损失，信道增益可以表示为

Represents the distance between the UAV base station m and the ground user n, rn _,m is the horizontal distance, and h is the fixed flying height of the UAV base station. According to the path loss, the channel gain can be expressed as

其中

分别表示无人机在水平面上x轴和y轴的位置坐标。in

和地面用户当前状态

无人机基站状态

包括无人机当前的位置信息；地面用户状态

and the current state of the ground user

UAV base station status

Including the current position information of the drone; ground user status

Including the location information of the current ground user.

因此，综合考虑公平性吞吐量奖励和能耗损失惩罚，本发明给出系统实时奖励r(t)Therefore, considering the fairness throughput reward and energy loss penalty comprehensively, the present invention gives the system real-time reward r(t)

局部观察集合O(t)＝{o₁(t),…,o_M(t)}，当多无人机基站在一个大范围区域协同工作时，每个无人机无法观察到全局信息，只能观察到自身覆盖范围内的地面用户信息。o_m(t)表示无人机基站m所观察到的处于自己覆盖范围内的地面用户的位置信息。The local observation set O(t)={o ₁ (t),...,o _M (t)}, when multiple UAV base stations work together in a large area, each UAV cannot observe the global information, Only terrestrial user information within its own coverage area can be observed. o _m (t) represents the location information of ground users within its coverage area observed by the UAV base station m.

表示评价网络的参数，如图3所示，步骤(3)包括：The invention introduces the multi-agent deep reinforcement learning algorithm MADDPG into the hovering position optimization of the UAV-to-ground communication network, adopts the architecture of centralized training and distributed execution, and uses global information during training to better guide each The gradient update of the UAV's decision function, each UAV only uses the local information observed by itself to make the next decision during execution, which is more in line with the needs of the actual scene; each agent adopts the Actor-Critic architecture The DDPG network is used for training, and the strategy network is used to fit the strategy function u(o), input the local observation o, and output the action strategy a; the evaluation network is used to fit the state-action function Q(s, a), which represents the state of the system. When it is s, the expected reward obtained by taking action a; let u={u ₁ ,...,u _M } denote the deterministic policy function of M agents,

Represents the parameters of the evaluation network, as shown in Figure 3, step (3) includes:

(3.1)初始化经验回放空间，并设置经验回放空间大小B，初始化每个DDPG网络的参数θ，训练回合数P，时长T等(3.1) Initialize the experience playback space, and set the size B of the experience playback space, initialize the parameters θ of each DDPG network, the number of training rounds P, the duration T, etc.

(3.3)获取当前无人机的局部观察信息o和整个系统当前状态s；每个无人机m使用t时隙得到的局部观察信息，基于∈贪婪策略和DDPG网络输出决策信息a_m调整悬停位置，并根据和地面用户间的路径损耗，基于贪婪方案选择路径损耗最低的W个地面用户进行通信服务，得到瞬时回报奖励r，达到下一系统状态s′并获得局部观察信息o′；将(s,o,a,r,s′,o′)作为样本存入经验回放空间，a＝{a₁,…,a_M}表示所有无人机的联合动作，o＝{o₁,…,o_m}表示所有无人机的局部观察信息，t＝t+1；(3.3) Obtain the local observation information o of the current UAV and the current state s of the entire system; each UAV m uses the local observation information obtained in the t time slot, and adjusts the suspension based on the ∈ greedy strategy and the DDPG network output decision information a _m stop position, and according to the path loss with the ground users, based on the greedy scheme, select W ground users with the lowest path loss for communication services, obtain the instantaneous reward r, reach the next system state s' and obtain the local observation information o'; Store (s,o,a,r,s′,o′) as a sample in the experience playback space, a={a ₁ ,...,a _M } represents the joint actions of all UAVs, o={o ₁ , ..., o _m } represents the local observation information of all UAVs, t=t+1;

Update the parameters of the agent's evaluation network:

(3.6)间隔一定回合后，更新评价目标网络参数θ^Q′和策略目标网络参数θ^u′：θ^Q′＝τθ^Q+(1-τ)θ^Q′,θ^u′＝τθ^u+(1-τ)θ^u′。当达到总时长T或无人机能量耗尽后，退出当前训练回合，否则，返回步骤3.3。若训练回合数已到，则退出训练过程，否则进入新的训练回合。(3.6) After a certain round interval, update the evaluation target network parameters θ ^Q′ and policy target network parameters θ ^u′ : θ ^Q′ = τθ ^Q +(1-τ)θ ^Q′ ,θ ^u′ =τθ ^u +(1 -τ)θ ^u′ . When the total duration T is reached or the drone's energy is exhausted, exit the current training round, otherwise, go back to step 3.3. If the number of training rounds is reached, exit the training process, otherwise enter a new training round.

综上所述：In summary:

本发明提出一种基于多智能体深度强化学习的无人机网络悬停位置优化方法，通过将多无人机对地通信场景中的吞吐量最大化问题建模为局部可观测马尔科夫决策过程，并使用MADDPG算法进行解决，使得无人机集群能够适应动态环境，进行分布式协作，实现网络的高吞吐量、低能耗和服务公平性。The present invention proposes a method for optimizing the hovering position of the UAV network based on multi-agent deep reinforcement learning. By modeling the throughput maximization problem in the multi-UAV-to-ground communication scenario as a locally observable Markov decision process, and use the MADDPG algorithm to solve it, so that the UAV swarm can adapt to the dynamic environment, carry out distributed cooperation, and achieve high throughput, low energy consumption and service fairness of the network.

以上显示和描述了本发明的基本原理和主要特征和本发明的优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是说明本发明的原理，在不脱离本发明精神和范围的前提下，本发明还会有各种变化和改进，这些变化和改进都落入要求保护的本发明范围内。本发明要求保护范围由所附的权利要求书及其等效物界定。The basic principles and main features of the present invention and the advantages of the present invention have been shown and described above. Those skilled in the art should understand that the present invention is not limited by the above-mentioned embodiments. The above-mentioned embodiments and descriptions only illustrate the principle of the present invention. Without departing from the spirit and scope of the present invention, the present invention will also have Various changes and modifications fall within the scope of the claimed invention. The claimed scope of the present invention is defined by the appended claims and their equivalents.

Claims

1. An unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning is characterized by comprising the following steps:

(1) establishing multi-unmanned aerial vehicle to ground communication network model

(1.1) establishing a scene model: establishing a square target area with the side length of l, wherein N ground users and M unmanned aerial vehicle base stations are arranged in the area, and the unmanned aerial vehicle base stations provide communication service for the ground users; the time is divided into T identical time slots, from the last time slot to the current time slot, the ground user may be static or may move, so the unmanned aerial vehicle base station needs to search a new optimal hovering position in each time slot and select the ground user to perform data transmission service after reaching the target position;

(1.2) establishing an air-to-ground communication model: use the air-to-ground channel model to model the channel between unmanned aerial vehicle basic station and the ground user, unmanned aerial vehicle basic station is because high flight altitude, compares in ground basic station and establishes the stadia link LoS with the ground user more easily, and under the LoS condition, the path loss model between unmanned aerial vehicle basic station m and the ground user n is:

where η denotes the excess path loss coefficient, c denotes the speed of light, f_cRepresenting the subcarrier frequency, α the path loss exponent,

representing the distance, r, between the drone base station m and the ground user n_n，mThe horizontal distance is h, and the fixed flying height of the unmanned aerial vehicle base station is h; the channel gain is expressed as

According to the channel gain, the data transmission rate between the unmanned aerial vehicle base station m and the ground user n in the time slot t is R_n，m(t)：

Where σ represents additive white Gaussian noise, p_tRepresenting the transmitted power of the drone base station, g_n，m(t) represents the channel gain between the unmanned aerial vehicle base station m and the ground user n at time t;

(1.3) establishing a coverage model: defining a maximum tolerable path loss L_maxIf the path loss between the unmanned aerial vehicle base station and the user is less than L at a certain moment_maxIf the connection is not established, the connection is established; defining the effective coverage range of each unmanned aerial vehicle base station according to the maximum tolerable path loss, wherein the range takes the projection point of the unmanned aerial vehicle base station on the ground as the circle center and takes R as the circle center_covIs radius, according to the path loss formula, R_covExpressed as:

(1.4) establishing an energy loss model: paying attention to energy loss caused by movement of the unmanned aerial vehicle, and considering the flying speed V and the flying power p of the unmanned aerial vehicle_fAnd the flight energy consumption delta e of the unmanned aerial vehicle base station m in the time slot t_m(t) distance of flight dependent:

wherein,

respectively representing the position coordinates of the unmanned aerial vehicle on the x axis and the y axis on the horizontal plane at the time t;

(2) modeling the problem as a locally observable markov decision process:

each unmanned aerial vehicle base station is equivalent to an agent; in each time slot with the environment state S (t), the agent m can only obtain local observation o within the coverage range of the agent m_mAnd according to a decision function u_m(o_m) Selecting an action a from the action set_mTo maximize the total desired reward for the discount

Where γ ∈ (0,1) is the discount coefficient, r_m(t) represents the reward of agent m at time t;

the system state set S ═ { S (t) | S (t) ═ S (S)^u(t)，S^g(t)) }, respectively containing the current state of the drone base station

And the current state of the ground user

Status of each drone base station

The current position information of the unmanned aerial vehicle is included; state of each terrestrial user

Including location information of a current ground user;

in the time slot t, the unmanned aerial vehicle m needs to make a decision a after obtaining the current local observation information_m(t), move to the next hover position, so the action set includes flight rotation angle θ (t) and movement distance d (t);

system real-time rewards r (t): the throughput of the unmanned aerial vehicle network is maximized while the user service fairness and the energy consumption are considered; thus, the extra throughput generated by adjusting the hover position of the drone at each time t is a positive reward, expressed as:

ΔC(t)＝C(S^u(t+1)，S^g(t))-C(S^u(t)，S^g(t))

wherein, C (S)^u(t)，S^g(t)) indicates that the unmanned aerial vehicle base station state is S^u(t) the ground user status is S^g(t) throughput generated by the network; c (S)^u(t+1)，S^g(t)) then indicates that the unmanned aerial vehicle base station state is S^u(t +1), the ground user state is S^g(t) throughput generated by the network; considering fairness of user service, if a certain area is gathered with a large number of users and a certain area has only a small number of users, the unmanned aerial vehicle base station always hovers in a high-density area in pursuit of maximizing throughput, and ignores the low-density area, so that a weight w is applied to the throughput reward of each user_n(t) implementing proportional fair scheduling; r_reqExpressed is the minimum communication rate requirement, R, of the terrestrial user demand_n(t) represents the average communication rate of the terrestrial user n from the beginning to the time t; when the drone base station serves this user, R_n(t) increase, the user's weight gradually becomes smaller; if the user is not served, R_n(t) increase, the user weight increasing; therefore, the reward weight of the user sparse area is continuously increased, and the unmanned aerial vehicle base station is attracted to carry out service;

wherein, a_n，m(t) is an indicator variable, at time t, if drone base station m serves a ground user n, then a_n，m(t) is 1, whereas, a_n，m(t) ═ 0; therefore, considering the fairness throughput reward and the energy loss penalty comprehensively, the system rewards r (t) in real time:

wherein alpha represents the weight occupied by the energy consumption punishment, the larger alpha is, the more the system pays attention to the energy consumption loss in the decision making process, otherwise, the more the energy consumption loss is ignored;

local observation set o (t) ═ o₁(t)，...，o_M(t), when a plurality of unmanned aerial vehicle base stations cooperatively work in a large-range area, each unmanned aerial vehicle cannot observe global information and can only observe ground user information in the coverage area of the unmanned aerial vehicle; o_m(t) the position information of the ground user in the coverage range of the unmanned aerial vehicle base station m observed at the moment t is represented;

(3) training based on a multi-agent deep reinforcement learning algorithm:

the multi-agent deep reinforcement learning algorithm MADDPG is introduced into the hovering position optimization of the unmanned aerial vehicle to the ground communication network, a centralized training and distributed execution architecture is adopted, global information is used during training, gradient updating of a decision function of each unmanned aerial vehicle is better guided, each unmanned aerial vehicle only uses local information observed by the unmanned aerial vehicle to make a next decision during execution, and the unmanned aerial vehicle is more suitable for an actual fieldThe need for a scene; each agent adopts an Actor-Critic structured DDPG network for training, the strategy network is used for fitting a strategy function u (o), inputting a local observation o and outputting an action strategy a; the evaluation network is used to fit a state-action function Q (s, a) representing the desired reward for taking action a when the system state is s; let u be { u ═ u₁，...，u_MDenotes the deterministic policy functions of M agents,

parameter representing each policy network, Q ═ Q₁，...，Q_MDenotes the evaluation network of M agents,

a parameter indicative of an evaluation network;

(3.1) initializing an experience playback space, setting the size of the experience playback space, initializing parameters of each DDPG network and training rounds;

(3.2) starting from the training round epoch-1 and starting from the time t-1;

(3.3) acquiring local observation information o of the current unmanned aerial vehicle and the current state s of the whole system, wherein each unmanned aerial vehicle m uses the local observation information obtained by the time slot t, and outputs decision information a based on an ∈ greedy strategy and a DDPG network_mAdjusting the hovering position, selecting W ground users with the lowest path loss to perform communication service based on a greedy scheme according to the path loss between the hovering position and the ground users, obtaining an instant return reward r, achieving the next system state s 'and obtaining local observation information o'; storing (s, o, a, r, s ', o') as a sample in an empirical playback space, a ═ a { (a)₁，...，a_MDenotes the joint action of all drones, o ═ o₁，...，o_mThe local observation information of all unmanned aerial vehicles is represented, and t is t + 1;

(3.4) if the number of the samples stored in the playback space is greater than B, reaching the step (3.5); otherwise, continuing to collect the sample, and returning to the step (3.3);

(3.5) for each agent m, randomly sampling data from the empirical playback spaceA constant number K of samples, calculating a target value, wherein the kth sample(s)^k，o^k，a^k，r^k，s′^k，o^k) Target value y of^kCan be expressed as:

wherein Q'_mA target network u 'representing an evaluation network of the m-th agent'_mTarget network of the policy network representing the mth agent, r^kRepresents the timely reward, a ', in the kth sample'_mRepresenting unmanned plane m in system state s'^kAccording to local observation

The decision made; minimizing loss functions using gradient descent based on global information

And updating parameters of the evaluation network of the agent:

updating parameters of the intelligent agent strategy network based on the strategy gradient of the sample according to the evaluation network and the sample information:

(3.6) updating the evaluation target network parameter theta after a certain turn interval^Q′And a policy target network parameter θ^u′：θ^Q′＝τθ^Q+(1-τ)θ^Q′，θ^u′＝τθ^u+(1-τ)θ^u′Tau ∈ (0,1) represents updating weight, when reaching total duration T or the energy of the unmanned aerial vehicle is exhausted, quitting the current training round, otherwise, returning to step (3.3), if the number of the training rounds is up, quitting the training process, otherwise, entering a new training round;

(4) and distributing the trained strategy network u to each unmanned aerial vehicle, deploying the unmanned aerial vehicles to a target area, adjusting the hovering position of each unmanned aerial vehicle in each time slot according to local observation of the unmanned aerial vehicle, and performing communication service on ground users.