CN116301007A

CN116301007A - A multi-quadrotor UAV assembly task path planning method based on reinforcement learning

Info

Publication number: CN116301007A
Application number: CN202310454330.7A
Authority: CN
Inventors: 罗俊海; 严泽成; 田雨鑫
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-06-23

Abstract

The invention discloses a multi-quad-rotor unmanned aerial vehicle (MPUAV) gathering task path planning method based on reinforcement learning, which comprises the steps of firstly constructing a PyBullet-based multi-quad-rotor unmanned aerial vehicle Gym environment, setting a reward function mechanism through abstracting a state space and an action space of the MPUAV, then carrying out path planning decision by using an improved deep reinforcement learning algorithm, finally training an improved deep reinforcement learning network, and controlling the MPUAVs through output action information, so that each MPUAV successfully reaches a specified target task in the shortest time. According to the method, N steps of return are utilized in the TD3 algorithm, so that more accurate return estimation, faster learning speed and better generalization performance are obtained, the sampling deviation is reduced by using preferential experience playback, the model deviation caused by unbalanced sampling is reduced, the stability of the algorithm is improved, the TD3 algorithm is more suitable for continuous multidimensional decision-making, and an optimal route can be planned in the shortest time under the condition of achieving specified instantaneity and accuracy.

Description

A multi-quadrotor UAV assembly task path planning method based on reinforcement learning

技术领域technical field

本发明属于多四旋翼无人机路径规划技术领域，具体涉及一种基于强化学习的多四旋翼无人机集结型任务路径规划方法。The invention belongs to the technical field of path planning for multiple quadrotor UAVs, and in particular relates to a multi-quadrotor UAV assembly task path planning method based on reinforcement learning.

背景技术Background technique

四旋翼无人机通过多个旋翼产生的升力来平衡飞行器的重力，可以实现悬停，垂直升降，且对起飞场地要求低，但其飞行速度相对较慢。因此，多旋翼无人机适用于复杂环境、范围不大的应用场景，如航拍、监控、建筑建模等任务。随着无人机技术的不断发展，它们已经被广泛应用于民用领域，而且所执行任务的复杂性也在不断提高。由于单架无人机的负载和飞行能力都有限，因此需要多无人机的合作以提升任务的执行能力和范围。The quadrotor UAV balances the gravity of the aircraft through the lift generated by multiple rotors. It can hover, vertically lift, and has low requirements for the take-off site, but its flight speed is relatively slow. Therefore, multi-rotor drones are suitable for complex environments and small-scale application scenarios, such as aerial photography, monitoring, architectural modeling and other tasks. With the continuous development of UAV technology, they have been widely used in civilian fields, and the complexity of the tasks performed is also increasing. Since the load and flight capabilities of a single UAV are limited, the cooperation of multiple UAVs is required to improve the execution capability and scope of the mission.

由于所有的无人机任务几乎都涉及最短路径规划，最短路径规划问题是近年来无人机路径规划的重点和研究难点。在最短路径规划问题中，根据任务的不同特点可以进一步分为集结型任务和分配型任务。集结型任务旨在规划每个无人机从各自的始发点到达相同目标点的最优路径。此类任务的目标通常是使所有无人机同时到达目标点并尽快完成任务。在这种情况下，目标一般为最小化总任务时间或总路径长度。相较于分配型任务，集结型任务具有更大泛用性。Since almost all UAV tasks involve the shortest path planning, the shortest path planning problem is the focus and research difficulty of UAV path planning in recent years. In the shortest path planning problem, according to the different characteristics of the task, it can be further divided into assembly task and distribution task. The rendezvous task aims to plan the optimal path for each UAV to reach the same target point from their respective starting points. The goal of such missions is usually to get all drones to the target point at the same time and complete the mission as quickly as possible. In this case, the goal is generally to minimize the total task time or total path length. Compared with distribution tasks, assembly tasks are more versatile.

相较于基于规则或启发式搜索等的现有算法，基于强化学习的路径规划方法具有更好的适应性和扩展性。现有方法需要根据环境手动设计和调整规则，而强化学习方法则可以通过智能体自主学习适应环境。由于强化学习中的智能体具有自主决策能力，智能体可以通过与环境交互来学习最优的行为。除此以外，深度学习算法具有强大的感知能力，深度学习与强化学习算法相结合的深度强化学习算法可以处理更高维的输入，更加适合本文多无人机的选题。所以与现有方法相比，深度强化学习算法可以更好地应对未知情况和变化，智能体能够在复杂的环境中执行连续的决策任务。Compared with existing algorithms based on rules or heuristic search, path planning methods based on reinforcement learning have better adaptability and scalability. Existing methods need to manually design and adjust rules according to the environment, while reinforcement learning methods can adapt to the environment through autonomous learning by agents. Since the agent in reinforcement learning has autonomous decision-making ability, the agent can learn the optimal behavior by interacting with the environment. In addition, the deep learning algorithm has strong perception ability, and the deep reinforcement learning algorithm combined with deep learning and reinforcement learning algorithm can handle higher-dimensional input, which is more suitable for the topic selection of multi-UAV in this paper. Therefore, compared with existing methods, deep reinforcement learning algorithms can better cope with unknown situations and changes, and agents can perform continuous decision-making tasks in complex environments.

目前针对多无人机实现集结型任务的解决技术仍面临许多挑战，其中包括环境建模、学习效率低、以及复杂的动作空间及状态空间等。首先，对于基于深度强化学习算法的智能体项目，模拟环境的构建是整个实验的基础，其无人机系统的设计必须依赖仿真工具。因此，建立适当的无人机模拟器对于学术研究和安全关键应用的发展至关重要。然而，目前许多基于深度强化学习算法模型进行模拟实验的环境缺乏现实世界的可移植性，其中许多强化学习环境为了获得高样本吞吐量而牺牲了现实性。此外，使用基于深度强化学习算法进行多无人机的路径规划的训练效率普遍较低。在大多数模拟环境中，路径规划任务奖励稀疏，智能体只能在任务结束后获得奖励信号，且由于复杂环境中有效探索的困难性，训练难以在早期阶段开始。最后，此类多无人机路径规划问题通常涉及到多智能体和多障碍物，因此问题的状态空间、行动空间和奖励函数等往往具有高维度、复杂的特点，增加了问题的建模和求解难度。由于多无人机路径规划的动作空间通很大，需要采用有效的搜索策略来解决高维度动作空间的问题。综上所述，集结型路径规划对多无人机的任务执行具有重要意义。At present, the solution technology for multi-UAV to realize the assembly task still faces many challenges, including environment modeling, low learning efficiency, and complex action space and state space. First of all, for the agent project based on the deep reinforcement learning algorithm, the construction of the simulation environment is the basis of the whole experiment, and the design of the UAV system must rely on simulation tools. Therefore, building a proper UAV simulator is crucial for academic research and the development of safety-critical applications. However, many current environments for simulation experiments based on deep reinforcement learning algorithm models lack real-world portability, and many of these reinforcement learning environments sacrifice realism for high sample throughput. In addition, the training efficiency of multi-UAV path planning using deep reinforcement learning algorithms is generally low. In most simulated environments, the path planning task rewards are sparse, the agent can only get the reward signal after the task is over, and due to the difficulty of effective exploration in complex environments, it is difficult to start training at an early stage. Finally, this kind of multi-UAV path planning problem usually involves multi-agents and multi-obstacles, so the state space, action space and reward function of the problem often have high-dimensional and complex characteristics, which increases the modeling and complexity of the problem. Solving difficulty. Since the action space of multi-UAV path planning is very large, an effective search strategy is needed to solve the problem of high-dimensional action space. To sum up, the assembly-type path planning is of great significance to the mission execution of multiple UAVs.

发明内容Contents of the invention

为解决上述技术问题，本发明提出了一种基于强化学习的多四旋翼无人机集结型任务路径规划方法，专注于解决多无人机路径规划中的集结型任务。In order to solve the above technical problems, the present invention proposes a multi-quadrotor UAV assembly task path planning method based on reinforcement learning, focusing on solving the assembly task in multi-UAV path planning.

本发明的技术方案为：一种基于强化学习的多四旋翼无人机集结型任务路径规划方法，具体步骤如下：The technical solution of the present invention is: a multi-quadrotor unmanned aerial vehicle assembly-type mission path planning method based on reinforcement learning, the specific steps are as follows:

S1、构建基于PyBullet的多四旋翼无人机Gym环境；S1. Build a multi-quadrotor drone Gym environment based on PyBullet;

S2、抽象四旋翼无人机的状态空间、动作空间，设置奖励函数机制，使无人机与环境进行交互；S2. Abstract the state space and action space of the quadrotor UAV, and set the reward function mechanism to make the UAV interact with the environment;

S3、使用改进的深度强化学习算法进行路径规划决策，在集结型任务下，为每一个四旋翼无人机进行路径规划；S3. Use the improved deep reinforcement learning algorithm to make path planning decisions, and perform path planning for each quadrotor UAV under the assembly task;

S4、训练改进的深度强化学习网络，通过输出的动作信息控制四旋翼无人机的角速度及线速度，使每一个四旋翼无人机在最短的时间内成功到达规定目标任务。S4. Train the improved deep reinforcement learning network, control the angular velocity and linear velocity of the quadrotor UAV through the output action information, so that each quadrotor UAV can successfully reach the specified target task in the shortest time.

进一步地，所述步骤S1中，具体如下：Further, in the step S1, the details are as follows:

S11、构建多四旋翼无人机动力学仿真模型；S11, building the dynamics simulation model of multi-quadrotor UAV;

通过四旋翼无人机的运动方程和空气动力学效应组成多四旋翼无人机的动力学方程，完成多四旋翼无人机动力学仿真模型的构建，具体如下：The dynamic equations of multi-quadrotor UAVs are composed of the motion equations and aerodynamic effects of quadrotor UAVs, and the construction of the dynamic simulation model of multi-quadrotor UAVs is completed, as follows:

使用PyBullet建立作用于Gym中每个四旋翼无人机的力和扭矩模型，利用物理引擎计算和更新所有四旋翼无人机的动力学方程。Use PyBullet to model the forces and torques acting on each quadrotor drone in Gym, and use the physics engine to calculate and update the dynamic equations for all quadrotor drones.

设定每个四旋翼无人机的臂长为L、质量为m、惯性属性为J、物理常数和凸面碰撞形状通过单独的URDF文件描述，用于“x”型四旋翼无人机的配置。Set the arm length of each quadrotor drone as L, the mass as m, the inertial property as J, the physical constants and the convex collision shape are described by a separate URDF file, which is used for the configuration of the "x" type quadrotor drone .

首先，在PyBullet中设置重力加速度g和物理学步进频率，施加在4个电机上的力F_i和围绕无人机Z轴的扭矩T_o与电机转速P_i的平方成正比，F_i和P_i表达式如下：First, set the gravitational acceleration g and the physical step frequency in PyBullet. The force F _i applied to the four motors and the torque T _o around the Z-axis of the drone are proportional to the square of the motor speed P _i , and F _i and The expression of P _i is as follows:

F_i＝k_F·P_i ² (1)F _i =k _F ·P _i ² (1)

其中，k_F和k_T表示预先设定的常数。Among them, k _F and k _T represent preset constants.

设定对模型实时控制，则四旋翼无人机的动力学方程表示如下：Set the real-time control of the model, the dynamic equation of the quadrotor UAV is expressed as follows:

J^TT_o＝Ma+h (3)J ^T T _o =Ma+h (3)

其中，J表示雅可比矩阵，M表示惯性矩阵，a表示广义加速度，h表示科里奥利Coriolis和重力效应，上标T表示转置操作。Among them, J represents the Jacobian matrix, M represents the inertia matrix, a represents the generalized acceleration, h represents Coriolis and gravitational effects, and the superscript T represents the transpose operation.

在实际中，接近地面或靠近其他无人机的地方飞行会产生额外的空气动力学效应，在PyBullet中分别对它们建模并联合使用它们，包括：螺旋桨阻力D、作用于单电机的地面效应G_i、作用于质心的下洗效应W。In practice, flying close to the ground or close to other drones will have additional aerodynamic effects, model them separately in PyBullet and use them jointly, including: propeller drag D, ground effect on a single motor G _i , the downwash effect W acting on the centroid.

四旋翼无人机的旋转螺旋桨产生阻力D，阻力D与四旋翼无人机线速度

螺旋桨的角速度，以及恒定阻力系数矩阵k_D成正比，表达式如下：The rotating propeller of the quadrotor UAV produces resistance D, and the drag D is related to the linear speed of the quadrotor UAV

The angular velocity of the propeller is proportional to the constant drag coefficient matrix k _D , and the expression is as follows:

其中，

表示螺旋桨的角速度，60为60s；恒定阻力系数矩k_D具体表达式如下：in,

Indicates the angular velocity of the propeller, 60 is 60s; the specific expression of the constant drag coefficient moment k _D is as follows:

k_D＝diag(k_⊥,k_⊥,k_||) (5)k _D ＝diag(k _⊥ ,k _⊥ ,k _|| ) (5)

其中，k_⊥表示垂直阻力系数，k_||表示平行阻力系数，且矩阵k_D利用最小二乘法对数据进行拟合。Among them, k _⊥ represents the vertical resistance coefficient, k _|| represents the parallel resistance coefficient, and the matrix k _D uses the least square method to fit the data.

当悬停在非常低的高度时，存在地面效应，将地面效应对每个电机的影响G_i与螺旋桨半径r_P、速度P_i、高度h_i和常数k_G做出成正比的等式关系如下：When hovering at a very low altitude, there is ground effect, and the influence of ground effect on each motor G _i is proportional to the propeller radius r _P , speed P _i , height h _i and constant k _G as follows:

当两架四旋翼无人机在不同高度的同一位置通过路径时，存在下洗效应，将下洗效应的影响简化为应用于无人机质心的单一作用力，其大小W取决于两个无人机在坐标系x，y，z中的距离(δ_x,δ_y,δ_z)和通过实验确定的常数k_D1，k_D2，k_D3，W表达式如下：When two quadrotor UAVs pass the path at the same position at different heights, there is a downwash effect. The influence of the downwash effect is simplified as a single force applied to the center of mass of the UAV, and its magnitude W depends on the two UAVs. The distance (δ _x , δ _{y , δ z ) between human and machine in the coordinate system x, y} , _z and the constants k _D1 , k _D2 , k _D3 , W determined through experiments are expressed as follows:

S12、构建多四旋翼无人机观测空间及动作空间；S12. Construct multi-quadrotor UAV observation space and action space;

在构建的Gym环境中，四旋翼无人机执行每一个动作，都会输出一个观测向量，则多四旋翼无人机的观测空间表达式如下：In the constructed Gym environment, each action performed by the quadrotor UAV will output an observation vector, and the expression of the observation space of the multi-quadrotor UAV is as follows:

其中，n∈[0...N]表示四旋翼无人机数量；X_n＝[x,y,z]_n表示四旋翼无人机的位置；q_n表示四元数，用于四旋翼无人机的姿态控制；r_n、p_n、y_n分别表示侧滚角Roll、俯仰角Pitch、偏航角Yaw，即三种用于姿态估计的角度；

为/>

表示第n个四旋翼无人机的线速度，ω_n为[ω_x,ω_y,ω_z]_n表示第n个四旋翼无人机的角速度；P_n为[P₀,P₁,P₂,P₃]_n表示所有无人机的电机速度。Among them, n∈[0...N] represents the number of quadrotor UAVs; X _n = [x,y,z] _n represents the position of quadrotor UAVs; q _n represents a quaternion, which is used for quadrotors Attitude control of the UAV; r _n , p _n , and y _n respectively represent the roll angle Roll, the pitch angle Pitch, and the yaw angle Yaw, that is, three angles used for attitude estimation;

for />

Indicates the linear velocity of the nth quadrotor UAV, ω _n is [ω _x ,ω _y ,ω _z ] _n indicates the angular velocity of the nth quadrotor UAV; P _n is [P ₀ ,P ₁ ,P ₂ , P ₃ ] _n represents the motor speed of all drones.

在本发明中四旋翼无人机使用激光雷达来探测障碍物，在模型中设定四旋翼无人机有激光雷达为k个，并使用这k个激光雷达对环境进行观察。In the present invention, the quadrotor UAV uses lidar to detect obstacles, and the quadrotor UAV is set to have k lidars in the model, and the k lidars are used to observe the environment.

其中，这k个激光雷达扫描角度范围为π，在两个激光间的角度为2π/k；(d₁,...,d_k)表示水平面上k个雷达的射线长度；d_i表示第i个雷达的射线长度，表达式如下：Among them, the scanning angle range of the k laser radars is π, and the angle between two lasers is 2π/k; (d ₁ ,...,d _k ) represents the ray length of the k radars on the horizontal plane; d _i represents the The ray length of i radars, the expression is as follows:

则环境信息s_E定义如下：Then the environmental information s _E is defined as follows:

s_E＝[ρ_i,d_i]^T,i＝1...k (10)s _E =[ρ _i ,d _i ] ^T ,i=1...k (10)

其中，ρ_i表示第i个雷达的独热码，表达式如下：Among them, _ρi represents the one-hot code of the i-th radar, and the expression is as follows:

则对于任意一个四旋翼无人机，动作空间表达式如下：Then for any quadrotor UAV, the action space expression is as follows:

{n:[v_x,v_y,v_z,v_M]_n} (12){n:[v _x ,v _y ,v _z ,v _M ] _n } (12)

其中，[v_x,v_y,v_z,v_M]_n表示对四旋翼无人机输入的速度，v_x，v_y，v_z表示单位向量的分量，v_M表示所需速度的大小；且动作空间也可以由四个电机的转速表示，表达式如下：Among them, [v _x , v _y , v _z , v _M ] _n represents the speed input to the quadrotor UAV, v _x , v _y , v _z represents the component of the unit vector, and v _M represents the magnitude of the required speed; And the action space can also be represented by the speed of the four motors, the expression is as follows:

{n:[P₀,P₁,P₂,P₃]_n} (13){n:[P ₀ ,P ₁ ,P ₂ ,P ₃ ] _n } (13)

最后，将输入转换为脉冲宽度调制PWM和电机速度委托给由位置和姿态控制子程序组成的控制器。Finally, the conversion of the input to pulse width modulation (PWM) and motor speed is delegated to a controller consisting of position and attitude control subroutines.

进一步地，所述步骤S2具体如下：Further, the step S2 is specifically as follows:

S21、抽象四旋翼无人机的状态空间、动作空间；S21. State space and action space of abstract quadrotor UAV;

四旋翼无人机的状态包括：四旋翼无人机的位置、四元数q_n、侧滚角Rollr_n、俯仰角Pitchp_n、偏航角Yawy_n、线速度

角速度ω_n、所有无人机的电机速度P_n＝[P₀,P₁,P₂,P₃]，无人机第一视角方向与目标连线间的角度β_n、以及第n个无人机的全局坐标(x,y,z)与目标的全局坐标(x_t,y_t,z_t)之间的差异d_0n。The state of the quadrotor UAV includes: the position of the quadrotor UAV, the quaternion q _n , the roll angle Rollr _n , the pitch angle Pitchp _n , the yaw angle Yawy _n , and the linear velocity

Angular velocity ω _n , motor speed P _n of all UAVs = [P ₀ , P ₁ , P ₂ , P ₃ ], angle β _n between the direction of the UAV’s first viewing angle and the line connecting the target, and the nth UAV’s The difference d _0n between the global coordinates (x, y, z) of the man-machine and the global coordinates (x _t , y _t , z _t ) of the target.

在状态中将无人机的全局位置替换为无人机与目标的相对位置ΔX_n也就是[Δx,Δy,Δz]_n，则无人机的状态s_U如下：In the state, replace the global position of the UAV with the relative position ΔX _n of the UAV and the target, that is, [Δx, Δy, Δz] _n , then the state s _U of the UAV is as follows:

由无人机的状态s_U和激光雷达检测到的环境状态s_E可得四旋翼无人机的状态空间s，表达式如下：The state space s of the quadrotor UAV can be obtained from the state s _U of the UAV and the environmental state s _E detected by the lidar, and the expression is as follows:

多四旋翼无人机环境的动作空间由对四旋翼无人机输入的速度组成，参考式(12)，则对于任意一个四旋翼无人机，动作空间表示式如下：The action space of the multi-quadrotor UAV environment is composed of the input speed of the quadrotor UAV. Referring to formula (12), for any quadrotor UAV, the expression of the action space is as follows:

a＝[v_x,v_y,v_z,v_M]^T (16)a＝[v _x ,v _y ,v _z ,v _M ] ^T (16)

S22、设置奖励函数机制，使四旋翼无人机与环境进行交互；S22. Setting a reward function mechanism to enable the quadrotor UAV to interact with the environment;

奖励函数R(s,a)表示在状态s下采取动作a所得到的环境反馈；设置由三部分组成的奖励函数使四旋翼无人机尽快到达集结型任务目标点，具体如下：The reward function R(s, a) represents the environmental feedback obtained by taking action a in the state s; the reward function composed of three parts is set to make the quadrotor UAV reach the target point of the assembly task as soon as possible, as follows:

首先，设四旋翼无人机与目标点之间距离奖励R_t促使四旋翼无人机到达目标，R_t表达式如下：First, set the distance reward R _t between the quadrotor UAV and the target point to promote the quadrotor UAV to reach the target, and the expression of R _t is as follows:

其中，d₀表示四旋翼无人机离目标的距离，

表示第n个四旋翼无人机离目标的距离，/>

表示第n个下一时刻离四旋翼无人机离目标的距离。Among them, d ₀ represents the distance from the quadrotor UAV to the target,

Indicates the distance from the nth quadrotor UAV to the target, />

Indicates the distance from the quadrotor UAV to the target at the nth next moment.

其次，设四旋翼无人机与障碍物之间距离奖励R_o促使无人机远离障碍物设置，R_o表达式如下：Secondly, set the distance reward R _o between the quadrotor UAV and the obstacle to make the UAV far away from the obstacle setting, and the expression of R _o is as follows:

其中，d_i表示第i个雷达的射线长度，即四旋翼无人机到障碍物或其他四旋翼无人机的探测距离，

表示第n个四旋翼无人机离目标的距离，设置无人机与障碍物的安全距离d_safe。Among them, d _i represents the ray length of the i-th radar, that is, the detection distance from the quadrotor UAV to obstacles or other quadrotor UAVs,

Indicates the distance between the nth quadrotor UAV and the target, and sets the safe distance d _safe between the UAV and the obstacle.

最后，设四旋翼无人机与目标点之间角度奖励R_a促使无人机向目标方向靠近，若β_n越大惩罚就越大，则R_a表达式如下：Finally, let the angle reward R _a between the quadrotor UAV and the target point make the UAV approach the target direction, if the larger the β _n is, the greater the penalty, then the expression of R _a is as follows:

进一步地，所述步骤S3具体如下：Further, the step S3 is specifically as follows:

由N步回报以及优先经验回放改进TD3算法得到ITD3算法，ITD3算法由四个子网络，即两个评论家网络和两个演员网络组成，改进的深度强化学习算法由ITD3算法实现。The ITD3 algorithm is obtained by improving the TD3 algorithm with N-step rewards and priority experience playback. The ITD3 algorithm consists of four sub-networks, namely two critic networks and two actor networks. The improved deep reinforcement learning algorithm is implemented by the ITD3 algorithm.

首先将N步回报引入TD3算法，N步回报将未来n个时间步的回报相加，提供比单步回报更加全面的信息；First, the N-step return is introduced into the TD3 algorithm. The N-step return adds up the returns of n time steps in the future, providing more comprehensive information than the single-step return;

在奖励稀疏的情况下，大多数状态转移p(s′|s,a)都没有奖励信息，则一步回报将不会有效；N步回报通过采样N个转移来缓解奖励稀疏的问题。In the case of sparse rewards, most state transitions p(s′|s,a) have no reward information, and one-step rewards will not be effective; N-step rewards alleviate the problem of reward sparsity by sampling N transitions.

通过加入N步回报，修改TD3算法评论家网络的方程式，在第j轮采样中，修改后的时间差分误差函数表达式δ_j如下：By adding N-step returns, the equation of the TD3 algorithm critic network is modified. In the j-th round of sampling, the modified time difference error function expression δ _j is as follows:

其中，φ和φ′表示双评论家网络的参数，k表示第k步回报，r_k表示第k步时的回报，s和a表示当下状态和动作，s_N和a_N表示目标状态和动作，Q(s_t,a_t|φ)表示评论家网络的值函数，Q′(s_t+N,a_t+N|φ′)表示目标评论家网络的值函数，γ表示折扣因子。Among them, φ and φ′ represent the parameters of the double-critic network, k represents the return of the k-th step, r _k represents the return of the k-th step, s and a represent the current state and action, s _N and a _N represent the target state and action , Q(s _t , a _t |φ) represents the value function of the critic network, Q′(s _t+N , a _t+N |φ′) represents the value function of the target critic network, and γ represents the discount factor.

其次，在原始TD3算法中使用优先经验回放，在样本开始时，第j个转换的采样概率被定义为P(j)，表达式如下：Second, using priority experience replay in the original TD3 algorithm, at the beginning of the sample, the sampling probability of the j-th transition is defined as P(j), expressed as follows:

其中，p_j表示第j个经验的优先级；α表示用于调整抽样权重的常数，α决定了使用多少优先级，当α等于0时，将采用均匀随机抽样。Among them, p _j represents the priority of the jth experience; α represents a constant used to adjust the sampling weight, and α determines how much priority is used. When α is equal to 0, uniform random sampling will be used.

然后，用于更新网络的每个过渡的抽样权重w_j由下式计算，它表示每个转移数据的重要性，M表示小批量的大小，max_iw_i表示最大化抽样权重，用于做归一化：Then, the _sampling weight w _j for each transition used to update the network _is calculated by Normalized:

最后，利用比例优先化，根据时间差分误差更新转移的优先级，如下式所示：Finally, using proportional prioritization, the priority of the transfer is updated according to the time difference error, as shown in the following equation:

p_j＝|δ_j|+∈ (23)p _j =|δ _j |+∈ (23)

其中，δ_j表示时间差分误差，∈表示预先设定的一个小值以避免0优先级。Among them, δ _j represents the time difference error, and ∈ represents a small value preset to avoid 0 priority.

进一步地，所述步骤S4具体如下：Further, the step S4 is specifically as follows:

ITD3网络训练网络由两部分神经网络实现：由三个全连接层组成的演员网络来完成状态到动作的映射，以及使用四个全连接层来估计Q-值的评论家网络。The ITD3 network training network is implemented by a two-part neural network: an actor network consisting of three fully connected layers to complete the state-to-action mapping, and a critic network using four fully connected layers to estimate the Q-value.

在ITD3网络中，对于两个演员网络，输入是状态，而输出是动作。评论家网络将状态-动作对作为其输入，并产生状态-动作值函数(Q-值)。则ITD3算法训练过程具体如下：In the ITD3 network, for two actor networks, the input is the state and the output is the action. The critic network takes state-action pairs as its input and produces a state-action value function (Q-value). The ITD3 algorithm training process is as follows:

首先从经验回放缓冲区中优先抽取一批小样本(s,a,s′,r)，将s′输入到演员目标网络中。然后，在下一次得到a′，并将状态-动作对(s′,a′)输入评论家目标网络。First, a batch of small samples (s, a, s′, r) are preferentially extracted from the experience playback buffer, and s′ is input into the actor target network. Then, a' is obtained in the next pass, and the state-action pair (s', a') is fed into the critic target network.

在得到两个目标Q-值(

和/>

)后，选择较小的一个来计算目标值函数y(r,s′)，目标值函数表达式如下：After getting two target Q-values (

and />

), select the smaller one to calculate the objective value function y(r,s′), the expression of the objective value function is as follows:

其中，r表示回报，折扣因子γ与式(20)取值相同，φ_i为评论家网络的随机参数。Among them, r represents the reward, the discount factor γ has the same value as formula (20), and φ _i is the random parameter of the critic network.

另一方面，将(s,a)输入评论家网络，得到两个Q-值(Q₁(·)和Q₂(·))。然后，用它们来计算y(r,s′)的均方误差，并反向传播均方误差之和来更新两个评论家网络的参数，并在时间差分误差更新中加入N步回放。On the other hand, feeding (s,a) into the critic network results in two Q-values (Q ₁ (·) and Q ₂ (·)). They are then used to compute the mean squared error of y(r, s′), and the sum of the mean squared errors is backpropagated to update the parameters of the two critic networks, incorporating N-step replay in the time-differenced error update.

接下来，将从第一个评论家网络中得到的Q-值输入到演员模型网络中，并沿Q-值增加的方向更新演员网络的参数(每两次迭代更新一次)。Next, the Q-value obtained from the first critic network is input into the actor model network, and the parameters of the actor network are updated along the direction of increasing Q-value (updated every two iterations).

最后，采用软更新方法来更新所有的目标网络。Finally, a soft update method is adopted to update all target networks.

训练完成后，通过输出的动作信息控制四旋翼无人机的角速度及线速度，使每一个四旋翼无人机在最短的时间内成功到达规定目标任务，完成集结型任务路径规划。After the training is completed, the angular velocity and linear velocity of the quadrotor UAV are controlled through the output action information, so that each quadrotor UAV can successfully reach the specified target mission in the shortest time and complete the assembly task path planning.

本发明的有益效果是：本发明的方法首先构建基于PyBullet的多四旋翼无人机Gym环境，通过抽象四旋翼无人机的状态空间、动作空间，设置奖励函数机制，再使用改进的深度强化学习算法进行路径规划决策，最后训练改进的深度强化学习网络，通过输出的动作信息控制四旋翼无人机，使每一个四旋翼无人机在最短的时间内成功到达规定目标任务。本发明的方法在TD3算法中利用N步回报，获得更准确的回报估计、更快的学习速度、更好的泛化性能，使用优先经验回放降低采样偏差，减少因采样不均衡造成的模型偏差，提高算法稳定性，使TD3算法更适合连续的多维决策问题，能在达到规定的实时性和准确性的情况下在最短时间规划出最优路线。The beneficial effects of the present invention are: the method of the present invention first constructs a multi-quadrotor drone Gym environment based on PyBullet, and sets a reward function mechanism by abstracting the state space and action space of the quadrotor drone, and then uses the improved depth enhancement The learning algorithm makes path planning decisions, and finally trains the improved deep reinforcement learning network to control the quadrotor UAV through the output action information, so that each quadrotor UAV can successfully reach the specified target task in the shortest time. The method of the present invention uses N-step returns in the TD3 algorithm to obtain more accurate return estimates, faster learning speeds, and better generalization performance, and uses priority experience playback to reduce sampling deviations and model deviations caused by unbalanced sampling , improve the stability of the algorithm, make the TD3 algorithm more suitable for continuous multi-dimensional decision-making problems, and can plan the optimal route in the shortest time under the condition of achieving the specified real-time and accuracy.

附图说明Description of drawings

图1为本发明的一种基于强化学习的多四旋翼无人机集结型任务路径规划方法的流程图。FIG. 1 is a flow chart of a multi-quadrotor UAV assembly-type task path planning method based on reinforcement learning of the present invention.

图2为本发明实施例中“x”型四旋翼无人机模型图。Fig. 2 is a model diagram of "x" type quadrotor UAV in the embodiment of the present invention.

图3为本发明实施例中激光雷达检测水平方向环境信息原理图。Fig. 3 is a principle diagram of detection of horizontal environment information by laser radar in an embodiment of the present invention.

图4为本发明实施例中四旋翼无人机的状态图。Fig. 4 is a state diagram of a quadrotor UAV in an embodiment of the present invention.

图5为本发明实施例中ITD3算法原理图。Fig. 5 is a schematic diagram of the ITD3 algorithm in the embodiment of the present invention.

图6为本发明实施例中ITD3中的神经网络结构图。FIG. 6 is a structural diagram of the neural network in ITD3 in the embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例进一步说明本发明的方法。The method of the present invention will be further described below in conjunction with the accompanying drawings and examples.

如图1所示，本发明的一种基于强化学习的多四旋翼无人机集结型任务路径规划方法流程图，具体步骤如下：As shown in Figure 1, a flow chart of a method for planning a multi-quadrotor unmanned aerial vehicle assembly task path based on reinforcement learning of the present invention, the specific steps are as follows:

在本实施例中，所述步骤S1具体如下：In this embodiment, the step S1 is specifically as follows:

使用PyBullet建立作用于Gym中每个四旋翼无人机的力和扭矩模型，利用物理引擎计算和更新所有无人机的动力学方程。Use PyBullet to model the forces and torques acting on each quadrotor drone in Gym, and use the physics engine to calculate and update the dynamic equations for all drones.

如图2所示，本实施例中构建的简化“x”型四旋翼无人机的动力模型，设定每个无人机的臂长为L、质量为m、惯性属性为J、物理常数和凸面碰撞形状通过单独的URDF文件描述，用于“x”型四旋翼无人机的配置。As shown in Figure 2, the dynamic model of the simplified "x" quadrotor UAV constructed in this embodiment is set to have the arm length of each UAV as L, the mass as m, the inertial property as J, and the physical constant and convex collision shapes are described by a separate URDF file for the "x" quadcopter configuration.

首先，在PyBullet中设置重力加速度g和物理学步进频率(比Gym步进的控制频率更精细)；除了物理属性和常数，在PyBullet中还可以使用URDF信息来加载四旋翼飞机的CAD模型；施加在4个电机上的力F_i和围绕无人机Z轴的扭矩T_o与电机转速P_i的平方成正比，F_i和P_i表达式如下：First, set the gravitational acceleration g and the physics step frequency in PyBullet (which is finer than the Gym step control frequency); in addition to physical properties and constants, URDF information can also be used in PyBullet to load the CAD model of the quadrotor aircraft; The force F _i applied on the four motors and the torque T _o around the Z-axis of the drone are proportional to the square of the motor speed P _i , and the expressions of F _i and P _i are as follows:

F_i＝k_F·P_i ² (1)F _i =k _F ·P _i ² (1)

F_i和P_i与输入的脉冲宽度调制(Pulse Width Modulation，PWM)是线性相关的，设定对模型的控制是实时的，则四旋翼无人机的运动方程表示如下：F _i and P _i are linearly related to the input pulse width modulation (Pulse Width Modulation, PWM), and the control of the model is set to be real-time, then the motion equation of the quadrotor UAV is expressed as follows:

J^TT_o＝Ma+h (3)J ^T T _o =Ma+h (3)

其中，J表示雅可比矩阵，M表示惯性矩阵，a表示广义加速度，h表示科里奥利(Coriolis)和重力效应，上标T表示转置操作。Among them, J represents the Jacobian matrix, M represents the inertia matrix, a represents the generalized acceleration, h represents Coriolis (Coriolis) and gravitational effects, and the superscript T represents the transpose operation.

在实际中，接近地面或靠近其他无人机的地方飞行可能会产生额外的空气动力学效应如图2中G_i＝0,1,2,3，D和W表示的相关力。在PyBullet中可以分别对它们建模并联合使用它们，包括：螺旋桨阻力D、作用于单电机的地面效应G_i、以及作用于质心的下洗效应W。In practice, flying close to the ground or close to other UAVs may generate additional aerodynamic effects as shown in Fig. 2 with G _{i = 0, 1, 2,} 3 , the associated forces represented by D and W. They can be modeled separately and used in combination in PyBullet, including: propeller drag D, ground effect G _i acting on a single motor, and downwash effect W acting on the center of mass.

四旋翼无人机的旋转螺旋桨产生阻力D，这是一种作用在与运动方向相反的方向上的力。阻力D与四旋翼无人机线速度

螺旋桨的角速度以及系数矩阵k_D成正比，表达式如下：The spinning propellers of a quadcopter create drag D, a force acting in the opposite direction to the direction of motion. Drag D and linear speed of quadrotor UAV

The angular velocity of the propeller is proportional to the coefficient matrix k _D , the expression is as follows:

其中，

表示螺旋桨的角速度，60为60s；为了模拟交叉耦合，需要拟合一个包含9个系数的矩阵k_D。拟合需要一定的对称性：阻力系数以及x和y轴之间的交叉耦合应该相同。由于对称性，z方向的风速在x和y方向上产生相同的力。除此之外，由x方向的速度引起的z方向的阻力应该与由y方向的速度引起的z方向的阻力相同。因此，阻力系数矩阵k_D具体表达式如下：in,

Indicates the angular velocity of the propeller, 60 is 60s; in order to simulate the cross-coupling, it is necessary to fit a matrix k _D containing 9 coefficients. The fit requires a certain symmetry: the drag coefficient and the cross-coupling between the x and y axes should be the same. Due to symmetry, wind speed in the z direction produces the same force in the x and y directions. Besides that, the drag in the z direction caused by the velocity in the x direction should be the same as the drag in the z direction caused by the velocity in the y direction. Therefore, the specific expression of the drag coefficient matrix k _D is as follows:

k_D＝diag(k_⊥,k_⊥,k_||) (5)k _D ＝diag(k _⊥ ,k _⊥ ,k _|| ) (5)

当悬停在非常低的高度时，存在地面效应，即四旋翼无人机受到的螺旋桨气流与地面相互作用引起的推力将增加，将地面效应对每个电机的影响G_i与螺旋桨半径r_P、速度P_i、高度h_i和常数k_G做出成正比的等式关系如下：When hovering at a very low altitude, there is a ground effect, that is, the thrust caused by the interaction between the propeller airflow and the ground of the quadrotor UAV will increase, and the influence of the ground effect on each motor G _i and the propeller radius r _P , speed P _i , height h _i and constant k _G are proportional to the equation as follows:

当两架四旋翼无人机在不同高度的同一位置通过路径时，存在下洗效应，下洗效应会导致底部飞行器的升力降低，将下洗效应的影响简化为应用于无人机质心的单一作用力，其大小W取决于两个无人机在坐标系x，y，z中的距离(δ_x,δ_y,δ_z)和通过实验确定的常数k_D1，k_D2，k_D3，W表达式如下：When two quadrotor UAVs pass the path at the same position at different heights, there is a downwash effect, which will cause the lift force of the bottom vehicle to decrease, and the influence of the downwash effect is simplified to a single one applied to the center of mass of the UAV The force, whose magnitude W depends on the distance (δ _x , δ _{y , δ z ) of the two drones in the coordinate system x, y} , _z and the constants k _D1 , k _D2 , k _D3 , W determined through experiments The expression is as follows:

为/>

for />

如图3所示，在本实施例中四旋翼无人机使用激光雷达来探测障碍物，在模型中设定四旋翼无人机有激光雷达为k个，并使用这k个激光雷达对环境进行观察。As shown in Figure 3, in this embodiment, the quadrotor UAV uses lidar to detect obstacles, and the quadrotor UAV is set to have k lidars in the model, and these k lidars are used to detect the environment Make observations.

其中，这k个激光雷达扫描角度范围为π(在本实施例中k＝24)，在两个激光间的角度为2π/k；(d₁,...,d_k)表示水平面上k个雷达的射线长度；如果一个传感器在有限的距离内没有探测到任何物体，那么射线的长度就是最大的可探测距离。否则，该长度为无人机与雷达探测到的点之间的距离；d_i表示第i个雷达的射线长度，表达式如下：Wherein, the scanning angle range of the k lidars is π (k=24 in this embodiment), and the angle between the two lasers is 2π/k; (d ₁ ,...,d _k ) represents k on the horizontal plane The ray length of a radar; if a sensor does not detect anything within a finite distance, then the ray length is the maximum detectable distance. Otherwise, the length is the distance between the UAV and the point detected by the radar; d _i represents the ray length of the i-th radar, and the expression is as follows:

s_E＝[ρ_i,d_i]^T,i＝1...k (10)s _E =[ρ _i ,d _i ] ^T ,i=1...k (10)

其中，ρ_i表示第i个雷达的独热码，如果雷达在有限的距离内探测到可探测的物体，ρ_i为1，否则为0，表达式如下：Among them, _ρi represents the one-hot code of the i-th radar. If the radar detects a detectable object within a limited distance, _ρi is 1, otherwise it is 0. The expression is as follows:

{n:[v_x,v_y,v_z,v_M]_n} (12){n:[v _x ,v _y ,v _z ,v _M ] _n } (12)

{n:[P₀,P₁,P₂,P₃]_n} (13){n:[P ₀ ,P ₁ ,P ₂ ,P ₃ ] _n } (13)

最后，将输入转换为脉冲宽度调制(Pulse Width Modulation，PWM)和电机速度委托给由位置和姿态控制子程序组成的控制器。Finally, the conversion of the input to Pulse Width Modulation (PWM) and motor speed is delegated to a controller consisting of position and attitude control subroutines.

在本实施例中，所述步骤S2具体如下：In this embodiment, the step S2 is specifically as follows:

四旋翼无人机n的状态包括：四旋翼无人机的位置、四元数(用于四旋翼无人机的姿态控制)q_n、侧滚角(Roll)r_n、俯仰角(Pitch)p_n、偏航角(Yaw)y_n、线速度

角速度ω_n、所有无人机的电机速度P_n＝[P₀,P₁,P₂,P₃]；如图4所示的无人机n的第一视角方向与目标连线间的角度β_n、以及第n个无人机的全局坐标(x,y,z)与目标的全局坐标(x_t,y_t,z_t)之间的差异d_0n。The state of the quadrotor UAV n includes: the position of the quadrotor UAV, the quaternion (for the attitude control of the quadrotor UAV) q _n , the roll angle (Roll) r _n , the pitch angle (Pitch) p _n , yaw angle (Yaw)y _n , linear velocity

Angular velocity ω _n , motor speed P _n of all UAVs = [P ₀ , P ₁ , P ₂ , P ₃ ]; as shown in Figure 4, the angle between the first viewing angle direction of UAV n and the target line β _n , and the difference d _0n between the global coordinates (x, y, z) of the nth drone and the global coordinates (x _t , y _t , z _t ) of the target.

为了使无人机更快的达到目标，提高收敛速度，在状态中将无人机的全局位置替换为无人机与目标的相对位置ΔX_n也就是[Δx,Δy,Δz]_n，则无人机的状态s_U如下：In order to make the UAV reach the goal faster and improve the convergence speed, in the state, replace the global position of the UAV with the relative position ΔX _n of the UAV and the target, that is, [Δx, Δy, Δz] _n , then there is no The status s _U of the HMI is as follows:

a＝[v_x,v_y,v_z,v_M]^T (16)a＝[v _x ,v _y ,v _z ,v _M ] ^T (16)

奖励函数的设置对深度强化学习模型的表现影响很大，并且决定了无人机的策略。奖励函数R(s,a)表示在状态s下采取动作a所得到的环境反馈，用来评价在当前状态下采取行动的质量。；如果R(s,a)很大，表示当前状态s下采取动作a有利于实现目标，在下一次策略更新中，在状态s下采取动作a的概率将增加，否则，该概率将被降低。The setting of the reward function has a great influence on the performance of the deep reinforcement learning model and determines the strategy of the drone. The reward function R(s,a) represents the environmental feedback obtained by taking action a in state s, and is used to evaluate the quality of the action taken in the current state. ; If R(s,a) is large, it means that taking action a in the current state s is beneficial to achieve the goal. In the next policy update, the probability of taking action a in state s will increase, otherwise, the probability will be reduced.

为了使四旋翼无人机尽快到达集结型任务目标点，本实施例中设置由三部分组成的奖励函数使四旋翼无人机尽快到达集结型任务目标点，具体如下：In order to make the quadrotor UAV reach the target point of the assembly task as soon as possible, a reward function composed of three parts is set in this embodiment to make the quadrotor UAV reach the target point of the assembly task as soon as possible, as follows:

首先，设四旋翼无人机与目标点之间距离奖励R_t促使四旋翼无人机到达目标，R_t设置如下：如果无人机靠近目标，则奖励为正，到达目标点时奖励最大；如果无人机远离目标，则奖励为负数，如果超过预设时间仍未到达目标，则惩罚最大为-5。R_t表达式如下：First, set the distance reward R _t between the quadrotor UAV and the target point to promote the quadrotor UAV to reach the target, R _t is set as follows: if the UAV is close to the target, the reward is positive, and the reward is the largest when it reaches the target point; If the drone is far away from the target, the reward is negative, and if the drone does not reach the target after a preset time, the penalty is up to -5. The _Rt expression is as follows:

其中，d₀表示四旋翼无人机离目标的距离，

表示第n个四旋翼无人机离目标的距离，/>

Indicates the distance from the nth quadrotor UAV to the target, />

其次，设四旋翼无人机与障碍物之间距离奖励R_o促使无人机远离障碍物设置，设置R_o如下：如果无人机与最近的障碍物之间的距离小于d_safe，则无人机将受到惩罚；如果无人机与障碍物发生撞击，则惩罚为-3；如果无人机与最近的障碍物之间的距离小于d_safe，则说明无人机安全且不会受到惩罚。R_o表达式如下：Secondly, set the distance reward R _o between the quadrotor UAV and the obstacle to make the UAV far away from the obstacle setting, and set R _o as follows: if the distance between the UAV and the nearest obstacle is less than d _safe , then no The human-machine will be punished; if the drone collides with an obstacle, the penalty is -3; if the distance between the drone and the nearest obstacle is less than d _safe , the drone is safe and will not be punished . The expression of R _o is as follows:

最后，设四旋翼无人机与目标点之间角度奖励R_a促使无人机向目标方向靠近，若β_n(如图4所示)越大惩罚就越大，则R_a表达式如下：Finally, the angle reward R _a between the quadrotor UAV and the target point is set to make the UAV approach the target direction. If β _n (as shown in Figure 4) is larger, the penalty is greater, and the expression of R _a is as follows:

在本实施例中，所述步骤S3具体如下：In this embodiment, the step S3 is specifically as follows:

本实施例中使用ITD3算法来实现未知环境下的多无人机集结型任务路径规划。TD3算法解决了深度确定性策略梯度(Deep Deterministic Policy Gradient，DDPG)存在过度估计偏差问题，是一种确定性策略强化学习算法，适合于高维连续动作空间。In this embodiment, the ITD3 algorithm is used to realize multi-UAV assembly mission path planning in an unknown environment. The TD3 algorithm solves the problem of overestimation bias in Deep Deterministic Policy Gradient (DDPG). It is a deterministic policy reinforcement learning algorithm suitable for high-dimensional continuous action spaces.

TD3算法通过使用双Q-网络和延迟更新策略来解决Q-值估计误差和过高方差的问题。然而，当环境具有延迟回报的时候，TD3算法可能需要更多的经验来学习如何做出正确的决策。为了解决这个问题，本实施例将N步回报引入TD3算法。The TD3 algorithm solves the problem of Q-value estimation error and excessive variance by using double Q-network and delayed update strategy. However, when the environment has delayed rewards, the TD3 algorithm may need more experience to learn how to make correct decisions. In order to solve this problem, this embodiment introduces the N-step return into the TD3 algorithm.

首先将N步回报引入TD3算法，N步回报将未来n个时间步的回报相加，提供比单步回报更加全面的信息；所以算法可以更好地利用延迟回报，提高学习效率。First, the N-step return is introduced into the TD3 algorithm. The N-step return adds the returns of n time steps in the future to provide more comprehensive information than the single-step return; so the algorithm can make better use of delayed returns and improve learning efficiency.

在奖励稀疏的情况下，大多数状态转移(State Transitions)p(s′|s,a)都没有奖励信息，则一步回报将不会有效；N步回报通过采样N个转移来环境奖励稀疏(这里设置实例值N＝4)。In the case of sparse rewards, most of the state transitions (State Transitions) p(s′|s, a) have no reward information, and one-step rewards will not be effective; N-step rewards are sparsely rewarded by sampling N transitions ( Here an instance value N=4 is set).

本实施例在TD3中加入N步回报增加了找到有奖励的迁移并从中学习的机会，因此，提高了学习效率。通过加入N步回报，修改TD3算法评论家网络的方程式，在第j轮采样中，修改后的时间差分误差(TimeDifference)函数表达式δ_j如下：In this embodiment, the addition of N-step rewards in TD3 increases the chance of finding and learning from rewarded transfers, thus improving learning efficiency. By adding N-step returns, the equation of the TD3 algorithm critic network is modified. In the j-th round of sampling, the modified time difference error (TimeDifference) function expression δ _j is as follows:

其次，在原始TD3算法每个经验是均匀抽样的，但是如果没有轻重之分，学习效率便会较低，而加入优先级可以解决这一问题。优先经验回放是一种增强DRL性能的技术，它在经验回放的基础上，根据经验的重要性对样本进行优先级排序，让重要的样本被更频繁地采样，从而提高模型的学习效率和性能。，在样本开始时，第j个转换的采样概率被定义为P(j)，表达式如下：Secondly, in the original TD3 algorithm, each experience is evenly sampled, but if there is no difference in severity, the learning efficiency will be low, and adding priority can solve this problem. Prioritized experience replay is a technology that enhances DRL performance. Based on experience replay, it prioritizes samples according to the importance of experience, allowing important samples to be sampled more frequently, thereby improving the learning efficiency and performance of the model. . , at the beginning of the sample, the sampling probability of the jth transition is defined as P(j), expressed as follows:

然后，用于更新网络的每个过渡的抽样权重w_j由下式计算，它表示每个转移数据的重要性，M表示小批量(mini-batch)的大小，max_iw_i表示最大化抽样权重，用于做归一化：Then, the _sampling weight w _j for each transition used to _update the network is calculated by Weights, used for normalization:

p_j＝|δ_j|+∈ (23)p _j =|δ _j |+∈ (23)

在本实施中，所述步骤S4具体如下：In this implementation, the step S4 is specifically as follows:

在ITD3网络中，对于两个演员网络，输入是状态，而输出是动作。评论家网络将状态-动作对作为其输入，并产生状态-动作值函数(Q-值)。则如图5所示，ITD3算法训练过程具体如下：In the ITD3 network, for two actor networks, the input is the state and the output is the action. The critic network takes state-action pairs as its input and produces a state-action value function (Q-value). As shown in Figure 5, the ITD3 algorithm training process is as follows:

在得到两个目标Q-值(

和/>

and />

根据经验风险最小化，神经网络的复杂度跟样本数量相关。因此，根据本实施例中上述步骤所得状态空间和动作空间(观测量的大小)，ITD3的网络结构被设计为图6。According to empirical risk minimization, the complexity of the neural network is related to the number of samples. Therefore, according to the state space and action space (size of observations) obtained by the above steps in this embodiment, the network structure of ITD3 is designed as shown in FIG. 6 .

本实施例中，ITD3演员网络由三个FC神经网络层组成，有512、128和4个节点。第一层和第二层由整流线性单元(Rectified Linear Unit，ReLU)激活，第三层由双曲正切(Hyperbolic Tangent，tanh)激活，以保证演员网络的输出在[-1,1]的范围内。在三个FC之后，演员将输入s_t映射到无人机的动作指令a_t。对于评论家网络来说，它由四个FC来估计Q-值。输入s_t首先输入带有ReLU激活函数的FC(节点数为1024)，然后将该向量和动作a_t合并为一个1028维的向量。经过两个ReLU激活函数和一个tanh激活函数，评论家网络将该向量转移到Q-值中。In this embodiment, the ITD3 actor network consists of three FC neural network layers with 512, 128 and 4 nodes. The first and second layers are activated by Rectified Linear Unit (ReLU), and the third layer is activated by Hyperbolic Tangent (tanh) to ensure that the output of the actor network is in the range of [-1,1] Inside. After three FCs, the actor maps the input s _t to the drone's action instruction at _t . For the critic network, it consists of four FCs to estimate the Q-value. The input s _t first enters the FC with the ReLU activation function (the number of nodes is 1024), and then combines the vector and the action _at into a 1028-dimensional vector. After two ReLU activation functions and a tanh activation function, the critic network transfers this vector into Q-values.

综上，本发明的方法首先基于PyBullet对多四旋翼无人机进行Gym环境建模，提高本发明的可移植性。其次在双延迟深度确定性策略梯度(Twin Delayed DeepDeterministic PolicyGradient，TD3)算法中利用N步回报，获得更准确的回报估计、更快的学习速度、以及更好的泛化性能；同时使用优先经验回放降低采样偏差，减少由于采样不均衡造成的模型偏差，提高算法的稳定性，从而使TD3算法更加适合连续的多维决策问题。最后训练改进的TD3(Improved TD3，ITD3)网络，通过控制四旋翼无人机的角速度和线速度，使其在环境未知的情况下在最短的时间内成功到达集结型任务目标点。To sum up, the method of the present invention first performs Gym environment modeling on multi-quadrotor UAVs based on PyBullet to improve the portability of the present invention. Secondly, the N-step reward is used in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm to obtain more accurate reward estimation, faster learning speed, and better generalization performance; at the same time, priority experience playback is used Reduce the sampling deviation, reduce the model deviation caused by unbalanced sampling, and improve the stability of the algorithm, so that the TD3 algorithm is more suitable for continuous multi-dimensional decision-making problems. Finally, the improved TD3 (Improved TD3, ITD3) network is trained, and by controlling the angular velocity and linear velocity of the quadrotor UAV, it can successfully reach the target point of the assembly task in the shortest time when the environment is unknown.

以上所述的实施例是为了帮助读者理解本发明的原理，应被理解为本发明的保护范围并不局限于这样的特别陈述和实施例。本领域的研究人员可以根据本发明公开的这些技术启示做出各种不脱离本发明实质的其它各种具体变形和组合，这些变形和组合仍然在本发明的保护范围内。The embodiments described above are intended to help readers understand the principles of the present invention, and it should be understood that the protection scope of the present invention is not limited to such specific statements and embodiments. Researchers in the field can make various other specific modifications and combinations based on the technical revelations disclosed in the present invention without departing from the essence of the present invention, and these modifications and combinations are still within the protection scope of the present invention.

Claims

1. A multi-quad-rotor unmanned helicopter collective task path planning method based on reinforcement learning comprises the following specific steps:

s1, constructing a PyBullet-based multi-quad-rotor unmanned aerial vehicle Gym environment;

s2, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle, and setting a reward function mechanism to enable the unmanned aerial vehicle to interact with the environment;

s3, performing path planning decision by using an improved deep reinforcement learning algorithm, and performing path planning for each four-rotor unmanned aerial vehicle under an aggregation task;

s4, training the improved deep reinforcement learning network, and controlling the angular speed and the linear speed of the four-rotor unmanned aerial vehicle through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time.

2. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein in the step S1, the method is specifically as follows:

S11, constructing a dynamics simulation model of the multi-quad-rotor unmanned helicopter;

the four-rotor unmanned aerial vehicle dynamic equation is formed by the motion equation and the aerodynamic effect of the four-rotor unmanned aerial vehicle, so that the construction of the four-rotor unmanned aerial vehicle dynamic simulation model is completed, and the method is as follows:

using PyBullet to establish a force and torque model acting on each quadrotor unmanned aerial vehicle in Gym, and calculating and updating a kinetic equation of all the quadrotor unmanned aerial vehicles by using a physical engine;

setting the arm length of each four-rotor unmanned aerial vehicle to be L, the mass to be m, the inertial property to be J, the physical constant and the convex collision shape to be described through a separate URDF file, and configuring the 'x' -shaped four-rotor unmanned aerial vehicle;

first, set the gravitational acceleration g and the physical step frequency in PyBullet, force F applied to 4 motors _i And a torque T about the Z-axis of the drone _o With motor speed P _i Is proportional to the square of F _i And P _i The expression is as follows:

F _i ＝k _F ·P _i ² (1)

wherein k is _F And k _T Representing a predetermined constant;

setting the real-time control of the model, the kinetic equation of the quadrotor unmanned plane is expressed as follows:

J ^T T _o ＝Ma+h (3)

wherein J represents a jacobian matrix, M represents an inertia matrix, a represents generalized acceleration, h represents Coriolis and the effect of gravity, and the superscript T represents a transposed operation;

In practice, flying near the ground or near other unmanned aerial vehicles creates additional aerodynamic effects, which are modeled separately and used in combination in the pybullets, including: propeller resistance D, ground effect G acting on single motor _i A wash-down effect W on the centroid;

resistance D is produced to four rotor unmanned aerial vehicle's rotatory screw, and resistance D and four rotor unmanned aerial vehicle linear velocity

Angular speed of propeller and constant drag coefficient matrix k _D Proportional, expressed as follows:

wherein,,

indicating the angular velocity of the propeller, 60 is 60s; constant drag coefficient moment k _D The specific expression is as follows:

k _D ＝diag(k _⊥ ,k _⊥ ,k _|| ) (5)

wherein k is _⊥ Represents the vertical drag coefficient, k _|| Represents parallel drag coefficients, and matrix k _D Fitting the data by using a least square method;

when hovering at very low altitudes, there is a ground effect, the effect of which on each motor G _i Radius r to the propeller _P Speed P _i Height h _i And constant k _G The proportional equation is made as follows:

when two four rotor unmannedWhen the plane passes through the path at the same position with different heights, a downward washing effect exists, the influence of the downward washing effect is simplified into a single acting force applied to the mass center of the unmanned plane, and the size W of the downward washing effect depends on the distance (delta _x ,δ _y ,δ _z ) And a constant k determined experimentally _D1 ，k _D2 ，k _D3 The expression of W is as follows:

s12, constructing an observation space and an action space of the multi-four-rotor unmanned aerial vehicle;

in the constructed Gym environment, when the quadrotor unmanned aerial vehicle executes each action and outputs an observation vector, the observation space expression of the quadrotor unmanned aerial vehicle is as follows:

wherein N is ∈ [ 0..N.)]Representing the number of the quadrotors; x is X _n ＝[x,y,z] _n Representing the position of the quadrotor unmanned aerial vehicle; q _n Representing quaternions for attitude control of a quad-rotor unmanned helicopter; r is (r) _n 、p _n 、y _n Respectively representing a Roll angle Roll, a Pitch angle Pitch and a Yaw angle Yaw, namely three angles for attitude estimation;

is->

Represents the linear velocity omega of the nth quad-rotor unmanned helicopter _n Is [ omega ] _x ,ω _y ,ω _z ] _n Representing the angular velocity of the nth quad-rotor unmanned helicopter; p (P) _n Is [ P ] ₀ ,P ₁ ,P ₂ ,P ₃ ] _n Representing motor speeds of all unmanned aerial vehicles;

in the invention, the four-rotor unmanned aerial vehicle uses the laser radar to detect the obstacle, k laser radars are set in the model, and the k laser radars are used for observing the environment;

wherein the k laser radars have a scanning angle range of pi and an angle between two lasers of 2 pi/k; (d) ₁ ,...,d _k ) Representing the ray lengths of k radars on a horizontal plane; d, d _i The ray length of the ith radar is expressed as follows:

Then the environment information s _E The definition is as follows:

s _E ＝[ρ _i ,d _i ] ^T ,i＝1...k (10)

wherein ρ is _i The unique thermal code representing the ith radar is expressed as follows:

then for any one quadrotor unmanned aerial vehicle, the action space expression is as follows:

{n:[v _x ,v _y ,v _z ,v _M ] _n } (12)

wherein [ v ] _x ,v _y ,v _z ,v _M ] _n Representing the speed of input to a quadrotor drone, v _x ，v _y ，v _z Representing the components of a unit vector, v _M Indicating the magnitude of the desired velocity; and the action space can be represented by the rotating speeds of four motors, and the expression is as follows:

{n:[P ₀ ,P ₁ ,P ₂ ,P ₃ ] _n } (13)

finally, the conversion of the input into pulse width modulation PWM and motor speed is delegated to a controller consisting of position and attitude control subroutines.

3. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S2 is specifically as follows:

s21, abstracting a state space and an action space of the quadrotor unmanned aerial vehicle;

the states of the quad-rotor unmanned helicopter include: position and quaternion q of four-rotor unmanned aerial vehicle _n Roll angle roller _n Pitch angle pitch _n Yaw angle Yawy _n Linear velocity

Angular velocity omega _n Motor speed P of all unmanned aerial vehicles _n ＝[P ₀ ,P ₁ ,P ₂ ,P ₃ ]Angle beta between first viewing angle direction of unmanned plane and target link _n And the global coordinates (x, y, z) of the nth unmanned aerial vehicle and the global coordinates (x) of the target _t ,y _t ,z _t ) Difference d between _0n ；

Replacing the global position of the unmanned aerial vehicle with the relative position DeltaX of the unmanned aerial vehicle and the target in the state _n I.e. [ Deltax, deltay, deltaz ]] _n Then the status s of the unmanned aerial vehicle _U The following are provided:

from the state s of the unmanned aerial vehicle _U And the laser radar detected environmental state s _E The state space s of the quadrotor unmanned aerial vehicle can be obtained, and the expression is as follows:

the action space of the four-rotor unmanned aerial vehicle environment is composed of the speeds input to the four-rotor unmanned aerial vehicle, and with reference to the reference formula (12), the action space expression for any one four-rotor unmanned aerial vehicle is as follows:

s22, setting a reward function mechanism to enable the quadrotor unmanned aerial vehicle to interact with the environment;

the reward function R (s, a) represents the environmental feedback resulting from taking action a in state s; setting a reward function consisting of three parts to enable the quadrotor unmanned aerial vehicle to reach an aggregation task target point as soon as possible, wherein the reward function is specifically as follows:

first, distance between the quadrotor unmanned plane and the target point is awarded R _t Forcing the quadrotor unmanned aerial vehicle to reach the target, R _t The expression is as follows:

wherein d ₀ Representing the distance of the quadrotor unmanned from the target,

representing the distance of the nth quad-rotor unmanned helicopter from the target,

representing the distance from the quadrotor unmanned plane to the target at the nth next time;

Secondly, distance between the quadrotor unmanned plane and the obstacle is set to be rewarded R _o Make unmanned aerial vehicle keep away from barrier setting, R _o The expression is as follows:

wherein d _i The ray length of the ith radar is represented, namely the detection distance from the quadrotor unmanned aerial vehicle to an obstacle or other quadrotor unmanned aerial vehicle,

indicating the distance between the nth four-rotor unmanned aerial vehicle and the target, and setting the safety distance d between the unmanned aerial vehicle and the obstacle _safe ；

Finally, setting an angle reward R between the quadrotor unmanned aerial vehicle and the target point _a To promote the unmanned plane to approach to the target direction, if beta _n The larger the penalty the larger, R _a The expression is as follows:

4. the method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S3 is specifically as follows:

the method comprises the steps of N-step report and preferential experience playback, improving a TD3 algorithm to obtain an ITD3 algorithm, wherein the ITD3 algorithm consists of four sub-networks, namely two critique networks and two actor networks, and the improved deep reinforcement learning algorithm is realized by the ITD3 algorithm;

firstly, introducing N-step returns into a TD3 algorithm, wherein the N-step returns add the returns of N time steps in the future to provide more comprehensive information than single-step returns;

in the case of sparse rewards, most state transitions p (s' |s, a) have no rewards information, and the one-step rewards will not be valid; n-step rewards are used for relieving the problem of sparse rewards by sampling N transfers;

Modifying the equation of the TD3 algorithm reviewer network by adding N-step returns, and in the j-th round of sampling, modifying the time difference error function expression delta _j The following are provided:

wherein phi and phi' represent parameters of the double commentator network, k represents the kth step return, r _k Representation ofReturn at step k, s and a represent current state and action, s _N And a _N Representing target states and actions, Q(s) _t ,a _t Phi) represents the value function of the critic network, Q'(s) _t+N ,a _t+N Phi') represents the value function of the target critique network, gamma represents the discount factor;

second, using preferential empirical playback in the original TD3 algorithm, at the beginning of the sample, the sampling probability of the jth transition is defined as P (j), expressed as follows:

wherein p is _j Representing the priority of the j-th experience; α represents a constant for adjusting the sampling weight, α determines how much priority to use, and when α equals 0, uniform random sampling will be employed;

then, the sampling weight w for each transition of the update network _j Calculated by the following formula, which represents the importance of each transfer data, M represents the size of the small batch, max _i w _i Representing the maximized sampling weight for normalization:

finally, using proportion prioritization, updating the transferred priority according to the time difference error, wherein the priority is shown in the following formula:

p _j ＝|δ _j |+∈ (23)

Wherein delta _j Representing a time difference error, e represents a small value preset to avoid a 0 priority.

5. The method for planning the collective mission path of the multi-quad-rotor unmanned helicopter based on reinforcement learning according to claim 1, wherein the step S4 is specifically as follows:

the ITD3 network training network is implemented by a two-part neural network: mapping states to actions is accomplished by an actor network consisting of three fully connected layers, and a reviewer network using four fully connected layers to estimate Q-values;

in the ITD3 network, for both actor networks, the input is state and the output is action; the critics network takes the state-action pair as input and generates a state-action value function (Q-value); the ITD3 algorithm training process is specifically as follows:

firstly, preferentially extracting a batch of small samples (s, a, s ', r) from an experience playback buffer zone, and inputting s' into an actor target network; then, a ' is obtained next time, and the state-action pair (s ', a ') is input into the criticizing target network;

after obtaining two target Q-values

And->

) Then, the smaller one is selected to calculate the target value function y (r, s'), the target value function expression is as follows:

wherein r represents the return, and the discount factor gamma is the same as the value of formula (20), phi _i Random parameters of the critics network;

on the other hand, (s, a) is input into the criticizing network to obtain two Q-values (Q ₁ (. Cndot.) and Q ₂ (. Cndot.); then, using them to calculate the mean square error of y (r, s'), and back-propagating the sum of the mean square errors to update the parameters of two critic networks, and adding N-step playback in the time difference error update;

next, inputting the Q-value obtained from the first criticizing home network into the actor model network, and updating parameters of the actor network in a direction in which the Q-value increases (once every two iterations);

finally, updating all target networks by adopting a soft updating method;

after training, the angular speed and the linear speed of the four-rotor unmanned aerial vehicle are controlled through the output action information, so that each four-rotor unmanned aerial vehicle successfully reaches a specified target task in the shortest time, and the collective task path planning is completed.