CN114020013B

CN114020013B - Unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning

Info

Publication number: CN114020013B
Application number: CN202111246299.5A
Authority: CN
Inventors: 张学军; 王思峰; 唐立
Original assignee: Beihang University Sichuan International Center For Innovation In Western China Co ltd
Current assignee: Beihang University Sichuan International Center For Innovation In Western China Co ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2024-03-15
Anticipated expiration: 2041-10-26
Also published as: CN114020013A

Abstract

The invention provides an unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning, which comprises the following steps: the method comprises the steps that a strategy enabling the unmanned aerial vehicle to independently avoid collision and fly is output, and the unmanned aerial vehicle can keep formation by setting different constraint conditions; training the unmanned aerial vehicle in a simulation environment, generating a strategy based on collision avoidance behavior by selecting different behaviors to set different rewarding values, and recording various state information and collision avoidance strategies of the unmanned aerial vehicle; processing external environment information by adopting an LSTM mode in a cyclic neural network, and training on the basis of an initial strategy by combining with the state information of the unmanned aerial vehicle; different constraint conditions are added on the basis of collision avoidance, so that the unmanned aerial vehicle keeps a certain formation to fly on the basis of avoiding collision between teams, and the unmanned aerial vehicle is continuously operated and optimized through a model. The invention realizes the effective unification of collision avoidance and formation of the unmanned aerial vehicle, can effectively integrate resources, and can adjust the individual behaviors in real time to acquire the optimal collision avoidance behavior.

Description

Unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning

Technical Field

The invention relates to the field of deep reinforcement learning and the technical field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning.

Background

In recent years, multi-agents have been increasingly studied due to their great potential in various fields. The related fields comprise collaborative exploration of monitoring and rescue, satellite cluster collaborative control, unmanned aerial vehicle formation control and the like. The basic concept of multi-agent systems is to solve complex tasks with individual collaboration, which are not accomplished by a single agent even if it has expensive equipment. Formation control is a fundamental problem of multi-agent systems, the goal of which is to achieve and maintain a certain formation shape that enables the multi-agent systems to jointly accomplish a specific task. Formation maintenance is an important issue in formation control. In addition, collision avoidance should also be considered in order to ensure the safety of the multi-agent system. Finding collision-free, time-efficient paths in an uncertain dynamic environment remains a challenge due to interactions between agents and trade-offs between collision avoidance and formation maintenance.

For the problem of formation maintenance, several formation control technologies are proposed in the research of other scholars, including formation control based on behavior, a virtual structure method and a formation control scheme based on a leader-follower architecture. Among these population control techniques, the leader-follower architecture is widely used with its simple structure and availability. While a series of efforts have been made in leader-follower formation control, in previous work, the problem of collision avoidance-based formation control has not been fully studied. Particularly in a dynamic environment, collisions between agents, and between multi-agent systems and obstacles, make collisions more and more difficult.

For the collision avoidance problem, conventional algorithms are generally classified into three types, including an offline planning method, an artificial potential field-based method, and a sensing method. The first offline planning method typically calculates a collision-free trajectory in advance and then uses the result as the desired trajectory for the subsequent tracking control system. However, these methods are computationally intensive. And the information of the whole environment is known in advance, so that the method is inconvenient to realize in a dynamic environment. The artificial potential field based approach avoids collisions by assuming virtual attractive and repulsive fields between individuals in the environment. However, there may be local minima, and sometimes there may be a problem that the destination is not reachable. The sensing and avoiding method solves the problem of collision avoidance by sensing the environment and correspondingly adjusting the current action, and has the characteristic of humanoid. Work on these methods can now be divided into two categories, reaction-based methods and prediction-based methods. The former avoids collisions by setting a walking rule based on the current state, such as a collision avoidance method based on fuzzy logic and a reciprocation speed obstacle method. However, these reaction-based approaches have limitations and may not be reliable in some cases because they do not take into account future conditions. The latter predicts the movement of the obstacle, predicts the future state, and then outputs a long-term action to avoid the collision. However, two problems are apparent-firstly, inaccurate estimation due to various uncertainties; the other is the enormous computational complexity of the prediction operation. Therefore, the traditional collision avoidance method has great limitation and does not have the capability of formation control, so that the gravity center of formation study collision avoidance is gradually shifted to the field of reinforcement learning.

Disclosure of Invention

Aiming at the problem of collision avoidance of a plurality of unmanned aerial vehicle formations, the invention provides an unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning, which is used for carrying out coordinated control on the whole unmanned aerial vehicle formation so as to achieve the purposes of avoiding collision and smoothly completing tasks.

In order to achieve the above object, the present invention adopts the following technical scheme:

an unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning comprises the following specific steps:

step one: the method comprises the steps of selecting a deep reinforcement learning model as a main body frame, setting initial parameters according to an industry maturation experiment, clearly training a strategy for enabling an unmanned aerial vehicle to independently avoid collision and fly, and enabling the unmanned aerial vehicle to keep formation by setting different constraint conditions on the basis;

step two: training the unmanned aerial vehicle in a simulation environment through simulation learning, enabling the unmanned aerial vehicle to simulate the selection behavior of human beings to operate, gradually generating a strategy based on collision avoidance behavior by selecting different behavior setting rewards, recording various state information and collision avoidance strategies of the unmanned aerial vehicle, and storing the state information and the collision avoidance strategies to a certain extent, wherein the state information and the collision avoidance strategies are used as input information of a subsequent learning model;

step three: the external environment information, mainly the state information of the obstacle, is processed by adopting an LSTM mode in the cyclic neural network, then the unmanned aerial vehicle is trained on the basis of an initial strategy by combining the state information of the unmanned aerial vehicle in the second step, and the speed of the unmanned aerial vehicle is adjusted by adopting a second-order dynamics model in the training process so as to obtain stable speed change, wherein the expected value of the training is that the unmanned aerial vehicle can reach a target position in a shorter path;

the Long short-term memory (LSTM) is a special RNN, and is mainly used for solving the problems of gradient elimination and gradient explosion in the Long sequence training process. In short, LSTM is able to perform better in longer sequences than normal RNNs.

Step four: different constraint conditions are added on the basis of collision avoidance, so that the unmanned aerial vehicle keeps a certain formation for flying on the basis of avoiding collision among teams, continuous running optimization is realized through a model, and a flexible flying strategy which keeps the formation and can return to a correct path after collision avoidance is expected to be output. In the first step, the environment comprises a leader, a follower and an obstacle, which are respectively represented by superscripts L, F and O;

the state space of the unmanned aerial vehicle at the time t is expressed as s _t The behavioral space can be expressed as a _t Other parameters in the training environment are: t represents the time, Δt represents the time step,indicating the position of the unmanned aerial vehicle at time t +.>The speed of the unmanned aerial vehicle at the time t is represented, r is the occupied radius, and p _g ＝[p _gx ，p _gy ]Representing the target position, v _pref To a desired speed, θ _t For heading angle->For the state space of the follower +.>Status space for leader +.>Is a state space of an obstacle;

its state information s at time t _t Represented asWherein->Representing state information that can be observed by the unmanned aerial vehicle; />Hidden state information which cannot be observed by the unmanned aerial vehicle is represented;

behavior a for unmanned aerial vehicle _t Assuming that the unmanned aerial vehicle can quickly respond after receiving the control instruction, settingThe goal of training is to design the strategy pi of the follower: />To select appropriate actions to maintain formation and avoid obstacles;

in the learning architecture, an optimization problem is translated into an objective function, which is a form of multiple objective functions, and a set of constraints, the time t required for the follower to reach the target _g And maintaining the accumulated error composition of the formation; meanwhile, the constraint condition also comprises a collision avoidance problem;

the objective function of formation collision avoidance is as follows:

in the formula (1.2), the amino acid sequence,representing other unmanned aerial vehicles in the environment that do not include followers, H _t A desired relative offset vector representing the follower relative to the leader; (1.2) represents a constraint condition for avoiding collision, (1.3) represents a constraint condition for reaching a target site, and (1.4) represents a kinematic constraint of the unmanned aerial vehicle.

The second step specifically comprises the following steps:

first, define a joint state space of the unmanned aerial vehicleWherein->Representing the observable space of all followers, +.>Representing the observable space of the obstacle.

Secondly, a value network is designed to estimate the value of the state space, the purpose of the value network is to find the optimal value function, and the definition of the value function is as follows:

in the formula (1.5), the amino acid sequence,representing the rewards acquired at time t, gamma representing the discount factor;

for the optimal strategy pi ^* :Iterative acquisition from the value function:

in the formula (1.6)Representing the transition probability between time t and t + deltat.

Finally, solving the problem of formation control by adopting a formation evaluation function based on the idea of reinforcement learning, wherein the formation evaluation function is used for evaluating the quality of the formation and calculating the rewards of the formation and reflecting the errors of the formation track in real time; taking Euclidean distance between the target position and the actual position as an input; the constructed reward function for formation is defined as:

in the formula (1.7), the amino acid sequence,a formation error value formed at time t;

the reward function for collision avoidance is expressed as follows:

in the formula (1.8)Representing a minimum distance between the follower and the other drone;

combining equations (1.7) and (1.8) to obtain a complete bonus function R _t The method comprises the following steps:

in the third step, the second-order power model is as follows:

in the formula (1.10), the amino acid sequence,and->Representing the position, speed and control input vector of the follower, respectively; in contrast, the +.>And->A position and velocity vector representing the leader; the follower should keep a certain distance from the leader according to the formation to be maintained, H _p ＝[H _x ,H _y ] ^T Representing the relative offset vector that needs to be maintained with respect to the leader follower;

suppose ζ ^F ＝[(P ^F ) ^T ,(V ^F ) ^T ] ^T Indicating the position and speed of the follower, ζ ^L ＝[(P ^L ) ^T ,(V ^L ) ^T ] ^T Representing the position and speed of the leader, the relative offset vector of the two isThe conditions for maintaining formation for any given initial state follower and leader are:

a control protocol is assumed according to the control conditions, wherein k ₁ ，k ₂ >0：

Compared with the prior art, the invention has the advantages that:

the collision avoidance behavior of the unmanned aerial vehicle is difficult to be effectively unified with the formation behavior of the unmanned aerial vehicle, the traditional collision avoidance mode lacks flexibility and cannot be flexibly integrated with the formation system of the unmanned aerial vehicle, most of traditional formation control is based on a control theory, is more prone to fixed and unchanged movement tasks, and lacks dynamic adjustment to dynamic environments. Therefore, from the viewpoint of the field of deep reinforcement learning, the unmanned aerial vehicle formation collision avoidance method based on the deep reinforcement learning is provided, so that the unmanned aerial vehicle collision avoidance and formation are effectively unified, resources can be effectively integrated, the behaviors of individuals can be adjusted in real time to obtain the optimal collision avoidance behavior, and the mobility of a cluster system and the adaptability to a complex environment are greatly improved.

Drawings

FIG. 1 is a general idea of the formation collision avoidance of the present invention;

FIG. 2 is a flow chart of processing data by the LSTM module of the present invention;

fig. 3 is a general algorithm block diagram of the formation collision avoidance of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the accompanying drawings and by way of examples in order to make the objects, technical solutions and advantages of the invention more apparent.

The unmanned aerial vehicle formation collision avoidance method for deep reinforcement learning comprises the following specific steps:

step one: firstly, a deep reinforcement learning model is selected as a main body frame, then initial parameters are set according to industry maturation experiments, a strategy enabling the unmanned aerial vehicle to independently avoid collision and fly is explicitly output, and on the basis, the unmanned aerial vehicle can keep formation to a certain extent by setting different constraint conditions. The environment of the invention includes a leader, a follower and an obstacle, and is divided by superscripts F, L, O for convenience of distinction.

The state space of the unmanned aerial vehicle at time t can be expressed as s _t The behavioral space can be expressed as a _t Other parameters in the training environment are shown in table 1.

TABLE 1 parameter details list

Various information of the unmanned aerial vehicle in the operation process can be clarified through the table 1, and the state information s of the unmanned aerial vehicle at the time t _t Can be expressed asWherein->Representing state information that can be observed by the unmanned aerial vehicle;and the hidden state information which cannot be observed by the unmanned aerial vehicle is indicated.

Behavior a for unmanned aerial vehicle _t The unmanned aerial vehicle is assumed to respond quickly after receiving the control instruction, thus settingThe goal of training is to design the strategy pi:. About.>To select appropriate actions to maintain formation and avoid obstacles.

In a learning architecture, this problem can be translated into an optimization problem of an objective function, which is a form of a multiple objective function, and a set of constraints, the time t required for the follower to reach the target _g And maintaining the accumulated error composition of the formation. At the same time, the constraint condition also comprises a collision avoidance problem.

Thus, the objective function of formation collision avoidance is as follows:

in the formula (1.2), the amino acid sequence,representing other unmanned aerial vehicles in the environment that do not include followers, H _t Representing the expected relative offset vector of the follower with respect to the leader. (1.2) represents a constraint condition for avoiding collision, (1.3) represents a constraint condition for reaching a target site, and (1.4) represents a kinematic constraint of the unmanned aerial vehicle.

Step two: the unmanned aerial vehicle is trained in a simulation environment through simulation learning, so that the unmanned aerial vehicle simulates the selection behavior of human beings to operate, a strategy based on collision avoidance behavior is gradually generated by selecting different behavior setting reward values, then various state information and collision avoidance strategies of the unmanned aerial vehicle are recorded, certain storage is carried out, and the state information and the collision avoidance strategies are used as input information of a subsequent learning model.

First defining a joint state space of a unmanned aerial vehicleWherein->Representing the observable space of all followers, +.>Representing the observable space of the obstacle.

in the formula (1.5), the amino acid sequence,indicating the rewards acquired at time t, gamma indicating the discount factor.

For the optimal strategy pi ^* :The iterative acquisition may be performed from a value function. For example:

Finally, the problem of formation control is solved by adopting a formation evaluation function based on the idea of reinforcement learning, and the essence of the problem is that the method is used for evaluating the quality of the formation and calculating the rewards of the formation, and particularly can reflect errors of formation tracks in real time. The euclidean distance between the target position and the actual position is taken as an input. The constructed reward function for formation is defined as:

in the formula (1.7), the amino acid sequence,indicating the formation error value formed at time t. While the reward function for collision avoidance is expressed as follows:

in the formula (1.8)Representing the minimum distance between the follower and the other drone. The complete bonus function obtained by combining equation (1.7) and equation (1.8) is:

the idea of combining collision avoidance and clustering can be considered that unmanned aerial vehicle clusters are collision avoidance is shown in fig. 1.

Step three: and thirdly, processing external environment information, mainly the state information of the obstacle, by adopting an LSTM mode in the cyclic neural network, and training on the basis of an initial strategy by combining the state information of the unmanned aerial vehicle in the second step, wherein the speed of the unmanned aerial vehicle is adjusted by adopting a second-order dynamics model in the training process so as to acquire stable speed change, and the expected value of training is that the unmanned aerial vehicle can reach a target position in a shorter path.

At time t, the state of the obstacle is regarded as an input sequence of the LSTM network, and the LSTM network processes the state information of the obstacle one by one as shown in fig. 2, and finally generates the coding information of all the obstacles. The LSTM network can solve the problem of uncertain number of obstacles, thereby avoiding updating the network due to changing number of obstacles.

The second order power model selected is as follows:

in the formula (1.10), the amino acid sequence,and->Representing the position, velocity and control input vector of the follower, respectively. In contrast, the +.>And->Representing the position and velocity vectors of the leader. The follower should be kept at a distance from the leader according to the formation to be maintained, thus H _p ＝[H _x ,H _y ] ^T Representing the relative offset vector that needs to be maintained with respect to the leader follower.

Suppose ζ ^F ＝[(P ^F ) ^T ,(V ^F ) ^T ] ^T Indicating the position and speed of the follower, ζ ^L ＝[(P ^L ) ^T ,(V ^L ) ^T ] ^T Representing the position and speed of the leader, the relative offset vector of the two isThe conditions for maintaining formation for any given initial state follower and leader are as follows:

depending on the control conditions, the following control protocol (where k ₁ ，k ₂ >0)：

Step four: finally, different constraint conditions are added on the basis of collision avoidance, so that the unmanned aerial vehicle keeps a certain formation for flying on the basis of avoiding collision among teams, and continuous running optimization is carried out through a model, and a flexible flying strategy which keeps the formation and can return to a correct path after collision avoidance is expected to be output. .

The main body frame of the unmanned aerial vehicle formation collision avoidance algorithm is shown in fig. 3.

The training process in the steps can be subdivided into:

(1) Executing an optimal reciprocal collision avoidance strategy algorithm based on formation, and collecting a demonstration data setThen go to (2).

(2) By presenting data setsThe value network V is initialized and then goes to (3).

(3) Initializing a target value networkThen go to (4).

(4) Initializing experience playback memoryFor breaking the association between the data, and then go to (5).

(5) And (4) performing circulation, if the maximum execution times are not reached, executing the following steps, otherwise, exiting, and returning to the cost function V.

a) Initializing random training spaces

b) The following steps are repeatedly performed until success or timeout:

i.A selection behavior

Selecting optimal behavior by maximizing the sum of the cost function and the feedback function

Preserving tuplesInto experience playback memory M

Randomly extracting training lots from M

v. set the desired value of the output

Performing gradient descent based on a cost function

c) Updating target value network once every training C times

d) Ending, returning the cost function V

Those of ordinary skill in the art will appreciate that the embodiments described herein are intended to aid the reader in understanding the practice of the invention and that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning is characterized by comprising the following steps of:

in the first step, the environment comprises a leader, a follower and an obstacle, which are respectively represented by superscripts L, F and O;

the state space of the unmanned aerial vehicle at the time t is expressed as s _t The behavioral space can be expressed as a _t Other parameters in the training environment are: t represents the time, Δt represents the time step,indicating the position of the unmanned aerial vehicle at time t +.>The speed of the unmanned aerial vehicle at the time t is represented, r is the occupied radius, and p _g ＝[p _gx ,p _gy ]Representing the target position, v _pref To a desired speed, θ _t For heading angle->For the state space of the follower +.>Status space for leader +.>Is a state space of an obstacle;

behavior a for unmanned aerial vehicle _t Assuming noThe man-machine can quickly respond and set after receiving the control instructionThe goal of training is to design the strategy pi:. About.>To select appropriate actions to maintain formation and avoid obstacles;

the objective function of formation collision avoidance is as follows:

in the formula (1.2), the amino acid sequence,representing other unmanned aerial vehicles in the environment that do not include followers, H _t A desired relative offset vector representing the follower relative to the leader; (1.2) a constraint for avoiding collision,(1.3) representing constraints on reaching the target site and (1.4) representing kinematic constraints of the drone;

the second step specifically comprises the following steps:

first, define a joint state space of the unmanned aerial vehicleWherein->Representing the observable space of all followers, +.>An observable space representing an obstacle;

for the optimal strategy pi ^* :Iterative acquisition from the value function:

in the formula (1.6)Representing the transition probability between time t and t+Δt;

the reward function for collision avoidance is expressed as follows:

step three: the external environment information is processed by adopting an LSTM mode in the cyclic neural network, then the unmanned aerial vehicle is trained on the basis of an initial strategy by combining with the state information of the unmanned aerial vehicle in the second step, and the speed of the unmanned aerial vehicle is adjusted by adopting a second-order dynamics model in the training process so as to obtain stable speed change, wherein the expected value of the training is that the unmanned aerial vehicle can reach a target position in a shorter path;

in the third step, the second order dynamics model is as follows:

a control protocol is assumed according to the control conditions, wherein k ₁ ，k ₂ >0：(1.12)；

Step four: different constraint conditions are added on the basis of collision avoidance, so that the unmanned aerial vehicle keeps a certain formation for flying on the basis of avoiding collision among teams, continuous running optimization is realized through a model, and a flexible flying strategy which keeps the formation and can return to a correct path after collision avoidance is expected to be output.