CN111897316B

CN111897316B - Multi-aircraft autonomous decision-making method under scene fast-changing condition

Info

Publication number: CN111897316B
Application number: CN202010575719.3A
Authority: CN
Inventors: 杜文博; 曹先彬; 李宇萌; 郭通
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2021-05-14
Anticipated expiration: 2040-06-22
Also published as: CN111897316A

Abstract

The invention discloses a multi-aircraft autonomous decision-making method under a scene fast-changing condition, belonging to the technical field of aircrafts; the multi-aircraft autonomous decision method under the scene fast-changing condition specifically comprises the following steps: firstly, carrying a laser radar on each aircraft respectively for target detection, and identifying static obstacles or other aircraft in a detection range according to returned three-dimensional point cloud data; then, an autonomous conflict resolution model is constructed by utilizing three-dimensional point cloud data of the aircraft; solving based on a multi-agent reinforcement learning framework to obtain a reward function of selecting actions according to an input state; and finally, the neural network learning module performs centralized training and decentralized execution based on a reward function, calculates all action values which can be taken based on a certain state through a converged neural network, and solves the multi-agent behavior action according to combined optimization. The invention can utilize the transfer learning technology to carry out inheritance training when scene information changes, and has better transfer property.

Description

Multi-aircraft autonomous decision-making method under scene fast-changing condition

Technical Field

The invention belongs to the technical field of aircrafts, relates to a conflict resolution method, and particularly relates to a multi-aircraft autonomous decision-making method under a scene fast-changing condition.

Background

With the rapid development of aeronautical science and technology, in complex and severe high-risk operating environments, low-altitude small aircrafts are widely applied in the aspects of aerial surveillance, forest rescue, reconnaissance and exploration, military application and the like. Therefore, the problems of path planning and conflict resolution in the autonomous decision-making of the multiple aircrafts cause wide attention of scholars at home and abroad.

The actual low-altitude operation environment has the most important characteristics that the scene is complex and highly dynamic, and a dynamic threat with unknown motion characteristics can exist, and in many actual tasks, the targets of the intelligent agent are not static generally but dynamic, while the regulation and control of the existing aircraft mainly depend on a pre-planned or established action set, and the future complex and dynamic scene is difficult to adapt.

The autonomous decision making of multiple aircrafts is a typical multi-agent cooperation problem, and the agents are expected to have the ability of learning to the environment, namely, automatically acquiring knowledge, accumulating experience, continuously updating and expanding the knowledge, and improving the knowledge performance. Learning capabilities are the ability of an agent to update knowledge through experimentation, observation and speculation. The intelligent agent can improve self adaptive ability only through continuous learning, and obtains knowledge by means of continuous interaction with the environment.

Disclosure of Invention

Aiming at the problems, the invention provides the multi-aircraft autonomous decision-making method under the scene fast-changing condition, which fully considers the dynamic property of the scene and improves the learning capacity of the multi-aircraft.

The multi-aircraft autonomous decision method under the scene fast-changing condition comprises the following steps:

step one, aiming at N^UFrame aircraft and N^TIn a scene formed by targets, each aircraft is respectively carried with a laser radar sensor and regularly sends radar echo signals in a detection range;

each aircraft corresponds to one target respectively, and the initial value of the target is set randomly. N is a radical of^UAnd N^TThe values of (A) are the same.

The detection range is as follows: each aircraft is considered to be a mass point,

the horizontal detection range angle is theta for the radius of the maximum detection distanceⁱAngle of vertical detection range of

Secondly, identifying static obstacles or other aircrafts in a detection range by each aircraft according to three-dimensional point cloud data returned by the radar echo signals;

when other aircrafts are detected, the returned three-dimensional point cloud data are the three-dimensional coordinates and the speed directions of the other aircrafts, when static obstacles are detected, the returned three-dimensional point cloud data are the boundary coordinates of the static obstacles, and if no obstacle exists, the returned three-dimensional point cloud data are 0.

Step three, aiming at the time slot t, constructing an autonomous conflict resolution model by the three-dimensional point cloud data of the ith aircraft and other aircraft;

the autonomous conflict resolution model aims at the shortest distance from each aircraft to the respective target point, and the objective function is as follows:

s.t.R₁,R₂,R₃

d_iand the distance between the ith aircraft and the target point corresponding to the aircraft is represented.

The three constraints are as follows:

(1)R₁a reward function representing the arrival of each aircraft at a respective target location; the calculation formula is as follows:

i′∈{1,2,…,N^T}；S_i′judging the completion degree of the target, if a certain target is not completed, S_i′-1, otherwise, the goal is completed, S_i′＝0。

(2)R₂And (3) a return function representing that each aircraft and the static obstacle can not collide with each other, and the calculation formula is as follows:

P_iis the path of the ith aircraft, D_mRepresents the mth static obstacle; m is an element of [1, N ]^M]，N^MRepresenting the total number of static obstacles in the scene.

(3)R₃The method is characterized in that a return function which can not generate collision between any aircrafts is represented, and the calculation formula is as follows:

the position coordinates of the ith aircraft at the current moment are obtained;

the position coordinate of the jth aircraft at the current moment is taken as the position coordinate of the jth aircraft at the current moment;

solving the autonomous conflict resolution model of the multiple aircrafts based on the multi-agent reinforcement learning framework to obtain a reward function for selecting actions according to the input state;

the reward functions include the following:

(1) reward function r set for shortest path between each aircraft and initial position of respective target_a；

First, an initial r is set_a＝0；

Then, the ith aircraft X_iAt time t, the state is

Acting as

According to the motion

Calculating the aircraft X after executing the action_iCurrent position of

And target position

The distance between

Expressed as:

finally, N is cumulatively calculated^USelecting the sum of the distances between the current position and the target position of the aircraft after the aircraft moves at the moment t, and updating the reward function r_a；

The update formula is:

therefore, if the sum of the accumulated distances of the aircrafts is larger, the joint strategy is poorer; otherwise, the combination strategy is good.

(2) Reward function r set for collision detection of aircraft and obstacle_b；

First, an initial r is set_b＝0；

Then, the ith aircraft X is calculated_iAccording to the operation at time t

Calculating the aircraft X after executing the action_iCurrent position of

And the position p of the mth static obstacle in the detection range^mThe distance between, expressed as:

further, the distance is determined

Whether or not less than aircraft X_iMinimum safe distance n from static obstacle_oIf so, setting a penalty value

Otherwise

Setting penalty values

For the ith aircraft X_iAt time t, the aircraft X_iThe distances between the static obstacles in the detection range and the minimum safe distance n_oJudging to obtain the sum of punishment values

Cumulative calculation of N^UThe sum of the punishment values corresponding to the aircraft at the moment t respectively is updated, and the reward function r is updated_b：

Therefore, the closer the aircraft is to the obstacle, the smaller the joint revenue obtained from the overall multi-aircraft autonomic decision making.

(3) Reward function r set for collision detection between aircraft and aircraft_c；

First, an initial r is set_c＝0；

Then, the ith aircraft X is calculated_iAccording to the operation at time t

Calculating the aircraft X after executing the action_iCurrent position of

And the current position of the jth aircraft in the detection range

The distance between, expressed as:

here a delay of one time step is set for the observation of other aircraft that is noisy.

Further, the distance is determined

Whether or not less than the collision distance n of the aircraft_cAnd a proximity risk distance n_m，n_c＜n_m(ii) a If so, the mobile terminal can be started,

then a penalty value is set

Otherwise, when it is satisfied

Then a penalty value is set

If it satisfies

Then a penalty value is set

For the ith aircraft X_iAt time t, the aircraft X_iThe distance from all other aircraft is respectively the collision distance n_cAnd a proximity risk distance n_mJudging to obtain the sum of punishment values

Cumulative calculation of N^UThe sum of the punishment values corresponding to the aircraft at the moment t is updated, and the punishment function r is updated_c：

Thus, the closer the aircraft is to other aircraft, the less the joint revenue the overall multi-aircraft autonomic decision may be to achieve.

And step five, the neural network learning module performs centralized training and decentralized execution based on the reward function, calculates all action values which can be taken based on a certain state through a converged neural network, and solves the multi-agent behavior action according to combined optimization.

The invention has the advantages that:

(1) the multi-aircraft autonomous decision method under the scene fast-changing condition has important practical significance by taking the scene with complex and high dynamics of a low-altitude airspace, unknown running characteristics of multiple elements, more complex coupling relation between the airspace environment and a traffic object and complex and fast-changing tasks as a research background.

(2) The invention relates to a multi-aircraft autonomous decision method under a scene fast-changing condition, which not only fully considers the dynamic property of the scene, but also considers incomplete information and non-ideal communication, and provides a method for guiding the autonomous decision of an aircraft.

Drawings

Fig. 1 is a schematic diagram of the detection range of a laser radar when an aircraft performs collision detection according to the present invention.

FIG. 2 is a diagram of a multi-agent reinforcement learning model according to the present invention.

Fig. 3 is a schematic view of the aircraft safety distance of the present invention.

Fig. 4 is a flowchart of a multi-aircraft autonomous decision method under a scene fast change condition according to the present invention.

Detailed Description

The present invention will be described in further detail and with reference to the accompanying drawings so that those skilled in the art can understand and practice the invention.

The invention provides a multi-aircraft autonomous decision method under a scene fast-changing condition, which aims at a complex high-dynamic scene and has the following characteristics: (1) static and dynamic obstacles coexist in a scene, and a target may change dynamically in the flight process; (2) the perception range of a single unmanned aerial vehicle is limited, and global information cannot be obtained; (3) the unmanned aerial vehicles can communicate with each other to share local airspace information; (4) interference and random loss exist in communication between the unmanned aerial vehicles; the multi-aircraft autonomic decision is broken down into two sub-problems: (1) planning a path; (2) the conflict is resolved. For path planning and conflict resolution, the optimization problem has been proven to be an NP-hard problem, and a heuristic algorithm is required for solving. Therefore, a method for solving the autonomous decision of multiple aircrafts can be completed through division: the two sub-problems are solved by solving the path planning and the conflict first, and then the solutions of the two sub-problems are combined to be used as a final solution.

As shown in fig. 4, the aircraft autonomous decision method includes the following steps:

step one, aiming at N^UFrame aircraft and N^TIn a scene formed by targets, each aircraft is respectively carried with a laser radar sensor, collision detection is carried out regularly in a detection range, and radar echo signals are sent;

the flight conflict detection adopts non-cooperative threat conflict detection based on a radar system, and the laser radar plays an important role in the autonomous navigation technology. The main performance parameters of the laser radar include the wavelength of laser light, the detection distance, and the field of view (FOV), which is divided into a horizontal field of view and a vertical field of view. The two most commonly used lidar wavelengths are 905nm and 1550 nm. The 1550nm wavelength radar sensor can operate at higher power, detecting distances further than the 905nm wavelength, but with a greater weight.

The invention is provided with N^UFrame aircraft and N^TEach aircraft corresponds to one target, and the initial value of the target is randomly set; n is a radical of^UAnd N^TThe values of (A) are the same. For the ith aircraft X_iAt time t, the state is

Acting as

Status of state

Obtaining the position information of the static obstacle by the three-dimensional point cloud data returned by airborne measuring equipment of a laser radar sensor carried by an aircraft;

as shown in fig. 1, the detection range of the radar is: each aircraft is considered to be a mass point,

the detection range angle of the horizontal FOV is theta for the radius of the maximum detection distanceⁱThe vertical FOV detection range angle is

the aircraft is regarded as a mass point, the aircraft sends radar echo signals within a detection range regularly, when other aircraft are detected, returned three-dimensional point cloud data are three-dimensional coordinates and speed directions of other aircraft, when static obstacles are detected, the returned three-dimensional point cloud data are boundary coordinates of the static obstacles, and if no obstacles exist, the returned three-dimensional point cloud data are 0.

Step three, aiming at the time slot t, establishing an autonomous decision modeling model by the three-dimensional point cloud data of the ith aircraft and other aircraft;

the invention describes the design process of the autonomous decision problem from three aspects of observation value, action and return function.

1) Observed value s_t: at each time T, T1, 2.., T represents the maximum time at which the aircraft reaches the target; because the agent in reinforcement learning makes a control decision based on the collected current state and the aircraft reward valueAn observation s is constructed first_tIth aircraft X_iThe observed value of the state at time t is expressed as

The joint state of the multi-agent system composed of all aircrafts is expressed as

Wherein,

indicated at time slot t, ith aircraft X_iThe action at the time t is

According to the motion

Calculating the aircraft X after executing the action_iCurrent position of

And target position

The distance between

And judging whether the current task is finished or not.

Showing the ith aircraft X at time t_iPerforming an action

Rear aircraft X_iCurrent position p of_t ⁱAnd the current position of the jth aircraft in the detection range

To determine if a conflict between aircraft has occurred, where the observation of other aircraft is noisy and has a delay of one time step.

At time t, the ith aircraft X_iPerforming an action

Rear aircraft X_iCurrent position of

And the position p of the mth static obstacle in the detection range^mTo determine whether a collision occurs between the aircraft and the obstacle;

2) action a_t: from the perspective of the DRL mechanism, if the movement of the aircraft is characterized as an action, the action can cause a change in the environment, and the moving distance of the aircraft can determine the energy consumption of the aircraft. Thus representing reinforcement learning actions based on the flight direction acceleration of the aircraft movement model

ρ_j(t)∈[0,ρ_max]Represents the pitch direction velocity, p, received by the jth aircraft at time t as the starting time_maxRepresenting the maximum speed in the pitch direction.

Representing the pitch direction acceleration received by the jth aircraft at time t as the starting time.

Represents a maximum acceleration in the pitch direction;

represents a minimum acceleration in the pitch direction;

representing the yaw-direction speed received by the jth aircraft at time t as the starting time.

Representing the yaw direction acceleration received by the jth aircraft at time t as the starting time.

Set a_tThe number of the medium elements is 2 x N^UI, the slave agent received action a_tAnd then, the jth aircraft can be determined to hover at the current position or move to a new position, so that the control of the continuous movement of the aircraft is realized.

3) A return function r_t: the objective of the autonomous decision problem is that the distance from each aircraft to the corresponding target point is shortest, so that three different constraints (the aircraft needs to finish the target, and the aircraft cannot collide with obstacles or the aircraft) exist, and in order to design a return function, the invention adopts the objective and the constraint for separately discussing the autonomous risk avoidance problem.

Firstly, the optimization goal of the multi-aircraft autonomous decision is that the path of each aircraft is shortest after the multi-aircraft autonomous decision reaches the goal, and then the objective function is expressed as:

Besides, three constraint conditions are designed respectively, and the following constraints are required to be met:

(1) all the objectives are accomplished:

R₁a reward function representing the arrival of each aircraft at a respective target location; the calculation formula is as follows:

(2) No collision between the aircraft and the obstacle can occur:

R₂and (3) a return function representing that each aircraft and the static obstacle can not collide with each other, and the calculation formula is as follows:

P_iis the path of the ith aircraft,

representing the flight position coordinates of the aircraft at time T; d_mRepresents the mth static obstacle; m is an element of [1, N ]^M]，N^MRepresenting the total number of static obstacles in the scene.

(3) No collision between the aircraft can occur:

R₃the method is characterized in that a return function which can not generate collision between any aircrafts is represented, and the calculation formula is as follows:

therefore, the multi-aircraft autonomous decision problem now turns into a combinatorial optimization problem, i.e. the autonomous conflict resolution model aims at the shortest distance from each aircraft to its own target point, and the objective function is as follows:

s.t.R₁,R₂,R₃

solving the autonomous decision model of the multi-aircraft based on a multi-agent reinforcement learning (MADDPG) frame to obtain a reward function for selecting actions according to the input state;

the specific process is as follows:

1) establishing a multi-agent neural network

The state space and action space of each Agent (Agent) are abstracted to be completely consistent with the aircraft. The policy of each agent is determined by a parameter theta,

denotes the Nth^UNeural network parameters for individual aircraft. The strategy of the agent is mu,

representing the aircraft at a neural network parameter theta_iTime of day policy. Let the policy of the agent be a deterministic policy, the action of the agent is completely determined by its policy and its corresponding parameters:

a_ithe motion of the ith aircraft; o ° o_iRepresenting the observation of the ith aircraft, including information on the distances between the agent and obstacles, targets and other agents; theta_iRepresenting neural network parameters for the ith aircraft.

By MADDPG-related theory, deterministic strategy

The gradient of (d) is:

representing an action network objective function; e_x,a～DRepresenting a desire for a random strategy sequence;

representing a joint observation of an agent;

representing the Q value function, D represents the Experience pool (Experience Replay Buffer) in MADDPG, and contains the tuples:

x' represents the joint observation of the agent at the next moment;

denotes the Nth^UA reward function for the rack aircraft;

the action value function of the network strategy of the critic is expressed and completely realized by a neural network, named as a critic network, and is updated according to the following objective function:

L(θ_i) Representing a critic network loss function; r represents a bonus that is given,

r_ia reward function representing an ith aircraft; γ ∈ (0, 1) denotes the attenuation factor;

denotes the Nth^UErecting the next moment of the aircraft; a'_jRepresenting the action of the jth aircraft at the next moment; mu's'_jPolicy, o for the next moment of the jth aircraft_jRepresents an observation of the jth aircraft;

and

the structure is identical, but the parameter update lags behind

Is generated.

The representation parameter update lags behind

The critic network strategy action value function has better physical meaning auxiliary action network training, and the action network is updated according to the following formula:

wherein J represents an action network objective function; s represents a small batch of samples drawn at random.

The model of the entire design is shown in fig. 2.

2) Reward function design

In order to meet the constraint conditions, the design of the reward function needs to be carried out on the MADDPG; as shown in fig. 3, the reward function includes the following:

(1) accumulating reward functions r set for shortest paths between each aircraft and the initial position of the respective target_a；

First, an initial r is set_a＝0；

Then, the ith aircraft X_iAt time t, the state is

Acting as

According to the motion

Calculating the aircraft X after executing the action_iCurrent position of

And target position

The distance between

Expressed as:

The update formula is:

(2) Impact aircraft and obstaclesSet reward function r for collision detection_b；

In order to ensure that the aircraft and the obstacle do not collide, collision detection is required, and first, initial r is set_b＝0；

Then, the ith aircraft X is calculated_iAccording to the operation at time t

Calculating the aircraft X after executing the action_iCurrent position of

further, the distance is determined

Otherwise

Setting penalty values

Accumulation meterCalculate N^UThe sum of the punishment values corresponding to the aircraft at the moment t respectively is updated, and the reward function r is updated_b：

In order to ensure that no collision occurs between the aircraft and the aircraft, collision detection needs to be performed, and first, an initial r is set_c＝0；

Then, the ith aircraft X is calculated_iAccording to the operation at time t

Calculating the aircraft X after executing the action_iCurrent position of

And the current position of the jth aircraft in the detection range

The distance between, expressed as:

Further, the distance is determined

then a penalty value is set

Otherwise, when it is satisfied

Then a penalty value is set

If it satisfies

Then a penalty value is set

Each agent contains an action Network (Actor Network) and a Critic Network (Critic Network). The Critic part of each Agent can acquire action information of all the other agents, centralized training and decentralized execution are carried out, namely during training, overall Critic capable of being observed is introduced to guide operator training, and during testing, only the operator with local observation is used for taking action.

Claims

1. A multi-aircraft autonomous decision method under a scene fast-changing condition is characterized by comprising the following steps:

s.t.R₁,R₂,R₃

d_irepresenting the distance between the ith aircraft and a target point corresponding to the aircraft;

the three constraints are as follows:

i′∈{1,2,…,N^T}；S_i′judging the completion degree of the target, if a certain target is not completed, S_i′-1, otherwise, the goal is completed, S_i′＝0；

P_iis the path of the ith aircraft, D_mRepresents the mth static obstacle; m is an element of [1, N ]^M]，N^MRepresenting a total number of static obstacles in the scene;

the reward functions include the following:

First, an initial r is set_a＝0；

Then, the ith aircraft X_iAt time t, the state is

Acting as

According to the motion

Calculating the aircraft X after executing the action_iCurrent position of

And target position

The distance between

Expressed as:

The update formula is:

therefore, if the sum of the accumulated distances of the aircrafts is larger, the joint strategy is poorer; otherwise, the combination strategy is good;

(2) reward function r set for collision detection of aircraft and obstacle_b；

First, an initial r is set_b＝0；

Then, the ith aircraft X is calculated_iAccording to the operation at time t

Calculating the aircraft X after executing the action_iCurrent position of

further, the distance is determined

Otherwise

Setting penalty values

Therefore, the closer the aircraft is to the obstacle, the smaller the joint revenue obtained by the whole multi-aircraft autonomous decision;

First, an initial r is set_c＝0；

Then, the ith aircraft X is calculated_iAccording to the operation at time t

Calculating the aircraft X after executing the action_iCurrent position of

And the current position of the jth aircraft in the detection range

The distance between, expressed as:

where the observations of other aircraft are noisy and delayed by a time step;

further, the distance is determined

then a penalty value is set

Otherwise, when it is satisfied

Then a penalty value is set

If it satisfies

Then a penalty value is set

Therefore, the closer the aircraft is to other aircraft, the smaller the joint revenue obtained by the whole multi-aircraft autonomous decision;

2. The method for multi-aircraft autonomous decision making under the scene fast changing condition as claimed in claim 1, wherein in the first step, each aircraft corresponds to a target, and the initial value of the target is randomly set.

3. The method for multi-aircraft autonomous decision making under the condition of fast changing scenes as claimed in claim 1, wherein in step two, when other aircraft is detected, the returned three-dimensional point cloud data are the three-dimensional coordinates and the speed direction of the other aircraft, when static obstacle is detected, the returned three-dimensional point cloud data are the boundary coordinates of the static obstacle, and if no obstacle is present, the returned three-dimensional point cloud data are 0.

4. The method for multi-aircraft autonomous decision making under the condition of fast scene change according to claim 1, wherein in the fifth step, each Agent comprises an action Network Actor Network and a Critic Network, the Critic part of each Agent can acquire action information of all the other agents, and centralized training and decentralized execution are performed, that is, during training, action training is guided by introducing Critic observing the global state, and during testing, action is taken only by using the action with local observation.