CN113268078A

CN113268078A - Target tracking and trapping method for self-adaptive environment of unmanned aerial vehicle group

Info

Publication number: CN113268078A
Application number: CN202110423332.0A
Authority: CN
Inventors: 宁芊; 杨川力; 周新志; 陈炳才; 黄霖宇
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-08-17
Anticipated expiration: 2041-04-20
Also published as: CN113268078B

Abstract

The invention discloses a target tracking and trapping method of an unmanned aerial vehicle group self-adaptive environment, which comprises the following steps: (1) establishing a multi-agent collaborative planning model by using an MADDPG algorithm to realize the tracking and trapping of the unmanned aerial vehicle group to the target; (2) when the unmanned aerial vehicle cluster approaches to the threat area, the GA algorithm is utilized to automatically adjust and re-plan the position of the unmanned aerial vehicle so as to avoid entering the threat area, the survival rate of the unmanned aerial vehicle is improved, and meanwhile, the enclosure task is completed. The establishment of the layered trapping model is divided into two layers: a trapping layer and a multi-agent training layer. The unmanned aerial vehicle cluster interacts with the environment in real time, so that the current environment state can be obtained at any time. And the trapping layer judges whether to adjust the formation from the current state and calculates a trapping position distribution scheme. Aiming at the dynamic change of the environment and the task, the invention improves the success rate of task execution in the relatively complex environment with threat, and simultaneously autonomously changes the enclosure position to avoid the threat area, thereby reducing the risk of the unmanned aerial vehicle cluster.

Description

Target tracking and trapping method for self-adaptive environment of unmanned aerial vehicle group

Technical Field

The invention relates to the technical field of unmanned aerial vehicle cluster task planning, in particular to a target tracking and trapping method of an unmanned aerial vehicle cluster self-adaptive environment.

Background

Unmanned aerial vehicle has zero casualties, continuous operation, low cost and outstanding mobility's unique advantage such as for unmanned aerial vehicle cluster battle becomes research focus in recent years. In the aspect of unmanned aerial vehicle cluster command control cooperative task decision, colony intelligent algorithms based on ant colony algorithm, wolf colony algorithm and the like are mostly adopted.

However, the group intelligent algorithm cannot meet the characteristics of autonomy and autonomy of unmanned aerial vehicle clusters, and intensive learning is widely concerned and applied in recent years. However, the existing unmanned aerial vehicle based on reinforcement learning is simpler in target trapping problem scene setting, threat factors are less considered, and the trapping mode is not flexible enough.

Disclosure of Invention

In order to solve the technical problems, the invention provides a target tracking and trapping method of an unmanned aerial vehicle group self-adaptive environment, aiming at the dynamic change of the environment and tasks, the success rate of task execution is improved in a relatively complex environment with threat, and meanwhile, the trapping position is autonomously changed to avoid a threat area, so that the risk of the unmanned aerial vehicle group is reduced.

The technical purpose of the invention is realized by the following technical scheme:

a target tracking and enclosing method of an unmanned aerial vehicle cluster self-adaptive environment comprises the following steps:

s1: constructing an enclosure layer and a multi-agent training layer;

s2: determining the current environment state of the unmanned aerial vehicle through the real-time interaction between the unmanned aerial vehicle cluster and the environment;

s3: judging whether the ideal enclosure position of the unmanned aerial vehicle cluster is in a threat area, if so, executing S4, otherwise, executing S5;

s4: calculating a new capture position from the newly calculated capture position through the capture layer, and then calculating to obtain a capture position distribution scheme consumed by the shortest voyage;

s5: through the multi-agent training layer, a multi-agent cooperation model is constructed, and tracking and trapping of the unmanned aerial vehicle group on the target are achieved.

As a preferred scheme, the S4 process specifically includes the following steps:

s401: judging an overlapping area of an enclosure position around the enclosure target and a threat area;

s402: and selecting a set of non-overlapping regions theta ═ theta₁，θ₂Maximum value of θ_maxThe average value of the angles of the non-overlapping areas of the unmanned planes is

S403: determination of theta_maxAnd

the relationship of (1);

s404: if it is not

Then order

And repeating the step S403;

s405: if it is not

Then at the selected theta_maxThe location of the entrapment is calculated in part according to the following formula:

wherein: t is_iIs the enclosure position of unmanned plane i, (c)_x，c_y) Is the position coordinate of the object to be enclosed, r is the enclosing radius, theta_aIs the average range of capture, theta, of each unmanned aerial vehicle_tIs the extent of overlap of the standard formation with the threat without taking into account the avoidance of the threat zone, θ_sIs the starting angle;

s406: calculating all allocation schemes according to the arrangement method;

s407: and calculating the route consumption of each scheme, and selecting the scheme with the minimum route consumption as the optimal distribution scheme.

As a preferred scheme, the S5 process specifically includes the following steps:

let a scene include N drones, each drone having a policy parameter θ ═ θ₁，θ₂,...θ_NUInstruction of

For the strategy set of all drones, the strategy gradient of the unmanned plane i is obtained as follows:

wherein:

is a function of Q value, a_iIs the action of unmanned aerial vehicle i, o_iIs the observation information of the unmanned aerial vehicle, including the position and speed information of the unmanned aerial vehicle relative to the target, x ═ o₁，...o_N) Representing observation information for N drones.

As a preferable scheme, in the S5 process, the reward sparseness problem is solved through a guiding reward function, which specifically includes the following steps:

let D denote an experience pool, which is used to store (x, x', a)₁，...a_N，r₁，...，r_N) Recording the experience of all the drones, x' being the new state after all the drones have performed the action, r_iIs a reward for interaction with the environment after the drone i performs an action.

Preferably, the Critic network is used in S5

By a loss function

To update;

wherein

As a preferred scheme, in the process of S5, the minimum policy gradient formula of the Critic network is:

to update the information of the content,

wherein: s is a small batch of random samples and j is the index of the samples.

As a preferred scheme, the Actor network and the Critic network both adopt 4-layer fully-connected artificial neural networks, the number of neurons in each layer of the Actor network is [64,64,64,2], the input of the last layer is a 2-dimensional vector corresponding to the acceleration of the unmanned aerial vehicle on the x axis and the y axis; the number of neurons in the Critic network coal seam is [64,64,64,1], and the output of the last layer is the evaluation of the action.

As a preferred scheme, the Reward structure of each unmanned aerial vehicle is composed of three parts₁+r₂+r₃，r₁Reward, r, representing the distance of the drone from its location of enclosure₂Penalty, r, representing collision of unmanned aerial vehicle with threat zone₃Represents the penalty of collision of the drone with other drones.

In conclusion, the invention has the following beneficial effects:

the target tracking and enclosing method of the self-adaptive environment of the unmanned aerial vehicle cluster provided by the invention is combined with the MADDPG (Multi-agent Deep Deterministic Policy Gradient) and the GA Algorithm (Greedy Algorithm) to design the self-adaptive Algorithm of the MADDPG-GA, the fusion Algorithm can accelerate the learning efficiency, improve the rapidity of target tracking and enclosing of the unmanned aerial vehicle cluster in the complex environment, optimize the formation of the unmanned aerial vehicle cluster to reduce the probability of the unmanned aerial vehicle falling into the threat area, improve the enclosing success rate and reduce the cluster risk;

the unmanned aerial vehicle group trained by the unmanned aerial vehicle group self-adaptive environment target tracking and trapping method provided by the invention can rapidly trap targets in a complex environment, so that the defects of slow training and low learning rate of the traditional unmanned aerial vehicle based on reinforcement learning are overcome, and the defect that the traditional fixed trapping mode is easy to fall into a threat area and cannot complete trapping is avoided.

Drawings

FIG. 1 is a diagram of a layered containment model in an embodiment of the invention;

FIG. 2 is a schematic diagram of an unmanned aerial vehicle fleet for capturing targets proximate to a threat zone in an embodiment of the present invention;

fig. 3 is a schematic diagram of unmanned aerial vehicle enclosure location allocation in the embodiment of the present invention;

FIG. 4 is a block diagram of a MADDPG in an embodiment of the present invention;

fig. 5 is a schematic diagram of an Actor network in an embodiment of the present invention.

Detailed Description

This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, and a person skilled in the art can solve the technical problem within a certain error range to substantially achieve the technical effect.

The terms in upper, lower, left, right and the like in the description and the claims are combined with the drawings to facilitate further explanation, so that the application is more convenient to understand and is not limited to the application.

The present invention will be described in further detail with reference to the accompanying drawings.

The invention designs a target tracking and trapping method of an unmanned aerial vehicle group self-adaptive environment, which mainly comprises the following steps: (1) establishing a multi-agent collaborative planning model by using an MADDPG algorithm to realize the tracking and trapping of the unmanned aerial vehicle group to the target; (2) when the unmanned aerial vehicle cluster approaches to the threat area, the GA algorithm is utilized to automatically adjust and re-plan the position of the unmanned aerial vehicle so as to avoid entering the threat area, the survival rate of the unmanned aerial vehicle is improved, and meanwhile, the enclosure task is completed. Establishing a layered trapping model as shown in figure 1, which is divided into two layers: a trapping layer and a multi-agent training layer. The unmanned aerial vehicle cluster interacts with the environment in real time, so that the current environment state can be obtained at any time. And the layer of enclosure judges whether to adjust the formation from the current state, and when the ideal enclosure position falls into a threat area, the layer calculates a new enclosure position by using a GA algorithm, and then calculates an enclosure position distribution scheme consumed by the shortest flight. After the pursuit position is determined, each unmanned aerial vehicle is trained in turn on a multi-agent system training layer, and the trained unmanned aerial vehicle cluster can execute a capture strategy to interact with the environment. The feasibility of the scheme is verified by simulation experiments.

1. Trapping layer

Determining the encirclement position according to the target position by using a greedy algorithm as shown in FIG. 2 comprises the following specific steps: if the surrounding positions around the target are overlapped with the environment threat area, selecting a non-overlapped area set theta as { theta ═ theta₁，θ₂Maximum value of θ_maxThe average value of the angles of the non-overlapping areas of the unmanned planes is

If it is

Then order

If it is not

in the formula: t is_iIs the enclosure position of unmanned plane i, (c)_x，c_y) Is the position coordinate of the object to be enclosed, r is the enclosing radius, theta_aIs the average range of capture, theta, of each unmanned aerial vehicle_tIs the extent of overlap of the standard formation with the threat without taking into account the avoidance of the threat zone, θ_sIs the starting angle.

After the encirclement position is determined, the encirclement position is distributed according to the minimum total route consumption and the constraint relation between the unmanned aerial vehicle and the encirclement position as shown in figure 3, and the specific steps are as follows: calculating all allocation schemes according to the arrangement method; and further calculating the route consumption of each scheme, and selecting the scheme with the minimum route consumption as the optimal distribution scheme.

2. Multi-agent training layer

As shown in FIG. 4, the multi-agent system training layer based on the MADDPG framework has the characteristics of centralized training and distributed application. A plurality of strategies are learned for each intelligent agent (unmanned aerial vehicle), and the overall effect of all the strategies is utilized for optimization during improvement, so that the stability and robustness of the algorithm are improved. Each agent has an Actor and a Critic network for training and learning, and the specific principle is as follows: considering that there are N agents in a scene, the policy parameter of each drone is θ ═ θ₁，θ₂,...θ_NUInstruction of

For the set of policies for all drones, it can be derived that the policy gradient for drone i is:

is a function of Q value, a_iIs the action of unmanned aerial vehicle i, o_iIs the observation information of the unmanned aerial vehicle, including the position and speed information of the unmanned aerial vehicle relative to the target, x ═ o₁，...o_N) Representing observation information of N drones for each agent

Are independent, so the Reward structure of each agent is arbitrary.

In the invention, a guiding Reward function is designed to solve the Reward Sparse problem (Sparse Reward). D represents the experience pool used to store (x, x', a)₁，...a_N，r₁，...，r_N) The experience of all agents is recorded, x' is the new state after all agents have performed the action, r_iIs a reward for the agent i to interact with the environment after performing an action. Of Critic networks

By a loss function

Is updated, wherein

The Actor network is formed by the minimum policy gradient formula:

to update the information of the content,

Actor and Critic network architecture

The invention relates to an Actor and Critic network structure of an intelligent agent, namely an unmanned aerial vehicle, which adopts a 4-layer full-connection artificial neural network, wherein the number of neurons of each layer of the Actor is (64, 64,64, 2)]As in fig. 5, the input to the last layer is a 2-dimensional vector, corresponding to the accelerations of the drone in the x and y axes. The number of neurons in each layer of the Critic network is [64,64,64,1]]The output of the last layer is an evaluation of the action. The Reward structure of each unmanned aerial vehicle consists of three parts r₁+r₂+r₃，r₁The reward that the unmanned aerial vehicle is far away from the target position of the surrounding capture is represented, and the closer the distance is, the greater the reward is; r is₂The penalty of collision between the unmanned aerial vehicle and the threat zone is represented and is related to a penalty coefficient xi, r₃Represents the penalty of collision of the drone with other drones. The Reward structure can be flexibly set according to the maneuverability, the trapping task characteristic and the like of the unmanned aerial vehicle.

The present embodiment is only for explaining the present invention, and it is not limited to the present invention, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present invention.

Claims

1. A target tracking and enclosing method of an unmanned aerial vehicle group self-adaptive environment is characterized by comprising the following steps:

s1: constructing an enclosure layer and a multi-agent training layer;

2. The method for target tracking and enclosure in a drone swarm adaptive environment according to claim 1, wherein the S4 process specifically includes the following steps:

S403: determination of theta_maxAnd

the relationship of (1);

s404: if it is not

Then order

And repeating the step S403;

s405: if it is not

s406: calculating all allocation schemes according to the arrangement method;

3. The method for target tracking and enclosure in a drone swarm adaptive environment according to claim 2, wherein the S5 process specifically includes the following steps:

wherein:

4. The target tracking and trapping method for the unmanned aerial vehicle fleet adaptive environment as claimed in claim 3, wherein in the step of S5, the problem of sparse rewards is solved through a guiding reward function, specifically comprising the steps of:

5. The method for target tracking and enclosure in a fleet of unmanned aerial vehicles as claimed in claim 3, wherein said step of S5 is performed in a Critic network

By a loss function

To update;

wherein

6. The method for target tracking and enclosure in a fleet of unmanned aerial vehicles adaptive environment of claim 3, wherein in the step of S5, the minimum policy gradient formula of Critic network is:

to update the information of the content,

7. The target tracking and capturing method for the unmanned aerial vehicle cluster self-adaptive environment according to any one of claims 4 to 6, wherein the Actor network and the Critic network both adopt 4 layers of fully connected artificial neural networks, the number of neurons in each layer of the Actor network is [64,64,64,2], the input of the last layer is a 2-dimensional vector corresponding to the acceleration of the unmanned aerial vehicle in the x and y axes; the number of neurons in the Critic network coal seam is [64,64,64,1], and the output of the last layer is the evaluation of the action.

8. The method for target tracking and encirclement in a fleet of unmanned aerial vehicles adaptive environment as claimed in claim 7, wherein the Reward structure of each unmanned aerial vehicle is composed of three parts r₁+r₂+r₃，r₁Reward, r, representing the distance of the drone from its location of enclosure₂Penalty, r, representing collision of unmanned aerial vehicle with threat zone₃Represents the penalty of collision of the drone with other drones.