CN114815891A

CN114815891A - PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method

Info

Publication number: CN114815891A
Application number: CN202210525303.XA
Authority: CN
Inventors: 李波; 黄晶益; 谢国燕; 杨志鹏; 杨帆; 万开方; 高晓光
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-05-15
Filing date: 2022-05-15
Publication date: 2022-07-29

Abstract

The invention provides a PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method, which comprises the steps of modeling a grid digital map and an unmanned aerial vehicle motion model, deploying a multi-unmanned aerial vehicle neural network model by adopting a depth Q network algorithm through interaction of each unmanned aerial vehicle and the environment, optimizing the algorithm model by utilizing a priority experience playback strategy, then establishing a state space, an action space and a reward function to carry out targeted design on the multi-unmanned aerial vehicle enclosure capture tactical model, and finally establishing the multi-unmanned aerial vehicle enclosure capture tactical model which can be used for establishing effective enclosure capture tactics in a complex obstacle environment to realize enclosure capture of maneuvering targets. The method can realize the enclosure of the maneuvering target, effectively improves the sampling efficiency of experience samples, solves the problem of low training speed of the unmanned aerial vehicle decision model in a complex task scene, and is suitable for the enclosure and autonomous obstacle avoidance tasks of the unmanned aerial vehicles in a complex dynamic environment, and the finally constructed multi-unmanned aerial vehicle enclosure tactical model has higher stability.

Description

PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method

Technical Field

The invention relates to the field of multi-agent systems and unmanned aerial vehicle intelligent decision making, in particular to a multi-unmanned aerial vehicle enclosure tactical method.

Background

The unmanned aerial vehicle has the characteristics of strong concealment, high safety and the like, and provides a new mode idea for meeting the requirements of multi-machine cooperation, low casualty rate and the like required by modern informatization defense tactics. In the scene that enemy invades our air to take the air and carries out illegal information reconnaissance, adopt many defense unmanned aerial vehicles to constitute many unmanned aerial vehicle formations, let many unmanned aerial vehicle formations can be according to the situation environment and carry out the enclosure expulsion or the accompanying of target automatically and keep watch on, have important meaning.

The existing research on multi-unmanned aerial vehicle enclosure capture tactics is small, the position of a target is solved in real time mainly by adopting an artificial intelligence method, and then a corresponding tracking path is planned to realize the approach and capture of the target. The patent publication CN112241173A proposes an intelligent planning method for multi-agent aggregation points based on artificial potential field, which converts the target into virtual aggregation points, and then calculates the repulsive force between agents and the repulsive force between agent and obstacle by using an artificial potential field model, and calculates the position and path information of the virtual aggregation points of the agents. However, the method does not consider the problem of large calculation amount brought by model calculation in a dynamic environment, and cannot guarantee the real-time performance of multi-agent decision making. In recent years, the development of deep reinforcement learning technology provides a new idea for real-time online intelligent decision making of an unmanned system. The patent publication CN113625775A provides a multi-unmanned-plane enclosure capturing method combining state prediction and DDPG, which predicts the states of unmanned planes based on a least square method, trains an unmanned plane model by adopting a deep reinforcement learning DDPG algorithm, and finally deploys the unmanned plane model into a multi-unmanned-plane system to realize multi-unmanned-plane enclosure capturing decision. However, when the method is used for training the unmanned aerial vehicle decision model, the training sample data size is large, the types of all variables are complex, the training efficiency is low, and the finally obtained multi-unmanned aerial vehicle trapping model is poor in stability and has certain limitations.

The priority experience playback strategy is a deep reinforcement learning optimization method, the use rate of experience samples with high priority is improved by calculating the importance of each experience sample and sequencing the priority, and the training speed of an intelligent agent is finally improved. Therefore, how to introduce the prior experience playback strategy into the multi-agent deep reinforcement learning method, and combine the prior experience playback strategy with a complex multi-unmanned-aerial-vehicle enclosure tactical model to improve the autonomous behavior of each unmanned aerial vehicle, and finally realize enclosure capture of the target through cooperative decision, which becomes a difficult problem of deep reinforcement learning applied in the multi-unmanned-aerial-vehicle intelligent decision field.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-unmanned aerial vehicle capture tactical method based on PER-IDQN. The invention relates to a multi-unmanned aerial vehicle encirclement tactical method based on a priority empirical playback strategy Independent Deep Q learning Network (PER-IDQN). Specifically, a grid digital map, an unmanned aerial vehicle motion model and the like are modeled, a Deep Q Network (DQN) algorithm is adopted to deploy a multi-unmanned aerial vehicle neural Network model through interaction between each unmanned aerial vehicle and the environment, a priority empirical Replay strategy (PER) is utilized to optimize the algorithm model, then a state space, an action space and a reward function are constructed to carry out targeted design on the multi-unmanned aerial vehicle enclosure tactical model, and the finally constructed multi-unmanned aerial vehicle enclosure tactical model can be used for making an effective enclosure tactical plan under a complex obstacle environment, so that enclosure capture of a maneuvering target is realized.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: constructing a grid digital map model and an unmanned aerial vehicle model;

step 2: constructing a multi-unmanned aerial vehicle enclosure decision model based on a PER-IDQN algorithm;

and step 3: constructing a multi-unmanned-plane enclosure capture decision model based on a PER-IDQN algorithm and training; and each unmanned aerial vehicle inputs the state information into the neural network respectively, the PER-IDQN neural network obtained through training is fitted, the flight action of the unmanned aerial vehicle is output, and each unmanned aerial vehicle for enclosure realizes the enclosure of the target through cooperative decision.

The steps of constructing the grid digital map model and the unmanned aerial vehicle model are as follows:

step 1-1: in order to conveniently quantify the specific position of the unmanned aerial vehicle, the whole airspace range is uniformly divided into a plurality of grids, each grid is set to be a square with the length of l kilometers, the task scene is a kilometer, b kilometer, and the total width l of the task scene is _width A.l kilometers, total length l _length B x l km;

step 1-2: setting the speed of the capture unmanned aerial vehicle to be l kilometer/time step, and setting the speed of the target unmanned aerial vehicle to be n x l kilometer/time step;

step 1-3: setting the size of the action space of the unmanned aerial vehicle to be 4, namely the unmanned aerial vehicle can only move in four directions, namely up, down, left and right directions in each step;

step 1-4: the detectable range of each unmanned aerial vehicle is set to be a circular area with l kilometers as the radius, namely, the detectable range is approximate to a peripheral Sudoku area with the unmanned aerial vehicle as the center in a grid scene.

The step 2 of constructing a multi-unmanned aerial vehicle trapping decision model based on a PER-IDQN algorithm comprises the following steps:

step 2-1: the action space A of the capture unmanned aerial vehicle is set as follows:

A＝[(0,-l),(0,l),(-l,0),(0,l)]

(0, -l), (0, l), (-l,0), (0, l) represent 4 actions of the drone moving down, up, left, and right, l represents the length of each grid;

step 2-2: setting the state space S of the capture unmanned aerial vehicle as follows:

S＝[S _uav ,S _teamer ,S _obser ,S _target ,S _finish ]

wherein S _uav ,S _teamer ,S _obser ,S _target ,S _finish Respectively representing state information of the unmanned aerial vehicle, information of other friend unmanned aerial vehicles, unmanned aerial vehicle detection information, target information and task state information;

specifically, for the ith unmanned aerial vehicle in the multi-enclosure unmanned aerial vehicle system, the state information of the ith unmanned aerial vehicle is set as follows:

x _i and y _i Coordinate information representing an ith drone;

setting acquirable friend-side unmanned aerial vehicle state information for ith unmanned aerial vehicle

Comprises the following steps:

wherein x is _i And y _i Respectively representing the horizontal and vertical coordinate values of the ith unmanned aerial vehicle, and n represents the number of the unmanned aerial vehicles;

set the observation information of the unmanned aerial vehicle i as

Wherein:

o _m the detection readings represent the exploration information of the surrounding unmanned aerial vehicle on the peripheral Sudoku positions;

in addition, in combination with the relative distance and orientation information of the target with respect to the i-th host unmanned aerial vehicle, for the i-th host unmanned aerial vehicle, acquirable target information is set

Comprises the following steps:

wherein, d _i And theta _i Respectively representing the distance and the relative azimuth angle x between the unmanned aerial vehicle for capturing and the target by one party _e And y _e The abscissa and ordinate values of the escape target are represented;

in addition, in order to help the encirclement unmanned aerial vehicle to effectively complete the encirclement on the target, for the ith encirclement unmanned aerial vehicle, a sub-state quantity is set

Indicating whether the target is completely captured or not;

step 2-3: considering three decision processes of maneuvering approach, cooperative enclosure and autonomous obstacle avoidance to a target in the multi-unmanned aerial vehicle enclosure tactics, for each individual enclosure unmanned aerial vehicle, setting a reward function R as follows:

R＝σ ₁ r _pos +σ ₂ r _safe +σ ₃ r _effi +σ ₄ r _task

wherein r is _pos ,r _safe ,r _effi ,r _task Respectively representing position reward, safe flight reward, high-efficiency flight reward and task completion reward sigma ₁ ～σ ₄ The corresponding weight value for each reward;

specifically, the position sub-reward is set as follows:

r _pos ＝(|x _e -x _i |+|y _e -y _i |)-(|x _e -x _i |+|y _e -y _i |)′

the safe flyer reward of the capture unmanned aerial vehicle is set as follows:

the high-efficiency flyer reward of the capture unmanned aerial vehicle is set as follows:

r _effi ＝-n _stay

n _stay representing the number of times of the unmanned aerial vehicle for enclosure to stay at the current grid position;

setting a task completion sub-reward item of the capture unmanned aerial vehicle as follows:

step 2-4: the setting of the multiple unmanned aerial vehicle enclosure judgment conditions is as follows: and when the target is away from each enclosure unmanned aerial vehicle by a unit grid distance, the target is regarded as being incapable of escaping, and the enclosure task is completed.

The step 3: constructing a multi-unmanned-plane enclosure capture decision model based on a PER-IDQN algorithm and training;

step 3-1: respectively constructing a main BP neural network hidden layer theta in a PER-IDQN algorithm for each captive unmanned aerial vehicle ⁱ And state-behavior value function

Wherein the value function input parameter

Respectively for the state and behavior of the unmanned aerial vehicle i at the moment t, and respectively calculating the main network parameter theta ⁱ Copy to target network θ ^i′ Of middle, i.e. theta ⁱ →θ ^i′ Wherein i represents the drone serial number;

step 3-2: setting the size of an experience playback queue as M, a discount factor as gamma, the maximum number of rounds E, the maximum number of steps T of each round and the size of an experience extraction N _batch Setting the number e of rounds as 0;

step 3-3: initializing n states s of an unmanned aerial vehicle for enclosure ₁ ,…,s _n Updating the current time t to be 0;

step 3-4: generating a random number z, for each drone i, performing the action:

wherein epsilon _greedy Is a greedy coefficient of the color space,

outputting the corresponding action for the maximum Q value of the main network;

step 3-5: set of actions a of execution ₁ ,…,a _n Calculating a prize value r ₁ ,…,r _n Update status to s' ₁ ,…,s′ _n And respectively calculate the priority p ₁ ,…,p _n Stored in the experience playback queue together;

step 3-6: based on

Collecting N _batch A sample base, where j represents the serial number of the extracted empirical sample, p _j The priority is expressed, and the parameter alpha is used for adjusting the priority sampling degree of the samples;

calculating a weight coefficient w for an importance sample _j ：

w _j ＝(M·P(i)) ^-β /max _i w _i

Beta is a hyper-parameter and is used for adjusting the influence calculation of importance sampling on the PER algorithm and the model convergence rate;

calculating the time difference error of the current moment:

wherein,

representing the reward obtained by drone i at time t + 1;

eyes of calculationScaling to obtain a target value

Wherein gamma is the reward discount factor, j is the sample number, theta ^i′ A target network representing an ith agent;

combining importance weights w _j Updating the parameter L (theta) of the current network based on the minimization loss function ⁱ )：

Step 3-7: respectively updating the target network parameters of each unmanned aerial vehicle agent:

θ ^i′ ←τθ ⁱ +(1-τ)θ ^i′

τ represents an update scale factor;

step 3-8: and (4) adding 1 to the update step length t, and executing judgment: when T is less than T and the multi-unmanned aerial vehicle enclosure judgment condition is not met, entering the step 3-4; otherwise, entering the step 3-9;

step 3-9: the number of update rounds e is added to 1, and judgment is performed: if E < E, updating to step 3-3; otherwise, finishing the training and entering the step 3-10;

step 3-10: terminating the PER-IDQN network training process and storing the current network parameters; and loading the stored parameters into a multi-unmanned-aerial-vehicle trapping system, inputting state information into a neural network by each unmanned aerial vehicle at each moment, fitting the neural network by using the trained PER-IDQN, outputting the flight action of the unmanned aerial vehicles, and realizing the trapping of the target by each trapping unmanned aerial vehicle through cooperative decision.

The beneficial effect of the invention lies in that the proposed multi-unmanned aerial vehicle surrounding and capturing tactical method based on PER-IDQN has the advantages that:

(1) the constructed multi-unmanned aerial vehicle enclosure-arrest decision system does not need to be independently specified for the tactics of each unmanned aerial vehicle, the tactics and task cooperation can be completed through environment sensing and information sharing among the unmanned aerial vehicles, and the finally formulated multi-unmanned aerial vehicle enclosure-arrest tactics can achieve enclosure-arrest of maneuvering targets.

(2) According to the method, a prior experience replay PER strategy is introduced into the IDQN algorithm, so that the sampling efficiency of experience samples is effectively improved, and the problem of low training rate of the decision model of the unmanned aerial vehicle in a complex task scene is solved. The finally constructed multi-unmanned aerial vehicle enclosure tactical model is stronger in stability and can be suitable for multi-unmanned aerial vehicle enclosure and autonomous obstacle avoidance tasks in complex dynamic environments.

Drawings

Fig. 1 is a schematic view of unmanned aerial vehicle detection.

Fig. 2 is a schematic diagram of a position relationship between the unmanned aerial vehicle for enclosure and the target.

Fig. 3 is a schematic diagram of training of a multi-unmanned aerial vehicle capture model based on PER-IDQN.

Fig. 4 is a schematic diagram of a multi-drone enclosure capture.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The invention provides a PER-IDQN-based multi-unmanned aerial vehicle containment tactical method, and the whole flow is shown in figure 3. The technical solution is further clearly and completely described below with reference to the accompanying drawings and specific embodiments:

step 1-1: in order to conveniently quantify the specific position of the unmanned aerial vehicle, the whole airspace range is divided into a plurality of grids, the length of each grid is set to be 0.1 kilometer, and the total width and the total length of a task scene are respectively set to be l _width 8 km and l _length 4 km ═ 4 km;

step 1-2: setting the speed of the capture unmanned aerial vehicle to be 0.1 kilometer per time step, and setting the speed of the target unmanned aerial vehicle to be 0.2 kilometer per time step;

step 1-4: setting the detectable range of each unmanned aerial vehicle as a circular region with the radius of 0.1 kilometer, namely, approximately as a peripheral Sudoku region with the unmanned aerial vehicle as the center in a grid scene;

step 2-1: setting an action space A of the capture unmanned aerial vehicle as follows:

A＝[(0,-l),(0,l),(-l,0),(0,l)]

(0, -l), (0, l), (-l,0), (0, l) represents 4 actions of the drone moving down, up, left, and right, l represents the length of each grid;

S＝[S _uav ,S _teamer ,S _obser ,S _target ,S _finish ]

wherein S _uav ,S _teamer ,S _obser ,S _target ,S _finish Respectively representing the state information of the unmanned aerial vehicle, the information of other friend-side unmanned aerial vehicles, the detection information of the unmanned aerial vehicle, target information and task state information;

x _i and y _i Coordinate information representing an ith drone;

setting acquirable friend unmanned aerial vehicle state information for ith unmanned aerial vehicle

Comprises the following steps:

set the observation information of the unmanned aerial vehicle i as

Wherein:

o _m the detection readings represent the exploration information of the surrounding unmanned aerial vehicle on the peripheral Sudoku positions; the unmanned aerial vehicle detection information is shown in fig. 1;

in addition, in combination with the relative distance and orientation information of the target with respect to the i-th host drone, for the i-th host drone, the target information that can be acquired by the i-th host drone is set

Comprises the following steps:

wherein d is _i And theta _i Respectively represents the distance and the relative azimuth angle x between the unmanned plane and the target _e And y _e Representing the horizontal and vertical coordinate values of the escape target; the position relation between the capture unmanned aerial vehicle and the target is shown in fig. 2;

Indicating whether the target is completely captured or not;

step 2-3: considering decision processes such as maneuvering approach, cooperative capture, autonomous obstacle avoidance and the like of a target in multi-unmanned aerial vehicle capture tactics, for each individual capture unmanned aerial vehicle, a reward function R is set as follows:

R＝σ ₁ r _pos +σ ₂ r _safe +σ ₃ r _effi +σ ₄ r _task

wherein r is _pos ,r _safe ,r _effi ,r _task Respectively representing position reward, safe flight reward, high-efficiency flight reward, task completion reward, sigma _1～4 Awarding corresponding weight values for all items;

specifically, the position sub-reward is set as follows:

r _pos ＝(|x _e -x _i |+|y _e -y _i |)-(|x _e -x _i |+|y _e -y _i |)′

the safe flyer reward of the capture unmanned aerial vehicle is set as follows:

r _effi ＝-n _stay

step 2-4: setting a plurality of unmanned aerial vehicle enclosure judgment conditions: when the target is away from each enclosure unmanned aerial vehicle by a unit grid distance, the target is regarded as being unable to escape, and the enclosure task is completed;

and step 3: constructing a multi-unmanned aerial vehicle capture decision model and training the model based on a deep reinforcement learning PER-IDQN algorithm;

Wherein the value function input parameter

step 3-2: setting the size M of an experience playback queue, a discount factor gamma, the maximum number of rounds E, the maximum number of steps T of each round and the size N of an experience extraction _batch Setting the number e of rounds as 0;

wherein epsilon _greedy Is a greedy coefficient to be used for the image display,

step 3-6: based on

Collecting N _batch A sample base where j denotes the serial number of the extracted empirical sample, p _j The priority is expressed, and the parameter alpha is used for adjusting the priority sampling degree of the samples;

calculating a weight coefficient w for an importance sample _j ：

w _j ＝(M·P(i)) ^-β /max _i w _i

M is the size of the empirical playback queue, beta is a hyperparameter and is used for adjusting the influence calculation of importance sampling on the PER algorithm and the model convergence rate;

calculating the time difference error of the current moment:

wherein,

representing the reward obtained by drone i at time t + 1;

calculating the target value to obtain the target value Y _t ⁱ ：

Wherein, gamma is reward discount factor, j is sample number, theta ^i′ A target network representing an ith agent;

θ ^i′ ←τθ ⁱ +(1-τ)θ ^i′

τ represents an update scale factor;

step 3-8: and (3) updating the step length t to t +1, and performing judgment: when T is less than T and does not meet the multi-unmanned aerial vehicle enclosure judgment condition shown in the step 2-4, entering the step 3-4; otherwise, entering the step 3-9;

step 3-9: update round number e ═ e +1, decision is performed: if E < E, updating to step 3-3; otherwise, finishing the training and entering the step 3-10;

step 3-10: terminating the PER-IDQN network training process and storing the current network parameters; and loading the stored parameters into a multi-unmanned aerial vehicle trapping system. At each moment, each unmanned aerial vehicle inputs state information into the neural network respectively, the PER-IDQN neural network obtained through training is fitted, the flight action of the unmanned aerial vehicles is output, and the target is captured by each capturing unmanned aerial vehicle through cooperative decision.

In order to better illustrate the superiority of the method of the present invention, the present embodiment was tested in different scenes. Specifically, in a task scenario in which the number of grids is 80 × 40, the obstacle mobility is kept at 10%, different obstacle coverage rates are set and the test is performed, and the test results are shown in table 1.

Table 1 multi-unmanned plane enclosure performance under different environmental barrier coverage

As can be seen from the above table, as the coverage rate of the environmental obstacles increases, the multi-unmanned aerial vehicle enclosure time increases; when the coverage rate of the obstacles is increased to 0.10 or above, compared with the traditional IDQN algorithm, the average simulation step length of the multi-unmanned aerial vehicle enclosure capture tactics based on the PER-IDQN algorithm is less, which means that the enclosure capture tactics formulated when the multi-unmanned aerial vehicle of the PER-IDQN algorithm faces to a complex obstacle environment are more effective, and the enclosure capture of the targets can be realized in a shorter time. A simulation diagram of the multi-unmanned plane enclosure capture is shown in FIG. 4.

In summary, the multi-unmanned aerial vehicle enclosure capture tactical method based on PER-IDQN provided by the invention adopts off-line learning to train the neural network, stores data generated during training in the experience pool, provides learning samples for optimization of the neural network, and combines the maneuvering control and cooperative enclosure capture task requirements of the multi-unmanned aerial vehicle to design the action and state of the unmanned aerial vehicle, thereby realizing the intelligent decision control of the multi-unmanned aerial vehicle.

The multi-unmanned aerial vehicle enclosure tactical method provided by the invention has high model training efficiency, and the constructed multi-unmanned aerial vehicle enclosure tactical model can be suitable for being used in a complex dynamic scene, so that the multi-unmanned aerial vehicle enclosure tactical execution efficiency is improved.

The above description is only a preferred embodiment of the present invention, and it should be noted that: the embodiments of the present invention are not limited to the above-described implementation methods; it will be apparent to those skilled in the art that other variations and modifications can be made without departing from the spirit of the invention. It should be understood that any equivalent substitutions, modifications and improvements made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A multi-unmanned aerial vehicle enclosure tactical method based on PER-IDQN is characterized by comprising the following steps:

2. The PER-IDQN-based multi-drone containment tactical method of claim 1, wherein:

3. The PER-IDQN-based multi-drone containment tactical method of claim 1, wherein:

A＝[(0,-l),(0,l),(-l,0),(0,l)]

S＝[S _uav ,S _teamer ,S _obser ,S _target ,S _finish ]

x _i and y _i Coordinate information representing an ith drone;

Comprises the following steps:

set the observation information of the unmanned aerial vehicle i as

Wherein:

in addition, are combinedRelative distance and direction information of the target relative to the unmanned aerial vehicle i of the same party, and acquirable target information is set for the ith enclosing unmanned aerial vehicle

Comprises the following steps:

wherein d is _i And theta _i Respectively representing the distance and the relative azimuth angle x between the unmanned aerial vehicle for capturing and the target by one party _e And y _e The abscissa and ordinate values of the escape target are represented;

Indicating whether the target is completely captured or not;

R＝σ ₁ r _pos +σ ₂ r _safe +σ ₃ r _effi +σ ₄ r _task

specifically, the position sub-reward is set as follows:

r _pos ＝(|x _e -x _i |+|y _e -y _i |)-(|x _e -x _i |+|y _e -y _i |)′

the safe flyer reward of the capture unmanned aerial vehicle is set as follows:

r _effi ＝-n _stay

4. The PER-IDQN-based multi-drone containment tactical method of claim 1, wherein:

Wherein the value function input parameter

step 3-5: set of actions a of execution ₁ ,…,a _n Calculating a prize value r ₁ ,…,r _n Update status to s' ₁ ,…,s′ _n And respectively calculate the priority p ₁ ,…,p _n Stored together in an experience playback queue;

step 3-6: based on

calculating importanceWeight coefficient w of sample _j ：

w _j ＝(M·P(i)) ^-β /max _i w _i

Beta is a hyper-parameter and is used for adjusting the influence calculation of importance sampling on a PER algorithm and a model convergence rate;

calculating the time difference error of the current moment:

wherein,

representing the reward obtained by drone i at time t + 1;

calculating the target value to obtain the target value

θ ^i′ ←τθ ⁱ +(1-τ)θ ^i′

τ represents an update scale factor;