CN116560409A

CN116560409A - Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R

Info

Publication number: CN116560409A
Application number: CN202310721379.4A
Authority: CN
Inventors: 王尔申; 宏晨; 陈纪浩; 屈力刚; 李寒冰; 徐嵩; 王传云; 崔璨; 庞涛
Original assignee: Songshan Laboratory; Shenyang Aerospace University
Current assignee: Songshan Laboratory; Shenyang Aerospace University
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-08-08

Abstract

The invention provides an unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R, and relates to the technical field of unmanned aerial vehicle cluster air-to-air search decision making. Firstly, setting a scene of unmanned aerial vehicle cluster air-air search, and randomly setting unmanned aerial vehicle clusters and target area positions; then, constructing an unmanned aerial vehicle cluster air-air search model based on reinforcement learning by using a reinforcement learning algorithm; the dependency relationship between unmanned aerial vehicles is measured through an inter-point mutual information method, and the cooperative relationship between unmanned aerial vehicles is improved through maximizing the inter-point mutual information; then obtaining the optimal point-to-point mutual information estimation by maximizing the mutual information, and redefining a reward function; and repeating the above process, calculating the task completion rate of the unmanned aerial vehicle cluster until the task completion rate reaches a set value, and completing unmanned aerial vehicle cluster path planning simulation. The method enables unmanned aerial vehicle clusters to learn cooperation strategies in a scattered manner, and cooperation among unmanned aerial vehicles can be improved better.

Description

Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R

Technical Field

The invention relates to the technical field of unmanned aerial vehicle cluster air-to-air search decision making, in particular to an unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R.

Background

Because the unmanned aerial vehicle has important significance for future air combat, all the main military countries in the world are tightening to develop the unmanned aerial vehicle. With the advancement of technologies such as artificial intelligence, distributed systems, networking communication and the like and the remarkable improvement of the level of airborne hardware, unmanned aerial vehicle clusters are highly concerned and vigorously developed in various countries.

Unmanned aerial vehicle clusters are also considered as a combat means capable of subverting future battlefield situations, so that military countries are tightening unmanned aerial vehicle cluster technologies to attack and develop cluster flight tests to strive for unmanned aerial vehicle clusters which are fused with new technologies to be applied to actual combat. Unmanned cluster intelligence is a critical technology that is sufficient to change the warfare patterns and rules. The essence of colony intelligence belongs to the field of bionics, and is derived from observation of collective behavior mode of animal colony behaviors of human beings, including activities of bee colonies and ant colonies, migration of waiting birds and the like. The intelligent unmanned aerial vehicle clusters are based on biological cluster behaviors, and the unmanned aerial vehicles can complete various tasks in various complex and dangerous environments through mutual information transmission and coordination.

In the process of searching multiple target points, the environment of the unmanned aerial vehicle cluster is often an uncertain, incomplete, dynamic and other complex environment, and most of the existing methods cannot be well expanded to the scattered unmanned aerial vehicle cluster due to calculation complexity or global information requirements. With the development of unmanned aerial vehicle technology, unmanned aerial vehicle group cooperation becomes a research hotspot. Through close collaboration, the drone swarm may exhibit superior coordination, intelligence, and autonomy over traditional multi-drone systems. Meanwhile, multi-target search in unknown environment becomes an important application direction of the unmanned aerial vehicle group

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R, realizes the planning simulation of unmanned aerial vehicle cluster paths, and provides a method for establishing unmanned aerial vehicle cluster collaborative search targets in a dynamic environment.

In order to solve the technical problems, the invention adopts the following technical scheme: the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R comprises the following steps:

step 1: setting a scene of unmanned aerial vehicle cluster space-space searching, and randomly setting unmanned aerial vehicle clusters and target area positions;

the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, the n unmanned aerial vehicles randomly search m target points, one or more target points can be selected by the unmanned aerial vehicles to conduct coverage monitoring in the searching process, the unmanned aerial vehicles fly on a two-dimensional plane at a speed not exceeding a set speed, and the action of each unmanned aerial vehicle is a continuous random variable.

In the unmanned aerial vehicle cluster air-air search scene, a particle dynamics model is endowed for each unmanned aerial vehicle, and the following formula is shown:

wherein phi represents the roll angle of the unmanned aerial vehicle, r _φ Represents the rolling angle speed, F represents the driving force,indicating heading angular velocity,/-, and>representing the course angle of the unmanned aerial vehicle, m is the mass of the unmanned aerial vehicle, v _x ,v _y The speed components of the unmanned aerial vehicle in the two-dimensional plane X axis and the Y axis are respectively, X and Y represent the positions of the unmanned aerial vehicle in the two-dimensional plane X axis and the Y axis, and dt represents a differential variable about time t;

the motion space of the unmanned plane is set to be a continuous space, and the motion is expressed as a two-dimensional velocity vector v _x ,v _y ]The scene allows the drone to move at variable speeds in any direction;

in order to effectively simulate the motion behavior of an actual unmanned aerial vehicle, the motion of the unmanned aerial vehicle is set to be a vector consisting of driving force and rolling angle speedThe flight speed and heading angle of the unmanned aerial vehicle at the time t are obtained by the following formula:

wherein v represents the speed of the unmanned aerial vehicle, i represents the number of the unmanned aerial vehicle, i=1, 2, …, n, Δt represents the time interval;

the position of the target in the scene is random, and the unmanned aerial vehicle cannot acquire the position information of the target in advance; the unmanned aerial vehicle can observe the target, and when the target is positioned in the monitoring range of the unmanned aerial vehicle, the unmanned aerial vehicle can sense and monitor the target; meanwhile, the unmanned aerial vehicle can receive communication information of other adjacent unmanned aerial vehicles, and the unmanned aerial vehicle cannot determine the quantity and position information of other targets;

one unmanned plane can monitor a plurality of target points simultaneously, and as no specific target allocation exists, the unmanned plane always keeps the targets in the perception range and searches more target points as much as possible; in addition, the drone should avoid collisions and fly out boundaries to meet safety constraints;

step 2: constructing an unmanned aerial vehicle cluster air-air search model based on reinforcement learning by using a reinforcement learning algorithm;

step 2.1: setting a decision process of the unmanned aerial vehicle cluster to be defined by using a partially observable Markov decision process, wherein each unmanned aerial vehicle can only acquire local observation information and cannot acquire a global state, each unmanned aerial vehicle makes an action decision according to the local observation information in each time step, and all unmanned aerial vehicles execute combined actions to update the environment;

the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, and the definition of a partially observable Markov decision process is as follows: (N, S, A, T, R, O, Z, γ);

wherein N is a set of N unmanned aerial vehicles, defining i E N; s is a state space, and consists of states S of each unmanned aerial vehicle, S epsilon S; a { A ⁽¹⁾ ,A ⁽²⁾ ,…A ⁽ⁿ⁾ The combined action space of all unmanned aerial vehicles is represented, and the combined action a of the ith unmanned aerial vehicle ⁽ⁱ⁾ ∈A ⁽ⁱ⁾ ；p(s'|s,a)→[0,1]Is a state transfer function of the environment and represents that the joint action a { a is executed ⁽¹⁾ ,a ⁽²⁾ ,…,a ⁽ⁿ⁾ Probability from state s to the next state s' after; r (s, a): { R ⁽¹⁾ ,r ⁽²⁾ ,…r ⁽ⁿ⁾ The "joint rewards" after performing the joint action a in state s; o { O ⁽¹⁾ ,O ⁽²⁾ ,…O ⁽ⁿ⁾ -representing the joint viewing space of all unmanned aerial vehicles; z: o ⁽ⁱ⁾ =z (s, i) represents the observation model of each unmanned individual in state s, o ⁽ⁱ⁾ ∈O ⁽ⁱ⁾ ，o ⁽ⁱ⁾ Observation model for showing ith unmanned aerial vehicle, O ⁽ⁱ⁾ Representing the observation space of the ith unmanned aerial vehicle; gamma e [0,1 ]]Is a discount factor representing the relative importance of balancing the long-term rewards with the current rewards;

setting each unmanned aerial vehicle to make action decisions according to local information of each unmanned aerial vehicle, and enabling an ith unmanned aerial vehicle to travelThe dynamic policy is pi ⁽ⁱ⁾ ：O ⁽ⁱ⁾ →A ⁽ⁱ⁾ The joint action strategy of all unmanned aerial vehicles is expressed as: pi { pi ⁽¹⁾ ,π ⁽²⁾ ,…π ⁽ⁿ⁾ When the joint observation of the unmanned plane is o and the joint action policy is pi, the state cost function is expressed as:when performing the joint action a, the action cost function is expressed as: />E _π Representing the desire for an action strategy;

step 2.2: the unmanned aerial vehicle cluster adopts a multi-agent depth deterministic strategy gradient algorithm to learn an unmanned aerial vehicle action strategy;

for each agent i, during centralized training, the Critic network receives state-action pairs from all agents, during decentralized execution, the Actor network receives only its own local observed state to make action decisions;

define the joint observation state vector o= [ o ] of unmanned aerial vehicle cluster ₁ ,o ₂ ,…,o _n ]And joint motion vector a= [ a ] ₁ ,a ₂ ,…,a _n ]The method comprises the steps of carrying out a first treatment on the surface of the Further, define θ= { θ ₁ ,θ ₂ ,…,θ _n And the parameterized set of unmanned plane cluster action strategies is shown as μ= { μ ₁ ,μ ₂ ,…,μ _n -a set of joint action policies for all unmanned aerial vehicles;

for the ith drone, a gradient of expected return is given, actor network J (μ _i )＝E[R _i ]The gradient of (2) is shown in the following formula:

wherein J represents a desire for rewards, E represents a desire, R _i Representing a reward for the ith unmanned aerial vehicle, a _-i ＝[a ₁ ,…,a _i-1 ,a _i+1 ,…,a _N ]Representing joint motion vectors for all agents except agent i, D is an empirical buffer recording all unmanned aerial vehicles,is a centralized state action cost function, and inputs the combined observation state o comprising the unmanned aerial vehicle cluster and the combined actions a, mu of all the intelligent agents _i A deterministic strategy of the ith unmanned aerial vehicle is represented;

the output of the Critic network is a centralized state action cost functionThe loss function is defined as follows:

wherein ,is the cost function of the critic-target network output, y is the target cost function, o 'is the observed state of the next time step unmanned aerial vehicle, a' is the action of the next time step unmanned aerial vehicle,/->Is a parameter of the actor-target network, +.>Periodically using the latest commentator network parameter mu to carry out soft update;

the input of the unmanned aerial vehicle consists of an action vector and an observation state vector, and the action vector consists ofConsists of observation state vector of +.>Composition; the observation state vector contains three kinds of state information, which are respectively:

(1) The current unmanned aerial vehicle state information;

the observation state vector of the ith unmanned aerial vehicle at the time t is expressed ass _i,t Representing the state and action information of the unmanned aerial vehicle, and +.>Representing situation characteristic information of other unmanned aerial vehicles in the same team, e _j,t Coordinate information representing a target area, j epsilon m, wherein m is a target area set; for the ith unmanned aerial vehicle itself, the position coordinates (x _i ,y _i ) Speed (v) _x,i ,v _y,i ) Course angle->Therefore, unmanned plane self state and action information +.>

(2) Situation feature vector information of other unmanned aerial vehicles;

unmanned aerial vehicle observes state feature vector of teammate state feature informationThus->Containing n-1 subvectors defined as follows:

wherein ,is the kth unmanned plane observation state subvector, as follows:

wherein ,a_k,-1 The kth unmanned plane is the action at the last moment, k epsilon N, i not equal to k; d, d _i,k,t Representing the current relative distance between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle teammate, described by the 2-norm as follows:

wherein H represents the conjugate transpose matrix, p _i,t Representing two-dimensional coordinate vector of ith unmanned aerial vehicle at t moment, p _k,t Representing a two-dimensional coordinate vector of the kth unmanned aerial vehicle at the time t;

(3) Coordinate vector information of the target area;

unmanned aerial vehicle observes target area state characteristic information, unmanned aerial vehicle can observe target area and obtain observation vector e _j,t It comprises m sub-vectors defined as follows:

wherein ,the sub-vector representing the observation of the jth target is as follows:

wherein ,d_i,j,t Representing the distance between the unmanned plane and the target, x _j,t ,y _j,t Representing the location of the target;

step 3: the dependency relationship between unmanned aerial vehicles is measured through an inter-point mutual information method, and the cooperative relationship between unmanned aerial vehicles is improved through maximizing the inter-point mutual information;

capturing the dependency relationship between unmanned aerial vehicles by using the mutual information PPMI between positive value points, and obtaining the dependency index d between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle ^(ik) Expressed as:

wherein, l represents the local information of each unmanned plane,b _i,t boundary information indicating the ith unmanned aerial vehicle, < ->b _k,t Boundary information of a kth unmanned plane is represented, i is not equal to k, and p () represents a joint distribution function;

the mutual information between the positive points PPMI (l) _i ,a _i ,l _k ,a _k ) Using a neural network f _w (l ⁽ⁱ⁾ ,a ⁽ⁱ⁾ ,l ^(k) ,a ^(k) ) Estimating mutual information I by maximizing JS divergence _JS (L ⁽ⁱ⁾ ,A ⁽ⁱ⁾ ,L ^(k) ,A ^(k) ) To estimate, where L ⁽ⁱ⁾ Representing all combined observations of unmanned plane i, A ⁽ⁱ⁾ Indicating all combined actions of unmanned plane i, L ^(k) Represent all joint observations of unmanned plane k, A ^(k) Representing all joint actions of the unmanned aerial vehicle k;

step 4: obtaining optimal point-to-point mutual information estimation by maximizing mutual information, and redefining a reward function;

designing a reward function to encourage the drones to reach the target and minimize the drone travel distance, calculating a set d of distances from each drone to the target, where the drone travel reward r ₁ ⁽ⁱ⁾ The following are provided:

r ₁ ⁽ⁱ⁾ ＝(-min(d))

when two observation regions collide with each other, it is called a pseudo collision; also, when unmanned aerial vehicles collide with each other, it is referred to as a real collision; in training, the number of real collisions is reduced by punishing the occurrence of false collisions, collision rewardsThe following are provided:

wherein dist ₂ Representing the distance between two unmanned aerial vehicles, sigma representing the radius of the unmanned aerial vehicle observation;

setting each unmanned aerial vehicle to fly in a task area, wherein the minimum distance between the unmanned aerial vehicle and the boundary is defined as d _min Boundary rewardsThe following are provided:

repeated tracking rewards among unmanned aerial vehiclesThe following are provided:

thus, each step of environmental rewards for the droneThe method comprises the following steps:

in the multi-agent path planning problem, it is assumed that if one drone can cooperate with an adjacent drone to get a better incentive, then the drone should be properly encouraged; otherwise, when the neighbors get a negative reward, the drone should take part of the responsibility and therefore be penalized; thus, the instant prize r of each unmanned plane is awarded ⁽ⁱ⁾ The modification is as follows:

wherein ,is a personal reward obtained from the environment, r _c ⁽ⁱ⁾ Is a reciprocal reward from other unmanned aerial vehicles, the weight coefficient alpha balances between private rewards and reciprocal rewards; when α=0, the drone is fully selfish, which is the initial setting for reinforcement learning; whereas α=1, the drone is totally unbiased; when 0 < alpha < 1, not only is its expectations maximized, but also cooperation with other unmanned aerial vehicles is encouraged;

unmanned aerial vehicle i receives reciprocal rewards from other unmanned aerial vehiclesExpressed as:

wherein ,N⁽ⁱ⁾ Representing the neighbors, d of unmanned plane i ^(ik) Representing the dependency between drone i and drone k,is a normalized coefficient, ++>

Step 5: and (3) repeatedly executing the steps 2 to 4, and calculating the task completion rate of the unmanned aerial vehicle cluster until the task completion rate reaches a set value, so as to complete unmanned aerial vehicle cluster path planning simulation.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the MADDPG-R-based unmanned aerial vehicle cluster path planning simulation method provided by the invention provides a multi-agent depth deterministic strategy algorithm-based reciprocal rewarding (MADDPG-R) method, so that unmanned aerial vehicle clusters can learn a cooperation strategy in a scattered manner, cooperation among unmanned aerial vehicles can be better improved, and more rich forms of cooperation behaviors appear among unmanned aerial vehicle clusters.

Drawings

Fig. 1 is a flowchart of an unmanned aerial vehicle cluster path planning simulation method based on madgpg-R according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

In this embodiment, the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R, as shown in FIG. 1, comprises the following steps:

the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, the n unmanned aerial vehicles randomly search for m target points, one or more target points can be selected by the unmanned aerial vehicles for coverage monitoring in the searching process, the unmanned aerial vehicles fly at the speed of not more than 10km/h on a two-dimensional plane, and in order to more accord with the actual actions of the unmanned aerial vehicles, the actions of each unmanned aerial vehicle are continuous random variables.

step 2.1: the decision process for setting up a cluster of unmanned aerial vehicles is defined using a Partially Observable Markov Decision Process (POMDP), each unmanned aerial vehicle can only acquire local observation information and cannot acquire global state, at each time step, each unmanned aerial vehicle makes its action decision according to its local observation information, and all unmanned aerial vehicles perform their joint actions to update the environment;

wherein N is a set of N unmanned aerial vehicles, defining i E N; s is a state space, and consists of states S of each unmanned aerial vehicle, S epsilon S; a { A ⁽¹⁾ ,A ⁽²⁾ ,…A ⁽ⁿ⁾ The combined action space of all unmanned aerial vehicles is represented, and the combined action a of the ith unmanned aerial vehicle ⁽ⁱ⁾ ∈A ⁽ⁱ⁾ ；p(s'|s,a)→[0,1]Is a state transfer function of the environment and represents that the joint action a { a is executed ⁽¹⁾ ,a ⁽²⁾ ,…,a ⁽ⁿ⁾ Probability from state s to the next state s' after; r (s, a): { R ⁽¹⁾ ,r ⁽²⁾ ,…r ⁽ⁿ⁾ The "joint rewards" after performing the joint action a in state s; o { O ⁽¹⁾ ,O ⁽²⁾ ,…O ⁽ⁿ⁾ -representing the joint viewing space of all unmanned aerial vehicles; z: o ⁽ⁱ⁾ =z (s, i) represents the observation model of each individual drone in state s,o ⁽ⁱ⁾ ∈O ⁽ⁱ⁾ ，o ⁽ⁱ⁾ observation model for showing ith unmanned aerial vehicle, O ⁽ⁱ⁾ Representing the observation space of the ith unmanned aerial vehicle; gamma e [0,1 ]]Is a discount factor representing the relative importance of balancing the long-term rewards with the current rewards;

setting each unmanned aerial vehicle to make action decisions according to local information of each unmanned aerial vehicle, wherein the action strategy of the ith unmanned aerial vehicle is pi ⁽ⁱ⁾ ：O ⁽ⁱ⁾ →A ⁽ⁱ⁾ The joint action strategy of all unmanned aerial vehicles is expressed as: pi { pi ⁽¹⁾ ,π ⁽²⁾ ,…π ⁽ⁿ⁾ When the joint observation of the unmanned plane is o and the joint action policy is pi, the state cost function is expressed as:when performing the joint action a, the action cost function is expressed as: />E _π Representing the desire for an action strategy;

step 2.2: the unmanned aerial vehicle cluster adopts a multi-agent depth deterministic strategy gradient (MADDPG) algorithm to learn an unmanned aerial vehicle action strategy;

in agent independent learning, the environment is non-stationary. The agent only evaluates and makes decisions through local observations, which may lead to inaccurate evaluation of the reviewer network and poor decision making ability of the actor network. However, during simulated training, the agent may typically obtain additional information and have no communication impediments, which may alleviate environmental instability. Therefore, we consider a Centralized Training Decentralized Execution (CTDE) framework, which is currently the most popular paradigm for multi-agent reinforcement learning.

The multi-agent depth deterministic strategy gradient (madppg) algorithm is a work that first proposes the CTDE training framework and applies to the DDPG algorithm.

For each agent i (i.e., unmanned aerial vehicle), during centralized training, the Critic network receives State-action pairs (State-action pairs) from all agents, during decentralized execution, the Actor network receives only its own local observed State to make action decisions;

wherein ,the method comprises the steps that (1) the method is a cost function output by a commentator-target network, y is a target cost function, o 'is an observation state of a next time step unmanned aerial vehicle, a' is an action of the next time step unmanned aerial vehicle, u is a parameter of the actor-target network, and u is updated softly by periodically using the latest commentator network parameter mu;

(1) The current unmanned aerial vehicle state information;

(2) Situation feature vector information of other unmanned aerial vehicles;

wherein ,is the kth unmanned plane observation state subvector, as follows:

(3) Coordinate vector information of the target area;

unmanned aerial vehicle observes target area state characteristic information, unmanned aerial vehicle can observe target area and obtainObservation vector e _j,t It comprises m sub-vectors defined as follows:

capturing the dependency relationship between unmanned aerial vehicles using positive value point-to-point mutual information (PPMI), then the dependency index d between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle ^(ik) Expressed as:

the mutual information between the positive points PPMI (l) _i ,a _i ,l _k ,a _k ) Using nervesNetwork f _w (l ⁽ⁱ⁾ ,a ⁽ⁱ⁾ ,l ^(k) ,a ^(k) ) Estimating mutual information I by maximizing JS divergence _JS (L ⁽ⁱ⁾ ,A ⁽ⁱ⁾ ,L ^(k) ,A ^(k) ) To estimate, where L ⁽ⁱ⁾ Representing all combined observations of unmanned plane i, A ⁽ⁱ⁾ Indicating all combined actions of unmanned plane i, L ^(k) Represent all joint observations of unmanned plane k, A ^(k) Representing all joint actions of the unmanned aerial vehicle k;

let the joint distribution of two random variables (X, Y) be P (X, Y), the edge distribution be P (X) and P (Y), respectively, the Mutual Information (MI) of (X, Y) be expressed as:

D _KL indicating KL divergence, log is a base log, and the greater the correlation between X and Y, the greater the I (X; Y). If X and Y are independent, I (X; Y) =0. For a particular random event pair (x, y), inter-point information (PMI) is used to measure the degree of dependency between them:

the mutual information and the point-to-point mutual information are symmetrical, I (X; Y) is the weighted sum of all possible (X; Y) point-to-point mutual information, and in order to avoid log 0= - ≡, positive value point-to-point mutual information (PPMI) is defined as:

the environmental rewards of each unmanned aerial vehicle are given, and one main difficulty in calculating reciprocal rewards is calculating the dependency index between two cooperating unmanned aerial vehicles. In the unmanned aerial vehicle group, each unmanned aerial vehicle is partially observable, and action decisions are made according to local information thereof, including communication with neighbors, observation of targets, boundary information, state information and the like. Since all unmanned aerial vehicles coexist in oneThe environment and global state are unknown, and the local information and actions between two adjacent drones may not be independent. We use inter-Point Mutual Information (PMI) to capture the dependency relationship between unmanned aerial vehicles, then the dependency index d between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle ^(ik) Expressed as:

since the inter-point mutual information is not easily and directly obtained in a complex environment, we can obtain the best inter-Point Mutual Information (PMI) estimation by maximizing the mutual information (MMI). Mutual information may be expressed as a desire for divergence between the product of the joint probability distribution and the marginal probability distribution. According to different measurement methods, a number of efforts have been proposed to maximize the lower bound of variation of mutual information. Wherein I is _JS The JS divergence estimation mutual information is used. Inspired by the mutual information neural estimation, the present embodiment may use a neural network to estimate the mutual information between points. For simplicity and convenience of derivation we use X ₁ and X₂ Replace (l) ⁽ⁱ⁾ ,a ⁽ⁱ⁾) and (l^(k) ,a ^(k) ) Then estimating the mutual information between the points by using the following quotation;

for random variable X ₁ and X₂ Their JS mutual information is defined as:

wherein I _JS (X ₁ ,X ₂ ) The lower variation bound of (2) is:

f _w (x ₁ ,x ₂ ) Is a fitting function parameterized by w, sp (x) =log (1+e ^x ) When the variance lower bound is maximum, we can find a best fit function:

the following was demonstrated:

regarding f _w (x ₁ ,x ₂ ) The first derivative is:

when (when)

Regarding f _w (x ₁ ,x ₂ ) The second derivative is:

thus, I _JS Is about f _w Has a unique maximum, i.e Is a random event pair (x ₁ ,x ₂ ) Is the inter-point information of the (c). We can use a neural network f _w (l ⁽ⁱ⁾ ,a ⁽ⁱ⁾ ,l ^(k) ,a ^(k) ) Estimating mutual information I by maximizing JS divergence _JS (L ⁽ⁱ⁾ ,A ⁽ⁱ⁾ ,L ^(k) ,A ^(k) ) To estimate inter-point information (l) ⁽ⁱ⁾ ,a ⁽ⁱ⁾ ,l ^(k) ,a ^(k)), wherein L⁽ⁱ⁾ Representing all combined observations of unmanned plane i, A ⁽ⁱ⁾ Indicating all combined actions of unmanned plane i, L ^(k) Represent all joint observations of unmanned plane k, A ^(k) Representing all joint actions of unmanned plane k.

In practice, when there is no dependency between two random events, the minimum point mutual information is 0, but f _w The output of (2) may be less than 0, so the present invention uses positive inter-point information (PPMI) to capture the dependency between unmanned aerial vehicles, and thus reshape the reward function.

to encourage the drones to learn the collaboration strategy, each drone must not only reach the already perceived target well, but also avoid repetitive perception to maximize perceived target. In addition, they should avoid flying out of the mission scene, ensuring flight safety. These expectations are represented by potential rewards modeling.

Therefore, it is necessary to design a rational reward for the drone to guide the learning process. Designing a reward function to encourage the drones to reach the target and minimize the drone travel distance, calculating a set d of distances from each drone to the target, where the drone travel reward r ₁ ⁽ⁱ⁾ The following are provided:

r ₁ ⁽ⁱ⁾ ＝(-min(d))

each unmanned aerial vehicle flies in the task area; when the drone is too close to the boundary, a portion of its perception range may fall outside the mission zone and the perception of the non-mission zone may be useless. Setting the minimum distance from the unmanned aerial vehicle to the boundary to be d _min Boundary rewardsThe following are provided:

repeated tracking means that one target is tracked by multiple unmanned aerial vehicles at the same time, which causes resource waste and increases collision risk. When the relative distance between two unmanned aerial vehicles is greater than twice the perceived radius, repeated tracking does not occur. Thus, repeated tracking rewards between dronesThe following are provided:

thus, each step of environmental rewards for the droneThe method comprises the following steps: />

for each unmanned aerial vehicle, a reciprocal reward only exists when the unmanned aerial vehicle cooperates with other unmanned aerial vehicles, and a dependency relationship exists between the unmanned aerial vehicles. And the greater the degree of collaboration, the greater the dependency. Thus, reciprocal rewards are related not only to rewards of others, but also to their dependencies. Then, unmanned aerial vehicle i receives reciprocal rewards from other unmanned aerial vehiclesExpressed as:

The contents of the above steps 2 to 4 may be collectively referred to as a multi-agent depth deterministic strategy reciprocal rewarding algorithm (madppg-R), by which the dependency relationship between unmanned aerial vehicles can be finally determined, and the pseudo code of the algorithm is shown in table 1.

TABLE 1 Multi-agent depth deterministic strategy reciprocal rewarding algorithm

In this embodiment, after 15000 rounds are performed, the path planning results shown in table 2 are obtained in 100 rounds of the test phase:

table 2 comparison of path plans for different numbers of drones

In this embodiment, the comparison result of the baseline algorithm and the MADDPG-R algorithm of the present invention for path planning in the same scene is shown in Table 3:

table 3 comparison of path plans for different algorithms

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. An unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R is characterized in that: the method comprises the following steps:

2. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 1, wherein the unmanned aerial vehicle cluster path planning simulation method is characterized in that: the specific method of the step 1 is as follows:

the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, the n unmanned aerial vehicles randomly search for m target points, one or more target points can be selected by the unmanned aerial vehicles for coverage monitoring in the searching process, the unmanned aerial vehicles fly at the speed not exceeding the set speed on a two-dimensional plane, and the action of each unmanned aerial vehicle is a continuous random variable;

unmanned aerial vehicle is set forIs a continuous space, and the motion is represented as a two-dimensional velocity vector v _x ,v _y ]The scene allows the drone to move at variable speeds in any direction;

one unmanned plane can monitor a plurality of target points simultaneously, and as no specific target allocation exists, the unmanned plane always keeps the targets in the perception range and searches more target points as much as possible; furthermore, the drone should avoid collisions and fly out boundaries to meet safety constraints.

3. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 2, wherein the unmanned aerial vehicle cluster path planning simulation method is characterized in that: the specific method of the step 2.1 is as follows:

setting each unmanned aerial vehicle to make action decisions according to local information of each unmanned aerial vehicle, wherein the action strategy of the ith unmanned aerial vehicle is pi ⁽ⁱ⁾ ：O ⁽ⁱ⁾ →A ⁽ⁱ⁾ The joint action strategy of all unmanned aerial vehicles is expressed as: pi { pi ⁽¹⁾ ,π ⁽²⁾ ,…π ⁽ⁿ⁾ When the joint observation of the unmanned plane is o and the joint action policy is pi, the state cost function is expressed as:when performing the joint action a, the action cost function is expressed as: />E _π Indicating the desire for action strategies.

4. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 3, wherein the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R is characterized in that: the specific method of the step 2.2 is as follows:

wherein J represents a desire for rewards, E represents a desire, R _i Representing a reward for the ith unmanned aerial vehicle, a _-i ＝[a ₁ ,…,a _i-1 ,a _i+1 ,…,a _N ]Representing joint motion vectors for all agents except agent i, D is an empirical buffer recording all unmanned aerial vehicles,is a centralized state action cost function, and inputs the combined observation state o comprising the unmanned aerial vehicle cluster and the combined actions a, mu of all the intelligent agents _i Deterministic strategy for representing ith unmanned aerial vehicleThe method is omitted;

(1) The current unmanned aerial vehicle state information;

(2) Situation feature vector information of other unmanned aerial vehicles;

wherein ,is the kth unmanned plane observation state subvector, as follows:

(3) Coordinate vector information of the target area;

wherein ,d_i,j,t Representing the distance between the unmanned plane and the target, x _j,t ,y _j,t Representing the location of the target.

5. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 4, wherein the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R is characterized in that: the specific method of the step 3 is as follows:

the mutual information between the positive points PPMI (l) _i ,a _i ,l _k ,a _k ) Using a neural network f _w (l ⁽ⁱ⁾ ,a ⁽ⁱ⁾ ,l ^(k) ,a ^(k) ) Estimating mutual information I by maximizing JS divergence _JS (L ⁽ⁱ⁾ ,A ⁽ⁱ⁾ ,L ^(k) ,A ^(k) ) To estimate, where L ⁽ⁱ⁾ Representing all combined observations of unmanned plane i, A ⁽ⁱ⁾ Indicating all combined actions of unmanned plane i, L ^(k) Represent all joint observations of unmanned plane k, A ^(k) Representing all joint actions of unmanned plane k.

6. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 5, wherein the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R is characterized in that: the specific method of the step 4 is as follows:

r ₁ ⁽ⁱ⁾ ＝(-min(d))

setting each unmanned aerial vehicle to fly in a task area, wherein the minimum distance between the unmanned aerial vehicle and the boundary is defined as d _min Boundary prize r ₃ ⁽ⁱ⁾ The following are provided:

wherein ,is a private reward obtained from the environment, +.>Is a reciprocal reward from other unmanned aerial vehicles, the weight coefficient alpha balances between private rewards and reciprocal rewards; when α=0, the drone is fully selfish, which is the initial setting for reinforcement learning; whereas α=1, the drone is totally unbiased; when 0 < alpha < 1, not only is its expectations maximized, but also cooperation with other unmanned aerial vehicles is encouraged;

unmanned aerial vehicle i receives reciprocal rewards r from other unmanned aerial vehicles _c ⁽ⁱ⁾ Expressed as: