CN116560409A - Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R - Google Patents

Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R Download PDF

Info

Publication number
CN116560409A
CN116560409A CN202310721379.4A CN202310721379A CN116560409A CN 116560409 A CN116560409 A CN 116560409A CN 202310721379 A CN202310721379 A CN 202310721379A CN 116560409 A CN116560409 A CN 116560409A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
action
representing
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310721379.4A
Other languages
Chinese (zh)
Inventor
王尔申
宏晨
陈纪浩
屈力刚
李寒冰
徐嵩
王传云
崔璨
庞涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Songshan Laboratory
Shenyang Aerospace University
Original Assignee
Songshan Laboratory
Shenyang Aerospace University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Songshan Laboratory, Shenyang Aerospace University filed Critical Songshan Laboratory
Priority to CN202310721379.4A priority Critical patent/CN116560409A/en
Publication of CN116560409A publication Critical patent/CN116560409A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention provides an unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R, and relates to the technical field of unmanned aerial vehicle cluster air-to-air search decision making. Firstly, setting a scene of unmanned aerial vehicle cluster air-air search, and randomly setting unmanned aerial vehicle clusters and target area positions; then, constructing an unmanned aerial vehicle cluster air-air search model based on reinforcement learning by using a reinforcement learning algorithm; the dependency relationship between unmanned aerial vehicles is measured through an inter-point mutual information method, and the cooperative relationship between unmanned aerial vehicles is improved through maximizing the inter-point mutual information; then obtaining the optimal point-to-point mutual information estimation by maximizing the mutual information, and redefining a reward function; and repeating the above process, calculating the task completion rate of the unmanned aerial vehicle cluster until the task completion rate reaches a set value, and completing unmanned aerial vehicle cluster path planning simulation. The method enables unmanned aerial vehicle clusters to learn cooperation strategies in a scattered manner, and cooperation among unmanned aerial vehicles can be improved better.

Description

Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R
Technical Field
The invention relates to the technical field of unmanned aerial vehicle cluster air-to-air search decision making, in particular to an unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R.
Background
Because the unmanned aerial vehicle has important significance for future air combat, all the main military countries in the world are tightening to develop the unmanned aerial vehicle. With the advancement of technologies such as artificial intelligence, distributed systems, networking communication and the like and the remarkable improvement of the level of airborne hardware, unmanned aerial vehicle clusters are highly concerned and vigorously developed in various countries.
Unmanned aerial vehicle clusters are also considered as a combat means capable of subverting future battlefield situations, so that military countries are tightening unmanned aerial vehicle cluster technologies to attack and develop cluster flight tests to strive for unmanned aerial vehicle clusters which are fused with new technologies to be applied to actual combat. Unmanned cluster intelligence is a critical technology that is sufficient to change the warfare patterns and rules. The essence of colony intelligence belongs to the field of bionics, and is derived from observation of collective behavior mode of animal colony behaviors of human beings, including activities of bee colonies and ant colonies, migration of waiting birds and the like. The intelligent unmanned aerial vehicle clusters are based on biological cluster behaviors, and the unmanned aerial vehicles can complete various tasks in various complex and dangerous environments through mutual information transmission and coordination.
In the process of searching multiple target points, the environment of the unmanned aerial vehicle cluster is often an uncertain, incomplete, dynamic and other complex environment, and most of the existing methods cannot be well expanded to the scattered unmanned aerial vehicle cluster due to calculation complexity or global information requirements. With the development of unmanned aerial vehicle technology, unmanned aerial vehicle group cooperation becomes a research hotspot. Through close collaboration, the drone swarm may exhibit superior coordination, intelligence, and autonomy over traditional multi-drone systems. Meanwhile, multi-target search in unknown environment becomes an important application direction of the unmanned aerial vehicle group
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R, realizes the planning simulation of unmanned aerial vehicle cluster paths, and provides a method for establishing unmanned aerial vehicle cluster collaborative search targets in a dynamic environment.
In order to solve the technical problems, the invention adopts the following technical scheme: the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R comprises the following steps:
step 1: setting a scene of unmanned aerial vehicle cluster space-space searching, and randomly setting unmanned aerial vehicle clusters and target area positions;
the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, the n unmanned aerial vehicles randomly search m target points, one or more target points can be selected by the unmanned aerial vehicles to conduct coverage monitoring in the searching process, the unmanned aerial vehicles fly on a two-dimensional plane at a speed not exceeding a set speed, and the action of each unmanned aerial vehicle is a continuous random variable.
In the unmanned aerial vehicle cluster air-air search scene, a particle dynamics model is endowed for each unmanned aerial vehicle, and the following formula is shown:
wherein phi represents the roll angle of the unmanned aerial vehicle, r φ Represents the rolling angle speed, F represents the driving force,indicating heading angular velocity,/-, and>representing the course angle of the unmanned aerial vehicle, m is the mass of the unmanned aerial vehicle, v x ,v y The speed components of the unmanned aerial vehicle in the two-dimensional plane X axis and the Y axis are respectively, X and Y represent the positions of the unmanned aerial vehicle in the two-dimensional plane X axis and the Y axis, and dt represents a differential variable about time t;
the motion space of the unmanned plane is set to be a continuous space, and the motion is expressed as a two-dimensional velocity vector v x ,v y ]The scene allows the drone to move at variable speeds in any direction;
in order to effectively simulate the motion behavior of an actual unmanned aerial vehicle, the motion of the unmanned aerial vehicle is set to be a vector consisting of driving force and rolling angle speedThe flight speed and heading angle of the unmanned aerial vehicle at the time t are obtained by the following formula:
wherein v represents the speed of the unmanned aerial vehicle, i represents the number of the unmanned aerial vehicle, i=1, 2, …, n, Δt represents the time interval;
the position of the target in the scene is random, and the unmanned aerial vehicle cannot acquire the position information of the target in advance; the unmanned aerial vehicle can observe the target, and when the target is positioned in the monitoring range of the unmanned aerial vehicle, the unmanned aerial vehicle can sense and monitor the target; meanwhile, the unmanned aerial vehicle can receive communication information of other adjacent unmanned aerial vehicles, and the unmanned aerial vehicle cannot determine the quantity and position information of other targets;
one unmanned plane can monitor a plurality of target points simultaneously, and as no specific target allocation exists, the unmanned plane always keeps the targets in the perception range and searches more target points as much as possible; in addition, the drone should avoid collisions and fly out boundaries to meet safety constraints;
step 2: constructing an unmanned aerial vehicle cluster air-air search model based on reinforcement learning by using a reinforcement learning algorithm;
step 2.1: setting a decision process of the unmanned aerial vehicle cluster to be defined by using a partially observable Markov decision process, wherein each unmanned aerial vehicle can only acquire local observation information and cannot acquire a global state, each unmanned aerial vehicle makes an action decision according to the local observation information in each time step, and all unmanned aerial vehicles execute combined actions to update the environment;
the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, and the definition of a partially observable Markov decision process is as follows: (N, S, A, T, R, O, Z, γ);
wherein N is a set of N unmanned aerial vehicles, defining i E N; s is a state space, and consists of states S of each unmanned aerial vehicle, S epsilon S; a { A (1) ,A (2) ,…A (n) The combined action space of all unmanned aerial vehicles is represented, and the combined action a of the ith unmanned aerial vehicle (i) ∈A (i) ;p(s'|s,a)→[0,1]Is a state transfer function of the environment and represents that the joint action a { a is executed (1) ,a (2) ,…,a (n) Probability from state s to the next state s' after; r (s, a): { R (1) ,r (2) ,…r (n) The "joint rewards" after performing the joint action a in state s; o { O (1) ,O (2) ,…O (n) -representing the joint viewing space of all unmanned aerial vehicles; z: o (i) =z (s, i) represents the observation model of each unmanned individual in state s, o (i) ∈O (i) ,o (i) Observation model for showing ith unmanned aerial vehicle, O (i) Representing the observation space of the ith unmanned aerial vehicle; gamma e [0,1 ]]Is a discount factor representing the relative importance of balancing the long-term rewards with the current rewards;
setting each unmanned aerial vehicle to make action decisions according to local information of each unmanned aerial vehicle, and enabling an ith unmanned aerial vehicle to travelThe dynamic policy is pi (i) :O (i) →A (i) The joint action strategy of all unmanned aerial vehicles is expressed as: pi { pi (1)(2) ,…π (n) When the joint observation of the unmanned plane is o and the joint action policy is pi, the state cost function is expressed as:when performing the joint action a, the action cost function is expressed as: />E π Representing the desire for an action strategy;
step 2.2: the unmanned aerial vehicle cluster adopts a multi-agent depth deterministic strategy gradient algorithm to learn an unmanned aerial vehicle action strategy;
for each agent i, during centralized training, the Critic network receives state-action pairs from all agents, during decentralized execution, the Actor network receives only its own local observed state to make action decisions;
define the joint observation state vector o= [ o ] of unmanned aerial vehicle cluster 1 ,o 2 ,…,o n ]And joint motion vector a= [ a ] 1 ,a 2 ,…,a n ]The method comprises the steps of carrying out a first treatment on the surface of the Further, define θ= { θ 12 ,…,θ n And the parameterized set of unmanned plane cluster action strategies is shown as μ= { μ 12 ,…,μ n -a set of joint action policies for all unmanned aerial vehicles;
for the ith drone, a gradient of expected return is given, actor network J (μ i )=E[R i ]The gradient of (2) is shown in the following formula:
wherein J represents a desire for rewards, E represents a desire, R i Representing a reward for the ith unmanned aerial vehicle, a -i =[a 1 ,…,a i-1 ,a i+1 ,…,a N ]Representing joint motion vectors for all agents except agent i, D is an empirical buffer recording all unmanned aerial vehicles,is a centralized state action cost function, and inputs the combined observation state o comprising the unmanned aerial vehicle cluster and the combined actions a, mu of all the intelligent agents i A deterministic strategy of the ith unmanned aerial vehicle is represented;
the output of the Critic network is a centralized state action cost functionThe loss function is defined as follows:
wherein ,is the cost function of the critic-target network output, y is the target cost function, o 'is the observed state of the next time step unmanned aerial vehicle, a' is the action of the next time step unmanned aerial vehicle,/->Is a parameter of the actor-target network, +.>Periodically using the latest commentator network parameter mu to carry out soft update;
the input of the unmanned aerial vehicle consists of an action vector and an observation state vector, and the action vector consists ofConsists of observation state vector of +.>Composition; the observation state vector contains three kinds of state information, which are respectively:
(1) The current unmanned aerial vehicle state information;
the observation state vector of the ith unmanned aerial vehicle at the time t is expressed ass i,t Representing the state and action information of the unmanned aerial vehicle, and +.>Representing situation characteristic information of other unmanned aerial vehicles in the same team, e j,t Coordinate information representing a target area, j epsilon m, wherein m is a target area set; for the ith unmanned aerial vehicle itself, the position coordinates (x i ,y i ) Speed (v) x,i ,v y,i ) Course angle->Therefore, unmanned plane self state and action information +.>
(2) Situation feature vector information of other unmanned aerial vehicles;
unmanned aerial vehicle observes state feature vector of teammate state feature informationThus->Containing n-1 subvectors defined as follows:
wherein ,is the kth unmanned plane observation state subvector, as follows:
wherein ,ak,-1 The kth unmanned plane is the action at the last moment, k epsilon N, i not equal to k; d, d i,k,t Representing the current relative distance between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle teammate, described by the 2-norm as follows:
wherein H represents the conjugate transpose matrix, p i,t Representing two-dimensional coordinate vector of ith unmanned aerial vehicle at t moment, p k,t Representing a two-dimensional coordinate vector of the kth unmanned aerial vehicle at the time t;
(3) Coordinate vector information of the target area;
unmanned aerial vehicle observes target area state characteristic information, unmanned aerial vehicle can observe target area and obtain observation vector e j,t It comprises m sub-vectors defined as follows:
wherein ,the sub-vector representing the observation of the jth target is as follows:
wherein ,di,j,t Representing the distance between the unmanned plane and the target, x j,t ,y j,t Representing the location of the target;
step 3: the dependency relationship between unmanned aerial vehicles is measured through an inter-point mutual information method, and the cooperative relationship between unmanned aerial vehicles is improved through maximizing the inter-point mutual information;
capturing the dependency relationship between unmanned aerial vehicles by using the mutual information PPMI between positive value points, and obtaining the dependency index d between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle (ik) Expressed as:
wherein, l represents the local information of each unmanned plane,b i,t boundary information indicating the ith unmanned aerial vehicle, < ->b k,t Boundary information of a kth unmanned plane is represented, i is not equal to k, and p () represents a joint distribution function;
the mutual information between the positive points PPMI (l) i ,a i ,l k ,a k ) Using a neural network f w (l (i) ,a (i) ,l (k) ,a (k) ) Estimating mutual information I by maximizing JS divergence JS (L (i) ,A (i) ,L (k) ,A (k) ) To estimate, where L (i) Representing all combined observations of unmanned plane i, A (i) Indicating all combined actions of unmanned plane i, L (k) Represent all joint observations of unmanned plane k, A (k) Representing all joint actions of the unmanned aerial vehicle k;
step 4: obtaining optimal point-to-point mutual information estimation by maximizing mutual information, and redefining a reward function;
designing a reward function to encourage the drones to reach the target and minimize the drone travel distance, calculating a set d of distances from each drone to the target, where the drone travel reward r 1 (i) The following are provided:
r 1 (i) =(-min(d))
when two observation regions collide with each other, it is called a pseudo collision; also, when unmanned aerial vehicles collide with each other, it is referred to as a real collision; in training, the number of real collisions is reduced by punishing the occurrence of false collisions, collision rewardsThe following are provided:
wherein dist 2 Representing the distance between two unmanned aerial vehicles, sigma representing the radius of the unmanned aerial vehicle observation;
setting each unmanned aerial vehicle to fly in a task area, wherein the minimum distance between the unmanned aerial vehicle and the boundary is defined as d min Boundary rewardsThe following are provided:
repeated tracking rewards among unmanned aerial vehiclesThe following are provided:
thus, each step of environmental rewards for the droneThe method comprises the following steps:
in the multi-agent path planning problem, it is assumed that if one drone can cooperate with an adjacent drone to get a better incentive, then the drone should be properly encouraged; otherwise, when the neighbors get a negative reward, the drone should take part of the responsibility and therefore be penalized; thus, the instant prize r of each unmanned plane is awarded (i) The modification is as follows:
wherein ,is a personal reward obtained from the environment, r c (i) Is a reciprocal reward from other unmanned aerial vehicles, the weight coefficient alpha balances between private rewards and reciprocal rewards; when α=0, the drone is fully selfish, which is the initial setting for reinforcement learning; whereas α=1, the drone is totally unbiased; when 0 < alpha < 1, not only is its expectations maximized, but also cooperation with other unmanned aerial vehicles is encouraged;
unmanned aerial vehicle i receives reciprocal rewards from other unmanned aerial vehiclesExpressed as:
wherein ,N(i) Representing the neighbors, d of unmanned plane i (ik) Representing the dependency between drone i and drone k,is a normalized coefficient, ++>
Step 5: and (3) repeatedly executing the steps 2 to 4, and calculating the task completion rate of the unmanned aerial vehicle cluster until the task completion rate reaches a set value, so as to complete unmanned aerial vehicle cluster path planning simulation.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the MADDPG-R-based unmanned aerial vehicle cluster path planning simulation method provided by the invention provides a multi-agent depth deterministic strategy algorithm-based reciprocal rewarding (MADDPG-R) method, so that unmanned aerial vehicle clusters can learn a cooperation strategy in a scattered manner, cooperation among unmanned aerial vehicles can be better improved, and more rich forms of cooperation behaviors appear among unmanned aerial vehicle clusters.
Drawings
Fig. 1 is a flowchart of an unmanned aerial vehicle cluster path planning simulation method based on madgpg-R according to an embodiment of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
In this embodiment, the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R, as shown in FIG. 1, comprises the following steps:
step 1: setting a scene of unmanned aerial vehicle cluster space-space searching, and randomly setting unmanned aerial vehicle clusters and target area positions;
the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, the n unmanned aerial vehicles randomly search for m target points, one or more target points can be selected by the unmanned aerial vehicles for coverage monitoring in the searching process, the unmanned aerial vehicles fly at the speed of not more than 10km/h on a two-dimensional plane, and in order to more accord with the actual actions of the unmanned aerial vehicles, the actions of each unmanned aerial vehicle are continuous random variables.
In the unmanned aerial vehicle cluster air-air search scene, a particle dynamics model is endowed for each unmanned aerial vehicle, and the following formula is shown:
wherein phi represents the roll angle of the unmanned aerial vehicle, r φ Represents the rolling angle speed, F represents the driving force,indicating heading angular velocity,/-, and>representing the course angle of the unmanned aerial vehicle, m is the mass of the unmanned aerial vehicle, v x ,v y The speed components of the unmanned aerial vehicle in the two-dimensional plane X axis and the Y axis are respectively, X and Y represent the positions of the unmanned aerial vehicle in the two-dimensional plane X axis and the Y axis, and dt represents a differential variable about time t;
the motion space of the unmanned plane is set to be a continuous space, and the motion is expressed as a two-dimensional velocity vector v x ,v y ]The scene allows the drone to move at variable speeds in any direction;
in order to effectively simulate the motion behavior of an actual unmanned aerial vehicle, the motion of the unmanned aerial vehicle is set to be a vector consisting of driving force and rolling angle speedThe flight speed and heading angle of the unmanned aerial vehicle at the time t are obtained by the following formula:
wherein v represents the speed of the unmanned aerial vehicle, i represents the number of the unmanned aerial vehicle, i=1, 2, …, n, Δt represents the time interval;
the position of the target in the scene is random, and the unmanned aerial vehicle cannot acquire the position information of the target in advance; the unmanned aerial vehicle can observe the target, and when the target is positioned in the monitoring range of the unmanned aerial vehicle, the unmanned aerial vehicle can sense and monitor the target; meanwhile, the unmanned aerial vehicle can receive communication information of other adjacent unmanned aerial vehicles, and the unmanned aerial vehicle cannot determine the quantity and position information of other targets;
one unmanned plane can monitor a plurality of target points simultaneously, and as no specific target allocation exists, the unmanned plane always keeps the targets in the perception range and searches more target points as much as possible; in addition, the drone should avoid collisions and fly out boundaries to meet safety constraints;
step 2: constructing an unmanned aerial vehicle cluster air-air search model based on reinforcement learning by using a reinforcement learning algorithm;
step 2.1: the decision process for setting up a cluster of unmanned aerial vehicles is defined using a Partially Observable Markov Decision Process (POMDP), each unmanned aerial vehicle can only acquire local observation information and cannot acquire global state, at each time step, each unmanned aerial vehicle makes its action decision according to its local observation information, and all unmanned aerial vehicles perform their joint actions to update the environment;
the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, and the definition of a partially observable Markov decision process is as follows: (N, S, A, T, R, O, Z, γ);
wherein N is a set of N unmanned aerial vehicles, defining i E N; s is a state space, and consists of states S of each unmanned aerial vehicle, S epsilon S; a { A (1) ,A (2) ,…A (n) The combined action space of all unmanned aerial vehicles is represented, and the combined action a of the ith unmanned aerial vehicle (i) ∈A (i) ;p(s'|s,a)→[0,1]Is a state transfer function of the environment and represents that the joint action a { a is executed (1) ,a (2) ,…,a (n) Probability from state s to the next state s' after; r (s, a): { R (1) ,r (2) ,…r (n) The "joint rewards" after performing the joint action a in state s; o { O (1) ,O (2) ,…O (n) -representing the joint viewing space of all unmanned aerial vehicles; z: o (i) =z (s, i) represents the observation model of each individual drone in state s,o (i) ∈O (i) ,o (i) observation model for showing ith unmanned aerial vehicle, O (i) Representing the observation space of the ith unmanned aerial vehicle; gamma e [0,1 ]]Is a discount factor representing the relative importance of balancing the long-term rewards with the current rewards;
setting each unmanned aerial vehicle to make action decisions according to local information of each unmanned aerial vehicle, wherein the action strategy of the ith unmanned aerial vehicle is pi (i) :O (i) →A (i) The joint action strategy of all unmanned aerial vehicles is expressed as: pi { pi (1)(2) ,…π (n) When the joint observation of the unmanned plane is o and the joint action policy is pi, the state cost function is expressed as:when performing the joint action a, the action cost function is expressed as: />E π Representing the desire for an action strategy;
step 2.2: the unmanned aerial vehicle cluster adopts a multi-agent depth deterministic strategy gradient (MADDPG) algorithm to learn an unmanned aerial vehicle action strategy;
in agent independent learning, the environment is non-stationary. The agent only evaluates and makes decisions through local observations, which may lead to inaccurate evaluation of the reviewer network and poor decision making ability of the actor network. However, during simulated training, the agent may typically obtain additional information and have no communication impediments, which may alleviate environmental instability. Therefore, we consider a Centralized Training Decentralized Execution (CTDE) framework, which is currently the most popular paradigm for multi-agent reinforcement learning.
The multi-agent depth deterministic strategy gradient (madppg) algorithm is a work that first proposes the CTDE training framework and applies to the DDPG algorithm.
For each agent i (i.e., unmanned aerial vehicle), during centralized training, the Critic network receives State-action pairs (State-action pairs) from all agents, during decentralized execution, the Actor network receives only its own local observed State to make action decisions;
define the joint observation state vector o= [ o ] of unmanned aerial vehicle cluster 1 ,o 2 ,…,o n ]And joint motion vector a= [ a ] 1 ,a 2 ,…,a n ]The method comprises the steps of carrying out a first treatment on the surface of the Further, define θ= { θ 12 ,…,θ n And the parameterized set of unmanned plane cluster action strategies is shown as μ= { μ 12 ,…,μ n -a set of joint action policies for all unmanned aerial vehicles;
for the ith drone, a gradient of expected return is given, actor network J (μ i )=E[R i ]The gradient of (2) is shown in the following formula:
wherein J represents a desire for rewards, E represents a desire, R i Representing a reward for the ith unmanned aerial vehicle, a -i =[a 1 ,…,a i-1 ,a i+1 ,…,a N ]Representing joint motion vectors for all agents except agent i, D is an empirical buffer recording all unmanned aerial vehicles,is a centralized state action cost function, and inputs the combined observation state o comprising the unmanned aerial vehicle cluster and the combined actions a, mu of all the intelligent agents i A deterministic strategy of the ith unmanned aerial vehicle is represented;
the output of the Critic network is a centralized state action cost functionThe loss function is defined as follows:
wherein ,the method comprises the steps that (1) the method is a cost function output by a commentator-target network, y is a target cost function, o 'is an observation state of a next time step unmanned aerial vehicle, a' is an action of the next time step unmanned aerial vehicle, u is a parameter of the actor-target network, and u is updated softly by periodically using the latest commentator network parameter mu;
the input of the unmanned aerial vehicle consists of an action vector and an observation state vector, and the action vector consists ofConsists of observation state vector of +.>Composition; the observation state vector contains three kinds of state information, which are respectively:
(1) The current unmanned aerial vehicle state information;
the observation state vector of the ith unmanned aerial vehicle at the time t is expressed ass i,t Representing the state and action information of the unmanned aerial vehicle, and +.>Representing situation characteristic information of other unmanned aerial vehicles in the same team, e j,t Coordinate information representing a target area, j epsilon m, wherein m is a target area set; for the ith unmanned aerial vehicle itself, the position coordinates (x i ,y i ) Speed (v) x,i ,v y,i ) Course angle->Therefore, unmanned plane self state and action information +.>
(2) Situation feature vector information of other unmanned aerial vehicles;
unmanned aerial vehicle observes state feature vector of teammate state feature informationThus->Containing n-1 subvectors defined as follows:
wherein ,is the kth unmanned plane observation state subvector, as follows:
wherein ,ak,-1 The kth unmanned plane is the action at the last moment, k epsilon N, i not equal to k; d, d i,k,t Representing the current relative distance between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle teammate, described by the 2-norm as follows:
wherein H represents the conjugate transpose matrix, p i,t Representing two-dimensional coordinate vector of ith unmanned aerial vehicle at t moment, p k,t Representing a two-dimensional coordinate vector of the kth unmanned aerial vehicle at the time t;
(3) Coordinate vector information of the target area;
unmanned aerial vehicle observes target area state characteristic information, unmanned aerial vehicle can observe target area and obtainObservation vector e j,t It comprises m sub-vectors defined as follows:
wherein ,the sub-vector representing the observation of the jth target is as follows:
wherein ,di,j,t Representing the distance between the unmanned plane and the target, x j,t ,y j,t Representing the location of the target;
step 3: the dependency relationship between unmanned aerial vehicles is measured through an inter-point mutual information method, and the cooperative relationship between unmanned aerial vehicles is improved through maximizing the inter-point mutual information;
capturing the dependency relationship between unmanned aerial vehicles using positive value point-to-point mutual information (PPMI), then the dependency index d between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle (ik) Expressed as:
wherein, l represents the local information of each unmanned plane,b i,t boundary information indicating the ith unmanned aerial vehicle, < ->b k,t Boundary information of a kth unmanned plane is represented, i is not equal to k, and p () represents a joint distribution function;
the mutual information between the positive points PPMI (l) i ,a i ,l k ,a k ) Using nervesNetwork f w (l (i) ,a (i) ,l (k) ,a (k) ) Estimating mutual information I by maximizing JS divergence JS (L (i) ,A (i) ,L (k) ,A (k) ) To estimate, where L (i) Representing all combined observations of unmanned plane i, A (i) Indicating all combined actions of unmanned plane i, L (k) Represent all joint observations of unmanned plane k, A (k) Representing all joint actions of the unmanned aerial vehicle k;
let the joint distribution of two random variables (X, Y) be P (X, Y), the edge distribution be P (X) and P (Y), respectively, the Mutual Information (MI) of (X, Y) be expressed as:
D KL indicating KL divergence, log is a base log, and the greater the correlation between X and Y, the greater the I (X; Y). If X and Y are independent, I (X; Y) =0. For a particular random event pair (x, y), inter-point information (PMI) is used to measure the degree of dependency between them:
the mutual information and the point-to-point mutual information are symmetrical, I (X; Y) is the weighted sum of all possible (X; Y) point-to-point mutual information, and in order to avoid log 0= - ≡, positive value point-to-point mutual information (PPMI) is defined as:
the environmental rewards of each unmanned aerial vehicle are given, and one main difficulty in calculating reciprocal rewards is calculating the dependency index between two cooperating unmanned aerial vehicles. In the unmanned aerial vehicle group, each unmanned aerial vehicle is partially observable, and action decisions are made according to local information thereof, including communication with neighbors, observation of targets, boundary information, state information and the like. Since all unmanned aerial vehicles coexist in oneThe environment and global state are unknown, and the local information and actions between two adjacent drones may not be independent. We use inter-Point Mutual Information (PMI) to capture the dependency relationship between unmanned aerial vehicles, then the dependency index d between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle (ik) Expressed as:
wherein, l represents the local information of each unmanned plane,b i,t boundary information indicating the ith unmanned aerial vehicle, < ->b k,t Boundary information of a kth unmanned plane is represented, i is not equal to k, and p () represents a joint distribution function;
since the inter-point mutual information is not easily and directly obtained in a complex environment, we can obtain the best inter-Point Mutual Information (PMI) estimation by maximizing the mutual information (MMI). Mutual information may be expressed as a desire for divergence between the product of the joint probability distribution and the marginal probability distribution. According to different measurement methods, a number of efforts have been proposed to maximize the lower bound of variation of mutual information. Wherein I is JS The JS divergence estimation mutual information is used. Inspired by the mutual information neural estimation, the present embodiment may use a neural network to estimate the mutual information between points. For simplicity and convenience of derivation we use X 1 and X2 Replace (l) (i) ,a (i)) and (l(k) ,a (k) ) Then estimating the mutual information between the points by using the following quotation;
for random variable X 1 and X2 Their JS mutual information is defined as:
wherein I JS (X 1 ,X 2 ) The lower variation bound of (2) is:
f w (x 1 ,x 2 ) Is a fitting function parameterized by w, sp (x) =log (1+e x ) When the variance lower bound is maximum, we can find a best fit function:
the following was demonstrated:
regarding f w (x 1 ,x 2 ) The first derivative is:
when (when)
Regarding f w (x 1 ,x 2 ) The second derivative is:
thus, I JS Is about f w Has a unique maximum, i.e Is a random event pair (x 1 ,x 2 ) Is the inter-point information of the (c). We can use a neural network f w (l (i) ,a (i) ,l (k) ,a (k) ) Estimating mutual information I by maximizing JS divergence JS (L (i) ,A (i) ,L (k) ,A (k) ) To estimate inter-point information (l) (i) ,a (i) ,l (k) ,a (k)), wherein L(i) Representing all combined observations of unmanned plane i, A (i) Indicating all combined actions of unmanned plane i, L (k) Represent all joint observations of unmanned plane k, A (k) Representing all joint actions of unmanned plane k.
In practice, when there is no dependency between two random events, the minimum point mutual information is 0, but f w The output of (2) may be less than 0, so the present invention uses positive inter-point information (PPMI) to capture the dependency between unmanned aerial vehicles, and thus reshape the reward function.
Step 4: obtaining optimal point-to-point mutual information estimation by maximizing mutual information, and redefining a reward function;
to encourage the drones to learn the collaboration strategy, each drone must not only reach the already perceived target well, but also avoid repetitive perception to maximize perceived target. In addition, they should avoid flying out of the mission scene, ensuring flight safety. These expectations are represented by potential rewards modeling.
Therefore, it is necessary to design a rational reward for the drone to guide the learning process. Designing a reward function to encourage the drones to reach the target and minimize the drone travel distance, calculating a set d of distances from each drone to the target, where the drone travel reward r 1 (i) The following are provided:
r 1 (i) =(-min(d))
when two observation regions collide with each other, it is called a pseudo collision; also, when unmanned aerial vehicles collide with each other, it is referred to as a real collision; in training, the number of real collisions is reduced by punishing the occurrence of false collisions, collision rewardsThe following are provided:
wherein dist 2 Representing the distance between two unmanned aerial vehicles, sigma representing the radius of the unmanned aerial vehicle observation;
each unmanned aerial vehicle flies in the task area; when the drone is too close to the boundary, a portion of its perception range may fall outside the mission zone and the perception of the non-mission zone may be useless. Setting the minimum distance from the unmanned aerial vehicle to the boundary to be d min Boundary rewardsThe following are provided:
repeated tracking means that one target is tracked by multiple unmanned aerial vehicles at the same time, which causes resource waste and increases collision risk. When the relative distance between two unmanned aerial vehicles is greater than twice the perceived radius, repeated tracking does not occur. Thus, repeated tracking rewards between dronesThe following are provided:
thus, each step of environmental rewards for the droneThe method comprises the following steps: />
In the multi-agent path planning problem, it is assumed that if one drone can cooperate with an adjacent drone to get a better incentive, then the drone should be properly encouraged; otherwise, when the neighbors get a negative reward, the drone should take part of the responsibility and therefore be penalized; thus, the instant prize r of each unmanned plane is awarded (i) The modification is as follows:
wherein ,is a personal reward obtained from the environment, r c (i) Is a reciprocal reward from other unmanned aerial vehicles, the weight coefficient alpha balances between private rewards and reciprocal rewards; when α=0, the drone is fully selfish, which is the initial setting for reinforcement learning; whereas α=1, the drone is totally unbiased; when 0 < alpha < 1, not only is its expectations maximized, but also cooperation with other unmanned aerial vehicles is encouraged;
for each unmanned aerial vehicle, a reciprocal reward only exists when the unmanned aerial vehicle cooperates with other unmanned aerial vehicles, and a dependency relationship exists between the unmanned aerial vehicles. And the greater the degree of collaboration, the greater the dependency. Thus, reciprocal rewards are related not only to rewards of others, but also to their dependencies. Then, unmanned aerial vehicle i receives reciprocal rewards from other unmanned aerial vehiclesExpressed as:
wherein ,N(i) Representing the neighbors, d of unmanned plane i (ik) Representing the dependency between drone i and drone k,is a normalized coefficient, ++>
The contents of the above steps 2 to 4 may be collectively referred to as a multi-agent depth deterministic strategy reciprocal rewarding algorithm (madppg-R), by which the dependency relationship between unmanned aerial vehicles can be finally determined, and the pseudo code of the algorithm is shown in table 1.
TABLE 1 Multi-agent depth deterministic strategy reciprocal rewarding algorithm
Step 5: and (3) repeatedly executing the steps 2 to 4, and calculating the task completion rate of the unmanned aerial vehicle cluster until the task completion rate reaches a set value, so as to complete unmanned aerial vehicle cluster path planning simulation.
In this embodiment, after 15000 rounds are performed, the path planning results shown in table 2 are obtained in 100 rounds of the test phase:
table 2 comparison of path plans for different numbers of drones
In this embodiment, the comparison result of the baseline algorithm and the MADDPG-R algorithm of the present invention for path planning in the same scene is shown in Table 3:
table 3 comparison of path plans for different algorithms
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims (6)

1. An unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R is characterized in that: the method comprises the following steps:
step 1: setting a scene of unmanned aerial vehicle cluster space-space searching, and randomly setting unmanned aerial vehicle clusters and target area positions;
step 2: constructing an unmanned aerial vehicle cluster air-air search model based on reinforcement learning by using a reinforcement learning algorithm;
step 2.1: setting a decision process of the unmanned aerial vehicle cluster to be defined by using a partially observable Markov decision process, wherein each unmanned aerial vehicle can only acquire local observation information and cannot acquire a global state, each unmanned aerial vehicle makes an action decision according to the local observation information in each time step, and all unmanned aerial vehicles execute combined actions to update the environment;
step 2.2: the unmanned aerial vehicle cluster adopts a multi-agent depth deterministic strategy gradient algorithm to learn an unmanned aerial vehicle action strategy;
step 3: the dependency relationship between unmanned aerial vehicles is measured through an inter-point mutual information method, and the cooperative relationship between unmanned aerial vehicles is improved through maximizing the inter-point mutual information;
step 4: obtaining optimal point-to-point mutual information estimation by maximizing mutual information, and redefining a reward function;
step 5: and (3) repeatedly executing the steps 2 to 4, and calculating the task completion rate of the unmanned aerial vehicle cluster until the task completion rate reaches a set value, so as to complete unmanned aerial vehicle cluster path planning simulation.
2. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 1, wherein the unmanned aerial vehicle cluster path planning simulation method is characterized in that: the specific method of the step 1 is as follows:
the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, the n unmanned aerial vehicles randomly search for m target points, one or more target points can be selected by the unmanned aerial vehicles for coverage monitoring in the searching process, the unmanned aerial vehicles fly at the speed not exceeding the set speed on a two-dimensional plane, and the action of each unmanned aerial vehicle is a continuous random variable;
in the unmanned aerial vehicle cluster air-air search scene, a particle dynamics model is endowed for each unmanned aerial vehicle, and the following formula is shown:
wherein phi represents the roll angle of the unmanned aerial vehicle, r φ Represents the rolling angle speed, F represents the driving force,indicating heading angular velocity,/-, and>representing the course angle of the unmanned aerial vehicle, m is the mass of the unmanned aerial vehicle, v x ,v y The speed components of the unmanned aerial vehicle in the two-dimensional plane X axis and the Y axis are respectively, X and Y represent the positions of the unmanned aerial vehicle in the two-dimensional plane X axis and the Y axis, and dt represents a differential variable about time t;
unmanned aerial vehicle is set forIs a continuous space, and the motion is represented as a two-dimensional velocity vector v x ,v y ]The scene allows the drone to move at variable speeds in any direction;
in order to effectively simulate the motion behavior of an actual unmanned aerial vehicle, the motion of the unmanned aerial vehicle is set to be a vector consisting of driving force and rolling angle speedThe flight speed and heading angle of the unmanned aerial vehicle at the time t are obtained by the following formula:
wherein v represents the speed of the unmanned aerial vehicle, i represents the number of the unmanned aerial vehicle, i=1, 2, …, n, Δt represents the time interval;
the position of the target in the scene is random, and the unmanned aerial vehicle cannot acquire the position information of the target in advance; the unmanned aerial vehicle can observe the target, and when the target is positioned in the monitoring range of the unmanned aerial vehicle, the unmanned aerial vehicle can sense and monitor the target; meanwhile, the unmanned aerial vehicle can receive communication information of other adjacent unmanned aerial vehicles, and the unmanned aerial vehicle cannot determine the quantity and position information of other targets;
one unmanned plane can monitor a plurality of target points simultaneously, and as no specific target allocation exists, the unmanned plane always keeps the targets in the perception range and searches more target points as much as possible; furthermore, the drone should avoid collisions and fly out boundaries to meet safety constraints.
3. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 2, wherein the unmanned aerial vehicle cluster path planning simulation method is characterized in that: the specific method of the step 2.1 is as follows:
the unmanned aerial vehicle cluster is set to comprise n unmanned aerial vehicles, and the definition of a partially observable Markov decision process is as follows: (N, S, A, T, R, O, Z, γ);
wherein N is a set of N unmanned aerial vehicles, defining i E N; s is a state space, and consists of states S of each unmanned aerial vehicle, S epsilon S; a { A (1) ,A (2) ,…A (n) The combined action space of all unmanned aerial vehicles is represented, and the combined action a of the ith unmanned aerial vehicle (i) ∈A (i) ;p(s'|s,a)→[0,1]Is a state transfer function of the environment and represents that the joint action a { a is executed (1) ,a (2) ,…,a (n) Probability from state s to the next state s' after; r (s, a): { R (1) ,r (2) ,…r (n) The "joint rewards" after performing the joint action a in state s; o { O (1) ,O (2) ,…O (n) -representing the joint viewing space of all unmanned aerial vehicles; z: o (i) =z (s, i) represents the observation model of each unmanned individual in state s, o (i) ∈O (i) ,o (i) Observation model for showing ith unmanned aerial vehicle, O (i) Representing the observation space of the ith unmanned aerial vehicle; gamma e [0,1 ]]Is a discount factor representing the relative importance of balancing the long-term rewards with the current rewards;
setting each unmanned aerial vehicle to make action decisions according to local information of each unmanned aerial vehicle, wherein the action strategy of the ith unmanned aerial vehicle is pi (i) :O (i) →A (i) The joint action strategy of all unmanned aerial vehicles is expressed as: pi { pi (1)(2) ,…π (n) When the joint observation of the unmanned plane is o and the joint action policy is pi, the state cost function is expressed as:when performing the joint action a, the action cost function is expressed as: />E π Indicating the desire for action strategies.
4. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 3, wherein the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R is characterized in that: the specific method of the step 2.2 is as follows:
for each agent i, during centralized training, the Critic network receives state-action pairs from all agents, during decentralized execution, the Actor network receives only its own local observed state to make action decisions;
define the joint observation state vector o= [ o ] of unmanned aerial vehicle cluster 1 ,o 2 ,…,o n ]And joint motion vector a= [ a ] 1 ,a 2 ,…,a n ]The method comprises the steps of carrying out a first treatment on the surface of the Further, define θ= { θ 12 ,…,θ n And the parameterized set of unmanned plane cluster action strategies is shown as μ= { μ 12 ,…,μ n -a set of joint action policies for all unmanned aerial vehicles;
for the ith drone, a gradient of expected return is given, actor network J (μ i )=E[R i ]The gradient of (2) is shown in the following formula:
wherein J represents a desire for rewards, E represents a desire, R i Representing a reward for the ith unmanned aerial vehicle, a -i =[a 1 ,…,a i-1 ,a i+1 ,…,a N ]Representing joint motion vectors for all agents except agent i, D is an empirical buffer recording all unmanned aerial vehicles,is a centralized state action cost function, and inputs the combined observation state o comprising the unmanned aerial vehicle cluster and the combined actions a, mu of all the intelligent agents i Deterministic strategy for representing ith unmanned aerial vehicleThe method is omitted;
the output of the Critic network is a centralized state action cost functionThe loss function is defined as follows:
wherein ,is the cost function of the critic-target network output, y is the target cost function, o 'is the observed state of the next time step unmanned aerial vehicle, a' is the action of the next time step unmanned aerial vehicle,/->Is a parameter of the actor-target network, +.>Periodically using the latest commentator network parameter mu to carry out soft update;
the input of the unmanned aerial vehicle consists of an action vector and an observation state vector, and the action vector consists ofConsists of observation state vector of +.>Composition; the observation state vector contains three kinds of state information, which are respectively:
(1) The current unmanned aerial vehicle state information;
the observation state vector of the ith unmanned aerial vehicle at the time t is expressed ass i,t Representing the state and action information of the unmanned aerial vehicle, and +.>Representing situation characteristic information of other unmanned aerial vehicles in the same team, e j,t Coordinate information representing a target area, j epsilon m, wherein m is a target area set; for the ith unmanned aerial vehicle itself, the position coordinates (x i ,y i ) Speed (v) x,i ,v y,i ) Course angle->Therefore, unmanned plane self state and action information +.>
(2) Situation feature vector information of other unmanned aerial vehicles;
unmanned aerial vehicle observes state feature vector of teammate state feature informationThus->Containing n-1 subvectors defined as follows:
wherein ,is the kth unmanned plane observation state subvector, as follows:
wherein ,ak,-1 The kth unmanned plane is the action at the last moment, k epsilon N, i not equal to k; d, d i,k,t Representing the current relative distance between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle teammate, described by the 2-norm as follows:
wherein H represents the conjugate transpose matrix, p i,t Representing two-dimensional coordinate vector of ith unmanned aerial vehicle at t moment, p k,t Representing a two-dimensional coordinate vector of the kth unmanned aerial vehicle at the time t;
(3) Coordinate vector information of the target area;
unmanned aerial vehicle observes target area state characteristic information, unmanned aerial vehicle can observe target area and obtain observation vector e j,t It comprises m sub-vectors defined as follows:
wherein ,the sub-vector representing the observation of the jth target is as follows:
wherein ,di,j,t Representing the distance between the unmanned plane and the target, x j,t ,y j,t Representing the location of the target.
5. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 4, wherein the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R is characterized in that: the specific method of the step 3 is as follows:
capturing the dependency relationship between unmanned aerial vehicles by using the mutual information PPMI between positive value points, and obtaining the dependency index d between the ith unmanned aerial vehicle and the kth unmanned aerial vehicle (ik) Expressed as:
wherein, l represents the local information of each unmanned plane,b i,t boundary information indicating the ith unmanned aerial vehicle, < ->b k,t Boundary information of a kth unmanned plane is represented, i is not equal to k, and p () represents a joint distribution function;
the mutual information between the positive points PPMI (l) i ,a i ,l k ,a k ) Using a neural network f w (l (i) ,a (i) ,l (k) ,a (k) ) Estimating mutual information I by maximizing JS divergence JS (L (i) ,A (i) ,L (k) ,A (k) ) To estimate, where L (i) Representing all combined observations of unmanned plane i, A (i) Indicating all combined actions of unmanned plane i, L (k) Represent all joint observations of unmanned plane k, A (k) Representing all joint actions of unmanned plane k.
6. The unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R according to claim 5, wherein the unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R is characterized in that: the specific method of the step 4 is as follows:
designing a reward function to encourage the drones to reach the target and minimize the drone travel distance, calculating a set d of distances from each drone to the target, where the drone travel reward r 1 (i) The following are provided:
r 1 (i) =(-min(d))
when two observation regions collide with each other, it is called a pseudo collision; also, when unmanned aerial vehicles collide with each other, it is referred to as a real collision; in training, the number of real collisions is reduced by punishing the occurrence of false collisions, collision rewardsThe following are provided:
wherein dist 2 Representing the distance between two unmanned aerial vehicles, sigma representing the radius of the unmanned aerial vehicle observation;
setting each unmanned aerial vehicle to fly in a task area, wherein the minimum distance between the unmanned aerial vehicle and the boundary is defined as d min Boundary prize r 3 (i) The following are provided:
repeated tracking rewards among unmanned aerial vehiclesThe following are provided:
thus, each step of environmental rewards for the droneThe method comprises the following steps:
in the multi-agent path planning problem, it is assumed that if one drone can cooperate with an adjacent drone to get a better incentive, then the drone should be properly encouraged; otherwise, when the neighbors get a negative reward, the drone should take part of the responsibility and therefore be penalized; thus, the instant prize r of each unmanned plane is awarded (i) The modification is as follows:
wherein ,is a private reward obtained from the environment, +.>Is a reciprocal reward from other unmanned aerial vehicles, the weight coefficient alpha balances between private rewards and reciprocal rewards; when α=0, the drone is fully selfish, which is the initial setting for reinforcement learning; whereas α=1, the drone is totally unbiased; when 0 < alpha < 1, not only is its expectations maximized, but also cooperation with other unmanned aerial vehicles is encouraged;
unmanned aerial vehicle i receives reciprocal rewards r from other unmanned aerial vehicles c (i) Expressed as:
wherein ,N(i) Representing the neighbors, d of unmanned plane i (ik) Representing the dependency between drone i and drone k,is a normalized coefficient, ++>
CN202310721379.4A 2023-06-16 2023-06-16 Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R Pending CN116560409A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310721379.4A CN116560409A (en) 2023-06-16 2023-06-16 Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310721379.4A CN116560409A (en) 2023-06-16 2023-06-16 Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R

Publications (1)

Publication Number Publication Date
CN116560409A true CN116560409A (en) 2023-08-08

Family

ID=87494869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310721379.4A Pending CN116560409A (en) 2023-06-16 2023-06-16 Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R

Country Status (1)

Country Link
CN (1) CN116560409A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117270393A (en) * 2023-10-07 2023-12-22 重庆大学 Intelligent robot cluster cooperative control system
CN117420849A (en) * 2023-12-18 2024-01-19 山东科技大学 Marine unmanned aerial vehicle formation granularity-variable collaborative search and rescue method based on reinforcement learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117270393A (en) * 2023-10-07 2023-12-22 重庆大学 Intelligent robot cluster cooperative control system
CN117270393B (en) * 2023-10-07 2024-05-17 重庆大学 Intelligent robot cluster cooperative control system
CN117420849A (en) * 2023-12-18 2024-01-19 山东科技大学 Marine unmanned aerial vehicle formation granularity-variable collaborative search and rescue method based on reinforcement learning
CN117420849B (en) * 2023-12-18 2024-03-08 山东科技大学 Marine unmanned aerial vehicle formation granularity-variable collaborative search and rescue method based on reinforcement learning

Similar Documents

Publication Publication Date Title
CN112465151A (en) Multi-agent federal cooperation method based on deep reinforcement learning
CN116560409A (en) Unmanned aerial vehicle cluster path planning simulation method based on MADDPG-R
CN112180967B (en) Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture
Zhou et al. Bayesian reinforcement learning for multi-robot decentralized patrolling in uncertain environments
CN110442129B (en) Control method and system for multi-agent formation
Teng et al. Adaptive computer-generated forces for simulator-based training
Hao et al. Independent generative adversarial self-imitation learning in cooperative multiagent systems
Wu et al. Distributed task allocation for multiple heterogeneous UAVs based on consensus algorithm and online cooperative strategy
CN114510012A (en) Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning
Geng et al. Learning to cooperate in decentralized multi-robot exploration of dynamic environments
CN114679729B (en) Unmanned aerial vehicle cooperative multi-target detection method integrating radar communication
Sadhu et al. Aerial-DeepSearch: Distributed multi-agent deep reinforcement learning for search missions
Xia et al. Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning
Rastogi et al. Sample-efficient reinforcement learning via difference models
Tan et al. Proximal policy based deep reinforcement learning approach for swarm robots
Huang et al. Multi-UAV Collision Avoidance using Multi-Agent Reinforcement Learning with Counterfactual Credit Assignment
CN116757249A (en) Unmanned aerial vehicle cluster strategy intention recognition method based on distributed reinforcement learning
Huang et al. A deep reinforcement learning approach to preserve connectivity for multi-robot systems
Roldán et al. A proposal of multi-UAV mission coordination and control architecture
Yang Self-Adaptive Swarm System
Zu et al. Research on UAV path planning method based on improved HPO algorithm in multi-task environment
Yang et al. Understanding the Application of Utility Theory in Robotics and Artificial Intelligence: A Survey
Nguyen et al. Apprenticeship learning for continuous state spaces and actions in a swarm-guidance shepherding task
Khaleghi et al. Analysis of uav/ugv control strategies in a dddams-based surveillance system
Lu et al. Strategy Generation Based on DDPG with Prioritized Experience Replay for UCAV

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination