CN113900445A - Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning - Google Patents

Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning Download PDF

Info

Publication number
CN113900445A
CN113900445A CN202111193986.5A CN202111193986A CN113900445A CN 113900445 A CN113900445 A CN 113900445A CN 202111193986 A CN202111193986 A CN 202111193986A CN 113900445 A CN113900445 A CN 113900445A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
network
target
drone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111193986.5A
Other languages
Chinese (zh)
Inventor
洪万福
王旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Yuanting Information Technology Co ltd
Original Assignee
Xiamen Yuanting Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Yuanting Information Technology Co ltd filed Critical Xiamen Yuanting Information Technology Co ltd
Priority to CN202111193986.5A priority Critical patent/CN113900445A/en
Publication of CN113900445A publication Critical patent/CN113900445A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying

Abstract

The invention discloses an unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning. The method comprises the following steps: establishing a large-scale unmanned aerial vehicle cluster task model; establishing a Markov game model according to the task model; constructing a MADDPG algorithm neural network; the hyper-parameters of the neural network are adjusted, when training is conducted through the MADDPG algorithm, samples are collected from exploration environment experience and high-quality experience respectively according to certain probability, self state information and environment information of each unmanned aerial vehicle serve as input of the neural network, the speed of multiple unmanned aerial vehicles serves as output, training of a motion planning strategy is completed, and the multiple unmanned aerial vehicles can autonomously avoid obstacles and safely and quickly reach target positions in a complex environment. The method can improve the robustness of the strategy, train an excellent strategy with stronger adaptability and higher flexibility, and has good application prospect in the scene of multi-unmanned aerial vehicle collaborative motion planning.

Description

Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning
Technical Field
The invention belongs to the technical field of artificial intelligence and unmanned aerial vehicles, and particularly relates to an unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning.
Background
In recent years, autonomous cluster unmanned combat becomes an exploration trend of intelligent military application, each unmanned platform can be regarded as an intelligent body, various unstable factors exist in the unmanned combat process, and the battlefield situation changes continuously, so that dynamic responses to the combat situation are different. Because the generalization performance of the supervised learning training model needing a large number of training samples is weak, and the deep reinforcement learning method only needs to react on the evaluation information of the operation effect of the current system, the deep reinforcement learning technology has higher instantaneity and robustness, and is more suitable for modeling the behavior of the intelligent game countermeasure.
The cooperative control of the unmanned cluster system comprises two aspects of coordination and cooperation. The purpose of coordination is to ensure that the multiple unmanned platforms do not conflict in the task execution process, and the problem of motion control among the multiple unmanned platforms is researched. The purpose of cooperation is to organize a plurality of unmanned platforms to jointly complete tasks, and research is carried out on high-level organization and decision-making mechanism problems. The unmanned cluster cooperative control relates to the structural design of an unmanned cluster system, the distributed control of the unmanned cluster and the like. By introducing the reinforcement learning technology, the autonomous distributed control of the unmanned platform has stronger adaptability and flexibility, the capability of completing combat missions by a single unmanned platform is improved, the coordination and cooperation of the unmanned cluster can be enhanced, and the overall performance of the cluster system is improved. In the unmanned cluster system, environmental information perceived by a single unmanned platform is local, so that a strategy obtained by a traditional single-agent reinforcement learning algorithm does not have universality. In order to solve the problem, the quantity of the intelligent agents is increased on the basis of single-intelligent-agent reinforcement learning, each intelligent agent has autonomy, purpose and coordination by introducing a distributed cooperative strategy mechanism, and the intelligent agent has learning, reasoning and self-organizing capabilities.
Disclosure of Invention
The invention aims to provide an unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning, so that cooperative decision making of large-scale unmanned aerial vehicles in execution of various complex tasks and environments is achieved, strategies of other agents are learned while the unmanned aerial vehicles in the large-scale unmanned aerial vehicles are trained and learned to act strategies, robustness of the strategies is improved, excellent strategies with higher adaptability and flexibility are trained, and the unmanned aerial vehicle cooperative control training method and system have good application prospects in scenes of multi-unmanned aerial vehicle cooperative control.
In order to achieve the above object, a first aspect of the present invention provides an unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning, including:
step S1: establishing a task model of a large-scale unmanned aerial vehicle cluster;
step S2: establishing a Markov game model according to the task model;
step S3: constructing a MADDPG algorithm neural network;
step S4: training a MADDPG algorithm neural network;
step S5: loading the MADDPG algorithm neural network into an unmanned aerial vehicle cluster, executing unmanned aerial vehicle cluster cooperative control, and mapping actions output by the neural network into corresponding control instructions of the unmanned aerial vehicle.
Further, the step S1 specifically includes:
(1) task description: describing a cooperative task of an unmanned aerial vehicle cluster in a scene, wherein the cooperative task is that the unmanned aerial vehicle cluster needs to reach a designated destination in a certain time, and a building group and an obstacle exist in a certain range; all unmanned aerial vehicles in the unmanned aerial vehicle cluster are isomorphic and have the same performance parameters;
(2) and (3) environmental constraint:
initial coordinate constraint: in a scene, an unmanned aerial vehicle i is randomly generated in an initial area, and a target position and an obstacle position randomly appear in a certain distance of a target area; the distance d from the unmanned aerial vehicle i to the target area g at the initial momentigSatisfies the following conditions:
di,g≥dinit
wherein d isinitEffective distance for successful completion of the task;
height and boundary constraints: its flying height satisfies the following constraints:
hmin≤h≤hmax
wherein h isminAt the lowest flying height, hmaxIs the maximum flying height;
speed and acceleration constraints: in three-dimensional space, the speed and acceleration of the drone need to satisfy maximum constraints:
Figure BDA0003302331410000023
Figure BDA0003302331410000022
maximum yaw angle constraint: let the coordinate of the unmanned aerial vehicle track point i be (x)i,yi,zi) Then the horizontal projection of the track segment from point i-1 to point i is αi=(xi-xi-1,yi-yi-1)TThen the maximum yaw angle φ is constrained to be:
Figure BDA0003302331410000021
and (3) restraining the obstacles: the distance l between the unmanned aerial vehicle and the obstacle satisfies the following conditions:
l≥Rsaft+lmin+RUAV
in the formula, RsaftIs a prescribed safe distance; lminThe length of the barrier in the direction of the unmanned aerial vehicle; rUAVIs the unmanned plane radius.
Further, the step S2 specifically includes:
(1) using quintuple<N,S,A,P,R>Representing a markov game model wherein: n ═ {1,2, …, N }, denotes a set of N drones; s is a combined state, S ═ S1×s2×…×snThe Cartesian product of the states of all drones, S thereiniRepresents the state of drone i; a is a combined action, A ═ a1×a2×…×anDenotes the Cartesian product of the actions of all drones, a of whichiRepresenting the action of the unmanned plane i; SxAxS → [0,1 → [ 1 ]]The unmanned aerial vehicle is a state transition model and represents the probability value of all unmanned aerial vehicles taking joint action to reach the next state in the current state; r is the joint reward, i.e. the Cartesian product of all the drone reward functions, R ═ R1×R2×…×RnWherein R isiIndicates unmanned plane i anda reward value obtained by the environmental interaction;
(2) setting a state space of the unmanned aerial vehicle, and setting the state space of each unmanned aerial vehicle in a polar coordinate system; taking the center of the unmanned aerial vehicle i as an origin, taking the direction from the unmanned aerial vehicle i to a target of the unmanned aerial vehicle i as a positive direction to establish a polar coordinate system, and expressing the state of the unmanned aerial vehicle i as follows: si=(s,sU,sE) Wherein s ═ P (P)ix,Piy,Pigx,Pigy) Position information for drone i and target, Pix,PiyPosition information for drone i, Pigx,PigyPosition information of a target of the unmanned aerial vehicle i; sU=(Pjx,Pjy) The position information of the obstacle closest to the unmanned aerial vehicle i in the communication range of the unmanned aerial vehicle i is shown, and if no other obstacle exists in the communication range, S is carried outE=(0,0);
(3) Setting the action space of the unmanned aerial vehicle, and regarding the unmanned aerial vehicle i, the action space is ai=(ωit),ωitThe angular velocity value of the unmanned aerial vehicle i at the moment t;
(4) setting a reward function of the unmanned aerial vehicle; the reward function of drone i is specifically set as follows:
R1=10+Rit
R2=-20,
R3=-2|α|+l-τ,
R4=-2|α|+1,
Ri=ω1R12R23R34R4
wherein R is1Representing the value of the reward when the drone reaches the target,
Figure BDA0003302331410000031
penalty, W, representing the time spent by the drone reaching the targettAs a penalty factor, TiA specific value of time spent by the drone to reach the target location,
Figure BDA0003302331410000041
indicating unmanned plane with uiThe shortest time required for the linear velocity of (a) to reach the target position along a straight line, PioAnd PigRespectively an initial position and a target position of the unmanned aerial vehicle; r2The penalty value is the collision penalty value of the unmanned aerial vehicle; r3For collision early warning, selecting a barrier or other unmanned aerial vehicles closest to the unmanned aerial vehicle within a communication range of a communication distance l as a dangerous object, and giving corresponding punishment when the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the current moment is smaller than the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the last moment; r4The punishment degree is increased along with the increase of the target angular speed alpha of the unmanned aerial vehicle for the dense return function of the unmanned aerial vehicle; reward R of unmanned aerial vehicle iiFrom R1 R2 R3 R4Subject to different weights omega1 ω2 ω3 ω4And weighted summation is carried out.
Further, the step S3 specifically includes:
(1) constructing a policy network in the MADDPG algorithm: policy network mu for drone iiComposed of an input layer, a hidden layer and an output layer, the input is a state vector s of an unmanned aerial vehicle iiAnd the output is the action vector a of the unmanned plane ii=μi(si);
(2) Constructing an evaluation network of the MADDPG algorithm: evaluation network of unmanned aerial vehicle i
Figure BDA0003302331410000042
The unmanned aerial vehicle is composed of an input layer, a hidden layer and an output layer, wherein the input is the state vector x(s) of all unmanned aerial vehicles1,…,sn) And all the unmanned planes obtain the action a according to respective policy networks1,…,anThe output is the action value function of the unmanned aerial vehicle i, and is a centralized action value function
Figure BDA0003302331410000043
(3) Constructing a target neural network: for drone i, policy network μiAnd evaluating the network
Figure BDA0003302331410000044
Are copied into the respective corresponding target network, i.e.
Figure BDA0003302331410000045
Wherein
Figure BDA0003302331410000046
Respectively representing parameters of the current policy network and the evaluation network,
Figure BDA0003302331410000047
respectively representing parameters of the target policy network and the target evaluation network.
Further, the step S4 specifically includes:
(1) initializing parameters of all networks
Figure BDA0003302331410000048
Emptying respective experience playback sets;
(2) setting the total training round number and starting iteration;
(3) for each drone, at the current policy network based on state siTo obtain
Figure BDA0003302331410000049
(4) Performing action a for each droneiTo obtain a new state s'iAnd respective prizes RiAnd will be (s, s', a)1,…an,r1,…rn) Adding the experience playback set;
(5) for each drone, starting to update the network by sampling M samples from the empirical playback set M;
(6) calculating the best action taken at the next moment through the target policy network
Figure BDA00033023314100000410
(7) Computing approximate trues through a target evaluation networkActual value, input as status and action, output as
Figure BDA00033023314100000411
(8) To be provided with
Figure BDA0003302331410000051
Updating the current evaluation network as a loss function;
(9) by passing
Figure BDA0003302331410000052
Updating the current strategy network;
(10) if the iteration times reach the updating frequency of the network parameters, updating the parameters of the target evaluation network and the target strategy network:
Figure BDA0003302331410000053
Figure BDA0003302331410000054
the updating mode is soft updating, wherein theta is a soft updating proportionality coefficient.
Further, the hyper-parameters of the neural network in step S4 include:
a strategy network and an evaluation network adopt a fully-connected neural network and are trained by adopting an Adam optimizer;
setting basic parameters of a policy network and an evaluation network, wherein the basic parameters comprise: the number of hidden layers, the activation function, the learning rate, the number of samples for batch updating and each reward weight in the reward function.
Further, the policy network is configured to: the unmanned aerial vehicle action learning method comprises two hidden layers, wherein activation functions of the hidden layers are relu functions, the first layer is 64 nodes, the second layer is 32 nodes, the output layer is 1 node, namely, actions taken by the unmanned aerial vehicle, the activation function adopted by the output layer is a tanh function, and the learning rate of a strategy network is 0.001; the evaluation network is set as: the method comprises two hidden layers, wherein activation functions of the hidden layers are relu functions, the number of nodes is 64, an output layer is 1 node, namely a Q value function, the activation functions of the output layer are linear functions y which are x + b, b is a bias parameter, and the learning rate of an evaluation network is 0.0001; the number of samples for batch updating by the random sampling experience is set to be N-128.
The second aspect of the present invention provides an unmanned aerial vehicle cooperative control training system based on multi-agent reinforcement learning, which is used for implementing the unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning according to any technical scheme of the first aspect of the present invention, and the method includes:
the task model data acquisition module is used for preprocessing data of the multi-unmanned-aerial-vehicle environment in a task, coding the observation space and the global state space of each unmanned aerial vehicle in the environment and converting the coded observation space and the global state space into vector features which can be identified by a neural network;
the neural network construction module is used for constructing the MADDPG neural network according to the task model and receiving the vector characteristics transmitted by the task model data acquisition module;
the parameter adjusting module is used for setting the hyper-parameters of the neural network, wherein the hyper-parameters comprise the number of hidden layers, an activation function, a learning rate, the number of samples updated in batches and reward weight in a reward function;
and the main control unit is used for loading the MADDPG algorithm neural network into the unmanned aerial vehicle cluster, executing the unmanned aerial vehicle cluster cooperative control, and mapping the action output by the neural network into a corresponding control instruction of the unmanned aerial vehicle.
The unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning improves the robustness of strategies, can train excellent strategies with stronger adaptability and higher flexibility, and has good application prospect in the scene of multi-unmanned aerial vehicle cooperative control.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of the system of the present invention.
Detailed Description
To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.
The invention will now be further described with reference to the accompanying drawings and detailed description.
As shown in fig. 1, the present invention provides an unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning, which includes the following steps:
the method comprises the following steps: and establishing a large-scale unmanned aerial vehicle cluster task model.
The method specifically comprises the following steps: set unmanned aerial vehicle as the circular shape intelligent body, unmanned aerial vehicle i's radius is ri(ii) a Setting the shape of the obstacle as a circle and the radius of the obstacle as roAnd the collision distance between the unmanned aerial vehicle and the barrier is Dio=ri+ro(ii) a The target position of the unmanned aerial vehicle i is a circular space with the radius rigWhen the unmanned aerial vehicle i contacts the target range, namely the distance D between the central position of the unmanned aerial vehicle i and the central position of the target rangeio≤ri+roJudging that the unmanned aerial vehicle i successfully reaches the target position;
position setting of unmanned aerial vehicle i is Pi=[xi,yi]TThe communication distance of the drone is denoted LcThe communication range of the unmanned aerial vehicle takes the center of the unmanned aerial vehicle as the circle center, and LcIs a circle with a radius; in the communication range of the unmanned aerial vehicle, the unmanned aerial vehicle can sense other unmanned aerial vehicles or obstacle information.
Step two: and establishing a Markov game model according to the task model.
The method specifically comprises the following steps:
(1) quintuple for representation of Markov game model<N,S,A,P,R>Each component is specifically explained as follows: n ═ 1,2,. and N, representing a set of N drones; s is a combined state, S ═ S1×s2×…×snIs shown byCartesian product of the states of all drones, where SiRepresents the state of drone i; a is a combined action, A ═ a1×a2×...×anThe Cartesian product, a, representing the motion of all dronesiRepresenting the action of the unmanned plane i;
P:S×A×S→[0,1]the unmanned aerial vehicle is a state transition model and represents the probability value of all unmanned aerial vehicles taking joint action to reach the next state in the current state; for joint rewards, i.e. the Cartesian product of all the reward functions of the drone, R ═ R1×R2×...×RnWherein R isiRepresenting a reward value obtained by interaction of the unmanned aerial vehicle i with the environment;
(2) setting a state space of the unmanned aerial vehicle, and setting the state space of each unmanned aerial vehicle in a polar coordinate system; taking the center of the unmanned aerial vehicle i as an origin, taking the direction from the unmanned aerial vehicle i to a target of the unmanned aerial vehicle i as a positive direction to establish a polar coordinate system, and expressing the state of the unmanned aerial vehicle i as follows: si=(s,sU,sE) Wherein s ═ P (P)ix,Piy,Pigx,Pigy) Position information for drone i and target, Pix,PiyPosition information for drone i, Pigx,PigyPosition information of a target of the unmanned aerial vehicle i; sU=(Pjx,Pjy) The position information of the obstacle closest to the unmanned aerial vehicle i in the communication range of the unmanned aerial vehicle i is shown, and if no other obstacle exists in the communication range, S is carried outE=(0,0);
(3) Setting the action space of the unmanned aerial vehicle, and regarding the unmanned aerial vehicle i, the action space is ai=(ωit),ωitFor the angular velocity value of unmanned aerial vehicle i at moment t, because the flight restriction of unmanned aerial vehicle and the restriction of barrier, the optional action of different moments is different, and unmanned aerial vehicle can only select the action from current action space.
(4) Setting a reward function of the unmanned aerial vehicle; the reward function of drone i is specifically set as follows:
R1=10+Rit
R2=-20,
R3=-2|α|+l-τ,
R4=-2|α|+1,
Ri=ω1R12R23R34R4
wherein R is1Representing the value of the reward when the drone reaches the target,
Figure BDA0003302331410000071
penalty, W, representing the time spent by the drone reaching the targettAs a penalty factor, TiA specific value of time spent by the drone to reach the target location,
Figure BDA0003302331410000072
indicating unmanned plane with uiThe shortest time required for the linear velocity of (a) to reach the target position along a straight line, PioAnd PigRespectively an initial position and a target position of the unmanned aerial vehicle; r2The penalty value is the collision penalty value of the unmanned aerial vehicle; r3For collision early warning, selecting a barrier or other unmanned aerial vehicles closest to the unmanned aerial vehicle within a communication range of a communication distance l as a dangerous object, and giving corresponding punishment when the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the current moment is smaller than the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the last moment; r4The punishment degree is increased along with the increase of the target angular speed alpha of the unmanned aerial vehicle for the dense return function of the unmanned aerial vehicle; reward R of unmanned aerial vehicle iiFrom R1 R2 R3 R4Subject to different weights omega1ω2ω3ω4And weighted summation is carried out.
Step three: constructing a MADDPG (Multi agent reinforcement learning) algorithm neural network.
The method specifically comprises the following steps:
(1) constructing a policy network (Actor) in the MADDPG algorithm: policy network mu for drone iiComposed of an input layer, a hidden layer and an output layer, the input is a state vector s of an unmanned aerial vehicle iiAnd the output is the action vector a of the unmanned plane ii=μi(si);
(2) Constructing an evaluation network (Critic) of the maddppg algorithm: evaluation network of unmanned aerial vehicle i
Figure BDA00033023314100000812
The unmanned aerial vehicle is composed of an input layer, a hidden layer and an output layer, wherein the input is the state vector x(s) of all unmanned aerial vehicles1,…,sn) And all the unmanned planes obtain the action a according to respective policy networks1,…,anThe output is the action value function of the unmanned aerial vehicle i, and is a centralized action value function
Figure BDA0003302331410000081
(3) Constructing a target neural network: for drone i, policy network μiAnd evaluating the network
Figure BDA0003302331410000082
Are copied into the respective corresponding target network, i.e.
Figure BDA0003302331410000083
Wherein
Figure BDA0003302331410000084
Respectively representing parameters of the current policy network and the evaluation network,
Figure BDA0003302331410000085
respectively representing parameters of the target policy network and the target evaluation network.
Step four: training the maddppg algorithm neural network.
The method specifically comprises the following steps:
(1) initializing parameters of all networks
Figure BDA0003302331410000086
Emptying respective experience playback sets;
(2) setting the total training round number and starting iteration;
(3) for each drone, the current policy network is based onState siTo obtain
Figure BDA0003302331410000087
(4) Performing action a for each droneiTo obtain a new state s'iAnd respective prizes RiAnd will be (s, s', a)1,…an,r1,…rn) Adding the experience playback set;
(5) for each drone, starting to update the network by sampling M samples from the empirical playback set M;
(6) calculating the best action taken at the next moment through the target policy network
Figure BDA0003302331410000088
(7) Calculating approximate real value through a target evaluation network, wherein the input is state and action, and the output is
Figure BDA0003302331410000089
(8) To be provided with
Figure BDA00033023314100000810
Updating the current evaluation network as a loss function;
(9) by passing
Figure BDA00033023314100000811
Updating the current strategy network;
(10) if the iteration times reach the updating frequency of the network parameters, updating the parameters of the target evaluation network and the target strategy network:
Figure BDA0003302331410000091
Figure BDA0003302331410000092
the updating mode is soft updating, wherein theta is a soft updating proportionality coefficient.
During the training process, the neural network hyper-parameters need to be set.
The method specifically comprises the following steps:
a strategy network and an evaluation network adopt a fully-connected neural network and are trained by adopting an Adam optimizer;
setting parameters such as the number of hidden layers, an activation function, a learning rate, the number of samples updated in batches and the like of the strategy network and the evaluation network, and adjusting the reward weight in the reward function.
An example set of parameter settings is given below:
the strategy network comprises two hidden layers, the activation functions of the hidden layers are relu functions, the first layer is 64 nodes, the second layer is 32 nodes, the output layer is 1 node, namely the action taken by the unmanned aerial vehicle, the activation function adopted by the output layer is a tanh function, and the learning rate of the strategy network is 0.001.
The evaluation network also comprises two hidden layers, the activation functions of the hidden layers are relu functions, the number of nodes is 64, the output layer is 1 node, namely a Q value function, the activation functions of the output layer are linear functions, y is x + b, b is a bias parameter, and the learning rate of the evaluation network is 0.0001. The number of samples for batch updating by the random sampling experience is set to be N-128.
Step six: loading the MADDPG algorithm neural network into an unmanned aerial vehicle cluster, executing unmanned aerial vehicle cluster cooperative control, and mapping actions output by the neural network into corresponding control instructions of the unmanned aerial vehicle. The method specifically comprises the following steps: and loading the stored strategy network and evaluation network parameter data into the unmanned aerial vehicle cluster, so that the multiple unmanned aerial vehicles execute flight actions according to the trained network, and completing a large-scale unmanned aerial vehicle motion planning task.
As shown in fig. 2, the present invention further provides a system for implementing the method according to the foregoing embodiment, including:
the task model data acquisition module 10 is used for preprocessing data of the multi-unmanned-aerial-vehicle environment in a task, coding the observation space and the global state space of each unmanned aerial vehicle in the environment, and converting the coded observation space and the global state space into vector features which can be identified by a neural network;
the neural network building module 20 is used for building the MADDPG neural network according to the task model, setting hidden layer dimensionality and receiving coding information from the environment;
the parameter adjusting module 30 is configured to set a hyper-parameter of the neural network, and includes: setting different hidden layer numbers aiming at a network structure, replacing different activation functions, controlling the learning rate of the network and setting the number of samples to be updated in batches; the reward weight in the reward function can be adjusted to improve the cooperative control effect;
and the main control unit 40 is used for loading the MADDPG algorithm neural network into the unmanned aerial vehicle cluster, executing the unmanned aerial vehicle cluster cooperative control, and mapping the action output by the neural network into a corresponding control instruction of the unmanned aerial vehicle.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept and the scope of the appended claims is intended to be protected.

Claims (9)

1. An unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning is characterized by comprising the following steps:
step S1: establishing a task model of a large-scale unmanned aerial vehicle cluster;
step S2: establishing a Markov game model according to the task model;
step S3: constructing a MADDPG algorithm neural network;
step S4: training a MADDPG algorithm neural network;
step S5: loading the MADDPG algorithm neural network into an unmanned aerial vehicle cluster, executing unmanned aerial vehicle cluster cooperative control, and mapping actions output by the neural network into corresponding control instructions of the unmanned aerial vehicle.
2. The cooperative drone control training method based on multi-agent reinforcement learning of claim 1, wherein the step S1 specifically includes:
(1) task description: describing a cooperative task of an unmanned aerial vehicle cluster in a scene, wherein the cooperative task is that the unmanned aerial vehicle cluster needs to reach a designated destination in a certain time, and a building group and an obstacle exist in a certain range; all unmanned aerial vehicles in the unmanned aerial vehicle cluster are isomorphic and have the same performance parameters;
(2) and (3) environmental constraint:
initial coordinate constraint: in the scene, an unmanned aerial vehicle i randomly generates a target in an initial regionThe position and the position of the obstacle randomly appear in a certain distance of the target area; the distance d from the unmanned aerial vehicle i to the target area g at the initial momentigSatisfies the following conditions:
di,g≥dinit
wherein d isinitEffective distance for successful completion of the task;
height and boundary constraints: its flying height satisfies the following constraints:
hmin≤h≤hmax
wherein h isminAt the lowest flying height, hmaxIs the maximum flying height;
speed and acceleration constraints: in three-dimensional space, the speed and acceleration of the drone need to satisfy maximum constraints:
|vx,y,z|≤vmaxx,y,z
|ax,y,z|≤amaxx,y,z
maximum yaw angle constraint: let the coordinate of the unmanned aerial vehicle track point i be (x)i,yi,zi) Then the horizontal projection of the track segment from point i-1 to point i is αi=(xi-xi-1,yi-yi-1)TThen the maximum yaw angle φ is constrained to be:
Figure FDA0003302331400000011
and (3) restraining the obstacles: the distance l between the unmanned aerial vehicle and the obstacle satisfies the following conditions:
l≥Rsaft+lmin+RUAV
in the formula, RsaftIs a prescribed safe distance; lminThe length of the barrier in the direction of the unmanned aerial vehicle; rUAVIs the unmanned plane radius.
3. The cooperative drone control training method based on multi-agent reinforcement learning of claim 1, wherein the step S2 specifically includes:
(1) miningUsing quintuple<N,S,A,P,R>Representing a markov game model wherein: n ═ 1,2,. and N, representing a set of N drones; s is a combined state, S ═ S1×s2×...×snThe Cartesian product of the states of all drones, S thereiniRepresents the state of drone i; a is a combined action, A ═ a1×a2×...×anDenotes the Cartesian product of the actions of all drones, a of whichiRepresenting the action of the unmanned plane i; SxAxS → [0,1 → [ 1 ]]The unmanned aerial vehicle is a state transition model and represents the probability value of all unmanned aerial vehicles taking joint action to reach the next state in the current state; r is the joint reward, i.e. the Cartesian product of all the drone reward functions, R ═ R1×R2×...×RnWherein R isiRepresenting a reward value obtained by interaction of the unmanned aerial vehicle i with the environment;
(2) setting a state space of the unmanned aerial vehicle, and setting the state space of each unmanned aerial vehicle in a polar coordinate system; taking the center of the unmanned aerial vehicle i as an origin, taking the direction from the unmanned aerial vehicle i to a target of the unmanned aerial vehicle i as a positive direction to establish a polar coordinate system, and expressing the state of the unmanned aerial vehicle i as follows: si=(s,sU,sE) Wherein s ═ P (P)ix,Piy,Pigx,Pigy) Position information for drone i and target, Pix,PiyPosition information for drone i, Pigx,PigyPosition information of a target of the unmanned aerial vehicle i; sU=(Pjx,Pjy) The position information of the obstacle closest to the unmanned aerial vehicle i in the communication range of the unmanned aerial vehicle i is shown, and if no other obstacle exists in the communication range, S is carried outE=(0,0);
(3) Setting the action space of the unmanned aerial vehicle, and regarding the unmanned aerial vehicle i, the action space is ai=(ωit),ωitThe angular velocity value of the unmanned aerial vehicle i at the moment t;
(4) and setting the reward function of the unmanned aerial vehicle.
4. The cooperative unmanned aerial vehicle control training method based on multi-agent reinforcement learning of claim 3, wherein the reward function of the unmanned aerial vehicle i in the step (4) is specifically set as follows:
R1=10+Rit
R2=-20,
R3=-2|α|+l-τ,
R4=-2|α|+1,
Ri=ω1R12R23R34R4
wherein R is1Representing the value of the reward when the drone reaches the target,
Figure FDA0003302331400000031
penalty, W, representing the time spent by the drone reaching the targettAs a penalty factor, TiA specific value of time spent by the drone to reach the target location,
Figure FDA0003302331400000032
indicating unmanned plane with uiThe shortest time required for the linear velocity of (a) to reach the target position along a straight line, PioAnd PigRespectively an initial position and a target position of the unmanned aerial vehicle; r2The penalty value is the collision penalty value of the unmanned aerial vehicle; r3For collision early warning, selecting a barrier or other unmanned aerial vehicles closest to the unmanned aerial vehicle within a communication range of a communication distance l as a dangerous object, and giving corresponding punishment when the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the current moment is smaller than the Euclidean distance between the unmanned aerial vehicle and the dangerous object in front at the last moment; r4The penalty degree is increased with the increase of the target angular velocity alpha of the unmanned aerial vehicle, which is a dense reward function of the unmanned aerial vehicle.
5. The cooperative drone control training method based on multi-agent reinforcement learning of claim 1, wherein the step S3 specifically includes:
(1) constructing a policy network in the MADDPG algorithm: policy network mu for drone iiComposed of an input layer, a hidden layer and an output layerComposition of the layer, input is the state vector s of unmanned aerial vehicle iiAnd the output is the action vector a of the unmanned plane ii=μi(si);
(2) Constructing an evaluation network of the MADDPG algorithm: evaluation network of unmanned aerial vehicle i
Figure FDA00033023314000000311
The unmanned aerial vehicle is composed of an input layer, a hidden layer and an output layer, wherein the input is the state vector x(s) of all unmanned aerial vehicles1,…,sn) And all the unmanned planes obtain the action a according to respective policy networks1,…,anThe output is the action value function of the unmanned aerial vehicle i, and is a centralized action value function
Figure FDA0003302331400000033
(3) Constructing a target neural network: for drone i, policy network μiAnd evaluating the network
Figure FDA0003302331400000034
Are copied into the respective corresponding target network, i.e.
Figure FDA0003302331400000035
Wherein
Figure FDA0003302331400000036
Respectively representing parameters of the current policy network and the evaluation network,
Figure FDA0003302331400000037
respectively representing parameters of the target policy network and the target evaluation network.
6. The cooperative drone control training method based on multi-agent reinforcement learning as claimed in claim 5, wherein the step S4 specifically includes:
(1) initializing parameters of all networks
Figure FDA0003302331400000038
Figure FDA0003302331400000039
Emptying respective experience playback sets;
(2) setting the total training round number and starting iteration;
(3) for each drone, at the current policy network based on state siTo obtain
Figure FDA00033023314000000310
(4) Performing action a for each droneiTo obtain a new state s'iAnd respective prizes RiAnd will be (s, s', a)1,…an,r1,…rn) Adding the experience playback set;
(5) for each drone, starting to update the network by sampling M samples from the empirical playback set M;
(6) calculating the best action taken at the next moment through the target policy network
Figure FDA0003302331400000041
(7) Calculating approximate real value through a target evaluation network, wherein the input is state and action, and the output is
Figure FDA0003302331400000042
(8) To be provided with
Figure FDA0003302331400000043
Updating the current evaluation network as a loss function;
(9) by passing
Figure FDA0003302331400000044
Updating the current strategy network;
(10) if the iteration times reach the updating frequency of the network parameters, updating the parameters of the target evaluation network and the target strategy network:
Figure FDA0003302331400000045
Figure FDA0003302331400000046
the update mode is soft update, wherein
Figure FDA0003302331400000047
The scaling factor is updated soft.
7. The cooperative drone control training method based on multi-agent reinforcement learning of claim 1, wherein the hyper-parameters of the neural network in the step S4 include:
a strategy network and an evaluation network adopt a fully-connected neural network and are trained by adopting an Adam optimizer;
setting basic parameters of a policy network and an evaluation network, wherein the basic parameters comprise: the number of hidden layers, the activation function, the learning rate, the number of samples for batch updating and each reward weight in the reward function.
8. The multi-agent reinforcement learning-based unmanned aerial vehicle cooperative control training method of claim 7, wherein the strategy network is set to: the unmanned aerial vehicle action learning method comprises two hidden layers, wherein activation functions of the hidden layers are relu functions, the first layer is 64 nodes, the second layer is 32 nodes, the output layer is 1 node, namely, actions taken by the unmanned aerial vehicle, the activation function adopted by the output layer is a tanh function, and the learning rate of a strategy network is 0.001; the evaluation network is set as: the method comprises two hidden layers, wherein activation functions of the hidden layers are relu functions, the number of nodes is 64, an output layer is 1 node, namely a Q value function, the activation functions of the output layer are linear functions y which are x + b, b is a bias parameter, and the learning rate of an evaluation network is 0.0001; the number of samples for batch updating by the random sampling experience is set to be N-128.
9. An unmanned aerial vehicle cooperative control training system based on multi-agent reinforcement learning, which is used for realizing the unmanned aerial vehicle cooperative control training method based on multi-agent reinforcement learning of any one of claims 1 to 8, and comprises the following steps:
the task model data acquisition module is used for preprocessing data of the multi-unmanned-aerial-vehicle environment in a task, coding the observation space and the global state space of each unmanned aerial vehicle in the environment and converting the coded observation space and the global state space into vector features which can be identified by a neural network;
the neural network construction module is used for constructing the MADDPG neural network according to the task model and receiving the vector characteristics transmitted by the task model data acquisition module;
the parameter adjusting module is used for setting the hyper-parameters of the neural network, wherein the hyper-parameters comprise the number of hidden layers, an activation function, a learning rate, the number of samples updated in batches and reward weight in a reward function;
and the main control unit is used for loading the MADDPG algorithm neural network into the unmanned aerial vehicle cluster, executing the unmanned aerial vehicle cluster cooperative control, and mapping the action output by the neural network into a corresponding control instruction of the unmanned aerial vehicle.
CN202111193986.5A 2021-10-13 2021-10-13 Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning Pending CN113900445A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111193986.5A CN113900445A (en) 2021-10-13 2021-10-13 Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111193986.5A CN113900445A (en) 2021-10-13 2021-10-13 Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning

Publications (1)

Publication Number Publication Date
CN113900445A true CN113900445A (en) 2022-01-07

Family

ID=79191936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111193986.5A Pending CN113900445A (en) 2021-10-13 2021-10-13 Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN113900445A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114415735A (en) * 2022-03-31 2022-04-29 天津大学 Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method
CN114721409A (en) * 2022-06-08 2022-07-08 山东大学 Underwater vehicle docking control method based on reinforcement learning
CN114722946A (en) * 2022-04-12 2022-07-08 中国人民解放军国防科技大学 Unmanned aerial vehicle asynchronous action and cooperation strategy synthesis method based on probability model detection
CN114879742A (en) * 2022-06-17 2022-08-09 电子科技大学 Unmanned aerial vehicle cluster dynamic coverage method based on multi-agent deep reinforcement learning
CN115019185A (en) * 2022-08-03 2022-09-06 华中科技大学 Brain-like continuous learning cooperative trapping method, system and medium
CN115273501A (en) * 2022-07-27 2022-11-01 同济大学 Automatic driving vehicle ramp confluence cooperative control method and system based on MADDPG
CN115334165A (en) * 2022-07-11 2022-11-11 西安交通大学 Underwater multi-unmanned platform scheduling method and system based on deep reinforcement learning
CN115525058A (en) * 2022-10-24 2022-12-27 哈尔滨工程大学 Unmanned underwater vehicle cluster cooperative countermeasure method based on deep reinforcement learning
CN116069023A (en) * 2022-12-20 2023-05-05 南京航空航天大学 Multi-unmanned vehicle formation control method and system based on deep reinforcement learning
CN116736883A (en) * 2023-05-23 2023-09-12 天津大学 Unmanned aerial vehicle cluster intelligent cooperative motion planning method
CN117076134A (en) * 2023-10-13 2023-11-17 天之翼(苏州)科技有限公司 Unmanned aerial vehicle state data processing method and system based on artificial intelligence
CN117111620A (en) * 2023-10-23 2023-11-24 山东省科学院海洋仪器仪表研究所 Autonomous decision-making method for task allocation of heterogeneous unmanned system
CN117406706A (en) * 2023-08-11 2024-01-16 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180174038A1 (en) * 2016-12-19 2018-06-21 Futurewei Technologies, Inc. Simultaneous localization and mapping with reinforcement learning
US20190266489A1 (en) * 2017-10-12 2019-08-29 Honda Motor Co., Ltd. Interaction-aware decision making
CN110958680A (en) * 2019-12-09 2020-04-03 长江师范学院 Energy efficiency-oriented unmanned aerial vehicle cluster multi-agent deep reinforcement learning optimization method
CN111667513A (en) * 2020-06-01 2020-09-15 西北工业大学 Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning
CN112256056A (en) * 2020-10-19 2021-01-22 中山大学 Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning
CN112488310A (en) * 2020-11-11 2021-03-12 厦门渊亭信息科技有限公司 Multi-agent group cooperation strategy automatic generation method
CN112947562A (en) * 2021-02-10 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle motion planning method based on artificial potential field method and MADDPG
CN113190032A (en) * 2021-05-10 2021-07-30 重庆交通大学 Unmanned aerial vehicle perception planning system and method applied to multiple scenes and unmanned aerial vehicle
CN113298368A (en) * 2021-05-14 2021-08-24 南京航空航天大学 Multi-unmanned aerial vehicle task planning method based on deep reinforcement learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180174038A1 (en) * 2016-12-19 2018-06-21 Futurewei Technologies, Inc. Simultaneous localization and mapping with reinforcement learning
US20190266489A1 (en) * 2017-10-12 2019-08-29 Honda Motor Co., Ltd. Interaction-aware decision making
CN110958680A (en) * 2019-12-09 2020-04-03 长江师范学院 Energy efficiency-oriented unmanned aerial vehicle cluster multi-agent deep reinforcement learning optimization method
CN111667513A (en) * 2020-06-01 2020-09-15 西北工业大学 Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning
CN112256056A (en) * 2020-10-19 2021-01-22 中山大学 Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning
CN112488310A (en) * 2020-11-11 2021-03-12 厦门渊亭信息科技有限公司 Multi-agent group cooperation strategy automatic generation method
CN112947562A (en) * 2021-02-10 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle motion planning method based on artificial potential field method and MADDPG
CN113190032A (en) * 2021-05-10 2021-07-30 重庆交通大学 Unmanned aerial vehicle perception planning system and method applied to multiple scenes and unmanned aerial vehicle
CN113298368A (en) * 2021-05-14 2021-08-24 南京航空航天大学 Multi-unmanned aerial vehicle task planning method based on deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李宝安;: "基于深度强化学习的无人艇控制研究" *
赵丽华;万晓冬;: "基于改进A算法的多无人机协同路径规划" *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114415735A (en) * 2022-03-31 2022-04-29 天津大学 Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method
CN114415735B (en) * 2022-03-31 2022-06-14 天津大学 Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method
CN114722946A (en) * 2022-04-12 2022-07-08 中国人民解放军国防科技大学 Unmanned aerial vehicle asynchronous action and cooperation strategy synthesis method based on probability model detection
CN114721409A (en) * 2022-06-08 2022-07-08 山东大学 Underwater vehicle docking control method based on reinforcement learning
CN114879742A (en) * 2022-06-17 2022-08-09 电子科技大学 Unmanned aerial vehicle cluster dynamic coverage method based on multi-agent deep reinforcement learning
CN114879742B (en) * 2022-06-17 2023-07-04 电子科技大学 Unmanned aerial vehicle cluster dynamic coverage method based on multi-agent deep reinforcement learning
CN115334165B (en) * 2022-07-11 2023-10-17 西安交通大学 Underwater multi-unmanned platform scheduling method and system based on deep reinforcement learning
CN115334165A (en) * 2022-07-11 2022-11-11 西安交通大学 Underwater multi-unmanned platform scheduling method and system based on deep reinforcement learning
CN115273501B (en) * 2022-07-27 2023-08-29 同济大学 MADDPG-based automatic driving vehicle ramp confluence cooperative control method and system
CN115273501A (en) * 2022-07-27 2022-11-01 同济大学 Automatic driving vehicle ramp confluence cooperative control method and system based on MADDPG
CN115019185B (en) * 2022-08-03 2022-10-21 华中科技大学 Brain-like continuous learning cooperative trapping method, system and medium
CN115019185A (en) * 2022-08-03 2022-09-06 华中科技大学 Brain-like continuous learning cooperative trapping method, system and medium
CN115525058A (en) * 2022-10-24 2022-12-27 哈尔滨工程大学 Unmanned underwater vehicle cluster cooperative countermeasure method based on deep reinforcement learning
CN116069023A (en) * 2022-12-20 2023-05-05 南京航空航天大学 Multi-unmanned vehicle formation control method and system based on deep reinforcement learning
CN116069023B (en) * 2022-12-20 2024-02-23 南京航空航天大学 Multi-unmanned vehicle formation control method and system based on deep reinforcement learning
CN116736883A (en) * 2023-05-23 2023-09-12 天津大学 Unmanned aerial vehicle cluster intelligent cooperative motion planning method
CN116736883B (en) * 2023-05-23 2024-03-08 天津大学 Unmanned aerial vehicle cluster intelligent cooperative motion planning method
CN117406706A (en) * 2023-08-11 2024-01-16 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning
CN117406706B (en) * 2023-08-11 2024-04-09 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning
CN117076134A (en) * 2023-10-13 2023-11-17 天之翼(苏州)科技有限公司 Unmanned aerial vehicle state data processing method and system based on artificial intelligence
CN117076134B (en) * 2023-10-13 2024-04-02 天之翼(苏州)科技有限公司 Unmanned aerial vehicle state data processing method and system based on artificial intelligence
CN117111620A (en) * 2023-10-23 2023-11-24 山东省科学院海洋仪器仪表研究所 Autonomous decision-making method for task allocation of heterogeneous unmanned system
CN117111620B (en) * 2023-10-23 2024-03-29 山东省科学院海洋仪器仪表研究所 Autonomous decision-making method for task allocation of heterogeneous unmanned system

Similar Documents

Publication Publication Date Title
CN113900445A (en) Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning
CN112947562B (en) Multi-unmanned aerial vehicle motion planning method based on artificial potential field method and MADDPG
De Souza et al. Decentralized multi-agent pursuit using deep reinforcement learning
CN113495578B (en) Digital twin training-based cluster track planning reinforcement learning method
Liu et al. Multi-UAV path planning based on fusion of sparrow search algorithm and improved bioinspired neural network
CN113093802B (en) Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning
CN112180967B (en) Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture
CN112433525A (en) Mobile robot navigation method based on simulation learning and deep reinforcement learning
CN113253733B (en) Navigation obstacle avoidance method, device and system based on learning and fusion
CN113848974B (en) Aircraft trajectory planning method and system based on deep reinforcement learning
CN111723931B (en) Multi-agent confrontation action prediction method and device
Kimmel et al. Maintaining team coherence under the velocity obstacle framework.
Chen et al. Runtime safety assurance for learning-enabled control of autonomous driving vehicles
Diallo et al. Multi-agent pattern formation: a distributed model-free deep reinforcement learning approach
Xue et al. Multi-agent deep reinforcement learning for uavs navigation in unknown complex environment
Shen Bionic communication network and binary pigeon-inspired optimization for multiagent cooperative task allocation
Al-Sharman et al. Self-learned autonomous driving at unsignalized intersections: A hierarchical reinforced learning approach for feasible decision-making
CN113110101A (en) Production line mobile robot gathering type recovery and warehousing simulation method and system
CN112651486A (en) Method for improving convergence rate of MADDPG algorithm and application thereof
CN116796843A (en) Unmanned aerial vehicle many-to-many chase game method based on PSO-M3DDPG
Huang et al. A deep reinforcement learning approach to preserve connectivity for multi-robot systems
CN116661503A (en) Cluster track automatic planning method based on multi-agent safety reinforcement learning
CN114326826B (en) Multi-unmanned aerial vehicle formation transformation method and system
CN116243727A (en) Unmanned carrier countermeasure and obstacle avoidance method for progressive deep reinforcement learning
Zhang et al. Multi-UAV cooperative short-range combat via attention-based reinforcement learning using individual reward shaping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination