CN117035435A

CN117035435A - Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment

Info

Publication number: CN117035435A
Application number: CN202310619913.0A
Authority: CN
Inventors: 袁冬冰; 牛昱斌; 李冬妮
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-11-10

Abstract

A multi-unmanned aerial vehicle task allocation and track planning optimization method under a dynamic environment belongs to the technical field of unmanned aerial vehicles. The multi-agent reinforcement learning algorithm MA-SAC based on the traditional deep reinforcement learning SAC algorithm is adopted, the traditional SAC algorithm is fused into a multi-agent network structure, and a centralized training distributed execution mode is adopted, so that interaction and learning among agents can be realized, higher rewarding value and task completion rate can be converged in a shorter time, the timeliness of task planning strategy decision is improved, the iteration times of the intelligent optimization algorithm are reduced, and the timeliness of the method is improved; through constructing the reward value function based on the strategy set, the reinforcement learning training efficiency and training stability are improved, the problem of sparse reward value in a dynamic environment is solved, and the convergence rate of the multi-agent reinforcement learning algorithm is improved. The method is suitable for the field of unmanned aerial vehicles, and can improve the real-time task planning efficiency of the unmanned aerial vehicle in a dynamic environment under a dynamic scene.

Description

Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment

Technical Field

The invention relates to a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment, and belongs to the technical field of unmanned aerial vehicles.

Background

In the face of increasingly complex and changeable battlefield environments and fussy and various battlefield tasks, the more difficult a single unmanned aerial vehicle is to meet actual battlefield requirements, and unmanned aerial vehicle cluster collaborative operation becomes a mainstream development trend. Unmanned aerial vehicle cluster is used as a novel multi-agent system, and most researchers currently adopt a cluster intelligent algorithm, such as wolves, bees, birds and the like, so as to realize autonomous and intelligent cooperative control. Although the unmanned aerial vehicle has partial autonomous capability, the unmanned aerial vehicle cluster is difficult to execute high-precision and strong-instantaneity flight path planning tasks under complex countermeasure environments due to the complexity of cluster control model establishment, sensitivity to parameters, inherent calculation amount of an algorithm, low intelligent degree and the like. In recent years, multi-agent deep reinforcement learning (Multi Agent Deep Reinforcement Learning, MADRL) has received widespread attention as one of ideas for solving the intelligent control and decision problems. MADRL allows agents to interact with the environment and perform collaborative or antagonistic autonomous learning on the basis of powerful situational awareness and information processing capabilities. Therefore, MADRL is expected to provide unmanned clusters with sufficient intelligent coordination to accomplish complex challenge tasks.

Currently, researchers have developed exploratory studies on related problems of unmanned aerial vehicle cluster track planning by using a deep reinforcement learning method. Yang Qingqing and the like apply the deep reinforcement learning algorithm based on the Rainbow model to sea and battlefield path planning, and the Rainbow model fuses 6 DQN algorithm improvement mechanisms such as Double DQN network, preferential experience playback, lasting network, noise network, distributed learning, multi-step learning and the like, and experiments prove that the algorithm has better path planning effect. Tang et al uses Deep-Sarsa algorithm for unmanned aerial vehicle track planning, deep-Sarsa uses a Deep neural network end-to-end fitting Q table on the basis of Sarsa algorithm, adopts the same strategy (on-policy) learning method, has higher learning speed and better performance in the aspect of single machine real-time track planning

However, most of the above documents do not consider the dynamic factors that the position of the obstacle is unknown and the unmanned aerial vehicle moves dynamically in the actual combat environment, and the unmanned aerial vehicle may be damaged due to failure of obstacle avoidance, so that the flight path planning task fails, and the like, the application scene is relatively simple and partially limited to a two-dimensional plane, and the research on the unmanned aerial vehicle cluster flight path planning strategy in the dynamic uncertain environment is lacking.

Disclosure of Invention

Aiming at the problem that the real-time decision efficiency of unmanned aerial vehicle task allocation and track optimization in a dynamic scene is low in the prior art, the main purpose of the invention is to provide a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic scene, a multi-agent reinforcement learning algorithm (MA-SAC) based on a traditional deep reinforcement learning SAC algorithm is provided, a heuristic reward value model is constructed, the problem of sparse reward value in the dynamic scene is solved, and the real-time task planning efficiency of the unmanned aerial vehicle in the dynamic scene is improved.

The invention aims at realizing the following technical scheme:

the invention discloses a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment, which is characterized in that a multi-unmanned aerial vehicle task planning scene model in the dynamic environment is established, a Markov decision model for the joint optimization problem of unmanned aerial vehicle task allocation and track planning in the dynamic environment is described, then a multi-agent reinforcement learning algorithm (MA-SAC) based on a traditional deep reinforcement learning (SAC) algorithm is adopted, and a corresponding heuristic reward value strategy is used for solving the problem of sparse reward value in the dynamic environment. The unmanned aerial vehicle real-time task planning efficiency under the dynamic environment can be improved under the dynamic scene.

The invention discloses a multi-unmanned aerial vehicle task allocation and track planning optimization method under a dynamic environment, which comprises the following steps:

step 1: establishing a multi-unmanned aerial vehicle task planning scene model in a dynamic environment;

step 1.1: establishing a kinematic model of the unmanned aerial vehicle:

wherein the method comprises the steps ofRepresenting displacement offsets of the unmanned aerial vehicle in the x-axis, the y-axis and the z-axis. θ represents the aircraft turning angle, γ represents the aircraft pitch angle, and V represents the unmanned aerial vehicle speed.

Step 1.2: establishing a radar and threat zone model;

the method has the advantages that the maximum detection distance of the radar, the maximum radius of the missile killing area and the maximum distance of the escape area are considered, and threat value models of each defense unit to the unmanned aerial vehicle are as follows:

wherein D is the distance between the drone and the defending unit; r is R _Rmax The maximum detection distance of the radar; r is R _Mmax Is the maximum radius of the missile killing zone; r is R _Mkmax Is the maximum range of the escape-free area.

Step 1.3: with the aim of avoiding collisions and minimizing track distance, an objective function is built, i.ed _i The track distance of the drone i is indicated.

Step 2: establishing a Markov decision model of the unmanned aerial vehicle task allocation and track planning combined optimization problem in a dynamic environment;

step 2.1: establishing a joint state space model of the unmanned aerial vehicle system;

the joint state space is

S＝[S ₁ ,S ₂ ,...,S _n ](i＝1,2,...,n)

The number of unmanned aerial vehicles is n, wherein each sub-state space is

S _i ＝[x _u ,y _u ,z _u ,d _a ,d _t ,x _o ,y _o ,z _o ] ^T (i＝1,2...n)

(x _u ,y _u ,z _u ) Representing longitude, latitude, altitude, d of unmanned plane i _a Indicating that unmanned plane i is closest to itDistance d between unmanned aerial vehicles _t Representing the distance, (x) between the drone i and the target point nearest to it _o ,y _o ,z _o ) Representing the nearest enemy threat object location observed by drone i within an effective detection range.

Step 2.2: establishing a joint action space model of an unmanned aerial vehicle system

For the action space, each unmanned aerial vehicle in the formation can select own action, and then the combined action space of the whole unmanned aerial vehicle formation is set as

A＝[A ₁ ,A ₂ ,...,A _n ](i＝1,2,...,n)

Wherein the action subspace is represented as

A _i ＝[ψ _i ,γ _i ] ^T

Wherein psi is _i Represents the turning angle change value of the unmanned aerial vehicle, gamma _i Representing the change value of the pitching angle of the unmanned plane

Step 2.3: constructing a reward value function based on a strategy set;

(1) According to the step 1.3, taking collision avoidance and minimized track distance as optimization targets, and under the excitation of promoting the unmanned aerial vehicle to reach the combat target position as soon as possible, realizing minimized track distance through a reward structure: traversing all target points, calculating the distance between unmanned aerial vehicles closest to the target points, summing the distances, and then taking the opposite number as r ₁

(2) According to the step 1.3, collision avoidance is used as an optimization target, and the unmanned aerial vehicles keep a space cooperative relationship, namely collision avoidance is carried out at the same time. When collision occurs, the intelligent agent receives a negative reward, and meanwhile, a key area is added to increase a collision early warning mechanism, so that a reward structure for avoiding collision among unmanned aerial vehicles is trained: all other unmanned aerial vehicles are circularly traversed, and collision avoidance reward value r of each unmanned aerial vehicle is calculated ₂ The following formula is shown:

wherein dist represents the unmanned aerial vehicleDistance from its nearest unmanned plane, dist _min2 Representing the sum of the dimensions of the two drones, sigma represents the width of the critical area.

(3) According to the step 1.3, a radar and threat zone model is established to calculate a threat value T of the threat object to the unmanned aerial vehicle _s Then according to T _s Calculating prize value r of avoidance threat zone and barrier zone ₃ As shown in the following formula

Wherein T is _s Representing threat value of threat object to unmanned plane, T _σ Representing a threat threshold.

The real-time rewards R of one unmanned aerial vehicle consist of the three parts, namely r=r ₁ +r ₂ +r ₃

Step 3: constructing a neural network of a multi-agent reinforcement learning algorithm;

the system architecture of the neural network is divided into a task abstract layer, an algorithm training layer and an execution layer; the task abstract layer converts the task optimization process into a corresponding rewarding structure convergence process, and the optimal algorithm aims at reducing collision as much as possible and minimizing track distance; the training layer of the neural network consists of a training environment and a training algorithm, wherein the environment comprises an unmanned aerial vehicle, a target and a threat area, and the training algorithm is MA-SAC; after training, each unmanned aerial vehicle intelligent can obtain a strategy, the neural network is an actor network, and the actor network receives an observed value and outputs an action; the policy of each agent in the execution layer is deployed into the real drone in the formation.

The method comprises a total of n unmanned aerial vehicle agents. Each unmanned aerial vehicle agent has an Actor network, a Target-Actor network, two Critic networks and two Target-Critic networks, and the networks are all composed of fully connected neural networks. For the Critic network of each unmanned aerial vehicle, not only the environmental state is input into the Critic network, but also the actions of other unmanned aerial vehicles are input into the Critic network, and the Q value is calculated through the local information observed by each unmanned aerial vehicle agent.

Step 4: training the neural network constructed in the step 3;

step 4.1: initializing critic network and actor network parameters, experience pool capacity D, sampling sample number B for training, and training round number episodes.

Step 4.2: for each training round, firstly, obtaining a state space si defined in step 2.1 of each unmanned aerial vehicle i intelligent agent through a simulation environment, obtaining an action space ai defined in step 2.2 of each unmanned aerial vehicle i intelligent agent according to an Actor network, and calculating a next state s 'according to a kinematic model of the unmanned aerial vehicle in step 1.1' _i Calculating the prize value r obtained by each unmanned plane i intelligent agent according to the step 2.3 _i 。

Sample < a _i ,s _i ,s _i ',r _i > store in experience pool. If the number of experience pool samples is greater than the number of sampling samples for training, go to step 4.3, otherwise continue to step 4.2.

Step 4.3: the number of samples B is randomly drawn from the sample pool. To be used forThe Critic current network is updated as a loss function,

wherein the method comprises the steps of

E _{(x,a,r,x')～D} Representing a desire to sample (x, a, r, x ') from the priority playback buffer pool D, where x represents the unmanned joint state, a represents the joint action, r represents the joint prize value, and x' represents the next joint state.Representing that a joint action (a) is performed in a joint state x given a random policy pi ₁ ,...,a _Nu ) State-action value of (c). y is _i For the estimated state-action cost function in the joint state x, r _i Represents a prize value for unmanned plane i, gamma represents a discount rate representing a percentage of future benefits to be referenced, +.>Representing solving for a given random strategy->When in state s _i ' Down execution action a _i 'desire,'>Representing a given random strategy->When in state s _i ' Down execution action a _i ' target state-action value, +.>Is state s _i ' Down strategy>Output action a _i Probability of';

by passing throughUpdating the Actor network, pi _i (a _i |s _i ) Is state s _i Lower policy pi output action a _i Probability of->Representing the state s at a given random strategy pi _i Lower execution action a _i Target state-action value of +.>Representing the state s at the time of solving a given random strategy pi _i Lower execution action a _i Is not limited to the above-described embodiments.

By passing throughUpdating the target network, wherein w 'represents the parameter of the target-critic network, w represents the parameter of the critic network, theta' represents the parameter of the target-actor network, theta represents the parameter of the actor network, and tau is the target networkUpdate rate of the complex.

Further comprising the step 5: and (3) performing task allocation and flight path planning on the multiple unmanned aerial vehicles in the dynamic environment by using the multi-agent reinforcement learning neural network obtained by training in the step (4), and simultaneously optimizing the internal strategy and the task planning global strategy of each unmanned aerial vehicle intelligent body, so that the instantaneity and the self-adaptation capability of the unmanned aerial vehicle task planning are improved in the dynamic environment, and all unmanned aerial vehicle combat tasks in the dynamic environment are completed under the conditions of higher battlefield income, shorter flight path planning distance and higher Shi Min property.

The beneficial effects are that:

1. according to the multi-unmanned aerial vehicle task allocation and track planning optimization method under the dynamic environment, the multi-agent reinforcement learning algorithm based on the traditional deep reinforcement learning SAC algorithm is adopted, the traditional SAC algorithm is fused into a multi-agent network structure, and the centralized training distributed execution mode is adopted, so that the agents can interact and learn, can converge to a higher rewarding value and task completion rate in a shorter time, the timeliness of task planning strategy decision is improved, the iteration times of the intelligent optimization algorithm is reduced, and the timeliness of the method is improved.

2. According to the multi-unmanned aerial vehicle task allocation and track planning optimization method under the dynamic environment, through constructing the reward value function based on the strategy set, the reinforcement learning training efficiency and training stability are improved, the problem of sparse reward values under the dynamic environment is solved, and the convergence speed of a multi-agent reinforcement learning algorithm is improved.

Drawings

FIG. 1 is a flow chart of a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment;

FIG. 2 is a diagram of a multi-agent reinforcement learning MA-SAC algorithm system architecture in the present embodiment;

fig. 3 is an initial situation map of a multi-unmanned aerial vehicle task planning simulation scene in the dynamic environment in the embodiment;

FIG. 4 is a diagram of a multi-agent reinforcement learning algorithm MA-SAC neural network architecture in the present embodiment;

fig. 5 is a comparison of the MA-SAC algorithm and the reward values of the madddpg and DDPG algorithms in the multi-unmanned aerial vehicle task allocation and track planning optimization method under the dynamic environment disclosed in the present embodiment;

FIG. 6 is a diagram of an unmanned aerial vehicle successfully discovering local missile row and completing obstacle avoidance process in the dynamic environment of the present embodiment

Fig. 7 is a view showing a case where an unmanned aerial vehicle launches a missile to an enemy ground tank row and destroys a target in a dynamic environment in the present embodiment.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples. The technical problems and the beneficial effects solved by the technical proposal of the invention are also described, and the described embodiment is only used for facilitating the understanding of the invention and does not have any limiting effect.

The embodiment is realized based on a domestic combined combat simulation platform, the platform provides a background python development interface, algorithm training is carried out in a dock environment, and the simulation step length is set to be 15 times of speed. Based on the environment custom fight design, fight tasks are formulated, and task types comprise maneuver, assault, strike, land transfer, patrol and support. A dynamic Liu Kongmo man-machine combat scene is created on the platform, and different combat units such as unmanned aerial vehicles, SA-22 type ground-to-air missile rows, T-72B type tank rows and the like are arranged under the scene, so that unmanned aerial vehicle task planning real-time visual simulation is performed.

The embodiment discloses a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment, which specifically comprises the following steps as shown in fig. 1:

step 1: multi-unmanned aerial vehicle task planning scene modeling in dynamic environment

Step 1.1: the method comprises the steps of establishing a fight assumption and randomly initializing 4 number of unmanned aerial vehicles, 8 number of threat objects, 10 number of tasks to be executed on a simulation platform, dynamically showing enemy threat objects and ground targets, moving the positions at any time, and attacking the enemy dynamic targets on the premise that the fight task of a plurality of unmanned aerial vehicles avoids all the threat objects of the enemy, wherein the initial situation of a simulation scene is shown in fig. 3.

Step 1.2: establishing an aircraft kinematics model

Assuming the drone as a particle in three-dimensional space, the kinematic model of the drone can be expressed as:

in the method, in the process of the invention,representing displacement offsets of the unmanned aerial vehicle in the x-axis, the y-axis and the z-axis. θ represents the aircraft turning angle, γ represents the aircraft pitch angle, and V represents the unmanned aerial vehicle speed. .

Step 1.3: establishing a radar and threat zone model

The maximum detection distance of the radar, the maximum radius of the missile killing area and the maximum distance of the escape-free area are considered. Therefore, the threat value formula of each defending unit to the unmanned aerial vehicle is as follows:

Step 1.4: establishing an objective function

The aim of the problem is to reduce collisions as much as possible and to minimize track distance, i.ed _i The track distance of the drone i is indicated.

Step 2: establishing a Markov decision model for unmanned aerial vehicle task allocation and track planning combined optimization problem under dynamic environment

Step 2.1: unmanned aerial vehicle system state space design

Let the joint state space be

S＝[S ₁ ,S ₂ ,...,S _n ](i＝1,2,...,n)

The number of unmanned aerial vehicles is 4, wherein each sub-state space is

S _i ＝[x _u ,y _u ,z _u ,d _a ,d _t ,x _o ,y _o ,z _o ] ^T (i＝1,2...n)

(x _u ,y _u ,z _u ) Representing longitude, latitude, altitude, d of unmanned plane i _a Representing the distance d between unmanned plane i and the unmanned plane nearest to it _t Representing the distance, (x) between the drone i and the target point nearest to it _o ,y _o ,z _o ) Representing the nearest enemy threat object location observed by drone i within an effective detection range.

Step 2.2: unmanned aerial vehicle system joint motion space design

A＝[A ₁ ,A ₂ ,...,A _n ](i＝1,2,...,n)

Wherein the action subspace can be represented as

A _i ＝[ψ _i ,γ _i ] ^T

Step 2.3: constructing a strategy set-based prize value function

(1) According to the step 1.3, taking collision avoidance and minimized track distance as optimization targets, and under the excitation of promoting the unmanned aerial vehicle to reach the combat target position as soon as possible, realizing minimized track distance through a reward structure: traversing all target points, calculating the distance between unmanned aerial vehicles closest to the target points, summing up and taking the opposite number as r ₁

(2) According to the step 1.3, collision avoidance is used as an optimization target, and the unmanned aerial vehicles keep a space cooperative relationship, namely collision avoidance is carried out at the same time. When collision occurs, the intelligent agent receives a negative reward, and meanwhile, a critical area is added to increase a collision early warning mechanism for training the avoidance between unmanned aerial vehiclesCollision-free bonus structure: all other unmanned aerial vehicles are circularly traversed, and collision avoidance reward value r of each unmanned aerial vehicle is calculated ₂ The following formula is shown:

dist represents the distance between the unmanned aerial vehicle and the nearest unmanned aerial vehicle _min2 Representing the sum of the dimensions of the two drones, sigma represents the width of the critical area.

Wherein T is _s Representing threat value of threat object to unmanned plane, T _σ Representing a threat threshold, set to 0.8.

the system architecture of the whole method is divided into a task abstract layer, an algorithm training layer and an execution layer. The task abstract layer converts the task optimization process into a corresponding rewarding structure convergence process, and the aim of the problem is to reduce collision as much as possible and minimize track distance; the algorithm training layer consists of a training environment and a training algorithm, wherein the environment comprises an unmanned aerial vehicle, a target and a threat area, the algorithm comprises an MA-SAC, each unmanned aerial vehicle intelligent body can obtain a strategy after training, the strategy is actually an actor network, and the actor network receives an observed value and outputs an action; the policy of each agent in the execution layer is deployed into the real drone in the formation.

The algorithm contains a total of n drone agents. Each unmanned aerial vehicle has an Actor network, a Target-Actor network, two Critic networks and two Target-Critic networks, which are all composed of fully connected neural networks. For the Critic network of each unmanned aerial vehicle, not only the environmental state is input into the Critic network, but also the actions of other unmanned aerial vehicles are input into the Critic network, and the Q value is calculated through the local information observed by each unmanned aerial vehicle agent.

Step 4: training the neural network constructed in the step 3;

step 4.1: initialization of critic and actor network parameters, empirical pool size D of 1000000, number of samples B for training of 1024, training rounds number of epocodes of 15000.

Step 4.2: for each training round, firstly obtaining a state space s defined in step 2.1 by each unmanned plane i intelligent agent through a simulation environment _i Obtaining an action space a defined in step 2.2 by each unmanned aerial vehicle i intelligent agent according to an Actor network _i Calculating the next state s 'according to the kinematic model of the unmanned aerial vehicle in the step 1.1' _i Calculating the prize value r obtained by each unmanned plane i intelligent agent according to the step 2.3 _i 。

Step 4.3:

the number of samples B is randomly drawn from the sample pool. To be used forThe Critic current network is updated as a loss function,

E _{(x,a,r,x')～D} representing a desire to sample (x, a, r, x ') from the priority playback buffer pool D, where x represents the unmanned joint state, a represents the joint action, r represents the joint prize value, and x' represents the next jointStatus of the device.Representing that a joint action is performed in the joint state x given a random policy pi>State-action value of (c). y is _i For the estimated state-action cost function in the joint state x, r _i Represents a prize value for unmanned plane i, gamma represents a discount rate representing a percentage of future benefits to be referenced, +.>Representing solving for a given random strategy->When in state s _i ' Down execution action a _i 'desire,'>Representing a given random strategy->When in state s _i ' Down execution action a _i ' target state-action value, +.>Is state s _i ' Down strategy>Output action a _i Probability of';

By passing throughUpdating the target network, w 'represents the parameter of the target-critic network, w represents the parameter of the critic network, theta' represents the parameter of the target-actor network, theta represents the parameter of the actor network, and tau is the updating rate of the target network. The parameters of the MA-SAC algorithm are set in the experiment as shown in the following table.

Comparing the MA-SAC algorithm with the reward values of the MADDPG and DDPG algorithms, as shown in figure 5, after 15000 rounds of learning, the reward value curves of the multi-frame unmanned aerial vehicle are shown in figure 5, and all three algorithms show convergence trend although the algorithms show a certain impact due to random noise in training. However, there are some differences in their convergence speed and time. At an epoode of around 2000, the reward of all three algorithms starts to rise. Wherein the convergence speed of DDPG is the slowest and does not show convergence trend until the epoode is about 6000. Both MA-SAC and MADDPG converge earlier than DDPG, and MA-SAC converges at the fastest speed.

Simulation results show that as shown in fig. 6, each unmanned aerial vehicle successfully discovers and avoids attack of enemy missile row; as shown in fig. 7, each drone launches a missile to the ground target that it is to attack and successfully destroys the target. And the unmanned aerial vehicle combat task under the dynamic environment is completed.

The indexes such as the reward value after algorithm convergence, the average task success rate, the total search times reaching the task point, the algorithm training efficiency and the like are compared and analyzed through a numerical experiment, and the experimental result shows that the method for optimizing the task allocation and the track planning of the multiple unmanned aerial vehicles in the dynamic environment is superior to the method for optimizing the task allocation and the track planning of the unmanned aerial vehicles based on the traditional optimization algorithm on the indexes.

The foregoing detailed description has set forth the objects, aspects and advantages of the invention in further detail, it should be understood that the foregoing description is only illustrative of the invention and is not intended to limit the scope of the invention, but is to be accorded the full scope of the invention as defined by the appended claims.

Claims

1. A multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment is characterized in that: comprises the following steps of the method,

the joint state space is

S＝[S ₁ ,S ₂ ,...,S _n ](i＝1,2,...,n)

The number of unmanned aerial vehicles is n, wherein each sub-state space is

S _i ＝[x _u ,y _u ,z _u ,d _a ,d _t ,x _o ,y _o ,z _o ] ^T (i＝1,2...n)

(x _u ,y _u ,z _u ) Representing longitude, latitude, altitude, d of unmanned plane i _a Representing the distance d between unmanned plane i and the unmanned plane nearest to it _t Representing the distance, (x) between the drone i and the target point nearest to it _o ,y _o ,z _o ) Representing the nearest enemy threat object position observed by the unmanned plane i in the effective detection range;

step 2.2: establishing a joint action space model of the unmanned aerial vehicle system;

A＝[A ₁ ,A ₂ ,...,A _n ](i＝1,2,...,n)

Wherein the action subspace is represented as

A _i ＝[ψ _i ,γ _i ] ^T

Step 2.3: constructing a reward value function based on a strategy set;

(2) According to the step 1.3, collision avoidance is used as an optimization target, and the unmanned aerial vehicles keep a space cooperative relationship, namely collision avoidance is carried out at the same time; when collision occurs, the intelligent agent receives a negative reward, and meanwhile, a key area is added to increase a collision early warning mechanism, so that a reward structure for avoiding collision among unmanned aerial vehicles is trained: all other unmanned aerial vehicles are circularly traversed, and collision avoidance reward value r of each unmanned aerial vehicle is calculated ₂ The following formula is shown:

wherein dist represents the distance between the unmanned aerial vehicle and the nearest unmanned aerial vehicle _min2 Representing the sum of the dimensions of the two unmanned aerial vehicles, sigma representing the width of the critical area;

Wherein T is _s Representing threat value of threat object to unmanned plane, T _σ Representing a threat threshold;

the system architecture of the neural network is divided into a task abstract layer, an algorithm training layer and an execution layer; the task abstract layer converts the task optimization process into a corresponding rewarding structure convergence process, and the optimal algorithm aims at reducing collision as much as possible and minimizing track distance; the training layer of the neural network consists of a training environment and a training algorithm, wherein the environment comprises an unmanned aerial vehicle, a target and a threat area, and the training algorithm is MA-SAC; after training, each unmanned aerial vehicle intelligent can obtain a strategy, the neural network is an actor network, and the actor network receives an observed value and outputs an action; the strategy of each agent in the execution layer is deployed into a real unmanned aerial vehicle in the formation;

the method comprises n unmanned aerial vehicle intelligent bodies in total; each unmanned aerial vehicle intelligent agent is provided with an Actor network, a Target-Actor network, two Critic networks and two Target-Critic networks, wherein the networks are composed of fully-connected neural networks; for the Critic network of each unmanned aerial vehicle, not only inputting the environment state into the Critic network, but also inputting the actions of other unmanned aerial vehicles into the Critic network, and calculating the Q value through the local information observed by each unmanned aerial vehicle intelligent body;

step 4: training the neural network constructed in the step 3.

2. The method for optimizing task allocation and track planning of multiple unmanned aerial vehicles in a dynamic environment according to claim 1, wherein the method comprises the following steps: further comprising the step 5: and (3) performing task allocation and flight path planning on the multiple unmanned aerial vehicles in the dynamic environment by using the multi-agent reinforcement learning neural network obtained by training in the step (4), and simultaneously optimizing the internal strategy and the task planning global strategy of each unmanned aerial vehicle intelligent body, so that the instantaneity and the self-adaptation capability of the unmanned aerial vehicle task planning are improved in the dynamic environment, and all unmanned aerial vehicle combat tasks in the dynamic environment are completed under the conditions of higher battlefield income, shorter flight path planning distance and higher Shi Min property.

3. The method for optimizing task allocation and track planning of multiple unmanned aerial vehicles in a dynamic environment according to claim 1, wherein the method comprises the following steps: the implementation method of the step 1 is that,

step 1.1: establishing a kinematic model of the unmanned aerial vehicle:

wherein the method comprises the steps ofRepresenting displacement offset of the unmanned aerial vehicle in an x axis, a y axis and a z axis; θ represents the aircraft turning angle, γ represents the aircraft pitch angle, and V represents the unmanned plane speed;

step 1.2: establishing a radar and threat zone model;

wherein D is the distance between the drone and the defending unit; r is R _Rmax The maximum detection distance of the radar; r is R _Mmax Is the maximum radius of the missile killing zone; r is R _Mkmax Is the maximum range of the escape-free area;

4. A multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment as claimed in claim 3, wherein: the implementation method of the step 4 is that,

step 4.1: initializing critic network and actor network parameters, the capacity D of an experience pool, the number B of sampling samples for training and the number epsilon of training rounds;

step 4.2: for each training round, firstly, obtaining a state space s defined in step 2.1 by each unmanned plane i intelligent agent through a simulation environment _i Obtaining an action space a defined in step 2.2 by each unmanned aerial vehicle i intelligent agent according to an Actor network _i Calculating the next state s 'according to the kinematic model of the unmanned aerial vehicle in the step 1.1' _i Calculating the prize value r obtained by each unmanned plane i intelligent agent according to the step 2.3 _i ；

Sample < a _i ,s _i ,s' _i ,r _i A pool of > logging experiences; if the number of the samples in the experience pool is larger than the number of the samples for training, the step 4.3 is switched to, otherwise, the step 4.2 is continued;

step 4.3: randomly extracting a sampling quantity B samples from a sample pool; to be used forThe Critic current network is updated as a loss function,

wherein the method comprises the steps of

E _{(x,a,r,x')～D} Representing a desire to sample (x, a, r, x ') from the priority playback buffer pool D, wherein x represents the unmanned aerial vehicle joint state, a represents the joint action, r represents the joint prize value, and x' represents the next joint state;representing that a joint action is performed in the joint state x given a random policy pi>Status-action value of (2); y is _i For the estimated state-action cost function in the joint state x, r _i Represents a prize value for unmanned plane i, gamma represents a discount rate representing a percentage of future benefits to be referenced, +.>Representing solving for a given random strategy->When in state s' _i Lower execution action a' _i Is used as a means for controlling the speed of the vehicle,representing a given random strategy->When in state s _i ' Down execution action a _i ' target state-action value, +.>Is the state s' _i Lower policy->Output action a _i Probability of';

by passing throughUpdating the Actor network, pi _i (a _i |s _i ) Is state s _i Lower policy pi output action a _i Probability of->Representing the state s at a given random strategy pi _i Lower execution action a _i Target state-action value of +.>Representing the state s at the time of solving a given random strategy pi _i Lower execution action a _i Is not limited to the desired one;

w'＝τw+(1-τ)w'

updating the target network by θ '=τθ+ (1- τ) θ', w 'represents a parameter of the target-critic network, w represents a parameter of the critic network, θ' represents a parameter of the target-actor network, θ represents a parameter of the actor network, and τ is an update ratio of the target network.