CN117035435A - Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment - Google Patents

Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment Download PDF

Info

Publication number
CN117035435A
CN117035435A CN202310619913.0A CN202310619913A CN117035435A CN 117035435 A CN117035435 A CN 117035435A CN 202310619913 A CN202310619913 A CN 202310619913A CN 117035435 A CN117035435 A CN 117035435A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
representing
action
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310619913.0A
Other languages
Chinese (zh)
Inventor
袁冬冰
牛昱斌
李冬妮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202310619913.0A priority Critical patent/CN117035435A/en
Publication of CN117035435A publication Critical patent/CN117035435A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0637Strategic management or analysis, e.g. setting a goal or target of an organisation; Planning actions based on goals; Analysis or evaluation of effectiveness of goals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • Operations Research (AREA)
  • Molecular Biology (AREA)
  • Automation & Control Theory (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)

Abstract

A multi-unmanned aerial vehicle task allocation and track planning optimization method under a dynamic environment belongs to the technical field of unmanned aerial vehicles. The multi-agent reinforcement learning algorithm MA-SAC based on the traditional deep reinforcement learning SAC algorithm is adopted, the traditional SAC algorithm is fused into a multi-agent network structure, and a centralized training distributed execution mode is adopted, so that interaction and learning among agents can be realized, higher rewarding value and task completion rate can be converged in a shorter time, the timeliness of task planning strategy decision is improved, the iteration times of the intelligent optimization algorithm are reduced, and the timeliness of the method is improved; through constructing the reward value function based on the strategy set, the reinforcement learning training efficiency and training stability are improved, the problem of sparse reward value in a dynamic environment is solved, and the convergence rate of the multi-agent reinforcement learning algorithm is improved. The method is suitable for the field of unmanned aerial vehicles, and can improve the real-time task planning efficiency of the unmanned aerial vehicle in a dynamic environment under a dynamic scene.

Description

Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment
Technical Field
The invention relates to a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment, and belongs to the technical field of unmanned aerial vehicles.
Background
In the face of increasingly complex and changeable battlefield environments and fussy and various battlefield tasks, the more difficult a single unmanned aerial vehicle is to meet actual battlefield requirements, and unmanned aerial vehicle cluster collaborative operation becomes a mainstream development trend. Unmanned aerial vehicle cluster is used as a novel multi-agent system, and most researchers currently adopt a cluster intelligent algorithm, such as wolves, bees, birds and the like, so as to realize autonomous and intelligent cooperative control. Although the unmanned aerial vehicle has partial autonomous capability, the unmanned aerial vehicle cluster is difficult to execute high-precision and strong-instantaneity flight path planning tasks under complex countermeasure environments due to the complexity of cluster control model establishment, sensitivity to parameters, inherent calculation amount of an algorithm, low intelligent degree and the like. In recent years, multi-agent deep reinforcement learning (Multi Agent Deep Reinforcement Learning, MADRL) has received widespread attention as one of ideas for solving the intelligent control and decision problems. MADRL allows agents to interact with the environment and perform collaborative or antagonistic autonomous learning on the basis of powerful situational awareness and information processing capabilities. Therefore, MADRL is expected to provide unmanned clusters with sufficient intelligent coordination to accomplish complex challenge tasks.
Currently, researchers have developed exploratory studies on related problems of unmanned aerial vehicle cluster track planning by using a deep reinforcement learning method. Yang Qingqing and the like apply the deep reinforcement learning algorithm based on the Rainbow model to sea and battlefield path planning, and the Rainbow model fuses 6 DQN algorithm improvement mechanisms such as Double DQN network, preferential experience playback, lasting network, noise network, distributed learning, multi-step learning and the like, and experiments prove that the algorithm has better path planning effect. Tang et al uses Deep-Sarsa algorithm for unmanned aerial vehicle track planning, deep-Sarsa uses a Deep neural network end-to-end fitting Q table on the basis of Sarsa algorithm, adopts the same strategy (on-policy) learning method, has higher learning speed and better performance in the aspect of single machine real-time track planning
However, most of the above documents do not consider the dynamic factors that the position of the obstacle is unknown and the unmanned aerial vehicle moves dynamically in the actual combat environment, and the unmanned aerial vehicle may be damaged due to failure of obstacle avoidance, so that the flight path planning task fails, and the like, the application scene is relatively simple and partially limited to a two-dimensional plane, and the research on the unmanned aerial vehicle cluster flight path planning strategy in the dynamic uncertain environment is lacking.
Disclosure of Invention
Aiming at the problem that the real-time decision efficiency of unmanned aerial vehicle task allocation and track optimization in a dynamic scene is low in the prior art, the main purpose of the invention is to provide a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic scene, a multi-agent reinforcement learning algorithm (MA-SAC) based on a traditional deep reinforcement learning SAC algorithm is provided, a heuristic reward value model is constructed, the problem of sparse reward value in the dynamic scene is solved, and the real-time task planning efficiency of the unmanned aerial vehicle in the dynamic scene is improved.
The invention aims at realizing the following technical scheme:
the invention discloses a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment, which is characterized in that a multi-unmanned aerial vehicle task planning scene model in the dynamic environment is established, a Markov decision model for the joint optimization problem of unmanned aerial vehicle task allocation and track planning in the dynamic environment is described, then a multi-agent reinforcement learning algorithm (MA-SAC) based on a traditional deep reinforcement learning (SAC) algorithm is adopted, and a corresponding heuristic reward value strategy is used for solving the problem of sparse reward value in the dynamic environment. The unmanned aerial vehicle real-time task planning efficiency under the dynamic environment can be improved under the dynamic scene.
The invention discloses a multi-unmanned aerial vehicle task allocation and track planning optimization method under a dynamic environment, which comprises the following steps:
step 1: establishing a multi-unmanned aerial vehicle task planning scene model in a dynamic environment;
step 1.1: establishing a kinematic model of the unmanned aerial vehicle:
wherein the method comprises the steps ofRepresenting displacement offsets of the unmanned aerial vehicle in the x-axis, the y-axis and the z-axis. θ represents the aircraft turning angle, γ represents the aircraft pitch angle, and V represents the unmanned aerial vehicle speed.
Step 1.2: establishing a radar and threat zone model;
the method has the advantages that the maximum detection distance of the radar, the maximum radius of the missile killing area and the maximum distance of the escape area are considered, and threat value models of each defense unit to the unmanned aerial vehicle are as follows:
wherein D is the distance between the drone and the defending unit; r is R Rmax The maximum detection distance of the radar; r is R Mmax Is the maximum radius of the missile killing zone; r is R Mkmax Is the maximum range of the escape-free area.
Step 1.3: with the aim of avoiding collisions and minimizing track distance, an objective function is built, i.ed i The track distance of the drone i is indicated.
Step 2: establishing a Markov decision model of the unmanned aerial vehicle task allocation and track planning combined optimization problem in a dynamic environment;
step 2.1: establishing a joint state space model of the unmanned aerial vehicle system;
the joint state space is
S=[S 1 ,S 2 ,...,S n ](i=1,2,...,n)
The number of unmanned aerial vehicles is n, wherein each sub-state space is
S i =[x u ,y u ,z u ,d a ,d t ,x o ,y o ,z o ] T (i=1,2...n)
(x u ,y u ,z u ) Representing longitude, latitude, altitude, d of unmanned plane i a Indicating that unmanned plane i is closest to itDistance d between unmanned aerial vehicles t Representing the distance, (x) between the drone i and the target point nearest to it o ,y o ,z o ) Representing the nearest enemy threat object location observed by drone i within an effective detection range.
Step 2.2: establishing a joint action space model of an unmanned aerial vehicle system
For the action space, each unmanned aerial vehicle in the formation can select own action, and then the combined action space of the whole unmanned aerial vehicle formation is set as
A=[A 1 ,A 2 ,...,A n ](i=1,2,...,n)
Wherein the action subspace is represented as
A i =[ψ ii ] T
Wherein psi is i Represents the turning angle change value of the unmanned aerial vehicle, gamma i Representing the change value of the pitching angle of the unmanned plane
Step 2.3: constructing a reward value function based on a strategy set;
(1) According to the step 1.3, taking collision avoidance and minimized track distance as optimization targets, and under the excitation of promoting the unmanned aerial vehicle to reach the combat target position as soon as possible, realizing minimized track distance through a reward structure: traversing all target points, calculating the distance between unmanned aerial vehicles closest to the target points, summing the distances, and then taking the opposite number as r 1
(2) According to the step 1.3, collision avoidance is used as an optimization target, and the unmanned aerial vehicles keep a space cooperative relationship, namely collision avoidance is carried out at the same time. When collision occurs, the intelligent agent receives a negative reward, and meanwhile, a key area is added to increase a collision early warning mechanism, so that a reward structure for avoiding collision among unmanned aerial vehicles is trained: all other unmanned aerial vehicles are circularly traversed, and collision avoidance reward value r of each unmanned aerial vehicle is calculated 2 The following formula is shown:
wherein dist represents the unmanned aerial vehicleDistance from its nearest unmanned plane, dist min2 Representing the sum of the dimensions of the two drones, sigma represents the width of the critical area.
(3) According to the step 1.3, a radar and threat zone model is established to calculate a threat value T of the threat object to the unmanned aerial vehicle s Then according to T s Calculating prize value r of avoidance threat zone and barrier zone 3 As shown in the following formula
Wherein T is s Representing threat value of threat object to unmanned plane, T σ Representing a threat threshold.
The real-time rewards R of one unmanned aerial vehicle consist of the three parts, namely r=r 1 +r 2 +r 3
Step 3: constructing a neural network of a multi-agent reinforcement learning algorithm;
the system architecture of the neural network is divided into a task abstract layer, an algorithm training layer and an execution layer; the task abstract layer converts the task optimization process into a corresponding rewarding structure convergence process, and the optimal algorithm aims at reducing collision as much as possible and minimizing track distance; the training layer of the neural network consists of a training environment and a training algorithm, wherein the environment comprises an unmanned aerial vehicle, a target and a threat area, and the training algorithm is MA-SAC; after training, each unmanned aerial vehicle intelligent can obtain a strategy, the neural network is an actor network, and the actor network receives an observed value and outputs an action; the policy of each agent in the execution layer is deployed into the real drone in the formation.
The method comprises a total of n unmanned aerial vehicle agents. Each unmanned aerial vehicle agent has an Actor network, a Target-Actor network, two Critic networks and two Target-Critic networks, and the networks are all composed of fully connected neural networks. For the Critic network of each unmanned aerial vehicle, not only the environmental state is input into the Critic network, but also the actions of other unmanned aerial vehicles are input into the Critic network, and the Q value is calculated through the local information observed by each unmanned aerial vehicle agent.
Step 4: training the neural network constructed in the step 3;
step 4.1: initializing critic network and actor network parameters, experience pool capacity D, sampling sample number B for training, and training round number episodes.
Step 4.2: for each training round, firstly, obtaining a state space si defined in step 2.1 of each unmanned aerial vehicle i intelligent agent through a simulation environment, obtaining an action space ai defined in step 2.2 of each unmanned aerial vehicle i intelligent agent according to an Actor network, and calculating a next state s 'according to a kinematic model of the unmanned aerial vehicle in step 1.1' i Calculating the prize value r obtained by each unmanned plane i intelligent agent according to the step 2.3 i
Sample < a i ,s i ,s i ',r i > store in experience pool. If the number of experience pool samples is greater than the number of sampling samples for training, go to step 4.3, otherwise continue to step 4.2.
Step 4.3: the number of samples B is randomly drawn from the sample pool. To be used forThe Critic current network is updated as a loss function,
wherein the method comprises the steps of
E (x,a,r,x')~D Representing a desire to sample (x, a, r, x ') from the priority playback buffer pool D, where x represents the unmanned joint state, a represents the joint action, r represents the joint prize value, and x' represents the next joint state.Representing that a joint action (a) is performed in a joint state x given a random policy pi 1 ,...,a Nu ) State-action value of (c). y is i For the estimated state-action cost function in the joint state x, r i Represents a prize value for unmanned plane i, gamma represents a discount rate representing a percentage of future benefits to be referenced, +.>Representing solving for a given random strategy->When in state s i ' Down execution action a i 'desire,'>Representing a given random strategy->When in state s i ' Down execution action a i ' target state-action value, +.>Is state s i ' Down strategy>Output action a i Probability of';
by passing throughUpdating the Actor network, pi i (a i |s i ) Is state s i Lower policy pi output action a i Probability of->Representing the state s at a given random strategy pi i Lower execution action a i Target state-action value of +.>Representing the state s at the time of solving a given random strategy pi i Lower execution action a i Is not limited to the above-described embodiments.
By passing throughUpdating the target network, wherein w 'represents the parameter of the target-critic network, w represents the parameter of the critic network, theta' represents the parameter of the target-actor network, theta represents the parameter of the actor network, and tau is the target networkUpdate rate of the complex.
Further comprising the step 5: and (3) performing task allocation and flight path planning on the multiple unmanned aerial vehicles in the dynamic environment by using the multi-agent reinforcement learning neural network obtained by training in the step (4), and simultaneously optimizing the internal strategy and the task planning global strategy of each unmanned aerial vehicle intelligent body, so that the instantaneity and the self-adaptation capability of the unmanned aerial vehicle task planning are improved in the dynamic environment, and all unmanned aerial vehicle combat tasks in the dynamic environment are completed under the conditions of higher battlefield income, shorter flight path planning distance and higher Shi Min property.
The beneficial effects are that:
1. according to the multi-unmanned aerial vehicle task allocation and track planning optimization method under the dynamic environment, the multi-agent reinforcement learning algorithm based on the traditional deep reinforcement learning SAC algorithm is adopted, the traditional SAC algorithm is fused into a multi-agent network structure, and the centralized training distributed execution mode is adopted, so that the agents can interact and learn, can converge to a higher rewarding value and task completion rate in a shorter time, the timeliness of task planning strategy decision is improved, the iteration times of the intelligent optimization algorithm is reduced, and the timeliness of the method is improved.
2. According to the multi-unmanned aerial vehicle task allocation and track planning optimization method under the dynamic environment, through constructing the reward value function based on the strategy set, the reinforcement learning training efficiency and training stability are improved, the problem of sparse reward values under the dynamic environment is solved, and the convergence speed of a multi-agent reinforcement learning algorithm is improved.
Drawings
FIG. 1 is a flow chart of a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment;
FIG. 2 is a diagram of a multi-agent reinforcement learning MA-SAC algorithm system architecture in the present embodiment;
fig. 3 is an initial situation map of a multi-unmanned aerial vehicle task planning simulation scene in the dynamic environment in the embodiment;
FIG. 4 is a diagram of a multi-agent reinforcement learning algorithm MA-SAC neural network architecture in the present embodiment;
fig. 5 is a comparison of the MA-SAC algorithm and the reward values of the madddpg and DDPG algorithms in the multi-unmanned aerial vehicle task allocation and track planning optimization method under the dynamic environment disclosed in the present embodiment;
FIG. 6 is a diagram of an unmanned aerial vehicle successfully discovering local missile row and completing obstacle avoidance process in the dynamic environment of the present embodiment
Fig. 7 is a view showing a case where an unmanned aerial vehicle launches a missile to an enemy ground tank row and destroys a target in a dynamic environment in the present embodiment.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings and examples. The technical problems and the beneficial effects solved by the technical proposal of the invention are also described, and the described embodiment is only used for facilitating the understanding of the invention and does not have any limiting effect.
The embodiment is realized based on a domestic combined combat simulation platform, the platform provides a background python development interface, algorithm training is carried out in a dock environment, and the simulation step length is set to be 15 times of speed. Based on the environment custom fight design, fight tasks are formulated, and task types comprise maneuver, assault, strike, land transfer, patrol and support. A dynamic Liu Kongmo man-machine combat scene is created on the platform, and different combat units such as unmanned aerial vehicles, SA-22 type ground-to-air missile rows, T-72B type tank rows and the like are arranged under the scene, so that unmanned aerial vehicle task planning real-time visual simulation is performed.
The embodiment discloses a multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment, which specifically comprises the following steps as shown in fig. 1:
step 1: multi-unmanned aerial vehicle task planning scene modeling in dynamic environment
Step 1.1: the method comprises the steps of establishing a fight assumption and randomly initializing 4 number of unmanned aerial vehicles, 8 number of threat objects, 10 number of tasks to be executed on a simulation platform, dynamically showing enemy threat objects and ground targets, moving the positions at any time, and attacking the enemy dynamic targets on the premise that the fight task of a plurality of unmanned aerial vehicles avoids all the threat objects of the enemy, wherein the initial situation of a simulation scene is shown in fig. 3.
Step 1.2: establishing an aircraft kinematics model
Assuming the drone as a particle in three-dimensional space, the kinematic model of the drone can be expressed as:
in the method, in the process of the invention,representing displacement offsets of the unmanned aerial vehicle in the x-axis, the y-axis and the z-axis. θ represents the aircraft turning angle, γ represents the aircraft pitch angle, and V represents the unmanned aerial vehicle speed. .
Step 1.3: establishing a radar and threat zone model
The maximum detection distance of the radar, the maximum radius of the missile killing area and the maximum distance of the escape-free area are considered. Therefore, the threat value formula of each defending unit to the unmanned aerial vehicle is as follows:
wherein D is the distance between the drone and the defending unit; r is R Rmax The maximum detection distance of the radar; r is R Mmax Is the maximum radius of the missile killing zone; r is R Mkmax Is the maximum range of the escape-free area.
Step 1.4: establishing an objective function
The aim of the problem is to reduce collisions as much as possible and to minimize track distance, i.ed i The track distance of the drone i is indicated.
Step 2: establishing a Markov decision model for unmanned aerial vehicle task allocation and track planning combined optimization problem under dynamic environment
Step 2.1: unmanned aerial vehicle system state space design
Let the joint state space be
S=[S 1 ,S 2 ,...,S n ](i=1,2,...,n)
The number of unmanned aerial vehicles is 4, wherein each sub-state space is
S i =[x u ,y u ,z u ,d a ,d t ,x o ,y o ,z o ] T (i=1,2...n)
(x u ,y u ,z u ) Representing longitude, latitude, altitude, d of unmanned plane i a Representing the distance d between unmanned plane i and the unmanned plane nearest to it t Representing the distance, (x) between the drone i and the target point nearest to it o ,y o ,z o ) Representing the nearest enemy threat object location observed by drone i within an effective detection range.
Step 2.2: unmanned aerial vehicle system joint motion space design
For the action space, each unmanned aerial vehicle in the formation can select own action, and then the combined action space of the whole unmanned aerial vehicle formation is set as
A=[A 1 ,A 2 ,...,A n ](i=1,2,...,n)
Wherein the action subspace can be represented as
A i =[ψ ii ] T
Wherein psi is i Represents the turning angle change value of the unmanned aerial vehicle, gamma i Representing the change value of the pitching angle of the unmanned plane
Step 2.3: constructing a strategy set-based prize value function
(1) According to the step 1.3, taking collision avoidance and minimized track distance as optimization targets, and under the excitation of promoting the unmanned aerial vehicle to reach the combat target position as soon as possible, realizing minimized track distance through a reward structure: traversing all target points, calculating the distance between unmanned aerial vehicles closest to the target points, summing up and taking the opposite number as r 1
(2) According to the step 1.3, collision avoidance is used as an optimization target, and the unmanned aerial vehicles keep a space cooperative relationship, namely collision avoidance is carried out at the same time. When collision occurs, the intelligent agent receives a negative reward, and meanwhile, a critical area is added to increase a collision early warning mechanism for training the avoidance between unmanned aerial vehiclesCollision-free bonus structure: all other unmanned aerial vehicles are circularly traversed, and collision avoidance reward value r of each unmanned aerial vehicle is calculated 2 The following formula is shown:
dist represents the distance between the unmanned aerial vehicle and the nearest unmanned aerial vehicle min2 Representing the sum of the dimensions of the two drones, sigma represents the width of the critical area.
(3) According to the step 1.3, a radar and threat zone model is established to calculate a threat value T of the threat object to the unmanned aerial vehicle s Then according to T s Calculating prize value r of avoidance threat zone and barrier zone 3 As shown in the following formula
Wherein T is s Representing threat value of threat object to unmanned plane, T σ Representing a threat threshold, set to 0.8.
The real-time rewards R of one unmanned aerial vehicle consist of the three parts, namely r=r 1 +r 2 +r 3
Step 3: constructing a neural network of a multi-agent reinforcement learning algorithm;
the system architecture of the whole method is divided into a task abstract layer, an algorithm training layer and an execution layer. The task abstract layer converts the task optimization process into a corresponding rewarding structure convergence process, and the aim of the problem is to reduce collision as much as possible and minimize track distance; the algorithm training layer consists of a training environment and a training algorithm, wherein the environment comprises an unmanned aerial vehicle, a target and a threat area, the algorithm comprises an MA-SAC, each unmanned aerial vehicle intelligent body can obtain a strategy after training, the strategy is actually an actor network, and the actor network receives an observed value and outputs an action; the policy of each agent in the execution layer is deployed into the real drone in the formation.
The algorithm contains a total of n drone agents. Each unmanned aerial vehicle has an Actor network, a Target-Actor network, two Critic networks and two Target-Critic networks, which are all composed of fully connected neural networks. For the Critic network of each unmanned aerial vehicle, not only the environmental state is input into the Critic network, but also the actions of other unmanned aerial vehicles are input into the Critic network, and the Q value is calculated through the local information observed by each unmanned aerial vehicle agent.
Step 4: training the neural network constructed in the step 3;
step 4.1: initialization of critic and actor network parameters, empirical pool size D of 1000000, number of samples B for training of 1024, training rounds number of epocodes of 15000.
Step 4.2: for each training round, firstly obtaining a state space s defined in step 2.1 by each unmanned plane i intelligent agent through a simulation environment i Obtaining an action space a defined in step 2.2 by each unmanned aerial vehicle i intelligent agent according to an Actor network i Calculating the next state s 'according to the kinematic model of the unmanned aerial vehicle in the step 1.1' i Calculating the prize value r obtained by each unmanned plane i intelligent agent according to the step 2.3 i
Sample < a i ,s i ,s i ',r i > store in experience pool. If the number of experience pool samples is greater than the number of sampling samples for training, go to step 4.3, otherwise continue to step 4.2.
Step 4.3:
the number of samples B is randomly drawn from the sample pool. To be used forThe Critic current network is updated as a loss function,
E (x,a,r,x')~D representing a desire to sample (x, a, r, x ') from the priority playback buffer pool D, where x represents the unmanned joint state, a represents the joint action, r represents the joint prize value, and x' represents the next jointStatus of the device.Representing that a joint action is performed in the joint state x given a random policy pi>State-action value of (c). y is i For the estimated state-action cost function in the joint state x, r i Represents a prize value for unmanned plane i, gamma represents a discount rate representing a percentage of future benefits to be referenced, +.>Representing solving for a given random strategy->When in state s i ' Down execution action a i 'desire,'>Representing a given random strategy->When in state s i ' Down execution action a i ' target state-action value, +.>Is state s i ' Down strategy>Output action a i Probability of';
by passing throughUpdating the Actor network, pi i (a i |s i ) Is state s i Lower policy pi output action a i Probability of->Representing the state s at a given random strategy pi i Lower execution action a i Target state-action value of +.>Representing the state s at the time of solving a given random strategy pi i Lower execution action a i Is not limited to the above-described embodiments.
By passing throughUpdating the target network, w 'represents the parameter of the target-critic network, w represents the parameter of the critic network, theta' represents the parameter of the target-actor network, theta represents the parameter of the actor network, and tau is the updating rate of the target network. The parameters of the MA-SAC algorithm are set in the experiment as shown in the following table.
Comparing the MA-SAC algorithm with the reward values of the MADDPG and DDPG algorithms, as shown in figure 5, after 15000 rounds of learning, the reward value curves of the multi-frame unmanned aerial vehicle are shown in figure 5, and all three algorithms show convergence trend although the algorithms show a certain impact due to random noise in training. However, there are some differences in their convergence speed and time. At an epoode of around 2000, the reward of all three algorithms starts to rise. Wherein the convergence speed of DDPG is the slowest and does not show convergence trend until the epoode is about 6000. Both MA-SAC and MADDPG converge earlier than DDPG, and MA-SAC converges at the fastest speed.
Simulation results show that as shown in fig. 6, each unmanned aerial vehicle successfully discovers and avoids attack of enemy missile row; as shown in fig. 7, each drone launches a missile to the ground target that it is to attack and successfully destroys the target. And the unmanned aerial vehicle combat task under the dynamic environment is completed.
The indexes such as the reward value after algorithm convergence, the average task success rate, the total search times reaching the task point, the algorithm training efficiency and the like are compared and analyzed through a numerical experiment, and the experimental result shows that the method for optimizing the task allocation and the track planning of the multiple unmanned aerial vehicles in the dynamic environment is superior to the method for optimizing the task allocation and the track planning of the unmanned aerial vehicles based on the traditional optimization algorithm on the indexes.
The foregoing detailed description has set forth the objects, aspects and advantages of the invention in further detail, it should be understood that the foregoing description is only illustrative of the invention and is not intended to limit the scope of the invention, but is to be accorded the full scope of the invention as defined by the appended claims.

Claims (4)

1. A multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment is characterized in that: comprises the following steps of the method,
step 1: establishing a multi-unmanned aerial vehicle task planning scene model in a dynamic environment;
step 2: establishing a Markov decision model of the unmanned aerial vehicle task allocation and track planning combined optimization problem in a dynamic environment;
step 2.1: establishing a joint state space model of the unmanned aerial vehicle system;
the joint state space is
S=[S 1 ,S 2 ,...,S n ](i=1,2,...,n)
The number of unmanned aerial vehicles is n, wherein each sub-state space is
S i =[x u ,y u ,z u ,d a ,d t ,x o ,y o ,z o ] T (i=1,2...n)
(x u ,y u ,z u ) Representing longitude, latitude, altitude, d of unmanned plane i a Representing the distance d between unmanned plane i and the unmanned plane nearest to it t Representing the distance, (x) between the drone i and the target point nearest to it o ,y o ,z o ) Representing the nearest enemy threat object position observed by the unmanned plane i in the effective detection range;
step 2.2: establishing a joint action space model of the unmanned aerial vehicle system;
for the action space, each unmanned aerial vehicle in the formation can select own action, and then the combined action space of the whole unmanned aerial vehicle formation is set as
A=[A 1 ,A 2 ,...,A n ](i=1,2,...,n)
Wherein the action subspace is represented as
A i =[ψ ii ] T
Wherein psi is i Represents the turning angle change value of the unmanned aerial vehicle, gamma i Representing the change value of the pitching angle of the unmanned plane
Step 2.3: constructing a reward value function based on a strategy set;
(1) According to the step 1.3, taking collision avoidance and minimized track distance as optimization targets, and under the excitation of promoting the unmanned aerial vehicle to reach the combat target position as soon as possible, realizing minimized track distance through a reward structure: traversing all target points, calculating the distance between unmanned aerial vehicles closest to the target points, summing the distances, and then taking the opposite number as r 1
(2) According to the step 1.3, collision avoidance is used as an optimization target, and the unmanned aerial vehicles keep a space cooperative relationship, namely collision avoidance is carried out at the same time; when collision occurs, the intelligent agent receives a negative reward, and meanwhile, a key area is added to increase a collision early warning mechanism, so that a reward structure for avoiding collision among unmanned aerial vehicles is trained: all other unmanned aerial vehicles are circularly traversed, and collision avoidance reward value r of each unmanned aerial vehicle is calculated 2 The following formula is shown:
wherein dist represents the distance between the unmanned aerial vehicle and the nearest unmanned aerial vehicle min2 Representing the sum of the dimensions of the two unmanned aerial vehicles, sigma representing the width of the critical area;
(3) According to the step 1.3, a radar and threat zone model is established to calculate a threat value T of the threat object to the unmanned aerial vehicle s Then according to T s Calculating prize value r of avoidance threat zone and barrier zone 3 As shown in the following formula
Wherein T is s Representing threat value of threat object to unmanned plane, T σ Representing a threat threshold;
the real-time rewards R of one unmanned aerial vehicle consist of the three parts, namely r=r 1 +r 2 +r 3
Step 3: constructing a neural network of a multi-agent reinforcement learning algorithm;
the system architecture of the neural network is divided into a task abstract layer, an algorithm training layer and an execution layer; the task abstract layer converts the task optimization process into a corresponding rewarding structure convergence process, and the optimal algorithm aims at reducing collision as much as possible and minimizing track distance; the training layer of the neural network consists of a training environment and a training algorithm, wherein the environment comprises an unmanned aerial vehicle, a target and a threat area, and the training algorithm is MA-SAC; after training, each unmanned aerial vehicle intelligent can obtain a strategy, the neural network is an actor network, and the actor network receives an observed value and outputs an action; the strategy of each agent in the execution layer is deployed into a real unmanned aerial vehicle in the formation;
the method comprises n unmanned aerial vehicle intelligent bodies in total; each unmanned aerial vehicle intelligent agent is provided with an Actor network, a Target-Actor network, two Critic networks and two Target-Critic networks, wherein the networks are composed of fully-connected neural networks; for the Critic network of each unmanned aerial vehicle, not only inputting the environment state into the Critic network, but also inputting the actions of other unmanned aerial vehicles into the Critic network, and calculating the Q value through the local information observed by each unmanned aerial vehicle intelligent body;
step 4: training the neural network constructed in the step 3.
2. The method for optimizing task allocation and track planning of multiple unmanned aerial vehicles in a dynamic environment according to claim 1, wherein the method comprises the following steps: further comprising the step 5: and (3) performing task allocation and flight path planning on the multiple unmanned aerial vehicles in the dynamic environment by using the multi-agent reinforcement learning neural network obtained by training in the step (4), and simultaneously optimizing the internal strategy and the task planning global strategy of each unmanned aerial vehicle intelligent body, so that the instantaneity and the self-adaptation capability of the unmanned aerial vehicle task planning are improved in the dynamic environment, and all unmanned aerial vehicle combat tasks in the dynamic environment are completed under the conditions of higher battlefield income, shorter flight path planning distance and higher Shi Min property.
3. The method for optimizing task allocation and track planning of multiple unmanned aerial vehicles in a dynamic environment according to claim 1, wherein the method comprises the following steps: the implementation method of the step 1 is that,
step 1.1: establishing a kinematic model of the unmanned aerial vehicle:
wherein the method comprises the steps ofRepresenting displacement offset of the unmanned aerial vehicle in an x axis, a y axis and a z axis; θ represents the aircraft turning angle, γ represents the aircraft pitch angle, and V represents the unmanned plane speed;
step 1.2: establishing a radar and threat zone model;
the method has the advantages that the maximum detection distance of the radar, the maximum radius of the missile killing area and the maximum distance of the escape area are considered, and threat value models of each defense unit to the unmanned aerial vehicle are as follows:
wherein D is the distance between the drone and the defending unit; r is R Rmax The maximum detection distance of the radar; r is R Mmax Is the maximum radius of the missile killing zone; r is R Mkmax Is the maximum range of the escape-free area;
step 1.3: with the aim of avoiding collisions and minimizing track distance, an objective function is built, i.ed i The track distance of the drone i is indicated.
4. A multi-unmanned aerial vehicle task allocation and track planning optimization method in a dynamic environment as claimed in claim 3, wherein: the implementation method of the step 4 is that,
step 4.1: initializing critic network and actor network parameters, the capacity D of an experience pool, the number B of sampling samples for training and the number epsilon of training rounds;
step 4.2: for each training round, firstly, obtaining a state space s defined in step 2.1 by each unmanned plane i intelligent agent through a simulation environment i Obtaining an action space a defined in step 2.2 by each unmanned aerial vehicle i intelligent agent according to an Actor network i Calculating the next state s 'according to the kinematic model of the unmanned aerial vehicle in the step 1.1' i Calculating the prize value r obtained by each unmanned plane i intelligent agent according to the step 2.3 i
Sample < a i ,s i ,s' i ,r i A pool of > logging experiences; if the number of the samples in the experience pool is larger than the number of the samples for training, the step 4.3 is switched to, otherwise, the step 4.2 is continued;
step 4.3: randomly extracting a sampling quantity B samples from a sample pool; to be used forThe Critic current network is updated as a loss function,
wherein the method comprises the steps of
E (x,a,r,x')~D Representing a desire to sample (x, a, r, x ') from the priority playback buffer pool D, wherein x represents the unmanned aerial vehicle joint state, a represents the joint action, r represents the joint prize value, and x' represents the next joint state;representing that a joint action is performed in the joint state x given a random policy pi>Status-action value of (2); y is i For the estimated state-action cost function in the joint state x, r i Represents a prize value for unmanned plane i, gamma represents a discount rate representing a percentage of future benefits to be referenced, +.>Representing solving for a given random strategy->When in state s' i Lower execution action a' i Is used as a means for controlling the speed of the vehicle,representing a given random strategy->When in state s i ' Down execution action a i ' target state-action value, +.>Is the state s' i Lower policy->Output action a i Probability of';
by passing throughUpdating the Actor network, pi i (a i |s i ) Is state s i Lower policy pi output action a i Probability of->Representing the state s at a given random strategy pi i Lower execution action a i Target state-action value of +.>Representing the state s at the time of solving a given random strategy pi i Lower execution action a i Is not limited to the desired one;
w'=τw+(1-τ)w'
updating the target network by θ '=τθ+ (1- τ) θ', w 'represents a parameter of the target-critic network, w represents a parameter of the critic network, θ' represents a parameter of the target-actor network, θ represents a parameter of the actor network, and τ is an update ratio of the target network.
CN202310619913.0A 2023-05-29 2023-05-29 Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment Pending CN117035435A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310619913.0A CN117035435A (en) 2023-05-29 2023-05-29 Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310619913.0A CN117035435A (en) 2023-05-29 2023-05-29 Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment

Publications (1)

Publication Number Publication Date
CN117035435A true CN117035435A (en) 2023-11-10

Family

ID=88640019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310619913.0A Pending CN117035435A (en) 2023-05-29 2023-05-29 Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment

Country Status (1)

Country Link
CN (1) CN117035435A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117707207A (en) * 2024-02-06 2024-03-15 中国民用航空飞行学院 Unmanned aerial vehicle ground target tracking and obstacle avoidance planning method based on deep reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117707207A (en) * 2024-02-06 2024-03-15 中国民用航空飞行学院 Unmanned aerial vehicle ground target tracking and obstacle avoidance planning method based on deep reinforcement learning
CN117707207B (en) * 2024-02-06 2024-04-19 中国民用航空飞行学院 Unmanned aerial vehicle ground target tracking and obstacle avoidance planning method based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN113093802B (en) Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning
CN113741508B (en) Unmanned aerial vehicle task allocation method based on improved wolf pack algorithm
CN108388250B (en) Water surface unmanned ship path planning method based on self-adaptive cuckoo search algorithm
CN114413906A (en) Three-dimensional trajectory planning method based on improved particle swarm optimization algorithm
CN108919818B (en) Spacecraft attitude orbit collaborative planning method based on chaotic population variation PIO
Yafei et al. An improved UAV path planning method based on RRT-APF hybrid strategy
Li et al. Collaborative decision-making method for multi-UAV based on multiagent reinforcement learning
CN113268081B (en) Small unmanned aerial vehicle prevention and control command decision method and system based on reinforcement learning
Xia et al. Multi—UAV path planning based on improved neural network
CN109871031A (en) A kind of method for planning track of fixed-wing unmanned plane
CN114330115B (en) Neural network air combat maneuver decision-making method based on particle swarm search
Bai et al. UAV maneuvering decision-making algorithm based on twin delayed deep deterministic policy gradient algorithm
Xin et al. Overview of research on transformation of multi-AUV formations
CN117035435A (en) Multi-unmanned aerial vehicle task allocation and track planning optimization method in dynamic environment
CN114089776B (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
Ke et al. Cooperative path planning for air–sea heterogeneous unmanned vehicles using search-and-tracking mission
CN112733251A (en) Multi-unmanned aerial vehicle collaborative track planning method
Wang et al. Obstacle avoidance of UAV based on neural networks and interfered fluid dynamical system
Qiu et al. A decoupling receding horizon search approach to agent routing and optical sensor tasking based on brain storm optimization
CN116432030A (en) Air combat multi-intention strategy autonomous generation method based on deep reinforcement learning
CN115097861A (en) Multi-Unmanned Aerial Vehicle (UAV) capture strategy method based on CEL-MADDPG
Xiao et al. Multi-UAV formation transformation based on improved heuristically-accelerated reinforcement learning
CN112818496A (en) Anti-ground-defense strategy based on ant colony algorithm
Wang et al. Pursuit-Evasion Game of Unmanded Surface Vehicles Based on Deep Reinforcement Learning
Zhong et al. Research on command and control of MAV/UAV engagement from the cooperative perspective

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination