CN116301022A

CN116301022A - Unmanned aerial vehicle cluster task planning method and device based on deep reinforcement learning

Info

Publication number: CN116301022A
Application number: CN202310006846.5A
Authority: CN
Inventors: 丘昌镇; 刘紫薇; 张志勇; 徐雪阳
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-06-23

Abstract

The application provides an unmanned aerial vehicle cluster task planning method and device based on deep reinforcement learning, comprising the following steps: randomly selecting one unmanned aerial vehicle from the clusters as a first unmanned aerial vehicle, and selecting other unmanned aerial vehicles as a second unmanned aerial vehicle, wherein the second unmanned aerial vehicle comprises the rest unmanned aerial vehicle clusters; acquiring an actual task execution environment of a first unmanned aerial vehicle and a task planning model of an unmanned aerial vehicle cluster; inputting an actual task execution environment into an unmanned aerial vehicle cluster task planning model to obtain a task plan of an unmanned aerial vehicle cluster; the unmanned aerial vehicle cluster task planning model is obtained by learning and training an improved MADDPG model by taking a simulated task execution environment as a training sample; the improved MADDPG model comprises a MADDPG network and an average field theory module, wherein the average field theory module is arranged in the MADDPG network. Through the method, the unmanned aerial vehicle can acquire global environment transformation in the action process, and the unmanned aerial vehicle cluster is guided to make optimal task planning in an unknown dynamic environment.

Description

Unmanned aerial vehicle cluster task planning method and device based on deep reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle cluster task planning method and device based on deep reinforcement learning.

Background

The unmanned aerial vehicle has the advantages of being convenient to operate, flexible and reliable, low in cost and capable of reducing accidents of operators. In recent years, unmanned aerial vehicles develop rapidly, the single-machine task execution capacity of various unmanned aerial vehicles is also stronger, and autonomy and intelligence are also improved continuously. Along with the continuous expansion of the scale of unmanned aerial vehicle execution's task, the complexity of task also increases gradually, makes unmanned aerial vehicle cluster appear, accomplishes the task jointly through many unmanned aerial vehicle cooperation.

The task planning among unmanned aerial vehicle clusters is the basis of the collaborative processing of tasks of a plurality of unmanned aerial vehicles, and the traditional unmanned aerial vehicle cluster task planning method is divided into two parts of track planning and task allocation and is carried out under the condition that the environment is basically fixed and completely known. According to the task planning method, the unmanned aerial vehicle cannot timely acquire global information and changes of the environment, the unmanned aerial vehicle is easily interfered by the external environment, and any incomplete environment perception and environment estimation deviation can cause problems in task planning of the unmanned aerial vehicle cluster. Meanwhile, the conventional unmanned aerial vehicle cluster task planning method does not fully consider the coupling relation between the unmanned aerial vehicle cluster task planning method and the unmanned aerial vehicle cluster task planning method in the course of flight path planning and task distribution.

Disclosure of Invention

Based on the method and the device for planning the unmanned aerial vehicle cluster tasks based on deep reinforcement learning, provided by the invention, the unmanned aerial vehicle can acquire global environment information and changes thereof in time, and the unmanned aerial vehicle cluster is guided to make an optimal decision aiming at a specific state.

In a first aspect, the present invention provides an unmanned aerial vehicle cluster task planning method based on deep reinforcement learning, including:

acquiring an actual task execution environment of an unmanned aerial vehicle and a task planning model of an unmanned aerial vehicle cluster, and inputting the actual task execution environment into the task planning model of the unmanned aerial vehicle cluster to obtain a task plan of the unmanned aerial vehicle cluster;

the unmanned aerial vehicle cluster task planning is obtained by learning and training an improved MADDPG model by adopting training samples executed by simulation tasks; the improved MADDPG model comprises a MADDPG network and an average field theory module.

In a second aspect, the present invention provides an unmanned aerial vehicle cluster mission planning apparatus based on deep reinforcement learning, including:

the parameter acquisition module is used for acquiring an actual task execution environment of the unmanned aerial vehicle and a task planning model of the unmanned aerial vehicle cluster;

the task planning module is used for inputting the actual task execution environment into the unmanned aerial vehicle cluster task planning model to obtain the task plan of the unmanned aerial vehicle cluster;

In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of any one of the deep reinforcement learning based unmanned aerial vehicle cluster task planning methods of the first aspect.

In a fourth aspect, the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor executes any one of the unmanned aerial vehicle cluster task planning methods based on deep reinforcement learning in the first aspect when executing the computer program.

The beneficial effects of adopting above-mentioned technical scheme are: according to the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning, an unmanned aerial vehicle cluster task planning model is obtained through learning and training, so that the unmanned aerial vehicle can acquire global environment information and changes thereof in each action process, and the unmanned aerial vehicle cluster can be guided to make optimal task planning in an unknown dynamic environment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic diagram of an unmanned aerial vehicle cluster task planning method based on deep reinforcement learning according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a frame of an madppg network according to an embodiment of the present invention;

fig. 3 is a schematic diagram of implementation of an unmanned aerial vehicle cluster task planning method based on deep reinforcement learning according to an embodiment of the present invention;

fig. 4a is a test environment for performing cooperative communication between unmanned aerial vehicles by using the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning according to an embodiment of the present invention;

fig. 4b is a test environment for performing physical spoofing between unmanned aerial vehicles by using the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning according to an embodiment of the present invention;

fig. 5 a-5 c are task planning results of performing cooperative communication between unmanned aerial vehicles according to the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning provided by an embodiment of the present invention;

fig. 6 a-6 c are task planning results of performing physical spoofing between unmanned aerial vehicles by using the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning according to an embodiment of the present invention;

fig. 7a is a graph comparing a reward value obtained by a deep reinforcement learning-based unmanned aerial vehicle cluster task planning method and an existing deep learning method when performing a cooperative communication task between unmanned aerial vehicles;

fig. 7b is a comparison chart of success rate obtained by the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning and the existing deep learning method in the embodiment of the invention when the cooperative communication task between unmanned aerial vehicles is executed;

fig. 8 is a schematic block diagram of an unmanned aerial vehicle cluster mission planning apparatus based on deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In order to more specifically explain the invention, the unmanned aerial vehicle cluster task planning method and device based on deep reinforcement learning, which are provided by the invention, are specifically described below with reference to the accompanying drawings.

Unmanned aerial vehicle cluster task planning compensates for the defect of the task execution capacity of a single unmanned aerial vehicle by adopting the cooperative coordination of a plurality of unmanned aerial vehicles, and increasingly complex task processing requirements are met. At present, when an unmanned aerial vehicle cluster executes regional defense tasks, because the information acquired by a single unmanned aerial vehicle is limited, the optimal strategy in the task planning process cannot be known by timely acquiring the global environment. Aiming at the problem, the application provides an unmanned aerial vehicle cluster task planning method, device, storage medium and equipment.

The embodiment of the application provides a specific application scenario of an unmanned aerial vehicle cluster task planning method based on deep reinforcement learning. The application scenario includes the terminal device provided by the embodiment, and the terminal device may be various electronic devices including, but not limited to, a smart phone and a computer device, where the computer device may be at least one of a desktop computer, a portable computer, a laptop computer, a tablet computer, and the like. The user operates the terminal equipment, sends out an operation instruction of unmanned aerial vehicle cluster task planning, and executes the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning.

Based on this, in the embodiment of the application, a method for planning unmanned aerial vehicle cluster tasks based on deep reinforcement learning is provided, and the method is applied to terminal equipment for example and is illustrated, and is combined with a schematic diagram of the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning shown in fig. 1.

In this embodiment of the present application, each unmanned aerial vehicle in the unmanned aerial vehicle cluster is regarded as a spherical intelligent body, and the radius of each unmanned aerial vehicle is set to be r _uva The initial position of the ith unmanned aerial vehicle is set to be P _i ＝[x _i ,y _i ,z _i ] ^T The initial speed of the ith unmanned aerial vehicle is set to V _i ＝[v _i,x ,v _i,y ,v _i,z ] ^T The speed of the ith unmanned aerial vehicle at the preset time is set as follows

Wherein v is _i,x An x-axis component, v, of the initial speed of the ith drone _i,y The y-axis component, v, of the initial speed of the ith drone _i,z Is the z-axis component of the initial speed of the ith unmanned aerial vehicle, v' _i,x For the x-axis component of the speed of the ith drone at a preset time, v' _i,y For the y-axis component of the speed of the ith drone at a preset time, v' _i,z For the z-axis component of the speed of the ith unmanned aerial vehicle at a preset time, a is the acceleration of the ith unmanned aerial vehicle, and Δt is the preset time; speed V of any unmanned aerial vehicle _i ≤V _max ，V _max For a preset maximum speed of the flying object, the y-axis component h of any unmanned plane position _min ≤y _i ≤h _max ，h _min Is the minimum height of the preset flying object, h _max Is the preset maximum height of the flyer.

One or more obstacles and destinations are also included in the unmanned cluster's task execution. Wherein the obstacle is also regarded as a sphere, the radius is set to r _adv The initial position of the obstacle is set to P _k ＝[x _k ,y _k ,z _k ] ^T The initial speed of the obstacle is set to V _k ＝[v _k,x ,v _k,y ,v _k,z ] ^T The speed of the obstacle at the preset time is set as

Wherein v is _k,x Is the x-axis component, v, of the initial velocity of the obstacle _k,y Is the y-axis component, v, of the initial velocity of the obstacle _k,z Is the z-axis component of the initial velocity of the obstacle, v' _k,x For the x-axis component of the speed of the obstacle at a preset time, v' _k,y For the y-axis component of the velocity of the obstacle at a preset time, v' _k,z For the z-axis component, a, of the speed of the obstacle at a preset time _k For the acceleration of the obstacle, Δt is the preset time, the speed V of the obstacle _k ≤V _max ，V _max For a predetermined maximum speed of the aircraft, the y-axis component of the obstacle positionh _min ≤y _k ≤h _max ，h _min Is the minimum height of the preset flying object, h _max Is the preset maximum height of the flyer.

The location of the destination is set to g= [ x _g ,y _g ,z _g ] ^T The radius of the destination is set to r _aim 。

Wherein the collision distance between the ith unmanned aerial vehicle and the obstacle is set to be D _col ＝r _uva +r _adv The method comprises the steps of carrying out a first treatment on the surface of the When the ith unmanned aerial vehicle reaches the target area, the distance between the unmanned aerial vehicle and the target area is set to be D _aim ≤r _uav +r _aim 。

In the embodiment of the application, unmanned plane cluster mission planning can be represented by adopting a Markov game model, and specifically comprises the following steps of<N,S,A,Γ,R,O,γ>Wherein N is the total number of unmanned aerial vehicles simulating the task execution environment; s is the local state of all unmanned aerial vehicles of the unmanned aerial vehicle cluster; a is motion vector of all unmanned aerial vehicles of unmanned aerial vehicle cluster, a=a ₁ ×A ₂ ×…×A _N The method comprises the steps of carrying out a first treatment on the surface of the Γ is the probability that an unmanned aerial vehicle cluster adopts a joint action to transition to the next state in the current state, Γ: S.times.A ₁ ×A ₂ ×…×A _N S ', S' is the next local state of all unmanned aerial vehicles of the unmanned aerial vehicle cluster; r is the joint rewards of the unmanned aerial vehicle,

gamma is the discount coefficient, r _i A prize value obtained for the i-th unmanned aerial vehicle to interact with the environment; o is the local state of each unmanned aerial vehicle.

Based on the physical model and the motion model of the unmanned aerial vehicle, the obstacle and the destination, the unmanned aerial vehicle cluster task planning method based on the deep reinforcement learning in the embodiment of the application specifically comprises the following steps:

step S101: one unmanned aerial vehicle is selected as a first unmanned aerial vehicle in the unmanned aerial vehicle cluster at will, and other unmanned aerial vehicles are used as second unmanned aerial vehicles, and the second unmanned aerial vehicles form the rest unmanned aerial vehicle clusters.

In this embodiment, for convenience of explanation, the selected first unmanned aerial vehicle is denoted as the ith unmanned aerial vehicle in the unmanned aerial vehicle cluster, the second unmanned aerial vehicle is denoted as the jth unmanned aerial vehicle in the unmanned aerial vehicle cluster, and the remaining unmanned aerial vehicle clusters are denoted as d (i), j e d (i).

Step S102: and acquiring an actual task execution environment of the first unmanned aerial vehicle and a task planning model of the unmanned aerial vehicle cluster.

The unmanned aerial vehicle cluster task planning is obtained by learning and training an improved MADDPG model by taking a simulated task execution environment as a training sample; the improved madppg model includes an madppg network and an average field theory module, wherein the average field theory module is disposed in the madppg network, wherein a frame of the madppg network is shown in fig. 2. The MADDPG network is a multi-agent depth deterministic strategy gradient network, performs centering training under multi-agent system planning, and removes a network framework for centering execution.

Step S103: and inputting the actual task execution environment into an unmanned aerial vehicle cluster task planning model to obtain the task plan of the unmanned aerial vehicle cluster.

The method comprises the steps of inputting local states of all unmanned aerial vehicles in an actual task execution environment to an unmanned aerial vehicle cluster task planning model to obtain task plans of unmanned aerial vehicle clusters in the actual task execution environment.

Further, with reference to fig. 3, a further explanation is made for the unmanned aerial vehicle cluster mission planning model used in steps S102-S103:

the unmanned aerial vehicle cluster task planning model is obtained by learning and training an improved MADDPG model by taking a simulated task execution environment as a training sample, wherein the improved MADDPG model comprises a MADDPG network and an average scene module; further, the madppg network includes a policy network and an evaluation network connected in sequence, wherein the average scene module is nested in the evaluation network.

The simulation task execution environment can be obtained by using an OPENAI gym simulation platform or an universe simulation platform.

The unmanned aerial vehicle cluster task planning model establishment method comprises the following steps of:

step S201: obtaining a training sample, wherein the training sample comprises a simulation state space s of an unmanned aerial vehicle cluster at the current moment and the next momentSimulation state space s' of unmanned aerial vehicle cluster, simulation rewards r of all unmanned aerial vehicles in unmanned aerial vehicle cluster, simulation motion vector a of all unmanned aerial vehicles in unmanned aerial vehicle cluster and average motion vector of remaining unmanned aerial vehicle clusters in unmanned aerial vehicle cluster

。

Each training sample may be denoted as (s, s', r, a,

) The simulation state space of the unmanned aerial vehicle cluster at the current moment comprises a simulation local state of each unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment, and is recorded as s= (o) _t,1 ,o _t,2 ,…,o _t,N )，o _t,i The simulation local state of the ith unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment is obtained, and N is the number of unmanned aerial vehicles in the unmanned aerial vehicle cluster; the simulation state space of the unmanned aerial vehicle cluster at the next moment comprises the simulation local state of each unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment, and the simulation local state is recorded as s' =

(o _t ′ _,1 ,o _t ′ _,2 ,…,o _t ′ _,N )，o _t ′ _,i The simulation local state of the ith unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment; the simulated rewards of all unmanned aerial vehicles in the unmanned aerial vehicle cluster comprise simulated rewards values of each unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment, and are recorded as r= (r) _t,1 ,r _t,2 ,…,r _t,N )，r _t,i The simulation rewarding value of the ith unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment is calculated; the simulated motion vectors of all unmanned aerial vehicles in the unmanned aerial vehicle cluster comprise the simulated motion vector of each unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment, and the simulated motion vector is recorded as a= (a) _t,1 ,a _t,2 ,…,a _t,N )，a _t,i The simulation motion vector of the ith unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment; the average motion vector of the remaining unmanned aerial vehicle clusters in the unmanned aerial vehicle clusters comprises the average motion vector of each unmanned aerial vehicle corresponding to the remaining unmanned aerial vehicle clusters in the unmanned aerial vehicle clusters at the current moment, and the average motion vector is recorded as

And the average motion vector of the ith unmanned aerial vehicle in the unmanned aerial vehicle cluster corresponding to the rest unmanned aerial vehicle clusters at the current moment.

Specifically, the training sample performs training learning on the unmanned aerial vehicle cluster task planning model, and the training sample comprises the following steps:

step S202: obtaining simulated local states s= (o) of all unmanned aerial vehicles in unmanned aerial vehicle cluster at current moment _t,1 ,o _t,2 ,…,o _t,N ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein the simulated local state of each drone is a local state that simulates what each drone can observe.

Step S203: according to the simulation local state o of each unmanned plane at the current moment _t,i Calculating to obtain a simulated motion vector a of each unmanned aerial vehicle _t,i The method specifically comprises the following steps:

simulating local state o of a first unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment _t,i Input to policy network mu _i Obtaining a simulated intermediate motion vector mu of the first unmanned aerial vehicle at the current moment _i (o _t,i )；

Superposing the simulated intermediate motion vector and the noise vector of the first unmanned aerial vehicle at the current moment to obtain a simulated motion vector of the first unmanned aerial vehicle at the current moment; the expression is a _t,i ＝μ _i (o _t,i ) +P, where P is the noise vector, the introduction of the noise vector increases the exploratory nature of the strategy function.

The simulated motion vector of the second unmanned aerial vehicle in the unmanned aerial vehicle cluster can also be calculated by adopting the expression, and the description is omitted here.

Step S204: simulating local states o of all second unmanned aerial vehicles in unmanned aerial vehicle cluster at current moment _t,j And a simulated motion vector a for each second unmanned aerial vehicle _t,j Respectively inputting the average field theory modules to obtain the simulated local state average value of the residual unmanned aerial vehicle clusters at the current moment and the residual unmanned aerial vehicle clustersIs a simulated average motion vector of (1)

Step S204 of calculating a simulated local state average value of the remaining unmanned aerial vehicle clusters and a simulated average motion vector of the remaining unmanned aerial vehicle clusters at the current moment includes steps S301-S302:

step 301: superposing the simulated local states of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment, and calculating an average value to obtain the average value of the simulated local states of the rest unmanned aerial vehicle clusters at the current moment; the specific expression is:

and d (i) is the number of unmanned aerial vehicles of the remaining unmanned aerial vehicle clusters, and d (i) is the number of unmanned aerial vehicles of the remaining unmanned aerial vehicle clusters corresponding to the ith unmanned aerial vehicle in the unmanned aerial vehicle clusters.

Step S302: superposing the simulated motion vectors of the second unmanned aerial vehicles in the unmanned aerial vehicle clusters and calculating an average value to obtain simulated average motion vectors of the rest unmanned aerial vehicle clusters; the specific expression is:

and the simulated average motion vector of the remaining unmanned aerial vehicle clusters at the current moment.

Step S205: combining the simulation local state of the first unmanned aerial vehicle at the current moment and the average value of the simulation local states of the rest unmanned aerial vehicle clusters at the current moment to obtain a simulation state space of the unmanned aerial vehicle clusters at the current moment, and recording the simulation state space as

s _t,i And the simulation state space of the unmanned aerial vehicle cluster at the current moment.

Step S206: and interacting the simulation action vector of each unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment with the simulation task execution environment to obtain the simulation rewards of each unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment and the simulation local states of each unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment.

Step S207: according to the simulated local state o of each unmanned plane at the next moment _t ′ _,i Calculating to obtain a simulated motion vector a 'of each unmanned aerial vehicle' _t,i The method specifically comprises the following steps:

the simulated local state o of the first unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment is obtained _t ′ _,i Input to policy network mu _i Obtaining a simulated intermediate motion vector mu of the first unmanned aerial vehicle at the next moment _i (o _t ′ _,i )；

Superposing the simulated intermediate motion vector of the first unmanned aerial vehicle at the next moment with the noise vector to obtain a simulated motion vector of the first unmanned aerial vehicle at the next moment; the expression is a' _t,i ＝μ _i (o _t ′ _,i ) +P, P is the noise vector, and the introduction of the noise vector increases the exploratory nature of the strategy function.

The simulated motion vector of the second unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment can also be calculated by adopting the expression, and the description is omitted here.

Step 208: simulating local states o of all second unmanned aerial vehicles in unmanned aerial vehicle cluster at next moment _t ′ _,j And the simulated motion vector a 'of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment' _t,j And respectively inputting the average field theory to obtain the average value of the simulated local states of the rest unmanned aerial vehicle clusters and the simulated average motion vector of the rest unmanned aerial vehicle clusters at the next moment.

Step S208 of calculating the average value of the simulated local states of the remaining unmanned aerial vehicle clusters and the simulated average motion vector of the remaining unmanned aerial vehicle clusters at the next moment includes steps S303-S304:

step 303: superposing the simulated local states of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment, and calculating an average value to obtain the simulated local state average value of the rest unmanned aerial vehicle clusters at the next moment; the specific expression is:

for the average value of the simulated local states of the remaining unmanned aerial vehicle clusters at the next moment, |d (i) | is the number of unmanned aerial vehicles of the remaining unmanned aerial vehicle clusters, and d (i) is the corresponding remaining unmanned aerial vehicle cluster of the ith unmanned aerial vehicle in the unmanned aerial vehicle clusters.

Step S304: superposing the simulated motion vectors of the second unmanned aerial vehicles in the unmanned aerial vehicle clusters and calculating an average value to obtain simulated average motion vectors of the rest unmanned aerial vehicle clusters; the specific expression is:

and (5) the simulated average motion vector of the remaining unmanned aerial vehicle clusters at the next moment.

Step S209: combining the average value of the simulated local state of the first unmanned aerial vehicle at the next moment and the simulated local state of the rest unmanned aerial vehicle clusters at the next moment to obtain a simulated state space of the unmanned aerial vehicle clusters at the next moment, and recording as

s _t ′ _,i And the simulation state space of the unmanned aerial vehicle cluster is the simulation state space of the unmanned aerial vehicle cluster at the next moment.

Step S210: and after the simulation state space of the unmanned aerial vehicle cluster at the next moment, the simulation motion vector of the first unmanned aerial vehicle at the next moment and the simulation average motion vector of the rest unmanned aerial vehicle clusters at the next moment are input into the evaluation network, the simulation rewards of the first unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment are overlapped, and the evaluation value of the evaluation network is obtained.

Specifically, the step S210 of calculating the evaluation value of the evaluation network includes steps S401 to S403:

step S401: and inputting the simulation state space of the unmanned aerial vehicle cluster at the next moment, the simulation motion vector of the first unmanned aerial vehicle at the next moment and the simulation average motion vector of the rest unmanned aerial vehicle clusters at the next moment into an evaluation network to obtain an evaluation network motion value at the next moment.

Step S402: and multiplying the evaluation network action value at the next moment by the discount coefficient to obtain an intermediate evaluation network action value.

Step S403: and superposing the action value of the intermediate evaluation network and the simulation rewards of the unmanned aerial vehicle at the current moment to obtain an evaluation value of the evaluation network. The specific expression for evaluating the network evaluation value is as follows:

where y is an evaluation value of the evaluation network, r _t,i For the simulation rewards of the first unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment, gamma is a discount coefficient,/->

The network action value is evaluated for the next time.

Step S211: and inputting the simulation state space of the unmanned aerial vehicle cluster at the current moment, the simulation motion vector of the first unmanned aerial vehicle at the current moment and the simulation average motion vector of the rest unmanned aerial vehicle clusters at the current moment into an evaluation network to obtain the motion value of the evaluation network at the current moment.

Step S212: and obtaining a loss function of the evaluation network according to the evaluation value of the evaluation network and the action value of the evaluation network at the current moment.

Step S212 of calculating a loss function of the evaluation network includes steps S501 to S502:

step S501: performing difference processing on the evaluation value of the evaluation network and the action value of the evaluation network at the current moment to obtain a loss error of the evaluation network

Step S502: and calculating the average value of the loss errors after the index processing according to the number of the training samples to obtain a loss function of the evaluation network. The specific expression of the loss function of the evaluation network is as follows:

wherein->

For evaluating the loss function of the network, M is the number of training samples, +.>

And evaluating the action value of the network for the current moment.

Step S213: and obtaining the evaluation network parameters through the loss function.

Step S214: and updating the strategy network gradient according to the evaluation network parameters to obtain strategy network parameters.

Step S214 of calculating policy network parameters includes steps S601-S602:

step S601: obtaining the current strategy network strategy function gradient according to the evaluation network parameters;

step S602: and carrying out product operation on the current strategy network strategy function gradient, the current evaluation network action value function gradient and the action value of the current moment evaluation network, and calculating a mean value according to the number of training samples to obtain the strategy gradient of the current strategy network parameters. The specific expression is:

wherein->

As a policy gradient of the policy network parameters,

gradient of policy function for policy network, +.>

And (5) evaluating the network action value function gradient for the current time.

Step S215: and updating the strategy network parameters according to the scale coefficients, and updating the strategy network and the evaluation network according to the updated strategy network coefficients until the updating times are reached, so as to obtain the unmanned aerial vehicle cluster task planning model.

The updating of the policy network parameters is specifically:

wherein->

For updated policy network parameters, +.>

For pre-update policy network parameters, τ is the scaling factor of the current policy network parameters.

In addition, calculating the simulated reward of the first drone in step 210 includes the steps of:

the simulated reward of the first unmanned aerial vehicle comprises a collision reward r of the first unmanned aerial vehicle and an obstacle ^c Arrival reward r for arrival of first unmanned aerial vehicle at destination ^g Action rewards r for the first unmanned aerial vehicle to execute an action ^s 。

Collision reward r ^c The specific expression of (2) is:

r _col = -5, D is the distance between the unmanned aerial vehicle and the obstacle, D _col Is the collision distance between the unmanned aerial vehicle and the obstacle.

Reach reward r ^g The specific expression of (2) is:

r _arr =10, epsilon is the guiding coefficient of the unmanned aerial vehicle approaching the destination, epsilon=1.1, p _i The position of the ith unmanned plane, g is the destination, D _aim Is the distance between the unmanned plane and the destination, r _uav Is the radius of the unmanned aerial vehicle, r _aim Is the radius of the destination range.

Action rewards r ^s The specific expression of (2) is: r is (r) ^s ＝-3。

According to the embodiment of the application, an average field theory module and an MADDPG network are combined to form an unmanned plane cluster task planning model, so that the problem that in the prior art, the observation value set obtained by each agent is used as a state value by a multi-agent reinforcement learning algorithm, and the dimension of the state value is exponentially increased when the number of agents is large is solved, the dimension of the state value is reduced in the model training process by the average field theory module, and the model training process is accelerated to converge; the MADDPG network adopts the principle of centralized training and distributed decision making, so that the unmanned aerial vehicle can still efficiently carry out task decision making on the premise of unknown environment, and the convergence rate is improved by adopting an experience pool and a double-network structure.

In addition, because a large amount of data is adopted to update and adjust the strategy network and the evaluation network in the training process, the final unmanned aerial vehicle cluster task planning model can achieve global optimal planning, and the unmanned aerial vehicle clusters can simultaneously adopt centralized training and distribute execution strategies in an unknown dynamic three-dimensional environment, and unmanned aerial vehicles communicate with each other in the training environment and learn cooperation strategies; in the actual task execution environment, the unmanned aerial vehicle can only rely on the local state observed by the unmanned aerial vehicle to make decisions, communication is not needed, and decision time is shortened to a great extent.

In order to better show the technical effects of the unmanned aerial vehicle cluster task planning method based on the deep reinforcement learning, as shown in fig. 4-6 c, the simulation shows the test environment and the task planning result of the unmanned aerial vehicle clusters in executing cooperative communication and physical deception tasks, and the simulation result of the drawing can show that the unmanned aerial vehicle cluster task planning method based on the deep reinforcement learning can achieve better effects in cooperative communication and physical deception among the unmanned aerial vehicle clusters. In order to more intuitively show the technical effect that the unmanned aerial vehicle cluster task planning method based on the deep reinforcement learning has better performance than the existing deep learning method, as shown in fig. 7a and 7b, when the unmanned aerial vehicle executes the cooperative communication task, the unmanned aerial vehicle cluster task planning method based on the deep reinforcement learning can reach higher rewarding value in the running process, and the success rate of task execution is far higher than that of the existing deep learning method.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrow, the steps are not necessarily performed in order as indicated by the arrow. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps of FIG. 1 may include multiple sub-steps or sub-stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the other steps or sub-steps of other steps.

The embodiment of the invention discloses an unmanned aerial vehicle cluster task planning method based on deep reinforcement learning, which can be realized by adopting various types of equipment, so that the invention also discloses an unmanned aerial vehicle cluster task planning device based on the deep reinforcement learning, which corresponds to the method, and a specific embodiment is given below for detail with reference to fig. 8.

The unmanned aerial vehicle selection module 701 is configured to randomly select one unmanned aerial vehicle as a first unmanned aerial vehicle in the unmanned aerial vehicle cluster, and other unmanned aerial vehicles as second unmanned aerial vehicles, where the second unmanned aerial vehicle forms a remaining unmanned aerial vehicle cluster.

The parameter obtaining module 702 is configured to obtain an actual task execution environment of the first unmanned aerial vehicle and a task planning model of the unmanned aerial vehicle cluster.

And the task planning module 703 is configured to input the actual task execution environment to a task planning model of the unmanned aerial vehicle cluster, so as to obtain a task plan of the unmanned aerial vehicle cluster.

The unmanned aerial vehicle cluster task planning is obtained by learning and training an improved MADDPG model by taking a simulated task execution environment as a training sample; the improved MADDPG model comprises a MADDPG network and an average field theory module, wherein the average field theory module is arranged in the MADDPG network.

For specific limitations of the unmanned aerial vehicle cluster task planning device based on deep reinforcement learning, reference may be made to the above limitation of the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning, and the description thereof will not be repeated here. Each of the modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be stored in a processor of the terminal device, or may be stored in software in a memory of the terminal device, so that the processor invokes and executes operations corresponding to the above modules.

In one embodiment, the present invention further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning described above.

The computer readable storage medium may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read-only memory), an EPROM (erasable programmable read-only memory), a hard disk, or a ROM. Optionally, the computer readable storage medium comprises a non-transitory computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium has storage space for program code to perform any of the method steps described above. These program code can be read from or written to one or more computer program products, which can be compressed in a suitable form.

In one embodiment, the invention provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning when executing the computer program.

The computer device includes a memory, a processor, and one or more computer programs, wherein the one or more computer programs are storable in the memory and configured to be executed by the one or more processors, and one or more application programs configured to perform the deep reinforcement learning-based unmanned aerial vehicle cluster mission planning method described above.

The processor may include one or more processing cores. The processor uses various interfaces and lines to connect various portions of the overall computer device, perform various functions of the computer device, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in memory, and invoking data stored in memory. Alternatively, the processor may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), a report validator of buried point data (Graphics Processing Unit, GPU), and a modem. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor and may be implemented solely by a single communication chip.

The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (rom). The memory may be used to store instructions, programs, code sets, or instruction sets. The memory may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The storage data area may also store data created by the terminal device in use, etc.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The unmanned aerial vehicle cluster task planning method based on deep reinforcement learning is characterized by comprising the following steps of:

randomly selecting one unmanned aerial vehicle from the unmanned aerial vehicle clusters as a first unmanned aerial vehicle, and selecting other unmanned aerial vehicles as second unmanned aerial vehicles, wherein the second unmanned aerial vehicles form the rest unmanned aerial vehicle clusters;

acquiring an actual task execution environment of a first unmanned aerial vehicle and a task planning model of an unmanned aerial vehicle cluster;

inputting the actual task execution environment into an unmanned aerial vehicle cluster task planning model to obtain a task plan of an unmanned aerial vehicle cluster;

the unmanned aerial vehicle cluster task planning model is obtained by learning and training an improved MADDPG model by taking a simulated task execution environment as a training sample; the improved MADDPG model comprises a MADDPG network and an average field theory module, wherein the average field theory module is arranged in the MADDPG network.

2. The unmanned aerial vehicle cluster mission planning method of claim 1, wherein the improved madppg model comprises a policy network and an evaluation network connected in sequence, wherein the average scene module is nested in the evaluation network;

the unmanned aerial vehicle cluster task planning model establishment comprises the following steps:

obtaining a training sample, wherein the training sample comprises a simulation state space of an unmanned aerial vehicle cluster at the current moment, a simulation state space of an unmanned aerial vehicle cluster at the next moment, simulation rewards of all unmanned aerial vehicles in the unmanned aerial vehicle cluster, simulation action vectors of all unmanned aerial vehicles in the unmanned aerial vehicle cluster and average action vectors of the rest unmanned aerial vehicle clusters in the unmanned aerial vehicle cluster;

inputting the simulated local state of the first unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment into a strategy network to obtain a simulated intermediate motion vector of the first unmanned aerial vehicle at the current moment;

superposing the simulated intermediate motion vector and the noise vector of the first unmanned aerial vehicle at the current moment to obtain a simulated motion vector of the first unmanned aerial vehicle at the current moment;

respectively inputting the simulated local state of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment and the simulated motion vector of each second unmanned aerial vehicle into an average field theory module to obtain the simulated local state average value of the rest unmanned aerial vehicle cluster at the current moment and the simulated average motion vector of the rest unmanned aerial vehicle cluster;

combining the simulation local state of the first unmanned aerial vehicle at the current moment with the average value of the simulation local states of the rest unmanned aerial vehicle clusters at the current moment to obtain a simulation state space of the unmanned aerial vehicle clusters at the current moment;

interaction is carried out on the simulation action vectors of all unmanned aerial vehicles in the unmanned aerial vehicle cluster at the current moment and the simulation task execution environment, so that simulation rewards of all unmanned aerial vehicles in the unmanned aerial vehicle cluster at the current moment and simulation local states of all unmanned aerial vehicles in the unmanned aerial vehicle cluster at the next moment are obtained;

inputting the simulated local state of the first unmanned aerial vehicle at the next moment into a strategy network to obtain a simulated intermediate motion vector of the first unmanned aerial vehicle at the next moment;

superposing the simulated intermediate motion vector of the first unmanned aerial vehicle at the next moment with the noise vector to obtain a simulated motion vector of the first unmanned aerial vehicle at the next moment;

respectively inputting the simulated local state of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment and the simulated motion vector of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster at the next moment into an average field theory to obtain the simulated local state average value of the rest unmanned aerial vehicle clusters at the next moment and the simulated average motion vector of the rest unmanned aerial vehicle clusters;

combining the simulation local state of the first unmanned aerial vehicle at the next moment with the average value of the simulation local states of the rest unmanned aerial vehicle clusters at the next moment to obtain a simulation state space of the unmanned aerial vehicle clusters at the next moment;

after a simulation state space of the unmanned aerial vehicle cluster at the next moment, a simulation motion vector of the first unmanned aerial vehicle at the next moment and a simulation average motion vector of the remaining unmanned aerial vehicle clusters at the next moment are input into an evaluation network, a simulation reward of the first unmanned aerial vehicle in the unmanned aerial vehicle cluster at the current moment is overlapped, and an evaluation value of the evaluation network is obtained;

inputting the simulation state space of the unmanned aerial vehicle cluster at the current moment, the simulation motion vector of the first unmanned aerial vehicle at the current moment and the simulation average motion vector of the rest unmanned aerial vehicle clusters at the current moment into an evaluation network to obtain the motion value of the evaluation network at the current moment;

obtaining a loss function of the evaluation network according to the evaluation value of the evaluation network and the action value of the evaluation network at the current moment;

obtaining evaluation network parameters through the loss function;

updating the strategy network gradient according to the evaluation network parameters to obtain strategy network parameters;

and updating the strategy network parameters according to the scale coefficients, and updating the strategy network and the evaluation network according to the updated strategy network coefficients until the updating times are reached, so as to obtain the unmanned aerial vehicle cluster task planning model.

3. The unmanned aerial vehicle cluster task planning method based on deep reinforcement learning of claim 2, wherein the step of inputting the simulated local state of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster and the simulated motion vector of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster into the average field theory module to obtain the simulated local state average value of the remaining unmanned aerial vehicle clusters and the simulated average motion vector of the remaining unmanned aerial vehicle clusters comprises the following steps:

superposing the simulated local states of each second unmanned aerial vehicle in the unmanned aerial vehicle cluster and calculating an average value to obtain a simulated local state average value of the rest unmanned aerial vehicle clusters;

and superposing the simulated motion vectors of the second unmanned aerial vehicles in the unmanned aerial vehicle clusters, and calculating an average value to obtain simulated average motion vectors of the rest unmanned aerial vehicle clusters.

4. The deep reinforcement learning-based unmanned aerial vehicle cluster mission planning method of claim 2, wherein the simulated rewards of the first unmanned aerial vehicle comprise:

collision rewards of the first unmanned aerial vehicle and the obstacle, arrival rewards of the first unmanned aerial vehicle to the destination, and action rewards of the first unmanned aerial vehicle executing actions.

5. The method for planning tasks of unmanned aerial vehicle clusters based on deep reinforcement learning according to claim 2, wherein after the simulation state space of the unmanned aerial vehicle clusters at the next time, the simulation motion vector of the first unmanned aerial vehicle at the next time and the simulation average motion vector of the remaining unmanned aerial vehicle clusters at the next time are input to the evaluation network, the simulation rewards of the first unmanned aerial vehicle in the unmanned aerial vehicle clusters at the current time are superimposed to obtain the evaluation value of the evaluation network, comprising:

inputting a simulation state space of the unmanned aerial vehicle cluster at the next moment, a simulation motion vector of the first unmanned aerial vehicle at the next moment and a simulation average motion vector of the rest unmanned aerial vehicle clusters at the next moment into an evaluation network to obtain an evaluation network motion value at the next moment;

multiplying the evaluation network action value at the next moment by a discount coefficient to obtain an intermediate evaluation network action value;

and superposing the action value of the intermediate evaluation network and the simulation rewards of the unmanned aerial vehicle at the current moment to obtain an evaluation value of the evaluation network.

6. The unmanned aerial vehicle cluster task planning method based on deep reinforcement learning according to claim 2, wherein the obtaining the loss function of the evaluation network according to the evaluation value of the evaluation network and the action value of the current time evaluation network comprises:

performing difference processing on the evaluation value of the evaluation network and the action value of the evaluation network at the current moment to obtain a loss error of the evaluation network;

and calculating the average value of the loss errors after the index processing according to the number of the training samples to obtain a loss function of the evaluation network.

7. The unmanned aerial vehicle cluster mission planning method of claim 2, wherein updating the policy network gradient according to the evaluation network parameter to obtain the policy network parameter comprises:

obtaining the current strategy network strategy function gradient according to the evaluation network parameters;

and carrying out product operation on the current strategy network strategy function gradient, the current evaluation network action value function gradient and the action value of the current moment evaluation network, and calculating a mean value according to the number of training samples to obtain the strategy gradient of the current strategy network parameters.

8. Unmanned aerial vehicle cluster mission planning device based on degree of depth reinforcement study, its characterized in that, this device includes:

the unmanned aerial vehicle selecting module is used for randomly selecting one unmanned aerial vehicle from the unmanned aerial vehicle clusters as a first unmanned aerial vehicle, and other unmanned aerial vehicles as second unmanned aerial vehicles, wherein the second unmanned aerial vehicles form the rest unmanned aerial vehicle clusters;

the parameter acquisition module is used for acquiring an actual task execution environment of the first unmanned aerial vehicle and a cluster task planning model of the unmanned aerial vehicle;

9. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning of any of claims 1-7.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, performs the unmanned aerial vehicle cluster task planning method based on deep reinforcement learning of any one of claims 1-7.