CN114706418A

CN114706418A - Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm

Info

Publication number: CN114706418A
Application number: CN202210264539.2A
Authority: CN
Inventors: 高显忠; 候中喜; 金泉; 王玉杰; 邓小龙
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-07-05

Abstract

The invention discloses an unmanned aerial vehicle combat autonomous decision-making method based on a deep reinforcement learning TD3 algorithm, which comprises the steps of establishing an unmanned aerial vehicle motion model, establishing an unmanned aerial vehicle combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial vehicle combat model is expressed by using a quadruple comprising a state space, an action space, a reward function and a discount factor, the state transfer function is expressed as the unmanned aerial vehicle motion model, and training an unmanned aerial vehicle to learn a maneuvering strategy based on the TD3 algorithm according to the unmanned aerial vehicle combat model. According to the unmanned aerial vehicle air combat model, the unmanned aerial vehicle learning maneuvering strategy is trained based on the TD3 algorithm, and the TD3 algorithm can solve the problem of Q value overestimation, so that the unmanned aerial vehicle learns a better maneuvering strategy and obtains position advantages in battles.

Description

Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm

Technical Field

The invention relates to the technical field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle fighting autonomous decision-making method based on a deep reinforcement learning TD3 algorithm.

Background

Intelligent autonomous combat drones and drone clusters have great potential to change battlefield patterns. Maneuver decision is the core technology of unmanned aerial vehicle combat confrontation, and it has great significance to study unmanned aerial vehicle and independently maneuver to obtain operational advantages according to battlefield situation and mission target.

When the unmanned aerial vehicle air combat problem is researched based on mathematical methods such as the traditional differential countermeasure theory, a mathematical model needs to be accurately established, and maneuvering strategies and performance parameters of both parties need to be known on the premise of qualitative or quantitative problems, which is impossible in reality. In future combat, information such as strategic intentions of enemies, tactics, and moving equipment performance cannot be accurately predicted in advance, and the interference of various uncertain factors and low detectability of targets in the combat environment limit the applicability of the method. The unmanned aerial vehicle dynamics model is complex, the state equation of the unmanned aerial vehicle is a nonlinear differential equation, so that the solution is difficult, the calculated amount is huge, a large amount of calculation resources are occupied, the consumed time is long, and dimension disasters can occur when the number of unmanned aerial vehicles of both enemy and my is further increased.

Although the Deep Deterministic Policy Gradient (DDPG) is suitable for solving the problem of a high-dimensional continuous motion space, the problem of overestimation of the Q value may be caused by using the algorithm to perform Deep robust learning of the unmanned aerial vehicle in the unmanned aerial vehicle combat environment, so that the total reward value of the unmanned aerial vehicle is always low, that is, when the Q value is overestimated, the Policy selected by the unmanned aerial vehicle has errors, and the errors become larger and larger, so that an effective Policy cannot be found, and a position advantage cannot be obtained in combat.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an unmanned aerial vehicle fighting autonomous decision method based on a deep reinforcement learning TD3 algorithm, which can solve the problem of Q value overestimation, so that an unmanned aerial vehicle learns a better maneuvering strategy and obtains a position advantage in battle.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

in a first aspect, the invention provides an unmanned aerial vehicle fighting autonomous decision method based on a deep reinforcement learning TD3 algorithm, which comprises the following steps:

establishing an unmanned aerial vehicle motion model;

establishing an unmanned aerial combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple comprising a state space, an action space, a reward function and a discount factor, and the unmanned aerial vehicle motion model represents a state transfer function in the unmanned aerial combat model;

and training an unmanned aerial vehicle learning maneuvering strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.

Compared with the prior art, the first aspect of the invention has the following beneficial effects:

the method comprises the steps of establishing an unmanned aerial combat model of unmanned aerial vehicles fighting with my unmanned aerial vehicle and an enemy unmanned aerial vehicle based on a Markov decision process according to an unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor; according to the unmanned aerial vehicle air combat model, the unmanned aerial vehicle is trained to learn the maneuver strategy based on the TD3 algorithm, and the TD3 algorithm can solve the problem of Q value overestimation, so that the unmanned aerial vehicle learns a better maneuver strategy, and position advantages are obtained in battle.

Further, the unmanned aerial vehicle motion model comprises a dynamics model and a kinematics model, and establishing the unmanned aerial vehicle motion model comprises:

establishing a dynamic model of the unmanned aerial vehicle in an inertial coordinate system:

wherein g represents a gravitational acceleration; the v represents a velocity of the drone and the v satisfies a constraint: v. of_min≤v≤v_max(ii) a The track inclination angle gamma represents the included angle between v and the horizontal plane XOY, and gamma belongs to [ -pi/2, pi/2](ii) a The track deviation angle phi represents the included angle between the projection of v on the horizontal plane XOY and the X axis of the coordinate axis, and phi epsilon (-phi, phi)](ii) a N is_τIndicating a tangential overload; n is_fIndicating the normal directionOverload; the μ represents a roll angle;

establishing a kinematic model of the unmanned aerial vehicle in the inertial coordinate system:

wherein the x, the y, and the z represent coordinates of the drone in the inertial coordinate system.

Further, the state space comprises: the state of the enemy drone and my drone themselves and relative states.

Further, the state space is constructed by:

setting the states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle:

S＝[x_r,y_r,z_r,x_b,y_b,z_b,v_r,v_b,γ_r,γ_b,ψ_r,ψ_b,μ_r,μ_b]

setting the relative states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle based on the states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle:

S_rb＝[D,α,β,v_r,v_b,γ_r,γ_b,ψ_r,ψ_b,μ_r,μ_b]

wherein x is_r,y_r,z_rRepresenting coordinate values, x, of the unmanned aerial vehicle of this party in three-dimensional space_b,y_b,z_bA coordinate value, v, representing the enemy drone in the three dimensional space_rRepresenting the speed of the my drone, said v_bRepresenting the speed of the enemy drone, said gamma_rRepresenting the track inclination of said my drone, said gamma_bIndicating the flight path inclination of said enemy drone, said psi_rIndicating the track drift angle of said my drone, said psi_bRepresents the flight path declination of the enemy drone, the mu_rRepresents the roll angle of the my drone, the μ_bThe roll angle of the enemy unmanned aerial vehicle is represented, the D represents the relative distance between the enemy unmanned aerial vehicle and the my unmanned aerial vehicle, the horizontal sight declination angle alpha represents the included angle between the projection of the sight lines of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle on the horizontal plane and the X axis, and the longitudinal sight declination angle beta represents the included angle between the sight lines of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle and the horizontal plane.

Further, the action space is constructed by the following formula:

A＝[n_τ,n_f,ω]

wherein, said n_τIndicating tangential overload, n_fIndicating normal overload and omega the body roll rate.

Further, the reward functions include a lock reward function, an angle advantage function, a distance advantage function, a height advantage function, and a speed advantage function, wherein the lock reward function is:

wherein, D is^*Represents the minimum distance between two unmanned planes when the unmanned plane of my party successfully locks the unmanned plane of the enemy party, and p is^*The maximum included angle indicating that the speed direction of the unmanned aerial vehicle at one party deviates from the direction of the mass center of the unmanned aerial vehicle at the other party when the unmanned aerial vehicle is locked is satisfied, and e^*Representing a maximum included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the enemy unmanned aerial vehicle when locking is met, D representing a relative distance between the enemy unmanned aerial vehicle and the enemy unmanned aerial vehicle, p representing an included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the enemy unmanned aerial vehicle, and e representing an included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the sight vector of the center of mass of the enemy unmanned aerial vehicle;

the angular merit function is:

the distance merit function is:

the height merit function is:

the speed merit function is:

wherein, v is_rRepresenting the speed of the my drone, said v_bRepresenting the speed of the enemy drone, D_maxRepresents a maximum detection distance of the drone, the Δ h represents a height difference between the enemy drone and the my drone, the v_maxRepresents a maximum value of the flight speed of the drone, said v_minRepresents a minimum value of the drone flight speed;

the reward function of the single step of the unmanned aerial vehicle is as follows:

R＝r_lock+k₁r₁+k₂r₂+k₃r₃+k₄r₄

wherein, k is₁To k is₄Represents a weight value, and said k₁To k is₄The sum is 1.

Further, the training of the unmanned aerial vehicle learning maneuver strategy based on the TD3 algorithm includes:

step S1, initializing two evaluator networks Q₁，Q₂Parameters of evaluator network

Parameter θ of actuator network^μTarget network parameters

θ'^μAnd an experience pool;

step S2, presetting the number of rounds, and executing the following steps in each round:

step S2-1, presetting the maximum limit step number of the unmanned aerial vehicle in each round;

s2-2, the unmanned aerial vehicle selects action according to the current state and strategy, and adds random noise;

step S2-3, the unmanned aerial vehicle executes actions, and obtains the next state and rewards by the state transfer function;

a step S2-4 of transferring the current state, the policy selection action, the reward, and the next state obtained by a state transfer function obtained in the steps S2-2 and S2-3 to an experience pool;

step S2-5, randomly extracting N samples from the experience pool;

step S2-6, calculating expected return of action through the two evaluator target networks, selecting a smaller Q value, and updating parameters of the evaluator networks;

step S2-7, updating the parameters of the actuator network through a deterministic strategy gradient;

step S2-8, after updating the parameters of the evaluator network and the parameters of the actuator network, updating the parameters of the target network;

step S2-9, until the number of steps reaches the maximum limit number of steps, ending a round;

and step S3, finishing the training of the unmanned aerial vehicle learning maneuver strategy after all rounds are finished.

Further, the calculating the expected return of action by the two evaluator target networks, selecting a smaller Q value, and updating the parameters of the evaluator networks includes:

learning and updating the network parameters of the evaluator, wherein the formula of the loss function L is as follows:

wherein, said s_iRepresents the current state, said a_iIndicating the current action, target desired value y_iAccording to the current real reward value r_iThe target expected value y is obtained by multiplying the next output value by the discount factor lambda_iThe formula of (1) is:

wherein s is_i+1Indicating the next step status.

Further, the updating parameters of the actuator network by the deterministic policy gradient includes:

learning and updating actuator network parameters, wherein a deterministic strategy gradient formula of the actuator network is as follows:

wherein N represents the number of samples randomly drawn from the experience pool, and Q (s, a | θ)^Q) Representing the evaluator network, the theta^QA parameter representing the evaluator network, the μ (s θ)^μ) Representing said actuator network, said θ^μA parameter representing the actuator network.

In a second aspect, the invention provides an unmanned aerial vehicle fighting autonomous decision making system based on a deep reinforcement learning TD3 algorithm, including:

the unmanned aerial vehicle motion model establishing unit is used for establishing an unmanned aerial vehicle motion model;

the unmanned aerial vehicle aerial combat model establishing unit is used for establishing an unmanned aerial vehicle aerial combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial vehicle aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor;

and the unmanned aerial vehicle learning maneuvering strategy training unit is used for training the unmanned aerial vehicle learning maneuvering strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.

Compared with the prior art, the second aspect of the invention has the following beneficial effects:

an unmanned aerial vehicle aerial combat model establishing unit of the system establishes an unmanned aerial vehicle aerial combat model of the unmanned aerial vehicle of the owner and the unmanned aerial vehicle of the enemy for combat based on a Markov decision process according to an unmanned aerial vehicle motion model, wherein the unmanned aerial vehicle aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor; the unmanned aerial vehicle learning maneuvering strategy training unit trains the unmanned aerial vehicle learning maneuvering strategy based on the TD3 algorithm according to the unmanned aerial vehicle air combat model, and the TD3 algorithm can solve the problem of Q value overestimation, so that the unmanned aerial vehicle learns a better maneuvering strategy, and position advantages are obtained in battle.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of an unmanned aerial vehicle combat autonomous decision method based on a deep reinforcement learning TD3 algorithm according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an inertial frame provided in accordance with an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a TD3 algorithm according to an embodiment of the present invention;

FIG. 4 is a diagram of simulation results based on TD3 algorithm training according to an embodiment of the present invention;

FIG. 5 is a diagram of simulation results comparing the TD3 algorithm with the training of DDPG algorithm according to an embodiment of the present invention;

fig. 6 is a diagram illustrating simulation results of a running trajectory when the TD3 algorithm converges according to an embodiment of the present invention;

FIG. 7 is a diagram of simulation results of dominant situations provided by an embodiment of the present invention;

FIG. 8 is a diagram illustrating simulation results of a disadvantage situation according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating simulation results of an equilibrium situation according to an embodiment of the present invention;

fig. 10 is a structural diagram of an unmanned aerial vehicle combat autonomous decision making system based on a deep reinforcement learning TD3 algorithm according to an embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present disclosure without making any creative effort, shall fall within the protection scope of the present disclosure. It should be noted that the features of the embodiments and examples of the present disclosure may be combined with each other without conflict. In addition, the drawings are intended to supplement the description of the text part of the specification with figures so that one can visually and vividly understand each feature and the whole technical solution of the present disclosure, but they should not be construed as limiting the scope of the present disclosure.

In the description of the invention, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the unmanned aerial vehicle air combat problem is researched based on mathematical methods such as the traditional differential countermeasure theory, a mathematical model needs to be accurately established, and maneuvering strategies and performance parameters of both parties need to be known on the premise of qualitative or quantitative problems, which is impossible in reality. In future combat, information such as strategic intentions of enemies, tactics, performance of moving equipment and the like cannot be accurately predicted in advance, and the application degree of the method is limited by interference of various uncertain factors and low detectability of targets in the combat environment. The unmanned aerial vehicle dynamics model is complex, the state equation of the unmanned aerial vehicle is a nonlinear differential equation, so that the solution is difficult, the calculated amount is huge, a large amount of calculation resources are occupied, the consumed time is long, and dimension disasters can occur when the number of unmanned aerial vehicles of both enemy and my is further increased.

In order to solve the problems, the unmanned aerial combat model of the unmanned aerial vehicle combat by the unmanned aerial vehicle and the enemy unmanned aerial vehicle is established based on a Markov decision process according to the unmanned aerial vehicle motion model, and the unmanned aerial combat model is expressed by a quadruple comprising a state space, an action space, a reward function and a discount factor; according to an unmanned aerial vehicle air combat model, an unmanned aerial vehicle learning maneuvering strategy is trained based on a double delay depth Deterministic strategy Gradient Algorithm (TD 3), and the TD3 Algorithm can solve the problem of Q value overestimation, so that the unmanned aerial vehicle learns a better maneuvering strategy and obtains a position advantage in battle.

Referring to fig. 1 to 9, an embodiment of the present invention provides an unmanned aerial vehicle combat autonomous decision method based on a deep reinforcement learning TD3 algorithm, including the steps of:

and S100, establishing an unmanned aerial vehicle motion model.

Specifically, a north-heaven-east inertial coordinate system is established, and referring to fig. 2, the positive direction of the X axis points to the true east, the positive direction of the Y axis points to the true north, and the positive direction of the Z axis points to the upward direction perpendicular to the ground. The unmanned aerial vehicle mainly receives engine thrust, gravity and pneumatic power effect in the air flight process. The unmanned aerial vehicle motion model comprises a dynamics model and a kinematics model, and therefore, establishing the unmanned aerial vehicle motion model comprises:

wherein g represents the acceleration of gravity; v represents the speed of the drone and v satisfies the constraint: v. of_min≤v≤v_max(ii) a The track inclination angle gamma represents the included angle between v and the horizontal plane XOY, and gamma belongs to [ -pi/2, pi/2]Pointing to the north direction at 0 degrees and turning to the west direction to be positive; the track deviation angle phi represents the included angle between the projection of v on the horizontal plane XOY and the X axis of the coordinate axis, and phi epsilon (-phi, phi)]Horizontal is 0 degree, upward is positive; n is_τIndicating tangential overload, tangential overload n_τRepresenting the influence of the resultant of thrust and drag on speed, by tangential overload n_τChanging the speed of the drone, tangential overload n_τThe direction of (2) is the speed direction of the unmanned aerial vehicle, wherein the resistance comprises the component action of gravity; n is a radical of an alkyl radical_fIndicating normal overload, mu roll angle, normal overload n_fThe direction of the unmanned aerial vehicle is the direction of the top of the unmanned aerial vehicle, the rolling angle mu represents the rotation angle of the unmanned aerial vehicle around the longitudinal axis of the unmanned aerial vehicle, and the normal overload n is_fAnd roll angle μmay change the direction in which the drone flies and the altitude at which it flies.

Establishing a kinematic model of the unmanned aerial vehicle in an inertial coordinate system:

wherein x, y and z represent the coordinates of the drone in an inertial coordinate system.

Step S200, establishing an unmanned aerial combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor, and the unmanned aerial vehicle motion model is represented as a state transfer function in the unmanned aerial combat model.

In particular, the reinforcement learning process is a "trial and error" process, and the markov decision process is typically used as a model framework to describe what is a reinforcement learning task. And establishing an unmanned aerial combat model based on a Markov decision process according to the unmanned aerial combat motion model, wherein the unmanned aerial combat model is represented by a quadruple (S, A, R and lambda), wherein S represents a state space, A represents an action space, R represents a reward function, and lambda represents a discount rate. Assume that the immediate reward function for the environment to feed back to the drone is r_t＝r_t(s_t,a_t) Defining the long-term reward of the unmanned aerial vehicle under the current state

λ represents a discount factor, and a larger discount factor indicates that the drone is more "seen through".

The state space is constructed by:

setting the states of the enemy unmanned aerial vehicle and the unmanned aerial vehicle of the third party:

S＝[x_r,y_r,z_r,x_b,y_b,z_b,v_r,v_b,γ_r,γ_b,ψ_r,ψ_b,μ_r,μ_b]

based on the states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle, the relative states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle are set:

S_rb＝[D,α,β,v_r,v_b,γ_r,γ_b,ψ_r,ψ_b,μ_r,μ_b]

wherein x is_r,y_r,z_rCoordinate value, x, representing my unmanned aerial vehicle in three-dimensional space_b,y_b,z_bRepresenting coordinate values, v, of enemy drone in three-dimensional space_rRepresenting the speed, v, of my drone_bIndicating the speed, gamma, of the enemy drone_rIndicating track inclination, gamma, of my drone_bIndicating the flight path inclination, psi, of the enemy drone_rIndicates the track drift angle, psi, of my drone_bIndicates the track drift angle, mu, of the enemy drone_rRepresents the roll angle, mu, of my drone_bThe roll angle of the enemy unmanned aerial vehicle is represented, the D represents the relative distance between the enemy unmanned aerial vehicle and the my unmanned aerial vehicle, the horizontal sight declination angle alpha represents the included angle between the projection of the sight of the enemy unmanned aerial vehicle and the sight of the my unmanned aerial vehicle on the horizontal plane and the X axis, and the longitudinal sight declination angle beta represents the included angle between the sight of the enemy unmanned aerial vehicle and the sight of the my unmanned aerial vehicle and the horizontal plane.

The state space in the embodiment can describe the battlefield situation more intuitively, and the dimension of the state space can be reduced.

The state transition function is set to: at the current input state s_iAnd adopt the action as a_iUnder the condition of (2), the next input state s is reached_i+1The probability of (c).

The action space is constructed by the following formula:

A＝[n_τ,n_f,ω]

wherein n is_τIndicating tangential overload, n_fIndicating normal overload and omega the body roll rate.

The reward functions include a lock reward function, an angle advantage function, a distance advantage function, a height advantage function, and a speed advantage function, wherein the lock reward function is:

wherein D is^*When indicating that my unmanned aerial vehicle successfully locks the enemy unmanned aerial vehicleMinimum distance between two machines, p^*The maximum included angle, e, indicating that the speed direction of the unmanned aerial vehicle deviates from the direction of the center of mass of the unmanned aerial vehicle pointing to the enemy when locking is satisfied^*The maximum included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the unmanned aerial vehicle pointing to the enemy unmanned aerial vehicle when the unmanned aerial vehicle is locked is represented, D represents the relative distance between the unmanned aerial vehicle and the enemy unmanned aerial vehicle, p represents the included angle of the speed direction of the unmanned aerial vehicle deviating from the direction of the center of mass of the unmanned aerial vehicle pointing to the enemy unmanned aerial vehicle, and e represents the included angle of the speed direction of the unmanned aerial vehicle deviating from the direction of the sight line vector of the unmanned aerial vehicle pointing to the center of mass of the unmanned aerial vehicle;

the angular merit function is:

the distance merit function is:

the height merit function is:

the speed merit function is:

wherein, p represents the angle that my unmanned aerial vehicle speed direction deviates from pointing to enemy unmanned aerial vehicle barycenter direction, e represents the angle that enemy unmanned aerial vehicle speed direction deviates from my unmanned aerial vehicle pointing to its barycenter direction sight vector, D represents my unmanned aerial vehicle and enemy unmanned aerial vehicle's relative distance, v_rRepresenting the speed, v, of my drone_bIndicating the speed of the enemy drone, D_maxRepresents the maximum detection range of the drone, Δ h represents the altitude difference between the enemy drone and the my drone, v_maxIndicating unmanned aerial vehicle flightMaximum value of velocity, v_minRepresents the minimum value of the flight speed of the unmanned aerial vehicle;

the reward function for a single step for the drone is:

R＝r_lock+k₁r₁+k₂r₂+k₃r₃+k₄r₄

wherein k is₁To k is₄Represents a weight value, and k₁To k is₄The sum is 1.

The reward function of the single step of the unmanned aerial vehicle in the embodiment comprises a locking reward function, an angle advantage function, a distance advantage function, a height advantage function and a speed advantage function, and the reward function can solve the problem that the algorithm is not easy to converge due to sparse reward.

And S300, training an unmanned aerial vehicle learning maneuver strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.

Referring to fig. 3, to solve the problem of Q value overestimation, a TD3 algorithm is adopted, wherein the TD3 algorithm comprises an actuator (Actor) network and two evaluator (Critic) networks, the Q value is estimated by using the two Critic networks, and the network with the relatively smaller Q value is selected as an updated target. The neural network is used for training the unmanned aerial vehicle of the same party based on the TD3 algorithm, the process of training the neural network is to perform gradient descent on the constructed strategy gradient function, and the optimal neural network parameters are obtained after iterative convergence. For example, when my unmanned plane is at the ith step, the current state space s of my unmanned plane is input_iTo Actor network μ (s | θ)^μ) The Actor network follows the current state space s_iOutputting the current maneuver a of the unmanned aerial vehicle of our party_iIn order to increase the searchability of the unmanned aerial vehicle, random noise N is introduced into the Actor network_iObtaining the current action a of the unmanned aerial vehicle of our party_i＝μ(s|θ^μ)+N_i. Will present s_iAnd a_iInputting the value into interaction environment, and obtaining the reward value r through a state transition function_iAnd the next step status s_i+1Will present s_i、a_i、r_iAnd s_i+1Storing the data into an experience pool, and randomly extracting N sample data (minipatch samples) from the experience pool for carrying outAnd (5) learning and updating network parameters.

Training the unmanned aerial vehicle learning maneuver strategy based on the TD3 algorithm comprises the following steps:

Parameter θ of actuator network^μTarget network parameters

θ'^μAnd an experience pool;

step S2-3, the unmanned aerial vehicle executes the action, and obtains the next state and obtains the reward by the state transfer function;

step S2-4, transferring the current state, the strategy selection action, the reward and the next state obtained by the state transfer function obtained in the step S2-2 and the step S2-3 to an experience pool;

step S2-5, randomly extracting N samples from the experience pool;

step S2-6, calculating the expected return of the action through two evaluator target networks, selecting a smaller Q value, and updating parameters of the evaluator networks;

step S2-7, updating parameters of the actuator network through the deterministic strategy gradient;

step S2-9, ending a round until the step number reaches the maximum limit step number;

The method comprises the following steps of calculating expected return of actions through two evaluator target networks, selecting a smaller Q value, and updating parameters of the evaluator networks, wherein the method comprises the following steps:

learning and updating the evaluator network parameters, wherein the formula of the loss function L is as follows:

wherein the target expectation value y_iAccording to the current real reward value r_iThe next step output value is multiplied by a discount factor lambda to obtain a target expected value y_iThe formula of (1) is:

updating parameters of the actuator network by a deterministic policy gradient, comprising:

where N represents the number of samples randomly drawn from the empirical pool, Q (s, a | θ)^Q) Representing an evaluator network, theta^QParameter, μ (s | θ), representing an evaluator network^μ) Representing a network of actuators, theta^μRepresenting parameters of the actuator network.

In the embodiment, the unmanned aerial vehicle is trained to learn the maneuver strategy based on the TD3 algorithm, the TD3 algorithm calculates the Q value by using two evaluator networks, and selects a smaller Q value to calculate the evaluator network parameters, so that the problem of overestimation of the Q value can be prevented, the unmanned aerial vehicle learns a better maneuver strategy, and a position advantage is obtained in battle.

For better illustration, the simulation experiment is performed in the embodiment, the environment for training simulation and the algorithm program are programmed in python language, a deep reinforcement learning framework is built based on the pytorch, the neural networks in the algorithm all adopt a full-connection network architecture, and the activation function is a linear rectification function (ReLU).

The unmanned aerial vehicles of the red and blue sides have the same performance, the same height at the initial moment, the fixed value of the initial horizontal distance, the same initial speed, the same climbing angle and the random value of the course angle, wherein the climbing angle is 0. At the next moment, the Hongfang unmanned aerial vehicle maneuvers according to a reinforcement learning algorithm Strategy, and the Langfang unmanned aerial vehicle selects from 7 basic maneuvers by adopting a minimum maximum Strategy (Minmax Strategy). Each step is awarded by the environment according to the states of the two parties until the number of steps of a single round reaches an upper limit or one party successfully locks an opponent, and the round is ended.

The blue-side unmanned aerial vehicle has 7 basic actions including constant speed level flight, acceleration, deceleration, climbing, descending, left turning and right turning, and the parameters are selected as follows:

TABLE 1

The specific experimental parameters of the unmanned aerial vehicle aerial combat simulation scene are shown in the following table:

TABLE 2

The relevant neural network parameters and training learning parameters are shown in the following table:

TABLE 3

According to the parameter design, 100000 rounds of unmanned aerial vehicle learning maneuver strategies are trained based on a TD3 algorithm, accumulated rewards of each round are recorded, average reward return of each 200 rounds is calculated, and the change of the average reward along with the increase of the number of the training rounds in the training process is obtained.

Since the algorithm selected by the embodiment is a double-delay depth deterministic policy gradient (TD3 algorithm) and is an optimization of the depth deterministic policy gradient (DDPG algorithm), on the premise of not changing the network structure and the training parameters, comparing the training processes of the two algorithms, the experimental result is as shown in fig. 5, under the same training environment, the TD3 algorithm converges at a faster speed than the DDPG algorithm, and the stable reward value obtained after convergence is slightly higher than the DDPG algorithm.

The simulation experiment also records the running tracks of the two unmanned aerial vehicles when the TD3 algorithm converges, and referring to FIG. 6, after training the unmanned aerial vehicle of our party based on the TD3 algorithm, the unmanned aerial vehicle of our party can autonomously make a maneuvering decision and obtain a dominant position in the game process with the unmanned aerial vehicle of the enemy.

In order to verify the adaptability of the TD3 algorithm, an enemy unmanned aerial vehicle adopts a single-step MINMAX maneuvering strategy to carry out 1000 Monte Carlo simulation tests under random initial conditions, and simulation results show that the accumulated return values of the enemy unmanned aerial vehicle of the 1000 tests can maintain a higher advantage situation, the winning rate is more than 80%, and the TD3 algorithm has stronger adaptability.

This embodiment has still carried out my unmanned aerial vehicle and enemy unmanned aerial vehicle and has carried out the emulation of antagonism under different situation, for example, when my unmanned aerial vehicle respectively at the dominant situation of tailgating enemy unmanned aerial vehicle, by the unfavorable situation that enemy unmanned aerial vehicle tailed the disadvantage situation and both sides are close to the average situation in opposite directions, obtain the antagonism simulation result.

When my unmanned aerial vehicle is in the dominant situation, the initial state information of my unmanned aerial vehicle and enemy unmanned aerial vehicle is as shown in the following table:

TABLE 4

Simulation results referring to fig. 7, when my unmanned aerial vehicle is in the dominant situation, my unmanned aerial vehicle keeps the angle advantage while seeking an opportunity to gradually reduce the distance to the enemy unmanned aerial vehicle. When the enemy unmanned aerial vehicle makes turning descending maneuver to get rid of locking, the enemy unmanned aerial vehicle controls the enemy unmanned aerial vehicle at more reasonable turning opportunity and speed, and the enemy unmanned aerial vehicle autonomously maneuvers to reach attack conditions.

When my unmanned aerial vehicle is in the disadvantaged situation, the initial state information of my unmanned aerial vehicle and enemy unmanned aerial vehicle is as shown in the following table:

TABLE 5

Simulation results referring to fig. 8, when the unmanned aerial vehicle of our party is in a disadvantage situation, the unmanned aerial vehicle of our party cannot get rid of the opponent through the speed, the unmanned aerial vehicle of our party chooses to make a similar "S" type of mechanical action, the heading is continuously changed to seek getting rid of the pursuit of the opponent, and the adverse situation is changed by decelerating the unmanned aerial vehicle of the opponent to surpass itself. At the initial moment, the unmanned aerial vehicle at our party is in a situation of being tailed by the unmanned aerial vehicle at the enemy party, and the unmanned aerial vehicle at the enemy party keeps heading and accelerates to be close to the unmanned aerial vehicle at the our party. The heading of the unmanned aerial vehicle of the third party is changed continuously from the initial moment, the unmanned aerial vehicle of the third party tries to get rid of the weapon attack angle of the unmanned aerial vehicle of the third party, the unmanned aerial vehicle of the third party can not successfully lock the unmanned aerial vehicle of the third party all the time in the process that the unmanned aerial vehicle of the third party and the unmanned aerial vehicle of the third party approach to each other, and finally the unmanned aerial vehicle of the third party is reversed to occupy a more favorable position.

When my unmanned aerial vehicle and enemy unmanned aerial vehicle are in the equilibrium situation, the initial state information of my unmanned aerial vehicle and enemy unmanned aerial vehicle is as shown in the following table:

TABLE 6

Referring to fig. 9, when the unmanned aerial vehicle of our party and the unmanned aerial vehicle of enemy are in the situation of being in the same situation of flying in opposite directions, the unmanned aerial vehicle of our party can still plan and control speed in a more reasonable flight path, and the battlefield position advantage is obtained through autonomous maneuvering.

Referring to fig. 10, an embodiment of the present invention provides an unmanned aerial vehicle fighting autonomous decision system based on a deep reinforcement learning TD3 algorithm, including:

the unmanned aerial combat model establishing unit is used for establishing an unmanned aerial combat model based on a Markov decision process according to an unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor;

It should be noted that, since the unmanned aerial vehicle combat decision system based on the deep reinforcement learning TD3 algorithm in the present embodiment is based on the same inventive concept as the above-mentioned unmanned aerial vehicle combat autonomous decision method based on the deep reinforcement learning TD3 algorithm, the corresponding contents in the method embodiment are also applicable to the present system embodiment, and are not described in detail herein.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. An unmanned aerial vehicle fighting autonomous decision-making method based on a deep reinforcement learning TD3 algorithm is characterized by comprising the following steps:

establishing an unmanned aerial vehicle motion model;

2. The unmanned aerial vehicle combat autonomous decision method based on the deep reinforcement learning TD3 algorithm of claim 1, wherein the unmanned aerial vehicle motion model comprises a dynamics model and a kinematics model, and the establishing the unmanned aerial vehicle motion model comprises:

wherein g represents a gravitational acceleration; the v represents a velocity of the drone and the v satisfies a constraint: v. of_min≤v≤v_max(ii) a The track inclination angle gamma represents the included angle between v and the horizontal plane, gamma belongs to [ -pi/2, pi/2](ii) a The flight path offset angle phi represents the included angle between the projection of v on the horizontal plane and the X axis of the coordinate axis, and phi belongs to (-phi, phi)](ii) a N is_τIndicating a tangential overload; n is_fIndicating a normal overload; the μ represents a roll angle;

3. The unmanned aerial vehicle combat autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 2, wherein the state space comprises: the state of the enemy drone and my drone themselves and relative states.

4. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 3, wherein the state space is constructed by:

S＝[x_r,y_r,z_r,x_b,y_b,z_b,v_r,v_b,γ_r,γ_b,ψ_r,ψ_b,μ_r,μ_b]

S_rb＝[D,α,β,v_r,v_b,γ_r,γ_b,ψ_r,ψ_b,μ_r,μ_b]

5. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 2, wherein the action space is constructed by the following formula:

A＝[n_τ,n_f,ω]

6. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 4, wherein the reward functions include a lock reward function, an angle advantage function, a distance advantage function, a height advantage function and a speed advantage function, wherein the lock reward function is:

wherein, D is^*Represents the minimum distance between two unmanned planes when the unmanned plane of my party successfully locks the unmanned plane of the enemy party, and p is^*The maximum included angle indicating that the speed direction of the unmanned aerial vehicle at one party deviates from the direction of the mass center of the unmanned aerial vehicle at the other party when the unmanned aerial vehicle is locked is satisfied, and e^*Representing a maximum included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the enemy unmanned aerial vehicle pointing to the direction of the center of mass of the enemy unmanned aerial vehicle when locking is met, D representing a relative distance between the enemy unmanned aerial vehicle and the enemy unmanned aerial vehicle, p representing an included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the enemy unmanned aerial vehicle, e representing the direction of the enemy unmanned aerial vehicle pointing to the direction of the center of mass of the enemy unmanned aerial vehicle, and e representing the direction of the enemy unmanned aerial vehicleThe speed direction deviates from the included angle of the sight line vector of the unmanned aerial vehicle pointing to the centroid direction;

the angular merit function is:

the distance merit function is:

the height merit function is:

the speed merit function is:

R＝r_lock+k₁r₁+k₂r₂+k₃r₃+k₄r₄

7. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm of claim 1, wherein the training of the unmanned aerial vehicle learning maneuver strategy based on the TD3 algorithm comprises:

Parameter θ of actuator network^μTarget network parameters

θ'^μAnd an experience pool;

step S2-5, randomly extracting N samples from the experience pool;

8. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 7, wherein the calculating the expected return of action through the two evaluator target networks, selecting a smaller Q value, and updating the parameters of the evaluator networks comprises:

wherein, said s_iRepresents the current state, said a_iIndicating the current action, target desired value y_iAccording to the current real reward value r_iThe next step output value is multiplied by a discount factor lambda to obtain a target expected value y_iThe formula of (1) is:

wherein s is_i+1Indicating the next step status.

9. The unmanned aerial vehicle combat autonomous decision making method based on the deep reinforcement learning TD3 algorithm according to claim 8, wherein the updating of the parameters of the actuator network through the deterministic policy gradient includes:

wherein N represents the number of samples randomly drawn from the experience pool, and Q (s, a | θ)^Q) Representing the evaluator network, the theta^QA parameter representing the evaluator network, the μ (s | θ)^μ) Representing said actuator network, said θ^μA parameter representing the actuator network.

10. An unmanned aerial vehicle fighting autonomous decision making system based on a deep reinforcement learning TD3 algorithm is characterized by comprising: