CN114706418A - Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm - Google Patents

Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm Download PDF

Info

Publication number
CN114706418A
CN114706418A CN202210264539.2A CN202210264539A CN114706418A CN 114706418 A CN114706418 A CN 114706418A CN 202210264539 A CN202210264539 A CN 202210264539A CN 114706418 A CN114706418 A CN 114706418A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
enemy
drone
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210264539.2A
Other languages
Chinese (zh)
Inventor
高显忠
候中喜
金泉
王玉杰
邓小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210264539.2A priority Critical patent/CN114706418A/en
Publication of CN114706418A publication Critical patent/CN114706418A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/60Intended control result
    • G05D1/656Interaction with payloads or external entities
    • G05D1/689Pointing payloads towards fixed or moving targets
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/40Control within particular dimensions
    • G05D1/46Control of position or course in three dimensions
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D2101/00Details of software or hardware architectures used for the control of position
    • G05D2101/10Details of software or hardware architectures used for the control of position using artificial intelligence [AI] techniques
    • G05D2101/15Details of software or hardware architectures used for the control of position using artificial intelligence [AI] techniques using machine learning, e.g. neural networks
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D2105/00Specific applications of the controlled vehicles
    • G05D2105/35Specific applications of the controlled vehicles for combat
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D2107/00Specific environments of the controlled vehicles
    • G05D2107/30Off-road
    • G05D2107/34Battlefields
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D2109/00Types of controlled vehicles
    • G05D2109/20Aircraft, e.g. drones
    • G05D2109/22Aircraft, e.g. drones with fixed wings

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses an unmanned aerial vehicle combat autonomous decision-making method based on a deep reinforcement learning TD3 algorithm, which comprises the steps of establishing an unmanned aerial vehicle motion model, establishing an unmanned aerial vehicle combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial vehicle combat model is expressed by using a quadruple comprising a state space, an action space, a reward function and a discount factor, the state transfer function is expressed as the unmanned aerial vehicle motion model, and training an unmanned aerial vehicle to learn a maneuvering strategy based on the TD3 algorithm according to the unmanned aerial vehicle combat model. According to the unmanned aerial vehicle air combat model, the unmanned aerial vehicle learning maneuvering strategy is trained based on the TD3 algorithm, and the TD3 algorithm can solve the problem of Q value overestimation, so that the unmanned aerial vehicle learns a better maneuvering strategy and obtains position advantages in battles.

Description

Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm
Technical Field
The invention relates to the technical field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle fighting autonomous decision-making method based on a deep reinforcement learning TD3 algorithm.
Background
Intelligent autonomous combat drones and drone clusters have great potential to change battlefield patterns. Maneuver decision is the core technology of unmanned aerial vehicle combat confrontation, and it has great significance to study unmanned aerial vehicle and independently maneuver to obtain operational advantages according to battlefield situation and mission target.
When the unmanned aerial vehicle air combat problem is researched based on mathematical methods such as the traditional differential countermeasure theory, a mathematical model needs to be accurately established, and maneuvering strategies and performance parameters of both parties need to be known on the premise of qualitative or quantitative problems, which is impossible in reality. In future combat, information such as strategic intentions of enemies, tactics, and moving equipment performance cannot be accurately predicted in advance, and the interference of various uncertain factors and low detectability of targets in the combat environment limit the applicability of the method. The unmanned aerial vehicle dynamics model is complex, the state equation of the unmanned aerial vehicle is a nonlinear differential equation, so that the solution is difficult, the calculated amount is huge, a large amount of calculation resources are occupied, the consumed time is long, and dimension disasters can occur when the number of unmanned aerial vehicles of both enemy and my is further increased.
Although the Deep Deterministic Policy Gradient (DDPG) is suitable for solving the problem of a high-dimensional continuous motion space, the problem of overestimation of the Q value may be caused by using the algorithm to perform Deep robust learning of the unmanned aerial vehicle in the unmanned aerial vehicle combat environment, so that the total reward value of the unmanned aerial vehicle is always low, that is, when the Q value is overestimated, the Policy selected by the unmanned aerial vehicle has errors, and the errors become larger and larger, so that an effective Policy cannot be found, and a position advantage cannot be obtained in combat.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an unmanned aerial vehicle fighting autonomous decision method based on a deep reinforcement learning TD3 algorithm, which can solve the problem of Q value overestimation, so that an unmanned aerial vehicle learns a better maneuvering strategy and obtains a position advantage in battle.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
in a first aspect, the invention provides an unmanned aerial vehicle fighting autonomous decision method based on a deep reinforcement learning TD3 algorithm, which comprises the following steps:
establishing an unmanned aerial vehicle motion model;
establishing an unmanned aerial combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple comprising a state space, an action space, a reward function and a discount factor, and the unmanned aerial vehicle motion model represents a state transfer function in the unmanned aerial combat model;
and training an unmanned aerial vehicle learning maneuvering strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.
Compared with the prior art, the first aspect of the invention has the following beneficial effects:
the method comprises the steps of establishing an unmanned aerial combat model of unmanned aerial vehicles fighting with my unmanned aerial vehicle and an enemy unmanned aerial vehicle based on a Markov decision process according to an unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor; according to the unmanned aerial vehicle air combat model, the unmanned aerial vehicle is trained to learn the maneuver strategy based on the TD3 algorithm, and the TD3 algorithm can solve the problem of Q value overestimation, so that the unmanned aerial vehicle learns a better maneuver strategy, and position advantages are obtained in battle.
Further, the unmanned aerial vehicle motion model comprises a dynamics model and a kinematics model, and establishing the unmanned aerial vehicle motion model comprises:
establishing a dynamic model of the unmanned aerial vehicle in an inertial coordinate system:
Figure BDA0003552139300000031
wherein g represents a gravitational acceleration; the v represents a velocity of the drone and the v satisfies a constraint: v. ofmin≤v≤vmax(ii) a The track inclination angle gamma represents the included angle between v and the horizontal plane XOY, and gamma belongs to [ -pi/2, pi/2](ii) a The track deviation angle phi represents the included angle between the projection of v on the horizontal plane XOY and the X axis of the coordinate axis, and phi epsilon (-phi, phi)](ii) a N isτIndicating a tangential overload; n isfIndicating the normal directionOverload; the μ represents a roll angle;
establishing a kinematic model of the unmanned aerial vehicle in the inertial coordinate system:
Figure BDA0003552139300000032
wherein the x, the y, and the z represent coordinates of the drone in the inertial coordinate system.
Further, the state space comprises: the state of the enemy drone and my drone themselves and relative states.
Further, the state space is constructed by:
setting the states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle:
S=[xr,yr,zr,xb,yb,zb,vr,vbrbrbrb]
setting the relative states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle based on the states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle:
Srb=[D,α,β,vr,vbrbrbrb]
wherein x isr,yr,zrRepresenting coordinate values, x, of the unmanned aerial vehicle of this party in three-dimensional spaceb,yb,zbA coordinate value, v, representing the enemy drone in the three dimensional spacerRepresenting the speed of the my drone, said vbRepresenting the speed of the enemy drone, said gammarRepresenting the track inclination of said my drone, said gammabIndicating the flight path inclination of said enemy drone, said psirIndicating the track drift angle of said my drone, said psibRepresents the flight path declination of the enemy drone, the murRepresents the roll angle of the my drone, the μbThe roll angle of the enemy unmanned aerial vehicle is represented, the D represents the relative distance between the enemy unmanned aerial vehicle and the my unmanned aerial vehicle, the horizontal sight declination angle alpha represents the included angle between the projection of the sight lines of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle on the horizontal plane and the X axis, and the longitudinal sight declination angle beta represents the included angle between the sight lines of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle and the horizontal plane.
Further, the action space is constructed by the following formula:
A=[nτ,nf,ω]
wherein, said nτIndicating tangential overload, nfIndicating normal overload and omega the body roll rate.
Further, the reward functions include a lock reward function, an angle advantage function, a distance advantage function, a height advantage function, and a speed advantage function, wherein the lock reward function is:
Figure BDA0003552139300000041
wherein, D is*Represents the minimum distance between two unmanned planes when the unmanned plane of my party successfully locks the unmanned plane of the enemy party, and p is*The maximum included angle indicating that the speed direction of the unmanned aerial vehicle at one party deviates from the direction of the mass center of the unmanned aerial vehicle at the other party when the unmanned aerial vehicle is locked is satisfied, and e*Representing a maximum included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the enemy unmanned aerial vehicle when locking is met, D representing a relative distance between the enemy unmanned aerial vehicle and the enemy unmanned aerial vehicle, p representing an included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the enemy unmanned aerial vehicle, and e representing an included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the sight vector of the center of mass of the enemy unmanned aerial vehicle;
the angular merit function is:
Figure BDA0003552139300000051
the distance merit function is:
Figure BDA0003552139300000052
the height merit function is:
Figure BDA0003552139300000053
the speed merit function is:
Figure BDA0003552139300000054
wherein, v isrRepresenting the speed of the my drone, said vbRepresenting the speed of the enemy drone, DmaxRepresents a maximum detection distance of the drone, the Δ h represents a height difference between the enemy drone and the my drone, the vmaxRepresents a maximum value of the flight speed of the drone, said vminRepresents a minimum value of the drone flight speed;
the reward function of the single step of the unmanned aerial vehicle is as follows:
R=rlock+k1r1+k2r2+k3r3+k4r4
wherein, k is1To k is4Represents a weight value, and said k1To k is4The sum is 1.
Further, the training of the unmanned aerial vehicle learning maneuver strategy based on the TD3 algorithm includes:
step S1, initializing two evaluator networks Q1,Q2Parameters of evaluator network
Figure BDA0003552139300000055
Parameter θ of actuator networkμTarget network parameters
Figure BDA0003552139300000061
θ'μAnd an experience pool;
step S2, presetting the number of rounds, and executing the following steps in each round:
step S2-1, presetting the maximum limit step number of the unmanned aerial vehicle in each round;
s2-2, the unmanned aerial vehicle selects action according to the current state and strategy, and adds random noise;
step S2-3, the unmanned aerial vehicle executes actions, and obtains the next state and rewards by the state transfer function;
a step S2-4 of transferring the current state, the policy selection action, the reward, and the next state obtained by a state transfer function obtained in the steps S2-2 and S2-3 to an experience pool;
step S2-5, randomly extracting N samples from the experience pool;
step S2-6, calculating expected return of action through the two evaluator target networks, selecting a smaller Q value, and updating parameters of the evaluator networks;
step S2-7, updating the parameters of the actuator network through a deterministic strategy gradient;
step S2-8, after updating the parameters of the evaluator network and the parameters of the actuator network, updating the parameters of the target network;
step S2-9, until the number of steps reaches the maximum limit number of steps, ending a round;
and step S3, finishing the training of the unmanned aerial vehicle learning maneuver strategy after all rounds are finished.
Further, the calculating the expected return of action by the two evaluator target networks, selecting a smaller Q value, and updating the parameters of the evaluator networks includes:
learning and updating the network parameters of the evaluator, wherein the formula of the loss function L is as follows:
Figure BDA0003552139300000062
wherein, said siRepresents the current state, said aiIndicating the current action, target desired value yiAccording to the current real reward value riThe target expected value y is obtained by multiplying the next output value by the discount factor lambdaiThe formula of (1) is:
Figure BDA0003552139300000063
wherein s isi+1Indicating the next step status.
Further, the updating parameters of the actuator network by the deterministic policy gradient includes:
learning and updating actuator network parameters, wherein a deterministic strategy gradient formula of the actuator network is as follows:
Figure BDA0003552139300000071
wherein N represents the number of samples randomly drawn from the experience pool, and Q (s, a | θ)Q) Representing the evaluator network, the thetaQA parameter representing the evaluator network, the μ (s θ)μ) Representing said actuator network, said θμA parameter representing the actuator network.
In a second aspect, the invention provides an unmanned aerial vehicle fighting autonomous decision making system based on a deep reinforcement learning TD3 algorithm, including:
the unmanned aerial vehicle motion model establishing unit is used for establishing an unmanned aerial vehicle motion model;
the unmanned aerial vehicle aerial combat model establishing unit is used for establishing an unmanned aerial vehicle aerial combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial vehicle aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor;
and the unmanned aerial vehicle learning maneuvering strategy training unit is used for training the unmanned aerial vehicle learning maneuvering strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.
Compared with the prior art, the second aspect of the invention has the following beneficial effects:
an unmanned aerial vehicle aerial combat model establishing unit of the system establishes an unmanned aerial vehicle aerial combat model of the unmanned aerial vehicle of the owner and the unmanned aerial vehicle of the enemy for combat based on a Markov decision process according to an unmanned aerial vehicle motion model, wherein the unmanned aerial vehicle aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor; the unmanned aerial vehicle learning maneuvering strategy training unit trains the unmanned aerial vehicle learning maneuvering strategy based on the TD3 algorithm according to the unmanned aerial vehicle air combat model, and the TD3 algorithm can solve the problem of Q value overestimation, so that the unmanned aerial vehicle learns a better maneuvering strategy, and position advantages are obtained in battle.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of an unmanned aerial vehicle combat autonomous decision method based on a deep reinforcement learning TD3 algorithm according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an inertial frame provided in accordance with an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a TD3 algorithm according to an embodiment of the present invention;
FIG. 4 is a diagram of simulation results based on TD3 algorithm training according to an embodiment of the present invention;
FIG. 5 is a diagram of simulation results comparing the TD3 algorithm with the training of DDPG algorithm according to an embodiment of the present invention;
fig. 6 is a diagram illustrating simulation results of a running trajectory when the TD3 algorithm converges according to an embodiment of the present invention;
FIG. 7 is a diagram of simulation results of dominant situations provided by an embodiment of the present invention;
FIG. 8 is a diagram illustrating simulation results of a disadvantage situation according to an embodiment of the present invention;
FIG. 9 is a diagram illustrating simulation results of an equilibrium situation according to an embodiment of the present invention;
fig. 10 is a structural diagram of an unmanned aerial vehicle combat autonomous decision making system based on a deep reinforcement learning TD3 algorithm according to an embodiment of the present invention.
Detailed Description
The technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present disclosure without making any creative effort, shall fall within the protection scope of the present disclosure. It should be noted that the features of the embodiments and examples of the present disclosure may be combined with each other without conflict. In addition, the drawings are intended to supplement the description of the text part of the specification with figures so that one can visually and vividly understand each feature and the whole technical solution of the present disclosure, but they should not be construed as limiting the scope of the present disclosure.
In the description of the invention, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the unmanned aerial vehicle air combat problem is researched based on mathematical methods such as the traditional differential countermeasure theory, a mathematical model needs to be accurately established, and maneuvering strategies and performance parameters of both parties need to be known on the premise of qualitative or quantitative problems, which is impossible in reality. In future combat, information such as strategic intentions of enemies, tactics, performance of moving equipment and the like cannot be accurately predicted in advance, and the application degree of the method is limited by interference of various uncertain factors and low detectability of targets in the combat environment. The unmanned aerial vehicle dynamics model is complex, the state equation of the unmanned aerial vehicle is a nonlinear differential equation, so that the solution is difficult, the calculated amount is huge, a large amount of calculation resources are occupied, the consumed time is long, and dimension disasters can occur when the number of unmanned aerial vehicles of both enemy and my is further increased.
Although the Deep Deterministic Policy Gradient (DDPG) is suitable for solving the problem of a high-dimensional continuous motion space, the problem of overestimation of the Q value may be caused by using the algorithm to perform Deep robust learning of the unmanned aerial vehicle in the unmanned aerial vehicle combat environment, so that the total reward value of the unmanned aerial vehicle is always low, that is, when the Q value is overestimated, the Policy selected by the unmanned aerial vehicle has errors, and the errors become larger and larger, so that an effective Policy cannot be found, and a position advantage cannot be obtained in combat.
In order to solve the problems, the unmanned aerial combat model of the unmanned aerial vehicle combat by the unmanned aerial vehicle and the enemy unmanned aerial vehicle is established based on a Markov decision process according to the unmanned aerial vehicle motion model, and the unmanned aerial combat model is expressed by a quadruple comprising a state space, an action space, a reward function and a discount factor; according to an unmanned aerial vehicle air combat model, an unmanned aerial vehicle learning maneuvering strategy is trained based on a double delay depth Deterministic strategy Gradient Algorithm (TD 3), and the TD3 Algorithm can solve the problem of Q value overestimation, so that the unmanned aerial vehicle learns a better maneuvering strategy and obtains a position advantage in battle.
Referring to fig. 1 to 9, an embodiment of the present invention provides an unmanned aerial vehicle combat autonomous decision method based on a deep reinforcement learning TD3 algorithm, including the steps of:
and S100, establishing an unmanned aerial vehicle motion model.
Specifically, a north-heaven-east inertial coordinate system is established, and referring to fig. 2, the positive direction of the X axis points to the true east, the positive direction of the Y axis points to the true north, and the positive direction of the Z axis points to the upward direction perpendicular to the ground. The unmanned aerial vehicle mainly receives engine thrust, gravity and pneumatic power effect in the air flight process. The unmanned aerial vehicle motion model comprises a dynamics model and a kinematics model, and therefore, establishing the unmanned aerial vehicle motion model comprises:
establishing a dynamic model of the unmanned aerial vehicle in an inertial coordinate system:
Figure BDA0003552139300000101
wherein g represents the acceleration of gravity; v represents the speed of the drone and v satisfies the constraint: v. ofmin≤v≤vmax(ii) a The track inclination angle gamma represents the included angle between v and the horizontal plane XOY, and gamma belongs to [ -pi/2, pi/2]Pointing to the north direction at 0 degrees and turning to the west direction to be positive; the track deviation angle phi represents the included angle between the projection of v on the horizontal plane XOY and the X axis of the coordinate axis, and phi epsilon (-phi, phi)]Horizontal is 0 degree, upward is positive; n isτIndicating tangential overload, tangential overload nτRepresenting the influence of the resultant of thrust and drag on speed, by tangential overload nτChanging the speed of the drone, tangential overload nτThe direction of (2) is the speed direction of the unmanned aerial vehicle, wherein the resistance comprises the component action of gravity; n is a radical of an alkyl radicalfIndicating normal overload, mu roll angle, normal overload nfThe direction of the unmanned aerial vehicle is the direction of the top of the unmanned aerial vehicle, the rolling angle mu represents the rotation angle of the unmanned aerial vehicle around the longitudinal axis of the unmanned aerial vehicle, and the normal overload n isfAnd roll angle μmay change the direction in which the drone flies and the altitude at which it flies.
Establishing a kinematic model of the unmanned aerial vehicle in an inertial coordinate system:
Figure BDA0003552139300000111
wherein x, y and z represent the coordinates of the drone in an inertial coordinate system.
Step S200, establishing an unmanned aerial combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor, and the unmanned aerial vehicle motion model is represented as a state transfer function in the unmanned aerial combat model.
In particular, the reinforcement learning process is a "trial and error" process, and the markov decision process is typically used as a model framework to describe what is a reinforcement learning task. And establishing an unmanned aerial combat model based on a Markov decision process according to the unmanned aerial combat motion model, wherein the unmanned aerial combat model is represented by a quadruple (S, A, R and lambda), wherein S represents a state space, A represents an action space, R represents a reward function, and lambda represents a discount rate. Assume that the immediate reward function for the environment to feed back to the drone is rt=rt(st,at) Defining the long-term reward of the unmanned aerial vehicle under the current state
Figure BDA0003552139300000121
λ represents a discount factor, and a larger discount factor indicates that the drone is more "seen through".
The state space is constructed by:
setting the states of the enemy unmanned aerial vehicle and the unmanned aerial vehicle of the third party:
S=[xr,yr,zr,xb,yb,zb,vr,vbrbrbrb]
based on the states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle, the relative states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle are set:
Srb=[D,α,β,vr,vbrbrbrb]
wherein x isr,yr,zrCoordinate value, x, representing my unmanned aerial vehicle in three-dimensional spaceb,yb,zbRepresenting coordinate values, v, of enemy drone in three-dimensional spacerRepresenting the speed, v, of my dronebIndicating the speed, gamma, of the enemy dronerIndicating track inclination, gamma, of my dronebIndicating the flight path inclination, psi, of the enemy dronerIndicates the track drift angle, psi, of my dronebIndicates the track drift angle, mu, of the enemy dronerRepresents the roll angle, mu, of my dronebThe roll angle of the enemy unmanned aerial vehicle is represented, the D represents the relative distance between the enemy unmanned aerial vehicle and the my unmanned aerial vehicle, the horizontal sight declination angle alpha represents the included angle between the projection of the sight of the enemy unmanned aerial vehicle and the sight of the my unmanned aerial vehicle on the horizontal plane and the X axis, and the longitudinal sight declination angle beta represents the included angle between the sight of the enemy unmanned aerial vehicle and the sight of the my unmanned aerial vehicle and the horizontal plane.
The state space in the embodiment can describe the battlefield situation more intuitively, and the dimension of the state space can be reduced.
The state transition function is set to: at the current input state siAnd adopt the action as aiUnder the condition of (2), the next input state s is reachedi+1The probability of (c).
The action space is constructed by the following formula:
A=[nτ,nf,ω]
wherein n isτIndicating tangential overload, nfIndicating normal overload and omega the body roll rate.
The reward functions include a lock reward function, an angle advantage function, a distance advantage function, a height advantage function, and a speed advantage function, wherein the lock reward function is:
Figure BDA0003552139300000131
wherein D is*When indicating that my unmanned aerial vehicle successfully locks the enemy unmanned aerial vehicleMinimum distance between two machines, p*The maximum included angle, e, indicating that the speed direction of the unmanned aerial vehicle deviates from the direction of the center of mass of the unmanned aerial vehicle pointing to the enemy when locking is satisfied*The maximum included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the unmanned aerial vehicle pointing to the enemy unmanned aerial vehicle when the unmanned aerial vehicle is locked is represented, D represents the relative distance between the unmanned aerial vehicle and the enemy unmanned aerial vehicle, p represents the included angle of the speed direction of the unmanned aerial vehicle deviating from the direction of the center of mass of the unmanned aerial vehicle pointing to the enemy unmanned aerial vehicle, and e represents the included angle of the speed direction of the unmanned aerial vehicle deviating from the direction of the sight line vector of the unmanned aerial vehicle pointing to the center of mass of the unmanned aerial vehicle;
the angular merit function is:
Figure BDA0003552139300000132
the distance merit function is:
Figure BDA0003552139300000133
the height merit function is:
Figure BDA0003552139300000134
the speed merit function is:
Figure BDA0003552139300000135
wherein, p represents the angle that my unmanned aerial vehicle speed direction deviates from pointing to enemy unmanned aerial vehicle barycenter direction, e represents the angle that enemy unmanned aerial vehicle speed direction deviates from my unmanned aerial vehicle pointing to its barycenter direction sight vector, D represents my unmanned aerial vehicle and enemy unmanned aerial vehicle's relative distance, vrRepresenting the speed, v, of my dronebIndicating the speed of the enemy drone, DmaxRepresents the maximum detection range of the drone, Δ h represents the altitude difference between the enemy drone and the my drone, vmaxIndicating unmanned aerial vehicle flightMaximum value of velocity, vminRepresents the minimum value of the flight speed of the unmanned aerial vehicle;
the reward function for a single step for the drone is:
R=rlock+k1r1+k2r2+k3r3+k4r4
wherein k is1To k is4Represents a weight value, and k1To k is4The sum is 1.
The reward function of the single step of the unmanned aerial vehicle in the embodiment comprises a locking reward function, an angle advantage function, a distance advantage function, a height advantage function and a speed advantage function, and the reward function can solve the problem that the algorithm is not easy to converge due to sparse reward.
And S300, training an unmanned aerial vehicle learning maneuver strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.
Referring to fig. 3, to solve the problem of Q value overestimation, a TD3 algorithm is adopted, wherein the TD3 algorithm comprises an actuator (Actor) network and two evaluator (Critic) networks, the Q value is estimated by using the two Critic networks, and the network with the relatively smaller Q value is selected as an updated target. The neural network is used for training the unmanned aerial vehicle of the same party based on the TD3 algorithm, the process of training the neural network is to perform gradient descent on the constructed strategy gradient function, and the optimal neural network parameters are obtained after iterative convergence. For example, when my unmanned plane is at the ith step, the current state space s of my unmanned plane is inputiTo Actor network μ (s | θ)μ) The Actor network follows the current state space siOutputting the current maneuver a of the unmanned aerial vehicle of our partyiIn order to increase the searchability of the unmanned aerial vehicle, random noise N is introduced into the Actor networkiObtaining the current action a of the unmanned aerial vehicle of our partyi=μ(s|θμ)+Ni. Will present siAnd aiInputting the value into interaction environment, and obtaining the reward value r through a state transition functioniAnd the next step status si+1Will present si、ai、riAnd si+1Storing the data into an experience pool, and randomly extracting N sample data (minipatch samples) from the experience pool for carrying outAnd (5) learning and updating network parameters.
Training the unmanned aerial vehicle learning maneuver strategy based on the TD3 algorithm comprises the following steps:
step S1, initializing two evaluator networks Q1,Q2Parameters of evaluator network
Figure BDA0003552139300000151
Parameter θ of actuator networkμTarget network parameters
Figure BDA0003552139300000152
θ'μAnd an experience pool;
step S2, presetting the number of rounds, and executing the following steps in each round:
step S2-1, presetting the maximum limit step number of the unmanned aerial vehicle in each round;
s2-2, the unmanned aerial vehicle selects action according to the current state and strategy, and adds random noise;
step S2-3, the unmanned aerial vehicle executes the action, and obtains the next state and obtains the reward by the state transfer function;
step S2-4, transferring the current state, the strategy selection action, the reward and the next state obtained by the state transfer function obtained in the step S2-2 and the step S2-3 to an experience pool;
step S2-5, randomly extracting N samples from the experience pool;
step S2-6, calculating the expected return of the action through two evaluator target networks, selecting a smaller Q value, and updating parameters of the evaluator networks;
step S2-7, updating parameters of the actuator network through the deterministic strategy gradient;
step S2-8, after updating the parameters of the evaluator network and the parameters of the actuator network, updating the parameters of the target network;
step S2-9, ending a round until the step number reaches the maximum limit step number;
and step S3, finishing the training of the unmanned aerial vehicle learning maneuver strategy after all rounds are finished.
The method comprises the following steps of calculating expected return of actions through two evaluator target networks, selecting a smaller Q value, and updating parameters of the evaluator networks, wherein the method comprises the following steps:
learning and updating the evaluator network parameters, wherein the formula of the loss function L is as follows:
Figure BDA0003552139300000161
wherein the target expectation value yiAccording to the current real reward value riThe next step output value is multiplied by a discount factor lambda to obtain a target expected value yiThe formula of (1) is:
Figure BDA0003552139300000162
updating parameters of the actuator network by a deterministic policy gradient, comprising:
learning and updating actuator network parameters, wherein a deterministic strategy gradient formula of the actuator network is as follows:
Figure BDA0003552139300000163
where N represents the number of samples randomly drawn from the empirical pool, Q (s, a | θ)Q) Representing an evaluator network, thetaQParameter, μ (s | θ), representing an evaluator networkμ) Representing a network of actuators, thetaμRepresenting parameters of the actuator network.
In the embodiment, the unmanned aerial vehicle is trained to learn the maneuver strategy based on the TD3 algorithm, the TD3 algorithm calculates the Q value by using two evaluator networks, and selects a smaller Q value to calculate the evaluator network parameters, so that the problem of overestimation of the Q value can be prevented, the unmanned aerial vehicle learns a better maneuver strategy, and a position advantage is obtained in battle.
For better illustration, the simulation experiment is performed in the embodiment, the environment for training simulation and the algorithm program are programmed in python language, a deep reinforcement learning framework is built based on the pytorch, the neural networks in the algorithm all adopt a full-connection network architecture, and the activation function is a linear rectification function (ReLU).
The unmanned aerial vehicles of the red and blue sides have the same performance, the same height at the initial moment, the fixed value of the initial horizontal distance, the same initial speed, the same climbing angle and the random value of the course angle, wherein the climbing angle is 0. At the next moment, the Hongfang unmanned aerial vehicle maneuvers according to a reinforcement learning algorithm Strategy, and the Langfang unmanned aerial vehicle selects from 7 basic maneuvers by adopting a minimum maximum Strategy (Minmax Strategy). Each step is awarded by the environment according to the states of the two parties until the number of steps of a single round reaches an upper limit or one party successfully locks an opponent, and the round is ended.
The blue-side unmanned aerial vehicle has 7 basic actions including constant speed level flight, acceleration, deceleration, climbing, descending, left turning and right turning, and the parameters are selected as follows:
Figure BDA0003552139300000171
TABLE 1
The specific experimental parameters of the unmanned aerial vehicle aerial combat simulation scene are shown in the following table:
Figure BDA0003552139300000172
TABLE 2
The relevant neural network parameters and training learning parameters are shown in the following table:
Figure BDA0003552139300000181
TABLE 3
According to the parameter design, 100000 rounds of unmanned aerial vehicle learning maneuver strategies are trained based on a TD3 algorithm, accumulated rewards of each round are recorded, average reward return of each 200 rounds is calculated, and the change of the average reward along with the increase of the number of the training rounds in the training process is obtained.
Since the algorithm selected by the embodiment is a double-delay depth deterministic policy gradient (TD3 algorithm) and is an optimization of the depth deterministic policy gradient (DDPG algorithm), on the premise of not changing the network structure and the training parameters, comparing the training processes of the two algorithms, the experimental result is as shown in fig. 5, under the same training environment, the TD3 algorithm converges at a faster speed than the DDPG algorithm, and the stable reward value obtained after convergence is slightly higher than the DDPG algorithm.
The simulation experiment also records the running tracks of the two unmanned aerial vehicles when the TD3 algorithm converges, and referring to FIG. 6, after training the unmanned aerial vehicle of our party based on the TD3 algorithm, the unmanned aerial vehicle of our party can autonomously make a maneuvering decision and obtain a dominant position in the game process with the unmanned aerial vehicle of the enemy.
In order to verify the adaptability of the TD3 algorithm, an enemy unmanned aerial vehicle adopts a single-step MINMAX maneuvering strategy to carry out 1000 Monte Carlo simulation tests under random initial conditions, and simulation results show that the accumulated return values of the enemy unmanned aerial vehicle of the 1000 tests can maintain a higher advantage situation, the winning rate is more than 80%, and the TD3 algorithm has stronger adaptability.
This embodiment has still carried out my unmanned aerial vehicle and enemy unmanned aerial vehicle and has carried out the emulation of antagonism under different situation, for example, when my unmanned aerial vehicle respectively at the dominant situation of tailgating enemy unmanned aerial vehicle, by the unfavorable situation that enemy unmanned aerial vehicle tailed the disadvantage situation and both sides are close to the average situation in opposite directions, obtain the antagonism simulation result.
When my unmanned aerial vehicle is in the dominant situation, the initial state information of my unmanned aerial vehicle and enemy unmanned aerial vehicle is as shown in the following table:
Figure BDA0003552139300000191
TABLE 4
Simulation results referring to fig. 7, when my unmanned aerial vehicle is in the dominant situation, my unmanned aerial vehicle keeps the angle advantage while seeking an opportunity to gradually reduce the distance to the enemy unmanned aerial vehicle. When the enemy unmanned aerial vehicle makes turning descending maneuver to get rid of locking, the enemy unmanned aerial vehicle controls the enemy unmanned aerial vehicle at more reasonable turning opportunity and speed, and the enemy unmanned aerial vehicle autonomously maneuvers to reach attack conditions.
When my unmanned aerial vehicle is in the disadvantaged situation, the initial state information of my unmanned aerial vehicle and enemy unmanned aerial vehicle is as shown in the following table:
Figure BDA0003552139300000201
TABLE 5
Simulation results referring to fig. 8, when the unmanned aerial vehicle of our party is in a disadvantage situation, the unmanned aerial vehicle of our party cannot get rid of the opponent through the speed, the unmanned aerial vehicle of our party chooses to make a similar "S" type of mechanical action, the heading is continuously changed to seek getting rid of the pursuit of the opponent, and the adverse situation is changed by decelerating the unmanned aerial vehicle of the opponent to surpass itself. At the initial moment, the unmanned aerial vehicle at our party is in a situation of being tailed by the unmanned aerial vehicle at the enemy party, and the unmanned aerial vehicle at the enemy party keeps heading and accelerates to be close to the unmanned aerial vehicle at the our party. The heading of the unmanned aerial vehicle of the third party is changed continuously from the initial moment, the unmanned aerial vehicle of the third party tries to get rid of the weapon attack angle of the unmanned aerial vehicle of the third party, the unmanned aerial vehicle of the third party can not successfully lock the unmanned aerial vehicle of the third party all the time in the process that the unmanned aerial vehicle of the third party and the unmanned aerial vehicle of the third party approach to each other, and finally the unmanned aerial vehicle of the third party is reversed to occupy a more favorable position.
When my unmanned aerial vehicle and enemy unmanned aerial vehicle are in the equilibrium situation, the initial state information of my unmanned aerial vehicle and enemy unmanned aerial vehicle is as shown in the following table:
Figure BDA0003552139300000202
TABLE 6
Referring to fig. 9, when the unmanned aerial vehicle of our party and the unmanned aerial vehicle of enemy are in the situation of being in the same situation of flying in opposite directions, the unmanned aerial vehicle of our party can still plan and control speed in a more reasonable flight path, and the battlefield position advantage is obtained through autonomous maneuvering.
Referring to fig. 10, an embodiment of the present invention provides an unmanned aerial vehicle fighting autonomous decision system based on a deep reinforcement learning TD3 algorithm, including:
the unmanned aerial vehicle motion model establishing unit is used for establishing an unmanned aerial vehicle motion model;
the unmanned aerial combat model establishing unit is used for establishing an unmanned aerial combat model based on a Markov decision process according to an unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor;
and the unmanned aerial vehicle learning maneuvering strategy training unit is used for training the unmanned aerial vehicle learning maneuvering strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.
It should be noted that, since the unmanned aerial vehicle combat decision system based on the deep reinforcement learning TD3 algorithm in the present embodiment is based on the same inventive concept as the above-mentioned unmanned aerial vehicle combat autonomous decision method based on the deep reinforcement learning TD3 algorithm, the corresponding contents in the method embodiment are also applicable to the present system embodiment, and are not described in detail herein.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims (10)

1. An unmanned aerial vehicle fighting autonomous decision-making method based on a deep reinforcement learning TD3 algorithm is characterized by comprising the following steps:
establishing an unmanned aerial vehicle motion model;
establishing an unmanned aerial combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial combat model is represented by a quadruple comprising a state space, an action space, a reward function and a discount factor, and the unmanned aerial vehicle motion model represents a state transfer function in the unmanned aerial combat model;
and training an unmanned aerial vehicle learning maneuvering strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.
2. The unmanned aerial vehicle combat autonomous decision method based on the deep reinforcement learning TD3 algorithm of claim 1, wherein the unmanned aerial vehicle motion model comprises a dynamics model and a kinematics model, and the establishing the unmanned aerial vehicle motion model comprises:
establishing a dynamic model of the unmanned aerial vehicle in an inertial coordinate system:
Figure FDA0003552139290000011
wherein g represents a gravitational acceleration; the v represents a velocity of the drone and the v satisfies a constraint: v. ofmin≤v≤vmax(ii) a The track inclination angle gamma represents the included angle between v and the horizontal plane, gamma belongs to [ -pi/2, pi/2](ii) a The flight path offset angle phi represents the included angle between the projection of v on the horizontal plane and the X axis of the coordinate axis, and phi belongs to (-phi, phi)](ii) a N isτIndicating a tangential overload; n isfIndicating a normal overload; the μ represents a roll angle;
establishing a kinematic model of the unmanned aerial vehicle in the inertial coordinate system:
Figure FDA0003552139290000021
wherein the x, the y, and the z represent coordinates of the drone in the inertial coordinate system.
3. The unmanned aerial vehicle combat autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 2, wherein the state space comprises: the state of the enemy drone and my drone themselves and relative states.
4. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 3, wherein the state space is constructed by:
setting the states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle:
S=[xr,yr,zr,xb,yb,zb,vr,vbrbrbrb]
setting the relative states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle based on the states of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle:
Srb=[D,α,β,vr,vbrbrbrb]
wherein x isr,yr,zrRepresenting coordinate values, x, of the unmanned aerial vehicle of this party in three-dimensional spaceb,yb,zbA coordinate value, v, representing the enemy drone in the three dimensional spacerRepresenting the speed of the my drone, said vbRepresenting the speed of the enemy drone, said gammarRepresenting the track inclination of said my drone, said gammabIndicating the flight path inclination of said enemy drone, said psirIndicating the track drift angle of said my drone, said psibRepresents the flight path declination of the enemy drone, the murRepresents the roll angle of the my drone, the μbThe roll angle of the enemy unmanned aerial vehicle is represented, the D represents the relative distance between the enemy unmanned aerial vehicle and the my unmanned aerial vehicle, the horizontal sight declination angle alpha represents the included angle between the projection of the sight lines of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle on the horizontal plane and the X axis, and the longitudinal sight declination angle beta represents the included angle between the sight lines of the enemy unmanned aerial vehicle and the my unmanned aerial vehicle and the horizontal plane.
5. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 2, wherein the action space is constructed by the following formula:
A=[nτ,nf,ω]
wherein, said nτIndicating tangential overload, nfIndicating normal overload and omega the body roll rate.
6. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 4, wherein the reward functions include a lock reward function, an angle advantage function, a distance advantage function, a height advantage function and a speed advantage function, wherein the lock reward function is:
Figure FDA0003552139290000031
wherein, D is*Represents the minimum distance between two unmanned planes when the unmanned plane of my party successfully locks the unmanned plane of the enemy party, and p is*The maximum included angle indicating that the speed direction of the unmanned aerial vehicle at one party deviates from the direction of the mass center of the unmanned aerial vehicle at the other party when the unmanned aerial vehicle is locked is satisfied, and e*Representing a maximum included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the enemy unmanned aerial vehicle pointing to the direction of the center of mass of the enemy unmanned aerial vehicle when locking is met, D representing a relative distance between the enemy unmanned aerial vehicle and the enemy unmanned aerial vehicle, p representing an included angle of the speed direction of the enemy unmanned aerial vehicle deviating from the direction of the center of mass of the enemy unmanned aerial vehicle, e representing the direction of the enemy unmanned aerial vehicle pointing to the direction of the center of mass of the enemy unmanned aerial vehicle, and e representing the direction of the enemy unmanned aerial vehicleThe speed direction deviates from the included angle of the sight line vector of the unmanned aerial vehicle pointing to the centroid direction;
the angular merit function is:
Figure FDA0003552139290000032
the distance merit function is:
Figure FDA0003552139290000033
the height merit function is:
Figure FDA0003552139290000041
the speed merit function is:
Figure FDA0003552139290000042
wherein, v isrRepresenting the speed of the my drone, said vbRepresenting the speed of the enemy drone, DmaxRepresents a maximum detection distance of the drone, the Δ h represents a height difference between the enemy drone and the my drone, the vmaxRepresents a maximum value of the flight speed of the drone, said vminRepresents a minimum value of the drone flight speed;
the reward function of the single step of the unmanned aerial vehicle is as follows:
R=rlock+k1r1+k2r2+k3r3+k4r4
wherein, k is1To k is4Represents a weight value, and said k1To k is4The sum is 1.
7. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm of claim 1, wherein the training of the unmanned aerial vehicle learning maneuver strategy based on the TD3 algorithm comprises:
step S1, initializing two evaluator networks Q1,Q2Parameters of evaluator network
Figure FDA0003552139290000043
Parameter θ of actuator networkμTarget network parameters
Figure FDA0003552139290000044
θ'μAnd an experience pool;
step S2, presetting the number of rounds, and executing the following steps in each round:
step S2-1, presetting the maximum limit step number of the unmanned aerial vehicle in each round;
s2-2, the unmanned aerial vehicle selects action according to the current state and strategy, and adds random noise;
step S2-3, the unmanned aerial vehicle executes actions, and obtains the next state and rewards by the state transfer function;
a step S2-4 of transferring the current state, the policy selection action, the reward, and the next state obtained by a state transfer function obtained in the steps S2-2 and S2-3 to an experience pool;
step S2-5, randomly extracting N samples from the experience pool;
step S2-6, calculating expected return of action through the two evaluator target networks, selecting a smaller Q value, and updating parameters of the evaluator networks;
step S2-7, updating the parameters of the actuator network through a deterministic strategy gradient;
step S2-8, after updating the parameters of the evaluator network and the parameters of the actuator network, updating the parameters of the target network;
step S2-9, until the number of steps reaches the maximum limit number of steps, ending a round;
and step S3, finishing the training of the unmanned aerial vehicle learning maneuver strategy after all rounds are finished.
8. The unmanned aerial vehicle fighting autonomous decision method based on the deep reinforcement learning TD3 algorithm according to claim 7, wherein the calculating the expected return of action through the two evaluator target networks, selecting a smaller Q value, and updating the parameters of the evaluator networks comprises:
learning and updating the network parameters of the evaluator, wherein the formula of the loss function L is as follows:
Figure FDA0003552139290000051
wherein, said siRepresents the current state, said aiIndicating the current action, target desired value yiAccording to the current real reward value riThe next step output value is multiplied by a discount factor lambda to obtain a target expected value yiThe formula of (1) is:
Figure FDA0003552139290000052
wherein s isi+1Indicating the next step status.
9. The unmanned aerial vehicle combat autonomous decision making method based on the deep reinforcement learning TD3 algorithm according to claim 8, wherein the updating of the parameters of the actuator network through the deterministic policy gradient includes:
learning and updating actuator network parameters, wherein a deterministic strategy gradient formula of the actuator network is as follows:
Figure FDA0003552139290000053
wherein N represents the number of samples randomly drawn from the experience pool, and Q (s, a | θ)Q) Representing the evaluator network, the thetaQA parameter representing the evaluator network, the μ (s | θ)μ) Representing said actuator network, said θμA parameter representing the actuator network.
10. An unmanned aerial vehicle fighting autonomous decision making system based on a deep reinforcement learning TD3 algorithm is characterized by comprising:
the unmanned aerial vehicle motion model establishing unit is used for establishing an unmanned aerial vehicle motion model;
the unmanned aerial vehicle aerial combat model establishing unit is used for establishing an unmanned aerial vehicle aerial combat model based on a Markov decision process according to the unmanned aerial vehicle motion model, wherein the unmanned aerial vehicle aerial combat model is represented by a quadruple including a state space, an action space, a reward function and a discount factor;
and the unmanned aerial vehicle learning maneuvering strategy training unit is used for training the unmanned aerial vehicle learning maneuvering strategy based on a TD3 algorithm according to the unmanned aerial vehicle air combat model.
CN202210264539.2A 2022-03-17 2022-03-17 Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm Pending CN114706418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210264539.2A CN114706418A (en) 2022-03-17 2022-03-17 Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210264539.2A CN114706418A (en) 2022-03-17 2022-03-17 Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm

Publications (1)

Publication Number Publication Date
CN114706418A true CN114706418A (en) 2022-07-05

Family

ID=82167866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210264539.2A Pending CN114706418A (en) 2022-03-17 2022-03-17 Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm

Country Status (1)

Country Link
CN (1) CN114706418A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117420849A (en) * 2023-12-18 2024-01-19 山东科技大学 Marine unmanned aerial vehicle formation granularity-variable collaborative search and rescue method based on reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117420849A (en) * 2023-12-18 2024-01-19 山东科技大学 Marine unmanned aerial vehicle formation granularity-variable collaborative search and rescue method based on reinforcement learning
CN117420849B (en) * 2023-12-18 2024-03-08 山东科技大学 Marine unmanned aerial vehicle formation granularity-variable collaborative search and rescue method based on reinforcement learning

Similar Documents

Publication Publication Date Title
CN112947581B (en) Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN113093802B (en) Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning
CN113095481A (en) Air combat maneuver method based on parallel self-game
CN110531786B (en) Unmanned aerial vehicle maneuvering strategy autonomous generation method based on DQN
CN113050686B (en) Combat strategy optimization method and system based on deep reinforcement learning
CN114721424B (en) Multi-unmanned aerial vehicle cooperative countermeasure method, system and storage medium
CN112906233B (en) Distributed near-end strategy optimization method based on cognitive behavior knowledge and application thereof
Chai et al. A hierarchical deep reinforcement learning framework for 6-DOF UCAV air-to-air combat
CN113962012B (en) Unmanned aerial vehicle countermeasure strategy optimization method and device
CN113282061A (en) Unmanned aerial vehicle air game countermeasure solving method based on course learning
Ruan et al. Autonomous maneuver decisions via transfer learning pigeon-inspired optimization for UCAVs in dogfight engagements
CN115688268A (en) Aircraft near-distance air combat situation assessment adaptive weight design method
CN114756959A (en) Design method of aircraft short-distance air combat maneuver intelligent decision machine model
CN115903865A (en) Aircraft near-distance air combat maneuver decision implementation method
CN116432310A (en) Six-degree-of-freedom incompletely observable air combat maneuver intelligent decision model design method
CN116700079A (en) Unmanned aerial vehicle countermeasure occupation maneuver control method based on AC-NFSP
CN114063644A (en) Unmanned combat aircraft air combat autonomous decision method based on pigeon flock reverse confrontation learning
CN116820134A (en) Unmanned aerial vehicle formation maintaining control method based on deep reinforcement learning
CN114706418A (en) Unmanned aerial vehicle fighting autonomous decision-making method based on deep reinforcement learning TD3 algorithm
Luo et al. Multi-UAV cooperative maneuver decision-making for pursuit-evasion using improved MADRL
Guo et al. Maneuver decision of UAV in air combat based on deterministic policy gradient
CN114967732A (en) Method and device for formation and aggregation of unmanned aerial vehicles, computer equipment and storage medium
CN117313561B (en) Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method
Baykal et al. An Evolutionary Reinforcement Learning Approach for Autonomous Maneuver Decision in One-to-One Short-Range Air Combat
CN116432030A (en) Air combat multi-intention strategy autonomous generation method based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination