CN114519292A - Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning - Google Patents

Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning Download PDF

Info

Publication number
CN114519292A
CN114519292A CN202111550831.2A CN202111550831A CN114519292A CN 114519292 A CN114519292 A CN 114519292A CN 202111550831 A CN202111550831 A CN 202111550831A CN 114519292 A CN114519292 A CN 114519292A
Authority
CN
China
Prior art keywords
missile
air
network
shoulder
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111550831.2A
Other languages
Chinese (zh)
Inventor
陈万春
龚晓鹏
陈中原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202111550831.2A priority Critical patent/CN114519292A/en
Publication of CN114519292A publication Critical patent/CN114519292A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/02Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Aiming, Guidance, Guns With A Light Source, Armor, Camouflage, And Targets (AREA)

Abstract

The invention relates to a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning, which comprises the following steps of: step 1, carrying out normalized kinetic modeling on over-shoulder emission; the normalization of the model enables each state quantity to have similar magnitude, so that the weight updating of the neural network can be more stable; step 2, in order to adapt to the research paradigm of reinforcement learning, the research problem in the step 1 needs to be modeled as a Markov decision process; step 3, building an algorithm network and setting algorithm parameters; and 4, before training reaches a target reward value or a maximum step number, continuously collecting state transition data and rewards by the agent according to the PPO algorithm, and continuously and iteratively updating parameters of an Actor network and a Critic network. By applying the technical scheme of the invention, the missile can obtain an attack angle guidance law with suboptimum and robustness in a complex pneumatic environment, the limitation of different maneuvering capacities of the missile is considered, and the missile has practical value in the future air combat.

Description

Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning
Technical Field
The invention relates to the field of aircraft control, in particular to a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning.
Background
In modern air combat, with the continuous enhancement of the maneuvering capability of the fighter plane, the scene of close combat is increasingly complicated. In order to improve the combat ability of a fighter in close combat, a shouldering launch mode capable of attacking a rear half-ball target is an important research. The air-to-air missile adopting the over-shoulder launching mode can change the flight direction rapidly after being launched, and enters a terminal guidance stage after the seeker locks a target, so that the missile has the omnidirectional attack capability. However, the missile faces complex aerodynamic phenomena such as asymmetric vortex shedding, induced moment and the like in a large attack angle turning stage, and belongs to a typical strong nonlinear high-uncertainty system, so that higher requirements are provided for a guidance and control system of the missile.
Current research on over-the-shoulder launch focuses primarily on robust pilot design, with relatively few approaches to guidance law design. At present, a commonly adopted mode is that a driver is used for tracking an offline optimized trajectory or a constant attack angle, but targets are easily lost after the targets are launched over shoulders under a complex aerodynamic environment and a transient and variable air combat situation. The proper guidance law can adapt to the dynamic change of a battlefield, the design burden of a pilot can be reduced, and the overall robustness of the missile guidance control system is improved.
Further considering the maneuvering capability and the future development potential of the current missile model, the designed guidance law based on the deep reinforcement learning can conveniently set the maximum available attack angle of the missile, thereby increasing the possible application range and the practical implementation possibility of the invention. Under the challenges of increasingly complex air battle environments and high maneuvering fighters, the intelligent guidance law provided by the invention has important application value.
Disclosure of Invention
The invention mainly aims to provide a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning, so as to at least solve the problems.
According to one aspect of the invention, a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning is provided, and comprises the following steps:
step 1, carrying out normalized kinetic modeling on the over-shoulder emission. Normalization of the model can result in similar magnitudes for each state quantity, thereby enabling weight update of the neural network to be more stable. Firstly, modeling is carried out on the scene of missile over-shoulder launching, and a kinetic equation under a pneumatic system, a kinematic equation under an inertial system and an equation considering mass change can be obtained.
Step 2, further, in order to adapt to the research paradigm of reinforcement learning, the research problem in step 1 needs to be modeled as a markov decision process. The specific process comprises step 201 to step 203.
Step 201, setting an action space. In order to ensure the dynamic stability of the system, the first derivative of the attack angle alpha is selected
Figure BDA0003417152030000021
As system input. In addition, will
Figure BDA0003417152030000022
The missile can conveniently meet the maneuvering capacity limit of the missile as the action. However, with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available attack angle can be cancelled.
Step 202, setting a state space and an observation space. On the basis of the set actions in step 201, the state space and observation space of the agent are set, but not all states in the system are meaningful for the decision of the control command. Redundant observations will lead to instability of the training, while insufficient observations tend to directly lead to non-convergence of the training.
Step 203, reward function setting. The setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed as
Figure BDA0003417152030000023
Wherein
Figure BDA0003417152030000024
To a desired turning angle, thetaMIs the angle of inclination of the trajectory of the missile, lambda1,λ2,λ3The hyper-parameters are needed to be set and are used for adjusting the proportion among the items. And to improve the final turn accuracy, an additional reward r is introducedbonusHaving a value of
Figure BDA0003417152030000031
Wherein r isbFor additional awards when the accuracy condition is satisfied, rbNeed to be coordinated with the previous items to ensure that the agent is at the desired accuracy thetathreTo obtain the appropriate reward.
And step 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a near-end Policy Optimization (PPO), the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters.
And 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network. Specifically, the method includes steps 401 to 404.
Step 401, at the current strategy
Figure BDA0003417152030000032
And collecting trace data and caching the trace data to an experience pool until the experience pool is full. In each simulation step, for the current observation otExecuting the current policy
Figure BDA0003417152030000033
Get the current action atAnd integrating according to a system kinetic equation to obtain a state s at the next momentt+1And observation ot+1While obtaining the prize rt
Step 402, estimating a dominance function by using a generalized dominant estimation (GAE) method
Figure BDA0003417152030000034
Final optimization objective
Figure BDA0003417152030000035
Wherein c isvfAnd csIs a hyper-parameter for adjusting the proportion of each item.
Figure BDA0003417152030000036
To increaseA more dominant probabilistic truncation target of actions,
Figure BDA0003417152030000037
in order to have a value function for the loss term,
Figure BDA0003417152030000038
to maximize the entropy terms that encourage exploration.
Step 403, extracting trajectory data from the experience pool according to the size of batch, and optimizing the target JPPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descent mode until the data in the experience pool completes the updating of K epochs.
In step 404, the expectation of the accumulated reward obtained by the new and old strategies is compared and the finally output network parameters are updated in consideration of the randomness of the initial turning command.
And 405, repeating the steps 401 to 404 until training obtains a target reward value or the maximum training step number is reached, and obtaining that an Actor network can be directly deployed on a missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.
The invention has the advantages and beneficial effects that: by applying the technical scheme of the invention, the missile can obtain an attack angle guidance law with suboptimal performance and robustness in a complex pneumatic environment, the limitation of different maneuvering capacities of the missile is considered, and the missile has practical value in future air combat.
Drawings
Fig. 1 is a geometric schematic diagram of an air-to-air missile over-shoulder launching plane engagement provided according to an embodiment of the invention.
Fig. 2 is a schematic diagram of interaction between an agent using a PPO algorithm and an environment according to an embodiment of the present invention.
FIG. 3 is a graph of a missile learning curve for limited and unrestricted maneuvering capabilities, respectively, according to an embodiment of the invention.
FIG. 4a is a curve of the convergence of the turning angle of a missile at the limited maneuvering capability.
FIG. 4b is the curve of the convergence of the turning angle of the missile when the maneuvering capacity is not limited.
Figure 5a is a time-dependent missile velocity profile for a mobility-limited agent and an optimal solution provided in accordance with an embodiment of the present invention.
Figure 5b is a time-varying missile angle of attack for an agent with limited maneuverability and an optimal solution provided in accordance with an embodiment of the present invention.
Figure 5c is a graph of missile trajectory inclination angle over time for a smart agent with limited maneuverability and an optimal solution provided in accordance with an embodiment of the present invention.
Figure 5d is a ballistic profile in the longitudinal plane of a missile with mobility-limited agents and optimal solutions provided in accordance with an embodiment of the present invention.
Figure 6a is a time-varying missile velocity curve for agents with unlimited maneuvering capability and optimal solutions provided in accordance with an embodiment of the present invention.
FIG. 6b is a time-varying missile angle of attack curve for an agent with unlimited maneuvering capability and an optimal solution provided in accordance with an embodiment of the invention.
Figure 6c is a time-varying missile trajectory inclination angle plot for agents with unlimited maneuvering capabilities and an optimal solution provided in accordance with an embodiment of the present invention.
Figure 6d is a ballistic curve in the longitudinal plane of a missile providing an agent with unlimited maneuvering capabilities and an optimal solution according to an embodiment of the invention.
Fig. 7a shows trajectory inclination deviation and attack angle distribution at the terminal moment under the limited maneuvering capability of the missile.
And fig. 7b shows trajectory inclination deviation and attack angle distribution at the terminal moment under the condition that the maneuvering capability of the missile is not limited.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Step 1, carrying out normalized kinetic modeling on the over-shoulder emission. Normalization of the model can result in similar magnitudes for each state quantity, thereby enabling weight update of the neural network to be more stable. Firstly, modeling is carried out on the scene of missile over-shoulder launching, and the following kinetic equation under a pneumatic system, kinematic equation under an inertial system and equation considering mass change can be obtained.
Figure BDA0003417152030000061
Figure BDA0003417152030000062
Figure BDA0003417152030000063
Figure BDA0003417152030000064
Wherein
Figure BDA0003417152030000065
Is the flight speed of the missile after normalization,
Figure BDA0003417152030000066
in order to normalize the angle of inclination of the back trajectory,
Figure BDA0003417152030000067
in order to normalize the horizontal coordinate after the operation,
Figure BDA0003417152030000068
is the ordinate after the normalization,
Figure BDA0003417152030000069
v is a rate of change of each of the foregoing quantities*、θ*、x*、y*The normalization factors corresponding to the above quantities. In addition, alpha is missile attack angle, P is main engine thrust, and T isrcsTo counteract jet engine thrust, upAnd urcsOn-off logic quantities, F, for main and reaction jet engines, respectivelyDAnd FLRespectively, the drag and lift with strong uncertainty, m is the missile mass, m iscG is the gravitational acceleration constant.
Step 2, further, in order to adapt to the research paradigm of reinforcement learning, the research problem in step 1 needs to be modeled as a markov decision process. The specific process comprises step 201 to step 203.
Step 201, setting an action space. In order to ensure the dynamic stability of the system, the first derivative of the attack angle alpha is selected
Figure BDA00034171520300000610
As system input. In addition, will
Figure BDA00034171520300000611
The missile can conveniently meet the maneuvering capability limit of the missile as the action. If the missile has available attack angle limitation, namely | alpha | < alpha |maxIn which α ismaxFor angle of attack limitation, then
Figure BDA00034171520300000612
However, with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available attack angle can be cancelled.
Step 202, setting a state space and an observation space. Upon setting the action in step 201, the state space of the system becomes
Figure BDA00034171520300000613
Not all states in the system are meaningful for control command decisions. Redundant observations will lead to instability in training, while insufficient observations tend to lead directly to non-convergence of training. In the present invention, the observation space is set to
Figure BDA00034171520300000614
Wherein
Figure BDA0003417152030000071
Figure BDA0003417152030000072
Is the desired angle of turning.
Step 203, reward function setting. The setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed as
Figure BDA0003417152030000073
Wherein λ1,λ2,λ3The hyper-parameters are needed to be set and are used for adjusting the proportion among the items. And to improve the final turn accuracy, an additional reward r is introducedbonusHaving a value of
Figure BDA0003417152030000074
Wherein r isbCoordinating with the previous items to ensure the intelligent agent to be at ideal precision thetathreTo obtain the appropriate reward.
And 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a near-end Policy Optimization (PPO), the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters.
And 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network. Specifically, the method includes steps 401 to 404.
Step 401, in each simulation step, based on the current observation value
Figure BDA0003417152030000075
Executing the current policy to obtain the current action
Figure BDA0003417152030000076
Probability mean of (i.e.
Figure BDA0003417152030000077
In a Gaussian distribution
Figure BDA0003417152030000078
Sampling to obtain current action
Figure BDA0003417152030000079
And according to the system dynamics equation f (x)t,atT) integration to get the state s at the next timet+1And observation ot+1While calculating the prize
Figure BDA00034171520300000710
Until the turn is over, a set of traces s is collected0,o0,a0,r1,s1,o1,a1,r2,s2… }. At the present strategy
Figure BDA00034171520300000711
And next, collecting trace data and caching the trace data into an experience pool, wherein traces of multiple rounds can be cached in the experience pool until the experience pool is full.
Step 402, estimating a dominance function by using a generalized dominant estimation (GAE) method
Figure BDA00034171520300000712
And calculating the objective function of the network update by adopting a truncation mode to increase the probability of more advantageous actions
Figure BDA00034171520300000713
Wherein
Figure BDA00034171520300000714
Meaning that the desired value, r, is foundt(θ) is the probability ratio characterizing the old and new policies to control the update step size within the trust domain, clip () is the truncation function, and e is the hyperparameter as the truncation factor. Further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimation
Figure BDA0003417152030000081
On the basis of the value function loss term
Figure BDA0003417152030000082
And maximum entropy terms encouraging exploration
Figure BDA0003417152030000083
Obtaining the final optimization target
Figure BDA0003417152030000084
Wherein c isvfAnd csIs a hyper-parameter for adjusting the ratio of each item.
Step 403, extracting trajectory data from the experience pool according to the size of batch, and setting the trajectory data as NB. And will optimize the objective JPPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descent mode until the data in the experience pool completes the updating of K epochs. The network parameters are updated according to the formula
Figure BDA0003417152030000085
Wherein alpha isLRIs the learning rate.
Step 404, considering the randomness of the initial turn command, compares the expectations of the cumulative rewards obtained from the old and new strategies, if
Figure BDA0003417152030000086
Wherein
Figure BDA0003417152030000087
Meaning that the desired value, R (θ)) Is the strategy pi to be output at the endθThe round accumulated reward obtained under the Twining, R (theta) is the strategy pi after the network is updated in the step 403θThe next round to win the jackpot will have the network parameter θ of the final strategy=θ。
And 405, repeating the steps 401 to 404 until training obtains a target reward value or the maximum training step number is reached, and obtaining that an Actor network can be directly deployed on a missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.
Because the reinforcement learning adopts the modes of off-line training and on-line deployment, the training part with high requirement on the calculation performance is finished at the ground workstation, and the finally obtained strategy network piθ*The method is essentially a series of matrix operations and activation function operations, occupies small memory and computing resources, and can meet the real-time requirement of online computing.
In addition, a near-end Policy Optimization (PPO) algorithm adopted by the invention is a Policy gradient-based reinforcement learning algorithm which is not based on a model, is commonly used for generating a continuous action space Policy, and shows excellent performance in a plurality of fields. Because the PPO algorithm does not need to model the system in the training process but directly makes a decision according to the strategy network, the method is very suitable for the scenes that the missile is unstable in pneumatic parameters and strong in interference in a large attack angle state, and can show the robustness far beyond the general guidance law. The PPO algorithm has strong stability, is insensitive to parameters, is simple to implement, and is suitable for solving the control problems of complex systems with high modeling difficulty, strong interference noise, uncertainty and nonlinearity. The PPO algorithm adopts an Actor-Critic framework, which comprises an Actor network and a Critic network, and optimizes an Actor strategy by maximizing cumulative rewards. The Actor network obtains the Gaussian distribution mean value of the action based on the current observation, exploration can be encouraged by properly setting noise in the training process, the situation that the action is in local optimum early is avoided, and the noise is not added in the deployment verification process so as to ensure the stability of strategy implementation. Critic network to Actor networkAnd evaluating the action under the current observation as a basis for the Actor network optimization. Therefore, in the iterative training process, not only the Actor network but also the Critic network need to be trained, but only the Actor network and not the Critic network are needed during deployment verification. The deep neural network is trained by taking out the trajectory data stored in the experience pool in batches and training K epochs according to the minipatch mode. It should be noted here that although the experience pool is provided in the PPO algorithm, the PPO algorithm is a same policy algorithm rather than a different policy algorithm, because the trajectory data stored in the experience pool are all pi according to the current policyθkObtained without other strategies.
In order to further understand the present invention, the following describes in detail a method for designing a guidance law of multi-missile cooperative attack based on reinforcement learning, with reference to the accompanying drawings.
Step 1, carrying out normalized kinetic modeling on the over-shoulder emission. According to the geometric scheme of engagement in FIG. 1, where RTMThe RCS is the reaction jet, existing as an ideal pilot in this example, as the distance between the missile and the aft hemisphere target. The aerodynamic parameters of the missile (mainly the lift coefficient and the drag coefficient in the example) at a large attack angle have large uncertainties and are difficult to obtain through empirical formulas, which are calculated through finite element software Fluent in the example. The initial conditions for training in each round are shown in table 1. In addition, the agent may be instructed to turn in one turn during the training process
Figure BDA0003417152030000101
Is constant but at each round initialization
Figure BDA0003417152030000102
Take [30 degrees, 180 degrees ]]And [ -180 °, -30 ° ]]Is the random value of (1).
TABLE 1 initial conditions for training scenarios
Figure BDA0003417152030000103
Step 2, in order to adapt to the research paradigm of reinforcement learning, the research problem in step 1 needs to be further modeled as a markov decision process. The specific process comprises step 201 to step 203.
Step 201, selecting the first derivative of the angle of attack alpha
Figure BDA0003417152030000104
As an action. If the missile has available attack angle limitation, namely | alpha | < alpha |maxIn which α ismaxIn this example, take α for the angle of attack limitationmaxThe angle of attack can be limited to 90 degrees, but with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available angle of attack can be eliminated. If there is an available angle of attack limit, then let
Figure BDA0003417152030000105
Step 202, setting a state space and an observation space of the dynamic system on the basis of step 201. The state space of the system becomes
Figure BDA0003417152030000106
The observation space is set up as
Figure BDA0003417152030000107
Wherein
Figure BDA0003417152030000108
Figure BDA0003417152030000109
Is the desired angle of turning.
Step 203, reward function setting. The setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed as
Figure BDA00034171520300001010
Wherein λ1,λ2,λ3In order to set the hyper-parameters to be set,in this example set to-0.1, 2, 1.5, respectively. Thus the instant prize is
Figure BDA0003417152030000111
And to improve the final turn accuracy, an additional reward r is introducedbonusTo maintain proper accuracy and value of the reward, the value is
Figure BDA0003417152030000112
And 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a PPO algorithm optimized by a near-end strategy, and the algorithm comprises an Actor network and a Critic network. Further, due to turning commands
Figure BDA0003417152030000113
At [30 °,180 ° ]]And [ -180 °, -30 ° ]]And random values are taken in the two intervals, so that two Actor networks are set for parallel training, and the network weight parameters adopt randomized parameters. The structure of the Actor network and Critic network is shown in table 2.
Table 2 network architecture parameters
Figure BDA0003417152030000114
And 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network. Specifically, the method includes steps 401 to 404. A schematic diagram of the agent interacting with the environment to collect data and update network parameters is shown in fig. 2. If the target initial position is at-30 deg., 30 deg. °]Within the range, the missile does not need to make large-motor turning, and directly gives an overload instruction n according to a classical guidance law such as proportional guidancecNamely, if the missile needs to turn at a large attack angle, the missile needs to be steered
Figure BDA0003417152030000115
Obtaining an attack angle command alpha after passing through an integratorcThen, againTracked by the autopilot of the missile. The pilot actuator takes Thrust Vector (TVC) or reaction jet (RCS) during large maneuvers, while pneumatic rudders may be added during final guided small angle of attack.
Step 401, at the current strategy
Figure BDA0003417152030000116
Collecting trace data and caching the trace data to an experience pool, and taking ND=212. In each simulation step, for the current observation otExecuting the current policy
Figure BDA0003417152030000121
Get the current action atI.e. first derivative of angle of attack
Figure BDA0003417152030000122
And integrating according to a system kinetic equation to obtain the state s of the next momentt+1And observation ot+1While obtaining a prize rt. Traces for multiple rounds may be cached in the experience pool until the experience pool is full.
Step 402, estimating a dominance function by using a generalized dominant estimation (GAE) method
Figure BDA0003417152030000123
And calculating the objective function of the network update by adopting a truncation mode to increase the probability of more advantageous actions
Figure BDA0003417152030000124
Wherein r istAnd (theta) taking a truncation factor epsilon of 0.3 as a probability ratio for characterizing the old strategy and the new strategy so as to control the updating step size in the trust domain. Further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimation
Figure BDA0003417152030000125
On the basis of the value function loss term
Figure BDA0003417152030000126
And maximum entropy terms encouraging exploration
Figure BDA0003417152030000127
Obtaining the final optimization target
Figure BDA0003417152030000128
Get
Figure BDA0003417152030000129
And cs0.01 is a super-parameter for adjusting the ratio of the terms, where σ (R) is the reward standard deviation, i.e., the coefficient is a dynamic adjustment coefficient.
Step 403, extracting trajectory data from the experience pool according to the size of batch, and extracting NB=210And will optimize the objective JPPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs, and taking K to be 4. The network parameters are updated according to the formula
Figure BDA00034171520300001210
Wherein the learning rate alphaLRGet 10-4
Step 404, considering the randomness of the initial turn command, compares the expectations of the cumulative rewards obtained from the old and new strategies, if
Figure BDA00034171520300001211
Wherein R (theta)) For the strategy to be output at the endθ*The next round accrued prize, R (θ) is the updated strategy π of the network of step 503θThe next round to win the jackpot will have the network parameter θ of the final strategy=θ。
And 405, repeating the steps 401 to 404 until training obtains a target reward value or the maximum training step number is reached, and obtaining that an Actor network can be directly deployed on a missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.
According to the specific example above, the agent is trained, and the settings are optimizedThe number of big training steps is 5 multiplied by 106The training curve is obtained as shown in fig. 3. In the figure, two training curves are respectively the case that the attack angle is limited and the case that the attack angle is not limited, the shaded part represents the reward distribution of different random initial turning instructions under the same network parameter, and the solid line is the average value of the reward distribution. It can be seen from the training curve that the agent has smoothly converged in this example.
Further, the trained agent is verified under the turning angles of [30 degrees, 180 degrees ] and [ -180 degrees, -30 degrees ], and the deviation value of the trajectory inclination angle of the missile and the command angle is shown in fig. 4a-4 b. Wherein, fig. 4a is a curve of the convergence of the turning angle of the missile when the maneuvering capability of the missile is limited, and fig. 4b is a curve of the convergence of the turning angle of the missile when the maneuvering capability of the missile is not limited. It can be seen that the missile can complete the turning task no matter whether the maneuvering capability is limited, namely whether the maximum attack angle is limited. But the limited allowable attack angle influences the maneuverability of the missile, so the turning time of the missile is longer.
Furthermore, through simulation experiments, the over-shoulder launching guidance law based on deep reinforcement learning is verified to meet the requirements of basic tasks and have suboptimal performance and robustness. Comparing the results obtained by the agent in this example with the general optimization software GPOPS, the case of limited angle of attack is shown in fig. 5a-5d, where fig. 5a is a time-varying curve of missile speed, fig. 5b is a time-varying curve of missile angle of attack, fig. 5c is a time-varying curve of missile trajectory inclination angle, and fig. 5d is a trajectory curve of the missile in the longitudinal plane. Fig. 6a-6d show the case of unlimited angle of attack, where fig. 6a is the time-varying curve of missile speed, fig. 6b is the time-varying curve of missile angle of attack, fig. 6c is the time-varying curve of missile trajectory inclination, and fig. 6d is the trajectory curve of the missile in the vertical plane. From the results of fig. 5a to 5d and fig. 6a to 6d, it can be seen that the guidance law based on reinforcement learning in fig. 5b and 6b gives attack angle commands very close to the optimal attack angle commands for GPOPS solution, and the velocity, trajectory inclination angle and trajectory curve can approach the optimal solution. It should be noted that the GPOPS is obtained in an open-loop and offline manner, and the reinforcement learning guidance law is obtained in a closed-loop and online manner, that is, the same Actor network can adapt to various turning angles. Furthermore, the optimization software is difficult to solve under the condition of excessively complex environment, for example, in the example, the complex pneumatics of the missile are required to be properly simplified to solve, but the solution is still meaningful for proving the suboptimal performance of the reinforcement learning guidance law. To further verify the robustness of the agent, a harsh initial condition different from the training environment was set in this example, as shown in table 3. The deviation condition not only considers the deviation of the aerodynamic coefficient, but also considers the generalization of the initial speed and the condition of coping with the instability of the thrust of the main engine, and the verification environment is not met by the intelligent agent in the training process. The final terminal state is shown in fig. 7a and 7b, where fig. 7a shows trajectory inclination deviation and attack angle distribution at the terminal time when the maneuvering capability of the missile is limited, and fig. 7b shows trajectory inclination deviation and attack angle distribution at the terminal time when the maneuvering capability of the missile is not limited. The deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.3 degrees, the deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.9 degrees, the deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.7 degrees, and the deviation of the inclination angle of the attack angle is not more than 0.9 degrees. It can be seen that the missile still completes the turn with high accuracy and shows the generalization ability of the agent in the scene outside the training. The method mainly benefits from the adoption of a depth reinforcement learning algorithm which is not urgent to the model, so that the high precision can be still maintained under the condition that the model has large deviation.
TABLE 3 bias conditions
Figure BDA0003417152030000141
In conclusion, the design method of the air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning can provide real-time attack angle instructions for large-maneuvering turning of the missile in a complex combat environment. Firstly, the normalized dynamics modeling is carried out on missile over-the-shoulder launching aiming at the operation scene that the missile is threatened from the rear hemisphere, wherein the strong pneumatic uncertainty is considered, and the strong pneumatic uncertainty is determined by the special pneumatic characteristic of the missile under the large attack angle. In order to deal with the pneumatic uncertainty, the method adopts a model-free deep reinforcement learning algorithm PPO, constructs a dynamic model into a Markov decision model, and finally obtains a trained strategy network. And the two conditions of limited and unlimited available attack angles are considered simultaneously in training, so that the practical capability and the future development potential of the current missile are considered. The training part of the intelligent agent has higher requirement on the computing performance, but the training part can be completed offline at a ground workstation, and the finally obtained strategy network can be directly deployed on the missile-borne computer, so that the occupied memory and the computing resource are small, and the real-time requirement of online computing can be met. The attack angle of the missile adopting the reinforcement learning guidance law is rapidly increased after launching, the speed is reduced, and the trajectory inclination angle is converted to an instruction value, so that the missile seeker can smoothly lock a target, and a favorable shift-changing condition is provided for final guidance. In addition, the present invention also verifies the suboptimal and robustness of the trained agent in the examples provided. The solution given by the intelligent agent is very close to the optimal solution given by the general optimization software, and meanwhile, the intelligent agent still can keep higher precision in a severe simulation environment, so that the generalization performance and the robustness of the intelligent agent are verified. Under the challenges of increasingly complex air battle environments and high maneuvering fighters, the intelligent guidance law provided by the invention has important application value.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning is characterized by comprising the following steps:
step 1, carrying out normalized kinetic modeling on over-shoulder emission; the model is normalized to ensure that each state quantity has similar magnitude, so that the weight updating of the neural network can be more stable; firstly, modeling a scene of missile launching over shoulders to obtain a kinetic equation under a pneumatic system, a kinematic equation under an inertial system and an equation considering mass change;
step 2, in order to adapt to the research paradigm of reinforcement learning, the research problem in the step 1 needs to be modeled as a Markov decision process;
step 3, building an algorithm network and setting algorithm parameters; the selected deep learning algorithm is a near-end strategy optimization algorithm PPO, the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters;
and 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network.
2. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 1, the equation is specifically:
Figure FDA0003417152020000011
Figure FDA0003417152020000012
Figure FDA0003417152020000013
Figure FDA0003417152020000014
wherein
Figure FDA0003417152020000015
Is a guided missileAfter the air speed is normalized, the air speed is changed,
Figure FDA0003417152020000016
in order to normalize the angle of inclination of the back trajectory,
Figure FDA0003417152020000017
in order to normalize the horizontal coordinate after the operation,
Figure FDA0003417152020000021
is the ordinate after the normalization,
Figure FDA0003417152020000022
for the respective rates of change of the aforementioned quantities, and V*、θ*、x*、y*A normalization factor corresponding to each quantity; in addition, alpha is missile attack angle, P is main engine thrust, and T isrcsTo counteract jet engine thrust, upAnd urcsOn-off logic quantities, F, for main and reaction jet engines, respectivelyDAnd FLRespectively, the drag and lift with strong uncertainty, m is the missile mass, m iscG is the gravitational acceleration constant.
3. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 2, the specific process includes steps 201 to 203;
step 201, setting an action space; in order to ensure the dynamic stability of the system, the first derivative of the attack angle alpha is selected
Figure FDA0003417152020000023
As a system input; in addition, will
Figure FDA0003417152020000024
The action can also meet the maneuvering capacity limit of the missile; but with the development of the maneuvering capability of the air-to-air missile in the future, particularly in the thrust vector or the reverse directionUnder the assistance of the action of jet, the limitation of using an attack angle is also cancelled;
step 202, setting a state space and an observation space; setting the state space and observation space of the agent on the basis of the setting action in step 201, but not all states in the system are meaningful for the decision of the control command; redundant observations will lead to instability in training, while insufficient observations tend to directly lead to non-convergence in training;
step 203, setting a reward function; the setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed as
Figure FDA0003417152020000025
Wherein
Figure FDA0003417152020000026
To a desired turning angle, thetaMIs the angle of inclination of the trajectory of the missile, lambda1,λ2,λ3The hyper-parameters are needed to be set and are used for adjusting the proportion among all the items; and to improve the final turn accuracy, an additional reward r is introducedbonusHaving a value of
Figure FDA0003417152020000027
Wherein r isbFor additional rewards when the accuracy condition is met, rbNeed to be coordinated with the previous items to ensure that the agent is at the desired accuracy thetathreTo obtain the appropriate reward.
4. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 4, specifically, steps 401 to 404 are included;
step 401, at the current strategy
Figure FDA0003417152020000031
Collecting trace data and caching the trace data in an experience pool until the experience pool is full; in each simulation stepFor the current observation otExecuting the current policy
Figure FDA0003417152020000032
Get the current action atAnd integrating according to a system kinetic equation to obtain a state s at the next momentt+1And observation ot+1While obtaining a prize rt
Step 402, estimating the dominance function by using the method of generalized dominance estimation GAE
Figure FDA0003417152020000033
Final optimization objective
Figure FDA0003417152020000034
Wherein c isvfAnd csIs a hyper-parameter for adjusting each proportion;
Figure FDA0003417152020000035
to increase the truncation target for the probability of a more advantageous action,
Figure FDA0003417152020000036
in order to have a value function for the loss term,
Figure FDA0003417152020000037
a maximum entropy term to encourage exploration;
step 403, extracting trajectory data from the experience pool according to the size of batch, and optimizing the target JPPO(theta) optimizing parameters of an Actor network and a Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs;
step 404, considering the randomness of the initial turning instruction, comparing the expectations of the accumulated rewards obtained by the new and old strategies, and updating the finally output network parameters;
and 405, repeating the steps 401 to 404 until training obtains a target reward value or reaches the maximum training step number, and obtaining that an Actor network is directly deployed on the missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.
5. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 201, | α | < α if the missile has an available angle of attack limitmaxIn which α ismaxFor angle of attack limitation, then
Figure FDA0003417152020000038
6. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 202, the state space of the system becomes
Figure FDA0003417152020000041
The observation space is set up as
Figure FDA0003417152020000042
Wherein
Figure FDA0003417152020000043
Figure FDA0003417152020000044
At a desired turning angle.
7. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 401, in each simulation step, based on the current observation value
Figure FDA0003417152020000045
Executing the current policy to obtain a current action
Figure FDA0003417152020000046
Probability mean of (i.e.
Figure FDA0003417152020000047
In a Gaussian distribution
Figure FDA0003417152020000048
Sampling to obtain current action
Figure FDA0003417152020000049
And according to the system dynamics equation f (x)t,atT) integration to get the state s at the next timet+1And observation ot+1While calculating the prize
Figure FDA00034171520200000410
Until the turn is over, a set of traces s is collected0,o0,a0,r1,s1,o1,a1,r2,s2… }; at the present strategy
Figure FDA00034171520200000411
And next, collecting trace data and caching the trace data into an experience pool, wherein traces of multiple rounds can be cached in the experience pool until the experience pool is full.
8. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 402, an objective function of the network update is further computed in a truncated manner to increase the probability of more advantageous actions
Figure FDA00034171520200000412
Wherein
Figure FDA00034171520200000413
Meaning that the desired value, r, is foundt(θ) is the probability ratio characterizing the old and new policies to control the update step size in the trust domain, clip () is a truncation function, eIs a hyperparameter as a truncation factor; further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimation
Figure FDA00034171520200000414
On the basis of the value function loss term
Figure FDA00034171520200000415
And maximum entropy terms encouraging exploration
Figure FDA00034171520200000416
Obtaining the final optimization target
Figure FDA00034171520200000417
9. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 403, trace data is fetched from the experience pool according to the size of batch, and is set to NB(ii) a And will optimize the objective JPPO(theta) optimizing parameters of an Actor network and a Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs; the updated formula of the network parameters is
Figure FDA0003417152020000051
Wherein alpha isLRIs the learning rate.
10. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 404, if
Figure FDA0003417152020000052
Wherein
Figure FDA0003417152020000053
Meaning that the desired value, R (θ)) For the policy to be output at the end
Figure FDA0003417152020000054
The next round accrued prize, R (θ) is the updated strategy π of the network of step 503θThe next round to win the jackpot will have the network parameter θ of the final strategy=θ。
CN202111550831.2A 2021-12-17 2021-12-17 Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning Pending CN114519292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111550831.2A CN114519292A (en) 2021-12-17 2021-12-17 Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111550831.2A CN114519292A (en) 2021-12-17 2021-12-17 Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN114519292A true CN114519292A (en) 2022-05-20

Family

ID=81596073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111550831.2A Pending CN114519292A (en) 2021-12-17 2021-12-17 Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114519292A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861368A (en) * 2022-06-13 2022-08-05 中南大学 Method for constructing railway longitudinal section design learning model based on near-end strategy
CN115328638A (en) * 2022-10-13 2022-11-11 北京航空航天大学 Multi-aircraft task scheduling method based on mixed integer programming

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930141A (en) * 2020-07-21 2020-11-13 哈尔滨工程大学 Three-dimensional path visual tracking method for underwater robot
CN112799429A (en) * 2021-01-05 2021-05-14 北京航空航天大学 Multi-missile cooperative attack guidance law design method based on reinforcement learning
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930141A (en) * 2020-07-21 2020-11-13 哈尔滨工程大学 Three-dimensional path visual tracking method for underwater robot
CN112799429A (en) * 2021-01-05 2021-05-14 北京航空航天大学 Multi-missile cooperative attack guidance law design method based on reinforcement learning
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FEI LIU 等: "Cooperative differential games guidance laws for multiple attackers against an active defense target", 《RESEARCHGATE》, 31 October 2021 (2021-10-31) *
陈中原;韦文书;陈万春: "基于强化学习的多发导弹协同攻击智能制导律", 兵工学报, no. 008, 31 August 2021 (2021-08-31) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861368A (en) * 2022-06-13 2022-08-05 中南大学 Method for constructing railway longitudinal section design learning model based on near-end strategy
CN114861368B (en) * 2022-06-13 2023-09-12 中南大学 Construction method of railway longitudinal section design learning model based on near-end strategy
CN115328638A (en) * 2022-10-13 2022-11-11 北京航空航天大学 Multi-aircraft task scheduling method based on mixed integer programming
CN115328638B (en) * 2022-10-13 2023-01-10 北京航空航天大学 Multi-aircraft task scheduling method based on mixed integer programming

Similar Documents

Publication Publication Date Title
CN108445766A (en) Model-free quadrotor drone contrail tracker and method based on RPD-SMC and RISE
CN114519292A (en) Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning
Hong et al. Model predictive convex programming for constrained vehicle guidance
Wang et al. Improving maneuver strategy in air combat by alternate freeze games with a deep reinforcement learning algorithm
Li et al. Autonomous maneuver decision-making for a UCAV in short-range aerial combat based on an MS-DDQN algorithm
CN114253296B (en) Hypersonic aircraft airborne track planning method and device, aircraft and medium
Lin et al. Development of an integrated fuzzy-logic-based missile guidance law against high speed target
CN111898201B (en) High-precision autonomous attack guiding method for fighter in air combat simulation environment
CN113377121B (en) Aircraft intelligent disturbance rejection control method based on deep reinforcement learning
Lee et al. Autonomous control of combat unmanned aerial vehicles to evade surface-to-air missiles using deep reinforcement learning
Chai et al. A hierarchical deep reinforcement learning framework for 6-DOF UCAV air-to-air combat
Gong et al. All-aspect attack guidance law for agile missiles based on deep reinforcement learning
CN115576353A (en) Aircraft formation control method based on deep reinforcement learning
CN115079565A (en) Variable-coefficient constraint guidance method and device with falling angle and aircraft
Duan et al. Autonomous maneuver decision for unmanned aerial vehicle via improved pigeon-inspired optimization
Qazi et al. Rapid trajectory optimization using computational intelligence for guidance and conceptual design of multistage space launch vehicles
CN115357051B (en) Deformation and maneuvering integrated avoidance and defense method
Du et al. Deep reinforcement learning based missile guidance law design for maneuvering target interception
CN116432539A (en) Time consistency collaborative guidance method, system, equipment and medium
Chen et al. The design of particle swarm optimization guidance using a line-of-sight evaluation method
Minglang et al. Maneuvering decision in short range air combat for unmanned combat aerial vehicles
Gui et al. Reaction control system optimization for maneuverable reentry vehicles based on particle swarm optimization
Bin et al. Cooperative guidance for maneuvering penetration with attack time consensus and bounded input
CN117192982B (en) Control parameterization-based short-distance air combat maneuver decision optimization method
Wu et al. Decision-Making Method of UAV Maneuvering in Close-Range Confrontation based on Deep Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination