CN114519292A - Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning - Google Patents
Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN114519292A CN114519292A CN202111550831.2A CN202111550831A CN114519292A CN 114519292 A CN114519292 A CN 114519292A CN 202111550831 A CN202111550831 A CN 202111550831A CN 114519292 A CN114519292 A CN 114519292A
- Authority
- CN
- China
- Prior art keywords
- missile
- air
- network
- shoulder
- reinforcement learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000002787 reinforcement Effects 0.000 title claims abstract description 33
- 238000013461 design Methods 0.000 title claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 43
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 37
- 230000008569 process Effects 0.000 claims abstract description 14
- 238000011160 research Methods 0.000 claims abstract description 12
- 238000010606 normalization Methods 0.000 claims abstract description 8
- 238000013528 artificial neural network Methods 0.000 claims abstract description 5
- 230000007704 transition Effects 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 33
- 239000003795 chemical substances by application Substances 0.000 claims description 32
- 230000009471 action Effects 0.000 claims description 21
- 238000005457 optimization Methods 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 7
- 238000004088 simulation Methods 0.000 claims description 7
- 238000011161 development Methods 0.000 claims description 6
- 238000013135 deep learning Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 4
- 230000001133 acceleration Effects 0.000 claims description 2
- 230000010354 integration Effects 0.000 claims description 2
- BULVZWIRKLYCBC-UHFFFAOYSA-N phorate Chemical compound CCOP(=S)(OCC)SCSCC BULVZWIRKLYCBC-UHFFFAOYSA-N 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 230000001186 cumulative effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 206010048669 Terminal state Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/02—Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T90/00—Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Aiming, Guidance, Guns With A Light Source, Armor, Camouflage, And Targets (AREA)
Abstract
The invention relates to a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning, which comprises the following steps of: step 1, carrying out normalized kinetic modeling on over-shoulder emission; the normalization of the model enables each state quantity to have similar magnitude, so that the weight updating of the neural network can be more stable; step 2, in order to adapt to the research paradigm of reinforcement learning, the research problem in the step 1 needs to be modeled as a Markov decision process; step 3, building an algorithm network and setting algorithm parameters; and 4, before training reaches a target reward value or a maximum step number, continuously collecting state transition data and rewards by the agent according to the PPO algorithm, and continuously and iteratively updating parameters of an Actor network and a Critic network. By applying the technical scheme of the invention, the missile can obtain an attack angle guidance law with suboptimum and robustness in a complex pneumatic environment, the limitation of different maneuvering capacities of the missile is considered, and the missile has practical value in the future air combat.
Description
Technical Field
The invention relates to the field of aircraft control, in particular to a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning.
Background
In modern air combat, with the continuous enhancement of the maneuvering capability of the fighter plane, the scene of close combat is increasingly complicated. In order to improve the combat ability of a fighter in close combat, a shouldering launch mode capable of attacking a rear half-ball target is an important research. The air-to-air missile adopting the over-shoulder launching mode can change the flight direction rapidly after being launched, and enters a terminal guidance stage after the seeker locks a target, so that the missile has the omnidirectional attack capability. However, the missile faces complex aerodynamic phenomena such as asymmetric vortex shedding, induced moment and the like in a large attack angle turning stage, and belongs to a typical strong nonlinear high-uncertainty system, so that higher requirements are provided for a guidance and control system of the missile.
Current research on over-the-shoulder launch focuses primarily on robust pilot design, with relatively few approaches to guidance law design. At present, a commonly adopted mode is that a driver is used for tracking an offline optimized trajectory or a constant attack angle, but targets are easily lost after the targets are launched over shoulders under a complex aerodynamic environment and a transient and variable air combat situation. The proper guidance law can adapt to the dynamic change of a battlefield, the design burden of a pilot can be reduced, and the overall robustness of the missile guidance control system is improved.
Further considering the maneuvering capability and the future development potential of the current missile model, the designed guidance law based on the deep reinforcement learning can conveniently set the maximum available attack angle of the missile, thereby increasing the possible application range and the practical implementation possibility of the invention. Under the challenges of increasingly complex air battle environments and high maneuvering fighters, the intelligent guidance law provided by the invention has important application value.
Disclosure of Invention
The invention mainly aims to provide a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning, so as to at least solve the problems.
According to one aspect of the invention, a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning is provided, and comprises the following steps:
Step 201, setting an action space. In order to ensure the dynamic stability of the system, the first derivative of the attack angle alpha is selectedAs system input. In addition, willThe missile can conveniently meet the maneuvering capacity limit of the missile as the action. However, with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available attack angle can be cancelled.
Step 202, setting a state space and an observation space. On the basis of the set actions in step 201, the state space and observation space of the agent are set, but not all states in the system are meaningful for the decision of the control command. Redundant observations will lead to instability of the training, while insufficient observations tend to directly lead to non-convergence of the training.
Step 203, reward function setting. The setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed asWhereinTo a desired turning angle, thetaMIs the angle of inclination of the trajectory of the missile, lambda1,λ2,λ3The hyper-parameters are needed to be set and are used for adjusting the proportion among the items. And to improve the final turn accuracy, an additional reward r is introducedbonusHaving a value ofWherein r isbFor additional awards when the accuracy condition is satisfied, rbNeed to be coordinated with the previous items to ensure that the agent is at the desired accuracy thetathreTo obtain the appropriate reward.
And step 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a near-end Policy Optimization (PPO), the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters.
And 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network. Specifically, the method includes steps 401 to 404.
Step 401, at the current strategyAnd collecting trace data and caching the trace data to an experience pool until the experience pool is full. In each simulation step, for the current observation otExecuting the current policyGet the current action atAnd integrating according to a system kinetic equation to obtain a state s at the next momentt+1And observation ot+1While obtaining the prize rt。
Step 402, estimating a dominance function by using a generalized dominant estimation (GAE) methodFinal optimization objectiveWherein c isvfAnd csIs a hyper-parameter for adjusting the proportion of each item.To increaseA more dominant probabilistic truncation target of actions,in order to have a value function for the loss term,to maximize the entropy terms that encourage exploration.
Step 403, extracting trajectory data from the experience pool according to the size of batch, and optimizing the target JPPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descent mode until the data in the experience pool completes the updating of K epochs.
In step 404, the expectation of the accumulated reward obtained by the new and old strategies is compared and the finally output network parameters are updated in consideration of the randomness of the initial turning command.
And 405, repeating the steps 401 to 404 until training obtains a target reward value or the maximum training step number is reached, and obtaining that an Actor network can be directly deployed on a missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.
The invention has the advantages and beneficial effects that: by applying the technical scheme of the invention, the missile can obtain an attack angle guidance law with suboptimal performance and robustness in a complex pneumatic environment, the limitation of different maneuvering capacities of the missile is considered, and the missile has practical value in future air combat.
Drawings
Fig. 1 is a geometric schematic diagram of an air-to-air missile over-shoulder launching plane engagement provided according to an embodiment of the invention.
Fig. 2 is a schematic diagram of interaction between an agent using a PPO algorithm and an environment according to an embodiment of the present invention.
FIG. 3 is a graph of a missile learning curve for limited and unrestricted maneuvering capabilities, respectively, according to an embodiment of the invention.
FIG. 4a is a curve of the convergence of the turning angle of a missile at the limited maneuvering capability.
FIG. 4b is the curve of the convergence of the turning angle of the missile when the maneuvering capacity is not limited.
Figure 5a is a time-dependent missile velocity profile for a mobility-limited agent and an optimal solution provided in accordance with an embodiment of the present invention.
Figure 5b is a time-varying missile angle of attack for an agent with limited maneuverability and an optimal solution provided in accordance with an embodiment of the present invention.
Figure 5c is a graph of missile trajectory inclination angle over time for a smart agent with limited maneuverability and an optimal solution provided in accordance with an embodiment of the present invention.
Figure 5d is a ballistic profile in the longitudinal plane of a missile with mobility-limited agents and optimal solutions provided in accordance with an embodiment of the present invention.
Figure 6a is a time-varying missile velocity curve for agents with unlimited maneuvering capability and optimal solutions provided in accordance with an embodiment of the present invention.
FIG. 6b is a time-varying missile angle of attack curve for an agent with unlimited maneuvering capability and an optimal solution provided in accordance with an embodiment of the invention.
Figure 6c is a time-varying missile trajectory inclination angle plot for agents with unlimited maneuvering capabilities and an optimal solution provided in accordance with an embodiment of the present invention.
Figure 6d is a ballistic curve in the longitudinal plane of a missile providing an agent with unlimited maneuvering capabilities and an optimal solution according to an embodiment of the invention.
Fig. 7a shows trajectory inclination deviation and attack angle distribution at the terminal moment under the limited maneuvering capability of the missile.
And fig. 7b shows trajectory inclination deviation and attack angle distribution at the terminal moment under the condition that the maneuvering capability of the missile is not limited.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
WhereinIs the flight speed of the missile after normalization,in order to normalize the angle of inclination of the back trajectory,in order to normalize the horizontal coordinate after the operation,is the ordinate after the normalization,v is a rate of change of each of the foregoing quantities*、θ*、x*、y*The normalization factors corresponding to the above quantities. In addition, alpha is missile attack angle, P is main engine thrust, and T isrcsTo counteract jet engine thrust, upAnd urcsOn-off logic quantities, F, for main and reaction jet engines, respectivelyDAnd FLRespectively, the drag and lift with strong uncertainty, m is the missile mass, m iscG is the gravitational acceleration constant.
Step 201, setting an action space. In order to ensure the dynamic stability of the system, the first derivative of the attack angle alpha is selectedAs system input. In addition, willThe missile can conveniently meet the maneuvering capability limit of the missile as the action. If the missile has available attack angle limitation, namely | alpha | < alpha |maxIn which α ismaxFor angle of attack limitation, thenHowever, with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available attack angle can be cancelled.
Step 202, setting a state space and an observation space. Upon setting the action in step 201, the state space of the system becomesNot all states in the system are meaningful for control command decisions. Redundant observations will lead to instability in training, while insufficient observations tend to lead directly to non-convergence of training. In the present invention, the observation space is set toWherein Is the desired angle of turning.
Step 203, reward function setting. The setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed asWherein λ1,λ2,λ3The hyper-parameters are needed to be set and are used for adjusting the proportion among the items. And to improve the final turn accuracy, an additional reward r is introducedbonusHaving a value ofWherein r isbCoordinating with the previous items to ensure the intelligent agent to be at ideal precision thetathreTo obtain the appropriate reward.
And 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a near-end Policy Optimization (PPO), the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters.
And 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network. Specifically, the method includes steps 401 to 404.
Step 401, in each simulation step, based on the current observation valueExecuting the current policy to obtain the current actionProbability mean of (i.e.In a Gaussian distributionSampling to obtain current actionAnd according to the system dynamics equation f (x)t,atT) integration to get the state s at the next timet+1And observation ot+1While calculating the prizeUntil the turn is over, a set of traces s is collected0,o0,a0,r1,s1,o1,a1,r2,s2… }. At the present strategyAnd next, collecting trace data and caching the trace data into an experience pool, wherein traces of multiple rounds can be cached in the experience pool until the experience pool is full.
Step 402, estimating a dominance function by using a generalized dominant estimation (GAE) methodAnd calculating the objective function of the network update by adopting a truncation mode to increase the probability of more advantageous actionsWhereinMeaning that the desired value, r, is foundt(θ) is the probability ratio characterizing the old and new policies to control the update step size within the trust domain, clip () is the truncation function, and e is the hyperparameter as the truncation factor. Further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimationOn the basis of the value function loss termAnd maximum entropy terms encouraging explorationObtaining the final optimization targetWherein c isvfAnd csIs a hyper-parameter for adjusting the ratio of each item.
Step 403, extracting trajectory data from the experience pool according to the size of batch, and setting the trajectory data as NB. And will optimize the objective JPPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descent mode until the data in the experience pool completes the updating of K epochs. The network parameters are updated according to the formulaWherein alpha isLRIs the learning rate.
Step 404, considering the randomness of the initial turn command, compares the expectations of the cumulative rewards obtained from the old and new strategies, ifWhereinMeaning that the desired value, R (θ)★) Is the strategy pi to be output at the endθThe round accumulated reward obtained under the Twining, R (theta) is the strategy pi after the network is updated in the step 403θThe next round to win the jackpot will have the network parameter θ of the final strategy*=θ。
And 405, repeating the steps 401 to 404 until training obtains a target reward value or the maximum training step number is reached, and obtaining that an Actor network can be directly deployed on a missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.
Because the reinforcement learning adopts the modes of off-line training and on-line deployment, the training part with high requirement on the calculation performance is finished at the ground workstation, and the finally obtained strategy network piθ*The method is essentially a series of matrix operations and activation function operations, occupies small memory and computing resources, and can meet the real-time requirement of online computing.
In addition, a near-end Policy Optimization (PPO) algorithm adopted by the invention is a Policy gradient-based reinforcement learning algorithm which is not based on a model, is commonly used for generating a continuous action space Policy, and shows excellent performance in a plurality of fields. Because the PPO algorithm does not need to model the system in the training process but directly makes a decision according to the strategy network, the method is very suitable for the scenes that the missile is unstable in pneumatic parameters and strong in interference in a large attack angle state, and can show the robustness far beyond the general guidance law. The PPO algorithm has strong stability, is insensitive to parameters, is simple to implement, and is suitable for solving the control problems of complex systems with high modeling difficulty, strong interference noise, uncertainty and nonlinearity. The PPO algorithm adopts an Actor-Critic framework, which comprises an Actor network and a Critic network, and optimizes an Actor strategy by maximizing cumulative rewards. The Actor network obtains the Gaussian distribution mean value of the action based on the current observation, exploration can be encouraged by properly setting noise in the training process, the situation that the action is in local optimum early is avoided, and the noise is not added in the deployment verification process so as to ensure the stability of strategy implementation. Critic network to Actor networkAnd evaluating the action under the current observation as a basis for the Actor network optimization. Therefore, in the iterative training process, not only the Actor network but also the Critic network need to be trained, but only the Actor network and not the Critic network are needed during deployment verification. The deep neural network is trained by taking out the trajectory data stored in the experience pool in batches and training K epochs according to the minipatch mode. It should be noted here that although the experience pool is provided in the PPO algorithm, the PPO algorithm is a same policy algorithm rather than a different policy algorithm, because the trajectory data stored in the experience pool are all pi according to the current policyθkObtained without other strategies.
In order to further understand the present invention, the following describes in detail a method for designing a guidance law of multi-missile cooperative attack based on reinforcement learning, with reference to the accompanying drawings.
TABLE 1 initial conditions for training scenarios
Step 201, selecting the first derivative of the angle of attack alphaAs an action. If the missile has available attack angle limitation, namely | alpha | < alpha |maxIn which α ismaxIn this example, take α for the angle of attack limitationmaxThe angle of attack can be limited to 90 degrees, but with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available angle of attack can be eliminated. If there is an available angle of attack limit, then let
Step 202, setting a state space and an observation space of the dynamic system on the basis of step 201. The state space of the system becomesThe observation space is set up asWherein Is the desired angle of turning.
Step 203, reward function setting. The setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed asWherein λ1,λ2,λ3In order to set the hyper-parameters to be set,in this example set to-0.1, 2, 1.5, respectively. Thus the instant prize isAnd to improve the final turn accuracy, an additional reward r is introducedbonusTo maintain proper accuracy and value of the reward, the value is
And 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a PPO algorithm optimized by a near-end strategy, and the algorithm comprises an Actor network and a Critic network. Further, due to turning commandsAt [30 °,180 ° ]]And [ -180 °, -30 ° ]]And random values are taken in the two intervals, so that two Actor networks are set for parallel training, and the network weight parameters adopt randomized parameters. The structure of the Actor network and Critic network is shown in table 2.
Table 2 network architecture parameters
And 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network. Specifically, the method includes steps 401 to 404. A schematic diagram of the agent interacting with the environment to collect data and update network parameters is shown in fig. 2. If the target initial position is at-30 deg., 30 deg. °]Within the range, the missile does not need to make large-motor turning, and directly gives an overload instruction n according to a classical guidance law such as proportional guidancecNamely, if the missile needs to turn at a large attack angle, the missile needs to be steeredObtaining an attack angle command alpha after passing through an integratorcThen, againTracked by the autopilot of the missile. The pilot actuator takes Thrust Vector (TVC) or reaction jet (RCS) during large maneuvers, while pneumatic rudders may be added during final guided small angle of attack.
Step 401, at the current strategyCollecting trace data and caching the trace data to an experience pool, and taking ND=212. In each simulation step, for the current observation otExecuting the current policyGet the current action atI.e. first derivative of angle of attackAnd integrating according to a system kinetic equation to obtain the state s of the next momentt+1And observation ot+1While obtaining a prize rt. Traces for multiple rounds may be cached in the experience pool until the experience pool is full.
Step 402, estimating a dominance function by using a generalized dominant estimation (GAE) methodAnd calculating the objective function of the network update by adopting a truncation mode to increase the probability of more advantageous actionsWherein r istAnd (theta) taking a truncation factor epsilon of 0.3 as a probability ratio for characterizing the old strategy and the new strategy so as to control the updating step size in the trust domain. Further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimationOn the basis of the value function loss termAnd maximum entropy terms encouraging explorationObtaining the final optimization targetGetAnd cs0.01 is a super-parameter for adjusting the ratio of the terms, where σ (R) is the reward standard deviation, i.e., the coefficient is a dynamic adjustment coefficient.
Step 403, extracting trajectory data from the experience pool according to the size of batch, and extracting NB=210And will optimize the objective JPPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs, and taking K to be 4. The network parameters are updated according to the formulaWherein the learning rate alphaLRGet 10-4。
Step 404, considering the randomness of the initial turn command, compares the expectations of the cumulative rewards obtained from the old and new strategies, ifWherein R (theta)*) For the strategy to be output at the endθ*The next round accrued prize, R (θ) is the updated strategy π of the network of step 503θThe next round to win the jackpot will have the network parameter θ of the final strategy*=θ。
And 405, repeating the steps 401 to 404 until training obtains a target reward value or the maximum training step number is reached, and obtaining that an Actor network can be directly deployed on a missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.
According to the specific example above, the agent is trained, and the settings are optimizedThe number of big training steps is 5 multiplied by 106The training curve is obtained as shown in fig. 3. In the figure, two training curves are respectively the case that the attack angle is limited and the case that the attack angle is not limited, the shaded part represents the reward distribution of different random initial turning instructions under the same network parameter, and the solid line is the average value of the reward distribution. It can be seen from the training curve that the agent has smoothly converged in this example.
Further, the trained agent is verified under the turning angles of [30 degrees, 180 degrees ] and [ -180 degrees, -30 degrees ], and the deviation value of the trajectory inclination angle of the missile and the command angle is shown in fig. 4a-4 b. Wherein, fig. 4a is a curve of the convergence of the turning angle of the missile when the maneuvering capability of the missile is limited, and fig. 4b is a curve of the convergence of the turning angle of the missile when the maneuvering capability of the missile is not limited. It can be seen that the missile can complete the turning task no matter whether the maneuvering capability is limited, namely whether the maximum attack angle is limited. But the limited allowable attack angle influences the maneuverability of the missile, so the turning time of the missile is longer.
Furthermore, through simulation experiments, the over-shoulder launching guidance law based on deep reinforcement learning is verified to meet the requirements of basic tasks and have suboptimal performance and robustness. Comparing the results obtained by the agent in this example with the general optimization software GPOPS, the case of limited angle of attack is shown in fig. 5a-5d, where fig. 5a is a time-varying curve of missile speed, fig. 5b is a time-varying curve of missile angle of attack, fig. 5c is a time-varying curve of missile trajectory inclination angle, and fig. 5d is a trajectory curve of the missile in the longitudinal plane. Fig. 6a-6d show the case of unlimited angle of attack, where fig. 6a is the time-varying curve of missile speed, fig. 6b is the time-varying curve of missile angle of attack, fig. 6c is the time-varying curve of missile trajectory inclination, and fig. 6d is the trajectory curve of the missile in the vertical plane. From the results of fig. 5a to 5d and fig. 6a to 6d, it can be seen that the guidance law based on reinforcement learning in fig. 5b and 6b gives attack angle commands very close to the optimal attack angle commands for GPOPS solution, and the velocity, trajectory inclination angle and trajectory curve can approach the optimal solution. It should be noted that the GPOPS is obtained in an open-loop and offline manner, and the reinforcement learning guidance law is obtained in a closed-loop and online manner, that is, the same Actor network can adapt to various turning angles. Furthermore, the optimization software is difficult to solve under the condition of excessively complex environment, for example, in the example, the complex pneumatics of the missile are required to be properly simplified to solve, but the solution is still meaningful for proving the suboptimal performance of the reinforcement learning guidance law. To further verify the robustness of the agent, a harsh initial condition different from the training environment was set in this example, as shown in table 3. The deviation condition not only considers the deviation of the aerodynamic coefficient, but also considers the generalization of the initial speed and the condition of coping with the instability of the thrust of the main engine, and the verification environment is not met by the intelligent agent in the training process. The final terminal state is shown in fig. 7a and 7b, where fig. 7a shows trajectory inclination deviation and attack angle distribution at the terminal time when the maneuvering capability of the missile is limited, and fig. 7b shows trajectory inclination deviation and attack angle distribution at the terminal time when the maneuvering capability of the missile is not limited. The deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.3 degrees, the deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.9 degrees, the deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.7 degrees, and the deviation of the inclination angle of the attack angle is not more than 0.9 degrees. It can be seen that the missile still completes the turn with high accuracy and shows the generalization ability of the agent in the scene outside the training. The method mainly benefits from the adoption of a depth reinforcement learning algorithm which is not urgent to the model, so that the high precision can be still maintained under the condition that the model has large deviation.
TABLE 3 bias conditions
In conclusion, the design method of the air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning can provide real-time attack angle instructions for large-maneuvering turning of the missile in a complex combat environment. Firstly, the normalized dynamics modeling is carried out on missile over-the-shoulder launching aiming at the operation scene that the missile is threatened from the rear hemisphere, wherein the strong pneumatic uncertainty is considered, and the strong pneumatic uncertainty is determined by the special pneumatic characteristic of the missile under the large attack angle. In order to deal with the pneumatic uncertainty, the method adopts a model-free deep reinforcement learning algorithm PPO, constructs a dynamic model into a Markov decision model, and finally obtains a trained strategy network. And the two conditions of limited and unlimited available attack angles are considered simultaneously in training, so that the practical capability and the future development potential of the current missile are considered. The training part of the intelligent agent has higher requirement on the computing performance, but the training part can be completed offline at a ground workstation, and the finally obtained strategy network can be directly deployed on the missile-borne computer, so that the occupied memory and the computing resource are small, and the real-time requirement of online computing can be met. The attack angle of the missile adopting the reinforcement learning guidance law is rapidly increased after launching, the speed is reduced, and the trajectory inclination angle is converted to an instruction value, so that the missile seeker can smoothly lock a target, and a favorable shift-changing condition is provided for final guidance. In addition, the present invention also verifies the suboptimal and robustness of the trained agent in the examples provided. The solution given by the intelligent agent is very close to the optimal solution given by the general optimization software, and meanwhile, the intelligent agent still can keep higher precision in a severe simulation environment, so that the generalization performance and the robustness of the intelligent agent are verified. Under the challenges of increasingly complex air battle environments and high maneuvering fighters, the intelligent guidance law provided by the invention has important application value.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning is characterized by comprising the following steps:
step 1, carrying out normalized kinetic modeling on over-shoulder emission; the model is normalized to ensure that each state quantity has similar magnitude, so that the weight updating of the neural network can be more stable; firstly, modeling a scene of missile launching over shoulders to obtain a kinetic equation under a pneumatic system, a kinematic equation under an inertial system and an equation considering mass change;
step 2, in order to adapt to the research paradigm of reinforcement learning, the research problem in the step 1 needs to be modeled as a Markov decision process;
step 3, building an algorithm network and setting algorithm parameters; the selected deep learning algorithm is a near-end strategy optimization algorithm PPO, the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters;
and 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network.
2. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 1, the equation is specifically:
whereinIs a guided missileAfter the air speed is normalized, the air speed is changed,in order to normalize the angle of inclination of the back trajectory,in order to normalize the horizontal coordinate after the operation,is the ordinate after the normalization,for the respective rates of change of the aforementioned quantities, and V*、θ*、x*、y*A normalization factor corresponding to each quantity; in addition, alpha is missile attack angle, P is main engine thrust, and T isrcsTo counteract jet engine thrust, upAnd urcsOn-off logic quantities, F, for main and reaction jet engines, respectivelyDAnd FLRespectively, the drag and lift with strong uncertainty, m is the missile mass, m iscG is the gravitational acceleration constant.
3. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 2, the specific process includes steps 201 to 203;
step 201, setting an action space; in order to ensure the dynamic stability of the system, the first derivative of the attack angle alpha is selectedAs a system input; in addition, willThe action can also meet the maneuvering capacity limit of the missile; but with the development of the maneuvering capability of the air-to-air missile in the future, particularly in the thrust vector or the reverse directionUnder the assistance of the action of jet, the limitation of using an attack angle is also cancelled;
step 202, setting a state space and an observation space; setting the state space and observation space of the agent on the basis of the setting action in step 201, but not all states in the system are meaningful for the decision of the control command; redundant observations will lead to instability in training, while insufficient observations tend to directly lead to non-convergence in training;
step 203, setting a reward function; the setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed asWhereinTo a desired turning angle, thetaMIs the angle of inclination of the trajectory of the missile, lambda1,λ2,λ3The hyper-parameters are needed to be set and are used for adjusting the proportion among all the items; and to improve the final turn accuracy, an additional reward r is introducedbonusHaving a value ofWherein r isbFor additional rewards when the accuracy condition is met, rbNeed to be coordinated with the previous items to ensure that the agent is at the desired accuracy thetathreTo obtain the appropriate reward.
4. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 4, specifically, steps 401 to 404 are included;
step 401, at the current strategyCollecting trace data and caching the trace data in an experience pool until the experience pool is full; in each simulation stepFor the current observation otExecuting the current policyGet the current action atAnd integrating according to a system kinetic equation to obtain a state s at the next momentt+1And observation ot+1While obtaining a prize rt;
Step 402, estimating the dominance function by using the method of generalized dominance estimation GAEFinal optimization objectiveWherein c isvfAnd csIs a hyper-parameter for adjusting each proportion;to increase the truncation target for the probability of a more advantageous action,in order to have a value function for the loss term,a maximum entropy term to encourage exploration;
step 403, extracting trajectory data from the experience pool according to the size of batch, and optimizing the target JPPO(theta) optimizing parameters of an Actor network and a Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs;
step 404, considering the randomness of the initial turning instruction, comparing the expectations of the accumulated rewards obtained by the new and old strategies, and updating the finally output network parameters;
and 405, repeating the steps 401 to 404 until training obtains a target reward value or reaches the maximum training step number, and obtaining that an Actor network is directly deployed on the missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.
5. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 201, | α | < α if the missile has an available angle of attack limitmaxIn which α ismaxFor angle of attack limitation, then
7. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 401, in each simulation step, based on the current observation valueExecuting the current policy to obtain a current actionProbability mean of (i.e.In a Gaussian distributionSampling to obtain current actionAnd according to the system dynamics equation f (x)t,atT) integration to get the state s at the next timet+1And observation ot+1While calculating the prizeUntil the turn is over, a set of traces s is collected0,o0,a0,r1,s1,o1,a1,r2,s2… }; at the present strategyAnd next, collecting trace data and caching the trace data into an experience pool, wherein traces of multiple rounds can be cached in the experience pool until the experience pool is full.
8. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 402, an objective function of the network update is further computed in a truncated manner to increase the probability of more advantageous actions
WhereinMeaning that the desired value, r, is foundt(θ) is the probability ratio characterizing the old and new policies to control the update step size in the trust domain, clip () is a truncation function, eIs a hyperparameter as a truncation factor; further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimationOn the basis of the value function loss termAnd maximum entropy terms encouraging explorationObtaining the final optimization target
9. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 403, trace data is fetched from the experience pool according to the size of batch, and is set to NB(ii) a And will optimize the objective JPPO(theta) optimizing parameters of an Actor network and a Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs; the updated formula of the network parameters isWherein alpha isLRIs the learning rate.
10. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 404, ifWhereinMeaning that the desired value, R (θ)★) For the policy to be output at the endThe next round accrued prize, R (θ) is the updated strategy π of the network of step 503θThe next round to win the jackpot will have the network parameter θ of the final strategy★=θ。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111550831.2A CN114519292A (en) | 2021-12-17 | 2021-12-17 | Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111550831.2A CN114519292A (en) | 2021-12-17 | 2021-12-17 | Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114519292A true CN114519292A (en) | 2022-05-20 |
Family
ID=81596073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111550831.2A Pending CN114519292A (en) | 2021-12-17 | 2021-12-17 | Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114519292A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114861368A (en) * | 2022-06-13 | 2022-08-05 | 中南大学 | Method for constructing railway longitudinal section design learning model based on near-end strategy |
CN115328638A (en) * | 2022-10-13 | 2022-11-11 | 北京航空航天大学 | Multi-aircraft task scheduling method based on mixed integer programming |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111930141A (en) * | 2020-07-21 | 2020-11-13 | 哈尔滨工程大学 | Three-dimensional path visual tracking method for underwater robot |
CN112799429A (en) * | 2021-01-05 | 2021-05-14 | 北京航空航天大学 | Multi-missile cooperative attack guidance law design method based on reinforcement learning |
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
-
2021
- 2021-12-17 CN CN202111550831.2A patent/CN114519292A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111930141A (en) * | 2020-07-21 | 2020-11-13 | 哈尔滨工程大学 | Three-dimensional path visual tracking method for underwater robot |
CN112799429A (en) * | 2021-01-05 | 2021-05-14 | 北京航空航天大学 | Multi-missile cooperative attack guidance law design method based on reinforcement learning |
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
Non-Patent Citations (2)
Title |
---|
FEI LIU 等: "Cooperative differential games guidance laws for multiple attackers against an active defense target", 《RESEARCHGATE》, 31 October 2021 (2021-10-31) * |
陈中原;韦文书;陈万春: "基于强化学习的多发导弹协同攻击智能制导律", 兵工学报, no. 008, 31 August 2021 (2021-08-31) * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114861368A (en) * | 2022-06-13 | 2022-08-05 | 中南大学 | Method for constructing railway longitudinal section design learning model based on near-end strategy |
CN114861368B (en) * | 2022-06-13 | 2023-09-12 | 中南大学 | Construction method of railway longitudinal section design learning model based on near-end strategy |
CN115328638A (en) * | 2022-10-13 | 2022-11-11 | 北京航空航天大学 | Multi-aircraft task scheduling method based on mixed integer programming |
CN115328638B (en) * | 2022-10-13 | 2023-01-10 | 北京航空航天大学 | Multi-aircraft task scheduling method based on mixed integer programming |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108445766A (en) | Model-free quadrotor drone contrail tracker and method based on RPD-SMC and RISE | |
CN114519292A (en) | Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning | |
Hong et al. | Model predictive convex programming for constrained vehicle guidance | |
Wang et al. | Improving maneuver strategy in air combat by alternate freeze games with a deep reinforcement learning algorithm | |
Li et al. | Autonomous maneuver decision-making for a UCAV in short-range aerial combat based on an MS-DDQN algorithm | |
CN114253296B (en) | Hypersonic aircraft airborne track planning method and device, aircraft and medium | |
Lin et al. | Development of an integrated fuzzy-logic-based missile guidance law against high speed target | |
CN111898201B (en) | High-precision autonomous attack guiding method for fighter in air combat simulation environment | |
CN113377121B (en) | Aircraft intelligent disturbance rejection control method based on deep reinforcement learning | |
Lee et al. | Autonomous control of combat unmanned aerial vehicles to evade surface-to-air missiles using deep reinforcement learning | |
Chai et al. | A hierarchical deep reinforcement learning framework for 6-DOF UCAV air-to-air combat | |
Gong et al. | All-aspect attack guidance law for agile missiles based on deep reinforcement learning | |
CN115576353A (en) | Aircraft formation control method based on deep reinforcement learning | |
CN115079565A (en) | Variable-coefficient constraint guidance method and device with falling angle and aircraft | |
Duan et al. | Autonomous maneuver decision for unmanned aerial vehicle via improved pigeon-inspired optimization | |
Qazi et al. | Rapid trajectory optimization using computational intelligence for guidance and conceptual design of multistage space launch vehicles | |
CN115357051B (en) | Deformation and maneuvering integrated avoidance and defense method | |
Du et al. | Deep reinforcement learning based missile guidance law design for maneuvering target interception | |
CN116432539A (en) | Time consistency collaborative guidance method, system, equipment and medium | |
Chen et al. | The design of particle swarm optimization guidance using a line-of-sight evaluation method | |
Minglang et al. | Maneuvering decision in short range air combat for unmanned combat aerial vehicles | |
Gui et al. | Reaction control system optimization for maneuverable reentry vehicles based on particle swarm optimization | |
Bin et al. | Cooperative guidance for maneuvering penetration with attack time consensus and bounded input | |
CN117192982B (en) | Control parameterization-based short-distance air combat maneuver decision optimization method | |
Wu et al. | Decision-Making Method of UAV Maneuvering in Close-Range Confrontation based on Deep Reinforcement Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |