CN114519292A

CN114519292A - Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning

Info

Publication number: CN114519292A
Application number: CN202111550831.2A
Authority: CN
Inventors: 陈万春; 龚晓鹏; 陈中原
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-05-20

Abstract

The invention relates to a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning, which comprises the following steps of: step 1, carrying out normalized kinetic modeling on over-shoulder emission; the normalization of the model enables each state quantity to have similar magnitude, so that the weight updating of the neural network can be more stable; step 2, in order to adapt to the research paradigm of reinforcement learning, the research problem in the step 1 needs to be modeled as a Markov decision process; step 3, building an algorithm network and setting algorithm parameters; and 4, before training reaches a target reward value or a maximum step number, continuously collecting state transition data and rewards by the agent according to the PPO algorithm, and continuously and iteratively updating parameters of an Actor network and a Critic network. By applying the technical scheme of the invention, the missile can obtain an attack angle guidance law with suboptimum and robustness in a complex pneumatic environment, the limitation of different maneuvering capacities of the missile is considered, and the missile has practical value in the future air combat.

Description

Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning

Technical Field

The invention relates to the field of aircraft control, in particular to a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning.

Background

In modern air combat, with the continuous enhancement of the maneuvering capability of the fighter plane, the scene of close combat is increasingly complicated. In order to improve the combat ability of a fighter in close combat, a shouldering launch mode capable of attacking a rear half-ball target is an important research. The air-to-air missile adopting the over-shoulder launching mode can change the flight direction rapidly after being launched, and enters a terminal guidance stage after the seeker locks a target, so that the missile has the omnidirectional attack capability. However, the missile faces complex aerodynamic phenomena such as asymmetric vortex shedding, induced moment and the like in a large attack angle turning stage, and belongs to a typical strong nonlinear high-uncertainty system, so that higher requirements are provided for a guidance and control system of the missile.

Current research on over-the-shoulder launch focuses primarily on robust pilot design, with relatively few approaches to guidance law design. At present, a commonly adopted mode is that a driver is used for tracking an offline optimized trajectory or a constant attack angle, but targets are easily lost after the targets are launched over shoulders under a complex aerodynamic environment and a transient and variable air combat situation. The proper guidance law can adapt to the dynamic change of a battlefield, the design burden of a pilot can be reduced, and the overall robustness of the missile guidance control system is improved.

Further considering the maneuvering capability and the future development potential of the current missile model, the designed guidance law based on the deep reinforcement learning can conveniently set the maximum available attack angle of the missile, thereby increasing the possible application range and the practical implementation possibility of the invention. Under the challenges of increasingly complex air battle environments and high maneuvering fighters, the intelligent guidance law provided by the invention has important application value.

Disclosure of Invention

The invention mainly aims to provide a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning, so as to at least solve the problems.

According to one aspect of the invention, a design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning is provided, and comprises the following steps:

step 1, carrying out normalized kinetic modeling on the over-shoulder emission. Normalization of the model can result in similar magnitudes for each state quantity, thereby enabling weight update of the neural network to be more stable. Firstly, modeling is carried out on the scene of missile over-shoulder launching, and a kinetic equation under a pneumatic system, a kinematic equation under an inertial system and an equation considering mass change can be obtained.

Step 2, further, in order to adapt to the research paradigm of reinforcement learning, the research problem in step 1 needs to be modeled as a markov decision process. The specific process comprises step 201 to step 203.

Step 201, setting an action space. In order to ensure the dynamic stability of the system, the first derivative of the attack angle alpha is selected

As system input. In addition, will

The missile can conveniently meet the maneuvering capacity limit of the missile as the action. However, with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available attack angle can be cancelled.

Step 202, setting a state space and an observation space. On the basis of the set actions in step 201, the state space and observation space of the agent are set, but not all states in the system are meaningful for the decision of the control command. Redundant observations will lead to instability of the training, while insufficient observations tend to directly lead to non-convergence of the training.

Step 203, reward function setting. The setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed as

Wherein

To a desired turning angle, theta_MIs the angle of inclination of the trajectory of the missile, lambda₁，λ₂，λ₃The hyper-parameters are needed to be set and are used for adjusting the proportion among the items. And to improve the final turn accuracy, an additional reward r is introduced_bonusHaving a value of

Wherein r is_bFor additional awards when the accuracy condition is satisfied, r_bNeed to be coordinated with the previous items to ensure that the agent is at the desired accuracy theta_threTo obtain the appropriate reward.

And step 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a near-end Policy Optimization (PPO), the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters.

And 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network. Specifically, the method includes steps 401 to 404.

Step 401, at the current strategy

And collecting trace data and caching the trace data to an experience pool until the experience pool is full. In each simulation step, for the current observation o_tExecuting the current policy

Get the current action a_tAnd integrating according to a system kinetic equation to obtain a state s at the next moment_t+1And observation o_t+1While obtaining the prize r_t。

Step 402, estimating a dominance function by using a generalized dominant estimation (GAE) method

Final optimization objective

Wherein c is_vfAnd c_sIs a hyper-parameter for adjusting the proportion of each item.

To increaseA more dominant probabilistic truncation target of actions,

in order to have a value function for the loss term,

to maximize the entropy terms that encourage exploration.

Step 403, extracting trajectory data from the experience pool according to the size of batch, and optimizing the target J^PPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descent mode until the data in the experience pool completes the updating of K epochs.

In step 404, the expectation of the accumulated reward obtained by the new and old strategies is compared and the finally output network parameters are updated in consideration of the randomness of the initial turning command.

And 405, repeating the steps 401 to 404 until training obtains a target reward value or the maximum training step number is reached, and obtaining that an Actor network can be directly deployed on a missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.

The invention has the advantages and beneficial effects that: by applying the technical scheme of the invention, the missile can obtain an attack angle guidance law with suboptimal performance and robustness in a complex pneumatic environment, the limitation of different maneuvering capacities of the missile is considered, and the missile has practical value in future air combat.

Drawings

Fig. 1 is a geometric schematic diagram of an air-to-air missile over-shoulder launching plane engagement provided according to an embodiment of the invention.

Fig. 2 is a schematic diagram of interaction between an agent using a PPO algorithm and an environment according to an embodiment of the present invention.

FIG. 3 is a graph of a missile learning curve for limited and unrestricted maneuvering capabilities, respectively, according to an embodiment of the invention.

FIG. 4a is a curve of the convergence of the turning angle of a missile at the limited maneuvering capability.

FIG. 4b is the curve of the convergence of the turning angle of the missile when the maneuvering capacity is not limited.

Figure 5a is a time-dependent missile velocity profile for a mobility-limited agent and an optimal solution provided in accordance with an embodiment of the present invention.

Figure 5b is a time-varying missile angle of attack for an agent with limited maneuverability and an optimal solution provided in accordance with an embodiment of the present invention.

Figure 5c is a graph of missile trajectory inclination angle over time for a smart agent with limited maneuverability and an optimal solution provided in accordance with an embodiment of the present invention.

Figure 5d is a ballistic profile in the longitudinal plane of a missile with mobility-limited agents and optimal solutions provided in accordance with an embodiment of the present invention.

Figure 6a is a time-varying missile velocity curve for agents with unlimited maneuvering capability and optimal solutions provided in accordance with an embodiment of the present invention.

FIG. 6b is a time-varying missile angle of attack curve for an agent with unlimited maneuvering capability and an optimal solution provided in accordance with an embodiment of the invention.

Figure 6c is a time-varying missile trajectory inclination angle plot for agents with unlimited maneuvering capabilities and an optimal solution provided in accordance with an embodiment of the present invention.

Figure 6d is a ballistic curve in the longitudinal plane of a missile providing an agent with unlimited maneuvering capabilities and an optimal solution according to an embodiment of the invention.

Fig. 7a shows trajectory inclination deviation and attack angle distribution at the terminal moment under the limited maneuvering capability of the missile.

And fig. 7b shows trajectory inclination deviation and attack angle distribution at the terminal moment under the condition that the maneuvering capability of the missile is not limited.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Step 1, carrying out normalized kinetic modeling on the over-shoulder emission. Normalization of the model can result in similar magnitudes for each state quantity, thereby enabling weight update of the neural network to be more stable. Firstly, modeling is carried out on the scene of missile over-shoulder launching, and the following kinetic equation under a pneumatic system, kinematic equation under an inertial system and equation considering mass change can be obtained.

Wherein

Is the flight speed of the missile after normalization,

in order to normalize the angle of inclination of the back trajectory,

in order to normalize the horizontal coordinate after the operation,

is the ordinate after the normalization,

v is a rate of change of each of the foregoing quantities_*、θ_*、x_*、y_*The normalization factors corresponding to the above quantities. In addition, alpha is missile attack angle, P is main engine thrust, and T is_rcsTo counteract jet engine thrust, u_pAnd u_rcsOn-off logic quantities, F, for main and reaction jet engines, respectively_DAnd F_LRespectively, the drag and lift with strong uncertainty, m is the missile mass, m is_cG is the gravitational acceleration constant.

As system input. In addition, will

The missile can conveniently meet the maneuvering capability limit of the missile as the action. If the missile has available attack angle limitation, namely | alpha | < alpha |_maxIn which α is_maxFor angle of attack limitation, then

However, with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available attack angle can be cancelled.

Step 202, setting a state space and an observation space. Upon setting the action in step 201, the state space of the system becomes

Not all states in the system are meaningful for control command decisions. Redundant observations will lead to instability in training, while insufficient observations tend to lead directly to non-convergence of training. In the present invention, the observation space is set to

Wherein

Is the desired angle of turning.

Wherein λ₁，λ₂，λ₃The hyper-parameters are needed to be set and are used for adjusting the proportion among the items. And to improve the final turn accuracy, an additional reward r is introduced_bonusHaving a value of

Wherein r is_bCoordinating with the previous items to ensure the intelligent agent to be at ideal precision theta_threTo obtain the appropriate reward.

And 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a near-end Policy Optimization (PPO), the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters.

Step 401, in each simulation step, based on the current observation value

Executing the current policy to obtain the current action

Probability mean of (i.e.

In a Gaussian distribution

Sampling to obtain current action

And according to the system dynamics equation f (x)_t,a_tT) integration to get the state s at the next time_t+1And observation o_t+1While calculating the prize

Until the turn is over, a set of traces s is collected₀,o₀,a₀,r₁,s₁,o₁,a₁,r₂,s₂… }. At the present strategy

And next, collecting trace data and caching the trace data into an experience pool, wherein traces of multiple rounds can be cached in the experience pool until the experience pool is full.

And calculating the objective function of the network update by adopting a truncation mode to increase the probability of more advantageous actions

Wherein

Meaning that the desired value, r, is found_t(θ) is the probability ratio characterizing the old and new policies to control the update step size within the trust domain, clip () is the truncation function, and e is the hyperparameter as the truncation factor. Further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimation

On the basis of the value function loss term

And maximum entropy terms encouraging exploration

Obtaining the final optimization target

Wherein c is_vfAnd c_sIs a hyper-parameter for adjusting the ratio of each item.

Step 403, extracting trajectory data from the experience pool according to the size of batch, and setting the trajectory data as N_B. And will optimize the objective J^PPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descent mode until the data in the experience pool completes the updating of K epochs. The network parameters are updated according to the formula

Wherein alpha is_LRIs the learning rate.

Step 404, considering the randomness of the initial turn command, compares the expectations of the cumulative rewards obtained from the old and new strategies, if

Wherein

Meaning that the desired value, R (θ)^★) Is the strategy pi to be output at the end_θThe round accumulated reward obtained under the Twining, R (theta) is the strategy pi after the network is updated in the step 403_θThe next round to win the jackpot will have the network parameter θ of the final strategy^＊＝θ。

Because the reinforcement learning adopts the modes of off-line training and on-line deployment, the training part with high requirement on the calculation performance is finished at the ground workstation, and the finally obtained strategy network pi_θ＊The method is essentially a series of matrix operations and activation function operations, occupies small memory and computing resources, and can meet the real-time requirement of online computing.

In addition, a near-end Policy Optimization (PPO) algorithm adopted by the invention is a Policy gradient-based reinforcement learning algorithm which is not based on a model, is commonly used for generating a continuous action space Policy, and shows excellent performance in a plurality of fields. Because the PPO algorithm does not need to model the system in the training process but directly makes a decision according to the strategy network, the method is very suitable for the scenes that the missile is unstable in pneumatic parameters and strong in interference in a large attack angle state, and can show the robustness far beyond the general guidance law. The PPO algorithm has strong stability, is insensitive to parameters, is simple to implement, and is suitable for solving the control problems of complex systems with high modeling difficulty, strong interference noise, uncertainty and nonlinearity. The PPO algorithm adopts an Actor-Critic framework, which comprises an Actor network and a Critic network, and optimizes an Actor strategy by maximizing cumulative rewards. The Actor network obtains the Gaussian distribution mean value of the action based on the current observation, exploration can be encouraged by properly setting noise in the training process, the situation that the action is in local optimum early is avoided, and the noise is not added in the deployment verification process so as to ensure the stability of strategy implementation. Critic network to Actor networkAnd evaluating the action under the current observation as a basis for the Actor network optimization. Therefore, in the iterative training process, not only the Actor network but also the Critic network need to be trained, but only the Actor network and not the Critic network are needed during deployment verification. The deep neural network is trained by taking out the trajectory data stored in the experience pool in batches and training K epochs according to the minipatch mode. It should be noted here that although the experience pool is provided in the PPO algorithm, the PPO algorithm is a same policy algorithm rather than a different policy algorithm, because the trajectory data stored in the experience pool are all pi according to the current policy_θkObtained without other strategies.

In order to further understand the present invention, the following describes in detail a method for designing a guidance law of multi-missile cooperative attack based on reinforcement learning, with reference to the accompanying drawings.

Step 1, carrying out normalized kinetic modeling on the over-shoulder emission. According to the geometric scheme of engagement in FIG. 1, where R_TMThe RCS is the reaction jet, existing as an ideal pilot in this example, as the distance between the missile and the aft hemisphere target. The aerodynamic parameters of the missile (mainly the lift coefficient and the drag coefficient in the example) at a large attack angle have large uncertainties and are difficult to obtain through empirical formulas, which are calculated through finite element software Fluent in the example. The initial conditions for training in each round are shown in table 1. In addition, the agent may be instructed to turn in one turn during the training process

Is constant but at each round initialization

Take [30 degrees, 180 degrees ]]And [ -180 °, -30 ° ]]Is the random value of (1).

TABLE 1 initial conditions for training scenarios

Step 2, in order to adapt to the research paradigm of reinforcement learning, the research problem in step 1 needs to be further modeled as a markov decision process. The specific process comprises step 201 to step 203.

Step 201, selecting the first derivative of the angle of attack alpha

As an action. If the missile has available attack angle limitation, namely | alpha | < alpha |_maxIn which α is_maxIn this example, take α for the angle of attack limitation_maxThe angle of attack can be limited to 90 degrees, but with the development of the maneuvering capability of the air-to-air missile in the future, especially with the assistance of a thrust vector or a reaction jet, the limitation of the available angle of attack can be eliminated. If there is an available angle of attack limit, then let

Step 202, setting a state space and an observation space of the dynamic system on the basis of step 201. The state space of the system becomes

The observation space is set up as

Wherein

Is the desired angle of turning.

Wherein λ₁，λ₂，λ₃In order to set the hyper-parameters to be set,in this example set to-0.1, 2, 1.5, respectively. Thus the instant prize is

And to improve the final turn accuracy, an additional reward r is introduced_bonusTo maintain proper accuracy and value of the reward, the value is

And 3, building an algorithm network and setting algorithm parameters. The deep learning algorithm selected in the invention is a PPO algorithm optimized by a near-end strategy, and the algorithm comprises an Actor network and a Critic network. Further, due to turning commands

At [30 °,180 ° ]]And [ -180 °, -30 ° ]]And random values are taken in the two intervals, so that two Actor networks are set for parallel training, and the network weight parameters adopt randomized parameters. The structure of the Actor network and Critic network is shown in table 2.

Table 2 network architecture parameters

And 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network. Specifically, the method includes steps 401 to 404. A schematic diagram of the agent interacting with the environment to collect data and update network parameters is shown in fig. 2. If the target initial position is at-30 deg., 30 deg. °]Within the range, the missile does not need to make large-motor turning, and directly gives an overload instruction n according to a classical guidance law such as proportional guidance_cNamely, if the missile needs to turn at a large attack angle, the missile needs to be steered

Obtaining an attack angle command alpha after passing through an integrator_cThen, againTracked by the autopilot of the missile. The pilot actuator takes Thrust Vector (TVC) or reaction jet (RCS) during large maneuvers, while pneumatic rudders may be added during final guided small angle of attack.

Step 401, at the current strategy

Collecting trace data and caching the trace data to an experience pool, and taking N_D＝2¹². In each simulation step, for the current observation o_tExecuting the current policy

Get the current action a_tI.e. first derivative of angle of attack

And integrating according to a system kinetic equation to obtain the state s of the next moment_t+1And observation o_t+1While obtaining a prize r_t. Traces for multiple rounds may be cached in the experience pool until the experience pool is full.

Wherein r is_tAnd (theta) taking a truncation factor epsilon of 0.3 as a probability ratio for characterizing the old strategy and the new strategy so as to control the updating step size in the trust domain. Further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimation

On the basis of the value function loss term

And maximum entropy terms encouraging exploration

Obtaining the final optimization target

Get

And c_s0.01 is a super-parameter for adjusting the ratio of the terms, where σ (R) is the reward standard deviation, i.e., the coefficient is a dynamic adjustment coefficient.

Step 403, extracting trajectory data from the experience pool according to the size of batch, and extracting N_B＝2¹⁰And will optimize the objective J^PPOAnd (theta) optimizing parameters of the Actor network and the Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs, and taking K to be 4. The network parameters are updated according to the formula

Wherein the learning rate alpha_LRGet 10^-4。

Wherein R (theta)^＊) For the strategy to be output at the end_θ＊The next round accrued prize, R (θ) is the updated strategy π of the network of step 503_θThe next round to win the jackpot will have the network parameter θ of the final strategy^＊＝θ。

According to the specific example above, the agent is trained, and the settings are optimizedThe number of big training steps is 5 multiplied by 10⁶The training curve is obtained as shown in fig. 3. In the figure, two training curves are respectively the case that the attack angle is limited and the case that the attack angle is not limited, the shaded part represents the reward distribution of different random initial turning instructions under the same network parameter, and the solid line is the average value of the reward distribution. It can be seen from the training curve that the agent has smoothly converged in this example.

Further, the trained agent is verified under the turning angles of [30 degrees, 180 degrees ] and [ -180 degrees, -30 degrees ], and the deviation value of the trajectory inclination angle of the missile and the command angle is shown in fig. 4a-4 b. Wherein, fig. 4a is a curve of the convergence of the turning angle of the missile when the maneuvering capability of the missile is limited, and fig. 4b is a curve of the convergence of the turning angle of the missile when the maneuvering capability of the missile is not limited. It can be seen that the missile can complete the turning task no matter whether the maneuvering capability is limited, namely whether the maximum attack angle is limited. But the limited allowable attack angle influences the maneuverability of the missile, so the turning time of the missile is longer.

Furthermore, through simulation experiments, the over-shoulder launching guidance law based on deep reinforcement learning is verified to meet the requirements of basic tasks and have suboptimal performance and robustness. Comparing the results obtained by the agent in this example with the general optimization software GPOPS, the case of limited angle of attack is shown in fig. 5a-5d, where fig. 5a is a time-varying curve of missile speed, fig. 5b is a time-varying curve of missile angle of attack, fig. 5c is a time-varying curve of missile trajectory inclination angle, and fig. 5d is a trajectory curve of the missile in the longitudinal plane. Fig. 6a-6d show the case of unlimited angle of attack, where fig. 6a is the time-varying curve of missile speed, fig. 6b is the time-varying curve of missile angle of attack, fig. 6c is the time-varying curve of missile trajectory inclination, and fig. 6d is the trajectory curve of the missile in the vertical plane. From the results of fig. 5a to 5d and fig. 6a to 6d, it can be seen that the guidance law based on reinforcement learning in fig. 5b and 6b gives attack angle commands very close to the optimal attack angle commands for GPOPS solution, and the velocity, trajectory inclination angle and trajectory curve can approach the optimal solution. It should be noted that the GPOPS is obtained in an open-loop and offline manner, and the reinforcement learning guidance law is obtained in a closed-loop and online manner, that is, the same Actor network can adapt to various turning angles. Furthermore, the optimization software is difficult to solve under the condition of excessively complex environment, for example, in the example, the complex pneumatics of the missile are required to be properly simplified to solve, but the solution is still meaningful for proving the suboptimal performance of the reinforcement learning guidance law. To further verify the robustness of the agent, a harsh initial condition different from the training environment was set in this example, as shown in table 3. The deviation condition not only considers the deviation of the aerodynamic coefficient, but also considers the generalization of the initial speed and the condition of coping with the instability of the thrust of the main engine, and the verification environment is not met by the intelligent agent in the training process. The final terminal state is shown in fig. 7a and 7b, where fig. 7a shows trajectory inclination deviation and attack angle distribution at the terminal time when the maneuvering capability of the missile is limited, and fig. 7b shows trajectory inclination deviation and attack angle distribution at the terminal time when the maneuvering capability of the missile is not limited. The deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.3 degrees, the deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.9 degrees, the deviation of the inclination angle of the terminal trajectory under the limitation of the maneuvering capability of the missile is not more than 0.7 degrees, and the deviation of the inclination angle of the attack angle is not more than 0.9 degrees. It can be seen that the missile still completes the turn with high accuracy and shows the generalization ability of the agent in the scene outside the training. The method mainly benefits from the adoption of a depth reinforcement learning algorithm which is not urgent to the model, so that the high precision can be still maintained under the condition that the model has large deviation.

TABLE 3 bias conditions

In conclusion, the design method of the air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning can provide real-time attack angle instructions for large-maneuvering turning of the missile in a complex combat environment. Firstly, the normalized dynamics modeling is carried out on missile over-the-shoulder launching aiming at the operation scene that the missile is threatened from the rear hemisphere, wherein the strong pneumatic uncertainty is considered, and the strong pneumatic uncertainty is determined by the special pneumatic characteristic of the missile under the large attack angle. In order to deal with the pneumatic uncertainty, the method adopts a model-free deep reinforcement learning algorithm PPO, constructs a dynamic model into a Markov decision model, and finally obtains a trained strategy network. And the two conditions of limited and unlimited available attack angles are considered simultaneously in training, so that the practical capability and the future development potential of the current missile are considered. The training part of the intelligent agent has higher requirement on the computing performance, but the training part can be completed offline at a ground workstation, and the finally obtained strategy network can be directly deployed on the missile-borne computer, so that the occupied memory and the computing resource are small, and the real-time requirement of online computing can be met. The attack angle of the missile adopting the reinforcement learning guidance law is rapidly increased after launching, the speed is reduced, and the trajectory inclination angle is converted to an instruction value, so that the missile seeker can smoothly lock a target, and a favorable shift-changing condition is provided for final guidance. In addition, the present invention also verifies the suboptimal and robustness of the trained agent in the examples provided. The solution given by the intelligent agent is very close to the optimal solution given by the general optimization software, and meanwhile, the intelligent agent still can keep higher precision in a severe simulation environment, so that the generalization performance and the robustness of the intelligent agent are verified. Under the challenges of increasingly complex air battle environments and high maneuvering fighters, the intelligent guidance law provided by the invention has important application value.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A design method of an air-to-air missile over-shoulder launching guidance law based on deep reinforcement learning is characterized by comprising the following steps:

step 1, carrying out normalized kinetic modeling on over-shoulder emission; the model is normalized to ensure that each state quantity has similar magnitude, so that the weight updating of the neural network can be more stable; firstly, modeling a scene of missile launching over shoulders to obtain a kinetic equation under a pneumatic system, a kinematic equation under an inertial system and an equation considering mass change;

step 2, in order to adapt to the research paradigm of reinforcement learning, the research problem in the step 1 needs to be modeled as a Markov decision process;

step 3, building an algorithm network and setting algorithm parameters; the selected deep learning algorithm is a near-end strategy optimization algorithm PPO, the algorithm comprises an Actor network and a Critic network, and network weight parameters adopt randomized parameters;

and 4, before the training reaches the target reward value or the maximum step number, continuously collecting state transition data and reward by the intelligent agent according to the PPO algorithm, and continuously and iteratively updating the parameters of the Actor network and the Critic network.

2. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 1, the equation is specifically:

wherein

Is a guided missileAfter the air speed is normalized, the air speed is changed,

in order to normalize the angle of inclination of the back trajectory,

in order to normalize the horizontal coordinate after the operation,

is the ordinate after the normalization,

for the respective rates of change of the aforementioned quantities, and V_*、θ_*、x_*、y_*A normalization factor corresponding to each quantity; in addition, alpha is missile attack angle, P is main engine thrust, and T is_rcsTo counteract jet engine thrust, u_pAnd u_rcsOn-off logic quantities, F, for main and reaction jet engines, respectively_DAnd F_LRespectively, the drag and lift with strong uncertainty, m is the missile mass, m is_cG is the gravitational acceleration constant.

3. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 2, the specific process includes steps 201 to 203;

step 201, setting an action space; in order to ensure the dynamic stability of the system, the first derivative of the attack angle alpha is selected

As a system input; in addition, will

The action can also meet the maneuvering capacity limit of the missile; but with the development of the maneuvering capability of the air-to-air missile in the future, particularly in the thrust vector or the reverse directionUnder the assistance of the action of jet, the limitation of using an attack angle is also cancelled;

step 202, setting a state space and an observation space; setting the state space and observation space of the agent on the basis of the setting action in step 201, but not all states in the system are meaningful for the decision of the control command; redundant observations will lead to instability in training, while insufficient observations tend to directly lead to non-convergence in training;

step 203, setting a reward function; the setting of the reward function has important influence on the final training effect, and in order to avoid reward sparsity, the reward function is designed as

Wherein

To a desired turning angle, theta_MIs the angle of inclination of the trajectory of the missile, lambda₁，λ₂，λ₃The hyper-parameters are needed to be set and are used for adjusting the proportion among all the items; and to improve the final turn accuracy, an additional reward r is introduced_bonusHaving a value of

Wherein r is_bFor additional rewards when the accuracy condition is met, r_bNeed to be coordinated with the previous items to ensure that the agent is at the desired accuracy theta_threTo obtain the appropriate reward.

4. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 4, specifically, steps 401 to 404 are included;

step 401, at the current strategy

Collecting trace data and caching the trace data in an experience pool until the experience pool is full; in each simulation stepFor the current observation o_tExecuting the current policy

Get the current action a_tAnd integrating according to a system kinetic equation to obtain a state s at the next moment_t+1And observation o_t+1While obtaining a prize r_t；

Step 402, estimating the dominance function by using the method of generalized dominance estimation GAE

Final optimization objective

Wherein c is_vfAnd c_sIs a hyper-parameter for adjusting each proportion;

to increase the truncation target for the probability of a more advantageous action,

in order to have a value function for the loss term,

a maximum entropy term to encourage exploration;

step 403, extracting trajectory data from the experience pool according to the size of batch, and optimizing the target J^PPO(theta) optimizing parameters of an Actor network and a Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs;

step 404, considering the randomness of the initial turning instruction, comparing the expectations of the accumulated rewards obtained by the new and old strategies, and updating the finally output network parameters;

and 405, repeating the steps 401 to 404 until training obtains a target reward value or reaches the maximum training step number, and obtaining that an Actor network is directly deployed on the missile-borne computer as a final strategy network to generate an attack angle guidance instruction in real time.

5. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 201, | α | < α if the missile has an available angle of attack limit_maxIn which α is_maxFor angle of attack limitation, then

6. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized by comprising the following steps of: in step 202, the state space of the system becomes

The observation space is set up as

Wherein

At a desired turning angle.

7. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 401, in each simulation step, based on the current observation value

Executing the current policy to obtain a current action

Probability mean of (i.e.

In a Gaussian distribution

Sampling to obtain current action

Until the turn is over, a set of traces s is collected₀,o₀,a₀,r₁,s₁,o₁,a₁,r₂,s₂… }; at the present strategy

8. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 402, an objective function of the network update is further computed in a truncated manner to increase the probability of more advantageous actions

Wherein

Meaning that the desired value, r, is found_t(θ) is the probability ratio characterizing the old and new policies to control the update step size in the trust domain, clip () is a truncation function, eIs a hyperparameter as a truncation factor; further, in order to improve the accuracy of the Critic network to the cost function estimation, the method is used for improving the accuracy of the Critic network to the cost function estimation

On the basis of the value function loss term

And maximum entropy terms encouraging exploration

Obtaining the final optimization target

9. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 403, trace data is fetched from the experience pool according to the size of batch, and is set to N_B(ii) a And will optimize the objective J^PPO(theta) optimizing parameters of an Actor network and a Critic network in a random gradient descending mode until the data in the experience pool completes the updating of K epochs; the updated formula of the network parameters is

Wherein alpha is_LRIs the learning rate.

10. The air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning is characterized in that: in step 404, if

Wherein

Meaning that the desired value, R (θ)^★) For the policy to be output at the end

The next round accrued prize, R (θ) is the updated strategy π of the network of step 503_θThe next round to win the jackpot will have the network parameter θ of the final strategy^★＝θ。