CN115857548A

CN115857548A - Terminal guidance law design method based on deep reinforcement learning

Info

Publication number: CN115857548A
Application number: CN202211509505.1A
Authority: CN
Inventors: 易文俊; 杨书
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-28

Abstract

The invention discloses a terminal guidance law design method based on deep reinforcement learning, and belongs to the field of missile and rocket guidance. The method comprises the following steps: establishing a relative kinematic equation between missile eyes in a longitudinal plane of a guided section at the tail end of a missile interception target; in order to adapt to a research paradigm of reinforcement learning, a research problem is abstracted and modeled into a Markov decision process; building an algorithm network, setting algorithm parameters, and selecting a deep reinforcement learning algorithm as DQN; in the final guidance process of each round, a sufficient number of training samples are obtained through Q-learning, a neural network and an updated target network are trained at fixed frequency respectively, and the process is repeated continuously before the set learning round is not reached. By applying the technical scheme of the invention, the guidance precision of the traditional proportional guidance law can be improved, and the missile can obtain certain autonomous decision-making capability.

Description

Terminal guidance law design method based on deep reinforcement learning

Technical Field

The invention belongs to the field of missile and rocket guidance, and particularly relates to a terminal guidance law design method based on deep reinforcement learning.

Background

The terminal control law for controlling the missile flying at the ultrahigh speed to accurately hit the enemy target is defined as the terminal guidance law and is also a vital technology of a prevention and control system. The control quantity output by the guidance law is a key basis for intercepting the missile to adjust the flight attitude of the missile, and most of the guidance laws actually applied to engineering practice at present are still based on a proportional guidance law or an improved guidance law. The principle is that the line-of-sight rotation rate of the missile and the target and the rotation rate of the speed vector of the missile are kept unchanged in direct proportion by control means such as a missile-borne steering engine.

In an ideal situation, a proportional guidance law can achieve a better hit effect, but when the inherent nonidealities of a missile aerodynamic model, the inherent delay of an autopilot and the target of large maneuvering are considered, the guidance law can cause a higher miss distance.

Disclosure of Invention

In order to solve the technical defects in the prior art, the invention provides a design method of a terminal guidance law based on deep reinforcement learning.

The technical scheme for realizing the purpose of the invention is as follows: a terminal guidance law design method based on deep reinforcement learning comprises the following steps:

step 1: establishing a relative kinematic equation between the missile eyes in the longitudinal plane of the terminal guidance section of the missile interception target;

step 2: abstracting the solution problem of the kinematic equation and modeling the solution problem into a Markov decision process;

and step 3: building an algorithm network, setting algorithm parameters, and training the algorithm network according to a randomly initialized data set to determine weight parameters of an initial network;

and 4, step 4: the intelligent agent continuously caches the state transition data and the reward value as learning samples in an experience pool according to a Q-learning algorithm, and continuously selects a fixed number of sample training networks from the experience pool at random until a set learning turn is reached;

and 5: in a specific guidance process, a learned network is used for generating action in real time according to the current state and transferring to the next state, and the process is continuously repeated until a target is hit to finish the guidance process.

Preferably, the relative kinematic equation between the missile eyes in the longitudinal plane of the missile interception target terminal guidance section established in the step 1 is as follows:

x _r ＝x _t -x _m

y _r ＝y _t -y _m

wherein x is _t Is the abscissa, x, of the object _m Is the abscissa, x, of the missile _r Is the transverse relative distance of the target from the missile, y _t Is the ordinate, y, of the object _m Is the ordinate, y, of the missile _r Is the longitudinal relative distance of the target and the missile, V _t Is the target linear velocity, theta _t Is the angle between the target linear velocity direction and the horizontal direction, V _m At linear velocity of the missile, theta _m Is the included angle between the linear velocity direction of the missile and the horizontal direction,

is the rate of change of the lateral distance between the target and the missile>

Is the change rate of the longitudinal distance between the target and the missile, r is the relative distance between the target and the missile, q is the included angle between the sight line between the missile and the target and the horizontal direction, also called the sight line angle, and is greater than or equal to the maximum value of the mean square value>

Based on the relative distance change rate>

Is the rate of change of the line of sight angle.

Preferably, abstracting and modeling the solution problem of the kinematic equation as a markov decision process specifically comprises:

the action space setting specifically is as follows: constructing an action space by taking a proportional guidance law PNG as expert experience;

the state space setting is specifically as follows: change the sight line

As a state space for the current problem of knowledge;

the reward function setting is specifically as follows:

in the formula, r _hit Relative distance, r, for the final target hit by the missile _end Is the relative distance between the missile and the target at the termination time, end is the total period duration of the termination time, r _t The distance between the missile and the target at the moment t in the simulation process.

Preferably, the specific process of constructing the motion space by using the proportional guidance law PNG as the expert experience is as follows:

taking the relative speed and the line-of-sight rotation rate as input, the output is an overload instruction, and the overload instruction is expressed as:

wherein K is a proportionality coefficient>

Is relative speed, is based on>

The line of sight rotation rate; by setting the proportionality coefficient K to be constantIs discretized into a finite number as a motion space, the scaling factor is determined by selecting the motion in the motion space, and the overload command is calculated therefrom.

Preferably, the specific steps of initializing the neural network weight parameter are as follows:

step 301: determining that an algorithm network adopts a BP neural network, inputting a (state, action) two-dimensional column vector, and outputting a Q value corresponding to a (state, action) binary group;

calling a random function in a given value range to generate a series of random data as an input data set of the network, and calculating a reward value taking the random data set as a state and an action according to a reward function as an output reference data set;

step 302: and training the neural network according to the data set obtained in the step 301 to determine initial weight parameters of the neural network.

Preferably, the specific method for training the neural network and updating the target network with a fixed frequency is as follows:

in each simulation step, selecting an executed action from an action space according to an epsilon-greedy strategy for the current state, integrating according to a kinetic equation to obtain the state of the next moment, and calculating to obtain an obtained reward value; setting an experience pool and saving the current state, the executed action, the reward value and the next state as experience into the experience pool;

randomly taking out a data set with a certain size from the experience pool at a fixed frequency, calculating a corresponding target value of the data set, training a neural network by using the data set and the target value corresponding to the data set, and updating the target network by adopting a certain frequency, namely replacing the target network with the network trained in a period of time.

Preferably, the specific calculation method of the target value is as follows:

Q _{t arg et} ＝Q(s _t ，a _t )+α[R _t +γmax _a Q(s _t+1 ，a)-Q(s _t ，a _t )]

wherein Q is _{t arg et} Represents updated(s) _t ，a _t ) The corresponding Q value of the signal is obtained,s _t represents the state at time t, a _t Representative state s _t Action performed below, Q(s) _t ，a _t ) Representative state s _t Lower execution action a _t Is the learning rate, is the rate of Q value update, R _t Representative state s _t Lower execution action a _t The value of the prize earned, gamma for the discount rate, s _t+1 Represents the state at time t +1, max _a Q(s _t+1 A) represents the state s _t+1 The Q value for performing the optimal action.

Compared with the prior art, the invention has the following remarkable advantages: the invention provides an optimal navigation ratio accumulation sequence obtained by applying a deep reinforcement learning algorithm through off-line learning in a given navigation ratio range, so that the missile can select the most appropriate navigation ratio parameter to generate overload required to be used at every moment according to the current state, thereby solving the problem of difficulty in selecting the navigation ratio to a certain extent and simultaneously improving the hit precision.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a geometrical diagram of missile interception end-guidance plane engagement provided according to an embodiment of the invention.

FIG. 2 is a diagram of the end-guided rhythm learning of the deep reinforcement learning of the present invention.

FIG. 3 is a flowchart of the deep reinforcement learning algorithm of the present invention.

Fig. 4 is a relationship between the amount of miss and the number of learning rounds for an embodiment of the present invention.

FIG. 5 is a two-dimensional trajectory graph of the missile and target provided in accordance with an embodiment of the present invention.

FIG. 6 is a sequence of navigation ratios for a specific example of the present invention.

Fig. 7 is a graph of line-of-sight angular velocity for a specific example of the present invention.

Detailed Description

It is easily understood that various embodiments of the present invention can be conceived by those skilled in the art according to the technical solution of the present invention without changing the essential spirit of the present invention. Therefore, the following detailed description and the accompanying drawings are merely illustrative of the technical aspects of the present invention, and should not be construed as all of the present invention or as limitations or limitations on the technical aspects of the present invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete. The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the innovative concepts of the invention.

The invention discloses a terminal guidance law design method based on deep reinforcement learning, which comprises the following steps:

step 1: with reference to the geometrical schematic diagram of the missile interception terminal guidance plane engagement in fig. 1, the equation of the relative kinematics between the missile eyes in the longitudinal plane of the missile interception target terminal guidance section is established as follows:

x _r ＝x _t -x _m

y _r ＝y _t -y _m

wherein x is _t Is the abscissa, x, of the object _m Is the abscissa, x, of the missile _r Is the transverse relative distance of the target from the missile, y _t Is the ordinate, y, of the object _m As ordinate, y, of the missile _r Is the longitudinal relative distance of the target from the missile, V _t Is the target linear velocity, theta _t Is the angle between the target linear velocity direction and the horizontal direction, V _m At the linear velocity of the missile, θ _m Is the included angle between the linear velocity direction of the missile and the horizontal direction,

in relation to the rate of change of the transverse distance of the target from the missile>

Based on the relative distance change rate>

Is the rate of change of the line of sight angle.

further, step 2 specifically includes:

and (4) setting an action space, and constructing the action space by taking the proportional guidance law PNG as expert experience in order to avoid the condition that the final algorithm cannot be converged due to overlarge action space search.

Specifically, the proportional guidance law takes the relative speed and the line-of-sight rotation rate as input, and the output is an overload command, which is expressed as:

wherein K is a proportionality coefficient>

Is relative speed, is based on>

The line of sight rotation rate; determining the proportionality coefficient by discretizing the proportionality coefficient K into a limited number within a certain range of values as an action space, by selecting an action in the action space, and calculating therefrom an overload instruction;

setting a state space, wherein in the design of the guidance ratio, the selected state space must contain all states of the guidance process, and converting the sight line into the ratio

As a state space for the current problem-aware, all states of motion can be adequately represented;

setting a reward function, wherein the DQN algorithm judges whether the execution action is good or not by using the reward function; in the missile pursuit process, if the relative distance between the missile and the target is shortened at adjacent moments, a positive reward is obtained; if the missile finally hits the target, a larger reward will be obtained, but otherwise the missile that did not hit the target will set the reward to 0, and in summary, the reward function is set to:

in the formula, r _hit Relative distance, r, for the final target hit by the missile _end Is the time of terminationThe relative distance between the missile and the target, end is the total period duration of the termination time, r _t The distance between the missile and the target at the moment t in the simulation process. Relative velocity during target pursuit by the missile

Is always negative when>

The time is changed from negative to positive at a certain moment, and the moment is the termination moment.

And 3, step 3: building an algorithm network, setting algorithm parameters, and training the algorithm network according to a randomly initialized data set to determine weight parameters of an initial network;

specifically, the algorithm determining network adopts a BP neural network, inputs the (state, action) two-dimensional column vectors, and outputs Q values corresponding to the (state, action) two-tuple, wherein the significance of the Q values is to determine the optimal action to be executed according to the Q values of different actions to be executed in the same state; calling a random function within a given value range to generate a series of random data as an input data set of the network, and calculating a reward value taking the random data set as a state and an action according to the reward function in the step 4 as an output reference data set;

and 4, step 4: the intelligent agent continuously caches state transition data and reward values as learning samples in an experience pool according to a Q-learning algorithm, and continuously selects a fixed number of sample training networks from the experience pool at random until a set learning turn is reached, and the method specifically comprises the following steps:

in each simulation step, determining which action is taken for the current state through an epsilon-greedy strategy, integrating according to a kinetic equation to obtain the state of the next moment, and meanwhile calculating to obtain an obtained reward value; setting experience pool and obtaining the current state, the executed action, the reward value and the next state(s) _t ，a _t ，r _t ，s _t+1 ) Saving the experience in an experience pool as experience;

randomly fetching a data set of a certain size from an experience pool at a fixed frequencyThen, calculating a corresponding target value of the data set, wherein the specific calculation method is as follows: q _{t arg et} ＝Q(s _t ，a _t )+α[R _t +γmax _a Q(s _t+1 ，a)-Q(s _t ，a _t )]Wherein Q is _{t arg et} Represents updated(s) _t ，a _t ) Corresponding Q value, s _t Representing the state at time t, a _t Representative state s _t Action performed, Q(s) _t ，a _t ) Representative state s _t Lower execution action a _t Is the rate of updating of the Q value, R _t Representative state s _t Lower execution action a _t The value of the reward earned, γ representing the discount rate, is how important the future experience is in performing the action on the current state, s _t+1 Represents the state at time t +1, max _a Q(s _t+1 A) represents the state s _t+1 The Q value of the optimal action is executed; then training a neural network by using the data set and the corresponding target value obtained by calculation until reaching the set learning turn;

To be a specific example of the present invention, the initial conditions are set as follows:

meanwhile, the motion space, i.e., the navigation ratio, is designed to be a = {2,2.12.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,3.0,3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9,4.0,4.1,4.2,4.3,4.4,4.5,4.6,4.7,4.8,4.9,5.0}; the neural network is set as 2 hidden layers, each layer has 40 neurons, and the error back propagation strategy selects a gradient descent method; the total number of learning rounds is 2200, and as can be seen from fig. 4, the miss distance converges from the initial random distribution to a lower value as the number of learning rounds increases, thus demonstrating the convergence of the algorithm of the present invention.

And intercepting the target by applying the learned algorithm model, and calculating a guidance trajectory by a fourth-order Runge Kutta to obtain a trajectory diagram as shown in FIG. 5. Comparing the depth-enhanced learning-based guidance law (DQNG) with the traditional proportional guidance law (PNG), wherein the miss distance of the DQNG is 0.5386m, the miss distance of the QNG is 1.3268m, and finding that the guidance trajectory of the DQNG can approach a target more quickly and perform accurate striking; meanwhile, the DQNG has the hit time of 12.44s, while the PNG has the hit time of 12.94s, compared with the DQNG, the target can be intercepted faster.

While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes described in a single embodiment or with reference to a single figure, for the purpose of streamlining the disclosure and aiding in the understanding of various aspects of the invention by those skilled in the art. However, the present invention should not be construed to include features in the exemplary embodiments which are all the essential technical features of the patent claims.

It should be understood that the modules, units, components, and the like included in the device of one embodiment of the present invention may be adaptively changed to be provided in a device different from that of the embodiment. The different modules, units or components comprised by the apparatus of an embodiment may be combined into one module, unit or component or may be divided into a plurality of sub-modules, sub-units or sub-components.

Claims

1. A terminal guidance law design method based on deep reinforcement learning is characterized by comprising the following steps:

step 1: establishing a relative kinematic equation between missile eyes in a longitudinal plane of a guided section at the tail end of a missile interception target;

and step 3: establishing an algorithm network, setting algorithm parameters, and training the algorithm network according to a randomly initialized data set to determine weight parameters of an initial network;

2. The terminal guidance law design method based on the deep reinforcement learning as claimed in claim 1, wherein the relative kinematic equation between the missile eyes in the longitudinal plane of the terminal guidance section of the missile interception target established in step 1 is specifically as follows:

x _r ＝x _t -x _m

y _r ＝y _t -y _m

wherein x is _t Is the abscissa, x, of the object _m Is the abscissa, x, of the missile _r Is the transverse relative distance of the target from the missile, y _t Is the ordinate, y, of the object _m As ordinate, y, of the missile _r Is the longitudinal relative distance of the target and the missile, V _t Is the target linear velocity, theta _t Is the angle between the target linear velocity direction and the horizontal direction, V _m At linear velocity of the missile, theta _m Is the included angle between the linear velocity direction of the missile and the horizontal direction,

Based on the relative distance change rate>

Is the rate of change of the line of sight angle.

3. The end-lead law design method based on deep reinforcement learning according to claim 1, wherein abstracting and modeling a solution problem of kinematic equations into a Markov decision process specifically comprises:

the state space setting is specifically as follows: change the line of sight to a certain degree

As a state space for the current problem of knowledge;

the reward function setting is specifically as follows:

4. The terminal guidance law design method based on the deep reinforcement learning as claimed in claim 3, wherein the specific process of constructing the action space by taking the proportional guidance law PNG as the expert experience is as follows:

taking the relative speed and the line of sight rotation rate as input, the output is an overload command, and is represented as:

wherein K is a proportionality coefficient>

Is relative speed, is based on>

The line of sight rotation rate; the scaling factor K is determined by selecting an action in the action space as an action space by discretizing the scaling factor K into a finite number within a certain range of values, and the overload command is calculated therefrom.

5. The terminal guidance law design method based on deep reinforcement learning as claimed in claim 1, wherein the specific steps of initializing the weight parameters of the neural network are as follows:

6. The terminal guidance law design method based on deep reinforcement learning as claimed in claim 1, wherein the specific method for training the neural network and updating the target network with fixed frequency is as follows:

7. The terminal guidance law design method based on the deep reinforcement learning as claimed in claim 6, wherein the specific calculation method of the target value is as follows:

Q _target ＝Q(s _t ,a _t )+α[R _t +γmax _a Q(s _t+1 ,a)-Q(s _t ,a _t )]

wherein Q is _target Represents updated(s) _t ,a _t ) Corresponding Q value, s _t Representing the state at time t, a _t Representative state s _t Action performed below, Q(s) _t ,a _t ) Representative state s _t Lower execution action a _t Is the rate of updating of the Q value, R _t Representative state s _t Lower execution action a _t The prize value earned, gamma representing the discount rate, s _t+1 Represents the state at time t +1, max _a Q(s _t+1 A) represents the state s _t+1 The Q value for performing the optimal action.