CN115046433B

CN115046433B - Aircraft time collaborative guidance method based on deep reinforcement learning

Info

Publication number: CN115046433B
Application number: CN202110256808.6A
Authority: CN
Inventors: 王江; 刘子超; 何绍溟; 侯淼; 王鹏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2023-04-07
Anticipated expiration: 2041-03-09
Also published as: CN115046433A

Abstract

The invention discloses an aircraft time collaborative guidance method based on deep reinforcement learning, which outputs a bias term a according to the flight state of an aircraft through a deep reinforcement learning model _t Obtaining new guidance instruction a based on the form of bias proportion guidance _m Finally according to the guidance instruction a _m And controlling the aircraft control system. According to the aircraft time collaborative guidance method based on the deep reinforcement learning, the selected input states are the current speed, the current speed direction, the current position and the residual flight time error, the mapping relation is reasonable, and the feasibility of fitting the mapping relation by using the deep reinforcement learning is high.

Description

Aircraft time collaborative guidance method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of aircrafts, in particular to the field of flight time cooperation, and particularly relates to an aircraft time cooperation guidance method based on deep reinforcement learning.

Background

Aircraft (such as missiles) are the medium strength for hitting important strategic targets, but in modern war, the defense and countermeasure means of enemies are various, and particularly, ground or ship-based platforms have remote interception weapons and near defense weapons, which all pose great threat to the aircraft.

The multi-bullet cooperative strike is a high-efficiency penetration measure, and can saturate the defense system of an enemy and improve the success rate of penetration. The flight time cooperation is a feasible means for realizing multi-bullet cooperative strike, and the flight time cooperation at present mainly comprises the following two ways: 1. coordinating the predicted arrival time of each projectile through inter-projectile communication; 2. equal expected arrival times are set for the missiles prior to launch. However, in any way, the remaining flight time of each missile needs to be accurately controlled, and for the problem, most of the existing guidance laws are based on a constant speed hypothesis, and the problem is converted into the control of the remaining flight path. Although the prediction accuracy can be improved by iterative calculation using a differential equation, the amount of calculation is large, and online prediction is difficult to achieve.

The multi-bullet cooperative confrontation decision-making technology needs to establish a task model or an environment model of the confrontation environment, uncertainty of the model cannot be fully considered, and the method for establishing the behavior model or the behavior criterion can artificially limit the solution space of the behavior strategy and is difficult to obtain the optimal strategy, so that the multi-bullet cooperative confrontation environment which is dynamically variable cannot be adapted. In addition, under a complex environment, the dimensions of environment variables and decision variables are increased, and the complexity of the problem is increased, so that the multi-aircraft cooperative countermeasure decision making technology cannot adapt to the complex environment or an algorithm is difficult to solve.

Therefore, it is necessary to provide an aircraft time cooperative guidance method which overcomes the defects of relying on the assumption of constant velocity and has good control effect.

Disclosure of Invention

In order to overcome the problems, the inventor of the invention makes a keen study to design an aircraft time collaborative guidance method based on deep reinforcement learning, and the method trains a deep reinforcement learning model according to the current speed, the current speed direction, the current position and the residual flight time error of an aircraft and realizes the residual flight time control by the deep reinforcement learning model. The method overcomes the defect of dependence on constant velocity assumption, has good control effect, and can be applied to an online guidance control scene, thereby completing the invention.

Specifically, the invention aims to provide an aircraft time collaborative guidance method based on deep reinforcement learning, and the method outputs a bias term a through a deep reinforcement learning model according to the flight state of an aircraft _t Deriving new guidance instructions based on the form of bias proportional guidancea _m Finally according to the guidance instruction a _m Controlling an aircraft control system;

the guidance instruction a _m Obtained by the following formula (one):

wherein, a _m Representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,

representing the rate of change of the viewing angle of the bullet, a _b A bias term is represented.

The bias term a _b Obtained by the following steps:

step 1, designing a simulated flight test, and training to obtain a deep reinforcement learning model;

step 2, testing the deep reinforcement learning model;

step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test _t Obtaining new guidance instruction a based on the form of bias proportion guidance _m Finally according to the guidance instruction a _m And controlling the aircraft control system.

In step 1, the deep reinforcement learning model is preferably learned by a near-end strategy optimization method (PPO);

preferably, said step 1 comprises the following sub-steps:

step 1-1, designing a simulated flight test according to an aircraft model;

and 1-2, designing the structure and parameters of the deep reinforcement learning model, and training to obtain the deep reinforcement learning model.

The step 1-1 comprises the following substeps:

1-1-1, acquiring aerodynamic parameters and reference area of the aircraft through a wind tunnel test of the aircraft;

1-1-2, designing an aircraft simulation model according to a motion differential equation set of an aircraft to obtain a flight state s of the aircraft;

1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation model, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset term guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.

The step 1-2 comprises the following substeps:

step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state _b To an aircraft simulation model;

step 1-2-2, collecting data of interaction between a deep reinforcement learning model and an aircraft, and storing the data in an experience pool;

step 1-2-3, improving the output bias term a of the deep reinforcement learning model by using data in the experience pool _b The policy of (1).

In step 1-2-2, the interaction data of the deep reinforcement learning model and the aircraft simulation model is element group(s) _t ，a _t ，r _t )；

Wherein s is _t Representing the flight state of the aircraft at the time t; a is _t A bias term representing the output of the deep reinforcement learning model at the time t; r is _t Representing that the aircraft executes the offset term a at the moment t _t A reward given later;

said r _t Obtained according to the following formula:

wherein, t _d Representing the desired time of flight, t _f Representing an actual time of flight; r represents the projectile distance;

c ₁ a normalized parameter representing time-of-flight reward, set to a constant of 100; c. C ₂ The normalized parameter, which represents the reward for the bullet distance, is set to a constant 10000.

The deep reinforcement learning model comprises two different neural networks: a policy network and an evaluation network;

the strategy network takes a flight state s as an input, and biases an item a _b Is an output;

the evaluation network takes a flight state s as input, and a state value function V of the state s _π (s) is an output;

wherein the merit function is

For improving a policy network, the merit function is obtained by:

where k is the number of awards, V represents a state value function, r _t Indicating the reward at time t, r _t+1 Denotes the reward at time t +1, r _t+2 Represents the reward at time t +2, and so on r _t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.

The objective function of the policy network is:

where ω represents the weight w in the policy network ₁ And offset b ₁ ω = { w = ₁ ，b ₁ }；w ₁ Weight representing full connectivity layer in policy network, b ₁ Representing an offset of a full connectivity layer in the policy network;

r _t (ω) represents the ratio between the improved strategy and the old strategy,

the clip is a function of the clip,

e is a shearing parameter for restricting the updating amplitude of the strategy network;

N _s is the capacity of the experience pool;

representing a merit function derived based on the old policy generation reward value;

the objective function of the evaluation network is

Where ξ denotes the weight w in the evaluation network ₂ And offset b ₂ Set of xi = { w = ₂ ，b ₂ }

A _t (s _t ，a _t ) Representing a merit function in the evaluation network;

when the number of interactions N = N _s When, indicating that the empirical pool is saturated, ω and ξ are updated according to the following equation:

wherein alpha is _ω ，α _ξ Respectively representing the parameter update rates of the policy network and the evaluation network,

representing a gradient of the function;

ω _new indicating updated omega, omega after saturation of the empirical pool _old Represents ω at saturation of the empirical pool;

ξ _new xi, xi representing the update after saturation of the experience pool _old Denotes ξ when the empirical pool saturates.

Step 3 comprises the following substeps:

step 3-1, the aircraft obtains a flight state s;

step 3-2, inputting the flight state s into the depth reinforcement learning model passing the test, and outputting the bias item a by the depth reinforcement learning model passing the test _b ；

3-3, obtaining a new guidance instruction a based on the form of bias proportion guidance _m Finally according to the guidance instruction a _m And controlling the aircraft control system.

The invention has the advantages that:

(1) According to the aircraft time collaborative guidance method based on the deep reinforcement learning, the selected input states are the current speed, the current speed direction, the current position and the residual flight time error, the mapping relation is reasonable, and the feasibility of fitting the mapping relation by using the deep reinforcement learning is high;

(2) The aircraft time cooperative guidance method based on the deep reinforcement learning can use a deep reinforcement learning model to fit the relation between the guidance instruction and the residual flight time error, and is a feasible method for realizing the aircraft time cooperative guidance;

(3) Compared with the traditional cooperative guidance algorithm, the aircraft time cooperative guidance method based on the deep reinforcement learning provided by the invention uses the simulation conditions which are more consistent with the real environment during training, overcomes the defect of dependence on the derivation of a constant speed hypothesis, ensures the dynamic stability of the environment to the aircraft during the training process, enables the distributed execution to be more consistent with the actual application scene, has a good control effect, and can be applied to an online guidance control scene.

Drawings

FIG. 1 is a diagram illustrating the operation of a deep reinforcement learning model according to a preferred embodiment of the present invention;

FIG. 2 is a diagram illustrating deep reinforcement learning model training in accordance with a preferred embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the operation of a near-end policy optimization algorithm in accordance with a preferred embodiment of the present invention;

FIG. 4 illustrates a deep reinforcement learning model learning reward curve in accordance with a preferred embodiment of the present invention;

FIGS. 5a-f are graphs showing test results of a flight trajectory curve, a residual time-of-flight error curve, a flight speed curve, a guidance command curve, and a bias term curve for a deep reinforcement learning model in an embodiment of the present invention;

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from the description. The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. In which, although various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The invention provides an aircraft time collaborative guidance method based on deep reinforcement learning, which outputs a bias term a according to the flight state of an aircraft through a deep reinforcement learning model _t Obtaining new guidance instruction a based on the form of bias proportion guidance _m Finally according to the guidance instruction a _m Controlling an aircraft control system;

the guidance instruction a _m Obtained by the following formula (one):

The bias term a _b Obtained by the following steps:

step 2, testing the deep reinforcement learning model;

step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test _b Obtaining a new guidance instruction a based on the form of bias proportion guidance _m Finally according to the guidance instruction a _m And controlling the aircraft control system.

The aircraft time collaborative guidance method based on deep reinforcement learning is further described as follows:

step 1, designing a simulated flight test, and training to obtain a deep reinforcement learning model.

In step 1, the deep reinforcement learning model is preferably learned by a near-end strategy optimization method (PPO), as shown in fig. 2;

preferably, said step 1 comprises the following sub-steps:

step 1-1, designing a simulated flight test according to an aircraft model;

The step 1-1 comprises the following substeps:

the aerodynamic parameters comprise a lift coefficient, an induced drag coefficient and a zero lift drag coefficient. In the present invention, when the structure of the aircraft is determined, the aerodynamic parameters of the aircraft can be basically determined. In actual flight, the aerodynamic parameters are generally related to the mach number, angle of attack, and rudder deflection angle of the aircraft.

Preferably, the mach number is related to the sonic speed of the aircraft at the current altitude and is obtained from the current speed information/sonic speed of the aircraft;

the method comprises the steps that a program in the aircraft comprises a navigation module, a guidance module and a control module, and altitude information and current speed information of the aircraft are obtained by the navigation module;

the angle of attack represents the incoming flow direction of the air and is obtained by a navigation module of the aircraft;

the rudder deflection angle is obtained by a control module of the aircraft.

The sound velocity is obtained by interpolation of air data measured in advance, and further Mach number is obtained.

More preferably, the aerodynamic parameters corresponding to the current mach number, angle of attack, and rudder deflection angle are obtained by wind tunnel experiments and interpolation calculation.

In the invention, the state of the aircraft at the next moment is obtained according to the following aerodynamic differential equation of the aircraft:

wherein v represents the magnitude of the velocity, θ represents the angle between the aircraft velocity vector and the horizontal plane, X represents the lateral spatial position of the aircraft, y represents the longitudinal spatial position of the aircraft, m represents the aircraft weight, P represents the engine thrust, α represents the aircraft angle of attack, X represents the drag, L represents the lift, m represents the thrust, and _c represents the fuel consumption per unit time;

the relationship between the lift force, the resistance force and the aerodynamic parameters is as follows:

X＝(c _d0 +c _d )qS

L＝c _L qS

wherein, c _d0 Denotes the coefficient of zero lift resistance, c _d Denotes the coefficient of induced resistance, c _L Representing the lift coefficient, q the dynamic pressure and S the reference area of the aircraft.

1-1-2, designing an aircraft simulation model according to a motion differential equation set of an aircraft to obtain the flight state of the aircraft;

1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation program, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset term guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.

The step 1-2 comprises the following substeps:

step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state of the aircraft _t To an aircraft simulation model;

step 1-2-2, collecting interactive data of a deep reinforcement learning model and an aircraft simulation model, and storing the interactive data into an experience pool;

step 1-2-3, improving the output bias term a of the deep reinforcement learning model by using data in the experience pool _b 。

in the present invention, the simulation model may adopt a semi-physical simulation platform, that is, the flight control system of the aircraft is a physical object, and includes: flight control computers, inertial measurement units (accelerometers, gyros, and magnetometers), while the GPS and object detection sensors (e.g., photoelectric pods, radar) of the aircraft and the flight environment (i.e., atmosphere, terrain, etc.) are completely virtual. Therefore, with lower cost, the training environment is close to the reality to the maximum extent, and the aircraft can utilize the data fed back by the virtual environment and the physical measurement to carry out artificial intelligence training.

The simulation model can also be in a complete virtual state, namely, the flight environment and the flight control system of the aircraft are both virtual.

In the invention, the closer the simulation model is to the real environment, the better the effect of the trained strategy model of the aircraft is.

According to a preferred embodiment of the invention, the current flight state s of the aircraft comprises the position, the velocity vector and the remaining time-of-flight error of the aircraft at the current moment in time, and the current observed state s of the aircraft is represented by the following equation (two).

s＝(v，θ，x，y，t _d - τ) (two)

Where s represents the observed state of the aircraft, v represents the absolute velocity of the aircraft, θ represents the velocity direction, and x represents the velocity directionThe transverse spatial position of the aircraft, y representing the longitudinal spatial position of the aircraft, t _d - τ represents the residual time-of-flight error, t _d Representing the desired remaining time of flight and tau the actual remaining time of flight.

In a further preferred embodiment, the own position of the aircraft is obtained by a GPS positioning system, said own position of the aircraft comprising the altitude and the lateral position of the aircraft at the current moment;

the self speed vector of the aircraft is obtained by an inertial measurement unit and a magnetometer, and the speed vector of the aircraft comprises the speed and the speed direction at the current moment;

the residual flight time error is the difference between the expected residual flight time and the actual residual flight time, the expected residual flight time is obtained by artificial setting, and the actual residual flight time is obtained by a prediction function

Calculating to obtain;

the prediction function is

Where θ represents the velocity direction, λ represents the line-of-sight angle, x represents the lateral spatial position of the aircraft, and y represents the longitudinal spatial position of the aircraft.

In the invention, the simulation model can record the flight state, the bias item and the reward of the aircraft, and can feed back the flight state, the bias item and the reward to the deep reinforcement learning model for storage to be used as a training data set.

Preferably, the process of interaction between the deep reinforcement learning model and the aircraft and simulation model is as follows: the deep reinforcement learning model outputs a bias item according to the current flight state information, and the aircraft executes a control instruction according to the bias item and then switches to a relay state (the flight state at the next moment) and gives rewards.

And 1-2-2, collecting interactive data of the deep reinforcement learning model and the aircraft simulation model, and storing the interactive data into an experience pool.

According toIn a preferred embodiment of the invention, the data interacted between the deep reinforcement learning model and the aircraft simulation model is an element group(s) _t ，a _t ，r _t )，

Wherein s is _t Representing the flight state of the aircraft at the time t; a is _t A bias term representing the output of the deep reinforcement learning model at the time t; r is _t Representing that the aircraft executes the offset term a at the moment t _t The reward obtained later.

In a further preferred embodiment, the data of the interaction is stored in an experience pool of each deep reinforcement learning model for improving the generation strategy of the bias term.

And after the interactive data is stored in the experience pool, the aircraft updates the current state to be a succession state.

According to the invention, the aircraft gives a reward r _t And the improved parameters for calculating the bias item generation strategy comprise two constraints of expected time and target hitting, flight time rewards are set according to the expected time, and bullet distance rewards are set according to the target hitting.

According to a preferred embodiment of the invention, in said time-of-flight reward, the closer the actual time-of-flight is to the desired time-of-flight, the greater the reward, the time-of-flight reward being designed to be- (t) _d -t _f ) ² ；

Wherein, t _d To the desired time of flight, t _f Is the actual time of flight.

The expected flight time is artificially set flight time in the actual application process and is different according to the actual situation; the actual flight time is the actual flight time of the aircraft in the actual application process and is obtained by prediction of a prediction function;

according to another preferred embodiment of the invention, in the aforementioned shot-to-shot distance reward, the aircraft should shorten the shot-to-shot distance as soon as possible, the smaller the shot-to-shot distance, the greater the reward, and the reward is designed to be-R ² Wherein R represents the projectile distance.

The shot-to-eye distance is obtained by adopting an absolute position according to the following formula

Where x and y are measured in real time by GPS.

In a further preferred embodiment, in order to avoid one of the rewards from masking the other reward, the two sets of rewards are normalized, and an exponential function normalization method is adopted in the application to obtain the reward r given by the environment after the aircraft performs the action at the time t +1 _t Obtained according to the following formula (III)

Wherein, c ₁ A normalized parameter for time-of-flight rewards set at a constant of 100; c. C ₂ The normalized parameter awarded for the bullet distance is set to a constant 10000.

Step 1-2-3, improving the bias term a output by the deep reinforcement learning model by using the data in the experience pool _b 。

The deep reinforcement learning model adopting the near-end strategy optimization algorithm comprises two different neural networks: a policy network and an evaluation network;

the strategy network takes a flight state s as input and biases a term a _b Is an output;

the evaluation network takes a flight state s as input, and a state value function V of the state s _π (s) is an output; a frame diagram of the near-end policy optimization algorithm is shown in fig. 3.

Wherein the function of state values V _π (s) is used to represent the potential value of state s. The aim of improving the bias term generation strategy is to find a strategy pi to enable the deep reinforcement learning model to obtain the maximum total reward value in an unknown environment, but the total reward value comprises future reward values and cannot be directly calculated, so that a state value function V is used _π (s) approximately calculating a total prize value;

the strategies are denoted differentlyBias term a under state s _b Due to the trial-and-error nature of reinforcement learning, the bias term a _b Typically not a determined value. The form of strategy pi is normal distribution pi-N (mu, sigma), and bias term a _b The probability density function of (a) is:

wherein x represents a randomly sampled value in the probability distribution, μ represents a mean value of the probability density function, and σ represents a standard deviation of the probability density function;

according to a preferred embodiment of the application, the strategy network is a neural network comprising two identical fully-connected layers as hidden layers, intermediate variables mu and sigma are output according to an input flight state s, then normal distribution N to (mu, sigma) is constructed, and after sampling randomly, a sampling result is output as a bias term a _b ；

To improve the policy network, a merit function is defined as

When the advantage function is positive, increasing the probability of the current behavior in the current state; when the dominance function is negative, reducing the probability of the current behavior in the current state;

the merit function is obtained by:

wherein k is the number of awards, V _π Representing a function of state values, r _t Indicating the reward at time t, r _t+1 Denotes the reward at time t +1, r _t+2 Represents the reward at time t +2, and so on r _t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.

The objective function of the policy network is:

where ω represents the weight w in the policy network ₁ And offset b ₁ ω = { w = ₁ ，b ₁ }；

r _t (ω) represents the ratio between the improved strategy and the old strategy, clip being the clip function, N _s Is the capacity of the experience pool;

wherein, the epsilon is a shearing parameter of the update amplitude of the constraint strategy network;

w ₁ weight representing the full connectivity layer in a policy network, b ₁ Representing an offset of a full connectivity layer in the policy network;

representing a merit function derived based on the old policy generation reward value.

According to the application, the fully connected layer in the policy network has the following form:

l _j ＝ReLU(∑ _i (w ₁ u+b ₁ ) ) = max (0,x)

Wherein l _j Represents the output of the fully-connected layer and u represents the input of the fully-connected layer.

According to a further preferred embodiment of the present application, the evaluation network is also a neural network comprising two identical fully-connected layers as hidden layers for obtaining the merit function

Two state value functions V(s) in _t ) And V: (s _t+k ) The fully connected layer has the following form:

l _j ＝ReLU(∑ _i (w ₂ u+b ₂ ) ) = max (0,x)

Wherein l _j Denotes the output of the fully-connected layer, u denotes the input of the fully-connected layer, w ₂ Representing the weight of the fully connected layer in the evaluation network, b ₂ Representing the offset of the full connection layer in the evaluation network;

defining the set of weights and offsets in the evaluation network as xi, xi = { w = ₂ ，b ₂ Evaluate the objective function of the network as

When N = N _s When, indicating that the experience pool is saturated, ω and ξ in the policy network and the evaluation network are updated according to the following equations:

wherein alpha is _ω ，α _ξ Respectively representing the parameter updating rate of the strategy network and the evaluation network, obtained by manual setting,

represents a gradient of the function;

The sampling process of the deep reinforcement learning is interactive, a new sample needs to be generated through a simulated flight test while learning, and the learning is carried out while sampling. In the course of learningDeep reinforcement learning model and old strategy for aircraft _old Interaction N _s And secondly, storing the interaction time sequence generated by the interaction process in a buffer area. When updating the policy network, the estimated merit function is first used

Then, calculating the probability pi of the executed behavior of the experience pool in the old strategy according to the probability density function of the normal distribution _old (a _t |s _t ). Calculating pi after strategy network generates new strategy pi _ω (a _t |s _t ) Then, an objective function is calculated, the gradient of the objective function to omega is obtained by using a gradient descent method, and a strategy network is updated, so that the objective function is maximized.

When the evaluation network is updated, the advantage function in the objective function is obtained in the stage of updating the strategy network and can be directly calculated. And optimizing a loss function of the evaluation network by using a gradient descent method, and updating a parameter xi of the evaluation network to minimize the loss function. After the two networks are updated, emptying the experience pool, and then using the learned new strategy to interact N _s This learning process is repeated until the simulation test is completed.

More preferably, when r _t When the change rate of the average value is less than 2%, determining that the average value is convergent, finishing the training of the multi-aircraft group, storing the obtained deep reinforcement learning model, and obtaining a learning curve of the deep reinforcement learning model after 100 times of training as shown in fig. 4.

And 2, testing the deep reinforcement learning model.

When the fluctuation amplitude of the reward value is less than 2%, the model is saved and a simulation test is carried out, and the test result is shown in fig. 5.

In the figure, a flight path curve is shown in fig. 5a, a residual flight time curve is shown in fig. 5b, a residual flight time error curve is shown in fig. 5c, a flight speed curve is shown in fig. 5d, a guidance instruction curve is shown in fig. 5e, and a bias term curve is shown in fig. 5 f;

according to a preferred embodiment of the invention, the aircraft can arrive at the target position with different desired flight times after departing from the same initial flight conditions.

According to a preferred embodiment of the present invention, the control effect of the deep reinforcement learning model is determined according to the difference between the actual remaining flight time and the expected remaining flight time.

Preferably, in the experimental stage, when the difference between the actual remaining flight time and the expected remaining flight time is less than 1s, the performance of the neural network model is considered to be basically satisfied with the application, and the neural network model can be used for actually executing the task process.

Step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test _b 。

Wherein, step 3 comprises the following substeps:

and 3-1, acquiring a flight state by the aircraft.

Wherein the flight state of the aircraft comprises the position and velocity vector of the aircraft and the residual time-of-flight error.

Step 3-2, inputting the flight state into the depth reinforcement learning model passing the test, and outputting the bias item a by the depth reinforcement learning model passing the test _t 。

In the invention, because the deep reinforcement learning model in the training stage learns to obtain the optimal behavior strategy and has a stable execution strategy model, in the task execution stage, the deep reinforcement learning model can output the bias item a only according to the flight state _b 。

Wherein,

a _m representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,

According to the aircraft time collaborative guidance method based on the deep reinforcement learning, in the training phase, the aircraft is launched under certain initial conditions, and different target times are set, so that the aircraft can learn under the conditions as much as possible, and the actual combat effect is good.

Examples of the experiments

Carrying out simulation test on the deep reinforcement learning model, wherein in the embodiment, the selected aircraft is a missile;

the fixed step length used by a simulation program in the simulation flight test is 0.1s;

in the simulated flight test, the simulation program runs 1000 times, and the deep reinforcement learning model is trained 30 times, and the training time is about 30000 times in total.

The kinetic model of the missile is

the built deep reinforcement learning model comprises a strategy network and an evaluation network, wherein the strategy network and the evaluation network both use two same full connection layers as hidden layers, and the function of the full connection layers in the strategy network is l _j ＝ReLU(∑ _i (w ₁ u+b ₁ ) ReLU (x) = max (0,x);

the objective function of the policy network is

Evaluating the objective function of the network as

Wherein

ξ denotes the weight w in the evaluation network ₂ And offset b ₂ Set of xi = { w = ₂ ，b ₂ }

V _π (s _t ) And V _π (s _t+k ) Obtained by evaluating network estimation;

the function evaluating the full connection layer in the network is l _j ＝ReLU(∑ _i (w ₂ u+b ₂ ) ) = max (0,x)

representing a gradient of the function;

ω _new indicating updated omega, omega after saturation of the empirical pool _old Denotes ω at empirical pool saturation；

After the training is finished, testing the converged depth-enhanced learning model, selecting 5 aircrafts to launch at a speed of 200m/s, setting the initial transverse position to be-20 km, the height to be 20km and the initial launching angle to be 0 degrees, and respectively setting the expected flight time to be 100s, 120s, 140s, 160s, 180s and 200s, wherein the result is shown in fig. 5, as can be seen from fig. 5, the residual flight time controlled by the depth-enhanced learning model trained in the embodiment converges to the expected residual flight time, and the maximum error is not more than 1s, which indicates that the depth-enhanced learning model can well fit the mapping relation between the missile flight state and the residual flight time.

The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims

1. An aircraft time collaborative guidance method based on deep reinforcement learning is disclosed, wherein a bias term a is output through a deep reinforcement learning model _b Obtaining new guidance instruction a based on the form of bias proportion guidance _m Finally according to the guidance instruction a _m Controlling an aircraft control system;

the guidance instruction a _m Obtained by the following formula (one):

representing the rate of change of the viewing angle of the bullet, a _b Representing a bias term;

the bias term a _b Obtained by the following steps:

step 2, testing the deep reinforcement learning model;

step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test _b Obtaining new guidance instruction a based on the form of bias proportion guidance _m Finally according to the guidance instruction a _m Controlling an aircraft control system;

in step 1, the deep reinforcement learning model learns through a near-end policy optimization (PPO);

the step 1 comprises the following substeps:

step 1-1, designing a simulated flight test according to an aircraft model;

step 1-2, designing the structure and parameters of a deep reinforcement learning model, and training to obtain the deep reinforcement learning model;

the step 1-1 comprises the following substeps:

1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation model, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset item guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.

2. The method of claim 1,

the step 1-2 comprises the following substeps:

step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state of the aircraft _b To an aircraft simulation model;

3. The method of claim 2,

in step 1-2-2, the interaction data of the deep reinforcement learning model and the aircraft simulation model is element group(s) _t ,a _t ,r _t )；

Wherein s is _t Representing the flight state of the aircraft at the time t; a is _t A bias term representing the output of the deep reinforcement learning model at the time t; r is _t Representing that the aircraft executes the offset term a at the moment t _t The reward given by the back environment.

4. The method of claim 3,

said r _t Obtained according to the following formula:

5. The method of claim 2,

the strategy network takes a flight state s as input and biases a term a _b Is an output；

using merit function

To improve the policy network, the merit function is obtained by:

wherein k is the number of awards, V _π Representing a function of state values, r _t Indicating the reward at time t, r _t+1 Denotes the reward at time t +1, r _t+2 Indicating the reward at time t +2, and so on r _t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.

6. The method of claim 5,

the objective function of the policy network is:

where ω represents the weight w in the policy network ₁ And offset b ₁ ω = { w = ₁ ,1}；w ₁ Weight representing the full connectivity layer in a policy network, b ₁ Representing an offset of a full connectivity layer in the policy network;

the clip is a function of the clip,

N _s is the capacity of the experience pool;

the objective function of the evaluation network is

Where ξ represents the weight w in the evaluation network ₂ And offset b ₂ Set of xi = { w = ₂ ,b ₂ }；

A _t (s _t ,a _t ) Representing a merit function in the evaluation network;

wherein alpha is _ω ,a _ξ Respectively representing the parameter update rates of the policy network and the evaluation network,

representing a gradient of the function; />

7. The method of claim 1,

step 3 comprises the following substeps:

step 3-1, the aircraft obtains a flight state s;