CN115046433B - Aircraft time collaborative guidance method based on deep reinforcement learning - Google Patents

Aircraft time collaborative guidance method based on deep reinforcement learning Download PDF

Info

Publication number
CN115046433B
CN115046433B CN202110256808.6A CN202110256808A CN115046433B CN 115046433 B CN115046433 B CN 115046433B CN 202110256808 A CN202110256808 A CN 202110256808A CN 115046433 B CN115046433 B CN 115046433B
Authority
CN
China
Prior art keywords
aircraft
reinforcement learning
representing
learning model
deep reinforcement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110256808.6A
Other languages
Chinese (zh)
Other versions
CN115046433A (en
Inventor
王江
刘子超
何绍溟
侯淼
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110256808.6A priority Critical patent/CN115046433B/en
Publication of CN115046433A publication Critical patent/CN115046433A/en
Application granted granted Critical
Publication of CN115046433B publication Critical patent/CN115046433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F42AMMUNITION; BLASTING
    • F42BEXPLOSIVE CHARGES, e.g. FOR BLASTING, FIREWORKS, AMMUNITION
    • F42B15/00Self-propelled projectiles or missiles, e.g. rockets; Guided missiles
    • F42B15/01Arrangements thereon for guidance or control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Chemical & Material Sciences (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Combustion & Propulsion (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses an aircraft time collaborative guidance method based on deep reinforcement learning, which outputs a bias term a according to the flight state of an aircraft through a deep reinforcement learning model t Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system. According to the aircraft time collaborative guidance method based on the deep reinforcement learning, the selected input states are the current speed, the current speed direction, the current position and the residual flight time error, the mapping relation is reasonable, and the feasibility of fitting the mapping relation by using the deep reinforcement learning is high.

Description

Aircraft time collaborative guidance method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of aircrafts, in particular to the field of flight time cooperation, and particularly relates to an aircraft time cooperation guidance method based on deep reinforcement learning.
Background
Aircraft (such as missiles) are the medium strength for hitting important strategic targets, but in modern war, the defense and countermeasure means of enemies are various, and particularly, ground or ship-based platforms have remote interception weapons and near defense weapons, which all pose great threat to the aircraft.
The multi-bullet cooperative strike is a high-efficiency penetration measure, and can saturate the defense system of an enemy and improve the success rate of penetration. The flight time cooperation is a feasible means for realizing multi-bullet cooperative strike, and the flight time cooperation at present mainly comprises the following two ways: 1. coordinating the predicted arrival time of each projectile through inter-projectile communication; 2. equal expected arrival times are set for the missiles prior to launch. However, in any way, the remaining flight time of each missile needs to be accurately controlled, and for the problem, most of the existing guidance laws are based on a constant speed hypothesis, and the problem is converted into the control of the remaining flight path. Although the prediction accuracy can be improved by iterative calculation using a differential equation, the amount of calculation is large, and online prediction is difficult to achieve.
The multi-bullet cooperative confrontation decision-making technology needs to establish a task model or an environment model of the confrontation environment, uncertainty of the model cannot be fully considered, and the method for establishing the behavior model or the behavior criterion can artificially limit the solution space of the behavior strategy and is difficult to obtain the optimal strategy, so that the multi-bullet cooperative confrontation environment which is dynamically variable cannot be adapted. In addition, under a complex environment, the dimensions of environment variables and decision variables are increased, and the complexity of the problem is increased, so that the multi-aircraft cooperative countermeasure decision making technology cannot adapt to the complex environment or an algorithm is difficult to solve.
Therefore, it is necessary to provide an aircraft time cooperative guidance method which overcomes the defects of relying on the assumption of constant velocity and has good control effect.
Disclosure of Invention
In order to overcome the problems, the inventor of the invention makes a keen study to design an aircraft time collaborative guidance method based on deep reinforcement learning, and the method trains a deep reinforcement learning model according to the current speed, the current speed direction, the current position and the residual flight time error of an aircraft and realizes the residual flight time control by the deep reinforcement learning model. The method overcomes the defect of dependence on constant velocity assumption, has good control effect, and can be applied to an online guidance control scene, thereby completing the invention.
Specifically, the invention aims to provide an aircraft time collaborative guidance method based on deep reinforcement learning, and the method outputs a bias term a through a deep reinforcement learning model according to the flight state of an aircraft t Deriving new guidance instructions based on the form of bias proportional guidancea m Finally according to the guidance instruction a m Controlling an aircraft control system;
the guidance instruction a m Obtained by the following formula (one):
Figure BDA0002968482720000021
wherein, a m Representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,
Figure BDA0002968482720000022
representing the rate of change of the viewing angle of the bullet, a b A bias term is represented.
The bias term a b Obtained by the following steps:
step 1, designing a simulated flight test, and training to obtain a deep reinforcement learning model;
step 2, testing the deep reinforcement learning model;
step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test t Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
In step 1, the deep reinforcement learning model is preferably learned by a near-end strategy optimization method (PPO);
preferably, said step 1 comprises the following sub-steps:
step 1-1, designing a simulated flight test according to an aircraft model;
and 1-2, designing the structure and parameters of the deep reinforcement learning model, and training to obtain the deep reinforcement learning model.
The step 1-1 comprises the following substeps:
1-1-1, acquiring aerodynamic parameters and reference area of the aircraft through a wind tunnel test of the aircraft;
1-1-2, designing an aircraft simulation model according to a motion differential equation set of an aircraft to obtain a flight state s of the aircraft;
1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation model, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset term guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.
The step 1-2 comprises the following substeps:
step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state b To an aircraft simulation model;
step 1-2-2, collecting data of interaction between a deep reinforcement learning model and an aircraft, and storing the data in an experience pool;
step 1-2-3, improving the output bias term a of the deep reinforcement learning model by using data in the experience pool b The policy of (1).
In step 1-2-2, the interaction data of the deep reinforcement learning model and the aircraft simulation model is element group(s) t ,a t ,r t );
Wherein s is t Representing the flight state of the aircraft at the time t; a is t A bias term representing the output of the deep reinforcement learning model at the time t; r is t Representing that the aircraft executes the offset term a at the moment t t A reward given later;
said r t Obtained according to the following formula:
Figure BDA0002968482720000041
wherein, t d Representing the desired time of flight, t f Representing an actual time of flight; r represents the projectile distance;
c 1 a normalized parameter representing time-of-flight reward, set to a constant of 100; c. C 2 The normalized parameter, which represents the reward for the bullet distance, is set to a constant 10000.
The deep reinforcement learning model comprises two different neural networks: a policy network and an evaluation network;
the strategy network takes a flight state s as an input, and biases an item a b Is an output;
the evaluation network takes a flight state s as input, and a state value function V of the state s π (s) is an output;
wherein the merit function is
Figure BDA0002968482720000042
For improving a policy network, the merit function is obtained by:
Figure BDA0002968482720000043
where k is the number of awards, V represents a state value function, r t Indicating the reward at time t, r t+1 Denotes the reward at time t +1, r t+2 Represents the reward at time t +2, and so on r t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.
The objective function of the policy network is:
Figure BDA0002968482720000044
where ω represents the weight w in the policy network 1 And offset b 1 ω = { w = 1 ,b 1 };w 1 Weight representing full connectivity layer in policy network, b 1 Representing an offset of a full connectivity layer in the policy network;
r t (ω) represents the ratio between the improved strategy and the old strategy,
Figure BDA0002968482720000051
the clip is a function of the clip,
Figure BDA0002968482720000052
e is a shearing parameter for restricting the updating amplitude of the strategy network;
N s is the capacity of the experience pool;
Figure BDA0002968482720000053
representing a merit function derived based on the old policy generation reward value;
the objective function of the evaluation network is
Figure BDA0002968482720000054
Where ξ denotes the weight w in the evaluation network 2 And offset b 2 Set of xi = { w = 2 ,b 2 }
A t (s t ,a t ) Representing a merit function in the evaluation network;
when the number of interactions N = N s When, indicating that the empirical pool is saturated, ω and ξ are updated according to the following equation:
Figure BDA0002968482720000055
Figure BDA0002968482720000056
wherein alpha is ω ,α ξ Respectively representing the parameter update rates of the policy network and the evaluation network,
Figure BDA0002968482720000057
representing a gradient of the function;
ω new indicating updated omega, omega after saturation of the empirical pool old Represents ω at saturation of the empirical pool;
ξ new xi, xi representing the update after saturation of the experience pool old Denotes ξ when the empirical pool saturates.
Step 3 comprises the following substeps:
step 3-1, the aircraft obtains a flight state s;
step 3-2, inputting the flight state s into the depth reinforcement learning model passing the test, and outputting the bias item a by the depth reinforcement learning model passing the test b
3-3, obtaining a new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
The invention has the advantages that:
(1) According to the aircraft time collaborative guidance method based on the deep reinforcement learning, the selected input states are the current speed, the current speed direction, the current position and the residual flight time error, the mapping relation is reasonable, and the feasibility of fitting the mapping relation by using the deep reinforcement learning is high;
(2) The aircraft time cooperative guidance method based on the deep reinforcement learning can use a deep reinforcement learning model to fit the relation between the guidance instruction and the residual flight time error, and is a feasible method for realizing the aircraft time cooperative guidance;
(3) Compared with the traditional cooperative guidance algorithm, the aircraft time cooperative guidance method based on the deep reinforcement learning provided by the invention uses the simulation conditions which are more consistent with the real environment during training, overcomes the defect of dependence on the derivation of a constant speed hypothesis, ensures the dynamic stability of the environment to the aircraft during the training process, enables the distributed execution to be more consistent with the actual application scene, has a good control effect, and can be applied to an online guidance control scene.
Drawings
FIG. 1 is a diagram illustrating the operation of a deep reinforcement learning model according to a preferred embodiment of the present invention;
FIG. 2 is a diagram illustrating deep reinforcement learning model training in accordance with a preferred embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the operation of a near-end policy optimization algorithm in accordance with a preferred embodiment of the present invention;
FIG. 4 illustrates a deep reinforcement learning model learning reward curve in accordance with a preferred embodiment of the present invention;
FIGS. 5a-f are graphs showing test results of a flight trajectory curve, a residual time-of-flight error curve, a flight speed curve, a guidance command curve, and a bias term curve for a deep reinforcement learning model in an embodiment of the present invention;
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from the description. The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. In which, although various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The invention provides an aircraft time collaborative guidance method based on deep reinforcement learning, which outputs a bias term a according to the flight state of an aircraft through a deep reinforcement learning model t Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m Controlling an aircraft control system;
the guidance instruction a m Obtained by the following formula (one):
Figure BDA0002968482720000071
wherein, a m Representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,
Figure BDA0002968482720000072
representing the rate of change of the viewing angle of the bullet, a b A bias term is represented.
The bias term a b Obtained by the following steps:
step 1, designing a simulated flight test, and training to obtain a deep reinforcement learning model;
step 2, testing the deep reinforcement learning model;
step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test b Obtaining a new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
The aircraft time collaborative guidance method based on deep reinforcement learning is further described as follows:
step 1, designing a simulated flight test, and training to obtain a deep reinforcement learning model.
In step 1, the deep reinforcement learning model is preferably learned by a near-end strategy optimization method (PPO), as shown in fig. 2;
preferably, said step 1 comprises the following sub-steps:
step 1-1, designing a simulated flight test according to an aircraft model;
and 1-2, designing the structure and parameters of the deep reinforcement learning model, and training to obtain the deep reinforcement learning model.
The step 1-1 comprises the following substeps:
1-1-1, acquiring aerodynamic parameters and reference area of the aircraft through a wind tunnel test of the aircraft;
the aerodynamic parameters comprise a lift coefficient, an induced drag coefficient and a zero lift drag coefficient. In the present invention, when the structure of the aircraft is determined, the aerodynamic parameters of the aircraft can be basically determined. In actual flight, the aerodynamic parameters are generally related to the mach number, angle of attack, and rudder deflection angle of the aircraft.
Preferably, the mach number is related to the sonic speed of the aircraft at the current altitude and is obtained from the current speed information/sonic speed of the aircraft;
the method comprises the steps that a program in the aircraft comprises a navigation module, a guidance module and a control module, and altitude information and current speed information of the aircraft are obtained by the navigation module;
the angle of attack represents the incoming flow direction of the air and is obtained by a navigation module of the aircraft;
the rudder deflection angle is obtained by a control module of the aircraft.
The sound velocity is obtained by interpolation of air data measured in advance, and further Mach number is obtained.
More preferably, the aerodynamic parameters corresponding to the current mach number, angle of attack, and rudder deflection angle are obtained by wind tunnel experiments and interpolation calculation.
In the invention, the state of the aircraft at the next moment is obtained according to the following aerodynamic differential equation of the aircraft:
Figure BDA0002968482720000091
wherein v represents the magnitude of the velocity, θ represents the angle between the aircraft velocity vector and the horizontal plane, X represents the lateral spatial position of the aircraft, y represents the longitudinal spatial position of the aircraft, m represents the aircraft weight, P represents the engine thrust, α represents the aircraft angle of attack, X represents the drag, L represents the lift, m represents the thrust, and c represents the fuel consumption per unit time;
the relationship between the lift force, the resistance force and the aerodynamic parameters is as follows:
X=(c d0 +c d )qS
L=c L qS
wherein, c d0 Denotes the coefficient of zero lift resistance, c d Denotes the coefficient of induced resistance, c L Representing the lift coefficient, q the dynamic pressure and S the reference area of the aircraft.
1-1-2, designing an aircraft simulation model according to a motion differential equation set of an aircraft to obtain the flight state of the aircraft;
1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation program, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset term guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.
The step 1-2 comprises the following substeps:
step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state of the aircraft t To an aircraft simulation model;
step 1-2-2, collecting interactive data of a deep reinforcement learning model and an aircraft simulation model, and storing the interactive data into an experience pool;
step 1-2-3, improving the output bias term a of the deep reinforcement learning model by using data in the experience pool b
Step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state b To an aircraft simulation model;
in the present invention, the simulation model may adopt a semi-physical simulation platform, that is, the flight control system of the aircraft is a physical object, and includes: flight control computers, inertial measurement units (accelerometers, gyros, and magnetometers), while the GPS and object detection sensors (e.g., photoelectric pods, radar) of the aircraft and the flight environment (i.e., atmosphere, terrain, etc.) are completely virtual. Therefore, with lower cost, the training environment is close to the reality to the maximum extent, and the aircraft can utilize the data fed back by the virtual environment and the physical measurement to carry out artificial intelligence training.
The simulation model can also be in a complete virtual state, namely, the flight environment and the flight control system of the aircraft are both virtual.
In the invention, the closer the simulation model is to the real environment, the better the effect of the trained strategy model of the aircraft is.
According to a preferred embodiment of the invention, the current flight state s of the aircraft comprises the position, the velocity vector and the remaining time-of-flight error of the aircraft at the current moment in time, and the current observed state s of the aircraft is represented by the following equation (two).
s=(v,θ,x,y,t d - τ) (two)
Where s represents the observed state of the aircraft, v represents the absolute velocity of the aircraft, θ represents the velocity direction, and x represents the velocity directionThe transverse spatial position of the aircraft, y representing the longitudinal spatial position of the aircraft, t d - τ represents the residual time-of-flight error, t d Representing the desired remaining time of flight and tau the actual remaining time of flight.
In a further preferred embodiment, the own position of the aircraft is obtained by a GPS positioning system, said own position of the aircraft comprising the altitude and the lateral position of the aircraft at the current moment;
the self speed vector of the aircraft is obtained by an inertial measurement unit and a magnetometer, and the speed vector of the aircraft comprises the speed and the speed direction at the current moment;
the residual flight time error is the difference between the expected residual flight time and the actual residual flight time, the expected residual flight time is obtained by artificial setting, and the actual residual flight time is obtained by a prediction function
Figure BDA0002968482720000111
Calculating to obtain;
the prediction function is
Figure BDA0002968482720000112
Where θ represents the velocity direction, λ represents the line-of-sight angle, x represents the lateral spatial position of the aircraft, and y represents the longitudinal spatial position of the aircraft.
In the invention, the simulation model can record the flight state, the bias item and the reward of the aircraft, and can feed back the flight state, the bias item and the reward to the deep reinforcement learning model for storage to be used as a training data set.
Preferably, the process of interaction between the deep reinforcement learning model and the aircraft and simulation model is as follows: the deep reinforcement learning model outputs a bias item according to the current flight state information, and the aircraft executes a control instruction according to the bias item and then switches to a relay state (the flight state at the next moment) and gives rewards.
And 1-2-2, collecting interactive data of the deep reinforcement learning model and the aircraft simulation model, and storing the interactive data into an experience pool.
According toIn a preferred embodiment of the invention, the data interacted between the deep reinforcement learning model and the aircraft simulation model is an element group(s) t ,a t ,r t ),
Wherein s is t Representing the flight state of the aircraft at the time t; a is t A bias term representing the output of the deep reinforcement learning model at the time t; r is t Representing that the aircraft executes the offset term a at the moment t t The reward obtained later.
In a further preferred embodiment, the data of the interaction is stored in an experience pool of each deep reinforcement learning model for improving the generation strategy of the bias term.
And after the interactive data is stored in the experience pool, the aircraft updates the current state to be a succession state.
According to the invention, the aircraft gives a reward r t And the improved parameters for calculating the bias item generation strategy comprise two constraints of expected time and target hitting, flight time rewards are set according to the expected time, and bullet distance rewards are set according to the target hitting.
According to a preferred embodiment of the invention, in said time-of-flight reward, the closer the actual time-of-flight is to the desired time-of-flight, the greater the reward, the time-of-flight reward being designed to be- (t) d -t f ) 2
Wherein, t d To the desired time of flight, t f Is the actual time of flight.
The expected flight time is artificially set flight time in the actual application process and is different according to the actual situation; the actual flight time is the actual flight time of the aircraft in the actual application process and is obtained by prediction of a prediction function;
according to another preferred embodiment of the invention, in the aforementioned shot-to-shot distance reward, the aircraft should shorten the shot-to-shot distance as soon as possible, the smaller the shot-to-shot distance, the greater the reward, and the reward is designed to be-R 2 Wherein R represents the projectile distance.
The shot-to-eye distance is obtained by adopting an absolute position according to the following formula
Figure BDA0002968482720000121
Where x and y are measured in real time by GPS.
In a further preferred embodiment, in order to avoid one of the rewards from masking the other reward, the two sets of rewards are normalized, and an exponential function normalization method is adopted in the application to obtain the reward r given by the environment after the aircraft performs the action at the time t +1 t Obtained according to the following formula (III)
Figure BDA0002968482720000131
Wherein, c 1 A normalized parameter for time-of-flight rewards set at a constant of 100; c. C 2 The normalized parameter awarded for the bullet distance is set to a constant 10000.
Step 1-2-3, improving the bias term a output by the deep reinforcement learning model by using the data in the experience pool b
The deep reinforcement learning model adopting the near-end strategy optimization algorithm comprises two different neural networks: a policy network and an evaluation network;
the strategy network takes a flight state s as input and biases a term a b Is an output;
the evaluation network takes a flight state s as input, and a state value function V of the state s π (s) is an output; a frame diagram of the near-end policy optimization algorithm is shown in fig. 3.
Wherein the function of state values V π (s) is used to represent the potential value of state s. The aim of improving the bias term generation strategy is to find a strategy pi to enable the deep reinforcement learning model to obtain the maximum total reward value in an unknown environment, but the total reward value comprises future reward values and cannot be directly calculated, so that a state value function V is used π (s) approximately calculating a total prize value;
the strategies are denoted differentlyBias term a under state s b Due to the trial-and-error nature of reinforcement learning, the bias term a b Typically not a determined value. The form of strategy pi is normal distribution pi-N (mu, sigma), and bias term a b The probability density function of (a) is:
Figure BDA0002968482720000132
wherein x represents a randomly sampled value in the probability distribution, μ represents a mean value of the probability density function, and σ represents a standard deviation of the probability density function;
according to a preferred embodiment of the application, the strategy network is a neural network comprising two identical fully-connected layers as hidden layers, intermediate variables mu and sigma are output according to an input flight state s, then normal distribution N to (mu, sigma) is constructed, and after sampling randomly, a sampling result is output as a bias term a b
To improve the policy network, a merit function is defined as
Figure BDA0002968482720000141
When the advantage function is positive, increasing the probability of the current behavior in the current state; when the dominance function is negative, reducing the probability of the current behavior in the current state;
the merit function is obtained by:
Figure BDA0002968482720000142
wherein k is the number of awards, V π Representing a function of state values, r t Indicating the reward at time t, r t+1 Denotes the reward at time t +1, r t+2 Represents the reward at time t +2, and so on r t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.
The objective function of the policy network is:
Figure BDA0002968482720000143
where ω represents the weight w in the policy network 1 And offset b 1 ω = { w = 1 ,b 1 };
r t (ω) represents the ratio between the improved strategy and the old strategy, clip being the clip function, N s Is the capacity of the experience pool;
Figure BDA0002968482720000144
Figure BDA0002968482720000145
wherein, the epsilon is a shearing parameter of the update amplitude of the constraint strategy network;
w 1 weight representing the full connectivity layer in a policy network, b 1 Representing an offset of a full connectivity layer in the policy network;
Figure BDA0002968482720000151
representing a merit function derived based on the old policy generation reward value.
According to the application, the fully connected layer in the policy network has the following form:
l j =ReLU(∑ i (w 1 u+b 1 ) ) = max (0,x)
Wherein l j Represents the output of the fully-connected layer and u represents the input of the fully-connected layer.
According to a further preferred embodiment of the present application, the evaluation network is also a neural network comprising two identical fully-connected layers as hidden layers for obtaining the merit function
Figure BDA0002968482720000152
Two state value functions V(s) in t ) And V: (s t+k ) The fully connected layer has the following form:
l j =ReLU(∑ i (w 2 u+b 2 ) ) = max (0,x)
Wherein l j Denotes the output of the fully-connected layer, u denotes the input of the fully-connected layer, w 2 Representing the weight of the fully connected layer in the evaluation network, b 2 Representing the offset of the full connection layer in the evaluation network;
defining the set of weights and offsets in the evaluation network as xi, xi = { w = 2 ,b 2 Evaluate the objective function of the network as
Figure BDA0002968482720000153
When N = N s When, indicating that the experience pool is saturated, ω and ξ in the policy network and the evaluation network are updated according to the following equations:
Figure BDA0002968482720000154
Figure BDA0002968482720000155
wherein alpha is ω ,α ξ Respectively representing the parameter updating rate of the strategy network and the evaluation network, obtained by manual setting,
Figure BDA0002968482720000156
represents a gradient of the function;
ω new indicating updated omega, omega after saturation of the empirical pool old Represents ω at saturation of the empirical pool;
ξ new xi, xi representing the update after saturation of the experience pool old Denotes ξ when the empirical pool saturates.
The sampling process of the deep reinforcement learning is interactive, a new sample needs to be generated through a simulated flight test while learning, and the learning is carried out while sampling. In the course of learningDeep reinforcement learning model and old strategy for aircraft old Interaction N s And secondly, storing the interaction time sequence generated by the interaction process in a buffer area. When updating the policy network, the estimated merit function is first used
Figure BDA0002968482720000161
Then, calculating the probability pi of the executed behavior of the experience pool in the old strategy according to the probability density function of the normal distribution old (a t |s t ). Calculating pi after strategy network generates new strategy pi ω (a t |s t ) Then, an objective function is calculated, the gradient of the objective function to omega is obtained by using a gradient descent method, and a strategy network is updated, so that the objective function is maximized.
When the evaluation network is updated, the advantage function in the objective function is obtained in the stage of updating the strategy network and can be directly calculated. And optimizing a loss function of the evaluation network by using a gradient descent method, and updating a parameter xi of the evaluation network to minimize the loss function. After the two networks are updated, emptying the experience pool, and then using the learned new strategy to interact N s This learning process is repeated until the simulation test is completed.
More preferably, when r t When the change rate of the average value is less than 2%, determining that the average value is convergent, finishing the training of the multi-aircraft group, storing the obtained deep reinforcement learning model, and obtaining a learning curve of the deep reinforcement learning model after 100 times of training as shown in fig. 4.
And 2, testing the deep reinforcement learning model.
When the fluctuation amplitude of the reward value is less than 2%, the model is saved and a simulation test is carried out, and the test result is shown in fig. 5.
In the figure, a flight path curve is shown in fig. 5a, a residual flight time curve is shown in fig. 5b, a residual flight time error curve is shown in fig. 5c, a flight speed curve is shown in fig. 5d, a guidance instruction curve is shown in fig. 5e, and a bias term curve is shown in fig. 5 f;
according to a preferred embodiment of the invention, the aircraft can arrive at the target position with different desired flight times after departing from the same initial flight conditions.
According to a preferred embodiment of the present invention, the control effect of the deep reinforcement learning model is determined according to the difference between the actual remaining flight time and the expected remaining flight time.
Preferably, in the experimental stage, when the difference between the actual remaining flight time and the expected remaining flight time is less than 1s, the performance of the neural network model is considered to be basically satisfied with the application, and the neural network model can be used for actually executing the task process.
Step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test b
Wherein, step 3 comprises the following substeps:
and 3-1, acquiring a flight state by the aircraft.
Wherein the flight state of the aircraft comprises the position and velocity vector of the aircraft and the residual time-of-flight error.
Step 3-2, inputting the flight state into the depth reinforcement learning model passing the test, and outputting the bias item a by the depth reinforcement learning model passing the test t
In the invention, because the deep reinforcement learning model in the training stage learns to obtain the optimal behavior strategy and has a stable execution strategy model, in the task execution stage, the deep reinforcement learning model can output the bias item a only according to the flight state b
3-3, obtaining a new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
Wherein,
Figure BDA0002968482720000171
a m representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,
Figure BDA0002968482720000172
representing the rate of change of the viewing angle of the bullet, a b A bias term is represented.
According to the aircraft time collaborative guidance method based on the deep reinforcement learning, in the training phase, the aircraft is launched under certain initial conditions, and different target times are set, so that the aircraft can learn under the conditions as much as possible, and the actual combat effect is good.
Examples of the experiments
Carrying out simulation test on the deep reinforcement learning model, wherein in the embodiment, the selected aircraft is a missile;
the fixed step length used by a simulation program in the simulation flight test is 0.1s;
in the simulated flight test, the simulation program runs 1000 times, and the deep reinforcement learning model is trained 30 times, and the training time is about 30000 times in total.
The kinetic model of the missile is
Figure BDA0002968482720000181
Wherein v represents the magnitude of the velocity, θ represents the angle between the aircraft velocity vector and the horizontal plane, X represents the lateral spatial position of the aircraft, y represents the longitudinal spatial position of the aircraft, m represents the aircraft weight, P represents the engine thrust, α represents the aircraft angle of attack, X represents the drag, L represents the lift, m represents the thrust, and c represents the fuel consumption per unit time;
the built deep reinforcement learning model comprises a strategy network and an evaluation network, wherein the strategy network and the evaluation network both use two same full connection layers as hidden layers, and the function of the full connection layers in the strategy network is l j =ReLU(∑ i (w 1 u+b 1 ) ReLU (x) = max (0,x);
w 1 weight representing the full connectivity layer in a policy network, b 1 Representing an offset of a full connectivity layer in the policy network;
the objective function of the policy network is
Figure BDA0002968482720000182
Evaluating the objective function of the network as
Figure BDA0002968482720000191
Wherein
Figure BDA0002968482720000192
Figure BDA0002968482720000193
ξ denotes the weight w in the evaluation network 2 And offset b 2 Set of xi = { w = 2 ,b 2 }
V π (s t ) And V π (s t+k ) Obtained by evaluating network estimation;
the function evaluating the full connection layer in the network is l j =ReLU(∑ i (w 2 u+b 2 ) ) = max (0,x)
When N = N s When, indicating that the experience pool is saturated, ω and ξ in the policy network and the evaluation network are updated according to the following equations:
Figure BDA0002968482720000194
Figure BDA0002968482720000195
wherein alpha is ω ,α ξ Respectively representing the parameter updating rate of the strategy network and the evaluation network, obtained by manual setting,
Figure BDA0002968482720000196
representing a gradient of the function;
ω new indicating updated omega, omega after saturation of the empirical pool old Denotes ω at empirical pool saturation;
ξ new Xi, xi representing the update after saturation of the experience pool old Denotes ξ when the empirical pool saturates.
After the training is finished, testing the converged depth-enhanced learning model, selecting 5 aircrafts to launch at a speed of 200m/s, setting the initial transverse position to be-20 km, the height to be 20km and the initial launching angle to be 0 degrees, and respectively setting the expected flight time to be 100s, 120s, 140s, 160s, 180s and 200s, wherein the result is shown in fig. 5, as can be seen from fig. 5, the residual flight time controlled by the depth-enhanced learning model trained in the embodiment converges to the expected residual flight time, and the maximum error is not more than 1s, which indicates that the depth-enhanced learning model can well fit the mapping relation between the missile flight state and the residual flight time.
The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims (7)

1. An aircraft time collaborative guidance method based on deep reinforcement learning is disclosed, wherein a bias term a is output through a deep reinforcement learning model b Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m Controlling an aircraft control system;
the guidance instruction a m Obtained by the following formula (one):
Figure FDA0004032393480000011
wherein, a m Representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,
Figure FDA0004032393480000012
representing the rate of change of the viewing angle of the bullet, a b Representing a bias term;
the bias term a b Obtained by the following steps:
step 1, designing a simulated flight test, and training to obtain a deep reinforcement learning model;
step 2, testing the deep reinforcement learning model;
step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test b Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m Controlling an aircraft control system;
in step 1, the deep reinforcement learning model learns through a near-end policy optimization (PPO);
the step 1 comprises the following substeps:
step 1-1, designing a simulated flight test according to an aircraft model;
step 1-2, designing the structure and parameters of a deep reinforcement learning model, and training to obtain the deep reinforcement learning model;
the step 1-1 comprises the following substeps:
1-1-1, acquiring aerodynamic parameters and reference area of the aircraft through a wind tunnel test of the aircraft;
1-1-2, designing an aircraft simulation model according to a motion differential equation set of an aircraft to obtain a flight state s of the aircraft;
1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation model, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset item guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.
2. The method of claim 1,
the step 1-2 comprises the following substeps:
step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state of the aircraft b To an aircraft simulation model;
step 1-2-2, collecting interactive data of a deep reinforcement learning model and an aircraft simulation model, and storing the interactive data into an experience pool;
step 1-2-3, improving the bias term a output by the deep reinforcement learning model by using the data in the experience pool b
3. The method of claim 2,
in step 1-2-2, the interaction data of the deep reinforcement learning model and the aircraft simulation model is element group(s) t ,a t ,r t );
Wherein s is t Representing the flight state of the aircraft at the time t; a is t A bias term representing the output of the deep reinforcement learning model at the time t; r is t Representing that the aircraft executes the offset term a at the moment t t The reward given by the back environment.
4. The method of claim 3,
said r t Obtained according to the following formula:
Figure FDA0004032393480000021
wherein, t d Representing the desired time of flight, t f Representing an actual time of flight; r represents the projectile distance;
c 1 a normalized parameter representing time-of-flight reward, set to a constant of 100; c. C 2 The normalized parameter, which represents the reward for the bullet distance, is set to a constant 10000.
5. The method of claim 2,
the deep reinforcement learning model comprises two different neural networks: a policy network and an evaluation network;
the strategy network takes a flight state s as input and biases a term a b Is an output;
The evaluation network takes a flight state s as input, and a state value function V of the state s π (s) is an output;
using merit function
Figure FDA0004032393480000031
To improve the policy network, the merit function is obtained by:
Figure FDA0004032393480000032
wherein k is the number of awards, V π Representing a function of state values, r t Indicating the reward at time t, r t+1 Denotes the reward at time t +1, r t+2 Indicating the reward at time t +2, and so on r t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.
6. The method of claim 5,
the objective function of the policy network is:
Figure FDA0004032393480000033
where ω represents the weight w in the policy network 1 And offset b 1 ω = { w = 1 ,1};w 1 Weight representing the full connectivity layer in a policy network, b 1 Representing an offset of a full connectivity layer in the policy network;
r t (ω) represents the ratio between the improved strategy and the old strategy,
Figure FDA0004032393480000034
the clip is a function of the clip,
Figure FDA0004032393480000035
e is a shearing parameter for restricting the updating amplitude of the strategy network;
N s is the capacity of the experience pool;
Figure FDA0004032393480000036
representing a merit function derived based on the old policy generation reward value;
the objective function of the evaluation network is
Figure FDA0004032393480000041
Where ξ represents the weight w in the evaluation network 2 And offset b 2 Set of xi = { w = 2 ,b 2 };
A t (s t ,a t ) Representing a merit function in the evaluation network;
when the number of interactions N = N s When, indicating that the empirical pool is saturated, ω and ξ are updated according to the following equation:
Figure FDA0004032393480000042
Figure FDA0004032393480000043
wherein alpha is ω ,a ξ Respectively representing the parameter update rates of the policy network and the evaluation network,
Figure FDA0004032393480000044
representing a gradient of the function; />
ω new Indicating updated omega, omega after saturation of the empirical pool old Represents ω at saturation of the empirical pool;
ξ new xi, xi representing the update after saturation of the experience pool old Denotes ξ when the empirical pool saturates.
7. The method of claim 1,
step 3 comprises the following substeps:
step 3-1, the aircraft obtains a flight state s;
step 3-2, inputting the flight state s into the depth reinforcement learning model passing the test, and outputting the bias item a by the depth reinforcement learning model passing the test b
3-3, obtaining a new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
CN202110256808.6A 2021-03-09 2021-03-09 Aircraft time collaborative guidance method based on deep reinforcement learning Active CN115046433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110256808.6A CN115046433B (en) 2021-03-09 2021-03-09 Aircraft time collaborative guidance method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110256808.6A CN115046433B (en) 2021-03-09 2021-03-09 Aircraft time collaborative guidance method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115046433A CN115046433A (en) 2022-09-13
CN115046433B true CN115046433B (en) 2023-04-07

Family

ID=83156606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110256808.6A Active CN115046433B (en) 2021-03-09 2021-03-09 Aircraft time collaborative guidance method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115046433B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117311374A (en) * 2023-09-08 2023-12-29 厦门渊亭信息科技有限公司 Aircraft control method based on reinforcement learning, terminal equipment and medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104007665A (en) * 2014-05-30 2014-08-27 北京航空航天大学 Flight simulation test system for solid-liquid power aircraft
CN108168381B (en) * 2018-01-04 2019-10-08 北京理工大学 A kind of control method of more pieces of guided missile cooperations
CN110488861B (en) * 2019-07-30 2020-08-28 北京邮电大学 Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle
US11676064B2 (en) * 2019-08-16 2023-06-13 Mitsubishi Electric Research Laboratories, Inc. Constraint adaptor for reinforcement learning control
CN112100834A (en) * 2020-09-06 2020-12-18 西北工业大学 Underwater glider attitude control method based on deep reinforcement learning
CN112069605B (en) * 2020-11-10 2021-01-29 中国人民解放军国防科技大学 Proportional guidance law design method with attack time constraint
CN112198890B (en) * 2020-12-03 2021-04-13 中国科学院自动化研究所 Aircraft attitude control method, system and device based on reinforcement learning

Also Published As

Publication number Publication date
CN115046433A (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN111667513B (en) Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning
CN110806759B (en) Aircraft route tracking method based on deep reinforcement learning
CN113791634B (en) Multi-agent reinforcement learning-based multi-machine air combat decision method
CN113467508B (en) Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task
CN111538241B (en) Intelligent control method for horizontal track of stratospheric airship
CN112286218B (en) Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient
CN113377121B (en) Aircraft intelligent disturbance rejection control method based on deep reinforcement learning
CN110187713A (en) A kind of longitudinally controlled method of hypersonic aircraft based on aerodynamic parameter on-line identification
CN111461294B (en) Intelligent aircraft brain cognitive learning method facing dynamic game
CN110673488A (en) Double DQN unmanned aerial vehicle concealed access method based on priority random sampling strategy
CN115033022A (en) DDPG unmanned aerial vehicle landing method based on expert experience and oriented to mobile platform
CN116627157B (en) Carrier rocket operation control method, device and equipment
CN113139331A (en) Air-to-air missile situation perception and decision method based on Bayesian network
CN115046433B (en) Aircraft time collaborative guidance method based on deep reinforcement learning
Kong et al. Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat
CN116697829A (en) Rocket landing guidance method and system based on deep reinforcement learning
CN115857530A (en) Decoupling-free attitude control method of aircraft based on TD3 multi-experience pool reinforcement learning
Zhu et al. Mastering air combat game with deep reinforcement learning
CN116432539A (en) Time consistency collaborative guidance method, system, equipment and medium
CN116611160A (en) Online real-time characteristic parameter identification and trajectory prediction method for uncontrolled aircraft based on measured trajectory parameters
de Celis et al. Neural network-based controller for terminal guidance applied in short-range rockets
CN115186378A (en) Real-time solution method for tactical control distance in air combat simulation environment
CN112278334A (en) Method for controlling the landing process of a rocket
CN113093803B (en) Unmanned aerial vehicle air combat motion control method based on E-SAC algorithm
CN117970952B (en) Unmanned aerial vehicle maneuver strategy offline modeling method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant