CN115046433B - Aircraft time collaborative guidance method based on deep reinforcement learning - Google Patents
Aircraft time collaborative guidance method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN115046433B CN115046433B CN202110256808.6A CN202110256808A CN115046433B CN 115046433 B CN115046433 B CN 115046433B CN 202110256808 A CN202110256808 A CN 202110256808A CN 115046433 B CN115046433 B CN 115046433B
- Authority
- CN
- China
- Prior art keywords
- aircraft
- reinforcement learning
- representing
- learning model
- deep reinforcement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 103
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000006870 function Effects 0.000 claims description 53
- 238000012360 testing method Methods 0.000 claims description 33
- 238000011156 evaluation Methods 0.000 claims description 28
- 238000004088 simulation Methods 0.000 claims description 26
- 238000012549 training Methods 0.000 claims description 22
- 230000003993 interaction Effects 0.000 claims description 10
- 230000002452 interceptive effect Effects 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 5
- 229920006395 saturated elastomer Polymers 0.000 claims description 4
- 238000010008 shearing Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 8
- 230000006399 behavior Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- RZVHIXYEVGDQDX-UHFFFAOYSA-N 9,10-anthraquinone Chemical compound C1=CC=C2C(=O)C3=CC=CC=C3C(=O)C2=C1 RZVHIXYEVGDQDX-UHFFFAOYSA-N 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000007123 defense Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000446 fuel Substances 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000035515 penetration Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000002620 method output Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000000053 physical method Methods 0.000 description 1
Images
Classifications
-
- F—MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
- F42—AMMUNITION; BLASTING
- F42B—EXPLOSIVE CHARGES, e.g. FOR BLASTING, FIREWORKS, AMMUNITION
- F42B15/00—Self-propelled projectiles or missiles, e.g. rockets; Guided missiles
- F42B15/01—Arrangements thereon for guidance or control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T90/00—Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Chemical & Material Sciences (AREA)
- Aviation & Aerospace Engineering (AREA)
- Combustion & Propulsion (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention discloses an aircraft time collaborative guidance method based on deep reinforcement learning, which outputs a bias term a according to the flight state of an aircraft through a deep reinforcement learning model t Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system. According to the aircraft time collaborative guidance method based on the deep reinforcement learning, the selected input states are the current speed, the current speed direction, the current position and the residual flight time error, the mapping relation is reasonable, and the feasibility of fitting the mapping relation by using the deep reinforcement learning is high.
Description
Technical Field
The invention relates to the technical field of aircrafts, in particular to the field of flight time cooperation, and particularly relates to an aircraft time cooperation guidance method based on deep reinforcement learning.
Background
Aircraft (such as missiles) are the medium strength for hitting important strategic targets, but in modern war, the defense and countermeasure means of enemies are various, and particularly, ground or ship-based platforms have remote interception weapons and near defense weapons, which all pose great threat to the aircraft.
The multi-bullet cooperative strike is a high-efficiency penetration measure, and can saturate the defense system of an enemy and improve the success rate of penetration. The flight time cooperation is a feasible means for realizing multi-bullet cooperative strike, and the flight time cooperation at present mainly comprises the following two ways: 1. coordinating the predicted arrival time of each projectile through inter-projectile communication; 2. equal expected arrival times are set for the missiles prior to launch. However, in any way, the remaining flight time of each missile needs to be accurately controlled, and for the problem, most of the existing guidance laws are based on a constant speed hypothesis, and the problem is converted into the control of the remaining flight path. Although the prediction accuracy can be improved by iterative calculation using a differential equation, the amount of calculation is large, and online prediction is difficult to achieve.
The multi-bullet cooperative confrontation decision-making technology needs to establish a task model or an environment model of the confrontation environment, uncertainty of the model cannot be fully considered, and the method for establishing the behavior model or the behavior criterion can artificially limit the solution space of the behavior strategy and is difficult to obtain the optimal strategy, so that the multi-bullet cooperative confrontation environment which is dynamically variable cannot be adapted. In addition, under a complex environment, the dimensions of environment variables and decision variables are increased, and the complexity of the problem is increased, so that the multi-aircraft cooperative countermeasure decision making technology cannot adapt to the complex environment or an algorithm is difficult to solve.
Therefore, it is necessary to provide an aircraft time cooperative guidance method which overcomes the defects of relying on the assumption of constant velocity and has good control effect.
Disclosure of Invention
In order to overcome the problems, the inventor of the invention makes a keen study to design an aircraft time collaborative guidance method based on deep reinforcement learning, and the method trains a deep reinforcement learning model according to the current speed, the current speed direction, the current position and the residual flight time error of an aircraft and realizes the residual flight time control by the deep reinforcement learning model. The method overcomes the defect of dependence on constant velocity assumption, has good control effect, and can be applied to an online guidance control scene, thereby completing the invention.
Specifically, the invention aims to provide an aircraft time collaborative guidance method based on deep reinforcement learning, and the method outputs a bias term a through a deep reinforcement learning model according to the flight state of an aircraft t Deriving new guidance instructions based on the form of bias proportional guidancea m Finally according to the guidance instruction a m Controlling an aircraft control system;
the guidance instruction a m Obtained by the following formula (one):
wherein, a m Representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,representing the rate of change of the viewing angle of the bullet, a b A bias term is represented.
The bias term a b Obtained by the following steps:
step 2, testing the deep reinforcement learning model;
step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test t Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
In step 1, the deep reinforcement learning model is preferably learned by a near-end strategy optimization method (PPO);
preferably, said step 1 comprises the following sub-steps:
step 1-1, designing a simulated flight test according to an aircraft model;
and 1-2, designing the structure and parameters of the deep reinforcement learning model, and training to obtain the deep reinforcement learning model.
The step 1-1 comprises the following substeps:
1-1-1, acquiring aerodynamic parameters and reference area of the aircraft through a wind tunnel test of the aircraft;
1-1-2, designing an aircraft simulation model according to a motion differential equation set of an aircraft to obtain a flight state s of the aircraft;
1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation model, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset term guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.
The step 1-2 comprises the following substeps:
step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state b To an aircraft simulation model;
step 1-2-2, collecting data of interaction between a deep reinforcement learning model and an aircraft, and storing the data in an experience pool;
step 1-2-3, improving the output bias term a of the deep reinforcement learning model by using data in the experience pool b The policy of (1).
In step 1-2-2, the interaction data of the deep reinforcement learning model and the aircraft simulation model is element group(s) t ,a t ,r t );
Wherein s is t Representing the flight state of the aircraft at the time t; a is t A bias term representing the output of the deep reinforcement learning model at the time t; r is t Representing that the aircraft executes the offset term a at the moment t t A reward given later;
said r t Obtained according to the following formula:
wherein, t d Representing the desired time of flight, t f Representing an actual time of flight; r represents the projectile distance;
c 1 a normalized parameter representing time-of-flight reward, set to a constant of 100; c. C 2 The normalized parameter, which represents the reward for the bullet distance, is set to a constant 10000.
The deep reinforcement learning model comprises two different neural networks: a policy network and an evaluation network;
the strategy network takes a flight state s as an input, and biases an item a b Is an output;
the evaluation network takes a flight state s as input, and a state value function V of the state s π (s) is an output;
where k is the number of awards, V represents a state value function, r t Indicating the reward at time t, r t+1 Denotes the reward at time t +1, r t+2 Represents the reward at time t +2, and so on r t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.
The objective function of the policy network is:
where ω represents the weight w in the policy network 1 And offset b 1 ω = { w = 1 ,b 1 };w 1 Weight representing full connectivity layer in policy network, b 1 Representing an offset of a full connectivity layer in the policy network;
the clip is a function of the clip,
N s is the capacity of the experience pool;
the objective function of the evaluation network is
Where ξ denotes the weight w in the evaluation network 2 And offset b 2 Set of xi = { w = 2 ,b 2 }
A t (s t ,a t ) Representing a merit function in the evaluation network;
when the number of interactions N = N s When, indicating that the empirical pool is saturated, ω and ξ are updated according to the following equation:
wherein alpha is ω ,α ξ Respectively representing the parameter update rates of the policy network and the evaluation network,representing a gradient of the function;
ω new indicating updated omega, omega after saturation of the empirical pool old Represents ω at saturation of the empirical pool;
ξ new xi, xi representing the update after saturation of the experience pool old Denotes ξ when the empirical pool saturates.
Step 3 comprises the following substeps:
step 3-1, the aircraft obtains a flight state s;
step 3-2, inputting the flight state s into the depth reinforcement learning model passing the test, and outputting the bias item a by the depth reinforcement learning model passing the test b ;
3-3, obtaining a new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
The invention has the advantages that:
(1) According to the aircraft time collaborative guidance method based on the deep reinforcement learning, the selected input states are the current speed, the current speed direction, the current position and the residual flight time error, the mapping relation is reasonable, and the feasibility of fitting the mapping relation by using the deep reinforcement learning is high;
(2) The aircraft time cooperative guidance method based on the deep reinforcement learning can use a deep reinforcement learning model to fit the relation between the guidance instruction and the residual flight time error, and is a feasible method for realizing the aircraft time cooperative guidance;
(3) Compared with the traditional cooperative guidance algorithm, the aircraft time cooperative guidance method based on the deep reinforcement learning provided by the invention uses the simulation conditions which are more consistent with the real environment during training, overcomes the defect of dependence on the derivation of a constant speed hypothesis, ensures the dynamic stability of the environment to the aircraft during the training process, enables the distributed execution to be more consistent with the actual application scene, has a good control effect, and can be applied to an online guidance control scene.
Drawings
FIG. 1 is a diagram illustrating the operation of a deep reinforcement learning model according to a preferred embodiment of the present invention;
FIG. 2 is a diagram illustrating deep reinforcement learning model training in accordance with a preferred embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the operation of a near-end policy optimization algorithm in accordance with a preferred embodiment of the present invention;
FIG. 4 illustrates a deep reinforcement learning model learning reward curve in accordance with a preferred embodiment of the present invention;
FIGS. 5a-f are graphs showing test results of a flight trajectory curve, a residual time-of-flight error curve, a flight speed curve, a guidance command curve, and a bias term curve for a deep reinforcement learning model in an embodiment of the present invention;
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from the description. The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. In which, although various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The invention provides an aircraft time collaborative guidance method based on deep reinforcement learning, which outputs a bias term a according to the flight state of an aircraft through a deep reinforcement learning model t Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m Controlling an aircraft control system;
the guidance instruction a m Obtained by the following formula (one):
wherein, a m Representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,representing the rate of change of the viewing angle of the bullet, a b A bias term is represented.
The bias term a b Obtained by the following steps:
step 2, testing the deep reinforcement learning model;
step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test b Obtaining a new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
The aircraft time collaborative guidance method based on deep reinforcement learning is further described as follows:
In step 1, the deep reinforcement learning model is preferably learned by a near-end strategy optimization method (PPO), as shown in fig. 2;
preferably, said step 1 comprises the following sub-steps:
step 1-1, designing a simulated flight test according to an aircraft model;
and 1-2, designing the structure and parameters of the deep reinforcement learning model, and training to obtain the deep reinforcement learning model.
The step 1-1 comprises the following substeps:
1-1-1, acquiring aerodynamic parameters and reference area of the aircraft through a wind tunnel test of the aircraft;
the aerodynamic parameters comprise a lift coefficient, an induced drag coefficient and a zero lift drag coefficient. In the present invention, when the structure of the aircraft is determined, the aerodynamic parameters of the aircraft can be basically determined. In actual flight, the aerodynamic parameters are generally related to the mach number, angle of attack, and rudder deflection angle of the aircraft.
Preferably, the mach number is related to the sonic speed of the aircraft at the current altitude and is obtained from the current speed information/sonic speed of the aircraft;
the method comprises the steps that a program in the aircraft comprises a navigation module, a guidance module and a control module, and altitude information and current speed information of the aircraft are obtained by the navigation module;
the angle of attack represents the incoming flow direction of the air and is obtained by a navigation module of the aircraft;
the rudder deflection angle is obtained by a control module of the aircraft.
The sound velocity is obtained by interpolation of air data measured in advance, and further Mach number is obtained.
More preferably, the aerodynamic parameters corresponding to the current mach number, angle of attack, and rudder deflection angle are obtained by wind tunnel experiments and interpolation calculation.
In the invention, the state of the aircraft at the next moment is obtained according to the following aerodynamic differential equation of the aircraft:
wherein v represents the magnitude of the velocity, θ represents the angle between the aircraft velocity vector and the horizontal plane, X represents the lateral spatial position of the aircraft, y represents the longitudinal spatial position of the aircraft, m represents the aircraft weight, P represents the engine thrust, α represents the aircraft angle of attack, X represents the drag, L represents the lift, m represents the thrust, and c represents the fuel consumption per unit time;
the relationship between the lift force, the resistance force and the aerodynamic parameters is as follows:
X=(c d0 +c d )qS
L=c L qS
wherein, c d0 Denotes the coefficient of zero lift resistance, c d Denotes the coefficient of induced resistance, c L Representing the lift coefficient, q the dynamic pressure and S the reference area of the aircraft.
1-1-2, designing an aircraft simulation model according to a motion differential equation set of an aircraft to obtain the flight state of the aircraft;
1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation program, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset term guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.
The step 1-2 comprises the following substeps:
step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state of the aircraft t To an aircraft simulation model;
step 1-2-2, collecting interactive data of a deep reinforcement learning model and an aircraft simulation model, and storing the interactive data into an experience pool;
step 1-2-3, improving the output bias term a of the deep reinforcement learning model by using data in the experience pool b 。
Step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state b To an aircraft simulation model;
in the present invention, the simulation model may adopt a semi-physical simulation platform, that is, the flight control system of the aircraft is a physical object, and includes: flight control computers, inertial measurement units (accelerometers, gyros, and magnetometers), while the GPS and object detection sensors (e.g., photoelectric pods, radar) of the aircraft and the flight environment (i.e., atmosphere, terrain, etc.) are completely virtual. Therefore, with lower cost, the training environment is close to the reality to the maximum extent, and the aircraft can utilize the data fed back by the virtual environment and the physical measurement to carry out artificial intelligence training.
The simulation model can also be in a complete virtual state, namely, the flight environment and the flight control system of the aircraft are both virtual.
In the invention, the closer the simulation model is to the real environment, the better the effect of the trained strategy model of the aircraft is.
According to a preferred embodiment of the invention, the current flight state s of the aircraft comprises the position, the velocity vector and the remaining time-of-flight error of the aircraft at the current moment in time, and the current observed state s of the aircraft is represented by the following equation (two).
s=(v,θ,x,y,t d - τ) (two)
Where s represents the observed state of the aircraft, v represents the absolute velocity of the aircraft, θ represents the velocity direction, and x represents the velocity directionThe transverse spatial position of the aircraft, y representing the longitudinal spatial position of the aircraft, t d - τ represents the residual time-of-flight error, t d Representing the desired remaining time of flight and tau the actual remaining time of flight.
In a further preferred embodiment, the own position of the aircraft is obtained by a GPS positioning system, said own position of the aircraft comprising the altitude and the lateral position of the aircraft at the current moment;
the self speed vector of the aircraft is obtained by an inertial measurement unit and a magnetometer, and the speed vector of the aircraft comprises the speed and the speed direction at the current moment;
the residual flight time error is the difference between the expected residual flight time and the actual residual flight time, the expected residual flight time is obtained by artificial setting, and the actual residual flight time is obtained by a prediction functionCalculating to obtain;
Where θ represents the velocity direction, λ represents the line-of-sight angle, x represents the lateral spatial position of the aircraft, and y represents the longitudinal spatial position of the aircraft.
In the invention, the simulation model can record the flight state, the bias item and the reward of the aircraft, and can feed back the flight state, the bias item and the reward to the deep reinforcement learning model for storage to be used as a training data set.
Preferably, the process of interaction between the deep reinforcement learning model and the aircraft and simulation model is as follows: the deep reinforcement learning model outputs a bias item according to the current flight state information, and the aircraft executes a control instruction according to the bias item and then switches to a relay state (the flight state at the next moment) and gives rewards.
And 1-2-2, collecting interactive data of the deep reinforcement learning model and the aircraft simulation model, and storing the interactive data into an experience pool.
According toIn a preferred embodiment of the invention, the data interacted between the deep reinforcement learning model and the aircraft simulation model is an element group(s) t ,a t ,r t ),
Wherein s is t Representing the flight state of the aircraft at the time t; a is t A bias term representing the output of the deep reinforcement learning model at the time t; r is t Representing that the aircraft executes the offset term a at the moment t t The reward obtained later.
In a further preferred embodiment, the data of the interaction is stored in an experience pool of each deep reinforcement learning model for improving the generation strategy of the bias term.
And after the interactive data is stored in the experience pool, the aircraft updates the current state to be a succession state.
According to the invention, the aircraft gives a reward r t And the improved parameters for calculating the bias item generation strategy comprise two constraints of expected time and target hitting, flight time rewards are set according to the expected time, and bullet distance rewards are set according to the target hitting.
According to a preferred embodiment of the invention, in said time-of-flight reward, the closer the actual time-of-flight is to the desired time-of-flight, the greater the reward, the time-of-flight reward being designed to be- (t) d -t f ) 2 ;
Wherein, t d To the desired time of flight, t f Is the actual time of flight.
The expected flight time is artificially set flight time in the actual application process and is different according to the actual situation; the actual flight time is the actual flight time of the aircraft in the actual application process and is obtained by prediction of a prediction function;
according to another preferred embodiment of the invention, in the aforementioned shot-to-shot distance reward, the aircraft should shorten the shot-to-shot distance as soon as possible, the smaller the shot-to-shot distance, the greater the reward, and the reward is designed to be-R 2 Wherein R represents the projectile distance.
The shot-to-eye distance is obtained by adopting an absolute position according to the following formula
Where x and y are measured in real time by GPS.
In a further preferred embodiment, in order to avoid one of the rewards from masking the other reward, the two sets of rewards are normalized, and an exponential function normalization method is adopted in the application to obtain the reward r given by the environment after the aircraft performs the action at the time t +1 t Obtained according to the following formula (III)
Wherein, c 1 A normalized parameter for time-of-flight rewards set at a constant of 100; c. C 2 The normalized parameter awarded for the bullet distance is set to a constant 10000.
Step 1-2-3, improving the bias term a output by the deep reinforcement learning model by using the data in the experience pool b 。
The deep reinforcement learning model adopting the near-end strategy optimization algorithm comprises two different neural networks: a policy network and an evaluation network;
the strategy network takes a flight state s as input and biases a term a b Is an output;
the evaluation network takes a flight state s as input, and a state value function V of the state s π (s) is an output; a frame diagram of the near-end policy optimization algorithm is shown in fig. 3.
Wherein the function of state values V π (s) is used to represent the potential value of state s. The aim of improving the bias term generation strategy is to find a strategy pi to enable the deep reinforcement learning model to obtain the maximum total reward value in an unknown environment, but the total reward value comprises future reward values and cannot be directly calculated, so that a state value function V is used π (s) approximately calculating a total prize value;
the strategies are denoted differentlyBias term a under state s b Due to the trial-and-error nature of reinforcement learning, the bias term a b Typically not a determined value. The form of strategy pi is normal distribution pi-N (mu, sigma), and bias term a b The probability density function of (a) is:
wherein x represents a randomly sampled value in the probability distribution, μ represents a mean value of the probability density function, and σ represents a standard deviation of the probability density function;
according to a preferred embodiment of the application, the strategy network is a neural network comprising two identical fully-connected layers as hidden layers, intermediate variables mu and sigma are output according to an input flight state s, then normal distribution N to (mu, sigma) is constructed, and after sampling randomly, a sampling result is output as a bias term a b ;
To improve the policy network, a merit function is defined asWhen the advantage function is positive, increasing the probability of the current behavior in the current state; when the dominance function is negative, reducing the probability of the current behavior in the current state;
the merit function is obtained by:
wherein k is the number of awards, V π Representing a function of state values, r t Indicating the reward at time t, r t+1 Denotes the reward at time t + 1, r t+2 Represents the reward at time t +2, and so on r t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.
The objective function of the policy network is:
where ω represents the weight w in the policy network 1 And offset b 1 ω = { w = 1 ,b 1 };
r t (ω) represents the ratio between the improved strategy and the old strategy, clip being the clip function, N s Is the capacity of the experience pool;
wherein, the epsilon is a shearing parameter of the update amplitude of the constraint strategy network;
w 1 weight representing the full connectivity layer in a policy network, b 1 Representing an offset of a full connectivity layer in the policy network;
According to the application, the fully connected layer in the policy network has the following form:
l j =ReLU(∑ i (w 1 u+b 1 ) ) = max (0,x)
Wherein l j Represents the output of the fully-connected layer and u represents the input of the fully-connected layer.
According to a further preferred embodiment of the present application, the evaluation network is also a neural network comprising two identical fully-connected layers as hidden layers for obtaining the merit functionTwo state value functions V(s) in t ) And V: (s t+k ) The fully connected layer has the following form:
l j =ReLU(∑ i (w 2 u+b 2 ) ) = max (0,x)
Wherein l j Denotes the output of the fully-connected layer, u denotes the input of the fully-connected layer, w 2 Representing the weight of the fully connected layer in the evaluation network, b 2 Representing the offset of the full connection layer in the evaluation network;
defining the set of weights and offsets in the evaluation network as xi, xi = { w = 2 ,b 2 Evaluate the objective function of the network as
When N = N s When, indicating that the experience pool is saturated, ω and ξ in the policy network and the evaluation network are updated according to the following equations:
wherein alpha is ω ,α ξ Respectively representing the parameter updating rate of the strategy network and the evaluation network, obtained by manual setting,represents a gradient of the function;
ω new indicating updated omega, omega after saturation of the empirical pool old Represents ω at saturation of the empirical pool;
ξ new xi, xi representing the update after saturation of the experience pool old Denotes ξ when the empirical pool saturates.
The sampling process of the deep reinforcement learning is interactive, a new sample needs to be generated through a simulated flight test while learning, and the learning is carried out while sampling. In the course of learningDeep reinforcement learning model and old strategy for aircraft old Interaction N s And secondly, storing the interaction time sequence generated by the interaction process in a buffer area. When updating the policy network, the estimated merit function is first usedThen, calculating the probability pi of the executed behavior of the experience pool in the old strategy according to the probability density function of the normal distribution old (a t |s t ). Calculating pi after strategy network generates new strategy pi ω (a t |s t ) Then, an objective function is calculated, the gradient of the objective function to omega is obtained by using a gradient descent method, and a strategy network is updated, so that the objective function is maximized.
When the evaluation network is updated, the advantage function in the objective function is obtained in the stage of updating the strategy network and can be directly calculated. And optimizing a loss function of the evaluation network by using a gradient descent method, and updating a parameter xi of the evaluation network to minimize the loss function. After the two networks are updated, emptying the experience pool, and then using the learned new strategy to interact N s This learning process is repeated until the simulation test is completed.
More preferably, when r t When the change rate of the average value is less than 2%, determining that the average value is convergent, finishing the training of the multi-aircraft group, storing the obtained deep reinforcement learning model, and obtaining a learning curve of the deep reinforcement learning model after 100 times of training as shown in fig. 4.
And 2, testing the deep reinforcement learning model.
When the fluctuation amplitude of the reward value is less than 2%, the model is saved and a simulation test is carried out, and the test result is shown in fig. 5.
In the figure, a flight path curve is shown in fig. 5a, a residual flight time curve is shown in fig. 5b, a residual flight time error curve is shown in fig. 5c, a flight speed curve is shown in fig. 5d, a guidance instruction curve is shown in fig. 5e, and a bias term curve is shown in fig. 5 f;
according to a preferred embodiment of the invention, the aircraft can arrive at the target position with different desired flight times after departing from the same initial flight conditions.
According to a preferred embodiment of the present invention, the control effect of the deep reinforcement learning model is determined according to the difference between the actual remaining flight time and the expected remaining flight time.
Preferably, in the experimental stage, when the difference between the actual remaining flight time and the expected remaining flight time is less than 1s, the performance of the neural network model is considered to be basically satisfied with the application, and the neural network model can be used for actually executing the task process.
Step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test b 。
Wherein, step 3 comprises the following substeps:
and 3-1, acquiring a flight state by the aircraft.
Wherein the flight state of the aircraft comprises the position and velocity vector of the aircraft and the residual time-of-flight error.
Step 3-2, inputting the flight state into the depth reinforcement learning model passing the test, and outputting the bias item a by the depth reinforcement learning model passing the test t 。
In the invention, because the deep reinforcement learning model in the training stage learns to obtain the optimal behavior strategy and has a stable execution strategy model, in the task execution stage, the deep reinforcement learning model can output the bias item a only according to the flight state b 。
3-3, obtaining a new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
a m representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,representing the rate of change of the viewing angle of the bullet, a b A bias term is represented.
According to the aircraft time collaborative guidance method based on the deep reinforcement learning, in the training phase, the aircraft is launched under certain initial conditions, and different target times are set, so that the aircraft can learn under the conditions as much as possible, and the actual combat effect is good.
Examples of the experiments
Carrying out simulation test on the deep reinforcement learning model, wherein in the embodiment, the selected aircraft is a missile;
the fixed step length used by a simulation program in the simulation flight test is 0.1s;
in the simulated flight test, the simulation program runs 1000 times, and the deep reinforcement learning model is trained 30 times, and the training time is about 30000 times in total.
The kinetic model of the missile is
Wherein v represents the magnitude of the velocity, θ represents the angle between the aircraft velocity vector and the horizontal plane, X represents the lateral spatial position of the aircraft, y represents the longitudinal spatial position of the aircraft, m represents the aircraft weight, P represents the engine thrust, α represents the aircraft angle of attack, X represents the drag, L represents the lift, m represents the thrust, and c represents the fuel consumption per unit time;
the built deep reinforcement learning model comprises a strategy network and an evaluation network, wherein the strategy network and the evaluation network both use two same full connection layers as hidden layers, and the function of the full connection layers in the strategy network is l j =ReLU(∑ i (w 1 u+b 1 ) ReLU (x) = max (0,x);
w 1 weight representing the full connectivity layer in a policy network, b 1 Representing an offset of a full connectivity layer in the policy network;
the objective function of the policy network is
Evaluating the objective function of the network as
Wherein ξ denotes the weight w in the evaluation network 2 And offset b 2 Set of xi = { w = 2 ,b 2 }
V π (s t ) And V π (s t+k ) Obtained by evaluating network estimation;
the function evaluating the full connection layer in the network is l j =ReLU(∑ i (w 2 u+b 2 ) ) = max (0,x)
When N = N s When, indicating that the experience pool is saturated, ω and ξ in the policy network and the evaluation network are updated according to the following equations:
wherein alpha is ω ,α ξ Respectively representing the parameter updating rate of the strategy network and the evaluation network, obtained by manual setting,representing a gradient of the function;
ω new indicating updated omega, omega after saturation of the empirical pool old Denotes ω at empirical pool saturation;
ξ new Xi, xi representing the update after saturation of the experience pool old Denotes ξ when the empirical pool saturates.
After the training is finished, testing the converged depth-enhanced learning model, selecting 5 aircrafts to launch at a speed of 200m/s, setting the initial transverse position to be-20 km, the height to be 20km and the initial launching angle to be 0 degrees, and respectively setting the expected flight time to be 100s, 120s, 140s, 160s, 180s and 200s, wherein the result is shown in fig. 5, as can be seen from fig. 5, the residual flight time controlled by the depth-enhanced learning model trained in the embodiment converges to the expected residual flight time, and the maximum error is not more than 1s, which indicates that the depth-enhanced learning model can well fit the mapping relation between the missile flight state and the residual flight time.
The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.
Claims (7)
1. An aircraft time collaborative guidance method based on deep reinforcement learning is disclosed, wherein a bias term a is output through a deep reinforcement learning model b Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m Controlling an aircraft control system;
the guidance instruction a m Obtained by the following formula (one):
wherein, a m Representing a guidance command, v representing the absolute velocity of the aircraft, λ representing the line-of-sight angle of the projectile,representing the rate of change of the viewing angle of the bullet, a b Representing a bias term;
the bias term a b Obtained by the following steps:
step 1, designing a simulated flight test, and training to obtain a deep reinforcement learning model;
step 2, testing the deep reinforcement learning model;
step 3, when the aircraft flies, the bias item a is obtained by using the depth reinforcement learning model passing the test b Obtaining new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m Controlling an aircraft control system;
in step 1, the deep reinforcement learning model learns through a near-end policy optimization (PPO);
the step 1 comprises the following substeps:
step 1-1, designing a simulated flight test according to an aircraft model;
step 1-2, designing the structure and parameters of a deep reinforcement learning model, and training to obtain the deep reinforcement learning model;
the step 1-1 comprises the following substeps:
1-1-1, acquiring aerodynamic parameters and reference area of the aircraft through a wind tunnel test of the aircraft;
1-1-2, designing an aircraft simulation model according to a motion differential equation set of an aircraft to obtain a flight state s of the aircraft;
1-1-3, taking an offset proportion guidance law as a guidance law, deploying interfaces of a deep reinforcement learning model and an aircraft simulation model, wherein the interfaces comprise an interface from an aircraft state to the deep reinforcement learning model, an interface from the deep reinforcement learning model to an offset item guided by the offset proportion, and an incentive value interface given by the aircraft during training of the deep reinforcement learning model.
2. The method of claim 1,
the step 1-2 comprises the following substeps:
step 1-2-1, the deep reinforcement learning model outputs a bias item a according to the flight state of the aircraft b To an aircraft simulation model;
step 1-2-2, collecting interactive data of a deep reinforcement learning model and an aircraft simulation model, and storing the interactive data into an experience pool;
step 1-2-3, improving the bias term a output by the deep reinforcement learning model by using the data in the experience pool b 。
3. The method of claim 2,
in step 1-2-2, the interaction data of the deep reinforcement learning model and the aircraft simulation model is element group(s) t ,a t ,r t );
Wherein s is t Representing the flight state of the aircraft at the time t; a is t A bias term representing the output of the deep reinforcement learning model at the time t; r is t Representing that the aircraft executes the offset term a at the moment t t The reward given by the back environment.
4. The method of claim 3,
said r t Obtained according to the following formula:
wherein, t d Representing the desired time of flight, t f Representing an actual time of flight; r represents the projectile distance;
c 1 a normalized parameter representing time-of-flight reward, set to a constant of 100; c. C 2 The normalized parameter, which represents the reward for the bullet distance, is set to a constant 10000.
5. The method of claim 2,
the deep reinforcement learning model comprises two different neural networks: a policy network and an evaluation network;
the strategy network takes a flight state s as input and biases a term a b Is an output;
The evaluation network takes a flight state s as input, and a state value function V of the state s π (s) is an output;
wherein k is the number of awards, V π Representing a function of state values, r t Indicating the reward at time t, r t+1 Denotes the reward at time t +1, r t+2 Indicating the reward at time t +2, and so on r t+k-1 Representing the reward at time t + k-1, and gamma is a discount factor set at a constant of 0.99.
6. The method of claim 5,
the objective function of the policy network is:
where ω represents the weight w in the policy network 1 And offset b 1 ω = { w = 1 ,1};w 1 Weight representing the full connectivity layer in a policy network, b 1 Representing an offset of a full connectivity layer in the policy network;
r t (ω) represents the ratio between the improved strategy and the old strategy,the clip is a function of the clip,
N s is the capacity of the experience pool;
the objective function of the evaluation network is
Where ξ represents the weight w in the evaluation network 2 And offset b 2 Set of xi = { w = 2 ,b 2 };
A t (s t ,a t ) Representing a merit function in the evaluation network;
when the number of interactions N = N s When, indicating that the empirical pool is saturated, ω and ξ are updated according to the following equation:
wherein alpha is ω ,a ξ Respectively representing the parameter update rates of the policy network and the evaluation network,representing a gradient of the function; />
ω new Indicating updated omega, omega after saturation of the empirical pool old Represents ω at saturation of the empirical pool;
ξ new xi, xi representing the update after saturation of the experience pool old Denotes ξ when the empirical pool saturates.
7. The method of claim 1,
step 3 comprises the following substeps:
step 3-1, the aircraft obtains a flight state s;
step 3-2, inputting the flight state s into the depth reinforcement learning model passing the test, and outputting the bias item a by the depth reinforcement learning model passing the test b ;
3-3, obtaining a new guidance instruction a based on the form of bias proportion guidance m Finally according to the guidance instruction a m And controlling the aircraft control system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110256808.6A CN115046433B (en) | 2021-03-09 | 2021-03-09 | Aircraft time collaborative guidance method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110256808.6A CN115046433B (en) | 2021-03-09 | 2021-03-09 | Aircraft time collaborative guidance method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115046433A CN115046433A (en) | 2022-09-13 |
CN115046433B true CN115046433B (en) | 2023-04-07 |
Family
ID=83156606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110256808.6A Active CN115046433B (en) | 2021-03-09 | 2021-03-09 | Aircraft time collaborative guidance method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115046433B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117311374A (en) * | 2023-09-08 | 2023-12-29 | 厦门渊亭信息科技有限公司 | Aircraft control method based on reinforcement learning, terminal equipment and medium |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104007665A (en) * | 2014-05-30 | 2014-08-27 | 北京航空航天大学 | Flight simulation test system for solid-liquid power aircraft |
CN108168381B (en) * | 2018-01-04 | 2019-10-08 | 北京理工大学 | A kind of control method of more pieces of guided missile cooperations |
CN110488861B (en) * | 2019-07-30 | 2020-08-28 | 北京邮电大学 | Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle |
US11676064B2 (en) * | 2019-08-16 | 2023-06-13 | Mitsubishi Electric Research Laboratories, Inc. | Constraint adaptor for reinforcement learning control |
CN112100834A (en) * | 2020-09-06 | 2020-12-18 | 西北工业大学 | Underwater glider attitude control method based on deep reinforcement learning |
CN112069605B (en) * | 2020-11-10 | 2021-01-29 | 中国人民解放军国防科技大学 | Proportional guidance law design method with attack time constraint |
CN112198890B (en) * | 2020-12-03 | 2021-04-13 | 中国科学院自动化研究所 | Aircraft attitude control method, system and device based on reinforcement learning |
-
2021
- 2021-03-09 CN CN202110256808.6A patent/CN115046433B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN115046433A (en) | 2022-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111667513B (en) | Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning | |
CN110806759B (en) | Aircraft route tracking method based on deep reinforcement learning | |
CN113791634B (en) | Multi-agent reinforcement learning-based multi-machine air combat decision method | |
CN113467508B (en) | Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task | |
CN111538241B (en) | Intelligent control method for horizontal track of stratospheric airship | |
CN112286218B (en) | Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient | |
CN113377121B (en) | Aircraft intelligent disturbance rejection control method based on deep reinforcement learning | |
CN110187713A (en) | A kind of longitudinally controlled method of hypersonic aircraft based on aerodynamic parameter on-line identification | |
CN111461294B (en) | Intelligent aircraft brain cognitive learning method facing dynamic game | |
CN110673488A (en) | Double DQN unmanned aerial vehicle concealed access method based on priority random sampling strategy | |
CN115033022A (en) | DDPG unmanned aerial vehicle landing method based on expert experience and oriented to mobile platform | |
CN116627157B (en) | Carrier rocket operation control method, device and equipment | |
CN113139331A (en) | Air-to-air missile situation perception and decision method based on Bayesian network | |
CN115046433B (en) | Aircraft time collaborative guidance method based on deep reinforcement learning | |
Kong et al. | Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat | |
CN116697829A (en) | Rocket landing guidance method and system based on deep reinforcement learning | |
CN115857530A (en) | Decoupling-free attitude control method of aircraft based on TD3 multi-experience pool reinforcement learning | |
Zhu et al. | Mastering air combat game with deep reinforcement learning | |
CN116432539A (en) | Time consistency collaborative guidance method, system, equipment and medium | |
CN116611160A (en) | Online real-time characteristic parameter identification and trajectory prediction method for uncontrolled aircraft based on measured trajectory parameters | |
de Celis et al. | Neural network-based controller for terminal guidance applied in short-range rockets | |
CN115186378A (en) | Real-time solution method for tactical control distance in air combat simulation environment | |
CN112278334A (en) | Method for controlling the landing process of a rocket | |
CN113093803B (en) | Unmanned aerial vehicle air combat motion control method based on E-SAC algorithm | |
CN117970952B (en) | Unmanned aerial vehicle maneuver strategy offline modeling method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |