CN114489107B

CN114489107B - Aircraft double-delay depth certainty strategy gradient attitude control method

Info

Publication number: CN114489107B
Application number: CN202210113006.4A
Authority: CN
Inventors: 韦常柱; 朱光楠; 刘哲; 浦甲伦; 徐世昊
Original assignee: Harbin Zhuyu Aerospace Technology Co ltd
Current assignee: Harbin Zhuyu Aerospace Technology Co ltd
Priority date: 2022-01-29
Filing date: 2022-01-29
Publication date: 2022-10-25
Anticipated expiration: 2042-01-29
Also published as: CN114489107A

Abstract

A gradient attitude control method for an aircraft dual-delay depth certainty strategy belongs to the technical field of aircraft control. The method comprises the following steps: establishing an aircraft dynamics model to form a reinforcement learning environment; initializing a reinforcement learning interaction environment, an agent and a maximum step number; obtaining a control quantity of the aircraft as an action quantity; calculating a reward function value corresponding to the action quantity and the next observed quantity, and combining to form experience data to be recorded to an experience playback area; adjusting the parameters of the intelligent agent to complete a round of reinforcement learning; and outputting the fuel-air mixing ratio of the aircraft control quantity and the elevator deflection angle. The invention relates to a high-precision self-adaptive aircraft intelligent control method, which carries out reinforcement learning by a double-delay depth certainty strategy gradient method, realizes the design of an optimal attitude controller which is weakly dependent on a model, only needs a basic model of an aircraft, and does not need to give out all parameter quantities in the model accurately, thereby weakening the dependence degree of the control system design on the model.

Description

Aircraft double-delay depth certainty strategy gradient attitude control method

Technical Field

The invention relates to a gradient attitude control method for a double-delay depth certainty strategy of various aircrafts, belonging to the technical field of aircraft control.

Background

The aircraft is difficult to establish an accurate control model due to the problems of high parameter uncertainty and coupling, high model nonlinearity, complex interference and the like, and a design method of a traditional controller depends on a relatively accurate control model, so that a control method with a design process which is weakly dependent on the accurate model of the aircraft needs to be developed.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a gradient attitude control method of an aircraft double-delay depth deterministic strategy.

The invention adopts the following technical scheme: an aircraft dual-delay depth-deterministic tactical gradient attitude control method, said method comprising the steps of:

s1: establishing an aircraft dynamics model, and packaging to form a reinforcement learning environment;

s2: initializing a reinforcement learning interaction environment, an agent and a maximum step number;

s3: obtaining a control quantity of the aircraft as an action quantity; calculating a reward function value corresponding to the action quantity and the next observed quantity, and storing experience data into an experience playback area;

s4: randomly sampling empirical data in a self-experience replay area, and adjusting parameters of the intelligent agent based on a double-delay depth certainty strategy gradient algorithm to complete a round of reinforcement learning;

if the accumulated number of rounds of reinforcement learning does not reach the maximum number of steps defined in S2, returning to S3; otherwise, ending the reinforcement learning;

s5: after the reinforcement learning is finished, storing the intelligent agent, and storing the Actor network to be used as an adaptive controller; and the self-adaptive controller outputs the fuel-air mixing ratio and the elevator deflection angle of the aircraft control quantity under the condition of inputting the general observation measurement.

Compared with the prior art, the invention has the beneficial effects that:

the invention relates to a high-precision self-adaptive aircraft intelligent control method, which carries out reinforcement learning by a double-delay depth certainty strategy gradient method, realizes the design of an optimal attitude controller which is weakly dependent on a model, only needs a basic model of an aircraft, and does not need to give out all parameter quantities in the model accurately, thereby weakening the dependence degree of the control system design on the model.

Drawings

FIG. 1 is a design flow diagram of the present invention;

FIG. 2 is a flow chart of reinforcement learning according to the present invention.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the invention, rather than all embodiments, and all other embodiments obtained by those skilled in the art without any creative work based on the embodiments of the present invention belong to the protection scope of the present invention.

An aircraft dual-delay depth-deterministic tactical gradient attitude control method, said method comprising the steps of:

s1: establishing an aircraft dynamic model based on aircraft design parameters and wind tunnel data, and packaging to form a reinforcement learning environment;

s101: the aircraft dynamics model is established as follows:

in formula (1):

v represents a speed;

α represents an angle of attack;

g represents the gravitational acceleration;

gamma represents the track inclination;

θ represents a pitch angle;

ω _z representing pitch angular velocity;

m represents the aircraft mass;

I _yy representing the pitch channel moment of inertia;

T＝C _T,φ (α)φ+C _T (α) represents a thrust force, in which: phi denotes fuel emptyGas mixing ratio, C _T,φ (α) represents a coefficient between T and φ, C _T (alpha) represents a coefficient between T and alpha, and is obtained through a wind tunnel test;

D＝qSC _D represents a resistance, wherein: q represents the dynamic pressure, S represents the aircraft reference area, C _D Representing the resistance coefficient and obtaining the resistance coefficient through a wind tunnel test;

L＝qSC _L represents lift, wherein: c _L Expressing the lift coefficient, and obtaining the lift coefficient through a wind tunnel test;

M＝z _T T+qScC _M represents the pitching moment, wherein: z is a radical of formula _T Denotes the thrust moment arm length, C denotes the mean aerodynamic chord length, C _M Representing a pitching moment coefficient, and obtaining the pitching moment coefficient through a wind tunnel experiment;

d _V ,d _γ ,d _Q representing uncertainty in the model due to parameter uncertainty;

s102: and compiling the aircraft dynamics model and a resolving program thereof into a C language program, compiling to form a dynamic link library file, and forming a reinforcement learning environment.

s201: the reinforcement learning interactive environment comprises: overview measurement o _T And an amount of operation a _T And a reward function, and the three categories of reward functions,

defining:

each simulation time step t observed quantity is o _t = { V, γ, θ, Q }, wherein: v represents a speed; gamma represents the track dip; θ represents a pitch angle; q represents the attitude angular rate;

overview measurement o _T ＝{o _t-3 ,o _t-2 ,o _t-1 ,o _t }; t-3 represents 3 time steps before the simulation time step t, t-2 represents 2 time steps before the simulation time step t, and t-1 represents 1 time step before the simulation time step t;

an amount of motion of a _T ＝{φ,δ _e }, wherein: phi represents a fuel-air mixture ratio; delta _e Representing elevator yaw;

the reward function is r _T ＝r ₁ +r ₂ Which isThe method comprises the following steps: r is ₁ Representing a reward function related to speed and track pitch control error, and r ₁ ＝λ ₁ (V-V _r ) ² +λ ₂ (γ-γ _r ) ² ，V _r Being a speed command, γ _r For track pitch command, λ ₁ ,λ ₂ Setting as a negative number for punishing the control error of the speed and the track inclination angle; r is a radical of hydrogen ₂ The project is to award prizes when the speed and track inclination control errors are small, and the total observation is measured _T Designed as four successive simulated time step observations o _t-3 ,o _t-2 ,o _t-1 ,o _t Superposition of (2);

if | V-V _r |＜ε ₁ And gamma-gamma _r |＜ε ₂ Wherein: epsilon ₁ ,ε ₂ Represents the ideal control accuracy, then r ₂ = P, P > 0 representing the value of the reward function for ideal speed and track angle control accuracy, otherwise r ₂ ＝0；

S202: the reinforcement learning agent comprises six neural networks which are respectively: actor network mu (o) _T ) Target Actor network mu _t (o _T ) Critic network one

Critic network two

Target criticic network one

And target critical network two

Wherein:

the input of the Actor network is the overview measurement o _T The output is the motion amount a _T ；

The input of both the Critic network I and the Critic network II is the overview measurement o _T And an amount of motion a _T The output quantity is obtained after the intelligent agent takes action(ii) an expected value of the jackpot;

and the structure of the Actor network is the same as that of the target Actor network, the structure of the Critic network I is the same as that of the target Critic network I, the structure of the Critic network II is the same as that of the target Critic network II, the parameters of each neural network are initialized randomly, and the initialized parameters of each neural network are the same as those of the corresponding target neural network, namely:

wherein:

θ _μ and is a parameter of the Actor network;

is a parameter of the target Actor network;

is a parameter of a Critic network I;

is a parameter of a target Critic network I;

the parameter is a parameter of a Critic network II;

parameters of a target Critic network II;

s203: setting the maximum number of reinforcement learning steps to be N _step 。

S3: in each simulation time step, taking the acquired corresponding aircraft speed, track inclination angle, attitude angle and attitude angle rate as observed quantities, and inputting the observed quantities into an intelligent agent to obtain the control quantity of the aircraft as an action quantity; calculating a reward function value and a next observed quantity corresponding to the action quantity, and combining the observed quantity, the action quantity, the reward function value and the next observed quantity of the simulation time step to form experience data and storing the experience data into an experience playback area;

s301: according to the speed V of the aircraft collected in each simulation time step t _t And track inclination angle gamma _t Attitude angle theta _t And attitude angular rate Q _t Calculating a total view o according to the step of S201 _T ；

S302: measuring the overall view obtained in S301 to o _T Inputting the input signal into the Actor network to obtain network output, and superposing random noise N to obtain action amount a _T ＝{φ,δ _e }；

Clipping phi to phi according to the physical constraints of the aircraft _min ≤φ≤φ _max ，δ _e Clipping to delta _emin ≤δ _e ≤δ _emax Wherein: phi is a _min Is the minimum value of the fuel-air mixture ratio; phi is a _max Is the maximum fuel-air mixture ratio; delta. For the preparation of a coating _emin Is the minimum value of the elevator deflection angle; delta _emax Is the maximum value of the elevator deflection angle;

s303: will move an amount a _T Inputting S102 to obtain observed quantity o of next simulation time step in reinforcement learning environment _t+1 And calculates a reward function r according to the definition in S201 _T And observed quantity o _T ；

S304: measure the overview o _T And an amount of operation a _T Observed quantity o of next simulation time step _T+1 And a reward function r _T The formed quadruple is stored in the experience playback area as experience data.

S4: randomly sampling the experience data in the experience playback area when the data in the experience playback area reaches a specified number, and adjusting the parameters of the agent based on a double-delay depth certainty strategy gradient algorithm to complete a round of reinforcement learning;

if the reinforcement learning accumulated round number does not reach the maximum step number defined in the S2, returning to the S3; otherwise, ending the reinforcement learning;

s401: randomly sampling M quadruples from an empirical playback zone, and recording the M quadruples as B and B _i ,1≤M is more than or equal to i and is the ith quadruple in B;

s402: b is to be _i General overview of (1) measure _T Inputting the target Actor network, and overlapping random noise to obtain action

Will be provided with

Is limited to

Is limited to

S403: will act on

And the observed quantity o _T+1 Respectively inputting them into a critical network I and a critical network II to respectively obtain output quantities

S404: calculating a value function

Wherein:

represents a discount factor, min (Q) _1i ,Q _2i ) Represents Q _1i ,Q _2i The minimum value of (d);

s405: repeating S402-S404, and calculating to obtain output quantities and value functions corresponding to all four tuples in B;

s406: calculating a loss function of a Critic network one

Loss function of Critic network two

Using a gradient descent method to minimize L ₁ And L ₂ Updating the parameters of the critical network I and the critical network II for the target

S407: to optimize

Aiming at the aim, a gradient ascending method is adopted to update the parameter theta of the Actor network _μ ；

S408: updating parameters of the target Actor, the target Critic network I and the target Critic network II by adopting the following formula:

in formula (2):

tau is more than 0 and less than 1, and is a smooth updating factor;

thus, a round of reinforcement learning is completed;

if the reinforcement learning accumulated round number does not reach the maximum step number defined in the S203, returning to the S3; otherwise, ending the reinforcement learning.

S5: after the reinforcement learning is finished, the intelligent agent is stored, and the Actor network is stored and used as a self-adaptive controller; and the self-adaptive controller outputs the fuel-air mixing ratio and the elevator deflection angle of the aircraft control quantity under the condition of inputting the general observation measurement.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A gradient attitude control method for an aircraft double-delay depth certainty strategy is characterized by comprising the following steps: the method comprises the following steps:

s2: initializing a reinforcement learning interaction environment, an aircraft and a maximum step number;

defining:

each simulation time step t observed quantity is o _t = V, γ, θ, Q, where: v represents a speed; gamma represents the track dip; θ represents a pitch angle; q represents the attitude angular rate;

an operation amount of a _T ＝{φ,δ _e }, wherein: phi represents a fuel-air mixture ratio; delta. For the preparation of a coating _e Representing elevator yaw;

the reward function is r _T ＝r ₁ +r ₂ Wherein: r is ₁ Representing a reward function related to speed and track pitch control error, and r ₁ ＝λ ₁ (V-V _r ) ² +λ ₂ (γ-γ _r ) ² ，V _r For speed command, γ _r For track pitch command, λ ₁ ,λ ₂ Setting as a negative number for punishing control errors of speed and track inclination; r is a radical of hydrogen ₂ The project design objective is to give a reward when speed and track pitch control errors are small, and the overall view measures o _T Designed as four continuous simulation time step observables o _t-3 ,o _t-2 ,o _t-1 ,o _t Superposition of (2);

if | V-V _r |＜ε ₁ And gamma-gamma _r |＜ε ₂ Wherein: epsilon ₁ ,ε ₂ Represents the ideal control accuracy, r ₂ = P, P > 0 representing the value of the reward function for ideal speed and track angle control accuracy, otherwise r ₂ ＝0；

S202: the reinforcement learning aircraft comprises six neural networks which are respectively as follows: actor network mu (o) _T ) Target Actor network mu _t (o _T ) Critic network one

Critic network two

Target criticic network one

And target critical network two

Wherein:

the input of the Actor network is the overview measurement o _T The output is the action amount a _T ；

The input of both the Critic network I and the Critic network II is the overview measurement o _T And an amount of operation a _T The output quantity is an expected value of the accumulated reward obtained after the aircraft takes action quantity;

and the structure of the Actor network is the same as that of the target Actor network, the structure of the critical network I is the same as that of the target critical network I, the structure of the critical network II is the same as that of the target critical network II, the parameters of each neural network are initialized randomly, and the initialized parameters of each neural network are the same as those of the corresponding target neural network, namely:

wherein:

θ _μ is a parameter of the Actor network;

is a parameter of the target Actor network;

is a parameter of a Critic network I;

is a parameter of a target Critic network I;

the parameter is a parameter of a Critic network II;

parameters of a target Critic network II;

s203: setting the maximum step number of reinforcement learning to N _step ；

s4: randomly sampling empirical data in a self-experience replay area, adjusting aircraft parameters based on a double-delay depth certainty strategy gradient algorithm, and completing a round of reinforcement learning;

s401: randomly sampling M quadruples from an empirical playback zone, and recording the M quadruples as B and B _i M is greater than or equal to 1 and less than or equal to the ith quadruple in the B;

s402: b is to be _i Overall view measurement of _T Inputting the target Actor network, and overlapping random noise to obtain action

Will be provided with

Is limited to

Is limited to

S403: will act on

S404: calculating a value function

Wherein:

s406: calculating a loss function of a Critic network one

Loss function of Critic network two

S407: to optimize

S408: and updating parameters of the target Actor, the target critical network I and the target critical network II by adopting the following formula:

in the formula (2):

tau is more than 0 and less than 1, and is a smooth updating factor;

thus, a round of reinforcement learning is completed;

if the accumulated number of rounds of reinforcement learning does not reach the maximum step number defined in S203, returning to S3; otherwise, ending the reinforcement learning;

s5: after the reinforcement learning is finished, storing the aircraft, and storing an Actor network to be used as a self-adaptive controller; and the self-adaptive controller outputs the fuel-air mixing ratio of the aircraft control quantity and the elevator deflection angle under the condition of inputting the overview measurement.

2. The aircraft dual-delay depth deterministic strategy gradient attitude control method of claim 1, characterized in that: the S1 comprises the following steps:

s101: the aircraft dynamics model is established as follows:

in formula (1):

v represents a speed;

α represents an angle of attack;

g represents the gravitational acceleration;

gamma represents the track inclination;

θ represents a pitch angle;

ω _z representing pitch angular velocity;

m represents the aircraft mass;

I _yy representing the pitch channel moment of inertia;

T＝C _T,φ (α)φ+C _T (α) represents a thrust force, wherein: phi denotes the fuel-air mixture ratio, C _T,φ (α) represents the coefficient between T and φ, C _T (alpha) represents a coefficient between T and alpha, and is obtained through a wind tunnel test;

D＝qSC _D represents a resistance, wherein: q represents the dynamic pressure, S represents the aircraft reference area, C _D Representing the resistance coefficient, and obtaining the resistance coefficient through a wind tunnel test;

M＝z _T T+qScC _M representing a pitching moment, wherein: z is a radical of _T Representing the arm length of the thrust moment, C representing the mean aerodynamic chord length, C _M Representing the pitching moment coefficient, and obtaining the pitching moment coefficient through a wind tunnel experiment;

3. The aircraft dual-delay depth deterministic strategy gradient attitude control method of claim 2, characterized in that: the S3 comprises the following steps:

s301: according to the speed V of the aircraft collected in each simulation time step t _t Track dip angle gamma _t Attitude angle theta _t And attitude angular rate Q _t Calculating the overview o according to the step of S201 _T ；

S302: measuring the overall view obtained in S301 to o _T Inputting the signal into the Actor network to obtain network output and superposing random noise

Obtaining an operation amount a _T ＝{φ,δ _e }；

Clipping phi to phi according to physical limitations of the aircraft _min ≤φ≤φ _max ，δ _e Limiting to delta _emin ≤δ _e ≤δ _emax Wherein: phi is a _min Is the minimum value of the fuel-air mixture ratio; phi is a _max Is the maximum fuel-air mixture ratio; delta _emin Is the minimum value of the elevator deflection angle; delta _emax Is the maximum value of the elevator deflection angle;