CN113985924B - Aircraft control method, device, equipment and computer readable storage medium - Google Patents

Aircraft control method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN113985924B
CN113985924B CN202111608105.1A CN202111608105A CN113985924B CN 113985924 B CN113985924 B CN 113985924B CN 202111608105 A CN202111608105 A CN 202111608105A CN 113985924 B CN113985924 B CN 113985924B
Authority
CN
China
Prior art keywords
network
aircraft
data
parameters
critic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111608105.1A
Other languages
Chinese (zh)
Other versions
CN113985924A (en
Inventor
周志明
刘振
蒲志强
易建强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202111608105.1A priority Critical patent/CN113985924B/en
Publication of CN113985924A publication Critical patent/CN113985924A/en
Application granted granted Critical
Publication of CN113985924B publication Critical patent/CN113985924B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The application provides an aircraft control method, an aircraft control device, aircraft control equipment and a computer program product, wherein the method comprises the following steps: according to model parameters in the aircraft model, determining observed quantity data and intelligent agent action data of the aircraft; training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data, and outputting a deterministic strategy and an action value function; updating the operator network parameters and the criticic network parameters through a depth certainty strategy gradient algorithm to obtain an optimal operator network and an optimal criticic network; and constructing an online controller based on the optimal actor network and the optimal critic network, and controlling the aircraft through the online controller. According to the aircraft control method provided by the embodiment of the application, the online controller is trained offline through a depth certainty strategy gradient algorithm, so that the online controller has good adaptability and robustness, and the accurate control of the aircraft is realized.

Description

Aircraft control method, device, equipment and computer readable storage medium
Technical Field
The present application relates to the field of aircraft control technologies, and in particular, to an aircraft control method, apparatus, device, and computer-readable storage medium.
Background
The aircraft generally has a large flight envelope range, a high speed and a wide flight altitude range, so that the dynamic coefficient related to the flight state is changed violently in the flight process. In engineering practice, the control of the aircraft is generally realized by using a classical frequency domain controller. Although the method can meet the requirements of practical application to a certain extent, the method has many obvious defects: firstly, a good controller can be designed only by accurately modeling an aircraft, and the capability of ensuring the stability and the dynamic performance of the system is poor under the condition of model uncertainty; secondly, when the state of the aircraft changes along with the flight time, the classical frequency domain controller needs to carry out a large amount of interpolation calculation, and the requirement on the storage space on the aircraft is high. In addition, the expansion of the flight envelope further makes the flight environment in the flight process become more complicated and changeable, and the uncertain factors are numerous, thereby bringing great difficulty to the precise control of the aircraft.
Disclosure of Invention
The application provides an aircraft control method, an aircraft control device, aircraft control equipment and a computer program product, and aims to realize accurate control of an aircraft.
In a first aspect, the present application provides an aircraft control method comprising:
according to model parameters in the aircraft model, determining observed quantity data and intelligent agent action data of the aircraft;
training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data, and outputting a deterministic strategy and an action value function;
updating an actor network parameter in the certainty strategy and a critic network parameter in the action value function through a depth certainty strategy gradient algorithm to obtain an optimal actor network and an optimal critic network;
and constructing an online controller based on the optimal actor network and the optimal critic network, and controlling the aircraft through the online controller.
In one embodiment, the deep deterministic policy gradient algorithm includes an actor current network and an actor target network,
updating the operator network parameters in the deterministic strategy through a depth deterministic strategy gradient algorithm to obtain an optimal operator network, wherein the method comprises the following steps:
updating initial operator network parameters in the certainty strategy through the operator current network, selecting a current action according to a current state to interact with a current environment, generating a next step state, and transmitting the next step state to an experience playback pool;
acquiring the next step state in the experience playback pool through the operator target network, selecting an optimal next action according to the next step state, and calculating target operator network parameters;
and determining the latest actor network parameters through the actor target network according to the initial actor network parameters, the target actor network parameters and the inertia updating rate, and updating the target actor network parameters through the latest actor network parameters to obtain the optimal actor network.
The depth deterministic policy gradient algorithm comprises a critic current network and a critic target network,
updating the criticic network parameters in the action value function through a depth certainty strategy gradient algorithm to obtain an optimal criticic network, wherein the method comprises the following steps:
updating initial critic network parameters in the action value function through the critic current network, calculating a function value of the current action, and constructing a to-be-processed action value function according to the function value of the current action;
estimating the to-be-processed action value function through the critic target network to obtain target critic network parameters;
and determining the latest critic network parameters through the critic target network according to the initial critic network parameters, the target critic network parameters and the inertia updating rate, and updating the target critic network parameters through the latest critic network parameters to obtain the optimal critic network.
The actor network includes a first input layer and a first intermediate fully-connected layer,
training an actor network based on the observed quantity data, and outputting a deterministic strategy, wherein the method comprises the following steps:
determining the observation data as input data for the first input layer;
processing the observed quantity data through the first middle full-connection layer, outputting a rudder deflection angle action, and determining the rudder deflection angle action as the certainty strategy;
the first input layer is an input layer of one layer of 9 neurons, and the first middle full-connection layer is a full-connection layer of two layers of 49 neurons.
The criticc network includes a second input layer, a second intermediate fully-connected layer and a third intermediate fully-connected layer,
training a criticc network based on the observed quantity data and the agent action data, and outputting an action value function, wherein the action value function comprises:
determining the observation data and the agent motion data as input data of the second input layer;
processing the observed quantity data through the second middle full-connection layer to obtain first data to be processed;
processing the intelligent agent action data through the third middle full connection layer to obtain second data to be processed;
summing the first data to be processed and the second data to be processed to obtain target data, processing the target data through the third middle full-link layer, and outputting the action value function;
the second input layer is an input layer of 10 neurons, the second middle full-connection layer is a full-connection layer of 49 neurons, and the third middle full-connection layer is a full-connection layer of 49 neurons.
The method for determining the observed quantity data and the intelligent agent action data of the aircraft according to the model parameters in the aircraft model comprises the following steps:
determining preset frame data of actual overload and instruction deviation, pitch angle speed and pseudo attack angle in the model parameters as the observed quantity data by taking the step length as a period;
and determining the pitching rudder deflection angle in the model parameters as the intelligent body action data.
Before determining observed quantity data and intelligent agent action data of the aircraft according to model parameters in the aircraft model, the method further comprises the following steps:
determining a transfer function from the normal overload to the pitch rudder deflection angle through the combination of transfer functions and the normal overload, the pitch rudder deflection angle and characteristic parameters of the aircraft;
determining a transfer function from the pitch angle speed to the pitch rudder deflection angle through the combination of the transfer functions and the pitch angle speed, the pitch rudder deflection angle and the characteristic parameters of the aircraft;
determining a transfer function from the attack angle to the pitching rudder deflection angle through the combination of the transfer functions and the attack angle, the pitching rudder deflection angle and the characteristic parameters of the aircraft;
and constructing the aircraft model according to a transfer function from normal overload to pitching rudder deflection angle, a transfer function from pitch angle speed to pitching rudder deflection angle and a transfer function from attack angle to pitching rudder deflection angle.
In a second aspect, the present application also provides an aircraft control device comprising:
the determining module is used for determining observed quantity data and intelligent agent action data of the aircraft according to model parameters in the aircraft model;
the training module is used for training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data and outputting a deterministic strategy and an action value function;
the updating module is used for updating the operator network parameters in the deterministic strategy and the criticic network parameters in the action value function through a depth deterministic strategy gradient algorithm to obtain an optimal operator network and an optimal criticic network;
and the control module is used for constructing an online controller based on the optimal actor network and the optimal critic network and controlling the aircraft through the online controller.
In a third aspect, the present application further provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the aircraft control method of the first aspect when executing the program.
In a fourth aspect, the present application also provides a computer program product comprising a computer program which, when executed by the processor, carries out the steps of the aircraft control method of the first aspect.
In a fifth aspect, the present application also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the aircraft control method of the first aspect.
According to the aircraft control method, the aircraft control device, the aircraft control equipment and the computer program product, the online controller is trained offline through a depth certainty strategy gradient algorithm, and the aircraft control is realized through the online controller. The online controller is obtained by training through a depth certainty strategy gradient algorithm, so that the online controller has good adaptability to the characteristic changes of the model caused by the unbeared parameter disturbance, the fault input and the model uncertainty. Meanwhile, the online controller realizes good instruction following of the aircraft under the conditions of parameter deviation, disturbance and fault to a certain degree, has strong robustness and generalization performance, and realizes accurate control of the aircraft through the online controller with good adaptability and robustness.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of an aircraft control method provided herein;
FIG. 2 is a control framework diagram of an aircraft control method provided herein;
FIG. 3 is a schematic structural diagram of an aircraft control device provided herein;
fig. 4 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The aircraft control methods, apparatus, devices and computer program products provided herein are described below in conjunction with fig. 1-4.
The present application provides an aircraft control method, with reference to fig. 1 to 4, fig. 1 is a schematic flow chart of the aircraft control method provided herein; FIG. 2 is a control framework diagram of an aircraft control method provided herein; FIG. 3 is a schematic structural diagram of an aircraft control device provided herein; fig. 4 is a schematic structural diagram of an electronic device provided in the present application.
While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in a different order than presented herein.
The electronic device is taken as an execution main body to exemplify the embodiment of the application, and the aircraft control system is taken as the electronic device in the embodiment of the application, so that the electronic device is not limited.
By way of analysis with reference to fig. 1, an aircraft control method provided in an embodiment of the present application includes:
and step S50, according to the model parameters in the aircraft model, determining the observed quantity data and the intelligent body action data of the aircraft.
It should be noted that, heretofore, the aircraft control system needs to build an aircraft model, and an online controller of the aircraft is built by the aid of the aircraft model. In the embodiment, the aircraft model is constructed with the aid of the transfer function, and the specific construction process is as described in step S10 to step S40.
Further, the description of steps S10 to S40 is as follows:
step S10, determining a transfer function from normal overload to pitching rudder deflection angle through transfer function combination and normal overload, pitching rudder deflection angle and characteristic parameters of the aircraft;
step S20, determining a transfer function from the pitch angle speed to the pitch rudder deflection angle through the combination of the transfer functions and the pitch angle speed, the pitch rudder deflection angle and the characteristic parameters of the aircraft;
step S30, determining a transfer function from the attack angle to the pitch rudder deflection angle through the combination of the transfer functions and the attack angle, the pitch rudder deflection angle and the characteristic parameters of the aircraft;
and step S40, constructing the aircraft model according to a transfer function from normal overload to pitch rudder deflection angle, a transfer function from pitch angular velocity to pitch rudder deflection angle and a transfer function from attack angle to pitch rudder deflection angle.
Specifically, the aircraft control system determines the normal overload, pitch angle speed, angle of attack, pitch rudder deflection angle and characteristic parameters of the aircraft, and defines the normal overload of the aircraft as the normal overload
Figure 362371DEST_PATH_IMAGE001
The pitch angle velocity of the aircraft is defined as
Figure 751764DEST_PATH_IMAGE002
The angle of attack of an aircraft is defined as
Figure 583585DEST_PATH_IMAGE003
The pitching rudder deflection angle of the aircraft is defined as
Figure 58429DEST_PATH_IMAGE004
The characteristic parameters of the aircraft include
Figure 866854DEST_PATH_IMAGE005
. The aircraft control system then combines the normal overload of the aircraft via a transfer function
Figure 110753DEST_PATH_IMAGE006
Pitching rudder deflection angle
Figure 379055DEST_PATH_IMAGE007
And characteristic parameters
Figure 75615DEST_PATH_IMAGE008
Determining a transfer function from the normal overload to the pitching rudder deflection angle, and recording the transfer function from the normal overload to the pitching rudder deflection angle as a first transfer function, so that the first transfer function can be expressed as
Figure 704043DEST_PATH_IMAGE009
. Then, the aircraft control system combines the pitch angular velocity of the aircraft through a transfer function
Figure 66365DEST_PATH_IMAGE010
Pitching rudder deflection angle
Figure 489256DEST_PATH_IMAGE011
And characteristic parameters
Figure 689424DEST_PATH_IMAGE012
Determining a transfer function of pitch angle speed to pitch rudder deflection angle, and recording the transfer function of pitch angle speed to pitch rudder deflection angle as a second transfer function, so that the second transfer function can be expressed as
Figure 855963DEST_PATH_IMAGE013
. Then, the aircraft control system combines the attack angle of the aircraft through a transfer function
Figure 323723DEST_PATH_IMAGE014
Pitching rudder deflection angle
Figure 183094DEST_PATH_IMAGE015
And characteristic parameters
Figure 339400DEST_PATH_IMAGE016
Determining a transfer function from the attack angle to the deflection angle of the pitching rudder, recording the transfer function from the attack angle to the deflection angle of the pitching rudder as a third transfer function, and expressing the third transfer function as
Figure 575210DEST_PATH_IMAGE017
. Transfer function of aircraft control system from normal overload to pitching rudder deflection angle
Figure 634826DEST_PATH_IMAGE018
And transfer function of pitch angle velocity to pitch rudder deflection angle
Figure 665099DEST_PATH_IMAGE019
And an attackTransfer function of angle to pitch rudder deflection angle
Figure 557968DEST_PATH_IMAGE020
And combining to construct an aircraft model.
Further, the aircraft control system needs to set a reward function through the normal overload, the pitch angle speed, the attack angle, the pitch rudder deflection angle and the characteristic parameters of the aircraft, and the reward function can be expressed as
Figure 817043DEST_PATH_IMAGE021
Wherein the immediate award is
Figure 10127DEST_PATH_IMAGE022
Figure 194989DEST_PATH_IMAGE023
For deviations of actual overload from instruction, immediate awarding
Figure 840734DEST_PATH_IMAGE024
Indicating that when the deviation of the control effect and the actual effect is large, a large punishment is output; when the deviation of the control effect from the actual effect is small, a penalty of almost zero is output.
Figure 637920DEST_PATH_IMAGE025
For constraining the energy of the control input. Sparse rewards
Figure 685510DEST_PATH_IMAGE026
Meaning that if the overload deviation is greater than 0.1 and less than 0.5,
Figure 309783DEST_PATH_IMAGE027
taking 1; if the overload deviation is less than 0.1,
Figure 177245DEST_PATH_IMAGE028
taking 5; rest conditions
Figure 43701DEST_PATH_IMAGE029
Are all taken to be 0.
Figure 945798DEST_PATH_IMAGE030
When the overload deviation of the system is more than 100 or less than-100, the exploration is ended and output
Figure 223195DEST_PATH_IMAGE031
=-500。
According to the embodiment of the application, the performance of the online controller is further generalized by constructing the aircraft model and constructing the online controller of the aircraft in an auxiliary manner through the aircraft model.
After the aircraft model is constructed, the aircraft control system needs to acquire model parameters in the aircraft model, where the model parameters include, but are not limited to, actual overload and commanded deviation and pitch angle velocity, and the present embodiment also needs to acquire a pseudo angle of attack for the reason that the actual angle of attack is not measurable. Next, the aircraft control system determines the observed quantity data and the agent motion data of the aircraft according to the deviation between the actual overload and the command, the pitch angle rate, and the false attack angle in the model parameters, as described in step S501 to step S502.
Further, the description of steps S501 to S502 is as follows:
step S501, with step length as a period, determining preset frame data of actual overload and instruction deviation, pitch angle speed and pseudo attack angle in the model parameters as the observed quantity data;
step S502, determining the pitching rudder deflection angle in the model parameters as the intelligent body action data.
It should be noted that the preset frame data in the present embodiment is set according to actual conditions, and the preset frame data in the present embodiment is preferably 3 frame data. Specifically, the aircraft control system takes the step length as a period, and the deviation of the actual overload and the instruction is obtained
Figure 827221DEST_PATH_IMAGE032
Angular velocity of pitch
Figure 746635DEST_PATH_IMAGE033
Angle of attack
Figure 988392DEST_PATH_IMAGE034
The 3 frames of data of (2) are determined as observed quantity data.
It is further understood that the deviation from the command as a function of the actual overload is periodic in steps
Figure 702270DEST_PATH_IMAGE035
Angular velocity of pitch
Figure 62101DEST_PATH_IMAGE036
Angle of attack
Figure 254048DEST_PATH_IMAGE037
One frame of data available is
Figure 615890DEST_PATH_IMAGE038
Thus, the 3 frame data is
Figure 500669DEST_PATH_IMAGE039
That is, the observed quantity data can be expressed as
Figure 79287DEST_PATH_IMAGE040
. Meanwhile, the aircraft control system determines the pitching rudder deflection angle in the model parameters as the intelligent body action data
Figure 74925DEST_PATH_IMAGE041
I.e. agent motion data
Figure 556853DEST_PATH_IMAGE042
Can be expressed as
Figure 612534DEST_PATH_IMAGE043
According to the method and the device, the observation data are determined by taking the step length as the period, the deviation between the actual overload and the instruction, the pitch angle speed and the preset frame data of the pseudo attack angle, and the pitching rudder deflection angle is determined as the intelligent body action data, so that the accuracy of the observation data and the accuracy of the intelligent body action data are guaranteed.
And step S60, training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data, and outputting a deterministic strategy and an action value function.
The aircraft control system takes the observed quantity data as input data of an actor network, trains the actor network and outputs a deterministic strategy, wherein the actor network comprises an input layer of 9 neurons, two middle full-connection layers of 49 neurons and an output layer of 1 neuron, and the specific training process is as described in steps S601 to S602. Meanwhile, the aircraft control system takes the observed quantity data and the agent action data as input data of a criticic network, trains the criticic network and outputs an action value function, wherein the criticic network comprises an input layer with 10 neurons, a middle full-link layer with two layers of 49 neurons, a middle full-link layer with 49 neurons and an output layer with 1 neuron, and the specific training process is as described in the step S603 to the step S606. Note that, the input layer, the intermediate fully-connected layer, and the output layer in the actor network and the critic network are all BP (Back Propagation, multilayer feedforward neural) neural networks.
Further, the embodiment also provides a control framework to assist the training of the operator network and the critic network, so as to achieve fast convergence. The description is made by an actor network, the control framework of the actor network is as shown in fig. 2, and fig. 2 is a control framework diagram of the aircraft control method provided by the application.
As shown in FIG. 2, the auxiliary controller in the control frame diagram is a proportional controller, and the gain factor of the proportional controller is selected to match the steady-state gain of the system
Figure 163601DEST_PATH_IMAGE044
. Therefore, the system passes the command, the auxiliary controller and the pitching rudder deflection angle
Figure 203408DEST_PATH_IMAGE045
Normal overload
Figure 789110DEST_PATH_IMAGE046
And when the flight action and the current environment determine to reach the steady state, the pitching rudder deflection angle output by the actor network is zero.
Further, the description of step S601 to step S602 is as follows:
step S601, determining the observation amount data as input data of the first input layer;
step S602, processing the observed quantity data through the first intermediate full connection layer, outputting a rudder deflection angle action, and determining the rudder deflection angle action as the certainty policy.
Specifically, the aircraft control system takes the observation data as the input data of the actor network, and since the input layer of the actor network is the input layer of one layer of 9 neurons, the 9 observation data are correspondingly taken as the input data of the input layer. Then, the aircraft control system processes the input 9 observation data through the middle full-connection layer of the two layers of 49 neurons, outputs the processed rudder deflection angle action through the output layer of the one layer of 1 neuron, and determines the rudder deflection angle action as a deterministic strategy, wherein the rudder deflection angle action is understood to be one specific expression form of the deterministic strategy. Further, the present embodiment may activate the deterministic policy by activating a function tanh, and additionally add 1 scalar layer in the operator network for scaling the output amplitude.
According to the method and the device, the observed quantity data are used as input data, and the certainty strategy of the observed quantity data is output through the input layer of 9 neurons in one layer of the factor network and the middle full-connection layer of 49 neurons in two layers, so that the certainty strategy is more accurate.
Further, the description of steps S603 to S606 is as follows:
step S603, determining the observation data and the agent action data as input data of the second input layer;
step S604, processing the observed quantity data through the second middle full-connection layer to obtain first data to be processed;
step S605, processing the intelligent agent action data through the third middle full connection layer to obtain second data to be processed;
step S606, summing the first to-be-processed data and the second to-be-processed data to obtain target data, processing the target data through the third intermediate full link layer, and outputting the action value function.
Specifically, the aircraft control system takes observation quantity data and intelligent body action data as input data of a criticic network, and since the input layer of the criticic network is a layer of 10 neuron input layers, 9 observation quantity data and 1 intelligent body action data are taken as input data of the input layers. And then, the aircraft control system processes the 9 observed quantity data through the middle full connection layer of the two layers of 49 neurons to obtain processed first data to be processed, and simultaneously processes the 1 intelligent body action data through the middle full connection layer of the one layer of 49 neurons to obtain processed second data to be processed. And then, the aircraft control system sums the first to-be-processed data processed by the middle full connection layer of the two layers of 49 neurons and the second to-be-processed data processed by the middle full connection layer of the 49 neurons to obtain target data. And finally, the aircraft control system processes the target data through a middle full-connection layer of one layer of 49 neurons, and outputs the action value function obtained after processing through an output layer of one layer of 1 neuron, wherein the action value function can be understood as a state behavior estimated value in the current state and the action. Further, the present embodiment may activate the action value function by the activation function ReLU.
According to the embodiment of the application, observed quantity data and agent action data are used as input data, and an action value function is output through an input layer of 10 neurons in one layer, a middle full-connection layer of 49 neurons in two layers and a middle full-connection layer of 49 neurons in one layer of a critic network, so that the action value function is more accurate.
And step S70, updating the actor network parameters in the certainty strategy and the critic network parameters in the action value function through a depth certainty strategy gradient algorithm to obtain the optimal actor network and the optimal critic network.
It should be noted that the deep deterministic policy gradient algorithm includes an actor current network, an actor target network, a critic current network, and a critic target network.
Specifically, the aircraft control system applies a deterministic strategy to
Figure 500845DEST_PATH_IMAGE047
The expression shows that the operator network parameter in the deterministic strategy is
Figure 804788DEST_PATH_IMAGE048
. In this embodiment, the operator network parameters
Figure 657075DEST_PATH_IMAGE049
Is determined by the gradient of the objective function, which can be expressed as
Figure 97284DEST_PATH_IMAGE050
Wherein, in the step (A),
Figure 511078DEST_PATH_IMAGE051
is the state distribution of a deterministic policy. The aircraft control system then determines the actor network parameters through the actor current network and the actor target network
Figure 36738DEST_PATH_IMAGE052
Updating is performed, and the specific updating process is as described in step S701 to step S703. The aircraft control system functions the action value as
Figure 430066DEST_PATH_IMAGE053
If the representation is positive, the criticc network parameter in the action value function is
Figure 990361DEST_PATH_IMAGE054
. In this embodiment, critical network parameters
Figure 309478DEST_PATH_IMAGE055
According to the time difference method, the updating formula is
Figure 588012DEST_PATH_IMAGE056
Figure 782102DEST_PATH_IMAGE057
Figure 196903DEST_PATH_IMAGE058
Wherein, in the step (A),
Figure 686921DEST_PATH_IMAGE059
converting the future reward into the current proportion, and generally taking 0.99;
Figure 921593DEST_PATH_IMAGE060
the learning rate of the critic network is generally 0.001;
Figure 670107DEST_PATH_IMAGE061
the learning rate of the actor network is generally 0.0001. The aircraft control system then carries out parameter pair on the critic network through the critic current network and the critic target network
Figure 926032DEST_PATH_IMAGE062
The updating is performed, and the specific updating process is as described in step S704 to step S706.
Further, the description of steps S701 to 703 is as follows:
step S701, updating initial operator network parameters in the certainty strategy through the operator current network, selecting a current action according to a current state to interact with a current environment, generating a next step state, and transmitting the next step state to an experience playback pool;
step S702, obtaining the next step state in the experience playback pool through the operator target network, selecting an optimal next action according to the next step state, and calculating target operator network parameters;
step S703, determining the latest actor network parameters according to the initial actor network parameters, the target actor network parameters and the inertia update rate through the actor target network, and updating the target actor network parameters through the latest actor network parameters to obtain the optimal actor network.
Specifically, the aircraft control system pairs initial actor network parameters in the deterministic policy through the actor's current network
Figure 101798DEST_PATH_IMAGE063
Updating and according to the current state
Figure 574499DEST_PATH_IMAGE064
Selecting a current action
Figure 861124DEST_PATH_IMAGE065
Interacting with the current environment to generate the next step state
Figure 234205DEST_PATH_IMAGE066
And a reward, and the status of the next step
Figure 580873DEST_PATH_IMAGE067
And transmitting the reward to the experience playback pool. Then, the aircraft control system acquires the next step state in the experience playback pool through the actor target network
Figure 524558DEST_PATH_IMAGE068
And according to the status of the next step
Figure 365607DEST_PATH_IMAGE067
Selecting an optimal next action
Figure 343927DEST_PATH_IMAGE069
Calculating the target operator network parameters
Figure 113693DEST_PATH_IMAGE070
. Then, the aircraft control system passes through the actor target network according to the initial actor network parameters
Figure 75833DEST_PATH_IMAGE071
Target operator network parameters
Figure 454993DEST_PATH_IMAGE072
And inertial update rate
Figure 553399DEST_PATH_IMAGE073
Determining the latest operator network parameters, which may be expressed as
Figure 491137DEST_PATH_IMAGE074
Wherein the inertia update rate
Figure 674993DEST_PATH_IMAGE075
Typically set to 0.001. Finally, the aircraft control system periodically transmits the latest actor network parameters through the actor target network
Figure 592265DEST_PATH_IMAGE076
Copying to target actor network parameters
Figure 810757DEST_PATH_IMAGE077
In order to update the target operator network parameter
Figure 480965DEST_PATH_IMAGE078
And obtaining the optimal operator network. Thus, the aircraft control system updates the last actor network parameters
Figure 152118DEST_PATH_IMAGE079
Copying to target actor network parameters
Figure 873080DEST_PATH_IMAGE080
Can be expressed as:
Figure 946078DEST_PATH_IMAGE081
the method and the device for controlling the aircraft in the embodiment of the application update the parameters of the actor network through the actor current network and the actor target network in the depth certainty strategy gradient algorithm to obtain the optimal actor network, so that the online controller which enables the online controller to have good adaptability and robustness is constructed through the optimal actor network, and the accurate control of the aircraft is further realized.
Further, the description of steps S704 to S706 is as follows:
step S704, updating the initial critic network parameters in the action value function through the critic current network, calculating the function value of the current action, and constructing a to-be-processed action value function according to the function value of the current action;
step S705, estimating the to-be-processed action value function through the critic target network to obtain a target critic network parameter;
step S706, determining the latest critic network parameters through the critic target network according to the initial critic network parameters, the target critic network parameters and the inertia updating rate, and updating the target critic network parameters through the latest critic network parameters to obtain the optimal critic network.
Specifically, the aircraft control system passes the initial critic network parameter in the critic current network to action value function
Figure 225619DEST_PATH_IMAGE082
Updating, calculating a function value of the current action, and constructing a function of the action value to be processed according to the function value of the current action. Then, the aircraft control system estimates the action value function to be processed through the critic target network to obtain target critic network parameters
Figure 118489DEST_PATH_IMAGE083
. And then, the aircraft control system passes through the critic target network according to the initial critic network parameters
Figure 892410DEST_PATH_IMAGE082
Target criticc network parameters
Figure 836226DEST_PATH_IMAGE084
And inertial update rate
Figure 37400DEST_PATH_IMAGE085
Determining the latest critic network parameter, which can be expressed as
Figure 669763DEST_PATH_IMAGE086
Wherein the inertia update rate
Figure 247375DEST_PATH_IMAGE087
Typically set to 0.001. Finally, the aircraft control system will periodically transmit the data via the critic target network
Figure 45698DEST_PATH_IMAGE088
Copy to target critic network parameters
Figure 417773DEST_PATH_IMAGE089
To update the target critic network parameters
Figure 268924DEST_PATH_IMAGE090
And obtaining an optimal critic network. Therefore, will
Figure 384647DEST_PATH_IMAGE091
Copy to target critic network parameters
Figure 37476DEST_PATH_IMAGE092
Can be expressed as:
Figure 580453DEST_PATH_IMAGE093
according to the method and the device, the critic network parameters are updated through the critic current network and the critic target network in the depth certainty strategy gradient algorithm to obtain the optimal critic network, so that the online controller which has good adaptability and robustness is constructed through the optimal critic network, and the accurate control of the aircraft is further realized.
And step S80, constructing an online controller based on the optimal actor network and the optimal critic network, and controlling the aircraft through the online controller.
The aircraft control system constructs an online controller through an optimal operator network and an optimal critic network to complete offline training of the online controller. It can be understood that when the aircraft control system detects an input state, the input state is evaluated by the online controller to complete the control of the aircraft. In one embodiment, the online controller determines the action of the input state through the optimal actor network, determines whether the action determined by the optimal actor network is appropriate through the optimal critic network, and controls the aircraft through the cooperation of the optimal actor network and the optimal critic network.
The embodiment provides an aircraft control method, an online controller is trained offline through a depth certainty strategy gradient algorithm, and aircraft control is achieved through the online controller. The online controller is obtained by training through a depth certainty strategy gradient algorithm, so that the online controller has good adaptability to the characteristic changes of the model caused by the unbeared parameter disturbance, the fault input and the model uncertainty. Meanwhile, the online controller realizes good instruction following of the aircraft under the conditions of parameter deviation, disturbance and fault to a certain degree, has strong robustness and generalization performance, and realizes accurate control of the aircraft through the online controller with good adaptability and robustness.
Further, the aircraft control device provided by the present application is described below, and the aircraft control device described below and the aircraft control method described above are referred to in correspondence with each other.
As shown in fig. 3, fig. 3 is a schematic structural diagram of an aircraft control device provided in the present application, where the aircraft control device includes:
a determining module 301, configured to determine observation data and agent action data of an aircraft according to a model parameter in an aircraft model;
a training module 302, configured to train an actor network and a critic network based on the observed quantity data or/and the agent action data, and output a deterministic policy and an action value function;
an updating module 303, configured to update an actor network parameter in the deterministic policy and a criticic network parameter in the action value function through a depth deterministic policy gradient algorithm, so as to obtain an optimal actor network and an optimal criticic network;
and the control module 304 is used for constructing an online controller based on the optimal actor network and the optimal critic network, and controlling the aircraft through the online controller.
Further, the updating module 303 is further configured to:
updating initial operator network parameters in the certainty strategy through the operator current network, selecting a current action according to a current state to interact with a current environment, generating a next step state, and transmitting the next step state to an experience playback pool;
acquiring the next step state in the experience playback pool through the operator target network, selecting an optimal next action according to the next step state, and calculating target operator network parameters;
and determining the latest actor network parameters through the actor target network according to the initial actor network parameters, the target actor network parameters and the inertia updating rate, and updating the target actor network parameters through the latest actor network parameters to obtain the optimal actor network.
Further, the updating module 303 is further configured to:
updating initial critic network parameters in the action value function through the critic current network, calculating a function value of the current action, and constructing a to-be-processed action value function according to the function value of the current action;
estimating the to-be-processed action value function through the critic target network to obtain target critic network parameters;
and determining the latest critic network parameters through the critic target network according to the initial critic network parameters, the target critic network parameters and the inertia updating rate, and updating the target critic network parameters through the latest critic network parameters to obtain the optimal critic network.
Further, the training module 302 is further configured to:
determining the observation data as input data for the first input layer;
processing the observed quantity data through the first middle full-connection layer, outputting a rudder deflection angle action, and determining the rudder deflection angle action as the certainty strategy;
the first input layer is an input layer of one layer of 9 neurons, and the first middle full-connection layer is a full-connection layer of two layers of 49 neurons.
Further, the training module 302 is further configured to:
determining the observation data and the agent motion data as input data of the second input layer;
processing the observed quantity data through the second middle full-connection layer to obtain first data to be processed;
processing the intelligent agent action data through the third middle full connection layer to obtain second data to be processed;
summing the first data to be processed and the second data to be processed to obtain target data, processing the target data through the third middle full-link layer, and outputting the action value function;
the second input layer is an input layer of 10 neurons, the second middle full-connection layer is a full-connection layer of 49 neurons, and the third middle full-connection layer is a full-connection layer of 49 neurons.
Further, the determining module 301 is further configured to:
determining preset frame data of actual overload and instruction deviation, pitch angle speed and pseudo attack angle in the model parameters as the observed quantity data by taking the step length as a period;
and determining the pitching rudder deflection angle in the model parameters as the intelligent body action data.
Further, the aircraft control device further comprises a building module for:
determining a transfer function from the normal overload to the pitch rudder deflection angle through the combination of transfer functions and the normal overload, the pitch rudder deflection angle and characteristic parameters of the aircraft;
determining a transfer function from the pitch angle speed to the pitch rudder deflection angle through the combination of the transfer functions and the pitch angle speed, the pitch rudder deflection angle and the characteristic parameters of the aircraft;
determining a transfer function from the attack angle to the pitching rudder deflection angle through the combination of the transfer functions and the attack angle, the pitching rudder deflection angle and the characteristic parameters of the aircraft;
and constructing the aircraft model according to a transfer function from normal overload to pitching rudder deflection angle, a transfer function from pitch angle speed to pitching rudder deflection angle and a transfer function from attack angle to pitching rudder deflection angle.
The specific embodiment of the aircraft control device provided in the present application is substantially the same as the embodiments of the aircraft control method described above, and details are not described here.
Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor) 410, a communication Interface 420, a memory (memory) 430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform an aircraft control method comprising:
according to model parameters in the aircraft model, determining observed quantity data and intelligent agent action data of the aircraft;
training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data, and outputting a deterministic strategy and an action value function;
updating an actor network parameter in the certainty strategy and a critic network parameter in the action value function through a depth certainty strategy gradient algorithm to obtain an optimal actor network and an optimal critic network;
and constructing an online controller based on the optimal actor network and the optimal critic network, and controlling the aircraft through the online controller.
In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present application also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the aircraft control method provided by the above-mentioned methods, the method comprising:
according to model parameters in the aircraft model, determining observed quantity data and intelligent agent action data of the aircraft;
training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data, and outputting a deterministic strategy and an action value function;
updating an actor network parameter in the certainty strategy and a critic network parameter in the action value function through a depth certainty strategy gradient algorithm to obtain an optimal actor network and an optimal critic network;
and constructing an online controller based on the optimal actor network and the optimal critic network, and controlling the aircraft through the online controller.
In yet another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, is implemented to perform the aircraft control method provided above, the method comprising:
according to model parameters in the aircraft model, determining observed quantity data and intelligent agent action data of the aircraft;
training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data, and outputting a deterministic strategy and an action value function;
updating an actor network parameter in the certainty strategy and a critic network parameter in the action value function through a depth certainty strategy gradient algorithm to obtain an optimal actor network and an optimal critic network;
and constructing an online controller based on the optimal actor network and the optimal critic network, and controlling the aircraft through the online controller.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (9)

1. An aircraft control method, comprising:
combining normal overload of aircraft by transfer function
Figure 445244DEST_PATH_IMAGE001
Pitching rudder deflection angle
Figure 17170DEST_PATH_IMAGE002
And characteristic parameters
Figure 114439DEST_PATH_IMAGE003
Determining a first transfer function of normal overload to pitch rudder deflection angle, said first transfer function being expressed as
Figure 71900DEST_PATH_IMAGE004
Combining the pitch angle velocity of the aircraft by the transfer function
Figure DEST_PATH_IMAGE005
Pitching rudder deflection angle
Figure 370157DEST_PATH_IMAGE006
And characteristic parameters
Figure 554014DEST_PATH_IMAGE003
Determining a second transfer function of pitch angle velocity to pitch rudder deflection angle, said second transfer function being expressed as
Figure DEST_PATH_IMAGE007
Combining the angle of attack of the aircraft by the transfer function
Figure 612231DEST_PATH_IMAGE008
Pitching rudder deflection angle
Figure 909351DEST_PATH_IMAGE009
And characteristic parameters
Figure 237564DEST_PATH_IMAGE003
Determining a third transfer function of the angle of attack to the pitch rudder deflection angle, said third transfer function being expressed as
Figure 502193DEST_PATH_IMAGE010
Combining the first transfer function, the second transfer function and the third transfer function to construct the aircraft model;
setting a reward function through the normal overload, the pitch angle speed, the attack angle, the pitch rudder deflection angle and the characteristic parameters of the aircraft, wherein the reward function is expressed as
Figure 347789DEST_PATH_IMAGE011
Wherein the immediate award is
Figure 889629DEST_PATH_IMAGE012
Figure 11912DEST_PATH_IMAGE013
For deviations of actual overload from instruction, immediate awarding
Figure 639202DEST_PATH_IMAGE012
Indicating that when the deviation of the control effect and the actual effect is large, a large punishment is output; when the deviation between the control effect and the actual effect is small, a penalty of almost zero is output;
Figure 757331DEST_PATH_IMAGE014
for constraining energy of the control input; sparse rewards
Figure 153677DEST_PATH_IMAGE015
Meaning that if the overload deviation is greater than 0.1 and less than 0.5,
Figure 948327DEST_PATH_IMAGE015
taking 1; if the overload deviation is less than 0.1,
Figure 62913DEST_PATH_IMAGE015
taking 5; rest conditions
Figure 719154DEST_PATH_IMAGE015
All are taken as 0;
Figure 235586DEST_PATH_IMAGE016
when the overload deviation of the system is more than 100 or less than-100, the exploration is ended and output
Figure 702602DEST_PATH_IMAGE016
=-500;
According to the model parameters in the aircraft model, determining observed quantity data and intelligent agent action data of the aircraft;
training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data, and outputting a deterministic strategy and an action value function;
updating an actor network parameter in the certainty strategy and a critic network parameter in the action value function through a depth certainty strategy gradient algorithm to obtain an optimal actor network and an optimal critic network;
constructing an online controller based on the optimal actor network and the optimal critic network, and controlling the aircraft through the online controller;
training of the operator network and the criticc network is assisted by an auxiliary controller, wherein the auxiliary controller is a proportional controller, and the gain coefficient of the proportional controller is
Figure 304484DEST_PATH_IMAGE017
And when the steady state is determined to be reached through the instruction, the auxiliary controller, the pitching rudder deflection angle, the normal overload, the flight action and the current environment, the pitching rudder deflection angle output by the actor network is zero.
2. The aircraft control method of claim 1 wherein the depth-deterministic policy gradient algorithm comprises an actor current network and an actor target network,
updating the operator network parameters in the deterministic strategy through a depth deterministic strategy gradient algorithm to obtain an optimal operator network, wherein the method comprises the following steps:
updating initial operator network parameters in the certainty strategy through the operator current network, selecting a current action according to a current state to interact with a current environment, generating a next step state, and transmitting the next step state to an experience playback pool;
acquiring the next step state in the experience playback pool through the operator target network, selecting an optimal next action according to the next step state, and calculating target operator network parameters;
and determining the latest actor network parameters through the actor target network according to the initial actor network parameters, the target actor network parameters and the inertia updating rate, and updating the target actor network parameters through the latest actor network parameters to obtain the optimal actor network.
3. The aircraft control method of claim 1 wherein the depth-deterministic policy gradient algorithm comprises a critical current network and a critical target network,
updating the criticic network parameters in the action value function through a depth certainty strategy gradient algorithm to obtain an optimal criticic network, wherein the method comprises the following steps:
updating initial critic network parameters in the action value function through the critic current network, calculating a function value of the current action, and constructing a to-be-processed action value function according to the function value of the current action;
estimating the to-be-processed action value function through the critic target network to obtain target critic network parameters;
and determining the latest critic network parameters through the critic target network according to the initial critic network parameters, the target critic network parameters and the inertia updating rate, and updating the target critic network parameters through the latest critic network parameters to obtain the optimal critic network.
4. The aircraft control method of claim 1 wherein the operator network comprises a first input layer and a first intermediate fully-connected layer,
training an actor network based on the observed quantity data, and outputting a deterministic strategy, wherein the method comprises the following steps:
determining the observation data as input data for the first input layer;
processing the observed quantity data through the first middle full-connection layer, outputting a rudder deflection angle action, and determining the rudder deflection angle action as the certainty strategy;
the first input layer is an input layer of one layer of 9 neurons, and the first middle full-connection layer is a full-connection layer of two layers of 49 neurons.
5. The aircraft control method of claim 1 wherein the critic network comprises a second input layer, a second intermediate fully-connected layer and a third intermediate fully-connected layer,
training a criticc network based on the observed quantity data and the agent action data, and outputting an action value function, wherein the action value function comprises:
determining the observation data and the agent motion data as input data of the second input layer;
processing the observed quantity data through the second middle full-connection layer to obtain first data to be processed;
processing the intelligent agent action data through the third middle full connection layer to obtain second data to be processed;
summing the first data to be processed and the second data to be processed to obtain target data, processing the target data through the third middle full-link layer, and outputting the action value function;
the second input layer is an input layer of 10 neurons, the second middle full-connection layer is a full-connection layer of 49 neurons, and the third middle full-connection layer is a full-connection layer of 49 neurons.
6. The aircraft control method of claim 1, wherein determining observation data and agent action data for the aircraft based on model parameters in the aircraft model comprises:
determining preset frame data of actual overload and instruction deviation, pitch angle speed and pseudo attack angle in the model parameters as the observed quantity data by taking the step length as a period;
and determining the pitching rudder deflection angle in the model parameters as the intelligent body action data.
7. An aircraft control device, comprising:
building block for combining the normal overload of an aircraft by means of a transfer function
Figure 764416DEST_PATH_IMAGE001
Pitching rudder deflection angle
Figure 400933DEST_PATH_IMAGE009
And characteristic parameters
Figure 537385DEST_PATH_IMAGE003
Determining a first transfer function of normal overload to pitch rudder deflection angle, said first transfer function being expressed as
Figure 360985DEST_PATH_IMAGE004
Combining the pitch angle velocity of the aircraft by the transfer function
Figure 624607DEST_PATH_IMAGE018
Pitching rudder deflection angle
Figure 850052DEST_PATH_IMAGE009
And characteristic parameters
Figure 655941DEST_PATH_IMAGE019
Determining a second transfer function of pitch angle velocity to pitch rudder deflection angle, said second transfer function being expressed as
Figure 232415DEST_PATH_IMAGE007
Combining the angle of attack of the aircraft by the transfer function
Figure 34149DEST_PATH_IMAGE008
Pitching rudder deflection angle
Figure 114101DEST_PATH_IMAGE009
And characteristic parameters
Figure 857935DEST_PATH_IMAGE019
Determining a third transfer function of the angle of attack to the pitch rudder deflection angle, said third transfer function being expressed as
Figure 797072DEST_PATH_IMAGE010
Combining the first transfer function, the second transfer function and the third transfer function to construct the aircraft model;
setting a reward function through the normal overload, the pitch angle speed, the attack angle, the pitch rudder deflection angle and the characteristic parameters of the aircraft, wherein the reward function is expressed as
Figure 527131DEST_PATH_IMAGE011
Wherein the immediate award is
Figure 87687DEST_PATH_IMAGE012
Figure 877789DEST_PATH_IMAGE013
For deviations of actual overload from instruction, immediate awarding
Figure 38643DEST_PATH_IMAGE012
Indicating that when the deviation of the control effect and the actual effect is large, a large punishment is output; when the deviation between the control effect and the actual effect is small, a penalty of almost zero is output;
Figure 306813DEST_PATH_IMAGE014
for constraining energy of the control input; sparse rewards
Figure 485990DEST_PATH_IMAGE015
Meaning that if the overload deviation is greater than 0.1 and less than 0.5,
Figure 446993DEST_PATH_IMAGE015
taking 1; if the overload deviation is less than 0.1,
Figure 829564DEST_PATH_IMAGE015
taking 5; rest conditions
Figure 901425DEST_PATH_IMAGE015
All are taken as 0;
Figure 699224DEST_PATH_IMAGE016
when the overload deviation of the system is more than 100 or less than-100, the exploration is ended and output
Figure 565549DEST_PATH_IMAGE016
=-500;
The determining module is used for determining observed quantity data and intelligent agent action data of the aircraft according to model parameters in the aircraft model;
the training module is used for training an actor network and a critic network based on the observed quantity data or/and the intelligent agent action data and outputting a deterministic strategy and an action value function;
the updating module is used for updating the operator network parameters in the deterministic strategy and the criticic network parameters in the action value function through a depth deterministic strategy gradient algorithm to obtain an optimal operator network and an optimal criticic network;
the control module is used for constructing an online controller based on the optimal actor network and the optimal critic network and controlling the aircraft through the online controller;
the auxiliary training module is used for assisting the training of the operator network and the criticc network through an auxiliary controller, wherein the auxiliary controller is a proportional controller, and the gain coefficient of the proportional controller is
Figure 700995DEST_PATH_IMAGE017
And when the steady state is determined to be reached through the instruction, the auxiliary controller, the pitching rudder deflection angle, the normal overload, the flight action and the current environment, the pitching rudder deflection angle output by the actor network is zero.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the aircraft control method according to any one of claims 1 to 6 are implemented by the processor when executing the computer program.
9. A computer-readable storage medium comprising a computer program, characterized in that the computer program, when being executed by a processor, realizes the steps of the aircraft control method according to any one of claims 1 to 6.
CN202111608105.1A 2021-12-27 2021-12-27 Aircraft control method, device, equipment and computer readable storage medium Active CN113985924B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111608105.1A CN113985924B (en) 2021-12-27 2021-12-27 Aircraft control method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111608105.1A CN113985924B (en) 2021-12-27 2021-12-27 Aircraft control method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113985924A CN113985924A (en) 2022-01-28
CN113985924B true CN113985924B (en) 2022-04-08

Family

ID=79734400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111608105.1A Active CN113985924B (en) 2021-12-27 2021-12-27 Aircraft control method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113985924B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286218A (en) * 2020-12-29 2021-01-29 南京理工大学 Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient
CN112363519A (en) * 2020-10-20 2021-02-12 天津大学 Four-rotor unmanned aerial vehicle reinforcement learning nonlinear attitude control method
CN112558470A (en) * 2020-11-24 2021-03-26 中国科学技术大学 Optimal consistency control method and device for actuator saturated multi-agent system
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN112966816A (en) * 2021-03-31 2021-06-15 东南大学 Multi-agent reinforcement learning method surrounded by formation
CN113391556A (en) * 2021-08-12 2021-09-14 中国科学院自动化研究所 Group distributed control method and device based on role distribution

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3101703B1 (en) * 2019-10-03 2021-11-26 Thales Sa AUTOMATIC LEARNING FOR MISSION SYSTEM
CN111595210A (en) * 2020-04-30 2020-08-28 南京理工大学 Precise vertical recovery control method for large-airspace high-dynamic rocket sublevel landing area

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112363519A (en) * 2020-10-20 2021-02-12 天津大学 Four-rotor unmanned aerial vehicle reinforcement learning nonlinear attitude control method
CN112558470A (en) * 2020-11-24 2021-03-26 中国科学技术大学 Optimal consistency control method and device for actuator saturated multi-agent system
CN112286218A (en) * 2020-12-29 2021-01-29 南京理工大学 Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN112966816A (en) * 2021-03-31 2021-06-15 东南大学 Multi-agent reinforcement learning method surrounded by formation
CN113391556A (en) * 2021-08-12 2021-09-14 中国科学院自动化研究所 Group distributed control method and device based on role distribution

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
非对称机动能力多无人机智能协同攻防对抗;陈灿,等;《航空学报》;20201231;第41卷(第12期);342-354 *

Also Published As

Publication number Publication date
CN113985924A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
JP7003355B2 (en) Autonomous navigation with a spiking neuromorphic computer
JP7014368B2 (en) Programs, methods, devices, and computer-readable storage media
EP3485432B1 (en) Training machine learning models on multiple machine learning tasks
EP3480741B1 (en) Reinforcement and imitation learning for a task
EP3459021B1 (en) Training neural networks using synthetic gradients
CN107209872B (en) Systems, methods, and storage media for training a reinforcement learning system
CN108051999B (en) Accelerator beam orbit control method and system based on deep reinforcement learning
CN111856925B (en) State trajectory-based confrontation type imitation learning method and device
EP3586277A1 (en) Training policy neural networks using path consistency learning
KR101706367B1 (en) Neural network-based fault-tolerant control method of underactuated autonomous vehicle
CN110088775B (en) Environmental prediction using reinforcement learning
CN112119404A (en) Sample efficient reinforcement learning
CN110692066A (en) Selecting actions using multimodal input
WO2018189404A1 (en) Distributional reinforcement learning
JP6446126B2 (en) Processing system and program
CN107797454A (en) Multi-agent system collaboration fault tolerant control method based on finite-time control
US20220366246A1 (en) Controlling agents using causally correct environment models
CN114253296B (en) Hypersonic aircraft airborne track planning method and device, aircraft and medium
Farivar et al. Continuous reinforcement learning to robust fault tolerant control for a class of unknown nonlinear systems
Baker A learning-boosted quasi-newton method for ac optimal power flow
CN111914069A (en) Training method and device, dialogue processing method and system and medium
CN112947505B (en) Multi-AUV formation distributed control method based on reinforcement learning algorithm and unknown disturbance observer
Meng et al. Adaptive fault tolerant control for a class of switched nonlinear systems with unknown control directions
CN111487992A (en) Unmanned aerial vehicle sensing and obstacle avoidance integrated method and device based on deep reinforcement learning
Zhang et al. Adaptive asymptotic tracking control for autonomous underwater vehicles with non-vanishing uncertainties and input saturation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant