CN111190429A

CN111190429A - Unmanned aerial vehicle active fault-tolerant control method based on reinforcement learning

Info

Publication number: CN111190429A
Application number: CN202010030358.4A
Authority: CN
Inventors: 任坚; 刘剑慰; 杨蒲; 葛志文
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2020-05-22
Anticipated expiration: 2040-01-13
Also published as: CN111190429B

Abstract

The invention discloses an unmanned aerial vehicle active fault-tolerant control method based on reinforcement learning, which specifically comprises two stages, namely an early off-line training stage: training and updating an evaluation network of a fault-tolerant controller for reinforcement learning by collecting historical postures generated when an unmanned aerial vehicle runs and data output by the controller, wherein the evaluation network is optimized by optimizing an extreme learning machine through a genetic algorithm, so that the training speed and the training precision are improved; and (3) system operation and on-line training stage: in the operation process of the unmanned aerial vehicle, the reinforcement learning evaluation network is used for real-time online updating, self-learning and self-improvement of the reinforcement learning fault-tolerant controller are realized through online updating in the active fault-tolerant control process of the unmanned aerial vehicle, and real-time online updating of the extreme learning machine is realized through a dynamic capacity-expansion updating algorithm. The invention optimizes the reinforcement learning method by adopting an incremental strategy, realizes asymptotic approach to an optimal fault-tolerant control strategy, and can better realize fault-tolerant control of the unmanned aerial vehicle.

Description

Unmanned aerial vehicle active fault-tolerant control method based on reinforcement learning

Technical Field

The invention relates to an unmanned aerial vehicle active fault-tolerant control method based on reinforcement learning, in particular to an unmanned aerial vehicle active fault-tolerant control method based on extreme learning machine and incremental strategy reinforcement learning, and belongs to the technical field of unmanned aerial vehicle active fault-tolerant control.

Background

With the continuous development of aerospace technology, the size of a flight control system becomes larger and larger, and the complexity of the system also increases continuously. While flight control systems continue to advance, system stability also presents significant challenges. Any type of fault can cause a compromise or even a breakdown in the system performance, resulting in instability of the control system and thus a significant loss. Therefore, how to reduce or even eliminate the risk caused by system failure is a problem worthy of research, and in order to overcome the failure of sensors, actuators and other components, many domestic and foreign scholars make great efforts in the research direction of failure diagnosis and fault-tolerant control.

Most of research work in recent years focuses on the design of a system controller, the system controller is mostly reconstructed by adopting a model-based method, the complexity of a flight control system is more and more huge due to the development of scientific technology, and great challenges are brought to mathematical modeling of the flight control system. Due to the high engineering application value of the data-based method, more and more attention is paid to the industry in recent years, and reinforcement learning is taken as a control method and has high research value based on data.

At present, reinforcement learning is mainly applied to the field of optimal control theory, and research results of applying reinforcement learning algorithm to active fault-tolerant control of unmanned aerial vehicles are still few.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the unmanned aerial vehicle active fault-tolerant control method based on reinforcement learning solves the problems that the accuracy of mathematical modeling can greatly influence the fault-tolerant effect, the traditional deterministic strategy reinforcement learning is poor in the fault-tolerant control effect and the like in the prior art, and has high instantaneity and adaptability.

The invention adopts the following technical scheme for solving the technical problems:

an active fault-tolerant control method of an unmanned aerial vehicle based on reinforcement learning comprises the following steps:

step 1, establishing an unmanned aerial vehicle dynamic model, and performing fault injection on an unmanned aerial vehicle to obtain an unmanned aerial vehicle aircraft fault model under the fault condition;

step 2, defining five different incremental strategies, including a non-compensation action, a positive action for compensating actuator faults, a negative action for compensating actuator faults, a positive action for compensating sensor faults and a negative action for compensating sensor faults, traversing an unmanned aerial vehicle fault model by one incremental strategy in sequence, and acquiring unmanned aerial vehicle attitude data under each incremental strategy through a sensor;

step 3, training a reinforcement learning evaluation network based on a genetic algorithm-extreme learning machine by using the attitude data of the unmanned aerial vehicle to obtain a trained reinforcement learning evaluation network;

step 4, when the unmanned aerial vehicle aircraft fault model is traversed according to the uncompensated action strategy in the step 2, the acquired unmanned aerial vehicle attitude data is used for training the state transition prediction network to obtain a trained state transition prediction network;

step 5, setting the training data set to be empty, and acquiring attitude angle data S once in each sampling period in the operation process of the unmanned aerial vehicle flight control system_kRespectively combining five different incremental strategies with the attitude angle data S_kInput data are formed and input to a current reinforcement learning evaluation network, and reward values corresponding to different incremental strategies under a current attitude angle are obtained;

and 6, selecting the optimal incremental strategy under the current attitude angle according to the reward values corresponding to different incremental strategies and combining the epsilon-Greedy strategy and executing the strategy to obtain the system instant return value Q (S)_current,A_current)；

Step 7, predicting the attitude angle of the next sampling period according to the current attitude angle data and the current state transition prediction network to obtain the attitude angle predicted value of the next sampling period;

step 8, repeating the step 5 and the step 6 on the attitude angle predicted value of the next sampling period to obtain the optimal incremental strategy corresponding to the next sampling period and the system instant return value Q (S)_next,A_next) Calculating the prize value to be updated

Step 9, the current attitude angle data S_kOptimal incremental strategy under current attitude angle and reward value needing to be updated

As a new data sample, expanding the capacity of the current training data set, and updating the current reinforcement learning evaluation network by using the current training data set;

and step 10, repeating the steps 5-9 for each sampling period until the flight mission is completed.

As a preferred scheme of the present invention, the failure model of the unmanned aerial vehicle in the failure condition in step 1 specifically includes:

wherein x ∈ R^4×1Is a state variable of the system and is,

theta is a variable of the pitch angle,

as a variable of the roll angle,

is composed of

The derivative of (a) of (b),

is composed of

U is the control input, A, B, C, D are all system matrices, y is the output of the control system, phi (t-t)₁)f_a(t)、φ(t-t₂)Ff_s(t) indicates actuator failure, sensor failure in the flight control system, respectively, f_a(t) unknown actuator fault offset value, Ff_s(t) is the unknown sensor fault offset value, phi (t-t)_f) Generating a time function for the fault, an

t_fTime, t represents time, generated for an unknown fault in the flight control system.

As a preferred embodiment of the present invention, the specific process of step 3 is:

step 31, sorting the unmanned aerial vehicle attitude data acquired in the step 2 according to a time sequence order to form a training sample set;

and step 32, the reinforcement learning evaluation network based on the genetic algorithm-extreme learning machine comprises a single hidden layer, a random parameter population of parameters of the hidden layer of the extreme learning machine is created through the genetic algorithm, the random parameter population is eliminated through a fitness function, the rest random parameter population is subjected to inheritance, crossing and mutation operations of the genetic algorithm, the elimination-inheritance-crossing-mutation process is repeated until the fitness function reaches an optimal value, and the trained reinforcement learning evaluation network is obtained.

As a preferred scheme of the invention, the updated reward value of the step 8

The calculation formula is as follows:

wherein, Q (S)_current,A_current) Representing the current attitude angle S_kThe system immediate return value obtained by executing the optimal incremental strategy, wherein lambda represents a discount factor, 0 < lambda < 1, and Q (S)_next,A_next) Indicates the next attitude angle S_nextAnd executing the optimal incremental strategy to obtain the system instant return value.

As a preferred embodiment of the present invention, the specific method for updating in step 9 is: and (3) solving a training algorithm of Moore-Penrose generalized inverse through a genetic algorithm optimization extreme learning machine, and updating the current reinforcement learning evaluation network.

As a preferred scheme of the present invention, the current state transition prediction network described in step 7 is updated once every 10 sampling periods, and if the current sampling period is to update the state transition prediction network, the training data adopted during the update is the attitude angle data acquired in the current sampling period and the attitude angle data acquired in the first 9 sampling periods of the current sampling period.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. the invention extracts the characteristics of real-time data generated by the system by adopting an evaluation network through a reinforcement learning controller, thereby acquiring fault information and adjusting the system controller based on the fault information; compared with the traditional fault-tolerant control method based on a model, the active fault-tolerant control method based on data breaks through the limitation of difficult modeling of a complex system, and the design of the controller is simplified by extracting the data characteristics to replace a fault detection subsystem.

2. The invention provides an incremental strategy reinforcement learning controller for the premise of uncertain faults, and improves the limitation that a deterministic fixed strategy is adopted in the traditional reinforcement learning algorithm, so that the approach to the optimal fault-tolerant strategy of the current fault system is realized.

3. The invention carries out the estimation of the next state through the state transition prediction network, and realizes the update of the real-time strategy network of the continuous control system.

4. Compared with the traditional reinforcement learning method, the reinforcement learning model has greatly enhanced capability of extracting the characteristics of the data after being optimized by optimizing the reinforcement learning evaluation network through the genetic algorithm-extreme learning machine model.

5. The invention provides a dynamic capacity-expansion updating algorithm for the online updating of the extreme learning machine model, and realizes the rapid online updating of the reinforcement learning evaluation network by utilizing the rapidity of the updating training of the extreme learning machine.

Drawings

FIG. 1 is a flow chart of a control method of the present invention.

FIG. 2 is a block diagram of an active fault-tolerant controller for reinforcement learning according to the present invention.

FIG. 3 is a flow chart of the reinforcement learning evaluation network training process of the present invention.

FIG. 4 is a schematic diagram of a dynamic capacity-expansion updating algorithm of the extreme learning network model according to the present invention.

FIG. 5 illustrates the effect of fault-tolerant control of an active fault-tolerant controller in the event of actuator failure, in accordance with an embodiment of the present invention.

FIG. 6 illustrates the effect of fault-tolerant control of an active fault-tolerant controller in the event of a sensor failure in accordance with an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As shown in fig. 1 and fig. 2, the present invention provides an active fault-tolerant control method for an unmanned aerial vehicle based on reinforcement learning, which includes the steps of:

step S1, early off-line training stage: an unmanned aerial vehicle dynamics model is established, and training and updating are carried out on an evaluation network of a fault-tolerant controller for reinforcement learning by collecting historical postures generated when an unmanned aerial vehicle runs and data output by a controller.

Step S2, system operation and on-line training stage: in the operation process of the unmanned aerial vehicle, the reinforcement learning evaluation network is used for real-time online updating, self-learning and self-improvement of the reinforcement learning fault-tolerant controller are realized through online updating in the active fault-tolerant control process of the unmanned aerial vehicle, and real-time online updating of the extreme learning machine is realized through a dynamic capacity-expansion updating algorithm. The invention optimizes the reinforcement learning method by adopting an incremental strategy, realizes asymptotic approach to an optimal fault-tolerant control strategy, and can better realize fault-tolerant control of the unmanned aerial vehicle.

The specific implementation steps of the early off-line training stage of step S1 are as follows:

s11, establishing a dynamic model of the unmanned aerial vehicle; considering that the drone flies at high altitude at constant speed, the dynamical model is described using a simplified three-degree-of-freedom model. The embodiment of the invention adopts an aircraft fault diagnosis experimental platform of ' advanced aircraft navigation, control and health management ' Ministry of industry and communications ' key laboratory of Nanjing aerospace university, and the fault model of the unmanned aerial vehicle under the established fault condition is as follows:

wherein x ∈ R^4×1Is the state of the system in which,

theta is a variable of the pitch angle,

as a roll angle variable, u ═ u₁u₂u₃u₄]^TFor control input, A ∈ R^4×4,B∈R^4×1,C∈R^1×4,D∈R^1×1For the system matrix, y ∈ R is the output of the control system, φ (t-t)₁)f_a(t)、φ(t-t₂)Ff_s(t) indicates actuator and sensor faults, respectively, in a flight control system, where f_a(t) unknown actuator fault offset value, Ff_s(t) is the bias value for unknown sensor faults, where F ∈ R^1×4,f_s(t)∈R^4×1，φ(t-t_f) The definition function for the fault generation time is defined as follows:

wherein t is_fFor the time of unknown fault generation in the flight control system, pass phi (t-t) in the model built_f) The function represents a sudden failure of the system (at time t)_fAfter which a fault occurs). The system matrix is specifically represented as follows:

C＝[0 1 0 0]D＝0

and S12, acquiring operation data in the unmanned aerial vehicle control system through the established mathematical model, acquiring the data through a sensor when the unmanned aerial vehicle operates normally and under the condition of failure through fault injection, wherein specific data labels are attitude Euler angle data, serial numbers of fault-tolerant strategies and actual output of the control system, and the acquired data are used as training data of an evaluation network. The selected data are variables which are selected from the control system and can reflect the running state of the control system, the variables can reflect the current running state of the system, and the fault-tolerant controller extracts useful characteristics through the system state and uses the useful characteristics as an important basis for decision making of the fault-tolerant controller. Defining the fault-tolerant control system state as the attitude angle of the flight control system:

and S is the label attribute of the data set, n is the number of the strategy action, and the acquired data is used as the training data of the evaluation network.

Step S13, optimizing the reinforcement learning Q-learning algorithm by using an extreme learning method, where the evaluation network of the reinforcement learning fault-tolerant controller is a three-layer extreme learning network including a single hidden layer, and the specific structure is shown in fig. 3.

And step S14, performing off-line training and updating on the constructed extreme learning machine network according to the collected operation data, and optimizing the extreme learning machine network through a genetic algorithm. The process is as follows:

and step S141, forming a training data sample set by the acquired training data samples according to a time sequence.

S142, establishing a random parameter population of hidden layer parameters of the extreme learning machine through a genetic algorithm, and passing through a fitness function f through heredity, intersection and variation of the genetic algorithm process_fitnessAnd (4) optimizing the population, and after a certain number of iterations, training to obtain an evaluation network model with the highest accuracy after the fitness function reaches the optimal value and does not change any more. Wherein the fitness function f_fitnessIs represented as follows:

in the formula, y_iIndicates the i-th sample expected output value, y_i' represents an actual output value after the i-th sample is input into the model. After a certain number of iterations, after the fitness function reaches an optimal value and does not change any more, training to obtain an evaluation network model with the highest accuracy.

In the early off-line training stage, training data is constructed through the step 141, for an extreme learning machine algorithm, updating of output layer weights is performed through solving of the Moore-Penrose generalized inverse in a linear equation, for a training process of a genetic algorithm optimization extreme learning machine model, firstly, hidden layer random parameter samples of a certain scale are randomly initialized, then, training is performed on all samples in a population, errors of all samples are solved to serve as fitness functions of the genetic algorithm, then, elimination is performed according to the fitness functions of all the individuals, then, operations such as crossing and mutation are performed on the eliminated individuals, after the operations such as crossing and mutation are performed, next-step sample training is performed continuously, then, iteration is performed continuously according to the operations, and the specific training process is shown in FIG. 3.

Defining a historical experience quadruplet (S)_k,A_k,R_k,S_k+1) In which S is_kFor the current state value of the unmanned aerial vehicle aircraft control system, A_kFault-tolerant strategic actions, R, made by the unmanned aerial vehicle aircraft control system in the current state_kAction A taken for unmanned aerial vehicle aircraft in current state_kThe value of the reward obtained, S_k+1Taking action A for unmanned aerial vehicle aircraft control system in current state_kThe next state value that the unmanned aerial vehicle aircraft control system of back reached. In the process of updating the reinforcement learning evaluation network training, S is needed to pass_kAnd A_kTo obtain S_k+1Further update S_k,A_kThe following Q function: q (S)_k,A_k) The invention realizes the S pair through a state transition prediction network_k+1And (4) predicting.

Iterative optimization is carried out on the training process of the extreme learning machine through a genetic algorithm, the number of initialized population of the genetic algorithm is 2000, the number of iteration is 200, a fitness function adopts the reciprocal of the sum of squares of model training errors, and the algorithm aims at maximizing a self-adaptive function, so that the error minimization is realized; the number of hidden layer nodes for the extreme learning machine network is 128.

The specific implementation steps of the system operation and on-line training stage of the step S2 are as follows:

s21, sorting the data of the unmanned aerial vehicle according to time sequence and inputting the data into S_kThe output is S_k+1The training data sample set is formed according to the time progressive sequence of the training samples.

And step S22, training the training data sample set obtained in the step S21 through a BP neural network.

The method comprises the steps that the operation state of the unmanned aerial vehicle flight control system is predicted, when the state transition prediction network training is finished, the control system conducts fault-tolerant tracking control on a reinforcement learning method based on an incremental strategy, and in the process, through the decision of each step and the instant reward valueUpdating the evaluation network on line, wherein the real-time reward value evaluation criterion is the absolute value of the error between the expected output and the actual output of the control system, and defines a reward function J (S)_t) The specific functional form is as follows:

wherein gamma is a discount factor, and meets the condition that gamma is more than 0 and less than or equal to 1; and U (S)_t-j,A_t-j) The utility function of the reinforcement learning algorithm is specifically in the form as follows:

U(S_t,A_t)＝Q(S_t,A_t)

and Q (S)_t,A_t) The function is mathematically formed as:

Q(S_t,A_t)＝|y(t,A_t)-y_d(t)|

where t is the system runtime, y (t, A)_t) Making a decision A for the current time of the system_tActual output, y, obtained by the rear control system_d(t) is the desired control system output at the current time.

Step S23, determining whether the current system fails according to data such as attitude angle and actuator current and voltage acquired by a sensor during the operation of the unmanned aerial vehicle control system, and if the current system fails, changing the reward value corresponding to each action in the current state and the policy action set by the evaluation network, wherein the mathematical expression form of the policy action set is as follows:

Ω＝{Λ₁,Λ₂,Λ₃,Λ₄,Λ₅}

wherein Λ_aFor the optional a-th configuration in the system, a is 1,2,3,4, 5. In a specific application embodiment, an incremental strategy is adopted to realize asymptotic approximation of an optimal fault-tolerant control strategy, a strategy made by a fault-tolerant controller at each moment is superposed into a current strategy signal, and for the application embodiment of the invention, the following five incremental strategies are defined:

1. actions taken when the system is normal: lambda₁＝[0 0 0 0]

2. And (3) compensating the positive action of actuator fault: lambda₂＝+[0 0.0002 0 0]

3. Compensating for negative actions of actuator faults: lambda₃＝-[0 0.0002 0 0]

4. Forward action to compensate for sensor failure: lambda₄＝+[0 0 0.0002 0]

5. Negative actions to compensate for sensor failure: lambda₅＝-[0 0 0.0002 0]

Step S24, evaluating the current operation state value S of the network passing control system_kAnd each incremental strategy action in the strategy action set is used as a model input, action selection is carried out by combining model output with an epsilon-Greedy strategy, and then the incremental strategy actions made by the evaluation network are superposed in the existing action and act on the current control signal to realize fault tolerance.

The online updating of the extreme learning machine network is realized through a dynamic capacity-expansion updating algorithm, the method does not need to carry out feedforward transmission on the current sample error through an algorithm similar to gradient descent, and realizes the online quick updating through the direct expansion of training data and the rapidity of the updating algorithm of the extreme learning machine, and the specific steps are shown in fig. 4.

For the updating process of reinforcement learning, firstly, initializing a Q-learning evaluation network, randomly initializing parameters of a neural network, inputting data of the neural network into the state of the system and the sequence number of the action currently taken, outputting the input data into the current state and obtaining a reward value U (S) by taking the action in the state_current,A_current)。

Next, the current state S is collected_t: actions are randomly chosen among all the action sets with a probability of epsilon, and action A that maximizes the reward value (which in this context is the error between the actual output and the desired output of the system) is chosen with a probability of (1-epsilon)_t＝argmaxQ(S_t) Recording the current state S_tAnd action A_tThe value of the reward is U (S)_current,A_current)。

Step S25, after the decision module of the reinforcement learning active fault-tolerant controller gives out the fault-tolerant strategy, the reward value function of the current state and the given strategy is solved, and the current immediate report value Q (S) is passed_current,A_current) Discounted historical values

And obtaining the accumulated discount return value, wherein the mathematical expression form of the updated reward value is as follows:

while

And outputting the return value for the state under the current evaluation network.

Step S26, numbering the current state value, the policy action taken and the result of step S25

As a new data sample, expand into the existing training data set.

And step S27, obtaining a training algorithm of Moore-Penrose generalized inverse through a genetic algorithm optimization extreme learning machine to obtain a latest training model.

And step S28, repeating the above process for each sampling period until the flight mission is completed.

And updating the state transition prediction network at intervals by controlling the historical operation data of the system, and predicting the next state value by the current state and action value. In order to reduce the pressure of a processor and ensure the rapidity of the system, the state transition prediction network is updated once every 10 sampling periods on the premise of not influencing the accurate judgment of the fault-tolerant controller.

In order to verify the effect of the fault-tolerant control, the invention applies an aircraft fault diagnosis experimental platform of an engineering and trust department key laboratory of "advanced aircraft navigation, control and health management" of Nanjing aerospace university to carry out verification experiments, when an actuator fault is injected into the experimental platform, the system attitude can continuously track an expected signal after generating a deviation under the fault-tolerant control of an active fault-tolerant controller based on an extreme learning machine and an incremental strategy reinforcement learning method, and the output residual error of the unmanned aerial vehicle aircraft is shown in FIG. 5. When a sensor fault is injected into the experimental platform, the output residual error of the unmanned aerial vehicle is shown in fig. 6.

According to the simulation result, when the unmanned aerial vehicle aircraft generates actuator faults or sensor faults in the flight process, the unmanned aerial vehicle active fault-tolerant control method based on the extreme learning machine and incremental strategy reinforcement learning can have a good fault-tolerant effect without depending on a system model in the operation process, and can realize on-line self-learning and updating. The method has important applicable reference value for fault-tolerant control of the unmanned aerial vehicle aircraft with faults.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. An active fault-tolerant control method of an unmanned aerial vehicle based on reinforcement learning is characterized by comprising the following steps:

2. The active fault-tolerant control method for the unmanned aerial vehicle based on reinforcement learning of claim 1, wherein the fault model of the unmanned aerial vehicle under the fault condition in step 1 is specifically:

wherein x ∈ R^4×1Is a state variable of the system and is,

theta is a variable of the pitch angle,

as a variable of the roll angle,

is composed of

The derivative of (a) of (b),

is composed of

3. The active fault-tolerant control method for the unmanned aerial vehicle based on reinforcement learning of claim 1, wherein the specific process of the step 3 is as follows:

4. The active fault-tolerant control method for unmanned aerial vehicle based on reinforcement learning of claim 1, wherein the updated reward value of step 8

The calculation formula is as follows:

5. The active fault-tolerant control method for the unmanned aerial vehicle based on reinforcement learning of claim 1, wherein the specific updating method in step 9 is as follows: and (3) solving a training algorithm of Moore-Penrose generalized inverse through a genetic algorithm optimization extreme learning machine, and updating the current reinforcement learning evaluation network.

6. The active fault-tolerant control method for unmanned aerial vehicles based on reinforcement learning of claim 1, wherein the current state transition prediction network is updated once every 10 sampling periods in step 7, and if the current sampling period is to update the state transition prediction network, the training data adopted during the update is the attitude angle data acquired in the current sampling period and the attitude angle data acquired in the first 9 sampling periods of the current sampling period.