CN114296350A

CN114296350A - A fault-tolerant control method for unmanned ships based on model reference reinforcement learning

Info

Publication number: CN114296350A
Application number: CN202111631716.8A
Authority: CN
Inventors: 张清瑞; 熊培轩; 张雷; 朱波; 胡天江
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-04-08
Anticipated expiration: 2041-12-28
Also published as: CN114296350B

Abstract

The invention discloses a fault-tolerant control method for an unmanned ship based on model reference reinforcement learning. The method includes: analyzing the uncertainty factors of the unmanned ship, and constructing a nominal dynamic model of the unmanned ship; based on the maximum entropy Actor-Critic method, according to the actual unmanned ship system, the state variable difference of the nominal dynamic model of the unmanned ship and the output of the nominal controller of the unmanned ship , build a fault-tolerant controller based on model reference reinforcement learning; according to the control task requirements, build a reinforcement learning evaluation function and a control strategy model and train the fault-tolerant controller to obtain the trained control strategy. By using the present invention, the safety and reliability of the unmanned ship system can be significantly improved. As a fault-tolerant control method for an unmanned ship based on model reference reinforcement learning, the present invention can be widely used in the field of unmanned ship control.

Description

A fault-tolerant control method for unmanned ships based on model reference reinforcement learning

技术领域technical field

本发明涉及无人船控制领域，尤其涉及一种基于模型参考强化学习的无人船容错控制方法。The invention relates to the field of unmanned ship control, in particular to an unmanned ship fault-tolerant control method based on model reference reinforcement learning.

背景技术Background technique

随着制导、导航和控制技术的显著进步，无人船(autonomous surface vehicles，ASV)的应用已经占据了航空举足轻重的部分。在大多数应用中，无人船预计将在长时间没有人工干预的情况下安全运行。因此，需要无人船具有足够的安全和可靠性属性以提供正常的运作，并避免灾难性的后果。然而，无人船容易出现故障、系统组建退化、传感器故障等问题，从而经历性能恶化，不稳定，甚至灾难性的损失。With significant advancements in guidance, navigation, and control technologies, the application of autonomous surface vehicles (ASVs) has taken over a pivotal part of aviation. In most applications, unmanned ships are expected to operate safely without human intervention for extended periods of time. Therefore, unmanned ships are required to have sufficient safety and reliability attributes to provide normal operation and avoid catastrophic consequences. However, unmanned ships are prone to problems such as failure, system component degradation, sensor failure, etc., and thus experience performance degradation, instability, and even catastrophic losses.

发明内容SUMMARY OF THE INVENTION

为了解决上述技术问题，本发明的目的是提供一种基于模型参考强化学习的无人船容错控制方法，可以在遇到故障后恢复系统性能或保持系统运行，从而显著提高系统的安全性和可靠性。In order to solve the above technical problems, the purpose of the present invention is to provide a fault-tolerant control method for unmanned ships based on model reference reinforcement learning, which can restore the system performance or keep the system running after encountering a fault, thereby significantly improving the safety and reliability of the system sex.

本发明所采用的第一技术方案是：一种基于模型参考强化学习的无人船容错控制方法，包括以下步骤：The first technical solution adopted by the present invention is: a fault-tolerant control method for an unmanned ship based on model reference reinforcement learning, comprising the following steps:

S1、对无人船的不确定性因素进行分析，构建无人船名义动力学模型；S1. Analyze the uncertainty factors of the unmanned ship, and construct a nominal dynamic model of the unmanned ship;

S2、基于无人船名义动力学模型，设计无人船标称控制器；S2. Based on the nominal dynamics model of the unmanned ship, design the nominal controller of the unmanned ship;

S3、基于最大熵的Actor-Critic方法，根据实际无人船系统、无人船名义动力学模型的状态变量差值和无人船标称控制器的输出，构建基于模型参考强化学习的容错控制器；S3. Actor-Critic method based on maximum entropy, according to the actual unmanned ship system, the state variable difference of the unmanned ship nominal dynamic model and the output of the unmanned ship nominal controller, construct fault-tolerant control based on model reference reinforcement learning device;

S4、根据控制任务需求，搭建强化学习评价函数和控制策略模型并训练容错控制器，得到训练完成的控制策略。S4. According to the control task requirements, build a reinforcement learning evaluation function and a control strategy model and train a fault-tolerant controller to obtain a trained control strategy.

进一步，所述无人船名义动力学模型的公式表示如下：Further, the formula of the nominal dynamic model of the unmanned ship is expressed as follows:

上式中，

表示广义坐标向量，v表示广义速度向量，u表示控制力和力矩，M表示惯性矩阵，C(v)包括科氏力和向心力，D(v)表示阻尼矩阵，G(v)表示由于重力和浮力及力矩而产生的未建模动力学，B表示预设的输入矩阵

In the above formula,

Represents generalized coordinate vector, v represents generalized velocity vector, u represents control force and moment, M represents inertia matrix, C(v) includes Coriolis force and centripetal force, D(v) represents damping matrix, G(v) represents due to gravity and Unmodeled dynamics due to buoyancy and moment, B is a preset input matrix

进一步，所述无人船标称控制器的公式表示如下：Further, the formula of the nominal controller of the unmanned ship is expressed as follows:

上式中，N_m和H_m包含无人船动力学模型的所有已知常量参数，η_m表示标称模型的广义坐标向量，u_m表示控制律，x_m表示参考模型的状态。In the above formula, N _m and H _m contain all known constant parameters of the dynamic model of the unmanned ship, η _m represents the generalized coordinate vector of the nominal model, u _m represents the control law, and x _m represents the state of the reference model.

进一步，所述容错控制器的公式表示如下：Further, the formula of the fault-tolerant controller is expressed as follows:

上式中，H_m-L表示Hurwitz矩阵，

u_l表示来自深度学习模块的控制策略，β(v)表示内环动力学中所有模型不确定性的集合，n_v表示广义速度测量值上的噪声矢量，f_v表示作用于广义速度矢量的传感器故障。In the above formula, H _m -L represents the Hurwitz matrix,

u _l represents the control strategy from the deep learning module, β(v) represents the set of all model uncertainties in the inner loop dynamics, n _v represents the noise vector on the generalized velocity measurement, f _v represents the action on the generalized velocity vector Sensor failure.

进一步，所述强化学习评价函数的公式表示如下：Further, the formula of the reinforcement learning evaluation function is expressed as follows:

Q_π(s_t,u_l,t)＝T^πQ_π(s_t,u_l,t)Q _π (s _t ,u _l,t )=T ^π Q _π (s _t ,u _l,t )

上式中，u_l,t表示来自RL的控制激发，s_t表示时间步长t处的状态信号，T^π表示固定策略，E_π表示期望算子，γ表示折扣因子，α表示温度系数，Q_π(s_t,u_l,t)表示强化学习评价函数。In the above formula, u _l,t represents the control excitation from RL, s _t represents the state signal at time step t, T ^π represents the fixed policy, E _π represents the expectation operator, γ represents the discount factor, α represents the temperature coefficient, Q _π (s _t , u _{l, t} ) represents the reinforcement learning evaluation function.

进一步，所述控制策略模型的公式表示如下：Further, the formula of the control strategy model is expressed as follows:

上式中，Π表示策略集，π^old表示前一次更新的策略，

表示π^old的Q值，D_KL表示KL散度，

表示归一化因子，(s_t,·)表示控制策略，点表示省去自变量的写法。In the above formula, Π represents the strategy set, π ^old represents the strategy of the previous update,

represents the Q value of π ^old , D _KL represents the KL divergence,

represents the normalization factor, (s _t , ) represents the control strategy, and the dot represents the writing without independent variables.

进一步，所述根据控制任务需求，搭建强化学习评价函数和控制策略模型并训练容错控制器，得到训练完成的控制策略这一步骤，其具体包括：Further, according to the control task requirements, the step of building a reinforcement learning evaluation function and a control strategy model and training a fault-tolerant controller to obtain a trained control strategy specifically includes:

S41、根据控制任务需求，对基于模型参考强化学习的容错控制器搭建强化学习评级函数和模型策略模型。S41 , building a reinforcement learning rating function and a model policy model for the fault-tolerant controller based on model reference reinforcement learning according to the requirements of the control task.

S42、对基于模型参考强化学习的容错控制器进行训练，得到初始控制策略；S42, train the fault-tolerant controller based on model reference reinforcement learning to obtain an initial control strategy;

S43、在无人船系统中注入故障，对初始控制策略进行再训练并返回步骤S41，直至强化学习的评价函数网络模型和控制策略模型收敛。S43, inject a fault into the unmanned ship system, retrain the initial control strategy, and return to step S41, until the evaluation function network model and the control strategy model of the reinforcement learning converge.

进一步，还包括：Further, it also includes:

引入双评价函数模型，在控制策略预期回报函数中加入策略的熵值，其中R_t是奖励函数，R_t＝R(s_t,u_l,t)。A dual evaluation function model is introduced, and the entropy value of the strategy is added to the expected return function of the control strategy, where R _t is the reward function, and R _t =R(s _t ,u _l,t ).

本发明方法的有益效果是：本发明针对存在模型不确定性和传感器故障的无人船系统，提出了一种将模型参考强化学习与故障诊断和估计机制相结合的基于强化学习的容错控制算法，考虑到蒙特卡洛采样效率低，使用Actor-Critic模型，把累计收益换成Q函数，通过新的基于强化学习的容错控制，我们确保无人船能够学习适应不同的传感器故障，并在故障条件下恢复轨迹跟踪性能。The beneficial effects of the method of the invention are as follows: the invention proposes a fault-tolerant control algorithm based on reinforcement learning that combines model reference reinforcement learning with fault diagnosis and estimation mechanisms for the unmanned ship system with model uncertainty and sensor faults , considering the low efficiency of Monte Carlo sampling, using the Actor-Critic model to replace the cumulative revenue with the Q function, through the new fault-tolerant control based on reinforcement learning, we ensure that the unmanned ship can learn to adapt to different sensor failures, and in the failure Recover trajectory tracking performance under conditions.

附图说明Description of drawings

图1是本发明一种基于模型参考强化学习的无人船容错控制方法的步骤流程图；Fig. 1 is the step flow chart of a kind of unmanned ship fault-tolerant control method based on model reference reinforcement learning of the present invention;

图2是本发明具体实施例Actor-Critic网络的结构框图。FIG. 2 is a structural block diagram of an Actor-Critic network according to a specific embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明做进一步的详细说明。对于以下实施例中的步骤编号，其仅为了便于阐述说明而设置，对步骤之间的顺序不做任何限定，实施例中的各步骤的执行顺序均可根据本领域技术人员的理解来进行适应性调整。The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. The numbers of the steps in the following embodiments are only set for the convenience of description, and the sequence between the steps is not limited in any way, and the execution sequence of each step in the embodiments can be adapted according to the understanding of those skilled in the art Sexual adjustment.

如图1所示，本发明提供了一种基于模型参考强化学习(reinforcementlearning，RL)的无人船容错控制方法，该方法包括以下步骤：As shown in FIG. 1 , the present invention provides a fault-tolerant control method for an unmanned ship based on model reference reinforcement learning (RL), and the method includes the following steps:

S1、对无人船内在的不确定性因素进行分析，忽略其中内环动力学中的所有非线性项，得到广义速度向量的动力学方程的线性和解耦模型，建立无人船名义动力学模型；S1. Analyze the inherent uncertainty factors of the unmanned ship, ignore all nonlinear terms in the inner loop dynamics, obtain the linear and decoupled models of the dynamic equation of the generalized velocity vector, and establish the nominal dynamics of the unmanned ship Model;

动力学模型具体为：The dynamic model is specifically:

其中

是一个广义坐标向量，x_p和y_p表示ASV在惯性系中的水平坐标，

是航向角。v＝[u_p,v_p,r_p]^T∈R³是广义速度向量，u_p和v_p分别为x轴和y轴方向上的线速度，r_p为航向角速率。u＝[τ_u,τ_r]∈R³控制力和力矩，G(v)＝[g₁(v),g₂(v),g₃(v)]^T∈R³是由于重力和浮力及力矩而产生的未建模动力学，M∈R^3×3是带有M＝M^T＞0的惯性矩阵且in

is a generalized coordinate vector, x _p and y _p represent the horizontal coordinates of the ASV in the inertial frame,

is the heading angle. v ₌ [up _p , v _p , rp ] ^T _∈ R ³ is a generalized velocity vector, up and v _p are the linear velocities in the x-axis and y-axis directions, respectively, and _rp is the heading angular rate. u=[τ _u ,τ _r ]∈R ³ controls forces and moments, G(v)=[g ₁ (v),g ₂ (v),g ₃ (v)] ^T ∈ R ³ is due to gravity and buoyancy and the unmodeled dynamics resulting from the moment, M∈R ^3×3 is the inertia matrix with M=M ^T > 0 and

其中

矩阵C(v)＝-C^T(v)包含科氏力和向心力，由下式给出：in

The matrix C(v)=- ^CT (v) contains the Coriolis and centripetal forces and is given by:

其中C₁₃(v)＝-M₂₂v-M₂₃r，C₂₃(v)＝M₁₁u。阻尼矩阵where C ₁₃ (v)=-M ₂₂ vM ₂₃ r, C ₂₃ (v)=M ₁₁ u. Damping matrix

其中D₁₁(v)＝-X_u-X_|u|u|u|-X_uuuu²，D₂₂(v)＝-Y_v-Y_|v|v|v|-Y_|r|v|r|，D₂₃(v)＝-Y_r-Y_|v|r|v|-Y_|r|r|r|，D₃₂(v)＝-N_v-N_|v|v|v|-N_|r|v|r|，D₃₃(v)＝-N_r-N_|v|r|v|-N_|r|r|r|，X(·)，Y(·)，N(·)是水动力系数，定义详见船舶水动力和运动控制手册。旋转矩阵

输入矩阵

where D ₁₁ (v)=-X _u -X _|u|u |u|-X _uuu u ² , D ₂₂ (v)=-Y _v -Y _|v|v |v|-Y _|r|v | r|, D ₂₃ (v)=-Y _r -Y _|v|r |v|-Y _|r|r |r|, D ₃₂ (v)=-N _v -N _|v|v |v|- N _|r|v |r|, D ₃₃ (v)=-N _r -N _|v|r |v|-N _|r|r |r|, X(·), Y(·), N(· ) is the hydrodynamic coefficient, the definition is detailed in the Manual of Ship Hydrodynamics and Motion Control. rotation matrix

input matrix

定义x＝[η^T v^T]^T，有Define x=[η ^T v ^T ] ^T , we have

其中H(v)＝-M^-1(C(v)+D(v))且N＝-M^-1B。where H(v)=-M ^-1 (C(v)+D(v)) and N=-M ^- 1B.

ASV系统(1)的状态测量值因噪声和传感器故障而损坏，因此表示为y＝x+n+f(t)，其中n∈R⁶是测量噪声向量，f(t)∈R⁶表示可能的传感器故障向量。本发明中，我们只考虑传感器故障对航向角速率r_p的测量，所以f(t)＝[0,0,0,0,0,f_r(t)]^T。传感器故障f_r(t)由下式给出：The state measurements of the ASV system (1) are corrupted by noise and sensor failures and are therefore denoted as y=x+n+f(t), where ^n∈R6 is the measurement noise vector and f(t) ^∈R6 denotes the possible sensor fault vector. In the present invention, we only consider the measurement of the yaw rate r _p due to sensor failure, so f(t)=[0,0,0,0,0,f _r (t)] ^T . The sensor fault f _r (t) is given by:

f_r(t)＝β(t-T_f)φ(t-T_f)，其中φ(t-T_f)是在瞬时T发生的传感器故障的未知功能，β(t-T_f)是对于t＜T_f时β(t-T_f)＝0且t＞T_f时

(k是故障的演化速率)的时间剖面。注意如果传感器故障的发生是突然的，例如偏置故障,k→∞。本发明的目的便是设计一个控制器，允许状态x在存在模型不确定性、可能的传感器故障和测量噪声的情况下跟踪由x_r表示的参考状态轨迹。f _r (t) = β(tT _f )φ(tT _f ), where φ(tT _f ) is the unknown function of sensor failure at instant T and β(tT _f ) is β(tT for t < T _f ) _f )=0 and t>T _f

(k is the evolution rate of the fault) time profile. Note that if the sensor failure occurs suddenly, such as a bias failure, k→∞. The purpose of the present invention is to design a controller that allows state x to track a reference state trajectory represented by _xr in the presence of model uncertainty, possible sensor failures and measurement noise.

S2、基于名义动力学模型，设计无人船标称控制器，保障无人船系统在无故障前提下的基本稳定性。对无人船名义模型进行分析。S2. Based on the nominal dynamics model, design the nominal controller of the unmanned ship to ensure the basic stability of the unmanned ship system under the premise of no faults. Analyze the nominal model of the unmanned ship.

标称控制器设计过程为：The nominal controller design process is:

所提出的基于RL的FTC算法遵循模型参考控制结构。对于大多数ASV系统，精确的非线性动力学模型很少可用，主要的不确定性来自流体力学引起的M、C(v)和D(v)，以及重力和浮力及力矩引起的G(v)。尽管ASV动力学存在不确定性，但基于ASV动力学的已知信息，仍然可以使用标称模型(5)。不确定ASV模型(5)的标称模型如下所示：The proposed RL-based FTC algorithm follows a model reference control structure. For most ASV systems, accurate nonlinear dynamic models are rarely available, and the main uncertainties come from M, C(v) and D(v) due to fluid mechanics, and G(v) due to gravity and buoyancy and moments ). Despite the uncertainties in ASV dynamics, based on known information on ASV dynamics, a nominal model can still be used (5). The nominal model of the uncertain ASV model (5) is as follows:

其中N_m和H_m包含ASV动力学(5)的所有已知常量参数,

是标称模型的广义坐标向量。本发明中，M_m是由M_m＝diag{M₁₁,M₂₂,M₃₃}得出的，H_m＝M_m ^-1D_m由D_m＝diag{-X_u,-Y_v,-N_r}和N_m＝M_m ^-1B得到。因此在标称模型中，忽略了内环动力学中的所有非线性项，因此最终得到了广义速度矢量v动力学方程的线性解耦模型。由于已知标称模型(6)的动力学，因此可以设计控制律u_m，以允许标称系统(6)的状态收敛到参考信号x_r，如当t→∞时||x_m-x_r||₂→0。这种控制律u_m也可被整个ASV动力学(5)用作标称控制器。where N _m and H _m contain all known constant parameters of ASV kinetics (5),

is the generalized coordinate vector of the nominal model. In the present invention, M _m is derived from M _m =diag{M ₁₁ , M ₂₂ , M ₃₃ }, and H _m =M _m ^-1 D _m is derived from D _m =diag{-X _u ,-Y _v ,- N _r } and N _m =M _m ^-1 B are obtained. Therefore, in the nominal model, all nonlinear terms in the dynamics of the inner loop are ignored, and thus a linearly decoupled model of the dynamics equation of the generalized velocity vector v is finally obtained. Since the dynamics of the nominal model (6) are known, the control law _um can be designed to allow the state of the nominal system (6) to converge to the reference signal x _r as ||x _m -x as t→∞ _r || ₂ → 0. This control law _um can also be used as a nominal controller by the entire ASV dynamics (5).

在模型参考控制结构中，目标是设计一个控制律，允许(5)的状态跟踪标称模型(6)的状态轨迹。ASV系统(5)的总体控制律具有以下表达式：In a model-referenced control structure, the goal is to design a control law that allows the state of (5) to track the state trajectory of the nominal model (6). The overall control law of the ASV system (5) has the following expression:

u＝u_b+u_l (7)u=u _b +u _l (7)

其中u_b是基于模型方法的标称，u_l是来自深度学习模块的控制策略。基线控制u_b用于确保一些基本性能(即局部稳定性)，而u_l用于补偿所有系统不确定性和传感器故障。where u _b is the nominal of the model-based approach and u _l is the control policy from the deep learning module. The baseline control u _b is used to ensure some basic performance (ie local stability), while u _l is used to compensate for all system uncertainties and sensor failures.

S3、基于最大熵的Actor-Critic方法，以实际无人船系统和名义模型的状态变量的差值和标称控制器的输出为输入，构建基于模型参考强化学习的容错控制器。S3. The Actor-Critic method based on maximum entropy takes the difference between the state variables of the actual unmanned ship system and the nominal model and the output of the nominal controller as input to construct a fault-tolerant controller based on model reference reinforcement learning.

Actor-Critic的网络框图参照图2，容错控制器的具体推导过程如下：Refer to Figure 2 for the network block diagram of Actor-Critic. The specific derivation process of the fault-tolerant controller is as follows:

RL的公式基于一个由元组表示的马尔可夫决策过程MDP:＝<S,U,P,R,γ>，其中S是状态空间，U指定操作/输入空间，P:S×U×S→R定义转移概率，R:S×U→R是一个回奖励函数，γ∈[0,1)是一个折现系数。在MDP中，状态向量s∈S包含影响RL控制u_l∈U的所有可用信号。对于本发明中ASV系统的跟踪控制，转移概率由(1)中的ASV动态和参考信号x_r表征。在RL中，控制策略是使用在离散时间域中采集的数据样本学习。设s_t为时间步长t处的状态信号s，相应地，u_l,t是时间步长t时基于RL的控制的输入。本发明中的RL算法旨在最大化一个行动价值函数，也称为Q函数，如下所示：The formulation of RL is based on a Markov Decision Process MDP:=<S,U,P,R,γ> represented by a tuple, where S is the state space, U specifies the operation/input space, and P:S×U×S →R defines the transition probability, R:S×U→R is a reward function, and γ∈[0,1) is a discount coefficient. In MDP, the state vector s ∈ S contains all available signals that affect the RL control u _l ∈ U. For the tracking control of the ASV system in the present invention, the transition probability is characterized by the ASV dynamics in (1) and the reference signal _xr . In RL, control policies are learned using data samples collected in the discrete time domain. Let s _t be the state signal s at time step t, and accordingly, u _l,t be the input to the RL-based control at time step t. The RL algorithm in the present invention aims to maximize an action value function, also known as the Q-function, as follows:

其中R_t是奖励函数，R_t＝R(s_t,u_l,t)，

且V_π(s_t+1)称为策略π下s_t+1的状态值函数，其中where R _t is the reward function, R _t =R(s _t ,u _l,t ),

And V _π (s _t +1) is called the state value function of s _t +1 under policy π, where

其中π(u_l,t|s_t)是控制策略，

是策略的熵，α是温度参数。RL中的控制策略π(u_l,t|s_t)是选择行动u_l,t∈U在状态s_t∈S下的概率。在本发明中，采用满足高斯分布的控制策略，即where π(u _l,t |s _t ) is the control strategy,

is the entropy of the policy and α is the temperature parameter. The control policy π(u _l,t |s _t ) in RL is the probability of choosing action u _l,t ∈ U in state s _t ∈ S. In the present invention, a control strategy that satisfies the Gaussian distribution is adopted, that is,

π(u_l|s)＝N(u_l(s),σ) (10)π(u _l |s)=N(u _l (s),σ) (10)

其中N(·,·)表示高斯分布，u_l(s)为均值，σ为协方差矩阵。协方差矩阵σ控制学习阶段的探索性能。where N(·,·) represents the Gaussian distribution, u _l (s) is the mean, and σ is the covariance matrix. The covariance matrix σ controls the exploration performance in the learning phase.

RL的目标是找到一个最优控制策略π^*使(8)中的Q_π(s_t,u_l,t)最大化，即The goal of RL is to find an optimal control policy π ^* that maximizes Q _π (s _t ,u _l,t ) in (8), i.e.

π^*＝argmaxQ_π(s_t,u_l,t) (11)π ^* = argmaxQ _π (s _t ,u _l,t ) (11)

注意，方差σ^*将收敛到0，一旦得到了最优策略π^*(u_l ^*|s)＝N(u_l ^*(s),σ^*)，平均值函数u_l ^*(s)将是学习到的最优控制律深度神经网络Q_θ(s_t,u_l,t)被称为critic，控制策略π_φ(u_l,t|s_t)被称为actor，将(5)中的ASV模型不确定内环动力学重写为：Note that the variance σ ^* will converge to 0, and once the optimal policy π ^* (u _l ^* |s) = N(u _l ^* (s),σ ^* ), the mean function u _l ^* (s) will be The learned optimal control law deep neural network Q _θ (s _t , u _l , t) is called critic, and the control strategy π _φ (u _{l, t} |s _t ) is called actor. The uncertain inner loop dynamics of the ASV model is rewritten as:

其中β(v)是内环动力学中所有模型不确定性的集合。假设不确定项β(v)是有界的。使e_v＝v-v_m，根据(6)和(12)，误差动力学为：where β(v) is the set of all model uncertainties in the inner loop dynamics. Assume that the uncertainty term β(v) is bounded. Let e _v = _vvm , according to (6) and (12), the error dynamics are:

在健康条件下，模型不确定性项β(v)可使用基于学习的控制u_l进行完全补偿。这意味着当t→∞时||e_v(t)||₂≤ε，其中ε是某个正小常数。如果发生传感器故障，错误信号e_v将大于ε。基于学习的容错控制(faulttolerantControl，FTC)的一个缺乏经验的想法是将传感器故障视为外部干扰的一部分。然而，将传感器故障视为干扰将导致基于保守学习的控制，如鲁棒控制。因此，我们引入了一种故障诊断和估计机制，允许基于学习的控制适应不同的场景：健康和不健康的条件。Under healthy conditions, the model uncertainty term β(v) can be fully compensated using the learning-based control _ul . This means that ||e _v (t)|| ₂ ≤ ε as t→∞, where ε is some positive small constant. In the event of a sensor failure, the error signal _ev will be greater than ε. An inexperienced idea of learning-based fault tolerant control (FTC) is to treat sensor failures as part of external disturbances. However, treating sensor failures as disturbances will lead to conservative learning-based control, such as robust control. Therefore, we introduce a fault diagnosis and estimation mechanism that allows learning-based control to adapt to different scenarios: healthy and unhealthy conditions.

设y_v＝v+n_v+f_v，其中n_v表示广义速度测量值上的噪声矢量，并相应地，f_v是作用于广义速度矢量的传感器故障。此外我们定义了

作为故障跟踪误差向量。在实际应用中，

是可测量的，而不是e_v。最后，介绍了以下故障诊断和估计机制：Let y _v =v+n _v +f _v , where n _v represents the noise vector on the generalized velocity measurement, and correspondingly, f _v is the sensor fault acting on the generalized velocity vector. Furthermore, we define

as the fault tracking error vector. In practical applications,

is measurable, not e _v . Finally, the following fault diagnosis and estimation mechanisms are introduced:

其中L被选择为H_m-L Hurwitz。信号

作为传感器故障发生和强度的指示器。设

得到where L is chosen to be H _m -L Hurwitz. Signal

As an indicator of the occurrence and intensity of sensor failures. Assume

get

上式中，H_m-L表示Hurwitz矩阵，

S4、根据控制任务需求，设计相应的回调函数，利用全连通网络搭建强化学习评价函数模型(Q-value)和控制策略模型。S4. Design a corresponding callback function according to the control task requirements, and use a fully connected network to build a reinforcement learning evaluation function model (Q-value) and a control strategy model.

回调函数、学习评价函数为、控制策略模型推导如下：The callback function, the learning evaluation function and the control strategy model are derived as follows:

基于RL的容错控制是使用故障诊断和估计机制的输出得到的。RL使用数据样本(包括输入和状态数据)在离散时间步学习控制策略。假设采样时间步长是固定的，用δt表示。在不丧失一般性的情况下，使y_t，u_b,t，u_l,t，和

分别代表ASV状态、标称控制器激发、来自RL的控制激发以及时间步长t处故障诊断和估计机制的输出。因此，在时间步长t处的状态信号s表示为：

RL的训练学习过程将重复执行策略评估和策略改进。在策略评估中，Q-value是通过Bellman操作Q_π(s_t,u_l,t)＝T^πQ_π(s_t,u_l,t)得到的，其中RL-based fault-tolerant control is obtained using the outputs of fault diagnosis and estimation mechanisms. RL uses data samples (including input and state data) to learn control policies at discrete time steps. The sampling time step is assumed to be fixed, denoted by δt. Without loss of generality, let y _t , u _b,t , u _l,t , and

represent the ASV state, the nominal controller excitation, the control excitation from the RL, and the output of the fault diagnosis and estimation mechanism at time step t, respectively. Therefore, the state signal s at time step t is expressed as:

The training learning process of RL will repeatedly perform policy evaluation and policy improvement. In policy evaluation, Q-value is obtained by Bellman operation Q _π (s _t ,u _l,t )=T ^π Q _π (s _t ,u _l,t ), where

在策略改进中，策略由下式更新：In policy refinement, the policy is updated by:

其中Π表示策略集，π^old表示上次更新的策略，

表示π^old的Q值，D_KL表示Kullback-Leibler(KL)散度，

表示归一化因子。通过数学运算，目标被转化为where Π represents the policy set, π ^old represents the last updated policy,

represents the Q value of π ^old , D _KL represents the Kullback-Leibler (KL) divergence,

represents the normalization factor. Through mathematical operations, the goal is transformed into

S5、在评价函数训练架构中引入双评价函数模型思想，同时在控制策略预期回报函数中加入策略的熵值，提升强化学习训练效率。S5. Introduce the idea of a dual-evaluation function model into the evaluation function training framework, and at the same time add the entropy value of the strategy to the expected return function of the control strategy to improve the training efficiency of reinforcement learning.

双评价函数模型推导过程：The derivation process of the double evaluation function model:

用θ参数化Q函数，并用Q_θ(s_t,u_l,t)表示。参数化策略由π_φ(u_l,t|s_t)表示，其中φ是要训练的参数集。注意，θ和φ都是一组参数，其尺寸由深度神经网络设置决定。例如，如果Q_θ由具有K个隐藏层和每个隐藏层有L个神经元的MLP表示，则参数集θ为θ＝{θ₀,θ₁,...,θ_K}且在1≤i≤K-1上

θ_K∈R^1×(L+1)，θ_i∈R^(L)×(L+1)，其中dim_s表示状态s的尺寸，dim_u表示输入u_l的尺寸。The Q function is parameterized by θ and denoted by Q _θ (s _t , u _l,t ). The parameterized policy is denoted by π _φ (u _l,t |s _t ), where φ is the set of parameters to be trained. Note that both θ and φ are a set of parameters whose dimensions are determined by the deep neural network settings. For example, if Q _θ is represented by an MLP with K hidden layers and L neurons in each hidden layer, the parameter set θ is θ = {θ ₀ , θ ₁ ,...,θ _K } and in 1≤ i≤K-1 on

θ _K ∈ R ^1×(L+1) , θ _i ∈ R ^(L)×(L+1) , where dim _s represents the size of state s, and dim _u represents the size of input u _l .

训练全程是离线的，在每个时间步t+1收集数据样本，例如来自上一个时间步的输入u_l,t，上一时间步s_t的状态、奖励R_t和当前状态s_t+1。这些历史数据将作为元组(s_t,u_l,t,R_t,s_t+1)存储在记忆池D中。在每个策略评估或改进步骤中，我们从记忆池D中随机抽取一批历史数据B，用于训练参数θ和φ。开始训练时，我们将标称控制策略u_b应用于ASV系统，以收集初始数据D₀，如算法1所示。初始数据集D₀用于Q函数的初始拟合。初始化结束后，执行u_b和最新更新的强化学习策略π_φ(u_l,t|s_t)以运行ASV系统。The whole training process is offline, and data samples are collected at each time step t+1, such as the input u _l,t from the previous time step, the state of the previous time step s _t , the reward R _t and the current state s _t+1 . These historical data will be stored in memory pool D as a tuple (s _t , u _l,t , R _t , s _t+1 ). At each policy evaluation or improvement step, we randomly sample a batch of historical data B from memory pool D for training parameters θ and φ. When starting training, we apply the nominal control strategy _ub to the ASV system to collect initial data D ₀ , as shown in Algorithm 1. The initial data set D ₀ is used for the initial fitting of the Q-function. After initialization, u _b and the newly updated reinforcement learning policy π _φ (u _l,t |s _t ) are executed to run the ASV system.

训练Q函数的参数θ以最小化贝尔曼残差：The parameter θ of the Q-function is trained to minimize the Bellman residual:

其中(s_t,u_l,t)～D意味着我们从记忆池D中随机选取的样本(s_t,u_l,t)，且

其中

是将缓慢更新的目标参数。DNN参数θ是通过将随机梯度下降法应用于修正数据批次B上的(15)而获得的，数据批次B的大小由|B|表示。本发明中使用了两个分别由θ₁和θ₂参数化的评价。引入这两个评价是为了减少评价神经网络训练中的高估问题。在双评价函数下，目标值Y_target为：where (s _t , u _{l, t} ) ~ D means a sample (s _t , u _{l, t} ) that we randomly select from memory pool D, and

in

is the target parameter that will be updated slowly. The DNN parameters θ are obtained by applying stochastic gradient descent to (15) on the revised data batch B, the size of which is denoted by |B|. Two evaluations parameterized by θ ₁ and θ ₂ , respectively, are used in the present invention. These two evaluations are introduced to reduce the overestimation problem in evaluating neural network training. Under the double evaluation function, the target value Y _target is:

策略改进步骤要使用记忆池D中的数据样本来实现以下参数化目标函数最小化：The policy improvement step uses the data samples in memory pool D to minimize the following parameterized objective function:

使用随机梯度下降法将参数φ训练至最小化，在训练阶段，actor神经网络表示为：The parameter φ is trained to minimize using stochastic gradient descent. In the training phase, the actor neural network is expressed as:

其中

是要学习的参数化控制律，

是探测噪声标准偏差，ξ～N(0,I)是探测噪声，“⊙”是哈达玛积。注意，探测噪声ξ只适用于训练阶段，一旦训练完成，只需要在运用中的

因此，在训练阶段的u_l等价于u_l,φ。一旦训练结束，得到

in

is the parametric control law to be learned,

is the standard deviation of the detection noise, ξ～N(0,I) is the detection noise, and “⊙” is the Hadamard product. Note that the detection noise ξ is only applicable in the training phase, once the training is completed, only the

Therefore, u _l in the training phase is equivalent to u _l,φ . Once the training is over, get

温度参数α在训练阶段也会更新。其更新是通过最小化以下目标函数获得的：The temperature parameter α is also updated during the training phase. Its update is obtained by minimizing the following objective function:

其中

为策略的熵值。本发明中设置

其中“2”表示动作维度。in

is the entropy value of the strategy. set in the present invention

where "2" represents the action dimension.

S6、在无故障情况下，对基于模型参考强化学习的控制器进行训练，获得初始控制策略，保证总体控制器对于模型不确定性的鲁棒性。S6. In the case of no fault, the controller based on model reference reinforcement learning is trained to obtain an initial control strategy, so as to ensure the robustness of the overall controller to model uncertainty.

S7、在无人船系统中注入故障，对已获取的基于模型参考强化学习的初始控制策略进行再训练，实现总体控制器对于部分传感器故障的适应性。S7. Inject faults into the unmanned ship system, and retrain the obtained initial control strategy based on model reference reinforcement learning to realize the adaptability of the overall controller to some sensor faults.

S8、在不同初始状态条件下，不断重复步骤S6和步骤S7，直到强化学习的评价函数网络模型和控制策略模型收敛。S8. Repeat step S6 and step S7 continuously under different initial state conditions until the evaluation function network model and the control strategy model of the reinforcement learning converge.

具体地，步骤S6-S8的训练过程具体如下：1)为

和

分别初始化参数θ₁，θ₂，用φ表示actor网络；2)为目标参数指定值：

3)运行u_l＝0时公式(5)中的u_b，得到数据集D₀；4)结束学习阶段的探索，使用数据集D₀训练初始critic参数θ₁ ⁰，

5)初始化记忆池D←D₀；6)为critic参数与其目标指定初始值：θ₁←θ₁ ⁰，

7)重复；8)开始循环，每个数据收集步骤执行操作；9)根据π_φ(u_l,t|s_t)选择一个动作u_l,t；10)运行标称系统(6)和整个系统(5)以及故障诊断和估计机制(14)&收集s_t+1＝{x_t+1,x_m,t+1,u_b,t+1}；11)D←D∪{s_t,u_l,t,R(s_t,u_l,t),s_t+1}；12)结束循环；13)开始循环，每个梯度更新步骤执行动作；14)从D中抽取一批数据B；15)θ_j←θ_j-ι_Q▽_θJ_Q(θ_j)，且j＝1,2；16)φ←φ-ι_π▽_φJ_π(φ)；17)α←α-ι_α▽_αJ_α(α)；18)

且j＝1,2；19)结束循环；20)直至收敛(如J_Q(θ)＜一个小阈值)。在该算法中，ι_Q，ι_π和ι_α是正学习率(标量)，κ＞0是常数标量。Specifically, the training process of steps S6-S8 is as follows: 1) is

and

Initialize the parameters θ ₁ and θ ₂ respectively, and use φ to represent the actor network; 2) Specify values for the target parameters:

3) Run u _b in formula (5) when u _l = 0 to obtain the data set D ₀ ; 4) End the exploration of the learning phase, use the data set D ₀ to train the initial critic parameter θ ₁ ⁰ ,

5) Initialize memory pool D←D ₀ ; 6) Specify initial values for critic parameters and their targets: θ ₁ ←θ ₁ ⁰ ,

7) Repeat; 8) Start the loop, performing actions at each data collection step; 9) Choose an action u _l,t according to _πφ (u _l,t |s _t ); 10) Run the nominal system (6) and the entire System (5) and Fault Diagnosis and Estimation Mechanism (14) & Collection s _t +1 = {x _t+1 , x _{m, t+1} , u _{b, t+1} }; 11) D←D∪{s _t ,u _l,t ,R(s _t ,u _l,t ),s _t+1 }; 12) End the loop; 13) Start the loop, each gradient update step performs actions; 14) Extract a batch of data from D B; 15) θ _j ←θ _j -ι _Q ▽ _θ J _Q (θ _j ), and j=1,2; 16) φ←φ-ι _π ▽ _φ J _π (φ); 17) α←α- ι _α ▽ _α J _α (α); 18)

and j=1,2; 19) end the loop; 20) until convergence (eg J _Q (θ) < a small threshold). In this algorithm, ι _Q , ι _π and ι _α are positive learning rates (scalars), and κ > 0 is a constant scalar.

一种基于模型参考强化学习的无人船容错控制系统，包括：A fault-tolerant control system for unmanned ships based on model reference reinforcement learning, including:

动力学模型构建模块，用于对无人船的不确定性因素进行分析，构建无人船名义动力学模型；The dynamic model building module is used to analyze the uncertain factors of the unmanned ship and construct the nominal dynamic model of the unmanned ship;

控制器设计模块，基于无人船名义动力学模型，设计无人船标称控制器；The controller design module, based on the nominal dynamics model of the unmanned ship, designs the nominal controller of the unmanned ship;

容错控制器构建模块，基于最大熵的Actor-Critic方法，根据实际无人船系统、无人船名义动力学模型的状态变量差值和无人船标称控制器的输出，构建基于模型参考强化学习的容错控制器；Fault-tolerant controller building module, based on the Actor-Critic method of maximum entropy, according to the actual unmanned ship system, the state variable difference of the unmanned ship's nominal dynamic model and the output of the unmanned ship's nominal controller, the model-based reference enhancement is constructed. Learned fault-tolerant controllers;

训练模块，用于根据控制任务需求，搭建强化学习评价函数和控制策略模型并训练容错控制器，得到训练完成的控制策略。The training module is used to build a reinforcement learning evaluation function and a control strategy model and train a fault-tolerant controller according to the requirements of the control task, and obtain the trained control strategy.

上述方法实施例中的内容均适用于本系统实施例中，本系统实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the above method embodiments are all applicable to the present system embodiments, the specific functions implemented by the present system embodiments are the same as the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

一种基于模型参考强化学习的无人船容错控制装置：A fault-tolerant control device for unmanned ships based on model reference reinforcement learning:

至少一个处理器；at least one processor;

至少一个存储器，用于存储至少一个程序；at least one memory for storing at least one program;

当所述至少一个程序被所述至少一个处理器执行，使得所述至少一个处理器实现如上所述一种基于模型参考强化学习的无人船容错控制方法。When the at least one program is executed by the at least one processor, the at least one processor implements the above-mentioned method for fault-tolerant control of an unmanned ship based on model reference reinforcement learning.

上述方法实施例中的内容均适用于本装置实施例中，本装置实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the above method embodiments are all applicable to the present device embodiments, the specific functions implemented by the present device embodiments are the same as the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

一种存储介质，其中存储有处理器可执行的指令，其特征在于：所述处理器可执行的指令在由处理器执行时用于实现如上所述一种基于模型参考强化学习的无人船容错控制方法。A storage medium storing processor-executable instructions, wherein the processor-executable instructions, when executed by the processor, are used to implement the above-mentioned model reference reinforcement learning-based unmanned ship Fault-tolerant control methods.

上述方法实施例中的内容均适用于本存储介质实施例中，本存储介质实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the foregoing method embodiments are all applicable to this storage medium embodiment, and the specific functions implemented by this storage medium embodiment are the same as those of the foregoing method embodiments, and the beneficial effects achieved are also the same as those achieved by the foregoing method embodiments. same.

以上是对本发明的较佳实施进行了具体说明，但本发明创造并不限于所述实施例，熟悉本领域的技术人员在不违背本发明精神的前提下还可做作出种种的等同变形或替换，这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present invention, but the present invention is not limited to the described embodiments, and those skilled in the art can make various equivalent deformations or replacements without departing from the spirit of the present invention. , these equivalent modifications or substitutions are all included within the scope defined by the claims of the present application.

Claims

1. A fault-tolerant control method of an unmanned ship based on model reference reinforcement learning is characterized by comprising the following steps:

s1, analyzing uncertainty factors of the unmanned ship and constructing a nominal dynamics model of the unmanned ship;

s2, designing a nominal controller of the unmanned ship based on the name meaning dynamic model of the unmanned ship;

s3, constructing a fault-tolerant controller based on model reference reinforcement learning according to the actual unmanned ship system, the state variable difference value of the unmanned ship name-sense dynamic model and the output of the unmanned ship nominal controller by an Actor-criticic method based on the maximum entropy;

and S4, building a reinforcement learning evaluation function and a control strategy model according to the control task requirements, and training a fault-tolerant controller to obtain a trained control strategy.

2. The unmanned ship fault-tolerant control method based on model reference reinforcement learning of claim 1, wherein the formula of the unmanned ship nominal dynamics model is as follows:

in the above formula, the first and second carbon atoms are,

representing a generalized coordinate vector, v representing a generalized velocity vector, u representing a control force and moment, M representing an inertia matrix, c (v) comprising coriolis force and centripetal force, d (v) representing a damping matrix, g (v) representing unmodeled dynamics due to gravity and buoyancy and moment, and B representing a preset input matrix.

3. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 2, wherein the formula of the unmanned ship nominal controller is as follows:

in the above formula, N_mAnd H_mComprising all known constant parameters, η, of the unmanned ship dynamics model_mGeneralized coordinate vector, u, representing a nominal model_mRepresenting the control law, x_mRepresenting the state of the reference model.

4. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 3, wherein the formula of the fault-tolerant controller is as follows:

in the above formula, H_mL represents a Hurwitz matrix,

u_lrepresents the control strategy from the deep learning module, β (v) represents the set of all model uncertainties in the inner-loop dynamics, n_vRepresenting a noise vector on the generalized velocity measurement, f_vIndicating a sensor fault acting on the generalized velocity vector.

5. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 4, wherein the formula of the reinforcement learning evaluation function is expressed as follows:

Q_π(s_t,u_l,t)＝T^πQ_π(s_t,u_l,t)

in the above formula, u_l,tIndicating the control excitation, s, from the RL_tRepresenting the state signal at a time step T, T^πDenotes a fixed policy, E_πRepresenting the desired operator, gamma representing the discount factor, alpha representing the temperature coefficient, Q_π(s_t,u_l,t) Representing a reinforcement learning evaluation function.

6. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 4, wherein the formula of the control strategy model is as follows:

in the above formula, pi represents a policy set, pi^oldRepresents the previous onePolicy of secondary update, Q^πoldDenotes pi^oldQ value of (D)_KLThe dispersion of the KL is expressed,

represents a normalization factor,(s)_tAnd) represents a control strategy.

7. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 1, wherein the step of constructing a reinforcement learning evaluation function and a control strategy model and training a fault-tolerant controller to obtain a trained control strategy according to control task requirements specifically comprises:

and S41, building a reinforcement learning rating function and a model strategy model for the fault-tolerant controller based on model reference reinforcement learning according to the control task requirements.

S42, training the fault-tolerant controller based on model reference reinforcement learning to obtain an initial control strategy;

and S43, injecting faults into the unmanned ship system, retraining the initial control strategy and returning to the step S41 until the reinforcement learning evaluation function network model and the control strategy model converge.

8. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 7, further comprising:

and introducing a double-evaluation function model, and adding an entropy value of the strategy into the expected return function of the control strategy.