CN114296350A - A fault-tolerant control method for unmanned ships based on model reference reinforcement learning - Google Patents

A fault-tolerant control method for unmanned ships based on model reference reinforcement learning Download PDF

Info

Publication number
CN114296350A
CN114296350A CN202111631716.8A CN202111631716A CN114296350A CN 114296350 A CN114296350 A CN 114296350A CN 202111631716 A CN202111631716 A CN 202111631716A CN 114296350 A CN114296350 A CN 114296350A
Authority
CN
China
Prior art keywords
model
unmanned ship
fault
reinforcement learning
tolerant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111631716.8A
Other languages
Chinese (zh)
Other versions
CN114296350B (en
Inventor
张清瑞
熊培轩
张雷
朱波
胡天江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202111631716.8A priority Critical patent/CN114296350B/en
Publication of CN114296350A publication Critical patent/CN114296350A/en
Application granted granted Critical
Publication of CN114296350B publication Critical patent/CN114296350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Landscapes

  • Feedback Control In General (AREA)

Abstract

本发明公开了一种基于模型参考强化学习的无人船容错控制方法,该方法包括:对无人船的不确定性因素进行分析,构建无人船名义动力学模型;基于无人船名义动力学模型,设计无人船标称控制器;基于最大熵的Actor‑Critic方法,根据实际无人船系统、无人船名义动力学模型的状态变量差值和无人船标称控制器的输出,构建基于模型参考强化学习的容错控制器;根据控制任务需求,搭建强化学习评价函数和控制策略模型并训练容错控制器,得到训练完成的控制策略。通过使用本发明,能够显著提高无人船系统的安全性和可靠性。本发明作为一种基于模型参考强化学习的无人船容错控制方法,可广泛应用于无人船控制领域。

Figure 202111631716

The invention discloses a fault-tolerant control method for an unmanned ship based on model reference reinforcement learning. The method includes: analyzing the uncertainty factors of the unmanned ship, and constructing a nominal dynamic model of the unmanned ship; based on the maximum entropy Actor-Critic method, according to the actual unmanned ship system, the state variable difference of the nominal dynamic model of the unmanned ship and the output of the nominal controller of the unmanned ship , build a fault-tolerant controller based on model reference reinforcement learning; according to the control task requirements, build a reinforcement learning evaluation function and a control strategy model and train the fault-tolerant controller to obtain the trained control strategy. By using the present invention, the safety and reliability of the unmanned ship system can be significantly improved. As a fault-tolerant control method for an unmanned ship based on model reference reinforcement learning, the present invention can be widely used in the field of unmanned ship control.

Figure 202111631716

Description

一种基于模型参考强化学习的无人船容错控制方法A fault-tolerant control method for unmanned ships based on model reference reinforcement learning

技术领域technical field

本发明涉及无人船控制领域,尤其涉及一种基于模型参考强化学习的无人船容错控制方法。The invention relates to the field of unmanned ship control, in particular to an unmanned ship fault-tolerant control method based on model reference reinforcement learning.

背景技术Background technique

随着制导、导航和控制技术的显著进步,无人船(autonomous surface vehicles,ASV)的应用已经占据了航空举足轻重的部分。在大多数应用中,无人船预计将在长时间没有人工干预的情况下安全运行。因此,需要无人船具有足够的安全和可靠性属性以提供正常的运作,并避免灾难性的后果。然而,无人船容易出现故障、系统组建退化、传感器故障等问题,从而经历性能恶化,不稳定,甚至灾难性的损失。With significant advancements in guidance, navigation, and control technologies, the application of autonomous surface vehicles (ASVs) has taken over a pivotal part of aviation. In most applications, unmanned ships are expected to operate safely without human intervention for extended periods of time. Therefore, unmanned ships are required to have sufficient safety and reliability attributes to provide normal operation and avoid catastrophic consequences. However, unmanned ships are prone to problems such as failure, system component degradation, sensor failure, etc., and thus experience performance degradation, instability, and even catastrophic losses.

发明内容SUMMARY OF THE INVENTION

为了解决上述技术问题,本发明的目的是提供一种基于模型参考强化学习的无人船容错控制方法,可以在遇到故障后恢复系统性能或保持系统运行,从而显著提高系统的安全性和可靠性。In order to solve the above technical problems, the purpose of the present invention is to provide a fault-tolerant control method for unmanned ships based on model reference reinforcement learning, which can restore the system performance or keep the system running after encountering a fault, thereby significantly improving the safety and reliability of the system sex.

本发明所采用的第一技术方案是:一种基于模型参考强化学习的无人船容错控制方法,包括以下步骤:The first technical solution adopted by the present invention is: a fault-tolerant control method for an unmanned ship based on model reference reinforcement learning, comprising the following steps:

S1、对无人船的不确定性因素进行分析,构建无人船名义动力学模型;S1. Analyze the uncertainty factors of the unmanned ship, and construct a nominal dynamic model of the unmanned ship;

S2、基于无人船名义动力学模型,设计无人船标称控制器;S2. Based on the nominal dynamics model of the unmanned ship, design the nominal controller of the unmanned ship;

S3、基于最大熵的Actor-Critic方法,根据实际无人船系统、无人船名义动力学模型的状态变量差值和无人船标称控制器的输出,构建基于模型参考强化学习的容错控制器;S3. Actor-Critic method based on maximum entropy, according to the actual unmanned ship system, the state variable difference of the unmanned ship nominal dynamic model and the output of the unmanned ship nominal controller, construct fault-tolerant control based on model reference reinforcement learning device;

S4、根据控制任务需求,搭建强化学习评价函数和控制策略模型并训练容错控制器,得到训练完成的控制策略。S4. According to the control task requirements, build a reinforcement learning evaluation function and a control strategy model and train a fault-tolerant controller to obtain a trained control strategy.

进一步,所述无人船名义动力学模型的公式表示如下:Further, the formula of the nominal dynamic model of the unmanned ship is expressed as follows:

Figure BDA0003440428850000011
Figure BDA0003440428850000011

上式中,

Figure BDA0003440428850000012
表示广义坐标向量,v表示广义速度向量,u表示控制力和力矩,M表示惯性矩阵,C(v)包括科氏力和向心力,D(v)表示阻尼矩阵,G(v)表示由于重力和浮力及力矩而产生的未建模动力学,B表示预设的输入矩阵
Figure BDA0003440428850000021
In the above formula,
Figure BDA0003440428850000012
Represents generalized coordinate vector, v represents generalized velocity vector, u represents control force and moment, M represents inertia matrix, C(v) includes Coriolis force and centripetal force, D(v) represents damping matrix, G(v) represents due to gravity and Unmodeled dynamics due to buoyancy and moment, B is a preset input matrix
Figure BDA0003440428850000021

进一步,所述无人船标称控制器的公式表示如下:Further, the formula of the nominal controller of the unmanned ship is expressed as follows:

Figure BDA0003440428850000022
Figure BDA0003440428850000022

上式中,Nm和Hm包含无人船动力学模型的所有已知常量参数,ηm表示标称模型的广义坐标向量,um表示控制律,xm表示参考模型的状态。In the above formula, N m and H m contain all known constant parameters of the dynamic model of the unmanned ship, η m represents the generalized coordinate vector of the nominal model, u m represents the control law, and x m represents the state of the reference model.

进一步,所述容错控制器的公式表示如下:Further, the formula of the fault-tolerant controller is expressed as follows:

Figure BDA0003440428850000023
Figure BDA0003440428850000023

上式中,Hm-L表示Hurwitz矩阵,

Figure BDA0003440428850000024
ul表示来自深度学习模块的控制策略,β(v)表示内环动力学中所有模型不确定性的集合,nv表示广义速度测量值上的噪声矢量,fv表示作用于广义速度矢量的传感器故障。In the above formula, H m -L represents the Hurwitz matrix,
Figure BDA0003440428850000024
u l represents the control strategy from the deep learning module, β(v) represents the set of all model uncertainties in the inner loop dynamics, n v represents the noise vector on the generalized velocity measurement, f v represents the action on the generalized velocity vector Sensor failure.

进一步,所述强化学习评价函数的公式表示如下:Further, the formula of the reinforcement learning evaluation function is expressed as follows:

Qπ(st,ul,t)=TπQπ(st,ul,t)Q π (s t ,u l,t )=T π Q π (s t ,u l,t )

Figure BDA0003440428850000025
Figure BDA0003440428850000025

上式中,ul,t表示来自RL的控制激发,st表示时间步长t处的状态信号,Tπ表示固定策略,Eπ表示期望算子,γ表示折扣因子,α表示温度系数,Qπ(st,ul,t)表示强化学习评价函数。In the above formula, u l,t represents the control excitation from RL, s t represents the state signal at time step t, T π represents the fixed policy, E π represents the expectation operator, γ represents the discount factor, α represents the temperature coefficient, Q π (s t , u l, t ) represents the reinforcement learning evaluation function.

进一步,所述控制策略模型的公式表示如下:Further, the formula of the control strategy model is expressed as follows:

Figure BDA0003440428850000026
Figure BDA0003440428850000026

上式中,Π表示策略集,πold表示前一次更新的策略,

Figure BDA0003440428850000027
表示πold的Q值,DKL表示KL散度,
Figure BDA0003440428850000028
表示归一化因子,(st,·)表示控制策略,点表示省去自变量的写法。In the above formula, Π represents the strategy set, π old represents the strategy of the previous update,
Figure BDA0003440428850000027
represents the Q value of π old , D KL represents the KL divergence,
Figure BDA0003440428850000028
represents the normalization factor, (s t , ) represents the control strategy, and the dot represents the writing without independent variables.

进一步,所述根据控制任务需求,搭建强化学习评价函数和控制策略模型并训练容错控制器,得到训练完成的控制策略这一步骤,其具体包括:Further, according to the control task requirements, the step of building a reinforcement learning evaluation function and a control strategy model and training a fault-tolerant controller to obtain a trained control strategy specifically includes:

S41、根据控制任务需求,对基于模型参考强化学习的容错控制器搭建强化学习评级函数和模型策略模型。S41 , building a reinforcement learning rating function and a model policy model for the fault-tolerant controller based on model reference reinforcement learning according to the requirements of the control task.

S42、对基于模型参考强化学习的容错控制器进行训练,得到初始控制策略;S42, train the fault-tolerant controller based on model reference reinforcement learning to obtain an initial control strategy;

S43、在无人船系统中注入故障,对初始控制策略进行再训练并返回步骤S41,直至强化学习的评价函数网络模型和控制策略模型收敛。S43, inject a fault into the unmanned ship system, retrain the initial control strategy, and return to step S41, until the evaluation function network model and the control strategy model of the reinforcement learning converge.

进一步,还包括:Further, it also includes:

引入双评价函数模型,在控制策略预期回报函数中加入策略的熵值,其中Rt是奖励函数,Rt=R(st,ul,t)。A dual evaluation function model is introduced, and the entropy value of the strategy is added to the expected return function of the control strategy, where R t is the reward function, and R t =R(s t ,u l,t ).

本发明方法的有益效果是:本发明针对存在模型不确定性和传感器故障的无人船系统,提出了一种将模型参考强化学习与故障诊断和估计机制相结合的基于强化学习的容错控制算法,考虑到蒙特卡洛采样效率低,使用Actor-Critic模型,把累计收益换成Q函数,通过新的基于强化学习的容错控制,我们确保无人船能够学习适应不同的传感器故障,并在故障条件下恢复轨迹跟踪性能。The beneficial effects of the method of the invention are as follows: the invention proposes a fault-tolerant control algorithm based on reinforcement learning that combines model reference reinforcement learning with fault diagnosis and estimation mechanisms for the unmanned ship system with model uncertainty and sensor faults , considering the low efficiency of Monte Carlo sampling, using the Actor-Critic model to replace the cumulative revenue with the Q function, through the new fault-tolerant control based on reinforcement learning, we ensure that the unmanned ship can learn to adapt to different sensor failures, and in the failure Recover trajectory tracking performance under conditions.

附图说明Description of drawings

图1是本发明一种基于模型参考强化学习的无人船容错控制方法的步骤流程图;Fig. 1 is the step flow chart of a kind of unmanned ship fault-tolerant control method based on model reference reinforcement learning of the present invention;

图2是本发明具体实施例Actor-Critic网络的结构框图。FIG. 2 is a structural block diagram of an Actor-Critic network according to a specific embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明做进一步的详细说明。对于以下实施例中的步骤编号,其仅为了便于阐述说明而设置,对步骤之间的顺序不做任何限定,实施例中的各步骤的执行顺序均可根据本领域技术人员的理解来进行适应性调整。The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. The numbers of the steps in the following embodiments are only set for the convenience of description, and the sequence between the steps is not limited in any way, and the execution sequence of each step in the embodiments can be adapted according to the understanding of those skilled in the art Sexual adjustment.

如图1所示,本发明提供了一种基于模型参考强化学习(reinforcementlearning,RL)的无人船容错控制方法,该方法包括以下步骤:As shown in FIG. 1 , the present invention provides a fault-tolerant control method for an unmanned ship based on model reference reinforcement learning (RL), and the method includes the following steps:

S1、对无人船内在的不确定性因素进行分析,忽略其中内环动力学中的所有非线性项,得到广义速度向量的动力学方程的线性和解耦模型,建立无人船名义动力学模型;S1. Analyze the inherent uncertainty factors of the unmanned ship, ignore all nonlinear terms in the inner loop dynamics, obtain the linear and decoupled models of the dynamic equation of the generalized velocity vector, and establish the nominal dynamics of the unmanned ship Model;

动力学模型具体为:The dynamic model is specifically:

Figure BDA0003440428850000031
Figure BDA0003440428850000031

其中

Figure BDA0003440428850000032
是一个广义坐标向量,xp和yp表示ASV在惯性系中的水平坐标,
Figure BDA0003440428850000033
是航向角。v=[up,vp,rp]T∈R3是广义速度向量,up和vp分别为x轴和y轴方向上的线速度,rp为航向角速率。u=[τur]∈R3控制力和力矩,G(v)=[g1(v),g2(v),g3(v)]T∈R3是由于重力和浮力及力矩而产生的未建模动力学,M∈R3×3是带有M=MT>0的惯性矩阵且in
Figure BDA0003440428850000032
is a generalized coordinate vector, x p and y p represent the horizontal coordinates of the ASV in the inertial frame,
Figure BDA0003440428850000033
is the heading angle. v = [up p , v p , rp ] T R 3 is a generalized velocity vector, up and v p are the linear velocities in the x-axis and y-axis directions, respectively, and rp is the heading angular rate. u=[τ ur ]∈R 3 controls forces and moments, G(v)=[g 1 (v),g 2 (v),g 3 (v)] T ∈ R 3 is due to gravity and buoyancy and the unmodeled dynamics resulting from the moment, M∈R 3×3 is the inertia matrix with M=M T > 0 and

Figure BDA0003440428850000041
Figure BDA0003440428850000041

其中

Figure BDA0003440428850000047
矩阵C(v)=-CT(v)包含科氏力和向心力,由下式给出:in
Figure BDA0003440428850000047
The matrix C(v)=- CT (v) contains the Coriolis and centripetal forces and is given by:

Figure BDA0003440428850000042
Figure BDA0003440428850000042

其中C13(v)=-M22v-M23r,C23(v)=M11u。阻尼矩阵where C 13 (v)=-M 22 vM 23 r, C 23 (v)=M 11 u. Damping matrix

Figure BDA0003440428850000043
Figure BDA0003440428850000043

其中D11(v)=-Xu-X|u|u|u|-Xuuuu2,D22(v)=-Yv-Y|v|v|v|-Y|r|v|r|,D23(v)=-Yr-Y|v|r|v|-Y|r|r|r|,D32(v)=-Nv-N|v|v|v|-N|r|v|r|,D33(v)=-Nr-N|v|r|v|-N|r|r|r|,X(·),Y(·),N(·)是水动力系数,定义详见船舶水动力和运动控制手册。旋转矩阵

Figure BDA0003440428850000044
输入矩阵
Figure BDA0003440428850000045
where D 11 (v)=-X u -X |u|u |u|-X uuu u 2 , D 22 (v)=-Y v -Y |v|v |v|-Y |r|v | r|, D 23 (v)=-Y r -Y |v|r |v|-Y |r|r |r|, D 32 (v)=-N v -N |v|v |v|- N |r|v |r|, D 33 (v)=-N r -N |v|r |v|-N |r|r |r|, X(·), Y(·), N(· ) is the hydrodynamic coefficient, the definition is detailed in the Manual of Ship Hydrodynamics and Motion Control. rotation matrix
Figure BDA0003440428850000044
input matrix
Figure BDA0003440428850000045

定义x=[ηT vT]T,有Define x=[η T v T ] T , we have

Figure BDA0003440428850000046
Figure BDA0003440428850000046

其中H(v)=-M-1(C(v)+D(v))且N=-M-1B。where H(v)=-M -1 (C(v)+D(v)) and N=-M - 1B.

ASV系统(1)的状态测量值因噪声和传感器故障而损坏,因此表示为y=x+n+f(t),其中n∈R6是测量噪声向量,f(t)∈R6表示可能的传感器故障向量。本发明中,我们只考虑传感器故障对航向角速率rp的测量,所以f(t)=[0,0,0,0,0,fr(t)]T。传感器故障fr(t)由下式给出:The state measurements of the ASV system (1) are corrupted by noise and sensor failures and are therefore denoted as y=x+n+f(t), where n∈R6 is the measurement noise vector and f(t) ∈R6 denotes the possible sensor fault vector. In the present invention, we only consider the measurement of the yaw rate r p due to sensor failure, so f(t)=[0,0,0,0,0,f r (t)] T . The sensor fault f r (t) is given by:

fr(t)=β(t-Tf)φ(t-Tf),其中φ(t-Tf)是在瞬时T发生的传感器故障的未知功能,β(t-Tf)是对于t<Tf时β(t-Tf)=0且t>Tf

Figure BDA0003440428850000051
(k是故障的演化速率)的时间剖面。注意如果传感器故障的发生是突然的,例如偏置故障,k→∞。本发明的目的便是设计一个控制器,允许状态x在存在模型不确定性、可能的传感器故障和测量噪声的情况下跟踪由xr表示的参考状态轨迹。f r (t) = β(tT f )φ(tT f ), where φ(tT f ) is the unknown function of sensor failure at instant T and β(tT f ) is β(tT for t < T f ) f )=0 and t>T f
Figure BDA0003440428850000051
(k is the evolution rate of the fault) time profile. Note that if the sensor failure occurs suddenly, such as a bias failure, k→∞. The purpose of the present invention is to design a controller that allows state x to track a reference state trajectory represented by xr in the presence of model uncertainty, possible sensor failures and measurement noise.

S2、基于名义动力学模型,设计无人船标称控制器,保障无人船系统在无故障前提下的基本稳定性。对无人船名义模型进行分析。S2. Based on the nominal dynamics model, design the nominal controller of the unmanned ship to ensure the basic stability of the unmanned ship system under the premise of no faults. Analyze the nominal model of the unmanned ship.

标称控制器设计过程为:The nominal controller design process is:

所提出的基于RL的FTC算法遵循模型参考控制结构。对于大多数ASV系统,精确的非线性动力学模型很少可用,主要的不确定性来自流体力学引起的M、C(v)和D(v),以及重力和浮力及力矩引起的G(v)。尽管ASV动力学存在不确定性,但基于ASV动力学的已知信息,仍然可以使用标称模型(5)。不确定ASV模型(5)的标称模型如下所示:The proposed RL-based FTC algorithm follows a model reference control structure. For most ASV systems, accurate nonlinear dynamic models are rarely available, and the main uncertainties come from M, C(v) and D(v) due to fluid mechanics, and G(v) due to gravity and buoyancy and moments ). Despite the uncertainties in ASV dynamics, based on known information on ASV dynamics, a nominal model can still be used (5). The nominal model of the uncertain ASV model (5) is as follows:

Figure BDA0003440428850000052
Figure BDA0003440428850000052

其中Nm和Hm包含ASV动力学(5)的所有已知常量参数,

Figure BDA0003440428850000053
是标称模型的广义坐标向量。本发明中,Mm是由Mm=diag{M11,M22,M33}得出的,Hm=Mm -1Dm由Dm=diag{-Xu,-Yv,-Nr}和Nm=Mm -1B得到。因此在标称模型中,忽略了内环动力学中的所有非线性项,因此最终得到了广义速度矢量v动力学方程的线性解耦模型。由于已知标称模型(6)的动力学,因此可以设计控制律um,以允许标称系统(6)的状态收敛到参考信号xr,如当t→∞时||xm-xr||2→0。这种控制律um也可被整个ASV动力学(5)用作标称控制器。where N m and H m contain all known constant parameters of ASV kinetics (5),
Figure BDA0003440428850000053
is the generalized coordinate vector of the nominal model. In the present invention, M m is derived from M m =diag{M 11 , M 22 , M 33 }, and H m =M m -1 D m is derived from D m =diag{-X u ,-Y v ,- N r } and N m =M m -1 B are obtained. Therefore, in the nominal model, all nonlinear terms in the dynamics of the inner loop are ignored, and thus a linearly decoupled model of the dynamics equation of the generalized velocity vector v is finally obtained. Since the dynamics of the nominal model (6) are known, the control law um can be designed to allow the state of the nominal system (6) to converge to the reference signal x r as ||x m -x as t→∞ r || 2 → 0. This control law um can also be used as a nominal controller by the entire ASV dynamics (5).

在模型参考控制结构中,目标是设计一个控制律,允许(5)的状态跟踪标称模型(6)的状态轨迹。ASV系统(5)的总体控制律具有以下表达式:In a model-referenced control structure, the goal is to design a control law that allows the state of (5) to track the state trajectory of the nominal model (6). The overall control law of the ASV system (5) has the following expression:

u=ub+ul (7)u=u b +u l (7)

其中ub是基于模型方法的标称,ul是来自深度学习模块的控制策略。基线控制ub用于确保一些基本性能(即局部稳定性),而ul用于补偿所有系统不确定性和传感器故障。where u b is the nominal of the model-based approach and u l is the control policy from the deep learning module. The baseline control u b is used to ensure some basic performance (ie local stability), while u l is used to compensate for all system uncertainties and sensor failures.

S3、基于最大熵的Actor-Critic方法,以实际无人船系统和名义模型的状态变量的差值和标称控制器的输出为输入,构建基于模型参考强化学习的容错控制器。S3. The Actor-Critic method based on maximum entropy takes the difference between the state variables of the actual unmanned ship system and the nominal model and the output of the nominal controller as input to construct a fault-tolerant controller based on model reference reinforcement learning.

Actor-Critic的网络框图参照图2,容错控制器的具体推导过程如下:Refer to Figure 2 for the network block diagram of Actor-Critic. The specific derivation process of the fault-tolerant controller is as follows:

RL的公式基于一个由元组表示的马尔可夫决策过程MDP:=<S,U,P,R,γ>,其中S是状态空间,U指定操作/输入空间,P:S×U×S→R定义转移概率,R:S×U→R是一个回奖励函数,γ∈[0,1)是一个折现系数。在MDP中,状态向量s∈S包含影响RL控制ul∈U的所有可用信号。对于本发明中ASV系统的跟踪控制,转移概率由(1)中的ASV动态和参考信号xr表征。在RL中,控制策略是使用在离散时间域中采集的数据样本学习。设st为时间步长t处的状态信号s,相应地,ul,t是时间步长t时基于RL的控制的输入。本发明中的RL算法旨在最大化一个行动价值函数,也称为Q函数,如下所示:The formulation of RL is based on a Markov Decision Process MDP:=<S,U,P,R,γ> represented by a tuple, where S is the state space, U specifies the operation/input space, and P:S×U×S →R defines the transition probability, R:S×U→R is a reward function, and γ∈[0,1) is a discount coefficient. In MDP, the state vector s ∈ S contains all available signals that affect the RL control u l ∈ U. For the tracking control of the ASV system in the present invention, the transition probability is characterized by the ASV dynamics in (1) and the reference signal xr . In RL, control policies are learned using data samples collected in the discrete time domain. Let s t be the state signal s at time step t, and accordingly, u l,t be the input to the RL-based control at time step t. The RL algorithm in the present invention aims to maximize an action value function, also known as the Q-function, as follows:

Figure BDA0003440428850000061
Figure BDA0003440428850000061

其中Rt是奖励函数,Rt=R(st,ul,t),

Figure BDA0003440428850000062
且Vπ(st+1)称为策略π下st+1的状态值函数,其中where R t is the reward function, R t =R(s t ,u l,t ),
Figure BDA0003440428850000062
And V π (s t +1) is called the state value function of s t +1 under policy π, where

Figure BDA0003440428850000063
Figure BDA0003440428850000063

其中π(ul,t|st)是控制策略,

Figure BDA0003440428850000064
是策略的熵,α是温度参数。RL中的控制策略π(ul,t|st)是选择行动ul,t∈U在状态st∈S下的概率。在本发明中,采用满足高斯分布的控制策略,即where π(u l,t |s t ) is the control strategy,
Figure BDA0003440428850000064
is the entropy of the policy and α is the temperature parameter. The control policy π(u l,t |s t ) in RL is the probability of choosing action u l,t ∈ U in state s t ∈ S. In the present invention, a control strategy that satisfies the Gaussian distribution is adopted, that is,

π(ul|s)=N(ul(s),σ) (10)π(u l |s)=N(u l (s),σ) (10)

其中N(·,·)表示高斯分布,ul(s)为均值,σ为协方差矩阵。协方差矩阵σ控制学习阶段的探索性能。where N(·,·) represents the Gaussian distribution, u l (s) is the mean, and σ is the covariance matrix. The covariance matrix σ controls the exploration performance in the learning phase.

RL的目标是找到一个最优控制策略π*使(8)中的Qπ(st,ul,t)最大化,即The goal of RL is to find an optimal control policy π * that maximizes Q π (s t ,u l,t ) in (8), i.e.

π*=argmaxQπ(st,ul,t) (11)π * = argmaxQ π (s t ,u l,t ) (11)

注意,方差σ*将收敛到0,一旦得到了最优策略π*(ul *|s)=N(ul *(s),σ*),平均值函数ul *(s)将是学习到的最优控制律深度神经网络Qθ(st,ul,t)被称为critic,控制策略πφ(ul,t|st)被称为actor,将(5)中的ASV模型不确定内环动力学重写为:Note that the variance σ * will converge to 0, and once the optimal policy π * (u l * |s) = N(u l * (s),σ * ), the mean function u l * (s) will be The learned optimal control law deep neural network Q θ (s t , u l , t) is called critic, and the control strategy π φ (u l, t |s t ) is called actor. The uncertain inner loop dynamics of the ASV model is rewritten as:

Figure BDA0003440428850000071
Figure BDA0003440428850000071

其中β(v)是内环动力学中所有模型不确定性的集合。假设不确定项β(v)是有界的。使ev=v-vm,根据(6)和(12),误差动力学为:where β(v) is the set of all model uncertainties in the inner loop dynamics. Assume that the uncertainty term β(v) is bounded. Let e v = vvm , according to (6) and (12), the error dynamics are:

Figure BDA0003440428850000072
Figure BDA0003440428850000072

在健康条件下,模型不确定性项β(v)可使用基于学习的控制ul进行完全补偿。这意味着当t→∞时||ev(t)||2≤ε,其中ε是某个正小常数。如果发生传感器故障,错误信号ev将大于ε。基于学习的容错控制(faulttolerantControl,FTC)的一个缺乏经验的想法是将传感器故障视为外部干扰的一部分。然而,将传感器故障视为干扰将导致基于保守学习的控制,如鲁棒控制。因此,我们引入了一种故障诊断和估计机制,允许基于学习的控制适应不同的场景:健康和不健康的条件。Under healthy conditions, the model uncertainty term β(v) can be fully compensated using the learning-based control ul . This means that ||e v (t)|| 2 ≤ ε as t→∞, where ε is some positive small constant. In the event of a sensor failure, the error signal ev will be greater than ε. An inexperienced idea of learning-based fault tolerant control (FTC) is to treat sensor failures as part of external disturbances. However, treating sensor failures as disturbances will lead to conservative learning-based control, such as robust control. Therefore, we introduce a fault diagnosis and estimation mechanism that allows learning-based control to adapt to different scenarios: healthy and unhealthy conditions.

设yv=v+nv+fv,其中nv表示广义速度测量值上的噪声矢量,并相应地,fv是作用于广义速度矢量的传感器故障。此外我们定义了

Figure BDA0003440428850000073
作为故障跟踪误差向量。在实际应用中,
Figure BDA0003440428850000074
是可测量的,而不是ev。最后,介绍了以下故障诊断和估计机制:Let y v =v+n v +f v , where n v represents the noise vector on the generalized velocity measurement, and correspondingly, f v is the sensor fault acting on the generalized velocity vector. Furthermore, we define
Figure BDA0003440428850000073
as the fault tracking error vector. In practical applications,
Figure BDA0003440428850000074
is measurable, not e v . Finally, the following fault diagnosis and estimation mechanisms are introduced:

Figure BDA0003440428850000075
Figure BDA0003440428850000075

其中L被选择为Hm-L Hurwitz。信号

Figure BDA0003440428850000076
作为传感器故障发生和强度的指示器。设
Figure BDA0003440428850000077
得到where L is chosen to be H m -L Hurwitz. Signal
Figure BDA0003440428850000076
As an indicator of the occurrence and intensity of sensor failures. Assume
Figure BDA0003440428850000077
get

Figure BDA0003440428850000078
Figure BDA0003440428850000078

上式中,Hm-L表示Hurwitz矩阵,

Figure BDA0003440428850000079
ul表示来自深度学习模块的控制策略,β(v)表示内环动力学中所有模型不确定性的集合,nv表示广义速度测量值上的噪声矢量,fv表示作用于广义速度矢量的传感器故障。In the above formula, H m -L represents the Hurwitz matrix,
Figure BDA0003440428850000079
u l represents the control strategy from the deep learning module, β(v) represents the set of all model uncertainties in the inner loop dynamics, n v represents the noise vector on the generalized velocity measurement, f v represents the action on the generalized velocity vector Sensor failure.

S4、根据控制任务需求,设计相应的回调函数,利用全连通网络搭建强化学习评价函数模型(Q-value)和控制策略模型。S4. Design a corresponding callback function according to the control task requirements, and use a fully connected network to build a reinforcement learning evaluation function model (Q-value) and a control strategy model.

回调函数、学习评价函数为、控制策略模型推导如下:The callback function, the learning evaluation function and the control strategy model are derived as follows:

基于RL的容错控制是使用故障诊断和估计机制的输出得到的。RL使用数据样本(包括输入和状态数据)在离散时间步学习控制策略。假设采样时间步长是固定的,用δt表示。在不丧失一般性的情况下,使yt,ub,t,ul,t,和

Figure BDA0003440428850000081
分别代表ASV状态、标称控制器激发、来自RL的控制激发以及时间步长t处故障诊断和估计机制的输出。因此,在时间步长t处的状态信号s表示为:
Figure BDA0003440428850000082
RL的训练学习过程将重复执行策略评估和策略改进。在策略评估中,Q-value是通过Bellman操作Qπ(st,ul,t)=TπQπ(st,ul,t)得到的,其中RL-based fault-tolerant control is obtained using the outputs of fault diagnosis and estimation mechanisms. RL uses data samples (including input and state data) to learn control policies at discrete time steps. The sampling time step is assumed to be fixed, denoted by δt. Without loss of generality, let y t , u b,t , u l,t , and
Figure BDA0003440428850000081
represent the ASV state, the nominal controller excitation, the control excitation from the RL, and the output of the fault diagnosis and estimation mechanism at time step t, respectively. Therefore, the state signal s at time step t is expressed as:
Figure BDA0003440428850000082
The training learning process of RL will repeatedly perform policy evaluation and policy improvement. In policy evaluation, Q-value is obtained by Bellman operation Q π (s t ,u l,t )=T π Q π (s t ,u l,t ), where

Figure BDA0003440428850000083
Figure BDA0003440428850000083

上式中,ul,t表示来自RL的控制激发,st表示时间步长t处的状态信号,Tπ表示固定策略,Eπ表示期望算子,γ表示折扣因子,α表示温度系数,Qπ(st,ul,t)表示强化学习评价函数。In the above formula, u l,t represents the control excitation from RL, s t represents the state signal at time step t, T π represents the fixed policy, E π represents the expectation operator, γ represents the discount factor, α represents the temperature coefficient, Q π (s t , u l, t ) represents the reinforcement learning evaluation function.

在策略改进中,策略由下式更新:In policy refinement, the policy is updated by:

Figure BDA0003440428850000084
Figure BDA0003440428850000084

其中Π表示策略集,πold表示上次更新的策略,

Figure BDA0003440428850000085
表示πold的Q值,DKL表示Kullback-Leibler(KL)散度,
Figure BDA0003440428850000086
表示归一化因子。通过数学运算,目标被转化为where Π represents the policy set, π old represents the last updated policy,
Figure BDA0003440428850000085
represents the Q value of π old , D KL represents the Kullback-Leibler (KL) divergence,
Figure BDA0003440428850000086
represents the normalization factor. Through mathematical operations, the goal is transformed into

Figure BDA0003440428850000087
Figure BDA0003440428850000087

S5、在评价函数训练架构中引入双评价函数模型思想,同时在控制策略预期回报函数中加入策略的熵值,提升强化学习训练效率。S5. Introduce the idea of a dual-evaluation function model into the evaluation function training framework, and at the same time add the entropy value of the strategy to the expected return function of the control strategy to improve the training efficiency of reinforcement learning.

双评价函数模型推导过程:The derivation process of the double evaluation function model:

用θ参数化Q函数,并用Qθ(st,ul,t)表示。参数化策略由πφ(ul,t|st)表示,其中φ是要训练的参数集。注意,θ和φ都是一组参数,其尺寸由深度神经网络设置决定。例如,如果Qθ由具有K个隐藏层和每个隐藏层有L个神经元的MLP表示,则参数集θ为θ={θ01,...,θK}且在1≤i≤K-1上

Figure BDA0003440428850000091
θK∈R1×(L+1),θi∈R(L)×(L+1),其中dims表示状态s的尺寸,dimu表示输入ul的尺寸。The Q function is parameterized by θ and denoted by Q θ (s t , u l,t ). The parameterized policy is denoted by π φ (u l,t |s t ), where φ is the set of parameters to be trained. Note that both θ and φ are a set of parameters whose dimensions are determined by the deep neural network settings. For example, if Q θ is represented by an MLP with K hidden layers and L neurons in each hidden layer, the parameter set θ is θ = {θ 0 , θ 1 ,...,θ K } and in 1≤ i≤K-1 on
Figure BDA0003440428850000091
θ K ∈ R 1×(L+1) , θ i ∈ R (L)×(L+1) , where dim s represents the size of state s, and dim u represents the size of input u l .

训练全程是离线的,在每个时间步t+1收集数据样本,例如来自上一个时间步的输入ul,t,上一时间步st的状态、奖励Rt和当前状态st+1。这些历史数据将作为元组(st,ul,t,Rt,st+1)存储在记忆池D中。在每个策略评估或改进步骤中,我们从记忆池D中随机抽取一批历史数据B,用于训练参数θ和φ。开始训练时,我们将标称控制策略ub应用于ASV系统,以收集初始数据D0,如算法1所示。初始数据集D0用于Q函数的初始拟合。初始化结束后,执行ub和最新更新的强化学习策略πφ(ul,t|st)以运行ASV系统。The whole training process is offline, and data samples are collected at each time step t+1, such as the input u l,t from the previous time step, the state of the previous time step s t , the reward R t and the current state s t+1 . These historical data will be stored in memory pool D as a tuple (s t , u l,t , R t , s t+1 ). At each policy evaluation or improvement step, we randomly sample a batch of historical data B from memory pool D for training parameters θ and φ. When starting training, we apply the nominal control strategy ub to the ASV system to collect initial data D 0 , as shown in Algorithm 1. The initial data set D 0 is used for the initial fitting of the Q-function. After initialization, u b and the newly updated reinforcement learning policy π φ (u l,t |s t ) are executed to run the ASV system.

训练Q函数的参数θ以最小化贝尔曼残差:The parameter θ of the Q-function is trained to minimize the Bellman residual:

Figure BDA0003440428850000092
Figure BDA0003440428850000092

其中(st,ul,t)~D意味着我们从记忆池D中随机选取的样本(st,ul,t),且

Figure BDA0003440428850000093
其中
Figure BDA0003440428850000094
是将缓慢更新的目标参数。DNN参数θ是通过将随机梯度下降法应用于修正数据批次B上的(15)而获得的,数据批次B的大小由|B|表示。本发明中使用了两个分别由θ1和θ2参数化的评价。引入这两个评价是为了减少评价神经网络训练中的高估问题。在双评价函数下,目标值Ytarget为:where (s t , u l, t ) ~ D means a sample (s t , u l, t ) that we randomly select from memory pool D, and
Figure BDA0003440428850000093
in
Figure BDA0003440428850000094
is the target parameter that will be updated slowly. The DNN parameters θ are obtained by applying stochastic gradient descent to (15) on the revised data batch B, the size of which is denoted by |B|. Two evaluations parameterized by θ 1 and θ 2 , respectively, are used in the present invention. These two evaluations are introduced to reduce the overestimation problem in evaluating neural network training. Under the double evaluation function, the target value Y target is:

Figure BDA0003440428850000095
Figure BDA0003440428850000095

策略改进步骤要使用记忆池D中的数据样本来实现以下参数化目标函数最小化:The policy improvement step uses the data samples in memory pool D to minimize the following parameterized objective function:

Figure BDA0003440428850000096
Figure BDA0003440428850000096

使用随机梯度下降法将参数φ训练至最小化,在训练阶段,actor神经网络表示为:The parameter φ is trained to minimize using stochastic gradient descent. In the training phase, the actor neural network is expressed as:

Figure BDA0003440428850000097
Figure BDA0003440428850000097

其中

Figure BDA0003440428850000101
是要学习的参数化控制律,
Figure BDA0003440428850000102
是探测噪声标准偏差,ξ~N(0,I)是探测噪声,“⊙”是哈达玛积。注意,探测噪声ξ只适用于训练阶段,一旦训练完成,只需要在运用中的
Figure BDA0003440428850000103
因此,在训练阶段的ul等价于ul,φ。一旦训练结束,得到
Figure BDA0003440428850000104
in
Figure BDA0003440428850000101
is the parametric control law to be learned,
Figure BDA0003440428850000102
is the standard deviation of the detection noise, ξ~N(0,I) is the detection noise, and “⊙” is the Hadamard product. Note that the detection noise ξ is only applicable in the training phase, once the training is completed, only the
Figure BDA0003440428850000103
Therefore, u l in the training phase is equivalent to u l,φ . Once the training is over, get
Figure BDA0003440428850000104

温度参数α在训练阶段也会更新。其更新是通过最小化以下目标函数获得的:The temperature parameter α is also updated during the training phase. Its update is obtained by minimizing the following objective function:

Figure BDA0003440428850000105
Figure BDA0003440428850000105

其中

Figure BDA0003440428850000106
为策略的熵值。本发明中设置
Figure BDA0003440428850000107
其中“2”表示动作维度。in
Figure BDA0003440428850000106
is the entropy value of the strategy. set in the present invention
Figure BDA0003440428850000107
where "2" represents the action dimension.

S6、在无故障情况下,对基于模型参考强化学习的控制器进行训练,获得初始控制策略,保证总体控制器对于模型不确定性的鲁棒性。S6. In the case of no fault, the controller based on model reference reinforcement learning is trained to obtain an initial control strategy, so as to ensure the robustness of the overall controller to model uncertainty.

S7、在无人船系统中注入故障,对已获取的基于模型参考强化学习的初始控制策略进行再训练,实现总体控制器对于部分传感器故障的适应性。S7. Inject faults into the unmanned ship system, and retrain the obtained initial control strategy based on model reference reinforcement learning to realize the adaptability of the overall controller to some sensor faults.

S8、在不同初始状态条件下,不断重复步骤S6和步骤S7,直到强化学习的评价函数网络模型和控制策略模型收敛。S8. Repeat step S6 and step S7 continuously under different initial state conditions until the evaluation function network model and the control strategy model of the reinforcement learning converge.

具体地,步骤S6-S8的训练过程具体如下:1)为

Figure BDA0003440428850000108
Figure BDA0003440428850000109
分别初始化参数θ1,θ2,用φ表示actor网络;2)为目标参数指定值:
Figure BDA00034404288500001010
3)运行ul=0时公式(5)中的ub,得到数据集D0;4)结束学习阶段的探索,使用数据集D0训练初始critic参数θ1 0
Figure BDA00034404288500001011
5)初始化记忆池D←D0;6)为critic参数与其目标指定初始值:θ1←θ1 0
Figure BDA00034404288500001012
Figure BDA00034404288500001013
7)重复;8)开始循环,每个数据收集步骤执行操作;9)根据πφ(ul,t|st)选择一个动作ul,t;10)运行标称系统(6)和整个系统(5)以及故障诊断和估计机制(14)&收集st+1={xt+1,xm,t+1,ub,t+1};11)D←D∪{st,ul,t,R(st,ul,t),st+1};12)结束循环;13)开始循环,每个梯度更新步骤执行动作;14)从D中抽取一批数据B;15)θj←θjQθJQj),且j=1,2;16)φ←φ-ιπφJπ(φ);17)α←α-ιααJα(α);18)
Figure BDA00034404288500001014
且j=1,2;19)结束循环;20)直至收敛(如JQ(θ)<一个小阈值)。在该算法中,ιQ,ιπ和ια是正学习率(标量),κ>0是常数标量。Specifically, the training process of steps S6-S8 is as follows: 1) is
Figure BDA0003440428850000108
and
Figure BDA0003440428850000109
Initialize the parameters θ 1 and θ 2 respectively, and use φ to represent the actor network; 2) Specify values for the target parameters:
Figure BDA00034404288500001010
3) Run u b in formula (5) when u l = 0 to obtain the data set D 0 ; 4) End the exploration of the learning phase, use the data set D 0 to train the initial critic parameter θ 1 0 ,
Figure BDA00034404288500001011
5) Initialize memory pool D←D 0 ; 6) Specify initial values for critic parameters and their targets: θ 1 ←θ 1 0 ,
Figure BDA00034404288500001012
Figure BDA00034404288500001013
7) Repeat; 8) Start the loop, performing actions at each data collection step; 9) Choose an action u l,t according to πφ (u l,t |s t ); 10) Run the nominal system (6) and the entire System (5) and Fault Diagnosis and Estimation Mechanism (14) & Collection s t +1 = {x t+1 , x m, t+1 , u b, t+1 }; 11) D←D∪{s t ,u l,t ,R(s t ,u l,t ),s t+1 }; 12) End the loop; 13) Start the loop, each gradient update step performs actions; 14) Extract a batch of data from D B; 15) θ j ←θ jQθ J Qj ), and j=1,2; 16) φ←φ-ι πφ J π (φ); 17) α←α- ι αα J α (α); 18)
Figure BDA00034404288500001014
and j=1,2; 19) end the loop; 20) until convergence (eg J Q (θ) < a small threshold). In this algorithm, ι Q , ι π and ι α are positive learning rates (scalars), and κ > 0 is a constant scalar.

一种基于模型参考强化学习的无人船容错控制系统,包括:A fault-tolerant control system for unmanned ships based on model reference reinforcement learning, including:

动力学模型构建模块,用于对无人船的不确定性因素进行分析,构建无人船名义动力学模型;The dynamic model building module is used to analyze the uncertain factors of the unmanned ship and construct the nominal dynamic model of the unmanned ship;

控制器设计模块,基于无人船名义动力学模型,设计无人船标称控制器;The controller design module, based on the nominal dynamics model of the unmanned ship, designs the nominal controller of the unmanned ship;

容错控制器构建模块,基于最大熵的Actor-Critic方法,根据实际无人船系统、无人船名义动力学模型的状态变量差值和无人船标称控制器的输出,构建基于模型参考强化学习的容错控制器;Fault-tolerant controller building module, based on the Actor-Critic method of maximum entropy, according to the actual unmanned ship system, the state variable difference of the unmanned ship's nominal dynamic model and the output of the unmanned ship's nominal controller, the model-based reference enhancement is constructed. Learned fault-tolerant controllers;

训练模块,用于根据控制任务需求,搭建强化学习评价函数和控制策略模型并训练容错控制器,得到训练完成的控制策略。The training module is used to build a reinforcement learning evaluation function and a control strategy model and train a fault-tolerant controller according to the requirements of the control task, and obtain the trained control strategy.

上述方法实施例中的内容均适用于本系统实施例中,本系统实施例所具体实现的功能与上述方法实施例相同,并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the above method embodiments are all applicable to the present system embodiments, the specific functions implemented by the present system embodiments are the same as the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

一种基于模型参考强化学习的无人船容错控制装置:A fault-tolerant control device for unmanned ships based on model reference reinforcement learning:

至少一个处理器;at least one processor;

至少一个存储器,用于存储至少一个程序;at least one memory for storing at least one program;

当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如上所述一种基于模型参考强化学习的无人船容错控制方法。When the at least one program is executed by the at least one processor, the at least one processor implements the above-mentioned method for fault-tolerant control of an unmanned ship based on model reference reinforcement learning.

上述方法实施例中的内容均适用于本装置实施例中,本装置实施例所具体实现的功能与上述方法实施例相同,并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the above method embodiments are all applicable to the present device embodiments, the specific functions implemented by the present device embodiments are the same as the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

一种存储介质,其中存储有处理器可执行的指令,其特征在于:所述处理器可执行的指令在由处理器执行时用于实现如上所述一种基于模型参考强化学习的无人船容错控制方法。A storage medium storing processor-executable instructions, wherein the processor-executable instructions, when executed by the processor, are used to implement the above-mentioned model reference reinforcement learning-based unmanned ship Fault-tolerant control methods.

上述方法实施例中的内容均适用于本存储介质实施例中,本存储介质实施例所具体实现的功能与上述方法实施例相同,并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents in the foregoing method embodiments are all applicable to this storage medium embodiment, and the specific functions implemented by this storage medium embodiment are the same as those of the foregoing method embodiments, and the beneficial effects achieved are also the same as those achieved by the foregoing method embodiments. same.

以上是对本发明的较佳实施进行了具体说明,但本发明创造并不限于所述实施例,熟悉本领域的技术人员在不违背本发明精神的前提下还可做作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present invention, but the present invention is not limited to the described embodiments, and those skilled in the art can make various equivalent deformations or replacements without departing from the spirit of the present invention. , these equivalent modifications or substitutions are all included within the scope defined by the claims of the present application.

Claims (8)

1. A fault-tolerant control method of an unmanned ship based on model reference reinforcement learning is characterized by comprising the following steps:
s1, analyzing uncertainty factors of the unmanned ship and constructing a nominal dynamics model of the unmanned ship;
s2, designing a nominal controller of the unmanned ship based on the name meaning dynamic model of the unmanned ship;
s3, constructing a fault-tolerant controller based on model reference reinforcement learning according to the actual unmanned ship system, the state variable difference value of the unmanned ship name-sense dynamic model and the output of the unmanned ship nominal controller by an Actor-criticic method based on the maximum entropy;
and S4, building a reinforcement learning evaluation function and a control strategy model according to the control task requirements, and training a fault-tolerant controller to obtain a trained control strategy.
2. The unmanned ship fault-tolerant control method based on model reference reinforcement learning of claim 1, wherein the formula of the unmanned ship nominal dynamics model is as follows:
Figure FDA0003440428840000011
in the above formula, the first and second carbon atoms are,
Figure FDA0003440428840000012
representing a generalized coordinate vector, v representing a generalized velocity vector, u representing a control force and moment, M representing an inertia matrix, c (v) comprising coriolis force and centripetal force, d (v) representing a damping matrix, g (v) representing unmodeled dynamics due to gravity and buoyancy and moment, and B representing a preset input matrix.
3. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 2, wherein the formula of the unmanned ship nominal controller is as follows:
Figure FDA0003440428840000013
in the above formula, NmAnd HmComprising all known constant parameters, η, of the unmanned ship dynamics modelmGeneralized coordinate vector, u, representing a nominal modelmRepresenting the control law, xmRepresenting the state of the reference model.
4. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 3, wherein the formula of the fault-tolerant controller is as follows:
Figure FDA0003440428840000014
in the above formula, HmL represents a Hurwitz matrix,
Figure FDA0003440428840000015
ulrepresents the control strategy from the deep learning module, β (v) represents the set of all model uncertainties in the inner-loop dynamics, nvRepresenting a noise vector on the generalized velocity measurement, fvIndicating a sensor fault acting on the generalized velocity vector.
5. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 4, wherein the formula of the reinforcement learning evaluation function is expressed as follows:
Qπ(st,ul,t)=TπQπ(st,ul,t)
Figure FDA0003440428840000021
in the above formula, ul,tIndicating the control excitation, s, from the RLtRepresenting the state signal at a time step T, TπDenotes a fixed policy, EπRepresenting the desired operator, gamma representing the discount factor, alpha representing the temperature coefficient, Qπ(st,ul,t) Representing a reinforcement learning evaluation function.
6. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 4, wherein the formula of the control strategy model is as follows:
Figure FDA0003440428840000022
in the above formula, pi represents a policy set, pioldRepresents the previous onePolicy of secondary update, QπoldDenotes pioldQ value of (D)KLThe dispersion of the KL is expressed,
Figure FDA0003440428840000023
represents a normalization factor,(s)tAnd) represents a control strategy.
7. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 1, wherein the step of constructing a reinforcement learning evaluation function and a control strategy model and training a fault-tolerant controller to obtain a trained control strategy according to control task requirements specifically comprises:
and S41, building a reinforcement learning rating function and a model strategy model for the fault-tolerant controller based on model reference reinforcement learning according to the control task requirements.
S42, training the fault-tolerant controller based on model reference reinforcement learning to obtain an initial control strategy;
and S43, injecting faults into the unmanned ship system, retraining the initial control strategy and returning to the step S41 until the reinforcement learning evaluation function network model and the control strategy model converge.
8. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 7, further comprising:
and introducing a double-evaluation function model, and adding an entropy value of the strategy into the expected return function of the control strategy.
CN202111631716.8A 2021-12-28 2021-12-28 A fault-tolerant control method for unmanned ships based on model reference reinforcement learning Active CN114296350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111631716.8A CN114296350B (en) 2021-12-28 2021-12-28 A fault-tolerant control method for unmanned ships based on model reference reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111631716.8A CN114296350B (en) 2021-12-28 2021-12-28 A fault-tolerant control method for unmanned ships based on model reference reinforcement learning

Publications (2)

Publication Number Publication Date
CN114296350A true CN114296350A (en) 2022-04-08
CN114296350B CN114296350B (en) 2023-11-03

Family

ID=80972328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111631716.8A Active CN114296350B (en) 2021-12-28 2021-12-28 A fault-tolerant control method for unmanned ships based on model reference reinforcement learning

Country Status (1)

Country Link
CN (1) CN114296350B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110109355A (en) * 2019-04-29 2019-08-09 山东科技大学 A kind of unmanned boat unusual service condition self-healing control method based on intensified learning
CN111694365A (en) * 2020-07-01 2020-09-22 武汉理工大学 Unmanned ship formation path tracking method based on deep reinforcement learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110109355A (en) * 2019-04-29 2019-08-09 山东科技大学 A kind of unmanned boat unusual service condition self-healing control method based on intensified learning
CN111694365A (en) * 2020-07-01 2020-09-22 武汉理工大学 Unmanned ship formation path tracking method based on deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHANG QINGRUI等: "fault tolerant control for autonomous surface vehicles via model reference reinforcement learning", 《2021 60THIEEE CONFERENCE ON DECISION AND CONTROL(CDC)》 *
ZHANGQINGRUI等: ""model-reference reinforcement learning control of autonomous surface vehicles", 《2020 59THIEEE CONFERENCE ON DECISION AND CONTROL(CDC)》 *

Also Published As

Publication number Publication date
CN114296350B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
Wang et al. Data-driven performance-prescribed reinforcement learning control of an unmanned surface vehicle
Fan et al. Global fixed-time trajectory tracking control of underactuated USV based on fixed-time extended state observer
Xue et al. System identification of ship dynamic model based on Gaussian process regression with input noise
Peng et al. Constrained control of autonomous underwater vehicles based on command optimization and disturbance estimation
Liang et al. Finite-time velocity-observed based adaptive output-feedback trajectory tracking formation control for underactuated unmanned underwater vehicles with prescribed transient performance
Chen et al. Adaptive optimal tracking control of an underactuated surface vessel using actor–critic reinforcement learning
Elhaki et al. Reinforcement learning-based saturated adaptive robust neural-network control of underactuated autonomous underwater vehicles
Ji et al. Model-free fault diagnosis for autonomous underwater vehicles using sequence convolutional neural network
Wang et al. Extended state observer-based fixed-time trajectory tracking control of autonomous surface vessels with uncertainties and output constraints
CN101871782B (en) Position error forecasting method for GPS (Global Position System)/MEMS-INS (Micro-Electricomechanical Systems-Inertial Navigation System) integrated navigation system based on SET2FNN
CN114035550B (en) An ESO-based fault diagnosis method for autonomous underwater robot actuators
Jiang et al. Neural network based adaptive sliding mode tracking control of autonomous surface vehicles with input quantization and saturation
Gong et al. Trajectory tracking control for autonomous underwater vehicles based on dual closed-loop of MPC with uncertain dynamics
CN107179693A (en) Based on the Huber robust adaptive filtering estimated and method for estimating state
CN118244770B (en) Repeated learning composite disturbance-resistant error-tolerant control method for unmanned ship
Yue et al. Online adaptive parameter identification of an unmanned surface vehicle without persistency of excitation
Zhang et al. Adaptive asymptotic tracking control for autonomous underwater vehicles with non-vanishing uncertainties and input saturation
Chen et al. Dynamic positioning for underactuated surface vessel via L1 adaptive backstepping control
Zhang et al. Event-trigger NMPC for 3-D trajectory tracking of UUV with external disturbances
Zhang et al. AUV 3D docking control using deep reinforcement learning
Song et al. Fuzzy optimal tracking control for nonlinear underactuated unmanned surface vehicles
Yan et al. Event-triggered adaptive predefined-time sliding mode control of autonomous surface vessels with unknown dead zone and actuator faults
González-Prieto Adaptive finite time smooth nonlinear sliding mode tracking control for surface vessels with uncertainties and disturbances
Yan et al. Reinforcement learning-based integrated active fault diagnosis and tracking control
CN114296350B (en) A fault-tolerant control method for unmanned ships based on model reference reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant