CN112100834A

CN112100834A - Underwater glider attitude control method based on deep reinforcement learning

Info

Publication number: CN112100834A
Application number: CN202010925225.3A
Authority: CN
Inventors: 高剑; 宋保维; 潘光; 张福斌; 王鹏; 曹永辉; 杜晓旭; 彭星光
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-09-06
Filing date: 2020-09-06
Publication date: 2020-12-18

Abstract

The present invention proposes an underwater glider attitude control method based on deep reinforcement learning, which includes a learning stage and an application stage. network, current evaluation neural network, target decision neural network and target evaluation neural network parameters; after obtaining the trained deep reinforcement learning neural network model, it is applied to the actual underwater glider in the vertical plane gliding motion, given the target pitch angle θ _d , the state value of the underwater glider is collected and input to the deep reinforcement learning neural network model to obtain the control amount to realize the attitude control of the underwater glider. The invention performs learning based on simulation model data or artificial experimental data, realizes the attitude control of the underwater glider, and has a simple learning method; and does not need to obtain an accurate mathematical model of the underwater glider, and is also applicable in complex environments.

Description

An Attitude Control Method for Underwater Glider Based on Deep Reinforcement Learning

技术领域technical field

本发明涉及一种水下机器人的控制技术，具体说是一种基于深度强化学习的水下滑翔机姿态控制方法。The invention relates to a control technology of an underwater robot, in particular to an attitude control method of an underwater glider based on deep reinforcement learning.

背景技术Background technique

水下滑翔机是一种将浮标、潜标技术与水下机器人技术相结合而研制出的一种无外挂、依靠自身重力驱动的新型水下航行器。其主要特点是：运动控制不依靠螺旋桨推进系统，而是通过调节滑翔机净浮力，实现上下沉浮运动，利用附于机身的水平机翼产生斜向上、或斜向下的升力，操纵滑翔机向前滑翔。水下滑翔机克服了水下航行器功率大、航行时间短的缺点，大大降低了运行成本和制造成本，提高了续航时间，在军事上和海洋探索研究上非常有实用价值。Underwater glider is a new type of underwater vehicle developed by combining buoy, submersible buoy technology and underwater robot technology without external hangers and driven by its own gravity. Its main features are: the motion control does not rely on the propeller propulsion system, but by adjusting the net buoyancy of the glider to achieve up and down movements, and use the horizontal wings attached to the fuselage to generate oblique upward or oblique downward lift to control the glider forward. gliding. The underwater glider overcomes the shortcomings of high power and short sailing time of underwater vehicles, greatly reduces the operating cost and manufacturing cost, and improves the endurance time. It is of great practical value in military and marine exploration and research.

水下滑翔机的运动姿态容易受海流与波浪的影响，同时水下滑翔机机体结构复杂，动力方式单一，动力学模型表现为强非线性，准确的模型参数不易得到而且在不同的水域环境下构建的模型也缺乏普适性。虽然许多传统的控制方法可以实现水下滑翔机的姿态控制且能达到一定的控制精度，但仍然不能满足高精度的要求，而且控制过程较为复杂。The motion attitude of the underwater glider is easily affected by ocean currents and waves. At the same time, the structure of the underwater glider is complex, the dynamic mode is single, and the dynamic model shows strong nonlinearity. Accurate model parameters are not easy to obtain and are constructed in different water environments. Models also lack universality. Although many traditional control methods can realize the attitude control of underwater glider and can achieve a certain control precision, they still cannot meet the requirements of high precision, and the control process is more complicated.

发明内容SUMMARY OF THE INVENTION

要解决的技术问题technical problem to be solved

本发明的目的是克服现有技术的缺点和不足，提供一种基于深度强化学习的水下滑翔机姿态控制方法，建立深度强化学习神经网络模型，通过对仿真模型数据或者人工实验数据进行学习，可以实现水下滑翔机姿态的精确控制。The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, provide an underwater glider attitude control method based on deep reinforcement learning, establish a deep reinforcement learning neural network model, and learn the simulation model data or artificial experimental data. Realize precise control of underwater glider attitude.

技术方案Technical solutions

本发明提出的基于深度强化学习的水下滑翔机姿态控制方法包括学习阶段和应用阶段，在学习阶段通过仿真模拟水下滑翔机的运动过程同时记录运动的实时数据，根据运动数据更新当前决策神经网络、当前评价神经网络、目标决策神经网络和目标评价神经网络的参数，具体步骤如下：The underwater glider attitude control method based on deep reinforcement learning proposed by the present invention includes a learning stage and an application stage. The parameters of the current evaluation neural network, target decision neural network and target evaluation neural network are as follows:

步骤1：建立4个BP神经网络，分别为当前决策神经网络、当前评价神经网络、目标决策神经网络和目标评价神经网络。当前决策神经网络与目标决策神经网络称为决策神经网络，当前评价神经网络和目标评价神经网络称为评价神经网络。决策神经网络采用水下滑翔机的状态值作为输入量，而采用水下滑翔机的控制量a作为输出动作。评价神经网络有以水下滑翔机的状态值和控制量为输入，以评价值为输出；Step 1: Establish 4 BP neural networks, namely the current decision-making neural network, the current evaluation neural network, the target decision-making neural network and the target evaluation neural network. The current decision neural network and the target decision neural network are called decision neural networks, and the current evaluation neural network and the target evaluation neural network are called evaluation neural networks. The decision neural network uses the state value of the underwater glider as the input, and uses the control value a of the underwater glider as the output action. The evaluation neural network has the state value and control quantity of the underwater glider as the input, and the evaluation value as the output;

构建神经网络之后，初始化4个神经网络的参数，初始化记忆库以及数据缓冲区的大小。After building the neural network, initialize the parameters of the 4 neural networks, initialize the memory bank and the size of the data buffer.

步骤2：获得当前时刻下水下滑翔机的状态值s_t，将状态值输入当前决策神经网络计算出在当前时刻姿态控制器的输出动作a_t，将输出的动作a_t施加给水下滑翔机仿真器，得到下一时刻水下滑翔机的状态值s_t+1。根据当前时刻的状态s_t、当前时刻的动作a_t、目标俯仰角θ_d和下一时刻的状态s_t+1计算出当前时刻的奖励值r_t。Step 2: Obtain the state value s _t of the underwater glider at the current moment, input the state value into the current decision-making neural network to calculate the output action a _t of the attitude controller at the current moment _, and apply the output action at to the underwater glider simulator, Obtain the state value s _t+1 of the underwater glider at the next moment. The reward value _rt at the current moment is calculated according to the state s _t at the current moment, the action at at the current moment, the target pitch angle θ _d and the state _st ₊₁ at the next moment.

优选r_t取值为：The preferred value of r _t is:

r_t＝r₁+r₂+r₃ r _t =r ₁ +r ₂ +r ₃

其中，r₁＝-λ₁(d_t-d_t-1)，r₁代表当前俯仰角远离或靠近期望角度获得的回报值，r₂＝-λ₂(w_t-w_t-1)，r₂代表角速度发生变化获得的回报值，r₃根据当前俯仰角的大小取值，如果：|d_t-d_t-1|＜0.1°，r₃为一个正的“奖励”，如果θ＜-90°或θ＞90°时，r₃为一个负数，表示惩罚，d_t为在t时刻时，当前的俯仰角θ与目标俯仰角θ_d的差值，w_t为在t时刻时俯仰角速率的大小，λ₁和λ₂为设定系数。Among them, r ₁ =-λ ₁ (d _t -d _t-1 ), r ₁ represents the reward value obtained when the current pitch angle is far away from or close to the desired angle, r ₂ =-λ ₂ (w _t -w _t-1 ), r ₂ represents the reward value obtained from the change of angular velocity, and r ₃ takes the value according to the current pitch angle, if: |d _t -d _t-1 |＜0.1°, r ₃ is a positive "reward", if θ＜ When -90° or θ>90°, r ₃ is a negative number, indicating punishment, d _t is the difference between the current pitch angle θ and the target pitch angle θ _d at time t, and w _t is the pitch at time t The size of the angular rate, λ ₁ and λ ₂ are set coefficients.

步骤3：将步骤2中获得的状态(s_t,a_t,r_t,s_t+1)作为一组经验数据单元储存在记忆库中，将t自增1；判断t与设定的记忆库大小n之间关系，如果t＜n，则利用更新后的s_t返回步骤2，直至记忆库中存储的经验数据单元数量满足n的要求后，进入步骤4。Step 3: Store the state (s _t , at , r _t , s _t ₊₁ ) obtained in step 2 as a set of empirical data units in the memory bank, and increment t by 1; judge t and the set memory The relationship between the library size n, if t<n, use the updated s _t to return to step 2, until the number of experience data units stored in the memory library meets the requirement of n, then enter step 4.

步骤4：从记忆库中采样指定数目为N的经验数据单元存放到缓冲区中。Step 4: Sampling a specified number of N empirical data units from the memory bank and storing them in the buffer.

优选通过基于优先级的经验回放机制对缓冲区中的N个经验数据单元进行优先级排序，并存储在SumTree数据结构中；SumTree数据结构中的头结点的值为所有节点所代表的经验数据单元的优先级之和，记为p_total。Preferably, the N experience data units in the buffer are prioritized through a priority-based experience playback mechanism and stored in the SumTree data structure; the value of the head node in the SumTree data structure is the experience data represented by all nodes The sum of unit priorities, denoted as p _total .

步骤5：对步骤4得到的存储在SumTree数据结构中的N个经验数据单元，然后在N个经验数据单元中采样m个经验数据单元，本实施例中为了能够保证各种经验都有可能被选到，通过以下过程采样m个经验数据单元；Step 5: For the N experience data units stored in the SumTree data structure obtained in step 4, and then sample m experience data units from the N experience data units. In this embodiment, in order to ensure that various experiences may be If selected, m empirical data units are sampled through the following process;

将[0,p_total]等分为数目为m的小区间，在各个小区间中进行随机均匀采样，得到采样所得优先级对应的经验数据单元，共得到m个经验数据单元。Divide [0,p _total ] into equal numbers of m cells, and perform random and uniform sampling in each cell to obtain experience data units corresponding to the sampling priorities, and obtain m experience data units in total.

对采样得到的m个经验数据单元按照以下过程进行逐一处理，得出当前评价神经网络的梯度值

The sampled m empirical data units are processed one by one according to the following process, and the gradient value of the current evaluation neural network is obtained.

对于某个经验数据单元(s_t,a_t,r_t,s_t+1)，将状态s_t和动作信号a_t输入到当前评价神经网络中得到当前评价神经网络的评价值Q；将下一时刻状态s_t+1输入目标决策神经网络中，得到目标决策神经网络输出的执行机构的动作信号μ'；将下一时刻状态s_t+1和目标决策神经网络输出的动作值μ'输入到目标评价神经网络中得到目标评价神经网络的评价值Q'；For a certain empirical data unit (s _t , at _t , r _t , s _t ₊₁ ), input the state s _t and the action signal at into the current evaluation neural network to obtain the evaluation value Q of the current evaluation neural network; The state s _t+1 at one moment is input into the target decision-making neural network, and the action signal μ' of the actuator output by the target decision-making neural network is obtained; the next moment state s _t+1 and the action value μ' output by the target decision-making neural network are input Go to the target evaluation neural network to obtain the evaluation value Q' of the target evaluation neural network;

利用当前评价神经网络的评价值Q和目标评价神经网络的评价值Q'以及评价神经网络的损失函数L计算出当前评价神经网络的梯度值

Using the evaluation value Q of the current evaluation neural network, the evaluation value Q' of the target evaluation neural network and the loss function L of the evaluation neural network, the gradient value of the current evaluation neural network is calculated

具体而言，当前评价神经网络的损失函数L为：Specifically, the loss function L of the current evaluation neural network is:

其中，ω_i为第i个经验数据单元的重要性采样权重，δ_i为：Among them, ω _i is the importance sampling weight of the i-th empirical data unit, and δ _i is:

δ_i＝r_i+γQ'(s_i+1,μ'(s_i+1|σ^μ')σ^Q')-Q(s_i,a_i|σ^Q)δ _i =r _i +γQ'(s _i+1 ,μ'(s _i+1 |σ ^μ' )σ ^Q' )-Q(s _i ,a _i |σ ^Q )

r_i是第i个经验数据单元中的奖励值，s_i+1是第i个经验数据单元中的下一时刻状态，σ^μ'是目标决策神经网络的参数，μ'(s_i+1|σ^μ')是s_i+1通过目标决策神经网络输出的执行机构的动作信号。σ^Q'是目标评价神经网络的参数，Q'(s_i+1,μ'(s_i+1|σ^μ')σ^Q')是将μ'(s_i+1|σ^μ')和s_i+1通过目标评价神经网络得到的评价值大小。s_i是第i个经验数据单元中的当前时刻状态，a_i是第i个经验数据单元中的当前时刻状态下的动作信号，σ^Q为当前评价神经网络的参数，Q(s_i,a_i|σ^Q)是当前评价神经网络的评价值大小。γ是折扣系数，γ取值为0.99。ri is the reward value in the _i -th empirical data unit, s _i+1 is the next moment state in the i-th empirical data unit, σ ^μ' is the parameter of the target decision-making neural network, μ'(s _i+1 |σ ^μ' ) is the action signal of the actuator output by _si+1 through the target decision-making neural network. σ ^Q' is the parameter of the target evaluation neural network, Q'(s _i+1 ,μ'(s _i+1 |σ ^μ' )σ ^Q' ) is the combination of μ'(s _i+1 |σ ^μ' ) and s _i+1 is the size of the evaluation value obtained by the target evaluation neural network. s _i is the current moment state in the ith experience data unit, a _i is the action signal at the current moment state in the ith experience data unit, σ ^Q is the parameter of the current evaluation neural network, Q(s _i , a _i |σ ^Q ) is the size of the evaluation value of the current evaluation neural network. γ is the discount coefficient, and the value of γ is 0.99.

当前评价神经网络的损失函数L的梯度为：The gradient of the loss function L of the current evaluation neural network is:

步骤6：当前评价神经网络更新：根据当前评价神经网络的梯度

对当前评价神经网络参数σ^Q自增

进行更新，α为评价神经网络的学习率，取值为0.001。Step 6: Update the current evaluation neural network: according to the gradient of the current evaluation neural network

Self-increase for the current evaluation neural network parameter σ ^Q

Update, α is the learning rate of the evaluation neural network, and the value is 0.001.

步骤7：计算当前决策神经网络的梯度

具体梯度大小计算公式为：Step 7: Calculate the gradient of the current decision neural network

The specific gradient size calculation formula is:

其中

表示当前评价神经网络的评价值

对参数α的梯度，μ(s_i)是s_i通过当前决策神经网络输出的执行机构的动作信号；

表示当前决策神经网络动作

对当前决策神经网络参数σ^μ的梯度。in

Indicates the evaluation value of the current evaluation neural network

For the gradient of the parameter α, μ(s _i ) is the action signal of the actuator output by s _i through the current decision-making neural network;

Represents the current decision neural network action

Gradient of the current decision neural network parameter σ ^μ .

步骤8：当前决策神经网络更新：Step 8: Current Decision Neural Network Update:

根据当前决策神经网络的梯度

对当前决策神经网络参数σ^μ自增

进行更新，β为决策神经网络的学习率，取值为0.001。According to the gradient of the current decision neural network

Self-increase for the current decision-making neural network parameter σ ^μ

To update, β is the learning rate of the decision neural network, and the value is 0.001.

步骤9：目标评价神经网络与目标决策神经网络更新：根据更新后的当前评价神经网络参数对目标评价神经网络参数进行更新，根据更新后的当前决策神经网络参数对目标决策神经网络参数进行更新，更新公式为Step 9: update the target evaluation neural network and the target decision neural network: update the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and update the target decision neural network parameters according to the updated current decision neural network parameters, The update formula is

其中，σ_t+1 ^Q表示更新后的当前评价神经网络参数，σ_t ^Q'表示待更新的目标评价神经网络参数，σ_t+1 ^Q'表示更新后的目标评价神经网络参数；σ_t+1 ^μ表示更新后的当前决策神经网络参数，σ_t ^μ'表示待更新的目标决策神经网络参数，σ_t+1 ^μ'表示更新后的目标决策神经网络参数；τ₁为目标评价神经网络的更新率取值为0.001，τ₂为目标决策神经网络的更新率取值为0.0001。Among them, σ _t+1 ^Q represents the updated current evaluation neural network parameters, σ _t ^Q' represents the target evaluation neural network parameters to be updated, σ _t+1 ^Q' represents the updated target evaluation neural network parameters; σ _{t+ 1} ^μ represents the updated current decision-making neural network parameters, σ _t ^μ' represents the target decision-making neural network parameters to be updated, σ _t+1 ^μ' represents the updated target decision-making neural network parameters; τ ₁ is the target evaluation neural network parameter. The update rate is 0.001, and τ ₂ is the update rate of the target decision-making neural network, which is 0.0001.

步骤10：判断训练次数是否超过设定训练次数，如果超过设定训练次数，则停止训练，保存4个神经网络的参数值，如果没有超过设定的训练次数，则返回步骤4，重新在记忆库中采样指定数目为N的经验数据单元存放到缓冲区中。Step 10: Determine whether the training times exceed the set training times. If it exceeds the set training times, stop the training and save the parameter values of the 4 neural networks. If it does not exceed the set training times, go back to step 4 and re-store in the memory. The empirical data units with a specified number of N samples in the library are stored in the buffer.

步骤11：得到训练完成的深度强化学习神经网络模型后，应用到实际水下滑翔机在纵平面滑翔运动中，给定目标俯仰角θ_d，采集水下滑翔机的状态值输入到深度强化学习神经网络模型得到控制量实现水下滑翔机姿态控制。Step 11: After obtaining the trained deep reinforcement learning neural network model, apply it to the actual underwater glider in the vertical plane gliding motion, given the target pitch angle θ _d , collect the state value of the underwater glider and input it to the deep reinforcement learning neural network The model gets the control amount to realize the attitude control of the underwater glider.

有益效果beneficial effect

与现有技术相比，本发明的技术方案所带来的有益效果是：Compared with the prior art, the beneficial effects brought by the technical solution of the present invention are:

1、基于仿真模型数据或者人工实验数据进行学习，实现水下滑翔机姿态的控制，学习方式简单。1. Learning based on simulation model data or artificial experimental data to realize the attitude control of the underwater glider, and the learning method is simple.

2、无需得到水下滑翔机的精确数学模型，同时在复杂环境下同样适用。2. There is no need to obtain an accurate mathematical model of the underwater glider, and it is also applicable in complex environments.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth, in part, from the following description, and in part will be apparent from the following description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1：深度强化学习算法框架示意图；Figure 1: Schematic diagram of the deep reinforcement learning algorithm framework;

图2：水下滑翔机纵向滑翔运动控制图；Figure 2: The longitudinal gliding motion control diagram of the underwater glider;

图3：SumTree结构示意图；Figure 3: Schematic diagram of SumTree structure;

图4：水下滑翔机姿态控制训练过程回合奖励值大小示意图；Figure 4: Schematic diagram of the size of the round reward value during the attitude control training process of the underwater glider;

图5：应用阶段设定下滑期望俯仰角为-25°时在姿态调节单元作用下当前俯仰角的变化过程。Figure 5: The change process of the current pitch angle under the action of the attitude adjustment unit when the desired pitch angle of the glide is set to -25° in the application stage.

具体实施方式Detailed ways

下面详细描述本发明的实例，在本实例中，以一个实际水下滑翔机垂直平面内的俯仰角控制为例，首先确定水下滑翔机基本特征参数，得到水下滑翔机在纵平面滑翔运动时的数学模型，其中输入控制量为移动质量块在x轴上的速度指令a_t，滑翔机变化的状态为s:{v₁,v₃,ω₂,θ}，v₁,v₃,ω₂,θ分别是水下滑翔机x,z轴方向速度(x轴为机体坐标系前向轴，z轴为垂直于机体平面的坐标轴)、俯仰角速度、以及俯仰角。以滑块速度为输入的纵向滑翔运动控制框图如图2所示。The example of the present invention will be described in detail below. In this example, taking the pitch angle control of an actual underwater glider in the vertical plane as an example, first determine the basic characteristic parameters of the underwater glider, and obtain the mathematics of the underwater glider when it glides in the vertical plane. The model, where the input control quantity is the speed command a _t of the moving mass on the x-axis, and the state of the glider change is s: {v ₁ , v ₃ , ω ₂ , θ}, v ₁ , v ₃ , ω ₂ , θ They are the speed of the underwater glider in the x and z-axis directions (the x-axis is the forward axis of the body coordinate system, and the z-axis is the coordinate axis perpendicular to the body plane), the pitch angular velocity, and the pitch angle. The block diagram of longitudinal gliding motion control with slider speed as input is shown in Figure 2.

本实例方法的原理是：使用基于优先采样的深度强化学习方法实现水下滑翔机在垂直平面内的俯仰角的控制，具体步骤如下：The principle of this example method is to use the deep reinforcement learning method based on priority sampling to realize the control of the pitch angle of the underwater glider in the vertical plane. The specific steps are as follows:

步骤1：建立4个BP神经网络，分别为当前决策神经网络、当前评价神经网络、目标决策神经网络和目标评价神经网络。当前决策神经网络与目标决策神经网络称为决策神经网络，当前评价神经网络和目标评价神经网络称为评价神经网络。决策神经网络有4个输入量，即当前时刻下水下滑翔机的状态值v₁,v₃,ω₂,θ，有1个输出量，即动作值a。评价神经网络有5个输入量v₁,v₃,ω₂,θ,a，有1个输出量为评价值。本实施例中，4个BP神经网络均有2层隐含层，第一个隐含层有400个节点，第二个隐含层有300个节点，4个BP神经网络均有一个输出节点。Step 1: Establish 4 BP neural networks, namely the current decision-making neural network, the current evaluation neural network, the target decision-making neural network and the target evaluation neural network. The current decision neural network and the target decision neural network are called decision neural networks, and the current evaluation neural network and the target evaluation neural network are called evaluation neural networks. The decision neural network has 4 inputs, namely the state values v ₁ , v ₃ , ω ₂ , θ of the underwater glider at the current moment, and 1 output, namely the action value a. The evaluation neural network has five inputs v ₁ , v ₃ , ω ₂ , θ, a, and one output is the evaluation value. In this embodiment, the four BP neural networks have two hidden layers, the first hidden layer has 400 nodes, the second hidden layer has 300 nodes, and the four BP neural networks have one output node .

BP神经网络的映射过程如下：The mapping process of BP neural network is as follows:

输入层信息——隐含层激活函数——第1层隐含层输出：Input layer information - hidden layer activation function - hidden layer output of the first layer:

其中，x_i为第i个输入量，ε为输入量的总个数。v_ki表示输入层到第1层隐含层的权值，b_k为偏置阈值，z_1k为第一个隐含层的第k个节点，h₁取为400，f₁(s)为隐含层激活函数，选取为Relu函数。Among them, x _i is the ith input quantity, and ε is the total number of input quantities. v _ki represents the weight from the input layer to the first hidden layer, b _k is the bias threshold, z _1k is the kth node of the first hidden layer, h ₁ is taken as 400, and f ₁ (s) is Hidden layer activation function, selected as Relu function.

第1层隐含层输出——隐含层激活函数——第2层隐含层输出：The output of the hidden layer of the first layer - the activation function of the hidden layer - the output of the hidden layer of the second layer:

其中，w_jk表示第1层隐含层到第2层隐含层之间的权值，b_j为偏置阈值，z_2j为第2层隐含层第j个输出，h₂取为300，f₂(s)为输出层的激活函数，选取为Relu函数。Among them, w _jk represents the weight between the first hidden layer and the second hidden layer, b _j is the bias threshold, z _2j is the jth output of the second hidden layer, and h ₂ is taken as 300 , f ₂ (s) is the activation function of the output layer, which is selected as the Relu function.

第2层隐含层输出——输出层激活函数——输出层输出：Layer 2 hidden layer output - output layer activation function - output layer output:

其中，y为输出，这里输出都为一个值，w_j表示第2层隐含层到输出层之间的权值，b_l为偏置阈值，f₃(s)为输出层的激活函数，决策神经网络的输出层激活函数为Tanh函数，评价神经网络的输出层激活函数为Relu函数。Among them, y is the output, where the output is all a value, w _j represents the weight between the second hidden layer and the output layer, b _l is the bias threshold, f ₃ (s) is the activation function of the output layer, The activation function of the output layer of the decision neural network is the Tanh function, and the activation function of the output layer of the evaluation neural network is the Relu function.

初始化4个神经网络的参数，初始化记忆库大小为n＝10000，数据缓冲区的大小为N为64。The parameters of the 4 neural networks are initialized, the size of the initialized memory bank is n=10000, and the size of the data buffer is 64.

步骤2：获得当前时刻下水下滑翔机的状态值s_t:{v₁,v₃,ω₂,θ}，将状态值输入当前决策神经网络计算出在当前时刻姿态控制器的输出动作a_t，将输出的动作a_t施加给基于水下滑翔机在纵平面滑翔运动时的数学模型而构建的仿真器，得到下一时刻水下滑翔机的状态值s_t+1。根据当前时刻的状态s_t、当前时刻的动作a_t、目标俯仰角θ_d和下一时刻的状态s_t+1计算出当前时刻的奖励值r_t。本实施例中r_t取值为：Step 2: Obtain the state value s _t of the underwater glider at the current moment: {v ₁ , v ₃ , ω ₂ , θ}, input the state value into the current decision-making neural network to calculate the output action a _t of the attitude controller at the current moment, The output action at is applied to the simulator constructed based on the mathematical model of the underwater glider gliding in the longitudinal plane, and the state value s _t ₊₁ of the underwater glider at the next moment is obtained. The reward value _rt at the current moment is calculated according to the state s _t at the current moment, the action at at the current moment, the target pitch angle θ _d and the state _st ₊₁ at the next moment. In this embodiment, the value of r _t is:

r_t＝r₁+r₂+r₃ r _t =r ₁ +r ₂ +r ₃

步骤4：从记忆库中采样指定数目为N的经验数据单元存放到缓冲区中；通过基于优先级的经验回放机制对缓冲区中的N个经验数据单元进行优先级排序，并存储在SumTree数据结构中；SumTree数据结构中的头结点的值为所有节点所代表的经验数据单元的优先级之和，记为p_total。Step 4: Sampling a specified number of N experience data units from the memory bank and storing them in the buffer; prioritizing the N experience data units in the buffer through the priority-based experience playback mechanism, and storing them in the SumTree data In the structure; the value of the head node in the SumTree data structure is the sum of the priorities of the experience data units represented by all nodes, denoted as p _total .

基于优先级的经验回放具体如下：定义P(i)为第i次经验的采样概率，P(i)的计算为：The priority-based experience playback is as follows: P(i) is defined as the sampling probability of the ith experience, and the calculation of P(i) is:

其中，

代表第i次经验的优先级，η用来设定使用优先级的程度，当η＝0时，为均匀经验采样。p_i＝|δ_i|+ε，δ_i为TD误差，ε为一个正的常数。为了不使算法的复杂性依赖经验池的容池大小，将得出的P(i)数据采用一种SumTree数据结构进行整理。SumTree是一种特殊的二叉树结构，该结构的所有叶子节点存储着对应经验的优先级，父节点的值为对应子节点的和，这样头结点的值为所有优先级的和，记为p_total。SumTree原理如图3所示：SumTree中的叶节点保存着各个经验的优先级，叶节点的序号与记忆库的经验序号一一对应，当选中该优先级的经验时，按照选中的当前优先级的序号去缓冲区中取出相应的经验。in,

Represents the priority of the i-th experience, and η is used to set the degree of use priority. When η=0, it is a uniform experience sampling. p _i =|δ _i |+ε, δ _i is the TD error, and ε is a positive constant. In order not to make the complexity of the algorithm depend on the size of the experience pool, the obtained P(i) data is organized using a SumTree data structure. SumTree is a special binary tree structure. All leaf nodes of the structure store the priority of the corresponding experience. The value of the parent node is the sum of the corresponding child nodes, so the value of the head node is the sum of all priorities, denoted as p _total . The principle of SumTree is shown in Figure 3: The leaf nodes in SumTree store the priority of each experience, and the serial number of the leaf node corresponds to the experience serial number of the memory bank. When the experience of this priority is selected, the current priority is selected The sequence number goes to the buffer to take out the corresponding experience.

当前评价神经网络的损失函数L为：The loss function L of the current evaluation neural network is:

对当前评价神经网络参数σ^Q自增

Self-increase for the current evaluation neural network parameter σ ^Q

步骤7：计算当前决策神经网络的梯度，梯度大小计算公式为：Step 7: Calculate the gradient of the current decision-making neural network. The formula for calculating the gradient size is:

其中

表示当前评价神经网络的评价值

表示当前决策神经网络动作

对当前决策神经网络参数σ^μ的梯度。in

Indicates the evaluation value of the current evaluation neural network

Represents the current decision neural network action

Gradient of the current decision neural network parameter σ ^μ .

根据当前决策神经网络的梯度

对当前决策神经网络参数σ^μ自增

Self-increase for the current decision-making neural network parameter σ ^μ

步骤11：得到训练完成的深度强化学习神经网络模型后，应用到实际水下滑翔机在纵平面滑翔运动中，给定目标俯仰角θ_d，采集水下滑翔机的状态值{v₁,v₃,ω₂,θ}输入到深度强化学习神经网络模型得到控制量(移动质量块在x轴上的速度指令a_t)实现水下滑翔机姿态控制。Step 11: After obtaining the trained deep reinforcement learning neural network model, apply it to the actual underwater glider in the vertical plane gliding motion. Given the target pitch angle θ _d , collect the state values of the underwater glider {v ₁ , v ₃ , ω ₂ , θ} is input to the deep reinforcement learning neural network model to obtain the control quantity (the speed command at _t of the moving mass block on the x-axis) to realize the attitude control of the underwater glider.

图4为本实例姿态控制器的训练过程的回合奖励值(episode reward)，训练效果的评价主要通过回合奖励值来衡量，训练一定周期后，平均回报值越大，表明训练的效果越好，如图4所示，回合奖励持续上升，经过900个回合训练，回合奖励值基本稳定在850左右，表明控制器学习到了良好的策略。Figure 4 is the episode reward of the training process of the example attitude controller. The evaluation of the training effect is mainly measured by the episode reward value. After a certain period of training, the larger the average reward value is, the better the training effect is. As shown in Figure 4, the round reward continues to rise. After 900 rounds of training, the round reward value is basically stable at around 850, indicating that the controller has learned a good strategy.

图5为应用阶段的一个实例，表示下滑过程中俯仰角的大小变化过程。设置初始俯仰角为0°开始下滑，期望俯仰角为-25°。经过16s俯仰角度达到期望值大小，稳定下滑时稳态误差为0.06°，可以看出，稳定滑翔时俯仰角与期望角度的误差较小，可以认为滑翔机能够按照期望的轨迹连续滑翔。Fig. 5 is an example of the application stage, which shows the change process of the pitch angle during the gliding process. Set the initial pitch angle to 0° to start gliding, and the desired pitch angle is -25°. After 16s, the pitch angle reaches the desired value, and the steady-state error is 0.06° during stable gliding. It can be seen that the error between the pitch angle and the desired angle during stable gliding is small, and it can be considered that the glider can glide continuously according to the desired trajectory.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在不脱离本发明的原理和宗旨的情况下在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and those of ordinary skill in the art will not depart from the principles and spirit of the present invention Variations, modifications, substitutions, and alterations to the above-described embodiments are possible within the scope of the present invention without departing from the scope of the present invention.

Claims

1. an underwater glider attitude control method based on deep reinforcement learning, is characterized in that: comprise the following steps:

Step 1: establish the current decision-making neural network, the current evaluation neural network, the target decision-making neural network and the target evaluation neural network; the decision-making neural network uses the state value of the underwater glider as the input, and the control value a of the underwater glider as the output action; The evaluation neural network takes the state value and control quantity of the underwater glider as the input, and the evaluation value is the output; initializes the parameters of 4 neural networks, initializes the memory bank and the data buffer;

Step 2: Obtain the state value s _t of the underwater glider at the current moment, input the state value into the current decision-making neural network to calculate the output action a _t of the attitude controller at the current moment _, and apply the output action at to the underwater glider simulator, Obtain the state value s _t+1 of the underwater glider at the next moment; according to the state s _t at the current moment, the action a _t at the current moment, the target pitch angle θ _d and the state s _t+1 at the next moment, calculate the current moment’s reward value r _t ;

Step 3: Store the state (s _t , at , r _t , s _t ₊₁ ) obtained in step 2 as a set of empirical data units in the memory bank, and increment t by 1; judge t and the set memory The relationship between the library size n, if t<n, use the updated s _t to return to step 2, until the number of empirical data units stored in the memory library meets the requirement of n, then enter step 4;

Step 4: Sampling a specified number of N empirical data units from the memory bank and storing them in the buffer;

Step 5: Sample m experience data units from N experience data units in the buffer; process the sampled m experience data units one by one according to the following process:

For a certain empirical data unit (s _t , at _t , r _t , s _t ₊₁ ), input the state s _t and the action signal at into the current evaluation neural network to obtain the evaluation value Q of the current evaluation neural network; The state s _t+1 at one moment is input into the target decision-making neural network, and the action signal μ' of the actuator output by the target decision-making neural network is obtained; the next moment state s _t+1 and the action value μ' output by the target decision-making neural network are input Go to the target evaluation neural network to obtain the evaluation value Q' of the target evaluation neural network;

Using the evaluation value Q of the current evaluation neural network, the evaluation value Q' of the target evaluation neural network, and the loss function L of the evaluation neural network, the gradient value of the current evaluation neural network is calculated

Step 6: Update the current evaluation neural network: according to the gradient of the current evaluation neural network

Self-increase for the current evaluation neural network parameter σ ^Q

Update, α is the learning rate of the evaluation neural network;

Step 7: Calculate the gradient of the current decision neural network

Step 8: Current decision neural network update: according to the gradient of the current decision neural network

Self-increase for the current decision-making neural network parameter σ ^μ

Update, β is the learning rate of the decision neural network;

Step 9: update the target evaluation neural network and the target decision neural network: update the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and update the target decision neural network parameters according to the updated current decision neural network parameters;

Step 10: Determine whether the training times exceed the set training times. If it exceeds the set training times, stop the training and save the parameter values of the 4 neural networks. If it does not exceed the set training times, go back to step 4 and re-store in the memory. The sampled number N of empirical data units in the library are stored in the buffer;

Step 11: After obtaining the trained deep reinforcement learning neural network model, apply it to the actual underwater glider in the vertical plane gliding motion, given the target pitch angle, collect the state value of the underwater glider and input it into the deep reinforcement learning neural network model to obtain The control quantity realizes the attitude control of the underwater glider.

2. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1, is characterized in that: reward value r _t is taken as:

r _t =r ₁ +r ₂ +r ₃

Among them, r ₁ =-λ ₁ (d _t -d _t-1 ), r ₁ represents the reward value obtained when the current pitch angle is far away from or close to the target pitch angle; r ₂ =-λ ₂ (w _t -w _t-1 ) , r ₂ represents the reward value obtained from the change of angular velocity; r ₃ takes the value according to the current pitch angle, if: |d _t -d _t-1 |＜0.1°, r ₃ is a positive reward, if the current time t When the pitch angle θ<-90° or θ>90°, r ₃ is a negative number, indicating punishment, d _t is the difference between the pitch angle θ and the target pitch angle θ _d at time t, and w _t is the The magnitude of the pitch rate at time t, λ ₁ and λ ₂ are set coefficients.

3. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1 is characterized in that: in step 4, by the experience replay mechanism based on priority, the N experience data units in the buffer zone are prioritized The level is sorted and stored in the SumTree data structure; the value of the head node in the SumTree data structure is the sum of the priorities of the experience data units represented by all nodes, denoted as p _total .

4. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 3, is characterized in that: in step 5, sample m empirical data units by following process, [0, p _total ] is equally divided into number is a small area of m, random and uniform sampling is performed in each small area, and the experience data units corresponding to the sampling obtained priorities are obtained, and m experience data units are obtained in total.

5. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1 is characterized in that: in step 5, the gradient value of current evaluation neural network is obtained

The process is:

The loss function L of the current evaluation neural network is:

Among them, ω _i is the importance sampling weight of the i-th empirical data unit, and δ _i is:

δ _i =r _i +γQ'(s _i+1 ,μ'(s _i+1 |σ ^μ' )σ ^Q' )-Q(s _i ,a _i |σ ^Q )

ri is the reward value in the _i -th empirical data unit, s _i+1 is the next moment state in the i-th empirical data unit, σ ^μ' is the parameter of the target decision-making neural network, μ'(s _i+1 |σ ^μ' ) is the action signal of the actuator output by s _i+1 through the target decision-making neural network; σ ^Q' is the parameter of the target evaluation neural network, Q'(s _i+1 ,μ'(s _i+1 | σ ^μ' )σ ^Q' ) is the size of the evaluation value obtained by passing μ'(s _i+1 |σ ^μ' ) and s _i+1 through the target evaluation neural network; s _i is the current value in the i-th empirical data unit. Time state, a _i is the action signal at the current time state in the i-th empirical data unit, σ ^Q is the parameter of the current evaluation neural network, Q(s _i , a _i |σ ^Q ) is the evaluation of the current evaluation neural network. value; γ is the discount factor;

The gradient of the loss function L of the current evaluation neural network is obtained as:

6. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1 is characterized in that: in step 7, the gradient of the current decision-making neural network

The calculation formula is:

in

Indicates the evaluation value of the current evaluation neural network

Represents the current decision neural network action

Gradient of the current decision neural network parameter σ ^μ .

7. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1 is characterized in that: in step 9, target evaluation neural network and target decision neural network update formula are:

Among them, σ _t+1 ^Q represents the updated current evaluation neural network parameters, σ _t ^Q' represents the target evaluation neural network parameters to be updated, σ _t+1 ^Q' represents the updated target evaluation neural network parameters; σ _{t+ 1} ^μ represents the updated current decision-making neural network parameters, σ _t ^μ' represents the target decision-making neural network parameters to be updated, σ _t+1 ^μ' represents the updated target decision-making neural network parameters; τ ₁ is the target evaluation neural network parameter. update rate, τ ₂ is the update rate of the target decision-making neural network.

8. A kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1, is characterized in that: decision-making neural network adopts the state values v ₁ , v ₃ , ω ₂ , θ of the underwater glider as input, v ₁ , v ₃ , ω ₂ , θ are the speed of the underwater glider in the x and z-axis directions, the pitch angular velocity and the pitch angle, respectively, where the x-axis is the forward axis of the body coordinate system, and the z-axis is the coordinate axis perpendicular to the body plane; The control quantity a of the underwater glider is the speed command of the moving mass on the x-axis.