CN112100834A - Underwater glider attitude control method based on deep reinforcement learning - Google Patents

Underwater glider attitude control method based on deep reinforcement learning Download PDF

Info

Publication number
CN112100834A
CN112100834A CN202010925225.3A CN202010925225A CN112100834A CN 112100834 A CN112100834 A CN 112100834A CN 202010925225 A CN202010925225 A CN 202010925225A CN 112100834 A CN112100834 A CN 112100834A
Authority
CN
China
Prior art keywords
neural network
current
evaluation
target
decision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010925225.3A
Other languages
Chinese (zh)
Inventor
高剑
宋保维
潘光
张福斌
王鹏
曹永辉
杜晓旭
彭星光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010925225.3A priority Critical patent/CN112100834A/en
Publication of CN112100834A publication Critical patent/CN112100834A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Feedback Control In General (AREA)

Abstract

本发明提出一种基于深度强化学习的水下滑翔机姿态控制方法,包括学习阶段和应用阶段,在学习阶段通过仿真模拟水下滑翔机的运动过程同时记录运动的实时数据,根据运动数据更新当前决策神经网络、当前评价神经网络、目标决策神经网络和目标评价神经网络的参数;得到训练完成的深度强化学习神经网络模型后,应用到实际水下滑翔机在纵平面滑翔运动中,给定目标俯仰角θd,采集水下滑翔机的状态值输入到深度强化学习神经网络模型得到控制量实现水下滑翔机姿态控制。本发明基于仿真模型数据或者人工实验数据进行学习,实现水下滑翔机姿态的控制,学习方式简单;而且无需得到水下滑翔机的精确数学模型,同时在复杂环境下同样适用。

Figure 202010925225

The present invention proposes an underwater glider attitude control method based on deep reinforcement learning, which includes a learning stage and an application stage. network, current evaluation neural network, target decision neural network and target evaluation neural network parameters; after obtaining the trained deep reinforcement learning neural network model, it is applied to the actual underwater glider in the vertical plane gliding motion, given the target pitch angle θ d , the state value of the underwater glider is collected and input to the deep reinforcement learning neural network model to obtain the control amount to realize the attitude control of the underwater glider. The invention performs learning based on simulation model data or artificial experimental data, realizes the attitude control of the underwater glider, and has a simple learning method; and does not need to obtain an accurate mathematical model of the underwater glider, and is also applicable in complex environments.

Figure 202010925225

Description

一种基于深度强化学习的水下滑翔机姿态控制方法An Attitude Control Method for Underwater Glider Based on Deep Reinforcement Learning

技术领域technical field

本发明涉及一种水下机器人的控制技术,具体说是一种基于深度强化学习的水下滑翔机姿态控制方法。The invention relates to a control technology of an underwater robot, in particular to an attitude control method of an underwater glider based on deep reinforcement learning.

背景技术Background technique

水下滑翔机是一种将浮标、潜标技术与水下机器人技术相结合而研制出的一种无外挂、依靠自身重力驱动的新型水下航行器。其主要特点是:运动控制不依靠螺旋桨推进系统,而是通过调节滑翔机净浮力,实现上下沉浮运动,利用附于机身的水平机翼产生斜向上、或斜向下的升力,操纵滑翔机向前滑翔。水下滑翔机克服了水下航行器功率大、航行时间短的缺点,大大降低了运行成本和制造成本,提高了续航时间,在军事上和海洋探索研究上非常有实用价值。Underwater glider is a new type of underwater vehicle developed by combining buoy, submersible buoy technology and underwater robot technology without external hangers and driven by its own gravity. Its main features are: the motion control does not rely on the propeller propulsion system, but by adjusting the net buoyancy of the glider to achieve up and down movements, and use the horizontal wings attached to the fuselage to generate oblique upward or oblique downward lift to control the glider forward. gliding. The underwater glider overcomes the shortcomings of high power and short sailing time of underwater vehicles, greatly reduces the operating cost and manufacturing cost, and improves the endurance time. It is of great practical value in military and marine exploration and research.

水下滑翔机的运动姿态容易受海流与波浪的影响,同时水下滑翔机机体结构复杂,动力方式单一,动力学模型表现为强非线性,准确的模型参数不易得到而且在不同的水域环境下构建的模型也缺乏普适性。虽然许多传统的控制方法可以实现水下滑翔机的姿态控制且能达到一定的控制精度,但仍然不能满足高精度的要求,而且控制过程较为复杂。The motion attitude of the underwater glider is easily affected by ocean currents and waves. At the same time, the structure of the underwater glider is complex, the dynamic mode is single, and the dynamic model shows strong nonlinearity. Accurate model parameters are not easy to obtain and are constructed in different water environments. Models also lack universality. Although many traditional control methods can realize the attitude control of underwater glider and can achieve a certain control precision, they still cannot meet the requirements of high precision, and the control process is more complicated.

发明内容SUMMARY OF THE INVENTION

要解决的技术问题technical problem to be solved

本发明的目的是克服现有技术的缺点和不足,提供一种基于深度强化学习的水下滑翔机姿态控制方法,建立深度强化学习神经网络模型,通过对仿真模型数据或者人工实验数据进行学习,可以实现水下滑翔机姿态的精确控制。The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, provide an underwater glider attitude control method based on deep reinforcement learning, establish a deep reinforcement learning neural network model, and learn the simulation model data or artificial experimental data. Realize precise control of underwater glider attitude.

技术方案Technical solutions

本发明提出的基于深度强化学习的水下滑翔机姿态控制方法包括学习阶段和应用阶段,在学习阶段通过仿真模拟水下滑翔机的运动过程同时记录运动的实时数据,根据运动数据更新当前决策神经网络、当前评价神经网络、目标决策神经网络和目标评价神经网络的参数,具体步骤如下:The underwater glider attitude control method based on deep reinforcement learning proposed by the present invention includes a learning stage and an application stage. The parameters of the current evaluation neural network, target decision neural network and target evaluation neural network are as follows:

步骤1:建立4个BP神经网络,分别为当前决策神经网络、当前评价神经网络、目标决策神经网络和目标评价神经网络。当前决策神经网络与目标决策神经网络称为决策神经网络,当前评价神经网络和目标评价神经网络称为评价神经网络。决策神经网络采用水下滑翔机的状态值作为输入量,而采用水下滑翔机的控制量a作为输出动作。评价神经网络有以水下滑翔机的状态值和控制量为输入,以评价值为输出;Step 1: Establish 4 BP neural networks, namely the current decision-making neural network, the current evaluation neural network, the target decision-making neural network and the target evaluation neural network. The current decision neural network and the target decision neural network are called decision neural networks, and the current evaluation neural network and the target evaluation neural network are called evaluation neural networks. The decision neural network uses the state value of the underwater glider as the input, and uses the control value a of the underwater glider as the output action. The evaluation neural network has the state value and control quantity of the underwater glider as the input, and the evaluation value as the output;

构建神经网络之后,初始化4个神经网络的参数,初始化记忆库以及数据缓冲区的大小。After building the neural network, initialize the parameters of the 4 neural networks, initialize the memory bank and the size of the data buffer.

步骤2:获得当前时刻下水下滑翔机的状态值st,将状态值输入当前决策神经网络计算出在当前时刻姿态控制器的输出动作at,将输出的动作at施加给水下滑翔机仿真器,得到下一时刻水下滑翔机的状态值st+1。根据当前时刻的状态st、当前时刻的动作at、目标俯仰角θd和下一时刻的状态st+1计算出当前时刻的奖励值rtStep 2: Obtain the state value s t of the underwater glider at the current moment, input the state value into the current decision-making neural network to calculate the output action a t of the attitude controller at the current moment , and apply the output action at to the underwater glider simulator, Obtain the state value s t+1 of the underwater glider at the next moment. The reward value rt at the current moment is calculated according to the state s t at the current moment, the action at at the current moment, the target pitch angle θ d and the state st +1 at the next moment.

优选rt取值为:The preferred value of r t is:

rt=r1+r2+r3 r t =r 1 +r 2 +r 3

其中,r1=-λ1(dt-dt-1),r1代表当前俯仰角远离或靠近期望角度获得的回报值,r2=-λ2(wt-wt-1),r2代表角速度发生变化获得的回报值,r3根据当前俯仰角的大小取值,如果:|dt-dt-1|<0.1°,r3为一个正的“奖励”,如果θ<-90°或θ>90°时,r3为一个负数,表示惩罚,dt为在t时刻时,当前的俯仰角θ与目标俯仰角θd的差值,wt为在t时刻时俯仰角速率的大小,λ1和λ2为设定系数。Among them, r 1 =-λ 1 (d t -d t-1 ), r 1 represents the reward value obtained when the current pitch angle is far away from or close to the desired angle, r 2 =-λ 2 (w t -w t-1 ), r 2 represents the reward value obtained from the change of angular velocity, and r 3 takes the value according to the current pitch angle, if: |d t -d t-1 |<0.1°, r 3 is a positive "reward", if θ< When -90° or θ>90°, r 3 is a negative number, indicating punishment, d t is the difference between the current pitch angle θ and the target pitch angle θ d at time t, and w t is the pitch at time t The size of the angular rate, λ 1 and λ 2 are set coefficients.

步骤3:将步骤2中获得的状态(st,at,rt,st+1)作为一组经验数据单元储存在记忆库中,将t自增1;判断t与设定的记忆库大小n之间关系,如果t<n,则利用更新后的st返回步骤2,直至记忆库中存储的经验数据单元数量满足n的要求后,进入步骤4。Step 3: Store the state (s t , at , r t , s t +1 ) obtained in step 2 as a set of empirical data units in the memory bank, and increment t by 1; judge t and the set memory The relationship between the library size n, if t<n, use the updated s t to return to step 2, until the number of experience data units stored in the memory library meets the requirement of n, then enter step 4.

步骤4:从记忆库中采样指定数目为N的经验数据单元存放到缓冲区中。Step 4: Sampling a specified number of N empirical data units from the memory bank and storing them in the buffer.

优选通过基于优先级的经验回放机制对缓冲区中的N个经验数据单元进行优先级排序,并存储在SumTree数据结构中;SumTree数据结构中的头结点的值为所有节点所代表的经验数据单元的优先级之和,记为ptotalPreferably, the N experience data units in the buffer are prioritized through a priority-based experience playback mechanism and stored in the SumTree data structure; the value of the head node in the SumTree data structure is the experience data represented by all nodes The sum of unit priorities, denoted as p total .

步骤5:对步骤4得到的存储在SumTree数据结构中的N个经验数据单元,然后在N个经验数据单元中采样m个经验数据单元,本实施例中为了能够保证各种经验都有可能被选到,通过以下过程采样m个经验数据单元;Step 5: For the N experience data units stored in the SumTree data structure obtained in step 4, and then sample m experience data units from the N experience data units. In this embodiment, in order to ensure that various experiences may be If selected, m empirical data units are sampled through the following process;

将[0,ptotal]等分为数目为m的小区间,在各个小区间中进行随机均匀采样,得到采样所得优先级对应的经验数据单元,共得到m个经验数据单元。Divide [0,p total ] into equal numbers of m cells, and perform random and uniform sampling in each cell to obtain experience data units corresponding to the sampling priorities, and obtain m experience data units in total.

对采样得到的m个经验数据单元按照以下过程进行逐一处理,得出当前评价神经网络的梯度值

Figure BDA0002668239170000031
The sampled m empirical data units are processed one by one according to the following process, and the gradient value of the current evaluation neural network is obtained.
Figure BDA0002668239170000031

对于某个经验数据单元(st,at,rt,st+1),将状态st和动作信号at输入到当前评价神经网络中得到当前评价神经网络的评价值Q;将下一时刻状态st+1输入目标决策神经网络中,得到目标决策神经网络输出的执行机构的动作信号μ';将下一时刻状态st+1和目标决策神经网络输出的动作值μ'输入到目标评价神经网络中得到目标评价神经网络的评价值Q';For a certain empirical data unit (s t , at t , r t , s t +1 ), input the state s t and the action signal at into the current evaluation neural network to obtain the evaluation value Q of the current evaluation neural network; The state s t+1 at one moment is input into the target decision-making neural network, and the action signal μ' of the actuator output by the target decision-making neural network is obtained; the next moment state s t+1 and the action value μ' output by the target decision-making neural network are input Go to the target evaluation neural network to obtain the evaluation value Q' of the target evaluation neural network;

利用当前评价神经网络的评价值Q和目标评价神经网络的评价值Q'以及评价神经网络的损失函数L计算出当前评价神经网络的梯度值

Figure BDA0002668239170000032
Using the evaluation value Q of the current evaluation neural network, the evaluation value Q' of the target evaluation neural network and the loss function L of the evaluation neural network, the gradient value of the current evaluation neural network is calculated
Figure BDA0002668239170000032

具体而言,当前评价神经网络的损失函数L为:Specifically, the loss function L of the current evaluation neural network is:

Figure BDA0002668239170000033
Figure BDA0002668239170000033

其中,ωi为第i个经验数据单元的重要性采样权重,δi为:Among them, ω i is the importance sampling weight of the i-th empirical data unit, and δ i is:

δi=ri+γQ'(si+1,μ'(si+1μ'Q')-Q(si,aiQ)δ i =r i +γQ'(s i+1 ,μ'(s i+1μ'Q' )-Q(s i ,a iQ )

ri是第i个经验数据单元中的奖励值,si+1是第i个经验数据单元中的下一时刻状态,σμ'是目标决策神经网络的参数,μ'(si+1μ')是si+1通过目标决策神经网络输出的执行机构的动作信号。σQ'是目标评价神经网络的参数,Q'(si+1,μ'(si+1μ'Q')是将μ'(si+1μ')和si+1通过目标评价神经网络得到的评价值大小。si是第i个经验数据单元中的当前时刻状态,ai是第i个经验数据单元中的当前时刻状态下的动作信号,σQ为当前评价神经网络的参数,Q(si,aiQ)是当前评价神经网络的评价值大小。γ是折扣系数,γ取值为0.99。ri is the reward value in the i -th empirical data unit, s i+1 is the next moment state in the i-th empirical data unit, σ μ' is the parameter of the target decision-making neural network, μ'(s i+1μ' ) is the action signal of the actuator output by si+1 through the target decision-making neural network. σ Q' is the parameter of the target evaluation neural network, Q'(s i+1 ,μ'(s i+1μ'Q' ) is the combination of μ'(s i+1μ' ) and s i+1 is the size of the evaluation value obtained by the target evaluation neural network. s i is the current moment state in the ith experience data unit, a i is the action signal at the current moment state in the ith experience data unit, σ Q is the parameter of the current evaluation neural network, Q(s i , a iQ ) is the size of the evaluation value of the current evaluation neural network. γ is the discount coefficient, and the value of γ is 0.99.

当前评价神经网络的损失函数L的梯度为:The gradient of the loss function L of the current evaluation neural network is:

Figure BDA0002668239170000034
Figure BDA0002668239170000034

步骤6:当前评价神经网络更新:根据当前评价神经网络的梯度

Figure BDA0002668239170000041
对当前评价神经网络参数σQ自增
Figure BDA0002668239170000042
进行更新,α为评价神经网络的学习率,取值为0.001。Step 6: Update the current evaluation neural network: according to the gradient of the current evaluation neural network
Figure BDA0002668239170000041
Self-increase for the current evaluation neural network parameter σ Q
Figure BDA0002668239170000042
Update, α is the learning rate of the evaluation neural network, and the value is 0.001.

步骤7:计算当前决策神经网络的梯度

Figure BDA0002668239170000043
具体梯度大小计算公式为:Step 7: Calculate the gradient of the current decision neural network
Figure BDA0002668239170000043
The specific gradient size calculation formula is:

Figure BDA0002668239170000044
Figure BDA0002668239170000044

其中

Figure BDA0002668239170000045
表示当前评价神经网络的评价值
Figure BDA0002668239170000046
对参数α的梯度,μ(si)是si通过当前决策神经网络输出的执行机构的动作信号;
Figure BDA0002668239170000047
表示当前决策神经网络动作
Figure BDA0002668239170000048
对当前决策神经网络参数σμ的梯度。in
Figure BDA0002668239170000045
Indicates the evaluation value of the current evaluation neural network
Figure BDA0002668239170000046
For the gradient of the parameter α, μ(s i ) is the action signal of the actuator output by s i through the current decision-making neural network;
Figure BDA0002668239170000047
Represents the current decision neural network action
Figure BDA0002668239170000048
Gradient of the current decision neural network parameter σ μ .

步骤8:当前决策神经网络更新:Step 8: Current Decision Neural Network Update:

根据当前决策神经网络的梯度

Figure BDA0002668239170000049
对当前决策神经网络参数σμ自增
Figure BDA00026682391700000410
进行更新,β为决策神经网络的学习率,取值为0.001。According to the gradient of the current decision neural network
Figure BDA0002668239170000049
Self-increase for the current decision-making neural network parameter σ μ
Figure BDA00026682391700000410
To update, β is the learning rate of the decision neural network, and the value is 0.001.

步骤9:目标评价神经网络与目标决策神经网络更新:根据更新后的当前评价神经网络参数对目标评价神经网络参数进行更新,根据更新后的当前决策神经网络参数对目标决策神经网络参数进行更新,更新公式为Step 9: update the target evaluation neural network and the target decision neural network: update the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and update the target decision neural network parameters according to the updated current decision neural network parameters, The update formula is

Figure BDA00026682391700000411
Figure BDA00026682391700000411

其中,σt+1 Q表示更新后的当前评价神经网络参数,σt Q'表示待更新的目标评价神经网络参数,σt+1 Q'表示更新后的目标评价神经网络参数;σt+1 μ表示更新后的当前决策神经网络参数,σt μ'表示待更新的目标决策神经网络参数,σt+1 μ'表示更新后的目标决策神经网络参数;τ1为目标评价神经网络的更新率取值为0.001,τ2为目标决策神经网络的更新率取值为0.0001。Among them, σ t+1 Q represents the updated current evaluation neural network parameters, σ t Q' represents the target evaluation neural network parameters to be updated, σ t+1 Q' represents the updated target evaluation neural network parameters; σ t+ 1 μ represents the updated current decision-making neural network parameters, σ t μ' represents the target decision-making neural network parameters to be updated, σ t+1 μ' represents the updated target decision-making neural network parameters; τ 1 is the target evaluation neural network parameter. The update rate is 0.001, and τ 2 is the update rate of the target decision-making neural network, which is 0.0001.

步骤10:判断训练次数是否超过设定训练次数,如果超过设定训练次数,则停止训练,保存4个神经网络的参数值,如果没有超过设定的训练次数,则返回步骤4,重新在记忆库中采样指定数目为N的经验数据单元存放到缓冲区中。Step 10: Determine whether the training times exceed the set training times. If it exceeds the set training times, stop the training and save the parameter values of the 4 neural networks. If it does not exceed the set training times, go back to step 4 and re-store in the memory. The empirical data units with a specified number of N samples in the library are stored in the buffer.

步骤11:得到训练完成的深度强化学习神经网络模型后,应用到实际水下滑翔机在纵平面滑翔运动中,给定目标俯仰角θd,采集水下滑翔机的状态值输入到深度强化学习神经网络模型得到控制量实现水下滑翔机姿态控制。Step 11: After obtaining the trained deep reinforcement learning neural network model, apply it to the actual underwater glider in the vertical plane gliding motion, given the target pitch angle θ d , collect the state value of the underwater glider and input it to the deep reinforcement learning neural network The model gets the control amount to realize the attitude control of the underwater glider.

有益效果beneficial effect

与现有技术相比,本发明的技术方案所带来的有益效果是:Compared with the prior art, the beneficial effects brought by the technical solution of the present invention are:

1、基于仿真模型数据或者人工实验数据进行学习,实现水下滑翔机姿态的控制,学习方式简单。1. Learning based on simulation model data or artificial experimental data to realize the attitude control of the underwater glider, and the learning method is simple.

2、无需得到水下滑翔机的精确数学模型,同时在复杂环境下同样适用。2. There is no need to obtain an accurate mathematical model of the underwater glider, and it is also applicable in complex environments.

本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth, in part, from the following description, and in part will be apparent from the following description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1:深度强化学习算法框架示意图;Figure 1: Schematic diagram of the deep reinforcement learning algorithm framework;

图2:水下滑翔机纵向滑翔运动控制图;Figure 2: The longitudinal gliding motion control diagram of the underwater glider;

图3:SumTree结构示意图;Figure 3: Schematic diagram of SumTree structure;

图4:水下滑翔机姿态控制训练过程回合奖励值大小示意图;Figure 4: Schematic diagram of the size of the round reward value during the attitude control training process of the underwater glider;

图5:应用阶段设定下滑期望俯仰角为-25°时在姿态调节单元作用下当前俯仰角的变化过程。Figure 5: The change process of the current pitch angle under the action of the attitude adjustment unit when the desired pitch angle of the glide is set to -25° in the application stage.

具体实施方式Detailed ways

下面详细描述本发明的实例,在本实例中,以一个实际水下滑翔机垂直平面内的俯仰角控制为例,首先确定水下滑翔机基本特征参数,得到水下滑翔机在纵平面滑翔运动时的数学模型,其中输入控制量为移动质量块在x轴上的速度指令at,滑翔机变化的状态为s:{v1,v32,θ},v1,v32,θ分别是水下滑翔机x,z轴方向速度(x轴为机体坐标系前向轴,z轴为垂直于机体平面的坐标轴)、俯仰角速度、以及俯仰角。以滑块速度为输入的纵向滑翔运动控制框图如图2所示。The example of the present invention will be described in detail below. In this example, taking the pitch angle control of an actual underwater glider in the vertical plane as an example, first determine the basic characteristic parameters of the underwater glider, and obtain the mathematics of the underwater glider when it glides in the vertical plane. The model, where the input control quantity is the speed command a t of the moving mass on the x-axis, and the state of the glider change is s: {v 1 , v 3 , ω 2 , θ}, v 1 , v 3 , ω 2 , θ They are the speed of the underwater glider in the x and z-axis directions (the x-axis is the forward axis of the body coordinate system, and the z-axis is the coordinate axis perpendicular to the body plane), the pitch angular velocity, and the pitch angle. The block diagram of longitudinal gliding motion control with slider speed as input is shown in Figure 2.

本实例方法的原理是:使用基于优先采样的深度强化学习方法实现水下滑翔机在垂直平面内的俯仰角的控制,具体步骤如下:The principle of this example method is to use the deep reinforcement learning method based on priority sampling to realize the control of the pitch angle of the underwater glider in the vertical plane. The specific steps are as follows:

步骤1:建立4个BP神经网络,分别为当前决策神经网络、当前评价神经网络、目标决策神经网络和目标评价神经网络。当前决策神经网络与目标决策神经网络称为决策神经网络,当前评价神经网络和目标评价神经网络称为评价神经网络。决策神经网络有4个输入量,即当前时刻下水下滑翔机的状态值v1,v32,θ,有1个输出量,即动作值a。评价神经网络有5个输入量v1,v32,θ,a,有1个输出量为评价值。本实施例中,4个BP神经网络均有2层隐含层,第一个隐含层有400个节点,第二个隐含层有300个节点,4个BP神经网络均有一个输出节点。Step 1: Establish 4 BP neural networks, namely the current decision-making neural network, the current evaluation neural network, the target decision-making neural network and the target evaluation neural network. The current decision neural network and the target decision neural network are called decision neural networks, and the current evaluation neural network and the target evaluation neural network are called evaluation neural networks. The decision neural network has 4 inputs, namely the state values v 1 , v 3 , ω 2 , θ of the underwater glider at the current moment, and 1 output, namely the action value a. The evaluation neural network has five inputs v 1 , v 3 , ω 2 , θ, a, and one output is the evaluation value. In this embodiment, the four BP neural networks have two hidden layers, the first hidden layer has 400 nodes, the second hidden layer has 300 nodes, and the four BP neural networks have one output node .

BP神经网络的映射过程如下:The mapping process of BP neural network is as follows:

输入层信息——隐含层激活函数——第1层隐含层输出:Input layer information - hidden layer activation function - hidden layer output of the first layer:

Figure BDA0002668239170000061
Figure BDA0002668239170000061

其中,xi为第i个输入量,ε为输入量的总个数。vki表示输入层到第1层隐含层的权值,bk为偏置阈值,z1k为第一个隐含层的第k个节点,h1取为400,f1(s)为隐含层激活函数,选取为Relu函数。Among them, x i is the ith input quantity, and ε is the total number of input quantities. v ki represents the weight from the input layer to the first hidden layer, b k is the bias threshold, z 1k is the kth node of the first hidden layer, h 1 is taken as 400, and f 1 (s) is Hidden layer activation function, selected as Relu function.

第1层隐含层输出——隐含层激活函数——第2层隐含层输出:The output of the hidden layer of the first layer - the activation function of the hidden layer - the output of the hidden layer of the second layer:

Figure BDA0002668239170000062
Figure BDA0002668239170000062

其中,wjk表示第1层隐含层到第2层隐含层之间的权值,bj为偏置阈值,z2j为第2层隐含层第j个输出,h2取为300,f2(s)为输出层的激活函数,选取为Relu函数。Among them, w jk represents the weight between the first hidden layer and the second hidden layer, b j is the bias threshold, z 2j is the jth output of the second hidden layer, and h 2 is taken as 300 , f 2 (s) is the activation function of the output layer, which is selected as the Relu function.

第2层隐含层输出——输出层激活函数——输出层输出:Layer 2 hidden layer output - output layer activation function - output layer output:

Figure BDA0002668239170000063
Figure BDA0002668239170000063

其中,y为输出,这里输出都为一个值,wj表示第2层隐含层到输出层之间的权值,bl为偏置阈值,f3(s)为输出层的激活函数,决策神经网络的输出层激活函数为Tanh函数,评价神经网络的输出层激活函数为Relu函数。Among them, y is the output, where the output is all a value, w j represents the weight between the second hidden layer and the output layer, b l is the bias threshold, f 3 (s) is the activation function of the output layer, The activation function of the output layer of the decision neural network is the Tanh function, and the activation function of the output layer of the evaluation neural network is the Relu function.

初始化4个神经网络的参数,初始化记忆库大小为n=10000,数据缓冲区的大小为N为64。The parameters of the 4 neural networks are initialized, the size of the initialized memory bank is n=10000, and the size of the data buffer is 64.

步骤2:获得当前时刻下水下滑翔机的状态值st:{v1,v32,θ},将状态值输入当前决策神经网络计算出在当前时刻姿态控制器的输出动作at,将输出的动作at施加给基于水下滑翔机在纵平面滑翔运动时的数学模型而构建的仿真器,得到下一时刻水下滑翔机的状态值st+1。根据当前时刻的状态st、当前时刻的动作at、目标俯仰角θd和下一时刻的状态st+1计算出当前时刻的奖励值rt。本实施例中rt取值为:Step 2: Obtain the state value s t of the underwater glider at the current moment: {v 1 , v 3 , ω 2 , θ}, input the state value into the current decision-making neural network to calculate the output action a t of the attitude controller at the current moment, The output action at is applied to the simulator constructed based on the mathematical model of the underwater glider gliding in the longitudinal plane, and the state value s t +1 of the underwater glider at the next moment is obtained. The reward value rt at the current moment is calculated according to the state s t at the current moment, the action at at the current moment, the target pitch angle θ d and the state st +1 at the next moment. In this embodiment, the value of r t is:

rt=r1+r2+r3 r t =r 1 +r 2 +r 3

其中,r1=-λ1(dt-dt-1),r1代表当前俯仰角远离或靠近期望角度获得的回报值,r2=-λ2(wt-wt-1),r2代表角速度发生变化获得的回报值,r3根据当前俯仰角的大小取值,如果:|dt-dt-1|<0.1°,r3为一个正的“奖励”,如果θ<-90°或θ>90°时,r3为一个负数,表示惩罚,dt为在t时刻时,当前的俯仰角θ与目标俯仰角θd的差值,wt为在t时刻时俯仰角速率的大小,λ1和λ2为设定系数。Among them, r 1 =-λ 1 (d t -d t-1 ), r 1 represents the reward value obtained when the current pitch angle is far away from or close to the desired angle, r 2 =-λ 2 (w t -w t-1 ), r 2 represents the reward value obtained from the change of angular velocity, and r 3 takes the value according to the current pitch angle, if: |d t -d t-1 |<0.1°, r 3 is a positive "reward", if θ< When -90° or θ>90°, r 3 is a negative number, indicating punishment, d t is the difference between the current pitch angle θ and the target pitch angle θ d at time t, and w t is the pitch at time t The size of the angular rate, λ 1 and λ 2 are set coefficients.

步骤3:将步骤2中获得的状态(st,at,rt,st+1)作为一组经验数据单元储存在记忆库中,将t自增1;判断t与设定的记忆库大小n之间关系,如果t<n,则利用更新后的st返回步骤2,直至记忆库中存储的经验数据单元数量满足n的要求后,进入步骤4。Step 3: Store the state (s t , at , r t , s t +1 ) obtained in step 2 as a set of empirical data units in the memory bank, and increment t by 1; judge t and the set memory The relationship between the library size n, if t<n, use the updated s t to return to step 2, until the number of experience data units stored in the memory library meets the requirement of n, then enter step 4.

步骤4:从记忆库中采样指定数目为N的经验数据单元存放到缓冲区中;通过基于优先级的经验回放机制对缓冲区中的N个经验数据单元进行优先级排序,并存储在SumTree数据结构中;SumTree数据结构中的头结点的值为所有节点所代表的经验数据单元的优先级之和,记为ptotalStep 4: Sampling a specified number of N experience data units from the memory bank and storing them in the buffer; prioritizing the N experience data units in the buffer through the priority-based experience playback mechanism, and storing them in the SumTree data In the structure; the value of the head node in the SumTree data structure is the sum of the priorities of the experience data units represented by all nodes, denoted as p total .

基于优先级的经验回放具体如下:定义P(i)为第i次经验的采样概率,P(i)的计算为:The priority-based experience playback is as follows: P(i) is defined as the sampling probability of the ith experience, and the calculation of P(i) is:

Figure BDA0002668239170000071
Figure BDA0002668239170000071

其中,

Figure BDA0002668239170000072
代表第i次经验的优先级,η用来设定使用优先级的程度,当η=0时,为均匀经验采样。pi=|δi|+ε,δi为TD误差,ε为一个正的常数。为了不使算法的复杂性依赖经验池的容池大小,将得出的P(i)数据采用一种SumTree数据结构进行整理。SumTree是一种特殊的二叉树结构,该结构的所有叶子节点存储着对应经验的优先级,父节点的值为对应子节点的和,这样头结点的值为所有优先级的和,记为ptotal。SumTree原理如图3所示:SumTree中的叶节点保存着各个经验的优先级,叶节点的序号与记忆库的经验序号一一对应,当选中该优先级的经验时,按照选中的当前优先级的序号去缓冲区中取出相应的经验。in,
Figure BDA0002668239170000072
Represents the priority of the i-th experience, and η is used to set the degree of use priority. When η=0, it is a uniform experience sampling. p i =|δ i |+ε, δ i is the TD error, and ε is a positive constant. In order not to make the complexity of the algorithm depend on the size of the experience pool, the obtained P(i) data is organized using a SumTree data structure. SumTree is a special binary tree structure. All leaf nodes of the structure store the priority of the corresponding experience. The value of the parent node is the sum of the corresponding child nodes, so the value of the head node is the sum of all priorities, denoted as p total . The principle of SumTree is shown in Figure 3: The leaf nodes in SumTree store the priority of each experience, and the serial number of the leaf node corresponds to the experience serial number of the memory bank. When the experience of this priority is selected, the current priority is selected The sequence number goes to the buffer to take out the corresponding experience.

步骤5:对步骤4得到的存储在SumTree数据结构中的N个经验数据单元,然后在N个经验数据单元中采样m个经验数据单元,本实施例中为了能够保证各种经验都有可能被选到,通过以下过程采样m个经验数据单元;Step 5: For the N experience data units stored in the SumTree data structure obtained in step 4, and then sample m experience data units from the N experience data units. In this embodiment, in order to ensure that various experiences may be If selected, m empirical data units are sampled through the following process;

将[0,ptotal]等分为数目为m的小区间,在各个小区间中进行随机均匀采样,得到采样所得优先级对应的经验数据单元,共得到m个经验数据单元。Divide [0,p total ] into equal numbers of m cells, and perform random and uniform sampling in each cell to obtain experience data units corresponding to the sampling priorities, and obtain m experience data units in total.

对采样得到的m个经验数据单元按照以下过程进行逐一处理,得出当前评价神经网络的梯度值

Figure BDA0002668239170000081
The sampled m empirical data units are processed one by one according to the following process, and the gradient value of the current evaluation neural network is obtained.
Figure BDA0002668239170000081

对于某个经验数据单元(st,at,rt,st+1),将状态st和动作信号at输入到当前评价神经网络中得到当前评价神经网络的评价值Q;将下一时刻状态st+1输入目标决策神经网络中,得到目标决策神经网络输出的执行机构的动作信号μ';将下一时刻状态st+1和目标决策神经网络输出的动作值μ'输入到目标评价神经网络中得到目标评价神经网络的评价值Q';For a certain empirical data unit (s t , at t , r t , s t +1 ), input the state s t and the action signal at into the current evaluation neural network to obtain the evaluation value Q of the current evaluation neural network; The state s t+1 at one moment is input into the target decision-making neural network, and the action signal μ' of the actuator output by the target decision-making neural network is obtained; the next moment state s t+1 and the action value μ' output by the target decision-making neural network are input Go to the target evaluation neural network to obtain the evaluation value Q' of the target evaluation neural network;

利用当前评价神经网络的评价值Q和目标评价神经网络的评价值Q'以及评价神经网络的损失函数L计算出当前评价神经网络的梯度值

Figure BDA0002668239170000082
Using the evaluation value Q of the current evaluation neural network, the evaluation value Q' of the target evaluation neural network and the loss function L of the evaluation neural network, the gradient value of the current evaluation neural network is calculated
Figure BDA0002668239170000082

当前评价神经网络的损失函数L为:The loss function L of the current evaluation neural network is:

Figure BDA0002668239170000083
Figure BDA0002668239170000083

其中,ωi为第i个经验数据单元的重要性采样权重,δi为:Among them, ω i is the importance sampling weight of the i-th empirical data unit, and δ i is:

δi=ri+γQ'(si+1,μ'(si+1μ'Q')-Q(si,aiQ)δ i =r i +γQ'(s i+1 ,μ'(s i+1μ'Q' )-Q(s i ,a iQ )

ri是第i个经验数据单元中的奖励值,si+1是第i个经验数据单元中的下一时刻状态,σμ'是目标决策神经网络的参数,μ'(si+1μ')是si+1通过目标决策神经网络输出的执行机构的动作信号。σQ'是目标评价神经网络的参数,Q'(si+1,μ'(si+1μ'Q')是将μ'(si+1μ')和si+1通过目标评价神经网络得到的评价值大小。si是第i个经验数据单元中的当前时刻状态,ai是第i个经验数据单元中的当前时刻状态下的动作信号,σQ为当前评价神经网络的参数,Q(si,aiQ)是当前评价神经网络的评价值大小。γ是折扣系数,γ取值为0.99。ri is the reward value in the i -th empirical data unit, s i+1 is the next moment state in the i-th empirical data unit, σ μ' is the parameter of the target decision-making neural network, μ'(s i+1μ' ) is the action signal of the actuator output by si+1 through the target decision-making neural network. σ Q' is the parameter of the target evaluation neural network, Q'(s i+1 ,μ'(s i+1μ'Q' ) is the combination of μ'(s i+1μ' ) and s i+1 is the size of the evaluation value obtained by the target evaluation neural network. s i is the current moment state in the ith experience data unit, a i is the action signal at the current moment state in the ith experience data unit, σ Q is the parameter of the current evaluation neural network, Q(s i , a iQ ) is the size of the evaluation value of the current evaluation neural network. γ is the discount coefficient, and the value of γ is 0.99.

当前评价神经网络的损失函数L的梯度为:The gradient of the loss function L of the current evaluation neural network is:

Figure BDA0002668239170000091
Figure BDA0002668239170000091

步骤6:当前评价神经网络更新:根据当前评价神经网络的梯度

Figure BDA0002668239170000092
对当前评价神经网络参数σQ自增
Figure BDA0002668239170000093
进行更新,α为评价神经网络的学习率,取值为0.001。Step 6: Update the current evaluation neural network: according to the gradient of the current evaluation neural network
Figure BDA0002668239170000092
Self-increase for the current evaluation neural network parameter σ Q
Figure BDA0002668239170000093
Update, α is the learning rate of the evaluation neural network, and the value is 0.001.

步骤7:计算当前决策神经网络的梯度,梯度大小计算公式为:Step 7: Calculate the gradient of the current decision-making neural network. The formula for calculating the gradient size is:

Figure BDA0002668239170000094
Figure BDA0002668239170000094

其中

Figure BDA0002668239170000095
表示当前评价神经网络的评价值
Figure BDA0002668239170000096
对参数α的梯度,μ(si)是si通过当前决策神经网络输出的执行机构的动作信号;
Figure BDA0002668239170000097
表示当前决策神经网络动作
Figure BDA0002668239170000098
对当前决策神经网络参数σμ的梯度。in
Figure BDA0002668239170000095
Indicates the evaluation value of the current evaluation neural network
Figure BDA0002668239170000096
For the gradient of the parameter α, μ(s i ) is the action signal of the actuator output by s i through the current decision-making neural network;
Figure BDA0002668239170000097
Represents the current decision neural network action
Figure BDA0002668239170000098
Gradient of the current decision neural network parameter σ μ .

步骤8:当前决策神经网络更新:Step 8: Current Decision Neural Network Update:

根据当前决策神经网络的梯度

Figure BDA0002668239170000099
对当前决策神经网络参数σμ自增
Figure BDA00026682391700000910
进行更新,β为决策神经网络的学习率,取值为0.001。According to the gradient of the current decision neural network
Figure BDA0002668239170000099
Self-increase for the current decision-making neural network parameter σ μ
Figure BDA00026682391700000910
To update, β is the learning rate of the decision neural network, and the value is 0.001.

步骤9:目标评价神经网络与目标决策神经网络更新:根据更新后的当前评价神经网络参数对目标评价神经网络参数进行更新,根据更新后的当前决策神经网络参数对目标决策神经网络参数进行更新,更新公式为Step 9: update the target evaluation neural network and the target decision neural network: update the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and update the target decision neural network parameters according to the updated current decision neural network parameters, The update formula is

Figure BDA00026682391700000911
Figure BDA00026682391700000911

其中,σt+1 Q表示更新后的当前评价神经网络参数,σt Q'表示待更新的目标评价神经网络参数,σt+1 Q'表示更新后的目标评价神经网络参数;σt+1 μ表示更新后的当前决策神经网络参数,σt μ'表示待更新的目标决策神经网络参数,σt+1 μ'表示更新后的目标决策神经网络参数;τ1为目标评价神经网络的更新率取值为0.001,τ2为目标决策神经网络的更新率取值为0.0001。Among them, σ t+1 Q represents the updated current evaluation neural network parameters, σ t Q' represents the target evaluation neural network parameters to be updated, σ t+1 Q' represents the updated target evaluation neural network parameters; σ t+ 1 μ represents the updated current decision-making neural network parameters, σ t μ' represents the target decision-making neural network parameters to be updated, σ t+1 μ' represents the updated target decision-making neural network parameters; τ 1 is the target evaluation neural network parameter. The update rate is 0.001, and τ 2 is the update rate of the target decision-making neural network, which is 0.0001.

步骤10:判断训练次数是否超过设定训练次数,如果超过设定训练次数,则停止训练,保存4个神经网络的参数值,如果没有超过设定的训练次数,则返回步骤4,重新在记忆库中采样指定数目为N的经验数据单元存放到缓冲区中。Step 10: Determine whether the training times exceed the set training times. If it exceeds the set training times, stop the training and save the parameter values of the 4 neural networks. If it does not exceed the set training times, go back to step 4 and re-store in the memory. The empirical data units with a specified number of N samples in the library are stored in the buffer.

步骤11:得到训练完成的深度强化学习神经网络模型后,应用到实际水下滑翔机在纵平面滑翔运动中,给定目标俯仰角θd,采集水下滑翔机的状态值{v1,v32,θ}输入到深度强化学习神经网络模型得到控制量(移动质量块在x轴上的速度指令at)实现水下滑翔机姿态控制。Step 11: After obtaining the trained deep reinforcement learning neural network model, apply it to the actual underwater glider in the vertical plane gliding motion. Given the target pitch angle θ d , collect the state values of the underwater glider {v 1 , v 3 , ω 2 , θ} is input to the deep reinforcement learning neural network model to obtain the control quantity (the speed command at t of the moving mass block on the x-axis) to realize the attitude control of the underwater glider.

图4为本实例姿态控制器的训练过程的回合奖励值(episode reward),训练效果的评价主要通过回合奖励值来衡量,训练一定周期后,平均回报值越大,表明训练的效果越好,如图4所示,回合奖励持续上升,经过900个回合训练,回合奖励值基本稳定在850左右,表明控制器学习到了良好的策略。Figure 4 is the episode reward of the training process of the example attitude controller. The evaluation of the training effect is mainly measured by the episode reward value. After a certain period of training, the larger the average reward value is, the better the training effect is. As shown in Figure 4, the round reward continues to rise. After 900 rounds of training, the round reward value is basically stable at around 850, indicating that the controller has learned a good strategy.

图5为应用阶段的一个实例,表示下滑过程中俯仰角的大小变化过程。设置初始俯仰角为0°开始下滑,期望俯仰角为-25°。经过16s俯仰角度达到期望值大小,稳定下滑时稳态误差为0.06°,可以看出,稳定滑翔时俯仰角与期望角度的误差较小,可以认为滑翔机能够按照期望的轨迹连续滑翔。Fig. 5 is an example of the application stage, which shows the change process of the pitch angle during the gliding process. Set the initial pitch angle to 0° to start gliding, and the desired pitch angle is -25°. After 16s, the pitch angle reaches the desired value, and the steady-state error is 0.06° during stable gliding. It can be seen that the error between the pitch angle and the desired angle during stable gliding is small, and it can be considered that the glider can glide continuously according to the desired trajectory.

尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在不脱离本发明的原理和宗旨的情况下在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and those of ordinary skill in the art will not depart from the principles and spirit of the present invention Variations, modifications, substitutions, and alterations to the above-described embodiments are possible within the scope of the present invention without departing from the scope of the present invention.

Claims (8)

1.一种基于深度强化学习的水下滑翔机姿态控制方法,其特征在于:包括以下步骤:1. an underwater glider attitude control method based on deep reinforcement learning, is characterized in that: comprise the following steps: 步骤1:建立当前决策神经网络、当前评价神经网络、目标决策神经网络和目标评价神经网络;决策神经网络采用水下滑翔机的状态值作为输入量,采用水下滑翔机的控制量a作为输出动作;评价神经网络有以水下滑翔机的状态值和控制量为输入,以评价值为输出;初始化4个神经网络的参数,初始化记忆库以及数据缓冲区;Step 1: establish the current decision-making neural network, the current evaluation neural network, the target decision-making neural network and the target evaluation neural network; the decision-making neural network uses the state value of the underwater glider as the input, and the control value a of the underwater glider as the output action; The evaluation neural network takes the state value and control quantity of the underwater glider as the input, and the evaluation value is the output; initializes the parameters of 4 neural networks, initializes the memory bank and the data buffer; 步骤2:获得当前时刻下水下滑翔机的状态值st,将状态值输入当前决策神经网络计算出在当前时刻姿态控制器的输出动作at,将输出的动作at施加给水下滑翔机仿真器,得到下一时刻水下滑翔机的状态值st+1;根据当前时刻的状态st、当前时刻的动作at、目标俯仰角θd和下一时刻的状态st+1计算出当前时刻的奖励值rtStep 2: Obtain the state value s t of the underwater glider at the current moment, input the state value into the current decision-making neural network to calculate the output action a t of the attitude controller at the current moment , and apply the output action at to the underwater glider simulator, Obtain the state value s t+1 of the underwater glider at the next moment; according to the state s t at the current moment, the action a t at the current moment, the target pitch angle θ d and the state s t+1 at the next moment, calculate the current moment’s reward value r t ; 步骤3:将步骤2中获得的状态(st,at,rt,st+1)作为一组经验数据单元储存在记忆库中,将t自增1;判断t与设定的记忆库大小n之间关系,如果t<n,则利用更新后的st返回步骤2,直至记忆库中存储的经验数据单元数量满足n的要求后,进入步骤4;Step 3: Store the state (s t , at , r t , s t +1 ) obtained in step 2 as a set of empirical data units in the memory bank, and increment t by 1; judge t and the set memory The relationship between the library size n, if t<n, use the updated s t to return to step 2, until the number of empirical data units stored in the memory library meets the requirement of n, then enter step 4; 步骤4:从记忆库中采样指定数目为N的经验数据单元存放到缓冲区中;Step 4: Sampling a specified number of N empirical data units from the memory bank and storing them in the buffer; 步骤5:在缓冲区中N个经验数据单元采样m个经验数据单元;对采样得到的m个经验数据单元按照以下过程进行逐一处理:Step 5: Sample m experience data units from N experience data units in the buffer; process the sampled m experience data units one by one according to the following process: 对于某个经验数据单元(st,at,rt,st+1),将状态st和动作信号at输入到当前评价神经网络中得到当前评价神经网络的评价值Q;将下一时刻状态st+1输入目标决策神经网络中,得到目标决策神经网络输出的执行机构的动作信号μ';将下一时刻状态st+1和目标决策神经网络输出的动作值μ'输入到目标评价神经网络中得到目标评价神经网络的评价值Q';For a certain empirical data unit (s t , at t , r t , s t +1 ), input the state s t and the action signal at into the current evaluation neural network to obtain the evaluation value Q of the current evaluation neural network; The state s t+1 at one moment is input into the target decision-making neural network, and the action signal μ' of the actuator output by the target decision-making neural network is obtained; the next moment state s t+1 and the action value μ' output by the target decision-making neural network are input Go to the target evaluation neural network to obtain the evaluation value Q' of the target evaluation neural network; 利用当前评价神经网络的评价值Q和目标评价神经网络的评价值Q'以及评价神经网络的损失函数L计算出当前评价神经网络的梯度值
Figure FDA0002668239160000011
Using the evaluation value Q of the current evaluation neural network, the evaluation value Q' of the target evaluation neural network, and the loss function L of the evaluation neural network, the gradient value of the current evaluation neural network is calculated
Figure FDA0002668239160000011
步骤6:当前评价神经网络更新:根据当前评价神经网络的梯度
Figure FDA0002668239160000012
对当前评价神经网络参数σQ自增
Figure FDA0002668239160000021
进行更新,α为评价神经网络的学习率;
Step 6: Update the current evaluation neural network: according to the gradient of the current evaluation neural network
Figure FDA0002668239160000012
Self-increase for the current evaluation neural network parameter σ Q
Figure FDA0002668239160000021
Update, α is the learning rate of the evaluation neural network;
步骤7:计算当前决策神经网络的梯度
Figure FDA0002668239160000022
Step 7: Calculate the gradient of the current decision neural network
Figure FDA0002668239160000022
步骤8:当前决策神经网络更新:根据当前决策神经网络的梯度
Figure FDA0002668239160000023
对当前决策神经网络参数σμ自增
Figure FDA0002668239160000024
进行更新,β为决策神经网络的学习率;
Step 8: Current decision neural network update: according to the gradient of the current decision neural network
Figure FDA0002668239160000023
Self-increase for the current decision-making neural network parameter σ μ
Figure FDA0002668239160000024
Update, β is the learning rate of the decision neural network;
步骤9:目标评价神经网络与目标决策神经网络更新:根据更新后的当前评价神经网络参数对目标评价神经网络参数进行更新,根据更新后的当前决策神经网络参数对目标决策神经网络参数进行更新;Step 9: update the target evaluation neural network and the target decision neural network: update the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and update the target decision neural network parameters according to the updated current decision neural network parameters; 步骤10:判断训练次数是否超过设定训练次数,如果超过设定训练次数,则停止训练,保存4个神经网络的参数值,如果没有超过设定的训练次数,则返回步骤4,重新在记忆库中采样指定数目为N的经验数据单元存放到缓冲区中;Step 10: Determine whether the training times exceed the set training times. If it exceeds the set training times, stop the training and save the parameter values of the 4 neural networks. If it does not exceed the set training times, go back to step 4 and re-store in the memory. The sampled number N of empirical data units in the library are stored in the buffer; 步骤11:得到训练完成的深度强化学习神经网络模型后,应用到实际水下滑翔机在纵平面滑翔运动中,给定目标俯仰角,采集水下滑翔机的状态值输入到深度强化学习神经网络模型得到控制量实现水下滑翔机姿态控制。Step 11: After obtaining the trained deep reinforcement learning neural network model, apply it to the actual underwater glider in the vertical plane gliding motion, given the target pitch angle, collect the state value of the underwater glider and input it into the deep reinforcement learning neural network model to obtain The control quantity realizes the attitude control of the underwater glider.
2.根据权利要求1所述一种基于深度强化学习的水下滑翔机姿态控制方法,其特征在于:奖励值rt取值为:2. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1, is characterized in that: reward value r t is taken as: rt=r1+r2+r3 r t =r 1 +r 2 +r 3 其中,r1=-λ1(dt-dt-1),r1代表当前俯仰角远离或靠近目标俯仰角获得的回报值;r2=-λ2(wt-wt-1),r2代表角速度发生变化获得的回报值;r3根据当前俯仰角的大小取值,如果:|dt-dt-1|<0.1°,r3为一个正的奖励,如果当前t时刻时,俯仰角θ<-90°或θ>90°时,r3为一个负数,表示惩罚,dt为在t时刻时,俯仰角θ与目标俯仰角θd的差值,wt为在t时刻时俯仰角速率的大小,λ1和λ2为设定系数。Among them, r 1 =-λ 1 (d t -d t-1 ), r 1 represents the reward value obtained when the current pitch angle is far away from or close to the target pitch angle; r 2 =-λ 2 (w t -w t-1 ) , r 2 represents the reward value obtained from the change of angular velocity; r 3 takes the value according to the current pitch angle, if: |d t -d t-1 |<0.1°, r 3 is a positive reward, if the current time t When the pitch angle θ<-90° or θ>90°, r 3 is a negative number, indicating punishment, d t is the difference between the pitch angle θ and the target pitch angle θ d at time t, and w t is the The magnitude of the pitch rate at time t, λ 1 and λ 2 are set coefficients. 3.根据权利要求1所述一种基于深度强化学习的水下滑翔机姿态控制方法,其特征在于:步骤4中,通过基于优先级的经验回放机制对缓冲区中的N个经验数据单元进行优先级排序,并存储在SumTree数据结构中;SumTree数据结构中的头结点的值为所有节点所代表的经验数据单元的优先级之和,记为ptotal3. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1 is characterized in that: in step 4, by the experience replay mechanism based on priority, the N experience data units in the buffer zone are prioritized The level is sorted and stored in the SumTree data structure; the value of the head node in the SumTree data structure is the sum of the priorities of the experience data units represented by all nodes, denoted as p total . 4.根据权利要求3所述一种基于深度强化学习的水下滑翔机姿态控制方法,其特征在于:步骤5中通过以下过程采样m个经验数据单元,将[0,ptotal]等分为数目为m的小区间,在各个小区间中进行随机均匀采样,得到采样所得优先级对应的经验数据单元,共得到m个经验数据单元。4. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 3, is characterized in that: in step 5, sample m empirical data units by following process, [0, p total ] is equally divided into number is a small area of m, random and uniform sampling is performed in each small area, and the experience data units corresponding to the sampling obtained priorities are obtained, and m experience data units are obtained in total. 5.根据权利要求1所述一种基于深度强化学习的水下滑翔机姿态控制方法,其特征在于:步骤5中,得出当前评价神经网络的梯度值
Figure FDA0002668239160000031
的过程为:
5. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1 is characterized in that: in step 5, the gradient value of current evaluation neural network is obtained
Figure FDA0002668239160000031
The process is:
当前评价神经网络的损失函数L为:The loss function L of the current evaluation neural network is:
Figure FDA0002668239160000032
Figure FDA0002668239160000032
其中,ωi为第i个经验数据单元的重要性采样权重,δi为:Among them, ω i is the importance sampling weight of the i-th empirical data unit, and δ i is: δi=ri+γQ'(si+1,μ'(si+1μ'Q')-Q(si,aiQ)δ i =r i +γQ'(s i+1 ,μ'(s i+1μ'Q' )-Q(s i ,a iQ ) ri是第i个经验数据单元中的奖励值,si+1是第i个经验数据单元中的下一时刻状态,σμ'是目标决策神经网络的参数,μ'(si+1μ')是si+1通过目标决策神经网络输出的执行机构的动作信号;σQ'是目标评价神经网络的参数,Q'(si+1,μ'(si+1μ'Q')是将μ'(si+1μ')和si+1通过目标评价神经网络得到的评价值大小;si是第i个经验数据单元中的当前时刻状态,ai是第i个经验数据单元中的当前时刻状态下的动作信号,σQ为当前评价神经网络的参数,Q(si,aiQ)是当前评价神经网络的评价值大小;γ是折扣系数;ri is the reward value in the i -th empirical data unit, s i+1 is the next moment state in the i-th empirical data unit, σ μ' is the parameter of the target decision-making neural network, μ'(s i+1μ' ) is the action signal of the actuator output by s i+1 through the target decision-making neural network; σ Q' is the parameter of the target evaluation neural network, Q'(s i+1 ,μ'(s i+1 | σ μ'Q' ) is the size of the evaluation value obtained by passing μ'(s i+1μ' ) and s i+1 through the target evaluation neural network; s i is the current value in the i-th empirical data unit. Time state, a i is the action signal at the current time state in the i-th empirical data unit, σ Q is the parameter of the current evaluation neural network, Q(s i , a iQ ) is the evaluation of the current evaluation neural network. value; γ is the discount factor; 得到当前评价神经网络的损失函数L的梯度为:The gradient of the loss function L of the current evaluation neural network is obtained as:
Figure FDA0002668239160000033
Figure FDA0002668239160000033
6.根据权利要求1所述一种基于深度强化学习的水下滑翔机姿态控制方法,其特征在于:步骤7中,当前决策神经网络的梯度
Figure FDA0002668239160000034
计算公式为:
6. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1 is characterized in that: in step 7, the gradient of the current decision-making neural network
Figure FDA0002668239160000034
The calculation formula is:
Figure FDA0002668239160000035
Figure FDA0002668239160000035
其中
Figure FDA0002668239160000036
表示当前评价神经网络的评价值
Figure FDA0002668239160000037
对参数α的梯度,μ(si)是si通过当前决策神经网络输出的执行机构的动作信号;
Figure FDA0002668239160000041
表示当前决策神经网络动作
Figure FDA0002668239160000042
对当前决策神经网络参数σμ的梯度。
in
Figure FDA0002668239160000036
Indicates the evaluation value of the current evaluation neural network
Figure FDA0002668239160000037
For the gradient of the parameter α, μ(s i ) is the action signal of the actuator output by s i through the current decision-making neural network;
Figure FDA0002668239160000041
Represents the current decision neural network action
Figure FDA0002668239160000042
Gradient of the current decision neural network parameter σ μ .
7.根据权利要求1所述一种基于深度强化学习的水下滑翔机姿态控制方法,其特征在于:步骤9中目标评价神经网络与目标决策神经网络更新公式为:7. a kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1 is characterized in that: in step 9, target evaluation neural network and target decision neural network update formula are:
Figure FDA0002668239160000043
Figure FDA0002668239160000043
其中,σt+1 Q表示更新后的当前评价神经网络参数,σt Q'表示待更新的目标评价神经网络参数,σt+1 Q'表示更新后的目标评价神经网络参数;σt+1 μ表示更新后的当前决策神经网络参数,σt μ'表示待更新的目标决策神经网络参数,σt+1 μ'表示更新后的目标决策神经网络参数;τ1为目标评价神经网络的更新率,τ2为目标决策神经网络的更新率。Among them, σ t+1 Q represents the updated current evaluation neural network parameters, σ t Q' represents the target evaluation neural network parameters to be updated, σ t+1 Q' represents the updated target evaluation neural network parameters; σ t+ 1 μ represents the updated current decision-making neural network parameters, σ t μ' represents the target decision-making neural network parameters to be updated, σ t+1 μ' represents the updated target decision-making neural network parameters; τ 1 is the target evaluation neural network parameter. update rate, τ 2 is the update rate of the target decision-making neural network.
8.根据权利要求1所述一种基于深度强化学习的水下滑翔机姿态控制方法,其特征在于:决策神经网络采用水下滑翔机的状态值v1,v32,θ作为输入量,v1,v32,θ分别是水下滑翔机x,z轴方向速度、俯仰角速度以及俯仰角,其中x轴为机体坐标系前向轴,z轴为垂直于机体平面的坐标轴;水下滑翔机的控制量a为移动质量块在x轴上的速度指令。8. A kind of underwater glider attitude control method based on deep reinforcement learning according to claim 1, is characterized in that: decision-making neural network adopts the state values v 1 , v 3 , ω 2 , θ of the underwater glider as input, v 1 , v 3 , ω 2 , θ are the speed of the underwater glider in the x and z-axis directions, the pitch angular velocity and the pitch angle, respectively, where the x-axis is the forward axis of the body coordinate system, and the z-axis is the coordinate axis perpendicular to the body plane; The control quantity a of the underwater glider is the speed command of the moving mass on the x-axis.
CN202010925225.3A 2020-09-06 2020-09-06 Underwater glider attitude control method based on deep reinforcement learning Pending CN112100834A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010925225.3A CN112100834A (en) 2020-09-06 2020-09-06 Underwater glider attitude control method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010925225.3A CN112100834A (en) 2020-09-06 2020-09-06 Underwater glider attitude control method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN112100834A true CN112100834A (en) 2020-12-18

Family

ID=73758469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010925225.3A Pending CN112100834A (en) 2020-09-06 2020-09-06 Underwater glider attitude control method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN112100834A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420513A (en) * 2021-07-01 2021-09-21 西北工业大学 Underwater cylinder turbulent flow partition flow field prediction method based on deep learning
CN113879495A (en) * 2021-10-26 2022-01-04 西北工业大学 A dynamic motion planning method for underwater glider based on ocean current prediction
CN114355777A (en) * 2022-01-06 2022-04-15 浙江大学 A dynamic gliding method and system based on distributed pressure sensors and segmented attitude control
CN115046433A (en) * 2021-03-09 2022-09-13 北京理工大学 Aircraft time collaborative guidance method based on deep reinforcement learning
CN118466221A (en) * 2024-07-11 2024-08-09 中国海洋大学 A deep reinforcement learning decision method for underwater glider attack angle

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102644A (en) * 2017-06-22 2017-08-29 华南师范大学 The underwater robot method for controlling trajectory and control system learnt based on deeply
CN108803321A (en) * 2018-05-30 2018-11-13 清华大学 Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN110597058A (en) * 2019-08-28 2019-12-20 浙江工业大学 A three-DOF autonomous underwater vehicle control method based on reinforcement learning
CN110879595A (en) * 2019-11-29 2020-03-13 江苏徐工工程机械研究院有限公司 An unmanned mine card tracking control system and method based on deep reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102644A (en) * 2017-06-22 2017-08-29 华南师范大学 The underwater robot method for controlling trajectory and control system learnt based on deeply
CN108803321A (en) * 2018-05-30 2018-11-13 清华大学 Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN110597058A (en) * 2019-08-28 2019-12-20 浙江工业大学 A three-DOF autonomous underwater vehicle control method based on reinforcement learning
CN110879595A (en) * 2019-11-29 2020-03-13 江苏徐工工程机械研究院有限公司 An unmanned mine card tracking control system and method based on deep reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘西等: "DDPG优化基于动态逆的飞行器姿态控制", 《计算机仿真》 *
葛东旭: "《数据挖掘原理与应用》", 31 March 2020 *
韦鹏程等: "《大数据巨量分析与机器学习的整合与开发》", 31 May 2017 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115046433A (en) * 2021-03-09 2022-09-13 北京理工大学 Aircraft time collaborative guidance method based on deep reinforcement learning
CN113420513A (en) * 2021-07-01 2021-09-21 西北工业大学 Underwater cylinder turbulent flow partition flow field prediction method based on deep learning
CN113420513B (en) * 2021-07-01 2023-03-07 西北工业大学 A deep learning-based method for predicting the flow field of underwater cylinder turbulence partitions
CN113879495A (en) * 2021-10-26 2022-01-04 西北工业大学 A dynamic motion planning method for underwater glider based on ocean current prediction
CN113879495B (en) * 2021-10-26 2024-04-19 西北工业大学 Dynamic motion planning method for underwater glider based on ocean current prediction
CN114355777A (en) * 2022-01-06 2022-04-15 浙江大学 A dynamic gliding method and system based on distributed pressure sensors and segmented attitude control
CN114355777B (en) * 2022-01-06 2023-10-10 浙江大学 A dynamic gliding method and system based on distributed pressure sensors and segmented attitude control
CN118466221A (en) * 2024-07-11 2024-08-09 中国海洋大学 A deep reinforcement learning decision method for underwater glider attack angle
CN118466221B (en) * 2024-07-11 2024-09-17 中国海洋大学 Deep reinforcement learning decision-making method for attack angle of underwater glider

Similar Documents

Publication Publication Date Title
CN112100834A (en) Underwater glider attitude control method based on deep reinforcement learning
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
CN110806759B (en) An aircraft route tracking method based on deep reinforcement learning
CN109946975B (en) Reinforced learning optimal tracking control method of unknown servo system
CN111351488A (en) Intelligent trajectory reconstruction reentry guidance method for aircraft
CN108490965A (en) Rotor craft attitude control method based on Genetic Algorithm Optimized Neural Network
CN115826621B (en) A UAV motion planning method and system based on deep reinforcement learning
CN113052372A (en) Dynamic AUV tracking path planning method based on deep reinforcement learning
CN111445498A (en) Target tracking method adopting Bi-L STM neural network
CN103019099A (en) Parameter optimization method for satellite attitude fuzzy controller
CN111240344A (en) Autonomous underwater robot model-free control method based on double neural network reinforcement learning technology
CN113722980A (en) Ocean wave height prediction method, system, computer equipment, storage medium and terminal
CN117289709A (en) High-ultrasonic-speed appearance-changing aircraft attitude control method based on deep reinforcement learning
CN117555352A (en) An ocean current assisted path planning method based on discrete SAC
CN112215412B (en) Dissolved oxygen prediction method and device
CN108805253A (en) A kind of PM2.5 concentration predictions method
CN114967713B (en) Underwater vehicle buoyancy discrete change control method based on reinforcement learning
Dong et al. Gliding motion optimization for a biomimetic gliding robotic fish
CN117818706B (en) Method, system, equipment and medium for predicting speed of medium-low speed maglev train
CN103336887A (en) Method for identifying water power coefficient based on bee colony algorithm
CN119397195A (en) A trajectory prediction method based on deep learning
CN110889531A (en) Wind power prediction method and prediction system based on improved GSA-BP neural network
CN114118371A (en) A kind of agent deep reinforcement learning method and computer readable medium
CN115617060B (en) A hovering control method for quadrotor drone based on deep reinforcement learning
CN111695195A (en) Spatial physical moving body modeling method based on long-time memory network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201218