CN111079936A - Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning - Google Patents

Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning Download PDF

Info

Publication number
CN111079936A
CN111079936A CN201911077089.0A CN201911077089A CN111079936A CN 111079936 A CN111079936 A CN 111079936A CN 201911077089 A CN201911077089 A CN 201911077089A CN 111079936 A CN111079936 A CN 111079936A
Authority
CN
China
Prior art keywords
wave
underwater operation
reinforcement learning
fin
operation robot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911077089.0A
Other languages
Chinese (zh)
Other versions
CN111079936B (en
Inventor
王宇
唐冲
王睿
王硕
谭民
马睿宸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201911077089.0A priority Critical patent/CN111079936B/en
Publication of CN111079936A publication Critical patent/CN111079936A/en
Application granted granted Critical
Publication of CN111079936B publication Critical patent/CN111079936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B63SHIPS OR OTHER WATERBORNE VESSELS; RELATED EQUIPMENT
    • B63CLAUNCHING, HAULING-OUT, OR DRY-DOCKING OF VESSELS; LIFE-SAVING IN WATER; EQUIPMENT FOR DWELLING OR WORKING UNDER WATER; MEANS FOR SALVAGING OR SEARCHING FOR UNDERWATER OBJECTS
    • B63C11/00Equipment for dwelling or working underwater; Means for searching for underwater objects
    • B63C11/52Tools specially adapted for working underwater, not otherwise provided for

Abstract

The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a method, a system and a device for tracking and controlling an underwater operation robot propelled by a wave fin based on reinforcement learning, aiming at solving the problem of low target tracking precision caused by poor convergence and stability of an Actor network in a training process. The method comprises the steps of obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information s of a Markov decision processt(ii) a Based on stAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning modelt(ii) a Based on atAnd controlling a wave fin of the underwater operation robot, and circulating the wave fin by making t equal to t + 1. The invention monitors the Actor network training through the PID controller, improves the stability and convergence of the network, and improves the target trackingThe accuracy of (2).

Description

Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning
Technical Field
The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a method, a system and a device for tracking and controlling an underwater operation robot propelled by a wave fin based on reinforcement learning.
Background
Autonomous control of underwater robots is a hotspot and difficulty of current research. With the transition of mankind from marine exploration to marine development, autonomous control and autonomous operation of underwater work robots pose new challenges. The autonomous operation of the underwater operation robot has great significance for underwater archaeology, underwater fishing, underwater rescue, underwater engineering and the like. The underwater remote control device can replace a diver or a remote control ROV to operate, realize underwater long-time continuous operation and improve the efficiency of underwater operation.
Generally, due to the irregularity of the underwater operation robot and the complexity of the underwater environment, the underwater operation robot is difficult to establish an accurate hydrodynamic model, so that the model-based robot control method is weak in adaptability. The reinforcement learning depends on the current state of the system, gives an action to be executed, and then transits to the next state, and is a typical model-free control method, and has strong adaptability to complex underwater environment and unknown disturbance. However, the training of reinforcement learning is experience learning based on data, and successful experience is important for training. In the initial training stage of reinforcement learning, due to the fact that the output effect is poor and the control behavior of successful exploration has certain contingency, successful experience in a database is insufficient, so that the convergence speed of an Actor network in a reinforcement learning model is low, the learning efficiency is low, and the accuracy of follow-up tracking control is directly influenced.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problem that the target tracking accuracy is low in the conventional tracking control method based on reinforcement learning due to poor convergence and stability of an Actor network in a reinforcement learning model in the training process, a first aspect of the present invention provides a tracking control method for a fluctuating fin propulsion underwater work robot based on reinforcement learning, the method comprising:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision processt
Step S200, based on StAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning modelt
Step S300, based on atControlling a wave fin of the underwater operation robot, and jumping to a construction module when t is t + 1;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
a100, acquiring a training data set; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and pose information of the target to be tracked in the underwater operation robot random coordinate systemt
Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on stObtaining the fluctuation frequency a of the fluctuation fin through a PID controllertOtherwise based on stAcquiring the fluctuation frequency a of the fluctuation fin through an Actor networkt
Step A300, based on stAnd atRespectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model*(st,at) Prize value rt
Step A400, based on Q*(st,at) By deterministic policy gradientsUpdating the parameters of the Actor network by an algorithm; and based on Q*(st,at)、rtUpdating the parameters of the Critic network;
and step a500, circularly executing steps a100 to a400 by making t equal to t +1 until t is greater than a preset training time, so as to obtain a trained Actor network.
In some preferred embodiments, the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is the Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is the tanh function.
In some preferred embodiments, the Critic network comprises five convolutional layers, the number of neurons in the first, second and third convolutional layers is 200, the activation function is the Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.
In some preferred embodiments, in step a200, "obtaining the supervised training probability at time t, and the random supervised probability" includes:
PROt=PRO0*0.999t
PROr=max(rand(1),0.01)
wherein, PROtFor supervised training probability at time t, PROrFor random supervised probability, PRO0Is a preset initial supervised training probability.
In some preferred embodiments, in step a200, "obtaining the wave frequency a of the wave fin by the PID controllert", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the wave frequency of the left wave fin and the wave frequency of the right wave fin are calculated by the following steps:
f1=ρr1fr1r2fx1
f2=ρx1fr2x2fx2
fr2=-fr1
Figure BDA0002262819460000031
fx2=fx1
Figure BDA0002262819460000032
wherein f is1、f2Is the final wave frequency of the left and right wave fins, fr1、fr2The wave frequency f of yaw rotation of the left and right wave finsx1、fx2The wave frequency, rho, of the advancing left and right wave finsr1、ρr2、ρx1、ρx2Weight coefficient for the left and right wave fins for the rotation and forward wave frequency, Pi,Ii,DiI e r, x is the PID parameter, Δ ψ is the relative bearing error,
Figure BDA0002262819460000041
for relative azimuth error differential, Δ x is the lateral error in relative position,
Figure BDA0002262819460000042
is the differential of the lateral error in relative position.
In some preferred embodiments, step a200 "obtaining the wave frequency a of the wave fin through the Actor networkt", the method is as follows: the fluctuation frequency atIn order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:
Figure BDA0002262819460000043
ct=max(c0*0.999t,0.01)
wherein, c0Is a preset initial noise standard deviation, ctFor time t noise scaleAlignment difference,. mu.t, is atIs mean value, in ctIs a random gaussian noise of standard deviation,
Figure BDA0002262819460000044
is a noise variable.
In some preferred embodiments, the predetermined reward function is:
r=r01||Δψ||22||ΔPo||2Tρ3ν
wherein r is the reward value, r0Awarding | | | Δ ψ | | luminance for a preset constant value2Is a relative orientation error 2 norm, | | Δ Po||2Is a relative position error 2 norm vTIs a velocity vector, v is a regular term, ρ1、ρ2、ρ3Are weight coefficients.
The invention provides a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning, which comprises an acquisition module, a wave frequency acquisition module and a circulation module, wherein the acquisition module is used for acquiring wave frequency;
the building module is configured to obtain system state information of the underwater operation robot at the time t and pose information of a target to be tracked in the underwater operation robot along-body coordinate system, and build state information s of the Markov decision processt
The acquisition fluctuation frequency module is configured to be based on stAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning modelt
The circulation module is configured to be based on atAnd controlling the wave fin of the underwater operation robot, enabling t to be t +1, and jumping to a construction module.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the program applications are loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
The invention has the beneficial effects that:
according to the invention, the PID controller is used for monitoring the Actor network training, so that the stability and convergence of the reinforcement learning model are improved, and the target tracking precision is improved. In the initial stage of reinforcement learning, the fluctuation frequency of the fluctuation fin is generated through the PID controller, a large amount of control experience is generated, effective supervision on an Actor network is achieved, and the fluctuation frequency of the fluctuation fin is generated through the Actor network in the later stage. A large amount of control experience is effectively evaluated according to the Critic network, and then the Actor network is updated through a deterministic strategy gradient algorithm, so that the convergence of the reinforcement learning model is accelerated, and the stability of the reinforcement learning model is improved.
Meanwhile, the invention combines several different strategies to train the Actor network, thereby improving the generalization ability of the reinforcement learning model and improving the target tracking precision.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning according to an embodiment of the invention;
FIG. 2 is a schematic flow chart of a training method of an Actor-Critic reinforcement learning model according to an embodiment of the present invention;
FIG. 3 is a frame diagram of a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning according to an embodiment of the invention;
FIG. 4 is an exemplary diagram of a wave fin propelled underwater work robot target tracking control in accordance with one embodiment of the present invention;
FIG. 5 is an exemplary diagram of PID controller based Actor-critical network update according to an embodiment of the invention;
FIG. 6 is an exemplary diagram of the architecture of an Actor network and a Critic network in accordance with one embodiment of the invention;
FIG. 7 is an exemplary diagram comparing different training strategies according to one embodiment of the invention;
FIG. 8 is an exemplary diagram of a stable tracking of a moving object according to one embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps as shown in figures 1 and 2:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision processt
Step S200, based on StAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning modelt
Step S300, based on atControlling a wave fin of the underwater operation robot, and jumping to a construction module when t is t + 1;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
a100, acquiring a training data set; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and pose information of the target to be tracked in the underwater operation robot random coordinate systemt
Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on stObtaining the fluctuation frequency a of the fluctuation fin through a PID controllertOtherwise based on stAcquiring the fluctuation frequency a of the fluctuation fin through an Actor networkt
Step A300, based on stAnd atRespectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model*(st,at) Prize value rt
Step A400, based on Q*(st,at) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q*(st,at)、rtUpdating the parameters of the Critic network;
and step a500, circularly executing steps a100 to a400 by making t equal to t +1 until t is greater than a preset training time, so as to obtain a trained Actor network.
In order to more clearly describe the tracking control method of the wave fin propulsion underwater operation robot based on reinforcement learning, the following describes the steps of an embodiment of the method in detail with reference to the attached drawings.
In the following embodiments, a training method of an Actor-Critic reinforcement learning model is introduced, and then a reinforcement learning-based wave fin propulsion underwater operation robot tracking control method for acquiring a wave frequency of a wave fin by using an Actor network is described.
1. Training method of Actor-critical reinforcement learning model
Step A100, obtaining system state information of an underwater operation robot at t moment in a preset training data set and pose information of a target to be tracked in an underwater operation robot random coordinate system, and constructing state information s of a Markov decision processt
Since the robot has self-stability in the roll direction and the pitch direction, assuming that the freedom of motion of the body is 4, the motion state can be defined as χ ═ x, y, z, ψ]TWherein x, y and z are three-dimensional coordinates of the underwater operation robot, psi is a yaw angle, T represents transposition, and v is a corresponding velocity vector ═ u, v, w and rz]TWherein u represents a forward speed, rzThe rotational speed is shown, w is the vertical direction of the movement velocity, and v is the lateral movement velocity. Therefore, the kinetic equation of the underwater operation robot system is shown in formula (1):
Figure BDA0002262819460000081
where U is the system control quantity, ξ is the unknown disturbance,
Figure BDA0002262819460000082
is the system state differential.
The target object to be tracked is detected and positioned through airborne binocular vision, and the visible range of the target object is limited under the satellite coordinate system. Tetrahedral O as in FIG. 4c1ABCDIs an effective detection range of binocular vision, the front cabin and the rear cabin are respectively a front cabin body and a rear cabin body on the machine, OBXBYBZBIs a satellite coordinate system, Oc1Xc1Yc1Zc1As a camera coordinate system, PcThe center position of the working space is the target of the robot body control, which enables the target object to be in the center area of the visual field, A1B1C1D1-A2B2C2D2Within the illustrated working space to facilitate the gripping operation of the robotic arm. At present, because the operation target is small, the posture of the target does not need to be considered. Therefore, the above-mentioned approach target object tracking control problem is shown in formula (2):
|g(xo,yo,zoo)-g(χ)|<ε (2)
wherein x iso,yo,zoIs the position of the target to be tracked, #oIs the azimuth of the target to be tracked, and epsilon is a preset fault tolerance range.
In this embodiment, the objective of the body control is to control the underwater operation robot to stably track the target object so that the target object is stabilized within a specified area relative to the robot. Considering the motion control of the robot in the two-dimensional plane, the position and the deflection angle of the target to be tracked under the robot satellite coordinate system are Po=[xo,yoo]TThe center position of the working space is Pc=[xc,ycc]T. In addition, u, r need to be included in the statezG, where g ∈ {0,1} is used to determine whether the target is approached. The robot generally does not need side-shifting motion during the motion, so the state information MDP that can define the markov decision process is shown in equation (3):
s=[g,xo-xc,yo-ycoc,u,w,rz]T(3)
the above formula can be abbreviated as formula (4):
Figure BDA0002262819460000091
wherein, Δ X, Δ Y, Δ ψ are X-axis position difference, Y-axis position difference, and deflection angle difference,
Figure BDA0002262819460000092
is a difference in position, vTIs a velocity vector.
Meanwhile, the invention is characterized in that the preset training data set isThe correlation of the training data is eliminated, an Experience pool is required to be constructed for Experience review (Experience Replay), and the training effect is improved. Information s of current system state of underwater operation robottTake action atI.e. the wave frequency of the wave fin, and then observe the state information s at the next instantt+1Simultaneously obtain the reward value r at the current momenttUsing state transition tuples { st,at,st+1,rtDenotes an experience. Then a certain amount of data is extracted from the historical experience base for training.
Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on stObtaining the fluctuation frequency a of the fluctuation fin through a PID controllertOtherwise based on stAcquiring the fluctuation frequency a of the fluctuation fin through an Actor networkt
The reinforcement learning training is experience learning based on data, and the successful experience is important for the training of the network. In the initial training stage of reinforcement learning, due to the fact that the output effect of an evaluation (Critic) network and a strategy (Actor) network is poor, and the successful exploration control behavior has certain contingency, the successful experience in a database is insufficient, and therefore the convergence speed is low, the learning efficiency is low, and even training divergence occurs. Ancient cloud: the invention constructs a supervisory controller to play the role of a master, plays a main role in the initial stage of reinforcement learning, generates a large amount of successful control experiences, effectively supervises and guides a strategy network to enable the strategy network to generate the successful control experiences and then gradually transits to a self-learning stage. In the self-learning process, the supervision controller interferes reinforcement learning with small probability, disturbs the action generated by the strategy network (the fluctuation frequency of the fluctuation fin), further optimizes the strategy network until the strategy network is 'blue rather than blue', and then the monitoring controller does not supervise any more.
The supervisory controller employed herein is a PID controller based on state observation, and the construction process is shown in equations (5) (6) (7) (8) (9) (10):
Figure BDA0002262819460000101
fr2=-fr1(6)
Figure BDA0002262819460000102
fx2=fx1(8)
f1=ρr1fr1r2fx1(9)
f2=ρx1fr2x2fx2(10)
wherein, Pi,Ii,DiI ∈ { r, x } is a PID parameter, fr1,fr2Is the fluctuation frequency of yaw rotation of the left and right wave fins, fx1,fx2The wave frequency of the forward motion of the left and right wave fins, f1,f2Is the final wave frequency, rho, of the left and right wave finsr1r2x1x2The left and right wave fins are used as weighting coefficients for the rotating and advancing wave frequencies.
Based on the idea that a master enters the door and depends on the individual, the supervision strategy designed by the invention is shown as formulas (11), (12) and (13):
PROt=PRO0*0.999t(11)
PROr=max(rand(1),0.01) (12)
Figure BDA0002262819460000111
wherein, PRO0Representing a preset initial supervised training probability, PROtRepresents the probability of supervised training at time t, and decreases exponentially as the number of training steps increases. PROrRepresenting a random supervised probability having a value greater than the supervised training probability PRO at time ttWhen the wave frequency of the wave fin is output by adopting the Actor network, otherwise, the wave frequency of the wave fin is output by adopting the supervision PID controllerThe wave frequency of the wave fin.
A schematic diagram of the training process of the reinforcement learning model based on Actor-Critic of the PID controller is shown in FIG. 5, namely step A100-step A500, where in FIG. 5, n (a)tε) is atRandom noise of (2), z-1Is a time delay.
One of the key points of reinforcement learning is the tradeoff between Exploration (Exploration) and utilization (Exploitation). The Actor network output control strategy is established on the basis of empirical data, and only sufficient data can endow the Actor network with sufficient generalization capability. Random noise is superposed on the output of the Actor network in the invention, so that the generalization capability training of the Actor network is carried out, as shown in formulas (14) and (15):
ct=max(c0*0.999t,0.01) (14)
Figure BDA0002262819460000112
wherein, c0Is the initial noise standard deviation, ctThe standard deviation of noise at time t, whose value decreases exponentially with the number of training steps, μ (t) is given by atIs mean value of ctIs a random gaussian noise of standard deviation,
Figure BDA0002262819460000113
is a noise variable.
Step A300, based on stAnd atRespectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model*(st,at) Prize value rt
The Markov decision process of reinforcement learning comprises four parts, namely a state space S, an action space A, a reward function S multiplied by A → R (S, a) and a state transition probability distribution p, wherein on the premise of meeting the Markov property, the state of the system at the next moment is only determined by the current state and action, namely p (x)t|xt-1,at-1) The cumulative reward function is defined as the future reward discount factor plusThe weight combination, as shown in equation (16):
Figure BDA0002262819460000121
wherein, gamma belongs to (0, 1)]As a discount factor, RtIs the cumulative prize, r(s), starting at time ti,ai) For the prize value, i is the subscript and N is the total number of time steps.
In the learning process, the network trained by reinforcement learning interacts with the robot, and a is selected in the action spacetE.g. A, then the system is from state xte.X to Xt+1E.g. X, while obtaining a reward r at time ttFor evaluating at state xtLower sampling atThe reward function is used to guide whether the goal is completed, or whether the behavior is optimal, etc. The goal of the control problem is therefore to obtain an optimal strategy pi*To make it receive the maximum prize J*,J*The calculation process of (2) is shown in equation (17):
Figure BDA0002262819460000122
where P is the policy space, EπAnd representing the average expectation of all strategies, the control quantity of the final system is u-a, J (pi) is a reward value corresponding to the strategy, and pi is the strategy.
The key to the above problem is therefore how to define the state of the MDP process and the single step reward function. Defining the state information MDP of the Markov decision process as formula (3) (4), wherein the reward function needs to comprehensively consider the position error and the deflection angle error, and expects the robot to adjust the movement direction preferentially and then approach to the target, and also needs to consider the power consumption problem of the robot, and the reward function is defined by combining the indexes as formula (18):
r=r01||Δψ||22||ΔPo||2Tρ3ν (18)
wherein r is the reward value, r0To prepareThe set constant value reward is 1 when the target task is completed, otherwise is 0, | | delta ψ | | survival2Is a relative orientation error 2 norm, | Δ Po||2Is a relative position error 2 norm, v is a regular term used for reducing the energy consumption of the system, and rho123Are weight coefficients.
In solving the MDP problem, the classical merit function is a state-behavior value function (also called Q function), which is defined as shown in equation (19):
Figure BDA0002262819460000131
wherein K is the total number of steps and K is the current number of steps.
The robot starts from state s and executes an optimal strategy pi*,Q*(s, a) represents the maximum jackpot value achieved, so the optima function satisfies the bellman optimality principle, as shown in equation (20):
Figure BDA0002262819460000132
wherein, at=π*(st) When the optimal Q is obtained through the reinforcement learning iterative training*Then, the optimal strategy, pi, can be obtained*(s)=arg maxπQ*(s,a)。
In the present invention, the structure of the Actor network and the Critic network is shown in fig. 6, the Actor network is shown in fig. 6(a), the number of neurons in the hidden layer 1 and the hidden layer 2 is 200, the Relu6 activation function is adopted, the number of neurons in the hidden layer 3 is 10, the activation function is Relu, the number of neurons in the final output layer is 2, the output is normalized to be between [ -1,1] by adopting the tanh activation function, and the hidden layer is a convolutional layer. The criticic network is shown in fig. 6(b), and the input includes a state s and an action a. The number of neurons in hidden layers 0 and 1 was 200. The state s and the action a are respectively output through the convolutional layer 0 and the convolutional layer 1, summed and fused, pass through a Relu6 activation function, and then input into the hidden layer 2, the number of the neurons of the hidden layer is 200, and the activation function is Relu 6. The number of neurons of the hidden layer 3 is 10, the activation function is Relu, the number of neurons of the final output layer is 1, and a linear activation function is adopted to output a state-action evaluation value Q. In fig. 6(b), the hidden layers 0,1, 2, 3, 4 are the first, second, third, fourth, and fifth convolution layers, respectively.
Step A400, based on Q*(st,at) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q*(st,at)、rtAnd updating the parameters of the Critic network.
In the tracking control of the underwater operation robot on the target object, the system state and the control quantity of the robot are continuous variables, the state of the MDP is also continuous variables, and therefore, the MDP state space and the motion space are also continuous domains. For the reinforcement learning problem of the continuous domain, a policy gradient method (PG) is widely applied, the core idea is to change the parameters of a policy function towards the direction of maximizing a reward function to improve the performance of a control policy, the method adopts an independent approximate function, is a random policy scheme, and has low calculation efficiency. To solve this problem, a deterministic state-behavior mapping function a ═ μ is usedθ(s), the deterministic policy gradient algorithm DPG, where θ is a parameter of the policy function. The maximization of the accumulated reward is obtained by updating the parameter theta through the positive gradient of the accumulated reward function, and the updating process is shown as the formula (21):
Figure BDA0002262819460000141
wherein, αθIn order to be the weighting coefficients,
Figure BDA0002262819460000142
is a gradient variable.
DPG algorithm derivation
Figure BDA0002262819460000143
Is of the form shown in equation (22):
Figure BDA0002262819460000144
wherein M is the number of iterations.
Wherein Q isμExpressing the Q-value function under the policy function μ (a | s), further derived equation (23):
Figure BDA0002262819460000145
both the strategy function and the Q value evaluation function have strong nonlinearity, and the fitting of the nonlinear function by adopting the deep neural network is an effective method. In this section, an Actor network and a Critic network are respectively designed to approximate the policy function muθ(as) and an evaluation function QW(s, a), the parameters of the two networks are represented by θ and W, respectively, and the parameter update equation is shown as the following equation (24) (25) for the Q value function:
δt=rt+γQW(st+1,at+1)-QW(st,at) (24)
Wt+1=WtWδtWQW(st,at))(25)
wherein, deltatIn order to reward the intermediate variable(s),
Figure BDA0002262819460000151
is the gradient of Q versus W.
In this embodiment, the training of the Actor network is dependent on the criticc network, so accurate evaluation is beneficial to the training of the Actor network. Therefore, in the training process of the Actor network and the Critic network, the updating speed of the target Critic network is faster than that of the target Actor network, so that the Actor network is trained and updated with more excellent evaluation. The asynchronous update strategy of the Actor network and the Critic network is shown as formulas (26) and (27):
θ′=τθ+(1-τ)θ′,if mod(t,FA)=0 (26)
W′=τW+(1-τ)W′,if mod(t,FQ)=0 (27)
wherein θ 'is a target Actor network parameter, W' is a target criticic network parameter, τ is an update coefficient, FA and FQ are update periods of the target Actor network and the target criticic network, respectively, and in general, FQ < FA, mod (·, ·,) is a remainder function.
And step a500, circularly executing steps a100 to a400 by making t equal to t +1 until t is greater than a preset training time, so as to obtain a trained Actor network.
In this embodiment, data in a preset training data set is sequentially selected to train the Actor network.
Aiming at an Experience review (Experience Replay) strategy, a training strategy of an Actor-Critic reinforcement learning model based on a PID controller, an asynchronous updating strategy of an Actor network and a Critic network and a random noise strategy, the invention selects five combined training strategies for training and testing, as shown in Table 1,
TABLE 1
Figure BDA0002262819460000152
Figure BDA0002262819460000161
The sign of √ in table 1 represents a selected strategy, network training is performed based on the reinforcement learning model of Actor-Critic and different training strategies in 5 selected in table 1, the number of training rounds is 2000, the number of training steps in each round is 600, the size of the experience base is 10000, the size of batch processing is 32, the learning rates of the Actor network and the Critic network are 0.001, the weight forgetting factor γ is 0.9, the position of the initialization target in the random coordinate system is set to be p [ -82,220] mm, and the result is shown in fig. 7. The addition of random noise is beneficial to improve the generalization capability of the model, but strategy 4 shows a slight decrease in performance relative to strategy 5. In the middle and later stages of model training, for an Actor network, a supervisory controller becomes random noise, so that the supervisory controller can provide enough generalization capability of a model, and the random noise becomes an uncertain factor influencing the precision of the model.
The underwater operation robot can stably keep a relatively fixed position relative to an operation target, which is an important premise for realizing accurate operation. And the operation target is detected and positioned through binocular vision, is input into a strategy network after state conversion, and finally generates fluctuation frequency to control the robot to realize tracking of the target. As shown in fig. 8, where the underwater perspective is viewed from an underwater camera, the onshore perspective is viewed from an onshore camera, and the left and right cameras are respectively onboard binocular left and right cameras.
2. Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning
The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision processt
An Actor-critical reinforcement learning model is a dynamic programming solving process aiming at a Markov Decision Process (MDP), so that an actual problem needs to be converted into a Markov decision process problem, and then the problem is solved by a reinforcement learning method. Therefore, in the embodiment, the state information s of the Markov decision process is constructed according to the system state information of the underwater operation robot at the time t and the pose information of the target to be tracked in the underwater operation robot random coordinate systemt
Step S200, based on StAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning modelt
In the embodiment, the Markov decision-making based on constructionStatus information s of a programtAnd acquiring the fluctuation frequency of the fluctuation fin through an Actor network trained offline, and controlling the propulsion tracking of the underwater robot.
Step S300, based on atAnd controlling the wave fin of the underwater operation robot, enabling t to be t +1, and jumping to a construction module.
In the embodiment, the wave fin of the underwater operation robot is controlled based on the wave frequency of the wave fin, t is t +1, and the target tracking is continued.
A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning of a second embodiment of the invention, as shown in fig. 3, includes: the method comprises the steps of constructing a module 100, obtaining a fluctuation frequency module 200 and a circulation module 300;
the construction module 100 is configured to acquire system state information of the underwater operation robot at the time t and pose information of a target to be tracked in the underwater operation robot along-body coordinate system, and construct state information s of a Markov decision processt
The obtain fluctuation frequency module 200 is configured to obtain a fluctuation frequency based on stAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning modelt
The cycle module 300 is configured to be based on atAnd controlling the wave fin of the underwater operation robot, enabling t to be t +1, and jumping to a construction module.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that the above-mentioned division of the functional modules is merely used as an example to illustrate the wave fin propulsion underwater operation robot tracking control system based on reinforcement learning provided in the above-mentioned embodiment, in practical applications, the above-mentioned function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above-mentioned embodiment may be combined into one module, or may be further split into a plurality of sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded by a processor and to implement the above-described method for tracking control of a walking fin propulsion underwater operation robot based on reinforcement learning.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (10)

1. A wave fin propulsion underwater operation robot tracking control method based on reinforcement learning is characterized by comprising the following steps:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision processt
Step S200, based on StAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning modelt
Step S300, based on atControlling a wave fin of the underwater operation robot, enabling t to be t +1, and skipping to the step S100;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
step A100, obtaining a preset training data set at time tThe method comprises the steps of establishing state information s of a Markov decision process by using system state information of an underwater operation robot and pose information of a target to be tracked under an underwater operation robot random coordinate systemt
Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on stObtaining the fluctuation frequency a of the fluctuation fin through a PID controllertOtherwise based on stAcquiring the fluctuation frequency a of the fluctuation fin through an Actor networkt
Step A300, based on stAnd atRespectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model*(st,at) Prize value rt
Step A400, based on Q*(st,at) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q*(st,at)、rtUpdating the parameters of the Critic network;
and step a500, circularly executing steps a100 to a400 by making t equal to t +1 until t is greater than a preset training time, so as to obtain a trained Actor network.
2. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is the Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is the tanh function.
3. The robot tracking control method for the wave fin propulsion underwater operation based on the reinforcement learning as claimed in claim 1, wherein the Critic network comprises five convolutional layers, the number of neurons in the first convolutional layer, the second convolutional layer and the third convolutional layer is 200, the activation function is Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.
4. The robot tracking and controlling method for the wave fin propulsion underwater operation based on the reinforcement learning of the claim 3 is characterized in that in the step A200, the supervised training probability and the random supervised probability at the time t are obtained by the following steps:
PROt=PRO0*0.999t
PROr=max(rand(1),0.01)
wherein, PROtFor supervised training probability at time t, PROrFor random supervised probability, PRO0Is a preset initial supervised training probability.
5. The robot tracking and controlling method for propelling underwater operation with wave fin based on reinforcement learning of claim 1, wherein in step a200, the wave frequency a of the wave fin is obtained by the PID controllert", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the wave frequency of the left wave fin and the wave frequency of the right wave fin are calculated by the following steps:
f1=ρr1fr1r2fx1
f2=ρx1fr2x2fx2
fr2=-fr1
Figure FDA0002262819450000031
fx2=fx1
Figure FDA0002262819450000032
wherein f is1、f2Is the final wave frequency of the left and right wave fins, fr1、fr2The wave frequency f of yaw rotation of the left and right wave finsx1、fx2The wave frequency, rho, of the advancing left and right wave finsr1、ρr2、ρx1、ρx2Weight coefficient for the left and right wave fins for the rotation and forward wave frequency, Pi,Ii,DiI e r, x is the PID parameter, Δ ψ is the relative bearing error,
Figure FDA0002262819450000033
for relative azimuth error differential, Δ x is the lateral error in relative position,
Figure FDA0002262819450000034
is the differential of the lateral error in relative position.
6. The robot tracking and controlling method for propelling underwater operations with wave fins based on reinforcement learning of claim 1, wherein in step a200, "obtaining the wave frequency a of wave fins through Actor networkt", the method is as follows: the fluctuation frequency atIn order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:
Figure FDA0002262819450000035
ct=max(c0*0.999t,0.01)
wherein, c0Is a preset initial noise standard deviation, ctFor the standard deviation of the noise at time t, μ (t) is atIs mean value, in ctIs a random gaussian noise of standard deviation,
Figure FDA0002262819450000036
is a noise variable.
7. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the preset reward function is:
r=r01||Δψ||22||ΔPo||2Tρ3ν
wherein r is the reward value, r0Awarding | | | Δ ψ | | luminance for a preset constant value2Is a relative orientation error 2 norm, | | Δ Po||2Is a relative position error 2 norm vTIs a velocity vector, v is a regular term, ρ1、ρ2、ρ3Are weight coefficients.
8. A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning is characterized by comprising a building module, a wave frequency obtaining module and a circulating module;
the building module is configured to obtain system state information of the underwater operation robot at the time t and pose information of a target to be tracked in the underwater operation robot along-body coordinate system, and build state information s of the Markov decision processt
The acquisition fluctuation frequency module is configured to be based on stAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning modelt
The circulation module is configured to be based on atAnd controlling the wave fin of the underwater operation robot, enabling t to be t +1, and jumping to a construction module.
9. A storage device having a plurality of programs stored therein, wherein the program applications are loaded and executed by a processor to implement the method for tracking control of an enhanced learning based wave fin propulsion underwater work robot according to any one of claims 1 to 7.
10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of any one of claims 1-7.
CN201911077089.0A 2019-11-06 2019-11-06 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning Active CN111079936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911077089.0A CN111079936B (en) 2019-11-06 2019-11-06 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911077089.0A CN111079936B (en) 2019-11-06 2019-11-06 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN111079936A true CN111079936A (en) 2020-04-28
CN111079936B CN111079936B (en) 2023-03-14

Family

ID=70310691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911077089.0A Active CN111079936B (en) 2019-11-06 2019-11-06 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN111079936B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111856936A (en) * 2020-07-21 2020-10-30 天津蓝鳍海洋工程有限公司 Control method for underwater high-flexibility operation platform with cable
CN112124537A (en) * 2020-09-23 2020-12-25 哈尔滨工程大学 Intelligent control method for underwater robot for autonomous absorption and fishing of benthos
CN112462792A (en) * 2020-12-09 2021-03-09 哈尔滨工程大学 Underwater robot motion control method based on Actor-Critic algorithm
CN112597693A (en) * 2020-11-19 2021-04-02 沈阳航盛科技有限责任公司 Self-adaptive control method based on depth deterministic strategy gradient
CN112819379A (en) * 2021-02-28 2021-05-18 广东电力交易中心有限责任公司 Risk preference information acquisition method and system, electronic device and storage medium
CN113977583A (en) * 2021-11-16 2022-01-28 山东大学 Robot rapid assembly method and system based on near-end strategy optimization algorithm
CN114114896A (en) * 2021-11-08 2022-03-01 北京机电工程研究所 PID parameter design method based on path integral
CN114995468A (en) * 2022-06-06 2022-09-02 南通大学 Intelligent control method of underwater robot based on Bayesian depth reinforcement learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008305064A (en) * 2007-06-06 2008-12-18 Japan Science & Technology Agency Learning type control device and method thereof
CN105549384A (en) * 2015-09-01 2016-05-04 中国矿业大学 Inverted pendulum control method based on neural network and reinforced learning
CN108008627A (en) * 2017-12-13 2018-05-08 中国石油大学(华东) A kind of reinforcement learning adaptive PID control method of parallel optimization
CN109068391A (en) * 2018-09-27 2018-12-21 青岛智能产业技术研究院 Car networking communication optimization algorithm based on edge calculations and Actor-Critic algorithm
CN109760046A (en) * 2018-12-27 2019-05-17 西北工业大学 Robot for space based on intensified learning captures Tum bling Target motion planning method
CN109769119A (en) * 2018-12-18 2019-05-17 中国科学院深圳先进技术研究院 A kind of low complex degree vision signal code processing method
CN110322017A (en) * 2019-08-13 2019-10-11 吉林大学 Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study
CN110333739A (en) * 2019-08-21 2019-10-15 哈尔滨工程大学 A kind of AUV conduct programming and method of controlling operation based on intensified learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008305064A (en) * 2007-06-06 2008-12-18 Japan Science & Technology Agency Learning type control device and method thereof
CN105549384A (en) * 2015-09-01 2016-05-04 中国矿业大学 Inverted pendulum control method based on neural network and reinforced learning
CN108008627A (en) * 2017-12-13 2018-05-08 中国石油大学(华东) A kind of reinforcement learning adaptive PID control method of parallel optimization
CN109068391A (en) * 2018-09-27 2018-12-21 青岛智能产业技术研究院 Car networking communication optimization algorithm based on edge calculations and Actor-Critic algorithm
CN109769119A (en) * 2018-12-18 2019-05-17 中国科学院深圳先进技术研究院 A kind of low complex degree vision signal code processing method
CN109760046A (en) * 2018-12-27 2019-05-17 西北工业大学 Robot for space based on intensified learning captures Tum bling Target motion planning method
CN110322017A (en) * 2019-08-13 2019-10-11 吉林大学 Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study
CN110333739A (en) * 2019-08-21 2019-10-15 哈尔滨工程大学 A kind of AUV conduct programming and method of controlling operation based on intensified learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张法帅等: "基于深度强化学习的无人艇航行控制", 《计测技术》 *
韩向敏等: "一种基于深度强化学习的自适应巡航控制算法", 《计算机工程》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111856936A (en) * 2020-07-21 2020-10-30 天津蓝鳍海洋工程有限公司 Control method for underwater high-flexibility operation platform with cable
CN111856936B (en) * 2020-07-21 2023-06-02 天津蓝鳍海洋工程有限公司 Control method for cabled underwater high-flexibility operation platform
CN112124537A (en) * 2020-09-23 2020-12-25 哈尔滨工程大学 Intelligent control method for underwater robot for autonomous absorption and fishing of benthos
CN112124537B (en) * 2020-09-23 2021-07-13 哈尔滨工程大学 Intelligent control method for underwater robot for autonomous absorption and fishing of benthos
CN112597693A (en) * 2020-11-19 2021-04-02 沈阳航盛科技有限责任公司 Self-adaptive control method based on depth deterministic strategy gradient
CN112462792A (en) * 2020-12-09 2021-03-09 哈尔滨工程大学 Underwater robot motion control method based on Actor-Critic algorithm
CN112819379A (en) * 2021-02-28 2021-05-18 广东电力交易中心有限责任公司 Risk preference information acquisition method and system, electronic device and storage medium
CN114114896A (en) * 2021-11-08 2022-03-01 北京机电工程研究所 PID parameter design method based on path integral
CN114114896B (en) * 2021-11-08 2024-01-05 北京机电工程研究所 PID parameter design method based on path integration
CN113977583A (en) * 2021-11-16 2022-01-28 山东大学 Robot rapid assembly method and system based on near-end strategy optimization algorithm
CN114995468A (en) * 2022-06-06 2022-09-02 南通大学 Intelligent control method of underwater robot based on Bayesian depth reinforcement learning
CN114995468B (en) * 2022-06-06 2023-03-31 南通大学 Intelligent control method of underwater robot based on Bayesian depth reinforcement learning

Also Published As

Publication number Publication date
CN111079936B (en) 2023-03-14

Similar Documents

Publication Publication Date Title
CN111079936B (en) Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
Zhang et al. MPC-based 3-D trajectory tracking for an autonomous underwater vehicle with constraints in complex ocean environments
Li et al. Robust time-varying formation control for underactuated autonomous underwater vehicles with disturbances under input saturation
Wang et al. Dynamic tanker steering control using generalized ellipsoidal-basis-function-based fuzzy neural networks
Li et al. Visual servo regulation of wheeled mobile robots with simultaneous depth identification
CN111240345B (en) Underwater robot trajectory tracking method based on double BP network reinforcement learning framework
Precup et al. Grey wolf optimizer-based approaches to path planning and fuzzy logic-based tracking control for mobile robots
CN110928189A (en) Robust control method based on reinforcement learning and Lyapunov function
CN115016496A (en) Water surface unmanned ship path tracking method based on deep reinforcement learning
Wang et al. Adaptive and extendable control of unmanned surface vehicle formations using distributed deep reinforcement learning
Song et al. Guidance and control of autonomous surface underwater vehicles for target tracking in ocean environment by deep reinforcement learning
Khalaji et al. Lyapunov-based formation control of underwater robots
CN114115262B (en) Multi-AUV actuator saturation cooperative formation control system and method based on azimuth information
Fang et al. Autonomous underwater vehicle formation control and obstacle avoidance using multi-agent generative adversarial imitation learning
Zhuang et al. Motion control and collision avoidance algorithms for unmanned surface vehicle swarm in practical maritime environment
Hu et al. Trajectory tracking and re-planning with model predictive control of autonomous underwater vehicles
Meng et al. A Fully-Autonomous Framework of Unmanned Surface Vehicles in Maritime Environments Using Gaussian Process Motion Planning
CN113485323B (en) Flexible formation method for cascading multiple mobile robots
Yao et al. Multi-USV cooperative path planning by window update based self-organizing map and spectral clustering
Wang et al. Data-driven path-following control of underactuated ships based on antenna mutation beetle swarm predictive reinforcement learning
Hadi et al. Adaptive formation motion planning and control of autonomous underwater vehicles using deep reinforcement learning
Song et al. Surface path tracking method of autonomous surface underwater vehicle based on deep reinforcement learning
Cui et al. Anti-disturbance cooperative formation containment control for multiple autonomous underwater vehicles with actuator saturation
Jiang et al. Learning decentralized control policies for multi-robot formation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant