CN111079936B - Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning - Google Patents

Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning Download PDF

Info

Publication number
CN111079936B
CN111079936B CN201911077089.0A CN201911077089A CN111079936B CN 111079936 B CN111079936 B CN 111079936B CN 201911077089 A CN201911077089 A CN 201911077089A CN 111079936 B CN111079936 B CN 111079936B
Authority
CN
China
Prior art keywords
wave
fin
underwater operation
reinforcement learning
operation robot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911077089.0A
Other languages
Chinese (zh)
Other versions
CN111079936A (en
Inventor
王宇
唐冲
王睿
王硕
谭民
马睿宸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201911077089.0A priority Critical patent/CN111079936B/en
Publication of CN111079936A publication Critical patent/CN111079936A/en
Application granted granted Critical
Publication of CN111079936B publication Critical patent/CN111079936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B63SHIPS OR OTHER WATERBORNE VESSELS; RELATED EQUIPMENT
    • B63CLAUNCHING, HAULING-OUT, OR DRY-DOCKING OF VESSELS; LIFE-SAVING IN WATER; EQUIPMENT FOR DWELLING OR WORKING UNDER WATER; MEANS FOR SALVAGING OR SEARCHING FOR UNDERWATER OBJECTS
    • B63C11/00Equipment for dwelling or working underwater; Means for searching for underwater objects
    • B63C11/52Tools specially adapted for working underwater, not otherwise provided for

Abstract

The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a tracking control method, system and device of an underwater operation robot propelled by a fluctuating fin based on reinforcement learning, aiming at solving the problem of low target tracking precision caused by poor convergence and stability of an Actor network in the training process. The method comprises the steps of obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process t (ii) a Based on s t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model t (ii) a Based on a t And controlling a wave fin of the underwater operation robot to make t = t +1, and circulating. According to the invention, the PID controller is used for monitoring the Actor network training, so that the stability and convergence of the network are improved, and the target tracking precision is improved.

Description

Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning
Technical Field
The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a method, a system and a device for tracking and controlling an underwater operation robot propelled by a fluctuating fin based on reinforcement learning.
Background
Autonomous control of underwater robots is a hotspot and difficulty of current research. With the transition of mankind from marine exploration to marine development, autonomous control and autonomous operation of underwater work robots pose new challenges. The autonomous operation of the underwater operation robot has great significance for underwater archaeology, underwater fishing, underwater rescue, underwater engineering and the like. The underwater remote control device can replace a diver or a remote control ROV to operate, realize underwater long-time continuous operation and improve the efficiency of underwater operation.
Generally, due to the irregularity of the underwater operation robot and the complexity of the underwater environment, the underwater operation robot is difficult to establish an accurate hydrodynamic model, so that the model-based robot control method is weak in adaptability. The reinforcement learning depends on the current state of the system, gives an action to be executed, and then transits to the next state, and is a typical model-free control method, and has strong adaptability to complex underwater environment and unknown disturbance. However, the training of reinforcement learning is experience learning based on data, and successful experience is important for training. In the initial training stage of reinforcement learning, due to the fact that the output effect is poor and the control behavior of successful exploration has certain contingency, successful experience in a database is insufficient, so that the convergence speed of an Actor network in a reinforcement learning model is low, the learning efficiency is low, and the accuracy of follow-up tracking control is directly influenced.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problem that the target tracking accuracy is low in the conventional tracking control method based on reinforcement learning due to poor convergence and stability of an Actor network in a reinforcement learning model in the training process, a first aspect of the present invention provides a tracking control method for a fluctuating fin propulsion underwater work robot based on reinforcement learning, the method comprising:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process t
Step S200, based on S t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t
Step S300, based on a t Controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
a100, acquiring a training data set; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and pose information of the target to be tracked in the underwater operation robot random coordinate system t
Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on s t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller t Otherwise based on s t Acquiring the fluctuation frequency a of the fluctuation fin through the Actor network t
Step A300, based on s t And a t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model * (s t ,a t ) Prize value r t
Step A400, based on Q * (s t ,a t ) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q * (s t ,a t )、r t Updating the parameters of the Critic network;
and step A500, enabling t = t +1, and circularly executing the step A100-the step A400 until t is greater than the preset training times to obtain the trained Actor network.
In some preferred embodiments, the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is a tanh function.
In some preferred embodiments, the Critic network comprises five convolutional layers, the number of neurons in the first, second and third convolutional layers is 200, the activation function is the Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.
In some preferred embodiments, in step a200, "obtaining the supervised training probability at time t, and the random supervised probability" includes:
PRO t =PRO 0 *0.999 t
PRO r =max(rand(1),0.01)
wherein, PRO t For supervised training probability at time t, PRO r For random supervision probability, PRO 0 Is a preset initial supervised training probability.
In some preferred embodiments, in step a200, "obtaining the wave frequency a of the wave fin by the PID controller t ", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the left wave fin wave frequency and the right wave fin wave frequency are calculated by the following steps:
f1=ρ r1 f r1r2 f x1
f2=ρ x1 f r2x2 f x2
f r2 =-f r1
Figure BDA0002262819460000031
f x2 =f x1
Figure BDA0002262819460000032
wherein, f 1 、f 2 Is the final wave frequency of the left and right wave fins, f r1 、f r2 The wave frequency of yaw rotation of the left and right wave fins, f x1 、f x2 The wave frequency, rho, of the advancing left and right wave fins r1 、ρ r2 、ρ x1 、ρ x2 Weighting coefficients for the left and right wave fins for the rotation and forward wave frequencies, P i ,I i ,D i I e r, x is the PID parameter, Δ ψ is the relative bearing error,
Figure BDA0002262819460000041
for relative azimuth error differential, Δ x is the lateral error in relative position,
Figure BDA0002262819460000042
is the lateral error differential of the relative position.
In some preferred embodiments, in step a200, "acquiring the wave frequency a of the wave fin through the Actor network t ", the method is as follows: the fluctuation frequency a t In order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:
Figure BDA0002262819460000043
c t =max(c 0 *0.999 t ,0.01)
wherein, c 0 Is a preset initial noise standard deviation, c t For the standard deviation of the noise at time t, μ (t) is a t Is mean value, in c t Is a random gaussian noise of standard deviation,
Figure BDA0002262819460000044
is a noise variable.
In some preferred embodiments, the predetermined reward function is:
r=r 01 ||Δψ|| 22 ||ΔP o || 2T ρ 3 ν
wherein r is the reward value, r 0 In the case of a predetermined constant value reward,||Δψ|| 2 is a relative orientation error 2 norm, | | Δ P o || 2 Is a relative position error 2 norm v T Is a velocity vector, v is a regular term, ρ 1 、ρ 2 、ρ 3 Are weight coefficients.
The invention provides a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning, which comprises an acquisition module, a wave frequency acquisition module and a circulation module, wherein the acquisition module is used for acquiring wave frequency;
the construction module is configured to acquire system state information of the underwater operation robot at the time t and position information of a target to be tracked in a random coordinate system of the underwater operation robot, and construct state information s of a Markov decision process t
The acquisition fluctuation frequency module is configured to be based on s t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model t
The circulation module is configured to be based on a t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the program applications are loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
The invention has the beneficial effects that:
according to the invention, the PID controller is used for monitoring the Actor network training, so that the stability and convergence of the reinforcement learning model are improved, and the target tracking precision is improved. In the initial stage of reinforcement learning, the fluctuation frequency of the fluctuation fin is generated through the PID controller, a large amount of control experience is generated, effective supervision on an Actor network is achieved, and the fluctuation frequency of the fluctuation fin is generated through the Actor network in the later stage. A large amount of control experience is effectively evaluated according to the Critic network, and then the Actor network is updated through a deterministic strategy gradient algorithm, so that the convergence of the reinforcement learning model is accelerated, and the stability of the reinforcement learning model is improved.
Meanwhile, the invention combines several different strategies to train the Actor network, thereby improving the generalization ability of the reinforcement learning model and improving the target tracking precision.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning according to an embodiment of the invention;
FIG. 2 is a schematic flow chart of a training method of an Actor-Critic reinforcement learning model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a frame of a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning according to an embodiment of the invention;
FIG. 4 is an exemplary illustration of a target tracking control for a wiggle fin-propelled underwater work robot in accordance with one embodiment of the present invention;
FIG. 5 is an exemplary diagram of PID controller based Actor-critical network update according to an embodiment of the invention;
fig. 6 is an exemplary diagram of the structure of an Actor network and a critical network according to an embodiment of the present invention;
FIG. 7 is an exemplary diagram comparing different training strategies according to one embodiment of the invention;
FIG. 8 is an exemplary diagram of a stable tracking of a moving object according to one embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps as shown in figures 1 and 2:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process t
Step S200, based on S t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model t
Step S300, based on a t Controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
step A100, a training data set is obtained; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and position information of the target to be tracked in the underwater operation robot random coordinate system t
Step A200, obtaining supervision training at time tProbability and random supervision probability, if the supervision training probability is larger than the random supervision probability, s is based on t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller t Otherwise based on s t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network t
Step A300, based on s t And a t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model * (s t ,a t ) Prize value r t
Step A400, based on Q * (s t ,a t ) Updating parameters of the Actor network through a deterministic policy gradient algorithm; and based on Q * (s t ,a t )、r t Updating the parameters of the Critic network;
and step A500, enabling t = t +1, and executing the steps A100-A400 in a circulating manner until t is greater than a preset training frequency to obtain a trained Actor network.
In order to more clearly describe the tracking control method of the wave fin propulsion underwater operation robot based on reinforcement learning, the following describes the steps of an embodiment of the method in detail with reference to the attached drawings.
In the following embodiments, a training method of an Actor-Critic reinforcement learning model is introduced, and then a reinforcement learning-based wave fin propulsion underwater operation robot tracking control method for acquiring a wave frequency of a wave fin by using an Actor network is described.
1. Training method of Actor-critical reinforcement learning model
Step A100, obtaining system state information of an underwater operation robot at t moment in a preset training data set and pose information of a target to be tracked in an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process t
Since the robot for propelling underwater operation by the aid of the wave fins has self-stability in the rolling direction and the pitching direction, assuming that the freedom of motion of the body is 4, the motion state of the robot can be defined as χ =[x,y,z,ψ] T Wherein x, y, z are three-dimensional coordinates of the underwater operation robot, psi is yaw angle, T is transposition, and corresponding velocity vector v = [ u, v, w, r = z ] T Wherein u represents a forward speed, r z The rotational speed is shown, w is the vertical direction of the movement velocity, and v is the lateral movement velocity. Therefore, the kinetic equation of the underwater operation robot system is shown in formula (1):
Figure BDA0002262819460000081
where U is the system control quantity, ξ is the unknown disturbance,
Figure BDA0002262819460000082
is the system state differential.
The target object to be tracked is detected and positioned through airborne binocular vision, and the visible range of the target object is limited under the satellite coordinate system. Tetrahedral O as in FIG. 4 c1ABCD Is an effective detection range of binocular vision, the front cabin and the rear cabin are respectively a front cabin body and a rear cabin body on the machine, O B X B Y B Z B Is a satellite coordinate system, O c1 X c1 Y c1 Z c1 Is the camera coordinate system, P c The target of the robot body control is to make the target object be in the central area of the visual field for the central position of the working space, A 1 B 1 C 1 D 1 -A 2 B 2 C 2 D 2 Within the illustrated working space to facilitate the gripping operation of the robotic arm. At present, because the operation target is small, the posture of the target does not need to be considered. Therefore, the above-mentioned approach target object tracking control problem is shown in formula (2):
|g(x o ,y o ,z oo )-g(χ)|<ε (2)
wherein x is o ,y o ,z o Is the position of the target to be tracked, # o Is the bearing of the target to be tracked, and epsilon is a preset fault-tolerant range.
In the present examples, the textThe object of the body control is to control the underwater operation robot to stably track the target object so as to be stable within a specified area relative to the robot. Considering the motion control of the robot in the two-dimensional plane, the position and the deflection angle of the target to be tracked under the robot satellite coordinate system are P o =[x o ,y oo ] T The center position of the working space is P c =[x c ,y cc ] T . In addition, u, r need to be included in the state z G, where g ∈ {0,1} is used to determine whether the target is approached. The robot generally does not need side-shifting motion during the motion, so the state information MDP that can define the markov decision process is shown in equation (3):
s=[g,x o -x c ,y o -y coc ,u,w,r z ] T (3)
the above formula can be abbreviated as formula (4):
Figure BDA0002262819460000091
wherein, Δ X, Δ Y, Δ ψ are X-axis position difference, Y-axis position difference, and deflection angle difference,
Figure BDA0002262819460000092
is a difference in position, v T Is a velocity vector.
Meanwhile, in order to eliminate the correlation of the training data in the preset training data set, an Experience pool is required to be constructed for Experience review (Experience Replay), and the training effect is improved. Information s of current system state of underwater operation robot t Take action a t I.e. the wave frequency of the wave fin, and then observe the state information s at the next instant t+1 Simultaneously obtain the reward value r at the current moment t Using state transition tuples { s t ,a t ,s t+1 ,r t Denotes an experience. Then a certain amount of data is extracted from the historical experience base for training.
Step A200 of obtaining time tA supervised training probability and a random supervised probability, if the supervised training probability is greater than the random supervised probability, based on s t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller t Otherwise based on s t Acquiring the fluctuation frequency a of the fluctuation fin through the Actor network t
The training of reinforcement learning is experience learning based on data, and successful experience is very important for the training of the network. In the initial training stage of reinforcement learning, due to the fact that the output effect of an evaluation (Critic) network and a strategy (Actor) network is poor, and the successful exploration control behavior has certain contingency, the successful experience in a database is insufficient, and therefore the convergence speed is low, the learning efficiency is low, and even training divergence occurs. Ancient language cloud: the invention constructs a supervisory controller to play the role of a master, plays a main role in the initial stage of reinforcement learning, generates a large amount of successful control experiences, effectively supervises and guides a strategy network to enable the strategy network to generate the successful control experiences and then gradually transits to a self-learning stage. In the self-learning process, the supervision controller interferes reinforcement learning with small probability, disturbs the action generated by the strategy network (the fluctuation frequency of the fluctuation fin), further optimizes the strategy network until the strategy network is 'blue rather than blue', and then the monitoring controller does not supervise any more.
The supervisory controller employed herein is a PID controller based on state observation, and the construction process is as shown in equations (5) (6) (7) (8) (9) (10):
Figure BDA0002262819460000101
f r2 =-f r1 (6)
Figure BDA0002262819460000102
f x2 =f x1 (8)
f1=ρ r1 f r1r2 f x1 (9)
f2=ρ x1 f r2x2 f x2 (10)
wherein, P i ,I i ,D i I ∈ { r, x } is a PID parameter, f r1 ,f r2 Is the fluctuation frequency of the yaw rotation of the left and right wave fins, f x1 ,f x2 The wave frequency of the forward motion of the left and right wave fins, f 1 ,f 2 Is the final wave frequency, rho, of the left and right wave fins r1r2x1x2 The left and right wave fins are used as weighting coefficients for the rotating and advancing wave frequencies.
Based on the idea that a master enters the door and depends on the individual, the supervision strategy designed by the invention is shown as formulas (11), (12) and (13):
PRO t =PRO 0 *0.999 t (11)
PRO r =max(rand(1),0.01) (12)
Figure BDA0002262819460000111
wherein, PRO 0 Representing a preset initial supervised training probability, PRO t Represents the probability of supervised training at time t, and decreases exponentially as the number of training steps increases. PRO r Representing a random supervised probability having a value greater than the supervised training probability PRO at time t t And outputting the fluctuation frequency of the fluctuation fin by adopting the Actor network, or outputting the fluctuation frequency of the fluctuation fin by adopting the monitoring PID controller.
A schematic diagram of the training process of the reinforcement learning model based on Actor-Critic of the PID controller is shown in FIG. 5, namely step A100-step A500, where in FIG. 5, n (a) t ε) is a t Random noise of (2), z -1 Is a time delay.
One of the key points of reinforcement learning is the tradeoff between Exploration (Exploration) and utilization (Exploration). The Actor network output control strategy is established on the basis of empirical data, and only sufficient data can endow the Actor network with sufficient generalization capability. Random noise is superposed on the output of the Actor network in the invention, so that the generalization capability training of the Actor network is carried out, as shown in formulas (14) and (15):
c t =max(c 0 *0.999 t ,0.01) (14)
Figure BDA0002262819460000112
wherein, c 0 Is the initial noise standard deviation, c t The standard deviation of noise at t, whose value decreases exponentially with the number of training steps, μ (t) is represented by a t Is mean value of c t Is a random gaussian noise of standard deviation that,
Figure BDA0002262819460000113
is a noise variable.
Step A300, based on s t And a t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model * (s t ,a t ) Prize value r t
The Markov decision process for reinforcement learning comprises four parts, namely a state space S, an action space A, a reward function S multiplied by A → R (S, a) and a state transition probability distribution p, wherein on the premise of meeting Markov property, the state of the system at the next moment is only determined by the current state and action, namely p (x, a) t |x t-1 ,a t-1 ) The cumulative reward function is defined as a weighted combination of the discount factors for the future reward, as shown in equation (16):
Figure BDA0002262819460000121
wherein γ ∈ (0, 1)]As a discount factor, R t Is the cumulative prize, r(s), starting at time t i ,a i ) For the prize value, i is the subscript and N is the total number of time steps.
In the learning process, the net trained by reinforcement learningThe network interacts with the robot to select a in the motion space t E.g. A, then the system is from state x t e.X to X t+1 E.g. X, while obtaining a reward r at time t t For evaluating at state x t Lower sampling a t The reward function is used to guide whether the goal is completed, or whether the behavior is optimal, etc. The goal of the control problem is therefore to obtain an optimal strategy pi * To make it receive the maximum prize J * ,J * The calculation process of (2) is shown in equation (17):
Figure BDA0002262819460000122
where P is the policy space, E π Representing the average expectation of all strategies, the control quantity of the final system is u = a, J (pi) is the reward value corresponding to the strategy, and pi is the strategy.
The key to the above problem is therefore how to define the state of the MDP process and the single step reward function. Defining the state information MDP of the Markov decision process as a formula (3) (4), wherein a reward function comprehensively considers position error and deflection angle error, expects the robot to adjust the movement direction preferentially and then approach to a target, and considers the power consumption problem of the robot, and the reward function is defined by combining the indexes and is shown as a formula (18):
r=r 01 ||Δψ|| 22 ||ΔP o || 2T ρ 3 ν (18)
wherein r is the reward value, r 0 Awarding for a preset constant value, wherein the award is 1 when the target task is completed, and otherwise, the award is 0, | | delta ψ | survival 2 Is a relative orientation error 2 norm, | Δ P o || 2 Is a relative position error 2 norm, v is a regular term used for reducing the energy consumption of the system, and rho 123 Are the weight coefficients.
In solving the MDP problem, the classical merit function is a state-behavior value function (also called Q function), which is defined as shown in equation (19):
Figure BDA0002262819460000131
wherein K is the total number of steps and K is the current number of steps.
The robot starts from state s and executes an optimal strategy π * ,Q * (s, a) represents the maximum accumulated prize value achieved, so the optima function satisfies the bellman optimality principle, as shown in equation (20):
Figure BDA0002262819460000132
wherein, a t =π * (s t ) When the optimal Q is obtained through reinforcement learning iterative training * Then, the optimal strategy, pi, can be obtained * (s)=arg max π Q * (s,a)。
In the present invention, the structure of the Actor network and the Critic network is shown in fig. 6, the Actor network is shown in fig. 6 (a), the number of neurons in the hidden layer 1 and the hidden layer 2 is 200, the Relu6 activation function is adopted, the number of neurons in the hidden layer 3 is 10, the activation function is Relu, the number of neurons in the final output layer is 2, the output quantity is normalized to be between [ -1,1] by adopting the tanh activation function, and the hidden layer is a convolutional layer. The Critic network is shown in FIG. 6 (b), and the inputs include state s and action a. The number of neurons in hidden layers 0 and 1 was 200. And the state s and the action a are respectively output through the convolutional layer 0 and the convolutional layer 1, summed and fused, pass through a Relu6 activation function, and then input into the hidden layer 2, wherein the number of the neurons is 200, and the activation function is Relu6. The number of neurons of the hidden layer 3 is 10, the activation function is Relu, the number of neurons of the final output layer is 1, and a linear activation function is adopted to output a state-action evaluation value Q. In fig. 6 (b), the hidden layers 0,1, 2, 3, 4 are the first, second, third, fourth, and five convolution layers, respectively.
Step A400, based on Q * (s t ,a t ) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q * (s t ,a t )、r t And updating the parameters of the Critic network.
In the tracking control of the underwater operation robot on the target object, the system state and the control quantity of the robot are continuous variables, the state of the MDP is also a continuous variable, and therefore, the MDP state space and the motion space are also continuous domains. For the reinforcement learning problem of the continuous domain, a policy gradient method (PG) is widely applied, the core idea is to change the parameters of a policy function towards the direction of maximizing a reward function to improve the performance of a control policy, the method adopts an independent approximate function, is a random policy scheme, and has low calculation efficiency. To solve this problem, a deterministic state-behavior mapping function a = μ is used θ (s), i.e. the deterministic policy gradient algorithm DPG, where θ is a parameter of the policy function. The maximization of the accumulated reward is obtained by updating the parameter theta through the positive gradient of the accumulated reward function, and the updating process is shown as the formula (21):
Figure BDA0002262819460000141
wherein alpha is θ In order to be the weighting coefficients,
Figure BDA0002262819460000142
is a gradient variable.
DPG algorithm derivation
Figure BDA0002262819460000143
Is of the form shown in equation (22):
Figure BDA0002262819460000144
wherein M is the number of iterations.
Wherein Q is μ Expressing the Q-value function under the policy function μ (a | s), further derived equation (23):
Figure BDA0002262819460000145
both the strategy function and the Q value evaluation function have strong nonlinearity, and the adoption of the deep neural network for fitting the nonlinear function is an effective method. In this section, an Actor network and a Critic network are respectively designed to approximate the policy function mu θ (as) and an evaluation function Q W (s, a), the parameters of the two networks are respectively represented by theta and W, and the parameter updating equation is shown as the formula (24) (25) for the Q value function:
δ t =r t +γQ W (s t+1 ,a t+1 )-Q W (s t ,a t ) (24)
W t+1 =W tW δ tW Q W (s t ,a t )) (25)
wherein, delta t In order to award the intermediate variable(s),
Figure BDA0002262819460000151
the gradient of Q versus W.
In this embodiment, the training of the Actor network is dependent on the criticc network, so accurate evaluation is beneficial to the training of the Actor network. Therefore, in the training process of the Actor network and the Critic network, the updating speed of the target Critic network is faster than that of the target Actor network, so that the Actor network is trained and updated with more excellent evaluation. The asynchronous update strategy of the Actor network and the Critic network is shown as formulas (26) and (27):
θ′=τθ+(1-τ)θ′,if mod(t,FA)=0 (26)
W′=τW+(1-τ)W′,if mod(t,FQ)=0 (27)
wherein θ 'is a target Actor network parameter, W' is a target criticic network parameter, τ is an update coefficient, FA and FQ are update periods of the target Actor network and the target criticic network, respectively, and in general, FQ < FA, mod (·, ·,) is a remainder function.
And step A500, enabling t = t +1, and circularly executing the step A100-the step A400 until t is greater than the preset training times to obtain the trained Actor network.
In this embodiment, data in a preset training data set is sequentially selected to train the Actor network.
Aiming at an Experience review (Experience Replay) strategy, a training strategy of an Actor-Critic reinforcement learning model based on a PID controller, an Actor network and Critic network asynchronous updating strategy and a random noise strategy, the invention selects five training strategies in a combination mode for training and testing, as shown in table 1,
TABLE 1
Figure BDA0002262819460000152
Figure BDA0002262819460000161
In table 1, √ symbol represents a selected strategy, network training is performed based on an Actor-Critic reinforcement learning model and different training strategies in 5 selected in table 1, the number of training rounds is 2000, the number of training steps in each round is 600, the size of an experience base is 10000, the size of batch processing is 32, the learning rate of the Actor network and the Critic network is 0.001, a weight forgetting factor γ =0.9, and the position of an initialization target in an associated coordinate system is set to be p = [ -82,220] mm, as a result, as shown in fig. 7, the network trained by strategy 5 has the highest accumulated reward and the shortest control step, and the result shows that the supervised training strategy proposed in this section can effectively improve the training effect. The addition of random noise is beneficial to improve the generalization capability of the model, but strategy 4 shows a slight degradation in performance relative to strategy 5. In the middle and later stages of model training, for an Actor network, a supervisory controller becomes random noise, so that the supervisory controller can provide enough generalization capability of a model, and the random noise becomes an uncertain factor influencing the precision of the model.
The underwater operation robot can stably keep a relatively fixed position relative to an operation target, which is an important premise for realizing accurate operation. And detecting and positioning the operation target through binocular vision, inputting the operation target into a strategy network after state conversion, and finally generating fluctuation frequency to control the robot to realize tracking of the target. As shown in fig. 8, where the underwater perspective is viewed from an underwater camera, the onshore perspective is viewed from an onshore camera, and the left and right cameras are respectively onboard binocular left and right cameras.
2. Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning
The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process t
An Actor-critical reinforcement learning model is a dynamic programming solving process aiming at a Markov Decision Process (MDP), so that an actual problem needs to be converted into a Markov decision process problem, and then the problem is solved by a reinforcement learning method. Therefore, in the embodiment, the state information s of the markov decision process is constructed according to the system state information of the underwater operation robot at the time t and the position information of the target to be tracked in the random coordinate system of the underwater operation robot t
Step S200, based on S t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t
In the present embodiment, the state information s based on the constructed Markov decision process t And acquiring the fluctuation frequency of the fluctuation fin through an Actor network trained offline, and controlling the propulsion tracking of the underwater robot.
Step S300, based on a t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.
In the present embodiment, the wave fin of the underwater operation robot is controlled based on the wave frequency of the wave fin, let t = t +1, and the tracking of the target is continued.
A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning of a second embodiment of the invention, as shown in fig. 3, includes: the method comprises the steps of constructing a module 100, obtaining a fluctuation frequency module 200 and a circulation module 300;
the construction module 100 is configured to acquire system state information of the underwater operation robot at the time t and pose information of a target to be tracked in a random coordinate system of the underwater operation robot, and construct state information s of a Markov decision process t
The obtain fluctuation frequency module 200 is configured to obtain a fluctuation frequency based on s t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t
The cycle module 300 is configured to be based on a t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that the above-mentioned division of the functional modules is merely used as an example to illustrate the wave fin propulsion underwater operation robot tracking control system based on reinforcement learning provided in the above-mentioned embodiment, in practical applications, the above-mentioned function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above-mentioned embodiment may be combined into one module, or may be further split into a plurality of sub-modules, so as to complete all or part of the functions described above. Names of the modules and steps related in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded by a processor and to implement the above-described method for tracking control of a walking fin propulsion underwater operation robot based on reinforcement learning.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
It can be clearly understood by those skilled in the art that, for convenience and brevity not described, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Those of skill in the art will appreciate that the various illustrative modules, method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (10)

1. A wave fin propulsion underwater operation robot tracking control method based on reinforcement learning is characterized by comprising the following steps:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process t
Step S200, based on S t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t
Step S300, based on a t Controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to the step S100;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
step A100, obtaining system state information of an underwater operation robot at t moment in a preset training data set and pose information of a target to be tracked in an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process t
Step A200, acquiring the supervised training probability and the random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probabilityThen based on s t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller t Otherwise based on s t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network t
Step A300, based on s t And a t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model * (s t ,a t ) Prize value r t
Step A400, based on Q * (s t ,a t ) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q * (s t ,a t )、r t Updating the parameters of the Critic network;
and step A500, enabling t = t +1, and executing the steps A100-A400 in a circulating manner until t is greater than a preset training frequency to obtain a trained Actor network.
2. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is a tanh function.
3. The robot tracking control method for the robot propelling the underwater operation by the aid of the wave fin based on the reinforcement learning as claimed in claim 1, wherein the Critic network comprises five convolutional layers, the number of neurons in the first convolutional layer, the second convolutional layer and the third convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.
4. The method for tracking and controlling the underwater robot propelled by the wave fin based on the reinforcement learning according to the claim 3, wherein in the step A200, the supervised training probability and the random supervised probability at the time t are obtained by the following steps:
PRO t =PRO 0 *0.999 t
PRO r =max(rand(1),0.01)
wherein, PRO t For supervised training probability at time t, PRO r For random supervised probability, PRO 0 Is a preset initial supervised training probability.
5. The robot tracking and controlling method for propelling underwater operation with wave fin based on reinforcement learning of claim 1, wherein in step a200, the wave frequency a of the wave fin is obtained by the PID controller t ", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the wave frequency of the left wave fin and the wave frequency of the right wave fin are calculated by the following steps:
f1=ρ r1 f r1r2 f x1
f2=ρ x1 f r2x2 f x2
f r2 =-f r1
Figure FDA0002262819450000031
f x2 =f x1
Figure FDA0002262819450000032
wherein, f 1 、f 2 Is the final wave frequency of the left and right wave fins, f r1 、f r2 The wave frequency f of yaw rotation of the left and right wave fins x1 、f x2 The wave frequency, rho, of the advancing left and right wave fins r1 、ρ r2 、ρ x1 、ρ x2 Are left and right wavesWeight coefficient of moving fin for rotating and advancing wave frequency, P i ,I i ,D i I ∈ { r, x } is the PID parameter, Δ ψ is the relative orientation error,
Figure FDA0002262819450000033
for relative azimuth error differential, Δ x is the lateral error in relative position,
Figure FDA0002262819450000034
is the differential of the lateral error in relative position.
6. The robot tracking and controlling method for propelling underwater operations with wave fins based on reinforcement learning of claim 1, wherein in step a200, "obtaining the wave frequency a of wave fins through Actor network t ", the method is as follows: the fluctuation frequency a t In order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:
Figure FDA0002262819450000035
c t =max(c 0 *0.999 t ,0.01)
wherein, c 0 Is a preset initial noise standard deviation, c t For the standard deviation of the noise at time t, μ (t) is a t Is mean value, in c t Is a random gaussian noise of standard deviation,
Figure FDA0002262819450000036
is a noise variable.
7. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the preset reward function is:
r=r 01 ||Δψ|| 22 ||ΔP o || 2T ρ 3 ν
wherein r is the reward value, r 0 Awarding | | | delta ψ| | live in eyes for a preset constant value 2 Is a relative orientation error 2 norm, | | Δ P o || 2 Is a relative position error 2 norm, v T Is a velocity vector, v is a regular term, ρ 1 、ρ 2 、ρ 3 Are weight coefficients.
8. A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning is characterized by comprising a construction module, a wave frequency acquisition module and a circulation module;
the construction module is configured to acquire system state information of the underwater operation robot at the time t and position information of a target to be tracked in a random coordinate system of the underwater operation robot, and construct state information s of a Markov decision process t
The acquisition fluctuation frequency module is configured to be based on s t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t
The circulation module is configured to be based on a t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.
9. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the method for tracking control of a wave fin propulsion underwater work robot based on reinforcement learning according to any one of claims 1 to 7.
10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning according to any one of claims 1-7.
CN201911077089.0A 2019-11-06 2019-11-06 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning Active CN111079936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911077089.0A CN111079936B (en) 2019-11-06 2019-11-06 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911077089.0A CN111079936B (en) 2019-11-06 2019-11-06 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN111079936A CN111079936A (en) 2020-04-28
CN111079936B true CN111079936B (en) 2023-03-14

Family

ID=70310691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911077089.0A Active CN111079936B (en) 2019-11-06 2019-11-06 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN111079936B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111856936B (en) * 2020-07-21 2023-06-02 天津蓝鳍海洋工程有限公司 Control method for cabled underwater high-flexibility operation platform
CN112124537B (en) * 2020-09-23 2021-07-13 哈尔滨工程大学 Intelligent control method for underwater robot for autonomous absorption and fishing of benthos
CN112597693A (en) * 2020-11-19 2021-04-02 沈阳航盛科技有限责任公司 Self-adaptive control method based on depth deterministic strategy gradient
CN112462792B (en) * 2020-12-09 2022-08-09 哈尔滨工程大学 Actor-Critic algorithm-based underwater robot motion control method
CN112819379A (en) * 2021-02-28 2021-05-18 广东电力交易中心有限责任公司 Risk preference information acquisition method and system, electronic device and storage medium
CN114114896B (en) * 2021-11-08 2024-01-05 北京机电工程研究所 PID parameter design method based on path integration
CN113977583B (en) * 2021-11-16 2023-05-09 山东大学 Robot rapid assembly method and system based on near-end strategy optimization algorithm
CN114995468B (en) * 2022-06-06 2023-03-31 南通大学 Intelligent control method of underwater robot based on Bayesian depth reinforcement learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008305064A (en) * 2007-06-06 2008-12-18 Japan Science & Technology Agency Learning type control device and method thereof
CN105549384A (en) * 2015-09-01 2016-05-04 中国矿业大学 Inverted pendulum control method based on neural network and reinforced learning
CN108008627A (en) * 2017-12-13 2018-05-08 中国石油大学(华东) A kind of reinforcement learning adaptive PID control method of parallel optimization
CN109068391A (en) * 2018-09-27 2018-12-21 青岛智能产业技术研究院 Car networking communication optimization algorithm based on edge calculations and Actor-Critic algorithm
CN109760046A (en) * 2018-12-27 2019-05-17 西北工业大学 Robot for space based on intensified learning captures Tum bling Target motion planning method
CN109769119A (en) * 2018-12-18 2019-05-17 中国科学院深圳先进技术研究院 A kind of low complex degree vision signal code processing method
CN110322017A (en) * 2019-08-13 2019-10-11 吉林大学 Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study
CN110333739A (en) * 2019-08-21 2019-10-15 哈尔滨工程大学 A kind of AUV conduct programming and method of controlling operation based on intensified learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008305064A (en) * 2007-06-06 2008-12-18 Japan Science & Technology Agency Learning type control device and method thereof
CN105549384A (en) * 2015-09-01 2016-05-04 中国矿业大学 Inverted pendulum control method based on neural network and reinforced learning
CN108008627A (en) * 2017-12-13 2018-05-08 中国石油大学(华东) A kind of reinforcement learning adaptive PID control method of parallel optimization
CN109068391A (en) * 2018-09-27 2018-12-21 青岛智能产业技术研究院 Car networking communication optimization algorithm based on edge calculations and Actor-Critic algorithm
CN109769119A (en) * 2018-12-18 2019-05-17 中国科学院深圳先进技术研究院 A kind of low complex degree vision signal code processing method
CN109760046A (en) * 2018-12-27 2019-05-17 西北工业大学 Robot for space based on intensified learning captures Tum bling Target motion planning method
CN110322017A (en) * 2019-08-13 2019-10-11 吉林大学 Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study
CN110333739A (en) * 2019-08-21 2019-10-15 哈尔滨工程大学 A kind of AUV conduct programming and method of controlling operation based on intensified learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于深度强化学习的自适应巡航控制算法;韩向敏等;《计算机工程》;全文 *
基于深度强化学习的无人艇航行控制;张法帅等;《计测技术》;全文 *

Also Published As

Publication number Publication date
CN111079936A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111079936B (en) Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
Long et al. Deep-learned collision avoidance policy for distributed multiagent navigation
CN107748566B (en) Underwater autonomous robot fixed depth control method based on reinforcement learning
WO2021103392A1 (en) Confrontation structured control-based bionic robotic fish motion control method and system
Wang et al. Dynamic tanker steering control using generalized ellipsoidal-basis-function-based fuzzy neural networks
CN110928189B (en) Robust control method based on reinforcement learning and Lyapunov function
Li et al. Visual servo regulation of wheeled mobile robots with simultaneous depth identification
CN109655066A (en) One kind being based on the unmanned plane paths planning method of Q (λ) algorithm
CN115016496A (en) Water surface unmanned ship path tracking method based on deep reinforcement learning
Precup et al. Grey wolf optimizer-based approaches to path planning and fuzzy logic-based tracking control for mobile robots
Wang et al. Adaptive and extendable control of unmanned surface vehicle formations using distributed deep reinforcement learning
Scorsoglio et al. Image-based deep reinforcement learning for autonomous lunar landing
Song et al. Guidance and control of autonomous surface underwater vehicles for target tracking in ocean environment by deep reinforcement learning
Khalaji et al. Lyapunov-based formation control of underwater robots
Fang et al. Autonomous underwater vehicle formation control and obstacle avoidance using multi-agent generative adversarial imitation learning
Zhuang et al. Motion control and collision avoidance algorithms for unmanned surface vehicle swarm in practical maritime environment
Meng et al. A Fully-Autonomous Framework of Unmanned Surface Vehicles in Maritime Environments Using Gaussian Process Motion Planning
Yao et al. Multi-USV cooperative path planning by window update based self-organizing map and spectral clustering
Hadi et al. Adaptive formation motion planning and control of autonomous underwater vehicles using deep reinforcement learning
Song et al. Surface path tracking method of autonomous surface underwater vehicle based on deep reinforcement learning
Wang et al. Adversarial deep reinforcement learning based robust depth tracking control for underactuated autonomous underwater vehicle
Jiang et al. Learning decentralized control policies for multi-robot formation
Lee et al. PSO-FastSLAM: An improved FastSLAM framework using particle swarm optimization
CN115755603A (en) Intelligent ash box identification method for ship motion model parameters and ship motion control method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant