CN111079936B - Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning - Google Patents
Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning Download PDFInfo
- Publication number
- CN111079936B CN111079936B CN201911077089.0A CN201911077089A CN111079936B CN 111079936 B CN111079936 B CN 111079936B CN 201911077089 A CN201911077089 A CN 201911077089A CN 111079936 B CN111079936 B CN 111079936B
- Authority
- CN
- China
- Prior art keywords
- wave
- fin
- underwater operation
- reinforcement learning
- operation robot
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 87
- 230000002787 reinforcement Effects 0.000 title claims abstract description 72
- 238000012549 training Methods 0.000 claims abstract description 71
- 230000008569 process Effects 0.000 claims abstract description 36
- 230000006870 function Effects 0.000 claims description 64
- 230000004913 activation Effects 0.000 claims description 21
- 210000002569 neuron Anatomy 0.000 claims description 19
- 238000010276 construction Methods 0.000 claims description 11
- 238000011156 evaluation Methods 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 5
- 238000012544 monitoring process Methods 0.000 abstract description 4
- 230000009471 action Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B63—SHIPS OR OTHER WATERBORNE VESSELS; RELATED EQUIPMENT
- B63C—LAUNCHING, HAULING-OUT, OR DRY-DOCKING OF VESSELS; LIFE-SAVING IN WATER; EQUIPMENT FOR DWELLING OR WORKING UNDER WATER; MEANS FOR SALVAGING OR SEARCHING FOR UNDERWATER OBJECTS
- B63C11/00—Equipment for dwelling or working underwater; Means for searching for underwater objects
- B63C11/52—Tools specially adapted for working underwater, not otherwise provided for
Abstract
The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a tracking control method, system and device of an underwater operation robot propelled by a fluctuating fin based on reinforcement learning, aiming at solving the problem of low target tracking precision caused by poor convergence and stability of an Actor network in the training process. The method comprises the steps of obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process t (ii) a Based on s t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model t (ii) a Based on a t And controlling a wave fin of the underwater operation robot to make t = t +1, and circulating. According to the invention, the PID controller is used for monitoring the Actor network training, so that the stability and convergence of the network are improved, and the target tracking precision is improved.
Description
Technical Field
The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a method, a system and a device for tracking and controlling an underwater operation robot propelled by a fluctuating fin based on reinforcement learning.
Background
Autonomous control of underwater robots is a hotspot and difficulty of current research. With the transition of mankind from marine exploration to marine development, autonomous control and autonomous operation of underwater work robots pose new challenges. The autonomous operation of the underwater operation robot has great significance for underwater archaeology, underwater fishing, underwater rescue, underwater engineering and the like. The underwater remote control device can replace a diver or a remote control ROV to operate, realize underwater long-time continuous operation and improve the efficiency of underwater operation.
Generally, due to the irregularity of the underwater operation robot and the complexity of the underwater environment, the underwater operation robot is difficult to establish an accurate hydrodynamic model, so that the model-based robot control method is weak in adaptability. The reinforcement learning depends on the current state of the system, gives an action to be executed, and then transits to the next state, and is a typical model-free control method, and has strong adaptability to complex underwater environment and unknown disturbance. However, the training of reinforcement learning is experience learning based on data, and successful experience is important for training. In the initial training stage of reinforcement learning, due to the fact that the output effect is poor and the control behavior of successful exploration has certain contingency, successful experience in a database is insufficient, so that the convergence speed of an Actor network in a reinforcement learning model is low, the learning efficiency is low, and the accuracy of follow-up tracking control is directly influenced.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problem that the target tracking accuracy is low in the conventional tracking control method based on reinforcement learning due to poor convergence and stability of an Actor network in a reinforcement learning model in the training process, a first aspect of the present invention provides a tracking control method for a fluctuating fin propulsion underwater work robot based on reinforcement learning, the method comprising:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process t ;
Step S200, based on S t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t ;
Step S300, based on a t Controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
a100, acquiring a training data set; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and pose information of the target to be tracked in the underwater operation robot random coordinate system t ;
Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on s t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller t Otherwise based on s t Acquiring the fluctuation frequency a of the fluctuation fin through the Actor network t ;
Step A300, based on s t And a t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model * (s t ,a t ) Prize value r t ;
Step A400, based on Q * (s t ,a t ) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q * (s t ,a t )、r t Updating the parameters of the Critic network;
and step A500, enabling t = t +1, and circularly executing the step A100-the step A400 until t is greater than the preset training times to obtain the trained Actor network.
In some preferred embodiments, the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is a tanh function.
In some preferred embodiments, the Critic network comprises five convolutional layers, the number of neurons in the first, second and third convolutional layers is 200, the activation function is the Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.
In some preferred embodiments, in step a200, "obtaining the supervised training probability at time t, and the random supervised probability" includes:
PRO t =PRO 0 *0.999 t
PRO r =max(rand(1),0.01)
wherein, PRO t For supervised training probability at time t, PRO r For random supervision probability, PRO 0 Is a preset initial supervised training probability.
In some preferred embodiments, in step a200, "obtaining the wave frequency a of the wave fin by the PID controller t ", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the left wave fin wave frequency and the right wave fin wave frequency are calculated by the following steps:
f1=ρ r1 f r1 +ρ r2 f x1
f2=ρ x1 f r2 +ρ x2 f x2
f r2 =-f r1
f x2 =f x1
wherein, f 1 、f 2 Is the final wave frequency of the left and right wave fins, f r1 、f r2 The wave frequency of yaw rotation of the left and right wave fins, f x1 、f x2 The wave frequency, rho, of the advancing left and right wave fins r1 、ρ r2 、ρ x1 、ρ x2 Weighting coefficients for the left and right wave fins for the rotation and forward wave frequencies, P i ,I i ,D i I e r, x is the PID parameter, Δ ψ is the relative bearing error,for relative azimuth error differential, Δ x is the lateral error in relative position,is the lateral error differential of the relative position.
In some preferred embodiments, in step a200, "acquiring the wave frequency a of the wave fin through the Actor network t ", the method is as follows: the fluctuation frequency a t In order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:
c t =max(c 0 *0.999 t ,0.01)
wherein, c 0 Is a preset initial noise standard deviation, c t For the standard deviation of the noise at time t, μ (t) is a t Is mean value, in c t Is a random gaussian noise of standard deviation,is a noise variable.
In some preferred embodiments, the predetermined reward function is:
r=r 0 -ρ 1 ||Δψ|| 2 -ρ 2 ||ΔP o || 2 -ν T ρ 3 ν
wherein r is the reward value, r 0 In the case of a predetermined constant value reward,||Δψ|| 2 is a relative orientation error 2 norm, | | Δ P o || 2 Is a relative position error 2 norm v T Is a velocity vector, v is a regular term, ρ 1 、ρ 2 、ρ 3 Are weight coefficients.
The invention provides a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning, which comprises an acquisition module, a wave frequency acquisition module and a circulation module, wherein the acquisition module is used for acquiring wave frequency;
the construction module is configured to acquire system state information of the underwater operation robot at the time t and position information of a target to be tracked in a random coordinate system of the underwater operation robot, and construct state information s of a Markov decision process t ;
The acquisition fluctuation frequency module is configured to be based on s t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model t ;
The circulation module is configured to be based on a t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the program applications are loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
The invention has the beneficial effects that:
according to the invention, the PID controller is used for monitoring the Actor network training, so that the stability and convergence of the reinforcement learning model are improved, and the target tracking precision is improved. In the initial stage of reinforcement learning, the fluctuation frequency of the fluctuation fin is generated through the PID controller, a large amount of control experience is generated, effective supervision on an Actor network is achieved, and the fluctuation frequency of the fluctuation fin is generated through the Actor network in the later stage. A large amount of control experience is effectively evaluated according to the Critic network, and then the Actor network is updated through a deterministic strategy gradient algorithm, so that the convergence of the reinforcement learning model is accelerated, and the stability of the reinforcement learning model is improved.
Meanwhile, the invention combines several different strategies to train the Actor network, thereby improving the generalization ability of the reinforcement learning model and improving the target tracking precision.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning according to an embodiment of the invention;
FIG. 2 is a schematic flow chart of a training method of an Actor-Critic reinforcement learning model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a frame of a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning according to an embodiment of the invention;
FIG. 4 is an exemplary illustration of a target tracking control for a wiggle fin-propelled underwater work robot in accordance with one embodiment of the present invention;
FIG. 5 is an exemplary diagram of PID controller based Actor-critical network update according to an embodiment of the invention;
fig. 6 is an exemplary diagram of the structure of an Actor network and a critical network according to an embodiment of the present invention;
FIG. 7 is an exemplary diagram comparing different training strategies according to one embodiment of the invention;
FIG. 8 is an exemplary diagram of a stable tracking of a moving object according to one embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps as shown in figures 1 and 2:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process t ;
Step S200, based on S t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model t ;
Step S300, based on a t Controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
step A100, a training data set is obtained; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and position information of the target to be tracked in the underwater operation robot random coordinate system t ;
Step A200, obtaining supervision training at time tProbability and random supervision probability, if the supervision training probability is larger than the random supervision probability, s is based on t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller t Otherwise based on s t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network t ;
Step A300, based on s t And a t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model * (s t ,a t ) Prize value r t ;
Step A400, based on Q * (s t ,a t ) Updating parameters of the Actor network through a deterministic policy gradient algorithm; and based on Q * (s t ,a t )、r t Updating the parameters of the Critic network;
and step A500, enabling t = t +1, and executing the steps A100-A400 in a circulating manner until t is greater than a preset training frequency to obtain a trained Actor network.
In order to more clearly describe the tracking control method of the wave fin propulsion underwater operation robot based on reinforcement learning, the following describes the steps of an embodiment of the method in detail with reference to the attached drawings.
In the following embodiments, a training method of an Actor-Critic reinforcement learning model is introduced, and then a reinforcement learning-based wave fin propulsion underwater operation robot tracking control method for acquiring a wave frequency of a wave fin by using an Actor network is described.
1. Training method of Actor-critical reinforcement learning model
Step A100, obtaining system state information of an underwater operation robot at t moment in a preset training data set and pose information of a target to be tracked in an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process t 。
Since the robot for propelling underwater operation by the aid of the wave fins has self-stability in the rolling direction and the pitching direction, assuming that the freedom of motion of the body is 4, the motion state of the robot can be defined as χ =[x,y,z,ψ] T Wherein x, y, z are three-dimensional coordinates of the underwater operation robot, psi is yaw angle, T is transposition, and corresponding velocity vector v = [ u, v, w, r = z ] T Wherein u represents a forward speed, r z The rotational speed is shown, w is the vertical direction of the movement velocity, and v is the lateral movement velocity. Therefore, the kinetic equation of the underwater operation robot system is shown in formula (1):
where U is the system control quantity, ξ is the unknown disturbance,is the system state differential.
The target object to be tracked is detected and positioned through airborne binocular vision, and the visible range of the target object is limited under the satellite coordinate system. Tetrahedral O as in FIG. 4 c1ABCD Is an effective detection range of binocular vision, the front cabin and the rear cabin are respectively a front cabin body and a rear cabin body on the machine, O B X B Y B Z B Is a satellite coordinate system, O c1 X c1 Y c1 Z c1 Is the camera coordinate system, P c The target of the robot body control is to make the target object be in the central area of the visual field for the central position of the working space, A 1 B 1 C 1 D 1 -A 2 B 2 C 2 D 2 Within the illustrated working space to facilitate the gripping operation of the robotic arm. At present, because the operation target is small, the posture of the target does not need to be considered. Therefore, the above-mentioned approach target object tracking control problem is shown in formula (2):
|g(x o ,y o ,z o ,ψ o )-g(χ)|<ε (2)
wherein x is o ,y o ,z o Is the position of the target to be tracked, # o Is the bearing of the target to be tracked, and epsilon is a preset fault-tolerant range.
In the present examples, the textThe object of the body control is to control the underwater operation robot to stably track the target object so as to be stable within a specified area relative to the robot. Considering the motion control of the robot in the two-dimensional plane, the position and the deflection angle of the target to be tracked under the robot satellite coordinate system are P o =[x o ,y o ,ψ o ] T The center position of the working space is P c =[x c ,y c ,ψ c ] T . In addition, u, r need to be included in the state z G, where g ∈ {0,1} is used to determine whether the target is approached. The robot generally does not need side-shifting motion during the motion, so the state information MDP that can define the markov decision process is shown in equation (3):
s=[g,x o -x c ,y o -y c ,ψ o -ψ c ,u,w,r z ] T (3)
the above formula can be abbreviated as formula (4):
wherein, Δ X, Δ Y, Δ ψ are X-axis position difference, Y-axis position difference, and deflection angle difference,is a difference in position, v T Is a velocity vector.
Meanwhile, in order to eliminate the correlation of the training data in the preset training data set, an Experience pool is required to be constructed for Experience review (Experience Replay), and the training effect is improved. Information s of current system state of underwater operation robot t Take action a t I.e. the wave frequency of the wave fin, and then observe the state information s at the next instant t+1 Simultaneously obtain the reward value r at the current moment t Using state transition tuples { s t ,a t ,s t+1 ,r t Denotes an experience. Then a certain amount of data is extracted from the historical experience base for training.
Step A200 of obtaining time tA supervised training probability and a random supervised probability, if the supervised training probability is greater than the random supervised probability, based on s t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller t Otherwise based on s t Acquiring the fluctuation frequency a of the fluctuation fin through the Actor network t 。
The training of reinforcement learning is experience learning based on data, and successful experience is very important for the training of the network. In the initial training stage of reinforcement learning, due to the fact that the output effect of an evaluation (Critic) network and a strategy (Actor) network is poor, and the successful exploration control behavior has certain contingency, the successful experience in a database is insufficient, and therefore the convergence speed is low, the learning efficiency is low, and even training divergence occurs. Ancient language cloud: the invention constructs a supervisory controller to play the role of a master, plays a main role in the initial stage of reinforcement learning, generates a large amount of successful control experiences, effectively supervises and guides a strategy network to enable the strategy network to generate the successful control experiences and then gradually transits to a self-learning stage. In the self-learning process, the supervision controller interferes reinforcement learning with small probability, disturbs the action generated by the strategy network (the fluctuation frequency of the fluctuation fin), further optimizes the strategy network until the strategy network is 'blue rather than blue', and then the monitoring controller does not supervise any more.
The supervisory controller employed herein is a PID controller based on state observation, and the construction process is as shown in equations (5) (6) (7) (8) (9) (10):
f r2 =-f r1 (6)
f x2 =f x1 (8)
f1=ρ r1 f r1 +ρ r2 f x1 (9)
f2=ρ x1 f r2 +ρ x2 f x2 (10)
wherein, P i ,I i ,D i I ∈ { r, x } is a PID parameter, f r1 ,f r2 Is the fluctuation frequency of the yaw rotation of the left and right wave fins, f x1 ,f x2 The wave frequency of the forward motion of the left and right wave fins, f 1 ,f 2 Is the final wave frequency, rho, of the left and right wave fins r1 ,ρ r2 ,ρ x1 ,ρ x2 The left and right wave fins are used as weighting coefficients for the rotating and advancing wave frequencies.
Based on the idea that a master enters the door and depends on the individual, the supervision strategy designed by the invention is shown as formulas (11), (12) and (13):
PRO t =PRO 0 *0.999 t (11)
PRO r =max(rand(1),0.01) (12)
wherein, PRO 0 Representing a preset initial supervised training probability, PRO t Represents the probability of supervised training at time t, and decreases exponentially as the number of training steps increases. PRO r Representing a random supervised probability having a value greater than the supervised training probability PRO at time t t And outputting the fluctuation frequency of the fluctuation fin by adopting the Actor network, or outputting the fluctuation frequency of the fluctuation fin by adopting the monitoring PID controller.
A schematic diagram of the training process of the reinforcement learning model based on Actor-Critic of the PID controller is shown in FIG. 5, namely step A100-step A500, where in FIG. 5, n (a) t ε) is a t Random noise of (2), z -1 Is a time delay.
One of the key points of reinforcement learning is the tradeoff between Exploration (Exploration) and utilization (Exploration). The Actor network output control strategy is established on the basis of empirical data, and only sufficient data can endow the Actor network with sufficient generalization capability. Random noise is superposed on the output of the Actor network in the invention, so that the generalization capability training of the Actor network is carried out, as shown in formulas (14) and (15):
c t =max(c 0 *0.999 t ,0.01) (14)
wherein, c 0 Is the initial noise standard deviation, c t The standard deviation of noise at t, whose value decreases exponentially with the number of training steps, μ (t) is represented by a t Is mean value of c t Is a random gaussian noise of standard deviation that,is a noise variable.
Step A300, based on s t And a t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model * (s t ,a t ) Prize value r t 。
The Markov decision process for reinforcement learning comprises four parts, namely a state space S, an action space A, a reward function S multiplied by A → R (S, a) and a state transition probability distribution p, wherein on the premise of meeting Markov property, the state of the system at the next moment is only determined by the current state and action, namely p (x, a) t |x t-1 ,a t-1 ) The cumulative reward function is defined as a weighted combination of the discount factors for the future reward, as shown in equation (16):
wherein γ ∈ (0, 1)]As a discount factor, R t Is the cumulative prize, r(s), starting at time t i ,a i ) For the prize value, i is the subscript and N is the total number of time steps.
In the learning process, the net trained by reinforcement learningThe network interacts with the robot to select a in the motion space t E.g. A, then the system is from state x t e.X to X t+1 E.g. X, while obtaining a reward r at time t t For evaluating at state x t Lower sampling a t The reward function is used to guide whether the goal is completed, or whether the behavior is optimal, etc. The goal of the control problem is therefore to obtain an optimal strategy pi * To make it receive the maximum prize J * ,J * The calculation process of (2) is shown in equation (17):
where P is the policy space, E π Representing the average expectation of all strategies, the control quantity of the final system is u = a, J (pi) is the reward value corresponding to the strategy, and pi is the strategy.
The key to the above problem is therefore how to define the state of the MDP process and the single step reward function. Defining the state information MDP of the Markov decision process as a formula (3) (4), wherein a reward function comprehensively considers position error and deflection angle error, expects the robot to adjust the movement direction preferentially and then approach to a target, and considers the power consumption problem of the robot, and the reward function is defined by combining the indexes and is shown as a formula (18):
r=r 0 -ρ 1 ||Δψ|| 2 -ρ 2 ||ΔP o || 2 -ν T ρ 3 ν (18)
wherein r is the reward value, r 0 Awarding for a preset constant value, wherein the award is 1 when the target task is completed, and otherwise, the award is 0, | | delta ψ | survival 2 Is a relative orientation error 2 norm, | Δ P o || 2 Is a relative position error 2 norm, v is a regular term used for reducing the energy consumption of the system, and rho 1 ,ρ 2 ,ρ 3 Are the weight coefficients.
In solving the MDP problem, the classical merit function is a state-behavior value function (also called Q function), which is defined as shown in equation (19):
wherein K is the total number of steps and K is the current number of steps.
The robot starts from state s and executes an optimal strategy π * ,Q * (s, a) represents the maximum accumulated prize value achieved, so the optima function satisfies the bellman optimality principle, as shown in equation (20):
wherein, a t =π * (s t ) When the optimal Q is obtained through reinforcement learning iterative training * Then, the optimal strategy, pi, can be obtained * (s)=arg max π Q * (s,a)。
In the present invention, the structure of the Actor network and the Critic network is shown in fig. 6, the Actor network is shown in fig. 6 (a), the number of neurons in the hidden layer 1 and the hidden layer 2 is 200, the Relu6 activation function is adopted, the number of neurons in the hidden layer 3 is 10, the activation function is Relu, the number of neurons in the final output layer is 2, the output quantity is normalized to be between [ -1,1] by adopting the tanh activation function, and the hidden layer is a convolutional layer. The Critic network is shown in FIG. 6 (b), and the inputs include state s and action a. The number of neurons in hidden layers 0 and 1 was 200. And the state s and the action a are respectively output through the convolutional layer 0 and the convolutional layer 1, summed and fused, pass through a Relu6 activation function, and then input into the hidden layer 2, wherein the number of the neurons is 200, and the activation function is Relu6. The number of neurons of the hidden layer 3 is 10, the activation function is Relu, the number of neurons of the final output layer is 1, and a linear activation function is adopted to output a state-action evaluation value Q. In fig. 6 (b), the hidden layers 0,1, 2, 3, 4 are the first, second, third, fourth, and five convolution layers, respectively.
Step A400, based on Q * (s t ,a t ) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q * (s t ,a t )、r t And updating the parameters of the Critic network.
In the tracking control of the underwater operation robot on the target object, the system state and the control quantity of the robot are continuous variables, the state of the MDP is also a continuous variable, and therefore, the MDP state space and the motion space are also continuous domains. For the reinforcement learning problem of the continuous domain, a policy gradient method (PG) is widely applied, the core idea is to change the parameters of a policy function towards the direction of maximizing a reward function to improve the performance of a control policy, the method adopts an independent approximate function, is a random policy scheme, and has low calculation efficiency. To solve this problem, a deterministic state-behavior mapping function a = μ is used θ (s), i.e. the deterministic policy gradient algorithm DPG, where θ is a parameter of the policy function. The maximization of the accumulated reward is obtained by updating the parameter theta through the positive gradient of the accumulated reward function, and the updating process is shown as the formula (21):
wherein M is the number of iterations.
Wherein Q is μ Expressing the Q-value function under the policy function μ (a | s), further derived equation (23):
both the strategy function and the Q value evaluation function have strong nonlinearity, and the adoption of the deep neural network for fitting the nonlinear function is an effective method. In this section, an Actor network and a Critic network are respectively designed to approximate the policy function mu θ (as) and an evaluation function Q W (s, a), the parameters of the two networks are respectively represented by theta and W, and the parameter updating equation is shown as the formula (24) (25) for the Q value function:
δ t =r t +γQ W (s t+1 ,a t+1 )-Q W (s t ,a t ) (24)
W t+1 =W t +α W δ t ▽ W Q W (s t ,a t )) (25)
In this embodiment, the training of the Actor network is dependent on the criticc network, so accurate evaluation is beneficial to the training of the Actor network. Therefore, in the training process of the Actor network and the Critic network, the updating speed of the target Critic network is faster than that of the target Actor network, so that the Actor network is trained and updated with more excellent evaluation. The asynchronous update strategy of the Actor network and the Critic network is shown as formulas (26) and (27):
θ′=τθ+(1-τ)θ′,if mod(t,FA)=0 (26)
W′=τW+(1-τ)W′,if mod(t,FQ)=0 (27)
wherein θ 'is a target Actor network parameter, W' is a target criticic network parameter, τ is an update coefficient, FA and FQ are update periods of the target Actor network and the target criticic network, respectively, and in general, FQ < FA, mod (·, ·,) is a remainder function.
And step A500, enabling t = t +1, and circularly executing the step A100-the step A400 until t is greater than the preset training times to obtain the trained Actor network.
In this embodiment, data in a preset training data set is sequentially selected to train the Actor network.
Aiming at an Experience review (Experience Replay) strategy, a training strategy of an Actor-Critic reinforcement learning model based on a PID controller, an Actor network and Critic network asynchronous updating strategy and a random noise strategy, the invention selects five training strategies in a combination mode for training and testing, as shown in table 1,
TABLE 1
In table 1, √ symbol represents a selected strategy, network training is performed based on an Actor-Critic reinforcement learning model and different training strategies in 5 selected in table 1, the number of training rounds is 2000, the number of training steps in each round is 600, the size of an experience base is 10000, the size of batch processing is 32, the learning rate of the Actor network and the Critic network is 0.001, a weight forgetting factor γ =0.9, and the position of an initialization target in an associated coordinate system is set to be p = [ -82,220] mm, as a result, as shown in fig. 7, the network trained by strategy 5 has the highest accumulated reward and the shortest control step, and the result shows that the supervised training strategy proposed in this section can effectively improve the training effect. The addition of random noise is beneficial to improve the generalization capability of the model, but strategy 4 shows a slight degradation in performance relative to strategy 5. In the middle and later stages of model training, for an Actor network, a supervisory controller becomes random noise, so that the supervisory controller can provide enough generalization capability of a model, and the random noise becomes an uncertain factor influencing the precision of the model.
The underwater operation robot can stably keep a relatively fixed position relative to an operation target, which is an important premise for realizing accurate operation. And detecting and positioning the operation target through binocular vision, inputting the operation target into a strategy network after state conversion, and finally generating fluctuation frequency to control the robot to realize tracking of the target. As shown in fig. 8, where the underwater perspective is viewed from an underwater camera, the onshore perspective is viewed from an onshore camera, and the left and right cameras are respectively onboard binocular left and right cameras.
2. Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning
The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process t 。
An Actor-critical reinforcement learning model is a dynamic programming solving process aiming at a Markov Decision Process (MDP), so that an actual problem needs to be converted into a Markov decision process problem, and then the problem is solved by a reinforcement learning method. Therefore, in the embodiment, the state information s of the markov decision process is constructed according to the system state information of the underwater operation robot at the time t and the position information of the target to be tracked in the random coordinate system of the underwater operation robot t 。
Step S200, based on S t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t 。
In the present embodiment, the state information s based on the constructed Markov decision process t And acquiring the fluctuation frequency of the fluctuation fin through an Actor network trained offline, and controlling the propulsion tracking of the underwater robot.
Step S300, based on a t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.
In the present embodiment, the wave fin of the underwater operation robot is controlled based on the wave frequency of the wave fin, let t = t +1, and the tracking of the target is continued.
A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning of a second embodiment of the invention, as shown in fig. 3, includes: the method comprises the steps of constructing a module 100, obtaining a fluctuation frequency module 200 and a circulation module 300;
the construction module 100 is configured to acquire system state information of the underwater operation robot at the time t and pose information of a target to be tracked in a random coordinate system of the underwater operation robot, and construct state information s of a Markov decision process t ;
The obtain fluctuation frequency module 200 is configured to obtain a fluctuation frequency based on s t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t ;
The cycle module 300 is configured to be based on a t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that the above-mentioned division of the functional modules is merely used as an example to illustrate the wave fin propulsion underwater operation robot tracking control system based on reinforcement learning provided in the above-mentioned embodiment, in practical applications, the above-mentioned function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above-mentioned embodiment may be combined into one module, or may be further split into a plurality of sub-modules, so as to complete all or part of the functions described above. Names of the modules and steps related in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded by a processor and to implement the above-described method for tracking control of a walking fin propulsion underwater operation robot based on reinforcement learning.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.
It can be clearly understood by those skilled in the art that, for convenience and brevity not described, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Those of skill in the art will appreciate that the various illustrative modules, method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (10)
1. A wave fin propulsion underwater operation robot tracking control method based on reinforcement learning is characterized by comprising the following steps:
s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process t ;
Step S200, based on S t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t ;
Step S300, based on a t Controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to the step S100;
the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:
step A100, obtaining system state information of an underwater operation robot at t moment in a preset training data set and pose information of a target to be tracked in an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process t ;
Step A200, acquiring the supervised training probability and the random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probabilityThen based on s t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller t Otherwise based on s t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network t ;
Step A300, based on s t And a t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model * (s t ,a t ) Prize value r t ;
Step A400, based on Q * (s t ,a t ) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q * (s t ,a t )、r t Updating the parameters of the Critic network;
and step A500, enabling t = t +1, and executing the steps A100-A400 in a circulating manner until t is greater than a preset training frequency to obtain a trained Actor network.
2. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is a tanh function.
3. The robot tracking control method for the robot propelling the underwater operation by the aid of the wave fin based on the reinforcement learning as claimed in claim 1, wherein the Critic network comprises five convolutional layers, the number of neurons in the first convolutional layer, the second convolutional layer and the third convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.
4. The method for tracking and controlling the underwater robot propelled by the wave fin based on the reinforcement learning according to the claim 3, wherein in the step A200, the supervised training probability and the random supervised probability at the time t are obtained by the following steps:
PRO t =PRO 0 *0.999 t
PRO r =max(rand(1),0.01)
wherein, PRO t For supervised training probability at time t, PRO r For random supervised probability, PRO 0 Is a preset initial supervised training probability.
5. The robot tracking and controlling method for propelling underwater operation with wave fin based on reinforcement learning of claim 1, wherein in step a200, the wave frequency a of the wave fin is obtained by the PID controller t ", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the wave frequency of the left wave fin and the wave frequency of the right wave fin are calculated by the following steps:
f1=ρ r1 f r1 +ρ r2 f x1
f2=ρ x1 f r2 +ρ x2 f x2
f r2 =-f r1
f x2 =f x1
wherein, f 1 、f 2 Is the final wave frequency of the left and right wave fins, f r1 、f r2 The wave frequency f of yaw rotation of the left and right wave fins x1 、f x2 The wave frequency, rho, of the advancing left and right wave fins r1 、ρ r2 、ρ x1 、ρ x2 Are left and right wavesWeight coefficient of moving fin for rotating and advancing wave frequency, P i ,I i ,D i I ∈ { r, x } is the PID parameter, Δ ψ is the relative orientation error,for relative azimuth error differential, Δ x is the lateral error in relative position,is the differential of the lateral error in relative position.
6. The robot tracking and controlling method for propelling underwater operations with wave fins based on reinforcement learning of claim 1, wherein in step a200, "obtaining the wave frequency a of wave fins through Actor network t ", the method is as follows: the fluctuation frequency a t In order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:
c t =max(c 0 *0.999 t ,0.01)
7. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the preset reward function is:
r=r 0 -ρ 1 ||Δψ|| 2 -ρ 2 ||ΔP o || 2 -ν T ρ 3 ν
wherein r is the reward value, r 0 Awarding | | | delta ψ| | live in eyes for a preset constant value 2 Is a relative orientation error 2 norm, | | Δ P o || 2 Is a relative position error 2 norm, v T Is a velocity vector, v is a regular term, ρ 1 、ρ 2 、ρ 3 Are weight coefficients.
8. A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning is characterized by comprising a construction module, a wave frequency acquisition module and a circulation module;
the construction module is configured to acquire system state information of the underwater operation robot at the time t and position information of a target to be tracked in a random coordinate system of the underwater operation robot, and construct state information s of a Markov decision process t ;
The acquisition fluctuation frequency module is configured to be based on s t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model t ;
The circulation module is configured to be based on a t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.
9. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the method for tracking control of a wave fin propulsion underwater work robot based on reinforcement learning according to any one of claims 1 to 7.
10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911077089.0A CN111079936B (en) | 2019-11-06 | 2019-11-06 | Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911077089.0A CN111079936B (en) | 2019-11-06 | 2019-11-06 | Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111079936A CN111079936A (en) | 2020-04-28 |
CN111079936B true CN111079936B (en) | 2023-03-14 |
Family
ID=70310691
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911077089.0A Active CN111079936B (en) | 2019-11-06 | 2019-11-06 | Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111079936B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111856936B (en) * | 2020-07-21 | 2023-06-02 | 天津蓝鳍海洋工程有限公司 | Control method for cabled underwater high-flexibility operation platform |
CN112124537B (en) * | 2020-09-23 | 2021-07-13 | 哈尔滨工程大学 | Intelligent control method for underwater robot for autonomous absorption and fishing of benthos |
CN112597693A (en) * | 2020-11-19 | 2021-04-02 | 沈阳航盛科技有限责任公司 | Self-adaptive control method based on depth deterministic strategy gradient |
CN112462792B (en) * | 2020-12-09 | 2022-08-09 | 哈尔滨工程大学 | Actor-Critic algorithm-based underwater robot motion control method |
CN112819379A (en) * | 2021-02-28 | 2021-05-18 | 广东电力交易中心有限责任公司 | Risk preference information acquisition method and system, electronic device and storage medium |
CN114114896B (en) * | 2021-11-08 | 2024-01-05 | 北京机电工程研究所 | PID parameter design method based on path integration |
CN113977583B (en) * | 2021-11-16 | 2023-05-09 | 山东大学 | Robot rapid assembly method and system based on near-end strategy optimization algorithm |
CN114995468B (en) * | 2022-06-06 | 2023-03-31 | 南通大学 | Intelligent control method of underwater robot based on Bayesian depth reinforcement learning |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008305064A (en) * | 2007-06-06 | 2008-12-18 | Japan Science & Technology Agency | Learning type control device and method thereof |
CN105549384A (en) * | 2015-09-01 | 2016-05-04 | 中国矿业大学 | Inverted pendulum control method based on neural network and reinforced learning |
CN108008627A (en) * | 2017-12-13 | 2018-05-08 | 中国石油大学(华东) | A kind of reinforcement learning adaptive PID control method of parallel optimization |
CN109068391A (en) * | 2018-09-27 | 2018-12-21 | 青岛智能产业技术研究院 | Car networking communication optimization algorithm based on edge calculations and Actor-Critic algorithm |
CN109760046A (en) * | 2018-12-27 | 2019-05-17 | 西北工业大学 | Robot for space based on intensified learning captures Tum bling Target motion planning method |
CN109769119A (en) * | 2018-12-18 | 2019-05-17 | 中国科学院深圳先进技术研究院 | A kind of low complex degree vision signal code processing method |
CN110322017A (en) * | 2019-08-13 | 2019-10-11 | 吉林大学 | Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study |
CN110333739A (en) * | 2019-08-21 | 2019-10-15 | 哈尔滨工程大学 | A kind of AUV conduct programming and method of controlling operation based on intensified learning |
-
2019
- 2019-11-06 CN CN201911077089.0A patent/CN111079936B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008305064A (en) * | 2007-06-06 | 2008-12-18 | Japan Science & Technology Agency | Learning type control device and method thereof |
CN105549384A (en) * | 2015-09-01 | 2016-05-04 | 中国矿业大学 | Inverted pendulum control method based on neural network and reinforced learning |
CN108008627A (en) * | 2017-12-13 | 2018-05-08 | 中国石油大学(华东) | A kind of reinforcement learning adaptive PID control method of parallel optimization |
CN109068391A (en) * | 2018-09-27 | 2018-12-21 | 青岛智能产业技术研究院 | Car networking communication optimization algorithm based on edge calculations and Actor-Critic algorithm |
CN109769119A (en) * | 2018-12-18 | 2019-05-17 | 中国科学院深圳先进技术研究院 | A kind of low complex degree vision signal code processing method |
CN109760046A (en) * | 2018-12-27 | 2019-05-17 | 西北工业大学 | Robot for space based on intensified learning captures Tum bling Target motion planning method |
CN110322017A (en) * | 2019-08-13 | 2019-10-11 | 吉林大学 | Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study |
CN110333739A (en) * | 2019-08-21 | 2019-10-15 | 哈尔滨工程大学 | A kind of AUV conduct programming and method of controlling operation based on intensified learning |
Non-Patent Citations (2)
Title |
---|
一种基于深度强化学习的自适应巡航控制算法;韩向敏等;《计算机工程》;全文 * |
基于深度强化学习的无人艇航行控制;张法帅等;《计测技术》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111079936A (en) | 2020-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111079936B (en) | Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning | |
CN108803321B (en) | Autonomous underwater vehicle track tracking control method based on deep reinforcement learning | |
Long et al. | Deep-learned collision avoidance policy for distributed multiagent navigation | |
CN107748566B (en) | Underwater autonomous robot fixed depth control method based on reinforcement learning | |
WO2021103392A1 (en) | Confrontation structured control-based bionic robotic fish motion control method and system | |
Wang et al. | Dynamic tanker steering control using generalized ellipsoidal-basis-function-based fuzzy neural networks | |
CN110928189B (en) | Robust control method based on reinforcement learning and Lyapunov function | |
Li et al. | Visual servo regulation of wheeled mobile robots with simultaneous depth identification | |
CN109655066A (en) | One kind being based on the unmanned plane paths planning method of Q (λ) algorithm | |
CN115016496A (en) | Water surface unmanned ship path tracking method based on deep reinforcement learning | |
Precup et al. | Grey wolf optimizer-based approaches to path planning and fuzzy logic-based tracking control for mobile robots | |
Wang et al. | Adaptive and extendable control of unmanned surface vehicle formations using distributed deep reinforcement learning | |
Scorsoglio et al. | Image-based deep reinforcement learning for autonomous lunar landing | |
Song et al. | Guidance and control of autonomous surface underwater vehicles for target tracking in ocean environment by deep reinforcement learning | |
Khalaji et al. | Lyapunov-based formation control of underwater robots | |
Fang et al. | Autonomous underwater vehicle formation control and obstacle avoidance using multi-agent generative adversarial imitation learning | |
Zhuang et al. | Motion control and collision avoidance algorithms for unmanned surface vehicle swarm in practical maritime environment | |
Meng et al. | A Fully-Autonomous Framework of Unmanned Surface Vehicles in Maritime Environments Using Gaussian Process Motion Planning | |
Yao et al. | Multi-USV cooperative path planning by window update based self-organizing map and spectral clustering | |
Hadi et al. | Adaptive formation motion planning and control of autonomous underwater vehicles using deep reinforcement learning | |
Song et al. | Surface path tracking method of autonomous surface underwater vehicle based on deep reinforcement learning | |
Wang et al. | Adversarial deep reinforcement learning based robust depth tracking control for underactuated autonomous underwater vehicle | |
Jiang et al. | Learning decentralized control policies for multi-robot formation | |
Lee et al. | PSO-FastSLAM: An improved FastSLAM framework using particle swarm optimization | |
CN115755603A (en) | Intelligent ash box identification method for ship motion model parameters and ship motion control method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |