CN111079936B

CN111079936B - Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Info

Publication number: CN111079936B
Application number: CN201911077089.0A
Authority: CN
Inventors: 王宇; 唐冲; 王睿; 王硕; 谭民; 马睿宸
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2023-03-14
Anticipated expiration: 2039-11-06
Also published as: CN111079936A

Abstract

The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a tracking control method, system and device of an underwater operation robot propelled by a fluctuating fin based on reinforcement learning, aiming at solving the problem of low target tracking precision caused by poor convergence and stability of an Actor network in the training process. The method comprises the steps of obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process _t (ii) a Based on s _t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model _t (ii) a Based on a _t And controlling a wave fin of the underwater operation robot to make t = t +1, and circulating. According to the invention, the PID controller is used for monitoring the Actor network training, so that the stability and convergence of the network are improved, and the target tracking precision is improved.

Description

Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Technical Field

The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a method, a system and a device for tracking and controlling an underwater operation robot propelled by a fluctuating fin based on reinforcement learning.

Background

Autonomous control of underwater robots is a hotspot and difficulty of current research. With the transition of mankind from marine exploration to marine development, autonomous control and autonomous operation of underwater work robots pose new challenges. The autonomous operation of the underwater operation robot has great significance for underwater archaeology, underwater fishing, underwater rescue, underwater engineering and the like. The underwater remote control device can replace a diver or a remote control ROV to operate, realize underwater long-time continuous operation and improve the efficiency of underwater operation.

Generally, due to the irregularity of the underwater operation robot and the complexity of the underwater environment, the underwater operation robot is difficult to establish an accurate hydrodynamic model, so that the model-based robot control method is weak in adaptability. The reinforcement learning depends on the current state of the system, gives an action to be executed, and then transits to the next state, and is a typical model-free control method, and has strong adaptability to complex underwater environment and unknown disturbance. However, the training of reinforcement learning is experience learning based on data, and successful experience is important for training. In the initial training stage of reinforcement learning, due to the fact that the output effect is poor and the control behavior of successful exploration has certain contingency, successful experience in a database is insufficient, so that the convergence speed of an Actor network in a reinforcement learning model is low, the learning efficiency is low, and the accuracy of follow-up tracking control is directly influenced.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the target tracking accuracy is low in the conventional tracking control method based on reinforcement learning due to poor convergence and stability of an Actor network in a reinforcement learning model in the training process, a first aspect of the present invention provides a tracking control method for a fluctuating fin propulsion underwater work robot based on reinforcement learning, the method comprising:

s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process _t ；

Step S200, based on S _t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model _t ；

Step S300, based on a _t Controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module;

the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:

a100, acquiring a training data set; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and pose information of the target to be tracked in the underwater operation robot random coordinate system _t ；

Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on s _t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller _t Otherwise based on s _t Acquiring the fluctuation frequency a of the fluctuation fin through the Actor network _t ；

Step A300, based on s _t And a _t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model ^* (s _t ,a _t ) Prize value r _t ；

Step A400, based on Q ^* (s _t ,a _t ) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q ^* (s _t ,a _t )、r _t Updating the parameters of the Critic network;

and step A500, enabling t = t +1, and circularly executing the step A100-the step A400 until t is greater than the preset training times to obtain the trained Actor network.

In some preferred embodiments, the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is a tanh function.

In some preferred embodiments, the Critic network comprises five convolutional layers, the number of neurons in the first, second and third convolutional layers is 200, the activation function is the Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.

In some preferred embodiments, in step a200, "obtaining the supervised training probability at time t, and the random supervised probability" includes:

PRO _t ＝PRO ₀ *0.999 ^t

PRO _r ＝max(rand(1),0.01)

wherein, PRO _t For supervised training probability at time t, PRO _r For random supervision probability, PRO ₀ Is a preset initial supervised training probability.

In some preferred embodiments, in step a200, "obtaining the wave frequency a of the wave fin by the PID controller _t ", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the left wave fin wave frequency and the right wave fin wave frequency are calculated by the following steps:

f1＝ρ _r1 f _r1 +ρ _r2 f _x1

f2＝ρ _x1 f _r2 +ρ _x2 f _x2

f _r2 ＝-f _r1

f _x2 ＝f _x1

wherein, f ₁ 、f ₂ Is the final wave frequency of the left and right wave fins, f _r1 、f _r2 The wave frequency of yaw rotation of the left and right wave fins, f _x1 、f _x2 The wave frequency, rho, of the advancing left and right wave fins _r1 、ρ _r2 、ρ _x1 、ρ _x2 Weighting coefficients for the left and right wave fins for the rotation and forward wave frequencies, P _i ,I _i ,D _i I e r, x is the PID parameter, Δ ψ is the relative bearing error,

for relative azimuth error differential, Δ x is the lateral error in relative position,

is the lateral error differential of the relative position.

In some preferred embodiments, in step a200, "acquiring the wave frequency a of the wave fin through the Actor network _t ", the method is as follows: the fluctuation frequency a _t In order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:

c _t ＝max(c ₀ *0.999 ^t ,0.01)

wherein, c ₀ Is a preset initial noise standard deviation, c _t For the standard deviation of the noise at time t, μ (t) is a _t Is mean value, in c _t Is a random gaussian noise of standard deviation,

is a noise variable.

In some preferred embodiments, the predetermined reward function is:

r＝r ₀ -ρ ₁ ||Δψ|| ₂ -ρ ₂ ||ΔP _o || ₂ -ν ^T ρ ₃ ν

wherein r is the reward value, r ₀ In the case of a predetermined constant value reward,||Δψ|| ₂ is a relative orientation error 2 norm, | | Δ P _o || ₂ Is a relative position error 2 norm v ^T Is a velocity vector, v is a regular term, ρ ₁ 、ρ ₂ 、ρ ₃ Are weight coefficients.

The invention provides a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning, which comprises an acquisition module, a wave frequency acquisition module and a circulation module, wherein the acquisition module is used for acquiring wave frequency;

the construction module is configured to acquire system state information of the underwater operation robot at the time t and position information of a target to be tracked in a random coordinate system of the underwater operation robot, and construct state information s of a Markov decision process _t ；

The acquisition fluctuation frequency module is configured to be based on s _t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model _t ；

The circulation module is configured to be based on a _t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the program applications are loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.

The invention has the beneficial effects that:

according to the invention, the PID controller is used for monitoring the Actor network training, so that the stability and convergence of the reinforcement learning model are improved, and the target tracking precision is improved. In the initial stage of reinforcement learning, the fluctuation frequency of the fluctuation fin is generated through the PID controller, a large amount of control experience is generated, effective supervision on an Actor network is achieved, and the fluctuation frequency of the fluctuation fin is generated through the Actor network in the later stage. A large amount of control experience is effectively evaluated according to the Critic network, and then the Actor network is updated through a deterministic strategy gradient algorithm, so that the convergence of the reinforcement learning model is accelerated, and the stability of the reinforcement learning model is improved.

Meanwhile, the invention combines several different strategies to train the Actor network, thereby improving the generalization ability of the reinforcement learning model and improving the target tracking precision.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning according to an embodiment of the invention;

FIG. 2 is a schematic flow chart of a training method of an Actor-Critic reinforcement learning model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a frame of a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning according to an embodiment of the invention;

FIG. 4 is an exemplary illustration of a target tracking control for a wiggle fin-propelled underwater work robot in accordance with one embodiment of the present invention;

FIG. 5 is an exemplary diagram of PID controller based Actor-critical network update according to an embodiment of the invention;

fig. 6 is an exemplary diagram of the structure of an Actor network and a critical network according to an embodiment of the present invention;

FIG. 7 is an exemplary diagram comparing different training strategies according to one embodiment of the invention;

FIG. 8 is an exemplary diagram of a stable tracking of a moving object according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps as shown in figures 1 and 2:

Step S200, based on S _t Obtaining the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-Critic reinforcement learning model _t ；

step A100, a training data set is obtained; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and position information of the target to be tracked in the underwater operation robot random coordinate system _t ；

Step A200, obtaining supervision training at time tProbability and random supervision probability, if the supervision training probability is larger than the random supervision probability, s is based on _t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller _t Otherwise based on s _t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network _t ；

Step A400, based on Q ^* (s _t ,a _t ) Updating parameters of the Actor network through a deterministic policy gradient algorithm; and based on Q ^* (s _t ,a _t )、r _t Updating the parameters of the Critic network;

and step A500, enabling t = t +1, and executing the steps A100-A400 in a circulating manner until t is greater than a preset training frequency to obtain a trained Actor network.

In order to more clearly describe the tracking control method of the wave fin propulsion underwater operation robot based on reinforcement learning, the following describes the steps of an embodiment of the method in detail with reference to the attached drawings.

In the following embodiments, a training method of an Actor-Critic reinforcement learning model is introduced, and then a reinforcement learning-based wave fin propulsion underwater operation robot tracking control method for acquiring a wave frequency of a wave fin by using an Actor network is described.

1. Training method of Actor-critical reinforcement learning model

Step A100, obtaining system state information of an underwater operation robot at t moment in a preset training data set and pose information of a target to be tracked in an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process _t 。

Since the robot for propelling underwater operation by the aid of the wave fins has self-stability in the rolling direction and the pitching direction, assuming that the freedom of motion of the body is 4, the motion state of the robot can be defined as χ =[x,y,z,ψ] ^T Wherein x, y, z are three-dimensional coordinates of the underwater operation robot, psi is yaw angle, T is transposition, and corresponding velocity vector v = [ u, v, w, r = _z ] ^T Wherein u represents a forward speed, r _z The rotational speed is shown, w is the vertical direction of the movement velocity, and v is the lateral movement velocity. Therefore, the kinetic equation of the underwater operation robot system is shown in formula (1):

where U is the system control quantity, ξ is the unknown disturbance,

is the system state differential.

The target object to be tracked is detected and positioned through airborne binocular vision, and the visible range of the target object is limited under the satellite coordinate system. Tetrahedral O as in FIG. 4 _c1ABCD Is an effective detection range of binocular vision, the front cabin and the rear cabin are respectively a front cabin body and a rear cabin body on the machine, O _B X _B Y _B Z _B Is a satellite coordinate system, O _c1 X _c1 Y _c1 Z _c1 Is the camera coordinate system, P _c The target of the robot body control is to make the target object be in the central area of the visual field for the central position of the working space, A ₁ B ₁ C ₁ D ₁ -A ₂ B ₂ C ₂ D ₂ Within the illustrated working space to facilitate the gripping operation of the robotic arm. At present, because the operation target is small, the posture of the target does not need to be considered. Therefore, the above-mentioned approach target object tracking control problem is shown in formula (2):

|g(x _o ,y _o ,z _o ,ψ _o )-g(χ)|＜ε (2)

wherein x is _o ,y _o ,z _o Is the position of the target to be tracked, # _o Is the bearing of the target to be tracked, and epsilon is a preset fault-tolerant range.

In the present examples, the textThe object of the body control is to control the underwater operation robot to stably track the target object so as to be stable within a specified area relative to the robot. Considering the motion control of the robot in the two-dimensional plane, the position and the deflection angle of the target to be tracked under the robot satellite coordinate system are P _o ＝[x _o ,y _o ,ψ _o ] ^T The center position of the working space is P _c ＝[x _c ,y _c ,ψ _c ] ^T . In addition, u, r need to be included in the state _z G, where g ∈ {0,1} is used to determine whether the target is approached. The robot generally does not need side-shifting motion during the motion, so the state information MDP that can define the markov decision process is shown in equation (3):

s＝[g,x _o -x _c ,y _o -y _c ,ψ _o -ψ _c ,u,w,r _z ] ^T (3)

the above formula can be abbreviated as formula (4):

wherein, Δ X, Δ Y, Δ ψ are X-axis position difference, Y-axis position difference, and deflection angle difference,

is a difference in position, v ^T Is a velocity vector.

Meanwhile, in order to eliminate the correlation of the training data in the preset training data set, an Experience pool is required to be constructed for Experience review (Experience Replay), and the training effect is improved. Information s of current system state of underwater operation robot _t Take action a _t I.e. the wave frequency of the wave fin, and then observe the state information s at the next instant _t+1 Simultaneously obtain the reward value r at the current moment _t Using state transition tuples { s _t ,a _t ,s _t+1 ,r _t Denotes an experience. Then a certain amount of data is extracted from the historical experience base for training.

Step A200 of obtaining time tA supervised training probability and a random supervised probability, if the supervised training probability is greater than the random supervised probability, based on s _t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller _t Otherwise based on s _t Acquiring the fluctuation frequency a of the fluctuation fin through the Actor network _t 。

The training of reinforcement learning is experience learning based on data, and successful experience is very important for the training of the network. In the initial training stage of reinforcement learning, due to the fact that the output effect of an evaluation (Critic) network and a strategy (Actor) network is poor, and the successful exploration control behavior has certain contingency, the successful experience in a database is insufficient, and therefore the convergence speed is low, the learning efficiency is low, and even training divergence occurs. Ancient language cloud: the invention constructs a supervisory controller to play the role of a master, plays a main role in the initial stage of reinforcement learning, generates a large amount of successful control experiences, effectively supervises and guides a strategy network to enable the strategy network to generate the successful control experiences and then gradually transits to a self-learning stage. In the self-learning process, the supervision controller interferes reinforcement learning with small probability, disturbs the action generated by the strategy network (the fluctuation frequency of the fluctuation fin), further optimizes the strategy network until the strategy network is 'blue rather than blue', and then the monitoring controller does not supervise any more.

The supervisory controller employed herein is a PID controller based on state observation, and the construction process is as shown in equations (5) (6) (7) (8) (9) (10):

f _r2 ＝-f _r1 (6)

f _x2 ＝f _x1 (8)

f1＝ρ _r1 f _r1 +ρ _r2 f _x1 (9)

f2＝ρ _x1 f _r2 +ρ _x2 f _x2 (10)

wherein, P _i ,I _i ,D _i I ∈ { r, x } is a PID parameter, f _r1 ,f _r2 Is the fluctuation frequency of the yaw rotation of the left and right wave fins, f _x1 ,f _x2 The wave frequency of the forward motion of the left and right wave fins, f ₁ ,f ₂ Is the final wave frequency, rho, of the left and right wave fins _r1 ,ρ _r2 ,ρ _x1 ,ρ _x2 The left and right wave fins are used as weighting coefficients for the rotating and advancing wave frequencies.

Based on the idea that a master enters the door and depends on the individual, the supervision strategy designed by the invention is shown as formulas (11), (12) and (13):

PRO _t ＝PRO ₀ *0.999 ^t (11)

PRO _r ＝max(rand(1),0.01) (12)

wherein, PRO ₀ Representing a preset initial supervised training probability, PRO _t Represents the probability of supervised training at time t, and decreases exponentially as the number of training steps increases. PRO _r Representing a random supervised probability having a value greater than the supervised training probability PRO at time t _t And outputting the fluctuation frequency of the fluctuation fin by adopting the Actor network, or outputting the fluctuation frequency of the fluctuation fin by adopting the monitoring PID controller.

A schematic diagram of the training process of the reinforcement learning model based on Actor-Critic of the PID controller is shown in FIG. 5, namely step A100-step A500, where in FIG. 5, n (a) _t ε) is a _t Random noise of (2), z ^-1 Is a time delay.

One of the key points of reinforcement learning is the tradeoff between Exploration (Exploration) and utilization (Exploration). The Actor network output control strategy is established on the basis of empirical data, and only sufficient data can endow the Actor network with sufficient generalization capability. Random noise is superposed on the output of the Actor network in the invention, so that the generalization capability training of the Actor network is carried out, as shown in formulas (14) and (15):

c _t ＝max(c ₀ *0.999 ^t ,0.01) (14)

wherein, c ₀ Is the initial noise standard deviation, c _t The standard deviation of noise at t, whose value decreases exponentially with the number of training steps, μ (t) is represented by a _t Is mean value of c _t Is a random gaussian noise of standard deviation that,

is a noise variable.

Step A300, based on s _t And a _t Respectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model ^* (s _t ,a _t ) Prize value r _t 。

The Markov decision process for reinforcement learning comprises four parts, namely a state space S, an action space A, a reward function S multiplied by A → R (S, a) and a state transition probability distribution p, wherein on the premise of meeting Markov property, the state of the system at the next moment is only determined by the current state and action, namely p (x, a) _t |x _t-1 ,a _t-1 ) The cumulative reward function is defined as a weighted combination of the discount factors for the future reward, as shown in equation (16):

wherein γ ∈ (0, 1)]As a discount factor, R _t Is the cumulative prize, r(s), starting at time t _i ,a _i ) For the prize value, i is the subscript and N is the total number of time steps.

In the learning process, the net trained by reinforcement learningThe network interacts with the robot to select a in the motion space _t E.g. A, then the system is from state x _t e.X to X _t+1 E.g. X, while obtaining a reward r at time t _t For evaluating at state x _t Lower sampling a _t The reward function is used to guide whether the goal is completed, or whether the behavior is optimal, etc. The goal of the control problem is therefore to obtain an optimal strategy pi ^* To make it receive the maximum prize J ^* ，J ^* The calculation process of (2) is shown in equation (17):

where P is the policy space, E _π Representing the average expectation of all strategies, the control quantity of the final system is u = a, J (pi) is the reward value corresponding to the strategy, and pi is the strategy.

The key to the above problem is therefore how to define the state of the MDP process and the single step reward function. Defining the state information MDP of the Markov decision process as a formula (3) (4), wherein a reward function comprehensively considers position error and deflection angle error, expects the robot to adjust the movement direction preferentially and then approach to a target, and considers the power consumption problem of the robot, and the reward function is defined by combining the indexes and is shown as a formula (18):

r＝r ₀ -ρ ₁ ||Δψ|| ₂ -ρ ₂ ||ΔP _o || ₂ -ν ^T ρ ₃ ν (18)

wherein r is the reward value, r ₀ Awarding for a preset constant value, wherein the award is 1 when the target task is completed, and otherwise, the award is 0, | | delta ψ | survival ₂ Is a relative orientation error 2 norm, | Δ P _o || ₂ Is a relative position error 2 norm, v is a regular term used for reducing the energy consumption of the system, and rho ₁ ,ρ ₂ ,ρ ₃ Are the weight coefficients.

In solving the MDP problem, the classical merit function is a state-behavior value function (also called Q function), which is defined as shown in equation (19):

wherein K is the total number of steps and K is the current number of steps.

The robot starts from state s and executes an optimal strategy π ^* ，Q ^* (s, a) represents the maximum accumulated prize value achieved, so the optima function satisfies the bellman optimality principle, as shown in equation (20):

wherein, a _t ＝π ^* (s _t ) When the optimal Q is obtained through reinforcement learning iterative training ^* Then, the optimal strategy, pi, can be obtained ^* (s)＝arg max _π Q ^* (s,a)。

In the present invention, the structure of the Actor network and the Critic network is shown in fig. 6, the Actor network is shown in fig. 6 (a), the number of neurons in the hidden layer 1 and the hidden layer 2 is 200, the Relu6 activation function is adopted, the number of neurons in the hidden layer 3 is 10, the activation function is Relu, the number of neurons in the final output layer is 2, the output quantity is normalized to be between [ -1,1] by adopting the tanh activation function, and the hidden layer is a convolutional layer. The Critic network is shown in FIG. 6 (b), and the inputs include state s and action a. The number of neurons in hidden layers 0 and 1 was 200. And the state s and the action a are respectively output through the convolutional layer 0 and the convolutional layer 1, summed and fused, pass through a Relu6 activation function, and then input into the hidden layer 2, wherein the number of the neurons is 200, and the activation function is Relu6. The number of neurons of the hidden layer 3 is 10, the activation function is Relu, the number of neurons of the final output layer is 1, and a linear activation function is adopted to output a state-action evaluation value Q. In fig. 6 (b), the hidden layers 0,1, 2, 3, 4 are the first, second, third, fourth, and five convolution layers, respectively.

Step A400, based on Q ^* (s _t ,a _t ) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q ^* (s _t ,a _t )、r _t And updating the parameters of the Critic network.

In the tracking control of the underwater operation robot on the target object, the system state and the control quantity of the robot are continuous variables, the state of the MDP is also a continuous variable, and therefore, the MDP state space and the motion space are also continuous domains. For the reinforcement learning problem of the continuous domain, a policy gradient method (PG) is widely applied, the core idea is to change the parameters of a policy function towards the direction of maximizing a reward function to improve the performance of a control policy, the method adopts an independent approximate function, is a random policy scheme, and has low calculation efficiency. To solve this problem, a deterministic state-behavior mapping function a = μ is used _θ (s), i.e. the deterministic policy gradient algorithm DPG, where θ is a parameter of the policy function. The maximization of the accumulated reward is obtained by updating the parameter theta through the positive gradient of the accumulated reward function, and the updating process is shown as the formula (21):

wherein alpha is _θ In order to be the weighting coefficients,

is a gradient variable.

DPG algorithm derivation

Is of the form shown in equation (22):

wherein M is the number of iterations.

Wherein Q is ^μ Expressing the Q-value function under the policy function μ (a | s), further derived equation (23):

both the strategy function and the Q value evaluation function have strong nonlinearity, and the adoption of the deep neural network for fitting the nonlinear function is an effective method. In this section, an Actor network and a Critic network are respectively designed to approximate the policy function mu _θ (as) and an evaluation function Q _W (s, a), the parameters of the two networks are respectively represented by theta and W, and the parameter updating equation is shown as the formula (24) (25) for the Q value function:

δ _t ＝r _t +γQ _W (s _t+1 ,a _t+1 )-Q _W (s _t ,a _t ) (24)

W _t+1 ＝W _t +α _W δ _t ▽ _W Q _W (s _t ,a _t )) (25)

wherein, delta _t In order to award the intermediate variable(s),

the gradient of Q versus W.

In this embodiment, the training of the Actor network is dependent on the criticc network, so accurate evaluation is beneficial to the training of the Actor network. Therefore, in the training process of the Actor network and the Critic network, the updating speed of the target Critic network is faster than that of the target Actor network, so that the Actor network is trained and updated with more excellent evaluation. The asynchronous update strategy of the Actor network and the Critic network is shown as formulas (26) and (27):

θ′＝τθ+(1-τ)θ′,if mod(t,FA)＝0 (26)

W′＝τW+(1-τ)W′,if mod(t,FQ)＝0 (27)

wherein θ 'is a target Actor network parameter, W' is a target criticic network parameter, τ is an update coefficient, FA and FQ are update periods of the target Actor network and the target criticic network, respectively, and in general, FQ < FA, mod (·, ·,) is a remainder function.

In this embodiment, data in a preset training data set is sequentially selected to train the Actor network.

Aiming at an Experience review (Experience Replay) strategy, a training strategy of an Actor-Critic reinforcement learning model based on a PID controller, an Actor network and Critic network asynchronous updating strategy and a random noise strategy, the invention selects five training strategies in a combination mode for training and testing, as shown in table 1,

TABLE 1

In table 1, √ symbol represents a selected strategy, network training is performed based on an Actor-Critic reinforcement learning model and different training strategies in 5 selected in table 1, the number of training rounds is 2000, the number of training steps in each round is 600, the size of an experience base is 10000, the size of batch processing is 32, the learning rate of the Actor network and the Critic network is 0.001, a weight forgetting factor γ =0.9, and the position of an initialization target in an associated coordinate system is set to be p = [ -82,220] mm, as a result, as shown in fig. 7, the network trained by strategy 5 has the highest accumulated reward and the shortest control step, and the result shows that the supervised training strategy proposed in this section can effectively improve the training effect. The addition of random noise is beneficial to improve the generalization capability of the model, but strategy 4 shows a slight degradation in performance relative to strategy 5. In the middle and later stages of model training, for an Actor network, a supervisory controller becomes random noise, so that the supervisory controller can provide enough generalization capability of a model, and the random noise becomes an uncertain factor influencing the precision of the model.

The underwater operation robot can stably keep a relatively fixed position relative to an operation target, which is an important premise for realizing accurate operation. And detecting and positioning the operation target through binocular vision, inputting the operation target into a strategy network after state conversion, and finally generating fluctuation frequency to control the robot to realize tracking of the target. As shown in fig. 8, where the underwater perspective is viewed from an underwater camera, the onshore perspective is viewed from an onshore camera, and the left and right cameras are respectively onboard binocular left and right cameras.

2. Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps:

s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process _t 。

An Actor-critical reinforcement learning model is a dynamic programming solving process aiming at a Markov Decision Process (MDP), so that an actual problem needs to be converted into a Markov decision process problem, and then the problem is solved by a reinforcement learning method. Therefore, in the embodiment, the state information s of the markov decision process is constructed according to the system state information of the underwater operation robot at the time t and the position information of the target to be tracked in the random coordinate system of the underwater operation robot _t 。

Step S200, based on S _t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model _t 。

In the present embodiment, the state information s based on the constructed Markov decision process _t And acquiring the fluctuation frequency of the fluctuation fin through an Actor network trained offline, and controlling the propulsion tracking of the underwater robot.

Step S300, based on a _t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.

In the present embodiment, the wave fin of the underwater operation robot is controlled based on the wave frequency of the wave fin, let t = t +1, and the tracking of the target is continued.

A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning of a second embodiment of the invention, as shown in fig. 3, includes: the method comprises the steps of constructing a module 100, obtaining a fluctuation frequency module 200 and a circulation module 300;

the construction module 100 is configured to acquire system state information of the underwater operation robot at the time t and pose information of a target to be tracked in a random coordinate system of the underwater operation robot, and construct state information s of a Markov decision process _t ；

The obtain fluctuation frequency module 200 is configured to obtain a fluctuation frequency based on s _t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model _t ；

The cycle module 300 is configured to be based on a _t And controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to a construction module.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that the above-mentioned division of the functional modules is merely used as an example to illustrate the wave fin propulsion underwater operation robot tracking control system based on reinforcement learning provided in the above-mentioned embodiment, in practical applications, the above-mentioned function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above-mentioned embodiment may be combined into one module, or may be further split into a plurality of sub-modules, so as to complete all or part of the functions described above. Names of the modules and steps related in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded by a processor and to implement the above-described method for tracking control of a walking fin propulsion underwater operation robot based on reinforcement learning.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.

It can be clearly understood by those skilled in the art that, for convenience and brevity not described, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.

Those of skill in the art will appreciate that the various illustrative modules, method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A wave fin propulsion underwater operation robot tracking control method based on reinforcement learning is characterized by comprising the following steps:

Step S300, based on a _t Controlling a wave fin of the underwater operation robot, enabling t = t +1, and skipping to the step S100;

step A100, obtaining system state information of an underwater operation robot at t moment in a preset training data set and pose information of a target to be tracked in an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process _t ；

Step A200, acquiring the supervised training probability and the random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probabilityThen based on s _t Obtaining the fluctuation frequency a of the fluctuation fin through a PID controller _t Otherwise based on s _t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network _t ；

2. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is a tanh function.

3. The robot tracking control method for the robot propelling the underwater operation by the aid of the wave fin based on the reinforcement learning as claimed in claim 1, wherein the Critic network comprises five convolutional layers, the number of neurons in the first convolutional layer, the second convolutional layer and the third convolutional layer is 200, the activation function is a Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is a Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.

4. The method for tracking and controlling the underwater robot propelled by the wave fin based on the reinforcement learning according to the claim 3, wherein in the step A200, the supervised training probability and the random supervised probability at the time t are obtained by the following steps:

PRO _t ＝PRO ₀ *0.999 ^t

PRO _r ＝max(rand(1),0.01)

wherein, PRO _t For supervised training probability at time t, PRO _r For random supervised probability, PRO ₀ Is a preset initial supervised training probability.

5. The robot tracking and controlling method for propelling underwater operation with wave fin based on reinforcement learning of claim 1, wherein in step a200, the wave frequency a of the wave fin is obtained by the PID controller _t ", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the wave frequency of the left wave fin and the wave frequency of the right wave fin are calculated by the following steps:

f1＝ρ _r1 f _r1 +ρ _r2 f _x1

f2＝ρ _x1 f _r2 +ρ _x2 f _x2

f _r2 ＝-f _r1

f _x2 ＝f _x1

wherein, f ₁ 、f ₂ Is the final wave frequency of the left and right wave fins, f _r1 、f _r2 The wave frequency f of yaw rotation of the left and right wave fins _x1 、f _x2 The wave frequency, rho, of the advancing left and right wave fins _r1 、ρ _r2 、ρ _x1 、ρ _x2 Are left and right wavesWeight coefficient of moving fin for rotating and advancing wave frequency, P _i ,I _i ,D _i I ∈ { r, x } is the PID parameter, Δ ψ is the relative orientation error,

is the differential of the lateral error in relative position.

6. The robot tracking and controlling method for propelling underwater operations with wave fins based on reinforcement learning of claim 1, wherein in step a200, "obtaining the wave frequency a of wave fins through Actor network _t ", the method is as follows: the fluctuation frequency a _t In order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:

c _t ＝max(c ₀ *0.999 ^t ,0.01)

is a noise variable.

7. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the preset reward function is:

r＝r ₀ -ρ ₁ ||Δψ|| ₂ -ρ ₂ ||ΔP _o || ₂ -ν ^T ρ ₃ ν

wherein r is the reward value, r ₀ Awarding | | | delta ψ| | live in eyes for a preset constant value ₂ Is a relative orientation error 2 norm, | | Δ P _o || ₂ Is a relative position error 2 norm, v ^T Is a velocity vector, v is a regular term, ρ ₁ 、ρ ₂ 、ρ ₃ Are weight coefficients.

8. A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning is characterized by comprising a construction module, a wave frequency acquisition module and a circulation module;

The acquisition fluctuation frequency module is configured to be based on s _t Acquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model _t ；

9. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the method for tracking control of a wave fin propulsion underwater work robot based on reinforcement learning according to any one of claims 1 to 7.

10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning according to any one of claims 1-7.