CN111079936A

CN111079936A - Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Info

Publication number: CN111079936A
Application number: CN201911077089.0A
Authority: CN
Inventors: 王宇; 唐冲; 王睿; 王硕; 谭民; 马睿宸
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2020-04-28
Anticipated expiration: 2039-11-06
Also published as: CN111079936B

Abstract

The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a method, a system and a device for tracking and controlling an underwater operation robot propelled by a wave fin based on reinforcement learning, aiming at solving the problem of low target tracking precision caused by poor convergence and stability of an Actor network in a training process. The method comprises the steps of obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process_t(ii) a Based on s_tAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model_t(ii) a Based on a_tAnd controlling a wave fin of the underwater operation robot, and circulating the wave fin by making t equal to t + 1. The invention monitors the Actor network training through the PID controller, improves the stability and convergence of the network, and improves the target trackingThe accuracy of (2).

Description

Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Technical Field

The invention belongs to the field of autonomous control of underwater operation robots, and particularly relates to a method, a system and a device for tracking and controlling an underwater operation robot propelled by a wave fin based on reinforcement learning.

Background

Autonomous control of underwater robots is a hotspot and difficulty of current research. With the transition of mankind from marine exploration to marine development, autonomous control and autonomous operation of underwater work robots pose new challenges. The autonomous operation of the underwater operation robot has great significance for underwater archaeology, underwater fishing, underwater rescue, underwater engineering and the like. The underwater remote control device can replace a diver or a remote control ROV to operate, realize underwater long-time continuous operation and improve the efficiency of underwater operation.

Generally, due to the irregularity of the underwater operation robot and the complexity of the underwater environment, the underwater operation robot is difficult to establish an accurate hydrodynamic model, so that the model-based robot control method is weak in adaptability. The reinforcement learning depends on the current state of the system, gives an action to be executed, and then transits to the next state, and is a typical model-free control method, and has strong adaptability to complex underwater environment and unknown disturbance. However, the training of reinforcement learning is experience learning based on data, and successful experience is important for training. In the initial training stage of reinforcement learning, due to the fact that the output effect is poor and the control behavior of successful exploration has certain contingency, successful experience in a database is insufficient, so that the convergence speed of an Actor network in a reinforcement learning model is low, the learning efficiency is low, and the accuracy of follow-up tracking control is directly influenced.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the target tracking accuracy is low in the conventional tracking control method based on reinforcement learning due to poor convergence and stability of an Actor network in a reinforcement learning model in the training process, a first aspect of the present invention provides a tracking control method for a fluctuating fin propulsion underwater work robot based on reinforcement learning, the method comprising:

s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process_t；

Step S200, based on S_tAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model_t；

Step S300, based on a_tControlling a wave fin of the underwater operation robot, and jumping to a construction module when t is t + 1;

the Actor-criticic reinforcement learning model comprises an Actor network and a criticic network, and is obtained through offline training, and the training method comprises the following steps:

a100, acquiring a training data set; and constructing state information s of Markov decision process based on system state information of the underwater operation robot at the time t in the training data set and pose information of the target to be tracked in the underwater operation robot random coordinate system_t；

Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on s_tObtaining the fluctuation frequency a of the fluctuation fin through a PID controller_tOtherwise based on s_tAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network_t；

Step A300, based on s_tAnd a_tRespectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model^*(s_t,a_t) Prize value r_t；

Step A400, based on Q^*(s_t,a_t) By deterministic policy gradientsUpdating the parameters of the Actor network by an algorithm; and based on Q^*(s_t,a_t)、r_tUpdating the parameters of the Critic network;

and step a500, circularly executing steps a100 to a400 by making t equal to t +1 until t is greater than a preset training time, so as to obtain a trained Actor network.

In some preferred embodiments, the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is the Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is the tanh function.

In some preferred embodiments, the Critic network comprises five convolutional layers, the number of neurons in the first, second and third convolutional layers is 200, the activation function is the Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.

In some preferred embodiments, in step a200, "obtaining the supervised training probability at time t, and the random supervised probability" includes:

PRO_t＝PRO₀*0.999^t

PRO_r＝max(rand(1),0.01)

wherein, PRO_tFor supervised training probability at time t, PRO_rFor random supervised probability, PRO₀Is a preset initial supervised training probability.

In some preferred embodiments, in step a200, "obtaining the wave frequency a of the wave fin by the PID controller_t", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the wave frequency of the left wave fin and the wave frequency of the right wave fin are calculated by the following steps:

f1＝ρ_r1f_r1+ρ_r2f_x1

f2＝ρ_x1f_r2+ρ_x2f_x2

f_r2＝-f_r1

f_x2＝f_x1

wherein f is₁、f₂Is the final wave frequency of the left and right wave fins, f_r1、f_r2The wave frequency f of yaw rotation of the left and right wave fins_x1、f_x2The wave frequency, rho, of the advancing left and right wave fins_r1、ρ_r2、ρ_x1、ρ_x2Weight coefficient for the left and right wave fins for the rotation and forward wave frequency, P_i,I_i,D_iI e r, x is the PID parameter, Δ ψ is the relative bearing error,

for relative azimuth error differential, Δ x is the lateral error in relative position,

is the differential of the lateral error in relative position.

In some preferred embodiments, step a200 "obtaining the wave frequency a of the wave fin through the Actor network_t", the method is as follows: the fluctuation frequency a_tIn order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:

c_t＝max(c₀*0.999^t,0.01)

wherein, c₀Is a preset initial noise standard deviation, c_tFor time t noise scaleAlignment difference,. mu.t, is a_tIs mean value, in c_tIs a random gaussian noise of standard deviation,

is a noise variable.

In some preferred embodiments, the predetermined reward function is:

r＝r₀-ρ₁||Δψ||₂-ρ₂||ΔP_o||₂-ν^Tρ₃ν

wherein r is the reward value, r₀Awarding | | | Δ ψ | | luminance for a preset constant value₂Is a relative orientation error 2 norm, | | Δ P_o||₂Is a relative position error 2 norm v^TIs a velocity vector, v is a regular term, ρ₁、ρ₂、ρ₃Are weight coefficients.

The invention provides a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning, which comprises an acquisition module, a wave frequency acquisition module and a circulation module, wherein the acquisition module is used for acquiring wave frequency;

the building module is configured to obtain system state information of the underwater operation robot at the time t and pose information of a target to be tracked in the underwater operation robot along-body coordinate system, and build state information s of the Markov decision process_t；

The acquisition fluctuation frequency module is configured to be based on s_tAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model_t；

The circulation module is configured to be based on a_tAnd controlling the wave fin of the underwater operation robot, enabling t to be t +1, and jumping to a construction module.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the program applications are loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.

The invention has the beneficial effects that:

according to the invention, the PID controller is used for monitoring the Actor network training, so that the stability and convergence of the reinforcement learning model are improved, and the target tracking precision is improved. In the initial stage of reinforcement learning, the fluctuation frequency of the fluctuation fin is generated through the PID controller, a large amount of control experience is generated, effective supervision on an Actor network is achieved, and the fluctuation frequency of the fluctuation fin is generated through the Actor network in the later stage. A large amount of control experience is effectively evaluated according to the Critic network, and then the Actor network is updated through a deterministic strategy gradient algorithm, so that the convergence of the reinforcement learning model is accelerated, and the stability of the reinforcement learning model is improved.

Meanwhile, the invention combines several different strategies to train the Actor network, thereby improving the generalization ability of the reinforcement learning model and improving the target tracking precision.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning according to an embodiment of the invention;

FIG. 2 is a schematic flow chart of a training method of an Actor-Critic reinforcement learning model according to an embodiment of the present invention;

FIG. 3 is a frame diagram of a wave fin propulsion underwater operation robot tracking control system based on reinforcement learning according to an embodiment of the invention;

FIG. 4 is an exemplary diagram of a wave fin propelled underwater work robot target tracking control in accordance with one embodiment of the present invention;

FIG. 5 is an exemplary diagram of PID controller based Actor-critical network update according to an embodiment of the invention;

FIG. 6 is an exemplary diagram of the architecture of an Actor network and a Critic network in accordance with one embodiment of the invention;

FIG. 7 is an exemplary diagram comparing different training strategies according to one embodiment of the invention;

FIG. 8 is an exemplary diagram of a stable tracking of a moving object according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps as shown in figures 1 and 2:

Step A400, based on Q^*(s_t,a_t) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q^*(s_t,a_t)、r_tUpdating the parameters of the Critic network;

In order to more clearly describe the tracking control method of the wave fin propulsion underwater operation robot based on reinforcement learning, the following describes the steps of an embodiment of the method in detail with reference to the attached drawings.

In the following embodiments, a training method of an Actor-Critic reinforcement learning model is introduced, and then a reinforcement learning-based wave fin propulsion underwater operation robot tracking control method for acquiring a wave frequency of a wave fin by using an Actor network is described.

1. Training method of Actor-critical reinforcement learning model

Step A100, obtaining system state information of an underwater operation robot at t moment in a preset training data set and pose information of a target to be tracked in an underwater operation robot random coordinate system, and constructing state information s of a Markov decision process_t。

Since the robot has self-stability in the roll direction and the pitch direction, assuming that the freedom of motion of the body is 4, the motion state can be defined as χ ═ x, y, z, ψ]^TWherein x, y and z are three-dimensional coordinates of the underwater operation robot, psi is a yaw angle, T represents transposition, and v is a corresponding velocity vector ═ u, v, w and r_z]^TWherein u represents a forward speed, r_zThe rotational speed is shown, w is the vertical direction of the movement velocity, and v is the lateral movement velocity. Therefore, the kinetic equation of the underwater operation robot system is shown in formula (1):

where U is the system control quantity, ξ is the unknown disturbance,

is the system state differential.

The target object to be tracked is detected and positioned through airborne binocular vision, and the visible range of the target object is limited under the satellite coordinate system. Tetrahedral O as in FIG. 4_c1ABCDIs an effective detection range of binocular vision, the front cabin and the rear cabin are respectively a front cabin body and a rear cabin body on the machine, O_BX_BY_BZ_BIs a satellite coordinate system, O_c1X_c1Y_c1Z_c1As a camera coordinate system, P_cThe center position of the working space is the target of the robot body control, which enables the target object to be in the center area of the visual field, A₁B₁C₁D₁-A₂B₂C₂D₂Within the illustrated working space to facilitate the gripping operation of the robotic arm. At present, because the operation target is small, the posture of the target does not need to be considered. Therefore, the above-mentioned approach target object tracking control problem is shown in formula (2):

|g(x_o,y_o,z_o,ψ_o)-g(χ)|＜ε (2)

wherein x is_o,y_o,z_oIs the position of the target to be tracked, #_oIs the azimuth of the target to be tracked, and epsilon is a preset fault tolerance range.

In this embodiment, the objective of the body control is to control the underwater operation robot to stably track the target object so that the target object is stabilized within a specified area relative to the robot. Considering the motion control of the robot in the two-dimensional plane, the position and the deflection angle of the target to be tracked under the robot satellite coordinate system are P_o＝[x_o,y_o,ψ_o]^TThe center position of the working space is P_c＝[x_c,y_c,ψ_c]^T. In addition, u, r need to be included in the state_zG, where g ∈ {0,1} is used to determine whether the target is approached. The robot generally does not need side-shifting motion during the motion, so the state information MDP that can define the markov decision process is shown in equation (3):

s＝[g,x_o-x_c,y_o-y_c,ψ_o-ψ_c,u,w,r_z]^T(3)

the above formula can be abbreviated as formula (4):

wherein, Δ X, Δ Y, Δ ψ are X-axis position difference, Y-axis position difference, and deflection angle difference,

is a difference in position, v^TIs a velocity vector.

Meanwhile, the invention is characterized in that the preset training data set isThe correlation of the training data is eliminated, an Experience pool is required to be constructed for Experience review (Experience Replay), and the training effect is improved. Information s of current system state of underwater operation robot_tTake action a_tI.e. the wave frequency of the wave fin, and then observe the state information s at the next instant_t+1Simultaneously obtain the reward value r at the current moment_tUsing state transition tuples { s_t,a_t,s_t+1,r_tDenotes an experience. Then a certain amount of data is extracted from the historical experience base for training.

Step A200, acquiring a supervised training probability and a random supervised probability at the moment t, and if the supervised training probability is greater than the random supervised probability, basing on s_tObtaining the fluctuation frequency a of the fluctuation fin through a PID controller_tOtherwise based on s_tAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network_t。

The reinforcement learning training is experience learning based on data, and the successful experience is important for the training of the network. In the initial training stage of reinforcement learning, due to the fact that the output effect of an evaluation (Critic) network and a strategy (Actor) network is poor, and the successful exploration control behavior has certain contingency, the successful experience in a database is insufficient, and therefore the convergence speed is low, the learning efficiency is low, and even training divergence occurs. Ancient cloud: the invention constructs a supervisory controller to play the role of a master, plays a main role in the initial stage of reinforcement learning, generates a large amount of successful control experiences, effectively supervises and guides a strategy network to enable the strategy network to generate the successful control experiences and then gradually transits to a self-learning stage. In the self-learning process, the supervision controller interferes reinforcement learning with small probability, disturbs the action generated by the strategy network (the fluctuation frequency of the fluctuation fin), further optimizes the strategy network until the strategy network is 'blue rather than blue', and then the monitoring controller does not supervise any more.

The supervisory controller employed herein is a PID controller based on state observation, and the construction process is shown in equations (5) (6) (7) (8) (9) (10):

f_r2＝-f_r1(6)

f_x2＝f_x1(8)

f1＝ρ_r1f_r1+ρ_r2f_x1(9)

f2＝ρ_x1f_r2+ρ_x2f_x2(10)

wherein, P_i,I_i,D_iI ∈ { r, x } is a PID parameter, f_r1,f_r2Is the fluctuation frequency of yaw rotation of the left and right wave fins, f_x1,f_x2The wave frequency of the forward motion of the left and right wave fins, f₁,f₂Is the final wave frequency, rho, of the left and right wave fins_r1,ρ_r2,ρ_x1,ρ_x2The left and right wave fins are used as weighting coefficients for the rotating and advancing wave frequencies.

Based on the idea that a master enters the door and depends on the individual, the supervision strategy designed by the invention is shown as formulas (11), (12) and (13):

PRO_t＝PRO₀*0.999^t(11)

PRO_r＝max(rand(1),0.01) (12)

wherein, PRO₀Representing a preset initial supervised training probability, PRO_tRepresents the probability of supervised training at time t, and decreases exponentially as the number of training steps increases. PRO_rRepresenting a random supervised probability having a value greater than the supervised training probability PRO at time t_tWhen the wave frequency of the wave fin is output by adopting the Actor network, otherwise, the wave frequency of the wave fin is output by adopting the supervision PID controllerThe wave frequency of the wave fin.

A schematic diagram of the training process of the reinforcement learning model based on Actor-Critic of the PID controller is shown in FIG. 5, namely step A100-step A500, where in FIG. 5, n (a)_tε) is a_tRandom noise of (2), z^-1Is a time delay.

One of the key points of reinforcement learning is the tradeoff between Exploration (Exploration) and utilization (Exploitation). The Actor network output control strategy is established on the basis of empirical data, and only sufficient data can endow the Actor network with sufficient generalization capability. Random noise is superposed on the output of the Actor network in the invention, so that the generalization capability training of the Actor network is carried out, as shown in formulas (14) and (15):

c_t＝max(c₀*0.999^t,0.01) (14)

wherein, c₀Is the initial noise standard deviation, c_tThe standard deviation of noise at time t, whose value decreases exponentially with the number of training steps, μ (t) is given by a_tIs mean value of c_tIs a random gaussian noise of standard deviation,

is a noise variable.

Step A300, based on s_tAnd a_tRespectively obtaining the state-action evaluation value Q through a Critic network and a preset reward function in the Actor-Critic reinforcement learning model^*(s_t,a_t) Prize value r_t。

The Markov decision process of reinforcement learning comprises four parts, namely a state space S, an action space A, a reward function S multiplied by A → R (S, a) and a state transition probability distribution p, wherein on the premise of meeting the Markov property, the state of the system at the next moment is only determined by the current state and action, namely p (x)_t|x_t-1,a_t-1) The cumulative reward function is defined as the future reward discount factor plusThe weight combination, as shown in equation (16):

wherein, gamma belongs to (0, 1)]As a discount factor, R_tIs the cumulative prize, r(s), starting at time t_i,a_i) For the prize value, i is the subscript and N is the total number of time steps.

In the learning process, the network trained by reinforcement learning interacts with the robot, and a is selected in the action space_tE.g. A, then the system is from state x_te.X to X_t+1E.g. X, while obtaining a reward r at time t_tFor evaluating at state x_tLower sampling a_tThe reward function is used to guide whether the goal is completed, or whether the behavior is optimal, etc. The goal of the control problem is therefore to obtain an optimal strategy pi^*To make it receive the maximum prize J^*，J^*The calculation process of (2) is shown in equation (17):

where P is the policy space, E_πAnd representing the average expectation of all strategies, the control quantity of the final system is u-a, J (pi) is a reward value corresponding to the strategy, and pi is the strategy.

The key to the above problem is therefore how to define the state of the MDP process and the single step reward function. Defining the state information MDP of the Markov decision process as formula (3) (4), wherein the reward function needs to comprehensively consider the position error and the deflection angle error, and expects the robot to adjust the movement direction preferentially and then approach to the target, and also needs to consider the power consumption problem of the robot, and the reward function is defined by combining the indexes as formula (18):

r＝r₀-ρ₁||Δψ||₂-ρ₂||ΔP_o||₂-ν^Tρ₃ν (18)

wherein r is the reward value, r₀To prepareThe set constant value reward is 1 when the target task is completed, otherwise is 0, | | delta ψ | | survival₂Is a relative orientation error 2 norm, | Δ P_o||₂Is a relative position error 2 norm, v is a regular term used for reducing the energy consumption of the system, and rho₁,ρ₂,ρ₃Are weight coefficients.

In solving the MDP problem, the classical merit function is a state-behavior value function (also called Q function), which is defined as shown in equation (19):

wherein K is the total number of steps and K is the current number of steps.

The robot starts from state s and executes an optimal strategy pi^*，Q^*(s, a) represents the maximum jackpot value achieved, so the optima function satisfies the bellman optimality principle, as shown in equation (20):

wherein, a_t＝π^*(s_t) When the optimal Q is obtained through the reinforcement learning iterative training^*Then, the optimal strategy, pi, can be obtained^*(s)＝arg max_πQ^*(s,a)。

In the present invention, the structure of the Actor network and the Critic network is shown in fig. 6, the Actor network is shown in fig. 6(a), the number of neurons in the hidden layer 1 and the hidden layer 2 is 200, the Relu6 activation function is adopted, the number of neurons in the hidden layer 3 is 10, the activation function is Relu, the number of neurons in the final output layer is 2, the output is normalized to be between [ -1,1] by adopting the tanh activation function, and the hidden layer is a convolutional layer. The criticic network is shown in fig. 6(b), and the input includes a state s and an action a. The number of neurons in hidden layers 0 and 1 was 200. The state s and the action a are respectively output through the convolutional layer 0 and the convolutional layer 1, summed and fused, pass through a Relu6 activation function, and then input into the hidden layer 2, the number of the neurons of the hidden layer is 200, and the activation function is Relu 6. The number of neurons of the hidden layer 3 is 10, the activation function is Relu, the number of neurons of the final output layer is 1, and a linear activation function is adopted to output a state-action evaluation value Q. In fig. 6(b), the hidden layers 0,1, 2, 3, 4 are the first, second, third, fourth, and fifth convolution layers, respectively.

Step A400, based on Q^*(s_t,a_t) Updating parameters of the Actor network through a deterministic strategy gradient algorithm; and based on Q^*(s_t,a_t)、r_tAnd updating the parameters of the Critic network.

In the tracking control of the underwater operation robot on the target object, the system state and the control quantity of the robot are continuous variables, the state of the MDP is also continuous variables, and therefore, the MDP state space and the motion space are also continuous domains. For the reinforcement learning problem of the continuous domain, a policy gradient method (PG) is widely applied, the core idea is to change the parameters of a policy function towards the direction of maximizing a reward function to improve the performance of a control policy, the method adopts an independent approximate function, is a random policy scheme, and has low calculation efficiency. To solve this problem, a deterministic state-behavior mapping function a ═ μ is used_θ(s), the deterministic policy gradient algorithm DPG, where θ is a parameter of the policy function. The maximization of the accumulated reward is obtained by updating the parameter theta through the positive gradient of the accumulated reward function, and the updating process is shown as the formula (21):

wherein, α_θIn order to be the weighting coefficients,

is a gradient variable.

DPG algorithm derivation

Is of the form shown in equation (22):

wherein M is the number of iterations.

Wherein Q is^μExpressing the Q-value function under the policy function μ (a | s), further derived equation (23):

both the strategy function and the Q value evaluation function have strong nonlinearity, and the fitting of the nonlinear function by adopting the deep neural network is an effective method. In this section, an Actor network and a Critic network are respectively designed to approximate the policy function mu_θ(as) and an evaluation function Q_W(s, a), the parameters of the two networks are represented by θ and W, respectively, and the parameter update equation is shown as the following equation (24) (25) for the Q value function:

δ_t＝r_t+γQ_W(s_t+1,a_t+1)-Q_W(s_t,a_t) (24)

W_t+1＝W_t+α_Wδ_t▽_WQ_W(s_t,a_t))(25)

wherein, delta_tIn order to reward the intermediate variable(s),

is the gradient of Q versus W.

In this embodiment, the training of the Actor network is dependent on the criticc network, so accurate evaluation is beneficial to the training of the Actor network. Therefore, in the training process of the Actor network and the Critic network, the updating speed of the target Critic network is faster than that of the target Actor network, so that the Actor network is trained and updated with more excellent evaluation. The asynchronous update strategy of the Actor network and the Critic network is shown as formulas (26) and (27):

θ′＝τθ+(1-τ)θ′,if mod(t,FA)＝0 (26)

W′＝τW+(1-τ)W′,if mod(t,FQ)＝0 (27)

wherein θ 'is a target Actor network parameter, W' is a target criticic network parameter, τ is an update coefficient, FA and FQ are update periods of the target Actor network and the target criticic network, respectively, and in general, FQ < FA, mod (·, ·,) is a remainder function.

In this embodiment, data in a preset training data set is sequentially selected to train the Actor network.

Aiming at an Experience review (Experience Replay) strategy, a training strategy of an Actor-Critic reinforcement learning model based on a PID controller, an asynchronous updating strategy of an Actor network and a Critic network and a random noise strategy, the invention selects five combined training strategies for training and testing, as shown in Table 1,

TABLE 1

The sign of √ in table 1 represents a selected strategy, network training is performed based on the reinforcement learning model of Actor-Critic and different training strategies in 5 selected in table 1, the number of training rounds is 2000, the number of training steps in each round is 600, the size of the experience base is 10000, the size of batch processing is 32, the learning rates of the Actor network and the Critic network are 0.001, the weight forgetting factor γ is 0.9, the position of the initialization target in the random coordinate system is set to be p [ -82,220] mm, and the result is shown in fig. 7. The addition of random noise is beneficial to improve the generalization capability of the model, but strategy 4 shows a slight decrease in performance relative to strategy 5. In the middle and later stages of model training, for an Actor network, a supervisory controller becomes random noise, so that the supervisory controller can provide enough generalization capability of a model, and the random noise becomes an uncertain factor influencing the precision of the model.

The underwater operation robot can stably keep a relatively fixed position relative to an operation target, which is an important premise for realizing accurate operation. And the operation target is detected and positioned through binocular vision, is input into a strategy network after state conversion, and finally generates fluctuation frequency to control the robot to realize tracking of the target. As shown in fig. 8, where the underwater perspective is viewed from an underwater camera, the onshore perspective is viewed from an onshore camera, and the left and right cameras are respectively onboard binocular left and right cameras.

2. Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

The invention discloses a wave fin propulsion underwater operation robot tracking control method based on reinforcement learning, which comprises the following steps:

s100, obtaining system state information of the underwater operation robot at the time t and pose information of a target to be tracked under an underwater operation robot random coordinate system, and constructing state information S of a Markov decision process_t。

An Actor-critical reinforcement learning model is a dynamic programming solving process aiming at a Markov Decision Process (MDP), so that an actual problem needs to be converted into a Markov decision process problem, and then the problem is solved by a reinforcement learning method. Therefore, in the embodiment, the state information s of the Markov decision process is constructed according to the system state information of the underwater operation robot at the time t and the pose information of the target to be tracked in the underwater operation robot random coordinate system_t。

Step S200, based on S_tAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model_t。

In the embodiment, the Markov decision-making based on constructionStatus information s of a program_tAnd acquiring the fluctuation frequency of the fluctuation fin through an Actor network trained offline, and controlling the propulsion tracking of the underwater robot.

Step S300, based on a_tAnd controlling the wave fin of the underwater operation robot, enabling t to be t +1, and jumping to a construction module.

In the embodiment, the wave fin of the underwater operation robot is controlled based on the wave frequency of the wave fin, t is t +1, and the target tracking is continued.

A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning of a second embodiment of the invention, as shown in fig. 3, includes: the method comprises the steps of constructing a module 100, obtaining a fluctuation frequency module 200 and a circulation module 300;

the construction module 100 is configured to acquire system state information of the underwater operation robot at the time t and pose information of a target to be tracked in the underwater operation robot along-body coordinate system, and construct state information s of a Markov decision process_t；

The obtain fluctuation frequency module 200 is configured to obtain a fluctuation frequency based on s_tAcquiring the fluctuation frequency a of the fluctuation fin through an Actor network in an Actor-critical reinforcement learning model_t；

The cycle module 300 is configured to be based on a_tAnd controlling the wave fin of the underwater operation robot, enabling t to be t +1, and jumping to a construction module.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that the above-mentioned division of the functional modules is merely used as an example to illustrate the wave fin propulsion underwater operation robot tracking control system based on reinforcement learning provided in the above-mentioned embodiment, in practical applications, the above-mentioned function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above-mentioned embodiment may be combined into one module, or may be further split into a plurality of sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded by a processor and to implement the above-described method for tracking control of a walking fin propulsion underwater operation robot based on reinforcement learning.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning.

It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A wave fin propulsion underwater operation robot tracking control method based on reinforcement learning is characterized by comprising the following steps:

Step S300, based on a_tControlling a wave fin of the underwater operation robot, enabling t to be t +1, and skipping to the step S100;

step A100, obtaining a preset training data set at time tThe method comprises the steps of establishing state information s of a Markov decision process by using system state information of an underwater operation robot and pose information of a target to be tracked under an underwater operation robot random coordinate system_t；

2. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the Actor network comprises four convolutional layers; the number of neurons in the first convolutional layer and the second convolutional layer is 200, the activation function is the Relu6 function, the number of neurons in the third convolutional layer is 10, the activation function is the Relu function, the number of neurons in the fourth convolutional layer is 2, and the activation function is the tanh function.

3. The robot tracking control method for the wave fin propulsion underwater operation based on the reinforcement learning as claimed in claim 1, wherein the Critic network comprises five convolutional layers, the number of neurons in the first convolutional layer, the second convolutional layer and the third convolutional layer is 200, the activation function is Relu6 function, the number of neurons in the fourth convolutional layer is 10, the activation function is Relu function, the number of neurons in the fifth convolutional layer is 1, and the activation function is a linear activation function.

4. The robot tracking and controlling method for the wave fin propulsion underwater operation based on the reinforcement learning of the claim 3 is characterized in that in the step A200, the supervised training probability and the random supervised probability at the time t are obtained by the following steps:

PRO_t＝PRO₀*0.999^t

PRO_r＝max(rand(1),0.01)

5. The robot tracking and controlling method for propelling underwater operation with wave fin based on reinforcement learning of claim 1, wherein in step a200, the wave frequency a of the wave fin is obtained by the PID controller_t", the method is as follows: the wave frequency of the wave fin comprises a left wave fin wave frequency and a right wave fin wave frequency; the wave frequency of the left wave fin and the wave frequency of the right wave fin are calculated by the following steps:

f1＝ρ_r1f_r1+ρ_r2f_x1

f2＝ρ_x1f_r2+ρ_x2f_x2

f_r2＝-f_r1

f_x2＝f_x1

is the differential of the lateral error in relative position.

6. The robot tracking and controlling method for propelling underwater operations with wave fins based on reinforcement learning of claim 1, wherein in step a200, "obtaining the wave frequency a of wave fins through Actor network_t", the method is as follows: the fluctuation frequency a_tIn order to superimpose the fluctuation frequency of random noise, the fluctuation frequency satisfies the superposition principle of the following formula:

c_t＝max(c₀*0.999^t,0.01)

wherein, c₀Is a preset initial noise standard deviation, c_tFor the standard deviation of the noise at time t, μ (t) is a_tIs mean value, in c_tIs a random gaussian noise of standard deviation,

is a noise variable.

7. The wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of claim 1, wherein the preset reward function is:

r＝r₀-ρ₁||Δψ||₂-ρ₂||ΔP_o||₂-ν^Tρ₃ν

8. A wave fin propulsion underwater operation robot tracking control system based on reinforcement learning is characterized by comprising a building module, a wave frequency obtaining module and a circulating module;

9. A storage device having a plurality of programs stored therein, wherein the program applications are loaded and executed by a processor to implement the method for tracking control of an enhanced learning based wave fin propulsion underwater work robot according to any one of claims 1 to 7.

10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the wave fin propulsion underwater operation robot tracking control method based on reinforcement learning of any one of claims 1-7.