CN111290270A

CN111290270A - Underwater robot backstepping speed and heading control method based on Q-learning parameter adaptive technology

Info

Publication number: CN111290270A
Application number: CN202010087509.XA
Authority: CN
Inventors: 王卓; 张佩; 孙延超; 秦洪德; 朱仲本; 张宇昂; 曹禹; 景锐洁
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2020-06-16
Anticipated expiration: 2040-02-11
Also published as: CN111290270B

Abstract

A method for controlling backstepping speed and heading of an underwater robot based on a Q-learning parameter adaptive technology belongs to the technical field of robot control. The method aims to solve the problems that the prior knowledge is needed in the existing control method of the underwater robot and the parameters of the controller cannot be adjusted in real time. The invention designs a parameter self-adaptive backstepping speed and heading controller based on a Q learning algorithm, takes deviation and deviation change rate as the input of Q learning, outputs adjustment parameters, combines the control parameters determined according to the adjustment parameters with a controller designed by a backstepping method to realize speed and heading control, and is mainly used for controlling the speed and heading of an underwater robot.

Description

Underwater robot backstepping speed and heading control method based on Q-learning parameter adaptive technology

Technical Field

The invention relates to a speed and heading control method for an underwater robot. Belongs to the technical field of robot control.

Background

The motion control system is used as a premise and guarantee for the underwater robot to complete various expected tasks and underwater operations, and is receiving attention and attention from more and more researchers. As a complex nonlinear system, the underwater robot has uncertain and time-varying characteristics in a model, and is easily interfered by external environments such as stormy waves and currents when moving in a complex and variable ocean, so that the motion control performance of the underwater robot is greatly influenced. Therefore, for a controller designed in advance, there is room for improvement in the selection of control parameters.

At present, for the problem of automatic optimization of controller parameters, there are many algorithms combining a neural network, a fuzzy control and an evolutionary algorithm with a traditional control, and all have a certain adaptive capacity, but still have significant disadvantages, for example, the neural network adaptive control needs a large amount of instructor signals, and the instructor signals are difficult to obtain in the practical application process; fuzzy adaptive control needs prior knowledge of experts, is not beneficial to large-scale popularization and use, and genetic algorithm cannot be used for online learning and cannot be used for real-time adjustment of controller parameters.

Disclosure of Invention

The invention aims to solve the problems that the prior knowledge is needed in the existing control method of the underwater robot and the parameters of the controller cannot be adjusted in real time.

A method for controlling the backstepping speed and the heading of an underwater robot based on a Q-learning parameter adaptive technology is characterized by comprising the following steps of:

based on a kinematic model and a dynamic model of the underwater robot, a speed controller and a heading controller are established by utilizing a backstepping method;

wherein k is_u、

All are positive numbers and are control parameters to be designed; tau is_uIs the longitudinal thrust of the propeller; tau is_rIs the bow turning moment; the positions of three axes of an x, y and z underwater robot coordinate system,

theta and psi are respectively a roll angle, a pitch angle and a course angle; u, v and w are respectively a longitudinal linear velocity, a transverse linear velocity and a vertical linear velocity, and p, q and r are respectively a transverse inclination angle velocity, a longitudinal inclination angle velocity and a yaw angle velocity; | represents an absolute value, X_u、

X_u|u|、Y_v、

Y_v|v|、N_r、

N_r|r|Are all dimensionless hydrodynamic coefficients, I_zThe moment of inertia of the underwater robot around the z axis of the motion coordinate system, and the mass of the underwater robot;

and the control parameters of the backstepping method are optimized and adjusted by using Q learning:

for a speed controller, the input state vector is S_u＝{s_1u,s_2u}；s_1u、s_2uAre each e_u、

A corresponding spatially transformed transform value; the output of Q learning is a parameter k 'of the speed controller'_u，k′_uIs an operation space which needs to be dividedThereby establishing a Q value table of the speed controller;

for a heading controller, the input state vector is

Are respectively as

A corresponding spatially transformed transform value;

a transformed value of the spatial transform corresponding to the longitudinal velocity u; the output of Q learning is two control parameters of the heading controller

And

and

the method comprises the following steps of (1) establishing a Q value table of a heading controller by an action space needing to be divided;

the input of the Q-learned speed controller is the speed deviation and the speed deviation change rate, and the output is the adjusting parameter k 'of the speed controller through the Q-learned algorithm'_u(ii) a Similarly, in the Q-learning heading controller, the deviation of the yaw angle, the deviation change rate of the yaw angle and the real-time speed of the underwater robot are input, and two adjustment parameters of the heading controller are finally output after the Q-learning algorithm

And

then, in terms of k'_u、

And

determining k_u、

k_u＝k_u0+k′_u

Wherein the content of the first and second substances,

and

are the initial values of two parameters of the heading controller;

will control the parameter k_u、

The value of the speed controller is substituted into the speed controller and the heading controller to realize the control of the underwater robot.

Has the advantages that:

when the underwater robot is controlled by using the method, a large amount of prior knowledge is not needed, and the parameters of the controller can be adjusted in real time through a control scheme, so that the method has stronger practicability.

Through simulation experiments, it can be determined that: under the condition of no ocean current interference, the invention has shorter rise time than the traditional speed and heading controller, mainly because the parameters of the controller of the self-adaptive backstepping method are variable or not invariable in the whole learning process of reinforcement learning, and according to the characteristics of reinforcement learning, the invention can select different optimal actions in different states, and the change of the three parameter values of the speed and heading controller is corresponding to the controller designed by people. Similarly, under the condition of ocean current interference, the controller has better control effect than the traditional controller, and has smaller overshoot and better anti-interference capability than the traditional controller.

The deviation of the speed and the heading controller of the invention is continuously reduced along with the increase of the training times, and finally the speed and the heading controller are stabilized at a certain fixed value or linger among a few smaller values, at this time, the reinforced learning training is stabilized, and the embodiment shows that under the condition of no ocean current interference, the speed controller of the invention is basically stabilized at the 150 th time, the heading controller is stabilized at the 100 th time, and similarly, under the condition of ocean current interference, the speed and the heading controller of the invention are stabilized at the 400 th time and the 350 th time respectively.

Drawings

FIG. 1 is a Q learning parameter adaptive backstepping controller of an underwater robot speed;

FIG. 2 is a Q learning parameter adaptive backstepping controller of the heading of an underwater robot;

FIG. 3 is a Q-learning based adaptive back-stepping speed controller;

FIG. 4 is a Q-learning based adaptive backstepping heading controller;

FIG. 5 is a graph of adaptive backstepping speed controller longitudinal thrust based on Q learning;

FIG. 6 adaptive backstepping heading controller yaw moment based on Q learning;

FIG. 7 adaptive backstepping speed controller parameter value variation based on Q-learning;

FIG. 8 adaptive backstepping heading controller parameter value 1 change based on Q-learning;

FIG. 9 adaptive backstepping heading controller parameter value 2 changes based on Q-learning;

FIG. 10 is a graph of a Q-learning based adaptive back-stepping speed controller speed offset sum;

FIG. 11 is a diagram of a Q-learning based adaptive back-stepping heading controller heading angle deviation sum;

FIG. 12 is a Q-learning based adaptive back-stepping speed controller;

FIG. 13 is a Q-learning based adaptive backstepping heading controller;

FIG. 14 is a graph of adaptive backstepping speed controller longitudinal thrust based on Q learning;

FIG. 15 adaptive backstepping heading controller yaw moment based on Q learning;

FIG. 16 adaptive backstepping speed controller parameter value variation based on Q-learning;

FIG. 17 adaptive backstepping heading controller parameter value 1 change based on Q-learning;

FIG. 18 adaptive backstepping heading controller parameter value 2 changes based on Q-learning;

FIG. 19 is a graph of the adaptive backstepping speed controller speed offset sum based on Q learning;

FIG. 20 is a graph of an adaptive backstepping heading controller heading angle deviation sum based on Q learning.

Detailed Description

In order to improve the autonomy and intelligence of motion control of the underwater robot and ensure that parameters of a backstepping method controller can be adjusted on line in real time, so that the motion control performance of the underwater robot under the interference of wind waves and current is improved.

Before describing the embodiments, the following description will first be made of parameters in the embodiments: m_RB-an inertial force matrix; c_RB-a coriolis centripetal force matrix; m_A-an additional mass force matrix; c_A-an additional damping force matrix; d is a damping force matrix; g is heavy buoyancy; tau is_u-propeller longitudinal thrust; tau is_r-a bow turning moment;

the underwater robot is in an inertial frameThe lower six-degree-of-freedom position and posture; v ═ u v w p q r]^ΤThe underwater robot has six-degree-of-freedom linear velocity and angular velocity under a satellite coordinate system; s_u＝{s_1u,s_2u} -input vectors of reinforcement learning based speed controllers; k'_u-adapting the output vector of the step-back speed controller based on the parameters of Q learning; s_1u-deviation of the actual value of the speed from the desired value; s_2u-rate of change of deviation of actual value of speed from expected value;

-an input vector of a lean based heading controller;

-adapting the output vector of the step-back speed controller based on the parameters of Q learning;

-deviation of the actual value of the heading angle from the desired value;

-the rate of change of deviation of the actual value of the heading angle from the desired value;

-real time speed of the autonomous underwater robot;

-a desired value of the heading angle; u. of_d-a desired value of speed; a is_tThe method comprises the following steps of obtaining a Q learning controller, obtaining an optimal output action based on the Q learning controller, obtaining a reward and punishment return value based on the Q learning controller, α, learning rate, gamma, discount rate and epsilon, greedy rate.

The first embodiment is as follows: the structure of the underwater robot speed and heading Q learning parameter adaptive backstepping method controller is respectively shown in fig. 1 and fig. 2;

the method for controlling the backstepping speed and the heading of the underwater robot based on the Q-learning parameter adaptive technology comprises the following steps:

wherein k is_u、

theta and psi are respectively a roll angle, a pitch angle and a course angle; u, v and w are respectively a longitudinal linear velocity, a transverse linear velocity and a vertical linear velocity, and p, q and r are respectively a transverse inclination angle velocity, a longitudinal inclination angle velocity and a yaw angle velocity; | represents an absolute value, X_u、X_u、X_u|u|、Y_v、

Y_v|v|、N_r、

and the control parameters of the backstepping method are optimized and adjusted by using Q learning, and the optimal decision is found in continuous trial and error learning by using the learning characteristics of the Q learning, so that the method is full ofThe requirement of real-time adjustment on the parameters of the controller is met; the input of the Q learning parameter self-adaptive backstepping method speed controller is the speed deviation and the speed deviation change rate, and the output is the adjusting parameter k 'of the speed controller through the Q learning algorithm'_u(ii) a Similarly, in the Q learning parameter self-adaptive backstepping heading controller, the deviation of the yaw angle, the deviation change rate of the yaw angle and the real-time speed of the underwater robot are input, and two adjusting parameters of the heading controller are finally output after the Q learning algorithm

And

the parameter adaptive backstepping speed and heading controller based on Q learning are designed based on the form of a Q value table, and meanwhile, the state space division size has great influence on the learning speed of Q learning, so that the states of the two controllers are divided into 21 equal parts:

A corresponding conversion value; e.g. of the type_u∈[-2,2]In order to be able to measure the deviation in speed,

is the rate of change of deviation in velocity; in the interval [ -2,2 [)]The deviation from the speed is divided on average into 21 state values, i.e., -2, -1.8,. once, 1.8,2, every 0.2]，s_1uIs e_uThe value in the corresponding new theoretical domain, i.e. the conversion value; in the interval [ -1,1 [)]The rate of change of deviation with respect to speed is divided on average into 21 states, i.e., -1, -0.9,. once, 0.9,1, every 0.1]，s_2uIs composed of

The value in the corresponding new theoretical domain,i.e. the conversion value; the output of reinforcement learning is the parameter k 'of the speed controller'_u，k′_uIs a motion space to be divided, k 'is selected'_u∈[-1，2]It is divided equally into 16 action values every 0.2, i.e., -1, -0.8]. In conclusion, a 21 × 21 × 16Q-value table is established for the speed controller.

For a heading controller, the input state vector is

Are respectively as

A corresponding conversion value;

in order to be able to determine the deviation of the yaw angle,

is the rate of change of deviation of yaw angle; will [ - π, π]Is expressed as [ -3.14,3.14 [ -3.14 [ ]]And is approximately [ -3,3 [)]In the interval [ -3,3 [)]The deviation from the yaw angle is divided on average every 0.3 into 21 state values, namely [ -3, -2.7.., 2.7,3]，

Is composed of

The value in the corresponding new theoretical domain, i.e. the conversion value; in the interval [ -1,1 [)]The rate of change of deviation of the yaw angle is divided on average every 0.1 into 21 states, namely [ -1, -0.9.., 0.9,1 [ ]]，

Is composed of

The value in the corresponding new theoretical domain, i.e. the conversion value;

new theory for longitudinal speed u correspondenceA value in the domain; the longitudinal speed interval [ -2,2 [ ]]Dividing the real-time speed into 21 states on average every 0.2 pairs; the output of reinforcement learning is two control parameters of the heading controller

And

and

is the action space to be divided, and selects

And

divide it into 16 action values on average every 0.2 and 0.1, respectively; actions 1 to 16 mean

The motions 17 to 32 represent

So far, a 21 × 21 × 21 × 32Q value table is established for the heading controller.

Rounding the values between each state to fit into the corresponding state as defined;

then according to k'_u、

And

determining k_u、

k_u＝k_u0+k′_u

Wherein the content of the first and second substances,

and

are the initial values of two parameters of the heading controller;

will control the parameter k_u、

Parameter self-adaptation backstepping controller reward punishment function design based on Q study: here, the reward and punishment function has a relatively clear target to evaluate the performance of the controller, and generally, the quality of a controller is based on the stability, accuracy and rapidity of the controller, and it is expected that the controller can reach a desired value more quickly and accurately, and reflect to a response curve with a faster rising speed and less overshoot and oscillation. Thus, the present invention defines the reward function as a function related to the control error and the rate of change of the error, as follows:

wherein, λ is a coefficient for controlling the magnitude of the reward and punishment function, and λ is more than 0; for speed controller

For the heading controller

e_uIn order to obtain the speed deviation value,

in order to be the rate of change of the speed deviation,

is a deviation value of the heading direction,

is the heading deviation change rate; Λ is a second-order diagonal matrix which represents the influence factor of each component of the σ on the reward punishment function; a is a magnitude control parameter of the reward function, the magnitude of the reward function can be controlled, and the observation of the above formula can find that when a is reduced, the interval of the reward function is smaller.

Determining an objective function: the control objectives of the speed and heading of the underwater robot are such that the underwater robot reaches and maintains the desired speed and desired heading, i.e. maximizes the desired cumulative reward function, so the objective function of the markov decision process model is as follows:

wherein r is_t+1(s_t+1) State s at time t +1_t+1A corresponding prize value; gamma is the discount rate; e [. C]Indicating a desire;

the parameter selection mechanism of the speed controller and the heading controller is as follows: the invention relates to a speed controller and a heading controller based on Q learning, wherein the parameter selection mode of the speed controller and the heading controller is an epsilon greedy strategy, epsilon is an epsilon (0,1), the epsilon greedy strategy that random probability epsilon is continuously attenuated along with the increase of iteration round number is adopted, when the value of epsilon is closer to 0, the training is shown to be in the last stage, and a Q learning system is more biased to utilize the learned experience; when the value of epsilon is closer to 1, the Q learning system is more inclined to explore the epsilon greedy strategy when the training is started; searching with the probability of epsilon and utilizing with the probability of 1-epsilon (if there are a plurality of actions with the same Q value, selecting one action at random) each time the action is selected; the concrete form is as follows:

wherein Q (s, a) is the value of the Q function, and π (a | s) is the strategy to take action a in the s-state; epsilon₀The initial value is mu, the attenuation factor is mu, ξ is a control factor in the control epsilon (0,1) interval, and step represents the number of control rounds;

the learning updating process of the parameter self-adaptive backstepping controller based on Q learning comprises the following steps: initializing a Q value table of divided states and actions, initializing all Q values in the table to 0, setting an initial speed and an initial heading angle, and obtaining an initial state s through state conversion_tThen selects action a via an epsilon greedy strategy_tTo the next state s_t+1And obtaining the real-time report given by the environment_t+1Based on these information, the Q-value table can be updated, and the specific update formula is as follows:

where α is the learning rate and γ is the discount rate.

Basic flow of the Q learning algorithm: in the Q learning algorithm, a table type algorithm is used, and when learning starts, a Q value table Q (s, a) is initialized arbitrarily; then the agent is in s_tState, determining action a according to epsilon greedy policy_tAnd obtaining empirical knowledge and training samples(s)_t,a_t,s_t+1,r_t+1) (ii) a Then updating the Q value; when the agent reaches the target state, the algorithm terminates a cycle; finally, the algorithm continues to start a new iteration loop from the initial state until the end of the learning period, noting that the Q table used in the initial state is already the Q table updated from the previous loop. The specific process of applying the Q learning algorithm to the control problem is shown in table 1:

TABLE 1 Single-step Q learning algorithm procedure

The second embodiment is as follows:

in the method for controlling backstepping speed and heading of an underwater robot based on a Q-learning parameter adaptive technique according to the embodiment, the process of establishing a speed controller and a heading controller by using a backstepping method based on a kinematic model and a dynamic model of the underwater robot comprises the following steps:

underwater robot coordinate system: two right-hand coordinate systems, namely an inertial coordinate system and a satellite coordinate system, are adopted.

The inertial coordinate system is used for describing the position and the posture of the underwater robot, the position and the posture of the underwater robot are fixedly connected to the ground, the E is represented by E- ξη zeta, the origin is E, the fixed point on the sea level is generally selected, the ξ axis and the η axis are perpendicular to each other, ξ axis is defined to point to geographical north as positive direction, η axis is defined to point to geographical east as positive direction, zeta axis is defined to point to the centre of the earth as positive direction, and ξ, η and zeta axis accord with the right-hand spiral rule.

An object coordinate system: the system is used for describing the motion information of the underwater robot, is fixedly connected to the underwater robot and is expressed by O-xyz, wherein O is an original point and is usually selected as the gravity center of the underwater robot; the x axis and the y axis are mutually vertical, the x axis is defined to point to the heading of the underwater robot as the forward direction, and the y axis is defined to point to the starboard of the underwater robot as the forward direction; the z axis takes the bottom pointing to the underwater robot as the positive direction; the x, y, z axes also conform to the right-hand helical rule.

The underwater robot kinematics model: the underwater robot kinematics model reflects the conversion relation between the pose and the (angular) speed, and the conversion relation is as follows:

in the formula (I), the compound is shown in the specification,

the position and the attitude of the underwater robot are shown, wherein the positions of three axes of an x, y and z underwater robot coordinate system,

theta and psi are respectively a roll angle, a pitch angle and a course angle; v ═ u v w p q r]^ΤAnd the linear velocity and the angular velocity of the underwater robot are shown, wherein u, v and w are respectively a longitudinal linear velocity, a transverse linear velocity and a vertical linear velocity, and p, q and r are respectively a transverse inclination velocity, a longitudinal inclination velocity and a yaw angular velocity.

The underwater robot dynamics model: a six-degree-of-freedom dynamic model of an underwater robot, which is proposed by Fossen in Handbook of Marine scaffold dynamics and Motion Control, is adopted as follows:

wherein the content of the first and second substances,

C_RB(v) v is the inertial force and the Coriolis centripetal force, M_RBIs a matrix of inertial forces, C_RBIs a Coriolis centripetal force matrix;

C_A(v) v is an additional mass force and an additional damping force M_ATo add a mass force matrix, C_AAdding damping force matrix, D (v) v is damping force, D is damping force matrix, g (η) is gravity buoyancy, and tau is thruster thrust.

Simplifying a six-degree-of-freedom dynamic model of the underwater robot into a horizontal plane kinematics and dynamic model: the underwater robot is assumed to be of a symmetrical structure in front and back, left and right, up and down, the gravity center and the floating center are on the same vertical line, the gravity and the buoyancy are balanced, and the horizontal plane kinematics and dynamics model is as follows:

wherein, tau_uIs a longitudinal thrust, generated by the main propeller; tau is_rIs a yaw moment and is generated by a group of vertical rudders; because the underwater robot has under-drive property and does not have a transverse propeller, the underwater robot has no transverse thrust, | · | represents an absolute value, X_u、

X_u|u|、Y_v、

Y_v|v|、N_r、

N_r|r|Is dimensionless hydrodynamic coefficient, I_zThe moment of inertia of the underwater robot around the z axis of the motion coordinate system, and the mass of the underwater robot.

According to the formula (2), the underwater robot speed control system is obtained as follows:

defining a speed deviation e ═ u_dU, and taking the derivative of e, we can obtain:

consider the following Lyapunov function simultaneously:

the derivation of equation (9) can be:

in order to ensure that the convergence of the V-V

Negative constant, so according to equation (10), for τ_uDesigning:

by substituting formula (11) for formula (10), it is possible to obtain:

it can be seen that only the design parameter k_uIs positive number, the Lyapunov stability theory can be satisfied, thereby ensuring

Is negative, and finally ensures the asymptotic stability of the speed controller.

Also, according to equation (2), the underwater robot heading control system can be obtained as follows:

also, a heading deviation z is defined₁＝ψ_d-ψ，z₂α -r, wherein α is the intermediate virtual control quantity, and for z₁Taking the derivative, we can get:

consider the following Lyapunov function simultaneously:

the derivation of equation (13) can be:

to ensure V₁Asymptotic convergence, which needs to be guaranteed

Negative definite, so the intermediate virtual control quantity is designed

At this time, α can be substituted into formula (14):

as can be seen from equation (15), as long as the design parameters

Is positive number, the Lyapunov stability condition can be satisfied, thereby leading the system z to be₁Is calmed. For system z₂The Lyapunov function is defined as follows:

the derivation of equation (16) can be:

the general formulae (14) and z₁、

z₂、

Substitution (17) gives:

order to

Satisfying negative definite, the control moment tau is needed_rThe design is carried out, and the specific formula is as follows:

finally, formula (19) is substituted for formula (18):

as can be seen from equation (20), only the design parameters

And parameters

For positive numbers to ensure stability of the heading controller, z is₁And z₂Substituting, the final control moment of the heading controller can be obtained as follows:

in summary, only the speed and heading control laws are designed according to the equations (11) and (21), and the parameter value k of the controller for the speed and heading is ensured_u、

And

the speed and the heading of the autonomous underwater robot can be well controlled by a positive number.

Other steps and parameters are the same as in the first embodiment.

Examples

The invention is applicable to virtually any form of autonomous underwater robot, i.e. it can be used to model any corresponding autonomous underwater robot. Because the contents of the invention need to be simulated in a simulation environment so as to verify the control effect of the speed and heading controller, the autonomous underwater robot needs to be subjected to a mathematical modeling to perform a simulation experiment.

Setting simulation parameters: in order to observe the speed and the effect of a heading controller of the self-adaptive backstepping method based on Q learning, the invention adopts an autonomous underwater robot kinematics and a dynamics model to respectively carry out corresponding simulation tests on the speed controller and the heading controller of the underwater robot. The desired value of the set speed is u_d1m/s, the desired value of heading is psi_dThe effect of the control of speed and heading was observed at 1 rad. The return functions of the speed controller and the heading controller and the epsilon greedy strategy related parameters are set as follows: λ ═ 10, Λ ═ diag ([0.8, 0.2)])，ε₀0.4, 200, 800, ξ, 400, and one-step control step T is selected_s0.5s, 50s for the simulation time M of a single control period, 0.9 for the discount rate γ,0.9 for the learning rate α, k_u0＝3，

And speed and heading are initialized: u. of₀＝0m/s，v₀＝0m/s，ψ₀＝0rad，r₀＝0rad/s。

Meanwhile, in order to better observe the speed and the control effect of the heading controller based on the reinforcement learning self-adaptive backstepping method, two groups of simulation experiments are carried out under different conditions, the first group is the simulation of the speed and the heading controller under the condition of no ocean current interference, the second group is the simulation of the speed and the heading controller under the condition of ocean current interference, and the parameter settings are all as described above.

Simulation results and analysis:

the simulation effect of the speed and heading controller based on the Q learning adaptive backstepping method under the condition of no ocean current interference is shown in the following graphs, fig. 3 and 4 are respectively comparison graphs of the speed and heading controller based on the Q learning adaptive backstepping method and a traditional backstepping controller under the condition of no ocean current interference, fig. 5 and 6 are respectively graphs of longitudinal thrust and yaw moment based on the speed and heading controller based on the Q learning adaptive backstepping method, fig. 7, 8 and 9 are respectively comparison graphs of parameter value changes of the speed controller and the heading controller after 800 times of training and parameter value changes of the speed controller and the heading controller when the training is started, and fig. 10 and 11 are respectively graphs of speed deviation, deviation of yaw angle and change along with the increase of the training times of the speed controller and the heading controller based on the Q learning adaptive backstepping method.

The speed and heading controller simulation effect based on the Q learning adaptive backstepping method under the condition of ocean current interference is shown in the following graph, figures 12 and 13 are graphs comparing a Q-learning based adaptive back-stepping speed and heading controller with a conventional back-stepping controller in the presence of ocean current disturbances, figures 14 and 15 show respectively the longitudinal thrust and yaw moment diagrams of the adaptive backstepping speed and heading controller based on Q-learning in the presence of a sea current disturbance, FIGS. 16, 17 and 18 are graphs comparing the variation of the parameter values of the speed controller and the heading controller after 800 times of training with the variation of the parameter values of the speed controller and the heading controller at the beginning of training, fig. 19 and 20 show the speed deviation and the yaw angle deviation of the adaptive backstepping method speed controller and the heading controller based on Q learning under the condition of ocean current interference and change graphs along with the increase of training times respectively.

It can be seen from fig. 3, 4, 12 and 13 that the control effect of the controller based on Q learning is better than that of the conventional controller in both the no-flow condition and the flow condition. Under the condition of no ocean current interference, the speed and heading controller based on Q learning has shorter rise time than the traditional speed and heading controller, because the parameters of the controller based on the self-adaptive backstepping method are variable and not invariable in the whole learning process of reinforcement learning, and according to the characteristics of reinforcement learning, the controller based on Q learning can select different optimal actions in different states, and the change of three parameter values of the speed and heading controller is corresponding to the controller designed by people. Similarly, under the condition of ocean current interference, the speed and heading controller based on Q learning has better control effect than the traditional controller, and has smaller overshoot and better anti-interference capability than the traditional controller.

It can be seen from fig. 7, 8, 9, 16, 17 and 18 that the parameter values of the speed and heading controllers are changed from the initial parameters before and after the reinforcement learning training, which shows that the effect of combining Q learning with the conventional controller is obvious. In addition, under the environment without ocean current interference, the parameter value k of the speed controller_uFinally, the value of the heading controller parameter is stabilized at 4.2

And

respectively stabilizing at 2 and 1.7, and under the condition of ocean current interference, the parameter value k of the speed controller_uFinally, the value of the heading controller parameter is stabilized at 4.6

And

stabilized at 4.6 and 0.9, respectively, which, in contrast, in two different cases,the stabilized parameters are different for either the speed controller or the heading controller, which indicates that reinforcement learning does have the ability to self-learn and adapt in different environments.

The deviations of the speed and heading controller based on reinforcement learning and the deviation of the heading controller are continuously reduced along with the increase of the training times, and finally the deviations are stabilized at a certain fixed value or wander among a few smaller values, at this time, the reinforcement learning training is stabilized, so that in the case of no ocean current interference, although 800 times of training are performed, the speed controller based on reinforcement learning is basically stabilized at the 150 th time, the heading controller is stabilized at the 100 th time, and similarly, in the case of ocean current interference, as shown in fig. 19 and 20, the speed and heading controller based on reinforcement learning respectively reach the stabilization at the 400 th time and the 350 th time.

Compared with the prior art:

the invention aims to improve the autonomy and intelligence of the motion control of the underwater robot, and combines a reinforcement learning method with a traditional controller, so that the parameters of the traditional controller can be adjusted on line in real time. There are many methods for adjusting parameters, but most of them are performed in a constant and structured environment, and typical methods for adjusting parameters mainly include a fuzzy algorithm, a genetic algorithm, and the like. These two schemes are briefly described below and compared to the algorithm of the present invention.

1. Fuzzy algorithm adjusting parameter

Mohammad Hedayati Khodayari et al in the Modeling and control of Autonomous Underwater Vehicle (AUV) in the heading and depth attribute via self-adaptive fuzzy PID controller designed an adaptive fuzzy PID controller, applied to the two-channel tracking control of the heading and depth of an underwater robot, and obtained better robustness, dynamics and stability. The design a Fuzzy-like PDcontrollerforan underserver robot applies Fuzzy PD control to the control and depth control of an underwater robot, the Fuzzy part of the controller is optimized by structure and parameters, and the scale factor of the PD part is optimized based on the minimum number of experiments in a real environment. The application of the self-adaptive fuzzy PID control in the AUV control adopts a self-adaptive fuzzy PID control method, and utilizes a fuzzy reasoning method to perform online setting on PID parameters. However, parameter adjustment based on the fuzzy algorithm needs a large amount of expert prior knowledge, which is not beneficial to application and popularization, but the parameter adjustment method based on reinforcement learning adopted by the invention can carry out online adjustment of parameters without any prior knowledge, and can obtain knowledge only by continuous interaction between a sensor of the parameter adjustment method and the environment, thereby autonomously carrying out action selection.

2. Biological intelligent algorithm adjusting parameter

The biological intelligent algorithm is a series of optimization algorithms derived by simulating biological evolution, animal foraging, plant growth, natural phenomena, ecological balance and the like, and mainly comprises a genetic algorithm, a particle swarm optimization algorithm, an ant colony optimization algorithm, a bacterial foraging optimization algorithm and the like. The improved particle swarm optimization of the S-surface controller of the underwater robot provides an improved particle swarm optimization algorithm, the parameters of the S-surface controller are optimized, dynamic compression factors are adopted to accelerate the convergence of particles, an annealing algorithm is introduced to improve the local searching capability of the particle swarm optimization algorithm, a simulation experiment and a pool experiment are carried out on the underwater robot, and the experimental result shows that the algorithm has a good effect on the parameter optimization of the nonlinear controller of the underwater robot. The research on the fuzzy control technology for the path tracking of the underwater robot based on genetic algorithm optimization adopts a genetic algorithm to optimize the parameters of a path tracking controller of the underwater robot, compares the tracking effect under the condition of time-varying continuous ocean current interference and constant discontinuous ocean current interference, and a simulation experiment proves the effectiveness of the controller optimized based on the genetic algorithm. However, such an optimization algorithm has a slow convergence rate and a long search time, and is easily trapped in local optimization, and the parameters are mainly adjusted in an offline environment, so that the parameters need to be adjusted and corrected through a real-time environment on site and requirements. The self-adaptive backstepping speed and heading control method based on reinforcement learning provided by the invention can adjust parameters on line in real time according to the change of environment.

It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations be included within the scope of the invention as defined in the following claims and the description.

Claims

1. A method for controlling the backstepping speed and the heading of an underwater robot based on a Q-learning parameter adaptive technology is characterized by comprising the following steps of:

wherein k is_u、

X_u|u|、Y_v、

Y_v|v|、N_r、

A corresponding spatially transformed transform value; the output of Q learning is a parameter k 'of the speed controller'_u，k′_uIs the action space which needs to be divided, thereby establishing a Q value table of the speed controller;

for a heading controller, the input state vector is

Are respectively as

A corresponding spatially transformed transform value;

And

and

And

then, in terms of k'_u、

And

determining k_u、

k_u＝k_u0+k′_u

Wherein the content of the first and second substances,

and

are the initial values of two parameters of the heading controller;

will control the parameter k_u、

2. The method for controlling the backstepping speed and the heading of the underwater robot based on the Q-learning parameter adaptive technology as claimed in claim 1, wherein the specific process of establishing the Q value table of the speed controller and the Q value table of the heading controller comprises the following steps:

is the rate of change of deviation in velocity; in the interval [ -2,2 [)]The deviation from the speed is divided on average into 21 state values, i.e., -2, -1.8,. once, 1.8,2, every 0.2]，s_1uIs e_uThe value in the corresponding new theoretical domain, i.e. the conversion value; in the interval [ -1,1 [)]The rate of change of deviation of velocity is divided equally into 21 states, namely [ -1 ], at intervals of 0.10.9,...,0.9,1]，s_2uIs composed of

The value in the corresponding new theoretical domain, i.e. the conversion value; the output of reinforcement learning is the parameter k 'of the speed controller'_u，k′_uIs a motion space to be divided, k 'is selected'_u∈[-1，2]It is divided equally into 16 action values every 0.2, i.e., -1, -0.8](ii) a In conclusion, a 21 × 21 × 16Q value table is established for the speed controller;

for a heading controller, the input state vector is

Are respectively as

A corresponding conversion value;

in order to be able to determine the deviation of the yaw angle,

Is composed of

Is composed of

a value in the new theoretical domain corresponding to the longitudinal velocity u; the longitudinal speed interval [ -2,2 [ ]]Dividing the real-time speed into 21 states on average every 0.2 pairs; the output of reinforcement learning is two control parameters of the heading controller

And

and

is the action space to be divided, and selects

And

The motions 17 to 32 represent

3. The method for controlling the backstepping speed and the heading of the underwater robot based on the Q-learning parameter adaptive technology as claimed in claim 2, wherein in the Q learning process, a reward function is as follows:

For the heading controller

e_uIn order to obtain the speed deviation value,

in order to be the rate of change of the speed deviation,

is a deviation value of the heading direction,

is the heading deviation change rate; Λ is a second-order diagonal matrix which represents the influence factor of each component of the σ on the reward punishment function; a is a magnitude control parameter of the reward function.

4. The method for controlling the backstepping speed and the heading of the underwater robot based on the Q-learning parameter adaptive technology as claimed in claim 3, wherein in the Q learning process, the objective function is as follows:

the control objectives of the speed and heading of the underwater robot are such that the underwater robot reaches and maintains the desired speed and desired heading, i.e. maximizes the desired cumulative reward function, so the objective function of the markov decision process model is as follows:

wherein r is_t+1(s_t+1) State s at time t +1_t+1A corresponding prize value; gamma is the discount rate; e [. C]Indicating a desire.

5. The method for controlling the backstepping speed and the heading of the underwater robot based on the Q-learning parameter adaptive technology as claimed in claim 4, wherein in the Q learning process, the parameter selection process of the speed controller and the heading controller is as follows:

the invention relates to a speed controller and a heading controller based on Q learning, wherein the parameter selection mode of the speed controller and the heading controller is an epsilon greedy strategy, epsilon is an epsilon (0,1), the epsilon greedy strategy that random probability epsilon is continuously attenuated along with the increase of iteration round number is adopted, when the value of epsilon is closer to 0, the training is shown to be in the last stage, and a Q learning system is more biased to utilize the learned experience; when the value of epsilon is closer to 1, the Q learning system is more inclined to explore the epsilon greedy strategy when the training is started; searching with the probability of epsilon and utilizing with the probability of 1-epsilon (if there are a plurality of actions with the same Q value, selecting one action at random) each time the action is selected; the concrete form is as follows:

ε＝ε₀·e^(μ-step)/ξ

wherein Q (s, a) is the value of the Q function, and π (a | s) is the strategy to take action a in the s-state; epsilon₀For the initial value, μ is the decay factor, ξ is the control factor in the control ε ∈ (0,1) interval, and step represents the number of control rounds.

6. The method for controlling the backstepping speed and heading of the underwater robot based on the Q-learning parameter adaptive technology as claimed in claim 5, wherein in the Q learning process, the learning and updating process of the parameter adaptive backstepping controller based on Q learning comprises the following steps:

initializing a Q value table of divided states and actions, initializing all Q values in the table to 0, setting an initial speed and an initial heading angle, and obtaining an initial state s through state conversion_tThen selects action a via an epsilon greedy strategy_tTo the next state s_t+1And obtaining the real-time report given by the environment_t+1Based on these information, the Q-value table can be updated, and the specific update formula is as follows:

where α is the learning rate and γ is the discount rate.

7. The method for controlling the backstepping speed and the heading of the underwater robot based on the Q-learning parameter adaptive technology as claimed in claim 6, wherein the kinematic model and the dynamic model of the underwater robot are as follows:

the underwater robot kinematics model: the underwater robot kinematics model reflects the conversion relation between the pose and the speed, and the conversion relation is as follows:

in the formula (I), the compound is shown in the specification,

representing the position and attitude of the underwater robot, v ═ u v w p q r]^ΤRepresenting linear and angular velocities of the underwater robot;

the underwater robot dynamics model: a six-degree-of-freedom dynamic model of the underwater robot is adopted, and the method comprises the following steps:

wherein the content of the first and second substances,

C_A(v) v is an additional mass force and an additional damping force M_ATo add a mass force matrix, C_AD (v) v is damping force, D is damping force matrix, g (η) is gravity buoyancy, tau is thruster thrust;

wherein, tau_uIs a longitudinal thrust, generated by the main propeller; tau is_rIs a yaw moment and is generated by a group of vertical rudders; the underwater robot has underactuation.