CN112925319A

CN112925319A - Underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning

Info

Publication number: CN112925319A
Application number: CN202110098934.3A
Authority: CN
Inventors: 孙玉山; 罗孝坤; 张国成; 李岳明; 薛源; 于鑫; 张红星
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-06-08
Anticipated expiration: 2041-01-25
Also published as: CN112925319B

Abstract

An underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning relates to the technical field of underwater robot obstacle avoidance. The invention aims to solve the problem that the obstacle avoidance research of an underwater autonomous vehicle on dynamic obstacles is lacked at present. The method comprises the steps of establishing an underwater autonomous vehicle model and a kinematics model, and acquiring information of surrounding obstacles; acquiring the motion state information of maneuvering obstacles around the underwater autonomous vehicle, and constructing a dynamic obstacle state equation; predicting a dynamic obstacle kinematics model according to a dynamic obstacle state equation; according to the information of obstacles around the underwater autonomous vehicle and the dynamic obstacle kinematics model, a multi-dynamic obstacle avoiding method is fused to generate an obstacle avoiding strategy and the obstacle avoiding strategy is converted into an MDP model; training the MDP model by combining a deterministic depth strategy gradient algorithm until the underwater autonomous vehicle can reach a target area without collision; and guiding the underwater autonomous vehicle to navigate by using the trained MDP model.

Description

Underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of obstacle avoidance of underwater robots.

Background

In recent years, with the continuous progress of new materials, new energy sources, artificial intelligence and other technologies in accordance with the needs of ocean development and ocean military applications, the research pace of Autonomous Underwater Vehicles (AUV) is accelerated and made to progress by ocean major countries of various countries. Compared with manned underwater vehicles, the AUV has focused attention of various scholars by virtue of the advantages of strong maneuverability, wide action range, no casualty risk, high adaptability and viability, low manufacturing and maintenance cost and the like. The autonomous underwater vehicle is not limited to the use in the marine environment, is gradually applied to various water areas such as channel water areas, water delivery tunnels, port water areas and the like, and becomes key equipment for exploration, underwater environment detection, underwater rescue and the like of the underwater world.

The underwater environment is complex and changeable, when the underwater autonomous vehicle reaches the underwater navigation, the underwater autonomous vehicle faces large and small obstacles, is static and also moves, and seriously threatens the running safety of the underwater autonomous vehicle. Most researchers at present make great and small progress in the aspect of obstacle avoidance of static obstacles of an underwater autonomous vehicle, but the obstacle avoidance research on dynamic obstacles is rarely carried out. Various dynamic obstacles such as underwater floaters, navigation ships and the like exist underwater, and an underwater autonomous vehicle can complete a specified task only by having high autonomous obstacle avoidance capability and safely return to the home. Therefore, the autonomous obstacle avoidance research of the underwater autonomous vehicle in the environment with a plurality of dynamic obstacles is one of important technologies in the field of underwater autonomous vehicles.

Disclosure of Invention

The invention provides a dynamic obstacle avoidance method of an underwater autonomous vehicle based on deep reinforcement learning, aiming at solving the problem that the obstacle avoidance research of the underwater autonomous vehicle on dynamic obstacles is lacked at present.

An underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning comprises the following steps:

the method comprises the following steps: establishing an underwater autonomous vehicle model and a kinematics model so as to obtain the information of obstacles around the underwater autonomous vehicle;

step two: acquiring motion state information of maneuvering obstacles around an underwater autonomous vehicle, and constructing a dynamic obstacle state equation, wherein the motion state information comprises: motion state vectors, state transition matrices, process noise and input control matrices;

step three: predicting a dynamic obstacle kinematics model according to a dynamic obstacle state equation by utilizing a particle filtering method associated with probability data;

step four: establishing an online training environment of multiple dynamic obstacles in a Cartesian coordinate system according to the information of the obstacles around the underwater autonomous vehicle obtained in the step one and the dynamic obstacle kinematics model obtained in the step three, and fusing a multi-dynamic obstacle avoiding method to generate an obstacle avoiding strategy;

step five: converting the obstacle avoidance strategy generated in the step four into an MDP model, and establishing a state set and an action set of the MDP model when the underwater autonomous vehicle faces a plurality of dynamic obstacles;

step six: taking the state set as the input of an MDP model and the action set as the output of the MDP model, and training the MDP model by combining a deterministic depth strategy gradient algorithm until the underwater autonomous aircraft under the MDP model can reach a target area without collision;

step seven: and guiding the underwater autonomous vehicle to navigate by using the trained MDP model.

Further, the underwater autonomous vehicle model of the first step comprises: one tail propeller, two side propellers and 7 obstacle avoidance sonars, the range finding sonar sampling frequency of the underwater autonomous vehicle model is 2Hz, the detection distance is 150 m-200 m, and the distribution angles under the coordinate system are as follows: 90 degrees, 60 degrees, 30 degrees, 0 degrees, 30 degrees, 60 degrees and 90 degrees;

the kinematic model is a kinematic model with 3 degrees of freedom on the horizontal plane, and the equation is as follows:

wherein the content of the first and second substances,

the method comprises the steps that a horizontal plane position vector of the underwater autonomous vehicle under a geodetic coordinate system is defined, upsilon is a horizontal plane velocity vector of the underwater autonomous vehicle under a carrier, R (psi) is a conversion matrix, psi is a yaw angle of the underwater autonomous vehicle, and R is a yaw angular velocity of the underwater autonomous vehicle under a satellite coordinate system.

Further, the dynamic obstacle state equation in the second step includes: a discrete time state equation of the uniform motion model when the sampling interval is T and a discrete time state equation of the uniform acceleration motion model when the sampling interval is T,

the expression of the discrete time state equation of the uniform motion model when the sampling interval is T is as follows:

X_k+1＝F_CVX_k+ω_k+1，

wherein, X_k+1And X_kThe states of the uniform motion model at the time k +1 and k, respectively, F_CVIs a state transition matrix, omega, of a uniform motion model_k+1In order to realize the process noise of the uniform motion model in discrete time,

the expression of the discrete time state equation of the uniform acceleration motion model when the sampling interval is T is as follows:

wherein the content of the first and second substances,

and

states of the uniform acceleration motion model at times k +1 and k, respectively, F_CAIs a state transition matrix of the uniform acceleration motion model,

is the process noise of the uniform acceleration motion model in discrete time.

Further, a state transition matrix F of the uniform motion model_CVThe expression of (a) is:

wherein the content of the first and second substances,

state transition matrix F of uniform acceleration motion model_CAThe expression of (a) is:

wherein the content of the first and second substances,

further, in the fourth step, a training environment map model is constructed by combining terrain information of the water area environment where the underwater autonomous vehicle is located, and then a plurality of dynamic obstacles are loaded in the training environment map model according to the dynamic obstacle kinematics model, so that the online training environment of the plurality of dynamic obstacles under a Cartesian coordinate system is obtained.

Further, in the fourth step, the behavior of the underwater autonomous vehicle towards the target is taken as a gravitational potential field function, the behavior of the underwater autonomous vehicle for avoiding dynamic obstacles is taken as a repulsive potential field function of the underwater autonomous vehicle,

the obstacle avoidance strategy is as follows:

when the sonar of the underwater autonomous vehicle detects the dynamic barrier, whether the dynamic barrier enters the action area of the repulsive potential field of the underwater autonomous vehicle is judged,

if so, the obstacle avoidance subtask priority is greater than the target tendency subtask priority, the course angle is continuously changed until the dynamic obstacle is separated from the repulsive force field action domain of the underwater autonomous vehicle,

and if not, the target trend subtask priority is greater than the obstacle avoidance subtask priority, and the heading is adjusted to be a pointing target, so that the underwater autonomous vehicle drives to a target area.

Further, the gravitational potential field function expression is as follows:

wherein k is₁Is the gravitational potential energy gain coefficient, x_tAnd y_tRespectively an abscissa and an ordinate of the position of the autonomous aircraft under water in a Cartesian coordinate system at time t, x_goalAnd y_goalRespectively are horizontal and vertical coordinates of the central position of the target area under a Cartesian coordinate system;

the repulsive force potential field function expression is as follows:

wherein k is₂Is a repulsive force potential energy gain coefficient, x'_tAnd y'_tRespectively, the abscissa and ordinate of the position of the dynamic obstacle at time t in a Cartesian coordinate system, d (q)_t,q’_t) Is the distance between the underwater autonomous vehicle and the dynamic barrier at the moment t, q_t＝(x_t,y_t)，q’_t＝(x’_t，y’_t)，d₀The distance L is large for the influence of the repulsive force potential field action domain energy of the underwater autonomous vehicle₁And L₂The length of the major axis and the length of the minor axis of the ellipse are respectively expanded into the ellipse by the autonomous underwater vehicle.

Further, in the fifth step, the MDP model expression is:

MDP＝(S,A,P_sa,R)，

wherein S is a state set, A is an action set, and P_saFor state transition probabilities, R is the reward function.

Further, in the fifth step, when facing a plurality of dynamic obstacles, the state set S of the MDP model is { S ═ S₁,S₂,...,S_t,...,S_T}，

Signals collected by 7 obstacle avoidance sonars of the underwater autonomous vehicle at the time t,

in step five, when facing a plurality of dynamic obstacles, the action set A of the MDP model is equal to { a ═ a }₁,a₂,...,a_t,...,a_T}，a_tThe yaw rate and the horizontal rate of the underwater autonomous vehicle at the moment t are { ω (t), v (t) }, ω (t) and v (t), respectively.

Further, the reward value R of the MDP model reward function R at the time t_tComprises the following steps:

r_t＝τ1r₁(s_t,a_t,s_t+1)+τ₂r₂(s_t,a_t,s_t+1)+τ₃r₃(s_t,a_t,s_t+1)，

wherein τ 1 is the proportionality coefficient of the target module, τ 2 is the proportionality coefficient of the safety module, τ 3 is the proportionality coefficient of the stability module, r₁(s_t,a_t,s_t+1) The prize value, r, for the time t of the goal module₂(s_t,a_t,s_t+1) The prize value r for the moment t of the security module₃(s_t,a_t,s_t+1) The prize value at time t of the stability module.

The dynamic obstacle avoidance method of the underwater autonomous vehicle based on the depth reinforcement learning can improve the dynamic obstacle avoidance capability of the underwater autonomous vehicle, effectively solve the problem that the underwater autonomous vehicle is difficult to avoid when the underwater autonomous vehicle encounters a plurality of dynamic obstacles to threaten the safety, and improve the safety of the underwater autonomous vehicle in a complex dynamic water area environment.

Meanwhile, the obstacle avoidance training of the underwater autonomous vehicle avoids collision damage of the underwater autonomous vehicle, and the dynamic obstacle avoidance strategy obtained by combining the kinematics model of the underwater autonomous vehicle can be directly applied to the actual underwater autonomous vehicle without secondary action planning, and compared with a traditional path planning and action planning separated mode, the method can save certain manpower and material resources.

Drawings

Fig. 1 is a flowchart of a dynamic obstacle avoidance method for an autonomous underwater vehicle based on depth reinforcement learning according to an embodiment;

FIG. 2 is a flow chart of path planning obstacle avoidance for an autonomous underwater vehicle in a multiple dynamic obstacle environment;

FIG. 3 is a schematic illustration of an underwater autonomous vehicle in a multiple dynamic obstacle environment;

fig. 4 is a schematic view of an autonomous underwater vehicle puffing process;

fig. 5 is a schematic block diagram of an obstacle avoidance controller for an underwater autonomous vehicle multi-dynamic obstacle environment based on a DDPG algorithm.

Detailed Description

The deep deterministic strategy gradient algorithm has good online adaptivity and learning capacity on a nonlinear system, and is widely researched in the fields of artificial intelligence, machine learning and automatic control. Therefore, the algorithm can be applied to a control system of the underwater autonomous vehicle to realize the autonomous obstacle avoidance function of the underwater autonomous vehicle so as to improve the adaptability of the environment. In addition, the depth certainty strategy gradient algorithm can also improve the problems of dimension disaster, long planning time, low precision and the like of other planning methods, and has important practical significance for the research of a method for avoiding a plurality of dynamic obstacles by an autonomous underwater vehicle.

The first embodiment is as follows: specifically describing the embodiment with reference to fig. 1 to 5, the method for dynamically avoiding obstacles of an autonomous underwater vehicle based on deep reinforcement learning in the embodiment includes the following steps:

the method comprises the following steps: and establishing an underwater autonomous vehicle model and a kinematics model so as to obtain the information of obstacles around the underwater autonomous vehicle.

In this embodiment, the underwater autonomous vehicle model includes: one tail propeller, two side propellers and 7 obstacle avoidance sonars. One tail propeller and two side propellers can realize the turning, advancing and retreating of the underwater autonomous vehicle, and the obstacle information around the underwater autonomous vehicle is obtained through 7 obstacle avoidance sonars. The range finding sonar sampling frequency of the underwater autonomous vehicle model is 2Hz, the detection distance is 200m, and the distribution angles under the coordinate system are as follows: 90 degrees, 60 degrees, 30 degrees, 0 degrees, 30 degrees, 60 degrees and 90 degrees.

wherein the content of the first and second substances,

the method is characterized in that a horizontal plane position vector of the underwater autonomous vehicle under a geodetic coordinate system comprises a horizontal plane position coordinate and a yaw angle, and upsilon is [ u, v, r ═]^T∈R³Is a horizontal plane velocity vector of the underwater autonomous vehicle under a carrier, u, v and R are an X axial component, a Y axial component and a yaw angle velocity of the horizontal velocity vector of the underwater autonomous vehicle under a satellite coordinate system respectively, R (psi) is a conversion matrix, psi is a yaw angle of the underwater autonomous vehicle,

in order to facilitate on-line training, the underwater autonomous vehicle is simplified into a rectangle in the embodiment, and the fact that the actual navigation of the underwater autonomous vehicle can be directly guided by a strategy trained by an on-line training system is guaranteed by combining a kinematics model of the underwater autonomous vehicle.

Step two: the precondition basis that the underwater autonomous vehicle can successfully avoid a plurality of dynamic obstacles is as follows: after the underwater autonomous vehicle accurately acquires the motion state of the dynamic barrier at some time through a sensor carried by the underwater autonomous vehicle, the sensor continuously observes the barrier, analyzes and processes observation data to obtain a kinematic model for accurately predicting and estimating the dynamic barrier, and leads the kinematic model into an obstacle avoidance planning system to avoid the obstacle of the underwater autonomous vehicle. The method specifically comprises the following steps:

acquiring motion state information of maneuvering obstacles around the underwater autonomous vehicle by using a sensor carried by the underwater autonomous vehicle, wherein the motion state information comprises: motion state vectors, state transition matrices, process noise, and input control matrices. The implementation mode aims at the movement characteristics of the dynamic barrier, combines the influence of random disturbance and system noise on the maneuvering motion of the barrier in the actual operation process, adds random noise on the basis of a linear Gaussian uniform motion (CV) model and a uniform accelerated motion (CA) model, and establishes a kinematics model for predicting and estimating the future movement state of the dynamic barrier. Specifically, a dynamic obstacle state equation is constructed according to the motion state information, and the dynamic obstacle state equation comprises the following steps: the discrete time state equation of the uniform motion model when the sampling interval is T and the discrete time state equation of the uniform acceleration motion model when the sampling interval is T.

The dynamic obstacle state equation is expressed by the position and the speed of the dynamic obstacle, and the acceleration term of white noise which obeys Gaussian distribution is added to express the slight change of the speed of the obstacle caused by external disturbance in the underwater environment. If the state vector of the dynamic obstacle at the time t is as follows:

wherein x (t) and y (t) are respectively the horizontal and vertical coordinates of the position of the obstacle,

and

respectively the lateral and longitudinal speed of the obstacle,

and

respectively the lateral and longitudinal acceleration of the obstacle.

wherein, X_k+1And X_kThe states of the uniform motion model at the time k +1 and k, respectively, F_CVIs a state transition matrix of the uniform motion model,

ω_k+1the process noise of the uniform motion model in discrete time is obtained.

wherein the content of the first and second substances,

and

is modeled as a uniform acceleration motionThe process noise in a discrete time is,

step three: and predicting a dynamic obstacle kinematics model according to a dynamic obstacle state equation by utilizing a particle filtering method associated with probability data. Namely: after the motion state of the dynamic barrier at some moments is accurately acquired through a sensor carried by the underwater autonomous vehicle, the sensor continuously observes the barrier, observation data is analyzed and processed, and a kinematics model describing the kinematics characteristic of the dynamic barrier is established and estimating the dynamics characteristic of the dynamic barrier is established by combining the established uniform velocity motion (CV) model and uniform acceleration motion (CA) model of the dynamic barrier. And leading the kinematic model for predicting and estimating the dynamic barrier into an obstacle avoidance planning system to carry out obstacle avoidance control on the underwater autonomous vehicle.

Step four: and establishing an on-line training environment of multiple dynamic obstacles in a Cartesian coordinate system according to the information of the obstacles around the underwater autonomous vehicle obtained in the step one and the dynamic obstacle kinematics model obtained in the step three, and fusing a multi-dynamic obstacle avoiding method to generate an obstacle avoiding strategy. The method comprises the following specific steps:

and constructing a training environment map model by combining terrain information of the water area environment where the underwater autonomous vehicle is located, and then loading a plurality of dynamic obstacles in the training environment map model according to the dynamic obstacle kinematics model to obtain the online training environment of the plurality of dynamic obstacles in a Cartesian coordinate system.

The method is combined with the idea of an artificial potential field method of a local path planning algorithm to design a multi-dynamic-barrier obstacle avoidance strategy. Taking the behavior of the underwater autonomous vehicle towards the target as a gravitational potential field function, wherein the expression of the gravitational potential field function is as follows:

wherein k is₁For gravitational potential energy gain coefficient, practically 0.01, x can be taken_tAnd y_tRespectively at time t in CartesianHorizontal and vertical coordinates, x, of the position of the autonomous underwater vehicle under the coordinate system_goalAnd y_goalRespectively are horizontal and vertical coordinates of the central position of the target area under a Cartesian coordinate system;

taking the behavior of the underwater autonomous vehicle for avoiding the dynamic barrier as a repulsive force potential field function of the underwater autonomous vehicle, wherein the repulsive force potential field function expression is as follows:

wherein k is₂For the repulsive force potential energy gain coefficient, practically 0.1, x 'can be taken'_tAnd y'_tRespectively, the abscissa and ordinate of the position of the dynamic obstacle at time t in a Cartesian coordinate system, d (q)_t,q’_t) Is the distance between the underwater autonomous vehicle and the dynamic barrier at the moment t, q_t＝(x_t,y_t)，q’_t＝(x’_t，y’_t)，d₀The distance can be increased for the influence of the action field of the repulsive force potential field of the underwater autonomous vehicle.

Because the shape of the dynamic obstacle is uncertain, but the shape of the underwater autonomous vehicle is determined, the embodiment performs bulking processing on the underwater autonomous vehicle to give a certain safety space:

wherein alpha is a constant with a value greater than 1; l and B are respectively the maximum length and the width of the underwater vehicle; l is₁And L₂The length of the major axis and the length of the minor axis of the ellipse are respectively expanded into the ellipse by the autonomous underwater vehicle.

The method is different from the action domain of the repulsive force field of the barrier in the manual potential field method, the action domain of the repulsive force field of the underwater autonomous vehicle is established, and when the dynamic barrier enters the action domain of the repulsive force field of the underwater autonomous vehicle, the smaller the distance between the dynamic barrier and the repulsive force field of the underwater autonomous vehicle is, the larger the repulsive force borne by the underwater autonomous vehicle is; conversely, the smaller the repulsive force experienced by the autonomous underwater vehicle. When the underwater autonomous vehicle makes a heading action to enable the barrier to be separated from the action field of the repulsive force potential field of the underwater autonomous vehicle, the repulsive force borne by the underwater autonomous vehicle is zero. The obstacle avoidance strategy is as follows:

Step five: in order to perform deep reinforcement learning training, the obstacle avoidance strategy generated in the fourth step is converted into an MDP (Markov Decision Process) model, wherein the MDP model is composed of quadruplets and has the expression:

MDP＝(S,A,P_sa,R)，

The autonomous underwater vehicle in this embodiment is fully driven, whereby the range of heading angle is [ - π, + π]The unit: and (7) rad. Considering the limitation of self mobility, the motion space of the final MDP model is defined as the yaw rate and the horizontal velocity, and the motion set a at the time t_tThe expression is as follows:

a_t＝{ω(t),V(t)}，

wherein, a_tAnd ω (t) and V (t) are respectively the yaw rate and the horizontal speed of the underwater autonomous vehicle at the moment t, which is the action set at the moment t.

The direct equipment for interaction between the underwater autonomous vehicle and the environment is a sonar sensor, a state set is defined as signals acquired by 7 obstacle avoidance sonars of the underwater autonomous vehicle at t moment, the limit of the detection capability of detection equipment of the underwater autonomous vehicle is considered, and the detection distance D (t) range is [0, 2%00](unit: m), set of states S at time t_tThe expression is as follows:

in the embodiment, the path planning and obstacle avoidance control method in the multiple dynamic obstacle environments proposed in the step two is fused into the specific setting of the reward function of the deep reinforcement learning MDP model, and the steps are mainly considered as follows;

setting the tendency target behavior of the autonomous underwater vehicle as the reward value r of the target module at the moment t₁(s_t,a_t,s_t+1) Combining the gravitational potential field function in step 2, r₁(s_t,a_t,s_t+1) The method specifically comprises the following steps:

when the underwater autonomous vehicle arrives at the target area, updating the reward value of the target module:

r₁(s_t,a_t,s_t+1)←r₁(s_t,a_t,s_t+1)+R

wherein R is a normal number.

Setting barrier avoiding behavior of the autonomous underwater vehicle as an award value r of a safety module at the moment t₂(s_t,a_t,s_t+1) When the next state of the autonomous underwater vehicle heading from a safe area is not a safe area, namely: enabling the dynamic barrier to enter the repulsive force field action domain of the underwater autonomous vehicle, combining the repulsive force field function of the underwater autonomous vehicle in the step 2 and the reward value r of the underwater autonomous vehicle₂(s_t,a_t,s_t+1) Comprises the following steps:

when the distance between the underwater autonomous vehicle and the obstacle is smaller than the minimum safe distance, the collision obstacle is represented, and the reward value of the safety module is updated:

r₂(s_t,a_t,s_t+1)←r₂(s_t,a_t,s_t+1)-R(d(q_t,q₀)≤r_s)

wherein r is_sIs the minimum safe distance of the underwater autonomous vehicle.

In order to avoid large fluctuation of speed and course of the underwater vehicle when the underwater vehicle navigates in a safe area and reduce ocean current interference, the invention sets the reward value r of the stability module at the moment t₃(s_t,a_t,s_t+1)：

r₃(s_t,a_t,s_t+1)＝-0.01×(|v_t+1-v_t|+|ω_t+1-ω_t|+|sin(ψ_t-φ)|)

Wherein v is_tAnd v_t+1Yaw angular velocity, omega, of the autonomous underwater vehicle at time t and time t +1, respectively_tAnd ω_t+1Yaw angular velocity, psi, of the autonomous underwater vehicle at time t and at time t +1, respectively_tAnd phi is the yaw angle and the water flow direction angle of the underwater autonomous vehicle at the moment t under the Cartesian coordinate system respectively.

The final prize value R of the prize function R in the MDP model at the moment t_tComprises the following steps:

r_t＝τ₁r₁(s_t,a_t,s_t+1)+τ₂r₂(s_t,a_t,s_t+1)+τ₃r₃(s_t,a_t,s_t+1)，

wherein, tau₁Is the scale factor, τ, of the target module₂For the proportionality coefficient of the security module, τ₃Is the scaling factor of the stability module.

Setting the MDP model: the motion planning task is horizontal and has three degrees of freedom; at the same time, the time is discretized, and the sampling rate of the obstacle avoidance system is T_SThe output is periodically made at an interval of 0.5s, whereby the autonomous underwater vehicle is at time tAfter receiving the state information, outputting the action mu_tE.g. A, generated prize value R at time t_t＝f(s_t) The state is changed to s_t+1. I.e. the output action u_tDetermined by the strategy pi, which is the state s_tProbability of mapping to each action: s → P (A).

And then establishing a state set and an action set of the MDP model when the underwater autonomous vehicle faces a plurality of dynamic obstacles. State set S ═ S for MDP model in the face of multiple dynamic obstacles₁,S₂,...,S_t,...,S_T}，

Signals collected by 7 obstacle avoidance sonars of the underwater autonomous vehicle at the moment t are processed, and in the fifth step, when the MDP model faces a plurality of dynamic obstacles, an action set A is { a ═ of the MDP model₁,a₂,...,a_t,...,a_T}。

Step six: set of states S ═ S₁,S₂,...,S_t,...,S_TAs the input of the MDP model, the action set a ═ a₁,a₂,...,a_t,...,a_TAnd (4) as the output of the MDP model, training the MDP model by combining with a deterministic depth strategy gradient algorithm until the underwater autonomous vehicle under the MDP model can reach a target area without collision.

A plurality of dynamic obstacle avoidance MDP models of the underwater autonomous vehicle are fused into an online training environment with a plurality of dynamic obstacles, so that a theoretical framework of a simulation training platform is established, and then the simulation environment is built by utilizing a Pyglet library in a python compiling environment. On the basis of the simulation environment module, a deep reinforcement learning training module is compiled, an obstacle avoidance controller for realizing the DDPG-based underwater autonomous vehicle multi-dynamic obstacle environment is imported by utilizing Python language compiling, and as shown in fig. 3, initial parameters of the underwater autonomous vehicle, initial parameters of dynamic obstacles and neural network training hyper-parameters are set.

Training is carried out: the underwater autonomous vehicle moves according to initial speed and initial yaw angle in a multi-dynamic-obstacle environment, environmental data detected by 7 sonars of the underwater autonomous vehicle is used as a deep reinforcement learning state, when no obstacle exists in the detection range of the 7 sonars, the underwater autonomous vehicle allows continuous learning and exploration, a target module continuously rewards and updates, the underwater autonomous vehicle continuously trends to a target until the underwater autonomous vehicle reaches a target area, and the round learning is finished.

When the 7 sonar detection ranges have obstacles and the obstacles are detected to enter the action areas of the repulsive force field of the underwater autonomous vehicle to intersect, the underwater autonomous vehicle is indicated to collide with the obstacles, the underwater autonomous vehicle obtains continuous negative reward values, the underwater autonomous vehicle changes the heading and the speed, the dynamic obstacles are continuously tried to be separated from the action area of the repulsive force field, if the underwater autonomous vehicle collides with the obstacles in the trying exploration process, the turn is finished, the turn is returned to the starting point to restart learning, and if the underwater autonomous vehicle successfully changes to avoid the obstacles and returns to the safe area, the underwater autonomous vehicle continues to learn and explore the target area.

And continuously and circularly running the operations until each round is triggered to end when no collision reaches the target area, the training is shown to be convergent at this time, and after 10000 rounds are run, the training is ended, and the learned strategy is stored. And operating the test module, calling a well deep reinforcement learning and training strategy, and generating a collision-free path.

Initializing the maximum round number 10000 of the training round number Ep to be Ep-1; time step tmax time step 2000 in Ep round is initialization t 1. The online Actor policy network is according to the current state s_tThe strategy selects an action set comprising the yaw rate and the horizontal speed of the underwater autonomous vehicle, and the action set in the current state is represented by the following formula: a is_t＝μ(s_t|θ^μ)+N_t。

According to the output action a_tCombining a kinematics model of 3 degrees of freedom of the horizontal plane of the underwater autonomous vehicle to obtain a differential expression, wherein the differential expression is represented by the following formula:

wherein the content of the first and second substances,

the method comprises the steps that a horizontal plane position vector under a geodetic coordinate system of the underwater autonomous vehicle comprises a horizontal plane position coordinate and a yaw angle; upsilon (t) is a horizontal plane velocity vector of the underwater autonomous vehicle under the carrier, and comprises a horizontal velocity and a yaw angular velocity; r [ psi (t)]Is a transformation matrix; psi (t) is the yaw angle of the autonomous underwater vehicle at time step t; u (t), v (t) and r (t) respectively represent the X axial component, the Y axial component and the yaw rate of the horizontal velocity vector of the underwater autonomous vehicle under the coordinate system of the satellite at the time step t. Solving a differential expression according to a fourth-order Runge-Kutta method to obtain a new position vector eta (t +1) after the action is executed, and expressing the vector by the following formula:

η(t+1)＝[x(t+1),y(t+1),ψ(t+1)]^T∈R³

from the new position vector after the execution of the action to the next state s_t+₁。

When the experience sample amount stored in the experience pool is more than or equal to the maximum data storage capacity len of the memory bank_MaxWhen (Data) is M, sampling a small batch of N experience samples

Wherein the content of the first and second substances,

and (3) expressing the kth experience sample of the time step t, wherein k is 1,2, …, N and N are the total number of the small batch samples, forming a data set, and sending the data set to the online strategy network, the target strategy network, the online evaluation network and the target evaluation network. From the sampled data set, the target policy network is based on state s_t+1Output action a'_t+1Calculating a target Q value, denoted as y_i：

y_i＝r_i+γQ'(s_i+1,μ'(s_i+1|θ^μ')|θ^Q')；

Target evaluation network according to state s_t+1Object strategyAction a 'of network output'_t+1And y of the target Q value_iUpdating the online evaluation network parameter theta of the critic by updating the loss function, and performing online evaluation according to the following formula:

where L is a loss function.

Combining the small batch of N experience samples with a random gradient descent method, updating the strategy of the actor network and the online strategy network parameter delta, and updating by the following formula:

wherein the content of the first and second substances,

a sampling strategy gradient is used.

Updating theta 'and delta' in the form of soft updates according to online network parameters theta and delta:

wherein τ is the weight of the online network parameter;

when t is less than or equal to 2000, the underwater autonomous vehicle collides with an obstacle or reaches a target area in the exploration process, the step is carried out by 5.2.5, and the number of turns Ep is Ep + 1; and when Ep is 10000, completing the training of the underwater autonomous vehicle in the large-scale continuous obstacle environment, and storing the learned obstacle avoidance strategy.

The above description is only a preferred embodiment of the two-dimensional obstacle avoidance control method of the underwater autonomous vehicle in the complex multi-dynamic obstacle environment, the protection range of the two-dimensional obstacle avoidance control method of the underwater autonomous vehicle in the complex multi-dynamic obstacle environment is not limited to the above embodiments, and all technical schemes belonging to the idea belong to the protection range of the invention. It should be noted that modifications and variations which do not depart from the gist of the invention will be those skilled in the art to which the invention pertains and which are intended to be within the scope of the invention.

Claims

1. An underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning is characterized by comprising the following steps:

2. The method for dynamically avoiding obstacles of the underwater autonomous vehicle based on the depth reinforcement learning as claimed in claim 1, wherein the step one of the underwater autonomous vehicle model comprises: one tail propeller, two side propellers and 7 obstacle avoidance sonars, the range finding sonar sampling frequency of the underwater autonomous vehicle model is 2Hz, the detection distance is 150 m-200 m, and the distribution angles under the coordinate system are as follows: 90 degrees, 60 degrees, 30 degrees, 0 degrees, 30 degrees, 60 degrees and 90 degrees;

wherein the content of the first and second substances,

3. The method for dynamically avoiding obstacles of the underwater autonomous vehicle based on the deep reinforcement learning of claim 1, wherein the dynamic obstacle state equation in the second step comprises: a discrete time state equation of the uniform motion model when the sampling interval is T and a discrete time state equation of the uniform acceleration motion model when the sampling interval is T,

X_k+1＝F_CVX_k+ω_k+1，

wherein the content of the first and second substances,

and

is the process noise of the uniform acceleration motion model in discrete time.

4. The underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning as claimed in claim 3, characterized in that a state transition matrix F of a uniform velocity motion model_CVThe expression of (a) is:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

5. the method for dynamically avoiding the obstacle of the underwater autonomous vehicle based on the deep reinforcement learning is characterized in that in the fourth step, a training environment map model is constructed by combining terrain information of a water area environment where the underwater autonomous vehicle is located, and then a plurality of dynamic obstacles are loaded in the training environment map model according to a dynamic obstacle kinematics model to obtain an online training environment of the plurality of dynamic obstacles in a Cartesian coordinate system.

6. The dynamic obstacle avoidance method of the underwater autonomous vehicle based on the depth reinforcement learning according to claim 1 or 2, characterized in that in the fourth step, the behavior of the underwater autonomous vehicle toward the target is taken as a gravitational potential field function, the behavior of the underwater autonomous vehicle for avoiding the dynamic obstacle is taken as a repulsive potential field function of the underwater autonomous vehicle,

the obstacle avoidance strategy is as follows:

7. The underwater autonomous vehicle dynamic obstacle avoidance method based on the depth reinforcement learning of claim 6 is characterized in that the gravitational potential field function expression is as follows:

the repulsive force potential field function expression is as follows:

wherein k is₂Is a repulsive force potential energy gain coefficient, x'_tAnd y'_tRespectively, the abscissa and ordinate of the position of the dynamic obstacle at time t in a Cartesian coordinate system, d (q)_t,q′_t) Is the distance between the underwater autonomous vehicle and the dynamic barrier at the moment t, q_t＝(x_t,y_t)，q′_t＝(x′_t，y′_t)，d₀The distance L is large for the influence of the repulsive force potential field action domain energy of the underwater autonomous vehicle₁And L₂The length of the major axis and the length of the minor axis of the ellipse are respectively expanded into the ellipse by the autonomous underwater vehicle.

8. The underwater autonomous vehicle dynamic obstacle avoidance method based on the depth reinforcement learning of claim 1 is characterized in that in the fifth step, the MDP model expression is as follows:

MDP＝(S,A,P_sa,R)，

9. The method for dynamically avoiding obstacles of the autonomous underwater vehicle based on the deep reinforcement learning as claimed in claim 2, wherein the state set S-S of the MDP model when facing a plurality of dynamic obstacles in the fifth step is { S ═ S }₁,S₂,...,S_t,...,S_T}，

10. The method for dynamically avoiding obstacles of the autonomous underwater vehicle based on the depth reinforcement learning as claimed in claim 9, wherein the reward value R of the reward function R at the time t in the MDP model is_tComprises the following steps:

wherein, tau₁Is the scale factor, τ, of the target module₂For the proportionality coefficient of the security module, τ₃Is the proportionality coefficient of the stability module, r₁(s_t,a_t,s_t+1) The prize value, r, for the time t of the goal module₂(s_t,a_t,s_t+1) The prize value r for the moment t of the security module₃(s_t,a_t,s_t+1) The prize value at time t of the stability module.