CN112179367A

CN112179367A - Intelligent autonomous navigation method based on deep reinforcement learning

Info

Publication number: CN112179367A
Application number: CN202011023274.4A
Authority: CN
Inventors: 彭小红; 陈亮; 陈荣发; 张军; 梁子祥; 史文杰; 黄文�; 陈剑勇; 黄曾祺; 余应淮
Original assignee: Guangdong Ocean University
Current assignee: Guangdong Ocean University
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2021-01-05
Anticipated expiration: 2040-09-25
Also published as: CN112179367B

Abstract

The invention relates to the technical field of intelligent autonomous navigation, in particular to an intelligent autonomous navigation method based on deep reinforcement learning. The method is used for solving the problems that the intelligent agent cannot sense the development conditions of a plurality of future states in advance and the obstacle avoidance and navigation capabilities of the intelligent agent are insufficient due to the fact that the existing algorithm only calculates the reward values of two adjacent states. The intelligent agent autonomous navigation method based on deep reinforcement learning comprises the following steps: constructing an intelligent autonomous navigation system, wherein the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm facing a multi-step mechanism; building a simulation environment; placing the autonomous navigation system in the simulation environment for training; and loading the trained autonomous navigation system to the intelligent agent, and acquiring the autonomous navigation capability of the intelligent agent. Through the technical scheme, the technical effects that the intelligent bodies can sense the future obstacle distribution situation and make evasive actions in advance are achieved.

Description

Intelligent autonomous navigation method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of intelligent agent autonomous navigation, in particular to an intelligent agent autonomous navigation method based on deep reinforcement learning.

Background

Due to the excessive exploitation of land resources by human beings, reserves of mineral resources, biological resources and the like are rapidly reduced. The ocean area is more than twice of the land area, and mineral resources, energy resources, fishery resources and the like stored underground are far more abundant than the land area. In view of the unknown and complex marine environment, the intelligent agent can replace human beings to perform exploration and development of marine resources, so in recent years, research on the intelligent agent in various countries is very important. Autonomous navigation is one of the key technologies for studying intelligent mobile agents. The autonomous navigation means that an intelligent agent finds an optimal or suboptimal path from a starting point to a target point in an environment containing complex obstacles according to a given constraint condition or conditions, such as shortest path length, minimum energy consumption or minimum movement time and the like, under the condition that the self pose information of the intelligent agent is known. The autonomous navigation problem of an agent may be equivalent to the autonomous path planning problem of an agent, all with the goal of controlling the mobile agent to move away from obstacles towards a target position. The path planning task aims to find one or more paths which start from a starting point, avoid various obstacles and safely reach a target position in a known or unknown environment through a specific algorithm. The nature of the method can be regarded as a condition optimization problem, and in the face of different requirements, the optimization target has certain difference. Aiming at various navigation algorithms, the navigation algorithms are roughly divided into two categories according to different intelligent degrees of an intelligent agent, namely a non-intelligent navigation algorithm and an intelligent navigation algorithm. By designing a modular deep neural network architecture, the neural network learning processing task of each module is more definite, and meanwhile, the stability of the algorithm is improved by adopting a double neural network structure method; and the output method, the loss function, the reward function and the data information stored in the experience pool of the target value network of the MS-DDQN algorithm are improved, so that the reward obtained by the intelligent agent in the training process can be diffused to the state value estimation value of the multi-step interval state. Through the method, the underwater intelligent body is guided to rapidly learn, and meanwhile, the underwater intelligent body can sense the change of the future state in advance, so that the underwater intelligent body is endowed with the capability of sensing the distribution of future obstacles, and the underwater intelligent body is helped to make evasive actions in advance.

Deep Q Networks (DQN) are a deep reinforcement learning algorithm. The key technology of the DQN algorithm is that a double neural network structure and an empirical data playback method are adopted. One of the innovative points of the DQN algorithm is to use the nerve to approach the optimal state-action cost Q function, instead of the Q-learning method that needs to establish a table to record the mapping relationship between the state and the action. The method overcomes the defect that Q-learning cannot be applied in the field of high-dimensional state space, and simultaneously exerts the processing capacity of the deep neural network on high-dimensional information. The second innovation is that two neural networks are established, namely a current value network and a target value network. The stability of the algorithm is improved by adopting a double-neural network structure method. The third innovation point is that an experience pool playback method is adopted, sample data of interaction between the intelligent agent and the environment are stored in the experience pool, and each sample data is marked by a reward value, so that the defect that a large number of samples need to be marked manually in a deep learning method is overcome. The training method of the DQN algorithm is shown in fig. 1. Two deep neural networks exist in the network structure of the DQN algorithm, namely a current value network and a target value network. When in useThe role of the previous value network is two, one of which is to process the input state information and evaluate the value of each output action during the training process, and then determine whether to randomly perform an action or to select to perform an action according to the maximum value of the current value network output by a greedy method. And the second function is to process the training samples extracted from the experience pool in the network training process, output the value of each action, compare the value with the value of the action output in the target network, and further calculate an error to guide the weight update of the current network. The target value network is mainly used for calculating training sample data extracted from an experience pool to process in a training process, outputting the value of each action and helping update iteration of the weight of the current network. The weights in the target value network are not iterated in the network training process, and the weights in the current value network are copied after every N steps. An empirical replay mechanism is used during the training of the DQN network. Through the experience playback mechanism, the agent can learn not only the current state experience data, but also repeatedly learn the previous experience data. Every time the intelligent agent and the environment complete the interaction information, the information is stored in the experience pool, and the sample data stored in the experience pool has the current state s_tAnd performing action a_tObtaining a prize value r_tAnd the next state s_t+1The four data are combined into a unit storage and experience pool D with the form of s_t,a_t,r_t,s_t+1]. Because the stored empirical data has strong relevance, the DQN algorithm adopts a random sampling mode to extract training sample data from an empirical pool in small batches, so that the independence among training samples in the training and learning process is ensured, and the convergence speed of the algorithm is improved.

Because a neural network is used in the DQN algorithm to replace a Q value table, the current value network represents the learning strategy pi, and the weight parameter in the current value network is assumed to be theta, and the output of the current value network is

y^DQN＝Q(s,a,θ) (2-26)

The output of the target network is:

the loss function of the DQN network is then:

updating the weights θ in the current value network by calculating the gradient of the loss function:

the parameters in the current network can be updated by adopting a gradient descent method, so that an optimal strategy is obtained. Due to Q used in DQN algorithm_target(s ', a ', theta ') to approximately represent the optimization target, and the selection actions are actions corresponding to the maximum Q value, and the selection and evaluation of the actions are based on the target value network, which results in the overfitting problem. To solve this problem, a Deep Double Q Network algorithm (DDQN) is proposed. The training process of the DDQN algorithm is almost the same as the DQN algorithm, the only difference being that the DDQN separates the target value network selection action from the evaluation action. Just by using the DQN algorithm, the network structure has two sets of different weight parameters, namely a weight parameter theta in the current network and a weight parameter theta' in the target value network. Wherein the action is selected by using the parameters in the current network, and the selected action is evaluated by using the parameters in the target value network, so the output of the target value network of the DDQN is:

the output of the DDQN current value network is:

y^DDQN＝Q(s,a,θ) (2-31)

where γ is a discount factor, γⁱBonus value r achieved for the t + i-th state represented_t+iDegree of influence on the current t state, and γ is a value smaller than 1 and larger than 0^λThe bonus value r obtained for the t + lambda state_t+λThe degree of influence on the current t state; q is the state-action value estimate, λ is the number of steps in the interval, s_tIs in the current state, a_tThe actions performed for the current state, r_tIndicating the timely reward value, r, earned by the agent at time t_r+λThe prize value, s, obtained for the lambda state_t+λA state that is a lambda state; theta is a weight parameter in the current value network, and theta' is a weight parameter in the target value network; i is the subscript of the prize value obtained from each state after the t state; q(s)_t+λAnd a, theta) represents the current value of the neural network based on the input information(s)_t+λA), outputting an estimated value of each motion; q_target(s_t+λ,argmaxQ(s_t+λA, θ)) means that the action command sum s corresponding to the maximum value of the estimated values output by the network is selected at first_t+λAs input information of the target estimation value network, the target estimation value network outputs an estimation value of each action; the loss function of the network of DDQN is:

where E is the neural network error, s is the state, a is the action performed, θ is the weight parameter in the current value network, Q is the state-action value estimate, Q (s, a, θ) represents the current value neural network

Outputting an estimated value of each action; the method for updating the weight parameters in the current value network comprises the following steps:

since deep reinforcement learning is to process high-dimensional original input information by using a neural network and to approximate a state-action cost function by using the neural network, deep reinforcement learning is more suitable for problems in a larger state space than the conventional reinforcement learning method. Therefore, the MS-DDQN is provided by correspondingly improving the DDQN algorithm, so that the underwater intelligent body is improved to have higher obstacle avoidance and navigation capabilities. As is evident from the above description of the Q (λ) algorithm, Q (λ) enables the agent to obtain the ability to reward the condition. In the navigation of the underwater robot, the obstacle avoidance function is an important precondition for completing tasks, and the influence on the state-action Q value of a longer remote step state is very important, namely, the intelligent body is endowed with the function of perceiving the future improvement of the obstacle avoidance capability of the underwater robot. If the underwater robot is in a certain state, the reward value obtained in the future can be perceived in advance, namely the development condition of the future state can be perceived in advance, and the underwater robot is quite helpful for avoiding obstacles and reaching a target point.

Disclosure of Invention

The invention aims to overcome at least one defect (deficiency) of the prior art, and provides an intelligent agent autonomous navigation method based on deep reinforcement learning and a manufacturing method thereof, which are used for solving the problems that the intelligent agent cannot sense the development conditions of a plurality of future states in advance and the obstacle avoidance and navigation capability of the intelligent agent are insufficient due to the fact that the existing algorithm only calculates the reward values of two adjacent states; the development situations and obstacle distribution situations of a plurality of future states can be sensed by the intelligent body, and therefore the technical effect of avoiding actions is achieved in advance.

The invention adopts the technical scheme that an intelligent agent autonomous navigation method based on deep reinforcement learning is designed, and the method comprises the following steps: an intelligent autonomous navigation system is constructed, the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm facing a multi-step mechanism, and the MS-DDQN algorithm is obtained by improving the DDQN algorithm; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance depth neural network module, a global navigation depth neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from an obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to move towards a closer path to a target position, and the instruction selection module is used for determining which network output action instruction is executed; building a simulation environment, including building an obstacle environment model and building a simulation intelligent agent; placing the autonomous navigation system in the simulation environment for training, namely training and learning by the intelligent agent in the simulation environment by adopting the MS-DDQN algorithm; the simulation environment is multiple, and the training times of each simulation environment are multiple; and loading the trained autonomous navigation system to the intelligent agent, and acquiring the autonomous navigation capability of the intelligent agent.

Further, the MS-DDQN algorithm includes a current value network for selecting an action, a target value network for evaluating the action, an error function for updating a weight, a reward function for obtaining a reward value when the agent takes an action in a current state and arrives at a next state, and an experience pool for storing sample data generated at each walk. The current value network, the target value network, the error function, the reward function and the experience pool are matched with one another, so that the MS-DDQN algorithm endows the intelligent agent with the ability of knowing the future obstacle distribution situation and making an evasive action in advance.

Further, the output of the target value network is:

where γ is a discount factor, γⁱBonus value r achieved for the t + i-th state represented_t+iDegree of influence on the current t state, and γ is a value smaller than 1 and larger than 0^λThe bonus value r obtained for the t + lambda state_t+λThe degree of influence on the current t state; q is the state-action value estimate, λ is the number of steps in the interval, s_tIs in the current state, a_tThe actions performed for the current state, r_tIndicating the timely reward value, r, earned by the agent at time t_r+λIs at the same timeValue of prize, s, obtained in lambda state_t+λA state that is a lambda state; theta is a weight parameter in the current value network, and theta' is a weight parameter in the target value network; i is the subscript of the prize value obtained from each state after the t state; q(s)_t+λAnd a, theta) represents the current value of the neural network based on the input information(s)_t+λA), outputting an estimated value of each motion; q_target(s_t+λ,argmaxQ(s_t+λA, θ)) means that the action command sum s corresponding to the maximum value of the estimated values output by the network is selected at first_t+λAs input information of the target estimation value network, the target estimation value network outputs an estimation value of each action; the loss function is:

wherein E is the neural network error, s is the state, a is the executed action, theta is the weight parameter in the current value network, Q is the state-action value estimation value, and Q (s, a, theta) represents the current value and the neural network outputs the estimation value of each action; the data stored in the experience pool are as follows:

wherein t is a certain moment, s is a state, and i is an index of the reward value obtained from each state after the t state starts; s_tIs the current state, a is the action performed, a_tIndicating the action performed at time t, r_tRepresenting the timely reward value obtained by the agent at time t; gamma is a discount factor, gammaⁱBonus value r achieved for the t + i-th state represented_t+iThe degree of influence on the current t state, and γ is a value less than 1 and greater than 0; λ is the number of steps in the interval, s_t+λIs the state after the step of lambda is separated;

the target value network is according to a function

Outputting the value of each group of actions, updating the weight theta of the current value network according to the loss function, and then sampling the samples after each action is executedAnd storing the data into an experience pool.

Further, the training method of the MS-DDQN algorithm comprises the following steps:

randomly initializing a nonce network Q(s)_tA; theta) weight theta and target value network Q_target(s_tA; theta) weight theta', Q(s)_tA; θ) represents the current value the neural network outputs an estimate for each action;

initializing an experience pool D and setting a hyper-parameter λ

For episode＝1,M do

Resetting the simulation environment, obtaining the initial observation state s_tT ← infinity, initializing four space arrays S_t,A,R,S_t+1(ii) a Wherein S_tAn array of statements for storing state information of the current state; an array of A statements for storing the action executed by the current state; an array of R statements for storing the prize value earned by the current state; s_t+1An array of statements for storing next-too state information; t is mainly used for judging whether the data acquired in the current round are stored in an experience pool when the training in the current round is finished;

For t＝1,2…do

If t<Tthen

selecting action a according to the current policy_t＝argmaxQ(s_tA; θ), perform action a_tReturn of the prize value r_tAnd a new state s_t+1 A 1 is to_tStored in S_t、r_tIs stored in R, a_tIs stored in A, s_t+1Stored in S_t+1(ii) a Where t represents the environmental status data obtained by the smartbook round, a_tRepresenting the action executed at the time t, and Q (s, a, theta) representing the current value, wherein the neural network outputs the estimation value of each action;

If s_t+1is the terminal state then:

T←t+1

τ←t-λ+1

If τ≥0 then

If τ+λ<T then

else

storing(s)_τ,a_τr_τs_τ+λ) In an experience pool D

Extracting mini-batch sample data from random punching D

Setting:

using a loss function L (theta) E [ (y)_i-Q(s,a,θ))²]Carrying out gradient descent updating on the current value network weight theta Until ═ T-1;

the tau is mainly used for judging whether the number of times of executing the action of the mobile robot exceeds a set step number lambda, if the number exceeds the set step number, the value of the tau is larger than the value of 0, the intelligent agent is indicated to obtain environment state data exceeding or equal to lambda, and at the moment, the influence r of reward values obtained from lambda states in the future on the tau state can be calculated_τAnd from array S_t,A,S_t+1Extracting state information s_τAction a_τState information s_τ+λThe three information and r_τForming a training sample tuple, and storing the training sample tuple in an experience pool D so as to extract a training sample from the experience pool for training; when the agent starts a turn of training, it will first turn T ← ∞, at which time T<T, the program enters a second for loop; and when the collision happens or the target position is reached, T ← T +1, if τ ═ T-1, indicating that all the environment state information obtained by the intelligent agent stores the training sample data of the round according to a multi-step method. When tau + lambda<T, the intelligent agent is not collided, the number of the environment state information needing to be stored is larger than or equal to the set lambda value, and the intelligent agent needs to pass through

r_i∈R_tCalculating the influence r of reward values obtained from future lambda states on the tau state_τ(ii) a Otherwise, the intelligent agent is collided, the number of the environment state information needing to be stored is smaller than the set lambda value, and the intelligent agent needs to pass through the set lambda value

r_i∈R_tCalculating the future influence r of lambda-1 (lambda-2, lambda-3, …,1) on the state of tau (tau +1, tau +2, …, T-1)_τ；y_iThe ith sample data in the mini-batch data extracted from the experience pool is represented, and the estimated value of each action estimated by the target value network is added with the actual r obtained in the state_iThe method is a comprehensive reward for the current state and is used for carrying out gradient descent operation with the current value network. The MS-DDQN algorithm enables rewards earned by underwater agents to be spread back across state value estimates for multi-step interval states. Through the method, the underwater intelligent body can be better guided to rapidly learn, and meanwhile, the underwater intelligent body can sense the change of the future state in advance.

Further, the training step in S3 specifically includes:

s31, acquiring current state information of the intelligent agent by the simulation environment, wherein the current state information comprises distance information between the intelligent agent and an obstacle in the environment and position relation information between the intelligent agent and a target point; the position relation information of the intelligent body and the target point comprises a relative coordinate relation between the current position coordinate of the intelligent body and the target position coordinate, an Euclidean distance from the current position of the intelligent body to the target position and an included angle between a vector of the advancing direction of the intelligent body and a vector of the direction from the current coordinate position of the intelligent body to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent body from the target position coordinate;

s32, inputting the acquired current state information into the modular deep neural network, specifically inputting distance information between an intelligent body and an obstacle in the environment and position relation information between the intelligent body and a target point into the local obstacle avoidance deep neural network module, and inputting position relation information between the intelligent body and the target point into the global navigation deep neural network module;

s33, the local obstacle avoidance depth neural network module and the global navigation depth neural network module output respective control instructions according to the input current state information;

s34, the instruction selection module determines to use the action instruction output by the local obstacle avoidance depth neural network module or the global navigation depth neural network module by judging the distance value between the intelligent body and the nearest obstacle;

s35, the intelligent agent executes the instruction selected by the instruction selection module, receives the reward value output by the reward function and enters the next state;

s36, storing sample data of the interaction into the experience pool, wherein the sample data comprises a current state s_tAnd performing action a_tObtaining a prize value r_tAnd the next state s_t+1The sample data is stored in the experience pool in a form of

In the step S41, the information of the positional relationship between the underwater intelligent object and the target point is added to the input status information, which helps the underwater intelligent object to reach the target position point in a shorter path; meanwhile, the position relation information is directly used as input state information of the global navigation neural network, and the underwater intelligent body can learn in which direction the underwater intelligent body should go forward through the position information so as to reach the target position at the fastest speed; meanwhile, a strategy method which needs to be learned is more definite for each neural network by adopting a modular deep neural network, so that the underwater intelligent body can better avoid obstacles and reach a target position by a shorter path; s36, an experience pool playback method is adopted, sample data of interaction between the intelligent agent and the environment are stored in the experience pool, and each sample data is marked by a reward value, so that the defect that a large number of samples need to be marked manually in a deep learning method is overcome.

Further, the specific steps of the local obstacle avoidance depth neural network module and the global navigation depth neural network module are as follows: s331, the current value network receives the input current state information, processes the current state information according to the input current state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool according to the current state information; and transmitting the output set of actions to the target value network;

s332, the target value network processes the group of actions according to the input actions, calculates the value of each action, and transmits the value of the group of actions back to the current value network;

s333, the current value network selects the next action to be executed according to the maximum value in the value of the group of actions returned by the target value network, and outputs a control instruction to the simulation environment according to the next action;

s334, the current value network calculates to obtain next step state information according to the next step action in S433, processes the next step state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool by the next step state information; and transmitting the output set of actions to the target value network; the target value network repeats step S332, that is, the target value network processes according to the transmitted group of actions, calculates the value of each action, and then transmits the value of the group of actions back to the current value network;

s335, the current value network compares the value of a group of actions returned by the target value network in the S334 with the value of a group of actions returned by the target value network in the S332, and calculates an error according to an error function;

s336, updating the weight of the current value network according to the error obtained in S335, and updating the weight of the target value network according to the weight of the current value network after every N steps;

s337, while proceeding in step S334, the current value network outputs a control instruction to the simulation environment according to the selected next action. The action is selected through the parameters in the current value network, the selected action is evaluated through the parameters in the target value network, and meanwhile, the target value network selection action and the evaluation action are separated by utilizing two sets of weight parameters in the current value network and the target value network.

Further, the reward function is a continuous combined reward function, and the continuous combined reward function comprises terminal rewards and non-terminal rewards; the terminal reward specifically comprises: a positive reward is obtained when the agent reaches the target point, expressed as r_arr＝100；ifd_r-t≤d_win(ii) a Wherein d is_r-tIs the Euclidean distance of the agent to the target point, d_winIs the threshold for the agent to reach the target point, when d_r-tD less than set_winIf so, indicating that the target point is reached, otherwise, not reaching the target point; said r_arrA positive reward value indicating that the agent achieved the target location;

a negative reward is obtained when the agent collides with an obstacle, expressed as r_col＝-100；ifd_r-o≤d_col(ii) a Wherein d is_r-oIs the Euclidean distance of the agent from the nearest barrier, d_colThreshold value for collision of intelligent body and obstacle when d_r-oD is less than or equal to_colIf so, indicating that collision occurs, otherwise, not indicating that collision does not occur; said r_colGiving a punitive negative reward value indicating that the agent has collided;

the non-terminal award specifically includes: a positive reward is obtained when the agent progresses towards the target point, expressed as r_{t_goal}＝c_r[d_r-t(t)-d_r-t(t-1)](ii) a Wherein c is_r∈(0,1]The coefficient of (d) is set to 1; the danger reward r obtained when the minimum distance of the agent from the obstacle is continuously reduced_dang∈[0,1]And is reduced, and its expression is

0≤r_dangLess than or equal to 1; wherein d is_minIs the minimum distance of the agent from the obstacle, where β is a coefficient such that r_dangThe value space of (a) is (0, 1); d is_r-t(t) represents the current position coordinates and the target of the intelligent agent at the t-th momentThe Euclidean distance of the position coordinates of the mark;

when the included angle between the advancing direction vector of the intelligent agent and the direction vector of the intelligent agent starting from the current coordinate and reaching the target position is less than +/-18 degrees, the reward 1 is obtained, when the included angle is more than +/-18 degrees and less than +/-72 degrees, the reward 0.3 is obtained, and in other cases, the reward 0 is obtained, and the expression is as follows:

wherein a is_oriThe included angle of the vector of the advancing direction of the intelligent agent and the vector of the direction of the intelligent agent to the target position is sent out by the current coordinate. The continuous combined reward function improves the convergence speed of the algorithm, so that the underwater intelligent body can better avoid obstacles and reach a target position by a shorter path; by continuously combining the reward functions, the underwater intelligent body can obtain corresponding reward in each step of learning, and the underwater intelligent body is more favorably guided to move towards a target point and move away from an obstacle.

Further, the local obstacle avoidance depth neural network module and the global navigation depth neural network module adopt activation functions of ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network. In order to improve the learning ability of the neural network and inhibit the problems of gradient disappearance and the like in the neural network, the ReLU6 and the ReLU are combined to be taken as an activation function in the neural network framework. The ReLU and the ReLU6 can avoid the phenomenon of gradient disappearance, so that the ReLU6 function is used as the activation function at the front end of the network, which is beneficial to fast learning the sparse feature of the data samples, and the ReLU function is used as the activation function at the rear end of the network.

Furthermore, the local obstacle avoidance deep neural network module and the global navigation deep neural network module both adopt a full-connection structure, the number of hidden layers of the local obstacle avoidance deep neural network module is more than three, and the number of hidden layers of the global navigation deep neural network module is one.

Further, the instruction selection module is provided with a threshold value, and selects the control instruction according to the size of the threshold value; when the threshold value is smaller than 40, selecting a control instruction output by the local obstacle avoidance depth neural network module; and when the threshold value is more than or equal to 40, executing the control instruction output by the global navigation neural network module. When the threshold value is smaller than 40, the underwater intelligent body is close to the obstacle, and therefore the instruction output by the local obstacle avoidance depth neural network is adopted for execution; when the threshold is greater than or equal to 40, the underwater agent is a certain distance away from the obstacle, and the global navigation neural network should be executed to reach the target position at a faster speed.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, by adopting a modular deep neural network architecture, the neural network learning processing task of each module is more definite, and meanwhile, the stability of the algorithm is improved by adopting a double neural network structure method; and the output method, the loss function, the reward function and the data information stored in the experience pool of the target value network of the MS-DDQN algorithm are improved, so that the reward obtained by the intelligent agent in the training process can be diffused to the state value estimation value of the multi-step interval state. Through the method, the underwater intelligent body can be better guided to carry out quick learning, and meanwhile, the underwater intelligent body can sense the change of the future state in advance, so that the underwater intelligent body is endowed with the capability of sensing the distribution of future obstacles, and the underwater intelligent body is helped to make evasive actions in advance. The improved continuous combined reward function improves the convergence speed of the algorithm, so that the underwater intelligent body can better avoid obstacles and reach a target position through a shorter path. The invention provides a method for improving a conventional DDQN algorithm by adopting a Q (lambda) algorithm theory, wherein the improved algorithm is a Multi-step DDQN (MS-DDQN) algorithm facing a Multi-step mechanism. The MS-DDQN enables the underwater robot to obtain rewards in the training process and can further diffuse the state value estimated value of the multi-step interval state. By the method, the underwater robot can be better guided to rapidly learn, and can sense the change of the future state in advance. The DDQN algorithm is improved by adopting a Q (lambda) method, which is equivalent to endowing the underwater robot with the capability of sensing future obstacle distribution so as to help the underwater robot to make evasive action in advance.

Drawings

Fig. 1 shows a training method of the DQN algorithm.

Fig. 2 is a system architecture of an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a neural network according to an embodiment of the present invention.

FIG. 4 is a geometric environment model diagram according to an embodiment of the present invention.

Fig. 5 is a sonar detector diagram according to an embodiment of the present invention.

Fig. 6 is a diagram of a forward-looking sonar detector simulator according to an embodiment of the present invention.

FIG. 7 is a diagram of a simulation environment model according to an embodiment of the present invention.

FIG. 8 is a diagram of a coordinate transformation code according to an embodiment of the present invention.

Fig. 9 is a graph of a ReLU function according to an embodiment of the present invention.

Fig. 10 is a graph of the ReLU6 function according to an embodiment of the present invention.

Fig. 11 is a navigation track diagram of the underwater robot according to the embodiment of the present invention.

FIG. 12 is a graph of training results of an embodiment of the present invention, wherein 12(a) is a success rate curve, 12(b) is a reward value curve per round, and 12(c) is a reward average.

Fig. 13 is a navigation track diagram of the underwater robot in different test environments, where 13(a) is environment 2, 13(b) is environment 3, 13(c) is environment 4, and 13(d) is environment 5.

FIG. 14 is a diagram of the hardware and software configuration of a computer according to an embodiment of the present invention.

FIG. 15 is a diagram of a hyper-parameter setting according to an embodiment of the present invention.

Fig. 16 is a graph of test results in environment 1 of an embodiment of the present invention.

FIG. 17 is a graph of test results for different environments according to an embodiment of the present invention.

Description of reference numerals: the underwater robot 1 after the mass point processing, the advancing direction 2 of the underwater robot, the target position 3 of the underwater robot, the target area 31 of the underwater robot, the starting position 4 of the underwater robot and the starting area 41 of the underwater robot;

Detailed Description

The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For a better understanding of the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Example 1

The embodiment provides an intelligent autonomous navigation method based on deep reinforcement learning, in particular to an underwater robot autonomous navigation method based on deep reinforcement learning, which comprises the following steps:

s1, constructing an intelligent autonomous navigation system, wherein the autonomous navigation system adopts an MS-DDQN algorithm, namely a multi-step mechanism-oriented DDQN algorithm, and the MS-DDQN algorithm is a deep reinforcement learning algorithm; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance depth neural network module, a global navigation depth neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from an obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to move towards a closer path to a target position, and the instruction selection module is used for determining a finally executed action instruction. The MS-DDQN algorithm comprises a current value network, a target value network, an error function, a reward function and an experience pool, wherein the current value network is used for selecting an action, the target value network is used for evaluating the action, the error function is used for updating the weight, the reward function refers to a reward value obtained when an agent takes a certain action in a current state and reaches a next state, and the experience pool is used for storing sample data generated in each step.

Based on the modularized neural network framework, the underwater robot can adopt different strategies to cope with different environmental states. When the underwater robot approaches to the obstacle, the main task of the underwater robot is to avoid the obstacle, and the global navigation task becomes a secondary task. When the underwater robot is far away from the obstacle, the global navigation task becomes a main task, so that the underwater robot can reach the target position in a shorter path. The embodiment provides a modularized neural network structure aiming at two subtasks of underwater robot navigation, namely a local obstacle avoidance task and a global navigation task. Aiming at the two subtasks, a local obstacle avoidance neural network and a global navigation neural network are designed respectively, and the underwater robot can clearly determine what action the underwater robot should take under each condition to better avoid the obstacle and reach the target position by a closer path through a double neural network structure.

The training method of the MS-DDQN algorithm comprises the following steps:

initializing the experience pool D and setting the hyperparameter lambda,

For episode＝1,M do

For t＝1,2…do

If t<Tthen

If s_t+1is the terminal state then:

T←t+1

τ←t-λ+1

If τ≥0 then

If τ+λ<T then

r_i∈R_t

else

r_i∈R_t

storing(s)_τ,a_τr_τs_τ+λ) In the experience pool D, the experience is collected,

extracting mini-batch sample data from the random impulse D,

setting:

using a loss function L (theta) E [ (y)_i-Q(s,a,θ))²]Gradient descent update of current value network weight theta

Until τ＝T-1；

Wherein τ is mainly used for judging whether the number of times of executing the action of the mobile robot exceeds a set step number λ, if the number exceeds the set step number, the value of τ is greater than or equal to 0, which indicates that the agent has obtained environmental state data exceeding or equal to λ, and at this time, the influence r of reward values obtained from λ states in the future on τ states can be calculated_τAnd from array S_t,A,S_t+1Extracting state information s_τAction a_τState information s_τ+λThe three information and r_τForming a training sample tuple stored in experienceAnd in the pool D, so that training samples can be extracted from the experience pool for training.

When the agent starts a turn of training, it will first turn T ← ∞, at which time T<T, the program enters a second for loop; and when the collision happens or the target position is reached, T ← T +1, if τ ═ T-1, indicating that all the environment state information obtained by the intelligent agent stores the training sample data of the round according to a multi-step method. When tau + lambda<T, the intelligent agent is not collided, the number of the environment state information needing to be stored is larger than or equal to the set lambda value at the moment, and the environment state information needs to pass through

r_i∈R_tCalculating the future influence r of lambda-1 (lambda-2, lambda-3, …,1) on the state of tau (tau +1, tau +2, …, T-1)_τ；y_iThe ith sample data in the mini-batch data extracted from the experience pool is represented, and the estimated value of each action estimated by the target value network is added with the actual r obtained in the state_iThe method is a comprehensive reward for the current state and is used for carrying out gradient descent operation with the current value network.

And S2, building a simulation environment, including building an obstacle environment model and building a simulation intelligent body.

The construction of the simulation environment model and the implementation of the related algorithm, the software platform and hardware components used are shown in fig. 14.

Constructing an obstacle environment model: the obstacle environment model refers to the description of obstacles in the environment, and the condition of the description of the environment model directly influences the state information input by the deep reinforcement learning algorithm and the finally learned obstacle avoidance strategy. The grid method and the geometric method are two more common environmental model description methods. As shown in fig. 4, a graphical representation of the environment model is described geometrically. The geometric method does not need to divide the environment, but uses the points, lines and surfaces of the obstacles to describe the obstacle information in the environment. Therefore, when the environment state information is established by adopting the geometric method, the data volume for describing the environment state is not increased sharply because the environment becomes complicated. The underwater robot has a large working environment range, and the obstacles are not very dense, so that the embodiment chooses to adopt a geometric method to construct the environment model.

In the embodiment, the type of the obstacle is divided into two types, namely an elliptical obstacle and a polygonal obstacle, wherein the circular obstacle is also regarded as the elliptical obstacle, and the geometric environment model only needs to record the coordinates of the vertex at the upper left corner of the rectangle for intercepting the ellipse and the major and minor axes of the ellipse. For polygons, where triangles, rectangles, and squares are classified into polygonal obstacles, the geometric model needs to record each vertex coordinate of the polygon. Building a simulation intelligent agent: a real underwater robot is not a particle but a solid body with a geometric size. In the embodiment, the underwater robot is considered as a mass point, and in order to ensure the navigation safety of the underwater robot in the real environment, the obstacle needs to be expanded outwards, so that the obstacle is correspondingly bulked. In this embodiment the underwater robot is projected as a dot with a radius of 1 pixel. In underwater environments, acoustic devices are often used as sensing instruments to detect the environment. Meanwhile, the simulated underwater robot sails in a fixed-depth underwater environment, and a multi-beam forward-looking sonar can be used as an environment detecting instrument in the embodiment, for example, a multi-beam forward-looking detecting sonar sensor. As shown in fig. 5, the sonar detector has a vertical open angle of 17 degrees, a horizontal open angle of 120 degrees, a maximum effective detection distance of 200m, a total of 240 beams, and an open angle of 0.5 degrees between each beam. A simulation detector is designed for simulating the forward-looking sonar detector in the embodiment. In the embodiment, the underwater machine only performs motion control action on a horizontal plane, so that a front-view detector simulator with a horizontal opening angle of 180 degrees, a maximum measuring range of 90 degrees, beams of 36 degrees and an opening angle of 5 degrees between each beam is designed. As shown in fig. 6, a simulated forward looking sonar detector simulator is shown. The black dot 1 represents the underwater robot after the particle formation, the line segment 2 is the advancing direction of the underwater robot, and the line segments on the two sides of the line segment 2 are sound wave lines emitted by a front sonar detector of the underwater robot.

In order to make the observation data detected and collected by the underwater robot sonar simulator have uniformity, the first sound wave line on the left side of the advancing direction of the underwater robot is 0 degree, and the first sound wave line on the right side is 180 degrees. Information detected by the sonar detector at the moment t is sequentially stored into row vectors according to the angle sequence of (0 degrees, 5 degrees, 10 degrees and

in (1). If the obstacle is not detected, the sonar detector returns the maximum detection distance value of the sound wave line segment, otherwise, the distance between the sound wave line segment and the obstacle is returned. Finally, normalization processing is carried out on the detected information, namely the row vector s is processed^tDivided by the maximum effective detection range. For the design of the underwater robot motion model, the underwater robot is set to advance at a constant speed, and the underwater robot can only perform discrete motion steering actions, namely, the underwater robot performs left-turning 15 degrees, left-turning 30 degrees, the original direction, right-turning 15 degrees and right-turning 30 degrees relative to the advancing direction of the underwater robot.

The Pygame library is adopted to build an environment model, firstly, a 500 x 500 window size is defined as a simulation environment, and different obstacles, boundary walls and target positions are added in the window. The environment model was created as shown in fig. 7, including gray ellipses, circles and polygons representing various types of obstacles. The middle black dot 1 represents the underwater robot after the particle formation, and the ray lines surrounding the black dot represent the forward-looking sonar detector. In the start area 41 and the target area 31 of the underwater robot in the figure, i.e., the rectangular

shaded areas

41 and 31, the start position and the target point of the underwater robot are randomly initialized. The black dots 4 represent the start position and the black dots 3 represent the target position.

Implementation of the dynamic environment: after the environment model and the simulation underwater robot model are built, the underwater robot moves in the simulation environment by adopting a relevant method, namely the dynamic realization of the simulation environment. Such as a method for underwater robot motion and a method for detecting the distance between the forward-looking sonar ranging simulator and an obstacle. And simultaneously setting a rule for judging whether the underwater robot collides and a rule for judging whether the underwater robot reaches the target position. The coordinate transformation of the underwater robot in the simulation environment after the underwater robot makes relevant movement at the current position is calculated. Firstly, the initial position P of the underwater robot at the lower left corner of the simulation environment is assumed_startThe coordinate is (x)_start,y_start) The motion speed v is 0.5, the current direction of the underwater robot is angle, and the underwater robot selects an action to execute in the current state, namely the underwater robot selects a steering action, and the angle steering is angle_tranThe steering angle is angle_tran∈(15°_{turn_left},30°_{turn_left},0°,15°_{turn_light},30°_{turn_light}). Equation 3-1 is the angle of the underwater robot after performing the action:

angle←angle+angle_train (5-1)

combining the formula 5-1, the underwater machine coordinate becomes:

x_next＝x_start+cos(angle)*v (5-2)

y_next＝y_start+sin(angle)*v (5-3)

before solving the distance between the detection of the forward-looking sonar ranging simulator and the obstacle, the projection length of the position coordinates of the end point of each sound wave line segment on the x axis and the y axis under the central coordinate system of the underwater robot is calculated, and then the projection length of each sound wave line segment under the central coordinate system of the environmental model is calculated. The transformation process of the robot-centered coordinate system to the environment model-centered coordinate system is a transformation process of a middle two-dimensional coordinate system. Assuming that the coordinates of the robot are (center _ x, center _ y), the partial code of solving the coordinates of the end of the acoustic line segment at the coordinates centered on the environment model is shown in fig. 8. When the coordinate projection of the sound wave line segments in the coordinate system taking the environment model as the center is obtained, the distance between the underwater robot and the obstacle detected by each sound wave line segment can be solved. Then, each side vector of the obstacle, the sound wave line segment vector and the vector from the position coordinate of the robot to each top point of the obstacle are constructed, the position information between the underwater robot and the obstacle can be obtained by solving a relative relation between the vectors in real time, and the length of each sound wave line segment detected by the forward-looking sonar detector can be solved.

And (3) judging the collision rule of the underwater robot: first, d is set_minFor the minimum safe distance between the underwater robot and the obstacle, the minimum sound wave line segment detected by the current sight sonar detection instrument is smaller than the set d_minAnd if so, judging that collision occurs, ending the training of the round, and re-allocating a new initial position for the underwater robot. Otherwise, collision does not occur, and the underwater robot selects actions to execute according to the relevant strategies. And (3) judging the rule that the underwater robot reaches the target position: first of all define d_ArrivalsThe maximum distance for the underwater machine to reach the target position. During the operation of the underwater robot, the Euclidean distance between the current position coordinate of the underwater robot and the target position coordinate is calculated, and if the distance is less than the obtained distance d_ArrivalsIndicates the target position of the underwater robot.

S3, placing the autonomous navigation system in the simulation environment for training, namely, the intelligent agent adopts the MS-DDQN algorithm to train and learn in the simulation environment; the simulation environment is multiple, and the training times of each simulation environment are multiple.

S31, acquiring current state information of the intelligent agent by the simulation environment, wherein the current state information comprises distance information between the intelligent agent and an obstacle in the environment and position relation information between the intelligent agent and a target point; the information of the position relation between the intelligent body and the target point comprises the relative coordinate relation between the current coordinate of the intelligent body and the coordinate of the target position, the Euclidean distance from the current position of the intelligent body to the target position and the included angle between the vector of the advancing direction of the intelligent body and the vector of the direction from the current coordinate position of the intelligent body to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent body from the coordinate of the target position.

The underwater robot mainly detects the environment information of the obstacles through the distance measuring sensor, and the information acquired by the distance measuring sensor is the distance information between the underwater robot and the obstacles in the environment. In order to improve the learning efficiency of the underwater robot and learn a better strategy, the position relation information of the underwater robot and a target point is added into the state information input into the deep neural network, so that the underwater robot can reach the target point towards a shorter path. The position relation comprises three information contents, the first is the relative coordinate relation between the current coordinate of the underwater robot and the target position coordinate, and the target position coordinate is subtracted from the current coordinate of the underwater robot to obtain the relative coordinate relation. The second is the Euclidean distance from the current position of the underwater robot to the target position, and the information covers the distance from the underwater robot to the target position. The third information is an included angle between a vector of the advancing direction of the underwater robot and a direction vector reaching the target position from the current robot coordinate position; this information may be used to guide in which direction the underwater robot should be the closest direction to reach the target position.

The information of the position relation between the underwater robot and the target point is added into the input state information, so that the underwater robot is helped to reach the target position point in a shorter path. Meanwhile, the position relation information is directly used as input state information of the global navigation neural network, and the underwater robot can learn in what direction the underwater robot should go forward through the position information and can reach the target position at the fastest speed.

And S32, inputting the acquired current state information into the modular deep neural network, specifically, inputting distance information between the intelligent body and an obstacle in the environment and position relation information between the intelligent body and a target point into the local obstacle avoidance deep neural network module, and inputting position relation information between the intelligent body and the target point into the global navigation deep neural network module.

And S33, the local obstacle avoidance depth neural network module and the global navigation depth neural network module output respective control instructions according to the input current state information.

And S34, the instruction selection module determines to use the action instruction output by the local obstacle avoidance depth neural network module or the global navigation depth neural network module by judging the distance value between the intelligent body and the nearest obstacle.

The system architecture of the autonomous navigation system is shown in fig. 2, the input state information of the local obstacle avoidance neural network includes environment information detected by the distance measuring sensor and relative position information of the underwater robot, and after the input state information is transmitted in the forward direction of the local obstacle avoidance depth neural network, the network directly outputs a control instruction for controlling the underwater robot. The input state information of the global obstacle avoidance neural network is only the relative position information of the underwater robot, and the control command for controlling the underwater robot is output after the input information is processed. Because the local obstacle avoidance depth neural network and the global navigation neural network both output corresponding instructions for controlling the underwater robot, an instruction selection module is designed to determine which network output action instruction is adopted for execution. The command selection module determines which network to use for outputting the action command by judging the distance value of the underwater robot from the nearest barrier. In the present embodiment, the threshold value d is set_{to_obs}The instruction of which module to use is determined 40. When the distance is less than 40 hours, the underwater robot is close to the obstacle, and therefore the instruction output by the local obstacle avoidance depth neural network is adopted for execution; if the distance is greater than or equal to 40, the underwater robot is a certain distance away from the obstacle, and the global navigation neural network is required to reach the target position at a higher speed. The deep reinforcement learning internal components are marked as a local obstacle avoidance module, a global navigation module, an instruction selection module and actions, and the components related to the external environment are marked as distance detection sensor information and relative position information, relative position information and environment. The design of the content neural network structure of the local obstacle avoidance module and the internal neural network structure of the global navigation module is shown in fig. 3, and the internal neural network structures all adopt a full-connection structure because the system adopts a forward-looking sonarThe detector senses the environmental information, meanwhile, the dimensionality of the information returned by the detector is low, and the data volume is small, so that a complex convolutional layer does not need to be built. The local obstacle avoidance neural network structure comprises three hidden layers, and the number of the neurons of the hidden layers is 256, 138 and 32 respectively. The global navigation neural network structure only comprises a hidden layer of one layer, and the number of the neurons is 32. The neural network structure of the global navigation module is simple because the input state information of the network structure is only the relative position information of the underwater robot, and the global navigation problem can be well solved by adopting a hidden layer.

In order to optimize input information of a state space, Euclidean distance information of the underwater robot reaching a target position and an included angle between a vector of the underwater robot in the advancing direction and a vector of the underwater robot in the direction from a current robot coordinate position to the target position are added into an original state space, and by adding the two pieces of information, the underwater robot can know the current position state more, and the underwater robot can reach a target position point towards more paths.

And S35, the intelligent agent executes the instruction selected by the instruction selection module, receives the reward value output by the reward function and enters the next state.

In order to improve the convergence speed of the algorithm, the underwater robot can better avoid the obstacle and reach the target position by a shorter path. The present embodiment employs a new continuous combinatorial reward function. By continuously combining the reward functions, the underwater robot can obtain corresponding rewards in learning of each step, and the underwater robot is more favorably guided to move towards a target point and move away from an obstacle. The continuous combined reward function comprises terminal rewards and non-terminal rewards; the terminal reward specifically comprises:

a positive reward is obtained when the agent reaches the target point, expressed as r_arr＝100；if d_r-t≤d_win(ii) a Wherein d is_r-tIs the Euclidean distance of the agent to the target point, d_winIs the threshold for the agent to reach the target point, when d_r-tD less than set_winWatch, clockShowing that the target point is reached, otherwise, not reaching the target point; said r_arrA positive reward value indicating that the agent achieved the target location;

the non-terminal award specifically includes:

a positive reward is obtained when the agent progresses towards the target point, expressed as r_{t_goal}＝c_r[d_r-t(t)-d_r-t(t-1)](ii) a Wherein c is_r∈(0,1]The coefficient of (d) is set to 1;

the danger reward r obtained when the minimum distance of the agent from the obstacle is continuously reduced_dang∈[0,1]And is reduced, and its expression is

0≤r_dangLess than or equal to 1; wherein d is_minIs the minimum distance of the agent from the obstacle, where β is a coefficient such that r_dangThe value space of (a) is (0, 1); d is_r-t(t) the Euclidean distance between the current position coordinate of the intelligent agent and the target position coordinate at the t-th moment;

wherein a is_oriIs the heading vector of the agent and the agent currentThe coordinates send out the included angle of the direction vector reaching the target position.

And S4, loading the trained autonomous navigation system onto the intelligent agent, and enabling the intelligent agent to obtain autonomous navigation capability.

The specific steps of the local obstacle avoidance depth neural network module and the global navigation depth neural network module in the S33 are as follows:

s331, the current value network receives the input current state information, processes the current state information according to the input current state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool according to the current state information; and passing the output set of actions to the target value network.

The output of the target value network is:

the loss function is:

the data stored in the experience pool are as follows:

wherein t is a certain moment, s is a state, and i is an index of the reward value obtained from each state after the t state starts; s_tIs the current state, a is the action performed, a_tIndicating the action performed at time t, r_tIndicating that agent acquired at time tThe obtained timely reward value; gamma is a discount factor, gammaⁱBonus value r achieved for the t + i-th state represented_t+iThe degree of influence on the current t state, and γ is a value less than 1 and greater than 0; λ is the number of steps in the interval, s_t+λThe state after the interval λ steps.

In single-step reinforcement learning, a training sample tuple contains the current state s_tThe current state makes action a_tTo obtain a reward value r_tNext state s_t+1Thus, a training sample tuple is(s)_t,a_t,r_t,s_t+1). In the multi-step reinforcement learning, a training sample tuple contains the current state s_tAction a made by the current State_tTo obtain a reward value r_tThe prize value r obtained in the next state_t+1The prize value r obtained in the next state_t+2… …, until a lambda state, when in the lambda state, the prize value achieved is r_r+λState s of lambda state_t+λThus a training sample tuple(s)_t,a_t,r_t,r_t+1,r_t+2…r_t+i…r_t+λ,s_t+λ) I is therefore the index of the prize value obtained for each state after the start of the t state.

Gamma is a discount factor, gammaⁱBonus value r achieved for the t + i-th state represented_t+iThe degree of influence on the current t state. γ is a value less than 1 and greater than 0. Let γ equal to 0.5

γ^λThe bonus value r obtained for the t + lambda state_t+λThe degree of influence on the current t state. Q(s)_t+λAnd a, theta) represents the current value of the neural network based on the input information(s)_t+λAnd a), outputting an estimated value of each action. Q_target(s_t+λ,argmaxQ(s_t+λA, θ)) of the estimated values output by the network at the current time, first, the action command sum s corresponding to the maximum value of the estimated values output by the network at the current time is selected_t+λAs input information of the target estimation value network, the target estimation value network outputs an estimation value of each action. Q (s, a, θ) represents the current value the neural network outputs an estimate of each action.

And S332, the target value network processes according to the transmitted group of actions, calculates the value of each action, and transmits the value of the group of actions back to the current value network.

And S333, selecting the next action to be executed by the current value network according to the maximum value in the value of the group of actions returned by the target value network, and outputting a control instruction to the simulation environment according to the next action.

S334, the current value network calculates to obtain next step state information according to the next step action in S333, processes the next step state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool by the next step state information; and transmitting the output set of actions to the target value network; and the target value network repeats the step S332, that is, the target value network processes according to the transmitted group of actions, calculates the value of each action, and transmits the value of the group of actions back to the current value network.

S335, the current value network compares the value of the group of actions returned by the target value network in S334 with the value of the group of actions returned by the target value network in S332, and calculates an error according to an error function.

And S336, updating the weight of the current value network according to the error obtained in the S335, and updating the weight of the target value network according to the weight of the current value network after every N steps.

S337, while proceeding in step S334, the current value network outputs a control instruction to the simulation environment according to the selected next action.

In order to improve the learning ability of the neural network and inhibit gradient disappearance in the neural network, the present embodiment adopts the combination of ReLU6 and ReLU as the activation functions in the neural network framework, that is, the activation functions adopted by the local obstacle avoidance deep neural network module and the global navigation deep neural network module are ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network.

The formula of the ReLU function is as follows.

A graph plotting the ReLU function by Python shows that, as shown in fig. 9, when the input value is negative or 0, the output of the ReLU is 0, but when the input value is greater than 0, the ReLU outputs the input value itself. The characteristic of unilateral activation of the ReLU enables the neurons in the neural network to have the function of sparse activation. The ReLU relieves the problem that the gradient disappears easily in the Sigmoid function and the Tanh function, so that the convergence speed of the neural network is more stable.

The ReLU6 is an improved activation function obtained by improving the ReLU, and the formula is as follows:

a graph plotting the ReLU6 function by Python as shown in fig. 10, the ReLU6 function is mainly an improvement on the positive part of the ReLU input, when the input value is greater than 6, the ReLU6 output value is always 6, if the input value is a real number greater than 0 and less than 6, then output itself, otherwise output 0. The ReLU6 can encourage neural network models to advance the relevant sparsity of learning input data. The two activation functions of ReLU and ReLU6 can avoid the phenomenon of gradient disappearance. The ReLU6 function is used as an activation function at the front end of the network, which is beneficial for fast learning the sparse characteristics of the data samples. And the network finally outputs an evaluation value for each behavior action, and finally selects the action corresponding to the highest evaluation value as the action to be executed by the underwater robot.

And training underwater robot navigation models based on different algorithms to verify the effectiveness of the MS-DDQN algorithm. First, as shown in fig. 7, the established simulation environment model is used as a training environment of the underwater robot, and the environment is set as a training environment 1. To verify the effectiveness of the MS-DDQN method, we tested the navigation capability of the underwater robot in training ENV-1, comparing the MS-DDQN algorithm with the DDQN, prioritized DQN and prioritized DDQN algorithms. In order to ensure the fairness of the experiment, the same network structure and the same software and hardware platform are used for model training. Before training, relevant hyper-parameters in deep reinforcement learning are set correspondingly, as shown in fig. 15. In order to be able to quantitatively evaluate the performance of each algorithm, we used three indicators to assess the goodness of the navigation model. The first is success rate, which represents the proportion of the number of successfully reached target positions of the underwater robot in the total number of training after the underwater robot starts training; the second is a reward value curve representing the sum of reward values obtained in each round of the underwater robot during the training process. For a smooth bonus curve we process the curve using a sliding average method with a sliding window size of 300. And the third is the average value of the acquired reward, which represents the reward sum acquired by the underwater robot in the training process and then divided by the number of training rounds. The autonomous navigation capability of the trainer of the underwater robot based on the MS-DDQN algorithm, the prioritized DQN algorithm and the prioritized algorithm in the environment 1 is shown in the figure 11.

As shown in fig. 12(a), it can be known that the success rate curve of MS-DDQN rises faster than the other three methods, which indicates that the learning efficiency of the MS-DDQN algorithm is higher. This is also demonstrated by the reward curve of fig. 12 (b). After 3000 times of training, the success rate of MS-DDQN reaching the target position is 80.133%, DDQN is 61.7%, prioritized DQN is 63.633%, prioritized DDQN is 53.366%, and the success rate of MS-DDQN is much higher than that of other algorithms. This shows that the MS-DDQN-based underwater robot performs more collision-free and target point reaching training in the training process, and has stronger obstacle avoidance and navigation functions. In fig. 12(b), it can be seen that after 500 training, the curve of his reward obtained by MS-DDQN is stable above 200, while the curves of the other three algorithms have larger fluctuation, which indicates that the navigation model based on MS-DDQN has higher stability. In fig. 12(c), the average reward value of the MS-DDQN is 185.072, the DDQN is 130.064, the prioritized DQN is 132.067, and the prioritized DDQN is 101.650, which also demonstrates that the MS-DDQN based underwater robot has a stronger navigation capability. Since a lower prize value means many negative prizes, it means that the underwater robot has more collisions. By analyzing the success rate curve, the reward value curve of each round and the reward average value of the navigation model based on different algorithms in the training process, the underwater robot based on the MS-DDQN algorithm can be known to have higher learning efficiency in the training process compared with the other three algorithms, and meanwhile, the trained navigation model has higher stability.

And testing the navigation capability and the generalization capability of the navigation model. After 3000 rounds of training in environment 1, a navigation model based on the MS-DDQN algorithm, prioritized DQN algorithm was obtained. These navigation models were first tested 200 times in environment 1 and analyzed for the proportion of successful target location arrival. In 200 tests, the start and target positions of the underwater robot were randomly assigned. And comparing the success rate of the underwater robot reaching the target position in 200 tests and obtaining average rewards to measure the superiority of the navigation model based on different algorithms. The higher the success rate, the higher the average reward, indicating that the navigation model is a better strategy. The results are shown in FIG. 16: after 3000 times of training, the underwater robot trained by the four algorithms basically learns how to avoid the obstacle and reach the target position in the environment 1. According to the test result, the MS-DDQN algorithm has the best effect, the success rate is 100%, and the average reward is highest. The result shows that the underwater robot based on the MS-DDQN algorithm has higher obstacle avoidance capability and better navigation strategy. The navigation track of the underwater robot trained on the MS-DDQN in Env-1 is shown in FIG. 11. In order to fully evaluate the generalization ability of navigation models based on different algorithms, four test environments different from a training environment are additionally designed. The four test environments are 500 × 500, 600 × 600, 700 × 700, and 800 × 800, respectively, environment 2, environment 3, environment 4, and environment 5. As in the training environment, in the test environment, the start position and the target position of the underwater robot are randomly initialized in the start area 41 and the target area 31 of the underwater robot, i.e., the rectangular

shaded areas

41 and 31. The navigation models trained by the algorithms such as MS-DDQN, DDQN, prioritized DQN and prioritized DDQN were tested 200 times in four test environments. As shown in fig. 13, the navigation trajectory of the underwater robot based on MS-DDQN in four unknown complex test environments shows that the navigation model trained by MS-DDQN has strong generalization capability and can adapt to new unknown environments. No retraining is required.

The test results of 200 rounds in four different test environments are shown in the following table, and it can be known from FIG. 17 that the success rate of the MS DDQN trained navigation model is 97% in Env-2, 91% in Env-3, 94% in Env-4, and 96% in Env-5. However, the navigation model trained by the other three algorithms did not reach a 90% success rate in the test environment, and even the success rate of DDQN in Env-3 was only 46%. The result shows that the navigation model based on the MS-DDQN has strong generalization capability, so that the underwater robot can navigate in a new unknown environment without retraining. On the other hand, the test results also confirm the conclusion in FIG. 12(b), confirming that the navigation strategy trained by MS-DDQN is more stable than DDQN, primarily DQN and primarily DDQN. The generalization capability of the Proiticized DQN and the Proiticized DQN is superior to that of the DDQN, because the Proiticized DQN carries out targeted training and learning on collision training samples in the training process, the Proiticized DQN has stronger navigation capability. The above experiments show that the generalization capability of the MS-DDQN trained navigation model is better than that of DDQN, prioitized DQN and prioitized DDQN. The reason is that the MS-DDQN will be able to spread the value of the prize earned during the training process towards a state of a few steps later. The underwater robot can learn autonomous navigation more quickly, and meanwhile, the underwater robot is also helped to sense the position of the obstacle and the target in advance and make evasive actions or actions tending to the target point in advance.

The present embodiment employs a geometric approach to simulate a 2-dimensional underwater environment in which many types of dense obstacles are contained. And finally, verifying the effectiveness of the MS-DDQN algorithm by comparing the navigation capability of the underwater robot in the simulated training environment with different algorithms. Meanwhile, relevant navigation tests are carried out in four testing environments with completely different training environments, and experiments prove that the underwater robot trained by the MS-DDQN algorithm has stronger generalization capability and can better adapt to a new obstacle environment without retraining.

It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims

1. An intelligent agent autonomous navigation method based on deep reinforcement learning is characterized by comprising the following steps:

s1, constructing an intelligent autonomous navigation system, wherein the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm facing a multi-step mechanism; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance depth neural network module, a global navigation depth neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from an obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to move to a target position towards a closer path, and the instruction selection module is used for determining a finally executed action instruction;

s2, building a simulation environment, including building an obstacle environment model and building a simulation intelligent agent;

s3, placing the autonomous navigation system in the simulation environment for training, namely, the intelligent agent adopts the MS-DDQN algorithm to train and learn in the simulation environment; the simulation environment is multiple, and the training times of each simulation environment are multiple;

2. The method according to claim 1, wherein the MS-DDQN algorithm comprises a current value network for selecting an action, a target value network for evaluating the action, an error function for updating a weight, a reward function for the agent to take an action in a current state and reach a reward value obtained in a next state, and a experience pool for storing sample data generated at each step.

3. The intelligent agent autonomous navigation method based on deep reinforcement learning according to claim 2, wherein the output function of the target value network is:

where γ is a discount factor, γⁱBonus value r achieved for the t + i-th state represented_t+iDegree of influence on the current t state, and γ is a value smaller than 1 and larger than 0^λThe bonus value r obtained for the t + lambda state_t+λThe degree of influence on the current t state; q is the state-action value estimate, λ is the number of steps in the interval, s_tIs in the current state, a_tThe actions performed for the current state, r_tIndicating the timely reward value, r, earned by the agent at time t_r+λFor the prize value, s, obtained in the lambda state_t+λIs the state of the interval lambda step; theta is a weight parameter in the current value network, and theta' is a weight parameter in the target value network; i is the index of the prize value obtained for each state after the start of the t state; q(s)_t+λAnd a, theta) represents the current value of the neural network based on the input information(s)_t+λA) outputting each actionAn estimated value of (d); q_target(s_t+λ,argmaxQ(s_t+λA, θ)) means that the action command sum s corresponding to the maximum value of the estimated values output by the network is selected at first as the current value_t+λAs input information of the target value network, the target value network outputs an estimated value of each action;

the loss function is:

wherein E is the neural network error, s is the state, a is the executed action, theta is the weight parameter in the current value network, Q is the state-action value estimation value, and Q (s, a, theta) represents the current value and the neural network outputs the estimation value of each action;

the data stored in the experience pool are as follows:

wherein t is a certain moment, s is a state, and i is an index of the reward value obtained from each state after the t state starts; s_tIs the current state, a is the action performed, a_tIndicating the action performed at time t, r_tRepresenting the timely reward value obtained by the agent at time t; gamma is a discount factor, gammaⁱBonus value r achieved for the t + i-th state represented_t+iThe degree of influence on the current t state, and γ is a value less than 1 and greater than 0; λ is the number of steps in the interval, s_t+λThe state after the interval λ steps.

4. The intelligent agent autonomous navigation method based on deep reinforcement learning of claim 3, wherein the training method of the MS-DDQN algorithm is as follows:

initializing the experience pool D and setting the hyperparameter lambda,

For episode＝1,M do

For t＝1,2…do

If t<T then

selecting action a according to the current policy_t＝argmaxQ(s_tA; θ), perform action a_tReturn of the prize value r_tAnd a new state s_t+1A 1 is to_tStored in S_t、r_tIs stored in R, a_tIs stored in A, s_t+1Stored in S_t+1(ii) a Where t represents the environmental status data obtained by the smartbook round, a_tRepresenting the action executed at the time t, and Q (s, a, theta) representing the current value, wherein the neural network outputs the estimation value of each action;

Ifs_t+1is the terminal state then

T←t+1

τ←t-λ+1

Ifτ≥0 then

Ifτ+λ<T then

else

Storing(s)_τ,a_τr_τs_τ+λ) In the experience pool D, the experience pool,

randomly extracting mini-batch sample data from D,

setting:

Untilτ＝T-1；

The tau is mainly used for judging whether the action execution times of the mobile robot exceed a set step number lambda or not, if the action execution times exceed the set step number, the value of the tau is greater than the value of 0, the intelligent agent is indicated to obtain environment state data exceeding or equal to the lambda step, and at the moment, the influence r of reward values obtained in the future lambda step state on the tau state can be calculated_τAnd from array S_t,A,S_t+1Extracting state information s_τAction a_τState information s_τ+λThe three information and r_τForming a training sample tuple and storing the training sample tuple in an experience pool D; when the agent starts a turn of training, it will first turn T ← ∞, at which time T<T, entering a second for cycle; and when the collision happens or the target position is reached, T ← T +1, if τ ═ T-1, indicating that all the environment state information obtained by the intelligent agent stores the training sample data of the round according to a multi-step method. When tau + lambda<T, the intelligent agent is not collided, the number of the environment state information needing to be stored is larger than or equal to the set lambda value at the moment, and the environment state information needs to pass through

Calculating the influence r of reward values obtained from future lambda states on the tau state_τ(ii) a Otherwise, the intelligent agent is collided, the number of the environment state information needing to be stored is smaller than the set lambda value, and the intelligent agent needs to pass through the set lambda value

Calculating future lambda-1 (lambda-2, lambda-3, …,1) versus tau (tau +1, tau +2, …, T-1) statesInfluence of (1) r_τ；y_iThe ith sample data in the mini-batch data extracted from the experience pool is represented, and the estimated value of each action estimated by the target value network is added with the actual r obtained in the state_iThe method is a comprehensive reward for the current state and is used for carrying out gradient descent operation with the current value network.

5. The method for intelligent agent autonomous navigation based on deep reinforcement learning of claim 1, wherein the training step in S3 specifically comprises:

s31, acquiring current state information of the intelligent agent by the simulation environment, wherein the current state information comprises distance information between the intelligent agent and an obstacle in the environment and position relation information between the intelligent agent and a target point; the position relation information of the intelligent body and the target point comprises a relative coordinate relation between the current coordinate of the intelligent body and the coordinate of the target position, an Euclidean distance from the current position of the intelligent body to the target position and an included angle between a vector of the advancing direction of the intelligent body and a vector of the direction from the current coordinate position of the intelligent body to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent body from the coordinate of the target position;

6. The intelligent autonomous navigation method based on deep reinforcement learning of claim 5, wherein the local obstacle avoidance deep neural network module and the global navigation deep neural network module comprise the following specific steps:

s331, the current value network receives the input current state information, processes the current state information according to the input current state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool according to the current state information; and transmitting the output set of actions to the target value network;

s334, the current value network calculates to obtain next step state information according to the next step action in S333, processes the next step state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool by the next step state information; and transmitting the output set of actions to the target value network; the target value network repeats step S332, that is, the target value network processes according to the transmitted group of actions, calculates the value of each action, and then transmits the value of the group of actions back to the current value network;

7. The intelligent agent autonomous navigation method based on deep reinforcement learning of claim 5, characterized in that the reward function is a continuous combined reward function, and the continuous combined reward function comprises terminal reward and non-terminal reward; the terminal reward specifically comprises:

a positive reward is obtained when the agent reaches the target point, expressed as r_arr＝100；ifd_r-t≤d_win(ii) a Wherein d is_r-tIs the Euclidean distance of the agent to the target point, d_winIs the threshold for the agent to reach the target point, when d_r-tD less than set_winIf so, indicating that the target point is reached, otherwise, not reaching the target point; said r_arrA positive reward value indicating that the agent achieved the target location;

a negative reward is obtained when the agent collides with an obstacle, expressed as r_col＝-100；ifd_r-o≤d_col(ii) a Wherein d is_r-oIs the Euclidean distance of the agent from the nearest barrier, d_colThreshold value for collision of intelligent body and obstacle when d_r-oD is less than or equal to_colIf so, indicating that collision occurs, otherwise, not indicating that collision does not occur; said r_colGiving penalty for indicating collision of agentA penalizing negative reward value;

the non-terminal award specifically includes:

Wherein d is_minIs the minimum distance of the agent from the obstacle, where β is a coefficient such that r_dangThe value space of (a) is (0, 1); d is_r-t(t) the Euclidean distance between the current position coordinate of the intelligent agent and the target position coordinate at the moment t is represented;

wherein a is_oriThe included angle of the vector of the advancing direction of the intelligent agent and the vector of the direction of the intelligent agent to the target position is sent out by the current coordinate.

8. The intelligent agent autonomous navigation method based on deep reinforcement learning according to any one of claims 1 to 7, characterized in that the local obstacle avoidance depth neural network module and the global navigation depth neural network module use activation functions of ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network.

9. The intelligent autonomous navigation method based on deep reinforcement learning according to any one of claims 1 to 7, wherein the local obstacle avoidance deep neural network module and the global navigation deep neural network module both adopt a fully connected structure, the number of hidden layers of the local obstacle avoidance neural network module is more than three, and the number of hidden layers of the global navigation neural network module is one.

10. The intelligent agent autonomous navigation method based on deep reinforcement learning according to any one of claims 1 to 7, characterized in that the instruction selection module is provided with a threshold value, and selects the control instruction according to the threshold value; when the threshold value is smaller than 40, selecting a control instruction output by the local obstacle avoidance depth neural network module; and when the threshold value is more than or equal to 40, selecting the control instruction output by the global navigation neural network module.