CN112179367A - Intelligent autonomous navigation method based on deep reinforcement learning - Google Patents

Intelligent autonomous navigation method based on deep reinforcement learning Download PDF

Info

Publication number
CN112179367A
CN112179367A CN202011023274.4A CN202011023274A CN112179367A CN 112179367 A CN112179367 A CN 112179367A CN 202011023274 A CN202011023274 A CN 202011023274A CN 112179367 A CN112179367 A CN 112179367A
Authority
CN
China
Prior art keywords
value
state
current
action
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011023274.4A
Other languages
Chinese (zh)
Other versions
CN112179367B (en
Inventor
彭小红
陈亮
陈荣发
张军
梁子祥
史文杰
黄文�
陈剑勇
黄曾祺
余应淮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ocean University
Original Assignee
Guangdong Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ocean University filed Critical Guangdong Ocean University
Priority to CN202011023274.4A priority Critical patent/CN112179367B/en
Publication of CN112179367A publication Critical patent/CN112179367A/en
Application granted granted Critical
Publication of CN112179367B publication Critical patent/CN112179367B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/3407Route searching; Route guidance specially adapted for specific applications
    • G01C21/343Calculating itineraries, i.e. routes leading from a starting point to a series of categorical destinations using a global route restraint, round trips, touristic trips
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B63SHIPS OR OTHER WATERBORNE VESSELS; RELATED EQUIPMENT
    • B63CLAUNCHING, HAULING-OUT, OR DRY-DOCKING OF VESSELS; LIFE-SAVING IN WATER; EQUIPMENT FOR DWELLING OR WORKING UNDER WATER; MEANS FOR SALVAGING OR SEARCHING FOR UNDERWATER OBJECTS
    • B63C11/00Equipment for dwelling or working underwater; Means for searching for underwater objects
    • B63C11/52Tools specially adapted for working underwater, not otherwise provided for
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Mechanical Engineering (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Ocean & Marine Engineering (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to the technical field of intelligent autonomous navigation, in particular to an intelligent autonomous navigation method based on deep reinforcement learning. The method is used for solving the problems that the intelligent agent cannot sense the development conditions of a plurality of future states in advance and the obstacle avoidance and navigation capabilities of the intelligent agent are insufficient due to the fact that the existing algorithm only calculates the reward values of two adjacent states. The intelligent agent autonomous navigation method based on deep reinforcement learning comprises the following steps: constructing an intelligent autonomous navigation system, wherein the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm facing a multi-step mechanism; building a simulation environment; placing the autonomous navigation system in the simulation environment for training; and loading the trained autonomous navigation system to the intelligent agent, and acquiring the autonomous navigation capability of the intelligent agent. Through the technical scheme, the technical effects that the intelligent bodies can sense the future obstacle distribution situation and make evasive actions in advance are achieved.

Description

Intelligent autonomous navigation method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of intelligent agent autonomous navigation, in particular to an intelligent agent autonomous navigation method based on deep reinforcement learning.
Background
Due to the excessive exploitation of land resources by human beings, reserves of mineral resources, biological resources and the like are rapidly reduced. The ocean area is more than twice of the land area, and mineral resources, energy resources, fishery resources and the like stored underground are far more abundant than the land area. In view of the unknown and complex marine environment, the intelligent agent can replace human beings to perform exploration and development of marine resources, so in recent years, research on the intelligent agent in various countries is very important. Autonomous navigation is one of the key technologies for studying intelligent mobile agents. The autonomous navigation means that an intelligent agent finds an optimal or suboptimal path from a starting point to a target point in an environment containing complex obstacles according to a given constraint condition or conditions, such as shortest path length, minimum energy consumption or minimum movement time and the like, under the condition that the self pose information of the intelligent agent is known. The autonomous navigation problem of an agent may be equivalent to the autonomous path planning problem of an agent, all with the goal of controlling the mobile agent to move away from obstacles towards a target position. The path planning task aims to find one or more paths which start from a starting point, avoid various obstacles and safely reach a target position in a known or unknown environment through a specific algorithm. The nature of the method can be regarded as a condition optimization problem, and in the face of different requirements, the optimization target has certain difference. Aiming at various navigation algorithms, the navigation algorithms are roughly divided into two categories according to different intelligent degrees of an intelligent agent, namely a non-intelligent navigation algorithm and an intelligent navigation algorithm. By designing a modular deep neural network architecture, the neural network learning processing task of each module is more definite, and meanwhile, the stability of the algorithm is improved by adopting a double neural network structure method; and the output method, the loss function, the reward function and the data information stored in the experience pool of the target value network of the MS-DDQN algorithm are improved, so that the reward obtained by the intelligent agent in the training process can be diffused to the state value estimation value of the multi-step interval state. Through the method, the underwater intelligent body is guided to rapidly learn, and meanwhile, the underwater intelligent body can sense the change of the future state in advance, so that the underwater intelligent body is endowed with the capability of sensing the distribution of future obstacles, and the underwater intelligent body is helped to make evasive actions in advance.
Deep Q Networks (DQN) are a deep reinforcement learning algorithm. The key technology of the DQN algorithm is that a double neural network structure and an empirical data playback method are adopted. One of the innovative points of the DQN algorithm is to use the nerve to approach the optimal state-action cost Q function, instead of the Q-learning method that needs to establish a table to record the mapping relationship between the state and the action. The method overcomes the defect that Q-learning cannot be applied in the field of high-dimensional state space, and simultaneously exerts the processing capacity of the deep neural network on high-dimensional information. The second innovation is that two neural networks are established, namely a current value network and a target value network. The stability of the algorithm is improved by adopting a double-neural network structure method. The third innovation point is that an experience pool playback method is adopted, sample data of interaction between the intelligent agent and the environment are stored in the experience pool, and each sample data is marked by a reward value, so that the defect that a large number of samples need to be marked manually in a deep learning method is overcome. The training method of the DQN algorithm is shown in fig. 1. Two deep neural networks exist in the network structure of the DQN algorithm, namely a current value network and a target value network. When in useThe role of the previous value network is two, one of which is to process the input state information and evaluate the value of each output action during the training process, and then determine whether to randomly perform an action or to select to perform an action according to the maximum value of the current value network output by a greedy method. And the second function is to process the training samples extracted from the experience pool in the network training process, output the value of each action, compare the value with the value of the action output in the target network, and further calculate an error to guide the weight update of the current network. The target value network is mainly used for calculating training sample data extracted from an experience pool to process in a training process, outputting the value of each action and helping update iteration of the weight of the current network. The weights in the target value network are not iterated in the network training process, and the weights in the current value network are copied after every N steps. An empirical replay mechanism is used during the training of the DQN network. Through the experience playback mechanism, the agent can learn not only the current state experience data, but also repeatedly learn the previous experience data. Every time the intelligent agent and the environment complete the interaction information, the information is stored in the experience pool, and the sample data stored in the experience pool has the current state stAnd performing action atObtaining a prize value rtAnd the next state st+1The four data are combined into a unit storage and experience pool D with the form of st,at,rt,st+1]. Because the stored empirical data has strong relevance, the DQN algorithm adopts a random sampling mode to extract training sample data from an empirical pool in small batches, so that the independence among training samples in the training and learning process is ensured, and the convergence speed of the algorithm is improved.
Because a neural network is used in the DQN algorithm to replace a Q value table, the current value network represents the learning strategy pi, and the weight parameter in the current value network is assumed to be theta, and the output of the current value network is
yDQN=Q(s,a,θ) (2-26)
The output of the target network is:
Figure BDA0002701350050000021
the loss function of the DQN network is then:
Figure BDA0002701350050000031
updating the weights θ in the current value network by calculating the gradient of the loss function:
Figure BDA0002701350050000032
the parameters in the current network can be updated by adopting a gradient descent method, so that an optimal strategy is obtained. Due to Q used in DQN algorithmtarget(s ', a ', theta ') to approximately represent the optimization target, and the selection actions are actions corresponding to the maximum Q value, and the selection and evaluation of the actions are based on the target value network, which results in the overfitting problem. To solve this problem, a Deep Double Q Network algorithm (DDQN) is proposed. The training process of the DDQN algorithm is almost the same as the DQN algorithm, the only difference being that the DDQN separates the target value network selection action from the evaluation action. Just by using the DQN algorithm, the network structure has two sets of different weight parameters, namely a weight parameter theta in the current network and a weight parameter theta' in the target value network. Wherein the action is selected by using the parameters in the current network, and the selected action is evaluated by using the parameters in the target value network, so the output of the target value network of the DDQN is:
Figure BDA0002701350050000033
the output of the DDQN current value network is:
yDDQN=Q(s,a,θ) (2-31)
where γ is a discount factor, γiBonus value r achieved for the t + i-th state representedt+iDegree of influence on the current t state, and γ is a value smaller than 1 and larger than 0λThe bonus value r obtained for the t + lambda statet+λThe degree of influence on the current t state; q is the state-action value estimate, λ is the number of steps in the interval, stIs in the current state, atThe actions performed for the current state, rtIndicating the timely reward value, r, earned by the agent at time tr+λThe prize value, s, obtained for the lambda statet+λA state that is a lambda state; theta is a weight parameter in the current value network, and theta' is a weight parameter in the target value network; i is the subscript of the prize value obtained from each state after the t state; q(s)t+λAnd a, theta) represents the current value of the neural network based on the input information(s)t+λA), outputting an estimated value of each motion; qtarget(st+λ,argmaxQ(st+λA, θ)) means that the action command sum s corresponding to the maximum value of the estimated values output by the network is selected at firstt+λAs input information of the target estimation value network, the target estimation value network outputs an estimation value of each action; the loss function of the network of DDQN is:
Figure BDA0002701350050000034
Figure BDA0002701350050000035
where E is the neural network error, s is the state, a is the action performed, θ is the weight parameter in the current value network, Q is the state-action value estimate, Q (s, a, θ) represents the current value neural network
Outputting an estimated value of each action; the method for updating the weight parameters in the current value network comprises the following steps:
Figure BDA0002701350050000041
since deep reinforcement learning is to process high-dimensional original input information by using a neural network and to approximate a state-action cost function by using the neural network, deep reinforcement learning is more suitable for problems in a larger state space than the conventional reinforcement learning method. Therefore, the MS-DDQN is provided by correspondingly improving the DDQN algorithm, so that the underwater intelligent body is improved to have higher obstacle avoidance and navigation capabilities. As is evident from the above description of the Q (λ) algorithm, Q (λ) enables the agent to obtain the ability to reward the condition. In the navigation of the underwater robot, the obstacle avoidance function is an important precondition for completing tasks, and the influence on the state-action Q value of a longer remote step state is very important, namely, the intelligent body is endowed with the function of perceiving the future improvement of the obstacle avoidance capability of the underwater robot. If the underwater robot is in a certain state, the reward value obtained in the future can be perceived in advance, namely the development condition of the future state can be perceived in advance, and the underwater robot is quite helpful for avoiding obstacles and reaching a target point.
Disclosure of Invention
The invention aims to overcome at least one defect (deficiency) of the prior art, and provides an intelligent agent autonomous navigation method based on deep reinforcement learning and a manufacturing method thereof, which are used for solving the problems that the intelligent agent cannot sense the development conditions of a plurality of future states in advance and the obstacle avoidance and navigation capability of the intelligent agent are insufficient due to the fact that the existing algorithm only calculates the reward values of two adjacent states; the development situations and obstacle distribution situations of a plurality of future states can be sensed by the intelligent body, and therefore the technical effect of avoiding actions is achieved in advance.
The invention adopts the technical scheme that an intelligent agent autonomous navigation method based on deep reinforcement learning is designed, and the method comprises the following steps: an intelligent autonomous navigation system is constructed, the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm facing a multi-step mechanism, and the MS-DDQN algorithm is obtained by improving the DDQN algorithm; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance depth neural network module, a global navigation depth neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from an obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to move towards a closer path to a target position, and the instruction selection module is used for determining which network output action instruction is executed; building a simulation environment, including building an obstacle environment model and building a simulation intelligent agent; placing the autonomous navigation system in the simulation environment for training, namely training and learning by the intelligent agent in the simulation environment by adopting the MS-DDQN algorithm; the simulation environment is multiple, and the training times of each simulation environment are multiple; and loading the trained autonomous navigation system to the intelligent agent, and acquiring the autonomous navigation capability of the intelligent agent.
Further, the MS-DDQN algorithm includes a current value network for selecting an action, a target value network for evaluating the action, an error function for updating a weight, a reward function for obtaining a reward value when the agent takes an action in a current state and arrives at a next state, and an experience pool for storing sample data generated at each walk. The current value network, the target value network, the error function, the reward function and the experience pool are matched with one another, so that the MS-DDQN algorithm endows the intelligent agent with the ability of knowing the future obstacle distribution situation and making an evasive action in advance.
Further, the output of the target value network is:
Figure BDA0002701350050000051
where γ is a discount factor, γiBonus value r achieved for the t + i-th state representedt+iDegree of influence on the current t state, and γ is a value smaller than 1 and larger than 0λThe bonus value r obtained for the t + lambda statet+λThe degree of influence on the current t state; q is the state-action value estimate, λ is the number of steps in the interval, stIs in the current state, atThe actions performed for the current state, rtIndicating the timely reward value, r, earned by the agent at time tr+λIs at the same timeValue of prize, s, obtained in lambda statet+λA state that is a lambda state; theta is a weight parameter in the current value network, and theta' is a weight parameter in the target value network; i is the subscript of the prize value obtained from each state after the t state; q(s)t+λAnd a, theta) represents the current value of the neural network based on the input information(s)t+λA), outputting an estimated value of each motion; qtarget(st+λ,argmaxQ(st+λA, θ)) means that the action command sum s corresponding to the maximum value of the estimated values output by the network is selected at firstt+λAs input information of the target estimation value network, the target estimation value network outputs an estimation value of each action; the loss function is:
Figure BDA0002701350050000052
wherein E is the neural network error, s is the state, a is the executed action, theta is the weight parameter in the current value network, Q is the state-action value estimation value, and Q (s, a, theta) represents the current value and the neural network outputs the estimation value of each action; the data stored in the experience pool are as follows:
Figure BDA0002701350050000053
wherein t is a certain moment, s is a state, and i is an index of the reward value obtained from each state after the t state starts; stIs the current state, a is the action performed, atIndicating the action performed at time t, rtRepresenting the timely reward value obtained by the agent at time t; gamma is a discount factor, gammaiBonus value r achieved for the t + i-th state representedt+iThe degree of influence on the current t state, and γ is a value less than 1 and greater than 0; λ is the number of steps in the interval, st+λIs the state after the step of lambda is separated;
the target value network is according to a function
Figure BDA0002701350050000061
Outputting the value of each group of actions, updating the weight theta of the current value network according to the loss function, and then sampling the samples after each action is executedAnd storing the data into an experience pool.
Further, the training method of the MS-DDQN algorithm comprises the following steps:
randomly initializing a nonce network Q(s)tA; theta) weight theta and target value network Qtarget(stA; theta) weight theta', Q(s)tA; θ) represents the current value the neural network outputs an estimate for each action;
initializing an experience pool D and setting a hyper-parameter λ
For episode=1,M do
Resetting the simulation environment, obtaining the initial observation state stT ← infinity, initializing four space arrays St,A,R,St+1(ii) a Wherein StAn array of statements for storing state information of the current state; an array of A statements for storing the action executed by the current state; an array of R statements for storing the prize value earned by the current state; st+1An array of statements for storing next-too state information; t is mainly used for judging whether the data acquired in the current round are stored in an experience pool when the training in the current round is finished;
For t=1,2…do
If t<Tthen
selecting action a according to the current policyt=argmaxQ(stA; θ), perform action atReturn of the prize value rtAnd a new state st+1 A 1 is totStored in St、rtIs stored in R, atIs stored in A, st+1Stored in St+1(ii) a Where t represents the environmental status data obtained by the smartbook round, atRepresenting the action executed at the time t, and Q (s, a, theta) representing the current value, wherein the neural network outputs the estimation value of each action;
If st+1is the terminal state then:
T←t+1
τ←t-λ+1
If τ≥0 then
If τ+λ<T then
Figure BDA0002701350050000062
else
Figure BDA0002701350050000071
storing(s)τ,aτrτsτ+λ) In an experience pool D
Extracting mini-batch sample data from random punching D
Setting:
Figure BDA0002701350050000074
using a loss function L (theta) E [ (y)i-Q(s,a,θ))2]Carrying out gradient descent updating on the current value network weight theta Until ═ T-1;
the tau is mainly used for judging whether the number of times of executing the action of the mobile robot exceeds a set step number lambda, if the number exceeds the set step number, the value of the tau is larger than the value of 0, the intelligent agent is indicated to obtain environment state data exceeding or equal to lambda, and at the moment, the influence r of reward values obtained from lambda states in the future on the tau state can be calculatedτAnd from array St,A,St+1Extracting state information sτAction aτState information sτ+λThe three information and rτForming a training sample tuple, and storing the training sample tuple in an experience pool D so as to extract a training sample from the experience pool for training; when the agent starts a turn of training, it will first turn T ← ∞, at which time T<T, the program enters a second for loop; and when the collision happens or the target position is reached, T ← T +1, if τ ═ T-1, indicating that all the environment state information obtained by the intelligent agent stores the training sample data of the round according to a multi-step method. When tau + lambda<T, the intelligent agent is not collided, the number of the environment state information needing to be stored is larger than or equal to the set lambda value, and the intelligent agent needs to pass through
Figure BDA0002701350050000072
ri∈RtCalculating the influence r of reward values obtained from future lambda states on the tau stateτ(ii) a Otherwise, the intelligent agent is collided, the number of the environment state information needing to be stored is smaller than the set lambda value, and the intelligent agent needs to pass through the set lambda value
Figure BDA0002701350050000073
ri∈RtCalculating the future influence r of lambda-1 (lambda-2, lambda-3, …,1) on the state of tau (tau +1, tau +2, …, T-1)τ;yiThe ith sample data in the mini-batch data extracted from the experience pool is represented, and the estimated value of each action estimated by the target value network is added with the actual r obtained in the stateiThe method is a comprehensive reward for the current state and is used for carrying out gradient descent operation with the current value network. The MS-DDQN algorithm enables rewards earned by underwater agents to be spread back across state value estimates for multi-step interval states. Through the method, the underwater intelligent body can be better guided to rapidly learn, and meanwhile, the underwater intelligent body can sense the change of the future state in advance.
Further, the training step in S3 specifically includes:
s31, acquiring current state information of the intelligent agent by the simulation environment, wherein the current state information comprises distance information between the intelligent agent and an obstacle in the environment and position relation information between the intelligent agent and a target point; the position relation information of the intelligent body and the target point comprises a relative coordinate relation between the current position coordinate of the intelligent body and the target position coordinate, an Euclidean distance from the current position of the intelligent body to the target position and an included angle between a vector of the advancing direction of the intelligent body and a vector of the direction from the current coordinate position of the intelligent body to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent body from the target position coordinate;
s32, inputting the acquired current state information into the modular deep neural network, specifically inputting distance information between an intelligent body and an obstacle in the environment and position relation information between the intelligent body and a target point into the local obstacle avoidance deep neural network module, and inputting position relation information between the intelligent body and the target point into the global navigation deep neural network module;
s33, the local obstacle avoidance depth neural network module and the global navigation depth neural network module output respective control instructions according to the input current state information;
s34, the instruction selection module determines to use the action instruction output by the local obstacle avoidance depth neural network module or the global navigation depth neural network module by judging the distance value between the intelligent body and the nearest obstacle;
s35, the intelligent agent executes the instruction selected by the instruction selection module, receives the reward value output by the reward function and enters the next state;
s36, storing sample data of the interaction into the experience pool, wherein the sample data comprises a current state stAnd performing action atObtaining a prize value rtAnd the next state st+1The sample data is stored in the experience pool in a form of
Figure BDA0002701350050000081
In the step S41, the information of the positional relationship between the underwater intelligent object and the target point is added to the input status information, which helps the underwater intelligent object to reach the target position point in a shorter path; meanwhile, the position relation information is directly used as input state information of the global navigation neural network, and the underwater intelligent body can learn in which direction the underwater intelligent body should go forward through the position information so as to reach the target position at the fastest speed; meanwhile, a strategy method which needs to be learned is more definite for each neural network by adopting a modular deep neural network, so that the underwater intelligent body can better avoid obstacles and reach a target position by a shorter path; s36, an experience pool playback method is adopted, sample data of interaction between the intelligent agent and the environment are stored in the experience pool, and each sample data is marked by a reward value, so that the defect that a large number of samples need to be marked manually in a deep learning method is overcome.
Further, the specific steps of the local obstacle avoidance depth neural network module and the global navigation depth neural network module are as follows: s331, the current value network receives the input current state information, processes the current state information according to the input current state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool according to the current state information; and transmitting the output set of actions to the target value network;
s332, the target value network processes the group of actions according to the input actions, calculates the value of each action, and transmits the value of the group of actions back to the current value network;
s333, the current value network selects the next action to be executed according to the maximum value in the value of the group of actions returned by the target value network, and outputs a control instruction to the simulation environment according to the next action;
s334, the current value network calculates to obtain next step state information according to the next step action in S433, processes the next step state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool by the next step state information; and transmitting the output set of actions to the target value network; the target value network repeats step S332, that is, the target value network processes according to the transmitted group of actions, calculates the value of each action, and then transmits the value of the group of actions back to the current value network;
s335, the current value network compares the value of a group of actions returned by the target value network in the S334 with the value of a group of actions returned by the target value network in the S332, and calculates an error according to an error function;
s336, updating the weight of the current value network according to the error obtained in S335, and updating the weight of the target value network according to the weight of the current value network after every N steps;
s337, while proceeding in step S334, the current value network outputs a control instruction to the simulation environment according to the selected next action. The action is selected through the parameters in the current value network, the selected action is evaluated through the parameters in the target value network, and meanwhile, the target value network selection action and the evaluation action are separated by utilizing two sets of weight parameters in the current value network and the target value network.
Further, the reward function is a continuous combined reward function, and the continuous combined reward function comprises terminal rewards and non-terminal rewards; the terminal reward specifically comprises: a positive reward is obtained when the agent reaches the target point, expressed as rarr=100;ifdr-t≤dwin(ii) a Wherein d isr-tIs the Euclidean distance of the agent to the target point, dwinIs the threshold for the agent to reach the target point, when dr-tD less than setwinIf so, indicating that the target point is reached, otherwise, not reaching the target point; said rarrA positive reward value indicating that the agent achieved the target location;
a negative reward is obtained when the agent collides with an obstacle, expressed as rcol=-100;ifdr-o≤dcol(ii) a Wherein d isr-oIs the Euclidean distance of the agent from the nearest barrier, dcolThreshold value for collision of intelligent body and obstacle when dr-oD is less than or equal tocolIf so, indicating that collision occurs, otherwise, not indicating that collision does not occur; said rcolGiving a punitive negative reward value indicating that the agent has collided;
the non-terminal award specifically includes: a positive reward is obtained when the agent progresses towards the target point, expressed as rt_goal=cr[dr-t(t)-dr-t(t-1)](ii) a Wherein c isr∈(0,1]The coefficient of (d) is set to 1; the danger reward r obtained when the minimum distance of the agent from the obstacle is continuously reduceddang∈[0,1]And is reduced, and its expression is
Figure BDA0002701350050000091
0≤rdangLess than or equal to 1; wherein d isminIs the minimum distance of the agent from the obstacle, where β is a coefficient such that rdangThe value space of (a) is (0, 1); d isr-t(t) represents the current position coordinates and the target of the intelligent agent at the t-th momentThe Euclidean distance of the position coordinates of the mark;
when the included angle between the advancing direction vector of the intelligent agent and the direction vector of the intelligent agent starting from the current coordinate and reaching the target position is less than +/-18 degrees, the reward 1 is obtained, when the included angle is more than +/-18 degrees and less than +/-72 degrees, the reward 0.3 is obtained, and in other cases, the reward 0 is obtained, and the expression is as follows:
Figure BDA0002701350050000101
wherein a isoriThe included angle of the vector of the advancing direction of the intelligent agent and the vector of the direction of the intelligent agent to the target position is sent out by the current coordinate. The continuous combined reward function improves the convergence speed of the algorithm, so that the underwater intelligent body can better avoid obstacles and reach a target position by a shorter path; by continuously combining the reward functions, the underwater intelligent body can obtain corresponding reward in each step of learning, and the underwater intelligent body is more favorably guided to move towards a target point and move away from an obstacle.
Further, the local obstacle avoidance depth neural network module and the global navigation depth neural network module adopt activation functions of ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network. In order to improve the learning ability of the neural network and inhibit the problems of gradient disappearance and the like in the neural network, the ReLU6 and the ReLU are combined to be taken as an activation function in the neural network framework. The ReLU and the ReLU6 can avoid the phenomenon of gradient disappearance, so that the ReLU6 function is used as the activation function at the front end of the network, which is beneficial to fast learning the sparse feature of the data samples, and the ReLU function is used as the activation function at the rear end of the network.
Furthermore, the local obstacle avoidance deep neural network module and the global navigation deep neural network module both adopt a full-connection structure, the number of hidden layers of the local obstacle avoidance deep neural network module is more than three, and the number of hidden layers of the global navigation deep neural network module is one.
Further, the instruction selection module is provided with a threshold value, and selects the control instruction according to the size of the threshold value; when the threshold value is smaller than 40, selecting a control instruction output by the local obstacle avoidance depth neural network module; and when the threshold value is more than or equal to 40, executing the control instruction output by the global navigation neural network module. When the threshold value is smaller than 40, the underwater intelligent body is close to the obstacle, and therefore the instruction output by the local obstacle avoidance depth neural network is adopted for execution; when the threshold is greater than or equal to 40, the underwater agent is a certain distance away from the obstacle, and the global navigation neural network should be executed to reach the target position at a faster speed.
Compared with the prior art, the invention has the beneficial effects that: according to the invention, by adopting a modular deep neural network architecture, the neural network learning processing task of each module is more definite, and meanwhile, the stability of the algorithm is improved by adopting a double neural network structure method; and the output method, the loss function, the reward function and the data information stored in the experience pool of the target value network of the MS-DDQN algorithm are improved, so that the reward obtained by the intelligent agent in the training process can be diffused to the state value estimation value of the multi-step interval state. Through the method, the underwater intelligent body can be better guided to carry out quick learning, and meanwhile, the underwater intelligent body can sense the change of the future state in advance, so that the underwater intelligent body is endowed with the capability of sensing the distribution of future obstacles, and the underwater intelligent body is helped to make evasive actions in advance. The improved continuous combined reward function improves the convergence speed of the algorithm, so that the underwater intelligent body can better avoid obstacles and reach a target position through a shorter path. The invention provides a method for improving a conventional DDQN algorithm by adopting a Q (lambda) algorithm theory, wherein the improved algorithm is a Multi-step DDQN (MS-DDQN) algorithm facing a Multi-step mechanism. The MS-DDQN enables the underwater robot to obtain rewards in the training process and can further diffuse the state value estimated value of the multi-step interval state. By the method, the underwater robot can be better guided to rapidly learn, and can sense the change of the future state in advance. The DDQN algorithm is improved by adopting a Q (lambda) method, which is equivalent to endowing the underwater robot with the capability of sensing future obstacle distribution so as to help the underwater robot to make evasive action in advance.
Drawings
Fig. 1 shows a training method of the DQN algorithm.
Fig. 2 is a system architecture of an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a neural network according to an embodiment of the present invention.
FIG. 4 is a geometric environment model diagram according to an embodiment of the present invention.
Fig. 5 is a sonar detector diagram according to an embodiment of the present invention.
Fig. 6 is a diagram of a forward-looking sonar detector simulator according to an embodiment of the present invention.
FIG. 7 is a diagram of a simulation environment model according to an embodiment of the present invention.
FIG. 8 is a diagram of a coordinate transformation code according to an embodiment of the present invention.
Fig. 9 is a graph of a ReLU function according to an embodiment of the present invention.
Fig. 10 is a graph of the ReLU6 function according to an embodiment of the present invention.
Fig. 11 is a navigation track diagram of the underwater robot according to the embodiment of the present invention.
FIG. 12 is a graph of training results of an embodiment of the present invention, wherein 12(a) is a success rate curve, 12(b) is a reward value curve per round, and 12(c) is a reward average.
Fig. 13 is a navigation track diagram of the underwater robot in different test environments, where 13(a) is environment 2, 13(b) is environment 3, 13(c) is environment 4, and 13(d) is environment 5.
FIG. 14 is a diagram of the hardware and software configuration of a computer according to an embodiment of the present invention.
FIG. 15 is a diagram of a hyper-parameter setting according to an embodiment of the present invention.
Fig. 16 is a graph of test results in environment 1 of an embodiment of the present invention.
FIG. 17 is a graph of test results for different environments according to an embodiment of the present invention.
Description of reference numerals: the underwater robot 1 after the mass point processing, the advancing direction 2 of the underwater robot, the target position 3 of the underwater robot, the target area 31 of the underwater robot, the starting position 4 of the underwater robot and the starting area 41 of the underwater robot;
Detailed Description
The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For a better understanding of the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Example 1
The embodiment provides an intelligent autonomous navigation method based on deep reinforcement learning, in particular to an underwater robot autonomous navigation method based on deep reinforcement learning, which comprises the following steps:
s1, constructing an intelligent autonomous navigation system, wherein the autonomous navigation system adopts an MS-DDQN algorithm, namely a multi-step mechanism-oriented DDQN algorithm, and the MS-DDQN algorithm is a deep reinforcement learning algorithm; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance depth neural network module, a global navigation depth neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from an obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to move towards a closer path to a target position, and the instruction selection module is used for determining a finally executed action instruction. The MS-DDQN algorithm comprises a current value network, a target value network, an error function, a reward function and an experience pool, wherein the current value network is used for selecting an action, the target value network is used for evaluating the action, the error function is used for updating the weight, the reward function refers to a reward value obtained when an agent takes a certain action in a current state and reaches a next state, and the experience pool is used for storing sample data generated in each step.
Based on the modularized neural network framework, the underwater robot can adopt different strategies to cope with different environmental states. When the underwater robot approaches to the obstacle, the main task of the underwater robot is to avoid the obstacle, and the global navigation task becomes a secondary task. When the underwater robot is far away from the obstacle, the global navigation task becomes a main task, so that the underwater robot can reach the target position in a shorter path. The embodiment provides a modularized neural network structure aiming at two subtasks of underwater robot navigation, namely a local obstacle avoidance task and a global navigation task. Aiming at the two subtasks, a local obstacle avoidance neural network and a global navigation neural network are designed respectively, and the underwater robot can clearly determine what action the underwater robot should take under each condition to better avoid the obstacle and reach the target position by a closer path through a double neural network structure.
The training method of the MS-DDQN algorithm comprises the following steps:
randomly initializing a nonce network Q(s)tA; theta) weight theta and target value network Qtarget(stA; theta) weight theta', Q(s)tA; θ) represents the current value the neural network outputs an estimate for each action;
initializing the experience pool D and setting the hyperparameter lambda,
For episode=1,M do
resetting the simulation environment, obtaining the initial observation state stT ← infinity, initializing four space arrays St,A,R,St+1(ii) a Wherein StAn array of statements for storing state information of the current state; an array of A statements for storing the action executed by the current state; an array of R statements for storing the prize value earned by the current state; st+1An array of statements for storing next-too state information; t is mainly used for judging whether the data acquired in the current round are stored in an experience pool when the training in the current round is finished;
For t=1,2…do
If t<Tthen
selecting action a according to the current policyt=argmaxQ(stA; θ), perform action atReturn of the prize value rtAnd a new state st+1 A 1 is totStored in St、rtIs stored in R, atIs stored in A, st+1Stored in St+1(ii) a Where t represents the environmental status data obtained by the smartbook round, atRepresenting the action executed at the time t, and Q (s, a, theta) representing the current value, wherein the neural network outputs the estimation value of each action;
If st+1is the terminal state then:
T←t+1
τ←t-λ+1
If τ≥0 then
If τ+λ<T then
Figure BDA0002701350050000131
ri∈Rt
else
Figure BDA0002701350050000132
ri∈Rt
storing(s)τ,aτrτsτ+λ) In the experience pool D, the experience is collected,
extracting mini-batch sample data from the random impulse D,
setting:
Figure BDA0002701350050000133
using a loss function L (theta) E [ (y)i-Q(s,a,θ))2]Gradient descent update of current value network weight theta
Until τ=T-1;
Wherein τ is mainly used for judging whether the number of times of executing the action of the mobile robot exceeds a set step number λ, if the number exceeds the set step number, the value of τ is greater than or equal to 0, which indicates that the agent has obtained environmental state data exceeding or equal to λ, and at this time, the influence r of reward values obtained from λ states in the future on τ states can be calculatedτAnd from array St,A,St+1Extracting state information sτAction aτState information sτ+λThe three information and rτForming a training sample tuple stored in experienceAnd in the pool D, so that training samples can be extracted from the experience pool for training.
When the agent starts a turn of training, it will first turn T ← ∞, at which time T<T, the program enters a second for loop; and when the collision happens or the target position is reached, T ← T +1, if τ ═ T-1, indicating that all the environment state information obtained by the intelligent agent stores the training sample data of the round according to a multi-step method. When tau + lambda<T, the intelligent agent is not collided, the number of the environment state information needing to be stored is larger than or equal to the set lambda value at the moment, and the environment state information needs to pass through
Figure BDA0002701350050000141
ri∈RtCalculating the influence r of reward values obtained from future lambda states on the tau stateτ(ii) a Otherwise, the intelligent agent is collided, the number of the environment state information needing to be stored is smaller than the set lambda value, and the intelligent agent needs to pass through the set lambda value
Figure BDA0002701350050000142
ri∈RtCalculating the future influence r of lambda-1 (lambda-2, lambda-3, …,1) on the state of tau (tau +1, tau +2, …, T-1)τ;yiThe ith sample data in the mini-batch data extracted from the experience pool is represented, and the estimated value of each action estimated by the target value network is added with the actual r obtained in the stateiThe method is a comprehensive reward for the current state and is used for carrying out gradient descent operation with the current value network.
And S2, building a simulation environment, including building an obstacle environment model and building a simulation intelligent body.
The construction of the simulation environment model and the implementation of the related algorithm, the software platform and hardware components used are shown in fig. 14.
Constructing an obstacle environment model: the obstacle environment model refers to the description of obstacles in the environment, and the condition of the description of the environment model directly influences the state information input by the deep reinforcement learning algorithm and the finally learned obstacle avoidance strategy. The grid method and the geometric method are two more common environmental model description methods. As shown in fig. 4, a graphical representation of the environment model is described geometrically. The geometric method does not need to divide the environment, but uses the points, lines and surfaces of the obstacles to describe the obstacle information in the environment. Therefore, when the environment state information is established by adopting the geometric method, the data volume for describing the environment state is not increased sharply because the environment becomes complicated. The underwater robot has a large working environment range, and the obstacles are not very dense, so that the embodiment chooses to adopt a geometric method to construct the environment model.
In the embodiment, the type of the obstacle is divided into two types, namely an elliptical obstacle and a polygonal obstacle, wherein the circular obstacle is also regarded as the elliptical obstacle, and the geometric environment model only needs to record the coordinates of the vertex at the upper left corner of the rectangle for intercepting the ellipse and the major and minor axes of the ellipse. For polygons, where triangles, rectangles, and squares are classified into polygonal obstacles, the geometric model needs to record each vertex coordinate of the polygon. Building a simulation intelligent agent: a real underwater robot is not a particle but a solid body with a geometric size. In the embodiment, the underwater robot is considered as a mass point, and in order to ensure the navigation safety of the underwater robot in the real environment, the obstacle needs to be expanded outwards, so that the obstacle is correspondingly bulked. In this embodiment the underwater robot is projected as a dot with a radius of 1 pixel. In underwater environments, acoustic devices are often used as sensing instruments to detect the environment. Meanwhile, the simulated underwater robot sails in a fixed-depth underwater environment, and a multi-beam forward-looking sonar can be used as an environment detecting instrument in the embodiment, for example, a multi-beam forward-looking detecting sonar sensor. As shown in fig. 5, the sonar detector has a vertical open angle of 17 degrees, a horizontal open angle of 120 degrees, a maximum effective detection distance of 200m, a total of 240 beams, and an open angle of 0.5 degrees between each beam. A simulation detector is designed for simulating the forward-looking sonar detector in the embodiment. In the embodiment, the underwater machine only performs motion control action on a horizontal plane, so that a front-view detector simulator with a horizontal opening angle of 180 degrees, a maximum measuring range of 90 degrees, beams of 36 degrees and an opening angle of 5 degrees between each beam is designed. As shown in fig. 6, a simulated forward looking sonar detector simulator is shown. The black dot 1 represents the underwater robot after the particle formation, the line segment 2 is the advancing direction of the underwater robot, and the line segments on the two sides of the line segment 2 are sound wave lines emitted by a front sonar detector of the underwater robot.
In order to make the observation data detected and collected by the underwater robot sonar simulator have uniformity, the first sound wave line on the left side of the advancing direction of the underwater robot is 0 degree, and the first sound wave line on the right side is 180 degrees. Information detected by the sonar detector at the moment t is sequentially stored into row vectors according to the angle sequence of (0 degrees, 5 degrees, 10 degrees and
Figure BDA0002701350050000151
in (1). If the obstacle is not detected, the sonar detector returns the maximum detection distance value of the sound wave line segment, otherwise, the distance between the sound wave line segment and the obstacle is returned. Finally, normalization processing is carried out on the detected information, namely the row vector s is processedtDivided by the maximum effective detection range. For the design of the underwater robot motion model, the underwater robot is set to advance at a constant speed, and the underwater robot can only perform discrete motion steering actions, namely, the underwater robot performs left-turning 15 degrees, left-turning 30 degrees, the original direction, right-turning 15 degrees and right-turning 30 degrees relative to the advancing direction of the underwater robot.
The Pygame library is adopted to build an environment model, firstly, a 500 x 500 window size is defined as a simulation environment, and different obstacles, boundary walls and target positions are added in the window. The environment model was created as shown in fig. 7, including gray ellipses, circles and polygons representing various types of obstacles. The middle black dot 1 represents the underwater robot after the particle formation, and the ray lines surrounding the black dot represent the forward-looking sonar detector. In the start area 41 and the target area 31 of the underwater robot in the figure, i.e., the rectangular shaded areas 41 and 31, the start position and the target point of the underwater robot are randomly initialized. The black dots 4 represent the start position and the black dots 3 represent the target position.
Implementation of the dynamic environment: after the environment model and the simulation underwater robot model are built, the underwater robot moves in the simulation environment by adopting a relevant method, namely the dynamic realization of the simulation environment. Such as a method for underwater robot motion and a method for detecting the distance between the forward-looking sonar ranging simulator and an obstacle. And simultaneously setting a rule for judging whether the underwater robot collides and a rule for judging whether the underwater robot reaches the target position. The coordinate transformation of the underwater robot in the simulation environment after the underwater robot makes relevant movement at the current position is calculated. Firstly, the initial position P of the underwater robot at the lower left corner of the simulation environment is assumedstartThe coordinate is (x)start,ystart) The motion speed v is 0.5, the current direction of the underwater robot is angle, and the underwater robot selects an action to execute in the current state, namely the underwater robot selects a steering action, and the angle steering is angletranThe steering angle is angletran∈(15°turn_left,30°turn_left,0°,15°turn_light,30°turn_light). Equation 3-1 is the angle of the underwater robot after performing the action:
angle←angle+angletrain (5-1)
combining the formula 5-1, the underwater machine coordinate becomes:
xnext=xstart+cos(angle)*v (5-2)
ynext=ystart+sin(angle)*v (5-3)
before solving the distance between the detection of the forward-looking sonar ranging simulator and the obstacle, the projection length of the position coordinates of the end point of each sound wave line segment on the x axis and the y axis under the central coordinate system of the underwater robot is calculated, and then the projection length of each sound wave line segment under the central coordinate system of the environmental model is calculated. The transformation process of the robot-centered coordinate system to the environment model-centered coordinate system is a transformation process of a middle two-dimensional coordinate system. Assuming that the coordinates of the robot are (center _ x, center _ y), the partial code of solving the coordinates of the end of the acoustic line segment at the coordinates centered on the environment model is shown in fig. 8. When the coordinate projection of the sound wave line segments in the coordinate system taking the environment model as the center is obtained, the distance between the underwater robot and the obstacle detected by each sound wave line segment can be solved. Then, each side vector of the obstacle, the sound wave line segment vector and the vector from the position coordinate of the robot to each top point of the obstacle are constructed, the position information between the underwater robot and the obstacle can be obtained by solving a relative relation between the vectors in real time, and the length of each sound wave line segment detected by the forward-looking sonar detector can be solved.
And (3) judging the collision rule of the underwater robot: first, d is setminFor the minimum safe distance between the underwater robot and the obstacle, the minimum sound wave line segment detected by the current sight sonar detection instrument is smaller than the set dminAnd if so, judging that collision occurs, ending the training of the round, and re-allocating a new initial position for the underwater robot. Otherwise, collision does not occur, and the underwater robot selects actions to execute according to the relevant strategies. And (3) judging the rule that the underwater robot reaches the target position: first of all define dArrivalsThe maximum distance for the underwater machine to reach the target position. During the operation of the underwater robot, the Euclidean distance between the current position coordinate of the underwater robot and the target position coordinate is calculated, and if the distance is less than the obtained distance dArrivalsIndicates the target position of the underwater robot.
S3, placing the autonomous navigation system in the simulation environment for training, namely, the intelligent agent adopts the MS-DDQN algorithm to train and learn in the simulation environment; the simulation environment is multiple, and the training times of each simulation environment are multiple.
S31, acquiring current state information of the intelligent agent by the simulation environment, wherein the current state information comprises distance information between the intelligent agent and an obstacle in the environment and position relation information between the intelligent agent and a target point; the information of the position relation between the intelligent body and the target point comprises the relative coordinate relation between the current coordinate of the intelligent body and the coordinate of the target position, the Euclidean distance from the current position of the intelligent body to the target position and the included angle between the vector of the advancing direction of the intelligent body and the vector of the direction from the current coordinate position of the intelligent body to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent body from the coordinate of the target position.
The underwater robot mainly detects the environment information of the obstacles through the distance measuring sensor, and the information acquired by the distance measuring sensor is the distance information between the underwater robot and the obstacles in the environment. In order to improve the learning efficiency of the underwater robot and learn a better strategy, the position relation information of the underwater robot and a target point is added into the state information input into the deep neural network, so that the underwater robot can reach the target point towards a shorter path. The position relation comprises three information contents, the first is the relative coordinate relation between the current coordinate of the underwater robot and the target position coordinate, and the target position coordinate is subtracted from the current coordinate of the underwater robot to obtain the relative coordinate relation. The second is the Euclidean distance from the current position of the underwater robot to the target position, and the information covers the distance from the underwater robot to the target position. The third information is an included angle between a vector of the advancing direction of the underwater robot and a direction vector reaching the target position from the current robot coordinate position; this information may be used to guide in which direction the underwater robot should be the closest direction to reach the target position.
The information of the position relation between the underwater robot and the target point is added into the input state information, so that the underwater robot is helped to reach the target position point in a shorter path. Meanwhile, the position relation information is directly used as input state information of the global navigation neural network, and the underwater robot can learn in what direction the underwater robot should go forward through the position information and can reach the target position at the fastest speed.
And S32, inputting the acquired current state information into the modular deep neural network, specifically, inputting distance information between the intelligent body and an obstacle in the environment and position relation information between the intelligent body and a target point into the local obstacle avoidance deep neural network module, and inputting position relation information between the intelligent body and the target point into the global navigation deep neural network module.
And S33, the local obstacle avoidance depth neural network module and the global navigation depth neural network module output respective control instructions according to the input current state information.
And S34, the instruction selection module determines to use the action instruction output by the local obstacle avoidance depth neural network module or the global navigation depth neural network module by judging the distance value between the intelligent body and the nearest obstacle.
The system architecture of the autonomous navigation system is shown in fig. 2, the input state information of the local obstacle avoidance neural network includes environment information detected by the distance measuring sensor and relative position information of the underwater robot, and after the input state information is transmitted in the forward direction of the local obstacle avoidance depth neural network, the network directly outputs a control instruction for controlling the underwater robot. The input state information of the global obstacle avoidance neural network is only the relative position information of the underwater robot, and the control command for controlling the underwater robot is output after the input information is processed. Because the local obstacle avoidance depth neural network and the global navigation neural network both output corresponding instructions for controlling the underwater robot, an instruction selection module is designed to determine which network output action instruction is adopted for execution. The command selection module determines which network to use for outputting the action command by judging the distance value of the underwater robot from the nearest barrier. In the present embodiment, the threshold value d is setto_obsThe instruction of which module to use is determined 40. When the distance is less than 40 hours, the underwater robot is close to the obstacle, and therefore the instruction output by the local obstacle avoidance depth neural network is adopted for execution; if the distance is greater than or equal to 40, the underwater robot is a certain distance away from the obstacle, and the global navigation neural network is required to reach the target position at a higher speed. The deep reinforcement learning internal components are marked as a local obstacle avoidance module, a global navigation module, an instruction selection module and actions, and the components related to the external environment are marked as distance detection sensor information and relative position information, relative position information and environment. The design of the content neural network structure of the local obstacle avoidance module and the internal neural network structure of the global navigation module is shown in fig. 3, and the internal neural network structures all adopt a full-connection structure because the system adopts a forward-looking sonarThe detector senses the environmental information, meanwhile, the dimensionality of the information returned by the detector is low, and the data volume is small, so that a complex convolutional layer does not need to be built. The local obstacle avoidance neural network structure comprises three hidden layers, and the number of the neurons of the hidden layers is 256, 138 and 32 respectively. The global navigation neural network structure only comprises a hidden layer of one layer, and the number of the neurons is 32. The neural network structure of the global navigation module is simple because the input state information of the network structure is only the relative position information of the underwater robot, and the global navigation problem can be well solved by adopting a hidden layer.
In order to optimize input information of a state space, Euclidean distance information of the underwater robot reaching a target position and an included angle between a vector of the underwater robot in the advancing direction and a vector of the underwater robot in the direction from a current robot coordinate position to the target position are added into an original state space, and by adding the two pieces of information, the underwater robot can know the current position state more, and the underwater robot can reach a target position point towards more paths.
And S35, the intelligent agent executes the instruction selected by the instruction selection module, receives the reward value output by the reward function and enters the next state.
In order to improve the convergence speed of the algorithm, the underwater robot can better avoid the obstacle and reach the target position by a shorter path. The present embodiment employs a new continuous combinatorial reward function. By continuously combining the reward functions, the underwater robot can obtain corresponding rewards in learning of each step, and the underwater robot is more favorably guided to move towards a target point and move away from an obstacle. The continuous combined reward function comprises terminal rewards and non-terminal rewards; the terminal reward specifically comprises:
a positive reward is obtained when the agent reaches the target point, expressed as rarr=100;if dr-t≤dwin(ii) a Wherein d isr-tIs the Euclidean distance of the agent to the target point, dwinIs the threshold for the agent to reach the target point, when dr-tD less than setwinWatch, clockShowing that the target point is reached, otherwise, not reaching the target point; said rarrA positive reward value indicating that the agent achieved the target location;
a negative reward is obtained when the agent collides with an obstacle, expressed as rcol=-100;ifdr-o≤dcol(ii) a Wherein d isr-oIs the Euclidean distance of the agent from the nearest barrier, dcolThreshold value for collision of intelligent body and obstacle when dr-oD is less than or equal tocolIf so, indicating that collision occurs, otherwise, not indicating that collision does not occur; said rcolGiving a punitive negative reward value indicating that the agent has collided;
the non-terminal award specifically includes:
a positive reward is obtained when the agent progresses towards the target point, expressed as rt_goal=cr[dr-t(t)-dr-t(t-1)](ii) a Wherein c isr∈(0,1]The coefficient of (d) is set to 1;
the danger reward r obtained when the minimum distance of the agent from the obstacle is continuously reduceddang∈[0,1]And is reduced, and its expression is
Figure BDA0002701350050000191
0≤rdangLess than or equal to 1; wherein d isminIs the minimum distance of the agent from the obstacle, where β is a coefficient such that rdangThe value space of (a) is (0, 1); d isr-t(t) the Euclidean distance between the current position coordinate of the intelligent agent and the target position coordinate at the t-th moment;
when the included angle between the advancing direction vector of the intelligent agent and the direction vector of the intelligent agent starting from the current coordinate and reaching the target position is less than +/-18 degrees, the reward 1 is obtained, when the included angle is more than +/-18 degrees and less than +/-72 degrees, the reward 0.3 is obtained, and in other cases, the reward 0 is obtained, and the expression is as follows:
Figure BDA0002701350050000192
wherein a isoriIs the heading vector of the agent and the agent currentThe coordinates send out the included angle of the direction vector reaching the target position.
S36, storing sample data of the interaction into the experience pool, wherein the sample data comprises a current state stAnd performing action atObtaining a prize value rtAnd the next state st+1The sample data is stored in the experience pool in a form of
Figure BDA0002701350050000193
And S4, loading the trained autonomous navigation system onto the intelligent agent, and enabling the intelligent agent to obtain autonomous navigation capability.
The specific steps of the local obstacle avoidance depth neural network module and the global navigation depth neural network module in the S33 are as follows:
s331, the current value network receives the input current state information, processes the current state information according to the input current state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool according to the current state information; and passing the output set of actions to the target value network.
The output of the target value network is:
Figure BDA0002701350050000201
the loss function is:
Figure BDA0002701350050000202
the data stored in the experience pool are as follows:
Figure BDA0002701350050000203
wherein t is a certain moment, s is a state, and i is an index of the reward value obtained from each state after the t state starts; stIs the current state, a is the action performed, atIndicating the action performed at time t, rtIndicating that agent acquired at time tThe obtained timely reward value; gamma is a discount factor, gammaiBonus value r achieved for the t + i-th state representedt+iThe degree of influence on the current t state, and γ is a value less than 1 and greater than 0; λ is the number of steps in the interval, st+λThe state after the interval λ steps.
In single-step reinforcement learning, a training sample tuple contains the current state stThe current state makes action atTo obtain a reward value rtNext state st+1Thus, a training sample tuple is(s)t,at,rt,st+1). In the multi-step reinforcement learning, a training sample tuple contains the current state stAction a made by the current StatetTo obtain a reward value rtThe prize value r obtained in the next statet+1The prize value r obtained in the next statet+2… …, until a lambda state, when in the lambda state, the prize value achieved is rr+λState s of lambda statet+λThus a training sample tuple(s)t,at,rt,rt+1,rt+2…rt+i…rt+λ,st+λ) I is therefore the index of the prize value obtained for each state after the start of the t state.
Gamma is a discount factor, gammaiBonus value r achieved for the t + i-th state representedt+iThe degree of influence on the current t state. γ is a value less than 1 and greater than 0. Let γ equal to 0.5
Figure BDA0002701350050000204
Figure BDA0002701350050000211
γλThe bonus value r obtained for the t + lambda statet+λThe degree of influence on the current t state. Q(s)t+λAnd a, theta) represents the current value of the neural network based on the input information(s)t+λAnd a), outputting an estimated value of each action. Qtarget(st+λ,argmaxQ(st+λA, θ)) of the estimated values output by the network at the current time, first, the action command sum s corresponding to the maximum value of the estimated values output by the network at the current time is selectedt+λAs input information of the target estimation value network, the target estimation value network outputs an estimation value of each action. Q (s, a, θ) represents the current value the neural network outputs an estimate of each action.
And S332, the target value network processes according to the transmitted group of actions, calculates the value of each action, and transmits the value of the group of actions back to the current value network.
And S333, selecting the next action to be executed by the current value network according to the maximum value in the value of the group of actions returned by the target value network, and outputting a control instruction to the simulation environment according to the next action.
S334, the current value network calculates to obtain next step state information according to the next step action in S333, processes the next step state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool by the next step state information; and transmitting the output set of actions to the target value network; and the target value network repeats the step S332, that is, the target value network processes according to the transmitted group of actions, calculates the value of each action, and transmits the value of the group of actions back to the current value network.
S335, the current value network compares the value of the group of actions returned by the target value network in S334 with the value of the group of actions returned by the target value network in S332, and calculates an error according to an error function.
And S336, updating the weight of the current value network according to the error obtained in the S335, and updating the weight of the target value network according to the weight of the current value network after every N steps.
S337, while proceeding in step S334, the current value network outputs a control instruction to the simulation environment according to the selected next action.
In order to improve the learning ability of the neural network and inhibit gradient disappearance in the neural network, the present embodiment adopts the combination of ReLU6 and ReLU as the activation functions in the neural network framework, that is, the activation functions adopted by the local obstacle avoidance deep neural network module and the global navigation deep neural network module are ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network.
The formula of the ReLU function is as follows.
Figure BDA0002701350050000221
A graph plotting the ReLU function by Python shows that, as shown in fig. 9, when the input value is negative or 0, the output of the ReLU is 0, but when the input value is greater than 0, the ReLU outputs the input value itself. The characteristic of unilateral activation of the ReLU enables the neurons in the neural network to have the function of sparse activation. The ReLU relieves the problem that the gradient disappears easily in the Sigmoid function and the Tanh function, so that the convergence speed of the neural network is more stable.
The ReLU6 is an improved activation function obtained by improving the ReLU, and the formula is as follows:
Figure BDA0002701350050000222
a graph plotting the ReLU6 function by Python as shown in fig. 10, the ReLU6 function is mainly an improvement on the positive part of the ReLU input, when the input value is greater than 6, the ReLU6 output value is always 6, if the input value is a real number greater than 0 and less than 6, then output itself, otherwise output 0. The ReLU6 can encourage neural network models to advance the relevant sparsity of learning input data. The two activation functions of ReLU and ReLU6 can avoid the phenomenon of gradient disappearance. The ReLU6 function is used as an activation function at the front end of the network, which is beneficial for fast learning the sparse characteristics of the data samples. And the network finally outputs an evaluation value for each behavior action, and finally selects the action corresponding to the highest evaluation value as the action to be executed by the underwater robot.
And training underwater robot navigation models based on different algorithms to verify the effectiveness of the MS-DDQN algorithm. First, as shown in fig. 7, the established simulation environment model is used as a training environment of the underwater robot, and the environment is set as a training environment 1. To verify the effectiveness of the MS-DDQN method, we tested the navigation capability of the underwater robot in training ENV-1, comparing the MS-DDQN algorithm with the DDQN, prioritized DQN and prioritized DDQN algorithms. In order to ensure the fairness of the experiment, the same network structure and the same software and hardware platform are used for model training. Before training, relevant hyper-parameters in deep reinforcement learning are set correspondingly, as shown in fig. 15. In order to be able to quantitatively evaluate the performance of each algorithm, we used three indicators to assess the goodness of the navigation model. The first is success rate, which represents the proportion of the number of successfully reached target positions of the underwater robot in the total number of training after the underwater robot starts training; the second is a reward value curve representing the sum of reward values obtained in each round of the underwater robot during the training process. For a smooth bonus curve we process the curve using a sliding average method with a sliding window size of 300. And the third is the average value of the acquired reward, which represents the reward sum acquired by the underwater robot in the training process and then divided by the number of training rounds. The autonomous navigation capability of the trainer of the underwater robot based on the MS-DDQN algorithm, the prioritized DQN algorithm and the prioritized algorithm in the environment 1 is shown in the figure 11.
As shown in fig. 12(a), it can be known that the success rate curve of MS-DDQN rises faster than the other three methods, which indicates that the learning efficiency of the MS-DDQN algorithm is higher. This is also demonstrated by the reward curve of fig. 12 (b). After 3000 times of training, the success rate of MS-DDQN reaching the target position is 80.133%, DDQN is 61.7%, prioritized DQN is 63.633%, prioritized DDQN is 53.366%, and the success rate of MS-DDQN is much higher than that of other algorithms. This shows that the MS-DDQN-based underwater robot performs more collision-free and target point reaching training in the training process, and has stronger obstacle avoidance and navigation functions. In fig. 12(b), it can be seen that after 500 training, the curve of his reward obtained by MS-DDQN is stable above 200, while the curves of the other three algorithms have larger fluctuation, which indicates that the navigation model based on MS-DDQN has higher stability. In fig. 12(c), the average reward value of the MS-DDQN is 185.072, the DDQN is 130.064, the prioritized DQN is 132.067, and the prioritized DDQN is 101.650, which also demonstrates that the MS-DDQN based underwater robot has a stronger navigation capability. Since a lower prize value means many negative prizes, it means that the underwater robot has more collisions. By analyzing the success rate curve, the reward value curve of each round and the reward average value of the navigation model based on different algorithms in the training process, the underwater robot based on the MS-DDQN algorithm can be known to have higher learning efficiency in the training process compared with the other three algorithms, and meanwhile, the trained navigation model has higher stability.
And testing the navigation capability and the generalization capability of the navigation model. After 3000 rounds of training in environment 1, a navigation model based on the MS-DDQN algorithm, prioritized DQN algorithm was obtained. These navigation models were first tested 200 times in environment 1 and analyzed for the proportion of successful target location arrival. In 200 tests, the start and target positions of the underwater robot were randomly assigned. And comparing the success rate of the underwater robot reaching the target position in 200 tests and obtaining average rewards to measure the superiority of the navigation model based on different algorithms. The higher the success rate, the higher the average reward, indicating that the navigation model is a better strategy. The results are shown in FIG. 16: after 3000 times of training, the underwater robot trained by the four algorithms basically learns how to avoid the obstacle and reach the target position in the environment 1. According to the test result, the MS-DDQN algorithm has the best effect, the success rate is 100%, and the average reward is highest. The result shows that the underwater robot based on the MS-DDQN algorithm has higher obstacle avoidance capability and better navigation strategy. The navigation track of the underwater robot trained on the MS-DDQN in Env-1 is shown in FIG. 11. In order to fully evaluate the generalization ability of navigation models based on different algorithms, four test environments different from a training environment are additionally designed. The four test environments are 500 × 500, 600 × 600, 700 × 700, and 800 × 800, respectively, environment 2, environment 3, environment 4, and environment 5. As in the training environment, in the test environment, the start position and the target position of the underwater robot are randomly initialized in the start area 41 and the target area 31 of the underwater robot, i.e., the rectangular shaded areas 41 and 31. The navigation models trained by the algorithms such as MS-DDQN, DDQN, prioritized DQN and prioritized DDQN were tested 200 times in four test environments. As shown in fig. 13, the navigation trajectory of the underwater robot based on MS-DDQN in four unknown complex test environments shows that the navigation model trained by MS-DDQN has strong generalization capability and can adapt to new unknown environments. No retraining is required.
The test results of 200 rounds in four different test environments are shown in the following table, and it can be known from FIG. 17 that the success rate of the MS DDQN trained navigation model is 97% in Env-2, 91% in Env-3, 94% in Env-4, and 96% in Env-5. However, the navigation model trained by the other three algorithms did not reach a 90% success rate in the test environment, and even the success rate of DDQN in Env-3 was only 46%. The result shows that the navigation model based on the MS-DDQN has strong generalization capability, so that the underwater robot can navigate in a new unknown environment without retraining. On the other hand, the test results also confirm the conclusion in FIG. 12(b), confirming that the navigation strategy trained by MS-DDQN is more stable than DDQN, primarily DQN and primarily DDQN. The generalization capability of the Proiticized DQN and the Proiticized DQN is superior to that of the DDQN, because the Proiticized DQN carries out targeted training and learning on collision training samples in the training process, the Proiticized DQN has stronger navigation capability. The above experiments show that the generalization capability of the MS-DDQN trained navigation model is better than that of DDQN, prioitized DQN and prioitized DDQN. The reason is that the MS-DDQN will be able to spread the value of the prize earned during the training process towards a state of a few steps later. The underwater robot can learn autonomous navigation more quickly, and meanwhile, the underwater robot is also helped to sense the position of the obstacle and the target in advance and make evasive actions or actions tending to the target point in advance.
The present embodiment employs a geometric approach to simulate a 2-dimensional underwater environment in which many types of dense obstacles are contained. And finally, verifying the effectiveness of the MS-DDQN algorithm by comparing the navigation capability of the underwater robot in the simulated training environment with different algorithms. Meanwhile, relevant navigation tests are carried out in four testing environments with completely different training environments, and experiments prove that the underwater robot trained by the MS-DDQN algorithm has stronger generalization capability and can better adapt to a new obstacle environment without retraining.
It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims (10)

1. An intelligent agent autonomous navigation method based on deep reinforcement learning is characterized by comprising the following steps:
s1, constructing an intelligent autonomous navigation system, wherein the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm facing a multi-step mechanism; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance depth neural network module, a global navigation depth neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from an obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to move to a target position towards a closer path, and the instruction selection module is used for determining a finally executed action instruction;
s2, building a simulation environment, including building an obstacle environment model and building a simulation intelligent agent;
s3, placing the autonomous navigation system in the simulation environment for training, namely, the intelligent agent adopts the MS-DDQN algorithm to train and learn in the simulation environment; the simulation environment is multiple, and the training times of each simulation environment are multiple;
and S4, loading the trained autonomous navigation system onto the intelligent agent, and enabling the intelligent agent to obtain autonomous navigation capability.
2. The method according to claim 1, wherein the MS-DDQN algorithm comprises a current value network for selecting an action, a target value network for evaluating the action, an error function for updating a weight, a reward function for the agent to take an action in a current state and reach a reward value obtained in a next state, and a experience pool for storing sample data generated at each step.
3. The intelligent agent autonomous navigation method based on deep reinforcement learning according to claim 2, wherein the output function of the target value network is:
Figure FDA0002701350040000011
where γ is a discount factor, γiBonus value r achieved for the t + i-th state representedt+iDegree of influence on the current t state, and γ is a value smaller than 1 and larger than 0λThe bonus value r obtained for the t + lambda statet+λThe degree of influence on the current t state; q is the state-action value estimate, λ is the number of steps in the interval, stIs in the current state, atThe actions performed for the current state, rtIndicating the timely reward value, r, earned by the agent at time tr+λFor the prize value, s, obtained in the lambda statet+λIs the state of the interval lambda step; theta is a weight parameter in the current value network, and theta' is a weight parameter in the target value network; i is the index of the prize value obtained for each state after the start of the t state; q(s)t+λAnd a, theta) represents the current value of the neural network based on the input information(s)t+λA) outputting each actionAn estimated value of (d); qtarget(st+λ,argmaxQ(st+λA, θ)) means that the action command sum s corresponding to the maximum value of the estimated values output by the network is selected at first as the current valuet+λAs input information of the target value network, the target value network outputs an estimated value of each action;
the loss function is:
Figure FDA0002701350040000021
wherein E is the neural network error, s is the state, a is the executed action, theta is the weight parameter in the current value network, Q is the state-action value estimation value, and Q (s, a, theta) represents the current value and the neural network outputs the estimation value of each action;
the data stored in the experience pool are as follows:
Figure FDA0002701350040000022
wherein t is a certain moment, s is a state, and i is an index of the reward value obtained from each state after the t state starts; stIs the current state, a is the action performed, atIndicating the action performed at time t, rtRepresenting the timely reward value obtained by the agent at time t; gamma is a discount factor, gammaiBonus value r achieved for the t + i-th state representedt+iThe degree of influence on the current t state, and γ is a value less than 1 and greater than 0; λ is the number of steps in the interval, st+λThe state after the interval λ steps.
4. The intelligent agent autonomous navigation method based on deep reinforcement learning of claim 3, wherein the training method of the MS-DDQN algorithm is as follows:
randomly initializing a nonce network Q(s)tA; theta) weight theta and target value network Qtarget(stA; theta) weight theta', Q(s)tA; θ) represents the current value the neural network outputs an estimate for each action;
initializing the experience pool D and setting the hyperparameter lambda,
For episode=1,M do
resetting the simulation environment, obtaining the initial observation state stT ← infinity, initializing four space arrays St,A,R,St+1(ii) a Wherein StAn array of statements for storing state information of the current state; an array of A statements for storing the action executed by the current state; an array of R statements for storing the prize value earned by the current state; st+1An array of statements for storing next-too state information; t is mainly used for judging whether the data acquired in the current round are stored in an experience pool when the training in the current round is finished;
For t=1,2…do
If t<T then
selecting action a according to the current policyt=argmaxQ(stA; θ), perform action atReturn of the prize value rtAnd a new state st+1A 1 is totStored in St、rtIs stored in R, atIs stored in A, st+1Stored in St+1(ii) a Where t represents the environmental status data obtained by the smartbook round, atRepresenting the action executed at the time t, and Q (s, a, theta) representing the current value, wherein the neural network outputs the estimation value of each action;
Ifst+1is the terminal state then
T←t+1
τ←t-λ+1
Ifτ≥0 then
Ifτ+λ<T then
Figure FDA0002701350040000031
else
Figure FDA0002701350040000032
Storing(s)τ,aτrτsτ+λ) In the experience pool D, the experience pool,
randomly extracting mini-batch sample data from D,
setting:
Figure FDA0002701350040000033
using a loss function L (theta) E [ (y)i-Q(s,a,θ))2]Gradient descent update of current value network weight theta
Untilτ=T-1;
The tau is mainly used for judging whether the action execution times of the mobile robot exceed a set step number lambda or not, if the action execution times exceed the set step number, the value of the tau is greater than the value of 0, the intelligent agent is indicated to obtain environment state data exceeding or equal to the lambda step, and at the moment, the influence r of reward values obtained in the future lambda step state on the tau state can be calculatedτAnd from array St,A,St+1Extracting state information sτAction aτState information sτ+λThe three information and rτForming a training sample tuple and storing the training sample tuple in an experience pool D; when the agent starts a turn of training, it will first turn T ← ∞, at which time T<T, entering a second for cycle; and when the collision happens or the target position is reached, T ← T +1, if τ ═ T-1, indicating that all the environment state information obtained by the intelligent agent stores the training sample data of the round according to a multi-step method. When tau + lambda<T, the intelligent agent is not collided, the number of the environment state information needing to be stored is larger than or equal to the set lambda value at the moment, and the environment state information needs to pass through
Figure FDA0002701350040000034
Calculating the influence r of reward values obtained from future lambda states on the tau stateτ(ii) a Otherwise, the intelligent agent is collided, the number of the environment state information needing to be stored is smaller than the set lambda value, and the intelligent agent needs to pass through the set lambda value
Figure FDA0002701350040000035
Calculating future lambda-1 (lambda-2, lambda-3, …,1) versus tau (tau +1, tau +2, …, T-1) statesInfluence of (1) rτ;yiThe ith sample data in the mini-batch data extracted from the experience pool is represented, and the estimated value of each action estimated by the target value network is added with the actual r obtained in the stateiThe method is a comprehensive reward for the current state and is used for carrying out gradient descent operation with the current value network.
5. The method for intelligent agent autonomous navigation based on deep reinforcement learning of claim 1, wherein the training step in S3 specifically comprises:
s31, acquiring current state information of the intelligent agent by the simulation environment, wherein the current state information comprises distance information between the intelligent agent and an obstacle in the environment and position relation information between the intelligent agent and a target point; the position relation information of the intelligent body and the target point comprises a relative coordinate relation between the current coordinate of the intelligent body and the coordinate of the target position, an Euclidean distance from the current position of the intelligent body to the target position and an included angle between a vector of the advancing direction of the intelligent body and a vector of the direction from the current coordinate position of the intelligent body to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent body from the coordinate of the target position;
s32, inputting the acquired current state information into the modular deep neural network, specifically inputting distance information between an intelligent body and an obstacle in the environment and position relation information between the intelligent body and a target point into the local obstacle avoidance deep neural network module, and inputting position relation information between the intelligent body and the target point into the global navigation deep neural network module;
s33, the local obstacle avoidance depth neural network module and the global navigation depth neural network module output respective control instructions according to the input current state information;
s34, the instruction selection module determines to use the action instruction output by the local obstacle avoidance depth neural network module or the global navigation depth neural network module by judging the distance value between the intelligent body and the nearest obstacle;
s35, the intelligent agent executes the instruction selected by the instruction selection module, receives the reward value output by the reward function and enters the next state;
s36, storing sample data of the interaction into the experience pool, wherein the sample data comprises a current state stAnd performing action atObtaining a prize value rtAnd the next state st+1The sample data is stored in the experience pool in a form of
Figure FDA0002701350040000041
6. The intelligent autonomous navigation method based on deep reinforcement learning of claim 5, wherein the local obstacle avoidance deep neural network module and the global navigation deep neural network module comprise the following specific steps:
s331, the current value network receives the input current state information, processes the current state information according to the input current state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool according to the current state information; and transmitting the output set of actions to the target value network;
s332, the target value network processes the group of actions according to the input actions, calculates the value of each action, and transmits the value of the group of actions back to the current value network;
s333, the current value network selects the next action to be executed according to the maximum value in the value of the group of actions returned by the target value network, and outputs a control instruction to the simulation environment according to the next action;
s334, the current value network calculates to obtain next step state information according to the next step action in S333, processes the next step state information and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool by the next step state information; and transmitting the output set of actions to the target value network; the target value network repeats step S332, that is, the target value network processes according to the transmitted group of actions, calculates the value of each action, and then transmits the value of the group of actions back to the current value network;
s335, the current value network compares the value of a group of actions returned by the target value network in the S334 with the value of a group of actions returned by the target value network in the S332, and calculates an error according to an error function;
s336, updating the weight of the current value network according to the error obtained in S335, and updating the weight of the target value network according to the weight of the current value network after every N steps;
s337, while proceeding in step S334, the current value network outputs a control instruction to the simulation environment according to the selected next action.
7. The intelligent agent autonomous navigation method based on deep reinforcement learning of claim 5, characterized in that the reward function is a continuous combined reward function, and the continuous combined reward function comprises terminal reward and non-terminal reward; the terminal reward specifically comprises:
a positive reward is obtained when the agent reaches the target point, expressed as rarr=100;ifdr-t≤dwin(ii) a Wherein d isr-tIs the Euclidean distance of the agent to the target point, dwinIs the threshold for the agent to reach the target point, when dr-tD less than setwinIf so, indicating that the target point is reached, otherwise, not reaching the target point; said rarrA positive reward value indicating that the agent achieved the target location;
a negative reward is obtained when the agent collides with an obstacle, expressed as rcol=-100;ifdr-o≤dcol(ii) a Wherein d isr-oIs the Euclidean distance of the agent from the nearest barrier, dcolThreshold value for collision of intelligent body and obstacle when dr-oD is less than or equal tocolIf so, indicating that collision occurs, otherwise, not indicating that collision does not occur; said rcolGiving penalty for indicating collision of agentA penalizing negative reward value;
the non-terminal award specifically includes:
a positive reward is obtained when the agent progresses towards the target point, expressed as rt_goal=cr[dr-t(t)-dr-t(t-1)](ii) a Wherein c isr∈(0,1]The coefficient of (d) is set to 1;
the danger reward r obtained when the minimum distance of the agent from the obstacle is continuously reduceddang∈[0,1]And is reduced, and its expression is
Figure FDA0002701350040000051
Wherein d isminIs the minimum distance of the agent from the obstacle, where β is a coefficient such that rdangThe value space of (a) is (0, 1); d isr-t(t) the Euclidean distance between the current position coordinate of the intelligent agent and the target position coordinate at the moment t is represented;
when the included angle between the advancing direction vector of the intelligent agent and the direction vector of the intelligent agent starting from the current coordinate and reaching the target position is less than +/-18 degrees, the reward 1 is obtained, when the included angle is more than +/-18 degrees and less than +/-72 degrees, the reward 0.3 is obtained, and in other cases, the reward 0 is obtained, and the expression is as follows:
Figure FDA0002701350040000061
wherein a isoriThe included angle of the vector of the advancing direction of the intelligent agent and the vector of the direction of the intelligent agent to the target position is sent out by the current coordinate.
8. The intelligent agent autonomous navigation method based on deep reinforcement learning according to any one of claims 1 to 7, characterized in that the local obstacle avoidance depth neural network module and the global navigation depth neural network module use activation functions of ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network.
9. The intelligent autonomous navigation method based on deep reinforcement learning according to any one of claims 1 to 7, wherein the local obstacle avoidance deep neural network module and the global navigation deep neural network module both adopt a fully connected structure, the number of hidden layers of the local obstacle avoidance neural network module is more than three, and the number of hidden layers of the global navigation neural network module is one.
10. The intelligent agent autonomous navigation method based on deep reinforcement learning according to any one of claims 1 to 7, characterized in that the instruction selection module is provided with a threshold value, and selects the control instruction according to the threshold value; when the threshold value is smaller than 40, selecting a control instruction output by the local obstacle avoidance depth neural network module; and when the threshold value is more than or equal to 40, selecting the control instruction output by the global navigation neural network module.
CN202011023274.4A 2020-09-25 2020-09-25 Intelligent autonomous navigation method based on deep reinforcement learning Active CN112179367B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011023274.4A CN112179367B (en) 2020-09-25 2020-09-25 Intelligent autonomous navigation method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011023274.4A CN112179367B (en) 2020-09-25 2020-09-25 Intelligent autonomous navigation method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN112179367A true CN112179367A (en) 2021-01-05
CN112179367B CN112179367B (en) 2023-07-04

Family

ID=73943509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011023274.4A Active CN112179367B (en) 2020-09-25 2020-09-25 Intelligent autonomous navigation method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN112179367B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112904848A (en) * 2021-01-18 2021-06-04 长沙理工大学 Mobile robot path planning method based on deep reinforcement learning
CN112925307A (en) * 2021-01-20 2021-06-08 中国科学院重庆绿色智能技术研究院 Distributed multi-robot path planning method for intelligent warehousing robot system
CN112926729A (en) * 2021-05-06 2021-06-08 中国科学院自动化研究所 Man-machine confrontation intelligent agent strategy making method
CN112947421A (en) * 2021-01-28 2021-06-11 西北工业大学 AUV autonomous obstacle avoidance method based on reinforcement learning
CN113033118A (en) * 2021-03-10 2021-06-25 山东大学 Autonomous floating control method of underwater vehicle based on demonstration data reinforcement learning technology
CN113064422A (en) * 2021-03-09 2021-07-02 河海大学 Autonomous underwater vehicle path planning method based on double neural network reinforcement learning
CN113146624A (en) * 2021-03-25 2021-07-23 重庆大学 Multi-agent control method based on maximum angle aggregation strategy
CN113156980A (en) * 2021-05-28 2021-07-23 山东大学 Tower crane path planning method and system based on deep reinforcement learning
CN113218399A (en) * 2021-05-12 2021-08-06 天津大学 Maze navigation method and device based on multi-agent layered reinforcement learning
CN113269315A (en) * 2021-06-29 2021-08-17 安徽寒武纪信息科技有限公司 Apparatus, method and readable storage medium for performing task using deep reinforcement learning
CN113312874A (en) * 2021-06-04 2021-08-27 福州大学 Overall wiring method based on improved deep reinforcement learning
CN113359717A (en) * 2021-05-26 2021-09-07 浙江工业大学 Mobile robot navigation obstacle avoidance method based on deep reinforcement learning
CN113485367A (en) * 2021-08-06 2021-10-08 浙江工业大学 Path planning method of multifunctional stage mobile robot
CN113691334A (en) * 2021-08-23 2021-11-23 广东工业大学 Cognitive radio dynamic power distribution method based on secondary user group cooperation
CN113759901A (en) * 2021-08-12 2021-12-07 杭州电子科技大学 Mobile robot autonomous obstacle avoidance method based on deep reinforcement learning
CN113805597A (en) * 2021-09-28 2021-12-17 福州大学 Obstacle self-protection artificial potential field method local path planning method based on particle swarm optimization
CN114355915A (en) * 2021-12-27 2022-04-15 杭州电子科技大学 AGV path planning based on deep reinforcement learning
CN114354082A (en) * 2022-03-18 2022-04-15 山东科技大学 Intelligent tracking system and method for submarine pipeline based on imitated sturgeon whiskers
CN114415663A (en) * 2021-12-15 2022-04-29 北京工业大学 Path planning method and system based on deep reinforcement learning
CN114603564A (en) * 2022-04-28 2022-06-10 中国电力科学研究院有限公司 Mechanical arm navigation obstacle avoidance method and system, computer equipment and storage medium
CN114964268A (en) * 2022-07-29 2022-08-30 白杨时代(北京)科技有限公司 Unmanned aerial vehicle navigation method and device
CN114995468A (en) * 2022-06-06 2022-09-02 南通大学 Intelligent control method of underwater robot based on Bayesian depth reinforcement learning
CN116443217A (en) * 2023-06-16 2023-07-18 中交一航局第一工程有限公司 Piling ship parking control method and device, piling ship and storage medium
TWI815613B (en) * 2022-08-16 2023-09-11 和碩聯合科技股份有限公司 Navigation method for robot and robot thereof
CN116755329A (en) * 2023-05-12 2023-09-15 江南大学 Multi-agent danger avoiding and escaping method and device based on deep reinforcement learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102644A (en) * 2017-06-22 2017-08-29 华南师范大学 The underwater robot method for controlling trajectory and control system learnt based on deeply
CN108803321A (en) * 2018-05-30 2018-11-13 清华大学 Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN109540151A (en) * 2018-03-25 2019-03-29 哈尔滨工程大学 A kind of AUV three-dimensional path planning method based on intensified learning
CN110333739A (en) * 2019-08-21 2019-10-15 哈尔滨工程大学 A kind of AUV conduct programming and method of controlling operation based on intensified learning
CN110515303A (en) * 2019-09-17 2019-11-29 余姚市浙江大学机器人研究中心 A kind of adaptive dynamic path planning method based on DDQN

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102644A (en) * 2017-06-22 2017-08-29 华南师范大学 The underwater robot method for controlling trajectory and control system learnt based on deeply
CN109540151A (en) * 2018-03-25 2019-03-29 哈尔滨工程大学 A kind of AUV three-dimensional path planning method based on intensified learning
CN108803321A (en) * 2018-05-30 2018-11-13 清华大学 Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN110333739A (en) * 2019-08-21 2019-10-15 哈尔滨工程大学 A kind of AUV conduct programming and method of controlling operation based on intensified learning
CN110515303A (en) * 2019-09-17 2019-11-29 余姚市浙江大学机器人研究中心 A kind of adaptive dynamic path planning method based on DDQN

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LEI XIAOYUN等: "Dynamic Path Planning of Unknown Environment Based on Deep Reinforcement Learning", 《JOURNAL OF ROBOTICS》 *
YINLONG YUAN等: "A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning", 《KNOWLEDGE-BASED SYSTEMS》 *
李志航: "基于深度递归强化学习的无人自主驾驶策略研究", 《工业控制计算机》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112904848A (en) * 2021-01-18 2021-06-04 长沙理工大学 Mobile robot path planning method based on deep reinforcement learning
CN112904848B (en) * 2021-01-18 2022-08-12 长沙理工大学 Mobile robot path planning method based on deep reinforcement learning
CN112925307A (en) * 2021-01-20 2021-06-08 中国科学院重庆绿色智能技术研究院 Distributed multi-robot path planning method for intelligent warehousing robot system
CN112947421A (en) * 2021-01-28 2021-06-11 西北工业大学 AUV autonomous obstacle avoidance method based on reinforcement learning
CN113064422A (en) * 2021-03-09 2021-07-02 河海大学 Autonomous underwater vehicle path planning method based on double neural network reinforcement learning
CN113064422B (en) * 2021-03-09 2022-06-28 河海大学 Autonomous underwater vehicle path planning method based on double neural network reinforcement learning
CN113033118A (en) * 2021-03-10 2021-06-25 山东大学 Autonomous floating control method of underwater vehicle based on demonstration data reinforcement learning technology
CN113146624B (en) * 2021-03-25 2022-04-29 重庆大学 Multi-agent control method based on maximum angle aggregation strategy
CN113146624A (en) * 2021-03-25 2021-07-23 重庆大学 Multi-agent control method based on maximum angle aggregation strategy
CN112926729A (en) * 2021-05-06 2021-06-08 中国科学院自动化研究所 Man-machine confrontation intelligent agent strategy making method
CN112926729B (en) * 2021-05-06 2021-08-03 中国科学院自动化研究所 Man-machine confrontation intelligent agent strategy making method
CN113218399B (en) * 2021-05-12 2022-10-04 天津大学 Maze navigation method and device based on multi-agent layered reinforcement learning
CN113218399A (en) * 2021-05-12 2021-08-06 天津大学 Maze navigation method and device based on multi-agent layered reinforcement learning
CN113359717A (en) * 2021-05-26 2021-09-07 浙江工业大学 Mobile robot navigation obstacle avoidance method based on deep reinforcement learning
CN113156980A (en) * 2021-05-28 2021-07-23 山东大学 Tower crane path planning method and system based on deep reinforcement learning
CN113312874A (en) * 2021-06-04 2021-08-27 福州大学 Overall wiring method based on improved deep reinforcement learning
CN113269315A (en) * 2021-06-29 2021-08-17 安徽寒武纪信息科技有限公司 Apparatus, method and readable storage medium for performing task using deep reinforcement learning
CN113269315B (en) * 2021-06-29 2024-04-02 安徽寒武纪信息科技有限公司 Apparatus, method and readable storage medium for performing tasks using deep reinforcement learning
CN113485367A (en) * 2021-08-06 2021-10-08 浙江工业大学 Path planning method of multifunctional stage mobile robot
CN113485367B (en) * 2021-08-06 2023-11-21 浙江工业大学 Path planning method for stage multifunctional mobile robot
CN113759901A (en) * 2021-08-12 2021-12-07 杭州电子科技大学 Mobile robot autonomous obstacle avoidance method based on deep reinforcement learning
CN113691334A (en) * 2021-08-23 2021-11-23 广东工业大学 Cognitive radio dynamic power distribution method based on secondary user group cooperation
CN113805597A (en) * 2021-09-28 2021-12-17 福州大学 Obstacle self-protection artificial potential field method local path planning method based on particle swarm optimization
CN114415663A (en) * 2021-12-15 2022-04-29 北京工业大学 Path planning method and system based on deep reinforcement learning
CN114355915B (en) * 2021-12-27 2024-04-02 杭州电子科技大学 AGV path planning based on deep reinforcement learning
CN114355915A (en) * 2021-12-27 2022-04-15 杭州电子科技大学 AGV path planning based on deep reinforcement learning
CN114354082A (en) * 2022-03-18 2022-04-15 山东科技大学 Intelligent tracking system and method for submarine pipeline based on imitated sturgeon whiskers
CN114354082B (en) * 2022-03-18 2022-05-31 山东科技大学 Intelligent tracking system and method for submarine pipeline based on imitated sturgeon whisker
CN114603564A (en) * 2022-04-28 2022-06-10 中国电力科学研究院有限公司 Mechanical arm navigation obstacle avoidance method and system, computer equipment and storage medium
CN114603564B (en) * 2022-04-28 2024-04-12 中国电力科学研究院有限公司 Mechanical arm navigation obstacle avoidance method, system, computer equipment and storage medium
CN114995468A (en) * 2022-06-06 2022-09-02 南通大学 Intelligent control method of underwater robot based on Bayesian depth reinforcement learning
CN114964268A (en) * 2022-07-29 2022-08-30 白杨时代(北京)科技有限公司 Unmanned aerial vehicle navigation method and device
TWI815613B (en) * 2022-08-16 2023-09-11 和碩聯合科技股份有限公司 Navigation method for robot and robot thereof
CN116755329A (en) * 2023-05-12 2023-09-15 江南大学 Multi-agent danger avoiding and escaping method and device based on deep reinforcement learning
CN116755329B (en) * 2023-05-12 2024-05-24 江南大学 Multi-agent danger avoiding and escaping method and device based on deep reinforcement learning
CN116443217A (en) * 2023-06-16 2023-07-18 中交一航局第一工程有限公司 Piling ship parking control method and device, piling ship and storage medium
CN116443217B (en) * 2023-06-16 2023-08-22 中交一航局第一工程有限公司 Piling ship parking control method and device, piling ship and storage medium

Also Published As

Publication number Publication date
CN112179367B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN112179367B (en) Intelligent autonomous navigation method based on deep reinforcement learning
Lauri et al. Planning for robotic exploration based on forward simulation
Liu et al. Robot navigation in crowded environments using deep reinforcement learning
Zhang et al. Autonomous navigation of UAV in multi-obstacle environments based on a deep reinforcement learning approach
Jesus et al. Deep deterministic policy gradient for navigation of mobile robots in simulated environments
Cao et al. Target search control of AUV in underwater environment with deep reinforcement learning
CN109540151A (en) A kind of AUV three-dimensional path planning method based on intensified learning
CN111880549B (en) Deep reinforcement learning rewarding function optimization method for unmanned ship path planning
CN113848974B (en) Aircraft trajectory planning method and system based on deep reinforcement learning
Wang et al. Cooperative collision avoidance for unmanned surface vehicles based on improved genetic algorithm
CN117590867B (en) Underwater autonomous vehicle connection control method and system based on deep reinforcement learning
Contreras et al. Using deep learning for exploration and recognition of objects based on images
Hien et al. Goal-oriented navigation with avoiding obstacle based on deep reinforcement learning in continuous action space
Mendonça et al. Reinforcement learning with optimized reward function for stealth applications
Cashmore et al. Planning inspection tasks for AUVs
Meyer On course towards model-free guidance: A self-learning approach to dynamic collision avoidance for autonomous surface vehicles
Keong et al. Reinforcement learning for autonomous aircraft avoidance
De Villiers et al. Learning fine-grained control for mapless navigation
Conforth et al. Reinforcement learning for neural networks using swarm intelligence
Mete et al. Coordinated Multi-Robot Exploration using Reinforcement Learning
Senthilkumar et al. Hybrid genetic-fuzzy approach to autonomous mobile robot
Aronsen Path planning and obstacle avoidance for marine vessels using the deep deterministic policy gradient method
Qin et al. An environment information-driven online Bi-level path planning algorithm for underwater search and rescue AUV
Kim et al. Transformable Gaussian Reward Function for Socially-Aware Navigation with Deep Reinforcement Learning
Gridnev et al. The Framework for robotic navigation algorithms evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant