CN112800545B

CN112800545B - Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN

Info

Publication number: CN112800545B
Application number: CN202110118727.XA
Authority: CN
Inventors: 胡潇文; 刘峰; 陈畅; 杨茜
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2022-06-24
Anticipated expiration: 2041-01-28
Also published as: CN112800545A

Abstract

The invention belongs to the field of unmanned ship path planning, and provides an unmanned ship to perform self-adaptive path planning in a learning mode. The method mainly comprises the following steps: constructing an unmanned ship model, and putting the unmanned ship in a simulation environment for navigation; randomly exploring the unmanned ship according to the behavior of the behavior space; acquiring environment image information through a depth camera of the unmanned ship, acquiring unmanned ship position information through a positioning system, and storing data obtained by exploration into a priority experience playback pool; data extraction of the playback pool is used for training the D3QN network; and loading the trained network model into the actual unmanned ship to plan the real environment path. The unmanned ship path planning method can ensure that the path planning precision is high, the collision rate is low and the self-adaptive capacity of the unmanned ship is strong under the condition of not needing prior information.

Description

Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN

Technical Field

The invention relates to the technical field of unmanned ship path planning, in particular to an unmanned ship self-adaptive path planning method, equipment and a storage medium based on D3 QN.

Background

With the rise of the artificial intelligence era, the unmanned ship technology is widely developed. China has a plurality of regions with severe marine environments, the self-adaptive capacity of domestic unmanned ships to the environment is poor, and various external interference factors exist, so that the domestic unmanned ship technology still does not meet the expected requirements, and a path planning algorithm with strong self-adaptive capacity and capable of coping with emergencies is urgently needed to break through the current bottleneck.

The design principle of the traditional unmanned ship path planning method is that an optimized barrier-free path is planned according to a priori map, the unmanned ship is in an instruction form following an algorithm, and once the environment changes, the algorithm cannot give the optimal guidance. The traditional method can have higher stability in a simple environment. However, in future researches, people can detect increasingly complex deep sea, complex dynamic and static obstacles and dangerous environments exist, the environment can change suddenly, and the unmanned ship can adapt to the change of the environment only by having an adaptive autonomous decision-making system without a pre-detection map.

In order to improve the self-adaptive capacity of the unmanned ship, the unmanned ship control system is required to have good cognitive ability and identification ability on the spatial information and the surrounding environment state of the unmanned ship. According to existing literature researches at present, such as a genetic algorithm, an ant colony algorithm and an a-star algorithm, although convergence can be achieved and a good effect can be achieved in a simple environment, when an emergency occurs, self-adaption capability capable of timely processing is not provided, and under the condition of strong interference, a path planning effect is greatly influenced, even collision occurs, and serious consequences are generated.

Disclosure of Invention

The invention aims to solve the problem that the defects of the prior art are overcome, and when an emergency occurs, the path planning algorithm can process in time and has good self-adaptive capacity. The unmanned ship self-adaptive path planning method based on D3QN is provided, so that the unmanned ship can avoid collision in time, and the safety coefficient is high.

In order to achieve the purpose, the unmanned ship self-adaptive path planning method based on D3QN provided by the invention comprises the following steps:

s1, constructing an unmanned ship model and an underwater simulation environment, designing a D3QN network, and putting the unmanned ship model in the underwater simulation environment for autonomous navigation;

s2, selecting a behavior A from the current state S according to an epsilon-greedy algorithm;

s3, enabling the unmanned ship to reach the next state S ' by adopting a PID position and speed error control algorithm according to the behavior A, acquiring a first position relation between the position of the next state S ' and the obstacle, acquiring a second position relation between the position of the next state S ' and a terminal point, and acquiring a return R by utilizing a reward and punishment mechanism according to the first position relation and the second position relation;

s4, acquiring environment information and position information of a current state S, merging the environment information and the position information into current state data S, acquiring environment information and position information of a next state S ', merging the environment information and the position information into next state data S ', storing the current state data S, behavior A, next state data S ' and return R into a priority experience playback pool in the form of an array D, and calculating the sampling probability of the array D in the priority experience playback pool through TD-error (the difference value between a current state function value and a target value function calculated by a time sequence difference method);

s5, extracting the array D in the experience playback pool to a D3QN network according to the sampling probability, carrying out gradient descent error training of the D3QN network, judging whether a termination condition is met, if so, obtaining a trained unmanned ship self-adaptive path planning model, and executing a step S6, otherwise, taking the next state S' as the current state S, and returning to the step S2;

and S6, importing the trained unmanned ship self-adaptive path planning model into an unmanned ship path planning system, planning the unmanned ship path in a real environment, and obtaining the unmanned ship path.

Further, the step of constructing the unmanned ship model and the underwater simulation environment and designing the D3QN network comprises the following steps:

building the unmanned ship model and the underwater simulation environment through the ROS and the Gazebo;

respectively forming a main network and a target network through an LSTM network, a convolutional neural network and a antagonistic fully-connected network;

and forming the D3QN network by the main network, the target network and the experience playback pool.

Further, a depth camera and a positioning system are arranged on the unmanned ship model;

the depth camera is used for acquiring current environment information;

the positioning system is used for acquiring the position information of the unmanned ship.

Further, step S5 specifically includes:

dividing the space of the whole preferential experience playback pool into M small ranges according to the minimum sample size M;

randomly extracting sample data in each small range according to the sampling probability;

obtaining current state data s and next state data s' according to the sample data;

processing the environmental information in the current state data s through the convolutional neural network of the main network to obtain first environmental information;

processing the position information in the current state data s through the LSTM network of the main network to obtain first position information;

combining the first environment information and the first position information and inputting the combined first environment information and the combined first position information into a reactive fully-connected network in the main network to obtain an output Q of the main network;

processing the environmental information in the next state data s' through the convolutional neural network of the target network to obtain second environmental information;

processing the position information in the next state data s' through the LSTM network of the target network to obtain second position information;

combining the second environment information and the second position information and inputting the combined second environment information and the second position information into a antagonistic fully-connected network in the target network to obtain an output Q1 of the target network;

calculating to obtain a target output Qt according to the Q1 and the Q;

calculating to obtain an error function according to the Q and the Qt;

and training the D3QN network by adopting a gradient descent method based on the error function, judging whether the error function meets a termination condition, if so, obtaining a trained unmanned ship self-adaptive path planning model, and executing a step S6, otherwise, taking the next state S' as the current state S, returning to the step S2, and retraining.

Further, the epsilon-greedy algorithm is:

and selecting behaviors from a behavior space by a greedy algorithm according to the probability of the epsilon at random, and selecting the behavior with the probability of 1-epsilon to obtain the behavior with the maximum output Q of the main network.

Further, the reward and punishment mechanism is as follows:

wherein R is the return, do represents the distance between the unmanned ship and the terminal in the current state S, and dt represents the distance between the unmanned ship and the terminal in the next state S'.

Further, the PID position and velocity error control algorithm is:

Ep＝[P(x′，y′，z′)-P(x，y，z)，O(r′，p′，y′)-O(r，p，y)]

Ev＝[v(x′，y′，z′)-v(x，y，z)，ω(x′，y′，z′)-ω(x，y，z)]

where Ep is a deviation angle, Ev is a speed deviation, r, P, and y are angles at which the unmanned ship deviates from the x, y, and z axes, respectively, P (x ', y ', z '), O (r ', P ', y ') is a position and a deviation angle of the unmanned ship in a state S ', v (x ', y ', z '), ω (x ', y ', z ') is a linear speed and an angular speed of a given target in the action a, P (x, y, z), O (r, P, y) is a position and a deviation angle of the unmanned ship in a current state S, and v (x, y, z), ω (x, y, z) is a linear speed and an angular speed of the unmanned ship in the current state S, respectively.

In addition, in order to achieve the above object, the present invention further provides an unmanned ship adaptive path planning apparatus based on D3QN, which includes a memory, a processor, and an unmanned ship adaptive path planning program stored in the memory and operable on the processor, and when executed by the processor, the unmanned ship adaptive path planning program implements the steps of any of the unmanned ship adaptive path planning methods.

In addition, in order to achieve the above object, the present invention further provides a storage medium having an unmanned ship adaptive path planning program stored thereon, wherein the unmanned ship adaptive path planning program, when executed by a processor, implements the steps of any one of the unmanned ship adaptive path planning methods.

The invention has the beneficial effects that: according to the method, a D3QN algorithm is adopted, sample information does not need to be given in advance, the network can be trained autonomously through experience obtained by autonomous exploration, and an optimal solution is obtained until training is finished; the main network based on the fusion of the LSTM and the convolutional neural network can realize the feature fusion of the unmanned ship environment, and the unmanned ship has the self-adaptive capacity to the environment change by adopting a learning mode, thereby conforming to the more intelligent development direction of the unmanned ship in the future.

Drawings

FIG. 1 is a flow chart of the implementation of the unmanned ship adaptive path planning method based on D3 QN;

FIG. 2 is a flow chart of a specific algorithm corresponding to FIG. 1;

fig. 3 is a D3QN network processing drone position and image information framework diagram.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.

Referring to fig. 1 and fig. 2, fig. 1 is a flowchart illustrating an implementation of the unmanned ship adaptive path planning method based on D3QN according to the present invention, and fig. 2 is a flowchart illustrating a specific algorithm corresponding to fig. 1.

The embodiment of the invention provides a D3 QN-based unmanned ship self-adaptive path planning method, which comprises the following steps:

an unmanned ship and an underwater environment are built through the ROS and the Gazebo, a depth camera is arranged on the unmanned ship and is provided with a positioning system, and the ROS has a Topic communication function;

the method comprises the steps of combining an LSTM network, a convolutional neural network, a full-connection network with a Dueling structure, a main network and a target network with parameters lagging behind the main network by a certain number of steps, and preferentially playing back an experience pool.

Acquiring image information of the underwater simulation environment through a depth camera of the unmanned ship;

acquiring the position information of the unmanned ship through a positioning system;

and transmitting the position information of the unmanned ship and the image information of the unmanned ship from the Gazebo to an adaptive path planning algorithm for storage by adopting a Topic function of the ROS.

S2, selecting behavior A from the current state S according to an epsilon-greedy algorithm, wherein the epsilon-greedy algorithm is as follows:

S3, enabling the unmanned ship to reach the next state S ' by adopting a PID position and speed error control algorithm according to the behavior A, obtaining a first position relation between the position of the next state S ' and the obstacle, obtaining a second position relation between the position of the next state S ' and the terminal point, and obtaining a return R by utilizing a reward and punishment mechanism according to the first position relation and the second position relation;

the PID position and speed error control algorithm is as follows:

Ep＝[P(x′，y′，z′)-P(x，y，z)，O(r′，p′，y′)-O(r，p，y)]

The reward calculation by utilizing the reward and punishment mechanism is specifically as follows:

when the unmanned ship approaches the terminal, a small amount of reward is obtained;

when the unmanned ship is far away from the terminal point, a small amount of punishment is obtained;

when the unmanned ship arrives at the terminal, a large amount of rewards are obtained;

when the unmanned ship approaches the obstacle, a large amount of punishments are obtained;

the reward and punishment mechanism calculates the reward according to the formula:

S4, acquiring environment information and position information of a current state S, merging the environment information and the position information into current state data S, acquiring environment information and position information of a next state S ', merging the current state data S, behavior A, next state data S' and return R into 5 data groups D, storing the data groups D into a priority experience playback pool, and calculating by TD-error to obtain the sampling probability of the data groups D in the priority experience playback pool;

calculating the priority (sampling probability) according to the TD-error (time difference error) value delta of the ith sample_iThe calculation formula of (c) is:

δ_i＝R_i+Q_t(s_i，argmaxaQ(s_i，a))-Q(s_i-1，a_i-1)

argmax_aQ(s_ia) status data s representing a sample i_iLower selection can obtain the maximum main network output Q value Q(s)_iA) behavior A, Q_t(s_i，argmax_aQ(s_iA)) represents the data s in the state_iTarget network output Q value, Q(s), obtained for lower selection behavior A_i-1，a_i-1) Status data s representing the i-1 th sample_i-1Selection behavior a_i-1And the obtained main network output Q value, gamma is an attenuation coefficient, the value is 0.8, and the transition probability of random sampling according to the priority is as follows:

where the alpha index represents the degree of random sampling priority, which appears as uniform random sampling when alpha is equal to 0, p_iIndicating the priority size of the ith sample, p when proportional sampling is used_iThe size of (A) is as follows:

p_i＝|δ_i|+ε

wherein epsilon is a variable larger than 0, so as to prevent the sample with TD-error of 0 from getting no playback opportunity.

S5, extracting the array D in the experience playback pool to a D3QN network according to the sampling probability, carrying out gradient descent error training on the D3QN network, judging whether a termination condition is met, if so, obtaining a trained unmanned ship self-adaptive path planning model, and executing a step S6, otherwise, taking the next state S' as the current state S, and returning to the step S2; the method comprises the following specific steps:

dividing the space of the whole prior experience playback pool into M small ranges according to the minimum sample size M;

randomly extracting a sample data in each small range according to the sampling probability;

referring to fig. 3, fig. 3 is a diagram of a D3QN network processing unmanned ship position and image information framework;

combining the second environment information and the second position information and inputting the combined second environment information and second position information into a antagonistic fully-connected network in the target network to obtain an output Q1 of the target network;

calculating to obtain a target output Qt according to the Q1 and the Q;

calculating an error function L according to the Q and the Qt;

the calculation formula of the error function L is as follows:

L(θ)＝E[(R+γQ_t(s′，argmax_a′Q(s′，a′；θ)；θ-)-Q(s，a；θ))²]

training network weight parameters by adopting a gradient descent method according to an error function; the implementation formula is as follows:

wherein theta is a main network weight parameter, theta-is a target network weight parameter, is an attenuation coefficient and has a value of 0.8, and Q (s, a; theta) represents a main network Q value obtained by selecting the behavior A when the main network weight parameter is theta in an s state, and argmax_a′Q (s ', a'; theta) can be inBehavior A 'capable of obtaining the maximum main network output Q value under the state data s'; qt (s', argmax)_a′Q (s ', a'; theta) represents the target network output Q value obtained by selecting the behavior A 'under the state data s';

judging whether the collision is approached, if the collision is approached, returning to the past state S, and reselecting the behavior A, otherwise, continuing to execute the training step;

judging whether the end point is reached, if so, resetting to the starting point, and continuing training, otherwise, continuing to execute the training step;

judging whether the target network weight is updated (the judgment condition is that the target network weight is updated once every 500 steps), if so, copying all the main network weight parameters to the target network, and otherwise, keeping the main network weight parameters unchanged;

and judging whether the iteration times are reached, if so, terminating the training, obtaining a trained unmanned ship self-adaptive path planning model, and executing the step S6, otherwise, taking the next state S' as the current state S, and continuing to retrain from S2.

In the training process, if the collision is approached, returning to the past state S, and reselecting the behavior A; if the end point is reached, resetting to the starting point, and continuing training; if the target network weight is updated (the judgment condition is that the updating is performed once every 500 steps), all the main network weight parameters are copied to the target network; and judging whether the iteration times are reached, if so, terminating the training, obtaining a trained unmanned ship self-adaptive path planning model, executing the step S6, otherwise, taking the next state S' as the current state S, and continuing to retrain from S2.

In addition, the embodiment of the invention also provides unmanned ship adaptive path planning equipment based on D3QN, which comprises a memory, a processor and an unmanned ship adaptive path planning program stored on the memory and capable of running on the processor, wherein the unmanned ship adaptive path planning program realizes the steps of the unmanned ship adaptive path planning method when executed by the processor.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with an unmanned ship self-adaptive path planning program, and the unmanned ship self-adaptive path planning program realizes the steps of the unmanned ship self-adaptive path planning method when being executed by a processor.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An unmanned ship adaptive path planning method based on D3QN is characterized by comprising the following steps:

s3, enabling the unmanned ship to reach the next state S ' by adopting a PID position and speed error control algorithm according to the behavior A, obtaining a first position relation between the position of the next state S ' and an obstacle, obtaining a second position relation between the position of the next state S ' and a terminal point, and obtaining a return R by utilizing a reward and punishment mechanism according to the first position relation and the second position relation;

s4, obtaining environment information and position information of the current state S, merging the environment information and the position information into current state data S, obtaining environment information and position information of the next state S ', merging the environment information and the position information into next state data S ', storing the current state data S, the behavior A, the next state data S ' and the return R into a priority experience playback pool in a form of an array D, and calculating the sampling probability of the array D in the priority experience playback pool through TD-error;

2. The unmanned ship adaptive path planning method according to claim 1, wherein the step of constructing the unmanned ship model and the underwater simulation environment and designing the D3QN network comprises:

3. The unmanned ship self-adaptive path planning method according to claim 1, wherein a depth camera and a positioning system are arranged on the unmanned ship model;

the depth camera is used for acquiring current environment information;

4. The unmanned ship adaptive path planning method according to claim 2, wherein the step S5 specifically includes:

respectively processing the current state data s and the next state data s' through the main network and the target network to obtain an output Q1 of the main network and an output Q1 of the target network;

calculating to obtain a target output Qt according to the Q1 and the Q;

calculating to obtain an error function according to the Q and the Qt;

5. The unmanned-vessel adaptive path planning method according to claim 4, wherein the step of processing the current state data s and the next state data s' by the main network and the target network respectively to obtain the output Q1 of the main network and the output Q1 of the target network comprises:

and combining the second environment information and the second position information and inputting the combined second environment information and second position information into a antagonistic fully-connected network in the target network to obtain an output Q1 of the target network.

6. The unmanned ship adaptive path planning method according to claim 2, wherein the epsilon-greedy algorithm is:

and selecting the behavior from the behavior space by a greedy algorithm according to the probability of the epsilon, and selecting the behavior with the maximum output Q of the main network according to the probability of 1-epsilon.

7. The unmanned ship adaptive path planning method according to claim 1, wherein the reward and punishment mechanism is:

8. The unmanned-vessel adaptive path planning method according to claim 1, wherein the PID position and velocity error control algorithm is:

Ep＝[P(x′，y′，z′)-P(x，y，z)，O(r′，p′，y′)-O(r，p，y)]

Ev＝[v(x′，y′，z′)-v(x，y，z)，(x′，y′，z′)-ω(x，y，z)]

where Ep is a deviation angle, Ev is a speed deviation, r, P, and y are angles at which the unmanned ship deviates from the x, y, and z axes, respectively, P (x ', y ', z '), O (r ', P ', y ') is a position and a deviation angle of the unmanned ship in the next state S ', v (x ', y ', z '), ω (x ', y ', z ') is a linear speed and an angular speed of a given target in the action a, P (x, y, z), O (r, P, y) is a position and a deviation angle of the unmanned ship in the current state S, and v (x, y, z), ω (x, y, z) is a linear speed and an angular speed of the unmanned ship in the current state S, respectively.

9. An unmanned ship adaptive path planning device based on D3QN, characterized in that the unmanned ship adaptive path planning device comprises a memory, a processor and an unmanned ship adaptive path planning program stored on the memory and operable on the processor, wherein the unmanned ship adaptive path planning program, when executed by the processor, implements the steps of the unmanned ship adaptive path planning method according to any one of claims 1 to 8.

10. A storage medium having stored thereon an unmanned ship adaptive path planning program, which when executed by a processor implements the steps of the unmanned ship adaptive path planning method according to any one of claims 1 to 8.