CN115344046A

CN115344046A - Mobile robot path planning based on improved deep Q network algorithm

Info

Publication number: CN115344046A
Application number: CN202211002713.2A
Authority: CN
Inventors: 臧强; 徐博文; 李宁; 郭镜虹
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2022-11-15

Abstract

The invention discloses a mobile robot path planning based on an improved depth Q network algorithm, aiming at the problems in the prior art, the path planning introduces an improved artificial potential field gravitation function and a target guiding action function in an action guiding strategy of the depth Q network algorithm, thereby effectively reducing the collision times in the algorithm training process and improving the convergence speed of the algorithm; the path planning designs a segmented reward function, and a discount reward value is given to a nearest point of a current position, which is closest to a target point, so that the mobile robot is promoted to explore towards the target point, and the planned path is more optimal; the path planning improves the sampling strategy, and the sampling probability of the samples in the experience pool is reduced along with the increase of the sampling times through the priorities of all the samples stored in the sample experience pool, so that the utilization rate of the samples is improved, and the problems of loss value divergence and neural network overfitting in the algorithm training process are effectively solved.

Description

Mobile robot path planning based on improved deep Q network algorithm

Technical Field

The invention relates to the technical field of path planning, in particular to a mobile robot path planning method based on an improved deep Q network algorithm.

Background

The robot is a pearl at the top of a manufacturing crown, the research, development, manufacture and application of the robot are important marks for measuring the national technological innovation and high-end manufacturing level, and the development of the robot technology and industry becomes a great strategic demand of China. Mobile robots are widely used in the fields of service, rescue, surveying, and the like. The development of the mobile robot is particularly emphasized and propelled by '2025 manufacture in China' and 'new generation outline of artificial intelligence development planning'. Mobile robots have become an important direction for artificial intelligence research.

The path planning technology is one of important contents in the research field of the mobile robot, and the effective path planning technology can promote the mobile robot to efficiently complete the operation and the task. According to different characteristics of algorithms, path planning algorithms are divided into traditional algorithm path planning and intelligent algorithm path planning, most of the traditional algorithm path planning needs a prior environment model, and exploration and planning cannot be effectively carried out in an unknown environment; the intelligent algorithm path planning has the capability of exploring unknown environments by simulating the experience and the characteristics of natural biological groups, can be better applied to path planning in real environments, and is one of deep reinforcement learning algorithms.

The deep reinforcement learning algorithm combines the perception capability of deep learning with the decision-making capability of reinforcement learning, so that the mobile robot establishes a mapping relation between a learning state and a behavior through interaction of the reinforcement learning algorithm and the environment in the face of unknown environment, the numerical return is maximized, an optimal action sequence is obtained, and an optimal path is planned; and then, a deep neural network is used for processing the high-dimensional state space task, so that the problem of dimension disaster when high-dimensional data is processed by reinforcement learning is solved. The Deep Q-Network (DQN) is a classic value function-based Deep reinforcement Learning algorithm, which combines a convolutional neural Network with a Q-Learning (Q-Learning) algorithm in reinforcement Learning, so that the convolutional neural Network replaces a Q table in the Q-Learning algorithm.

The deep mind group of google proposed the DQN algorithm in 2013, the basic framework of which is the Q-Learning algorithm. The action guiding strategy adopts a greedy strategy, when the mobile robot acquires a current action value, the current state and the reward value of the next state after the action are executed need to be traversed, so that the mobile robot needs to traverse more invalid environment states, a large amount of calculation time is consumed, and the algorithm convergence speed is low.

Whether the path planned by the DQN algorithm is optimal depends on whether the action guiding strategy of the Q-Learning algorithm is optimal. After the mobile robot interacts with the environment to obtain the reward in the Q-Learning algorithm, the mobile robot tends to execute the action with more reward value. In the traditional reward function, except that positive and negative reward values exist at a target point and a collision point with an obstacle, the reward values of other points are 0, so that the mobile robot forms a redundant path and a large number of inflection points appear, and the mobile robot cannot be prompted to plan an optimal path.

The DQN algorithm adopts an experience playback mechanism to process samples in the network model training process, and small batches of samples are extracted from a stored experience sample pool at equal probability each time to be used for training the model, so that the importance of different samples cannot be effectively distinguished, and the DQN algorithm has low sample utilization rate, and the problems of loss value divergence and neural network overfitting occur in the algorithm training process.

Disclosure of Invention

The invention aims to provide a mobile robot path planning method based on an improved deep Q network algorithm, which solves the problems of low convergence speed, non-optimal planned path and low sample utilization rate of the deep Q network algorithm by improving the traditional DQN algorithm, effectively reduces the collision times in the algorithm training process, improves the convergence speed of the algorithm, and prompts the mobile robot to explore towards a target point, thereby ensuring that the planned path is more optimal.

The purpose of the invention can be realized by the following technical scheme:

a mobile robot path plan based on an improved deep Q network algorithm, the path plan comprising the steps of:

step 1: and determining the position of the mobile robot and the position of a target point, and initializing relevant parameters.

Step 2: and initializing a predicted network parameter omega and a target network parameter omega', and determining the current state S of the mobile robot.

And step 3: the mobile robot selects action a to execute by taking the improved artificial potential field gravity function as an action strategy according to the probability of epsilon, and selects action a with the maximum Q value to execute according to the probability of 1-epsilon.

And 4, step 4: the mobile robot is easy to fall into the local minimum point due to the fact that the gravity of the artificial potential field is large in the later stage of the algorithm, and the problem that the mobile robot falls into the local minimum point is solved through the target guiding action function.

And 5: and after the mobile robot executes the action a selected in the steps 3 to 4, obtaining a current state reward value r, a next state reward value S 'and a next state reward value r' according to the Q value function and the improved reward function, and storing the (S, a, r, r ', S') as an experience sample into a corresponding experience pool.

And 6: from experience pool D with probability ρ ₁ Sample samples from empirical pool D with 1-p probability ₂ Sampling samples, calculating and storing in a sample experience pool D through a sample priority formula ₁ And D ₂ The priority p of all samples, then the probability of each sample being sampled is calculated through a probability formula of the sample being sampled, and finally the samples in the experience pool are sampled according to the probability.

And 7: if the next state of the sampled sample is the target state, the target Q value is the reward value of the target state, otherwise the target Q value is the maximum Q value of the next state with the reward value of the target state discounted.

And 8: and (6) continuously executing the step (6) to the step (7), and repeating the number of the sampling sequences n times to obtain a final target Q value.

And step 9: and calculating a loss value according to a mean square error loss function formula, and updating the network parameter w of the prediction network by using a small batch semi-gradient descent method.

Step 10: and assigning the network parameter w of the prediction network to the network parameter w' of the target network at regular intervals of training steps, and planning a path from the starting point to the end point according to the action selected by each step of the mobile robot.

Step 11: if the position state of the mobile robot at the t step does not reach the target point state, repeating the step 2 to the step 10, otherwise, judging whether the training scenario number reaches the total scenario number of network training, if not, repeating the step 1 to the step 10, if so, ending the training, and drawing a path from the starting point to the end point according to the maximum Q value action selected by the mobile robot, and ending the algorithm.

Further, the relevant parameters initialized in step 1 are specifically as follows:

determining a current position X of a mobile robot _t ＝[x _t ,y _t ]The number of steps searched by the mobile robot is t, and the horizontal and vertical coordinates x of the position of the mobile robot at the t-th step _t ，y _t The target point position is X _tar ＝[x _tar ,y _tar ]The horizontal and vertical coordinates of the target point position are x _tar ，y _tar . Initializing positive and negative value sample experience pool D ₁ And D ₂ Capacity of the system, greedy sampling strategy epsilon, total number of network training knots N, sample priority p and sampling sequence number N.

Further, the action a to be executed next by the mobile robot in the step 3 is specifically as follows:

improved artificial potential field gravitational function F _att The size of (X) is:

where m is a threshold, L is a diagonal distance of the map environment, a direction of the attraction is a direction in which the mobile robot points to a target point, dis (X) _tar -X _t ) Is that the mobile robot is at the current position X _t ＝[x _t ,y _t ]And target point X _tar ＝[x _tar ,y _tar ]The formula is as follows:

from the above-mentioned gravitational function F _att (X) the current position X of the mobile robot can be known by the formula _t ＝[x _t ,y _t ]And target point X _tar ＝[x _tar ,y _tar ]The larger the distance between them, the smaller the value of the attraction force, and when the mobile robot is at position X _t ＝[x _t ,y _t ]And target point X _tar ＝[x _tar ,y _tar ]When the distance is less than the threshold value m, the attraction value is kept unchanged.

The selection of the maximum Q action is determined by the Q function, and the Q function update iteration is as follows:

wherein Q (S, a) is the Q value of the current state, S is the current state, a is the execution action, alpha (alpha is more than or equal to 0 and less than or equal to 1) is the learning rate, r ' is the reward value given to the next state by the environment, gamma (gamma is more than or equal to 0 and less than or equal to 1) is the discount coefficient, S ' is the next state, and a ' is the action executed next step.

Further, the improved reward function in step 5 is specifically as follows:

the improved reward function formula is as follows:

wherein λ (0 < λ < 1) is a discount factor of the reward function, X _ne ＝[x _ne ,y _ne ]Is a neighboring point, x, whose current position is closest to the target point _ne ，y _ne Is the abscissa and ordinate of the neighboring point whose current position is closest to the target point,

is a set of neighboring points of the current position,

i is the distance set of the neighboring point and the target point, i is the direction of the neighboring point of the current position.

The following Euclidean distance formula is used to obtain:

further, the sampling process in step 6 is specifically as follows:

priority p of sample j _j The formula is as follows:

in the formula, u _j The number of times each sample is sampled. From the formula, the priority p of the sample _j Will gradually decrease as the number of times sampled increases, so according to the priority of the sample, the probability P (j) that each sample is sampled is defined as:

wherein, size (D) ₁ or D ₂ ) The cell capacity is the sample experience. From the formula, the algorithm is derived from D according to the sample sampled probability P (j) ₁ And D ₂ The method has the advantages that samples are extracted, the probability that high-value samples are repeatedly sampled is reduced, the probability that low-value samples are collected is improved, and therefore the problems of loss value divergence and neural network overfitting can be effectively solved.

Further, the square error loss function formula and the small batch half-gradient descent method formula in step 9 are specifically as follows:

the mean square error loss function is shown as follows:

wherein, E () is a mean square error function, w' is a target network parameter, w is a prediction network parameter, and Q (S, a, w) is a prediction networkOutput, Q (S ', a ', w ') is the target network output,

is the maximum Q value for the next state,

the target Q value is obtained.

The formula of the small batch half-gradient descent method is shown as follows:

wherein

To gradient Q (S, a, w).

Further, the assignment formula of the network parameter w' of the target network in step 10 is as follows:

w'←τw+(1-τ)w'

tau (tau is less than or equal to 1), the updating amplitude of the target Q value at each time can be effectively controlled through the assignment formula of w', and the stability of the mobile robot in learning is improved.

The invention has the beneficial effects that:

1. in the action guidance strategy of the deep Q network algorithm, the path planning introduces an improved artificial potential field gravitation function and a target guidance action function, thereby effectively reducing the collision times in the algorithm training process and improving the convergence speed of the algorithm;

2. the path planning designs a segmented reward function, which is endowed with a reward value of discount at a nearest neighbor point of a current position and a target point, so that the mobile robot is prompted to explore towards the target point, and the planned path is more optimal;

3. the path planning improves the sampling strategy, and the sampling probability of the samples in the experience pool is reduced along with the increase of the sampling times through the priorities of all the samples stored in the sample experience pool, so that the utilization rate of the samples is improved, and the problems of loss value divergence and neural network overfitting in the algorithm training process are effectively solved.

Drawings

The invention will be further described with reference to the accompanying drawings.

FIG. 1 is a flow chart of a path planning method of the present invention;

FIG. 2 is 8 directions in which the robot can move;

FIG. 3 is a diagram of the path planning results of the present invention;

fig. 4 is a graph of the performance of the path planning algorithm training process of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A mobile robot path planning based on an improved deep Q network algorithm, as shown in fig. 1, the path planning includes the following steps:

step 1: and determining the position of the mobile robot and the position of the target point, and initializing relevant parameters.

Firstly, the current position X of the mobile robot is determined _t ＝[x _t ,y _t ]The number of steps searched by the mobile robot is t, and the horizontal and vertical coordinates x of the position of the mobile robot at the t-th step _t ，y _t The target point position is X _tar ＝[x _tar ,y _tar ]The horizontal and vertical coordinates of the target point position are x _tar ，y _tar . Initializing positive and negative value sample experience pool D ₁ And D ₂ Capacity of the system, greedy sampling strategy epsilon, total number of network training knots N, sample priority p and sampling sequence number N.

And 2, step: and initializing the predicted network parameter omega and the target network parameter omega', and determining the current state S of the mobile robot.

And 3, step 3: selecting an action a to be executed by taking the improved artificial potential field gravity function as an action strategy according to the probability of epsilon, and selecting the action a with the maximum Q value to be executed according to the probability of 1-epsilon; and selecting the action a with the maximum Q value according to the probability of 1-epsilon, wherein the specific method comprises the following steps:

where m is a threshold, L is a diagonal distance of the map environment, and the direction of the attraction is a direction in which the mobile robot points to a target point, dis (X) _tar -X _t ) Is that the mobile robot is at the current position X _t ＝[x _t ,y _t ]And target point X _tar ＝[x _tar ,y _tar ]The formula is as follows:

The selection of the maximum Q action is determined by the Q function, and the update iteration of the Q function is as follows:

And 4, step 4: the mobile robot is easy to fall into a local minimum point due to the fact that the gravity of the artificial potential field is large in the later stage of the algorithm, and the problem that the mobile robot falls into the local minimum point is solved through the target guiding action function.

And if the mobile robot sinks into the partial minimum point in the process of exploring the environment, executing the action by taking the target guiding action function as an action guiding strategy.

As shown in fig. 2, the moving direction of the mobile robot is divided into eight directions, i.e., upward right, rightward, downward right, downward left, leftward and upward left, and is represented by 1, 2, 3, 4, 5, 6, 7 and 8. The target guiding action function is as follows:

when the mobile robot falls into the local minimum point, the action set A is screened out, and the mobile robot executes a random action searching environment in the action set A, so that the action of repeatedly executing gravitation-guided collision obstacles is avoided, and the mobile robot escapes from the local minimum point.

The improved reward function formula is as follows:

is a set of neighboring points of the current position,

i is the direction of the proximity point of the current position.

The following Euclidean distance formula is used to obtain:

x due to improved reward function _ne ＝[x _ne ,y _ne ]A discount prize value of X _ne ＝[x _ne ,y _ne ]The Q value of (A) is far larger than that of other nearby points. The mobile robot executes the action with the maximum Q value according to the probability of 1-epsilon, and when the strategy expectation with the maximum accumulated reward is obtained, the mobile robot is guided to preferentially select the action moving towards the target point. It can be seen that the reward function guides the mobile robot to find the shortest path with the least steering angle and redundant points.

And if the current state reward value r in the experience sample is more than or equal to 0, storing the sample into the positive value sample experience pool, otherwise, storing the sample into the negative value sample pool.

Step 6: sampling samples by adopting a greedy sampling strategy and obtaining a probability rho from an experience pool D ₁ Sample samples from empirical pool D with 1-p probability ₂ Sampling samples, calculating and storing in a sample experience pool D through a sample priority formula ₁ And D ₂ The priority p of all samples in the experience pool, then the probability of each sample being sampled is calculated through a probability formula of the sample being sampled, and finally the samples in the experience pool are sampled according to the probability.

Priority p of sample j _j The formula is as follows:

in the formula, u _j The number of times each sample is sampled. From the formula, the priority p of the sample _j Will gradually decrease as the number of times sampled increases. Thus, according to the priority of the samples, the probability P (j) that each sample is sampled is defined as:

in the formula, size (D) ₁ or D ₂ ) The cell capacity is the sample experience. From the formula, the algorithm is derived from D by the probability that the sample is sampled P (j) ₁ And D ₂ The method has the advantages that samples are extracted, the probability that high-value samples are repeatedly sampled is reduced, the probability that low-value samples are collected is improved, and therefore the problems of loss value divergence and neural network overfitting can be effectively solved.

And 7: if the next state of the sampled sample is the target point state, the target Q value is the reward value of the target point state, otherwise, the target Q value is the maximum Q value of the next state with the reward value of the target point state being discounted.

And step 8: and (6) continuously executing the step (6) to the step (7), and repeating the number of the sampling sequences n times to obtain a final target Q value.

The mean square error loss function is shown below:

wherein, E () is a mean square error function, w 'is a target network parameter, w is a predicted network parameter, Q (S, a, w) is a predicted network output, Q (S', a ', w') is a target network output,

is the maximum Q value of the next state,

the target Q value is obtained.

wherein

To gradient Q (S, a, w).

The assignment formula of the network parameter w' of the target network is as follows:

w'←τw+(1-τ)w'

As shown in fig. 3, the line of the path trajectory planned by the mobile robot is the path from the starting point to the end point planned by the mobile robot, the path length is short, the smoothness of the path is high, the steering angle in the motion of the robot is reduced, and a good path is obtained, so that the path planning of the mobile robot can be better realized in a real environment.

As shown in fig. 4, in the heuristic deep Q-network algorithm mobile robot path planning training process, the training step number peak value does not reach 2000 steps in the first 100 sections, and after 100 sections, the algorithm converges rapidly and tends to be flat, and almost no network model oscillation occurs, so that the problems of loss value divergence and neural network overfitting are effectively alleviated.

In the description herein, references to the description of "one embodiment," "an example," "a specific example," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing shows and describes the general principles, principal features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, and such changes and modifications are within the scope of the invention as claimed.

Claims

1. A mobile robot path planning based on an improved deep Q network algorithm is characterized by comprising the following steps:

step 1: determining the position of the mobile robot and the position of a target point, and initializing relevant parameters;

step 2: initializing a predicted network parameter omega and a target network parameter omega', and determining the current state S of the mobile robot;

and 3, step 3: the mobile robot selects action a to execute by taking the improved artificial potential field gravity function as an action strategy according to the probability of epsilon, and selects action a with the maximum Q value according to the probability of 1-epsilon to execute;

and 4, step 4: the mobile robot is easy to fall into a local minimum point due to the fact that the gravity of the artificial potential field is large in the later stage of the algorithm, and the problem that the mobile robot falls into the local minimum point is solved through a target guiding action function:

and 5: after the mobile robot executes the action a selected in the steps 3 to 4, obtaining a reward value r in the current state, a reward value S 'in the next state and a reward value r' in the next state according to the Q value function and the improved reward function, and storing the (S, a, r, r ', S') as an experience sample into a corresponding experience pool;

step 6: from experience pool D with probability ρ ₁ Samples are sampled from empirical pool D with 1-p probability ₂ Sampling samples, and calculating the samples stored in a sample experience pool D through a sample priority formula ₁ And D ₂ Calculating the sampling probability of each sample according to the probability formula of the sampled samples, and finally sampling the samples in the experience pool according to the probability;

and 7: if the next state of the sampled sample is the target point state, the target Q value is the reward value of the target point state, otherwise, the target Q value is the maximum Q value of the next state with the reward value of the target point state being discounted;

and 8: step 6 to step 7 are continuously executed, and the number of repeated sampling sequences is n times to obtain a final target Q value;

and step 9: calculating a loss value according to a mean square error loss function formula, and updating a network parameter w of the prediction network by using a small batch semi-gradient descent method;

step 10: assigning the network parameter w of the prediction network to the network parameter w' of the target network at intervals of a certain training step number, and planning a path from a starting point to an end point according to the action selected by each step of the mobile robot;

step 11: if the position state of the mobile robot does not reach the target point state in the t-th step, repeating the step 2 to the step 10, otherwise, judging whether the training scenario number reaches the total scenario number of the network training, if not, repeating the step 1 to the step 10, if so, ending the training, and drawing a path from the starting point to the end point according to the maximum Q value action selected by the mobile robot, and ending the algorithm.

2. The mobile robot path planning method based on the improved deep Q network algorithm according to claim 1, wherein the relevant parameters initialized in step 1 are as follows:

3. The mobile robot path planning method based on the improved deep Q-network algorithm according to claim 1, wherein the action a to be executed by the mobile robot in the step 3 is specifically as follows:

by the above-mentioned gravitational forceFunction F _att (X) the current position X of the mobile robot can be known _t ＝[x _t ,y _t ]And target point X _tar ＝[x _tar ,y _tar ]The larger the distance between them, the smaller the value of the attraction force, and when the mobile robot is at position X _t ＝[x _t ,y _t ]And target point X _tar ＝[x _tar ,y _tar ]When the distance is less than the threshold value m, the gravity value is kept unchanged;

4. The mobile robot path planning method based on the improved deep Q network algorithm as claimed in claim 1, wherein the reward function improved in the step 5 is specifically as follows:

the improved reward function formula is as follows:

is a set of neighboring points of the current position,

The following Euclidean distance formula is obtained:

5. the mobile robot path planning method based on the improved deep Q-network algorithm according to claim 1, wherein the sampling sample process in step 6 is as follows:

priority p of sample j _j The formula is as follows:

wherein, size (D) ₁ or D ₂ ) The pool capacity is experienced for the sample. From the formula, the algorithm is derived from D by the probability that the sample is sampled P (j) ₁ And D ₂ And the samples are extracted, so that the probability of repeated sampling of the high-value samples is reduced, and the probability of acquiring the low-value samples is improved, thereby effectively relieving the problems of loss value divergence and neural network overfitting.

6. The mobile robot path planning method based on the improved deep Q network algorithm as claimed in claim 1, wherein the equation of the square error loss function and the equation of the small batch half gradient descent method in step 9 are specifically as follows:

the mean square error loss function is shown as follows:

is the maximum Q value of the next state,

is the target Q value;

wherein ^ Q (S, a, w) is gradient for Q (S, a, w).

7. The mobile robot path planning method based on the improved deep Q-network algorithm of claim 1, wherein the assignment formula of the network parameter w' of the target network in the step 10 is as follows:

w'←τw+(1-τ)w'