CN115344046A - Mobile robot path planning based on improved deep Q network algorithm - Google Patents

Mobile robot path planning based on improved deep Q network algorithm Download PDF

Info

Publication number
CN115344046A
CN115344046A CN202211002713.2A CN202211002713A CN115344046A CN 115344046 A CN115344046 A CN 115344046A CN 202211002713 A CN202211002713 A CN 202211002713A CN 115344046 A CN115344046 A CN 115344046A
Authority
CN
China
Prior art keywords
mobile robot
value
target
network
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211002713.2A
Other languages
Chinese (zh)
Inventor
臧强
徐博文
李宁
郭镜虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202211002713.2A priority Critical patent/CN115344046A/en
Publication of CN115344046A publication Critical patent/CN115344046A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0223Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0287Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling
    • G05D1/0289Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling with means for avoiding collisions between vehicles

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a mobile robot path planning based on an improved depth Q network algorithm, aiming at the problems in the prior art, the path planning introduces an improved artificial potential field gravitation function and a target guiding action function in an action guiding strategy of the depth Q network algorithm, thereby effectively reducing the collision times in the algorithm training process and improving the convergence speed of the algorithm; the path planning designs a segmented reward function, and a discount reward value is given to a nearest point of a current position, which is closest to a target point, so that the mobile robot is promoted to explore towards the target point, and the planned path is more optimal; the path planning improves the sampling strategy, and the sampling probability of the samples in the experience pool is reduced along with the increase of the sampling times through the priorities of all the samples stored in the sample experience pool, so that the utilization rate of the samples is improved, and the problems of loss value divergence and neural network overfitting in the algorithm training process are effectively solved.

Description

Mobile robot path planning based on improved deep Q network algorithm
Technical Field
The invention relates to the technical field of path planning, in particular to a mobile robot path planning method based on an improved deep Q network algorithm.
Background
The robot is a pearl at the top of a manufacturing crown, the research, development, manufacture and application of the robot are important marks for measuring the national technological innovation and high-end manufacturing level, and the development of the robot technology and industry becomes a great strategic demand of China. Mobile robots are widely used in the fields of service, rescue, surveying, and the like. The development of the mobile robot is particularly emphasized and propelled by '2025 manufacture in China' and 'new generation outline of artificial intelligence development planning'. Mobile robots have become an important direction for artificial intelligence research.
The path planning technology is one of important contents in the research field of the mobile robot, and the effective path planning technology can promote the mobile robot to efficiently complete the operation and the task. According to different characteristics of algorithms, path planning algorithms are divided into traditional algorithm path planning and intelligent algorithm path planning, most of the traditional algorithm path planning needs a prior environment model, and exploration and planning cannot be effectively carried out in an unknown environment; the intelligent algorithm path planning has the capability of exploring unknown environments by simulating the experience and the characteristics of natural biological groups, can be better applied to path planning in real environments, and is one of deep reinforcement learning algorithms.
The deep reinforcement learning algorithm combines the perception capability of deep learning with the decision-making capability of reinforcement learning, so that the mobile robot establishes a mapping relation between a learning state and a behavior through interaction of the reinforcement learning algorithm and the environment in the face of unknown environment, the numerical return is maximized, an optimal action sequence is obtained, and an optimal path is planned; and then, a deep neural network is used for processing the high-dimensional state space task, so that the problem of dimension disaster when high-dimensional data is processed by reinforcement learning is solved. The Deep Q-Network (DQN) is a classic value function-based Deep reinforcement Learning algorithm, which combines a convolutional neural Network with a Q-Learning (Q-Learning) algorithm in reinforcement Learning, so that the convolutional neural Network replaces a Q table in the Q-Learning algorithm.
The deep mind group of google proposed the DQN algorithm in 2013, the basic framework of which is the Q-Learning algorithm. The action guiding strategy adopts a greedy strategy, when the mobile robot acquires a current action value, the current state and the reward value of the next state after the action are executed need to be traversed, so that the mobile robot needs to traverse more invalid environment states, a large amount of calculation time is consumed, and the algorithm convergence speed is low.
Whether the path planned by the DQN algorithm is optimal depends on whether the action guiding strategy of the Q-Learning algorithm is optimal. After the mobile robot interacts with the environment to obtain the reward in the Q-Learning algorithm, the mobile robot tends to execute the action with more reward value. In the traditional reward function, except that positive and negative reward values exist at a target point and a collision point with an obstacle, the reward values of other points are 0, so that the mobile robot forms a redundant path and a large number of inflection points appear, and the mobile robot cannot be prompted to plan an optimal path.
The DQN algorithm adopts an experience playback mechanism to process samples in the network model training process, and small batches of samples are extracted from a stored experience sample pool at equal probability each time to be used for training the model, so that the importance of different samples cannot be effectively distinguished, and the DQN algorithm has low sample utilization rate, and the problems of loss value divergence and neural network overfitting occur in the algorithm training process.
Disclosure of Invention
The invention aims to provide a mobile robot path planning method based on an improved deep Q network algorithm, which solves the problems of low convergence speed, non-optimal planned path and low sample utilization rate of the deep Q network algorithm by improving the traditional DQN algorithm, effectively reduces the collision times in the algorithm training process, improves the convergence speed of the algorithm, and prompts the mobile robot to explore towards a target point, thereby ensuring that the planned path is more optimal.
The purpose of the invention can be realized by the following technical scheme:
a mobile robot path plan based on an improved deep Q network algorithm, the path plan comprising the steps of:
step 1: and determining the position of the mobile robot and the position of a target point, and initializing relevant parameters.
Step 2: and initializing a predicted network parameter omega and a target network parameter omega', and determining the current state S of the mobile robot.
And step 3: the mobile robot selects action a to execute by taking the improved artificial potential field gravity function as an action strategy according to the probability of epsilon, and selects action a with the maximum Q value to execute according to the probability of 1-epsilon.
And 4, step 4: the mobile robot is easy to fall into the local minimum point due to the fact that the gravity of the artificial potential field is large in the later stage of the algorithm, and the problem that the mobile robot falls into the local minimum point is solved through the target guiding action function.
And 5: and after the mobile robot executes the action a selected in the steps 3 to 4, obtaining a current state reward value r, a next state reward value S 'and a next state reward value r' according to the Q value function and the improved reward function, and storing the (S, a, r, r ', S') as an experience sample into a corresponding experience pool.
And 6: from experience pool D with probability ρ 1 Sample samples from empirical pool D with 1-p probability 2 Sampling samples, calculating and storing in a sample experience pool D through a sample priority formula 1 And D 2 The priority p of all samples, then the probability of each sample being sampled is calculated through a probability formula of the sample being sampled, and finally the samples in the experience pool are sampled according to the probability.
And 7: if the next state of the sampled sample is the target state, the target Q value is the reward value of the target state, otherwise the target Q value is the maximum Q value of the next state with the reward value of the target state discounted.
And 8: and (6) continuously executing the step (6) to the step (7), and repeating the number of the sampling sequences n times to obtain a final target Q value.
And step 9: and calculating a loss value according to a mean square error loss function formula, and updating the network parameter w of the prediction network by using a small batch semi-gradient descent method.
Step 10: and assigning the network parameter w of the prediction network to the network parameter w' of the target network at regular intervals of training steps, and planning a path from the starting point to the end point according to the action selected by each step of the mobile robot.
Step 11: if the position state of the mobile robot at the t step does not reach the target point state, repeating the step 2 to the step 10, otherwise, judging whether the training scenario number reaches the total scenario number of network training, if not, repeating the step 1 to the step 10, if so, ending the training, and drawing a path from the starting point to the end point according to the maximum Q value action selected by the mobile robot, and ending the algorithm.
Further, the relevant parameters initialized in step 1 are specifically as follows:
determining a current position X of a mobile robot t =[x t ,y t ]The number of steps searched by the mobile robot is t, and the horizontal and vertical coordinates x of the position of the mobile robot at the t-th step t ,y t The target point position is X tar =[x tar ,y tar ]The horizontal and vertical coordinates of the target point position are x tar ,y tar . Initializing positive and negative value sample experience pool D 1 And D 2 Capacity of the system, greedy sampling strategy epsilon, total number of network training knots N, sample priority p and sampling sequence number N.
Further, the action a to be executed next by the mobile robot in the step 3 is specifically as follows:
improved artificial potential field gravitational function F att The size of (X) is:
Figure BDA0003805860060000041
where m is a threshold, L is a diagonal distance of the map environment, a direction of the attraction is a direction in which the mobile robot points to a target point, dis (X) tar -X t ) Is that the mobile robot is at the current position X t =[x t ,y t ]And target point X tar =[x tar ,y tar ]The formula is as follows:
Figure BDA0003805860060000042
from the above-mentioned gravitational function F att (X) the current position X of the mobile robot can be known by the formula t =[x t ,y t ]And target point X tar =[x tar ,y tar ]The larger the distance between them, the smaller the value of the attraction force, and when the mobile robot is at position X t =[x t ,y t ]And target point X tar =[x tar ,y tar ]When the distance is less than the threshold value m, the attraction value is kept unchanged.
The selection of the maximum Q action is determined by the Q function, and the Q function update iteration is as follows:
Figure BDA0003805860060000043
wherein Q (S, a) is the Q value of the current state, S is the current state, a is the execution action, alpha (alpha is more than or equal to 0 and less than or equal to 1) is the learning rate, r ' is the reward value given to the next state by the environment, gamma (gamma is more than or equal to 0 and less than or equal to 1) is the discount coefficient, S ' is the next state, and a ' is the action executed next step.
Further, the improved reward function in step 5 is specifically as follows:
the improved reward function formula is as follows:
Figure BDA0003805860060000051
wherein λ (0 < λ < 1) is a discount factor of the reward function, X ne =[x ne ,y ne ]Is a neighboring point, x, whose current position is closest to the target point ne ,y ne Is the abscissa and ordinate of the neighboring point whose current position is closest to the target point,
Figure BDA0003805860060000052
is a set of neighboring points of the current position,
Figure BDA0003805860060000053
i is the distance set of the neighboring point and the target point, i is the direction of the neighboring point of the current position.
Figure BDA0003805860060000054
The following Euclidean distance formula is used to obtain:
Figure BDA0003805860060000055
further, the sampling process in step 6 is specifically as follows:
priority p of sample j j The formula is as follows:
Figure BDA0003805860060000056
in the formula, u j The number of times each sample is sampled. From the formula, the priority p of the sample j Will gradually decrease as the number of times sampled increases, so according to the priority of the sample, the probability P (j) that each sample is sampled is defined as:
Figure BDA0003805860060000057
wherein, size (D) 1 or D 2 ) The cell capacity is the sample experience. From the formula, the algorithm is derived from D according to the sample sampled probability P (j) 1 And D 2 The method has the advantages that samples are extracted, the probability that high-value samples are repeatedly sampled is reduced, the probability that low-value samples are collected is improved, and therefore the problems of loss value divergence and neural network overfitting can be effectively solved.
Further, the square error loss function formula and the small batch half-gradient descent method formula in step 9 are specifically as follows:
the mean square error loss function is shown as follows:
Figure BDA0003805860060000061
wherein, E () is a mean square error function, w' is a target network parameter, w is a prediction network parameter, and Q (S, a, w) is a prediction networkOutput, Q (S ', a ', w ') is the target network output,
Figure BDA0003805860060000062
is the maximum Q value for the next state,
Figure BDA0003805860060000063
the target Q value is obtained.
The formula of the small batch half-gradient descent method is shown as follows:
Figure BDA0003805860060000064
wherein
Figure BDA0003805860060000065
To gradient Q (S, a, w).
Further, the assignment formula of the network parameter w' of the target network in step 10 is as follows:
w'←τw+(1-τ)w'
tau (tau is less than or equal to 1), the updating amplitude of the target Q value at each time can be effectively controlled through the assignment formula of w', and the stability of the mobile robot in learning is improved.
The invention has the beneficial effects that:
1. in the action guidance strategy of the deep Q network algorithm, the path planning introduces an improved artificial potential field gravitation function and a target guidance action function, thereby effectively reducing the collision times in the algorithm training process and improving the convergence speed of the algorithm;
2. the path planning designs a segmented reward function, which is endowed with a reward value of discount at a nearest neighbor point of a current position and a target point, so that the mobile robot is prompted to explore towards the target point, and the planned path is more optimal;
3. the path planning improves the sampling strategy, and the sampling probability of the samples in the experience pool is reduced along with the increase of the sampling times through the priorities of all the samples stored in the sample experience pool, so that the utilization rate of the samples is improved, and the problems of loss value divergence and neural network overfitting in the algorithm training process are effectively solved.
Drawings
The invention will be further described with reference to the accompanying drawings.
FIG. 1 is a flow chart of a path planning method of the present invention;
FIG. 2 is 8 directions in which the robot can move;
FIG. 3 is a diagram of the path planning results of the present invention;
fig. 4 is a graph of the performance of the path planning algorithm training process of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A mobile robot path planning based on an improved deep Q network algorithm, as shown in fig. 1, the path planning includes the following steps:
step 1: and determining the position of the mobile robot and the position of the target point, and initializing relevant parameters.
Firstly, the current position X of the mobile robot is determined t =[x t ,y t ]The number of steps searched by the mobile robot is t, and the horizontal and vertical coordinates x of the position of the mobile robot at the t-th step t ,y t The target point position is X tar =[x tar ,y tar ]The horizontal and vertical coordinates of the target point position are x tar ,y tar . Initializing positive and negative value sample experience pool D 1 And D 2 Capacity of the system, greedy sampling strategy epsilon, total number of network training knots N, sample priority p and sampling sequence number N.
And 2, step: and initializing the predicted network parameter omega and the target network parameter omega', and determining the current state S of the mobile robot.
And 3, step 3: selecting an action a to be executed by taking the improved artificial potential field gravity function as an action strategy according to the probability of epsilon, and selecting the action a with the maximum Q value to be executed according to the probability of 1-epsilon; and selecting the action a with the maximum Q value according to the probability of 1-epsilon, wherein the specific method comprises the following steps:
improved artificial potential field gravitational function F att The size of (X) is:
Figure BDA0003805860060000081
where m is a threshold, L is a diagonal distance of the map environment, and the direction of the attraction is a direction in which the mobile robot points to a target point, dis (X) tar -X t ) Is that the mobile robot is at the current position X t =[x t ,y t ]And target point X tar =[x tar ,y tar ]The formula is as follows:
Figure BDA0003805860060000082
from the above-mentioned gravitational function F att (X) the current position X of the mobile robot can be known by the formula t =[x t ,y t ]And target point X tar =[x tar ,y tar ]The larger the distance between them, the smaller the value of the attraction force, and when the mobile robot is at position X t =[x t ,y t ]And target point X tar =[x tar ,y tar ]When the distance is less than the threshold value m, the attraction value is kept unchanged.
The selection of the maximum Q action is determined by the Q function, and the update iteration of the Q function is as follows:
Figure BDA0003805860060000083
wherein Q (S, a) is the Q value of the current state, S is the current state, a is the execution action, alpha (alpha is more than or equal to 0 and less than or equal to 1) is the learning rate, r ' is the reward value given to the next state by the environment, gamma (gamma is more than or equal to 0 and less than or equal to 1) is the discount coefficient, S ' is the next state, and a ' is the action executed next step.
And 4, step 4: the mobile robot is easy to fall into a local minimum point due to the fact that the gravity of the artificial potential field is large in the later stage of the algorithm, and the problem that the mobile robot falls into the local minimum point is solved through the target guiding action function.
And if the mobile robot sinks into the partial minimum point in the process of exploring the environment, executing the action by taking the target guiding action function as an action guiding strategy.
As shown in fig. 2, the moving direction of the mobile robot is divided into eight directions, i.e., upward right, rightward, downward right, downward left, leftward and upward left, and is represented by 1, 2, 3, 4, 5, 6, 7 and 8. The target guiding action function is as follows:
Figure BDA0003805860060000091
when the mobile robot falls into the local minimum point, the action set A is screened out, and the mobile robot executes a random action searching environment in the action set A, so that the action of repeatedly executing gravitation-guided collision obstacles is avoided, and the mobile robot escapes from the local minimum point.
And 5: and after the mobile robot executes the action a selected in the steps 3 to 4, obtaining a current state reward value r, a next state reward value S 'and a next state reward value r' according to the Q value function and the improved reward function, and storing the (S, a, r, r ', S') as an experience sample into a corresponding experience pool.
The improved reward function formula is as follows:
Figure BDA0003805860060000092
wherein λ (0 < λ < 1) is a discount factor of the reward function, X ne =[x ne ,y ne ]Is a neighboring point, x, whose current position is closest to the target point ne ,y ne Is the abscissa and ordinate of the neighboring point whose current position is closest to the target point,
Figure BDA0003805860060000093
is a set of neighboring points of the current position,
Figure BDA0003805860060000094
i is the direction of the proximity point of the current position.
Figure BDA0003805860060000095
The following Euclidean distance formula is used to obtain:
Figure BDA0003805860060000096
x due to improved reward function ne =[x ne ,y ne ]A discount prize value of X ne =[x ne ,y ne ]The Q value of (A) is far larger than that of other nearby points. The mobile robot executes the action with the maximum Q value according to the probability of 1-epsilon, and when the strategy expectation with the maximum accumulated reward is obtained, the mobile robot is guided to preferentially select the action moving towards the target point. It can be seen that the reward function guides the mobile robot to find the shortest path with the least steering angle and redundant points.
And if the current state reward value r in the experience sample is more than or equal to 0, storing the sample into the positive value sample experience pool, otherwise, storing the sample into the negative value sample pool.
Step 6: sampling samples by adopting a greedy sampling strategy and obtaining a probability rho from an experience pool D 1 Sample samples from empirical pool D with 1-p probability 2 Sampling samples, calculating and storing in a sample experience pool D through a sample priority formula 1 And D 2 The priority p of all samples in the experience pool, then the probability of each sample being sampled is calculated through a probability formula of the sample being sampled, and finally the samples in the experience pool are sampled according to the probability.
Priority p of sample j j The formula is as follows:
Figure BDA0003805860060000101
in the formula, u j The number of times each sample is sampled. From the formula, the priority p of the sample j Will gradually decrease as the number of times sampled increases. Thus, according to the priority of the samples, the probability P (j) that each sample is sampled is defined as:
Figure BDA0003805860060000102
in the formula, size (D) 1 or D 2 ) The cell capacity is the sample experience. From the formula, the algorithm is derived from D by the probability that the sample is sampled P (j) 1 And D 2 The method has the advantages that samples are extracted, the probability that high-value samples are repeatedly sampled is reduced, the probability that low-value samples are collected is improved, and therefore the problems of loss value divergence and neural network overfitting can be effectively solved.
And 7: if the next state of the sampled sample is the target point state, the target Q value is the reward value of the target point state, otherwise, the target Q value is the maximum Q value of the next state with the reward value of the target point state being discounted.
And step 8: and (6) continuously executing the step (6) to the step (7), and repeating the number of the sampling sequences n times to obtain a final target Q value.
And step 9: and calculating a loss value according to a mean square error loss function formula, and updating the network parameter w of the prediction network by using a small batch semi-gradient descent method.
The mean square error loss function is shown below:
Figure BDA0003805860060000111
wherein, E () is a mean square error function, w 'is a target network parameter, w is a predicted network parameter, Q (S, a, w) is a predicted network output, Q (S', a ', w') is a target network output,
Figure BDA0003805860060000112
is the maximum Q value of the next state,
Figure BDA0003805860060000113
the target Q value is obtained.
The formula of the small batch half-gradient descent method is shown as follows:
Figure BDA0003805860060000114
wherein
Figure BDA0003805860060000115
To gradient Q (S, a, w).
Step 10: and assigning the network parameter w of the prediction network to the network parameter w' of the target network at regular intervals of training steps, and planning a path from the starting point to the end point according to the action selected by each step of the mobile robot.
The assignment formula of the network parameter w' of the target network is as follows:
w'←τw+(1-τ)w'
tau (tau is less than or equal to 1), the updating amplitude of the target Q value at each time can be effectively controlled through the assignment formula of w', and the stability of the mobile robot in learning is improved.
Step 11: if the position state of the mobile robot at the t step does not reach the target point state, repeating the step 2 to the step 10, otherwise, judging whether the training scenario number reaches the total scenario number of network training, if not, repeating the step 1 to the step 10, if so, ending the training, and drawing a path from the starting point to the end point according to the maximum Q value action selected by the mobile robot, and ending the algorithm.
As shown in fig. 3, the line of the path trajectory planned by the mobile robot is the path from the starting point to the end point planned by the mobile robot, the path length is short, the smoothness of the path is high, the steering angle in the motion of the robot is reduced, and a good path is obtained, so that the path planning of the mobile robot can be better realized in a real environment.
As shown in fig. 4, in the heuristic deep Q-network algorithm mobile robot path planning training process, the training step number peak value does not reach 2000 steps in the first 100 sections, and after 100 sections, the algorithm converges rapidly and tends to be flat, and almost no network model oscillation occurs, so that the problems of loss value divergence and neural network overfitting are effectively alleviated.
In the description herein, references to the description of "one embodiment," "an example," "a specific example," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The foregoing shows and describes the general principles, principal features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, and such changes and modifications are within the scope of the invention as claimed.

Claims (7)

1. A mobile robot path planning based on an improved deep Q network algorithm is characterized by comprising the following steps:
step 1: determining the position of the mobile robot and the position of a target point, and initializing relevant parameters;
step 2: initializing a predicted network parameter omega and a target network parameter omega', and determining the current state S of the mobile robot;
and 3, step 3: the mobile robot selects action a to execute by taking the improved artificial potential field gravity function as an action strategy according to the probability of epsilon, and selects action a with the maximum Q value according to the probability of 1-epsilon to execute;
and 4, step 4: the mobile robot is easy to fall into a local minimum point due to the fact that the gravity of the artificial potential field is large in the later stage of the algorithm, and the problem that the mobile robot falls into the local minimum point is solved through a target guiding action function:
and 5: after the mobile robot executes the action a selected in the steps 3 to 4, obtaining a reward value r in the current state, a reward value S 'in the next state and a reward value r' in the next state according to the Q value function and the improved reward function, and storing the (S, a, r, r ', S') as an experience sample into a corresponding experience pool;
step 6: from experience pool D with probability ρ 1 Samples are sampled from empirical pool D with 1-p probability 2 Sampling samples, and calculating the samples stored in a sample experience pool D through a sample priority formula 1 And D 2 Calculating the sampling probability of each sample according to the probability formula of the sampled samples, and finally sampling the samples in the experience pool according to the probability;
and 7: if the next state of the sampled sample is the target point state, the target Q value is the reward value of the target point state, otherwise, the target Q value is the maximum Q value of the next state with the reward value of the target point state being discounted;
and 8: step 6 to step 7 are continuously executed, and the number of repeated sampling sequences is n times to obtain a final target Q value;
and step 9: calculating a loss value according to a mean square error loss function formula, and updating a network parameter w of the prediction network by using a small batch semi-gradient descent method;
step 10: assigning the network parameter w of the prediction network to the network parameter w' of the target network at intervals of a certain training step number, and planning a path from a starting point to an end point according to the action selected by each step of the mobile robot;
step 11: if the position state of the mobile robot does not reach the target point state in the t-th step, repeating the step 2 to the step 10, otherwise, judging whether the training scenario number reaches the total scenario number of the network training, if not, repeating the step 1 to the step 10, if so, ending the training, and drawing a path from the starting point to the end point according to the maximum Q value action selected by the mobile robot, and ending the algorithm.
2. The mobile robot path planning method based on the improved deep Q network algorithm according to claim 1, wherein the relevant parameters initialized in step 1 are as follows:
determining a current position X of a mobile robot t =[x t ,y t ]The number of steps searched by the mobile robot is t, and the horizontal and vertical coordinates x of the position of the mobile robot at the t-th step t ,y t The target point position is X tar =[x tar ,y tar ]The horizontal and vertical coordinates of the target point position are x tar ,y tar . Initializing positive and negative value sample experience pool D 1 And D 2 Capacity of the system, greedy sampling strategy epsilon, total number of network training knots N, sample priority p and sampling sequence number N.
3. The mobile robot path planning method based on the improved deep Q-network algorithm according to claim 1, wherein the action a to be executed by the mobile robot in the step 3 is specifically as follows:
improved artificial potential field gravitational function F att The size of (X) is:
Figure FDA0003805860050000021
where m is a threshold, L is a diagonal distance of the map environment, and the direction of the attraction is a direction in which the mobile robot points to a target point, dis (X) tar -X t ) Is that the mobile robot is at the current position X t =[x t ,y t ]And target point X tar =[x tar ,y tar ]The formula is as follows:
Figure FDA0003805860050000022
by the above-mentioned gravitational forceFunction F att (X) the current position X of the mobile robot can be known t =[x t ,y t ]And target point X tar =[x tar ,y tar ]The larger the distance between them, the smaller the value of the attraction force, and when the mobile robot is at position X t =[x t ,y t ]And target point X tar =[x tar ,y tar ]When the distance is less than the threshold value m, the gravity value is kept unchanged;
the selection of the maximum Q action is determined by the Q function, and the Q function update iteration is as follows:
Figure FDA0003805860050000031
wherein Q (S, a) is the Q value of the current state, S is the current state, a is the execution action, alpha (alpha is more than or equal to 0 and less than or equal to 1) is the learning rate, r ' is the reward value given to the next state by the environment, gamma (gamma is more than or equal to 0 and less than or equal to 1) is the discount coefficient, S ' is the next state, and a ' is the action executed next step.
4. The mobile robot path planning method based on the improved deep Q network algorithm as claimed in claim 1, wherein the reward function improved in the step 5 is specifically as follows:
the improved reward function formula is as follows:
Figure FDA0003805860050000032
wherein λ (0 < λ < 1) is a discount factor of the reward function, X ne =[x ne ,y ne ]Is a neighboring point, x, whose current position is closest to the target point ne ,y ne Is the abscissa and ordinate of the neighboring point whose current position is closest to the target point,
Figure FDA0003805860050000033
is a set of neighboring points of the current position,
Figure FDA0003805860050000034
i is the distance set of the neighboring point and the target point, i is the direction of the neighboring point of the current position.
Figure FDA0003805860050000035
The following Euclidean distance formula is obtained:
Figure FDA0003805860050000036
5. the mobile robot path planning method based on the improved deep Q-network algorithm according to claim 1, wherein the sampling sample process in step 6 is as follows:
priority p of sample j j The formula is as follows:
Figure FDA0003805860050000037
in the formula, u j The number of times each sample is sampled. From the formula, the priority p of the sample j Will gradually decrease as the number of times sampled increases, so according to the priority of the sample, the probability P (j) that each sample is sampled is defined as:
Figure FDA0003805860050000041
wherein, size (D) 1 or D 2 ) The pool capacity is experienced for the sample. From the formula, the algorithm is derived from D by the probability that the sample is sampled P (j) 1 And D 2 And the samples are extracted, so that the probability of repeated sampling of the high-value samples is reduced, and the probability of acquiring the low-value samples is improved, thereby effectively relieving the problems of loss value divergence and neural network overfitting.
6. The mobile robot path planning method based on the improved deep Q network algorithm as claimed in claim 1, wherein the equation of the square error loss function and the equation of the small batch half gradient descent method in step 9 are specifically as follows:
the mean square error loss function is shown as follows:
Figure FDA0003805860050000042
wherein, E () is a mean square error function, w 'is a target network parameter, w is a predicted network parameter, Q (S, a, w) is a predicted network output, Q (S', a ', w') is a target network output,
Figure FDA0003805860050000043
is the maximum Q value of the next state,
Figure FDA0003805860050000044
is the target Q value;
the formula of the small batch half-gradient descent method is shown as follows:
Figure FDA0003805860050000045
wherein ^ Q (S, a, w) is gradient for Q (S, a, w).
7. The mobile robot path planning method based on the improved deep Q-network algorithm of claim 1, wherein the assignment formula of the network parameter w' of the target network in the step 10 is as follows:
w'←τw+(1-τ)w'
tau (tau is less than or equal to 1), the updating amplitude of the target Q value at each time can be effectively controlled through the assignment formula of w', and the stability of the mobile robot in learning is improved.
CN202211002713.2A 2022-08-19 2022-08-19 Mobile robot path planning based on improved deep Q network algorithm Pending CN115344046A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211002713.2A CN115344046A (en) 2022-08-19 2022-08-19 Mobile robot path planning based on improved deep Q network algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211002713.2A CN115344046A (en) 2022-08-19 2022-08-19 Mobile robot path planning based on improved deep Q network algorithm

Publications (1)

Publication Number Publication Date
CN115344046A true CN115344046A (en) 2022-11-15

Family

ID=83954272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211002713.2A Pending CN115344046A (en) 2022-08-19 2022-08-19 Mobile robot path planning based on improved deep Q network algorithm

Country Status (1)

Country Link
CN (1) CN115344046A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116700258A (en) * 2023-06-13 2023-09-05 重庆市荣冠科技有限公司 Intelligent vehicle path planning method based on artificial potential field method and reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116700258A (en) * 2023-06-13 2023-09-05 重庆市荣冠科技有限公司 Intelligent vehicle path planning method based on artificial potential field method and reinforcement learning
CN116700258B (en) * 2023-06-13 2024-05-03 万基泰科工集团数字城市科技有限公司 Intelligent vehicle path planning method based on artificial potential field method and reinforcement learning

Similar Documents

Publication Publication Date Title
Jiang et al. Path planning for intelligent robots based on deep Q-learning with experience replay and heuristic knowledge
Zhu et al. Deep reinforcement learning supervised autonomous exploration in office environments
CN109945881B (en) Mobile robot path planning method based on ant colony algorithm
CN107272679B (en) Path planning method based on improved ant colony algorithm
CN114489059B (en) Path planning method based on D3QN-PER mobile robot
CN112325897B (en) Path planning method based on heuristic deep reinforcement learning
CN113110592A (en) Unmanned aerial vehicle obstacle avoidance and path planning method
CN111982125A (en) Path planning method based on improved ant colony algorithm
CN109597425B (en) Unmanned aerial vehicle navigation and obstacle avoidance method based on reinforcement learning
CN111664852B (en) Unmanned aerial vehicle path planning method and device
CN112362066A (en) Path planning method based on improved deep reinforcement learning
CN111880561A (en) Unmanned aerial vehicle three-dimensional path planning method based on improved whale algorithm in urban environment
CN115344046A (en) Mobile robot path planning based on improved deep Q network algorithm
CN112484732A (en) IB-ABC algorithm-based unmanned aerial vehicle flight path planning method
CN116494247A (en) Mechanical arm path planning method and system based on depth deterministic strategy gradient
CN115454067A (en) Path planning method based on fusion algorithm
CN114326726B (en) Formation path planning control method based on A and improved artificial potential field method
CN113218400B (en) Multi-agent navigation algorithm based on deep reinforcement learning
CN114815801A (en) Adaptive environment path planning method based on strategy-value network and MCTS
CN113722980A (en) Ocean wave height prediction method, system, computer equipment, storage medium and terminal
CN116164753B (en) Mine unmanned vehicle path navigation method and device, computer equipment and storage medium
CN117434950A (en) Mobile robot dynamic path planning method based on Harris eagle heuristic hybrid algorithm
CN115454070B (en) K-Means ant colony algorithm multi-robot path planning method
CN111174794A (en) Unmanned ship path planning method and system based on improved particle swarm algorithm
CN116225046A (en) Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination