CN112344944B

CN112344944B - Reinforced learning path planning method introducing artificial potential field

Info

Publication number: CN112344944B
Application number: CN202011327198.6A
Authority: CN
Inventors: 王科银; 石振; 张建辉; 杨正才
Original assignee: Hubei University of Automotive Technology
Current assignee: Hubei University of Automotive Technology
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2022-08-05
Anticipated expiration: 2040-11-24
Also published as: CN112344944A

Abstract

The invention discloses a reinforcement learning path planning method introducing an artificial potential field, which comprises the following steps: s1, establishing a grid map, introducing a gravitational field function initialization state value, and obtaining a simulation environment for training the reinforcement learning agent; s2, initializing algorithm parameters; s3, selecting actions by adopting a dynamic factor adjustment strategy; s4, executing the action and updating the Q value; s5, repeating the third step and the fourth step until reaching a certain step number or a certain convergence condition; s6, selecting the action with the maximum Q value in each step to obtain an optimal path; and S7, sending the optimal path to a controller of the mobile robot, and controlling the mobile robot to walk according to the optimal path. Compared with the traditional algorithm, the improved Q-learning algorithm shortens the path planning time by 85.1%, reduces the iteration times before convergence by 74.7%, and improves the convergence result stability of the algorithm.

Description

Reinforced learning path planning method introducing artificial potential field

Technical Field

The invention relates to the technical field of robot path planning, in particular to a reinforcement learning path planning method introducing an artificial potential field.

Background

With the development of science and technology, more and more mobile robots walk into the daily life of people. The problem of path planning for mobile robots is also becoming more and more important. The path planning technology can help the robot avoid the obstacle to plan an optimal movement route from the starting point to the target point under the condition of referring to a certain index. Path planning can be divided into global path planning and local path planning according to the known degree of environmental knowledge in the path planning process. The global path planning algorithm which is widely applied comprises an A-star algorithm, a dijkstra algorithm, a visual graph method, a free space method and the like; the local path planning algorithm comprises an artificial potential field algorithm, a genetic algorithm, a neural network algorithm, a reinforcement learning algorithm and the like. The reinforcement learning algorithm is an algorithm with relatively strong adaptability, and can continuously try and error to find an optimal path in a completely unknown environment, so that the reinforcement learning algorithm obtains more and more attention in the field of path planning of the mobile robot.

The reinforced learning algorithm which is most widely applied in the field of path planning of mobile robots is a Q-learning algorithm. The conventional Q-learning algorithm has the following problems: (1) all Q values are set to be 0 or random values in the initialization process, so that the intelligent agent can only search blindly in the initial stage, and excessive invalid iterations occur in the initial stage of the algorithm; (2) an epsilon-greedy strategy is adopted during action selection, too large epsilon value enables more exploration environments of the intelligent agent to be difficult to converge, and too small epsilon value enables the intelligent agent to find out a suboptimal solution due to insufficient environment exploration, and the relation between exploration and utilization is difficult to balance.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a reinforcement learning path planning method introducing an artificial potential field, which introduces a gravitational field function of the artificial potential field in the Q value initialization process, so that the state value is larger when the artificial potential field is closer to a target position, an intelligent agent can search towards the target position in the initial stage, invalid iteration of the initial stage of the algorithm is reduced, and the path planning time of a mobile robot based on reinforcement learning is shortened.

A reinforcement learning path planning method for introducing an artificial potential field comprises the following steps:

s1, establishing a grid map, introducing a gravitational field function initialization state value, and obtaining a simulation environment for training the reinforcement learning agent;

s2, initializing algorithm parameters;

s3, selecting actions by adopting a dynamic factor adjustment strategy;

s4, executing the action and updating the Q value;

s5, repeating the third step and the fourth step until reaching a certain step number or a certain convergence condition;

s6, selecting the action with the maximum Q value in each step to obtain an optimal path;

and S7, sending the optimal path to a controller of the mobile robot, and controlling the mobile robot to walk according to the optimal path.

Preferably, in step S1, the specific process is as follows: the method comprises the steps of carrying out segmentation processing on an environment image obtained by a mobile robot, segmenting the image into 20 x 20 grids, establishing an environment model by adopting a grid method, and if an obstacle is found in the grids, defining the grids as the position of the obstacle, wherein the robot cannot pass through the grids; if the target point is found in the grid, defining the grid as a target position, namely a position to be finally reached by the mobile robot; the other grids are defined as non-obstacle grids, and the robot can calculate the attraction value of each grid according to the formula (1);

where ζ is a scale factor greater than 0 for adjusting the magnitude of attraction; | d | is the distance between the current position and the position of the target point; eta is a normal number, and the attraction value at the target point is prevented from being infinite.

Preferably, in step S2, the parameters include: learning rate alpha epsilon (0, 1), discount factor gamma epsilon (0, 1), maximum iteration times, reward function r and greedy factor dynamic adjustment strategy parameter epsilon _max ，ε _min ，T，n；

Initializing the Q function using equation (2)

Wherein, P(s) ^， | s, a) is the probability of transition to state s, V(s), from the current state s and action a determined ^， ) As a function of the state value of the next state,V（s ^，）=U _att 。

preferably, in step S3, the greedy factor adjustment strategy is as follows:

wherein the specific form of the tanh function is as follows:

e is the base of the natural logarithm, as the independent variabletWhen greater than 0, tanh: (t) The value range of (1) is (0);std _n step number standard deviation under continuous n times of iteration; t is a coefficient, and is opposite to the effect of temperature values in the simulated annealing algorithm, and the larger T is, the smaller the randomness is; epsilon _max And ε _min The set maximum value and minimum value of the search rate are respectively.

Preferably, in step S4, the action a selected in the third step is executed to arrive at S, the instant reward R (S, a) is obtained, the Q-value function is updated by using a Q-learning algorithm introducing artificial potential field, and the updating rule is as shown in formula (5)

Wherein (s, a) is a current state-action pair; (s) ^，，a ^， ) Is the state-action pair at the next time, and R (s, a) is the instant reward for performing action a at state s.

The invention has the beneficial effects that:

in order to solve the problems of low convergence speed, multiple iteration times, unstable convergence result and the like when the traditional reinforcement learning algorithm is applied to path planning of an unknown environment of a mobile robot, an improved Q-learning algorithm is provided. Introducing an artificial potential field method during state initialization, so that the state value is larger when the intelligent agent is closer to a target position, and the intelligent agent is guided to move towards the target position; and an epsilon-greedy strategy is improved on action selection, and a greedy factor epsilon is dynamically adjusted according to the convergence degree of the algorithm, so that the relationship between exploration and utilization is well balanced. Simulation results based on the grid map show that compared with the traditional algorithm, the improved Q-learning algorithm shortens the path planning time by 85.1%, the iteration frequency before convergence is reduced by 74.7%, and meanwhile, the convergence result stability of the algorithm is improved.

Drawings

FIG. 1 is a schematic view of the general flow of the process of the present invention.

Fig. 2 is a grid map of the operation of the mobile robot according to the embodiment of the present invention.

FIG. 3 is a diagram of conventional Q-learning convergence.

FIG. 4 is a diagram of improved Q-learning convergence according to an embodiment of the invention.

FIG. 5 is a diagram of an optimized path drawn by the improved Q-learning scheme according to the embodiment of the invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only used as examples, and the protection scope of the present invention is not limited thereby.

Referring to fig. 1, the method for planning a reinforcement learning path by introducing an artificial potential field according to the present invention includes the following steps:

the first step is as follows: the method comprises the steps of carrying out segmentation processing on an environment image obtained by a mobile robot, segmenting the image into 20 x 20 grids, establishing an environment model by adopting a grid method, and if an obstacle is found in the grids, defining the grids as the position of the obstacle, wherein the robot cannot pass through the grids; if the target point is found in the grid, determining that the grid is the target position and the position to which the mobile robot finally arrives; the other grids are defined as barrier-free grids, and the robot can pass through, and calculate the attraction value of each grid according to formula (1).

Zeta is a scale factor greater than 0 and is used for adjusting the size of the attraction force, | d | is the distance between the current position and the position of the target point, and η is a normal number, so that the attraction force value at the target point is prevented from being infinite.

Through the above steps, a simulation environment for training the reinforcement learning agent can be obtained, and the grid map for the mobile robot in the embodiment is shown in fig. 2.

The second step is that: initializing algorithm parameters, the parameters comprising: learning rate alpha epsilon (0, 1), discount factor gamma epsilon (0, 1), maximum iteration times, reward function r and greedy factor dynamic adjustment strategy parameter epsilon _max ，ε _min ，T，n。

The Q value function is initialized using equation (2).

Wherein, P(s) ^， | s, a) is the probability of transition to state s, V(s), from the current state s and action a determined ^， ) As a function of the state value of the next state, V(s) ^，）=U _att 。

The third step: and selecting the action by adopting a greedy factor dynamic adjustment strategy, wherein the greedy factor dynamic adjustment strategy is as follows:

wherein the specific form of the tanh function is as follows:

e is the base of the natural logarithm, as the independent variabletAbove 0, tan h: (t) The value range of (1) is (0);std _n step number standard deviation under continuous n times of iteration; t is a coefficient, and is opposite to the effect of temperature values in the simulated annealing algorithm, and the larger T is, the smaller the randomness is; epsilon _max And ε _min Respectively the maximum value of the set exploration rateAnd a minimum value.

The fourth step: performing the action a selected in the third step ^， Arrival s ^， Obtaining an instant reward R (s, a), updating the Q value function by using a Q-learning algorithm introducing an artificial potential field, and updating the rule as shown in the formula (5)

Wherein (s, a) is a current state-action pair; (s) ^，，a ^， ) Is the state-action pair at the next time; r (s, a) is an instant reward for performing action a in state s.

And repeatedly executing the third step and the fourth step until a certain step number or a certain convergence condition is reached.

The fifth step: and selecting the action with the maximum Q value in each step to obtain the optimal path.

And a sixth step: and sending the optimal path to a controller of the mobile robot, and controlling the mobile robot to walk according to the optimal path.

The parameter settings in this embodiment are as follows: learning rate a =0.01, discount factor γ =0.9, maximum number of iterations is set to 20000, scale factor ζ is set to 0.6, constant η is set to 1, and ∈ is set _max =0.5，ε _min =0.01, T =500, n =10, the reward function is set as:

in this embodiment, we can obtain the optimal path by using the above method and setting the parameters as shown in fig. 5.

Comparing fig. 3 and fig. 4, it can be seen that, by comparing the improved Q-learning algorithm with the conventional Q-learning algorithm, the improved algorithm shortens the algorithm convergence time by 85.1%, reduces the iteration number by 74.7%, and improves the convergence result stability of the algorithm.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

Claims

1. A reinforcement learning path planning method for introducing an artificial potential field is characterized by comprising the following steps: the method comprises the following steps:

s2, initializing algorithm parameters;

s3, selecting actions by adopting a dynamic factor adjustment strategy;

s4, executing the action and updating the Q value;

s7, sending the optimal path to a controller of the mobile robot, and controlling the mobile robot to walk according to the optimal path;

the specific process of step S1 is as follows: the method comprises the steps of carrying out segmentation processing on an environment image obtained by a mobile robot, segmenting the image into 20 x 20 grids, establishing an environment model by adopting a grid method, and if an obstacle is found in the grids, defining the grids as the position of the obstacle, wherein the robot cannot pass through the grids; if the target point is found in the grid, defining the grid as a target position, namely a position to be finally reached by the mobile robot; the other grids are defined as non-obstacle grids, and the robot can calculate the attraction value of each grid according to the formula (1);

where ζ is a scale factor greater than 0 for adjusting the magnitude of attraction; | d | is the distance between the current position and the position of the target point; eta is a normal number, so that the attraction value at the target point is prevented from being infinite;

in step S2, the parameters include: learning rate alpha epsilon (0, 1), discount factor gamma epsilon (0, 1), maximum iteration times, reward function r and greedy factor dynamic adjustment strategy parameter epsilon _max ，ε _min ，T，n；

Initializing the Q function using equation (2)

Wherein, P(s) ^， S, a) is a transition to the next state s from the case determined by the current state s and the action a ^， Probability of (a), V(s) ^， ) As a function of the state value of the next state, V(s) ^，）=U _att ；

The prize value obtained for the current state s and taking action a,

s in (1) is a state set, U _att A gravity value for the current position;

in step S3, the greedy factor adjustment strategy is as follows:

wherein the specific form of the tanh function is as follows:

e is the base of the natural logarithm, and when the independent variable is greater than 0, the value range of the tanh () is (0, 1);std _n step number standard deviation under continuous n times of iteration; t is a coefficient, and is opposite to the effect of temperature values in the simulated annealing algorithm, and the larger T is, the smaller the randomness is; epsilon _max And epsilon _min The set maximum and minimum values of the search rate are set.

2. The method of reinforcement learning path planning incorporating an artificial potential field of claim 1, further comprising:

in step S4, the action a selected in the third step is executed to reach S to obtain the instant reward R (S, a), the Q value function is updated by using the Q-learning algorithm introduced with artificial potential field, and the updating rule is as the formula (5)

Wherein (s, a) is a current state-action pair; (s) ^，，a ^， ) Is the state-action pair at the next time; r (s, a) is the instant reward for executing action a under state s, a is the learning rate, and γ is the discount factor.

3. The method of reinforcement learning path planning with introduction of an artificial potential field according to claim 1, characterized by: the scale factor ζ is set to 0.6 and the constant η is set to 1.

4. The method of reinforcement learning path planning incorporating an artificial potential field of claim 1, further comprising: learning rate α =0.01, discount factor γ =0.9, maximum number of iterations is set to 20000, ∈ max =0.5, ∈ min =0.01, T =500, n =10, and the reward function is set to:

。