CN115542912A

CN115542912A - Mobile robot path planning method based on improved Q-learning algorithm

Info

Publication number: CN115542912A
Application number: CN202211213330.XA
Authority: CN
Inventors: 涂俊翔; 张立; 李凡; 钟礼阳
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2022-12-30
Anticipated expiration: 2042-09-29

Abstract

The invention relates to a mobile robot path planning method based on an improved Q-learning algorithm, which comprises the following steps: (1) A new potential energy field function is designed by combining an artificial potential field principle with a simulation environment, and an environment potential energy value is introduced as heuristic information to initialize a Q value table, so that the closer to a target point, the larger the potential energy value is, the search of the intelligent body towards the target direction is guided earlier, the algorithm convergence is accelerated, and the planning efficiency is improved. Improving the blindness of the early stage exploration of the Q-learning algorithm; (2) And adding a behavior utility function into an epsilon greedy strategy of the traditional Q-learning algorithm, evaluating the actions according to the conditions of the path sections after the actions are executed, dynamically adjusting the probability of each action of the intelligent agent being selected, improving the search efficiency and improving the path smoothness. By applying the technical scheme, the shortest path can be obtained, and meanwhile, the convergence speed of the algorithm and the smoothness of the path can be improved.

Description

Mobile robot path planning method based on improved Q-learning algorithm

Technical Field

The invention relates to the technical field of robot navigation planning, in particular to a mobile robot path planning method based on an improved Q-learning algorithm.

Background

With the proposal of a goods-to-people picking mode, the mobile robot is widely applied to an intelligent warehouse, the introduction of the mobile robot improves the picking efficiency of the warehouse, and the path planning is also more and more concerned as one of the core technologies of the mobile robot. The path planning means that a collision-free optimal path is planned according to the environment of the mobile robot and by combining evaluation standards such as the shortest path, the shortest planning time and the path smoothness.

The path planning originated in the 60's of the 20 th century, and is commonly used for Dijkstra's algorithm, a-x algorithm, artificial potential field method, and ant colony algorithm, particle swarm algorithm, etc. of the heuristic intelligent search method. However, the traditional method is complex to operate, the problem solving efficiency is low, and besides, the heuristic algorithm is difficult to design and understand. With the progress of reinforcement learning in recent years, some learners start to apply reinforcement learning to route planning. The most widely used reinforcement learning algorithm in mobile robot path planning is the Q-learning algorithm. The Q-learning algorithm is a reinforced learning algorithm of time sequence difference, and the process of the Q-learning algorithm is as follows: the mobile robot first selects and executes action a among all possible actions under the state s, and then evaluates the outcome of the action according to obtaining an immediate reward value for action a and receiving an estimate of the current state action value. By repeating all actions in all states, the mobile robot can learn the overall best behavior by judging long-term discount returns. As a supervised learning method, the traditional Q-learning algorithm can enable the mobile robot to plan a better collision-free path by utilizing a learning mechanism through real-time interaction with the environment, does not need an environment model, and has excellent performance in a complex environment. However, the method still has the problems of large blindness of algorithm early-stage exploration, long learning time, low convergence speed, poor path smoothness and the like.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a mobile robot path planning method based on an improved Q-learning algorithm, which can improve the convergence speed of the algorithm and the smoothness of the path while obtaining the shortest path.

In order to achieve the purpose, the invention adopts the following technical scheme: a mobile robot path planning method based on an improved Q-learning algorithm comprises the following steps:

step 1, modeling an environment map by adopting a grid method, and establishing an obstacle matrix;

step 2, the mobile robot exists as particles in a two-dimensional environment and can only search and move in 4 directions, namely up, down, left and right; the side length of each grid is a unit, and the single-step moving distance of the mobile robot is 1 step length;

step 3, designing a reward function and establishing a reward matrix R and a Q value table Q;

step 4, initializing a Q value table by utilizing the improved potential energy field function, and initializing various parameters of the algorithm, including iteration times, epsilon-exploration probability epsilon, learning efficiency alpha, reward attenuation factor gamma and behavior utility attenuation coefficient p ₁ Search for excitation coefficient p ₂ A weight coefficient β;

step 5, initializing a starting point and a target point;

step 6, starting exploration, selecting an execution action according to the improved epsilon exploration strategy, acquiring a timely reward value after the execution action, calculating a distance function, and updating an action utility function;

and 7, updating a Q value table and action execution probability according to the executed action, wherein the Q value table updating formula is as follows:

Q(s,a)＝Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]

in the formula, alpha represents learning efficiency alpha belongs to [0,1], gamma represents discount factor gamma belongs to [0,1], rt is a timely reward value, and s 'and a' are a next state and a next action;

the update formula of the action execution probability is as follows:

in the formula: n is the total number of actions performed,

is a behavior utility function;

step 8, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and judging whether the maximum step length is exceeded, otherwise, jumping to step 6, and if yes, entering step 9;

and 9, recording the path learned every time, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step 5.

In a preferred embodiment, the reward function designed in step 3 is:

in the formula: r is ₁ ，r ₂ Are all positive numbers, the reward value-r being when the agent hits an obstacle ₁ Is acquired; when the agent reaches the target point, the reward value r ₂ Is acquired; the prize value obtained when the agent arrives elsewhere is 0.

In a preferred embodiment, the potential energy field function modified in step 4 is:

wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d _1(s) ,d _2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the intelligent agent and the connecting line of the target point and the starting point;

the initialized Q value table function is:

Q(s,a)＝R+γV(S)

wherein, R is an initial reward function matrix, gamma is a reward attenuation factor, and V (S) is a cost function for initializing all states through a gravitational field function; the Q value table after the initialization by the method has the advantages that the Q value is larger when the table is closer to a target point, the target point has the maximum Q value, and the Q value at an obstacle is 0.

In a preferred embodiment, the modified epsilon search strategy in step 6 comprises the following specific steps: when the random value is smaller than the greedy factor, selecting the action with the highest action probability; when the random value is larger than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, selecting the action with the highest Q value to execute, wherein the random value is between 0 and 1, and the updating formula is as follows:

T _Q ＝Q+β×(P ₁ ,P ₂ ,P _i )(i∈(1,n))

in the formula: t is _Q Is the Q value after performing probability update according to the new action, beta is the weight coefficient, P ₁ ，P ₂ ，P _i Is the probability that each action was performed; the improved behavior selection strategy enables the intelligent agent to perform multi-step exploration by utilizing the explored environment information when selecting to perform actions in each state, comprehensively considers the distance information between the front and back states of the intelligent agent and a target point and the multi-step execution action information, and selects the optimal action to perform;

the distance function in step 6 is:

in the formula:

respectively representing the distance between the previous state and the current state from the target point;

the action utility function and the calculation rule thereof are as follows:

in the formula: p is a radical of formula ₁ To be attenuation coefficient, p ₂ Is to search for the excitation coefficient, r _t Is an instant prize value; a is _i Respectively different actions, updating the E values of the different actions according to the magnitude of the instant reward value and the condition whether the continuously executed actions are the same or not, when the instant reward value is positive and the continuously executed actions are the same for three times,

when the immediate prize value is positive, and the same action is performed twice in succession,

otherwise the value of E is zero.

In a preferred embodiment, the Q value table is initialized by introducing the environment potential value as the heuristic information.

In a preferred embodiment, an epsilon greedy policy is improved by using a behavior utility function as a criterion for evaluating actions to be performed, and by combining environmental information that has been explored by an agent and the impact of the performed actions on path segment smoothness, the probability that each action of the agent is selected is dynamically adjusted.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a mobile robot path planning method based on an improved Q-learning algorithm, which is improved aiming at the defects of large exploration blindness, long algorithm learning time, high convergence speed, poor path smoothness and the like of the traditional Q-learning algorithm, introduces a potential energy field function and a behavior utility function, guides an intelligent body to explore towards a target direction in the early stage, leads the intelligent body to explore in the target direction in each state by utilizing the explored environmental information when selecting an execution action, comprehensively considers the distance information between the front state and the rear state of the intelligent body and a target point and the multistep execution action information, selects an optimal action to execute, improves the operation efficiency of the algorithm, accelerates the convergence speed of the algorithm, and improves the smoothness of the algorithm planning path.

Drawings

FIG. 1 is a flow chart of a method implementation of the preferred embodiment of the present invention.

Fig. 2 is a graph of the convergence of a conventional method, a prior art method, and the present method in a preferred embodiment of the present invention.

Fig. 3 is a diagram illustrating a comparison of the path search effect of the conventional method, and the present method in the preferred embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

A mobile robot path planning method based on an improved Q-learning algorithm is improved, the designed mobile robot path planning method of the improved Q-learning algorithm is improved around a Q value table and an epsilon exploration strategy, a potential energy field function is introduced to serve as heuristic information to initialize the Q value table, an intelligent agent is guided to explore towards a target direction in an early stage, and the problem of blindness of early exploration of the algorithm is solved; and meanwhile, a behavior utility function is introduced to improve an epsilon exploration strategy, and the probability of each action of the intelligent agent being selected is dynamically adjusted by combining the environmental information which is already explored by the intelligent agent and the influence of the executed action on the smoothness of the path segment.

As shown in fig. 1, the method of the embodiment of the present invention specifically includes the following steps:

step 2, the mobile robot exists as a particle in a two-dimensional environment and can only search and move in 4 directions, namely, up, down, left and right; the side length of each grid is a unit, and the single-step moving distance of the mobile robot is 1 step length;

step 3, designing a reward function and establishing a reward matrix R and a Q value table Q, wherein the reward function formula is as follows:

in the formula: r is ₁ ，r ₂ Are all positive numbers, the reward value-r being when the agent hits an obstacle ₁ Is acquired; when the agent reaches the target point, the reward value r ₂ Is acquired; the prize value obtained when the agent arrives at the other location is 0.

And 4, initializing a Q value table by using the improved potential energy field function, and initializing various parameters of an algorithm, including iteration times, epsilon-exploration probability epsilon, learning efficiency alpha, reward attenuation factor gamma and behavior utility attenuation coefficient p ₁ Search for excitation coefficient p ₂ Weight coefficient β, etc. The traditional Q value table is generally 0 or the same initial value, so that the early exploration blindness is high, and the convergence speed is low, therefore, the invention combines the artificial potential field principle to design a new potential energy field function, and the potential energy field function is introduced as heuristic information, wherein the potential energy field function is as follows:

wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d _1(s) ,d _2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the agent and the connecting line of the target point and the starting point.

The initialized Q-value table function is:

Q(s,a)＝R+γV(S)

where R is the initial reward function matrix, γ is the reward attenuation factor, and V (S) is the cost function for initializing all states by the gravitational field function. According to the Q value table after the initialization of the method, the Q value is larger as the Q value is closer to the target point, the target point has the maximum Q value, and the Q value at the obstacle is 0, so that the intelligent agent is explored towards the target point in the early stage, the algorithm convergence is accelerated, and the planning efficiency is improved.

Initializing a starting point and a target point;

step six, starting exploration, selecting an execution action according to an improved epsilon exploration strategy, acquiring a timely reward value after the execution action, calculating a distance function, and updating an action utility function;

according to the traditional epsilon exploration strategy, the randomness of action selection is high, so that the algorithm convergence speed is low, and the path smoothness is poor, the concept of a behavior utility equation is introduced, the behavior utility equation is designed for evaluating the execution quality of actions, the probability of each action of an intelligent agent being selected is dynamically adjusted, and the behavior utility equation is as follows:

in the formula: p is a radical of ₁ To be attenuation coefficient, p ₂ Is to search for the excitation coefficient, r _t Is an instant prize value; a is _i Updating the E values of different actions according to the magnitude of the instant reward value and the condition whether the continuously executed actions are the same or not, wherein when the instant reward value is positive and the continuously executed actions are the same for three times,

otherwise the value of E is zero.

Wherein the distance function is as follows:

in the formula:

respectively representing the distance between the previous state and the current state from the target point.

The improved epsilon exploration strategy comprises the following specific steps: when the random value (between 0-1) is less than the greedy factor, the action with the highest probability of performing the action is selected. And when the random value is smaller than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, and selecting the action with the highest Q value to execute. The update formula is as follows:

T _Q ＝Q+β×(P ₁ ,P ₂ ,P _i )(i∈(1,n))

in the formula: t is _Q Is the Q value after performing probability update according to the new action, beta is the weight coefficient, P ₁ ，P ₂ ，P _i Is the probability that each action is performed. The improved behavior selection strategy enables the intelligent agent to perform multi-step exploration by utilizing the explored environment information when the intelligent agent selects to perform actions in each state, comprehensively considers the distance information between the front and back states of the intelligent agent and a target point and the multi-step execution action information, and selects the 'optimal' action to perform.

And seventhly, updating the Q value table, updating the action execution probability and updating the position state. Wherein the Q value table update formula is as follows:

Q(s,a)＝Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]

where α represents learning efficiency α ∈ [0,1], γ represents discount factor γ ∈ [0,1], rt is the in-time reward value, and s ', a' is the next state and the next action.

The updated formula of the action execution probability is as follows:

in the formula: n is the total number of executed actionsThe number of the first and second groups is,

is a behavioral utility function.

Step eight, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and judging whether the maximum step length is exceeded, otherwise, jumping to the step six, and entering the step nine;

and step nine, recording the path learned every time, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step five.

Fig. 2 and 3 show the difference between the method of the present invention and the conventional Q-learning algorithm and the existing Q-learning algorithm in the path planning effect. FIGS. 2 and 3 illustrate (a) and (b) a conventional Q-learning algorithm and (c) a conventional Q-learning algorithm, respectively, as the inventive method herein. It can be seen from the above figures that compared with the conventional algorithm and the existing Q-learning algorithm, the method of the present invention can effectively improve the path smoothness and accelerate the algorithm convergence.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A mobile robot path planning method based on an improved Q-learning algorithm is characterized by comprising the following steps:

step 5, initializing a starting point and a target point;

Q(s,a)＝Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]

the update formula of the action execution probability is as follows:

in the formula: n is the total number of actions performed,

is a behavioral utility function;

step 8, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and whether the maximum step length is exceeded, otherwise, jumping to the step 6, and entering the step 9 if the current position is the end position;

and 9, recording the path of each learning, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step 5.

2. The method for planning the path of the mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:

the reward function designed in the step 3 is as follows:

3. The method for planning the path of a mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:

the improved potential energy field function in the step 4 is as follows:

wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d _1(s) ,d _2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the intelligent agent and the connecting line between the target point and the starting point;

the initialized Q value table function is:

Q(s,a)＝R+γV(S)

4. The method for planning the path of a mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:

the improved epsilon exploration strategy in the step 6 comprises the following specific steps: when the random value is smaller than the greedy factor, selecting the action with the highest action probability; when the random value is larger than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, selecting the action with the highest Q value to execute, wherein the random value is between 0 and 1, and the updating formula is as follows:

T _Q ＝Q+β×(P ₁ ,P ₂ ,P _i )(i∈(1,n))

the distance function in step 6 is:

in the formula:

respectively representing the distance between the previous state and the current state and the target point;

the action utility function and its calculation rule are:

in the formula: p is a radical of ₁ To be attenuation coefficient, p ₂ Is to search for the excitation coefficient, r _t Is an instant prize value; a is _i Respectively different actions according to the magnitude and connection of the instant reward valueUpdating the E value of different actions according to the condition whether the actions are the same or not, when the immediate reward value is positive and the actions are the same for three consecutive times,

otherwise the value of E is zero.

5. The method as claimed in claim 1, wherein the Q-value table is initialized by introducing environment potential values as heuristic information.

6. The method for mobile robot path planning based on the improved Q-learning algorithm as claimed in claim 1, wherein a behavior utility function is used as a standard for evaluating the executed actions, so as to improve an epsilon greedy strategy, and the probability of each action being selected by the agent is dynamically adjusted in combination with the environmental information already explored by the agent and the influence of the executed actions on the smoothness of the path segment.