CN115542912A - Mobile robot path planning method based on improved Q-learning algorithm - Google Patents

Mobile robot path planning method based on improved Q-learning algorithm Download PDF

Info

Publication number
CN115542912A
CN115542912A CN202211213330.XA CN202211213330A CN115542912A CN 115542912 A CN115542912 A CN 115542912A CN 202211213330 A CN202211213330 A CN 202211213330A CN 115542912 A CN115542912 A CN 115542912A
Authority
CN
China
Prior art keywords
value
action
improved
reward
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211213330.XA
Other languages
Chinese (zh)
Other versions
CN115542912B (en
Inventor
涂俊翔
张立
李凡
钟礼阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202211213330.XA priority Critical patent/CN115542912B/en
Priority claimed from CN202211213330.XA external-priority patent/CN115542912B/en
Publication of CN115542912A publication Critical patent/CN115542912A/en
Application granted granted Critical
Publication of CN115542912B publication Critical patent/CN115542912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to a mobile robot path planning method based on an improved Q-learning algorithm, which comprises the following steps: (1) A new potential energy field function is designed by combining an artificial potential field principle with a simulation environment, and an environment potential energy value is introduced as heuristic information to initialize a Q value table, so that the closer to a target point, the larger the potential energy value is, the search of the intelligent body towards the target direction is guided earlier, the algorithm convergence is accelerated, and the planning efficiency is improved. Improving the blindness of the early stage exploration of the Q-learning algorithm; (2) And adding a behavior utility function into an epsilon greedy strategy of the traditional Q-learning algorithm, evaluating the actions according to the conditions of the path sections after the actions are executed, dynamically adjusting the probability of each action of the intelligent agent being selected, improving the search efficiency and improving the path smoothness. By applying the technical scheme, the shortest path can be obtained, and meanwhile, the convergence speed of the algorithm and the smoothness of the path can be improved.

Description

Mobile robot path planning method based on improved Q-learning algorithm
Technical Field
The invention relates to the technical field of robot navigation planning, in particular to a mobile robot path planning method based on an improved Q-learning algorithm.
Background
With the proposal of a goods-to-people picking mode, the mobile robot is widely applied to an intelligent warehouse, the introduction of the mobile robot improves the picking efficiency of the warehouse, and the path planning is also more and more concerned as one of the core technologies of the mobile robot. The path planning means that a collision-free optimal path is planned according to the environment of the mobile robot and by combining evaluation standards such as the shortest path, the shortest planning time and the path smoothness.
The path planning originated in the 60's of the 20 th century, and is commonly used for Dijkstra's algorithm, a-x algorithm, artificial potential field method, and ant colony algorithm, particle swarm algorithm, etc. of the heuristic intelligent search method. However, the traditional method is complex to operate, the problem solving efficiency is low, and besides, the heuristic algorithm is difficult to design and understand. With the progress of reinforcement learning in recent years, some learners start to apply reinforcement learning to route planning. The most widely used reinforcement learning algorithm in mobile robot path planning is the Q-learning algorithm. The Q-learning algorithm is a reinforced learning algorithm of time sequence difference, and the process of the Q-learning algorithm is as follows: the mobile robot first selects and executes action a among all possible actions under the state s, and then evaluates the outcome of the action according to obtaining an immediate reward value for action a and receiving an estimate of the current state action value. By repeating all actions in all states, the mobile robot can learn the overall best behavior by judging long-term discount returns. As a supervised learning method, the traditional Q-learning algorithm can enable the mobile robot to plan a better collision-free path by utilizing a learning mechanism through real-time interaction with the environment, does not need an environment model, and has excellent performance in a complex environment. However, the method still has the problems of large blindness of algorithm early-stage exploration, long learning time, low convergence speed, poor path smoothness and the like.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a mobile robot path planning method based on an improved Q-learning algorithm, which can improve the convergence speed of the algorithm and the smoothness of the path while obtaining the shortest path.
In order to achieve the purpose, the invention adopts the following technical scheme: a mobile robot path planning method based on an improved Q-learning algorithm comprises the following steps:
step 1, modeling an environment map by adopting a grid method, and establishing an obstacle matrix;
step 2, the mobile robot exists as particles in a two-dimensional environment and can only search and move in 4 directions, namely up, down, left and right; the side length of each grid is a unit, and the single-step moving distance of the mobile robot is 1 step length;
step 3, designing a reward function and establishing a reward matrix R and a Q value table Q;
step 4, initializing a Q value table by utilizing the improved potential energy field function, and initializing various parameters of the algorithm, including iteration times, epsilon-exploration probability epsilon, learning efficiency alpha, reward attenuation factor gamma and behavior utility attenuation coefficient p 1 Search for excitation coefficient p 2 A weight coefficient β;
step 5, initializing a starting point and a target point;
step 6, starting exploration, selecting an execution action according to the improved epsilon exploration strategy, acquiring a timely reward value after the execution action, calculating a distance function, and updating an action utility function;
and 7, updating a Q value table and action execution probability according to the executed action, wherein the Q value table updating formula is as follows:
Q(s,a)=Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]
in the formula, alpha represents learning efficiency alpha belongs to [0,1], gamma represents discount factor gamma belongs to [0,1], rt is a timely reward value, and s 'and a' are a next state and a next action;
the update formula of the action execution probability is as follows:
Figure BDA0003872758640000031
in the formula: n is the total number of actions performed,
Figure BDA0003872758640000032
is a behavior utility function;
step 8, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and judging whether the maximum step length is exceeded, otherwise, jumping to step 6, and if yes, entering step 9;
and 9, recording the path learned every time, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step 5.
In a preferred embodiment, the reward function designed in step 3 is:
Figure BDA0003872758640000033
in the formula: r is 1 ,r 2 Are all positive numbers, the reward value-r being when the agent hits an obstacle 1 Is acquired; when the agent reaches the target point, the reward value r 2 Is acquired; the prize value obtained when the agent arrives elsewhere is 0.
In a preferred embodiment, the potential energy field function modified in step 4 is:
Figure BDA0003872758640000034
wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d 1(s) ,d 2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the intelligent agent and the connecting line of the target point and the starting point;
the initialized Q value table function is:
Q(s,a)=R+γV(S)
wherein, R is an initial reward function matrix, gamma is a reward attenuation factor, and V (S) is a cost function for initializing all states through a gravitational field function; the Q value table after the initialization by the method has the advantages that the Q value is larger when the table is closer to a target point, the target point has the maximum Q value, and the Q value at an obstacle is 0.
In a preferred embodiment, the modified epsilon search strategy in step 6 comprises the following specific steps: when the random value is smaller than the greedy factor, selecting the action with the highest action probability; when the random value is larger than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, selecting the action with the highest Q value to execute, wherein the random value is between 0 and 1, and the updating formula is as follows:
T Q =Q+β×(P 1 ,P 2 ,P i )(i∈(1,n))
in the formula: t is Q Is the Q value after performing probability update according to the new action, beta is the weight coefficient, P 1 ,P 2 ,P i Is the probability that each action was performed; the improved behavior selection strategy enables the intelligent agent to perform multi-step exploration by utilizing the explored environment information when selecting to perform actions in each state, comprehensively considers the distance information between the front and back states of the intelligent agent and a target point and the multi-step execution action information, and selects the optimal action to perform;
the distance function in step 6 is:
Figure BDA0003872758640000041
in the formula:
Figure BDA0003872758640000042
Figure BDA0003872758640000043
respectively representing the distance between the previous state and the current state from the target point;
the action utility function and the calculation rule thereof are as follows:
Figure BDA0003872758640000051
in the formula: p is a radical of formula 1 To be attenuation coefficient, p 2 Is to search for the excitation coefficient, r t Is an instant prize value; a is i Respectively different actions, updating the E values of the different actions according to the magnitude of the instant reward value and the condition whether the continuously executed actions are the same or not, when the instant reward value is positive and the continuously executed actions are the same for three times,
Figure BDA0003872758640000052
when the immediate prize value is positive, and the same action is performed twice in succession,
Figure BDA0003872758640000053
otherwise the value of E is zero.
In a preferred embodiment, the Q value table is initialized by introducing the environment potential value as the heuristic information.
In a preferred embodiment, an epsilon greedy policy is improved by using a behavior utility function as a criterion for evaluating actions to be performed, and by combining environmental information that has been explored by an agent and the impact of the performed actions on path segment smoothness, the probability that each action of the agent is selected is dynamically adjusted.
Compared with the prior art, the invention has the following beneficial effects: the invention provides a mobile robot path planning method based on an improved Q-learning algorithm, which is improved aiming at the defects of large exploration blindness, long algorithm learning time, high convergence speed, poor path smoothness and the like of the traditional Q-learning algorithm, introduces a potential energy field function and a behavior utility function, guides an intelligent body to explore towards a target direction in the early stage, leads the intelligent body to explore in the target direction in each state by utilizing the explored environmental information when selecting an execution action, comprehensively considers the distance information between the front state and the rear state of the intelligent body and a target point and the multistep execution action information, selects an optimal action to execute, improves the operation efficiency of the algorithm, accelerates the convergence speed of the algorithm, and improves the smoothness of the algorithm planning path.
Drawings
FIG. 1 is a flow chart of a method implementation of the preferred embodiment of the present invention.
Fig. 2 is a graph of the convergence of a conventional method, a prior art method, and the present method in a preferred embodiment of the present invention.
Fig. 3 is a diagram illustrating a comparison of the path search effect of the conventional method, and the present method in the preferred embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
A mobile robot path planning method based on an improved Q-learning algorithm is improved, the designed mobile robot path planning method of the improved Q-learning algorithm is improved around a Q value table and an epsilon exploration strategy, a potential energy field function is introduced to serve as heuristic information to initialize the Q value table, an intelligent agent is guided to explore towards a target direction in an early stage, and the problem of blindness of early exploration of the algorithm is solved; and meanwhile, a behavior utility function is introduced to improve an epsilon exploration strategy, and the probability of each action of the intelligent agent being selected is dynamically adjusted by combining the environmental information which is already explored by the intelligent agent and the influence of the executed action on the smoothness of the path segment.
As shown in fig. 1, the method of the embodiment of the present invention specifically includes the following steps:
step 1, modeling an environment map by adopting a grid method, and establishing an obstacle matrix;
step 2, the mobile robot exists as a particle in a two-dimensional environment and can only search and move in 4 directions, namely, up, down, left and right; the side length of each grid is a unit, and the single-step moving distance of the mobile robot is 1 step length;
step 3, designing a reward function and establishing a reward matrix R and a Q value table Q, wherein the reward function formula is as follows:
Figure BDA0003872758640000071
in the formula: r is 1 ,r 2 Are all positive numbers, the reward value-r being when the agent hits an obstacle 1 Is acquired; when the agent reaches the target point, the reward value r 2 Is acquired; the prize value obtained when the agent arrives at the other location is 0.
And 4, initializing a Q value table by using the improved potential energy field function, and initializing various parameters of an algorithm, including iteration times, epsilon-exploration probability epsilon, learning efficiency alpha, reward attenuation factor gamma and behavior utility attenuation coefficient p 1 Search for excitation coefficient p 2 Weight coefficient β, etc. The traditional Q value table is generally 0 or the same initial value, so that the early exploration blindness is high, and the convergence speed is low, therefore, the invention combines the artificial potential field principle to design a new potential energy field function, and the potential energy field function is introduced as heuristic information, wherein the potential energy field function is as follows:
Figure BDA0003872758640000072
wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d 1(s) ,d 2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the agent and the connecting line of the target point and the starting point.
The initialized Q-value table function is:
Q(s,a)=R+γV(S)
where R is the initial reward function matrix, γ is the reward attenuation factor, and V (S) is the cost function for initializing all states by the gravitational field function. According to the Q value table after the initialization of the method, the Q value is larger as the Q value is closer to the target point, the target point has the maximum Q value, and the Q value at the obstacle is 0, so that the intelligent agent is explored towards the target point in the early stage, the algorithm convergence is accelerated, and the planning efficiency is improved.
Initializing a starting point and a target point;
step six, starting exploration, selecting an execution action according to an improved epsilon exploration strategy, acquiring a timely reward value after the execution action, calculating a distance function, and updating an action utility function;
according to the traditional epsilon exploration strategy, the randomness of action selection is high, so that the algorithm convergence speed is low, and the path smoothness is poor, the concept of a behavior utility equation is introduced, the behavior utility equation is designed for evaluating the execution quality of actions, the probability of each action of an intelligent agent being selected is dynamically adjusted, and the behavior utility equation is as follows:
Figure BDA0003872758640000081
in the formula: p is a radical of 1 To be attenuation coefficient, p 2 Is to search for the excitation coefficient, r t Is an instant prize value; a is i Updating the E values of different actions according to the magnitude of the instant reward value and the condition whether the continuously executed actions are the same or not, wherein when the instant reward value is positive and the continuously executed actions are the same for three times,
Figure BDA0003872758640000082
when the immediate prize value is positive, and the same action is performed twice in succession,
Figure BDA0003872758640000083
otherwise the value of E is zero.
Wherein the distance function is as follows:
Figure BDA0003872758640000091
in the formula:
Figure BDA0003872758640000092
Figure BDA0003872758640000093
respectively representing the distance between the previous state and the current state from the target point.
The improved epsilon exploration strategy comprises the following specific steps: when the random value (between 0-1) is less than the greedy factor, the action with the highest probability of performing the action is selected. And when the random value is smaller than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, and selecting the action with the highest Q value to execute. The update formula is as follows:
T Q =Q+β×(P 1 ,P 2 ,P i )(i∈(1,n))
in the formula: t is Q Is the Q value after performing probability update according to the new action, beta is the weight coefficient, P 1 ,P 2 ,P i Is the probability that each action is performed. The improved behavior selection strategy enables the intelligent agent to perform multi-step exploration by utilizing the explored environment information when the intelligent agent selects to perform actions in each state, comprehensively considers the distance information between the front and back states of the intelligent agent and a target point and the multi-step execution action information, and selects the 'optimal' action to perform.
And seventhly, updating the Q value table, updating the action execution probability and updating the position state. Wherein the Q value table update formula is as follows:
Q(s,a)=Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]
where α represents learning efficiency α ∈ [0,1], γ represents discount factor γ ∈ [0,1], rt is the in-time reward value, and s ', a' is the next state and the next action.
The updated formula of the action execution probability is as follows:
Figure BDA0003872758640000094
in the formula: n is the total number of executed actionsThe number of the first and second groups is,
Figure BDA0003872758640000095
is a behavioral utility function.
Step eight, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and judging whether the maximum step length is exceeded, otherwise, jumping to the step six, and entering the step nine;
and step nine, recording the path learned every time, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step five.
Fig. 2 and 3 show the difference between the method of the present invention and the conventional Q-learning algorithm and the existing Q-learning algorithm in the path planning effect. FIGS. 2 and 3 illustrate (a) and (b) a conventional Q-learning algorithm and (c) a conventional Q-learning algorithm, respectively, as the inventive method herein. It can be seen from the above figures that compared with the conventional algorithm and the existing Q-learning algorithm, the method of the present invention can effectively improve the path smoothness and accelerate the algorithm convergence.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims (6)

1. A mobile robot path planning method based on an improved Q-learning algorithm is characterized by comprising the following steps:
step 1, modeling an environment map by adopting a grid method, and establishing an obstacle matrix;
step 2, the mobile robot exists as particles in a two-dimensional environment and can only search and move in 4 directions, namely up, down, left and right; the side length of each grid is a unit, and the single-step moving distance of the mobile robot is 1 step length;
step 3, designing a reward function and establishing a reward matrix R and a Q value table Q;
step 4, initializing a Q value table by utilizing the improved potential energy field function, and initializing various parameters of the algorithm, including iteration times, epsilon-exploration probability epsilon, learning efficiency alpha, reward attenuation factor gamma and behavior utility attenuation coefficient p 1 Search for excitation coefficient p 2 A weight coefficient β;
step 5, initializing a starting point and a target point;
step 6, starting exploration, selecting an execution action according to the improved epsilon exploration strategy, acquiring a timely reward value after the execution action, calculating a distance function, and updating an action utility function;
and 7, updating a Q value table and action execution probability according to the executed action, wherein the Q value table updating formula is as follows:
Q(s,a)=Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]
in the formula, alpha represents learning efficiency alpha belongs to [0,1], gamma represents discount factor gamma belongs to [0,1], rt is a timely reward value, and s 'and a' are a next state and a next action;
the update formula of the action execution probability is as follows:
Figure FDA0003872758630000011
in the formula: n is the total number of actions performed,
Figure FDA0003872758630000012
is a behavioral utility function;
step 8, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and whether the maximum step length is exceeded, otherwise, jumping to the step 6, and entering the step 9 if the current position is the end position;
and 9, recording the path of each learning, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step 5.
2. The method for planning the path of the mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:
the reward function designed in the step 3 is as follows:
Figure FDA0003872758630000021
in the formula: r is 1 ,r 2 Are all positive numbers, the reward value-r being when the agent hits an obstacle 1 Is acquired; when the agent reaches the target point, the reward value r 2 Is acquired; the prize value obtained when the agent arrives at the other location is 0.
3. The method for planning the path of a mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:
the improved potential energy field function in the step 4 is as follows:
Figure FDA0003872758630000022
wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d 1(s) ,d 2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the intelligent agent and the connecting line between the target point and the starting point;
the initialized Q value table function is:
Q(s,a)=R+γV(S)
wherein, R is an initial reward function matrix, gamma is a reward attenuation factor, and V (S) is a cost function for initializing all states through a gravitational field function; the Q value table after the initialization by the method has the advantages that the Q value is larger when the table is closer to a target point, the target point has the maximum Q value, and the Q value at an obstacle is 0.
4. The method for planning the path of a mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:
the improved epsilon exploration strategy in the step 6 comprises the following specific steps: when the random value is smaller than the greedy factor, selecting the action with the highest action probability; when the random value is larger than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, selecting the action with the highest Q value to execute, wherein the random value is between 0 and 1, and the updating formula is as follows:
T Q =Q+β×(P 1 ,P 2 ,P i )(i∈(1,n))
in the formula: t is Q Is the Q value after performing probability update according to the new action, beta is the weight coefficient, P 1 ,P 2 ,P i Is the probability that each action was performed; the improved behavior selection strategy enables the intelligent agent to perform multi-step exploration by utilizing the explored environment information when selecting to perform actions in each state, comprehensively considers the distance information between the front and back states of the intelligent agent and a target point and the multi-step execution action information, and selects the optimal action to perform;
the distance function in step 6 is:
Figure FDA0003872758630000031
in the formula:
Figure FDA0003872758630000032
respectively representing the distance between the previous state and the current state and the target point;
the action utility function and its calculation rule are:
Figure FDA0003872758630000033
in the formula: p is a radical of 1 To be attenuation coefficient, p 2 Is to search for the excitation coefficient, r t Is an instant prize value; a is i Respectively different actions according to the magnitude and connection of the instant reward valueUpdating the E value of different actions according to the condition whether the actions are the same or not, when the immediate reward value is positive and the actions are the same for three consecutive times,
Figure FDA0003872758630000041
when the immediate prize value is positive, and the same action is performed twice in succession,
Figure FDA0003872758630000042
otherwise the value of E is zero.
5. The method as claimed in claim 1, wherein the Q-value table is initialized by introducing environment potential values as heuristic information.
6. The method for mobile robot path planning based on the improved Q-learning algorithm as claimed in claim 1, wherein a behavior utility function is used as a standard for evaluating the executed actions, so as to improve an epsilon greedy strategy, and the probability of each action being selected by the agent is dynamically adjusted in combination with the environmental information already explored by the agent and the influence of the executed actions on the smoothness of the path segment.
CN202211213330.XA 2022-09-29 Mobile robot path planning method based on improved Q-learning algorithm Active CN115542912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211213330.XA CN115542912B (en) 2022-09-29 Mobile robot path planning method based on improved Q-learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211213330.XA CN115542912B (en) 2022-09-29 Mobile robot path planning method based on improved Q-learning algorithm

Publications (2)

Publication Number Publication Date
CN115542912A true CN115542912A (en) 2022-12-30
CN115542912B CN115542912B (en) 2024-06-07

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822765A (en) * 2023-06-02 2023-09-29 东南大学 Q-learning-based agent time sequence task path planning method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819264A (en) * 2012-07-30 2012-12-12 山东大学 Path planning Q-learning initial method of mobile robot
WO2017161632A1 (en) * 2016-03-24 2017-09-28 苏州大学张家港工业技术研究院 Cleaning robot optimal target path planning method based on model learning
CN112344944A (en) * 2020-11-24 2021-02-09 湖北汽车工业学院 Reinforced learning path planning method introducing artificial potential field

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819264A (en) * 2012-07-30 2012-12-12 山东大学 Path planning Q-learning initial method of mobile robot
WO2017161632A1 (en) * 2016-03-24 2017-09-28 苏州大学张家港工业技术研究院 Cleaning robot optimal target path planning method based on model learning
CN112344944A (en) * 2020-11-24 2021-02-09 湖北汽车工业学院 Reinforced learning path planning method introducing artificial potential field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
董培方;张志安;梅新虎;朱朔;: "引入势场及陷阱搜索的强化学习路径规划算法", 计算机工程与应用, no. 16, 8 September 2017 (2017-09-08), pages 135 - 140 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822765A (en) * 2023-06-02 2023-09-29 东南大学 Q-learning-based agent time sequence task path planning method

Similar Documents

Publication Publication Date Title
CN106990792B (en) Multi-unmanned aerial vehicle collaborative time sequence coupling task allocation method based on hybrid gravity search algorithm
CN113110592A (en) Unmanned aerial vehicle obstacle avoidance and path planning method
CN111260027B (en) Intelligent agent automatic decision-making method based on reinforcement learning
CN110442129B (en) Control method and system for multi-agent formation
CN109690576A (en) The training machine learning model in multiple machine learning tasks
CN109523029A (en) For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN105893694A (en) Complex system designing method based on resampling particle swarm optimization algorithm
CN114460943B (en) Self-adaptive target navigation method and system for service robot
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN110135584A (en) Extensive Symbolic Regression method and system based on self-adaptive parallel genetic algorithm
JP2014502393A (en) Determination method and determination apparatus
CN111352419B (en) Path planning method and system for updating experience playback cache based on time sequence difference
CN113561986A (en) Decision-making method and device for automatically driving automobile
CN111950735A (en) Reinforced learning method based on bidirectional model
CN113268854A (en) Reinforced learning method and system for double evaluators and single actuator
CN114815891A (en) PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method
CN113110101B (en) Production line mobile robot gathering type recovery and warehousing simulation method and system
CN106650028A (en) Optimization method and system based on agile satellite design parameters
Gao et al. An adaptive framework to select the coordinate systems for evolutionary algorithms
CN115542912A (en) Mobile robot path planning method based on improved Q-learning algorithm
CN115542912B (en) Mobile robot path planning method based on improved Q-learning algorithm
CN116339349A (en) Path planning method, path planning device, electronic equipment and storage medium
CN116382299A (en) Path planning method, path planning device, electronic equipment and storage medium
CN114861368B (en) Construction method of railway longitudinal section design learning model based on near-end strategy
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant