CN115542912A - Mobile robot path planning method based on improved Q-learning algorithm - Google Patents
Mobile robot path planning method based on improved Q-learning algorithm Download PDFInfo
- Publication number
- CN115542912A CN115542912A CN202211213330.XA CN202211213330A CN115542912A CN 115542912 A CN115542912 A CN 115542912A CN 202211213330 A CN202211213330 A CN 202211213330A CN 115542912 A CN115542912 A CN 115542912A
- Authority
- CN
- China
- Prior art keywords
- value
- action
- improved
- reward
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000009471 action Effects 0.000 claims abstract description 91
- 238000005381 potential energy Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 50
- 230000006399 behavior Effects 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000005284 excitation Effects 0.000 claims description 6
- 230000009191 jumping Effects 0.000 claims description 6
- 230000007613 environmental effect Effects 0.000 claims description 4
- 239000002245 particle Substances 0.000 claims description 4
- 230000003542 behavioural effect Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims description 2
- 201000004569 Blindness Diseases 0.000 abstract description 5
- 238000004088 simulation Methods 0.000 abstract 1
- 238000004590 computer program Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 4
- 230000002787 reinforcement Effects 0.000 description 3
- 238000007796 conventional method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0214—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention relates to a mobile robot path planning method based on an improved Q-learning algorithm, which comprises the following steps: (1) A new potential energy field function is designed by combining an artificial potential field principle with a simulation environment, and an environment potential energy value is introduced as heuristic information to initialize a Q value table, so that the closer to a target point, the larger the potential energy value is, the search of the intelligent body towards the target direction is guided earlier, the algorithm convergence is accelerated, and the planning efficiency is improved. Improving the blindness of the early stage exploration of the Q-learning algorithm; (2) And adding a behavior utility function into an epsilon greedy strategy of the traditional Q-learning algorithm, evaluating the actions according to the conditions of the path sections after the actions are executed, dynamically adjusting the probability of each action of the intelligent agent being selected, improving the search efficiency and improving the path smoothness. By applying the technical scheme, the shortest path can be obtained, and meanwhile, the convergence speed of the algorithm and the smoothness of the path can be improved.
Description
Technical Field
The invention relates to the technical field of robot navigation planning, in particular to a mobile robot path planning method based on an improved Q-learning algorithm.
Background
With the proposal of a goods-to-people picking mode, the mobile robot is widely applied to an intelligent warehouse, the introduction of the mobile robot improves the picking efficiency of the warehouse, and the path planning is also more and more concerned as one of the core technologies of the mobile robot. The path planning means that a collision-free optimal path is planned according to the environment of the mobile robot and by combining evaluation standards such as the shortest path, the shortest planning time and the path smoothness.
The path planning originated in the 60's of the 20 th century, and is commonly used for Dijkstra's algorithm, a-x algorithm, artificial potential field method, and ant colony algorithm, particle swarm algorithm, etc. of the heuristic intelligent search method. However, the traditional method is complex to operate, the problem solving efficiency is low, and besides, the heuristic algorithm is difficult to design and understand. With the progress of reinforcement learning in recent years, some learners start to apply reinforcement learning to route planning. The most widely used reinforcement learning algorithm in mobile robot path planning is the Q-learning algorithm. The Q-learning algorithm is a reinforced learning algorithm of time sequence difference, and the process of the Q-learning algorithm is as follows: the mobile robot first selects and executes action a among all possible actions under the state s, and then evaluates the outcome of the action according to obtaining an immediate reward value for action a and receiving an estimate of the current state action value. By repeating all actions in all states, the mobile robot can learn the overall best behavior by judging long-term discount returns. As a supervised learning method, the traditional Q-learning algorithm can enable the mobile robot to plan a better collision-free path by utilizing a learning mechanism through real-time interaction with the environment, does not need an environment model, and has excellent performance in a complex environment. However, the method still has the problems of large blindness of algorithm early-stage exploration, long learning time, low convergence speed, poor path smoothness and the like.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a mobile robot path planning method based on an improved Q-learning algorithm, which can improve the convergence speed of the algorithm and the smoothness of the path while obtaining the shortest path.
In order to achieve the purpose, the invention adopts the following technical scheme: a mobile robot path planning method based on an improved Q-learning algorithm comprises the following steps:
step 1, modeling an environment map by adopting a grid method, and establishing an obstacle matrix;
step 2, the mobile robot exists as particles in a two-dimensional environment and can only search and move in 4 directions, namely up, down, left and right; the side length of each grid is a unit, and the single-step moving distance of the mobile robot is 1 step length;
step 3, designing a reward function and establishing a reward matrix R and a Q value table Q;
step 4, initializing a Q value table by utilizing the improved potential energy field function, and initializing various parameters of the algorithm, including iteration times, epsilon-exploration probability epsilon, learning efficiency alpha, reward attenuation factor gamma and behavior utility attenuation coefficient p 1 Search for excitation coefficient p 2 A weight coefficient β;
step 6, starting exploration, selecting an execution action according to the improved epsilon exploration strategy, acquiring a timely reward value after the execution action, calculating a distance function, and updating an action utility function;
and 7, updating a Q value table and action execution probability according to the executed action, wherein the Q value table updating formula is as follows:
Q(s,a)=Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]
in the formula, alpha represents learning efficiency alpha belongs to [0,1], gamma represents discount factor gamma belongs to [0,1], rt is a timely reward value, and s 'and a' are a next state and a next action;
the update formula of the action execution probability is as follows:
step 8, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and judging whether the maximum step length is exceeded, otherwise, jumping to step 6, and if yes, entering step 9;
and 9, recording the path learned every time, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step 5.
In a preferred embodiment, the reward function designed in step 3 is:
in the formula: r is 1 ,r 2 Are all positive numbers, the reward value-r being when the agent hits an obstacle 1 Is acquired; when the agent reaches the target point, the reward value r 2 Is acquired; the prize value obtained when the agent arrives elsewhere is 0.
In a preferred embodiment, the potential energy field function modified in step 4 is:
wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d 1(s) ,d 2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the intelligent agent and the connecting line of the target point and the starting point;
the initialized Q value table function is:
Q(s,a)=R+γV(S)
wherein, R is an initial reward function matrix, gamma is a reward attenuation factor, and V (S) is a cost function for initializing all states through a gravitational field function; the Q value table after the initialization by the method has the advantages that the Q value is larger when the table is closer to a target point, the target point has the maximum Q value, and the Q value at an obstacle is 0.
In a preferred embodiment, the modified epsilon search strategy in step 6 comprises the following specific steps: when the random value is smaller than the greedy factor, selecting the action with the highest action probability; when the random value is larger than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, selecting the action with the highest Q value to execute, wherein the random value is between 0 and 1, and the updating formula is as follows:
T Q =Q+β×(P 1 ,P 2 ,P i )(i∈(1,n))
in the formula: t is Q Is the Q value after performing probability update according to the new action, beta is the weight coefficient, P 1 ,P 2 ,P i Is the probability that each action was performed; the improved behavior selection strategy enables the intelligent agent to perform multi-step exploration by utilizing the explored environment information when selecting to perform actions in each state, comprehensively considers the distance information between the front and back states of the intelligent agent and a target point and the multi-step execution action information, and selects the optimal action to perform;
the distance function in step 6 is:
in the formula: respectively representing the distance between the previous state and the current state from the target point;
the action utility function and the calculation rule thereof are as follows:
in the formula: p is a radical of formula 1 To be attenuation coefficient, p 2 Is to search for the excitation coefficient, r t Is an instant prize value; a is i Respectively different actions, updating the E values of the different actions according to the magnitude of the instant reward value and the condition whether the continuously executed actions are the same or not, when the instant reward value is positive and the continuously executed actions are the same for three times,when the immediate prize value is positive, and the same action is performed twice in succession,otherwise the value of E is zero.
In a preferred embodiment, the Q value table is initialized by introducing the environment potential value as the heuristic information.
In a preferred embodiment, an epsilon greedy policy is improved by using a behavior utility function as a criterion for evaluating actions to be performed, and by combining environmental information that has been explored by an agent and the impact of the performed actions on path segment smoothness, the probability that each action of the agent is selected is dynamically adjusted.
Compared with the prior art, the invention has the following beneficial effects: the invention provides a mobile robot path planning method based on an improved Q-learning algorithm, which is improved aiming at the defects of large exploration blindness, long algorithm learning time, high convergence speed, poor path smoothness and the like of the traditional Q-learning algorithm, introduces a potential energy field function and a behavior utility function, guides an intelligent body to explore towards a target direction in the early stage, leads the intelligent body to explore in the target direction in each state by utilizing the explored environmental information when selecting an execution action, comprehensively considers the distance information between the front state and the rear state of the intelligent body and a target point and the multistep execution action information, selects an optimal action to execute, improves the operation efficiency of the algorithm, accelerates the convergence speed of the algorithm, and improves the smoothness of the algorithm planning path.
Drawings
FIG. 1 is a flow chart of a method implementation of the preferred embodiment of the present invention.
Fig. 2 is a graph of the convergence of a conventional method, a prior art method, and the present method in a preferred embodiment of the present invention.
Fig. 3 is a diagram illustrating a comparison of the path search effect of the conventional method, and the present method in the preferred embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
A mobile robot path planning method based on an improved Q-learning algorithm is improved, the designed mobile robot path planning method of the improved Q-learning algorithm is improved around a Q value table and an epsilon exploration strategy, a potential energy field function is introduced to serve as heuristic information to initialize the Q value table, an intelligent agent is guided to explore towards a target direction in an early stage, and the problem of blindness of early exploration of the algorithm is solved; and meanwhile, a behavior utility function is introduced to improve an epsilon exploration strategy, and the probability of each action of the intelligent agent being selected is dynamically adjusted by combining the environmental information which is already explored by the intelligent agent and the influence of the executed action on the smoothness of the path segment.
As shown in fig. 1, the method of the embodiment of the present invention specifically includes the following steps:
step 1, modeling an environment map by adopting a grid method, and establishing an obstacle matrix;
step 2, the mobile robot exists as a particle in a two-dimensional environment and can only search and move in 4 directions, namely, up, down, left and right; the side length of each grid is a unit, and the single-step moving distance of the mobile robot is 1 step length;
step 3, designing a reward function and establishing a reward matrix R and a Q value table Q, wherein the reward function formula is as follows:
in the formula: r is 1 ,r 2 Are all positive numbers, the reward value-r being when the agent hits an obstacle 1 Is acquired; when the agent reaches the target point, the reward value r 2 Is acquired; the prize value obtained when the agent arrives at the other location is 0.
And 4, initializing a Q value table by using the improved potential energy field function, and initializing various parameters of an algorithm, including iteration times, epsilon-exploration probability epsilon, learning efficiency alpha, reward attenuation factor gamma and behavior utility attenuation coefficient p 1 Search for excitation coefficient p 2 Weight coefficient β, etc. The traditional Q value table is generally 0 or the same initial value, so that the early exploration blindness is high, and the convergence speed is low, therefore, the invention combines the artificial potential field principle to design a new potential energy field function, and the potential energy field function is introduced as heuristic information, wherein the potential energy field function is as follows:
wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d 1(s) ,d 2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the agent and the connecting line of the target point and the starting point.
The initialized Q-value table function is:
Q(s,a)=R+γV(S)
where R is the initial reward function matrix, γ is the reward attenuation factor, and V (S) is the cost function for initializing all states by the gravitational field function. According to the Q value table after the initialization of the method, the Q value is larger as the Q value is closer to the target point, the target point has the maximum Q value, and the Q value at the obstacle is 0, so that the intelligent agent is explored towards the target point in the early stage, the algorithm convergence is accelerated, and the planning efficiency is improved.
Initializing a starting point and a target point;
step six, starting exploration, selecting an execution action according to an improved epsilon exploration strategy, acquiring a timely reward value after the execution action, calculating a distance function, and updating an action utility function;
according to the traditional epsilon exploration strategy, the randomness of action selection is high, so that the algorithm convergence speed is low, and the path smoothness is poor, the concept of a behavior utility equation is introduced, the behavior utility equation is designed for evaluating the execution quality of actions, the probability of each action of an intelligent agent being selected is dynamically adjusted, and the behavior utility equation is as follows:
in the formula: p is a radical of 1 To be attenuation coefficient, p 2 Is to search for the excitation coefficient, r t Is an instant prize value; a is i Updating the E values of different actions according to the magnitude of the instant reward value and the condition whether the continuously executed actions are the same or not, wherein when the instant reward value is positive and the continuously executed actions are the same for three times,when the immediate prize value is positive, and the same action is performed twice in succession,otherwise the value of E is zero.
Wherein the distance function is as follows:
in the formula: respectively representing the distance between the previous state and the current state from the target point.
The improved epsilon exploration strategy comprises the following specific steps: when the random value (between 0-1) is less than the greedy factor, the action with the highest probability of performing the action is selected. And when the random value is smaller than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, and selecting the action with the highest Q value to execute. The update formula is as follows:
T Q =Q+β×(P 1 ,P 2 ,P i )(i∈(1,n))
in the formula: t is Q Is the Q value after performing probability update according to the new action, beta is the weight coefficient, P 1 ,P 2 ,P i Is the probability that each action is performed. The improved behavior selection strategy enables the intelligent agent to perform multi-step exploration by utilizing the explored environment information when the intelligent agent selects to perform actions in each state, comprehensively considers the distance information between the front and back states of the intelligent agent and a target point and the multi-step execution action information, and selects the 'optimal' action to perform.
And seventhly, updating the Q value table, updating the action execution probability and updating the position state. Wherein the Q value table update formula is as follows:
Q(s,a)=Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]
where α represents learning efficiency α ∈ [0,1], γ represents discount factor γ ∈ [0,1], rt is the in-time reward value, and s ', a' is the next state and the next action.
The updated formula of the action execution probability is as follows:
in the formula: n is the total number of executed actionsThe number of the first and second groups is,is a behavioral utility function.
Step eight, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and judging whether the maximum step length is exceeded, otherwise, jumping to the step six, and entering the step nine;
and step nine, recording the path learned every time, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step five.
Fig. 2 and 3 show the difference between the method of the present invention and the conventional Q-learning algorithm and the existing Q-learning algorithm in the path planning effect. FIGS. 2 and 3 illustrate (a) and (b) a conventional Q-learning algorithm and (c) a conventional Q-learning algorithm, respectively, as the inventive method herein. It can be seen from the above figures that compared with the conventional algorithm and the existing Q-learning algorithm, the method of the present invention can effectively improve the path smoothness and accelerate the algorithm convergence.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.
Claims (6)
1. A mobile robot path planning method based on an improved Q-learning algorithm is characterized by comprising the following steps:
step 1, modeling an environment map by adopting a grid method, and establishing an obstacle matrix;
step 2, the mobile robot exists as particles in a two-dimensional environment and can only search and move in 4 directions, namely up, down, left and right; the side length of each grid is a unit, and the single-step moving distance of the mobile robot is 1 step length;
step 3, designing a reward function and establishing a reward matrix R and a Q value table Q;
step 4, initializing a Q value table by utilizing the improved potential energy field function, and initializing various parameters of the algorithm, including iteration times, epsilon-exploration probability epsilon, learning efficiency alpha, reward attenuation factor gamma and behavior utility attenuation coefficient p 1 Search for excitation coefficient p 2 A weight coefficient β;
step 5, initializing a starting point and a target point;
step 6, starting exploration, selecting an execution action according to the improved epsilon exploration strategy, acquiring a timely reward value after the execution action, calculating a distance function, and updating an action utility function;
and 7, updating a Q value table and action execution probability according to the executed action, wherein the Q value table updating formula is as follows:
Q(s,a)=Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]
in the formula, alpha represents learning efficiency alpha belongs to [0,1], gamma represents discount factor gamma belongs to [0,1], rt is a timely reward value, and s 'and a' are a next state and a next action;
the update formula of the action execution probability is as follows:
step 8, updating the position state after the action is executed into a current position state, judging whether the current position is an end position and whether the maximum step length is exceeded, otherwise, jumping to the step 6, and entering the step 9 if the current position is the end position;
and 9, recording the path of each learning, judging whether the maximum iteration times is reached, if so, outputting the optimal path, and if not, jumping to the step 5.
2. The method for planning the path of the mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:
the reward function designed in the step 3 is as follows:
in the formula: r is 1 ,r 2 Are all positive numbers, the reward value-r being when the agent hits an obstacle 1 Is acquired; when the agent reaches the target point, the reward value r 2 Is acquired; the prize value obtained when the agent arrives at the other location is 0.
3. The method for planning the path of a mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:
the improved potential energy field function in the step 4 is as follows:
wherein C = L = X + Y, X being the horizontal length of the environment and Y being the vertical length of the environment; d 1(s) ,d 2(s) Respectively representing the vertical distance and the horizontal distance between the current position of the intelligent agent and a target point; d(s) is the Euclidean distance between the current position of the intelligent agent and the connecting line between the target point and the starting point;
the initialized Q value table function is:
Q(s,a)=R+γV(S)
wherein, R is an initial reward function matrix, gamma is a reward attenuation factor, and V (S) is a cost function for initializing all states through a gravitational field function; the Q value table after the initialization by the method has the advantages that the Q value is larger when the table is closer to a target point, the target point has the maximum Q value, and the Q value at an obstacle is 0.
4. The method for planning the path of a mobile robot based on the improved Q-learning algorithm as claimed in claim 1, wherein:
the improved epsilon exploration strategy in the step 6 comprises the following specific steps: when the random value is smaller than the greedy factor, selecting the action with the highest action probability; when the random value is larger than the greedy factor, updating the Q value of the current state transferred to the next state according to the probability of each execution action, selecting the action with the highest Q value to execute, wherein the random value is between 0 and 1, and the updating formula is as follows:
T Q =Q+β×(P 1 ,P 2 ,P i )(i∈(1,n))
in the formula: t is Q Is the Q value after performing probability update according to the new action, beta is the weight coefficient, P 1 ,P 2 ,P i Is the probability that each action was performed; the improved behavior selection strategy enables the intelligent agent to perform multi-step exploration by utilizing the explored environment information when selecting to perform actions in each state, comprehensively considers the distance information between the front and back states of the intelligent agent and a target point and the multi-step execution action information, and selects the optimal action to perform;
the distance function in step 6 is:
in the formula:respectively representing the distance between the previous state and the current state and the target point;
the action utility function and its calculation rule are:
in the formula: p is a radical of 1 To be attenuation coefficient, p 2 Is to search for the excitation coefficient, r t Is an instant prize value; a is i Respectively different actions according to the magnitude and connection of the instant reward valueUpdating the E value of different actions according to the condition whether the actions are the same or not, when the immediate reward value is positive and the actions are the same for three consecutive times,when the immediate prize value is positive, and the same action is performed twice in succession,otherwise the value of E is zero.
5. The method as claimed in claim 1, wherein the Q-value table is initialized by introducing environment potential values as heuristic information.
6. The method for mobile robot path planning based on the improved Q-learning algorithm as claimed in claim 1, wherein a behavior utility function is used as a standard for evaluating the executed actions, so as to improve an epsilon greedy strategy, and the probability of each action being selected by the agent is dynamically adjusted in combination with the environmental information already explored by the agent and the influence of the executed actions on the smoothness of the path segment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211213330.XA CN115542912B (en) | 2022-09-29 | Mobile robot path planning method based on improved Q-learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211213330.XA CN115542912B (en) | 2022-09-29 | Mobile robot path planning method based on improved Q-learning algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115542912A true CN115542912A (en) | 2022-12-30 |
CN115542912B CN115542912B (en) | 2024-06-07 |
Family
ID=
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116822765A (en) * | 2023-06-02 | 2023-09-29 | 东南大学 | Q-learning-based agent time sequence task path planning method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819264A (en) * | 2012-07-30 | 2012-12-12 | 山东大学 | Path planning Q-learning initial method of mobile robot |
WO2017161632A1 (en) * | 2016-03-24 | 2017-09-28 | 苏州大学张家港工业技术研究院 | Cleaning robot optimal target path planning method based on model learning |
CN112344944A (en) * | 2020-11-24 | 2021-02-09 | 湖北汽车工业学院 | Reinforced learning path planning method introducing artificial potential field |
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819264A (en) * | 2012-07-30 | 2012-12-12 | 山东大学 | Path planning Q-learning initial method of mobile robot |
WO2017161632A1 (en) * | 2016-03-24 | 2017-09-28 | 苏州大学张家港工业技术研究院 | Cleaning robot optimal target path planning method based on model learning |
CN112344944A (en) * | 2020-11-24 | 2021-02-09 | 湖北汽车工业学院 | Reinforced learning path planning method introducing artificial potential field |
Non-Patent Citations (1)
Title |
---|
董培方;张志安;梅新虎;朱朔;: "引入势场及陷阱搜索的强化学习路径规划算法", 计算机工程与应用, no. 16, 8 September 2017 (2017-09-08), pages 135 - 140 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116822765A (en) * | 2023-06-02 | 2023-09-29 | 东南大学 | Q-learning-based agent time sequence task path planning method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106990792B (en) | Multi-unmanned aerial vehicle collaborative time sequence coupling task allocation method based on hybrid gravity search algorithm | |
CN113110592A (en) | Unmanned aerial vehicle obstacle avoidance and path planning method | |
CN111260027B (en) | Intelligent agent automatic decision-making method based on reinforcement learning | |
CN110442129B (en) | Control method and system for multi-agent formation | |
CN109690576A (en) | The training machine learning model in multiple machine learning tasks | |
CN109523029A (en) | For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body | |
CN105893694A (en) | Complex system designing method based on resampling particle swarm optimization algorithm | |
CN114460943B (en) | Self-adaptive target navigation method and system for service robot | |
CN111008449A (en) | Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment | |
CN110135584A (en) | Extensive Symbolic Regression method and system based on self-adaptive parallel genetic algorithm | |
JP2014502393A (en) | Determination method and determination apparatus | |
CN111352419B (en) | Path planning method and system for updating experience playback cache based on time sequence difference | |
CN113561986A (en) | Decision-making method and device for automatically driving automobile | |
CN111950735A (en) | Reinforced learning method based on bidirectional model | |
CN113268854A (en) | Reinforced learning method and system for double evaluators and single actuator | |
CN114815891A (en) | PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method | |
CN113110101B (en) | Production line mobile robot gathering type recovery and warehousing simulation method and system | |
CN106650028A (en) | Optimization method and system based on agile satellite design parameters | |
Gao et al. | An adaptive framework to select the coordinate systems for evolutionary algorithms | |
CN115542912A (en) | Mobile robot path planning method based on improved Q-learning algorithm | |
CN115542912B (en) | Mobile robot path planning method based on improved Q-learning algorithm | |
CN116339349A (en) | Path planning method, path planning device, electronic equipment and storage medium | |
CN116382299A (en) | Path planning method, path planning device, electronic equipment and storage medium | |
CN114861368B (en) | Construction method of railway longitudinal section design learning model based on near-end strategy | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |