CN113467481A - Path planning method based on improved Sarsa algorithm - Google Patents

Path planning method based on improved Sarsa algorithm Download PDF

Info

Publication number
CN113467481A
CN113467481A CN202110918358.2A CN202110918358A CN113467481A CN 113467481 A CN113467481 A CN 113467481A CN 202110918358 A CN202110918358 A CN 202110918358A CN 113467481 A CN113467481 A CN 113467481A
Authority
CN
China
Prior art keywords
path planning
action
agent
path
improved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110918358.2A
Other languages
Chinese (zh)
Other versions
CN113467481B (en
Inventor
徐丽
娄茹珍
许逸茗
申林山
钱婧捷
贾我欢
闫鑫
张立国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202110918358.2A priority Critical patent/CN113467481B/en
Publication of CN113467481A publication Critical patent/CN113467481A/en
Application granted granted Critical
Publication of CN113467481B publication Critical patent/CN113467481B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0223Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

A path planning method based on an improved Sarsa algorithm belongs to the field of reinforcement learning and path planning. The method aims to solve the problems of low planning convergence speed and low planning efficiency in the path planning process based on the traditional Sarsa algorithm. The method includes the steps that a map model is established for an area to be subjected to path planning, a path matrix P (s, a) is introduced, a greedy factor epsilon is dynamically adjusted in the searching process of an agent, an epsilon-greedy strategy is adopted for action selection, and after the agent takes an action a, an environment feeds back a reward R and enters the next state s'; and updating the Q value table based on the path matrix, thereby realizing path planning based on the improved Sarsa algorithm. The method is mainly used for path planning of the robot.

Description

Path planning method based on improved Sarsa algorithm
Technical Field
The invention belongs to the field of reinforcement learning and path planning, and particularly relates to a path planning method based on reinforcement learning.
Background
Along with the development of technologies such as artificial intelligence and big data, the intelligent robot is more and more important to people's daily life, and the intelligent robot can help people to accomplish independently exploring, independently plan the route and independently avoid the barrier, can constantly learn from the environment and finally has a complete and clear assurance to the environment. Therefore, path planning of the intelligent robot is becoming more and more important, and is a research subject worth deep discussion in real life. The traditional path planning method comprises an artificial potential field method, a simulated annealing method, a fast search random tree method, a fuzzy logic method and the like, and the methods are classical path planning methods and have the problems that a target cannot be reached and falls into a local optimal solution and the like; followed by the occurrence of A*The algorithm is low in searching efficiency and difficult to apply in practice; the following artificial intelligence-based path planning methods are available: the method comprises a genetic algorithm, a particle swarm optimization algorithm, an ant colony algorithm, a neural network algorithm and the like, wherein the algorithms are high in search efficiency and more intelligent, but the convergence rate is low. In recent years, reinforcement learning and path planning have become popular research fields.
The reinforcement learning is that an intelligent agent learns in a trial and error mode through interaction with an environment, and has self-adaption capability and autonomous exploration capability, but the path planning under an unknown environment by using a traditional reinforcement learning algorithm has some problems, for example, when the intelligent agent is in a completely unknown environment, blindness exists at the initial stage of exploration, and continuous trial and error and exploration are needed, so that the training time is too long, and the convergence speed is slow. In addition, in a more complex unknown environment, the system state dimension is increased continuously, and the related training parameters are in an exponential growth trend, consuming a great deal of training time and storage space, and finally causing dimension disaster. At present, the reinforcement Learning algorithms applied to path planning include algorithms such as Q-Learning, Sarsa, PPO, DDPG, DQN and the like. Among them, Sarsa algorithm is a classic online reinforcement learning method, which has the following problems: due to the adoption of single-step updating, useless Q values also exist in a Q table, so that invalid iterations are caused, the learning time is too long, the training strategy is easy to fail, the convergence speed is slow, and the like.
Disclosure of Invention
The method aims to solve the problems of low planning convergence speed and low planning efficiency in the path planning process based on the traditional Sarsa algorithm.
A path planning method based on an improved Sarsa algorithm comprises the following steps:
establishing a map model aiming at an area to be subjected to path planning, namely establishing a two-dimensional simulation environment on a coordinate axis, and setting a trap and a target position on the map according to an actual environment;
the coordinate of the intelligent agent in the environment is (x, y), the coordinate corresponds to the state s of the intelligent agent, the action space of the intelligent agent in the map comprises four actions, namely an upper action, a lower action, a left action and a right action, and a Q value table is established through the environment coordinate and the actions; the intelligent agent realizes path planning based on the Sarsa algorithm;
the method is characterized in that the process of realizing path planning by the agent based on the Sarsa algorithm comprises the following steps:
s1, initializing a Q value table, a state S, a path matrix P (S, a) and a greedy factor epsilon; the path matrix P (s, a) is used for saving the state s and the action a of the path taken by the agent in the current round in the environment;
s2, the agent starts to explore, and the following steps are executed for each iteration:
in the state s, the intelligent agent selects actions by adopting an epsilon-greedy strategy; after the agent takes action a, the environment will feed back a reward R and enter the next state s'; updating the Q value table:
Q(s,a)←Q(s,a)+α[R+γQ(s′,a′)-Q(s,a)][P(s,a)]
where s 'and a' represent the status and action of the next step and γ represents the discount factor.
Further, the process of initializing the path matrix P (S, a) in step S1 is as follows:
the initial value of the path matrix P (s, a) is 0; for each step in the round, an increment plus is added to the corresponding position in the path matrix P (s, a):
P(s,a)←P(s,a)+plus
where plus is a constant.
Further, plus is 1.
Further, the process of initializing ∈ in step S1 is as follows:
each time an action selection is made with the ε -greedy policy, for each step in the round, ε is updated as follows:
Figure BDA0003206458290000021
Figure BDA0003206458290000022
indicates the corresponding quantity of accumulating all the values in the path matrix of the round and dividing the accumulated number N, and action _ space indicates the action space.
Further, after updating the Q-value table in step S1, P (S, a) is also updated: p (s, a) ← γ P (s, a).
Further, the method for detecting the boundary by using the boundary detection function in the process of selecting the action by the intelligent agent and then selecting the action based on the position of the intelligent agent comprises the following steps:
the boundary detection function is used for judging whether the intelligent agent is located at the boundary of the map or not, transmitting the parameters of the map into the boundary detection function, then giving the coordinates and the names of various boundaries by the boundary detection function, and adjusting the selection of actions by the intelligent agent according to the boundary where the intelligent agent is located.
Further, the reward
Figure BDA0003206458290000023
Further, a grid method is adopted to establish a map model with grids in the process of establishing the map model for the area to be subjected to path planning.
Further, the value range of the discount factor gamma is 0 < gamma < 1.
The invention has the following beneficial technical effects:
aiming at the problems of too long learning time, easy failure of a training strategy, low convergence rate and the like of the Sarsa algorithm, the Sarsa algorithm is improved, and a path planning method based on the improved Sarsa algorithm is provided. Compared with the original Sarsa algorithm, the improved Sarsa algorithm has the advantages that the convergence speed is high, the operation efficiency of the algorithm is high, the total return is increased, the total step number and the average step number of each round of the algorithm can be effectively reduced, and the total step number of the improved algorithm is reduced by 23.3% in the labyrinth environment with the same difficulty, so that the performance of the algorithm is improved.
Drawings
FIG. 1 is a flow chart of a path planning method of the present invention;
FIG. 2 is a diagram of an experimental environment of the present invention;
FIG. 3 is a schematic diagram showing a comparison of the loss function of the path planning method of the embodiment of the present invention and the conventional Sarsa algorithm;
FIG. 4 is a schematic diagram illustrating a comparison between the operation time of the path planning method of the embodiment of the present invention and the operation time of the conventional Sarsa algorithm;
fig. 5 is a diagram illustrating a comparison of total returns of a path planning method according to an embodiment of the present invention and a conventional Sarsa algorithm.
Detailed Description
The invention aims to provide a path planning method based on an improved Sarsa algorithm, which can enable an intelligent robot to avoid collision and fall into a trap in a complex maze environment so as to carry out reasonable path planning and further reach a target position. On the basis of a traditional Sarsa algorithm, a path matrix P (s, a) is introduced and used for storing paths taken by an agent in the current round in the environment, all values in the path matrix P (s, a) are reduced in proportion by updating each time, and the effect that the paths closer to a target position are more important is achieved; and introducing a dynamic adjustment greedy factor epsilon to improve the searching capability of the algorithm, wherein if the intelligent agent is closer to the target position, 1-epsilon is increased by taking the path matrix P (s, a) as an heuristic to improve the searching purpose of the algorithm, and otherwise, epsilon is increased to improve the randomness of the searching. The performance of the algorithm is evaluated through indexes such as an algorithm loss function, operation time, total return, step angle and the like, compared with the traditional Sarsa algorithm, the method improves the convergence rate and the total return of the Sarsa algorithm for realizing equal path planning in a labyrinth environment with the same difficulty, can effectively reduce the iteration steps of the algorithm, has higher operation efficiency, and has the advantages that the convergence rate is the exploration iteration steps required by the intelligent agent when the optimal Q value is reached, and the process that the intelligent agent explores a terminal point or a trap is a round of exploration iteration process each time.
In order to make those skilled in the art better understand the technical solution of the present invention, the following will clearly and completely describe the technical solution in the embodiments of the present invention with reference to the drawings and the specific implementation examples in the embodiments of the present invention:
the invention provides a path planning method based on an improved Sarsa algorithm, which is characterized in that a path matrix P (s, a) is introduced on the basis of the traditional Sarsa algorithm, the size of the path matrix P (s, a) is the same as that of a Q value table, all values of the matrix are initialized to be 0, a plus value is added at a corresponding position of the path matrix P (s, a) after a certain action is executed in a certain state, the effect of storing the path traveled by an intelligent agent in the current round in the environment is further realized, all values in the path matrix P (s, a) are reduced in proportion by updating each time, the more important effect of the path closer to a target position can be realized, the path matrix P (s, a) participates in the updating of the Q value table, and the effective path value is strengthened; and a dynamic adjustment greedy factor epsilon is introduced to improve the exploration capability of the algorithm, if the intelligent agent is closer to the target position, the 1-epsilon is used as an enlightenment to improve the exploration purpose of the algorithm, otherwise, the epsilon is increased to improve the exploration randomness, and the dynamic adjustment greedy factor epsilon can obviously improve the path exploration efficiency through experimental verification. According to the principle of reinforcement learning, fig. 1 shows a flow chart of a path planning method, which specifically includes the following steps:
step 1: initializing various data information, and assigning values to related variables:
step 1.1: initializing an environment model: establishing an experimental model with grids in a coordinate axis by adopting a grid method, and establishing a two-dimensional simulation environment on the coordinate axis;
the environment is a N x N grid map, random traps and target positions are arranged on the map to form a maze environment, the capacity of an intelligent body for avoiding the traps and tending to a target place is realized through the arrangement of the random traps and the target positions, and the intelligent body can restart exploration and study from a starting point after falling into the traps; the position of the agent is represented by (x, y), corresponding to a certain state in the Q table, the starting point of the agent is fixed to the upper left corner, namely (0, 0), fig. 2 depicts a specific experimental environment, wherein a red square is the agent, a black square is a trap, a yellow circle is a target position, and a white area is a common area for the agent to walk;
step 1.2: setting an action space: all actions taken by the agent in the context of the map model, defining the agent's actions in the context as four actions, up, down, left, right, etc., represented by the list [0, 1, 2, 3], where 0 represents up, 1 represents down, 2 represents right, 3 represents left, approximating the agent as a particle;
step 1.3: setting a reward function: the reward function is used for evaluating the excellence of the action executed by the agent when the agent changes from the current state to the next state through feedback, the reward function is represented by R and comprises reward settings of a target position, a trap and other positions, and the agent can select an optimal strategy by using the reward function, and the reward function is specifically designed as follows:
Figure BDA0003206458290000041
when the intelligent agent explores the common area of the environment, the intelligent agent does not obtain rewards; when the agent falls into the trap, a penalty of-100 is obtained; when the intelligent agent reaches the target position, 100 rewards are obtained, and finally whether the strategy is the optimal strategy is judged according to the total reward value;
step 1.4: initializing a Q value table: the key of the reinforcement learning algorithm is the establishment of a Q-value table, and the state s of the agent is represented by the coordinates of the environment, so the Q-value table is established by the coordinates of the environment and the actions, for example, the coordinate of the upper left corner is (0, 0), in the environment model, there are four actions that the agent can select at each grid, each action corresponds to a Q-value, for example, fig. 2 uses a map of 8 × 8, 50 states can reach inside (starting point, end point and trap are removed), there are 200Q-values, then the Q-table at this time is a matrix of 50 × 4, the structure of the Q-table is as follows, and table 1 is a Q-table of three states and two actions:
TABLE 1Q Table structure diagram
Figure BDA0003206458290000051
Step 1.5: initializing a state s and a path matrix P (s, a), wherein the state s and the action a are used for storing the state s and the action a of a path taken by an agent in a current round in an environment, the size of the path matrix P (s, a) is the same as that of a Q value table, the initial value of the path matrix P (s, a) is 0, and for each step in the round, an increment plus is added to a corresponding position in the path matrix P (s, a), and the formula is as follows:
P(s,a)←P(s,a)+plus
wherein plus is a constant, different values can be set according to different tasks, and is set as 1 in the experiment, the path matrix participates in updating the Q table, and the value of the path matrix is gradually reduced, so that the more important the path closer to the target position is realized, that is, when the agent selects a certain action at a certain time point, it can be understood that a sign is made on the state-action pair, and the mark of the sign gradually becomes fuzzy with the increase of time, finally after a plurality of steps, the agent reaches the terminal point, and at this time, it can be seen that the step sign closer to the terminal point is clearer, the step sign farther from the terminal point is more fuzzy, and the more important the path closer to the target position is realized;
the path matrix P (S, A) adds a plus value to the position each time when a certain action is executed in a certain state, so that the path taken by the intelligent agent can be recorded; the newer value recorded in the intelligent path matrix is the point closest to the end point, and the value in the path matrix is multiplied by gamma every step, namely the earlier the value stored in the path matrix becomes smaller, namely the more important the value is, the closer the end point is, the more important the effect is.
Step 1.6: initializing epsilon: using the path matrix P (s, a) as heuristic, ∑a=action_spaceP (s, a) represents the sum of the values of the path matrix P (s, a), action _ space represents the action space, whena=action_spaceThe larger the value of P (s, a), the closer the distance from the position to the target point is to 0, i.e. the greater the distance to the target point is, the greater the value of 1-epsilon increases the search purpose, and each time the epsilon-greedy strategy is used to perform action selection, the more sigma is subtracted from epsilon for each step in the rounda=action_spaceP (s, a)/N, the formula is as follows:
Figure BDA0003206458290000061
namely, it is
Figure BDA0003206458290000062
The argument of the epsilon function is P (s, a), the function sigmaa=action_spaceThe value of P (s, a)/N should be smaller than epsilon, the function of dynamically adjusting epsilon can be realized through the formula, and the function value is larger along with the closer distance between the intelligent agent and the terminal point, the effect of properly reducing the value of epsilon can be realized, and the effect of increasing the search purpose is further realized.
Figure BDA0003206458290000063
All values in the path matrix of the round are accumulated and the accumulated times N are divided, the explanation can be made according to the thought, the closer the distance to the end point, the larger the value is, the value participates in the adjustment of the heart value, the effect of dynamically adjusting the greedy value can be realized, and the experiment verification has a better effect.
Step 2: the intelligent agent starts to explore, and the specific exploration steps are as follows:
step 2.1: in a state s, obtaining Q values corresponding to four actions of a current coordinate, selecting the actions by the intelligent agent by adopting an epsilon-greedy strategy, wherein the epsilon is a set hyper-parameter less than 1, and randomly selecting unknown actions by the intelligent agent according to a probability epsilon, namely randomly selecting the unknown actions from the four actions; the intelligent agent selects the action with the maximum Q value from the existing actions of the Q table according to the probability of 1-epsilon;
in the action selection, a boundary detection function is additionally added in the action selection module, considering that the action selection of the agent is not completely free in some specific positions, for example, when the agent is in the upper left corner, the action of the agent in the next step can only be down or right. The boundary detection function can judge whether the intelligent agent is located at the boundary of the map, and action selection is reduced when the intelligent agent is found to be located at the boundary, so that the exploration efficiency of the intelligent agent is improved. The specific implementation mode is as follows: storing all boundaries in a list form, then giving coordinates and names of various boundaries by a boundary detection function, then transmitting coordinate parameters of a map into the boundary detection function, judging whether the map is in the boundary list or not aiming at each state when an action selection module is used, and if the map is in the list, adjusting an action selection mode according to different boundaries to reduce an action selection range in the boundary;
step 2.2: after the agent takes action, the environment will feed back a reward R and enter the next state s', the agent continuously performs a cyclic training in each iteration (Episode), and the Q-value table is continuously updated by the following formula:
Q(s,a)←Q(s,a)+α[R+γQ(s′,a′)-Q(s,a)][P(s,a)]
wherein s 'and a' represent the state and action of the next step, γ represents a discount factor, 0 < γ < 1, the setting of the discount factor can avoid the infinite increase of the Q value, and a variable P (s, a) represents a path matrix, and the variable is further changed along with the execution of the action;
at the same time, the value of P (s, a) is updated according to the following formula, so that the value in the P (s, a) matrix is reduced according to the proportion γ, the earlier the value of the writing path matrix P (s, a) becomes smaller, the later value is relatively larger, and obviously the later action and state are closer to the end point, thereby achieving the effect that the value closer to the end point is more important:
P(s,a)←γP(s,a)
and step 3: judging whether the current position is a target position or a trap, if so, ending the intelligent agent exploration in the round, simultaneously judging whether a convergence condition is reached, and if not, returning to the step 1.4 to start a new round of exploration; if the current position is not the target or the trap returns to the step 2 to continue exploring; and when the intelligent agent obtains the optimal solution, the exploration is finished.
In this embodiment, an experiment is performed in an 8 × 8 maze environment in which the trap and the target position are randomly set, the performance difference between the Sarsa improved algorithm and the original algorithm is compared, and the performance of the algorithm is evaluated through indexes such as an algorithm loss function, running time, total return, step angle and the like.
Fig. 3 is a schematic diagram comparing the loss functions of the path planning method of the embodiment of the present invention and the conventional Sarsa algorithm: the loss function of the algorithm is used for measuring the prediction capability of the algorithm model and reflecting the similarity between the path formed by the algorithm and the optimal path, so that the smaller the loss function is, the better the loss function is, and the expression of the loss function of the algorithm is given by the following formula:
Figure BDA0003206458290000071
wherein y is the total return of the actual path, f (x) represents the total return of the optimal path, Turn is the number of rounds, the result is considered to be more visual and clear, the specific value of the reward is reduced by one hundred times, and the loss function of improving the Sarsa algorithm is obviously smaller than that of the Sarsa algorithm by analyzing the graph 3.
Fig. 4 is a schematic diagram comparing the operation time of the path planning method of the embodiment of the present invention with that of the conventional Sarsa algorithm: when the iteration number n of the algorithm is less than 25, the operating time of the improved Sarsa algorithm is almost the same as that of the Sarsa algorithm, the intelligent agent continuously and deeply learns, when the iteration number n of the algorithm is more than 25, the learning efficiency is continuously improved, the exploration capability is stronger, and the operating efficiency of the improved Sarsa algorithm is higher.
Fig. 5 is a schematic diagram showing a comparison of total returns of the path planning method of the embodiment of the present invention and the conventional Sarsa algorithm: it can be seen that the improved algorithm has a higher total return than the original algorithm and a higher total return than the original algorithm in almost all rounds, which means that the improved algorithm not only reaches the end faster than the original algorithm in the early stage, but also converges faster after reaching the end than the original algorithm.
The results in terms of the total steps and the average steps per round are shown in table 2, and it can be seen that the total steps and the average steps per round of the improved algorithm are much smaller than those of the original algorithm, the total steps of the improved algorithm are reduced by 23.3%, which indicates that the improved algorithm has higher efficiency in path planning.
Table 2 comparative units for step angle: step by step
Figure BDA0003206458290000072
Figure BDA0003206458290000081

Claims (9)

1. A path planning method based on an improved Sarsa algorithm comprises the following steps:
establishing a map model aiming at an area to be subjected to path planning, namely establishing a two-dimensional simulation environment on a coordinate axis, and setting a trap and a target position on the map according to an actual environment;
the coordinate of the intelligent agent in the environment is (x, y), the coordinate corresponds to the state s of the intelligent agent, the action space of the intelligent agent in the map comprises four actions, namely an upper action, a lower action, a left action and a right action, and a Q value table is established through the environment coordinate and the actions; the intelligent agent realizes path planning based on the Sarsa algorithm;
the method is characterized in that the process of realizing path planning by the agent based on the Sarsa algorithm comprises the following steps:
s1, initializing a Q value table, a state S, a path matrix P (S, a) and a greedy factor epsilon; the path matrix P (s, a) is used for saving the state s and the action a of the path taken by the agent in the current round in the environment;
s2, the agent starts to explore, and the following steps are executed for each iteration:
in the state s, the intelligent agent selects actions by adopting an epsilon-greedy strategy; after the agent takes action a, the environment will feed back a reward R and enter the next state s'; updating the Q value table:
Q(s,a)←Q(s,a)+α[R+γQ(s′,a′)-Q(s,a)][P(s,a)]
where s 'and a' represent the status and action of the next step and γ represents the discount factor.
2. The improved Sarsa algorithm based path planning method as claimed in claim 1, wherein the initialization of the path matrix P (S, a) in step S1 is as follows:
the initial value of the path matrix P (s, a) is 0; for each step in the round, an increment plus is added to the corresponding position in the path matrix P (s, a):
P(s,a)←P(s,a)+plus
where plus is a constant.
3. The improved Sarsa algorithm-based path planning method according to claim 2, wherein plus is 1.
4. The improved Sarsa algorithm-based path planning method as claimed in claim 2, wherein the process of initializing S in step S1 is as follows:
each time an action selection is made with the ε -greedy policy, for each step in the round, ε is updated as follows:
Figure FDA0003206458280000011
Figure FDA0003206458280000012
indicates the corresponding quantity of accumulating all the values in the path matrix of the round and dividing the accumulated number N, and action _ space indicates the action space.
5. The improved Sarsa algorithm based path planning method as claimed in claim 4, wherein the updating of the Q value table in step S1 is further followed by updating P (S, a): p (s, a) ← γ P (s, a).
6. The improved Sarsa algorithm-based path planning method according to any one of claims 1 to 5, wherein the detection of the boundary by the boundary detection function is performed during the selection of the action by the agent, and then the action selection is performed based on the position of the agent, comprising the following steps:
the boundary detection function is used for judging whether the intelligent agent is located at the boundary of the map or not, transmitting the parameters of the map into the boundary detection function, then giving the coordinates and the names of various boundaries by the boundary detection function, and adjusting the selection of actions by the intelligent agent according to the boundary where the intelligent agent is located.
7. The improved Sarsa algorithm-based path planning method as claimed in claim 6, wherein the reward is
Figure FDA0003206458280000021
8. The improved Sarsa algorithm-based path planning method as claimed in claim 7, wherein the grid map model is created by using a grid method in the process of creating the map model for the area to be subjected to path planning.
9. The improved Sarsa algorithm-based path planning method as claimed in claim 8, wherein the range of the discount factor γ is 0 < γ < 1.
CN202110918358.2A 2021-08-11 2021-08-11 Path planning method based on improved Sarsa algorithm Active CN113467481B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110918358.2A CN113467481B (en) 2021-08-11 2021-08-11 Path planning method based on improved Sarsa algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110918358.2A CN113467481B (en) 2021-08-11 2021-08-11 Path planning method based on improved Sarsa algorithm

Publications (2)

Publication Number Publication Date
CN113467481A true CN113467481A (en) 2021-10-01
CN113467481B CN113467481B (en) 2022-10-25

Family

ID=77866277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110918358.2A Active CN113467481B (en) 2021-08-11 2021-08-11 Path planning method based on improved Sarsa algorithm

Country Status (1)

Country Link
CN (1) CN113467481B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115993831A (en) * 2023-03-23 2023-04-21 安徽大学 Method for planning path of robot non-target network based on deep reinforcement learning
CN116822765A (en) * 2023-06-02 2023-09-29 东南大学 Q-learning-based agent time sequence task path planning method

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929281A (en) * 2012-11-05 2013-02-13 西南科技大学 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment
CN103204193A (en) * 2013-04-08 2013-07-17 浙江大学 Under-actuated biped robot walking control method
CN103517309A (en) * 2013-10-11 2014-01-15 清华大学 Cell interruption compensation method based on asymptotic greedy behavior exploration
CN108319286A (en) * 2018-03-12 2018-07-24 西北工业大学 A kind of unmanned plane Air Combat Maneuvering Decision Method based on intensified learning
CN108563112A (en) * 2018-03-30 2018-09-21 南京邮电大学 Control method for emulating Soccer robot ball-handling
US20190025917A1 (en) * 2014-12-12 2019-01-24 The Research Foundation For The State University Of New York Autonomous brain-machine interface
CN109669452A (en) * 2018-11-02 2019-04-23 北京物资学院 A kind of cloud robot task dispatching method and system based on parallel intensified learning
US20190147355A1 (en) * 2017-11-14 2019-05-16 International Business Machines Corporation Self-critical sequence training of multimodal systems
CN109794937A (en) * 2019-01-29 2019-05-24 南京邮电大学 A kind of Soccer robot collaboration method based on intensified learning
CN109948054A (en) * 2019-03-11 2019-06-28 北京航空航天大学 A kind of adaptive learning path planning system based on intensified learning
US20190261566A1 (en) * 2016-11-08 2019-08-29 Dogtooth Technologies Limited Robotic fruit picking system
CN110488859A (en) * 2019-07-15 2019-11-22 北京航空航天大学 A kind of Path Planning for UAV based on improvement Q-learning algorithm
CN110515303A (en) * 2019-09-17 2019-11-29 余姚市浙江大学机器人研究中心 A kind of adaptive dynamic path planning method based on DDQN
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN111079305A (en) * 2019-12-27 2020-04-28 南京航空航天大学 Different-strategy multi-agent reinforcement learning cooperation method based on lambda-reward
US20200193333A1 (en) * 2018-12-14 2020-06-18 Fujitsu Limited Efficient reinforcement learning based on merging of trained learners
US10726059B1 (en) * 2016-11-10 2020-07-28 Snap Inc. Deep reinforcement learning-based captioning with embedding reward
CN111619624A (en) * 2020-06-01 2020-09-04 北京全路通信信号研究设计院集团有限公司 Tramcar operation control method and system based on deep reinforcement learning
CN111898728A (en) * 2020-06-02 2020-11-06 东南大学 Team robot decision-making method based on multi-Agent reinforcement learning
EP3805062A1 (en) * 2018-06-29 2021-04-14 Huawei Technologies Co., Ltd. Method and device for determining automatic parking strategy

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929281A (en) * 2012-11-05 2013-02-13 西南科技大学 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment
CN103204193A (en) * 2013-04-08 2013-07-17 浙江大学 Under-actuated biped robot walking control method
CN103517309A (en) * 2013-10-11 2014-01-15 清华大学 Cell interruption compensation method based on asymptotic greedy behavior exploration
US20190025917A1 (en) * 2014-12-12 2019-01-24 The Research Foundation For The State University Of New York Autonomous brain-machine interface
US20190261566A1 (en) * 2016-11-08 2019-08-29 Dogtooth Technologies Limited Robotic fruit picking system
US10726059B1 (en) * 2016-11-10 2020-07-28 Snap Inc. Deep reinforcement learning-based captioning with embedding reward
US20190147355A1 (en) * 2017-11-14 2019-05-16 International Business Machines Corporation Self-critical sequence training of multimodal systems
CN108319286A (en) * 2018-03-12 2018-07-24 西北工业大学 A kind of unmanned plane Air Combat Maneuvering Decision Method based on intensified learning
CN108563112A (en) * 2018-03-30 2018-09-21 南京邮电大学 Control method for emulating Soccer robot ball-handling
EP3805062A1 (en) * 2018-06-29 2021-04-14 Huawei Technologies Co., Ltd. Method and device for determining automatic parking strategy
CN109669452A (en) * 2018-11-02 2019-04-23 北京物资学院 A kind of cloud robot task dispatching method and system based on parallel intensified learning
US20200193333A1 (en) * 2018-12-14 2020-06-18 Fujitsu Limited Efficient reinforcement learning based on merging of trained learners
CN109794937A (en) * 2019-01-29 2019-05-24 南京邮电大学 A kind of Soccer robot collaboration method based on intensified learning
CN109948054A (en) * 2019-03-11 2019-06-28 北京航空航天大学 A kind of adaptive learning path planning system based on intensified learning
CN110488859A (en) * 2019-07-15 2019-11-22 北京航空航天大学 A kind of Path Planning for UAV based on improvement Q-learning algorithm
CN110515303A (en) * 2019-09-17 2019-11-29 余姚市浙江大学机器人研究中心 A kind of adaptive dynamic path planning method based on DDQN
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN111079305A (en) * 2019-12-27 2020-04-28 南京航空航天大学 Different-strategy multi-agent reinforcement learning cooperation method based on lambda-reward
CN111619624A (en) * 2020-06-01 2020-09-04 北京全路通信信号研究设计院集团有限公司 Tramcar operation control method and system based on deep reinforcement learning
CN111898728A (en) * 2020-06-02 2020-11-06 东南大学 Team robot decision-making method based on multi-Agent reinforcement learning

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
DONG XU等: "Path Planning Method Combining Depth Learning and Sarsa Algorithm", 《2017 10TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN》 *
ENBO LI等: "Model learning for two-wheeled robot self-balance control", 《2019 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS》 *
张汝波: "《强化学习理论及应用》", 15 April 2001, 哈尔滨工程大学出版社 *
权浩: "面向多任务的仓储移动机器人路径规划与调度", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王作为: "具有认知能力的智能机器人行为学习方法研究", 《中国博士学位论文全文数据库 信息科技辑》 *
袁银龙: "深度强化学习算法及应用研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115993831A (en) * 2023-03-23 2023-04-21 安徽大学 Method for planning path of robot non-target network based on deep reinforcement learning
CN115993831B (en) * 2023-03-23 2023-06-09 安徽大学 Method for planning path of robot non-target network based on deep reinforcement learning
CN116822765A (en) * 2023-06-02 2023-09-29 东南大学 Q-learning-based agent time sequence task path planning method
CN116822765B (en) * 2023-06-02 2024-08-16 东南大学 Q-learning-based agent time sequence task path planning method

Also Published As

Publication number Publication date
CN113467481B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
CN109945881B (en) Mobile robot path planning method based on ant colony algorithm
CN107272679A (en) Paths planning method based on improved ant group algorithm
CN112329348A (en) Intelligent decision-making method for military countermeasure game under incomplete information condition
CN113467481B (en) Path planning method based on improved Sarsa algorithm
CN111982125A (en) Path planning method based on improved ant colony algorithm
CN113741508B (en) Unmanned aerial vehicle task allocation method based on improved wolf pack algorithm
CN116242383B (en) Unmanned vehicle path planning method based on reinforced Harris eagle algorithm
Bai et al. Adversarial examples construction towards white-box q table variation in dqn pathfinding training
CN110327624A (en) A kind of game follower method and system based on course intensified learning
Bai et al. Variational dynamic for self-supervised exploration in deep reinforcement learning
CN115933693A (en) Robot path planning method based on adaptive chaotic particle swarm algorithm
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
CN112613608A (en) Reinforced learning method and related device
CN113848911B (en) Mobile robot global path planning method based on Q-learning and RRT
CN116128060A (en) Chess game method based on opponent modeling and Monte Carlo reinforcement learning
CN114815801A (en) Adaptive environment path planning method based on strategy-value network and MCTS
CN114492715A (en) Improved sparrow searching method based on chaotic reverse learning and self-adaptive spiral searching
CN116700258B (en) Intelligent vehicle path planning method based on artificial potential field method and reinforcement learning
CN117523359A (en) Image comparison and identification method and device based on reinforcement learning
CN117522078A (en) Method and system for planning transferable tasks under unmanned system cluster environment coupling
CN117420832A (en) Robot path planning method based on improved GTO
CN116795098A (en) Spherical amphibious robot path planning method based on improved sparrow search algorithm
CN105956680A (en) Frame for generating and managing adaptive rule based on reinforcement learning
CN115759199A (en) Multi-robot environment exploration method and system based on hierarchical graph neural network
Korkmaz A Survey Analyzing Generalization in Deep Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant