CN113467481A

CN113467481A - Path planning method based on improved Sarsa algorithm

Info

Publication number: CN113467481A
Application number: CN202110918358.2A
Authority: CN
Inventors: 徐丽; 娄茹珍; 许逸茗; 申林山; 钱婧捷; 贾我欢; 闫鑫; 张立国
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-10-01
Anticipated expiration: 2041-08-11
Also published as: CN113467481B

Abstract

A path planning method based on an improved Sarsa algorithm belongs to the field of reinforcement learning and path planning. The method aims to solve the problems of low planning convergence speed and low planning efficiency in the path planning process based on the traditional Sarsa algorithm. The method includes the steps that a map model is established for an area to be subjected to path planning, a path matrix P (s, a) is introduced, a greedy factor epsilon is dynamically adjusted in the searching process of an agent, an epsilon-greedy strategy is adopted for action selection, and after the agent takes an action a, an environment feeds back a reward R and enters the next state s'; and updating the Q value table based on the path matrix, thereby realizing path planning based on the improved Sarsa algorithm. The method is mainly used for path planning of the robot.

Description

Path planning method based on improved Sarsa algorithm

Technical Field

The invention belongs to the field of reinforcement learning and path planning, and particularly relates to a path planning method based on reinforcement learning.

Background

Along with the development of technologies such as artificial intelligence and big data, the intelligent robot is more and more important to people's daily life, and the intelligent robot can help people to accomplish independently exploring, independently plan the route and independently avoid the barrier, can constantly learn from the environment and finally has a complete and clear assurance to the environment. Therefore, path planning of the intelligent robot is becoming more and more important, and is a research subject worth deep discussion in real life. The traditional path planning method comprises an artificial potential field method, a simulated annealing method, a fast search random tree method, a fuzzy logic method and the like, and the methods are classical path planning methods and have the problems that a target cannot be reached and falls into a local optimal solution and the like; followed by the occurrence of A^*The algorithm is low in searching efficiency and difficult to apply in practice; the following artificial intelligence-based path planning methods are available: the method comprises a genetic algorithm, a particle swarm optimization algorithm, an ant colony algorithm, a neural network algorithm and the like, wherein the algorithms are high in search efficiency and more intelligent, but the convergence rate is low. In recent years, reinforcement learning and path planning have become popular research fields.

The reinforcement learning is that an intelligent agent learns in a trial and error mode through interaction with an environment, and has self-adaption capability and autonomous exploration capability, but the path planning under an unknown environment by using a traditional reinforcement learning algorithm has some problems, for example, when the intelligent agent is in a completely unknown environment, blindness exists at the initial stage of exploration, and continuous trial and error and exploration are needed, so that the training time is too long, and the convergence speed is slow. In addition, in a more complex unknown environment, the system state dimension is increased continuously, and the related training parameters are in an exponential growth trend, consuming a great deal of training time and storage space, and finally causing dimension disaster. At present, the reinforcement Learning algorithms applied to path planning include algorithms such as Q-Learning, Sarsa, PPO, DDPG, DQN and the like. Among them, Sarsa algorithm is a classic online reinforcement learning method, which has the following problems: due to the adoption of single-step updating, useless Q values also exist in a Q table, so that invalid iterations are caused, the learning time is too long, the training strategy is easy to fail, the convergence speed is slow, and the like.

Disclosure of Invention

The method aims to solve the problems of low planning convergence speed and low planning efficiency in the path planning process based on the traditional Sarsa algorithm.

A path planning method based on an improved Sarsa algorithm comprises the following steps:

establishing a map model aiming at an area to be subjected to path planning, namely establishing a two-dimensional simulation environment on a coordinate axis, and setting a trap and a target position on the map according to an actual environment;

the coordinate of the intelligent agent in the environment is (x, y), the coordinate corresponds to the state s of the intelligent agent, the action space of the intelligent agent in the map comprises four actions, namely an upper action, a lower action, a left action and a right action, and a Q value table is established through the environment coordinate and the actions; the intelligent agent realizes path planning based on the Sarsa algorithm;

the method is characterized in that the process of realizing path planning by the agent based on the Sarsa algorithm comprises the following steps:

s1, initializing a Q value table, a state S, a path matrix P (S, a) and a greedy factor epsilon; the path matrix P (s, a) is used for saving the state s and the action a of the path taken by the agent in the current round in the environment;

s2, the agent starts to explore, and the following steps are executed for each iteration:

in the state s, the intelligent agent selects actions by adopting an epsilon-greedy strategy; after the agent takes action a, the environment will feed back a reward R and enter the next state s'; updating the Q value table:

Q(s，a)←Q(s，a)+α[R+γQ(s′，a′)-Q(s，a)][P(s，a)]

where s 'and a' represent the status and action of the next step and γ represents the discount factor.

Further, the process of initializing the path matrix P (S, a) in step S1 is as follows:

the initial value of the path matrix P (s, a) is 0; for each step in the round, an increment plus is added to the corresponding position in the path matrix P (s, a):

P(s，a)←P(s，a)+plus

where plus is a constant.

Further, plus is 1.

Further, the process of initializing ∈ in step S1 is as follows:

each time an action selection is made with the ε -greedy policy, for each step in the round, ε is updated as follows:

indicates the corresponding quantity of accumulating all the values in the path matrix of the round and dividing the accumulated number N, and action _ space indicates the action space.

Further, after updating the Q-value table in step S1, P (S, a) is also updated: p (s, a) ← γ P (s, a).

Further, the method for detecting the boundary by using the boundary detection function in the process of selecting the action by the intelligent agent and then selecting the action based on the position of the intelligent agent comprises the following steps:

the boundary detection function is used for judging whether the intelligent agent is located at the boundary of the map or not, transmitting the parameters of the map into the boundary detection function, then giving the coordinates and the names of various boundaries by the boundary detection function, and adjusting the selection of actions by the intelligent agent according to the boundary where the intelligent agent is located.

Further, the reward

Further, a grid method is adopted to establish a map model with grids in the process of establishing the map model for the area to be subjected to path planning.

Further, the value range of the discount factor gamma is 0 < gamma < 1.

The invention has the following beneficial technical effects:

aiming at the problems of too long learning time, easy failure of a training strategy, low convergence rate and the like of the Sarsa algorithm, the Sarsa algorithm is improved, and a path planning method based on the improved Sarsa algorithm is provided. Compared with the original Sarsa algorithm, the improved Sarsa algorithm has the advantages that the convergence speed is high, the operation efficiency of the algorithm is high, the total return is increased, the total step number and the average step number of each round of the algorithm can be effectively reduced, and the total step number of the improved algorithm is reduced by 23.3% in the labyrinth environment with the same difficulty, so that the performance of the algorithm is improved.

Drawings

FIG. 1 is a flow chart of a path planning method of the present invention;

FIG. 2 is a diagram of an experimental environment of the present invention;

FIG. 3 is a schematic diagram showing a comparison of the loss function of the path planning method of the embodiment of the present invention and the conventional Sarsa algorithm;

FIG. 4 is a schematic diagram illustrating a comparison between the operation time of the path planning method of the embodiment of the present invention and the operation time of the conventional Sarsa algorithm;

fig. 5 is a diagram illustrating a comparison of total returns of a path planning method according to an embodiment of the present invention and a conventional Sarsa algorithm.

Detailed Description

The invention aims to provide a path planning method based on an improved Sarsa algorithm, which can enable an intelligent robot to avoid collision and fall into a trap in a complex maze environment so as to carry out reasonable path planning and further reach a target position. On the basis of a traditional Sarsa algorithm, a path matrix P (s, a) is introduced and used for storing paths taken by an agent in the current round in the environment, all values in the path matrix P (s, a) are reduced in proportion by updating each time, and the effect that the paths closer to a target position are more important is achieved; and introducing a dynamic adjustment greedy factor epsilon to improve the searching capability of the algorithm, wherein if the intelligent agent is closer to the target position, 1-epsilon is increased by taking the path matrix P (s, a) as an heuristic to improve the searching purpose of the algorithm, and otherwise, epsilon is increased to improve the randomness of the searching. The performance of the algorithm is evaluated through indexes such as an algorithm loss function, operation time, total return, step angle and the like, compared with the traditional Sarsa algorithm, the method improves the convergence rate and the total return of the Sarsa algorithm for realizing equal path planning in a labyrinth environment with the same difficulty, can effectively reduce the iteration steps of the algorithm, has higher operation efficiency, and has the advantages that the convergence rate is the exploration iteration steps required by the intelligent agent when the optimal Q value is reached, and the process that the intelligent agent explores a terminal point or a trap is a round of exploration iteration process each time.

In order to make those skilled in the art better understand the technical solution of the present invention, the following will clearly and completely describe the technical solution in the embodiments of the present invention with reference to the drawings and the specific implementation examples in the embodiments of the present invention:

the invention provides a path planning method based on an improved Sarsa algorithm, which is characterized in that a path matrix P (s, a) is introduced on the basis of the traditional Sarsa algorithm, the size of the path matrix P (s, a) is the same as that of a Q value table, all values of the matrix are initialized to be 0, a plus value is added at a corresponding position of the path matrix P (s, a) after a certain action is executed in a certain state, the effect of storing the path traveled by an intelligent agent in the current round in the environment is further realized, all values in the path matrix P (s, a) are reduced in proportion by updating each time, the more important effect of the path closer to a target position can be realized, the path matrix P (s, a) participates in the updating of the Q value table, and the effective path value is strengthened; and a dynamic adjustment greedy factor epsilon is introduced to improve the exploration capability of the algorithm, if the intelligent agent is closer to the target position, the 1-epsilon is used as an enlightenment to improve the exploration purpose of the algorithm, otherwise, the epsilon is increased to improve the exploration randomness, and the dynamic adjustment greedy factor epsilon can obviously improve the path exploration efficiency through experimental verification. According to the principle of reinforcement learning, fig. 1 shows a flow chart of a path planning method, which specifically includes the following steps:

step 1: initializing various data information, and assigning values to related variables:

step 1.1: initializing an environment model: establishing an experimental model with grids in a coordinate axis by adopting a grid method, and establishing a two-dimensional simulation environment on the coordinate axis;

the environment is a N x N grid map, random traps and target positions are arranged on the map to form a maze environment, the capacity of an intelligent body for avoiding the traps and tending to a target place is realized through the arrangement of the random traps and the target positions, and the intelligent body can restart exploration and study from a starting point after falling into the traps; the position of the agent is represented by (x, y), corresponding to a certain state in the Q table, the starting point of the agent is fixed to the upper left corner, namely (0, 0), fig. 2 depicts a specific experimental environment, wherein a red square is the agent, a black square is a trap, a yellow circle is a target position, and a white area is a common area for the agent to walk;

step 1.2: setting an action space: all actions taken by the agent in the context of the map model, defining the agent's actions in the context as four actions, up, down, left, right, etc., represented by the list [0, 1, 2, 3], where 0 represents up, 1 represents down, 2 represents right, 3 represents left, approximating the agent as a particle;

step 1.3: setting a reward function: the reward function is used for evaluating the excellence of the action executed by the agent when the agent changes from the current state to the next state through feedback, the reward function is represented by R and comprises reward settings of a target position, a trap and other positions, and the agent can select an optimal strategy by using the reward function, and the reward function is specifically designed as follows:

when the intelligent agent explores the common area of the environment, the intelligent agent does not obtain rewards; when the agent falls into the trap, a penalty of-100 is obtained; when the intelligent agent reaches the target position, 100 rewards are obtained, and finally whether the strategy is the optimal strategy is judged according to the total reward value;

step 1.4: initializing a Q value table: the key of the reinforcement learning algorithm is the establishment of a Q-value table, and the state s of the agent is represented by the coordinates of the environment, so the Q-value table is established by the coordinates of the environment and the actions, for example, the coordinate of the upper left corner is (0, 0), in the environment model, there are four actions that the agent can select at each grid, each action corresponds to a Q-value, for example, fig. 2 uses a map of 8 × 8, 50 states can reach inside (starting point, end point and trap are removed), there are 200Q-values, then the Q-table at this time is a matrix of 50 × 4, the structure of the Q-table is as follows, and table 1 is a Q-table of three states and two actions:

TABLE 1Q Table structure diagram

Step 1.5: initializing a state s and a path matrix P (s, a), wherein the state s and the action a are used for storing the state s and the action a of a path taken by an agent in a current round in an environment, the size of the path matrix P (s, a) is the same as that of a Q value table, the initial value of the path matrix P (s, a) is 0, and for each step in the round, an increment plus is added to a corresponding position in the path matrix P (s, a), and the formula is as follows:

P(s，a)←P(s，a)+plus

wherein plus is a constant, different values can be set according to different tasks, and is set as 1 in the experiment, the path matrix participates in updating the Q table, and the value of the path matrix is gradually reduced, so that the more important the path closer to the target position is realized, that is, when the agent selects a certain action at a certain time point, it can be understood that a sign is made on the state-action pair, and the mark of the sign gradually becomes fuzzy with the increase of time, finally after a plurality of steps, the agent reaches the terminal point, and at this time, it can be seen that the step sign closer to the terminal point is clearer, the step sign farther from the terminal point is more fuzzy, and the more important the path closer to the target position is realized;

the path matrix P (S, A) adds a plus value to the position each time when a certain action is executed in a certain state, so that the path taken by the intelligent agent can be recorded; the newer value recorded in the intelligent path matrix is the point closest to the end point, and the value in the path matrix is multiplied by gamma every step, namely the earlier the value stored in the path matrix becomes smaller, namely the more important the value is, the closer the end point is, the more important the effect is.

Step 1.6: initializing epsilon: using the path matrix P (s, a) as heuristic, ∑_{a＝action_space}P (s, a) represents the sum of the values of the path matrix P (s, a), action _ space represents the action space, when_{a＝action_space}The larger the value of P (s, a), the closer the distance from the position to the target point is to 0, i.e. the greater the distance to the target point is, the greater the value of 1-epsilon increases the search purpose, and each time the epsilon-greedy strategy is used to perform action selection, the more sigma is subtracted from epsilon for each step in the round_{a＝action_space}P (s, a)/N, the formula is as follows:

namely, it is

The argument of the epsilon function is P (s, a), the function sigma_{a＝action_space}The value of P (s, a)/N should be smaller than epsilon, the function of dynamically adjusting epsilon can be realized through the formula, and the function value is larger along with the closer distance between the intelligent agent and the terminal point, the effect of properly reducing the value of epsilon can be realized, and the effect of increasing the search purpose is further realized.

All values in the path matrix of the round are accumulated and the accumulated times N are divided, the explanation can be made according to the thought, the closer the distance to the end point, the larger the value is, the value participates in the adjustment of the heart value, the effect of dynamically adjusting the greedy value can be realized, and the experiment verification has a better effect.

Step 2: the intelligent agent starts to explore, and the specific exploration steps are as follows:

step 2.1: in a state s, obtaining Q values corresponding to four actions of a current coordinate, selecting the actions by the intelligent agent by adopting an epsilon-greedy strategy, wherein the epsilon is a set hyper-parameter less than 1, and randomly selecting unknown actions by the intelligent agent according to a probability epsilon, namely randomly selecting the unknown actions from the four actions; the intelligent agent selects the action with the maximum Q value from the existing actions of the Q table according to the probability of 1-epsilon;

in the action selection, a boundary detection function is additionally added in the action selection module, considering that the action selection of the agent is not completely free in some specific positions, for example, when the agent is in the upper left corner, the action of the agent in the next step can only be down or right. The boundary detection function can judge whether the intelligent agent is located at the boundary of the map, and action selection is reduced when the intelligent agent is found to be located at the boundary, so that the exploration efficiency of the intelligent agent is improved. The specific implementation mode is as follows: storing all boundaries in a list form, then giving coordinates and names of various boundaries by a boundary detection function, then transmitting coordinate parameters of a map into the boundary detection function, judging whether the map is in the boundary list or not aiming at each state when an action selection module is used, and if the map is in the list, adjusting an action selection mode according to different boundaries to reduce an action selection range in the boundary;

step 2.2: after the agent takes action, the environment will feed back a reward R and enter the next state s', the agent continuously performs a cyclic training in each iteration (Episode), and the Q-value table is continuously updated by the following formula:

Q(s，a)←Q(s，a)+α[R+γQ(s′，a′)-Q(s，a)][P(s，a)]

wherein s 'and a' represent the state and action of the next step, γ represents a discount factor, 0 < γ < 1, the setting of the discount factor can avoid the infinite increase of the Q value, and a variable P (s, a) represents a path matrix, and the variable is further changed along with the execution of the action;

at the same time, the value of P (s, a) is updated according to the following formula, so that the value in the P (s, a) matrix is reduced according to the proportion γ, the earlier the value of the writing path matrix P (s, a) becomes smaller, the later value is relatively larger, and obviously the later action and state are closer to the end point, thereby achieving the effect that the value closer to the end point is more important:

P(s，a)←γP(s，a)

and step 3: judging whether the current position is a target position or a trap, if so, ending the intelligent agent exploration in the round, simultaneously judging whether a convergence condition is reached, and if not, returning to the step 1.4 to start a new round of exploration; if the current position is not the target or the trap returns to the step 2 to continue exploring; and when the intelligent agent obtains the optimal solution, the exploration is finished.

In this embodiment, an experiment is performed in an 8 × 8 maze environment in which the trap and the target position are randomly set, the performance difference between the Sarsa improved algorithm and the original algorithm is compared, and the performance of the algorithm is evaluated through indexes such as an algorithm loss function, running time, total return, step angle and the like.

Fig. 3 is a schematic diagram comparing the loss functions of the path planning method of the embodiment of the present invention and the conventional Sarsa algorithm: the loss function of the algorithm is used for measuring the prediction capability of the algorithm model and reflecting the similarity between the path formed by the algorithm and the optimal path, so that the smaller the loss function is, the better the loss function is, and the expression of the loss function of the algorithm is given by the following formula:

wherein y is the total return of the actual path, f (x) represents the total return of the optimal path, Turn is the number of rounds, the result is considered to be more visual and clear, the specific value of the reward is reduced by one hundred times, and the loss function of improving the Sarsa algorithm is obviously smaller than that of the Sarsa algorithm by analyzing the graph 3.

Fig. 4 is a schematic diagram comparing the operation time of the path planning method of the embodiment of the present invention with that of the conventional Sarsa algorithm: when the iteration number n of the algorithm is less than 25, the operating time of the improved Sarsa algorithm is almost the same as that of the Sarsa algorithm, the intelligent agent continuously and deeply learns, when the iteration number n of the algorithm is more than 25, the learning efficiency is continuously improved, the exploration capability is stronger, and the operating efficiency of the improved Sarsa algorithm is higher.

Fig. 5 is a schematic diagram showing a comparison of total returns of the path planning method of the embodiment of the present invention and the conventional Sarsa algorithm: it can be seen that the improved algorithm has a higher total return than the original algorithm and a higher total return than the original algorithm in almost all rounds, which means that the improved algorithm not only reaches the end faster than the original algorithm in the early stage, but also converges faster after reaching the end than the original algorithm.

The results in terms of the total steps and the average steps per round are shown in table 2, and it can be seen that the total steps and the average steps per round of the improved algorithm are much smaller than those of the original algorithm, the total steps of the improved algorithm are reduced by 23.3%, which indicates that the improved algorithm has higher efficiency in path planning.

Table 2 comparative units for step angle: step by step

Claims

1. A path planning method based on an improved Sarsa algorithm comprises the following steps:

Q(s，a)←Q(s，a)+α[R+γQ(s′，a′)-Q(s，a)][P(s，a)]

2. The improved Sarsa algorithm based path planning method as claimed in claim 1, wherein the initialization of the path matrix P (S, a) in step S1 is as follows:

P(s，a)←P(s，a)+plus

where plus is a constant.

3. The improved Sarsa algorithm-based path planning method according to claim 2, wherein plus is 1.

4. The improved Sarsa algorithm-based path planning method as claimed in claim 2, wherein the process of initializing S in step S1 is as follows:

5. The improved Sarsa algorithm based path planning method as claimed in claim 4, wherein the updating of the Q value table in step S1 is further followed by updating P (S, a): p (s, a) ← γ P (s, a).

6. The improved Sarsa algorithm-based path planning method according to any one of claims 1 to 5, wherein the detection of the boundary by the boundary detection function is performed during the selection of the action by the agent, and then the action selection is performed based on the position of the agent, comprising the following steps:

7. The improved Sarsa algorithm-based path planning method as claimed in claim 6, wherein the reward is

8. The improved Sarsa algorithm-based path planning method as claimed in claim 7, wherein the grid map model is created by using a grid method in the process of creating the map model for the area to be subjected to path planning.

9. The improved Sarsa algorithm-based path planning method as claimed in claim 8, wherein the range of the discount factor γ is 0 < γ < 1.