WO2019148645A1

WO2019148645A1 - Partially observable markov decision process-based optimal robot path planning method

Info

Publication number: WO2019148645A1
Application number: PCT/CN2018/082104
Authority: WO
Inventors: 刘全; 朱斐; 钱炜晟; 章宗长
Original assignee: 苏州大学张家港工业技术研究院; 苏州大学
Priority date: 2018-02-01
Filing date: 2018-04-08
Publication date: 2019-08-08
Also published as: CN108680155A; CN108680155B

Abstract

Disclosed is a partially observable Markov decision process (POMDP)-based optimal robot path planning method. A robot searches for an optimal path to a target position. A POMDP model and an SARSOP algorithm are used as a basis. A GLS search method is used as a heuristic condition during searching. For continuous state and massive view space problems, the usage of the present invention can reduce the times for updating upper and lower bounds of the belief state in multiple similar paths which are updated repetitively by an early classical algorithm using an experiment as the heuristic condition. The final optimal policy is not affected, the algorithm efficiency is improved, and the robot can get a better policy and find a better path in the same time.

Description

Robot optimal path planning method based on partial perceptual Markov decision process

Technical field

The invention relates to the field of robot control, in particular to a robot optimal path planning method based on a partial perceptual Markov decision process.

Background technique

Machine Learning (ML) is a discipline that studies how to simulate or implement human learning behaviors, and to reorganize and improve its original knowledge structure. Reinforcement learning is an important research branch of machine learning. It is a machine learning method that obtains the maximum long-term cumulative discount reward by mapping the state to the action through the interaction between the agent and the environment. Usually reinforcement learning uses Markov Decision Processes (MDPs) as a model, ie the environment is fully observable. However, in the real world, uncertainty is ubiquitous. For example, the sensor of the agent has its own limitations: (1) the sensor can only detect a local limited environment, the agent cannot accurately distinguish different states outside the detection range; (2) the sensor itself has defects, the detection result has noise, and the sensor of the agent Different results may be obtained when probing the same state. For example, in the robot soccer game RoboCup, Agent's vision system includes visual angle, accurate visual distance, blurred visual distance and invisible distance. Only when the observation target is within the visible angle and the accurate visual distance, the Agent can get its accurate state, and in other cases, only a fuzzy observation can be obtained. Agents should consider uncertainties such as imperfect control, sensor noise, and incomplete environmental knowledge when making decisions in complex environments (such as driverless driving).

Partially Observable Markov Decision Processes (POMDPs) provide a general model for modeling part of the Agent's planning (or sequence decision) problems and learning problems in a random environment. In the past decade, research on POMDP planning issues has achieved remarkable results. These research results use experimental-based asynchronous value iterative algorithms in the heuristic search phase, such as HSVI2, SARSOP, and FSVI. These algorithms only search for nodes with the greatest value when searching forward. However, trial-based searches select optimal actions and observations each time, without considering other observations that are very close to the optimal observations and have a significant impact on future algorithm performance. In the large-scale observation of spatial problems, the performance of the algorithm is poor.

Summary of the invention

The object of the present invention is to provide a robot optimal path planning method based on a partial perceptual Markov decision process, which reduces the number of updates of an algorithm in a similar search path in a continuous state and a large-scale observation space problem, thereby saving calculations. Time, increase the efficiency of the algorithm. The improved efficiency allows the robot to get a better path in the same time.

In the large-scale observation space problem, some nodes that are very close to the maximum value also play a very important role in the performance of future algorithms. If the number of updates to a belief state can be more than at other belief states, then this update method is called an asynchronous value iteration method. In the reachable belief space R(b ₀ ), the accuracy of the value function at b ₀ is usually more important than at other belief states. Therefore, in the POMDP problem, an asynchronous value iterative method can be used. The trial-based search is a classic asynchronous value iterative method. Each search starts from the initial belief state b ₀ , searches for the leaf belief state, and obtains a path without branches. In the process of searching, the Agent selects the corresponding action and observation at the current belief state according to different heuristic conditions, and obtains the corresponding next belief state. Combine the paths of each search to form a reachable belief tree. This allows the Agent to search only in the reachable belief space, and approximate the infinite belief space, so that the continuous state problem can be solved. Therefore, choosing a better heuristic condition can make the searchable reachable belief space closer to the real belief space and have better performance. The heuristic condition used by HSVI2 is to obtain as many representative and reachable belief trees as possible through simulation. SARSOP is based on HSVI2's reachable belief tree, and chooses better heuristic conditions, which makes the simulation process closer to the optimal strategy, thus obtaining a more representative optimal reachable belief. tree.

The Gingko Leaf Search (GLS) algorithm is used in the forward search phase for the partial searchable Markov decision-making problem for continuous state and large-scale observation space. The most valuable state of belief, and adaptive search is very close to the state of belief with the most valuable belief state. Without affecting the effect of the value function update, GLS reduces the number of belief state updates, reduces update time, and improves algorithm efficiency.

In order to achieve the above object, the present invention provides the following technical solution: a method for optimal path planning of a robot based on a partial perceptual Markov decision process, comprising the following steps:

S1, initialization model and environment, set the state transition function f: X × U × X → [0, 1], reward function ρ: X × U → □, observation function Ω: U × X × Z → [0, 1], where X is the state set, U is the action set, Z is the observation set, the discount rate γ is set to 0.95, the position of the robot is set, and the initial value of the initial belief state b ₀ is set: the standard critical value L of the lower bound and the upper bound The standard threshold value U=L+ò, where ò is a pre-specified threshold criterion, and the upper bound value of the initial belief state b ₀ is calculated.

And the lower bound value V (b ₀ ), transferred to S2;

S2, setting the initial belief state b ₀ to the current belief state b, and transferring to S3;

S3, predicting the optimal value of the current belief state b

Transfer to S4;

S4. Determine whether the current belief state b satisfies the following conditions:

And

Where d _b is the depth of the current belief state b, if satisfied, then proceeds to S13, if not satisfied, then proceeds to S5;

S5. Calculate the lower bound value of the value function of each action under the current belief state b, select the maximum value Q among the lower bound values, and update the standard critical value U' of the upper bound of the current belief state b and the standard critical value L' of the lower bound , transferred to S6;

S6, calculating an optimal action and observing the greatest contribution to the initial belief state b ₀ , and recording the total count count, and transferring to S7;

S7, sequentially select observations in the observation set, if count is not 0, transfer to S8, otherwise transfer to S11;

S8, the count value is reduced by 1, and transferred to S9;

S9. Judging the current selected observation

Whether there is value of exploration, if it is, then transfer to S10, if not, then transfer to S7;

S10, calculating a standard threshold value of the upper bound of the next belief state and a standard threshold value of the lower bound, obtaining an upper bound value and a lower bound value of the next belief state, and transferring to S7;

S11, updating the upper bound value and the lower bound value of the current belief state, and transferring to S12;

S12, selecting an optimal action to enter a next belief state, setting a next belief state to a current belief state, and transferring to S3;

S13. Obtain an optimal strategy, and obtain an optimal path of the robot according to the optimal strategy.

In the above technical solution, S2, the initial belief state b is ₀ blind low value calculated strategy, the initial belief state b ₀ is calculated upper bound constraints using flash notification method.

In the above technical solution, in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula:

The standard threshold for the lower bound of the current belief state b is calculated using the following formula:

Where Q represents the maximum of the lower bounds of the value function for each action.

In the above technical solution, in S6, the optimal action is calculated by the following formula:

The optimal observation is calculated using the following formula:

among them

Faith state

The difference between the upper and lower boundaries.

In the above technical solution, in S9, judging the currently selected observation

Whether the standard of exploration value is

Where ζ is the threshold function.

In the above technical solution, in S10, the upper bound of the next belief state is a standard critical value.

Calculated by

The calculation method of the lower bound standard threshold U _db in the next belief state is

Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.

Due to the above technical solutions, the present invention has the following advantages over the prior art: the present invention is based on a partially observable Markov decision process, and the robot finds an optimal path to the target position, based on the POMDP model and the SARSOP algorithm, using GLS. The search method is a heuristic condition when searching. In the continuous state large-scale observation space problem, the use of the present invention can avoid the early classical algorithm using the trial as a heuristic condition to repeatedly update the plurality of similar paths, and update the upper and lower bounds of the belief state without affecting the final optimal strategy. Improve the efficiency of the algorithm, in the same time, the robot can get a better strategy and find a better path.

DRAWINGS

Figure 1 is a schematic view showing the layout of the environment of the present invention.

2 is a search tree formed by a search path obtained by a certain search in the present invention.

Figure 3 is a flow chart of the operation of the present invention.

Detailed ways

The present invention is further described below in conjunction with the principles, drawings, and embodiments of the present invention.

Referring to Figure 1, the sweeping robot is in the living room on the right. Its task is to clean the bedroom on the left. According to the layout of the room, it needs to bypass the dining table and pass through the middle door to enter the bedroom smoothly. The robot head is evenly installed. With a distance sensor, each sensor can detect whether there are obstacles within 1 unit length in front of it. There are 256 kinds of detection results of the sensor. The probability that each sensor receives the correct detection result is 0.9, and the probability of receiving the error detection result. At 0.1, the sweeping robot is randomly positioned in the room. Its goal is to reach the bedroom on the left as quickly as possible. The prize for the sweeping robot to reach the target position is +10.

Referring to FIG. 2, in a search process, the search path includes not only the nodes searched by the early SAROSOP algorithm (black solid circles) but also the nodes with large search value (open circles).

Referring to FIG. 3, a method for optimal path planning of a robot based on a partial perceptual Markov decision process includes the following steps:

And the lower bound value V (b ₀ ), transferred to S2;

S3, predicting the optimal value of the current belief state b

Transfer to S4;

And

S8, the count value is reduced by 1, and transferred to S9;

S9. Judging the current selected observation

In S2, the lower bound value of the initial belief state b ₀ is calculated by a blind strategy, and the upper bound value of the initial belief state b ₀ is calculated by the fast notification constraint method.

Wherein, in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula:

Among them, in S6, the optimal action is calculated by the following formula:

The optimal observation is calculated using the following formula:

among them

Faith state

The difference between the upper and lower boundaries.

Among them, in S9, judge the current selected observation

Whether the standard of exploration value is

Among them is a threshold function.

Among them, in S10, the upper bound of the next belief state standard threshold

Calculated by

Lower bound standard threshold for next belief state

Calculated by

The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but the scope of the invention is to be accorded

Claims

A method for optimal path planning of a robot based on a partial perceptual Markov decision process, comprising the following steps:

S1, initialization model and environment, set the state transition function f: X × U × X → [0, 1], reward function ρ: X × U → □, observation function Ω: U × X × Z → [0, 1], where X is the state set, U is the action set, Z is the observation set, the discount rate γ is set to 0.95, the position of the robot is set, and the initial value of the initial belief state b 0 is set: the standard critical value L of the lower bound and the upper bound The standard threshold value U=L+ò, where ò is a pre-specified threshold criterion, and the upper bound value of the initial belief state b 0 is calculated.
And the lower bound value V (b 0 ), transferred to S2;

S2, setting the initial belief state b 0 to the current belief state b, and transferring to S3;

S3, predicting the optimal value of the current belief state b
Transfer to S4;

S4. Determine whether the current belief state b satisfies the following conditions:
And
Where d b is the depth of the current belief state b, if satisfied, then proceeds to S13, if not satisfied, then proceeds to S5;

S5. Calculate the lower bound value of the value function of each action under the current belief state b, select the maximum value Q among the lower bound values, and update the standard critical value U' of the upper bound of the current belief state b and the standard critical value L' of the lower bound , transferred to S6;

S6, calculating an optimal action and observing the greatest contribution to the initial belief state b 0 , and recording the total count count, and transferring to S7;

S7, sequentially select observations in the observation set, if count is not 0, transfer to S8, otherwise transfer to S11;

S8, the count value is reduced by 1, and transferred to S9;

S9. Judging the current selected observation
Whether there is value of exploration, if it is, then transfer to S10, if not, then transfer to S7;

S10, calculating a standard critical value of the upper bound of the next belief state and a standard critical value of the lower bound, obtaining an upper bound value and a lower bound value of the next belief state, and transferring to S7;

S11, updating the upper bound value and the lower bound value of the current belief state, and transferring to S12;

S12, selecting an optimal action to enter a next belief state, setting a next belief state to a current belief state, and transferring to S3;

S13. Obtain an optimal strategy, and obtain an optimal path of the robot according to the optimal strategy.
The portion of the robot as claimed in claim optimal path planning method Markov decision process-based sensing, characterized in that said 1, S2, the lower bound of the initial belief state b is 0 blind strategy calculations, the upper bound of the initial belief state b 0 Values are calculated using the fast notification constraint method.
The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula:
The standard threshold for the lower bound of the current belief state b is calculated using the following formula:
Where Q represents the maximum of the lower bounds of the value function for each action.
The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S6, the optimal action is calculated by the following formula:
The optimal observation is calculated using the following formula:
among them
Faith state
The difference between the upper and lower boundaries.
The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S9, the current selected observation is judged
Whether the standard of exploration value is
Where ζ is the threshold function.
The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S10, the upper bound of the next belief state is a standard threshold value
Calculated by
Lower bound standard threshold for next belief state
Calculated by
Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.