WO2019148645A1 - Partially observable markov decision process-based optimal robot path planning method - Google Patents
Partially observable markov decision process-based optimal robot path planning method Download PDFInfo
- Publication number
- WO2019148645A1 WO2019148645A1 PCT/CN2018/082104 CN2018082104W WO2019148645A1 WO 2019148645 A1 WO2019148645 A1 WO 2019148645A1 CN 2018082104 W CN2018082104 W CN 2018082104W WO 2019148645 A1 WO2019148645 A1 WO 2019148645A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- value
- belief state
- state
- optimal
- lower bound
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000008569 process Effects 0.000 title claims abstract description 18
- 230000009471 action Effects 0.000 claims description 20
- 238000012546 transfer Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 16
- 230000006870 function Effects 0.000 description 14
- 238000001514 detection method Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 238000010408 sweeping Methods 0.000 description 3
- 230000000875 corresponding effect Effects 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 244000194101 Ginkgo biloba Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
- G01C21/20—Instruments for performing navigational calculations
- G01C21/206—Instruments for performing navigational calculations specially adapted for indoor navigation
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0217—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with energy consumption, time reduction or distance reduction criteria
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0219—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory ensuring the processing of the whole working surface
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
- G01C21/26—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
- G01C21/34—Route searching; Route guidance
- G01C21/3446—Details of route searching algorithms, e.g. Dijkstra, A*, arc-flags, using precalculated routes
Definitions
- the invention relates to the field of robot control, in particular to a robot optimal path planning method based on a partial perceptual Markov decision process.
- Machine Learning is a discipline that studies how to simulate or implement human learning behaviors, and to reorganize and improve its original knowledge structure.
- Reinforcement learning is an important research branch of machine learning. It is a machine learning method that obtains the maximum long-term cumulative discount reward by mapping the state to the action through the interaction between the agent and the environment.
- MDPs Markov Decision Processes
- the sensor of the agent has its own limitations: (1) the sensor can only detect a local limited environment, the agent cannot accurately distinguish different states outside the detection range; (2) the sensor itself has defects, the detection result has noise, and the sensor of the agent Different results may be obtained when probing the same state.
- Agent's vision system includes visual angle, accurate visual distance, blurred visual distance and invisible distance. Only when the observation target is within the visible angle and the accurate visual distance, the Agent can get its accurate state, and in other cases, only a fuzzy observation can be obtained. Agents should consider uncertainties such as imperfect control, sensor noise, and incomplete environmental knowledge when making decisions in complex environments (such as driverless driving).
- POMDPs Partially Observable Markov Decision Processes
- the object of the present invention is to provide a robot optimal path planning method based on a partial perceptual Markov decision process, which reduces the number of updates of an algorithm in a similar search path in a continuous state and a large-scale observation space problem, thereby saving calculations. Time, increase the efficiency of the algorithm. The improved efficiency allows the robot to get a better path in the same time.
- the Agent selects the corresponding action and observation at the current belief state according to different heuristic conditions, and obtains the corresponding next belief state. Combine the paths of each search to form a reachable belief tree. This allows the Agent to search only in the reachable belief space, and approximate the infinite belief space, so that the continuous state problem can be solved. Therefore, choosing a better heuristic condition can make the searchable reachable belief space closer to the real belief space and have better performance.
- the heuristic condition used by HSVI2 is to obtain as many representative and reachable belief trees as possible through simulation. SARSOP is based on HSVI2's reachable belief tree, and chooses better heuristic conditions, which makes the simulation process closer to the optimal strategy, thus obtaining a more representative optimal reachable belief. tree.
- the Gingko Leaf Search (GLS) algorithm is used in the forward search phase for the partial searchable Markov decision-making problem for continuous state and large-scale observation space.
- the most valuable state of belief, and adaptive search is very close to the state of belief with the most valuable belief state. Without affecting the effect of the value function update, GLS reduces the number of belief state updates, reduces update time, and improves algorithm efficiency.
- the present invention provides the following technical solution: a method for optimal path planning of a robot based on a partial perceptual Markov decision process, comprising the following steps:
- the initial belief state b is 0 blind low value calculated strategy
- the initial belief state b 0 is calculated upper bound constraints using flash notification method.
- the standard critical value of the lower bound of the current belief state b is calculated by the following formula:
- the standard threshold for the lower bound of the current belief state b is calculated using the following formula: Where Q represents the maximum of the lower bounds of the value function for each action.
- the upper bound of the next belief state is a standard critical value.
- the calculation method of the lower bound standard threshold U db in the next belief state is Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
- the present invention has the following advantages over the prior art: the present invention is based on a partially observable Markov decision process, and the robot finds an optimal path to the target position, based on the POMDP model and the SARSOP algorithm, using GLS.
- the search method is a heuristic condition when searching.
- the use of the present invention can avoid the early classical algorithm using the trial as a heuristic condition to repeatedly update the plurality of similar paths, and update the upper and lower bounds of the belief state without affecting the final optimal strategy. Improve the efficiency of the algorithm, in the same time, the robot can get a better strategy and find a better path.
- Figure 1 is a schematic view showing the layout of the environment of the present invention.
- FIG. 2 is a search tree formed by a search path obtained by a certain search in the present invention.
- Figure 3 is a flow chart of the operation of the present invention.
- the sweeping robot is in the living room on the right. Its task is to clean the bedroom on the left. According to the layout of the room, it needs to bypass the dining table and pass through the middle door to enter the bedroom smoothly.
- the robot head is evenly installed. With a distance sensor, each sensor can detect whether there are obstacles within 1 unit length in front of it. There are 256 kinds of detection results of the sensor. The probability that each sensor receives the correct detection result is 0.9, and the probability of receiving the error detection result.
- the sweeping robot is randomly positioned in the room. Its goal is to reach the bedroom on the left as quickly as possible. The prize for the sweeping robot to reach the target position is +10.
- the search path includes not only the nodes searched by the early SAROSOP algorithm (black solid circles) but also the nodes with large search value (open circles).
- a method for optimal path planning of a robot based on a partial perceptual Markov decision process includes the following steps:
- the lower bound value of the initial belief state b 0 is calculated by a blind strategy
- the upper bound value of the initial belief state b 0 is calculated by the fast notification constraint method.
- the standard critical value of the lower bound of the current belief state b is calculated by the following formula:
- the standard threshold for the lower bound of the current belief state b is calculated using the following formula:
- Q represents the maximum of the lower bounds of the value function for each action.
- the optimal action is calculated by the following formula:
- the optimal observation is calculated using the following formula: among them Faith state The difference between the upper and lower boundaries.
- the upper bound of the next belief state standard threshold Calculated by Lower bound standard threshold for next belief state Calculated by Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
Landscapes
- Engineering & Computer Science (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Aviation & Aerospace Engineering (AREA)
- Feedback Control In General (AREA)
- Manipulator (AREA)
Abstract
Disclosed is a partially observable Markov decision process (POMDP)-based optimal robot path planning method. A robot searches for an optimal path to a target position. A POMDP model and an SARSOP algorithm are used as a basis. A GLS search method is used as a heuristic condition during searching. For continuous state and massive view space problems, the usage of the present invention can reduce the times for updating upper and lower bounds of the belief state in multiple similar paths which are updated repetitively by an early classical algorithm using an experiment as the heuristic condition. The final optimal policy is not affected, the algorithm efficiency is improved, and the robot can get a better policy and find a better path in the same time.
Description
本发明涉及机器人控制领域,具体涉及一种基于部分感知马氏决策过程的机器人最优路径规划方法。The invention relates to the field of robot control, in particular to a robot optimal path planning method based on a partial perceptual Markov decision process.
机器学习(Machine Learning,ML)是一门研究怎样模拟或实现人类的学习行为,不断重组、完善自身原有知识结构的学科。强化学习则是机器学习重要的一个研究分支,它是一种通过智能体(Agent)与环境的交互,将状态映射到动作,从而获得最大长期累积折扣奖赏的机器学习方法。通常强化学习采用马尔科夫决策过程(Markov Decision Processes,MDPs)作为模型,即环境是完全可观察的。然而在现实世界中,不确定性是普遍存在的。例如Agent的传感器,其自身存在局限:(1)传感器只能探测局部有限的环境,Agent无法准确区分探测范围外的不同的状态;(2)传感器自身存在缺陷,探测结果存在噪声,Agent的传感器对同一状态进行探测时可能得到不同的结果。例如在机器人足球比赛RoboCup中,Agent的视觉系统包含可视角度、准确可视距离、模糊可视距离和不可视距离。仅当观测目标在可视角度和准确可视距离范围内,Agent才能得到它的准确状态,其余情况下只能得到一个模糊的观察。Agent在复杂环境中决策时(例如无人驾驶),应该考虑不完美的控制、传感器的噪声和不完整的环境知识等不确定性因素。Machine Learning (ML) is a discipline that studies how to simulate or implement human learning behaviors, and to reorganize and improve its original knowledge structure. Reinforcement learning is an important research branch of machine learning. It is a machine learning method that obtains the maximum long-term cumulative discount reward by mapping the state to the action through the interaction between the agent and the environment. Usually reinforcement learning uses Markov Decision Processes (MDPs) as a model, ie the environment is fully observable. However, in the real world, uncertainty is ubiquitous. For example, the sensor of the agent has its own limitations: (1) the sensor can only detect a local limited environment, the agent cannot accurately distinguish different states outside the detection range; (2) the sensor itself has defects, the detection result has noise, and the sensor of the agent Different results may be obtained when probing the same state. For example, in the robot soccer game RoboCup, Agent's vision system includes visual angle, accurate visual distance, blurred visual distance and invisible distance. Only when the observation target is within the visible angle and the accurate visual distance, the Agent can get its accurate state, and in other cases, only a fuzzy observation can be obtained. Agents should consider uncertainties such as imperfect control, sensor noise, and incomplete environmental knowledge when making decisions in complex environments (such as driverless driving).
部分可观察马尔科夫决策过程(Partially Observable Markov Decision Processes,POMDPs)为建模部分可观察随机环境中Agent的规划(或序列决策)问题和学习问题提供了一个通用的模型。过去十年中,POMDP规划问题的研究取得令人瞩目的成果。这些研究成果,在启发式搜索阶段都采用基于试验的异步值迭代算法,例如HSVI2、SARSOP和FSVI等。这些算法 在向前搜索时,只搜索具有最大价值的结点。然而,基于试验的搜索每次都选择最优的动作和观察,未考虑其它与最优观察非常接近并对未来算法性能有着重要影响的观察。在大规模观察空间问题,其算法的性能较差。Partially Observable Markov Decision Processes (POMDPs) provide a general model for modeling part of the Agent's planning (or sequence decision) problems and learning problems in a random environment. In the past decade, research on POMDP planning issues has achieved remarkable results. These research results use experimental-based asynchronous value iterative algorithms in the heuristic search phase, such as HSVI2, SARSOP, and FSVI. These algorithms only search for nodes with the greatest value when searching forward. However, trial-based searches select optimal actions and observations each time, without considering other observations that are very close to the optimal observations and have a significant impact on future algorithm performance. In the large-scale observation of spatial problems, the performance of the algorithm is poor.
发明内容Summary of the invention
本发明的发明目的是提供一种基于部分感知马氏决策过程的机器人最优路径规划方法,减少在连续状态、大规模观察空间问题中,算法在相似搜索路径的中的更新次数,从而节约计算时间,调高算法的效率。效率的提高,可以让机器人的相同的时间内,获得更优的路径。The object of the present invention is to provide a robot optimal path planning method based on a partial perceptual Markov decision process, which reduces the number of updates of an algorithm in a similar search path in a continuous state and a large-scale observation space problem, thereby saving calculations. Time, increase the efficiency of the algorithm. The improved efficiency allows the robot to get a better path in the same time.
在大规模观察空间问题中,某些非常接近最大价值的节点,其对未来算法性能的也很起着非常重要的作用。在某个信念状态的更新次数如果可以比其它信念状态处的更多,那么这种更新方法称为异步值迭代方法。在可达信念空间R(b
0)中,值函数在b
0处的准确性通常比在其它信念状态处的更重要。因此在POMDP问题中,可以使用异步值迭代方法。基于试验的搜索是一种经典的异步值迭代方法,它每次搜索都从初始信念状态b
0出发,搜索到叶子信念状态,并获得一条没有分支的路径。在搜索的过程中,Agent根据不同的启发式条件,在当前信念状态处选择相应的动作和观察,并获得对应的下一个信念状态。把每一次搜索的路径相结合,形成一棵可达信念树。这样可以让Agent只在可达的信念空间中搜索,近似的表示无限的信念空间,使得连续状态问题可以求解。因此,选择更好的启发式条件,可以使得搜索的可达信念空间更加接近真实的信念空间,并且具有更好的性能。HSVI2使用的启发式条件是通过仿真,获得一个尽量多的、具有代表性的的可达信念树。SARSOP则是在HSVI2的可达信念树的基础上,选择更优的启发式条件,使得其仿真的过程中,更加接近最优的策略,从而获得一棵更具有代表性的最优可达信念树。
In the large-scale observation space problem, some nodes that are very close to the maximum value also play a very important role in the performance of future algorithms. If the number of updates to a belief state can be more than at other belief states, then this update method is called an asynchronous value iteration method. In the reachable belief space R(b 0 ), the accuracy of the value function at b 0 is usually more important than at other belief states. Therefore, in the POMDP problem, an asynchronous value iterative method can be used. The trial-based search is a classic asynchronous value iterative method. Each search starts from the initial belief state b 0 , searches for the leaf belief state, and obtains a path without branches. In the process of searching, the Agent selects the corresponding action and observation at the current belief state according to different heuristic conditions, and obtains the corresponding next belief state. Combine the paths of each search to form a reachable belief tree. This allows the Agent to search only in the reachable belief space, and approximate the infinite belief space, so that the continuous state problem can be solved. Therefore, choosing a better heuristic condition can make the searchable reachable belief space closer to the real belief space and have better performance. The heuristic condition used by HSVI2 is to obtain as many representative and reachable belief trees as possible through simulation. SARSOP is based on HSVI2's reachable belief tree, and chooses better heuristic conditions, which makes the simulation process closer to the optimal strategy, thus obtaining a more representative optimal reachable belief. tree.
针对连续状态、大规模观察空间的部分可观察马氏决策问题中,基于试验的搜索算法的局限性,本发明采用银杏叶搜索算法(Gingko Leaf Search,GLS),在向前搜索阶段,不仅搜索最有价值的信念状态,而且自适应的搜索与最有价值的信念状态非常接近的信念状态。在不影响值函数更新效果的情况下,GLS减少信念状态更新的次数,减少更新时间,提高算法效率。The Gingko Leaf Search (GLS) algorithm is used in the forward search phase for the partial searchable Markov decision-making problem for continuous state and large-scale observation space. The most valuable state of belief, and adaptive search is very close to the state of belief with the most valuable belief state. Without affecting the effect of the value function update, GLS reduces the number of belief state updates, reduces update time, and improves algorithm efficiency.
为实现上述发明目的,本发明提供以下的技术方案:一种基于部分感知马氏决策过程的机器人最优路径规划方法,包括如下步骤:In order to achieve the above object, the present invention provides the following technical solution: a method for optimal path planning of a robot based on a partial perceptual Markov decision process, comprising the following steps:
S1、初始化模型和环境,设置环境的状态迁移函数f:X×U×X→[0,1],奖赏函数ρ:X×U→□,观察函数Ω:U×X×Z→[0,1],其中X为状态集合,U为动作集合,Z为观察集合,设置折扣率γ为0.95,设置机器人的位置,设置初始信念状态b
0的初始值:下界的标准临界值L和上界的标准临界值U=L+ò,其中ò是预先指定的阈值标准,计算初始信念状态b
0的上界值
和下界值
V(b
0),转入S2;
S1, initialization model and environment, set the state transition function f: X × U × X → [0, 1], reward function ρ: X × U → □, observation function Ω: U × X × Z → [0, 1], where X is the state set, U is the action set, Z is the observation set, the discount rate γ is set to 0.95, the position of the robot is set, and the initial value of the initial belief state b 0 is set: the standard critical value L of the lower bound and the upper bound The standard threshold value U=L+ò, where ò is a pre-specified threshold criterion, and the upper bound value of the initial belief state b 0 is calculated. And the lower bound value V (b 0 ), transferred to S2;
S2、将初始信念状态b
0置为当前信念状态b,转入S3;
S2, setting the initial belief state b 0 to the current belief state b, and transferring to S3;
S3、预测当前信念状态b的最优值
转入S4;
S3, predicting the optimal value of the current belief state b Transfer to S4;
S4、判断当前信念状态b是否满足如下条件:
且
其中d
b为当前信念状态b的深度,如满足,则转入S13,如不满足,则转入S5;
S4. Determine whether the current belief state b satisfies the following conditions: And Where d b is the depth of the current belief state b, if satisfied, then proceeds to S13, if not satisfied, then proceeds to S5;
S5、计算当前信念状态b下,每个动作的值函数下界值,选取这些下界值中的最大值
Q,更新当前信念状态b的上界的标准临界值U′和下界的标准临界值L′,转入S6;
S5. Calculate the lower bound value of the value function of each action under the current belief state b, select the maximum value Q among the lower bound values, and update the standard critical value U' of the upper bound of the current belief state b and the standard critical value L' of the lower bound , transferred to S6;
S6、计算最优动作和对初始信念状态b
0贡献最大的观察,并记录观察的总数count,转入S7;
S6, calculating an optimal action and observing the greatest contribution to the initial belief state b 0 , and recording the total count count, and transferring to S7;
S7、顺序选择观察集中的观察,若count不为0,转入S8,否则转入S11;S7, sequentially select observations in the observation set, if count is not 0, transfer to S8, otherwise transfer to S11;
S8、count值减少1,转入S9;S8, the count value is reduced by 1, and transferred to S9;
S9、判断当前选择的观察
是否有探索的价值,若是,则转入S10,若否,则转入S7;
S9. Judging the current selected observation Whether there is value of exploration, if it is, then transfer to S10, if not, then transfer to S7;
S10、计算下一信念状态的上界的标准临界值和下界的标准临界值,获得下一信念状态的上界值和下界值,转入S7;S10, calculating a standard threshold value of the upper bound of the next belief state and a standard threshold value of the lower bound, obtaining an upper bound value and a lower bound value of the next belief state, and transferring to S7;
S11、更新当前信念状态的上界值和下界值,转入S12;S11, updating the upper bound value and the lower bound value of the current belief state, and transferring to S12;
S12、选择最优动作进入下一信念状态,将下一信念状态置为当前信念状态,转入S3;S12, selecting an optimal action to enter a next belief state, setting a next belief state to a current belief state, and transferring to S3;
S13、获得最优策略,根据最优策略获得机器人的最优路径。S13. Obtain an optimal strategy, and obtain an optimal path of the robot according to the optimal strategy.
上述技术方案中,S2中,初始信念状态b
0的下界值采用盲策略计算,初始信念状态b
0的上界值采用快速通知约束方法计算。
In the above technical solution, S2, the initial belief state b is 0 blind low value calculated strategy, the initial belief state b 0 is calculated upper bound constraints using flash notification method.
上述技术方案中,S5中,当前信念状态b的下界的标准临界值采用如下公式计算:
当前信念状态b的下界的标准临界值采用如下公式计算:
其中
Q表示每个动作的值函数下界值中的最大值。
In the above technical solution, in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula: The standard threshold for the lower bound of the current belief state b is calculated using the following formula: Where Q represents the maximum of the lower bounds of the value function for each action.
上述技术方案中,S6中,最优动作采用如下公式计算:
最优观察采用如下公式进行计算:
其中
是信念状态
处的上下界之差。
In the above technical solution, in S6, the optimal action is calculated by the following formula: The optimal observation is calculated using the following formula: among them Faith state The difference between the upper and lower boundaries.
上述技术方案中,S9中,判断当前选择的观察
是否具有探索价值的标准为
其中ζ为阈 值函数。
In the above technical solution, in S9, judging the currently selected observation Whether the standard of exploration value is Where ζ is the threshold function.
上述技术方案中,S10中,下一信念状态上界标准临界值
的计算方法为
下一信念状态下界标准临界值U
db的计算方法为
其中L′和U′分别是当前信念状态的上界标准临界值和下界标准临界值。
In the above technical solution, in S10, the upper bound of the next belief state is a standard critical value. Calculated by The calculation method of the lower bound standard threshold U db in the next belief state is Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
由于上述技术方案运用,本发明与现有技术相比具有以下优点:本发明基于部分可观察马氏决策过程,机器人寻找到达目标位置的最优路径,以POMDP模型和SARSOP算法为基础,使用GLS搜索方法作为搜索时的启发式条件。在连续状态大规模观察空间问题中,使用本发明可避免早期经典算法使用基于试验作为启发式条件重复更新多条相似路径中,更新信念状态上下界的次数,且不影响最终的最优策略,提高算法效率,在相同时间内,机器人能够获得更优的策略,找到更优的路径。Due to the above technical solutions, the present invention has the following advantages over the prior art: the present invention is based on a partially observable Markov decision process, and the robot finds an optimal path to the target position, based on the POMDP model and the SARSOP algorithm, using GLS. The search method is a heuristic condition when searching. In the continuous state large-scale observation space problem, the use of the present invention can avoid the early classical algorithm using the trial as a heuristic condition to repeatedly update the plurality of similar paths, and update the upper and lower bounds of the belief state without affecting the final optimal strategy. Improve the efficiency of the algorithm, in the same time, the robot can get a better strategy and find a better path.
图1为本发明环境的布局示意图。Figure 1 is a schematic view showing the layout of the environment of the present invention.
图2为本发明某次搜索获得的搜索路径形成的搜索树。2 is a search tree formed by a search path obtained by a certain search in the present invention.
图3为本发明的工作流程图。Figure 3 is a flow chart of the operation of the present invention.
下面结合本发明的原理、附图以及实施例对本发明进一步描述The present invention is further described below in conjunction with the principles, drawings, and embodiments of the present invention.
参见图1所示,扫地机器人在右边的客厅内,它的任务是打扫左边的卧室,根据房间的布置,它需要绕过餐桌并经过中间的房门才能顺利进入卧室,机器人头部均匀地安装有距离传感器,每个传感器都能探测其正前方1单位长度内是否有障碍物,传感器的探测结果有256种,每个传感器接收到正确 探测结果的概率为0.9,接收到错误探测结果的概率为0.1,扫地机器人在房间中初始位置是随机的,它的目标是尽可能快的到达左边的卧室,扫地机器人到达目标位置的奖赏为+10。Referring to Figure 1, the sweeping robot is in the living room on the right. Its task is to clean the bedroom on the left. According to the layout of the room, it needs to bypass the dining table and pass through the middle door to enter the bedroom smoothly. The robot head is evenly installed. With a distance sensor, each sensor can detect whether there are obstacles within 1 unit length in front of it. There are 256 kinds of detection results of the sensor. The probability that each sensor receives the correct detection result is 0.9, and the probability of receiving the error detection result. At 0.1, the sweeping robot is randomly positioned in the room. Its goal is to reach the bedroom on the left as quickly as possible. The prize for the sweeping robot to reach the target position is +10.
参见图2所示,某次搜索过程中,搜索路径中不仅包含早期SAROSOP算法搜索到的结点(黑色实心圆),更包含具有较大搜索价值的结点(空心圆)。Referring to FIG. 2, in a search process, the search path includes not only the nodes searched by the early SAROSOP algorithm (black solid circles) but also the nodes with large search value (open circles).
参见图3所示,一种基于部分感知马氏决策过程的机器人最优路径规划方法,包括如下步骤:Referring to FIG. 3, a method for optimal path planning of a robot based on a partial perceptual Markov decision process includes the following steps:
S1、初始化模型和环境,设置环境的状态迁移函数f:X×U×X→[0,1],奖赏函数ρ:X×U→□,观察函数Ω:U×X×Z→[0,1],其中X为状态集合,U为动作集合,Z为观察集合,设置折扣率γ为0.95,设置机器人的位置,设置初始信念状态b
0的初始值:下界的标准临界值L和上界的标准临界值U=L+ò,其中ò是预先指定的阈值标准,计算初始信念状态b
0的上界值
和下界值
V(b
0),转入S2;
S1, initialization model and environment, set the state transition function f: X × U × X → [0, 1], reward function ρ: X × U → □, observation function Ω: U × X × Z → [0, 1], where X is the state set, U is the action set, Z is the observation set, the discount rate γ is set to 0.95, the position of the robot is set, and the initial value of the initial belief state b 0 is set: the standard critical value L of the lower bound and the upper bound The standard threshold value U=L+ò, where ò is a pre-specified threshold criterion, and the upper bound value of the initial belief state b 0 is calculated. And the lower bound value V (b 0 ), transferred to S2;
S2、将初始信念状态b
0置为当前信念状态b,转入S3;
S2, setting the initial belief state b 0 to the current belief state b, and transferring to S3;
S3、预测当前信念状态b的最优值
转入S4;
S3, predicting the optimal value of the current belief state b Transfer to S4;
S4、判断当前信念状态b是否满足如下条件:
且
其中d
b为当前信念状态b的深度,如满足,则转入S13,如不满足,则转入S5;
S4. Determine whether the current belief state b satisfies the following conditions: And Where d b is the depth of the current belief state b, if satisfied, then proceeds to S13, if not satisfied, then proceeds to S5;
S5、计算当前信念状态b下,每个动作的值函数下界值,选取这些下界值中的最大值
Q,更新当前信念状态b的上界的标准临界值U′和下界的标准临界值L′,转入S6;
S5. Calculate the lower bound value of the value function of each action under the current belief state b, select the maximum value Q among the lower bound values, and update the standard critical value U' of the upper bound of the current belief state b and the standard critical value L' of the lower bound , transferred to S6;
S6、计算最优动作和对初始信念状态b
0贡献最大的观察,并记录观察的 总数count,转入S7;
S6, calculating an optimal action and observing the greatest contribution to the initial belief state b 0 , and recording the total count count, and transferring to S7;
S7、顺序选择观察集中的观察,若count不为0,转入S8,否则转入S11;S7, sequentially select observations in the observation set, if count is not 0, transfer to S8, otherwise transfer to S11;
S8、count值减少1,转入S9;S8, the count value is reduced by 1, and transferred to S9;
S9、判断当前选择的观察
是否有探索的价值,若是,则转入S10,若否,则转入S7;
S9. Judging the current selected observation Whether there is value of exploration, if it is, then transfer to S10, if not, then transfer to S7;
S10、计算下一信念状态的上界的标准临界值和下界的标准临界值,获得下一信念状态的上界值和下界值,转入S7;S10, calculating a standard threshold value of the upper bound of the next belief state and a standard threshold value of the lower bound, obtaining an upper bound value and a lower bound value of the next belief state, and transferring to S7;
S11、更新当前信念状态的上界值和下界值,转入S12;S11, updating the upper bound value and the lower bound value of the current belief state, and transferring to S12;
S12、选择最优动作进入下一信念状态,将下一信念状态置为当前信念状态,转入S3;S12, selecting an optimal action to enter a next belief state, setting a next belief state to a current belief state, and transferring to S3;
S13、获得最优策略,根据最优策略获得机器人的最优路径。S13. Obtain an optimal strategy, and obtain an optimal path of the robot according to the optimal strategy.
其中,S2中,初始信念状态b
0的下界值采用盲策略计算,初始信念状态b
0的上界值采用快速通知约束方法计算。
In S2, the lower bound value of the initial belief state b 0 is calculated by a blind strategy, and the upper bound value of the initial belief state b 0 is calculated by the fast notification constraint method.
其中,S5中,当前信念状态b的下界的标准临界值采用如下公式计算:
当前信念状态b的下界的标准临界值采用如下公式计算:
其中
Q表示每个动作的值函数下界值中的最大值。
Wherein, in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula: The standard threshold for the lower bound of the current belief state b is calculated using the following formula: Where Q represents the maximum of the lower bounds of the value function for each action.
其中,S6中,最优动作采用如下公式计算:
最优观察采用如下公式进行计算:
其中
是信念状态
处的上下界之差。
Among them, in S6, the optimal action is calculated by the following formula: The optimal observation is calculated using the following formula: among them Faith state The difference between the upper and lower boundaries.
其中,S9中,判断当前选择的观察
是否具有探索价值的标准为
其中ζ一个阈值函数。
Among them, in S9, judge the current selected observation Whether the standard of exploration value is Among them is a threshold function.
其中,S10中,下一信念状态上界标准临界值
的计算方法为
下一信念状态下界标准临界值
的计算方法为
其中L′和U′分别是当前信念状态的上界标准临界值和下界标准临界值。
Among them, in S10, the upper bound of the next belief state standard threshold Calculated by Lower bound standard threshold for next belief state Calculated by Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but the scope of the invention is to be accorded
Claims (6)
- 一种基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,包括如下步骤:A method for optimal path planning of a robot based on a partial perceptual Markov decision process, comprising the following steps:S1、初始化模型和环境,设置环境的状态迁移函数f:X×U×X→[0,1],奖赏函数ρ:X×U→□,观察函数Ω:U×X×Z→[0,1],其中X为状态集合,U为动作集合,Z为观察集合,设置折扣率γ为0.95,设置机器人的位置,设置初始信念状态b 0的初始值:下界的标准临界值L和上界的标准临界值U=L+ò,其中ò是预先指定的阈值标准,计算初始信念状态b 0的上界值 和下界值 V(b 0),转入S2; S1, initialization model and environment, set the state transition function f: X × U × X → [0, 1], reward function ρ: X × U → □, observation function Ω: U × X × Z → [0, 1], where X is the state set, U is the action set, Z is the observation set, the discount rate γ is set to 0.95, the position of the robot is set, and the initial value of the initial belief state b 0 is set: the standard critical value L of the lower bound and the upper bound The standard threshold value U=L+ò, where ò is a pre-specified threshold criterion, and the upper bound value of the initial belief state b 0 is calculated. And the lower bound value V (b 0 ), transferred to S2;S2、将初始信念状态b 0置为当前信念状态b,转入S3; S2, setting the initial belief state b 0 to the current belief state b, and transferring to S3;S3、预测当前信念状态b的最优值 转入S4; S3, predicting the optimal value of the current belief state b Transfer to S4;S4、判断当前信念状态b是否满足如下条件: 且 其中d b为当前信念状态b的深度,如满足,则转入S13,如不满足,则转入S5; S4. Determine whether the current belief state b satisfies the following conditions: And Where d b is the depth of the current belief state b, if satisfied, then proceeds to S13, if not satisfied, then proceeds to S5;S5、计算当前信念状态b下,每个动作的值函数下界值,选取这些下界值中的最大值 Q,更新当前信念状态b的上界的标准临界值U′和下界的标准临界值L′,转入S6; S5. Calculate the lower bound value of the value function of each action under the current belief state b, select the maximum value Q among the lower bound values, and update the standard critical value U' of the upper bound of the current belief state b and the standard critical value L' of the lower bound , transferred to S6;S6、计算最优动作和对初始信念状态b 0贡献最大的观察,并记录观察的总数count,转入S7; S6, calculating an optimal action and observing the greatest contribution to the initial belief state b 0 , and recording the total count count, and transferring to S7;S7、顺序选择观察集中的观察,若count不为0,转入S8,否则转入S11;S7, sequentially select observations in the observation set, if count is not 0, transfer to S8, otherwise transfer to S11;S8、count值减少1,转入S9;S8, the count value is reduced by 1, and transferred to S9;S9、判断当前选择的观察 是否有探索的价值,若是,则转入S10,若否,则转入S7; S9. Judging the current selected observation Whether there is value of exploration, if it is, then transfer to S10, if not, then transfer to S7;S10、计算下一信念状态的上界的标准临界值和下界的标准临界值,获得 下一信念状态的上界值和下界值,转入S7;S10, calculating a standard critical value of the upper bound of the next belief state and a standard critical value of the lower bound, obtaining an upper bound value and a lower bound value of the next belief state, and transferring to S7;S11、更新当前信念状态的上界值和下界值,转入S12;S11, updating the upper bound value and the lower bound value of the current belief state, and transferring to S12;S12、选择最优动作进入下一信念状态,将下一信念状态置为当前信念状态,转入S3;S12, selecting an optimal action to enter a next belief state, setting a next belief state to a current belief state, and transferring to S3;S13、获得最优策略,根据最优策略获得机器人的最优路径。S13. Obtain an optimal strategy, and obtain an optimal path of the robot according to the optimal strategy.
- 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S2中,初始信念状态b 0的下界值采用盲策略计算,初始信念状态b 0的上界值采用快速通知约束方法计算。 The portion of the robot as claimed in claim optimal path planning method Markov decision process-based sensing, characterized in that said 1, S2, the lower bound of the initial belief state b is 0 blind strategy calculations, the upper bound of the initial belief state b 0 Values are calculated using the fast notification constraint method.
- 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S5中,当前信念状态b的下界的标准临界值采用如下公式计算: 当前信念状态b的下界的标准临界值采用如下公式计算: 其中 Q表示每个动作的值函数下界值中的最大值。 The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula: The standard threshold for the lower bound of the current belief state b is calculated using the following formula: Where Q represents the maximum of the lower bounds of the value function for each action.
- 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S6中,最优动作采用如下公式计算: 最优观察采用如下公式进行计算: 其中 是信念状态 处的上下界之差。 The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S6, the optimal action is calculated by the following formula: The optimal observation is calculated using the following formula: among them Faith state The difference between the upper and lower boundaries.
- 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S9中,判断当前选择的观察 是否具有探索价值的标准为 其中ζ为阈值函数。 The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S9, the current selected observation is judged Whether the standard of exploration value is Where ζ is the threshold function.
- 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S10中,下一信念状态上界标准临界值 的计算方法为 下一信念状态下界标准临界值 的计算方法为 其中L′和U′分别是当前信念状态的上界标准临界值和下界标准临界值。 The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S10, the upper bound of the next belief state is a standard threshold value Calculated by Lower bound standard threshold for next belief state Calculated by Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810102240.0 | 2018-02-01 | ||
CN201810102240.0A CN108680155B (en) | 2018-02-01 | 2018-02-01 | Robot optimal path planning method based on partial perception Markov decision process |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019148645A1 true WO2019148645A1 (en) | 2019-08-08 |
Family
ID=63800139
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/082104 WO2019148645A1 (en) | 2018-02-01 | 2018-04-08 | Partially observable markov decision process-based optimal robot path planning method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108680155B (en) |
WO (1) | WO2019148645A1 (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111007858A (en) * | 2019-12-23 | 2020-04-14 | 北京三快在线科技有限公司 | Training method of vehicle driving decision model, and driving decision determining method and device |
CN111367317A (en) * | 2020-03-27 | 2020-07-03 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle cluster online task planning method based on Bayesian learning |
CN112131754A (en) * | 2020-09-30 | 2020-12-25 | 中国人民解放军国防科技大学 | Extended POMDP planning method and system based on robot accompanying behavior model |
CN113189986A (en) * | 2021-04-16 | 2021-07-30 | 中国人民解放军国防科技大学 | Two-stage self-adaptive behavior planning method and system for autonomous robot |
CN113433937A (en) * | 2021-06-08 | 2021-09-24 | 杭州未名信科科技有限公司 | Heuristic exploration-based layered navigation obstacle avoidance system and layered navigation obstacle avoidance method |
CN114118441A (en) * | 2021-11-24 | 2022-03-01 | 福州大学 | Online planning method based on efficient search strategy under uncertain environment |
CN114253265A (en) * | 2021-12-17 | 2022-03-29 | 成都朴为科技有限公司 | On-time arrival probability maximum path planning algorithm and system based on fourth-order moment |
CN114355973A (en) * | 2021-12-28 | 2022-04-15 | 哈尔滨工程大学 | Multi-agent hierarchical reinforcement learning-based unmanned cluster cooperation method under weak observation condition |
CN114676935A (en) * | 2022-04-26 | 2022-06-28 | 中国人民解放军国防科技大学 | Multi-Agent system mixed task planning algorithm under uncertain environment |
CN114880946A (en) * | 2022-05-31 | 2022-08-09 | 苏州大学 | Intelligent agent random exploration method based on flight strategy |
CN115952892A (en) * | 2022-12-13 | 2023-04-11 | 广东工业大学 | Branch iteration method and system for defective plate stock layout |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110716554B (en) * | 2019-11-12 | 2020-08-14 | 华育昌(肇庆)智能科技研究有限公司 | Vision-based household robot |
CN111026110B (en) * | 2019-11-20 | 2021-04-30 | 北京理工大学 | Uncertain action planning method for linear time sequence logic containing soft and hard constraints |
CN110989602B (en) * | 2019-12-12 | 2023-12-26 | 齐鲁工业大学 | Autonomous guided vehicle path planning method and system in medical pathology inspection laboratory |
CN111311384A (en) * | 2020-05-15 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Method and system for training recommendation model |
CN111896006B (en) * | 2020-08-11 | 2022-10-04 | 燕山大学 | Path planning method and system based on reinforcement learning and heuristic search |
CN112356031B (en) * | 2020-11-11 | 2022-04-01 | 福州大学 | On-line planning method based on Kernel sampling strategy under uncertain environment |
US12118441B2 (en) | 2021-12-13 | 2024-10-15 | International Business Machines Corporation | Knowledge augmented sequential decision-making under uncertainty |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929281A (en) * | 2012-11-05 | 2013-02-13 | 西南科技大学 | Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment |
CN107368076A (en) * | 2017-07-31 | 2017-11-21 | 中南大学 | Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment |
US20170364831A1 (en) * | 2016-06-21 | 2017-12-21 | Sri International | Systems and methods for machine learning using a trusted model |
-
2018
- 2018-02-01 CN CN201810102240.0A patent/CN108680155B/en active Active
- 2018-04-08 WO PCT/CN2018/082104 patent/WO2019148645A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929281A (en) * | 2012-11-05 | 2013-02-13 | 西南科技大学 | Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment |
US20170364831A1 (en) * | 2016-06-21 | 2017-12-21 | Sri International | Systems and methods for machine learning using a trusted model |
CN107368076A (en) * | 2017-07-31 | 2017-11-21 | 中南大学 | Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111007858A (en) * | 2019-12-23 | 2020-04-14 | 北京三快在线科技有限公司 | Training method of vehicle driving decision model, and driving decision determining method and device |
CN111367317A (en) * | 2020-03-27 | 2020-07-03 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle cluster online task planning method based on Bayesian learning |
CN112131754A (en) * | 2020-09-30 | 2020-12-25 | 中国人民解放军国防科技大学 | Extended POMDP planning method and system based on robot accompanying behavior model |
CN113189986B (en) * | 2021-04-16 | 2023-03-14 | 中国人民解放军国防科技大学 | Two-stage self-adaptive behavior planning method and system for autonomous robot |
CN113189986A (en) * | 2021-04-16 | 2021-07-30 | 中国人民解放军国防科技大学 | Two-stage self-adaptive behavior planning method and system for autonomous robot |
CN113433937A (en) * | 2021-06-08 | 2021-09-24 | 杭州未名信科科技有限公司 | Heuristic exploration-based layered navigation obstacle avoidance system and layered navigation obstacle avoidance method |
CN114118441A (en) * | 2021-11-24 | 2022-03-01 | 福州大学 | Online planning method based on efficient search strategy under uncertain environment |
CN114118441B (en) * | 2021-11-24 | 2024-10-22 | 福州大学 | Online planning method based on efficient search strategy in uncertainty environment |
CN114253265B (en) * | 2021-12-17 | 2023-10-20 | 成都朴为科技有限公司 | On-time arrival probability maximum path planning algorithm and system based on fourth-order moment |
CN114253265A (en) * | 2021-12-17 | 2022-03-29 | 成都朴为科技有限公司 | On-time arrival probability maximum path planning algorithm and system based on fourth-order moment |
CN114355973A (en) * | 2021-12-28 | 2022-04-15 | 哈尔滨工程大学 | Multi-agent hierarchical reinforcement learning-based unmanned cluster cooperation method under weak observation condition |
CN114355973B (en) * | 2021-12-28 | 2023-12-08 | 哈尔滨工程大学 | Unmanned cluster cooperation method based on multi-agent layered reinforcement learning under weak observation condition |
CN114676935A (en) * | 2022-04-26 | 2022-06-28 | 中国人民解放军国防科技大学 | Multi-Agent system mixed task planning algorithm under uncertain environment |
CN114676935B (en) * | 2022-04-26 | 2024-08-16 | 中国人民解放军国防科技大学 | Multi-Agent system mixed task planning algorithm under uncertain environment |
CN114880946A (en) * | 2022-05-31 | 2022-08-09 | 苏州大学 | Intelligent agent random exploration method based on flight strategy |
CN115952892A (en) * | 2022-12-13 | 2023-04-11 | 广东工业大学 | Branch iteration method and system for defective plate stock layout |
Also Published As
Publication number | Publication date |
---|---|
CN108680155A (en) | 2018-10-19 |
CN108680155B (en) | 2020-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019148645A1 (en) | Partially observable markov decision process-based optimal robot path planning method | |
CN104571113B (en) | The paths planning method of mobile robot | |
Zhu et al. | Online minimax Q network learning for two-player zero-sum Markov games | |
CN105740644B (en) | Cleaning robot optimal target path planning method based on model learning | |
Zheng et al. | Robust bayesian inverse reinforcement learning with sparse behavior noise | |
CN106444769B (en) | A kind of optimum path planning method of indoor mobile robot increment type environmental information sampling | |
CN104298239B (en) | A kind of indoor mobile robot strengthens map study paths planning method | |
Bode | Neural networks for cost estimation: simulations and pilot application | |
Chatterjee et al. | A Geese PSO tuned fuzzy supervisor for EKF based solutions of simultaneous localization and mapping (SLAM) problems in mobile robots | |
CN110378439A (en) | Single robot path planning method based on Q-Learning algorithm | |
CN108469827A (en) | A kind of automatic guided vehicle global path planning method suitable for logistic storage system | |
CN109241291A (en) | Knowledge mapping optimal path inquiry system and method based on deeply study | |
CN110174118A (en) | Robot multiple-objective search-path layout method and apparatus based on intensified learning | |
Zhang et al. | Learning novel policies for tasks | |
Kollar et al. | Efficient Optimization of Information-Theoretic Exploration in SLAM. | |
CN109782270B (en) | Data association method under multi-sensor multi-target tracking condition | |
Wang et al. | On the convergence of the monte carlo exploring starts algorithm for reinforcement learning | |
CN112356031A (en) | On-line planning method based on Kernel sampling strategy under uncertain environment | |
Janmohamed et al. | Improving the Data Efficiency of Multi-Objective Quality-Diversity through Gradient Assistance and Crowding Exploration | |
CN114139675B (en) | Method for improving selection reliability and action accuracy in intelligent agent control | |
CN113741416B (en) | Multi-robot full-coverage path planning method based on improved predator prey model and DMPC | |
CN111814908B (en) | Abnormal data detection model updating method and device based on data flow | |
CN114118441A (en) | Online planning method based on efficient search strategy under uncertain environment | |
Zhou et al. | Switching deep reinforcement learning based intelligent online decision making for autonomous systems under uncertain environment | |
Gao et al. | Partial Consistency for Stabilizing Undiscounted Reinforcement Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18903071 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18903071 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26/01/2021) |