WO2019148645A1 - Partially observable markov decision process-based optimal robot path planning method - Google Patents

Partially observable markov decision process-based optimal robot path planning method Download PDF

Info

Publication number
WO2019148645A1
WO2019148645A1 PCT/CN2018/082104 CN2018082104W WO2019148645A1 WO 2019148645 A1 WO2019148645 A1 WO 2019148645A1 CN 2018082104 W CN2018082104 W CN 2018082104W WO 2019148645 A1 WO2019148645 A1 WO 2019148645A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
belief state
state
optimal
lower bound
Prior art date
Application number
PCT/CN2018/082104
Other languages
French (fr)
Chinese (zh)
Inventor
刘全
朱斐
钱炜晟
章宗长
Original Assignee
苏州大学张家港工业技术研究院
苏州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州大学张家港工业技术研究院, 苏州大学 filed Critical 苏州大学张家港工业技术研究院
Publication of WO2019148645A1 publication Critical patent/WO2019148645A1/en

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • G01C21/206Instruments for performing navigational calculations specially adapted for indoor navigation
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0217Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with energy consumption, time reduction or distance reduction criteria
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0219Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory ensuring the processing of the whole working surface
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/3446Details of route searching algorithms, e.g. Dijkstra, A*, arc-flags, using precalculated routes

Definitions

  • the invention relates to the field of robot control, in particular to a robot optimal path planning method based on a partial perceptual Markov decision process.
  • Machine Learning is a discipline that studies how to simulate or implement human learning behaviors, and to reorganize and improve its original knowledge structure.
  • Reinforcement learning is an important research branch of machine learning. It is a machine learning method that obtains the maximum long-term cumulative discount reward by mapping the state to the action through the interaction between the agent and the environment.
  • MDPs Markov Decision Processes
  • the sensor of the agent has its own limitations: (1) the sensor can only detect a local limited environment, the agent cannot accurately distinguish different states outside the detection range; (2) the sensor itself has defects, the detection result has noise, and the sensor of the agent Different results may be obtained when probing the same state.
  • Agent's vision system includes visual angle, accurate visual distance, blurred visual distance and invisible distance. Only when the observation target is within the visible angle and the accurate visual distance, the Agent can get its accurate state, and in other cases, only a fuzzy observation can be obtained. Agents should consider uncertainties such as imperfect control, sensor noise, and incomplete environmental knowledge when making decisions in complex environments (such as driverless driving).
  • POMDPs Partially Observable Markov Decision Processes
  • the object of the present invention is to provide a robot optimal path planning method based on a partial perceptual Markov decision process, which reduces the number of updates of an algorithm in a similar search path in a continuous state and a large-scale observation space problem, thereby saving calculations. Time, increase the efficiency of the algorithm. The improved efficiency allows the robot to get a better path in the same time.
  • the Agent selects the corresponding action and observation at the current belief state according to different heuristic conditions, and obtains the corresponding next belief state. Combine the paths of each search to form a reachable belief tree. This allows the Agent to search only in the reachable belief space, and approximate the infinite belief space, so that the continuous state problem can be solved. Therefore, choosing a better heuristic condition can make the searchable reachable belief space closer to the real belief space and have better performance.
  • the heuristic condition used by HSVI2 is to obtain as many representative and reachable belief trees as possible through simulation. SARSOP is based on HSVI2's reachable belief tree, and chooses better heuristic conditions, which makes the simulation process closer to the optimal strategy, thus obtaining a more representative optimal reachable belief. tree.
  • the Gingko Leaf Search (GLS) algorithm is used in the forward search phase for the partial searchable Markov decision-making problem for continuous state and large-scale observation space.
  • the most valuable state of belief, and adaptive search is very close to the state of belief with the most valuable belief state. Without affecting the effect of the value function update, GLS reduces the number of belief state updates, reduces update time, and improves algorithm efficiency.
  • the present invention provides the following technical solution: a method for optimal path planning of a robot based on a partial perceptual Markov decision process, comprising the following steps:
  • the initial belief state b is 0 blind low value calculated strategy
  • the initial belief state b 0 is calculated upper bound constraints using flash notification method.
  • the standard critical value of the lower bound of the current belief state b is calculated by the following formula:
  • the standard threshold for the lower bound of the current belief state b is calculated using the following formula: Where Q represents the maximum of the lower bounds of the value function for each action.
  • the upper bound of the next belief state is a standard critical value.
  • the calculation method of the lower bound standard threshold U db in the next belief state is Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
  • the present invention has the following advantages over the prior art: the present invention is based on a partially observable Markov decision process, and the robot finds an optimal path to the target position, based on the POMDP model and the SARSOP algorithm, using GLS.
  • the search method is a heuristic condition when searching.
  • the use of the present invention can avoid the early classical algorithm using the trial as a heuristic condition to repeatedly update the plurality of similar paths, and update the upper and lower bounds of the belief state without affecting the final optimal strategy. Improve the efficiency of the algorithm, in the same time, the robot can get a better strategy and find a better path.
  • Figure 1 is a schematic view showing the layout of the environment of the present invention.
  • FIG. 2 is a search tree formed by a search path obtained by a certain search in the present invention.
  • Figure 3 is a flow chart of the operation of the present invention.
  • the sweeping robot is in the living room on the right. Its task is to clean the bedroom on the left. According to the layout of the room, it needs to bypass the dining table and pass through the middle door to enter the bedroom smoothly.
  • the robot head is evenly installed. With a distance sensor, each sensor can detect whether there are obstacles within 1 unit length in front of it. There are 256 kinds of detection results of the sensor. The probability that each sensor receives the correct detection result is 0.9, and the probability of receiving the error detection result.
  • the sweeping robot is randomly positioned in the room. Its goal is to reach the bedroom on the left as quickly as possible. The prize for the sweeping robot to reach the target position is +10.
  • the search path includes not only the nodes searched by the early SAROSOP algorithm (black solid circles) but also the nodes with large search value (open circles).
  • a method for optimal path planning of a robot based on a partial perceptual Markov decision process includes the following steps:
  • the lower bound value of the initial belief state b 0 is calculated by a blind strategy
  • the upper bound value of the initial belief state b 0 is calculated by the fast notification constraint method.
  • the standard critical value of the lower bound of the current belief state b is calculated by the following formula:
  • the standard threshold for the lower bound of the current belief state b is calculated using the following formula:
  • Q represents the maximum of the lower bounds of the value function for each action.
  • the optimal action is calculated by the following formula:
  • the optimal observation is calculated using the following formula: among them Faith state The difference between the upper and lower boundaries.
  • the upper bound of the next belief state standard threshold Calculated by Lower bound standard threshold for next belief state Calculated by Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

Disclosed is a partially observable Markov decision process (POMDP)-based optimal robot path planning method. A robot searches for an optimal path to a target position. A POMDP model and an SARSOP algorithm are used as a basis. A GLS search method is used as a heuristic condition during searching. For continuous state and massive view space problems, the usage of the present invention can reduce the times for updating upper and lower bounds of the belief state in multiple similar paths which are updated repetitively by an early classical algorithm using an experiment as the heuristic condition. The final optimal policy is not affected, the algorithm efficiency is improved, and the robot can get a better policy and find a better path in the same time.

Description

基于部分感知马氏决策过程的机器人最优路径规划方法Robot optimal path planning method based on partial perceptual Markov decision process 技术领域Technical field
本发明涉及机器人控制领域,具体涉及一种基于部分感知马氏决策过程的机器人最优路径规划方法。The invention relates to the field of robot control, in particular to a robot optimal path planning method based on a partial perceptual Markov decision process.
背景技术Background technique
机器学习(Machine Learning,ML)是一门研究怎样模拟或实现人类的学习行为,不断重组、完善自身原有知识结构的学科。强化学习则是机器学习重要的一个研究分支,它是一种通过智能体(Agent)与环境的交互,将状态映射到动作,从而获得最大长期累积折扣奖赏的机器学习方法。通常强化学习采用马尔科夫决策过程(Markov Decision Processes,MDPs)作为模型,即环境是完全可观察的。然而在现实世界中,不确定性是普遍存在的。例如Agent的传感器,其自身存在局限:(1)传感器只能探测局部有限的环境,Agent无法准确区分探测范围外的不同的状态;(2)传感器自身存在缺陷,探测结果存在噪声,Agent的传感器对同一状态进行探测时可能得到不同的结果。例如在机器人足球比赛RoboCup中,Agent的视觉系统包含可视角度、准确可视距离、模糊可视距离和不可视距离。仅当观测目标在可视角度和准确可视距离范围内,Agent才能得到它的准确状态,其余情况下只能得到一个模糊的观察。Agent在复杂环境中决策时(例如无人驾驶),应该考虑不完美的控制、传感器的噪声和不完整的环境知识等不确定性因素。Machine Learning (ML) is a discipline that studies how to simulate or implement human learning behaviors, and to reorganize and improve its original knowledge structure. Reinforcement learning is an important research branch of machine learning. It is a machine learning method that obtains the maximum long-term cumulative discount reward by mapping the state to the action through the interaction between the agent and the environment. Usually reinforcement learning uses Markov Decision Processes (MDPs) as a model, ie the environment is fully observable. However, in the real world, uncertainty is ubiquitous. For example, the sensor of the agent has its own limitations: (1) the sensor can only detect a local limited environment, the agent cannot accurately distinguish different states outside the detection range; (2) the sensor itself has defects, the detection result has noise, and the sensor of the agent Different results may be obtained when probing the same state. For example, in the robot soccer game RoboCup, Agent's vision system includes visual angle, accurate visual distance, blurred visual distance and invisible distance. Only when the observation target is within the visible angle and the accurate visual distance, the Agent can get its accurate state, and in other cases, only a fuzzy observation can be obtained. Agents should consider uncertainties such as imperfect control, sensor noise, and incomplete environmental knowledge when making decisions in complex environments (such as driverless driving).
部分可观察马尔科夫决策过程(Partially Observable Markov Decision Processes,POMDPs)为建模部分可观察随机环境中Agent的规划(或序列决策)问题和学习问题提供了一个通用的模型。过去十年中,POMDP规划问题的研究取得令人瞩目的成果。这些研究成果,在启发式搜索阶段都采用基于试验的异步值迭代算法,例如HSVI2、SARSOP和FSVI等。这些算法 在向前搜索时,只搜索具有最大价值的结点。然而,基于试验的搜索每次都选择最优的动作和观察,未考虑其它与最优观察非常接近并对未来算法性能有着重要影响的观察。在大规模观察空间问题,其算法的性能较差。Partially Observable Markov Decision Processes (POMDPs) provide a general model for modeling part of the Agent's planning (or sequence decision) problems and learning problems in a random environment. In the past decade, research on POMDP planning issues has achieved remarkable results. These research results use experimental-based asynchronous value iterative algorithms in the heuristic search phase, such as HSVI2, SARSOP, and FSVI. These algorithms only search for nodes with the greatest value when searching forward. However, trial-based searches select optimal actions and observations each time, without considering other observations that are very close to the optimal observations and have a significant impact on future algorithm performance. In the large-scale observation of spatial problems, the performance of the algorithm is poor.
发明内容Summary of the invention
本发明的发明目的是提供一种基于部分感知马氏决策过程的机器人最优路径规划方法,减少在连续状态、大规模观察空间问题中,算法在相似搜索路径的中的更新次数,从而节约计算时间,调高算法的效率。效率的提高,可以让机器人的相同的时间内,获得更优的路径。The object of the present invention is to provide a robot optimal path planning method based on a partial perceptual Markov decision process, which reduces the number of updates of an algorithm in a similar search path in a continuous state and a large-scale observation space problem, thereby saving calculations. Time, increase the efficiency of the algorithm. The improved efficiency allows the robot to get a better path in the same time.
在大规模观察空间问题中,某些非常接近最大价值的节点,其对未来算法性能的也很起着非常重要的作用。在某个信念状态的更新次数如果可以比其它信念状态处的更多,那么这种更新方法称为异步值迭代方法。在可达信念空间R(b 0)中,值函数在b 0处的准确性通常比在其它信念状态处的更重要。因此在POMDP问题中,可以使用异步值迭代方法。基于试验的搜索是一种经典的异步值迭代方法,它每次搜索都从初始信念状态b 0出发,搜索到叶子信念状态,并获得一条没有分支的路径。在搜索的过程中,Agent根据不同的启发式条件,在当前信念状态处选择相应的动作和观察,并获得对应的下一个信念状态。把每一次搜索的路径相结合,形成一棵可达信念树。这样可以让Agent只在可达的信念空间中搜索,近似的表示无限的信念空间,使得连续状态问题可以求解。因此,选择更好的启发式条件,可以使得搜索的可达信念空间更加接近真实的信念空间,并且具有更好的性能。HSVI2使用的启发式条件是通过仿真,获得一个尽量多的、具有代表性的的可达信念树。SARSOP则是在HSVI2的可达信念树的基础上,选择更优的启发式条件,使得其仿真的过程中,更加接近最优的策略,从而获得一棵更具有代表性的最优可达信念树。 In the large-scale observation space problem, some nodes that are very close to the maximum value also play a very important role in the performance of future algorithms. If the number of updates to a belief state can be more than at other belief states, then this update method is called an asynchronous value iteration method. In the reachable belief space R(b 0 ), the accuracy of the value function at b 0 is usually more important than at other belief states. Therefore, in the POMDP problem, an asynchronous value iterative method can be used. The trial-based search is a classic asynchronous value iterative method. Each search starts from the initial belief state b 0 , searches for the leaf belief state, and obtains a path without branches. In the process of searching, the Agent selects the corresponding action and observation at the current belief state according to different heuristic conditions, and obtains the corresponding next belief state. Combine the paths of each search to form a reachable belief tree. This allows the Agent to search only in the reachable belief space, and approximate the infinite belief space, so that the continuous state problem can be solved. Therefore, choosing a better heuristic condition can make the searchable reachable belief space closer to the real belief space and have better performance. The heuristic condition used by HSVI2 is to obtain as many representative and reachable belief trees as possible through simulation. SARSOP is based on HSVI2's reachable belief tree, and chooses better heuristic conditions, which makes the simulation process closer to the optimal strategy, thus obtaining a more representative optimal reachable belief. tree.
针对连续状态、大规模观察空间的部分可观察马氏决策问题中,基于试验的搜索算法的局限性,本发明采用银杏叶搜索算法(Gingko Leaf Search,GLS),在向前搜索阶段,不仅搜索最有价值的信念状态,而且自适应的搜索与最有价值的信念状态非常接近的信念状态。在不影响值函数更新效果的情况下,GLS减少信念状态更新的次数,减少更新时间,提高算法效率。The Gingko Leaf Search (GLS) algorithm is used in the forward search phase for the partial searchable Markov decision-making problem for continuous state and large-scale observation space. The most valuable state of belief, and adaptive search is very close to the state of belief with the most valuable belief state. Without affecting the effect of the value function update, GLS reduces the number of belief state updates, reduces update time, and improves algorithm efficiency.
为实现上述发明目的,本发明提供以下的技术方案:一种基于部分感知马氏决策过程的机器人最优路径规划方法,包括如下步骤:In order to achieve the above object, the present invention provides the following technical solution: a method for optimal path planning of a robot based on a partial perceptual Markov decision process, comprising the following steps:
S1、初始化模型和环境,设置环境的状态迁移函数f:X×U×X→[0,1],奖赏函数ρ:X×U→□,观察函数Ω:U×X×Z→[0,1],其中X为状态集合,U为动作集合,Z为观察集合,设置折扣率γ为0.95,设置机器人的位置,设置初始信念状态b 0的初始值:下界的标准临界值L和上界的标准临界值U=L+ò,其中ò是预先指定的阈值标准,计算初始信念状态b 0的上界值
Figure PCTCN2018082104-appb-000001
和下界值 V(b 0),转入S2;
S1, initialization model and environment, set the state transition function f: X × U × X → [0, 1], reward function ρ: X × U → □, observation function Ω: U × X × Z → [0, 1], where X is the state set, U is the action set, Z is the observation set, the discount rate γ is set to 0.95, the position of the robot is set, and the initial value of the initial belief state b 0 is set: the standard critical value L of the lower bound and the upper bound The standard threshold value U=L+ò, where ò is a pre-specified threshold criterion, and the upper bound value of the initial belief state b 0 is calculated.
Figure PCTCN2018082104-appb-000001
And the lower bound value V (b 0 ), transferred to S2;
S2、将初始信念状态b 0置为当前信念状态b,转入S3; S2, setting the initial belief state b 0 to the current belief state b, and transferring to S3;
S3、预测当前信念状态b的最优值
Figure PCTCN2018082104-appb-000002
转入S4;
S3, predicting the optimal value of the current belief state b
Figure PCTCN2018082104-appb-000002
Transfer to S4;
S4、判断当前信念状态b是否满足如下条件:
Figure PCTCN2018082104-appb-000003
Figure PCTCN2018082104-appb-000004
其中d b为当前信念状态b的深度,如满足,则转入S13,如不满足,则转入S5;
S4. Determine whether the current belief state b satisfies the following conditions:
Figure PCTCN2018082104-appb-000003
And
Figure PCTCN2018082104-appb-000004
Where d b is the depth of the current belief state b, if satisfied, then proceeds to S13, if not satisfied, then proceeds to S5;
S5、计算当前信念状态b下,每个动作的值函数下界值,选取这些下界值中的最大值 Q,更新当前信念状态b的上界的标准临界值U′和下界的标准临界值L′,转入S6; S5. Calculate the lower bound value of the value function of each action under the current belief state b, select the maximum value Q among the lower bound values, and update the standard critical value U' of the upper bound of the current belief state b and the standard critical value L' of the lower bound , transferred to S6;
S6、计算最优动作和对初始信念状态b 0贡献最大的观察,并记录观察的总数count,转入S7; S6, calculating an optimal action and observing the greatest contribution to the initial belief state b 0 , and recording the total count count, and transferring to S7;
S7、顺序选择观察集中的观察,若count不为0,转入S8,否则转入S11;S7, sequentially select observations in the observation set, if count is not 0, transfer to S8, otherwise transfer to S11;
S8、count值减少1,转入S9;S8, the count value is reduced by 1, and transferred to S9;
S9、判断当前选择的观察
Figure PCTCN2018082104-appb-000005
是否有探索的价值,若是,则转入S10,若否,则转入S7;
S9. Judging the current selected observation
Figure PCTCN2018082104-appb-000005
Whether there is value of exploration, if it is, then transfer to S10, if not, then transfer to S7;
S10、计算下一信念状态的上界的标准临界值和下界的标准临界值,获得下一信念状态的上界值和下界值,转入S7;S10, calculating a standard threshold value of the upper bound of the next belief state and a standard threshold value of the lower bound, obtaining an upper bound value and a lower bound value of the next belief state, and transferring to S7;
S11、更新当前信念状态的上界值和下界值,转入S12;S11, updating the upper bound value and the lower bound value of the current belief state, and transferring to S12;
S12、选择最优动作进入下一信念状态,将下一信念状态置为当前信念状态,转入S3;S12, selecting an optimal action to enter a next belief state, setting a next belief state to a current belief state, and transferring to S3;
S13、获得最优策略,根据最优策略获得机器人的最优路径。S13. Obtain an optimal strategy, and obtain an optimal path of the robot according to the optimal strategy.
上述技术方案中,S2中,初始信念状态b 0的下界值采用盲策略计算,初始信念状态b 0的上界值采用快速通知约束方法计算。 In the above technical solution, S2, the initial belief state b is 0 blind low value calculated strategy, the initial belief state b 0 is calculated upper bound constraints using flash notification method.
上述技术方案中,S5中,当前信念状态b的下界的标准临界值采用如下公式计算:
Figure PCTCN2018082104-appb-000006
当前信念状态b的下界的标准临界值采用如下公式计算:
Figure PCTCN2018082104-appb-000007
其中 Q表示每个动作的值函数下界值中的最大值。
In the above technical solution, in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula:
Figure PCTCN2018082104-appb-000006
The standard threshold for the lower bound of the current belief state b is calculated using the following formula:
Figure PCTCN2018082104-appb-000007
Where Q represents the maximum of the lower bounds of the value function for each action.
上述技术方案中,S6中,最优动作采用如下公式计算:
Figure PCTCN2018082104-appb-000008
最优观察采用如下公式进行计算:
Figure PCTCN2018082104-appb-000009
其中
Figure PCTCN2018082104-appb-000010
是信念状态
Figure PCTCN2018082104-appb-000011
处的上下界之差。
In the above technical solution, in S6, the optimal action is calculated by the following formula:
Figure PCTCN2018082104-appb-000008
The optimal observation is calculated using the following formula:
Figure PCTCN2018082104-appb-000009
among them
Figure PCTCN2018082104-appb-000010
Faith state
Figure PCTCN2018082104-appb-000011
The difference between the upper and lower boundaries.
上述技术方案中,S9中,判断当前选择的观察
Figure PCTCN2018082104-appb-000012
是否具有探索价值的标准为
Figure PCTCN2018082104-appb-000013
其中ζ为阈 值函数。
In the above technical solution, in S9, judging the currently selected observation
Figure PCTCN2018082104-appb-000012
Whether the standard of exploration value is
Figure PCTCN2018082104-appb-000013
Where ζ is the threshold function.
上述技术方案中,S10中,下一信念状态上界标准临界值
Figure PCTCN2018082104-appb-000014
的计算方法为
Figure PCTCN2018082104-appb-000015
下一信念状态下界标准临界值U db的计算方法为
Figure PCTCN2018082104-appb-000016
其中L′和U′分别是当前信念状态的上界标准临界值和下界标准临界值。
In the above technical solution, in S10, the upper bound of the next belief state is a standard critical value.
Figure PCTCN2018082104-appb-000014
Calculated by
Figure PCTCN2018082104-appb-000015
The calculation method of the lower bound standard threshold U db in the next belief state is
Figure PCTCN2018082104-appb-000016
Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
由于上述技术方案运用,本发明与现有技术相比具有以下优点:本发明基于部分可观察马氏决策过程,机器人寻找到达目标位置的最优路径,以POMDP模型和SARSOP算法为基础,使用GLS搜索方法作为搜索时的启发式条件。在连续状态大规模观察空间问题中,使用本发明可避免早期经典算法使用基于试验作为启发式条件重复更新多条相似路径中,更新信念状态上下界的次数,且不影响最终的最优策略,提高算法效率,在相同时间内,机器人能够获得更优的策略,找到更优的路径。Due to the above technical solutions, the present invention has the following advantages over the prior art: the present invention is based on a partially observable Markov decision process, and the robot finds an optimal path to the target position, based on the POMDP model and the SARSOP algorithm, using GLS. The search method is a heuristic condition when searching. In the continuous state large-scale observation space problem, the use of the present invention can avoid the early classical algorithm using the trial as a heuristic condition to repeatedly update the plurality of similar paths, and update the upper and lower bounds of the belief state without affecting the final optimal strategy. Improve the efficiency of the algorithm, in the same time, the robot can get a better strategy and find a better path.
附图说明DRAWINGS
图1为本发明环境的布局示意图。Figure 1 is a schematic view showing the layout of the environment of the present invention.
图2为本发明某次搜索获得的搜索路径形成的搜索树。2 is a search tree formed by a search path obtained by a certain search in the present invention.
图3为本发明的工作流程图。Figure 3 is a flow chart of the operation of the present invention.
具体实施方式Detailed ways
下面结合本发明的原理、附图以及实施例对本发明进一步描述The present invention is further described below in conjunction with the principles, drawings, and embodiments of the present invention.
参见图1所示,扫地机器人在右边的客厅内,它的任务是打扫左边的卧室,根据房间的布置,它需要绕过餐桌并经过中间的房门才能顺利进入卧室,机器人头部均匀地安装有距离传感器,每个传感器都能探测其正前方1单位长度内是否有障碍物,传感器的探测结果有256种,每个传感器接收到正确 探测结果的概率为0.9,接收到错误探测结果的概率为0.1,扫地机器人在房间中初始位置是随机的,它的目标是尽可能快的到达左边的卧室,扫地机器人到达目标位置的奖赏为+10。Referring to Figure 1, the sweeping robot is in the living room on the right. Its task is to clean the bedroom on the left. According to the layout of the room, it needs to bypass the dining table and pass through the middle door to enter the bedroom smoothly. The robot head is evenly installed. With a distance sensor, each sensor can detect whether there are obstacles within 1 unit length in front of it. There are 256 kinds of detection results of the sensor. The probability that each sensor receives the correct detection result is 0.9, and the probability of receiving the error detection result. At 0.1, the sweeping robot is randomly positioned in the room. Its goal is to reach the bedroom on the left as quickly as possible. The prize for the sweeping robot to reach the target position is +10.
参见图2所示,某次搜索过程中,搜索路径中不仅包含早期SAROSOP算法搜索到的结点(黑色实心圆),更包含具有较大搜索价值的结点(空心圆)。Referring to FIG. 2, in a search process, the search path includes not only the nodes searched by the early SAROSOP algorithm (black solid circles) but also the nodes with large search value (open circles).
参见图3所示,一种基于部分感知马氏决策过程的机器人最优路径规划方法,包括如下步骤:Referring to FIG. 3, a method for optimal path planning of a robot based on a partial perceptual Markov decision process includes the following steps:
S1、初始化模型和环境,设置环境的状态迁移函数f:X×U×X→[0,1],奖赏函数ρ:X×U→□,观察函数Ω:U×X×Z→[0,1],其中X为状态集合,U为动作集合,Z为观察集合,设置折扣率γ为0.95,设置机器人的位置,设置初始信念状态b 0的初始值:下界的标准临界值L和上界的标准临界值U=L+ò,其中ò是预先指定的阈值标准,计算初始信念状态b 0的上界值
Figure PCTCN2018082104-appb-000017
和下界值 V(b 0),转入S2;
S1, initialization model and environment, set the state transition function f: X × U × X → [0, 1], reward function ρ: X × U → □, observation function Ω: U × X × Z → [0, 1], where X is the state set, U is the action set, Z is the observation set, the discount rate γ is set to 0.95, the position of the robot is set, and the initial value of the initial belief state b 0 is set: the standard critical value L of the lower bound and the upper bound The standard threshold value U=L+ò, where ò is a pre-specified threshold criterion, and the upper bound value of the initial belief state b 0 is calculated.
Figure PCTCN2018082104-appb-000017
And the lower bound value V (b 0 ), transferred to S2;
S2、将初始信念状态b 0置为当前信念状态b,转入S3; S2, setting the initial belief state b 0 to the current belief state b, and transferring to S3;
S3、预测当前信念状态b的最优值
Figure PCTCN2018082104-appb-000018
转入S4;
S3, predicting the optimal value of the current belief state b
Figure PCTCN2018082104-appb-000018
Transfer to S4;
S4、判断当前信念状态b是否满足如下条件:
Figure PCTCN2018082104-appb-000019
Figure PCTCN2018082104-appb-000020
其中d b为当前信念状态b的深度,如满足,则转入S13,如不满足,则转入S5;
S4. Determine whether the current belief state b satisfies the following conditions:
Figure PCTCN2018082104-appb-000019
And
Figure PCTCN2018082104-appb-000020
Where d b is the depth of the current belief state b, if satisfied, then proceeds to S13, if not satisfied, then proceeds to S5;
S5、计算当前信念状态b下,每个动作的值函数下界值,选取这些下界值中的最大值 Q,更新当前信念状态b的上界的标准临界值U′和下界的标准临界值L′,转入S6; S5. Calculate the lower bound value of the value function of each action under the current belief state b, select the maximum value Q among the lower bound values, and update the standard critical value U' of the upper bound of the current belief state b and the standard critical value L' of the lower bound , transferred to S6;
S6、计算最优动作和对初始信念状态b 0贡献最大的观察,并记录观察的 总数count,转入S7; S6, calculating an optimal action and observing the greatest contribution to the initial belief state b 0 , and recording the total count count, and transferring to S7;
S7、顺序选择观察集中的观察,若count不为0,转入S8,否则转入S11;S7, sequentially select observations in the observation set, if count is not 0, transfer to S8, otherwise transfer to S11;
S8、count值减少1,转入S9;S8, the count value is reduced by 1, and transferred to S9;
S9、判断当前选择的观察
Figure PCTCN2018082104-appb-000021
是否有探索的价值,若是,则转入S10,若否,则转入S7;
S9. Judging the current selected observation
Figure PCTCN2018082104-appb-000021
Whether there is value of exploration, if it is, then transfer to S10, if not, then transfer to S7;
S10、计算下一信念状态的上界的标准临界值和下界的标准临界值,获得下一信念状态的上界值和下界值,转入S7;S10, calculating a standard threshold value of the upper bound of the next belief state and a standard threshold value of the lower bound, obtaining an upper bound value and a lower bound value of the next belief state, and transferring to S7;
S11、更新当前信念状态的上界值和下界值,转入S12;S11, updating the upper bound value and the lower bound value of the current belief state, and transferring to S12;
S12、选择最优动作进入下一信念状态,将下一信念状态置为当前信念状态,转入S3;S12, selecting an optimal action to enter a next belief state, setting a next belief state to a current belief state, and transferring to S3;
S13、获得最优策略,根据最优策略获得机器人的最优路径。S13. Obtain an optimal strategy, and obtain an optimal path of the robot according to the optimal strategy.
其中,S2中,初始信念状态b 0的下界值采用盲策略计算,初始信念状态b 0的上界值采用快速通知约束方法计算。 In S2, the lower bound value of the initial belief state b 0 is calculated by a blind strategy, and the upper bound value of the initial belief state b 0 is calculated by the fast notification constraint method.
其中,S5中,当前信念状态b的下界的标准临界值采用如下公式计算:
Figure PCTCN2018082104-appb-000022
当前信念状态b的下界的标准临界值采用如下公式计算:
Figure PCTCN2018082104-appb-000023
其中 Q表示每个动作的值函数下界值中的最大值。
Wherein, in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula:
Figure PCTCN2018082104-appb-000022
The standard threshold for the lower bound of the current belief state b is calculated using the following formula:
Figure PCTCN2018082104-appb-000023
Where Q represents the maximum of the lower bounds of the value function for each action.
其中,S6中,最优动作采用如下公式计算:
Figure PCTCN2018082104-appb-000024
最优观察采用如下公式进行计算:
Figure PCTCN2018082104-appb-000025
其中
Figure PCTCN2018082104-appb-000026
是信念状态
Figure PCTCN2018082104-appb-000027
处的上下界之差。
Among them, in S6, the optimal action is calculated by the following formula:
Figure PCTCN2018082104-appb-000024
The optimal observation is calculated using the following formula:
Figure PCTCN2018082104-appb-000025
among them
Figure PCTCN2018082104-appb-000026
Faith state
Figure PCTCN2018082104-appb-000027
The difference between the upper and lower boundaries.
其中,S9中,判断当前选择的观察
Figure PCTCN2018082104-appb-000028
是否具有探索价值的标准为
Figure PCTCN2018082104-appb-000029
其中ζ一个阈值函数。
Among them, in S9, judge the current selected observation
Figure PCTCN2018082104-appb-000028
Whether the standard of exploration value is
Figure PCTCN2018082104-appb-000029
Among them is a threshold function.
其中,S10中,下一信念状态上界标准临界值
Figure PCTCN2018082104-appb-000030
的计算方法为
Figure PCTCN2018082104-appb-000031
下一信念状态下界标准临界值
Figure PCTCN2018082104-appb-000032
的计算方法为
Figure PCTCN2018082104-appb-000033
其中L′和U′分别是当前信念状态的上界标准临界值和下界标准临界值。
Among them, in S10, the upper bound of the next belief state standard threshold
Figure PCTCN2018082104-appb-000030
Calculated by
Figure PCTCN2018082104-appb-000031
Lower bound standard threshold for next belief state
Figure PCTCN2018082104-appb-000032
Calculated by
Figure PCTCN2018082104-appb-000033
Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but the scope of the invention is to be accorded

Claims (6)

  1. 一种基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,包括如下步骤:A method for optimal path planning of a robot based on a partial perceptual Markov decision process, comprising the following steps:
    S1、初始化模型和环境,设置环境的状态迁移函数f:X×U×X→[0,1],奖赏函数ρ:X×U→□,观察函数Ω:U×X×Z→[0,1],其中X为状态集合,U为动作集合,Z为观察集合,设置折扣率γ为0.95,设置机器人的位置,设置初始信念状态b 0的初始值:下界的标准临界值L和上界的标准临界值U=L+ò,其中ò是预先指定的阈值标准,计算初始信念状态b 0的上界值
    Figure PCTCN2018082104-appb-100001
    和下界值 V(b 0),转入S2;
    S1, initialization model and environment, set the state transition function f: X × U × X → [0, 1], reward function ρ: X × U → □, observation function Ω: U × X × Z → [0, 1], where X is the state set, U is the action set, Z is the observation set, the discount rate γ is set to 0.95, the position of the robot is set, and the initial value of the initial belief state b 0 is set: the standard critical value L of the lower bound and the upper bound The standard threshold value U=L+ò, where ò is a pre-specified threshold criterion, and the upper bound value of the initial belief state b 0 is calculated.
    Figure PCTCN2018082104-appb-100001
    And the lower bound value V (b 0 ), transferred to S2;
    S2、将初始信念状态b 0置为当前信念状态b,转入S3; S2, setting the initial belief state b 0 to the current belief state b, and transferring to S3;
    S3、预测当前信念状态b的最优值
    Figure PCTCN2018082104-appb-100002
    转入S4;
    S3, predicting the optimal value of the current belief state b
    Figure PCTCN2018082104-appb-100002
    Transfer to S4;
    S4、判断当前信念状态b是否满足如下条件:
    Figure PCTCN2018082104-appb-100003
    Figure PCTCN2018082104-appb-100004
    其中d b为当前信念状态b的深度,如满足,则转入S13,如不满足,则转入S5;
    S4. Determine whether the current belief state b satisfies the following conditions:
    Figure PCTCN2018082104-appb-100003
    And
    Figure PCTCN2018082104-appb-100004
    Where d b is the depth of the current belief state b, if satisfied, then proceeds to S13, if not satisfied, then proceeds to S5;
    S5、计算当前信念状态b下,每个动作的值函数下界值,选取这些下界值中的最大值 Q,更新当前信念状态b的上界的标准临界值U′和下界的标准临界值L′,转入S6; S5. Calculate the lower bound value of the value function of each action under the current belief state b, select the maximum value Q among the lower bound values, and update the standard critical value U' of the upper bound of the current belief state b and the standard critical value L' of the lower bound , transferred to S6;
    S6、计算最优动作和对初始信念状态b 0贡献最大的观察,并记录观察的总数count,转入S7; S6, calculating an optimal action and observing the greatest contribution to the initial belief state b 0 , and recording the total count count, and transferring to S7;
    S7、顺序选择观察集中的观察,若count不为0,转入S8,否则转入S11;S7, sequentially select observations in the observation set, if count is not 0, transfer to S8, otherwise transfer to S11;
    S8、count值减少1,转入S9;S8, the count value is reduced by 1, and transferred to S9;
    S9、判断当前选择的观察
    Figure PCTCN2018082104-appb-100005
    是否有探索的价值,若是,则转入S10,若否,则转入S7;
    S9. Judging the current selected observation
    Figure PCTCN2018082104-appb-100005
    Whether there is value of exploration, if it is, then transfer to S10, if not, then transfer to S7;
    S10、计算下一信念状态的上界的标准临界值和下界的标准临界值,获得 下一信念状态的上界值和下界值,转入S7;S10, calculating a standard critical value of the upper bound of the next belief state and a standard critical value of the lower bound, obtaining an upper bound value and a lower bound value of the next belief state, and transferring to S7;
    S11、更新当前信念状态的上界值和下界值,转入S12;S11, updating the upper bound value and the lower bound value of the current belief state, and transferring to S12;
    S12、选择最优动作进入下一信念状态,将下一信念状态置为当前信念状态,转入S3;S12, selecting an optimal action to enter a next belief state, setting a next belief state to a current belief state, and transferring to S3;
    S13、获得最优策略,根据最优策略获得机器人的最优路径。S13. Obtain an optimal strategy, and obtain an optimal path of the robot according to the optimal strategy.
  2. 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S2中,初始信念状态b 0的下界值采用盲策略计算,初始信念状态b 0的上界值采用快速通知约束方法计算。 The portion of the robot as claimed in claim optimal path planning method Markov decision process-based sensing, characterized in that said 1, S2, the lower bound of the initial belief state b is 0 blind strategy calculations, the upper bound of the initial belief state b 0 Values are calculated using the fast notification constraint method.
  3. 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S5中,当前信念状态b的下界的标准临界值采用如下公式计算:
    Figure PCTCN2018082104-appb-100006
    当前信念状态b的下界的标准临界值采用如下公式计算:
    Figure PCTCN2018082104-appb-100007
    其中 Q表示每个动作的值函数下界值中的最大值。
    The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S5, the standard critical value of the lower bound of the current belief state b is calculated by the following formula:
    Figure PCTCN2018082104-appb-100006
    The standard threshold for the lower bound of the current belief state b is calculated using the following formula:
    Figure PCTCN2018082104-appb-100007
    Where Q represents the maximum of the lower bounds of the value function for each action.
  4. 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S6中,最优动作采用如下公式计算:
    Figure PCTCN2018082104-appb-100008
    最优观察采用如下公式进行计算:
    Figure PCTCN2018082104-appb-100009
    其中
    Figure PCTCN2018082104-appb-100010
    是信念状态
    Figure PCTCN2018082104-appb-100011
    处的上下界之差。
    The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S6, the optimal action is calculated by the following formula:
    Figure PCTCN2018082104-appb-100008
    The optimal observation is calculated using the following formula:
    Figure PCTCN2018082104-appb-100009
    among them
    Figure PCTCN2018082104-appb-100010
    Faith state
    Figure PCTCN2018082104-appb-100011
    The difference between the upper and lower boundaries.
  5. 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S9中,判断当前选择的观察
    Figure PCTCN2018082104-appb-100012
    是否具有探索价值的标准为
    Figure PCTCN2018082104-appb-100013
    其中ζ为阈值函数。
    The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S9, the current selected observation is judged
    Figure PCTCN2018082104-appb-100012
    Whether the standard of exploration value is
    Figure PCTCN2018082104-appb-100013
    Where ζ is the threshold function.
  6. 根据权利要求1所述的基于部分感知马氏决策过程的机器人最优路径规划方法,其特征在于,S10中,下一信念状态上界标准临界值
    Figure PCTCN2018082104-appb-100014
    的计算方法为
    Figure PCTCN2018082104-appb-100015
    下一信念状态下界标准临界值
    Figure PCTCN2018082104-appb-100016
    的计算方法为
    Figure PCTCN2018082104-appb-100017
    其中L′和U′分别是当前信念状态的上界标准临界值和下界标准临界值。
    The method for optimal path planning of a robot based on a partial perceptual Markov decision process according to claim 1, wherein in S10, the upper bound of the next belief state is a standard threshold value
    Figure PCTCN2018082104-appb-100014
    Calculated by
    Figure PCTCN2018082104-appb-100015
    Lower bound standard threshold for next belief state
    Figure PCTCN2018082104-appb-100016
    Calculated by
    Figure PCTCN2018082104-appb-100017
    Where L' and U' are the upper bound standard threshold and the lower bound standard threshold of the current belief state, respectively.
PCT/CN2018/082104 2018-02-01 2018-04-08 Partially observable markov decision process-based optimal robot path planning method WO2019148645A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810102240.0 2018-02-01
CN201810102240.0A CN108680155B (en) 2018-02-01 2018-02-01 Robot optimal path planning method based on partial perception Markov decision process

Publications (1)

Publication Number Publication Date
WO2019148645A1 true WO2019148645A1 (en) 2019-08-08

Family

ID=63800139

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/082104 WO2019148645A1 (en) 2018-02-01 2018-04-08 Partially observable markov decision process-based optimal robot path planning method

Country Status (2)

Country Link
CN (1) CN108680155B (en)
WO (1) WO2019148645A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111007858A (en) * 2019-12-23 2020-04-14 北京三快在线科技有限公司 Training method of vehicle driving decision model, and driving decision determining method and device
CN111367317A (en) * 2020-03-27 2020-07-03 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster online task planning method based on Bayesian learning
CN112131754A (en) * 2020-09-30 2020-12-25 中国人民解放军国防科技大学 Extended POMDP planning method and system based on robot accompanying behavior model
CN113189986A (en) * 2021-04-16 2021-07-30 中国人民解放军国防科技大学 Two-stage self-adaptive behavior planning method and system for autonomous robot
CN113433937A (en) * 2021-06-08 2021-09-24 杭州未名信科科技有限公司 Heuristic exploration-based layered navigation obstacle avoidance system and layered navigation obstacle avoidance method
CN114118441A (en) * 2021-11-24 2022-03-01 福州大学 Online planning method based on efficient search strategy under uncertain environment
CN114253265A (en) * 2021-12-17 2022-03-29 成都朴为科技有限公司 On-time arrival probability maximum path planning algorithm and system based on fourth-order moment
CN114355973A (en) * 2021-12-28 2022-04-15 哈尔滨工程大学 Multi-agent hierarchical reinforcement learning-based unmanned cluster cooperation method under weak observation condition
CN114676935A (en) * 2022-04-26 2022-06-28 中国人民解放军国防科技大学 Multi-Agent system mixed task planning algorithm under uncertain environment
CN114880946A (en) * 2022-05-31 2022-08-09 苏州大学 Intelligent agent random exploration method based on flight strategy
CN115952892A (en) * 2022-12-13 2023-04-11 广东工业大学 Branch iteration method and system for defective plate stock layout

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110716554B (en) * 2019-11-12 2020-08-14 华育昌(肇庆)智能科技研究有限公司 Vision-based household robot
CN111026110B (en) * 2019-11-20 2021-04-30 北京理工大学 Uncertain action planning method for linear time sequence logic containing soft and hard constraints
CN110989602B (en) * 2019-12-12 2023-12-26 齐鲁工业大学 Autonomous guided vehicle path planning method and system in medical pathology inspection laboratory
CN111311384A (en) * 2020-05-15 2020-06-19 支付宝(杭州)信息技术有限公司 Method and system for training recommendation model
CN111896006B (en) * 2020-08-11 2022-10-04 燕山大学 Path planning method and system based on reinforcement learning and heuristic search
CN112356031B (en) * 2020-11-11 2022-04-01 福州大学 On-line planning method based on Kernel sampling strategy under uncertain environment
US12118441B2 (en) 2021-12-13 2024-10-15 International Business Machines Corporation Knowledge augmented sequential decision-making under uncertainty

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929281A (en) * 2012-11-05 2013-02-13 西南科技大学 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment
CN107368076A (en) * 2017-07-31 2017-11-21 中南大学 Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment
US20170364831A1 (en) * 2016-06-21 2017-12-21 Sri International Systems and methods for machine learning using a trusted model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929281A (en) * 2012-11-05 2013-02-13 西南科技大学 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment
US20170364831A1 (en) * 2016-06-21 2017-12-21 Sri International Systems and methods for machine learning using a trusted model
CN107368076A (en) * 2017-07-31 2017-11-21 中南大学 Robot motion's pathdepth learns controlling planning method under a kind of intelligent environment

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111007858A (en) * 2019-12-23 2020-04-14 北京三快在线科技有限公司 Training method of vehicle driving decision model, and driving decision determining method and device
CN111367317A (en) * 2020-03-27 2020-07-03 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster online task planning method based on Bayesian learning
CN112131754A (en) * 2020-09-30 2020-12-25 中国人民解放军国防科技大学 Extended POMDP planning method and system based on robot accompanying behavior model
CN113189986B (en) * 2021-04-16 2023-03-14 中国人民解放军国防科技大学 Two-stage self-adaptive behavior planning method and system for autonomous robot
CN113189986A (en) * 2021-04-16 2021-07-30 中国人民解放军国防科技大学 Two-stage self-adaptive behavior planning method and system for autonomous robot
CN113433937A (en) * 2021-06-08 2021-09-24 杭州未名信科科技有限公司 Heuristic exploration-based layered navigation obstacle avoidance system and layered navigation obstacle avoidance method
CN114118441A (en) * 2021-11-24 2022-03-01 福州大学 Online planning method based on efficient search strategy under uncertain environment
CN114118441B (en) * 2021-11-24 2024-10-22 福州大学 Online planning method based on efficient search strategy in uncertainty environment
CN114253265B (en) * 2021-12-17 2023-10-20 成都朴为科技有限公司 On-time arrival probability maximum path planning algorithm and system based on fourth-order moment
CN114253265A (en) * 2021-12-17 2022-03-29 成都朴为科技有限公司 On-time arrival probability maximum path planning algorithm and system based on fourth-order moment
CN114355973A (en) * 2021-12-28 2022-04-15 哈尔滨工程大学 Multi-agent hierarchical reinforcement learning-based unmanned cluster cooperation method under weak observation condition
CN114355973B (en) * 2021-12-28 2023-12-08 哈尔滨工程大学 Unmanned cluster cooperation method based on multi-agent layered reinforcement learning under weak observation condition
CN114676935A (en) * 2022-04-26 2022-06-28 中国人民解放军国防科技大学 Multi-Agent system mixed task planning algorithm under uncertain environment
CN114676935B (en) * 2022-04-26 2024-08-16 中国人民解放军国防科技大学 Multi-Agent system mixed task planning algorithm under uncertain environment
CN114880946A (en) * 2022-05-31 2022-08-09 苏州大学 Intelligent agent random exploration method based on flight strategy
CN115952892A (en) * 2022-12-13 2023-04-11 广东工业大学 Branch iteration method and system for defective plate stock layout

Also Published As

Publication number Publication date
CN108680155A (en) 2018-10-19
CN108680155B (en) 2020-09-08

Similar Documents

Publication Publication Date Title
WO2019148645A1 (en) Partially observable markov decision process-based optimal robot path planning method
CN104571113B (en) The paths planning method of mobile robot
Zhu et al. Online minimax Q network learning for two-player zero-sum Markov games
CN105740644B (en) Cleaning robot optimal target path planning method based on model learning
Zheng et al. Robust bayesian inverse reinforcement learning with sparse behavior noise
CN106444769B (en) A kind of optimum path planning method of indoor mobile robot increment type environmental information sampling
CN104298239B (en) A kind of indoor mobile robot strengthens map study paths planning method
Bode Neural networks for cost estimation: simulations and pilot application
Chatterjee et al. A Geese PSO tuned fuzzy supervisor for EKF based solutions of simultaneous localization and mapping (SLAM) problems in mobile robots
CN110378439A (en) Single robot path planning method based on Q-Learning algorithm
CN108469827A (en) A kind of automatic guided vehicle global path planning method suitable for logistic storage system
CN109241291A (en) Knowledge mapping optimal path inquiry system and method based on deeply study
CN110174118A (en) Robot multiple-objective search-path layout method and apparatus based on intensified learning
Zhang et al. Learning novel policies for tasks
Kollar et al. Efficient Optimization of Information-Theoretic Exploration in SLAM.
CN109782270B (en) Data association method under multi-sensor multi-target tracking condition
Wang et al. On the convergence of the monte carlo exploring starts algorithm for reinforcement learning
CN112356031A (en) On-line planning method based on Kernel sampling strategy under uncertain environment
Janmohamed et al. Improving the Data Efficiency of Multi-Objective Quality-Diversity through Gradient Assistance and Crowding Exploration
CN114139675B (en) Method for improving selection reliability and action accuracy in intelligent agent control
CN113741416B (en) Multi-robot full-coverage path planning method based on improved predator prey model and DMPC
CN111814908B (en) Abnormal data detection model updating method and device based on data flow
CN114118441A (en) Online planning method based on efficient search strategy under uncertain environment
Zhou et al. Switching deep reinforcement learning based intelligent online decision making for autonomous systems under uncertain environment
Gao et al. Partial Consistency for Stabilizing Undiscounted Reinforcement Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18903071

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18903071

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26/01/2021)