CN102929281A - Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment - Google Patents

Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment Download PDF

Info

Publication number
CN102929281A
CN102929281A CN 201210455666 CN201210455666A CN102929281A CN 102929281 A CN102929281 A CN 102929281A CN 201210455666 CN201210455666 CN 201210455666 CN 201210455666 A CN201210455666 A CN 201210455666A CN 102929281 A CN102929281 A CN 102929281A
Authority
CN
China
Prior art keywords
state
robot
action
value
knn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201210455666
Other languages
Chinese (zh)
Inventor
江虹
黄玉清
李强
秦明伟
李小霞
张晓琴
石繁荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest University of Science and Technology
Original Assignee
Southwest University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest University of Science and Technology filed Critical Southwest University of Science and Technology
Priority to CN 201210455666 priority Critical patent/CN102929281A/en
Publication of CN102929281A publication Critical patent/CN102929281A/en
Pending legal-status Critical Current

Links

Landscapes

  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

未知动态环境下的机器人路径规划技术具有重要应用价值,对此,本发明公开了一种不完全感知环境下的机器人kNN路径规划方法,主要包括:POMDP模型建立,POMDP模型求解,迭代学习模型的构建。本发明利用迭代模型提高了机器人路径规划时对环境的学习适应能力,可以提高路径规划性能。Robot path planning technology in an unknown dynamic environment has important application value. For this, the present invention discloses a robot kNN path planning method in an incomplete perception environment, which mainly includes: POMDP model establishment, POMDP model solution, iterative learning model Construct. The invention utilizes the iterative model to improve the learning and adaptability of the robot to the environment during path planning, and can improve the path planning performance.

Description

一种不完全感知环境下的机器人kNN路径规划方法A robot kNN path planning method in incomplete perception environment

技术领域technical field

本发明是一种未知动态环境下的机器人路径规划方法,涉及机器人导航技术领域,尤其涉及到机器人路径规划算法方面。The invention relates to a robot path planning method in an unknown dynamic environment, relates to the technical field of robot navigation, and in particular relates to a robot path planning algorithm.

背景技术Background technique

随着机器人技术的发展,机器人的能力不断提高,机器人应用领域也不断扩大,尤其是在一些危险、特殊或人不宜前往的应用领域,如核应急处置、太空作业等,都需要机器人的介入。路径规划是机器人导航技术的重要环节,机器人路径规划问题一般定义为:给定机器人的出发点和目标点,在有固定或移动障碍的环境中,规划一条无碰的、满足某种最优准则的路径,使机器人按照该路径运动到目标点。其中,最优准则一般有:所消耗的能量最少、所用的时间最短、路径长度最短等。因此,路径规划方法的研究对寻找一条无碰、最优路径起着至关重要的作用。With the development of robot technology, the ability of robots is constantly improving, and the application fields of robots are also expanding, especially in some dangerous, special or unsuitable application fields, such as nuclear emergency response, space operations, etc., all require the intervention of robots. Path planning is an important part of robot navigation technology. The problem of robot path planning is generally defined as: given the starting point and target point of the robot, in an environment with fixed or moving obstacles, plan a non-collision path that satisfies certain optimal criteria. Path, so that the robot moves to the target point according to the path. Among them, the optimal criteria generally include: the least energy consumed, the shortest time used, the shortest path length, and so on. Therefore, the research on path planning methods plays a vital role in finding a collision-free and optimal path.

机器人要在未知动态环境中安全、可靠地完成路径规划,需要具备能够处理各种不确定情况的能力,以提高对环境的适应性。因此,具有智能学习能力的机器人路径规划显得尤为重要。强化学习算法用于机器人路径规划,其优势在于该算法是一种非监督在线学习方法,且不需要环境的精确模型,因此在动态未知环境下的移动机器人路径规划应用中正受到重视。如:MohammadAbdel Kareem Jaradat的Reinforcementbased mobile robot navigation in dynamic environment一文对强化学习与人工势场法相比较,实验结果表明基于强化学习算法的机器人路径规划方法具有更好的适用性。Hoang-huu VIET的Extended Dyna-QAlgorithm for Path Planning of Mobile Robots一文在Dyna-Q强化学习算法基础上,利用最大似然模型选择动作并更新Q值函数,提高了算法的收敛速度。To complete path planning safely and reliably in an unknown dynamic environment, a robot needs to be able to handle various uncertain situations to improve its adaptability to the environment. Therefore, robot path planning with intelligent learning ability is particularly important. The reinforcement learning algorithm is used for robot path planning. Its advantage is that the algorithm is an unsupervised online learning method and does not require an accurate model of the environment. Therefore, it is being paid attention to in the application of mobile robot path planning in dynamic unknown environments. For example: Reinforcement based mobile robot navigation in dynamic environment by Mohammad Abdel Kareem Jaradat compares reinforcement learning with artificial potential field method. The experimental results show that the robot path planning method based on reinforcement learning algorithm has better applicability. Hoang-huu VIET's Extended Dyna-QAlgorithm for Path Planning of Mobile Robots is based on the Dyna-Q reinforcement learning algorithm, using the maximum likelihood model to select actions and update the Q value function, which improves the convergence speed of the algorithm.

这些方法中,机器人都是在完全可观测的环境下完成路径规划,吴峰在“基于决策理论的多智能体系统规划问题研究”一文中从决策论的角度用DEC-POMDP模型,以解决大状态空间下的多智能体规划问题,该方法考虑了环境信息的不完全可观测性,但建立的模型与算法具有较高的复杂性。In these methods, the robot completes path planning in a fully observable environment. Wu Feng used the DEC-POMDP model from the perspective of decision theory in his article "Research on Multi-Agent System Planning Problems Based on Decision Theory" to solve large-scale For the multi-agent programming problem in the state space, this method takes into account the incomplete observability of environmental information, but the established model and algorithm have high complexity.

针对这些问题,本发明提出一种不完全感知环境下的机器人kNN路径规划方法。该方法采用基于k最近邻分类思想的局部值迭代学习模型,考虑未知环境下动作的不确定性与环境信息获取的不完整性,提高实际环境中机器人路径规划算法的适应性。In view of these problems, the present invention proposes a robot kNN path planning method in an incomplete perception environment. This method adopts the local value iterative learning model based on the idea of k-nearest neighbor classification, considers the uncertainty of actions in unknown environments and the incompleteness of environmental information acquisition, and improves the adaptability of robot path planning algorithms in actual environments.

发明内容Contents of the invention

本发明的目的在于解决未知动态环境下,机器人路径规划存在环境信息的不完全可观测性、大状态空间求解难的问题,以有效提高路径规划算法的适用能力。该方法采用基于k最近邻分类法的局部点值迭代,代替对全部状态的值迭代计算,以有效缓解求解POMDP模型中的维数灾难问题,同时提高路径规划过程中强化学习算法的收敛性。The purpose of the present invention is to solve the problem of incomplete observability of environmental information and difficult solution in large state space in robot path planning in an unknown dynamic environment, so as to effectively improve the applicability of path planning algorithms. This method uses local point value iteration based on the k-nearest neighbor classification method to replace the iterative calculation of the value of all states to effectively alleviate the dimensionality disaster problem in solving the POMDP model, and at the same time improve the convergence of the reinforcement learning algorithm in the path planning process.

为了达到上述目的,本发明采取的技术方案是:一种不完全感知环境下的机器人kNN路径规划方法,包括以下步骤:In order to achieve the above object, the technical solution adopted by the present invention is: a robot kNN path planning method under an incomplete perception environment, comprising the following steps:

一、POMDP模型建立:1. Establishment of POMDP model:

采用栅格地图将机器人规划环境划分为小栅格。利用栅格法建立环境图,每个小栅格单元对应于POMDP模型状态集S中的一个状态s。动作集A有东(East)、西(West)、南(South)、北(North)四个动作,机器人可以在下一时刻处于相邻的4个无障碍栅格单元之一;机器人到达目标状态可获得回报值0,其它情况回报值均为-1。在机器人与环境不断交互过程中,由于动作的执行存在不确定性,因此转换概率设置为以较大概率正确执行最优策略选择的动作,以较小概率滑向该动作的左右两侧。A grid map is used to divide the robot planning environment into small grids. The environment graph is established by using the grid method, and each small grid unit corresponds to a state s in the state set S of the POMDP model. Action set A has four actions: East (East), West (West), South (South), and North (North). The robot can be in one of the four adjacent barrier-free grid cells at the next moment; the robot reaches the target state The return value can be 0, and the return value in other cases is -1. During the continuous interaction between the robot and the environment, due to the uncertainty in the execution of the action, the transition probability is set to correctly execute the action selected by the optimal strategy with a high probability, and slide to the left and right sides of the action with a small probability.

二、POMDP模型求解:2. POMDP model solution:

机器人POMDP路径规划中,机器人传感器不能完全观测所有环境信息。为了求解最优策略,机器人需要所经历的动作与观测状态的完整序列,即历史信息。历史信息可以利用信念状态(Belief State)来取代,信念状态b(s)为状态集S上的一个概率分布,所有信念状态组成一个|S|维矩阵。求解时以信念状态代替状态,POMDP问题就转化为基于信念状态的MDP问题,动作选择策略π转化为由信念状态到动作的映射:π(b)→a,在最优策略π*下,所有信念状态的折扣累积奖赏值组成最优值函数Q(b,a),从而可利用求解MDP问题的方法kNN-Sarsa(λ)算法求解POMDP问题。In robot POMDP path planning, robot sensors cannot fully observe all environmental information. In order to solve the optimal policy, the robot needs the complete sequence of experienced actions and observed states, that is, historical information. Historical information can be replaced by a belief state (Belief State). The belief state b(s) is a probability distribution on the state set S, and all belief states form a |S|-dimensional matrix. When solving the problem, the belief state is used instead of the state, and the POMDP problem is transformed into an MDP problem based on the belief state. The action selection strategy π is transformed into a mapping from the belief state to the action: π(b)→a. Under the optimal strategy π*, all The discounted cumulative reward value of the belief state constitutes the optimal value function Q(b, a), so that the kNN-Sarsa(λ) algorithm can be used to solve the MDP problem to solve the POMDP problem.

三、迭代学习模型的构建:3. Construction of iterative learning model:

机器人设置起始位置与目标位置后,利用基于强化学习算法的机器人路径规划方法,为机器人寻找一条从起始位置到目标位置的无碰、最优路径。在寻找最优路径的过程中,本发明将机器人可能到达的栅格单元定义为迭代学习模型的状态s,动作a定义为具体动作方向:东、西、南、北,动作选择的目的是最大程度的缩短从起始位置到目标位置的路径。强化学习算法的迭代模型给每个(s,a)定义了一个状态-动作值函数Q,即机器人在当前状态选择某一动作更新到下一状态时获得的折算累积回报值,动作选择策略依据该Q值选择最优动作,以使累积回报值最大。After the robot sets the starting position and the target position, the robot path planning method based on the reinforcement learning algorithm is used to find a collision-free and optimal path for the robot from the starting position to the target position. In the process of finding the optimal path, the present invention defines the grid unit that the robot may reach as the state s of the iterative learning model, and the action a is defined as the specific action direction: east, west, south, north, and the purpose of action selection is to maximize The degree of shortening the path from the starting position to the target position. The iterative model of the reinforcement learning algorithm defines a state-action value function Q for each (s, a), that is, the converted cumulative return value obtained when the robot selects an action in the current state to update to the next state, and the action selection strategy is based on This Q-value selects the optimal action so that the cumulative reward value is maximized.

四、迭代学习模型用到的表结构:4. The table structure used in the iterative learning model:

为了实现本发明方法,需要构建下列表结构:In order to realize the method of the present invention, it is necessary to construct the following table structure:

(1)QTable表(1) QTable table

基于迭代学习模型的机器人路径规划,首先需要建立状态-动作值函数表Q Table。该Q Table表为|S|行|A|列的二维矩阵(|S|为状态集S的元素数,|A|为动作集A的元素数),它存储了每个状态-动作对应的累积回报值,即Q(s,a)为选择最优动作a更新到状态s时的最大累积回报值。Robot path planning based on the iterative learning model first needs to establish a state-action value function table Q Table. The Q Table is a two-dimensional matrix of |S|rows|A|columns (|S| is the number of elements of the state set S, |A| is the number of elements of the action set A), which stores each state-action correspondence The cumulative reward value of , that is, Q(s, a) is the maximum cumulative reward value when the optimal action a is updated to state s.

(2)转换函数表T(2) Conversion function table T

转换函数T:S×A→∏(S),描述动作的变化对环境状态的影响,机器人基本动作有东西南北四个,需建立四个转换函数表:T_E、T_W、T_S、T_N,分别为选择东(East)、西(West)、南(South)、北(North)动作后由状态st转换为状态st+1的概率。The conversion function T: S×A→∏(S), describes the impact of the change of action on the state of the environment. There are four basic movements of the robot: east, west, south, north, and four conversion function tables need to be established: T_E, T_W, T_S, T_N, respectively The probability of transitioning from state s t to state s t+1 after choosing East, West, South, and North actions.

(3)观测函数表O(3) Observation function table O

机器人依据自身携带传感器所探测的信息进行决策,观测函数O:S×A→∏(Z)表示机器人在当前状态st执行动作a后转换到新状态st+1得到观测状态z的概率,即建立一个基于观测值的概率分布表。The robot makes decisions based on the information detected by its own sensors. The observation function O: S×A→∏(Z) represents the probability that the robot transitions to the new state st +1 to obtain the observed state z after executing action a in the current state st . That is, a table of probability distributions based on observations is established.

(4)回报值表R(4) Return value table R

通过判断机器人是否到达目标位置来检测机器人是否完成一次路径搜索,当机器人到达目标位置时,给出0值回报,否则给出-1值回报。回报函数R:S×A→R,描述如果某个动作能够获得环境较高的回报值,那么在以后的寻找路径过程中产生这个动作的趋势就会加强,否则产生这个动作的趋势就会减弱。By judging whether the robot has reached the target position, it is detected whether the robot has completed a path search. When the robot reaches the target position, a return value of 0 is given, otherwise a return value of -1 is given. Reward function R: S×A→R, describing that if an action can obtain a higher reward value from the environment, then the trend of generating this action will be strengthened in the process of finding the path in the future, otherwise the trend of generating this action will be weakened .

(c2)迭代学习过程:(c2) Iterative learning process:

迭代学习过程由以下步骤组成:The iterative learning process consists of the following steps:

Step1:初始化Step1: Initialize

初始化状态-动作值函数表Q Table,令Q(s,a)=0、资格迹e(s,a)=0、初始信念状态b(s)=0.001076,参数k=5(表示选择5近邻)、学习因子α=0.95,折扣因子γ=0.99、λ=0.95,其中γλ表示资格迹e依照概率γλ衰减,ε=0.001表示选择随机动作的概率值。Initialize the state-action value function table Q Table, set Q(s, a) = 0, qualification trace e(s, a) = 0, initial belief state b(s) = 0.001076, parameter k = 5 (indicates that 5 neighbors are selected ), learning factor α=0.95, discount factor γ=0.99, λ=0.95, where γλ indicates that eligibility trace e decays according to probability γλ, and ε=0.001 indicates the probability value of choosing a random action.

Step2:获取当前状态st及其k个最近邻状态的信念状态集BStep2: Obtain the belief state set B of the current state s t and its k nearest neighbor states

1)将机器人的起始位置作为当前状态st1) Take the initial position of the robot as the current state s t ;

2)计算st与状态集S中欧氏距离最小的k个状态构成的状态集knn;2) Calculate the state set knn consisting of k states with the smallest Euclidean distance between st and the state set S;

3)计算状态集knn中各个状态的信念状态值bt(s):bt(s)=1/(|S|)。3) Calculate the belief state value b t (s) of each state in the state set knn: b t (s)=1/(|S|).

Step3:获取信念状态值函数Step3: Get the belief state value function

信念状态bt(s)对应的值函数由下式计算:The value function corresponding to the belief state b t (s) is calculated by the following formula:

QQ (( bb ,, aa )) == ΣΣ ii ∈∈ knnknn QQ (( ii ,, aa )) bb (( ii ))

即Q(s,a)表中当前状态st的k最近邻集knn中所有状态值函数Q(i,a)与信念状态b(i)乘积之和。That is, the sum of the products of all state value functions Q(i, a) and belief state b(i) in the k-nearest neighbor set knn of the current state s t in the Q(s, a) table.

Step4:选择动作Step4: Select an action

依据ε-greedy动作选择策略选择动作:Actions are selected according to the ε-greedy action selection strategy:

&pi;&pi; (( aa )) == argarg maxmax aa &Sigma;&Sigma; sthe s &Element;&Element; SS QQ (( sthe s ,, aa )) bb (( sthe s )) (( Uu &GreaterEqual;&Greater Equal; &epsiv;&epsiv; )) randrand (( aa )) (( Uu << &epsiv;&epsiv; )) ,,

其中,U为(0,1)之间均匀分布的随机数,概率值ε在每个学习周期(Episode)中以0.99倍的速率衰减,即在学习周期的初始阶段以较大的概率选择随机动作,避免算法陷入局部最优;随着Q值有效信息的增加,ε逐渐降低,保证了算法收敛性。Among them, U is a uniformly distributed random number between (0, 1), and the probability value ε decays at a rate of 0.99 times in each learning cycle (Episode), that is, in the initial stage of the learning cycle, a random number is selected with a greater probability. action to prevent the algorithm from falling into local optimum; as the effective information of the Q value increases, ε gradually decreases, which ensures the convergence of the algorithm.

Step5:执行动作Step5: Execute the action

执行动作st后转换到新状态st+1,同时获得观测状态z及回报值R。After executing the action s t , switch to the new state s t+1 , and obtain the observed state z and the reward value R at the same time.

Step6:计算回报值RStep6: Calculate the return value R

机器人执行了动作at后到达新位置,判断该位置是否为目标位置,如果不是,则获得回报值-1,执行Step7;否则,获得回报值0,执行Step10。After the robot executes the action a t and arrives at a new position, judge whether the position is the target position, if not, get a reward value of -1, and execute Step7; otherwise, get a reward value of 0, and execute Step10.

Step7:获取下一状态st+1对应的信念状态集B′Step7: Obtain the belief state set B′ corresponding to the next state s t+1

1)计算st+1与状态集s中欧氏距离最小的k个状态构成的状态集knn′;1) Calculate the state set knn′ consisting of k states with the smallest Euclidean distance between s t+1 and the state set s;

2)计算状态集knn′中各个状态的信念状态值bt+1(s′):2) Calculate the belief state value b t+1 (s') of each state in the state set knn':

bb tt ++ 11 (( sthe s &prime;&prime; )) == Oo (( sthe s &prime;&prime; ,, aa ,, zz )) &Sigma;&Sigma; sthe s &Element;&Element; SS TT (( sthe s ,, aa ,, sthe s &prime;&prime; )) bb tt (( sthe s )) &Sigma;&Sigma; sthe s &prime;&prime; &Element;&Element; SS Oo (( sthe s &prime;&prime; ,, aa ,, zz )) &Sigma;&Sigma; sthe s &Element;&Element; SS TT (( sthe s ,, aa ,, sthe s &prime;&prime; )) bb tt (( sthe s )) ..

3)重复执行Step3-Step4。3) Repeat Step3-Step4.

Step8:更新Step8: update

1)资格迹按下式定义:1) The qualification trace is defined as follows:

ee (( sthe s ,, jj )) == bb (( sthe s )) jj == aa ,, 00 jj &NotEqual;&NotEqual; aa .. (( sthe s &Element;&Element; knnknn )) ..

2)对机器人所处状态的所有k最近邻状态的状态-动作值函数Q(i,a)进行更新:2) Update the state-action value function Q(i, a) of all k-nearest neighbor states in the state of the robot:

&Delta;q&Delta;q aa (( sthe s )) == &alpha;&alpha; (( rr ++ &gamma;&gamma; maxmax aa &prime;&prime; QQ (( bb &prime;&prime; ,, aa &prime;&prime; )) -- QQ (( sthe s ,, aa )) bb (( sthe s )) )) ee (( sthe s ,, aa )) ,, (( sthe s &Element;&Element; knnknn ))

Qt+1(s,a)=Qt(s,a)+Δqa(s)(s∈knn)Q t+1 (s, a)=Q t (s, a)+Δq a (s)(s∈knn)

3)st=st+1,at=at+1,knn=knn′,et+1=γλet,bt(s)=bt+1(s′)。3) s t =s t+1 , a t = at+1 , knn=knn', e t+1 =γλe t , b t (s)=b t+1 (s').

Step9:转向Step5。Step9: Turn to Step5.

Step10:一次迭代学习过程结束,转到Step2进入下一个迭代学习过程,直到Q值收敛到最优或近似最优。Step10: An iterative learning process ends, go to Step2 to enter the next iterative learning process, until the Q value converges to the optimal or nearly optimal.

Claims (1)

1.一种适用于不完全感知环境下的机器人kNN路径规划方法,其特征在于:POMDP模型建立,POMDP模型求解、迭代模型构建三个步骤:1. A robot kNN path planning method applicable to an incomplete perception environment is characterized in that: the POMDP model is set up, the POMDP model is solved, and the iterative model is constructed in three steps: (a)POMDP模型建立:采用栅格地图将机器人规划环境划分为小栅格,每个小栅格单元对应POMDP模型状态集S中的一个状态s,动作集A有东(East)、西(West)、南(South)、北(North)四个动作,机器人可以在下一时刻处于相邻4个无障碍栅格单元之一,机器人到达目标状态可获得回报值0,其它情况回报值均为-1,在机器人与环境交互中,转换概率设置为以较大概率正确执行最优策略选择的动作,以较小概率滑向该动作的左右两侧;(a) POMDP model establishment: use the grid map to divide the robot planning environment into small grids, each small grid unit corresponds to a state s in the state set S of the POMDP model, and the action set A includes East (East), West ( West), South, North four actions, the robot can be in one of the four adjacent barrier-free grid units at the next moment, the robot can get a return value of 0 when it reaches the target state, and the return value in other cases is -1, in the interaction between the robot and the environment, the transition probability is set to correctly execute the action selected by the optimal strategy with a high probability, and slide to the left and right sides of the action with a low probability; (b)POMDP模型求解:机器人求解最优策略需要所经历的动作与观测状态的历史信息,历史信息可以利用信念状态(Belief State)来取代,信念状态b(s)为状态集S上的一个概率分布,求解时以信念状态代替状态,POMDP问题转化为基于信念状态的MDP问题,动作选择策略π转化为由信念状态到动作的映射:π(b)→a,在最优策略π*下,所有信念状态的折扣累积奖赏值组成最优值函数Q(b,a);(b) Solving the POMDP model: The robot needs the historical information of the actions experienced and the observed state to solve the optimal strategy. The historical information can be replaced by the belief state (Belief State). The belief state b(s) is a state set S Probability distribution, the belief state is used to replace the state when solving, the POMDP problem is transformed into an MDP problem based on the belief state, and the action selection strategy π is transformed into a mapping from the belief state to the action: π(b)→a, under the optimal strategy π* , the discounted cumulative reward values of all belief states form the optimal value function Q(b, a); (c)迭代模型构建:机器人设置起始位置与目标位置后,利用基于强化学习算法的机器人路径规划方法,强化学习算法给每个(s,a)定义了一个状态-动作值函数Q,即机器人在当前状态选择某一动作更新到下一状态时获得的折算累积回报值,动作选择策略依据该Q值选择最优动作,以使累积回报值最大,迭代学习算法的具体步骤如下:(c) Iterative model construction: After the robot sets the starting position and the target position, the robot path planning method based on the reinforcement learning algorithm is used. The reinforcement learning algorithm defines a state-action value function Q for each (s, a), namely The converted cumulative return value obtained when the robot selects an action in the current state to update to the next state. The action selection strategy selects the optimal action based on the Q value to maximize the cumulative return value. The specific steps of the iterative learning algorithm are as follows: Step1:初始化Step1: Initialize 初始化状态-动作值函数表Q Table,对Q(s,a)、资格迹e(s,a)、初始信念状态b(s),参数k、学习因子α,以及随机动作选择概率值ε赋初始值,Initialize the state-action value function table Q Table, assign Q(s, a), qualification trace e(s, a), initial belief state b(s), parameter k, learning factor α, and random action selection probability value ε initial value, Step 2:获取当前状态st及其k个最近邻状态的信念状态集BStep 2: Obtain the belief state set B of the current state s t and its k nearest neighbor states 1)将机器人的起始位置作为当前状态st1) Take the initial position of the robot as the current state s t ; 2)计算st与状态集S中欧氏距离最小的k个状态构成的状态集knn;2) Calculate the state set knn consisting of k states with the smallest Euclidean distance between s t and the state set S; 3)计算状态集knn中各个状态的信念状态值bt(s):bt(s)=1/(|S|),3) Calculate the belief state value b t (s) of each state in the state set knn: b t (s) = 1/(|S|), Step3:获取信念状态值函数Step3: Get the belief state value function 信念状态bt(s)对应的值函数由下式计算:The value function corresponding to the belief state b t (s) is calculated by the following formula: QQ (( bb ,, aa )) == &Sigma;&Sigma; ii &Element;&Element; knnknn QQ (( ii ,, aa )) bb (( ii )) 即Q(s,a)表中当前状态st的k最近邻集knn中所有状态值函数Q(i,a)与信念状态b(i)乘积之和,That is, the sum of the products of all state value functions Q(i, a) and belief state b(i) in the k-nearest neighbor set knn of the current state s t in the Q(s, a) table, Step4:选择动作Step4: Select an action 依据ε-greedy动作选择策略选择动作:Actions are selected according to the ε-greedy action selection strategy: &pi;&pi; (( aa )) == argarg maxmax aa &Sigma;&Sigma; sthe s &Element;&Element; SS QQ (( sthe s ,, aa )) bb (( sthe s )) (( Uu &GreaterEqual;&Greater Equal; &epsiv;&epsiv; )) randrand (( aa )) (( Uu << &epsiv;&epsiv; )) 其中,U为(0,1)之间均匀分布的随机数,概率值ε在每个学习周期(Episode)中以0.99倍的速率衰减,即在学习周期的初始阶段以较大的概率选择随机动作,避免算法陷入局部最优;随着Q值有效信息的增加,ε逐渐降低,保证了算法收敛性,Among them, U is a uniformly distributed random number between (0, 1), and the probability value ε decays at a rate of 0.99 times in each learning cycle (Episode), that is, in the initial stage of the learning cycle, a random number is selected with a greater probability. action, to avoid the algorithm from falling into local optimum; with the increase of the effective information of Q value, ε gradually decreases, which ensures the convergence of the algorithm. Step5:执行动作Step5: Execute the action 执行动作at后转换到新状态st+1,同时获得观测状态z及回报值R,After executing the action a t, switch to the new state s t+1 , and obtain the observed state z and the return value R at the same time, Step6:计算回报值RStep6: Calculate the return value R 机器人执行了动作at后到达新位置,判断该位置是否为目标位置,如果不是,则获得回报值-1,执行Step7;否则,获得回报值0,执行Step10,After the robot performs the action a t and arrives at a new position, judge whether the position is the target position, if not, get a reward value of -1, and execute Step7; otherwise, get a reward value of 0, and execute Step10, Step 7:获取下一状态st+1对应的信念状态集B′Step 7: Obtain the belief state set B′ corresponding to the next state s t+1 1)计算st+1与状态集S中欧氏距离最小的k个状态构成的状态集knn′,1) Calculate the state set knn′ consisting of k states with the smallest Euclidean distance between st+1 and the state set S, 2)计算状态集knn′中各个状态的信念状态值bt+1(s′):2) Calculate the belief state value b t+1 (s') of each state in the state set knn': bb tt ++ 11 (( sthe s &prime;&prime; )) == Oo (( sthe s &prime;&prime; ,, aa ,, zz )) &Sigma;&Sigma; sthe s &Element;&Element; SS TT (( sthe s ,, aa ,, sthe s &prime;&prime; )) bb tt (( sthe s )) &Sigma;&Sigma; sthe s &prime;&prime; &Element;&Element; SS Oo (( sthe s &prime;&prime; ,, aa ,, zz )) &Sigma;&Sigma; sthe s &Element;&Element; SS TT (( sthe s ,, aa ,, sthe s &prime;&prime; )) bb tt (( sthe s )) .. 3)重复执行Step3-Step4,3) Repeat Step3-Step4, Step8:更新Step8: update 1)资格迹按下式定义:1) The qualification trace is defined as follows: ee (( sthe s ,, jj )) == bb (( sthe s )) jj == aa ,, 00 jj &NotEqual;&NotEqual; aa .. (( sthe s &Element;&Element; knnknn )) .. 2)对机器人所处状态的所有k最近邻状态的状态-动作值函数Q(i,a)进行更新:2) Update the state-action value function Q(i, a) of all k-nearest neighbor states in the state of the robot: &Delta;q&Delta;q aa (( sthe s )) == &alpha;&alpha; (( rr ++ &gamma;&gamma; maxmax aa &prime;&prime; QQ (( bb &prime;&prime; ,, aa &prime;&prime; )) -- QQ (( sthe s ,, aa )) bb (( sthe s )) )) ee (( sthe s ,, aa )) ,, (( sthe s &Element;&Element; knnknn )) Qt+1(s,a)=Qt(s,a)+Δqa(s)(s∈knn)Q t+1 (s, a)=Q t (s, a)+Δq a (s)(s∈knn) 3)st=st+1,at=at+1,knn=knn′,et+1=γλet,bt(s)=bt+1(s′),3) s t =s t+1 , a t =a t+1 , knn=knn', e t+1 =γλe t , b t (s)=b t+1 (s'), Step9:转向Step5,Step9: Turn to Step5, Step10:一次迭代学习过程结束,转到Step 2进入下一个迭代学习过程,直到Q值收敛到最优或近似最优。Step10: After an iterative learning process ends, go to Step 2 to enter the next iterative learning process until the Q value converges to the optimal or nearly optimal.
CN 201210455666 2012-11-05 2012-11-05 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment Pending CN102929281A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210455666 CN102929281A (en) 2012-11-05 2012-11-05 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210455666 CN102929281A (en) 2012-11-05 2012-11-05 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment

Publications (1)

Publication Number Publication Date
CN102929281A true CN102929281A (en) 2013-02-13

Family

ID=47644109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210455666 Pending CN102929281A (en) 2012-11-05 2012-11-05 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment

Country Status (1)

Country Link
CN (1) CN102929281A (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605368A (en) * 2013-12-04 2014-02-26 苏州大学张家港工业技术研究院 Method and device for route programming in dynamic unknown environment
CN103901888A (en) * 2014-03-18 2014-07-02 北京工业大学 Robot autonomous motion control method based on infrared and sonar sensors
CN104626206A (en) * 2014-12-17 2015-05-20 西南科技大学 Robot operation pose information measuring method under non-structural environment
CN104658297A (en) * 2015-02-04 2015-05-27 沈阳理工大学 Central type dynamic path inducing method based on Sarsa learning
CN104680264A (en) * 2015-03-27 2015-06-03 青岛大学 Transportation vehicle path optimizing method based on multi-agent reinforcement learning
CN105549598A (en) * 2016-02-16 2016-05-04 江南大学 Iterative learning trajectory tracking control and robust optimization method for two-dimensional motion mobile robot
CN105740644A (en) * 2016-03-24 2016-07-06 苏州大学 Cleaning robot optimal target path planning method based on model learning
CN107065890A (en) * 2017-06-02 2017-08-18 北京航空航天大学 A kind of unmanned vehicle intelligent barrier avoiding method and system
CN107403426A (en) * 2017-06-20 2017-11-28 北京工业大学 A kind of target object detection method and equipment
CN107479547A (en) * 2017-08-11 2017-12-15 同济大学 Decision tree behaviour decision making algorithm based on learning from instruction
CN107703509A (en) * 2017-11-06 2018-02-16 苏州科技大学 A kind of optimal system and method fished a little of sonar contact shoal of fish selection
CN108319286A (en) * 2018-03-12 2018-07-24 西北工业大学 A kind of unmanned plane Air Combat Maneuvering Decision Method based on intensified learning
CN108572654A (en) * 2018-04-25 2018-09-25 哈尔滨工程大学 Three-dimensional stabilization control and realization method of underactuated AUV virtual mooring based on Q-learning
CN108680155A (en) * 2018-02-01 2018-10-19 苏州大学 The robot optimum path planning method of mahalanobis distance map process is perceived based on part
CN108762249A (en) * 2018-04-26 2018-11-06 常熟理工学院 Clean robot optimum path planning method based on the optimization of approximate model multistep
CN109059939A (en) * 2018-06-27 2018-12-21 湖南智慧畅行交通科技有限公司 Map-matching algorithm based on Hidden Markov Model
CN109238297A (en) * 2018-08-29 2019-01-18 沈阳理工大学 A kind of user is optimal and the Dynamic User-Optimal Route Choice method of system optimal
CN109445437A (en) * 2018-11-30 2019-03-08 电子科技大学 A kind of paths planning method of unmanned electric vehicle
CN109579861A (en) * 2018-12-10 2019-04-05 华中科技大学 A kind of method for path navigation and system based on intensified learning
CN109857107A (en) * 2019-01-30 2019-06-07 广州大学 AGV trolley air navigation aid, device, system, medium and equipment
CN110361017A (en) * 2019-07-19 2019-10-22 西南科技大学 A kind of full traverse path planing method of sweeping robot based on Grid Method
CN110531620A (en) * 2019-09-02 2019-12-03 常熟理工学院 Trolley based on Gaussian process approximate model is gone up a hill system self-adaption control method
CN110941268A (en) * 2019-11-20 2020-03-31 苏州大学 A control method of unmanned automatic car based on Sarsa safety model
CN110989602A (en) * 2019-12-12 2020-04-10 齐鲁工业大学 Method and system for planning paths of autonomous guided vehicle in medical pathological examination laboratory
CN111240318A (en) * 2019-12-24 2020-06-05 华中农业大学 A Robotic Person Discovery Algorithm
CN111367317A (en) * 2020-03-27 2020-07-03 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster online task planning method based on Bayesian learning
CN111369181A (en) * 2020-06-01 2020-07-03 北京全路通信信号研究设计院集团有限公司 Train autonomous scheduling deep reinforcement learning method and module
CN111645079A (en) * 2020-08-04 2020-09-11 天津滨电电力工程有限公司 Device and method for planning and controlling mechanical arm path of live working robot
CN111752274A (en) * 2020-06-17 2020-10-09 杭州电子科技大学 A path tracking control method for laser AGV based on reinforcement learning
CN112015174A (en) * 2020-07-10 2020-12-01 歌尔股份有限公司 Multi-AGV motion planning method, device and system
CN112131754A (en) * 2020-09-30 2020-12-25 中国人民解放军国防科技大学 Extended POMDP planning method and system based on robot accompanying behavior model
CN112356031A (en) * 2020-11-11 2021-02-12 福州大学 On-line planning method based on Kernel sampling strategy under uncertain environment
CN113111296A (en) * 2019-12-24 2021-07-13 浙江吉利汽车研究院有限公司 Vehicle path planning method and device, electronic equipment and storage medium
CN113467481A (en) * 2021-08-11 2021-10-01 哈尔滨工程大学 Path planning method based on improved Sarsa algorithm
WO2021227536A1 (en) * 2020-05-15 2021-11-18 Huawei Technologies Co., Ltd. Methods and systems for support policy learning

Cited By (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605368A (en) * 2013-12-04 2014-02-26 苏州大学张家港工业技术研究院 Method and device for route programming in dynamic unknown environment
CN103901888A (en) * 2014-03-18 2014-07-02 北京工业大学 Robot autonomous motion control method based on infrared and sonar sensors
CN103901888B (en) * 2014-03-18 2017-01-25 北京工业大学 Robot autonomous motion control method based on infrared and sonar sensors
CN104626206A (en) * 2014-12-17 2015-05-20 西南科技大学 Robot operation pose information measuring method under non-structural environment
CN104658297A (en) * 2015-02-04 2015-05-27 沈阳理工大学 Central type dynamic path inducing method based on Sarsa learning
CN104680264A (en) * 2015-03-27 2015-06-03 青岛大学 Transportation vehicle path optimizing method based on multi-agent reinforcement learning
CN104680264B (en) * 2015-03-27 2017-09-22 青岛大学 A kind of vehicle route optimization method based on multiple agent intensified learning
CN105549598A (en) * 2016-02-16 2016-05-04 江南大学 Iterative learning trajectory tracking control and robust optimization method for two-dimensional motion mobile robot
CN105549598B (en) * 2016-02-16 2018-04-17 江南大学 The iterative learning Trajectory Tracking Control and its robust Optimal methods of a kind of two dimensional motion mobile robot
CN105740644B (en) * 2016-03-24 2018-04-13 苏州大学 Cleaning robot optimal target path planning method based on model learning
CN105740644A (en) * 2016-03-24 2016-07-06 苏州大学 Cleaning robot optimal target path planning method based on model learning
CN107065890A (en) * 2017-06-02 2017-08-18 北京航空航天大学 A kind of unmanned vehicle intelligent barrier avoiding method and system
CN107403426A (en) * 2017-06-20 2017-11-28 北京工业大学 A kind of target object detection method and equipment
CN107403426B (en) * 2017-06-20 2020-02-21 北京工业大学 A target object detection method and device
CN107479547A (en) * 2017-08-11 2017-12-15 同济大学 Decision tree behaviour decision making algorithm based on learning from instruction
CN107479547B (en) * 2017-08-11 2020-11-24 同济大学 Behavioral Decision Algorithm of Decision Tree Based on Teaching Learning
CN107703509A (en) * 2017-11-06 2018-02-16 苏州科技大学 A kind of optimal system and method fished a little of sonar contact shoal of fish selection
CN107703509B (en) * 2017-11-06 2023-08-04 苏州科技大学 System and method for selecting optimal fishing point by detecting fish shoal through sonar
WO2019148645A1 (en) * 2018-02-01 2019-08-08 苏州大学张家港工业技术研究院 Partially observable markov decision process-based optimal robot path planning method
CN108680155B (en) * 2018-02-01 2020-09-08 苏州大学 Robot optimal path planning method based on partial perception Markov decision process
CN108680155A (en) * 2018-02-01 2018-10-19 苏州大学 The robot optimum path planning method of mahalanobis distance map process is perceived based on part
CN108319286A (en) * 2018-03-12 2018-07-24 西北工业大学 A kind of unmanned plane Air Combat Maneuvering Decision Method based on intensified learning
CN108319286B (en) * 2018-03-12 2020-09-22 西北工业大学 A Reinforcement Learning-Based UAV Air Combat Maneuvering Decision Method
CN108572654A (en) * 2018-04-25 2018-09-25 哈尔滨工程大学 Three-dimensional stabilization control and realization method of underactuated AUV virtual mooring based on Q-learning
CN108762249A (en) * 2018-04-26 2018-11-06 常熟理工学院 Clean robot optimum path planning method based on the optimization of approximate model multistep
CN109059939A (en) * 2018-06-27 2018-12-21 湖南智慧畅行交通科技有限公司 Map-matching algorithm based on Hidden Markov Model
CN109238297B (en) * 2018-08-29 2022-03-18 沈阳理工大学 Dynamic path selection method for user optimization and system optimization
CN109238297A (en) * 2018-08-29 2019-01-18 沈阳理工大学 A kind of user is optimal and the Dynamic User-Optimal Route Choice method of system optimal
CN109445437A (en) * 2018-11-30 2019-03-08 电子科技大学 A kind of paths planning method of unmanned electric vehicle
CN109579861B (en) * 2018-12-10 2020-05-19 华中科技大学 A Reinforcement Learning-Based Path Navigation Method and System
CN109579861A (en) * 2018-12-10 2019-04-05 华中科技大学 A kind of method for path navigation and system based on intensified learning
CN109857107A (en) * 2019-01-30 2019-06-07 广州大学 AGV trolley air navigation aid, device, system, medium and equipment
CN110361017A (en) * 2019-07-19 2019-10-22 西南科技大学 A kind of full traverse path planing method of sweeping robot based on Grid Method
CN110361017B (en) * 2019-07-19 2022-02-11 西南科技大学 A full traversal path planning method for sweeping robot based on grid method
CN110531620A (en) * 2019-09-02 2019-12-03 常熟理工学院 Trolley based on Gaussian process approximate model is gone up a hill system self-adaption control method
CN110941268A (en) * 2019-11-20 2020-03-31 苏州大学 A control method of unmanned automatic car based on Sarsa safety model
CN110989602B (en) * 2019-12-12 2023-12-26 齐鲁工业大学 Autonomous guided vehicle path planning method and system in medical pathology inspection laboratory
CN110989602A (en) * 2019-12-12 2020-04-10 齐鲁工业大学 Method and system for planning paths of autonomous guided vehicle in medical pathological examination laboratory
CN113111296A (en) * 2019-12-24 2021-07-13 浙江吉利汽车研究院有限公司 Vehicle path planning method and device, electronic equipment and storage medium
CN111240318A (en) * 2019-12-24 2020-06-05 华中农业大学 A Robotic Person Discovery Algorithm
CN111367317A (en) * 2020-03-27 2020-07-03 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster online task planning method based on Bayesian learning
US11605026B2 (en) 2020-05-15 2023-03-14 Huawei Technologies Co. Ltd. Methods and systems for support policy learning
WO2021227536A1 (en) * 2020-05-15 2021-11-18 Huawei Technologies Co., Ltd. Methods and systems for support policy learning
CN111369181B (en) * 2020-06-01 2020-09-29 北京全路通信信号研究设计院集团有限公司 Train autonomous scheduling deep reinforcement learning method and device
CN111369181A (en) * 2020-06-01 2020-07-03 北京全路通信信号研究设计院集团有限公司 Train autonomous scheduling deep reinforcement learning method and module
CN111752274A (en) * 2020-06-17 2020-10-09 杭州电子科技大学 A path tracking control method for laser AGV based on reinforcement learning
CN111752274B (en) * 2020-06-17 2022-06-24 杭州电子科技大学 A path tracking control method for laser AGV based on reinforcement learning
CN112015174B (en) * 2020-07-10 2022-06-28 歌尔股份有限公司 Multi-AGV motion planning method, device and system
US12045061B2 (en) 2020-07-10 2024-07-23 Goertek Inc. Multi-AGV motion planning method, device and system
CN112015174A (en) * 2020-07-10 2020-12-01 歌尔股份有限公司 Multi-AGV motion planning method, device and system
CN111645079A (en) * 2020-08-04 2020-09-11 天津滨电电力工程有限公司 Device and method for planning and controlling mechanical arm path of live working robot
CN112131754B (en) * 2020-09-30 2024-07-16 中国人民解放军国防科技大学 Extension POMDP planning method and system based on robot accompanying behavior model
CN112131754A (en) * 2020-09-30 2020-12-25 中国人民解放军国防科技大学 Extended POMDP planning method and system based on robot accompanying behavior model
CN112356031A (en) * 2020-11-11 2021-02-12 福州大学 On-line planning method based on Kernel sampling strategy under uncertain environment
CN112356031B (en) * 2020-11-11 2022-04-01 福州大学 An Online Planning Method Based on Kernel Sampling Strategy in Uncertain Environment
CN113467481A (en) * 2021-08-11 2021-10-01 哈尔滨工程大学 Path planning method based on improved Sarsa algorithm

Similar Documents

Publication Publication Date Title
CN102929281A (en) Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment
Semnani et al. Multi-agent motion planning for dense and dynamic environments via deep reinforcement learning
Wang et al. A survey of learning‐based robot motion planning
Li et al. A general framework of motion planning for redundant robot manipulator based on deep reinforcement learning
Xie et al. Deep imitation learning for bimanual robotic manipulation
CN105527965A (en) Route planning method and system based on genetic ant colony algorithm
Soltero et al. Generating informative paths for persistent sensing in unknown environments
Zhang et al. Reinforcement Learning in Robot Path Optimization.
Zhang et al. Learning to cooperate: Application of deep reinforcement learning for online AGV path finding
Ma et al. State-chain sequential feedback reinforcement learning for path planning of autonomous mobile robots
Wang et al. A novel incremental learning scheme for reinforcement learning in dynamic environments
Mohanty et al. A new ecologically inspired algorithm for mobile robot navigation
Ni et al. An improved shuffled frog leaping algorithm for robot path planning
CN107422734B (en) Robot path planning method based on chaotic reverse pollination algorithm
CN104020769B (en) Robot overall path planning method based on charge system search
Mohanty et al. A new real time path planning for mobile robot navigation using invasive weed optimization algorithm
Chaudhary et al. Obstacle avoidance of a point-mass robot using feedforward neural network
Hasan et al. Improved Variable Step Length A* Search Algorithm for Path Planning of Mobile Robots
Chen et al. Timesharing-tracking framework for decentralized reinforcement learning in fully cooperative multi-agent system
Su et al. Path planning for mobile robots based on genetic algorithms
Wang et al. Fault-tolerant pattern formation by multiple robots: a learning approach
Mo et al. Bio-geography based differential evolution for robot path planning
Suh et al. Energy-efficient high-dimensional motion planning for humanoids using stochastic optimization
Loganathan et al. Robot path planning via Harris hawks optimization: A comparative assessment
Wang et al. Welding robot path optimization based on hybrid discrete PSO

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130213