CN102929281A

CN102929281A - Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment

Info

Publication number: CN102929281A
Application number: CN 201210455666
Authority: CN
Inventors: 江虹; 黄玉清; 李强; 秦明伟; 李小霞; 张晓琴; 石繁荣
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2012-11-05
Filing date: 2012-11-05
Publication date: 2013-02-13

Abstract

未知动态环境下的机器人路径规划技术具有重要应用价值，对此，本发明公开了一种不完全感知环境下的机器人kNN路径规划方法，主要包括：POMDP模型建立，POMDP模型求解，迭代学习模型的构建。本发明利用迭代模型提高了机器人路径规划时对环境的学习适应能力，可以提高路径规划性能。Robot path planning technology in an unknown dynamic environment has important application value. For this, the present invention discloses a robot kNN path planning method in an incomplete perception environment, which mainly includes: POMDP model establishment, POMDP model solution, iterative learning model Construct. The invention utilizes the iterative model to improve the learning and adaptability of the robot to the environment during path planning, and can improve the path planning performance.

Description

A robot kNN path planning method in incomplete perception environment

技术领域technical field

本发明是一种未知动态环境下的机器人路径规划方法，涉及机器人导航技术领域，尤其涉及到机器人路径规划算法方面。The invention relates to a robot path planning method in an unknown dynamic environment, relates to the technical field of robot navigation, and in particular relates to a robot path planning algorithm.

背景技术Background technique

随着机器人技术的发展，机器人的能力不断提高，机器人应用领域也不断扩大，尤其是在一些危险、特殊或人不宜前往的应用领域，如核应急处置、太空作业等，都需要机器人的介入。路径规划是机器人导航技术的重要环节，机器人路径规划问题一般定义为：给定机器人的出发点和目标点，在有固定或移动障碍的环境中，规划一条无碰的、满足某种最优准则的路径，使机器人按照该路径运动到目标点。其中，最优准则一般有：所消耗的能量最少、所用的时间最短、路径长度最短等。因此，路径规划方法的研究对寻找一条无碰、最优路径起着至关重要的作用。With the development of robot technology, the ability of robots is constantly improving, and the application fields of robots are also expanding, especially in some dangerous, special or unsuitable application fields, such as nuclear emergency response, space operations, etc., all require the intervention of robots. Path planning is an important part of robot navigation technology. The problem of robot path planning is generally defined as: given the starting point and target point of the robot, in an environment with fixed or moving obstacles, plan a non-collision path that satisfies certain optimal criteria. Path, so that the robot moves to the target point according to the path. Among them, the optimal criteria generally include: the least energy consumed, the shortest time used, the shortest path length, and so on. Therefore, the research on path planning methods plays a vital role in finding a collision-free and optimal path.

机器人要在未知动态环境中安全、可靠地完成路径规划，需要具备能够处理各种不确定情况的能力，以提高对环境的适应性。因此，具有智能学习能力的机器人路径规划显得尤为重要。强化学习算法用于机器人路径规划，其优势在于该算法是一种非监督在线学习方法，且不需要环境的精确模型，因此在动态未知环境下的移动机器人路径规划应用中正受到重视。如：MohammadAbdel Kareem Jaradat的Reinforcementbased mobile robot navigation in dynamic environment一文对强化学习与人工势场法相比较，实验结果表明基于强化学习算法的机器人路径规划方法具有更好的适用性。Hoang-huu VIET的Extended Dyna-QAlgorithm for Path Planning of Mobile Robots一文在Dyna-Q强化学习算法基础上，利用最大似然模型选择动作并更新Q值函数，提高了算法的收敛速度。To complete path planning safely and reliably in an unknown dynamic environment, a robot needs to be able to handle various uncertain situations to improve its adaptability to the environment. Therefore, robot path planning with intelligent learning ability is particularly important. The reinforcement learning algorithm is used for robot path planning. Its advantage is that the algorithm is an unsupervised online learning method and does not require an accurate model of the environment. Therefore, it is being paid attention to in the application of mobile robot path planning in dynamic unknown environments. For example: Reinforcement based mobile robot navigation in dynamic environment by Mohammad Abdel Kareem Jaradat compares reinforcement learning with artificial potential field method. The experimental results show that the robot path planning method based on reinforcement learning algorithm has better applicability. Hoang-huu VIET's Extended Dyna-QAlgorithm for Path Planning of Mobile Robots is based on the Dyna-Q reinforcement learning algorithm, using the maximum likelihood model to select actions and update the Q value function, which improves the convergence speed of the algorithm.

这些方法中，机器人都是在完全可观测的环境下完成路径规划，吴峰在“基于决策理论的多智能体系统规划问题研究”一文中从决策论的角度用DEC-POMDP模型，以解决大状态空间下的多智能体规划问题，该方法考虑了环境信息的不完全可观测性，但建立的模型与算法具有较高的复杂性。In these methods, the robot completes path planning in a fully observable environment. Wu Feng used the DEC-POMDP model from the perspective of decision theory in his article "Research on Multi-Agent System Planning Problems Based on Decision Theory" to solve large-scale For the multi-agent programming problem in the state space, this method takes into account the incomplete observability of environmental information, but the established model and algorithm have high complexity.

针对这些问题，本发明提出一种不完全感知环境下的机器人kNN路径规划方法。该方法采用基于k最近邻分类思想的局部值迭代学习模型，考虑未知环境下动作的不确定性与环境信息获取的不完整性，提高实际环境中机器人路径规划算法的适应性。In view of these problems, the present invention proposes a robot kNN path planning method in an incomplete perception environment. This method adopts the local value iterative learning model based on the idea of k-nearest neighbor classification, considers the uncertainty of actions in unknown environments and the incompleteness of environmental information acquisition, and improves the adaptability of robot path planning algorithms in actual environments.

发明内容Contents of the invention

本发明的目的在于解决未知动态环境下，机器人路径规划存在环境信息的不完全可观测性、大状态空间求解难的问题，以有效提高路径规划算法的适用能力。该方法采用基于k最近邻分类法的局部点值迭代，代替对全部状态的值迭代计算，以有效缓解求解POMDP模型中的维数灾难问题，同时提高路径规划过程中强化学习算法的收敛性。The purpose of the present invention is to solve the problem of incomplete observability of environmental information and difficult solution in large state space in robot path planning in an unknown dynamic environment, so as to effectively improve the applicability of path planning algorithms. This method uses local point value iteration based on the k-nearest neighbor classification method to replace the iterative calculation of the value of all states to effectively alleviate the dimensionality disaster problem in solving the POMDP model, and at the same time improve the convergence of the reinforcement learning algorithm in the path planning process.

为了达到上述目的，本发明采取的技术方案是：一种不完全感知环境下的机器人kNN路径规划方法，包括以下步骤：In order to achieve the above object, the technical solution adopted by the present invention is: a robot kNN path planning method under an incomplete perception environment, comprising the following steps:

一、POMDP模型建立：1. Establishment of POMDP model:

采用栅格地图将机器人规划环境划分为小栅格。利用栅格法建立环境图，每个小栅格单元对应于POMDP模型状态集S中的一个状态s。动作集A有东(East)、西(West)、南(South)、北(North)四个动作，机器人可以在下一时刻处于相邻的4个无障碍栅格单元之一；机器人到达目标状态可获得回报值0，其它情况回报值均为-1。在机器人与环境不断交互过程中，由于动作的执行存在不确定性，因此转换概率设置为以较大概率正确执行最优策略选择的动作，以较小概率滑向该动作的左右两侧。A grid map is used to divide the robot planning environment into small grids. The environment graph is established by using the grid method, and each small grid unit corresponds to a state s in the state set S of the POMDP model. Action set A has four actions: East (East), West (West), South (South), and North (North). The robot can be in one of the four adjacent barrier-free grid cells at the next moment; the robot reaches the target state The return value can be 0, and the return value in other cases is -1. During the continuous interaction between the robot and the environment, due to the uncertainty in the execution of the action, the transition probability is set to correctly execute the action selected by the optimal strategy with a high probability, and slide to the left and right sides of the action with a small probability.

二、POMDP模型求解：2. POMDP model solution:

机器人POMDP路径规划中，机器人传感器不能完全观测所有环境信息。为了求解最优策略，机器人需要所经历的动作与观测状态的完整序列，即历史信息。历史信息可以利用信念状态(Belief State)来取代，信念状态b(s)为状态集S上的一个概率分布，所有信念状态组成一个|S|维矩阵。求解时以信念状态代替状态，POMDP问题就转化为基于信念状态的MDP问题，动作选择策略π转化为由信念状态到动作的映射：π(b)→a，在最优策略π*下，所有信念状态的折扣累积奖赏值组成最优值函数Q(b，a)，从而可利用求解MDP问题的方法kNN-Sarsa(λ)算法求解POMDP问题。In robot POMDP path planning, robot sensors cannot fully observe all environmental information. In order to solve the optimal policy, the robot needs the complete sequence of experienced actions and observed states, that is, historical information. Historical information can be replaced by a belief state (Belief State). The belief state b(s) is a probability distribution on the state set S, and all belief states form a |S|-dimensional matrix. When solving the problem, the belief state is used instead of the state, and the POMDP problem is transformed into an MDP problem based on the belief state. The action selection strategy π is transformed into a mapping from the belief state to the action: π(b)→a. Under the optimal strategy π*, all The discounted cumulative reward value of the belief state constitutes the optimal value function Q(b, a), so that the kNN-Sarsa(λ) algorithm can be used to solve the MDP problem to solve the POMDP problem.

三、迭代学习模型的构建：3. Construction of iterative learning model:

机器人设置起始位置与目标位置后，利用基于强化学习算法的机器人路径规划方法，为机器人寻找一条从起始位置到目标位置的无碰、最优路径。在寻找最优路径的过程中，本发明将机器人可能到达的栅格单元定义为迭代学习模型的状态s，动作a定义为具体动作方向：东、西、南、北，动作选择的目的是最大程度的缩短从起始位置到目标位置的路径。强化学习算法的迭代模型给每个(s，a)定义了一个状态-动作值函数Q，即机器人在当前状态选择某一动作更新到下一状态时获得的折算累积回报值，动作选择策略依据该Q值选择最优动作，以使累积回报值最大。After the robot sets the starting position and the target position, the robot path planning method based on the reinforcement learning algorithm is used to find a collision-free and optimal path for the robot from the starting position to the target position. In the process of finding the optimal path, the present invention defines the grid unit that the robot may reach as the state s of the iterative learning model, and the action a is defined as the specific action direction: east, west, south, north, and the purpose of action selection is to maximize The degree of shortening the path from the starting position to the target position. The iterative model of the reinforcement learning algorithm defines a state-action value function Q for each (s, a), that is, the converted cumulative return value obtained when the robot selects an action in the current state to update to the next state, and the action selection strategy is based on This Q-value selects the optimal action so that the cumulative reward value is maximized.

四、迭代学习模型用到的表结构：4. The table structure used in the iterative learning model:

为了实现本发明方法，需要构建下列表结构：In order to realize the method of the present invention, it is necessary to construct the following table structure:

(1)QTable表(1) QTable table

基于迭代学习模型的机器人路径规划，首先需要建立状态-动作值函数表Q Table。该Q Table表为|S|行|A|列的二维矩阵(|S|为状态集S的元素数，|A|为动作集A的元素数)，它存储了每个状态-动作对应的累积回报值，即Q(s，a)为选择最优动作a更新到状态s时的最大累积回报值。Robot path planning based on the iterative learning model first needs to establish a state-action value function table Q Table. The Q Table is a two-dimensional matrix of |S|rows|A|columns (|S| is the number of elements of the state set S, |A| is the number of elements of the action set A), which stores each state-action correspondence The cumulative reward value of , that is, Q(s, a) is the maximum cumulative reward value when the optimal action a is updated to state s.

(2)转换函数表T(2) Conversion function table T

转换函数T：S×A→∏(S)，描述动作的变化对环境状态的影响，机器人基本动作有东西南北四个，需建立四个转换函数表：T_E、T_W、T_S、T_N，分别为选择东(East)、西(West)、南(South)、北(North)动作后由状态s_t转换为状态s_t+1的概率。The conversion function T: S×A→∏(S), describes the impact of the change of action on the state of the environment. There are four basic movements of the robot: east, west, south, north, and four conversion function tables need to be established: T_E, T_W, T_S, T_N, respectively The probability of transitioning from state s _t to state s _t+1 after choosing East, West, South, and North actions.

(3)观测函数表O(3) Observation function table O

机器人依据自身携带传感器所探测的信息进行决策，观测函数O：S×A→∏(Z)表示机器人在当前状态s_t执行动作a后转换到新状态s_t+1得到观测状态z的概率，即建立一个基于观测值的概率分布表。The robot makes decisions based on the information detected by its own sensors. The observation function O: S×A→∏(Z) represents the probability that the robot transitions to the new state st ₊₁ to obtain the observed state z after executing action a in the current state _st . That is, a table of probability distributions based on observations is established.

(4)回报值表R(4) Return value table R

通过判断机器人是否到达目标位置来检测机器人是否完成一次路径搜索，当机器人到达目标位置时，给出0值回报，否则给出-1值回报。回报函数R：S×A→R，描述如果某个动作能够获得环境较高的回报值，那么在以后的寻找路径过程中产生这个动作的趋势就会加强，否则产生这个动作的趋势就会减弱。By judging whether the robot has reached the target position, it is detected whether the robot has completed a path search. When the robot reaches the target position, a return value of 0 is given, otherwise a return value of -1 is given. Reward function R: S×A→R, describing that if an action can obtain a higher reward value from the environment, then the trend of generating this action will be strengthened in the process of finding the path in the future, otherwise the trend of generating this action will be weakened .

(c2)迭代学习过程：(c2) Iterative learning process:

迭代学习过程由以下步骤组成：The iterative learning process consists of the following steps:

Step1：初始化Step1: Initialize

初始化状态-动作值函数表Q Table，令Q(s，a)＝0、资格迹e(s，a)＝0、初始信念状态b(s)＝0.001076，参数k＝5(表示选择5近邻)、学习因子α＝0.95，折扣因子γ＝0.99、λ＝0.95，其中γλ表示资格迹e依照概率γλ衰减，ε＝0.001表示选择随机动作的概率值。Initialize the state-action value function table Q Table, set Q(s, a) = 0, qualification trace e(s, a) = 0, initial belief state b(s) = 0.001076, parameter k = 5 (indicates that 5 neighbors are selected ), learning factor α=0.95, discount factor γ=0.99, λ=0.95, where γλ indicates that eligibility trace e decays according to probability γλ, and ε=0.001 indicates the probability value of choosing a random action.

Step2：获取当前状态s_t及其k个最近邻状态的信念状态集BStep2: Obtain the belief state set B of the current state s _t and its k nearest neighbor states

1)将机器人的起始位置作为当前状态s_t；1) Take the initial position of the robot as the current state s _t ;

2)计算st与状态集S中欧氏距离最小的k个状态构成的状态集knn；2) Calculate the state set knn consisting of k states with the smallest Euclidean distance between st and the state set S;

3)计算状态集knn中各个状态的信念状态值b_t(s)：b_t(s)＝1/(|S|)。3) Calculate the belief state value b _t (s) of each state in the state set knn: b _t (s)=1/(|S|).

Step3：获取信念状态值函数Step3: Get the belief state value function

信念状态b_t(s)对应的值函数由下式计算：The value function corresponding to the belief state b _t (s) is calculated by the following formula:

$Q Q ((b b,, a a)) = = \underset{i i &Element; &Element; knn knn}{Σ Σ} Q Q ((i i,, a a)) b b ((i i))$

即Q(s，a)表中当前状态s_t的k最近邻集knn中所有状态值函数Q(i，a)与信念状态b(i)乘积之和。That is, the sum of the products of all state value functions Q(i, a) and belief state b(i) in the k-nearest neighbor set knn of the current state s _t in the Q(s, a) table.

Step4：选择动作Step4: Select an action

依据ε-greedy动作选择策略选择动作：Actions are selected according to the ε-greedy action selection strategy:

$π π ((a a)) = = \{\begin{matrix} arg arg {max max}_{a a} \underset{s the s &Element; &Element; S S}{Σ Σ} Q Q ((s the s,, a a)) b b ((s the s)) & ((U u &GreaterEqual; &Greater Equal; ϵ ϵ)) \\ rand rand ((a a)) & ((U u < < ϵ ϵ)) \end{matrix},,$

其中，U为(0，1)之间均匀分布的随机数，概率值ε在每个学习周期(Episode)中以0.99倍的速率衰减，即在学习周期的初始阶段以较大的概率选择随机动作，避免算法陷入局部最优；随着Q值有效信息的增加，ε逐渐降低，保证了算法收敛性。Among them, U is a uniformly distributed random number between (0, 1), and the probability value ε decays at a rate of 0.99 times in each learning cycle (Episode), that is, in the initial stage of the learning cycle, a random number is selected with a greater probability. action to prevent the algorithm from falling into local optimum; as the effective information of the Q value increases, ε gradually decreases, which ensures the convergence of the algorithm.

Step5：执行动作Step5: Execute the action

执行动作s_t后转换到新状态s_t+1，同时获得观测状态z及回报值R。After executing the action s _t , switch to the new state s _t+1 , and obtain the observed state z and the reward value R at the same time.

Step6：计算回报值RStep6: Calculate the return value R

机器人执行了动作a_t后到达新位置，判断该位置是否为目标位置，如果不是，则获得回报值-1，执行Step7；否则，获得回报值0，执行Step10。After the robot executes the action a _t and arrives at a new position, judge whether the position is the target position, if not, get a reward value of -1, and execute Step7; otherwise, get a reward value of 0, and execute Step10.

Step7：获取下一状态s_t+1对应的信念状态集B′Step7: Obtain the belief state set B′ corresponding to the next state s _t+1

1)计算s_t+1与状态集s中欧氏距离最小的k个状态构成的状态集knn′；1) Calculate the state set knn′ consisting of k states with the smallest Euclidean distance between s _t+1 and the state set s;

2)计算状态集knn′中各个状态的信念状态值b_t+1(s′)：2) Calculate the belief state value b _t+1 (s') of each state in the state set knn':

${b b}_{t t + + 11} (({s the s}^{' '})) = = \frac{O o (({s the s}^{' '},, a a,, z z)) \underset{s the s &Element; &Element; S S}{Σ Σ} T T ((s the s,, a a,, {s the s}^{' '})) {b b}_{t t} ((s the s))}{\underset{{s the s}^{' '} &Element; &Element; S S}{Σ Σ} O o (({s the s}^{' '},, a a,, z z)) \underset{s the s &Element; &Element; S S}{Σ Σ} T T ((s the s,, a a,, {s the s}^{' '})) {b b}_{t t} ((s the s))} . .$

3)重复执行Step3-Step4。3) Repeat Step3-Step4.

Step8：更新Step8: update

1)资格迹按下式定义：1) The qualification trace is defined as follows:

$e e ((s the s,, j j)) = = \{\begin{matrix} b b ((s the s)) & j j = = a a,, \\ 00 & j j &NotEqual; &NotEqual; a a . . \end{matrix} ((s the s &Element; &Element; knn knn)) . .$

2)对机器人所处状态的所有k最近邻状态的状态-动作值函数Q(i，a)进行更新：2) Update the state-action value function Q(i, a) of all k-nearest neighbor states in the state of the robot:

${Δq Δq}_{a a} ((s the s)) = = α α ((r r + + γ γ \underset{{a a}^{' '}}{max max} Q Q (({b b}^{' '},, {a a}^{' '})) - - Q Q ((s the s,, a a)) b b ((s the s)))) e e ((s the s,, a a)),, ((s the s &Element; &Element; knn knn))$

Q_t+1(s，a)＝Q_t(s，a)+Δq_a(s)(s∈knn)Q _t+1 (s, a)=Q _t (s, a)+Δq _a (s)(s∈knn)

3)s_t＝s_t+1，a_t＝a_t+1，knn＝knn′，e_t+1＝γλe_t，b_t(s)＝b_t+1(s′)。3) s _t =s _t+1 , a _t = _at+1 , knn=knn', e _t+1 =γλe _t , b _t (s)=b _t+1 (s').

Step9：转向Step5。Step9: Turn to Step5.

Step10：一次迭代学习过程结束，转到Step2进入下一个迭代学习过程，直到Q值收敛到最优或近似最优。Step10: An iterative learning process ends, go to Step2 to enter the next iterative learning process, until the Q value converges to the optimal or nearly optimal.

Claims

1. A robot kNN path planning method applicable to an incomplete perception environment is characterized in that: the POMDP model is set up, the POMDP model is solved, and the iterative model is constructed in three steps:

(a) POMDP model establishment: use the grid map to divide the robot planning environment into small grids, each small grid unit corresponds to a state s in the state set S of the POMDP model, and the action set A includes East (East), West ( West), South, North four actions, the robot can be in one of the four adjacent barrier-free grid units at the next moment, the robot can get a return value of 0 when it reaches the target state, and the return value in other cases is -1, in the interaction between the robot and the environment, the transition probability is set to correctly execute the action selected by the optimal strategy with a high probability, and slide to the left and right sides of the action with a low probability;

(b) Solving the POMDP model: The robot needs the historical information of the actions experienced and the observed state to solve the optimal strategy. The historical information can be replaced by the belief state (Belief State). The belief state b(s) is a state set S Probability distribution, the belief state is used to replace the state when solving, the POMDP problem is transformed into an MDP problem based on the belief state, and the action selection strategy π is transformed into a mapping from the belief state to the action: π(b)→a, under the optimal strategy π* , the discounted cumulative reward values of all belief states form the optimal value function Q(b, a);

(c) Iterative model construction: After the robot sets the starting position and the target position, the robot path planning method based on the reinforcement learning algorithm is used. The reinforcement learning algorithm defines a state-action value function Q for each (s, a), namely The converted cumulative return value obtained when the robot selects an action in the current state to update to the next state. The action selection strategy selects the optimal action based on the Q value to maximize the cumulative return value. The specific steps of the iterative learning algorithm are as follows:

Step1: Initialize

Initialize the state-action value function table Q Table, assign Q(s, a), qualification trace e(s, a), initial belief state b(s), parameter k, learning factor α, and random action selection probability value ε initial value,

Step 2: Obtain the belief state set B of the current state s _t and its k nearest neighbor states

1) Take the initial position of the robot as the current state s _t ;

2) Calculate the state set knn consisting of k states with the smallest Euclidean distance between s _t and the state set S;

3) Calculate the belief state value b _t (s) of each state in the state set knn: b _t (s) = 1/(|S|),

Step3: Get the belief state value function

The value function corresponding to the belief state b _t (s) is calculated by the following formula:

Q Q ((b b,, a a)) = = \underset{i i &Element; &Element; knn knn}{Σ Σ} Q Q ((i i,, a a)) b b ((i i))

That is, the sum of the products of all state value functions Q(i, a) and belief state b(i) in the k-nearest neighbor set knn of the current state s _t in the Q(s, a) table,

Step4: Select an action

Actions are selected according to the ε-greedy action selection strategy:

π π ((a a)) = = \{\begin{matrix} arg arg {max max}_{a a} \underset{s the s &Element; &Element; S S}{Σ Σ} Q Q ((s the s,, a a)) b b ((s the s)) & ((U u &GreaterEqual; &Greater Equal; ϵ ϵ)) \\ rand rand ((a a)) & ((U u < < ϵ ϵ)) \end{matrix}

Among them, U is a uniformly distributed random number between (0, 1), and the probability value ε decays at a rate of 0.99 times in each learning cycle (Episode), that is, in the initial stage of the learning cycle, a random number is selected with a greater probability. action, to avoid the algorithm from falling into local optimum; with the increase of the effective information of Q value, ε gradually decreases, which ensures the convergence of the algorithm.

Step5: Execute the action

After executing the action a _t, switch to the new state s _t+1 , and obtain the observed state z and the return value R at the same time,

Step6: Calculate the return value R

After the robot performs the action a _t and arrives at a new position, judge whether the position is the target position, if not, get a reward value of -1, and execute Step7; otherwise, get a reward value of 0, and execute Step10,

Step 7: Obtain the belief state set B′ corresponding to the next state s _t+1

1) Calculate the state set knn′ consisting of k states with the smallest Euclidean distance between _st+1 and the state set S,

2) Calculate the belief state value b _t+1 (s') of each state in the state set knn':

{b b}_{t t + + 11} (({s the s}^{' '})) = = \frac{O o (({s the s}^{' '},, a a,, z z)) \underset{s the s &Element; &Element; S S}{Σ Σ} T T ((s the s,, a a,, {s the s}^{' '})) {b b}_{t t} ((s the s))}{\underset{{s the s}^{' '} &Element; &Element; S S}{Σ Σ} O o (({s the s}^{' '},, a a,, z z)) \underset{s the s &Element; &Element; S S}{Σ Σ} T T ((s the s,, a a,, {s the s}^{' '})) {b b}_{t t} ((s the s))} . .

3) Repeat Step3-Step4,

Step8: update

1) The qualification trace is defined as follows:

e e ((s the s,, j j)) = = \{\begin{matrix} b b ((s the s)) & j j = = a a,, \\ 00 & j j &NotEqual; &NotEqual; a a . . \end{matrix} ((s the s &Element; &Element; knn knn)) . .

2) Update the state-action value function Q(i, a) of all k-nearest neighbor states in the state of the robot:

{Δq Δq}_{a a} ((s the s)) = = α α ((r r + + γ γ \underset{{a a}^{' '}}{max max} Q Q (({b b}^{' '},, {a a}^{' '})) - - Q Q ((s the s,, a a)) b b ((s the s)))) e e ((s the s,, a a)),, ((s the s &Element; &Element; knn knn))

Q _t+1 (s, a)=Q _t (s, a)+Δq _a (s)(s∈knn)

3) s _t =s _t+1 , a _t =a _t+1 , knn=knn', e _t+1 =γλe _t , b _t (s)=b _t+1 (s'),

Step9: Turn to Step5,

Step10: After an iterative learning process ends, go to Step 2 to enter the next iterative learning process until the Q value converges to the optimal or nearly optimal.