JP2005125475A

JP2005125475A - Architecture capable of learning best course by single successful trial

Info

Publication number: JP2005125475A
Application number: JP2003400652A
Authority: JP
Inventors: Sunao Kitamura; 直喜多村
Original assignee: Individual
Current assignee: Individual
Priority date: 2003-10-24
Filing date: 2003-10-24
Publication date: 2005-05-19

Abstract

<P>PROBLEM TO BE SOLVED: To provide an architecture capable of learning an optimum course by a single successful trial in learning control of a main body location and an arm/hand location of a mobile robot. <P>SOLUTION: The architecture for learning control has a hierarchical structure in which a higher-degree behavior in a higher-order and an execution program therefor are arranged. A hierarchy selection and a behavior selection in the hierarchy are executed by using evaluation functions based on emotional values concerning environments, respectively. Therefore when a behavior inhibition, i.e. hindrance of a certain moving behavior by an obstacle, occurs, a better behavior that is one order higher than the inhibited behavior is selected, and by using a destination directional vector calculated based on a memory of the behavior inhibition having occurred before a success, the optimum course is learned by the single successful trial. Then by reversely tracing a time of the optimum course, an optimum returning course is calculated. Further even in the case of a malfunction of sensor recognition caused frequently, the architecture is effectively operated by correcting information by remote recognition of the user. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、移動ロボットの本体位置およびアーム・ハンド位置の学習制御に関するものである。 The present invention relates to learning control of a main body position and an arm / hand position of a mobile robot.

移動ロボットの位置の学習制御は大別して、本体の位置制御を対象としたもの、およびロボットのアーム・ハンド系の制御に関するものである。いずれもロボットの現在位置から、目的の場所を探索し発見する試行錯誤を重ねることを通じて、目的場所までロボット本体もしくはハンドを到達させる経路と移動中の行動を学習することを目的とする。 The learning control of the position of the mobile robot is roughly classified into the control of the position of the main body and the control of the arm / hand system of the robot. Each of them aims to learn the route for reaching the target location and the movement of the robot and the moving behavior through repeated trial and error to find and find the target location from the current position of the robot.

以下では、移動ロボット本体の位置制御とハンドの位置制御を、どちらも共通の方法で解決する技術に関するものなので、単にロボットの位置の制御、あるいはロボットの位置の学習制御と呼ぶことにする。 In the following, since both the position control of the mobile robot body and the position control of the hand are related to a technique for solving by a common method, they will be simply referred to as robot position control or robot position learning control.

ロボットの位置の学習制御には強化学習法などが用いられ、単純な形状の障害物を迂回して目的地まで到達する場合、最良経路を学習するのに数万回の試行錯誤を要する。またハンドを目的の物体まで到達させる場合も、ハンドやアームが途中の障害物を避けて最良経路を学習するのに数十万回の試行錯誤を要する。したがって、この方法をリアルタイムで用いることは実用的でないので、予めシミュレーションによって学習を終え、結果のみをロボット制御に用いる必要がある。 A reinforcement learning method or the like is used for learning control of the position of the robot. When reaching a destination by bypassing a simple obstacle, it takes tens of thousands of trials and errors to learn the best route. Even when the hand reaches the target object, it takes hundreds of thousands of trials and errors for the hand or arm to avoid the obstacle on the way and learn the best route. Therefore, since it is not practical to use this method in real time, it is necessary to finish learning by simulation in advance and use only the result for robot control.

しかし、シミュレーション結果を用いることにより、環境の予期せぬ変化や、シミュレーションにおけるロボットあるいは環境のモデルに含まれる誤差のために、学習した制御法がうまく機能せず、例えば、急に出てきた障害物によって行動が乱されてしまう等の不具合を生じる。そのため、あらたな環境に適応した制御法を得るために、再度、新たな環境に対して何万回におよぶシミュレーションを行わなければならない。 However, using the simulation results, the learned control method does not work well due to unexpected changes in the environment or errors contained in the robot or environment model in the simulation, for example, sudden failure Problems such as the behavior being disturbed by objects occur. Therefore, in order to obtain a control method adapted to a new environment, it is necessary to perform tens of thousands of simulations again on the new environment.

以上述べたように従来のロボットの位置の学習制御方法は、環境の予期せぬ変化や、シミュレーションにおけるロボットあるいは環境モデルに含まれる誤差のために、学習した制御法がうまく機能しないという問題がある。 As described above, the conventional robot position learning control method has a problem that the learned control method does not work well due to unexpected changes in the environment or errors included in the robot or environment model in the simulation. .

そこで、ロボットの位置制御の学習において解決しようとする課題は、学習前でも環境変化に対する適応能力が高いこと、目的地へ到達するまでの最良の制御方策を、一回の成功試行後に数回の試行錯誤で獲得すること、また仮に成功試行の学習後の大きな環境変化に対して、学習した制御法が機能しない場合の再学習が極めて低コストで済むこと、である。 Therefore, the problem to be solved in robot position control learning is that the adaptability to environmental changes is high even before learning, and the best control strategy until reaching the destination is determined several times after one successful trial. It is obtained by trial and error, and re-learning can be performed at a very low cost when the learned control method does not function for a large environmental change after learning of a successful trial.

本発明は、このような環境の予期せぬ変化に対応できないなどという問題を極めて低コストで解消できる、一回の成功試行で最良経路を学習するアーキテクチャを提供することを目的とするものである。 An object of the present invention is to provide an architecture that learns the best path with a single successful trial, which can solve such a problem that it cannot cope with such an unexpected change in the environment at a very low cost. .

課題を解決する手段は、一言で言うと動物や人間の知能と行動の進化発達過程の階層モデルを構築し、それをロボットの学習制御に使うことである。動物や人間が障害物を迂回して目的地に到達する場合、数回の試行錯誤で学習を終える。試行錯誤過程の意識を人からの聞き取りや自身の反省をまとめると、行動が何らかの理由で止められると、そのときの自分の身体の状態と環境の状態を記憶して、次の試行の際に、その記憶した情報を効果的に利用するということである。ここで、記憶した情報とは、身体が無意識的に覚えたものと、意識的に頭で覚えたものの両方を意味する。 In short, the means to solve the problem is to build a hierarchical model of the evolutionary process of intelligence and behavior of animals and humans and use it for learning control of robots. When animals and humans reach their destination by bypassing obstacles, learning is completed with several trials and errors. Summarizing the consciousness of the trial and error process from people and their reflections, if the action is stopped for some reason, remember the state of the body and the environment at that time, and at the next trial It means that the stored information is effectively used. Here, the memorized information means both what the body remembers unconsciously and what it consciously remembers with the head.

そこで、知能と行動の発達モデルは図１に示すように、下から上に向かって発達進化のレベルが高くなる階層構造とする。発達進化はあるレベルの行動では目的達成が行き詰まると、一つ上のレベルの意識での工夫を行い、それを同レベルの行動で実現するという手順を表す。これは、個体発達において、下位の行動で達成できないと、目的達成のための思考が働き試行錯誤を行い、やがて上位行動が発達する過程を意味する。また系統発生としてみるなら、何億年の時間の中で、ある種が生存のために行った何らかの工夫が生存を可能にし、やがて上位の種に進化することを表す。 Therefore, as shown in FIG. 1, the intelligent and behavioral development model has a hierarchical structure in which the level of developmental evolution increases from bottom to top. Developmental evolution represents a procedure in which if a certain level of behavior fails to achieve its purpose, it will be devised with a higher level of consciousness and realized at the same level of behavior. This means that in individual development, if it cannot be achieved by lower-level actions, the thoughts for achieving the objective work, trial and error, and higher-level actions eventually develop. In terms of phylogeny, it also means that in hundreds of millions of years, some kind of ingenuity that a certain species has made to survive makes it possible to survive and eventually evolve into a higher species.

このモデルをロボットに移植できる形に表したのが「意識アーキテクチャ」と呼ぶ意識と行動の階層モデルある。これを図２に示す。この意識モデル部分は、時間空間認識の情報処理を行う上位層構造とし、それより下位に４階層から成る身体的意識階層とする。また行動モデル部分は、４階層の身体的意識階層の各階層に対応した行動関数群から成る。この意識アーキテクチャでは、意識の強さと感情の良し悪しを評価する関数も用いて階層選択と行動選択が行われる。 This model is expressed in a form that can be ported to a robot. This is shown in FIG. This consciousness model portion has an upper layer structure for performing information processing for time-space recognition, and a physical consciousness layer consisting of four layers below it. The behavior model portion is composed of behavior function groups corresponding to the four layers of the physical consciousness layer. In this consciousness architecture, hierarchy selection and action selection are performed using functions that evaluate the strength of consciousness and the quality of emotions.

ここで考えているロボット行動は基本的には２通りしかない。一つは最も価値の高い対象物に向かって進むこと、もう一つは最も嫌いな対象物から離れることである。しかし、実際の行動では、上で述べた４層の身体的意識階層ごとに対象物価値は異なって設定するので、同一レベル内において妥協的な、また異なったレベル間における妥協的な、好きと嫌いの中間に対応した行動となる。行動選択は以下で説明する評価関数に基づいて行われる。 There are basically only two robot behaviors considered here. One is to move toward the most valuable object, and the other is to move away from the most disliked object. However, in actual behavior, the object value is set differently for each of the four layers of physical consciousness described above, so that it is a compromise between the same level and a compromise between different levels. It becomes the action corresponding to the middle of dislike. The action selection is performed based on an evaluation function described below.

身体的意識階層のレベル選択に使われる評価関数Ｃ_ｉは意識強度と呼び対象物についての意識の強さを表し、次の式（１）で与えられる。

ここに、β_ｉｊは知覚されたＮ_Ｅ個の対象物のｊ番目の対象物についてのレベルｉでの好きか嫌いかの度合い（好きなら正、嫌いなら負の値）を表し、対象物からの距離と対象物への方位角の減少関数として適宜定義される。γ_ｉｊはレベルｉでの知覚されたＮ_Ｉ個の身体部分の中のｊ番目の身体部分の具合良さ悪さの度合い（具合が良ければ正、悪ければ負の値）を表し、例えばロボットのバッテリ残余エネルギがこれに相当し、解決すべき課題に応じて定義される。α_ｉｊは対象物ｊに関するレベルｉでの期待感情の良し悪いの度合い（良ければ正、悪ければ負）を表し、解決すべき問題に応じて対象物についての好きか嫌いかの度合いであるβ_ｉｊの関数として定義される。The evaluation function C _i used for the level selection of the physical consciousness hierarchy expresses the consciousness strength and the consciousness strength about the target object, and is given by the following equation (1).

Here, beta _ij represents like or dislike of the degree of the level i for j-th object N _E number of objects that are perceived (like if positive, negative if dislike), from the object And a function of decreasing the azimuth angle to the object. gamma _ij is (positive The better degree, negative values at worst) j th body part degree good poor degree of in perceived N _I pieces of the body portion at the level i represents, for example, a robot of the battery The residual energy corresponds to this and is defined according to the problem to be solved. α _ij represents the degree of good or bad of the expected emotion at level i related to the object j (positive if good, negative if bad), and is the degree of likes or dislikes about the object depending on the problem to be solved β defined as a function of _ij .

これまでの経験から意識強度（１−１）と（１−２）の２種類のものが最もよく使われたが、これ以外のものも適宜用いるロボットと環境に応じて考案することができる。意識強度（１−１）と（１−２）の使いわけは、（１−１）は意識に入ったすべての意識対象を考慮に入れた方がよい場合、（１−２）は意識に入っている意識対象の中からら対象を絞り込む方がよい場合に用いられる。
いずれを使うにせよ、関数Ｃｉが最大となるような階層レベルｉが１、２、３、４、の中から選ばれる。Two types of consciousness intensity (1-1) and (1-2) have been used most frequently from experience so far, but other types can be devised according to the robot and environment to be used as appropriate. The use of consciousness strength (1-1) and (1-2) is as follows: (1-1) should consider all conscious objects that have entered consciousness, (1-2) It is used when it is better to narrow down the target from among the conscious objects that are contained.
Whichever is used, the hierarchical level i that maximizes the function Ci is selected from 1, 2, 3, and 4.

次に、こうして選ばれたレベル（これをレベルｉとする）において、とるべき行動を選ぶために使われる評価関数Ｉ_ｉは感情評価関数と呼び、感情の良し悪しを表し、次の式（２）で与えられる。

ここで、β_ｉｊしたがってα_ｉｊは、今上で述べたように対象物までの距離と方向の関数として定義されるので、レベルｉのｈ番目の行動Ｂ_ｉｈに依存して値が変化する。
感情評価関数（２−１）は意識強度（１−１）が選ばれた場合、また感情評価関数（２−２）は意識強度（１−２）が選ばれた場合に用いられる。
そこで、選ばれたレベルｉにおける行動は、式（２）が最大となるような行動が選ばれる。関数Ｉ_ｉがどのような行動に対しても、もはやそれ以上増加しない場合は、一つ上のレベル、つまりレベルｉ＋１の意識に移る。そこで感情評価関数Ｉ_ｉ＋１を用いて行動が選択される。
ここで行動選択は３種類存在する。１。行動が現在より下位レベルに移る場合、２。行動が同じレベルにある場合、つまり同じレベル内で行動が変化する場合、３。同じレベル内に、もはやＩ_ｉを増大させるような行動が無く、レベルが一つ上昇する場合である。この２の場合を「部分的に行動が抑止された」と言い、また最後の３の場合を「行動が抑止された」という。もしレベル４において行動が抑止されると、身体的意識階層より上位の時間空間認識階層に入る。後で述べるように、２と３の場合が起きたときの記憶は、学習に有用な情報として用いられる。Next, the evaluation function I _i used to select an action to be taken at the level selected in this way (this is assumed to be level i) is called an emotion evaluation function, and expresses whether the emotion is good or bad. ).

Here, β _{ij and} thus α _ij is defined as a function of the distance to the object and the direction as described above, and therefore the value changes depending on the h-th action B _ih at level i.
The emotion evaluation function (2-1) is used when the consciousness strength (1-1) is selected, and the emotion evaluation function (2-2) is used when the consciousness strength (1-2) is selected.
Accordingly, the action at the selected level i is selected such that the expression (2) is maximized. If the function I _i no longer increases for any action, the level shifts to the level one level, i.e. level i + 1. Therefore, an action is selected using the emotion evaluation function I _{i + 1} .
There are three types of action selection. 1. 2. If the action moves down to the current level 2. If the behavior is at the same level, that is, if the behavior changes within the same level, This is the case where there is no action to increase I _i within the same level, and the level increases by one. The case 2 is referred to as “partial behavior is inhibited”, and the last case 3 is referred to as “behavior is inhibited”. If the action is suppressed at level 4, it will enter the temporal and spatial recognition hierarchy above the physical consciousness hierarchy. As will be described later, the memory when the

cases

2 and 3 occur is used as information useful for learning.

上記２つの評価関数Ｃ_ｉとＩ_ｉに基づく行動決定の計算と行動実行は、ロボット運転中絶えず行われる。身体的意識階層の４階層の計算の流れを図３に示す。
次に、身体的意識階層の４階層より上位にある時間空間認識の階層について説明する（図４−１）。これは、２つの副階層（レベル５−１とレベル５−２）からなる。上位副層（レベル５−２）は想起反省モジュールと時間空間認識モジュール、下位副層（レベル５−１）は行動計画モジュールからなる。
時空間認識モジュールは障害物および対象物の位置の認識、特徴点の認識と分類、特徴点の位置の認識を行う。行動計画モジュールは、ロボットに課せられたタスクに応じて基本的なモジュールを複数種類を前もって用意する。また時空間認識モジュールと想起反省モジュールは唯一つである。
行動が進むにつれて、時間空間認識モジュールの情報と身体的意識階層の各層から上がってくる行動抑止に関する情報が想起反省モジュール（図４−２）に書き込まれ、さらにこの情報に基づいて、レベル５−１に当初与えられたもの以外の新たな行動計画モジュールが随時構築され追加される。どの行動計画モジュールも行動指令を持ち、一度起動されると身体階層のレベル４に行動指令を出す。
想起反省モジュールは、第一試行中のレベル１、２、３、４における行動抑止の情報と、空間認識モジュールにより得られたその時の環境の時間空間情報を記憶する。以下に、想起反省モジュールに基づいて、ロボットが進むべき方向の学習について説明する。
基本的行動計画モジュールは、想起反省モジュールと時空認識モジュールを参照しつつ、目的地に到達する第一試行が終了するまでに、目標地点までの経路において、レベル４で抑止原因となった障害物の特徴点から、サブゴールとすべき障害物に新たなβ値を割り当て、その障害物の場所をサブゴールとして設定する機能を持つ。
また、基本的行動計画モジュールは、過去に抑止が起きたときのロボット位置ベクトルの重み付き平均として、式（３）で定義されるサブゴールの方向ベクトルを計算し蓄える。

ここに、ｑをサブゴールの総数として、ｐはｐ番目のサブゴールを表し、ｋは抑止された事象を時系列に並べたときの事象の番号であり、抑止事象ｋが起きたときの時刻をｔ（ｋ）で表す。ロボットの出発地を原点として、ｘ＿（ｋ）はｋが起きたときのロボットの速度ベクトル、またｍ（ｋ）は、事象ｋの時の抑止の強さを表す正数、ｎは抑止の起きた最近の時刻である。ｍは抑止の程度（部分的か否か）に応じて値が変わる。部分的なら１、さもなければ行動変化に応じて適宜（ｍ＞１として）決める。
この方向ベクトルは、１つのサブゴールから発し、次のサブゴールに向かう方向ベクトルであり、ｄ＿_ｐ（ｋ）のように時間の進む順番に番号を割り振られる。上でも述べたように、ここに、ｐ＝１、、、、ｑで、ｑは一回目の試行終了時までに得られた方向ベクトルの総数である。
この基本的行動計画モジュールの利用と実行は二回目以降の試行で行われ、新たに割り当てられた、あるいは変更されたβ値と対象物情報は、学習した経路の大きな方向転換を行うべき地点の仮想ランドマークとして機能する。Calculation of behavior determination and behavior execution based on the two evaluation functions C _i and I _i are continuously performed during operation of the robot. FIG. 3 shows the calculation flow of the four layers of the physical consciousness layer.
Next, the space-time recognition layer that is higher than the four layers of the physical consciousness layer will be described (FIG. 4A). This consists of two sub-layers (level 5-1 and level 5-2). The upper sublayer (level 5-2) consists of a recall module and a time space recognition module, and the lower sublayer (level 5-1) consists of an action plan module.
The spatiotemporal recognition module recognizes the positions of obstacles and objects, recognizes and classifies feature points, and recognizes the positions of feature points. The action plan module prepares a plurality of types of basic modules in advance according to tasks assigned to the robot. There is only one spatio-temporal recognition module and recall module.
As the action progresses, information on the time-space recognition module and information on action deterrence rising from each layer of the physical consciousness hierarchy are written into the recall module (FIG. 4-2), and based on this information, level 5- New action plan modules other than those originally given to 1 are constructed and added at any time. Every action plan module has an action command, and once activated, issues an action command to level 4 of the body hierarchy.
The recall module stores information on action suppression at

levels

1, 2, 3, and 4 in the first trial, and temporal and spatial information of the environment at that time obtained by the space recognition module. Hereinafter, learning of the direction in which the robot should proceed will be described based on the recall module.
The basic action plan module refers to the recall module and the space-time recognition module, and the obstacle that caused the deterrence at level 4 on the route to the target point by the end of the first trial to reach the destination. From the feature points, a new β value is assigned to an obstacle to be a subgoal, and the location of the obstacle is set as a subgoal.
Further, the basic action plan module calculates and stores the direction vector of the subgoal defined by the equation (3) as a weighted average of the robot position vectors when the suppression has occurred in the past.

Here, q is the total number of subgoals, p is the pth subgoal, k is the number of the event when the suppressed events are arranged in time series, and the time when the suppressed event k occurs is t (K). X_ (k) is the robot's velocity vector when k occurs, and m (k) is a positive number indicating the strength of inhibition at event k, and n is the occurrence of inhibition. It is the most recent time. The value of m varies depending on the degree of inhibition (partial or not). If it is partial, 1 is determined appropriately (m> 1) according to the behavior change.
This direction vector is emitted from one subgoal is a direction vector toward the next subgoal is allocated numbered sequentially advances the time as d_ p _(k). As described above, here, p = 1,..., Q, and q is the total number of direction vectors obtained up to the end of the first trial.
The use and execution of this basic action plan module is performed in the second and subsequent trials, and the newly assigned or changed β value and the object information are used for the point where the learned route should be greatly changed. Functions as a virtual landmark.

二度目の試行では基本的行動計画モジュールの方向ベクトルに基づいて、ｄ＿_ｐ（ｋ）にしたがって試行を開始する。この試行で得られた情報は想起反省モジュールに上書きされ、必要なら三度目の試行で同様に利用される。二度目の試行で形成された方向ベクトルは基本的行動計画モジュールに上書きされる。The second time trial based on the direction vector of the basic action planning module, to start a trial according d_ p _(k). The information obtained in this trial is overwritten in the recall module, and is used in the third trial as well if necessary. The direction vector formed in the second trial is overwritten in the basic action plan module.

成功試行後に学習した最良経路の時間を逆向きにたどれば、目的地から出発地までの最良経路を算出できる。それは、想起反省モジュールの目的地到達時間から出発時間まで時間を逆向きにたどりながら、さらにそれと平行して基本行動計画モジュールをやはり時間を逆向きにたどりながら、成功試行時に算出されているｑ個の方向ベクトル式（３）を、算出された順とは逆に、しかもそれぞれの方向ベクトルを時間を逆にたどることによって実行される。この実行順序に従ってロボットを移動させれば、目的地から出発地に向かって最良経路で帰還することになる。 If the time of the best route learned after a successful trial is traced in the reverse direction, the best route from the destination to the departure point can be calculated. It is the q number calculated at the time of successful trial while tracing the time in the reverse direction from the destination arrival time to the departure time of the recalling module, and in parallel with the basic action plan module in the reverse direction. The direction vector equation (3) is executed in the reverse order of the calculated order and by tracing the respective direction vectors in the reverse time. If the robot is moved according to this execution order, it will return on the best route from the destination to the departure point.

ロボット本体の位置制御の場合と、アーム・ハンドの位置制御の場合の違いは、式（１）と（２）の評価関数の構成の違いである。 The difference between the position control of the robot body and the position control of the arm / hand is the difference in the configuration of the evaluation functions of the equations (1) and (2).

センサによる認識が完璧な場合は最初の試行から完全自律移動で、目的地まで到達できれば、本アークテクチャによる上記の学習が行われる。しかし往々にして起きるセンサ認識の不全によって、時空認識モジュールにおける障害物の認識と分類が不調な場合でも、ユーザが遠隔的にＣＣＤカメラ画像によってロボット周囲の環境認識が可能で、その認識内容をロボットに伝達する装置を装備することによって、本アーキテクチャは有効に働く。すなわち、そのようなセンサ認識が不完全な場合、ユーザがＣＣＤカメラを見ながら正しい認識結果を本アーキテクチャの想起反省モジュールと時空認識モジュールに、そのときの時間、ロボット位置、障害物名称と位置と特徴点の種類をそれぞれ対応する欄に登録すればよい。 When the recognition by the sensor is perfect, the above learning by the present architecture is performed if the destination can be reached by the completely autonomous movement from the first trial. However, even if the recognition and classification of obstacles in the space-time recognition module is not good due to the failure of sensor recognition that often occurs, the user can remotely recognize the environment around the robot with the CCD camera image, and the recognition contents can be transferred to the robot. This architecture works effectively by equipping it with a device that communicates. In other words, when such sensor recognition is incomplete, the user recognizes the correct recognition result while looking at the CCD camera to the recall module and space-time recognition module of this architecture, and the time, robot position, obstacle name and position at that time. The types of feature points may be registered in the corresponding fields.

以上述べたように本発明のロボットの位置の学習制御のアーキテクチャを用いれば、目的の位置へ到達するまでの最良の制御方策をわずか２〜５回程度の試行錯誤で獲得することが出来、成功試行後に障害物が急に出て来る等環境が変化して、一回目の試行で学習が不調であっても、極めて少ない回数で再学習が可能である。 As described above, by using the robot position learning control architecture of the present invention, the best control strategy to reach the target position can be obtained with only 2 to 5 trials and errors. Even if the environment changes, such as when an obstacle suddenly appears after the trial, and the learning is not successful in the first trial, the relearning can be performed with a very small number of times.

この発明では、行動の抑止を感知する必要がある。したがって、ロボット本体の位置制御に必要なセンサは障害物認識に必要なＣＣＤカメラ、とっさの障害物回避に必要な障害物回避センサである。またロボットハンド制御には、アームが障害物に衝突したさいの認識センサもしくはそれと等価な機能が必要である。
本発明を、ロボット本体の位置制御の学習、およびロボットハンドの位置制御の学習に用いる場合、一回の成功試行の後、一回（悪くても数回）の試行で学習が完了する能力を持つ。
本アーキテクチャの以上の能力から、レスキュー分野や介護分野における効果的な用途が多数存在する。レスキュー分野では、被災現場まで移動した後にそこから最良経路で自律帰還する場合に有用である。また、アーム・ハンドを操作者が操りながら、被災現場の物体を探索把持する場合、試行錯誤で一回成功した後は、アーム・ハンドに自律的に最短経路で探索把持させることができる。
介護分野では、アーム・ハンドをユーザーが操りながら、テーブル上の物体を探索把持する場合、試行錯誤で一回成功した後は、アーム・ハンドに自律的に最短経路で探索把持させることができる。
センサによる認識が完璧な場合は最初の試行から完全自律移動で、本アークテクチャによる上記の学習が行われる。しかし往々にして起きるセンサ認識が不完全な場合でも、ユーザの代替認識とロボットへの情報伝達の装備が整っていれば、本アーキテクチャは有効に働く。
以上の説明のように、本発明は、このような環境の予期せぬ変化に対応できないなどという問題を極めて低コストで解消できる、一回の成功試行で最良経路を学習するアーキテクチャを提供するものである。In the present invention, it is necessary to sense the inhibition of behavior. Therefore, sensors necessary for position control of the robot body are a CCD camera necessary for obstacle recognition and an obstacle avoidance sensor necessary for immediate obstacle avoidance. The robot hand control requires a recognition sensor or an equivalent function when the arm collides with an obstacle.
When the present invention is used for learning of position control of the robot body and learning of position control of the robot hand, the ability to complete learning in one (or several times at worst) after one successful attempt Have.
Because of the above capabilities of this architecture, there are many effective applications in the rescue and care fields. In the rescue field, it is useful when returning to the disaster site and returning autonomously from there. Further, when an operator manipulates the arm and hand to search and hold an object at the disaster site, after successful trial and error once, the arm and hand can autonomously search and hold on the shortest path.
In the care field, when an object on a table is searched and held while a user operates the arm and hand, after successful trial and error once, the arm and hand can be searched and held autonomously by the shortest path.
When the recognition by the sensor is perfect, the above learning is performed by this architecture from the first trial to complete autonomous movement. However, even if the sensor recognition that often occurs is incomplete, this architecture works effectively if the user's alternative recognition and information transmission to the robot are in place.
As described above, the present invention provides an architecture that learns the best route with a single successful trial, which can solve the problem of being unable to cope with such unexpected changes in the environment at a very low cost. It is.

次に、本発明の実施の形態について説明する。 Next, an embodiment of the present invention will be described.

ロボット本体の位置制御の学習の例
小型ロボットを用いた実施例について説明する。ロボットは自己位置を知るエンコーダと障害物認識のためのＣＣＤカメラ、とっさの障害物回避に必要な障害物回避センサを備える。目的値は障害物の向こう側にあり、ロボットは目的地まで行かねばならない。第一回目の試行（行動軌跡を図５−１に示す）で、式（３）でＢ点までの方向ベクトルを計算し、その結果を用いて第二回目の試行（行動軌跡を図５−２に示す）を行った。
ロボットの目的地（図５−１のＧ点）は身体的階層レベル２の目的地として設定された。第一回目の試行中に学習が実行され、第二回目の試行で、より近い経路で目的地に到達したことを示す。
この実施例では、評価関数は、（１−１）と（２−１）を用い、パラメータα_ｉｊ、β_ｉｊ、γ_ｉｊを式（４）のように定義した。

ただし、ε_ｉｊと忘却関数ｆ（ｔ）は以下の式（５）で定義される。

この実施例では、唯一つの基本行動計画モジュールを用い、また唯一の空間認識モジュールを用いた。空間認識モジュールは、障害物の位置と変化点（角点と端点）を認識する。また基本的行動計画モジュールでは、障害物の変化点（図５−１および図５−２のＡとＢ）のβ値を第一試行中に新たに正値に設定した（当初は０）。こうすることによって、Ａ点とＢ点は基本的行動計画モジュールにおいて中間目標点として機能し、その結果、第二試行において障害物のＡＢ部分を最良に迂回することができた。Example of learning of position control of robot body An embodiment using a small robot will be described. The robot includes an encoder that knows its own position, a CCD camera for obstacle recognition, and an obstacle avoidance sensor that is necessary for avoiding an obstacle. The target value is beyond the obstacle, and the robot must go to the destination. In the first trial (the action trajectory is shown in FIG. 5A), the direction vector up to point B is calculated by Equation (3), and the second trial (the action trajectory is shown in FIG. 2).
The destination of the robot (point G in FIG. 5-1) was set as the destination of physical hierarchy level 2. Learning is performed during the first trial, and the second trial shows that the destination has been reached via a closer route.
In this embodiment, (1-1) and (2-1) are used as the evaluation function, and parameters α _ij , β _ij , and γ _ij are defined as in Expression (4).

However, ε _ij and the forgetting function f (t) are defined by the following equation (5).

In this example, only one basic action plan module was used and only one spatial recognition module was used. The space recognition module recognizes the position and change point (corner point and end point) of the obstacle. In the basic action plan module, the β value of the obstacle change point (A and B in FIGS. 5-1 and 5-2) was newly set to a positive value during the first trial (initially 0). By doing so, points A and B functioned as intermediate target points in the basic action planning module, and as a result, the AB part of the obstacle could be best bypassed in the second trial.

アーム・ハンド系の位置制御の学習の例
６関節ロボットのアーム・ハンド系を使って、手探りで目的物を把持する行動の実施例を説明する。アーム・ハンド系の概略と第一回目の試行の推移を図６に示す。制御の目的は、第一試行において、リンクが壁と弱い衝突（行動抑止）を繰り返しながら、ハンドが目的物を把持可能な近傍に到達した後、それまでの抑止情報を用いて、帰路には壁と接触することなく、アーム・ハンド系を壁環境から抜け出すことである。
アーム・ハンド系は図６に示すように、４関節４リンクの直列機構で、図中Ａ点がハンド先端とし、アー厶基部ＥＤが図中上下方向きに移動するアクチュエータに直結しているものとする。リンクＡＢが穴に入りかけており、穴底には目的物（黒の四角）が置かれている。
本ロボットは、センサはアーム基部ＥＤ部の変位および各関節角度を検知するポテンショメータがあり、ハンド先端にはＣＣＤカメラ等の視覚センサを持つ。関節の指令角度と実際の関節角の差が閾値を超えたことから、壁がどのリンクと衝突したかを検知する。ＣＣＤカメラによって、目的物の認識を行う。
以下に図６の１〜１９までの第一試行時の推移例を説明する。１ではアーム・ハンドを一直線として下方に向けてＡ点が壁にぶつかるまで駆動する。２では、Ａ点のぶつかりを電流から感知しＡ点での反力方向を検出する。そして、Ｂ点のモータを反力方向に決められた角度回転する。
３では下方にアーム・ハンド系を決められた距離だけ下方に移動する。４ではロボット全部をＢ点が壁にぶつかるまで下方に移動する。そして２〜３と同様のことを行い、図中の１９のように、ハンド先端が目的物近傍に到達するまで、下方移動、関節のぶつかり、壁反力の向きの検出、上位関節の回転といった一連の動作を適宜繰り返す。例えば、もし関節の回転が大き過ぎて壁上側にリンクの一部が衝突することが起きても、このアルゴリズムは適宜対処できる。
第一試行を行うときに用いる評価関数は以下の式（６）のように設定した。β_ｉｊをレベルｉにおける２つの対象物（目的地ｊ＝１と壁ｊ＝２）に対する好悪の知覚として、

また、β_ｉｊ（ｊ＝１、２）の関数形は以下に定義される。まずβ_ｉｊ（ｉ＝２、３、４）は、ｋを抑止が起きた時間として、推定した目的物方向をθ_Ｅとして、適宜に選んだｂ_ｉ（ただし、ｂ_ｉ−１＞ｂ_ｉ０）にたいして、次の式（７）のように定義する。

ただし、過去にその関節Ｘ_ｉで起きた抑止ｒ（ｒ＝１_ｏｏｏｋ）がレベルｉで起きたときは、δ（ｒ、ｉ）＝１、ｒがレベルｉで起きなかったとき、δ（ｒ、ｉ）＝０とする。
次に、β_ｉ２（ｉ＝２、３、４）は、リンクＸ_ｉ−１Ｘ_ｉ−２の関節以外のどこかが壁と接触しているとき、適宜決めたβ＜０に対して、β_ｉ２＝β、それ以外のとき、β_ｉ２＝０とする。そして、β_１２＝０とする。
またこのときの意識アーキテクチャの説明を図７に示す。この場合、レベル５−２の時間空間認識モジュールでは、ＣＣＤカメラデータによる目的物認識を行うと共に、レベル３と４での衝突検出データに基づいて、どのリンクが壁と衝突したかを認識する。この情報に基づき、想起反省モジュールでは、壁との弱い衝突によって起きるアームの抑止情報と目的物認識情報が時系列として記憶される。
レベル５−１の基本的行動計画モジュールでは、想起反省モジュール情報を用いて、第一試行でハンドが目的物把持可能な近傍に到達してから、アーム・ハンド系を壁環境から抜き取るまでの帰路と往路の第二試行におけるアーム駆動計画を立てる。
帰路計画は、図６の１９から１の手順を、衝突なしに進む駆動工程表である。抑止情報から、衝突した位置では、衝突しない程度の関節角の設定し関節を駆動する、第二試行の往路計画は、図６の１から１９の手順を、衝突なしに進む駆動工程表である、抑止情報から、衝突した位置では、衝突しない程度の関節角の設定し関節を駆動する。
もし帰路において、壁との衝突が起きれば、今度は第二試行においてその抑止情報を用いて、上記の第二試行の往路計画に修正を加える。
また、基本的行動計画モジュールでは、など式（６）式（７）で必要な衝突後の各関節の適度な角度を決めるパラメータｂ_ｉ、β、目的物方位θ_Ｅなどの初期値を持ち、またそれらの更新値を設定する。
なお、図７が示すとおり、回転する関節の位置（Ｂ、Ｃ、Ｄ）に応じて、レベル４を３通りに分割する。ただし、本実施例ではレベル２以下を用いないで学習が可能である。Example of learning of position control of arm / hand system An embodiment of an action of grasping an object by groping using an arm / hand system of a six-joint robot will be described. The outline of the arm-hand system and the transition of the first trial are shown in FIG. The purpose of the control is to repeat the weak collision (behavior deterrence) with the wall in the first trial, and after reaching the vicinity where the hand can hold the object, use the deterrence information so far, To get out of the wall environment without touching the wall.
As shown in FIG. 6, the arm-hand system is a four-joint 4-link serial mechanism, with point A in the figure being the tip of the hand, and arm base ED being directly connected to an actuator that moves vertically in the figure. And The link AB is about to enter the hole, and the object (black square) is placed on the bottom of the hole.
This robot has a potentiometer that detects the displacement of the arm base ED portion and each joint angle, and has a visual sensor such as a CCD camera at the tip of the hand. Since the difference between the command angle of the joint and the actual joint angle exceeds the threshold, it is detected which link the wall collides with. The object is recognized by the CCD camera.
Below, the transition example at the time of the 1st trial to 1-19 of FIG. 6 is demonstrated. In 1, the arm and hand are driven in a straight line and pointed downward until point A hits the wall. In 2, the collision of the point A is sensed from the current and the reaction force direction at the point A is detected. Then, the motor at point B is rotated at an angle determined in the reaction force direction.
In 3, the arm / hand system is moved downward by a predetermined distance. At 4, the entire robot moves downward until point B hits the wall. Then, do the same as 2-3, and move downwards, collide with the joints, detect the direction of the wall reaction force, rotate the upper joints, etc. A series of operations is repeated as appropriate. For example, if the joint rotation is too large and a part of the link collides with the upper side of the wall, this algorithm can cope with it appropriately.
The evaluation function used when performing the first trial was set as in the following formula (6). _Let β _{ij be} a perception of good or bad for two objects at level i (destination j = 1 and wall j = 2)

The function form of β _ij (j = 1, 2) is defined below. First, β _ij (i = 2, 3, 4) is appropriately selected b _i (where b _i−1 > b _i 0), where k is the time when suppression occurs and the estimated object direction is θ _E. ) Is defined as the following equation (7).

However, if the suppression r (r = 1 _oo k) that occurred in the joint X _i in the past occurred at level i, δ (r, i) = 1, and if r did not occur at level i, δ ( r, i) = 0.
Next, when β _i2 (i = 2, 3, 4) is in contact with the wall anywhere other than the joint of the link X _i-1 X _i-2 , β _i2 = β, otherwise β _i2 = 0. Then, β ₁₂ = 0.
A description of the consciousness architecture at this time is shown in FIG. In this case, the time-space recognition module at level 5-2 recognizes the target object based on the CCD camera data, and recognizes which link has collided with the wall based on the collision detection data at

levels

3 and 4. Based on this information, the recall module reflects the arm deterrence information and object recognition information caused by a weak collision with the wall in time series.
In the basic action plan module of level 5-1, using the recall module information, the return path from when the hand reaches the vicinity where the object can be gripped in the first trial until the arm / hand system is removed from the wall environment. And make an arm drive plan in the second trial of the outward trip.
The return route plan is a driving process table in which steps 19 to 1 in FIG. Based on the inhibition information, the second trial forward plan, in which the joint angle is set so as not to collide at the collision position and the joint is driven, is a driving process table in which steps 1 to 19 in FIG. From the inhibition information, the joint angle is set so as not to collide at the collision position and the joint is driven.
If there is a collision with the wall on the return trip, this time, the suppression information is used in the second trial, and the forward plan of the second trial is modified.
In addition, the basic action plan module has initial values such as parameters b _i and β and an object orientation θ _E that determine an appropriate angle of each joint after the collision required by Equation (6) and Equation (7). Moreover, those update values are set.
As shown in FIG. 7, level 4 is divided into three types according to the position (B, C, D) of the rotating joint. However, in this embodiment, learning is possible without using level 2 or lower.

数回の試行で学習が完了するので、介護やレスキュー分野におけるロボット制御に適している。 Since learning is completed in several trials, it is suitable for robot control in the nursing and rescue fields.

知能と行動の発達モデルを表形式で表した図A diagram showing the development model of intelligence and behavior in tabular form 意識アーキテクチャのブロック図Awareness architecture block diagram 身体的意識階層の４階層の流れ図Flow diagram of the four levels of physical consciousness 時間空間認識モジュールの流れ図Flow diagram of space-time recognition module 想起反省モジュールを表形式表した図A table showing the recall module 小型ロボット実験例の一回目成功試行の軌跡図Trajectory diagram of the first successful trial of a small robot experiment example 小型ロボット実験例の二回目試行の軌跡図Trajectory diagram of the second trial of a small robot experiment example アーム・ハンド系の目的物探索の推移例を示した図A figure showing an example of transition of searching for an arm / hand object アー厶・ハンド系の意識アーキテクチャを表形式で表した図A diagram showing the conscious architecture of arts and hands

Claims

The robot body position learning control architecture has a hierarchical structure in which higher-level actions and execution programs are arranged at the upper level, and the selection of the hierarchy and the selection of actions in the hierarchy are performed using evaluation functions based on emotion values related to the environment. If a behavioral deterrence occurs that a certain moving behavior is obstructed by an obstacle, a better action is selected, and the destination direction vector calculated based on the memory of the behavioral deterrence that occurred up to the success is used. An architecture that learns the best path in one successful trial.

The learning architecture of the hand tip position of the articulated robot arm / hand system has a hierarchical structure in which higher-level actions and execution programs are arranged at the upper level. When a behavioral deterrence occurs in which a movement or rotation of a joint is obstructed by an obstacle, the behavioral deterrence is detected from an increase in the joint drive motor current and the better action is selected. An architecture that learns the optimal path in one successful trial by using the object direction vector calculated based on the memory of action suppression that occurred before the success.

In the above claims 1 and 2, when returning from the destination to the departure point, the optimum return route from the destination to the departure point is calculated by tracing the optimum route learned in one successful trial in the reverse direction. In this way, the architecture realizes return on the optimal route from the destination to the departure point.

In the above claims 1 and 2, when the robot sensor is malfunctioning, if the user registers the correct information of time, robot position, obstacle name and position while looking at the CCD camera, the optimal outbound path and the optimal An architecture in which learning the return path works effectively.