JPH0850548A

JPH0850548A - Method and device for learning path

Info

Publication number: JPH0850548A
Application number: JP6184523A
Authority: JP
Inventors: Hiroyuki Abe; 啓之阿部
Original assignee: Nikon Corp
Current assignee: Nikon Corp
Priority date: 1994-08-05
Filing date: 1994-08-05
Publication date: 1996-02-20

Abstract

PURPOSE:To apply this method/device to more general-purpose problems as well by continuously handling an input/output parameter in a fixed state while using only the limited pieces of action. CONSTITUTION:This method is composed of a 1st step 101 for arranging membership functions in respective discretely distributed states, 2nd step 102 for selecting one of adjacently arranged states at the time of transiting an object, 3rd step 103 for storing the state, through which the object passes, as the history of a state number, 4th step 104 for erasing state number history when the state number of the 2nd step 102 is matched with the state number of the 3rd step 103, 5th step 105 for deciding and executing the action of the object by performing fuzzy arithmetic processing, 6th step 106 for avoiding an obstacle when transiting to the next state, 7th step 107 for evaluating the result of the action of the object until arriving at a goal point based on the traveling distance from a start point to the goal point, and 8th step 108 for distributing the reward corresponding to the action evaluated result to the parameters of the membership functions, and by repeating these operations through these steps, learning is converged so that the object can adopts the optimum action (9th step 109).

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は組み合わせ最適問題、最
短経路探索などの応用に関する分野に利用できる。特に
学習機能を持った移動装置に関する。INDUSTRIAL APPLICABILITY The present invention can be used in fields relating to applications such as combinatorial optimization problems and shortest path search. In particular, it relates to a mobile device having a learning function.

【０００２】[0002]

【従来の技術】最近、人工知能の１分野である強化学習
方式が注目されている。この強化学習とは報酬という特
別な入力を手がかりとして環境に対象を適応させようと
する方法である。従来の強化学習方法に関する論文とし
て、Machine Learnig,Vol.3,pp225-245 (1988)があげら
れる。2. Description of the Related Art Recently, a reinforcement learning method, which is one field of artificial intelligence, has been attracting attention. This reinforcement learning is a method that tries to adapt the object to the environment by using a special input called reward as a clue. Machine Learnig, Vol.3, pp225-245 (1988) is a paper on conventional reinforcement learning methods.

【０００３】強化学習の対象としている問題は、ステー
ト（状態）遷移の性質と入出力変数の種類によってクラ
ス分けされている。この論文のステート（状態）遷移の
性質は、ステート（状態）が固定であり、各ステートで
取りうるアクション（行動）が数個に制限されている。
また、入出力変数の種類は離散的である。この強化法の
報酬分配法をみると、報酬は各ステートにおけるアクシ
ョンに重みとして分配している経験型強化学習方法であ
る。The problems targeted for reinforcement learning are classified into classes according to the nature of state transitions and the types of input / output variables. The nature of the state transition in this paper is that the state is fixed and the number of actions that can be taken in each state is limited to several.
Also, the types of input / output variables are discrete. Looking at the reward distribution method of this reinforcement method, reward is an experiential reinforcement learning method in which actions are distributed as weights in each state.

【０００４】他の代表的な従来学習法として、入出力変
数を連続に扱えるニューラルネットワークを取り入れた
強化学習方法が、米国特許番号ＵＳ５１１３４８２で提
案されている。As another typical conventional learning method, US Pat. No. 5,113,482 proposes a reinforcement learning method incorporating a neural network capable of continuously treating input / output variables.

【０００５】[0005]

【発明が解決しようとする課題】上述した従来の経験型
強化学習方法では、限定された問題しか扱えず自由度が
少ない。また、ニューラルネットワークを取り入れた強
化学習方法では、ニューロン数が増加し、メモリーが膨
大となる欠点がある。本発明は、ファジィ推論を強化学
習方法に採用することにより、固定したステートで、数
個の限られたアクションのままで入出力変数を連続的に
扱え、より汎用的な問題にも適用できる経路学習装置を
提供することを目的とする。The above-mentioned conventional experiential reinforcement learning method can handle only limited problems and has a small degree of freedom. Further, the reinforcement learning method incorporating a neural network has a drawback that the number of neurons increases and the memory becomes huge. The present invention adopts fuzzy reasoning as a reinforcement learning method, and can handle input / output variables continuously in a fixed state with a few limited actions, and can be applied to more general-purpose problems. The object is to provide a learning device.

【０００６】[0006]

【課題を解決するための手段】本経路学習方法は、離散
的に分布されている各ステートにメンバーシップ関数を
配置する初期化する第１ステップと、前記対象が現在の
ステートから次のステートに遷移する際、その近傍に配
置されたステートの中から１つを選択するステート選択
する第２ステップと、前記対象が通過したステートをス
テート番号の履歴として記憶する第３ステップと、前記
第２ステップによって選択されたステート番号と前記第
３ステップに記憶されているステート番号とが一致した
と判断した時、経路ループしたとしてループしたステー
ト番号履歴を削除する第４ステップと、前記各ステート
に配置されたメンバーシップ関数と前記対象の現在位置
とからファジィ演算処理することにより、前記対象の行
動を決定し、実行する第５ステップと、前記対象が次の
ステートに遷移する際、障害物を回避する第６ステップ
と、前記対象がゴール点に到達するまでの前記対象の行
動結果をスタート点からゴール点までの走行距離によっ
て評価する行動評価する第７ステップと、前記第７ステ
ップの行動評価結果に応じた報酬を、通過した各ステー
トに配置されたすべてのメンバーシップ関数のパラメー
タに分配する報酬分配する第８ステップと、前記第２ス
テップから第８ステップを順次繰返し処理することで、
前記対象が最適の行動を取るように学習を収束させる第
９ステップとを備え、動作させることを特徴とする。The route learning method includes a first step of initializing a membership function in each state that is discretely distributed, and the target from the current state to the next state. Upon transition, a second step of selecting a state from among the states arranged in the vicinity thereof, a third step of storing the state passed by the object as a history of state numbers, and the second step When it is determined that the state number selected by the step number and the state number stored in the third step match, a fourth step of deleting the looped state number history as a route loop, and Fuzzy arithmetic processing from the membership function and the current position of the target to determine and execute the action of the target. And a sixth step of avoiding an obstacle when the target transits to the next state, and a behavior result of the target until the target reaches the goal point from a start point to a goal point. A seventh step of behavior evaluation evaluated based on the traveling distance, and a reward distribution eighth step of distributing the reward according to the behavior evaluation result of the seventh step to the parameters of all membership functions arranged in each passing state By sequentially repeating the steps and the second to eighth steps,
And a ninth step of converging learning so that the object takes an optimal action.

【０００７】また、本発明の経路学習装置は、離散的に
分布されている各ステートにメンバーシップ関数を配置
する初期化する初期化処理手段と、前記メンバーシップ
関数と前記対象の現在位置とからファジィ演算処理しな
がら行動決定し、その行動評価結果に応じ報酬を分配す
ることにより経路を強化学習する経路学習処理部と、前
記経路学習処理部の結果に基づき強化された経路に応じ
て前記対象が移動する経路移動処理部とを備えたことを
特徴とする。Further, the route learning apparatus of the present invention comprises an initialization processing means for initializing a membership function in each of the discretely distributed states, the membership function and the present position of the object. A route learning processing unit for reinforced learning of a route by deciding an action while performing fuzzy arithmetic processing and distributing rewards according to the action evaluation result, and the target according to the route reinforced based on the result of the route learning processing unit. And a route movement processing unit for moving the vehicle.

【０００８】[0008]

【作用】本発明は、経験型強化学習方法にファジィ推論
を採用したことにより、離散的な扱いしかできなかった
問題を連続的に扱えるようにした。本発明の請求項２に
示した経路学習装置は、請求項１の経路学習方法を組み
込んだものである。The present invention employs fuzzy reasoning in the empirical reinforcement learning method so that problems that can only be treated discretely can be treated continuously. A route learning device according to a second aspect of the present invention incorporates the route learning method according to the first aspect.

【０００９】本発明の経路学習装置の請求項２の第１の
ステート選択手段は、経路学習処理部として用いてお
り、重み付き確率にてステートを選択する。しかしなが
ら、経路学習処理を終了後の経路移動処理部における第
２のステート選択手段では、確率は用いてはいない。本
発明の経路学習処理部の行動決定手段では、ファジィ演
算処理により移動体の移動速度や進行方向を決め、実行
させるようにした。また、報酬に応じてメンバーシップ
関数のパラメータを変えることにより、移動体が移動す
る経路が強化される。その結果、経路学習以後の経路移
動においては、滑らかな移動体の動作となる。The first state selecting means of claim 2 of the route learning apparatus of the present invention is used as a route learning processing section, and selects a state with a weighted probability. However, the probability is not used in the second state selecting means in the route movement processing unit after the route learning process is completed. In the action determining means of the route learning processing unit of the present invention, the moving speed and the traveling direction of the moving body are determined by the fuzzy arithmetic processing and executed. In addition, by changing the parameters of the membership function according to the reward, the route along which the moving body moves is strengthened. As a result, the movement of the moving body is smooth in the movement of the route after the route learning.

【００１０】本発明における経路学習方法及び装置の典
型的な事例として、自律型移動ロボットがあるが、その
ほかスケジューリング問題などの最適問題にも適用する
ことができる。本発明の対象とは、動作対象となるもの
を指し、本実施例では移動体のことである。また、本実
施例における移動体とはＣＰＵ等のプログラミングが可
能な装置によって制御される移動体のことを意味してお
り、ＣＰＵ等が移動体の内部にあっても外部にあっても
構わない。本実施例において、移動体は全方位移動可能
である。A typical example of the route learning method and apparatus of the present invention is an autonomous mobile robot, but it can also be applied to optimal problems such as a scheduling problem. The object of the present invention refers to an object to be operated, and is a moving body in this embodiment. Further, the moving body in the present embodiment means a moving body controlled by a programmable device such as a CPU, and the CPU or the like may be inside or outside the moving body. . In this embodiment, the moving body can move in all directions.

【００１１】本実施例において、メンバーシップ関数は
円錐形状であり、その最大高さは、常に同じ高さ（規格
値）を１に設定している。なお、本発明では、２次元平
面上の経路学習装置を取り扱っているが、メンバーシッ
プ関数を球状にし、球状の中心と移動体の距離に反比例
した適合度（例えば、中心で１、球の表面で０等）を用
いれば、水中や宇宙空間などの３次元障害物を含む経路
学習の問題にも拡張が可能である。In this embodiment, the membership function has a conical shape, and its maximum height is always set to 1 with the same height (standard value). Although the present invention deals with a route learning device on a two-dimensional plane, the membership function is spherical, and the fitness is inversely proportional to the distance between the center of the sphere and the moving body (eg, 1 at the center, the surface of the sphere). Can be extended to the problem of route learning including three-dimensional obstacles such as underwater and outer space.

【００１２】[0012]

【実施例】以下、図面を参照してこの発明の一実施例で
ある移動体の経路学習方法の説明を行う。図１は、本発
明である経路学習方法の基本的な処理フロー図である。
本発明の経路学習方法は、離散的に分布されている各ス
テートにメンバーシップ関数を配置する初期化する第１
ステップ１０１と、前記対象が現在のステートから次の
ステートに遷移する際、その近傍に配置されたステート
の中から１つを選択する第２ステップ１０２と、前記対
象が通過したステートをステート番号の履歴として記憶
する第３ステップ１０３と、前記第２ステップによって
選択されたステート番号と前記第３ステップに記憶され
ているステート番号とが一致したと判断した時、経路ル
ープしたとしてループしたステート番号履歴を削除する
第４ステップ１０４と、前記各ステートに配置されたメ
ンバーシップ関数と前記対象の現在位置とからファジィ
演算処理することにより、前記対象の行動を決定し、実
行する第５ステップ１０５と、前記対象が次のステート
に遷移する際、障害物を回避する第６ステップ１０６
と、前記対象がゴール点に到達するまでの前記対象の行
動結果をスタート点からゴール点までの走行距離によっ
て評価する行動評価する第７ステップ１０７と、前記第
７ステップの行動評価結果に応じた報酬を、通過した各
ステートに配置されたすべてのメンバーシップ関数のパ
ラメータに分配する報酬分配する第８ステップ１０８と
前記第２ステップから第８ステップを順次繰返し処理す
ることで、前記対象が最適の行動を取るように学習を収
束させる第９ステップ１０９とから構成されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A route learning method for a moving body according to an embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a basic processing flow chart of the route learning method of the present invention.
The route learning method of the present invention is the first initialization in which a membership function is arranged in each of the discretely distributed states.
Step 101, a second step 102 of selecting one of the states arranged in the vicinity when the target transits from the current state to the next state, and a state number of the state passed by the target. When it is determined that the third step 103 to be stored as history and the state number selected in the second step and the state number stored in the third step match, the state number history looped as a route loop And a fifth step 105 of determining and executing the action of the target by performing fuzzy arithmetic processing from the membership function arranged in each state and the current position of the target. A sixth step 106 of avoiding obstacles when the target transits to the next state
According to the action evaluation result of the seventh step 107 for evaluating the action of the target until the target reaches the goal point, the action is evaluated by the distance traveled from the start point to the goal point, and the action evaluation result of the seventh step. By repeating the eighth step 108 of distributing the reward to the parameters of all membership functions arranged in each passed state and the second to eighth steps in order, the target is optimized. It is composed of a ninth step 109 for converging learning so as to take action.

【００１３】図２は、本発明の請求項１記載の経路学習
方法を用いた経路学習装置としてのブロック図である。
その装置構成は、離散的に分布されている各ステートに
メンバーシップ関数を配置する初期化する初期化処理手
段３０１を有し、前記メンバーシップ関数と前記対象の
現在位置とからファジィ演算処理しながら行動決定し、
その行動評価結果に応じ報酬を分配することにより経路
を強化学習する経路学習処理部２０１と前記経路学習処
理部の結果に基づき強化された経路に応じて前記対象が
移動する経路移動処理部２０２とから構成されている。FIG. 2 is a block diagram of a route learning apparatus using the route learning method according to the first aspect of the present invention.
The device configuration has initialization processing means 301 for allocating a membership function to each discretely distributed state, and performs fuzzy arithmetic processing from the membership function and the current position of the target. Make a decision,
A route learning processing unit 201 that reinforces and learns a route by distributing rewards according to the action evaluation result, and a route movement processing unit 202 that moves the target according to the route reinforced based on the result of the route learning processing unit. It consists of

【００１４】図２の本発明装置の経路学習処理部の初期
化処理手段３０１では、平面上に等間隔に分布された各
ステートに円錐形状のメンバーシップ関数を配置し、メ
ンバーシップ関数のパラメータの初期化を行う。さら
に、この手段においてスタート点、ゴール点をステート
に配置する。この配置されたスタート点からゴール点ま
での経路をエピソード（報酬から報酬に至るルールの選
択系列）と呼ぶことにする。また、スタート点からゴー
ル点までの経路の途中にサブゴール点を設けることもあ
る。その時、経路はサブゴール点によって分割されるこ
とになる。またこの分割された経路もエピソードと呼ば
れる。In the initialization processing means 301 of the route learning processing unit of the apparatus of the present invention shown in FIG. 2, a conical membership function is arranged in each state evenly distributed on the plane, and the parameters of the membership function are Perform initialization. Further, the starting point and the goal point are arranged in the state by this means. The route from the placed start point to the goal point is called an episode (a selection sequence of rules from reward to reward). In addition, a subgoal point may be provided on the way from the start point to the goal point. At that time, the route will be divided by the subgoal points. This divided route is also called an episode.

【００１５】スタート点〜サブゴール１、サブゴール１
〜サブゴール２、・・・サブゴールｎ〜ゴールのそれぞ
れの区間をこの順番に学習する。つまり、エピソード１
が学習終了した後にエピソード２を学習するという処理
である。また、経路学習後の経路移動処理でも経路学習
処理部と同様にサブゴールによってエピソードに分割さ
れている。さらに、サブゴールで分割された経路が、重
なる場合は各々重なったステートに別々のメンバーシッ
プ関数をとることも行う。経路学習処理は、経路移動処
理の前に１度は行わなければならないが、経路学習した
結果を記憶媒体に蓄えておけば、環境が変わらない限り
再学習処理は不要である。Starting point-subgoal 1, subgoal 1
~ Subgoal 2, ... Subgoal n ~ Each section of goal is learned in this order. That is, episode 1
Is a process of learning episode 2 after finishing learning. Also, in the route moving process after the route learning, the sub-goals are used to divide the episodes, as in the route learning processing unit. Furthermore, when the routes divided by the subgoals overlap, a separate membership function is used for each overlapping state. The route learning process must be performed once before the route moving process, but if the result of the route learning is stored in the storage medium, the re-learning process is not necessary unless the environment changes.

【００１６】図３は、本発明装置の各エピソードにおけ
る経路学習処理部のブロック図である。その構成は、各
ステートにメンバーシップ関数を配置する初期化処理手
段３０１と、移動体が現在のステートから次のステート
に遷移する際、その近傍に配置されたステートの中から
１つを重み付き確率にて選択する第１のステート選択手
段３０２と、前記対象が通過したステートをステート番
号の履歴として記憶するステート履歴記憶手段３０３
と、前記第１のステート選択手段によって選択されたス
テート番号と前記ステート履歴記憶手段に記憶されてい
るステート番号とが一致したと判断した時、経路ループ
したとしてループしたステート番号履歴を削除する経路
ループ削除手段３０４と、前記各ステートに配置された
メンバーシップ関数と前記対象の現在位置とからファジ
ィ演算処理することにより、前記対象の行動を決定し、
実行する行動決定手段３０５と、前記対象が次のステー
トに遷移する際、障害物を回避する障害物回避手段３０
６と、前記対象がゴール点に到達するまでの前記対象の
行動結果をスタート点からゴール点までの走行距離によ
って評価する行動評価手段３０７と、前記行動評価結果
に応じた報酬を、通過した各ステートに配置されたすべ
てのメンバーシップ関数のパラメータに分配する報酬分
配手段３０８と、前記各手段を順次繰返し処理すること
で、前記対象が最適の行動を取るように学習を収束させ
る学習収束手段３０９とから構成されている。FIG. 3 is a block diagram of the route learning processing unit in each episode of the device of the present invention. The configuration is such that the initialization processing means 301 for arranging a membership function in each state and weighting one of the states arranged in the vicinity when the moving body transits from the current state to the next state. A first state selecting means 302 for selecting with probability, and a state history storing means 303 for storing the state passed by the target as a history of state numbers.
When it is determined that the state number selected by the first state selection means and the state number stored in the state history storage means match, a path for deleting the looped state number history as a path loop The action of the target is determined by performing fuzzy arithmetic processing from the loop deleting means 304, the membership function arranged in each state, and the current position of the target,
Action determining means 305 to be executed and obstacle avoiding means 30 for avoiding an obstacle when the target transits to the next state
6, the action evaluation means 307 that evaluates the action result of the target until the target reaches the goal point by the traveling distance from the start point to the goal point, and the reward according to the action evaluation result. Reward distribution means 308 that distributes to the parameters of all membership functions arranged in the state, and learning convergence means 309 that converges the learning so that the target takes an optimal action by sequentially and repeatedly processing the respective means. It consists of and.

【００１７】図４は、本発明装置の各エピソードにおけ
る経路移動処理部のブロック図である。その構成は、移
動体が現在のステートから次のステートに遷移する際、
その近傍に配置されたステートの中から１つを選択する
第２のステート選択手段４０１と、現在移動体が位置し
ている複数のステートの各ステートに配置された各円錐
形状のメンバーシップ関数とから適合度及び移動体の進
行方向を算出するために、ファジイ演算処理を用いて移
動体の行動を決定し、実行する行動決定手段４０２と、
前記対象が次のステートに遷移する際、障害物を回避す
る障害物回避手段４０３とからなる。FIG. 4 is a block diagram of the route movement processing section in each episode of the device of the present invention. The configuration is such that when the mobile moves from the current state to the next state,
Second state selecting means 401 for selecting one from the states arranged in the vicinity thereof, and a conical membership function arranged in each state of the plurality of states in which the moving body is currently located, A behavior determining unit 402 that determines and executes the behavior of the mobile body using fuzzy arithmetic processing in order to calculate the fitness and the traveling direction of the mobile body from
It comprises an obstacle avoiding means 403 for avoiding an obstacle when the target transits to the next state.

【００１８】次に、離散的に分布された各ステートに配
置された円錐形状のメンバーシップ関数を用いた対象
（移動体）の動作説明をする。図５は、障害物の存在す
る平面上をスタート点からゴール点、サブゴール点を学
習するための経路探索環境をモデル化した平面モデル図
である。実施例中の各構成部は主にコンピュータシステ
ム内のソフトウエア的手段により実現されているが、特
にこれに限定されずにハードウエア的手段によってもよ
い。Next, the operation of the object (moving body) using the conical membership function arranged in each discretely distributed state will be described. FIG. 5 is a plane model diagram modeling a route search environment for learning a start point, a goal point, and a subgoal point on a plane where an obstacle exists. Although each component in the embodiments is mainly realized by software means in the computer system, it is not limited to this and may be hardware means.

【００１９】２次元平面をＸ×ＹのＸＹ個のメッシュに
切り、交点であるメッシュ点を強化学習のステート点と
する。本実施例ではＸ＝１０、Ｙ＝１２としている。初
期化処理手段では、１２０個の各ステートに円錐形状の
メンバーシップ関数を等間隔に配置し、メンバーシップ
関数の初期化処理を行う。また、１２０個のステートの
中で、スタート点、ゴール点、サブゴール点を選択す
る。対象のスタート点であるステートは図の左下に、対
象の到達点であるゴール点のステートは図５の中央やや
上に設定してある。また、スタートとゴールの間に立ち
寄る必要のある点をサブゴール点とする。ここでは１個
のサブゴールが右中央に配置してある。スタート点、ゴ
ール点、サブゴール点はこのように１２０個のステート
点から自由に選択できるものとする。サブゴール点を選
択することにより経路は分割される。The two-dimensional plane is cut into XY meshes of X × Y, and the mesh points at the intersections are used as the state points for reinforcement learning. In this embodiment, X = 10 and Y = 12. The initialization processing means arranges conical membership functions at equal intervals in each of the 120 states, and performs initialization processing for the membership functions. In addition, a start point, a goal point, and a subgoal point are selected from the 120 states. The state of the target start point is set in the lower left of the figure, and the state of the goal point of the target point is set slightly above the center of FIG. In addition, the point that requires a stop between the start and the goal is the subgoal point. Here, one subgoal is placed in the center right. In this way, the start point, goal point, and subgoal point can be freely selected from 120 state points. The route is divided by selecting a subgoal point.

【００２０】図６は、平面上のあるステートに置かれた
移動体とその移動体に隣接するステートとの関係を説明
する図である。あるステート点に置かれた移動体は、周
囲８点のステート点と隣接する。移動体の移動は、現在
滞在しているステートと隣接する周囲の８個のステート
のどれか１つを選択しながらステートを渡り歩く形でな
される。具体的な自律移動ロボットでは、移動体が周囲
８個のステート近傍までの範囲しか届かない障害物セン
サーを有し、そのセンサーによって移動体が自律移動す
ることになる。FIG. 6 is a diagram for explaining the relationship between a moving body placed in a certain state on the plane and a state adjacent to the moving body. The moving body placed at a certain state point is adjacent to the eight state points around it. The movement of the moving body is performed by walking across the states while selecting one of the eight neighboring states adjacent to the currently staying state. In a specific autonomous mobile robot, the moving body has an obstacle sensor that can reach only the range of the eight surrounding states, and the moving body is autonomously moved by the sensor.

【００２１】図７は、本発明で用いたメンバーシップ関
数の円錐形状を示した図である。各ステートに配置され
たメンバーシップ関数の形状は、高さが常に規定値（図
７では１である）の円錐形状で表される３次元形状をし
ている。円錐形状の底面の半径（図７のｈ）を可変と
し、この値を制御することで移動体の動作を決定する。
直径２ｈを「ヘッジ幅」と称す。本メンバーシップ関数
は、全部のステート（本実施例では１２０個）に配置さ
れ、初期状態ではヘッジ幅２ｈは最小の幅で表されてお
り、学習中もこの幅以下にはならない。このメンバーシ
ップ関数は、２つの役目を持っている。FIG. 7 is a diagram showing the conical shape of the membership function used in the present invention. The shape of the membership function arranged in each state has a three-dimensional shape in which the height is always represented by a conical shape having a prescribed value (1 in FIG. 7). The radius of the conical bottom surface (h in FIG. 7) is made variable, and the operation of the moving body is determined by controlling this value.
The diameter 2h is called "hedge width". This membership function is arranged in all the states (120 in this embodiment), and the hedge width 2h is represented by the minimum width in the initial state, and does not fall below this width even during learning. This membership function has two functions.

【００２２】第１の役目は、経路学習時にステートから
次のステートに移動する際、隣接する複数のステートか
ら１つを重み付き確率で決定される時の「重み」とし
て、メンバーシップ関数の形状パラメータである底面の
直径２ｈを使用することである。図８（ａ）（ｂ）は、
移動体がステート選択する際、移動方向に応じて移動対
象となるステートを示した図である。実際には、学習収
束を速くするため、後戻りするステート選択を禁止して
いる。そのため、隣接する８個のステート中、移動方向
前面と左右の図の点線で示した５個のステートが選択対
象となる。つまり、５面のサイコロの各面を重みでもっ
て変形し、重みが大きいほどその面が出やすくなるよう
にした変形サイコロをふりながら移動体が走行すると考
えればよい。The first role is to define the shape of the membership function as a "weight" when one of a plurality of adjacent states is determined by the weighted probability when moving from one state to the next state during route learning. It is to use the parameter bottom surface diameter 2h. 8 (a) and 8 (b)
It is the figure which showed the state used as a moving object according to a moving direction, when a moving body selects a state. Actually, in order to speed up the learning convergence, the backward state selection is prohibited. Therefore, of the eight adjacent states, the five states indicated by the dotted lines in the front and left and right directions in the moving direction are to be selected. In other words, it can be considered that the moving body runs while swinging the modified dice in which each face of the five-sided dice is deformed by weight and the face becomes easier to appear as the weight is larger.

【００２３】また、第２の役目は、移動体に移動方向の
速度を与えるという役目である。円錐形状のメンバーシ
ップ関数の内部に位置している移動体の移動速度は、そ
の位置から鉛直上方に延ばした円錐面との交点までの長
さ（図７のｚ）によって決定される。この長さｚを適合
度という。移動体の現在位置から次のステートに進入し
た時点で、変形５面サイコロを振ることで次に行くべき
ステートが決まる。次のステートへ行く力がこの適合度
ｚで表されることになる。また、隣接した各ステートに
配置された円錐形状のメンバーシップ関数の各々の円錐
面が重なる場合は、それぞれの適合度ｚの値でファジィ
演算処理を行い、移動方向が決定される。わかりやすく
表現すれば、本発明のメンバーシップ関数は、移動体が
次のステートを決定する「引力」の役目と、そのステー
トに入った移動体を押し出す「斥力」という２つの対称
的な役目を持つことになる。また、ファジィ演算処理を
行うことにより、自動的に移動体は滑らかな曲率を描き
ながら移動するという効果もある。The second role is to give the moving body a velocity in the moving direction. The moving speed of the moving body located inside the conical membership function is determined by the length (z in FIG. 7) from the position to the intersection with the conical surface extending vertically upward. This length z is called the fitness. At the time of entering the next state from the current position of the moving body, the state to go to next is determined by rolling the modified 5-sided dice. The force to go to the next state will be represented by this goodness of fit z. When the conical surfaces of the conical membership functions arranged in the adjacent states overlap, fuzzy arithmetic processing is performed with the values of the respective fitness z to determine the moving direction. To put it simply, the membership function of the present invention has two symmetrical roles: "attractive force" by which the moving body determines the next state, and "repulsive force" by which the moving body that enters the state is pushed out. I will have. In addition, by performing the fuzzy calculation processing, the moving body automatically moves while drawing a smooth curvature.

【００２４】本発明では、強化学習の報酬をメンバーシ
ップ関数の形状パラメータに反映させている。つまり、
移動体がゴール点（またはサブゴール点、以後の説明で
はゴールとサブゴールを総称してゴール点とする）に到
達した時点で、その回の学習を「走行距離」の大きさで
評価し、その評価に応じた報酬を、今回通った経路の全
ステートに分配する。行動評価が高い時は、ステートの
メンバーシップ関数の形状パラメータ（ここでは円錐形
状のヘッジ幅２ｈである）に加える報酬を多くし、行動
評価が低いときは少なくする。行動評価がきわめて低い
ときは「負の報酬」も考慮する。重み付き確率でステー
ト間を遷移している移動体は、ヘッジ幅２ｈの大きいス
テートに引きつけられやすいため、行動評価が高い経路
は徐々に通りやすくなる。In the present invention, the reward for reinforcement learning is reflected in the shape parameter of the membership function. That is,
When the moving body reaches the goal point (or sub-goal point, in the following description, the goal and sub-goal are collectively referred to as the goal point), the learning at that time is evaluated by the size of the "mileage", and the evaluation is performed. The reward according to is distributed to all the states of the route that went this time. When the action evaluation is high, the reward added to the shape parameter of the membership function of the state (here, the cone-shaped hedge width 2h) is increased, and when the action evaluation is low, it is decreased. Consider “negative reward” when behavioral assessment is extremely low. A moving body that transits between states with a weighted probability is likely to be attracted to a state with a large hedge width 2h, and thus a route with a high action evaluation gradually passes easily.

【００２５】この学習により、ステートのヘッジ幅２ｈ
を初期状態で一律に小さくしておくことで、はじめの間
はランダムウォークを行いあらゆる経路を試み、評価の
高い経路が何本か生成されてからは、それらを随時評価
するという、人間の思考に近いモデルが形成される。次
に、各エピソードにおける経路学習処理部の動作につい
て説明する。図９は、動作説明するための処理フロー図
である。By this learning, the state hedge width 2h
By uniformly reducing the initial state in the initial state, a random walk is performed in the beginning, all routes are tried, and after some high-estimated routes are generated, the human thought to evaluate them at any time. A model close to is formed. Next, the operation of the route learning processing unit in each episode will be described. FIG. 9 is a process flow diagram for explaining the operation.

【００２６】ステップＳ１学習収束手段では経路学習の
終了の判定を行う。本実施例での終了の判定は、エピソ
ードでの最短経路を構成するステートに配置された円錐
形状のメンバーシップ関数のヘッジ幅がすべて最大規定
値になった時点を収束したと判断している。最大規定値
とは、隣接する８個のステートの中心を通る円の中で最
大の直径である。ただし、その円が障害物にかかる場合
の最大規定値は、障害物にかからない円のうちの最大の
円の直径である。Step S1 The learning converging means determines the end of the route learning. In the determination of the end in the present embodiment, it is determined that the time when all the hedge widths of the conical membership function arranged in the state that constitutes the shortest path in the episode reaches the maximum specified value. The maximum prescribed value is the maximum diameter in a circle passing through the centers of eight adjacent states. However, the maximum specified value when the circle hits an obstacle is the diameter of the largest circle that does not hit the obstacle.

【００２７】学習終了の判定を受けた場合はステップＳ
２へ行き、次のエピソードがある場合は再び経路学習処
理を繰り返し、最後のエピソードの場合は学習を終了す
る。ステップＳ３では移動体が、ゴール点に到達したか
どうかの判定を行う。ステップＳ３での判定が肯定され
るとサブルーチンＳＲ１へ移行する。サブルーチンＳＲ
１では移動体がゴール点に到達したときに、通過してき
た経路を行動評価した後、各ステートに配置されたメン
バーシップ関数の形状パラメータに報酬を与える。If it is determined that learning has ended, step S
If there is the next episode, the route learning process is repeated again, and if it is the last episode, the learning is ended. In step S3, it is determined whether or not the moving body has reached the goal point. If the determination in step S3 is affirmative, the process proceeds to subroutine SR1. Subroutine SR
In No. 1, when the moving body reaches the goal point, the behavior of the route that has passed is evaluated, and then the shape parameter of the membership function arranged in each state is rewarded.

【００２８】図１０は、行動評価手段及び報酬分配手段
における処理フロー図である。ステップＳ１０１では今
回通ってきた経路の走行距離に応じてステートに与える
報酬の算出を行う。算出は例えば次のような算出式を使
用する。報酬＝（１．５ー今回の走行距離／今までの最短走行距
離）×定数この式で定数は実験で決まる値である。一般には値を大
きくすれば、収束は早いが最短経路がみつかりにくく、
値を小さくすればその逆の傾向がある。FIG. 10 is a processing flow chart in the action evaluation means and the reward distribution means. In step S101, the reward to be given to the state is calculated according to the traveling distance of the route that has passed this time. The calculation uses, for example, the following calculation formula. Reward = (1.5-mileage this time / shortest mileage so far) x constant In this formula, the constant is a value determined by experiment. In general, the larger the value, the faster the convergence but the harder it is to find the shortest path,
The smaller the value, the opposite tendency.

【００２９】ステップＳ１０２では今回の走行距離がこ
のエピソード内で最小かどうかの判定を行う。判定が肯
定されればステップＳ１０３へ移り、この最小経路を強
化するために前述した通常の報酬よりも幾分多く報酬を
分配する。ステップＳ１０４では最短経路として通過し
てきた全てのステートの番号を全走行距離とともにステ
ート履歴記憶手段の一部に記憶する。In step S102, it is determined whether or not the current mileage is the smallest in this episode. If the determination is affirmative, the process proceeds to step S103, and the reward is distributed to some extent larger than the normal reward described above in order to strengthen the minimum path. In step S104, the numbers of all the states that have passed through as the shortest route are stored in a part of the state history storage means together with the total traveling distance.

【００３０】ステップＳ１０４を終了し、ステップＳ１
０５を経ないでステップＳ１０６へ直接移動する。ステ
ップＳ１０５では報酬の正負判定を行う。報酬が正の場
合はステップＳ１０６へ移り、今回通過した全ステート
に配置されたメンバーシップ関数のヘッジ幅に報酬を加
算して、このサブルーチンを抜ける。報酬が負の場合は
ステップＳ１０７へ移る。ここでは、ステップＳ１０４
で記憶された最短経路を構成するステートを除いた全ス
テートに対し報酬分だけヘッジ幅を減らす。Step S104 is completed, and step S1
It moves directly to step S106 without passing 05. In step S105, whether the reward is positive or negative is determined. If the reward is positive, the process proceeds to step S106, the reward is added to the hedge width of the membership function arranged in all the states passed this time, and this subroutine is exited. If the reward is negative, the process proceeds to step S107. Here, step S104
The hedge width is reduced by the reward amount for all states excluding the states that compose the shortest path stored in.

【００３１】このサブルーチン終了後は、図９のステッ
プＳ１へ戻る。図９のステップＳ４では移動体が、ステ
ート履歴記憶手段の記憶にない新しいステート範囲に位
置しているかどうかを判断する。ステート範囲とは、円
錐形状のメンバーシップ関数の底面の円の内部をいう。
ステップＳ４での判定が肯定されるとサブルーチンＳＲ
２へ移行する。After the end of this subroutine, the process returns to step S1 in FIG. In step S4 of FIG. 9, it is determined whether or not the moving body is located in a new state range that is not stored in the state history storage means. The state range is the inside of the circle at the bottom of the conical membership function.
If the determination in step S4 is positive, the subroutine SR
Move to 2.

【００３２】図１１のサブルーチンＳＲ２は、第１のス
テート選択手段での動作である。このサブルーチンＳＲ
２では次に行くべきステートを決定する処理を行う。移
動体の行動選択は、進行方向にある５つのステートから
１つを重み付き確率で選択する。また、５つのステート
のヘッジ幅を重みとするだけでなく、ゴールのある方向
を重みに加算するとゴール点の発見が早くなることがあ
る。しかし、最短経路を探索する時には、逆に重みを加
算しない方がよいことが実験でわかっている。従って、
経路の早期発見に主眼を置くか、時間はかかっても最短
経路の発見に主眼を置くかでゴール点方向の重みの扱い
を使い分ける必要がある。同様に、移動体の慣性を重み
に加算する方法も考えられる。つまり、現在進んでいる
方向の重みを多くするという方法であるが、実際の移動
体では必ず慣性が存在するためエネルギー最小という評
価項を最短経路に付加する必要がある。そのため、移動
体の慣性を次のステートの決定のパラメータにするのは
合理的である。The subroutine SR2 in FIG. 11 is the operation of the first state selecting means. This subroutine SR
In step 2, a process for determining the next state to go to is performed. In the action selection of the moving body, one of the five states in the traveling direction is selected with a weighted probability. Further, not only the hedging widths of the five states are used as weights, but also the direction in which the goal is added may be added to the weights, so that the goal point may be discovered earlier. However, experiments have shown that it is better not to add weights when searching for the shortest route. Therefore,
It is necessary to properly handle the weights toward the goal point depending on whether to focus on the early detection of the route or to find the shortest route even if it takes time. Similarly, a method of adding the inertia of the moving body to the weight can be considered. In other words, this is a method of increasing the weight of the direction that is currently proceeding, but in an actual moving body, since there is always inertia, it is necessary to add an evaluation term of minimum energy to the shortest path. Therefore, it is rational to use the inertia of the moving body as a parameter for determining the next state.

【００３３】ステップＳ２０１では、移動体の現在位置
を確認する。ステップＳ２０２では前記したようにゴー
ル点の方向を確率の重みに加算するときのためにゴール
点の方向を確認する。ステップＳ２０３では移動体の進
行方向から前面、斜め前、左右の５方向の近傍のステー
トを選択し、次に進むステートの候補とする。In step S201, the current position of the moving body is confirmed. In step S202, the direction of the goal point is confirmed in order to add the direction of the goal point to the probability weight, as described above. In step S203, states near the front, diagonally forward, and left and right in five directions from the traveling direction of the moving body are selected as candidates for the next state.

【００３４】ステップＳ２０４では選択された５つのス
テートが、障害物の内部にあるかどうかを判定する。判
定が肯定された場合はステップＳ２０５へ移行し、障害
物の内部にあるステートを候補から外す。ステップＳ２
０６では重み付き確率演算を行い、次に進むステートを
決定する。ここで重み付き確率演算の１例についての説
明を行う。In step S204, it is determined whether the five selected states are inside the obstacle. If the determination is positive, the process proceeds to step S205, and the state inside the obstacle is removed from the candidates. Step S2
At 06, a weighted probability calculation is performed to determine the next state. Here, an example of the weighted probability calculation will be described.

【００３５】図１２は、移動体が図のステート番号ｎに
位置し、５つのステートから１つを選択する説明図であ
る。上図に、５つのステートの各メンバーシップ関数の
ヘッジ幅が、それぞれ１０、３０、１００、４０、２０
である場合を示した。また、下図は、重み付き確率演算
説明図である。５つのヘッジ幅の総計２００にＲＮＤ関
数（０から１の実数値を乱数として発生する関数をＲＮ
Ｄ関数という）を乗じて得られた値が１２０である場
合、移動体が次に進むべきステートがｃと決定されるこ
とが図からわかる。このサブルーチン終了後は、図９の
サブルーチンＳＲ３へ移行する。FIG. 12 is an explanatory diagram in which the mobile unit is located at the state number n in the figure and selects one from the five states. In the figure above, the hedge widths of the membership functions of the five states are 10, 30, 100, 40 and 20, respectively.
The case is shown. Further, the following figure is an explanatory diagram of weighted probability calculation. The RND function (the function that generates a real number from 0 to 1 as a random number is RN)
It can be seen from the figure that when the value obtained by multiplying by (D function) is 120, the state in which the mobile body should proceed is determined to be c. After the completion of this subroutine, the process proceeds to subroutine SR3 in FIG.

【００３６】サブルーチンＳＲ３では経路のループ判定
を行う。図１３は、経路ループ削除手段の処理フロー図
である。また、図１４（ａ）、（ｂ）は、その動作説明
図である。図１４（ａ）は、次に移動体が選択したステ
ート番号が、既に学習しステート履歴記憶に記憶されて
いるステート番号と一致したと判断された時のループ経
路図である。図１４（ｂ）は、そのループした経路を削
除した時の経路図である。In the subroutine SR3, the loop determination of the route is performed. FIG. 13 is a process flow chart of the route loop deleting means. 14 (a) and 14 (b) are explanatory diagrams of the operation. FIG. 14A is a loop route diagram when it is determined that the state number next selected by the mobile unit matches the state number already learned and stored in the state history memory. FIG. 14B is a route diagram when the looped route is deleted.

【００３７】サブゴールを含んだ経路をスタート点から
ゴール点まで通して経路学習するようなシステムでは、
移動体がサブゴール近傍で同一のステートを再度通過す
ることがある。しかし、本実施例では、各エピソードの
学習が収束するまで次のエピソードの学習は行わないた
め、既に通過したステートに戻ることはない。そのた
め、ループした経路の削除が有効となる。In a system in which a route including a subgoal is learned from a start point to a goal point,
A mobile may pass through the same state again near the subgoal. However, in this embodiment, the learning of the next episode is not performed until the learning of each episode has converged, and therefore the state that has already passed is not returned. Therefore, deleting the looped route is effective.

【００３８】ステップＳ３０１ではサブルーチンＳＲ２
で決定された次に進むステートが既に通過したステート
か否かの判定を行う。判定が否定された場合はこのサブ
ルーチンを終了させる。判定が肯定された場合は、ステ
ップＳ３０２において、ステート履歴記憶手段に記憶さ
れたステート履歴のリストからそのステートの次から現
在までのステートの記録を削除する。このサブルーチン
終了後、図９のステップＳ５へ移る。In step S301, subroutine SR2
It is determined whether or not the next state determined in step 3 is a state that has already passed. When the determination is negative, this subroutine is ended. If the determination is affirmative, in step S302, the record of the state from the state up to the present state is deleted from the state history list stored in the state history storage means. After the end of this subroutine, the process proceeds to step S5 in FIG.

【００３９】ステップＳ５ではゴール点到達後の評価お
よび削除ループの判定に使うために決定されたステート
を履歴のリストに追加する。その終了後は図９のステッ
プＳ３へ戻る。図９のステップＳ４で判定が否定された
ときは、通常の移動体の移動状態であり、移動体の方向
を決定するサブルーチンＳＲ４行動決定手段に移行す
る。In step S5, the states determined for use in the evaluation after reaching the goal point and the determination of the deletion loop are added to the history list. After that, the process returns to step S3 in FIG. When the determination in step S4 of FIG. 9 is negative, the moving state of the moving body is normal, and the process proceeds to the subroutine SR4 action determining means for determining the direction of the moving body.

【００４０】図１５のステップＳ４０１では、移動体の
位置が現在どこかのステート（ヘッジ）の内部にあるか
どうかの判定を行う。判定が否定された場合は、ステッ
プＳ４０２へ移行し、移動体の方向はサブルーチンＳＲ
２で決定された次に進むべきステートの方向を取る。学
習初期の段階で経路のステートに配置されているメンバ
ーシップ関数のパラメータに報酬が加算されていない時
はステートのヘッジ幅が小さく、このステップに移行し
易くなる。In step S401 of FIG. 15, it is determined whether or not the position of the moving body is currently inside some state (hedge). If the determination is negative, the process proceeds to step S402, and the direction of the moving body is the subroutine SR
Take the direction of the next state determined in 2. When the reward is not added to the parameter of the membership function arranged in the state of the route at the early stage of learning, the hedge width of the state is small and it becomes easy to shift to this step.

【００４１】ステップＳ４０３では以後のファジィ演算
処理に使用するパラメータとして、移動体が位置してい
る全てのステートの中心と移動体の距離を算出する。こ
の距離は図７ではｎの値である。ステップＳ４０４では
移動体の位置から各ステートとの適合度を算出する。図
７ではｚの値である。ステップＳ４０５では移動体の方
向をファジィ演算処理により決定する。In step S403, the distance between the center of all the states in which the moving body is located and the moving body is calculated as a parameter used in the subsequent fuzzy calculation processing. This distance is the value of n in FIG. In step S404, the degree of compatibility with each state is calculated from the position of the moving body. In FIG. 7, it is the value of z. In step S405, the direction of the moving body is determined by fuzzy calculation processing.

【００４２】図１６は、行動決定手段により移動体が行
動決定する際の動作説明図である。移動体は、ａｅ点で
ステートＡの領域に入るとサブルーチンＳＲ２によりス
テートＢを選択し、ステートＢの中心に向かう。ステー
トＢの領域に入るｂｅ点までの間、Ａのメンバーシップ
関数と移動体の位置によりＢに向かう適合度ａｆが算出
される。さらに、移動体は、ｂｅ点でステートＢの領域
に入るとサブルーチンＳＲ２により次のステートＣを選
択し、ステートＣの中心に向かう。移動体がステートＣ
の領域に入るまでは、Ａのメンバーシップ関数と移動体
の位置によりＢに向かう適合度ａｆ及びＢのメンバーシ
ップ関数からＣに向かう適合度ｂｆとが随時算出され
る。このように移動体の位置と各ステートに配置された
メンバーシップ関数とからファジィ演算処理して移動体
の行動が決定される。FIG. 16 is a diagram for explaining the operation when the moving body makes an action decision by the action decision means. When the moving body enters the area of the state A at the point ae, the moving body selects the state B by the subroutine SR2 and moves toward the center of the state B. The fitness af toward B is calculated by the membership function of A and the position of the moving body up to the point be in the state B area. Further, when the moving body enters the region of state B at point be, it selects the next state C by the subroutine SR2 and moves toward the center of state C. Mobile is state C
Until entering the region of A, the membership function of A, the fitness af toward B according to the position of the moving body, and the fitness bf toward C from the membership function of B are calculated at any time. In this way, the behavior of the mobile body is determined by performing fuzzy arithmetic processing from the position of the mobile body and the membership function arranged in each state.

【００４３】図１７は移動体の進む方向をベクトル成分
で表した図である。このように、移動体の進む方向は、
両方の適合度を正規化されたベクトル成分を使って算出
される。重みつき平均は、次式により計算される。FIG. 17 is a diagram showing the traveling direction of the moving body as a vector component. In this way, the moving direction of the moving body is
Both goodness of fits are calculated using the normalized vector components. The weighted average is calculated by the following formula.

【００４４】[0044]

【数１】 [Equation 1]

【００４５】この式はファジィ制御で一般に使われるも
のと同じ式である。以上のように本実施例では、移動体
があるステートの範囲に入った瞬間に次に移動するステ
ートを決定し、ステートの中心を経由しないで次のステ
ートへ向かうようにしている。そのため、従来の離散的
な強化学習方法の欠点であった有限のアクションしか持
てないという制限、つまり、離散的ステートの中心のど
れかへ必ず行かなければならないという制限がなくな
り、移動体の移動は滑らかな制御を行わなくても自動的
に滑らかな曲線を描くことになる。This equation is the same as that generally used in fuzzy control. As described above, in the present embodiment, the state to which the moving body moves next is determined at the moment when the moving body enters the range of a certain state, and goes to the next state without passing through the center of the state. Therefore, the limitation of having only finite actions, which is a drawback of the conventional discrete reinforcement learning method, that is, the limitation that you must go to any one of the centers of discrete states, is eliminated, and the movement of the moving body A smooth curve will be drawn automatically without smooth control.

【００４６】移動体の方向が決定した後、このサブルー
チンは終了し次の障害物回避のサブルーチンに移行す
る。通常の移動時、移動体は常に障害物の監視を続け
る。図９のサブルーチンＳＲ５は障害物回避手段の処理
フロー図である。図１８のステップＳ５０１では移動体
の障害物センサーが進行方向に障害物を検知したか否か
の判定を行う。判定が否定された場合はこのサブルーチ
ンを終了する。判定が肯定された場合は障害物回避のア
ルゴリズムが起動される。ステップＳ５０２は、移動体
のセンサーにより障害物の左右両端それぞれのおおまか
な距離を計測する。図１９（ａ），（ｂ）は、移動体の
障害物回避手段の動作説明図である。図では、障害物を
斜線部で示した。After the direction of the moving body is determined, this subroutine ends and the process proceeds to the next obstacle avoidance subroutine. During normal movement, the moving body keeps monitoring obstacles at all times. Subroutine SR5 in FIG. 9 is a process flow chart of the obstacle avoiding means. In step S501 of FIG. 18, it is determined whether or not the obstacle sensor of the moving body has detected an obstacle in the traveling direction. If the determination is negative, this subroutine ends. If the determination is positive, the obstacle avoidance algorithm is activated. A step S502 measures a rough distance between the left and right ends of the obstacle by the sensor of the moving body. 19 (a) and 19 (b) are operation explanatory views of the obstacle avoiding means of the moving body. In the figure, the obstacles are indicated by shaded areas.

【００４７】図１９（ａ）は、移動体を中心とする円が
センサーの計測しうる範囲、Ｌｒが障害物の右端と移動
体の距離、Ｌｌが障害物の左端と移動体の距離を示す。
ステップＳ５０３ではＬｒとＬｌを比較し近い方に移動
体の方向を変更する。また、図１９（ｂ）は、障害物の
左端はセンサーの計測しうる範囲に入っているが右端が
範囲外のため左端を方向として選択する。また、両端が
範囲外の場合は次に進むべきステートに近い端を選択す
る。In FIG. 19A, a circle centering on the moving body is a range that can be measured by the sensor, Lr is a distance between the right end of the obstacle and the moving body, and Ll is a distance between the left end of the obstacle and the moving body. .
In step S503, Lr and Ll are compared and the direction of the moving body is changed to the closer one. Further, in FIG. 19B, the left end of the obstacle is in the range measurable by the sensor, but the right end is out of the range, so the left end is selected as the direction. If both ends are out of the range, the end closest to the state to be advanced is selected.

【００４８】以上で障害物回避アルゴリズムを終了し、
次のステップへ移行する。図９のステップＳ６ではサブ
ルーチンＳＲ４で決まった方向に基づいて一定距離の移
動を行う。距離の設定は実験により適時決定する。その
のち処理はステップＳ４へ戻る。以上のように移動体は
ランダムウォークしながらゴール点に到達する毎に報酬
を受け（負の場合もある）最短経路を強化する。Thus, the obstacle avoidance algorithm is finished,
Move to the next step. In step S6 of FIG. 9, a predetermined distance is moved based on the direction determined by the subroutine SR4. The setting of the distance should be decided by experimentation. After that, the process returns to step S4. As described above, the mobile object receives a reward (may be negative) each time it reaches the goal point while randomly walking, and strengthens the shortest path.

【００４９】次に、経路学習終了後、移動体が移動する
経路移動処理部について説明する。図４は、経路学習終
了後の、経路移動処理部のブロック図である。また、図
２０は、経路移動処理部における移動体が移動する各エ
ピソードにおける経路移動の処理フロー図である。この
ブロック図４は、ステップ、サブルーチンとも経路学習
処理部の動作と共通のものが多い。例えば、行動決定手
段４０２、障害物回避手段４０３は、経路学習処理部の
行動決定手段３０５、障害物回避手段３０６などと動作
は共通している。したがって、ここでは相違する箇所の
みの動作説明を行う。Next, the route movement processing unit in which the moving body moves after the route learning is completed will be described. FIG. 4 is a block diagram of the route movement processing unit after the route learning is completed. In addition, FIG. 20 is a process flow diagram of the route movement in each episode in which the moving body moves in the route movement processing unit. In this block diagram 4, many steps and subroutines are common to the operation of the route learning processing unit. For example, the action determining unit 402 and the obstacle avoiding unit 403 have the same operation as the action determining unit 305, the obstacle avoiding unit 306, and the like of the route learning processing unit. Therefore, only the different points will be described here.

【００５０】図２０のステップＳ７は、移動体の現在位
置がゴール点のステートにいるかどうかの判定を行う。
判定が肯定されたときはステップＳ２へ移行する。ステ
ップＳ８の第２のステート選択手段では、経路学習のサ
ブルーチンＳＲ２第１のステート選択手段とは異なり、
次に選択するステートの決定に確率は使用しない。図８
（ａ）でこのステート選択を説明する。対象となるステ
ートは、点線内で示した５つのステートである。この５
つのステートの中で、円錐形状のメンバーシップ関数の
ヘッジ幅が一番大きなステートを選択する。したがっ
て、図中ではｎ＋１のステートが選ばれる。In step S7 of FIG. 20, it is determined whether or not the current position of the moving body is in the goal point state.
When the determination is affirmative, the process proceeds to step S2. The second state selecting means of step S8 is different from the route learning subroutine SR2 first state selecting means.
Probability is not used to determine the next selected state. FIG.
This state selection will be described with reference to (a). The target states are the five states shown within the dotted line. This 5
Of the two states, the state with the largest hedge width of the conical membership function is selected. Therefore, n + 1 states are selected in the figure.

【００５１】図２１は、障害物が存在する平面モデル上
で移動体が、最初のエピソードにおいて経路を学習して
いる途中のシミュレーション図である。図２２は、学習
収束した結果、移動体がスタート点からゴール点まで学
習強化された経路を示すシミュレーション図である。図
中には示してはないが、サブゴールの近傍などで別々の
エピソード間でステートが重なる場合がある。このよう
な場合は、移動体の移動に支障があり、エピソード毎に
同一ステートに別々のメンバーシップ関数を持たせる必
要がある。FIG. 21 is a simulation diagram while the moving body is learning the route in the first episode on the plane model in which the obstacle exists. FIG. 22 is a simulation diagram showing a route in which the moving body is learning-reinforced from the start point to the goal point as a result of the learning convergence. Although not shown in the figure, the states may overlap between different episodes in the vicinity of the subgoal. In such a case, the movement of the moving body is hindered, and it is necessary to give different membership functions to the same state for each episode.

【００５２】[0052]

【発明の効果】以上詳細に説明したように請求項１の発
明によれば強化学習におけるステートにメンバーシップ
関数を配置し、強化学習における報酬をメンバーシップ
関数の大きさに反映させることで学習を収束させること
ができる。また、ステートに配置したメンバーシップ関
数を利用してファジイ処理を行うことでステートの位置
が固定で、取り得るアクションが有限なままで、連続的
な入出力変数に近い学習が実現でき、実用向きなコンパ
クトなシステムが組める。As described in detail above, according to the invention of claim 1, the membership function is arranged in the state in the reinforcement learning, and the reward in the reinforcement learning is reflected in the size of the membership function to perform the learning. Can be converged. In addition, by performing fuzzy processing using the membership function placed in the state, the state position is fixed, the possible actions remain finite, and learning close to continuous input / output variables can be realized, making it suitable for practical use. A compact system can be assembled.

【００５３】また、障害物の存在する平面上を移動する
移動体において、使用者は平面上を無条件に等間隔の有
限なステートを配置するだけで移動体が自ら経路を学習
することが可能となった。さらに、従来の強化学習にお
ける移動体の移動では、移動体が離散的で固定的なステ
ートを渡り歩くいわばジグザグな走行であったのに対し
て、本発明ではステートが範囲を広げ隣接するステート
とファジイ演算処理をすることで円滑な走行を実現し
た。Further, in a moving body that moves on a plane where an obstacle exists, the user can learn the route by himself by unconditionally arranging finite states at equal intervals on the plane. Became. Further, in the conventional movement of the moving body in the reinforcement learning, the moving body is a so-called zigzag traveling that walks across discrete and fixed states, whereas in the present invention, the state expands the range and adjacent states and fuzzy Smooth running was achieved by performing arithmetic processing.

【００５４】さらに、メンバーシップ関数の大きさその
ものをステートの範囲とするために、学習収束後の結果
としてはステートの位置情報とステートの範囲情報のみ
であるため、装置の記憶領域が小さくて済むという効果
がある。Further, since the size of the membership function itself is set as the range of states, the result after learning convergence is only the position information of the states and the range information of the states, so the storage area of the device can be small. There is an effect.

[Brief description of drawings]

【図１】本発明の経路学習方法の処理フロー図。FIG. 1 is a processing flow chart of a route learning method of the present invention.

【図２】本発明の経路学習装置のブロック図。FIG. 2 is a block diagram of a route learning device of the present invention.

【図３】本発明の経路学習処理部における各エピソー
ドのブロック図FIG. 3 is a block diagram of each episode in the route learning processing unit of the present invention.

【図４】本発明の経路処理部における各エピソードの
ブロック図。FIG. 4 is a block diagram of each episode in the route processing unit of the present invention.

【図５】障害物が存在する経路学習のための平面モデ
ル図。FIG. 5 is a plane model diagram for learning a route in which an obstacle exists.

【図６】移動体の障害物センサーが検出できる範囲を
示した図。FIG. 6 is a diagram showing a range in which an obstacle sensor of a moving body can detect.

【図７】円錐形状のメンバーシップ関数の説明図。FIG. 7 is an explanatory diagram of a conical membership function.

【図８】移動体がステート選択する際対象となるステ
ートを示した図。FIG. 8 is a diagram showing a target state when a moving body selects a state.

【図９】本発明の経路学習処理部における各エピソー
ドの処理フロー図。FIG. 9 is a processing flowchart of each episode in the route learning processing unit of the present invention.

【図１０】本発明の経路学習処理部の行動評価手段及
び報酬分配手段の処理フロー図。FIG. 10 is a process flow chart of the behavior evaluation unit and the reward distribution unit of the route learning processing unit of the present invention.

【図１１】本発明の経路学習処理部の第１のステート
選択手段の処理フロー図。FIG. 11 is a processing flow chart of the first state selecting means of the route learning processing unit of the present invention.

【図１２】第１のステート選択手段により重み付き確
率演算の１例。FIG. 12 shows an example of a weighted probability calculation by the first state selection means.

【図１３】本発明の経路学習処理部の経路ループ削除
手段の処理フロー図。FIG. 13 is a processing flow chart of the route loop deleting means of the route learning processing unit of the present invention.

【図１４】経路ループ削除手段の動作説明図。FIG. 14 is an explanatory diagram of the operation of the route loop deleting means.

【図１５】本発明の経路学習処理部の行動決定手段の
処理フロー図。FIG. 15 is a processing flow chart of the action determining means of the route learning processing unit of the present invention.

【図１６】行動決定手段の動作説明図。FIG. 16 is an operation explanatory view of the action determining means.

【図１７】行動決定手段において移動体の動作方向ベ
クトル説明図FIG. 17 is an explanatory diagram of a moving direction vector of a moving body in the action determining means.

【図１８】本発明の経路学習処理部の障害物回避手段
の処理フロー図。FIG. 18 is a processing flowchart of the obstacle avoiding means of the route learning processing unit of the present invention.

【図１９】障害物回避手段の動作説明図。FIG. 19 is an operation explanatory view of the obstacle avoiding means.

【図２０】本発明の経路学習装置の経路移動処理部の
処理フロー図。FIG. 20 is a processing flowchart of a route movement processing unit of the route learning device of the present invention.

【図２１】本発明の経路学習装置を用いて移動体が経
路学習している途中のシミュレーション図。FIG. 21 is a simulation diagram in the middle of a route learning of a moving body using the route learning device of the present invention.

【図２２】本発明の経路学習装置を用いて、移動体が
経路学習した結果と経路移動を行ったシミュレーション
図。FIG. 22 is a simulation diagram showing a result of route learning performed by a moving body and route movement using the route learning device of the present invention.

Claims

[Claims]

1. A path learning method in which an object moves from a start point to a goal point, a first step of allocating a membership function to each discretely distributed state, and the object is a current state. A second step of selecting one of the states arranged in the vicinity when transitioning from the next state to the next state, a third step of storing the state passed by the target as a history of state numbers, When it is determined that the state number selected in the two steps and the state number stored in the third step match, the state number history looped as a route loop is deleted, and the state number is arranged in each state. By performing fuzzy arithmetic processing from the membership function determined and the current position of the target, the behavior of the target A fifth step of determining and executing, a sixth step of avoiding an obstacle when the target transits to the next state, and a behavior result of the target until the target reaches a goal point from a start point A seventh step of behavior evaluation evaluated by the distance traveled to the goal point, and a reward of distributing the reward according to the behavior evaluation result of the seventh step to the parameters of all membership functions arranged in each passing state An eighth step of distributing and a ninth step of converging learning so that the object takes an optimal action by sequentially repeating the second step to the eighth step are provided and the operation is performed. A route learning method characterized by the above.

2. In a route learning device in which an object moves from a start point to a goal point, initialization processing means for initializing a membership function in each of the discretely distributed states, and the membership function. A route learning processing unit that determines an action while performing fuzzy arithmetic processing from the current position of the target and distributes rewards according to the action evaluation result, and a route learning processing unit that performs reinforcement learning of a route, and is strengthened based on the result of the route learning processing unit. And a route movement processing unit for moving the target according to the route.

3. The route learning processing unit, when the target transits from a current state to a next state, first state selecting means for selecting one from among states arranged in the vicinity thereof, The state history storage means for storing the state passed by the object as a history of state numbers, the state number selected by the first state selection means, and the state number stored in the state history storage means match. When it is determined that the object is a path loop, a route loop deleting means for deleting the looped state number history, a membership function arranged in each of the states and a current position of the object are subjected to fuzzy arithmetic processing to obtain the object. Behavior decision means for deciding and executing the action, and obstacles for avoiding obstacles when the target transits to the next state. Passing a harmful substance avoidance means, an action evaluation means for evaluating the action result of the object until the object reaches the goal point by the traveling distance from the start point to the goal point, and a reward according to the action evaluation result, Reward distribution means for distributing the parameters of all membership functions arranged in the respective states, and learning convergence means for converging the learning so that the object takes an optimal action by sequentially and repeatedly processing the respective means. The route learning device according to claim 2, further comprising:

4. The route movement processing unit, when the target transits from a current state to a next state, second state selecting means for selecting one from among states arranged in the vicinity thereof, Action determining means for determining and executing the action of the target by performing fuzzy arithmetic processing from the membership function arranged in each state and the current position of the target, and when the target transits to the next state 3. The route learning device according to claim 2, further comprising: obstacle avoiding means for avoiding an obstacle.

5. The route learning device according to claim 2, wherein the shape of the membership function in the initialization processing means is a conical shape.

6. The route learning device according to claim 2, wherein the first state selecting means selects a state to move to next using a weighted probability.

7. A weighted probability with respect to a parameter of a membership function placed in each state of the goal point or the moving direction of the object in the one state selected by the first state selecting means. 7. The route learning device according to claim 6, wherein the weights are added.

8. The route learning device according to claim 2, wherein the obstacle avoiding means uses the closer one of the left and right ends of the obstacle as the direction of the target for avoiding the obstacle.

9. The route learning device according to claim 2, wherein the learning convergence unit converges the learning when all the passed states reach the maximum specified value.