JPH11175493A

JPH11175493A - Experience enhancement type enhanced learning method using behavior selection network and record medium recorded with experience enhancement type enhanced learning program

Info

Publication number: JPH11175493A
Application number: JP9346743A
Authority: JP
Inventors: Satoshi Kurihara; 聡栗原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-12-16
Filing date: 1997-12-16
Publication date: 1999-07-02

Abstract

PROBLEM TO BE SOLVED: To perform learning that has flexibility and to perform learning in the shortest path by changing the attenuation factor of enhancement value with the existence of experience that attenuates an episode and finishing active propagation when the enhancement value falls below prescribed threshold. SOLUTION: When an enhancement value is propagated from an adjacent storage module, a storage module voluntarily propagates the enhancement value to other storage modules it adjoins. A propagation method is to propagate storage modules which have experience that configures an episode and enhancement values R due to different attenuation factors except them after attenuation is performs as a whole as shown in a diagram. Active propagation is finished when enhancement value to be propagated falls below certain threshold. Learning in an L-ANA is to store how many enhancement values are propagated to each storage module. This active propagation can easily operate its characteristic with two parameters which are the dimension of enhancement value that is used at the time of performing active propagation and an attenuation factor that is used at the time of propagating.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複雑で動的に変化
する環境の下で動作する自律行動主体が変化に対して効
果的に適応できるための行動計画法として考案された行
動選択ネットワークの枠組を従来の経験強化型強化学習
法であるprofit-sharingに適応した行動選択ネットワー
クを用いた新規な経験強化型強化学習方法に関し、具体
的には実世界において人とインタラクションを行う自律
移動ロボットや、インターネットとユーザとの自律的な
インタラクションを行うインタフェースエージェント等
のような自律行動主体が従来の行動計画モジュールに従
って行動するだけでなく、環境内の個々の状況に対して
効率的に適応できるための学習を行う行動選択ネットワ
ークを用いた経験強化型強化学習方法および経験強化型
強化学習プログラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an action selection network devised as an action planning method that enables an autonomous action subject operating in a complex and dynamically changing environment to effectively adapt to changes. A new experience-based reinforcement learning method using a behavior selection network that adapts the framework to the conventional experience-based reinforcement learning method, profit-sharing, specifically, an autonomous mobile robot that interacts with humans in the real world, In order to enable autonomous actors, such as interface agents that perform autonomous interaction between the Internet and users, to not only act according to conventional action planning modules, but also to efficiently adapt to individual situations in the environment. Experience-based reinforcement learning method and experience-based reinforcement learning program using behavior selection network for learning On recording the recording medium.

【０００２】[0002]

【従来の技術】行動選択ネットワーク（P.Maes：The Ag
ent Network Architecture(ANA),SIGART Bulletin,Vol.
2,No.4,pp.115-120,1991参照）は、比較的単機能なモジ
ュールの集合が互いに活性値を伝播し合うことで協調
し、モジュール全体として合目的な行動を計画する枠組
である。集中制御が不要で各モジュールがそれぞれ自律
的に振舞うことから、特にロバスト性と拡張性を特徴と
し、動的な環境の変化に柔軟に即応しつつ適切な行動計
画を行うことができる。従って、実世界やインターネッ
トなどの環境において動作する自律移動ロボットや、ユ
ーザとインターネットとの仲介を行うソフトウェアエー
ジェント（電子秘書）などのための行動計画モジュール
を構築する際に有効な手段である。[Prior Art] Behavior selection network (P.Maes: The Ag
ent Network Architecture (ANA), SIGART Bulletin, Vol.
2, No. 4, pp. 115-120, 1991) is a framework in which a set of relatively single-function modules cooperate by propagating activity values to each other to plan a suitable action as a whole module. is there. Since each module behaves autonomously without the need for centralized control, it is particularly characterized by its robustness and scalability, and can perform an appropriate action plan while flexibly and immediately responding to dynamic environmental changes. Therefore, this is an effective means when constructing an action plan module for an autonomous mobile robot operating in an environment such as the real world or the Internet, or a software agent (electronic secretary) that mediates between the user and the Internet.

【０００３】しかし、自律行動主体がより環境に適応す
るためには、環境内で遭遇する種々の状況に個別に対応
できるための「学習機能」が必要不可欠である。そこで
行動選択ネットワークの特徴を損なうことなく、行動選
択ネットワークの枠組に学習機能を組み込むアプローチ
が望まれるわけであるが、現状では学習機能を組み込ん
だ枠組は提案されていない。However, in order for the autonomous action subject to adapt to the environment more, a "learning function" is indispensable for individually coping with various situations encountered in the environment. Therefore, an approach that incorporates a learning function into the framework of the action selection network without damaging the features of the action selection network is desired. However, at present, no framework that incorporates the learning function has been proposed.

【０００４】profit-sharing（J.J.Grefenstette：Cred
it Assignment in Rule DiscoverySystems Based on Ge
neric Algorithms,Machine Learning,Vol.3,pp.225-245
(1988) 参照）は、経験強化型強化学習方法であり、報
酬を得た時にそれまでの行動系列を一括して強化する。
この時の行動系列を「エピソード」と称する。このprof
it-sharingは、学習に要する試行回数が少ないこと、ま
たＱ−learning（C.J.C.Watkins and P.Dayan ：Techni
cal Note：Ｑ−Learning,Machine Learning,Vol.8,pp.5
5-68(1992)参照）等に比較して動的な環境の変化に対し
てロバスト性があるという特徴を有する。[0004] profit-sharing (JJGrefenstette: Cred
it Assignment in Rule DiscoverySystems Based on Ge
neric Algorithms, Machine Learning, Vol. 3, pp. 225-245
(1988)) is an experience-based reinforcement learning method, which collectively reinforces the action sequence up to that point when a reward is obtained.
The action sequence at this time is called an “episode”. This prof
it-sharing requires a small number of trials for learning, and Q-learning (CJC Watkins and P. Dayan: Techni
cal Note: Q-Learning, Machine Learning, Vol.8, pp.5
5-68 (1992)).

【０００５】強化学習法としては近年Ｑ−learningが注
目されている。Ｑ−learningは環境同定型の強化学習法
であり、Ｑ値を求めるための環境の状態が正確に同定さ
れれば最適な学習効果が得られることが証明されてい
る。しかしながら、profit-sharingに比べて非常に多く
の試行回数を要することや、環境が動的に変化してしま
うとそれまで得られた学習結果全体に影響が及んでしま
うなどの問題点が指摘されている。As a reinforcement learning method, Q-learning has recently attracted attention. Q-learning is an environment identification type reinforcement learning method, and it has been proved that an optimum learning effect can be obtained if an environment state for obtaining a Q value is accurately identified. However, problems were pointed out, such as the fact that it requires a much larger number of trials than profit-sharing, and that if the environment changes dynamically, the overall learning results obtained so far will be affected. ing.

【０００６】従って、行動主体の環境全体の詳細な知識
を獲得できるような状況においてはＱ−learningは適し
ているものの、今回我々が対象とする自律行動主体のよ
うに、動的に変化する環境内で動作し、その結果常に不
完全な環境の知識しか持つことのできない状況において
は、profit-sharingのような学習法の方が適している。
しかしながらprofit-sharingにおいても、変化の影響を
受けてしまった部分の学習は無効とするしかなく、学習
効果が動的な環境の変化の規模に大きく依存するという
限界がある。Accordingly, Q-learning is suitable in a situation in which detailed knowledge of the entire environment of the action subject can be acquired, but in an environment that changes dynamically, such as the autonomous action subject we are interested in, this time. In situations where it runs within and therefore always has only incomplete knowledge of the environment, learning methods like profit-sharing are more appropriate.
However, even in profit-sharing, the learning of the part affected by the change has to be invalidated, and there is a limit that the learning effect largely depends on the scale of the dynamic environment change.

【０００７】[0007]

【発明が解決しようとする課題】自律行動主体が学習す
るためには環境の情報を収集する必要があるが、実世界
やインターネットのすべての情報を予め詳細に得ること
は不可能である。従って、これら行動主体はセンサ等を
用いてローカルな情報を収集しつつ環境のモデルを構築
することになる。しかし、環境が動的に変化するため得
られたモデルは常に不完全である。このような状況にお
いては、特に動的な環境の変化にロバストな経験強化型
の強化学習法を考える必要がある。In order for an autonomous action subject to learn, it is necessary to collect environmental information. However, it is impossible to obtain all the information of the real world and the Internet in detail in advance. Therefore, these actors build an environmental model while collecting local information using sensors and the like. However, the model obtained is always incomplete because the environment changes dynamically. In such a situation, it is necessary to consider an experience-reinforcement-type reinforcement learning method that is particularly robust to dynamic environmental changes.

【０００８】本発明は、上記に鑑みてなされたもので、
その目的とするところは、従来のprofit-sharingに比較
して動的な環境の変化に対しロバスト性を有し、環境の
個々の状況に効率的に適応できる行動選択ネットワーク
を用いた経験強化型強化学習方法および経験強化型強化
学習プログラムを記録した記録媒体を提供することにあ
る。[0008] The present invention has been made in view of the above,
The objective is to enhance experience using a behavior selection network that is more robust to dynamic environmental changes than conventional profit-sharing and can efficiently adapt to individual environmental situations. It is an object of the present invention to provide a recording medium on which a reinforcement learning method and an experience reinforcement type reinforcement learning program are recorded.

【０００９】[0009]

【課題を解決するための手段】上記目的を達成するた
め、請求項１記載の本発明は、複雑で動的に変化する環
境の下で動作する自律行動主体が変化に対して効果的に
適応できるための行動選択ネットワークの枠組をprofit
-sharingに適応した経験強化型強化学習方法であって、
状態要素の遷移系列であるエピソードを構成する各状態
にそれぞれ自律主体である記憶エージェントを割り付
け、内部状態が移動する際に視野に入ったエピソード以
外の状態にも記憶エージェントを割り当て、隣接記憶エ
ージェントからの強化値が伝播されると自発的に隣接記
憶エージェントに向い強化値を伝播し、全体として減衰
を行った後に、エピソードを減衰した経験の有無で強化
値の減衰率を変え、強化値が所定の閾値以下になった場
合に活性伝播を終了するというように活性伝播が記憶エ
ージェント群の協調動作として実現されることを要旨と
する。SUMMARY OF THE INVENTION To achieve the above object, the present invention according to the first aspect provides an autonomous action subject operating in a complex and dynamically changing environment, effectively adapting to the change. To be able to profit action selection network framework
-This is an experience-based reinforcement learning method adapted to -sharing,
A storage agent, which is an autonomous subject, is assigned to each state constituting an episode, which is a transition sequence of state elements.A storage agent is also assigned to states other than the episode in view when the internal state moves. When the enhancement value is propagated, the enhancement value is spontaneously propagated to the adjacent memory agent, and after the attenuation is performed as a whole, the attenuation rate of the enhancement value is changed depending on whether or not the user has experienced episode attenuation. The point is that the active propagation is realized as a cooperative operation of the storage agent group, such that the active propagation is terminated when the threshold value becomes equal to or less than the threshold value.

【００１０】請求項１記載の本発明にあっては、エピソ
ードに直接関係しないが隣接するような記憶エージェン
トに対しても活性伝播を行うことにより柔軟性を持った
学習を行うことができるとともに、また最短経路で効率
の良い学習を行うことができる。According to the first aspect of the present invention, it is possible to perform flexible learning by performing activity propagation even on storage agents that are not directly related to episodes but are adjacent to each other. In addition, efficient learning can be performed with the shortest path.

【００１１】また、請求項２記載の本発明は、複雑で動
的に変化する環境の下で動作する自律行動主体が変化に
対して効果的に適応できるための行動選択ネットワーク
の枠組をprofit-sharingに適応した経験強化型強化学習
プログラムを記録した記録媒体であって、状態要素の遷
移系列であるエピソードを構成する各状態にそれぞれ自
律主体である記憶エージェントを割り付け、内部状態が
移動する際に視野に入ったエピソード以外の状態にも記
憶エージェントを割り当て、隣接記憶エージェントから
の強化値が伝播されると自発的に隣接記憶エージェント
に向い強化値を伝播し、全体として減衰を行った際に、
エピソードを減衰した経験の有無で強化値の減衰率を変
え、強化値が所定の閾値以下になった場合に活性伝播を
終了するというように活性伝播が記憶エージェント群の
協調動作として実現されることを要旨とする。Further, the present invention according to claim 2 proposes a framework of an action selection network that enables an autonomous action subject operating in a complex and dynamically changing environment to effectively adapt to changes. A recording medium that records an experience-reinforcement-type reinforcement learning program adapted to sharing, in which a storage agent, which is an autonomous subject, is assigned to each state constituting an episode that is a transition sequence of state elements, and when an internal state moves. When a storage agent is also assigned to a state other than the episode that has entered the field of view, when the reinforcement value from the adjacent storage agent is propagated, the reinforcement value spontaneously propagates to the adjacent storage agent,
Active propagation is realized as a cooperative operation of a group of storage agents, such as changing the decay rate of the reinforcement value depending on the experience of attenuating the episode and terminating the activation propagation when the reinforcement value falls below a predetermined threshold. Is the gist.

【００１２】請求項２記載の本発明にあっては、エピソ
ードを構成する各状態にそれぞれ記憶エージェントを割
り付けるとともに、内部状態が移動する際に視野に入っ
たエピソード以外の状態にも記憶エージェントを割り当
て、隣接記憶エージェントからの強化値が伝播されると
隣接記憶エージェントに向い強化値を伝播し、全体とし
て減衰を行った後に、エピソードを減衰した経験の有無
で強化値の減衰率を変え、強化値が所定の閾値以下にな
った場合に活性伝播を終了する経験強化型強化学習プロ
グラムを記録媒体として記録しているため、該記録媒体
を用いて、その流通性を高めることができる。According to the present invention, a storage agent is assigned to each state constituting an episode, and a storage agent is also assigned to a state other than the episode in view when the internal state moves. When the enhancement value from the adjacent storage agent is propagated, the enhancement value is propagated to the adjacent storage agent, and after the attenuation is performed as a whole, the decay rate of the enhancement value is changed depending on whether or not the episode has been attenuated. Since the experience-enhancement-type reinforcement learning program for terminating the activity propagation when is less than or equal to a predetermined threshold value is recorded as a recording medium, it is possible to enhance the circulation by using the recording medium.

【００１３】[0013]

【発明の実施の形態】まず、本発明の経験強化型強化学
習方法Ｌ−ＡＮＡを説明するために、図１に示すような
格子状の状態空間Ｓ内を移動する自律行動主体Ａを考え
る。状態空間Ｓを構成する個々の状態は、それぞれＳ1
a，・・・，Ｓ7fのように表記される。自律行動主体Ａ
は状態空間Ｓ内を上下左右に１ブロックずつ移動するこ
とができ、移動に際して行動主体Ａに搭載された仮想バ
ッテリを使用する。バッテリは移動することにより減少
するが、充電ポイントＢにて補給することが可能であ
る。充電ポイントＢは状態空間Ｓ内に数箇所存在し、ま
た補給できるエネルギ量もそれぞれ異なっている。ま
た、状態空間Ｓ内には障害物も存在し、行動主体Ａは障
害物を通過することはできない。充電ポイントＢの位置
と数は変化しないが、障害物の位置と数は動的に変化す
るものとする。なお、報酬は補給エネルギ量に比例した
値が与えられる。BEST MODE FOR CARRYING OUT THE INVENTION First, in order to explain an experience-reinforcement type reinforcement learning method L-ANA of the present invention, consider an autonomous action subject A moving in a grid-like state space S as shown in FIG. The individual states constituting the state space S are S1
a,..., S7f. Autonomous action subject A
Can move one block at a time up, down, left, and right in the state space S, and use a virtual battery mounted on the action subject A when moving. The battery decreases as it moves, but can be replenished at charging point B. There are several charging points B in the state space S, and the amount of energy that can be supplied is also different. Also, there are obstacles in the state space S, and the subject A cannot pass through the obstacles. The position and number of the charging point B do not change, but the position and number of the obstacle dynamically change. The reward is given a value proportional to the amount of replenishment energy.

【００１４】行動主体Ａは初期状態では状態空間Ｓが格
子状の環境で自分が上下左右に移動できること以外、充
電ポイントＢや障害物の位置に関しての情報は一切持っ
ていない。ただし、行動主体Ａには仮想センサが搭載さ
れており、図１の斜線を施した範囲に関する環境の状況
を得ることができる。In the initial state, the action subject A has no information on the position of the charging point B or the obstacle except that the state space S can move up, down, left and right in a grid-like environment. However, a virtual sensor is mounted on the action subject A, and it is possible to obtain the environmental status regarding the shaded area in FIG.

【００１５】行動主体Ａは状態空間Ｓに依存して変化す
る内部状態空間Ａ_Sを有し、この例題ではＡ_S＝
｛Ａ_s1：充電が必要、Ａ_s2：充電は必要ない｝という２
種類の内部状態を考えることができる。The action subject A has an internal state space A _S that changes depending on the state space S. In this example, A _S =
{A _s1 : charge required, A _s2 : no charge required} 2
One can think of a variety of internal states.

【００１６】行動主体Ａに要求される学習とは、より多
くのエネルギを補給できる充電ポイントＢと、その充電
ポイントＢへの移動経路を学習することである。また、
補給エネルギ量は少ないものの、近くの充電ポイントＢ
で補給したほうが効率がよいということも学習できるこ
とが望まれる。そして障害物が動的に消滅／出現するの
で、これを察知した時に柔軟に対応でき、学習の効率を
低下させないことが何より重要である。The learning required for the action subject A is to learn a charging point B capable of supplying more energy and a moving route to the charging point B. Also,
Although charging energy is small, nearby charging point B
It is hoped that it is possible to learn that replenishing is more efficient. And since the obstacle disappears / appears dynamically, it is most important to be able to respond flexibly when detecting this, and not to reduce the learning efficiency.

【００１７】Ｌ−ＡＮＡはprofit-sharingと同様に経験
強化型の強化学習を行う。profit-sharingでは、エピソ
ードに相当する状態遷移系列を単位として学習を行い、
エピソードとエピソードに割り振られた学習値を記憶す
る。例えば図２（ａ）に示すように行動主体Ａがｐの状
態遷移系列により偶然充電ポイントＢにたどり着けたと
すると、まず充電ポイントＢの存在する状態に対して報
酬１００が与えられ、さらにエピソードｐを構成する各
状態に対して強化関数に基づいた強化値が順次割り振ら
れる。このように１つのエピソードを記憶することがpr
ofit-sharingにおける１回の学習である。The L-ANA performs experience-based reinforcement learning in the same manner as profit-sharing. In profit-sharing, learning is performed in units of state transition series corresponding to episodes,
An episode and a learning value assigned to the episode are stored. For example, as shown in FIG. 2A, if the action subject A accidentally arrives at the charging point B by the state transition sequence of p, a reward 100 is first given to the state where the charging point B exists, and the episode p is added. An enhancement value based on an enhancement function is sequentially assigned to each of the constituent states. In this way, remembering one episode is pr
It is one learning in ofit-sharing.

【００１８】Ｌ−ＡＮＡでは、報酬を得たことにより変
化する内部状態Ａ_siが学習の対象となり、この時のエピ
ソードは初期状態またはこの内部状態Ａ_siが成立した時
から内部状態Ａ_siが不成立となるまでの状態要素の遷移
系列である。例えば、内部状態Ａ_siが成立している状態
で充電を行うことができたとすると、内部状態Ａ_s1は不
成立となり、代わりに内部状態Ａ_s2が成立する。この場
合、内部状態Ａ_s1についての強化学習が行われる。[0018] In L-ANA, internal state A _si is subject to learning is changed by obtaining the reward, it satisfied the internal state A _si since episode when this is the initial state or the internal state A _si is satisfied It is a transition sequence of the state element until becomes. For example, if charging can be performed in a state where the internal state _Asi is established, the internal state _As1 is not established, and the internal state _As2 is established instead. In this case, the reinforcement learning for the internal state _As1 is performed.

【００１９】これに対し、Ｌ−ＡＮＡでは図２（ｂ）に
示すようにエピソードを構成する各状態に対して１つず
つ記憶モジュールを割り付けていく。報酬を得た時に行
われる強化学習はprofit-sharingと同様で、報酬を得た
状態に割り当てられた記憶エージェントをｓ_ijとする
と、「ｓ_ij（ｔ＋１）＝ｓ_ij（ｔ）−ｂｓ_ij（ｔ）＋ｂ
ｐ（ｔ）、ｂ：学習率、ｐ：報酬」に従って強化値の更
新を行う。profit-sharingはエピソードに対して、報酬
からそれだけ過去であるかを引数として強化値を返す強
化関数に基づいた強化値を割り当てる。上述したよう
に、profit-sharingではエピソードを構成する状態系列
にのみ強化値を割り振るわけだが、これは経験以外の不
確定要素を排除し経験しか信用できない状況において有
効な手段である。しかしながら、行動主体が視野を持ち
ローカルな情報収集が可能である状況を考えると、行動
主体から見える範囲においてはエピソードに割り振る強
化値よりも小さい割合で強化値を割り振ることは学習に
柔軟性とロバスト性を与えられる可能性がある。Ｌ−Ａ
ＮＡは、profit-sharingと同様に経験を強化することが
基本方式であるので、直接経験していないエピソード以
外の状態に対して差をつける必要がある。On the other hand, in the L-ANA, as shown in FIG. 2B, one storage module is assigned to each state constituting an episode. Reinforcement learning performed when a reward is obtained is similar to profit-sharing. If a storage agent assigned to a state in which a reward is obtained is s _ij , “s _ij (t + 1) = s _ij (t) −bs _ij ( t) + b
The reinforcement value is updated according to "p (t), b: learning rate, p: reward". Profit-sharing assigns an episode an enhancement value based on an enhancement function that returns an enhancement value from the reward based on whether it is more past. As described above, profit-sharing assigns enhancement values only to the state series that make up an episode, but this is an effective means in situations where uncertain factors other than experience are eliminated and only experience can be trusted. However, given the situation in which the actor has a view and can collect local information, it is more flexible and robust for learning to assign reinforcement values at a smaller percentage than the reinforcement value assigned to episodes within the range visible to the actor. May be given. LA
As with profit-sharing, NA is a basic method that enhances experience, so it is necessary to make a difference between states other than episodes that have not been directly experienced.

【００２０】そこで、Ｌ−ＡＮＡでは行動主体Ａが移動
する際に視野に入ったエピソード以外の状態に対しても
記憶モジュールを割り当てていく。例えば、図２（ｂ）
のようなエピソードが得られたとすると、行動主体Ａが
割り当てることができる記憶モジュールは図２（ｃ）の
部分となる。各記憶モジュールは割り当てられる時に、
自分がどの記憶エージェントと隣接関係にあるのか、ま
た自分がエピソードを構成しているかどうかを記憶す
る。割り付けられた時にはエピソードを構成していなく
ても、その後の試行でエピソードの一部分となる場合も
ある。なお、隣接関係にあるのは上下左右に位置する記
憶エージェントである。Therefore, in the L-ANA, a storage module is allocated to a state other than an episode that is in view when the action subject A moves. For example, FIG.
If such an episode is obtained, the storage module to which the action subject A can assign is the part shown in FIG. When each storage module is assigned,
It remembers which storage agent you are adjacent to and whether you are making up an episode. Even if the episode is not composed when assigned, it may become part of the episode in a subsequent trial. Note that the storage agents located in the adjacent relationship are located at the top, bottom, left and right.

【００２１】なお、学習された強化値の利用の仕方はpr
ofit-sharingと同様に、より強化値の大きい状態に移動
するという方針である。The method of using the learned reinforcement value is pr
Similar to ofit-sharing, the policy is to move to a state where the reinforcement value is higher.

【００２２】記憶モジュールは隣接する記憶モジュール
から強化値が伝播されると自発的に自分が隣接する他の
記憶モジュールに向かって強化値を伝播する。伝播の仕
方は図３に示すように、全体としてある減衰を行った
後、エピソードを構成した経験がある記憶モジュールと
それ以外で異なる減衰率による強化値Ｒを伝播する。伝
播する強化値がある閾値以下になった時点で活性伝播を
終了する。Ｌ−ＡＮＡにおける学習とは、各記憶モジュ
ールが自分がどの隣接する記憶モジュールに対し、どれ
くらいの強化値を伝播するのかを記憶することである。When an enhancement value is propagated from an adjacent storage module, the storage module spontaneously propagates the enhancement value to another adjacent storage module. As shown in FIG. 3, after a certain attenuation is performed as a whole, an enhancement value R with a different attenuation rate is propagated to a storage module having experience in forming an episode and to other storage modules. The activation propagation ends when the propagation enhancement value falls below a certain threshold. Learning in L-ANA means that each storage module stores how much reinforcement value it propagates to which adjacent storage module.

【００２３】活性伝播は、１．活性伝播を行う際に用い
る強化値の大きさと、２．伝播する時に用いる減衰率の
２つのパラメータによりその特性を容易に操作すること
ができ、この２つのパラメータの設定の仕方により、以
下のように２つの学習の特性を使い分けることができ
る。The activity propagation is performed as follows: 1. the magnitude of the reinforcement value used when performing active propagation; The characteristics can be easily manipulated by the two parameters of the attenuation rate used when propagating, and the two learning characteristics can be selectively used as described below depending on how these two parameters are set.

【００２４】（１）行動主体Ａが報酬を得られる状態の
近傍に来た時のみ、学習効果が発揮されるような学習を
行いたい場合には、強化値を大きく減衰率を高くする。
つまり報酬を得た状態を頂点とする強化値の山を考える
と、その高度は高くしかも急勾配となる。例えば、状態
空間Ｓ内に行動主体Ａに対する捕食者を考えてみると、
捕食者から逃げるのは捕食者が行動主体Ａの近傍に迫っ
た時のみでよい。このような学習を行うには（１）の設
定が有効である。(1) If it is desired to perform learning so that the learning effect is exhibited only when the action subject A comes near a state where a reward can be obtained, the reinforcement value is increased and the attenuation rate is increased.
That is, considering the peak of the reinforcement value with the rewarded state at the top, its altitude is high and steep. For example, consider a predator for a subject A in a state space S,
It is only necessary to escape from the predator when the predator approaches the vicinity of the subject A. To perform such learning, the setting of (1) is effective.

【００２５】（２）逆に、行動主体Ａが報酬を得る状態
から離れている状態でも学習効果が発揮されるようにす
るには、強化値を小さく減衰率を低くする。つまり報酬
を得た状態を頂点とする強化値の山を考えた時、その高
度は低くしかも勾配もなだらかとなる。例えば、今回の
ようにエネルギを充電するようなことを学習することを
考えた時には（２）のように学習効果が広く行き渡って
いた方がよい。なお、強化値を（１）に比べて小さく設
定しなければならない理由は、（１）が（２）の強化値
の分布に完全に含まれないようにするためである。
（２）の学習結果に従って充電ポイントＢまで移動中で
あっても、捕食者が接近してきた時には（１）の学習効
果が発揮されなくてはならないからである。(2) Conversely, in order for the learning effect to be exhibited even when the action subject A is away from the state in which the reward is obtained, the reinforcement value is reduced and the attenuation rate is reduced. In other words, when considering the peak of the reinforcement value with the rewarded state at the top, the altitude is low and the gradient is gentle. For example, when considering learning to charge energy as in this case, it is better that the learning effect is widespread as in (2). The reason why the enhancement value must be set smaller than that of (1) is to prevent (1) from being completely included in the distribution of the enhancement value of (2).
This is because the learning effect of (1) must be exhibited when the predator approaches, even while moving to the charging point B according to the learning result of (2).

【００２６】エピソード以外の記憶モジュールに対して
も活性伝播を行うことで、より柔軟性とロバスト性を兼
ね備えた学習を行うことができる。例えば、図４（ａ）
のようにprofit-sharingではどちらかのエピソードに出
会うまではランダムに移動するしかないが、図４（ｂ）
に示すようにＬ−ＡＮＡでは行動主体Ａが既に活性伝播
された状態に位置していれば、近隣のエピソードに最短
経路で引き込まれるので効率良く充電ポイントＢまで移
動することができる。By performing activity propagation to a storage module other than an episode, learning with more flexibility and robustness can be performed. For example, FIG.
In profit-sharing as shown in Fig. 4 (b), there is no choice but to move randomly until one of the episodes is encountered.
As shown in (1), in the L-ANA, if the action subject A is already in the state of active propagation, it can be efficiently moved to the charging point B because it is drawn into the neighboring episode by the shortest route.

【００２７】各記憶モジュールはそれぞれ独立して機能
していることから、仮にある記憶モジュールの機能が損
なわれたとしても、その記憶モジュールを欠いた状態で
活性伝播が行われ、機能が損なわれた部分を迂回するよ
うな経路が自動的に選択される。profit-sharingではエ
ピソードを構成する状態の１つが欠けてしまうと、その
エピソード全体が影響を受けてしまう。このことからも
Ｌ−ＡＮＡはよりロバスト性を有し、実世界などの動的
な環境内で動作する自律行動主体のための学習法として
適している。Since each storage module functions independently, even if the function of a certain storage module is impaired, activation propagation is performed in a state where the storage module is missing, and the function is impaired. A route that bypasses the part is automatically selected. In profit-sharing, missing one of the states that make up an episode affects the entire episode. Therefore, L-ANA has more robustness and is suitable as a learning method for an autonomous action subject operating in a dynamic environment such as the real world.

【００２８】profit-sharingでは強化値割り当ての際無
効ルールを抑制することが問題となるが、Ｌ−ＡＮＡで
は無効ルールも報酬を得る状態に至る経路として積極的
に再利用される。In profit-sharing, there is a problem in suppressing invalid rules at the time of strengthening value assignment. In L-ANA, invalid rules are also actively reused as a route to a state where a reward is obtained.

【００２９】学習はＡ_si単位で行われ、Ａ_si単位で独自
の活性伝播図形が学習さることになる。従って、例えば
あるＡ_s1とＡ_s2が共に成立するような状況では、両方の
強化値分布を重ねた分布図を用いて行動主体Ａは行動選
択を行えばよい。[0029] The study is carried out in A _si unit, own of the active propagation figure in A _si unit is learning Sarukoto. Therefore, for example, in a situation where certain _As1 and _As2 are both established, the action subject A may perform the action selection using a distribution map in which both the reinforcement value distributions are superimposed.

【００３０】また、行動主体Ａが複数存在し、行動主体
Ａ同士で協調する枠組を考えると、異なる行動主体の学
習した強化値分布を共有することで互いの学習結果を利
用し合うことも容易に実現できる。Also, considering a framework in which a plurality of actors A exist and cooperate with the actors A, it is easy to use the learning results of each other by sharing the reinforcement value distributions learned by different actors. Can be realized.

【００３１】これまではすべて正の強化学習について述
べてきたが、Ｌ−ＡＮＡでは活性値を吸収する逆の活性
伝播を行うことで、負の強化学習も容易に実現すること
ができる。行動主体Ａは常に強化値の大きい状態に移動
する方法で学習結果を用いるわけだが、ここで状態空間
Ｓ内に落し穴の設定を加えてみる。この場合落し穴に近
付かないようにするためには、落し穴を中心として活性
値を吸収する活性伝播を行えばよい。そしてエネルギを
充電するための正の強化学習のための強化値分布と重ね
ることで両方の学習効果を容易に統合することができ、
落し穴を避けつつ最適に充電ポイントに至る経路を選択
することができる。Although positive reinforcement learning has been described above, negative reinforcement learning can be easily realized in L-ANA by performing reverse activity propagation that absorbs the activity value. The action subject A always uses the learning result in a method of moving to a state where the reinforcement value is large. Here, a setting of a pit is added in the state space S. In this case, in order to prevent the pit from approaching the pit, it is sufficient to perform activity propagation that absorbs the activity value around the pit. And by superimposing the reinforcement value distribution for positive reinforcement learning for charging energy, both learning effects can be easily integrated,
The route to the charging point can be optimally selected while avoiding pitfalls.

【００３２】図５および図６は、実際に状態空間Ｓを用
いてprofit-sharingとＬ−ＡＮＡの比較評価を行った結
果を示す図である。FIGS. 5 and 6 show the results of comparative evaluation of profit-sharing and L-ANA using the state space S. FIG.

【００３３】図５（ａ），（ｂ）は、それぞれある環境
ｓを用いた時のprofit-sharingとＬ−ＡＮＡにおいて学
習された移動経路の精度を示したものである。具体的に
は、自律行動主体Ａがどれだけ最短経路で充電ポイント
Ｂまで移動できたか、すなわち自律行動主体Ａが移動し
た経路と、計算した最短経路との比を示している。例え
ば１０倍とは学習された経路面が最短経路の１０倍であ
ったことを示している。図５（ａ）に示すprofit-shari
ngと図５（ｂ）に示すＬ−ＡＮＡとを比較すればわかる
ように、Ｌ−ＡＮＡは常に最短経路に近い経路で移動で
きているが、profit-sharingではかなりのばらつきが見
られ、Ｌ−ＡＮＡの方がより最短経路を学習しているこ
とを確認することができる。FIGS. 5 (a) and 5 (b) show the accuracy of the moving route learned in profit-sharing and L-ANA when a certain environment s is used. Specifically, it shows how shortly the autonomous action subject A was able to move to the charging point B, that is, the ratio of the route traveled by the autonomous action subject A to the calculated shortest route. For example, ten times indicates that the learned path surface is ten times the shortest path. Profit-shari shown in Fig. 5 (a)
As can be seen by comparing ng with L-ANA shown in FIG. 5 (b), L-ANA can always move along the route closest to the shortest route, but considerable variability is seen in profit-sharing. -It can be confirmed that ANA learns the shortest path more.

【００３４】更に、図５（ａ），（ｂ）において、２５
０ステップ目に環境に動的な変化を起こすと、すなわち
具体的には動的に障害物を出現させると、profit-shari
ngでは一時的に性能が劣化するが、Ｌ−ＡＮＡでは性能
が劣化することがないことを確認することができる。従
って、Ｌ−ＡＮＡの方がprofit-sharingに比較して、よ
り環境の動的な変化にロバストであることを確認するこ
とができた。Further, in FIGS. 5A and 5B, 25
When a dynamic change occurs in the environment at the 0th step, that is, when an obstacle appears dynamically, profit-shari
It can be confirmed that the performance temporarily deteriorates with ng, but does not deteriorate with L-ANA. Accordingly, it was confirmed that L-ANA is more robust to dynamic changes in the environment than profit-sharing.

【００３５】図６（ａ），（ｂ）は、それぞれprofit-s
haringおよびＬ−ＡＮＡについて環境内の各状態がどこ
の充電ポイントへの経路として学習されたかを示した図
である。図６（ａ）に示すprofit-sharingでは、充電ポ
イントＢ1 の近くであるにも関わらず、充電ポイントＢ
3 へ向かう経路が学習されているような状況が起こり、
効率が悪いが、図６（ｂ）に示すＬ−ＡＮＡでは、充電
ポイントＢ1 の近くでは充電ポイントＢ1 に向かう経路
が学習されており、充電ポイントＢ1 までの移動距離と
充電されたエネルギ補給量を考慮した経路が学習されて
いることを確認することができる。FIGS. 6A and 6B respectively show profit-s
FIG. 9 is a diagram showing where each state in the environment for haring and L-ANA is learned as a path to a charging point; In the profit-sharing shown in FIG. 6A, the charging point B is notwithstanding the charging point B1.
A situation arises where the route to 3 has been learned,
Although the efficiency is low, in the L-ANA shown in FIG. 6B, the route to the charging point B1 is learned near the charging point B1, and the travel distance to the charging point B1 and the charged energy supply amount are calculated. It is possible to confirm that the considered route has been learned.

【００３６】次に、図７および図８に示すフローチャー
トを参照して、本発明の一実施形態に係る行動選択ネッ
トワークを用いた経験強化型強化学習方法の作用を説明
する。図７はＬ−ＡＮＡの全体的流れ、具体的には内部
条件Ａ_siに関する学習の流れを示すフローチャートであ
り、図８は図７のステップＳ２１における活性伝播につ
いてのアルゴリズムを示すフローチャートである。Next, the operation of the experience-based reinforcement learning method using the action selection network according to one embodiment of the present invention will be described with reference to the flowcharts shown in FIGS. FIG. 7 is a flowchart showing an overall flow of L-ANA, specifically, a flow of learning regarding the internal condition _Asi . FIG. 8 is a flowchart showing an algorithm for activity propagation in step S21 in FIG.

【００３７】図７を参照して、Ｌ−ＡＮＡの全体的流れ
について説明する。同図に示す処理は内部条件Ａ_siに関
する強化学習を例としているものであり、まず内部条件
Ａ_siが成立しているか否かがチェックされる（ステップ
Ｓ１１）。成立していない場合には、強化学習を行う必
要がないので、移動可能な状態にランダムに移動し、最
初のステップに戻る（ステップＳ１３）。Referring to FIG. 7, the overall flow of L-ANA will be described. The processing shown in the figure is an example of the reinforcement learning regarding the internal condition _Asi . First, it is checked whether or not the internal condition _Asi is satisfied (step S11). If the condition is not satisfied, it is not necessary to perform reinforcement learning, so that the robot randomly moves to a movable state and returns to the first step (step S13).

【００３８】内部条件Ａ_siが成立している場合には、よ
り大きな強化値を有する状態に移動し、候補が複数存在
する場合には、ランダムに選択する（ステップＳ１
５）。なお、強化値は初期値としてすべての状態に対し
て０を与えておく。移動した状態をエピソード登録用テ
ーブルに登録する（ステップＳ１７）。そして、報酬を
貰えたか否かをチェックし（ステップＳ１９）、貰えな
い場合には、ステップＳ１５に戻って、より大きな強化
値を有する状態に移動し、同じ処理を繰り返すが、報酬
を貰えた場合には、活性伝播を行う（ステップＳ２
１）。なお、この活性伝播について図８に示すフローチ
ャートで詳細に説明する。If the internal condition A _si is satisfied, the process moves to a state having a larger enhancement value. If there are a plurality of candidates, the candidates are randomly selected (step S1).
5). The enhancement value is set to 0 for all states as an initial value. The moved state is registered in the episode registration table (step S17). Then, it is checked whether or not a reward has been obtained (step S19). If the reward has not been obtained, the process returns to step S15, moves to a state having a larger reinforcement value, and repeats the same processing. In step S2, activity propagation is performed.
1). The active propagation will be described in detail with reference to the flowchart shown in FIG.

【００３９】活性伝播を行うと、エピソード登録用テー
ブルを初期化し（ステップＳ２３）、内部条件Ａ_siを不
成立にし、最初のステップに戻る（ステップＳ２５）。After the activation propagation, the episode registration table is initialized (step S23), the internal condition _Asi is not satisfied, and the process returns to the first step (step S25).

【００４０】次に、図８に示す活性伝播について説明す
る。図８において、活性伝播がスタートすると、まず報
酬を得た状態Ａ_siに対して強化値Ｒを与え、活性伝播用
基準強化値ｓに強化値Ｒを代入する（ステップＳ３３，
Ｓ３５）。それから、状態Ａ_siに隣接する状態Ａ_sjの１
つずつに対して以下の処理を行う（ステップＳ３７）。Next, the activation propagation shown in FIG. 8 will be described. In FIG. 8, when the activation propagation starts, the reinforcement value R is first given to the state _Asi that has obtained the reward, and the reinforcement value R is substituted into the activation propagation reference reinforcement value s (step S33,
S35). Then, one of the states A _sj adjacent to the state A _si
The following processing is performed for each one (step S37).

【００４１】まず、状態Ａ_sjはエピソード登録用テーブ
ルに登録されているか否かをチェックする（ステップＳ
３９）。登録されている場合には、該状態Ａ_sjの強化値
が前記基準強化値ｓと減衰率α（エピソードを構成する
状態に活性伝播する際の減衰率であり、０＜α＜１）と
の積である強化値よりも小さいか否かをチェックする
（ステップＳ４１）。Ａ_sjの強化値が小さくない場合に
は、ステップＳ３７に戻り、同じ処理を繰り返すが、強
化値が小さい場合には、すなわち、強化値を伝播しよう
とする状態に既に活性値が伝播されており、その値が今
回伝播しようとする強化値よりも小さい時のみ、再び強
化値の伝播を行うので、強化値が小さい場合には、減衰
率αと基準強化値ｓとの積である強化値が、最小強化値
（ｍｉｎ）よりも小さいか否かが成立するか否かをチェ
ックする（ステップＳ４３）。すなわち、伝播する最小
強化値をｍｉｎとする。First, it is checked whether the state _Asj is registered in the episode registration table (step S).
39). If it has been registered, the enhancement value of the state A _sj is the reference enhancement value s and the attenuation rate α (the attenuation rate at the time of active propagation to the state constituting the episode, and 0 <α <1). It is checked whether or not it is smaller than the enhancement value which is the product (step S41). If the enhancement value of A _sj is not small, the process returns to step S37 and repeats the same process. If the enhancement value is small, that is, the activation value has already been propagated to the state where the enhancement value is to be propagated. Only when the value is smaller than the reinforcement value to be propagated this time, the reinforcement value is propagated again. Therefore, when the reinforcement value is small, the reinforcement value which is the product of the attenuation rate α and the reference reinforcement value s becomes It is checked whether or not the condition is smaller than the minimum reinforcement value (min) (step S43). That is, the minimum enhancement value to be propagated is set to min.

【００４２】伝播しようとする強化値が予め設定した最
小値（ｍｉｎ）より小さくなった場合には、この部分の
活性伝播を終了し、ステップＳ３７に戻り、別の状態の
活性伝播を行う。強化値が最小値（ｍｉｎ）よりも小さ
くない場合には、状態Ａ_sjに対して強化値αｓを与える
（ステップＳ４５）。If the enhancement value to be propagated is smaller than the preset minimum value (min), the active propagation of this part is terminated, and the process returns to step S37 to perform active propagation in another state. If the fortification value is not smaller than the minimum value (min), the fortification value αs is given to the state A _sj (step S45).

【００４３】それから、状態Ａ_sjが活性伝播テーブルに
登録されているか否かをチェックし（ステップＳ４
７）、登録されていない場合には、登録し（ステップＳ
４９）、登録されている場合には、状態Ａ_sjがすでに活
性伝播を行ったというチェックが付いていたら、これを
解除する（ステップＳ５１）。Then, it is checked whether or not the state _Asj is registered in the activity propagation table (step S4).
7) If not registered, register (step S
49) If it is registered, if it is checked that the state A _sj has already been activated, the state is canceled (step S51).

【００４４】次に、状態Ａ_siに隣接するすべての状態Ａ
_sjについて終了したか否かをチェックする（ステップＳ
５３）。すなわち、活性伝播テーブルに登録されている
すべての記憶モジュールについて活性伝播が終了して状
態Ａ_siに関する一連の活性伝播を終了する。すべての状
態について終了していない場合には、ステップＳ３７に
戻り、別の状態について活性伝播を繰り返し行うが、す
べての状態について終了している場合には、状態Ａ_sjは
活性伝播を終了したとしてチェックを付ける（ステップ
Ｓ５５）。活性伝播テーブルに登録されている状態の中
でチェックされていないものが残っているか否かをチェ
ックし、残っていない場合には、本処理を終了するが、
残っている場合には、活性伝播テーブルに登録されてい
る状態の中でチェックされていないものの中で最も古く
に登録されている状態を新しい状態Ａ_siとして見立てて
以下の処理を繰り返す（ステップＳ５９）。すなわち、
活性伝播用基準強化値ｓに自分の強化値を代入し、ステ
ップＳ３７に戻り、同じ処理を繰り返し行う（ステップ
Ｓ６１）。Next, all the states A adjacent to the state A _si
Check whether or not _sj has been completed (step S
53). That is, activity propagation ends for all storage modules registered in the activity propagation table, and a series of activity propagation for the state _Asi ends. If all the states have not been completed, the process returns to step S37, and the active propagation is repeatedly performed for another state. However, if all the states have been completed, it is determined that the state _Asj has completed the active propagation. A check is made (step S55). It is checked whether or not any unchecked state remains in the active propagation table. If not, the process is terminated.
If remaining is regarded a state that is oldest registered among those not checked in the status registered in the active propagation table as the new state A _si repeats the following processing (step S59 ). That is,
It substitutes its own enhancement value for the active propagation reference enhancement value s, returns to step S37, and repeats the same processing (step S61).

【００４５】一方、ステップＳ３９のチェックにおい
て、状態Ａ_sjがエピソード登録用テーブルに登録されて
いない場合には、状態Ａ_sjの強化値が基準強化値ｓと減
衰率β（エピソードを構成する状態に活性伝播する際の
減衰率であり、０＜β＜１）との積である強化値よりも
小さいか否かをチェックする（ステップＳ６３）。Ａ_sj
の強化値が小さくない場合には、ステップＳ３７に戻
り、同じ処理を繰り返すが、強化値が小さい場合には、
減衰率βと基準強化値ｓとの積である強化値が最小強化
値（ｍｉｎ）よりも小さいか否かが成立するか否かをチ
ェックする（ステップＳ６５）。伝播しようとする強化
値が予め設定した最小強化値（ｍｉｎ）より小さくなっ
た場合には、この部分の活性伝播を終了し、ステップＳ
３７に戻り、別の状態の活性伝播を行う。強化値が最小
値（ｍｉｎ）よりも小さくない場合には、状態Ａ_sjに対
して強化値βｓを与え（ステップＳ６７）、ステップＳ
４７に進み、上述した処理を行う。On the other hand, in the check of step S39, when the state A _sj is not registered in the episode registration table, the attenuation factor strengthening value with reference reinforcing value s state A _sj beta (the state constituting the episode It is checked whether or not this is an attenuation rate at the time of active propagation and is smaller than an enhancement value which is a product of 0 <β <1 (step S63). A _sj
If the fortification value is not small, the process returns to step S37, and the same processing is repeated.
It is checked whether or not the strengthening value, which is the product of the attenuation rate β and the reference strengthening value s, is smaller than the minimum strengthening value (min) (step S65). If the enhancement value to be propagated becomes smaller than the preset minimum enhancement value (min), the active propagation of this portion is terminated, and step S
Returning to 37, the activation propagation of another state is performed. If the fortification value is not smaller than the minimum value (min), the fortification value βs is given to the state A _sj (step S67), and step S67 is performed.
Proceeding to 47, the above-described processing is performed.

【００４６】[0046]

【発明の効果】以上説明したように、本発明によれば、
エピソードに直接関係しないが隣接する記憶エージェン
トに対しても活性伝播を行い、柔軟性を持った学習が可
能であり、また最短経路で効率の良い学習を行うことが
できる上に、従来のprofit-sharingに比較してロバスト
性を有し、実世界やインターネット等の複雑で動的に変
化する環境で動作する自律移動ロボットやソフトウェア
エージェント等の自律行動主体に最適であり、活性伝播
の特性を制御することにより学習の特性を容易に操作で
き、従来の実時間リアクティブプランニング等と組み合
わせることも容易となる。As described above, according to the present invention,
Active propagation is also performed on adjacent storage agents that are not directly related to the episode, enabling flexible learning.Also, efficient learning can be performed with the shortest path, and the conventional profit- It is more robust than sharing, and is most suitable for autonomous mobile robots and software agents operating in complex and dynamically changing environments such as the real world and the Internet, and controls the characteristics of activity propagation. By doing so, the characteristics of the learning can be easily manipulated, and it can be easily combined with the conventional real-time reactive planning or the like.

[Brief description of the drawings]

【図１】本発明の行動選択ネットワークを用いた経験強
化型強化学習方法Ｌ−ＡＮＡを説明するための一例（グ
リッドワールド）を構成する格子状の状態空間内を移動
する自律行動主体を示す説明図である。FIG. 1 is a diagram illustrating an autonomous action subject moving in a grid-like state space constituting an example (grid world) for explaining an experience-reinforcement type reinforcement learning method L-ANA using an action selection network according to the present invention. FIG.

【図２】エピソードと記憶モジュールの関係を示す説明
図である。FIG. 2 is an explanatory diagram showing the relationship between episodes and storage modules.

【図３】活性伝播の仕方を示す説明図である。FIG. 3 is an explanatory diagram showing a method of propagating an activity.

【図４】活性伝播の効果を従来のprofit-sharingと本発
明のＬ−ＡＮＡについて示す説明図である。FIG. 4 is an explanatory diagram showing the effect of activity propagation for conventional profit-sharing and L-ANA of the present invention.

【図５】ある環境ｓを用いた時の従来のprofit-sharing
と本発明のＬ−ＡＮＡにおいて学習された移動経路の精
度を示す図である。FIG. 5: Conventional profit-sharing when a certain environment s is used
FIG. 7 is a diagram showing the accuracy of a moving route learned by the L-ANA of the present invention.

【図６】従来のprofit-sharingおよび本発明のＬ−ＡＮ
Ａについて環境内の各状態がどこの充電ポイントへの経
路として学習されたかを示した図である。FIG. 6 shows conventional profit-sharing and L-AN of the present invention.
FIG. 6 is a diagram showing where each state in the environment of A is learned as a path to a charging point;

【図７】本発明の一実施形態に係るＬ−ＡＮＡの全体的
流れを示すフローチャートである。FIG. 7 is a flowchart showing an overall flow of L-ANA according to an embodiment of the present invention.

【図８】図７のステップＳ２１における活性伝播につい
てアルゴリズムを示すフローチャートである。FIG. 8 is a flowchart showing an algorithm for activity propagation in step S21 of FIG. 7;

[Explanation of symbols]

Ａ自律行動主体Ｂ充電ポイントｐエピソードＲ強化値Ｓ状態空間 A Autonomous action subject B Charging point p Episode R Strengthening value S State space

Claims

[Claims]

An experience-enhanced reinforcement learning adapting a framework of an action selection network to profit-sharing so that an autonomous action subject operating in a complex and dynamically changing environment can effectively adapt to the change. A method, in which a storage agent, which is an autonomous subject, is assigned to each state constituting an episode, which is a transition sequence of state elements, and a storage agent is also assigned to states other than the episode in view when the internal state moves. When the reinforcement value from the neighboring memory agent is propagated, the reinforcement value is spontaneously propagated to the neighboring memory agent, and after attenuating as a whole, the decay rate of the reinforcement value is changed depending on whether or not the episode has been attenuated. Active propagation is realized as a cooperative operation of a group of storage agents, such as terminating active propagation when the enhancement value falls below a predetermined threshold. Experience enhanced reinforcement learning method using the action selection network characterized by Rukoto.

2. An experience-enhanced reinforcement learning adapting a framework of an action selection network to profit-sharing so that an autonomous action subject operating in a complex and dynamically changing environment can effectively adapt to the change. A storage medium on which a program is recorded, where a storage agent, which is an autonomous subject, is assigned to each state constituting an episode, which is a transition sequence of state elements, and a state other than an episode that is in view when the internal state moves. Also assigns a storage agent, and when the reinforcement value from the neighboring storage agent is propagated, the reinforcement value is spontaneously propagated to the neighboring storage agent, and when the whole is attenuated, it is strengthened based on the experience of episode attenuation Active propagation is a storage agent, such as changing the decay rate of the value and terminating the active propagation when the reinforcement value falls below a predetermined threshold. Recording medium recorded with experience enhanced reinforcement learning program using a behavior selection network, characterized in that it is implemented as a cooperative operation.