JP5398414B2

JP5398414B2 - Learning system and learning method

Info

Publication number: JP5398414B2
Application number: JP2009187526A
Authority: JP
Inventors: 誉羽竹内; 広司辻野
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2008-09-18
Filing date: 2009-08-12
Publication date: 2014-01-29
Anticipated expiration: 2029-08-12
Also published as: JP2010073200A

Description

本発明は、強化学習による学習システム及び学習方法に関する。 The present invention relates to a learning system and learning method based on reinforcement learning.

ロボットなどの機械が学習によって自己の制御規則を目的に合うように改善する学習方法として強化学習が知られている（たとえば、非特許文献１）。さらに、特に生物研究において、環境モデルを明示的に持った強化学習を脳が行なっている可能性が示されている（たとえば、非特許文献２）。環境モデルを明示的に持った強化学習によれば、環境モデルを持たない従来型の強化学習が不得手とする環境の変化に対応することができることや、獲得した行動系列をまとまりとして管理することができることなどの利点がある。 Reinforcement learning is known as a learning method in which a machine such as a robot is improved by learning so that its own control rules meet a purpose (for example, Non-Patent Document 1). Furthermore, the possibility that the brain is performing reinforcement learning with an environmental model explicitly in biological research is shown (for example, Non-Patent Document 2). Reinforcement learning with an explicit environment model can respond to changes in the environment that traditional reinforcement learning without an environment model is not good at, and manage acquired action sequences as a group There are advantages such as being able to.

一方で、環境モデルを明示的に持った強化学習は、環境モデルを表す木構造などを探索せねばならず、計算コストが非常に高い。 On the other hand, reinforcement learning with an explicit environment model requires a search for a tree structure representing the environment model, and the calculation cost is very high.

このように、環境モデルを明示的に有する、低計算コストの強化学習システム及び強化学習方法は開発されていない。 Thus, the reinforcement learning system and reinforcement learning method of the low calculation cost which have an environmental model explicitly are not developed.

N. D. Daw & K. Doya, “The computational neurobiology of learning and reward”, Current Opinion in Neurobiology, 2006, 16, pp199-204N. D. Daw & K. Doya, “The computational neurobiology of learning and reward”, Current Opinion in Neurobiology, 2006, 16, pp199-204 N. D. Daw, Y. Niv & P. Dayan, “Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control”, Nature Neuroscience, 2005, 8, pp1704-1711N. D. Daw, Y. Niv & P. Dayan, “Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control”, Nature Neuroscience, 2005, 8, pp1704-1711

したがって、環境モデルを明示的に有して、環境の変化に対応し、獲得した行動系列をまとまりとして管理することができる、低計算コストの強化学習システム及び強化学習方法に対するニーズがある。 Accordingly, there is a need for a low learning cost reinforcement learning system and reinforcement learning method that can explicitly have an environment model, can respond to changes in the environment, and can manage acquired action sequences as a whole.

本発明による学習システムは、報酬を得た直前の状態・行動対に至る、一連の状態・行動対の集合をイベント・リストとして、複数のイベント・リストを保持するイベント・リスト・データベースと、状態・行動対を、前記複数のイベント・リストに分類して記憶させるイベント・リスト管理部と、各イベント・リストの要素である状態・行動対の報酬期待値を更新する学習制御部と、を備えている。 The learning system according to the present invention includes an event list database that holds a plurality of event lists, with a series of state / action pairs reaching an immediately preceding state / action pair as an event list. An event list management unit that classifies action pairs into the plurality of event lists and stores them, and a learning control unit that updates a reward expected value of a state / action pair that is an element of each event list. ing.

本発明による学習方法は、報酬を得た直前の状態・行動対に至る、一連の状態・行動対の集合をイベント・リストとして、複数のイベント・リストを保持するイベント・リスト・データベースと、イベント・リスト管理部と、学習制御部と、を備えた学習システムによる学習方法である。本発明による学習方法は、前記イベント・リスト管理部が、状態・行動対を、前記複数のイベント・リストに分類して記憶させるステップと、前記学習制御部が、各イベント・リストの要素である状態・行動対の報酬期待値を更新するステップと、を含む。 The learning method according to the present invention includes an event list database that holds a plurality of event lists, with an event list as a set of state / action pairs that reach a state / action pair immediately before a reward is obtained, and an event A learning method using a learning system including a list management unit and a learning control unit. In the learning method according to the present invention, the event list management unit classifies and stores state / action pairs into the plurality of event lists, and the learning control unit is an element of each event list. Updating the expected reward value of the state / action pair.

本発明の学習システム及び学習方法によれば、報酬を得た直前の状態・行動対に至る、一連の状態・行動対の集合をイベント・リストとして、状態・行動対を複数のイベント・リストに分類して記憶させる。結果として、報酬を得た直前の状態・行動対ごとの環境モデルが作成される。したがって、本発明の学習システム及び学習方法は、環境の変化に対応することができ、獲得した行動系列をまとまりとして、すなわちイベント・リストとして管理することができる。 According to the learning system and the learning method of the present invention, a set of a state / action pair that reaches a state / action pair immediately before a reward is obtained is an event list, and the state / action pair is converted into a plurality of event lists. Sort and memorize. As a result, an environmental model is created for each state / action pair immediately before the reward is obtained. Therefore, the learning system and the learning method of the present invention can cope with changes in the environment, and can manage the acquired action series as a group, that is, as an event list.

本発明の実施形態によれば、前記イベント・リスト管理部は、行動が選択されるごとに、状態・行動対を一時的に保持し、報酬を得るごとに、一時的に保持した状態・行動対の集合の内、前記イベント・リスト・データベースに記憶されていない状態・行動対を、前記報酬を得た直前の状態・行動対のイベント・リストの要素として前記イベント・リスト・データベースに記憶させる。 According to the embodiment of the present invention, the event / list management unit temporarily holds a state / action pair each time an action is selected, and temporarily holds the state / action each time a reward is obtained. Of the set of pairs, state / action pairs not stored in the event list database are stored in the event list database as elements of the event list of the state / action pair immediately before the reward is obtained. .

本実施形態によれば、状態・行動対を複数のイベント・リストに効率的に分類して記憶させることができる。 According to the present embodiment, state / action pairs can be efficiently classified and stored in a plurality of event lists.

本発明の実施形態によれば、前記学習制御部は、報酬を得るごとに、前記報酬を得た直前の状態・行動対のイベント・リストの要素である状態・行動対の報酬期待値を、前記報酬の値を使用して更新し、前記報酬を得た直前の状態・行動対のイベント・リスト以外のイベント・リストの要素である状態・行動対の報酬期待値を、報酬の値がゼロであるとして更新する。 According to the embodiment of the present invention, every time the learning control unit obtains a reward, the reward expectation value of the state / action pair which is an element of the event / list of state / action pair immediately before the reward is obtained, Update using the value of the reward, and the reward value of the state / action pair that is an element of the event list other than the event list of the state / action pair immediately before obtaining the reward is zero. Update as

本実施形態によれば、イベント・リストごとに、そのイベント・リストの要素である状態・行動対の報酬期待値を効率的に更新することができる。 According to the present embodiment, for each event list, the expected reward value of the state / action pair that is an element of the event list can be efficiently updated.

本発明の一実施形態による学習システムを含む装置の構成を示す図である。It is a figure which shows the structure of the apparatus containing the learning system by one Embodiment of this invention. イベント・リスト・データベースのデータ構造を説明するための図である。It is a figure for demonstrating the data structure of an event list database. イベント・リスト管理部の動作を説明するための流れ図である。It is a flowchart for demonstrating operation | movement of an event list management part. 学習システムの動作を説明するための流れ図である。It is a flowchart for demonstrating operation | movement of a learning system. 行動選択部の動作を説明するための流れ図である。It is a flowchart for demonstrating operation | movement of an action selection part. 第１のシミュレーション環境であるマルコフ決定過程（ＭＤＰ：Markov Decision Process）を説明するための図である。It is a figure for demonstrating the Markov decision process (MDP: Markov Decision Process) which is a 1st simulation environment. 第２のシミュレーション環境である高次マルコフ決定過程（ＨＯＭＤＰ： High Order Markov Decision Process）を説明するための図である。It is a figure for demonstrating the high order Markov decision process (HOMDDP: High Order Markov Decision Process) which is a 2nd simulation environment. 本発明の実施形態による学習システムと従来の学習システムのシミュレーション結果を示す図である。It is a figure which shows the simulation result of the learning system by embodiment of this invention, and the conventional learning system.

図１は、本発明の一実施形態による学習システム１００を含む装置２００の構成を示す図である。装置２００は、たとえばロボットであってもよい。装置２００は、情報取得部２０１、取得情報処理部２０３、行動選択部２０５、行動出力部２０７、スーパーバイザ２０９及び学習システム１００を含む。 FIG. 1 is a diagram illustrating a configuration of an apparatus 200 including a learning system 100 according to an embodiment of the present invention. Device 200 may be, for example, a robot. The apparatus 200 includes an information acquisition unit 201, an acquisition information processing unit 203, an action selection unit 205, an action output unit 207, a supervisor 209, and a learning system 100.

情報取得部２０１は、環境３００から入力情報を取得し、装置２００自身の状態情報を取得する。装置２００がロボットである場合に、情報取得部２０１は、カメラを含み、該カメラによって撮影した環境３００の画像によって、環境３００の情報を取得してもよい。また、情報取得部２０１は、ロボットの位置及び向きを含む、装置２００の状態情報を取得してもよい。情報取得部２０１は、取得したこれらの情報を取得情報処理部２０３に送る。 The information acquisition unit 201 acquires input information from the environment 300 and acquires state information of the device 200 itself. When the apparatus 200 is a robot, the information acquisition unit 201 may include a camera and acquire information on the environment 300 based on an image of the environment 300 captured by the camera. Further, the information acquisition unit 201 may acquire state information of the device 200 including the position and orientation of the robot. The information acquisition unit 201 sends the acquired information to the acquisition information processing unit 203.

取得情報処理部２０３は、環境及び自己の状態情報に基づいて、装置２００のおかれた状態を予め定めた複数の状態のいずれかに分類する。 The acquired information processing unit 203 classifies the state in which the device 200 is placed into one of a plurality of predetermined states based on the environment and its own state information.

学習システム１００は、装置２００のおかれた状態において、装置２００が選択した行動を状態・行動対として記憶し、その結果の報酬にしたがって、状態・行動対の報酬期待値を学習する。ここで、報酬は、情報取得部２０１が取得した情報に基づいて取得情報処理部２０３によって定められる。学習システム１００は、イベント・リスト管理部１０１、一時的リスト記憶部１０３、イベント・リスト・データベース１０５及び学習制御部１０７を含む。イベント・リスト管理部１０１は、一時的リスト記憶部１０３及びイベント・リスト・データベース１０５に状態行動・対を記憶させる。学習制御部１０７は、報酬にしたがって、状態・行動対ごとの報酬期待値を学習し、イベント・リスト・データベース１０５に、状態・行動対と関連付けて記憶させる。学習システム１００の詳細は、後で説明する。 The learning system 100 stores the action selected by the apparatus 200 as a state / action pair in the state in which the apparatus 200 is placed, and learns the expected reward value of the state / action pair according to the reward of the result. Here, the reward is determined by the acquisition information processing unit 203 based on the information acquired by the information acquisition unit 201. The learning system 100 includes an event list management unit 101, a temporary list storage unit 103, an event list database 105, and a learning control unit 107. The event list management unit 101 stores state actions / pairs in the temporary list storage unit 103 and the event list database 105. The learning control unit 107 learns an expected reward value for each state / action pair according to the reward, and stores it in the event list database 105 in association with the state / action pair. Details of the learning system 100 will be described later.

行動選択部２０５は、取得情報処理部２０３から装置２００のおかれた状態を受け取り、その状態に対して、状態・行動対と関連付けてイベント・リスト・データベース１０５に記憶された報酬期待値が最大の行動を最大の確率で選択する。 The behavior selection unit 205 receives the state of the device 200 from the acquired information processing unit 203, and the expected reward value stored in the event list database 105 in association with the state / behavior pair is maximum for the state. Select the action with the maximum probability.

行動出力部２０７は、行動選択部２０５の選択した行動を出力する。行動の結果としての環境３００の変化は、情報取得部２０１によって情報として取得される。 The behavior output unit 207 outputs the behavior selected by the behavior selection unit 205. The change in the environment 300 as a result of the action is acquired as information by the information acquisition unit 201.

スーパーバイザ２０９は、装置２００のおかれた状態に対して最速で報酬が得られるような一連の行動を教示する。学習の初期の段階において学習システム１００の学習を援助するのに使用される。 The supervisor 209 teaches a series of actions that can be rewarded at the fastest rate for the state of the device 200. It is used to assist learning of the learning system 100 in the early stages of learning.

本発明の実施形態による学習システム１００は、状態・行動対を、報酬を得た直前の状態・行動対ごとの集合に分類し、上記集合ごとに状態・行動対を記憶し、その状態・行動対の報酬期待値を学習することを特徴とする。報酬を得た直前の状態・行動対ごとの集合として状態・行動対を記憶し、その状態・行動対の報酬期待値を学習することは、報酬を得た直前の状態・行動対ごとの環境モデルを作成することに対応する。したがって、本実施形態による学習システム１００は、環境の変化に対応することができ、獲得した行動系列をまとまりとして管理することができる。以下に詳細に説明する。 The learning system 100 according to the embodiment of the present invention classifies state / action pairs into sets for each state / action pair immediately before obtaining a reward, stores the state / action pairs for each set, and stores the states / action pairs. It is characterized by learning the expected reward value of the pair. Memorizing the state / action pair as a set of each state / action pair immediately before receiving the reward and learning the expected value of reward for that state / action pair is the environment for each state / action pair immediately before receiving the reward Corresponds to creating a model. Therefore, the learning system 100 according to the present embodiment can cope with changes in the environment, and can manage the acquired action series as a group. This will be described in detail below.

ここで、報酬期待値Ｒは、以下の式で表せる。

式(１)において、Ｅ［｜］は条件付期待値を表す。 Here, the expected reward value R can be expressed by the following equation.

In equation (1), E [|] represents a conditional expected value.

ｓ_ｔは、時刻ｔで観測された状態を表す。観測される状態は複数有り、それぞれ、たとえば
ｓ０，ｓ１，・・・，ｓｉ，・・・ｓｎ
と表せる。時刻ｔにおいて、これらのうちの一つを実際に観察し、これをｓ_ｔと表す。 s _t represents the observed state at time t. There are a plurality of observed states, for example, s0, s1,..., Si,.
It can be expressed. At time t, to observe one of these actually represent to as s _t.

ａ_ｔは、時刻ｔに選択した行動を表す。選択の対象となる行動は複数有り、それぞれ、たとえば
ａ０，ａ１，・・・，ａｉ，・・・ａｎ
と表せる。時刻ｔにおいて、これらのうちの一つを実際に選択し、これをａ_ｔと表す。 a _t represents the action selected at time t. There are a plurality of actions to be selected. For example, a0, a1,..., Ai,.
It can be expressed. At time t, selecting one of these actually, this is expressed as a _t.

ｒ_ｔ＋ｋは時刻ｔ＋ｋで得る報酬である。 rt _{+ k} is a reward obtained at time t + k.

γは、割引率と呼ばれるパラメータである。 γ is a parameter called a discount rate.

式(１)は、以下のように変形することができる。

Equation (1) can be modified as follows.

ｐ（ｋ｜・・・）は、エピソードが現在時刻からｋステップ後に終端に達する確率である。ここで、「エピソード」とは、ある状態において行動が選択される結果、順次生じる一連の状態の中の一つの状態を指し、「終端」とは、上記一連の状態の最後の状態を指す。報酬を得た場合や、行動の選択が中断された場合に、状態は最後の状態に達する。 p (k |...) is the probability that the episode will reach the end after k steps from the current time. Here, “episode” refers to one state in a series of states that are sequentially generated as a result of selecting an action in a certain state, and “end” refers to the last state in the series of states. The state reaches the last state when a reward is obtained or the action selection is interrupted.

（Ｓ，Ａ）は、（ｓ_{ｔ＋ｋ−１}，ａ_{ｔ＋ｋ−１}）を表し、報酬ｒ_ｔ＋ｋを得る直前の状態・行動対である。時刻ｔで状態ｓ_ｔを観測し、行動ａ_ｔをとった場合の状態・行動対を（ｓ_ｔ，ａ_ｔ）と表す。 (S, A) represents (s _{t + k−1} , a _{t + k−1} ), and is a state / action pair immediately before obtaining the reward r _{t + k} . To observe the state _{s t} at time t, representing the state-action pair if you took the action _{a t} and _(s _{t, a} t).

Ｅ（Ｓ，Ａ）[・｜・・]は、報酬を得た直前の状態・行動対（Ｓ，Ａ）によって分割された報酬期待値の部分を表し、部分期待値と呼称する。 E (S, A) [• | ••] represents a portion of the expected reward value divided by the state / action pair (S, A) immediately before the reward is obtained, and is referred to as a partial expected value.

式（２）は、報酬期待値が部分期待値の和で表せることを示す。また、全ての状態・行動対（ｓｉ，ａｊ）が複数の（Ｓ，Ａ）のグループに分類できることを示す。そこで、上記のように、状態・行動対を、報酬を得た直前の状態・行動対ごとの集合に分類し、上記集合ごとに状態・行動対を記憶し、その状態・行動対の報酬期待値を学習することが可能となる。 Expression (2) indicates that the reward expected value can be expressed by the sum of the partial expected values. It also indicates that all state / action pairs (si, aj) can be classified into a plurality of (S, A) groups. Therefore, as described above, the state / action pair is classified into a set for each state / action pair immediately before the reward is obtained, and the state / action pair is stored for each set, and the reward expectation of the state / action pair is stored. The value can be learned.

図２は、イベント・リスト・データベース１０５のデータ構造を説明するための図である。図２において、（Ｓ，Ａ）ｎは、報酬を得た直前の状態・行動対を示す。Ｓ及びＡは、報酬を得た直前の状態及び行動を示し、ｎは、報酬を得た直前の状態・行動対の番号を示す。（Ｓ，Ａ）ｎは、（Ｓ，Ａ）ｎにいたるまでの一連の状態・行動対とともに集合を形成する。この集合をイベント・リストと呼称する。（ｓｉ，ａｊ）は、イベント・リストに含まれる状態・行動対を示す。ｓ及びａは、それぞれ状態及び行動を示す。ｉ及びｊは、それぞれ状態ｓ及び行動ａの番号を示す。（ｓｉ，ａｊ）に関連付けてその状態・行動対の報酬期待値Ｅ［ｒ］ｐがイべント・リストに記憶される。ｒは報酬を示し、ｐは、報酬期待値の番号を示す。 FIG. 2 is a diagram for explaining the data structure of the event list database 105. In FIG. 2, (S, A) n indicates the state / action pair immediately before the reward is obtained. S and A indicate the state and action immediately before the reward is obtained, and n indicates the number of the state / action pair immediately before the reward is obtained. (S, A) n forms a set with a series of state / action pairs up to (S, A) n. This set is called an event list. (Si, aj) indicates a state / action pair included in the event list. s and a indicate the state and action, respectively. i and j indicate the numbers of the state s and the action a, respectively. The reward expected value E [r] p of the state / action pair is stored in the event list in association with (si, aj). r indicates the reward, and p indicates the number of the expected reward value.

このように、イベント・リスト・データベース１０５は、報酬を得た直前の状態・行動対１０５１ごとに分類されている。イベント・リストは、報酬を得た直前の状態・行動対１０５１、状態・行動対１０５１に至る一連の状態・行動対１０５３、及びイベント・リストの要素である状態・行動対（ｓｉ，ａｊ）に関連付けられる報酬期待値Ｅ［ｒ］ｐを含む。報酬期待値Ｅ［ｒ］ｐは前記の「部分期待値」に対応する。 As described above, the event list database 105 is classified for each state / action pair 1051 immediately before the reward is obtained. The event list includes a state / action pair 1051 immediately before the reward is obtained, a series of state / action pairs 1053 reaching the state / action pair 1051, and a state / action pair (si, aj) which is an element of the event list. The associated reward expectation value E [r] p is included. The expected reward value E [r] p corresponds to the “partial expected value”.

ここで、ある状態・行動対（ｓｉ，ａｊ）が、複数の、報酬を得た直前の状態・行動対（Ｓ，Ａ）のイベント・リストに含まれることがある。この場合にその状態・行動対（ｓｉ，ａｊ）の報酬期待値は、複数の状態・行動対（Ｓ，Ａ）のイベント・リストに含まれるその状態・行動対（ｓｉ，ａｊ）の報酬期待値の和である。 Here, a certain state / action pair (si, aj) may be included in a plurality of event lists of the state / action pair (S, A) immediately before the reward is obtained. In this case, the reward expectation value of the state / action pair (si, aj) is the reward expectation of the state / action pair (si, aj) included in the event list of the plurality of state / action pairs (S, A). It is the sum of values.

図３は、イベント・リスト管理部１０１の動作を説明するための流れ図である。 FIG. 3 is a flowchart for explaining the operation of the event list management unit 101.

図３のステップＳ１０５において、イベント・リスト管理部１０１は、行動選択部２０５から状態・行動対（ｓｉ，ａｊ）を受け取ったかどうか判断する。ここで、行動選択部２０５は、行動を選択するごとに状態・行動対（ｓｉ，ａｊ）をイベント・リスト管理部１０１に送る。状態・行動対（ｓｉ，ａｊ）を受け取っていればステップＳ１１０に進む。状態・行動対（ｓｉ，ａｊ）を受け取っていなければ、ステップＳ１１５に進む。 In step S 105 of FIG. 3, the event / list management unit 101 determines whether a state / action pair (si, aj) has been received from the action selection unit 205. Here, the action selection unit 205 sends a state / action pair (si, aj) to the event / list management unit 101 each time an action is selected. If the state / action pair (si, aj) has been received, the process proceeds to step S110. If the state / action pair (si, aj) has not been received, the process proceeds to step S115.

図３のステップＳ１１０において、イベント・リスト管理部１０１は、状態・行動対（ｓｉ，ａｊ）を一時的リスト記憶部１０３に記憶させる。 In step S 110 of FIG. 3, the event / list management unit 101 stores the state / action pair (si, aj) in the temporary list storage unit 103.

図３のステップＳ１１５において、イベント・リスト管理部１０１は、取得情報処理部２０３から報酬を受け取ったかどうか判断する。ここで、取得情報処理部２０３は、行動出力部２０７が行動を出力してから所定の時間経過後に情報取得部２０１が取得した情報に基づいて報酬を定め、イベント・リスト管理部１０１に送る。報酬を受け取っていればステップＳ１２０に進む。報酬を受け取っていなければ、所定の時間経過後ステップＳ１０５に戻る。 In step S 115 of FIG. 3, the event list management unit 101 determines whether a reward has been received from the acquired information processing unit 203. Here, the acquired information processing unit 203 determines a reward based on the information acquired by the information acquiring unit 201 after a predetermined time has elapsed since the behavior output unit 207 outputs the behavior, and sends the reward to the event / list management unit 101. If a reward has been received, the process proceeds to step S120. If no reward has been received, the process returns to step S105 after a predetermined time has elapsed.

図３のステップＳ１２０において、イベント・リスト管理部１０１は、一時的リスト記憶部１０３に最後に記憶された状態・行動対（ｓｉ，ａｊ）を、報酬を得た直前の状態・行動対（Ｓ，Ａ）とする。 In step S120 of FIG. 3, the event / list management unit 101 uses the state / action pair (si, aj) stored last in the temporary list storage unit 103 as the state / action pair (S , A).

図３のステップＳ１２５において、イベント・リスト管理部１０１は、イベント・リスト・データベース１０５に（Ｓ，Ａ）が存在するかどうか判断する。（Ｓ，Ａ）が存在すれば、ステップＳ１３５に進む。（Ｓ，Ａ）が存在しなければ、ステップＳ１３０に進む。 In step S125 of FIG. 3, the event list management unit 101 determines whether (S, A) exists in the event list database 105. If (S, A) exists, the process proceeds to step S135. If (S, A) does not exist, the process proceeds to step S130.

図３のステップＳ１３０において、イベント・リスト管理部１０１は、（Ｓ，Ａ）をイベント・リスト・データベース１０５に記憶させる。 In step S 130 of FIG. 3, the event list management unit 101 stores (S, A) in the event list database 105.

図３のステップＳ１３５において、イベント・リスト管理部１０１は、一時的リスト記憶部１０３に記憶された状態・行動対（ｓｉ，ａｊ）のそれぞれが、イベント・リスト・データベース１０５の（Ｓ，Ａ）のイベント・リストに含まれるかどうか判断する。（Ｓ，Ａ）のイベント・リストに含まれれば、ステップＳ１４５に進む。（Ｓ，Ａ）のイベント・リストに含まれなければ、ステップＳ１４０に進む。 In step S135 of FIG. 3, the event list manager 101, stored in the temporary list storing unit 103 the state-action pair (si, aj) each have, in the event list database 105 (S, A) To be included in the event list. If it is included in the event list of (S, A), the process proceeds to step S145. If it is not included in the event list of (S, A), the process proceeds to step S140.

図３のステップＳ１４０において、イベント・リスト管理部１０１は、（Ｓ，Ａ）のイベント・リストに含まれていない状態・行動対（ｓｉ，ａｊ）を（Ｓ，Ａ）のイベント・リストに追加する。このとき、追加される状態・行動対の数は、あらかじめ決められた数を上限とする。 In step S140 of FIG. 3, the event list management unit 101 adds the state / action pair (si, aj) not included in the event list of (S, A) to the event list of (S, A). To do. At this time, the number of state / action pairs to be added is limited to a predetermined number.

図３のステップＳ１４５において、イベント・リスト管理部１０１は、一時的リスト記憶部１０３に記憶された全ての状態・行動対（ｓｉ，ａｊ）についてステップＳ１３５の処理を行ったかどうか判断する。全ての状態・行動対（ｓｉ，ａｊ）についてステップＳ１３５の処理を行っていれば、ステップＳ１５０に進む。全ての状態・行動対（ｓｉ，ａｊ）についてステップＳ１３５の処理を行っていなければ、ステップＳ１３５に戻る。 In step S145 of FIG. 3, the event / list management unit 101 determines whether or not the processing of step S135 has been performed for all the state / action pairs (si, aj) stored in the temporary list storage unit 103. If the process of step S135 is performed for all the state / action pairs (si, aj), the process proceeds to step S150. If the process of step S135 is not performed for all the state / action pairs (si, aj), the process returns to step S135.

図３のステップＳ１５０において、イベント・リスト管理部１０１は、一時的リスト記憶部１０３に記憶された全ての状態・行動対（ｓｉ，ａｊ）をクリア（消去）する。 In step S150 of FIG. 3, the event / list management unit 101 clears (deletes) all the state / action pairs (si, aj) stored in the temporary list storage unit 103.

図４は、学習システム１００の学習制御部１０７の動作を説明するための流れ図である。 FIG. 4 is a flowchart for explaining the operation of the learning control unit 107 of the learning system 100.

図４のステップＳ２０５において、学習制御部１０７は、取得情報処理部２０３から報酬またはエピソード終了の通知を受け取ったかどうか判断する。ここで、取得情報処理部２０３は、行動出力部２０７が行動を出力してから所定の時間経過後に情報取得部２０１が取得した情報に基づいて報酬を定め、学習制御部１０７に送る。また、取得情報処理部２０３は、何らかの理由によりエピソードが終端に達したした場合に、エピソード終了の通知を学習制御部１０７に送る。報酬またはエピソード終了の通知を受け取った場合には、ステップＳ２１０に進む。報酬またはエピソード終了の通知を受け取らなかった場合には、所定の時間経過後ステップＳ２０５に戻る。 In step S 205 of FIG. 4, the learning control unit 107 determines whether or not a reward or end of episode notification has been received from the acquired information processing unit 203. Here, the acquisition information processing unit 203 determines a reward based on the information acquired by the information acquisition unit 201 after a predetermined time has elapsed after the behavior output unit 207 outputs the behavior, and sends the reward to the learning control unit 107. Also, the acquired information processing unit 203 sends a notification of the end of the episode to the learning control unit 107 when the episode reaches the end for some reason. If a reward or an end of episode notification is received, the process proceeds to step S210. If no reward or end of episode notification is received, the process returns to step S205 after a predetermined time has elapsed.

図４のステップＳ２１０において、学習制御部１０７は、直近で報酬を得た直前の状態・行動対（Ｓ，Ａ）のイベント・リストにおける状態・行動対（ｓｉ，ａｊ）の報酬期待値を以下の式によって更新する。

In step S210 of FIG. 4, the learning control unit 107 calculates the expected reward value of the state / action pair (si, aj) in the event list of the state / action pair (S, A) immediately before receiving the most recent reward below. Update with the formula of

ここで、αは学習定数と呼ばれるパラメータであり、０と１の間の定数である。 Here, α is a parameter called a learning constant and is a constant between 0 and 1.

Ｔｖは、以下の式によって与えられる。

Tv is given by the following equation.

ここで、τは、状態ｓｉにおいて行動ａｊが選択されて、状態・行動対（ｓｉ，ａｊ）が、実際に生じた時刻である。 Here, τ is the time when the action aj is selected in the state si and the state / action pair (si, aj) actually occurs.

図４のステップＳ２１５において、学習制御部１０７は、直近で報酬を得た直前の状態・行動対以外の（Ｓ，Ａ）のイベント・リストにおける状態・行動対（ｓｉ，ａｊ）の報酬期待値を式（３）にしたがって更新する。この場合に目標値Ｔｖはゼロとする。エピソード終了の通知を受け取った場合には、目標値Ｔｖはゼロとして全ての（Ｓ，Ａ）のイベント・リストにおける状態・行動対（ｓｉ，ａｊ）の報酬期待値を式（３）にしたがって更新する。 In step S215 of FIG. 4, the learning control unit 107 expects a reward value of the state / action pair (si, aj) in the event list of (S, A) other than the state / action pair immediately before the most recently obtained reward. Is updated according to equation (3). In this case, the target value Tv is set to zero. When the end of episode notification is received, the target value Tv is set to zero, and the expected reward value of the state / action pair (si, aj) in all (S, A) event lists is updated according to the equation (3). To do.

このようにして、報酬を得た直前の状態・行動対にしたがってグループ分けされたイベント・リストごとに報酬期待値が更新される。 In this way, the expected reward value is updated for each event list grouped according to the state / action pair immediately before the reward is obtained.

図５は、行動選択部２０５の動作を説明するための流れ図である。 FIG. 5 is a flowchart for explaining the operation of the action selection unit 205.

図５のステップＳ３０５において、行動選択部２０５は、取得情報処理部２０３から現在の状態を受け取る。ここで、取得情報処理部２０３は、学習制御部１０７に報酬を送り、学習制御部１０７が報酬期待値を更新したことを確認した後に行動選択部２０５に状態を送ってもよい。 In step S 305 of FIG. 5, the behavior selection unit 205 receives the current state from the acquired information processing unit 203. Here, the acquisition information processing unit 203 may send a reward to the learning control unit 107 and send a state to the action selection unit 205 after confirming that the learning control unit 107 has updated the expected reward value.

図５のステップＳ３１０において、行動選択部２０５は、イベント・リスト・データベース１０５から、現在の状態を有する状態・行動対を選び、さらにその状態・行動対の報酬期待値が最大の状態・行動対を選ぶ。上述のように、複数のイベント・リストに、現在の状態を有する状態・行動対が含まれる場合には、複数のイベント・リストのその状態・行動対の報酬期待値の和をその状態・行動対の報酬期待値とする。 In step S310 of FIG. 5, the action selection unit 205 selects a state / action pair having the current state from the event list database 105, and further selects a state / action pair having the maximum reward expectation value for the state / action pair. Select. As described above, when a state / action pair having the current state is included in a plurality of event lists, the sum of the expected reward values of the state / action pairs of the plurality of event lists is calculated as the state / action. The expected value of the pair's reward.

図５のステップＳ３１５において、行動選択部２０５は、選択した状態・行動対の行動を行動出力部２０７に送り、選択した状態・行動対をイベント・リスト管理部１０１に送る。 In step S315 of FIG. 5, the action selection unit 205 sends the action of the selected state / action pair to the action output unit 207, and sends the selected state / action pair to the event / list management unit 101.

つぎに、本発明の実施形態による学習システム１００の機能を確認するためのシミュレーション実験について説明する。シミュレーション実験には第１のシミュレーション環境及び第２のシミュレーション環境を準備する。 Next, a simulation experiment for confirming the function of the learning system 100 according to the embodiment of the present invention will be described. A first simulation environment and a second simulation environment are prepared for the simulation experiment.

図６は、第１のシミュレーション環境であるマルコフ決定過程（ＭＤＰ：Markov Decision Process）を説明するための図である。選択しうる行動は、ａ０，ａ１，・・・ａ９の１０個である。ｓ０の状態を観測してから、ａ０からａ９の順に行動を選択すると、報酬ｒ＝１が与えられる。しかし、各遷移は確率的である。ｓ０からｓ１へは確率０．３で遷移する。その他の場合は、確率０．９で遷移する。さらに、各状態に対して観測できる信号は２個存在する。たとえば、ｓ０に対しては、Ｏ_００とＯ_０１の信号が存在し、いずれかが確率０．５で観測される。したがって、報酬を得るまでの観測信号の出現の組み合わせは、２^１０＝１０２４通りである。 FIG. 6 is a diagram for explaining a Markov decision process (MDP) which is a first simulation environment. There are ten actions a0, a1,... A9 that can be selected. If the behavior is selected in the order of a0 to a9 after observing the state of s0, a reward r = 1 is given. However, each transition is probabilistic. Transition from s0 to s1 has a probability of 0.3. In other cases, transition is made with a probability of 0.9. Furthermore, there are two signals that can be observed for each state. For example, for s0, O ₀₀ and O ₀₁ signals exist, and either one is observed with a probability of 0.5. Therefore, there are 2 ¹⁰ = 1024 combinations of appearance of the observation signal until the reward is obtained.

図７は、第２のシミュレーション環境である高次マルコフ決定過程（ＨＯＭＤＰ：High Order Markov Decision Process）を説明するための図である。選択しうる行動は、ａ０，ａ１，・・・ａ９の１０個であり、そのうち報酬に関係するのはａ０，ａ１，・・・ａ５の６個である。本過程は、過程Ａと過程Ｂとを含む。過程Ａで報酬を得た場合につぎに報酬を得るためには過程Ｂを選択する必要があり、過程Ｂで報酬を得た場合につぎに報酬を得るためには過程Ａを選択する必要がある。すなわち、同じ観測信号に対して、過程Ａと過程Ｂで異なる行動を選択する必要がある。この場合も各遷移は確率的である。過程Ａにおいて、ｓ０からｓ２へは確率０．３で遷移する。その他の場合は、確率０．９で遷移する。過程Ｂにおいて、ｓ１からｓ２へは確率０．３で遷移する。その他の場合は、確率０．９で遷移する。さらに、各状態に対して観測できる信号は２個存在する。たとえば、ｓ０に対しては、Ｏ_００とＯ_０１の信号が存在し、いずれかが確率０．５で観測される。 FIG. 7 is a diagram for explaining a high order Markov decision process (HOMDP) which is a second simulation environment. There are ten actions a0, a1,..., A9, and six actions a0, a1,. This process includes process A and process B. In order to obtain the next reward when the reward is obtained in the process A, it is necessary to select the process B. To obtain the next reward when the reward is obtained in the process B, it is necessary to select the process A. is there. That is, it is necessary to select different actions in the process A and the process B for the same observation signal. Again, each transition is probabilistic. In the process A, a transition is made from s0 to s2 with a probability of 0.3. In other cases, transition is made with a probability of 0.9. In the process B, the transition from s1 to s2 is performed with a probability of 0.3. In other cases, transition is made with a probability of 0.9. Furthermore, there are two signals that can be observed for each state. For example, for s0, O ₀₀ and O ₀₁ signals exist, and either one is observed with a probability of 0.5.

以下に、上記のシミュレーション環境を使用したシミュレーション実験の手順について説明する。最初に、環境をＨＯＭＤＰのシミュレーション環境として、最初の１０回の試行（トライアル）において、スーパーバイザ２０９が行動選択部１０７に、最速で報酬が得られるような一連の行動を教示する。この間に学習システム１００は、学習を行なう。しかし、この間に全ての行動パターンを学習することはできない。 The procedure of the simulation experiment using the above simulation environment will be described below. First, using the environment as a HOMDP simulation environment, in the first 10 trials, the supervisor 209 teaches the action selection unit 107 a series of actions that can be rewarded at the fastest speed. During this time, the learning system 100 performs learning. However, not all behavior patterns can be learned during this time.

つぎに２５１試行目からは、環境をＭＤＰのシミュレーション環境とし、２６０試行目までは、スーパーバイザ２０９が行動選択部１０７に、最速で報酬が得られるような一連の行動を教示する。この間に学習システム１００は、学習を行なう。しかし、この間に全ての行動パターンを学習することはできない。 Next, from the 251st trial, the environment is set as an MDP simulation environment, and until the 260th trial, the supervisor 209 teaches the action selection unit 107 a series of actions that can be rewarded at the fastest speed. During this time, the learning system 100 performs learning. However, not all behavior patterns can be learned during this time.

つぎに、５０１試行目からは、環境を再びＨＯＭＤＰのシミュレーション環境とする。スーパーバイザ２０９による教示は行なわない。したがって、学習システム１００は、突然変化した環境に対応する必要がある。 Next, from the 501st trial, the environment is again set as a HOMDP simulation environment. The supervisor 209 does not teach. Therefore, the learning system 100 needs to cope with a suddenly changed environment.

つぎに、７５１試行目からは、環境を再びＭＤＰのシミュレーション環境とする。スーパーバイザ２０９による教示は行なわない。したがって、学習システム１００は、突然変化した環境に対応する必要がある。 Next, from the 751st trial, the environment is again set as the MDP simulation environment. The supervisor 209 does not teach. Therefore, the learning system 100 needs to cope with a suddenly changed environment.

図８は、本発明の実施形態による学習システムと従来の学習システムのシミュレーション結果を示す図である。図８のグラフの横軸は、試行回数を示す。試行回数は、１０００回であり、上述のように、ＨＯＭＤＰ、ＭＤＰ、ＨＯＭＤＰ、ＭＤＰの順にシミュレーション環境を変化させながら２５０回ずつ試行を行う。図８のグラフの縦軸は、報酬にいたるステップの平均数を示す。平均は、上記の１０００回の試行を一組として２０００組の平均である。ここで、ステップとは、行動の選択を指す。すなわち、ステップの数とは、選択された行動の数である。各試行は、直前の試行の最終状態から開始し、学習システムが報酬を得るか、報酬を得ずにステップ数が１００回に達した場合に終了する。 FIG. 8 is a diagram illustrating simulation results of the learning system according to the embodiment of the present invention and the conventional learning system. The horizontal axis of the graph in FIG. 8 indicates the number of trials. The number of trials is 1000. As described above, the trial is performed 250 times while changing the simulation environment in the order of HOMDP, MDP, HOMDP, and MDP. The vertical axis of the graph in FIG. 8 indicates the average number of steps leading to a reward. The average is the average 2000 set of the 1000 trials of above as a set. Here, the step indicates selection of an action. That is, the number of steps is the number of selected actions. Each trial starts from the final state of the previous trial and ends when the learning system gets a reward or if the number of steps reaches 100 without getting a reward.

図８のグラフにおいて、太い線は、本発明の学習システムを示す。図８中においては、本発明と表記した。細い線は、従来の適格度トレース付きのＳＡＲＳＡ（State-Action-Reward-State-Action）学習則を使用した学習システムを示す。図８中においては、従来例と表記した。適格度トレースのパラメータλは、０．７である。また、直線は、最速のステップ数を示す。図８中においては、理想値と表記した。 In the graph of FIG. 8, a thick line shows the learning system of this invention. In FIG. 8, it described as the present invention. The thin line shows a learning system using a conventional SARSA (State-Action-Reward-State-Action) learning rule with a qualification trace. In FIG. 8, it was described as a conventional example. The qualification trace parameter λ is 0.7. A straight line indicates the fastest number of steps. In FIG. 8, it was expressed as an ideal value.

本発明の学習システムにおいて、式（３）の学習定数αは、０．０５とし、式（１）の割引率γは、０・９５とした。従来例の学習システムにおいて、上記と同じ数値を使用すると性能が低下したので、学習定数αは０．１とし、割引率γは、０．９とした。 In the learning system of the present invention, the learning constant α in equation (3) is 0.05, and the discount rate γ in equation (1) is 0.95. In the learning system of the conventional example, when the same numerical value as above was used, the performance deteriorated. Therefore, the learning constant α was set to 0.1 and the discount rate γ was set to 0.9.

図８のグラフが示すように、従来例においては、２回目のＨＯＭＤＰの最終的な平均数は、約４５回であり、１回目のＨＯＭＤＰの最終的な平均数（約３５回）よりも大きい。また、２回目のＭＤＰの最終的な平均数は、約４０回であり、１回目のＭＤＰの最終的な平均数（約３５回）よりも大きい。これに対して、本発明においては、１回目のＨＯＭＤＰ、１回目のＭＤＰ、２回目のＨＯＭＤＰ、２回目のＭＤＰの最終的な平均数は全て約３０回である。 As shown in the graph of FIG. 8, in the conventional example, the final average number of the second HOMDP is about 45 times, which is larger than the final average number of the first HOMDP (about 35 times). . Further, the final average number of the second MDP is about 40 times, which is larger than the final average number of the first MDP (about 35 times). In contrast, in the present invention, the final average number of the first HOMDP, the first MDP, the second HOMDP, and the second MDP is all about 30 times.

従来例においては、変化前の環境の学習結果が変化後の環境の学習に影響を与え、学習の速度が低下している。しかし、本発明においては、環境が変化しても学習の速度が低下することはない。また、本発明の各環境における平均数も、従来例の平均数よりも小さく、本発明の各環境における学習も従来例より優れている。 In the conventional example, the learning result of the environment before the change affects the learning of the environment after the change, and the learning speed is reduced. However, in the present invention, the learning speed does not decrease even if the environment changes. The average number in each environment of the present invention is also smaller than the average number in the conventional example, and learning in each environment in the present invention is also superior to the conventional example.

このように本発明の学習システムは、環境の変化に対応する学習及び同一の環境における学習において、従来のＳＡＲＳＡ（State-Action-Reward-State-Action）学習則を使用した学習システムよりも優れている。また、本発明の学習システムは、複雑な構造の環境モデルを使用しないので計算コストを低く抑えることができる。 Thus, the learning system of the present invention is superior to a learning system using a conventional SARSA (State-Action-Reward-State-Action) learning rule in learning corresponding to environmental changes and learning in the same environment. Yes. In addition, the learning system of the present invention does not use an environment model having a complicated structure, so that the calculation cost can be reduced.

１００…学習システム、１０１…イベント・リスト管理部、１０３…一時的リスト記憶、１０５…イベント・リスト・データベース、１０７…学習制御部 DESCRIPTION OF SYMBOLS 100 ... Learning system, 101 ... Event list management part, 103 ... Temporary list memory, 105 ... Event list database, 107 ... Learning control part

Claims

An event list database that holds a plurality of event lists, with a set of state / action pairs reaching the state / action pairs immediately before rewarding as an event list,
An event list management unit for classifying and storing state / action pairs into the plurality of event lists;
A learning control unit that updates the expected reward value R (S _t , a _t ) of the state / action pair, which is an element of each event list , as shown in the following Equation 1 :
Learning system with

Here, _St is the state observed at time t. a _t is the action that you selected in time t. E [|] is a conditional expected value. rt _{+ k} is the reward obtained at time t + k. γ is a discount rate.

A temporary list storage unit;
The event list management unit stores a state / action pair in the temporary list storage unit each time an action is selected, and stores the state / action stored in the temporary list storage unit every time a reward is obtained. Of the set of pairs, state / action pairs not stored in the event list database are stored in the event list database as elements of the event list of the state / action pair immediately before the reward is obtained. The learning system according to claim 1.

The learning control unit updates the reward expectation value of the state / action pair, which is an element of the event / list of the state / action pair immediately before the reward is obtained, using the reward value every time the reward is obtained. And updating the expected reward value of the state / action pair, which is an element of the event list other than the event list of the state / action pair immediately before obtaining the reward, assuming that the reward value is zero. 2. The learning system according to 2.

Event list database that holds multiple event lists as a list of state / action pairs that reach the state / action pairs immediately before the reward is obtained, and an event list management unit, and learning A learning method using a learning system comprising a control unit,
The event list management unit classifying and storing state / action pairs into the plurality of event lists;
The learning control unit includes a step of updating the expected reward value R (S _t , a _t ) of the state / action pair, which is an element of each event list, as shown in Equation 1 below .

The event list management unit temporarily holds a state / action pair each time an action is selected, and each time a reward is obtained, the event list management unit 5. The learning method according to claim 4, wherein a state / action pair that is not stored in the list database is stored in the event list database as an element of an event list of the state / action pair immediately before the reward is obtained.

The learning control unit updates the reward expectation value of the state / action pair, which is an element of the event / list of the state / action pair immediately before the reward is obtained, using the reward value every time the reward is obtained. And updating the expected reward value of the state / action pair, which is an element of the event list other than the event list of the state / action pair immediately before obtaining the reward, assuming that the reward value is zero. 5. The learning method according to 5.