JP5528214B2

JP5528214B2 - Learning control system and learning control method

Info

Publication number: JP5528214B2
Application number: JP2010122796A
Authority: JP
Inventors: 誉羽竹内; 広司辻野
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2010-05-28
Filing date: 2010-05-28
Publication date: 2014-06-25
Anticipated expiration: 2030-05-28
Also published as: JP2011248728A

Description

本発明は、強化学習を利用した学習システム及び学習方法に関する。 The present invention relates to a learning system and a learning method using reinforcement learning.

ロボットなどの機械が学習によって自己の制御規則を改善する学習方法として強化学習が知られている（たとえば、非特許文献１）。一方、他からの明示的な教示によって学習する教師付き学習と呼ばれる方法がある。この両者を組み合わせることで、たとえば、ロボットが人から教示されたことを覚えて、教示されたことを試行錯誤しながら自分で状況に合わせて巧みに使用することができるようになることが期待できる。しかし、他からの教示を効率的に覚えることができ、かつ、教示された内容を試行錯誤に組み合わせながら学習を行なうことができる学習制御システム及び学習制御方法は開発されていない。 Reinforcement learning is known as a learning method in which a machine such as a robot improves its own control rules by learning (for example, Non-Patent Document 1). On the other hand, there is a method called supervised learning in which learning is performed by explicit teaching from others. By combining the two, for example, it can be expected that the robot will be taught by a person and can be used skillfully according to the situation by trial and error. . However, a learning control system and a learning control method that can efficiently learn the teachings from others and can perform learning while combining the taught contents by trial and error have not been developed.

N.D. Daw & K. Doya, “The computational neurobiology of learning and reward”, Current Opinion in Neurobiology, 2006, 16, pp199-204N.D.Daw & K. Doya, “The computational neurobiology of learning and reward”, Current Opinion in Neurobiology, 2006, 16, pp199-204

したがって、他からの教示を効率的に覚えることができ、かつ、教示された内容を試行錯誤に組み合わせながら学習を行なうことができる学習制御システム及び学習制御方法は開発に対するニーズがある。 Therefore, there is a need for development of a learning control system and a learning control method that can efficiently learn the teachings from others and that can perform learning while combining the taught contents by trial and error.

本発明の一つの態様による学習制御システムは、報酬を得た直前の状態・行動対及び報酬を得たときの状態に至る、一連の状態・行動対の集合をイベント・リストとして、複数のイベント・リストを保持するイベント・リスト・データベースと、状態・行動対を、前記複数のイベント・リストに分類して該イベント・リスト・データベースに記憶させるイベント・リスト管理部と、各イベント・リストの要素である状態・行動対の報酬期待値を更新するイベント・リスト学習制御部と、該イベント・リスト・データベースのイベント・リストを使用して第１の行動価値関数を求める行動計画部と、強化学習に基づいて第２の行動価値関数を求める強化学習部と、該行動計画部から受け取った第１の行動価値関数及び該強化学習部から受け取った第２の行動価値関数に基づいて行動を選択する行動選択部と、を備えている。 The learning control system according to one aspect of the present invention provides a plurality of events by using a set of state / action pairs as a list of states / action pairs immediately before a reward is obtained and a state when the reward is obtained. An event list database that holds a list, an event list management unit that classifies state / action pairs into the plurality of event lists and stores them in the event list database, and elements of each event list An event list learning control unit that updates an expected reward value of a state / action pair, an action plan unit that obtains a first action value function using the event list of the event list database, and reinforcement learning A reinforcement learning unit for obtaining a second action value function based on the first action value function received from the action plan unit and the reinforcement learning part And a, a behavior selection unit that selects an action based on the second action value function.

本態様による学習制御システムによれば、イベント・リスト学習制御部が、報酬を得た直前の状態・行動対及び報酬を得たときの状態によって分類されたイベント・リストごとに状態・行動対の報酬期待値を更新するので、他からの教示を効率的に覚えることができる。また、本態様による学習制御システムによれば、上記他からの教示に加えて、強化学習部が行う試行錯誤の学習の結果も、該イベント・リストに反映されるので、教示された内容を試行錯誤に組み合わせながら学習を行なうことができる。 According to the learning control system of this aspect, the event / list learning control unit sets the state / action pair for each event list classified according to the state / action pair immediately before the reward is obtained and the state when the reward is obtained. Since the reward expectation value is updated, teachings from others can be efficiently learned. Further, according to the learning control system according to this aspect, in addition to the teachings from the above, the result of trial and error learning performed by the reinforcement learning unit is also reflected in the event list, so that the taught content is tried. You can learn while combining with mistakes.

本発明の一つの実施形態による学習制御システムにおいて、該行動計画部は、目標の状態が与えられた場合に、該イベント・リスト・データベースのイベント・リストを使用して、現在の状態から該目標の状態に至る経路を探索し、経路探索が成功した場合には、経路探索の結果に基づいて行動価値関数を求める。 In the learning control system according to one embodiment of the present invention, the action planning unit uses the event list of the event list database to set the target from the current state when the target state is given. The route to the state is searched, and when the route search is successful, an action value function is obtained based on the result of the route search.

本実施形態においては、該行動計画部は、目標の状態が与えられた場合に、該イベント・リスト・データベースのイベント・リストを使用して経路探索を行うので、より効率的に行動価値関数を求めることができる。したがって、教示された内容をより効率的に試行錯誤に組み合わせることができる。 In the present embodiment, the behavior planning unit performs a route search using the event list of the event list database when a target state is given, so that the behavior value function is more efficiently obtained. Can be sought. Therefore, the taught contents can be combined more efficiently by trial and error.

本発明の一つの実施形態による学習制御システムにおいて、該イベント・リスト学習制御部が、各イベント・リストの要素である状態・行動対の報酬期待値を、報酬の値に対する期待値である部分報酬期待値と報酬を得る状態に至るまでの距離の期待値である部分距離期待値との積の和で表し、該部分報酬期待値及び該部分距離期待値を別個に更新するように構成されている。 In the learning control system according to one embodiment of the present invention, the event list learning control unit sets a reward expected value of a state / action pair as an element of each event list to a partial reward that is an expected value for the reward value. Expressed as the sum of the product of the expected value and the expected partial distance that is the expected value of the distance until the reward is obtained, and configured to update the expected partial reward and the expected partial distance separately. Yes.

本実施形態によれば、該部分報酬期待値及び該部分距離期待値を別個に更新するので、より効率的に学習を行うことができる。 According to the present embodiment, since the partial reward expected value and the partial distance expected value are updated separately, learning can be performed more efficiently.

本発明の一つの実施形態による学習制御システムにおいて、該部分報酬期待値を求めるための単純移動平均値及び該部分距離期待値を求めるための単純移動平均値をイベント・リスト・データベースに記憶するように構成されている。 In the learning control system according to one embodiment of the present invention, the simple moving average value for obtaining the partial reward expectation value and the simple moving average value for obtaining the partial distance expectation value are stored in the event list database. It is configured.

本実施形態によれば、該部分報酬期待値及び該部分距離期待値を求めるために単純移動平均値を使用するので、低い計算コストで効率的に学習を行うことができる。 According to the present embodiment, since the simple moving average value is used to obtain the partial reward expectation value and the partial distance expectation value, it is possible to efficiently learn at a low calculation cost.

本発明の一つの態様による学習制御方法は、報酬を得た直前の状態・行動対及び報酬を得たときの状態に至る、一連の状態・行動対の集合をイベント・リストとして、複数のイベント・リストを保持するイベント・リスト・データベースと、イベント・リスト管理部と、イベント・リスト学習制御部と、行動計画部と、強化学習部と、を備えた学習制御システムによって、学習を行なって行動を選択する学習制御方法である。本方法は、該イベント・リスト管理部が、状態・行動対を、前記複数のイベント・リストに分類して該イベント・リスト・データベースに記憶させるステップと、該イベント・リスト学習制御部が、各イベント・リストの要素である状態・行動対の報酬期待値を更新するステップと、を含む。本方法は、該行動計画部が、該イベント・リスト・データベースのイベント・リストを使用して第１の行動価値関数を求めるステップと、該強化学習部が、強化学習に基づいて第２の行動価値関数を求めるステップと、該行動選択部が、該行動計画部から受け取った第１の行動価値関数及び該強化学習部から受け取った第２の行動価値関数に基づいて行動を選択するステップと、をさらに含む。 The learning control method according to one aspect of the present invention provides a plurality of events by using a set of state / action pairs as a list of states / action pairs immediately before the reward is obtained and the state when the reward is obtained.・ Learning and action by a learning control system that includes an event list database that holds lists, an event list management unit, an event list learning control unit, an action planning unit, and a reinforcement learning unit Is a learning control method for selecting. In this method, the event list management unit classifies the state / action pairs into the plurality of event lists and stores them in the event list database, and the event list learning control unit Updating an expected reward value of the state / action pair which is an element of the event list. The method includes a step in which the action plan unit obtains a first action value function using an event list in the event list database, and the reinforcement learning unit performs a second action based on reinforcement learning. Obtaining a value function; and selecting the action based on the first action value function received from the action planning unit and the second action value function received from the reinforcement learning unit; Further included.

本態様による学習制御方法によれば、イベント・リスト学習制御部が、報酬を得た直前の状態・行動対及び報酬を得たときの状態によって分類されたイベント・リストごとに状態・行動対の報酬期待値を更新するので、他からの教示を効率的に覚えることができる。また、本態様による学習制御方法によれば、上記他からの教示に加えて、強化学習部が行う試行錯誤の学習の結果も、該イベント・リストに反映されるので、教示された内容を試行錯誤に組み合わせながら学習を行なうことができる。 According to the learning control method according to this aspect, the event / list learning control unit sets the state / action pair for each event list classified according to the state / action pair immediately before the reward is obtained and the state when the reward is obtained. Since the reward expectation value is updated, teachings from others can be efficiently learned. Further, according to the learning control method according to this aspect, in addition to the teaching from the above, the result of trial and error learning performed by the reinforcement learning unit is also reflected in the event list. You can learn while combining with mistakes.

本発明の一実施形態による学習制御システムを含む装置の構成を示す図である。It is a figure which shows the structure of the apparatus containing the learning control system by one Embodiment of this invention. イベント・リスト・データベースのデータ構造を説明するための図である。It is a figure for demonstrating the data structure of an event list database. イベント・リスト管理部の動作を説明するための流れ図である。It is a flowchart for demonstrating operation | movement of an event list management part. イベント・リスト学習制御部の動作を説明するための流れ図である。It is a flowchart for demonstrating operation | movement of an event list learning control part. 図４のステップＳ２０３０の詳細な動作を説明するための図である。It is a figure for demonstrating the detailed operation | movement of step S2030 of FIG. 図４のステップＳ２０３５の詳細な動作を説明するための図である。It is a figure for demonstrating the detailed operation | movement of step S2035 of FIG. 行動計画部の動作を説明するための流れ図である。It is a flowchart for demonstrating operation | movement of an action plan part. イベント・リスト・データベースの情報を使用して、初期状態から目標状態に至る経路を探索する方法を説明するための図である。It is a figure for demonstrating the method of searching the path | route from an initial state to a target state using the information of an event list database. 行動選択部の動作を説明するための流れ図である。It is a flowchart for demonstrating operation | movement of an action selection part. シミュレーション実験の手順を説明するための図である。It is a figure for demonstrating the procedure of a simulation experiment. 第１のシミュレーションの結果を示す図である。It is a figure which shows the result of a 1st simulation. 第２のシミュレーションの結果を示す図である。It is a figure which shows the result of a 2nd simulation. 第３のシミュレーションの結果を示す図である。It is a figure which shows the result of a 3rd simulation. 第４のシミュレーションの結果を示す図である。It is a figure which shows the result of a 4th simulation. 第５のシミュレーションの結果を示す図である。It is a figure which shows the result of a 5th simulation. 第５のシミュレーション環境である高次マルコフ決定過程（ＨＯＭＤＰ： High Order Markov Decision Process）を説明するための図である。It is a figure for demonstrating the high order Markov decision process (HOMDDP: High Order Markov Decision Process) which is a 5th simulation environment.

図１は、本発明の一実施形態による学習制御システム１５０を含む装置２００の構成を示す図である。装置２００は、たとえばロボットであってもよい。装置２００は、情報取得部２０１、行動出力部２０３、目標取得部２０５、教示取得部２０７及び学習システム１５０を含む。 FIG. 1 is a diagram illustrating a configuration of an apparatus 200 including a learning control system 150 according to an embodiment of the present invention. Device 200 may be, for example, a robot. The apparatus 200 includes an information acquisition unit 201, an action output unit 203, a target acquisition unit 205, a teaching acquisition unit 207, and a learning system 150.

情報取得部２０１は、環境３００から入力情報を取得し、また、装置２００自身の状態情報を取得する。装置２００がロボットである場合に、情報取得部２０１は、カメラを含み、該カメラによって撮影した環境３００の画像によって、環境３００の情報を取得してもよい。また、情報取得部２０１は、ロボットの位置及び向きを含む、装置２００の状態情報を取得してもよい。情報取得部２０１は、取得したこれらの情報を学習制御システム１５０に送る。 The information acquisition unit 201 acquires input information from the environment 300 and acquires state information of the device 200 itself. When the apparatus 200 is a robot, the information acquisition unit 201 may include a camera and acquire information on the environment 300 based on an image of the environment 300 captured by the camera. Further, the information acquisition unit 201 may acquire state information of the device 200 including the position and orientation of the robot. The information acquisition unit 201 sends the acquired information to the learning control system 150.

行動出力部２０３は、学習制御システム１５０の選択した行動を出力する。行動の結果としての環境３００の変化は、情報取得部２０１によって情報として取得される。 The behavior output unit 203 outputs the behavior selected by the learning control system 150. The change in the environment 300 as a result of the action is acquired as information by the information acquisition unit 201.

教示取得部２０５は、装置２００のおかれた状態に対して、後で説明する報酬が最速で得られるような一連の行動の教示をユーザなどから取得し、その教示を学習制御システム１５０に送る。教示は、学習の初期の段階において学習システム１５０の学習を援助するのに使用される。 The teaching acquisition unit 205 acquires a teaching of a series of actions from a user or the like so that a reward described later can be obtained at the fastest speed with respect to the state of the apparatus 200, and sends the teaching to the learning control system 150. . The teachings are used to assist learning of the learning system 150 in the early stages of learning.

目標取得部２０７は、ユーザなどから装置２００が達成すべき目標を受け取り、その目標を学習制御システム１５０に送る。 The goal acquisition unit 207 receives a goal that the device 200 should achieve from a user or the like, and sends the goal to the learning control system 150.

学習制御システム１５０は、取得情報処理部１０９、強化学習部１１１、行動選択部１１３及びイベント・リスト学習制御システム１００を含む。 The learning control system 150 includes an acquisition information processing unit 109, a reinforcement learning unit 111, an action selection unit 113, and an event / list learning control system 100.

取得情報処理部１０９は、情報取得部２０１から受け取った情報を処理し、装置２００の「状態」を定める。また、装置２００の「行動」の結果の評価である「報酬」を定める。 The acquisition information processing unit 109 processes the information received from the information acquisition unit 201 and determines the “state” of the device 200. Further, “reward” that is an evaluation of the result of “behavior” of the apparatus 200 is determined.

イベント・リスト学習制御システム１００及び強化学習部１１１は、一連の状態、行動及び報酬の情報に基づいて、行動の価値を評価する行動価値関数を定める。イベント・リスト学習制御システム１００の詳細は後で説明する。強化学習部１１１は、従来の強化学習システムであり、たとえば、ＳＡＲＳＡ（State-Action-Reward-State-Action）アルゴリズムを使用したシステムであってもよい。ＳＡＲＳＡアルゴリズムについては、たとえば、文献（R. S. Sutton, A. G. Barto, Reinforcement Learning: Introduction, MIT Press）に詳細に記載されている。 The event list learning control system 100 and the reinforcement learning unit 111 determine an action value function for evaluating the value of action based on a series of state, action, and reward information. Details of the event list learning control system 100 will be described later. The reinforcement learning unit 111 is a conventional reinforcement learning system, and may be, for example, a system using a SARSA (State-Action-Reward-State-Action) algorithm. The SARSA algorithm is described in detail in, for example, literature (R. S. Sutton, A. G. Barto, Reinforcement Learning: Introduction, MIT Press).

行動選択部１１３は、イベント・リスト学習制御システム１００から受け取った第１の行動価値関数及び強化学習部１１１から受け取った第２の行動価値関数に基づいて装置の行動を選択する。また、行動選択部１１３は、教示取得部２０５から行動を教示された場合には、教示された行動を選択する。 The behavior selection unit 113 selects the behavior of the device based on the first behavior value function received from the event / list learning control system 100 and the second behavior value function received from the reinforcement learning unit 111. Further, when the behavior selection unit 113 is instructed by the teaching acquisition unit 205, the behavior selection unit 113 selects the taught behavior.

ここで、イベント・リスト学習制御システム１００の基本的な考え方について説明する。 Here, the basic concept of the event list learning control system 100 will be described.

最初に行動価値関数について説明する。観測される状態の空間（状態空間）をSとし行動の選択肢の空間（行動空間）をAとする。|S|を状態空間の要素数、|A|を行動空間の要素数とする。状態空間の要素をs, 行動空間の要素をaであらわす。このときs_tは時刻tにおいて観測された状態空間の要素であり、a_tも同様である。行動価値関数は、現在時刻tに状態s_tを観測し、行動a_tをとったときに時間的に現在tから将来渡って得られる報酬rの期待値であり、以下のようにあらわされる。

ここで、γは割引率と呼ばれる定数であり、E[]は期待値を意味する。 First, the behavior value function will be described. The observed state space (state space) is S, and the action option space (action space) is A. Let | S | be the number of elements in the state space, and | A | be the number of elements in the action space. The element of the state space is represented by s, and the element of the action space is represented by a. At this time s _t is an element of the observed state space at time t, is the same a _t. The action value function is an expected value of the reward r that is obtained from the current t in the future when the state _st is observed at the current time t and the action a _t is taken, and is expressed as follows.

Here, γ is a constant called a discount rate, and E [] means an expected value.

式（１）は、以下のように展開することができる。

ここで、

を報酬期待値と呼称する。 Equation (1) can be expanded as follows.

here,

Is called a reward expectation value.

式（３）は、マルコフ決定過程(Markov Decision Process)という一般的な仮定のもとで以下のように変形することができる。

ここで、Pr(|)は、条件付確率を示す。Pr(k|)は、一連の状態が現在時刻からkステップ後に終端に達する確率である。ここで、「終端」とは、上記一連の状態の最後の状態（この場合は、報酬を得た状態）を指す。Pr(r_t+k|)は、t+kで報酬rを得る確率である。 Equation (3) can be modified as follows under the general assumption of a Markov Decision Process.

Here, Pr (|) represents a conditional probability. Pr (k |) is the probability that a series of states will reach the end after k steps from the current time. Here, the “termination” refers to the last state of the series of states (in this case, a state where a reward is obtained). Pr (r _{t + k} |) is the probability of obtaining reward r at t + k.

(S,A,S’)は(s_t+k-1,a_t+k-1,s_t+k)を表し、報酬r_t+kを得る直前の状態・行動対(s_t+k-1,a_t+k-1)と報酬r_t+kを得たときの状態s_t+kからなるデータの組である。ここで、状態・行動対とは、装置２００が、ある状態において、ある行動を選択した場合に、その状態とその行動との対を指す。 (S, A, S ′) represents (s _{t + k−1} , a _{t + k−1} , s _{t + k} ), and the state / action pair (s _{t + k} ) immediately before the reward r _{t + k} is obtained. _-1 , a _{t + k-1} ) and the state s _{t + k} when the reward r _{t + k} is obtained. Here, the state / action pair refers to a pair of the state and the action when the device 200 selects the action in a certain state.

定義から、以下の式が成立する。

式（５）の左辺を部分距離期待値と呼称する。「部分」とは、(S,A,S’)の場合に限定された期待値であることを意味する。「距離」とは、現在の状態から報酬を得る状態までの距離の期待値であることを意味する。 From the definition, the following equation holds.

The left side of Expression (5) is referred to as a partial distance expected value. “Part” means an expected value limited to the case of (S, A, S ′). “Distance” means an expected value of the distance from the current state to the state where the reward is obtained.

また、定義から、以下の式が成立する。

式（６）の左辺を部分報酬期待値と呼称する。「部分」とは、(S,A,S’)の場合に限定された期待値であることを意味する。「報酬」とは、報酬の値の期待値であることを意味する。 Also, from the definition, the following formula is established.

The left side of Equation (6) is referred to as a partial reward expected value. “Part” means an expected value limited to the case of (S, A, S ′). “Reward” means an expected value of a reward value.

式（４）は、報酬期待値が、式（５）で表される部分距離期待値と式（６）で表される部分距離期待値との積の和で表せることを示す。また、報酬を得た状態を終端とする一連の状態行動対は、「部分」、すなわち、(S,A,S’)ごとのグループ（集合）に分類できることを示す。そこで、一連の状態・行動対を(S,A,S’)ごとの集合に分類し、該集合ごとに状態・行動対を記憶し、それぞれの状態・行動対の報酬期待値を各部分距離期待値ならびに各部分報酬期待値ごとに学習することができる。 Equation (4) indicates that the expected reward value can be expressed as the sum of products of the partial distance expectation value represented by Equation (5) and the partial distance expectation value represented by Equation (6). In addition, a series of state-action pairs ending with a state in which a reward is obtained indicates that they can be classified into “parts”, that is, groups (sets) for each (S, A, S ′). Therefore, a series of state / action pairs are classified into sets for each (S, A, S '), the state / action pairs are stored for each set, and the reward expectation value of each state / action pair is set for each partial distance. It is possible to learn for each expected value and each partial reward expected value.

式（５）で表される部分距離期待値と式（６）で表される部分距離期待値を計算するために、直近のm個のデータの平均である単純移動平均(Simple Moving Average)を使用してもよい。単純移動平均は、以下の式で表せる。

ここで、mの値は、環境３００の変化にすばやく追随するために、１０などの比較的小さい値が望ましい。また初期状態でデータの個数がm以下の場合は通常の平均値を計算する。 In order to calculate the partial distance expectation value represented by equation (5) and the partial distance expectation value represented by equation (6), the simple moving average, which is the average of the most recent m data, May be used. The simple moving average can be expressed by the following formula.

Here, the value of m is preferably a relatively small value such as 10 in order to quickly follow changes in the environment 300. In the initial state, when the number of data is less than or equal to m, a normal average value is calculated.

式（５）で表される部分距離期待値に対して、

の場合に、

である。それ以外の場合には、

である。 For the partial distance expectation represented by Equation (5),

In the case of

It is. Otherwise,

It is.

式（６）で表される部分報酬期待値に対して、

の場合に、

である。それ以外の場合には、

である。 For the partial reward expectation value expressed by Equation (6),

In the case of

It is. Otherwise,

It is.

部分距離期待値に対する単純移動平均をma_SAS’[γ|s,a]と表し、部分報酬期待値に対する単純移動平均をma[r|SAS’]と表す。
つぎに、イベント・リスト学習制御システム１００の構成について説明する。図１に示すように、イベント・リスト学習制御システム１００は、イベント・リスト管理部１０１、一時的リスト記憶部１０３、イベント・リスト・データベース１０５、イベント・リスト学習制御部１０７及び行動計画部１０８を含む。 The simple moving average for the partial distance expected value is expressed as ma _{SAS '} [γ | s, a], and the simple moving average for the partial reward expected value is expressed as ma [r | SAS'].
Next, the configuration of the event list learning control system 100 will be described. As shown in FIG. 1, the event list learning control system 100 includes an event list management unit 101, a temporary list storage unit 103, an event list database 105, an event list learning control unit 107, and an action planning unit 108. Including.

イベント・リスト管理部１０１は、一連の状態・行動対を一時的記憶部１０３に記憶させ、報酬を受け取った際に、該一連の状態・行動対を、(S,A,S’)ごとの集合に分類し、該集合ごとにイベント・リスト・データベース１０５に記憶させる。イベント・リスト・データベース１０５には、(S,A,S’)ごとの集合の、部分報酬期待値に対する単純移動平均ma[r|SAS’]及び部分距離期待値に対する単純移動平均ma_SAS’[γ|s,a]も記憶される。 The event list management unit 101 stores a series of state / action pairs in the temporary storage unit 103, and when receiving a reward, the series of state / action pairs is stored for each (S, A, S ′). The data is classified into sets, and each set is stored in the event list database 105. The event list database 105 includes a simple moving average ma [r | SAS '] for the partial reward expectation value and a simple moving average ma _SAS' [for the partial distance expectation value of the set for each (S, A, S ′). γ | s, a] is also stored.

イベント・リスト学習制御システム１００は、新たな状態・行動対を受け取るごとに、(S,A,S’)ごとの集合の、分報酬期待値に対する単純移動平均をma[r|SAS’]及び部分距離期待値に対する単純移動平均を更新（学習）する。 Every time a new state / action pair is received, the event list learning control system 100 calculates ma [r | SAS '] and a simple moving average of the set for each (S, A, S ′) with respect to the minute reward expectation value. Update (learn) the simple moving average for the partial distance expected value.

行動計画部１０８は、目標取得部２０７から目標の状態が与えられた場合に、イベント・リスト・データベース１０５のデータを使用して、現在の状態から該目標の状態に至る経路を探索し、経路探索が成功した場合には、経路探索の結果に基づいて行動価値関数を求める。それ以外の場合には、イベント・リスト・データベース１０５のデータを使用して、行動価値関数を求める。 When the target plan is given from the target acquisition unit 207, the action plan unit 108 uses the data of the event list database 105 to search for a route from the current state to the target state, If the search is successful, an action value function is obtained based on the result of the route search. In other cases, the action value function is obtained using the data of the event list database 105.

イベント・リスト学習制御システム１００の各構成要素その詳細については以下に説明する。 Details of each component of the event list learning control system 100 will be described below.

図２はイベント・リスト・データベース１０５のデータ構造を説明するための図である。図２において、(S,A,S’)nは報酬を得る直前の状態・行動対ならびに報酬を得たときの状態を示す。(S,A,S’)nは(S,A,S’)nに至るまでの一連の状態・行動対とともに集合を形成する。この集合をイベント・リストと呼称する。(s_i,a_j)はイベント・リストに含まれる状態・行動対を示す。なお(S,A)もこの集合に含まれる。(s_i,a_j)に関連付けて、その部分距離期待値の単純移動平均ma_SAS’[γ|s,a]もイベント・リストに保持されている。さらに、この部分距離期待値の単純移動平均を更新するための補助変数e[s_i,a_j]も保持されている。この補助変数については後に説明する。またそれぞれの(S,A,S’)nに関連付けて、対応する部分報酬期待値の単純移動平均ma[r|(S,A,S’)_n]が保持されている。 FIG. 2 is a diagram for explaining the data structure of the event list database 105. In FIG. 2, (S, A, S ′) n indicates a state / action pair immediately before obtaining a reward and a state when the reward is obtained. (S, A, S ′) n forms a set with a series of state / action pairs up to (S, A, S ′) n. This set is called an event list. (s _i , a _j ) indicates a state / action pair included in the event list. Note that (S, A) is also included in this set. In association with (s _i , a _j ), the simple moving average ma _{SAS ′} [γ | s, a] of the partial distance expectation value is also held in the event list. Furthermore, auxiliary variables e [s _i , a _j ] for updating the simple moving average of the partial distance expectation values are also held. This auxiliary variable will be described later. In addition, a simple moving average ma [r | (S, A, S ′) _n ] of a corresponding partial reward expectation value is held in association with each (S, A, S ′) _n .

このようにイベント・リスト・データベース１０５のデータは報酬を得る直前の状態・行動対ならびに報酬を得たときの状態(S,A,S’)ごとのデータ、すなわちイベント・リストに分類される。 As described above, the data in the event list database 105 is classified into data for each state (S, A, S ′) when the reward is obtained, that is, the state / action pair immediately before the reward is obtained, that is, the event list.

図３は、イベント・リスト管理部１０１の動作を説明するための流れ図である。 FIG. 3 is a flowchart for explaining the operation of the event list management unit 101.

図３のステップＳ１００５において、イベント・リスト管理部１０１は、行動選択部１１３から状態・行動対(s,a)を受け取ったかどうか判断する。ここで、行動選択部１１３は、行動を選択するごとに状態・行動対(s,a)をイベント・リスト管理部１０１に送る。状態・行動対(s,a)を受け取っていればステップＳ１０１０に進む。状態・行動対(s,a)を受け取っていなければ、待機する。 In step S <b> 1005 of FIG. 3, the event / list management unit 101 determines whether or not a state / action pair (s, a) has been received from the action selection unit 113. Here, the action selection unit 113 sends a state / action pair (s, a) to the event / list management unit 101 each time an action is selected. If the state / action pair (s, a) has been received, the process proceeds to step S1010. If the state / action pair (s, a) has not been received, wait.

図３のステップＳ１０１０において、イベント・リスト管理部１０１は、状態・行動対(s,a)を一時的リスト記憶部１０３に記憶させる。 In step S <b> 1010 of FIG. 3, the event / list management unit 101 stores the state / action pair (s, a) in the temporary list storage unit 103.

図３のステップＳ１０１５において、イベント・リスト管理部１０１は、取得情報処理部１０９から報酬及び報酬を得たときの状態s’を受け取ったかどうか判断する。ここで、取得情報処理部１０９は、行動出力部２０７が行動を出力してから所定の時間経過後に情報取得部２０１が取得した情報に基づいて報酬を定め、イベント・リスト管理部１０１に送る。報酬を受け取っていればステップＳ１０２０に進む。報酬を受け取っていなければ、所定の時間経過後ステップＳ１００５に戻る。 In step S1015 of FIG. 3, the event list management unit 101 determines whether or not the reward and the state s ′ when the reward is obtained from the acquired information processing unit 109. Here, the acquired information processing unit 109 determines a reward based on the information acquired by the information acquisition unit 201 after a predetermined time has elapsed since the behavior output unit 207 outputs the behavior, and sends the reward to the event / list management unit 101. If the reward has been received, the process proceeds to step S1020. If no reward has been received, the process returns to step S1005 after a predetermined time has elapsed.

図３のステップＳ１０２０において、イベント・リスト管理部１０１は、一時的リスト記憶部１０３に最後に記憶された状態・行動対(s,a)を、報酬を得た直前の状態・行動対(S,A)とし、報酬を得たときの状態s’をS’として、報酬を得た直前の状態・行動対及び報酬を得たときの状態(S,A,S’)を生成する。 In step S1020 of FIG. 3, the event / list management unit 101 uses the state / action pair (s, a) stored last in the temporary list storage unit 103 as the state / action pair (S , A), the state s ′ when the reward is obtained is set as S ′, and the state / action pair immediately before the reward is obtained and the state (S, A, S ′) when the reward is obtained are generated.

図３のステップＳ１０２５において、イベント・リスト管理部１０１は、イベント・リスト・データベース１０５に (S,A,S’)が存在するかどうか判断する。(S,A,S’)が存在すれば、ステップＳ１０３５に進む。(S,A,S’)が存在しなければ、ステップＳ１０３０に進む。 In step S1025 of FIG. 3, the event list management unit 101 determines whether (S, A, S ′) exists in the event list database 105. If (S, A, S ') exists, the process proceeds to step S1035. If (S, A, S ') does not exist, the process proceeds to step S1030.

図３のステップＳ１０３０において、イベント・リスト管理部１０１は、(S,A,S’)をイベント・リスト・データベース１０５に記憶させる。 In step S 1030 of FIG. 3, the event list management unit 101 stores (S, A, S ′) in the event list database 105.

図３のステップＳ１０３５において、イベント・リスト管理部１０１は、一時的リスト記憶部１０３に記憶された状態・行動対(s,a)のそれぞれが、イベント・リスト・データベース１０５の(S,A,S’)のイベント・リストに含まれているかどうか判断する。(S,A,S’)のイベント・リストに含まれていれば、ステップＳ１０４５に進む。(S,A,S’)のイベント・リストに含まれていなければ、ステップＳ１０４０に進む。 In step S1035 of FIG. 3, the event list management unit 101 determines that each of the state / action pairs (s, a) stored in the temporary list storage unit 103 is (S, A, It is determined whether it is included in the event list of S '). If it is included in the event list of (S, A, S '), the process proceeds to step S1045. If it is not included in the event list of (S, A, S '), the process proceeds to step S1040.

図３のステップＳ１０４０において、イベント・リスト管理部１０１は、(S,A,S’)のイベント・リストに含まれていない状態・行動対(s,a)を(S,A,S’)のイベント・リストに追加する。このとき、追加される状態・行動対の数は、あらかじめ決められた数を上限とする。 In step S1040 of FIG. 3, the event list management unit 101 selects a state / action pair (s, a) that is not included in the event list of (S, A, S ′) as (S, A, S ′). Add to your event list. At this time, the number of state / action pairs to be added is limited to a predetermined number.

図３のステップＳ１０４５において、イベント・リスト管理部１０１は、一時的リスト記憶部１０３に記憶された全ての状態・行動対(s,a)についてステップＳ１０３５の処理を行ったかどうか判断する。全ての状態・行動対(s,a)についてステップＳ１０３５の処理を行っていれば、ステップＳ１０５０に進む。全ての状態・行動対(s,a)についてステップＳ１０３５の処理を行っていなければ、ステップＳ１０３５に戻る。 In step S <b> 1045 of FIG. 3, the event / list management unit 101 determines whether or not the processing of step S <b> 1035 has been performed for all the state / action pairs (s, a) stored in the temporary list storage unit 103. If the process of step S1035 is performed for all state / action pairs (s, a), the process proceeds to step S1050. If the process of step S1035 is not performed for all the state / action pairs (s, a), the process returns to step S1035.

図３のステップＳ１０５０において、イベント・リスト管理部１０１は、一時的リスト記憶部１０３に記憶された全ての状態・行動対(s,a)をクリア（消去）する。 In step S1050 of FIG. 3, the event / list management unit 101 clears (deletes) all the state / action pairs (s, a) stored in the temporary list storage unit 103.

図４は、イベント・リスト学習制御部１０７の動作を説明するための流れ図である。 FIG. 4 is a flowchart for explaining the operation of the event list learning control unit 107.

図４のステップＳ２００５において、イベント・リスト管理部１０１は、行動選択部１１３から状態・行動対(s,a)を受け取ったかどうか判断する。状態・行動対(s,a)を受け取っていればステップＳ２０１０に進む。状態・行動対(s,a)を受け取っていなければ、待機する。 In step S2005 of FIG. 4, the event / list management unit 101 determines whether a state / action pair (s, a) is received from the action selection unit 113. If the state / action pair (s, a) has been received, the process proceeds to step S2010. If the state / action pair (s, a) has not been received, wait.

図４のステップＳ２０１０において、イベント・リスト管理部１０１は、取得情報処理部１０９から次の状態s’を受け取ったかどうか判断する。次の状態s’を受け取っていれば、ステップＳ２０１５に進む。次の状態s’を受け取っていなければ待機する。 In step S2010 of FIG. 4, the event list management unit 101 determines whether or not the next state s ′ has been received from the acquired information processing unit 109. If the next state s' has been received, the process proceeds to step S2015. If the next state s' has not been received, the process waits.

図４のステップＳ２０１５において、イベント・リスト管理部１０１は、そのときの報酬を受け取ったかどうか判断する。報酬を受け取っていれば、ステップＳ２０２０に進む。報酬を受け取っていなければ、ステップＳ２０２５に進む。 In step S2015 in FIG. 4, the event list management unit 101 determines whether or not the reward at that time has been received. If a reward has been received, the process proceeds to step S2020. If no reward has been received, the process proceeds to step S2025.

図４のステップＳ２０２０において、イベント・リスト管理部１０１は、報酬の値をｒに代入する。 In step S2020 of FIG. 4, the event list management unit 101 substitutes the value of reward into r.

図４のステップＳ２０２５において、イベント・リスト管理部１０１は、ゼロをｒに代入する。 In step S2025 of FIG. 4, the event list management unit 101 substitutes zero for r.

図４のステップＳ２０３０において、イベント・リスト管理部１０１は、イベント・リスト・データベース１０５中の各イベント・リストの補助変数e[s,a]を更新する。 In step S2030 of FIG. 4, the event list management unit 101 updates the auxiliary variable e [s, a] of each event list in the event list database 105.

図４のステップＳ２０３５において、イベント・リスト管理部１０１は、イベント・リスト・データベース１０５中の各イベント・リストの単純移動平均の値を更新する。 In step S2035 of FIG. 4, the event list management unit 101 updates the value of the simple moving average of each event list in the event list database 105.

図５は、図４のステップＳ２０３０の詳細な動作を説明するための図である。 FIG. 5 is a diagram for explaining the detailed operation of step S2030 of FIG.

図５のステップＳ３００５において、イベント・リスト管理部１０１は、イベント・リスト・データベース１０５から一つのイベント・リスト(S,A,S’)nを取り出す。 In step S3005 of FIG. 5, the event list management unit 101 extracts one event list (S, A, S ′) n from the event list database 105.

図５のステップＳ３０１０において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nから一つの状態・行動対(s”,a”)を取り出す。 In step S3010 of FIG. 5, the event list management unit 101 extracts one state / action pair (s ″, a ″) from the extracted event list (S, A, S ′) n.

図５のステップＳ３０１５において、イベント・リスト管理部１０１は、取り出した状態・行動対(s”,a”)が、受け取った状態・行動対(s,a)と同じであるかどうか判断する。同じであれば、ステップＳ３０２０に進む。同じでなければ、ステップＳ３０２５に進む。 In step S3015 of FIG. 5, the event / list management unit 101 determines whether or not the extracted state / action pair (s ″, a ″) is the same as the received state / action pair (s, a). If they are the same, the process proceeds to step S3020. If not, the process proceeds to step S3025.

図５のステップＳ３０２０において、イベント・リスト管理部１０１は、状態・行動対(s”,a”)の補助変数e[s”,a”]を以下の式にしたがって更新する。ここで、補助変数の初期値は、すべてゼロである。

In step S3020 of FIG. 5, the event / list management unit 101 updates the auxiliary variable e [s ″, a ″] of the state / action pair (s ″, a ″) according to the following equation. Here, the initial values of the auxiliary variables are all zero.

図５のステップＳ３０２５において、イベント・リスト管理部１０１は、状態・行動対(s”,a”)の補助変数e[s”,a”]を以下の式にしたがって更新する。

ここで、γは割引率と呼ばれる定数である。 In step S3025 of FIG. 5, the event / list management unit 101 updates the auxiliary variable e [s ″, a ″] of the state / action pair (s ″, a ″) according to the following equation.

Here, γ is a constant called a discount rate.

図５のステップＳ３０３０において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nの全ての状態・行動対(s”,a”)をチェックしたかどうか判断する。全ての状態・行動対(s”,a”)をチェックしていれば、ステップＳ３０３５に進む。全ての状態・行動対(s”,a”)をチェックしていなければ、ステップＳ３０１０に戻る。 In step S3030 of FIG. 5, the event list management unit 101 determines whether or not all state / action pairs (s ″, a ″) in the extracted event list (S, A, S ′) n have been checked. . If all state / action pairs (s ″, a ″) have been checked, the process proceeds to step S3035. If all the state / action pairs (s ″, a ″) are not checked, the process returns to step S3010.

図５のステップＳ３０３５において、イベント・リスト管理部１０１は、イベント・リスト・データベース１０５の全てのイベント・リスト(S,A,S’)n をチェックしたかどうか判断する。全てのイベント・リスト(S,A,S’)n をチェックしていれば、処理を終了する。全てのイベント・リスト(S,A,S’)n をチェックしていなければ、ステップＳ３００５に戻る。 In step S3035 of FIG. 5, the event list management unit 101 determines whether all event lists (S, A, S ′) n in the event list database 105 have been checked. If all event lists (S, A, S ') n have been checked, the process ends. If all event lists (S, A, S ') n are not checked, the process returns to step S3005.

図６は、図４のステップＳ２０３５の詳細な動作を説明するための図である。 FIG. 6 is a diagram for explaining the detailed operation of step S2035 of FIG.

図６のステップＳ４００５において、イベント・リスト管理部１０１は、イベント・リスト・データベース１０５のイベント・リスト(S,A,S’)nのうちから、その状態・行動対にsを含むものだけを取り出す。 In step S4005 of FIG. 6, the event list management unit 101 selects only the event list (S, A, S ′) n in the event list database 105 that includes s in its state / action pair. Take out.

図６のステップＳ４０１０において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nの(S,A,S’)が、(s,a,s’)と同じであるかどうか判断する。同じであれば、ステップＳ４０１５に進む。同じでなければステップＳ４０２０に進む。 In step S4010 of FIG. 6, the event list management unit 101 has the same (S, A, S ′) as (s, a, s ′) in the extracted event list (S, A, S ′) n. It is determined whether it is. If they are the same, the process proceeds to step S4015. If not, the process proceeds to step S4020.

図６のステップＳ４０１５において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nの、部分報酬期待値に対する単純移動平均ma[r|SAS’]を以下の式にしたがって更新する。

ただし、上述にように

である。 In step S4015 of FIG. 6, the event list management unit 101 calculates the simple moving average ma [r | SAS ′] for the partial reward expectation value of the extracted event list (S, A, S ′) n as follows: Update according to

However, as mentioned above

It is.

図６のステップＳ４０２０において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nから一つの状態・行動対(s”,a”)を取り出す。 In step S4020 of FIG. 6, the event list management unit 101 extracts one state / action pair (s ″, a ″) from the extracted event list (S, A, S ′) n.

図６のステップＳ４０２５において、イベント・リスト管理部１０１は、状態・行動対(s”,a”)の補助変数e[s”,a”]が正であるかどうか判断する。補助変数e[s”,a”]が正であれば、ステップＳ４０３０に進む。補助変数e[s”,a”]が正でなければ、ステップＳ４０５０に進む。 In step S4025 of FIG. 6, the event / list management unit 101 determines whether the auxiliary variable e [s ″, a ″] of the state / action pair (s ″, a ″) is positive. If the auxiliary variable e [s ″, a ″] is positive, the process proceeds to step S4030. If the auxiliary variable e [s ″, a ″] is not positive, the process proceeds to step S4050.

図６のステップＳ４０３０において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nの(S,A,S’)が、(s,a,s’)と同じであるかどうか判断する。同じであれば、ステップＳ４０３５に進む。同じでなければステップＳ４０４０に進む。 In step S4030 of FIG. 6, the event list management unit 101 has (S, A, S ′) of the extracted event list (S, A, S ′) n the same as (s, a, s ′). It is determined whether it is. If they are the same, the process proceeds to step S4035. If not, the process proceeds to step S4040.

図６のステップＳ４０３５において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nの要素の、部分距離期待値に対する単純移動平均ma_SAS’[γ|s,a]を以下の式にしたがって更新する。

ただし、上述にように

である。 In step S4035 of FIG. 6, the event list management unit 101 performs simple moving average ma _{SAS ′} [γ | s, a with respect to the partial distance expectation value of the extracted event list (S, A, S ′) n. ] Is updated according to the following formula.

However, as mentioned above

It is.

図６のステップＳ４０４０において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nの要素の、部分距離期待値に対する単純移動平均ma_SAS’[γ|s,a]を以下の式にしたがって更新する。

ただし、上述にように

である。 In step S4040 of FIG. 6, the event list management unit 101 performs the simple moving average ma _{SAS ′} [γ | s, a for the partial distance expected value of the extracted event list (S, A, S ′) n. ] Is updated according to the following formula.

However, as mentioned above

It is.

図６のステップＳ４０４５において、イベント・リスト管理部１０１は、状態・行動対(s”,a”)の補助変数e[s”,a”]を以下の式にしたがって更新する。

In step S4045 in FIG. 6, the event / list management unit 101 updates the auxiliary variable e [s ″, a ″] of the state / action pair (s ″, a ″) according to the following formula.

図６のステップＳ４０５０において、イベント・リスト管理部１０１は、取り出したイベント・リスト(S,A,S’)nの全ての状態・行動対(s”,a”)をチェックしたかどうか判断する。全ての状態・行動対(s”,a”)をチェックしていれば、ステップＳ４０５５に進む。全ての状態・行動対(s”,a”)をチェックしていなければ、ステップＳ４０２０に戻る。 In step S4050 of FIG. 6, the event list management unit 101 determines whether or not all state / action pairs (s ″, a ″) of the extracted event list (S, A, S ′) n have been checked. . If all state / action pairs (s ″, a ″) are checked, the process proceeds to step S4055. If all the state / action pairs (s ″, a ″) are not checked, the process returns to step S4020.

図６のステップＳ４０５５において、イベント・リスト管理部１０１は、イベント・リスト・データベース１０５の全てのイベント・リスト(S,A,S’)n をチェックしたかどうか判断する。全てのイベント・リスト(S,A,S’)n をチェックしていれば、処理を終了する。全てのイベント・リスト(S,A,S’)n をチェックしていなければ、ステップＳ４００５に戻る。 In step S4055 of FIG. 6, the event list management unit 101 determines whether all event lists (S, A, S ′) n in the event list database 105 have been checked. If all event lists (S, A, S ') n have been checked, the process ends. If all event lists (S, A, S ') n are not checked, the process returns to step S4005.

図７Ａは、行動計画部１０８の動作を説明するための流れ図である。 FIG. 7A is a flowchart for explaining the operation of the action plan unit 108.

図７ＡのステップＳ５００５において、行動計画部１０８は、取得情報処理部１０９から状態s’を受け取ったかどうか判断する。状態s’を受け取っていれば、ステップＳ５０１０に進む。状態s’を受け取っていなければ待機する。 In step S5005 of FIG. 7A, the action planning unit 108 determines whether or not the state s ′ is received from the acquired information processing unit 109. If the state s' has been received, the process proceeds to step S5010. If the state s' has not been received, the process waits.

図７ＡのステップＳ５０１０において、行動計画部１０８は、目標取得部２０７から目標S’を受け取ったかどうか判断する。目標S’を受け取っていれば、ステップＳ５０１５に進む。目標S’を受け取っていなければ、ステップＳ５０３０に進む。 In step S5010 of FIG. 7A, the action planning unit 108 determines whether or not the target S ′ has been received from the target acquisition unit 207. If the target S ′ has been received, the process proceeds to step S5015. If the target S 'has not been received, the process proceeds to step S5030.

図７ＡのステップＳ５０１５において、行動計画部１０８は、行動計画部１０８は、イベント・リスト・データベース１０５の情報を使用して、初期状態s’から目標状態S’に至る経路を探索する。 In step S5015 of FIG. 7A, the behavior planning unit 108 uses the information in the event list database 105 to search for a route from the initial state s ′ to the target state S ′.

図７Ｂは、イベント・リスト・データベース１０５の情報を使用して、初期状態から目標状態に至る経路を探索する方法を説明するための図である。図７Ｂ（ａ）は、S₀からS₃に至る状態遷移を示す図である。図７Ｂ（ｂ）は、図７Ｂ（ａ）の状態遷移に対応するイベント・リストを示す図である。図７Ｂ（ｂ）のイベント・リストにおいて、(S,A,S’)は、(S₂,a₃,S₃)である。図７Ｂ（ｃ）は、イベント・リストの組み合わせを示す図である。図７Ｂ（ｃ）に示されるように、複数のイベント・リストを組み合わせて初期状態から目標状態に至る経路を探索する。経路探索の方法は、たとえば、最良優先探索法（たとえば、人工知能学会編、人工知能学辞典、２００６年、共立出版）によってもよい。 FIG. 7B is a diagram for explaining a method for searching for a route from the initial state to the target state using information in the event list database 105. FIG. 7B (a) is a diagram showing a state transition from S ₀ to S ₃ . FIG. 7B (b) is a diagram showing an event list corresponding to the state transition of FIG. 7B (a). In the event list of FIG. 7B (b), (S, A, S ′) is (S ₂ , a ₃ , S ₃ ). FIG. 7B (c) is a diagram showing combinations of event lists. As shown in FIG. 7B (c), a route from the initial state to the target state is searched by combining a plurality of event lists. The route search method may be, for example, a best-first search method (for example, the Japanese Society for Artificial Intelligence, Artificial Intelligence Dictionary, 2006, Kyoritsu Publishing).

図７ＡのステップＳ５０２０において、行動計画部１０８は、経路探索は成功したかどうか判断する。経路探索が成功していれば、ステップＳ５０２５に進む。経路探索が成功していなければ、ステップＳ５０３０に進む。 In step S5020 of FIG. 7A, the action planning unit 108 determines whether the route search has been successful. If the route search is successful, the process proceeds to step S5025. If the route search is not successful, the process proceeds to step S5030.

図７ＡのステップＳ５０２５において、行動計画部１０８は、経路探索の結果に基づいて、推奨される行動aに対応した行動価値関数をイベント・リスト・データベース１０５中の対応する単純移動平均の値を使って、以下の式にしたがって求め、出力する。

式（８）は、図７Ｂ（ｃ）のイベント・リストの組合せに対応し、初期状態は、sである。 In step S5025 of FIG. 7A, the action planning unit 108 uses the value of the corresponding simple moving average in the event list database 105 as the action value function corresponding to the recommended action a based on the result of the route search. Is obtained according to the following equation and output.

Equation (8) corresponds to the event list combination of FIG. 7B (c), and the initial state is s.

図７ＡのステップＳ５０３０において、行動計画部１０８は、イベント・リスト・データベース１０５の情報を使用して、状態s’に対応する行動価値関数を求め、出力する。具体的に、行動計画部１０８は、イベント・リスト・データベース１０５に記憶されたイベント・リストの内、状態s’に対応する部分の部分報酬期待値および部分距離期待値から、以下の式にしたがって式（４）によって示される状態s’に対する報酬期待値を求める。

In step S5030 of FIG. 7A, the action planning unit 108 uses the information in the event list database 105 to obtain and output an action value function corresponding to the state s ′. Specifically, the action planning unit 108 calculates the partial remuneration expectation value and the partial distance expectation value of the part corresponding to the state s ′ in the event list stored in the event list database 105 according to the following formula. An expected reward value for the state s ′ represented by the equation (4) is obtained.

図８は、行動選択部１１３の動作を説明するための流れ図である。 FIG. 8 is a flowchart for explaining the operation of the action selection unit 113.

図８のステップＳ６００５において、行動選択部１１３は、取得情報処理部１０９から状態s’を受け取ったかどうか判断する。状態s’を受け取っていれば、ステップＳ６０１０に進む。状態s’を受け取っていなければ待機する。 In step S6005 of FIG. 8, the action selection unit 113 determines whether or not the state s ′ is received from the acquired information processing unit 109. If the state s ′ has been received, the process proceeds to step S6010. If the state s' has not been received, the process waits.

図８のステップＳ６０１０において、行動選択部１１３は、教示取得部２０５から教示を受け取ったかどうか判断する。教示を受け取っていれば、ステップＳ６０１５に進む。教示を受け取っていなければステップＳ６０２０に進む。 In step S6010 of FIG. 8, the action selection unit 113 determines whether or not a teaching is received from the teaching acquisition unit 205. If the instruction has been received, the process proceeds to step S6015. If the instruction has not been received, the process proceeds to step S6020.

図８のステップＳ６０１５において、行動選択部１１３は、教示された行動aを選択して出力し、処理を終了する。 In step S6015 of FIG. 8, the action selecting unit 113 selects and outputs the taught action a, and ends the process.

図８のステップＳ６０２０において、行動選択部１１３は、強化学習部１１１から行動価値関数Qを受け取ったかどうか判断する。行動価値関数Qを受け取っていれば、ステップＳ６０２５に進む。行動価値関数Qを受け取っていなければステップＳ６００５に戻る。 In step S6020 of FIG. 8, the behavior selection unit 113 determines whether or not the behavior value function Q is received from the reinforcement learning unit 111. If the behavior value function Q has been received, the process proceeds to step S6025. If the action value function Q has not been received, the process returns to step S6005.

図８のステップＳ６０２５において、行動選択部１１３は、行動計画部１０８から行動価値関数tQを受け取ったかどうか判断する。行動価値関数tQを受け取っていれば、ステップＳ６０３０に進む。行動価値関数tQを受け取っていなければステップＳ６０３５に進む。 In step S6025 of FIG. 8, the action selection unit 113 determines whether or not the action value function tQ is received from the action plan unit. If the action value function tQ has been received, the process proceeds to step S6030. If the behavior value function tQ has not been received, the process proceeds to step S6035.

図８のステップＳ６０３０において、行動選択部１１３は、tQとQとの和をtQとする。 In step S6030 of FIG. 8, the action selection unit 113 sets the sum of tQ and Q as tQ.

図８のステップＳ６０３５において、行動選択部１１３は、QをtQとする。 In step S6035 of FIG. 8, the action selection unit 113 sets Q to tQ.

図８のステップＳ６０４０において、行動選択部１１３は、tQに基づいて、確率的に行動aを選択し、出力する。 In step S6040 of FIG. 8, the action selection unit 113 selects the action a probabilistically based on tQ and outputs it.

以下に、本実施形態による学習制御システム１５０のシミュレーション実験について説明する。 Hereinafter, a simulation experiment of the learning control system 150 according to the present embodiment will be described.

図９は、シミュレーション実験の手順を説明するための図である。s₀からs₇までの８個の観察される状態が存在する。また、a₀からa₇までの８個の行動が存在する。「教示されるエピソード」は、学習制御システム１５０を備えた装置２００に、たとえば教示取得部２０５を介して教示されるエピソードを示す。ここで、エピソードとは、連続して生じる一連の状態及び行動を指す。以下において、装置２００をエージェントと呼称する。 FIG. 9 is a diagram for explaining the procedure of the simulation experiment. There are 8 observed states from s ₀ to s ₇ . There are 8 actions from a ₀ to a ₇ . “Teached episode” indicates an episode taught to the apparatus 200 including the learning control system 150 via, for example, the teaching acquisition unit 205. Here, an episode refers to a series of states and actions that occur continuously. Hereinafter, the apparatus 200 is referred to as an agent.

たとえば、エピソードＡでは、最初に観測状態はs₀が観測される。このとき行動a₁をとるように教示される。そして行動a₁をエージェントが選択すると、その結果観測状態がs₁に代わる。以下同様にして観測状態s₃までたどり着くと、正の報酬値がエージェントに与えられる。同様にしてエピソードＢ、エピソードＣ、エピソードＤが各一回ずつ教示される。エピソードＢ及びエピソードＣの終端においては正の報酬値が与えられる。しかし、エピソードＤの終端においては負の報酬値が与えられ、エピソードＤは望ましくないものとして教示される。 For example, in episode A, s ₀ is first observed as the observation state. At this time they are instructed to take action a _1. When the agent selects action a ₁ , the observation state is changed to s _{1 as a} result. Similarly, when reaching the observation state s ₃ in the same manner, a positive reward value is given to the agent. Similarly, episode B, episode C, and episode D are taught once each. A positive reward value is given at the end of episode B and episode C. However, at the end of episode D, a negative reward value is given, and episode D is taught as undesirable.

つぎに、エージェントに問題が与えられる。図９の問題１の場合に、エージェントは、観測状態s₀におかれ、目標とする状態がs₆であると提示される。エージェントは、状態s₀から状態がs₆まで、状態を最短のステップで遷移させることが要求される。ここで、ステップとは、状態に対してとられる行動の数である。図９の問題２の場合に、エージェントは、観測状態s₀におかれ、目標とする状態がs_７であると提示される。エージェントは、状態s₀から状態がs_７まで、状態を最短のステップで遷移させることが要求される。 Next, a problem is given to the agent. In the case of Problem 1 in FIG. 9, the agent is placed in the observation state s ₀ and the target state is presented as s ₆ . The agent is required to transition the state from the state s ₀ to the state s ₆ in the shortest step. Here, a step is the number of actions taken for a state. In the case of Problem 2 in FIG. 9, the agent is placed in the observation state s ₀ and the target state is presented as s ₇ . The agent is required to transition the state from the state s ₀ to the state s ₇ in the shortest step.

実際のシミュレーションにおいては、２０のトライアルからなるシミュレーションを行なった。ここで、トライアルとは、エージェントが、状態に応じて終端にいたるまで実施する一連の行動である。ただし、トライアルのステップ数は最大５０とする。換言すれば、エージェントの行動が、５０ステップを経てもなお終端に至らない場合には、トライアルを終了する。最初の４個のトライアル、すなわち、第１乃至第４のトライアルは、エピソード教示期間であり教示が与えられる。具体的に、第１乃至第４のトライアルにおいては、エピソードＡ乃至Ｄが上述のように教示される。第５乃至第２０のトライアルは、問題対処期間である。問題対処期間には、エージェントに問題１及び問題２が交互に与えられる。具体的に第５のトライアルにおいては、問題１、第６のトライアルにおいては問題２、第７のトライアルにおいては問題１、第８のトライアルにおいては問題２がそれぞれ与えられる。このようにして、第２０のトライアルに至るまで、問題１及び問題２が交互に与えられる。 In the actual simulation, a simulation consisting of 20 trials was performed. Here, the trial is a series of actions that the agent performs until reaching the terminal depending on the state. However, the maximum number of trial steps is 50. In other words, if the agent's action does not reach the end even after 50 steps, the trial ends. The first four trials, that is, the first to fourth trials are episode teaching periods and are taught. Specifically, in the first to fourth trials, episodes A to D are taught as described above. The fifth to twentieth trials are problem-handling periods. During the problem handling period, problems 1 and 2 are alternately given to the agent. Specifically, in the fifth trial, problem 1 is given, problem 6 is given in the sixth trial, problem 1 is given in the seventh trial, and problem 2 is given in the eighth trial. In this way, problem 1 and problem 2 are alternately given until the 20th trial.

図１０は、第１のシミュレーションの結果を示す図である。ここで、図１０乃至図１３のグラフの横軸はトライアル数を示し、縦軸は各トライアルのステップ数を示す。各トライアルのステップ数は、１０００回繰り返した結果の平均である。図１０乃至図１３には、本発明の本実施形態による学習制御システム１５０の他、ＳＡＲＳＡ（０．１）のアルゴリズム及びＳＡＲＳＡ（０．５）のアルゴリズムによる結果を示した。０．１及び０．５は、λで表されるＳＡＲＳＡアルゴリズムのパラメータである（R. S. Sutton, A. G. Barto, Reinforcement Learning: Introduction, MIT Press）。なお、本実施形態による学習制御システム１５０の強化学習部１１１は、ＳＡＲＳＡ（０．５）のアルゴリズムを使用している。 FIG. 10 is a diagram illustrating a result of the first simulation. Here, the horizontal axis of the graphs of FIGS. 10 to 13 indicates the number of trials, and the vertical axis indicates the number of steps of each trial. The number of steps in each trial is the average of 1000 repeated results. FIGS. 10 to 13 show the results of the SARSA (0.1) algorithm and the SARSA (0.5) algorithm in addition to the learning control system 150 according to this embodiment of the present invention. 0.1 and 0.5 are parameters of the SARSA algorithm represented by λ (R. S. Sutton, A. G. Barto, Reinforcement Learning: Introduction, MIT Press). Note that the reinforcement learning unit 111 of the learning control system 150 according to the present embodiment uses the SARSA (0.5) algorithm.

図１０において、ＳＡＲＳＡ（０．１）のアルゴリズムによる結果は、問題対処期間においてステップ数が減少しないので全く学習が行なわれていないことを示している。ＳＡＲＳＡ（０．５）のアルゴリズムによる結果は、問題１に対してのみステップ数が減少しているので、問題１に対してのみ学習が行なわれていることを示している。本実施形態による学習制御システム１５０による結果は、問題１及び２に対して正しく学習が行なわれていることを示している。 In FIG. 10, the result of the SARSA (0.1) algorithm indicates that learning is not performed at all because the number of steps does not decrease in the problem coping period. The result of the SARSA (0.5) algorithm shows that learning is performed only for problem 1 because the number of steps is reduced only for problem 1. The result of the learning control system 150 according to the present embodiment indicates that learning is correctly performed for the problems 1 and 2.

図１１は、第２のシミュレーション結果を示す図である。本シミュレーションにおいて、問題対処期間には、第１のシミュレーションの場合と同様に、エージェントに問題１及び問題２が交互に与えられる。ただし、第２のシミュレーションにおいては、問題対処期間のトライアルが確率的である。具体的に、たとえ、エージェントの選んだ行動が正しくても確率0.8の割合でしか正しく次の状態に遷移しない。図１１によれば、この場合でも、本実施形態による学習制御システム１５０は、正しく学習を行なう。 FIG. 11 is a diagram illustrating a second simulation result. In this simulation, the problem 1 and the problem 2 are alternately given to the agent during the problem handling period, as in the case of the first simulation. However, in the second simulation, the trial of the problem coping period is probabilistic. Specifically, even if the agent's chosen action is correct, it will only correctly transition to the next state with a probability of 0.8. According to FIG. 11, even in this case, the learning control system 150 according to the present embodiment performs learning correctly.

図１２は、第３のシミュレーション結果を示す図である。本シミュレーションにおいて、問題対処期間の環境が、エピソード教示期間の環境と変わってしまっている。具体的に、エピソード教示期間の、問題１に対応するエピソードＡでは、エージェントが行動a₂を状態s₁で選ぶと、状態s₂になったが、問題１のトライアルではs₇になってしまうようことが起きるとする。この時の遷移は確率的ではなく、「決定論的」である。図１２において、「１’」は、問題１が変わってしまっていることを示す。問題２は、エピソード教示期間と同じである。図１２によれば、この場合でも、本実施形態による学習制御システム１５０は、イベント・リスト学習制御システム１００による学習を強化学習部１１１による学習と効果的に組み合わせることにより、教示されていない問題１’に対しても正しく学習を行なう。 FIG. 12 is a diagram illustrating a third simulation result. In this simulation, the problem handling period environment has changed from the episode teaching period environment. Specifically, in episode A corresponding to problem 1 in the episode teaching period, if the agent selects action a ₂ in state s ₁ , it becomes state s ₂ , but in the trial of problem 1 it becomes s _7. Suppose this happens. The transition at this time is not probabilistic but “deterministic”. In FIG. 12, “1 ′” indicates that the problem 1 has changed. Problem 2 is the same as the episode teaching period. According to FIG. 12, even in this case, the learning control system 150 according to the present embodiment effectively solves the problem 1 that is not taught by combining learning by the event list learning control system 100 with learning by the reinforcement learning unit 111. Also learn correctly for '.

図１３は、第４のシミュレーション結果を示す図である。本シミュレーションにおいて、第２のシミュレーションと同様に、問題対処期間のトライアルが確率的であり、且つ、第３のシミュレーションと同様に、問題１が変わってしまっている。この場合でも、本実施形態による学習制御システム１５０は、イベント・リスト学習制御システム１００による学習を強化学習部１１１による学習と効果的に組み合わせることにより、問題１’及び問題２に対して正しく学習を行なう。 FIG. 13 is a diagram illustrating a fourth simulation result. In this simulation, as in the second simulation, the trial of the problem coping period is probabilistic, and the problem 1 has changed as in the third simulation. Even in this case, the learning control system 150 according to the present embodiment correctly learns the problem 1 ′ and the problem 2 by effectively combining the learning by the event / list learning control system 100 with the learning by the reinforcement learning unit 111. Do.

図１４は、第５のシミュレーション結果を示す図である。第５のシミュレーションにおいては、教示が一切なく目標も与えられない。 FIG. 14 is a diagram illustrating a fifth simulation result. In the fifth simulation, there is no teaching and no target is given.

図１５は、第５のシミュレーション環境である高次マルコフ決定過程（ＨＯＭＤＰ： High Order Markov Decision Process）を説明するための図である。選択しうる行動は、a₀,a₁,…a₉の１０個であり、そのうち報酬に関係するのはa₀,a₁,…a₅の６個である。本過程は、過程Ａと過程Ｂとを含む。過程Ａで報酬を得た場合につぎに報酬を得るためには過程Ｂを選択する必要があり、過程Ｂで報酬を得た場合につぎに報酬を得るためには過程Ａを選択する必要がある。すなわち、同じ観測信号に対して、過程Ａと過程Ｂで異なる行動を選択する必要がある。各遷移は確率的である。過程Ａにおいて、s₀からs₂へは確率０．３で遷移する。その他の場合は、確率０．９で遷移する。過程Ｂにおいて、s₁からs₂へは確率０．３で遷移する。その他の場合は、確率０．９で遷移する。さらに、各状態に対して観測できる信号は２個存在する。たとえば、s₀に対しては、O₀₀とO₀₁の信号が存在し、いずれかが確率０．５で観測される。 FIG. 15 is a diagram for explaining a high order Markov decision process (HOMDDP) which is a fifth simulation environment. The selected may behavior, a _0, a _1, a ten ... a _9, of which the related to the reward is six of _{_{a 0, a 1, ... a}} 5. This process includes process A and process B. In order to obtain the next reward when the reward is obtained in the process A, it is necessary to select the process B. To obtain the next reward when the reward is obtained in the process B, it is necessary to select the process A. is there. That is, it is necessary to select different actions in the process A and the process B for the same observation signal. Each transition is probabilistic. In the process A, a transition is made from s ₀ to s ₂ with a probability of 0.3. In other cases, transition is made with a probability of 0.9. In the process B, a transition is made from s ₁ to s ₂ with a probability of 0.3. In other cases, transition is made with a probability of 0.9. Furthermore, there are two signals that can be observed for each state. For example, for s _0, there are O ₀₀ and O ₀₁ signals, one of which is observed with a probability of 0.5.

図１４のグラフの横軸はトライアル数を示し、縦軸は各トライアルのステップ数を示す。各トライアルのステップ数は、１０００回繰り返した結果の平均である。図１４には、本発明の本実施形態による学習制御システム１５０の他、ＳＡＲＳＡ（０．５）のアルゴリズムによる結果を示した。図１４において、実線は、学習制御システム１５０によるステップ数の平均値を示し、一点鎖線は、ＳＡＲＳＡ（０．５）のアルゴリズムによるステップ数の平均値を示す。また、点線は、学習制御システム１５０によるステップ数の標準偏差を示し、二点鎖線は、ＳＡＲＳＡ（０．５）のアルゴリズムによるステップ数の標準偏差を示す。図１４によれば、本発明の本実施形態による学習制御システム１５０は、ＳＡＲＳＡ（０．５）のアルゴリズムより少ないステップ数で収束している。この結果、全く教示や目標の提示がない場合でも、イベント・リスト学習制御システム１００は、強化学習部１１１の学習を助けていることがわかる。 The horizontal axis of the graph in FIG. 14 indicates the number of trials, and the vertical axis indicates the number of steps in each trial. The number of steps in each trial is the average of 1000 repeated results. FIG. 14 shows the result of the SARSA (0.5) algorithm in addition to the learning control system 150 according to the present embodiment of the present invention. In FIG. 14, the solid line indicates the average value of the number of steps by the learning control system 150, and the alternate long and short dash line indicates the average value of the number of steps by the SARSA (0.5) algorithm. A dotted line indicates the standard deviation of the number of steps by the learning control system 150, and a two-dot chain line indicates the standard deviation of the number of steps by the SARSA (0.5) algorithm. According to FIG. 14, the learning control system 150 according to the present embodiment of the present invention converges with fewer steps than the SARSA (0.5) algorithm. As a result, it can be seen that the event list learning control system 100 assists the learning of the reinforcement learning unit 111 even when there is no teaching or presentation of the goal.

１００…イベント・リスト学習制御システム、１０１…イベント・リスト管理部、１０３…一時的リスト記憶、１０５…イベント・リスト・データベース、１０７…イベント・リスト学習制御部、１０８…行動計画部 DESCRIPTION OF SYMBOLS 100 ... Event list learning control system, 101 ... Event list management part, 103 ... Temporary list storage, 105 ... Event list database, 107 ... Event list learning control part, 108 ... Action plan part

Claims

An event list database that holds a plurality of event lists, with a set of state / action pairs as a list of states / action pairs immediately before the reward is obtained and the state when the reward is obtained;
An event list management unit for classifying state / action pairs into the plurality of event lists and storing them in the event list database;
An event list learning control unit that updates a reward expected value of a state / action pair that is an element of each event list;
An action planning unit for obtaining a first action value function using an event list of the event list database;
A reinforcement learning unit for obtaining a second action value function based on reinforcement learning;
An action selection unit that selects an action based on the first action value function received from the action plan unit and the second action value function received from the reinforcement learning unit;
With
The event list management unit has the observed state, action, and reward, respectively, the state at the time immediately before obtaining one reward, the action taken for the state, and the result of the action And a state-action chain represented by a set composed of the state when the reward is obtained, and a list configured by associating a set of state / action pairs up to the state-action chain. The event list is associated with the one reward and stored in the event list database.
The first and second action value functions represent expected values of rewards that will be obtained from the present to the future if the actions determined by the action planning unit and the reinforcement learning unit are executed, respectively. The
The first behavior value function includes a plurality of the expected reward values that are expected values of rewards obtained at other times after the behavior taken at the one time with respect to the state at the one time. Defined as a weighted addition over other times,
The expected reward value is a probability that a reward is obtained through one state-behavior chain at another time after the action taken at the one time with respect to the state at the one time. The partial distance expectation value, which is a value obtained by weighting over the time, and the action taken at the one time with respect to the state at the one time, through the one state action chain at another time thereafter The product of the expected value from which one reward is obtained and the partial reward expectation value, which is a value obtained by weighting and adding all the rewards obtained through the one state-action chain, to the given target state. Calculated as an added value for one set of the state-action chain constituting the route,
Learning control system.

An event list database that holds a plurality of event lists, with a set of state / action pairs as a list of states / action pairs immediately before the reward is obtained and the state when the reward is obtained; A learning control method for performing learning and selecting an action by a learning control system including an event list management unit, an event list learning control unit, an action planning unit, a reinforcement learning unit, and an action selection unit. There,
The event list management unit classifying state / action pairs into the plurality of event lists and storing them in the event list database;
The event list learning control unit updates a reward expectation value of a state / action pair which is an element of each event list;
The action planning unit using the event list of the event list database to obtain a first action value function;
The reinforcement learning unit obtaining a second action value function based on reinforcement learning;
The action selecting unit selecting an action based on the first action value function received from the action plan unit and the second action value function received from the reinforcement learning unit;
Including
In the storing step, the observed state, action, and reward are respectively the state at the time immediately before obtaining one reward, the action taken for the state, and the result of the action, A list consisting of a set of state / action pairs up to the state-action chain is associated with the state-action chain represented by the set consisting of the state when the reward is obtained, and one event Classify it as a list, associate the one event list with the one reward, and store it in the event list database;
The first and second action value functions represent expected values of rewards that will be obtained from the present to the future if the actions determined by the action planning unit and the reinforcement learning unit are executed, respectively. The
The first behavior value function includes a plurality of the expected reward values that are expected values of rewards obtained at other times after the behavior taken at the one time with respect to the state at the one time. Defined as a weighted addition over other times,
The expected reward value is a probability that a reward is obtained through one state-behavior chain at another time after the action taken at the one time with respect to the state at the one time. The partial distance expectation value, which is a value obtained by weighting over the time, and the action taken at the one time with respect to the state at the one time, through the one state action chain at another time thereafter A route to the target state is formed by multiplying the expected value from which one reward is obtained and the partial reward expected value, which is a value obtained by weighting and adding all rewards obtained through the one state-action chain. Calculated as an added value for the set of state-action chains.
Learning control method.