JP2013225192A

JP2013225192A - Reward function estimation apparatus, reward function estimation method and program

Info

Publication number: JP2013225192A
Application number: JP2012096453A
Authority: JP
Inventors: Hiroaki Sugiyama; 弘晃杉山; Yasuhiro Minami; 泰浩南
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-04-20
Filing date: 2012-04-20
Publication date: 2013-10-31
Anticipated expiration: 2032-04-20
Also published as: JP5815458B2

Abstract

PROBLEM TO BE SOLVED: To estimate a proper reward function.SOLUTION: An aggregate comprising training action sequences indicating a set of states transferred by an action and an action in each state, and an aggregate of a training order evaluation value indicating propriety of each training action sequence are stored in a training action sequence storage part 11. A training action sequence evaluation part 15 uses a reward function for acquiring a reward value with respect to a state, to obtain an order evaluation value corresponding to the reward value of a set of states indicated by the training action sequences. A reward function adjustment part 16 updates the reward function on the basis of difference between the aggregate of the order evaluation value and the aggregate of the training order evaluation value.

Description

本発明は、強化学習問題において行動列（ある状態とその状態に対する行動のペアの連なり）と行動列に付与された評価（あるタスクに対する行動列の適切さを表す指標）に基づき、評価の高い行動列を再現するための報酬関数を推定する報酬関数推定装置、方法、プログラムに関する。 The present invention has a high evaluation based on an action sequence (a series of action pairs for a certain state and an action pair) and an evaluation given to the action sequence (an index indicating appropriateness of the action sequence for a certain task) in the reinforcement learning problem. The present invention relates to a reward function estimation device, method, and program for estimating a reward function for reproducing an action sequence.

報酬関数に基づき意思決定を行う手法として、強化学習と呼ばれる手法（非特許文献１，２参照）が知られている。強化学習とは、ある環境内におけるエージェントが、現在の「状態」を観測し、取るべき「行動」を決定する問題を扱う機械学習の一種を意味する。強化学習では、例えばある環境中で自律的に行動するロボットシステムが環境から得られる「状態」に応じて、「報酬」が最大になるような「行動」を選択していくという問題を扱う。すなわち強化学習では、達成させたい「タスク」に応じて「報酬」を設計することで、システムの行動を設計することができる。なお報酬関数とは、入力された「状態」に対応する「報酬」を出力する関数を意味する。具体例を挙げる。例えば「穴が開いてるGridに切られた世界（Gridworld）で、エージェントが穴に落ちないように、ゴールまで最短距離で移動する」という「タスク」を想定する。このようなタスクのための報酬関数の一例は、穴という「状態」に対して-100という「報酬」を出力し、ゴールという「状態」に対して+100という「報酬」を出力し、それ以外の位置という「状態」に対して-1という「報酬」を出力する関数である。 As a technique for making a decision based on a reward function, a technique called reinforcement learning (see Non-Patent Documents 1 and 2) is known. Reinforcement learning is a type of machine learning that deals with the problem of an agent in a certain environment observing the current “state” and determining the “action” to be taken. In reinforcement learning, for example, a robot system that acts autonomously in a certain environment handles the problem of selecting “action” that maximizes “reward” according to the “state” obtained from the environment. In other words, in reinforcement learning, the behavior of the system can be designed by designing “reward” according to the “task” to be achieved. The reward function means a function that outputs “reward” corresponding to the input “state”. A specific example is given. For example, suppose a “task” in which the agent moves in the shortest distance to the goal so that the agent does not fall into the hole in the world that is cut by the grid with a hole (Gridworld). An example of a reward function for such a task is to output a “reward” of -100 for the “state” of the hole, and a “reward” of +100 for the “state” of the goal, It is a function that outputs a “reward” of −1 for a “state” of a position other than.

従来、プログラムの設計者はこの報酬関数をあらかじめ適切に設計しておく必要があった。しかし、特定の望ましい行動をシステムに取らせたい場合に報酬関数をどのように設定すればよいかは、人のヒューリスティックな知識に依存し、難しい問題であった（非特許文献３参照）。 Conventionally, the program designer had to design the reward function appropriately in advance. However, how to set a reward function when it is desired to cause the system to take a specific desired action depends on human heuristic knowledge and is a difficult problem (see Non-Patent Document 3).

この問題を解決し、所与の行動列を再現する報酬関数を推定する方法として、逆強化学習に基づく報酬関数推定手法（非特許文献４，５参照）が知られている。これらの手法では、適当な報酬関数を用いて所与の行動列が再現されるかどうかをシミュレーションによってテストし、所与の行動列と異なる行動が選択された場合に報酬関数のパラメータを修正して再度テストすることを繰り返し、適切な報酬関数を推定する。 As a method for solving this problem and estimating a reward function that reproduces a given action sequence, a reward function estimation method based on inverse reinforcement learning (see Non-Patent Documents 4 and 5) is known. These methods test whether a given action sequence is reproduced using an appropriate reward function by simulation, and correct the parameters of the reward function when an action different from the given action sequence is selected. Repeat the test again to estimate the appropriate reward function.

Jason D. Williams. Applying POMDPs to dialog systems in the troubleshooting domain, In Workshop on Bridging the Gap, pp. 1-8, Rochester, New York, 2007. Association for Computational Linguistics.Jason D. Williams. Applying POMDPs to dialog systems in the troubleshooting domain, In Workshop on Bridging the Gap, pp. 1-8, Rochester, New York, 2007. Association for Computational Linguistics. Toyomi Meguro, Ryuichiro Higashinaka, Yasuhiro Minami, and Kohji Dohsaka. Controlling Listening-oriented Dialogue using Partially Observable Markov Decision Processes. In Coling, 2010.Toyomi Meguro, Ryuichiro Higashinaka, Yasuhiro Minami, and Kohji Dohsaka. Controlling Listening-oriented Dialogue using Partially Observable Markov Decision Processes. In Coling, 2010. A. Boularias, H.R. Chinaei, and B. Chaib-draa. Learning the Reward Model of Dialogue POMDPs from Data. NIPS 2010 Workshop on Machine Learning for Assistive Technologies (MLAT-2010), pp. 1-9, 2010.A. Boularias, H.R.Chinaei, and B. Chaib-draa. Learning the Reward Model of Dialogue POMDPs from Data.NIPS 2010 Workshop on Machine Learning for Assistive Technologies (MLAT-2010), pp. 1-9, 2010. Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. Twenty-first international conference on Machine learning - ICML'04, 2004.Pieter Abbeel and Andrew Y. Ng.Apprenticeship learning via inverse reinforcement learning.Twenty-first international conference on Machine learning-ICML'04, 2004. B.D. Ziebart, Andrew Maas, J.A. Bagnell, and A.K. Dey. Maximum entropy inverse reinforcement learning. In Proc. AAAI, pp. 1433-1438, 2008.B.D.Ziebart, Andrew Maas, J.A. Bagnell, and A.K.Dey.Maximum entropy inverse reinforcement learning.In Proc.AAAI, pp. 1433-1438, 2008.

非特許文献４，５では、与えられた行動列が全て正しく、再現すべき対象であるという前提で報酬関数の推定が行われる。しかし、実際の行動列はそれぞれタスクに対する適切さが異なっている。例えば、人を説得する対話（状態：対話相手の発言，行動：説得するための発話）や、けん玉のコントロール（状態：けん玉の位置・自分の腕位置など，行動：腕・手関節制御信号）などの行動列の場合、タスクを失敗した不適切な行動列も再現対象の行動列に含まれ得る。これらを同一の適切さを持つ行動列として報酬関数を推定すると、不適切な行動列を生成した報酬関数を含んだ報酬関数が推定されてしまう。また、適切な行動列のみを人手で選んで非特許文献４，５の入力とした場合には、今度は学習データサイズが減ってしまう。そのような場合、対応する学習データが存在しない「状態」および「行動」が増えてしまう。これにより、何らかの原因でそうした「行動」を取らざるを得ない「状態」に置かれた場合、この欠損によって致命的な行動を回避できないという問題が発生する。 In Non-Patent Documents 4 and 5, the reward function is estimated on the premise that all the given action sequences are correct and should be reproduced. However, each of the actual action sequences has different suitability for the task. For example, dialogue to persuade people (state: speech of the conversation partner, behavior: speech to persuade), control of kendama (state: position of kendama, own arm position, behavior: arm / hand joint control signal) In the case of behavior sequences such as, an inappropriate behavior sequence in which a task has failed may be included in the behavior sequence to be reproduced. If a reward function is estimated as an action sequence having the same appropriateness, a reward function including a reward function that has generated an inappropriate action sequence is estimated. In addition, when only an appropriate action sequence is manually selected and used as the input of non-patent documents 4 and 5, the learning data size is reduced this time. In such a case, “state” and “behavior” for which there is no corresponding learning data increase. As a result, there is a problem that when the user is placed in a “state” that has to take such “behavior” for some reason, a fatal behavior cannot be avoided due to this deficiency.

本発明はこのような点に鑑みてなされたものであり、従来よりも適切な報酬関数を推定することが可能な技術を提供することを目的とする。 The present invention has been made in view of these points, and an object thereof is to provide a technique capable of estimating a reward function more appropriate than in the past.

本発明では、行動によって遷移する一連の状態と各状態での行動とを表す訓練行動列からなる集合、および、訓練行動列それぞれの適切さを表す訓練順序評価値の集合が格納されており、状態に対する報酬値を求める報酬関数を用い、訓練行動列が表す一連の状態の報酬値に対応する順序評価値を得、順序評価値の集合と訓練順序評価値の集合との相違に基づいて報酬関数を更新する。 In the present invention, a set of training behavior sequences representing a series of states transitioned by behavior and behavior in each state, and a set of training order evaluation values representing the appropriateness of each training behavior sequence are stored. Using a reward function to obtain a reward value for the state, obtain an order evaluation value corresponding to a series of state reward values represented by the training action sequence, and reward based on the difference between the order evaluation value set and the training order evaluation value set Update the function.

本発明では、報酬関数を用いて得られる訓練行動列の順序評価値と訓練行動列の訓練順序評価値との相違に基づいて報酬関数を推定する。そのため、訓練行動列の適切さを考慮して報酬関数を推定でき、従来よりも適切に報酬関数を推定できる。 In the present invention, the reward function is estimated based on the difference between the order evaluation value of the training action sequence obtained using the reward function and the training order evaluation value of the training action sequence. Therefore, the reward function can be estimated in consideration of the appropriateness of the training action sequence, and the reward function can be estimated more appropriately than before.

実施形態の補修関数推定装置の機能構成を例示するブロック図。The block diagram which illustrates the functional composition of the repair function estimating device of an embodiment. 実施形態の補修関数推定方法を例示するフロー図。The flowchart which illustrates the repair function estimation method of embodiment. シミュレーション方法を説明するための図。The figure for demonstrating the simulation method. シミュレーション結果を説明するための図。The figure for demonstrating a simulation result.

以下、図面を参照して本発明の実施形態を説明する。
図１に例示するように、本形態の報酬関数推定装置１は、訓練行動列記憶部１１、状態遷移計算部１２、行動評価部１３、行動決定部１４、訓練行動列評価部１５、報酬関数調整部１６、および制御部１７を有する。報酬関数推定装置１は、例えば、公知または専用のコンピュータに所定のプログラムが読み込まれることで構成される特別な装置である。あるいは例えば、報酬関数推定装置１を構成する各部の少なくとも一部がハードウェアによって構成されてもよい。報酬関数推定装置１は、制御部１７による制御に従って各処理を実行する。学習時には、例えば、訓練行動列記憶部１１、状態遷移計算部１２、行動評価部１３、行動決定部１４、訓練行動列評価部１５、および報酬関数調整部１６によって、適切な報酬関数を推定する。実行時には、例えば、状態遷移計算部１２、行動評価部１３、および行動決定部１４によって適切な行動を得る。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
As illustrated in FIG. 1, the reward function estimation device 1 of this embodiment includes a training action sequence storage unit 11, a state transition calculation unit 12, a behavior evaluation unit 13, a behavior determination unit 14, a training behavior sequence evaluation unit 15, and a reward function. An adjustment unit 16 and a control unit 17 are included. The reward function estimation device 1 is a special device configured by, for example, reading a predetermined program into a known or dedicated computer. Alternatively, for example, at least a part of each part configuring the reward function estimation device 1 may be configured by hardware. The reward function estimation device 1 executes each process according to control by the control unit 17. At the time of learning, for example, an appropriate reward function is estimated by the training action sequence storage unit 11, the state transition calculation unit 12, the behavior evaluation unit 13, the behavior determination unit 14, the training behavior sequence evaluation unit 15, and the reward function adjustment unit 16. . At the time of execution, for example, an appropriate behavior is obtained by the state transition calculation unit 12, the behavior evaluation unit 13, and the behavior determination unit 14.

＜事前処理＞
訓練行動列記憶部１１には、入力された複数の訓練行動列からなる集合、および、訓練行動列それぞれの適切さを表す訓練順序評価値の集合が格納される。訓練行動列のそれぞれは、例えば、ある状態とその状態に対する行動のペアの連なりであり、行動によって遷移する一連の状態と各状態での行動とを表す。訓練行動列の具体例は、行動によって遷移する一連の状態のそれぞれを表す値の列と当該各状態を表す値の列とからなる。 <Pre-processing>
The training behavior sequence storage unit 11 stores a set of a plurality of input training behavior sequences, and a set of training order evaluation values representing the appropriateness of each training behavior sequence. Each of the training behavior sequences is, for example, a series of a certain state and a pair of behaviors corresponding to the state, and represents a series of states that change according to the behavior and behaviors in each state. A specific example of the training action sequence is composed of a sequence of values representing each of a series of states transitioned by behavior and a sequence of values representing each of the states.

訓練順序評価値のそれぞれは、例えば、訓練行動列からなる集合に属する２個の訓練行動列からなる組に対応し、当該組をなす２個の訓練行動列のうちいずれのほうが適切であるかを表す。この例の訓練順序評価値は、訓練行動列からなる集合に属する２個の訓練行動列からなる組（ペア）ごとに付与される。すなわち、Ｎ個（Ｎ≧２）の訓練行動列からなる集合はＮ（Ｎ−１）／２個のペアを含むため、例えばＮ（Ｎ−１）／２個の訓練順序評価値が付与される。具体的には、例えば各訓練行動列のペアζ^* _i，ζ^* _jに対し、それぞれ、どちらが適切であるかを表す以下の訓練順序評価値ｏ^* _i，ｊが付与される。

以下では訓練行動列をζ^*と総称し、訓練順序評価値をｏ^*と総称する。 Each of the training order evaluation values corresponds to, for example, a set of two training action sequences belonging to the set of training action sequences, and which of the two training action sequences forming the set is more appropriate? Represents. The training order evaluation value in this example is assigned to each pair (pair) composed of two training behavior sequences belonging to a set composed of training behavior sequences. That is, since a set of N (N ≧ 2) training action sequences includes N (N−1) / 2 pairs, for example, N (N−1) / 2 training order evaluation values are given. The Specifically, for example, the following training order evaluation values o ^* _{i, j} representing which one is appropriate are assigned to each pair of training behavior sequences ζ ^* _i , ζ ^* _j .

Hereinafter, the training action sequence is generically referred to as ζ ^*, and the training order evaluation value is generically referred to as o ^* .

報酬関数の推定のためには、いずれの訓練行動列のほうが適切であるかを相対評価できればよく、数値で訓練行動列を評価する必要はない。上記のようなペアごとの訓練順序評価値を用いるほうが、実数の訓練順序評価値（例えば、ある発言に対する否定が-1.4，肯定が+0.3など）を用いるよりも取り扱いが容易となる。また、学習時に訓練行動列ごとに改めて順序評価する必要はない。 In order to estimate the reward function, it is only necessary to relatively evaluate which training action sequence is more appropriate, and it is not necessary to evaluate the training action sequence numerically. Using the training order evaluation value for each pair as described above is easier to handle than using a real number of training order evaluation values (for example, -1.4 for a certain utterance, +0.3 for affirmation). Further, it is not necessary to evaluate the order again for each training action sequence during learning.

＜報酬関数推定処理＞
訓練行動列からなる集合および訓練順序評価値の集合を用い、報酬関数の推定を行う処理を説明する。
図２に例示するように、まず制御部１７がループ数に対応する変数ｓｔｅｐを０に初期化する（ステップＳ１０）。次に報酬関数調整部１６が、初期の報酬関数のパラメータ（以下「報酬関数パラメータ」という）θ_０を設定する（ステップＳ１１）。本形態の報酬関数パラメータは、実数を要素とするＭ次元ベクトルである（Ｍは１以上の整数）。初期の報酬関数のパラメータθ_０は、ランダムに選択された要素からなるＭ次元ベクトルであってもよいし、予め定められた要素からなるＭ次元ベクトルであってもよい。 <Reward function estimation process>
A process for estimating a reward function using a set of training action sequences and a set of training order evaluation values will be described.
As illustrated in FIG. 2, first, the control unit 17 initializes a variable step corresponding to the number of loops to 0 (step S10). Next, the reward function adjusting unit 16 sets an initial reward function parameter (hereinafter referred to as “reward function parameter”) θ ₀ (step S11). The reward function parameter of this embodiment is an M-dimensional vector whose elements are real numbers (M is an integer of 1 or more). The initial reward function parameter θ ₀ may be an M-dimensional vector composed of randomly selected elements, or may be an M-dimensional vector composed of predetermined elements.

その後、報酬関数推定装置１は、以下のループに従って報酬関数パラメータθ_ｓｔｅｐを逐次的に更新する。
まず訓練行動列評価部１５が、現在の報酬関数パラメータθ_ｓｔｅｐに基づいて、訓練行動列記憶１１から読み出した訓練行動列ζ^*の集合を評価し、訓練行動列ζ^*の集合に対応する順序評価値の集合を得て出力する。すなわち訓練行動列評価部１５は、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数を用い、各訓練行動列ζ^*が表す一連の状態の報酬値に対応する順序評価値を得て、当該順序評価値の集合を出力する。例えば順序評価値のそれぞれは、訓練行動列ζ^*からなる集合に属する２個の訓練行動列ζ^* _i，ζ^* _jからなる組（ペア）に対応し、当該組をなす２個の訓練行動列ζ^* _i，ζ^* _jのうちいずれに対応する報酬値または当該報酬値の期待値のほうが高い評価を表すかを表す。この場合、Ｎ個の訓練行動列からなる集合に対し、例えばＮ（Ｎ−１）／２個の順序評価値が得られる。以下では順序評価値をｏ^ｓｔｅｐと総称する（ステップＳ１２／詳細は後述）。 Thereafter, the reward function estimation device 1 sequentially updates the reward function parameter θ _step according to the following loop.
First, the training behavior sequence evaluation unit 15 evaluates a set of training behavior sequences ζ ^* read from the training behavior sequence storage 11 based on the current reward function parameter θ _step , and the order corresponding to the set of training behavior sequences ζ ^*. Obtain and output a set of evaluation values. That is, the training action sequence evaluation unit 15 uses the reward function corresponding to the reward function parameter θ _step to obtain an order evaluation value corresponding to a series of state reward values represented by each training action sequence ζ ^* , and the order evaluation value Output a set of For example, each sequence evaluation values, two training action column zeta ^* _i belonging to the set consisting of the training action column zeta ^*, corresponds to a set (pair) consisting of zeta ^* _j, two training actions constituting the assembled This indicates which of the columns ζ ^* _i and ζ ^* _j corresponds to the higher reward value or the expected value of the reward value. In this case, for example, N (N−1) / 2 order evaluation values are obtained for a set of N training action sequences. Hereinafter, the order evaluation value is collectively referred to as o ^step (step S12 / details will be described later).

訓練行動列評価部１５から出力された順序評価値ｏ^ｓｔｅｐの集合と、行動列記憶１１から読み出された訓練順序評価値ｏ^*の集合とが、報酬関数調整部１６に入力される。報酬関数調整部１６は、順序評価値ｏ^ｓｔｅｐの集合と訓練順序評価値ｏ^*の集合とを比較し、それらの相違に基づいて報酬関数パラメータθ_ｓｔｅｐを報酬関数パラメータθ_{ｓｔｅｐ＋１}に更新する（ステップＳ１３／詳細は後述）。 A set of order evaluation values o ^step output from the training action sequence evaluation unit 15 and a set of training order evaluation values o ^* read from the action sequence storage 11 are input to the reward function adjustment unit 16. The reward function adjustment unit 16 compares the set of order evaluation values o ^{step with} the set of training order evaluation values o ^* , and updates the reward function parameter θ _step to the reward function parameter θ _{step + 1} based on the difference therebetween (step S13 / details will be described later).

制御部１７は、所定の終了条件が満たされたかを判定する（ステップＳ１４）。終了条件の例は、順序評価値ｏ^ｓｔｅｐの集合と訓練順序評価値ｏ^*の集合と間の誤差が規定の値を下回る、既定回数更新しても当該誤差の改善量が規定量を下回る、または、繰り返し回数が規定数に達するなどである。順序評価値ｏ^ｓｔｅｐの集合と訓練順序評価値ｏ^*の集合と間の誤差の例は、対応する順序評価値ｏ^ｓｔｅｐと訓練順序評価値ｏ^*とが互いに相違する訓練行動列のペアの個数、順序評価値ｏ^ｓｔｅｐの集合と訓練順序評価値ｏ^*の集合との距離などである。 The control unit 17 determines whether a predetermined end condition is satisfied (step S14). An example of the end condition is that the error between the set of order evaluation values o ^{step and} the set of training order evaluation values o ^* is less than a specified value, and even if the predetermined number of updates is performed, the improvement amount of the error is less than the specified amount. Or the number of repetitions reaches a specified number. Examples of the error between the set and the training sequence evaluation value o ^* of the set and sequence evaluation value o ^step, the number of pairs of the training action sequence and corresponding sequence evaluation value o ^step and training sequence evaluation value o ^* are different from each other , The distance between the set of order evaluation values o ^{step and} the set of training order evaluation values o ^* .

ステップＳ１４で所定の終了条件が満たされていないと判定された場合、制御部１７はｓｔｅｐ＋１を新たな変数ｓｔｅｐの値とし（ステップＳ１５）、処理をステップＳ１２に戻す。一方、所定の終了条件が満たされたと判定された場合、報酬関数調整部１６は、報酬関数パラメータθ_{ｓｔｅｐ＋１}に対応する報酬関数を表す情報を出力する。報酬関数を表す情報の例は、報酬関数そのものや報酬関数パラメータθ_{ｓｔｅｐ＋１}などである（ステップＳ１６）。このように、所定の終了条件が満たされるまで、訓練行動列評価部１５や報酬関数調整部１６等による処理（ステップＳ１２および１３）が繰り返され、所定の終了条件が満たされた場合に報酬関数を表す情報が出力される。 When it is determined in step S14 that the predetermined end condition is not satisfied, the control unit 17 sets step + 1 as the value of the new variable step (step S15), and returns the process to step S12. On the other hand, when it is determined that the predetermined termination condition is satisfied, the reward function adjusting unit 16 outputs information representing the reward function corresponding to the reward function parameter θ _{step + 1} . Examples of information representing the reward function include the reward function itself and the reward function parameter θ _{step + 1} (step S16). In this way, the processing (steps S12 and S13) by the training action sequence evaluation unit 15 and the reward function adjusting unit 16 is repeated until a predetermined end condition is satisfied, and the reward function is satisfied when the predetermined end condition is satisfied. Is output.

＜ステップＳ１２の詳細＞
ステップＳ１２の詳細を例示する。ここでは、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数に基づいて、各訓練行動列ζ^*のシステム評価値ｅ（ζ^*｜θ_ｓｔｅｐ）を得、各ペアζ^* _i，ζ^* _jに対応するシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ），ｅ（ζ^* _j｜θ_ｓｔｅｐ）の大小関係を表す情報を、各ペアζ^* _i，ζ^* _jに対応する順序評価値ｏ^ｓｔｅｐ（以下「順序評価値ｏ^ｓｔｅｐ _i，ｊ」と表記）とする。なお、各システム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）は、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数を用いて計算された各訓練行動列ζ^* _iの評価値である。システム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）は、各訓練行動列ζ^* _iが表す一連の状態の報酬値またはその期待値の和に対応する。報酬値または期待値は、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数を用いて得られる。このような値であればどのような値をシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）としてもよい。 <Details of Step S12>
Details of step S12 are illustrated. Here, based on a reward function corresponding to the reward function parameter θ _step , a system evaluation value e (ζ ^* | θ _step ) of each training action sequence ζ ^* is obtained, and corresponding to each pair ζ ^* _i , ζ ^* _j . Information indicating the magnitude relationship between the system evaluation values e (ζ ^* _i | θ _step ) and e (ζ ^* _j | θ _step ) is used as the order evaluation value o ^step (hereinafter “the ^step ”) corresponding to each pair ζ ^* _i , ζ ^* _j. Order evaluation value o ^step _{i, j} ”). Each system evaluation value e (ζ ^* _i | θ _step ) is an evaluation value of each training action sequence ζ ^* _i calculated using a reward function corresponding to the reward function parameter θ _step . The system evaluation value e (ζ ^* _i | θ _step ) corresponds to the reward value of a series of states represented by each training action sequence ζ ^* _i or the sum of its expected values. The reward value or the expected value is obtained using a reward function corresponding to the reward function parameter θ _step . Any value may be used as the system evaluation value e (ζ ^* _i | θ _step ) as long as it is such a value.

システム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）の一例は、訓練行動列ζ^* _iが表す一連の状態の報酬値Ｒ（ζ^* _i｜θ_ｓｔｅｐ）と、当該訓練行動列ζ^* _iに対応する最適行動列ζ^Ａ _iが表す一連の状態の報酬値Ｒ（ζ^Ａ _i｜θ_ｓｔｅｐ）との差分である。
ｅ（ζ^* _i｜θ_ｓｔｅｐ）＝Ｒ（ζ^* _i｜θ_ｓｔｅｐ）−Ｒ（ζ^Ａ _i｜θ_ｓｔｅｐ） (1)
なお、報酬値Ｒ（ζ^* _i｜θ_ｓｔｅｐ）およびＲ（ζ^Ａ _i｜θ_ｓｔｅｐ）は、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数を用いて得られる値である。報酬値Ｒ（ζ^* _i｜θ_ｓｔｅｐ）およびＲ（ζ^Ａ _i｜θ_ｓｔｅｐ）の具体例は以下の通りである。

ここで、ｓ^* _ｉ，ｔおよびｓ^Ａ _ｉ，ｔは、それぞれ、訓練行動列ζ^* _iおよび最適行動列ζ^Ａ _iが表すｔ番目の状態（時点ｔでの状態）を表す。ｆ（ｓ^* _ｉ，ｔ）およびｆ（ｓ^Ａ _ｉ，ｔ）は、それぞれ、状態ｓ^* _ｉ，ｔおよびｓ^Ａ _ｉ，ｔに対応する素性を表すＭ次元ベクトルである。ｆ（ｓ^* _ｉ，ｔ）およびｆ（ｓ^Ａ _ｉ，ｔ）の各要素は正の実数である。β^Ｔはβの転置を表す。θ_ｓｔｅｐ ^Ｔ・ｆ（ｓ^* _ｉ，ｔ）およびθ_ｓｔｅｐ ^Ｔ・ｆ（ｓ^Ａ _ｉ，ｔ）はＭ次元ベクトルの内積を表し、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数に相当する。なお、これらの報酬値やシステム評価値は一例にすぎず、要旨を逸脱しない範囲での変更は可能である。 System evaluation value e | an example of (ζ ^* _i θ _step), the compensation value of a series of states represented by the training action column zeta ^* _i R | a (ζ ^* _i θ _step), corresponding to the training action column zeta ^* _i This is the difference from the reward value R (ζ ^A _i | θ _step ) in a series of states represented by the optimal action sequence ζ ^A _{i to} be performed.
e (ζ ^* _i | θ _step ) = R (ζ ^* _i | θ _step ) −R (ζ ^A _i | θ _step ) (1)
The reward values R (ζ ^* _i | θ _step ) and R (ζ ^A _i | θ _step ) are values obtained using a reward function corresponding to the reward function parameter θ _step . Specific examples of the reward value R (ζ ^* _i | θ _step ) and R (ζ ^A _i | θ _step ) are as follows.

Here, s ^* _{i, t} and ^s _{A i, t,} respectively, represent the t-th state representing training action column zeta ^* _i and optimal action column zeta ^A _i (the state at time t). f (s ^* _{i, t)} and ^{f _(s} _{A i, t)} are respectively the M-dimensional vector representing a feature corresponding to the state s ^* _{i, t} and ^s _{A i, t.} Each element of f (s ^* _{i, t} ) and f (s ^A _{i, t} ) is a positive real number. β ^T represents transposition of β. θ _step ^T · f (s ^* _{i, t} ) and θ _step ^T · f (s ^A _{i, t} ) represent inner products of M-dimensional vectors and correspond to reward functions corresponding to the reward function parameter θ _step . Note that these reward values and system evaluation values are merely examples, and can be changed without departing from the scope of the invention.

また最適行動列ζ^Ａ _iは、訓練行動列ζ^* _iが表す一連の状態の初期状態s^* _ｉ，０と同一の状態を初期状態s_ｉ，０とし、その初期状態から各行動によって遷移する一連の状態と各状態での行動とを表す行動列ζ_iのうち、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数によって得られる当該一連の状態の報酬値の合計が最大となる行動列を意味する。例えば、訓練行動列ζ^* _iが表す初期状態s^* _ｉ，０を初期状態s_ｉ，０とする行動列ζ_iのうち、以下の報酬値Ｒ（ζ_i｜θ_ｓｔｅｐ）を最大にする行動列ζ_iが最適行動列ζ^Ａ _iである。

ここで、ｓ_ｉ，ｔは、行動列ζ_iが表すｔ番目の状態を表す。ｆ（ｓ_ｉ，ｔ）は、状態ｓ_ｉ，ｔに対応する素性を表すＭ次元ベクトルである。 The optimal action sequence zeta ^A _i is the same state as the initial state s ^* _{i, 0} of a sequence of states representing training action column zeta ^* _i as the initial state s _{i, 0,} a transition by the action from the initial state Among action sequences ζ _i representing a series of states and actions in each state, it means an action sequence in which the total of reward values of the series of states obtained by the reward function corresponding to the reward function parameter θ _step is maximized. . For example, among the action sequences ζ _i in which the initial state s ^* _{i, 0} represented by the training action sequence ζ ^* _i is the initial state s _{i, 0} , the action that maximizes the following reward value R (ζ _i | θ _step ). The column ζ _i is the optimum action sequence ζ ^A _i .

Here, s _{i, t} represents the t-th state represented by the action sequence ζ _i . f (s _{i, t} ) is an M-dimensional vector representing the feature corresponding to the state s _{i, t} .

≪最適行動列ζ^Ａ _iの探索≫
上記の方法でシステム評価値を計算するためには、訓練行動列記憶部１１に格納されたすべての訓練行動列ζ^* _iに対し、それぞれ最適行動列ζ^Ａ _iを決定する必要がある。ここでは最適行動列ζ^Ａ _iを決定するため、訓練行動列ζ^* _iごとに多数の行動列ζ_iを生成し、報酬値Ｒ（ζ_i｜θ_ｓｔｅｐ）が最大となる行動列ζ_iを探索する、探索的アプローチを用いる。 «Search for the optimum behavior column ζ ^A _i»
In order to calculate the system evaluation value by the above method, it is necessary to determine the optimum action sequence ζ ^A _i for each of the training action sequences ζ ^* _i stored in the training action sequence storage unit 11. Here To determine the optimal action sequence zeta ^A _i generates a large number of actions column zeta _i for each training action column zeta ^* _i, reward R | a (ζ _i θ _step) is maximum action column zeta _i Use an exploratory approach to explore.

各行動列ζ_iは、訓練行動列ζ^* _iが表す一連の状態の初期状態ｓ^* _ｉ，０を初期状態s_ｉ，０とし、各状態ｓ_ｉ，ｔでの行動ａ_ｉ，ｔを決定し、その行動ａ_ｉ，ｔに基づいて次の状態ｓ_{ｉ，ｔ+１}を決定するというプロセスを繰り返して生成される。問題の空間が全探索可能なほど小さい場合には、訓練行動列ζ^* _iごとに、初期状態ｓ_ｉ，０から遷移し得るすべての行動列ζ_iを生成し、それらの中で報酬値Ｒ（ζ_i｜θ_ｓｔｅｐ）が最大となる行動列ζ_iを最適行動列ζ^Ａ _iとする（全探索法）。一方、空間が大きく全探索が困難な場合には行動列ζ_iを確率的に生成し、それらの中で報酬値Ｒ（ζ_i｜θ_ｓｔｅｐ）が最大となる行動列ζ_iを最適行動列ζ^Ａ _iとする（部分探索法）。 Each action column ζ _i is the initial state s ^* _{i, 0} of a series of state represented by the training action column ζ ^* _i as the initial state s _{i, 0,} determine the action a _{i, t} in each state s _{i, t} Then, the process of determining the next state s _{i, t + 1} based on the behavior a _{i, t} is generated repeatedly. If the problem space is small enough to be searched _, all action sequences ζ _i that can transition from the initial state s _{i, 0} are generated for each training action sequence ζ ^* _i , and the reward value R among them is generated. The action sequence ζ _i having the maximum (ζ _i | θ _step ) is set as the optimum action sequence ζ ^A _i (full search method). On the other hand, if the space is large and the full search is difficult, the action sequence ζ _i is generated probabilistically, and the action sequence ζ _i having the maximum reward value R (ζ _i | θ _step ) is determined as the optimum action sequence. Let it be ζ ^A _i (partial search method).

［最適行動列ζ^Ａ _iの生成方法の例示］
最適行動列ζ^Ａ _iの生成方法を例示する。
１−１−１：訓練行動列記憶部１１からある訓練行動列ζ^* _iが読み出され、状態遷移計算部１２に入力される。 [Example of generation method of optimal action sequence ζ ^A _i ]
^A method for generating the optimal action sequence ζ ^A _i will be exemplified.
1-1-1: A training action sequence ζ ^* _i is read from the training action sequence storage unit 11 and input to the state transition calculation unit 12.

１−１−２：状態遷移計算部１２で、訓練行動列ζ^* _iが表す一連の状態の初期状態ｓ^* _ｉ，０を読み出し、初期状態ｓ_ｉ，０＝ｓ^* _ｉ，０を行動評価部１３へ入力する。 1-1-2: The state transition calculation unit 12 reads the initial state s ^* _{i, 0} of a series of states represented by the training action sequence ζ ^* _i and evaluates the initial state s _{i, 0} = s ^* _{i, 0} Input to the unit 13.

１−１−３：行動評価部１３で、現在の報酬関数パラメータθ_ｓｔｅｐに基づき、状態ｓ_ｉ，ｔで取りうるすべての行動ａ_ｉ，ｔについて、以下の期待報酬値ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ）を計算する。最初のループではｔ＝０とされる。

ただし、Ｐ_Ｔ（ｓ_{ｉ，ｔ＋１}｜ｓ_ｉ，ｔ，ａ_ｉ，ｔ）は、状態ｓ_ｉ，ｔで行動ａ_ｉ，ｔを行った場合に状態ｓ_{ｉ，ｔ+１}に遷移する条件付き確率を表す。なお、条件付き確率Ｐ_Ｔ（ｓ_{ｉ，ｔ＋１}｜ｓ_ｉ，ｔ，ａ_ｉ，ｔ）は予め定められているものとする。 1-1-3: Based on the current reward function parameter θ _step , the action evaluation unit 13 sets the following expected reward values r _{i, t} (s) for all actions a _{i, t} that can be taken in the states s _{i, t.} _{i, t} , a _{i, t} ). In the first loop, t = 0.

_{_{_{However, P T (s i, t}}} + 1 | s i, t, a i, t) , the state _{s i,} act in a _t _{a i,} state _{s i} in the case of performing the _{_t,} with conditions for transition to _{t + 1} Represents a probability. It is assumed that the conditional probability P _T (s _{i, t + 1} | s _{i, t} , a _{i, t} ) is predetermined.

１−１−４：行動評価部１３で得られた状態ｓ_ｉ，ｔおよび行動ａ_ｉ，ｔごとの期待報酬値ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ）は、対応する状態ｓ_ｉ，ｔおよび行動ａ_ｉ，ｔを表す情報とともに、行動決定部１４および訓練行動列評価部１５に入力される。 1-1-4: Expected reward values r _{i, t} (s _{i, t} , a _{i, t} ) for the states s _{i, t} and actions a _{i, t} obtained by the action evaluation unit 13 correspond to the corresponding states Along with information representing s _{i, t} and behavior a _{i, t} , the information is input to the behavior determination unit 14 and the training behavior sequence evaluation unit 15.

１−１−５：行動決定部１４で、期待報酬値ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ）に基づいて状態ｓ_ｉ，ｔでの行動ａ_ｉ，ｔを決定し、決定した行動ａ_ｉ，ｔおよび状態ｓ_ｉ，ｔを表す情報を状態遷移計算部１２へ入力する。全探索法の場合、状態ｓ_ｉ，ｔで取り得る行動が順番にａ_ｉ，ｔとして選択される。部分探索法の場合、確率的に行動ａ_ｉ，ｔが決定される。行動ａ_ｉ，ｔの確率的な決定方法は後述する。 1-1-5: In the action determining section 14, to determine the state _{s i,} action _{a i, t} at _t on the basis of the expected reward value _{_{r i, t (s i,}} t, a i, t), determined Information indicating the action a _{i, t} and the state s _{i, t} is input to the state transition calculation unit 12. In the case of the full search method, actions that can be taken in the states s _{i, t} are sequentially selected as a _{i, t} . In the case of the partial search method, the actions a _{i, t} are determined probabilistically. A probabilistic determination method of the actions a _{i, t} will be described later.

１−１−６：状態遷移計算部１２で、行動決定部１４で得られた行動ａ_ｉ，ｔに基づいて、次の状態ｓ_{ｉ，ｔ＋１}を決定し、決定した状態ｓ_{ｉ，ｔ＋１}を表す情報を行動評価部１３へ入力する。 1-1-6: The state transition calculation unit 12 determines the next state s _{i, t + 1} based on the behavior a _{i, t} obtained by the behavior determination unit 14, and represents the determined state s _{i, t + 1} Information is input to the behavior evaluation unit 13.

１−１−７：所定の終了条件を満たさない場合にはｔ＋１を新たなｔとして１−１−３へ戻る。所定の終了条件を満たす場合にはループを停止し、行動評価部１３はそれまでに得られた一連の状態ｓ_ｉ，ｔおよび行動ａ_ｉ，ｔからなる行動列ζ_iを訓練行動列評価部１５へ入力する。この終了条件の例は、１−１−３〜１−１−７のループを既定回数繰り返した、既定の状態に至った等である。 1-1-7: When the predetermined end condition is not satisfied, t + 1 is set as a new t, and the process returns to 1-1-3. When the predetermined end condition is satisfied, the loop is stopped, and the behavior evaluating unit 13 determines the behavior sequence ζ _i composed of the series of states s _{i, t} and behaviors a _{i, t} obtained so far as the training behavior sequence evaluating unit. Input to 15. Examples of the termination condition include repeating a loop of 1-1-3 to 1-1-7 a predetermined number of times, reaching a predetermined state, and the like.

訓練行動列評価部１５は、以上の１−１−１〜１−１−７のループを複数回繰り返し、訓練行動列ζ^* _iごとに複数個の行動列ζ_iを得る。訓練行動列評価部１５は、得られた行動列ζ_iのそれぞれについて報酬値Ｒ（ζ_i｜θ_ｓｔｅｐ）を計算し、訓練行動列ζ^* _iごとに報酬値Ｒ（ζ_i｜θ_ｓｔｅｐ）を最大にする行動列ζ_iを得、それらを各訓練行動列ζ^* _iに対応する最適行動列ζ^Ａ _iとする。すなわち、訓練行動列評価部１５は、訓練行動列ζ^* _iが表す一連の状態の初期状態ｓ^* _ｉ，０を初期状態ｓ_ｉ，０とする最適行動列ζ^Ａ _iをｉごとに生成する。 The training action sequence evaluation unit 15 repeats the above loop 1-1-1 to 1-1-7 a plurality of times, and obtains a plurality of action sequences ζ _i for each training action sequence ζ ^* _i . Training action column evaluation unit 15, reward value for each of the resulting action column ζ _i R | to calculate the (ζ _i θ _step), reward value for each training action column ^{_{_{ζ * i R (ζ i |}}} θ step) the resulting action column ζ _i to maximize them to the optimal action column ζ ^a _i corresponding to each training action column ζ ^* _i. That is, the training action sequence evaluation unit 15 generates, for each _i , the optimum action sequence ζ ^A _i that uses the initial state s ^* _{i, 0} of the series of states represented by the training action sequence ζ ^* _i as the initial state s _{i, 0.} .

［１−１−５での行動の確率的な決定方法の例］
空間が広く全探索困難な場合、行動ａ_ｉ，ｔを確率的に決定する必要がある。その方法として、２種類の方法を例示する。 [Example of probabilistic determination method of action in 1-1-5]
When the space is wide and it is difficult to search all, it is necessary to determine the actions a _{i, t} probabilistically. Two types of methods are illustrated as the method.

例１：期待報酬値ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ）に基づいて確率的に選択
例１の行動決定部１４は、例えば確率ｐ（ａ_ｉ，ｔ）∝ｅｘｐ（ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ））またはｐ（ａ_ｉ，ｔ）∝ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ）に従って行動ａ_ｉ，ｔをサンプリングし、状態ｓ_ｉ，ｔでの行動ａ_ｉ，ｔを決定する。ただし、β１∝β２はβ１がβ２に比例することを表し、ｅｘｐは指数関数を表す。 Example 1: Probabilistic selection based on expected reward value r _{i, t} (s _{i, t} , a _{i, t} ) The behavior determination unit 14 of Example 1 has, for example, a probability p (a _{i, t} ) ∝exp (r _{i, t} (s _{i, t} , a _{i, t} )) or p (a _{i, t} ) ∝ri _{, t} (s _{i, t} , a _{i, t} ) and sample the actions a _{i, t} s _i, action _{a i} in _{_t,} to determine the _t. Here, β1∝β2 represents that β1 is proportional to β2, and exp represents an exponential function.

例２：ランダムに選択
例２の行動決定部１４は、期待報酬値ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ）の値にかかわらず、一様分布もしくは非特許文献６のＵＣＢアルゴリズムなどの探索アルゴリズムに従って行動ａ_ｉ，ｔをサンプリングし、状態ｓ_ｉ，ｔでの行動ａ_ｉ，ｔを決定する。この場合には、期待報酬値ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ）が算出されなくてもよい。
［非特許文献６］P. Auer, N. Cesa-Bianchi, P. Fischer. Finite-time analysis of the multiarmed bandit problem, Machine learning, pp. 235-256, 2002. Example 2: Random selection The behavior determination unit 14 of Example 2 is configured to use a uniform distribution or the UCB algorithm of Non-Patent Document 6 regardless of the value of the expected reward value r _{i, t} (s _{i, t} , a _{i, t} ). action _{a i} according to the search algorithm, such _as, sampling the _t, to determine the state _{s i,} action at _{_t a i,} _t. In this case, the expected reward value r _{i, t} (s _{i, t} , a _{i, t} ) may not be calculated.
[Non-Patent Document 6] P. Auer, N. Cesa-Bianchi, P. Fischer. Finite-time analysis of the multiarmed bandit problem, Machine learning, pp. 235-256, 2002.

≪システム評価値の計算≫
訓練行動列評価部１５は、上述の各最適行動列ζ^Ａ _i、および訓練行動列記憶部１１から読み出した各訓練行動列ζ^* _iを用い、各訓練行動列ζ^* _iのシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）を得る。訓練行動列評価部１５は、例えば前述の式（１）に従ってシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）を得る。

この例のシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）は、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数を用いて得られる、訓練行動列ζ^* _iが表す一連の状態の報酬値の和と、当該訓練行動列ζ^* _iに対応する最適行動列ζ^Ａ _iが表す一連の状態の報酬値の和と、の差に相当する。 ≪Calculation of system evaluation value≫
The training behavior sequence evaluation unit 15 uses the above-described optimum behavior sequences ζ ^A _i and the training behavior sequences ζ ^* _i read from the training behavior sequence storage unit 11, and uses the system evaluation values e of the training behavior sequences ζ ^* _i. (Ζ ^* _i | θ _step ) is obtained. The training action sequence evaluation unit 15 obtains a system evaluation value e (ζ ^* _i | θ _step ) according to the above-described equation (1), for example.

The system evaluation value e (ζ ^* _i | θ _step ) in this example is obtained by using the reward function corresponding to the reward function parameter θ _step and the sum of the reward values of a series of states represented by the training action sequence ζ ^* _i. This corresponds to the difference between the reward values of a series of states represented by the optimum action sequence ζ ^A _i corresponding to the training action sequence ζ ^* _i .

あるいは、式（１）に従ってシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）を得るのではなく、訓練行動列評価部１５が、最適行動列ζ^Ａ _iおよび訓練行動列ζ^* _iに対応する期待報酬値ｒ_ｉ，ｔ（ｓ_ｉ，ｔ，ａ_ｉ，ｔ）（式（２）参照）を用い、以下のようにシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）を得てもよい。

この例のシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ）は、報酬関数パラメータθ_ｓｔｅｐに対応する報酬関数を用いて得られる、訓練行動列ζ^* _iが表す一連の状態の報酬値の期待値の総和と、当該訓練行動列ζ^* _iに対応する最適行動列ζ^Ａ _iが表す一連の状態の報酬値の期待値の総和との差に相当する。 Alternatively, instead of obtaining the system evaluation value e (ζ ^* _i | θ _step ) according to the equation (1), the training behavior sequence evaluation unit 15 expects corresponding to the optimal behavior sequence ζ ^A _i and the training behavior sequence ζ ^* _i. Using the reward value r _{i, t} (s _{i, t} , a _{i, t} ) (see equation (2)), the system evaluation value e (ζ ^* _i | θ _step ) may be obtained as follows.

The system evaluation value e (ζ ^* _i | θ _step ) in this example is an expected value of a reward value of a series of states represented by the training action sequence ζ ^* _i obtained using a reward function corresponding to the reward function parameter θ _step. And the sum of expected values of reward values in a series of states represented by the optimum action sequence ζ ^A _i corresponding to the training action sequence ζ ^* _i .

≪順序評価値の計算≫
訓練行動列評価部１５は、各システム評価値のペアζ^* _i，ζ^* _ｊに対応するシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ），ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）を用い、システム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ），ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）の大小関係を表す各順序評価値ｏ^ｓｔｅｐ _i，ｊを得て出力する。例えば、訓練行動列評価部１５は、ペアζ^* _i，ζ^* _ｊごとにシステム評価値ｅ（ζ^* _i｜θ_ｓｔｅｐ），ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）を比較し、以下のように各順序評価値ｏ^ｓｔｅｐ _i，ｊを得て出力する。

≪Calculation of order evaluation value≫
The training action sequence evaluation unit 15 uses the system evaluation values e (ζ ^* _i | θ _step ) and e (ζ ^* _j | θ _step ) corresponding to each system evaluation value pair ζ ^* _i , ζ ^* _j, and uses the system evaluation values e (ζ ^* _i | θ _step ). Each order evaluation value o ^step _{i, j} representing the magnitude relationship between the evaluation values e (ζ ^* _i | θ _step ) and e (ζ ^* _j | θ _step ) is obtained and output. For example, the training action sequence evaluation unit 15 compares the system evaluation values e (ζ ^* _i | θ _step ) and e (ζ ^* _j | θ _step ) for each pair ζ ^* _i , ζ ^* _j , as follows: Each order evaluation value o ^step _{i, j} is obtained and output.

＜ステップＳ１３の詳細＞
ステップＳ１３の詳細を例示する。
報酬関数調整部１６は、各ペアζ^* _i，ζ^* _ｊに対応する訓練順序評価値ｏ^* _i，ｊと順序評価値ｏ^ｓｔｅｐ _i，ｊとを入力とし、ｏ^* _i，ｊとｏ^ｓｔｅｐ _i，ｊとをペアζ^* _i，ζ^* _ｊごとに比較し、ｏ^* _i，ｊとｏ^ｓｔｅｐ _i，ｊとの正負が互いに異なるペアζ^* _i，ζ^* _ｊを選択する。 <Details of Step S13>
Details of step S13 are illustrated.
The reward function adjustment unit 16 receives the training order evaluation value o ^* _{i, j} and the order evaluation value o ^step _{i, j} corresponding to each pair ζ ^* _i , ζ ^* _j , and inputs o ^* _{i, j} and o ^step. _i, and _j pair zeta ^* _i, compared to each zeta ^* _j, o ^* _{i, j} and ^{o step} _i, positive and negative different pairs of _j zeta ^* _i, selects a zeta ^* _j.

報酬関数調整部１６は、訓練行動列記憶部１１から訓練順序評価値ｏ^* _i，ｊと順序評価値ｏ^ｓｔｅｐ _i，ｊとの正負が互いに異なるペアζ^* _i，ζ^* _ｊを抽出し、以下に従って報酬関数パラメータθ_ｓｔｅｐを報酬関数パラメータθ_{ｓｔｅｐ＋１}に更新する。

ただし、式（３）のγは所定の重み係数であり、例えば、０＜γ＜１を満たす実数である。またγは学習率であり、０.０１＜γ＜０．１程度の値とすることが望ましい。報酬関数パラメータの更新後の挙動によってγが修正されてもよい。 The reward function adjusting unit 16 extracts pairs ζ ^* _i and ζ ^* _j having different positive and negative signs of the training order evaluation value o ^* _{i, j} and the order evaluation value o ^step _{i, j} from the training behavior sequence storage unit 11, The reward function parameter θ _step is updated to the reward function parameter θ _{step + 1 according} to the following.

However, γ in Equation (3) is a predetermined weighting coefficient, for example, a real number satisfying 0 <γ <1. Further, γ is a learning rate, and it is desirable to set a value of about 0.01 <γ <0.1. Γ may be modified according to the behavior after the reward function parameter is updated.

例えば、ｏ^* _i，ｊ＝１かつｅ（ζ^* _i｜θ_ｓｔｅｐ）＜ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）（すなわちｏ^ｓｔｅｐ _i，ｊ≠１）のとき、ｏ^{ｓｔｅｐ＋１} _i，ｊ＝１とするためにはｅ（ζ^* _i｜θ_ｓｔｅｐ）−ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）が大きくなる方向にθ_ｓｔｅｐを更新してθ_{ｓｔｅｐ＋１}とすればよいため、

とすればよい。一方、ｏ^* _i，ｊ＝−１かつｅ（ζ^* _i｜θ_ｓｔｅｐ）＞ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）（すなわちｏ^ｓｔｅｐ _i，ｊ＝１）のとき、ｏ^{ｓｔｅｐ＋１} _i，ｊ＝−１とするためにはｅ（ζ^* _i｜θ_ｓｔｅｐ）−ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）が小さくなる方向にθ_ｓｔｅｐを更新してθ_{ｓｔｅｐ＋１}とすればよいため、

とすればよい。これらをまとめると式（３）となる。 For example, when o ^* _{i, j} = 1 and e (ζ ^* _i | θ _step ) <e (ζ ^* _j | θ _step ) (that is, o ^step _{i, j} ≠ 1), o ^{step + 1} _{i, j} = 1 because it if theta _{step + 1} and update the theta _step in the direction | (θ _step ζ ^* _j) is ^{_{_{large, | (θ step ζ * i}}} ) -e e in order to

And it is sufficient. On the other hand, when o ^* _{i, j} = −1 and e (ζ ^* _i | θ _step )> e (ζ ^* _j | θ _step ) (that is, o ^step _{i, j} = 1), o ^{step + 1} _{i, j} = − since it is sufficient | (θ _step ζ ^* _j) to update the theta _step in the direction of smaller theta _{step + 1} and, | 1 and to the ^{_{_{e (ζ * i θ step)}}} -e

And it is sufficient. These are summarized as Equation (3).

以上のように、この例の報酬関数調整部１６は、訓練順序評価値ｏ^* _i，ｊと順序評価値ｏ^ｓｔｅｐ _i，ｊとの正負が異なる組をなす２個の訓練行動列ζ^* _i，ζ^* _ｊの一方に対応するシステム評価値から他方に対応するシステム評価値を減じた値ｅ（ζ^* _i｜θ_ｓｔｅｐ）−ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）の報酬関数パラメータθ_ｓｔｅｐでの偏微分値と、当該正負が異なる組をなす２個の訓練行動列ζ^* _i，ζ^* _ｊに対応する訓練順序評価値ｏ^* _i，ｊと、の乗算値の総和に対応する更新ベクトルγ▽θ_ｓｔｅｐ ^ｐｒｅｆを得、当該更新ベクトルγ▽θ_ｓｔｅｐ ^ｐｒｅｆによって報酬関数パラメータθ_ｓｔｅｐを更新することで報酬関数を更新する。 As described above, the reward function adjusting unit 16 in this example has two training action sequences ζ ^* _i that form pairs in which the training order evaluation value o ^* _{i, j} and the order evaluation value ^ostep _{i, j} are different in sign. , Ζ ^* _j is a reward function parameter θ _step of a value e (ζ ^* _i | θ _step ) −e (ζ ^* _j | θ _step ) obtained by subtracting a system evaluation value corresponding to the other from the system evaluation value corresponding to one of And an update vector corresponding to the sum of multiplication values of the partial differential value of and the training sequence evaluation values o ^* _{i, j} corresponding to the two training action sequences ζ ^* _i , ζ ^* _j that form different sets of positive and negative The reward function is updated by obtaining γ ▽ θ _step ^pref and updating the reward function parameter θ _step by the update vector γ ▽ θ _step ^pref .

以上の方法に加え、高い評価の行動列を利用した逆強化学習を行って報酬関数パラメータθ_ｓｔｅｐを更新してもよい。その場合、報酬関数調整部１６は、高い評価の行動列を再現するよう報酬関数パラメータθ_ｓｔｅｐを更新するために、以下のように報酬関数パラメータθ_ｓｔｅｐを更新する。

In addition to the above method, the reward function parameter θ _step may be updated by performing reverse reinforcement learning using a highly evaluated action sequence. In that case, the reward function adjusting portion 16, in order to update the reward function parameters theta _step to reproduce the behavior sequence of highly rated, updates the reward function parameter theta _step as follows.

なお、式（４）のγおよびαはいずれも所定の重み係数であり、例えば、０＜γ＜１，０＜α＜１を満たす実数である。γは学習率であり、０.０１＜γ＜０．１程度の値とすることが望ましい。報酬関数パラメータの更新後の挙動によってγが修正されてもよい。また、αを大きくすると高い評価の行動列が重視されて逆強化学習（ＩＲＬ：inverse reinforcement learning）の挙動に近づく。一方、αを小さくすると本発明の方式（Preference-based ＩＲＬ）の挙動が強くなる。 Note that γ and α in Equation (4) are both predetermined weighting factors, for example, real numbers that satisfy 0 <γ <1, 0 <α <1. γ is a learning rate, and it is desirable to set a value of about 0.01 <γ <0.1. Γ may be modified according to the behavior after the reward function parameter is updated. Further, when α is increased, a highly evaluated action sequence is emphasized and approaches the behavior of inverse reinforcement learning (IRL). On the other hand, when α is reduced, the behavior of the method of the present invention (Preference-based IRL) becomes stronger.

以上のように、この例の報酬関数調整部１６は、訓練順序評価値ｏ^* _i，ｊと順序評価値ｏ^ｓｔｅｐ _i，ｊとの正負が異なる組をなす２個の訓練行動列ζ^* _i，ζ^* _ｊの一方に対応するシステム評価値から他方に対応するシステム評価値を減じた値ｅ（ζ^* _i｜θ_ｓｔｅｐ）−ｅ（ζ^* _ｊ｜θ_ｓｔｅｐ）の報酬関数パラメータθ_ｓｔｅｐでの偏微分値と、当該正負が異なる組をなす２個の訓練行動列ζ^* _i，ζ^* _ｊに対応する訓練順序評価値ｏ^* _i，ｊと、の乗算値の総和に対応する更新ベクトルγ▽θ_ｓｔｅｐ ^ｐｒｅｆを得る。さらに、この例の報酬関数調整部１６は、訓練行動列ζ^* _iが表す一連の状態のそれぞれに対応するベクトルの総和から、当該訓練行動列ζ^* _iに対応する最適行動列ζ^Ａ _iが表す一連の状態のそれぞれに対応するベクトルの総和を減じて得られるベクトルの総和に対応する逆強化学習ベクトルγα▽θ_ｓｔｅｐ ^ｃｌｉｐを得る。この例の報酬関数調整部１６は、更新ベクトルγ▽θ_ｓｔｅｐ ^ｐｒｅｆおよび逆強化学習ベクトルγα▽θ_ｓｔｅｐ ^ｃｌｉｐによって報酬関数パラメータθ_ｓｔｅｐを更新することで報酬関数を更新する。 As described above, the reward function adjusting unit 16 in this example has two training action sequences ζ ^* _i that form pairs in which the training order evaluation value o ^* _{i, j} and the order evaluation value ^ostep _{i, j} are different in sign. , Ζ ^* _j is a reward function parameter θ _step of a value e (ζ ^* _i | θ _step ) −e (ζ ^* _j | θ _step ) obtained by subtracting a system evaluation value corresponding to the other from the system evaluation value corresponding to one of And an update vector corresponding to the sum of multiplication values of the partial differential value of and the training sequence evaluation values o ^* _{i, j} corresponding to the two training action sequences ζ ^* _i , ζ ^* _j that form different sets of positive and negative Obtain γ ▽ θ _step ^pref . Further, reward function adjusting portion 16 in this example, the sum of vectors corresponding to each of a series of states represented by the training action column zeta ^* _i, optimal action column zeta ^A _i that corresponds to the training action column zeta ^* _i is An inverse reinforcement learning vector γα ▽ θ _step ^clip corresponding to the sum of vectors obtained by subtracting the sum of vectors corresponding to each of a series of states to be expressed is obtained. The reward function adjustment unit 16 in this example updates the reward function by updating the reward function parameter θ _step by the update vector γ ▽ θ _step ^pref and the inverse reinforcement learning vector γα ▽ θ _step ^clip .

＜シミュレーション＞
［ｃｏｌｏｒ−ｇｒｉｄｗｏｒｌｄ］
上述の方式と従来方式とを比較するため、ｃｏｌｏｒ−ｇｒｉｄｗｏｒｌｄを導入する。これは従来のＲＬ／ＩＲＬで頻繁に用いられるｇｒｉｄｗｏｒｌｄタスクを、多様な報酬関数を許容するように拡張したものである。図３を用いてｃｏｌｏｒ−ｇｒｉｄｗｏｒｌｄの概要を説明する。ｃｏｌｏｒ−ｇｒｉｄｗｏｒｌｄでは、エージェントは各時点ｔにおいて四方の任意のマス（状態ｓ_ｔ）へ移動できる。エージェントがある状態ｓ_ｔへ遷移するごとに、状態ｓ_ｔに関連付けられた素性に対応する評価値を受け取る。ｃｏｌｏｒ−ｇｒｉｄｗｏｒｌｄの状態ｓ_ｔはそれぞれ特定の色（Ｃｏｌｏｒｆｅａｔｕｒｅ）を持ち、各エージェント（ａｇｅｎｔ１〜３）は図３Ｂのようにそれぞれ異なる報酬関数（色と評価値の対応関係）を持つとする。例えば図３Ｂでは、エージェント１では緑色（横ハッチング）が評価値−３に対応し、青色（斜めハッチング）が評価値＋１に対応する。この例では、一度エージェントが訪れた状態に専用の素性（Ｖｉｓｉｔｅｄｆｅａｔｕｒｅ）を割り当てる（図３Ｂ中、白色（ハッチングなし））。このルールにより、高い評価の状態に留まり続けることを抑制するとともに、各方式がこうした素性に対しどのような報酬を付与するのかを検証する。このルールによって状態列の履歴を保存すると、状態列の総数は各状態が０もしくは１を取る全組み合わせ数となるため、状態数をｎとして状態列の数がＯ（２＾ｎ）となり爆発してしまう。そのため状態列数が膨大でも比較的動作するモンテカルロ法を用いてエージェントの方策を決定する。学習データには、ランダムに報酬関数と初期状態が設定された訓練エージェントによって生成された行動列に対し、評価エージェントが順序評価したものを用いる。また、推定された報酬関数の評価は、報酬関数の学習時とは異なるｃｏｌｏｒ−ｇｒｉｄｗｏｒｌｄ上で行う。 <Simulation>
[Color-gridworld]
In order to compare the above method with the conventional method, a color-gridworld is introduced. This is an extension of the gridworld task frequently used in the conventional RL / IRL to allow various reward functions. The outline of color-gridworld will be described with reference to FIG. In color-gridworld, the agent can move to an arbitrary square (state s _t ) at each time point t. Each time the state transitions to the state s _t with an agent receives an evaluation value corresponding to the feature associated with the state s _t. has a color-gridworld state _{s t} certain respective colors (Color Description feature), each agent (Agent1～3) is to have a different reward function (correspondence between the color evaluation value) as shown in Figure 3B. For example, in FIG. 3B, in agent 1, green (horizontal hatching) corresponds to the evaluation value -3, and blue (diagonal hatching) corresponds to the evaluation value +1. In this example, a dedicated feature (visited feature) is assigned to a state once visited by the agent (in FIG. 3B, white (not hatched)). By this rule, while keeping it from staying in a state of high evaluation, it is verified what kind of reward each method gives to such a feature. If the history of the state column is saved according to this rule, the total number of state columns becomes the total number of combinations in which each state takes 0 or 1, so that the number of states is n and the number of state columns becomes O (2 ^ n). End up. Therefore, the agent policy is determined using the Monte Carlo method which operates relatively even if the number of state sequences is enormous. As learning data, an evaluation agent evaluates the order of action sequences generated by a training agent in which a reward function and an initial state are randomly set. Further, the estimated reward function is evaluated on a different color-gridworld from the learning of the reward function.

[シミュレーション結果]
上述した提案手法を用い、順序評価つきの訓練行動列を入力として、行動列に評価を付与した評価エージェントの報酬関数を推定する。比較対象の従来手法として、非特許文献５の手法（ＭａｘＥｎｔｌＲＬ）を用いる。さらに、この従来手法は順序評価値には対応していないため、評価エージェント自身が生成した行動列を入力として評価エージェントの報酬関数を推定するものとする。このシミュレーションを手法ごとに５回実行し、それらの評価の平均値を用いて比較する。各パラメータは以下のように設定する。各行動列の最大長Ｔは７、エージェントのモンテカルロのサンプル生成数は３０００、評価者のサンプル生成数は５０００、ＩＲＬのイテレーション回数は最大５０にする。報酬関数のパラメータの範囲は−４から＋４までとし、全エージェント間で一度訪れた状態に付与される素性に対応する報酬は−１とする。ただし、学習エージェントはこの設定を知らないものとする。 [simulation result]
Using the proposed method described above, the training function sequence with order evaluation is input, and the reward function of the evaluation agent that gave the evaluation to the behavior sequence is estimated. As a conventional method to be compared, the method (MaxEntlRL) of Non-Patent Document 5 is used. Furthermore, since this conventional method does not correspond to the order evaluation value, it is assumed that the reward function of the evaluation agent is estimated with the action sequence generated by the evaluation agent itself as an input. This simulation is executed five times for each method, and a comparison is made using the average value of the evaluations. Set each parameter as follows. The maximum length T of each action sequence is 7, the number of sampled Monte Carlo samples of the agent is 3000, the number of sampled samples of the evaluator is 5000, and the maximum number of iterations of the IRL is 50. The parameter range of the reward function is -4 to +4, and the reward corresponding to the feature given to the state visited once among all agents is -1. However, the learning agent does not know this setting.

正解パラメータに対する推定された報酬関数の誤差（報酬関数距離）の平均値を表１に示す。色数（色の個数、すなわち素性の個数）５の場合と１５の場合で評価を行った。表１より、評価エージェントのみを用いて推定する従来手法よりも、行動列の適切さが異なるデータから順序評価を用いて推定した提案手法のほうが誤差を小さくできることが分かる。

Table 1 shows the average value of the reward function error (reward function distance) estimated for the correct parameter. The evaluation was performed with the number of colors (number of colors, that is, the number of features) being 5 and 15. From Table 1, it can be seen that the proposed method estimated using order evaluation from data with different appropriateness of action sequence can reduce the error, compared with the conventional method estimated using only the evaluation agent.

＜本形態の特徴＞
本形態の手法を用いることで、従来よりも適切に報酬関数を訓練行動列から推定できる。その結果、例えば人同士の対話からそのやり方を学習することでロボットによる自然な対話を実現することが可能になる。また、システムとやりとりするユーザがどのようなものを求めているか、個人ごとにその評価関数を推定することで、より適切な推薦や情報提示が可能になる。本形態では、異なる適切さを持つ訓練行動列とその訓練順序評価に基づき、その訓練順序評価を付与した評価エージェントの報酬関数を推定する。行動列自体以外に必要になるデータは訓練行動列のペアごとの訓練順序評価であり、これは個々に絶対値で点数をつけていくやりかたや、データ全体の全順序を付与するやり方に比べ、評価付与が容易であるという利点も持つ。 <Features of this embodiment>
By using the method of this embodiment, the reward function can be estimated from the training behavior sequence more appropriately than in the past. As a result, it is possible to realize a natural dialogue by a robot by learning the method from a dialogue between people, for example. In addition, it is possible to make more appropriate recommendations and information presentation by estimating the evaluation function for each individual as to what the user interacting with the system wants. In this embodiment, the reward function of the evaluation agent assigned with the training order evaluation is estimated based on the training behavior sequence having different appropriateness and the training order evaluation. The data required in addition to the action sequence itself is a training order evaluation for each pair of training action sequences, which is compared to the method of assigning points with absolute values individually and the method of assigning the entire order of the entire data, It also has the advantage of easy evaluation.

＜変形例等＞
本発明は上述の実施形態に限定されるものではない。例えば、報酬関数は上述したものに限定されず、状態（例えば状態に対応するベクトル）と報酬関数パラメータとに対応するその他の関数が報酬関数とされてもよい。 <Modifications>
The present invention is not limited to the above-described embodiment. For example, the reward function is not limited to those described above, and other functions corresponding to a state (for example, a vector corresponding to the state) and a reward function parameter may be used as the reward function.

システム評価値は上述したものに限定されず、例えば以下のようなシステム評価値が用いられてもよい。

The system evaluation values are not limited to those described above, and for example, the following system evaluation values may be used.

訓練順序評価値や順序評価値が｛−１，０，１｝の値を取るのではなく、−１の代わりに所定の負値をとり、１の代わりに所定の正値をとってもよい。また、報酬関数パラメータや報酬関数の更新方法も上記のものに限定されない。更新された報酬関数を用いて得られる順序評価値の集合と訓練順序評価値の集合との相違が、更新前の報酬関数を用いて得られる順序評価値の集合と訓練順序評価値の集合との相違よりも小さくなるように、報酬関数パラメータや報酬関数が更新されればよい。 The training order evaluation value or the order evaluation value does not take a value of {−1, 0, 1}, but may take a predetermined negative value instead of −1 and may take a predetermined positive value instead of 1. Also, the method for updating the reward function parameter and the reward function is not limited to the above. The difference between the set of order evaluation values obtained using the updated reward function and the set of training order evaluation values is the difference between the set of order evaluation values obtained using the pre-update reward function and the set of training order evaluation values. The reward function parameter and the reward function may be updated so as to be smaller than the difference.

上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads a program stored in its own recording device and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and each time the program is transferred from the server computer to the computer. The processing according to the received program may be executed sequentially.

上記実施形態では、コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されたが、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 In the above embodiment, the processing functions of the apparatus are realized by executing a predetermined program on a computer. However, at least a part of these processing functions may be realized by hardware.

１報酬関数推定装置
１１訓練行動列記憶部
１５訓練行動列評価部
１６報酬関数調整部 DESCRIPTION OF SYMBOLS 1 Reward function estimation apparatus 11 Training action sequence memory | storage part 15 Training action sequence evaluation part 16 Reward function adjustment part

Claims

A set of training behavior sequences representing a series of states that change according to behavior and behavior in each state, and a training behavior sequence storage unit that stores a set of training order evaluation values representing the appropriateness of each of the training behavior sequences; ,
A training action sequence evaluation unit that obtains an order evaluation value corresponding to a reward value of a series of states represented by the training behavior sequence using a reward function for calculating a reward value for the state;
A reward function adjustment unit that updates the reward function based on a difference between the set of order evaluation values and the set of training order evaluation values;
A reward function estimating device.

The reward function estimation device according to claim 1,
Until the predetermined end condition is satisfied, the process by the training action sequence evaluation unit and the process by the reward function adjustment unit are repeated,
The reward function adjusting unit outputs information representing the reward function when the predetermined termination condition is satisfied,
A reward function estimation device characterized by the above.

The reward function estimation device according to claim 1 or 2,
Each of the training order evaluation values corresponds to a set of two training action sequences belonging to the set of the training action sequences, and which of the two training action sequences forming the set is more appropriate? Represents
Each of the order evaluation values corresponds to a set of two training action sequences belonging to the set of the training action sequences, and the reward value corresponding to any of the two training action sequences forming the set or Indicates whether the expected value of the reward value represents a higher evaluation,
A reward function estimation device characterized by the above.

The reward function estimation device according to claim 3,
The training action sequence evaluation unit
The reward of the series of states obtained by the reward function from the set of action sequences representing the series of states transitioned by action from the same state as the initial state of the series of states represented by the training action series and the actions in each state Select the action sequence that maximizes the sum of the values, and select the action sequence that maximizes the total reward value as the optimal action sequence corresponding to the training action sequence,
A difference between a sum of reward values of a series of states represented by the training action sequence obtained using the reward function and a sum of reward values of a series of states represented by the optimal behavior sequence corresponding to the training action sequence, or A difference between a sum of expected values of a series of reward values represented by the training action sequence and a sum of expected values of a series of reward values represented by the optimal behavior sequence corresponding to the training behavior sequence, Obtained as a system evaluation value of the action sequence,
Obtaining the order evaluation value representing a magnitude relationship between two system evaluation values corresponding to the two training action sequences forming the set;
A reward function estimation device characterized by the above.

The reward function estimation device according to claim 4,
The reward function corresponds to the state and a reward function parameter that is a vector;
Each of the training order evaluation values is a positive value when one of the two training action sequences forming the set is more appropriate than the other, and a negative value when the other is more appropriate than the other And zero when the appropriateness of the one and the other is equal,
Each of the order evaluation values is a positive value when a system evaluation value corresponding to the one of the two training action sequences forming the set is larger than a system evaluation value corresponding to the other, A negative value when the corresponding system evaluation value is smaller than the system evaluation corresponding to the other, and a zero value when the system evaluation value corresponding to the one and the system evaluation corresponding to the other are equal;
The reward function adjustment unit is configured such that the system evaluation corresponding to the other from the system evaluation value corresponding to the one of the two training action sequences forming the set in which the sign of the training order evaluation value and the order evaluation value is different. Corresponds to the sum of the product of the partial differential value in the reward function parameter of the value obtained by subtracting the value and the training order evaluation value corresponding to the two training action sequences that form the pair having different positive and negative values Obtaining an update vector and updating the reward function by updating the reward function parameter with the update vector;
A reward function estimation device characterized by the above.

The reward function estimation device according to claim 4,
The reward function corresponds to a vector corresponding to the state and a reward function parameter that is a vector,
Each of the training order evaluation values is a positive value when one of the two training action sequences forming the set is more appropriate than the other, and a negative value when the other is more appropriate than the other And zero when the appropriateness of the one and the other is equal,
Each of the order evaluation values is a positive value when a system evaluation value corresponding to the one of the two training action sequences forming the set is larger than a system evaluation value corresponding to the other, A negative value when the corresponding system evaluation value is smaller than the system evaluation corresponding to the other, and a zero value when the system evaluation value corresponding to the one and the system evaluation corresponding to the other are equal;
The reward function adjustment unit is configured such that the system evaluation corresponding to the other from the system evaluation value corresponding to the one of the two training action sequences forming the set in which the sign of the training order evaluation value and the order evaluation value is different. Corresponds to the sum of the product of the partial differential value in the reward function parameter of the value obtained by subtracting the value and the training order evaluation value corresponding to the two training action sequences that form the pair having different positive and negative values Get the update vector,
The sum of the vectors obtained by subtracting the sum of the vectors corresponding to each of the series of states represented by the optimum behavior sequence corresponding to the training behavior sequence from the sum of the vectors corresponding to each of the series of states represented by the training behavior sequence. Get the inverse reinforcement learning vector corresponding to
Updating the reward function by updating the reward function parameter with the update vector and the inverse reinforcement learning vector;
A reward function estimation device characterized by the above.

A reward function estimation method executed by a reward function estimation device having a training action string storage unit, a training action string evaluation unit, and a reward function adjustment unit,
A set of training behavior sequences representing a series of states transitioned by behavior and behaviors in each state, and a set of training order evaluation values representing the suitability of each training behavior sequence are stored in the training behavior sequence storage unit Has been
In the training action sequence evaluation unit, using a reward function for obtaining a reward value for the state, obtaining an order evaluation value corresponding to a series of state reward values represented by the training action sequence,
In the reward function adjustment unit, the reward function is updated based on the difference between the set of order evaluation values and the set of training order evaluation values.
A reward function estimation method characterized by the above.

The program for functioning a computer as each part of the reward function estimation apparatus in any one of Claim 1 to 6.