JP2022182593A

JP2022182593A - Reverse reinforcement learning device, method and program

Info

Publication number: JP2022182593A
Application number: JP2021090234A
Authority: JP
Inventors: 研一中里; Kenichi Nakazato
Original assignee: Bosch Corp
Current assignee: Bosch Corp
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2022-12-08

Abstract

To provide a reverse reinforcement learning device, a method and a program for achieving efficiency of learning of a compensation scheme.SOLUTION: A reverse reinforcement learning device (10) comprises: a measures determination part (111) for determining measures (π) of a behavior on the basis of a locus (Sg) of a given behavior; and a compensation determination part (112) for determining a compensation (r) applied from an environment with respect to a behavior of an agent, so as to maximize an expected compensation (J) calculated from the measures (π) and a state worth (V). The compensation determination part (112) mixes the state worth (Vt) with an additional worth (Rt) being calculated using the compensation (r) so as to approximate the state worth (Vt), for calculating the state worth (V), and updates the compensation (r) so as to maximize the expected compensation (J) with respect to the calculated state worth (V).SELECTED DRAWING: Figure 1

Description

本発明は、逆強化学習装置、逆強化学習方法及びプログラムに関する。 The present invention relates to an inverse reinforcement learning device, an inverse reinforcement learning method, and a program.

従来、与えられたタスクを達成するために強化学習が用いられている。強化学習は、タスクが与えられた環境におけるエージェントの行動を、環境から付与される報酬によって評価し、一連の行動の累積報酬が最大化するように方策を学習する方法である。例えば、強化学習は、ゲームやモータの制御、又は車両の自動運転制御等に応用されている（特許文献１及び２参照）。 Conventionally, reinforcement learning is used to accomplish a given task. Reinforcement learning is a method of evaluating an agent's actions in an environment given a task based on the reward given from the environment, and learning a policy so as to maximize the cumulative reward of a series of actions. For example, reinforcement learning is applied to games, motor control, automatic driving control of vehicles, and the like (see Patent Documents 1 and 2).

特開２０１８－６３６０２号公報Japanese Patent Application Laid-Open No. 2018-63602 特開２０２０－１４４４８３号公報JP 2020-144483 A

一方、エキスパートの技術をモデル化するために、逆強化学習が用いられている。逆強化学習では、エキスパートの行動の軌跡から、どのような行動を評価して報酬を付与するのか、その報酬体系を学習する。しかし、報酬体系は複雑であることが一般的である。エキスパートの報酬体系に近似する報酬体系を効率的に学習できる手法が求められていた。 On the other hand, inverse reinforcement learning is used to model the skill of experts. Inverse reinforcement learning learns the reward system from the trajectory of the expert's actions, which action is evaluated and rewarded. However, reward systems are generally complex. There is a demand for a method that can efficiently learn a reward system that approximates the reward system of an expert.

本発明は、報酬体系の学習の効率化を目的とする。 An object of the present invention is to improve the efficiency of learning a reward system.

本発明の一態様は、エージェントの行動に対して環境から付与される報酬（ｒ）を、与えられた行動の軌跡（Ｓｇ）に基づいて決定する逆強化学習装置（１０）である。逆強化学習装置（１０）は、前記軌跡（Ｓｇ）から行動の方策（π）を決定する方策決定部（１１１）と、下記式（１）に示すように、前記方策（π）と前記環境の状態価値（Ｖ）とによって計算される期待報酬（Ｊ）が最大化するように、前記報酬（ｒ）を決定する報酬決定部（１１２）と、を備える。前記報酬決定部（１１２）は、下記式（２）に示すように、時間ｔのときの状態（ｓ）を評価する状態価値（Ｖ_ｔ）に、前記状態価値（Ｖ_ｔ）を近似するように前記報酬（ｒ）を用いて計算される付加価値（Ｒ_ｔ）を混合することにより、前記状態価値（Ｖ）を計算し、前記計算された状態価値（Ｖ）に対して前記期待報酬（Ｊ）が最大化するように、前記報酬（ｒ）を更新する。

〔Ｅは、期待値を出力する関数を表す。π（ｓ｜ａ）は、状態（ｓ）における行動（ａ）を選択する方策（π）を表す。Ｖ（ｓ）は、状態（ｓ）を評価する状態価値（Ｖ）を表す。Ｖ_ｔは時間ｔのときの状態（ｓ）を評価する状態価値（Ｖ）を表す。Ｒ_ｔは、時間ｔのときの付加価値を表す。τはＲ_ｔを混合する割合を表し、０≦τ≦１を満たす。〕 One aspect of the present invention is an inverse reinforcement learning device (10) that determines a reward (r) given from the environment for an agent's action based on a given action trajectory (Sg). The inverse reinforcement learning device (10) includes a policy decision unit (111) that decides a course of action (π) from the trajectory (Sg), and a policy (π) and the environment a remuneration determination unit (112) that determines the remuneration (r) so as to maximize the expected remuneration (J) calculated by the state value (V) of . _The remuneration determination unit (112) approximates the state value (V t ) to the state value (V _t ) for evaluating the state (s) at time t, as shown in the following formula (2): and _the expected reward ( Update the reward (r) such that J) is maximized.

[E represents a function that outputs an expected value. π(s|a) represents a policy (π) that selects action (a) in state (s). V(s) represents a state value (V) that evaluates state (s). _Vt represents the state value (V) that evaluates the state (s) at time t. _Rt represents the added value at time t. τ represents the mixing ratio of _Rt and satisfies 0≦τ≦1. ]

本発明の他の一態様は、エージェントの行動に対して環境から付与される報酬（ｒ）を、与えられた行動の軌跡（Ｓｇ）に基づいて決定する逆強化学習方法である。前記逆強化学習方法は、前記軌跡（Ｓｇ）から行動の方策（π）を決定するステップと、上記式（１）に示すように、前記方策（π）と前記環境の状態価値（Ｖ）とによって計算される期待報酬（Ｊ）が最大化するように、前記報酬（ｒ）を決定するステップと、を含む。前記報酬（ｒ）を決定するステップは、上記式（２）に示すように、時間ｔのときの状態（ｓ）を評価する状態価値（Ｖ_ｔ）に、前記状態価値（Ｖ_ｔ）を近似するように前記報酬（ｒ）を用いて計算される付加価値（Ｒ_ｔ）を混合することにより、前記状態価値（Ｖ）を計算するステップと、前記計算された状態価値（Ｖ）に対して前記期待報酬（Ｊ）が最大化するように、前記報酬（ｒ）を更新するステップと、を含む。 Another aspect of the present invention is an inverse reinforcement learning method that determines a reward (r) given from the environment for an agent's action based on a given action trajectory (Sg). The inverse reinforcement learning method includes a step of determining a behavior policy (π) from the trajectory (Sg), and calculating the policy (π) and the state value (V) of the environment as shown in the above formula (1). determining said reward (r) such that it maximizes the expected reward (J) calculated by . The step of determining the reward (r) includes approximating the state value (V t ₎ to a state value (V _t ) that evaluates the state (s) at time t, as shown in equation (2) above. calculating the state value (V) by mixing the value added (R _t ) calculated with the reward (r) so that for the calculated state value (V) and updating said reward (r) such that said expected reward (J) is maximized.

本発明の他の一態様は、エージェントの行動に対して環境から付与される報酬（ｒ）を、与えられた行動の軌跡（Ｓｇ）に基づいて決定する逆強化学習方法を、コンピュータに実行させるためのプログラムである。前記逆強化学習方法は、前記軌跡（Ｓｇ）から行動の方策（π）を決定するステップと、上記式（１）に示すように、前記方策（π）と前記環境の状態価値（Ｖ）とによって計算される期待報酬（Ｊ）が最大化するように、前記報酬（ｒ）を決定するステップと、を含む。前記報酬（ｒ）を決定するステップは、上記式（２）に示すように、時間ｔのときの状態（ｓ）を評価する状態価値（Ｖ_ｔ）に、前記状態価値（Ｖ_ｔ）を近似するように前記報酬（ｒ）を用いて計算される付加価値（Ｒ_ｔ）を混合することにより、前記状態価値（Ｖ）を計算するステップと、前記計算された状態価値（Ｖ）に対して前記期待報酬（Ｊ）が最大化するように、前記報酬（ｒ）を更新するステップと、を含む。 Another aspect of the present invention causes a computer to execute an inverse reinforcement learning method for determining a reward (r) given from the environment for an agent's action based on a given action trajectory (Sg). It is a program for The inverse reinforcement learning method includes a step of determining a behavior policy (π) from the trajectory (Sg), and calculating the policy (π) and the state value (V) of the environment as shown in the above formula (1). determining said reward (r) such that it maximizes the expected reward (J) calculated by . The step of determining the reward (r) includes approximating the state value (V t ₎ to a state value (V _t ) that evaluates the state (s) at time t, as shown in equation (2) above. calculating the state value (V) by mixing the value added (R _t ) calculated with the reward (r) so that for the calculated state value (V) and updating said reward (r) such that said expected reward (J) is maximized.

本発明によれば、報酬体系の学習を効率化することができる。 ADVANTAGE OF THE INVENTION According to this invention, learning of a reward system can be made efficient.

本実施形態の逆強化学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the inverse reinforcement learning apparatus of this embodiment. 逆強化学習処理を示すフローチャートである。It is a flow chart which shows inverse reinforcement learning processing. エキスパートの行動の軌跡の一例を示す図である。It is a figure which shows an example of the locus|trajectory of an expert's action. 状態遷移モデルを説明する図である。It is a figure explaining a state transition model.

以下、本発明の逆強化学習装置、逆強化学習方法及びプログラムの一実施形態について、図面を参照して説明する。以下の説明は本発明の一例（代表例）であり、本発明はこれに限定されない。 An embodiment of an inverse reinforcement learning device, an inverse reinforcement learning method, and a program according to the present invention will be described below with reference to the drawings. The following description is an example (representative example) of the present invention, and the present invention is not limited thereto.

図１は、本発明の一実施形態の逆強化学習装置１０の構成を示す。
逆強化学習装置１０は、ＣＰＵ（Central Processing Unit）１１及び記憶部１２を備える。逆強化学習装置１０は、操作部１３、表示部１４及び通信部１５をさらに備えてもよい。 FIG. 1 shows the configuration of an inverse reinforcement learning device 10 according to one embodiment of the present invention.
The inverse reinforcement learning device 10 includes a CPU (Central Processing Unit) 11 and a storage unit 12 . The inverse reinforcement learning device 10 may further include an operation unit 13 , a display unit 14 and a communication unit 15 .

ＣＰＵ１１は、記憶部１２からプログラムを読み出して実行することにより、後述する逆強化学習処理を実行する。逆強化学習処理において、ＣＰＵ１１は、方策決定部１１１及び報酬決定部１１２として機能する。 The CPU 11 reads out a program from the storage unit 12 and executes it, thereby executing reverse reinforcement learning processing, which will be described later. In the inverse reinforcement learning process, the CPU 11 functions as a policy determination section 111 and a reward determination section 112 .

方策決定部１１１は、与えられたエキスパートの行動の軌跡から、行動の方策を決定する。報酬決定部１１２は、方策決定部１１１により決定された方策と、環境の状態価値とから、一連の行動による期待報酬が最大化するように、報酬を決定する。 The policy decision unit 111 decides a course of action from the given trajectory of the action of the expert. The remuneration determination unit 112 determines a remuneration based on the policy determined by the policy determination unit 111 and the state value of the environment so as to maximize the expected remuneration for a series of actions.

記憶部１２は、ＣＰＵ１１が読み取り可能なプログラム、及びプログラムの実行に用いられるデータ等を記憶する。記憶部１２としては、例えばハードディスク等の記録媒体を用いることができる。 The storage unit 12 stores programs readable by the CPU 11, data used for executing the programs, and the like. As the storage unit 12, for example, a recording medium such as a hard disk can be used.

操作部１３は、キーボード、又はマウス等である。操作部１３は、ユーザの操作を受け付けて、その操作内容をＣＰＵ１１に出力する。 The operation unit 13 is a keyboard, mouse, or the like. The operation unit 13 receives a user's operation and outputs the content of the operation to the CPU 11 .

表示部１４は、ディスプレイ等である。表示部１４は、ＣＰＵ１１からの表示指示にしたがって、操作画面やＣＰＵ１１の処理結果等を表示する。 The display unit 14 is a display or the like. The display unit 14 displays an operation screen, a processing result of the CPU 11, and the like according to a display instruction from the CPU 11. FIG.

通信部１５は、ネットワークを介して外部のコンピュータと通信するインターフェイスである。 The communication unit 15 is an interface that communicates with an external computer via a network.

逆強化学習装置１０は、模倣すべきエキスパートの行動の軌跡から、環境から付与する報酬（ｒ）を逆強化学習処理により決定することができる。
本実施形態において、報酬（ｒ）は、式（４）に示すように、パラメータ（θ）を有するニューラルネットワークとして定義される。パラメータは、ニューラルネットワークに設定される重み又はバイアス等をいう。
（４）ｒ＝ｒ（θ） The inverse reinforcement learning device 10 can determine the reward (r) given from the environment by inverse reinforcement learning processing from the trajectory of the action of the expert to be imitated.
In this embodiment, reward (r) is defined as a neural network with parameters (θ) as shown in equation (4). Parameters are weights or biases set in the neural network.
(4) r=r(θ)

図２は、逆強化学習処理のフローチャートである。
まず、方策決定部１１１は、環境とともに与えられたエキスパートの行動の軌跡（Ｓｇ）のグループを取得する（ステップＳ１）。方策決定部１１１は、軌跡（Ｓｇ）のグループを、記憶部１２から取得してもよいし、ネットワーク上の外部装置から取得してもよい。軌跡（Ｓｇ）は、式（５）に示すように、一連の行動により遷移した環境の状態（ｓ）の集合として表される。
（５）Ｓｇ＝｛（ｓ_０，ｓ_１，・・・，ｓ_ｎ）｝ FIG. 2 is a flowchart of inverse reinforcement learning processing.
First, the policy determination unit 111 acquires a group of expert action trajectories (Sg) given along with the environment (step S1). The policy determination unit 111 may acquire the group of trajectories (Sg) from the storage unit 12 or from an external device on the network. A trajectory (Sg) is expressed as a set of environmental states (s) transitioned by a series of actions, as shown in Equation (5).
(5) Sg={( _s0 , _s1 ,..., _sn )}

図３は、行動の軌跡（Ｓｇ）の一例を示す。
図３に例示される軌跡Ｌ１は、スタート地点Ｐｓからゴール地点Ｐｇまでエキスパートが迷路内を移動したときの経路である。迷路は、複数ブロックのエリア３０からなり、そこでは１ブロックずつ移動できる。移動は、ブロック間に配置された壁によって阻まれることがある。ここで、エリア３０は与えられた環境であり、各ブロックは環境の状態（ｓ）に相当する。 FIG. 3 shows an example of an action trajectory (Sg).
A trajectory L1 exemplified in FIG. 3 is a route when the expert moves in the maze from the start point Ps to the goal point Pg. The maze consists of a multi-block area 30 in which one can move one block at a time. Movement can be blocked by walls placed between blocks. Here, area 30 is the given environment and each block corresponds to a state (s) of the environment.

方策決定部１１１は、この軌跡（Ｓｇ）から状態遷移モデルを生成する。状態遷移モデルは、環境のある状態（ｓ）から次の状態（ｓ）への遷移確率の分布である。例えば、状態遷移モデルは、遷移確率がテーブル化された状態遷移マトリックスとして生成される。方策決定部１１１は、この状態遷移モデルを基に方策（π）を決定する（ステップＳ２）。方策（π）は、各状態（ｓ）において選択される行動（ａ）の確率分布である。 Policy determination unit 111 generates a state transition model from this trajectory (Sg). A state transition model is the distribution of transition probabilities from one state(s) of the environment to the next state(s). For example, the state transition model is generated as a state transition matrix in which transition probabilities are tabulated. The policy determination unit 111 determines a policy (π) based on this state transition model (step S2). A policy (π) is the probability distribution of actions (a) that are chosen in each state (s).

図４は、状態遷移モデルを説明する図である。
上述した迷路において軌跡Ｌ１上のブロックは、状態（ｓ）の価値が高い。図４において、各ブロックに配置された円は状態（ｓ）の価値を表し、円の濃度が高いほど状態（ｓ）の価値が高いことを意味する。方策決定部１１１は、軌跡Ｌ１上のブロックへ遷移する確率が高くなるように、各ブロック（状態）から次のブロック（状態）への遷移確率を決定できる。 FIG. 4 is a diagram for explaining the state transition model.
Blocks on trajectory L1 in the maze described above are worth state(s). In FIG. 4, a circle arranged in each block represents the value of the state (s), and the higher the density of the circle, the higher the value of the state (s). The policy determination unit 111 can determine the transition probability from each block (state) to the next block (state) so that the probability of transition to the block on the trajectory L1 is high.

次に、報酬決定部１１２が、決定された方策（π）から、期待報酬（Ｊ）が最大化するように報酬（ｒ）を決定する。期待報酬（Ｊ）は、１エピソードにおいて獲得が期待できる累積報酬をいう。エピソードは、環境の初期状態（ｓ_０）から最終状態（ｓ_ｅ）まで遷移する一連の行動をいう。期待報酬（Ｊ）は、式（１）に示すように方策（π）及び状態価値（Ｖ）によって計算される。 Next, the reward determining unit 112 determines the reward (r) from the determined policy (π) so as to maximize the expected reward (J). Expected reward (J) refers to cumulative reward that can be expected to be obtained in one episode. An episode refers to a series of actions that transition from the initial state (s ₀ ) of the environment to the final state (s _e ). The expected reward (J) is calculated by policy (π) and state value (V) as shown in equation (1).

上記Ｅ［］は、[]内の期待値を出力する関数を表す。π（ｓ｜ａ）は、状態（ｓ）における行動（ａ）を選択する方策（π）を表す。Ｖ（ｓ）は、状態（ｓ）を評価する状態価値（Ｖ）を表す。 The above E[] represents a function that outputs the expected value in []. π(s|a) represents a policy (π) that selects action (a) in state (s). V(s) represents a state value (V) that evaluates state (s).

本実施形態において、状態価値（Ｖ）は式（２）に示すように定義される。式（２）に示すように、状態価値（Ｖ）は、時間ｔの状態（ｓ）を評価する状態価値（Ｖ_ｔ）に、付加価値（Ｒ_ｔ）を混合することにより計算される。付加価値（Ｒ_ｔ）は、式（３）に示すように、時間ｔにおける状態価値（Ｖ_ｔ）を近似するように報酬（ｒ）を用いて計算される。 In this embodiment, the state value (V) is defined as shown in Equation (2). As shown in equation (2), the state value (V) is calculated by mixing the state value (V _t ), which evaluates the state (s) at time t, with the added value (R _t ). Value added (R _t ) is computed using reward (r) to approximate state value (V _t ) at time t, as shown in equation (3).

上記Ｖ_ｔは、時間ｔのときの状態（ｓ）を評価する状態価値（Ｖ）を表す。Ｒ_ｔは、時間ｔのときの付加価値を表す。τ、τ_Ｐ、τ_Ｄ及びτ_Ｉは、それぞれ０以上１以下の係数を表す。γ_Ｅは、各行動（ａ）に付与される報酬（ｒ）の割引率を表し、０＜γ_Ｅ≦１を満たす。ｔ_ｅは、最終状態（ｓ_ｅ）における時間を表す。式（３）において、係数τ_Ｐを含む項を比例項、係数τ_Ｄを含む項を微分項、係数τ_Ｉを含む項を積分項という。 The above V _t represents the state value (V) that evaluates the state (s) at time t. _Rt represents the added value at time t. τ, τ _P , τ _D and τ _I each represent a coefficient of 0 or more and 1 or less. γ _E represents the discount rate of the reward (r) given to each action (a) and satisfies 0<γ _E ≦1. t _e represents the time in the final state (s _e ). In equation (3), the term including the coefficient _τP is called the proportional term, the term including the coefficient _τD is called the differential term, and the term including the coefficient _τI is called the integral term.

比例項において係数τ_Ｐが乗算されるｒ^＊は、１エピソード中の時間ｔの状態（ｓ）における行動（ａ）に対して付与される報酬（ｒ）を表す。例えば、ｔ＝５の場合、報酬決定部１１２は、状態（ｓ_５）における行動（ａ）に対して付与される報酬（ｒ）を比例項に用いることができる。 The r ^* multiplied by the factor τ _P in the proportional term represents the reward (r) given for the action (a) in the state (s) at time t during one episode. For example, when t=5, the reward determining unit 112 can use the reward (r) given for the action (a) in the state (s ₅ ) as the proportional term.

微分項において係数τ_Ｄが乗算されるｄｒ／ｄｔは、１エピソードにおいて時間ｔより前の状態から時間ｔの状態（ｓ）までの一定時間内に付与される報酬（ｒ）の微分値を表す。微分項の加算により、報酬（ｒ）の時間変化を考慮して状態価値（Ｖ）を決定することができる。例えば、報酬決定部１１２は、ｔ＝５の場合、３～５ステップの状態（ｓ_３）から状態（ｓ_５）までの間に付与された報酬（ｒ）の微分値を微分項に用いることができる。 dr / dt multiplied by the coefficient τ _D in the differential term represents the differential value of the reward (r) given within a certain time from the state before time t to the state (s) at time t in one episode . By adding the differential term, the state value (V) can be determined considering the time variation of the reward (r). For example, when t=5, the reward determining unit 112 uses the differential value of the reward (r) given from the state (s ₃ ) to the state (s ₅ ) in steps 3 to 5 as the differential term. can be done.

積分項において係数τ_Ｉが乗算されるｒの積分値は、１エピソードの間に付与された報酬（ｒ）の累積値を表す。この累積値において各状態での行動（ａ）に対する報酬（ｒ）は割引率γ_Ｅにより割り引かれる。割引率γ_Ｅは（ｔ_ｅ－ｔ）乗され、最終状態（ｓ_ｅ）に近いほど報酬（ｒ）の割引率が小さくなる。 The integral value of r multiplied by the coefficient _τI in the integral term represents the accumulated reward (r) given during one episode. In this accumulated value, the reward (r) for the action (a) in each state is discounted by the discount rate _γE . The discount rate γ _E is multiplied by (t _e −t), and the closer to the final state (s _e ), the smaller the discount rate of the reward (r).

報酬（ｒ）を決定する際、まず、報酬決定部１１２は、式（２）中の付加価値（Ｒ_ｔ）を最適化する（ステップＳ３）。具体的には、報酬決定部１１２は、期待報酬（Ｊ）が最大化するようにパラメータ（θ）を更新することにより、式（４）に示す報酬（ｒ）を更新する。更新された報酬（ｒ）を用いて付加価値（Ｒ_ｔ）を計算することにより、最適化された付加価値（Ｒ_ｔ）が得られる。 When determining the reward (r), first, the reward determination unit 112 optimizes the added value (R _t ) in Equation (2) (step S3). Specifically, the remuneration determination unit 112 updates the remuneration (r) shown in Equation (4) by updating the parameter (θ) so as to maximize the expected remuneration (J). Optimizing added value (R _t ) is obtained by calculating added value (R _t ) using updated reward (r).

次に、報酬決定部１１２は、最適化された付加価値（Ｒ_ｔ）に対して、式（２）中の状態価値（Ｖ_ｔ）を最適化する（ステップＳ４）。この最適化は、更新された報酬（ｒ）を用いて、式（６）に示すように状態価値（Ｖ_ｔ）を更新することにより、行われる。

Next, the remuneration determining unit 112 optimizes the state value (V _t ) in Equation (2) with respect to the optimized added value (R _t ) (step S4). This optimization is done by using the updated reward (r) to update the state value (V _t ) as shown in equation (6).

ｓ_ｔは時間ｔにおける環境の状態（ｓ）を表す。ｓ_ｔ＋１は、状態（ｓ_ｔ）から遷移した１ステップ後の状態（ｓ）を表す。ｒ_ｔ＋１は状態（ｓ_ｔ）における行動（ａ）に応じて環境から付与される報酬（ｒ）を表す。αは学習率を表し、０＜α≦１を満たす。γは割引率を表し、０＜γ≦１を満たす。ｍａｘは、状態（ｓ_ｔ＋１）から遷移し得る次の状態の状態価値（Ｖ）のうち、最大値を出力する関数を表す。 s _t represents the state of the environment (s) at time t. s _t+1 represents the state ( _s ) one step after transitioning from the state (s t ). r _t+1 represents the reward (r) given from the environment according to the action (a) in the state (s _t ). α represents the learning rate and satisfies 0<α≦1. γ represents a discount rate and satisfies 0<γ≦1. max represents a function that outputs the maximum value among the state values (V) of the next state that can transition from the state (s _t+1 ).

状態価値（Ｖ_ｔ）の最適化後、式（１）に示す期待報酬（Ｊ）が収束しない場合、報酬（ｒ）も収束していない。この場合（ステップＳ５：ＮＯ）、報酬決定部１１２は、付加価値（Ｒ_ｔ）の最適化（ステップＳ３）と状態価値（Ｖ_ｔ）の最適化（ステップＳ４）とを交互に繰り返す。これにより、期待報酬（Ｊ）が最大化するように、報酬（ｒ）が最適化されていく。期待報酬（Ｊ）が収束すると、報酬（ｒ）も収束する。この場合（ステップＳ５：ＹＥＳ）、逆強化学習処理が終了する。 If the expected reward (J) given in equation (1) does not converge after optimizing the state value (V _t ), then neither does the reward (r). In this case (step S5: NO), the remuneration determination unit 112 alternately repeats the optimization of the added value (R _t ) (step S3) and the optimization of the state value (V _t ) (step S4). Thereby, the reward (r) is optimized so that the expected reward (J) is maximized. When the expected reward (J) converges, so does the reward (r). In this case (step S5: YES), the inverse reinforcement learning process ends.

このように、報酬決定部１１２は、付加価値（Ｒ_ｔ）の更新、つまり報酬（ｒ）の更新と、更新された報酬（ｒ）を用いた状態価値（Ｖ_ｔ）の更新とを反復することにより、報酬（ｒ）を最適化する。このような反復が可能になったのは、式（２）に示すように、時間ｔにおける状態価値（Ｖ_ｔ）の項と、報酬（ｒ）を用いて状態価値（Ｖ_ｔ）を近似した付加価値（Ｒ_ｔ）との項に分けて、状態価値（Ｖ）を定義したことによる。式（２）及び式（３）によれば、期待報酬（Ｊ）を最大化させる報酬（ｒ）を容易に計算することができ、逆強化学習の効率化が可能である。 In this way, the reward determination unit 112 repeats updating the added value (R _t ), that is, updating the reward (r), and updating the state value (V _t ) using the updated reward (r). to optimize the reward (r). This iteration is possible because the state value (V _t ) term at time t and the reward (r) are used to approximate the state value (V _t ), as shown in equation (2). This is because the state value (V) is defined separately from the added value (R _t ) term. According to Equations (2) and (3), it is possible to easily calculate the reward (r) that maximizes the expected reward (J) and improve the efficiency of inverse reinforcement learning.

上記逆強化学習処理において、報酬決定部１１２は、付加価値（Ｒ_ｔ）を混合する割合（τ）を調整することができる。例えば、同程度に混合する場合は、τ＝０．５に調整できる。元の状態価値（Ｖ_ｔ）よりも付加価値（Ｒ_ｔ）の割合を増やす場合は、τ＝０．７のように調整すればよい。 In the inverse reinforcement learning process, the reward determining unit 112 can adjust the ratio (τ) of mixing the added value (R _t ). For example, for equal mixing, τ=0.5 can be adjusted. To increase the ratio of the added value (R _t ) over the original state value (V _t ), adjustment should be made such that τ=0.7.

報酬決定部１１２は、報酬（ｒ）の更新回数が増えるにつれて、付加価値（Ｒ_ｔ）を混合する割合（τ）を減らすことが好ましい。報酬決定部１１２は、割合（τ）を最終的に０まで減らすことができる。割合（τ）を減らすことにより、報酬（ｒ）の学習時間を短縮化しつつ、付加価値（Ｒ_ｔ）を用いない場合と同様の結果に収束させることができる。 Preferably, the remuneration determination unit 112 reduces the ratio (τ) of mixing the added value (R _t ) as the number of updates of the remuneration (r) increases. The reward determination unit 112 can eventually reduce the ratio (τ) to zero. By reducing the ratio (τ), it is possible to shorten the learning time of the reward (r) and converge to the same result as when the added value (R _t ) is not used.

報酬決定部１１２は、割合（τ）を単調減少させてもよいし、更新回数に対して割合（τ）を減少させる程度を任意に決定してもよい。また、報酬決定部１１２は、割合（τ）を減らす過程において一時的に増やしてもよい。 The remuneration determination unit 112 may monotonically decrease the ratio (τ), or may arbitrarily determine the extent to which the ratio (τ) is reduced with respect to the number of updates. Also, the remuneration determination unit 112 may temporarily increase the ratio (τ) in the process of decreasing it.

また、報酬決定部１１２は、各係数τ_Ｐ、τ_Ｄ及びτ_Ｉを調整することにより、比例項、微分項及び積分項の割合を調整でき、割合を０にすることも可能である。例えば、τ_Ｐ＝０、τ_Ｄ＝０、τ_Ｉ＝１と設定することにより、積分項のみ、つまり累積報酬を状態価値（Ｖ_ｔ）に混合することができる。時間ｔの状態（ｓ）を重視したい場合は、τ_Ｐ＝１に設定することにより比例項を加算し、時間変化を考慮したい場合は、τ_Ｄ＝１に設定することにより微分項を加算すればよい。 In addition, the remuneration determining unit 112 can adjust the proportions of the proportional term, the differential term, and the integral term by adjusting the respective coefficients τ _P , τ _D , and τ _I , and can set the proportion to zero. For example, by setting τ _P =0, τ _D =0, τ _I =1, only the integral term, ie the cumulative reward, can be blended into the state value (V _t ). If you want to emphasize the state (s) at time t, set τ _P =1 to add the proportional term, and if you want to consider the time change, set τ _D =1 to add the differential term. Just do it.

以上のように、本実施形態によれば、与えられた軌跡（Ｓｇ）から方策（π）を決定し、この方策（π）に対して期待報酬（Ｊ）が最大化するように、状態価値（Ｖ）を更新する。このとき、式（２）に示すように、時間ｔにおける状態価値（Ｖ_ｔ）に、報酬（ｒ）を用いて計算される付加価値（Ｒ_ｔ）を所定の割合（τ）で混合することにより、状態価値（Ｖ）が計算される。 As described above, according to the present embodiment, the policy (π) is determined from the given trajectory (Sg), and the state value Update (V). At this time, as shown in formula (2), the state value (V _t ) at time t is mixed with the added value (R _t ) calculated using the reward (r) at a predetermined ratio (τ) calculates the state value (V).

これにより、期待報酬（Ｊ）が最大化するように、付加価値（Ｒ_ｔ）の更新と状態価値（Ｖ_ｔ）の更新とを反復することができる。付加価値（Ｒ_ｔ）の更新により報酬（ｒ）が更新されるため、期待報酬（Ｊ）の最大化によって報酬（ｒ）を最適化することができる。式（２）によって計算が容易になり、複雑な報酬（ｒ）も容易に最適化することができるため、逆強化学習の効率化が可能である。 This allows iterative updating of added value (R _t ) and updating of state value (V _t ) such that expected reward (J) is maximized. Since reward (r) is updated by updating value added (R _t ), reward (r) can be optimized by maximizing expected reward (J). Equation (2) facilitates calculation, and complex rewards (r) can be easily optimized, so that inverse reinforcement learning can be made more efficient.

以上、本発明の好ましい実施形態について説明したが、本発明は、これらの実施形態に限定されない。本発明の範囲内で種々の変形が可能である。
例えば、式（３）によって付加価値（Ｒ_ｔ）を定義したが、報酬（ｒ）を用いて状態価値（Ｖ_ｔ）を近似できるのであれば、これに限定されない。 Although preferred embodiments of the present invention have been described above, the present invention is not limited to these embodiments. Various modifications are possible within the scope of the invention.
For example, the added value (R _t ) is defined by Equation (3), but it is not limited to this as long as the state value (V _t ) can be approximated using the reward (r).

状態価値（Ｖ）は報酬（ｒ）によって計算されるため、式（３）においてｄｒ／ｄｔの代わりに、ｄＶ／ｄｔ又はＶ（ｓ_ｔ）－Ｖ（ｓ）が用いられてもよい。ｄＶ／ｄｔは、時間ｔよりも前の状態から時間ｔの状態（ｓ）までの一定時間内における状態価値（Ｖ）の微分値を表す。また、Ｖ（ｓ_ｔ）－Ｖ（ｓ）は、時間ｔより前の状態から時間ｔの状態（ｓ）までの状態価値（Ｖ）の変化を表す。 Since the state value (V) is calculated by the reward (r), dV/dt or V(s _t )−V(s) may be used instead of dr/dt in equation (3). dV/dt represents the differential value of the state value (V) within a certain period of time from the state before time t to the state (s) at time t. Also, V(s _t )−V(s) represents the change in state value (V) from the state before time t to the state (s) at time t.

また、報酬（ｒ）は、ニューラルネットワークにより定義されたが、線形関数により近似した報酬関数として定義されてもよい。 Also, the reward (r) is defined by a neural network, but may be defined as a reward function approximated by a linear function.

逆強化学習装置１０は、様々な技術分野に用いることができ、その技術分野は特に限定されない。例えば、危険物を回避して車両の走行経路を決定する自動運転制御、モータの駆動制御、ゲームのキャラクタの制御等に逆強化学習装置１０を利用可能である。 The inverse reinforcement learning device 10 can be used in various technical fields, and the technical field is not particularly limited. For example, the reverse reinforcement learning device 10 can be used for automatic driving control for determining the vehicle's travel route while avoiding dangerous objects, motor drive control, game character control, and the like.

また、本発明の逆強化学習方法をコンピュータに実行させるプログラムが記録された記録媒体が提供されてもよい。記録媒体としては、ＣＰＵ等のコンピュータが読み取り可能な記録媒体であれば特に限定されず、半導体メモリ、磁気ディスク、光ディスク等を使用可能である。 Also, a recording medium recording a program for causing a computer to execute the inverse reinforcement learning method of the present invention may be provided. The recording medium is not particularly limited as long as it can be read by a computer such as a CPU, and semiconductor memories, magnetic disks, optical disks, and the like can be used.

１０・・・強化学習装置、１１・・・ＣＰＵ、１１１・・・方策決定部、１１２・・・報酬決定部、１２・・・記憶部 10... Reinforcement learning device, 11... CPU, 111... Policy determination unit, 112... Reward determination unit, 12... Storage unit

Claims

In an inverse reinforcement learning device (10) that determines a reward (r) given from the environment for an agent's action based on a given action trajectory (Sg),
A policy decision unit (111) that decides a course of action (π) from the trajectory (Sg);
A reward that determines the reward (r) so as to maximize the expected reward (J) calculated by the policy (π) and the state value (V) of the environment, as shown in the following formula (1) a determining unit (112),
The remuneration determination unit (112)
Using the reward (r) so as to approximate the state value (V _t ) to the state value (V _t ) for evaluating the state (s) at time t, as shown in the following formula (2) calculating said state value (V) by mixing the calculated added value (R _t );
An inverse reinforcement learning device (10) for updating the reward (r) so as to maximize the expected reward (J) with respect to the calculated state value (V).

The reward determination unit (112), as shown in the following formula (3), provides the reward (r) given to the action (a) in the state (s) at time t in one episode, and the 1 The differential value of the reward (r) given from the state before the time t of the episode to the state (s) at the time t, and the accumulated value of the reward (r) given during the one episode The inverse reinforcement learning device (10) of claim 1, wherein the added value (R _{t ) is calculated by adding R t} .

[τ _P , τ _D and τ _I represent coefficients of 0 or more and 1 or less. r ^* represents the reward (r) given for action (a) in state (s) at time t. dr/dt represents the differential value of the reward (r) given from the state before time t to the state (s) at time t. γ _E represents a discount rate and satisfies 0<γ _E ≦1. t _e represents the time in the final state of one episode. ]

The remuneration determination unit (112)
optimizing the added value (R _t ) by updating the reward (r) to maximize the expected reward (J);
optimizing the state value (V _t ) by updating the state value (V _t ) with the updated reward (r);
The inverse reinforcement learning device (10) according to claim 1 or 2, wherein updating the added value (R _t ) and updating the state value (V _t ) are repeated until the reward (r) converges.

The inverse reinforcement learning device (10) according to any one of claims 1 to 3, wherein the reward determination unit (112) adjusts a ratio (τ) of mixing the added value (R _t ).

The remuneration determination unit (112) reduces the ratio (τ) of mixing the added value (R _t ) as the number of updates of the remuneration (r) increases. An inverse reinforcement learning device (10).

The remuneration determination unit (112)
defining the reward (r) as a neural network with parameters (θ),
The inverse reinforcement learning device (10) according to any one of claims 1 to 5, wherein the reward (r) is updated by updating the parameter (θ) so that the expected reward (J) is maximized ).

In an inverse reinforcement learning method that determines the reward (r) given from the environment for the action of the agent based on the given action trajectory (Sg),
determining a course of action (π) from the trajectory (Sg);
A step of determining the reward (r) so as to maximize the expected reward (J) calculated by the policy (π) and the state value (V) of the environment, as shown in the following formula (1): and including
The step of determining the reward (r) comprises:
Using the reward (r) so as to approximate the state value (V _t ) to the state value (V _t ) for evaluating the state (s) at time t, as shown in the following formula (2) calculating said state value (V) by mixing the calculated value added (R _t );
updating said reward (r) such that said expected reward (J) is maximized with respect to said calculated state value (V).

A program for causing a computer to execute an inverse reinforcement learning method for determining a reward (r) given from the environment for an action of an agent based on a given action trajectory (Sg),
The inverse reinforcement learning method includes:
determining a course of action (π) from the trajectory (Sg);
A step of determining the reward (r) so as to maximize the expected reward (J) calculated by the policy (π) and the state value (V) of the environment, as shown in the following formula (1): and including
The step of determining the reward (r) comprises:

Using the reward (r) so as to approximate the state value (V _t ) to the state value (V _t ) for evaluating the state (s) at time t, as shown in the following formula (2) calculating said state value (V) by mixing the calculated value added (R _t );
updating said reward (r) such that said expected reward (J) is maximized with respect to said calculated state value (V).