JP2021082014A

JP2021082014A - Estimation device, training device, estimation method, training method, program, and non-transitory computer readable medium

Info

Publication number: JP2021082014A
Application number: JP2019209036A
Authority: JP
Inventors: 新一前田; Shinichi Maeda; バラドワジホマンガ; Bharadhwaj Homanga
Original assignee: Preferred Networks Inc
Current assignee: Preferred Networks Inc
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2021-05-27

Abstract

To make adaptation from a simulation environment to an actual environment.SOLUTION: An estimation device comprises: one or more memories; and one or more processors. The memory stores a trained model which can estimate a dynamics parameter. The one or more processors encode a state, encodes an environmental dynamics parameter on the basis of a dynamics parameter estimated by the trained model, and estimates an action on the basis of the encoded state and the encoded dynamics parameter.SELECTED DRAWING: Figure 1

Description

本開示は、推定装置、訓練装置、推定方法、訓練方法、プログラム及び非一時的コンピュータ可読媒体に関する。 The present disclosure relates to estimators, training devices, estimation methods, training methods, programs and non-transitory computer-readable media.

従来、制御装置等による特定の作業を自動的に実行する制御を機械学習、特に強化学習により最適化したり、タスクの割り振る仕方を最適化したりする試みは、今日広く行われている。 Conventionally, attempts have been widely made today to optimize the control for automatically executing a specific task by a control device or the like by machine learning, particularly reinforcement learning, or to optimize the method of allocating tasks.

Bernd Waschneck, et. al., "Optimization of global production scheduling with deep reinforcement learning," Procedia CIRP, pp. 1264-1269, Vol. 72, 2018.Bernd Waschneck, et. Al., "Optimization of global production scheduling with deep reinforcement learning," Procedia CIRP, pp. 1264-1269, Vol. 72, 2018.

そこで、本開示においては、適切に推定可能な推定装置、そのような推定装置を容易に訓練する訓練装置を実現する。 Therefore, in the present disclosure, an estimation device that can be appropriately estimated and a training device that easily trains such an estimation device are realized.

一実施形態に係る推定装置は、１又は複数のメモリと、１又は複数のプロセッサと、を備え、前記メモリは、ダイナミクスパラメータを推定可能な訓練済みモデルを記憶し、前記１又は複数のプロセッサは、状態をエンコードし、前記訓練済みモデルにより推定されたダイナミクスパラメータに基づいて、環境のダイナミクスパラメータをエンコードし、前記エンコードされた状態、及び前記エンコードされたダイナミクスパラメータに基づいて、行動を推定する。 The estimator according to an embodiment comprises one or more memories and one or more processors, the memory storing a trained model capable of estimating dynamics parameters, the one or more processors. , The state is encoded, the dynamics parameters of the environment are encoded based on the dynamics parameters estimated by the trained model, and the behavior is estimated based on the encoded state and the encoded dynamics parameters.

一実施形態に係る推定装置のブロック図。The block diagram of the estimation apparatus which concerns on one Embodiment. 一実施形態に係る推定装置の処理を示すフローチャート。The flowchart which shows the processing of the estimation apparatus which concerns on one Embodiment. 一実施形態に係る訓練装置のブロック図。The block diagram of the training apparatus which concerns on one Embodiment. 一実施形態に係る訓練装置の処理を示すフローチャート。The flowchart which shows the processing of the training apparatus which concerns on one Embodiment. 一実施形態に係る訓練装置の処理を示すフローチャート。The flowchart which shows the processing of the training apparatus which concerns on one Embodiment. 一実施形態に係る結果を示す図。The figure which shows the result which concerns on one Embodiment. 一実施形態に係る結果を示す図。The figure which shows the result which concerns on one Embodiment. 一実施形態に係る結果を示す図。The figure which shows the result which concerns on one Embodiment. 一実施形態に係るハードウェア実装例を示す図。The figure which shows the hardware implementation example which concerns on one Embodiment.

以下、図面を参照して本発明の実施形態について説明する。図面及び実施形態の説明は一例として示すものであり、本発明を限定するものではない。以下、各構成に備えられるモデル（訓練済みモデル）は、MLP（Multi-Layer Perceptron）、CNN（Convolutional Neural Network）等のニューラルネットワークであってもよいが、これらに限られるものではなく、適切に入力データに対して出力データが取得できるモデルであればよい。また、ニューラルネットワークモデル以外のモデルであってもよい。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The drawings and the description of the embodiments are shown as an example, and do not limit the present invention. Hereinafter, the model (trained model) provided for each configuration may be a neural network such as MLP (Multi-Layer Perceptron) or CNN (Convolutional Neural Network), but is not limited thereto and is appropriately used. Any model may be used as long as the output data can be acquired for the input data. Further, a model other than the neural network model may be used.

本実施形態では、シミュレーション環境におけるダイナミクスを推定する推定器と、このダイナミクスにしたがうシミュレーション環境における行動を推定する推定器を用いる。この結果、学習に多数の試行を必要とする強化学習を実環境においてよりよく適用できるようにしようとするものである。例えば、訓練に相当数の試行が必要となる強化学習を実環境で試行することは現実的には、時間が掛かりすぎ、また、ランダムに初期化した方策で行動を選択すると安全性が低くなるという問題がある。本実施形態においては、この問題点を解決すべく、訓練と推定を実行することを図る。 In this embodiment, an estimator that estimates the dynamics in the simulation environment and an estimator that estimates the behavior in the simulation environment according to the dynamics are used. As a result, we are trying to make it possible to better apply reinforcement learning, which requires a large number of trials for learning, in a real environment. For example, trying reinforcement learning in a real environment, which requires a considerable number of trials for training, takes too much time in reality, and it is less safe to select an action with a randomly initialized strategy. There is a problem. In this embodiment, training and estimation are performed in order to solve this problem.

（推定装置）
図１は、一実施形態に係る推定装置の機能を示すブロック図である。本開示における推定装置１は、第１推定部１０と、第２推定部１２と、を備える。第１推定部１０は、力学的な条件に基づいた行動を推定する。第２推定部１２は、力学的状況を推定する。 (Estimator)
FIG. 1 is a block diagram showing the functions of the estimation device according to the embodiment. The estimation device 1 in the present disclosure includes a first estimation unit 10 and a second estimation unit 12. The first estimation unit 10 estimates the behavior based on the mechanical conditions. The second estimation unit 12 estimates the mechanical situation.

第１推定部１０は、第１エンコーダ１００と、第２エンコーダ１０２と、デコーダ１０４と、を備える。この他、訓練装置２は、データ等の入出力を行うインタフェース、データ、パラメータ等を格納する記憶部を備えていてもよい。 The first estimation unit 10 includes a first encoder 100, a second encoder 102, and a decoder 104. In addition, the training device 2 may include an interface for inputting / outputting data and the like, and a storage unit for storing data, parameters, and the like.

第１推定部１０は、例えば、ある物体に対して、状態oと、ダイナミクスパラメータηを入力として、行動aを出力する。これらの量は、例えば、ベクトルとして表されるが、これに限定されるものではなく、任意の次元を有するテンソルで表されてもよい。 For example, the first estimation unit 10 inputs the state o and the dynamics parameter η to an object, and outputs the action a. These quantities are represented, for example, as vectors, but are not limited to, and may be represented by tensors having arbitrary dimensions.

状態とは、例えば、ある環境における物体及び制御対象の観測された状態を示す。この状態は、例えば、各種センサにより取得された情報であり、制御対象の状態、制御対象が何らかの作用を及ぼす物体の状態等の情報を含んでもよい。 The state refers to, for example, the observed state of an object and a controlled object in a certain environment. This state is, for example, information acquired by various sensors, and may include information such as a state of a controlled object and a state of an object on which the controlled object exerts some action.

ダイナミクスとは、例えば、制御対象及び物体が存在する空間の環境における動力学のことであり、このダイナミクスを特徴付けるパラメータをダイナミクスパラメータと記載する。このダイナミクスパラメータは、例えば、物体と物体が配置されている領域との摩擦係数、物体がどのような慣性系に配置されているか、又は、センサがどのような状況で物体及び制御対象を感知しているか等の情報を含んでもよい。その他に、ダイナミクスパラメータは、制御対象及び物体に対して環境がどのような影響を与えるかを示すパラメータ等が含まれてもよい。 The dynamics is, for example, the dynamics in the environment of the space where the controlled object and the object exist, and the parameters that characterize the dynamics are described as the dynamics parameters. This dynamics parameter is, for example, the friction coefficient between the object and the region where the object is placed, what kind of inertial system the object is placed in, or under what circumstances the sensor detects the object and the controlled object. It may include information such as whether or not it is present. In addition, the dynamics parameters may include parameters indicating how the environment affects the controlled object and the object.

行動とは、例えば、入力されたダイナミクスパラメータにより記述される空間において、入力された状態の場合に、制御対象が取る望ましい行動又は制御対象に与える望ましい制御を示す。すなわち、行動は、例えば、ダイナミクスパラメータに条件付けられた行動を示す。 The action indicates, for example, a desirable action taken by the controlled object or a desirable control given to the controlled object in the input state in the space described by the input dynamics parameter. That is, the behavior indicates, for example, the behavior conditioned on the dynamics parameter.

また、方策とは、例えば、入力された状態において、制御対象が取る望ましい行動または制御対象に与える望ましい制御を出力する関数である。方策は、一般には状態を条件として条件付けた、行動又は制御に関する条件付き確率分布の形式で表現される。ここでは、例えば、状態だけではなく、ダイナミクスパラメータにより条件付けられた条件付き確率分布として、方策を表現する。 Further, the policy is, for example, a function that outputs a desired action taken by the controlled object or a desired control given to the controlled object in the input state. Strategies are generally expressed in the form of conditional probability distributions for behavior or control, conditioned on state. Here, for example, the policy is expressed not only as a state but also as a conditional probability distribution conditioned by dynamics parameters.

第１エンコーダ１００は、環境e、時刻tにおける状態o_t ^(e)を入力として、この状態をエンコードして、その特徴量を出力する。 The first encoder 100 takes the state o _t ^(e) at the environment e and the time t as an input, encodes this state, and outputs the feature amount.

第２エンコーダ１０２は、環境eにおけるダイナミクスパラメータη_eを入力として、このダイナミクスパラメータをエンコードして、その特徴量を出力する。 The second encoder 102 is input with dynamics parameters eta _e in the environment e, and encodes the dynamics parameters, and outputs the feature quantity.

第１エンコーダ１００の変換をf_φ()と、第２エンコーダ１０２の変換をM_ζ()と表すと、環境e、時刻tにおける潜在状態Z_t=[f_φ(o_t ^(e)), M_ζ(η_e)]として表すことができる。上記のように、第１エンコーダ１００及び第２エンコーダ１０２は、それぞれ、状態とダイナミクスパラメータをエンコードする。 If the conversion of the first encoder 100 _{is expressed as f φ} () and the conversion of the second encoder 102 _{is expressed as M ζ} (), the latent state Z _t = [f _φ (o _t ^(e) )) at the environment e and time t. It can be expressed as M _ζ (η _e)]. As described above, the first encoder 100 and the second encoder 102 encode the state and dynamics parameters, respectively.

状態をダイナミクスパラメータから分離してエンコードすることにより、状態とダイナミクスパラメータを合わせてエンコードする場合に比較して、時々刻々と変化する状態と変化するにしても緩やかにしか変化しないダイナミクスパラメータを区別することができるため、ダイナミクスパラメータの変わらない同じ環境では照明やカメラが変わったとしても同じダイナミクスパラメータのエンコードが再利用できるなど、効率的に扱うことができる。また、状態は、例えば、単位時間ごとに感知されたデータに基づき時々刻々と変化するが、ダイナミクスパラメータは、状態と比べて変化が少なく、例えば単位時間ごとに変化するものではなく、ある程度の時間内においては仮想的に固定されていてもよいため、このように状態とダイナミクスパラメータとを分離してエンコードを行う。 By encoding the state separately from the dynamics parameter, it distinguishes between the state that changes from moment to moment and the dynamics parameter that changes only slowly even if it changes, compared to the case where the state and the dynamics parameter are encoded together. Therefore, in the same environment where the dynamics parameters do not change, even if the lighting or camera changes, the encoding of the same dynamics parameters can be reused, and it can be handled efficiently. In addition, the state changes from moment to moment based on the data sensed for each unit time, for example, but the dynamics parameter changes less than the state, for example, does not change for each unit time, and does not change for a certain period of time. Since it may be virtually fixed inside, the state and the dynamics parameter are separated and encoded in this way.

すなわち、状態とダイナミクスパラメータの再エンコードは、同じタイミングで行わなくてもよい。例えば、エンコードした値は、しばらく同じ値を使い続けても構わない。再エンコードする頻度は、状態とダイナミクスで異なってよい。エンコードされた値が存在する場合には、例えば、状態のエンコードを、ダイナミクスパラメータのエンコードと異なるスパン、例えば、より短いスパンで行ってもよい。このため、本実施形態においては、第１エンコーダ１００と第２エンコーダ１０２とを用いて、状態とダイナミクスパラメータをエンコードする。 That is, the state and dynamics parameters do not have to be re-encoded at the same time. For example, the encoded value may continue to use the same value for a while. The frequency of re-encoding may vary depending on the state and dynamics. If the encoded value is present, for example, the state encoding may be done in a different span than the dynamics parameter encoding, eg, a shorter span. Therefore, in this embodiment, the first encoder 100 and the second encoder 102 are used to encode the state and dynamics parameters.

デコーダ１０４は、潜在状態Z_tを入力として行動a_t ^(e)を出力する。行動は、例えば、予測される行動又は制御の分布として出力される。一例として、デコーダ１０４は、潜在状態Z_tを条件として、条件付けした確率分布から行動a_t ^(e)を確率的に出力する。この確率分布は、δ関数のように決定論的に行動のサンプルを生成するような分布でもよいし、正規分布であってもよい。デコーダ１０４は、例えば正規分布として行動を出力する場合、行動の平均値m_tと、行動の分散Σ_tとを潜在状態Z_tに基づいて決定して出力してもよい。これらの値を用いると、行動は、例えば、a_t ^(e)~N(m_t, Σ_t)として表すことができる。ここで、N(m, Σ)は、平均m、分散ベクトルΣの正規分布を示す。デコーダ１０４は、例えば、このm_tとΣtとを出力し、例えば、この分布から行動a_t ^(e)をサンプルとして生成する。 The decoder 104 takes the latent state Z _t as an input and outputs the _{action a t} ^(e). The behavior is output, for example, as a distribution of predicted behaviors or controls. As an example, the decoder 104 probabilistically outputs the _{action a t} ^(e) from the conditioned probability distribution with the _{latent state Z t as a condition.} This probability distribution may be a distribution that deterministically generates a sample of behavior such as a delta function, or may be a normal distribution. When the decoder 104 outputs an action as a normal distribution, for example, the decoder 104 may determine and output the average value m _t of the action and the variance Σ _{t of} the action based on the latent state Z _t. Using these values, behaviors, for example, can be expressed as _{^{a t (e) ~ N (}} m t, Σ t). Here, N (m, Σ) indicates a normal distribution of the mean m and the variance vector Σ. The decoder 104 outputs, for example, these m _t and Σ t, and for example, _{generates an action a t} ^(e) as a sample from this distribution.

すなわち、デコーダ１０４からは、例えば、制御対象がどのような行動をとればよいかといった情報が出力される。より詳しくは、デコーダ１０４は、制御対象における確率的な行動を出力する。デコーダ１０４の変換をg_θ()と表すと、例えば正規分布とした場合の行動の分布は、(m_t, Σ_t)=g_θ([f_φ(o_t ^(e)), M_ζ(η_e)])と表される。 That is, the decoder 104 outputs, for example, information such as what kind of action the controlled object should take. More specifically, the decoder 104 outputs stochastic behavior in the controlled object. Expressed conversion decoder 104 g _theta (), for example, the distribution behavior in the case of a normal _{_{distribution, (m t, Σ t)}} = g θ ([f φ (o t (e)), M ζ ( It _{is expressed as η e} )]).

なお、行動の分布は、正規分布には限られるものではない。適切に行動の確率分布を表すことのできる分布であってもよく、別の分布である場合には、デコーダ１０４は、当該分布を表現するのに適したパラメータ等に基づいて、行動を確率的に出力する。 The distribution of behavior is not limited to the normal distribution. It may be a distribution that can appropriately represent the probability distribution of behavior, and if it is another distribution, the decoder 104 probabilistically represents the behavior based on parameters and the like suitable for expressing the distribution. Output to.

また、上述において、第１エンコーダ１００、第２エンコーダ１０２、デコーダ１０４は、別々のモデルを備えるとしたが、これには限られず、全部又は少なくとも２つの構成が同じモデルに備えられてもよい。例えば、状態が更新されると次の行動を推定したいという場合には、第１エンコーダ１００と、デコーダ１０４とは、連続した１つのモデルであってもよい。この場合、第２エンコーダ１０２の出力を処理の途中で受け付けることできる構成としてもよい。また、さらに別の例として、上記の構成は、全て１つのモデル内に備えられていてもよい。この場合、ダイナミクスパラメータの入力と、状態の入力とを別々のタイミングにおいて実行できるようにしておいてもよい。 Further, in the above description, the first encoder 100, the second encoder 102, and the decoder 104 are provided with separate models, but the present invention is not limited to this, and all or at least two configurations may be provided in the same model. For example, when it is desired to estimate the next action when the state is updated, the first encoder 100 and the decoder 104 may be one continuous model. In this case, the output of the second encoder 102 may be received in the middle of processing. Moreover, as yet another example, all the above configurations may be provided in one model. In this case, the input of the dynamics parameter and the input of the state may be executed at different timings.

第１エンコーダ１００、第２エンコーダ１０２、デコーダ１０４は、例えば、任意の機械学習手法で訓練された、任意の構成を有する訓練済みモデルであってもよい。 The first encoder 100, the second encoder 102, and the decoder 104 may be, for example, a trained model having an arbitrary configuration trained by an arbitrary machine learning method.

このように、第１推定部１０は、状態とダイナミクスパラメータに基づいて行動を推定する。 In this way, the first estimation unit 10 estimates the behavior based on the state and the dynamics parameters.

本実施形態の第２推定部１２は、エスティメータ１２０と、アグリゲータ１２２と、を備える。第２推定部１２は、第２エンコーダ１０２に入力するための様々な環境のそれぞれに対応するダイナミクスパラメータを推定する。第２推定部１２は、状態の時系列と、行動の時系列を入力すると、ダイナミクスパラメータを出力する。出力するダイナミクスパラメータは、確率分布で表されるものであってもよい。 The second estimation unit 12 of the present embodiment includes an estimator 120 and an aggregator 122. The second estimation unit 12 estimates the dynamics parameters corresponding to each of the various environments for input to the second encoder 102. The second estimation unit 12 outputs the dynamics parameter when the time series of the state and the time series of the action are input. The dynamics parameter to be output may be represented by a probability distribution.

ある時刻における状態o_tと行動a_tは、状況に対して行動を起こすと次の状況になるという関係が成り立つ。すなわち、時刻tにおけるo_t及びa_tから、ダイナミクスパラメータに基づいてo_t+1が生成されるはずである。このことから、これらの３つの値のセット(o_t, a_t, o_t+1)を入力とする関数により第２推定部１２の処理は、実行される。 _{The relationship between the state o t} and the action a _t at a certain time is that when an action is taken for a situation, the next situation is established. That is, _{o t + 1} should be generated _{from o t} and a _{t at} time t based on the dynamics parameters. Therefore, the processing of the set of these three values by a function of the _{_{(o t, a t, o}} t + 1) receiving the second estimation unit 12 is performed.

ここで、t=1からkTまでの入力{(o_t, a_t, o_t+1)_n}_n=1 ^kTがある場合、この入力を、例えば、T個ごとのチャンクに分割する。すなわち、この時系列の入力データを、{(o_t, a_t, o_t+1)_t}_t=1 ^T，{(o_t, a_t, o_t+1)_t}_t=T+1 ^2T，・・・，{(o_t, a_t, o_t+1)_t}_t=(k-1)T+1 ^kTと分割する。この分割したデータごとに処理を実行する。なお、Tの倍数の個数の時系列ではない場合には、例えば、最後のチャンクは、最後のデータまでとしてもよいし、これには限られず適切なデータにより補完等してもよい。 Here, if there are inputs from t = 1 to kT {(o _t , a _t , o _{t + 1} ) _n } _{n = 1} ^kT , this input is divided into chunks of, for example, T pieces. That is, the input data of this time series is {(o _t , a _t , o _{t + 1} ) _t } _{t = 1} ^T , {(o _t , a _t , o _{t + 1} ) _t } _{t = T + 1} Divide into ^2T , ···, {(o _t , a _t , o _{t + 1} ) _t } _{t = (k-1) T + 1} ^kT. Processing is executed for each of the divided data. If the time series is not a multiple of T, for example, the last chunk may be up to the last data, and the last chunk may be complemented by appropriate data without limitation.

エスティメータ１２０は、分割されたチャンクごとの入力に基づいて、分割されたチャンクに対するダイナミクスパラメータの推定量を取得する。推定量は、確率分布の形式での表現、例えば、平均μと分散σ²といった推定値を取得できるものであってもよい。すなわち、T個の(o_t, a_t, o_t+1)のデータから、ダイナミクスパラメータの推定量を取得する。これを、k個のチャンクに対して実行し、それぞれの推定量を取得する。エスティメータ１２０は、例えば、k個のチャンクのデータから、ダイナミクスパラメータの推定量μ⁽¹⁾,σ⁽¹⁾，μ⁽²⁾,σ⁽²⁾，・・・，μ^(k),σ^(k)を取得する。 The estimator 120 obtains an estimator of dynamics parameters for the divided chunks based on the input for each divided chunk. The estimator may be expressed in the form of a probability distribution, for example, an ^{estimator such as mean μ and variance σ 2} can be obtained. That is, the estimator of the dynamics parameter is obtained from _{T (o t} , a _t , o _{t + 1) data.} Do this for k chunks and get an estimator for each. The estimator 120 is, for example, an estimator of dynamics parameters μ ⁽¹⁾ , σ ⁽¹⁾ , μ ⁽²⁾ , σ ⁽²⁾ , ···, μ ^(k) , σ from the data of k chunks. ^{Get (k)} .

なお、この演算は、並列処理してもよく、例えば、複数のエスティメータ１２０がチャンクごとに備えられていてもよい。また、本実施形態では推定量の一例として、平均値と分散値を取得しているがこれには限られず、適切な推定量を取得できるものであればよい。 Note that this calculation may be performed in parallel, and for example, a plurality of estimators 120 may be provided for each chunk. Further, in the present embodiment, as an example of the estimated amount, the average value and the variance value are acquired, but the present invention is not limited to this, and any estimate can be obtained as long as it can be obtained.

アグリゲータ１２２は、エスティメータ１２０が出力した推定量を、例えば、訓練済みモデルを用いて合成し、時系列のデータが観測された環境におけるダイナミクスパラメータを推定して出力する。 The aggregator 122 synthesizes the estimator output by the estimator 120 using, for example, a trained model, and estimates and outputs the dynamics parameters in the environment in which the time series data is observed.

エスティメータ１２０、アグリゲータ１２２は、例えば、任意の機械学習手法で訓練された、任意の構成を有する訓練済みモデルであってもよい。 The estimator 120 and the aggregator 122 may be, for example, a trained model having an arbitrary configuration trained by an arbitrary machine learning method.

このように、第２推定部１２は、状態と行動に基づいてダイナミクスパラメータを推定する。 In this way, the second estimation unit 12 estimates the dynamics parameters based on the state and the behavior.

推定装置１は、例えば、第２推定部１２により、既に取得されている状態と行動のデータに基づいて、ダイナミクスパラメータを推定する。そして、第１推定部１０により、現時刻の状態と推定されたダイナミクスパラメータに基づいて、行動を推定する。 The estimation device 1 estimates the dynamics parameters based on the state and behavior data already acquired by, for example, the second estimation unit 12. Then, the first estimation unit 10 estimates the behavior based on the state at the current time and the estimated dynamics parameters.

図２は、本実施形態に係る推定装置１の処理を示すフローチャートである。 FIG. 2 is a flowchart showing the processing of the estimation device 1 according to the present embodiment.

まず、第２推定部１２に、取得されている状態と行動を入力してダイナミクスパラメータを推定する（S100）。 First, the acquired state and action are input to the second estimation unit 12 to estimate the dynamics parameter (S100).

次に、推定されたダイナミクスパラメータを第１推定部１０の第２エンコーダ１０２によりエンコードする（S102）。 Next, the estimated dynamics parameters are encoded by the second encoder 102 of the first estimation unit 10 (S102).

次に、観測された状態を第１推定部１０の第１エンコーダ１００によりエンコードする（S100）。 Next, the observed state is encoded by the first encoder 100 of the first estimation unit 10 (S100).

次に、第１推定部１０のデコーダ１０４により、エンコードされた推定されたダイナミクスパラメータと、エンコードされた状態とに基づいて、行動を推定する（S106）。 Next, the decoder 104 of the first estimation unit 10 estimates the behavior based on the encoded estimated dynamics parameters and the encoded state (S106).

次に、処理を終了するか否かを判断する（S108）。 Next, it is determined whether or not to end the process (S108).

処理を終了しない場合（S108：NO）、S104からの処理を繰り返す。この場合、例えば、状態は、推定された行動により遷移した新たな状態であってもよい。 If the process is not completed (S108: NO), the process from S104 is repeated. In this case, for example, the state may be a new state transitioned by the estimated behavior.

処理を終了する場合（S108：YES）、必要な終了処理を実行し、処理を終了する。 When the process is terminated (S108: YES), the required termination process is executed and the process is terminated.

なお、上記は、ダイナミクスパラメータを更新しない場合について説明したが、所定のタイミングにおいてダイナミクスパラメータを更新してもよい。例えば、処理を終了しない場合（S108：NO）、ダイナミクスパラメータを更新するか否かを判断する（S110）。 Although the case where the dynamics parameter is not updated has been described above, the dynamics parameter may be updated at a predetermined timing. For example, when the process is not completed (S108: NO), it is determined whether or not to update the dynamics parameter (S110).

ダイナミクスパラメータを更新しない場合（S110：NO）、S104からの処理を繰り返す。 If the dynamics parameters are not updated (S110: NO), the processing from S104 is repeated.

ダイナミクスパラメータを更新する場合（S110：YES）、S100からの処理を繰り返す。この場合、ダイナミクスパラメータの推定に用いる状態と行動は、それまでの時間に観測された状態、及び、それまでの時間に実行した行動を用いてもよい。 When updating the dynamics parameters (S110: YES), the process from S100 is repeated. In this case, as the state and action used for estimating the dynamics parameter, the state observed in the previous time and the action executed in the previous time may be used.

このように、ダイナミクスパラメータを適切な時刻において更新してもよい。 In this way, the dynamics parameters may be updated at an appropriate time.

以上のように、既に取得されている状態と行動のデータに基づいてダイナミクスパラメータを推定し、さらに、取得された状態と推定されたダイナミクスパラメータに基づいて行動を推定することにより、実際の環境と相互作用することなく、望ましい行動を取得する方策を素早く獲得することが可能となる。この結果、例えば、生産現場等における処理のオートメーション化の精度を向上させることが可能となる。なお、上記においては、第１推定部１０と第２推定部１２とは別個の構成としたが、これには限られない。これらの推定部は、１つの推定部として実装されていてもよい。 As described above, by estimating the dynamics parameters based on the already acquired state and behavior data, and further estimating the behavior based on the acquired state and estimated dynamics parameters, the actual environment and It is possible to quickly acquire measures to obtain desired behavior without interaction. As a result, for example, it is possible to improve the accuracy of processing automation at a production site or the like. In the above, the first estimation unit 10 and the second estimation unit 12 have separate configurations, but the configuration is not limited to this. These estimation units may be implemented as one estimation unit.

このように実装された推定装置１によれば、ランダムな初期値から始めて実環境で試行、訓練するよりも安全性を確保することが可能となる。さらに、上述の実施形態によれば、シミュレーション環境で訓練しておいた方策において、実環境との間で生じる差異を適応することができる方策を取得することが可能となる。この結果は、例えば、ロボットによるオートメーション化等に利用することができる。 According to the estimation device 1 implemented in this way, it is possible to ensure safety rather than starting from a random initial value and trying and training in a real environment. Further, according to the above-described embodiment, it is possible to acquire a policy that can adapt the difference that occurs from the actual environment in the policy trained in the simulation environment. This result can be used, for example, for automation by a robot.

例えば、環境が異なることにより、ダイナミクスパラメータが変わることがあり、この、種々のダイナミクスパラメータに対して適切な結果を出力可能な第２エンコーダ１０２が必要となる場合がある。そこで、第１推定部１０は、例えば、シミュレータ環境で訓練された第２エンコーダ１０２を用いる。シミュレータ環境において訓練された第２エンコーダ１０２を用いることにより様々な環境に対してオンラインにおける行動の推定を可能とする。 For example, the dynamics parameters may change due to different environments, and a second encoder 102 capable of outputting appropriate results for the various dynamics parameters may be required. Therefore, the first estimation unit 10 uses, for example, a second encoder 102 trained in a simulator environment. By using the second encoder 102 trained in the simulator environment, it is possible to estimate the behavior online in various environments.

（訓練装置）
本実施形態に係る訓練装置は、推定装置１の第１エンコーダ１００、第２エンコーダ１０２、デコーダ１０４、エスティメータ１２０、アグリゲータ１２２を訓練する。訓練は、例えば、機械学習の手法により実行される。本実施形態の訓練は、シミュレータを用いる場合には、ダイナミクスパラメータがあらかじめ取得できることに基づく。すなわち、シミュレータ環境において、既知のダイナミクスパラメータを用いて状態から行動を取得できるように訓練を行う。そして、このシミュレーション環境を現実の空間に適応することにより、実機空間における行動の推定精度を高める訓練を実行する。 (Training device)
The training device according to the present embodiment trains the first encoder 100, the second encoder 102, the decoder 104, the estimator 120, and the aggregator 122 of the estimation device 1. Training is performed, for example, by machine learning techniques. The training of this embodiment is based on the fact that the dynamics parameters can be acquired in advance when the simulator is used. That is, in the simulator environment, training is performed so that actions can be acquired from the state using known dynamics parameters. Then, by adapting this simulation environment to the actual space, training is performed to improve the estimation accuracy of the behavior in the actual space.

図３は、本実施形態に係る訓練装置２のブロック図である。訓練装置２は、順伝播部２０と、誤差算出部２２と、更新部２４と、を備える。この他、訓練装置２は、データ等の入出力を行うインタフェース、データ、パラメータ等を格納する記憶部を備えていてもよい。訓練装置２は、上記で説明した推定装置１の各部におけるモデルを訓練する。 FIG. 3 is a block diagram of the training device 2 according to the present embodiment. The training device 2 includes a forward propagation unit 20, an error calculation unit 22, and an update unit 24. In addition, the training device 2 may include an interface for inputting / outputting data and the like, and a storage unit for storing data, parameters, and the like. The training device 2 trains the model in each part of the estimation device 1 described above.

訓練装置２は、１つの装置において図の右側に示される第１エンコーダ１００、第２エンコーダ１０２、デコーダ１０４、エスティメータ１２０、アグリゲータ１２２の訓練を実行してもよい。別の例としては、第１推定部１０、第２推定部１２のそれぞれに対して１つの訓練装置２が備えられてもよい。さらに別の例としては、全てのモデルに対してその一部を訓練する訓練装置２が備えられてもよい。１つの訓練装置２が複数のモデルを訓練する場合には、モデルごとにロスの算出方法、更新方法等の各工程、すなわち、訓練の方法を切り替えてもよい。 The training device 2 may perform training of the first encoder 100, the second encoder 102, the decoder 104, the estimator 120, and the aggregator 122 shown on the right side of the figure in one device. As another example, one training device 2 may be provided for each of the first estimation unit 10 and the second estimation unit 12. As yet another example, a training device 2 may be provided to train a part of all models. When one training device 2 trains a plurality of models, each process such as a loss calculation method and an update method, that is, the training method may be switched for each model.

上述したように、第１推定部１０の各構成のモデルは、シミュレーション環境で訓練されることが好ましい。ダイナミクスパラメータは、一般的に、制御対象となる装置等が配置されている空間において正確な情報を取得するのが困難である。ダイナミクスパラメータの各要素が与えられるシミュレーション環境を複数用意し、このシミュレーション環境内で訓練を行うことにより、実際の制御対象となる装置等に対する行動を推定可能なモデルを容易に得ることができる。 As described above, it is preferable that the model of each configuration of the first estimation unit 10 is trained in the simulation environment. It is generally difficult to obtain accurate information on the dynamics parameters in the space where the device to be controlled or the like is arranged. By preparing a plurality of simulation environments in which each element of the dynamics parameter is given and performing training in this simulation environment, it is possible to easily obtain a model capable of estimating the behavior of the device or the like to be actually controlled.

第２推定部１２は、状態と行動からダイナミクスパラメータを推定するものであるが、この第２推定部１２により推定されたダイナミクスパラメータを第１推定部１０の訓練を実行するシミュレーション環境を構築するダイナミクスパラメータとして使用することができる。また、第２推定部１２は、第１推定部１０の訓練だけではなく、上述したように、実際の環境における行動を推定するためのダイナミクスパラメータの推定にも利用される。このように、第１推定部１０の訓練、又は、推定を実行するためのダイナミクスパラメータを推定するモデルを訓練装置２により訓練する。 The second estimation unit 12 estimates the dynamics parameters from the state and the behavior, and the dynamics parameters estimated by the second estimation unit 12 are used to construct a simulation environment for executing the training of the first estimation unit 10. Can be used as a parameter. Further, the second estimation unit 12 is used not only for training the first estimation unit 10, but also for estimating dynamics parameters for estimating behavior in an actual environment, as described above. In this way, the training device 2 trains the first estimation unit 10 or the model for estimating the dynamics parameters for executing the estimation.

順伝播部２０は、訓練の対象となるモデルの順伝播処理を実行する。例えば、各モデルに対して所定のデータを入力し、出力データを取得する。 The forward propagation unit 20 executes the forward propagation process of the model to be trained. For example, predetermined data is input to each model and output data is acquired.

誤差算出部２２は、順伝播部２０により取得したデータと、教師データ等に基づく誤差、又は、方策の実行によりシステムから取得される累積コストで記述される損失を算出する。この損失等の算出方法は、モデルにより異なるものであってもよい。また、後述するように、必ずしも教師あり学習の誤差算出に用いられるわけではなく、強化学習における報酬（累積報酬）のコストで記述される損失の算出等を、この誤差算出部２２が実行してもよい。このように、誤差算出部２２は、広義の意味において、ロス、報酬、その他のパラメータ更新に必要となる値を目的関数に基づいて算出する。 The error calculation unit 22 calculates the error described by the data acquired by the forward propagation unit 20 and the error based on the teacher data or the like, or the cumulative cost acquired from the system by executing the policy. The method of calculating the loss or the like may differ depending on the model. Further, as will be described later, the error calculation unit 22 executes the calculation of the loss described by the cost of the reward (cumulative reward) in the reinforcement learning, which is not necessarily used for the error calculation of supervised learning. May be good. In this way, the error calculation unit 22 calculates, in a broad sense, the values required for updating the loss, reward, and other parameters based on the objective function.

更新部２４は、誤差算出部２２が算出した損失に基づいて、ネットワークを更新する。ネットワークの更新は、例えば、誤差を逆伝播することにより実行される。誤差を逆伝播する場合には、図において点線の矢印で示されるように、例えば、ニューラルネットワークの層ごとに、誤差算出部２２と更新部２４との処理を実行してもよい。 The update unit 24 updates the network based on the loss calculated by the error calculation unit 22. Network updates are performed, for example, by backpropagating errors. When the error is back-propagated, as shown by the dotted arrow in the figure, for example, the processing of the error calculation unit 22 and the update unit 24 may be executed for each layer of the neural network.

以下、各部に備えられるモデルの訓練について詳しく説明する。 The training of the model provided in each part will be described in detail below.

まず、第１推定部１０に備えられるモデルから説明する。第１推定部１０の第１エンコーダ１００、第２エンコーダ１０２、デコーダ１０４に備えられるモデルは、同じタイミングで訓練されてもよい。 First, a model provided in the first estimation unit 10 will be described. The models provided in the first encoder 100, the second encoder 102, and the decoder 104 of the first estimation unit 10 may be trained at the same timing.

図４は、本実施形態に係る第１推定部１０の訓練の処理を示すフローチャートである。図４に基づいて、訓練装置２の第１推定部１０に係る訓練の処理について説明する。 FIG. 4 is a flowchart showing a training process of the first estimation unit 10 according to the present embodiment. Based on FIG. 4, the training process related to the first estimation unit 10 of the training device 2 will be described.

まず、訓練装置２は、パラメータを初期化する（S200）。パラメータとは、第１エンコーダ１００、第２エンコーダ１０２、デコーダ１０４に備えられるモデルf_φ、M_η、g_θ、及び、第１エンコーダ１００、デコーダ１０４を訓練するためのモデルf_rec、g_invに関するパラメータである。 First, the training device 2 initializes the parameters (S200). A parameter is a first encoder 100, second encoder 102, model f _phi provided to the decoder 104, M _eta, g _theta, and a first encoder 100, the model f _rec to train decoder 104 relates g _inv It is a parameter.

次に、ランダムにn個の環境を生成する(S202)。この環境は、例えば、シミュレータ上に形成される。このシミュレータから取得された状態と行動に基づいて、第２推定部１２において後述するそれぞれの環境に対応するn個のダイナミクスパラメータを推定し取得することができる。 Next, n environments are randomly generated (S202). This environment is formed on a simulator, for example. Based on the state and behavior acquired from this simulator, the second estimation unit 12 can estimate and acquire n dynamic parameters corresponding to each environment described later.

次に、ループ処理によりパラメータの更新を実行する。まず、生成したn個の環境からランダムに1の環境を選択する（S204）。この環境の選択からの処理をエピソードとし、所定の条件を満たすようにエピソード内においてパラメータの更新を繰り返してもよい。また、それぞれのモデルに関する訓練のアルゴリズムは、任意のアルゴリズムを利用できるものとする。 Next, the parameter is updated by loop processing. First, one environment is randomly selected from the generated n environments (S204). The process from the selection of this environment may be regarded as an episode, and the parameter update may be repeated in the episode so as to satisfy a predetermined condition. In addition, any algorithm can be used as the training algorithm for each model.

次に、選択された環境に対応するダイナミクスパラメータを取得する（S206）。ダイナミクスパラメータは、例えば、S202において、第２推定部１２により推定されたものであってもよい。 Then get the dynamics parameters for the selected environment (S206). The dynamics parameter may be, for example, the one estimated by the second estimation unit 12 in S202.

次に、ランダムに初期状態を標本抽出する（S208）。例えば、テストデータとして与えられている種々のデータから、ランダムに初期状態となる状態を取得する。 Next, the initial state is randomly sampled (S208). For example, a state that becomes an initial state is randomly acquired from various data given as test data.

次に、パラメータを更新する（S210）。より詳しくは、S208において取得した初期状態を第１エンコーダ１００に、S206において取得したダイナミクスパラメータを第２エンコーダ１０２に、それぞれ入力し、潜在状態Z_tを取得する。そして、このZ_tをデコーダ１０４に入力することにより、行動a_t ^(e)を取得する。これは、例えば、順伝播部２０が各モデルに対して入力データを順伝播させることにより実行されてもよい。そして、取得された結果に基づいて、誤差算出部２２がそれぞれのモデルに適切な誤差等を算出し、逆伝播させ、更新部２４がパラメータの更新を実行する。各構成の訓練については、後述する。 Next, update the parameters (S210). More specifically, the initial state acquired in S208 is input to the first encoder 100, and the dynamics parameter acquired in S206 is input to the second encoder 102 to acquire the _{latent state Z t.} Then, _{by inputting this Z t} to the decoder 104, the action a _t ^(e) is acquired. This may be performed, for example, by the forward propagation unit 20 progressively propagating the input data to each model. Then, based on the acquired result, the error calculation unit 22 calculates an appropriate error or the like for each model and propagates it back, and the update unit 24 updates the parameters. The training of each configuration will be described later.

次に、処理を終了するか否かを判断する（S212）。処理の終了は、例えば、所定数のエピソードを、所定数のエポック数だけ繰り返し計算した、アキュラシーが所定値以上となった、バリデーションで所定条件を満たした、等により判断されてもよい。ここで、エピソードとエポックに対する所定数は、同じものではなく、それぞれに設定されるものであってもよい。 Next, it is determined whether or not to end the process (S212). The end of the process may be determined, for example, by repeatedly calculating a predetermined number of episodes by a predetermined number of epochs, having an accuracy of a predetermined value or more, satisfying a predetermined condition by validation, and the like. Here, the predetermined numbers for episodes and epochs are not the same, but may be set for each.

処理を終了しないと判断した場合（S212：NO）、S204からの処理を繰り返す。この場合、エピソードの繰り返しが所定数以内である場合には、エピソードを繰り返してもよい。また、エピソードの繰り返しが終了した場合には、エポック数が所定数に到達しているかで判断し、エポック数が所定数に到達していない場合には、次のエポックの処理を実行してもよい。 If it is determined that the processing is not completed (S212: NO), the processing from S204 is repeated. In this case, if the number of episodes repeated is within a predetermined number, the episodes may be repeated. Also, when the repetition of the episode is completed, it is judged whether the number of epochs has reached the predetermined number, and if the number of epochs has not reached the predetermined number, the next epoch process can be executed. Good.

一方で、処理を終了すると判断した場合（S212：YES）、訓練装置２は、必要な処理、例えば、最適化されたパラメータの出力、記憶等の処理を実行した後、処理を終了する。 On the other hand, when it is determined to end the process (S212: YES), the training device 2 ends the process after executing necessary processes such as output of optimized parameters and storage.

第１推定部１０は、このようにシミュレーション環境において訓練してもよい。このシミュレーション環境には、ランダム性を付与し、これに適応できる方策を学習することにより最適化されてもよい。シミュレーション環境を用いるので、訓練装置２は、例えば、種々の環境に対するダイナミクスパラメータを取得することができる。訓練装置２は、このダイナミクスパラメータを用いて、第１推定部１０を訓練する。 The first estimation unit 10 may be trained in the simulation environment in this way. This simulation environment may be optimized by imparting randomness and learning strategies that can adapt to it. Since the simulation environment is used, the training device 2 can acquire dynamics parameters for various environments, for example. The training device 2 trains the first estimation unit 10 using this dynamics parameter.

上記の第１推定部１０の訓練は、強化学習により行うことができる。強化学習は一般的に試行の回数が非常に多く必要とされるため、異なる環境下で適切に推定を行うことが可能なよう、異なる複数の環境下での強化学習を行うことは実用上困難である。また、実環境においての訓練では、危険が伴う場合も考えられる。しかし、本実施形態においては、シミュレーション環境において第１推定部の訓練を行うことで、容易に強化学習を行うことができるため、異なる環境においても適切に推定が可能な第１推定部を得ることができる。 The training of the first estimation unit 10 described above can be performed by reinforcement learning. Reinforcement learning generally requires a very large number of trials, so it is practically difficult to perform reinforcement learning in multiple different environments so that appropriate estimation can be performed in different environments. Is. In addition, training in a real environment may be dangerous. However, in the present embodiment, since reinforcement learning can be easily performed by training the first estimation unit in the simulation environment, it is possible to obtain the first estimation unit that can be appropriately estimated even in different environments. Can be done.

強化学習を行う場合、例えば、行動aを生成する確率分布をπ(a|o, η)とし、報酬r_kを設定し、以下のような目的関数を用いて強化学習をすることもできる。

ここで、γは、γ∈[0, 1]を満たす定数とし、E_πは、方策の確率分布πと初期状態o₀、システムの状態遷移確率と報酬確率に関する期待値を意味する。このような目的関数の確率分布πのパラメータに関する最大化として、強化学習を行ってもよい。また、Tは、エピソードの長さを表す定数であり、正の整数である。 When performing reinforcement learning, for example, the probability distribution for generating the action a is set to π (a | o, η), the reward r _k is set, and the reinforcement learning can be performed using the following objective function.

Here, γ is a constant satisfying γ ∈ [0, 1], and E _π means the probability distribution π of the policy, the initial state o ₀ , and the expected value regarding the state transition probability and the reward probability of the system. Reinforcement learning may be performed as a maximization of the parameters of the probability distribution π of such an objective function. In addition, T is a constant representing the length of the episode and is a positive integer.

強化学習を行う場合にも、上記のようにシミュレーション環境において、事前にシミュレーション環境内におけるパラメータをあらかじめ取得しておいてもよい。本実施形態のように、強化学習を行うと、シミュレーション環境において訓練された結果を実環境においても精度の低下を抑えて適用することが可能となる。 Even when performing reinforcement learning, the parameters in the simulation environment may be acquired in advance in the simulation environment as described above. When reinforcement learning is performed as in the present embodiment, it is possible to apply the training results in the simulation environment to the actual environment while suppressing the decrease in accuracy.

第１エンコーダ１００の訓練は、例えば、さらに以下の損失L_rec ^(t)を付加的に用いて実行されてもよい。

ここで、f_rec()は、第１エンコーダ１００により変換された状態o_tを元に戻すモデルである。例えば、第１エンコーダ１００における状態o_tの圧縮率が高すぎると、圧縮された状態f_φ(o_t)から元の状態o_tへと逆変換できない場合がある。このような場合には、デコーダ１０４において、状態o_t又はその一部が行動へと反映されない場合がある。そこで、このような問題を回避するために、圧縮された状態がある程度状態へと逆変換できるように、第１エンコーダ１００を、［数２］の損失を用いて更新する。 Training of the first encoder 100 may be performed, for example, with the additional loss L _rec ^(t) of:

Here, f _rec () is a model that restores _{the state o t} converted by the first encoder 100. For example, if the compression ratio _{of the state o t} in the first encoder 100 is too high, it may not be possible to reversely convert _{the compressed state f φ} (o _t ) to the original state o _t. In such a case, in the decoder 104, the state o _t or a part thereof may not be reflected in the action. Therefore, in order to avoid such a problem, the first encoder 100 is updated with the loss of [Equation 2] so that the compressed state can be inversely transformed into a state to some extent.

なお、目的関数は、これに限られるわけではなく、適切にエンコードされた状態と、元の状態とを比較できる目的関数であればよい。 The objective function is not limited to this, and may be any objective function that can compare an appropriately encoded state with the original state.

また、第２エンコーダ１０２と、デコーダ１０４の訓練は、例えば、さらに以下の損失L_inv ^(t)を付加的に用いて実行されてもよい。

ここで、g_inv()は、第１潜在状態Z_tと、第２潜在状態Z_t+1の間にどのような行動が実行されたかを推定するモデルである。Z_tは、上述したように、状態o_t ^(e)とダイナミクスパラメータη_eを用いて、Z_t=[f_φ(o_t ^(e)), M_ζ(η_e)]と表される。これに対して、Z_t+1は、デコーダ１０４により推定された時刻tにおける行動a_t（第１行動と呼ぶ）が与えられた場合の時刻t+1の状態o_t+1 ^(e)とダイナミクスパラメータη_eを用いて、Z_t+1=[f_φ(o_t+1 ^(e)), M_ζ(η_e)]と表される。g_inv(Z_t+1, Z_t)は、Z_t+1とZ_tが与えられた場合に、対象となる環境においてどのような行動a_t'（第２行動と呼ぶ）がZ_tに対して実行されたかを推定するモデルである。 Further, the training of the second encoder 102 and the decoder 104 may be performed, for example, by additionally using the following loss L _inv ^(t) .

Here, g _inv () is a model that estimates what kind of action was executed between the first latent state Z _t and the second latent state Z _{t + 1.} Z _t is expressed as Z _t = [f _φ (o _t ^(e) ), M _ζ (η _e _{)] using the state o t} ^(e) and the dynamics parameter η _e , as described above. On the other hand, Z _{t + 1} _{is the state o t + 1} ^(e) at time t + 1 when the _{action a t} (called the first action) at time t estimated by the decoder 104 is given. Using the dynamics parameter η _e , it is expressed as _{Z t + 1} = [f _φ (o _{t + 1} ^(e) ), M _ζ (η _e)]. g _inv (Z _{t + 1} , Z _t ) gives what action a _t '(called the second action) to Z _t in the _{target environment given Z t + 1} and Z _t. It is a model that estimates whether or not it was executed.

このようにLrec(t)、Linv(t)を定義し、モデルを訓練することにより、例えば、光の強度の変化、照明の方向、影の位置等、行動の推定には不要である情報の影響を、状態や環境から減らして推定するように訓練を実行することが可能となる。 By defining Lrec (t) and Linv (t) in this way and training the model, information that is not necessary for estimating behavior, such as changes in light intensity, lighting direction, and shadow position, can be obtained. It is possible to perform training to estimate the impact by reducing it from the condition and environment.

L_inv ^(t)は、実際に与えた第１行動a_tと、当該第１行動により発生した状態と、その状態が発生した場合にどのような第２行動a_t'が与えられたかの差を算出する。このロス関数を用いて第２エンコーダ１０２のモデルM_ζ、デコーダ１０４のモデルg_θ及び潜在状態から行動を取得するモデルg_invのパラメータを更新する。
L _inv ^(t) is the difference between the first action a _t actually given, the state generated by the first action, and what kind of second action a _t'was given when that state occurred. calculate. Using this loss function, the parameters of _{the model M ζ} of the second encoder 102, the model g _{θ of the} _{decoder 104, and the model g inv} that acquires the action from the latent state are updated.

次に、第１推定部１０に入力するダイナミクスパラメータを推定する第２推定部１２の訓練について説明する。 Next, the training of the second estimation unit 12 for estimating the dynamics parameters input to the first estimation unit 10 will be described.

第１推定部１０の訓練においては、n個のランダムな環境におけるダイナミクスパラメータを取得する必要がある。そこで、まず、許可された行動の範囲内において行動をランダムにサンプリングし、それぞれの環境において数回のエピソード等が実行できるような方策を設定してもよい。ここで、許可された行動とは、例えば、環境内において許される行動、又は、事前に訓練した安全な方策における任意の行動のことを示す。 In the training of the first estimation unit 10, it is necessary to acquire the dynamics parameters in n random environments. Therefore, first, the behavior may be randomly sampled within the range of the permitted behavior, and a measure may be set so that several episodes or the like can be executed in each environment. Here, the permitted action means, for example, an action permitted in the environment or an arbitrary action in a pre-trained safe measure.

そこで、状態遷移データを、上述したように、{(o_t ^(e), a_t ^(e), o_t+1 ^(e))}という形式で収集し、第２推定部１２の訓練を実行する。ここで、F()をシミュレータの順伝播するダイナミクスパラメータモデルとして定義する。真の次の状態と、シミュレータによる次の状態は、ダイナミクスパラメータη_eの真の値が取得できる場合には、以下のようにそれぞれ表される。

ここで、R_tは、ノイズ項であり、簡単のため、平均0、分散v²とする正規分布と仮定してもよい。これは、状態遷移がQ(o_t+1 ^(e)|o_t ^(e), a_t ^(e), η_e)の尤度モデルを定義することと同義である。このため、D_e={(o_t ^(e), a_t ^(e), o_t+1 ^(e))_t}_t=1 ^Neとした場合に、事後分布p(η_e|D_e)を介してη_eを推定することに、問題は帰着される。そこで、ダイナミクスパラメータを正しく推定するために、エピソード内の相関状態遷移データを仮定してもよい。 Therefore, the state transition data, as described _{^{above, {(o t (e)}} , a t (e), o t + 1 (e))} were collected in the form of, perform the training of the second estimating unit 12 To do. Here, F () is defined as a forward-propagating dynamics parameter model of the simulator. The true next state and the next state by the simulator are expressed as follows when the true value of _{the dynamics parameter η e can be obtained.}

Here, R _t is a noise term, and for the sake of simplicity, it may be assumed that it is a normal distribution with ^{mean 0 and variance v 2.} This state transition _{^{Q (o t + 1 (e}} ) | o t (e), a t (e), η e) is synonymous with defining a likelihood model. Therefore, when _De = {(o _t ^(e) , a _t ^(e) , o _{t + 1} ^(e) ) _t } _{t = 1} ^Ne , the posterior distribution p (η _e | De _e ) The problem is reduced to estimating _{η e through.} Therefore, in order to correctly estimate the dynamics parameters, the correlation state transition data in the episode may be assumed.

上述したように、D_eをT個のタプル(o_t ^(e), a_t ^(e), o_t+1 ^(e))ごとにk個のチャンクへと分割し、それぞれのチャンクi∈{1, ... , k}においてμ⁽ⁱ⁾とσ⁽ⁱ⁾とが推定できるように、エスティメータ１２０のモデルを訓練する。環境eにおけるチャンクi内の行動の時系列をx_i ^(e)、状態の時系列をy_i ^(e)とおくと、以下の事後分布pを推定することにより実行してもよい。

ここで、チャンクの長さは、適切に選択されていてもよい。例えば、Tは、大きすぎない範囲で十分に大きく取ってもよい。例えば、Tが十分大きい場合には、中心極限定理が成り立つような場合において、事後分布が正規分布に近づくため近似が正確になる。一方で、Tが大きすぎる場合、データ数がTと比較して少なくなるとそのままの状態で適用することが困難となる。また、時系列のデータをTで割った場合のあまりが生じる場合に、残りのデータを有効に利用することができない等の無駄が生じる可能性があるためである。 As described above, divided into k chunks of D _e T number of tuples _{^{(o t (e), a}} t (e), o t + 1 (e)) for each respective chunk i ∈ { Train the model of the estimator 120 so ^{that μ (i)} and σ ⁽ⁱ⁾ can be estimated at 1, ..., k}. If the time series of actions in chunk i in the environment e is x _i ^(e) and the time series of states is y _i ^(e) , it may be executed by estimating the following posterior distribution p.

Here, the chunk length may be appropriately selected. For example, T may be large enough so that it is not too large. For example, when T is sufficiently large, the approximation becomes accurate because the posterior distribution approaches the normal distribution when the central limit theorem holds. On the other hand, if T is too large and the number of data is smaller than that of T, it becomes difficult to apply the data as it is. In addition, if too much time-series data is divided by T, waste such as not being able to effectively use the remaining data may occur.

それぞれのチャンクiから算出されたk個のη_eの推定を集約するために、単一のデータポイントのペアに条件付けられたダイナミクスパラメータη_eの事後分布p(η_e|x_i ^(e), y_i ^(e))と、全体のデータセットに条件付けられたダイナミクスパラメータη_eの事後分布との間に成り立つべき関係を利用してもよい。全体のデータセットに条件付けられたη_eの事後分布は、以下のように表される。

これは、以下のように書き換えられる。

ここで、事後分布p(η_e|x_i ^(e), y_i ^(e))と事後分布p(η_e)とが独立した正規分布であると仮定すると、事後分布p(η_e|D_e)は、以下のように書き換えることもできる。

ここで、添え字のjは、ベクタのj番目の要素を表す。ニューラルネットワーク（ディープニューラルネットワークを含む）において実現されたμ_j ⁽ⁱ⁾()、σ_j ⁽ⁱ⁾()を用いると、p(η_e|D_e)は、パラメータθにより、事後分布p_θ(η_e|D_e)としてパラメタライズすることができる。パラメータθは、スカラー、例えば、f_0,j、g_0,j（j=1, ... ,d）（dは、ηの次元）を含んでいてもよい。パラメータθは、真の事後分布を近似するように最適化されてもよい。 _{To aggregate the estimates of k η e} calculated from each chunk i, the posterior distribution p (η _e | x _i ^(e) , of _{the dynamics parameter η e} conditioned on a single data point pair. You may use the relationship that should hold between y _i ^(e) _{) and the posterior distribution of the dynamics parameter η e} conditioned on the entire dataset. _{The posterior distribution of η e} conditioned on the entire data set is expressed as follows.

This can be rewritten as follows:

Assuming that the posterior distribution p (η _e | x _i ^(e) , y _i ^(e) ) and the posterior distribution p (η _e ) are independent normal distributions, the posterior distribution p (η _e | D) _e ) can also be rewritten as follows.

Here, the subscript j represents the j-th element of the vector. _{Using μ j} ⁽ⁱ⁾ () and σ _j ⁽ⁱ⁾ () realized in neural networks (including deep neural networks), p (η _e | _De ) has a posterior distribution p _{θ due to the parameter θ.} It can be parameterized as (η _e | _De). The parameter θ may include a scalar, for example, f _{0, j} , g _{0, j} (j = 1, ..., d) (where d is the dimension of η). The parameter θ may be optimized to approximate the true posterior distribution.

事後分布の近似の手法として、KL（Kullback-Leibler）ダイバージェンスを用いてもよい。例えば、真の事後分布p_true(η_e|D_e)とパラメータにより表現される事後分布p_θ(η_e|D_e)との間のKLダイバージェンスKL[p_θ(η_e|D_e)|p_true(η_e|D_e)]がパラメータθに対して最小値を取るようにして近似することができる。下限最適化（lower bound optimization）にしたがい、真の事後分布p_true(η_e|D_e)を明示的に評価することなく、それぞれの環境eに対して事後分布の近似をすることできる。例えば、以下の式を用いて最適化を行ってもよい。

KL (Kullback-Leibler) divergence may be used as a method for approximating the posterior distribution. For example, the KL divergence between the _true posterior distribution p true (η _e | De _e ) and the posterior distribution p _θ (η _e | _De _{) expressed by parameters KL [p θ} (η _e | _De ) | p _true (η _e | De _e )] can be approximated by taking the minimum value for the parameter θ. According to lower bound optimization, it is possible to approximate the posterior distribution for each environment e without explicitly evaluating the _true posterior distribution p true (η _e | _De). For example, optimization may be performed using the following equation.

また、真の事後分布p_true(η_e|D_e)とKLダイバージェンスが近い正規分布を求めるため、VAE（Variational Auto Encoder）においてしばしば用いられるリパラメトライズトリックを用いてもよい。すなわち、ε_j~N(ε_j|0, 1)として、取得されたε_jを用いてη_eを以下の式にしたがって表すことで、パラメータで表した事後分布p_θ(η_e|D_e)の期待値を標準正規分布ε~N(ε|0, 1)の期待値として置き換えてもよい。

Also, the true posterior distribution p _true | for obtaining the (η _{_e} D _e) and KL divergence normal distribution close, may be used re parametric Tri's trick often used in VAE (Variational Auto Encoder). That, ε _j ~ N | a (ε _j 0, 1), that expressed according to the following equation eta _e using the obtained epsilon _j, the posterior distribution p θ _(η _e expressed by the parameter | D _e ) May be replaced with the expected value of the standard normal distribution ε ~ N (ε | 0, 1).

上記のように、スタンドアロンのタプル(o_t ^(e), a_t ^(e), o_t+1 ^(e))の代わりに、時系列のチャンクを考慮することにより、摩擦、重力等の複雑なダイナミクスパラメータの事後分布を効果的に近似することが可能となる。一般に、ダイナミクスパラメータの事後分布は、複雑なマルチモーダル分布とすることができるが、中心極限定理にしたがう等の場合、時系列のチャンク内のサンプル数が増加すると、そのサンプルに条件付けられたηの事後分布は、正規分布に近似することができる。 As mentioned above, standalone tuple _{^{(o t (e), a}} t (e), o t + 1 (e)) instead of by time considering the chunk sequence, friction, complex, such as gravity It is possible to effectively approximate the posterior distribution of dynamics parameters. In general, the posterior distribution of dynamics parameters can be a complex multimodal distribution, but in cases such as according to the central limit theorem, as the number of samples in a time-series chunk increases, the η conditioned on that sample The posterior distribution can be approximated to the normal distribution.

図５は、訓練装置２による上記の第２推定部１２の訓練処理を示すフローチャートである。詳細は上述したため、簡単に説明する。まず、エスティメータ１２０とアグリゲータ１２２のパラメータを初期化する（S300）。次に、環境eのダイナミクスパラメータをランダムに設定し、状態及び行動の標本を抽出し、全体の時系列を生成する（S302）。次に、時系列を分割し、パラメータ生成に用いるチャンクを複数生成する（S304）。上記の［数４］〜［数１３］にしたがい、エスティメータ１２０及びアグリゲータ１２２のパラメータを更新する（S308）。処理が終了である場合（S308：YES）、訓練装置２は、第２推定部１２の訓練処理を終了する。訓練の終了は、上記した第１推定部１０の訓練において説明した条件に準ずるものであってもよいし、別のものであってもよい。処理が終了でない場合（S308：NO）、異なる環境での状態及び行動の標本の抽出処理（S302）から処理を繰り返してもよい。また、例えば、訓練する環境を変更する場合等、必要に応じて、状態、行動の標本の抽出処理（S302）から処理を繰り返してもよい。 FIG. 5 is a flowchart showing the training process of the second estimation unit 12 by the training device 2. Since the details have been described above, they will be briefly described. First, the parameters of the estimator 120 and the aggregator 122 are initialized (S300). Next, the dynamics parameters of the environment e are randomly set, samples of states and behaviors are extracted, and the entire time series is generated (S302). Next, the time series is divided and a plurality of chunks used for parameter generation are generated (S304). The parameters of the estimator 120 and the aggregator 122 are updated according to the above [Equation 4] to [Equation 13] (S308). When the process is completed (S308: YES), the training device 2 ends the training process of the second estimation unit 12. The end of the training may be according to the conditions described in the training of the first estimation unit 10 described above, or may be different. If the process is not completed (S308: NO), the process may be repeated from the state and behavior sample extraction process (S302) in different environments. Further, for example, when changing the training environment, the process may be repeated from the state / behavior sample extraction process (S302) as needed.

このように、状態遷移データが与えられることにより、訓練されたモデルを使用して所定の環境、例えば、テスト環境等の実環境を含む環境のダイナミクスパラメータを推定することができる。上記によれば、方策外のデータのみを用いてダイナミクスパラメータの推定を実行してもよい。この方策外のデータは、ランダムな方策又は安全に実行できることがわかっているルールベース等で作られた既存の方策を実行するだけで収集することが可能である。このため、種々の環境におけるダイナミクスパラメータを取得するモデルの訓練するためのデータは、低いコストで収集することができる。 Given the state transition data in this way, the trained model can be used to estimate the dynamics parameters of a given environment, including a real environment such as a test environment. According to the above, the estimation of the dynamics parameters may be performed using only the data outside the policy. Data outside this policy can be collected simply by executing a random policy or an existing policy created based on a rule base that is known to be safe to implement. Therefore, data for training models that acquire dynamics parameters in various environments can be collected at low cost.

以上のように、本実施形態によれば、ダイナミクスパラメータを取得する第２推定部１２のモデルを訓練し、このモデルにより得られたダイナミクスパラメータを用いてシミュレータを生成することが可能となる。そして、観測された状態に対して、シミュレータを用いて種々の環境における訓練を実行することにより、シミュレータ環境において訓練したモデルを用いて精度の高い実環境におけるモデルの適用、及び、モデルの訓練を実行することが可能となる。 As described above, according to the present embodiment, it is possible to train the model of the second estimation unit 12 that acquires the dynamics parameters and generate a simulator using the dynamics parameters obtained by this model. Then, by executing training in various environments using the simulator for the observed state, application of the model in a highly accurate real environment and training of the model using the model trained in the simulator environment are performed. It becomes possible to execute.

第２推定部１２の訓練は、シミュレーション環境においてダイナミクスパラメータが既知であることを利用することにより、上述のように教師あり学習により訓練することができる。さらに、ダイナミクスパラメータが未知であっても、適切に未来の状態が予測できるように、ダイナミクスパラメータ、及び、ダイナミクスを表す状態遷移のパラメータを学習することも可能である。 The training of the second estimation unit 12 can be trained by supervised learning as described above by utilizing the fact that the dynamics parameters are known in the simulation environment. Furthermore, even if the dynamics parameters are unknown, it is possible to learn the dynamics parameters and the state transition parameters representing the dynamics so that the future state can be appropriately predicted.

このようなシミュレーション環境において事前に訓練した第２推定部１２により、実環境では、実際に行動を選択して、環境に働きかけるような環境との相互作用を実現することなく、既存の手法で遷移する状態、行動の系列を観測するだけで、その環境のダイナミクスパラメータを推定することができる。第２推定部１２により推定されたダイナミクスパラメータを用いて第１推定部１０を方策とすることにより、実環境における良好な方策を取得することが可能となる。 By the second estimation unit 12 trained in advance in such a simulation environment, in the actual environment, the behavior is actually selected and the transition is performed by the existing method without realizing the interaction with the environment that works on the environment. The dynamics parameters of the environment can be estimated simply by observing the state and behavioral sequence. By using the dynamics parameters estimated by the second estimation unit 12 and using the first estimation unit 10 as a policy, it is possible to obtain a good policy in the actual environment.

このように訓練された第２推定部１２により推定されたn個の環境におけるダイナミクスパラメータの推定値を用いて訓練された第１推定部１０を用いて、未知のダイナミクスパラメータη_n+1を有するn+1番目の環境、例えば、実機の環境における行動を、図２に示すフローチャートにしたがって推定してもよい。 Using the first estimation unit 10 trained using the estimated values of the dynamics parameters in n environments estimated by the second estimation unit 12 thus trained, the first estimation unit 10 has an unknown dynamics parameter η _{n + 1} . The behavior in the n + 1th environment, for example, the environment of the actual machine may be estimated according to the flowchart shown in FIG.

例えば、まず、訓練されたパラメータからf_φ、M_η、g_θ、のそれぞれのモデル（及びf_rec、g_invを含んでもよい）を生成する。環境をn+1個目の環境、例えば、テスト環境であると仮定する。ここで、方策外の状態と行動の時系列（D_n+1）を取得する。この取得は、例えば、実環境において第１推定部１０により取得された行動により動作させて行ってもよい。このD_n+1を用いて、第２推定部１２により、ダイナミクスパラメータη_n+1を推定する（S100）。そして、モデルf_φ、M_η、g_θからダイナミクスパラメータに基づいた方策を実行することにより、行動を推定する（S106）。このように、本実施形態によれば、環境に対する試行を行うことなく、シミュレータ環境で訓練された方策に基づいて状態から適切な行動を推定することが可能となる。 For example, first _{generate models of f φ} , M _η , g _θ , respectively (and may include _{f rec} , g _{inv) from the trained parameters.} Suppose the environment is the n + 1th environment, for example a test environment. Here, we get _{the time series (D n + 1} ) of the state and action outside the policy. This acquisition may be performed, for example, by operating according to the action acquired by the first estimation unit 10 in the actual environment. Using this D _{n + 1} , the second estimation unit 12 estimates the dynamics parameter η _{n + 1} (S100). Then, the behavior is estimated by executing the measures based on the dynamics parameters from the models f _φ , M _η , and g _{θ (S106).} As described above, according to the present embodiment, it is possible to estimate appropriate behavior from the state based on the measures trained in the simulator environment without making a trial for the environment.

また、本実施形態においては、さらにファインチューニングを行ってもよい。例えば、実環境内において生成されたダイナミクスパラメータを用いて、実機を推定された行動により動作させる。そして、ミスが起こった場合に、例えば、人間がより望ましい行動をすることにより、状態と行動の時系列を取得する。この時系列を用いて第１推定部１０のファインチューニングを実行してもよい。このように第１推定部１０をより精度の高いものへとチューニングすることもできる。 Further, in the present embodiment, fine tuning may be further performed. For example, using the dynamics parameters generated in the real environment, the real machine is operated by the estimated behavior. Then, when a mistake occurs, for example, a human takes a more desirable action to acquire a time series of states and actions. Fine tuning of the first estimation unit 10 may be executed using this time series. In this way, the first estimation unit 10 can be tuned to have higher accuracy.

例えば、実環境において、取得された方策を実際に適用する前に、当該方策でよい性能が取得できるか否かを評価するステップを含んでもよい。この評価には、推定されたシミュレーション環境が実環境をよく模倣しているか、その模倣したシミュレーション環境において方策がよい性能を出しているかを含めてもよい。すなわち、既存の手法を上回るよい方策が得られていると判断した場合に、実環境へ適用してもよい。 For example, in a real environment, a step of evaluating whether or not good performance can be obtained by the acquired measure may be included before actually applying the acquired measure. This evaluation may include whether the estimated simulation environment mimics the real environment well and whether the measures perform well in the mimicking simulation environment. That is, it may be applied to the actual environment when it is determined that a better measure than the existing method is obtained.

その上で、実際に推定された方策を使用した結果Aと、既存のルールベース等の方策を使用した結果Bを用いて、A/Bテストを実行してもよい。A/Bテストを実行した結果、良好な推定をされた方策にスイッチしてもよい。さらに、結果Aの方が結果Bよりも劣っている場合には、その結果を直接フィードバックして方策を改善てもよい。またさらに、必要であればシミュレーションした環境を改善し、改善したシミュレーション環境で方策をさらに改善してもよい。このように、ファインチューニングすることで、実環境における方策の性能を高めることが可能となる。 Then, the A / B test may be executed using the result A using the actually estimated policy and the result B using the existing rule-based policy. You may switch to a well-estimated strategy as a result of running A / B testing. Further, if the result A is inferior to the result B, the result may be directly fed back to improve the policy. Further, if necessary, the simulated environment may be improved, and the measures may be further improved in the improved simulation environment. By fine-tuning in this way, it is possible to improve the performance of measures in the actual environment.

推定装置１における推定の方が劣る場合の原因は、大きく分けて２つ考えられる。１つは、シミュレーション環境が実環境と乖離している場合であり、もう１つは、シミュレーション環境は、実環境を高い精度で再現できているが、シミュレーション環境における方策そのものの性能が高くない場合である。 There are two main causes when the estimation by the estimation device 1 is inferior. One is when the simulation environment deviates from the actual environment, and the other is when the simulation environment can reproduce the actual environment with high accuracy, but the performance of the policy itself in the simulation environment is not high. Is.

シミュレーション環境が実環境と乖離しているか否かは、状態遷移の再現がうまくできているか否かで判断することができる。この場合、この乖離を削減するように方策を訓練してもよい。シミュレーション環境における方策の性能については、シミュレーション環境で実際に方策を試験することができる。この場合、シミュレーション環境における方策の訓練をさらに実行してもよい。そして、性能が改善できた後に、再度A/Bテストを実行することにより、推定装置１のチューニングを行ってもよい。 Whether or not the simulation environment deviates from the actual environment can be judged by whether or not the state transition is reproduced well. In this case, measures may be trained to reduce this divergence. Regarding the performance of the policy in the simulation environment, the policy can be actually tested in the simulation environment. In this case, further training of measures in the simulation environment may be performed. Then, after the performance is improved, the estimation device 1 may be tuned by executing the A / B test again.

なお、例えば、実環境における訓練の際（ファインチューニングの際）、推定装置の推定が適切でなく問題が起こる場合、例えばオートメーション化された工場のライン等において方策が適切でなく、装置の制御にミスが起こる場合が考えられるが、その際のバックアップを人間やその他の装置が行ってもよい。つまり、装置のミスを人間の作業者やその他の装置が修正または対応してもよい。例えば、人と訓練中のロボット（制御装置）が共同で作業する場合、ロボットの作業（例えばピッキング）がうまくいかない場合には人間がカバーし、作業ピッキングの成功率が上がるように訓練を行うことができる。 For example, during training in a real environment (during fine tuning), if the estimation of the estimation device is not appropriate and a problem occurs, for example, the policy is not appropriate in an automated factory line, and the device is controlled. A mistake may occur, but a human or other device may back it up. That is, a human worker or other device may correct or respond to a device error. For example, when a person and a robot (control device) being trained work together, if the robot's work (for example, picking) does not go well, the human can cover it and train to increase the success rate of work picking. it can.

なお、工場のライン等で学習をする場合、訓練の初期においては人間がロボットのミスをカバーできる程度の速度でロボットに作業を行わせるよう、ラインの速度を設定してもよい。 When learning on a factory line or the like, the speed of the line may be set so that the robot can perform the work at a speed that allows humans to cover the mistakes of the robot at the initial stage of training.

また、訓練が進むにつれて、ラインの速度を上げてよい。 You may also increase the speed of the line as the training progresses.

上記の場合、速度の調整は、成功率に基づいて行ってもよく、人間が行ってもよく、環境に備えられた撮像装置等のセンサにより取得された情報に基づいて成功率を算出しそれに基づいても行ってもよい。 In the above case, the speed adjustment may be performed based on the success rate or may be performed by a human, and the success rate is calculated based on the information acquired by a sensor such as an image pickup device provided in the environment. It may be based on or done.

これにより、ラインの生産性を維持しつつモデルの訓練を行うことができる。 This makes it possible to train the model while maintaining the productivity of the line.

また、このようなモデルの訓練とラインの生産性とを両立する方法としては、ラインの速度の変更のほか、適切であることが分かっている現在の方策と新たにテストする方策の割合の変更もあげられる。ここで、新たにテストする方策は、シミュレーション環境において適切であった方策など現在のところ最も良いと思われる方策を使ってもよく、例えばより良い方策が存在しないかをテストするためにランダムな方策としてもよい。 In addition to changing the speed of the line, changing the ratio of current measures and newly tested measures that are known to be appropriate as a way to balance the training of such a model with the productivity of the line. Can also be given. Here, the new test strategy may be the one that seems to be the best at the moment, such as the one that was appropriate in the simulation environment, for example, a random strategy to test if there is a better strategy. May be.

この割合の変更には、例えば、ε-グリーディ法を用いてもよい。強化学習におけるε-グリーディ探索とは、ランダムな方策と現在最も良いと思われる方策の両方を確率εと1-εの割合で確率的に試行するものである。 For example, the ε-greedy method may be used to change this ratio. The ε-greedy search in reinforcement learning is a probabilistic trial of both a random strategy and what seems to be the best policy at present, at a rate of probability ε and 1-ε.

（変形例）
上記のように訓練装置２により訓練された推定装置１をさらにロバストなものへとすることもできる。訓練装置２は、異なるダイナミクスを有する環境においてその訓練を実行する。このダイナミクスは、現在の状態と、選択した行動からどのような状態に遷移するかという状態遷移確率を定める。この場合、あらかじめ定めた状態遷移確率に含まれるパラメータを変化させて異なるダイナミクスを生成するだけではなく、この状態遷移確率において、行動にばらつきを持たせることにより、さらに多様なダイナミクスを表現してもよい。 (Modification example)
The estimation device 1 trained by the training device 2 as described above can be made more robust. The training device 2 performs the training in an environment having different dynamics. This dynamics determines the current state and the state transition probability of what state the selected action transitions to. In this case, not only can different dynamics be generated by changing the parameters included in the predetermined state transition probability, but even more diverse dynamics can be expressed by making the behavior vary in this state transition probability. Good.

訓練装置２は、環境や状態に付加される他の情報、例えば、モータノイズの変化を適用してもよい。例えば、行動をドメインランダマイゼーションとして解釈して、特定の状態に依存する偏差が行動に加えられるとしてもよい。この場合、行動に加えられる偏差は、方策モデルの行動の出力に擾乱を加えることにより実装してもよい。 The training device 2 may apply other information added to the environment or state, such as changes in motor noise. For example, a behavior may be interpreted as domain randomization, and deviations that depend on a particular state may be added to the behavior. In this case, the deviation added to the action may be implemented by adding disturbance to the action output of the policy model.

擾乱ε_tをモデル用いて推定するために、例えば、擾乱ε_tを、環境に依存するパラメータベクトルω_eで重み付けされた現在の状態o_t ^(e)のベクトル関数との内積で表されると仮定する。例えば、Φを非線形のマッピング、具体的には、パラメータτが既知、又は、ランダムに割り当てられたフィードフォワードニューラルネットワークとして、擾乱ε_t=ω_eΦ_τ(o_t ^(e))として表してもよい。なお、例えば、ω_eを横ベクトルで表し、Φ_τ(o_t ^(e))を縦ベクトルで表し、ω_eΦ_τ(o_t ^(e))は、ベクトルの内積を表す。この場合、ω_eの推定を通して、環境eにおいてモータノイズに起因する摂動を特定することにより実行される。 To estimate the disturbance ε _t using a model, for example, if the disturbance ε _t is represented by the inner product of the vector function of the current state o _t ^(e) weighted by the environment-dependent parameter vector ω _e. Assume. For example, Φ can be represented as a non-linear mapping, specifically, as a feedforward neural network with known or randomly assigned parameters τ, as disturbance ε _t = ω _e Φ _τ (o _t ^(e) ). Good. Incidentally, for example, represent the omega _e in the horizontal vector represents [Phi _tau and ^(o _t ^(e)) in the vertical _{_{_{vector, ω e Φ τ (o t}}} (e)) represents the inner product of vectors. In this case, it is executed by identifying the perturbation caused by the motor noise in the environment e through the estimation of _{ω e.}

タイムステップtにおける、オリジナルの予測行動をa^_t ^(e)とする。行動のばらつきによりシミュレータの入力となる行動は、例えば以下のように表される。

ここで、Kは、ノイズに対するスカラーの係数である。ω_eは、ダイナミクスパラメータη_eと同様に環境に依存するパラメータであるので、拡張されたダイナミクスパラメータη_e'=(η_e, ω_e)として上述のダイナミクスパラメータη_eの推定モデルと同様に処理することでω_eの推定に関しても訓練することが可能である。 _{Let a ^ t} ^(e) be the original predictive behavior at time step t. The behavior that is input to the simulator due to the variation in behavior is expressed as follows, for example.

Here, K is a scalar coefficient with respect to noise. omega _e is because it is a parameter that depends on the environment as well as the dynamics parameter eta _e, enhanced dynamics parameters _{_{η e '= (η e,}} ω e) as similar to the estimation model of the above-mentioned dynamics parameters eta _e processing By doing so, it is possible to train on the estimation _{of ω e.}

以上のように、環境eに対するダイナミクスパラメータη_eを拡張することにより、環境や状態に付加される情報、例えば、モータノイズの影響についても推定装置１において推定することが可能となる。この推定装置１は、もちろん訓練装置２により、上記の式に基づいてダイナミクスパラメータη_eを変形することにより実現することができる。 _{As described above, by extending the dynamics parameter η e} with respect to the environment e, the information added to the environment and the state, for example, the influence of motor noise can also be estimated by the estimation device 1. This estimation device 1 can of course be realized by the _{training device 2 by transforming the dynamics parameter η e based on the above equation.}

以下、本開示における手法を用いたいくつかの結果について説明する。 Hereinafter, some results using the method in the present disclosure will be described.

図６は、上述した手法により訓練された推定装置１の報酬を示すグラフである。図中、(1)は、f_rec、g_invのモデルを用いずに訓練された結果、(2)は、g_invのモデルを用いて訓練された結果、(3)は、f_recのモデルを用いて訓練された結果、(4)は、f_rec、g_invの双方のモデルを用いて訓練された結果である。(5)は、比較例として、環境における正しいダイナミクスパラメータを用いて訓練された結果であり、これよりもよい報酬を得ることは理論的にはないと考えてよい。(6)は、比較例として、メタラーニングの一手法であるMAML（Model Agnostic Meta-Learning）を用いて訓練された結果である。 FIG. 6 is a graph showing the reward of the estimation device 1 trained by the method described above. In the figure, (1) is f _rec, results trained without using a model of g _inv, (2) a result of trained using a model of g _inv, (3) is f _rec Model results, which is trained using (4), which is a result of the trained using both models f _rec, g _inv. As a comparative example, (5) is the result of training using the correct dynamics parameters in the environment, and it can be considered that it is theoretically impossible to obtain a better reward. (6) is the result of training using MAML (Model Agnostic Meta-Learning), which is a method of meta-learning, as a comparative example.

このグラフから、f_rec、g_invの双方を用いて訓練したものは、いずれか一方、又は、双方を用いずに訓練した結果よりもよい報酬を得ていることが分かる。また、MAMLと比較してもよい結果を得ていることが分かる。 From this graph, _{it can be seen that those trained using both f rec} and g _inv receive better rewards than the results of training using either or neither. In addition, it can be seen that the results can be compared with MAML.

図７は、上述した手法の訓練中の報酬の遷移を示すグラフである。横軸は、訓練におけるエピソード数に比例し、縦軸は、報酬を示す。(4)と(5)については、図６の説明と同様である。比較例として、ドメインランダマイゼーションのみを用いた結果を示す。このグラフに示されるように、本実施形態によれば、訓練の早い段階から他の訓練よりもよい結果を得ることができることが分かる。 FIG. 7 is a graph showing the transition of reward during training of the above-mentioned method. The horizontal axis is proportional to the number of episodes in the training, and the vertical axis shows the reward. (4) and (5) are the same as those described in FIG. As a comparative example, the result using only domain randomization is shown. As shown in this graph, it can be seen that according to this embodiment, better results can be obtained from the early stage of training than in other trainings.

図８は、上記に示した擾乱による影響を示すグラフである。横軸は、擾乱に係る係数Kを示し、縦軸は、報酬を示す。τは、ランダムに選択されたものである。実線で示す擾乱ありの結果は、訓練時に擾乱を設定し、上述の手法によりω_eを推定して行動を推定した結果である。破線で示す擾乱なしの結果は、訓練においては擾乱を考慮したが、潜在状態Zに擾乱ω_eをエンコードせず、ω_eを推定せずに行動を推定した結果である。このグラフに示されるように、本実施形態によれば、擾乱を考慮し、さらに、潜在状態Zに反映することにより、よりよい結果が得られることが分かる。 FIG. 8 is a graph showing the effect of the disturbance shown above. The horizontal axis shows the coefficient K related to the disturbance, and the vertical axis shows the reward. τ is randomly selected. The result with disturbance shown by the solid line is the result of estimating the behavior _{by setting the disturbance at the time of training and estimating ω e by the above-mentioned method.} The result without disturbance shown by the broken line is the result of estimating the behavior without encoding the _{disturbance ω e} _{in the latent state Z and estimating the ω e, although the disturbance was taken into consideration in the training.} As shown in this graph, according to the present embodiment, it can be seen that better results can be obtained by considering the disturbance and further reflecting it in the latent state Z.

前述した実施形態における各装置（推定装置１又は訓練装置２）の一部又は全部は、ハードウェアで構成されていてもよいし、CPU（Central Processing Unit）、又はGPU（Graphics Processing Unit）等が実行するソフトウェア（プログラム）の情報処理で構成されてもよい。ソフトウェアの情報処理で構成される場合には、前述した実施形態における各装置の少なくとも一部の機能を実現するソフトウェアを、フレキシブルディスク、CD-ROM（Compact Disc-Read Only Memory）又はUSB（Universal Serial Bus）メモリ等の非一時的な記憶媒体（非一時的なコンピュータ可読媒体）に収納し、コンピュータに読み込ませることにより、ソフトウェアの情報処理を実行してもよい。また、通信ネットワークを介して当該ソフトウェアがダウンロードされてもよい。さらに、ソフトウェアがASIC（Application Specific Integrated Circuit）又はFPGA（Field Programmable Gate Array）等の回路に実装されることにより、情報処理がハードウェアにより実行されてもよい。 A part or all of each device (estimation device 1 or training device 2) in the above-described embodiment may be composed of hardware, a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or the like. It may consist of information processing of software (program) to be executed. When it is composed of information processing of software, the software that realizes at least a part of the functions of each device in the above-described embodiment is a flexible disk, a CD-ROM (Compact Disc-Read Only Memory), or a USB (Universal Serial). Bus) Information processing of software may be executed by storing it in a non-temporary storage medium (non-temporary computer-readable medium) such as a memory and loading it into a computer. In addition, the software may be downloaded via a communication network. Further, information processing may be executed by hardware by implementing software in a circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

ソフトウェアを収納する記憶媒体の種類は限定されるものではない。記憶媒体は、磁気ディスク、又は光ディスク等の着脱可能なものに限定されず、ハードディスク、又はメモリ等の固定型の記憶媒体であってもよい。また、記憶媒体は、コンピュータ内部に備えられてもよいし、コンピュータ外部に備えられてもよい。 The type of storage medium that stores the software is not limited. The storage medium is not limited to a removable one such as a magnetic disk or an optical disk, and may be a fixed storage medium such as a hard disk or a memory. Further, the storage medium may be provided inside the computer or may be provided outside the computer.

図９は、前述した実施形態における各装置（推定装置１又は訓練装置２）のハードウェア構成の一例を示すブロック図である。各装置は、一例として、プロセッサ７１と、主記憶装置７２（メモリ）と、補助記憶装置７３（メモリ）と、ネットワークインタフェース７４と、デバイスインタフェース７５と、を備え、これらがバス７６を介して接続されたコンピュータ７として実現されてもよい。 FIG. 9 is a block diagram showing an example of the hardware configuration of each device (estimating device 1 or training device 2) in the above-described embodiment. As an example, each device includes a processor 71, a main storage device 72 (memory), an auxiliary storage device 73 (memory), a network interface 74, and a device interface 75, which are connected via a bus 76. It may be realized as a computer 7.

図９のコンピュータ７は、各構成要素を一つ備えているが、同じ構成要素を複数備えていてもよい。また、図９では、１台のコンピュータ７が示されているが、ソフトウェアが複数台のコンピュータにインストールされて、当該複数台のコンピュータそれぞれがソフトウェアの同一の又は異なる一部の処理を実行してもよい。この場合、コンピュータそれぞれがネットワークインタフェース７４等を介して通信して処理を実行する分散コンピューティングの形態であってもよい。つまり、前述した実施形態における各装置（推定装置１又は訓練装置２）は、１又は複数の記憶装置に記憶された命令を１台又は複数台のコンピュータが実行することで機能を実現するシステムとして構成されてもよい。また、端末から送信された情報をクラウド上に設けられた１台又は複数台のコンピュータで処理し、この処理結果を端末に送信するような構成であってもよい。 The computer 7 of FIG. 9 includes one component for each component, but may include a plurality of the same components. Further, although one computer 7 is shown in FIG. 9, software is installed on a plurality of computers, and each of the plurality of computers executes the same or different part of the software. May be good. In this case, it may be a form of distributed computing in which each computer communicates via a network interface 74 or the like to execute processing. That is, each device (estimation device 1 or training device 2) in the above-described embodiment is a system that realizes a function by executing instructions stored in one or a plurality of storage devices by one or a plurality of computers. It may be configured. Further, the information transmitted from the terminal may be processed by one or a plurality of computers provided on the cloud, and the processing result may be transmitted to the terminal.

前述した実施形態における各装置（推定装置１又は訓練装置２）の各種演算は、１又は複数のプロセッサを用いて、又は、ネットワークを介した複数台のコンピュータを用いて、並列処理で実行されてもよい。また、各種演算が、プロセッサ内に複数ある演算コアに振り分けられて、並列処理で実行されてもよい。また、本開示の処理、手段等の一部又は全部は、ネットワークを介してコンピュータ７と通信可能なクラウド上に設けられたプロセッサ及び記憶装置の少なくとも一方により実行されてもよい。このように、前述した実施形態における各装置は、１台又は複数台のコンピュータによる並列コンピューティングの形態であってもよい。 Various operations of each device (estimation device 1 or training device 2) in the above-described embodiment are executed in parallel processing by using one or more processors or by using a plurality of computers via a network. May be good. Further, various operations may be distributed to a plurality of arithmetic cores in the processor and executed in parallel processing. In addition, some or all of the processes, means, etc. of the present disclosure may be executed by at least one of a processor and a storage device provided on the cloud capable of communicating with the computer 7 via a network. As described above, each device in the above-described embodiment may be in the form of parallel computing by one or a plurality of computers.

プロセッサ７１は、コンピュータの制御装置及び演算装置を含む電子回路（処理回路、Processing circuit、Processing circuitry、CPU、GPU、FPGA又はASIC等）であってもよい。また、プロセッサ７１は、専用の処理回路を含む半導体装置等であってもよい。プロセッサ７１は、電子論理素子を用いた電子回路に限定されるものではなく、光論理素子を用いた光回路により実現されてもよい。また、プロセッサ７１は、量子コンピューティングに基づく演算機能を含むものであってもよい。 The processor 71 may be an electronic circuit (processing circuit, Processing circuit, Processing circuitry, CPU, GPU, FPGA, ASIC, etc.) including a control device and an arithmetic unit of a computer. Further, the processor 71 may be a semiconductor device or the like including a dedicated processing circuit. The processor 71 is not limited to an electronic circuit using an electronic logic element, and may be realized by an optical circuit using an optical logic element. Further, the processor 71 may include an arithmetic function based on quantum computing.

プロセッサ７１は、コンピュータ７の内部構成の各装置等から入力されたデータやソフトウェア（プログラム）に基づいて演算処理を行い、演算結果や制御信号を各装置等に出力することができる。プロセッサ７１は、コンピュータ７のOS（Operating System）や、アプリケーション等を実行することにより、コンピュータ７を構成する各構成要素を制御してもよい。 The processor 71 can perform arithmetic processing based on data and software (programs) input from each device or the like having an internal configuration of the computer 7, and output the arithmetic result or control signal to each device or the like. The processor 71 may control each component constituting the computer 7 by executing an OS (Operating System) of the computer 7, an application, or the like.

前述した実施形態における各装置（推定装置１及び／又は訓練装置２）は、１又は複数のプロセッサ７１により実現されてもよい。ここで、プロセッサ７１は、１チップ上に配置された１又は複数の電子回路を指してもよいし、２つ以上のチップあるいは２つ以上のデバイス上に配置された１又は複数の電子回路を指してもよい。複数の電子回路を用いる場合、各電子回路は有線又は無線により通信してもよい。 Each device (estimation device 1 and / or training device 2) in the above-described embodiment may be realized by one or more processors 71. Here, the processor 71 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or two or more devices. You may point. When a plurality of electronic circuits are used, each electronic circuit may communicate by wire or wirelessly.

主記憶装置７２は、プロセッサ７１が実行する命令及び各種データ等を記憶する記憶装置であり、主記憶装置７２に記憶された情報がプロセッサ７１により読み出される。補助記憶装置７３は、主記憶装置７２以外の記憶装置である。なお、これらの記憶装置は、電子情報を格納可能な任意の電子部品を意味するものとし、半導体のメモリでもよい。半導体のメモリは、揮発性メモリ、不揮発性メモリのいずれでもよい。前述した実施形態における各装置（推定装置１又は訓練装置２）において各種データを保存するための記憶装置は、主記憶装置７２又は補助記憶装置７３により実現されてもよく、プロセッサ７１に内蔵される内蔵メモリにより実現されてもよい。例えば、前述した実施形態における記憶部は、主記憶装置７２又は補助記憶装置７３により実現されてもよい。 The main storage device 72 is a storage device that stores instructions executed by the processor 71, various data, and the like, and the information stored in the main storage device 72 is read out by the processor 71. The auxiliary storage device 73 is a storage device other than the main storage device 72. Note that these storage devices mean arbitrary electronic components capable of storing electronic information, and may be semiconductor memories. The semiconductor memory may be either a volatile memory or a non-volatile memory. The storage device for storing various data in each device (estimation device 1 or training device 2) in the above-described embodiment may be realized by the main storage device 72 or the auxiliary storage device 73, and is built in the processor 71. It may be realized by the built-in memory. For example, the storage unit in the above-described embodiment may be realized by the main storage device 72 or the auxiliary storage device 73.

記憶装置（メモリ）１つに対して、複数のプロセッサが接続（結合）されてもよいし、単数のプロセッサが接続されてもよい。プロセッサ１つに対して、複数の記憶装置（メモリ）が接続（結合）されてもよい。前述した実施形態における各装置（推定装置１又は訓練装置２）が、少なくとも１つの記憶装置（メモリ）とこの少なくとも１つの記憶装置（メモリ）に接続（結合）される複数のプロセッサで構成される場合、複数のプロセッサのうち少なくとも１つのプロセッサが、少なくとも１つの記憶装置（メモリ）に接続（結合）される構成を含んでもよい。また、複数台のコンピュータに含まれる記憶装置（メモリ））とプロセッサによって、この構成が実現されてもよい。さらに、記憶装置（メモリ）がプロセッサと一体になっている構成（例えば、L1キャッシュ、L2キャッシュを含むキャッシュメモリ）を含んでもよい。 A plurality of processors may be connected (combined) or a single processor may be connected to one storage device (memory). A plurality of storage devices (memory) may be connected (combined) to one processor. Each device (estimation device 1 or training device 2) in the above-described embodiment is composed of at least one storage device (memory) and a plurality of processors connected (combined) to the at least one storage device (memory). In the case, a configuration in which at least one of a plurality of processors is connected (combined) to at least one storage device (memory) may be included. Further, this configuration may be realized by a storage device (memory) and a processor included in a plurality of computers. Further, a configuration in which the storage device (memory) is integrated with the processor (for example, a cache memory including an L1 cache and an L2 cache) may be included.

ネットワークインタフェース７４は、無線又は有線により、通信ネットワーク８に接続するためのインタフェースである。ネットワークインタフェース７４は、既存の通信規格に適合したもの等、適切なインタフェースを用いればよい。ネットワークインタフェース７４により、通信ネットワーク８を介して接続された外部装置９Ａと情報のやり取りが行われてもよい。なお、通信ネットワーク８は、WAN（Wide Area Network）、LAN（Local Area Network）、PAN（Personal Area Network）等のいずれか、又は、それらの組み合わせであってよく、コンピュータ７と外部装置９Ａとの間で情報のやりとりが行われるものであればよい。WANの一例としてインターネット等があり、LANの一例としてIEEE802.11やイーサネット（登録商標）等があり、PANの一例としてBluetooth（登録商標やNFC（Near Field Communication）等がある。 The network interface 74 is an interface for connecting to the communication network 8 wirelessly or by wire. As the network interface 74, an appropriate interface such as one conforming to an existing communication standard may be used. The network interface 74 may exchange information with the external device 9A connected via the communication network 8. The communication network 8 may be any one of WAN (Wide Area Network), LAN (Local Area Network), PAN (Personal Area Network), or a combination thereof, and the computer 7 and the external device 9A may be used. Any information can be exchanged between them. An example of a WAN is the Internet, an example of a LAN is IEEE802.11 or Ethernet (registered trademark), and an example of a PAN is Bluetooth (registered trademark or NFC (Near Field Communication)).

デバイスインタフェース７５は、外部装置９Ｂと直接接続するUSB等のインタフェースである。 The device interface 75 is an interface such as USB that directly connects to the external device 9B.

外部装置９Ａは、コンピュータ７とネットワークを介して接続されている装置である。外部装置９Ｂは、コンピュータ７と直接接続されている装置である。 The external device 9A is a device connected to the computer 7 via a network. The external device 9B is a device that is directly connected to the computer 7.

外部装置９Ａ又は外部装置９Ｂは、一例として、入力装置であってもよい。入力装置は、例えば、カメラ、マイクロフォン、モーションキャプチャ、各種センサ等、キーボード、マウス、又は、タッチパネル等のデバイスであり、取得した情報をコンピュータ７に与える。また、パーソナルコンピュータ、タブレット端末、又は、スマートフォン等の入力部とメモリとプロセッサを備えるデバイスであってもよい。 The external device 9A or the external device 9B may be an input device as an example. The input device is, for example, a device such as a camera, a microphone, a motion capture, various sensors, a keyboard, a mouse, or a touch panel, and gives the acquired information to the computer 7. Further, it may be a personal computer, a tablet terminal, or a device having an input unit such as a smartphone, a memory, and a processor.

また、外部装置９Ａ又は外部装置９Ｂは、一例として、出力装置でもよい。出力装置は、例えば、LCD（Liquid Crystal Display）、CRT（Cathode Ray Tube）、PDP（Plasma Display Panel）、又は、有機EL（Electro Luminescence）パネル等の表示装置であってもよいし、音声等を出力するスピーカ等であってもよい。また、パーソナルコンピュータ、タブレット端末、又は、スマートフォン等の出力部とメモリとプロセッサを備えるデバイスであってもよい。 Further, the external device 9A or the external device 9B may be an output device as an example. The output device may be, for example, a display device such as an LCD (Liquid Crystal Display), a CRT (Cathode Ray Tube), a PDP (Plasma Display Panel), or an organic EL (Electro Luminescence) panel, or may output audio or the like. It may be an output speaker or the like. Further, it may be a personal computer, a tablet terminal, or a device having an output unit such as a smartphone, a memory, and a processor.

また、外部装置９Ａ又は外部装置９Ｂは、記憶装置（メモリ）であってもよい。例えば、外部装置９Ａは、ネットワークストレージ等であってもよく、外部装置９Ｂは、HDD等のストレージであってもよい。 Further, the external device 9A or the external device 9B may be a storage device (memory). For example, the external device 9A may be a network storage or the like, and the external device 9B may be a storage such as an HDD.

また、外部装置９Ａ又は外部装置９Ｂは、前述した実施形態における各装置（推定装置１又は訓練装置２）の構成要素の一部の機能を有する装置でもよい。つまり、コンピュータ７は、外部装置９Ａ又は外部装置９Ｂの処理結果の一部又は全部を送信又は受信してもよい。 Further, the external device 9A or the external device 9B may be a device having some functions of the components of each device (estimating device 1 or training device 2) in the above-described embodiment. That is, the computer 7 may transmit or receive a part or all of the processing result of the external device 9A or the external device 9B.

本明細書（請求項を含む）において、「a、b及びcの少なくとも1つ（一方）」又は「a、b又はcの少なくとも1つ（一方）」の表現（同様な表現を含む）が用いられる場合は、a、b、c、a-b、a-c、b-c、又は、a-b-cのいずれかを含む。また、a-a、a-b-b、a-a-b-b-c-c等のように、いずれかの要素について複数のインスタンスを含んでもよい。さらに、a-b-c-dのようにdを有する等、列挙された要素（a、b及びc）以外の他の要素を加えることも含む。 In the present specification (including claims), the expression (including similar expressions) of "at least one (one) of a, b and c" or "at least one (one) of a, b or c" is used. When used, it includes any of a, b, c, ab, ac, bc, or abc. It may also include multiple instances of any element, such as a-a, a-b-b, a-a-b-b-c-c, and the like. It also includes adding elements other than the listed elements (a, b and c), such as having d, such as a-b-c-d.

本明細書（請求項を含む）において、「データを入力として／データに基づいて／に従って／に応じて」等の表現（同様な表現を含む）が用いられる場合は、特に断りがない場合、各種データそのものを入力として用いる場合や、各種データに何らかの処理を行ったもの（例えば、ノイズ加算したもの、正規化したもの、各種データの中間表現等）を入力として用いる場合を含む。また「データに基づいて／に従って／に応じて」何らかの結果が得られる旨が記載されている場合、当該データのみに基づいて当該結果が得られる場合を含むとともに、当該データ以外の他のデータ、要因、条件、及び／又は状態等にも影響を受けて当該結果が得られる場合をも含み得る。また、「データを出力する」旨が記載されている場合、特に断りがない場合、各種データそのものを出力として用いる場合や、各種データに何らかの処理を行ったもの（例えば、ノイズ加算したもの、正規化したもの、各種データの中間表現等）を出力とする場合も含む。 In the present specification (including claims), when expressions such as "with data as input / based on / according to / according to" (including similar expressions) are used, unless otherwise specified. This includes the case where various data itself is used as an input, and the case where various data are processed in some way (for example, noise-added data, normalized data, intermediate representation of various data, etc.) are used as input. In addition, when it is stated that some result can be obtained "based on / according to / according to the data", it includes the case where the result can be obtained based only on the data, and other data other than the data. It may also include cases where the result is obtained under the influence of factors, conditions, and / or conditions. In addition, when it is stated that "data is output", unless otherwise specified, various data itself is used as output, or various data is processed in some way (for example, noise is added, normal). It also includes the case where the output is output (intermediate representation of various data, etc.).

本明細書（請求項を含む）において、「接続される（connected）」及び「結合される（coupled）」との用語が用いられる場合は、直接的な接続／結合、間接的な接続／結合、電気的（electrically）な接続／結合、通信的（communicatively）な接続／結合、機能的（operatively）な接続／結合、物理的（physically）な接続／結合等のいずれをも含む非限定的な用語として意図される。当該用語は、当該用語が用いられた文脈に応じて適宜解釈されるべきであるが、意図的に或いは当然に排除されるのではない接続／結合形態は、当該用語に含まれるものして非限定的に解釈されるべきである。 In the present specification (including claims), when the terms "connected" and "coupled" are used, direct connection / coupling and indirect connection / coupling are used. , Electrically (electrically) connection / coupling, communication (communicatively) connection / coupling, functionally (operatively) connection / coupling, physical connection / coupling, etc. Intended as a term. The term should be interpreted as appropriate according to the context in which the term is used, but any connection / combination form that is not intentionally or naturally excluded is not included in the term. It should be interpreted in a limited way.

本明細書（請求項を含む）において、「ＡがＢするよう構成される（A configured to B）」との表現が用いられる場合は、要素Ａの物理的構造が、動作Ｂを実行可能な構成を有するとともに、要素Ａの恒常的（permanent）又は一時的（temporary）な設定（setting/configuration）が、動作Ｂを実際に実行するように設定（configured/set）されていることを含んでよい。例えば、要素Ａが汎用プロセッサである場合、当該プロセッサが動作Ｂを実行可能なハードウェア構成を有するとともに、恒常的（permanent）又は一時的（temporary）なプログラム（命令）の設定により、動作Ｂを実際に実行するように設定（configured）されていればよい。また、要素Ａが専用プロセッサ又は専用演算回路等である場合、制御用命令及びデータが実際に付属しているか否かとは無関係に、当該プロセッサの回路的構造が動作Ｂを実際に実行するように構築（implemented）されていればよい。 When the expression "A configured to B" is used in the present specification (including claims), the physical structure of the element A can perform the operation B. Including that the element A has a configuration and the permanent or temporary setting (setting / configuration) of the element A is set (configured / set) to actually execute the operation B. Good. For example, when the element A is a general-purpose processor, the processor has a hardware configuration capable of executing the operation B, and the operation B is set by setting a permanent or temporary program (instruction). It suffices if it is configured to actually execute. Further, when the element A is a dedicated processor, a dedicated arithmetic circuit, or the like, the circuit structure of the processor actually executes the operation B regardless of whether or not the control instruction and data are actually attached. It only needs to be implemented.

本明細書（請求項を含む）において、含有又は所有を意味する用語（例えば、「含む（comprising/including）」及び有する「（having）等）」が用いられる場合は、当該用語の目的語により示される対象物以外の物を含有又は所有する場合を含む、open-endedな用語として意図される。これらの含有又は所有を意味する用語の目的語が数量を指定しない又は単数を示唆する表現（a又はanを冠詞とする表現）である場合は、当該表現は特定の数に限定されないものとして解釈されるべきである。 In the present specification (including claims), when a term meaning inclusion or possession (for example, "comprising / including" and "having", etc.) is used, the object of the term is used. It is intended as an open-ended term, including the case of containing or owning an object other than the indicated object. If the object of these terms that mean inclusion or possession is an expression that does not specify a quantity or suggests a singular (an expression with a or an as an article), the expression is interpreted as not being limited to a specific number. It should be.

本明細書（請求項を含む）において、ある箇所において「１つ又は複数（one or more）」又は「少なくとも１つ（at least one）」等の表現が用いられ、他の箇所において数量を指定しない又は単数を示唆する表現（a又はanを冠詞とする表現）が用いられているとしても、後者の表現が「１つ」を意味することを意図しない。一般に、数量を指定しない又は単数を示唆する表現（a又はanを冠詞とする表現）は、必ずしも特定の数に限定されないものとして解釈されるべきである。 In the present specification (including claims), expressions such as "one or more" or "at least one" are used in some places, and the quantity is specified in other places. Even if expressions that do not or suggest the singular (expressions with a or an as an article) are used, the latter expression is not intended to mean "one". In general, expressions that do not specify a quantity or suggest a singular (expressions with a or an as an article) should be interpreted as not necessarily limited to a particular number.

本明細書において、ある実施例の有する特定の構成について特定の効果（advantage/result）が得られる旨が記載されている場合、別段の理由がない限り、当該構成を有する他の１つ又は複数の実施例についても当該効果が得られると理解されるべきである。但し当該効果の有無は、一般に種々の要因、条件、及び／又は状態等に依存し、当該構成により必ず当該効果が得られるものではないと理解されるべきである。当該効果は、種々の要因、条件、及び／又は状態等が満たされたときに実施例に記載の当該構成により得られるものに過ぎず、当該構成又は類似の構成を規定したクレームに係る発明において、当該効果が必ずしも得られるものではない。 In the present specification, when it is stated that a specific effect (advantage / result) can be obtained for a specific configuration of an embodiment, unless there is a specific reason, one or more of the other configurations having the configuration. It should be understood that the effect can also be obtained in the examples of. However, it should be understood that the presence or absence of the effect generally depends on various factors, conditions, and / or states, etc., and that the effect cannot always be obtained by the configuration. The effect is merely obtained by the configuration described in the examples when various factors, conditions, and / or conditions are satisfied, and in the invention relating to the claim that defines the configuration or a similar configuration. , The effect is not always obtained.

本明細書（請求項を含む）において、「最大化（maximize）」等の用語が用いられる場合は、グローバルな最大値を求めること、グローバルな最大値の近似値を求めること、ローカルな最大値を求めること、及びローカルな最大値の近似値を求めることを含み、当該用語が用いられた文脈に応じて適宜解釈されるべきである。また、これら最大値の近似値を確率的又はヒューリスティックに求めることを含む。同様に、「最小化（minimize）」等の用語が用いられる場合は、グローバルな最小値を求めること、グローバルな最小値の近似値を求めること、ローカルな最小値を求めること、及びローカルな最小値の近似値を求めることを含み、当該用語が用いられた文脈に応じて適宜解釈されるべきである。また、これら最小値の近似値を確率的又はヒューリスティックに求めることを含む。同様に、「最適化（optimize）」等の用語が用いられる場合は、グローバルな最適値を求めること、グローバルな最適値の近似値を求めること、ローカルな最適値を求めること、及びローカルな最適値の近似値を求めることを含み、当該用語が用いられた文脈に応じて適宜解釈されるべきである。また、これら最適値の近似値を確率的又はヒューリスティックに求めることを含む。 In the present specification (including claims), when terms such as "maximize" are used, the global maximum value is obtained, the approximate value of the global maximum value is obtained, and the local maximum value is obtained. Should be interpreted as appropriate according to the context in which the term was used, including finding an approximation of the local maximum. It also includes probabilistically or heuristically finding approximate values of these maximum values. Similarly, when terms such as "minimize" are used, find the global minimum, find the approximation of the global minimum, find the local minimum, and find the local minimum. It should be interpreted as appropriate according to the context in which the term was used, including finding an approximation of the value. It also includes probabilistically or heuristically finding approximate values of these minimum values. Similarly, when terms such as "optimize" are used, finding a global optimal value, finding an approximation of a global optimal value, finding a local optimal value, and local optimization It should be interpreted as appropriate according to the context in which the term was used, including finding an approximation of the value. It also includes probabilistically or heuristically finding approximate values of these optimal values.

本明細書（請求項を含む）において、複数のハードウェアが所定の処理を行う場合、各ハードウェアが協働して所定の処理を行ってもよいし、一部のハードウェアが所定の処理の全てを行ってもよい。また、一部のハードウェアが所定の処理の一部を行い、別のハードウェアが所定の処理の残りを行ってもよい。本明細書（請求項を含む）において、「１又は複数のハードウェアが第１の処理を行い、前記１又は複数のハードウェアが第２の処理を行う」等の表現が用いられている場合、第１の処理を行うハードウェアと第２の処理を行うハードウェアは同じものであってもよいし、異なるものであってもよい。つまり、第１の処理を行うハードウェア及び第２の処理を行うハードウェアが、前記１又は複数のハードウェアに含まれていればよい。なお、ハードウェアは、電子回路、又は、電子回路を含む装置等を含んでもよい。 In the present specification (including claims), when a plurality of hardware performs a predetermined process, the respective hardware may cooperate to perform the predetermined process, or some hardware may perform the predetermined process. You may do all of the above. Further, some hardware may perform a part of a predetermined process, and another hardware may perform the rest of the predetermined process. In the present specification (including claims), when expressions such as "one or more hardware performs the first process and the one or more hardware performs the second process" are used. , The hardware that performs the first process and the hardware that performs the second process may be the same or different. That is, the hardware that performs the first process and the hardware that performs the second process may be included in the one or more hardware. The hardware may include an electronic circuit, a device including the electronic circuit, or the like.

以上、本開示の実施形態について詳述したが、本開示は上記した個々の実施形態に限定されるものではない。特許請求の範囲に規定された内容及びその均等物から導き出される本発明の概念的な思想と趣旨を逸脱しない範囲において種々の追加、変更、置き換え及び部分的削除等が可能である。例えば、前述した全ての実施形態において、数値又は数式を説明に用いている場合は、一例として示したものであり、これらに限られるものではない。また、実施形態における各動作の順序は、一例として示したものであり、これらに限られるものではない。 Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to the individual embodiments described above. Various additions, changes, replacements, partial deletions, etc. are possible without departing from the conceptual idea and purpose of the present invention derived from the contents defined in the claims and their equivalents. For example, in all the above-described embodiments, when numerical values or mathematical formulas are used for explanation, they are shown as examples, and the present invention is not limited thereto. Further, the order of each operation in the embodiment is shown as an example, and is not limited to these.

１：推定装置、
１０：第１推定部、
１００：第１エンコーダ、
１０２：第２エンコーダ、
１０４：デコーダ、
１２：第２推定部、
１２０：エスティメータ、
１２２：アグリゲータ、
２：訓練装置、
２０：順伝播部、
２２：誤差算出部、
２４：更新部 1: Estimator,
10: First estimation part,
100: 1st encoder,
102: Second encoder,
104: Decoder,
12: Second estimation part,
120: Estimator,
122: Aggregator,
2: Training device,
20: Forward propagation part,
22: Error calculation unit,
24: Update department

Claims

With one or more memories
With one or more processors,
The memory stores a trained model capable of estimating dynamics parameters.
The one or more processors
Encode the state and
Based on the dynamics parameters estimated by the trained model, the dynamics parameters of the environment are encoded.
Estimate behavior based on the encoded state and the encoded dynamics parameters.
Estimator.

The one or more processors
Encoding the state and encoding the dynamics parameters are performed using different trained models.
The estimation device according to claim 1.

By the one or more processors
The frequency of encoding the state is different from the frequency of encoding the dynamics parameter.
The estimation device according to claim 1 or 2.

The one or more processors
Output the probability distribution for the behavior,
The estimation device according to any one of claims 1 to 3.

The one or more processors
Output the probability distribution for the behavior using the trained model,
The estimation device according to claim 4.

The one or more processors
The dynamics parameters of the environment are estimated based on the time series data of the state and the state as a result of acting on the state in the environment.
The estimation device according to any one of claims 1 to 5.

The one or more processors
Divide the time series data into multiple chunks and divide it into multiple chunks.
Evaluate the distribution of the dynamics parameters corresponding to the chunks
Estimate the dynamics parameters based on the distribution of the plurality of evaluated dynamics parameters.
The estimation device according to claim 6.

The one or more processors
Based on the time-series state and the time-series behavior in the chunk, an estimator of the dynamics parameter is obtained.
The dynamics parameters are estimated by synthesizing the estimated amounts of the dynamics parameters obtained from the plurality of chunks.
The estimation device according to claim 7.

The one or more processors
Obtaining an estimator of the dynamics parameters using a trained model,
The estimation device according to claim 8.

The one or more processors
Synthesize estimates of the dynamics parameters using a trained model,
The estimation device according to claim 8 or 9.

With one or more memories
With one or more processors,
The one or more processors
Encode the state and
Encode dynamics parameters in a randomly selected environment
The first action is output based on the encoded state and the encoded dynamics parameters.
Based on the above-mentioned state and the state in which the first action is performed, what kind of second action is performed is estimated.
By comparing the first action with the second action, a model for estimating the first action is trained.
Training equipment.

The one or more processors
Along with the model that estimates the first behavior, the model that estimates the second behavior is trained.
The training device according to claim 11.

The one or more processors
Restore the encoded state and
Training a model that encodes the state by comparing the state with the restored state.
The training device according to claim 11 or 12.

The one or more processors
Along with the model that encodes the state, train the model that restores the encoded state.
The training device according to claim 13.

The one or more processors
Training a model that adds noise to the first action and encodes the state.
The training device according to claim 13 or 14.

The one or more processors
Produces the noise based on the selected environment,
The training device according to claim 15.

The one or more processors
Furthermore, a model for estimating the first behavior is trained by reinforcement learning.
The training device according to any one of claims 11 to 16.

The one or more processors
Generate the dynamics parameters in multiple environments
Select the dynamics parameters that correspond to a randomly selected environment,
The training device according to any one of claims 11 to 17.

The one or more processors
Fine tuning using the plurality of dynamics parameters.
The training device according to claim 18.

The one or more processors
The dynamics parameters are estimated based on the time-series data of the state, the first action performed on the state, and the state after the first action is performed.
The training device according to claim 18 or 19.

The one or more processors
Train a model that estimates the dynamics parameters based on the hypothesized prior distribution and the hypothesized posterior distribution.
The training device according to claim 20.

One or more processors encode the state and
The one or more processors encode the dynamics parameters of the environment based on the dynamics parameters estimated by the trained model stored in one or more memories.
The one or more processors estimate behavior based on the encoded state and the encoded dynamics parameters.
Estimating method.

One or more processors encode the state and
The one or more processors encode the dynamics parameters in a randomly selected environment.
The one or more processors output the first action based on the encoded state and the encoded dynamics parameters.
The one or more processors estimate what kind of second action is performed based on the state and the state in which the first action is performed.
The one or more processors train a model that estimates the first action by comparing the first action with the second action.
Training method.

When executed by one or more processors,
Encode the state and
Encode the dynamics parameters of the environment based on the dynamics parameters estimated by the trained model stored in one or more memories.
Estimate behavior based on the encoded state and the encoded dynamics parameters.
program.

When executed by one or more processors,
Encode dynamics parameters in a randomly selected environment
The first action is output based on the encoded state and the encoded dynamics parameters.
Based on the above-mentioned state and the state in which the first action is performed, what kind of second action is performed is estimated.
By comparing the first action with the second action, a model for estimating the first action is trained.
program.

When executed by one or more processors,
Encode the state and
Encode the dynamics parameters of the environment based on the dynamics parameters estimated by the trained model stored in one or more memories.
Estimate behavior based on the encoded state and the encoded dynamics parameters.
A non-transitory computer-readable medium containing a program.

When executed by one or more processors,
Encode dynamics parameters in a randomly selected environment
The first action is output based on the encoded state and the encoded dynamics parameters.
Based on the above-mentioned state and the state in which the first action is performed, what kind of second action is performed is estimated.
By comparing the first action with the second action, a model for estimating the first action is trained.
A non-transitory computer-readable medium containing a program.