JP2021527289A

JP2021527289A - Sum Stochastic Gradient Estimating Methods, Devices, and Computer Programs

Info

Publication number: JP2021527289A
Application number: JP2021518295A
Authority: JP
Inventors: パラマス，パーヴォ
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-06-05
Filing date: 2019-06-05
Publication date: 2021-10-11
Anticipated expiration: 2039-06-05
Also published as: WO2019235551A1; JP7378836B2

Abstract

【課題】勾配推定方法、勾配推定装置、およびコンピュータプログラムを提供する。
【解決手段】勾配推定方法は、計算グラフを含み、計算グラフ中の一の変数に対する他の変数の勾配を推定する勾配推定方法であって、計算グラフ中のいくつかのノードで、異なる勾配推定量を用いて同じ勾配の２つ以上の異なる推定を実行し、初期推定数よりも少なくなるように異なる推定値を結合し、結合した推定値を計算グラフ中の異なるノードに受け渡し、勾配推定値は更なる計算に使用される。PROBLEM TO BE SOLVED: To provide a gradient estimation method, a gradient estimation device, and a computer program.
A gradient estimation method is a gradient estimation method that includes a calculation graph and estimates the slope of another variable with respect to one variable in the calculation graph, and different gradient estimation is performed at some nodes in the calculation graph. Perform two or more different estimates of the same gradient using the quantity, combine the different estimates so that they are less than the initial estimates, pass the combined estimates to different nodes in the calculation graph, and the gradient estimates. Is used for further calculations.

Description

本発明は、計算グラフにおいて定義された変数の勾配を推定する方法、上記推定を行う装置、およびコンピュータプログラムに関する。 The present invention relates to a method of estimating the gradient of a variable defined in a calculation graph, a device for making the above estimation, and a computer program.

ほとんどの機械学習問題には、何らかのデータ生成分布ｐ_Ｄａｔａ（ｘ）全体の目的関数Ｊ（ｘ；θ）の期待値の最適化を伴うが、この分布は、サンプルデータ点｛ｘ_ｉ｝を通じてのみアクセス可能である。 Most machine learning problems involve optimizing the expected value of the objective function J (x; θ) of some data generation distribution p _Data (x) as a whole, but this distribution is only through the _{sample data points {x i}.} It is accessible.

最も一般的な最適化方法は、逆伝播により計算されるＰａｔｈｗｉｓｅ導関数（ｐａｔｈｗｉｓｅｄｅｒｉｖａｔｉｖｅ）を用いた勾配降下法である。 The most common optimization method is the gradient descent method using the Pathwise derivative calculated by backpropagation.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157-166, 1994Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5 (2): 157-166, 1994 Deisenroth, Marc Peter, Neumann, Gerhard, Peters, Jan, らによるA survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1-142, 2013A survey on policy search for robotics by Deisenroth, Marc Peter, Neumann, Gerhard, Peters, Jan, et al. Foundations and Trends in Robotics, 2 (1-2): 1-142, 2013 Deisenroth, Marc Peter, Fox, Dieter, and Rasmussen, Carl Edward. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408-423, 2015Deisenroth, Marc Peter, Fox, Dieter, and Rasmussen, Carl Edward. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37 (2): 408-423, 2015

いくつかの状況（特に、非常に長い計算グラフまたはリカレントな計算グラフを伴う場合）において、この手法は、勾配分散の爆発によって、ランダムウォークに陥る可能性もある。通常、この現象は、ステップの増大および学習の不安定化につながる数値問題と捉えられる（非特許文献１参照）。 In some situations (especially with very long or recurrent computational graphs), this technique can also fall into a random walk due to the explosion of gradient variance. Usually, this phenomenon is regarded as a numerical problem leading to an increase in steps and destabilization of learning (see Non-Patent Document 1).

本発明の目的は、勾配推定に伴う課題を解決することである。本発明は、逆伝播アルゴリズムの代わりとして、任意の計算グラフに使用し得る汎用的な勾配推定方法である。 An object of the present invention is to solve a problem associated with gradient estimation. The present invention is a general purpose gradient estimation method that can be used for any computational graph as an alternative to the backpropagation algorithm.

勾配推定方法は、計算グラフを含み、計算グラフ中の他の変数に対するある変数の勾配を推定するものであって、グラフ中のいくつかのノードで、別個の勾配推定量を用いて同じ勾配の２つ以上の別個の推定を実行し、初期推定値数よりも少なくなるように別個の推定値を結合し、結合した推定値をグラフ中の異なるノードに受け渡し、勾配推定値が、さらなる計算に使用される。 Gradient estimation methods include computational graphs and estimate the gradient of a variable relative to other variables in the computational graph, with several nodes in the graph using separate gradient estimators for the same gradient. Perform two or more separate estimates, combine the separate estimates so that they are less than the initial number of estimates, pass the combined estimates to different nodes in the graph, and the gradient estimates are used for further calculations. used.

本出願によれば、より正確で、勾配の爆発に悩まされない勾配評価の代替的な柔軟性のあるフレームワークを提供することが可能である。 According to the present application, it is possible to provide an alternative and flexible framework for gradient assessment that is more accurate and does not suffer from gradient explosions.

本実施形態に係る、コンピューティングデバイス１のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware configuration of the computing device 1 which concerns on this embodiment. ＰＩＬＣＯによるポリシー勾配評価アルゴリズムを説明する図である。It is a figure explaining the policy gradient evaluation algorithm by PILCO. 総和伝播アルゴリズムを説明する図である。It is a figure explaining the total propagation algorithm. 本実施形態に係る、コンピューティングデバイス１により実行される手順を説明するフローチャートである。It is a flowchart explaining the procedure executed by the computing device 1 which concerns on this Embodiment. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 分散のグラフである。It is a graph of variance. 分散のグラフである。It is a graph of variance. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 数式１１における経路の例を示す図である。It is a figure which shows the example of the route in formula 11. 数式１１における経路の例を示す図である。It is a figure which shows the example of the route in formula 11. モデル基準およびモデルなしのＬＲ勾配推定の確率計算グラフを示す図である。It is a figure which shows the probability calculation graph of the model reference and LR gradient estimation without a model. モデル基準およびモデルなしのＬＲ勾配推定の確率計算グラフを示す図である。It is a figure which shows the probability calculation graph of the model reference and LR gradient estimation without a model. 総和伝播と適合する様子を詳しく説明するためのアルゴリズム３を示す図である。It is a figure which shows the algorithm 3 for demonstrating in detail how it conforms with total propagation. ガウス成形勾配における計算経路を示す図である。It is a figure which shows the calculation path in the Gauss molding gradient. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. 実験結果を示す図である。It is a figure which shows the experimental result. アルゴリズムの一般形態を示す図である。It is a figure which shows the general form of an algorithm. 機械学習における全てのニューラルネットワークアプリケーションの他、その他多くのアプリケーションにおいて使用される逆伝播アルゴリズムを示す図である。It is a figure which shows the back propagation algorithm used in all the neural network applications in machine learning, as well as many other applications. 単一の勾配推定量となるように尤度比および再パラメータ化勾配推定量を結合することにより勾配推定が実行される場合の総和伝播アルゴリズムを示す図である。It is a figure which shows the sum propagation algorithm when the gradient estimation is performed by combining the likelihood ratio and the re-parameterized gradient estimator so that it becomes a single gradient estimator.

（実施形態１）
図１は、本実施形態に係る、コンピューティングデバイス１のハードウェア構成を示すブロック図である。本実施形態に係るコンピューティングデバイス１は、パソコン、サーバ装置等の情報処理装置である。コンピューティングデバイス１は、制御ユニット１１、記憶ユニット１２、入力ユニット１３、通信ユニット１４、操作ユニット１５、および表示ユニット１６を具備する。コンピューティングデバイス１は、本発明者らによる「PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos」、「Total Propagation Algorithm: Supplementary notes」、および「Total stochastic gradient algorithms and applications in reinforcement learning」において開示された方法を実装している。 (Embodiment 1)
FIG. 1 is a block diagram showing a hardware configuration of the computing device 1 according to the present embodiment. The computing device 1 according to the present embodiment is an information processing device such as a personal computer and a server device. The computing device 1 includes a control unit 11, a storage unit 12, an input unit 13, a communication unit 14, an operation unit 15, and a display unit 16. The computing device 1 is described in "PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos", "Total Propagation Algorithm: Supplementary notes", and "Total stochastic gradient algorithms and applications in reinforcement learning" by the present inventors. Implements the disclosed method.

制御ユニット１１は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）等を具備する。制御ユニット１１のＲＯＭには、ハードウェアの各部の動作を制御する制御プログラム等が記憶されている。制御ユニット１１のＣＰＵは、ＲＯＭに記憶された制御プログラムおよび後述する記憶ユニット１２に記憶された種々プログラムを実行して、前述の論文に開示の方法のように、ハードウェアの動作を制御する。制御ユニット１１のＲＡＭには、種々プログラムの実行に際して一時的に使用されるデータが記憶されている。 The control unit 11 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. The ROM of the control unit 11 stores a control program or the like that controls the operation of each part of the hardware. The CPU of the control unit 11 executes a control program stored in the ROM and various programs stored in the storage unit 12 described later to control the operation of the hardware as in the method disclosed in the above-mentioned paper. The RAM of the control unit 11 stores data that is temporarily used when executing various programs.

なお、制御ユニット１１は、上記構成に限定されず、シングルコアＣＰＵ、マルチコアＣＰＵ、ＧＰＵ（Graphics Processing Unit）、マイクロコンピュータ、揮発性または不揮発性メモリを含む１つまたは複数の処理回路または演算回路であってもよい。また、制御ユニット１１は、データおよび時間の情報を出力するクロック、測定開始命令の適用から測定終了命令が与えられるまでの経過時間を測定するタイマー、計数用のカウンタ等の機能を含んでいてもよい。 The control unit 11 is not limited to the above configuration, and may be one or a plurality of processing circuits or arithmetic circuits including a single-core CPU, a multi-core CPU, a GPU (Graphics Processing Unit), a microcomputer, and a volatile or non-volatile memory. There may be. Further, the control unit 11 may include functions such as a clock for outputting data and time information, a timer for measuring the elapsed time from the application of the measurement start command to the given of the measurement end command, and a counter for counting. good.

記憶ユニット１２は、ＳＲＡＭ（Static Random Access Memory）、フラッシュメモリ、ハードディスク等を用いた記憶装置を含む。記憶ユニット１２は、制御ユニット１１により実行される種々のプログラム、種々のプログラムの実行に必要なデータ等を記憶する。記憶ユニット１２に記憶されるプログラムとしては、たとえば上記論文に開示の技術を実装したコンピュータプログラムが挙げられる。 The storage unit 12 includes a storage device using a SRAM (Static Random Access Memory), a flash memory, a hard disk, or the like. The storage unit 12 stores various programs executed by the control unit 11, data necessary for executing various programs, and the like. Examples of the program stored in the storage unit 12 include a computer program in which the technique disclosed in the above paper is implemented.

記憶ユニット１２に記憶されたプログラムは、プログラムが可読記録された記録媒体Ｍにより提供されるようになっていてもよい。記録媒体Ｍは、ＳＤ（Secure Digital）カード、マイクロＳＤカード、コンパクトフラッシュ（登録商標）等の携帯型メモリである。この場合、制御ユニット１１は、読み出し装置（図示せず）を用いて記録媒体Ｍからプログラムを読み出し、この読み出したプログラムを記憶ユニット１２にインストールすることができる。さらに、記憶ユニット１２に記憶されたプログラムは、通信ユニット１４を介して、通信により提供されるようになっていてもよい。この場合、制御ユニット１１は、通信ユニット１４を通じてプログラムを取得し、この取得したプログラムを記憶ユニット１２にインストールすることができる。 The program stored in the storage unit 12 may be provided by the recording medium M in which the program is readable and recorded. The recording medium M is a portable memory such as an SD (Secure Digital) card, a micro SD card, and a compact flash (registered trademark). In this case, the control unit 11 can read a program from the recording medium M using a reading device (not shown), and the read program can be installed in the storage unit 12. Further, the program stored in the storage unit 12 may be provided by communication via the communication unit 14. In this case, the control unit 11 can acquire a program through the communication unit 14 and install the acquired program in the storage unit 12.

入力ユニット１３は、種々データを装置に入力するための入力インターフェースを有する。制御ユニット１１は、入力ユニット１３を通じて、処理対象のデータを取得する。 The input unit 13 has an input interface for inputting various data to the device. The control unit 11 acquires data to be processed through the input unit 13.

通信ユニット１４は、インターネット等の通信ネットワーク（図示せず）に接続するための通信インターフェースを含み、外部に通知されるさまざまな種類の情報を送信し、外部から送信されたさまざまな種類の情報を受信する。本実施形態においては、入力ユニット１３を通じて処理対象のデータが取得されるが、通信ユニット１４を通じて処理対象のデータが取得されるようになっていてもよい。 The communication unit 14 includes a communication interface for connecting to a communication network (not shown) such as the Internet, transmits various types of information notified to the outside, and transmits various types of information transmitted from the outside. Receive. In the present embodiment, the data to be processed is acquired through the input unit 13, but the data to be processed may be acquired through the communication unit 14.

操作ユニット１５は、キーボードおよびタッチパネル等のユーザインターフェースを含み、さまざまな操作情報および設定情報を受け付ける。制御ユニット１１は、操作ユニット１５から入力された操作情報に基づいて適当な制御を実行し、必要に応じて、設定情報を記憶ユニット１２に記憶する。 The operation unit 15 includes a user interface such as a keyboard and a touch panel, and receives various operation information and setting information. The control unit 11 executes appropriate control based on the operation information input from the operation unit 15, and stores the setting information in the storage unit 12 as needed.

表示ユニット１６は、液晶表示パネルおよび有機ＥＬ（Electro Luminescence）表示パネル等の表示装置を含み、制御ユニット１１から出力された制御信号に基づいて、ユーザに通知される情報を表示する。 The display unit 16 includes a display device such as a liquid crystal display panel and an organic EL (Electro Luminescence) display panel, and displays information notified to the user based on a control signal output from the control unit 11.

本実施形態において、上記論文に開示の構成は、制御ユニット１１により実行されるソフトウェア処理によって実現されるが、ＬＳＩ（Large Scale integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Arra）等が制御ユニット１１と別個に搭載されていてもよい。この場合、制御ユニット１１は、入力ユニット１３から入力された処理対象のデータをハードウェアに送ることにより、上記論文に開示の方法をハードウェア内で実現する。 In the present embodiment, the configuration disclosed in the above paper is realized by software processing executed by the control unit 11, but LSI (Large Scale integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Arra). ) Etc. may be mounted separately from the control unit 11. In this case, the control unit 11 realizes the method disclosed in the above paper in the hardware by sending the data to be processed input from the input unit 13 to the hardware.

さらに、本実施形態において、コンピューティングデバイス１は、簡素化のため単一の装置として記載しているが、複数のコンピューティングデバイスにより構成されていてもよいし、１つまたは複数の仮想マシンにより構成されていてもよい。 Further, in the present embodiment, the computing device 1 is described as a single device for simplification, but may be composed of a plurality of computing devices, or may be composed of one or a plurality of virtual machines. It may be configured.

本実施形態においては、コンピューティングデバイス１が操作ユニット１５および表示ユニット１６を具備するが、操作ユニット１５および表示ユニット１６は、必須ではない。たとえば、コンピューティングデバイス１は、外部接続されたコンピュータを通じて操作を受け付け、外部コンピュータに通知される情報を出力するようにしてもよい。 In the present embodiment, the computing device 1 includes an operation unit 15 and a display unit 16, but the operation unit 15 and the display unit 16 are not essential. For example, the computing device 1 may accept an operation through an externally connected computer and output information notified to the external computer.

以下、本発明の勾配推定方法について説明する。以下の式では、小文字がスカラーを表し、太字がベクトルまたは行列を表す。ただし、以下の説明においては、小文字と太字とを区別なく示している。また、以下の説明において、「Ｃ＿＾」は、ハット付き文字を表し、「Ｃ＿〜」は、チルダ付き文字を表す。 Hereinafter, the gradient estimation method of the present invention will be described. In the following formula, lowercase letters represent scalars and bold letters represent vectors or matrices. However, in the following description, lowercase letters and bold letters are shown without distinction. Further, in the following description, "C_ ^" represents a character with a hat, and "C_ ~" represents a character with a tilde.

（２．１ポリシー探索）
ポリシー探索方法の総括としては、非特許文献２が参照される。なお、ポリシー探索は、アルゴリズムの１つのアプリケーションに過ぎず、特定の計算グラフには限定されず、如何なる計算グラフにも適用可能である。状態ベクトルｘ_ｔ（たとえば、ロボットの位置および速度）ならびに適用動作／制御ベクトルｕ_ｔ（たとえば、モータトルク）により記述される離散時間系を考える。固定された初期状態分布ｘ_０〜ｐ（ｘ_０）から状態をサンプリングすることによって、エピソードが開始となる。ポリシーπ_θは、適用された動作ｕ_ｔ〜ｐ（ｕ_ｔ）＝π（ｘ_ｔ；θ）を決定する。動作の適用により、未知のダイナミクス関数ｘ_ｔ＋１〜ｐ（ｘ_ｔ＋１）＝ｆ（ｘ_ｔ，ｕ_ｔ）に従って状態が遷移する。ポリシーおよびダイナミクスはいずれも、確率的かつ非線形であってもよい。最大Ｔ時間ステップまで動作および状態遷移が繰り返されて、軌跡τ：（ｘ_０，ｕ_０，ｘ_１，ｕ_１，・・・，ｘ_Ｔ）が生成される。各エピソードは、リターン関数Ｇ（τ）に従ってスコアリングされる。リターンは、時間ステップごとのコストの総和Ｇ（τ）＝Σ_ｔ＝０ ^Ｔｃ（ｘ_ｔ）（ｔ＝０，・・・，Ｔ）に分解されることが多く、ここで、ｃ（ｘ）はコスト関数である。その目標は、ポリシーパラメータθを最適化して、期待リターンＪ（θ）＝Ｅ_{ｒ〜ｐ（τ；θ）}［Ｇ（τ）］を最小化することである。ここで、値Ｖ_ｈ（ｘ）＝Ｅ_ｔ＝ｈ ^Ｔ［Σｃ（ｘ_ｔ）］と定義する。 (2.1 Policy search)
Non-Patent Document 2 is referred to as a summary of the policy search method. Note that the policy search is only one application of the algorithm, and is not limited to a specific calculation graph, and can be applied to any calculation graph. The state vector x _{t (e.g.,} the position of the robot and speed) and apply operation / control vector u _{t (e.g.,} motor torque) consider a discrete-time system described by. The episode begins by sampling the states from a fixed initial state distribution x _{0 to} p (x _0). Policy [pi _theta, applied operation _{_{u t ~p (u t) =}} π; determining _{(x t} θ). The application of the operation, the unknown dynamics function _{_{x t + 1 ~p (x t}} + 1) = f (x t, u t) state transitions in accordance with. Both the policy and the dynamics may be stochastic and non-linear. The operation and state transition are repeated up to the maximum T time step, and the locus τ: (x ₀ , u ₀ , x ₁ , u ₁ , ..., X _T ) is generated. Each episode is scored according to the return function G (τ). The return is often decomposed into the sum of the costs for each time step G (τ) = Σ _{t = 0} ^T c (x _t ) (t = 0, ..., T), where c (x). ) Is a cost function. The goal is to optimize the policy parameter θ _{to minimize the expected return J (θ) = Er ~ p (τ; θ)} [G (τ)]. Here, the value V _h (x) = _{Et = h} ^T [Σc (x _t )] is defined.

学習は、システム上のポリシーの実行と、その後のθの更新による後続試行上の性能の向上とが交互に発生する。ポリシー勾配法では、目的関数の勾配ｄ／ｄθ・Ｊ（θ）を直接推定し、これを最適化に使用する。一部のモデル基準のポリシー探索方法では、データを全て使用して、ｆ＿＾で示されるｆのモデルを学習し、これを試行間の「メンタルリハーサル」に使用してポリシーを最適化する。現実の試行ごとに何百回もの模擬試行を実行して、データ効率を大幅に向上可能である。ここで、ｆ＿＾の微分によって、モデルなしアルゴリズムよりも優れた勾配推定量を求め得るという事実を利用する。この場合のモデルは、確率論的であり、状態分布を予測する。 Learning alternates between executing policies on the system and improving performance on subsequent trials by subsequently updating θ. In the policy gradient method, the gradient d / dθ · J (θ) of the objective function is directly estimated and used for optimization. Some model-based policy search methods use all the data to learn the model of f, indicated by f_ ^, and use it for "mental rehearsal" between trials to optimize the policy. Hundreds of simulated trials can be performed for each real-world trial to significantly improve data efficiency. Here we take advantage of the fact that the derivative of f_ ^ can yield a better gradient estimator than the unmodeled algorithm. The model in this case is stochastic and predicts the state distribution.

（確率的勾配推定）
ここで、サンプリング分布のパラメータに対する任意の関数φ（ｘ）の期待値の勾配ｄ／ｄθＥ_{ｘ〜ｐ（ｘ；θ）}［φ（ｘ）］（たとえば、ポリシーパラメータに対する期待リターン）を計算する方法について説明する。 (Stochastic gradient descent)
_{Here, a method of calculating the gradient d / dθE x to p (x; θ)} [φ (x)] (for example, the expected return for a policy parameter) of the expected value of an arbitrary function φ (x) with respect to the parameters of the sampling distribution. Will be described.

（再パラメータ化勾配（ＲＰ））
一変量ガウス分布からのサンプリングを考える。ある手法では、ゼロ平均および単位分散ε〜Ｎ（０，１）でのサンプリングの後、この点をマッピングして、所望の分布からサンプルを複製する（ｘ＝μ＋σε）。ここで、分布パラメータを参照して出力を微分するのは容易である。すなわち、ｄｘ／ｄμ＝１およびｄｘ／ｄσ＝εである。サンプルの平均化ｄφ／ｄｘ・ｄｘ／ｄθによって、期待値の勾配の不偏推定値が与えられる。これは、正規分布のＲＰ勾配である。多変量ガウス分布の場合は、σの代わりに、共分散行列のコレスキー因子（Ｌ、ｓ．ｔ．Σ＝ＬＬ^Ｔ）を使用可能である。 (Reparameterization Gradient (RP))
Consider sampling from a univariate Gaussian distribution. In one technique, after sampling with zero mean and unit variance ε-N (0,1), this point is mapped to duplicate the sample from the desired distribution (x = μ + σε). Here, it is easy to differentiate the output with reference to the distribution parameters. That is, dx / dμ = 1 and dx / dσ = ε. The sample averaging dφ / dx · dx / dθ gives an unbiased estimate of the expected gradient. This is a normally distributed RP gradient. In the case of a multivariate Gaussian distribution, the Cholesky factor (L, st Σ = LL ^T ) of the covariance matrix can be used instead of σ.

（尤度比勾配（ＬＲ））
所望の勾配は、ｄ／ｄθ・Ｅ_{ｘ〜ｐ（ｘ；θ）}［φ（ｘ）］＝∫ｄｐ（ｘ；θ）／ｄθφ（ｘ）として記述可能である。一般に、∫φ（ｘ）ｄｘ＝∫ｑ（ｘ）φ（ｘ）／ｑ（ｘ）ｄｘ＝Ｅ_ｘ〜ｑ［φ（ｘ）／ｑ（ｘ）］の実行によって、分布ｑ（ｘ）からサンプリングすることにより如何なる関数も積分可能である。尤度比勾配は、ｑ（ｘ）＝ｐ（ｘ）を抜き取って、以下のように直接積分する。 (Likelihood ratio gradient (LR))
The desired gradient can be described as d / dθ · E _{x to p (x; θ)} [φ (x)] = ∫dp (x; θ) / dθφ (x). In general, from the distribution q (x) by executing ∫φ (x) dx = ∫q (x) φ (x) / q (x) dx = _{Ex to} q [φ (x) / q (x)] Any function can be integrated by sampling. The likelihood ratio gradient is directly integrated as follows by extracting q (x) = p (x).

ＬＲ勾配は、高分散の場合が多く、制御変量として知られる分散低減技術と組み合わせる必要がある（Greensmithら、2004）。一般的な手法では、関数値から一定基準値ｂを減算して、推定量Ｅ_ｘ〜ｐ［ｄ／ｄθ・（ｌｏｇｐ（ｘ；θ））（φ（ｘ）−ｂ）］を求める。ｂがサンプルと無関係の場合は、これによって、バイアスの導入なく、分散を大幅に低減可能である。実際には、サンプル平均が良い選択である（ｂ＝Ｅ［φ（ｘ）］）。バッチから勾配を推定する場合は、各点の一個抜き基準値を推定することによって、不偏勾配推定量を求めることができる。すなわち、ｂ_ｉ＝Σ_ｊ≠ｉ ^Ｐφ（ｘ_ｊ）／（Ｐ−１）である。 LR gradients are often highly dispersed and need to be combined with dispersion reduction techniques known as control variables (Greensmith et al., 2004). In a general method, a constant reference value b is subtracted from a function value to obtain an estimator _{Ex to} p [d / dθ · (log p (x; θ)) (φ (x) −b)]. If b is irrelevant to the sample, this can significantly reduce the dispersion without introducing a bias. In practice, sample mean is a good choice (b = E [φ (x)]). When estimating the gradient from a batch, the unbiased estimator can be obtained by estimating the reference value without one at each point. That is, _{_{^{b i = Σ j ≠ i P}}} φ (x j) / (P-1).

（軌跡勾配推定）
特定の軌跡を観測する確率密度ｐ（τ）＝ｐ（ｘ_０，ｕ_０，ｘ_１，ｕ_１，・・・，ｘ_Ｔ）は、ｐ（ｘ_０）π（ｕ_０｜ｘ_０）ｐ（ｘ_１｜ｘ_０，ｕ_０）・・・ｐ（ｘ_Ｔ｜ｘ_Ｔ−１，ｕ_Ｔ−１）として記述可能である。 (Trajectory gradient estimation)
The probability density p (τ) = p (x ₀ , u ₀ , x ₁ , u ₁ , ..., X _T ) for observing a specific trajectory is p (x ₀ ) π (u ₀ | x ₀ ) p. (X ₁ | x ₀ , u ₀ ) ... It can be described as _{p (x T} | x _T-1 , u _T-1).

ＲＰ勾配を使用するには、ダイナミクスｐ（ｘ_ｔ＋１｜ｘ_ｔ｜ｕ_ｔ）を把握または推定する必要がある。言い換えると、モデル基準の場合に適用可能である。このようなモデルによれば、連鎖律を用いて、予測軌跡を微分可能である。 To use the RP gradient dynamics _p needs to know or estimate the _{_{_{(x t + 1 | u t}}} | x t). In other words, it is applicable in the case of model criteria. According to such a model, the predicted locus can be differentiated using the chain rule.

なお、ＬＲ勾配を使用するには、ｐ（τ）が積であることから、ｌｏｇｐ（τ）を総和に変換可能である。Ｇ_ｈ（τ）＝Σ_ｔ＝ｈ ^Ｔｃ（ｘ_ｔ）と表す。なお、（１）動作分布のみがポリシーパラメータによって決まり、（２）過去の時間ステップで求められたコストに動作は影響せず、以下のような勾配推定量が得られる。 To use the LR gradient, log p (τ) can be converted to the sum because p (τ) is a product. It is expressed as _{G h} (τ) = Σ _{t = h} ^T c (x _t). Note that (1) only the motion distribution is determined by the policy parameters, and (2) the motion does not affect the cost obtained in the past time step, and the following gradient estimator can be obtained.

（ＰＩＬＣＯ）
図２は、ＰＩＬＣＯによるポリシー勾配評価アルゴリズムを説明した図である。ここでは元のＰＩＬＣＯに従うが、これは、ガウス過程ダイナミクスモデルを使用して、ある時間ステップから次の時間ステップまでの状態の変化を予測する。すなわち、ｐ（Δｘ_ｔ＋１ ^ａ）＝ｇＰ（ｘ_ｔ，ｕ_ｔ）（ただし、ｘ∈Ｒ^Ｄ、ｕ∈Ｒ^Ｆ、Δｘ_ｔ＋１ ^ａ＝ｘ_ｔ＋１ ^ａ−ｘ_ｔ ^ａ）である。各次元ａに対して、別個のガウス過程が学習される。ここでは、二乗指数共分散関数ｋ_ａ（ｘ＿〜，ｘ’＿〜）＝ｓ_ａ ^２ｅｘｐ（−（ｘ＿〜−ｘ’＿〜）^ＴΛ_ａ ^−１（ｘ＿〜−ｘ’＿〜））を使用する。ただし、ｓ_ａおよびΛ＝ｄｉａｇ（［ｌ_ａ１，ｌ_ａ２，・・・，ｌ_ａＤ＋Ｆ］）はそれぞれ、関数分散および長さスケールのハイパーパラメータである。また、ノイズハイパーパラメータがσ_ｎのガウス尤度関数を使用する。ハイパーパラメータは、訓練によって、周辺尤度を最大化する。これらのモデルからのサンプリングに際して、予測は、ｙ＝ｆ＿＾（ｘ）＋ε（ただし、ε〜Ｎ（０，σ_ｆ ^２（ｘ）＋σ_ｎ ^２））という形態を有する。ここで、σ_ｆ ^２は、モデルの不確実性を表し、領域中のデータの欠如に起因する。一方、σ_ｎ ^２は、学習済みの固有モデルノイズである。学習済みモデルノイズは、システム中の実観測ノイズσ_ｏ ^２と必ずしも同じではない。実際、潜在状態はモデル化されておらず、システムは、現在の観測を所与として、次の観測を予測することにより近似される。さらに、軌跡には、付加的な分散源が存在し、開始位置が異なれば軌跡も異なる。 (PILCO)
FIG. 2 is a diagram illustrating a policy gradient evaluation algorithm by PILCO. Here we follow the original PILCO, which uses a Gaussian process dynamics model to predict changes in state from one time step to the next. That is, p (Δx _{t + 1} ^a _{) = gP (x t} , ut) (where x ^{∈ R D} , u _∈ ^{RF F} , Δ x _{t + 1} ^a = x _{t + 1} ^a −x _t ^a ). A separate Gaussian process is learned for each dimension a. Here, the square exponential covariance function _{_{^{k a (x_~, x'_~) =}}} s a 2 exp (- (x_~-x'_~) T Λ a -1 (x_~-x'_~)) To use. However, s _a and Λ = diag ([ _la1 , _la2 , ..., _{laD + F} ]) are hyperparameters of function variance and length scale, respectively. We also use a Gauss-likelihood function with noise hyperparameters of σ _n. Hyperparameters maximize marginal likelihood by training. Upon sampling from these models, the prediction has the form y = f_ ^ (x) + ε (where ε to N (0, σ _f ² (x) + σ _n ² )). Here, σ _f ² represents the uncertainty of the model and is due to the lack of data in the region. On the other hand, σ _n ² is the learned intrinsic model noise. The trained model noise is not necessarily the same as _{the actual observed noise σ o} ^{2 in the system.} In fact, the latent state is not modeled and the system is approximated by predicting the next observation given the current observation. Further, there is an additional dispersion source in the locus, and the locus is different if the starting position is different.

（モーメントマッチング予測）
一般的に、ガウス分布が非線形関数によってマッピングされた場合、出力は、扱いにくく、非ガウス分布である。ただし、出力分布のモーメントを解析的に評価できる場合もある。モーメントマッチング（ＭＭ）は、平均および分散を真のモーメントとマッチングさせることにより、出力分布をガウス分布として近似する。なお、状態次元が別個の関数ｆａ＿＾でモデル化されていても、ＭＭは一体的に実行され、状態分布が共分散を含み得る。 (Moment matching prediction)
In general, when the Gaussian distribution is mapped by a nonlinear function, the output is awkward and non-Gaussian. However, there are cases where the moment of output distribution can be evaluated analytically. Moment matching (MM) approximates the output distribution as a Gaussian distribution by matching the mean and variance with the true moment. Note that even if the state dimensions are modeled by a separate function fa_ ^, the MM is executed integrally and the state distribution can include covariance.

（パーティクル予測）
一般的に、パーティクル軌跡予測は単純で、全てのパーティクル位置での予測、出力分布からのサンプリング、繰り返しを行う。ただし、ガウス再サンプリング（ＧＲ）に基づく方式との比較により、ＰＩＬＣＯへのニューラルネットワークダイナミクスモデルの適用も行う。 (Particle prediction)
In general, particle trajectory prediction is simple, predicting at all particle positions, sampling from the output distribution, and repeating. However, the neural network dynamics model is also applied to PILCO by comparison with the method based on Gauss resampling (GR).

（ガウス再サンプリング（ＧＲ））
ＭＭは、確率的に複製可能である。各時間ステップにおいて、パーティクルの平均μ＿＾＝Σ_ｉ＝１ ^Ｐｘ_ｉ／Ｐおよび分散Σ＿＾＝Σ_ｉ＝１ ^Ｐ（ｘ_ｉ−μ＿＾）（ｘ_ｉ−μ＿＾）^Ｔ／（Ｐ−１）が推定される。その後、パーティクルは、適合分布ｘ’_ｉ〜μ＿＾＋Ｌｚ_ｉ｜ｚ_ｉ〜Ｎ（０，Ｉ）（ただし、ＬはΣ＿＾のコレスキー因子）から再サンプリングされる。勾配ｄＬ＝ｄΣ＿＾を求めることは、容易ではない。ここでは、与えられた記号表現を使用する。 (Gauss resampling (GR))
MM is stochastically replicable. In each time step, the average μ_^ = Σ _{i = 1} ^P x _i / P and the variance _{Σ_ ^ = Σ i = 1} ^P (x _i − μ_ ^) (x _{i −} ^{μ_ ^) T} / (P-1) of the particles. ) Is estimated. After that, the particles, fit distribution _{_{_{x 'i ~μ _ ^ + Lz}}} i | z i ~N (0, I) ( where, L is the Cholesky factor of Σ_ ^) is re-sampled from. It is not easy to find the gradient dL = dΣ_ ^. Here, the given symbolic representation is used.

（ハイブリッド勾配推定技術）
本発明の場合には、ＲＰ勾配を使用可能である。ただし、驚くべきことに、ＲＰ勾配は絶望的に不正確である（図５Ｄ参照）。この問題を解決するため、モデル導関数をＬＲ勾配と結合した新たな勾配推定量を得た。特に、本発明の手法では、バッチ内重点サンプリングによって、サンプリング効率の向上を可能にした。 (Hybrid gradient estimation technology)
In the case of the present invention, an RP gradient can be used. However, surprisingly, the RP gradient is hopelessly inaccurate (see Figure 5D). To solve this problem, we obtained a new gradient estimator that combines the model derivative with the LR gradient. In particular, in the method of the present invention, sampling efficiency can be improved by in-batch weighted sampling.

（モデル基準のＬＲ）
予測軌跡上の分布は、ｐ（τ）＝ｐ（ｘ_０）π（ｕ_０｜ｘ_０）ｆ＿＾（ｘ_１｜ｘ_０，ｕ_０）・・・ｆ＿＾（ｘ_Ｔ｜ｘ_Ｔ−１，ｕ_Ｔ−１）として記述可能である。また、決定論的ポリシーによって、ｐ（ｘ_ｔ＋１｜ｘ_ｔ）＝ｆ＿＾（ｘ_ｔ＋１｜ｘ_ｔ，π（ｘ_ｔ；θ））のように、モデルとポリシーとを結合可能であるが、これは、微分可能である（ｄｐ_ｔ＋１／ｄθ＝ｄｐ_ｔ＋１／ｄｕ_ｔ・ｄｕ_ｔ／ｄθ）。モデル基準の勾配は、以下のように導かれる。 (Model standard LR)
The distribution on the predicted locus is p (τ) = p (x ₀ ) π (u ₀ | x ₀ ) f_ ^ (x ₁ | x ₀ , u ₀ ) ... f_ ^ (x _T | x _T-1) , U _T-1 ). Also, depending on the deterministic policy, it is possible to combine the model and the policy, such as _{p (x t + 1} | x _t ) = f_ ^ (x _{t + 1} | x _t , π (x _{t; θ)).} Is differentiable (dp _{t + 1} / dθ = dp _{t + 1} / du _t · du _t / dθ). The gradient of the model reference is derived as follows.

（バッチ重点加重ＬＲ（ＢＩＷ−ＬＲ））
ここでは、並列計算を使用して、複数のパーティクルを同時にサンプリングする。状態分布は、混合分布ｑ（ｘ_ｔ＋１）＝Σ_ｉ＝１ ^Ｐｐ（ｘ_ｔ＋１｜ｘ_ｉ，ｔ；θ）／Ｐとして表される。ＬＲの導出と同様に、各時間ステップについて、バッチ内の重点サンプリングにより低分散推定量を以下のように導出可能である。 (Batch weighted LR (BIW-LR))
Here, parallel computing is used to sample multiple particles at the same time. The state distribution is expressed as a mixture distribution q (x _{t + 1} ) = Σ _{i = 1} ^P p (x _{t + 1} | x _{i, t} ; θ) / P. Similar to the derivation of LR, for each time step, the low variance estimator can be derived as follows by intensive sampling in batch.

以下の方程式により、正規化重点サンプリングによって、リターンの一個抜き平均を推定するようにする。 The following equation is used to estimate the average of returns without one by normalization-weighted sampling.

ただし、ｃ_{ｊ，ｔ＋１}＝ｐ（ｘ_{ｊ，ｔ＋１}｜ｘ_ｉ，ｔ）／Σ_ｋ＝１ ^Ｐｐ（ｘ_{ｊ，ｔ＋１}｜ｘ_ｋ，ｔ）である。正規化がなければ、基準値推定の高分散によって、ＬＲ勾配が不十分となる。なお、時間ステップごとにＰ基準値を計算する一方で、勾配推定量には、Ｐ^２成分が存在する。真の不偏勾配を求めるには、Ｐ^２の一個抜き基準値（分布の各混合成分のパーティクルごとに１つ）を計算するものとする。本明細書には、ここに提示の基準値のみを用いた評価を含む（これにより、バイアスのほとんどを除去済みであることが分かっている）。 However, c _{j, t + 1} = p (x _{j, t + 1} | x _{i, t} ) / Σ _{k = 1} ^P p (x _{j, t + 1} | x _{k, t} ). Without normalization, the high variance of the reference value estimation results in an inadequate LR gradient. Incidentally, while calculating the P reference value for each time step, the gradient estimator, there is P ² component. To determine the true unbiased gradient shall be calculated one vent reference value P ² (the one for each particle of each mixture component distribution). The present specification includes an evaluation using only the reference values presented herein (which is known to have eliminated most of the bias).

（ＲＰ／ＬＲ加重平均）
計算の大部分は、ｄｐ（ｘ_ｔ＋１｜ｘ_ｔ；θ）／ｄθ項に費やされる。これらの項は、ＬＲおよびＲＰの両勾配に必要なため、両推定量の結合には如何なるペナルティも存在しない。周知の統計学的結果によれば、独立した推定量に関して、重みが逆分散に比例する場合は、最適な加重平均推定値が実現される。すなわち、μ＝μ_ＬＲｋ_ＬＲ＋μ_ＲＰｋ_ＲＰ（ただし、ｋ_ＬＲ＝σ_ＬＲ＿＾^−２／（σ_ＬＲ＿＾^−２＋σ_ＲＰ＿＾^−２）およびｋ_ＲＰ＝１−ｋ_ＬＲ）である。 (RP / LR weighted average)
Most of the calculation _{is spent on the dp (x t + 1} | x _t ; θ) / dθ term. Since these terms are required for both LR and RP gradients, there is no penalty for combining both estimators. Well-known statistical results show that for independent estimators, optimal weighted average estimates are achieved when the weights are proportional to the inverse variance. That is, μ = μ _LR k _LR + μ _RP k _RP (where k _LR = σ _LR _ ^ ^-2 / (σ _LR _ ^ ^-2 + σ _RP _ ^ ^-2 ) and k _RP = 1-k _LR ). ..

単純結合方式であれば、両推定量について、全軌跡の勾配を別個に計算した後、それらを結合することになるが、この手法では、軌跡の短い部分に再パラメータ化勾配を使用して、より優れた勾配推定値を求める機会が無視されてしまう。本発明の新たな総和伝播アルゴリズム（ＴＰ）は、この単純法に優る。ＴＰでは、単一の後方パスによって、全ての考え得るＲＰ深度にわたる和集合を計算するため、低分散の推定量に大きな重みが自動的に付与される。 In the simple coupling method, the gradients of all trajectories are calculated separately for both estimators and then combined, but in this method, reparameterization gradients are used for the short parts of the trajectories. Opportunities for better gradient estimates are ignored. The novel sum propagation algorithm (TP) of the present invention is superior to this simple method. In TP, a single back pass calculates the union over all possible RP depths, so the low variance estimator is automatically weighted heavily.

図３は、総和伝播アルゴリズムを説明した図である。アルゴリズム２においては、各後方ステップにおいて、ＬＲおよびＲＰの両方法を用いることにより、ポリシーパラメータに対して勾配を評価する。また、ポリシーパラメータ空間における分散に基づいて比を評価するが、この分散は、ポリシー勾配推定量の分散に比例する。勾配は結合され、分布パラメータ空間における最良の推定値が過去の時間ステップに受け渡される。このアルゴリズムにおいては、Ｖ演算子が異なるパーティクルから勾配推定値のサンプル分散を取り出すが、他の分散推定方式も考えられ、たとえば、勾配の大きさの移動平均から分散を推定することも可能であるし、分散に対して異なる統計学的推定量を使用することも可能であるし、ポリシーパラメータの部分集合のみを使用することも可能である。このアルゴリズムは、ＲＬ問題に限定されず、一般的な確率的計算グラフにも適用可能であり、確率論的モデル、確率的ニューラルネットワーク等の訓練に使用することも可能である。一般的な計算グラフ設定においては、勾配をグラフ中で後方に伝播させることにより、グラフ中のいくつかのノードで複数の勾配推定量を結合するようにしてもよい。この場合に、時間ステップパラメータｔを１だけ小さくすれば、これは、グラフ中でのノードの後方移動の一方、勾配の伝播に対応することになる。勾配推定量の結合方式での決定に用いられる分散等の統計値は、計算グラフ中のその他任意のノードから求められるようになっていてもよい。 FIG. 3 is a diagram illustrating a total propagation algorithm. In Algorithm 2, each backward step evaluates the gradient against policy parameters by using both the LR and RP methods. We also evaluate the ratio based on the variance in the policy parameter space, which is proportional to the variance of the policy gradient estimator. The gradients are combined and the best estimate in the distribution parameter space is passed to the past time step. In this algorithm, the sample variance of the gradient estimator is extracted from particles with different V operators, but other variance estimation methods are also conceivable. For example, it is possible to estimate the variance from the moving average of the gradient magnitude. However, it is possible to use different statistical inferences for the variance, or just a subset of the policy parameters. This algorithm is not limited to the RL problem, but can be applied to general probabilistic calculation graphs, and can also be used for training stochastic models, stochastic neural networks, and the like. In a typical computational graph setting, multiple gradient estimators may be combined at several nodes in the graph by propagating the gradient backwards in the graph. In this case, if the time step parameter t is reduced by 1, this corresponds to the backward movement of the node in the graph while the propagation of the gradient. Statistical values such as variance used to determine the gradient estimator in the combined method may be obtained from any other node in the calculation graph.

図４は、本実施形態に係る、コンピューティングデバイス１により実行される手順を説明したフローチャートである。コンピューティングデバイス１は、アルゴリズム２に従って、以下のプロセスを実行する。 FIG. 4 is a flowchart illustrating a procedure executed by the computing device 1 according to the present embodiment. The computing device 1 executes the following process according to the algorithm 2.

制御ユニット１１は、種々パラメータを初期化する（ステップＳ１０１）。具体的には、制御ユニット１１は、ｄＧ_Ｔ＋１／ｄζ_Ｔ＋１＝０、ｄＪ／ｄθ＝０、Ｇ_Ｔ＋１＝０と設定する。ただし、ζは、分布パラメータ（たとえば、μおよびσ）である。 The control unit 11 initializes various parameters (step S101). Specifically, the control unit 11 _sets the _{_{dG T + 1 / dζ T +}} 1 = 0, dJ / dθ = 0, G T + 1 = 0. Where ζ is a distribution parameter (eg μ and σ).

制御ユニット１１は、時間（時間ステップ）ｔをＴに設定し（ステップＳ１０２）、パーティクルｉごとに以下の計算を実行する（ステップＳ１０３）。ただし、ｃ_ｔは、時間ｔにおけるコストである。 The control unit 11 sets the time (time step) t to T (step S102), and executes the following calculation for each particle i (step S103). However, _ct is the cost at time t.

制御ユニット１１は、数式６の計算結果を用いて、以下の計算を実行する（ステップＳ１０４）。 The control unit 11 executes the following calculation using the calculation result of the formula 6 (step S104).

さらに、制御ユニット１１は、数式６の計算結果を用いて、パーティクルｉごとに、以下の計算を実行する（ステップＳ１０５）。 Further, the control unit 11 executes the following calculation for each particle i using the calculation result of the formula 6 (step S105).

次に、制御ユニット１１は、時間ｔが所定の時間１に達したかを判定する（ステップＳ１０６）。時間ｔが時間１になっていない場合（Ｓ１０６：ＮＯ）、制御ユニット１１は、時間ｔを１だけ減らし（ステップＳ１０７）、プロセスをステップＳ１０３に戻す。 Next, the control unit 11 determines whether the time t has reached the predetermined time 1 (step S106). If the time t is not time 1 (S106: NO), the control unit 11 reduces the time t by 1 (step S107) and returns the process to step S103.

（ポリシー最適化）
なお、勾配に基づく任意の最適化手順を使用することも可能であるが、本実施形態においては、ＲＭＳｐｒｏｐのような確率的勾配降下法を使用する（ＲＭＳｐｒｏｐに由来するアルゴリズムを使用する）。ＲＭＳｐｒｏｐでは、勾配の二乗の移動平均を利用して、そのＳＧＤステップを正規化する。本発明の場合は、バッチサイズが大きいため、ｚ＝Ｅ［ｇ^２］＝Ｅ［ｇ］^２＋Ｖ［ｇ］（ただし、ｇが勾配）によって、バッチから二乗の期待値を直接推定する。また、平均の分散を使用する。すなわち、Ｖ［ｇ］は、パーティクル数Ｐにより除された分散である。勾配ステップは、ｇ／ｚ^１／２になる。また、パラメータγのモーメンタムを使用する。完全更新された方程式は、以下のようになる。 (Policy optimization)
It is possible to use any gradient-based optimization procedure, but in this embodiment, a stochastic gradient descent method such as RMSprop is used (an algorithm derived from RMSprop is used). RMSprop uses the moving average of the square of the gradient to normalize its SGD step. In the case of the present invention, since the batch size is large, the expected value of the square is directly estimated from the batch by ^{z = E [g 2} ] = E [g] ^{2 + V [g] (where g is a gradient).} Also, use the mean variance. That is, V [g] is the variance divided by the number of particles P. The gradient step is g / z ^1/2 . Also, the momentum of the parameter γ is used. The fully updated equation looks like this:

乱数シードの固定によって、確率的問題を決定論的に変えることができ、ＲＬコミュニティにおいてはＰＥＧＡＳＵＳトリックとしても知られている。シードが固定された場合は、ＲＰ勾配が対象の厳密な勾配であり、ＢＦＧＳ等の決定論的疑似ニュートンオプティマイザを使用可能である。 Fixed random number seeds can deterministically change stochastic problems and are also known in the RL community as PEGASUS tricks. When the seed is fixed, the RP gradient is the exact gradient of interest and a deterministic pseudo-Newton optimizer such as BFGS can be used.

（実験）
２つの目的で、実験を行った：（１）ＲＰ勾配が十分ではない理由を説明するため、（２）本発明の新たに開発された方法が学習効率の点でＰＩＬＣＯに匹敵し得ることを示すため。 (experiment)
Experiments were conducted for two purposes: (1) to explain why the RP gradient was not sufficient, and (2) that the newly developed method of the present invention could be comparable to PILCO in terms of learning efficiency. To show.

（値ランドスケープをプロットする）
図５Ａ〜図５Ｆは実験結果を図示している。ランダムに選択された固定方向にポリシーパラメータθを摂動させ、目的関数および、射影勾配の大きさをΔθの関数としてプロットする。この実験の結果は、恐らくは本明細書において最も斬新な部分であり、「カオスの呪い（ｔｈｅｃｕｒｓｅｏｆｃｈａｏｓ）」という用語を思いついた。 (Plot the value landscape)
5A-5F illustrate the experimental results. The policy parameter θ is perturbed in a randomly selected fixed direction, and the objective function and the magnitude of the projection gradient are plotted as a function of Δθ. The results of this experiment were perhaps the most novel part of this specification, and came up with the term "the curse of chaos".

プロットは、非線形のｃａｒｔ−ｐｏｌｅのタスクで、生成された。１０００パーティクルを使用し、一方で図５Ｄの高分散が乱数性によって生じるのではなく、システムのカオスのような特性によるものであることを実証するために乱数シードは固定し続けた。信頼区間は、Ｖａｒ／Ｐによって推定され、ここで、Ｖａｒはサンプル分散であり、Ｐはパーティクル数である。後述するように、より原理的な手法を使用して分散のＰに対する依存性をプロットする。 Plots were generated in a non-linear cart-pole task. Using 1000 particles, the random seeds were kept fixed to demonstrate that the high variance in Figure 5D was not caused by randomness, but by the chaotic properties of the system. The confidence interval is estimated by Var / P, where Var is the sample variance and P is the number of particles. As will be described later, a more principled method is used to plot the dependence of the variance on P.

図５Ｄには、特異な結果が含まれており、ある領域ではＲＰ勾配が良好な振る舞いをしているが、ポリシーパラメータが摂動されると相遷移のような変化により分散が爆発している。Δθ＝１．５における分散は、Δθ＝０の〜４×１０^５倍であり、この領域でＲＰ勾配が正確となるためには４×１０^８パーティクルが必要であることを意味している。実用に際しては、ＲＰ勾配で最適化することにより単純なランダムウォークが導かれる。 FIG. 5D contains a peculiar result, in which the RP gradient behaves well in some regions, but when the policy parameters are perturbed, the variance explodes due to changes such as phase transitions. Dispersion in [Delta] [theta] = 1.5 is to 4 × 10 ⁵ times the [Delta] [theta] = 0, which means that for RP gradient in this region is the accurate are required 4 × 10 ⁸ particles. In practical use, a simple random walk can be derived by optimizing with the RP gradient.

シードが固定されているため、図５ＤのＲＰ勾配は図５Ａの値の厳密な勾配である。したがって、図５Ａの右に極微小の決定論的な「ノイズ」が存在する。しかし１０００パーティクルにわたって平均化される値は、真の目的ではないが、無限数のパーティクルを平均化する必要がある。無限数のパーティクルを平均化した場合、まだ「ノイズ」が存在するだろうか？または、関数が滑らかになるだろうか？ Since the seeds are fixed, the RP gradient in FIG. 5D is the exact gradient of the values in FIG. 5A. Therefore, there is a very small deterministic "noise" on the right side of FIG. 5A. However, the value averaged over 1000 particles is not the true purpose, but it is necessary to average an infinite number of particles. Is there still "noise" when averaging an infinite number of particles? Or will the function be smooth?

図５Ｅおよび図５Ｆの新たな勾配推定量は、真の目的が確かに滑らかであることを示唆している。さらなるエビデンスを与えるために、「ノイズ」を無視できるように、θにおいて十分に大きな摂動を使用して図５Ａの値の有限差分から勾配の大きさを推定した。２つの別個の手法（１つはポリシーパラメータθを変化させる、もう１つはθを固定し続けるが軌跡から勾配を推定する）が合致するという事実は、真の目的が滑らかであるという説得力のあるエビデンスを与える。 The new gradient estimators in FIGS. 5E and 5F suggest that the true purpose is certainly smooth. To provide further evidence, the magnitude of the gradient was estimated from the finite difference of the values in FIG. 5A using a sufficiently large perturbation at θ so that “noise” could be ignored. The fact that two separate methods (one to change the policy parameter θ, the other to keep θ fixed but to estimate the gradient from the trajectory) is convincing that the true purpose is smooth. Give some evidence.

図５Ｂおよび図５Ｃは、ＲＰ勾配を使用する際の、分散の爆発の理由を説明している。図５Ｂは、最も左のパラメータ設定に対応し、図５Ｃは最も右のパラメータ設定に対応している。プロットは、値Ｖ（ｘ；θ）（残存累積コスト）が位置ｘの関数としてどのように変化するかを示している。なお、乱数シードが固定されているため、値Ｖは残存リターンＧと同一である。図面は、異なる固定シードで４パーティクルについて各点の軌跡を予測し、軌跡のコストを平均化することによって作成された。１パーティクルを試した後に、４パーティクルを予測するようにし、それについては値が階段のような部分を含むように見えたが、それ以外は現在の図面と比べてあまり興味深くはなかった。４パーティクルの平均値は不安定であるため、４パーティクルのうちの少なくとも１つは示される領域内で非常に不安定であったに違いない。 5B and 5C explain the reason for the dispersion explosion when using the RP gradient. FIG. 5B corresponds to the leftmost parameter setting, and FIG. 5C corresponds to the rightmost parameter setting. The plot shows how the value V (x; θ) (residual cumulative cost) changes as a function of position x. Since the random number seed is fixed, the value V is the same as the residual return G. The drawings were created by predicting the locus of each point for 4 particles with different fixed seeds and averaging the cost of the locus. After trying one particle, I tried to predict four particles, for which the values seemed to include stair-like parts, but otherwise it wasn't very interesting compared to the current drawing. Since the average value of the four particles is unstable, at least one of the four particles must have been very unstable within the area shown.

初期状態分布の中央から平均予測に四角が中央に位置付けられる。四角の軸は、わずかに異なっているが、θが変わると予測される位置ｐ（ｘ１；θ）が変わるからである。辺の長さはガウス分布ｐ（ｘ１；θ）の４標準偏差に対応している。速度は平均値に固定し続けた。 From the center of the initial state distribution, the square is positioned in the center of the average prediction. This is because the axes of the squares are slightly different, but the position p (x1; θ), which is predicted to change θ, changes. The edge lengths correspond to the four standard deviations of the Gaussian distribution p (x1; θ). The speed remained fixed at the average value.

ＲＰはｄ／ｄθ ∫ｐ（ｘ_１；θ）Ｖ（ｘ_１）ｄｘを推定する。これは四角内部の点をサンプリングし、勾配ｄＶ／ｄθ＝ｄＶ／ｄｘ・ｄｘ／ｄθを計算して、サンプルとともに平均化する。図５Ｃでは、Ｖを微分することで期待値の勾配を見出すことは全く絶望的である。対照的に、ＬＲ勾配（図５Ｅ）は、値Ｖの微分ではなく値Ｖだけを使用しており、この問題を被っていない。ＴＰ（図５Ｆ）は、両方の推定量を効果的に結合している。 RP estimates d / dθ ∫p (x ₁ ; θ) V (x ₁ ) dx. It samples the points inside the square, calculates the gradient dV / dθ = dV / dx · dx / dθ, and averages it with the sample. In FIG. 5C, finding the gradient of the expected value by differentiating V is quite hopeless. In contrast, the LR gradient (FIG. 5E) uses only the value V, not the derivative of the value V, and does not suffer from this problem. TP (Fig. 5F) effectively combines both estimators.

ガウス再サンプリングの場合についてプロット値と勾配を示すことはしないが、最終的に、これらの両方が固定された乱数シードに対して滑らかな関数であった。したがって、再サンプリングも「カオスの呪い」に対して有効である。 We do not show plot values and gradients for the Gauss resampling case, but in the end both of these were smooth functions for a fixed random seed. Therefore, resampling is also effective against the "curse of chaos".

図６Ａおよび図６Ｂは、分散のグラフである。図６Ａおよび図６Ｂでは、Δθ＝０およびΔθ＝１．５における勾配推定量の分散がパーティクル数Ｐにどのように依存するかをプロットした。分散は、多数回、推定量を繰り返しサンプリングし、評価の集合からの分散を計算することによって計算された。ＲＰ、ＴＰならびにＬＲ勾配を、バッチ重点加重（ＢＩＷ）のある時とない時の両方とで比較して、本発明の重点サンプリング方式が分散を低減させることを示す。重点サンプリング基準値を使用した−実際には、通常のＬＲ勾配はより単純な基準値を使用し、ずっと高い分散を有する。図６ＢではＲＰ勾配が省略されているが、分散が１０^８〜１０^１５の間にあったためである。ＴＰ勾配が、ＢＩＷ−ＬＲ、およびＲＰ勾配を結合した。 6A and 6B are graphs of variance. In FIGS. 6A and 6B, how the variance of the gradient estimator at Δθ = 0 and Δθ = 1.5 depends on the number of particles P is plotted. The variance was calculated by iteratively sampling the estimator many times and calculating the variance from the set of evaluations. RP, TP and LR gradients are compared both with and without batch weighted (BIW) to show that the weighted sampling scheme of the present invention reduces variance. Using a weighted reference value-in practice, a normal LR gradient uses a simpler reference value and has a much higher variance. The RP gradient is omitted in FIG. 6B because the variance was between ^{10 8 and} 10 ^15. The TP gradient combined the BIW-LR and RP gradients.

結果により、ＢＩＷが著しく分散を低減していることが確認される。さらに、本発明のＴＰアルゴリズムが最良であった。重要なことに、図６Ｂでは全軌跡についてのＲＰ勾配の分散は他の推定量よりも１０^６大きいが、ＴＰは短い経路長のＲＰ勾配を利用して２５０より少ないパーティクルについて１０〜５０％低減した分散を得ている。これは注目すべき結果であるが、勾配推定量が別個に計算される場合、結合された推定量についての最高の可能な精度は別個の推定量の精度の総和となるからである。しかしながら、本発明の総和伝播アルゴリズムは、計算のグラフ構造を利用しているため、総和よりも高い精度を実現している。 The results confirm that BIW significantly reduces dispersion. Furthermore, the TP algorithm of the present invention was the best. Importantly, the dispersion of the RP slope for all trajectories in Figure 6B is ¹⁰⁶ greater than other estimates, TP 10-50% for less particles than 250 using the RP gradient short path length reduction Is getting the variance. This is a notable result, because when the gradient estimators are calculated separately, the highest possible accuracy for the combined estimators is the sum of the accuracy of the separate estimators. However, since the total propagation algorithm of the present invention uses the graph structure of calculation, it achieves higher accuracy than the total.

（学習実験）
エピソード的な学習タスクでのＰＩＬＣＯを以下のパーティクル基準の方法と比較する：ＲＰ、固定シードでのＲＰ（ＲＰＦＳ）、ガウス再サンプリング（ＧＲ）、固定シードでのＧＲ（ＧＲＦＳ）、モデル基準のバッチ重点加重尤度比（ＬＲ）、および総和伝播（ＴＰ）。さらに、パーティクル予測の２つのバリエーションを評価する。（１）モデルの不確実性を無視する一方で、各時間ステップにおいてノイズのみを加算するＴＰ（ＴＰ−σ_ｆ）。（２）予測ノイズが増加させたＴＰ（ＴＰ＋σ_ｎ）。全ての場合で３００パーティクルを使用した。 (Learning experiment)
Compare PILCO in episodic learning tasks with the following particle-based methods: RP, RP with fixed seed (RPFS), Gauss resampling (GR), GR with fixed seed (GRFS), model-based batch Weighted likelihood ratio (LR), and total propagation (TP). In addition, two variations of particle prediction are evaluated. _{(1) TP (TP-σ f} ) that adds only noise at each time step while ignoring model uncertainty. (2) TP (TP + σ _n ) in which the predicted noise is increased. 300 particles were used in all cases.

最近のＰＩＬＣＯの論文（非特許文献３）：カートポールのスイングアップおよびバランス、ならびに一輪車のバランス、より学習タスクを実行した。シミュレーションダイナミクスは同一に設定し、他の態様は元のＰＩＬＣＯと同様にした。図７Ａ、図７Ｂ、図８および図９は実験結果を図示している。 A recent PILCO paper (Non-Patent Document 3): Swing-up and balance of cart poles, and balance of unicycles, performed more learning tasks. The simulation dynamics were set to be the same, and other aspects were the same as the original PILCO. 7A, 7B, 8 and 9 illustrate the experimental results.

オプティマイザを、各試行間で６００ポリシー評価について、実行した。ＳＧＤ学習速度およびモーメンタムパラメータは、α＝５×１０^−４およびγ＝０．９であった。エピソード長は、カートポールでは３ｓ、一輪車では２ｓであった。なお、一輪車タスクについては、ポリシーを長い試行に一般化するためには２ｓでは十分ではないが、それでもＰＩＬＣＯと比較することはできる。制御周波数は１０Ｈｚであった。コストは、タイプ１−ｅｘｐ（−（ｘ−ｔ）^ＴＱ（ｘ−ｔ））であり、ここでｔはターゲットである。ポリシー＿（ｘ）からの出力は飽和関数ｓａｔ（ｕ）＝９ｓｉｎ（ｕ）／８＋ｓｉｎ（３ｕ）／８によって制約され、ここでｕ＝π＿〜（ｘ）である。１つの実験は（１；５）ランダム試行から構成され、カートと一輪車のタスクそれぞれについて学習済み試行（１５；３０）が続く。各実験は１００回繰り返され、平均化した。各試行は、ポリシーを３０回実行して平均化することにより評価したが、これは評価目的のためのみに実行したことに留意されたい（アルゴリズムのアクセスは１試行だけである）。最終試行のリターンが閾値を下回ったどうかによって、成功を判断した。 The optimizer was run for 600 policy evaluations between each trial. The SGD learning rate and momentum parameters were α = 5 × 10 ^-4 and γ = 0.9. The episode length was 3s for the cart pole and 2s for the unicycle. For unicycle tasks, 2s is not enough to generalize the policy to long trials, but it can still be compared to PILCO. The control frequency was 10 Hz. The cost is of type 1-exp (-(x-t) ^T Q (x-t)), where t is the target. The output from policy_ (x) is constrained by the saturation function sat (u) = 9sin (u) / 8 + sin (3u) / 8, where u = π_ to (x). One experiment consists of (1; 5) random trials, followed by learned trials (15; 30) for each of the cart and unicycle tasks. Each experiment was repeated 100 times and averaged. Each trial was evaluated by running the policy 30 times and averaging, but note that this was done for evaluation purposes only (algorithm access is only one trial). Success was judged by whether the return of the final trial was below the threshold.

（カート−ポールのスイングアップおよびバランス）
これは標準的な制御セオリーのベンチマーク課題である。タスクは、カートを前後に押して、直立に取り付けられた振り子を揺らしてそのバランスを保つことから構成される。状態空間は、ｘ＝［ｓ，β，ｄｓ／ｄｔ，ｄβ／ｄｔ］と表現され、ここでｓはカート位置であり、βはポール角度である。基準のノイズレベルはσ_ｓ＝０．０１ｍ、β＝１ｄｅｇ、σ_{ｄｓ／ｄｔ}＝０．１ｍ／ｓ、σ_{ｄβ／ｄｔ}＝１０ｄｅｇ／ｓである。ノイズは、異なる実験では乗数ｋ：σ_２＝ｋσ_ｂａｓｅ ^２によって修正される。元の論文では、真の状態への直接アクセスが考慮されている。類似の設定を求めるために、ｋ＝１０^−２と設定したが、やはりｋ∈｛１，４，９，１６｝を試験した。ポリシーπ＿〜は、５０基底関数を伴う動径基底関数ネットワーク（ガウシアンの総和）である。２つのコスト関数を考える。１つは、元のＰＩＬＣＯと同じものであり、ｘがサインとコサインを含み、振り子がバランスをとっている時の振り子の先端（Ｔｉｐ）と先端の位置との間の距離に依存している（ＴｉｐＣｏｓｔ）。もう１つのコストは、生の角度を使用し、Ｑ＝ｄｉａｇ（［１，１，０，０］）であった（ＡｎｇｌｅＣｏｓｔ）。このコストはＴｉｐＣｏｓｔとは概念的に異なっており、振り子をスイングアップする正しい方向が１つだけであるからである。 (Cart-pole swing up and balance)
This is a standard control theory benchmarking task. The task consists of pushing the cart back and forth and rocking the upright mounted pendulum to maintain its balance. The state space is expressed as x = [s, β, ds / dt, dβ / dt], where s is the cart position and β is the pole angle. The reference noise levels are σ _s = 0.01 m, β = 1 deg, σ _{ds / dt} = 0.1 m / s, and σ _{dβ / dt} = 10 deg / s. The noise is corrected by the multiplier k: σ ₂ = kσ _base ^{2 in different experiments.} The original paper considers direct access to the true state. To determine the setting of the similarity, was set to k = ^{10 -2,} were tested again k∈ {1,4,9,16}. Policy π_ ~ is a radial basis function network (sum of Gaussian) with 50 basis functions. Consider two cost functions. One is the same as the original PILCO, where x contains the sine and cosine and depends on the distance between the tip of the pendulum and the position of the tip when the pendulum is in balance. (Tip Cost). Another cost was Q = diag ([1,1,0,0]) using the raw angle (Angle Cost). This cost is conceptually different from Tip Cost, because there is only one correct direction to swing up the pendulum.

（一輪車のバランス）
タスクは、一輪車ロボットがバランスをとることから構成され、状態次元Ｄ＝１２、および制御次元Ｆ＝２である。ノイズは低い値に設定した。制御を与えるπ＿〜は線形である。 (Balance of unicycle)
The task consists of balancing the unicycle robot, with state dimension D = 12 and control dimension F = 2. The noise was set to a low value. Π_ ~ giving control is linear.

（学習実験）
ＰＩＬＣＯは、ノイズのないシナリオでは良好に実行されるが、ノイズが加わると、結果が悪化する。この悪化は、ＭＭ近似における誤りの累積によって最も生じやすく、以前、予測に求積を使用したVinogradskaら、(2016)によって観測されている。パーティクルはこの問題を被っておらず、ＴＰ勾配を使用することは、高ノイズ状態で常にＰＩＬＣＯより優れている。 (Learning experiment)
PILCO works well in noisy scenarios, but the addition of noise worsens the results. This exacerbation is most likely to occur due to the accumulation of errors in the MM approximation, previously observed by Vinogradska et al. (2016), who used quadrature for prediction. Particles do not suffer from this problem and using a TP gradient is always better than PILCO in high noise conditions.

一方、低いノイズレベルでは、ＴＰならびにＬＲのパフォーマンスは低下している。パーティクルの全てが、小さな領域からサンプリングされる場合、リターンの変化から勾配を推定することが困難になる（デルタ分散の極限では、ＬＲ勾配は評価すらできない）。ＴＰ勾配はこの問題をそれほど被らないが、ＲＰからの情報を組み込むからである。最終的に、予測の不確実性が非常に低い場合（たとえばｋ＝１０^−２）、モデルノイズを学習に影響するパラメータとして考え、より正確な勾配を得るためにそれを大きくすることができる。ＴＰ＋σ_ｎを参照されたい。ただし、モデルノイズ分散は１００で乗じた。 On the other hand, at low noise levels, the performance of TP and LR is degraded. If all of the particles are sampled from a small area, it will be difficult to estimate the gradient from the change in return (at the limit of delta variance, the LR gradient cannot even be evaluated). The TP gradient does not suffer much from this problem, because it incorporates information from the RP. Finally, if the prediction uncertainty is very low (eg k = ^10-2 ), model noise can be considered as a parameter affecting learning and increased to obtain a more accurate gradient. See TP + σ _n. However, the model noise variance was multiplied by 100.

とりわけ、ＰＩＬＣＯなどのＭＭを使用する手法、およびＧＲは、ＴｉｐＣｏｓｔを使用する場合、他よりも優れている。理由としては、目的のマルチモダリティを挙げることができる−ＴｉｐＣｏｓｔでは、振り子はタスクを解決するためにいずれの方向からもスイングアップされ得る；ＡｎｇｌｅＣｏｓｔでは、正しい方向は、１つだけである。ＭＭを実行することは、アルゴリズムにユニモーダルな経路に沿うよう強制するが、それにもかかわらずパーティクル手法は、一部のパーティクルが一方から来てもう一方で止まるバイモーダルなスイングアップを試行する可能性がある。したがって、ＭＭは最適化問題を簡略化する一種の「分布報酬成形」を実行している場合がある。そのような説明は、以前にGalら、(2016)によってなされている。 In particular, methods using MM, such as PILCO, and GR are superior to others when using Tip Cost. The reason can be the multi-modality of interest-in Tip Cost, the pendulum can be swung up from any direction to solve the task; in Angle Cost, there is only one correct direction. Running MM forces the algorithm to follow a unimodal path, but the particle technique can nevertheless attempt a bimodal swing-up where some particles come from one and stop at the other. There is sex. Therefore, the MM may perform a kind of "distribution reward shaping" that simplifies the optimization problem. Such an explanation was previously made by Gal et al. (2016).

最終的に、驚くべきＴＰ−σ_ｆ実験を指摘する。予測はモデルの不確実性を無視しているが、方法は９３％の成功率を達成する。なぜ学習がうまくいったのかの説明は困難であるが、成功がＧＰのゼロ事前平均に関連し得るとの仮説を立てている。データがない領域では、ＧＰダイナミクスモデルの平均は０に向かい、これは入力制御信号がパーティクルに対して効果がないことを意味している。したがって、ポリシー最適化を成功させるためには、パーティクルがデータの存在する領域に留まるように制御しなければならない。なお、同様の結果が、進化型アルゴリズムを使用して、モデル不確実性を無視する場合でもカート−ポールタスクで８５〜９０％の成功率を達成したChatzilygeroudisら、(2017)により見出されている。 Finally, we point out a surprising TP-σ _{f experiment.} The prediction ignores model uncertainty, but the method achieves a success rate of 93%. It is difficult to explain why learning was successful, but we hypothesize that success can be associated with a zero prior average of GP. In the region with no data, the average of the GP dynamics model goes to 0, which means that the input control signal has no effect on the particles. Therefore, for successful policy optimization, the particles must be controlled to stay in the area where the data resides. Similar results were found by Chatzilygeroudis et al. (2017), who achieved a success rate of 85-90% in the cart-pole task even when using evolutionary algorithms and ignoring model uncertainty. There is.

ほとんどの機械学習問題には、何らかのデータ生成分布ｐ_Ｄａｔａ（ｘ）に対する目的関数Ｊ（ｘ；θ）の期待値の最適化を伴うが、この分布は、サンプルデータ点｛ｘ_ｉ｝を通じてのみアクセス可能である。本発明の予測的フレームワークは、深層モデルに類似している：ｐ（ｘ_０）は、データ生成分布であり、ｐ（ｘ_ｔ；θ）はモデルレイヤにｐ_Ｄａｔａ（ｘ）を通すことにより求められる。最も一般的な最適化方法は、逆伝播により計算されるＰａｔｈｗｉｓｅ導関数を用いたＳＧＤである。本発明の結果は、いくつかの状況（特に、非常に深いまたはリカレントなモデルの場合）において、この手法は、勾配分散の爆発によって、ランダムウォークに陥る可能性もあることを示唆している。 Most machine learning problems involve optimizing the expected value of the objective function J (x; θ) for some data generation distribution p _Data (x), but this distribution is accessed only through the _{sample data points {x i}.} It is possible. The predictive framework of the present invention is similar to a deep model: p (x ₀ ) is a data generation distribution and p (x _t ; θ) is _{by passing p Data} (x) through the model layer. Desired. The most common optimization method is SGD using the Pathwise derivative calculated by backpropagation. The results of the present invention suggest that in some situations (especially in the case of very deep or recurrent models), this approach can also lead to random walks due to gradient dispersion explosions.

勾配の爆発は、深層学習の研究において、長年観測されている（Doya, 1993; Bengioら、1994）。通常、この現象は、ステップの増大および学習の不安定化につながる数値問題と見なされる。一般的な対策としては、勾配のクリッピング、ＲｅＬＵ活性化関数（Nair & Hinton, 2010）、およびスマート初期化が挙げられる。この問題に対する本発明の説明は異なる：勾配は、大きくなるだけではなく、勾配分散は爆発し、これはｘ_ｉ〜ｐ_Ｄａｔａからのあらゆるサンプルが、モデルパラメータθをどのように変えて分布全体Ｅ_{ｐＤａｔａ}［Ｊ（ｘ）］についての目的の期待値を大きくするかについての情報を本質的に与えないことを意味している。良好な初期化を選択することがこの問題に対処する一手法である一方で、これはシステムが学習中にカオスにならないことを保証することは困難と思われる。たとえば計量経済学では、最適なポリシーがカオス的なダイナミクスをもたらす場合すらある（Deneckere & Pelikan, 1986）。勾配クリッピングにより、大きなパラメータステップを止めることができるが、勾配がランダムになれば根本的に問題を解決することにはならない。線形系ではカオスが生じないことを考慮して（Alligoodら、1996）、本発明の解析は、ＲｅＬＵなどのカオスの影響を受けにくい区分線形活性化が深層学習でうまくいく理由を示唆している。 Gradient explosions have been observed for many years in deep learning studies (Doya, 1993; Bengio et al., 1994). This phenomenon is usually regarded as a numerical problem that leads to increased steps and learning instability. Common countermeasures include gradient clipping, ReLU activation function (Nair & Hinton, 2010), and smart initialization. The description of the present invention for this problem is different: not only does the gradient increase, but the gradient variance explodes, which means _{that every sample from x i} _{to p Data} changes the model parameter θ and the entire distribution E. It means that it essentially does not give information about whether to increase the expected value of the purpose for _{pData [J (x)].} While choosing good initialization is one way to deal with this problem, it seems difficult to guarantee that the system will not become chaotic during learning. In econometrics, for example, optimal policies can even result in chaotic dynamics (Deneckere & Pelikan, 1986). Gradient clipping can stop large parameter steps, but random gradients do not fundamentally solve the problem. Considering that chaos does not occur in linear systems (Alligood et al., 1996), the analysis of the present invention suggests why piecewise linear activation, which is less susceptible to chaos such as ReLU, works well in deep learning. ..

本発明の深層的な仮説をなお計算機的に確認しなければならない一方で、いくつかの研究によりニューラルネットワークにおけるカオスが調査されているが（Kolen & Pollack, 1991; Sompolinskyら、1988）、やはり本発明が初めて、カオスは逆伝播を使用して計算されると勾配を縮退させ得ることを示唆していると信じている。とりわけ、Pooleら、(2016)はそのような特性が「指数関数的な表現力」をもたらすことを示唆したが、この現象が呪いの代わりとなり得ると信じている。 While the deep hypothesis of the present invention must still be computationally confirmed, some studies have investigated chaos in neural networks (Kolen & Pollack, 1991; Sompolinsky et al., 1988), but again the book. For the first time, we believe that chaos suggests that gradients can be degenerated when calculated using backpropagation. In particular, Poole et al. (2016) suggested that such a property provides "exponential expressiveness", but believes that this phenomenon can replace the curse.

（結論と今後の研究）
逆伝播により計算されるものなど、Ｐａｔｈｗｉｓｅ導関数を使用する期待値を最適化することの限界を説明した。さらに、計算にノイズを投入すること、および尤度比のトリックを使用することにより、この呪いに拮抗する方法を示す。本発明の総和伝播アルゴリズムは、任意の確率的計算グラフに対する再パラメータ化勾配を、あらゆる量の他の勾配推定量（値関数を使用して計算された勾配すら使用することができる）と結合するための効率的な方法を提供する。本発明の研究を拡張する数え切れないほどの方法がある：よりよい最適化、自然な勾配の組み込みなど。本発明の方法の柔軟な性質により、これらの拡大が容易になるはずである。 (Conclusion and future research)
The limitations of optimizing expected values using Pathwise derivatives, such as those calculated by backpropagation, have been explained. In addition, we show how to counteract this curse by adding noise to the calculation and using the likelihood ratio trick. The sum propagation algorithm of the present invention combines a reparameterized gradient for any probabilistic computational graph with any amount of other gradient estimators (even gradients calculated using value functions can be used). Provides an efficient way to do this. There are countless ways to extend the work of the present invention: better optimization, incorporation of natural gradients, etc. The flexible nature of the methods of the invention should facilitate these extensions.

（実施形態２）
確率論的な計算グラフ（ＰＣＧ）の定義を提供する。なお、ＰＣＧの概念は、総和伝播アルゴリズムを説明するために使用した計算グラフの概念とは異なっているが、代わりに勾配推定量についての理由に関するフレームワークを説明している。定義は、標準的な有向グラフ的なモデルの定義と全く等価であるが、本発明の方法により注目するものであり、推論を実行するのではなく勾配を計算することにおける本発明の興味を強調している。主な違いは、たとえばガウシアンについての分布パラメータζ、平均μ、および共分散Σの明示的な包含である。 (Embodiment 2)
A definition of a probabilistic computational graph (PCG) is provided. Note that the PCG concept is different from the computational graph concept used to explain the sum propagation algorithm, but instead describes a framework for reasons for gradient estimators. The definition is quite equivalent to the definition of a standard directed graph model, but is more focused on the methods of the invention, emphasizing the invention's interest in calculating gradients rather than performing inference. ing. The main difference is the explicit inclusion of the distribution parameters ζ, mean μ, and covariance Σ for Gaussian, for example.

定義１（確率論的計算グラフ（ＰＣＧ））
ノード／頂点ＶおよびエッジＥを有する非巡回グラフは、以下の特性を満足する：
１．各ノードｉ∈Ｖは、周辺同時確率密度ｐ（ｘ_ｉ；ζ_ｉ）を有するランダムな変数の集合に対応し、ここでζ_ｉは分布の恐らく無限なパラメータ。なお、パラメータ化は一意ではなく、あらゆるパラメータ化が受け入れ可能である。
２．各ノードの確率密度は条件的に親ノードに依存し、ｐ（ｘ_ｉ｜Ｐａ_ｉ）である。ここでＰａ_ｉは、ノードｉの直接の親におけるランダム変数である。
３．同時確率密度はｐ（ｘ_１，・・・，ｘ_ｎ）＝Π_ｉ＝１ ^ｎｐ（ｘ_ｉ｜Ｐａ_ｉ）を満足する。
４．各ζ_ｉは、その親の関数であり、ζ_ｉ＝ｆ（Ｐｚ_ｉ）。ここで、Ｐｚ_ｉはノードｉの親における分布パラメータである。特に、ｐ（ｘ_ｉ；ζ_ｉ＝∫ｐ（ｘ_ｉ｜Ｐａ_ｉ）ｐ（Ｐａ_ｉ；Ｐｚ_ｉ）ｄＰａ_ｉである。 Definition 1 (Probabilistic Calculation Graph (PCG))
A non-circular graph with node / vertex V and edge E satisfies the following characteristics:
1. 1. Each node i ∈ V corresponds to a set of random variables with a peripheral joint probability density p (x _i ; ζ _i _{), where ζ i} is a perhaps infinite parameter of the distribution. It should be noted that the parameterization is not unique and any parameterization is acceptable.
2. Probability density for each node conditionally dependent on the parent node, _p | a _(x _i Pa i). Where Pa _i is a random variable in the immediate parent of node i.
3. 3. The joint probability density _{satisfies p (x 1} , ···, x _n ) = Π _{i = 1} ⁿ p (x _i | Pa _i ).
4. Each ζ _i is a function of its parent, ζ _i = f (Pz _i ). Here, Pz _i is the distribution parameter in the parent node i. In particular, p (x _i ; ζ _i = ∫p (x _i | Pa _i ) p (Pa _i ; Pz _i ) dPa _i .

本発明の数式化においては、確率的なことがないことを強調したい。各計算は解析的に扱いにくい場合があるが、決定論的である。さらに、この定義は決定論的なノードを除外するものではない、すなわちノードにおける分布はディラックのデルタ分散（質点）であり得ることを強調する。後に、勾配の確率的推定値を導出するためにこの数式化を使用する。 It should be emphasized that there is no stochastic in the mathematical formula of the present invention. Each calculation can be analytically cumbersome, but deterministic. Furthermore, this definition does not exclude deterministic nodes, that is, it emphasizes that the distribution at the nodes can be the Dirac delta variance (mass point). Later, we will use this formula to derive a stochastic estimate of the gradient.

（定理の導出）
興味の対象は、あるノードζ_ｉにおける分布パラメータの、別のノードｄζ_ｉ／ｄζ_ｊにおけるパラメータに対する全微分を計算することである。全微分の規則をイテレートすることにより、ノードｊからノードｉまでの経路にわたる総和が導かれ、以下の通りである。 (Derivation of the theorem)
The subject of interest is to calculate the total derivative _{of the distribution parameter at one node ζ i with} respect to the parameter at another node dζ _i / dζ _j. By iterating the rule of total derivative, the sum over the path from node j to node i is derived as follows.

この等式は、あらゆる決定論的な計算グラフに当てはまり、またたとえばＯＪＡコミュニティで周知でもある。この等式は自明に本発明の確率的勾配定理を導き、ＡからＢへの経路にわたる総和が、Ａから中間ノードおよび中間ノードからＢへの経路の総和として書くことができることを説明している。図１０Ａおよび図１０Ｂは、数式１１における経路の例を図示している。 This equation applies to all deterministic computational graphs and is well known in the OJA community, for example. This equation naturally leads to the stochastic gradient theorem of the present invention, explaining that the sum of the paths from A to B can be written as the sum of the paths from A to the intermediate node and from the intermediate node to B. .. 10A and 10B illustrate examples of routes in Equation 11.

定理１（総和確率的勾配定理）
ある確率的計算グラフにおいてｉとｊを異なるノードとし、ＩＮを中間ノードの任意の集合とし、これはｊからｉへの経路をブロックする、すなわちＩＮはｊからｉへの経路が存在しないようにするためのものであり、ＩＮ中でノードを通過しない。｛ａ→ｂ｝をａからｂへの経路の集合で表し、｛ａ→ｂ｝／ｃはａからｂへの経路の集合であり、ｂを除いて経路に沿うノードを集合ｃに含めることはできない。この場合、全微分ｄζ_ｉ／ｄζ_ｊは次の等式で書くことができる。 Theorem 1 (Sum Stochastic Gradient Theorem)
In a probabilistic calculation graph, i and j are different nodes, IN is an arbitrary set of intermediate nodes, which blocks the path from j to i, that is, IN so that there is no path from j to i. It is for doing so, and does not pass through the node in IN. {A → b} is represented by a set of paths from a to b, {a → b} / c is a set of paths from a to b, and nodes along the paths except b are included in the set c. Can't. In this case, the total derivative dζ _i / dζ _j can be written by the following equation.

数式１０および数式１１を結合して次を与えることができる。 Formula 10 and Formula 11 can be combined to give:

なお、ｒ∈｛ｊ→ｍ｝／ＩＮとｓ∈｛ｊ→ｍ｝／ＩＮとをそれぞれスワップすることにより、類似の定理を導くことができる。これは次の等式を導く。 A similar theorem can be derived by swapping r ∈ {j → m} / IN and s ∈ {j → m} / IN, respectively. This leads to the following equation.

後半、および前半分の総和勾配等式として、それぞれ数式１２および数式１３を参照する。 Equations 12 and 13 are referred to as the sum-gradient equations for the second half and the first half, respectively.

（グラフ上での勾配推定）
前セクションでは、グラフ全体に対する勾配計算を分解してより狭いグラフに対する勾配計算とする手段を与え、またサブグラフに対して勾配を推定する方法を与えた。ここで、サブグラフに対する勾配をどのように結合してグラフ全体に対する勾配のための推定量とすることができるかの手法を明らかにする。タスクは、ノードｊにおけるパラメータに対する遠位のノードｉにおける期待値の導関数を推定することである：ｄ／ｄζ_ｊＥ_{ｘｉ〜ｐ（ｘｉ；ζｉ）}［ｘｉ］。真のζは、扱いにくいため、サンプリング基準の推定を行う。ｐ（ｘ；ζ）のサブ分散をサンプリングすることを考える。すなわち、ｐ（ｘ；ζ）＝∫ｐ（ｘ；ζ＿＾）ｐ（ζ＿＾）ｄζ＿＾となるようにζ＿＾をサンプリングする。これは次のように書くことができる。 (Gradient estimation on the graph)
The previous section provided a means of decomposing the gradient calculation for the entire graph into a gradient calculation for a narrower graph, and also provided a method for estimating the gradient for subgraphs. Here we clarify a method of how the gradients for a subgraph can be combined into an estimator for the gradient for the entire graph. The task is to estimate the derivative of the expected value at the distal node i with respect to the parameter at node j: d / dζ _j E _{xi to p (xi; ζ i)} [xi]. Since true ζ is difficult to handle, the sampling standard is estimated. Consider sampling the subvariance of p (x; ζ). That is, ζ_^ is sampled so that p (x; ζ) = ∫p (x; ζ_^) p (ζ_^) dζ_^. This can be written as:

ζ＿＾は伝承サンプリング手順では自然に生じる。説明の簡素化のため、サンプリングは再パラメータ化可能である、すなわち、ｐ（ζ_ｍ＿＾；ζ_ｊ）＝ｆ（ζ_ｍ＿＾；ζ_ｊ，ｚ_ｍ）ｐ（ｚ_ｍ）とさらに想定する。これは次のように書くことができる。 ζ_ ^ occurs naturally in the traditional sampling procedure. For the sake of brevity, the sampling is reparameterizable, i.e. further assumed _{that p (ζ m} _ ^; ζ _j ) = f (ζ _m _ ^; ζ _j , z _m ) p (z _m). do. This can be written as:

項ｄζ_ｍ＿＾／ｄζｊは、Ｐａｔｈｗｉｓｅ導関数推定量により推定される。残りの項ｄ／ｄζ_ｍ＿＾Ｅｘ_{ｉ〜ｐ（ｘｉ；ζｉ＿＾）}［ｘｉ］は、任意の他の推定量により推定され、たとえばジャンプ推定量を使用することができる。第２の推定量がやはり不偏であるとすれば、推定量全体が不偏となる。 The term dζ _m _ ^ / dζj is estimated by the Pathwise derivative estimator. The remaining terms d / d ζ _m _ ^ Ex _{i to p (xi; ζ i_ ^)} [xi] are estimated by any other estimator, for example a jump estimator can be used. If the second estimator is also unbiased, then the entire estimator is unbiased.

要約すると、グラフ全体に対して、ｊからｉまでの勾配推定量を作成する手順は以下の通りである：
１．経路ｊからｉまでをブロックする中間ノードＩＮの集合を選択する。
２．ｊから中間ノードＩＮまでのＰａｔｈｗｉｓｅ導関数推定量を構築する。
３．ＩＮからｉまでの全微分推定量を構築して、ｉからｊまでの連鎖律を適用する。 In summary, for the entire graph, the steps to create a gradient estimator from j to i are:
1. 1. Select a set of intermediate nodes IN that block paths j to i.
2. Construct a Pathwise derivative estimator from j to the intermediate node IN.
3. 3. Construct a total derivative estimator from IN to i and apply the chain rule from i to j.

（ポリシー勾配定理に対する関係性）
典型的なモデルなしＲＬの問題では、エージェントは確率的ポリシーπに従って動作ｕ〜π（ｕ_ｔ｜ｘ_ｔ；θ）を実行し、状態ｘ_ｔを遷移して、コストｃ_ｔを求める（または、逆に報酬を求める）。エージェントのゴールは、ポリシーパラメータθを見つけることであり、これは各エピソードの期待リターンＧ＝Σ_ｔ＝０ ^Ｈｃ_ｔを最適化する。図１１Ａおよび図１１Ｂはモデル基準およびモデルなしのＬＲ勾配推定の確率計算グラフを図示している。文献では、ポリシー勾配定理および決定論的ポリシー勾配定理の２つの「勾配定理」が全般的に適用される。 (Relationship to the policy gradient theorem)
In a typical model without RL problems, the agent operates u~π according probabilistic policy [pi |; running _(u _{t x} t _θ), the transition state _{x t,} determining the cost _{c t} (or, On the contrary, ask for a reward). Agent goal is to find the policy parameters theta, which optimizes the expected return _{^{G = Σ t = 0 H c}} t of each episode. 11A and 11B illustrate the probability calculation graphs for model-based and LR gradient estimation without a model. In the literature, two "gradient theorems", the policy gradient theorem and the deterministic policy gradient theorem, are generally applied.

Ｑｔ＿＾は、動作ｕを選択した場合の特定の状態ｘからの残存リターンΣ_ｈ＝ｔ ^Ｈ−１ｃ_ｈ＋１の推定量に対応する。数式１６について、任意の推定量が受け入れ可能であり、サンプリング基準の推定すら使用可能である。数式１７については、Ｑ＿＾は通常微分可能なサロゲートモデルである。重要なことに、上の等式が有効であるためには、Ｑ＿＾が推定量でなければならず、真のＱではない。すなわち、勾配を推定する際、ポリシーパラメータは現在の時間ステップについて変更されるだけであり、後続の時間ステップについては固定され続けることを想定しなければならない。図１１Ａは、これらの２つの定理が同一の確率論的計算グラフにどのように対応するかを示している。中間ノードは、各時間ステップで選択された動作である。中間ノードに続く全微分を推定するためのジャンプ推定量の選択に差異が存在する−ポリシー勾配定理はＬＲ勾配を使用するが、決定論的なポリシー勾配定理はＰａｔｈｗｉｓｅ導関数をサロゲートモデルに対して使用する。 Qt_ ^ corresponds to the estimator _{of the residual return Σ h = t} ^H-1 _{ch + 1} from the specific state x when the operation u is selected. For Equation 16, any estimator is acceptable, and even a sampling criterion estimate can be used. For Equation 17, Q_ ^ is usually a differentiable surrogate model. Importantly, for the above equation to be valid, Q_ ^ must be an estimator, not a true Q. That is, when estimating the gradient, it must be assumed that the policy parameters only change for the current time step and remain fixed for subsequent time steps. FIG. 11A shows how these two theorems correspond to the same probabilistic computational graph. The intermediate node is the operation selected at each time step. There is a difference in the choice of jump estimates to estimate the total derivative following the intermediate node-the policy gradient theorem uses the LR gradient, while the deterministic policy gradient theorem applies the Pathwise derivative to the surrogate model. use.

（新規なアルゴリズム）
典型的にＰＣＧに対して勾配を推定する際は、グラフ全体を通じて伝承サンプリングを実行して１サンプルを求め、たとえばＲＬ問題については軌跡をサンプリングする。そのようなサンプルをパーティクルと呼ぶ。そのようなサンプリングのバッチを使用して、勾配推定量を求めることができる。あるノードにおける推定される分布パラメータは、各サンプリングされたパーティクルζ＿＾＝｛ζ_ｉ＿＾｝_ｉ ^Ｐについての分布パラメータの集合によって与えられ、ここでＰはパーティクル数である。たとえば、ＰＣＧがガウス分布からの順次的なサンプリングから成る場合、ζ_ｉ＿＾は、パーティクルがそのノードでサンプリングされたガウシアンの平均および共分散に対応する。以下のセクションでは、パーティクルの集合を使用して、周辺分布について直接分布パラメータΓの異なる集合を推定するという選択肢を活用する。 (New algorithm)
Typically, when estimating the gradient for a PCG, traditional sampling is performed throughout the graph to obtain one sample, for example for the RL problem, the trajectory is sampled. Such samples are called particles. A batch of such samplings can be used to determine the gradient estimator. The estimated distribution parameters at a node are given by the set of distribution parameters for each sampled particle ζ _ ^ = {ζ _i _ ^} _i ^{P, where P is the number of particles.} For example, if the PCG consists of sequential sampling from a Gaussian distribution, ζ _i _ ^ corresponds to the mean and covariance of Gaussian where the particles were sampled at that node. The following sections take advantage of the option of using a set of particles to estimate a different set of direct distribution parameters Γ for the marginal distribution.

（密度推定ＬＲ（ＤＥＬ））
以下の説明により、サンプリングされたパーティクルの集合から分布パラメータΓを推定し、推定された分布ζ＿＾を使用してＬＲ勾配を適用することを、試行することができる。特に、平均μ＿＾＝Σ_ｉ ^Ｐｘ_ｉ／Ｐおよび分散Σ＿＾＝Σ_ｉ ^Ｐ（ｘ_ｉ−μ＿＾）^２／（Ｐ−１）を推定することにより密度をガウシアンとして近似する。次に標準的なＬＲトリックを使用して、勾配ΣｉＰｄｌｏｇｑ（ｘ_ｉ）／ｄθ（Ｇ_ｉ−ｂ）を推定することができ、ここでｑ（ｘ）＝Ｎ（μ＿＾，Σ＿＾）である。この方法を使用するために、パーティクルｘ_ｉに関するμ＿＾およびΣ＿＾の微分を計算し、連鎖律を使用して勾配をポリシーパラメータまで伝えなければならないが、これは容易である。本発明の新たな方法をＤＥＬ推定量と呼ぶ。重要なことに、ｑ（ｘ）は勾配を推定するために使用されるが、如何なる方法でも軌跡サンプリングを修正するために使用されないことに留意されたい。これは、パーティクルがそのようにフィッティングされたガウス分布から再サンプリングされ、軌跡分布を修正するガウス再サンプリングの場合と対照的である。
ＤＥＬの利点：計算にノイズを投入しなくてもＬＲ勾配を使用することができる。
ＤＥＬの不利な点：推定量が不偏であり、密度推定が困難になる可能性がある。 (Density estimation LR (DEL))
With the following description, it is possible to estimate the distribution parameter Γ from the set of sampled particles and try to apply the LR gradient using the estimated distribution ζ_ ^. In particular, the density is approximated as Gaussian by estimating the mean _{μ_^ = Σ i} ^P x _i / P and the variance _{Σ_ ^ = Σ i} ^P (x _{i −} μ_ ^) ^{2 / (P-1).} The gradient ΣiP dlogq (x _i ) / dθ (G _i − b) can then be estimated using standard LR tricks, where q (x) = N (μ_ ^, Σ_ ^). be. To use this method, _{the derivatives of μ_ ^ and Σ_ ^ with respect to the particles x i} must be calculated and the chain rule must be used to convey the gradient to the policy parameters, which is easy. The new method of the present invention is called a DEL estimator. Importantly, note that q (x) is used to estimate the gradient, but not in any way to modify trajectory sampling. This is in contrast to Gauss resampling, where particles are resampled from a Gaussian distribution so fitted and the trajectory distribution is modified.
Advantages of DEL: LR gradients can be used without adding noise to the calculation.
Disadvantages of DEL: The estimator is unbiased and can make density estimation difficult.

（ガウス成形勾配（ＧＳ））
これまで、全てのＲＬ方法が総和勾配等式の後半（数式１２）を使用してきた。等式の前半（数式１３）を使用する推定量を作成できるだろうか？図１３はガウス成形勾配における計算経路を図示している。図１３は、これがどのように行われ得るかの例を与えている。ｘｍにおける密度を、パーティクルに対するガウシアンをフィッティングにより推定することを提案する。次いで、ｄＥ［ｃ_ｍ］＝ｄΓ_ｍ（灰色のエッジ）が、この分布からパーティクルを再サンプリングすることにより（またはあらゆる他の積分の方法により）推定される。これは、ｄΓ_ｍ／ｄθをどのように推定するかという疑問を残す（点線エッジおよび太線エッジ）。ＲＰ方法を使用することが、容易である。ＬＲ方法を使用するためには、まず総和勾配等式の後半をｄΓ_ｍ＝ｄθに対して適用して項Σ_{ｒ∈｛θ→ｋ｝／ＩＮ}Π_{（ｐ，ｔ）∈ｒ}∂ζ_ｔ／∂ζ_ｐ（点線エッジ）およびｄΓｍ／ｄζ_ｋ（太線エッジ）を求める。考慮しているシナリオでは、これらの項の第１は単一の経路であり、ＲＰを使用して推定される。第２の項は、より興味深いもので、これをＬＲ方法を使用して推定する。ガウス近似を使用しているため、分布パラメータΓ_ｍは、ｘ_ｍの平均および分散であり、μ_ｍ＝Ｅ［ｘ_ｍ］およびΣｍ＝Ｅ［ｘ_ｍｘ_ｍ ^Ｔ］−μ_ｍμ_ｍ ^Ｔとして推定することができる。これらの項のＬＲ勾配推定量は次のように求めることができる。 (Gauss molding gradient (GS))
So far, all RL methods have used the latter half of the sum gradient equation (Equation 12). Is it possible to make an estimator using the first half of the equation (Formula 13)? FIG. 13 illustrates the calculation path in the Gauss forming gradient. FIG. 13 gives an example of how this can be done. We propose to estimate the density at xm by fitting Gaussian to particles. Then dE [ _cm ] = dΓ _m (gray edge) is estimated by resample the particles from this distribution (or by any other method of integration). This _{leaves the question of how to estimate dΓ m} / dθ (dotted and thick edges). It is easy to use the RP method. To use the LR method, first apply the latter half of the sum gradient equation to the dΓ _m = dθ term Σ _{r ∈ {θ → k} / IN} Π _{(p, t) ∈} r ∂ζ _t / Find ∂ζ _p (dotted line edge) and dΓm / dζ _k (thick line edge). In the scenario under consideration, the first of these terms is a single route and is estimated using RP. The second term is more interesting and is estimated using the LR method. Since the Gaussian approximation is used, the distribution parameter Γ _m is the mean and variance of _{x m} _{, where μ m} = E [x _m ] and Σ m = E [x _m x _m ^T ] -μ _m μ _m ^T. Can be estimated. The LR gradient estimators for these terms can be determined as follows.

実際には、サンプリング基準の推定ζ_ｋ＿＾を行い、推定量がサンプルζ_ｋ＿＾に対して条件付きではないかと懸念されるかも知れないが、興味の対象は条件付きではない推定値である。条件付き推定が等価であることを説明する。分散については、μ_ｍは条件付きではない平均の推定であるため、推定全体が、条件付きではない分散の推定に直接対応していることに留意されたい。平均については、イテレートされた期待値の規則を以下の通り適用する。 In practice, you may make an estimate _{of the sampling criteria ζ k} _{_ ^ and be concerned that the estimator is conditional on the sample ζ k} _ ^, but you are interested in the non-conditional estimates. be. Explain that conditional estimates are equivalent. Note that for variances, μ _m is an estimate of the unconditional mean, so the entire estimate directly corresponds to the estimate of the unconditional variance. For the mean, the iterated expected value rule applies as follows.

これにより、条件付き勾配推定量が、条件付きではない平均の勾配についての不偏な推定量であることが明らかである。 This makes it clear that the conditional gradient estimator is an unbiased estimator for the unconditional mean gradient.

（勾配を累積するための効率的なアルゴリズム）
具体的な例として、モデル基準のポリシー勾配方法を考え、そのＰＣＧが図１３に与えられる。本発明の以前の研究において、このアルゴリズムが、まず最初に考えられたものであり、ダイナミクスの微分可能な確率論的モデルへのアクセスに決定的に依存している。ＧＳ勾配をこの状況にどのように適用するかを説明する。ｘ_ｋノードごとに、ｋの後の全てのｘ_ｍノードへのＬＲジャンプを実施し、ノードｍにおける分布のガウス近似で勾配を計算したい。逆伝播のようなやり方で後方パスの間、全てのノードを累積する。なお、ｋおよび経路ごとに、勾配をｄＥ［ｃ_ｍ］／ｄΓ_ｍｄΓ_ｍ／ｄζ_ｋ（ｄζ_ｋ／ｄｕ_ｋ−１ｄｕ_ｋ−１／ｄθ）と書くことができる。項ｄＥ［ｃ_ｍ］／ｄΓ_ｍ・ｄΓ_ｍ／ｄζ_ｋはｄＥ［ｃ_ｍ］／ｄΓ_ｍｚ_ｍｄｌｏｇｐ（ｘ_ｋ；ζ_ｋ）／ｄζ_ｋとして推定され、ここでｚｍは上の項ｘ_ｍ−ｂ_μなどを要約しているベクトルに対応する。なお、ｄＥ［ｃ_ｍ］／ｄΓ_ｍｚ_ｍはただのスカラー量ｇ_ｍである。したがって、後方パスの間の全てのｇの合計を累積して、各ｋノードにおける全てのｍノードを合計するアルゴリズムを使用する。図１２は総和伝播と適合する様子を詳しく説明するためのアルゴリズム３を図示している。最終的なアルゴリズムは本質的には通常のコスト／報酬を修正された値で置換するだけであり、そのような手法はさらに、確率的ポリシーおよびＬＲ勾配を使用してモデルなしポリシー勾配アルゴリズムに適用可能である。ＧＳの２つの解釈：１．あるノードにおいて、周辺分布のガウス近似を行う。２．パーティクルの分布に基づいて、あるタイプの報酬成形を行う。特に、パーティクルの全てが複数の報酬の領域間で分布が分かれるのではなく報酬の１つの「島」に集中するよう軌跡分布をユニモーダルに保つよう本質的に推進する−これにより最適化が単純になる場合がある。 (Efficient algorithm for accumulating gradients)
As a specific example, consider a model-based policy gradient method, the PCG of which is given in FIG. In previous studies of the present invention, this algorithm was first conceived and relies decisively on access to a differentiable stochastic model of dynamics. How to apply the GS gradient to this situation will be described. For _{each x k} node, we _{want to perform an LR jump to all x m} nodes after k and calculate the gradient with a Gaussian approximation of the distribution at node m. Accumulate all nodes during the backward path in a back-propagation-like manner. The gradient can be written as dE [ _cm ] / dΓ _m dΓ _m / dζ _k (dζ _k / du _k-1 du _k-1 / dθ) for each k and the path. The term dE [ _cm ] / dΓ _m · dΓ _m / dζ _k is estimated as dE [ _cm ] / dΓ _m z _m d logp (x _k ; ζ _k ) / dζ _k , where zm is the above term x. corresponding to vector summarizing the like _m -b _mu. In addition, dE [ _cm ] / dΓ _m z _{m is} just a scalar amount g _m . Therefore, we use an algorithm that accumulates the sum of all g during the backward path and sums all m nodes at each k node. FIG. 12 illustrates Algorithm 3 for explaining in detail how it fits with sum propagation. The final algorithm essentially simply replaces the normal cost / reward with the modified value, and such techniques are further applied to the unmodeled policy gradient algorithm using probabilistic policies and LR gradients. It is possible. Two interpretations of GS: 1. At a node, perform a Gaussian approximation of the marginal distribution. 2. Perform some type of reward shaping based on the distribution of particles. In particular, it essentially encourages the trajectory distribution to be unimodal so that all of the particles are concentrated on one "island" of the reward rather than split across multiple reward regions-this simplifies the optimization. May become.

（実験）
ＰＩＬＣＯの論文により、モデル基準のＲＬ模擬実験を行った。本発明のＧＳ手法ならびに総和伝播との結合を試験するために、カート−ポールのスイングアップ、およびバランスの課題を試験した。さらに、この考えの実現性を示すために、より単純なカート−ポールの、バランスだけの課題に対して、ＤＥＬ手法を試験した。本発明の新たな推定量を伴うパーティクル基準の勾配をＰＩＬＣＯと比較した。本発明の以前の研究において、パーティクルを使用して信頼できる結果を求めるためにコスト関数を変更しなければならなかった−現在の実験の主な動機の１つは、元のＰＩＬＣＯが使用したのと同じコストを使用してＰＩＬＣＯの結果とマッチングさせることである（これは、後にさらに詳述する）。 (experiment)
Based on the PILCO paper, a model-based RL simulation experiment was performed. Cart-pole swing-up and balance tasks were tested to test coupling with the GS method of the invention as well as total propagation. In addition, to demonstrate the feasibility of this idea, the DEL method was tested on a simpler cart-pole, balance-only task. Particle-based gradients with new estimators of the invention were compared to PILCO. In previous studies of the present invention, the cost function had to be modified to obtain reliable results using particles-one of the main motives of the current experiment was used by the original PILCO. Matching with PILCO results using the same cost as (this will be further detailed later).

（モデル基準のポリシー探索バックグラウンド）
モデルなしポリシー探索方法に対するモデル基準のアナログを考える。対応する確率論的計算グラフを図１１Ｂに与える。表記は本発明の以前の研究に従う。各エピソードの後、ｐ（Δｘ_ｔ＋１ ^ａ）＝ｇＰ（ｘ_ｔ＿〜），となるよう、データの全てを使用してダイナミクスの各次元の別個のガウス過程モデルを学習する。ここでｘ＿〜＝［ｘ_ｔ ^Ｔ，ｕ_ｔ ^Ｔ］かつｘ∈Ｒ^Ｄ、ｕ∈Ｒ^Ｆである。次いで、このモデルを使用して、勾配降下法によりポリシーを最適化するためにエピソード間で「メンタルシミュレーション」を実行する。二乗指数共分散関数ｋ_ａ（ｘ＿〜，ｘ’＿〜）＝ｓ_ａ ^２ｅｘｐ（−（ｘ＿〜−ｘ’＿〜）^ＴΛ_ａ ^−１（ｘ＿〜−ｘ’＿〜））を使用した。また、ノイズハイパーパラメータがσ_ｎ，２ ^２のガウス尤度関数を使用する。ハイパーパラメータ｛ｓ，Λ，σ_ｎ｝は、周辺尤度を最大化することにより訓練される。予測はｐ（ｘ_ｔ＋１ ^ａ）＝Ｎ（μ（ｘ_ｔ＿〜），σ_ｆ ^２（ｘ_ｔ＿〜）＋σ_ｎ ^２）の形態を取り、ここでσ_ｆ ^２（ｘ_ｔ＿〜）はモデルについての不確実性であり、状態空間の領域内内のデータの可用性に依存している。図１１Ｂでは、θから中間ノードまでの偏微分がＰａｔｈｗｉｓｅ導関数で推定され、中間ノードに続く全微分がジャンプ推定量で推定される。 (Model-based policy search background)
Consider a model-based analog for a modelless policy search method. The corresponding stochastic calculation graph is given in FIG. 11B. The notation follows previous work of the present invention. After each episode, we train a separate Gaussian process model for each dimension of dynamics using all of the data so that p (Δx _{t + 1} ^a ) = gP (x _{t _ ~).} Here, x_ ~ = [x _t ^T , ^{ut T} ] and ^{x ∈ R D} , _{u ∈} ^{R F.} This model is then used to perform a "mental simulation" between episodes to optimize the policy by gradient descent. Squared exponential covariance function _{_{^{k a (x_~, x'_~) =}}} s a 2 exp - using ^{((x_~-x'_~) T Λ} a -1 (x_~-x'_~)) .. The noise hyperparameter uses Gaussian likelihood function of sigma _{n, 2} ^2. Hyperparameters {s, Λ, σ _n } are trained by maximizing marginal likelihood. The prediction takes the form of p (x _{t + 1} ^a ) = N (μ (x _t _ ~), σ _f ² (x _t _ ~) + σ _n ² ), where σ _f ² (x _t _ ~) is a model. Uncertainty about, depending on the availability of data within the realm of the state space. In FIG. 11B, the partial derivative from θ to the intermediate node is estimated by the Pathwise derivative, and the total derivative following the intermediate node is estimated by the jump estimator.

（セットアップ）
カート−ポールは、前後に押すことができるカートと、取り付けられたポールから成る。状態空間は、［ｓ，β，ｄｓ／ｄｔ，ｄβ／ｄｔ］であり、ここでｓはカート位置であり、βは角度である。制御は、カートに対する水平方向の力である。ダイナミクスは、ＰＩＬＣＯの論文と同様であった。セットアップは本発明の以前の研究に従う。 (setup)
A cart-pole consists of a cart that can be pushed back and forth and an attached pole. The state space is [s, β, ds / dt, dβ / dt], where s is the cart position and β is the angle. Control is a horizontal force on the cart. The dynamics were similar to the PILCO paper. The setup follows previous work of the present invention.

（タスクにおける共通の特性）
実験は１ランダムエピソード、続いて学習済ポリシーを有する１５エピソードから成り、ポリシーはエピソード間で最適化される。各エピソード長は３ｓであり、制御周波数は１０Ｈｚであった。各タスクは再現性を試験するために異なる乱数シードで別個に１００回評価した。乱数シードは、異なるアルゴリズム同士で共有した。各エピソードは３０回評価し、コストを平均化したが、これは評価目的のためのみに行ったことに留意されたい−アルゴリズムのアクセスは１エピソードだけである。ポリシーは、本発明の以前の研究によるＲＭＳｐｒｏｐのような学習規則を使用して最適化され、これは勾配を異なるパーティクルからの勾配のサンプリング分散を使用して勾配を正規化する。モデル基準のポリシー最適化では、ポリシー勾配評価ごとに３００パーティクルを使用して６００勾配ステップを実行した。学習速度およびモーメンタムパラメータはそれぞれ、α＝５×１０^−４、γ＝０：９であり、本発明の以前の研究と同じである。ポリシーからの出力はｓａｔ（ｕ）＝９ｓｉｎ（ｕ）／８＋ｓｉｎ（３ｕ）／８によって飽和され、ここでｕ＝π＿〜（ｘ）である。ポリシーπ＿〜は、５０基底関数および２５４パラメータの総和を伴う動径基底関数ネットワーク（ガウシアンの総和）である。コスト関数は、タイプ１−ｅｘｐ（−（ｘ−ｔ）^ＴＱ（ｘ−ｔ））であり、ここでｔはターゲットである。２つのタイプのコスト関数を考える：１）ＡｎｇｌｅＣｏｓｔ、Ｑ＝ｄｉａｇ（［１，１，０，０］）であるコストが対角行列である、２）ＴｉｐＣｏｓｔ、元のＰＩＬＣＯの論文からのコストであり、バランスが取れている時の、振り子の先端から先端の位置までの距離に依存する。これらのコスト関数は概念的に異なっている−ＴｉｐＣｏｓｔでは、振り子はいずれの方向からもスイングアップすることができ、ＡｎｇｌｅＣｏｓｔでは、正しい方向は、１つだけである。基準の観測ノイズレベルは、σ_ｓ＝０．０１ｍ、σ_β＝１ｄｅｇ、σ_{ｄｓ／ｄｔ}＝０．１ｍ／ｓ、σ_{ｄβ／ｄｔ}＝１０ｄｅｇ／ｓ、またこれらはσ^２＝ｋσ_ｂａｓｅ ^２となるように乗数ｋ∈｛１０^−２，１｝で修正される。 (Common characteristics in tasks)
The experiment consists of one random episode, followed by 15 episodes with a learned policy, and the policy is optimized between episodes. Each episode length was 3 s and the control frequency was 10 Hz. Each task was evaluated 100 times separately with different random seeds to test reproducibility. Random seeds were shared by different algorithms. Each episode was evaluated 30 times and the cost was averaged, but keep in mind that this was done for evaluation purposes only-the algorithm has access to only one episode. The policy is optimized using a learning rule such as RMSprop from previous studies of the present invention, which normalizes the gradient using a gradient sampling variance from different particles. In model-based policy optimization, 600 gradient steps were performed using 300 particles for each policy gradient evaluation. The learning rate and momentum parameters are α = 5 × 10 ^-4 and γ = 0: 9, respectively, which are the same as in the previous studies of the present invention. The output from the policy is saturated by sat (u) = 9sin (u) / 8 + sin (3u) / 8, where u = π_ to (x). Policy π_ ~ is a radial basis function network (Gaussian sum) with a sum of 50 basis functions and 254 parameters. The cost function is of type 1-exp (-(x-t) ^T Q (x-t)), where t is the target. Consider two types of cost functions: 1) Angle Cost, Q = diag ([1,1,0,0]) where the cost is a diagonal matrix, 2) Tip Cost, from the original PILCO paper. It is a cost and depends on the distance from the tip of the pendulum to the position of the tip when balanced. These cost functions are conceptually different-in Tip Cost, the pendulum can swing up from any direction, and in Angle Cost, there is only one correct direction. The reference observed noise levels are σ _s = 0.01 m, σ _β = 1 deg, σ _{ds / dt} = 0.1 m / s, σ _{dβ / dt} = 10 deg / s, and these are σ ² = kσ _base ^2. It is corrected by the multiplier k ∈ {10 ^{-2, 1}.}

（カート−ポールのスイングアップおよびバランス）
このタスクでは、振り子は最初下方向にぶら下がっており、そしてスイングしてバランスを取らなければならない。本発明の以前の研究から、一部の結果を得た：１）ＰＩＬＣＯ、２）再パラメータ化法勾配（ＲＰ）、３）ガウス再サンプリング（ＧＲ）、４）バッチ重点加重基準値を伴うバッチ重点加重ＬＲ（ＬＲ）、５）ＬＲとＲＰを結合する総和伝播（ＴＰ）。新たな方法と比較した：６）ＬＲ成分だけを使用するガウス成形勾配（ＧＬＲ）、７）総和伝播を使用してＬＲとＲＰ変量の両方を結合するガウス成形勾配（ＧＴＰ）。総和伝播アルゴリズムの説明については、計算のグラフに対する複数の勾配推定量を効果的に結合する方法である本発明の以前の研究を参照されたい。さらには、モデルノイズ分散に２５を乗じた場合のＧＴＰを試験した（ＧＴＰ＋σｎ）。 (Cart-pole swing up and balance)
In this task, the pendulum first hangs downwards and then has to swing and balance. Some results have been obtained from previous studies of the present invention: 1) PILCO, 2) reparameterization gradient (RP), 3) Gauss resampling (GR), 4) batch with weighted reference values. Weighted LR (LR), 5) Total propagation (TP) that combines LR and RP. Compared with the new method: 6) Gauss forming gradient (GLR) using only the LR component, 7) Gauss forming gradient (GTP) combining both LR and RP variables using summation propagation. For a description of the sum propagation algorithm, see previous work of the present invention, which is a method of effectively combining multiple gradient estimators on a computational graph. Furthermore, the GTP when the model noise variance was multiplied by 25 was tested (GTP + σn).

（ＤＥＬ推定量でのカート−ポールのバランス）
このタスクはずっと単純である−ポールは最初直立しており、そしてバランスを取らなければならない。実験は、ＤＥＬが実現可能であり、さらに開発されれば有用な場合があることを示すために工夫された。ＡｎｇｌｅＣｏｓｔおよび基準ノイズレベルが使用された。 (Cart-pole balance in DEL estimator)
This task is much simpler-Paul is initially upright and must be balanced. Experiments have been devised to show that DEL is feasible and may be useful if further developed. Angle Cost and reference noise levels were used.

（結果）
図１４および図１５は実験結果を図示している。本発明の以前の研究と同様、ノイズが低い場合、ＬＲ成分を含む方法はうまくいかない。しかしながら、ＧＴＰ＋σｎの実験はノイズをモデル予測に投入することが問題を解決できることを示している。主な重要な結果は、ＴｉｐＣｏｓｔシナリオではＧＴＰがＰＩＬＣＯと一致することである。本発明の以前の研究では、懸念の１つは、このシナリオではＴＰがＰＩＬＣＯと一致しないことであった。図１５Ｂおよび図１５Ｃのコストを見ることだけでは、適切に差異が示されない。対照的に、成功率はＴＰもうまくいかなかったことを示している。成功率は、本発明の以前の研究で校正された閾値（１５を下回る最終損失）ならびに全ての実験実行を視覚的に分類することの両方によって測定された。両方の方法が一致した。最終エピソードにおけるピークパフォーマの損失はＴＰであった：１１．１４±１．７３、ＧＴＰ：９．７８±０．４０、ＰＩＬＣＯ：９．１０±０．２２、これはやはりＴＰが著しく悪かったことを示している。ピークパフォーマがなお改善している間、残存実験は収束した。ＰＩＬＣＯはなお、わずかによりデータ効率的に見えるが、必要とされるデータ量が少ないため、差異に実用的な有意性はほとんどない。図１５ＢではＴＰの分散はより小さいことにも留意されたい。ＧＴＰおよびＰＩＬＣＯの大きな分散は、大きな損失を伴う外れ値により生ずる。これらの外れ値は、局所的最小値に収束し、これは状態分布のガウス近似のテールを利用している−これは、ＰＩＬＣＯがガウス近似のテールを使用して探索を行う以前の示唆とは対照的である。 (result)
14 and 15 illustrate the experimental results. As in previous studies of the present invention, when the noise is low, the method containing the LR component does not work. However, GTP + σn experiments show that adding noise to model prediction can solve the problem. The main important result is that GTP is consistent with PILCO in the Tip Cost scenario. In previous studies of the invention, one concern was that TP was inconsistent with PILCO in this scenario. Just looking at the costs in FIGS. 15B and 15C does not properly show the difference. In contrast, the success rate indicates that TP did not work either. Success rates were measured both by the threshold calibrated in previous studies of the invention (final loss below 15) and by visually classifying all experimental runs. Both methods matched. The peak performer loss in the final episode was 11.14 ± 1.73, GTP: 9.78 ± 0.40, PILCO: 9.10 ± 0.22, which also means that TP was significantly worse. Is shown. Residual experiments converged while peak performers were still improving. PILCO still looks slightly more data efficient, but due to the small amount of data required, the differences have little practical significance. It should also be noted that the variance of TP is smaller in FIG. 15B. Large variances of GTP and PILCO are caused by outliers with large losses. These outliers converge to the local minimum, which utilizes the tail of the Gaussian approximation of the state distribution-this is a suggestion before PILCO searched using the tail of the Gaussian approximation. In contrast.

（実施形態３）
総和伝播アルゴリズムは、逆伝播と同様に、計算グラフに対する汎用的な勾配推定アルゴリズムであるが、勾配が爆発する問題を克服するものである。アルゴリズムにおける重要な考え方は、勾配計算の後方パスの間に勾配推定の複数の方法を組み合わせることである。重要なことに、複数の勾配推定値は勾配推定量のより小さな集合にアグリゲートされ（たとえば全ての勾配推定量は単一の最良の勾配の推定に結合される）、また勾配推定量の全てが別個にではなく、この勾配推定量の小さな集合が後方に渡される。そのような方法により、後方に渡される勾配推定量の増殖を招くことなく、計算のグラフにおける勾配推定の精度を高めるために多数の勾配推定技術を結合することができ、それにより良好な計算効率を実現する。 (Embodiment 3)
The sum-propagation algorithm, like backpropagation, is a general-purpose gradient estimation algorithm for computational graphs, but it overcomes the problem of gradient explosion. An important idea in the algorithm is to combine multiple methods of gradient estimation during the backward path of gradient calculation. Importantly, multiple gradient estimators are aggregated into a smaller set of gradient estimators (eg, all gradient estimators are combined into a single best gradient estimate), and all of the gradient estimators. Is not separate, but a small set of this gradient estimator is passed backwards. Such a method allows a number of gradient estimation techniques to be combined to improve the accuracy of the gradient estimation in the computational graph without causing the growth of the gradient estimator passed backwards, thereby resulting in good computational efficiency. To realize.

（フレームワークとアルゴリズムの説明）
計算グラフはノード／頂点Ｖと有向エッジＥの集合であり、頂点にある変数同士の計算上の関係を定義している。各ノードｉはその親ノードＰａ_ｉからの変数を入力として受け取り、出力ｘ_ｉ＝ｆ（Ｐａ_ｉ）を計算し、ここで関数ｆは確率的であることもできる。Ｐａ_ｉおよびｘ_ｉは１つまたは複数の変数の集合を表現しているため、ベクトル値化またはテンソル値化されている場合がある。変数ｘ_ｉはノードｉの子ノードに渡され、Ｃｈ_ｉと表記される。図１６はアルゴリズムの一般形態を図示している。アルゴリズムの一般形態は、アルゴリズム４に提示されており、ここで重要な新規性は、ステップ５および６を含む組み合わせである。総和伝播は逆伝播アルゴリズムに類似しており、連鎖法則を適用することにより計算した勾配をグラフの後方に送ることで、グラフ全体で勾配を計算する。標準的な逆伝播を図１７に図示する。総和伝播は、いくつかのノードにおいて複数の勾配推定を実行すること、勾配推定量を結合すること、および結合した推定量を後方に送ること図１８によりこの手順を修正する。 (Explanation of framework and algorithm)
The calculation graph is a set of nodes / vertices V and directed edges E, and defines the computational relationship between the variables at the vertices. Each node i receives a variable from its parent node Pa _i _{as an input and calculates an output x i} = f (Pa _i ), where the function f can also be stochastic. Since Pa _i and x _i represent a set of one or more variables, they may be vector-valued or tensor-valued. The variable x _i is passed to the child node of node _i and is written as Ch i. FIG. 16 illustrates a general form of the algorithm. A general form of the algorithm is presented in Algorithm 4, where an important novelty is a combination that includes steps 5 and 6. Sum-propagation is similar to the back-propagation algorithm, where the gradient calculated by applying the chain rule is sent to the back of the graph to calculate the gradient for the entire graph. Standard backpropagation is illustrated in FIG. Sum-propagation modifies this procedure by performing multiple gradient estimates at some nodes, combining gradient estimators, and sending the combined estimates backwards with reference to FIG.

図１７は、機械学習における全てのニューラルネットワークアプリケーションの他、その他多くのアプリケーションにおいて使用される逆伝播アルゴリズムを図示している。総和伝播アルゴリズムは、異なる勾配推定技術を使用してｄＬ／ｄｚ_２の複数の推定値を求めること（たとえば、再パラメータ化法および尤度比法）、これらの推定値をより小さな勾配推定量の集合に結合すること、およびこれらを計算グラフの後方に渡すことにより、この手順を修正する。 FIG. 17 illustrates a backpropagation algorithm used in all neural network applications in machine learning, as well as many other applications. The sum propagation algorithm uses different gradient estimation techniques to _{obtain multiple estimates of dL / dz 2} (eg, reparameterization and likelihood ratio methods), and these estimates are used for smaller gradient estimators. Modify this procedure by joining them into a set and passing them to the back of the computational graph.

図１８は、単一の勾配推定量となるように尤度比および再パラメータ化勾配推定量を結合することにより勾配推定が実行される場合の総和伝播アルゴリズムを図示している。これは、３つ以上の勾配推定量を勾配推定量の総和数よりも少ない数に結合すること、および結合した勾配推定量を後方に送ることを、簡単に一般化する。 FIG. 18 illustrates a total propagation algorithm when gradient estimation is performed by combining likelihood ratios and reparameterized gradient estimators to result in a single gradient estimator. This simply generalizes combining three or more gradient estimators to a number less than the sum of the gradient estimators, and sending the combined gradient estimators backwards.

Claims

A gradient estimation method that includes a calculation graph and estimates the slope of another variable with respect to one variable in the calculation graph.
At some nodes in the calculation graph, perform two or more different estimates of the same gradient with different gradient estimators and combine the different estimates so that they are less than the number of initial estimates. A gradient estimation method in which the combined estimates are passed to different nodes in the calculation graph and the gradient estimates are used for further calculations.

The different estimates of the gradient are combined on the basis of the weighted average, and the weight of the weighted average is the variance of the gradient estimates of some other variables in the calculation graph relative to some variables in the calculation graph. The gradient estimation method according to claim 1, which is calculated based on an explicit or implicit estimate of.

The gradient estimation method according to claim 2, wherein the weight is set in proportion to the magnitude of the reciprocal of the variance.

The gradient estimation method according to any one of claims 1 to 3, wherein the gradient estimator is a likelihood ratio and a reparameterized gradient estimator.

The gradient estimation method according to any one of claims 1 to 4, wherein the gradient is used for optimizing parameters in the calculation graph.

A gradient estimation method that includes a computational graph and estimates the gradient of a variable relative to a variable in the computational graph, the purpose for both the likelihood ratio and the reparameterization method at some nodes in the computational graph. A gradient estimation method that estimates the gradient of a function and uses both estimates to optimize the parameters in the calculation graph.

The gradient estimation method according to claim 6, wherein the likelihood ratio and reparameterization gradient estimators are combined based on a weighted average and the weights are proportional to the reciprocal of the variance of each gradient estimator.

The gradient estimation method according to any one of claims 1 to 7, wherein the calculation graph corresponds to a policy search, reinforcement learning, machine learning, or neural network calculation graph.

The gradient estimation method according to any one of claims 1 to 8, wherein the combined estimated value is passed to the preceding node in the calculation graph.

The gradient estimation method according to any one of claims 6 to 9, wherein the parameter optimization method is a gradient descent or ascending optimization method.

The gradient estimation method according to any one of claims 1 to 10, wherein the further calculation is a further gradient estimation of some other variable with respect to some variable.

The gradient estimation method according to any one of claims 5 to 11, wherein the combination of the gradient estimates is determined based on the gradient by the previous optimization step.

A gradient estimation method that includes a calculation graph and estimates the gradient of another variable with respect to one variable in the calculation graph. The gradient estimation method is a probability at some nodes in the calculation graph. Assuming a parametric form of density, the parameters of the probability density are estimated from the sampled calculations in the calculation graph, the gradient of the expected variable of the node depending on the current variable is estimated, and the expected value is over the entire estimated distribution. A gradient estimation method obtained by multiplying the gradient with some statistical values at the node to obtain a scalar variable, and using the scalar variable to obtain a probability ratio gradient estimator.

The gradient estimation method according to claim 13, wherein the parametric form of the probability distribution is a Gaussian distribution.

The order of multiplying the gradient with the statistical value and obtaining the likelihood ratio gradient estimate is such that the likelihood ratio gradient estimation is performed prior to the multiplication of the estimated parametric probability distribution with the gradient. The gradient estimation method according to claim 13 or 14, which is replaced.

A gradient estimation method according to a combination of the gradient estimation method according to any one of claims 1 to 12 and the gradient estimation method according to claims 13, 14 or 15.

An apparatus that executes the gradient estimation method according to any one of claims 1 to 16.

A computer program that executes the gradient estimation method according to any one of claims 1 to 16.

It is a policy search method in reinforcement learning,
Estimating the gradient of the average total reward for policy parameters, by combining the reparameterization method and the likelihood ratio method at each of the gradient backpropagation steps opposite to the direction in which the state transition occurs according to the policy and dynamics. Estimate the gradient of the average total reward for the parameters
The policy parameters are updated according to the evaluation result.
Policy search method.

Further, the weight of the weighted average is set based on the variance of the desired gradient with respect to the policy parameter.
The policy search method according to claim 19.

The weights assigned to the gradient estimators according to the re-parameterization method and the likelihood ratio method are set in proportion to the reciprocal of the variance of each gradient estimator.
The policy search method according to claim 20.

A policy search device for reinforcement learning
Compute the state in a discrete-time system,
By combining the reparameterization method and the likelihood ratio method at each of the gradient backpropagation steps opposite to the direction in which the state transition occurs according to the policy and dynamics, the gradient of the average total reward for the policy parameters is estimated.
The policy parameters are updated according to the evaluation result.
Policy search device.

Further, the weight of the weighted average is set based on the variance of the desired gradient with respect to the policy parameter.
The policy search device according to claim 22.

The weights assigned to the gradient estimators according to the re-parameterization method and the likelihood ratio method are set in proportion to the reciprocal of the variance of each gradient estimator.
The policy search device according to claim 23.

On the computer
Compute the state in a discrete-time system,
By combining the reparameterization method and the likelihood ratio method at each of the gradient backpropagation steps opposite to the direction in which the state transition occurs according to the policy and dynamics, the gradient of the average total reward for the policy parameters is estimated.
The policy parameters are updated according to the evaluation result.
A computer program that lets a computer perform processing.