JP7378836B2

JP7378836B2 - Summative stochastic gradient estimation method, apparatus, and computer program

Info

Publication number: JP7378836B2
Application number: JP2021518295A
Authority: JP
Inventors: パラマス，パーヴォ
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-06-05
Filing date: 2019-06-05
Publication date: 2023-11-14
Anticipated expiration: 2039-06-05
Also published as: WO2019235551A1; JP2021527289A

Description

本発明は、計算グラフにおいて定義された変数の勾配を推定する方法、上記推定を行う装置、およびコンピュータプログラムに関する。 The present invention relates to a method for estimating the slope of a variable defined in a computational graph, an apparatus for performing the estimation, and a computer program.

ほとんどの機械学習問題には、何らかのデータ生成分布ｐ_Ｄａｔａ（ｘ）全体の目的関数Ｊ（ｘ；θ）の期待値の最適化を伴うが、この分布は、サンプルデータ点｛ｘ_ｉ｝を通じてのみアクセス可能である。 Most machine learning problems involve optimizing the expectation value of an objective function J(x; θ) over some data-producing distribution p _Data (x), but this _distribution is accessible.

最も一般的な最適化方法は、逆伝播により計算されるＰａｔｈｗｉｓｅ導関数（ｐａｔｈｗｉｓｅｄｅｒｉｖａｔｉｖｅ）を用いた勾配降下法である。 The most common optimization method is gradient descent using pathwise derivatives calculated by backpropagation.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157-166, 1994Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157-166, 1994 Deisenroth, Marc Peter, Neumann, Gerhard, Peters, Jan, らによるA survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1-142, 2013Deisenroth, Marc Peter, Neumann, Gerhard, Peters, Jan, et al., A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1-142, 2013 Deisenroth, Marc Peter, Fox, Dieter, and Rasmussen, Carl Edward. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408-423, 2015Deisenroth, Marc Peter, Fox, Dieter, and Rasmussen, Carl Edward. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408-423, 2015

いくつかの状況（特に、非常に長い計算グラフまたはリカレントな計算グラフを伴う場合）において、この手法は、勾配分散の爆発によって、ランダムウォークに陥る可能性もある。通常、この現象は、ステップの増大および学習の不安定化につながる数値問題と捉えられる（非特許文献１参照）。 In some situations (particularly with very long computational graphs or recurrent computational graphs), this approach can also fall into random walks due to exploding gradient variance. Usually, this phenomenon is considered to be a numerical problem that leads to an increase in steps and instability in learning (see Non-Patent Document 1).

本発明の目的は、勾配推定に伴う課題を解決することである。本発明は、逆伝播アルゴリズムの代わりとして、任意の計算グラフに使用し得る汎用的な勾配推定方法である。 The purpose of the invention is to solve the problems associated with gradient estimation. The present invention is a general gradient estimation method that can be used for arbitrary computational graphs as an alternative to backpropagation algorithms.

勾配推定方法は、計算グラフを含み、計算グラフ中の他の変数に対するある変数の勾配を推定するものであって、グラフ中のいくつかのノードで、別個の勾配推定量を用いて同じ勾配の２つ以上の別個の推定を実行し、初期推定値数よりも少なくなるように別個の推定値を結合し、結合した推定値をグラフ中の異なるノードに受け渡し、勾配推定値が、さらなる計算に使用される。 Gradient estimation methods involve a computational graph and estimate the slope of one variable with respect to other variables in the computational graph, such that at several nodes in the graph, separate slope estimators are used to estimate the slope of the same slope. Perform two or more separate estimates, combine the separate estimates to less than the initial number of estimates, pass the combined estimates to different nodes in the graph, and use the gradient estimate for further calculations. used.

本出願によれば、より正確で、勾配の爆発に悩まされない勾配評価の代替的な柔軟性のあるフレームワークを提供することが可能である。 According to the present application, it is possible to provide an alternative flexible framework for gradient evaluation that is more accurate and does not suffer from gradient explosion.

本実施形態に係る、コンピューティングデバイス１のハードウェア構成を示すブロック図である。1 is a block diagram showing the hardware configuration of a computing device 1 according to the present embodiment. FIG. ＰＩＬＣＯによるポリシー勾配評価アルゴリズムを説明する図である。FIG. 2 is a diagram illustrating a policy gradient evaluation algorithm by PILCO. 総和伝播アルゴリズムを説明する図である。FIG. 2 is a diagram illustrating a summation propagation algorithm. 本実施形態に係る、コンピューティングデバイス１により実行される手順を説明するフローチャートである。2 is a flowchart illustrating a procedure executed by the computing device 1 according to the present embodiment. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 分散のグラフである。This is a graph of dispersion. 分散のグラフである。This is a graph of dispersion. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 数式１１における経路の例を示す図である。11 is a diagram showing an example of a route in Formula 11. FIG. 数式１１における経路の例を示す図である。11 is a diagram showing an example of a route in Formula 11. FIG. モデル基準およびモデルなしのＬＲ勾配推定の確率計算グラフを示す図である。FIG. 6 is a diagram showing probability calculation graphs for model-based and model-free LR slope estimation; モデル基準およびモデルなしのＬＲ勾配推定の確率計算グラフを示す図である。FIG. 6 is a diagram showing probability calculation graphs for model-based and model-free LR slope estimation; 総和伝播と適合する様子を詳しく説明するためのアルゴリズム３を示す図である。FIG. 7 is a diagram showing Algorithm 3 for explaining in detail how it is compatible with summation propagation. ガウス成形勾配における計算経路を示す図である。It is a figure which shows the calculation path in a Gaussian shaping gradient. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. 実験結果を示す図である。FIG. 3 is a diagram showing experimental results. アルゴリズムの一般形態を示す図である。FIG. 2 is a diagram showing the general form of the algorithm. 機械学習における全てのニューラルネットワークアプリケーションの他、その他多くのアプリケーションにおいて使用される逆伝播アルゴリズムを示す図である。FIG. 2 illustrates a backpropagation algorithm used in all neural network applications in machine learning, as well as many other applications. 単一の勾配推定量となるように尤度比および再パラメータ化勾配推定量を結合することにより勾配推定が実行される場合の総和伝播アルゴリズムを示す図である。FIG. 3 illustrates a summation propagation algorithm where gradient estimation is performed by combining likelihood ratios and reparameterized gradient estimators into a single gradient estimator.

（実施形態１）
図１は、本実施形態に係る、コンピューティングデバイス１のハードウェア構成を示すブロック図である。本実施形態に係るコンピューティングデバイス１は、パソコン、サーバ装置等の情報処理装置である。コンピューティングデバイス１は、制御ユニット１１、記憶ユニット１２、入力ユニット１３、通信ユニット１４、操作ユニット１５、および表示ユニット１６を具備する。コンピューティングデバイス１は、本発明者らによる「PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos」、「Total Propagation Algorithm: Supplementary notes」、および「Total stochastic gradient algorithms and applications in reinforcement learning」において開示された方法を実装している。 (Embodiment 1)
FIG. 1 is a block diagram showing the hardware configuration of a computing device 1 according to this embodiment. The computing device 1 according to this embodiment is an information processing device such as a personal computer or a server device. The computing device 1 comprises a control unit 11 , a storage unit 12 , an input unit 13 , a communication unit 14 , an operating unit 15 and a display unit 16 . The computing device 1 is described in "PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos", "Total Propagation Algorithm: Supplementary notes", and "Total stochastic gradient algorithms and applications in reinforcement learning" by the present inventors. Implements the disclosed method.

制御ユニット１１は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）等を具備する。制御ユニット１１のＲＯＭには、ハードウェアの各部の動作を制御する制御プログラム等が記憶されている。制御ユニット１１のＣＰＵは、ＲＯＭに記憶された制御プログラムおよび後述する記憶ユニット１２に記憶された種々プログラムを実行して、前述の論文に開示の方法のように、ハードウェアの動作を制御する。制御ユニット１１のＲＡＭには、種々プログラムの実行に際して一時的に使用されるデータが記憶されている。 The control unit 11 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. The ROM of the control unit 11 stores control programs and the like that control the operations of each part of the hardware. The CPU of the control unit 11 executes a control program stored in the ROM and various programs stored in the storage unit 12, which will be described later, to control the operation of the hardware as in the method disclosed in the above-mentioned paper. The RAM of the control unit 11 stores data temporarily used when executing various programs.

なお、制御ユニット１１は、上記構成に限定されず、シングルコアＣＰＵ、マルチコアＣＰＵ、ＧＰＵ（Graphics Processing Unit）、マイクロコンピュータ、揮発性または不揮発性メモリを含む１つまたは複数の処理回路または演算回路であってもよい。また、制御ユニット１１は、データおよび時間の情報を出力するクロック、測定開始命令の適用から測定終了命令が与えられるまでの経過時間を測定するタイマー、計数用のカウンタ等の機能を含んでいてもよい。 Note that the control unit 11 is not limited to the above configuration, and may be one or more processing circuits or arithmetic circuits including a single-core CPU, multi-core CPU, GPU (Graphics Processing Unit), microcomputer, volatile or non-volatile memory. There may be. The control unit 11 may also include functions such as a clock that outputs data and time information, a timer that measures the elapsed time from the application of the measurement start command until the measurement end command is given, and a counter for counting. good.

記憶ユニット１２は、ＳＲＡＭ（Static Random Access Memory）、フラッシュメモリ、ハードディスク等を用いた記憶装置を含む。記憶ユニット１２は、制御ユニット１１により実行される種々のプログラム、種々のプログラムの実行に必要なデータ等を記憶する。記憶ユニット１２に記憶されるプログラムとしては、たとえば上記論文に開示の技術を実装したコンピュータプログラムが挙げられる。 The storage unit 12 includes a storage device using an SRAM (Static Random Access Memory), a flash memory, a hard disk, or the like. The storage unit 12 stores various programs executed by the control unit 11, data necessary for executing the various programs, and the like. Examples of programs stored in the storage unit 12 include computer programs implementing the technology disclosed in the above-mentioned paper.

記憶ユニット１２に記憶されたプログラムは、プログラムが可読記録された記録媒体Ｍにより提供されるようになっていてもよい。記録媒体Ｍは、ＳＤ（Secure Digital）カード、マイクロＳＤカード、コンパクトフラッシュ（登録商標）等の携帯型メモリである。この場合、制御ユニット１１は、読み出し装置（図示せず）を用いて記録媒体Ｍからプログラムを読み出し、この読み出したプログラムを記憶ユニット１２にインストールすることができる。さらに、記憶ユニット１２に記憶されたプログラムは、通信ユニット１４を介して、通信により提供されるようになっていてもよい。この場合、制御ユニット１１は、通信ユニット１４を通じてプログラムを取得し、この取得したプログラムを記憶ユニット１２にインストールすることができる。 The program stored in the storage unit 12 may be provided by a recording medium M on which the program is readably recorded. The recording medium M is a portable memory such as an SD (Secure Digital) card, a micro SD card, or a compact flash (registered trademark). In this case, the control unit 11 can read the program from the recording medium M using a reading device (not shown) and install the read program into the storage unit 12. Furthermore, the program stored in the storage unit 12 may be provided by communication via the communication unit 14. In this case, the control unit 11 can acquire the program through the communication unit 14 and install the acquired program in the storage unit 12 .

入力ユニット１３は、種々データを装置に入力するための入力インターフェースを有する。制御ユニット１１は、入力ユニット１３を通じて、処理対象のデータを取得する。 The input unit 13 has an input interface for inputting various data into the device. The control unit 11 obtains data to be processed through the input unit 13 .

通信ユニット１４は、インターネット等の通信ネットワーク（図示せず）に接続するための通信インターフェースを含み、外部に通知されるさまざまな種類の情報を送信し、外部から送信されたさまざまな種類の情報を受信する。本実施形態においては、入力ユニット１３を通じて処理対象のデータが取得されるが、通信ユニット１４を通じて処理対象のデータが取得されるようになっていてもよい。 The communication unit 14 includes a communication interface for connecting to a communication network (not shown) such as the Internet, and transmits various types of information notified to the outside, and receives various types of information transmitted from the outside. Receive. In this embodiment, the data to be processed is acquired through the input unit 13, but the data to be processed may be acquired through the communication unit 14.

操作ユニット１５は、キーボードおよびタッチパネル等のユーザインターフェースを含み、さまざまな操作情報および設定情報を受け付ける。制御ユニット１１は、操作ユニット１５から入力された操作情報に基づいて適当な制御を実行し、必要に応じて、設定情報を記憶ユニット１２に記憶する。 The operation unit 15 includes a user interface such as a keyboard and a touch panel, and receives various operation information and setting information. The control unit 11 executes appropriate control based on the operation information input from the operation unit 15, and stores setting information in the storage unit 12 as necessary.

表示ユニット１６は、液晶表示パネルおよび有機ＥＬ（Electro Luminescence）表示パネル等の表示装置を含み、制御ユニット１１から出力された制御信号に基づいて、ユーザに通知される情報を表示する。 The display unit 16 includes a display device such as a liquid crystal display panel and an organic EL (Electro Luminescence) display panel, and displays information to be notified to the user based on a control signal output from the control unit 11.

本実施形態において、上記論文に開示の構成は、制御ユニット１１により実行されるソフトウェア処理によって実現されるが、ＬＳＩ（Large Scale integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Arra）等が制御ユニット１１と別個に搭載されていてもよい。この場合、制御ユニット１１は、入力ユニット１３から入力された処理対象のデータをハードウェアに送ることにより、上記論文に開示の方法をハードウェア内で実現する。 In this embodiment, the configuration disclosed in the above-mentioned paper is realized by software processing executed by the control unit 11. ) etc. may be installed separately from the control unit 11. In this case, the control unit 11 implements the method disclosed in the above paper in the hardware by sending the data to be processed inputted from the input unit 13 to the hardware.

さらに、本実施形態において、コンピューティングデバイス１は、簡素化のため単一の装置として記載しているが、複数のコンピューティングデバイスにより構成されていてもよいし、１つまたは複数の仮想マシンにより構成されていてもよい。 Furthermore, in this embodiment, the computing device 1 is described as a single device for simplicity, but it may be configured by multiple computing devices or by one or more virtual machines. may be configured.

本実施形態においては、コンピューティングデバイス１が操作ユニット１５および表示ユニット１６を具備するが、操作ユニット１５および表示ユニット１６は、必須ではない。たとえば、コンピューティングデバイス１は、外部接続されたコンピュータを通じて操作を受け付け、外部コンピュータに通知される情報を出力するようにしてもよい。 In this embodiment, the computing device 1 includes an operation unit 15 and a display unit 16, but the operation unit 15 and display unit 16 are not essential. For example, the computing device 1 may accept operations through an externally connected computer and output information to be notified to the external computer.

以下、本発明の勾配推定方法について説明する。以下の式では、小文字がスカラーを表し、太字がベクトルまたは行列を表す。ただし、以下の説明においては、小文字と太字とを区別なく示している。また、以下の説明において、「Ｃ＿＾」は、ハット付き文字を表し、「Ｃ＿～」は、チルダ付き文字を表す。 The gradient estimation method of the present invention will be explained below. In the formulas below, lowercase letters represent scalars and bold letters represent vectors or matrices. However, in the following description, lowercase letters and bold letters are shown without distinction. Furthermore, in the following description, "C_^" represents a character with a hat, and "C_~" represents a character with a tilde.

（２．１ポリシー探索）
ポリシー探索方法の総括としては、非特許文献２が参照される。なお、ポリシー探索は、アルゴリズムの１つのアプリケーションに過ぎず、特定の計算グラフには限定されず、如何なる計算グラフにも適用可能である。状態ベクトルｘ_ｔ（たとえば、ロボットの位置および速度）ならびに適用動作／制御ベクトルｕ_ｔ（たとえば、モータトルク）により記述される離散時間系を考える。固定された初期状態分布ｘ_０～ｐ（ｘ_０）から状態をサンプリングすることによって、エピソードが開始となる。ポリシーπ_θは、適用された動作ｕ_ｔ～ｐ（ｕ_ｔ）＝π（ｘ_ｔ；θ）を決定する。動作の適用により、未知のダイナミクス関数ｘ_ｔ＋１～ｐ（ｘ_ｔ＋１）＝ｆ（ｘ_ｔ，ｕ_ｔ）に従って状態が遷移する。ポリシーおよびダイナミクスはいずれも、確率的かつ非線形であってもよい。最大Ｔ時間ステップまで動作および状態遷移が繰り返されて、軌跡τ：（ｘ_０，ｕ_０，ｘ_１，ｕ_１，・・・，ｘ_Ｔ）が生成される。各エピソードは、リターン関数Ｇ（τ）に従ってスコアリングされる。リターンは、時間ステップごとのコストの総和Ｇ（τ）＝Σ_ｔ＝０ ^Ｔｃ（ｘ_ｔ）（ｔ＝０，・・・，Ｔ）に分解されることが多く、ここで、ｃ（ｘ）はコスト関数である。その目標は、ポリシーパラメータθを最適化して、期待リターンＪ（θ）＝Ｅ_{ｒ～ｐ（τ；θ）}［Ｇ（τ）］を最小化することである。ここで、値Ｖ_ｈ（ｘ）＝Ｅ_ｔ＝ｈ ^Ｔ［Σｃ（ｘ_ｔ）］と定義する。 (2.1 Policy search)
For a general overview of policy search methods, refer to Non-Patent Document 2. Note that policy search is just one application of the algorithm, and is not limited to a particular computational graph, but can be applied to any computational graph. Consider a discrete time system described by a state vector x _t (eg, robot position and velocity) and an applied motion/control vector u _t (eg, motor torque). An episode begins by sampling states from a fixed initial state distribution x ₀ ~p(x ₀ ). The policy π _θ determines the applied action u _t ~p(u _t )=π(x _t ; θ). Application of the action causes a state transition according to an unknown dynamics function x _t+1 ~p(x _t+1 )=f(x _t , u _t ). Both policies and dynamics may be stochastic and non-linear. The operations and state transitions are repeated up to a maximum of T time steps to generate a trajectory τ: (x ₀ , u ₀ , x ₁ , u ₁ , . . . , x _T ). Each episode is scored according to the return function G(τ). The return is often decomposed into the sum of costs for each time step G(τ)=Σ _t=0 ^T c(x _t )(t=0,...,T), where c(x ) is the cost function. The goal is to optimize the policy parameter θ to minimize the expected return J(θ)=E _r~p(τ;θ) [G(τ)]. Here, the value V _h (x)=E _t=h ^T [Σc(x _t )] is defined.

学習は、システム上のポリシーの実行と、その後のθの更新による後続試行上の性能の向上とが交互に発生する。ポリシー勾配法では、目的関数の勾配ｄ／ｄθ・Ｊ（θ）を直接推定し、これを最適化に使用する。一部のモデル基準のポリシー探索方法では、データを全て使用して、ｆ＿＾で示されるｆのモデルを学習し、これを試行間の「メンタルリハーサル」に使用してポリシーを最適化する。現実の試行ごとに何百回もの模擬試行を実行して、データ効率を大幅に向上可能である。ここで、ｆ＿＾の微分によって、モデルなしアルゴリズムよりも優れた勾配推定量を求め得るという事実を利用する。この場合のモデルは、確率論的であり、状態分布を予測する。 Learning occurs alternately between executing a policy on the system and subsequently updating θ to improve performance on subsequent trials. In the policy gradient method, the gradient d/dθ·J(θ) of the objective function is directly estimated and used for optimization. Some model-based policy search methods use all the data to learn a model of f, denoted f_^, and use this for "mental rehearsal" between trials to optimize the policy. Hundreds of mock trials can be run for every real trial, greatly increasing data efficiency. Here, we exploit the fact that by differentiating f_^, we can obtain a gradient estimator that is better than the model-less algorithm. The model in this case is probabilistic and predicts the state distribution.

（確率的勾配推定）
ここで、サンプリング分布のパラメータに対する任意の関数φ（ｘ）の期待値の勾配ｄ／ｄθＥ_{ｘ～ｐ（ｘ；θ）}［φ（ｘ）］（たとえば、ポリシーパラメータに対する期待リターン）を計算する方法について説明する。 (Stochastic gradient estimation)
Here, how to calculate the gradient d/dθE of the expected value of any function φ(x ₎ with respect to the parameters of the sampling distribution I will explain about it.

（再パラメータ化勾配（ＲＰ））
一変量ガウス分布からのサンプリングを考える。ある手法では、ゼロ平均および単位分散ε～Ｎ（０，１）でのサンプリングの後、この点をマッピングして、所望の分布からサンプルを複製する（ｘ＝μ＋σε）。ここで、分布パラメータを参照して出力を微分するのは容易である。すなわち、ｄｘ／ｄμ＝１およびｄｘ／ｄσ＝εである。サンプルの平均化ｄφ／ｄｘ・ｄｘ／ｄθによって、期待値の勾配の不偏推定値が与えられる。これは、正規分布のＲＰ勾配である。多変量ガウス分布の場合は、σの代わりに、共分散行列のコレスキー因子（Ｌ、ｓ．ｔ．Σ＝ＬＬ^Ｔ）を使用可能である。 (Reparameterization Gradient (RP))
Consider sampling from a univariate Gaussian distribution. One approach, after sampling with zero mean and unit variance ε~N(0,1), maps this point to replicate samples from the desired distribution (x=μ+σε). Here, it is easy to differentiate the output with reference to the distribution parameters. That is, dx/dμ=1 and dx/dσ=ε. The averaging of the samples dφ/dx·dx/dθ gives an unbiased estimate of the slope of the expected value. This is a normally distributed RP slope. In the case of a multivariate Gaussian distribution, the Cholesky factor (L, s.t.Σ=LL ^T ) of the covariance matrix can be used instead of σ.

（尤度比勾配（ＬＲ））
所望の勾配は、ｄ／ｄθ・Ｅ_{ｘ～ｐ（ｘ；θ）}［φ（ｘ）］＝∫ｄｐ（ｘ；θ）／ｄθφ（ｘ）として記述可能である。一般に、∫φ（ｘ）ｄｘ＝∫ｑ（ｘ）φ（ｘ）／ｑ（ｘ）ｄｘ＝Ｅ_ｘ～ｑ［φ（ｘ）／ｑ（ｘ）］の実行によって、分布ｑ（ｘ）からサンプリングすることにより如何なる関数も積分可能である。尤度比勾配は、ｑ（ｘ）＝ｐ（ｘ）を抜き取って、以下のように直接積分する。 (Likelihood ratio gradient (LR))
The desired gradient can be written as d/dθ·E _x~p(x;θ) [φ(x)]=∫dp(x;θ)/dθφ(x). In general, from the distribution q(x) by executing ∫φ(x)dx=∫q(x)φ(x)/q(x)dx=E _x~q [φ(x)/q(x)] Any function can be integrated by sampling. The likelihood ratio gradient is obtained by extracting q(x)=p(x) and directly integrating it as follows.

ＬＲ勾配は、高分散の場合が多く、制御変量として知られる分散低減技術と組み合わせる必要がある（Greensmithら、2004）。一般的な手法では、関数値から一定基準値ｂを減算して、推定量Ｅ_ｘ～ｐ［ｄ／ｄθ・（ｌｏｇｐ（ｘ；θ））（φ（ｘ）－ｂ）］を求める。ｂがサンプルと無関係の場合は、これによって、バイアスの導入なく、分散を大幅に低減可能である。実際には、サンプル平均が良い選択である（ｂ＝Ｅ［φ（ｘ）］）。バッチから勾配を推定する場合は、各点の一個抜き基準値を推定することによって、不偏勾配推定量を求めることができる。すなわち、ｂ_ｉ＝Σ_ｊ≠ｉ ^Ｐφ（ｘ_ｊ）／（Ｐ－１）である。 LR gradients are often highly dispersive and need to be combined with variance reduction techniques known as control variables (Greensmith et al., 2004). In a general method, a constant reference value b is subtracted from the function value to obtain the estimated amount E _x~p [d/dθ·(log p(x;θ))(φ(x)−b)]. If b is independent of the sample, this can significantly reduce the variance without introducing bias. In practice, the sample average is a good choice (b=E[φ(x)]). When estimating a gradient from a batch, an unbiased gradient estimator can be obtained by estimating a reference value for each point. That is, b _i =Σ _j≠i ^P φ(x _j )/(P-1).

（軌跡勾配推定）
特定の軌跡を観測する確率密度ｐ（τ）＝ｐ（ｘ_０，ｕ_０，ｘ_１，ｕ_１，・・・，ｘ_Ｔ）は、ｐ（ｘ_０）π（ｕ_０｜ｘ_０）ｐ（ｘ_１｜ｘ_０，ｕ_０）・・・ｐ（ｘ_Ｔ｜ｘ_Ｔ－１，ｕ_Ｔ－１）として記述可能である。 (trajectory gradient estimation)
The probability density p(τ)=p(x ₀ , u ₀ , x ₁ , u ₁ , ..., x _T ) of observing a specific trajectory is p(x ₀ )π(u ₀ |x ₀ )p (x ₁ |x ₀ , u ₀ )...p(x _T |x _T-1 , u _T-1 ).

ＲＰ勾配を使用するには、ダイナミクスｐ（ｘ_ｔ＋１｜ｘ_ｔ｜ｕ_ｔ）を把握または推定する必要がある。言い換えると、モデル基準の場合に適用可能である。このようなモデルによれば、連鎖律を用いて、予測軌跡を微分可能である。 Using the RP gradient requires knowing or estimating the dynamics p(x _t+1 |x _t | _ut ). In other words, it is applicable in the case of model criteria. According to such a model, the predicted trajectory can be differentiated using the chain rule.

なお、ＬＲ勾配を使用するには、ｐ（τ）が積であることから、ｌｏｇｐ（τ）を総和に変換可能である。Ｇ_ｈ（τ）＝Σ_ｔ＝ｈ ^Ｔｃ（ｘ_ｔ）と表す。なお、（１）動作分布のみがポリシーパラメータによって決まり、（２）過去の時間ステップで求められたコストに動作は影響せず、以下のような勾配推定量が得られる。 Note that to use the LR gradient, log p(τ) can be converted to a sum since p(τ) is a product. It is expressed as G _h (τ)=Σ _t=h ^T c (x _t ). Note that (1) only the motion distribution is determined by the policy parameters, and (2) the motion does not affect the cost determined in the past time step, and the following gradient estimator is obtained.

（ＰＩＬＣＯ）
図２は、ＰＩＬＣＯによるポリシー勾配評価アルゴリズムを説明した図である。ここでは元のＰＩＬＣＯに従うが、これは、ガウス過程ダイナミクスモデルを使用して、ある時間ステップから次の時間ステップまでの状態の変化を予測する。すなわち、ｐ（Δｘ_ｔ＋１ ^ａ）＝ｇＰ（ｘ_ｔ，ｕ_ｔ）（ただし、ｘ∈Ｒ^Ｄ、ｕ∈Ｒ^Ｆ、Δｘ_ｔ＋１ ^ａ＝ｘ_ｔ＋１ ^ａ－ｘ_ｔ ^ａ）である。各次元ａに対して、別個のガウス過程が学習される。ここでは、二乗指数共分散関数ｋ_ａ（ｘ＿～，ｘ’＿～）＝ｓ_ａ ^２ｅｘｐ（－（ｘ＿～－ｘ’＿～）^ＴΛ_ａ ^－１（ｘ＿～－ｘ’＿～））を使用する。ただし、ｓ_ａおよびΛ＝ｄｉａｇ（［ｌ_ａ１，ｌ_ａ２，・・・，ｌ_ａＤ＋Ｆ］）はそれぞれ、関数分散および長さスケールのハイパーパラメータである。また、ノイズハイパーパラメータがσ_ｎのガウス尤度関数を使用する。ハイパーパラメータは、訓練によって、周辺尤度を最大化する。これらのモデルからのサンプリングに際して、予測は、ｙ＝ｆ＿＾（ｘ）＋ε（ただし、ε～Ｎ（０，σ_ｆ ^２（ｘ）＋σ_ｎ ^２））という形態を有する。ここで、σ_ｆ ^２は、モデルの不確実性を表し、領域中のデータの欠如に起因する。一方、σ_ｎ ^２は、学習済みの固有モデルノイズである。学習済みモデルノイズは、システム中の実観測ノイズσ_ｏ ^２と必ずしも同じではない。実際、潜在状態はモデル化されておらず、システムは、現在の観測を所与として、次の観測を予測することにより近似される。さらに、軌跡には、付加的な分散源が存在し、開始位置が異なれば軌跡も異なる。 (PILCO)
FIG. 2 is a diagram illustrating a policy gradient evaluation algorithm by PILCO. We follow the original PILCO here, which uses a Gaussian process dynamics model to predict state changes from one time step to the next. That is, p(Δx _t+1 ^a )=gP(x _t , _ut ) (where x∈R ^D , u∈R ^F , Δx _t+1 ^a =x _t+1 ^a −x _t ^a ). A separate Gaussian process is learned for each dimension a. Here, the squared exponential covariance function k _a (x_~, x'_~) = s _a ² exp (-(x_~-x'_~) ^T Λ _a ^-1 (x_~-x'_~)) use. However, s _a and Λ=diag([l _a1 , l _a2 , . . . , l _aD+F ]) are hyperparameters of the function variance and length scale, respectively. Also, a Gaussian likelihood function with a noise hyperparameter σ _n is used. Hyperparameters, through training, maximize the marginal likelihood. Upon sampling from these models, the prediction has the form y=f_^(x)+ε, where ε~N(0,σ _f ² (x)+σ _n ² ). Here, σ _f ² represents the model uncertainty and is due to the lack of data in the region. On the other hand, σ _n ² is learned unique model noise. The trained model noise is not necessarily the same as the actual observed noise σ _o ² in the system. In fact, the latent state is not modeled and the system is approximated by predicting the next observation given the current observation. Furthermore, there are additional sources of variance in the trajectory, and different starting positions result in different trajectories.

（モーメントマッチング予測）
一般的に、ガウス分布が非線形関数によってマッピングされた場合、出力は、扱いにくく、非ガウス分布である。ただし、出力分布のモーメントを解析的に評価できる場合もある。モーメントマッチング（ＭＭ）は、平均および分散を真のモーメントとマッチングさせることにより、出力分布をガウス分布として近似する。なお、状態次元が別個の関数ｆａ＿＾でモデル化されていても、ＭＭは一体的に実行され、状態分布が共分散を含み得る。 (Moment matching prediction)
In general, if a Gaussian distribution is mapped by a nonlinear function, the output is unwieldy and non-Gaussian. However, in some cases it is possible to analytically evaluate the moments of the output distribution. Moment matching (MM) approximates the output distribution as a Gaussian distribution by matching the mean and variance to the true moments. Note that even if the state dimension is modeled with a separate function fa_^, the MM is performed integrally and the state distribution may include covariance.

（パーティクル予測）
一般的に、パーティクル軌跡予測は単純で、全てのパーティクル位置での予測、出力分布からのサンプリング、繰り返しを行う。ただし、ガウス再サンプリング（ＧＲ）に基づく方式との比較により、ＰＩＬＣＯへのニューラルネットワークダイナミクスモデルの適用も行う。 (Particle prediction)
In general, particle trajectory prediction is simple: predict at every particle position, sample from the output distribution, and iterate. However, we also apply a neural network dynamics model to PILCO by comparison with a method based on Gaussian resampling (GR).

（ガウス再サンプリング（ＧＲ））
ＭＭは、確率的に複製可能である。各時間ステップにおいて、パーティクルの平均μ＿＾＝Σ_ｉ＝１ ^Ｐｘ_ｉ／Ｐおよび分散Σ＿＾＝Σ_ｉ＝１ ^Ｐ（ｘ_ｉ－μ＿＾）（ｘ_ｉ－μ＿＾）^Ｔ／（Ｐ－１）が推定される。その後、パーティクルは、適合分布ｘ’_ｉ～μ＿＾＋Ｌｚ_ｉ｜ｚ_ｉ～Ｎ（０，Ｉ）（ただし、ＬはΣ＿＾のコレスキー因子）から再サンプリングされる。勾配ｄＬ＝ｄΣ＿＾を求めることは、容易ではない。ここでは、与えられた記号表現を使用する。 (Gaussian resampling (GR))
MM is probabilistically replicable. At each time step, the particle's mean μ_^=Σ _i=1 ^P x _i /P and variance Σ_^=Σ _i=1 ^P (x i - μ_^)(x _i _- μ_^) ^T /(P-1 ) is estimated. The particles are then resampled from the fitted distribution x' _i ~μ_^+Lz _i |z _i ~N(0,I), where L is the Cholesky factor of Σ_^. It is not easy to find the gradient dL=dΣ_^. Here we will use the given symbolic representation.

（ハイブリッド勾配推定技術）
本発明の場合には、ＲＰ勾配を使用可能である。ただし、驚くべきことに、ＲＰ勾配は絶望的に不正確である（図５Ｄ参照）。この問題を解決するため、モデル導関数をＬＲ勾配と結合した新たな勾配推定量を得た。特に、本発明の手法では、バッチ内重点サンプリングによって、サンプリング効率の向上を可能にした。 (Hybrid gradient estimation technology)
In the case of the present invention, RP gradients can be used. However, surprisingly, the RP slope is hopelessly inaccurate (see Figure 5D). To solve this problem, we obtained a new slope estimator that combines the model derivative with the LR slope. In particular, the method of the present invention makes it possible to improve sampling efficiency by performing focused sampling within a batch.

（モデル基準のＬＲ）
予測軌跡上の分布は、ｐ（τ）＝ｐ（ｘ_０）π（ｕ_０｜ｘ_０）ｆ＿＾（ｘ_１｜ｘ_０，ｕ_０）・・・ｆ＿＾（ｘ_Ｔ｜ｘ_Ｔ－１，ｕ_Ｔ－１）として記述可能である。また、決定論的ポリシーによって、ｐ（ｘ_ｔ＋１｜ｘ_ｔ）＝ｆ＿＾（ｘ_ｔ＋１｜ｘ_ｔ，π（ｘ_ｔ；θ））のように、モデルとポリシーとを結合可能であるが、これは、微分可能である（ｄｐ_ｔ＋１／ｄθ＝ｄｐ_ｔ＋１／ｄｕ_ｔ・ｄｕ_ｔ／ｄθ）。モデル基準の勾配は、以下のように導かれる。 (LR based on model)
The distribution on the predicted trajectory is p(τ)=p(x ₀ )π(u ₀ |x ₀ )f_^(x ₁ |x ₀ ,u ₀ )...f_^(x _T |x _T-1 , u _T-1 ). Also, by using a deterministic policy, it is possible to combine a model and a policy as shown in p(x _t+1 | x _t )=f_^(x _t+1 | x _t , π(x _t ; θ)), but this is differentiable (dp _t+1 /dθ=dp _t+1 /du _t ·du _t /dθ). The slope of the model criterion is derived as follows.

（バッチ重点加重ＬＲ（ＢＩＷ－ＬＲ））
ここでは、並列計算を使用して、複数のパーティクルを同時にサンプリングする。状態分布は、混合分布ｑ（ｘ_ｔ＋１）＝Σ_ｉ＝１ ^Ｐｐ（ｘ_ｔ＋１｜ｘ_ｉ，ｔ；θ）／Ｐとして表される。ＬＲの導出と同様に、各時間ステップについて、バッチ内の重点サンプリングにより低分散推定量を以下のように導出可能である。 (Batch weighted LR (BIW-LR))
Here, we use parallel computation to sample multiple particles at the same time. The state distribution is expressed as a mixture distribution q(x _t+1 )=Σ _i=1 ^P p(x _t+1 |x _i,t ; θ)/P. Similar to the derivation of LR, for each time step, a low variance estimator can be derived by weighted sampling within the batch as follows.

以下の方程式により、正規化重点サンプリングによって、リターンの一個抜き平均を推定するようにする。 The following equation is used to estimate the one-out average return using normalized weighted sampling.

ただし、ｃ_{ｊ，ｔ＋１}＝ｐ（ｘ_{ｊ，ｔ＋１}｜ｘ_ｉ，ｔ）／Σ_ｋ＝１ ^Ｐｐ（ｘ_{ｊ，ｔ＋１}｜ｘ_ｋ，ｔ）である。正規化がなければ、基準値推定の高分散によって、ＬＲ勾配が不十分となる。なお、時間ステップごとにＰ基準値を計算する一方で、勾配推定量には、Ｐ^２成分が存在する。真の不偏勾配を求めるには、Ｐ^２の一個抜き基準値（分布の各混合成分のパーティクルごとに１つ）を計算するものとする。本明細書には、ここに提示の基準値のみを用いた評価を含む（これにより、バイアスのほとんどを除去済みであることが分かっている）。 However, c _j,t+1 =p(x _j,t+1 |x _i,t )/Σ _k=1 ^P p(x _j,t+1 |x _k,t ). Without normalization, the high variance of the reference value estimates will result in poor LR slopes. Note that while the P reference value is calculated for each time step, there is a ^P2 component in the gradient estimator. In order to find the true unbiased gradient, a one-pick reference value of P ² (one for each particle of each mixture component of the distribution) shall be calculated. This specification includes an evaluation using only the reference values presented here (which has been found to remove most of the bias).

（ＲＰ／ＬＲ加重平均）
計算の大部分は、ｄｐ（ｘ_ｔ＋１｜ｘ_ｔ；θ）／ｄθ項に費やされる。これらの項は、ＬＲおよびＲＰの両勾配に必要なため、両推定量の結合には如何なるペナルティも存在しない。周知の統計学的結果によれば、独立した推定量に関して、重みが逆分散に比例する場合は、最適な加重平均推定値が実現される。すなわち、μ＝μ_ＬＲｋ_ＬＲ＋μ_ＲＰｋ_ＲＰ（ただし、ｋ_ＬＲ＝σ_ＬＲ＿＾^－２／（σ_ＬＲ＿＾^－２＋σ_ＲＰ＿＾^－２）およびｋ_ＲＰ＝１－ｋ_ＬＲ）である。 (RP/LR weighted average)
Most of the calculation is spent on the dp(x _t+1 |x _t ; θ)/dθ term. Since these terms are needed for both the LR and RP slopes, there is no penalty for combining both estimators. According to well-known statistical results, for independent estimators, an optimal weighted average estimate is achieved if the weights are proportional to the inverse variances. That is, μ=μ _LR k _LR + μ _RP k _RP (where k _LR =σ _LR _^ ^-2 /(σ _LR _^ ^-2 +σ _RP _^ ^-2 ) and k _RP =1-k _LR ). .

単純結合方式であれば、両推定量について、全軌跡の勾配を別個に計算した後、それらを結合することになるが、この手法では、軌跡の短い部分に再パラメータ化勾配を使用して、より優れた勾配推定値を求める機会が無視されてしまう。本発明の新たな総和伝播アルゴリズム（ＴＰ）は、この単純法に優る。ＴＰでは、単一の後方パスによって、全ての考え得るＲＰ深度にわたる和集合を計算するため、低分散の推定量に大きな重みが自動的に付与される。 A simple combination method would calculate the gradients of the entire trajectory for both estimators separately and then combine them, but this method uses reparameterized gradients for short parts of the trajectory, Opportunities for better gradient estimates are ignored. Our new sum propagation algorithm (TP) outperforms this simple method. In TP, a single backward pass computes the union over all possible RP depths, so low variance estimators are automatically given greater weight.

図３は、総和伝播アルゴリズムを説明した図である。アルゴリズム２においては、各後方ステップにおいて、ＬＲおよびＲＰの両方法を用いることにより、ポリシーパラメータに対して勾配を評価する。また、ポリシーパラメータ空間における分散に基づいて比を評価するが、この分散は、ポリシー勾配推定量の分散に比例する。勾配は結合され、分布パラメータ空間における最良の推定値が過去の時間ステップに受け渡される。このアルゴリズムにおいては、Ｖ演算子が異なるパーティクルから勾配推定値のサンプル分散を取り出すが、他の分散推定方式も考えられ、たとえば、勾配の大きさの移動平均から分散を推定することも可能であるし、分散に対して異なる統計学的推定量を使用することも可能であるし、ポリシーパラメータの部分集合のみを使用することも可能である。このアルゴリズムは、ＲＬ問題に限定されず、一般的な確率的計算グラフにも適用可能であり、確率論的モデル、確率的ニューラルネットワーク等の訓練に使用することも可能である。一般的な計算グラフ設定においては、勾配をグラフ中で後方に伝播させることにより、グラフ中のいくつかのノードで複数の勾配推定量を結合するようにしてもよい。この場合に、時間ステップパラメータｔを１だけ小さくすれば、これは、グラフ中でのノードの後方移動の一方、勾配の伝播に対応することになる。勾配推定量の結合方式での決定に用いられる分散等の統計値は、計算グラフ中のその他任意のノードから求められるようになっていてもよい。 FIG. 3 is a diagram illustrating the summation propagation algorithm. In Algorithm 2, at each backward step, we evaluate the gradient against the policy parameters by using both LR and RP methods. We also evaluate the ratio based on the variance in the policy parameter space, which is proportional to the variance of the policy gradient estimator. The gradients are combined and the best estimate in the distribution parameter space is passed to past time steps. In this algorithm, the sample variance of the gradient estimate is extracted from particles with different V operators, but other variance estimation methods are also possible; for example, it is also possible to estimate the variance from a moving average of the gradient magnitude. However, it is also possible to use different statistical estimators for the variance, or only a subset of policy parameters. This algorithm is not limited to RL problems, but can also be applied to general probabilistic calculation graphs, and can also be used for training probabilistic models, probabilistic neural networks, etc. In a typical computational graph setting, multiple gradient estimators may be combined at some nodes in the graph by propagating the gradient backward through the graph. In this case, if we reduce the time step parameter t by 1, this will correspond to a backward movement of the node in the graph, while propagating the gradient. Statistical values such as variance used in determining the gradient estimator using the combination method may be obtained from any other node in the calculation graph.

図４は、本実施形態に係る、コンピューティングデバイス１により実行される手順を説明したフローチャートである。コンピューティングデバイス１は、アルゴリズム２に従って、以下のプロセスを実行する。 FIG. 4 is a flowchart illustrating a procedure executed by the computing device 1 according to the present embodiment. Computing device 1 executes the following process according to algorithm 2.

制御ユニット１１は、種々パラメータを初期化する（ステップＳ１０１）。具体的には、制御ユニット１１は、ｄＧ_Ｔ＋１／ｄζ_Ｔ＋１＝０、ｄＪ／ｄθ＝０、Ｇ_Ｔ＋１＝０と設定する。ただし、ζは、分布パラメータ（たとえば、μおよびσ）である。 The control unit 11 initializes various parameters (step S101). Specifically, the control unit 11 sets dG _T+1 /dζ _T+1 =0, dJ/dθ=0, and G _T+1 =0. where ζ is a distribution parameter (eg, μ and σ).

制御ユニット１１は、時間（時間ステップ）ｔをＴに設定し（ステップＳ１０２）、パーティクルｉごとに以下の計算を実行する（ステップＳ１０３）。ただし、ｃ_ｔは、時間ｔにおけるコストである。 The control unit 11 sets the time (time step) t to T (step S102), and executes the following calculation for each particle i (step S103). However, c _t is the cost at time t.

制御ユニット１１は、数式６の計算結果を用いて、以下の計算を実行する（ステップＳ１０４）。 The control unit 11 executes the following calculation using the calculation result of Equation 6 (step S104).

さらに、制御ユニット１１は、数式６の計算結果を用いて、パーティクルｉごとに、以下の計算を実行する（ステップＳ１０５）。 Further, the control unit 11 uses the calculation result of Equation 6 to perform the following calculation for each particle i (step S105).

次に、制御ユニット１１は、時間ｔが所定の時間１に達したかを判定する（ステップＳ１０６）。時間ｔが時間１になっていない場合（Ｓ１０６：ＮＯ）、制御ユニット１１は、時間ｔを１だけ減らし（ステップＳ１０７）、プロセスをステップＳ１０３に戻す。 Next, the control unit 11 determines whether the time t has reached a predetermined time 1 (step S106). If the time t has not reached time 1 (S106: NO), the control unit 11 decrements the time t by 1 (step S107) and returns the process to step S103.

（ポリシー最適化）
なお、勾配に基づく任意の最適化手順を使用することも可能であるが、本実施形態においては、ＲＭＳｐｒｏｐのような確率的勾配降下法を使用する（ＲＭＳｐｒｏｐに由来するアルゴリズムを使用する）。ＲＭＳｐｒｏｐでは、勾配の二乗の移動平均を利用して、そのＳＧＤステップを正規化する。本発明の場合は、バッチサイズが大きいため、ｚ＝Ｅ［ｇ^２］＝Ｅ［ｇ］^２＋Ｖ［ｇ］（ただし、ｇが勾配）によって、バッチから二乗の期待値を直接推定する。また、平均の分散を使用する。すなわち、Ｖ［ｇ］は、パーティクル数Ｐにより除された分散である。勾配ステップは、ｇ／ｚ^１／２になる。また、パラメータγのモーメンタムを使用する。完全更新された方程式は、以下のようになる。 (Policy optimization)
Note that any optimization procedure based on gradients can be used, but in this embodiment, a stochastic gradient descent method such as RMSprop is used (an algorithm derived from RMSprop is used). RMSprop uses a moving average of the squared gradient to normalize its SGD steps. In the case of the present invention, since the batch size is large, the expected value of the square is directly estimated from the batch using z=E[g ² ]=E[g] ² +V[g] (where g is the gradient). Also, use the variance of the mean. That is, V[g] is the variance divided by the number of particles P. The gradient step will be g/z ^1/2 . Also, the momentum of the parameter γ is used. The fully updated equation looks like this:

乱数シードの固定によって、確率的問題を決定論的に変えることができ、ＲＬコミュニティにおいてはＰＥＧＡＳＵＳトリックとしても知られている。シードが固定された場合は、ＲＰ勾配が対象の厳密な勾配であり、ＢＦＧＳ等の決定論的疑似ニュートンオプティマイザを使用可能である。 Fixing the random number seed allows us to turn a stochastic problem into deterministic, also known in the RL community as the PEGASUS trick. If the seed is fixed, the RP gradient is the exact gradient of interest and a deterministic pseudo-Newton optimizer such as BFGS can be used.

（実験）
２つの目的で、実験を行った：（１）ＲＰ勾配が十分ではない理由を説明するため、（２）本発明の新たに開発された方法が学習効率の点でＰＩＬＣＯに匹敵し得ることを示すため。 (experiment)
Experiments were conducted for two purposes: (1) to explain why the RP gradient is not sufficient, and (2) to demonstrate that our newly developed method can be comparable to PILCO in terms of learning efficiency. To show.

（値ランドスケープをプロットする）
図５Ａ～図５Ｆは実験結果を図示している。ランダムに選択された固定方向にポリシーパラメータθを摂動させ、目的関数および、射影勾配の大きさをΔθの関数としてプロットする。この実験の結果は、恐らくは本明細書において最も斬新な部分であり、「カオスの呪い（ｔｈｅｃｕｒｓｅｏｆｃｈａｏｓ）」という用語を思いついた。 (plot the value landscape)
5A-5F illustrate experimental results. We perturb the policy parameter θ in a randomly selected fixed direction and plot the objective function and the magnitude of the projection gradient as a function of Δθ. The results of this experiment, perhaps the most novel part of this paper, led to the term "the curse of chaos."

プロットは、非線形のｃａｒｔ－ｐｏｌｅのタスクで、生成された。１０００パーティクルを使用し、一方で図５Ｄの高分散が乱数性によって生じるのではなく、システムのカオスのような特性によるものであることを実証するために乱数シードは固定し続けた。信頼区間は、Ｖａｒ／Ｐによって推定され、ここで、Ｖａｒはサンプル分散であり、Ｐはパーティクル数である。後述するように、より原理的な手法を使用して分散のＰに対する依存性をプロットする。 The plots were generated with a non-linear cart-pole task. 1000 particles were used, while the random number seed was kept fixed to demonstrate that the high dispersion in Figure 5D is not caused by randomness, but rather due to the chaos-like properties of the system. The confidence interval is estimated by Var/P, where Var is the sample variance and P is the number of particles. As described below, a more principled approach is used to plot the dependence of the variance on P.

図５Ｄには、特異な結果が含まれており、ある領域ではＲＰ勾配が良好な振る舞いをしているが、ポリシーパラメータが摂動されると相遷移のような変化により分散が爆発している。Δθ＝１．５における分散は、Δθ＝０の～４×１０^５倍であり、この領域でＲＰ勾配が正確となるためには４×１０^８パーティクルが必要であることを意味している。実用に際しては、ＲＰ勾配で最適化することにより単純なランダムウォークが導かれる。 Figure 5D contains an unusual result, where the RP gradient behaves well in some regions, but when the policy parameters are perturbed, the dispersion explodes due to phase transition-like changes. The variance at Δθ=1.5 is ~4×10 ⁵ times that of Δθ=0, meaning that 4×10 ⁸ particles are required for the RP gradient to be accurate in this region. In practice, a simple random walk is derived by optimizing with the RP gradient.

シードが固定されているため、図５ＤのＲＰ勾配は図５Ａの値の厳密な勾配である。したがって、図５Ａの右に極微小の決定論的な「ノイズ」が存在する。しかし１０００パーティクルにわたって平均化される値は、真の目的ではないが、無限数のパーティクルを平均化する必要がある。無限数のパーティクルを平均化した場合、まだ「ノイズ」が存在するだろうか？または、関数が滑らかになるだろうか？ Since the seed is fixed, the RP slope in Figure 5D is the exact slope of the value in Figure 5A. Therefore, there is extremely small deterministic "noise" on the right side of FIG. 5A. However, a value that is averaged over 1000 particles is not the real goal, as it requires averaging over an infinite number of particles. If you average an infinite number of particles, will there still be "noise"? Or will the function be smooth?

図５Ｅおよび図５Ｆの新たな勾配推定量は、真の目的が確かに滑らかであることを示唆している。さらなるエビデンスを与えるために、「ノイズ」を無視できるように、θにおいて十分に大きな摂動を使用して図５Ａの値の有限差分から勾配の大きさを推定した。２つの別個の手法（１つはポリシーパラメータθを変化させる、もう１つはθを固定し続けるが軌跡から勾配を推定する）が合致するという事実は、真の目的が滑らかであるという説得力のあるエビデンスを与える。 The new slope estimators in Figures 5E and 5F suggest that the true objective is indeed smooth. To provide further evidence, we estimated the magnitude of the slope from the finite difference of the values in Figure 5A using a sufficiently large perturbation in θ such that the "noise" was negligible. The fact that two separate methods (one that varies the policy parameter θ, the other that keeps θ fixed but estimates the gradient from the trajectory) agree is convincing that the real objective is smoothness. Give some evidence.

図５Ｂおよび図５Ｃは、ＲＰ勾配を使用する際の、分散の爆発の理由を説明している。図５Ｂは、最も左のパラメータ設定に対応し、図５Ｃは最も右のパラメータ設定に対応している。プロットは、値Ｖ（ｘ；θ）（残存累積コスト）が位置ｘの関数としてどのように変化するかを示している。なお、乱数シードが固定されているため、値Ｖは残存リターンＧと同一である。図面は、異なる固定シードで４パーティクルについて各点の軌跡を予測し、軌跡のコストを平均化することによって作成された。１パーティクルを試した後に、４パーティクルを予測するようにし、それについては値が階段のような部分を含むように見えたが、それ以外は現在の図面と比べてあまり興味深くはなかった。４パーティクルの平均値は不安定であるため、４パーティクルのうちの少なくとも１つは示される領域内で非常に不安定であったに違いない。 Figures 5B and 5C explain the reason for the explosion in dispersion when using RP gradients. FIG. 5B corresponds to the leftmost parameter setting, and FIG. 5C corresponds to the rightmost parameter setting. The plot shows how the value V(x; θ) (remaining cumulative cost) varies as a function of position x. Note that since the random number seed is fixed, the value V is the same as the residual return G. The drawings were created by predicting the trajectory of each point for four particles with different fixed seeds and averaging the cost of the trajectories. After trying 1 particle, I ended up predicting 4 particles, for which the values seemed to include some staircase-like parts, but otherwise weren't very interesting compared to the current drawing. Since the average value of 4 particles is unstable, at least one of the 4 particles must have been highly unstable within the region shown.

初期状態分布の中央から平均予測に四角が中央に位置付けられる。四角の軸は、わずかに異なっているが、θが変わると予測される位置ｐ（ｘ１；θ）が変わるからである。辺の長さはガウス分布ｐ（ｘ１；θ）の４標準偏差に対応している。速度は平均値に固定し続けた。 A square is centered from the center of the initial state distribution to the average prediction. The axes of the squares are slightly different, but this is because when θ changes, the predicted position p(x1; θ) changes. The length of the side corresponds to 4 standard deviations of the Gaussian distribution p(x1; θ). The speed remained fixed at the average value.

ＲＰはｄ／ｄθ ∫ｐ（ｘ_１；θ）Ｖ（ｘ_１）ｄｘを推定する。これは四角内部の点をサンプリングし、勾配ｄＶ／ｄθ＝ｄＶ／ｄｘ・ｄｘ／ｄθを計算して、サンプルとともに平均化する。図５Ｃでは、Ｖを微分することで期待値の勾配を見出すことは全く絶望的である。対照的に、ＬＲ勾配（図５Ｅ）は、値Ｖの微分ではなく値Ｖだけを使用しており、この問題を被っていない。ＴＰ（図５Ｆ）は、両方の推定量を効果的に結合している。 RP estimates d/dθ ∫p(x ₁ ; θ)V(x ₁ )dx. This samples the points inside the square, calculates the slope dV/dθ=dV/dx·dx/dθ, and averages it with the samples. In FIG. 5C, finding the slope of the expected value by differentiating V is completely hopeless. In contrast, the LR gradient (FIG. 5E) uses only the value V, rather than its derivative, and does not suffer from this problem. TP (Fig. 5F) effectively combines both estimators.

ガウス再サンプリングの場合についてプロット値と勾配を示すことはしないが、最終的に、これらの両方が固定された乱数シードに対して滑らかな関数であった。したがって、再サンプリングも「カオスの呪い」に対して有効である。 We will not show the plotted values and slopes for the case of Gaussian resampling, but in the end both of these were smooth functions for a fixed random seed. Therefore, resampling is also effective against the "curse of chaos."

図６Ａおよび図６Ｂは、分散のグラフである。図６Ａおよび図６Ｂでは、Δθ＝０およびΔθ＝１．５における勾配推定量の分散がパーティクル数Ｐにどのように依存するかをプロットした。分散は、多数回、推定量を繰り返しサンプリングし、評価の集合からの分散を計算することによって計算された。ＲＰ、ＴＰならびにＬＲ勾配を、バッチ重点加重（ＢＩＷ）のある時とない時の両方とで比較して、本発明の重点サンプリング方式が分散を低減させることを示す。重点サンプリング基準値を使用した－実際には、通常のＬＲ勾配はより単純な基準値を使用し、ずっと高い分散を有する。図６ＢではＲＰ勾配が省略されているが、分散が１０^８～１０^１５の間にあったためである。ＴＰ勾配が、ＢＩＷ－ＬＲ、およびＲＰ勾配を結合した。 6A and 6B are graphs of dispersion. In FIGS. 6A and 6B, we plot how the variance of the gradient estimator at Δθ=0 and Δθ=1.5 depends on the number of particles P. Variance was calculated by repeatedly sampling the estimator many times and calculating the variance from the set of ratings. RP, TP and LR slopes are compared both with and without batch weighted weighting (BIW) to show that our weighted sampling scheme reduces variance. Used weighted sampling criteria - In practice, normal LR gradients use simpler criteria and have much higher variance. The RP slope is omitted in FIG. 6B because the variance was between 10 ⁸ and 10 ¹⁵ . A TP gradient combined the BIW-LR and RP gradients.

結果により、ＢＩＷが著しく分散を低減していることが確認される。さらに、本発明のＴＰアルゴリズムが最良であった。重要なことに、図６Ｂでは全軌跡についてのＲＰ勾配の分散は他の推定量よりも１０^６大きいが、ＴＰは短い経路長のＲＰ勾配を利用して２５０より少ないパーティクルについて１０～５０％低減した分散を得ている。これは注目すべき結果であるが、勾配推定量が別個に計算される場合、結合された推定量についての最高の可能な精度は別個の推定量の精度の総和となるからである。しかしながら、本発明の総和伝播アルゴリズムは、計算のグラフ構造を利用しているため、総和よりも高い精度を実現している。 The results confirm that BIW significantly reduces dispersion. Moreover, the TP algorithm of the present invention was the best. Importantly, in Figure 6B, the variance of the RP gradient for all trajectories is 10 ⁶ larger than the other estimators, but TP is reduced by 10-50% for fewer than 250 particles utilizing short path length RP gradients. We have obtained a certain amount of variance. This is a remarkable result because if the gradient estimators are computed separately, the highest possible precision for the combined estimator will be the sum of the precisions of the separate estimators. However, since the summation propagation algorithm of the present invention utilizes the graph structure of computation, it achieves higher accuracy than summation.

（学習実験）
エピソード的な学習タスクでのＰＩＬＣＯを以下のパーティクル基準の方法と比較する：ＲＰ、固定シードでのＲＰ（ＲＰＦＳ）、ガウス再サンプリング（ＧＲ）、固定シードでのＧＲ（ＧＲＦＳ）、モデル基準のバッチ重点加重尤度比（ＬＲ）、および総和伝播（ＴＰ）。さらに、パーティクル予測の２つのバリエーションを評価する。（１）モデルの不確実性を無視する一方で、各時間ステップにおいてノイズのみを加算するＴＰ（ＴＰ－σ_ｆ）。（２）予測ノイズが増加させたＴＰ（ＴＰ＋σ_ｎ）。全ての場合で３００パーティクルを使用した。 (Learning experiment)
We compare PILCO for episodic learning tasks with the following particle-based methods: RP, RP with fixed seed (RPFS), Gaussian resampling (GR), GR with fixed seed (GRFS), and batch model-based. weighted likelihood ratio (LR), and sum propagation (TP). Furthermore, we evaluate two variations of particle prediction. (1) TP (TP-σ _f ) that adds only noise at each time step while ignoring model uncertainties. (2) TP (TP+σ _n ) with increased prediction noise. 300 particles were used in all cases.

最近のＰＩＬＣＯの論文（非特許文献３）：カートポールのスイングアップおよびバランス、ならびに一輪車のバランス、より学習タスクを実行した。シミュレーションダイナミクスは同一に設定し、他の態様は元のＰＩＬＣＯと同様にした。図７Ａ、図７Ｂ、図８および図９は実験結果を図示している。 Recent PILCO papers (Non-Patent Document 3): Cart pole swing up and balance, and unicycle balance, more learning tasks were performed. The simulation dynamics were set the same and other aspects were similar to the original PILCO. 7A, 7B, 8 and 9 illustrate experimental results.

オプティマイザを、各試行間で６００ポリシー評価について、実行した。ＳＧＤ学習速度およびモーメンタムパラメータは、α＝５×１０^－４およびγ＝０．９であった。エピソード長は、カートポールでは３ｓ、一輪車では２ｓであった。なお、一輪車タスクについては、ポリシーを長い試行に一般化するためには２ｓでは十分ではないが、それでもＰＩＬＣＯと比較することはできる。制御周波数は１０Ｈｚであった。コストは、タイプ１－ｅｘｐ（－（ｘ－ｔ）^ＴＱ（ｘ－ｔ））であり、ここでｔはターゲットである。ポリシー＿（ｘ）からの出力は飽和関数ｓａｔ（ｕ）＝９ｓｉｎ（ｕ）／８＋ｓｉｎ（３ｕ）／８によって制約され、ここでｕ＝π＿～（ｘ）である。１つの実験は（１；５）ランダム試行から構成され、カートと一輪車のタスクそれぞれについて学習済み試行（１５；３０）が続く。各実験は１００回繰り返され、平均化した。各試行は、ポリシーを３０回実行して平均化することにより評価したが、これは評価目的のためのみに実行したことに留意されたい（アルゴリズムのアクセスは１試行だけである）。最終試行のリターンが閾値を下回ったどうかによって、成功を判断した。 The optimizer was run for 600 policy evaluations between each trial. The SGD learning rate and momentum parameters were α=5×10 ⁻⁴ and γ=0.9. The episode length was 3s for the cart pole and 2s for the unicycle. Note that for the unicycle task, 2s is not sufficient to generalize the policy to long trials, but it can still be compared with PILCO. The control frequency was 10Hz. The cost is of type 1-exp(-(xt) ^T Q(xt)), where t is the target. The output from policy_(x) is constrained by a saturation function sat(u)=9sin(u)/8+sin(3u)/8, where u=π_(x). One experiment consisted of (1;5) random trials, followed by (15;30) learned trials for the cart and unicycle tasks, respectively. Each experiment was repeated 100 times and averaged. Note that each trial was evaluated by running the policy 30 times and averaging, but this was done for evaluation purposes only (the algorithm only accesses one trial). Success was determined by whether the return on the final trial was below a threshold.

（カート－ポールのスイングアップおよびバランス）
これは標準的な制御セオリーのベンチマーク課題である。タスクは、カートを前後に押して、直立に取り付けられた振り子を揺らしてそのバランスを保つことから構成される。状態空間は、ｘ＝［ｓ，β，ｄｓ／ｄｔ，ｄβ／ｄｔ］と表現され、ここでｓはカート位置であり、βはポール角度である。基準のノイズレベルはσ_ｓ＝０．０１ｍ、β＝１ｄｅｇ、σ_{ｄｓ／ｄｔ}＝０．１ｍ／ｓ、σ_{ｄβ／ｄｔ}＝１０ｄｅｇ／ｓである。ノイズは、異なる実験では乗数ｋ：σ_２＝ｋσ_ｂａｓｅ ^２によって修正される。元の論文では、真の状態への直接アクセスが考慮されている。類似の設定を求めるために、ｋ＝１０^－２と設定したが、やはりｋ∈｛１，４，９，１６｝を試験した。ポリシーπ＿～は、５０基底関数を伴う動径基底関数ネットワーク（ガウシアンの総和）である。２つのコスト関数を考える。１つは、元のＰＩＬＣＯと同じものであり、ｘがサインとコサインを含み、振り子がバランスをとっている時の振り子の先端（Ｔｉｐ）と先端の位置との間の距離に依存している（ＴｉｐＣｏｓｔ）。もう１つのコストは、生の角度を使用し、Ｑ＝ｄｉａｇ（［１，１，０，０］）であった（ＡｎｇｌｅＣｏｓｔ）。このコストはＴｉｐＣｏｓｔとは概念的に異なっており、振り子をスイングアップする正しい方向が１つだけであるからである。 (Cart-pole swing-up and balance)
This is a standard control theory benchmark problem. The task consists of pushing the cart back and forth and swinging an upright pendulum to maintain its balance. The state space is expressed as x = [s, β, ds/dt, dβ/dt], where s is the cart position and β is the pole angle. The reference noise levels are σ _s =0.01 m, β=1 deg, σ _ds/dt =0.1 m/s, and σ _dβ/dt =10 deg/s. The noise is corrected by a multiplier k:σ ₂ =kσ _base ² in different experiments. In the original paper, direct access to the true state is considered. To find a similar setting, we set k=10 ⁻² , but we also tested k∈{1,4,9,16}. The policy π_~ is a radial basis function network (Gaussian summation) with 50 basis functions. Consider two cost functions. One is the same as the original PILCO, where x includes sine and cosine and depends on the distance between the tip of the pendulum (Tip) and the position of the tip when the pendulum is balanced. (Tip Cost). Another cost used raw angles and was Q=diag([1,1,0,0]) (Angle Cost). This cost is conceptually different from the Tip Cost because there is only one correct direction for the pendulum to swing up.

（一輪車のバランス）
タスクは、一輪車ロボットがバランスをとることから構成され、状態次元Ｄ＝１２、および制御次元Ｆ＝２である。ノイズは低い値に設定した。制御を与えるπ＿～は線形である。 (unicycle balance)
The task consists of a unicycle robot balancing, state dimension D=12, and control dimension F=2. Noise was set to a low value. The control π_~ is linear.

（学習実験）
ＰＩＬＣＯは、ノイズのないシナリオでは良好に実行されるが、ノイズが加わると、結果が悪化する。この悪化は、ＭＭ近似における誤りの累積によって最も生じやすく、以前、予測に求積を使用したVinogradskaら、(2016)によって観測されている。パーティクルはこの問題を被っておらず、ＴＰ勾配を使用することは、高ノイズ状態で常にＰＩＬＣＯより優れている。 (Learning experiment)
PILCO performs well in no-noise scenarios, but when noise is added, the results deteriorate. This deterioration is most likely caused by the accumulation of errors in the MM approximation and was previously observed by Vinogradska et al. (2016) who used quadrature for prediction. Particles do not suffer from this problem, and using TP gradients always outperforms PILCO in high noise conditions.

一方、低いノイズレベルでは、ＴＰならびにＬＲのパフォーマンスは低下している。パーティクルの全てが、小さな領域からサンプリングされる場合、リターンの変化から勾配を推定することが困難になる（デルタ分散の極限では、ＬＲ勾配は評価すらできない）。ＴＰ勾配はこの問題をそれほど被らないが、ＲＰからの情報を組み込むからである。最終的に、予測の不確実性が非常に低い場合（たとえばｋ＝１０^－２）、モデルノイズを学習に影響するパラメータとして考え、より正確な勾配を得るためにそれを大きくすることができる。ＴＰ＋σ_ｎを参照されたい。ただし、モデルノイズ分散は１００で乗じた。 On the other hand, at low noise levels, the performance of TP and LR is degraded. If all of the particles are sampled from a small area, it becomes difficult to estimate the slope from the change in return (in the limit of delta variance, the LR slope cannot even be evaluated). TP gradients do not suffer from this problem as much because they incorporate information from the RP. Finally, if the prediction uncertainty is very low (eg k=10 ⁻² ), we can consider the model noise as a parameter that affects learning and increase it to obtain a more accurate gradient. See TP+σ _n . However, the model noise variance was multiplied by 100.

とりわけ、ＰＩＬＣＯなどのＭＭを使用する手法、およびＧＲは、ＴｉｐＣｏｓｔを使用する場合、他よりも優れている。理由としては、目的のマルチモダリティを挙げることができる－ＴｉｐＣｏｓｔでは、振り子はタスクを解決するためにいずれの方向からもスイングアップされ得る；ＡｎｇｌｅＣｏｓｔでは、正しい方向は、１つだけである。ＭＭを実行することは、アルゴリズムにユニモーダルな経路に沿うよう強制するが、それにもかかわらずパーティクル手法は、一部のパーティクルが一方から来てもう一方で止まるバイモーダルなスイングアップを試行する可能性がある。したがって、ＭＭは最適化問題を簡略化する一種の「分布報酬成形」を実行している場合がある。そのような説明は、以前にGalら、(2016)によってなされている。 In particular, approaches using MM such as PILCO, and GR outperform others when using Tip Cost. The reason may be the multimodality of the objective - in Tip Cost, the pendulum can be swung up from either direction to solve the task; in Angle Cost, there is only one correct direction. Although running MM forces the algorithm to follow a unimodal path, the particle method can nevertheless attempt a bimodal swing-up where some particles come from one side and stop at the other. There is sex. Therefore, the MM may be performing a type of "distributed reward shaping" that simplifies the optimization problem. Such an explanation was previously made by Gal et al. (2016).

最終的に、驚くべきＴＰ－σ_ｆ実験を指摘する。予測はモデルの不確実性を無視しているが、方法は９３％の成功率を達成する。なぜ学習がうまくいったのかの説明は困難であるが、成功がＧＰのゼロ事前平均に関連し得るとの仮説を立てている。データがない領域では、ＧＰダイナミクスモデルの平均は０に向かい、これは入力制御信号がパーティクルに対して効果がないことを意味している。したがって、ポリシー最適化を成功させるためには、パーティクルがデータの存在する領域に留まるように制御しなければならない。なお、同様の結果が、進化型アルゴリズムを使用して、モデル不確実性を無視する場合でもカート－ポールタスクで８５～９０％の成功率を達成したChatzilygeroudisら、(2017)により見出されている。 Finally, we point out a surprising TP-σ _f experiment. Although the prediction ignores model uncertainties, the method achieves a success rate of 93%. Although it is difficult to explain why learning was successful, we hypothesize that success may be related to the zero prior mean of GP. In regions where there is no data, the mean of the GP dynamics model tends toward 0, meaning that the input control signal has no effect on the particles. Therefore, in order to succeed in policy optimization, particles must be controlled to stay in the area where data exists. Similar results were found by Chatzilygeroudis et al. (2017), who used an evolutionary algorithm to achieve an 85-90% success rate in a cart-pole task even when ignoring model uncertainty. There is.

ほとんどの機械学習問題には、何らかのデータ生成分布ｐ_Ｄａｔａ（ｘ）に対する目的関数Ｊ（ｘ；θ）の期待値の最適化を伴うが、この分布は、サンプルデータ点｛ｘ_ｉ｝を通じてのみアクセス可能である。本発明の予測的フレームワークは、深層モデルに類似している：ｐ（ｘ_０）は、データ生成分布であり、ｐ（ｘ_ｔ；θ）はモデルレイヤにｐ_Ｄａｔａ（ｘ）を通すことにより求められる。最も一般的な最適化方法は、逆伝播により計算されるＰａｔｈｗｉｓｅ導関数を用いたＳＧＤである。本発明の結果は、いくつかの状況（特に、非常に深いまたはリカレントなモデルの場合）において、この手法は、勾配分散の爆発によって、ランダムウォークに陥る可能性もあることを示唆している。 Most machine learning problems involve optimizing the expectation of an objective function J(x; θ) for some data-producing distribution p _Data (x), which is only accessible through sample data points {x _i }. It is possible. Our predictive framework is similar to a deep model: p(x ₀ ) is the data-generating distribution and p(x _t ; θ) is by passing p _Data (x) through the model layer. Desired. The most common optimization method is SGD using Pathwise derivatives calculated by backpropagation. Our results suggest that in some situations (particularly for very deep or recurrent models), this approach may also fall into random walks due to the explosion of gradient variance.

勾配の爆発は、深層学習の研究において、長年観測されている（Doya, 1993; Bengioら、1994）。通常、この現象は、ステップの増大および学習の不安定化につながる数値問題と見なされる。一般的な対策としては、勾配のクリッピング、ＲｅＬＵ活性化関数（Nair & Hinton, 2010）、およびスマート初期化が挙げられる。この問題に対する本発明の説明は異なる：勾配は、大きくなるだけではなく、勾配分散は爆発し、これはｘ_ｉ～ｐ_Ｄａｔａからのあらゆるサンプルが、モデルパラメータθをどのように変えて分布全体Ｅ_{ｐＤａｔａ}［Ｊ（ｘ）］についての目的の期待値を大きくするかについての情報を本質的に与えないことを意味している。良好な初期化を選択することがこの問題に対処する一手法である一方で、これはシステムが学習中にカオスにならないことを保証することは困難と思われる。たとえば計量経済学では、最適なポリシーがカオス的なダイナミクスをもたらす場合すらある（Deneckere & Pelikan, 1986）。勾配クリッピングにより、大きなパラメータステップを止めることができるが、勾配がランダムになれば根本的に問題を解決することにはならない。線形系ではカオスが生じないことを考慮して（Alligoodら、1996）、本発明の解析は、ＲｅＬＵなどのカオスの影響を受けにくい区分線形活性化が深層学習でうまくいく理由を示唆している。 Gradient explosion has been observed for many years in deep learning research (Doya, 1993; Bengio et al., 1994). This phenomenon is usually considered a numerical problem leading to increased steps and instability of learning. Common countermeasures include gradient clipping, ReLU activation functions (Nair & Hinton, 2010), and smart initialization. Our explanation for this problem is different: the slope not only grows, but the slope variance explodes, which explains how every sample from x _i ~p _Data changes the model parameter θ and changes the entire distribution E This means that essentially no information is given as to whether to increase the desired expected value for _pData [J(x)]. While choosing a good initialization is one way to address this problem, this seems difficult to ensure that the system does not become chaotic during learning. For example, in econometrics, optimal policies can even lead to chaotic dynamics (Deneckere & Pelikan, 1986). Gradient clipping can stop large parameter steps, but it does not fundamentally solve the problem if the gradients become random. Considering that chaos does not occur in linear systems (Alligood et al., 1996), our analysis suggests why piecewise linear activations that are less sensitive to chaos, such as ReLU, work well in deep learning. .

本発明の深層的な仮説をなお計算機的に確認しなければならない一方で、いくつかの研究によりニューラルネットワークにおけるカオスが調査されているが（Kolen & Pollack, 1991; Sompolinskyら、1988）、やはり本発明が初めて、カオスは逆伝播を使用して計算されると勾配を縮退させ得ることを示唆していると信じている。とりわけ、Pooleら、(2016)はそのような特性が「指数関数的な表現力」をもたらすことを示唆したが、この現象が呪いの代わりとなり得ると信じている。 While the deep hypotheses of the invention still have to be verified computationally, several studies have investigated chaos in neural networks (Kolen & Pollack, 1991; Sompolinsky et al., 1988), but this book is still lacking. We believe that our invention suggests for the first time that chaos can degenerate gradients when computed using backpropagation. Among others, Poole et al. (2016) suggested that such properties result in "exponential expressiveness," and believe that this phenomenon could be an alternative to curses.

（結論と今後の研究）
逆伝播により計算されるものなど、Ｐａｔｈｗｉｓｅ導関数を使用する期待値を最適化することの限界を説明した。さらに、計算にノイズを投入すること、および尤度比のトリックを使用することにより、この呪いに拮抗する方法を示す。本発明の総和伝播アルゴリズムは、任意の確率的計算グラフに対する再パラメータ化勾配を、あらゆる量の他の勾配推定量（値関数を使用して計算された勾配すら使用することができる）と結合するための効率的な方法を提供する。本発明の研究を拡張する数え切れないほどの方法がある：よりよい最適化、自然な勾配の組み込みなど。本発明の方法の柔軟な性質により、これらの拡大が容易になるはずである。 (Conclusion and future research)
We have discussed the limitations of optimizing expectations using Pathwise derivatives, such as those computed by backpropagation. Furthermore, we show how to counteract this curse by injecting noise into the calculations and using the likelihood ratio trick. Our summation propagation algorithm combines reparameterized gradients for arbitrary stochastic computational graphs with any quantity of other gradient estimators (even gradients computed using value functions can be used) provide an efficient method for There are countless ways to extend our work: better optimization, incorporating natural gradients, etc. The flexible nature of the methods of the invention should facilitate their expansion.

（実施形態２）
確率論的な計算グラフ（ＰＣＧ）の定義を提供する。なお、ＰＣＧの概念は、総和伝播アルゴリズムを説明するために使用した計算グラフの概念とは異なっているが、代わりに勾配推定量についての理由に関するフレームワークを説明している。定義は、標準的な有向グラフ的なモデルの定義と全く等価であるが、本発明の方法により注目するものであり、推論を実行するのではなく勾配を計算することにおける本発明の興味を強調している。主な違いは、たとえばガウシアンについての分布パラメータζ、平均μ、および共分散Σの明示的な包含である。 (Embodiment 2)
Provides a definition of a probabilistic computational graph (PCG). Note that the concept of PCG is different from the concept of computational graphs used to describe the summation propagation algorithm, but instead describes a framework for reasoning about gradient estimators. The definition is quite equivalent to that of a standard directed graph model, but is more focused on our method, emphasizing our interest in computing gradients rather than performing inference. ing. The main difference is the explicit inclusion of the distribution parameters ζ, mean μ, and covariance Σ, for example for the Gaussian.

定義１（確率論的計算グラフ（ＰＣＧ））
ノード／頂点ＶおよびエッジＥを有する非巡回グラフは、以下の特性を満足する：
１．各ノードｉ∈Ｖは、周辺同時確率密度ｐ（ｘ_ｉ；ζ_ｉ）を有するランダムな変数の集合に対応し、ここでζ_ｉは分布の恐らく無限なパラメータ。なお、パラメータ化は一意ではなく、あらゆるパラメータ化が受け入れ可能である。
２．各ノードの確率密度は条件的に親ノードに依存し、ｐ（ｘ_ｉ｜Ｐａ_ｉ）である。ここでＰａ_ｉは、ノードｉの直接の親におけるランダム変数である。
３．同時確率密度はｐ（ｘ_１，・・・，ｘ_ｎ）＝Π_ｉ＝１ ^ｎｐ（ｘ_ｉ｜Ｐａ_ｉ）を満足する。
４．各ζ_ｉは、その親の関数であり、ζ_ｉ＝ｆ（Ｐｚ_ｉ）。ここで、Ｐｚ_ｉはノードｉの親における分布パラメータである。特に、ｐ（ｘ_ｉ；ζ_ｉ＝∫ｐ（ｘ_ｉ｜Ｐａ_ｉ）ｐ（Ｐａ_ｉ；Ｐｚ_ｉ）ｄＰａ_ｉである。 Definition 1 (Probabilistic Computation Graph (PCG))
An acyclic graph with nodes/vertices V and edges E satisfies the following properties:
1. Each node i∈V corresponds to a set of random variables with marginal joint probability density p(x _i ; ζ _i ), where ζ _i is a possibly infinite parameter of the distribution. Note that parameterization is not unique, and any parameterization is acceptable.
2. The probability density of each node conditionally depends on the parent node and is p(x _i |Pa _i ). Here Pa _i is a random variable in the immediate parent of node i.
3. The joint probability density satisfies p(x ₁ , . . . , x _n )=Π _i=1 ⁿ p(x _i |Pa _i ).
4. Each ζ _i is a function of its parent, ζ _i =f(Pz _i ). Here, Pz _i is a distribution parameter in the parent of node i. In particular, p(x _i ; ζ _i =∫p(x _i |Pa _i )p(Pa _i ; Pz _i )dPa _i .

本発明の数式化においては、確率的なことがないことを強調したい。各計算は解析的に扱いにくい場合があるが、決定論的である。さらに、この定義は決定論的なノードを除外するものではない、すなわちノードにおける分布はディラックのデルタ分散（質点）であり得ることを強調する。後に、勾配の確率的推定値を導出するためにこの数式化を使用する。 We would like to emphasize that there is nothing stochastic about the mathematical formulation of the present invention. Each calculation may be analytically unwieldy, but it is deterministic. Furthermore, we emphasize that this definition does not exclude deterministic nodes, i.e. the distribution at the nodes can be a Dirac delta distribution (particle mass). We later use this formulation to derive probabilistic estimates of the slope.

（定理の導出）
興味の対象は、あるノードζ_ｉにおける分布パラメータの、別のノードｄζ_ｉ／ｄζ_ｊにおけるパラメータに対する全微分を計算することである。全微分の規則をイテレートすることにより、ノードｊからノードｉまでの経路にわたる総和が導かれ、以下の通りである。 (Derivation of theorem)
We are interested in computing the total derivative of the distribution parameter at one node ζ _i with respect to the parameter at another node dζ _i /dζ _j . By iterating over the total differentiation rule, the summation over the path from node j to node i is derived as follows.

この等式は、あらゆる決定論的な計算グラフに当てはまり、またたとえばＯＪＡコミュニティで周知でもある。この等式は自明に本発明の確率的勾配定理を導き、ＡからＢへの経路にわたる総和が、Ａから中間ノードおよび中間ノードからＢへの経路の総和として書くことができることを説明している。図１０Ａおよび図１０Ｂは、数式１１における経路の例を図示している。 This equation applies to any deterministic computational graph and is also well known, for example, in the OJA community. This equation trivially leads to our stochastic gradient theorem and explains that the summation over the path from A to B can be written as the summation of the paths from A to intermediate nodes and from intermediate nodes to B. . 10A and 10B illustrate examples of paths in Equation 11.

定理１（総和確率的勾配定理）
ある確率的計算グラフにおいてｉとｊを異なるノードとし、ＩＮを中間ノードの任意の集合とし、これはｊからｉへの経路をブロックする、すなわちＩＮはｊからｉへの経路が存在しないようにするためのものであり、ＩＮ中でノードを通過しない。｛ａ→ｂ｝をａからｂへの経路の集合で表し、｛ａ→ｂ｝／ｃはａからｂへの経路の集合であり、ｂを除いて経路に沿うノードを集合ｃに含めることはできない。この場合、全微分ｄζ_ｉ／ｄζ_ｊは次の等式で書くことができる。 Theorem 1 (summation stochastic gradient theorem)
Let i and j be different nodes in some probabilistic computation graph, and let IN be any set of intermediate nodes, which blocks the path from j to i, i.e. IN is such that there is no path from j to i. It does not pass through the nodes in the IN. {a→b} is represented by a set of paths from a to b, {a→b}/c is a set of paths from a to b, and nodes along the path except for b are included in the set c. I can't. In this case, the total differential dζ _i /dζ _j can be written as the following equation.

数式１０および数式１１を結合して次を与えることができる。 Equations 10 and 11 can be combined to give the following:

なお、ｒ∈｛ｊ→ｍ｝／ＩＮとｓ∈｛ｊ→ｍ｝／ＩＮとをそれぞれスワップすることにより、類似の定理を導くことができる。これは次の等式を導く。 Note that a similar theorem can be derived by swapping rε{j→m}/IN and sε{j→m}/IN, respectively. This leads to the following equation.

後半、および前半分の総和勾配等式として、それぞれ数式１２および数式１３を参照する。 Equations 12 and 13 are referred to as the total gradient equations for the second half and the first half, respectively.

（グラフ上での勾配推定）
前セクションでは、グラフ全体に対する勾配計算を分解してより狭いグラフに対する勾配計算とする手段を与え、またサブグラフに対して勾配を推定する方法を与えた。ここで、サブグラフに対する勾配をどのように結合してグラフ全体に対する勾配のための推定量とすることができるかの手法を明らかにする。タスクは、ノードｊにおけるパラメータに対する遠位のノードｉにおける期待値の導関数を推定することである：ｄ／ｄζ_ｊＥ_{ｘｉ～ｐ（ｘｉ；ζｉ）}［ｘｉ］。真のζは、扱いにくいため、サンプリング基準の推定を行う。ｐ（ｘ；ζ）のサブ分散をサンプリングすることを考える。すなわち、ｐ（ｘ；ζ）＝∫ｐ（ｘ；ζ＿＾）ｐ（ζ＿＾）ｄζ＿＾となるようにζ＿＾をサンプリングする。これは次のように書くことができる。 (Gradient estimation on graph)
In the previous section, we provided a means to decompose the gradient computation for an entire graph into a gradient computation for a narrower graph, and also provided a method for estimating the gradient for subgraphs. Here, we demonstrate how the gradients for subgraphs can be combined into an estimator for the gradient for the entire graph. The task is to estimate the derivative of the expectation at distal node i for the parameter at node j: d/dζ _j E _xi~p(xi;ζi) [xi]. Since the true ζ is difficult to handle, we estimate the sampling criterion. Consider sampling a subvariance of p(x;ζ). That is, ζ_^ is sampled so that p(x;ζ)=∫p(x;ζ_^)p(ζ_^)dζ_^. This can be written as:

ζ＿＾は伝承サンプリング手順では自然に生じる。説明の簡素化のため、サンプリングは再パラメータ化可能である、すなわち、ｐ（ζ_ｍ＿＾；ζ_ｊ）＝ｆ（ζ_ｍ＿＾；ζ_ｊ，ｚ_ｍ）ｐ（ｚ_ｍ）とさらに想定する。これは次のように書くことができる。 ζ_^ occurs naturally in the tradition sampling procedure. For simplicity of explanation, we further assume that the sampling is reparameterizable, i.e., p(ζ _m _^; ζ _j ) = f(ζ _m _^; ζ _j , z _m ) p(z _m ) do. This can be written as:

項ｄζ_ｍ＿＾／ｄζｊは、Ｐａｔｈｗｉｓｅ導関数推定量により推定される。残りの項ｄ／ｄζ_ｍ＿＾Ｅｘ_{ｉ～ｐ（ｘｉ；ζｉ＿＾）}［ｘｉ］は、任意の他の推定量により推定され、たとえばジャンプ推定量を使用することができる。第２の推定量がやはり不偏であるとすれば、推定量全体が不偏となる。 The term dζ _m _^/dζj is estimated by the Pathwise derivative estimator. The remaining term d/dζ _{m_} ^Ex _{i~p(xi;ζi_^)} [xi] can be estimated by any other estimator, for example using a jump estimator. If the second estimator is also unbiased, then the entire estimator is unbiased.

要約すると、グラフ全体に対して、ｊからｉまでの勾配推定量を作成する手順は以下の通りである：
１．経路ｊからｉまでをブロックする中間ノードＩＮの集合を選択する。
２．ｊから中間ノードＩＮまでのＰａｔｈｗｉｓｅ導関数推定量を構築する。
３．ＩＮからｉまでの全微分推定量を構築して、ｉからｊまでの連鎖律を適用する。 In summary, the steps to create a gradient estimator from j to i for the entire graph are as follows:
1. Select a set of intermediate nodes IN that block the path from j to i.
2. Construct a Pathwise derivative estimator from j to intermediate node IN.
3. Construct the total differential estimator from IN to i and apply the chain rule from i to j.

（ポリシー勾配定理に対する関係性）
典型的なモデルなしＲＬの問題では、エージェントは確率的ポリシーπに従って動作ｕ～π（ｕ_ｔ｜ｘ_ｔ；θ）を実行し、状態ｘ_ｔを遷移して、コストｃ_ｔを求める（または、逆に報酬を求める）。エージェントのゴールは、ポリシーパラメータθを見つけることであり、これは各エピソードの期待リターンＧ＝Σ_ｔ＝０ ^Ｈｃ_ｔを最適化する。図１１Ａおよび図１１Ｂはモデル基準およびモデルなしのＬＲ勾配推定の確率計算グラフを図示している。文献では、ポリシー勾配定理および決定論的ポリシー勾配定理の２つの「勾配定理」が全般的に適用される。 (Relationship to policy gradient theorem)
_In a typical model-less RL problem _, an agent executes operations u~π(u _t _| (on the contrary, they seek compensation). The agent's goal is to find the policy parameters θ, which optimize the expected return G=Σ _t=0 ^H c _t for each episode. FIGS. 11A and 11B illustrate probability calculation graphs for model-based and model-free LR slope estimation. In the literature, two "gradient theorems" are commonly applied: the policy gradient theorem and the deterministic policy gradient theorem.

Ｑｔ＿＾は、動作ｕを選択した場合の特定の状態ｘからの残存リターンΣ_ｈ＝ｔ ^Ｈ－１ｃ_ｈ＋１の推定量に対応する。数式１６について、任意の推定量が受け入れ可能であり、サンプリング基準の推定すら使用可能である。数式１７については、Ｑ＿＾は通常微分可能なサロゲートモデルである。重要なことに、上の等式が有効であるためには、Ｑ＿＾が推定量でなければならず、真のＱではない。すなわち、勾配を推定する際、ポリシーパラメータは現在の時間ステップについて変更されるだけであり、後続の時間ステップについては固定され続けることを想定しなければならない。図１１Ａは、これらの２つの定理が同一の確率論的計算グラフにどのように対応するかを示している。中間ノードは、各時間ステップで選択された動作である。中間ノードに続く全微分を推定するためのジャンプ推定量の選択に差異が存在する－ポリシー勾配定理はＬＲ勾配を使用するが、決定論的なポリシー勾配定理はＰａｔｈｗｉｓｅ導関数をサロゲートモデルに対して使用する。 Qt_^ corresponds to an estimate of the residual return Σ _h=t ^H−1 c _h+1 from a particular state x when action u is selected. For Equation 16, any estimator is acceptable, even an estimate of the sampling criterion can be used. For Equation 17, Q_^ is a normally differentiable surrogate model. Importantly, for the above equation to be valid, Q_^ must be an estimator and not the true Q. That is, when estimating the gradient, it must be assumed that the policy parameters are only changed for the current time step and remain fixed for subsequent time steps. FIG. 11A shows how these two theorems correspond to the same probabilistic computation graph. Intermediate nodes are the actions selected at each time step. A difference exists in the choice of jump estimator to estimate the total derivative following an intermediate node - the policy gradient theorem uses the LR gradient, whereas the deterministic policy gradient theorem uses the Pathwise derivative for the surrogate model. use.

（新規なアルゴリズム）
典型的にＰＣＧに対して勾配を推定する際は、グラフ全体を通じて伝承サンプリングを実行して１サンプルを求め、たとえばＲＬ問題については軌跡をサンプリングする。そのようなサンプルをパーティクルと呼ぶ。そのようなサンプリングのバッチを使用して、勾配推定量を求めることができる。あるノードにおける推定される分布パラメータは、各サンプリングされたパーティクルζ＿＾＝｛ζ_ｉ＿＾｝_ｉ ^Ｐについての分布パラメータの集合によって与えられ、ここでＰはパーティクル数である。たとえば、ＰＣＧがガウス分布からの順次的なサンプリングから成る場合、ζ_ｉ＿＾は、パーティクルがそのノードでサンプリングされたガウシアンの平均および共分散に対応する。以下のセクションでは、パーティクルの集合を使用して、周辺分布について直接分布パラメータΓの異なる集合を推定するという選択肢を活用する。 (new algorithm)
Typically, when estimating gradients for a PCG, one sample is obtained by performing tradition sampling over the entire graph, and for example, for RL problems, the trajectory is sampled. Such samples are called particles. A batch of such samplings can be used to determine the slope estimator. The estimated distribution parameters at a node are given by the set of distribution parameters for each sampled particle ζ_^={ζ _i _^} _i ^P , where P is the number of particles. For example, if the PCG consists of sequential sampling from a Gaussian distribution, ζ _i corresponds to the mean and covariance of the Gaussian at which the particles were sampled at that node. In the following sections, we exploit the option of using a collection of particles to directly estimate different sets of distribution parameters Γ for the marginal distribution.

（密度推定ＬＲ（ＤＥＬ））
以下の説明により、サンプリングされたパーティクルの集合から分布パラメータΓを推定し、推定された分布ζ＿＾を使用してＬＲ勾配を適用することを、試行することができる。特に、平均μ＿＾＝Σ_ｉ ^Ｐｘ_ｉ／Ｐおよび分散Σ＿＾＝Σ_ｉ ^Ｐ（ｘ_ｉ－μ＿＾）^２／（Ｐ－１）を推定することにより密度をガウシアンとして近似する。次に標準的なＬＲトリックを使用して、勾配ΣｉＰｄｌｏｇｑ（ｘ_ｉ）／ｄθ（Ｇ_ｉ－ｂ）を推定することができ、ここでｑ（ｘ）＝Ｎ（μ＿＾，Σ＿＾）である。この方法を使用するために、パーティクルｘ_ｉに関するμ＿＾およびΣ＿＾の微分を計算し、連鎖律を使用して勾配をポリシーパラメータまで伝えなければならないが、これは容易である。本発明の新たな方法をＤＥＬ推定量と呼ぶ。重要なことに、ｑ（ｘ）は勾配を推定するために使用されるが、如何なる方法でも軌跡サンプリングを修正するために使用されないことに留意されたい。これは、パーティクルがそのようにフィッティングされたガウス分布から再サンプリングされ、軌跡分布を修正するガウス再サンプリングの場合と対照的である。
ＤＥＬの利点：計算にノイズを投入しなくてもＬＲ勾配を使用することができる。
ＤＥＬの不利な点：推定量が不偏であり、密度推定が困難になる可能性がある。 (Density estimation LR (DEL))
With the following explanation, one can try to estimate the distribution parameter Γ from the sampled collection of particles and apply the LR gradient using the estimated distribution ζ_^. In particular, the density is approximated as a Gaussian by estimating the mean μ_^=Σ _i ^P x _i /P and the variance Σ_^=Σ _i ^P (x _i −μ_^) ² /(P−1). Standard LR tricks can then be used to estimate the slope ΣiP dlogq(x _i )/dθ(G _i -b), where q(x)=N(μ_^,Σ_^) be. To use this method, we have to compute the derivatives of μ_^ and Σ_^ with respect to particle x _i and use the chain rule to propagate the gradients to the policy parameters, which is easy. The new method of the present invention is called the DEL estimator. Importantly, note that q(x) is used to estimate the slope, but not to modify the trajectory sampling in any way. This is in contrast to the case of Gaussian resampling, where particles are resampled from a Gaussian distribution so fitted, modifying the trajectory distribution.
Advantage of DEL: LR slope can be used without introducing noise into the calculation.
Disadvantages of DEL: The estimator is unbiased, which can make density estimation difficult.

（ガウス成形勾配（ＧＳ））
これまで、全てのＲＬ方法が総和勾配等式の後半（数式１２）を使用してきた。等式の前半（数式１３）を使用する推定量を作成できるだろうか？図１３はガウス成形勾配における計算経路を図示している。図１３は、これがどのように行われ得るかの例を与えている。ｘｍにおける密度を、パーティクルに対するガウシアンをフィッティングにより推定することを提案する。次いで、ｄＥ［ｃ_ｍ］＝ｄΓ_ｍ（灰色のエッジ）が、この分布からパーティクルを再サンプリングすることにより（またはあらゆる他の積分の方法により）推定される。これは、ｄΓ_ｍ／ｄθをどのように推定するかという疑問を残す（点線エッジおよび太線エッジ）。ＲＰ方法を使用することが、容易である。ＬＲ方法を使用するためには、まず総和勾配等式の後半をｄΓ_ｍ＝ｄθに対して適用して項Σ_{ｒ∈｛θ→ｋ｝／ＩＮ}Π_{（ｐ，ｔ）∈ｒ}∂ζ_ｔ／∂ζ_ｐ（点線エッジ）およびｄΓｍ／ｄζ_ｋ（太線エッジ）を求める。考慮しているシナリオでは、これらの項の第１は単一の経路であり、ＲＰを使用して推定される。第２の項は、より興味深いもので、これをＬＲ方法を使用して推定する。ガウス近似を使用しているため、分布パラメータΓ_ｍは、ｘ_ｍの平均および分散であり、μ_ｍ＝Ｅ［ｘ_ｍ］およびΣｍ＝Ｅ［ｘ_ｍｘ_ｍ ^Ｔ］－μ_ｍμ_ｍ ^Ｔとして推定することができる。これらの項のＬＲ勾配推定量は次のように求めることができる。 (Gaussian forming gradient (GS))
Until now, all RL methods have used the second half of the sum gradient equation (Equation 12). Can we create an estimator that uses the first half of the equation (Equation 13)? FIG. 13 illustrates the computational path in a Gaussian shaping gradient. Figure 13 gives an example of how this can be done. We propose to estimate the density at xm by fitting a Gaussian to the particle. dE[c _m ]=dΓ _m (gray edge) is then estimated by resampling particles from this distribution (or by any other method of integration). This leaves the question of how to estimate dΓ _m /dθ (dotted and thick edges). It is easy to use the RP method. To use the LR method, first apply the second half of the summation gradient equation to dΓ _m = dθ to obtain the term Σ _{r∈{θ→k}/IN} Π _{(p, t)∈r} ∂ζ _t / Find ∂ζ _p (dotted edge) and dΓm/dζ _k (bold edge). In the scenario considered, the first of these terms is a single path and is estimated using RP. The second term is more interesting and is estimated using the LR method. Since we are using a Gaussian approximation, the distribution parameter Γ _m is the mean and variance of x _m , as μ _m = E[x _m ] and Σm = E[x _m x _m ^T ] − μ _m μ _m ^T It can be estimated. The LR gradient estimator for these terms can be determined as follows.

実際には、サンプリング基準の推定ζ_ｋ＿＾を行い、推定量がサンプルζ_ｋ＿＾に対して条件付きではないかと懸念されるかも知れないが、興味の対象は条件付きではない推定値である。条件付き推定が等価であることを説明する。分散については、μ_ｍは条件付きではない平均の推定であるため、推定全体が、条件付きではない分散の推定に直接対応していることに留意されたい。平均については、イテレートされた期待値の規則を以下の通り適用する。 In practice, we estimate the sampling criterion ζ _k _^ and may be concerned that the estimator is conditional on the sample ζ _k _^, but what we are interested in is the estimate that is not conditional. be. Explain that conditional estimates are equivalent. Note that for the variance, since μ _m is an estimate of the non-conditional mean, the overall estimate corresponds directly to the estimate of the non-conditional variance. For the average, apply the iterated expectation rule as follows.

これにより、条件付き勾配推定量が、条件付きではない平均の勾配についての不偏な推定量であることが明らかである。 This makes it clear that the conditional slope estimator is an unbiased estimator of the mean slope that is not conditional.

（勾配を累積するための効率的なアルゴリズム）
具体的な例として、モデル基準のポリシー勾配方法を考え、そのＰＣＧが図１３に与えられる。本発明の以前の研究において、このアルゴリズムが、まず最初に考えられたものであり、ダイナミクスの微分可能な確率論的モデルへのアクセスに決定的に依存している。ＧＳ勾配をこの状況にどのように適用するかを説明する。ｘ_ｋノードごとに、ｋの後の全てのｘ_ｍノードへのＬＲジャンプを実施し、ノードｍにおける分布のガウス近似で勾配を計算したい。逆伝播のようなやり方で後方パスの間、全てのノードを累積する。なお、ｋおよび経路ごとに、勾配をｄＥ［ｃ_ｍ］／ｄΓ_ｍｄΓ_ｍ／ｄζ_ｋ（ｄζ_ｋ／ｄｕ_ｋ－１ｄｕ_ｋ－１／ｄθ）と書くことができる。項ｄＥ［ｃ_ｍ］／ｄΓ_ｍ・ｄΓ_ｍ／ｄζ_ｋはｄＥ［ｃ_ｍ］／ｄΓ_ｍｚ_ｍｄｌｏｇｐ（ｘ_ｋ；ζ_ｋ）／ｄζ_ｋとして推定され、ここでｚｍは上の項ｘ_ｍ－ｂ_μなどを要約しているベクトルに対応する。なお、ｄＥ［ｃ_ｍ］／ｄΓ_ｍｚ_ｍはただのスカラー量ｇ_ｍである。したがって、後方パスの間の全てのｇの合計を累積して、各ｋノードにおける全てのｍノードを合計するアルゴリズムを使用する。図１２は総和伝播と適合する様子を詳しく説明するためのアルゴリズム３を図示している。最終的なアルゴリズムは本質的には通常のコスト／報酬を修正された値で置換するだけであり、そのような手法はさらに、確率的ポリシーおよびＬＲ勾配を使用してモデルなしポリシー勾配アルゴリズムに適用可能である。ＧＳの２つの解釈：１．あるノードにおいて、周辺分布のガウス近似を行う。２．パーティクルの分布に基づいて、あるタイプの報酬成形を行う。特に、パーティクルの全てが複数の報酬の領域間で分布が分かれるのではなく報酬の１つの「島」に集中するよう軌跡分布をユニモーダルに保つよう本質的に推進する－これにより最適化が単純になる場合がある。 (Efficient algorithm for accumulating gradients)
As a specific example, we consider a model-based policy gradient method and its PCG is given in FIG. In previous work of the present invention, this algorithm was first conceived and critically relies on access to a differentiable probabilistic model of the dynamics. We explain how GS gradients are applied to this situation. For each x _k node, we want to perform an LR jump to all x _m nodes after k and compute the gradient with a Gaussian approximation of the distribution at node m. Accumulate all nodes during the backward pass in a backpropagation-like manner. Note that for each k and path, the gradient can be written as dE[c _m ]/dΓ _m dΓ _m /dζ _k (dζ _k /du _k-1 du _k-1 /dθ). The term dE[c _m ]/dΓ _m・dΓ _m /dζ _k is estimated as dE[c _m ]/dΓ _m z _m d logp(x _k ;ζ _k )/dζ _k , where zm is the upper term x It corresponds to a vector summarizing _m - b _μ , etc. Note that dE[c _m ]/dΓ _m z _m is just a scalar quantity g _m . Therefore, we use an algorithm that accumulates all g sums during the backward pass and sums all m nodes at each k node. FIG. 12 illustrates Algorithm 3 for explaining in detail how it is compatible with summation propagation. The final algorithm essentially just replaces the normal costs/rewards with modified values, and such techniques can be further applied to model-less policy gradient algorithms using stochastic policies and LR gradients. It is possible. Two interpretations of GS: 1. A Gaussian approximation of the marginal distribution is performed at a certain node. 2. Do some type of reward shaping based on the distribution of particles. In particular, it inherently promotes keeping the trajectory distribution unimodal so that all of the particles are concentrated in one "island" of reward rather than having the distribution split across multiple reward regions - this simplifies optimization. It may become.

（実験）
ＰＩＬＣＯの論文により、モデル基準のＲＬ模擬実験を行った。本発明のＧＳ手法ならびに総和伝播との結合を試験するために、カート－ポールのスイングアップ、およびバランスの課題を試験した。さらに、この考えの実現性を示すために、より単純なカート－ポールの、バランスだけの課題に対して、ＤＥＬ手法を試験した。本発明の新たな推定量を伴うパーティクル基準の勾配をＰＩＬＣＯと比較した。本発明の以前の研究において、パーティクルを使用して信頼できる結果を求めるためにコスト関数を変更しなければならなかった－現在の実験の主な動機の１つは、元のＰＩＬＣＯが使用したのと同じコストを使用してＰＩＬＣＯの結果とマッチングさせることである（これは、後にさらに詳述する）。 (experiment)
Based on the paper by PILCO, we conducted a model-based RL simulation experiment. To test the GS method of the present invention and its coupling with summation propagation, cart-pole swing-up and balance tasks were tested. Furthermore, to demonstrate the feasibility of this idea, we tested the DEL method on a simpler cart-pole balance-only task. The particle-based slope with our new estimator was compared to PILCO. In our previous work, we had to modify the cost function in order to obtain reliable results using particles - one of the main motivations for the current experiments was that the original PILCO used (this will be explained in more detail later).

（モデル基準のポリシー探索バックグラウンド）
モデルなしポリシー探索方法に対するモデル基準のアナログを考える。対応する確率論的計算グラフを図１１Ｂに与える。表記は本発明の以前の研究に従う。各エピソードの後、ｐ（Δｘ_ｔ＋１ ^ａ）＝ｇＰ（ｘ_ｔ＿～），となるよう、データの全てを使用してダイナミクスの各次元の別個のガウス過程モデルを学習する。ここでｘ＿～＝［ｘ_ｔ ^Ｔ，ｕ_ｔ ^Ｔ］かつｘ∈Ｒ^Ｄ、ｕ∈Ｒ^Ｆである。次いで、このモデルを使用して、勾配降下法によりポリシーを最適化するためにエピソード間で「メンタルシミュレーション」を実行する。二乗指数共分散関数ｋ_ａ（ｘ＿～，ｘ’＿～）＝ｓ_ａ ^２ｅｘｐ（－（ｘ＿～－ｘ’＿～）^ＴΛ_ａ ^－１（ｘ＿～－ｘ’＿～））を使用した。また、ノイズハイパーパラメータがσ_ｎ，２ ^２のガウス尤度関数を使用する。ハイパーパラメータ｛ｓ，Λ，σ_ｎ｝は、周辺尤度を最大化することにより訓練される。予測はｐ（ｘ_ｔ＋１ ^ａ）＝Ｎ（μ（ｘ_ｔ＿～），σ_ｆ ^２（ｘ_ｔ＿～）＋σ_ｎ ^２）の形態を取り、ここでσ_ｆ ^２（ｘ_ｔ＿～）はモデルについての不確実性であり、状態空間の領域内内のデータの可用性に依存している。図１１Ｂでは、θから中間ノードまでの偏微分がＰａｔｈｗｉｓｅ導関数で推定され、中間ノードに続く全微分がジャンプ推定量で推定される。 (Model-based policy search background)
Consider a model-criteria analog to model-free policy search methods. The corresponding probabilistic calculation graph is given in FIG. 11B. The notation follows previous work of the present invention. After each episode, all of the data is used to train a separate Gaussian process model for each dimension of the dynamics, such that p(Δx _t+1 ^a )=gP(x _t _~),. Here, x_~=[x _t ^T , u _t ^T ] and x∈R ^D , u∈R ^F. This model is then used to perform "mental simulations" between episodes to optimize the policy via gradient descent. Using the squared exponential covariance function k _a (x_~, x'_~) = s _a ² exp (-(x_~-x'_~) ^T Λ _a ^-1 (x_~-x'_~)) . Also, a Gaussian likelihood function with a noise hyperparameter σ _n,2 ² is used. The hyperparameters {s, Λ, σ _n } are trained by maximizing the marginal likelihood. The prediction takes the form p(x _t+1 ^a )=N(μ(x _t ＿〜), σ _f ² (x _t ＿〜)+σ _n ² ), where σ _f ² (x _t ＿〜) is the model , and is dependent on the availability of data within the region of the state space. In FIG. 11B, partial derivatives from θ to intermediate nodes are estimated with Pathwise derivatives, and total derivatives following intermediate nodes are estimated with jump estimators.

（セットアップ）
カート－ポールは、前後に押すことができるカートと、取り付けられたポールから成る。状態空間は、［ｓ，β，ｄｓ／ｄｔ，ｄβ／ｄｔ］であり、ここでｓはカート位置であり、βは角度である。制御は、カートに対する水平方向の力である。ダイナミクスは、ＰＩＬＣＯの論文と同様であった。セットアップは本発明の以前の研究に従う。 (set up)
A cart-pole consists of a cart that can be pushed back and forth and an attached pole. The state space is [s, β, ds/dt, dβ/dt], where s is the cart position and β is the angle. The control is a horizontal force on the cart. The dynamics were similar to the PILCO paper. The setup follows our previous work.

（タスクにおける共通の特性）
実験は１ランダムエピソード、続いて学習済ポリシーを有する１５エピソードから成り、ポリシーはエピソード間で最適化される。各エピソード長は３ｓであり、制御周波数は１０Ｈｚであった。各タスクは再現性を試験するために異なる乱数シードで別個に１００回評価した。乱数シードは、異なるアルゴリズム同士で共有した。各エピソードは３０回評価し、コストを平均化したが、これは評価目的のためのみに行ったことに留意されたい－アルゴリズムのアクセスは１エピソードだけである。ポリシーは、本発明の以前の研究によるＲＭＳｐｒｏｐのような学習規則を使用して最適化され、これは勾配を異なるパーティクルからの勾配のサンプリング分散を使用して勾配を正規化する。モデル基準のポリシー最適化では、ポリシー勾配評価ごとに３００パーティクルを使用して６００勾配ステップを実行した。学習速度およびモーメンタムパラメータはそれぞれ、α＝５×１０^－４、γ＝０：９であり、本発明の以前の研究と同じである。ポリシーからの出力はｓａｔ（ｕ）＝９ｓｉｎ（ｕ）／８＋ｓｉｎ（３ｕ）／８によって飽和され、ここでｕ＝π＿～（ｘ）である。ポリシーπ＿～は、５０基底関数および２５４パラメータの総和を伴う動径基底関数ネットワーク（ガウシアンの総和）である。コスト関数は、タイプ１－ｅｘｐ（－（ｘ－ｔ）^ＴＱ（ｘ－ｔ））であり、ここでｔはターゲットである。２つのタイプのコスト関数を考える：１）ＡｎｇｌｅＣｏｓｔ、Ｑ＝ｄｉａｇ（［１，１，０，０］）であるコストが対角行列である、２）ＴｉｐＣｏｓｔ、元のＰＩＬＣＯの論文からのコストであり、バランスが取れている時の、振り子の先端から先端の位置までの距離に依存する。これらのコスト関数は概念的に異なっている－ＴｉｐＣｏｓｔでは、振り子はいずれの方向からもスイングアップすることができ、ＡｎｇｌｅＣｏｓｔでは、正しい方向は、１つだけである。基準の観測ノイズレベルは、σ_ｓ＝０．０１ｍ、σ_β＝１ｄｅｇ、σ_{ｄｓ／ｄｔ}＝０．１ｍ／ｓ、σ_{ｄβ／ｄｔ}＝１０ｄｅｇ／ｓ、またこれらはσ^２＝ｋσ_ｂａｓｅ ^２となるように乗数ｋ∈｛１０^－２，１｝で修正される。 (Common characteristics in tasks)
The experiment consists of one random episode followed by 15 episodes with the learned policy, and the policy is optimized between episodes. Each episode length was 3 s and the control frequency was 10 Hz. Each task was evaluated separately 100 times with different random number seeds to test reproducibility. Random number seeds were shared between different algorithms. Note that each episode was evaluated 30 times and costs averaged, but this was done for evaluation purposes only - the algorithm only had access to one episode. The policy is optimized using a learning rule such as RMSprop from our previous work, which normalizes the gradient using the sampling variance of the gradient from different particles. For model-based policy optimization, 600 gradient steps were performed using 300 particles for each policy gradient evaluation. The learning rate and momentum parameters are α=5×10 ⁻⁴ and γ=0:9, respectively, which are the same as in our previous work. The output from the policy is saturated by sat(u)=9sin(u)/8+sin(3u)/8, where u=π_~(x). The policy π_~ is a radial basis function network (Gaussian summation) with 50 basis functions and a summation of 254 parameters. The cost function is of type 1-exp(-(xt) ^T Q(xt)), where t is the target. Consider two types of cost functions: 1) Angle Cost, where the cost is a diagonal matrix with Q=diag([1,1,0,0]), 2) Tip Cost, from the original PILCO paper It is a cost and depends on the distance from tip to tip of the pendulum when balanced. These cost functions are conceptually different - with Tip Cost, the pendulum can swing up from either direction, and with Angle Cost, there is only one correct direction. The standard observation noise level is σ _s = 0.01 m, σ _β = 1 deg, σ _ds/dt = 0.1 m/s, σ _dβ/dt = 10 deg/s, and these are σ ² = kσ _base ² . It is modified by the multiplier k∈{10 ⁻² , 1} as follows.

（カート－ポールのスイングアップおよびバランス）
このタスクでは、振り子は最初下方向にぶら下がっており、そしてスイングしてバランスを取らなければならない。本発明の以前の研究から、一部の結果を得た：１）ＰＩＬＣＯ、２）再パラメータ化法勾配（ＲＰ）、３）ガウス再サンプリング（ＧＲ）、４）バッチ重点加重基準値を伴うバッチ重点加重ＬＲ（ＬＲ）、５）ＬＲとＲＰを結合する総和伝播（ＴＰ）。新たな方法と比較した：６）ＬＲ成分だけを使用するガウス成形勾配（ＧＬＲ）、７）総和伝播を使用してＬＲとＲＰ変量の両方を結合するガウス成形勾配（ＧＴＰ）。総和伝播アルゴリズムの説明については、計算のグラフに対する複数の勾配推定量を効果的に結合する方法である本発明の以前の研究を参照されたい。さらには、モデルノイズ分散に２５を乗じた場合のＧＴＰを試験した（ＧＴＰ＋σｎ）。 (Cart-pole swing-up and balance)
In this task, the pendulum initially hangs downward and must then swing to balance. We obtained some results from our previous work: 1) PILCO, 2) Reparameterization gradient (RP), 3) Gaussian resampling (GR), 4) Batch with batch weighted reference values. weighted LR (LR); 5) summation propagation (TP) that combines LR and RP; We compared new methods: 6) Gaussian Shaped Gradient (GLR), which uses only the LR component, and 7) Gaussian Shaped Gradient (GTP), which uses summation propagation to combine both LR and RP variables. For a description of the summation propagation algorithm, see our previous work on how to effectively combine multiple gradient estimators for graphs of computations. Furthermore, GTP was tested when the model noise variance was multiplied by 25 (GTP+σn).

（ＤＥＬ推定量でのカート－ポールのバランス）
このタスクはずっと単純である－ポールは最初直立しており、そしてバランスを取らなければならない。実験は、ＤＥＬが実現可能であり、さらに開発されれば有用な場合があることを示すために工夫された。ＡｎｇｌｅＣｏｓｔおよび基準ノイズレベルが使用された。 (Kart-Paul balance with DEL estimator)
This task is much simpler - the pole is initially upright and must be balanced. Experiments were devised to demonstrate that DEL is feasible and may be useful if further developed. Angle Cost and reference noise level were used.

（結果）
図１４および図１５は実験結果を図示している。本発明の以前の研究と同様、ノイズが低い場合、ＬＲ成分を含む方法はうまくいかない。しかしながら、ＧＴＰ＋σｎの実験はノイズをモデル予測に投入することが問題を解決できることを示している。主な重要な結果は、ＴｉｐＣｏｓｔシナリオではＧＴＰがＰＩＬＣＯと一致することである。本発明の以前の研究では、懸念の１つは、このシナリオではＴＰがＰＩＬＣＯと一致しないことであった。図１５Ｂおよび図１５Ｃのコストを見ることだけでは、適切に差異が示されない。対照的に、成功率はＴＰもうまくいかなかったことを示している。成功率は、本発明の以前の研究で校正された閾値（１５を下回る最終損失）ならびに全ての実験実行を視覚的に分類することの両方によって測定された。両方の方法が一致した。最終エピソードにおけるピークパフォーマの損失はＴＰであった：１１．１４±１．７３、ＧＴＰ：９．７８±０．４０、ＰＩＬＣＯ：９．１０±０．２２、これはやはりＴＰが著しく悪かったことを示している。ピークパフォーマがなお改善している間、残存実験は収束した。ＰＩＬＣＯはなお、わずかによりデータ効率的に見えるが、必要とされるデータ量が少ないため、差異に実用的な有意性はほとんどない。図１５ＢではＴＰの分散はより小さいことにも留意されたい。ＧＴＰおよびＰＩＬＣＯの大きな分散は、大きな損失を伴う外れ値により生ずる。これらの外れ値は、局所的最小値に収束し、これは状態分布のガウス近似のテールを利用している－これは、ＰＩＬＣＯがガウス近似のテールを使用して探索を行う以前の示唆とは対照的である。 (result)
Figures 14 and 15 illustrate experimental results. Similar to our previous work, when the noise is low, methods involving LR components fail. However, the GTP+σn experiment shows that injecting noise into the model predictions can solve the problem. The main important result is that GTP matches PILCO in the Tip Cost scenario. In our previous work, one of the concerns was that TP would not match PILCO in this scenario. Just looking at the costs in FIGS. 15B and 15C does not adequately indicate the difference. In contrast, the success rate indicates that TP was also unsuccessful. Success rate was measured both by a threshold (final loss below 15) calibrated in our previous work as well as by visually classifying all experimental runs. Both methods were in agreement. The loss of peak performer in the final episode was TP: 11.14 ± 1.73, GTP: 9.78 ± 0.40, PILCO: 9.10 ± 0.22, which again indicates that TP was significantly worse. It shows. The remaining experiments converged while the peak performer still improved. PILCO still appears to be slightly more data efficient, but the difference has little practical significance due to the small amount of data required. Note also that the variance of TP is smaller in FIG. 15B. The large variance in GTP and PILCO is caused by outliers with large losses. These outliers converge to a local minimum, which takes advantage of the Gaussian tail of the state distribution - this is contrary to previous suggestions that PILCO uses the Gaussian tail to search. It's a contrast.

（実施形態３）
総和伝播アルゴリズムは、逆伝播と同様に、計算グラフに対する汎用的な勾配推定アルゴリズムであるが、勾配が爆発する問題を克服するものである。アルゴリズムにおける重要な考え方は、勾配計算の後方パスの間に勾配推定の複数の方法を組み合わせることである。重要なことに、複数の勾配推定値は勾配推定量のより小さな集合にアグリゲートされ（たとえば全ての勾配推定量は単一の最良の勾配の推定に結合される）、また勾配推定量の全てが別個にではなく、この勾配推定量の小さな集合が後方に渡される。そのような方法により、後方に渡される勾配推定量の増殖を招くことなく、計算のグラフにおける勾配推定の精度を高めるために多数の勾配推定技術を結合することができ、それにより良好な計算効率を実現する。 (Embodiment 3)
The summation propagation algorithm, like backpropagation, is a general-purpose gradient estimation algorithm for computational graphs, but it overcomes the problem of exploding gradients. The key idea in the algorithm is to combine multiple methods of gradient estimation during the backward pass of gradient computation. Importantly, multiple gradient estimates are aggregated into a smaller set of gradient estimators (e.g., all gradient estimators are combined into a single best gradient estimate), and all of the gradient estimators are This small set of gradient estimators is passed backwards, rather than separately. Such a method allows multiple gradient estimation techniques to be combined to increase the accuracy of gradient estimation in the graph of computation without incurring a proliferation of gradient estimators passed backwards, thereby achieving good computational efficiency. Realize.

（フレームワークとアルゴリズムの説明）
計算グラフはノード／頂点Ｖと有向エッジＥの集合であり、頂点にある変数同士の計算上の関係を定義している。各ノードｉはその親ノードＰａ_ｉからの変数を入力として受け取り、出力ｘ_ｉ＝ｆ（Ｐａ_ｉ）を計算し、ここで関数ｆは確率的であることもできる。Ｐａ_ｉおよびｘ_ｉは１つまたは複数の変数の集合を表現しているため、ベクトル値化またはテンソル値化されている場合がある。変数ｘ_ｉはノードｉの子ノードに渡され、Ｃｈ_ｉと表記される。図１６はアルゴリズムの一般形態を図示している。アルゴリズムの一般形態は、アルゴリズム４に提示されており、ここで重要な新規性は、ステップ５および６を含む組み合わせである。総和伝播は逆伝播アルゴリズムに類似しており、連鎖法則を適用することにより計算した勾配をグラフの後方に送ることで、グラフ全体で勾配を計算する。標準的な逆伝播を図１７に図示する。総和伝播は、いくつかのノードにおいて複数の勾配推定を実行すること、勾配推定量を結合すること、および結合した推定量を後方に送ること図１８によりこの手順を修正する。 (Framework and algorithm description)
A computational graph is a set of nodes/vertices V and directed edges E, and defines computational relationships between variables at the vertices. Each node i receives variables from its parent node Pa _i as input and computes an output x _i =f(Pa _i ), where the function f can also be stochastic. Since Pa _i and x _i represent a set of one or more variables, they may be converted into vector values or tensor values. Variable x _i is passed to the child node of node i and is denoted Ch _i . FIG. 16 illustrates the general form of the algorithm. The general form of the algorithm is presented in Algorithm 4, where the key novelty is the combination comprising steps 5 and 6. Summation propagation is similar to the backpropagation algorithm, which computes the gradient across the graph by applying the chain rule and sending the computed gradient backwards through the graph. Standard backpropagation is illustrated in FIG. Sum propagation modifies this procedure by performing multiple gradient estimates at several nodes, combining the gradient estimators, and sending the combined estimators backwards.

図１７は、機械学習における全てのニューラルネットワークアプリケーションの他、その他多くのアプリケーションにおいて使用される逆伝播アルゴリズムを図示している。総和伝播アルゴリズムは、異なる勾配推定技術を使用してｄＬ／ｄｚ_２の複数の推定値を求めること（たとえば、再パラメータ化法および尤度比法）、これらの推定値をより小さな勾配推定量の集合に結合すること、およびこれらを計算グラフの後方に渡すことにより、この手順を修正する。 Figure 17 illustrates the backpropagation algorithm used in all neural network applications in machine learning, as well as many other applications. The summation propagation algorithm uses different gradient estimation techniques to obtain multiple estimates of dL/dz ₂ (e.g., reparameterization and likelihood ratio methods) and then subdivides these estimates into smaller gradient estimators. We modify this procedure by joining into sets and passing them backwards in the computational graph.

図１８は、単一の勾配推定量となるように尤度比および再パラメータ化勾配推定量を結合することにより勾配推定が実行される場合の総和伝播アルゴリズムを図示している。これは、３つ以上の勾配推定量を勾配推定量の総和数よりも少ない数に結合すること、および結合した勾配推定量を後方に送ることを、簡単に一般化する。 FIG. 18 illustrates a summation propagation algorithm where gradient estimation is performed by combining likelihood ratios and reparameterized gradient estimators into a single gradient estimator. This easily generalizes combining three or more gradient estimators into less than the total number of gradient estimators and sending the combined gradient estimators backwards.

Claims

A computer-based gradient estimation method comprising a calculation graph and estimating the slope of one variable with respect to another variable in the calculation graph, the method comprising:
performing two or more different estimates of the same gradient using different gradient estimators at some nodes in the computational graph;
Combine the different estimates so that the number is less than the initial estimate,
Pass the combined estimates to different nodes in the calculation graph
executing the process by the computer;
Gradient estimation method, where the gradient estimate is used for further calculations.

The different estimates of the slope are combined based on a weighted average, the weight of the weighted average being the variance of the slope estimates of some variables in the computational graph versus some other variables in the computational graph. The gradient estimation method according to claim 1, wherein the gradient estimation method is calculated based on an explicit or implicit estimate of .

3. The gradient estimation method according to claim 2, wherein the weight is set in proportion to the magnitude of the reciprocal of the variance.

4. A gradient estimation method according to any one of claims 1 to 3, wherein the gradient estimator is a likelihood ratio and a reparameterized gradient estimator.

The gradient estimation method according to any one of claims 1 to 4, wherein the gradient is used for optimizing parameters in the calculation graph.

A gradient estimation method according to any one of claims 1 to 5, wherein the combined estimate is passed to a preceding node in the computational graph.

A computer-based gradient estimation method that includes a calculation graph and estimates the slope of a variable with respect to a variable in the calculation graph, the method comprising:
Estimate the gradient of the objective function for both the likelihood ratio and reparameterization methods at some nodes in the computational graph;
Optimize the parameters in the calculation graph using both estimators
A gradient estimation method , wherein processing is executed by the computer .

8. The gradient estimation method of claim 7 , wherein the likelihood ratio and reparameterization gradient estimators are combined based on a weighted average, the weights being proportional to the inverse of the variance of the respective gradient estimators.

9. The gradient estimation method according to claim 5, wherein the parameter optimization method is a gradient descent or ascent optimization method.

7. A gradient estimation method according to any one of claims 1 to 6 , wherein the further calculation is a further gradient estimation of some variables with respect to some other variables.

A gradient estimation method according to any one of claims 1 to 6 , wherein the combination of gradient estimates is determined based on gradients from previous optimization steps.

A computer-based gradient estimation method comprising a calculation graph and estimating the slope of one variable with respect to another variable in the calculation graph ,
At some nodes in the computational graph, assume a parametric form of the probability density at the nodes;
estimating parameters of the probability density from sampled calculations in the calculation graph;
Estimate the gradient of a node's expected variable depending on the current variable
executing the process by the computer;
Expected values are obtained over the entire estimated distribution,
Further, multiplying the gradient by some statistical value at the node to obtain a scalar variable,
Find the likelihood ratio gradient estimator using the scalar variables
A gradient estimation method , wherein processing is executed by the computer .

13. A gradient estimation method according to claim 12 , wherein the parametric form of a probability distribution is a Gaussian distribution.

The order of multiplying the gradient by the statistic and determining the likelihood ratio gradient estimator is such that likelihood ratio gradient estimation is performed prior to multiplying the estimated parametric probability distribution by the gradient. The gradient estimation method according to claim 12 or claim 13 , wherein the gradient estimation method is replaced.

A gradient estimation method, which is performed by combining the gradient estimation method according to any one of claims 1 to 11 and the gradient estimation method according to claim 12 , 13 , or 14 .

The gradient estimation method according to any one of claims 1 to 15, wherein the calculation graph corresponds to a calculation graph of policy search, reinforcement learning, machine learning, or neural network.

An apparatus for carrying out the gradient estimation method according to any one of claims 1 to 16.

A computer program for executing the gradient estimation method according to any one of claims 1 to 16.

A policy search device in reinforcement learning,
compute states in a discrete-time system,
In a gradient backpropagation step opposite to the direction in which state transitions occur according to the policy and dynamics, estimate the gradient of the average total reward with respect to the policy parameters by combining reparameterization and likelihood ratio methods,
updating the policy parameters according to the evaluation results;
Policy search device.

further, setting weights for the weighted average based on the variance of the desired gradient for the policy parameters;
The policy search device according to claim 19 .

the weights assigned to gradient estimators according to the reparameterization method and the likelihood ratio method are set proportional to the magnitude of the inverse of the variance of the respective gradient estimator;
The policy search device according to claim 20 .

to the computer,
compute states in a discrete-time system,
In a gradient backpropagation step opposite to the direction in which state transitions occur according to the policy and dynamics, estimate the gradient of the average total reward with respect to the policy parameters by combining reparameterization and likelihood ratio methods,
updating the policy parameters according to the evaluation results;
A computer program that causes a computer to perform a process.