JP5150371B2

JP5150371B2 - Controller, control method and control program

Info

Publication number: JP5150371B2
Application number: JP2008143586A
Authority: JP
Inventors: 哲郎森村; 英治内部; 潤一郎吉本; 賢治銅谷
Original assignee: kinawa Institute of Science and Technology Graduate University
Current assignee: kinawa Institute of Science and Technology Graduate University
Priority date: 2008-05-30
Filing date: 2008-05-30
Publication date: 2013-02-20
Anticipated expiration: 2028-05-30
Also published as: JP2009289199A

Description

本発明は、方策勾配法により制御対象を制御する制御器、制御方法および制御プログラムの構成に関する。 The present invention relates to the configuration of a controller, a control method, and a control program for controlling a controlled object by a policy gradient method.

「マルコフ決定過程」として定式化される制御問題は、ロボット、プラント、移動機械（電車、自動車）などの自律的制御問題として、幅広い応用を持つ重要な技術である。 The control problem formulated as a “Markov decision process” is an important technology that has a wide range of applications as an autonomous control problem for robots, plants, mobile machines (trains, cars, etc.).

マルコフ決定過程に対する最適制御に関する従来技術として、いわゆる「強化学習」がある。 There is so-called “reinforcement learning” as a conventional technique related to optimal control for a Markov decision process.

「強化学習」とは、エージェントが環境と相互作用を通じて試行錯誤し、得られる累積報酬量を最大化するような「方策」と呼ばれる行動則、すなわち、制御問題に用いる場合には、「制御規則」を学習する理論的な枠組みである。この学習法は、環境やエージェント自身に関する先験的な知識をほとんど必要としない点で様々な分野から注目を集めている。 “Reinforcement learning” is a behavioral rule called “policy” that maximizes the cumulative amount of reward that is obtained through trial and error through interaction with the environment. It is a theoretical framework for learning. This learning method attracts attention from various fields in that it requires little a priori knowledge about the environment and the agent itself.

強化学習は大まかに２つに分類できる。価値関数を用いて間接的に方策を表現し、価値関数を更新することで方策も更新される「価値関数更新法」と、方策を明示的にもち目的関数の勾配に従って方策を更新する「直接方策更新法（方策勾配法）」である。 Reinforcement learning can be roughly classified into two. The value function is used to express the policy indirectly, and the value function is updated to update the policy, and the value function is updated, and the policy is updated according to the objective function gradient. “Policy update method (policy gradient method)”.

方策勾配法は、行動のランダム性を制御するパラメータも方策パラメータに含めることで確率的方策の獲得が可能であり、また連続系への適用性も高いため、特に注目を集めている。しかし一般に実タスクへ適用すると、適切な行動則を獲得するまでの時間が非現実となることがある。そこで、複数学習器の同時利用、モデルの利用、教示信号の利用等の補助機構を入れて学習時間を短縮させる研究が活発に行われ、成果も著しい。 The policy gradient method is particularly attracting attention because it can acquire a stochastic policy by including a parameter for controlling the randomness of the action in the policy parameter, and has high applicability to a continuous system. However, in general, when applied to a real task, the time to acquire an appropriate behavioral rule may be unrealistic. Therefore, active research has been conducted to shorten the learning time by using auxiliary mechanisms such as simultaneous use of multiple learners, use of models, use of teaching signals, etc., and results have been remarkable.

ここで、方策勾配強化学習法（ＰＧＲＬ）は、方策パラメータについての平均報酬の偏微分を用いることにより、方策パラメータを改善して平均報酬を最大化するための強化学習（ＲＬ：Reinforcement Learning）の一般的なアルゴリズムである。ここで、平均報酬の偏微分は、方策勾配（ＰＧ：Policy Gradient）と呼ばれる。 Here, policy gradient reinforcement learning method (PGRL), by using the partial differential of the average compensation for measures parameters, reinforcement learning to maximize the average reward to improve measures parameters (RL: Reinforcem e nt Learning ) Is a general algorithm. Here, the partial differential of the average reward is called a policy gradient (PG).

すなわち、方策勾配強化学習法は、エージェントが環境と相互作用する際に得られる報酬の時間平均値を目的関数とし、この目的関数を局所最大化する方策（行動則）の獲得を目指した方策探索法で、方策パラメータを目的関数の勾配により逐次更新することで実現される。 In other words, the policy gradient reinforcement learning method uses a time average value of reward obtained when an agent interacts with the environment as an objective function, and searches for a policy aiming at obtaining a policy (action law) that maximizes this objective function locally. This is achieved by sequentially updating the policy parameters with the gradient of the objective function.

方策さえ適切にパラメータ化すれば、エージェントや環境に関する知識を必要とせずマルコフ決定過程(MDP： Markov Decision Process）に実装可能である。また行動のランダム性を制御するパラメータも方策パラメータに含めることで確率的な方策の獲得も可能なため、履歴を要しない方策族の中で最適なものが確率的になる場合がある部分観測マルコフ決定過程（POMDP： Partially Observable MDP）に対しても、パラメータ化された方策をその表現可能な範囲で最適化することができる[非特許文献3]〜[非特許文献7]。 Even if the measures are appropriately parameterized, they can be implemented in the Markov Decision Process (MDP) without requiring knowledge about the agent or the environment. In addition, it is possible to acquire a probabilistic policy by including a parameter that controls the randomness of the action in the policy parameter. Even for the decision process (POMDP: Partially Observable MDP), the parameterized policy can be optimized within the range that can be expressed [Non-Patent Document 3] to [Non-Patent Document 7].

一方で、課題によっては良い方策を獲得するまでの時間が膨大になる問題がある。そのため副報酬の導入[非特許文献8]、モデルの利用[非特許文献9] や教示信号の利用[非特許文献10] 等の補助機構による学習時間の短縮を目指した研究は行われているが、一般に特定の課題を想定しており、課題についての事前知識を必要とするため汎用性に乏しい。よって標準的な強化学習の枠組みに手を加えない、つまり課題に依存しないような方策勾配アルゴリズムの改良が望まれる。 On the other hand, depending on the problem, there is a problem that the time until obtaining a good policy becomes enormous. Therefore, research aimed at shortening the learning time by using auxiliary mechanisms such as the introduction of supplementary rewards [Non-patent Document 8], the use of models [Non-Patent Document 9] and the use of teaching signals [Non-Patent Document 10] has been conducted However, in general, a specific problem is assumed, and since prior knowledge about the problem is required, the versatility is poor. Therefore, it is desirable to improve the policy gradient algorithm that does not change the standard reinforcement learning framework, that is, does not depend on the task.

なお、以下、本文中で引用することとなる方策勾配学習法に関連した先行技術文献を以下に挙げる。
S. Amari: “Natural gradient works efficiently in learning”, Neural Computation, 10, 2, pp. 251-276 (1998). S. Kakade: “A natural policy gradient”, Advances in NeuralInformation Processing Systems, Vol. 14, MIT Press (2002). R. J. Williams: “Simple statistical gradient-following algorithms for connectionist reinforcement learning”, Machine Learning, 8, pp. 229-256 (1992). D. P. Bertsekas and J. N. Tsitsiklis: “Neuro-Dynamic Programming”, Athena Scientific (1996). H. Kimura and S. Kobayashi: “An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function”,International Conference on Machine Learning, pp.278-286 (1998). J. Baxter and P. Bartlett: “Infinite-horizon policygradientestimation”, Journal of Artificial Intelligence Research, 15, pp. 319-350 (2001). J. Baxter, P. Bartlett and L. Weaver: “Experiments with infinite-horizon policy-gradient estimation”, Journal of Artificial Intelligence Research, 15, pp. 351-381 (2001). A. Y. Ng, D. Harada and S. Russell: “Policy invariance under reward transformations: theory and application to reward shaping”, International Conference on Machine Learning, pp. 278-287 (1999). D. Bagnell, S. Kakade, A. Ng and J. Schneider: “Policy search by dynamic programming”, Advances of Neural Information Processing Systems (2004). M. T. Ronsenstein and A. G. Barto: “Supervised actor-critic reinforcement learning”, Learning and Approximate Dynamic Programming: Scaling Up to the Real World, John Wiley & Sons, Inc., pp. 359-380 (2004). K. F. S. Amari: “Local minima and plateaus in hierarchical structures of multilayer perceptrons”, Neural Networks, 13, 3, pp. 317-327 (2000). S. Amari, A. Cichocki and H. H. Yang: “A new learning algorithm for blind signal separation”, Advances in Neural Information Processing Systems, Vol. 8, MIT Press (1996). D. MacKay: “Maximum likelihood and covariant algorithms for independent component analysis”, Technical report, University of Cambridge (1999). S. Amari, H. Park and K. Fukumizu: “Adaptive method of realizing natural gradient learning for multilayer perceptrons”, Neural Computation, 12, 6, pp. 1399-1409 (2000). J. Peters, S. Vijayakumar and S. Schaal: “Reinforcement learning for humanoid robotics”, IEEE-RAS International Conference on Humanoid Robots (2003). D. Bagnell and J. Schneider: “Covariant policy search”, Proceedings of the International Joint Conference on Artificial Intelligence (2003). J. Peters, S. Vijayakumar and S. Schaal: “Natural actor-critic”, European Conference on Machine Learning (2005). Y. Nakamura, T. Mori and S. Ishii: “Natural policy gradient reinforcement learning for a CPG control of a biped robot”, International conference on parallel problem solving from nature, pp. 972-981 (2004). T. Morimura, E. Uchibe and K. Doya: “Utilizing natural gradient in temporal difference reinforcement learning with eligibility traces”, International Symposium on Information Geometry and its Applications (2005). S. Richter, D. Aberdeen and J. Yu: “Natural actorcritic for road traffic optimisation”, Advances in Neural Information Processing Systems, MIT Press (2007). D. P. Bertsekas: “Dynamic Programming and Optimal Control, Volumes 1 and 2”, Athena Scientific (1995). R. S. Sutton and A. G. Barto: “Reinforcement Learning”, MIT Press (1998). R. S. Sutton, D. McAllester, S. Singh and Y. Mansour:“Policy gradient methods for reinforcement learning with function approximation”, Advances in Neural Information Processing Systems, Vol. 12, MIT Press (2000). S. Amari and H. Nagaoka: “Method of Information Geometry”, Oxford University Press (2000). R. Fletcher: “Practical Methods of Optimization”, Wiley (1987). T. Morimura, E. Uchibe, J. Yoshimoto and K. Doya:“Reinforcement learning with log stationary distribution gradient”, Technical report, Nara Institute of Science and Technology (2007). C. M. Bishop: “Neural Networks for Pattern Recognition”, Oxford University Press (1995). 福水, 栗木, 竹内, 赤平：“特異モデルの統計学”, 岩波書店 M. Rattray and D. Saad: “Analysis of natural gradient descent for multilayer neural networks”, Physical Review E, 59, 4, pp. 4523-4532 (1999). S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33-57, 1996. J. A. Boyan. Least-squares temporal difference learning. Machine Learning, 49(2-3):233-246, 2002. Hereinafter, prior art documents related to the policy gradient learning method to be cited in the text are listed below.
S. Amari: “Natural gradient works efficiently in learning”, Neural Computation, 10, 2, pp. 251-276 (1998). S. Kakade: “A natural policy gradient”, Advances in Neural Information Processing Systems, Vol. 14, MIT Press (2002). RJ Williams: “Simple statistical gradient-following algorithms for connectionist reinforcement learning”, Machine Learning, 8, pp. 229-256 (1992). DP Bertsekas and JN Tsitsiklis: “Neuro-Dynamic Programming”, Athena Scientific (1996). H. Kimura and S. Kobayashi: “An analysis of actor / critic algorithms using eligibility traces: Reinforcement learning with imperfect value function”, International Conference on Machine Learning, pp.278-286 (1998). J. Baxter and P. Bartlett: “Infinite-horizon policygradientestimation”, Journal of Artificial Intelligence Research, 15, pp. 319-350 (2001). J. Baxter, P. Bartlett and L. Weaver: “Experiments with infinite-horizon policy-gradient estimation”, Journal of Artificial Intelligence Research, 15, pp. 351-381 (2001). AY Ng, D. Harada and S. Russell: “Policy invariance under reward transformations: theory and application to reward shaping”, International Conference on Machine Learning, pp. 278-287 (1999). D. Bagnell, S. Kakade, A. Ng and J. Schneider: “Policy search by dynamic programming”, Advances of Neural Information Processing Systems (2004). MT Ronsenstein and AG Barto: “Supervised actor-critic reinforcement learning”, Learning and Approximate Dynamic Programming: Scaling Up to the Real World, John Wiley & Sons, Inc., pp. 359-380 (2004). KFS Amari: “Local minima and plateaus in hierarchical structures of multilayer perceptrons”, Neural Networks, 13, 3, pp. 317-327 (2000). S. Amari, A. Cichocki and HH Yang: “A new learning algorithm for blind signal separation”, Advances in Neural Information Processing Systems, Vol. 8, MIT Press (1996). D. MacKay: “Maximum likelihood and covariant algorithms for independent component analysis”, Technical report, University of Cambridge (1999). S. Amari, H. Park and K. Fukumizu: “Adaptive method of realizing natural gradient learning for multilayer perceptrons”, Neural Computation, 12, 6, pp. 1399-1409 (2000). J. Peters, S. Vijayakumar and S. Schaal: “Reinforcement learning for humanoid robotics”, IEEE-RAS International Conference on Humanoid Robots (2003). D. Bagnell and J. Schneider: “Covariant policy search”, Proceedings of the International Joint Conference on Artificial Intelligence (2003). J. Peters, S. Vijayakumar and S. Schaal: “Natural actor-critic”, European Conference on Machine Learning (2005). Y. Nakamura, T. Mori and S. Ishii: “Natural policy gradient reinforcement learning for a CPG control of a biped robot”, International conference on parallel problem solving from nature, pp. 972-981 (2004). T. Morimura, E. Uchibe and K. Doya: “Utilizing natural gradient in temporal difference reinforcement learning with eligibility traces”, International Symposium on Information Geometry and its Applications (2005). S. Richter, D. Aberdeen and J. Yu: “Natural actorcritic for road traffic optimisation”, Advances in Neural Information Processing Systems, MIT Press (2007). DP Bertsekas: “Dynamic Programming and Optimal Control, Volumes 1 and 2”, Athena Scientific (1995). RS Sutton and AG Barto: “Reinforcement Learning”, MIT Press (1998). RS Sutton, D. McAllester, S. Singh and Y. Mansour: “Policy gradient methods for reinforcement learning with function approximation”, Advances in Neural Information Processing Systems, Vol. 12, MIT Press (2000). S. Amari and H. Nagaoka: “Method of Information Geometry”, Oxford University Press (2000). R. Fletcher: “Practical Methods of Optimization”, Wiley (1987). T. Morimura, E. Uchibe, J. Yoshimoto and K. Doya: “Reinforcement learning with log stationary distribution gradient”, Technical report, Nara Institute of Science and Technology (2007). CM Bishop: “Neural Networks for Pattern Recognition”, Oxford University Press (1995). Fukumizu, Kuriki, Takeuchi, Akahira: “Statistics of Singular Models”, Iwanami Shoten M. Rattray and D. Saad: “Analysis of natural gradient descent for multilayer neural networks”, Physical Review E, 59, 4, pp. 4523-4532 (1999). SJ Bradtke and AG Barto.Linear least-squares algorithms for temporal difference learning.Machine Learning, 22 (1-3): 33-57, 1996. JA Boyan. Least-squares temporal difference learning.Machine Learning, 49 (2-3): 233-246, 2002.

しかしながら、最適方策への収束を遅くしている理由を学習すべきパラメータ空間の構造の性質から考察した場合に、これをいかにして改善するべきかについては、従来、明らかではなかった。 However, when considering the reason for slowing down the convergence to the optimal policy from the nature of the structure of the parameter space to be learned, it has not been clear how to improve this.

では、何が学習時間を膨大にしているのだろうか。もちろん課題が複雑になれば一般に方策パラメータの探索空間は膨大になり学習は困難になるが、それ以外にMDP の確率分布に対して各方策パラメータの敏感さの相違やその相関を無視している問題がある。つまりパラメータを座標系とするMDP で表現可能な確率分布の集合はユークリッド空間ではなくリーマン多様体となるが、従来の勾配法によるパラメータ更新方向（偏微分）はリーマン空間における最急勾配方向とは異なり学習が停滞してしまう問題である。実際に２状態MDPのような極めて単純なモデルであっても深刻なプラトー（学習の停滞期間）に陥ることが報告されている[非特許文献2]。これはパラメータを座標系とする目的関数の幾何構造が局所的に平坦になっている部分があり、そこでは勾配が極端に小さくなるからである[非特許文献11]。 So what makes learning time enormous? Of course, if the problem becomes complicated, the policy parameter search space will generally become enormous and learning will be difficult, but other than this, the differences in the sensitivity of each policy parameter to the MDP probability distribution and its correlation are ignored. There's a problem. In other words, the set of probability distributions that can be expressed in MDP with parameters as the coordinate system is not a Euclidean space, but a Riemannian manifold. It is a problem that learning is stagnant. It has been reported that even a very simple model such as two-state MDP actually falls into a serious plateau (learning stagnation period) [Non-Patent Document 2]. This is because there is a portion where the geometric structure of the objective function whose parameter is the coordinate system is locally flat, and the gradient is extremely small there [Non-Patent Document 11].

この解決法としてリーマン空間上の最急勾配法である“自然勾配法” が知られている[非特許文献1]。これは他の機械学習分野ではよく研究されているが[非特許文献1]、 [非特許文献12]〜[非特許文献14]、強化学習についての理論的な研究はあまり進んでいない[非特許文献2]、 [非特許文献15]〜[非特許文献17]。 As a solution to this problem, the “natural gradient method”, which is the steepest gradient method in Riemann space, is known [Non-Patent Document 1]. This is well studied in other machine learning fields [Non-patent document 1], [Non-patent document 12] to [Non-patent document 14], but theoretical research on reinforcement learning has not progressed much [ Patent Document 2], [Non-Patent Document 15] to [Non-Patent Document 17].

特に自然勾配法ではどのリーマン計量行列を用いるかで勾配方向が異なるため、リーマン計量の設計は大切な問題である。 In particular, since the gradient direction differs depending on which Riemann metric matrix is used in the natural gradient method, the design of the Riemann metric is an important issue.

しかしながら自然勾配の強化学習への適用[非特許文献15]〜[非特許文献20] は全てKakade [非特許文献2] の提案した行動の確率分布に関するリーマン計量が用いられており、他のリーマン計量ついては議論も提案もされていなかった。 However, the application of natural gradient to reinforcement learning [Non-Patent Document 15] to [Non-Patent Document 20] all use the Riemann metric on the probability distribution of behavior proposed by Kakade [Non-Patent Document 2]. There was no discussion or suggestion about weighing.

そこで本発明では、方策パラメータに関する状態と行動の同時分布に関する新しいリーマン計量を提案し、そのリーマン計量における平均報酬の自然勾配による新しい自然方策勾配法である自然定常方策勾配法（NSG:National Stationarypolicy Gradient）を提案する。 Therefore, in the present invention, a new Riemann metric is proposed for the simultaneous distribution of states and actions related to policy parameters, and the natural stationary policy gradient method (NSG: National Stationary policy Gradient) is a new natural policy gradient method based on the natural gradient of the average reward in the Riemann metric. ) Is proposed.

したがって、本発明は、上記のような問題点を解決するためになされたものであって、その目的は、学習のプラトーの発生を回避して試行回数（学習時間）を減少させることが可能な自然勾配を用いた強化学習に基づく、制御器、制御方法、制御プログラムを提供することである。 Therefore, the present invention has been made to solve the above-described problems, and its object is to avoid the occurrence of a learning plateau and reduce the number of trials (learning time). It is to provide a controller, a control method, and a control program based on reinforcement learning using a natural gradient.

以下の説明で明らかとなるように、我々が提案するリーマン計量は、Kakade のリーマン計量や平均報酬のヘッセ行列と比較して妥当性なものである。さらに、我々が提案する自然勾配は方策パラメータにより規定される基底関数の線形結合によって即時報酬を最小二乗近似した際の線形結合係数に一致する。 As will be apparent from the following explanation, the Riemann metric we propose is reasonable compared to Kakade's Riemann metric and the Hessian matrix of average rewards. Furthermore, the natural gradient we propose agrees with the linear combination coefficient when the immediate reward is approximated by least squares by the linear combination of basis functions defined by the policy parameters.

このような目的を達成するために、本発明の制御器は、対象とするシステムの時間発展がマルコフ過程として記述される際に、システムの状態に対する制御則である方策をシステムの状態量の観測により学習する制御器であって、方策に基づいて、システムを制御するための制御信号を生成する制御信号生成手段と、システムの状態量を観測する状態量検知手段と、状態量により特定される状態と制御信号値の同時分布のフィッシャー情報行列をリーマン計量行列とする、平均報酬の自然勾配である自然定常方策勾配を推定する自然定常方策勾配推定手段と、自然定常方策勾配推定手段による推定結果とに基づいて、方策を規定する方策パラメータを更新することで、方策を更新する方策更新手段とを備える。 In order to achieve such an object, the controller of the present invention observes a measure which is a control law for the system state when the time evolution of the target system is described as a Markov process. Is a controller that learns based on a policy, and is specified by a control signal generating means for generating a control signal for controlling the system, a state quantity detecting means for observing the state quantity of the system, and the state quantity Estimated results by natural steady policy gradient estimation means that estimates the natural steady policy gradient, which is the natural gradient of the average reward, using the Fisher information matrix of the simultaneous distribution of state and control signal values, and the natural steady policy gradient estimation means And a policy update means for updating the policy by updating the policy parameter that defines the policy based on the above.

好ましくは、自然定常方策勾配推定手段は、状態と制御信号とに予め定められた関係で依存する報酬値を獲得する報酬値獲得手段と、各タイムステップにおける状態量と制御信号とに基づいて、定常分布の対数の偏微分を推定し、推定された偏微分により特定される状態と制御信号との基底関数を規定とする線形関数近似器により報酬値を回帰することで、自然定常方策勾配を推定する推定手段とを含む。 Preferably, the natural steady policy gradient estimation unit is based on a reward value acquisition unit that acquires a reward value that depends on a predetermined relationship between the state and the control signal, and the state quantity and the control signal at each time step. Estimating the partial derivative of the logarithm of the steady distribution and regressing the reward value with a linear function approximator that defines the basis function between the state specified by the estimated partial derivative and the control signal, the natural steady policy gradient is Estimating means for estimating.

この発明の他の局面に従うと、対象とするシステムの時間発展がマルコフ過程として記述される際に、システムの状態に対する制御則である方策をシステムの状態量の観測により学習する制御方法であって、方策に基づいて、システムを制御するための制御信号を生成する制御信号生成ステップと、システムの状態量を観測する状態量検知ステップと、状態量により特定される状態と制御信号値の同時分布のフィッシャー情報行列をリーマン計量行列とする、平均報酬の自然勾配である自然定常方策勾配を推定する自然定常方策勾配推定ステップと、自然定常方策勾配推定ステップによる推定結果とに基づいて、方策を規定する方策パラメータを更新することで、方策を更新する方策更新ステップとを備える。 According to another aspect of the present invention, there is provided a control method for learning a policy, which is a control law for a system state, by observing a state quantity of the system when the time evolution of the target system is described as a Markov process. A control signal generation step for generating a control signal for controlling the system based on the strategy, a state quantity detection step for observing the state quantity of the system, and a simultaneous distribution of the state specified by the state quantity and the control signal value The policy is defined based on the natural steady policy gradient estimation step that estimates the natural steady policy gradient, which is the natural gradient of the average reward, and the estimation result from the natural steady policy gradient estimation step, using the Riemann metric matrix as the Fisher information matrix A policy update step of updating the policy by updating the policy parameter to be updated.

この発明のさらに他の局面に従うと、対象とするシステムの時間発展がマルコフ過程として記述される際に、システムの状態に対する制御則である方策をシステムの状態量の観測により学習する制御方法をコンピュータに実行させるためのプログラムであって、方策に基づいて、システムを制御するための制御信号を生成する制御信号生成ステップと、システムの状態量を観測する状態量検知ステップと、状態量により特定される状態と制御信号値の同時分布のフィッシャー情報行列をリーマン計量行列とする、平均報酬の自然勾配である自然定常方策勾配を推定する自然定常方策勾配推定ステップと、自然定常方策勾配推定ステップによる推定結果とに基づいて、方策を規定する方策パラメータを更新することで、方策を更新する方策更新ステップとを含む、制御処理をコンピュータに実行させる。 According to still another aspect of the present invention, when a time evolution of a target system is described as a Markov process, a control method for learning a policy that is a control law for the state of the system by observing a state quantity of the system A control signal generation step for generating a control signal for controlling the system, a state amount detection step for observing the state amount of the system, and a state amount. Estimating the natural steady policy gradient estimation step to estimate the natural steady policy gradient, which is the natural gradient of the average reward, and the natural steady policy gradient estimation step, using the Fisher information matrix of simultaneous distribution of control state and control signal value as the Riemann metric matrix Based on the results, update the policy by updating the policy parameters that define the policy. And a step to execute the control process computer.

（本発明の内容のあらまし）
一般に統計モデルや機械学習のパラメータの空間は、その出力の変化に関してユークリッドではなくリーマン空間としての性質を持ち、その最急勾配方向は従来の勾配である出力の偏微分と必ずしも一致しない。 (Outline of the contents of the present invention)
In general, a parameter space of a statistical model or machine learning has a property as a Riemann space instead of Euclidean with respect to changes in its output, and its steepest gradient direction does not always coincide with the partial differential of the output which is a conventional gradient.

この問題に対してAmari [非特許文献1] は自然勾配法を提案し、Kakade [非特許文献2] がマルコフ決定過程の最適化手法の一つである方策勾配強化学習法に自然勾配を適用した。 Amari [Non-Patent Document 1] proposes a natural gradient method for this problem, and Kakade [Non-Patent Document 2] applies a natural gradient to the policy gradient reinforcement learning method, which is one of the optimization methods of the Markov decision process. did.

自然勾配方向はリーマン構造を規定するリーマン計量のもと定まるので、その選択は重要な問題となる。しかしながら、Kakade の用いたリーマン計量行列は方策のパラメータ摂動による行動の確率分布変化だけを考慮した計量行列であり、同様に方策の影響を受けるはずの状態の確率分布変化に関しては無視していた。そこで本発明では、状態の確率分布も考慮した新しいリーマン計量を提案し、その計量に基づく新しい自然方策勾配、自然定常方策勾配を導出する。さらに、この自然定常方策勾配は方策パラメータにより規定される基底関数をもつ線形関数近似器で即時報酬を近似した際に学習されるパラメータに一致することを証明する。また様々な状態数のマルコフ決定問題に適用した数値実験より、特に状態数が多い場合に従来法に比べ提案法は有効に働くことを示す。 Since the natural gradient direction is determined based on the Riemann metric that defines the Riemann structure, its selection is an important issue. However, the Riemann metric matrix used by Kakade is a metric matrix that takes into account only the change in the probability distribution of the action due to the parameter perturbation of the policy, and similarly ignores the change in the probability distribution of the state that should be affected by the policy. Therefore, the present invention proposes a new Riemannian metric that also considers the probability distribution of states, and derives a new natural policy gradient and a natural steady policy gradient based on the metric. Furthermore, it is proved that this natural steady policy gradient matches the parameter learned when the immediate reward is approximated by a linear function approximator having a basis function defined by the policy parameter. In addition, numerical experiments applied to Markov decision problems with various numbers of states show that the proposed method works more effectively than the conventional method, especially when the number of states is large.

以下の説明の構成の概要を説明すると、（１．本発明の概要）において、本発明の全体的な構成を説明し、2．1 節にて強化学習における勾配法である方策勾配法を解説する。2．2 節ではAmari [非特許文献1] によって提案された自然勾配法を解説し、その強化学習への適用である自然方策勾配法を紹介して、2．3 節にてその疑問点を指摘する。 The outline of the composition of the following explanation is explained. In (1. Outline of the present invention), the overall structure of the present invention is explained, and the policy gradient method which is the gradient method in reinforcement learning is explained in Section 2.1. To do. Section 2.2 explains the natural gradient method proposed by Amari [Non-Patent Document 1], introduces the natural policy gradient method that is applied to reinforcement learning, and discusses the question in Section 2.3. Point out.

（１．本発明の概要）
後に説明するように、本発明では、「自然定常方策勾配」を導出して「方策」を更新し、制御対象の制御を行う。ここで、この自然定常方策勾配の導出にあたっては、後に説明するように、逆方向マルコフ連鎖の方法とＴＤ（temporal difference）学習アルゴリズムにより、定常分布の勾配としてのＬＳＤＧ（定常分布の対数の偏微分−ＬＳＤＧ(LogStationary Distribution Gradients））が導出され、これを用いることができる。 (1. Overview of the present invention)
As will be described later, in the present invention, the “natural steady policy gradient” is derived, the “policy” is updated, and the control target is controlled. Here, in the derivation of the natural steady policy gradient, as will be described later, an LSDG (logarithmic partial derivative of the steady distribution) is used as a gradient of the steady distribution by a backward Markov chain method and a TD (temporal difference) learning algorithm. -LSDG ( LogStationary Distribution Gradients)) is derived and can be used.

以下、図面を参照して本発明の実施の形態について説明する。
以下の説明で明らかとなるとおり、本発明は、ロボット、プラント、移動機械（電車、自動車）などの制御問題として、幅広い応用を持つ。 Embodiments of the present invention will be described below with reference to the drawings.
As will be apparent from the following description, the present invention has a wide range of applications as control problems for robots, plants, mobile machines (trains, automobiles), and the like.

ただし、以下では、本発明の具体的な適用例として、特に簡単なロボットの自動制御問題を対象とするものとして説明を行う。また、数値計算の結果は、さらに簡単なモデルに対する比較を示している。しかしながら、本発明は、このような応用に限定されるものではなく、より一般的に、対象システムの時間発展が複雑な場合の対象システムの制御に適用することができる。そのようなものの例としては、巨大プラント（溶鉱炉、原子力プラント）、マルチリンクロボット（ヒューマノイドロボット）、ノンホロノームシステム（宇宙ステーション）、地下鉄ホームでの人の流れなどがある。これらは、いずれも古典的制御法での制御が困難であり、かつ重要な制御対象である。 However, in the following, as a specific application example of the present invention, a description will be given on the assumption that a particularly simple automatic robot control problem is targeted. The numerical calculation results show a comparison with a simpler model. However, the present invention is not limited to such an application, and more generally can be applied to control of the target system when the time development of the target system is complicated. Examples of such are giant plants (blast furnaces, nuclear power plants), multi-link robots (humanoid robots), non-holonomic systems (space stations), and the flow of people in subway platforms. All of these are difficult to control by the classical control method and are important control objects.

（２．本発明のシステム構成）
図１は、本発明の制御方法および制御プログラムが適用される制御器を用いたシステム１０００の一例を示す概念図である。 ( 2. System configuration of the present invention)
FIG. 1 is a conceptual diagram showing an example of a system 1000 using a controller to which a control method and a control program of the present invention are applied.

図１を参照して、システム１０００は、制御対象となる被制御装置２００と、この被制御装置２００に対して制御信号を与えるためのコンピュータ１００とを備える。 Referring to FIG. 1, system 1000 includes controlled device 200 to be controlled and a computer 100 for giving a control signal to controlled device 200.

図１を参照してこのコンピュータ１００は、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory ）上の情報を読込むためのＣＤ−ＲＯＭドライブ１０８およびフレキシブルディスク（Flexible Disk、以下ＦＤ）１１６に情報を読み書きするためのＦＤドライブ１０６を備えたコンピュータ本体１０２と、コンピュータ本体１０２に接続された表示装置としてのディスプレイ１０４と、同じくコンピュータ本体１０２に接続された入力装置としてのキーボード１１０およびマウス１１２とを含む。 Referring to FIG. 1, this computer 100 reads / writes information from / to a CD-ROM drive 108 and a flexible disk (FD) 116 for reading information on a CD-ROM (Compact Disc Read-Only Memory). A computer main body 102 having the FD drive 106, a display 104 as a display device connected to the computer main body 102, and a keyboard 110 and a mouse 112 as input devices also connected to the computer main body 102.

図２は、このコンピュータ１００の構成をブロック図形式で示す図である。
図２に示されるように、このコンピュータ１００を構成するコンピュータ本体１０２は、ＣＤ−ＲＯＭドライブ１０８およびＦＤドライブ１０６に加えて、それぞれバスＢＳに接続されたＣＰＵ（Central Processing Unit ）１２０と、ＲＯＭ（Read Only Memory) およびＲＡＭ（Random Access Memory）を含むメモリ１２２と、直接アクセスメモリ装置、たとえば、ハードディスク１２４と、被制御装置２００とデータの授受を行うための通信インタフェース１２８とを含んでいる。ＣＤ−ＲＯＭドライブ１０８にはＣＤ−ＲＯＭ１１８が装着される。ＦＤドライブ１０６にはＦＤ１１６が装着される。 FIG. 2 is a block diagram showing the configuration of the computer 100. As shown in FIG.
As shown in FIG. 2, in addition to the CD-ROM drive 108 and the FD drive 106, the computer main body 102 constituting the computer 100 includes a CPU (Central Processing Unit) 120 connected to the bus BS, and a ROM ( It includes a memory 122 including a read only memory (RAM) and a random access memory (RAM), a direct access memory device, for example, a hard disk 124, and a communication interface 128 for exchanging data with the controlled device 200. A CD-ROM 118 is attached to the CD-ROM drive 108. An FD 116 is attached to the FD drive 106.

被制御装置２００からは、コンピュータ１００に対して被制御装置２００の状態を示すパラメータ（状態量）の情報、たとえば、被制御装置２００の可動部分の位置、速度、加速度、角度、角速度等の情報が与えられる。一方、コンピュータ１００からは、被制御装置２００に対して、これら状態量を制御するための制御情報が制御信号（以下に説明する「行動」に対応）として与えられる。 From the controlled device 200, information on parameters (state quantities) indicating the state of the controlled device 200 to the computer 100, for example, information on the position, velocity, acceleration, angle, angular velocity, etc. of the movable part of the controlled device 200 Is given. On the other hand, from the computer 100, control information for controlling these state quantities is given to the controlled device 200 as a control signal (corresponding to “behavior” described below).

なお、ＣＤ−ＲＯＭ１１８は、コンピュータ本体に対してインストールされるプログラム等の情報を記録可能な媒体であれば、他の媒体、たとえば、ＤＶＤ−ＲＯＭ（Digital Versatile Disc）やメモリカードなどでもよく、その場合は、コンピュータ本体１０２には、これらの媒体を読取ることが可能なドライブ装置が設けられる。 The CD-ROM 118 may be another medium, such as a DVD-ROM (Digital Versatile Disc) or a memory card, as long as it can record information such as a program installed in the computer main body. In this case, the computer main body 102 is provided with a drive device that can read these media.

本発明の制御器の主要部は、コンピュータハードウェアと、ＣＰＵ１２０により実行されるソフトウェアとにより構成される。一般的にこうしたソフトウェアはＣＤ−ＲＯＭ１１８、ＦＤ１１６等の記憶媒体に格納されて流通し、ＣＤ−ＲＯＭドライブ１０８またはＦＤドライブ１０６等により記憶媒体から読取られてハードディスク１２４に一旦格納される。または、当該装置がネットワークに接続されている場合には、ネットワーク上のサーバから一旦ハードディスク１２４にコピーされる。そうしてさらにハードディスク１２４からメモリ１２２中のＲＡＭに読出されてＣＰＵ１２０により実行される。なお、ネットワーク接続されている場合には、ハードディスク１２４に格納することなくＲＡＭに直接ロードして実行するようにしてもよい。 The main part of the controller of the present invention is composed of computer hardware and software executed by the CPU 120. Generally, such software is stored and distributed in a storage medium such as a CD-ROM 118 or FD 116, read from the storage medium by the CD-ROM drive 108 or FD drive 106, and temporarily stored in the hard disk 124. Alternatively, when the device is connected to the network, it is temporarily copied from the server on the network to the hard disk 124. Then, the data is further read from the hard disk 124 to the RAM in the memory 122 and executed by the CPU 120. In the case of network connection, the program may be directly loaded into the RAM and executed without being stored in the hard disk 124.

図１および図２に示したコンピュータのハードウェア自体およびその動作原理は一般的なものである。したがって、本発明の最も本質的な部分は、ＦＤ１１６、ＣＤ−ＲＯＭ１１８、ハードディスク１２４等の記憶媒体に記憶されたソフトウェアである。 The computer hardware itself and its operating principle shown in FIGS. 1 and 2 are general. Therefore, the most essential part of the present invention is software stored in a storage medium such as the FD 116, the CD-ROM 118, and the hard disk 124.

なお、一般的傾向として、コンピュータのオペレーティングシステムの一部として様々なプログラムモジュールを用意しておき、アプリケーションプログラムはこれらモジュールを所定の配列で必要な時に呼び出して処理を進める方式が一般的である。そうした場合、当該制御器を実現するためのソフトウェア自体にはそうしたモジュールは含まれず、当該コンピュータでオペレーティングシステムと協働してはじめて制御器が実現することになる。しかし、一般的なプラットフォームを使用する限り、そうしたモジュールを含ませたソフトウェアを流通させる必要はなく、それらモジュールを含まないソフトウェア自体およびそれらソフトウェアを記録した記録媒体（およびそれらソフトウェアがネットワーク上を流通する場合のデータ信号）が実施の形態を構成すると考えることができる。 As a general tendency, various program modules are prepared as a part of a computer operating system, and an application program generally calls a module in a predetermined arrangement and advances the processing when necessary. In such a case, the software itself for realizing the controller does not include such a module, and the controller is realized only in cooperation with the operating system on the computer. However, as long as a general platform is used, it is not necessary to distribute software including such modules, and the software itself not including these modules and the recording medium storing the software (and the software distributes on the network). Data signal) can be considered to constitute the embodiment.

［制御方法の一般的説明］
以下、本発明の構成について、その理論的な構成を説明する。なお、数式において細字はスカラーであり、太字で表される変数は、ベクトル（または行列）を表す。ただし、本文中では、太字と細字を区別していない点を予め断っておく。
（2． 1 方策勾配強化学習）
強化学習問題として、有限状態集合S∋sと有限行動集合A∋a の上で定義される離散時間マルコフ決定過程(MDP) を考える[非特許文献21]、 [非特許文献22]。 [General description of control method]
Hereinafter, the theoretical configuration of the configuration of the present invention will be described. Note that in the formulas, the fine characters are scalars, and the variables represented by bold characters represent vectors (or matrices). However, it should be noted in advance that the text does not distinguish between bold and thin.
(2.1 Policy gradient reinforcement learning)
As a reinforcement learning problem, a discrete-time Markov decision process (MDP) defined on a finite state set S∋s and a finite action set A∋a is considered [Non-patent document 21], [Non-patent document 22].

各時間ステップt で、エージェントは方策パラメータθ∈Ｒ^dによって規定される以下の確率的方策に従い行動a_tを選択する。 At each time step t 1, the agent selects an action a _t according to the following probabilistic policy defined by the policy parameter θ∈R ^d .

その上で、以下の状態遷移確率に従って新しい状態s_t+1に遷移し、有界な予め定められた報酬関数r(s_t, a_t, s_t+1) により定まる即時報酬r_t+1を獲得する。ここでは、常に方策π（a|s;θ）はパラメータθに関して滑らかであり、状態遷移確率と方策により規定される以下のマルコフ連鎖はエルゴード性を満足していると仮定する。 After that, the transition to the new state s _{t + 1} according to the following state transition probability, and the immediate reward r _{t + 1} determined by the bounded predetermined reward function r (s _t , a _t , s _{t + 1} ) To win. Here, it is assumed that the policy π (a | s; θ) is always smooth with respect to the parameter θ, and the following Markov chain defined by the state transition probability and the policy satisfies the ergodic property.

よって方策パラメータを定めれば初期状態s₀ に依存しない、以下のような唯一の(状態の) 定常分布ｄ（下付θ）が存在し、さらには、以下で定義されるように極限分布と一致する： Therefore, if the policy parameter is defined, there is only one (state) steady distribution d (subscript θ) that does not depend on the initial state s ₀ as shown below, and further, as defined below, Matches:

強化学習（方策探索）の目的は、即時報酬の時間平均である、以下の式（３）で表される平均報酬を最大にする方策パラメータθ^＊を同定することである。 The purpose of reinforcement learning (policy search) is to identify a policy parameter θ ^* that maximizes the average reward represented by the following equation (3), which is the time average of the immediate reward.

ここでE｛…｝は期待値を表しており、エルゴード性の仮定から右辺は初期状態s₀に依存しない値となることに注意する。また時間平均と空間平均は一致（式(2)）するので、平均報酬は以下の式（４）のように書き直すことができる[非特許文献21]： Here E {...} represents the expected value, the right side from the assumption of ergodicity is noted that a value which does not depend on the initial state s _0. Moreover, since the time average and the space average coincide (formula (2)), the average reward can be rewritten as the following formula (4) [Non-patent document 21]

したがって、方策パラメータに関する平均報酬の勾配は、以下の式（５）のように表現できる。 Therefore, the slope of the average reward related to the policy parameter can be expressed as the following equation (5).

となる。ただしa^Tはあるベクトルa の転置を表す。
よって、方策パラメータを下式より更新すれば目的関数である平均報酬R（θ）は増加する： It becomes. Where a ^T represents the transpose of a vector a.
Therefore, if the policy parameter is updated from the following equation, the average reward R (θ) that is the objective function increases:

ここで、上式で左辺と右辺をつなぐ記号は、左辺の変数に右辺の値を代入する演算子を表しており、ηは学習率とよばれ十分に小さな定数である。以上の枠組みは、一般に方策勾配（強化学習）法と呼ばれる[非特許文献6]、[非特許文献23]。
（2．2 自然勾配[非特許文献1]）
あるパラメータ空間がリーマン空間であるとは、パラメータθ∈Ｒ^dがリーマン計量行列G（θ）∈Ｒ^dｘd (正定値行列) で規定されるリーマン多様体上にあることをいう。すなわち、パラメータθとθを微少変動をさせてできるベクトル(θ+ dθ) との二乗距離が、以下の式で定義されることを意味する。 Here, the symbol connecting the left side and the right side in the above expression represents an operator that substitutes the value on the right side for the variable on the left side, and η is called a learning rate and is a sufficiently small constant. The above framework is generally called a policy gradient (reinforcement learning) method [Non-patent document 6], [Non-patent document 23].
(2.2 Natural gradient [Non-Patent Document 1])
A certain parameter space is a Riemann space means that the parameter θ∈R ^d is on the Riemannian manifold defined by the Riemann metric matrix G (θ) ^{∈R dxd} (positive definite matrix). That is, it means that the square distance between the parameter θ and the vector (θ + dθ) that can be obtained by slightly changing θ is defined by the following equation.

ここで、ｇ_i,jとは行列Gの（i, j）番目の要素のことである。
十分に小さい正の定数をεとして移動距離｜ｄθ｜＝εの拘束下で関数R（θ）を最大にするdθの方向、つまりリーマン空間における最急勾配方向は、以下の式（６）のようになり、「自然勾配」と呼ばれる。 Here, g _{i, j} is the (i, j) -th element of the matrix G.
The direction of dθ that maximizes the function R (θ) under the constraint of the movement distance | dθ | = ε, where ε is a sufficiently small positive constant, that is, the steepest gradient direction in the Riemann space is expressed by the following equation (6). It is called “natural gradient”.

特に、強化学習においては、つまりθが方策パラメータでR（θ）が平均報酬である場合は、自然方策勾配と呼ばれる[非特許文献2]。 In particular, in reinforcement learning, that is, when θ is a policy parameter and R (θ) is an average reward, it is called a natural policy gradient [Non-Patent Document 2].

そして、自然（方策）勾配法では、以下の式（７）に従ってパラメータを逐次的に更新することで目的関数の（局所）最大化を行う。 In the natural (policy) gradient method, the objective function is (locally) maximized by sequentially updating parameters according to the following equation (7).

ある変数x の出現確率があるパラメータθで規定される統計モデルPr(ｘ｜θ)を考える場合には、一般にリーマン計量行列G(θ）として、以下の式（８）のフィッシャー情報行列Fx（θ）がよく用いられる: When considering a statistical model Pr (x | θ) defined by a parameter θ with a certain probability of occurrence of a variable x, generally, as a Riemannian metric matrix G (θ), a Fisher information matrix Fx ( θ) is often used:

その理由としてフィッシャー情報行列は確率分布間の擬距離を表すKullback-Leibler(KL) ダイバージェンスの局所近似と対応付けられるからである。すなわち、2 つの確率分布Pr(ｘ｜θ)とPr(ｘ｜θ＋Δθ)のKL ダイバージェンスは以下のように表せるからである[非特許文献24]。 This is because the Fisher information matrix is associated with a local approximation of Kullback-Leibler (KL) divergence representing the pseudorange between probability distributions. That is, the KL divergence of the two probability distributions Pr (x | θ) and Pr (x | θ + Δθ) can be expressed as follows [Non-patent Document 24].

（2．3 自然方策勾配法の疑問点）
方策勾配強化学習法は、確率的方策π（a|s;θ）と状態遷移確率P(s'|s,a)により規定される統計モデル上で方策パラメータθの最適化をしているとみなせるので、その統計モデルのフィッシャー情報行列をもとにリーマン計量行列Ｇ（θ）を設計すれば、式（６）より自然方策勾配は自然に導出される。そして自然方策勾配法は任意にパラメータ化される方策のパラメータ空間ではなくてリーマン計量行列Ｇ（θ）で規定されるリーマン空間における勾配法であるため、強化学習問題におけるリーマン計量を設計できれば自然方策勾配は非常に有効な手法となる。 (2.3 Questions about the natural policy gradient method)
The policy gradient reinforcement learning method optimizes the policy parameter θ on the statistical model defined by the stochastic policy π (a | s; θ) and the state transition probability P (s' | s, a). Since the Riemannian metric matrix G (θ) is designed based on the Fisher information matrix of the statistical model, the natural policy gradient is naturally derived from the equation (6). The natural policy gradient method is not a parameter space of a policy that is arbitrarily parameterized, but a gradient method in Riemann space defined by the Riemann metric matrix G (θ). Therefore, if the Riemann metric in the reinforcement learning problem can be designed, the natural policy Gradient is a very effective technique.

しかしKakade [非特許文献2] も指摘しているように強化学習におけるリーマン計量は
必ずしも一つには定まらず様々なものが考えられ、そして式（６）より明らかに自然方策勾配方向は、用いるリーマン計量行列Ｇ（θ）に依存する。 However, as pointed out by Kakade [Non-Patent Document 2], the Riemannian metric in reinforcement learning is not necessarily limited to one, and various ones are considered, and the natural policy gradient direction is clearly used from Equation (6). Depends on the Riemann metric matrix G (θ).

そのため適切なリーマン計量は何かという問題について議論しておくべきである。しかしながら、これまでの自然方策勾配に関する研究[非特許文献15]〜[非特許文献20] ではその点には触れず、Kakade [非特許文献2] の提案した行動分布のフィッシャー情報に関するリーマン計量行列が用いられていた。つまり、方策勾配強化学習法に関してどういった統計モデルが考えられ、どのリーマン計量が強化学習問題を解くのに有効なのかという議論はこれまであまりされてこなかった。そこで、本発明では強化学習における統計モデル及びリーマン計量について議論し、新しいリーマン計量を提案する。
（3．方策勾配におけるリーマン計量行列）
3． 1 節で強化学習における新しいリーマン計量行列を提案し、3． 2 と3． 3 節ではこのリーマン計量をKakade [非特許文献2] が提案したリーマン計量や平均報酬のヘッセ行列と比較し、その妥当性を議論する。
（3． 1 状態行動に関する提案リーマン計量行列と提案自然定常方策勾配）
強化学習で直接調整できる関数は方策π（a|s;θ）であるため、これまでの自然方策勾配の研究は方策関数にのみ、つまり統計モデルPr(a|s,Ｍ（θ）)に着目したものであった。しかしながら、実際は方策が変われば状態の分布Pr(s|Ｍ（θ）)も変わる。そして式（４）より、平均報酬R（θ）は状態行動(s,a)∈Ｓ×Ａの同時分布より規定されるので、強化学習の目的である平均報酬の（局所）最大化において注目すべき統計モデルはPr(s,a|Ｍ（θ）)が妥当であると考えられる。 Therefore, the question of what is the appropriate Riemannian metric should be discussed. However, the research on the natural policy gradient [Non-Patent Document 15] to [Non-Patent Document 20] does not touch on this point, and the Riemannian metric matrix for the Fisher information of action distribution proposed by Kakade [Non-Patent Document 2]. Was used. In other words, there has not been much discussion about what statistical models can be considered for policy gradient reinforcement learning methods and which Riemannian metrics are effective in solving reinforcement learning problems. Therefore, the present invention discusses statistical models and Riemann metrics in reinforcement learning, and proposes new Riemann metrics.
(3. Riemannian metric matrix for policy gradients)
3． In section 1, we propose a new Riemann metric matrix for reinforcement learning. 2 and 3. Section 3 compares this Riemann metric with the Riemann metric proposed by Kakade [Non-Patent Document 2] and the Hessian matrix of average reward, and discusses its validity.
(3.1 Proposed Riemannian Metric Matrix and Proposed Natural Steady Policy Gradient for State Behavior)
Since the function that can be directly adjusted by reinforcement learning is the policy π (a | s; θ), the research of the natural policy gradient so far is only for the policy function, that is, the statistical model Pr (a | s, M (θ)). It was something that was noticed. However, in reality, if the policy changes, the state distribution Pr (s | M (θ)) also changes. Then, from equation (4), the average reward R (θ) is defined by the simultaneous distribution of state behavior (s, a) ∈S × A, so attention is paid to (local) maximization of the average reward, which is the purpose of reinforcement learning. It is considered that Pr (s, a | M (θ)) is a reasonable statistical model to be used.

何故なら、もしPr(s,a|Ｍ（θ）)のフィッシャー情報行列をリーマン計量行列として用いれば、その自然方策勾配は、方策パラメータθに関する状態行動同時分布における、以下のKL ダイバージェンスＤ_KLの微小変化一定という拘束下での平均報酬の最急勾配方向と一致するからである。 Because, if the Fisher information matrix of Pr (s, a | M (θ)) is used as the Riemannian metric matrix, the natural policy gradient is the following KL divergence D _KL in the state behavior simultaneous distribution with respect to the policy parameter θ This is because it coincides with the steepest gradient direction of the average reward under the constraint that the minute change is constant.

この統計モデルに対応するフィッシャー情報行列Fs,a（θ）は、式（８）を用いて次の式（９）のように求まる： The Fisher information matrix Fs, a (θ) corresponding to this statistical model is obtained as shown in the following equation (9) using equation (8):

そして状態行動の同時分布によるフィッシャー情報行列Fs,a（θ）をリーマン計量行列Ｇ（θ）とする提案する自然方策勾配である自然定常方策勾配（NSG:National Stationarypolicy Gradient）は式（６）より次式で求まる： And the natural stationary policy gradient (NSG: National Stationarypolicy Gradient), which is the proposed natural policy gradient with the Riemannian metric matrix G (θ) as the Fisher information matrix Fs, a (θ) by the simultaneous distribution of state behaviors, is obtained from Equation (6). Obtained by the following formula:

（3． 2 従来リーマン計量行列との比較）
これまで自然方策勾配に用いられた唯一のリーマン計量行列は、Kakade [非特許文献2] がアドホックに提案した行列であり、方策のフィッシャー情報行列Fa（s,θ）を定常分布で重み付けして積算した以下の式（１２）であった。 (3.2 Comparison with the conventional Riemann metric matrix)
The only Riemann metric matrix that has been used for natural policy gradients so far is a matrix proposed by Kakade [Non-Patent Document 2] ad hoc, and the policy Fisher information matrix Fa (s, θ) is weighted with a steady distribution. The following equation (12) was added.

これはFs,a（θ）の式（９）の第二項そのものである。方策パラメータの変化により定常分布が変化しないという仮定をおけば、式（１０）において、以下の関係がなりたつ。 This is the second term of Formula (9) of Fs, a (θ). If it is assumed that the steady distribution does not change due to the change of the policy parameter, the following relationship is established in the equation (10).

しかし、上式の仮定は一般に成立しない。言い換えれば、Kakadeのリーマン計量行列は状態行動の同時分布の統計モデルPr(s,a|Ｍ（θ）)において、方策パラメータθの摂動による状態の次式で表される定常分布の変化を無視したリーマン計量行列であることが分かる。 However, the assumption of the above equation is generally not valid. In other words, Kakade's Riemannian metric matrix ignores the change in the steady distribution expressed by the following equation of the state due to perturbation of the policy parameter θ in the statistical model Pr (s, a | M (θ)) of the simultaneous distribution of state behavior It can be seen that this is the Riemann metric matrix.

一方でBagnell et al． [非特許文献16] やPeters et al． [非特許文献15] らは、以下の内容を示した。 Meanwhile, Bagnell et al. [Non-patent document 16] and Peters et al. [Non-Patent Document 15] and others showed the following contents.

そしてBagnell et al． [非特許文献16] とPeters et al． [非特許文献15] らは、強化学習問題つまり平均報酬最大化問題は、式（３）より結局のところシステム軌跡の最適化に帰着するのでKakadeのリーマン計量行列は妥当なリーマン計量行列になり得ることを主張している。 And Bagnell et al. [Non-Patent Document 16] and Peters et al. [Non-Patent Document 15] et al., Kakade's Riemannian metric matrix becomes a valid Riemannian metric matrix because the reinforcement learning problem, that is, the average reward maximization problem, ultimately results in optimization of the system trajectory from Equation (3). Insist on getting.

しかし、以下のことがいえる。 However, the following can be said.

これはシステム軌跡の統計モデルではシステムの時間進展まで考慮したものであるが、状態行動の統計モデルは考慮していないためであり、以下で明確にする。 This is because the statistical model of the system trajectory takes into account the time evolution of the system, but the statistical model of state behavior is not considered, and will be clarified below.

ではどのフィッシャー情報行列が強化学習問題である平均報酬最大化に適しているのだろうか。先にも述べたように平均報酬（式（４））は状態行動の同時分布（１時間ステップのシステム軌跡）により定まり２時間ステップ以降のシステム進展には依存しないため、Kakade のフィッシャー情報行列は冗長な統計モデルを想定したものであり、状態行動の同時分布のフィッシャー情報行列F_s,a(θ)が最も自然なリーマン計量であると考えられる。単位行列I、フィッシャー情報行列F_s,a(θ)やKakadeのリーマン計量行列等をそれぞれリーマン計量とする自然勾配の数値比較実験は５．節にて示す。 Which Fisher information matrix is suitable for maximizing average reward, which is a reinforcement learning problem? As mentioned earlier, the average reward (equation (4)) is determined by the simultaneous distribution of state actions (system trajectory of 1 hour step) and does not depend on the system progress after 2 hour steps. Therefore, Kakade's Fisher information matrix is A redundant statistical model is assumed, and the Fisher information matrix F _{s, a} (θ) of the simultaneous distribution of state actions is considered to be the most natural Riemannian metric. The numerical comparison experiment of natural gradient using unit matrix I, Fisher information matrix F _{s, a} (θ), Kakade's Riemann metric matrix, etc. is Riemannian metric. Shown in section.

さらに、以下のことがいえる。 Furthermore, the following can be said.

（3． 3 フィッシャー情報行列とヘッセ行列の類似性）
フィッシャー情報行列F_s,a(θ)とKakadeのリーマン計量行列を平均報酬の方策パラメータθに関する二階偏微分であるヘッセ行列と比較する。 (3.3 Similarity between Fisher information matrix and Hessian matrix)
The Fisher information matrix F _{s, a} (θ) and Kakade's Riemann metric matrix are compared with the Hessian matrix, which is a second-order partial derivative with respect to the policy parameter θ of the average reward.

Kakade の（無限時間システム軌跡における正規化フィッシャー情報行列）リーマン計量行列の式（１２）とヘッセ行列H(θ) の式（１６）を見比べれば、Kakadeも主張しているように[非特許文献2] 、Kakade のリーマン計量行列はH(θ)の式（１６）の中括弧｛…｝内の初めの二項以外については何ら情報を保持していない。 Comparing Kakade's (Normalized Fisher Information Matrix in Infinite-Time System Trajectory) Riemann Metric Matrix (12) with Hessian H (θ) (16), as Kakade claims [Non-patent [2] Kakade's Riemannian metric matrix holds no information except for the first two terms in curly braces {...} in equation (16) of H (θ).

一方、式（９）と（１５）より、状態行動におけるフィッシャー情報行列F_s,a(θ)は明らかにH(θ)の全ての項に関して何らかの情報を保持している。よって、ヘッセ行列との比較の上でも提案フィッシャー情報行列F_s,a(θ)は妥当なリーマン計量になり得ることが示唆される。 On the other hand, from equations (9) and (15), the Fisher information matrix F _{s, a} (θ) in the state action clearly holds some information regarding all the terms of H (θ). Therefore, it is suggested that the proposed Fisher information matrix F _{s, a} (θ) can be a valid Riemannian metric in comparison with the Hessian matrix.

また特筆すべきことに、一般に平均報酬は方策パラメータθに関して二次形式ではなく、特にθが最適パラメータθ^＊から離れている場合、ヘッセ行列は不定値行列になりやすい。一方で、フィッシャー情報行列はその定義（式(8)）より不定値行列にはならず半正定性が保証されるため、特に強化学習問題においては自然勾配法はヘッセ行列H(θ)を用いるニュートン・ラフソン法等[非特許文献25] より汎用的な共変的な勾配法であると考えられる。これらの数値比較実験は5．節に示す。 It should also be noted that the average reward is not generally quadratic with respect to the policy parameter θ, and especially when θ is far from the optimal parameter θ ^* , the Hessian matrix tends to be an indefinite value matrix. On the other hand, since the Fisher information matrix is not an indefinite value matrix and its semi-definiteness is guaranteed by its definition (equation (8)), the natural gradient method uses the Hessian matrix H (θ) especially in the reinforcement learning problem. Newton-Raphson method, etc. [Non-Patent Document 25] This is considered to be a more general-purpose covariant gradient method. These numerical comparison experiments are described in 5. Shown in section.

またθにも依存するある特殊な報酬関数を用いた場合、F_s,a(θ)とヘッセ行列は一致する（付録2．）。
（４．フィッシャー情報行列F_s,a(θ) による自然定常方策勾配）
状態と行動の同時分布Pr(s,a|Ｍ（θ）)のフィッシャー情報行列F_s,a(θ)をリーマン計量行列とする平均報酬の自然勾配である自然定常方策勾配の推定を考える。この推定は、報酬関数r(s,a, s') の回帰問題に帰着することを示す。 If a special reward function that depends on θ is used, F _{s, a} (θ) and the Hessian match (Appendix 2).
(4. Natural stationary policy gradient based on Fisher information matrix F _{s, a} (θ))
Consider an estimation of a natural steady policy gradient, which is a natural gradient of an average reward _{, using a} Fisher information matrix F _{s, a} (θ) of _a state and action simultaneous distribution Pr (s, a | M (θ)) as _a Riemann metric matrix. We show that this estimation results in a regression problem of the reward function r (s, a, s').

このとき次の[定理1］が成り立つ。 At this time, the following [Theorem 1] holds.

以上より、以下の式で表される関数を基底関数とした線形関数近似器を考える。 In view of the above, consider a linear function approximator using a function represented by the following equation as a basis function.

この線形関数近似器で即時報酬r_t+1 を最小二乗近似すれば、その学習パラメータωが、次式で表される自然定常方策勾配の不偏推定量になることが分かる。 If the immediate reward r _{t + 1} is approximated by least squares with this linear function approximator, it can be seen that the learning parameter ω becomes an unbiased estimation amount of the natural steady policy gradient expressed by the following equation.

よって、自然定常方策勾配推定は報酬関数回帰問題に帰着するので、単純に最小二乗法によっても、またMorimura et al． [非特許文献19] の手法を拡張することでフィッシャー情報行列を明示的に必要としない勾配法によっても自然定常方策勾配は推定可能である。 Thus, natural steady-state policy gradient estimation results in a reward function regression problem, so simply by the least-squares method or Morimura et al. By extending the method of [Non-Patent Document 19], the natural steady policy gradient can be estimated even by the gradient method that does not explicitly require the Fisher information matrix.

また実際に実装する上で、以下の点での注意が必要である。 In actual implementation, the following points should be noted.

なお、この非特許文献26に記載の実装方法については、この文献に詳しいので、後に、簡単に概要をまとめるにとどめる。また、この非特許文献２６の内容は、本件と同一の特許出願人により、特願２００８−０７７６７１号として出願されている。
（5．数値実験）
本節では、任意性のある様々な状態数のマルコフ決定過程に提案する自然定常方策勾配法や従来勾配法を適用して比較する。
（5.1 計量の比較）
図３は、リーマン計量行列の位相平面上での相違を示す図である。 Note that the mounting method described in Non-Patent Document 26 is well-known in this document, so only a summary will be given later. The content of this non-patent document 26 has been filed as Japanese Patent Application No. 2008-077771 by the same patent applicant as the present case.
(5. Numerical experiments)
In this section, the natural steady policy gradient method and the conventional gradient method are applied to the Markov decision process for various arbitrary states.
(5.1 Comparison of weighing)
FIG. 3 is a diagram illustrating the difference on the phase plane of the Riemann metric matrix.

図３において、（ｉ）は、提案するフィッシャー情報行列F_s,a(θ)の場合を示し、（ｉｉ）は、Kakadeのリーマン計量行列を示し、（ｉｉｉ）は、単位行列Ｉを示している。 3, (i) shows the case of the proposed Fisher information matrix F _{s, a} (θ), (ii) shows the Kakade Riemannian metric matrix, and (iii) shows the unit matrix I Yes.

ここでは、各状態ｓ∈｛1, 2｝が自身と相互の遷移行動Ａ∈｛ｌ, ｍ｝を有し、かつ、各状態遷移が決定論的な２状態マルコフ決定過程（ＭＤＰ）である。 Here, each state s∈ {1, 2} has a mutual transition action A∈ {1, m}, and each state transition is a deterministic two-state Markov decision process (MDP). .

なお、方策πについては、後に図４で説明するマルコフ決定過程と同様にシグモイド関数としている。 The policy π is a sigmoid function as in the Markov decision process described later with reference to FIG.

図において、背景のグレースケールの変化は、状態１と状態２の定常状態分布の比をＬｏｇスケールで表現したものであり、各楕円は、一定の距離Δθ^TＧ（θ）Δθ＝ε²を満足するようなΔθの組に対応しており、自然定常方策勾配は、平均勾配の最急勾配を示している。 In the figure, the change in the gray scale of the background is the ratio of the steady state distribution of the state 1 and the state 2 expressed in the log scale, and each ellipse has a constant distance Δθ ^T G (θ) Δθ = ε ² . Corresponding to a satisfactory set of Δθ, the natural steady policy gradient indicates the steepest average gradient.

提案方法では、楕円の短軸の方向が定常分布の変化方向に沿っており、方策パラメータθの摂動による定常分布の変化をうまく扱えるのがわかる。
（5．2 学習の比較）
（5．2. 1 実験の設定）
状態数｜S｜∈｛3, 10, 20, 35, 50, 65, 80, 100｝、行動数|A|= 2 のマルコフ決定過程（ＭＤＰ）を以下の手順で設定した。 In the proposed method, the direction of the short axis of the ellipse is along the direction of change of the steady distribution, and it can be seen that the change of the steady distribution due to the perturbation of the policy parameter θ can be handled well.
(5. 2 Comparison of learning)
(5.2.1 Experiment settings)
The Markov decision process (MDP) with the number of states | S | ∈ {3, 10, 20, 35, 50, 65, 80, 100} and the number of actions | A | = 2 was set according to the following procedure.

状態遷移確率p(s’|s,a)に関しては、理論で仮定したマルコフ連鎖M(θ) のエルゴード性が破綻しないように、また一般的な強化学習課題[非特許文献22] にも相当して各状態の結合が粗となるように、言い換えれば状態数増加に従ってM(θ) の混合時間が増すよう設計した。 Regarding the state transition probability p (s' | s, a), the ergodic property of the Markov chain M (θ) assumed in theory is not broken, and it corresponds to a general reinforcement learning task [Non-patent document 22]. Thus, it is designed so that the coupling of each state becomes coarse, in other words, the mixing time of M (θ) increases as the number of states increases.

具体的な構成方法については、以下のとおりである。 A specific configuration method is as follows.

報酬関数r(s,a,s’)に関しては、引数の各組み合わせの戻り値をそれぞれ標準正規分布Ｎ(μ＝0,σ²＝１) に従って一時的に定め、そして各MDP 設定の平均報酬の大きさを揃えるために、平均報酬の最大値max R(θ) = 1、最小値minR(θ) = 0 となるように次式で報酬関数を正規化した： For the reward function r (s, a, s'), the return value of each combination of arguments is temporarily determined according to the standard normal distribution N (μ = 0, σ ² = 1), and the average reward for each MDP setting To equalize the size of the reward function, the reward function was normalized by the following formula so that the maximum value max R (θ) = 1 and the minimum value minR (θ) = 0:

また、方策π(a|s;θ)に関しては、以下のように設定した。すなわち、 The policy π (a | s; θ) was set as follows. That is,

（5． 2． 2 各勾配法の設定）
上記のMDP に提案法と3 種の従来法の計4 種の方策勾配法を適用した。それらの違いは勾配を求める際に用いるリーマン計量行列G(θ)（式(6)）に関してで、この４種の方策勾配法とは、以下のとおりである。 (5. 2.2 Setting the gradient method)
A total of four policy gradient methods, the proposed method and three conventional methods, were applied to the above MDP. These differences are related to the Riemann metric matrix G (θ) (formula (6)) used for obtaining the gradient, and these four types of policy gradient methods are as follows.

このヘッセ行列調整法は、λmax に応じて調整の程度が変わる点以外は、偏微分方向とニュートン方向を内挿する一般的な調整法である[非特許文献27]。 This Hessian adjustment method is a general adjustment method that interpolates the partial differential direction and the Newton direction except that the degree of adjustment changes according to λmax [Non-patent Document 27].

調整度合いが大きい時、その勾配方向は以下の偏微分の方向と変わらなくなってしまうので、本実験では可能な限りヘッセ行列の性質を消失させないために必要最小限の補正で済むよう調整度合いを可変とした。 When the degree of adjustment is large, the gradient direction remains the same as the direction of partial differentiation below, so in this experiment, the degree of adjustment can be changed so that the minimum correction is necessary to avoid losing the Hessian matrix as much as possible. It was.

各勾配法を実装するにあたり、本発明では勾配推定のためのデータサンプリング問題は議論しておらず、その勾配方向だけに注目しているので、各勾配は解析的に求めた。その際に必要となる値のうち、定常分布とその偏微分の導出は単純でないので、付録3．に示した。学習ステップ総数T は300とし、方策パラメータ更新（式(7)）における各勾配法の学習率ηは、以下の式により定めた。 In implementing each gradient method, the present invention does not discuss the data sampling problem for gradient estimation, and focuses only on the gradient direction, so each gradient is obtained analytically. Of the required values, the derivation of the steady distribution and its partial derivative is not simple. It was shown to. The total number of learning steps T is 300, and the learning rate η of each gradient method in policy parameter update (equation (7)) is determined by the following equation.

もしも学習率ηが適切であれば、Ｒ（θ）は、式（７）による方策の更新で、常に増加する。それゆえ、方策の更新がＲ（θ）を減少させるときは、学習率を”η：＝η／２”と調節し、同じタイムステップにおいて、再度、更新を試みることとした。このような調整は、ΔＲ（θ）＞＝０である範囲で維持された。一方で、η₀＞ηが、次のステップでも成り立つときは、” η：＝２η”と調整して、学習が停止するのを防止した。 If the learning rate η is appropriate, R (θ) always increases with the policy update according to equation (7). Therefore, when the policy update reduces R (θ), the learning rate is adjusted to “η: = η / 2” and the update is attempted again at the same time step. Such adjustment was maintained in a range where ΔR (θ)> = 0. On the other hand, when η ₀ > η holds even in the next step, adjustment is made to “η: = 2η” to prevent learning from stopping.

（5．3 実験結果と考察）
図４は、5． 2． 1 節に従って設定した状態数｜S｜= 100 のMDP について、それぞれ初期設定値の異なる１０のエピソードの学習曲線を示す図である。ただし他のMDP 設定においても同様な結果となる。 (5.3 Experimental results and discussion)
FIG. 2 . It is a figure which shows the learning curve of 10 episodes from which each initial setting value differs about MDP of the state number | S | = 100 set according to 1 clause. However, similar results are obtained with other MDP settings.

図４より提案自然定常方策勾配法は他の勾配法に比べ一様に方策パラメータの最適化に成功したことが分かる。例えば、あるエピソードに関して各勾配法を比べれば唯一提案法のみ学習が成功していて、他の勾配法は学習プラトー（停滞）に陥っていたことが確認できる。ただし、特定のエピソードで比べると全ての勾配法で学習が成功していて、提案法に関しては学習も平均報酬の立ち上がりも最も遅い、という場合もみられる。これらには次の理由が考えられる。平均報酬とは式(4) より報酬関数を状態行動同時分布Pr(s,a|Ｍ（θ）)で重み付けした線形結合和で定まるため、一般にPr(s,a|Ｍ（θ）)が大きく変わればその値も大きく変わる。そして提案勾配法はPr(s,a|Ｍ（θ）)のフィッシャー情報行列F_s,a(θ)をリーマン計量として用いるため、十分に小さい学習率ηのもとで一回の方策パラメータの更新ΔθによるKL ダイバージェンスは一定であり、Pr(s,a|Ｍ（θ）) が大きく変わることはないため平均報酬R(θ) も急激には変化しない。 FIG. 4 shows that the proposed natural stationary policy gradient method succeeded in optimizing the policy parameters uniformly compared with other gradient methods. For example, if each gradient method is compared with respect to a certain episode, it can be confirmed that only the proposed method has been successfully learned and the other gradient methods have fallen into a learning plateau (stagnation). However, there are cases where learning is successful in all gradient methods compared to a specific episode, and learning and the average reward rise are the slowest in the proposed method. The following reasons can be considered. The average reward is determined by the linear combination sum obtained by weighting the reward function with the state behavior simultaneous distribution Pr (s, a | M (θ)) from Equation (4), so Pr (s, a | M (θ)) is generally If it changes greatly, its value will also change greatly. And since the proposed gradient method uses the Fisher information matrix F _{s, a} (θ) of Pr (s, a | M (θ)) as _a Riemannian metric, the policy parameter can be set once with a sufficiently small learning rate η. The KL divergence due to the update Δθ is constant, and Pr (s, a | M (θ)) does not change significantly, so the average reward R (θ) does not change abruptly.

一方で、他の勾配法では他のリーマン計量を用いているためPr(s,a|Ｍ（θ）) が急激に変化し得るために、即時的に平均報酬を素早く大きくすることが可能である。しかしその分、他の学習法では学習プラトーに陥りやすくなってしまい、図５のような結果が得られたのだと考えられる。その具体的な原因の一つとして、Pr(s,a|Ｍ（θ）)が急激に変われば、偏微分（式(5)）も大きく変わり易く極めて0 に近づくこともあり、そこでは各勾配法におけるパラメータ空間に対する目的関数R(θ)の幾何構造も平坦に成り易かった問題が考えられる。 On the other hand, other gradient methods use other Riemann metrics, so Pr (s, a | M (θ)) can change rapidly, so it is possible to increase the average reward quickly and quickly. is there. However, it seems that other learning methods easily fall into the learning plateau, and the results shown in Fig. 5 are obtained. One of the specific causes is that if Pr (s, a | M (θ)) changes abruptly, the partial differentiation (Equation (5)) is also very easy to change, and it can be very close to zero. It can be considered that the geometric structure of the objective function R (θ) with respect to the parameter space in the gradient method tends to be flat.

また図４は、多層パーセプトロンの学習に自然勾配を用いたAmari et al． [非特許文献14] の結果とも一致している。 Fig. 4 shows the results of Amari et al. This is consistent with the results of [Non-Patent Document 14].

図５は各状態数｜S｜∈｛3, 10, 20, 35, 50, 65, 80, 100｝のMDP における各勾配法の300 エピソードによる学習成功率(Success Rate)を表す図である。 FIG. 5 is a diagram showing a learning success rate (Success Rate) by 300 episodes of each gradient method in the MDP of each state number | S | ∈ {3, 10, 20, 35, 50, 65, 80, 100}.

平均報酬R(θ) の最大値は1 と設定したので、R(θ_T ) > 0.95 ならその学習を成功とした。 Since the maximum value of the average reward R (θ) was set to 1, learning was successful if R (θ _T )> 0.95.

図５より状態数が少ないMDP の場合、提案自然勾配法とKakade の自然勾配法は他の勾配法に比べプラトーに陥らず適切に学習していたことがわかる。これは、提案自然勾配法ではそのものだが、Kakade の自然勾配法で用いるリーマン計量行列も統計モデルPr(s,a|Ｍ（θ）)のフィッシャー情報行列と関連した計量であるためだと考えられる。 In the case of MDP with a small number of states, it can be seen that the proposed natural gradient method and Kakade's natural gradient method did not fall into a plateau compared to other gradient methods, and learned appropriately. This is due to the fact that the Riemann metric matrix used in Kakade's natural gradient method is a metric related to the Fisher information matrix of the statistical model Pr (s, a | M (θ)). .

一方で状態数が多い場合、提案自然勾配法に比べKakade の自然勾配法は学習に失敗している。これは3． 2 節で理論的に議論した通りKakade 法は状態の分布に関するフィッシャー情報量Fs(θ) を無視しているためだと考えられる。 On the other hand, when the number of states is large, Kakade's natural gradient method fails to learn compared to the proposed natural gradient method. This is 3. As discussed theoretically in Section 2, the Kakade method is thought to be because the amount of Fisher information Fs (θ) related to the state distribution is ignored.

さらに図よりKakade 法は擬似ニュートン勾配法に対しても、状態数が多い場合では劣っていたことが確認できる。これはKakade 法はFs(θ) に関する情報を何ら捉えていない一方で、擬似ニュートン勾配法はFs(θ) という形ではないがθによる定常分布の微分に関する情報も持った「計量行列G(θ) が−（擬似ヘッセ行列）に等しい」との関係を用いていたためだと考えられる（式(16)）。 Furthermore, it can be confirmed from the figure that the Kakade method is inferior to the pseudo Newton gradient method when the number of states is large. This is because the Kakade method does not capture any information about Fs (θ), while the pseudo-Newton gradient method is not in the form of Fs (θ), but it also has information about the differentiation of the steady distribution due to θ. ) Is equal to-(pseudo-Hesse matrix) "(Equation (16)).

図６は、各勾配法がどの程度学習プラトーに陥っているかを示す図である。図６において、縦軸は、プラトーの指標として学習曲線の滑らかさ（離散曲率）Δ²Ｒ（θ_t）の絶対値の積算を用いた。 FIG. 6 is a diagram showing how much each gradient method falls into the learning plateau. In FIG. 6, the vertical axis represents the integration of absolute values of the smoothness (discrete curvature) Δ ² R (θ _t ) of the learning curve as a plateau index.

図６より、どの状態数においても提案自然勾配法が最も滑らかに学習していたことが確認でき、これまでの結果とも一致して提案手法が最もプラトーに陥いり難い手法であることが示唆される。 FIG. 6 confirms that the proposed natural gradient method learned the smoothest in any number of states, and suggests that the proposed method is the most difficult method to fall into a plateau consistent with the results so far. The

以上数値実験より、提案自然勾配法はMDP の環境設定(p, r, ψ ) 及び初期方策パラメータ値θ₀に依存せず、また特に定常分布と方策パラメータに強い相関があっても状態数に依存せず、学習のプラトーを避け適切に学習できることが確認できた。よって本勾配法はKakade が提案した自然方策勾配法より、より自然な自然方策勾配法であると考えられる。 From the above numerical experiments, the proposed natural gradient method does not depend on the MDP environment setting (p, r, ψ) and the initial policy parameter value θ ₀ , and in particular, there is a strong correlation between the steady distribution and the policy parameter. It was confirmed that it was possible to learn appropriately without relying on the learning plateau. Therefore, this gradient method is considered to be a natural policy gradient method that is more natural than the natural policy gradient method proposed by Kakade.

付録１．として、式（１４）の導出は、以下のように行うことができる。 Appendix 1. As a result, the derivation of the equation (14) can be performed as follows.

また、付録２．として、フィッシャー情報行列とヘッセ行列の一致性は以下のようにして示される。 In addition, Appendix 2. As such, the coincidence between the Fisher information matrix and the Hessian matrix is shown as follows.

さらに、付録３．として、定常分布とその偏微分の導出は以下のようにして行われうる。 In addition, Appendix 3. As a result, the derivation of the steady distribution and its partial derivative can be performed as follows.

式（A-2）より、定常分布ベクトルは、以下のようにして求まる。 From equation (A-2), the steady distribution vector is obtained as follows.

同様にして、第２階偏微分は次式で求まる。 Similarly, the second order partial differential is obtained by the following equation.

図７は、本発明の制御器の構成の一例を示す概念図である。
図７の例では、制御器は、行動、すなわち、制御信号を制御対象に与える処理を行って、制御対象の状態量を観測器（たとえば、位置センサ、角度センサ、加速度センサ、角加速度センサなど）で観測し、この観測結果により「定常分布の対数の偏微分」（ＬＳＤＧ）を推定し、これを用いて「自然定常方策勾配」を推定して、方策パラメータを更新し、これにより方策を更新する。そして、更新された方策により、制御信号（以上の説明での行動に対応する）が生成され、さらに、制御対象が制御される。
（６．非特許文献２６に示される定常分布の勾配の導出の実装の概要）
方策の勾配および定常分布の勾配の導出の実装については、非特許文献２６に詳しいが、以下では、その実装方法の大略を確認のためにまとめる。 FIG. 7 is a conceptual diagram showing an example of the configuration of the controller of the present invention.
In the example of FIG. 7, the controller performs an action, that is, a process of giving a control signal to the control target, and the state quantity of the control target is observed (for example, a position sensor, an angle sensor, an acceleration sensor, an angular acceleration sensor, etc. ) And estimate the "logarithm partial derivative of the steady distribution" (LSDG) from this observation result, and use this to estimate the "natural steady policy gradient", update the policy parameters, and Update. Then, a control signal (corresponding to the action in the above description) is generated by the updated policy, and the control target is further controlled.
( 6. Overview of implementation of derivation of gradient of steady distribution shown in Non-Patent Document 26)
The implementation of the derivation of the gradient of the policy and the gradient of the steady distribution is detailed in Non-Patent Document 26, but the outline of the implementation method will be summarized below for confirmation.

なお、一部ノーテーションについては、以降の説明では、新たに定義し直しているものがある。 Some notations are newly redefined in the following description.

以下では、マルコフ決定過程（ＭＤＰ）について考えることにし、制御対象（制御器の環境）は状態遷移確率（時刻ｔにおいて、状態ｘ_tであるときに行動（制御）ｕ_tを実行することで状態がｘ_t+1となる確率）と報酬関数ｒ_t+1＝ｒ（ｘ_t，ｕ_t）（なお、ｒ_t+1＝ｒ（ｘ_t＋１,ｘ_t，ｕ_t）の場合にも同様に議論できる）によって特徴づけられるものとする。なお、この報酬関数については、制御対象の制御目標に応じて予め定められているものとする。状態入力ｘ∈Ｘから行動出力ｕ∈Ｕへの写像を方策と呼び、以下で説明するように確率的に表現する。方策は、パラメータθで、規定される。 In the following, the Markov decision process (MDP) will be considered, and the controlled object (controller environment) is in a state transition probability (execution (control) u _t when it is in the state x _t at time t). The same applies to the case where the probability is x _{t + 1} ) and the reward function r _{t + 1} = r (x _t , u _t ) (r _{t + 1} = r (x _{t + 1} , x _t , u _t )) Can be discussed). In addition, about this reward function, it shall be predetermined according to the control target of control object. The mapping from the state input xεX to the action output uεU is called a policy and is expressed stochastically as described below. The strategy is defined by the parameter θ.

有限な状態Ｘ∋ｘの組と行動Ｕ∋ｕとを有する離散時間ＭＤＰは、以下の状態遷移確率ｐと報酬関数ｒ₊₁によって規定される。 A discrete time MDP having a finite set of state X∋x and action U∋u is defined by the following state transition probability p and reward function r _{+ 1} .

ここで、記載の簡単のために、ｘ₊₁は、状態ｘにおいて、行動ｕにより与えられる次の状態であり、ｒ₊₁は、ｘ₊₁において観測された即時報酬である。ｘ_+kおよびｕ_+kは、それぞれ、状態ｘからｋ時間ステップ先の状態および行動であり、添え字が−ｋとなっていれば、その反対である。（ＭＤＰにおける）決定は、θ∈Ｒ^dによりパラメータ化された以下の確率的方策πにしたがってなされる。 Here, for simplicity of description, x _{+ 1} is the next state given by action u in state x, and r _{+ 1} is the immediate reward observed in x _{+ 1} . x _{+ k} and u _{+ k} are respectively a state and an action which are k steps from the state x, and vice versa if the subscript is −k. The decision (in MDP) is made according to the following probabilistic policy π parameterized by θεR ^d .

以下の方策πは、全てのｘ∈Ｘおよびｕ∈Ｕに対して、θについて微分可能であると仮定する。 The following strategy π is assumed to be differentiable with respect to θ for all x∈X and u∈U.

さらに、以下のような仮定をおく。
（仮定１）
以下のような状態遷移確率ｐと確率的方策πとを有するマルコフ連鎖Ｍ（θ）は、全ての方策パラメータについてエルゴート的（既約で非周期的）である。 Furthermore, the following assumptions are made.
(Assumption 1)
A Markov chain M (θ) having a state transition probability p and a stochastic policy π as follows is ergodic (irreducible and aperiodic) for all policy parameters.

したがって、以下のただ１つの定常分布が存在する。 Thus, there is only one steady distribution:

この定常分布は、初期状態には独立であって、以下の式を満たす。 This steady distribution is independent of the initial state and satisfies the following equation.

ここで、以下の式が成り立つ。 Here, the following equation holds.

ＰＧＲＬの目的は、以下の「平均報酬」と呼ばれる即時報酬の平均を最大化する方策パラメータθ^＊を見いだすことである。 The purpose of PGR L is to find a policy parameter θ ^* that maximizes the average of immediate rewards, referred to below as “average reward”.

仮定１の下では、平均報酬は、初期状態ｘには独立で、以下の式に等しいことが示される： Under Assumption 1, the average reward is shown to be independent of the initial state x and equal to:

方策勾配ＲＬアルゴリズムは、方策パラメータθを、以下の式に示されるθについての平均報酬Ｒ（θ）の勾配の方向に更新する。 The policy gradient RL algorithm updates the policy parameter θ in the direction of the average reward R (θ) gradient for θ shown in the following equation.

以下、単に方策勾配（ＰＧ）と、しばしば呼ばれる。この方策勾配は、以下のように与えられる。 Hereinafter, it is often referred to simply as policy gradient (PG). This policy gradient is given as follows.

以下の式に示される定常分布の対数の偏微分の導出は簡単ではない。 Deriving the partial derivative of the logarithm of the steady distribution shown in the following equation is not easy.

そこで、従来のＰＧアルゴリズムは、ＰＧのもう一つの表現を利用している。 Therefore, the conventional PG algorithm uses another representation of PG.

ここで、割引率γ∈［０，１）で、それぞれ、行動価値関数Ｑと状態価値関数Ｖとは以下のように表される。 Here, the action value function Q and the state value function V are respectively expressed as follows at a discount rate γ∈ [0, 1).

式（２３）の第２項の寄与は、γが１に近づくにつれて小さくなるので、従来のアルゴリズムは、γ〜１とすることで、第１項のみからＰＧを近似している。このような省略による偏りは、割引率γを1に近づければ小さくなるが、推定の分散は多くなってしまう。 Since the contribution of the second term of Equation ( 23 ) decreases as γ approaches 1, the conventional algorithm approximates PG from only the first term by setting γ˜1. Such bias due to omission becomes smaller when the discount rate γ approaches 1, but the variance of estimation increases.

ここでは、もう１つのアプローチを提案する。そこでは、以下の式の定常分布の対数の偏微分（ＬＳＤＧ）を推定し、ＰＧを導出するために式（２２）を用いる。 Here, another approach is proposed. There, the partial differential (LSDG) of the logarithm of the steady distribution of the following formula is estimated, and formula ( 22 ) is used to derive PG.

著しい特徴は、価値関数を学習する必要がなく、したがって、そのアルゴリズムは、割引率γの選択において、偏りと分散のトレードオフと関係がないことである。
（６．１定常分布の対数の偏微分の推定）
以下では、最小二乗法に基づくＬＳＤＧ推定アルゴリズム、ＬＳＤＧ（λ）を提案する。この目的のために、エルゴート的なマルコフ連鎖Ｍ（θ）の逆過程を定式化し、ＬＳＤＧは、ＴＤ法の枠組みで推定できることを示す。
（６．１．１順方向および逆方向マルコフ連鎖の性質）
ベイズの理論を用いれば、現在の状態から過去の状態および行動の対への逆方向の確率は、以下の式で表される。 A striking feature is that no value function needs to be learned, and therefore the algorithm has nothing to do with the trade-off between bias and variance in the selection of discount rate γ.
( 6.1 Estimation of partial derivative of logarithm of stationary distribution)
In the following, we propose an LSDG estimation algorithm, LSDG (λ), based on the least square method. For this purpose, we formulate the inverse process of the Ergoto-like Markov chain M (θ) and show that LSDG can be estimated in the framework of the TD method.
( 6.1.1 Properties of forward and reverse Markov chains)
Using Bayesian theory, the backward probability from the current state to the past state and action pair is expressed by the following equation.

以下の事後確率ｑは事前分布ｐに依存する。 The following posterior probability q depends on the prior distribution p.

以下のように、事前分布ｐが定常分布ｄと方策πに従うとき、事後分布ｑは、定常逆方向確率と呼ばれ、下付添え字Ｂ（θ）が加えられる。 As will be described below, when the prior distribution p follows the steady distribution d and the policy π, the posterior distribution q is called a steady reverse probability, and a subscript B (θ) is added.

マルコフ連鎖―Ｍ（θ）とＢ（θ）の両方―は、以下の２つの定理において記述されるように密接に関連している。
（定理１） The Markov chain—both M (θ) and B (θ) —is closely related as described in the following two theorems.
(Theorem 1)

（証明）
式（２４）の両辺に以下の定常分布をかける。 (Proof)
The following steady distribution is applied to both sides of the equation ( 24 ).

すると、全ての可能な行動ｕ_-1∈Ｕについて総和をとると、以下の式が得られる： Then, summing over all possible actions u ₋₁ ∈U, we get:

そして、式（２６）の両辺を可能な状態ｘ∈Ｘについて総和をとると、以下の式が成り立つ。 Then, when the sum is taken for the possible state xεX on both sides of the equation ( 26 ), the following equation is established.

このことは、以下の２点を成立させる．（ｉ）Ｂ（θ）は、Ｍ（θ）と同一の定常分布を有すること、（ｉｉ）Ｂ（θ）はＭ（θ）と同じ既約な性質を有すること。 This establishes the following two points. (I) B (θ) has the same steady distribution as M (θ), and (ii) B (θ) has the same irreducible property as M (θ).

このことは、（ｉｉｉ）Ｂ（θ）がＭ（θ）と同じ非周期的な性質をもっていることを示唆する。式（２５）は、（ｉ）−（ｉｉｉ）により直接証明される。
（定理２） This suggests that (iii) B (θ) has the same aperiodic nature as M (θ). Equation ( 25 ) is directly proved by (i)-(iii).
(Theorem 2)

（証明）
マルコフ連鎖の特性と式（２４）を代入することにより、以下の関係が得られる。 (Proof)
By substituting the properties of the Markov chain and the formula ( 24 ), the following relationship is obtained.

このことは、有限のＫの場合において式（２７）が成立することを証明する。定理１から以下の式が導かれるので、式（２７）のＫ→∞の極限の場合も成立することが、すぐさま証明される。 This proves that the equation ( 27 ) holds in the case of finite K. Since the following equation is derived from Theorem 1, it is immediately proved that the K → ∞ limit of equation ( 27 ) holds.

定理１および定理２は、これらが、定常分布に収束する状態分布の下で、順方向マルコフ連鎖Ｍ（θ）からのサンプルは、そのまま、逆方向マルコフ連鎖Ｂ（θ）に関する推定に使用できることになるので、重要である。そして、これらは、後に利用されうるものである。
（６．１．２逆方向から順方向のマルコフ連鎖のＬＳＤＧのためのＴＤ（Temporal Difference）学習法）
ＬＳＤＧは式（２４）を用いて、以下のように分解される。 Theorem 1 and Theorem 2 indicate that samples from the forward Markov chain M (θ) can be used for estimation on the backward Markov chain B (θ) as they are under a state distribution that converges to a steady distribution. So it ’s important. These can be used later.
( 6.1.2 TD (Temporal Difference) learning method for LSDG of Markov chain from reverse direction to forward direction)
LSDG is decomposed as follows using equation ( 24 ).

式（２９）は、状態ｘのＬＳＤＧは、以下の式で表される方策の対数の偏微分の状態ｘから逆方向マルコフ連鎖Ｂ（θ）の無限区間の集積であることを暗示している。 Equation ( 29 ) implies that the LSDG of state x is an accumulation of infinite intervals of the backward Markov chain B (θ) from the logarithmic partial differential state x of the policy represented by the following equation: .

式（２８）および（２９）から、ＬＳＤＧは、Ｍ（θ）よりもむしろ逆方向マルコフ連鎖Ｂ（θ）についての、以下のような逆方向ＴＤ δに関するＴＤ学習により推定されうる。 From equations ( 28 ) and ( 29 ), LSDG can be estimated by TD learning for reverse TD δ as follows for reverse Markov chain B (θ) rather than M (θ).

ここで、最初の２つの項は、Ｂ（θ）における方策の対数の偏微分の１ステップ前の実際の観測値と１ステップ前のＬＳＤＧであり、現在の状態のＬＳＤＧが最後の項である。 Here, the first two terms are the actual observation value one step before the logarithmic partial differentiation of the policy at B (θ) and the LSDG one step before, and the LSDG in the current state is the last term. .

適格度減衰率λ∈［０，１］と逆方向追跡時間ステップＫ∈Ｎを用いて、式（２９）は、以下のように一般化される。 Using the eligibility decay rate λε [0,1] and the backward tracking time step KεN, equation ( 29 ) is generalized as follows:

上記のような設定でなくとも、大きなλやＫを用いたならば、このような最小化は従来の価値関数に対するＴＤ（λ）学習の場合のように、非マルコフ効果に対しては、より敏感ではない。 Even if the setting is not as described above, if a large λ or K is used, such minimization is more effective for the non-Markov effect as in the case of TD (λ) learning for the conventional value function. Not sensitive.

理論的な仮定と現実への適用との間のギャップを埋めるために、以下の２つのうちのいずれかの戦略をとる必要がある。（ｉ）λ〜１ならば、Ｋは、あまり大きな整数に設定しない、（ｉｉ）Ｋ〜ｔならば、λは、１に設定しない、ここで、ｔは、現実の順方向マルコフ連鎖の現在のタイムステップである。
（６．１．３ＬＳＤＧ推定アルゴリズム：制限付き逆方向ＴＤ（λ）の最小二乗法） To fill the gap between theoretical assumptions and real-world applications, one of the following two strategies needs to be taken. (I) If λ˜1, K is not set to a very large integer, (ii ) If K˜t, λ is not set to 1, where t is the current forward Markov chain current Is the time step.
( 6.1.3 LSDG estimation algorithm: restricted inverse TD (λ) least squares method)

この６．１．３では、最小二乗法に基づく、ＬＳＤＧ推定アルゴリズム、ＬＳＤＧ（λ）を提案する。これは、同時に、平均二乗を減少させるとともに、制約条件を満足することを達成しようとするものである。 This 6.1.3 proposes an LSDG estimation algorithm, LSDG (λ), based on the least square method. This simultaneously attempts to reduce the mean square and meet the constraints.

したがって、最小化すべき対象となる関数は、以下の式（３２）となる。 Therefore, the function to be minimized is the following expression ( 32 ).

ここで、右辺の第２項は、式（３１）の制約条件のためのものである。したがって、式（３２）の偏微分は、以下のようになる。 Here, the second term on the right side is for the constraint condition of the formula ( 31 ). Therefore, the partial differentiation of the equation ( 32 ) is as follows.

一般的なＲＬ問題では、このような相関が存在するので、このような偏りを除くために、操作変数法（instrumental variable method）を適用する。 In a general RL problem, such a correlation exists, so that an instrumental variable method is applied to remove such a bias.

以下では、現実のマルコフ連鎖Ｍ（θ）における時間ステップｔの状態を示すために、ノーテーションをｘ_ｔに変更する。提案するＬＳＤＧ推定アルゴリズム、ＬＳＤＧ（λ）は、適格性減衰率λ∈［０，１）の下で、逆方向にさかのぼる時間ステップＫを現在の状態ｘ_ｔのタイムステップｔと同じにする。すなわち、以下が成り立つ。 Hereinafter, to indicate the status of the time step t in the real Markov chain M (theta), to change the notation for x _t. The proposed LSDG estimation algorithm, LSDG (λ), makes the time step K going backwards the same as the time step t of the current state x _t under the qualifying decay rate λε [0,1). That is, the following holds.

図８は、ＬＳＤＧ（λ）を求める手順をアルゴリズム１として示す図である。 FIG. 8 is a diagram showing a procedure for obtaining LSDG (λ) as algorithm 1.

また、図９は、アルゴリズム１を示すフローチャートである。
図９を参照して、まず、ステップＳ１００において、処理の前提として、以下の設定がなされる。 FIG. 9 is a flowchart showing the algorithm 1.
Referring to FIG. 9, first, in step S 100, as a premise of processing, the following settings are made.

続いて、初期化処理として、以下の処理が行われる（ステップＳ１０２）。 Subsequently, the following processing is performed as initialization processing (step S102).

時間ステップｔがｔ＝０とされ（ステップＳ１０４）、以下の処理が、ｔ＝０からｔ＝Ｔ−１まで繰り返される（ステップＳ１０６〜Ｓ１１６）。 The time step t is set to t = 0 (step S104), and the following processing is repeated from t = 0 to t = T−1 (steps S106 to S116).

まず、ステップＳ１０６においてｔ＝０であれば、初期状態が観測され（ステップＳ１０８）、続いて、以下の設定が行われる（Ｓ１１０）。 First, if t = 0 in step S106, the initial state is observed (step S108), and then the following settings are made (S110).

一方、ステップＳ１０６において、ｔが０でなければ、以下の処理が行われる（ステップＳ１１２）。 On the other hand, if t is not 0 in step S106, the following processing is performed (step S112).

ステップＳ１１０またはＳ１１２に続いて、以下の計算が行われる（ステップＳ１１４）。 Following step S110 or S112, the following calculation is performed (step S114).

ステップＳ１１６にて、ｔがＴよりも小さければ処理はステップＳ１０６に復帰し、ｔがＴ以上であれば、処理はステップＳ１１８に移行して、以下の計算を行う。 In step S116, if t is smaller than T, the process returns to step S106. If t is equal to or greater than T, the process proceeds to step S118, and the following calculation is performed.

続いて、以下の計算によりＬＳＤＧの推定値を得る。 Subsequently, an estimated value of LSDG is obtained by the following calculation.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明の制御方法および制御プログラムが適用される制御器を用いたシステム１０００の一例を示す概念図である。It is a conceptual diagram which shows an example of the system 1000 using the controller with which the control method and control program of this invention are applied. コンピュータ１００の構成をブロック図形式で示す図である。It is a figure which shows the structure of the computer 100 in a block diagram format. リーマン計量行列の相違を示す図である。It is a figure which shows the difference of a Riemann metric matrix. 学習曲線を示す図である。It is a figure which shows a learning curve. 各勾配法の学習成功率を表す図である。It is a figure showing the learning success rate of each gradient method. 各勾配法がどの程度学習プラトーに陥っているかを示す図である。It is a figure which shows how much each gradient method has fallen into the learning plateau. 本発明の制御器の構成の一例を示す概念図である。It is a conceptual diagram which shows an example of a structure of the controller of this invention. ＬＳＤＧ（λ）を求める手順をアルゴリズム１として示す図である。FIG. 5 is a diagram showing a procedure for obtaining LSDG (λ) as algorithm 1; アルゴリズム１を示すフローチャートである。3 is a flowchart showing Algorithm 1.

Explanation of symbols

１００コンピュータ、１０２コンピュータ本体、１０４ディスプレイ、１０６ＦＤドライブ、１０８ＣＤ−ＲＯＭドライブ、１１０キーボード、１１２マウス、１１８ＣＤ−ＲＯＭ、１２０ＣＰＵ、１２２メモリ、１２４ハードディスク、１２８通信インタフェース、２００被制御装置、１０００システム。 100 computer, 102 computer, 104 display, 106 FD drive, 108 CD-ROM drive, 110 keyboard, 112 mouse, 118 CD-ROM, 120 CPU , 122 memory, 124 hard disk, 128 communication interface, 200 the controlled device, 1000 systems.

Claims

When the time evolution of the target system is described as a Markov process, the controller learns a policy that is a control law for the state of the system by observing the state quantity of the system,
Control signal generating means for generating a control signal for controlling the system based on the strategy;
State quantity detection means for observing the state quantity of the system;
A natural steady policy gradient estimating means for estimating a natural steady policy gradient that is a natural gradient of an average reward, using a Riemannian metric matrix as a Fisher information matrix of a simultaneous distribution of a state and a control signal value specified by the state quantity,
A controller comprising: policy update means for updating the policy by updating a policy parameter that defines the policy based on an estimation result by the policy gradient estimation means.

The natural steady policy gradient estimation means includes:
Reward value acquisition means for acquiring a reward value depending on the state and the control signal in a predetermined relationship;
Based on the state quantity and the control signal at each time step, the logarithmic partial derivative of the steady distribution is estimated, and a linear function that defines the basis function between the state specified by the estimated partial differentiation and the control signal The controller according to claim 1, further comprising: an estimation unit that estimates the natural steady policy gradient by regressing the reward value using an approximater.

When the time evolution of the target system is described as a Markov process, the control method learns a policy that is a control law for the state of the system by observing the state quantity of the system,
A control signal generating step for generating a control signal for controlling the system based on the strategy;
A state quantity detection step of observing the state quantity of the system;
And Riemann metric matrix Fisher information matrix of the joint distribution of the state and the control signal value which is specified by the previous SL state quantity and nature steady policy gradient estimating step for estimating a natural constant policy gradient is a natural slope of the average earnings,
A control method comprising: a policy update step of updating the policy by updating a policy parameter that defines the policy based on an estimation result obtained by the policy gradient estimation step.

A program for causing a computer to execute a control method for learning a policy that is a control law for the state of the system by observing the state quantity of the system when the time evolution of the target system is described as a Markov process. And
A control signal generating step for generating a control signal for controlling the system based on the strategy;
A state quantity detection step of observing the state quantity of the system;
A natural steady policy gradient estimation step for estimating a natural steady policy gradient that is a natural gradient of an average reward, using a Riemannian metric matrix as a Fisher information matrix of a simultaneous distribution of a state specified by the state quantity and a control signal value;
A control program for causing a computer to execute control processing, including a policy update step of updating the policy by updating a policy parameter that defines the policy based on an estimation result obtained by the policy gradient estimation step.