JP2009064216A

JP2009064216A - Function approximation device, enhanced learning system, function approximation system, and function approximation program

Info

Publication number: JP2009064216A
Application number: JP2007231085A
Authority: JP
Inventors: Masahiko Morita; 昌彦森田
Original assignee: University of Tsukuba NUC
Current assignee: University of Tsukuba NUC
Priority date: 2007-09-06
Filing date: 2007-09-06
Publication date: 2009-03-26
Anticipated expiration: 2027-09-06
Also published as: JP5152780B2

Abstract

<P>PROBLEM TO BE SOLVED: To precisely approximate a function even in a large number of input variables, and to reduce a learning cost required for the approximation. <P>SOLUTION: This function approximation device (C) is provided with: an input variable input means (C7A) of which the input layer (Na) comprises respective input elements (s<SB>X1</SB>, s<SB>X2</SB>to s<SB>Xn</SB>, s<SB>v1</SB>, s<SB>v2</SB>to s<SB>vn</SB>, s<SB>θ1</SB>, s<SB>θ2</SB>to s<SB>θn</SB>, s<SB>ω1</SB>, s<SB>ω2</SB>, to s<SB>ωn</SB>) input with respective values of the three or more of input variables (x, v, θ, ω) and for inputting respectively the respective values of the three or more of input variables (x, v, θ, ω), and; an intermediate variable computing means (C7C) set with input variable sets with the two input variables out of the three or more of input variables (x, v, θ, ω) serving as one set, and for computing respectively respective values of intermediate variables y, based on each value of the one input variable in each input variable set, and each first output sensitivity computed based on each value of the other input variable in each input variable set. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、ニューラルネットにより関数を近似する関数近似装置、関数近似システムおよび関数近似プログラムに関する。また、本発明は、前記制御装置の制御を適切に行うために前記関数近似装置が適用された強化学習システムに関する。 The present invention relates to a function approximation device, a function approximation system, and a function approximation program that approximate a function using a neural network. The present invention also relates to a reinforcement learning system to which the function approximation device is applied in order to appropriately control the control device.

従来より、入力と出力との関係を表す関数を近似する技術が知られており、機械学習、パターン認識、ロボット等の制御等の知的情報処理分野では広く用いられている。また、前記関数を近似する技術は、前記関数を精度良く近似できれば、気象予測や製鉄所の溶鉱炉の制御等の様々な工学分野においても有用であるため、研究・開発等が盛んに行われている。 Conventionally, a technique for approximating a function representing a relationship between an input and an output is known and widely used in the field of intelligent information processing such as machine learning, pattern recognition, and control of a robot. In addition, the technique for approximating the function is useful in various engineering fields such as weather forecasting and iron furnace blast furnace control if the function can be approximated with high accuracy. Yes.

前記関数を近似する技術として、例えば、前記関数を線形多項式等に近似する線形モデル（一般線形モデル（ＧＬＭ：General Linear Model））や、予め記憶されたテーブルを参照することにより近似するテーブル参照法（テーブルルックアップ（Table Look-up））等の技術が知られている。
また、入力層、中間層、出力層がそれぞれ有する素子（ニューロン）よって入力・出力および関数を分散表現することで前記関数を近似する、いわゆる、層状ニューラルネットの技術が知られている。なお、前記層状ニューラルネットには、学習能力（汎化能力）等の特性がそれぞれ異なる多層パーセプトロン（ＭＬＰ：Multi-Layer Perceptron，Multi-Layered Perceptron）、ＲＢＦ（Radial Basis Function）ネットワーク、ＧＲＢＦ（Gaussian Radial Basis Function）ネットワーク等のアルゴリズムが含まれる。
なお、前記線形モデルについては、例えば、非特許文献１等に記載されており、前記層状ニューラルネットについては、例えば、非特許文献２等に記載されており、いずれも公知である。 As a technique for approximating the function, for example, a linear model (general linear model (GLM)) that approximates the function to a linear polynomial, or a table reference method that approximates by referring to a pre-stored table Techniques such as (Table Look-up) are known.
In addition, a so-called layered neural network technique is known in which the function is approximated by expressing the input / output and the function in a distributed manner by elements (neurons) included in the input layer, the intermediate layer, and the output layer. The layered neural network includes a multi-layer perceptron (MLP), an RBF (Radial Basis Function) network, and a GRBF (Gaussian Radial). Basis Function) Network and other algorithms are included.
The linear model is described in, for example, Non-Patent Document 1 and the like, and the layered neural network is described in, for example, Non-Patent Document 2 and the like, both of which are publicly known.

（選択的不感化法について）
また、本願発明者らによって、選択的不感化法が適用された新たな層状ニューラルネットが提案されている。なお、前記選択的不感化法については、非特許文献３等に記載されており、公知である。ここで、前記選択的不感化法が適用された層状ニューラルネットについて、非特許文献３に記載された内容に基づいて、以下に説明する。
前記選択的不感化法が適用された層状ニューラルネットでは、２つの入力パターンＳ，Ｃの情報を統合して目標パターンＴを連想する課題、すなわち、入力がＳ，Ｃならば出力がＴとなる関数を近似する課題について考える。 (About selective desensitization)
In addition, the present inventors have proposed a new layered neural network to which the selective desensitization method is applied. The selective desensitization method is described in Non-Patent Document 3, etc., and is well known. Here, the layered neural network to which the selective desensitization method is applied will be described below based on the contents described in Non-Patent Document 3.
In the layered neural network to which the selective desensitization method is applied, the problem of associating the target pattern T by integrating the information of the two input patterns S and C, that is, if the input is S and C, the output is T. Consider the task of approximating a function.

図１２は選択的不感化法が適用された層状ニューラルネットの最小の構成の説明図である。
図１２において、前記選択的不感化法が適用された層状ニューラルネット０１は、２ｎ個の素子により構成された入力層０１ａと、ｎ個の素子により構成された中間層０１ｂと、ｍ個の素子により構成された出力層０１ｃとを有する。前記入力層０１ａの２ｎ個の素子は、前記入力パターンＳ，Ｃにそれぞれｎ個づつ振り分けられている。すなわち、前記入力パターンＳ，Ｃは、それぞれｎ個の素子（要素）により構成されている。また、前記中間層０１ｂのｎ個の素子には、前記入力パターンＳ，Ｃのｎ個の素子がそれぞれ結合されており、中間層０１ｂのｉ番目の素子（後述する要素ｘ_ｉ）には、前記入力パターンＳ，Ｃのｉ番目の素子（後述する要素ｓ_ｉ，ｃ_ｉ）がそれぞれ結合されている（ｉ＝１，２，…，ｎ）。 FIG. 12 is an explanatory diagram of the minimum configuration of a layered neural network to which the selective desensitization method is applied.
In FIG. 12, a layered neural network 01 to which the selective desensitization method is applied includes an input layer 01a composed of 2n elements, an intermediate layer 01b composed of n elements, and m elements. And an output layer 01c constituted by The 2n elements of the input layer 01a are assigned to the input patterns S and C by n. That is, each of the input patterns S and C is composed of n elements. In addition, n elements of the input patterns S and C are respectively coupled to n elements of the intermediate layer 01b, and an i-th element (element x _i described later) of the intermediate layer 01b includes The i-th elements (elements s _i , c _{i to be} described later) of the input patterns S, C are respectively coupled (i = 1, 2,..., N).

ここで、前記入力パターンＳ，Ｃが、Ｓ＝（ｓ_１，ｓ_２，…，ｓ_ｎ），Ｃ＝（ｃ_１，ｃ_２，…，ｃ_ｎ）で示されるとする。この場合、入力パターンＳ，Ｃのｉ番目の要素がｓ_ｉ，ｃ_ｉで示される。また、前記中間層０１ｂの出力である中間パターンをＸとし、前記中間パターンＸが、Ｘ＝（ｘ_１，ｘ_２，…，ｘ_ｎ）で示されるものとする。このとき、前記中間パターンＸのｉ番目の要素ｘ_ｉは、以下の式（１）で示される。
ｘ_ｉ＝ｇ_ｉ（ｓ_ｉ−ａｖｅ（ｘ_ｉ））＋ａｖｅ（ｘ_ｉ） …（１）
ここで、ｇ_ｉは前記要素ｘ_ｉにおけるゲイン（出力感度、感度）であり、ａｖｅ（ｘ_ｉ）は前記要素ｘ_ｉの平均出力レベル、すなわち、全ての入力（要素ｓ_ｉ，ｃ_ｉ）に対する出力（要素ｘ_ｉ）の平均値である。 Here, it is assumed that the input patterns S and C are represented by S = (s ₁ , s ₂ ,..., S _n ), C = (c ₁ , c ₂ ,..., C _n ). In this case, the i-th element of the input patterns S and C is denoted by s _i and c _i . Further, it is assumed that the intermediate pattern that is the output of the intermediate layer 01b is X, and the intermediate pattern X is represented by X = (x ₁ , x ₂ ,..., X _n ). At this time, the i-th element x _i of the intermediate pattern X is expressed by the following equation (1).
x _i = g _i (s _i −ave (x _i )) + ave (x _i ) (1)
Here, g _i is a gain (output sensitivity, sensitivity) in the element x _i , and ave (x _i ) is an average output level of the element x _i , that is, for all inputs (elements s _i , c _i ). It is the average value of the output (element x _i ).

なお、前記平均出力レベルａｖｅ（ｘ_ｉ）は、前記ゲインｇ_ｉと前記要素ｓ_ｉとの間に相関がない場合、前記要素ｓ_ｉの平均値（ａｖｅ（ｓ_ｉ））と置き換えることが可能である。ここで、両者（ａｖｅ（ｘ_ｉ），ａｖｅ（ｓ_ｉ））が厳密に一致していなくても殆ど問題はない。よって、例えば、前記要素ｓ_ｉの値が１と−１とを等確率で取り、前記ゲインｇ_ｉと前記要素ｓ_ｉとが独立に決定される場合（相関がない場合）、前記平均出力レベルａｖｅ（ｘ_ｉ）を０とすることも可能である（ａｖｅ（ｘ_ｉ）＝０）。この場合、前記要素ｘ_ｉは、以下の式（１）′で示すことができる。
ｘ_ｉ＝ｇ_ｉ×ｓ_ｉ…（１）′ The average output level ave (x _i ) can be replaced with the average value (ave (s _i )) of the element s _i when there is no correlation between the gain g _i and the element s _i. It is. Here, there is almost no problem even if both (ave (x _i ), ave (s _i )) do not exactly match. Thus, for example, when the value of the element s _i takes 1 and −1 with equal probability, and the gain g _i and the element s _i are determined independently (no correlation), the average output level It is also possible to set ave (x _i ) to 0 (ave (x _i ) = 0). In this case, the element x _i can be expressed by the following equation (1) ′.
x _i = g _i × s _i (1) ′

また、前記ゲインｇ_ｉをｉ番目の要素として含むゲインベクトルをＧとし、前記ゲインベクトルＧが、Ｇ＝（ｇ_１，ｇ_２，…，ｇ_ｎ）で示されるものとする。また、前記ゲインｇ_ｉの値は、通常１であるが、これを０にすることを「不感化」と呼ぶこととし、前記中間パターンＸの出力において、前記入力パターンＳのｎ個の素子ｓ_ｉ（ｉ＝１，２，…，ｎ）のある一部分だけを不感化することを「選択的不感化」と呼ぶこととする。
そして、不感化する要素ｓ_ｉ（ｉ＝１，２，…，ｎ）の組み合わせを前記入力パターン（文脈パターン）Ｃの要素ｃ_ｉ（ｉ＝１，２，…，ｎ）に基づいて変化させる方法が、選択的不感化に基づく文脈修飾法であり、これを「積型文脈修飾」と呼ぶこととする。また、前記入力パターン（文脈パターン）Ｃで積型文脈修飾された入力パターンＳを積型修飾Ｓ（Ｃ）とする。 Further, it is assumed that a gain vector including the gain g _i as the i-th element is G, and the gain vector G is represented by G = (g ₁ , g ₂ ,..., G _n ). The value of the gain g _i is normally 1, but setting this to 0 is referred to as “desensitization”, and n elements s of the input pattern S are output at the output of the intermediate pattern X. Desensitizing only a part of _i (i = 1, 2,..., n) is referred to as “selective desensitization”.
Then, the combination of desensitized elements s _i (i = 1, 2,..., N) is changed based on the elements c _i (i = 1, 2,..., N) of the input pattern (context pattern) C. The method is a context modification method based on selective desensitization, which is referred to as “product type context modification”. An input pattern S that has been subjected to product type context modification with the input pattern (context pattern) C is referred to as product type modification S (C).

この場合、前記ゲインベクトルＧのｉ番目の要素（ゲイン）ｇ_ｉは、前記入力パターン（文脈パターン）Ｃのｉ番目の要素ｃ_ｉに基づいて決定される。最も簡単な決定方法として、例えば、前記ゲインｇ_ｉは、前記要素ｃ_ｉの値が前記要素ｓ_ｉと同様に１と−１とを等確率で取る場合、以下の式（２）で示される。
ｇ_ｉ＝（１＋ｃ_ｉ）／２ …（２）
したがって、ｃ_ｉ＝１の場合には、ｇ_ｉ＝１となり、ｃ_ｉ＝−1の場合には、ｇ_ｉ＝０となり、前記入力パターン（文脈パターン）Ｃと、前記ゲインベクトルＧとを同一視できる。なお、前記ゲインｇ_ｉ（要素ｃ_ｉ）と前記要素ｓ_ｉとの間に相関がある場合、すなわち、前記ゲインベクトルＧ（入力パターン（文脈パターン）Ｃ）と前記入力パターンＳとの間に相関がある場合、前記ゲインベクトルＧの成分（要素ｇ_ｉ）を適当にシャッフルする等して相関関係を解消し、ａｖｅ（ｘ_ｉ）＝０とすることが望ましい。 In this case, the i-th element (gain) g _i of the gain vector G is determined based on the i-th element c _i of the input pattern (context pattern) C. As the simplest determination method, for example, the gain g _i is expressed by the following equation (2) when the value of the element c _i takes 1 and −1 with the same probability as the element s _i. .
g _i = (1 + c _i ) / 2 (2)
Therefore, when c _i = 1, g _i = 1, and when c _i = −1, g _i = 0, and the input pattern (context pattern) C and the gain vector G are the same. I can see it. When there is a correlation between the gain g _i (element c _i ) and the element s _{i, that} is, between the gain vector G (input pattern (context pattern) C) and the input pattern S. If there is, it is desirable to eliminate the correlation by appropriately shuffling the component (element g _i ) of the gain vector G and set ave (x _i ) = 0.

また、図１２において、前記出力層０１ｃのｍ個の各素子には、前記中間層０１ｂの複数の素子がそれぞれ結合されている。ここで、前記出力層０１ｃの出力である出力パターンをＹとし、前記出力パターンＹが、Ｙ＝（ｙ_１，ｙ_２，…，ｙ_ｍ）で示される場合、前記出力パターンＹのｊ番目の要素ｙ_ｊは、以下の数１の式（３）で示される。 In FIG. 12, a plurality of elements of the intermediate layer 01b are coupled to each of the m elements of the output layer 01c. Here, when the output pattern which is the output of the output layer 01c is Y, and the output pattern Y is represented by Y = (y ₁ , y ₂ ,..., Y _m ), the j-th output pattern Y The element y _j is expressed by the following equation (3).

なお、ｗ_ｊｉは前記要素ｘ_ｉから前記要素ｙ_ｊへの結合荷重（重み付け）である（０≦ｗ_ｊｉ≦１）。また、ｓｇｎ（ｕ）は、ある値ｕについて、ｕ＞０の場合には、ｓｇｎ（ｕ）＝１、それ以外（ｕ≦０）の場合には、ｓｇｎ（ｕ）＝−１となる符号関数である。
Note that w _ji is a coupling load (weighting) from the element x _i to the element y _j (0 ≦ w _ji ≦ 1). Also, sgn (u) is a code for a value u such that sgn (u) = 1 when u> 0, and sgn (u) = − 1 otherwise (u ≦ 0). It is a function.

また、前記入力パターンＳ，Ｃの間には相関がなく、前記入力パターンＳ，Ｃどうしの直交性が低いため、前記結合荷重ｗ_ｊｉについてのフィードバック型の学習則として、直交性の低い入力パターンＳ，Ｃに対する連想記憶が可能な直交学習が利用できる。すなわち、前記目標パターンＴが、Ｔ＝（ｔ_１，ｔ_２，…，ｔ_ｍ）で示され、前記目標パターンＴのｊ番目の要素がｔ_ｊで示される場合、前記結合荷重ｗ_ｊｉは、入力パターンＳ，Ｃが入力される度に、以下の数２の式（４）で示される更新値Δｗ_ｊｉが加えられて更新される。 Further, since there is no correlation between the input patterns S and C and the orthogonality between the input patterns S and C is low, an input pattern with low orthogonality is used as a feedback-type learning rule for the coupling load w _ji. Orthogonal learning capable of associative memory for S and C can be used. That is, when the target pattern T is represented by T = (t ₁ , t ₂ ,..., T _m ) and the j-th element of the target pattern T is represented by t _j , the combined load w _ji is Each time the input patterns S and C are input, the update value Δw _ji expressed by the following equation (4) is added and updated.

ここで、εは正定数（例えば、０．３）である。また、式（４）は、前記式（３）によって、以下の式（４）′で示すことができる。
Δｗ_ｊｉ＝ε（ｔ_ｊ−ｙ_ｊ）ｘ_ｉ…（４）′
Here, ε is a positive constant (eg, 0.3). Further, the formula (4) can be expressed by the following formula (4) ′ by the formula (3).
Δw _ji = ε (t _j −y _j ) x _i (4) ′

（積型モデルと相互修飾モデル）
図１３は選択的不感化法が適用された層状ニューラルネットの説明図であり、図１３Ａは積型モデルの説明図であり、図１３Ｂは相互修飾モデルの説明図である。
なお、選択的不感化法が適用された層状ニューラルネットには、図１３Ａに示す積型モデル０１′と、図１３Ｂに示す相互修飾モデル０２とが含まれる。
図１３Ａにおいて、前記積型モデル０１′は、２ｎ個の素子により構成された入力層０１ａ′と、ｎ個の素子により構成された中間層０１ｂ′と、ｍ個の素子により構成された出力層０１ｃ′とを有する。なお、前記積型モデル０１′は、矢印Ｓにより、前記入力層０１ａ′の入力パターンＳが省略されて簡略化されただけであり、図１２に示す前記層状ニューラルネット０１と同様に構成されている。 (Product model and mutual modification model)
FIG. 13 is an explanatory diagram of a layered neural network to which the selective desensitization method is applied, FIG. 13A is an explanatory diagram of a product type model, and FIG. 13B is an explanatory diagram of a mutual modification model.
Note that the layered neural network to which the selective desensitization method is applied includes a product model 01 ′ shown in FIG. 13A and a mutual modification model 02 shown in FIG. 13B.
In FIG. 13A, the product model 01 ′ includes an input layer 01a ′ composed of 2n elements, an intermediate layer 01b ′ composed of n elements, and an output layer composed of m elements. 01c '. The product model 01 ′ is simplified by omitting the input pattern S of the input layer 01a ′ by the arrow S, and has the same configuration as the layered neural network 01 shown in FIG. Yes.

また、図１３Ｂにおいて、前記相互修飾モデル０２は、２ｎ個の素子により構成された入力層０２ａと、２ｎ個の素子により構成された中間層０２ｂと、ｍ個の素子により構成された出力層０２ｃとを有する。ここで、前記相互修飾モデル０２では、前記積型モデル０１，０１′の中間層０１，０１ｂ′が前記入力パターンＳの前記積型修飾Ｓ（Ｃ）を出力する各素子（ｘ_ｉ（ｉ＝１，２，…，ｎ））のみを有しているのに対して、前記中間層０２ｂが、前記入力パターンＳの積型修飾Ｓ（Ｃ）を出力する各素子（ｘ_ｉ（ｉ＝１，２，…，ｎ））と、前記入力パターンＣの積型修飾Ｃ（Ｓ）を出力する素子（ｘ_ｉ′（ｉ＝１，２，…，ｎ））とを有している。すなわち、前記中間層０２ｂの中間パターンＸ′が、Ｘ′＝（ｘ_１，ｘ_２，…，ｘ_ｎ，ｘ_１′，ｘ_２′，…，ｘ_ｎ′）で示されるものとする。
そして、前記相互修飾モデル０２では、前記入力パターンＳ，Ｃどうしが相互に積型文脈修飾される。すなわち、前記入力パターンＳが前記入力パターンＣで積型文脈修飾されて前記積型修飾Ｓ（Ｃ）が出力されると共に、前記入力パターンＣも前記入力パターンＳで積型文脈修飾されて積型修飾Ｃ（Ｓ）が出力される（式（１）〜（４）参照）。このため、前記相互修飾モデル０２は、前記積型モデル０１′に比べ、学習能力（汎化能力）をさらに向上させることができる（非特許文献３参照）。 13B, the mutual modification model 02 includes an input layer 02a composed of 2n elements, an intermediate layer 02b composed of 2n elements, and an output layer 02c composed of m elements. And have. Here, in the mutual modification model 02, each intermediate element 01, 01b ′ of the product model 01, 01 ′ outputs each element (x _i (i = i = , N)), the intermediate layer 02b outputs each element (x _i (i = 1) that outputs the product type modification S (C) of the input pattern S. , 2,..., N)) and an element (x _i ′ (i = 1, 2,..., N)) that outputs the product type modification C (S) of the input pattern C. That is, the intermediate pattern X ′ of the intermediate layer 02b is represented by X ′ = (x ₁ , x ₂ ,..., X _n , x ₁ ′, x ₂ ′,..., X _n ′).
In the mutual modification model 02, the input patterns S and C are mutually product-type context modified. That is, the input pattern S is product-type context modified by the input pattern C and the product type modification S (C) is output, and the input pattern C is also product-type context modified by the input pattern S and the product type The modification C (S) is output (see formulas (1) to (4)). For this reason, the mutual modification model 02 can further improve the learning ability (generalization ability) compared to the product model 01 ′ (see Non-Patent Document 3).

加藤悦史、“一般線形モデルについて”、「online」、2002年9月、北海道大学、「2007年8月6日検索」、インターネット＜URL：http://www.neurosci.aist.go.jp/~kurita/lecture/prnn/prnn.html＞Atsushi Kato, “General Linear Models”, “online”, September 2002, Hokkaido University, “August 6, 2007 search”, Internet <URL: http://www.neurosci.aist.go.jp/ ~ kurita / lecture / prnn / prnn.html> 粟田多喜男、“パターン認識とニューラルネットワーク”、「online」、２００１年２月、産業技術総合研究所、「2007年8月7日検索」、インターネット＜URL：http://hosho.ees.hokudai.ac.jp/~kato/seminar/020909/slide_bold.pdf＞Takio Hamada, “Pattern recognition and neural network”, “online”, February 2001, National Institute of Advanced Industrial Science and Technology, “August 7, 2007 search”, Internet <URL: http: //hosho.ees.hokudai. ac.jp/~kato/seminar/020909/slide_bold.pdf> 森田昌彦、他３名、“選択的不感化法を適用した層状ニューラルネットの情報統合能力”、「電子情報通信学会論文誌」、2004年12月、第J87-D-II巻、第12号、p.2242−2252Masahiko Morita and three others, “Information Integration Capability of Layered Neural Networks Applying Selective Desensitization”, “Journal of the Institute of Electronics, Information and Communication Engineers”, December 2004, Volume J87-D-II, No. 12 , P.2242-2252 森紘一郎、山名早人、“強化学習並列化による学習の高速化”、「online」、2004年3月、早稲田大学、「2007年8月8日検索」、インターネット＜URL：http://www.yama.info.waseda.ac.jp/publications/2003/0403_mori.pdf＞Soichiro Mori, Hayato Yamana, “Acceleration of learning through parallel reinforcement learning”, “online”, March 2004, Waseda University, “August 8, 2007 search”, Internet <URL: http: // www .yama.info.waseda.ac.jp / publications / 2003 / 0403_mori.pdf> 浅川伸一、“強化学習”、「online」、2002年7月、東京女子大学、「2007年8月22日検索」、インターネット＜URL：http://www.twcu.ac.jp/~asakawa/chiba2002/lect8-Reinforcement/ReinforcementLearning.pdf＞Shinichi Asakawa, “Reinforcement Learning”, “online”, July 2002, Tokyo Women's University, “August 22, 2007 search”, Internet <URL: http://www.twcu.ac.jp/~asakawa/ chiba2002 / lect8-Reinforcement / ReinforcementLearning.pdf>

（従来技術の問題点） (Problems of conventional technology)

しかしながら、線形モデルによって関数を近似する場合には、予め線形化できると分かっている関数しか精度良く近似できず、前記関数を精度良く近似するためには、線形多項式の次数を大きくし、入力および出力のサンプルデータをできるだけ多くする必要があった。また、前記テーブル参照法によって関数を近似する場合には、参照するテーブルの大きさが変数の数に応じて指数的に増加するという問題があった。例えば、１００通りの値を取り得る状態変数がｎ個存在する場合、テーブルで全ての状態を示すためには１００^ｎ通りの大きさのテーブルが必要となる。したがって、関数を精度良く近似する場合には、１００^ｎ通りにできる限り近い数のサンプルデータが必要となるため、テーブル作成のために莫大な時間および記憶容量が必要となる問題があった。 However, when a function is approximated by a linear model, only a function that can be linearized in advance can be approximated with high accuracy. To approximate the function with high accuracy, the order of the linear polynomial is increased, and the input and It was necessary to increase the output sample data as much as possible. Further, when the function is approximated by the table reference method, there is a problem that the size of the table to be referred to increases exponentially according to the number of variables. For example, if there are n state variables that can take 100 different values, a table having 100 ⁿ different sizes is required to indicate all states in the table. Therefore, when the function is approximated with high accuracy, as many sample data as possible are required as many as 100 ⁿ types, and there is a problem that enormous time and storage capacity are required for table creation.

さらに、前記多層パーセプトロンや前記ＲＢＦ（ＧＲＢＦ）ネットワークによって関数を近似する場合にも、前記テーブル参照法の場合と同様に、入力層の素子の数が大きくなるほど学習に必要なサンプルデータの数が多く必要となり、前記関数を精度良く近似するためには、莫大な学習時間が必要となるという問題があった。
特に、前記多層パーセプトロンについては、独立した入力パターンの数が多くなり、課題が複雑になる程、期待された学習能力（汎化能力）が失われてしまうという問題があった（非特許文献３参照）。また、この問題については、前記多層パーセプトロンとアルゴリズムが異なるだけの前記ＲＢＦ（ＧＲＢＦ）ネットワークについても同様に存在すると考えられる。
したがって、従来公知のこれらの技術では、強い非線形性を有する関数（非線形関数）については精度良く近似することができず、入力パターンの数が小さくなければ、莫大な数のサンプルデータが必要になり、莫大な学習時間が必要となるという問題があった。 Further, when the function is approximated by the multilayer perceptron or the RBF (GRBF) network, the number of sample data necessary for learning increases as the number of elements in the input layer increases as in the case of the table reference method. There is a problem that enormous learning time is required to accurately approximate the function.
In particular, the multi-layer perceptron has a problem that the number of independent input patterns increases and the more complicated the task, the more the learning ability (generalization ability) expected is lost (Non-patent Document 3). reference). In addition, it is considered that this problem also exists in the RBF (GRBF) network in which the algorithm differs from the multilayer perceptron.
Therefore, these conventionally known techniques cannot accurately approximate a function (nonlinear function) having strong nonlinearity, and if the number of input patterns is not small, an enormous number of sample data is required. There was a problem that enormous learning time was required.

なお、前記選択的不感化法が適用された前記積型モデル０１，０１′（図１２、図１３Ａ参照）や前記相互修飾モデル０２（図１３Ｂ参照）は、前記多層パーセプトロンに比べ、学習能力（汎化能力）が高く、少ない学習回数で精度良く学習することができる（非特許文献３参照）。しかしながら、前記選択的不感化法が適用された層状ニューラルネット０１，０１′，０２では、２つの入力パターンＳ，Ｃについて前記中間パターンＸ，Ｘ′（Ｘ＝Ｓ（Ｃ），Ｘ′＝Ｓ（Ｃ），Ｃ（Ｓ））を出力する場合しか想定されていないという問題があった。すなわち、入力パターンの数が３つ以上に増えた場合には、前記選択的不感化法が適用された層状ニューラルネット０１，０１′，０２が適用できず、結局、従来公知の技術（線形モデル、テーブル参照法、多層パーセプトロン等の層状ニューラルネット等）を適用して前記関数を近似するしか方法がないため、上記の問題が解決できないという問題があった。また、従来公知の前記選択的不感化法が適用された層状ニューラルネット０１，０１′，０２は、簡単な線形関数（例えば、ｆ（ｘ，ｙ）＝ｘ−ｙ等）について実験されているだけで、複雑な非線形関数の近似についての有効性等について検証されておらず、且つ、現実的な応用例等も提案されていないという問題があった。 Note that the product model 01, 01 ′ (see FIGS. 12 and 13A) and the mutual modification model 02 (see FIG. 13B) to which the selective desensitization method is applied have a learning ability ( Generalization ability) is high, and it is possible to learn accurately with a small number of learning times (see Non-Patent Document 3). However, in the layered neural networks 01, 01 ', 02 to which the selective desensitization method is applied, the intermediate patterns X, X' (X = S (C), X '= S) for two input patterns S, C. (C), C (S)) is only assumed to be output. That is, when the number of input patterns increases to three or more, the layered neural networks 01, 01 ', 02 to which the selective desensitization method is applied cannot be applied. However, there is a problem that the above problem cannot be solved because there is only a method of approximating the function by applying a table reference method, a layered neural network such as a multilayer perceptron). The layered neural networks 01, 01 ', 02 to which the known selective desensitization method is applied have been tested for simple linear functions (eg, f (x, y) = xy). However, there has been a problem that the effectiveness of approximation of a complex nonlinear function has not been verified, and no practical application example has been proposed.

本発明は、前述の事情に鑑み、次の記載内容（Ｏ01）を技術的課題とする。
（Ｏ01）入力変数の数が多い場合でも関数を精度良く近似すると共に、近似するために必要な学習のコストを低減すること。 In view of the above circumstances, the present invention has the following description (O01) as a technical problem.
(O01) To accurately approximate a function even when the number of input variables is large, and to reduce the learning cost necessary for the approximation.

前記技術的課題を解決するために、請求項１記載の発明の関数近似装置は、
入力変数の値が入力される入力素子により構成された入力層と、前記入力素子に結合された中間素子であって、前記入力素子に入力された値に基づいて演算された中間変数の値が出力される前記中間素子により構成された中間層と、前記中間素子に結合された出力素子であって、前記中間素子に入力された値に基づいて演算された出力変数の値が出力される前記出力素子により構成された出力層と、を有する層状ニューラルネットにより、前記入力変数と前記出力変数との関係である関数を近似する関数近似装置において、
３以上の入力変数の各値が入力される各入力素子により構成された前記入力層と、
前記３以上の入力変数の各値をそれぞれ入力する入力変数入力手段と、
前記３以上の入力変数の中のいずれか２つの入力変数を一組とした入力変数組が設定され、前記入力変数組の一方の入力変数の各値と、前記入力変数組の他方の入力変数の各値に基づいて演算される各第１出力感度と、に基づいて、前記中間変数の各値をそれぞれ演算する中間変数演算手段と、
前記中間変数と、前記中間変数の値の重視度合いに応じて設定された結合荷重と、に基づいて、前記出力変数の値を演算する出力変数演算手段と、
前記出力変数の値と、予め記憶された前記関数の実際の値との差分に基づいて、前記結合荷重を更新することにより、前記結合荷重を学習する結合荷重学習手段と、
を備えたことを特徴とする。 In order to solve the technical problem, the function approximating device according to claim 1 is characterized in that:
An input layer configured by an input element to which an input variable value is input, and an intermediate element coupled to the input element, the intermediate variable value calculated based on the value input to the input element being An intermediate layer constituted by the intermediate element to be output, and an output element coupled to the intermediate element, wherein the value of the output variable calculated based on the value input to the intermediate element is output In a function approximation device that approximates a function that is a relationship between the input variable and the output variable by a layered neural network having an output layer constituted by output elements,
The input layer constituted by each input element to which each value of three or more input variables is input;
Input variable input means for inputting each value of the three or more input variables;
An input variable set in which any two input variables of the three or more input variables are set as one set, each value of one input variable of the input variable set, and the other input variable of the input variable set Intermediate variable calculation means for calculating each value of the intermediate variable based on each first output sensitivity calculated based on each value of
An output variable calculation means for calculating the value of the output variable based on the intermediate variable and a combined load set according to the importance of the value of the intermediate variable;
A combined load learning means for learning the combined load by updating the combined load based on a difference between the value of the output variable and an actual value of the function stored in advance;
It is provided with.

請求項２に記載の発明は、請求項１に記載の関数近似装置において、
前記入力変数組の一方の入力変数の各値と、前記入力変数組の他方の入力変数の各値に基づいて演算される各第１出力感度と、に基づいて、第１中間変数の各値をそれぞれ演算する第１中間変数演算手段と、前記入力変数組の他方の入力変数の各値と、前記入力変数組の一方の入力変数の各値に基づいて演算される各第２出力感度と、に基づいて、第２中間変数の各値をそれぞれ演算する第２中間変数演算手段と、を有する前記中間変数演算と、
前記第１中間変数と、前記第２中間変数と、前記第１中間変数および前記第２中間変数の各値の重視度合いに応じて設定された各結合荷重と、に基づいて、前記出力変数の値を演算する前記出力変数演算手段と、
を備えたことを特徴とする。 The invention according to claim 2 is the function approximating device according to claim 1,
Each value of the first intermediate variable based on each value of one input variable of the input variable set and each first output sensitivity calculated based on each value of the other input variable of the input variable set First intermediate variable calculation means for calculating each of the above, each value of the other input variable of the input variable set, each second output sensitivity calculated based on each value of one input variable of the input variable set, , Based on the second intermediate variable calculation means for calculating each value of the second intermediate variable, respectively, the intermediate variable calculation,
Based on the first intermediate variable, the second intermediate variable, and each combined load set in accordance with the importance of each value of the first intermediate variable and the second intermediate variable, the output variable The output variable calculating means for calculating a value;
It is provided with.

請求項３に記載の発明は、請求項１または２に記載の関数近似装置において、
前記中間層が、複数の各中間変数の各値が出力される各中間素子により構成され、且つ、前記３以上の入力変数の全ての入力変数が、少なくとも一組の前記入力変数組の一方または他方の入力変数として設定されて複数組の前記入力変数組が構成されていることを特徴とする。 The invention according to claim 3 is the function approximating device according to claim 1 or 2,
The intermediate layer includes each intermediate element from which each value of a plurality of intermediate variables is output, and all the input variables of the three or more input variables are at least one of the input variable sets or A plurality of sets of the input variables are configured by being set as the other input variable.

前記技術的課題を解決するために、請求項４記載の発明の強化学習システムは、
行動を制御する対象としての制御装置と、
前記制御装置の状態を測定する状態測定手段と、
前記行動に対する報酬を取得する報酬取得手段と、
将来取得可能な報酬の予測に基づいて、測定された前記状態における全ての行動を評価するための評価値である行動価値関数を演算する行動価値関数演算手段と、
前記状態で測定された測定値を、前記入力変数の値とみなすことにより、測定された前記状態における前記行動価値関数を近似する請求項１ないし３のいずれかに記載の前記関数近似装置と、
近似された前記行動価値関数に基づいて、前記行動を選択する行動選択手段と、
選択された前記行動を実行する行動実行手段と、
前記報酬に基づいて、前記制御装置の制御が失敗した状態であるか否かを判別することにより、前記制御装置の制御を終了するか否かを判別する制御終了判別手段と、
を備えたことを特徴とする。 In order to solve the technical problem, the reinforcement learning system of the invention according to claim 4 comprises:
A control device as a target for controlling behavior;
State measuring means for measuring the state of the control device;
Reward acquisition means for acquiring a reward for the behavior;
An action value function calculating means for calculating an action value function that is an evaluation value for evaluating all actions in the measured state based on prediction of a reward that can be acquired in the future;
The function approximation device according to any one of claims 1 to 3, wherein the action value function in the measured state is approximated by regarding a measured value measured in the state as a value of the input variable.
Action selecting means for selecting the action based on the approximated action value function;
Action executing means for executing the selected action;
Control end determination means for determining whether to end control of the control device by determining whether control of the control device has failed based on the reward,
It is provided with.

請求項５に記載の発明は、請求項４に記載の強化学習システムにおいて、
前記制御装置の制御が失敗した状態であると判別された場合に、前記結合荷重を学習する前記結合荷重学習手段、を有する請求項１ないし３のいずれかに記載の前記関数近似装置、
を備えたことを特徴とする。 The invention according to claim 5 is the reinforcement learning system according to claim 4,
The function approximation device according to any one of claims 1 to 3, further comprising the connection weight learning unit that learns the connection weight when it is determined that the control of the control device has failed.
It is provided with.

前記技術的課題を解決するために、請求項６記載の発明の関数近似システムは、
入力変数の値が入力される入力素子により構成された入力層と、前記入力素子に結合された中間素子であって、前記入力素子に入力された値に基づいて演算された中間変数の値が出力される前記中間素子により構成された中間層と、前記中間素子に結合された出力素子であって、前記中間素子に入力された値に基づいて演算された出力変数の値が出力される前記出力素子により構成された出力層と、を有する層状ニューラルネットにより、前記入力変数と前記出力変数との関係である関数を近似する関数近似システムにおいて、
３以上の入力変数の各値が入力される各入力素子により構成された前記入力層と、
前記３以上の入力変数の各値をそれぞれ入力する入力変数入力手段と、
前記３以上の入力変数の中のいずれか２つの入力変数を一組とした入力変数組が設定され、前記入力変数組の一方の入力変数の各値と、前記入力変数組の他方の入力変数の各値に基づいて演算される各第１出力感度と、に基づいて、前記中間変数の各値をそれぞれ演算する中間変数演算手段と、
前記中間変数と、前記中間変数の値の重視度合いに応じて設定された結合荷重と、に基づいて、前記出力変数の値を演算する出力変数演算手段と、
前記出力変数の値と、予め記憶された前記関数の実際の値との差分に基づいて、前記結合荷重を更新することにより、前記結合荷重を学習する結合荷重学習手段と、
を備えたことを特徴とする。 In order to solve the technical problem, the function approximation system according to claim 6 is characterized in that:
An input layer configured by an input element to which an input variable value is input, and an intermediate element coupled to the input element, the intermediate variable value calculated based on the value input to the input element being An intermediate layer constituted by the intermediate element to be output, and an output element coupled to the intermediate element, wherein the value of the output variable calculated based on the value input to the intermediate element is output In a function approximation system that approximates a function that is a relationship between the input variable and the output variable by a layered neural network having an output layer constituted by output elements,
The input layer constituted by each input element to which each value of three or more input variables is input;
Input variable input means for inputting each value of the three or more input variables;
An input variable set in which any two input variables of the three or more input variables are set as one set, each value of one input variable of the input variable set, and the other input variable of the input variable set Intermediate variable calculation means for calculating each value of the intermediate variable based on each first output sensitivity calculated based on each value of
An output variable calculation means for calculating the value of the output variable based on the intermediate variable and a combined load set according to the importance of the value of the intermediate variable;
A combined load learning means for learning the combined load by updating the combined load based on a difference between the value of the output variable and an actual value of the function stored in advance;
It is provided with.

前記技術的課題を解決するために、請求項７記載の発明の関数近似プログラムは、
コンピュータを、
入力変数の値が入力される入力素子により構成され且つ３以上の入力変数の各値が入力される各入力素子により構成された入力層と、前記入力素子に結合された中間素子であって、前記入力素子に入力された値に基づいて演算された中間変数の値が出力される前記中間素子により構成された中間層と、前記中間素子に結合された出力素子であって、前記中間素子に入力された値に基づいて演算された出力変数の値が出力される前記出力素子により構成された出力層と、を有する層状ニューラルネットにおいて、前記３以上の入力変数の各値をそれぞれ入力する入力変数入力手段、
前記３以上の入力変数の中のいずれか２つの入力変数を一組とした入力変数組が設定され、前記入力変数組の一方の入力変数の各値と、前記入力変数組の他方の入力変数の各値に基づいて演算される各第１出力感度と、に基づいて、前記中間変数の各値をそれぞれ演算する中間変数演算手段、
前記中間変数と、前記中間変数の値の重視度合いに応じて設定された結合荷重と、に基づいて、前記出力変数の値を演算する出力変数演算手段、
前記出力変数の値と、予め記憶された前記関数の実際の値との差分に基づいて、前記結合荷重を更新することにより、前記結合荷重を学習する結合荷重学習手段、
として機能させることにより、前記入力変数と前記出力変数との関係である関数を近似する。 In order to solve the technical problem, the function approximation program of the invention according to claim 7 is:
Computer
An input layer configured by input elements to which values of input variables are input and configured by input elements to which respective values of three or more input variables are input, and an intermediate element coupled to the input elements, An intermediate layer composed of the intermediate element that outputs the value of the intermediate variable calculated based on the value input to the input element, and an output element coupled to the intermediate element, the intermediate element An input layer configured to output the value of the output variable calculated based on the input value; and an input layer configured to input each value of the three or more input variables. Variable input means,
An input variable set in which any two input variables of the three or more input variables are set as one set, each value of one input variable of the input variable set, and the other input variable of the input variable set Intermediate variable calculation means for calculating each value of the intermediate variable based on each first output sensitivity calculated based on each value of
An output variable calculation means for calculating the value of the output variable based on the intermediate variable and a combined load set according to the importance of the value of the intermediate variable;
A connection weight learning means for learning the connection weight by updating the connection weight based on a difference between a value of the output variable and an actual value of the function stored in advance;
, The function that is the relationship between the input variable and the output variable is approximated.

請求項１に記載の発明によれば、前記入力変数組の一方の入力変数の各値と、前記入力変数組の他方の入力変数の各値に基づいて演算される各第１出力感度と、に基づいて、前記中間変数の各値をそれぞれ演算することにより、２つの入力変数について中間変数および出力変数を出力する場合しか想定されていない従来公知の選択的不感化法が適用された層状ニューラルネットの適用対象外であった３以上の入力変数を有する関数に対しても、学習能力（汎化能力）が高い前記選択的不感化法に基づく層状ニューラルネットを適用できる。この結果、前記関数の入力変数の数が多い場合でも、従来公知の技術（線形モデル、テーブル参照法、多層パーセプトロン等の層状ニューラルネット等）に比べ、学習能力（汎化能力）を向上させることができるため、前記関数を精度良く近似することができると共に、前記関数を近似するために必要な学習における演算量や演算時間等の演算コストを低減できる。また、請求項１に記載の発明によれば、複数の前記入力変数組に基づいて、複数の中間変数の各値を演算する場合には、それぞれ並列で演算することが可能であり、例えば、複数のハードウェアにより、並列で演算することも可能である。この場合、前記中間変数を高速で演算でき、前記関数を近似するための演算時間を低減できる。 According to the invention of claim 1, each first output sensitivity calculated based on each value of one input variable of the input variable set and each value of the other input variable of the input variable set, Based on the above, it is assumed that only the intermediate variable and the output variable are output for two input variables by calculating each value of the intermediate variable. A layered neural network based on the selective desensitization method having high learning ability (generalization ability) can also be applied to a function having three or more input variables that is not a net application target. As a result, even when the number of input variables of the function is large, the learning ability (generalization ability) is improved as compared with conventionally known techniques (linear model, table reference method, layered neural network such as multilayer perceptron). Therefore, the function can be approximated with high accuracy, and the computation cost such as the amount of computation and computation time required for approximating the function can be reduced. According to the invention described in claim 1, when each value of a plurality of intermediate variables is calculated based on a plurality of the input variable sets, each value can be calculated in parallel. It is also possible to perform calculations in parallel by a plurality of hardware. In this case, the intermediate variable can be calculated at high speed, and the calculation time for approximating the function can be reduced.

請求項２に記載の発明によれば、前記入力変数組の一方の入力変数が、いわゆる、選択的不感化された前記第１中間変数を演算できると共に、前記入力変数組の他方の入力変数が選択的不感化された前記第２中間変数を演算できる。すなわち、前記入力変数組の各入力変数が相互に選択的不感化された２つの各中間変数を演算できる。この結果、前記入力変数組の各入力変数が相互に選択的不感化されていない場合に比べ、学習能力（汎化能力）を向上させることができるため、前記関数を精度良く近似することができると共に、前記関数を近似するために必要な学習における演算量や演算時間等の演算コストを低減できる。
請求項３に記載の発明によれば、前記３以上の入力変数の全ての入力変数に基づいて、選択的不感化された複数の前記中間変数を演算できる。この結果、一部の入力変数が前記中間変数の演算に関わらずに前記出力変数が演算される場合に比べ、学習能力（汎化能力）を向上させることができるため、前記関数を精度良く近似することができると共に、前記関数を近似するために必要な学習における演算量や演算時間等の演算コストを低減できる。 According to the second aspect of the present invention, one input variable of the input variable set can calculate the so-called selectively desensitized first intermediate variable, and the other input variable of the input variable set is The second intermediate variable that has been selectively desensitized can be calculated. That is, two intermediate variables in which the input variables of the input variable set are selectively insensitive to each other can be calculated. As a result, the learning ability (generalization ability) can be improved as compared with the case where the input variables of the input variable set are not selectively desensitized to each other, so that the function can be approximated with high accuracy. In addition, it is possible to reduce calculation costs such as calculation amount and calculation time in learning necessary for approximating the function.
According to the third aspect of the present invention, the plurality of intermediate variables selectively desensitized can be calculated based on all the input variables of the three or more input variables. As a result, the learning ability (generalization ability) can be improved compared to the case where the output variable is computed regardless of the computation of the intermediate variable for some of the input variables. In addition, it is possible to reduce calculation costs such as calculation amount and calculation time in learning necessary for approximating the function.

請求項４に記載の発明によれば、前記関数近似装置によって、前記行動価値関数を近似することにより、未知の状態についても適切な行動を選択して実行できる。このため、未知の状態の行動価値関数を近似しない従来公知の強化学習システムに比べ、学習能力（汎化能力）を高くすることができると共に、外乱や観測ノイズ等を含む未知の環境に対する適用能力も高くすることができる。
請求項５に記載の発明によれば、前記制御装置の制御が失敗した状態であると判別され、前記制御装置の制御を終了すると判別された場合に、前記結合荷重を学習するため、最も多くの状態について評価できるように演算された行動価値関数に基づいて、前記結合荷重を学習できる。この結果、各状態における各行動価値関数を演算する度に前記結合荷重を学習する場合に比べ、前記関数を近似するために必要な学習における演算量および演算時間等の演算コストを低減することができる。 According to the fourth aspect of the present invention, it is possible to select and execute an appropriate action even for an unknown state by approximating the action value function by the function approximating device. For this reason, it is possible to increase the learning ability (generalization ability) compared to a conventionally known reinforcement learning system that does not approximate the action value function of an unknown state, and to apply to unknown environments including disturbances and observation noises. Can also be high.
According to the fifth aspect of the present invention, in order to learn the combined load when it is determined that the control of the control device has failed and it is determined that the control of the control device is to end, The combined weight can be learned on the basis of an action value function calculated so as to be able to evaluate the state. As a result, it is possible to reduce calculation costs such as calculation amount and calculation time in learning necessary for approximating the function as compared with the case where the connection weight is learned each time each behavior value function in each state is calculated. it can.

請求項６に記載の発明によれば、前記入力変数組の一方の入力変数の各値と、前記入力変数組の他方の入力変数の各値に基づいて演算される各第１出力感度と、に基づいて、前記中間変数の各値をそれぞれ演算することにより、２つの入力変数について中間変数および出力変数を出力する場合しか想定されていない従来公知の選択的不感化法が適用された層状ニューラルネットの適用対象外であった３以上の入力変数を有する関数に対しても、学習能力（汎化能力）が高い前記選択的不感化法に基づく層状ニューラルネットを適用できる。この結果、前記関数の入力変数の数が多い場合でも、従来公知の技術（線形モデル、テーブル参照法、多層パーセプトロン等の層状ニューラルネット等）に比べ、学習能力（汎化能力）を向上させることができるため、前記関数を精度良く近似することができると共に、前記関数を近似するために必要な学習における演算量や演算時間等の演算コストを低減できる。また、請求項６に記載の発明によれば、複数の前記入力変数組に基づいて、複数の中間変数の各値を演算する場合には、それぞれ並列で演算することが可能であり、例えば、複数のハードウェアにより、並列で演算することも可能である。この場合、前記中間変数を高速で演算でき、前記関数を近似するための演算時間を低減できる。 According to invention of Claim 6, each 1st output sensitivity calculated based on each value of one input variable of the said input variable group, and each value of the other input variable of the said input variable group, Based on the above, it is assumed that only the intermediate variable and the output variable are output for two input variables by calculating each value of the intermediate variable. A layered neural network based on the selective desensitization method having high learning ability (generalization ability) can also be applied to a function having three or more input variables that is not a net application target. As a result, even when the number of input variables of the function is large, the learning ability (generalization ability) is improved as compared with conventionally known techniques (linear models, table reference methods, layered neural networks such as multilayer perceptrons, etc.). Therefore, the function can be approximated with high accuracy, and the computation cost such as the amount of computation and computation time required for approximating the function can be reduced. According to the invention described in claim 6, when each value of a plurality of intermediate variables is calculated based on a plurality of the input variable sets, it is possible to calculate each value in parallel. It is also possible to perform calculations in parallel by a plurality of hardware. In this case, the intermediate variable can be calculated at high speed, and the calculation time for approximating the function can be reduced.

請求項７に記載の発明によれば、前記入力変数組の一方の入力変数の各値と、前記入力変数組の他方の入力変数の各値に基づいて演算される各第１出力感度と、に基づいて、前記中間変数の各値をそれぞれ演算することにより、２つの入力変数について中間変数および出力変数を出力する場合しか想定されていない従来公知の選択的不感化法が適用された層状ニューラルネットの適用対象外であった３以上の入力変数を有する関数に対しても、学習能力（汎化能力）が高い前記選択的不感化法に基づく層状ニューラルネットを適用できる。この結果、前記関数の入力変数の数が多い場合でも、従来公知の技術（線形モデル、テーブル参照法、多層パーセプトロン等の層状ニューラルネット等）に比べ、学習能力（汎化能力）を向上させることができるため、前記関数を精度良く近似することができると共に、前記関数を近似するために必要な学習における演算量や演算時間等の演算コストを低減できる。また、請求項７に記載の発明によれば、複数の前記入力変数組に基づいて、複数の中間変数の各値を演算する場合には、それぞれ並列で演算することが可能であり、例えば、複数のハードウェアにより、並列で演算することも可能である。この場合、前記中間変数を高速で演算でき、前記関数を近似するための演算時間を低減できる。 According to the invention of claim 7, each first output sensitivity calculated based on each value of one input variable of the input variable set and each value of the other input variable of the input variable set, Based on the above, it is assumed that only the intermediate variable and the output variable are output for two input variables by calculating each value of the intermediate variable. A layered neural network based on the selective desensitization method having high learning ability (generalization ability) can also be applied to a function having three or more input variables that is not a net application target. As a result, even when the number of input variables of the function is large, the learning ability (generalization ability) is improved as compared with conventionally known techniques (linear model, table reference method, layered neural network such as multilayer perceptron). Therefore, the function can be approximated with high accuracy, and the computation cost such as the amount of computation and computation time required for approximating the function can be reduced. Further, according to the invention described in claim 7, when each value of a plurality of intermediate variables is calculated based on a plurality of the input variable sets, it can be calculated in parallel, for example, It is also possible to perform calculations in parallel by a plurality of hardware. In this case, the intermediate variable can be calculated at high speed, and the calculation time for approximating the function can be reduced.

次に図面を参照しながら、本発明の実施の形態の具体例（以下、実施例と記載する）を説明するが、本発明は以下の実施例に限定されるものではない。
なお、以後の説明の理解を容易にするために、図面において、前後方向をＸ軸方向、左右方向をＹ軸方向、上下方向をＺ軸方向とし、矢印Ｘ，−Ｘ，Ｙ，−Ｙ，Ｚ，−Ｚで示す方向または示す側をそれぞれ、前方、後方、右方、左方、上方、下方、または、前側、後側、右側、左側、上側、下側とする。
また、図中、「○」の中に「・」が記載されたものは紙面の裏から表に向かう矢印を意味し、「○」の中に「×」が記載されたものは紙面の表から裏に向かう矢印を意味するものとする。
なお、以下の図面を使用した説明において、理解の容易のために説明に必要な部材以外の図示は適宜省略されている。 Next, specific examples of embodiments of the present invention (hereinafter referred to as examples) will be described with reference to the drawings, but the present invention is not limited to the following examples.
In order to facilitate understanding of the following description, in the drawings, the front-rear direction is the X-axis direction, the left-right direction is the Y-axis direction, the up-down direction is the Z-axis direction, and arrows X, -X, Y, -Y, The direction indicated by Z and -Z or the indicated side is defined as the front side, the rear side, the right side, the left side, the upper side, the lower side, or the front side, the rear side, the right side, the left side, the upper side, and the lower side, respectively.
In the figure, “•” in “○” means an arrow heading from the back of the page to the front, and “×” in “○” is the front of the page. It means an arrow pointing from the back to the back.
In the following description using the drawings, illustrations other than members necessary for the description are omitted as appropriate for easy understanding.

図１は本発明の実施例１の関数近似システムの全体説明図である。
図１において、実施例１の関数近似システム（強化学習システム）Ｓの一例は、いわゆる、倒立振子が倒れないようにするという課題を解決するためのシミュレータとして構成された制御システムである。前記関数近似システムＳは、左右方向（Ｙ軸方向）に移動可能な移動体の一例としての台車（制御装置）１を有する。前記台車１には、倒立振子としての棒２が支持されている。前記棒２は、基端が前記台車１上面の中央部に回転可能な状態で固定支持され、先端が左右方向（Ｙ軸方向）に振動可能な状態となっている。
前記台車１には、駆動輪１ａが設けられており、前記駆動輪１ａにより前記台車１は、水平な床面３を移動可能になっている。また、前記駆動輪１ａは、前記台車１に内蔵された駆動部（駆動モータ）１ｂにより駆動され、前記駆動部１ｂは、前記台車１に内蔵された制御部（関数近似装置）Ｃにより駆動を制御される。また、前記床面３の左右方向（Ｙ軸方向）の両端部には、右端壁３ａと左端壁３ｂとがそれぞれ設けられている。 FIG. 1 is an overall explanatory diagram of a function approximation system according to a first embodiment of the present invention.
In FIG. 1, an example of a function approximation system (reinforcement learning system) S according to the first embodiment is a control system configured as a simulator for solving the problem of preventing an inverted pendulum from falling down. The function approximating system S includes a carriage (control device) 1 as an example of a moving body that can move in the left-right direction (Y-axis direction). A rod 2 as an inverted pendulum is supported on the cart 1. The rod 2 is fixedly supported in a state where the base end is rotatable at the central portion of the upper surface of the carriage 1, and the tip end is capable of vibrating in the left-right direction (Y-axis direction).
The carriage 1 is provided with drive wheels 1a, and the carriage 1 can move on a horizontal floor surface 3 by the drive wheels 1a. The drive wheel 1a is driven by a drive unit (drive motor) 1b built in the cart 1, and the drive unit 1b is driven by a control unit (function approximation device) C built in the cart 1. Be controlled. Further, a right end wall 3a and a left end wall 3b are respectively provided at both ends of the floor surface 3 in the left-right direction (Y-axis direction).

（倒立振子の運動方程式）
ここで、図１に示すように、前記台車１の重心が前記床面３の左右方向（Ｙ軸方向）の中央部から右方向（＋Ｙ方向）に離れた距離、すなわち、前記台車１の位置をｘ［ｍ］とし、前記棒２が垂直方向（Ｚ軸方向）に対して右方向（＋Ｙ方向）に傾いた角度をθ［ｄｅｇ］とし、前記制御部Ｃにより前記駆動部１ｂを駆動して台車１に対して右方向（＋Ｙ方向）に加える力をＦ［Ｎ］とする。
また、前記台車１の質量をＭ［ｋｇ］とし、前記台車１が右方向（＋Ｙ方向）に移動する速度をｖ［ｍ／ｓ］とし（ｖ＝（ｄ／ｄｔ）ｘ）、前記棒２の質量をｍ［ｋｇ］とし、前記棒２の基端から重心までの長さ、すなわち、前記棒２の半分の長さをＬ［ｍ］とし、前記棒２が上下方向（Ｚ軸方向）に対して時計周りに回転する角速度をω［ｄｅｇ／ｓ］とし（ω＝（ｄ／ｄｔ）θ）、重力加速度をｇ［ｍ／ｓ^２］とし、前記台車１が右方向（＋Ｙ方向）に移動する場合の加速度をａとし（ａ＝（ｄ／ｄｔ）ｖ［ｍ／ｓ^２］）、前記棒２が上下方向（Ｚ軸方向）に対して時計周りに回転する場合の角加速度をｂとし（ｂ＝（ｄ／ｄｔ）ω［ｄｅｇ／ｓ^２］）、前記台車１と前記振子２との間の摩擦と、前記台車１の駆動輪１ａと前記床面３との間の摩擦とを無視できるものとすると、課題である倒立振子問題の運動方程式は、以下の式（５−１），（５−２）で示すことができる。 (Inverted pendulum equation of motion)
Here, as shown in FIG. 1, the center of gravity of the carriage 1 is a distance away from the center of the floor surface 3 in the left-right direction (Y-axis direction) in the right direction (+ Y direction), that is, the position of the carriage 1. X [m], the angle at which the rod 2 is tilted to the right (+ Y direction) with respect to the vertical direction (Z-axis direction) is θ [deg], and the control unit C drives the drive unit 1b. The force applied to the cart 1 in the right direction (+ Y direction) is F [N].
Further, the mass of the carriage 1 is M [kg], the speed at which the carriage 1 moves in the right direction (+ Y direction) is v [m / s] (v = (d / dt) x), and the rod 2 M [kg], the length from the base end of the rod 2 to the center of gravity, that is, the half length of the rod 2 is L [m], and the rod 2 is in the vertical direction (Z-axis direction) The angular velocity rotating clockwise relative to ω is [deg / s] (ω = (d / dt) θ), the gravitational acceleration is g [m / s ² ], and the carriage 1 is in the right direction (+ Y direction). Is the acceleration when moving to a (a = (d / dt) v [m / s ² ]), and the angular acceleration when the rod 2 rotates clockwise with respect to the vertical direction (Z-axis direction). and b (b = (d / dt ) ω [deg / s 2]), the friction between the carriage 1 and the pendulum 2, the driving wheel 1a and the floor surface 3 of the carriage 1 When friction and between what a negligible, the equation of motion of the inverted pendulum problem is an issue, the following equation (5-1), can be represented by (5-2).

（Ｍ＋ｍ）ａ＋（ｍＬｃｏｓθ）ｂ−ｍＬω^２ｓｉｎθ＝Ｆ …（５−１）
（ｍＬｃｏｓθ）ａ＋（４／３×ｍＬ^２）ｂ−ｍｇＬｓｉｎθ＝０ …（５−２）
また、式（５−１），（５−２）を連立方程式として、変数ａ，ｂについて解いた結果、前記変数ａ，ｂは、以下の式（６−１），（６−２）で示すことができる。
ａ＝
｛４／３（Ｆ＋ｍＬω^２ｓｉｎθ）−ｍｇｓｉｎθｃｏｓθ｝／
｛４／３（Ｍ＋ｍ）−ｍｃｏｓ^２θ｝ …（６−１）
ｂ＝
｛（Ｍ＋ｍ）ｇｓｉｎθ−（Ｆ＋ｍＬω^２ｓｉｎθ）ｃｏｓθ｝／
｛４／３（Ｍ＋ｍ）−ｍｃｏｓ^２θ｝Ｌ …（６−２） (M + m) a + (mL cos θ) b−mLω ² sin θ = F (5-1)
(MLcos θ) a + (4/3 × mL ² ) b-mgLsin θ = 0 (5-2)
Further, as a result of solving the variables a and b using the equations (5-1) and (5-2) as simultaneous equations, the variables a and b are expressed by the following equations (6-1) and (6-2). Can show.
a =
{4/3 (F + mLω ² sin θ) −mg sin θ cos θ} /
{4/3 (M + m) -mcos ² θ} (6-1)
b =
{(M + m) gsinθ- (F + mLω ² sinθ) cosθ} /
{4/3 (M + m) -mcos ² θ} L (6-2)

また、ある自然数をｋとし（ｋ＝０，１，２，…）、微小な時間ステップをｔとし、ある時間ｋ×ｔ，（ｋ＋１）×ｔにおける前記台車１の位置をｘ（ｋ），ｘ（ｋ＋１）、速度をｖ（ｋ），ｖ（ｋ＋１）、前記棒２の角度をθ（ｋ），θ（ｋ＋１）、角速度をω（ｋ），ω（ｋ＋１）とし、前記時間ｋ×ｔにおける前記台車１の加速度をａ（ｋ）、前記棒２の角加速度をｂ（ｋ）とした場合、前記時間（ｋ＋１）×ｔの各変数ｘ（ｋ＋１），ｖ（ｋ＋１），θ（ｋ＋１），ω（ｋ＋１）は、従来公知のルンゲクッタ法（ＲＫ法、Runge-Kutta method）を用いて、以下の式（７−１）〜（７−４）で示される。なお、前記ルンゲクッタ法とは、予め設定された初期値に基づいて、対象となる常微分方程式の増分を計算して次の点の値を計算してゆく、いわゆる、逐次計算法による常微分方程式の解法アルゴリズムである。 Further, a certain natural number is k (k = 0, 1, 2,...), A minute time step is t, and the position of the carriage 1 at a certain time k × t, (k + 1) × t is x (k), x (k + 1), velocity is v (k), v (k + 1), angle of the rod 2 is θ (k), θ (k + 1), angular velocity is ω (k), ω (k + 1), and time k × When the acceleration of the carriage 1 at t is a (k) and the angular acceleration of the rod 2 is b (k), each variable x (k + 1), v (k + 1), θ (time (k + 1) × t k + 1) and ω (k + 1) are represented by the following equations (7-1) to (7-4) using a conventionally known Runge-Kutta method (RK method). The Runge-Kutta method is based on a preset initial value and calculates the next point value by calculating the increment of the target ordinary differential equation. It is a solution algorithm of.

ｘ（ｋ＋１）＝ｘ（ｋ）＋ｔ×ｖ（ｋ） …（７−１）
ｖ（ｋ＋１）＝ｖ（ｋ）＋ｔ×ａ（ｋ） …（７−２）
θ（ｋ＋１）＝θ（ｋ）＋ｔ×ω（ｋ） …（７−３）
ω（ｋ＋１）＝ω（ｋ）＋ｔ×ｂ（ｋ） …（７−４） x (k + 1) = x (k) + t × v (k) (7-1)
v (k + 1) = v (k) + t × a (k) (7-2)
θ (k + 1) = θ (k) + t × ω (k) (7-3)
ω (k + 1) = ω (k) + t × b (k) (7-4)

なお、実施例１では、前記台車１の質量Ｍが１．０［ｋｇ］（Ｍ＝１．０）、前記棒２の質量ｍが０．１［ｋｇ］（ｍ＝０．１）、前記棒２の長さＬが１．０［ｍ］（Ｌ＝１．０）、重力加速度ｇが９．８［ｍ／ｓ^２］（ｇ＝９．８）、前記時間ステップｔが０．０２［ｓ］（ｔ＝０．０２）に予め設定されている。また、前記床面３における右端壁３ａから左端壁３ｂまでの距離が４．８［ｍ］に予め設定されている。すなわち、前記台車１の位置ｘが、−２．４〜２．４［ｍ］の範囲に予め制限されている（ｘ＝−２．４〜２．４）。
また、前記台車１に最初に加える力をＦ（０）とした場合、前記力Ｆ（０）が２０．０［Ｎ］（Ｆ（０）＝２０．０）に予め設定されている。さらに、初期状態の前記台車１の位置をｘ（０）、速度をｖ（０）、前記棒２の角度をθ（０）、角速度をω（０）とした場合、各変数ｘ（０），ｖ（０），θ（０），ω（０）は、以下の式（７−１）′〜（７−４）′で示される。 In Example 1, the mass M of the carriage 1 is 1.0 [kg] (M = 1.0), the mass m of the rod 2 is 0.1 [kg] (m = 0.1), The length L of the rod 2 is 1.0 [m] (L = 1.0), the gravitational acceleration g is 9.8 [m / s ² ] (g = 9.8), and the time step t is 0.02. [S] (t = 0.02) is preset. Further, the distance from the right end wall 3a to the left end wall 3b on the floor surface 3 is preset to 4.8 [m]. That is, the position x of the carriage 1 is limited in advance to a range of −2.4 to 2.4 [m] (x = −2.4 to 2.4).
Further, when the first force applied to the carriage 1 is F (0), the force F (0) is set in advance to 20.0 [N] (F (0) = 20.0). Further, when the position of the carriage 1 in the initial state is x (0), the speed is v (0), the angle of the rod 2 is θ (0), and the angular speed is ω (0), each variable x (0) , V (0), θ (0), ω (0) are represented by the following equations (7-1) ′ to (7-4) ′.

ｘ（０）＝０．０ …（７−１）′
ｖ（０）＝０．０ …（７−２）′
θ（０）＝−３．０〜３．０（−３〜３の範囲で設定された乱数） …（７−３）′
ω（０）＝０．０ …（７−４）′ x (0) = 0.0 (7-1) ′
v (0) = 0.0 (7-2) '
θ (0) = − 3.0 to 3.0 (random number set in the range of −3 to 3) (7-3) ′
ω (0) = 0.0 (7-4) ′

したがって、実施例１の関数近似システムＳでは、前記式（６−１），（６−２），（７−１）〜（７−４），（７−１）′〜（７−４）′と、前記各定数Ｍ，ｍ，Ｌ，ｇ，ｔおよび前記各変数Ｆ，ａ，ｂ，ｋとによって、前記時間ステップｔ毎の前記各変数ｘ，ｖ，θ，ωを演算できる。また、実施例１の関数近似システムＳでは、前記制御部Ｃが、演算された前記各変数ｘ，ｖ，θ，ωに基づいて、前記台車１に適切な力Ｆが加わるように前記駆動部１ｂの駆動を制御する。 Therefore, in the function approximation system S of the first embodiment, the above formulas (6-1), (6-2), (7-1) to (7-4), (7-1) ′ to (7-4) ′, The constants M, m, L, g, and t and the variables F, a, b, and k, the variables x, v, θ, and ω for each time step t can be calculated. Further, in the function approximation system S of the first embodiment, the control unit C applies the appropriate force F to the carriage 1 based on the calculated variables x, v, θ, ω. The drive of 1b is controlled.

（実施例１の制御部Ｃの説明）
図２は本発明の実施例１の台車の制御部が備えている各機能をブロック図（機能ブロック図）で示した図である。
前記制御部Ｃは、必要な処理を行うためのプログラムおよびデータ等が記憶されたＲＯＭ（リードオンリーメモリ）、必要なデータを一時的に記憶するためのＲＡＭ（ランダムアクセスメモリ）、前記ＲＯＭに記憶されたプログラムに応じた処理を行うＣＰＵ（中央演算処理装置）、ならびにクロック発振器等を有するマイクロコンピュータにより構成されており、前記ＲＯＭに記憶されたプログラムを実行することにより種々の機能を実現することができる。
前記構成の制御部Ｃは、前記ＲＯＭに記憶されたプログラムを実行することにより種々の機能を実現することができる。実施例１の前記制御部ＣのＲＯＭには、関数近似プログラムＡＰ１が記憶されている。 (Description of the control part C of Example 1)
FIG. 2 is a block diagram (functional block diagram) illustrating each function provided in the control unit of the cart according to the first embodiment of the present invention.
The control unit C includes a ROM (read only memory) in which a program and data for performing necessary processing are stored, a RAM (random access memory) in which necessary data is temporarily stored, and is stored in the ROM. It is composed of a CPU (central processing unit) that performs processing according to the programmed program and a microcomputer having a clock oscillator and the like, and realizes various functions by executing the program stored in the ROM. Can do.
The control unit C configured as described above can realize various functions by executing a program stored in the ROM. A function approximation program AP1 is stored in the ROM of the control unit C of the first embodiment.

（関数近似プログラムＡＰ１）
図３は強化学習の簡単な説明図である。
関数近似プログラムＡＰ１は、下記の機能手段（プログラムモジュール）を有する。
Ｃ１：倒立振子制御開始判別手段
倒立振子制御開始判別手段Ｃ１は、開始状態記憶手段Ｃ１Ａを有し、倒立振子としての前記棒２が倒れないように前記台車１の左右方向（Ｙ軸方向）の移動を制御する倒立振子制御処理を開始するか否かを判別する。なお、実施例１の前記倒立振子制御処理では、図３に示す強化学習に基づいて、前記棒２が倒れないように前記駆動部１ｂの駆動を制御する。 (Function approximation program AP1)
FIG. 3 is a simple explanatory diagram of reinforcement learning.
The function approximating program AP1 has the following functional means (program modules).
C1: Inverted pendulum control start discriminating means The inverted pendulum control start discriminating means C1 has a start state storage means C1A. It is determined whether or not to start an inverted pendulum control process for controlling movement. In the inverted pendulum control process of the first embodiment, the drive of the drive unit 1b is controlled based on the reinforcement learning shown in FIG. 3 so that the rod 2 does not fall.

（強化学習について）
なお、強化学習とは、行動主体となるエージェントが、報酬を手がかりに、試行錯誤を通じて、制御対象となる環境に適応する学習方式のことをいう。ここで、図３に示すように、前記エージェント（制御装置）をＡとし、前記環境をＥとし、ある時間をｔとし、前記時間ｔの次の時間をｔ＋１とし、前記時間ｔ，ｔ＋１における行動をａ_ｔ，ａ_ｔ＋１とし、前記時間ｔ，ｔ＋１における前記環境Ｅの状態をｓ_ｔ，ｓ_ｔ＋１とし、前記時間ｔ，ｔ＋１における前記環境Ｅから得られる報酬をｒ_ｔ，ｒ_ｔ＋１とする。
強化学習では、制御対象となる環境Ｅの現在の状態がｓ_ｔである場合に、前記エージェントＡが、ある行動ａ_ｔをとって前記環境Ｅに直接働きかける。その結果、前記環境Ｅの状態がｓ_ｔ＋１に変化すると共に、前記エージェントＡは、前記環境Ｅから報酬ｒ_ｔを得る。よって、前記エージェントＡは、前記報酬ｒ_ｔを最大化することを目的とする。すなわち、強化学習では、通常の機械学習のように入力に対する正しい行動ａ_ｔを明示的に示す教師等が存在しない替わりに、前記行動ａ_ｔの結果として前記環境Ｅから与えられる前記報酬ｒ_ｔに基づいて学習を行う。 (About reinforcement learning)
Reinforcement learning refers to a learning method in which an agent acting as an action subject adapts to an environment to be controlled through trial and error using a reward as a clue. Here, as shown in FIG. 3, the agent (control device) is A, the environment is E, a certain time is t, a time next to the time t is t + 1, and actions at the times t and t + 1 are as follows. _{Is set} to a _t , a _{t + 1} , the state of the environment E at the times t and t + 1 is set to s _t and _{st +1,} and the reward obtained from the environment E at the times t and t + 1 is set to r _t and r _{t + 1} .
In the reinforcement learning, if the current state of the environment E to be controlled is a s _t, the agent A, appeal directly to the environment E taking a certain action a _t. As a result, the state of the environment E changes to s _{t + 1,} the agent A, to obtain a reward r _t from the environment E. Thus, the agent A, which aims to maximize the reward r _t. That is, in the reinforcement learning, instead of teachers, etc. explicitly indicate the correct action a _t with respect to the input as a normal machine learning does not exist, the action a the reward result given from the environment E of _t r _t Based on the learning.

なお、前記エージェントＡは、前記環境Ｅの現在の状態ｓ_ｔにおける、各行動ａ_ｔで得られると期待できる予測報酬（ｒ_ｔ）を保持している。よって、前記予測報酬（ｒ_ｔ）が行動ａ_ｔの優先度であり、行動ａ_ｔの評価値となる。そして、前記エージェントＡは、得られる報酬ｒ_ｔが最大になるように試行錯誤的に行動してゆく。
しかしながら、前記報酬ｒ_ｔには、遅れやノイズ等が存在する可能性も考えられる。例えば、ある行動ａ_ｔは直接的な状態ｓ_ｔのみならず、その後の状態（ｓ_ｔ＋１）にも影響を与えるため、その後の全ての後続報酬（ｒ_ｔ＋１）に影響する等の場合が考えられる。そこで、強化学習では、課題に応じて、エピソード単位で評価を行う経験強化型の強化学習と、ステップ単位で評価を行う環境同定型の強化学習のアルゴリズムとが選択できる。ここで、ステップとは、初期状態をｓ_０とした場合に、前記エージェントＡが、前記環境Ｅで観測されたある状態ｓ_ｔに基づいて、実際に行動ａ_ｔを実行するまでの単位であり、エピソードとは、前記初期状態ｓ_０から目的を達成した状態（ｓ_ｔ）または失敗した状態（ｓ_ｔ）となるまでの前記ステップの集合であるものとする。 Incidentally, the agent A, the in the current state s _t environment E, holds the predicted reward (r _t) which is the expected yield for each action a _t. Thus, the predicted reward _{(r t)} is the priority of the action _{a t,} the evaluation value of the action _{a t.} Then, the agent A is obtained reward r _t is slide into action by trial and error so as to maximize.
However, wherein the reward r _t, is a possibility that there is a delay or noise. For example, not only action a _t the direct state s _t, since also affects the subsequent state (s t _{+ 1),} cases can be considered such that affect all subsequent reward then (r t _{+ 1)} . Therefore, in reinforcement learning, experience-reinforcement-type reinforcement learning that performs evaluation in units of episodes and environment identification-type reinforcement learning algorithm that performs evaluation in units of steps can be selected according to the task. Here, the step, the initial state when the s _0, the agent A, on the basis of the certain state s _t observed in environment E, in unit before actually executing the action a _t An episode is a set of the steps from the initial state s ₀ to a state where the goal has been achieved (s _t ) or a state where it has failed (s _t ).

前記経験強化型の強化学習は、学習中における高い報酬ｒ_ｔが獲得できる行動ａ_ｔを優先して選択する。このため、学習の立ち上がりは早いが、局所的な解（ローカルミニマム）に陥る可能性が高い。また、前記環境同定型の強化学習は、前記環境Ｅのすべての状態ｓ_ｔにおける全ての行動ａ_ｔを試行することにより最適な解を見つけ出す。よって、実施例１では、前記環境同定型の代表的なアルゴリズムである、Ｑ−ｌｅａｒｎｉｎｇによる強化学習を行う。 The experience enhanced reinforcement learning of, preferentially selected the action a _t a high reward r _t in during the learning can be acquired. For this reason, although the start-up of learning is fast, there is a high possibility of falling into a local solution (local minimum). Also, reinforcement learning of the environment identification type, finding the optimal solution by attempting to every action a _t in all states s _t of the environment E. Therefore, in Example 1, reinforcement learning by Q-learning, which is a typical algorithm of the environment identification type, is performed.

（Ｑ−ｌｅａｒｎｉｎｇについて）
ここで、Ｑ−ｌｅａｒｎｉｎｇとは、前記時間ステップｔにおける状態ｓ_ｔで実行される行動ａ_ｔの評価値をＱ（ｓ_ｔ，ａ_ｔ）とした場合、各状態（ｓ_ｔ）における各行動（ａ_ｔ）の各評価値（Ｑ（ｓ_ｔ，ａ_ｔ））を予測して記憶すると共に、記憶された前記各評価値（Ｑ（ｓ_ｔ，ａ_ｔ））に基づいて、ある状態（ｓ_ｔ）における行動（ａ_ｔ）を選択するアルゴリズムである。ここで、前記Ｑ−ｌｅａｒｎｉｎｇにおける１エピソードの具体的なアルゴリズムを、以下の処理（１）〜（５）により例示する。 (About Q-learning)
Here, the Q-learning, the evaluation value _{Q (s} t, _{a t)} actions _{a t} executed in a state _{s t} at time step t case of the respective action in each state _{(s t)} ( each evaluation value of _{_{a t) (Q (s t}} , a t) with predicted and stored), stored the evaluation values _(Q (s _{t, a} t) on the basis of) certain state (s behavior in _t) the _{(a t)} is an algorithm for selection. Here, a specific algorithm of one episode in the Q-learning is exemplified by the following processes (1) to (5).

（Ｑ−ｌｅａｒｎｉｎｇのアルゴリズムの具体例）
（１）エージェントＡは、環境Ｅの状態ｓ_ｔを観測する。
（２）エージェントＡは、任意の行動選択方法に従って行動ａ_ｔを実行する。
（３）エージェントＡは、環境Ｅから報酬ｒ_ｔを獲得すると共に、状態遷移後の状態ｓ_ｔ＋１を観測する。
（４）エージェントＡは、以下の式（８）に基づいて、評価値である行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を更新する。
Ｑ（ｓ_ｔ，ａ_ｔ）＝
（１−α）Ｑ（ｓ_ｔ，ａ_ｔ）
＋α［ｒ_ｔ＋γｍａｘ_ａｔ＋１（Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１））］ …（８）
ここで、αは１回の更新による行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）の変化量を表すパラメータとしての学習率であり（０＜α≦１）、γは将来得ることができると予想される報酬（ｒ_ｔ＋１）をどれだけ割り引いて現在の評価に反映させるかを表すパラメータとしての割引率である（０＜γ≦１）。また、ｍａｘ_ａｔ＋１（Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１））は、状態ｓ_ｔ＋１における全ての行動ａ_ｔ＋１に対する評価値Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１）の最大値である。
（５）目標が達成できた状態（成功状態）または失敗した状態（失敗状態）となったか否かを判別し、成功状態または失敗状態になった場合、エピソードを終了する。そうでなければ、時間をｔからｔ＋１に進めて処理（２）に戻る。 (Specific example of Q-learning algorithm)
(1) Agent A observes the state _{s t} environment E.
(2) Agent A performs an action a _t in accordance with any action selection method.
(3) Agent A, as well as to win a reward _{r t} from the environment E, observing the state _{s t + 1} after the state transition.
(4) The agent A updates the action value function Q (s _t , a _t ) that is an evaluation value based on the following formula (8).
Q (s _t , a _t ) =
(1-α) Q (s _t , a _t )
+ Α [r _t + γmax _{at + 1} (Q (s _{t + 1} , a _{t + 1} ))] (8)
Here, α is a learning rate as a parameter representing the amount of change in the action value function Q (s _t , a _t ) by one update (0 <α ≦ 1), and γ is expected to be obtained in the future. It is a discount rate as a parameter indicating how much the reward (r _{t + 1} ) to be discounted is reflected in the current evaluation (0 <γ ≦ 1). Further, max _{at + 1} (Q (s _{t + 1} , a _{t + 1} )) is the maximum value of the evaluation values Q (s _{t + 1} , a _{t + 1} ) for all the actions a _{t + 1} in the state s _{t + 1} .
(5) It is determined whether or not the goal has been achieved (successful state) or has failed (failure state). When the target state is successful or unsuccessful, the episode is terminated. Otherwise, the time is advanced from t to t + 1, and the process returns to the process (2).

なお、実施例１では、処理（２）における前記行動選択法として、常に最も高い価値（Ｑ（ｓ_ｔ，ａ_ｔ））が期待できる行動を選択してゆくｇｒｅｅｄｙ選択法を適用するが、これに限定されず、例えば、最大価値の行動を選択するε−ｇｒｅｅｄｙ選択法等を適用することも可能である。また、実施例１では、強化学習定数としての学習率αおよび割引率γについて、学習率αが０．１（α＝０．１）、割引率γが０．９５（γ＝０．９５）に予め設定されている。
なお、強化学習、Ｑ−ｌｅａｒｎｉｎｇ、ｇｒｅｅｄｙ選択法等の行動選択法については、非特許文献４，５等に記載されており、公知である。 In the first embodiment, as the action selection method in the process (2), a greedy selection method that selects an action that can always be expected to have the highest value (Q (s _t , a _t )) is applied. For example, it is possible to apply an ε-greedy selection method that selects the action of the maximum value. In the first embodiment, the learning rate α and the discount rate γ as the reinforcement learning constant are 0.1 (α = 0.1) and the discount rate γ is 0.95 (γ = 0.95). Is preset.
Note that behavior selection methods such as reinforcement learning, Q-learning, and greedy selection methods are described in Non-Patent Documents 4 and 5, and are well known.

したがって、Ｑ−ｌｅａｒｎｉｎｇの強化学習では、前記環境Ｅについて前記エージェントＡの行動によらない時間変化等がなければ、前記エージェントＡは、全状態（ｓ_ｔ）における全行動（ａ_ｔ）を試行し、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を学習することにより、最適な行動（ａ_ｔ）を選択できるようになる。しかしながら、最適な行動（ａ_ｔ）を選択するために、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）についての学習を収束させるには莫大な時間が必要となる。このため、例えば、課題が複雑である場合には、全試行が有限時間で終了しない可能性があった。 Therefore, the reinforcement learning Q-learning, unless time variation or the like which does not depend on the behavior of the agent A for the environment E, the agent A, tries all act to _{(a t)} in all states _{(s t)} By learning the action value function Q (s _t , a _t ), it becomes possible to select the optimum action (a _t ). However, in order to select the optimum action (a _t), said action value function Q (s _{t, a} _t) enormous time is required for converging the learning about. For this reason, for example, when the task is complicated, all trials may not be completed in a finite time.

（選択的不感化法が適用された層状ニューラルネットが適用された強化学習について）
図４は本発明の実施例１の選択的不感化法が適用された層状ニューラルネットの説明図である。
そこで、実施例１では、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）の学習効率を向上させるために、図４に示す選択的不感化法が適用された層状ニューラルネットの一例である多変数相互修飾モデルＮによって、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が近似される新たなＱ−ｌｅａｒｎｉｎｇの強化学習が適用されている。
図４において、前記多変数相互修飾モデルＮは、４ｎ個の素子（入力素子）により構成された入力層Ｎａと、１２ｎ個の素子（中間素子）により構成された中間層Ｎｂと、ｍ個の素子（出力素子）により構成された出力層Ｎｃとを有する。 (Reinforcement learning with a layered neural network with selective desensitization applied)
FIG. 4 is an explanatory diagram of a layered neural network to which the selective desensitization method according to the first embodiment of the present invention is applied.
Therefore, in the first embodiment, in order to improve the learning efficiency of the behavior value function Q (s _t , a _t ), a multivariate which is an example of a layered neural network to which the selective desensitization method shown in FIG. 4 is applied. A new Q-learning reinforcement learning in which the behavior value function Q (s _t , a _t ) is approximated by the mutual modification model N is applied.
In FIG. 4, the multivariable mutual modification model N includes an input layer Na composed of 4n elements (input elements), an intermediate layer Nb composed of 12n elements (intermediate elements), And an output layer Nc composed of elements (output elements).

前記多変数相互修飾モデルＮでは、従来公知の前記相互修飾モデル０２の入力層０２ａが、２つの入力パターンＳ，Ｃによって構成されているのに対し（図１３Ｂ参照）、実施例１の前記入力層Ｎａは、４つの入力変数（入力パターン）ｘ，ｖ，θ，ω、すなわち、前記台車１の前記位置ｘと、前記速度ｖと、前記棒２の前記角度θと、前記角速度ωとによって構成されている。なお、実施例１では、４つの前記入力変数ｘ，ｖ，θ，ωは、それぞれｎ個の素子によって構成されている（４×ｎ＝４ｎ）。 In the multivariable mutual modification model N, the input layer 02a of the conventionally known mutual modification model 02 is composed of two input patterns S and C (see FIG. 13B), whereas the input of the first embodiment is used. The layer Na is defined by four input variables (input patterns) x, v, θ, ω, that is, the position x of the carriage 1, the speed v, the angle θ of the rod 2, and the angular speed ω. It is configured. In the first embodiment, the four input variables x, v, θ, and ω are each composed of n elements (4 × n = 4n).

また、前記多変数相互修飾モデルＮでは、従来公知の前記相互修飾モデル０２の中間層０２ｂが、前記入力パターンＳの積型修飾Ｓ（Ｃ）と、前記入力パターンＣの積型修飾Ｃ（Ｓ）とを出力するのに対し（図１１Ｂ、式（１），（１）′，（２）参照）、実施例１の前記中間層Ｎｂは、前記入力変数ｘの積型修飾ｘ（ｖ），ｘ（θ），ｘ（ω）と、前記入力変数ｖの積型修飾ｖ（θ），ｖ（ω），ｖ（ｘ）と、前記入力変数θの積型修飾θ（ω），θ（ｘ），θ（ｖ）と、前記入力変数ωの積型修飾ω（ｘ），ω（ｖ），ω（θ）とを出力する。すなわち、実施例１の前記中間層Ｎｂでは、前記入力変数ｘ，ｖ，θ，ωが、それぞれ２つずつを１組（入力変数組）として相互に積型文脈修飾され、積型文脈修飾された１２個の前記積型修飾ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）が中間変数（中間パターン）として出力される。なお、実施例１では、中間変数である１２個の前記積型修飾ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）は、それぞれｎ個の素子によって構成されている（１２×ｎ＝１２ｎ）。 In the multivariable mutual modification model N, the intermediate layer 02b of the conventionally known mutual modification model 02 includes a product type modification S (C) of the input pattern S and a product type modification C (S of the input pattern C. ) (See FIG. 11B, equations (1), (1) ′, (2)), the intermediate layer Nb of the first embodiment uses the product type modification x (v) of the input variable x. , X (θ), x (ω), product type modification v (θ), v (ω), v (x) of the input variable v, and product type modification θ (ω), θ of the input variable θ. (X), θ (v) and product type modifications ω (x), ω (v), ω (θ) of the input variable ω are output. In other words, in the intermediate layer Nb of the first embodiment, the input variables x, v, θ, and ω are mutually subjected to product type context modification, with two each being a set (input variable set), and product type context modified. The twelve product type modifications x (v), v (x), x (θ), θ (x), x (ω), ω (x), v (θ), θ (v), v ( ω), ω (v), θ (ω), ω (θ) are output as intermediate variables (intermediate pattern). In the first embodiment, the twelve product type modifications x (v), v (x), x (θ), θ (x), x (ω), ω (x), v (which are intermediate variables are used. θ), θ (v), v (ω), ω (v), θ (ω), and ω (θ) are each composed of n elements (12 × n = 12n).

そして、前記多変数相互修飾モデルＮでは、従来公知の前記相互修飾モデル０２の出力層０２ｃにおいて、ｍ個の各素子に、前記中間層０２ｂの２ｎ個のうちの複数の素子が、それぞれ前記結合荷重ｗ_ｊｉにより重み付けされた状態で結合されており、且つ、符号関数の演算が行われて出力パターンＹが出力されるのと同様に（図１１Ｂ、式（３）参照）、実施例１の前記出力層Ｎｃについては、ｍ個の各素子に、前記中間層Ｎｂの１２ｎ個のうちの複数の素子が、それぞれ結合荷重ｗ_ｊｉ（ｗ_ｊｉ′）により重み付けされた状態で結合されており、且つ、符号関数の演算が行われて出力変数（出力パターン）としての前記行動評価関数Ｑ（ｓ_ｔ，ａ_ｔ）が出力される。なお、実施例１では、前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）は、ｍ個の素子によって構成されている。 In the multivariate mutual modification model N, in the output layer 02c of the conventionally known mutual modification model 02, a plurality of elements in 2n of the intermediate layer 02b are connected to the m elements. Similarly to the case where the weights w _ji are combined in a weighted state and the sign function is calculated and the output pattern Y is output (see FIG. 11B, equation (3)), For the output layer Nc, a plurality of elements of 12n of the intermediate layer Nb are coupled to each of m elements in a state weighted by a coupling load w _ji (w _ji ′), In addition, the sign function is calculated, and the action evaluation function Q (s _t , a _t ) as an output variable (output pattern) is output. In the first embodiment, the output variable Q (s _t , a _t ) is composed of m elements.

（強化学習の各要素と倒立振子問題の各要素との対応関係について）
したがって、実施例１では、前記関数近似システムＳに対して、前記Ｑ−ｌｅａｒｎｉｎｇの強化学習における各要素（Ａ，Ｅ，ｓ_ｔ，ａ_ｔ，ｒ_ｔ）を、以下の（１）〜（５）のように対応付けることにより、前記倒立振子制御処理を実行することができる。 (About the correspondence between each element of reinforcement learning and each element of the inverted pendulum problem)
Thus, in Example 1, with respect to the function approximation system S, each element in the reinforcement learning _{Q-learning (A, E,} s t, a t, r t) to the following (1) to (5 ), The inverted pendulum control process can be executed.

（１）前記エージェントＡは、前記台車１の制御部Ｃに対応付けられる。
（２）前記環境Ｅは、実施例１の前記関数近似システムＳの構成としての前記各要素１〜３に対応付けられる。すなわち、前記台車１および前記棒２についての定数Ｍ，ｍ，Ｌ，ｔ，ｇおよび変数ａ，ｂ，ｋ等や、倒立振子問題についての関係式（式（５−１），（５−２），（６−１），（６−２），（７−１）〜（７−４）参照）、前記床面３（右端壁３ａと左端壁３ｂ）等と対応付けられる。
（３）前記状態ｓ_ｔ，ｓ_ｔ＋１は、前記入力変数ｘ，ｖ，θ，ωに対応付けられる。ここで、例えば、前記状態ｓ_ｔが、ｓ_ｔ＝ｘ（ｋ），ｖ（ｋ），θ（ｋ），ω（ｋ）で示される場合、次の状態ｓ_ｔ＋１を、ｓ_ｔ＋１＝ｘ（ｋ＋１），ｖ（ｋ＋１），θ（ｋ＋１），ω（ｋ＋１）で示すことができる。
（４）前記行動ａ_ｔは、前記力Ｆに対応付けられる。すなわち、前記棒２が倒れないように前記台車１を移動させるために前記駆動部１ｂを制御する値に応じた前記力Ｆに対応付けられる。ここで、前記時間ｋ×ｔにおける前記力ＦをＦ（ｋ）とした場合（ｋ＝０，１，２，…）、前記行動ａ_ｔは、例えば、ａ_ｔ＝Ｆ（ｋ）で示すことができる。
（５）前記報酬ｒ_ｔ，ｒ_ｔ＋１は、前記棒２が倒れているか否かを確認する値とすることができる。例えば、前記棒２が倒れて前記角度θが、θ≦−１８０°、又は、θ≧１８０°、となった場合には、前記報酬ｒ_ｔ，ｒ_ｔ＋１を負の値とし（ｒ_ｔ＜０，ｒ_ｔ＋１＜０）、それ以外の場合には、前記報酬ｒ_ｔ，ｒ_ｔ＋１を０以上の値とする（ｒ_ｔ≧０，ｒ_ｔ＋１≧０）。 (1) The agent A is associated with the control unit C of the carriage 1.
(2) The environment E is associated with the elements 1 to 3 as the configuration of the function approximation system S of the first embodiment. That is, constants M, m, L, t, g and variables a, b, k, etc. for the carriage 1 and the rod 2 and relational expressions for the inverted pendulum problem (formulas (5-1), (5-2) ), (6-1), (6-2), (7-1) to (7-4)), the floor surface 3 (the right end wall 3a and the left end wall 3b), and the like.
(3) The states s _t and s _{t + 1} are associated with the input variables x, v, θ, and ω. Here, for example, when the state s _t is represented by s _t = x (k), v (k), θ (k), ω (k), the next state s _{t + 1} is changed to s _{t + 1} = x ( k + 1), v (k + 1), θ (k + 1), and ω (k + 1).
(4) the action _{a t} is associated with the force F. That is, it is associated with the force F corresponding to a value for controlling the drive unit 1b in order to move the carriage 1 so that the rod 2 does not fall down. Here, if the force F in the time k × t was F (k) (k = 0,1,2 , ...), said action _{a t} is, for _example, be represented by a t = F (k) Can do.
(5) The rewards r _t and r _{t + 1} can be values for confirming whether or not the bar 2 is collapsed. For example, when the rod 2 is tilted and the angle θ becomes θ ≦ −180 ° or θ ≧ 180 °, the rewards r _t and r _{t + 1} are set to negative values (r _t <0 , R _{t + 1} <0), otherwise, the rewards r _t , r _{t + 1} are set to a value of 0 or more (r _t ≧ 0, r _{t + 1} ≧ 0).

なお、実施例１の前記倒立振子制御開始判別手段Ｃ１は、前記台車１および前記棒２についての各定数Ｍ，ｍ，Ｌ，ｔ，ｇ，α，γの設定値および各変数ｘ，ｖ，θ，ω，ａ，ｂ，ｋ，Ｆ，ｗ_ｊｉ，ｗ_ｊｉ′の初期値が設定されたか否かを判別することにより、前記倒立振子制御処理を開始するか否かを判別する。
Ｃ１Ａ：開始状態記憶手段
開始状態記憶手段Ｃ１Ａは、前記倒立振子制御処理を開始する状態を記憶する。実施例１の開始状態記憶手段Ｃ１Ａは、前記倒立振子制御処理を開始する状態として、前記台車１および前記棒２についての各定数Ｍ，ｍ，Ｌ，ｔ，ｇの設定値（Ｍ＝１．０，ｍ＝０．１，Ｌ＝１．０，ｔ＝０．０２，ｇ＝９．８）、強化学習定数α，γの設定値（α＝０．１，γ＝０．９５）、各変数ｘ，ｖ，θ，ω，ａ，ｂ，ｋ，Ｆ，ｗ_ｊｉ，ｗ_ｊｉ′の初期値（ｘ（０）＝ｖ（０）＝ω（０）＝ａ（０）＝ｂ（０）＝ｗ_ｊｉ（０）＝ｗ_ｊｉ′（０）＝０．０，θ（０）＝−３．０〜３．０，ｋ＝０，Ｆ（０）＝２０．０）とを記憶する。 The inverted pendulum control start determining means C1 according to the first embodiment is configured such that the set values of the constants M, m, L, t, g, α, γ and the variables x, v, It is determined whether or not the inverted pendulum control process is started by determining whether or not initial values of θ, ω, a, b, k, F, w _ji , and w _ji ′ are set.
C1A: Start state storage unit The start state storage unit C1A stores a state in which the inverted pendulum control process is started. The start state storage means C1A according to the first embodiment sets the constants M, m, L, t, and g for the carriage 1 and the rod 2 as the state in which the inverted pendulum control process is started (M = 1. 0, m = 0.1, L = 1.0, t = 0.02, g = 9.8), set values of reinforcement learning constants α, γ (α = 0.1, γ = 0.95), Initial values of each variable x, v, θ, ω, a, b, k, F, w _ji , w _ji ′ (x (0) = v (0) = ω (0) = a (0) = b ( 0) = w _ji (0) = w _ji ′ (0) = 0.0, θ (0) = − 3.0 to 3.0, k = 0, F (0) = 20.0) To do.

なお、実施例１では、前記行動ａ_ｔが、前記台車１の左右方向（Ｙ軸方向）への移動の制御であるため、右方向（＋Ｙ方向）への移動の評価用の荷重結合をｗ_ｊｉとし、左方向（−Ｙ方向）への移動の評価用の荷重結合をｗ_ｊｉ′として別々に荷重結合を設けることにより、前記各荷重結合ｗ_ｊｉ，ｗ_ｊｉ′の学習が互いに干渉しないように予め設定されている。また、実施例１では、未学習の場合の前記各荷重結合ｗ_ｊｉ，ｗ_ｊｉ′の初期値をｗ_ｊｉ（０），ｗ_ｊｉ′（０）とした場合に、いかなる状態（ｓ_ｔ＝ｘ（０），ｖ（０），θ（０），ω（０））であっても、前記各荷重結合ｗ_ｊｉ（０），ｗ_ｊｉ′（０）が、ｗ_ｊｉ（０）＝ｗ_ｊｉ′（０）＝０．０で示されるように予め設定されている。 In Example 1, the action a _t is, because the is the control of the movement in the lateral direction of the carriage 1 (Y-axis direction), the load coupling for evaluation of the movement in the right direction (+ Y direction) w _{j i} and load coupling for evaluation of movement in the left direction (−Y direction) as w _ji ′ are provided separately so that learning of each of the load couplings w _ji and w _ji ′ does not interfere with each other. Is set in advance. Further, in the first embodiment, any state (s _t = x) when the initial value of each of the load couplings w _ji and w _ji ′ when not learned is set to w _ji (0) and w _ji ′ (0). (0), v (0), θ (0), ω (0)), the load coupling w _ji (0), w _ji ′ (0) is _expressed as w _ji (0) = w _ji It is preset as indicated by ′ (0) = 0.0.

Ｃ２：状態測定手段
状態測定手段Ｃ２は、前記状態ｓ_ｔ，ｓ_ｔ＋１を測定する。実施例１の前記状態測定手段Ｃ２は、前記式（６−１），（６−２），（７−１）〜（７−４），（７−１）′〜（７−４）′と前記各定数Ｍ，ｍ，Ｌ，ｇ，ｔおよび前記各変数ａ，ｂ、ｋ，Ｆとに基づいて、前記時間ステップｔ毎の前記入力変数ｘ，ｖ，θ，ωの値（入力値ｘ（ｋ），ｖ（ｋ），θ（ｋ），ω（ｋ）（ｋ＝０，１，２，…））を演算することにより、前記状態ｓ_ｔ，ｓ_ｔ＋１を測定する（ｓ_ｔ＝ｘ（ｋ），ｖ（ｋ），θ（ｋ），ω（ｋ），ｓ_ｔ＋１＝ｘ（ｋ＋１），ｖ（ｋ＋１），θ（ｋ＋１），ω（ｋ＋１））。 C2: State measuring unit The state measuring unit C2 measures the states s _t and _{st +1} . The state measuring means C2 according to the first embodiment includes the equations (6-1), (6-2), (7-1) to (7-4), (7-1) ′ to (7-4) ′. And the constants M, m, L, g, t and the variables a, b, k, F, the values of the input variables x, v, θ, ω for each time step t (input values). The states s _t and s _{t + 1} are measured by calculating x (k), v (k), θ (k), ω (k) (k = 0, 1, 2,...)) (s _t = X (k), v (k), θ (k), ω (k), _{st + 1} = x (k + 1), v (k + 1), θ (k + 1), ω (k + 1)).

Ｃ３：行動実行手段
行動実行手段Ｃ３は、行動選択手段Ｃ３Ａと、台車移動制御手段Ｃ３Ｂとを有し、前記行動ａ_ｔを実行する、すなわち、前記行動実行手段Ｃ３は、前記時間ステップｔ毎の前記力Ｆ（ｋ）に応じて前記台車１を左右方向（Ｙ軸方向）に移動させる。
Ｃ３Ａ：行動選択手段
行動選択手段Ｃ３Ａは、任意の行動選択方法に従って行動ａ_ｔを選択する。実施例１の行動選択手段Ｃ３Ａは、前記ｇｒｅｅｄｙ選択法に基づいて、評価値である前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が最大となる前記行動ａ_ｔを選択する。
Ｃ３Ｂ：台車移動制御手段
台車移動制御手段Ｃ３Ｂは、前記駆動部１ｂの駆動を制御して前記駆動輪１ａを回転駆動させることにより、前記台車１の左右方向（Ｙ軸方向）への移動を制御する。 C3: action executing means action executing means C3 has an action selection means C3A, and a carriage movement control means C3B, performing the action a _t, that is, the action executing means C3 is the time per step t The cart 1 is moved in the left-right direction (Y-axis direction) according to the force F (k).
C3A: behavior selection means action selection means C3A selects an action _{a t} in accordance with any action selection method. Behavior selection means C3A of Example 1, on the basis of the greedy selection method, the action value function _{Q (s} t, _{a t)} is an evaluation value selects the action _{a t} with the maximum.
C3B: Carriage Movement Control Means The carriage movement control means C3B controls the movement of the carriage 1 in the left-right direction (Y-axis direction) by controlling the drive of the drive unit 1b and rotating the drive wheels 1a. To do.

Ｃ４：報酬取得手段
報酬取得手段Ｃ４は、前記報酬ｒ_ｔ，ｒ_ｔ＋１を取得する。実施例１の前記報酬取得手段Ｃ４では、前記時間ステップｔ毎の前記報酬ｒ_ｔ，ｒ_ｔ＋１は、前記棒２が倒れて前記角度θが、θ≦−１８０°、又は、θ≧１８０°、となった場合には、負の値として取得され（ｒ_ｔ＜０，ｒ_ｔ＋１＜０）、それ以外の場合には、０以上の値として取得される（ｒ_ｔ≧０，ｒ_ｔ＋１≧０）。なお、前記報酬取得手段Ｃ４では、前記角度θが０［ｄｅｇ］に近いほど前記報酬ｒ_ｔ，ｒ_ｔ＋１の値が大きくなるように予め設定されている。
Ｃ５：行動価値関数演算手段
行動価値関数演算手段Ｃ５は、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ），Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１）を演算する。実施例１の前記行動価値関数演算手段Ｃ５は、前記式（８）に基づいて、前記時間ステップｔ毎の、すなわち、時間ｋ×ｔ，（ｋ＋１）×ｔの前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ），Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１）を演算する（ｋ＝０，１，２，…）。 C4: Remuneration acquisition means Remuneration acquisition means C4 acquires the rewards r _t and r _{t + 1} . In the reward acquisition means C4 according to the first embodiment, the rewards r _t and r _{t + 1 for} each time step t are equal to θ ≦ −180 ° or θ ≧ 180 ° when the rod 2 is tilted and the angle θ is If it becomes, it is acquired as a negative value (r _t <0, r _{t + 1} <0), otherwise it is acquired as a value of 0 or more (r _t ≧ 0, r _{t + 1} ≧ 0) ). The reward acquisition means C4 is preset so that the values of the rewards r _t and r _{t + 1} increase as the angle θ is closer to 0 [deg].
C5: Action value function calculating means The action value function calculating means C5 calculates the action value functions Q (s _t , a _t ), Q (s _{t + 1} , a _{t + 1} ). The action value function calculating unit C5 according to the first exemplary embodiment calculates the action value function Q (s _{t at} each time step t, that is, at time k × t, (k + 1) × t, based on the equation (8). , A _t ), Q (s _{t + 1} , a _{t + 1} ) (k = 0, 1, 2,...).

Ｃ６：行動価値関数記憶手段
行動価値関数記憶手段Ｃ６は、前記行動価値関数演算手段Ｃ５で演算された、前記時間ステップｔ毎の前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ），Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１）を記憶する。実施例１の前記行動価値関数記憶手段Ｃ６は、前記時間ｋ×ｔ，（ｋ＋１）×ｔにおける全ての状態ｓ_ｔ，ｓ_ｔ＋１（ｓ_ｔ＝ｘ（ｋ），ｖ（ｋ），θ（ｋ），ω（ｋ），ｓ_ｔ＋１＝ｘ（ｋ＋１），ｖ（ｋ＋１），θ（ｋ＋１），ω（ｋ＋１））についての前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ），Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１）を記憶する（ｋ＝０，１，２，…）。
Ｃ７：行動価値関数近似手段
行動価値関数近似手段Ｃ７は、入力変数生成手段Ｃ７Ａと、入力変数記憶手段Ｃ７Ｂと、中間変数演算手段Ｃ７Ｃと、中間変数記憶手段Ｃ７Ｄと、出力変数演算手段Ｃ７Ｅと、出力変数記憶手段Ｃ７Ｆと、行動価値関数学習手段Ｃ７Ｇとを有し、前記多変数相互修飾モデルＮによって、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を近似する行動価値関数近似処理を実行する。すなわち、前記状態ｓ_ｔにおける前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）の近似値を演算する前記行動価値関数近似処理を実行する。 C6: Action value function storage means The action value function storage means C6 is calculated by the action value function calculation means C5, and the action value functions Q (s _t , a _t ), Q (s _{t + 1} ) for each time step t. , A _{t + 1} ). The action value function storage unit C6 according to the first exemplary embodiment stores all the states s _t and s _{t + 1} (s _t = x (k), v (k), θ (k) at the time k × t and (k + 1) × t. ), Ω (k), s _{t + 1} = x (k + 1), v (k + 1), θ (k + 1), ω (k + 1)), the action value functions Q (s _t , a _t ), Q (s _{t + 1} , a _{t + 1} ) is stored (k = 0, 1, 2,...).
C7: Action value function approximating means The action value function approximating means C7 includes an input variable generating means C7A, an input variable storing means C7B, an intermediate variable calculating means C7C, an intermediate variable storing means C7D, and an output variable calculating means C7E. An output variable storage means C7F and an action value function learning means C7G are provided, and an action value function approximation process for approximating the action value function Q (s _t , a _t ) is executed by the multivariable mutual modification model N. . That is, the behavior value function approximation process for calculating an approximate value of the behavior value function Q (s _t , a _t ) in the state s _t is executed.

Ｃ７Ａ：入力変数生成手段（入力変数入力手段）
入力変数生成手段Ｃ７Ａは、前記時間ステップｔ毎に連続的に変化する前記入力変数ｘ，ｖ，θ，ωの入力値（ｘ（ｋ），ｖ（ｋ），θ（ｋ），ω（ｋ），ｘ（ｋ＋１），ｖ（ｋ＋１），θ（ｋ＋１），ω（ｋ＋１））を生成する。実施例１の入力変数生成手段Ｃ７Ａは、前記多変数相互修飾モデルＮの入力層Ｎａ（図４参照）において、前記時間ステップｔ毎に、状態測定手段Ｃ２で演算された前記入力変数ｘ，ｖ，θ，ωの入力値に対応する、前記入力変数ｘ，ｖ，θ，ωの素子の値の配列、いわゆる、状態パターンを設定する。 C7A: Input variable generation means (input variable input means)
The input variable generating means C7A is configured to input values (x (k), v (k), θ (k), ω (k) of the input variables x, v, θ, ω that continuously change at each time step t. ), X (k + 1), v (k + 1), θ (k + 1), ω (k + 1)). The input variable generation unit C7A according to the first embodiment performs the input variable x, v calculated by the state measurement unit C2 at each time step t in the input layer Na (see FIG. 4) of the multivariable mutual modification model N. , Θ, ω corresponding to input values, an array of values of the elements of the input variables x, v, θ, ω, so-called state patterns are set.

図５は台車の位置を示す入力変数の状態パターンの一例を説明するための説明図である。
実施例１の前記入力変数生成手段Ｃ７Ａでは、前記入力層Ｎａにおける前記入力変数ｘ，ｖ，θ，ωの全４ｎ個の素子は、それぞれ−１または＋１のいずれかの値となるように予め設定されている。したがって、例えば、前記入力変数ｘ（台車１の位置ｘ）の状態パターンをＳ_ｘとし、Ｓ_ｘ＝（ｓ_ｘ１，ｓ_ｘ２，…，ｓ_ｘｎ）で示される場合、前記入力変数ｘのｎ個の素子ｓ_ｘｉ（ｉ＝１，２，…，ｎ）は、ｓ_ｘｉ＝＋１またはｓ_ｘｉ＝−１で示すことができる。
また、前記入力変数生成手段Ｃ７Ａでは、前記入力変数ｘ，ｖ，θ，ωは、ｎ／２個の素子が＋１となり、残りのｎ／２個の素子が−１となるように予め設定されている。よって、例えば、図５に示すように、前記入力変数ｘ（台車１の位置ｘ）の状態パターンＳ_ｘについて、前記台車１の重心が実際に存在する位置に応じた範囲のｎ／２個の素子ｓ_ｘｉが＋１を出力し、残りのｎ／２個の素子ｓ_ｘｉが−１を出力するように予め設定されている。 FIG. 5 is an explanatory diagram for explaining an example of an input variable state pattern indicating the position of the carriage.
In the input variable generation means C7A of the first embodiment, all the 4n elements of the input variables x, v, θ, and ω in the input layer Na are previously set to have a value of −1 or +1, respectively. Is set. Therefore, for example, when the state pattern of the input variable x (the position x of the carriage 1) is S _x and S _x = (s _x1 , s _x2 ,..., S _xn ), n pieces of the input variable x S _xi (i = 1, 2,..., N) can be represented by s _xi = + 1 or s _xi = −1.
In the input variable generation means C7A, the input variables x, v, θ, and ω are preset so that n / 2 elements are +1 and the remaining n / 2 elements are -1. ing. Thus, for example, as shown in FIG. 5, with respect to the state pattern S _x of the input variable x (the position x of the carriage 1), n / 2 pieces in a range corresponding to the position where the center of gravity of the carriage 1 actually exists. It is preset that the element s _xi outputs +1 and the remaining n / 2 elements s _xi output -1.

さらに、前記入力変数生成手段Ｃ７Ａでは、前記入力変数θ（棒２の角度θ）の状態パターンは、学習の際の影響が特に大きいと思われるため、前記入力変数θの主要な値、例えば、θが０．０［ｄｅｇ］の付近の値である場合には（θ≒０）、僅かな値の変化によって、状態パターンが大きく変化するように予め設定されている。すなわち、前記入力変数θの状態パターンをＳ_θとし、前記時間ｋ×ｔ，（ｋ＋１）×ｔの前記入力変数θの入力値θ（ｋ），θ（ｋ＋１）に応じた各状態パターンをＳ_θ（ｋ），Ｓ_{θ（ｋ＋１）}とした場合、前記入力変数θは、入力値θ（ｋ），θ（ｋ＋１）が０．０［ｄｅｇ］の付近であれば（θ（ｋ）≒０，θ（ｋ＋１）≒０）、前記時間がｋ×ｔから（ｋ＋１）×ｔに変化したときの僅かな入力値の変化で（θ（ｋ＋１）−θ（ｋ）≒０）、変化後の状態パターンＳ_{θ（ｋ＋１）}が、変化前の状態パターンＳ_θ（ｋ）に比べて大きく変化し、それ以外の値であれば、僅かな入力値の変化では、変化後の状態パターンＳ_{θ（ｋ＋１）}が、変化前の状態パターンＳ_θ（ｋ）比べて大きく変化しないように予め設定されている。 Further, in the input variable generation means C7A, the state pattern of the input variable θ (the angle θ of the rod 2) seems to have a particularly large influence during learning, so the main value of the input variable θ, for example, When θ is a value in the vicinity of 0.0 [deg] (θ≈0), the state pattern is set in advance so as to change greatly by a slight change. That is, the state pattern of the input variable θ is S _θ, and each state pattern corresponding to the input values θ (k), θ (k + 1) of the input variable θ at the time k × t, (k + 1) × t is S. _{When θ (k)} and S _{θ (k + 1)} are set, the input variable θ is (θ (k) ≈0 if the input values θ (k) and θ (k + 1) are in the vicinity of 0.0 [deg]. , Θ (k + 1) ≈0), a slight change in input value when the time changes from k × t to (k + 1) × t (θ (k + 1) −θ (k) ≈0), The state pattern S _{θ (k + 1)} changes greatly compared to the state pattern S _{θ (k)} before the change, and if the value is other than that, the state pattern S _{θ (( k + 1)} is set in advance so as not to change significantly compared to the state pattern S _{θ (k)} before the change.

Ｃ７Ｂ：入力変数記憶手段
入力変数記憶手段Ｃ７Ｂは、前記入力変数生成手段Ｃ７Ａで生成された前記入力変数ｘ，ｖ，θ，ωの入力値を記憶する。実施例１の前記入力変数記憶手段Ｃ７Ｂは、前記時間ステップｔ毎に前記入力変数ｘ，ｖ，θ，ωの各状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ω（図４参照）を記憶する。 C7B: Input Variable Storage Unit The input variable storage unit C7B stores the input values of the input variables x, v, θ, ω generated by the input variable generation unit C7A. The input variable storage unit C7B according to the first embodiment stores the state patterns S _x , S _v , S _θ , S _ω (see FIG. 4) of the input variables x, v, θ, ω at each time step t. To do.

Ｃ７Ｃ：中間変数演算手段
中間変数演算手段Ｃ７Ｃは、選択的不感化手段Ｃ７C1を有し、前記時間ステップｔ毎に前記入力変数ｘ，ｖ，θ，ωに基づく前記中間変数を演算する。図４において、実施例１の前記中間変数演算手段Ｃ７Ｃは、前記中間層Ｎｂに示すように、前記時間ステップｔ毎に、前記入力変数ｘ，ｖ，θ，ωが、それぞれ２つずつを１組として相互に積型文脈修飾された１２個の前記積型修飾ｘ（ｖ），ｘ（θ），ｘ（ω），ｖ（θ），ｖ（ω），ｖ（ｘ），θ（ω），θ（ｘ），θ（ｖ），ω（ｘ），ω（ｖ），ω（θ）を中間変数（ｙ）として演算する（出力する）。すなわち、前記中間変数をｙとし（ｙ＝（ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）））、前記中間変数の状態パターンをＳ_ｙとした場合に、前記入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωから、前記中間変数ｙの状態パターンＳ_ｙ（Ｓ_ｙ＝（Ｓ_ｘ（Ｓ_ｖ），Ｓ_ｖ（Ｓ_ｘ），Ｓ_ｘ（Ｓ_θ），Ｓ_θ（Ｓ_ｘ），Ｓ_ｘ（Ｓ_ω），Ｓ_ω（Ｓ_ｘ），Ｓ_ｖ（Ｓ_θ），Ｓ_θ（Ｓ_ｖ），Ｓ_ｖ（Ｓ_ω），Ｓ_ω（Ｓ_ｖ），Ｓ_θ（Ｓ_ω），Ｓ_ω（Ｓ_θ）））を演算する。 C7C: Intermediate variable calculation means The intermediate variable calculation means C7C has selective desensitization means C7C1, and calculates the intermediate variables based on the input variables x, v, θ, ω at each time step t. As shown in the intermediate layer Nb, the intermediate variable calculation means C7C according to the first embodiment in FIG. 4 sets the input variables x, v, θ, and ω to 2 each at the time step t. Twelve product type modifications x (v), x (θ), x (ω), v (θ), v (ω), v (x), θ (ω that are mutually product type context modified as a set ), Θ (x), θ (v), ω (x), ω (v), ω (θ) are calculated (output) as intermediate variables (y). That is, the intermediate variable is y (y = (x (v), v (x), x (θ), θ (x), x (ω), ω (x), v (θ), θ (v ), V (ω), ω (v), θ (ω), ω (θ))), the state of the input variables x, v, θ, ω when the intermediate variable state pattern is S _y. From the patterns S _x , S _v , S _θ , S _ω , the state pattern S _y (S _y = (S _x (S _v ), S _v (S _x ), S _x (S _θ ), S)) of the intermediate variable y is obtained. _θ (S _x ), S _x (S _ω ), S _ω (S _x ), S _v (S _θ ), S _θ (S _v ), S _v (S _ω ), S _ω (S _v ), S _θ (S _ω ), S _ω (S _θ ))) is calculated.

Ｃ７C1：選択的不感化手段
選択的不感化手段Ｃ７C1は、第１不感化手段Ｃ７C1aと、第１出力感度演算手段Ｃ７C1bと、第２不感化手段Ｃ７C1cと、第２出力感度演算手段Ｃ７C1dとを有し、前記中間変数ｙの出力において、前記入力変数ｘ，ｖ，θ，ωに対して選択的不感化する選択的不感化処理を実行する。実施例１の前記選択的不感化手段Ｃ７C1は、前記入力変数ｘ，ｖ，θ，ωを、それぞれ２つずつを１組として相互に選択的不感化処理を実行する。例えば、４つの入力変数ｘ，ｖ，θ，ωから選択された２つの入力変数ｘ，ｖ（台車１の位置ｘおよび速度ｖ）を１組として、前記入力変数ｘを第１選択変数とし、前記入力変数ｖを第２選択変数とした場合、前記第２選択変数ｖが前記第１選択変数ｘを選択的不感化すると共に、前記第１選択変数ｘが前記第２選択変数ｖを選択的不感化する。 C7C1: Selective desensitizing means The selective desensitizing means C7C1 includes first desensitizing means C7C1a, first output sensitivity calculating means C7C1b, second desensitizing means C7C1c, and second output sensitivity calculating means C7C1d. Then, at the output of the intermediate variable y, a selective desensitization process for selectively desensitizing the input variables x, v, θ, and ω is executed. The selective desensitizing means C7C1 according to the first embodiment executes the selective desensitizing process with each of the input variables x, v, θ, ω as two sets. For example, two input variables x, v selected from four input variables x, v, θ, ω (position x and speed v of the carriage 1) are set as one set, and the input variable x is set as a first selection variable, When the input variable v is a second selection variable, the second selection variable v selectively desensitizes the first selection variable x and the first selection variable x selectively selects the second selection variable v. Desensitize.

具体的には、図４において、まず、前記第１選択変数ｘの状態パターンＳ_ｘ（Ｓ_ｘ＝（ｓ_ｘ１，ｓ_ｘ２，…，ｓ_ｘｎ）＝ｓ_ｘｉ（ｉ＝１，２，…，ｎ））に対応して、前記第２選択変数ｖの状態パターンをＳ_ｖ（Ｓ_ｖ＝（ｓ_ｖ１，ｓ_ｖ２，…，ｓ_ｖｎ）＝ｓ_ｖｉ（ｉ＝１，２，…，ｎ））とし、前記選択変数ｘ，ｖの状態パターンＳ_ｘ，Ｓ_ｖに基づくゲインベクトルをＧ_ｘ（Ｇ_ｘ＝（ｇ_ｘ１，ｇ_ｘ２，…，ｇ_ｘｎ）＝ｇ_ｘｉ（ｉ＝１，２，…，ｎ）），Ｇ_ｖ（Ｇ_ｖ＝（ｇ_ｖ１，ｇ_ｖ２，…，ｇ_ｖｎ）＝ｇ_ｖｉ（ｉ＝１，２，…，ｎ））とし、前記積型修飾ｘ（ｖ），ｖ（ｘ）の状態パターンをＳ_ｘ（Ｓ_ｖ）（Ｓ_ｘ（Ｓ_ｖ）＝（ｙ_ｘｖ１，ｙ_ｘｖ２，…，ｙ_ｘｖｎ）＝ｙ_ｘｖｉ（ｉ＝１，２，…，ｎ）），Ｓ_ｖ（Ｓ_ｘ）（Ｓ_ｖ（Ｓ_ｘ）＝（ｙ_ｖｘ１，ｙ_ｖｘ２，…，ｙ_ｖｘｎ）＝ｙ_ｖｘｉ（ｉ＝１，２，…，ｎ））とする。 Specifically, in FIG. 4, first, the state pattern S _x (S _x = (s _x1 , s _x2 ,..., S _xn ) = s _xi (i = 1, 2,...) Of the first selection variable x. n)), the state pattern of the second selection variable v is expressed as S _v (S _v = (s _v1 , s _v2 ,..., s _vn ) = s _vi (i = 1, 2,..., n) ), And G _x (G _x = (g _x1 , g _x2 ,..., G _xn ) = g _xi (i = 1, 2, _v) based on the state patterns S _x and S _v of the selection variables x and v , N)), G _v (G _v = (g _v1 , g _v2 ,..., G _vn ) = g _vi (i = 1, 2,..., N)), and the product type modification x (v), The state pattern of v (x) is _expressed as S _x (S _v ) (S _x (S _v ) = (y _{xv 1} , y _{xv 2} ,..., y _xvn ) = y _xvi (i = 1, 2,... , N)), S _v (S _x ) (S _v (S _x ) = (y _vx1 , y _vx2 ,..., Y _vxn ) = y _vxi (i = 1, 2,..., N)).

そして、前記選択変数ｘ，ｖの状態パターンＳ_ｘ，Ｓ_ｖについて、前記式（１），（２）を適用する。なお、実施例１では、前記状態パターンＳ_ｘ，Ｓ_ｖの素子ｓ_ｘｉ，ｓ_ｖｉ（ｉ＝１，２，…，ｎ）の値（＋１または−１）が存在する確率が等しく、且つ、独立に決定される（相関がない）ように予め設定されている。このため、前記式（１）の替わりに前記式（１）′が適用できる。ここで、前記選択変数ｘ，ｖの状態パターンＳ_ｘ，Ｓ_ｖについて、前記式（１）′，（２）が適用された式を、以下の式（９−１），（９−２），（１０−１），（１０−２）に示す。
ｙ_ｘｖｉ＝ｇ_ｖｉ×ｓ_ｘｉ …（９−１）
ｇ_ｖｉ＝（１＋ｓ_ｖｉ）／２ …（９−２）
ｙ_ｖｘｉ＝ｇ_ｘｉ×ｓ_ｖｉ …（１０−１）
ｇ_ｘｉ＝（１＋ｓ_ｘｉ）／２ …（１０−２） Then, the expressions (1) and (2) are applied to the state patterns S _x and S _v of the selection variables x and v. In Example 1, the probabilities that the values (+1 or −1) of the elements s _xi , s _vi (i = 1, 2,..., N) of the state patterns S _x , S _v exist are equal, and It is set in advance so as to be determined independently (no correlation). Therefore, the formula (1) ′ can be applied instead of the formula (1). Here, with respect to the state patterns S _x , S _v of the selection variables x, v, equations obtained by applying the equations (1) ′, (2) are represented by the following equations (9-1), (9-2) , (10-1), (10-2).
y _xvi = g _vi × s _xi (9-1)
g _vi = (1 + s _vi ) / 2 (9-2)
y _vxi = g _xi × s _vi (10-1)
g _xi = (1 + s _xi ) / 2 (10-2)

Ｃ７C1a：第１不感化手段
第１不感化手段Ｃ７C1aは、前記第１選択変数の各素子の値を、前記中間変数の各素子の値に反映しないようにする（不感化する）。実施例１の前記第１不感化手段Ｃ７C1aでは、例えば、前記入力変数ｘが第１選択変数、前記入力変数ｖが第２選択変数として選択されている場合、前記式（９−１）によって、前記第１選択変数ｘの各素子ｓ_ｘｉ（ｉ＝１，２，…，ｎ）から前記積型修飾ｘ（ｖ）の各素子ｙ_ｘｖｉ（ｉ＝１，２，…，ｎ）を演算する際に、前記ゲインｇ_ｖｉが０になる場合（ｇ_ｖｉ＝０）、対応する前記素子ｙ_ｘｖｉが、ｙ_ｘｖｉ＝０となる。すなわち、前記第１選択変数ｘの各素子ｓ_ｘｉの値（＋１または−１）が、前記積型修飾ｘ（ｖ）の各素子ｙ_ｘｖｉ（ｉ＝１，２，…，ｎ）の値に反映されず０となる。本願明細書では、この状況を「不感化される」と呼ぶ。
Ｃ７C1b：第１出力感度演算手段
第１出力感度演算手段Ｃ７C1bは、前記第１選択変数を不感化するためのゲイン（第１出力感度）のゲインベクトルを演算する。実施例１の前記第１出力感度演算手段Ｃ７C1bは、例えば、前記入力変数ｘが第１選択変数、前記入力変数ｖが第２選択変数として選択されている場合、第２選択変数ｖの状態パターンＳ_ｖに基づいて、第１選択変数ｘを不感化するための前記ゲインベクトルＧ_ｖ＝ｇ_ｖｉ（ｉ＝１，２，…，ｎ）を演算する（式（９−２）参照）。
なお、前記第１不感化手段Ｃ７C1aおよび前記第１出力感度演算手段Ｃ７C1bにより、実施例１の第１中間変数演算手段（Ｃ７C1a＋Ｃ７C1b）が構成されている。 C7C1a: First desensitizing means The first desensitizing means C7C1a does not reflect (desensitize) the value of each element of the first selection variable to the value of each element of the intermediate variable. In the first desensitizing means C7C1a of the first embodiment, for example, when the input variable x is selected as the first selection variable and the input variable v is selected as the second selection variable, according to the equation (9-1), Each element y _xvi (i = 1, 2,..., N) of the product type modification x (v) is calculated from each element s _xi (i = 1, 2,..., N) of the first selection variable x. At this time, when the gain g _vi becomes 0 (g _vi = 0), the corresponding element y _xvi becomes y _xvi = 0. That is, the value (+1 or −1) of each element s _xi of the first selection variable x becomes the value of each element y _xvi (i = 1, 2,..., N) of the product type modification x (v). It is not reflected and becomes 0. In this specification, this situation is referred to as “desensitized”.
C7C1b: First output sensitivity calculation means The first output sensitivity calculation means C7C1b calculates a gain vector of a gain (first output sensitivity) for desensitizing the first selection variable. For example, when the input variable x is selected as the first selection variable and the input variable v is selected as the second selection variable, the first output sensitivity calculation means C7C1b of the first embodiment is the state pattern of the second selection variable v. Based on S _v , the gain vector G _v = g _vi (i = 1, 2,..., N) for desensitizing the first selection variable x is calculated (see Expression (9-2)).
The first desensitizing means C7C1a and the first output sensitivity calculating means C7C1b constitute first intermediate variable calculating means (C7C1a + C7C1b) of the first embodiment.

Ｃ７C1c：第２不感化手段
第２不感化手段Ｃ７C1cは、前記第２選択変数の各素子の値を、前記中間変数の各素子の値に反映しないようにする（不感化する）。実施例１の前記第２不感化手段Ｃ７C1cでは、例えば、前記入力変数ｘが第１選択変数、前記入力変数ｖが第２選択変数として選択されている場合、前記式（１０−１）によって、前記第２選択変数ｖの各素子ｓ_ｖｉ（ｉ＝１，２，…，ｎ）から前記積型修飾ｖ（ｘ）の各素子ｙ_ｖｘｉ（ｉ＝１，２，…，ｎ）を演算する際に、前記ゲインｇ_ｘｉが０になる場合（ｇ_ｘｉ＝０）、対応する前記素子ｙ_ｖｘｉが、ｙ_ｖｘｉ＝０となる。すなわち、前記第２選択変数ｖの各素子ｓ_ｖｉの値（＋１または−１）が、前記積型修飾ｖ（ｘ）の各素子ｙ_ｖｘｉ（ｉ＝１，２，…，ｎ）の値に反映されず０となる。すなわち、不感化される。
Ｃ７C1d：第２出力感度演算手段
第２出力感度演算手段Ｃ７C1dは、前記第２選択変数を不感化するためのゲイン（第２出力感度）のゲインベクトルを演算する。実施例１の前記第２出力感度演算手段Ｃ７C1dは、例えば、前記入力変数ｘが第１選択変数、前記入力変数ｖが第２選択変数として選択されている場合、第１選択変数ｘの状態パターンＳ_ｘに基づいて、第２選択変数ｖを不感化するための前記ゲインベクトルＧ_ｘ＝ｇ_ｘｉ（ｉ＝１，２，…，ｎ）を演算する（式（１０−２）参照）。
なお、前記第２不感化手段Ｃ７C1cおよび前記第２出力感度演算手段Ｃ７C1dにより、実施例１の第２中間変数演算手段（Ｃ７C1c＋Ｃ７C1d）が構成されている。 C7C1c: Second desensitizing means The second desensitizing means C7C1c does not reflect (desensitize) the value of each element of the second selection variable to the value of each element of the intermediate variable. In the second desensitizing means C7C1c of the first embodiment, for example, when the input variable x is selected as the first selection variable and the input variable v is selected as the second selection variable, according to the equation (10-1), Each element y _vxi (i = 1, 2,..., N) of the product type modification v (x) is calculated from each element s _vi (i = 1, 2,..., N) of the second selection variable v. At this time, when the gain g _xi becomes 0 (g _xi = 0), the corresponding element y _vxi becomes y _vxi = 0. That is, the value (+1 or −1) of each element s _vi of the second selection variable v becomes the value of each element y _vxi (i = 1, 2,..., N) of the product type modification v (x). It is not reflected and becomes 0. That is, it is desensitized.
C7C1d: second output sensitivity calculation means The second output sensitivity calculation means C7C1d calculates a gain vector of gain (second output sensitivity) for desensitizing the second selection variable. For example, when the input variable x is selected as the first selection variable and the input variable v is selected as the second selection variable, the second output sensitivity calculation unit C7C1d of the first embodiment is the state pattern of the first selection variable x. Based on S _x , the gain vector G _x = g _xi (i = 1, 2,..., N) for desensitizing the second selection variable v is calculated (see Expression (10-2)).
The second desensitizing means C7C1c and the second output sensitivity calculating means C7C1d constitute second intermediate variable calculating means (C7C1c + C7C1d) of the first embodiment.

Ｃ７Ｄ：中間変数記憶手段
中間変数記憶手段Ｃ７Ｄは、前記中間変数演算手段Ｃ７Ｃで演算された前記中間変数をｙの値（中間値）を記憶する（ｙ＝（ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）））。実施例１の前記中間変数記憶手段Ｃ７Ｄは、前記時間ステップｔ毎に前記中間変数ｙの状態パターンＳ_ｙ（Ｓ_ｙ＝（Ｓ_ｘ（Ｓ_ｖ），Ｓ_ｖ（Ｓ_ｘ），Ｓ_ｘ（Ｓ_θ），Ｓ_θ（Ｓ_ｘ），Ｓ_ｘ（Ｓ_ω），Ｓ_ω（Ｓ_ｘ），Ｓ_ｖ（Ｓ_θ），Ｓ_θ（Ｓ_ｖ），Ｓ_ｖ（Ｓ_ω），Ｓ_ω（Ｓ_ｖ），Ｓ_θ（Ｓ_ω），Ｓ_ω（Ｓ_θ）））を記憶する。 C7D: Intermediate variable storage means The intermediate variable storage means C7D stores the value (intermediate value) of the intermediate variable calculated by the intermediate variable calculation means C7C (y = (x (v), v (x) , X (θ), θ (x), x (ω), ω (x), v (θ), θ (v), v (ω), ω (v), θ (ω), ω (θ) )). The intermediate variable storage unit C7D according to the first embodiment performs the state pattern S _y (S _y = (S _x (S _v ), S _v (S _x ), S _x (S _x ), S _x (S _x )) of the intermediate variable y at each time step t. _θ ), S _θ (S _x ), S _x (S _ω ), S _ω (S _x ), S _v (S _θ ), S _θ (S _v ), S _v (S _ω ), S _ω (S _v ), S _θ (S _ω ), S _ω (S _θ ))).

Ｃ７Ｅ：出力変数演算手段
出力変数演算手段Ｃ７Ｅは、前記時間ステップｔ毎に前記中間変数ｙ（ｙ＝（ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）））に基づく前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）を演算する。図４において、実施例１の前記出力変数演算手段Ｃ７Ｅは、前記出力層Ｎｃに示すように、前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）の状態パターンをＳ_Ｑ（Ｓ_Ｑ＝（ｑ_１，ｑ_２，…，ｑ_ｍ），（ｑ_１′，ｑ_２′，…，ｑ_ｍ′）＝ｑ_ｊ，ｑ_ｊ′（ｊ＝１，２，…，ｍ））とし、中間変数ｙの状態パターンＳ_ｙについて、以下の式（１１）が成立する場合に、前記符号関数ｓｇｎ（ｕ）を用いた前記式（３）に基づく以下の式（１１−１），（１１−２）によって、前記中間変数ｙの状態パターンＳ_ｙ（Ｓ_ｙ＝（Ｓ_ｘ（Ｓ_ｖ），Ｓ_ｖ（Ｓ_ｘ），Ｓ_ｘ（Ｓ_θ），Ｓ_θ（Ｓ_ｘ），Ｓ_ｘ（Ｓ_ω），Ｓ_ω（Ｓ_ｘ），Ｓ_ｖ（Ｓ_θ），Ｓ_θ（Ｓ_ｖ），Ｓ_ｖ（Ｓ_ω），Ｓ_ω（Ｓ_ｖ），Ｓ_θ（Ｓ_ω），Ｓ_ω（Ｓ_θ）））から前記出力変数の状態パターンＳ_Ｑを演算する。 C7E: Output variable calculation means The output variable calculation means C7E outputs the intermediate variable y (y = (x (v), v (x), x (θ), θ (x), x (ω) at every time step t. ), Ω (x), v (θ), θ (v), v (ω), ω (v), θ (ω), ω (θ)))), the output variable Q (s _t , a _t ) Is calculated. In FIG. 4, the output variable calculation means C7E according to the first embodiment changes the state pattern of the output variable Q (s _t , a _t ) to S _Q (S _Q = (q ₁ , q ₂ ,..., q _m ), (q ₁ ′, q ₂ ′,..., q _m ′) = q _j , q _j ′ (j = 1, 2,..., m)) and the state of the intermediate variable y When the following equation (11) holds for the pattern S _y , the following equations (11-1) and (11-2) based on the equation (3) using the sign function sgn (u): the intermediate variable y state pattern _{_{_{_{S y (S y = (S}}}} x (S v), S v (S x), S x (S θ), S θ (S x), S x (S ω), S from _{_{_{ω (S x), S v}}} (S θ), S θ (S v), S v (S ω), S ω (S v), S θ (S ω), S ω (S θ))) It calculates the state pattern S _Q of the serial output variable.

（中間変数ｙの状態パターンＳ_ｙ）
Ｓ_ｙ
＝（ｙ_１，ｙ_２，…，ｙ_１２ｎ）
＝（ｙ_ｘｖ１，ｙ_ｘｖ２，…，ｙ_ｘｖｎ，ｙ_ｖｘ１，ｙ_ｖｘ２，…，ｙ_ｖｘｎ，
ｙ_ｘθ１，ｙ_ｘθ２，…，ｙ_ｘθｎ，ｙ_θｘ１，ｙ_θｘ２，…，ｙ_θｘｎ，
ｙ_ｘω１，ｙ_ｘω２，…，ｙ_ｘωｎ，ｙ_ωｘ１，ｙ_ωｘ２，…，ｙ_ωｘｎ，
ｙ_ｖθ１，ｙ_ｖθ２，…，ｙ_ｖθｎ，ｙ_θｖ１，ｙ_θｖ２，…，ｙ_θｖｎ，
ｙ_ｖω１，ｙ_ｖω２，…，ｙ_ｖωｎ，ｙ_ωｖ１，ｙ_ωｖ２，…，ｙ_ωｖｎ，
ｙ_θω１，ｙ_θω２，…，ｙ_θωｎ，ｙ_ωθ１，ｙ_ωθ２，…，ｙ_ωθｎ）
…（１１）
（右方向（＋Ｙ方向）への移動を評価する場合）
ｑ_ｊ＝ｓｇｎ（Σ_ｉｗ_ｊｉｙ_ｉ）（ｉ＝１，２，…，１２ｎ） …（１２−１）
（左方向（−Ｙ方向）への移動を評価する場合）
ｑ_ｊ′＝ｓｇｎ（Σ_ｉｗ_ｊｉ′ｙ_ｉ）（ｉ＝１，２，…，１２ｎ） …（１２−２） (State pattern S _y of intermediate variable _y )
S _y
= (Y ₁ , y ₂ ,..., Y _12n )
= (Y _xv1 , y _xv2 ,..., Y _xvn , y _vx1 , y _vx2 ,..., Y _vxn ,
y _xθ1 , y _xθ2 ,..., y _xθn , y _θx1 , y _θx2 _,.
y _xω1 , y _xω2 , ..., y _xωn , y _ωx1 , y _ωx2 , ..., y _ωxn ,
y _vθ1 , y _vθ2 ,..., y _vθn , y _θv1 , y _θv2 _,.
y _vω1 , y _vω2 , ..., y _vωn , y _ωv1 , y _ωv2 , ..., y _ωvn ,
_yθω1 , _yθω2 , ..., _yθωn , _yωθ1 , _yωθ2 , ..., _yωθn )
... (11)
(When evaluating movement in the right direction (+ Y direction))
q _j = sgn (Σ _i w _ji y _i ) (i = 1, 2,..., 12n) (12-1)
(When evaluating movement in the left direction (-Y direction))
q _j ′ = sgn (Σ _i w _ji ′ y _i ) (i = 1, 2,..., 12n) (12-2)

Ｃ７Ｆ：出力変数記憶手段
出力変数記憶手段Ｃ７Ｆは、前記出力変数演算手段Ｃ７Ｅで演算された前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）の値（出力値）を記憶する。実施例１の前記出力変数記憶手段Ｃ７Ｆは、前記時間ステップｔ毎に前記出力変数をＱ（ｓ_ｔ，ａ_ｔ）の状態パターンＳ_Ｑ（Ｓ_Ｑ＝（ｑ_１，ｑ_２，…，ｑ_ｍ），（ｑ_１′，ｑ_２′，…，ｑ_ｍ′）＝ｑ_ｊ，ｑ_ｊ′（ｊ＝１，２，…，ｍ））を記憶する。
Ｃ７Ｇ：行動価値関数学習手段（結合荷重学習手段）
行動価値関数学習手段Ｃ７Ｇは、学習判別手段Ｃ７G1結合荷重演算手段Ｃ７G2と、結合荷重記憶手段Ｃ７G3とを有し、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を学習する行動価値関数学習処理を実行する。実施例１の前記行動価値関数学習手段Ｃ７Ｇは、前記失敗状態となって前記倒立振子制御処理が終了した場合に、前記行動価値関数記憶手段Ｃ６に記憶された前記倒立振子制御処理の終了後の前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）の状態パターンをＴ_Ｑ（Ｔ_Ｑ＝（ｔ_１，ｔ_２，…，ｔ_ｍ），（ｔ_１′，ｔ_２′，…，ｔ_ｍ′）＝ｔ_ｊ，ｔ_ｊ′（ｊ＝１，２，…，ｍ））として、前記式（４），（４）′に基づく以下の式（１３−１），（１３−２）によって、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′の更新値Δｗ_ｊｉ，Δｗ_ｊｉ′を演算して、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′を更新する処理を繰り返すことにより、近似する前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を学習する前記行動価値関数学習処理を実行する。なお、実施例１では、前記正定数εが０．３に予め設定されている（ε＝０．３）。 C7F: Output Variable Storage Unit The output variable storage unit C7F stores the value (output value) of the output variable Q (s _t , a _t ) calculated by the output variable calculation unit C7E. The output variable storage means C7F of Example 1, the said output variable per time step _{_{t Q (s t, a t}} ) state pattern _{_{_{S Q (S Q = (q}}} 1, q 2 of, ..., _{q m} ), (Q ₁ ′, q ₂ ′,..., Q _m ′) = q _j , q _j ′ (j = 1, 2,..., M)).
C7G: Action value function learning means (bond weight learning means)
The behavior value function learning means C7G has a learning determination means C7G1 combined weight calculation means C7G2 and a combined weight storage means C7G3, and performs behavior value function learning processing for learning the behavior value function Q (s _t , a _t ). Execute. The action value function learning unit C7G according to the first embodiment is configured to perform the inverted pendulum control process after the end of the inverted pendulum control process stored in the action value function storage unit C6 when the inverted pendulum control process ends in the failure state. The state pattern of the behavior value function Q (s _t , a _t ) is _expressed as T _Q (T _Q = (t ₁ , t ₂ ,..., T _m ), (t ₁ ′, t ₂ ′,..., T _m ′). = T _j , t _j ′ (j = 1, 2,..., M)), the following equations (13-1) and (13-2) based on the equations (4) and (4) ′ coupling weight _w _{ji, w ji} 'update value Δw _{_ji,} Δw _ji' by calculating the coupling weight _w _ji, by repeating the process of updating the _{w ji} ', the action value function Q _{(s t} to approximate , A _t ), the behavior value function learning process is executed. In Example 1, the positive constant ε is preset to 0.3 (ε = 0.3).

（右方向（＋Ｙ方向）への移動を評価する場合）
Δｗ_ｊｉ＝ε（ｔ_ｊ−ｑ_ｊ）ｙ_ｉ …（１３−１）
（左方向（−Ｙ方向）への移動を評価する場合）
Δｗ_ｊｉ′＝ε（ｔ_ｊ′−ｑ_ｊ′）ｙ_ｉ …（１３−２）
すなわち、前記行動価値関数学習手段Ｃ７Ｇは、前記時間ｋ×ｔ，（ｋ＋１）×ｔの前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′をｗ_ｊｉ（ｋ），ｗ_ｊｉ（ｋ＋１），ｗ_ｊｉ′（ｋ），ｗ_ｊｉ′（ｋ＋１）とした場合に、以下の式（１４−１），（１４−２）によって、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′をｗ_ｊｉ（ｋ），ｗ_ｊｉ′（ｋ）からｗ_ｊｉ（ｋ＋１），ｗ_ｊｉ′（ｋ＋１）に更新する処理を繰り返す（ｋ＝０，１，…）。 (When evaluating movement in the right direction (+ Y direction))
Δw _ji = ε (t _j −q _j ) y _i (13-1)
(When evaluating movement in the left direction (-Y direction))
Δw _ji ′ = ε (t _j ′ −q _j ′) y _i (13-2)
That is, the behavior value function learning means C7G uses the connection weights w _ji , w _ji ′ of the time k × t, (k + 1) × t as w _ji (k), w _ji (k + 1), w _ji ′ (k ), W _ji ′ (k + 1), the combined loads w _ji , w _ji ′ are changed to w _ji (k), w _ji ′ (k) according to the following equations (14-1), (14-2). ) To w _ji (k + 1), w _ji ′ (k + 1) is repeated (k = 0, 1,...).

ｗ_ｊｉ（ｋ＋１）＝ｗ_ｊｉ（ｋ）＋Δｗ_ｊｉ …（１４−１）
ｗ_ｊｉ′（ｋ＋１）＝ｗ_ｊｉ′（ｋ）＋Δｗ_ｊｉ′ …（１４−２）
なお、前記時間がｋ×ｔから（ｋ＋１）×ｔに変化した場合の前記入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωが類似する場合、前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）の状態パターンＴ_Ｑにも大きな変化がなく、例えば、同じ状態パターン（Ｔ_Ｑ）となる場合がある。よって、実施例１の前記行動価値関数学習手段Ｃ７Ｇでは、前記時間がｋ×ｔから（ｋ＋１）×ｔに変化した場合に、前記入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωが類似し（例えば、状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωの９割以上が変化しておらず）、且つ、前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）の状態パターンＴ_Ｑが変化しなかった場合（Ｑ（ｓ_ｔ，ａ_ｔ）＝Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１））、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′の更新処理を省略する。このため、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を学習する処理の演算ステップ（計算量）を低減できる。 w _ji (k + 1) = w _ji (k) + Δw _ji (14-1)
w _ji ′ (k + 1) = w _ji ′ (k) + Δw _ji ′ (14-2)
Note that when the state patterns S _x , S _v , S _θ , S _{ω of the} input variables x, v, θ, ω when the time changes from k × t to (k + 1) × t are similar, the output variable _{Q (s} _{t, a} t) state pattern _T without significant change in _Q of, for example, there are cases where the same state pattern _{(T Q).} Therefore, in the behavior value function learning means C7G of the first embodiment, when the time changes from k × t to (k + 1) × t, the state patterns S _x , S of the input variables x, v, θ, ω. _{v 1} , S _θ , S _ω are similar (for example, 90% or more of the state patterns S _x , S _v , S _θ , S _ω have not changed), and the output variable Q (s _t , a _t ) State pattern _TQ does not change (Q (s _t , a _t ) = Q (s _{t + 1} , a _{t + 1} )), the update processing of the joint loads w _ji , w _ji ′ is omitted. For this reason, the calculation step (calculation amount) of the process of learning the action value function Q (s _t , a _t ) can be reduced.

Ｃ７G1：学習判別手段
学習判別手段Ｃ７G1は、前記時間がｋ×ｔから（ｋ＋１）×ｔに変化した場合に、前記入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωが類似し（例えば、状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωの９割以上が変化しておらず）、且つ、前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ），Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１）の状態パターンＴ_Ｑが変化したか否かを判別することにより、前記時間ｋ×ｔ，（ｋ＋１）×ｔ（ｋ＝０，１，２，）における前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′の更新処理を省略するか否かを判別する。
Ｃ７G2：結合荷重演算手段
結合荷重演算手段Ｃ７G2は、前記式（１３−１），（１３−２），（１４−１），（１４−２）により、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′を演算する（更新する）。
Ｃ７G3：結合荷重記憶手段
結合荷重記憶手段Ｃ７G3は、前記結合荷重演算手段Ｃ７G2で演算された前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′を記憶する。 C7G1: Learning discriminating means The learning discriminating means C7G1 has the state patterns S _x , S _v , S _{θ of the} input variables x, v, θ, ω when the time changes from k × t to (k + 1) × t. , S _ω are similar (for example, 90% or more of the state patterns S _x , S _v , S _θ , S _ω have not changed), and the output variables Q (s _t , a _t ), Q ( by _{_{s t + 1, a t +}} 1 state pattern _{T Q} in) is equal to or changed, the time k × t, (k + 1 ) × t (k = 0,1,2, the bond in) load _{w ji} , W _ji ′ is determined whether or not to omit the update process.
C7G2: Combined load calculating means The combined load calculating means C7G2 calculates the combined loads w _ji and w _ji ′ according to the equations (13-1), (13-2), (14-1), and (14-2). Calculate (update).
C7G3: Combined load storage means The combined load storage means C7G3 stores the combined loads w _ji , w _ji ′ calculated by the combined load calculation means C7G2.

Ｃ８：倒立振子制御終了判別手段（制御終了判別手段）
倒立振子制御終了判別手段Ｃ８は、倒立振子制御時間計時手段Ｃ８Ａと、成功状態判別手段Ｃ８Ｂと、失敗状態判別手段Ｃ８Ｃとを有し、前記倒立振子制御処理を終了するか否かを判別する。実施例１の前記倒立振子制御終了判別手段Ｃ８は、前記倒立振子制御処理を開始してから前記棒２が１８０秒倒立した状態（成功状態）または負の値の前記報酬ｒ_ｔが得られた状態（失敗状態）であるか否かを判別することにより、前記倒立振子制御処理を終了するか否かを判別する。
Ｃ８Ａ：倒立振子制御時間計時手段
倒立振子制御時間計時手段Ｃ８Ａは、前記倒立振子制御処理を開始した場合に、後述するタイマＴＭの計時を開始することにより、前記倒立振子制御処理における経過時間ｋ×ｔ（ｋ＝０，１，…，ｋ_ｍａｘ（９０００））を計時する。 C8: Inverted pendulum control end determining means (control end determining means)
The inverted pendulum control end determining means C8 includes an inverted pendulum control time measuring means C8A, a success state determining means C8B, and a failure state determining means C8C, and determines whether or not to end the inverted pendulum control process. The inverted pendulum control termination judgment means C8 of Example 1, the reward r _t of the state where the rod 2 from the start of the inverted pendulum control process has been inverted 180 seconds (successful state) or negative value is obtained It is determined whether or not the inverted pendulum control process is terminated by determining whether or not the state (failure state) is present.
C8A: Inverted pendulum control time measuring means When the inverted pendulum control time measuring means C8A starts the inverted pendulum control process, the elapsed time k × in the inverted pendulum control process is started by starting the timer TM described later. Time t (k = 0, 1,..., k _max (9000)) is measured.

Ｃ８Ｂ：成功状態判別手段
成功状態判別手段Ｃ８Ｂは、前記倒立振子制御時間計時手段Ｃ８Ａにより、前記倒立振子制御処理を開始してから、前記報酬ｒ_ｔが一度も負の値を取らずに（ｒ_ｔ≧０）前記棒２が１８０秒倒立した状態であるか否かを判別することにより（ｋ＝ｋ_ｍａｘ＝９０００）、前記成功状態であるか否かを判別する。ここで、ｋ_ｍａｘは、前記自然数ｋの最大値である（ｋ_ｍａｘ＝１８０／０．０２＝９０００）。
Ｃ８Ｃ：失敗状態判別手段
失敗状態判別手段Ｃ８Ｃは、前記失敗状態として前記報酬ｒ_ｔが負の値であるか否かを判別することにより（ｒ_ｔ＜０）、前記失敗状態であるか否かを判別する。
ＴＭ：タイマ
タイマＴＭは、前記倒立振子制御処理を実行するときの各時間ｋ×ｔ（ｋ＝０，１，…，ｋ_ｍａｘ（９０００））を計時する。 C8B: Success state discriminating means success status determining means C8B is by the inverted pendulum control time measuring means C8A, from the start of the inverted pendulum control process, the reward r _t is a time even without taking a negative value (r _t ≧ 0) By determining whether or not the rod 2 is in an inverted state for 180 seconds (k = k _max = 9000), it is determined whether or not it is the successful state. Here, k _max is the maximum value of the natural number k (k _max = 180 / 0.02 = 9000).
C8C: Failed state discriminating means fails state discrimination means C8C, by the reward r _t as the failure state it is determined whether or not a negative value (r t _<0), whether the a failure state Is determined.
TM: Timer The timer TM measures each time k × t (k = 0, 1,..., K _max (9000)) when the inverted pendulum control process is executed.

（実施例１のフローチャートの説明）
次に、実施例１の制御部Ｃの関数近似プログラムＡＰ１の処理の流れをフローチャートを使用して説明する。
（実施例１のメイン処理のフローチャートの説明）
図６は本発明の実施例１の関数近似プログラムのメイン処理のフローチャートである。
図６のフローチャートの各ＳＴ（ステップ）の処理は、前記制御部ＣのＲＯＭ等に記憶されたプログラムに従って行われる。また、この処理は前記制御部Ｃの他の各種処理と並行してマルチタスクで実行される。 (Description of Flowchart of Example 1)
Next, the processing flow of the function approximation program AP1 of the control unit C according to the first embodiment will be described with reference to a flowchart.
(Description of flowchart of main processing of embodiment 1)
FIG. 6 is a flowchart of the main process of the function approximation program according to the first embodiment of the present invention.
The processing of each ST (step) in the flowchart of FIG. 6 is performed according to a program stored in the ROM or the like of the control unit C. In addition, this process is executed by multitasking in parallel with other various processes of the control unit C.

図６に示すフローチャートは前記台車１が電源オンした後、前記関数近似プログラムＡＰ１が起動した場合に開始される。
図６のＳＴ１において、次の（１）〜（６）の処理を実行し、ＳＴ２に移る。
（１）台車１の位置ｘ、速度ｖ、加速度ａ、棒２の角速度ω、角加速度ｂの初期値として０．０をセットする（ｘ（０）＝ｖ（０）＝ω（０）＝ａ（０）＝ｂ（０）＝０．０）。
（２）棒２の角度θの初期値として−３．０°〜３．０°の範囲の乱数をセットする（θ（０）＝−３．０°〜３．０°）。
（３）台車１に対して右方向（＋Ｙ方向）に加える力Ｆの初期値として２０．０をセットする（Ｆ（０）＝２０．０）。
（４）結合荷重ω_ｊｉ，ω_ｊｉ′の初期値として０．０をセットする（ω_ｊｉ（０）＝ω_ｊｉ′（０）＝０．０）。
（５）台車１および棒２についての各定数Ｍ，ｍ，Ｌ，ｔ，ｇの各設定値および強化学習定数α，γの各設定値をセットする（Ｍ＝１．０，ｍ＝０．１，Ｌ＝１．０，ｇ＝９．８，ｔ＝０．０２，α＝０．１，γ＝０．９５）。
（６）自然数ｋに１をセットする（ｋ＝１）。 The flowchart shown in FIG. 6 is started when the function approximating program AP1 is started after the cart 1 is powered on.
In ST1 of FIG. 6, the following processes (1) to (6) are executed, and the process proceeds to ST2.
(1) 0.0 is set as the initial value of the position x, velocity v, acceleration a of the carriage 1, the angular velocity ω of the rod 2, and the angular acceleration b (x (0) = v (0) = ω (0) = a (0) = b (0) = 0.0).
(2) A random number in the range of −3.0 ° to 3.0 ° is set as the initial value of the angle θ of the rod 2 (θ (0) = − 3.0 ° to 3.0 °).
(3) 20.0 is set as an initial value of the force F applied to the cart 1 in the right direction (+ Y direction) (F (0) = 20.0).
(4) 0.0 is set as an initial value of the coupling loads ω _ji and ω _ji ′ (ω _ji (0) = ω _ji ′ (0) = 0.0).
(5) The set values of the constants M, m, L, t, g and the set values of the reinforcement learning constants α, γ for the carriage 1 and the rod 2 are set (M = 1.0, m = 0. 1, L = 1.0, g = 9.8, t = 0.02, α = 0.1, γ = 0.95).
(6) Set 1 to the natural number k (k = 1).

ＳＴ２において、結合荷重記憶手段Ｃ７G3に記憶された学習済の結合荷重ω_ｊｉ，ω_ｊｉ′が存在するか否かを判別する。イエス（Ｙ）の場合はＳＴ３に移り、ノー（Ｎ）の場合はＳＴ４に移る。
ＳＴ３において、結合荷重記憶手段Ｃ７G3に記憶された学習済の結合荷重ω_ｊｉ，ω_ｊｉ′をセットする。そして、ＳＴ４に移る。
ＳＴ４において、タイマＴＭによる計時を開始する。そして、ＳＴ５に移る。 In ST2, it is determined whether or not the learned connection weights ω _ji and ω _ji ′ stored in the connection weight storage means C7G3 exist. If yes (Y), the process proceeds to ST3, and, if no (N), the process proceeds to ST4.
In ST3, the learned connection weights ω _ji and ω _ji ′ stored in the connection weight storage means C7G3 are set. Then, the process proceeds to ST4.
In ST4, timing by the timer TM is started. Then, the process proceeds to ST5.

ＳＴ５において、式（６−１），（６−２），（７−１）〜（７−４），（７−１）′〜（７−４）′と、各定数Ｍ，ｍ，Ｌ，ｇ，ｔおよび各変数ａ（０），ｂ（０）、ｋ，Ｆ（０）とに基づいて、時間ｔ（ｋ＝１，ｔ＝０．０２［ｓ］）の入力変数ｘ，ｖ，θ，ωの値（入力値ｘ（１），ｖ（１），θ（１），ω（１））を演算することにより、状態ｓ_ｔを測定する（ｓ_ｔ＝ｘ（１），ｖ（１），θ（１），ω（１））。そして、ＳＴ６に移る。
ＳＴ６において、多変数相互修飾モデルＮ（図４参照）によって、行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を近似する行動価値関数近似処理（後述する図７のフローチャート参照）を実行する。すなわち、状態ｓ_ｔにおける時間ｋ×ｔ（ｋ＝１，２，…，ｋ_ｍａｘ）の行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）の近似値を演算する前記行動価値関数近似処理を実行する。そして、ＳＴ７に移る。
ＳＴ７において、前記ｇｒｅｅｄｙ選択法に基づいて、評価値である行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が最大となる行動ａ_ｔを選択して実行する。すなわち、時間ｋ×ｔ（ｋ＝１，２，…，ｋ_ｍａｘ）の力Ｆ（ｋ）に応じて台車１を左右方向（Ｙ軸方向）に移動させる。そして、ＳＴ８に移る。 In ST5, formulas (6-1), (6-2), (7-1) to (7-4), (7-1) ′ to (7-4) ′, and constants M, m, L , G, t and each variable a (0), b (0), k, F (0), input variables x, v at time t (k = 1, t = 0.02 [s]) , theta, value of omega (input value x (1), v (1 ), θ (1), ω (1)) by calculating, measuring the state _{_{s t (s t = x (}} 1), v (1), θ (1), ω (1)). Then, the process proceeds to ST6.
In ST6, an action value function approximation process (see a flowchart of FIG. 7 described later) for approximating the action value function Q (s _t , a _t ) is executed by the multivariable mutual modification model N (see FIG. 4). That is, the behavior value function approximation process for calculating the approximate value of the behavior value function Q (s _t , a _t ) at time k × t (k = 1, 2,..., K _max ) in the state s _t is executed. Then, the process proceeds to ST7.
In ST7, on the basis of the greedy selection method, an evaluation value action value function _{Q (s} t, _{a t)} is executed by selecting an action _{a t} with the maximum. That is, the carriage 1 is moved in the left-right direction (Y-axis direction) according to the force F (k) of time k × t (k = 1, 2,..., K _max ). Then, the process proceeds to ST8.

ＳＴ８において、次の（１），（２）の処理を実行し、ＳＴ９に移る。
（１）行動ａ_ｔに応じた報酬ｒ_ｔを取得する。なお、実施例１の報酬ｒ_ｔは、棒２が倒れて角度θが、θ≦−１８０°、又は、θ≧１８０°、となった場合には、負の値として取得され（ｒ_ｔ＜０）、それ以外の場合には、０以上の値として取得される（ｒ_ｔ≧０）。また、棒２の角度θが０［ｄｅｇ］に近いほど報酬ｒ_ｔの値が大きくなるように予め設定されている。
（２）式（６−１），（６−２），（７−１）〜（７−４）と、各定数Ｍ，ｍ，Ｌ，ｇ，ｔおよび各変数ｘ（ｋ），ｖ（ｋ），θ（ｋ），ω（ｋ），ａ（ｋ），ｂ（ｋ），ｋ，Ｆ（ｋ）とに基づいて、時間（ｋ＋１）×ｔの入力変数ｘ，ｖ，θ，ωの値（入力値ｘ（ｋ＋１），ｖ（ｋ＋１），θ（ｋ＋１），ω（ｋ＋１））を演算することにより、行動ａ_ｔ後の次の状態ｓ_ｔ＋１を測定する（ｓ_ｔ＋１＝ｘ（ｋ＋１），ｖ（ｋ＋１），θ（ｋ＋１），ω（ｋ＋１）（ｋ＝１，２，…，ｋ_ｍａｘ））。 In ST8, the following processes (1) and (2) are executed, and the process proceeds to ST9.
(1) to get the reward _{r t} in accordance with the action _{a t.} Note that reward r _t of the first embodiment, the angle theta and collapse rods 2, theta ≦ -180 °, or, if it becomes theta ≧ 180 °, and is obtained as a negative value (r _t < 0), otherwise, it is obtained as a value greater than or equal to 0 (r _t ≧ 0). Further, the value of the reward r _t is set in advance so that the angle θ of the bar 2 is closer to 0 [deg].
(2) Expressions (6-1), (6-2), (7-1) to (7-4), each constant M, m, L, g, t and each variable x (k), v ( k), θ (k), ω (k), a (k), b (k), k, F (k), and input variables x, v, θ, ω of time (k + 1) × t value (input value x (k + 1), v (k + 1), θ (k + 1), ω (k + 1)) by calculating, measuring the next state _{s t + 1} after action _{_{a t (s t + 1 =}} x ( k + 1), v (k + 1), θ (k + 1), ω (k + 1) (k = 1, 2,..., k _max )).

ＳＴ９において、式（８）に基づいて、時間ｋ×ｔ（ｋ＝０，１，２，…，ｋ_ｍａｘ）の行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を演算する（更新する）行動価値関数更新処理を実行する。そして、ＳＴ１０に移る。
ＳＴ１０において、タイマＴＭの計時を開始してから負の値の報酬ｒ_ｔを一度も取得せずに（ｒ_ｔ≧０）１８０秒（ｋ＝ｋ_ｍａｘ＝９０００）を計時した状態、すなわち、棒２が１８０秒倒立した状態であるか否かを判別することにより、成功状態であるか否かを判別する。ノー（Ｎ）の場合はＳＴ１１に移り、イエス（Ｙ）の場合はＳＴ１に戻る。 In ST9, based on the equation (8), the behavior value Q (s _t , a _t ) for the time k × t (k = 0, 1, 2,..., K _max ) is calculated (updated). Execute function update processing. Then, the process proceeds to ST10.
In ST10, and it measures the reward _{r t} from the start of counting of the timer TM negative value without also not acquire once _{(r t} ≧ 0) 180 seconds _(k = _k max = 9000) state, i.e., rod It is determined whether or not 2 is in a successful state by determining whether or not 2 is in an inverted state for 180 seconds. If no (N), the process moves to ST11, and if yes (Y), the process returns to ST1.

ＳＴ１１において、負の値の前記報酬ｒ_ｔ（ｒ_ｔ＜０）が得られた状態であるか否かを判別することにより、失敗状態であるか否かを判別する。ノー（Ｎ）の場合はＳＴ１２に移り、イエス（Ｙ）の場合はＳＴ１３に移る。
ＳＴ１２において、自然数ｋに＋１を加算する（ｋ＝ｋ＋１）。そして、ＳＴ６に戻る。
ＳＴ１３において、行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を学習する行動価値関数学習処理（後述する図８のフローチャート参照）を実行する。そして、ＳＴ１に戻る。 In ST11, it is determined whether or not it is a failure state by determining whether or not a negative value of the reward r _t (r _t <0) is obtained. If no (N), the process moves to ST12, and if yes (Y), the process moves to ST13.
In ST12, +1 is added to the natural number k (k = k + 1). Then, the process returns to ST6.
In ST13, an action value function learning process (see the flowchart of FIG. 8 described later) for learning the action value function Q (s _t , a _t ) is executed. Then, the process returns to ST1.

（実施例１の行動価値関数近似処理のフローチャートの説明）
図７は本発明の実施例１の関数近似プログラムの行動価値関数近似処理のフローチャートであり、図６のＳＴ６のサブルーチンの説明図である。
図７のＳＴ１０１において、時間ｋ×ｔ（ｋ＝０，１，２，…，ｋ_ｍａｘ）の入力変数ｘ，ｖ，θ，ωを生成する。すなわち、多変数相互修飾モデルＮの入力層Ｎａ（図４参照）における時間ｋ×ｔの入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωを設定する（生成する）。そして、ＳＴ１０２に移る。 (Description of Flowchart of Action Value Function Approximation Processing in Embodiment 1)
FIG. 7 is a flowchart of the action value function approximation process of the function approximation program according to the first embodiment of the present invention, and is an explanatory diagram of the subroutine of ST6 in FIG.
In ST101 of FIG. 7, input variables x, v, θ, ω of time k × t (k = 0, 1, 2,..., K _max ) are generated. That is, the state patterns S _x , S _v , S _θ , S _ω of the input variables x, v, θ, ω of the time k × t in the input layer Na (see FIG. 4) of the multivariable mutual modification model N are set ( To generate). Then, the process proceeds to ST102.

ＳＴ１０２において、生成された入力変数ｘ，ｖ，θ，ωから中間変数ｙ（ｙ＝（ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）））を演算する。すなわち、入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωから、多変数相互修飾モデルＮの中間層Ｎｂ（図４参照）における中間変数ｙの状態パターンＳ_ｙ（Ｓ_ｙ＝（Ｓ_ｘ（Ｓ_ｖ），Ｓ_ｖ（Ｓ_ｘ），Ｓ_ｘ（Ｓ_θ），Ｓ_θ（Ｓ_ｘ），Ｓ_ｘ（Ｓ_ω），Ｓ_ω（Ｓ_ｘ），Ｓ_ｖ（Ｓ_θ），Ｓ_θ（Ｓ_ｖ），Ｓ_ｖ（Ｓ_ω），Ｓ_ω（Ｓ_ｖ），Ｓ_θ（Ｓ_ω），Ｓ_ω（Ｓ_θ）））を演算する（式（９−１），（９−２），（１０−１），（１０−２）参照）。なお、前記ＳＴ１０２では、図７に示すように、中間変数ｙである１２個の各積型修飾ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）を演算する１２個のＳＴ１０２ａ〜１０２ｍが並列で実行される。 In ST102, intermediate variables y (y = (x (v), v (x), x (θ), θ (x), x (ω), ω () are generated from the generated input variables x, v, θ, ω. x), v (θ), θ (v), v (ω), ω (v), θ (ω), ω (θ))). That is, from the state patterns S _x , S _v , S _θ , S _{ω of} the input variables x, v, θ, ω, the state pattern S of the intermediate variable y in the intermediate layer Nb (see FIG. 4) of the multivariable mutual modification model N. _y (S _y = (S _x (S _v ), S _v (S _x ), S _x (S _θ ), S _θ (S _x ), S _x (S _ω ), S _ω (S _x ), S _v ) (S _θ ), S _θ (S _v ), S _v (S _ω ), S _ω (S _v ), S _θ (S _ω ), S _ω (S _θ ))) are calculated (formula (9-1) ), (9-2), (10-1), (10-2)). In ST102, as shown in FIG. 7, twelve product type modifications x (v), v (x), x (θ), θ (x), x (ω), Twelve STs 102a to 102m for calculating ω (x), v (θ), θ (v), v (ω), ω (v), θ (ω), ω (θ) are executed in parallel.

ＳＴ１０３において、演算された中間変数ｙから出力変数Ｑ（ｓ_ｔ，ａ_ｔ）を演算する。すなわち、中間変数ｙの状態パターンＳ_ｙから、多変数相互修飾モデルＮの出力層Ｎｃ（図４参照）における出力変数ｙの状態パターンＳ_Ｑを演算する（式（１１），（１２−１），（１２−２）参照）。そして、行動価値関数近似処理を終了し、図６のメイン処理に戻る。 In ST103, the output variable Q (s _t , a _t ) is calculated from the calculated intermediate variable y. That is, the state pattern _{S y} of the intermediate variable y, and calculates the state pattern _{S Q} of the output variable y at the output layer Nc multivariable cross-modified model N (see FIG. 4) (formula (11), (12-1) , (12-2)). Then, the behavior value function approximation process is terminated, and the process returns to the main process of FIG.

（実施例１の行動価値関数学習処理のフローチャートの説明）
図８は本発明の実施例１の関数近似プログラムの行動価値関数学習処理のフローチャートであり、図６のＳＴ１３のサブルーチンの説明図である。
図８のＳＴ２０１において、変数ｉ，ｊに１、自然数ｋおよび結合荷重ｗ_ｊｉ，ｗ_ｊｉ′の更新値Δｗ_ｊｉ，Δｗ_ｊｉ′に０をセットする（ｉ＝ｊ＝１，ｋ＝０，Δｗ_ｊｉ＝Δｗ_ｊｉ′＝０）。そして、ＳＴ２０２に移る。 (Explanation of Flowchart of Behavior Value Function Learning Processing in Embodiment 1)
FIG. 8 is a flowchart of the action value function learning process of the function approximation program according to the first embodiment of the present invention, and is an explanatory diagram of the subroutine of ST13 in FIG.
In ST201 in FIG. 8, the variables i and j are set to 1, the natural number k and the updated values Δw _ji and Δw _ji ′ of the coupling weights w _ji and w _ji ′ are set to 0 (i = j = 1, k = 0, Δw). _ji = Δw _ji ′ = 0). Then, the process proceeds to ST202.

ＳＴ２０２において、時間がｋ×ｔから（ｋ＋１）×ｔ（ｋ＝０，１，…，ｋ_ｍａｘ）に変化したときの入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωが類似し（例えば、状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωの９割以上が変化しておらず）、且つ、出力変数Ｑ（ｓ_ｔ，ａ_ｔ）の状態パターンＴ_Ｑが変化していないか否かを判別する（ｓ_ｔ≒ｓ_ｔ＋１、且つ、Ｑ（ｓ_ｔ，ａ_ｔ）＝Ｑ（ｓ_ｔ＋１，ａ_ｔ＋１））。ノー（Ｎ）の場合はＳＴ２０３に移り、イエス（Ｙ）の場合はＳＴ２１１に移る。 In ST202, the state patterns S _x , S _v , S of the input variables x, v, θ, ω when the time changes from k × t to (k + 1) × t (k = 0, 1,..., K _max ). _θ and S _ω are similar (for example, 90% or more of the state patterns S _x , S _v , S _θ , and S _ω are not changed), and the state pattern of the output variable Q (s _t , a _t ) It is determined whether or not _TQ has changed (s _t ≈s _{t + 1} and Q (s _t , a _t ) = Q (s _{t + 1} , a _{t + 1} )). If no (N), the process moves to ST203, and if yes (Y), the process moves to ST211.

ＳＴ２０３において、行動価値関数記憶手段Ｃ６に記憶された（１エピソード終了後の）行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）の状態パターンＴ_Ｑと、出力変数記憶手段Ｃ７Ｆに記憶された時間ｋ×ｔ（ｋ＝０，１，…，ｋ_ｍａｘ）の出力変数Ｑ（ｓ_ｔ，ａ_ｔ）の状態パターンＳ_Ｑとによって、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′の更新値（変化量）Δｗ_ｊｉ，Δｗ_ｊｉ′を演算する（式（１３−１），（１３−２）参照）。そして、ＳＴ２０４に移る。
ＳＴ２０４において、式（１４−１），（１４−２）によって、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′を更新する。そして、ＳＴ２０５に移る。 In ST 203, stored in the action-value function storage means C6 (after episode termination) action value function _{Q (s} _{t, a} t) state pattern _{T Q} and the time k × stored in the output variable storage means C7F of _{t (k = 0,1, ...,} k max) output variable _{Q (s} _{t, a} t) of by the state pattern _{S Q} of the coupling weights _w _ji, updated value of _{w ji} '(variation) [Delta] w _ji , Δw _ji ′ (see equations (13-1) and (13-2)). Then, the process proceeds to ST204.
In ST204, the combined loads w _ji and w _ji ′ are updated by the equations (14-1) and (14-2). Then, the process proceeds to ST205.

ＳＴ２０５において、変数ｉが１２ｎ以上か否かを判別する（ｉ≧１２ｎ）。ノー（Ｎ）の場合はＳＴ２０６に移り、イエス（Ｙ）の場合はＳＴ２０７に移る。
ＳＴ２０６において、変数ｉに＋１を加算する（ｉ＝ｉ＋１）。そして、ＳＴ２０３に戻る。
ＳＴ２０７において、変数ｉを１にリセットする（ｉ＝１）。そして、ＳＴ２０８に移る。 In ST205, it is determined whether or not the variable i is 12n or more (i ≧ 12n). If no (N), the process moves to ST206, and if yes (Y), the process moves to ST207.
In ST206, +1 is added to the variable i (i = i + 1). Then, the process returns to ST203.
In ST207, the variable i is reset to 1 (i = 1). Then, the process proceeds to ST208.

ＳＴ２０８において、変数ｊがｍ以上か否かを判別する（ｊ≧ｍ）。ノー（Ｎ）の場合はＳＴ２０９に移り、イエス（Ｙ）の場合はＳＴ２１０に移る。
ＳＴ２０９において、変数ｊに＋１を加算する（ｊ＝ｊ＋１）。そして、ＳＴ２０３に戻る。
ＳＴ２１０において、変数ｊを１にリセットする（ｊ＝１）。そして、ＳＴ２１１に移る。
ＳＴ２１１において、自然数ｋがｋ_ｍａｘ−１より小さいか否かを判別する（ｋ＜ｋ_ｍａｘ−１）。イエス（Ｙ）の場合はＳＴ２１２に移り、ノー（Ｎ）の場合は行動価値関数学習処理を終了し、図６のメイン処理に戻る。
ＳＴ２１２において、自然数ｋに＋１を加算する（ｋ＝ｋ＋１）。そして、ＳＴ２０２に戻る。 In ST208, it is determined whether or not the variable j is greater than or equal to m (j ≧ m). If no (N), the process moves to ST209, and if yes (Y), the process moves to ST210.
In ST209, +1 is added to the variable j (j = j + 1). Then, the process returns to ST203.
In ST210, variable j is reset to 1 (j = 1). Then, the process proceeds to ST211.
In ST211, it is determined whether or not the natural number k is smaller than k _max −1 (k <k _max −1). If yes (Y), the process transfers to ST212, and, if no (N), the action value function learning process ends, and the process returns to the main process in FIG.
In ST212, +1 is added to the natural number k (k = k + 1). Then, the process returns to ST202.

（実施例１の作用）
前記構成を備えた実施例１の前記関数近似システムＳでは、前記倒立振子制御処理（図６のＳＴ１〜ＳＴ１２参照）において、前記行動ａ_ｔ（台車１に対して右方向（＋Ｙ方向）に加える力Ｆ）を選択するための前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を近似するために、選択的不感化法に基づく前記行動価値関数近似処理（図６のＳＴ６、図７のＳＴ１０１〜ＳＴ１０３参照）が実行される。したがって、Ｑ−ｌｅａｒｎｉｎｇに基づく前記行動価値関数更新処理（図６のＳＴ１０、式（８）参照）で学習していない状態（ｓ_ｔ）における前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）についても演算することができ、すなわち、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）の近似値を演算することができ、近似された前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）に基づいて、前記状態ｓ_ｔにおける適切な行動ａ_ｔを選択することができる（図６のＳＴ６，ＳＴ７参照）。 (Operation of Example 1)
In the function approximating system S of the first embodiment having the above-described configuration, the action a _t (added in the right direction (+ Y direction) with respect to the carriage 1) in the inverted pendulum control process (see ST1 to ST12 in FIG. 6). In order to approximate the action value function Q (s _t , a _t ) for selecting the force F), the action value function approximation process (ST6 in FIG. 6, ST101 in FIG. 7) based on the selective desensitization method. ST103) is executed. Therefore, the action-value function update process based on the Q-learning (ST10 in FIG. 6, equation (8) see) the action value function _{Q (s} t, _{a t)} in a state that is not learned _{(s t)} with regard to An approximate value of the action value function Q (s _t , a _t ) can be calculated, and based on the approximated action value function Q (s _t , a _t ), it is possible to select an appropriate action _{a t} in state _{s t} (see ST6, ST7 in FIG. 6).

また、実施例１の前記関数近似システムＳでは、前記倒立振子制御処理が失敗状態となって終了した場合には、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を近似するための前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′を学習する前記行動価値関数学習処理（図６のＳＴ１３、図８のＳＴ２０１〜ＳＴ２１２参照）が実行される。したがって、前記倒立振子制御処理が失敗した場合、次のエピソード、すなわち、次の倒立振子制御処理では、更新された前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′で前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を近似できるため、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を精度良く近似でき、前記行動ａ_ｔをより適切に選択することができる（図６のＳＴ２，ＳＴ３参照）。 Further, in the function approximating system S of the first embodiment, when the inverted pendulum control process ends in a failure state, the combined load for approximating the action value function Q (s _t , a _t ). The behavior value function learning process (see ST13 in FIG. 6 and ST201 to ST212 in FIG. 8) for learning w _ji and w _ji ′ is executed. Therefore, if the inverted pendulum control process fails, in the next episode, that is, the next inverted pendulum control process, the action value function Q (s _t , a _{t with the} updated combined weights w _ji , w _ji ′. ) because it can approximate the action value function _{Q (s} t, _{a t)} be accurately approximated, it is possible to select the action _{a t} better (see ST2, ST3 in FIG. 6).

ここで、前記行動価値関数近似処理および前記行動価値関数学習処理で用いられる前記多変数相互修飾モデルＮにおいて、前記中間層Ｎｂ（図４参照）では、４つの入力変数ｘ，ｖ，θ，ωについて、順列組み合わせで２つの入力変数（第１選択変数、第２選択変数）を１組として相互に積型文脈修飾された１２個（_４Ｐ_２＝４×３＝１２）の積型修飾ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）が前記中間変数ｙとして演算される（図７のＳＴ１０２、式（９−１），（９−２），（１０−１），（１０−２）参照）。このため、実施例１の前記関数近似システムＳでは、２つの入力パターンＳ，Ｃについて中間パターンＸ，Ｘ′および出力パターンＹを出力する場合しか想定されていない従来公知の選択的不感化法が適用された層状ニューラルネット０１，０１′，０２（図１２、図１３参照）の適用対象外であった３つ以上の入力変数（ｘ，ｖ，θ，ω）を有する関数（Ｑ（ｓ_ｔ，ａ_ｔ））に対しても、学習能力（汎化能力）が高い前記選択的不感化法に基づく前記多変数相互修飾モデルＮが適用できる。 Here, in the multivariable mutual modification model N used in the behavior value function approximating process and the behavior value function learning process, the intermediate layer Nb (see FIG. 4) has four input variables x, v, θ, ω. For twelve ( ₄ P ₂ = 4 × 3 = 12) product-type modifiers x that are mutually product-type context-modified with two input variables (first selection variable, second selection variable) as one set in a permutation combination (V), v (x), x (θ), θ (x), x (ω), ω (x), v (θ), θ (v), v (ω), ω (v), θ (Ω) and ω (θ) are calculated as the intermediate variable y (see ST102 in FIG. 7, equations (9-1), (9-2), (10-1), and (10-2)). Therefore, in the function approximation system S of the first embodiment, there is a conventionally known selective desensitization method that is assumed only when the intermediate patterns X, X ′ and the output pattern Y are output for the two input patterns S, C. A function (Q (s _t ) having three or more input variables (x, v, θ, ω) that is not applicable to the applied layered neural network 01, 01 ′, 02 (see FIGS. 12 and 13). , A _t )), the multivariable mutual modification model N based on the selective desensitization method having high learning ability (generalization ability) can be applied.

この結果、実施例１の前記関数近似システムＳは、従来公知のＱ−ｌｅａｒｎｉｎｇの強化学習のように全試行（全ての状態（ｓ_ｔ）における全ての行動（ｓ_ｔ））を実行しなくても、学習能力（汎化能力）が高い前記多変数相互修飾モデルＮに基づく前記行動価値関数近似処理および前記行動価値関数学習処理を実行することにより学習が収束する時間、いわゆる、学習時間を低減することができる。 As a result, the function approximating system S according to the first embodiment does not execute all trials (all actions (s _t ) in all states (s _t )) like the conventionally known reinforcement learning of Q-learning. Also, the time for learning to converge by executing the action value function approximation process and the action value function learning process based on the multivariable mutual modification model N with high learning ability (generalization ability), so-called learning time is reduced. can do.

（実験例）
ここで、実施例１の前記関数近似システムＳが、前記行動価値関数近似処理および前記行動価値関数学習処理を実行することにより、学習効率がどのように変化したかを調べるために、次の実験例１〜３および比較例１〜３を準備した。 (Experimental example)
Here, in order to investigate how the function approximation system S of Example 1 changed learning efficiency by executing the behavior value function approximation processing and the behavior value function learning processing, the following experiment was performed. Examples 1-3 and Comparative Examples 1-3 were prepared.

（実験例１）
実験例１の関数近似システムＳは、実施例１の前記関数近似システムＳと同等の構成で作製されており、実施例１の前記メイン処理（図６のＳＴ１〜ＳＴ１３参照）において、前記行動価値関数近似処理および前記行動価値関数近似処理が実行される。すなわち、Ｑ−ｌｅａｒｎｉｇの強化学習で学習した状態ｓ_ｔにおける行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）のみならず、学習していない状態ｓ_ｔについても行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が近似され、近似された行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）に基づく行動ａ_ｔが実行されると共に、前記倒立振子制御処理が失敗した場合に、前記行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）を近似するための前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′の学習が行われる。
また、実験例１の前記関数近似システムＳは、外乱および観測ノイズのない環境Ｅの下で実験を行った（図１、図３参照）。 (Experimental example 1)
The function approximating system S of Experimental Example 1 is produced with the same configuration as the function approximating system S of Example 1, and in the main processing of Example 1 (see ST1 to ST13 in FIG. 6), the action value A function approximation process and the action value function approximation process are executed. That is, not only the action value function Q (s _t , a _t ) in the state s _t learned by the reinforcement learning of Q-learning but also the action value function Q (s _t , a _t ) not in the state s _t There is approximated, approximated action value function Q (s _{t, a} _t) with the based actions a _t is executed, when the inverted pendulum control process has failed, the action value function Q (s _t, a Learning of the connection weights w _ji and w _ji ′ for approximating _{t 2} ) is performed.
In addition, the function approximation system S of Experimental Example 1 was tested under an environment E without disturbance and observation noise (see FIGS. 1 and 3).

（比較例１）
比較例１の関数近似システムＳ′は、従来公知のＱ−ｌｅａｒｎｉｎｇの強化学習に前記テーブル参照法が適用されており、実験例１の前記関数近似システムＳに比べ、前記メイン処理において、前記行動価値関数近似処理および前記行動価値関数近似処理が省略されている。すなわち、比較例１の前記関数近似システムＳ′は、学習した状態ｓ_ｔにおける行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）をテーブルに記憶する。このため、学習した状態ｓ_ｔについては、前記状態ｓ_ｔおける行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）に基づく適切な行動ａ_ｔが実行されるが、学習していない状態ｓ_ｔについては、行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が存在せず、前記状態ｓ_ｔが試行されて学習されるまで適切な行動ａ_ｔが実行できない。 (Comparative Example 1)
In the function approximation system S ′ of Comparative Example 1, the table reference method is applied to conventionally known reinforcement learning of Q-learning. Compared with the function approximation system S of Experimental Example 1, The value function approximation process and the behavior value function approximation process are omitted. That is, the function approximating system S ′ of the comparative example 1 stores the action value function Q (s _t , a _t ) in the learned state s _t in a table. Therefore, the state _{s t} learned, the state _{s t} definitive action value function _{Q (s} t, _{a t)} is the based on the appropriate action _{a t} is executed, the state _{s t} that is not learned, action value function Q (s _{t, a} _t) is absent, it can not be executed appropriate action a _t to the state s _t is learned being attempted.

なお、比較例１の前記関数近似システムＳ′では、前記各入力変数ｘ，ｖ，θ，ωが１０等分されている。すなわち、前記入力変数ｘ，ｖ，θ，ωが取り得る入力値が１０種類に予め設定されている。また、比較例１の銭関数近似システムＳ′は、実施例１の前記関数近似システムＳが、右方向（＋Ｙ方向）および左方向（−Ｙ方向）の各移動（行動ａ_ｔ）についての前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′を別々に設けているのに応じて、右方向（＋Ｙ方向）および左方向（−Ｙ方向）の各移動（行動ａ_ｔ）についての評価用のテーブルを別々に設けている。このため、比較例１で記憶するテーブルのサイズは、２×１０^４（１０×１０×１０×１０×２＝２×１０^４）で示される。
また、比較例１の前記関数近似システムＳ′は、実験例１と同様に、前記環境Ｅの下で実験を行った。 In the function approximation system S ′ of Comparative Example 1, the input variables x, v, θ, and ω are equally divided into ten. In other words, ten types of input values that can be taken by the input variables x, v, θ, and ω are preset. Also, Qian function approximation system S of Comparative Example 1 ', the function approximation system S of Example 1, the for each movement of the right direction (+ Y direction) and the left direction (-Y direction) (action a _t) According to the connection loads w _ji and w _ji ′ provided separately, the evaluation table for each movement (action a _t ) in the right direction (+ Y direction) and the left direction (−Y direction) is separately provided. Provided. For this reason, the size of the table stored in Comparative Example 1 is indicated by 2 × 10 ⁴ (10 × 10 × 10 × 10 × 2 = 2 × 10 ⁴ ).
Further, the function approximating system S ′ of Comparative Example 1 was tested under the environment E as in Experimental Example 1.

（実験例２）
実験例２の関数近似システムＳは、実験例１の前記関数近似システムＳと同様の構成において、実験例１の前記環境Ｅに外乱としての角速度ω_ｆが加えられた環境Ｅ′の下で実験を行った。ここで、前記外乱（角速度）ω_ｆは、シミュレータではなく実際に倒立振子を制御する場合に、ユーザが前記棒２を指で左右方向（Ｙ軸方向）に弾く行為に相当する。 (Experimental example 2)
The function approximating system S of Experimental Example 2 has the same configuration as that of the functional approximating system S of Experimental Example 1, and the experiment is performed under an environment E ′ in which an angular velocity ω _f as a disturbance is added to the environment E of Experimental Example 1. Went. Here, the disturbance (angular velocity) ω _f corresponds to an action in which the user flips the bar 2 with his / her finger in the left / right direction (Y-axis direction) when actually controlling the inverted pendulum instead of the simulator.

なお、実験例２の前記環境Ｅ′では、前記外乱ω_ｆが、前記棒２の角度θおよび角速度ωが所定の範囲の値以下を３秒間維持する度に与えられる。すなわち、所定の範囲の値以下を３秒間維持して安定した状態を安定状態とし、前記安定状態であるか否かを判別するための前記角度θおよび前記角速度ωの最小値をθ_ｍｉｎ，ω_ｍｉｎとし、最大値をθ_ｍａｘ，ω_ｍａｘとすると、前記角度θおよび前記角速度ωが、θ_ｍｉｎ≦θ≦θ_ｍａｘ，ω_ｍｉｎ≦ω≦ω_ｍａｘの範囲の値を３秒間維持した場合に前記安定状態であると判別され、前記外乱ω_ｆが与えられる。なお、実験例２では、前記角度θおよび前記角速度ωの最小値θ_ｍｉｎ，ω_ｍｉｎおよび最大値θ_ｍａｘ，ω_ｍａｘについて、θ_ｍｉｎ＝−２．５［ｄｅｇ］，ω_ｍｉｎ＝−４５［ｄｅｇ／ｓ］，θ_ｍａｘ＝２．５［ｄｅｇ］，ω_ｍａｘ＝４５［ｄｅｇ／ｓ］が成立するように予め設定されている（−２．５°≦θ≦２．５°，−４５≦ω≦４５）。 In the environment E ′ of Experimental Example 2, the disturbance ω _f is given each time the angle θ and the angular velocity ω of the rod 2 are kept within a predetermined range for 3 seconds. That is, a state where a value within a predetermined range is maintained for 3 seconds to make a stable state a stable state, and the minimum values of the angle θ and the angular velocity ω for determining whether or not the stable state is set are θ _min , ω _Assuming that the maximum value is θ _max and ω _max , the angle θ and the angular velocity ω are the above when the values in the range of θ _min ≦ θ ≦ θ _max and ω _min ≦ ω ≦ ω _max are maintained for 3 seconds. The stable state is determined and the disturbance ω _f is given. In Experimental Example 2, θ _min = −2.5 [deg], ω _min = −45 [deg] for the minimum values θ _min and ω _min and the maximum values θ _max and ω _max of the angle θ and the angular velocity ω. / S], θ _max = 2.5 [deg], and ω _max = 45 [deg / s] are set in advance (−2.5 ° ≦ θ ≦ 2.5 °, −45 ≦ ω ≦ 45).

したがって、実験例２の関数近似システムＳでは、前記角度θおよび前記角速度ωが、−２．５°≦θ≦２．５°，−４５≦ω≦４５の範囲の値を３秒間維持した場合に前記安定状態であると判別され、前記外乱ω_ｆが与えられる。
また、実験例２の前記環境Ｅ′では、前記外乱ω_ｆが与えられる度に１［ｄｅｇ／ｓ］大きくなるような角速度ω_ｆが、前記外乱ω_ｆとして左右方向（Ｙ軸方向）に対してランダムに与えられる。すなわち、何回目の外乱であるかを示す値（外乱を与える回数）をＮ_ωとすると、ω_ｆ＝Ｎ_ω［ｄｅｇ／ｓ］（左方向（−Ｙ方向）に与える場合にはω_ｆ＝−Ｎ_ω［ｄｅｇ／ｓ］）が成立し、前記棒２の角速度ωの値（入力値）に加算できる。 Therefore, in the function approximation system S of Experimental Example 2, when the angle θ and the angular velocity ω maintain values in the range of −2.5 ° ≦ θ ≦ 2.5 ° and −45 ≦ ω ≦ 45 for 3 seconds. Is determined to be in the stable state, and the disturbance ω _f is given.
Further, in the environment E ′ of Experimental Example 2, an angular velocity ω _f that increases by 1 [deg / s] every time the disturbance ω _f is applied is defined as the disturbance ω _f in the left-right direction (Y-axis direction). And given randomly. That is, assuming that the value indicating the number of disturbances (number of times the disturbance is applied) is N _ω , ω _f = N _ω [deg / s] (in the case of applying in the left direction (−Y direction), ω _f = −N _ω [deg / s]) is established and can be added to the value (input value) of the angular velocity ω of the rod 2.

（比較例２）
比較例２の関数近似システムＳ′は、比較例１の前記関数近似システムＳ′と同様の構成において、実験例２と同様の前記環境Ｅ′の下で実験を行った。なお、比較例２の前記関数近似システムＳ′は、実験例２の前記関数近似システムＳに比べ、学習を収束し易くするために前記状態ｓ_ｔを大きく区切っているため（各入力変数ｘ，ｖ，θ，ωの入力値の種類を１０種類としたため）、安定するまでに必要な時間が長くなると共に、安定していても各入力変数ｘ，ｖ，θ，ωの入力値の変化量が大きくなっている。このため、比較例２の前記環境Ｅ′では、実験例２に比べて安定条件の範囲を緩くするために、前記角度θおよび前記角速度ωの最小値θ_ｍｉｎ，ω_ｍｉｎおよび最大値θ_ｍａｘ，ω_ｍａｘについて、θ_ｍｉｎ＝−６．０［ｄｅｇ］，ω_ｍｉｎ＝−１５０［ｄｅｇ／ｓ］，θ_ｍａｘ＝６．０［ｄｅｇ］，ω_ｍａｘ＝１５０［ｄｅｇ／ｓ］が成立するように予め設定されている（−６°≦θ≦６°，−１５０≦ω≦１５０）。すなわち、比較例２の関数近似システムＳ′では、前記角度θおよび前記角速度ωが、−６°≦θ≦６°，−１５０≦ω≦１５０の範囲の値を３秒間維持した場合に前記安定状態であると判別され、前記外乱ω_ｆが与えられる。さらに、比較例２の前記関数近似システムＳ′は、安定状態になるまでに必要な時間が長いため、前記棒２の倒立時間に対して、成功状態（棒２を１８０秒倒立させた状態）である１８０秒の倒立時間の上限を設けないこととした。 (Comparative Example 2)
The function approximation system S ′ of Comparative Example 2 was subjected to an experiment under the same environment E ′ as in Experiment Example 2 with the same configuration as the function approximation system S ′ of Comparative Example 1. Incidentally, the function approximation system S of Comparative Example 2 ', compared with the function approximation system S of Example 2, since the separated larger the state s _t in order to facilitate the convergence of learning (the input variable x, Since the types of input values of v, θ, and ω are 10), the time required for stabilization becomes long, and the amount of change in the input values of the input variables x, v, θ, and ω even if they are stable Is getting bigger. For this reason, in the environment E ′ of the comparative example 2, in order to loosen the range of the stability condition compared to the experimental example 2, the minimum values θ _min and ω _min and the maximum values θ _max and the angle θ and the angular velocity ω For ω _max , θ _min = −6.0 [deg], ω _min = −150 [deg / s], θ _max = 6.0 [deg], and ω _max = 150 [deg / s]. It is set in advance (−6 ° ≦ θ ≦ 6 °, −150 ≦ ω ≦ 150). That is, in the function approximating system S ′ of Comparative Example 2, the angle θ and the angular velocity ω are stable when the values in the range of −6 ° ≦ θ ≦ 6 ° and −150 ≦ ω ≦ 150 are maintained for 3 seconds. Is determined to be in the state, and the disturbance ω _f is given. Furthermore, since the time required for the function approximation system S ′ of Comparative Example 2 to be in a stable state is long, the function approximation system S ′ is in a successful state with respect to the inversion time of the rod 2 (the state in which the rod 2 is inverted for 180 seconds). The upper limit of the inversion time of 180 seconds is not set.

（実験例３）
実験例３の関数近似システムＳは、実験例１の前記関数近似システムＳと同様の構成において、実験例１の前記環境Ｅに観測ノイズとして前記台車１の位置ｘおよび前記棒２の角度θに誤差Δｘ，Δθが加えられた環境Ｅ″の下で実験を行った。ここで、前記観測ノイズΔｘ，Δθは、シミュレータではなく実際に倒立振子を制御する場合に、前記台車１に設けられるセンサ（位置測定センサ、傾斜角度測定センサ等）のノイズ、いわゆる、センサノイズに相当する。なお、実験例３の前記環境Ｅ″では、前記観測ノイズΔｘ，Δθは、時間ｋ×ｔ（ｋ＝０，１，２，…，ｋ_ｍａｘ）毎に乱数を発生させて、前記位置ｘに±０．０５［ｍ］の範囲の誤差Δｘ、前記角度θに±０．５［ｄｅｇ］の範囲の誤差Δθとして与えられる。
（比較例３）
比較例３の関数近似システムＳ′は、比較例１の前記関数近似システムＳ′と同様の構成において、実験例３と同様の前記環境Ｅ″の下で実験を行った。 (Experimental example 3)
The function approximating system S of Experimental Example 3 has the same configuration as that of the functional approximating system S of Experimental Example 1, with the environment E of Experimental Example 1 being observed noise in the position x of the carriage 1 and the angle θ of the rod 2. Experiments were performed under an environment E ″ to which errors Δx and Δθ were added. Here, the observation noises Δx and Δθ are sensors provided in the carriage 1 when actually controlling an inverted pendulum instead of a simulator. This corresponds to so-called sensor noise (position measurement sensor, inclination angle measurement sensor, etc.) In the environment E ″ of Experimental Example 3, the observation noises Δx and Δθ are time k × t (k = 0). , 1, 2,..., K _max ), a random number is generated for each position x, an error Δx in the range of ± 0.05 [m] at the position x, and an error in the range of ± 0.5 [deg] at the angle θ. It is given as Δθ.
(Comparative Example 3)
The function approximation system S ′ of Comparative Example 3 was tested under the same environment E ″ as in Experiment Example 3 with the same configuration as that of the function approximation system S ′ of Comparative Example 1.

（実験結果）
また、実験例１〜３および比較例１〜３の実験結果について以下に示す。 (Experimental result)
Moreover, it shows below about the experimental result of Experimental Examples 1-3 and Comparative Examples 1-3.

（実験例１および比較例１の実験結果について）
図９は実験例の実験結果の説明図であり、横軸に倒立振子制御処理の試行回数（エピソード数）をとり、縦軸に倒立振子制御処理の試行時間（棒が倒立し続けた時間）をとって、実験例１の学習効率と比較例１の学習効率とを比較するためのグラフであり、図９Ａは実験例１の実験結果を示すグラフであり、図９Ｂは比較例１の実験結果を示すグラフである。
図９Ａ、図９Ｂに示すように、実験例１の前記関数近似システムＳは、約１０回の試行（エピソード）で成功状態（棒２を１８０秒倒立させた状態）となったのに対し、比較例１の前記関数近似システムＳ′は、成功状態となるまでに約１３００回の試行が必要であったことがわかる。すなわち、実験例１の前記関数近似システムＳは、比較例１の前記関数近似システムＳ′に比べ、学習の立ち上がりが早く、学習能力（汎化能力）が高いことがわかる。 (Experimental results of Experimental Example 1 and Comparative Example 1)
FIG. 9 is an explanatory diagram of the experimental results of the experimental example. The horizontal axis indicates the number of trials of the inverted pendulum control process (number of episodes), and the vertical axis indicates the trial time of the inverted pendulum control process (the time during which the bar has been inverted). 9A is a graph for comparing the learning efficiency of Experimental Example 1 with the learning efficiency of Comparative Example 1, FIG. 9A is a graph showing the experimental results of Experimental Example 1, and FIG. 9B is the experiment of Comparative Example 1. It is a graph which shows a result.
As shown in FIGS. 9A and 9B, the function approximation system S of Experimental Example 1 is in a successful state (a state in which the rod 2 is inverted for 180 seconds) after about 10 trials (episodes). It can be seen that the function approximating system S ′ of the comparative example 1 required about 1300 trials to become a successful state. That is, it can be seen that the function approximation system S of Experimental Example 1 has a faster learning start-up and higher learning ability (generalization ability) than the function approximation system S ′ of Comparative Example 1.

なお、図９Ａにおいて、実験例１の前記関数近似システムＳは、始めて成功状態となった後も、失敗状態（棒２が倒れた状態）となる試行が度々見受けられる。この実験結果については、以下のように考えられる。すなわち、実験例１の前記関数近似システムＳでは、前記入力変数（角度）θは、入力値θ（ｋ），θ（ｋ＋１）が０．０［ｄｅｇ］の付近であれば（θ（ｋ）≒０，θ（ｋ＋１）≒０）、前記時間がｋ×ｔから（ｋ＋１）×ｔに変化したときの僅かな入力値の変化で（θ（ｋ＋１）−θ（ｋ）≒０）、変化後の状態パターンＳ_{θ（ｋ＋１）}が、変化前の状態パターンＳ_θ（ｋ）に比べて大きく変化し、それ以外の入力値θ（ｋ），θ（ｋ＋１）であれば、僅かな入力値の変化では、変化後の状態パターンＳ_{θ（ｋ＋１）}が、変化前の状態パターンＳ_θ（ｋ）比べて大きく変化しないように予め設定されている。 In FIG. 9A, the function approximation system S of Experimental Example 1 often shows trials that are in a failed state (a state in which the rod 2 is collapsed) even after the first successful state. The results of this experiment are considered as follows. That is, in the function approximation system S of Experimental Example 1, the input variable (angle) θ is (θ (k)) when the input values θ (k) and θ (k + 1) are in the vicinity of 0.0 [deg]. ≒ 0, θ (k + 1) ≒ 0), a slight change in input value when the time changes from k × t to (k + 1) × t (θ (k + 1)-θ (k) ≒ 0), change If the subsequent state pattern S _{θ (k + 1)} changes significantly compared to the state pattern S _{θ (k)} before the change, and the other input values θ (k) and θ (k + 1), the input value is small. In this change, the state pattern S _{θ (k + 1)} after the change is set in advance so as not to change significantly compared to the state pattern S _{θ (k)} before the change.

また、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′を学習するか否かを判別するために、時間がｋ×ｔから（ｋ＋１）×ｔ（ｋ＝０，１，…，ｋ_ｍａｘ）に変化したときの入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωが類似し（例えば、状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωの９割以上が変化しておらず）、且つ、出力変数Ｑ（ｓ_ｔ，ａ_ｔ）の状態パターンＴ_Ｑが変化していないか否かを判別している（図８のＳＴ２０２参照）。
したがって、実験例１の前記関数近似システムＳは、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′の学習について、安定状態（−１°≦θ≦１°）を維持するための学習の機会が多くなり、安定状態ではない状態から安定状態になるまでの学習の機会が少なくなる。このため、前記角度θの初期値θ（０）が安定状態（−１°≦θ≦１°）よりも大きな角度（例えば、θ（０）＝３［ｄｅｇ］）で開始された場合に、失敗状態となる可能性が大きくなったものと推察される。 When the time changes from k × t to (k + 1) × t (k = 0, 1,..., K _max ) in order to determine whether or not to learn the connection weights w _ji and w _ji ′. State variables S _x , S _v , S _θ , S _ω are similar (for example, 90% or more of the state patterns S _x , S _v , S _θ , S _ω change). and yet not), and the output variable _{Q (s} t, it is determined whether or not the state pattern _{T Q} of _{a t)} does not change (see ST202 of FIG. 8).
Therefore, the function approximation system S of Experimental Example 1 has many learning opportunities for maintaining a stable state (−1 ° ≦ θ ≦ 1 °) for learning the coupling weights w _ji and w _ji ′. There are fewer opportunities for learning from a non-stable state to a stable state. Therefore, when the initial value θ (0) of the angle θ is started at an angle larger than the stable state (−1 ° ≦ θ ≦ 1 °) (for example, θ (0) = 3 [deg]), It is presumed that the possibility of failure has increased.

また、図９Ｂにおいて、比較例１の前記関数近似システムＳ′は、始めて成功状態となった後は、ほぼ安定して成功状態となる試行が繰り返されている。この実験結果によって、比較例１の前記関数近似システムＳ′では、学習が収束するために、すなわち、２×１０^４通りのテーブルに適切な行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が設定されるために、約１３００回もの試行が必要であったが、前記テーブルに適切な行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が設定されれば、それ以降は安定して成功状態となる試行を繰り返すことができることがわかる。 Further, in FIG. 9B, after the function approximation system S ′ of the comparative example 1 is in a successful state for the first time, trials that are in a stable state in a stable state are repeated. According to the experimental results, in the function approximation system S ′ of the comparative example 1, learning is converged, that is, an appropriate action value function Q (s _t , a _t ) is set in 2 × 10 ⁴ tables. Therefore, about 1300 trials are necessary, but if an appropriate action value function Q (s _t , a _t ) is set in the table, trials that are stable and succeed after that are performed. You can see that it can be repeated.

（実験例２および比較例２の実験結果について）
図１０は実験例の実験結果の説明図であり、横軸に倒立振子制御処理の試行回数（エピソード数）をとり、縦軸に倒立振子制御処理の試行時間（棒が倒立し続けた時間）および外乱を与えた回数（棒を指で弾いた回数）をとって、実験例２の学習効率と比較例２の学習効率とを比較するためのグラフであり、図１０Ａは実験例２の実験結果を示すグラフであり、図１０Ｂは比較例２の実験結果を示すグラフである。 (Experimental results of Experimental Example 2 and Comparative Example 2)
FIG. 10 is an explanatory diagram of the experimental results of the experimental example, where the horizontal axis represents the number of trials of the inverted pendulum control process (number of episodes), and the vertical axis represents the trial time of the inverted pendulum control process (the time during which the bar continued to be inverted). FIG. 10A is a graph for comparing the learning efficiency of Experimental Example 2 and the learning efficiency of Comparative Example 2 by taking the number of times disturbance was applied (number of times the stick was played with a finger), and FIG. FIG. 10B is a graph showing the experimental results of Comparative Example 2. FIG.

図１０Ａに示すように、実験例２の前記関数近似システムＳでは、約５０回から１００回程度の試行により、試行回数（エピソード数）に対する試行時間ｋ×ｔ（ｋ＝０，１，…，ｋ_ｍａｘ（９０００））の波形（図１０Ａの実線参照）が、試行回数に対する前記外乱ω_ｆを与えた回数Ｎ_ωの波形（図１０Ａの点線参照）に対応している。すなわち、波形が類似してくることがわかる。
これに対し、図１０Ｂに示すように、比較例２の前記関数近似システムＳ′では、試行回数に対する試行時間ｋ×ｔ（ｋ＝０，１，…，ｋ_ｍａｘ（９０００））の波形（図１０Ｂの実線参照）が、試行回数に対する前記外乱ω_ｆを与えた回数Ｎ_ωの波形（図１０Ｂの点線参照）に対応するまでに（波形が類似してくるまでに）、約６０００回もの試行が必要であったことがわかる。 As shown in FIG. 10A, in the function approximation system S of Experimental Example 2, the trial time k × t (k = 0, 1,...) With respect to the number of trials (number of episodes) by about 50 to 100 trials. The waveform of k _max (9000)) (see the solid line in FIG. 10A) corresponds to the waveform of the number of times N _ω (see the dotted line in FIG. 10A) given the disturbance ω _f with respect to the number of trials. That is, it can be seen that the waveforms are similar.
On the other hand, as shown in FIG. 10B, in the function approximation system S ′ of Comparative Example 2, the waveform of the trial time k × t (k = 0, 1,..., K _max (9000)) with respect to the number of trials (FIG. 10B). About 6000 trials until the waveform corresponding to the number N _ω (see the dotted line in FIG. 10B) of the number of times of giving the disturbance ω _f to the number of trials (see the dotted line in FIG. 10B). It was found that was necessary.

なお、図１０Ｂにおいて、比較例２の前記関数近似システムＳ′では、約４０００回程度の試行で、十分な試行時間（倒立時間）が得られているように見受けられるが、約４０００回までの試行における前記外乱ω_ｆを与えた回数Ｎ_ωが非常に少ないため、安定状態となるまでに非常に時間がかかっているだけであることがわかる。また、約４０００回から約６０００回までの試行では、前記外乱ω_ｆを殆ど与えていないにも関わらず、試行時間が非常に短くなっている。したがって、比較例２の前記関数近似システムＳ′は、学習が収束するために約６０００回の試行が必要であったことがわかる。
この結果、実験例２の前記関数近似システムＳは、比較例２の前記関数近似システムＳ′に比べ、外乱ω_ｆに対する適応能力が約６０〜１２０倍（６０００／１００＝６０，６０００／５０＝１２０）ほど高いことがわかる。 In FIG. 10B, in the function approximation system S ′ of Comparative Example 2, it seems that sufficient trial time (inversion time) is obtained in about 4000 trials, but up to about 4000 times. It can be seen that since the number of times N _ω given the disturbance ω _f in the trial is very small, it only takes a very long time to reach a stable state. In addition, in the trial from about 4000 times to about 6000 times, the trial time is very short although the disturbance ω _f is hardly given. Therefore, it can be seen that the function approximation system S ′ of Comparative Example 2 required about 6000 trials for the learning to converge.
As a result, the function approximation system S of Example 2, compared with the function approximation system S 'of Comparative Example 2, the disturbance ω adaptability of about 60 to 120 times with respect to _{f (6000/100 = 60,6000 /} 50 = 120).

（実験例３および比較例３の実験結果について）
図１１は実験例の実験結果の説明図であり、横軸に倒立振子制御処理の試行回数（エピソード数）をとり、縦軸に倒立振子制御処理の試行時間（棒が倒立し続けた時間）をとって、実験例３の学習効率と比較例３の学習効率とを比較するためのグラフであり、図１１Ａは実験例３の実験結果を示すグラフであり、図１１Ｂは比較例３の実験結果を示すグラフである。 (Experimental results of Experimental Example 3 and Comparative Example 3)
FIG. 11 is an explanatory diagram of the experimental results of the experimental example, where the horizontal axis represents the number of trials of the inverted pendulum control process (number of episodes), and the vertical axis represents the trial time of the inverted pendulum control process (the time during which the bar continued to be inverted). 11A is a graph for comparing the learning efficiency of Experimental Example 3 with the learning efficiency of Comparative Example 3, FIG. 11A is a graph showing the experimental result of Experimental Example 3, and FIG. 11B is the experiment of Comparative Example 3. It is a graph which shows a result.

図１１Ａ、図１１Ｂに示すように、実験例３の前記関数近似システムＳは、約１０回の試行（エピソード）で成功状態となったのに対し、比較例３の前記関数近似システムＳ′は、成功状態となるまでに約１３００回の試行が必要であったことがわかる。また、実験例３の前記関数近似システムＳは、約７０回の試行で、観測ノイズΔｘ，Δθの影響を受けずに安定して成功状態が得られるようになったのに対し、比較例３の前記関数近似システムＳ′は、約１５００回の試行が行われても、安定して成功状態が得られていないことがわかる。すなわち、実験例３の前記関数近似システムＳは、比較例３の前記関数近似システムＳ′に比べ、観測ノイズΔｘ，Δθに対する適応能力が高いことがわかる。 As shown in FIGS. 11A and 11B, the function approximating system S of Experimental Example 3 has succeeded in about 10 trials (episodes), whereas the function approximating system S ′ of Comparative Example 3 is It can be seen that about 1300 trials were required to reach a successful state. In addition, the function approximation system S of Experimental Example 3 can stably obtain a successful state without being affected by the observation noises Δx and Δθ in about 70 trials, whereas Comparative Example 3 It can be seen that the function approximation system S ′ of FIG. 1 does not stably obtain a successful state even after about 1500 trials. That is, it can be seen that the function approximation system S of Experimental Example 3 has higher adaptability to the observation noises Δx and Δθ than the function approximation system S ′ of Comparative Example 3.

また、図１１Ａにおいて、実験例３の前記関数近似システムＳは、約７０回の試行で安定して成功状態が得られるようになった後、失敗状態（棒２が倒れた状態）となる試行が実験例１に比べて少なくなったことがわかる。この実験結果については、以下のように考えられる。すなわち、前記観測ノイズΔｘ，Δθにより、前記入力変数ｘ，ｖ，θ，ωが取りうる入力値ｘ（ｋ），ｖ（ｋ），θ（ｋ），ω（ｋ）（ｋ＝０，１，…，ｋ_ｍａｘ（９０００））の範囲が、実験例１の前記関数近似システムＳに比べて広くなっている。このため、実験例３の前記関数近似システムＳは、実験例１の前記関数近似システムＳに比べて、安定状態ではない状態から安定状態になるまでの学習の機会が多くなり、より安定して成功状態となる試行を繰り返すことができるようになったと推察される。 Further, in FIG. 11A, the function approximating system S of Experimental Example 3 is a trial that becomes a failed state (a state in which the rod 2 is collapsed) after a successful state is stably obtained in about 70 trials. It can be seen that the number is less than that of Experimental Example 1. The results of this experiment are considered as follows. That is, the input values x (k), v (k), θ (k), ω (k) (k = 0, 1) that the input variables x, v, θ, ω can take due to the observation noises Δx, Δθ. ,..., K _max (9000)) is wider than that of the function approximation system S of Experimental Example 1. For this reason, the function approximation system S of Experimental Example 3 has more opportunities for learning from the non-stable state to the stable state, and is more stable than the function approximation system S of Experimental Example 1. It is inferred that the trial to become a successful state can be repeated.

また、図１１Ｂにおいて、比較例３の前記関数近似システムＳ′は、比較例１の前記関数近似システムＳ′に比べ、始めて成功状態となった後も、安定して成功状態となる試行を繰り返すことができなくなったことがわかる。この実験結果については、以下のように考えられる。すなわち、比較例１〜３の前記関数近似システムＳ′では、学習を収束し易くするために前記状態ｓ_ｔを大きく区切っているため（各入力変数ｘ，ｖ，θ，ωの入力値の種類を１０種類としたため）、安定状態でも前記入力変数ｘ，ｖ，θ，ωの入力値ｘ（ｋ），ｖ（ｋ），θ（ｋ），ω（ｋ）（ｋ＝０，１，…，ｋ_ｍａｘ（９０００））が取り得る値の範囲が広く、前記台車１および前記棒２の移動が、実験例３の前記関数近似システムＳと比べて大きくなる。この状態から、さらに前記観測ノイズΔｘ，Δθが加えられると、安定状態を維持したり、安定状態でない状態から安定状態にしたりする適切な行動ａ_ｔがより選択し難くなる。この結果、学習が収束し難くなってしまい、安定して成功状態となる試行を繰り返すことができなくなったものと推察される。 Further, in FIG. 11B, the function approximation system S ′ of the comparative example 3 repeats trials that stably become a successful state even after the success of the function approximation system S ′ of the comparative example 1 for the first time. You can see that you can't. The results of this experiment are considered as follows. That is, in the function approximation system S ′ of Comparative Examples 1 to 3, the state _st is largely divided in order to facilitate learning (types of input values of the input variables x, v, θ, and ω). 10), the input values x (k), v (k), θ (k), ω (k) (k = 0, 1,...) Of the input variables x, v, θ, ω even in a stable state. , K _max (9000)) can take a wide range of values, and the movement of the carriage 1 and the rod 2 is larger than that of the function approximation system S of Experimental Example 3. From this state, further the observation noise [Delta] x, when Δθ is added, or to maintain a stable state, it becomes difficult to select appropriate actions a _t Gayori or a stable state from a state not in a stable state. As a result, it is presumed that learning becomes difficult to converge and it is no longer possible to repeat trials that are stably successful.

前記実験結果により、実施例１の関数近似システムＳは、前記行動価値関数近似処理および前記行動価値関数学習処理が実行されない従来公知の関数近似システムＳ′に比べ、学習能力（汎化能力）を大幅に高くすることができると共に、外乱ω_ｆや観測ノイズΔｘ，Δθ等を含む未知の環境Ｅ′，Ｅ″に対しての適用能力も大幅に高くすることができる。すなわち、実施例１の関数近似システムＳは、シミュレータではなく、実際に倒立振子を制御する装置（関数近似装置）に適用される場合にも、従来公知の関数近似システムＳ′に比べ、より適切に倒立振子を制御することができる。 Based on the experimental results, the function approximation system S of the first embodiment has a learning ability (generalization ability) as compared with the conventionally known function approximation system S ′ in which the action value function approximation process and the action value function learning process are not executed. In addition to being able to greatly increase, the applicability to unknown environments E ′ and E ″ including disturbance ω _f and observation noises Δx and Δθ, etc. can also be significantly increased. The function approximating system S controls the inverted pendulum more appropriately than the conventionally known function approximating system S ′ even when applied to a device (function approximating device) that actually controls the inverted pendulum instead of the simulator. be able to.

また、実施例１の関数近似システムＳでは、前記中間変数ｙとしての１２個の各積型修飾ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）については、前記多変数相互修飾モデルＮの入力層Ｎａ（図４参照）から前記入力変数ｘ，ｖ，θ，ωの状態パターンＳ_ｘ，Ｓ_ｖ，Ｓ_θ，Ｓ_ωが与えられた場合には、それぞれ並列で演算される（図７のＳＴ１０２ａ〜ＳＴ１０２ｍ参照）。したがって、例えば、複数のＬＳＩ（Large-Scale Integrated circuit：大規模集積回路）等のハードウェアにより、各積型修飾ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）が並列で演算される場合には、前記中間変数ｙを高速で演算でき（１２個のＬＳＩならば約１２倍の速度で演算でき、６個のＬＳＩならば約６倍の速度で演算でき）、前記行動価値関数近似処理の演算時間を低減できる。 In the function approximation system S of the first embodiment, each of the twelve product type modifications x (v), v (x), x (θ), θ (x), x (ω), For ω (x), v (θ), θ (v), v (ω), ω (v), θ (ω), and ω (θ), the input layer Na (see FIG. 4), the state patterns S _x , S _v , S _θ , S _{ω of the} input variables x, v, θ, _ω are given in parallel (see ST102a to ST102m in FIG. 7). ). Therefore, for example, each product type modification x (v), v (x), x (θ), θ (x), x by hardware such as a plurality of LSIs (Large-Scale Integrated circuits). When (ω), ω (x), v (θ), θ (v), v (ω), ω (v), θ (ω), ω (θ) are calculated in parallel, the intermediate Variable y can be calculated at high speed (12 LSIs can be calculated at about 12 times faster, and 6 LSIs can be calculated at about 6 times faster), reducing the calculation time of the action value function approximation process. it can.

（変更例）
以上、本発明の実施例を詳述したが、本発明は、前記実施例に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内で、種々の変更を行うことが可能である。本発明の変更例（Ｈ01）〜（Ｈ08）を下記に例示する。
（Ｈ01）本発明の実施例１の前記関数近似システムＳは、いわゆる、倒立振子問題を解決するためのシミュレータとして構成されているが、シミュレータに限定されず、現実の制御機器の制御部にも適用可能である。また、課題についても倒立振子問題に限定されず、その他の課題、例えば、二足歩行の制御を行うロボット等や、画像や音声等のパターン認識装置等に適用することも可能である。 (Example of change)
As mentioned above, although the Example of this invention was explained in full detail, this invention is not limited to the said Example, A various change is performed within the range of the summary of this invention described in the claim. It is possible. Modification examples (H01) to (H08) of the present invention are exemplified below.
(H01) The function approximating system S according to the first embodiment of the present invention is configured as a simulator for solving the so-called inverted pendulum problem, but is not limited to the simulator, and is also used in a control unit of an actual control device. Applicable. Also, the problem is not limited to the inverted pendulum problem, and can be applied to other problems, for example, a robot that controls bipedal walking, a pattern recognition device for images and sounds, and the like.

（Ｈ02）本発明の実施例では、倒立振子問題を解決するために、Ｑ−ｌｅａｒｎｉｎｇの強化学習が適用されているが、これに限定されず、Ｑ−ｌｅａｒｎｉｎｇ以外のアルゴリズムを適用することも可能である。すなわち、前記行動評価関数Ｑ（ｓ_ｔ，ａ_ｔ）以外の関数を近似することも可能である。また、課題によっては、強化学習を適用せずに、前記多変数相互修飾モデルＮによって関数を近似する機能（図６のＳＴ６、図７のＳＴ１０１〜ＳＴ１０３参照）のみを抽出して、前記課題を解決する関数近似装置を構成することも可能である。すなわち、本発明の選択的不感化法が適用された層状ニューラルネット（多変数相互修飾モデルＮ等）により、前記課題である関数（非線形関数等）を近似する前記関数近似装置を構成することも可能である。 (H02) In the embodiment of the present invention, the reinforcement learning of Q-learning is applied to solve the inverted pendulum problem. However, the present invention is not limited to this, and an algorithm other than Q-learning can be applied. It is. That is, it is possible to approximate a function other than the behavior evaluation function Q (s _t , a _t ). Further, depending on the problem, only the function (see ST6 in FIG. 6 and ST101 to ST103 in FIG. 7) for approximating the function without extracting the reinforcement learning and extracting the problem is extracted. It is also possible to configure a function approximating device to be solved. That is, the function approximation device for approximating the function (nonlinear function or the like) as the subject may be configured by a layered neural network (multivariable mutual modification model N or the like) to which the selective desensitization method of the present invention is applied. Is possible.

（Ｈ03）本発明の実施例では、前記多変数相互修飾モデルＮにおいて、前記入力層Ｎａを、４つの入力変数ｘ，ｖ，θ，ωにより構成したが、これに限定されず、３つの入力変数の場合や、５以上の入力変数の場合についても構成可能である。
また、本発明の実施例では、前記中間層Ｎｂの中間変数ｙを、４つの入力変数ｘ，ｖ，θ，ωのうちの２つずつを１組として相互に積型文脈修飾された１２個の各積型修飾ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）により構成し、前記中間変数ｙの状態パターンＳ_ｙを１２ｎ個の素子により構成したが（図４、式（１１）参照）、これに限定されず、相互に積型文脈修飾しないようにすることも可能である。
例えば、最小の構成として、４つの入力変数ｘ，ｖ，θ，ωのうちの２つずつ（（ｘ，ｖ），（θ，ω））を１組として積型文脈修飾された２個の各積型修飾ｘ（ｖ），θ（ω）だけを前記中間変数ｙとして、前記中間変数ｙの状態パターンＳ_ｙを２ｎ個の素子により構成することも可能である。この場合、前記中間変数ｙの状態パターンＳ_ｙについて、Ｓ_ｙ＝（ｙ_１，ｙ_２，…，ｙ_２ｎ）＝（ｙ_ｘｖ１，ｙ_ｘｖ２，…，ｙ_ｘｖｎ，ｙ_θω１，ｙ_θω２，…，ｙ_θωｎ）が成立する。なお、本願明細書では、本発明の最小の構成である前記層状ニューラルネットを、実施例１の「多変数相互修飾モデルＮ」に対して、「多変数積型モデル」と呼ぶこととする。
さらに、本発明の実施例では、出力層Ｎｃを、１つの出力変数Ｑ（ｓ_ｔ，ａ_ｔ）により構成したが、これに限定されず、複数の出力変数で構成することも可能である。 (H03) In the embodiment of the present invention, in the multivariable mutual modification model N, the input layer Na is configured by four input variables x, v, θ, ω, but the present invention is not limited to this. The case of variables and the case of five or more input variables can also be configured.
In the embodiment of the present invention, the intermediate variable y of the intermediate layer Nb is set to 12 of the four input variables x, v, θ, and ω, which are mutually modified by product type context modification. Product type modifications x (v), v (x), x (θ), θ (x), x (ω), ω (x), v (θ), θ (v), v (ω), The state pattern S _y of the intermediate variable y is composed of 12n elements (see FIG. 4, equation (11)), but is limited to this. It is also possible to avoid product type context modification with each other.
For example, as the minimum configuration, two of the four input variables x, v, θ, ω ((x, v), (θ, ω)) are set as a pair, and the product type context modified two Only the product type modifications x (v) and θ (ω) may be the intermediate variable y, and the state pattern S _y of the intermediate variable y may be configured by 2n elements. In this case, for the state pattern S _y of the intermediate variable y, S _y = (y ₁ , y ₂ ,..., Y _2n ) = (y _xv1 , y _xv2 ,..., Y _xvn , y _θω1 , y _θω2,. _yθωn ) holds. In the present specification, the layered neural network having the minimum configuration of the present invention is referred to as a “multivariate product model” with respect to the “multivariable mutual modification model N” in the first embodiment.
Furthermore, in the embodiment of the present invention, the output layer Nc is configured by one output variable Q (s _t , a _t ), but is not limited thereto, and may be configured by a plurality of output variables.

（Ｈ04）本発明の実施例の多変数相互修飾モデルＮ（図４参照）における入力変数ｘ，ｖ，θ，ωおよび中間変数ｙとしての各積型修飾ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）の各素子数ｎの値については任意に設定可能であり、例えば、ｎ＝１とすることも可能である。この場合、入力変数ｘ，ｖ，θ，ωの数と入力層Ｎａの素子数とが等しくなる（４ｎ＝４×１＝４）。
したがって、本発明の選択的不感化法が適用された層状ニューラルネットにおいて、前記入力層（Ｎａ）の素子数がｎで固定されている場合、最大の構成である前記多変数相互修飾モデル（Ｎ）では、前記中間層（Ｎｂ）の素子数の最大値がｎ（ｎ−１）となる。すなわち、素子数１のｎ個の入力変数が存在した場合、前記最大値について、_ｎＰ_２＝ｎ（ｎ−１）が成立する。また、最小の構成である前記多変数積型モデルは、前記中間層の素子数の最小値がｎ／２となる。すなわち、素子数ｎ／４の４つの入力変数が存在した場合、前記最小値について、ｎ／４×２＝ｎ／２が成立する。
この結果、本発明の選択的不感化法が適用された層状ニューラルネットは、入力層の素子数がｎ個で構成されている場合には、中間層の素子数はｎ／２〜ｎ（ｎ−１）個の範囲で構成できる。 (H04) Each product type modification x (v), v (x), as input variables x, v, θ, ω and intermediate variable y in the multivariable mutual modification model N (see FIG. 4) of the embodiment of the present invention. x (θ), θ (x), x (ω), ω (x), v (θ), θ (v), v (ω), ω (v), θ (ω), ω (θ) The value of the number n of elements can be arbitrarily set. For example, n = 1 can be set. In this case, the number of input variables x, v, θ, ω is equal to the number of elements of the input layer Na (4n = 4 × 1 = 4).
Therefore, in the layered neural network to which the selective desensitization method of the present invention is applied, when the number of elements of the input layer (Na) is fixed at n, the multivariable mutual modification model (N ), The maximum value of the number of elements of the intermediate layer (Nb) is n (n−1). That is, when there are n input variables having the number of elements of 1, _n P ₂ = n (n−1) holds for the maximum value. In the multivariate product model having the minimum configuration, the minimum value of the number of elements in the intermediate layer is n / 2. That is, when there are four input variables with the number of elements n / 4, n / 4 × 2 = n / 2 holds for the minimum value.
As a result, in the layered neural network to which the selective desensitization method of the present invention is applied, when the number of elements in the input layer is n, the number of elements in the intermediate layer is n / 2 to n (n -1) It can be composed of a range

（Ｈ05）本発明の実施例１における各定数Ｍ，ｍ，Ｌ，ｔ，ｇ，α，γの設定値および各変数ｘ，ｖ，θ，ω，ａ，ｂ，ｋ，Ｆ，ｗ_ｊｉ，ｗ_ｊｉ′の初期値や取り得る値の範囲等については任意に変更可能である。
（Ｈ06）本発明の実施例では、前記行動価値関数学習処理（図６のＳＴ１３、図８のＳＴ２０１〜ＳＴ２１２参照）が、前記倒立振子制御処理（図６のＳＴ１〜ＳＴ１２参照）が失敗状態で終了した場合にのみ実行されるが（図６のＳＴ１１参照）、これに限定されず、成功状態で終了した場合（図６のＳＴ１０参照）についても実行することが可能である。この場合、Ｑ−ｌｅａｒｎｉｎｇの強化学習（図６のＳＴ９、式（８）参照）により更新された最新の行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）に基づいて、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′が学習されるため、実施例１の前記関数近似システムＳに比べて演算ステップが増加するが、行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）についての学習の収束をより早くすることができる。 (H05) The set values of the constants M, m, L, t, g, α, γ and the variables x, v, θ, ω, a, b, k, F, w _ji , in the first embodiment of the present invention. The initial value of w _ji ′, the range of possible values, and the like can be arbitrarily changed.
(H06) In the embodiment of the present invention, the behavior value function learning process (see ST13 in FIG. 6 and ST201 to ST212 in FIG. 8) is in the failed state, and the inverted pendulum control process (see ST1 to ST12 in FIG. 6) is in a failed state. Although it is executed only when it is completed (see ST11 in FIG. 6), it is not limited to this, and it can also be executed when it is completed in a successful state (see ST10 in FIG. 6). In this case, based on the latest action value function Q (s _t , a _t ) updated by reinforcement learning of Q-learning (see ST9 in FIG. 6 and equation (8)), the combined weights w _ji , w _ji Since ′ is learned, the number of computation steps is increased as compared with the function approximation system S of the first embodiment, but the learning convergence on the action value function Q (s _t , a _t ) can be accelerated.

（Ｈ07）本発明の実施例では、前記多変数相互修飾モデルＮの中間層Ｎｂを１層としたが、これに限定されず、中間層Ｎｂを複数層とすることも可能である。
（Ｈ08）本発明の実施例では、前記各入力変数ｘ，ｖ，θ，ωから中間変数ｙ（ｙ＝（ｘ（ｖ），ｖ（ｘ），ｘ（θ），θ（ｘ），ｘ（ω），ω（ｘ），ｖ（θ），θ（ｖ），ｖ（ω），ω（ｖ），θ（ω），ω（θ）））が演算される場合に、いわゆる、選択的不感化が行われ（式（９−１），（９−２），（１０−１），（１０−２）参照）、中間変数ｙから前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）が演算される場合には従来公知の多層パーセプトロンと同様に演算されて選択的不感化が行われないが、これに限定されず、前記中間変数ｙから前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）を演算する場合にも選択的不感化を適用することも可能である。例えば、中間変数ｙとして、ｙ＝（ｘ（ｖ），θ（ω）））が演算される場合に、すなわち、入力変数ｖで選択的不感化された入力変数ｘ（積型修飾ｘ（ｖ））と、入力変数ωで選択的不感化された入力変数θ（積型修飾θ（ω））とを中間変数ｙとした場合に、前記積型修飾θ（ω）で前記積型修飾ｘ（ｖ）を選択的不感化した後（積型文脈修飾した後）、前記結合荷重ｗ_ｊｉ，ｗ_ｊｉ′に基づいて前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）を演算することも可能である（式（１１），（１２−１），（１２−２）参照）。さらに、前記各入力変数ｘ，ｖ，θ，ωから中間変数ｙが演算される場合に選択的不感化を行わずに従来公知の多層パーセプトロンと同様に演算し、中間変数ｙから前記出力変数Ｑ（ｓ_ｔ，ａ_ｔ）が演算される場合に選択的不感化を行うことも可能である。 (H07) In the embodiment of the present invention, the intermediate layer Nb of the multivariable mutual modification model N is one layer. However, the present invention is not limited to this, and the intermediate layer Nb may be a plurality of layers.
(H08) In the embodiment of the present invention, the intermediate variables y (y = (x (v), v (x), x (θ), θ (x), x) are calculated from the input variables x, v, θ, ω. (Ω), ω (x), v (θ), θ (v), v (ω), ω (v), θ (ω), ω (θ))) are calculated, so-called selection Desensitization is performed (see equations (9-1), (9-2), (10-1), and (10-2)), and the output variable Q (s _t , a _t ) is changed from the intermediate variable y. Although known multilayer perceptron as well as calculated by selective desensitization is not performed conventionally when the operation is not limited thereto, the said output from the intermediate variable y variable Q (s _{t, a} _t) a It is also possible to apply selective desensitization when calculating. For example, when y = (x (v), θ (ω))) is calculated as the intermediate variable y, that is, the input variable x (product type modification x (v )) And the input variable θ (product type modification θ (ω)) selectively desensitized by the input variable ω as an intermediate variable y, the product type modification x (ω) is used as the product type modification x. After selectively desensitizing (v) (after product type context modification), it is also possible to calculate the output variable Q (s _t , a _t ) based on the coupling weights w _ji , w _ji ′. (See formulas (11), (12-1), (12-2)). Further, when the intermediate variable y is calculated from each of the input variables x, v, θ, and ω, the calculation is performed in the same manner as a conventionally known multilayer perceptron without performing selective desensitization, and the output variable Q is calculated from the intermediate variable y. It is also possible to perform selective desensitization when (s _t , a _t ) is computed.

前述の本発明の選択的不感化法が適用された層状ニューラルネット（多変数相互修飾モデルＮ、多変数積型モデル等）により、課題を関数（非線形関数等）として近似する関数近似装置、関数近似システムおよび関数近似プログラムを作製し、実時間で学習可能な課題についての制御を行う学習制御機器、例えば、自律行動型のロボットや、画像や音声等のパターン認識装置等の制御に適用することにより、学習時間を大幅に低減させたり、未知の環境に対する適応能力を向上させたりすることが可能となる。 A function approximation device and function for approximating a problem as a function (non-linear function, etc.) using a layered neural network (multivariable mutual modification model N, multivariate product model, etc.) to which the selective desensitization method of the present invention is applied. Create an approximation system and a function approximation program, and apply it to the control of learning control devices that control tasks that can be learned in real time, such as autonomous behavior type robots, pattern recognition devices for images and sounds, etc. As a result, the learning time can be greatly reduced, and the adaptability to an unknown environment can be improved.

図１は本発明の実施例１の関数近似システムの全体説明図である。FIG. 1 is an overall explanatory diagram of a function approximation system according to a first embodiment of the present invention. 図２は本発明の実施例１の台車の制御部が備えている各機能をブロック図（機能ブロック図）で示した図である。FIG. 2 is a block diagram (function block diagram) illustrating each function provided in the control unit of the cart according to the first embodiment of the present invention. 図３は強化学習の簡単な説明図である。FIG. 3 is a simple explanatory diagram of reinforcement learning. 図４は本発明の実施例１の選択的不感化法が適用された層状ニューラルネットの説明図である。FIG. 4 is an explanatory diagram of a layered neural network to which the selective desensitization method according to the first embodiment of the present invention is applied. 図５は台車の位置を示す入力変数の状態パターンの一例を説明するための説明図である。FIG. 5 is an explanatory diagram for explaining an example of an input variable state pattern indicating the position of the carriage. 図６は本発明の実施例１の関数近似プログラムのメイン処理のフローチャートである。FIG. 6 is a flowchart of the main process of the function approximation program according to the first embodiment of the present invention. 図７は本発明の実施例１の関数近似プログラムの行動価値関数近似処理のフローチャートであり、図６のＳＴ７のサブルーチンの説明図である。FIG. 7 is a flowchart of the action value function approximation process of the function approximation program according to the first embodiment of the present invention, and is an explanatory diagram of the subroutine of ST7 in FIG. 図８は本発明の実施例１の関数近似プログラムの行動価値関数学習処理のフローチャートであり、図６のＳＴ１３のサブルーチンの説明図である。FIG. 8 is a flowchart of the action value function learning process of the function approximation program according to the first embodiment of the present invention, and is an explanatory diagram of the subroutine of ST13 in FIG. 図９は実験例の実験結果の説明図であり、横軸に倒立振子制御処理の試行回数（エピソード数）をとり、縦軸に倒立振子制御処理の試行時間（棒が倒立し続けた時間）をとって、実験例１の学習効率と比較例１の学習効率とを比較するためのグラフであり、図９Ａは実験例１の実験結果を示すグラフであり、図９Ｂは比較例１の実験結果を示すグラフである。FIG. 9 is an explanatory diagram of the experimental results of the experimental example. The horizontal axis indicates the number of trials of the inverted pendulum control process (number of episodes), and the vertical axis indicates the trial time of the inverted pendulum control process (the time during which the bar has been inverted). 9A is a graph for comparing the learning efficiency of Experimental Example 1 with the learning efficiency of Comparative Example 1, FIG. 9A is a graph showing the experimental results of Experimental Example 1, and FIG. 9B is the experiment of Comparative Example 1. It is a graph which shows a result. 図１０は実験例の実験結果の説明図であり、横軸に倒立振子制御処理の試行回数（エピソード数）をとり、縦軸に倒立振子制御処理の試行時間（棒が倒立し続けた時間）および外乱を与えた回数（棒を指で弾いた回数）をとって、実験例２の学習効率と比較例２の学習効率とを比較するためのグラフであり、図１０Ａは実験例２の実験結果を示すグラフであり、図１０Ｂは比較例２の実験結果を示すグラフである。FIG. 10 is an explanatory diagram of the experimental results of the experimental example, where the horizontal axis represents the number of trials of the inverted pendulum control process (number of episodes), and the vertical axis represents the trial time of the inverted pendulum control process (the time during which the bar continued to be inverted). FIG. 10A is a graph for comparing the learning efficiency of Experimental Example 2 and the learning efficiency of Comparative Example 2 by taking the number of times disturbance was applied (number of times the stick was played with a finger), and FIG. FIG. 10B is a graph showing the experimental results of Comparative Example 2. 図１１は実験例の実験結果の説明図であり、横軸に倒立振子制御処理の試行回数（エピソード数）をとり、縦軸に倒立振子制御処理の試行時間（棒が倒立し続けた時間）をとって、実験例３の学習効率と比較例３の学習効率とを比較するためのグラフであり、図１１Ａは実験例３の実験結果を示すグラフであり、図１１Ｂは比較例３の実験結果を示すグラフである。FIG. 11 is an explanatory diagram of the experimental results of the experimental example, where the horizontal axis represents the number of trials of the inverted pendulum control process (number of episodes), and the vertical axis represents the trial time of the inverted pendulum control process (the time during which the bar continued to be inverted). 11A is a graph for comparing the learning efficiency of Experimental Example 3 with the learning efficiency of Comparative Example 3, FIG. 11A is a graph showing the experimental results of Experimental Example 3, and FIG. 11B is the experiment of Comparative Example 3 It is a graph which shows a result. 図１２は選択的不感化法が適用された層状ニューラルネットの最小の構成の説明図である。FIG. 12 is an explanatory diagram of the minimum configuration of a layered neural network to which the selective desensitization method is applied. 図１３は選択的不感化法が適用された層状ニューラルネットの説明図であり、図１３Ａは積型モデルの説明図であり、図１３Ｂは相互修飾モデルの説明図である。FIG. 13 is an explanatory diagram of a layered neural network to which the selective desensitization method is applied, FIG. 13A is an explanatory diagram of a product type model, and FIG. 13B is an explanatory diagram of a mutual modification model.

Explanation of symbols

Ａ，１…制御装置、
ＡＰ１…関数近似プログラム、
ａ_ｔ…行動、
Ｃ…関数近似装置、
Ｃ２…状態測定手段、
Ｃ３…行動実行手段、
Ｃ３Ａ…行動選択手段、
Ｃ４…報酬取得手段、
Ｃ５…行動価値関数演算手段、
Ｃ７Ａ…入力変数入力手段、
Ｃ７Ｃ…中間変数演算手段、
Ｃ７C1a＋Ｃ７C1b…第１中間変数演算手段、
Ｃ７C1c＋Ｃ７C1d…第２中間変数演算手段、
Ｃ７Ｅ…出力変数演算手段、
Ｃ７Ｇ…結合荷重学習手段、
Ｃ８…制御終了判別手段、
Ｅ…環境、
Ｎ…層状ニューラルネット、
Ｎａ…入力層、
Ｎｂ…中間層、
Ｎｃ…出力層、
Ｑ（ａ_ｔ，ｓ_ｔ），Ｑ（ａ_ｔ＋１，ｓ_ｔ＋１）…行動価値関数、
Ｑ（ａ_ｔ，ｓ_ｔ）…関数、出力変数、
ｒ_ｔ，ｒ_ｔ＋１…報酬、
Ｓ…強化学習システム，関数近似システム、
（ｑ_１，ｑ_２，…，ｑ_ｍ），（ｑ_１′，ｑ_２′，…，ｑ_ｍ′）…出力素子
ｓ_ｔ，ｓ_ｔ＋１…状態、
（ｓ_ｘ１，ｓ_ｘ２，…，ｓ_ｘｎ，ｓ_ｖ１，ｓ_ｖ２，…，ｓ_ｖｎ，ｓ_θ１，ｓ_θ２，…，ｓ_θｎ，ｓ_ω１，ｓ_ω２，…，ｓ_ωｎ）…入力素子、
（ｔ_１，ｔ_２，…，ｔ_ｍ），（ｔ_１′，ｔ_２′，…，ｔ_ｍ′）…関数の実際の値、
ｗ_ｊｉ，ｗ_ｊｉ′…結合荷重、
ｘ，ｖ，θ，ω…入力変数、
ｙ…中間変数、
（ｙ_１，ｙ_２，…，ｙ_１２ｎ）…中間素子。 A, 1 ... control device,
AP1 ... Function approximation program,
a t _... action,
C: Function approximation device,
C2 ... state measuring means,
C3 ... Action execution means,
C3A ... Action selection means,
C4 ... Reward acquisition means,
C5: Action value function calculation means,
C7A: Input variable input means,
C7C: Intermediate variable calculation means,
C7C1a + C7C1b ... first intermediate variable computing means,
C7C1c + C7C1d: second intermediate variable calculation means,
C7E: Output variable calculation means,
C7G: Means for learning connection weight,
C8 ... Control end determination means,
E ... Environment
N ... layered neural network,
Na ... input layer,
Nb ... intermediate layer,
Nc: output layer,
Q (a _t , s _t ), Q (a _{t + 1} , s _{t + 1} )... Action value function,
Q (a _t , s _t ) ... function, output variable,
r _t , r _{t + 1} ... reward,
S ... Reinforcement learning system, function approximation system,
_{_{_{(Q 1, q 2, ...}}} , q m), (q 1 ', q 2', ..., q m ') ... output element _s _t, _s t _{+ 1 ...} state,
_{_{_{_{(S x1, s x2, ...}}}} , s xn, s v1, s v2, ..., s vn, s θ1, s θ2, ..., s θn, s ω1, s ω2, ..., s ωn) ... input element,
(T ₁ , t ₂ ,..., T _m ), (t ₁ ′, t ₂ ′,..., T _m ′).
w _ji , w _ji ′… bond load,
x, v, θ, ω ... input variables,
y ... intermediate variable,
(Y ₁ , y ₂ ,..., Y _12n ) ... Intermediate element.

Claims

An input layer configured by an input element to which an input variable value is input, and an intermediate element coupled to the input element, the intermediate variable value calculated based on the value input to the input element being An intermediate layer constituted by the intermediate element to be output, and an output element coupled to the intermediate element, wherein the value of the output variable calculated based on the value input to the intermediate element is output In a function approximation device that approximates a function that is a relationship between the input variable and the output variable by a layered neural network having an output layer constituted by output elements,
The input layer constituted by each input element to which each value of three or more input variables is input;
Input variable input means for inputting each value of the three or more input variables;
An input variable set in which any two input variables of the three or more input variables are set as one set, each value of one input variable of the input variable set, and the other input variable of the input variable set Intermediate variable calculation means for calculating each value of the intermediate variable based on each first output sensitivity calculated based on each value of
An output variable calculation means for calculating the value of the output variable based on the intermediate variable and a combined load set according to the importance of the value of the intermediate variable;
A combined load learning means for learning the combined load by updating the combined load based on a difference between the value of the output variable and an actual value of the function stored in advance;
A function approximating device comprising:

Each value of the first intermediate variable based on each value of one input variable of the input variable set and each first output sensitivity calculated based on each value of the other input variable of the input variable set First intermediate variable calculation means for calculating each of the above, each value of the other input variable of the input variable set, each second output sensitivity calculated based on each value of one input variable of the input variable set, , Based on the second intermediate variable calculation means for calculating each value of the second intermediate variable, respectively, the intermediate variable calculation,
Based on the first intermediate variable, the second intermediate variable, and each combined load set in accordance with the importance of each value of the first intermediate variable and the second intermediate variable, the output variable The output variable calculating means for calculating a value;
The function approximation apparatus according to claim 1, further comprising:

The intermediate layer includes each intermediate element that outputs each value of a plurality of intermediate variables, and all the input variables of the three or more input variables are at least one of the input variable sets or 3. The function approximating apparatus according to claim 1, wherein a plurality of sets of the input variables are configured by being set as the other input variable.

A control device as a target for controlling behavior;
State measuring means for measuring the state of the control device;
Reward acquisition means for acquiring a reward for the behavior;
An action value function calculating means for calculating an action value function that is an evaluation value for evaluating all actions in the measured state based on prediction of a reward that can be acquired in the future;
The function approximation device according to any one of claims 1 to 3, wherein the action value function in the measured state is approximated by regarding a measured value measured in the state as a value of the input variable.
Action selecting means for selecting the action based on the approximated action value function;
Action executing means for executing the selected action;
Control end determination means for determining whether to end control of the control device by determining whether control of the control device has failed based on the reward,
Reinforcement learning system characterized by having

The function approximation device according to any one of claims 1 to 3, further comprising the connection weight learning unit that learns the connection weight when it is determined that the control of the control device has failed.
The reinforcement learning system according to claim 4, further comprising:

An input layer configured by an input element to which an input variable value is input, and an intermediate element coupled to the input element, the intermediate variable value calculated based on the value input to the input element being An intermediate layer constituted by the intermediate element to be output, and an output element coupled to the intermediate element, wherein the value of the output variable calculated based on the value input to the intermediate element is output In a function approximation system that approximates a function that is a relationship between the input variable and the output variable by a layered neural network having an output layer constituted by output elements,
The input layer constituted by each input element to which each value of three or more input variables is input;
Input variable input means for inputting each value of the three or more input variables;
An input variable set in which any two input variables of the three or more input variables are set as one set, each value of one input variable of the input variable set, and the other input variable of the input variable set Intermediate variable calculation means for calculating each value of the intermediate variable based on each first output sensitivity calculated based on each value of
An output variable calculation means for calculating the value of the output variable based on the intermediate variable and a combined load set according to the importance of the value of the intermediate variable;
A combined load learning means for learning the combined load by updating the combined load based on a difference between the value of the output variable and an actual value of the function stored in advance;
A function approximation system characterized by comprising:

Computer
An input layer configured by input elements to which values of input variables are input and configured by input elements to which respective values of three or more input variables are input, and an intermediate element coupled to the input elements, An intermediate layer composed of the intermediate element that outputs the value of the intermediate variable calculated based on the value input to the input element, and an output element coupled to the intermediate element, the intermediate element An input layer configured to output the value of the output variable calculated based on the input value; and an input layer configured to input each value of the three or more input variables. Variable input means,
An input variable set in which any two input variables of the three or more input variables are set as one set, each value of one input variable of the input variable set, and the other input variable of the input variable set Intermediate variable calculation means for calculating each value of the intermediate variable based on each first output sensitivity calculated based on each value of
An output variable calculation means for calculating the value of the output variable based on the intermediate variable and a combined load set according to the importance of the value of the intermediate variable;
A connection weight learning means for learning the connection weight by updating the connection weight based on a difference between a value of the output variable and an actual value of the function stored in advance;
A function approximation program for approximating a function that is a relationship between the input variable and the output variable by functioning as