JPH052600A

JPH052600A - System for learning probability distribution and probabilistic rule

Info

Publication number: JPH052600A
Application number: JP18027291A
Authority: JP
Inventors: Naoki Abe; 直樹安倍
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1991-06-24
Filing date: 1991-06-24
Publication date: 1993-01-08

Abstract

PURPOSE:To attain the learning of reliable and explainable probability distribution or probabilistic rule by outputting a specific expression as the best expression. CONSTITUTION:An expression capable of minimizing the sum of a product between the number of bits of the expression and the a-th power (0<a<1) of the number of input data and the minus logarithm likelihood of input data determined by probability distribution is outputted as the best expression. Namely a sample consisting of the restricted number of events and a parameter (a) in NIC definition are received as inputs and the expression of probability distribution is outputted based upon the class H (hypothetic space) of the expression of a certain determined probability distribution. In details, a hypothesis minimizing the value of an expression I in the hypothetic space H for input sample S = (x1 to xn) is outputted as the best expression. In the expression I, 1(h) is the description length of a hypothesis (h) and (m) is the number of input samples (number of events). When many events are applied in an environment including noise, the probability distribution or probabilistic rule capable of highly reliably explaining the events can be learned.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は多数の事例を入力として
確率的規則または確率分布の表現形を出力する確率的規
則または確率分布の学習方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a learning method for a stochastic rule or probability distribution which inputs a number of cases and outputs a probabilistic rule or a probability distribution expression.

【０００２】[0002]

【従来の技術】未知の確率的情報源（以下、真のモデ
ル）を、その情報源から発されるデータを入力とし、そ
の情報源を最適に近似するモデルをある与えられた仮説
空間から選択し出力する、統計的推定問題における仮説
の選択基準としては、リサネン（Ｒｉｓｓａｎｅｎ）の
記述長最少原理（または、Ｓｃｈｗａｒｚのベイズ情報
量基準），赤池の情報量基準、竹内の情報量基準、ハナ
ン（Ｈａｎａｎ）とクイン（Ｑｕｉｎｎ）の基準などが
知られているが、このうち、一致性すなわち漸近的な真
のモデルへの収束が証明されているものは、リサネン
（Ｒｉｓｓａｎｅｎ）の記述長最小原理、及びハナン
（Ｈａｎａｎ）とクイン（Ｑｕｉｎｎ）の基準であり、
さらに、このうちある距離関数に関して、迅速な収束が
証明されているものは、リサネン（Ｒｉｓｓａｎｅｎ）
の記述長最小原理である。この基準は、１９７８年発行
の雑誌オートマチカ（Ａｕｔｏｍａｔｉｃａ）の１４号
掲載のリサネン（Ｒｉｓｓａｎｅｎ）著の論文「モデリ
ングバイショーテストデータデスクリプション」
（Ｍｏｄｅｌｉｎｇｂｙｓｈｏｒｔｅｓｔｄａｔａ
ｄｅｓｃｒｉｐｔｉｏｎ）に述べられている。この基準
は、仮説の実数値パラメータ数とデータ数の対数の積の
２分の１と、その仮説によって決まるデータのマイナス
対数尤度との和を最小とする仮説を最良とする。また、
記述長最小原理のＨｅｌｌｉｎｇｅｒ距離に関する迅速
な収束が、１９９０年発行の米国の雑誌「プロシーヂン
グスオブザサードアニュアルワークショップ
オンコンピュテーショナルラーニングセオリー」
（ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＴｈｉｒｄ
ＡｎｎｕａｌＷｏｒｋｓｈｏｐｏｎＣｏｍｐｕｔ
ａｔｉｏｎａｌＬｅａｒｎｉｎｇＴｈｅｏｒｙ）掲
載の山西著の論文「アラーニングクライテリオンフ
ォーストーキャスチックルールズ」（ＡＬｅａｒｎ
ｉｎｇＣｒｉｔｅｒｉｏｎｆｏｒＳｔｏｃｈａｓ
ｔｉｃＲｕｌｅｓ）において証明されている。2. Description of the Related Art An unknown stochastic information source (hereinafter referred to as a true model) is input with data generated from the information source, and a model that optimally approximates the information source is selected from a given hypothesis space. As the selection criterion of the hypothesis in the statistical estimation problem to be output, the description length minimum principle of Rissanen (or Schwarz's Bayesian information criterion), Akaike's information criterion, Takeuchi's information criterion, Hanan ( Hanan) and Quinn's criteria are known. Among them, the one that proves the agreement, that is, the convergence to an asymptotic true model is Risannen's minimum description length principle, And Hanan and Quinn criteria,
Furthermore, for some distance functions, the one that has been proved to have a rapid convergence is Rissanen.
Is the minimum description length principle. This criterion is based on the article "Modeling by Showtest Data Description" written by Risanen in the 14th issue of the Automatica magazine published in 1978.
(Modeling by shortestdata
description). The criterion is the hypothesis that minimizes the sum of the half of the product of the number of real-valued parameters of the hypothesis and the logarithm of the number of data and the negative logarithmic likelihood of the data determined by the hypothesis. Also,
The rapid convergence of the Hellinger distance, which is based on the principle of minimum length, is explained in the 1990 issue of the US magazine "Procedures of the Third Annual Workshop on Computational Learning Theory".
(Proceedings of the Third
Annual Workshop on Comput
"Learning Criterion Force Stochastic Rules" (ALearn), written by Yamanishi, published by National Learning Theory.
ing Criterion for Stochas
tic Rules).

【０００３】[0003]

【発明が解決しようとする課題】上で記したように、記
述長最小原理は、Ｈｅｌｌｉｎｇｅｒ距離に関しては、
迅速な収束が証明されているが、統計学上最も良く使わ
れている距離関数であるＫＬ情報量に関しては未だ証明
されていない。ＫＬ情報量を最小にすることは、符号理
論的に言えば平均符号長を最小にすることでもあり、統
計的妥当性の評価基準としてより説明力に富んでいる。
さらに、ＫＬ情報量は、Ｈｅｌｌｉｎｇｅｒ距離を含ん
だ多くの距離関数に比して、より厳しい評価基準であ
る、すなわち、ＫＬ情報量によってＨｅｌｌｉｎｇｅｒ
距離を含んだ多くの距離関数を上から押えられるが、逆
は必ずしも真でないことが知られている。そこで、ＫＬ
情報量に関する迅速な収束の証明を行なうこと、もしく
はそうした証明の可能な情報基準の発見が求められてい
た。本発明は、まさにこれを達成する。As described above, the principle of the minimum description length is as follows regarding the Hellinger distance.
The rapid convergence has been proved, but it has not been proved yet regarding the KL information content which is the most widely used distance function in statistics. Minimizing the amount of KL information is also minimizing the average code length in code theory, and is more explanatory as an evaluation criterion of statistical validity.
Furthermore, the KL information amount is a stricter evaluation criterion than many distance functions including the Hellinger distance, that is, the KL information amount causes the Hellinger amount.
It is known that many distance functions including distance can be suppressed from the top, but the reverse is not always true. So KL
There has been a demand for a rapid proof of convergence regarding the amount of information, or the discovery of an information standard that enables such proof. The present invention accomplishes exactly this.

【０００４】[0004]

【課題を解決するための手段】第一の発明である確率分
布の学習方式は、多数の入力データに対して、該データ
の分布を説明する確率分布の表現形を、与えられた複数
の確率分布の表現形を含むクラスから、選択し出力する
確率分布の学習方式において、前記表現形のビット数と
入力データ数のａ乗（０＜ａ＜１）の積と、前記確率分
布によって決まる該入力データのマイナス対数尤度との
和を最小にする表現形を、最良の表現形として出力する
確率分布の学習方式である。According to a first aspect of the present invention, there is provided a probability distribution learning method for a large number of input data, wherein a probability distribution expression for explaining the distribution of the data is given to a plurality of given probabilities. In a learning method of a probability distribution that is selected and output from a class including a distribution expression form, the probability distribution is determined by the product of the number of bits of the expression form and the a-th power (0 <a <1) of the number of input data. This is a probability distribution learning method that outputs the expression that minimizes the sum of the input data and the minus logarithmic likelihood as the best expression.

【０００５】第二の発明の確率的規則の学習方式は、多
数のデータ及び教師信号の入力に対して、該データをク
ラス分けする確率的規則を、与えられた複数の確率的規
則の表現形を含むクラスから、選択し出力する確率的規
則の学習方式において、前記表現形のビット数と入力デ
ータ数のａ乗（０＜ａ＜１）の積と、前記確率分布によ
って決まる該入力データ及び教師信号のマイナス対数尤
度との和を最小にする表現形を、最良の表現形として出
力する確率的規則の学習方式である。The probabilistic rule learning method of the second invention is such that, with respect to inputs of a large number of data and teacher signals, a probabilistic rule for classifying the data is given to a plurality of given probabilistic rule expressions. In a learning method of a probabilistic rule that selects and outputs from a class including, the input data determined by the probability distribution and the product of the number of bits of the expression and the number of input data to the power a (0 <a <1), and This is a probabilistic rule learning method that outputs the expression that minimizes the sum of the teacher signal and the minus logarithmic likelihood as the best expression.

【０００６】第三の発明の確率分布の学習方式は、多数
の入力データ及び入力精度パラメータに対して、該デー
タの分布を説明する確率分布の表現形を、与えられた複
数の確率分布の表現形を含むクラスから、選択し出力す
る確率分布の学習方式において、該表現形のクラスの部
分クラスであり、該部分クラスが、前記表現形のクラス
中の任意の表現形に対して、該表現形を前記精度パラメ
ータ内で近似する表現形を含み、該部分クラス中の任意
の表現形が領域上のいかなる要素に与える確率も、２分
の１の該精度パラメータの逆数及び個々のデータのビッ
ト数の上限に関する多項式乗によって下から押えられて
おり、しかも該部分クラス中の要素の数が２の前記精度
パラメータの逆数及び個々のデータのビット数の上限に
関する多項式乗で上から押えられるという特徴を有する
クラスの中から、前記表現形のビット数と入力データ数
のａ乗（０＜ａ＜１）の積と、前記確率分布によって決
まる前記入力データのマイナス対数尤度との和を最小に
する表現形を、最良の表現形として出力する確率分布の
学習方式である。The probability distribution learning method according to the third aspect of the present invention represents, with respect to a large number of input data and input accuracy parameters, an expression form of the probability distribution that explains the distribution of the data, and a plurality of given probability distribution expressions. In a probability distribution learning method for selecting and outputting from a class including a shape, the expression is a subclass of the expression type class, and the partial class is the expression for an arbitrary expression type in the expression type class. The probability that any representation in the subclass will give to any element on the region, including a representation approximating the shape within the precision parameter, is one-half the reciprocal of the precision parameter and the bits of the individual data. It is constrained from below by a polynomial power on the upper bound of the number, and the number of elements in the subclass is a reciprocal of the precision parameter of 2 and a polynomial power on the upper bound of the number of bits of the individual data. From the class having the feature of being suppressed from, the product of the number of bits of the expression and the number of input data to the power a (0 <a <1), and the negative logarithmic likelihood of the input data determined by the probability distribution. This is a probability distribution learning method that outputs the expression that minimizes the sum of the as the best expression.

【０００７】第四の発明の確率的規則の学習方式は、多
数のデータ及び教師信号、及び入力精度パラメータの入
力に対して、該データをクラス分けする確率的規則を、
与えられた複数の確率的規則の表現形を含むクラスか
ら、選択し出力する確率的規則の学習方式において、該
表現形のクラスの部分クラスであり、該部分クラスが、
前記表現形のクラス中の任意の表現形に対して、該表現
形を前記精度パラメータ内で良く近似する表現形を含
み、前記部分クラス中の任意の表現形が領域上のいかな
る要素といかなる教師信号の対に与える条件付き確率
も、２分の１の該精度パラメータの逆数及び個々のデー
タのビット数の上限に関する多項式乗によって下から押
えられており、しかも該部分クラス中の要素の数が、２
の前記精度パラメータの逆数及び個々のデータのビット
数の上限に関する多項式乗で上から押えられるという特
徴を有するクラスの中から、前記表現形のビット数と入
力データ数のａ乗（０＜ａ＜１）の積と、前記確率的規
則によって決まる該入力データ及び教師信号のマイナス
対数尤度との和を最小にする表現形を最良の表現形とし
て出力する確率的規則の学習方式である。The probabilistic rule learning method of the fourth aspect of the present invention uses a probabilistic rule for classifying a large number of data and teacher signals, and input precision parameters for the input data.
In a learning method of a probabilistic rule that selects and outputs from a class including a plurality of given probabilistic rule expressions, it is a partial class of the expression type class, and the partial class is
For any expression in the class of the expression, including an expression that approximates the expression well within the accuracy parameter, any expression in the subclass includes any element on the region and any teacher. The conditional probabilities given to the signal pairs are also constrained from below by a polynomial power of one half the reciprocal of the precision parameter and the upper bound on the number of bits of the individual data, and the number of elements in the subclass is Two
Of the class having the characteristic that the reciprocal of the precision parameter and the upper limit of the number of bits of individual data are suppressed from the above, and the number of bits of the expression and the number of input data to the power a (0 <a < It is a learning method of a stochastic rule that outputs the expression that minimizes the sum of the product of 1) and the minus log-likelihood of the input data and the teacher signal determined by the stochastic rule as the best expression.

【０００８】第五の発明の確率分布の学習方式は、多数
の入力データ及び入力精度パラメータに対して、該デー
タの分布を説明する確率分布の表現形を、与えられた、
離散モデルと実数値パラメータからなる確率分布のパラ
メトリックな表現形を複数含むクラスから、選択し出力
する確率分布の学習方式において、探索過程において対
象となる離散モデルの記述長の上限を動的に計算し、前
記表現形のクラスの中で離散モデルの記述長が該上限以
下であるものの部分クラスであり、該部分クラスが、前
記表現形のクラス中で離散モデルの記述長が該上限以下
である任意の表現形に対して、該表現形を前記精度パラ
メータ内で近似する表現形を含み、該部分クラス中の任
意の表現形が領域上のいかなる要素に与える確率も、２
分の１の該精度パラメータの逆数、前記表現形記述長上
限、及び個々のデータのビット数の上限に関する多項式
乗によって下から押えられており、しかも該部分クラス
中の要素の数が２の前記精度パラメータの逆数、前記表
現形記述長上限、及び個々のデータのビット数の上限に
関する多項式乗で上から押えられるという特徴を有する
クラスの中から、前記表現形のビット数と入力データ数
のａ乗（０＜ａ＜１）の積と、前記確率分布によって決
まる前記入力データのマイナス対数尤度との和を最小に
する表現形を、最良の表現形として出力する確率分布の
学習方式である。The learning method of the probability distribution of the fifth invention is given a representation of the probability distribution which explains the distribution of the data, for a large number of input data and input accuracy parameters.
Dynamically calculate the upper limit of the description length of the target discrete model in the search process in the learning method of the probability distribution that selects and outputs from the class that includes multiple parametric representations of the probability distribution consisting of the discrete model and real-valued parameters However, the description length of the discrete model is a partial class whose expression length is less than or equal to the upper limit, and the partial class has a description length of the discrete model that is less than or equal to the upper limit. The probability that any expression in the subclass given to any element on the region includes an expression that approximates the expression within the precision parameter for any expression.
The reciprocal of the precision parameter is reduced by a factor of 1, the upper limit of the length of the expression description and the polynomial power of the upper limit of the number of bits of individual data, and the number of elements in the partial class is 2. From the class having the characteristic that the reciprocal of the precision parameter, the upper limit of the expression description length, and the polynomial power related to the upper limit of the number of bits of individual data are held from above, a. It is a learning method of a probability distribution that outputs the expression that minimizes the sum of the product of powers (0 <a <1) and the negative logarithmic likelihood of the input data determined by the probability distribution as the best expression. .

【０００９】第六の発明の確率的規則の学習方式は、多
数のデータ及び教師信号、及び入力精度パラメータの入
力に対して、該データをクラス分けする確率的規則を、
与えられた、離散モデルと実数値パラメータからなる確
率的規則のパラメトリックな表現形を複数含むクラスか
ら、選択し出力する確率的規則の学習方式において、探
索過程において対象となる離散モデルの記述長の上限を
動的に計算し、前記表現形のクラスの中で離散モデルの
記述長が該上限以下であるものの部分クラスであり、該
部分クラスが、前記表現形のクラス中で離散モデルの記
述長が該上限以下である任意の表現形に対して、該表現
形を前記精度パラメータ内で良く近似する表現形を含
み、前記部分クラス中の任意の表現形が領域上のいかな
る要素といかなる教師信号の対に与える条件付き確率
も、２分の１の該精度パラメータの逆数、前記表現形記
述長上限、及び個々のデータのビット数の上限に関する
多項式乗によって下から押えられており、しかも該部分
クラス中の要素の数が、２の前記精度パラメータの逆
数、前記表現形記述長上限、及び個々のデータのビット
数の上限に関する多項式乗で上から押えられるという特
徴を有するクラスの中から、前記表現形のビット数と入
力データ数のａ乗（０＜ａ＜１）の積と、前記確率的規
則によって決まる該入力データ及び教師信号のマイナス
対数尤度との和を最小にする表現形を最良の表現形とし
て出力する確率的規則の学習方式である。A learning method of a probabilistic rule according to a sixth aspect of the present invention uses a probabilistic rule for classifying a large number of data and teacher signals and input accuracy parameters into the classified data.
In the learning method of a probabilistic rule that selects and outputs from a given class that includes multiple parametric expressions of a probabilistic rule consisting of a discrete model and real-valued parameters, the description length of the target discrete model in the search process The upper limit is dynamically calculated, and is a partial class of the description model whose description length is less than or equal to the upper limit, and the partial class describes the description length of the discrete model in the expression class. For any expression whose value is less than or equal to the upper limit, including an expression that closely approximates the expression within the accuracy parameter, where any expression in the subclass is any element on the region and any teacher signal. The conditional probability given to the pair of? Is also reduced by a polynomial power with respect to the reciprocal of the precision parameter of 1/2, the representation length upper limit, and the upper limit of the number of bits of individual data. The feature is characterized in that the number of elements in the partial class is suppressed from above by a polynomial power related to the reciprocal of the precision parameter of 2, the upper limit of the expression description length, and the upper limit of the number of bits of individual data. Of the number of bits of the expression and the a-th power (0 <a <1) of the number of input data and the negative logarithmic likelihood of the input data and the teacher signal determined by the stochastic rule This is a learning method of probabilistic rules that outputs the expression that minimizes the sum as the best expression.

【００１０】[0010]

【作用】ここで領域Ｘ上の確率分布と呼ぶのは、確率統
計理論では一般に確率密度関数と呼ばれているものであ
るが、ここでは簡単のため、領域Ｘは離散であると仮定
し、確率分布という言葉を使う。更に領域Ｘは、Σをあ
るアルファベットとして、ΣⁿまたはＵ_i=1,...,nΣⁱ
であると仮定する。また、ここでｎは変数と考えられる
ので、ΣⁿまたはＵ_i=1,...,nΣⁱをＸ_nと書き、領域
のファミリーＸ＝｛Ｘ_n：ｎ∈Ｎ｝を考える。The term "probability distribution on the region X" is generally called the probability density function in the probability statistical theory, but here, for simplicity, it is assumed that the region X is discrete, We use the term probability distribution. Further, in the region X, Σ ⁿ or U _{i = 1, ..., n} Σ ⁱ _{, where} Σ is a certain alphabet.
Suppose that Further, since n is considered to be a variable here, Σ ⁿ or U _{i = 1, ..., n} Σ ⁱ is written as X _n, and the family X = {X _n : nεN} of the region is considered.

【００１１】確率的規則とは、ある属性値ｘが与えられ
たときに、ｘのクラスｙがある実現値をとる条件付き確
率を与える関数である。ここで、ｘは上と同様の仮定を
満たす領域Ｘの要素であり、ｙは、有限の値域Ｙの要素
である。任意のｘ∈Ｘが与えられたとき、クラスｙ∈Ｙ
が割り付けられる条件付き確率を、ｐ（ｙ／ｘ）と書
く。The probabilistic rule is a function that gives a conditional probability that a class y of x takes a certain realization value when a certain attribute value x is given. Here, x is an element of the region X that satisfies the same assumption as above, and y is an element of the finite range Y. Given any xεX, class yεY
The conditional probability that is assigned is written as p (y / x).

【００１２】確率分布の表現形、または確率的規則の表
現形とは、ある固定された符号法により、バイナリーア
ルファベットで記述可能である表現形であるとしか仮定
しない。例えば、良く使われるパラメトリックな表現形
では、離散モデルと実数値パラメータの二部分からなる
ものとされるが、計算機により出力される限り、実数値
パラメータもある精度のもとでバイナリーアルファベッ
トにより記述されるのであり、ここでは特に区別しな
い。ここでは、確率分布または確率的規則の表現形ｈの
記述長とは、上記の符号法により記述された時のビット
数とする。The probability distribution expression or the stochastic rule expression is only assumed to be an expression that can be described in a binary alphabet by a fixed coding method. For example, a parametric expression that is often used consists of two parts, a discrete model and a real-valued parameter, but as long as it is output by a computer, the real-valued parameter is also described by a binary alphabet with a certain precision. Therefore, no particular distinction is made here. Here, the description length of the expression h of the probability distribution or the stochastic rule is the number of bits when described by the above coding method.

【００１３】上記の定義を用いて、第一及び第二の発明
方式をより詳しく説明する。第一の発明の確率分布の学
習方式は、入力として有限個の事例からなるサンプル、
及びＮＩＣ定義の中のパラメータａを受けとり、ある定
められた確率分布の表現形のクラスＨ（以下、仮説空間
と呼ぶ）から、確率分布の表現形を出力するものであ
る。詳しく言えば、該学習方式は、入力サンプルＳ＝
〈ｘ_1,...,ｘ_m〉に対して、仮説空間Ｈの中で次の量を
最小にする仮説を最良とし、出力する。The first and second invention methods will be described in more detail using the above definitions. The learning method of the probability distribution of the first invention is a sample consisting of a finite number of cases as an input,
And a parameter a in the NIC definition, and outputs a probability distribution phenotype from a certain defined probability distribution phenotype class H (hereinafter, referred to as a hypothesis space). More specifically, the learning method is based on the input sample S =
For <x _{1, ...,} X _m >, the hypothesis that minimizes the following amount in the hypothesis space H is the best and is output.

【数１】ここで、ａが入力として与えられるとしたが、与えられ
ない場合は、例えばａ＝３／４とおけば良く本質的な相
違はない。また、ｍは入力サンプル数（事例数）とした
が、一般には、学習に使われるサンプル数の上限ｍを定
めておき、サンプル数がｍを越える場合には、最後のｍ
個のデータを用いて学習することも可能である。上の方
法は、特に情報源が時間とともに変化しており、しかも
学習方式に与えられたメモリに制約がある場合に有効と
なる。上の第一項ｎｉｃ（Ｓ／ｈ）が、前記の「ｈによ
り定まるサンプルのマイナス対数尤度」であり、詳しく
は、ｈにより定まる個々のデータのマイナス対数尤度の
総和であり、第二項ｎｉｃ（ｈ）が、ｈの記述長とサン
プル数のａ乗の積である。尚、上でｎｉｃと記したの
は、‘ｎｅｗｉｎｆｏｒｍａｔｉｏｎｃｒｉｔｅｒ
ｉｏｎ’( 新情報量）の略である。[Equation 1] Here, it is assumed that a is given as an input, but when it is not given, there is no essential difference, for example, a = 3/4. Although m is the number of input samples (the number of cases), generally, the upper limit m of the number of samples used for learning is set, and when the number of samples exceeds m, the last m.
It is also possible to learn using individual data. The above method is particularly effective when the information source is changing with time and the learning method has a limited memory. The first term nic (S / h) above is the “minus logarithmic likelihood of the sample determined by h”, specifically, the sum of the minus logarithmic likelihood of individual data determined by h, and the second The term nic (h) is the product of the description length of h and the a-th power of the number of samples. In addition, what was described as nic above is'new information writer.
Abbreviation for'ion '(new information volume).

【００１４】また、第二の発明の確率的規則の学習方式
は、入力として、有限個の事例、すなわちＸ×Ｙの要
素、からなるサンプル、及びＮＩＣ定義の中のパラメー
タａを受けとり、ある定められた確率的規則の表現形の
クラス（以下、仮説空間と呼ぶ）Ｈから、確率的規則の
表現形を出力するものである。詳しく言えば、入力サン
プルＳ＝〈〈ｘ₁，ｙ₁〉，_...,〈ｘ_m，ｙ_m〉〉に対
して、仮説空間Ｈの中で次の量を最小にする仮説を最良
とし、出力する。The probabilistic rule learning method of the second invention receives a sample consisting of a finite number of cases, that is, X × Y elements, and a parameter a in the NIC definition as an input, and determines a certain rule. The probabilistic rule expression is output from the given probabilistic rule expression class H (hereinafter referred to as a hypothesis space). More specifically, for the input sample S = << x ₁ , y ₁ >, _..., <x _m , y _m >>, the hypothesis that minimizes the following quantity in the hypothesis space H is the best. ,Output.

【数２】ここで、ｌ（ｈ）は、仮説ｈの記述長とし、ｍは上と同
様入力サンプル数（事例数）とした。また、ａが入力と
して与えられるとしたが、与えられない場合は、例えば
ａ＝３／４とおけば良く本質的な相違はない。[Equation 2] Here, l (h) is the description length of the hypothesis h, and m is the number of input samples (the number of cases) as above. Further, although it is assumed that a is given as an input, if it is not given, for example, a = 3/4 may be set, and there is no essential difference.

【００１５】次に、第三及び第四の発明を説明するのに
必要な、確率分布及び確率的規則の表現形についていく
つかの定義を導入する。（以下では、確率分布を確率的
規則の特別な場合と見なし、「確率的規則」のみについ
て定義をする。）定義１ＨおよびＨ（ε）を有限個数の確率的規則の表
現形を含むクラスであるとした時、ある多項式ｑが存在
して、任意のｎ∈Ｎについて、任意のｈ∈Ｈに対して次
の２式を満たすようなｈ′∈Ｈ（ε）が存在する場合、
Ｈ（ε）はＨのε−下界近似であるという。Next, some definitions of the probability distribution and the representational form of the stochastic rule necessary for explaining the third and fourth inventions will be introduced. (Hereinafter, the probability distribution is regarded as a special case of stochastic rules, and only "stochastic rules" are defined.) Definition 1 H and H (ε) are classes that include a finite number of representations of stochastic rules. If there exists a polynomial q and h′εH (ε) such that for any nεN, the following two equations are satisfied for any hεH, then
H (ε) is said to be an ε-lower bound approximation of H.

【数３】 [Equation 3]

【数４】定義２任意のε＞０に関して、Ｈのε−下界近似であ
り、濃度が高々２のｐｏｌｙ（１／ε，ｎ）乗（ここ
で、ｎは領域上の文字列長の上限、ｐｏｌｙはある多項
式）である、Ｈの部分クラスＨ（ε）を、同じパラメー
タ１／ε，ｎに関する多項式時間内に構成するアルゴリ
ズムが存在する時、Ｈはε−下界近似について閉じてい
るという。[Equation 4] Definition 2 For any ε> 0, it is the ε-lower bound approximation of H, and the density is at most 2 to the power of poly (1 / ε, n) (where n is the upper limit of the character string length on the region, and poly is When there is an algorithm that constructs a subclass H (ε) of H, which is a polynomial) in polynomial time for the same parameter 1 / ε, n, H is said to be closed for ε-lower bound approximation.

【００１６】上記の定義を用いて、第三及び第四の発明
方式をより詳しく説明する。第三の発明の確率分布の学
習方式は、入力として、有限個の事例からなるサンプ
ル、ＮＩＣ定義の中のパラメータａ、及び、精度のパラ
メータε＞０を受けとり、ある定められたε−下界近似
について閉じている確率分布の表現形のクラスＨ（以
下、仮説空間と呼ぶ）から、確率分布の表現形を出力す
るものである。尚ここで、精度のパラメータεが与えら
れていることを仮定したが、精度のパラメータの与えら
れていない場合は、サンプル数（ｍ）の減少関数、例え
ばε＝１／ｍ等、として定めれば良く、本質的な違いは
ない。また、ａが入力として与えられるとしたが、与え
られない場合は、例えばａ＝３／４とおけば良い。詳し
く言えば、該学習方式は、入力サンプルＳ＝（ｘ₁，・
・・，ｘ_m）、及びε＞０に対して、仮説空間Ｈの部分
クラスで、効率的に構成可能なＨのε−下界近似Ｈ
（ε）の中から、次の量を最小にする仮説を最良とし、
出力する。Using the above definitions, the third and fourth invention methods will be described in more detail. The learning method of the probability distribution of the third invention receives a sample consisting of a finite number of cases, a parameter a in the NIC definition, and an accuracy parameter ε> 0 as inputs, and sets a certain ε-lower bound approximation. The probability distribution expression is output from the probability distribution expression class H (hereinafter, referred to as a hypothesis space) that is closed for. It is assumed here that the accuracy parameter ε is given, but if the accuracy parameter is not given, it is determined as a decreasing function of the number of samples (m), for example, ε = 1 / m. Better, there is no essential difference. Although a is given as an input, if a is not given, a = 3/4 may be set, for example. More specifically, the learning method is based on the input sample S = (x ₁ , ...
.., x _m ), and ε> 0, ε-lower bound approximation H of H that can be efficiently constructed in a partial class of the hypothesis space H
From (ε), the hypothesis that minimizes
Output.

【数５】ここで、ｍは入力サンプル数（事例数）とした。[Equation 5] Here, m is the number of input samples (the number of cases).

【００１７】また、第四の発明の確率的規則の学習方式
は、入力として、有限個の事例、すなわちＸ×Ｙの要
素、からなるサンプル、ＮＩＣ定義の中のパラメータ
ａ、及び精度のパラメータε＞０を受けとり、ある定め
られたε−下界近似について閉じている確率的規則の表
現形のクラス（以下、仮説空間と呼ぶ）Ｈから、確率的
規則の表現形を出力するものである。上と同様、精度の
パラメータの与えられていない場合は、サンプル数
（ｍ）の減少関数、例えばε＝１／ｍ等、として定めら
れ、本質的な違いはない。また、ａが入力として与えら
れるとしたが、与えられない場合は、例えばａ＝３／４
とおけば良い。詳しく言えば、入力サンプルＳ＝〈〈ｘ
₁，ｙ₁〉，・・・，〈ｘ_m，ｙ_m〉〉、及びε＞０に
対して、仮説空間Ｈの部分クラスで、効率的に構成可能
なＨのε−下界近似Ｈ（ε）の中から、次の量を最小に
する仮説を最良とし、出力する。In the learning method of the stochastic rule of the fourth invention, a sample consisting of a finite number of cases, that is, X × Y elements, a parameter a in the NIC definition, and an accuracy parameter ε are used as inputs. Receiving> 0, the probabilistic rule expression is output from a probabilistic rule expression class H (hereinafter referred to as a hypothesis space) H that is closed for a given ε-lower bound approximation. Similar to the above, when the accuracy parameter is not given, it is defined as a decreasing function of the number of samples (m), for example, ε = 1 / m, and there is no essential difference. Although a is given as an input, if a is not given, for example, a = 3/4
You should say Specifically, the input sample S = <<< x
_{_{1, y 1>, ···,}} <x m, y m >>, and with respect to epsilon> 0, with partial class hypothesis space H, efficiently configurable for H .epsilon. lower bound approximation H (epsilon ), The hypothesis that minimizes the following quantity is the best and is output.

【数６】ここで、ｌ（ｈ）は、仮説ｈの記述長とし、ｍは上と同
様入力サンプル数（事例数）とした。[Equation 6] Here, l (h) is the description length of the hypothesis h, and m is the number of input samples (the number of cases) as above.

【００１８】次に、第五及び第六の発明を説明するのに
必要な、確率分布及び確率的規則のパラメトリックな表
現形についていくつかの定義を導入する。（以下では、
確率分布を確率的規則の特別な場合と見なし、「確率的
規則」のみについて定義をする。）確率的規則のパラメ
トリックな表現形とは、離散モデルと実数値パラメータ
（のヴェクトル）の対からなる表現形である。ここで、
離散モデルとは一般に、バイナリーアルファベットで有
限の記述長をもって記述されるものであり、また実数値
パラメータもある有限の精度をもってバイナリーアルフ
ァベットで記述される。例えば確率オートマトンの場
合、離散モデルは状態数の記述であり、実数値パラメー
タは全遷移確率である。尚、以下で、一般にＨが確率的
規則のパラメトリックな表現形のクラスである時、離散
モデルの記述長の上限ｓで与えられるＨの部分クラスを
Ｈ_sと書く。定義３ＨおよびＨ（ε）を確率的規則のパラメトリッ
クな表現形のクラスであるとした時、ある多項式ｑが存
在して、任意のｓ，ｎ∈Ｎについて、任意のｈ∈Ｈ_sに
対して次の２式を満たすようなｈ′∈Ｈ（ε）_sが存在
する場合、Ｈ（ε）はＨのε−下界近似であるという。Next, some definitions of parametric expressions of probability distributions and stochastic rules necessary for explaining the fifth and sixth inventions are introduced. (In the following,
Consider the probability distribution as a special case of probabilistic rules, and define only “probabilistic rules”. ) A parametric expression of a stochastic rule is an expression consisting of a pair of a discrete model and a real-valued parameter. here,
A discrete model is generally described in a binary alphabet with a finite description length, and is also described in a binary alphabet with a finite precision with real-valued parameters. For example, in the case of a probability automaton, the discrete model is a description of the number of states, and the real-valued parameter is the total transition probability. In the following, in general, when H is a class of a parametric expression form of a probabilistic rule, the partial class of H given by the upper limit s of the description length of the discrete model is written as H _s . Definition 3 Let H and H (ε) be a class of parametric representations of probabilistic rules, then there exists a polynomial q, and for any s, nεN, for any hεH _s If h′εH (ε) _s that satisfies the following two equations exists, then H (ε) is said to be an ε-lower bound approximation of H.

【数７】 [Equation 7]

【数８】定義４任意のｎ，ｓ∈Ｎ及びε＞０に関して、Ｈ_sの
ε−下界近似であり、濃度が高々２のｐｏｌｙ（１／
ε，ｎ，ｓ）乗（ここで、ｎは領域上の文字列長の上
限、ｓは離散モデルの記述長の上限、ｐｏｌｙはある多
項式）である、Ｈ_sの部分クラスＨ（ε）_sを、同じパ
ラメータ１／ε，ｎ，ｓに関する多項式時間内に構成す
るアルゴリズムが存在する時、Ｈはε−下界近似につい
て閉じているという。[Equation 8] Definition 4 For any n, sεN and ε> 0, it is an ε-lower bound approximation of H _s , with a concentration of at most 2 poly (1 /
epsilon, n, s) power (where, n represents the upper limit of the length of the string on the region, s is the upper limit of the description length of the discrete model, poly is some polynomial), H _s of partial class H (epsilon) _s H is said to be closed for the ε-lower bound approximation when there is an algorithm that constructs in the polynomial time for the same parameter 1 / ε, n, s.

【００１９】第五の発明の確率分布の学習方式は、入力
として、有限個の事例からなるサンプル、ＮＩＣ定義の
中のパラメータａ、及び、精度のパラメータε＞０を受
けとり、ある定められたε−下界近似について閉じてい
る離散モデルと実数値パラメータからなる確率分布のパ
ラメトリックな表現形を複数含むクラスＨ（以下、仮説
空間と呼ぶ）から、確率分布の表現形を出力するもので
ある。尚ここで、精度のパラメータεが与えられている
ことを仮定したが、精度のパラメータの与えられていな
い場合は、サンプル数（ｍ）の減少関数、例えばε＝１
／ｍ等、として定めれば良く、本質的な違いはない。ま
た、ａが入力として与えられるとしたが、与えられない
場合は、例えばａ＝３／４とおけば良い。該学習方式
は、入力サンプルＳ＝〈ｘ₁，・・・，ｘ_m〉、及びε
＞０に対して、探索過程において対象となる離散モデル
の記述長の上限ｓを動的に計算し、仮説空間Ｈの部分ク
ラスで、効率的に構成可能なＨのε−下界近似Ｈ（ε）
の中から、次の量を最小にする仮説を最良とし、出力す
る。The learning method of the probability distribution of the fifth invention receives a sample consisting of a finite number of cases, a parameter a in the NIC definition, and an accuracy parameter ε> 0 as an input, and determines a certain ε. An output form of a probability distribution is output from a class H (hereinafter, referred to as a hypothesis space) including a plurality of parametric expression forms of a probability distribution consisting of a discrete model closed for the lower bound approximation and real-valued parameters. It is assumed here that the accuracy parameter ε is given, but when the accuracy parameter is not given, a decreasing function of the number of samples (m), for example, ε = 1
/ M, etc., and there is no essential difference. Although a is given as an input, if a is not given, a = 3/4 may be set, for example. The learning method uses input samples S = <x ₁ , ..., X _m >, and ε
For> 0, the upper limit s of the description length of the target discrete model is dynamically calculated in the search process, and ε-lower bound approximation H (ε )
From among the following, the hypothesis that minimizes the following quantity is the best and is output.

【数９】ここで、ｌ（ｈ）は、仮説ｈの記述長とし、ｍは上と同
様入力サンプル数（事例数）とした。また、上で仮説空
間中で考慮される必要のあるものの記述長の上限ｓを動
的に計算するとしたが、詳しくは、例えば以下のような
手順によってこれを実現することができる。離散モデル
の記述長の上限ｓの初期値には、ある任意の値（例え
ば、ｌｏｇｍ）を用い、記述長が上限ｓ以下で、上記の
量を最小にする仮説を現在までの最良の仮説ｈ^*とし、
現上限ｓとｍ^aの積が、ｈ^*の入力サンプルｓに関する
新情報量ｎｉｃ（Ｓ，ｈ^*）を越えれば、ＮＩＣ最小の
仮説の離散モデルの記述長がｓ以下であることが分かる
ので、ｈ^*を最良の仮説として出力し、越えなければ、
ｓ：＝２ｓとおいて上のプロセスを繰り返す。[Equation 9] Here, l (h) is the description length of the hypothesis h, and m is the number of input samples (the number of cases) as above. Further, although the upper limit s of the description length that needs to be considered in the hypothesis space is dynamically calculated above, this can be realized in detail by the following procedure, for example. An arbitrary value (for example, logm) is used as the initial value of the upper limit s of the description length of the discrete model, and the hypothesis that the description length is the upper limit s or less and minimizes the above amount is the best hypothesis h to date. ^* And
Product of the current upper limit s and m ^a is, h ^* new information amount nic (S, h ^*) for the input samples s of it exceeds the so it can be seen description length of the discrete model of NIC minimum hypothesis is less than s , H ^* is output as the best hypothesis, and if it does not exceed,
Repeat the above process with s: = 2s.

【００２０】第六の発明の確率的規則の学習方式は、入
力として、有限個の事例、すなわちＸ×Ｙの要素、から
なるサンプル、ＮＩＣ定義の中のパラメータａ、及び精
度のパラメータε＞０を受けとり、ある定められたε−
下界近似について閉じている離散モデルと実数値パラメ
ータからなる確率的規則のパラメトリックな表現形を複
数含むクラスＨ（以下、仮説空間と呼ぶ）から、確率的
規則の表現形を出力するものである。上と同様、精度の
パラメータの与えられていない場合は、サンプル数
（ｍ）の減少関数、例えばε＝１／ｍ等、として定めら
れ、本質的な違いはない。また、ａが入力として与えら
れるとしたが、与えられない場合は、例えばａ＝３／４
とおけば良い。該学習方式は、入力サンプルＳ＝〈〈ｘ
₁，ｙ₁〉，・・・，〈ｘ_m，ｙ_m〉〉、及びε＞０に
対して、探索過程において対象となる離散モデルの記述
長の上限ｓを動的に計算し、仮説空間Ｈの部分クラス
で、効率的に構成可能なＨのε−下界近似Ｈ（ε）の中
から、次の量を最小にする仮説を最良とし、出力する。The probabilistic rule learning method according to the sixth aspect of the present invention uses as input a sample consisting of a finite number of cases, that is, X × Y elements, a parameter a in the NIC definition, and an accuracy parameter ε> 0. Received, and a certain defined ε−
The probabilistic rule expression is output from a class H (hereinafter, referred to as a hypothesis space) that includes a plurality of parametric expression forms of a probabilistic rule consisting of a discrete model closed for the lower bound approximation and real-valued parameters. Similar to the above, when the accuracy parameter is not given, it is defined as a decreasing function of the number of samples (m), for example, ε = 1 / m, and there is no essential difference. Although a is given as an input, if a is not given, for example, a = 3/4
You should say The learning method is based on the input sample S = << x
₁ , y ₁ >, ..., <x _m , y _m >>, and ε> 0, the upper limit s of the description length of the target discrete model in the search process is dynamically calculated, and the hypothesis space is calculated. From the ε-lower bound approximation H (ε) of H that can be efficiently constructed in the subclass of H, the hypothesis that minimizes the following quantity is set as the best output.

【数１０】ここで、ｌ（ｈ）は、仮説ｈの記述長とし、ｍは上と同
様入力サンプル数（事例数）とした。また、上で仮説空
間中で考慮する必要のある離散モデルの記述長の上限ｓ
を動的に計算するとしたが、これは上の第五の発明の処
で説明したように計算することができる。[Equation 10] Here, l (h) is the description length of the hypothesis h, and m is the number of input samples (the number of cases) as above. Also, the upper limit s of the description length of the discrete model that needs to be considered in the hypothesis space above
Was calculated dynamically, but this can be calculated as described in the fifth invention above.

【００２１】[0021]

【実施例】次に、本発明の実施例について図面を参照し
て詳細に説明する。ここでは、第五の発明の確率分布の
学習方式を、確率オートマトンを仮説空間とする場合の
実施例をもって説明する。ここでいう確率オートマトン
とは、確率的発生器としての確率オートマトンであり、
前出の安倍、ヴァルムトの論文で定義、説明されてい
る。以下では、状態数が高々ｓの確率オートマトンのク
ラスをＨ_Sと書く。また、任意の定数ε＞０について、
Ｈの、記述長の上限ｓで与えられる部分クラスのε−下
界近似を効率的に構成する手順ε−ＢＡの存在を仮定
し、ε−ＢＡ（Ｈ，ε，ｓ）で得られるＨ_sのε−下界
近似をＨ（ε）_sと書く。また、簡単の為、ε−ＢＡ
（Ｈ，ε，ｓ）で得られるｉ番目のオートマトンをε−
ＢＡ（Ｈ，ε，ｓ，ｉ）と書く。また、Ｈ（ε）_sの濃
度を計算する手順ε−ＢＡＣ（Ｈ，ε，ｓ）の存在も仮
定する。Embodiments of the present invention will now be described in detail with reference to the drawings. Here, the probability distribution learning method of the fifth aspect of the invention will be described with reference to an embodiment in which a probability automaton is used as a hypothesis space. The probabilistic automaton here is a probabilistic automaton as a stochastic generator,
It is defined and explained in the above paper by Abe and Warmt. In the following, the class of stochastic automata whose number of states is at most s is written as H _S. For any constant ε> 0,
Assuming the existence of the procedure ε-BA that efficiently constructs the ε-lower bound approximation of the subclass given by the upper bound s of the description length of H, we obtain the H _s of H _s obtained by ε-BA (H, ε, s). The ε-lower bound approximation is written as H (ε) _s . Also, for simplicity, ε-BA
Let the ith automaton obtained by (H, ε, s) be ε-
Write BA (H, ε, s, i). It is also assumed that the procedure ε-BAC (H, ε, s) for calculating the concentration of H (ε) _s exists.

【００２２】第五の発明の確率分布の学習方式の、Ｈ_s
を仮説空間とした場合の実施例を、図１から図５を用い
て下に説明する。尚、他の五発明の実施例については、
後で説明するように、次に記す第五の発明の実施例を部
分的に変更することによって得られる。まず、図１を用
いて、第五の発明の学習方式の大局的な構成を説明す
る。入力として精度のパラメータε、ＮＩＣ定義の中の
パラメータａ、及びサンプルＳ＝〈ｘ₁，・・・，
ｘ_m〉を与えられる。ここで、ａが入力として与えられ
るとしたが、与えられない場合は、例えばａ＝３／４と
おけば良く本質的な相違はない。まず、入力及び初期化
手段（ステップ０）に於いて記述長の上限をｓ：＝ｌｏ
ｇｍと初期化してから、ε−下界近似計算手段（ステッ
プ１）に於いてＨ_sのε−下界近似Ｈ（ε）_sを構成
し、これを配列Ｈ（ε）とする。Ｈ_sが状態数が高々ｓ
の確率オートマトンである現文脈に於いては、Ｈ（ε）
_sは、各々のパラメータが｛１−ε／（２ｎ）｝の冪乗
であり、少なくともε／ｎであるような、状態数高々ｓ
の全確率オートマトンの任意の配列とすれば良い。（こ
のことは、前出の安倍、ヴァルムトの論文で説明されて
いる。）この場合のε−ＢＡ及びε−ＢＡＣの実現方法
には、複数の可能性があるが、自明なのでここでは詳細
には触れない。In the learning method of the probability distribution of the fifth invention, H _s
An example in which is a hypothetical space will be described below with reference to FIGS. 1 to 5. In addition, regarding the other five embodiments of the invention,
As will be described later, it can be obtained by partially modifying the fifth embodiment of the invention described below. First, the overall structure of the learning method of the fifth invention will be described with reference to FIG. The accuracy parameter ε as an input, the parameter a in the NIC definition, and the sample S = <x ₁ , ...,
x _m >. Here, it is assumed that a is given as an input, but when it is not given, there is no essential difference, for example, a = 3/4. First, in the input and initialization means (step 0), the upper limit of the description length is set to s: = lo
after gm and initialization, .epsilon. In lower bound approximation calculation means (step 1) constitutes the .epsilon. lower bound approximation H (epsilon) _s of H _s, to which the sequence H (epsilon). H _s has at most s states
In the present context, which is the probability automaton of, H (ε)
_s is the power of each parameter {1-ε / (2n)} and is at least ε / n.
It may be an arbitrary array of all probability automata of. (This is explained in the above-mentioned paper by Abe and Warmt.) There are several possibilities to realize ε-BA and ε-BAC in this case, but since they are obvious, we will explain them in detail here. Does not touch.

【００２３】次に、ＮＩＣ計算手段（ステップ２）に於
いて、Ｈ（ε）_s中の各々の仮説ｈに関してＳのｈに関
するＮＩＣ情報量、すなわちNext, in the NIC calculating means (step 2), for each hypothesis h in H (ε) _s , the NIC information amount of S of h, that is,

【数１１】を計算する。次に、ＮＩＣ最小仮説発見手段（ステップ
３）に於いて、Ｈ（ε）_sのうち、上記の量ｎｉｃ
（Ｓ，ｈ）を最小にするような仮説ｈを一つ選び、これ
をｈ^*とする。尚、同点の場合の選択方法は一般には任
意とする。最後に、終了判定手段（ステップ４）に於い
て、ｓの値をｓ：＝２ｓと更新してから、ｎｉｃ（Ｓ，
ｈ^*）＜ｓｍ^aであるかどうかを判定し、もしそうであ
れば、ｈ^*を最良の仮説として出力し、そうでなけれ
ば、ステップ１に戻る。[Equation 11] To calculate. Next, in the NIC minimum hypothesis finding means (step 3), the above-mentioned amount nic of H (ε) _s
One hypothesis h that minimizes (S, h) is selected and designated as h ^* . In addition, the selection method in case of the same point is generally arbitrary. Finally, in the end judging means (step 4), after updating the value of s to s: = 2s, nic (S,
h ^*) <to determine whether it is a sm ^a, if so, to output the h ^* as the best hypothesis, otherwise, return to step 1.

【００２４】次に上記各手段の実現例を図２から図５を
用いて細かく説明する。まず、ε−下界近似計算手段
（ステップ１）は、例えば図２に示した流れ図のように
実現できる。まず、１０１で、ε−ＢＡＣ（Ｈ_s，ε）
を用いて、Ｈ（ε）_sの濃度を計算しこれをｋとおき、
また、ｉ：＝１と初期化する。次に、１０２で、インデ
ックスｉがｋに等しいかどうか判定し、そうであれば、
本手段を終了しステップ２に進む。そうでなければ、１
０３で配列Ｈ（ε）_sのｉ番目の要素をε−ＢＡ
（Ｈ_s，ε，ｉ）を用いて計算されたｉ番目の仮説とお
く。最後に、インデックスｉに１を足して、１０２に進
む。Next, an implementation example of each of the above means will be described in detail with reference to FIGS. First, the ε-lower bound approximation calculating means (step 1) can be realized, for example, as shown in the flowchart of FIG. First, at 101, ε-BAC (H _s , ε)
Calculate the concentration of H (ε) _s by using
Also, i: = 1 is initialized. Next, at 102, it is determined whether the index i is equal to k, and if so,
The present means is terminated and the process proceeds to step 2. Otherwise 1
In 03, the i-th element of the array H (ε) _s is ε-BA
Let the i-th hypothesis calculated using (H _s , ε, i). Finally, add 1 to the index i and proceed to 102.

【００２５】次に、ＮＩＣ計算手段（ステップ２）は、
例えば図３に示した流れ図のように実現できる。まず、
２０１で、ｉ：＝１と初期化する。次に、２０２で、イ
ンデックスｉがｋに等しいかどうか判定し、そうであれ
ば、本手段を終了しステップ３に進む。そうでなけれ
ば、２０３でＨ（ε）_sのｉ番目の仮説ｈについてＮＩ
Ｃの第一項、すなわち、Next, the NIC calculation means (step 2)
For example, it can be realized as in the flow chart shown in FIG. First,
At 201, i: = 1 is initialized. Next, at 202, it is determined whether the index i is equal to k, and if so, the present means is terminated and the process proceeds to step 3. Otherwise, at 203, for the i-th hypothesis h of H (ε) _s , NI
The first term of C, that is,

【数１２】を計算し、これをＮＩＣ１（ｉ）とおく。次に、２０４
で、同じくＨ（ε）_sのｉ番目の仮説ｈについてＮＩＣ
の第二項、すなわち、[Equation 12] Is calculated and this is set as NIC1 (i). Then 204
And NIC for the i-th hypothesis h of H (ε) _s
The second term of

【数１３】を計算し、これをＮＩＣ２（ｉ）とおく。次に、２０５
で、上で求められたＮＩＣ１（ｉ）とＮＩＣ２（ｉ）の
和をＮＩＣ（ｉ）とおく。最後に２０６で、インデック
スｉに１を足して、２０２に進む。[Equation 13] Is calculated and this is set as NIC2 (i). Then 205
Then, the sum of NIC1 (i) and NIC2 (i) obtained above is set as NIC (i). Finally at 206, the index i is incremented by 1 and the process proceeds to 202.

【００２６】次に、ＮＩＣ最小仮説発見手段（ステップ
３）は、例えば図４に示した流れ図のように実現でき
る。まず、３０１で、ｉ：＝１、ＭＩＮＶＡＬ：＝０、
ＭＩＮＩＮＤ：＝１と初期化する。次に、３０２で、イ
ンデックスｉがｋに等しいかどうか判定し、そうであれ
ば、本手段を終了し、ステップ４でＨ（ε）（ＭＩＮＩ
ＮＤ）を出力する。そうでなければ、３０３でＭＩＮＶ
ＡＬ＝０であるか、あるいはＭＩＮＶＡＬ＞ＮＩＣ
（ｉ）であるかを判断し、そうでなければ、３０５に進
み、そうであれば、３０４で、新しいＮＩＣ最小のイン
デックスが発見されたので、ＭＩＮＶＡＬ：＝ＮＩＣ
（ｉ）、及びＭＩＮＩＮＤ：＝ｉと更新する。次に、３
０５でインデックスｉに１を足して、３０２に進む。Next, the NIC minimum hypothesis finding means (step 3) can be realized as shown in the flow chart of FIG. 4, for example. First, in 301, i: = 1, MINVAL: = 0,
Initialize as MININD: = 1. Next, in 302, it is determined whether the index i is equal to k, and if so, the present means is terminated, and in step 4, H (ε) (MINI
ND) is output. Else, MINV at 303
AL = 0 or MINVAL> NIC
(I), if not, go to 305, and if so, at 304, a new NIC minimum index was found, so MINVAL: = NIC
(I) and MININD: = i are updated. Then 3
At 05, 1 is added to the index i and the process proceeds to 302.

【００２７】最後に、終了判定手段（ステップ４）は、
例えば図５に示した流れ図のように実現できる。４０１
で、ｓ：＝２ｓと更新し、４０２で、ＭＩＮＶＡＬ＜ｓ
ｍ^aであるかどうか判定し、そうであれば出力（ステッ
プ５）に進み、そうでなければステップ１に戻る。Finally, the end judging means (step 4)
For example, it can be realized as in the flow chart shown in FIG. 401
Then, s: = 2s is updated, and in 402, MINVAL <s
It is determined whether or not it is m ^a , and if so, the process proceeds to the output (step 5), and if not so, the process returns to step 1.

【００２８】第六の発明の実施例は、例えば上記の実施
例における仮説空間である確率オートマトンを、確率的
受理器である「確率アクセプター」で置き換えることに
よって得られる。確率的規則としての「確率アクセプタ
ー」は、例えば１９９１年発行の米国の雑誌「プロシー
ヂングスオブザフォースアニュアルワークシ
ョップオンコンピュテーショナルラーニングセ
オリー」（ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＦ
ｏｕｒｔｈＡｎｎｕａｌＷｏｒｋｓｈｏｐｏｎＣ
ｏｍｐｕｔａｔｉｏｎａｌＬｅａｒｎｉｎｇＴｈｅ
ｏｒｙ）掲載の安倍、竹内、ヴァルムト著の論文「ポリ
ノミアルラーナビリチーオブプロバビリスチック
コンセプトウイズリスペクトツーＫＬダイ
バージェンス」（ＰｏｌｙｎｏｍｉａｌＬｅａｒｎａ
ｂｉｌｉｔｙｏｆＰｒｏｂａｂｉｌｉｓｔｉｃＣ
ｏｎｃｅｐｔｓｗｉｔｈｒｅｓｐｅｃｔｔｏｔ
ｈｅＫｕｌｌｂａｃｋ−ＬｅｉｂｌｅｒＤｉｖｅｒ
ｇｅｎｃｅ）において、説明及び定義されている。尚、
第六の発明の実施例で上記の実施例と異なる点は、入力
サンプルＳが、〈〈ｘ₁，ｙ₁〉，・・・，〈ｘ_m，ｙ
_m〉〉のような領域の要素と教師信号の対の列であるこ
とである。従って、２０３で計算されるＨ（ε）_sのｉ
番目の仮説ｈについてＮＩＣの第一項は次式で与えられ
る。The sixth embodiment of the invention is obtained, for example, by replacing the stochastic automaton, which is the hypothetical space in the above embodiment, with a "stochastic acceptor" which is a stochastic acceptor. “Probability acceptor” as a probabilistic rule is, for example, the US magazine “Proceedings of the Force Annual Workshop on Computational Learning Theory” (Proceedings of the F).
ourth Annual Workshop C
operational Learning The
ory), Abe, Takeuchi, and Valmut's paper "Polynomial Lanablity of Probabilistic Concept with Respect to KL Divergence" (Polynomial Learna)
bilility of Probabilistic C
onscepts with respect to t
he Kullback-Leibler Diver
gen)). still,
The sixth embodiment of the invention is different from the above-mentioned embodiments in that the input sample S is 〈〈 x ₁ , y ₁ 〉, ..., 〈x _m , y
It is a sequence of pairs of elements and teacher signals in a region such as _m 〉〉. Therefore, i of H (ε) _s calculated in 203
For the th hypothesis h, the first term of NIC is given by the following equation.

【数１４】 [Equation 14]

【００２９】第三の発明の実施例は、第五の発明の実施
例の仮説空間を状態数が高々ｓである確率オートマトン
のクラスで置き換えることによって得られる。この場
合、ε−下界近似の計算は同様に行なえる。終了判定手
段（ステップ４）は不要である。The embodiment of the third invention is obtained by replacing the hypothesis space of the embodiment of the fifth invention with a class of probability automata whose number of states is at most s. In this case, the calculation of the ε-lower bound approximation can be similarly performed. The end determination means (step 4) is unnecessary.

【００３０】第四の発明の実施例は、第三の発明の実施
例の仮説空間である確率オートマトンを、確率的受理器
である確率アクセプターで置き換えることによって得ら
れる。The embodiment of the fourth invention is obtained by replacing the hypothesis space probability automaton of the embodiment of the third invention with a stochastic acceptor stochastic acceptor.

【００３１】第一の発明の実施例は、第五の発明の実施
例の仮説空間を、状態数が高々ｓである確率オートマト
ンのクラスのε−下界近似である部分クラスで置き換え
ることによって得られる。ε−下界近似計算手段（ステ
ップ１）、及び終了判定手段（ステップ４）は不要であ
る。The embodiment of the first invention is obtained by replacing the hypothesis space of the embodiment of the fifth invention with a subclass which is an ε-lower bound approximation of a class of stochastic automata whose number of states is at most s. . The ε-lower bound approximation calculation means (step 1) and the end determination means (step 4) are unnecessary.

【００３２】第二の発明の実施例は、第一の発明の実施
例の仮説空間である確率オートマトンを、確率的受理器
である確率アクセプターで置き換えることによって得ら
れる。The embodiment of the second invention is obtained by replacing the stochastic automaton, which is the hypothesis space of the embodiment of the first invention, with a stochastic acceptor, which is a stochastic acceptor.

【００３３】[0033]

【発明の効果】第三の発明の確率分布の学習方式によれ
ば、多数のデータに対して、データの分布を説明する確
率分布の表現形を、仮説空間として与えられたある確率
分布の表現形のクラスから選択し出力する確率的分布の
学習方式において、入力のデータが領域上のある確率分
布によって独立に発生される場合に、ε−下界近似につ
いて閉じた任意の仮説空間に関して、仮説として出力さ
れた確率分布と真の確率分布とのＫＬ情報量が、前記の
仮説空間中の最適な分布と真の分布とのＫＬ情報量に、
データ数の増加とともに迅速に収束することを保証する
ことが可能である。According to the probability distribution learning method of the third aspect of the present invention, for a large number of data, an expression form of the probability distribution that explains the data distribution is expressed as a certain probability distribution given as a hypothesis space. In the learning method of the stochastic distribution that selects and outputs from the class of the form, when the input data is independently generated by a certain probability distribution on the region, as a hypothesis regarding an arbitrary hypothesis space closed for the ε-lower bound approximation The KL information amount between the output probability distribution and the true probability distribution is the KL information amount between the optimal distribution and the true distribution in the hypothesis space,
It is possible to ensure that the data converges rapidly as the number of data increases.

【００３４】第四の発明の確率的規則の学習方式によれ
ば、多数のデータ及び教師信号の入力に対して、データ
をクラス分けする確率的規則を、仮説空間として与えら
れたある確率的規則のクラスから選択し出力する確率的
規則の学習方式において、入力のデータ及び教師信号が
領域上の確率分布及びその領域上の真の確率的規則の同
時分布によって独立に発生される場合に、ε−下界近似
について閉じた任意の仮説空間に関して、仮説として出
力された確率的規則と真の確率的規則とのＫＬ情報量
が、前記の仮説空間中の最適な規則と真の規則とのＫＬ
情報量に、データ数の増加とともに迅速に収束すること
を保証することが可能である。According to the probabilistic rule learning method of the fourth invention, a probabilistic rule for classifying data into a large number of data and teacher signals is given as a hypothetical space. In the learning method of the stochastic rule selected from the class of and output, when the input data and the teacher signal are independently generated by the probability distribution over the area and the simultaneous distribution of the true stochastic rules over the area, ε -For any hypothesis space closed about the lower bound approximation, the KL information amount of the probabilistic rule output as a hypothesis and the true stochastic rule is KL of the optimal rule and the true rule in the hypothesis space.
It is possible to guarantee that the amount of information converges rapidly as the number of data increases.

【００３５】第五の発明の確率分布の学習方式によれ
ば、多数のデータに対して、データの分布を説明する確
率分布の表現形を、仮説空間として与えられた離散モデ
ルと実数値パラメータからなる確率分布のパラメトリッ
クな表現形のクラスから選択し出力する確率的分布の学
習方式において、入力のデータが領域上のある確率分布
によって独立に発生される場合に、ε−下界近似につい
て閉じた任意のパラメトリックな仮説空間に関して、仮
説として出力された確率分布と真の確率分布とのＫＬ情
報量が、前記の仮説空間中の最適な分布と真の分布との
ＫＬ情報量に、データ数の増加とともに迅速に収束する
ことを保証することが可能である。According to the learning method of the probability distribution of the fifth invention, the expression form of the probability distribution for explaining the distribution of the data is calculated for a large number of data from the discrete model and the real-valued parameter given as the hypothesis space. In a learning method of a probabilistic distribution that selects and outputs from a class of parametric representations of the probability distribution, if the input data is independently generated by a certain probability distribution on the region, an arbitrary closed ε-lower bound approximation For the parametric hypothesis space of, the KL information amount of the probability distribution output as the hypothesis and the true probability distribution is increased to the KL information amount of the optimal distribution and the true distribution in the hypothesis space. It is possible to guarantee that it will converge rapidly with.

【００３６】第六の発明の確率的規則の学習方式によれ
ば、多数のデータ及び教師信号の入力に対して、データ
をクラス分けする確率的規則を、仮説空間として与えら
れた離散モデルと実数値パラメータからなる確率的規則
のパラメトリックな表現形のクラスから選択し出力する
確率的規則の学習方式において、入力のデータ及び教師
信号が領域上の確率分布及びその領域上の真の確率的規
則の同時分布によって独立に発生される場合に、ε−下
界近似について閉じた任意のパラメトリックな仮説空間
に関して、仮説として出力された確率的規則と真の確率
的規則とのＫＬ情報量が、前記の仮説空間中の最適な規
則と真の規則とのＫＬ情報量に、データ数の増加ととも
に迅速に収束することを保証することが可能である。According to the probabilistic rule learning method of the sixth invention, a probabilistic rule for classifying data into a large number of data and teacher signals is input to a discrete model and a real model given as a hypothesis space. In the learning method of the probabilistic rule that selects and outputs from the parametric expression form class of the probabilistic rule consisting of numerical parameters, the input data and the teacher signal are the probability distribution over the area and the true stochastic rule over the area. The KL information between the probabilistic rule output as a hypothesis and the true probabilistic rule for any parametric hypothetical space closed about the ε-lower bound approximation when independently generated by the joint distribution is It is possible to guarantee that the KL information amount between the optimum rule and the true rule in space converges rapidly as the number of data increases.

【００３７】第三、第四、第五、及び第六の発明の学習
方式に関する性能保証については、１９９１年度人工知
能全国大会予稿集掲載の安倍著の論文「記述長最小原理
と一様収束手法」中に述べられているが、ここに概略を
述べる。定理１ ε−下界近似について閉じている確率
分布または確率的規則のパラメトリックな表現形のクラ
スＨが与えられている時、Ｈεを濃度が高々２のｐ（１
／ε，ｎ，ｓ）乗で、また確率の下限が（１／２）のｑ
（１／ε，ｎ，ｓ）乗（ｐ，ｑはある多項式）で押えら
れているＨのε−下界近似とする。この時、Ｈεを仮説
空間とした第三、第四、第五、第六の学習方式は、任意
の真のモデルに関して、入力のサンプル中のデータ数が
オーダーRegarding the performance guarantees regarding the learning methods of the third, fourth, fifth, and sixth inventions, the paper "Minimum Description Length Principle and Uniform Convergence Method" by Abe, published in the Proceedings of the 1991 National Conference on Artificial Intelligence. , ”But here is a brief description. Theorem 1 ε-When a class H of parametric expression forms of closed probability distributions or stochastic rules about the lower bound approximation is given, Hε is set to p (1
/ Ε, n, s), and the lower limit of probability is (1/2) q
Let ε be the ε-lower bound approximation of H held by the (1 / ε, n, s) th power (p and q are certain polynomials). At this time, the third, fourth, fifth, and sixth learning methods using Hε as a hypothetical space have an order of the number of data in the input sample with respect to an arbitrary true model.

【数１５】で与えられる上限を越える時、確率少なくとも１−δ
で、真のモデルに関するＫＬ情報量がＨ中の最適な仮説
より高々εしか大きくないような仮説を出力する。上で
確率分布間の距離として用いたＫＬ情報量を以下に定義
する。Ｄを真の確率分布、ｈを仮説としての確率分布と
すると、ＨのＤに関するＫＬ情報量ｄ（Ｄ，ｈ）（また
は、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒｄｉｖｅｒｇ
ｅｎｃｅ）は次式で与えられる。[Equation 15] When the upper limit given by is exceeded, the probability is at least 1-δ
Then, a hypothesis that the KL information amount regarding the true model is at most ε larger than the optimal hypothesis in H is output. The KL information amount used as the distance between the probability distributions above is defined below. When D is a true probability distribution and h is a probability distribution as a hypothesis, the KL information amount d (D, h) of D of H (or Kullback-Leibler diverg
ence) is given by the following equation.

【数１６】確率規則間の距離としてのＫＬ情報量もほぼ同様に定義
される。Ｄを領域上の確率分布、ｐを真の確率規則、ｈ
を仮説としての確率規則とすると、ｈのｐに関するＫＬ
情報量ｄ（ｐ，ｈ）（または、Ｋｕｌｌｂａｃｋ−Ｌｅ
ｉｂｌｅｒｄｉｖｅｒｇｅｎｃｅ）は次式で与えられ
る。[Equation 16] The KL information amount as the distance between the probability rules is defined in almost the same manner. D is the probability distribution over the region, p is the true probability rule, h
Is a probability rule as a hypothesis, KL of h for p
Information amount d (p, h) (or Kullback-Le
The ibler divergence) is given by the following equation.

【数１７】 [Equation 17]

【００３８】例えば、確率分布のパラメトリックな表現
形としての確率オートマトンのクラスは、ε−下界近似
について閉じていることが、前出の安倍、ヴァルムトの
論文で示されている。また、確率的規則のパラメトリッ
クな表現形としての確率オートマトン（確率アクセプタ
ー）のクラスは、ε−下界近似について閉じていること
が、前出の安倍、竹内、ヴァルムトの論文で示されてい
る。よって、次の二系を得る。系１第五の発明の確率分布の学習方式は、確率オート
マトンを仮説空間とする場合、ある多項式ｐｏｌｙに関
して、事例数がｐｏｌｙ（１／ε，ｎ，ｓ）（ここで、
ｎは領域上の文字列長の上限、ｓは最適モデルの離散モ
デルの記述長）を越える時に、確率少なくとも１−δ
で、真のモデルに関するＫＬ情報量が最適なオートマト
ンより高々εしか大きくないような確率オートマトンを
出力する。系２第六の発明の確率的規則の学習方式は、確率アク
セプターを仮説空間とする場合、ある多項式ｐｏｌｙに
関して、事例数がｐｏｌｙ（１／ε，ｎ，ｓ）（ここ
で、ｎは領域上の文字列長の上限、ｓは最適モデルの離
散モデルの記述長）を越える時に、確率少なくとも１−
δで、真のモデルに関するＫＬ情報量が最適なオートマ
トンより高々εしか大きくないような確率アクセプター
を出力する。For example, the class of probability automata as a parametric expression of the probability distribution is closed in the ε-lower bound approximation, as shown in the above paper by Abe and Walmut. Also, the class of stochastic automata (stochastic acceptors) as parametric expressions of stochastic rules is closed for the ε-lower bound approximation, as shown in the above papers by Abe, Takeuchi, and Walmut. Therefore, we obtain the following two systems. System 1 In the probability distribution learning method of the fifth invention, when a probability automaton is used as the hypothesis space, the number of cases is poly (1 / ε, n, s) (where,
n is the upper limit of the character string length on the area, and s is the description length of the discrete model of the optimum model), the probability is at least 1-δ
Then, a probability automaton whose KL information amount regarding the true model is at most ε larger than the optimum automaton is output. System 2 In the learning method of the stochastic rule of the sixth invention, when the probability acceptor is a hypothesis space, the number of cases is poly (1 / ε, n, s) (where n is on the domain) for a certain polynomial poly. The upper limit of the character string length of s, and s is the description length of the discrete model of the optimal model), the probability is at least 1-
At δ, we output a stochastic acceptor whose KL information about the true model is at most ε greater than the optimal automaton.

[Brief description of drawings]

【図１】本発明の確率分布及び確率的規則の学習方式の
構成図である。FIG. 1 is a configuration diagram of a learning method of a probability distribution and a stochastic rule according to the present invention.

【図２】ステップ１のε−下界近似計算手段を示したフ
ローチャートである。FIG. 2 is a flowchart showing ε-lower bound approximation calculation means in step 1.

【図３】ステップ２のＮＩＣ計算手段を示したフローチ
ャートである。FIG. 3 is a flowchart showing a NIC calculating means in step 2.

【図４】ステップ３のＮＩＣ最小仮説発見手段を示した
フローチャートである。FIG. 4 is a flowchart showing the NIC minimum hypothesis finding means of step 3;

【図５】ステップ４の終了判定手段を示したフローチャ
ートである。FIG. 5 is a flowchart showing an end determination unit of step 4.

Claims

[Claims]

1. Learning of a probability distribution for a large number of input data, which selects and outputs a probability distribution expression for explaining the distribution of the data from a class including a plurality of given probability distribution expressions. In the method, an expression form that minimizes the sum of the product of the number of bits of the expression form and the a-th power (0 <a <1) of the number of input data and the negative logarithmic likelihood of the input data determined by the probability distribution. , A probability distribution learning method characterized by outputting as the best expression.

2. Probabilistic rule for selecting and outputting a probabilistic rule for classifying a large number of data and teacher signals from a class including a plurality of given probabilistic rule expressions. In the rule learning method, a product of the number of bits of the expression and the number of input data to the power a (0 <a <1),
A learning method of a stochastic rule, wherein an expression form that minimizes the sum of the input data and the minus log likelihood of the teacher signal determined by the probability distribution is output as the best expression form.

3. With respect to a large number of input data and input accuracy parameters, a probability distribution expression that describes the distribution of the data is selected and output from a class that includes a plurality of given probability distribution expressions. In the learning method of the probability distribution, the partial class of the expression type class, and the partial class is
For any representation in the class of representations, including the representation that approximates the representation within the accuracy parameter, the probability that any representation in the subclass will give to any element on the region The reciprocal of the precision parameter and the reciprocal of the precision parameter of 2 and the number of elements in the subclass are suppressed from below by a polynomial power with respect to the reciprocal of the precision parameter of 1/2 and the upper limit of the number of bits of individual data. From the class having the characteristic of being constrained from above by a polynomial power related to the upper limit of the number of bits of the data, the product of the number of bits of the expression and the number a of the input data (0 <a <1), and the probability A learning method of a probability distribution, wherein an expression form that minimizes the sum of the input data and the minus logarithmic likelihood determined by the distribution is output as the best expression form.

4. A probabilistic rule for classifying a large number of data and teacher signals and input accuracy parameters is selected from a class including a plurality of given probabilistic rule expressions. In the learning method of the probabilistic rule to be output, the partial class is a subclass of the class of the expression, and the partial class sets the expression to the accuracy parameter for the arbitrary expression in the class of the expression. The conditional probability that any expression in the subclass gives to any pair of elements and any teacher signal in the region is It is constrained from below by a polynomial power on the upper limit of the number of bits of data, and the number of elements in the subclass is the reciprocal of the precision parameter of 2 and the upper limit of the number of bits of individual data. From the class having the characteristic of being pushed down from above by a polynomial power with respect to, the input data determined by the stochastic rule and the product of the number of bits of the expression and the power of a (0 <a <1) And a learning method of a probabilistic rule, which outputs the expression that minimizes the sum of the teacher signal and the minus logarithmic likelihood as the best expression.

5. A parametric representation of a probability distribution consisting of a discrete model and real-valued parameters, given a representation of a probability distribution that explains the distribution of the data for a large number of input data and input accuracy parameters. In the learning method of the probability distribution that selects and outputs from the class including multiple,
Dynamically calculating the upper limit of the description length of the target discrete model in the search process, which is a partial class of the representation model whose description length is less than or equal to the upper limit,
The partial class includes an expression form approximating the expression form within the accuracy parameter for any expression form whose description length in the class of expression forms is less than or equal to the upper limit,
The probability that any representation in the subclass will give to any element on the region is the reciprocal of the precision parameter, halved,
The expression description length upper limit, and is suppressed from below by a polynomial power with respect to the upper limit of the number of bits of individual data,
Moreover, from among the classes having the feature that the number of elements in the partial class is suppressed from above by the reciprocal of the precision parameter of 2, the representation length description upper limit, and a polynomial power related to the upper limit of the number of bits of individual data. , The expression that minimizes the sum of the product of the number of bits of the expression and the a-th power (0 <a <1) of the number of input data and the negative logarithmic likelihood of the input data determined by the probability distribution is best. A learning method of probability distribution, which is characterized by outputting as the expression form of.

6. Parametric of a probabilistic rule consisting of a discrete model and real-valued parameters, given a probabilistic rule for classifying a large number of data and teacher signals and input accuracy parameters In a learning method of probabilistic rules that selects and outputs from a class that includes multiple different expression types, the upper limit of the description length of the target discrete model in the search process is dynamically calculated, and A model is a partial class whose description length is less than or equal to the upper limit, and the partial class is the representation type for any representation type in which the description length of the discrete model is less than or equal to the upper limit. Of the conditional probabilities given to the pair of any element on the region and any teacher signal by any expression in the subclass is 2 The reciprocal of the precision parameter divided by 1,
The expression description length upper limit, and is suppressed from below by a polynomial power with respect to the upper limit of the number of bits of individual data,
Moreover, in the class having the characteristic that the number of elements in the partial class is suppressed from above by the reciprocal of the precision parameter of 2, the upper limit of the expression description length, and the polynomial power with respect to the upper limit of the number of bits of individual data. To minimize the sum of the product of the number of bits of the expression and the a-th power (0 <a <1) of the number of input data and the negative logarithmic likelihood of the input data and the teacher signal determined by the stochastic rule. A probabilistic rule learning method characterized by outputting the expression as the best expression.