JP7148445B2

JP7148445B2 - Information estimation device and information estimation method

Info

Publication number: JP7148445B2
Application number: JP2019051627A
Authority: JP
Inventors: 仁吾安達
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2022-10-05
Anticipated expiration: 2039-03-19
Also published as: JP2020154615A

Description

本発明は、ニューラルネットワーク（ＮＮ：Neural Network）を用いた推定処理を行う情報推定装置及び情報推定方法に関する。 The present invention relates to an information estimation device and an information estimation method that perform estimation processing using a neural network (NN).

ニューラルネットワークを用いた推定器は、他の推定器と比べ、画像やセンサー信号データなど、大量の情報を入力データとして処理し、推定を行うことができることから様々な分野への応用に期待されている。学習フェーズでは、学習データをニューラルネットワークに入力して、期待される出力結果が出力されるように、ニューラルネットワークの重みが最適化される。テストフェーズでは、ある未知の入力データをニューラルネットワークに入力すれば、ニューラルネットワークが推定する結果が出力値として出力される。 Compared to other estimators, estimators using neural networks are expected to be applied to various fields because they can process a large amount of information such as images and sensor signal data as input data and perform estimation. there is In the learning phase, training data is input to the neural network, and the weights of the neural network are optimized so that the expected output result is output. In the test phase, if some unknown input data is input to the neural network, the result estimated by the neural network is output as an output value.

ニューラルネットワークが用いるデータ、学習データやテストデータは、ある次元数ｎ_xin次元のベクトルｘ_inからなる点データである。ここで言う点データとは、ある固定の値をとるという意味である。また、それらがニューラルネットワークに入力されて、ニューロン値として、各層で線形及び／又は非線形の処理が行われるが、各層に入力されるデータも同様に、ある次元数を持つある固定の値のベクトルからなる点データである。そして、回帰問題であれ分類問題であれ、最終的な結果として出力されるデータも固定値からなる点データであり、ある次元数を持つベクトル又はスカラーの値となる。 The data used by the neural network, learning data and test data, is point data consisting of a vector _xin with a certain number of dimensions _nxin . The term "point data" as used here means that it takes a certain fixed value. In addition, they are input to a neural network, and linear and/or nonlinear processing is performed in each layer as neuron values. It is point data consisting of Then, whether it is a regression problem or a classification problem, the data output as the final result is also point data consisting of fixed values, and is a vector or scalar value having a certain number of dimensions.

また、ニューラルネットワークを用いた推定技術に関連した従来の技術として、例えば、非特許文献１、２に記載されている手法が知られている。非特許文献１には、学習された重みパラメータを固定の値ではなく分布として考え、推定の際も、入力データに対して学習したパラメータの分布に基づき、出力を分布として計算する手法が記載されている。また、非特許文献２には、理想的な入力データを逆伝搬で計算する手法として、ニューラルネットワーク中のどの層のどのニューロンが推定結果に関して重要な役割を果たしているのかを視覚化する技術が知られている。 As conventional techniques related to estimation techniques using neural networks, for example, methods described in Non-Patent Documents 1 and 2 are known. Non-Patent Document 1 describes a method in which the learned weighting parameters are considered as distributions rather than fixed values, and the output is calculated as a distribution based on the distribution of the learned parameters for the input data during estimation. ing. In Non-Patent Document 2, as a technique for calculating ideal input data by back propagation, a technique for visualizing which neurons in which layers in a neural network play an important role in estimation results is known. It is

2016, “Uncertainty in Deep Learning”,Yarin Gal,Thesis http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf2016, “Uncertainty in Deep Learning”, Yarin Gal, Thesis http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf 2009, “Visualizing Higher-Layer Features of a Deep Network”, Yoshua Bengio, University de Montreal2009, “Visualizing Higher-Layer Features of a Deep Network”, Yoshua Bengio, University de Montreal 2009, “Fast Dropout”, Sida Wang, Christopher Manning ; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(2):118-126, 2013.2009, “Fast Dropout”, Sida Wang, Christopher Manning; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(2):118-126, 2013. Jean Daunizeau,"Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables", 2017, arXiv:1703.00091 [stat.ML]Jean Daunizeau,"Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables", 2017, arXiv:1703.00091 [stat.ML] I. J. Goodfellow, J. Shlens, and C. Szegedy. “Explaining and harnessing adversarial examples”, 2015.I. J. Goodfellow, J. Shlens, and C. Szegedy. “Explaining and harnessing adversarial examples”, 2015.

しかしながら、前述のような従来のニューラルネットワークの処理は、ベクトルである点データの入力に対して、ベクトルからなる点データとしてしか計算できないという点データ単位の処理であり、出力される推定結果も点データであった。 However, the processing of the conventional neural network as described above is a point data unit process that can only be calculated as point data consisting of vectors for the input of vector point data, and the output estimation result is also a point data unit. was the data.

観測データなどの誤差のある入力データ、例えば、ある値とその値のプラスマイナスいくらかの値との誤差がガウス分布として存在するような分布データは、従来のニューラルネットワークでは、直接扱うことができなかった。従来のニューラルネットワークで分布データをあえて扱おうとした場合には、モンテカルロ的に、すなわち、その分布から点データを大量にサンプリングし、その大量の点データをニューラルネットワークに入力して計算することができるが、処理時間がかかってしまうという問題があった。また、点データのベクトルの次元数が大きいほど、様々なバリエーションがあり得るため、必要なサンプル数も増えてしまうという問題もあった。 Conventional neural networks cannot directly handle input data with errors such as observed data, such as distribution data in which the error between a certain value and some plus or minus value exists as a Gaussian distribution. rice field. If you dare to handle distribution data with a conventional neural network, you can sample a large amount of point data from the distribution in a Monte Carlo manner, and input that large amount of point data into the neural network for calculation. However, there is a problem that the processing takes a long time. In addition, the greater the number of dimensions of the vector of point data, the greater the number of possible variations, so there is also the problem that the required number of samples increases.

また、出力される推定結果が点データである場合、推定結果が、とり得る値の分布ではなく、算出された固定のある値というだけでは、その値がどれほどの確信をもって推定されたのか、つまり、その値の確率や確からしさを評価することができないという問題があった。すなわち、ある推定された値に対する推定値の分散や信頼区間が得られないという問題があった。 In addition, when the output estimation result is point data, if the estimation result is only a calculated fixed value instead of the distribution of possible values, how confident was the value estimated? , there is a problem that the probability or certainty of the value cannot be evaluated. That is, there is a problem that the variance of the estimated value and the confidence interval for a certain estimated value cannot be obtained.

なお、この問題に対する対策としては、例えば、非特許文献１に記載されているような手法が提案されている。この非特許文献１の手法では、上述したように、学習された重みパラメータを固定の値ではなく分布として考え、推定の際も、入力データに対して学習したパラメータの分布に基づき、出力を分布として計算している。とりわけ、非特許文献１の手法では、重みパラメータの分布から出力分布を計算するための積分計算であって、本来は解くことができない積分計算を、ドロップアウトを与えることによって近似的に計算することで、同等の結果を得ることを提案している。ただし、非特許文献１の手法は、学習された重みパラメータを固定の値ではなく分布として考えており、本発明に係る手法とは異なっている。 As a countermeasure against this problem, for example, a technique as described in Non-Patent Document 1 has been proposed. In the method of Non-Patent Document 1, as described above, the learned weighting parameters are considered as a distribution rather than a fixed value, and even during estimation, the output is distributed based on the distribution of the parameters learned for the input data. is calculated as In particular, in the method of Non-Patent Document 1, the integral calculation for calculating the output distribution from the weight parameter distribution, which is originally unsolvable, is approximately calculated by giving a dropout. proposed to obtain similar results. However, the method of Non-Patent Document 1 considers the learned weighting parameters not as fixed values but as distributions, and is different from the method according to the present invention.

本発明は上記の問題に鑑みて、学習済みのニューラルネットワークがある推定結果を出力するための、ニューラルネットワークの各層におけるデータの特徴量分布を計算することが可能な情報推定装置及び情報推定方法を提供することを目的とする。 In view of the above problems, the present invention provides an information estimation device and an information estimation method capable of calculating the feature amount distribution of data in each layer of a neural network for outputting a certain estimation result of a trained neural network. intended to provide

上記の目的を達成するため、本発明によれば、ニューラルネットワークを用いて推定処理を行う情報推定装置であって、
固定値のベクトルからなる学習データを用いて、データの一部を欠損させるドロップアウトを行うドロップアウト層を備えた前記ニューラルネットワークにおける重みを学習する重み学習部と、
前記ニューラルネットワークの各層で伝搬するデータを平均値、分散値及び共分散値を含む分布パラメータで定義される多変量ガウス分布を持つ確率変数として、入力データに対して前記ニューラルネットワークで前記ドロップアウトを与えて伝搬計算を行う伝搬計算部と、
前記ニューラルネットワークからの出力データと正解データとの差分を表すコスト関数を計算するコスト関数計算部と、
前記コスト関数を微分して得られる微小誤差に対して前記ニューラルネットワークで前記ドロップアウトを与えて逆伝搬計算を行い、前記ニューラルネットワークの各層における前記分布パラメータを更新する分布パラメータ更新部と、
前記コスト関数が所定の閾値以下となった場合における前記分布パラメータを最適化された分布パラメータとして出力する最適化パラメータ出力部とを、
有する情報推定装置が提供される。 In order to achieve the above object, according to the present invention, there is provided an information estimation device that performs estimation processing using a neural network,
a weight learning unit that learns weights in the neural network, which includes a dropout layer that performs dropout to drop a part of data using learning data consisting of a vector of fixed values;
Data propagated in each layer of the neural network is a random variable having a multivariate Gaussian distribution defined by distribution parameters including a mean value, a variance value, and a covariance value, and the dropout is performed in the neural network for input data. a propagation calculation unit that performs propagation calculation by giving
a cost function calculation unit that calculates a cost function representing the difference between output data from the neural network and correct data;
a distribution parameter updating unit that performs back propagation calculation by giving the dropout to the minute error obtained by differentiating the cost function in the neural network and updating the distribution parameter in each layer of the neural network;
an optimization parameter output unit that outputs the distribution parameter as an optimized distribution parameter when the cost function is equal to or less than a predetermined threshold;
There is provided an information estimation device comprising:

また、上記の目的を達成するため、本発明によれば、ニューラルネットワークを用いて推定処理を行う情報推定装置で実行される情報推定方法であって、
固定値のベクトルからなる学習データを用いて、データの一部を欠損させるドロップアウトを行うドロップアウト層を備えた前記ニューラルネットワークにおける重みを学習する重み学習ステップと、
前記ニューラルネットワークの各層で伝搬するデータを平均値、分散値及び共分散値を含む分布パラメータで定義される多変量ガウス分布を持つ確率変数として、入力データに対して前記ニューラルネットワークで前記ドロップアウトを与えて伝搬計算を行う伝搬計算ステップと、
前記ニューラルネットワークからの出力データと正解データとの差分を表すコスト関数を計算するコスト関数計算ステップと、
前記コスト関数を微分して得られる微小誤差に対して前記ニューラルネットワークで前記ドロップアウトを与えて逆伝搬計算を行い、前記ニューラルネットワークの各層における前記分布パラメータを更新する分布パラメータ更新ステップと、
前記コスト関数が所定の閾値以下となった場合における前記分布パラメータを最適化された分布パラメータとして出力する最適化パラメータ出力ステップとを、
有する情報推定方法が提供される。 Further, in order to achieve the above object, according to the present invention, there is provided an information estimation method executed by an information estimation device that performs estimation processing using a neural network, comprising:
A weight learning step of learning weights in the neural network provided with a dropout layer that performs dropout to lose a part of data using learning data consisting of a vector of fixed values;
Data propagated in each layer of the neural network is a random variable having a multivariate Gaussian distribution defined by distribution parameters including a mean value, a variance value, and a covariance value, and the dropout is performed in the neural network for input data. a propagation calculation step of providing and performing propagation calculation;
a cost function calculation step of calculating a cost function representing the difference between output data from the neural network and correct data;
A distribution parameter updating step of performing back propagation calculation by giving the dropout in the neural network to the minute error obtained by differentiating the cost function, and updating the distribution parameter in each layer of the neural network;
an optimization parameter output step of outputting the distribution parameter as an optimized distribution parameter when the cost function is equal to or less than a predetermined threshold;
An information estimation method is provided.

本発明によれば、学習済みのニューラルネットワークにおいて、多変量ガウス分布を持つ確率変数を表す分布パラメータの誤差に対して、ドロップアウトを与えた伝搬及び逆伝搬の計算を繰り返し行うことで、分布パラメータを最適化することが可能となり、ある推定結果を出力するための、ニューラルネットワークの各層におけるデータの特徴量分布を計算することが可能となる。 According to the present invention, in a trained neural network, the distribution parameter can be optimized, and it is possible to calculate the feature quantity distribution of data in each layer of the neural network in order to output a certain estimation result.

本発明の第１～第３の実施の形態において、ニューラルネットワークの任意の層ｌにおける入出力データの関係を模式的に示す図であって、（ａ）は、ニューラルネットワークの層ｌにおいて入力データのガウス分布のパラメータから出力データのガウス分布のパラメータへの伝搬を示す図であり、（ｂ）は、ニューラルネットワークの任意の層ｌにおいて出力データのガウス分布のパラメータから入力データのガウス分布のパラメータへの更新すべき誤差の逆伝搬を示す図である。In the first to third embodiments of the present invention, it is a diagram schematically showing the relationship between input and output data in an arbitrary layer l of the neural network, (a) is input data in layer l of the neural network (b) from the parameters of the Gaussian distribution of the output data to the parameters of the Gaussian distribution of the input data at an arbitrary layer l of the neural network. FIG. 11 illustrates backpropagation of an error to be updated to . 本発明の第１の実施の形態におけるニューラルネットワークの構成の一例を示す図である。It is a figure which shows an example of a structure of the neural network in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるデータの特徴量分布を計算する処理の一例を示すフローチャートである。4 is a flow chart showing an example of processing for calculating a feature quantity distribution of data according to the first embodiment of the present invention; 本発明の第１～第３の実施の形態における情報推定装置の構成の一例を示すブロック図である。1 is a block diagram showing an example of the configuration of an information estimation device according to first to third embodiments of the present invention; FIG. 本発明の第１の実施の形態と比較例とにおけるコスト関数の値と更新回数との関係を表すグラフであって、（ａ）は、比較例におけるコスト関数の値と更新回数との関係を表すグラフであり、（ｂ）は、本発明の第１の実施の形態におけるコスト関数の値と更新回数との関係を表すグラフである。4 is a graph showing the relationship between the value of the cost function and the number of updates in the first embodiment of the present invention and the comparative example, wherein (a) shows the relationship between the value of the cost function and the number of updates in the comparative example; 2B is a graph showing the relationship between the value of the cost function and the number of updates in the first embodiment of the present invention. 本発明の第１の実施の形態において、数字９のクラスでの最適化された特徴量分布を示す図であって、（ａ）は、図２のニューラルネットワークの入力層における入力画像の特徴量分布を示す図であり、（ｂ）は、図２のニューラルネットワークの最初のＦＣ層での１０×１０＝１００個のニューロンに関する特徴量分布を示す図である。In the first embodiment of the present invention, it is a diagram showing the optimized feature amount distribution in the class of number 9, (a) is the feature amount of the input image in the input layer of the neural network of FIG. 3B is a diagram showing the distribution, and (b) is a diagram showing the feature quantity distribution for 10×10=100 neurons in the first FC layer of the neural network of FIG. 2; FIG. 本発明の第１の実施の形態における入力画像及びソフトマックスの出力値の一例を示す図であって、（ａ）は、図６（ａ）の数字９のガウス分布からサンプリングされた入力画像の一例を示す図であり、（ｂ）は、図７（ａ）の入力画像についてニューラルネットワークを伝搬させたときのソフトマックスの出力値を示す図である。FIG. 6A is a diagram showing an example of an input image and softmax output values in the first embodiment of the present invention, and FIG. FIG. 7B is a diagram showing an example, and FIG. 7B is a diagram showing softmax output values when the input image of FIG. 7A is propagated through a neural network; 本発明の第１の実施の形態において、ニューラルネットワークに入力されるＭＮＩＳＴ拡張画像の複数の例を示す図である。FIG. 4 is a diagram showing a plurality of examples of MNIST augmented images input to the neural network in the first embodiment of the present invention; FIG. 図８に示すＭＮＩＳＴ拡張画像に関連して、正解ラベル（数字０）を計算した場合の入力画像の特徴量分布の一例を示す図である。FIG. 9 is a diagram showing an example of a feature amount distribution of an input image when a correct label (number 0) is calculated in relation to the MNIST augmented image shown in FIG. 8; 本発明の第１の実施の形態における入力画像及びソフトマックスの出力値の一例を示す図であって、（ａ）は、図９の特徴量分布からサンプリングされた入力画像の一例を示す図であり、（ｂ）は、図１０（ａ）の入力画像についてニューラルネットワークを伝搬させたときのソフトマックスの出力値を示す図である。FIG. 10 is a diagram showing an example of an input image and softmax output values in the first embodiment of the present invention, where (a) is a diagram showing an example of an input image sampled from the feature amount distribution of FIG. 10(b) is a diagram showing softmax output values when the input image of FIG. 10(a) is propagated through a neural network. 本発明の第２の実施の形態において、事後確率分布と入力データとを示すグラフであって、（ａ）は、事後確率分布と入力点データとを表すグラフであり、（ｂ）は、事後確率分布と入力分布データとを表すグラフである。In the second embodiment of the present invention, graphs showing the posterior probability distribution and input data, (a) is a graph showing the posterior probability distribution and input point data, (b) is a graph showing the posterior It is a graph showing probability distribution and input distribution data. 本発明の第２の実施の形態において、事後確率分布の近似式を計算する処理の一例を示すフローチャートである。FIG. 10 is a flowchart showing an example of processing for calculating an approximate expression of a posterior probability distribution in the second embodiment of the present invention; FIG. 本発明の第２の実施の形態と比較例とにおける各クラスの事後確率値を示すヒストグラムであって、（ａ）は、比較例において補正定数が仮の値（最適化されていない補正定数）の場合の各クラスの事後確率値を示すヒストグラムであり、（ｂ）は、本発明の第２の実施の形態において補正定数が最適化されている場合の各クラスの事後確率値を示すヒストグラムである。A histogram showing the posterior probability value of each class in the second embodiment of the present invention and a comparative example, wherein (a) is a temporary value of the correction constant in the comparative example (non-optimized correction constant) is a histogram showing the posterior probability values of each class in the case of (b) is a histogram showing the posterior probability values of each class when the correction constant is optimized in the second embodiment of the present invention be. 本発明の第２の実施の形態における実験に関連したデータの一例を示す図であって、（ａ）は、テストデータである数字７の画像を示す図であり、（ｂ）は、（ａ）の数字７の画像を入力データとして、補正定数が最適化された事後確率の対数値をクラスごとに計算した結果を示す図である。FIG. 10 is a diagram showing an example of data related to experiments in the second embodiment of the present invention, (a) is a diagram showing an image of number 7 which is test data, and (b) is a diagram showing (a ) is input data, and the logarithm of the posterior probability with the optimized correction constant is calculated for each class. 本発明の第２の実施の形態における実験に関連したデータの一例を示す図であって、左側の欄には、補正定数を求める前に（仮の補正定数を用いて）事後確率値を計算した場合の各クラスの正解率を示し、右側の欄には、補正定数を求めた後に事後確率値を計算した場合の各クラスの正解率を示す図である。FIG. 10 is a diagram showing an example of data related to experiments in the second embodiment of the present invention, in which the left column shows the posterior probability values calculated (using the provisional correction constants) before obtaining the correction constants; The right column shows the accuracy rate of each class when the posterior probability value is calculated after obtaining the correction constant. 本発明の第３の実施の形態における実験に関連したデータの一例を示す図であって、（ａ）は、テストデータである数字４の画像を示す図であり、（ｂ）は、テストデータである、数字４の画像の一部が灰色で塗られた欠損状態の画像を示す図であり、（ｃ）は、（ａ）のテストデータに対して従来法で計算したソフトマックスの値を示す図であり、（ｄ）は、（ｂ）のテストデータに対して従来法で計算したソフトマックスの値を示す図であり、（ｅ）は、（ａ）のテストデータに対して本発明に係る計算手法で計算した事後確率の対数値を示す図であり、（ｆ）は、（ｂ）のテストデータに対して本発明に係る計算手法で計算した事後確率の対数値を示す図である。FIG. 10 is a diagram showing an example of data related to experiments in the third embodiment of the present invention, (a) is a diagram showing an image of number 4 as test data, and (b) is a diagram showing test data and (c) shows the softmax value calculated by the conventional method for the test data of (a). (d) is a diagram showing the softmax value calculated by the conventional method for the test data of (b), and (e) is the value of the present invention for the test data of (a) and (f) is a diagram showing the logarithm of the posterior probability calculated by the calculation method according to the present invention for the test data of (b). be.

以下、図面を参照しながら、本発明の第１～第３の実施の形態について説明する。 First to third embodiments of the present invention will be described below with reference to the drawings.

まず、本発明の概要について説明する。 First, the outline of the present invention will be explained.

本発明は、ニューラルネットワークにおいて、ある出力（回帰問題ならある出力値、分類問題なら各クラスのあるスコア値）を出力するためには、各層の入力の点データであるベクトルの値の分布を計算する手法を提案する。本発明では、各層に入力されるデータであるベクトルの値がどのような分布をとるべきなのか、どの分布からサンプリングされたものであるのかを表す分布が計算される。 In the neural network, in order to output a certain output (a certain output value for a regression problem, a certain score value for each class for a classification problem), the distribution of vector values, which are input point data for each layer, is calculated. We propose a method to In the present invention, a distribution is calculated that indicates what kind of distribution vector values, which are data input to each layer, should have and from which distribution they are sampled.

その分布の形状を見れば、各層のどのニューロンが推定結果を出すのに重要であるかを把握することが可能となる。例えば、ある層のあるニューロン値に関する分布の幅が広ければ、そのニューロン値は、どのような値をとっても望みの出力結果となることを意味し、特徴が少なく、出力結果に影響を与えない値であることを意味する。例えば、このようなニューロン値については計算不要である判断して、計算を簡略化することができる。 By looking at the shape of the distribution, it is possible to grasp which neurons in each layer are important for producing an estimation result. For example, if the distribution of a certain neuron value in a certain layer is wide, it means that any value of that neuron value will produce the desired output result. means that For example, it is possible to simplify the calculation by determining that such neuron values do not require calculation.

逆に、ある層のあるニューロン値に関する分布の幅が狭ければ、そのニューロン値は、この狭い分布幅の範囲内の値をとらなければ所望の出力結果が得られないことを意味する。すなわち、そこでの層の処理は重要であるということがわかる。このことは、ニューラルネットワークが推定した結果を判断するにあたり、どの層が重要な決定を下しているのかを特定することを可能とし、推定結果の説明責任を果たすために重要な手掛かりとなる。 Conversely, if the distribution width of a certain neuron value in a certain layer is narrow, it means that the neuron value must take a value within this narrow distribution width to obtain the desired output result. That is, it turns out that the treatment of the layer there is important. This makes it possible to identify which layer is making important decisions in judging the results estimated by the neural network, and is an important clue to accountability for the estimation results.

本発明は非特許文献１の手法とは大きく異なっている。上述したように、非特許文献１の手法では、学習された重みパラメータを固定の値ではなく分布として考え、推定の際には、学習された重みパラメータの分布に基づいて、入力データに対する出力を分布として計算している。 The present invention is significantly different from the technique of Non-Patent Document 1. As described above, in the method of Non-Patent Document 1, the learned weight parameters are considered as a distribution rather than a fixed value, and at the time of estimation, based on the distribution of the learned weight parameters, the output for the input data is calculated. Calculated as a distribution.

これに対し、本発明は、学習された重みパラメータは固定の値のままとし、所望の出力を最大限に出すために入力データのベクトルの値がどのような特徴の分布となるべきなのかについて、入力データに係る分布を逆伝搬で計算する。 In contrast, the present invention leaves the learned weighting parameter at a fixed value and determines what kind of characteristic distribution the values of the input data vector should have in order to maximize the desired output. , the distribution of the input data is calculated by backpropagation.

また、本発明は非特許文献２の手法とも大きく異なっている。非特許文献２の手法では、ニューラルネットワークの後方の層を活性化させるためには、どのような入力データがあればよいのかを逆伝搬で計算しているが、逆伝搬で計算する入力データはあくまでもある固定値の点データである。 The present invention is also significantly different from the technique of Non-Patent Document 2. In the method of Non-Patent Document 2, in order to activate the layer behind the neural network, it is calculated by back propagation what kind of input data should be present, but the input data calculated by back propagation is It is point data of a certain fixed value to the last.

本発明は、非特許文献２の手法に挙げられているような「ニューラルネットワークは入力データの何を見て判断するのか」という視覚化技術と同列のものである。しかしながら、本発明は、どのような分布からサンプリングされたデータであればニューラルネットワークの後方の層を活性化し、所望の出力を出すことができるかを計算するもの、すなわち、入力データの分布を計算して視覚化するものである。すなわち、非特許文献２の手法は、望みの出力結果を出すような入力データ（点データ）そのものを計算しているのに対し、本発明は、こうした点データがとり得る分布（特徴量分布）を計算する点で大きく異なっている。 The present invention is on the same level as the visualization technique of "what does a neural network look at in input data to make a decision?" However, the present invention calculates what kind of distribution sampled data can activate the back layer of the neural network and produce the desired output. and visualize it. That is, the method of Non-Patent Document 2 calculates the input data (point data) itself that produces the desired output result, whereas the present invention calculates the distribution (feature distribution) that such point data can take. is very different in terms of calculating

本発明では、入力データがとり得る特徴量分布はシングルモードの多変量ガウス分布であると仮定し、シングルモードのガウス分布の形を決定づけるパラメータ、すなわち、ある次元の平均ベクトルμと分散共分散行列Σ（分散値Ｖａｒ及び共分散値Ｃｏｖ）とを逆伝搬を使って求めていく。 In the present invention, it is assumed that the feature distribution that the input data can take is a single-mode multivariate Gaussian distribution. Σ (variance value Var and covariance value Cov) is obtained using back propagation.

実際のデータの分布はシングルモードのガウス分布ではなく複雑な分布を有しており、複雑な分布が層から層へマッピングされながら伝搬していく。そして、その複雑さゆえに、分布の逆伝搬の計算は困難とされている。しかしながら、本発明においてブレイクスルーとなる技術として、ドロップアウトによりノイズを加えることで、シングルモードのガウス分布として、そのパラメータの逆伝搬の計算が可能となる。 The distribution of actual data is not a single-mode Gaussian distribution but a complex distribution, and the complex distribution propagates while being mapped from layer to layer. And because of its complexity, it is difficult to compute backpropagation of the distribution. However, as a breakthrough technique in the present invention, noise can be added by dropout to allow calculation of the backpropagation of its parameters as a single-mode Gaussian distribution.

本発明は、上述したように、シングルモードのガウス分布とみなした場合の分布のパラメータを計算する近似手法を含む情報推定方法を提案し、この近似手法を用いた推定器を含む情報推定装置を提案する。従来の技術では、シングルモードのガウス分布と仮定して逆伝搬させる計算を行うことは困難であるが、本発明者は、ドロップアウトによるノイズを加えることで前述の計算が可能になることを発見し、本発明において、その効果と性質を利用した推定器を提案するに至ったものである。 As described above, the present invention proposes an information estimation method that includes an approximation method for calculating parameters of a distribution that is assumed to be a single-mode Gaussian distribution, and provides an information estimation apparatus that includes an estimator using this approximation method. suggest. With conventional technology, it is difficult to perform backpropagation calculations assuming a single-mode Gaussian distribution, but the inventors discovered that adding noise due to dropout enables the above calculations. As a result, the present invention proposes an estimator that utilizes the effects and properties of the method.

なお、ドロップアウトを加えるという手法に関連して、例えば非特許文献３には、解析的ドロップアウトという技術が記載されている。解析的ドロップアウトの技術は、ドロップアウトをモンテカルロ的に何度も行わなくても、それにより生じる分散値を解析的に計算しておき、ニューロンの値がとる分布に前述の生じた分散値を足しておくことで、ドロップアウトと同等の効果を得るというものである。非特許文献３では、この解析的ドロップアウトの技術を使って、ニューラルネットワークの学習時の重みの計算が行われている。これに対し、本発明では、重みを学習済みの固定した値とする点、入力データの分布の最適な分布を学習及び計算する際にドロップアウトを与えることで、データの分布をシングルモードのガウス分布と仮定した場合であっても計算可能になるという点などを有しており、これらの点は非特許文献３には開示されていない。 In relation to the technique of adding dropout, for example, Non-Patent Document 3 describes a technique called analytical dropout. Analytical dropout technology does not need to perform dropout many times in a Monte-Carlo way, but analytically calculates the resulting variance, and applies the resulting variance to the distribution taken by the neuron values. By adding it, an effect equivalent to dropout can be obtained. In Non-Patent Document 3, this analytical dropout technique is used to calculate weights during neural network learning. On the other hand, in the present invention, the data distribution is a single-mode Gaussian It has the point that it can be calculated even if it is assumed to be a distribution, and these points are not disclosed in Non-Patent Document 3.

点データを扱う従来のニューラルネットワークで逆伝搬の計算を行うには、まずフォワードプロパゲーション（順方向の伝搬又は単に伝搬と呼ぶ）の計算を行い、その結果を用いてバックプロパゲーション（逆方向の伝搬又は逆伝搬と呼ぶ）の計算を行う必要がある。本発明では、扱うデータが点データではなく分布のパラメータであるが、従来と同様に、順方向及び逆方向の両方の計算を行う必要がある。 In order to calculate back propagation in a conventional neural network that handles point data, forward propagation (called forward propagation or simply propagation) is calculated first, and then back propagation (reverse propagation) is performed using the result. (called propagation or backpropagation) must be performed. In the present invention, the data to be handled is not point data but distribution parameters, but it is necessary to perform both forward and backward calculations as in the conventional art.

以下に、本発明の第１～第３の実施の形態で用いられる計算式について詳細に説明する。 Calculation formulas used in the first to third embodiments of the present invention will be described in detail below.

（伝搬時の計算）
まず、順方向の伝搬の計算において、ガウス分布である入力データが、ニューラルネットワークの各層にどのように入力されてどのように出力されるのかについて説明する。 (Calculation during propagation)
First, it will be described how Gaussian-distributed input data is input to each layer of the neural network and how it is output in forward propagation calculations.

図１（ａ）に示すように、任意の層をｌと表記し、その層ｌに入力される入力データであるｎ_in ^l次元多変量ガウス分布を以下のように定義する。 As shown in FIG. 1(a), an arbitrary layer is denoted by l, and an n _in ^l -dimensional multivariate Gaussian distribution, which is input data input to the layer l, is defined as follows.

μ_in ^lはｎ_in ^l次元の平均ベクトル、Ｖａｒ_in ^lはｎ_in ^l次元の分散ベクトル、Ｃｏｖ_in ^lはｎ_in ^l次元の共分散行列であり、これらはガウス分布のパラメータである。 μ _in ^l is the n _in ^l dimensional mean vector, Var _in ^l is the n _in ^l dimensional variance vector, and Cov _in ^l is the n _in ^l dimensional covariance matrix, which are the parameters of the Gaussian distribution.

また、その層ｌから出力される出力データであるｎ_out ^l次元多変量ガウス分布を以下のように定義する。 Also, the n _out ^l -dimensional multivariate Gaussian distribution, which is the output data output from the layer l, is defined as follows.

μ_out ^lはｎ_out ^l次元の平均ベクトル、Ｖａｒ_out ^lはｎ_out ^l次元の分散ベクトル、Ｃｏｖ_out ^lはｎ_out ^l次元の共分散行列であり、これらはガウス分布のパラメータである。 μ _out ^l is the n _out ^l -dimensional mean vector, Var _out ^l is the n _out ^l -dimensional variance vector, and Cov _out ^l is the n _out ^l -dimensional covariance matrix, which are the parameters of the Gaussian distribution.

図１（ａ）に示すように、ある層ｌについて、以下の式のように、入力データであるガウス分布のパラメータから出力データであるガウス分布のパラメータのマッピングを計算する。 As shown in FIG. 1A, for a given layer l, the mapping of the parameters of the Gaussian distribution, which is the input data, to the parameters of the Gaussian distribution, which is the output data, is calculated according to the following equation.

本発明の第１～第３の実施の形態では、例えば図２に示すような構造のニューラルネットワークを構築して計算を行う。図２に示すニューラルネットワークは、全結合層（Fully Connected Layer、以下、ＦＣ層と記載）、ドロップアウト層（Dropout Layer）、シグモイド層（Sigmoid Layer）、ソフトマックス層（Softmax Layer）を含んで構築されている。後述する実験においても同一の構造を有するニューラルネットワークが用いられているが、本発明に係るニューラルネットワークは、この構造に限定されるわけではない。 In the first to third embodiments of the present invention, computation is performed by constructing a neural network having a structure as shown in FIG. 2, for example. The neural network shown in FIG. 2 is constructed including a fully connected layer (hereinafter referred to as an FC layer), a dropout layer (Dropout Layer), a sigmoid layer (Sigmoid Layer), and a softmax layer (Softmax Layer) It is A neural network having the same structure was used in the experiments described later, but the neural network according to the present invention is not limited to this structure.

以下、順に、ニューラルネットワークに含まれる各層ｌにおける順方向の伝搬時のマッピングの計算について、種類別に説明する。 Hereinafter, the calculation of the mapping during forward propagation in each layer l included in the neural network will be described for each type.

（ＤＦ層：伝搬時の計算）
扱う層ｌがドロップアウト層及びＦＣ層からなる層の場合、本明細書ではこれらの層を総称してＤＦ層と呼び、ｌ＝ＤＦと表記する。 (DF layer: calculation during propagation)
If the layer l to be treated is a layer consisting of a dropout layer and an FC layer, these layers are collectively referred to herein as the DF layer and denoted l=DF.

ドロップアウト層は、ドロップアウトに係る計算を行う層である。ドロップアウトとは、ある層の個々のニューロンの値や、２つの層のニューロン同士を結ぶ各コネクトに対し、それぞれ独立に、ベルヌーイ分布やガウス分布などの分布からサンプリングした値を掛けるという意味である。サンプリングは伝搬及び逆伝搬毎ごとに独立に行われる。 The dropout layer is a layer that performs dropout-related calculations. Dropout means multiplying the values of individual neurons in a layer and each connect that connects neurons in two layers independently by a value sampled from a distribution such as Bernoulli distribution or Gaussian distribution. . Sampling is done independently for each propagation and backpropagation.

ドロップアウトが、ＤＦ層における各ニューロンにベルヌーイ分布からサンプリングされた値を掛けて行われる場合には、入力データのベクトルの各ｎ_in ^DF個の要素の値が、それぞれ独立に、設計者が決めた確率ｐ_dropout ^DF（０＜ｐ_dropout ^DF＜１）でゼロとなるように計算される。そのため、残った非ゼロの値はｎ_in ^DF個中Ｍ個（Ｍは数式４に記載されているＭバー）として、以下のような式で表される。 When dropout is performed by multiplying each neuron in the DF layer by a value sampled from the Bernoulli distribution, the values of each n _in ^DF elements of the vector of input data are independently determined by the designer. is calculated to be zero with probability p _dropout ^DF (0<p _dropout ^DF <1). Therefore, the remaining non-zero values are represented by the following formula as M out of n _in ^DF (M is the M bar described in Equation 4).

一方、ＦＣ層は、ｎ_out ^l×ｎ_in ^l次元からなる行列の重みＷ_i,jとｎ_out ^l次元からなるベクトルのバイアス項ｂ_iとが設定されており、入力データに重みを掛けてバイアス項を足す計算が行われる。 On the other hand, in the FC layer, the weight W _i,j of the matrix consisting of n _out ^l × n _in ^l dimensions and the bias term b _i of the vector consisting of n _out ^l dimensions are set, and the input data is multiplied by the weight to A calculation is made to add the bias term.

数式１で表されたガウス分布がＤＦ層に入力されると、数式２で表されるどのようなガウス分布として出力されるのかについて、すなわち、以下の数式５に示すガウス分布のパラメータ（平均、分散、共分散）の変換がどのように行われるのかについて説明する。 When the Gaussian distribution represented by Equation 1 is input to the DF layer, what kind of Gaussian distribution represented by Equation 2 is output, that is, the parameters of the Gaussian distribution shown in Equation 5 below (average, We will explain how the transformation of variance, covariance) is performed.

出力される平均μ_out ^DFは以下のように計算される。 The output average μ _out ^DF is calculated as follows.

出力される分散Ｖａｒ_out ^DFは以下のように計算される。 The output variance Var _out ^DF is calculated as follows.

出力される共分散は以下のように計算される。 The output covariance is computed as follows.

数式１１の後半の２項は、前述の数式から計算できる。また、数式１１の最初の項は以下のように計算できる。 The latter two terms of Equation 11 can be calculated from the above equations. Also, the first term of Equation 11 can be calculated as follows.

（シグモイド層：伝搬時の計算）
次に、扱う層ｌが非線形層、とりわけ、シグモイド層の場合について説明する。ここでは、近似として共分散は考えず、シングルモードの平均ベクトルμ_in及び分散ベクトルＶａｒ_inのパラメータからなる多変量ガウス分布Ｘ_inがシグモイド層に入力され、出力Ｘ_outもシングルモードの平均ベクトルμ_out及び分散ベクトルＶａｒ_outのパラメータからなる多変量ガウス分布であると近似する。この場合、以下のように定義される。 (Sigmoid layer: calculation during propagation)
Next, the case where the layer l to be treated is a nonlinear layer, especially a sigmoid layer, will be described. Here, the covariance is not considered as an approximation, and the multivariate Gaussian distribution X _in consisting of the parameters of the single-mode mean vector μ _in and the variance vector Var _in is input to the sigmoid layer, and the output X _out is also the single-mode mean vector μ It is approximated to be a multivariate Gaussian distribution with parameters _out and variance vector Var _out . In this case, it is defined as follows.

この場合、入力Ｘ_inのシングルモードの多変量ガウス分布のパラメータを使って、出力Ｘ_outのシングルモードの多変量ガウス分布のパラメータをどのように計算するのかについて説明する。 In this case, we describe how the parameters of the single-mode multivariate Gaussian distribution of the input X _in are used to compute the parameters of the single-mode multivariate Gaussian distribution of the output X _out .

出力の平均値は以下のように計算される。 The average value of the output is calculated as follows.

出力の分散値は以下のように計算される。 The variance of the output is calculated as follows.

（最終出力層：伝搬時の計算）
扱う層ｌが最終出力層の場合について説明する。最終出力層において入力から出力を得るための関数を関数πとする。 (Final output layer: calculation during propagation)
A case where the layer l to be treated is the final output layer will be described. Let function π be a function for obtaining output from input in the final output layer.

回帰問題の場合には、最終出力値を出力するための特別な層はなく、以下の数式２２や数式２３に示されるように、前述したＦＣ層（重みＷ及びバイアス項ｂを用いて計算する層）や非線形層（任意の非線形関数Ｎｏｎｌｉｎｅａｒにより計算する層）がそのまま最終出力層となり、この層からの出力値が推定値となる。 In the case of the regression problem, there is no special layer for outputting the final output value. layer) and a nonlinear layer (a layer calculated by an arbitrary nonlinear function Nonlinear) directly become the final output layer, and the output value from this layer becomes the estimated value.

一方、分類問題の場合には、最終出力層はソフトマックス層であり（ｌ＝Ｓｏｆｔｍａｘ）、関数πはソフトマックス関数で表される。 On the other hand, for the classification problem, the final output layer is the softmax layer (l=Softmax) and the function π is represented by the softmax function.

従来、入力及び出力は点データであり、入力はＫ次元のベクトルＸ、出力もＫ次元のベクトルπ（Ｘ）となり、以下の数式２４のように計算される。なお、次元数Ｋはクラス数ｃ_maxを意味する。 Conventionally, the input and output are point data, the input is a K-dimensional vector X, and the output is also a K-dimensional vector π(X). Note that the number of dimensions K means the number of classes c _max .

これに対して本発明では、入力及び出力は点データではなく、ガウス分布であり、入力の単一ガウス分布に対して、出力も単一のガウス分布であると近似して計算が行われる。とりわけ、フォワード計算では、ソフトマックス層から出力されるガウス分布の平均値のみを計算する。なぜなら、この層が最終出力層であり、最終出力層から出力される値は、学習の場合は正解値と比較するためだけに、推定の場合は推定値として採用されるためだけに使われるため、分散値及び共分散値は必要とされない。非特許文献４によれば、ソフトマックス層から出力されるガウス分布の平均値は以下のように計算できる。 On the other hand, in the present invention, the input and output are not point data but Gaussian distributions, and the calculation is performed by approximating the single Gaussian distribution of the input to the single Gaussian distribution of the output as well. Specifically, the forward computation only computes the mean of the Gaussian output from the softmax layer. This is because this layer is the final output layer, and the values output from the final output layer are used only to compare with the correct value in the case of learning and to be adopted as the estimated value in the case of estimation. , variance and covariance values are not required. According to Non-Patent Document 4, the average value of the Gaussian distribution output from the softmax layer can be calculated as follows.

（逆伝搬時の計算）
次に、入力データの逆伝搬時の計算について説明する。以下、最終出力層から遡って、ニューロン値としてガウス分布になり、そのガウス分布のパラメータがどのように逆伝搬していくのか、そして入力層でどのような入力データになるのかについて、そのガウス分布のパラメータの計算方法を説明する。 (Calculation during back propagation)
Next, calculations during backpropagation of input data will be described. In the following, going back from the final output layer, the neuron value becomes a Gaussian distribution, how the parameters of the Gaussian distribution propagate back, and what kind of input data becomes in the input layer, the Gaussian distribution We will explain how to calculate the parameters of

非特許文献２には、ある点データを逆伝搬させることで入力データの値の誤差を求めていく手法が記載されている。この手法と同様に、本発明では、入力データがとり得るガウス分布のパラメータの誤差、すなわち、平均値の誤差δμ、分散値の誤差δＶａｒ、共分散値の誤差δＣｏｖを逆伝搬させて計算していく。 Non-Patent Literature 2 describes a method of obtaining an error in the value of input data by back-propagating certain point data. Similar to this method, in the present invention, the error of the parameters of the Gaussian distribution that the input data can have, that is, the error δμ of the mean value, the error δVar of the variance value, and the error δCov of the covariance value is calculated by back propagation. go.

図１（ｂ）に示すように、ある層ｌにおいて、その出力ｏｕｔのガウス分布のパラメータの誤差から、入力ｉｎのガウス分布のパラメータの誤差を逆向きに計算する。 As shown in FIG. 1(b), in a certain layer l, the parameter error of the Gaussian distribution of the input in is calculated inversely from the parameter error of the Gaussian distribution of the output out.

この入力ｉｎのガウス分布のパラメータの誤差が求められれば、従来の学習法のように、入力ｉｎのガウス分布のパラメータを、設計者が定義した学習率ＬｅａｒｎｉｇＲａｔｅの値で更新することができる。 Once the error in the parameters of the Gaussian distribution of input in is determined, the parameters of the Gaussian distribution of input in can be updated with the value of the designer-defined learning rate LearningRate, as in conventional learning methods.

以下に、後方の層（出力側）の誤差δ_outを前方の層（入力側）の誤差δ_inに逆伝搬させる計算について示す。 The calculation for back-propagating the error δ _out of the rear layer (output side) to the error δ _in of the front layer (input side) will be described below.

上の式を解くためには、誤差δのほかに、それぞれのパラメータ間の微分、すなわち、ヤコビアンＪ^lを計算する必要がある。ある層ｌにおけるヤコビアンを以下の数式３３に示す。 In order to solve the above equation, besides the error δ, it is necessary to calculate the derivative between each parameter, ie the Jacobian J ^l . The Jacobian at some layer l is shown in Equation 33 below.

以下、ニューラルネットワークに含まれる各層ｌの種類別に、ヤコビアンを構成する微分計算方法について説明する。 A differential calculation method for constructing the Jacobian will be described below for each type of each layer l included in the neural network.

（ＤＦ層：逆伝搬時の計算）
前述したドロップアウト層及びＦＣ層からなるＤＦ層における微分計算について説明する。本発明は、ドロップアウトを入れることで、シングルモードのガウス分布の伝搬についても逆伝搬の計算ができるようにしている。 (DF layer: calculation during back propagation)
Differential calculation in the DF layer composed of the dropout layer and the FC layer described above will be described. By inserting dropout, the present invention enables calculation of back propagation even for single-mode Gaussian distribution propagation.

出力平均値の入力平均値による微分は以下のようになる。 The derivative of the average output value with respect to the average input value is as follows.

出力平均値の入力分散値による微分は以下のようになる。 The derivative of the output mean value with respect to the input variance value is as follows.

出力平均値の入力共分散値による微分は以下のようになる。 The derivative of the output mean value with respect to the input covariance value is as follows.

出力分散値の入力平均値による微分は以下のようになる。 Differentiation of the output variance value with respect to the input mean value is as follows.

上記の数式３７のＶａｒ（ＬｉｓｔＷμ_in ^DF _i）の部分は、以下のように表される。 The Var(ListWμ _in ^DF _i ) portion of Equation 37 above is expressed as follows.

この微分は以下のように計算することができる。 This derivative can be calculated as follows.

上記の数式３７の２項目及び３項目は、入力の平均値に依存しないことからゼロとなり、最終的には以下の式となる。 The 2nd and 3rd items in the above formula 37 are zero because they do not depend on the input average value, and finally the following formula is obtained.

出力の分散値の入力の分散値による微分は以下のようになる。 The derivative of the variance of the output with respect to the variance of the input is as follows.

出力の分散値の入力の共分散値による微分は以下のようになる。 The derivative of the variance of the output with respect to the covariance of the input is as follows.

出力の共分散値の入力の平均値による微分は以下のようになる。 The derivative of the covariance value of the output with respect to the average value of the input is as follows.

出力の共分散値の入力の分散値による微分は以下のようになる。 The derivative of the covariance value of the output with respect to the variance value of the input is as follows.

出力の共分散値の入力の共分散値による微分は以下のようになる。 The derivative of the output covariance value with respect to the input covariance value is:

（シグモイド層：逆伝搬時の計算）
次に、非線形層、とりわけ、シグモイド層の場合について説明する。 (Sigmoid layer: calculation during backpropagation)
Next, the case of nonlinear layers, in particular sigmoid layers, will be described.

出力分散値の入力分散値による微分は以下のようになる。

The derivative of the output variance with respect to the input variance is as follows.

（最終出力層：逆伝搬時の計算）
ニューラルネットワークの層ｌが最終出力層の場合における関数π（数式２１参照）の微分に関して説明する。 (Final output layer: calculation during backpropagation)
Differentiation of the function π (see Equation 21) when the layer l of the neural network is the final output layer will be described.

回帰問題の場合には、前述したＦＣ層や非線形層でのガウス分布の平均値や分散共分散値の微分が逆伝搬に使われる。 In the case of the regression problem, the mean value of the Gaussian distribution and the derivative of the variance-covariance value in the FC layer and the nonlinear layer described above are used for backpropagation.

一方、分類問題の場合には、最終出力層はソフトマックス層である。以下、ソフトマックス層の入力がガウス分布の場合の微分、すなわちヤコビアンの計算方法について説明する。最終出力層からの出力は正解ラベルであるスカラー値との差分によりコスト関数を計算するために用いられる。したがって、分散値や共分散の微分には依存せず、最終出力層からの出力の分布の平均値のみに関する微分だけを考慮すればよい。 On the other hand, for classification problems, the final output layer is the softmax layer. A method of calculating the differentiation, that is, the Jacobian, when the input of the softmax layer is a Gaussian distribution will be described below. The output from the final output layer is used to calculate the cost function by the difference from the correct label scalar value. Therefore, it is only necessary to consider only the derivative of the average value of the distribution of the output from the final output layer without depending on the derivative of the variance or the covariance.

以上、任意の構造のニューラルネットワークに関して、ＤＦ層、シグモイド層、ソフトマックス層についての単一のガウス分布のパラメータのヤコビアンの要素の計算について、すなわち、バックプロパゲーションにおけるガウス分布のパラメータの微小誤差を計算する方法について説明した。 Above, for a neural network of arbitrary structure, the calculation of the Jacobian elements of the parameters of a single Gaussian distribution for the DF layer, the sigmoid layer, and the softmax layer, that is, the small error of the parameters of the Gaussian distribution in backpropagation I explained how to calculate it.

（コスト関数）
最後に、逆伝搬の出発点、すなわち最終出力層におけるコスト関数の設定方法について説明する。 (cost function)
Finally, the starting point of backpropagation, that is, the method of setting the cost function in the final output layer will be described.

ｎ個のニューロンを持つｎ次元の最終出力層では、最終出力層からの出力の分布の平均値のみを考慮する。最終出力層からの分布の平均値と学習正解データの所望の値とを比較し、その差分をコスト関数として設定する。以下、二乗誤差によりコスト関数を設定した場合を示すが、当該差分を表すことができる計算であれば、二乗誤差に限定されるものではない。 For an n-dimensional final output layer with n neurons, we only consider the average value of the distribution of outputs from the final output layer. The average value of the distribution from the final output layer is compared with the desired value of the learned correct data, and the difference is set as the cost function. Although the case where the cost function is set by the squared error will be described below, the calculation is not limited to the squared error as long as the calculation can express the difference.

回帰問題の場合には、以下のように出力値がスカラーの場合にはｎ＝１となり、正解値ｙとの差分からコスト関数を定義する。 In the case of a regression problem, n=1 when the output value is a scalar as shown below, and the cost function is defined from the difference from the correct value y.

分類問題の場合には、以下のように総クラス数ｃ_maxにおいて、正解ラベルのホットベクトルｈとソフトマックス層からの出力ベクトルπ（Ｘ）との差分で、以下のようにコスト関数を定義する。 In the case of the classification problem, the difference between the hot vector h of the correct label and the output vector π(X) from the softmax layer defines the cost function as follows in the total number of classes c _max as follows: .

このコスト関数の値を小さくするために、コスト関数の微分をとることで、入力のガウス分布のパラメータの平均値の微小誤差δμ_inと分散値の微小誤差δＶａｒ_inを以下のように計算できる。 In order to reduce the value of this cost function, the cost function is differentiated, and the minute error δμ _in of the mean value of the parameters of the input Gaussian distribution and the minute error δVar _in of the variance value can be calculated as follows.

前述のヤコビアンを使うことで、逆伝搬で各層での微小誤差δを順々に計算していくことができ、入力層まで遡ることで、入力層及び各層におけるガウス分布のパラメータを計算することができる。 By using the above-mentioned Jacobian, it is possible to sequentially calculate the small error δ in each layer by back propagation, and by going back to the input layer, it is possible to calculate the parameters of the Gaussian distribution in the input layer and each layer. can.

また、正則化として、コスト関数（数式５２、数式５３参照）に任意の正則項Ｒを設けることで、最適化計算される解を制御することができる。例えば、ある層ｌでの入力データの分布の平均値μ^lを０以上１以下に抑えたい場合、以下のようなμ^l値に依存するコスト関数に対して、以下のような正則項Ｒを付けることが考えられる。 Further, as regularization, by providing an arbitrary regular term R in the cost function (see Equations 52 and 53), it is possible to control the solution to be optimized. For example, if you want to suppress the average value μ ^l of the distribution of input data in a certain layer l to between 0 and 1, the following regular term R is applied to the cost function that depends on the μ ^l value as follows: It is possible to attach

正則項Ｒを設けた場合には、正則項Ｒのμ及びＶａｒに関する微分をそれぞれ計算し、それぞれの計算結果を数式５４及び数式５５のそれぞれに追加する必要がある。 When the regular term R is provided, it is necessary to calculate the differential of the regular term R with respect to μ and Var, and add the respective calculation results to Equations 54 and 55, respectively.

以上、本明細書で提案される計算方法であって、バックプロパゲーションさせることで所望の出力を最大に出すための各層における分布パラメータを求める計算方法について説明した。 The above is the calculation method proposed in the present specification for obtaining the distribution parameter in each layer for maximizing the desired output through back propagation.

本明細書では、この最適化された分布パラメータに係る分布を「特徴量分布」と呼ぶ。この特徴量分布は、ニューラルネットワーク上のあらゆる層のニューロンでの入力データの値に関連して計算することができる。ここで、ある層ｌにおけるニューロンの入力データの値をｘとし、その値がとるべき特徴ガウス分布関数Ｐ_l（ｘ）について、ガウス分布のパラメータである平均ベクトルμ、分散共分散行列Σを用いて以下の式のように定義する。 In this specification, a distribution related to this optimized distribution parameter is called a "feature quantity distribution". This feature quantity distribution can be calculated in relation to the values of input data in neurons of all layers on the neural network. Here, let x be the value of the input data of a neuron in a certain layer l, and for the characteristic Gaussian distribution function P _l (x) to be taken by that value, use the mean vector μ and the variance-covariance matrix Σ that are the parameters of the Gaussian distribution. is defined as the following formula.

回帰問題の場合には、ある値ｙが出力されるための特徴ガウス分布Ｐ_l,y（ｘ）は以下のように表される。ある値ｙは、ユーザが有限又は無限の範囲から選んだ値である。 In the case of a regression problem, the feature Gaussian distribution P _l,y (x) for outputting a certain value y is expressed as follows. A value y is a value chosen by the user from a finite or infinite range.

分類問題の場合には、あるクラスｃに分類されるための特徴ガウス分布Ｐ_l,c（ｘ）は以下のように表される。あるクラスｃは総クラスの数ｃ_max分だけ存在する。 In the case of a classification problem, the feature Gaussian distribution P _l,c (x) for classification into a certain class c is expressed as follows. There are as many classes c as the total number of classes c _max .

ここで注目すべきは、前述の数式５８や数式５９において、分母は入力データｘに依存せず、全てのニューロンに共通のあるスカラー値、すなわち定数であるということである。 It should be noted here that the denominator in Equations 58 and 59 is a scalar value common to all neurons, that is, a constant, independent of the input data x.

ニューラルネットワークの試験者は、各ニューロンにおける特徴量分布Ｐ_l（ｘ）の幅である分散値から可視化されるニューロンの重要度を評価することができる。数式５８や数式５９のある層ｌにおけるｎ次元の特徴量分布の分散共分散行列Σの分散値は、ｎ個のニューロンのそれぞれの分散値を表している。分散値が大きいということは、どの値でもよいということであり、したがって、そのニューロンは最終出力決定に及ぼす影響が低いと考えることができる。 A neural network tester can evaluate the importance of a visualized neuron from the variance, which is the width of the feature distribution P _l (x) in each neuron. The variance value of the variance-covariance matrix Σ of the n-dimensional feature quantity distribution in a certain layer l in Equations 58 and 59 represents the variance value of each of n neurons. A large variance value means that any value is acceptable, and therefore the neuron can be considered to have a low influence on the final output decision.

逆に、あるニューロンの分散値が小さければ、望みの最終結果を出すためには、そのニューロンのとる値はその範囲内でなければならず、そのニューロンが推定の決定に重要な影響を及ぼすことがわかる。また、ニューラルネットワークの推定結果の説明責任において、分散値の大小を見ることで、どのニューロンが推定結果に重要な働きをしているのかを示すことができる。さらに、重要でないと判断されるニューロンに関しては計算を省くなど、計算処理の簡略化を図ることも可能でなる。 Conversely, if a neuron has a small variance value, the value it takes must be within that range to produce the desired final result, and the neuron has a significant influence on the estimation decision. I understand. In addition, in terms of accountability for the estimation results of the neural network, it is possible to indicate which neurons play an important role in the estimation results by looking at the magnitude of the variance value. Furthermore, it is possible to simplify the calculation process by omitting the calculation for neurons that are judged to be unimportant.

（第１の実施の形態）
本発明の第１の実施の形態において、データの特徴量分布を計算する処理について説明する。図３は、本発明の第１の実施の形態において、データの特徴量分布を計算する処理の一例を示すフローチャートである。 (First embodiment)
In the first embodiment of the present invention, processing for calculating the feature quantity distribution of data will be described. FIG. 3 is a flow chart showing an example of processing for calculating the feature amount distribution of data in the first embodiment of the present invention.

図３において、まず、従来と同様に点データである学習データを用いて、ニューラルネットワークでドロップアウトを与えて学習させ、重みとバイアス項を最適化する（ステップＳ１０１）。なお、ドロップアウトを与えるとは、ニューラルネットワーク内のドロップアウト層において確率ｐ_dropout ^DFを設定し、ドロップアウト層を通過するデータを一定の確率ｐ_dropout ^DFでゼロにすることを意味する。ステップＳ１０１における学習が完了すると、この学習により決定された重みやバイアス項は固定値として扱われる。 In FIG. 3, first, learning data, which is point data, is used as in the conventional art, and dropouts are given to the neural network for learning, thereby optimizing weights and bias terms (step S101). Note that providing a dropout means setting a probability p _dropout ^DF in the dropout layer in the neural network and setting the data passing through the dropout layer to zero with a certain probability p _dropout ^DF . When the learning in step S101 is completed, the weights and bias terms determined by this learning are treated as fixed values.

次に、入力データを最適化する処理を行う。入力データを最適化する処理では、学習済みの重みとバイアス項を固定したまま、入力データに対する伝搬と、伝搬から得られた計算結果（ガウス分布のパラメータ）の逆伝搬とが繰り返し実行され、入力データに係るガウス分布のパラメータを最適化する処理、すなわち、どのようなガウス分布に基づく入力データが入力されればよいのかを最適化する処理が行われる。 Next, a process of optimizing the input data is performed. In the process of optimizing the input data, propagation of the input data and back propagation of the calculation results (Gaussian distribution parameters) obtained from the propagation are repeatedly executed while the learned weights and bias terms are fixed. A process of optimizing the parameters of the Gaussian distribution related to the data, that is, a process of optimizing the input data based on what kind of Gaussian distribution should be input is performed.

まず、ニューラルネットワークで伝搬される入力データを多変量ガウス分布として、この入力データの多変量ガウス分布のパラメータ（平均ベクトルと分散共分散行列）の初期値を設定する（ステップＳ１０２）。この初期設定では、例えば乱数などの任意の値が初期値として設定される。平均値については、例えば入力が画像の場合には、ピクセルの画素の値がとり得る範囲の中心値を初期値として設定してもよい。また、分散値については、例えば１以上の適切な乱数としてもよい。さらに、共分散値はゼロとしてもよい。ただし、分散共分散値は正定値行列とする必要がある。 First, the input data propagated by the neural network is assumed to have a multivariate Gaussian distribution, and the initial values of the parameters (mean vector and variance-covariance matrix) of the multivariate Gaussian distribution of this input data are set (step S102). In this initial setting, an arbitrary value such as a random number is set as an initial value. For the average value, for example, when the input is an image, the initial value may be set to the central value of the range of possible pixel values. Also, the variance value may be, for example, an appropriate random number of 1 or more. Additionally, the covariance value may be zero. However, the variance-covariance value must be a positive definite matrix.

次に、入力データのガウス分布のパラメータを入力層から最終出力層までドロップアウトを与えて伝搬させ、最終出力層で得られた結果からコスト関数を計算し（ステップＳ１０３）、コスト関数の値が十分にゼロに近づいたかどうかを判断する（ステップＳ１０４）。 Next, the parameters of the Gaussian distribution of the input data are propagated from the input layer to the final output layer with dropouts, the cost function is calculated from the results obtained in the final output layer (step S103), and the value of the cost function is It is determined whether or not it has sufficiently approached zero (step S104).

なお、コスト関数の値が十分にゼロに近づいたかどうかの判断は、例えば、ユーザがあらかじめ定めた閾値（ゼロに近い値）とコスト関数の値とを比較することにより行われる。コスト関数の値と所定の閾値とを比較して、例えばコスト関数の値が所定の閾値以下となった場合には、コスト関数の値が十分にゼロに近づいたと判断することができる。コスト関数の値が十分にゼロに近づいた状態は、回帰問題の場合にはユーザが選んだある望みの値を最大限出力する状態、分類問題の場合にはユーザが選んだある望みのクラスのソフトマックスの値を最大限出力する状態となったことを意味する。 Whether or not the value of the cost function has sufficiently approached zero is determined, for example, by comparing the value of the cost function with a threshold (a value close to zero) predetermined by the user. By comparing the value of the cost function with a predetermined threshold, for example, when the value of the cost function is equal to or less than the predetermined threshold, it can be determined that the value of the cost function has sufficiently approached zero. The state in which the value of the cost function is sufficiently close to zero is the state in which a user-selected desired value is maximally output in the case of a regression problem, and the state in which a user-selected desired class is output in the case of a classification problem. It means that the maximum softmax value is output.

コスト関数の値が十分にゼロに近づいたと判断されなかった場合には（ステップＳ１０４で「いいえ」）、コスト関数からガウス分布のパラメータの微小誤差を計算する（ステップＳ１０５）。具体的には、上述した数式５４及び数式５５に示すように、コスト関数の微分を計算することで、微小誤差δを計算することができる。 If it is not determined that the value of the cost function is sufficiently close to zero ("No" in step S104), the minute error of the Gaussian distribution parameter is calculated from the cost function (step S105). Specifically, as shown in Equations 54 and 55 described above, the minute error δ can be calculated by calculating the differentiation of the cost function.

次に、例えば前述したヤコビアンを使うことで、微小誤差をドロップアウトを与えて逆伝搬させて、各層でのガウス分布のパラメータの微小誤差を順々に計算する（ステップＳ１０６）。その結果、入力層まで遡って微小誤差を計算することができる。そして、入力データのガウス分布のパラメータの微小誤差を用いて、各層でのガウス分布のパラメータを更新する（ステップＳ１０７）。このとき、パラメータは、例えば微小誤差に学習率ＬｅａｒｎｉｎｇＲａｔｅを掛けた分だけ更新される。 Next, for example, by using the above-mentioned Jacobian, the minute errors are back-propagated with dropouts, and the minute errors of the parameters of the Gaussian distribution in each layer are sequentially calculated (step S106). As a result, it is possible to trace back to the input layer and calculate the minute error. Then, the Gaussian distribution parameters in each layer are updated using the minute error of the Gaussian distribution parameters of the input data (step S107). At this time, the parameter is updated by, for example, the minute error multiplied by the learning rate LearningRate.

ステップＳ１０７で入力データのガウス分布のパラメータが更新されると、ステップＳ１０３に戻り、更新されたパラメータを入力層から最終出力層までドロップアウトを与えて伝搬させ、上述したステップＳ１０３以降の処理が繰り返される。そして、コスト関数の値が十分にゼロに近づいたと判断された場合には（ステップＳ１０４で「はい」）、入力データのガウス分布のパラメータの最適化処理を終了する。 When the parameters of the Gaussian distribution of the input data are updated in step S107, the process returns to step S103, the updated parameters are propagated from the input layer to the final output layer with dropouts, and the processes after step S103 described above are repeated. be Then, when it is determined that the value of the cost function has sufficiently approached zero ("Yes" in step S104), the optimization processing of the parameters of the Gaussian distribution of the input data ends.

最適化処理の終了時点における入力データのガウス分布のパラメータは最適化されたデータであり、このガウス分布からサンプリングされる入力データが入力された場合、ニューラルネットワークは所望の結果（回帰問題の場合は所望の出力値、分類問題の場合は所望のクラスのソフトマックスの値）を出力することができる。なお、入力データのガウス分布とは、ニューラルネットワークの最初の層である入力層におけるデータの分布だけではなく、各層の各ニューロンにおけるデータの分布も意味している。すなわち、図３に示す処理によって、任意の層におけるガウス分布のパラメータが最適化された状態となる。 The parameters of the Gaussian distribution of the input data at the end of the optimization process are the optimized data, and given input data sampled from this Gaussian distribution, the neural network will produce the desired result ( desired output value, or in the case of a classification problem, the softmax value of the desired class). The Gaussian distribution of input data means not only the distribution of data in the input layer, which is the first layer of the neural network, but also the distribution of data in each neuron of each layer. That is, the parameters of the Gaussian distribution in an arbitrary layer are optimized by the processing shown in FIG.

次に、本発明の第１の実施の形態における情報推定装置の構成について説明する。図４は、本発明の第１の実施の形態における情報推定装置の構成の一例を示すブロック図である。図４に示す情報推定装置１は、重み学習部１０、データ特徴量分布計算部２０、事後確率分布計算部３０、事後確率値推定部４０を有する。なお、図４に示す情報推定装置１には、後述する本発明の第２及び第３の実施の形態において使用される構成要素も図示されている。 Next, the configuration of the information estimation device according to the first embodiment of the present invention will be described. FIG. 4 is a block diagram showing an example of the configuration of the information estimation device according to the first embodiment of the present invention. The information estimation device 1 shown in FIG. The information estimation device 1 shown in FIG. 4 also shows constituent elements used in second and third embodiments of the present invention, which will be described later.

図４に示すブロック図は、本発明に関連した機能を表しているにすぎず、実際の実装では、ハードウェア、ソフトウェア、ファームウェア、又はそれらの任意の組み合わせによって実現されてもよい。ソフトウェアで実装される機能は、１つ又は複数の命令若しくはコードとして任意のコンピュータ可読媒体に記憶され、これらの命令又はコードは、ＣＰＵ（Central Processing Unit：中央処理ユニット）などのハードウェアベースの処理ユニットによって実行可能である。また、本発明に関連した機能は、ＩＣ（Integrated Circuit：集積回路）やＩＣチップセットなどを含む様々なデバイスによって実現されてもよい。 The block diagram shown in FIG. 4 merely represents functionality associated with the present invention, which in actual implementation may be realized in hardware, software, firmware, or any combination thereof. Software-implemented functions may be stored as one or more instructions or code on any computer-readable medium, and these instructions or code may be processed by a hardware-based processor, such as a CPU (Central Processing Unit). executable by the unit. Also, functions related to the present invention may be implemented by various devices including an IC (Integrated Circuit), an IC chipset, and the like.

本発明の第１の実施の形態では、重み学習部１０及びデータ特徴量分布計算部２０が使用され、重み学習部１０及びデータ特徴量分布計算部２０により上述した図３に示すフローチャートの処理が実行される。 In the first embodiment of the present invention, the weight learning unit 10 and the data feature distribution calculation unit 20 are used, and the processing of the flowchart shown in FIG. executed.

重み学習部１０は、固定値のベクトルからなる学習データを用いて、データの一部を欠損させるドロップアウトを行うドロップアウト層を備えた前記ニューラルネットワークにおける重みを学習する機能を有する。 The weight learning unit 10 has a function of learning weights in the above-described neural network having a dropout layer that performs dropout to drop out part of data using learning data consisting of vectors of fixed values.

データ特徴量分布計算部２０は、伝搬計算部２１、コスト関数計算部２２、分布パラメータ更新部２３、最適化パラメータ出力部２４を含む。 The data feature quantity distribution calculator 20 includes a propagation calculator 21 , a cost function calculator 22 , a distribution parameter updater 23 and an optimization parameter outputter 24 .

伝搬計算部２１は、ニューラルネットワークの各層で伝搬するデータを平均値、分散値及び共分散値を含む分布パラメータで定義される多変量ガウス分布を持つ確率変数として、入力データに対してニューラルネットワークでドロップアウトを与えて伝搬計算を行う機能を有する。 The propagation calculation unit 21 treats the data propagated in each layer of the neural network as a random variable having a multivariate Gaussian distribution defined by distribution parameters including the mean value, the variance value, and the covariance value, and the input data in the neural network. It has a function to give dropout and perform propagation calculation.

コスト関数計算部２２は、ニューラルネットワークからの出力データと正解データとの差分を表すコスト関数を計算する機能を有する。 The cost function calculator 22 has a function of calculating a cost function representing the difference between the output data from the neural network and the correct data.

分布パラメータ更新部２３は、コスト関数を微分して得られる微小誤差に対してニューラルネットワークでドロップアウトを与えて逆伝搬計算を行い、ニューラルネットワークの各層における分布パラメータを更新する機能を有する。 The distribution parameter updating unit 23 has a function of giving dropouts to minute errors obtained by differentiating the cost function in the neural network, performing back propagation calculation, and updating the distribution parameters in each layer of the neural network.

最適化パラメータ出力部２４は、コスト関数が所定の閾値以下となった場合における分布パラメータを、最適化された分布パラメータとして出力する機能を有する。 The optimization parameter output unit 24 has a function of outputting a distribution parameter when the cost function is equal to or less than a predetermined threshold as an optimized distribution parameter.

以下、本発明の第１の実施の形態に関連する実験について説明する。この実験では、図２に示すニューラルネットワークが構築され、手書き数字データＭＮＩＳＴが使用されている。図２のニューラルネットワークは、ＦＣ層、ドロップアウト層、シグモイド層、ソフトマックス層を含んで構築されている。入力データ（入力画像）は、２８×２８＝７８４ピクセルの画像であり、７８４次元のベクトルデータである。また、最終出力層のソフトマックス層からは、０～９のいずれかの数字を示すスコアが推定結果として出力される。各層に付記されている数字は各層の次元数であり、例えば次元数が変換される場合には「７８４→１００」などのように、変換前及び変換後の次元数が記載されている。 Experiments related to the first embodiment of the present invention will be described below. In this experiment, the neural network shown in FIG. 2 was constructed and handwritten digit data MNIST was used. The neural network of FIG. 2 is constructed including an FC layer, a dropout layer, a sigmoid layer, and a softmax layer. Input data (input image) is an image of 28×28=784 pixels and is vector data of 784 dimensions. Also, a score representing any number from 0 to 9 is output as an estimation result from the softmax layer of the final output layer. The number attached to each layer is the number of dimensions of each layer. For example, when the number of dimensions is converted, the numbers of dimensions before and after conversion are described, such as "784→100".

この実験では、ニューラルネットワークにおいて、６万枚の手書き数字データＭＮＩＳＴの画像に関してドロップアウトを与えて学習させて、重みを計算している。なお、ＭＮＩＳＴの入力画像は２８×２８＝７８４ピクセルの画像であり、１枚の画像が７８４次元のベクトルである１つの点データに相当する。 In this experiment, in the neural network, 60,000 handwritten digit data MNIST images are given dropouts for learning, and weights are calculated. An input image of MNIST is an image of 28×28=784 pixels, and one image corresponds to one point data that is a 784-dimensional vector.

学習が完了すると、重みやバイアス項を固定値とし、今度は入力データをある未知のガウス分布からサンプリングされる確率変数として、ドロップアウトを与えて伝搬させる。なお、初期段階においてガウス分布のパラメータ（すなわち、平均ベクトルμ、分散共分散行列Σ）が未知の場合には、適当な初期値が設定される。 After learning is completed, the weights and bias terms are set to fixed values, and the input data are now propagated as random variables sampled from some unknown Gaussian distribution with dropout. If the parameters of the Gaussian distribution (that is, mean vector μ, variance-covariance matrix Σ) are unknown at the initial stage, appropriate initial values are set.

最終出力層のソフトマックス層で正解ラベルと一致するようにガウス分布のパラメータの微小誤差を逆伝搬させて、ガウス分布のパラメータを更新する。その際に、ＤＦ層では必ずドロップアウトを与える。このように、伝搬及び逆伝搬の計算を行ってガウス分布のパラメータを更新する。 In the softmax layer of the final output layer, the small error of the Gaussian distribution parameter is back-propagated so that it matches the correct label, and the Gaussian distribution parameter is updated. At that time, the DF layer always gives a dropout. Thus, propagation and backpropagation calculations are performed to update the Gaussian parameters.

更新されたガウス分布のパラメータの評価は、以下のようにモンテカルロ的に行われる。更新されたガウス分布のパラメータに基づくガウス分布を設定し、そのガウス分布から５０００個の点データをサンプリングする。その点データを入力データとしてニューラルネットワークで順方向に計算させ、ソフトマックス層での出力が正解ラベルと合っているかどうかについてを評価するためのコスト関数を計算する。コスト関数の値が小さくなっていけば最適化が進んでいるとする。 Evaluation of the parameters of the updated Gaussian distribution is done Monte Carlo as follows. A Gaussian distribution based on the parameters of the updated Gaussian distribution is set, and 5000 point data are sampled from the Gaussian distribution. Using the point data as input data, forward computation is performed by a neural network, and a cost function is computed for evaluating whether or not the output of the softmax layer matches the correct label. It is assumed that optimization progresses as the value of the cost function decreases.

図５（ａ）、（ｂ）に、コスト関数の値（縦軸）と更新回数（横軸）との関係を表すプロットを示す。図５（ａ）には、ガウス分布のパラメータの逆伝搬の際にドロップアウトを与えなかった場合（本発明の比較例）を示し、図５（ｂ）には、本発明に係る実験においてガウス分布のパラメータの逆伝搬の際にドロップアウトを与えた場合（本発明に係る実験例）を示す。 5A and 5B show plots representing the relationship between the value of the cost function (vertical axis) and the number of updates (horizontal axis). FIG. 5(a) shows a case (comparative example of the present invention) in which no dropout is given during backpropagation of Gaussian distribution parameters, and FIG. 5(b) shows Gaussian A case (experimental example according to the present invention) in which dropout is given during backpropagation of distribution parameters is shown.

図５（ａ）に示すように、ドロップアウトを与えなかった場合には、最適化を何度も繰り返しても、コスト関数はゼロ近傍に収束せず、入力データのガウス分布のパラメータを最適化することができない。一方、図５（ｂ）に示すように、本発明に係る実験ではドロップアウトというノイズの源を与えることで、コスト関数をゼロ近傍に収束させることができ、正解データのガウス分布を作り出すことできる。 As shown in FIG. 5(a), when dropout is not given, the cost function does not converge to near zero even if the optimization is repeated many times, and the parameters of the Gaussian distribution of the input data are optimized. Can not do it. On the other hand, as shown in FIG. 5(b), in the experiment according to the present invention, by providing a noise source called dropout, the cost function can be converged near zero, and a Gaussian distribution of correct data can be created. .

次に、ドロップアウトを与えて得られた特徴量分布のパラメータを計算した後について説明する。図６（ａ）、（ｂ）に、数字９のクラスでの最適化された特徴量分布を示す。なお、これらの特徴量分布は、各々のピクセルにおけるグレースケール値の分布を表している。ただし、図６（ａ）、（ｂ）では、平均値及び分散値から表すことができる分布のみが示されており、各々の分布間の相関である共分散値については図示省略している。 Next, the operation after calculating the parameters of the feature quantity distribution obtained by giving dropout will be described. 6(a) and 6(b) show the optimized feature quantity distribution in the class of number 9. FIG. These feature quantity distributions represent the distribution of grayscale values in each pixel. However, in FIGS. 6A and 6B, only the distributions that can be represented by the mean values and the variance values are shown, and the covariance values, which are the correlations between the distributions, are not shown.

図６（ａ）に、図２のニューラルネットワークの入力層における入力画像の特徴量分布を示す。なお、入力画像は２８×２８＝７８４ピクセルであり、７８４ピクセルの各々についての特徴量分布が得られているが、図示簡略化のため、図６（ａ）では１４×１４＝２５６ピクセルにおける特徴量分布が示されている（ピクセルを１つおきに表示）。 FIG. 6(a) shows the feature quantity distribution of the input image in the input layer of the neural network of FIG. Note that the input image has 28×28=784 pixels, and the feature quantity distribution for each of the 784 pixels is obtained. The volume distribution is shown (every other pixel is shown).

図６（ａ）に示す入力画像のピクセル値のとるべき分布によって表れているように、０～９のうちのどの数字であるかを決定づけるのに重要であると思われる画像の中心部分は分布の幅が小さく、分散が小さい。一方、画像の四隅や端の近くは分布の幅が大きく、分散が大きい。したがって、画像の四隅や端の近くのピクセルは、推定結果を決定づけるのに重要ではないことが明らかに示されている。 As shown by the distribution of the pixel values of the input image shown in FIG. has a small width and a small variance. On the other hand, near the four corners and edges of the image, the width of the distribution is wide and the dispersion is large. Therefore, it is clearly shown that the pixels near the corners and edges of the image are not important in determining the estimation result.

図６（ｂ）に、図２のニューラルネットワークの最初のＦＣ層での１０×１０＝１００個のニューロンに関する特徴量分布を示す。このＦＣ層は入力層の次の層であるため、入力画像が抽象化されてしまっているが、それでも、分布の幅が小さいニューロンと分布の幅が大きいニューロンに分かれている。分布の幅が小さいニューロンのほうが、分布の幅の大きいニューロンより、推定の最終決定に大きく影響を与えると考えられる。 FIG. 6(b) shows the feature quantity distribution for 10×10=100 neurons in the first FC layer of the neural network of FIG. Since this FC layer is the next layer after the input layer, the input image is abstracted, but it is still divided into neurons with a narrow distribution width and neurons with a wide distribution width. A neuron with a narrow distribution is believed to have a greater influence on the final decision of the estimation than a neuron with a wide distribution.

また、図7（ａ）に、図６（ａ）の数字９のガウス分布からサンプリングされた入力画像の一例を示す。また、図７（ｂ）に、図７（ａ）の画像を入力データとしてニューラルネットワークで計算させた場合のソフトマックスの値の棒グラフ（横軸はクラス、縦軸はソフトマックスの値）を示す。図７（ａ）は、数字９のガウス分布からサンプリングされた画像であり、見た目上は数字９には見えない画像であるが、ニューラルネットワークは、数字９の場合のクラスのみで最大となるソフトマックスの値（推定結果）を出力する。 Also, FIG. 7(a) shows an example of an input image sampled from the Gaussian distribution of number 9 in FIG. 6(a). Also, FIG. 7(b) shows a bar graph of softmax values (the horizontal axis is the class, and the vertical axis is the softmax value) when the neural network calculates the image of FIG. 7(a) as input data. . FIG. 7(a) is an image sampled from the Gaussian distribution of the number 9. Although it is an image that does not look like the number 9, the neural network produces the maximum softness only in the class of the number 9. Output the max value (estimation result).

図６（ａ）の特徴量分布からどのようなデータをサンプリングした場合であっても、ニューラルネットワークは、この入力画像に対して、必ず、数字９であると推定するソフトマックスの値を出力する。このことから、本発明に係る手法は、非特許文献５で言及されている所望の推定結果を出す入力データ「Adversarial Example」をサンプリングすることができる分布の１つを計算する手法であると言うこともできる。 No matter what kind of data is sampled from the feature quantity distribution of FIG. . From this, it can be said that the method according to the present invention is a method of calculating one of the distributions that can sample the input data "Adversarial Example" that produces the desired estimation results mentioned in Non-Patent Document 5. can also

また、このような分布の分散値を見て、ニューラルネットワークがどのピクセルの値又はニューロンの値を特徴として重要視しているのかを判断することについて、さらに、図８に示すようなＭＮＩＳＴをさらに拡張した画像（ＭＮＩＳＴ拡張画像）を学習させた場合に顕著に観察することができる。図８のＭＮＩＳＴ拡張画像は、右上に位置する正解ラベルに関連する真のデータと、左上及び右下の２箇所にあるダミーとしての無関係なデータと、左下の何もない黒い部分とにより構成されている。このＭＮＩＳＴ拡張の画像は、正解ラベルのデータが常に右上の数字だけであり、残りのエリアは正解ラベルに依存しないダミーデータである。このようなデータについても、従来のニューラルネットワークで点データとして学習させた場合には、正解を推定させることができる。 In addition, the MNIST as shown in FIG. It can be observed remarkably when an extended image (MNIST extended image) is learned. The MNIST augmented image of FIG. 8 is composed of true data related to the correct label located in the upper right, irrelevant dummy data in two places of the upper left and lower right, and an empty black part in the lower left. ing. In this MNIST-extended image, the correct label data is always only the upper right number, and the remaining area is dummy data that does not depend on the correct label. Even with such data, if a conventional neural network is trained as point data, the correct answer can be estimated.

図９に、本発明に係る手法で、正解ラベル（数字０）を計算した場合の入力画像の特徴量分布を示す。なお、入力画像は（２８×２）×（２８×２）＝３１３６ピクセルであり、３１３６ピクセルの各々についての特徴量分布が得られているが、図示簡略化のため、図９では２８×２８＝７８４ピクセルにおける特徴量分布が示されている（ピクセルを１つおきに表示）。図９に示す特徴量分布では右上のエリアだけ分布の幅が小さく、この右上のエリアの特徴量分布が重要であることが示されている。 FIG. 9 shows the feature amount distribution of the input image when the correct label (number 0) is calculated by the method according to the present invention. Note that the input image has (28×2)×(28×2)=3136 pixels, and the feature quantity distribution for each of the 3136 pixels is obtained. =784 pixels are shown (every other pixel is displayed). In the feature quantity distribution shown in FIG. 9, the width of the distribution is small only in the upper right area, indicating that the feature quantity distribution in this upper right area is important.

図１０（ａ）に、図９の特徴量分布からサンプリングされた入力画像の一例を示す。また、図１０（ｂ）に、図１０（ａ）の画像を入力データとしてニューラルネットワークで計算させた場合のソフトマックスの値の棒グラフ（横軸はクラス、縦軸はソフトマックスの値）を示す。図９の特徴量分布からサンプリングされた図１０（ａ）の画像について、ニューラルネットワークは、この画像に対するソフトマックスの値として、図１０（ｂ）に示すように、数字０の場合のクラスのみで最大となる推定結果を出力する。図９及び図１０（ａ）、（ｂ）に示すように、本発明の手法によって得られた特徴量分布を参照すると、ニューラルネットワークが入力画像の右上のみを見て判断していることがわかり、特徴量分布によってニューラルネットワークの可視化が可能となる。 FIG. 10(a) shows an example of an input image sampled from the feature amount distribution of FIG. FIG. 10(b) shows a bar graph of softmax values (horizontal axis is class, vertical axis is softmax value) when the neural network calculates the image of FIG. 10(a) as input data. . For the image of FIG. 10(a) sampled from the feature quantity distribution of FIG. Output the maximum estimation result. As shown in FIGS. 9 and 10(a) and (b), referring to the feature quantity distribution obtained by the method of the present invention, it can be seen that the neural network only looks at the upper right portion of the input image for judgment. , the feature distribution enables visualization of the neural network.

（第２の実施の形態）
次に、本発明の第２の実施の形態について説明する。 (Second embodiment)
Next, a second embodiment of the invention will be described.

上述した本発明の第１の実施の形態では、データの特徴量分布を計算する手法について説明した。本発明の第２の実施の形態では、学習データを使い、さらに本発明の第１の実施の形態で得られた特徴量分布を尤度分布としてとらえて、事後確率分布を計算する。事後確率分布を計算しておけば、図１１（ａ）のグラフ（横軸は入力値、縦軸は事後確率値）に示すように、新規の入力点データ（新しい未知のテストデータ）に対して事後確率分布の値を計算することで、出力される推定結果の確率値、すなわち、どのくらいの確からしさで推定を行っているのかを表す推定の確からしさを計算することができる。 In the above-described first embodiment of the present invention, the method of calculating the feature quantity distribution of data has been described. In the second embodiment of the present invention, learning data is used, and the feature quantity distribution obtained in the first embodiment of the present invention is treated as a likelihood distribution to calculate the posterior probability distribution. If the posterior probability distribution is calculated, as shown in the graph of FIG. 11(a) (the horizontal axis is the input value and the vertical axis is the posterior probability value), By calculating the value of the posterior probability distribution with , it is possible to calculate the probability value of the output estimation result, that is, the certainty of the estimation, which indicates the degree of certainty of the estimation.

事後確率分布は、ニューラルネットワーク中の全ての層のニューロン値で計算することができる。したがって、新規の入力データに対して任意の層で事後確率値を計算でき、最終出力層まで計算しなくても、どのような出力になるのかを把握することができるので、計算の高速化及び簡略化が期待できる。 A posterior probability distribution can be computed on the neuron values of all layers in the neural network. Therefore, it is possible to calculate the posterior probability value in any layer for new input data, and it is possible to grasp what kind of output it will be without calculating up to the final output layer. Simplification is expected.

以下に説明するように、前述の特徴量分布（数式５８、数式５９参照）の出力値は、ある入力点データｘ_inを入力した場合の尤度とみなすことができる。 As will be described below, the output value of the aforementioned feature quantity distribution (see Equations 58 and 59) can be regarded as the likelihood when certain input point data x _in is input.

回帰問題の場合には、出力値がｙになるような入力データｘ_inの尤度Ｌ_l（ｘ_in｜ｙ）は、以下の数式６０で表されると定義する。ここで、出力値ｙは複数の出力値ｙ₁，…，ｙ_maxのうちのどのｙの値になるのかを比較する。 In the case of a regression problem, the likelihood L _l (x _in |y) of the input data x _in that gives an output value of y is defined as expressed by Equation 60 below. Here, the output value y is compared as to which y value among the plurality of output values y ₁ , . . . , y _max .

分類問題の場合には、あるクラスｃになるような入力データの尤度Ｌ_l（ｘ_in｜ｃ）は、以下の数式６１で表されると定義する。 In the case of a classification problem, the likelihood L _l (x _in |c) of input data that falls into a certain class c is defined as expressed by Equation 61 below.

前述の数式５８、数式５９の尤度の式の分母は、分散共分散行列のデターミナント（Determinant）である。行列のデターミナントを計算する際、行列の次元数が大きい場合には非常に高いコンピュータの計算精度が要求される。ニューロン数は層によっては１０００以上の膨大な次元数となり、行列のデターミナントの計算が困難な場合もある。さらに、未知のデータｘ_inが与えられたときに、どの出力値ｙ／クラスｃであるのかを推定するためには、尤度Ｌ_lと事前確率Ｐｒｉｏｒ_lから事後確率Ｐｏｓｔ_lを計算しなければならない。 The denominator of the likelihood formulas of Formulas 58 and 59 above is the determinant of the variance-covariance matrix. When calculating the determinant of a matrix, a very high computational precision is required when the number of dimensions of the matrix is large. Depending on the layer, the number of neurons is a huge number of dimensions of 1000 or more, and it may be difficult to calculate the determinant of the matrix. Furthermore, in order to estimate which output value y/class c is given when unknown data x _in is given, the posterior probability Post _l must be calculated from the likelihood L _l and the prior probability Prior _l . not.

回帰問題の場合、出力値ｙに関して、尤度Ｌ_l（ｘ_in｜ｙ）と事前確率Ｐｒｉｏｒ_l（ｙ）から計算される事後確率Ｐｏｓｔ_l（ｙ｜ｘ_in）を以下の式のように定義する。 In the case of a regression problem, for the output value y, the posterior probability Post _l (y | x _in ) calculated from the likelihood L _l (x _in | y) and the prior probability Prior _l (y) is defined as follows: do.

分類問題の場合、クラスｃに関して、尤度Ｌ_l（ｘ_in｜ｃ）と事前確率Ｐｒｉｏｒ_l（ｃ）から計算される事後確率Ｐｏｓｔ_l（ｃ｜ｘ_in）を以下の式のように定義する。 In the case of the classification problem, for class c, the posterior probability Post _l (c|x _in ) calculated from the likelihood L _l (x _in |c) and the prior probability Prior _l (c) is defined as follows: .

上記の数式６２、数式６３の分母は、出力値ｙ／クラスｃの間で一定の定数であることから、分母を計算せずに分子のみを計算すればよい。さらに数式６２、数式６３の事前確率Ｐｒｉｏｒ_lを計算しなくても、尤度Ｌ_lの分子だけを計算すれば、事後確率Ｐｏｓｔ_lに比例する値を計算することができる。これについて、以下に説明する。 Since the denominators of the above Expressions 62 and 63 are constants between the output value y/class c, it is sufficient to calculate only the numerators without calculating the denominators. Furthermore, even if the prior probability Prior _l of Equations 62 and 63 is not calculated, a value proportional to the posterior probability Post _l can be calculated by calculating only the numerator of the likelihood L _l . This will be explained below.

学習データＤａｔａ_trainには、回帰問題の場合は、複数の出力値ｙ₁，…，ｙ_maxのうちのどのｙの値になるのか、それぞれの出力値についての学習データが存在する。また、分類問題の場合は、複数のクラスｃ₁，…，ｃ_maxのうちのどのｃの値になるのか、それぞれのクラスについての学習データが存在する。 In the learning data Data _train , in the case of a regression problem, there is learning data for each of the output values y ₁ , . . . , y _max . Also, in the case of a classification problem, there is learning data for each class, which is the value of c among a plurality of classes c ₁ , . . . , c _max .

本発明は、事後確率Ｐｏｓｔ_lである数式６２や数式６３は、数式６０や数式６１の尤度Ｌ_lの分子にある補正定数を掛けた単純な形で近似的に表現できることを提案する。 The present invention proposes that Equations 62 and 63, which are posterior probabilities Post _l , can be approximated in a simple form by multiplying the numerator of the likelihood L _l of Equations 60 and 61 by a correction constant.

回帰問題では、事後確率Ｐｏｓｔ_lは、補正定数ａ_y、ｂ_yを用いて以下の式のように表される。 In the regression problem, the posterior probability _Postl is represented by the following _formula using correction constants _ay and by.

また、分類問題では、事後確率Ｐｏｓｔ_lは、補正定数ａ_c、ｂ_cを用いて以下の式のように表される。 Also, in the classification problem, the posterior probability _Postl is represented by the following equation using correction constants a _c and b _c .

補正定数ａはバイアス値、補正定数ｂはスケール値であり、補正定数ａ、ｂは、数式６０や数式６１の尤度Ｌ_lの分母、事前分布ｐｒｉｏｒ、そして計算誤差を表す。補正定数は、後述のデータの巨視的特徴を利用したキャリブレーションにより計算することができる。これについて、以下に説明する。 The correction constant a is a bias value, the correction constant b is a scale value, and the correction constants a and b represent the denominator of the likelihood L _l in Equations 60 and 61, the prior distribution prior, and the calculation error. The correction constant can be calculated by calibration using the macroscopic features of the data, which will be described later. This will be explained below.

出力値ｙ／クラスｃのそれぞれの事前確率に関して、テストフェーズで存在するとされる出力値ｙ／クラスｃの数をそれぞれＮ_y、Ｎ_cとすると、出力値ｙ／クラスｃの数は事前確率値と同じ比率を持つため、回帰問題の場合には下記の数式６６、分類問題の場合には下記の数式６７のように表すことができる。 For each prior probability of output value y/class c, let N _y and N _c be the number of output value y/class c assumed to exist in the test phase, then the number of output value y/class c is the prior probability value , the regression problem can be expressed as Equation 66 below, and the classification problem can be expressed as Equation 67 below.

そこで、テストフェーズの事前確率を反映する数のデータを用意しておく。ただし、データ数Ｎ_y、Ｎ_cを多めに用意する。例えば、最低でも１０００件以上のデータを用意しておくことが望ましい。そのデータを学習データＤａｔａ_trainとする。 Therefore, we prepare a number of data that reflects the prior probability of the test phase. However, a large number of data N _y and N _c are prepared. For example, it is desirable to prepare at least 1000 data items. Let the data be learning data Data _train .

学習データＤａｔａ_train全てに対し、それぞれの出力値ｙ／クラスｃに関して事後確率Ｐｏｓｔ_lの値を計算し、その計算結果の値の集合を集合ＴｏｔａｌＰ_l（ｘ_in）と表記する（ただし、ｘ_inはＤａｔａ_trainの要素）。この集合はある分布を持つが、それなりに大きい数であるため、白色効果によりガウス分布をなすとみなすことができる。 For all learning data Data _train , the value of the posterior probability Post _l is calculated for each output value y/class c, and the set of values of the calculation result is denoted as the set TotalP _l (x _in ) (where x _in is an element of Data _train ). Although this set has a distribution, it can be considered Gaussian due to the white effect, since it is a reasonably large number.

回帰問題に関しては、出力値がｙである事後確率値の集合は、以下の式のように表される。 For regression problems, the set of posterior probability values whose output value is y is represented by the following equation.

分類問題に関しては、クラスがｃである事後確率値の集合は以下の式のように表される。 As for the classification problem, the set of posterior probability values whose class is c is represented by the following equation.

数式６８及び数式６９の右辺は、それぞれの出力値ｙ／クラスｃの平均値Ｔｏｔａｌμと分散値ＴｏｔａｌΣをパラメータとするガウス分布を意味する。 The right sides of Equations 68 and 69 mean Gaussian distributions whose parameters are the mean value Total μ and variance value Total Σ of each output value y/class c.

平均値Ｔｏｔａｌμは、出力値ｙ／クラスｃに依存する。また、平均値Ｔｏｔａｌμは、学習データのデータ数Ｎ_yやＮ_cに比例した値となるはずである。なぜなら、事後確率値とは、どの出力値ｙ／クラスｃになるかという推定の確率値であり、学習データのその値の集合の平均は、学習データの既知の正解ラベルの数に比例しているはずだからである。 The average value Totalμ depends on the output value y/class c. Also, the average value _Totalμ should be a value proportional to the data numbers Ny and _Nc of the learning data. Because the posterior probability value is the estimated probability value of which output value y/class c will be, and the average of that set of values in the training data is proportional to the number of known correct labels in the training data. Because there should be.

分散値ＴｏｔａｌΣは、前述のデータ数が十分多いという仮定から、どの出力値／クラスに対しても同じ値になると考えられる。なぜなら、データのばらつきである分散は、データの数が十分多い場合にはある一定値に収束しているはずだからである。 Based on the assumption that the number of data is sufficiently large, the variance value TotalΣ is considered to be the same value for any output value/class. This is because the variance, which is the variation of data, should converge to a certain constant value when the number of data is sufficiently large.

以上の平均値Ｔｏｔａｌμと分散値ＴｏｔａｌΣに関する条件から、数式６４や数式６５の補正定数ａ、ｂを求めることができる。 The correction constants a and b in Equations 64 and 65 can be obtained from the above conditions relating to the average value Total μ and variance value Total Σ.

以上のように事後確率分布の近似式を定義し、その補正定数ａ、ｂを計算するまでの処理を図１２に示す。図１２は、本発明の第２の実施の形態において、事後確率分布の近似式を完成させるための処理の一例を示すフローチャートである。 FIG. 12 shows the processing from defining the approximation formula of the posterior probability distribution as described above to calculating the correction constants a and b. FIG. 12 is a flow chart showing an example of processing for completing an approximation of the posterior probability distribution in the second embodiment of the present invention.

この処理では、図１２に示すように、まず、上述した本発明の第１の実施の形態に係る処理によって、データの特徴量分布を計算する（ステップＳ２０１）。 In this process, as shown in FIG. 12, first, the feature amount distribution of data is calculated by the process according to the first embodiment of the present invention (step S201).

次に、推定時での各出力値ｙ／クラスｃでの事前確率値と同じ比率を持つ学習データＤａｔａ_trainを用意する（ステップＳ２０２）。また、ステップＳ２０１で計算された特徴量分布を、尤度を計算する分布の式とみなし、尤度の分子の式から補正定数ａを引いて、それを補正定数ｂで割った式を事後確率分布の近似式とする（ステップＳ２０３）。 Next, learning data Data _train having the same ratio as the prior probability value for each output value y/class c at the time of estimation is prepared (step S202). Further, the feature amount distribution calculated in step S201 is regarded as a distribution formula for calculating the likelihood, and the posterior probability is obtained by subtracting the correction constant a from the numerator formula of the likelihood and dividing it by the correction constant b. A distribution approximation formula is obtained (step S203).

次に、ステップＳ２０３において補正定数ａ、ｂで定義した事後確率分布の式を用いて、用意した学習データＤａｔａ_trainに関して、事後確率値を集合として各出力値ｙ／クラスｃで計算する（ステップＳ２０４）。さらに、各出力値ｙ／クラスｃでの事後確率値について、その集合の平均値及び分散値を計算する（ステップＳ２０５）。 Next, in step S203, the posterior probability distribution formula defined by the correction constants a and b is used to calculate a set of posterior probability values for each output value y/class c for the prepared learning data Data _train (step S204 ). Furthermore, for each output value y/class c posterior probability value, the average value and variance value of the set are calculated (step S205).

そして、各出力値ｙ／クラスｃでの事後確率値について、その集合の平均値が学習データの各出力値ｙ／クラスｃの数と比率が同じになるように、さらに、各出力値ｙ／クラスｃでの事後確率値について、その集合の分散値が同じ値となるように補正定数ａ、ｂを計算する（ステップＳ２０６）。以上の計算により補正定数ａ、ｂが得られ、この補正定数ａ、ｂを使って事後確率分布の近似式を完成させることができる。 Then, for the posterior probability value for each output value y/class c, each output value y/ For the posterior probability values in class c, correction constants a and b are calculated so that the set has the same variance (step S206). The correction constants a and b are obtained by the above calculation, and the approximation formula of the posterior probability distribution can be completed using the correction constants a and b.

次に、本発明の第２の実施の形態における情報推定装置の構成について説明する。本発明の第２の実施の形態では、図４に示す情報推定装置１に含まれる重み学習部１０、データ特徴量分布計算部２０、事後確率分布計算部３０、事後確率値推定部４０が使用される。重み学習部１０及びデータ特徴量分布計算部２０は、本発明の第１の実施の形態と同一であり、ニューラルネットワークの任意の層ｌにおける特徴量分布、すなわち、最適化された分布パラメータを出力する。 Next, the configuration of the information estimation device according to the second embodiment of the present invention will be described. In the second embodiment of the present invention, the weight learning unit 10, the data feature distribution calculation unit 20, the posterior probability distribution calculation unit 30, and the posterior probability value estimation unit 40 included in the information estimation device 1 shown in FIG. be done. The weight learning unit 10 and the data feature quantity distribution calculator 20 are the same as in the first embodiment of the present invention, and output the feature quantity distribution in an arbitrary layer l of the neural network, that is, the optimized distribution parameters. do.

事後確率分布計算部３０は、データ特徴量分布計算部２０から出力された最適化された分布パラメータに係る分布を尤度分布とし、尤度分布に事前確率を掛けたものを事後確率分布の式とし、分数で表された尤度分布の分子からバイアス値を引いてスケール値で割った式を事後確率分布の近似式として、事後確率分布の近似式で学習データの事後確率値を計算し、事後確率値の集合からなる分布の平均値及び分散値が、分類問題の場合には全てのクラスで、回帰問題の場合には複数の出力値同士で同じとなるように補正定数ａ、ｂ（バイアス値及びスケール値）を最適化する機能を有する。事後確率分布計算部３０により上述した図１２に示すフローチャートの処理が実行される。 The posterior probability distribution calculation unit 30 uses the distribution related to the optimized distribution parameters output from the data feature quantity distribution calculation unit 20 as a likelihood distribution, and multiplies the likelihood distribution by the prior probability as the posterior probability distribution formula and calculate the posterior probability value of the training data with the approximation formula of the posterior probability distribution, using the formula obtained by subtracting the bias value from the numerator of the likelihood distribution expressed as a fraction and dividing it by the scale value as the approximation formula of the posterior probability distribution, Correction constants a, b ( bias value and scale value). The posterior probability distribution calculation unit 30 executes the processing of the flowchart shown in FIG.

事後確率値推定部４０は、事後確率分布計算部３０によって最適化された補正定数ａ、ｂ（バイアス値及びスケール値）を含む事後確率分布の近似式を用いて、固定値のベクトルからなる入力データに対する事後確率値を計算する機能を有する。 The posterior probability value estimating unit 40 uses an approximate expression of the posterior probability distribution including the correction constants a and b (bias value and scale value) optimized by the posterior probability distribution calculating unit 30 to obtain an input consisting of a vector of fixed values. It has the ability to calculate posterior probability values for data.

以下、本発明の第２の実施の形態に関連する実験について説明する。この実験では、上述した本発明の第１の実施の形態からの継続として、手書き数字データＭＩＮＳＴが使用されている。 Experiments related to the second embodiment of the present invention will be described below. In this experiment, handwritten numeral data MINST is used as a continuation from the first embodiment of the present invention described above.

ＭＮＩＳＴの場合、０から９までの１０クラス（ｍａｘ＝１０）が、学習データ及びテストデータともに均一の数だけ用意されている。したがって、ＭＮＩＳＴを使用して推定する環境では、事前分布は同じとみなすことができる。 In the case of MNIST, 10 classes from 0 to 9 (max=10) are prepared in equal numbers for both learning data and test data. Therefore, in the estimating environment using MNIST, the prior distributions can be considered the same.

また、これに合った数の学習データを用意する。この場合、ＭＮＩＳＴの学習データが均等に存在するので、学習データをそのままキャリブレーション補正のデータＤａｔａ_trainとして使うことができる。 Also, a suitable number of learning data is prepared. In this case, since the MNIST learning data are evenly distributed, the learning data can be used as they are as data _train for calibration correction.

学習データは６万件と十分な数だけ存在する。したがって、事後確率の値の集合Ｐｏｓｔ_l（ｃ｜ｘ_in）（ただし、ｘ_inはＤａｔａ_trainの要素）の分布に関して、平均Ｔｏｔａｌμ_c及び分散値ＴｏｔａｌΣ_cは、各クラスにおいて単純に同一値となるはずである。 A sufficient number of 60,000 learning data exist. Therefore, regarding the distribution of the set of posterior probability values Post _l (c|x _in ) (where x _in is an element of the Data _train ), the average Total μ _c and variance value Total Σ _c are simply the same in each class. should be.

学習データＤａｔａ_trainの点データで、図２のニューラルネットワークの最初のＦＣ層における、数式６５で定義される事後確率Ｐｏｓｔ_lの値の集合（ｃ｜ｘ_in）（ただし、ｘ_inはＤａｔａ_trainの要素）をそれぞれの出力値ｙ／クラスｃで計算した。なお、補正定数は仮の値とし、ａ_c＝０、ｂ_c＝１とした。 In the point data of the learning data Data _train , the set (c|x _in ) of the posterior probability Post _l values defined by Equation 65 in the first FC layer of the neural network in FIG. 2 (where x _in is Data _train element) was calculated at each output value y/class c. It should be noted that the correction constants are assumed to be temporary values of a _c =0 and b _c =1.

図１３（ａ）に、その集合の事後確率値をヒストグラムとして示す。図１３（ａ）には、全１０クラスのうちの各々のクラスについてのヒストグラムが示されている。図１３（ａ）のヒストグラムから、上述したように補正定数は仮の値で設定されており正しい値ではないため、クラス間でばらばらな分布となっていることがわかる。 FIG. 13(a) shows the posterior probability values of the set as a histogram. FIG. 13(a) shows histograms for each class out of all 10 classes. From the histogram of FIG. 13(a), it can be seen that the correction constants are set as temporary values as described above and are not correct values, so that the distributions are scattered among the classes.

一方、これらの事後確率の値の集合の分布に係る平均値及び分散値がそれぞれ同じ値となるように補正定数ａ_c、ｂ_cを求め、数式６５で定義される事後確率Ｐｏｓｔ_l（ｃ｜ｘ_in）の値をヒストグラムとして、図１３（ｂ）に示す。この場合には、事後確率分布、厳密には事後確率に比例する確率の分布は、全てのクラスで類似した形状を有しており、きれいにそろっていることがわかる。 On the other hand, the correction constants a _c and b _c are obtained so that the mean value and the variance value related to the distribution of the set of values of these posterior probabilities are the same, respectively, and the posterior probability Post _l (c| x _in ) values are shown in FIG. 13(b) as a histogram. In this case, it can be seen that the posterior probability distributions, strictly speaking, the distributions of probabilities proportional to the posterior probabilities, have similar shapes in all classes and are well aligned.

以上のように、学習データを用いて、事後確率分布を構成する補正定数ａ_c、ｂ_cを計算しておけば、推定時に未知のテストデータが来た場合であっても、事後確率分布から事後確率値（すなわち、推定の確からしさ）を各クラスで計算し、推定の確からしさの値が最も大きいクラスを、推定されるクラスと判断することができる。 As described above, if the correction constants a _c and b _c that make up the posterior probability distribution are calculated using the learning data, even if unknown test data comes at the time of estimation, the posterior probability distribution A posterior probability value (ie, the probability of estimation) is calculated for each class, and the class with the highest value of probability of estimation can be determined to be the estimated class.

実験では、ＭＮＩＳＴの１万件のテストデータの画像に対して、事後確率値がどのクラスで一番高いかを計算した。そして、この計算結果が正解ラベルと合っているかどうかを判定した。 In the experiment, for 10,000 MNIST test data images, which class has the highest posterior probability value was calculated. Then, it was determined whether or not this calculation result matched the correct label.

この実験の一例として、図１４（ａ）にテストデータとして使用されたＭＮＩＳＴの数字７の画像を示す。また、図１４（ｂ）に、図１４（ａ）の画像を入力データとした場合の各クラスごとの事後確率の対数値（Ｌｏｇの値）を示す。図１４（ｂ）に示すように、数字７のクラスでの事後確率の対数値が一番大きくなっており、最も確率が高く、推定が正解であることがわかる。 As an example of this experiment, FIG. 14(a) shows an image of MNIST digit 7 used as test data. FIG. 14(b) shows the logarithmic value (Log value) of the posterior probability for each class when the image of FIG. 14(a) is used as input data. As shown in FIG. 14(b), the logarithm value of the posterior probability in the class of number 7 is the largest, indicating that the probability is the highest and the estimation is correct.

同様にこの判定を１万件のテストデータの画像に対して行って得られた各クラスでの正解率を図１５に示す。図１５の左側（補正前）は、学習データで補正定数を求めなかった場合（仮の値ａ_c＝０、ｂ_c＝１が設定された場合）の正解率、図１５の右側（補正後）は、前述の方法で補正定数を求めてから事後確率値を計算した場合の正解率である。仮の補正定数の場合には正解率の平均が０．７５９であったのに対し、補正定数を求めた場合の正解率の平均は０．９０７に上がった。なお、学習データから事後確率分布を計算せず、従来のテストデータをそのまま点データとしてフォワード計算してソフトマックス層で推定した場合の正解率は０．９２である。したがって、本発明の第２の実施の形態における事後確率値から推定を行う推定結果の正解率は、最終層であるソフトマックス層から出力される推定結果の正解率とほぼ同水準であり、高い正解率で分布から推定できていることがわかる。 Similarly, FIG. 15 shows the accuracy rate in each class obtained by performing this judgment on 10,000 test data images. The left side of FIG. 15 (before correction) shows the accuracy rate when correction constants are not obtained from the learning data (when temporary values a _c =0 and b _c =1 are set), and the right side of FIG. 15 (after correction ) is the accuracy rate when the posterior probability value is calculated after obtaining the correction constant by the method described above. In the case of the provisional correction constant, the average accuracy rate was 0.759, whereas in the case of obtaining the correction constant, the average accuracy rate increased to 0.907. Note that the correct answer rate is 0.92 when forward calculation is performed using the conventional test data as point data without calculating the posterior probability distribution from the learning data, and estimation is performed using the softmax layer. Therefore, the accuracy rate of the estimation result estimated from the posterior probability value in the second embodiment of the present invention is approximately the same level as the accuracy rate of the estimation result output from the softmax layer, which is the final layer, and is high. It can be seen that the accuracy rate can be estimated from the distribution.

さらに、事後確率値から、クラスの推定だけではなく、推定の確からしさや確率値を計算することができるため、本発明の第２の実施の形態における手法は、確率値を出力できるニューラルネットワークの推定器に適用することも可能である。 Furthermore, from the posterior probability value, it is possible to calculate not only the class estimation but also the probability of estimation and the probability value. It can also be applied to estimators.

（第３の実施の形態）
次に、本発明の第３の実施の形態について説明する。 (Third Embodiment)
Next, a third embodiment of the invention will be described.

上述した本発明の第２の実施の形態では、事後確率分布を計算する手法について説明した。本発明の第３の実施の形態では、さらに、入力データが点データではなく、ある分布を持つ確率変数からなる分布データである場合においても計算できることを示す。具体的には、本発明の第３の実施の形態では、分布同士の重なり具合を計算することで、事後確率値を解析的に計算する手法を提案する。入力データが分布である場合はデータの値に不確かさや誤差があることを意味するが、このような場合であっても、ニューラルネットワークで扱うことができるようになる。すなわち、本発明の第３の実施の形態によれば、値が分布として表現される不確かさを持つ入力データに対しても、ニューラルネットワークによる推定器を適用することができる。 In the above-described second embodiment of the present invention, a method for calculating the posterior probability distribution has been described. In the third embodiment of the present invention, it will be further shown that calculation can be performed even when the input data is not point data but distribution data consisting of random variables having a certain distribution. Specifically, the third embodiment of the present invention proposes a method of analytically calculating the posterior probability value by calculating the degree of overlap between distributions. If the input data is a distribution, it means that there are uncertainties and errors in the data values. Even in such cases, the neural network can handle them. In other words, according to the third embodiment of the present invention, an estimator based on a neural network can be applied even to input data with uncertainty expressed as a distribution of values.

ここで、入力データが、以下の式に表されるように平均ベクトルμ_in、分散共分散行列Σ_inで表されるガウス分布の確率変数からなるベクトルであるとする。 Here, it is assumed that input data is a vector consisting of Gaussian-distributed random variables represented by an average vector μ _in and a variance-covariance matrix Σ _in as represented by the following equations.

数式６４や数式６５の事後確率Ｐｏｓｔ_l（ｘ_in）の式において、入力データｘ_inが分布である（すなわち、入力分布データである）ことは、図１１（ｂ）のグラフ（横軸は入力値、縦軸は事後確率値）に示すように、ガウス分布からのサンプリングによる出力の期待値、すなわち事後確率値の期待値を計算することを意味する。これを以下の式のように表す。 In the expression of the posterior probability Post _l (x _in ) of Expressions 64 and 65, the fact that the input data x _in is a distribution (that is, the input distribution data) is shown in the graph of FIG. value, the vertical axis is the posterior probability value), it means calculating the expected value of the output by sampling from the Gaussian distribution, that is, the expected value of the posterior probability value. This is represented by the following formula.

数式７３の期待値の計算方法を以下に示す。 A method of calculating the expected value of Equation 73 is shown below.

これにより、入力データが分布データ、すなわち、ある分布に基づく確率変数であってもニューラルネットワークでの推定が可能となり、推定の確からしさが計算可能となる。応用として、入力データが、測定誤差分布を持つ信号やカルマンフィルタなどの従来の確率手法によって計算される分散値を持つデータの場合であってもニューラルネットワークで扱うことが可能となる。さらに、値が不明の欠損したデータなどについても、適当な仮の固定値で代用して計算を行うのではなく、適度に幅の広い分散値を持つ確率変数として設定して推定計算を行うことが可能となる。 As a result, even if the input data is distribution data, that is, a random variable based on a certain distribution, it is possible to perform estimation using a neural network, and it is possible to calculate the likelihood of estimation. As an application, even if the input data is a signal with a measurement error distribution or data with a variance calculated by a conventional stochastic method such as the Kalman filter, it can be handled by the neural network. In addition, even for missing data with unknown values, instead of performing calculations by substituting appropriate temporary fixed values, set them as random variables with a moderately wide range of variance values and perform estimation calculations. becomes possible.

また、従来の点データも、デルタ関数のような分散値が小さな幅の分布データであると考えれば、全てのデータは分布データであるととらえることができる。これにより、点データと分布データとが混在するような欠損したデータを処理することも可能である。このようなデータの処理に係る実験として、ＭＮＩＳＴ画像を用いた計算結果を図１６（ａ）～（ｆ）に示す。 Also, if conventional point data is considered to be distribution data with a small variance value such as a delta function, then all data can be regarded as distribution data. This makes it possible to process missing data in which point data and distribution data coexist. Calculation results using MNIST images are shown in FIGS.

図１６（ａ）に通常のＭＮＩＳＴの数字４の画像、図１６（ｂ）に当該数字４の画像の一部が灰色で塗られた欠損状態の画像を示す。図１６（ｂ）に示す画像では、灰色で塗られた部分が画素のグレースケール値が不明の欠損部分となる。 FIG. 16(a) shows an image of a normal MNIST number 4, and FIG. 16(b) shows a defective image in which part of the image of the number 4 is painted in gray. In the image shown in FIG. 16(b), the portion painted in gray is the missing portion where the grayscale value of the pixel is unknown.

図１６（ａ）、（ｂ）に示す画像を入力画像とし、それぞれの入力画像に対して従来の手法を用いて推定したソフトマックスの値を図１６（ｃ）、（ｄ）に示す。これに対し、図１６（ａ）、（ｂ）に示す画像を入力画像とし、それぞれの入力画像に対して本発明に係る計算手法を用いて計算された事後確率値を図１６（ｅ）、（ｆ）に示す。図１６（ｃ）～（ｆ）は、各クラスに対する推定結果が棒グラフで表されている。 The images shown in FIGS. 16(a) and 16(b) are used as input images, and softmax values estimated using the conventional method for each input image are shown in FIGS. 16(c) and 16(d). On the other hand, the images shown in FIGS. 16A and 16B are used as input images, and the posterior probability values calculated using the calculation method according to the present invention for each input image are shown in FIGS. (f). 16(c) to (f) show the estimation results for each class in bar graphs.

入力画像は、グレースケール値が０から１の範囲にあるグレースケール画像である。この実験では、入力画像を点データとして扱う場合には、図１６（ｂ）の灰色の欠損部分の画素については、中間値である０．５の値を採用した。一方、入力画像を分布データとして扱う場合には、図１６（ｂ）の灰色の欠損部分の画素については、平均値０．５、分散値１０のガウス分布であるとみなすとともに、欠損していない部分についても前述のように分布データとしてとらえ、平均値を画素のグレースケール値、分散値を０．００１（仮の小さな分散値）とすることで、点データをデルタ関数的な分布データで表現して計算処理を行った。 The input image is a grayscale image with grayscale values ranging from 0 to 1. In this experiment, when the input image was treated as point data, the intermediate value of 0.5 was adopted for pixels in the gray missing portion in FIG. 16(b). On the other hand, when the input image is treated as distribution data, the gray missing pixels in FIG. The part is also treated as distribution data as described above, and by setting the average value to the grayscale value of the pixel and the variance value to 0.001 (temporary small variance value), the point data is expressed as delta function distribution data. and performed the calculations.

図１６（ｃ）に示すように入力画像を点データとしてとらえた場合には、欠損のない数字４の画像（図１６（ａ）の画像）のソフトマックスの値は数字４にスコアが大きく出ている。しかしながら、図１６（ｄ）に示すように、欠損のある画像（図１６（ｂ）の画像）のソフトマックスの値は、数字８や数字９のスコアが上がり、数字４に関してはスコアが出ていない。すなわち、従来の手法では、灰色の部分の画素の値を０．５という固定値にして計算しているため、この入力画像が数字４の画像である可能性が認識されていない。 When the input image is treated as point data as shown in FIG. 16(c), the softmax value of the image of number 4 (the image of FIG. 16(a)) with no defects shows a large score for number 4. ing. However, as shown in FIG. 16(d), the softmax value of the missing image (the image in FIG. 16(b)) shows that the scores for numbers 8 and 9 are higher, and the score for number 4 is lower. do not have. That is, in the conventional method, the value of the pixel in the gray portion is fixed at 0.5 for calculation, so the possibility that this input image is the number 4 image is not recognized.

一方、図１６（ｅ）は入力画像を分布データとしてとらえた場合を示している。欠損のない数字４の画像（図１６（ａ）の画像）の事後確率値は、数字４であると認識される。さらに、図１６（ｆ）に示すように、欠損のある画像（図１６（ｂ）の画像）についても、灰色の部分は０～１の範囲内の様々な値をとる可能性があることが考慮され、数字０、数字４、数字８、数字９の画像である可能性が認識され、これらの可能性がスコアとして出力されている。 On the other hand, FIG. 16E shows a case where the input image is treated as distribution data. The posterior probability value of the image of number 4 (the image in FIG. 16( a )) without defects is recognized as number 4 . Furthermore, as shown in FIG. 16(f), it is possible that the gray portion of the defective image (the image of FIG. 16(b)) may take various values within the range of 0 to 1. It is considered and the possibilities of images of number 0, number 4, number 8, number 9 are recognized and these possibilities are output as a score.

一般的に、センサーネットワークにおいて、ニューラルネットワークが様々な複数のセンサーから得られたデータを総合して判断する場合に、全てのセンサーから完全なデータが得られた状態で判断を行えることは決して多くはない。むしろ、一部のセンサーからのデータが欠損した状態やデータが不明な状態で判断を行わなければならない場合が多く起こり得る。従来のニューラルネットワークでは適当な値を設定するか、あるいは、それぞれのデータが欠損した場合を想定してニューラルネットワークを学習させておく必要がある。一方、本発明の手法によれば、確率変数からなる分布データに関しても、上述した本発明の第２の実施の形態において得られる事後確率分布を学習データで計算しておくだけで、データが欠損した状態やデータが不明な状態に対処でき、有効な推定結果を得ることが可能となる。 In general, in sensor networks, when a neural network makes judgments based on the data obtained from various sensors, it is not often possible to make judgments with complete data obtained from all sensors. no. Rather, there are many cases where judgments must be made with data missing or unknown from some sensors. In a conventional neural network, it is necessary to set appropriate values, or to train the neural network assuming that each data is missing. On the other hand, according to the technique of the present invention, even with regard to distribution data consisting of random variables, data is lost simply by calculating the posterior probability distribution obtained in the above-described second embodiment of the present invention using learning data. It is possible to deal with the state where the data is unknown and the state where the data is unknown, and obtain an effective estimation result.

本発明は、学習済みのニューラルネットワークがある推定結果を出力するための、ニューラルネットワークの各層におけるデータの特徴量分布を計算することが可能であり、ニューラルネットワークにかかる技術全般に適用可能である。 INDUSTRIAL APPLICABILITY The present invention is capable of calculating the feature quantity distribution of data in each layer of a neural network so that a trained neural network can output certain estimation results, and is applicable to all neural network techniques.

１情報推定装置
１０重み学習部
２０データ特徴量分布計算部
２１伝搬計算部
２２コスト関数計算部
２３分布パラメータ更新部
２４最適化パラメータ出力部
３０事後確率分布計算部
４０事後確率値推定部 1 information estimation device 10 weight learning unit 20 data feature distribution calculation unit 21 propagation calculation unit 22 cost function calculation unit 23 distribution parameter update unit 24 optimization parameter output unit 30 posterior probability distribution calculation unit 40 posterior probability value estimation unit

Claims

An information estimation device that performs estimation processing using a neural network,
a weight learning unit that learns weights in the neural network, which includes a dropout layer that performs dropout to drop a part of data using learning data consisting of a vector of fixed values;
Data propagated in each layer of the neural network is a random variable having a multivariate Gaussian distribution defined by distribution parameters including a mean value, a variance value, and a covariance value, and the dropout is performed in the neural network for input data. a propagation calculation unit that performs propagation calculation by giving
a cost function calculation unit that calculates a cost function representing the difference between output data from the neural network and correct data;
a distribution parameter updating unit that performs back propagation calculation by giving the dropout to the minute error obtained by differentiating the cost function in the neural network and updating the distribution parameter in each layer of the neural network;
an optimization parameter output unit that outputs the distribution parameter as an optimized distribution parameter when the cost function is equal to or less than a predetermined threshold;
information estimation device.

2. The information estimation device according to claim 1, wherein the variance value included in the optimized distribution parameter output from the optimization parameter output unit represents the importance of each neuron in each layer of the neural network.

3. The information estimating apparatus according to claim 2, which simplifies calculation in neurons determined to have low importance based on the importance of each neuron in each layer of the neural network.

The distribution related to the optimized distribution parameter output from the optimization parameter output unit is assumed to be a likelihood distribution, and the likelihood distribution multiplied by the prior probability is assumed to be a posterior probability distribution formula, expressed as a fraction The formula obtained by subtracting the bias value from the numerator of the likelihood distribution and dividing it by the scale value is used as the approximation formula of the posterior probability distribution, and the posterior probability value of the learning data is calculated by the approximation formula of the posterior probability distribution. The bias value and the scale value so that the mean value and the variance value of a distribution consisting of a set of probability values are the same for all classes in the case of a classification problem, and between a plurality of output values in the case of a regression problem. 4. The information estimation device according to any one of claims 1 to 3, further comprising a posterior probability distribution calculator for optimizing .

5. A posterior probability value estimating unit that calculates a posterior probability value for input data consisting of a vector of fixed values using an approximate expression of the posterior probability distribution including the optimized bias value and the scale value. The information estimation device according to .

The input data for which the posterior probability values are to be calculated is a random variable having a predetermined distribution, and a plurality of posterior probability values are calculated using a plurality of values sampled from the distribution of the input data. 6. The information estimation device according to claim 5, wherein the posterior probability value for the input data is calculated by calculating the expected value.

An information estimation method executed by an information estimation device that performs estimation processing using a neural network,
A weight learning step of learning weights in the neural network provided with a dropout layer that performs dropout to lose a part of data using learning data consisting of a vector of fixed values;
Data propagated in each layer of the neural network is a random variable having a multivariate Gaussian distribution defined by distribution parameters including a mean value, a variance value, and a covariance value, and the dropout is performed in the neural network for input data. a propagation calculation step of providing and performing propagation calculation;
a cost function calculation step of calculating a cost function representing the difference between output data from the neural network and correct data;
A distribution parameter updating step of performing back propagation calculation by giving the dropout in the neural network to the minute error obtained by differentiating the cost function, and updating the distribution parameter in each layer of the neural network;
an optimization parameter output step of outputting the distribution parameter as an optimized distribution parameter when the cost function is equal to or less than a predetermined threshold;
information estimation method.

8. The information estimation method according to claim 7, wherein the variance value included in the optimized distribution parameter output in the optimization parameter output step represents the importance of each neuron in each layer of the neural network.

9. The method of estimating information according to claim 8 , wherein calculations in neurons determined to be less important based on the importance of each neuron in each layer of the neural network are simplified.

A distribution related to the optimized distribution parameter is defined as a likelihood distribution, a formula obtained by multiplying the likelihood distribution by a prior probability is defined as a posterior probability distribution formula, and a bias value is obtained from the numerator of the likelihood distribution represented by a fraction. The formula obtained by subtracting and dividing by the scale value is used as the approximation formula of the posterior probability distribution, and the posterior probability value of the learning data is calculated by the approximation formula of the posterior probability distribution, and the average value of the distribution consisting of the set of posterior probability values and a posterior probability distribution calculation step of optimizing the bias value and the scale value such that the variance value is the same for all classes in the case of a classification problem and among a plurality of output values in the case of a regression problem. The information estimation method according to any one of claims 7 to 9, comprising:

Information estimation according to claim 10, wherein the approximation of the posterior probability distribution including the optimized bias value and the scale value is used to calculate the posterior probability value for input data consisting of a vector of fixed values. Method.

The input data for which the posterior probability values are to be calculated is a random variable having a predetermined distribution, and a plurality of posterior probability values are calculated using a plurality of values sampled from the distribution of the input data. 12. The information estimation method according to claim 11, wherein the posterior probability value for the input data is calculated by calculating its expected value.