JP2016218513A

JP2016218513A - Neural network and computer program therefor

Info

Publication number: JP2016218513A
Application number: JP2015099137A
Authority: JP
Inventors: 駿平窪澤; Shunpei Kubosawa
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2015-05-14
Filing date: 2015-05-14
Publication date: 2016-12-22
Anticipated expiration: 2035-05-14
Also published as: JP6521440B2

Abstract

PROBLEM TO BE SOLVED: To provide a neural network with high generalization performance, by which prediction is possible in short time.SOLUTION: A neural network 80 comprises an input layer 90 and a hidden layer 92 for receiving output from the input layer 90. The hidden layer 92 includes a plurality of hidden layer units 100 and 102 each of which is connected with a plurality of input nodes of the input layer 90. Each of the plurality of hidden layer units 100 and 102 includes: a plurality of COS elements 120-128 each of which has input connected with the plurality of input nodes with weighting, and uses a cosine function as an activation function; and a Σ element 140 which is connected so as to receive the output of the plurality of COS elements 120-128 in the same hidden layer with weighting, and uses a subdifferentiable function as the activation function. The neural network 80 may further include an output layer 94 connected so as to receive the output of the hidden layer units 100 and 102.SELECTED DRAWING: Figure 3

Description

この発明は機械学習に関し、特に、計算資源の小さな計算機でも精度よく推論を行うことができるニューラルネットワークに関する。 The present invention relates to machine learning, and more particularly, to a neural network that can perform inference with high accuracy even on a computer with small calculation resources.

機械学習とは、限られた量の事例から、一般的な法則あるいは傾向を獲得することを目的とした、帰納的な推論に基づく学習方法である。数理的には、事例（訓練）データが真のデータ分布から標本として抽出されたことを仮定し、抽出元のデータ分布を大域的に推定することとして定式化される。データ分布は、データ空間を定義域とし、確率等のデータらしさを表現する空間を値域とする連続関数として捉えることができる。この連続関数を訓練データから近似する方法の一つとしてニューラルネットワークがある。ニューラルネットワークは、多層パーセプトロン(multi-layered perceptron：ＭＬＰ)すなわちフィードフォワードニューラルネットワーク（feed forward neural network：ＦＦＮＮ）として、言語モデル、音声認識、及び画像認識等様々な識別タスクにおいて広く利用されている。 Machine learning is a learning method based on inductive reasoning that aims to acquire general rules or trends from a limited amount of cases. Mathematically, it is formulated as a global estimation of the source data distribution, assuming that the case (training) data is sampled from the true data distribution. The data distribution can be understood as a continuous function having a data space as a domain and a space expressing a data quality such as probability as a range. One method for approximating this continuous function from training data is a neural network. Neural networks are widely used as multi-layered perceptrons (MLPs) or feed forward neural networks (FFNNs) in various identification tasks such as language models, speech recognition, and image recognition.

任意の連続関数すなわちデータ分布は、ロジスティックシグモイド関数を活性化関数として使用する２層ニューラルネットワークにおいて表現可能である。しかし、２層程度の少ない層数のネットワークでは、個々の訓練データ点等の局所的な分布のみを学習し、大域的な分布を獲得できない過適合が発生する傾向がある。この様子を図１に示す。図１は２次元のデータ空間において、２種類のデータが存在し、それらの訓練データ（×及び○で示す）が得られた場合を想定したものである。ニューラルネットワークのタスクとして、これら２種類のデータを識別する場合、元のデータ空間においてこれら２種類のデータを識別するための識別曲線を求めることが必要である。しかし、想定するパラメータの数が多すぎたりすると、識別曲線が訓練データに対して過適合することが多い。過適合により得られた識別曲線を例えば図１の識別曲線とすると、この識別曲線３０は、訓練データに対する識別性能は極めて高く、ときには完全な識別結果を与えるが、訓練データ以外のデータに対する識別性能（汎化性能）は逆に悪化する。それに対して適正な訓練により得られた識別曲線３２は、全てのテストデータに対して完全な識別結果を与えるとは限らないが、訓練データ以外のデータが与えられたときも比較的精度よくデータの識別を行うことができる。 Any continuous function or data distribution can be represented in a two-layer neural network using a logistic sigmoid function as an activation function. However, in a network with a small number of layers such as about two layers, there is a tendency that only a local distribution such as individual training data points is learned, and overfitting that cannot acquire a global distribution occurs. This is shown in FIG. FIG. 1 assumes a case where two types of data exist in a two-dimensional data space and their training data (indicated by x and ◯) are obtained. When these two types of data are identified as a neural network task, it is necessary to obtain an identification curve for identifying these two types of data in the original data space. However, if too many parameters are assumed, the identification curve often overfits the training data. If the discrimination curve obtained by overfitting is, for example, the discrimination curve of FIG. 1, this discrimination curve 30 has very high discrimination performance for training data and sometimes gives a complete discrimination result, but discrimination performance for data other than training data. Conversely, the (generalization performance) deteriorates. On the other hand, the identification curve 32 obtained by proper training does not always give a complete identification result for all test data, but data with relatively high accuracy is provided even when data other than training data is given. Can be identified.

過適合を抑制するため、機械学習全般において正則化（非特許文献１、非特許文献２）と呼ばれる手法が広く用いられている。正則化とは、識別モデルが訓練データを近似する様に学習する際に、近似に影響するパラメータ数を削減したり、パラメータの値を小さくしたりする手法である。この手法は、訓練データが抽出される元の分布は、滑らかな形状であるという仮定に基づいている。すなわち、正則化は、元の分布が少数かつ小さな値のパラメータによって構成される関数であるという仮定に基づいている。一般には、最小化する目的関数にパラメータのＬ２ノルム又はＬ１ノルムを加えることで、多くのパラメータの値を０または０近傍とする目的で用いられる。すなわち、正則化により推論に使用されるパラメータの数を実質的に制限するという制約を課すことで、汎化性能が向上する。 In order to suppress overfitting, a technique called regularization (Non-Patent Document 1, Non-Patent Document 2) is widely used in general machine learning. Regularization is a method of reducing the number of parameters that affect approximation or reducing the value of parameters when the identification model learns to approximate training data. This approach is based on the assumption that the original distribution from which training data is extracted has a smooth shape. That is, regularization is based on the assumption that the original distribution is a function composed of a small number of small value parameters. In general, by adding the L2 norm or L1 norm of the parameter to the objective function to be minimized, it is used for the purpose of setting the value of many parameters to 0 or near 0. That is, the generalization performance is improved by imposing a restriction that substantially restricts the number of parameters used for inference by regularization.

正則化と共に、モデルの計算構造（関数定義）という観点から汎化性能を向上するという考え方がある。その手法の１つとして、深層学習（deep learning）と呼ばれる、多層化したニューラルネットワークがある。多数の層からなるニューラルネットワークでは、入力に近い層においては局所的なデータ分布を捉え、出力層へと進むにつれて、局所的なデータ分布の組み合わせにより大域的な分布を捉える性質があると言われている（非特許文献３）。画像認識等一部の識別タスクにおいては、深層学習により従来手法を大幅に凌駕する汎化性能が達成されている（非特許文献４）。 Along with regularization, there is an idea of improving generalization performance from the viewpoint of the calculation structure (function definition) of the model. As one of the methods, there is a multilayered neural network called deep learning. In a neural network consisting of many layers, it is said that local data distribution is captured in layers close to the input, and global distribution is captured by a combination of local data distributions as it proceeds to the output layer. (Non-patent Document 3). In some identification tasks such as image recognition, generalization performance that far surpasses conventional techniques is achieved by deep learning (Non-Patent Document 4).

図２に、多層化したニューラルネットワーク（ディープニューラルネットワーク：ＤＮＮ）５０の構成例を示す。図２を参照して、ＤＮＮ５０は、入力層５２と、複数個の隠れ層からなる中間層５４と、出力層５６とを含む。この例では、中間層５４は５個の隠れ層を含んでいる。ＤＮＮ５０の学習には誤差逆伝搬法が用いられる。すなわち、訓練データが入力層５２に与えられ、中間層５４を経て出力層５６から訓練データに対する予測出力が得られる。その出力と正解データとの誤差が逆に出力側から与えられ、各ノード間の結合の重みとバイアスとが、所定の誤差関数を最小にするように更新される。 FIG. 2 shows a configuration example of a multilayered neural network (deep neural network: DNN) 50. Referring to FIG. 2, DNN 50 includes an input layer 52, an intermediate layer 54 including a plurality of hidden layers, and an output layer 56. In this example, the intermediate layer 54 includes five hidden layers. The back propagation method is used for learning the DNN 50. That is, training data is given to the input layer 52, and a predicted output for the training data is obtained from the output layer 56 via the intermediate layer 54. The error between the output and the correct data is given from the output side, and the weight and bias of the connection between the nodes are updated so as to minimize a predetermined error function.

従来は、各層の活性化関数としては、ロジスティックシグモイド関数及びハイパーボリックタンジェント等が広く用いられてきた。しかし最近では、多層化を前提として、各層の活性化関数についても、様々な関数（非特許文献５、非特許文献６）が提案され、これらにより汎化性能が向上することが示されている。 Conventionally, logistic sigmoid functions, hyperbolic tangents, and the like have been widely used as activation functions for each layer. However, recently, on the premise of multilayering, various functions (Non-Patent Document 5 and Non-Patent Document 6) have been proposed for the activation function of each layer, and these have been shown to improve generalization performance. .

Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in computational mathematics, 13(1), 1-50.Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines.Advances in computational mathematics, 13 (1), 1-50. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural computation, 7(2), 219-269.Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural computation, 7 (2), 219-269. Montufar, G., Pascanu, R., Cho, K. and Bengio, Y. (2014). On the Number of Linear Regions of Deep Neural Networks. http://arxiv.org/abs/1402.1869Montufar, G., Pascanu, R., Cho, K. and Bengio, Y. (2014) .On the Number of Linear Regions of Deep Neural Networks.http: //arxiv.org/abs/1402.1869 Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks, Vol 61, pp 85-117. (http://arxiv.org/abs/1404.7828)Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks, Vol 61, pp 85-117. (Http://arxiv.org/abs/1404.7828) Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR W&CP Volume. Vol. 15.Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier networks.Proceedings of the 14th International Conference on Artificial Intelligence and Statistics.JMLR W & CP Volume.Vol. 15. Goodfellow, I., Warde-farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout Networks. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1319-1327.Goodfellow, I., Warde-farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013) .Maxout Networks.In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1319-1327.

以上のように、従来技術により、多層化による汎化性能の向上が示されている。しかし、同時に計算量が増大したという問題がある。したがって、性能向上のためには、依然として大規模な計算装置が必要である。また、汎用ＧＰＵ計算機の普及等、計算機ハードウェアの大規模並列化の進展により各層の計算効率は向上しているが、多層モデルでは入力側の層における計算を終えるまで次の層の計算が不可能であるという制約がある。したがって、多層化による層数の増加に比例して必然的に待ち時間が増大し、訓練時のみならず予測時の計算効率も低下しているという問題がある。 As described above, improvement in generalization performance by multilayering has been shown by the prior art. However, there is a problem that the calculation amount has increased at the same time. Therefore, a large-scale computing device is still necessary for improving the performance. In addition, the computation efficiency of each layer has improved due to the progress of massive parallelization of computer hardware, such as the spread of general-purpose GPU computers. However, in the multilayer model, the calculation of the next layer is not possible until the computation on the input side layer is completed. There is a restriction that it is possible. Therefore, there is a problem that the waiting time is inevitably increased in proportion to the increase in the number of layers due to the multi-layering, and the calculation efficiency is reduced not only during training but also during prediction.

したがって本発明の目的は、短時間で予測可能な、汎化性能の高いニューラルネットワークを提供することである。 Accordingly, an object of the present invention is to provide a neural network with high generalization performance that can be predicted in a short time.

本発明の第１の局面に係るニューラルネットワークは、複数個の入力ノードを持つ入力層と、入力層からの出力を受けるように接続された入力を持つ非線形関数層とを含む。非線形関数層は、各々が入力層の複数個の入力ノードからの出力を受けるように接続された、各々が任意の独立成分を近似可能な複数個の隠れ層を含む。複数個の隠れ層の各々は、各々が複数個の入力ノードに重み付きで接続された入力を持ち、活性化関数として、劣微分可能な周期関数を用いる複数個の神経細胞素子と、同一の隠れ層内の複数個の神経細胞素子の出力を重み付きで受けるように接続され、活性化関数として、劣微分可能な関数を用いる出力集約素子とを含む。 The neural network according to the first aspect of the present invention includes an input layer having a plurality of input nodes and a nonlinear function layer having an input connected to receive an output from the input layer. The nonlinear function layer includes a plurality of hidden layers, each connected to receive outputs from a plurality of input nodes of the input layer, each capable of approximating an arbitrary independent component. Each of the plurality of hidden layers has an input connected to each of a plurality of input nodes with weights, and is the same as a plurality of nerve cell elements that use a sub-differentiable periodic function as an activation function. And an output aggregating element that is connected so as to receive weighted outputs of a plurality of nerve cell elements in the hidden layer and uses a subdifferentiable function as an activation function.

好ましくは、周期関数は、コサイン関数若しくはサイン関数又はこれらの組み合わせ、又は区分線形近似したコサイン関数若しくはサイン関数である。 Preferably, the periodic function is a cosine function or sine function or a combination thereof, or a cosine function or sine function approximated by piecewise linear approximation.

より好ましくは、ニューラルネットワークは、複数個の隠れ層の各々の入力側の重みパラメータにＬ２正則化を用いて訓練されたものである。 More preferably, the neural network is trained using L2 regularization for the weight parameter on the input side of each of the plurality of hidden layers.

ニューラルネットワークは、準ニュートン法を用いて訓練されたものでもよい。 The neural network may be trained using a quasi-Newton method.

より好ましくは、出力集約素子の活性化関数は恒等関数、すなわち与えられた入力と同じ値を出力する関数でもよい。 More preferably, the activation function of the output aggregation element may be an identity function, that is, a function that outputs the same value as a given input.

ニューラルネットワークはさらに、複数個の隠れ層の出力集約素子の出力を受けるように接続された出力層を含んでもよい。出力層は、各々が複数個の隠れ層の出力集約素子の出力を重み付きで受けるように接続された複数個の出力神経細胞素子を含む。 The neural network may further include an output layer connected to receive the output of the plurality of hidden layer output aggregating elements. The output layer includes a plurality of output nerve cell elements each connected to receive the output of a plurality of hidden layer output aggregation elements with weights.

出力集約素子の活性化関数はsoftmax関数であってもよい。 The activation function of the output aggregation element may be a softmax function.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータを、上記したいずれかのニューラルネットワークとして機能させる。 A computer program according to the second aspect of the present invention causes a computer to function as any one of the neural networks described above.

機械学習における過適合の問題を説明するための、２次元空間におけるデータ分布と識別関数との関係を示す図である。It is a figure which shows the relationship between the data distribution in a two-dimensional space, and an identification function for demonstrating the problem of overfitting in machine learning. 従来のディープニューラルネットワークの問題点を説明するためのニューラルネットワークの模式図である。It is a schematic diagram of the neural network for demonstrating the problem of the conventional deep neural network. 本発明の１実施の形態に係るニューラルネットワークの概略構成を模式的に示す図である。It is a figure which shows typically schematic structure of the neural network which concerns on one embodiment of this invention. 誤差に正則項を加えないときの目的関数の一例を示すグラフである。It is a graph which shows an example of an objective function when a regular term is not added to an error. 誤差に正則項を加えたときの目的関数の一例を示すグラフである。It is a graph which shows an example of the objective function when a regular term is added to an error. 本発明の１実施の形態に係るニューラルネットワークによる音響モデルを用いた音声認識装置のブロック図である。It is a block diagram of the speech recognition apparatus using the acoustic model by the neural network which concerns on one embodiment of this invention. 本発明の１実施の形態において、正則化をしたときの手書き文字認識における汎化性能の向上を示すグラフである。In one embodiment of this invention, it is a graph which shows the improvement of the generalization performance in the handwritten character recognition when regularizing. 本発明の１実施の形態において、正則化をしないときの手書き文字認識における汎化性能を示すグラフである。In 1 embodiment of this invention, it is a graph which shows the generalization performance in the handwritten character recognition when not regularizing. 本発明の１実施の形態に係る文字認識装置を実現するコンピュータシステムの外観図である。1 is an external view of a computer system that implements a character recognition device according to an embodiment of the present invention. 図９に示すコンピュータのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the computer shown in FIG.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰返さない。なお、以下の実施の形態の説明では、ニューラルネットワークを「ＮＮ」と呼ぶ。 In the following description and drawings, the same parts are denoted by the same reference numerals. Therefore, detailed description thereof will not be repeated. In the following description of the embodiment, the neural network is referred to as “NN”.

［基本的考え方］
機械学習において、訓練データを近似する連続関数を得る枠組みとして、以下に述べる本発明の１実施の形態では、連続関数の離散フーリエ変換における周波数領域を表現するネットワーク構造を用いる。 [Basic concept]
In machine learning, as a framework for obtaining a continuous function that approximates training data, a network structure that represents a frequency domain in a discrete Fourier transform of a continuous function is used in one embodiment of the present invention described below.

図３は、２個のラベル分布を近似する関数を例に、本実施の形態で用いるＮＮの構造を示す。図３を参照して、このＮＮ８０は、入力層９０、入力層９０の出力を受けるように接続された隠れ層９２、及び隠れ層９２の出力を受けるように接続された出力層９４を含む。 FIG. 3 shows the structure of the NN used in the present embodiment, taking a function approximating two label distributions as an example. Referring to FIG. 3, NN 80 includes an input layer 90, a hidden layer 92 connected to receive the output of input layer 90, and an output layer 94 connected to receive the output of hidden layer 92.

この例では、入力層９０は３つのノードを含む。各ノードは神経細胞素子（ニューロン）と呼ばれる。各ニューロンは、その入力の重み付きの総和を入力とする所定の活性化関数の値を出力する。活性化関数は、単調増加関数であればよいが、本実施の形態で使用する活性化関数については後述する。 In this example, input layer 90 includes three nodes. Each node is called a neural cell element (neuron). Each neuron outputs a value of a predetermined activation function having the input weighted sum as an input. The activation function may be a monotonically increasing function, but the activation function used in the present embodiment will be described later.

隠れ層９２は、２つの隠れ層ユニット１００及び１０２を含む。隠れ層ユニット１００及び１０２はいずれも同じ構造を持つ。したがって以下では隠れ層ユニット１００のみを例に説明する。また隠れ層９２は、階層数でいえば１層のみである。図３では、説明のため隠れ層ユニットを２つとしているが、本実施の形態では１０個程度である。 The hidden layer 92 includes two hidden layer units 100 and 102. Both hidden layer units 100 and 102 have the same structure. Accordingly, only the hidden layer unit 100 will be described below as an example. The hidden layer 92 is only one layer in terms of the number of layers. In FIG. 3, the number of hidden layer units is two for the sake of explanation, but there are about ten in this embodiment.

隠れ層ユニット１００は、いずれも入力層９０の３つの入力素子の出力に結合された入力を持ち、コサイン関数を活性化関数として用いるＣＯＳ素子１２０、１２２、１２４、１２６及び１２８と、ＣＯＳ素子１２０、１２２、１２４、１２６及び１２８の出力を受けるようにこれらに結合され、恒等写像を活性化関数として用いるΣ素子１４０とを含む。これら素子はいずれも神経細胞素子の一種である。 The hidden layer unit 100 has inputs coupled to the outputs of the three input elements of the input layer 90, and COS elements 120, 122, 124, 126, and 128 that use a cosine function as an activation function, and a COS element 120. , 122, 124, 126 and 128 are coupled to these to receive the outputs and include a Σ element 140 using the identity map as the activation function. Each of these elements is a kind of nerve cell element.

出力層９４は、本実施の形態では、いずれも隠れ層ユニット１００及び１０２の出力を受けるようにこれらに結合されたノードを含む。 In this embodiment, output layer 94 includes nodes coupled to both so as to receive the outputs of hidden layer units 100 and 102.

各素子の間の結合には重み及びバイアスが設定される。各Σ素子の出力は以下の式により表される。 Weights and biases are set for the coupling between the elements. The output of each Σ element is expressed by the following equation.

上式のベクトルｘは入力ベクトルを表し、行列ａ_ｄは入力層９０及び隠れ層ユニット１００、１０２の間の素子間の結合の重み行列を表し、ベクトルｂ_ｄは隠れ層ユニット１００及び１０２の各素子の出力に加算されるバイアスベクトルを表し、ベクトルｃ_ｄは隠れ層ユニット１００、１０２の各ＣＯＳ素子１２０、１２２、１２４、１２６及び１２８とΣ素子１４０との間の結合の重みベクトルを表す。これら重み及びバイアスを訓練データに基づいて決定する過程が学習である。学習の結果、各Σ素子は、入力層９０に対する入力について任意の連続関数を近似可能である。各Σ素子が近似する連続関数の周波数領域において、ＣＯＳ素子の入力側の重みは周波数成分に対応し、同入力側のバイアスは位相成分に対応し、同出力側の重みは振幅成分に対応する。各Σ素子は、互いにＣＯＳ素子の出力を共有しないことにより、入力空間上の任意の連続関数を独立に近似可能である。

The vector x in the above equation represents the input vector, the matrix a _d represents the weight matrix of the coupling between the elements between the input layer 90 and the hidden

layer units

100 and 102, and the vector b _d represents each of the hidden

layer units

100 and 102. represents a bias vector being added to the output of the device, the vector _{c d} represents the weight vector of the bond between each

COS elements

120, 122, 124, 126 and 128 and Σ element 140 of the hidden

layer units

100 and 102. Learning is a process of determining these weights and biases based on training data. As a result of learning, each Σ element can approximate an arbitrary continuous function with respect to the input to the input layer 90. In the frequency domain of the continuous function approximated by each Σ element, the weight on the input side of the COS element corresponds to the frequency component, the bias on the input side corresponds to the phase component, and the weight on the output side corresponds to the amplitude component. . Each Σ element can independently approximate any continuous function in the input space by not sharing the output of the COS element.

なお、出力層９４は、ラベル数と同数の隠れ層ユニット１００及び１０２等を用意することで省略可能である。また、各隠れ層ユニットのΣ素子の活性化関数をsoftmax関数とすることで、その出力を確率値として用いることができる。softmax関数は以下の式で表される関数（ｉ＝１、…、ｎ）である。 The output layer 94 can be omitted by preparing the same number of hidden layer units 100 and 102 as the number of labels. Further, by making the activation function of the Σ element of each hidden layer unit a softmax function, the output can be used as a probability value. The softmax function is a function (i = 1,..., n) represented by the following expression.

図３に示す様に出力層９４を用意することで、ラベル数より多いまたは少ないΣ素子を組み込むことも可能である。なお、ＣＯＳ素子においてコサイン関数の代わりにサイン関数を用いた場合も、ネットワークの近似性能は等価である。また、これら以外の微分可能な周期関数若しくはそれらの組み合わせを用いても良い。 By preparing the output layer 94 as shown in FIG. 3, it is possible to incorporate more or fewer Σ elements than the number of labels. Note that the approximate performance of the network is equivalent even when a sine function is used instead of a cosine function in a COS element. Further, differentiable periodic functions other than these or combinations thereof may be used.

本実施の形態でも、従来のＦＦＮＮ等と同様、誤差逆伝搬法を用いた学習を行う。しかし、本実施の形態では、隠れ層ユニットのＣＯＳ素子において、コサイン関数を活性化関数に用いている。そのため、以下に述べるように従来と同様の手法を用いると訓練データに対して隠れ層ユニット１００及び１０２の最適化学習を行うことがむずかしい。そこで本実施の形態では、学習時に以下の様な２つの手法を用いる。第１は正則化、第２は準ニュートン法の採用である。 Also in this embodiment, learning using the error back-propagation method is performed as in the conventional FFNN. However, in this embodiment, the cosine function is used as the activation function in the COS element of the hidden layer unit. For this reason, as described below, it is difficult to perform optimization learning of the hidden layer units 100 and 102 on the training data using the same method as the conventional method. Therefore, in this embodiment, the following two methods are used during learning. The first is regularization, and the second is the quasi-Newton method.

本実施の形態で用いるネットワーク構造は、データ分布の周波数領域表現を用いている。したがって、ＮＮ８０において、近似する関数の周波数領域における性質として解釈可能である周波数成分、位相成分、振幅成分の各要素を個別に制御可能である。一般に、関数の大域的な傾向は低周波成分によって表現される。したがって、大域的なデータ分布を得ることにより汎化性能の向上を図ることができる。そのため、本実施の形態では、周波数成分への正則化を行うことで、低周波領域によってデータ分布を近似する。正則化項は、周波数成分を表すパラメータについて、原点を頂点とする下に凸な関数（凸関数）である。学習における目的関数を、近似関数と訓練データとの誤差とすると、誤差に正則化を加えることで、目的関数は原点付近において最小値となり、大域的には凸関数となる。図４及び図５は、本実施の形態で用いるネットワークにおける周波数パラメータの値に関する目的関数の形状例を示したものである。正則化をしない場合の目的関数の形状例を図４に示す。また、最小化する目的関数に周波数成分を表すパラメータのＬ２ノルムを加えた（Ｌ２正則化した）場合の目的関数の形状例を図５に示す。 The network structure used in this embodiment uses a frequency domain representation of data distribution. Therefore, each element of the frequency component, the phase component, and the amplitude component that can be interpreted as a property in the frequency domain of the function to be approximated can be individually controlled in the NN 80. In general, the global trend of a function is expressed by low frequency components. Therefore, generalization performance can be improved by obtaining a global data distribution. Therefore, in the present embodiment, the data distribution is approximated by the low frequency region by regularizing the frequency component. The regularization term is a downward convex function (convex function) with the origin at the apex with respect to the parameter representing the frequency component. If the objective function in learning is an error between the approximate function and the training data, by adding regularization to the error, the objective function becomes a minimum value near the origin, and becomes a convex function globally. 4 and 5 show examples of the shape of the objective function relating to the value of the frequency parameter in the network used in the present embodiment. An example of the shape of the objective function without regularization is shown in FIG. FIG. 5 shows an example of the shape of the objective function when the L2 norm of the parameter representing the frequency component is added to the objective function to be minimized (L2 regularization).

図４及び図５を対比すると、図４では最小値が複数個存在するのに対し、図５では原点付近に１個のみ大域的な最小値が存在することが分かる。したがって、誤差関数がこの大域的な最小値に収束するように各パラメータの学習を行う。しかし、図５から明らかなように、このグラフには局所的な最小値（極小値）が複数個存在する。通常の学習の過程では、パラメータがこれら極小値の１つに収束してしまう危険性がある。そのような事態は避けなければならない。そこで本実施の形態ではさらに、周波数パラメータに関してはＬ―ＢＦＧＳ法を含む準ニュートン法等の大域的最適化手法を用いる。準ニュートン法は局所な最適解に収束しにくいという特徴がある。中でもＬ−ＢＧＦＳ法は使用するメモリが少なくて済むことが知られている。したがって、計算資源の少ない装置でも利用できる。 4 and 5 are compared, it can be seen that there are a plurality of minimum values in FIG. 4, whereas there is only one global minimum value near the origin in FIG. Therefore, each parameter is learned so that the error function converges to the global minimum value. However, as is apparent from FIG. 5, there are a plurality of local minimum values (minimum values) in this graph. In the normal learning process, there is a risk that the parameter will converge to one of these minimum values. Such a situation must be avoided. Therefore, in the present embodiment, a global optimization method such as a quasi-Newton method including the L-BFGS method is further used for the frequency parameter. The quasi-Newton method has a feature that it is difficult to converge to a local optimal solution. In particular, it is known that the L-BGFS method requires less memory. Therefore, the apparatus can be used even with a device having a small calculation resource.

［構成］
図６に、上記した様な構成を持つＮＮを採用した例として文字認識装置１８０と、文字認識装置１８０のＮＮの学習を行う文字認識ＮＮ学習装置１８６とをブロック図形式で示す。 [Constitution]
FIG. 6 is a block diagram showing a character recognition device 180 and a character recognition NN learning device 186 that learns the NN of the character recognition device 180 as an example of employing the NN having the above-described configuration.

図６を参照して、文字認識装置１８０は、手書き文字を含む入力画像１８２を受け、手書き文字認識を行って文字認識テキスト１８４を出力するためのものである。この例では文字認識装置１８０は、図示しない別のスキャナ又は記憶装置等から入力画像１８２を受信するものとする。文字認識装置１８０は、入力画像１８２を受信して一時記憶する入力画像記憶装置２００と、入力画像記憶装置２００に記憶された入力画像に対して傾きの補正等を行った後、個別の文字画像に分離し、個々の文字画像に対する正規化等をする画像補正部２０２と、この文字画像を文字ごとに記憶する画像記憶部２０４と、上記本実施の形態に係るＮＮからなる学習済の文字認識ＮＮ２０８と、画像記憶部２０４から各文字ごとに文字画像を読出し、文字認識ＮＮ２０８に入力してその結果を得ることにより文字認識テキスト１８４を出力するデコーダ２１０とを含む。 Referring to FIG. 6, character recognition device 180 receives input image 182 including handwritten characters, performs handwritten character recognition, and outputs character recognition text 184. In this example, it is assumed that the character recognition device 180 receives an input image 182 from another scanner or storage device (not shown). The character recognition device 180 receives the input image 182 and temporarily stores the input image 182. After correcting the inclination of the input image stored in the input image storage device 200, the individual character image An image correction unit 202 that normalizes each character image, an image storage unit 204 that stores the character image for each character, and a learned character recognition that includes the NN according to the present embodiment. The NN 208 includes a decoder 210 that reads a character image for each character from the image storage unit 204, inputs the character image to the character recognition NN 208, and obtains the result to output the character recognition text 184.

文字認識ＮＮ２０８には、文字認識ＮＮ学習装置１８６による学習が予め行われているものとする。文字認識ＮＮ学習装置１８６は、手書き文字からなる画像をその文字の文字ラベルとともに記憶する手書き文字データベース（ＤＢ）２４０と、手書き文字ＤＢ２４０に記憶された手書き文字から画像補正部２０２と同様の手法により文字画像を補正し、文字ラベルと組にした学習データを生成する学習データ生成部２４２と、学習データ生成部２４２により生成された学習データを記憶する学習データ記憶装置２４４と、学習データ記憶装置２４４に記憶された学習データを用い、上記した手法により文字認識ＮＮ２０８の学習を行うための学習処理部２４６とを含む。文字認識ＮＮ２０８の入力層のノード数は、学習データのうち、文字ラベルを除いた要素数と等しい。文字ラベルは学習時の教師データとして文字認識ＮＮ２０８の出力層側からの誤差逆伝搬に用いられる正解ベクトルの生成に使用される。 In the character recognition NN 208, learning by the character recognition NN learning device 186 is performed in advance. The character recognition NN learning device 186 uses a method similar to that of the image correction unit 202 from a handwritten character database (DB) 240 that stores an image of handwritten characters together with a character label of the character, and handwritten characters stored in the handwritten character DB 240. A learning data generation unit 242 that corrects a character image and generates learning data paired with a character label, a learning data storage device 244 that stores learning data generated by the learning data generation unit 242, and a learning data storage device 244 And a learning processing unit 246 for learning the character recognition NN 208 using the above-described method. The number of nodes in the input layer of the character recognition NN 208 is equal to the number of elements in the learning data excluding the character label. The character label is used as teacher data at the time of learning to generate a correct vector used for back propagation of error from the output layer side of the character recognition NN 208.

［動作］
上記した文字認識ＮＮ学習装置１８６及び文字認識装置１８０は以下のように動作する。まず文字認識ＮＮ学習装置１８６による文字認識ＮＮ２０８の学習が行われ、その後に文字認識装置１８０による入力画像１８２の文字認識が行われる。 [Operation]
The character recognition NN learning device 186 and the character recognition device 180 described above operate as follows. First, the character recognition NN 208 is learned by the character recognition NN learning device 186, and then the character recognition of the input image 182 by the character recognition device 180 is performed.

学習では、学習データ生成部２４２が手書き文字ＤＢ２４０に記憶された手書き文字画像を読出し、画像補正部２０２と同様に画像を補正・正規化し、文字ラベルと組み合わせて学習データを生成する。生成された学習データは学習データ記憶装置２４４に蓄積される。学習処理部２４６は、学習データ記憶装置２４４に蓄積された学習データを用い、上記した手法によって文字認識ＮＮ２０８の学習を行う。学習処理部２４６による文字認識ＮＮ２０８の学習が終われば、文字認識装置１８０による入力画像１８２の文字認識が可能になる。 In learning, the learning data generation unit 242 reads a handwritten character image stored in the handwritten character DB 240, corrects and normalizes the image in the same manner as the image correction unit 202, and generates learning data in combination with a character label. The generated learning data is stored in the learning data storage device 244. The learning processing unit 246 uses the learning data stored in the learning data storage device 244 to learn the character recognition NN 208 using the method described above. When learning of the character recognition NN 208 by the learning processing unit 246 is completed, the character recognition of the input image 182 by the character recognition device 180 becomes possible.

文字認識では、図示しないスキャナ等により入力画像１８２が生成され、入力画像記憶装置２００に記憶される。画像補正部２０２は、画像の傾き補正、文字領域の抽出、文字領域の正規化等を行って画像を補正し、画像記憶部２０４に記憶させる。デコーダ２１０は、この画像を画像記憶部２０４から読出し、文字認識ＮＮ２０８に入力として与える。文字認識ＮＮ２０８は、学習の結果、入力された文字画像に対応する文字ラベルを特定する出力をデコーダ２１０に返す。デコーダ２１０はこの文字ラベルから文字認識テキスト１８４を生成して出力する。 In character recognition, an input image 182 is generated by a scanner or the like (not shown) and stored in the input image storage device 200. The image correction unit 202 corrects the image by performing image inclination correction, character region extraction, character region normalization, and the like, and stores the image in the image storage unit 204. The decoder 210 reads this image from the image storage unit 204 and gives it to the character recognition NN 208 as an input. As a result of learning, the character recognition NN 208 returns an output specifying the character label corresponding to the input character image to the decoder 210. The decoder 210 generates a character recognition text 184 from the character label and outputs it.

［実験結果］
上記文字認識ＮＮ２０８の性能を調べるため、手書き文字認識に関する実験を行った。実験には、ＭＮＩＳＴ手書き文字データセットを用いた。本実施の形態による性能と、ＮＮの識別モデルとして非特許文献６で示されたmaxoutモデル、及び非特許文献５で示されたrectifierによる性能を、併せて次のテーブル１に示す。 [Experimental result]
In order to investigate the performance of the character recognition NN 208, an experiment on handwritten character recognition was conducted. In the experiment, an MNIST handwritten character data set was used. The performance according to the present embodiment, the maxout model shown in Non-Patent Document 6 as the NN identification model, and the performance based on the rectifier shown in Non-Patent Document 5 are shown together in Table 1 below.

なお、表１において例えば「実施の形態（３０／１０）」の「３０」は隠れ層ユニット数を、「１０」は隠れ層ユニットのＣＯＳ素子数を、それぞれ表す。他も同様である。

In Table 1, for example, “30” in “Embodiment (30/10)” represents the number of hidden layer units, and “10” represents the number of COS elements of the hidden layer units. Others are the same.

表１に示すように、本実施の形態に係るＮＮの識別性能は、最新の手法によるものには及ばない。しかし、性能としてはそれほど悪いわけでもない。特筆すべきは、パラメータ数の少なさである。実施の形態（３０／１０）ではパラメータ数が４７２Ｋである。一方、maxoutではパラメータ数は１２３３Ｋ、rectifierに至っては３、７９８Ｋである。このように、本実施の形態によれば、少ないパラメータ数で比較的高い精度を出すことができる。更に、実施の形態（２０／１０）ではパラメータ数は１５７Ｋ、実施の形態（１０／１５）では１１８Ｋと、パラメータ数はさらに大幅に削減される。にもかかわらず、識別誤り率はわずかしか上昇せず、識別性能の低下はごくわずかである。 As shown in Table 1, the identification performance of the NN according to the present embodiment does not reach that of the latest method. However, the performance is not so bad. Of particular note is the small number of parameters. In the embodiment (30/10), the number of parameters is 472K. On the other hand, in maxout, the number of parameters is 1233K, and in rectifier is 3,798K. Thus, according to the present embodiment, a relatively high accuracy can be obtained with a small number of parameters. Furthermore, in the embodiment (20/10), the number of parameters is 157K, and in the embodiment (10/15), the number of parameters is 118K. Nevertheless, the discrimination error rate increases only slightly and the degradation of the discrimination performance is negligible.

さらに、正則化が汎化性能の向上にどの程度寄与しているかを確認するために、正則化を行って文字認識ＮＮ２０８の学習を行った場合と、正則化せずに学習を行った場合との結果を比較した。正則化を行った場合を図７に、行わなかった場合を図８に、それぞれ示す。双方の図において、実線２６０及び２８０で示したのがそれぞれ訓練データに対する識別誤り率、破線２６４及び２８４で示したのがテストデータに対する識別誤り率である。グラフは、横軸に学習の繰返し数を示し、縦軸にそのときの識別誤り率を示す。 Furthermore, in order to confirm how much regularization contributes to the improvement of generalization performance, when regularization is performed and learning of the character recognition NN 208 is performed, and when learning is performed without regularization The results were compared. FIG. 7 shows a case where regularization is performed, and FIG. 8 shows a case where regularization is not performed. In both figures, the solid line 260 and 280 indicate the discrimination error rate for the training data, and the broken lines 264 and 284 indicate the discrimination error rate for the test data. In the graph, the horizontal axis indicates the number of learning repetitions, and the vertical axis indicates the identification error rate at that time.

図８の実線２８０により示されるように、正則化を行わない場合には、テストデータに対する識別誤りは学習とともに大きく低下する。しかし、破線２８４により示されるように、テストデータについての精度はほとんど高くならなかった。それに対し、図７に示すように、正則化を行った場合には、学習データに対する誤り率の低下は実線２６０により示されるように正則化を行わなかった場合には及ばなかったが、破線２６４により示されるように、テストデータに対する識別誤り率は学習とともに低下した。すなわち、識別誤りが少なくなり、識別精度は高くなった。この結果、汎化性能は正則化を行わない場合と比較して高くなっているといえる。 As indicated by a solid line 280 in FIG. 8, when regularization is not performed, the identification error for the test data greatly decreases with learning. However, as indicated by the broken line 284, the accuracy for the test data was hardly increased. On the other hand, as shown in FIG. 7, when regularization is performed, the decrease in the error rate with respect to the learning data does not reach when the regularization is not performed as indicated by the solid line 260, but the broken line 264. As shown by, the discrimination error rate for the test data decreased with learning. That is, identification errors are reduced and the identification accuracy is increased. As a result, it can be said that the generalization performance is higher than the case where regularization is not performed.

［実施の形態の効果］
以上のように、上記実施の形態によれば、多層化を必要とせずに汎化性能の向上が可能である。したがって、計算全体における並列計算が占める比率が増大し、並列計算機を用いた場合の計算効率が向上する。また、層数が低減されることによって各層における待ち時間が削減され、予測時における計算時間と計算に必要な計算機資源が削減される。 [Effect of the embodiment]
As described above, according to the above embodiment, the generalization performance can be improved without requiring multi-layering. Therefore, the ratio of the parallel calculation in the entire calculation increases, and the calculation efficiency when using the parallel computer is improved. Further, the waiting time in each layer is reduced by reducing the number of layers, so that the calculation time at the time of prediction and the computer resources necessary for the calculation are reduced.

中間層の計算単位であるΣ素子は、その入力であるＣＯＳ素子の出力を互いに共有しないため、訓練時の誤差逆伝搬における経路数が削減される。つまり、信用割り当て問題が緩和される。そのため、全結合の構造と比較して上記実施の形態のＮＮは効率的な最適化が可能であり、訓練時間の低減という効果がある。 Since the Σ element, which is the calculation unit of the intermediate layer, does not share the output of the COS element as its input, the number of paths in error back propagation during training is reduced. In other words, the credit allocation problem is eased. Therefore, the NN of the above embodiment can be efficiently optimized as compared with the structure of all bonds, and has the effect of reducing the training time.

従来の正則化は、分布を表現するパラメータ数を削減することが目的であったため、識別に関与するパラメータはネットワークの一部に留まっていた。一方で上記実施の形態では、コサインまたはサインという大域的に変化が続く周期関数を活性化関数として用いている。そのため、全ての素子が識別結果に影響を与える。これにより、予め用意するパラメータ数の削減が見込まれることから、全体の計算規模が縮小され、小規模な計算機上での高性能な識別が可能となる。なお、周期関数として上記実施の形態ではコサイン関数を用いている。しかし、上記したようにサイン関数を用いても良い。コサイン関数とサイン関数との組み合わせを用いても良い。さらに、コサイン関数又はサイン関数を区分線形近似した関数（区分線形近似したコサイン関数又はサイン関数）も同様に周期関数となり、コサイン関数又はサイン関数に代えて使用できる。区分線形近似した関数を用いる場合、線分の連結部分において微分可能でないため、通常の方法では準ニュートン法を適用した訓練を行えない。しかしそうした場合でも、区分線形近似のように微分ができない部分で劣微分が定義可能（劣微分可能）であれば準ニュートン法を適用した訓練を行うことができる。すなわち、ＣＯＳ素子のように隠れユニットの神経細胞素子で用いる活性化関数は劣微分可能であればよい。 Conventional regularization has been aimed at reducing the number of parameters expressing the distribution, so the parameters involved in the identification have remained part of the network. On the other hand, in the above-described embodiment, a periodic function called cosine or sine that continues to change globally is used as the activation function. Therefore, all elements affect the identification result. As a result, since the number of parameters prepared in advance is expected to be reduced, the overall calculation scale is reduced, and high-performance discrimination on a small computer becomes possible. In the above embodiment, a cosine function is used as the periodic function. However, as described above, a sine function may be used. A combination of a cosine function and a sine function may be used. Further, a function obtained by piecewise linear approximation of a cosine function or sine function (a cosine function or sine function obtained by piecewise linear approximation) similarly becomes a periodic function, and can be used in place of the cosine function or sine function. In the case of using a piecewise linear approximation function, training using the quasi-Newton method cannot be performed by a normal method because it is not differentiable at a connected portion of line segments. However, even in such a case, if sub-differentiation can be defined in a portion where differentiation cannot be performed as in piecewise linear approximation (sub-differentiation is possible), training using the quasi-Newton method can be performed. That is, the activation function used in the neuron cell element of the hidden unit such as the COS element only needs to be subdifferentiable.

上記実施の形態によれば、従来よりも小規模な計算によって汎化性能が得られる。そのため、計算性能が限定的なモバイル機器において識別モデルを利用する際に、モバイル機器内で完結したアプリケーション及びサービスの提供が可能となる。 According to the above embodiment, generalization performance can be obtained by a smaller-scale calculation than in the past. Therefore, when the identification model is used in a mobile device with limited calculation performance, it is possible to provide a complete application and service in the mobile device.

［コンピュータによる実現］
本発明の実施の形態に係る文字認識装置１８０及び文字認識ＮＮ学習装置１８６は、いずれもコンピュータハードウェアと、そのコンピュータハードウェア上で実行されるコンピュータプログラムとにより実現できる。図９はこのコンピュータシステム３３０の外観を示し、図１０はコンピュータシステム３３０の内部構成を示す。 [Realization by computer]
The character recognition device 180 and the character recognition NN learning device 186 according to the embodiment of the present invention can both be realized by computer hardware and a computer program executed on the computer hardware. FIG. 9 shows the external appearance of the computer system 330, and FIG. 10 shows the internal configuration of the computer system 330.

図９を参照して、このコンピュータシステム３３０は、メモリポート３５２及びＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２とを含む。 Referring to FIG. 9, the computer system 330 includes a computer 340 having a memory port 352 and a DVD (Digital Versatile Disc) drive 350, a keyboard 346, a mouse 348, and a monitor 342.

図１０を参照して、コンピュータ３４０は、メモリポート３５２及びＤＶＤドライブ３５０に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、メモリポート３５２及びＤＶＤドライブ３５０に接続されたバス３６６と、ブートプログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０と、ハードディスク３５４を含む。コンピュータシステム３３０はさらに、他端末との通信を可能とするネットワーク３６８への接続を提供するネットワークインターフェイス（Ｉ／Ｆ）３４４を含む。 Referring to FIG. 10, in addition to the memory port 352 and the DVD drive 350, the computer 340 includes a CPU (Central Processing Unit) 356, a bus 366 connected to the CPU 356, the memory port 352, and the DVD drive 350, and a boot program. A read-only memory (ROM) 358 for storing etc., a random access memory (RAM) 360 connected to the bus 366 for storing program instructions, system programs, work data, etc., and a hard disk 354 are included. The computer system 330 further includes a network interface (I / F) 344 that provides a connection to a network 368 that allows communication with other terminals.

コンピュータシステム３３０を上記した実施の形態に係る文字認識装置１８０及び文字認識ＮＮ学習装置１８６の各機能部として機能させるためのコンピュータプログラムは、ＤＶＤドライブ３５０又はメモリポート３５２に装着されるＤＶＤ３６２又はリムーバブルメモリ３６４に記憶され、さらにハードディスク３５４に転送される。又は、プログラムはネットワーク３６８を通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＤＶＤ３６２から、リムーバブルメモリ３６４から又はネットワーク３６８を介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to function as each function unit of the character recognition device 180 and the character recognition NN learning device 186 according to the above-described embodiment is a DVD 362 or a removable memory mounted on the DVD drive 350 or the memory port 352. 364 and further transferred to the hard disk 354. Alternatively, the program may be transmitted to the computer 340 through the network 368 and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly from the DVD 362 to the RAM 360 from the removable memory 364 or via the network 368.

このプログラムは、コンピュータ３４０を、上記実施の形態に係る文字認識装置１８０及び文字認識ＮＮ学習装置１８６の各機能部として機能させるための複数の命令からなる命令列を含む。コンピュータ３４０にこの動作を行わせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するオペレーティングシステム若しくはサードパーティのプログラム又はコンピュータ３４０にインストールされる、ダイナミックリンク可能な各種プログラミングツールキット又はプログラムライブラリにより提供される。たとえば、誤差逆伝搬法、及びＬ−ＢＦＧＳ法については市販の統計処理ライブラリにより提供されるソフトウェアツールを利用できる。したがって、このプログラム自体はこの実施の形態のシステム及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能又はプログラミングツールキット又はプログラムライブラリ内の適切なプログラムを実行時に動的に呼出すことにより、上記したシステム又は装置としての機能を実現する命令のみを含んでいればよい。もちろん、プログラムのみで必要な機能を全て提供してもよい。 This program includes an instruction sequence including a plurality of instructions for causing the computer 340 to function as each function unit of the character recognition device 180 and the character recognition NN learning device 186 according to the above-described embodiment. Some of the basic functions necessary to cause the computer 340 to perform this operation are an operating system or a third party program running on the computer 340 or various dynamically linkable programming toolkits or programs installed on the computer 340 Provided by the library. For example, for the error back propagation method and the L-BFGS method, a software tool provided by a commercially available statistical processing library can be used. Therefore, this program itself does not necessarily include all functions necessary for realizing the system and method of this embodiment. This program is a system described above by dynamically calling an appropriate program in an appropriate function or programming toolkit or program library in execution in a controlled manner to obtain a desired result. It is only necessary to include an instruction for realizing a function as a device. Of course, all necessary functions may be provided only by the program.

［応用例］
上記実施の形態は、手書き文字認識装置に関するものである。しかし、本発明はそのような実施の形態には限定されない。例えば、音声認識装置に用いられる音響モデル、統計的言語モデルに使用することもできる。自動翻訳装置に用いられる翻訳モデル、手書き文字以外の画像パターン認識に適用することもできる。 [Application example]
The above embodiment relates to a handwritten character recognition device. However, the present invention is not limited to such an embodiment. For example, it can also be used for acoustic models and statistical language models used in speech recognition devices. It can also be applied to recognition of image patterns other than translation models and handwritten characters used in automatic translation devices.

音響モデルの場合には、音声そのものではなく、特徴量として例えば音声信号から得られるＭＦＣＣ系の特徴量、ＬＰＣ系の特徴量、及びメルフィルタバンクの出力、それらの一次差分及び二次差分等、通常使用される特徴量を入力としてそのまま利用できる。入力層のノード数は特徴量ベクトルの要素数と一致させる。出力層のノード数は、想定される音素数と一致させる。 In the case of an acoustic model, instead of speech itself, as feature quantities, for example, MFCC feature quantities obtained from speech signals, LPC feature quantities, mel filter bank outputs, their primary and secondary differences, etc. Normally used feature values can be used as input. The number of nodes in the input layer is matched with the number of elements in the feature vector. The number of nodes in the output layer is matched with the assumed number of phonemes.

言語モデルの場合には例えば以下のようにする。トライグラム言語モデルであれば、想定される語彙全てからなるベクトルを考える。２語前の単語に対応する要素が１、それ以外の要素が０となる第１のベクトルと、１語前の単語に対応する要素が１、それ以外の要素が０となる第２のベクトルとを生成し、これらをつなぎあわせて１つの特徴ベクトルとする。入力層のノード数はしたがって、想定される語彙数の２倍となる。出力ノードのノード数は、想定される語彙数と一致する。 In the case of a language model, for example: In the case of a trigram language model, consider a vector consisting of all possible vocabularies. A first vector in which the element corresponding to the word two words before is 1 and the other elements are 0, and a second vector in which the element corresponding to the word one word before is 1 and the other elements are 0 Are generated and connected to form one feature vector. The number of nodes in the input layer is therefore twice the number of vocabularies assumed. The number of output nodes matches the number of vocabularies assumed.

その他、本発明は、ＮＮを用いたあらゆる機械学習と予測装置とに適用できる。 In addition, the present invention can be applied to all machine learning and prediction devices using NN.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

１８０文字認識装置
１８２入力画像
１８４文字認識テキスト
１８６文字認識ＮＮ学習装置
２０８文字認識ＮＮ
２１０デコーダ 180 Character recognition device 182 Input image 184 Character recognition text 186 Character recognition NN learning device 208 Character recognition NN
210 Decoder

Claims

An input layer having a plurality of input nodes;
A nonlinear function layer having an input connected to receive the output from the input layer,
The nonlinear function layer includes a plurality of hidden layers each connected to receive outputs from the plurality of input nodes of the input layer, each capable of approximating an arbitrary independent component;
Each of the plurality of hidden layers is
A plurality of neuronal elements each having an input connected with a weight to the plurality of input nodes, and using an underdifferentiable periodic function as an activation function;
A neural network including an output aggregating element connected to receive outputs of the plurality of nerve cell elements in the same hidden layer with weights and using a subdifferentiable function as an activation function;

The neural network according to claim 1, wherein the periodic function is a cosine function, a sine function, or a combination thereof, or a cosine function or a sine function that is piecewise linearly approximated.

The neural network according to claim 1, wherein the neural network is trained by using L2 regularization for a weight parameter on an input side of each of the plurality of hidden layers.

The neural network according to claim 1, wherein the neural network is trained using a quasi-Newton method.

And an output layer connected to receive the output of the output aggregation elements of the plurality of hidden layers,
5. The output layer according to claim 1, wherein the output layer includes a plurality of output nerve cell elements each connected to receive the output of the output aggregation element of the plurality of hidden layers with a weight. Neural network described in 1.

A computer program for causing a computer to function as the neural network according to any one of claims 1 to 5.