JP6704585B2

JP6704585B2 - Information processing equipment

Info

Publication number: JP6704585B2
Application number: JP2018206370A
Authority: JP
Inventors: 可直佐藤; 成満池田; 真人藤野
Original assignee: Fairy Devices Inc
Current assignee: Fairy Devices Inc
Priority date: 2018-11-01
Filing date: 2018-11-01
Publication date: 2020-06-03
Anticipated expiration: 2038-11-01
Also published as: JP2020071755A

Description

本発明は、様々な情報、系列情報、時系列情報を処理し、予測、識別、実行が可能なニューラルネットワーク系情報処理装置に関する。特に、時系列データ、例えば、文章、音声、音楽、動画等の処理に適した情報処理装置に関する。 The present invention relates to a neural network system information processing apparatus capable of processing various information, series information, and time series information and performing prediction, identification, and execution. In particular, the present invention relates to an information processing apparatus suitable for processing time-series data such as text, voice, music, and moving images.

深層学習（ＤｅｅｐＬｅａｒｎｉｎｇ）は、機械学習の一種であるニューラルネットワーク（ＮｅｕｒａｌＮｅｔｗｏｒｋ）の階層を深めた、生物の脳の神経細胞（ニューロン、Ｎｅｕｒｏｎ）モデルとしたアルゴリズムで、１９４０年代から研究が行われてきた歴史のあるアルゴリズムである。ニューラルネットワークの基本的な構造は、入力層、複数の隠れ層、出力層を備え、各層に含まれる複数のノード（ユニット）をエッジで接続する構造となっており、隠れ層の層数が多いものを深層学習と呼んでいる。そして、各層は活性化関数を、エッジは結合荷重を有し、各ノードの値は、そのノードと接続する前の層のノードの値、エッジの結合荷重の値、及び、層が有する活性化関数から計算され、ノード接続方法、計算方法には様々なものが開発されており、近年急速な進化を遂げ、画像認識、音声認識等の様々な分野で実用化されている。 Deep Learning is an algorithm that uses a model of neural cells (neurons) of a living organism, which has deepened the hierarchy of a neural network (Neural Network), which is a type of machine learning, and has been studied since the 1940s. It is an algorithm with a long history. The basic structure of a neural network is that it has an input layer, multiple hidden layers, and an output layer, and connects multiple nodes (units) included in each layer with edges, and the number of hidden layers is large. Things are called deep learning. Then, each layer has an activation function, an edge has a connection weight, and the value of each node is the value of the node of the layer before connecting to that node, the value of the edge connection weight, and the activation that the layer has. It is calculated from a function, and various node connection methods and calculation methods have been developed. In recent years, it has made rapid progress and has been put to practical use in various fields such as image recognition and voice recognition.

このような深層学習で、画像処理の分野で実績があるのは、ＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎＮｅｕｒａｌＮｅｔｗｏｒｋ）やＧＡＮ（ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋ）である（非特許文献１）。ＣＮＮでは、隠れ層において、入力画像の特徴を維持しながら画像を縮小処理して画像を抽象化し、この抽象画された画像を用いて、入力される画像の分類及び認識をするものである。現在では、更に、教師画像を学習して教師画像と近似した生成画像を生み出すネットワーク構造が開発され、この生成器（Ｇｅｎｅｒａｔｏｒ）と教師画像と生成画像を識別するネットワーク構造である識別器（Ｄｉｓｃｒｉｍｉｎａｔｏｒ）の二つのニューラルネットワークで構成されたネットワーク構造であるＧＡＮの有効性に注目が注がれている。 In such deep learning, CNN (Convolution Natural Network) and GAN (General Adversary Network) have a track record in the field of image processing (Non-Patent Document 1). In the CNN, in the hidden layer, the image is reduced while preserving the features of the input image to abstract the image, and the input image is classified and recognized using the abstracted image. At present, a network structure for learning a teacher image to generate a generated image similar to the teacher image has been developed. This generator (Generator) and a discriminator (Discriminator) which is a network structure for discriminating the teacher image and the generated image. Attention is focused on the effectiveness of GAN, which is a network structure composed of two neural networks.

画像処理は二次元の矩形データで固定長の系列データしか取り扱わない一方、音声データ等の可変長の時系列データを扱うことが可能なネットワーク構造として、ＲＮＮ（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）が開発された（非特許文献２）。これは、隠れ層の値を再び隠れ層に入力するネットワーク構造としたことに特徴があるが、通常、誤差逆伝播法ＢＰＴＴ（ＢａｃｋｐｒｏａｇａｔｉｏｎＴｈｒｏｕｇｈＴｉｍｅ）という学習方法を適用するため、過去に遡った全ての時系列データが学習に必要で、長時間のデータを処理する場合、隠れ層が増加するにつれ、勾配損失及び過学習等が生じると共に、莫大な演算量となり、短時間のデータしか処理できないという問題があった。そのため、ＲＰＲＯＰ（ＲｅｓｉｌｉｅｎｔＢａｃｋｐｒｏｐａｇａｔｉｏｎ）やＲＴＲＬ（ＲｅａｌＴｉｍｅＲｅｃｕｒｒｅｎｔＬｅａｎｉｎｇ）等の学習方法が検討されてきたが、上記課題の解決には至っていない。 While image processing handles only fixed-length series data in two-dimensional rectangular data, RNN (Recurrent Natural Network) has been developed as a network structure capable of handling variable-length time series data such as voice data ( Non-Patent Document 2). This is characterized by a network structure in which the value of the hidden layer is input to the hidden layer again, but since a learning method called error backpropagation method BPTT (Backpropagation Through Time) is usually applied, all the methods traced back to the past The time-series data of is required for learning, and when processing long-term data, as the hidden layer increases, gradient loss and over-learning etc. occur, and the amount of calculation becomes enormous and only short-time data can be processed. There was a problem. Therefore, learning methods such as RPROP (Resilient Backpropagation) and RTRL (Real Time Recurring Learning) have been studied, but the above problems have not been solved.

このようなＲＮＮの課題を解決するネットワーク構造として、１９９７年に開発された、長時間前のデータが関連付けて記録されるデータ貯蔵部を有するＬＳＴＭ（ＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ）が注目されているが、勾配損失を解消する技術であり、学習方法として基本的にはＢＰＴＴを適用しており、大量の教師あり学習が必要であるということに変わりはなく、演算に必要な多大な時間と労力を低減することは困難であり、極めてコストが高いという問題を内在している（非特許文献１）。 As a network structure for solving such a problem of the RNN, an LSTM (Long Short-Term Memory), which has been developed in 1997 and has a data storage unit in which data for a long time before is associated and recorded, is attracting attention. , It is a technique to eliminate gradient loss, and basically applies BPTT as a learning method, and it still requires a large amount of supervised learning, which requires a large amount of time and effort required for calculation. It is difficult to reduce, and there is an inherent problem that the cost is extremely high (Non-Patent Document 1).

近年、このような時系列データを扱うＲＮＮやＬＳＴＭの課題を解決する新しいニューラルネットワーク構造として、リザバー計算（ＲＣ、ＲｅｓｅｒｖｏｉｒＣｏｍｐｕｔｉｎｇ）が提案されている（非特許文献３及び４）。ＲＣは、入力層、リザバー層（隠れ層）、出力層の三層で構成されているＲＮＮの一種であるが、入力層とリザバー層間、リサバー層内のエッジにおける結合荷重は初期値のまま変更することはなく、リザバー層と出力層と結合するエッジにおいてのみ結合荷重を調整して学習するという点に特徴がある。リザバー層は、ノードが規則性なくエッジで結合されており、入力されてきた情報を教師なし学習で学習しつつ、その学習された情報を蓄積していく機能を有していると考えられる。 In recent years, reservoir computation (RC, Reservoir Computing) has been proposed as a new neural network structure that solves the problems of RNN and LSTM that handle such time-series data (Non-Patent Documents 3 and 4). RC is a kind of RNN composed of three layers of an input layer, a reservoir layer (hidden layer), and an output layer, but the coupling load at the edges of the input layer, the reservoir layer, and the reservoir layer is changed to the initial value. The feature is that learning is performed by adjusting the connection weight only at the edge connecting the reservoir layer and the output layer. The reservoir layer is considered to have a function of accumulating the learned information while learning the input information by unsupervised learning in which nodes are connected by edges without regularity.

このようなＲＣの範疇に属するアルゴリズムには、ＥＳＮ（ＥｃｈｏＳｔａｔｅＮｅｔｗｏｒｋ）及びＬＳＭ（ＬｉｑｕｉｄＳｔａｔｅＭａｃｈｉｎｅ）等があり、いずれも、演算に掛かる負担が少なく、時系列データを扱うことができ、ＲＮＮ等と遜色ない学習結果を得ることができる（非特許文献５及び６）。代表例として、ＥＳＮの構造を図１に示す。また、ＲＣのリザバー層と出力層との結合における特徴的な学習方法として、ＦＯＲＣＥ（ＦｉｒｓｔＯｒｄｅｒＲｅｄｕｃｅｄａｎｄＣｏｎｔｒｏｌｌｅｄＥｒｒｏｒ）やＢＰＤＣ（Ｂａｃｋｐｒｏｐａｇａｔｉｏｎ−Ｄｅｃｏｒｒｅｌａｔｉｏｎ）等が提案されている（非特許文献７及び８）。しかしながら、ＲＣに高い性能を付与するためには、リザバー層にタスク実行に必要となる活性化関数群が存在しなければならないという問題がある。 Algorithms belonging to the category of such RC include ESN (Echo State Network) and LSM (Liquid State Machine), etc., and all of them have a small calculation load and can handle time-series data. It is possible to obtain a learning result comparable to that (Non-patent documents 5 and 6). As a typical example, the structure of ESN is shown in FIG. Further, FORCE (First Order Reduced and Controlled Error), BPDC (Backpropagation-Decoration), and the like have been proposed as characteristic learning methods in the connection between the reservoir layer and the output layer of RC (Non-Patent Documents 7 and 8). .. However, there is a problem that a group of activation functions necessary for task execution must exist in the reservoir layer in order to give high performance to RC.

一方、教師なし学習の代表例で、ＲＣと同様演算に掛かる負担が少ない深層学習として注目されているのが、１９８２年、Ｋｏｈｏｎｅｎによって提案された、入力された情報が自己組織的に分類される自己組織化マップ（ＳＯＭ、Ｓｅｌｆ−ＯｒｇａｎｉｚｉｎｇＭａｐ）である（非特許文献９及び１０）。これは、入力層と競合層があり、入力層のノードと入力層よりも多い競合層のノードとが全てエッジで結合されており、エッジの結合荷重は当初適当に与えられるが、Ｋｏｈｏｎｅｎｎのアルゴリズムによって、学習するごとに結合の荷重が更新され、入力された情報が精度よく分類される。このようなＳＯＭは、多次元データを扱うことができ、複雑な計算が必要なく、視覚的な結果が得られることから、遺伝子解析、音声認識、画像解析、ロボット制御等への応用が期待されている。これとほとんど相違ないアルゴリズムとして、ＡＲＴ（ＡｄａｐｔｉｖｅＲｅｓｏｎａｎｃｅＴｈｅｏｒｙＭｏｄｅｌ）及びＬＶＱ（ＬｅａｒｎｉｎｇＶｅｃｔｏｒＱｕａｎｔｉｚａｔｉｏｎ）等がある。代表例として、ＳＯＭの構造を図２に示す。 On the other hand, in the typical example of unsupervised learning, attention is paid to deep learning, which is similar to RC and has a small calculation load, and the input information proposed by Kohonen in 1982 is classified in a self-organizing manner. It is a self-organizing map (SOM, Self-Organizing Map) (Non-Patent Documents 9 and 10). This is because there are an input layer and a competition layer, and all the nodes in the input layer and the nodes in the competition layer that are larger than the input layers are all connected by edges, and the edge connection weight is given appropriately at first. By this, the weight of connection is updated every time learning is performed, and the input information is classified with high accuracy. Such SOM can handle multi-dimensional data, without complex calculations, since the visual results are obtained, genetic analysis, speech recognition, image analysis, applied to the robot control and the like are expected ing. Algorithms that are almost the same as the above include ART (Adaptive Resonance Theory Model) and LVQ (Learning Vector Quantization). As a typical example, the structure of the SOM is shown in FIG.

しかしながら、このようなデータクラスタリング（ＤａｔａＣｌｕｓｔｅｒｉｎｇ）的なニューラルネットワーク構造では、繰り返し学習が必要であり、データ数が大きい場合、繰り返し学習の回数とデータ数に比例して演算量が膨大になるという問題がある。また、初期の結合荷重や繰り返し学習の回数が適切でない場合、安定した性能が得られないという問題もある。 However, in such a data clustering-like neural network structure, iterative learning is required, and when the number of data is large, the amount of calculation becomes enormous in proportion to the number of repeated learning and the number of data. There is. There is also a problem that stable performance cannot be obtained when the initial connection weight or the number of repeated learnings is not appropriate.

特に、時系列データが音声の場合、音声認識システムとして、機械学習（ＭａｃｈｉｎｅＬｅａｒｎｉｎｇ）が有効であることが認識されており、そのシステムは、主として、音声情報の特徴の抽出、抽出された特徴量のモデル化、モデル化されたパラメータを推定する評価基準、最適化アルゴリズムから構成される。特に、音声情報の特徴量をモデル化する方法が重要で、生成モデル、識別モデル、因子分析モデル等が提案されてきた。例えば、生成モデルとしては、ＧＭＭ−ＵＢＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅ−ＵｎｉｖｅｒｓａｌＢａｃｋｒｏｕｄ）やＧＭＭ-ＳＶ（ＳｕｐｅｒＶｅｃｔｏｒ）、識別モデルとしては、ＳＶＭ（ＳｕｐｅｒＶｅｃｔｏｒＭａｃｈｉｎｅ）、因子分析モデルとしては、ｉ−ｖｅｃｔｏｒ等である（非特許文献１１及び１２）。その結果、現在の最高水準のモデルであるｉ−ｖｅｃｔｏｒ／ＰＬＤＡ（ＰｒｏｂａｂｉｌｉｓｔｉｃＬｉｎｅａｒＤｉｓｃｒｉｍｉｎａｎｔＡｎａｌｙｓｉｓ）に至っている。この最高水準のモデルを利用しても、学習、識別データが少ない場合、性能が著しく劣化するという問題がある。 In particular, when the time-series data is voice, it has been recognized that machine learning is effective as a voice recognition system, and the system mainly extracts the features of voice information and extracts the extracted feature amount. , The evaluation criteria for estimating the modeled parameters, and the optimization algorithm. In particular, a method of modeling a feature amount of voice information is important, and a generation model, a discrimination model, a factor analysis model, etc. have been proposed. For example, the generation model is GMM-UBM (Gaussian Mixture Mode-Universal Background) or GMM-SV (Super Vector), the identification model is SVM (Super Vector Machine), and the factor analysis model is i-vect. (Non-patent documents 11 and 12). As a result, the current highest level model, i-vector/PLDA (Probabilistic Linear Discriminant Analysis) has been reached. Even if this highest-level model is used, there is a problem that the performance is significantly deteriorated when the learning and identification data are small.

特許第４０９３８５８号公報Japanese Patent No. 4093858

「やさしい機械学習」、http://gagbot.net/machine-learning."Easy Machine Learning", http://gagbot.net/machine-learning. 「ニューラルネットワークで時系列データの予測を行う」, https://qiita.com/icoxfog417/items/2791ee878deee0d0fd9c．"Predict time series data with neural network", https://qiita.com/icoxfog417/items/2791ee878deee0d0fd9c. 「ちょっと変わったニューラルネットワークReservoir Computing」, https://qiita.com/kazoo04/items/71b659ced9dc0342a2b0."A little strange neural network Reservoir Computing", https://qiita.com/kazoo04/items/71b659ced9dc0342a2b0. B. Schrauwen,D. Verstraeten, and J. V. Campenhout,“An overview of reservoir computing: theory,applications and implementations”, ESANN'2007 proceedings - European Symposiumon Artificial Neural Networks Bruges (Belgium), 25-27 April 2007, d-sidepubli., ISBN 2-930307-07-2, pp.471-482.B. Schrauwen, D. Verstraeten, and JV Campenhout, “An overview of reservoir computing: theory, applications and implementations”, ESANN'2007 proceedings-European Symposiumon Artificial Neural Networks Bruges (Belgium), 25-27 April 2007, d-sidepubli ., ISBN 2-930307-07-2, pp.471-482. H. Jaeger,“Echo state network”, Scholarpedia, 2(9):2330(2007), http://www. Scholar-pedia.org/article/Echo_state_network.H. Jaeger, “Echo state network”, Scholarpedia, 2(9):2330(2007), http://www. Scholar-pedia.org/article/Echo_state_network. S. Kok,“Liquid State Machine Optimization”, https://pdfs.semanticscholar. org/379d/135c7ac1a5bded34100b98d04712e2473ec4.pdf.S. Kok, “Liquid State Machine Optimization”, https://pdfs.semanticscholar.org/379d/135c7ac1a5bded34100b98d04712e2473ec4.pdf. D. Sussillo andL.F. Abbott, “GeneratingCoherent Patterns of Activity from Chaotic Neural Networks”, Neuron 63, 544-557,August 27, 2009.D. Sussillo and L.F.Abbott, “GeneratingCoherent Patterns of Activity from Chaotic Neural Networks”, Neuron 63, 544-557, August 27, 2009. J. J. Steil,“Backpropagation-Decorrelation:onlinerecurrent learning with 0(N)complexity”,http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.161.9279&rep=rep1& type=pdf.J. J. Steil, “Backpropagation-Decorrelation:onlinerecurrent learning with 0(N)complexity”, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.161.9279&rep=rep1& type=pdf. T. Kohonen,“Self-OrganizedFormation of Topologically Correct Feature Maps”, Biol. Cybern., 43, 59-69(1982).T. Kohonen, “Self-OrganizedFormation of Topologically Correct Feature Maps”, Biol. Cybern., 43, 59-69 (1982). A. K. Jain, M. N. Murty, and P. J. Flynn,“Data Clustering : A Review”,ACM Computing Surveys, Vol. 31, No. 3, September 1999, pp.264-323.A. K. Jain, M. N. Murty, and P. J. Flynn, “Data Clustering :A Review”, ACM Computing Surveys, Vol. 31, No. 3, September 1999, pp.264-323. 越仲孝文, 篠田浩一, 「話者認識の国際動向」,日本音響学会誌, 69巻, 7号(2013), pp.342-348.Kobunaka Takafumi, Shinoda Koichi, "International trend of speaker recognition", Journal of Acoustical Society of Japan, Volume 69, No. 7 (2013), pp.342-348. 小川哲司, 松井知子，「話者認識で用いる機械学習」, 日本音響学会誌, 69巻, 7号(2013),pp.349-356.Tetsuji Ogawa, Tomoko Matsui, "Machine Learning Used for Speaker Recognition," Journal of Acoustical Society of Japan, Volume 69, No. 7 (2013), pp.349-356.

ニューラルネットワーク系情報処理技術は、次のような課題がある。ＲＮＮ等の教師あり学習は、大量の教師あり学習で、学習方法としてＢＰＴＴを使用するため、莫大な演算量が必要であると共に、勾配損失及び過学習の問題がある。ＥＳＮ等のリザバー層から出力層の結合荷重をチューニングするＲＣは、演算量は抑制されるが、高い性能を要求する場合、リザバー層にタスク実行に必要となる関数群が存在しなければならない。また、ＳＯＭ等のクラスタリング的な教師なし学習は、構造上スタティックな情報のみしか扱えないこと、繰り返し学習が必要な為、その回数とデータ数に比例して演算量が膨大になること、また、初期の結合荷重や繰り返し学習の回数が適切でない場合、安定した性能が得られないという問題がある。 Neural network information processing technology has the following problems. Supervised learning such as RNN is a large amount of supervised learning and uses BPTT as a learning method. Therefore, a huge amount of calculation is required and there are problems of gradient loss and overlearning. RC, which tunes the coupling weight from the reservoir layer to the output layer, such as ESN, suppresses the amount of computation, but if high performance is required, the reservoir layer must have a function group necessary for task execution. Further, clustering-like unsupervised learning such as SOM can handle only static information structurally, and iterative learning is required, resulting in an enormous amount of calculation in proportion to the number of times and the number of data. There is a problem that stable performance cannot be obtained when the initial connection weight or the number of repeated learning is not appropriate.

特に、音声処理技術においては、現在の最高水準であるｉ−ｖｅｃｔｏｒ／ＰＬＤＡモデリングを用いた音声識別システムでも、学習、識別データが少ない場合、性能が著しく劣化するという課題がある。そのため、音声認識システムへのニューラルネットワーク系情報処理技術の適用が検討されているが、当然、音声認識システムにＲＮＮ、ＥＳＮやＬＳＭ等のＲＣ、及び、ＳＯＭ、ＡＲＴ、ＬＶＱ等のクラスタリング的な教師なし学習を適用する場合にも、上述したようなそれぞれ固有の課題が生起する。 In particular, in the voice processing technology, even the voice identification system using the current highest level i-vector/PLDA modeling has a problem that the performance is significantly deteriorated when the amount of learning and identification data is small. Therefore, application of neural network type information processing technology to a voice recognition system is being considered, but naturally, RC such as RNN, ESN and LSM, and clustering teachers such as SOM, ART and LVQ are applied to the voice recognition system. Even when the none learning is applied, each unique problem as described above occurs.

本発明は、上記課題を解決した単純なニューラルネットワーク構造であって、演算が容易で少ないにもかかわらず、性能に優れた情報処理装置及び情報識別装置、特に、時系列データにも対応可能な情報処理装置及び情報識別装置を提供することを目的とする。 INDUSTRIAL APPLICABILITY The present invention is a simple neural network structure that solves the above problems, and is capable of handling information processing devices and information identifying devices with excellent performance, particularly time-series data, even though the number of operations is small and easy. An object is to provide an information processing device and an information identification device.

本発明者らは、ＥＳＮ、ＬＳＭ、及び、ＳＯＭ等のニューラルネットワーク構造及びそれらを実行するアルゴリズムを詳細に検討した結果、ＲＣにおいて、リザバー層に入力する情報の特徴を予め抽出し、事前学習することによって、上述した課題を解決できることを見出すと共に、リザバー層と出力層との結合における学習方法を最適化することによって更に性能が向上することを見出し、本発明の完成に至った。 As a result of detailed study of neural network structures such as ESN, LSM, and SOM and algorithms for executing them, the present inventors have extracted in advance features of information to be input to the reservoir layer in RC and pre-learned. As a result, it has been found that the above-mentioned problems can be solved, and further, the performance is further improved by optimizing the learning method in the connection of the reservoir layer and the output layer, and the present invention has been completed.

すなわち、本発明は、少なくとも、情報入力部と、前記情報入力部に入力した情報を空間情報に埋め込む教師なし構造学習を行う特徴抽出部と、この特徴抽出部で教師なし構造学習された情報を教師なし学習で更に学習を行うネットワーク中に導入してその構造学習された情報を学習しつつ蓄積する情報蓄積部と、この情報蓄積部で蓄積された情報から教師あり学習によって解答を抽出する情報読出部とを備えるニューラルネットワーク系情報処理装置を提供するものであって、この順に直接接続されることが効率的で好ましい。 That is, at least an information input unit, a feature extraction unit that performs unsupervised structure learning that embeds information input to the information input unit in spatial information, and information that is unsupervised structure learning by the feature extraction unit. Information that is introduced into a network that performs further learning by unsupervised learning, and an information storage unit that stores the structure-learned information while learning, and information that extracts answers by supervised learning from the information stored in this information storage unit The present invention provides a neural network type information processing apparatus including a reading unit, and direct connection in this order is efficient and preferable.

更に、情報の種類、質、及び、量等に応じて、本発明の情報処理装置の前に、情報収集部及び情報処理部を備えることが好ましい。一方、本発明の情報処理装置の後には、情報処理の結果の扱い方に応じて、様々な方法のマン・マシン・インターフェースとしての出力部を設けることが好ましい。 Furthermore, it is preferable to provide an information collecting unit and an information processing unit before the information processing apparatus of the present invention, depending on the type, quality, quantity, and the like of information. On the other hand, after the information processing apparatus of the present invention, it is preferable to provide an output unit as a man-machine interface of various methods depending on how to handle the result of information processing.

本発明の情報処理装置を構成する情報入力部、特徴抽出部、情報蓄積部、及び、情報読出部は、特に限定されるものではない。ただし、特徴抽出部が、時系列データも扱うことができ、入力した情報を空間情報に埋め込む教師なし構造学習を行うことができるデータクラスタリング的なニューラルネットワーク構造であって、情報蓄積部も、時系列データも扱うことができ、入力した情報を教師なし学習で学習しつつ、その学習された情報を蓄積することができるＲＣ的なニューラルネットワーク構造であることが求められる。 The information input unit, the feature extraction unit, the information storage unit, and the information reading unit that form the information processing apparatus of the present invention are not particularly limited. However, the feature extraction unit is a data clustering-like neural network structure that can handle time-series data and can perform unsupervised structure learning that embeds input information in spatial information. It is required to have an RC neural network structure capable of handling series data and learning the input information by unsupervised learning while accumulating the learned information.

具体的には、特徴抽出部として、ＳＯＭ、ＡＲＴ、及び、ＬＶＱ等の情報入力部と特徴抽出部とが二層で接続され、その層間で教師なしの繰返し競合学習が行われる新しいニューラルネットワーク構造全体又はその一部を利用することができる。また、従来の教師なし学習で次元圧縮を主として行うＰＣＡ（ＰｒｉｎｃｉｐａｌＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）、Ａｕｔｏ−ｅｎｃｏｒｄｅｒ、及び、ＧＴＭ（ＧｅｎｅｒａｔｉｖｅＴｏｐｏｇｒａｈｉｃＭａｐ）等の手法も利用できる。また、情報蓄積部は、既存のＲＣであるＥＳＮ（ＥｃｈｏＳｔａｔｅＮｅｔｗｏｒｋ）、ＬＳＭ（ＬｉｑｕｉｄＳｔａｔｅＭａｃｈｉｎｅ）等の教師なし学習が行われる情報蓄積部と、その学習された情報を教師あり学習で読み出す情報読出部とが接続している新しいニューラルネットワーク構造全体又は一部を利用することができる。更に、情報読出部では、ＲＣに適用可能な学習方法であるＦＯＲＣＥ又はＢＰＤＣを適用することもできる。 Specifically, as a feature extraction unit, a new neural network structure in which an information input unit such as SOM, ART, and LVQ and a feature extraction unit are connected in two layers and unsupervised iterative competitive learning is performed between the layers. The whole or a part thereof can be used. Further, methods such as PCA (Principal Component Analysis), Auto-encoder, and GTM (Generic Topographic Map) which mainly perform dimension reduction by conventional unsupervised learning can also be used. In addition, the information storage unit is an existing RC, which is an information storage unit that performs unsupervised learning such as ESN (Echo State Network) and LSM (Liquid State Machine), and information that reads the learned information by supervised learning. All or part of the new neural network structure to which the readout is connected can be used. Further, in the information reading unit, FORCE or BPDC which is a learning method applicable to RC can be applied.

特に、情報入力部と特徴抽出部とが接続した、ＳＯＭ、ＡＲＴ、及び、ＬＶＱのいずれかと、情報蓄積部と情報読出部とが接続したＥＳＮ又はＬＳＭのどちらかとを接続することが、従来にない新たなニューラルネットワーク構造を構築する必要がなく、簡便な方法でありながら、演算量が少なく、性能に優れた情報処理装置を低価格で提供することができ、特に、ＳＯＭとＥＳＮの組合せが最も好ましい。 In particular, it has been conventional to connect any one of SOM, ART, and LVQ in which the information input unit and the feature extraction unit are connected to either ESN or LSM in which the information storage unit and the information reading unit are connected. There is no need to construct a new neural network structure, and it is possible to provide an information processing device that has a small amount of calculation and excellent performance at a low price while it is a simple method. In particular, the combination of SOM and ESN Most preferred.

更に、情報入力部及び特徴抽出部は、ＳＯＭ、ＡＲＴ、及び、ＬＶＱ（ＬｅａｒｎｉｎｇＶｅｃｔｏｒＱｕａｎｔｉｚａｔｉｏｎ）のいずれかを適用するが、情報蓄積部は、ＥＳＮ又はＬＳＭのリザバー層を活用するが、情報読出部では、ＦＯＲＣＥ又はＢＰＤＣのＥＳＮ又はＬＳＭと異なる学習方法を適用することもできる。この場合、音声、動画、文章等情報の種類、質、及び、量等に応じて使い分けることが好ましい。 Furthermore, the information input unit and the feature extraction unit apply any one of SOM, ART, and LVQ (Learning Vector Quantization), while the information storage unit uses the reservoir layer of ESN or LSM, but the information reading unit. Then, a learning method different from the ESN or LSM of FORCE or BPDC can also be applied. In this case, it is preferable to use properly according to the type, quality, quantity, etc. of information such as voice, moving images, and sentences.

特徴抽出部として、ＰＣＡ、Ａｕｔｏ−ｅｎｃｏｒｄｅｒ、及び、ＧＴＭのいずれかのアルゴリズムで実行されるニューラルネットワーク構造を適用し、情報蓄積部及び情報読出部には、ＥＳＮ又はＬＳＭを適用することも可能である。更に、この場合、情報蓄積部として、ＥＳＮ又はＬＳＭのリザバー層を活用し、情報読出部として、ＦＯＲＣＥ又はＢＰＤＣを適用することもできる。 It is also possible to apply a neural network structure executed by any of the algorithms of PCA, Auto-encoder, and GTM as the feature extraction unit, and apply ESN or LSM to the information storage unit and the information reading unit. is there. Further, in this case, the reservoir layer of ESN or LSM can be utilized as the information storage unit, and FORCE or BPDC can be applied as the information reading unit.

このように、本発明の情報処理装置は、従来のニューラルネットワーク構造全体又は一部を多種多様な構造に組み換えて構築されることによって創造された、全く新しいニューラルネットワーク構造を応用した情報処理装置であるという大きな特徴がある。 As described above, the information processing apparatus of the present invention is an information processing apparatus applying a completely new neural network structure created by recombining the whole or a part of the conventional neural network structure into various structures. There is a big feature that there is.

本発明の情報処理装置は、時系列データに対応可能なニューラルネットワーク構造で構成されているが、それに限定されることはなく、系列データの情報処理にも対応可能なニューラルネットワーク構造である。特に、本発明の情報処理装置は、莫大な情報から演算量を増やすことなく精度の高い情報処理が可能な為、動画、音声、文章、及び、言語等の抽出、認識、判断、及び、診断等の識別、並びに、自動車の自動運転等に代表される表現、行動、作業等の実行に適している。中でも、音声の情報処理に適しており、音声認識、話者識別、音声合成、感情把握、情報判断等に適している。 The information processing apparatus of the present invention has a neural network structure capable of handling time-series data, but is not limited to this, and has a neural network structure capable of handling information processing of series data. In particular, since the information processing apparatus of the present invention can perform highly accurate information processing from an enormous amount of information without increasing the amount of calculation, extraction, recognition, judgment, and diagnosis of moving images, voices, sentences, languages, etc. It is suitable for identifying such things as well as performing expressions, actions, tasks, etc., represented by automatic driving of automobiles. Among them, it is suitable for voice information processing, and is suitable for voice recognition, speaker identification, voice synthesis, emotional grasping, information judgment, and the like.

本発明の情報処理装置は、情報入力部と、前記情報入力部に入力した情報を空間情報に埋め込む教師なし構造学習を行う特徴抽出部と、前記特徴抽出部で教師なし構造学習された情報を教師なし学習で更に学習を行うネットワーク中に導入して前記構造学習された情報を学習しつつ蓄積する情報蓄積部と、前記情報蓄積部で蓄積された情報から教師あり学習によって解答を抽出する情報読出部とを備えていることを特徴としている。より具体的には、本発明は、入力層、リザバー層、出力層から構成されるリザバー層における教師なし学習とリザバー層と出力層との接続における結合荷重のみ調整する教師あり学習とを実行して情報処理を行うことができるＥＳＮやＬＳＭ等のＲＣの入力層に入力する情報が、クラスタリング的なニューラルネットワーク構造で教師なし構造学習された情報とすることを特徴とする情報処理装置である。このことによって、莫大な情報量であっても、演算量の増加を招くことなく、従来技術以上の高い性能を発揮することができるようになり、演算コストを大きく削減することができる。更に、本発明の情報処理装置は、従来にない新たなニューラルネットワーク構造を構築する必要がなく、既存のニューラルネットワーク構造を情報の種類、質、及び、量等に応じて様々な構造に組換えることによって創造された、全く新しいニューラルネットワーク構造を応用したものであるため、簡便な構造で容易に製造可能であり、装置コストも大きく削減することができる。特に、音声識別システムにおいては、本発明の情報処理装置は、現在最高水準のモデルであるｉ−ｖｅｃｔｏｒ／ＰＬＤＡを用いた音声識別システム以上に高い識別能力を有する。更に、本発明の情報処理装置は、時系列データに対して最小限の遅延のみで、リアルタイムに出力結果を得ることができる。 An information processing apparatus according to the present invention includes an information input unit, a feature extraction unit that performs unsupervised structure learning that embeds information input to the information input unit in spatial information, and information that is unsupervised structure learned by the feature extraction unit. An information storage unit that is introduced into a network for further learning by unsupervised learning and stores the structure-learned information while learning, and information that extracts an answer by supervised learning from the information accumulated in the information storage unit. And a reading unit. More specifically, the present invention performs unsupervised learning in a reservoir layer composed of an input layer, a reservoir layer, and an output layer, and supervised learning that adjusts only the connection weight in the connection between the reservoir layer and the output layer. The information processing apparatus is characterized in that the information input to the input layer of the RC such as ESN or LSM capable of performing information processing by means is information that has undergone unsupervised structure learning by a clustering neural network structure. As a result, even with a huge amount of information, it is possible to achieve higher performance than that of the conventional technique without increasing the amount of calculation, and it is possible to greatly reduce the calculation cost. Further, the information processing apparatus of the present invention does not need to construct a new neural network structure which has not been available in the past, and recombines the existing neural network structure into various structures according to the type, quality, quantity, etc. of information. Since it is an application of a completely new neural network structure created by the above, it can be easily manufactured with a simple structure and the device cost can be greatly reduced. Particularly, in the voice identification system, the information processing apparatus of the present invention has a higher identification ability than the voice identification system using the i-vector/PLDA, which is the highest level model at present. Furthermore, the information processing apparatus of the present invention can obtain the output result in real time with a minimum delay with respect to the time series data.

ＲＣの代表例であるＥＳＮのニューラルネットワーク構造を示す模式図である。It is a schematic diagram which shows the neural network structure of ESN which is a typical example of RC. データクラスタリング的な深層学習の代表例であるＳＯＭのニューラルネットワーク構造を示す模式図である。It is a schematic diagram which shows the neural network structure of SOM which is a typical example of deep learning like data clustering. 本発明のニューラルネットワーク構造を備えた情報処理装置の概念を示す模式図である。It is a schematic diagram which shows the concept of the information processing apparatus provided with the neural network structure of this invention. 本発明の一実施形態に係るＲＯＭ（Ｒｅｓｅｒｖｏｉｒｗｉｔｈｓｅｌｆ−ｏｒｇａｎｉｚｅｄＭａｐｐｉｎｇ）の構造を示す模式図である。It is a schematic diagram which shows the structure of ROM(Reservoir with self-organized Mapping) which concerns on one Embodiment of this invention.

本発明の情報処理装置について、音声認識装置に利用する場合を想定し、複数の固有の特性を有する音声が、複数の源から発せられる話者音声情報を用いた話者識別に関する実施形態を詳細に説明するが、本発明の情報処理装置が扱うことが可能な音声情報、また、本発明の情報処理装置が応用可能な音声認識装置はこれに限定されるものではない。更に、ここでは本技術の一実施例として、音声情報を扱うことを想定した一実施形態を取り上げたが、本発明の情報処理装置が扱うことができる情報は音声だけに限定されるものではなく、静止画、動画、文章等、系列データ及び時系列データを問わず幅広く取り扱うことができる上、本発明の情報処理装置の構成もこれに限定されるものではなく、本発明の主旨を逸脱しない範囲内で種々変更して実施することが可能であり、特許請求の範囲に記載した技術思想によってのみ限定されるものである。 Regarding the information processing apparatus of the present invention, assuming that the information processing apparatus is used as a voice recognition apparatus, an embodiment relating to speaker identification using speaker voice information generated from a plurality of sources for voices having a plurality of unique characteristics will be described in detail. However, the voice information that can be handled by the information processing apparatus of the present invention and the voice recognition apparatus to which the information processing apparatus of the present invention can be applied are not limited to this. Furthermore, here, as an example of the present technology, an embodiment on the assumption that voice information is handled is taken up, but the information that the information processing apparatus of the present invention can handle is not limited to voice. It is possible to handle a wide range of series data and time series data such as still images, moving pictures, sentences, etc., and the configuration of the information processing apparatus of the present invention is not limited to this, and does not depart from the gist of the present invention. The present invention can be variously modified and implemented within the scope, and is limited only by the technical idea described in the claims.

本発明の一実施形態である音声認識装置は、図４に示すように、情報入力部４−１、特徴抽出部４−２、情報蓄積部４−３、及び、情報読出部４−４から構成され、それぞれ、ＳＯＭの入力層、ＳＯＭの競合層、ＥＳＮのリザバー層、ＥＳＮの出力層を適用したもので、ＥＳＮにおける入力層（図１における１−１）に、ＳＯＭの入力層４−１及び競合層４−２（４−１２）（図２における２−１及び２−２）が組み込まれ、新しいニューラルネットワーク構造が創出されており、ＲＯＭ（Ｒｅｓｅｒｖｏｉｒｗｉｔｈｓｅｌｆ−ｏｒｇａｎｉｚｅｄＭａｐｐｉｎｇ）と命名し、話者識別装置に適用した。このように、本発明の技術思想を具体的に説明するため、本発明の一実施形態としてＲＯＭを取り上げ、音声を扱う情報処理装置に応用しているが、情報は、音声に限定されるものではなく、静止画、動画、音楽、文章等あらゆる情報の処理装置に応用可能である。 As shown in FIG. 4, the voice recognition device according to the embodiment of the present invention includes an information input unit 4-1, a feature extraction unit 4-2, an information storage unit 4-3, and an information reading unit 4-4. The SOM input layer, the SOM competition layer, the ESN reservoir layer, and the ESN output layer are applied to the SOM input layer (1-1 in FIG. 1) and the SOM input layer 4-, respectively. 1 and a competitive layer 4-2 (4-12) (2-1 and 2-2 in FIG. 2) are incorporated to create a new neural network structure, which is named ROM (Reservoir with self-organized Mapping). , Applied to the speaker identification device. As described above, in order to specifically explain the technical idea of the present invention, the ROM is taken as one embodiment of the present invention and applied to an information processing device that handles voice, but the information is limited to voice. Instead, it can be applied to a processing device for all kinds of information such as still images, moving images, music, and sentences.

この実施形態では、情報入力部４−１に音声情報を入力することになるが、情報入力部４−１には、話者識別に適した音声情報とする必要がある。そのため、（図４には図示していない）図３に示したような従来技術を用いた情報収集部や情報処理部を適宜設けた。具体的には、情報収集部にはマイクロフォン等の音声入力デバイスを、情報処理部には、マイクロフォンから入力された音声信号を話者識別に適した前処理を行う高速フーリエ変換（ＦＦＴ、ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）アナライザーを設けた。ただし、情報処理方法は、音声信号から話者の特徴量を抽出する方法であれば、これに限定されることなく適用できる。例えば、あらかじめ決められた特徴量を数学的に計算して求める方法や、ルールベースによる処理で特徴量を抽出する方法、フォルマント等を適用してもよい。 In this embodiment, voice information is input to the information input unit 4-1, but the information input unit 4-1 needs to be voice information suitable for speaker identification. Therefore, an information collecting unit and an information processing unit using the conventional technique as shown in FIG. 3 (not shown in FIG. 4) are appropriately provided. Specifically, the information collecting unit is a voice input device such as a microphone, and the information processing unit is a fast Fourier transform (FFT, Fast Fourier) that performs preprocessing suitable for speaker identification on a voice signal input from the microphone. A Transform analyzer was provided. However, the information processing method is not limited to this as long as it is a method of extracting the feature amount of the speaker from the audio signal. For example, a method of mathematically calculating a predetermined feature amount, a method of extracting the feature amount by a rule-based process, a formant, or the like may be applied.

一方、情報読出部４−４には、識別結果が出力されるが、話者識別装置としても利用の仕方に応じて、（図４には図示していない）図３に示したような出力部３−７として、スピーカーやディスプレイ等既存の出力装置を備えることが好ましい。 On the other hand, although the identification result is output to the information reading unit 4-4, it is output as shown in FIG. 3 (not shown in FIG. 4) depending on the usage as a speaker identification device. It is preferable that the unit 3-7 includes an existing output device such as a speaker or a display.

このような本発明の一実施形態である話者識別装置は、情報収集部及び情報処理部から情報入力部４−１に音声情報を入力された後、特徴抽出部４−２、情報蓄積部４−３、情報読出部４−４を経由して、識別結果が出力され、その結果が出力部で公開される。例えば、話者が５人の場合（話者１、話者２、話者３、話者４、話者５）、話者２が発話していれば、識別結果として話者２が出力される。 In such a speaker identification device according to an embodiment of the present invention, after the voice information is input from the information collection unit and the information processing unit to the information input unit 4-1, the feature extraction unit 4-2 and the information storage unit. The identification result is output via the 4-3 and the information reading unit 4-4, and the result is disclosed in the output unit. For example, when there are five speakers (speaker 1, speaker 2, speaker 3, speaker 4, speaker 5), if speaker 2 is speaking, speaker 2 is output as the identification result. It

次いで、本発明の一実施形態である図４に示す話者識別装置４の学習方法や動作を説明するが、上述したように、話者識別に適した音声情報を入力する必要があるため、情報収集部で集められた音声情報に対して情報処理部でＦＦＴを行ったので、簡単に説明する。 Next, the learning method and operation of the speaker identification device 4 shown in FIG. 4, which is an embodiment of the present invention, will be described. However, as described above, it is necessary to input voice information suitable for speaker identification. has performed an FFT in an information processing unit with respect to the audio information collected by the information collection unit will be briefly described.

音声信号は連続信号であり、発話が全て終わってからＦＦＴを行うのでは実用性に欠けるため、音声信号を一定時間に区切る時間窓を設定し、時間窓内の音声波形に対してＦＦＴを行った。通常、時間窓は、矩形波状やハミングウィンドウ等の窓関数を設定するが、両端の不連続性が問題になることを考慮してハミングウィンドウの窓関数を用いた。 Since the voice signal is a continuous signal, it is not practical to perform FFT after all the utterances are completed. Therefore, a time window that divides the voice signal into a certain time is set, and FFT is performed on the voice waveform in the time window. It was Usually, a window function such as a rectangular wave shape or a Hamming window is set as the time window, but the window function of the Hamming window was used in consideration of the fact that discontinuity at both ends becomes a problem.

次いで、情報入力部４−１と特徴抽出部４−２において行われる教師なし学習について説明する。この情報入力部４−１と特徴抽出部４−２は、それぞれ、図２の模式図に示したフィードフォワードニューラルネットワークであるＳＯＭ２の入力層２−１及び競合層２−２に対応している。ここでは、以下、図４の新しく構築されたニューラルネットワーク構造の符号及びその説明を使用して説明する。 Next, the unsupervised learning performed in the information input unit 4-1 and the feature extraction unit 4-2 will be described. The information input unit 4-1 and the feature extraction unit 4-2 respectively correspond to the input layer 2-1 and the competitive layer 2-2 of the SOM2 which is the feedforward neural network shown in the schematic diagram of FIG. .. Here, the description will be given below using the symbols of the newly constructed neural network structure of FIG. 4 and the description thereof.

特徴抽出部４−２は、一般的には、ノードが一次元に配置したアレイ又は二次元に配置したマップであるが、ここでは、二次元のマップとし、情報収集部及び情報処理部を経由して情報入力部４−１に入力された高次元の情報を二次元の空間パターンとして特徴抽出部４−２に出力する教師なし競合学習が行われた。この教師なし競合学習における結合荷重ｗ_ｉは、次のようにして更新された。情報処理部でＦＦＴが行われた情報入力部４−１への入力情報ｘに対し、最初は、初期化された結合荷重ｗ_ｉを用いた式（１）により学習されたノードｉ^＊を得るが、それ以後、情報入力部４−１への入力情報ｘに対して、結合荷重ｗ_ｉに最も近い値で、ノードｉ^＊の近傍のノードとなるように、式（１）及び（２）に従って次々と更新される。 The feature extraction unit 4-2 is generally an array in which nodes are arranged in one dimension or a map in which nodes are arranged in two dimensions, but here, it is a two-dimensional map, and is passed through the information collection unit and the information processing unit. Then, unsupervised competitive learning is performed in which the high-dimensional information input to the information input unit 4-1 is output to the feature extraction unit 4-2 as a two-dimensional spatial pattern. The connection weight w _i in this unsupervised competitive learning was updated as follows. For the input information x to the information input unit 4-1, which has been FFT-processed by the information processing unit, initially, the node i ^* learned by the equation (1) using the initialized connection weight w _i is obtained. After that, with respect to the input information x to the information input unit 4-1, the equation (1) and (2) are set so as to be a node that is the closest value to the connection weight w _i and is in the vicinity of the node i ^*. Will be updated one after another.

ここで、ｄは距離関数、γ（ｎ）は学習回数ｎで減衰する学習率、Ｎ（ｉ，ｊ；ｎ）は、ノードｉとｊの間の距離Ｄ（ｉ，ｊ）と共に減少する近接関数であり、本発明の一実施形態では、学習率及び近接関数は、式（５）、（６）、及び（７）によって求めた。γ_０及びλは、それぞれ、初期学習率及び学習減衰因子である。このようにして、全ての結合荷重が正規化され、似通った入力データが特徴抽出部４−２に密接したノードとして投影される。 Here, d is a distance function, γ(n) is a learning rate that decreases with the number of learning times n, and N(i,j;n) is a proximity that decreases with the distance D(i,j) between the nodes i and j. Function, and in one embodiment of the present invention, the learning rate and the proximity function are obtained by the equations (5), (6), and (7). γ ₀ and λ are the initial learning rate and the learning attenuation factor, respectively. In this way, all the connection weights are normalized, and similar input data is projected as a node close to the feature extraction unit 4-2.

そして、この教師なし競合学習によって得られた、一種のクラスタリングされた情報が、情報蓄積部４−３の入力情報となり、更に情報蓄積層４−３において、教師なし学習が行われつつ情報が蓄積される。最後に、この蓄積された情報に基づき、情報蓄積部４−３と情報読出部４−４との間において教師あり学習が行われ、話者が識別された結果情報読出部に出力され、需要に応じた方法で出力部から公開される。 Then, a kind of clustered information obtained by this unsupervised competitive learning becomes the input information of the information storage unit 4-3, and further, in the information storage layer 4-3, the information is stored while the unsupervised learning is performed. To be done. Finally, based on this accumulated information, supervised learning is performed between the information accumulating section 4-3 and the information reading section 4-4, and the result is output to the information reading section where the speaker is identified, and the demand is calculated. It will be published from the output section according to the method.

これは、図１に示したＥＳＮ１の模式図では、特徴抽出部４−２の出力情報が、入力層１−１に与えられ、リザバー層１−２において、教師なし学習が行われつつ情報が蓄積され、リザバー層１−２と出力層１−３との間で教師あり学習が行われることと対応している。すなわち、図４の新しく構築されたニューラルネットワーク構造は、情報入力部４−１と特徴抽出部４−２とをまとめて、情報蓄積部４−３及び情報読出部４−４の情報入力部４−１２と考えれば、図１の模式図に示したフィードバックニューラルネットワークで、ＲＮＮの一種であるＥＳＮ１の入力層１−１、リザバー層１−２、出力層１−３と対応している。ここでは、以下、図４の新しく構築されたニューラルネットワーク構造の符号及びその説明を使用して説明する。 This is because in the schematic diagram of the ESN 1 shown in FIG. 1, the output information of the feature extraction unit 4-2 is given to the input layer 1-1, and the information is transmitted while the unsupervised learning is performed in the reservoir layer 1-2. This corresponds to the fact that the information is stored and supervised learning is performed between the reservoir layer 1-2 and the output layer 1-3. That is, in the newly constructed neural network structure of FIG. 4, the information input unit 4-1 and the feature extraction unit 4-2 are combined to form the information input unit 4 of the information storage unit 4-3 and the information reading unit 4-4. Considering -12, the feedback neural network shown in the schematic diagram of FIG. 1 corresponds to the input layer 1-1, the reservoir layer 1-2, and the output layer 1-3 of the ESN 1 which is a kind of RNN. Here, the description will be given below using the symbols of the newly constructed neural network structure of FIG. 4 and the description thereof.

ＥＳＮは、ＲＮＮの一種であるが、前の時刻の隠れ層の出力を次の時刻の隠れ層の入力としてＢＰＴＴ等の学習方法を用いるＲＮＮと全く異なり、少量の教師あり学習で複雑な時系列のダイナミックスを学習できる。これは、図４の情報蓄積部４−３が、ＲＮＮの結合Ｗを持った一つの隠れ層から構成されており、その一つの隠れ層の内部にＲＮＮの隠れ層に相当するノードが不規則に結合されていると共に、各結合荷重が不規則で固定されていることに起因している。 ESN is a kind of RNN, but unlike the RNN that uses a learning method such as BPTT as the output of the hidden layer at the previous time as the input of the hidden layer at the next time, it is a complicated time series with a small amount of supervised learning. You can learn the dynamics of. This is because the information storage unit 4-3 in FIG. 4 is composed of one hidden layer having a coupling W of the RNN, and the nodes corresponding to the hidden layer of the RNN are irregular inside the one hidden layer. It is due to the fact that each coupling load is irregularly fixed as well as being coupled to.

本発明の実施形態では、特徴抽出部４−２から情報蓄積部４−３への結合を書込みＷ_ｉｎと呼び、情報蓄積部４−３から情報読出部への結合を読出しＷ_ｏｕｔと呼び、出力ｙ（ｔ）は、式（７）のように計算される。このｙ（ｔ）は、話者数の次元を持つベクトルであり、各次元が各話者に対応する。そして、話者の登録時には、ｙ（ｔ）が、フレームｔにおける話者のワン・ホットベクトル（発話を行っている話者に対応する要素が１、その他の要素は０に設定されたベクトル）に設定され、Ｗ_ｏｕｔはこのような出力を与えるように学習される。一方、話者の認識時には、ｙ（ｔ）は、各話者のスコア（その話者が発話を行っている尤もらしさ）を与える。なお、時間ステップｔにおける情報蓄積（リザバー）状態ｓ（ｔ）は式（６）で計算され、ｘ（ｔ）は入力ベクトル、ε（ｔ）はノイズ、αは入力スケールファクターである。 In an embodiment of the present invention, referred to as coupling write W _in from the feature extraction unit 4-2 to the information storage unit 4-3, the coupling from the information storage unit 4-3 to the information reading section is referred to as read W _out, The output y(t) is calculated as in Expression (7). This y(t) is a vector having dimensions of the number of speakers, and each dimension corresponds to each speaker. Then, when the speaker is registered, y(t) is a one-hot vector of the speaker in frame t (the element corresponding to the speaker who is uttering is 1 and the other elements are set to 0). , And W _out is learned to give such an output. On the other hand, when a speaker is recognized, y(t) gives the score of each speaker (likelihood that the speaker is speaking). The information storage (reservoir) state s(t) at the time step t is calculated by the equation (6), x(t) is an input vector, ε(t) is noise, and α is an input scale factor.

ここで、本発明の一実施形態における特徴抽出部４−２の一組のノードと情報蓄積部４−３の一組のノードとは同一であり、特徴抽出部４−２の一組のノードの二次元空間パターンというトポロジーは情報蓄積部４−３において無視される。 Here, the set of nodes of the feature extraction unit 4-2 and the set of nodes of the information storage unit 4-3 in one embodiment of the present invention are the same, and the set of nodes of the feature extraction unit 4-2 are the same. The information storage unit 4-3 ignores the topology of the two-dimensional space pattern.

さて、本発明の一実施形態においては、初期化において、ＲＮＮ結合Ｗの各コンポーネントは、確率ｐ_ｗを０とする、すなわち、スパース化するか、又は、［−１，１］の一様分布から選ばれ、ＲＮＮ結合Ｗの全てのコンポーネントは、同じファクターを用いてそのスペクトル半径ｒ_ｗが１より小さくなるように調整された。初期化後は、このニューラルネットワークの特徴であるように、Ｗの全てのコンポーネントが固定された。 Now, in one embodiment of the present invention, at initialization, each component of the RNN connection W has a probability p _w of 0, that is, is sparsified, or has a uniform distribution of [−1,1]. All components of the RNN coupling W were tuned such that their spectral radius r _w is less than 1 using the same factors. After initialization, all components of W were fixed, as is characteristic of this neural network.

これは、初期化の設定の一例であり、次のように様々な選択肢がある。例えば、読出しＷ_ｏｕｔの学習（話者登録時）及び識別・分類（話者識別）時において、１）ゼロベクトルに設定する、２）ＳＯＭの学習に使用したデータ（音声）を、全て又は部分的に入力した後のリザバー状態に設定する、３）Ｗ_ｏｕｔの学習（話者登録）用の音声を、全て又は部分的に入力した後のリザバー状態に設定する、４）上記の音声を組み合わせて入力した後のリザバー状態に設定する等の方法がある。 This is an example of initialization setting, and there are various options as follows. For example, at the time of learning read-out W _out (during speaker registration) and identification/classification (speaker identification), 1) set to a zero vector, and 2) all or part of the data (voice) used for SOM learning set to reservoir state after entering, the 3) the speech for learning (speaker registration) of W _out, is set to reservoir state after entering all or partially, 4) a combination of the above audio There is a method such as setting the reservoir state after inputting.

また、このニューラルネットワークは、情報蓄積部４−３から情報読出部への結合を読出しＷｏｕｔにおいてのみ、教師あり学習が実行される。このステップは、本発明の一実施形態においては、エンロールメント（登録）といい、少量のデータの教師あり学習によって十分な精度の高い学習が行われることができる。これは、情報蓄積部４−３が大容量で、入力データのダイナミクスをモデル化することができる能力があるためである。ただし、本発明の一実施形態のエンロールメントは、従来のｉ−ｖｅｃｔｏｒ系システムと異なり、話者のある１グループに対して行われ、個々の話者に対して行われるものではない。そのため、識別結果としては、各登録された話者の発話の可能性が情報読出部４−４から得られる。 Further, in this neural network, the supervised learning is executed only in Wout by reading the connection from the information storage unit 4-3 to the information reading unit. This step is called enrollment (registration) in one embodiment of the present invention, and learning with sufficient accuracy can be performed by supervised learning of a small amount of data. This is because the information storage unit 4-3 has a large capacity and is capable of modeling the dynamics of input data. However, unlike the conventional i-vector system, the enrollment of one embodiment of the present invention is performed for one group of speakers and not for each speaker. Therefore, as the identification result, the possibility of the utterance of each registered speaker is obtained from the information reading unit 4-4.

更に、本発明の一実施形態では、話者識別装置に応用しているため、読出しマトリックス（行列）が、情報蓄積（リザバー）状態空間における話者と想像されるベクトルの集合であると解釈される。本発明の一実施形態では、ｘ（ｔ）を無視し、Ｗ_ｏｕｔの列ベクトルを用いて、式（７）の右辺を簡略化し、式（８）に書き換えることにした。ここで、Ｐは，話者の個体数、ω_ｏｕｔはコサイン類似度を示している。この式は、話者ｐであることの可能性が、コサイン類似度ω_ｏｕｔと抽出される情報蓄積（リザバー）状態ｓ（ｔ）によって与えられ、発話から抽出されることを示している。従って、コサイン類似度ω_ｏｕｔが情報蓄積（リザバー）状態空間における話者ベクトルを表出していると見なすことができ、話者識別装置として機能することができる。 Further, in one embodiment of the present invention, since it is applied to the speaker identification device, the read matrix is interpreted as a set of vectors that can be imagined as the speaker in the information storage (reserve) state space. It In the embodiment of the present invention, x(t) is ignored and the column vector of W _out is used to simplify the right side of Expression (7) and rewrite it as Expression (8). Here, P is the number of speakers, and ω _out is the cosine similarity. This expression indicates that the possibility of being the speaker p is given by the cosine similarity ω _out and the information storage (reserver) state s(t) to be extracted, and is extracted from the utterance. Therefore, it can be considered that the cosine similarity ω _out represents the speaker vector in the information storage (reservation) state space, and can function as a speaker identification device.

以上、本発明の一実施形態である話者識別装置は、ＳＯＭとＥＳＮの構造及びアルゴリズムを詳細に検討した結果、図４の模式図に示すように、ＥＳＮの入力層としてＳＯＭを結合させ、情報入力部４−１から情報蓄積部４−３までの式（１）〜（７）に示した教師なし学習方法を用い、話者識別に適した式（８）の教師あり学習を工夫することによって実現することができた。 As described above, as a result of detailed examination of the structure and algorithm of the SOM and ESN, the speaker identification device according to the embodiment of the present invention, as shown in the schematic diagram of FIG. 4, connects the SOM as an input layer of the ESN, Using the unsupervised learning method shown in the equations (1) to (7) from the information input section 4-1 to the information storage section 4-3, devise the supervised learning of the equation (8) suitable for speaker identification. Could be realized by

そこで、本発明の一実施形態である話者識別装置の性能を明らかにするために、短い発話に関し、認識時の発声内容が登録時の発声内容によらないテキスト独立型話者識別に関する実験を行うと共に、現在の音声から抽出された特徴量のモデル化として最高水準のｉ−ｖｅｃｔｏｒ／ＰＬＤＡを用いた話者識別装置の識別精度と比較した。この実験では、登録と識別に使った全ての発話は明瞭で短いものであり、話者の全てが既知であるクローズドセット話者識別に的を絞った。換言すれば、存在しない人の発話は用いられない。 Therefore, in order to clarify the performance of the speaker identification device according to an embodiment of the present invention, an experiment relating to text-independent speaker identification in which a utterance content at the time of recognition does not depend on a utterance content at the time of registration regarding a short utterance is described. In addition, the comparison was performed with the identification accuracy of the speaker identification device using the highest level i-vector/PLDA as a model of the feature amount extracted from the current voice. In this experiment, all utterances used for registration and identification were clear and short, and we focused on closed-set speaker identification, where all speakers were known. In other words, the utterance of a person who does not exist is not used.

この実験では、日本語話し言葉コーパス（ＣＳＪ）と多数の話者のＡＴＲ音声データベース、特に音素バランス文を読み上げ発生したもの（ＡＴＲ／ＡＰＰ−ＢＬＡ）の二つのコーパスを用いた。 In this experiment, we used two corpus of Japanese spoken language corpus (CSJ) and ATR speech database of many speakers, especially those generated by reading phoneme balance sentences (ATR/APP-BLA).

ＣＳＪは、日本語の自発音声データの収集であり、１６ｋＨｚで、１，３９５名の話者の６６１時間の話し言葉が含まれており、その約９０％はモノローグ音声で、残り約１０％は対話、朗読、再朗読の音声である。このコーパスは、ｉ−ｖｅｃｔｏｒ音声抽出器の学習のために用いられ、そのコーパスから無作為に選択された一部がＲＯＭの情報入力部４−１及び特徴抽出部４−２における事前学習、すなわち、情報蓄積部４−３への構造学習された入力情報を生成するために用いられた。 CSJ is a collection of spontaneous speech data in Japanese, which contains 661 hours of spoken language of 1,395 speakers at 16 kHz, about 90% of which is monologue and about 10% of which is dialogue. , Read aloud and read aloud again. This corpus is used for learning of the i-vector speech extractor, and a part randomly selected from the corpus is pre-learned in the information input unit 4-1 and the feature extraction unit 4-2 of the ROM, that is, , Used to generate the structurally learned input information to the information storage unit 4-3.

ＡＴＲ／ＡＰＰ−ＢＬＡは、ＣＳＪと同じ音声データの収集で、３，７００名の話者によって読み上げられた音素バランス文の約１００，０００件の朗読であり、総朗読時間は１２８時間であるが、平均発話時間は４秒であり、話者は一度しか声に出して読み上げない。そして、このコーパスも、多数の話者による明瞭で短い朗読である。このコーパスから、本発明の一実施形態の話者識別装置のエンロールメント（登録）と識別のための発話を以下に記載される方法で選択した。 ATR/APP-BLA is the same voice data collection as CSJ, and is about 100,000 readings of phoneme balance sentences read by 3,700 speakers, and the total reading time is 128 hours. , The average utterance time is 4 seconds, and the speaker speaks only once aloud. Then, this corpus is also a clear and short readings by the large number of speakers. From this corpus, utterances for enrollment (registration) and identification of the speaker identification device according to the embodiment of the present invention were selected by the method described below.

６人、５０人、１００人からなる話者群ｐの一つのグループ内における話者を識別するために、数多くの試験を行った。最初に、話者群ｐのある一つの話者グループＧｐが、ある一セットが５０である単文を朗読した１５９６人の話者から無作為に選択された。それから、グループＧｐ、０．５秒、１秒、２秒、及び、５秒からなる登録時間ｄｅ、並びに、０．５秒、１秒、２秒、及び、５秒からなる識別時間ｄｒの各組合せを求めるための四つの単文が無作為に選択された。その後、単文と話者の各ペアに対し、必要に応じて、上記四つの単文以外の単文を切り取り、繋ぎ合わせることによって、登録時間ｄｅ及び識別時間ｄｒの発話を抽出した。最後に、登録のために、グループＧｐの全ての話者のための一つの単文、すなわち、各話者のための一つの発話が選択され、識別のために残っている発話が選択された。登録のための単文の選択を変えながら、この手順が４回繰り返された。言い換えれば、四つの発話から登録のための一つの発話が提供されたのである。それゆえ、登録は、Ｇｐ、ｄｅ、ｄｒ、及び、ｉ（発話が提供される回数）の組合せに対して一回だけ必要とされる。発話グループＧの無作為な選択は、ｐ＝６人、５０人、１００人それぞれに対して、Ｎ_Ｇｐ＝１５０回、２０回、１０回繰り返される。従って、試験は、ｐ、ｄｅ、及び、ｄｒによって決定される条件の下で、ｐ×Ｎ_Ｇｐ×３（発話内容）×４（四つの単文から一つの単文を提供）回行われる。 Numerous tests were performed to identify speakers within one group of speaker groups p of 6, 50 and 100 people. First, one speaker group Gp of speaker group p was randomly selected from 1596 speakers who read a single sentence of which a set was 50. Then, the group Gp, the registration time de consisting of 0.5 seconds, 1 second, 2 seconds, and 5 seconds, and the identification time dr consisting of 0.5 seconds, 1 second, 2 seconds, and 5 seconds, respectively. Four simple sentences for finding combinations were randomly selected. After that, for each pair of simple sentence and speaker, simple sentences other than the above-mentioned four simple sentences were cut out and connected to each other to extract the utterance of the registration time de and the identification time dr. Finally, for registration, one sentence was selected for all speakers in the group Gp, one utterance for each speaker, and the remaining utterances were selected for identification. This procedure was repeated four times, changing the choice of simple sentences for enrollment. In other words, four utterances provided one utterance for registration. Therefore, registration is required only once for the combination of Gp, de, dr, and i (the number of times the utterance is provided). The random selection of utterance group G is repeated N _Gp =150 times, 20 times, and 10 times for p=6 people, 50 people, and 100 people, respectively. Therefore, the test is performed p×N _Gp ×3 (speech content)×4 (providing one simple sentence from four simple sentences) times under the conditions determined by p, de, and dr.

上述したように、ＣＳＪによって学習されたｉ−ｖｅｃｔｏｒ／ＰＬＤＡを用いたシステムを基準とした。音声認識でよく使用され音声の特徴表現の代表例である、デルタ及びデルタ−デルタ特徴量が追加された２０次元のＭＦＣＣｓ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒａｌＣｏｅｆｆｉｃｉｅｎｔｓ）を用いて、６０次元の音響的特徴が形成された。ＦＦＴを実施する時間窓のフレーム幅及びフレームシフトは、それぞれ、２０ｍｓ及び１０ｍｓである。この音響的特徴から、事前に学習しておいた事前分布として２５６混合の完全共分散行列ＧＭＭ−ＵＢＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ−ＵｎｉｖｅｒｓａｌＢａｃｋｇｒｏｕｎｄＭｏｄｅｌ）を用いて、一般的な話者の音響的特徴（ＵＢＭ）からの差として話者の音響的特徴を表現する１００次元のｉ−ｖｅｃｔｏｒを抽出する。更に、次のように話者内の変動要因を低減する。すなわち、このようにして抽出したｉ−ｖｅｃｔｏｒに対してホワイトニング及び長さの規格化を行った後、ＬＤＡ（ＬｉｎｅａｒＤｉｓｃｒｉｍｉｎａｎｔＡｎａｌｙｓｉｓ）により５０次元に圧縮し、更に、ＷＣＣＮ（Ｗｉｔｈｉｎ−ＣｌａｓｓＣｏｖａｒｉａｎｃｅＮｏｍａｒｉｚａｔｉｏｎ）を行い話者内の変動要因を低減する。そして、ＰＬＤＡモデルにより話者のスコアを算出した。 As described above, the system using i-vector/PLDA learned by CSJ was used as a reference. A 60-dimensional acoustic feature is formed using 20-dimensional MFCCs (Mel-Frequency Cepstral Coefficients) to which delta and delta-delta feature amounts are added, which is a typical example of a feature expression of voice often used in speech recognition. Was done. The frame width and frame shift of the time window for performing FFT are 20 ms and 10 ms, respectively. From this acoustic characteristic, a general co-acoustic characteristic (UBM) of a general speaker is used by using a complete covariance matrix GMM-UBM (Gaussian Mixture Model-Universal Background Model) of 256 mixture as a prior distribution learned in advance. ), a 100-dimensional i-vector expressing the acoustic characteristics of the speaker is extracted. Furthermore, the fluctuation factors in the speaker are reduced as follows. That is, after whitening and normalizing the length of the i-vector thus extracted, the i-vector is compressed into 50 dimensions by LDA (Linear Discriminant Analysis), and further, WCCN (Within-Class Covariance Nomalization) is applied. Performs to reduce the fluctuation factors in the speaker. Then, the score of the speaker was calculated by the PLDA model.

本発明の一実施形態である話者識別装置では、次のような条件で話者識別を行った。入力される音響的特徴は、１０２５次元の対数パワースペクトルである。ＦＦＴを実施する情報処理部３−６の時間窓のフレーム幅及びフレームシフトは、それぞれ、１００ｍｓ及び２５ｍｓとした。また、本発明の一実施形態の実験では、ＲＯＭ４に表１に示すパラメータを設定した。これらのパラメータは、ＡＴＲ／ＡＰＰ−ＢＬＡの選択されなかったデータを用いて決定され、評価には使用されていない。情報入力部４−１及び特徴抽出部４−２における事前学習には、ＣＳＪから１０，０００フレームの話し言葉が用いられた。 In the speaker identification device which is one embodiment of the present invention, the speaker identification is performed under the following conditions. The input acoustic feature is a 1025-dimensional logarithmic power spectrum. The frame width and frame shift of the time window of the information processing unit 3-6 that implements FFT are set to 100 ms and 25 ms, respectively. Further, in the experiment of the embodiment of the present invention, the parameters shown in Table 1 were set in the ROM 4. These parameters were determined using unselected data for ATR/APP-BLA and have not been used for evaluation. For the pre-learning in the information input unit 4-1 and the feature extraction unit 4-2, the spoken language of 10,000 frames from CSJ was used.

このようにして得られた結果をｉ−ｖｅｃｔｏｒ／ＰＬＤＡの結果と比較するために、次のような手順で音声全体に対する話者識別結果を定める。各フレームにおける話者のスコアを表す出力ベクトルｙ（ｔ）にソフト・マックス関数を適用し、識別対象の音声全体で和を取った結果が最大となる話者を識別結果として採用する。
In order to compare the result thus obtained with the result of i-vector/PLDA, the speaker identification result for the entire voice is determined by the following procedure. The soft-max function is applied to the output vector y(t) representing the speaker's score in each frame, and the speaker having the maximum sum of the speech to be classified is adopted as the classification result.

表２に、本発明の一実施形態であるＲＯＭ４及びｉ−ｖｅｃｔｏｒ／ＰＬＤＡを用いたシステムについて、結果として得られた話者識別の精度を示す。表から明らかなように、登録時間ｄｅ及び識別時間ｄｒが十分に長く、話者群Ｇｐの人数が少ない場合、両者に有意差が認められないが、登録時間ｄｅ及び識別時間ｄｒが短くなるにつれ、また、話者群Ｇｐの人数が多くなるにつれ、本発明の一実施形態であるＲＯＭ４を用いたシステムの話者識別精度が、ｉ−ｖｅｃｔｏｒ／ＰＬＤＡを用いたシステムのそれよりも高いという結果が得られた。すなわち、本発明の実施形態であるＲＯＭ４を用いたシステムは、世界最高水準の話者識別精度を有していることが明らかとなった。 Table 2 shows the accuracy of the resulting speaker identification for the system using the ROM 4 and i-vector/PLDA, which is an embodiment of the present invention. As is clear from the table, when the registration time de and the identification time dr are sufficiently long and the number of speakers Gp is small, there is no significant difference between the two, but as the registration time de and the identification time dr become shorter. Further, as the number of speakers in the group Gp increases, the speaker identification accuracy of the system using the ROM 4 according to the embodiment of the present invention is higher than that of the system using i-vector/PLDA. was gotten. That is, it was revealed that the system using the ROM 4 which is the embodiment of the present invention has the highest level of speaker identification accuracy in the world.

このような結果は、短い発話で登録及び識別が行え、話者の負担が極めて軽く、精度の高い音声認識装置を構築できる上、演算コストが低く、低価格の音声認識装置を提供できることを示している。更に、上記一実施形態の話者識別装置から分かるように、本発明の情報処理装置は、出力結果をフレームごとに与えることができる。これは、出力結果を時間ステップごとに与えることができることを意味しているので、本発明が、時系列データに対して最小限の遅延のみで、リアルタイムに出力結果を得ることができる情報処理装置であることを示している。 These results indicate that registration and identification can be performed with a short utterance, the burden on the speaker is extremely light, and a highly accurate voice recognition device can be constructed, and the calculation cost is low and a low-priced voice recognition device can be provided. ing. Further, as can be seen from the speaker identifying apparatus of the above-described one embodiment, the information processing apparatus of the present invention can give the output result for each frame. This means that the output result can be given for each time step, so that the present invention can obtain the output result in real time with a minimum delay with respect to the time series data. Is shown.

本発明の情報処理装置は、莫大な情報から演算量を増やすことなく精度の高い情報処理が可能であり、実施例では、音声認識において優れた性能を発現することを示した。しかし、ニューラルネットワークの実用化が最も進んでいる、売上需要動向、商品トレンド・レコメンド等の予測の分野に適用できることはいうまでもなく、更に高度な情報処理が必要とされる識別及び実行の分野に適している。識別の分野では、言語、画像、音楽等の判断、仕分け、及び、検索等、並びに、音声、画像、及び、動画等の識別、認証、及び、感情把握等、並びに、故障、異常、及び、潜在顧客等の予知、検出、及び、発見等に適用することができ、また、実行の分野では、自動運転車、Ｑ＆Ａ対応、及び、苦情処理対応等の作業の自動化、並びに、文章の要約、作成、及び、翻訳等の表現生成、並びに、ゲーム攻略、配送経路の最適化等の行動の最適化に適用することができ、幅広い産業分野に利用可能である。特に、時系列データに対して最小限の遅延のみで、リアルタイムに出力結果を得ることができる情報処理装置に適している。 The information processing apparatus of the present invention is capable of highly accurate information processing from an enormous amount of information without increasing the amount of calculation, and in the examples, it has been shown that excellent performance is exhibited in speech recognition. However, it goes without saying that it can be applied to the fields of forecasting sales demand trends, product trends, recommendations, etc., where neural networks are most practically applied, and fields of identification and execution that require more advanced information processing. Suitable for In the field of identification, judgment, sorting, search, etc. of language, image, music, etc., identification, authentication, emotional recognition, etc. of voice, image, video, etc., and failure, abnormality, and It can be applied to prediction, detection, discovery, etc. of potential customers, and in the field of execution, automated work such as self-driving cars, Q&A support, and complaint handling support, and text summarization, The present invention can be applied to creation and expression generation such as translation, and action optimization such as game capture and delivery route optimization, and can be used in a wide range of industrial fields. In particular, it is suitable for an information processing device that can obtain an output result in real time with a minimum delay with respect to time series data.

１ＥＳＮ
１−１入力層
１−２リザバー層
１−３出力層
２ＳＯＭ
２−１入力層
２−２出力層（競合層）
３情報処理装置
３−１情報入力部
３−２特徴抽出部
３−３情報蓄積部
３−４情報読出部
３−５情報収集部
３−６情報処理部
３−７出力部
４ＲＯＭ（Ｒｅｓｅｒｖｏｉｒｗｉｔｈｓｅｌｆ−ｏｒｇａｎｉｚｅｄＭａｐｐｉｎｇ）
４−１情報入力部／ＳＯＭの入力層
４−２特徴抽出部／ＳＯＭの出力層（競合層）
４−１２ＳＯＭ（ＥＳＮの入力層に相当）
４−３情報蓄積部／ＥＳＮのリザバー層
４−４情報読出部／ＥＳＮの出力層
1 ESN
1-1 Input layer 1-2 Reservoir layer 1-3 Output layer 2 SOM
2-1 Input layer 2-2 Output layer (competition layer)
3 Information processing device 3-1 Information input unit 3-2 Feature extraction unit 3-3 Information storage unit 3-4 Information reading unit 3-5 Information collection unit 3-6 Information processing unit 3-7 Output unit 4 ROM (Reservoir with) self-organized Mapping)
4-1 Information input unit/SOM input layer 4-2 Feature extraction unit/SOM output layer (competition layer)
4-12 SOM (equivalent to ESN input layer)
4-3 Information Storage Unit/ESN Reservoir Layer 4-4 Information Reading Unit/ESN Output Layer

Claims

Information input section,
A feature extraction unit that performs unsupervised structure learning in which the information input to the information input unit is embedded in the spatial information as a spatial pattern ;
Reservoir layer by introducing information as spatial pattern wherein is unsupervised structural learning by the feature extraction unit in the network to accumulate while learning in further unsupervised learning the structure learning information further learning unsupervised When,
A neural network system information processing apparatus, comprising: an information reading unit that extracts an answer by learning with a teacher from information accumulated in the reservoir layer .

The information input unit and the feature extraction unit are executed by one of SOM (Self-Organizing Map), ART (Adaptive Resonance Theory Model), and LVQ (Learning Vector Quantization) algorithms .
The reservoir layer and the information reading unit are executed by an ESN (Echo State Network) or LSM (Liquid State Machine) algorithm.
The neural network system information processing apparatus according to claim 1, wherein

The information input unit and the feature extraction unit are executed by one of SOM (Self-Organizing Map), ART (Adaptive Resonance Theory Model), and LVQ (Learning Vector Quantization) algorithms .
The reservoir layer is executed by an ESN (Echo State Network) or LSM (Liquid State Machine) algorithm .
The information reading unit is implemented by a FORCE (First Order Reduced and Controlled Error) or a BPDC (Backpropagation Decoration) algorithm.
The neural network system information processing apparatus according to claim 1, wherein

The feature extracting unit is executed by any one of PCA (Principal Component Analysis), Auto-encoder, and GTM (Generic Topographic Map) ,
The reservoir layer and the information reading unit are executed by an ESN (Echo State Network) or LSM (Liquid State Machine) algorithm.
The neural network system information processing apparatus according to claim 1, wherein

The feature extracting unit is executed by any one of PCA (Principal Component Analysis), Auto-encoder, and GTM (Generic Topographic Map) ,
The reservoir layer is executed by an ESN (Echo State Network) or LSM (Liquid State Machine) algorithm .
The information reading unit is implemented by a FORCE (First Order Reduced and Controlled Error) or a BPDC (Backpropagation Decoration) algorithm.
The neural network system information processing apparatus according to claim 1, wherein

The neural network type information processing device according to claim 1, wherein the information is sequence information.

The neural network type information processing device according to claim 1, wherein the information is time series information.

The neural network system information identification device according to claim 1, wherein the information is time series data.

The neural network system voice identification device according to claim 1, wherein the information is voice.