JPH09212197A

JPH09212197A - Neural network

Info

Publication number: JPH09212197A
Application number: JP8037292A
Authority: JP
Inventors: Hideto Tomabechi; 英人苫米地
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 1996-01-31
Filing date: 1996-01-31
Publication date: 1997-08-15

Abstract

PROBLEM TO BE SOLVED: To provide the neural network which facilitates learning and can obtain a high recognition rate with simple constitution. SOLUTION: A neuron element net 22 consists of self-associative type neural networks ANN1-ANNn. Each ANN is made to correspond to each phoneme and learns only the corresponding phoneme exclusively. Namely, self-associative type learning is performed with a vector string as to each phoneme obtained by the spectrum analysis of an FFT device 21. For speech recognition, on the other hand, the vector string of the spectrum-analyzed speech is inputted to all the ANNs 1-(n). Then the similarity between the inputted vector string and an outputted victor string is calculated by each ANN and the phoneme corresponding to the ANN having the highest similarity is recognized as the phoneme constituting the inputted speech.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、ニューラルネット
ワークに係り、例えば、形状認識や音声認識等に使用さ
れるニューラルネットワークに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a neural network, for example, a neural network used for shape recognition, voice recognition and the like.

【０００２】[0002]

【従来の技術】人間の脳神経系の仕組みを工学的に実現
し、情報処理を行おうとするニューラルネットワークが
注目されている。このニューラルネットワークは、デー
タの伝搬を行う複数のニューロン素子から成るニューロ
ン素子網とその学習を制御する学習制御部から構成され
ている。このニューロン素子網は、一般に、データが入
力される入力層と、入力されたデータに対してデータが
出力される出力層、およびこの両層間に配置された１ま
たは複数の中間層から構成されている。そして、ニュー
ロン素子網の各層間におけるニューロン素子は、他のニ
ューロン素子に対して所定の強さ（結合重み）で結合さ
れており、この結合重みの値の違いにより出力信号が変
化するようになっている。2. Description of the Related Art Neural networks have been attracting attention, which are intended to realize information processing by engineeringly realizing the mechanism of the human cranial nerve system. This neural network is composed of a neuron element network composed of a plurality of neuron elements that propagate data and a learning control unit that controls the learning. This neuron element network is generally composed of an input layer to which data is input, an output layer to which data is output in response to input data, and one or a plurality of intermediate layers arranged between these two layers. There is. The neuron elements in each layer of the neuron element network are connected to other neuron elements with a predetermined strength (coupling weight), and the output signal changes according to the difference in the coupling weight values. ing.

【０００３】このような階層構造に構成された従来のニ
ューラルネットワークでは、各ニューロン素子相互間の
結合重みを学習制御部により変化させることによって
「学習」という処理が行われる。学習は、入力層と出力
層の入出力数に対応して与えられるアナログまたは２値
のデータ（パターン）によって行われる。いま、データ
としてｇ１〜ｇ６が与えられ、この内、ｇ１〜ｇ３を入
力層から学習パターンとして入力した場合に、出力層か
らある出力信号ｐ１〜ｐ３が出力されたものとする。こ
の入力信号に対する出力信号の正解がｇ４〜ｇ６である
場合、これらｇ４〜ｇ６を一般に教師信号と呼んでい
る。そして、出力信号ｐ１〜ｐ３と教師信号ｇ４〜ｇ６
との誤差が最小になるように、または一致するように各
ニューロン素子の結合重みを修正する処理を、複数の学
習パターンに対して実行することによって学習が行われ
る。In a conventional neural network having such a hierarchical structure, a learning control process is performed by changing the connection weight between the neuron elements by a learning control unit. Learning is performed by analog or binary data (pattern) given corresponding to the number of inputs and outputs of the input layer and the output layer. Now, it is assumed that g1 to g6 are given as data, and when g1 to g3 among them are input as a learning pattern from the input layer, certain output signals p1 to p3 are output from the output layer. When the correct answer of the output signal with respect to this input signal is g4 to g6, these g4 to g6 are generally called teacher signals. Then, the output signals p1 to p3 and the teacher signals g4 to g6
Learning is performed by executing a process of correcting the connection weights of the respective neuron elements so that the error between and becomes the minimum or the same, for a plurality of learning patterns.

【０００４】このような、教師信号に出力信号が一致す
るように、ニューロン素子網における各ニューロン素子
間の結合重みを修正する具体的方法として、従来から誤
差逆伝播法（以下、ＢＰ法という。）がよく用いられて
いる。ＢＰ法は、出力層での出力値と教師信号との誤差
を最小にするために、このニューラルネットワークを構
成する全ての層間における各ニューロン素子相互間の結
合重みを修正するものである。すなわち、出力層におけ
る誤差は、各中間層のニューロン素子で生じる個々の誤
差が積算されたものであると判断し、単に出力層からの
誤差だけでなく、その原因となっている各中間層のニュ
ーロン素子の誤差も最小となるように結合重みを修正す
る。そのために出力層、各中間層のニューロン素子毎の
全ての誤差を計算処理する。As a concrete method for correcting the connection weight between the neuron elements in the neuron element network so that the output signal coincides with the teacher signal, the error back-propagation method (hereinafter referred to as the BP method) has been conventionally used. ) Is often used. The BP method is to correct the connection weight between the neuron elements in all the layers constituting the neural network in order to minimize the error between the output value in the output layer and the teacher signal. That is, it is determined that the error in the output layer is the sum of the individual errors generated in the neuron elements in each intermediate layer, and not only the error from the output layer but also the error in each intermediate layer that causes the error. The connection weight is modified so that the error of the neuron element is also minimized. Therefore, all the errors for each neuron element in the output layer and each intermediate layer are calculated.

【０００５】この計算処理は、出力層のニューロン素子
の個々の誤差値を初期条件として与えて、ｎ番目の中間
層の各ニューロン素子の誤差値、（ｎ−１）番目の中間
層の誤差、……、といったように、逆の方向に計算処理
を行う。このようにして求めた各ニューロン素子の持つ
誤差値と、その時点での結合重みを用いて、結合重みの
修正値を算出する。以上の、学習処理を教師信号との誤
差が一定値以下となるまで、または所定回数だけ、全て
の学習パターンについて繰り返すことにより、学習が終
了する。このようなニューラルネットワークを使用し
て、各種データの文字や図形等のパターン認識、音声の
分析や合成処理、運動の時系列パターン発生の予測等を
行うことが研究されている。In this calculation process, the error value of each neuron element of the output layer is given as an initial condition, and the error value of each neuron element of the nth intermediate layer, the error of the (n-1) th intermediate layer, ......, and so on, the calculation processing is performed in the opposite direction. The error value of each neuron element thus obtained and the connection weight at that time are used to calculate the correction value of the connection weight. The learning is completed by repeating the above-described learning process for all learning patterns until the error with the teacher signal becomes a predetermined value or less, or a predetermined number of times. Using such a neural network, research has been conducted on pattern recognition of characters and figures of various data, analysis and synthesis processing of voice, prediction of generation of time-series pattern of motion, and the like.

【０００６】[0006]

【発明が解決しようとする課題】このような従来のニュ
ーラルネットワークでは、１つのネットワークによっ
て、全ての場合に対応させていた。すなわち、音声認識
の場合であれば、学習対象となる単語や音素の全てを１
つのニューラルネットワークで学習させていた。しか
し、各音素の学習には複数のデータによる学習を行う必
要がある。例えば、音素“ａ”の場合であれば、母音の
“ａ”だけでなく、“ｍａ”、“ｓａ”、“ｔａ”等の
子音中に含まれる“ａ”、さらに、同じ“ｍａ”であっ
ても、ｍａｔｕのように語頭にあるものや、ｈａｍａｙ
ａのように語中にあるものや、ｓｉｍａのように語尾に
あるものなどについて学習する必要がある。また、不特
定話者認識に適用する場合には、複数の者の発声による
これらの各音素を学習データとする必要がある。このよ
うに、多くのデータに対して従来は１つのニューラルネ
ットワークで認識を行うようにしていたため、全ての学
習データに対して十分な学習を行って認識率を高めるた
めには、中間層の数をレイヤ数を増やしたり、各層のニ
ューロン素子数を増やすことでネットワークサイズを大
きくしなければならず、非常に高価なネットワークにな
っていた。また、サイズが大きくなればそれだけ学習に
要する時間も長くなっていた。一方、パーソナルコンピ
ュータ等を用いて音声認識等を行う場合には、装置が小
型であるために、中間層のサイズ（ニューロン素子数）
が処理能力から制限され、十分学習しきれていない場合
があり、学習不十分による認識率の低下を招いていた。In such a conventional neural network, one network is used for all cases. That is, in the case of speech recognition, all words and phonemes to be learned are set to 1
I was learning with two neural networks. However, in order to learn each phoneme, it is necessary to carry out learning with a plurality of data. For example, in the case of the phoneme "a", not only the vowel "a" but also "a" included in consonants such as "ma", "sa", "ta", and the same "ma" Even if there is something, such as the one at the beginning of a word like matu, or mayay
It is necessary to learn what is in a word like a and what is at the end like sima. In addition, when applied to unspecified speaker recognition, it is necessary to use each of these phonemes uttered by a plurality of persons as learning data. As described above, since a single neural network has conventionally been used to recognize a large amount of data, in order to increase the recognition rate by performing sufficient learning on all learning data, the number of intermediate layers must be increased. The network size had to be increased by increasing the number of layers and the number of neuron elements in each layer, resulting in a very expensive network. Also, the larger the size, the longer the learning time. On the other hand, when performing voice recognition using a personal computer, etc., the size of the intermediate layer (the number of neuron elements) is used because the device is small.
There is a case that the learning capacity is limited and the learning is not completed enough, and the recognition rate is lowered due to insufficient learning.

【０００７】また、音声認識用ニューラルネットワーク
の最初の位置にあらかじめ入力スペクトルを適合させて
おく必要がある。従って既存の手法では、音韻の開始タ
イミングが自由に変化する連続音声認識に対応すること
ができなかった。さらに、従来のニューラルネットワー
クに対する音声認識では、音韻のスペクトルはそれぞれ
単独で与えられている。しかしながら、連続音声認識時
における各音素の状態はそれぞれの前に現れる音素の状
態により影響を受けているため、音素毎の単独の認識を
おこなう既存のニューラルネットワークによる認識で
は、前に提示された音素情報を利用することができず連
続音声認識には適当ではなかった。これらの課題は、音
声認識だけではなく、図形認識や文字認識、運動の時系
列パターン発生の予測等においても同様に存在してい
た。Further, it is necessary to previously adapt the input spectrum to the first position of the neural network for speech recognition. Therefore, the existing method cannot handle continuous speech recognition in which the start timing of the phoneme changes freely. Furthermore, in the conventional speech recognition for a neural network, each phoneme spectrum is given independently. However, since the state of each phoneme during continuous speech recognition is affected by the state of the phoneme that appears before each, the existing neural network that performs individual recognition for each phoneme recognizes the previously presented phoneme. Since the information was not available, it was not suitable for continuous speech recognition. These problems existed not only in voice recognition, but also in figure recognition, character recognition, prediction of the occurrence of a time-series pattern of movement, and the like.

【０００８】そこで、本発明は、簡単な構成で、高認識
率等を得ることができるニューラルネットワークを提供
することを第１の目的とする。また本発明は、さらに学
習を短時間で容易に行うことが可能なニューラルネット
ワークを提供することを第２の目的とする。Therefore, it is a first object of the present invention to provide a neural network having a simple structure and capable of obtaining a high recognition rate and the like. A second object of the present invention is to provide a neural network capable of performing learning easily in a short time.

【０００９】[0009]

【課題を解決するための手段】請求項１記載の発明で
は、複数のニューロン素子を有する入力層と、この入力
層よりも少ないニューロン素子を有する中間層と、前記
入力層と同数のニューロン素子を有する出力層とを有
し、各々に異なる特定意味が対応付けられた複数のボト
ルネックニューロン素子網と、ベクトル列を前記ボトル
ネックにニューロン素子網の各データ入力層に入力する
入力手段と、この入力手段によるベクトル列の入力によ
る各ボトルネックニューロン素子網の出力ベクトル列と
入力ベクトル列との類似度を算出する類似度算出手段
と、この類似度算出手段で算出された類似度がもっとも
大きいボトルネックニューロン素子網に対応する特定の
意味を、入力手段に入力されたベクトル列の意味として
出力する出力手段と、をニューラルネットワークに具備
させて前記第１の目的を達成する。According to a first aspect of the present invention, an input layer having a plurality of neuron elements, an intermediate layer having a smaller number of neuron elements than the input layer, and the same number of neuron elements as the input layer are provided. A plurality of bottleneck neuron element networks each having a different specific meaning associated with each other, and input means for inputting a vector sequence to each data input layer of the neuron element network at the bottleneck, Similarity calculation means for calculating the similarity between the output vector sequence of each bottleneck neuron element network and the input vector sequence by the input of the vector sequence by the input means, and the bottle having the highest similarity degree calculated by this similarity calculation means Output means for outputting a specific meaning corresponding to the neck neuron element network as the meaning of the vector sequence input to the input means, It is provided in-menu neural network to achieve the first purpose.

【００１０】請求項２に記載の発明では、請求項１に記
載のニューラルネットワークにおいて、前記複数のボト
ルネックニューロン素子網は、対応する特定意味のベク
トル列を入力データおよび教師信号とする自己連想的学
習を行ったものを使用する。請求項３に記載の発明で
は、請求項１に記載のニューラルネットワークにおい
て、対応する特定意味のベクトル列を入力データおよび
教師信号とする自己連想的学習を、前記各ボトルネック
ニューロン素子網毎に行う学習手段を具備させて、前記
第２の目的を達成する。請求項４に記載の発明では、請
求項２または請求項３に記載のニューラルネットワーク
において、バックプロパゲーション則により自己連想的
学習を行う。請求項５に記載の発明では、請求項１から
請求項４のうちのいずれか１の請求項に記載したニュー
ラルネットワークにおいて、前記特定の意味は音声を構
成する音素であり、入力層に入力されるベクトル列は、
時系列的に解析された音素についての特徴量を表すベク
トル列を使用する。請求項６に記載の発明では、請求項
５に記載のニューラルネットワークにおいて、音声のス
ペクトルデータとケプストラムデータの少なくとも一方
を使用する。According to a second aspect of the present invention, in the neural network according to the first aspect, the plurality of bottleneck neuron element networks are self-associative with corresponding vector strings of specific meaning as input data and teacher signals. Use what you have learned. According to a third aspect of the present invention, in the neural network according to the first aspect, self-associative learning is performed for each of the bottleneck neuron element networks using a corresponding vector string of a specific meaning as input data and a teacher signal. A learning means is provided to achieve the second object. According to the invention described in claim 4, in the neural network according to claim 2 or 3, self-associative learning is performed by the back propagation rule. According to a fifth aspect of the invention, in the neural network according to any one of the first to fourth aspects, the specific meaning is a phoneme that constitutes a voice and is input to the input layer. The vector sequence
A vector sequence representing the feature amount of the phonemes analyzed in time series is used. According to a sixth aspect of the invention, in the neural network according to the fifth aspect, at least one of the spectrum data of the voice and the cepstrum data is used.

【００１１】[0011]

【発明の実施の形態】以下、本発明のニューラルネット
ワークの一実施の形態について、音声認識を例に図１か
ら図１０を参照しながら詳細に説明する。図１はニュー
ラルネットワークを利用した音声認識装置のシステム構
成を表したものである。この音声認識装置は、ニューロ
ン素子網に対する学習のためのベクトル列の入力と出力
層への教師信号（ベクトル列）の入力、学習による各ニ
ューロン素子間の結合重みの変更、およびニューロン素
子網からの出力信号に基づく音声認識等の各種処理およ
び制御を行うＣＰＵ１１を備えている。このＣＰＵ１１
は、データバス等のバスライン１２を介して、ＲＯＭ１
３、ＲＡＭ１４、通信制御装置１５、プリンタ１６、表
示装置１７、キーボード１８、ＦＦＴ（高速フーリエ変
換）装置２１、ｎ個の自己連想型ＮＮ（ニューラルネッ
トワーク）を有するニューロン素子網２２、および図形
読取装置２４が接続されている。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of the neural network of the present invention will be described in detail below with reference to FIGS. FIG. 1 shows a system configuration of a voice recognition device using a neural network. This speech recognition device is designed to input a vector sequence for learning to a neuron element network and input a teacher signal (vector sequence) to an output layer, change a connection weight between each neuron element by learning, and A CPU 11 is provided for performing various processes such as voice recognition and control based on the output signal. This CPU11
Is connected to the ROM 1 via a bus line 12 such as a data bus.
3, RAM 14, communication control device 15, printer 16, display device 17, keyboard 18, FFT (fast Fourier transform) device 21, neuron element network 22 having n self-associative NN (neural networks), and graphic reading device 24 is connected.

【００１２】ＲＯＭ１３は、ＣＰＵ１１が音声認識やニ
ューロン素子網の学習等の処理、制御を行うための各種
プログラムやデータが格納されているリード・オンリー
・メモリである。このＲＯＭ１３には、例えば、ニュー
ロン素子網の学習としてバックプロパゲーション則によ
る学習を行うためのプログラムや、入力信号から順次音
素を認識するプログラムや、認識した音素から音声を認
識すると共に、認識した音声を文字による文章に変換す
る日本語変換システムのプログラムも格納されている。The ROM 13 is a read-only memory that stores various programs and data for the CPU 11 to perform processing and control such as voice recognition and learning of a neuron element network. In the ROM 13, for example, a program for performing learning based on a backpropagation rule as learning of a neuron element network, a program for sequentially recognizing phonemes from an input signal, a voice for recognizing phonemes from the recognized phonemes, and a recognized voice. It also stores the Japanese conversion system program that converts the text into text.

【００１３】ＲＡＭ１４は、ＲＯＭ１３に格納された所
定のプログラムがダウンロードされ格納されると共に、
ＣＰＵ１１のワーキングメモリとして使用されるランダ
ム・アクセス・メモリである。ＲＡＭ１４には、ＦＦＴ
装置２１で解析された音声データまたは通信制御装置１
５から受信した音声データについて、各時間と各周波数
におけるパワーを一時格納するためのベクトル列格納エ
リアが確保されている。この各周波数におけるパワーの
値が、ニューロン素子網２２の各自己連想型ＮＮの入力
層Ｉに入力されるベクトル列になる。また、ＲＡＭ１４
には、文字や図形等をニューラルネットワークで認識す
る場合には、図形読取装置２４で読み取られた画像デー
タが格納されるようになっている。In the RAM 14, a predetermined program stored in the ROM 13 is downloaded and stored, and at the same time,
It is a random access memory used as a working memory of the CPU 11. FFT in RAM14
Voice data analyzed by the device 21 or the communication control device 1
For the voice data received from No. 5, a vector string storage area for temporarily storing the power at each time and each frequency is secured. The power value at each frequency becomes a vector sequence input to the input layer I of each self-associative NN of the neuron element network 22. RAM 14
When recognizing characters, figures, etc. by a neural network, the image data read by the figure reading device 24 is stored.

【００１４】通信制御装置１５は、認識した音声データ
等の各種データについて、電話回線網、ＬＡＮ、パーソ
ナルコンピュータ通信網等の各種の通信網２を介して他
の通信制御装置との間でデータ送受信を行う。プリンタ
１６は、レーザプリンタやドットプリンタ等を備えてお
り、入力データや認識した音声の内容等を印刷するよう
になっている。表示装置１７は、ＣＲＴディスプレイや
液晶ディスプレイ等の画像表示部と表示制御部とを備え
ており、入力データや認識した音声の内容、および、音
声認識に必要な操作の指示を画面表示するようになって
いる。キーボード１８は、ＦＦＴ装置２１のパラメータ
の変更や設定条件等を入力したり、文章の入力処理等を
行うための入力装置であり、数字を入力するテンキー、
文字を入力する文字キー、各種の機能を実現するための
機能キー等が配置されている。このキーボード１８に
は、ポインティングデバイスとしてのマウス１９が接続
されている。The communication control device 15 transmits / receives various data such as recognized voice data to / from other communication control devices via various communication networks 2 such as a telephone line network, a LAN, a personal computer communication network and the like. I do. The printer 16 is provided with a laser printer, a dot printer, etc., and is designed to print input data, the contents of recognized voice, and the like. The display device 17 includes an image display unit such as a CRT display or a liquid crystal display and a display control unit, and displays on screen the input data, the content of the recognized voice, and the operation instruction necessary for the voice recognition. Has become. The keyboard 18 is an input device for inputting parameter changes, setting conditions, and the like of the FFT device 21, and for inputting text, and a numeric keypad for inputting numbers,
Character keys for inputting characters, function keys for realizing various functions, and the like are arranged. A mouse 19 as a pointing device is connected to the keyboard 18.

【００１５】ＦＦＴ装置２１には、マイク等の音声入力
装置２３が接続されている。このＦＦＴ装置２１は、音
声入力装置２３から入力されたアナログの音声データ
を、ディジタルに変換すると共に、離散的フーリエ変換
によりスペクトル解析を行う。このＦＦＴ装置２１によ
るスペクトル解析により、各周波数毎のパワーによるベ
クトル列が、各時間毎に出力され、この各時間毎のベク
トル列はＲＡＭ１４のベクトル列格納エリアに格納され
るようになっている。図形読取装置２４は、ＣＣＤ（Ch
arge Coupled Device ）等の素子を備えており、用紙等
に記録された文字や図形等の画像を読み取るための装置
であり、この画像読取装置２４で読み取られた画像デー
タは、ＲＡＭ１４に格納されるようになっている。A voice input device 23 such as a microphone is connected to the FFT device 21. The FFT device 21 converts analog voice data input from the voice input device 23 into digital data, and also performs spectrum analysis by discrete Fourier transform. By the spectrum analysis by the FFT device 21, a vector sequence by the power for each frequency is output for each time, and the vector sequence for each time is stored in the vector sequence storage area of the RAM 14. The figure reading device 24 is a CCD (Ch
a device for reading an image of characters, figures, etc. recorded on a sheet of paper, etc., and image data read by the image reading device 24 is stored in the RAM 14. It is like this.

【００１６】図２は、ニューロン素子網２２の構成を表
したものである。この図２に示すように、ニューロン素
子網２２は、ｎ個の自己連想型ニューラルネットワーク
（以下、単にＡＮＮという）１〜ｎを備えている。ＡＮ
Ｎの数ｎは、入力データに対してニューロン素子網２２
によって区別しようとする特定の意味の数だけ設けられ
る。例えば、本実施の形態における音声認識の場合、入
力された音声データに対して認識しようとする音素の数
がｎの値となり、８０音素についての認識であればｎ＝
８０となる。これらのＡＮＮ１〜ＡＮＮｎには各音素が
対応付けられており、ニューラルネットワークの学習に
おいて、対応する音素についての学習が行われるように
なっている。すなわち、ＡＮＮ１は音素“ａ”につい
て、ＡＮＮ２は音素“ｉ”について、ＡＮＮ３は音素
“ｕ”について、それぞれ自己連想型の学習し、他のＡ
ＮＮ４〜ｎもそれぞれ対応する音素について自己連想型
の学習が行われるようになっている。FIG. 2 shows the configuration of the neuron element network 22. As shown in FIG. 2, the neuron element network 22 includes n self-associative neural networks (hereinafter, simply referred to as ANN) 1 to n. AN
The number n of N is the neuron element network 22 for the input data.
It is provided by the number of specific meanings to be distinguished by. For example, in the case of speech recognition according to the present embodiment, the number of phonemes to be recognized with respect to the input speech data is a value of n, and if recognition is performed for 80 phonemes, n =
80. Phonemes are associated with these ANN1 to ANNn, and learning of the corresponding phonemes is performed in learning of the neural network. That is, ANN1 learns about the phoneme "a", ANN2 learns about the phoneme "i", and ANN3 learns about the phoneme "u" in self-associative learning.
Each of the NN4 to NN is also designed to perform self-associative learning on the corresponding phoneme.

【００１７】ＡＮＮは、入力層Ｉと中間層Ｈおよび出力
層Ｏの３層を備えている。入力層Ｉは、音声認識や、図
形認識等の各種処理に対応して任意に選択される入力デ
ータ数ｐに応じた数ｐ個のニューロン素子Ｉ１〜Ｉｐを
備えている。中間層Ｈは、入力層Ｈのニューロン素子の
数ｐ個よりも少ない数ｐ個のニューロン素子Ｈ１〜Ｈｑ
（ｑ＜ｐ）を備えている。出力層Ｏは、入力層Ｈと同数
ｐ個のニューロン素子Ｏ１〜Ｏｐを備えている。このよ
うに自己連想型ＮＮは、いわゆるボトルネック型のニュ
ーラルネットワークが使用されている。これは、中間層
のニューロン素子数が入力層および出力層のニューロン
素子数と同一の場合、自己認識の学習をさせた場合に各
層間の結合が全て１になってしまい適切な学習を行うこ
とができないためである。The ANN comprises three layers, an input layer I, an intermediate layer H, and an output layer O. The input layer I includes a number p of neuron elements I1 to Ip corresponding to the number p of input data arbitrarily selected corresponding to various processes such as voice recognition and graphic recognition. The intermediate layer H includes a number p of neuron elements H1 to Hq smaller than the number p of neuron elements of the input layer H.
(Q <p). The output layer O includes the same number p of neuron elements O1 to Op as the input layer H. As described above, the self-associative NN uses a so-called bottleneck neural network. This is because when the number of neuron elements in the middle layer is the same as the number of neuron elements in the input layer and the output layer, when self-recognition learning is performed, the connections between layers become all 1 and appropriate learning is performed. This is because it cannot be done.

【００１８】中間層Ｈの各ニューロン素子Ｈ１〜Ｈｑ
は、入力層Ｉの全ニューロン素子との間で、学習時に変
更可能な結合重みＷ１１〜Ｗｐｑ（以下、この集合をＷ
で表記する）で完全結合している。また中間層Ｈの各ニ
ューロン素子Ｈ１〜Ｈｑは、それぞれ学習段階で変更可
能な閾値θ１〜θｑを備えている。中間層Ｈの各ニュー
ロン素子Ｈ１〜Ｈｑは、入力層Ｉに入力された入力デー
タと、結合重みＷと、閾値に基づいて、順伝播活性によ
る出力値を出力するようになっている。また、出力層Ｏ
の各ニューロン素子Ｏ１〜Ｏｐは、中間層Ｈの全ニュー
ロン素子Ｈ１〜Ｈｑとの間で、学習時に可変な結合重み
ｗ１１〜ｗｑｐ（以下、この集合をｗで表記する）で完
全結合している。そして、各ニューロン素子Ｏ１〜Ｏｐ
は、中間層Ｈの出力値と結合重みｗとから、自己連想型
ＮＮの出力値を出力するようになっている。Each neuron element H1 to Hq of the intermediate layer H
Is the connection weights W11 to Wpq that can be changed at the time of learning with all the neuron elements of the input layer I (hereinafter, this set is W
It is completely connected with. Each of the neuron elements H1 to Hq of the intermediate layer H has thresholds θ1 to θq that can be changed in the learning stage. Each of the neuron elements H1 to Hq of the intermediate layer H is adapted to output an output value due to forward propagation activity based on the input data input to the input layer I, the connection weight W, and the threshold value. Also, the output layer O
Each of the neuron elements O1 to Op is completely connected to all the neuron elements H1 to Hq of the intermediate layer H with variable connection weights w11 to wqp (hereinafter, this set is represented by w) during learning. . Then, each neuron element O1 to Op
Outputs the self-associative NN output value from the output value of the hidden layer H and the connection weight w.

【００１９】ニューロン素子網２２は、図示しないメモ
リを備えており、各自己連想型ＮＮのそれぞれに対応し
た結合重みテーブルが記憶されている。結合重みテーブ
ルには、入力層Ｉと中間層Ｈとの結合重みＷ、中間層の
閾値θ、および中間層Ｈと出力層Ｏとの結合重みが格納
されるようになっている。なお、閾値θについては、入
力層及び出力層の各ニューロン素子についても設定する
ようにしてもよい。そして、ニューロン素子網２２の学
習は、ＣＰＵ１１が所定のバックプロパゲーション則に
従って、これらの結合重みおよび閾値を変更することで
実行されるようになっている。The neuron element network 22 includes a memory (not shown), and stores a connection weight table corresponding to each self-associative NN. The connection weight table stores the connection weight W of the input layer I and the intermediate layer H, the threshold value θ of the intermediate layer, and the connection weight of the intermediate layer H and the output layer O. Note that the threshold value θ may be set for each neuron element in the input layer and the output layer. Then, the learning of the neuron element network 22 is executed by the CPU 11 changing these connection weights and threshold values according to a predetermined backpropagation rule.

【００２０】次に、このように構成された実施の形態の
動作について説明する。動作の概要自己連想型ニューラルネットワークＡＮＮ１〜ＡＮＮｎ
は、それぞれ認識対象となる各音素に対応されており、
対応する音素についてだけを専用に学習を行う。学習で
は、ＦＦＴ装置２１のスペクトル解析により得られる各
音素についてのベクトル列により自己連想型の学習を、
他のＡＮＮの学習から独立して行う。一方、音声認識を
行う場合、ＦＦＴ装置２１でスペクトル解析された音声
についてのベクトル列を、全てのＡＮＮ１〜ＡＮＮｎに
入力する。そして、各ＡＮＮ毎に、入力したベクトル列
と出力されたベクトル列との類似度を算出する。例え
ば、音素“ａ”のベクトル列が認識対象として入力され
た場合、対応するＡＮＮ１のみが音素“ａ”についての
学習を行っているため、類似度が最も高くなる。一方、
他のＡＮＮ２〜ＡＮＮｎは“ａ”の学習を行っていない
ので、類似度が極めて小さくなる。従って、類似度が最
も大きいＡＮＮに対応する音素を、入力された音声を構
成する音素であると認識することができる。Next, the operation of the embodiment configured as described above will be described. Outline of operation Self-associative neural networks ANN1 to ANNn
Corresponds to each phoneme to be recognized,
Only learn about the corresponding phonemes. In the learning, self-associative learning is performed by a vector sequence for each phoneme obtained by the spectrum analysis of the FFT device 21,
Independent of learning from other ANNs. On the other hand, in the case of performing voice recognition, the vector sequence of the voice spectrum-analyzed by the FFT device 21 is input to all ANN1 to ANNn. Then, the similarity between the input vector sequence and the output vector sequence is calculated for each ANN. For example, when the vector string of the phoneme "a" is input as the recognition target, the similarity is highest because only the corresponding ANN1 is learning about the phoneme "a". on the other hand,
Since the other ANN2 to ANNn have not learned "a", the degree of similarity is extremely small. Therefore, the phoneme corresponding to the ANN having the highest degree of similarity can be recognized as the phoneme forming the input voice.

【００２１】ニューラルネットワークの学習の詳細前述したように、ニューロン素子網２２の各ＡＮＮ１〜
ＡＮＮｎは、それぞれ特定の音素に対応しており、例え
ばＡＮＮ１は音素“ａ”についての自己連想型の学習の
みが独立して行われ、ＡＮＮ２は音素“ｉ”についての
自己連想型の学習のみが独立して行われる。また、他の
ＡＮＮ３〜ｎもそれぞれ対応する音素についての自己連
想型の学習のみが独立して行われるようになっている。
このように、各ＡＮＮがそれぞれに対応した特定の音素
についてのみ学習すればよいため、入出力層や中間層の
サイズを小型化（ニューロン素子を少なくする）ことが
可能になり、パーソナルコンピュータ等による学習およ
び認識が容易になる。Details of Learning of Neural Network As described above, each ANN1 to ANN1 of the neuron element network 22 is
Each ANNn corresponds to a specific phoneme, for example, ANN1 independently performs only self-associative learning of a phoneme "a", and ANN2 only self-associative learning of a phoneme "i". It is done independently. Further, the other ANNs 3 to n are also configured to independently perform only self-associative learning for the corresponding phonemes.
As described above, each ANN only needs to learn about a specific phoneme corresponding to each ANN, which makes it possible to reduce the size of the input / output layer and the intermediate layer (reduce the number of neuron elements). Easy to learn and recognize.

【００２２】ニューラルネットワークについての学習を
行う場合、最初にキーボード１８を操作することによ
り、または表示装置１７に表示された所定キーをマウス
により操作することにより、学習モードを指定する。学
習モードを指定した後、予め決められた８０の音素に対
応する文字を順次キーボード１８から入力した後に、そ
の音素についての音声を音声入力装置２３に入力する。
なお、入力すべき音素を表示装置１７に表示すること
で、発声すべき音素を順次知らせるようにしてもよい。
音声入力装置２３では、例えば音素「ａ」について、図
４（ａ）に示すようなアナログ信号が入力されると、こ
れをＦＦＴ装置２１に供給する。ＦＦＴ装置２１では、
供給されたアナログ音声データを２２ＫＨｚでサンプリ
ングし、１６ビットのＰＣＭデータにＡ／Ｄ変換し、図
示しない記憶部に格納する。なお、サンプリングの間隔
については特に２２ＫＨｚに限定されるものではなく、
「ク」、「ッ」、「プ」といった発声時間が短い音素の
発声時間に対する１／２以下の間隔であれば、他の間隔
でもよい。また、ＰＣＭデータについても１６ビットに
限定されるものではなく、３２ビット、１０ビット、８
ビット、６ビット、４ビット等であってもよい。When learning about the neural network, the learning mode is designated by first operating the keyboard 18 or operating a predetermined key displayed on the display device 17 with a mouse. After designating the learning mode, the characters corresponding to the predetermined 80 phonemes are sequentially input from the keyboard 18, and then the voice for the phoneme is input to the voice input device 23.
The phonemes to be input may be displayed on the display device 17 to sequentially notify the phonemes to be uttered.
In the voice input device 23, for example, for the phoneme “a”, when an analog signal as shown in FIG. 4A is input, the analog signal is supplied to the FFT device 21. In the FFT device 21,
The supplied analog audio data is sampled at 22 KHz, A / D converted into 16-bit PCM data, and stored in a storage unit (not shown). The sampling interval is not particularly limited to 22 KHz,
Other intervals may be used as long as the intervals are ½ or less with respect to the utterance time of a phoneme having a short utterance time such as “ku”, “tsu”, and “p”. Further, the PCM data is not limited to 16 bits, but may be 32 bits, 10 bits, 8 bits.
It may be bits, 6 bits, 4 bits or the like.

【００２３】次いでＦＦＴ装置２１では、方形窓、ハミ
ング（Ｈａｍｍｉｎｇ）窓、ハニング（Ｈａｎｎｉｇ）
窓等の時間窓の形や、ポイント数等のパラメータに従っ
て、各時間ｔｎ（ｎ＝１、２、…）毎に、高速フーリエ
変換（ＦＦＴ）処理によりディジタル音声データ「ａ」
についてのスペクトル解析を行う。すなわち、ＦＦＴ装
置２１は、図４（ｂ）に示すように、各時間ｔｎ毎にお
ける音声データの、各周波数（Ｆ１〜Ｆ３０）に対する
パワーＰ（ｔｎ）を算出する。この各周波数のパワーＰ
（ｔｎ）によるベクトル列は、図５に示すように、各時
間毎に、ＲＡＭ１４のベクトル列格納エリアに格納され
る。Next, in the FFT device 21, a rectangular window, a Hamming window, and a Hanning.
Digital audio data “a” is processed by fast Fourier transform (FFT) processing at each time tn (n = 1, 2, ...) According to the shape of a time window such as a window and parameters such as the number of points.
The spectrum analysis is performed. That is, the FFT device 21 calculates the power P (tn) for each frequency (F1 to F30) of the audio data at each time tn, as shown in FIG. 4 (b). Power P of each frequency
The vector sequence by (tn) is stored in the vector sequence storage area of the RAM 14 every time as shown in FIG.

【００２４】以上のようにして各音素についての学習用
データを生成するが、各ＡＮＮ１〜ＡＮＮｎによる自己
連想型の学習に普遍性を持たせるために複数の学習用デ
ータを生成する。以下では、その生成について音素
「ａ」を例に説明する。いま、学習対象となる音素
「ａ」については、言葉の最初に発声する場合の音素
（音頭音素）を“あ”で表し、言葉の最後に発声される
場合の音素（音尾音素）を“ア”で表し、言葉の途中に
発声される場合の音素（音中音素）を“Ａ”で表すもの
とする。例えば、“あ”は、ａｋｉ（秋）からとり、
“ア”はｄｅｎｗａ（電話）からとり、“Ａ”はｔｏｍ
ａｒｉ（泊まり）からとる。なお、以下の説明において
は、音素「あ」について、“あ”、“ア”、“Ａ”の３
パターンによる音素「ａ」の学習を例に説明するが、各
音素について３〜３０パターン、好ましくは１００パタ
ーン程度による学習が行われる。The learning data for each phoneme is generated as described above, but a plurality of learning data are generated in order to make the self-associative learning by each ANN1 to ANNn universal. The generation will be described below by taking the phoneme "a" as an example. For the phoneme "a" to be learned, the phoneme (onset phoneme) when uttered at the beginning of the word is represented by "a", and the phoneme (tail tail phoneme) when uttered at the end of the word is " It is represented by "A", and a phoneme (phoneme in phoneme) when it is uttered in the middle of a word is represented by "A". For example, "a" is taken from aki (autumn),
"A" is taken from denwa (phone), "A" is tom
Take from ari (night stay). In the following description, the phoneme "a" is divided into three parts, "a", "a", and "A".
The learning of the phoneme "a" by the pattern will be described as an example, but the learning is performed by 3 to 30 patterns, preferably about 100 patterns for each phoneme.

【００２５】図６は、これら３種類の“あ”、“ア”、
“Ａ”について、ＦＦＴ装置２１で各時間ｔ（ｔ＝１、
２、…）毎に、ＦＦＴ処理によりスペクトル解析したデ
ータを表したものである。ＦＦＴ装置２１は、各音素
“あ”、“ア”、“Ａ”について、それぞれ図６
（ａ）、（ｂ）、（ｃ）に示すように、各時間ｔ毎に音
声データの、各周波数（周波数の分割数は、ＡＮＮの入
力層Ｉのニューロン素子の数ｐに対応して、Ｆ１〜Ｆｐ
のｐ個である）に対するパワー（Ｐ）の値を算出する。
そして、各周波数のパワーＰ（ｔ）による各時間毎のベ
クトル列が、各音素毎に、ＲＡＭ１４の自ベクトル列格
納エリアに格納される。FIG. 6 shows these three types of "A", "A",
For “A”, the FFT device 21 uses each time t (t = 1,
2, ...) Represents data obtained by spectrum analysis by FFT processing. The FFT device 21 is shown in FIG. 6 for each phoneme “A”, “A”, and “A”.
As shown in (a), (b), and (c), at each time t, each frequency of the audio data (the number of frequency divisions corresponds to the number p of neuron elements in the input layer I of the ANN, F1 to Fp
Value of the power (P) for each of
Then, the vector sequence for each time with the power P (t) of each frequency is stored in the own vector sequence storage area of the RAM 14 for each phoneme.

【００２６】いま、図６（ａ）に示されるように、音素
“あ”についてスペクトル解析された、時刻ｔ＝１にお
けるパワーＰ（１）のベクトル列をあ１とし、時刻ｔ＝
２におけるパワーＰ（２）のベクトル列をあ２とし、同
様に、図示しないが、時刻ｔ＝ｎのベクトル列をあｎと
する。また、図６（ｂ）に示されるように、音素“ア”
についてスペクトル解析された、時刻ｔ＝１におけるパ
ワーＰ（１）のベクトル列をア１とし、時刻ｔ＝２にお
けるパワーＰ（２）のベクトル列をア２とし、同様に、
図示しないが、時刻ｔ＝ｎのベクトル列をアｎとする。
また、図６（ｃ）に示されるように、音素“Ａ”につい
てスペクトル解析された、時刻ｔ＝１におけるパワーＰ
（１）のベクトル列をＡ１とし、時刻ｔ＝２におけるパ
ワーＰ（２）のベクトル列をＡ２とし、同様に、図示し
ないが、時刻ｔ＝ｎのベクトル列をＡｎとする。Now, as shown in FIG. 6A, the vector sequence of the power P (1) at the time t = 1, which is spectrally analyzed for the phoneme "A", is set to A1, and the time t =
The vector sequence of power P (2) in 2 is A2, and similarly, although not shown, the vector sequence at time t = n is A. In addition, as shown in FIG. 6B, the phoneme "A"
The vector sequence of the power P (1) at the time t = 1 and the vector sequence of the power P (2) at the time t = 2, which are spectrally analyzed for
Although not shown, the vector sequence at time t = n is an.
Further, as shown in FIG. 6C, the power P at time t = 1, which is spectrally analyzed for the phoneme “A”.
The vector sequence of (1) is A1, the vector sequence of power P (2) at time t = 2 is A2, and similarly, although not shown, the vector sequence of time t = n is An.

【００２７】これらの各音素についてスペクトル解析さ
れたパワーＰ（ｔ）で構成されるベクトル列によって、
ＡＮＮ１の学習が各時間ｔ毎に行われる。すなわち、同
一時刻、例えばｔ＝１における各音素のベクトル列あ
１、ア１、Ａ、をＡＮＮ１の入力層Ｉ１〜Ｉｐの入力デ
ータとすると共に、出力層Ｏ１〜Ｏｐの教師信号として
使用することで、各時刻ｔのベクトル列毎に学習が行わ
れる。By the vector sequence composed of the power P (t) spectrally analyzed for each of these phonemes,
Learning of ANN1 is performed at each time t. That is, the vector sequence A1, A1, A of each phoneme at the same time, for example, t = 1 is used as the input data of the input layers I1 to Ip of ANN1 and is used as the teacher signal of the output layers O1 to Op. Then, learning is performed for each vector sequence at each time t.

【００２８】図７は、自己連想型Ｎ２７の学習における
入力データと教師信号について表したものである。この
図７では、図６に示した各音素に対するパワーのベクト
ル列に基づいて学習する場合を例に示している。図７に
示されるように、各時刻ｔ（ｔ＝１、２、…ｎ）を単位
として学習が行われる。例えば、時刻ｔ１の場合であれ
ば、教師信号をあ１として入力データあ１とア１とＡ１
について学習を行い、次に、教師信号をア１として、入
力データあ１とア１とＡ１について学習を行い、更に、
教師信号をＡ１として、入力データあ１とア１とＡ１に
ついて学習を行う。さらに、あ２、ア２、Ａ２による入
力データと教師信号の全組み合わせによる学習が行われ
る。同様にして、他のあｔ、アｔ、Ａｔについての学習
が行われる。FIG. 7 shows input data and teacher signals in the learning of the self-associative type N27. In FIG. 7, an example is shown in which learning is performed based on the vector sequence of power for each phoneme shown in FIG. As shown in FIG. 7, learning is performed with each time t (t = 1, 2, ... N) as a unit. For example, in the case of time t1, input data A1, A1 and A1 with the teacher signal A1.
Then learning the input data A1 and A1 and A1 with the teacher signal A1.
With the teacher signal as A1, learning is performed on input data A1, A1 and A1. Further, learning is performed by all combinations of the input data and the teacher signal by A2, A2, and A2. In the same manner, learning about other points t, t, and At is performed.

【００２９】このように、同一の音素であっても、複数
者による複数の音素（音頭音素、音中音素、音尾音素）
を使用し、学習データ“ア”に対し、教師信号を“ア”
として自己連想型の学習を行う場合だけでなく、同一音
素に属する他の“Ａ”や“あ”等も教師信号として自己
連想型の学習が行われる。これによって、同一の音素に
対して音素“ａ”の範疇に含まれる普遍的な音素につい
ての学習を行うことができる。As described above, even if the same phoneme, a plurality of phonemes (a head phoneme, a middle phoneme, and a tail phoneme) by a plurality of people are used.
Using the training data “A”,
Not only in the case of performing the self-associative learning as, the self-associative learning is performed by using other "A", "a", etc. belonging to the same phoneme as a teacher signal. As a result, it is possible to learn about universal phonemes included in the category of the phoneme "a" for the same phoneme.

【００３０】逆に、各音素の各パターンについての組み
合わせでなくても、入力層Ｉの入力データおよび出力層
Ｏの教師信号として、同一のパターンのみを使用するよ
うにしてもよい。すなわち、学習データ“ア”に対し、
同一パターンの教師信号“ア”についてのみ自己連想型
の学習を行うようにしてもよい。また、図７では図示し
ていないが、同一の時刻ｔ毎に学習を行う場合だけでな
く、例えば“あ１”、“ア１”、“Ａ１”を学習データ
に対して、“あ２”、“ア２”、“Ａ２”を教師信号と
するようにしてもよい。すなわち、各時刻ｔｎのデータ
の学習に対してｔｎおよびｔｎ＋１のデータを教師信号
とするようにしてもよい。On the contrary, the same pattern may be used as the input data of the input layer I and the teacher signal of the output layer O instead of the combination of each pattern of each phoneme. That is, for the learning data "a",
The self-associative learning may be performed only for the teacher signal “A” having the same pattern. Although not shown in FIG. 7, not only when learning is performed at the same time t, for example, “A1”, “A1”, and “A1” are compared with “A2” with respect to the learning data. , "A2", "A2" may be used as the teacher signal. That is, the data of tn and tn + 1 may be used as the teacher signal for learning the data of each time tn.

【００３１】ＡＮＮ１の学習において、入力層Ｉへのベ
クトル列の入力および出力層への教師信号の入力が済む
と、ＣＰＵ１１は、図３に示した、ＡＮＮ１についての
入力層Ｉ、中間層Ｈおよび出力層Ｏの各ニューロン素子
間の結合重みＷおよび閾値θを用いて学習を行い、各結
合重みを学習後の値に更新する。以上ＡＮＮ１につい
て、対応する音素「ａ」の自己連想型の学習について説
明したが、同様にして、ＡＮＮ２〜ＡＮＮｎについて
も、それぞれ対応する音素についての自己連想型の学習
を行う。In the learning of ANN1, when the input of the vector sequence to the input layer I and the input of the teacher signal to the output layer are completed, the CPU 11 shows the input layer I, the intermediate layer H, and the intermediate layer H for the ANN1 shown in FIG. Learning is performed using the connection weight W between each neuron element of the output layer O and the threshold value θ, and each connection weight is updated to the value after learning. Although the self-associative learning of the corresponding phoneme “a” has been described above for ANN1, the self-associative learning for the corresponding phonemes is similarly performed for ANN2 to ANNn.

【００３２】本実施の形態において、行われる学習はバ
ックプロパゲーション則による学習が行われる。学習式
は、Δｗ（ｔ）＝〔Ｓ（ｔ）／〔Ｓ（ｔ−１）−Ｓ
（ｔ）〕〕×Δｗ（ｔ−１）であり、式の詳細および学
習アルゴリズム（ＴｈｅＱｕｉｃｋｐｒｏｐＡｌｇ
ｏｒｉｔｈｍ）は、カーネギーメロン大学１９８８年９
月発行、Ｓ．Ｆａｈｌｍａｎ著の技術レポート♯ＣＭＵ
−ＣＳ−８８−１６２の“ＡｎＥｍｐｉｒｉｃａｌ
ＳｔｕｄｙｏｆＬｅａｒｎｉｎｇＳｐｅｅｄｉｎ
Ｂａｃｋ−ＰｒｏｐａｇａｔｉｏｎＮｅｔｗｏｒｋ
ｓ”に記載されている。また、エルマン（Ｊ．Ｌ．Ｅｌ
ｍａｎ）による、Ｆｉｎｄｉｎｇｓｔｒｕｃｔｕｒｅ
ｉｎｔｉｍｅ，Ｃｏｇｎｉｔｉｖｅｓｃｉｅｎｃ
ｅ，１４，ｐｐ．１７９−２１１（１９９０）に記載さ
れている、離散時間のリカレントネットワークに、フィ
ードフォワードネットワークのバックプロパゲーション
則を準用した学習でもよい。また、学習については以上
の方法に限定されず、他の学習方法によってもよい。In the present embodiment, the learning to be performed is the learning based on the back propagation rule. The learning formula is Δw (t) = [S (t) / [S (t-1) -S
(T)]] × Δw (t−1), and the details of the equation and the learning algorithm (The Quickprop Alg
orithm) is Carnegie Mellon University 1988 9
Issued monthly, S.M. Technical report #CMU by Fahlman
-CS-88-162, "An Imperial
Study of Learning Speedin
Back-Propagation Network
s ". Also, Elman (JL El
man)) Finding structure
in time, Cognitive science
e, 14, pp. 179-211 (1990), learning that applies the backpropagation rule of the feedforward network to the recurrent network of discrete time may be applied. Further, learning is not limited to the above method, and other learning method may be used.

【００３３】なお、学習対象となる各音素についてのス
ペクトルデータは、入力装置２３およびＦＦＴ装置２１
で学習時に生成するのではなく、他の装置により予めス
ペクトル解析しておいた各種音素についてのデータを通
信制御装置１５から入力して、ＲＡＭ１４のベクトル列
格納エリアに格納するようにしてよもい。The spectrum data for each phoneme to be learned is the input device 23 and the FFT device 21.
It is also possible to input from the communication control device 15 data on various phonemes that have been spectrally analyzed in advance by another device and store them in the vector sequence storage area of the RAM 14 instead of generating them at the time of learning.

【００３４】音声認識についての詳細各ＡＮＮ１〜ＡＮＮｎについての学習が終了した後、音
声入力装置２３から認識対象となる音声が入力される
と、ＦＦＴ装置２１においてスペクトル解析が行われ、
ＲＡＭ１４のベクトル列格納エリアに各周波数に対する
パワーがベクトル列として各時間ｔ毎に格納される。そ
してＣＰＵ１１は、認識対象となる音声データについ
て、時間ｔｎにおけるベクトル列Ｐ（ｔｎ）をＲＡＭ１
４から読みだし、ＡＮＮ１の入力層Ｉに入力する。そし
て、ニューロン素子網２２のメモリに格納されている学
習済の結合重みテーブルから、ＡＮＮ１に対する出力ベ
クトル列Ｏ（ｔｎ）を求める。そして、ＣＰＵ１１は、
このＡＮＮ１の出力ベクトル列Ｏ（ｔｎ）と入力したベ
クトル列Ｐ（ｔｎ）との類似度Ｓ１（ｔｎ）を算出す
る。ＣＰＵ１１は、他のＡＮＮ２〜ＡＮＮｎについても
同様に、ベクトル列Ｐ（ｔｎ）を入力した場合の出力ベ
クトル列と、その出力ベクトル列の入力ベクトル列に対
する類似度Ｓ２（ｔｎ）〜Ｓｎ（ｔｎ）を算出する。な
お、類似度は、入力と出力との差、ユークリッド距離、
最小二乗値、その他各種の方法によって算出する。Details of voice recognition When the voice to be recognized is input from the voice input device 23 after the learning of each ANN1 to ANNn is completed, the FFT device 21 performs spectrum analysis,
The power for each frequency is stored in the vector sequence storage area of the RAM 14 as a vector sequence at each time t. Then, the CPU 11 stores the vector sequence P (tn) at the time tn in the RAM 1 for the voice data to be recognized.
It is read out from No. 4 and input to the input layer I of ANN1. Then, the output vector sequence O (tn) for ANN1 is obtained from the learned connection weight table stored in the memory of the neuron element network 22. Then, the CPU 11
The similarity S1 (tn) between the output vector sequence O (tn) of this ANN1 and the input vector sequence P (tn) is calculated. Similarly for the other ANN2 to ANNn, the CPU 11 obtains the output vector sequence when the vector sequence P (tn) is input and the similarity S2 (tn) to Sn (tn) of the output vector sequence with respect to the input vector sequence. calculate. The similarity is the difference between the input and output, the Euclidean distance,
It is calculated by the least squares value and various other methods.

【００３５】ＣＰＵ１１は、全てのＡＮＮ１〜ＡＮＮｎ
につていの類似度Ｓ１（ｔｎ）〜Ｓｎ（ｔｎ）を算出し
た後、最も類似度の大きいＡＮＮに対応した音素を、入
力された音声についての時刻ｔ１での音素であると認識
して、ＲＡＭ１４に２格納する。すなわち、ＡＮＮ２の
類似度Ｓ２が最も大きい場合には、時刻ｔｎでの音素が
“ｉ”であると認識する。このように、各ＡＮＮ１〜Ａ
ＮＮｎは、それぞれ対応した音素についてのみの学習を
しいてるため、入力音声データの各音素に対応したＡＮ
Ｎの類似度が極めて高く、他のＡＮＮについての類似度
が極めて低い値となり、類似度Ｓの値から音素を特定す
ることが可能となる。The CPU 11 has all the ANN1 to ANNn.
After calculating the similarities S1 (tn) to Sn (tn), the phoneme corresponding to the ANN having the highest similarity is recognized as the phoneme at the time t1 of the input voice, 2 is stored in the RAM 14. That is, when the similarity S2 of ANN2 is the largest, the phoneme at time tn is recognized as "i". In this way, each ANN1-A
Since the NNn learns only the corresponding phonemes, the AN corresponding to each phoneme of the input speech data.
The similarity of N is extremely high, and the similarity of other ANNs is extremely low, and it becomes possible to specify a phoneme from the value of the similarity S.

【００３６】以下同様にして、時刻ｔｎ＋１以降のベク
トル列Ｐを順次ＲＡＭ１４から読み出し、各ＡＮＮ１〜
ＡＮＮｎの類似度Ｓ１〜Ｓｎの最大値から、その時刻で
の音素を認識し、順次ＲＡＭ１４に格納する。Similarly, the vector sequence P after the time tn + 1 is sequentially read from the RAM 14 and the ANN1 to ANN1 are sequentially read.
The phoneme at that time is recognized from the maximum value of the similarities S1 to Sn of ANNn and sequentially stored in the RAM 14.

【００３７】ＡＮＮ１〜ＡＮＮｎの各入力層Ｉにベクト
ル列Ｐ（ｔｎ）が時系列的に入力される毎に音素が特定
されるため、ＲＡＭ１４には複数の音素列が格納され
る。例えば、音声「いろ」が入力され、各時刻での認識
した音素列「ｉｉｉｉｉｒｒｒｏｏｏｏｏ」がＲＡＭに
格納される。ＣＰＵ１１は、このＲＡＭ１４に格納され
た音素列から、入力された音声を「ｉｒｏ」と認識す
る。そしてＣＰＵ１１は、キーボード１８からの入力指
示がある場合には、認識した音声を日本語変換システム
に従って、文字による文章に変換する。変換した文章
は、表示装置１７に表示されると共にＲＡＭ１４に格納
される。また、キーボード１８からの指示に応じて、通
信制御装置５および通信網２を介して、パーソナルコン
ピュータやワードプロセッサ等の各種通信制御装置にデ
ータ伝送を行う。Since the phoneme is specified every time the vector sequence P (tn) is input to each of the input layers I of ANN1 to ANNn in time series, the RAM 14 stores a plurality of phoneme sequences. For example, the voice "iro" is input, and the phoneme sequence "iiiiiirrrooooo" recognized at each time is stored in the RAM. The CPU 11 recognizes the input voice as “iro” from the phoneme string stored in the RAM 14. Then, when there is an input instruction from the keyboard 18, the CPU 11 converts the recognized voice into a text by the Japanese conversion system. The converted text is displayed on the display device 17 and stored in the RAM 14. Further, according to an instruction from the keyboard 18, data is transmitted to various communication control devices such as a personal computer and a word processor via the communication control device 5 and the communication network 2.

【００３８】なお、最大類似度の音素であっても、その
類似度が所定の閾値をこえていない場合には、誤認識の
可能性がある。この場合には、入力層Ｉに入力されたベ
クトルは認識対象からはずされる。これは、各音素から
音素に変化する中間でスペクトル分析されたベクトル列
の場合に発生しやすい。すなわち、音素間のスペクトル
の場合、ＡＮＮ１〜ＡＮＮｎの全ての類似度Ｓが低い場
合があり、このような場合にはその時刻ｔでの音素を認
識できないことになる。しかし、その後継続的に特定さ
れる音素によって容易に音声を認識することができる。
例えば、音声「いろ」に対して、「ｉｉｉ？ｒｒ？ｏ
ｏ」という出力がされたものとする。このように途中に
類似度が低いために認識対象からはずされるベクトル列
（？で表されたベクトル列）があったとしても、その前
後において入力音声を構成する音素が認識されるため、
全体として入力音声「いろ」を認識することができる。
従って、連続音声認識を容易に行うことができる。Even if the phoneme has the maximum similarity, it may be erroneously recognized if the similarity does not exceed a predetermined threshold value. In this case, the vector input to the input layer I is removed from the recognition target. This is likely to occur in the case of a vector sequence that is spectrally analyzed in the middle of changing from each phoneme to a phoneme. That is, in the case of a spectrum between phonemes, all the similarities S of ANN1 to ANNn may be low, and in such a case, the phoneme at the time t cannot be recognized. However, the voice can be easily recognized by the phoneme that is continuously specified thereafter.
For example, for the voice "Iro", "iii? Rr? O"
It is assumed that the output "o" is made. Even if there is a vector sequence (vector sequence represented by?) That is removed from the recognition target due to the low degree of similarity in this way, the phonemes that make up the input speech are recognized before and after that.
The input voice "color" can be recognized as a whole.
Therefore, continuous speech recognition can be easily performed.

【００３９】各音素の変化時において音素を特定できな
いのは、学習段階において、個々の音素単位での学習を
行っており、各音素同士がスペクトルに与える影響まで
は学習の対象になっていないためであると考えられる。The reason why the phoneme cannot be specified when each phoneme changes is that the learning is performed for each phoneme unit at the learning stage, and the effect of each phoneme on the spectrum is not the object of learning. Is considered to be.

【００４０】本実施の形態によれば、各ＡＮＮ１〜ＡＮ
Ｎｎは、それぞれ対応する音素についてのみ専用に学習
を行うようにしているため、対応音素について豊富な学
習（複数人による複数の場合の音素についての学習）を
行うことで、高い認識率をうることができる。従って、
不特定話者認識を行うことができる。According to the present embodiment, each of ANN1 to AN is
Since Nn specializes in learning only corresponding phonemes, it is possible to obtain a high recognition rate by performing abundant learning about corresponding phonemes (learning phonemes in the case of multiple people by multiple people). You can Therefore,
Unspecified speaker recognition can be performed.

【００４１】また、音素単位での音声認識を行う場合に
従来から認識すべき音素の開始点をどのようにして正確
に決定するかが問題であったが、本実施の形態によれ
ば、「ッ」等の発声時間が短い音素の発声時間に対する
１／２以下の間隔でサンプリングしているので、ＰＣＭ
データについても１音素の開始点を特定する必要がな
い。また、音素単位による連続音声認識を行う場合に、
各個人差が大きい各音素の発声時間に関係なく、音声を
認識することができる。例えば、音声として「はーる」
というように、音声「は」をのばして発声した場合であ
っても、「ｈｈｈｈｈ…ａａａａａａａａａａａａａａ
…ｒｒｒｒ…ｕｕｕｕｕ…」というように、音素「ａ」
が多く特定されるだけで、容易に音声「はる」と認識す
ることができる。Further, in the case of performing speech recognition on a phoneme basis, it has been a problem in the past how to accurately determine the starting point of a phoneme to be recognized. According to this embodiment, " , Etc. is sampled at intervals of 1/2 or less than the utterance time of a phoneme having a short utterance time.
With respect to the data, it is not necessary to specify the starting point of one phoneme. Also, when performing continuous speech recognition in phoneme units,
The voice can be recognized regardless of the utterance time of each phoneme having a large individual difference. For example, as a voice, "Haru"
In this way, even when the voice "ha" is extended and uttered, "hhhhhh ... aaaaaaaaaaaaaaaaa"
Phoneme "a", such as "rrrr ... uuuuu ..."
It is possible to easily recognize the voice "Haru" simply by specifying many.

【００４２】さらに、本実施の形態では、音素単位の音
声認識について説明したが、単語単位で音声認識するよ
うにしてもよい。この場合、ベクトル列が表す特定の意
味としてその単語を表す符号列が教師信号として使用さ
れる。Further, in the present embodiment, the speech recognition in the unit of phoneme has been described, but the speech recognition may be performed in the unit of word. In this case, a code string representing the word as a specific meaning represented by the vector string is used as a teacher signal.

【００４３】また、本実施の形態では、ＲＯＭ１３に格
納した学習プログラムに従ってＣＰＵ１１でニューロン
素子網２２の学習を行い、学習後のニューロン素子網２
２による音声認識を行うようにしたが、不特定話者の連
続音声認識を高い認識率で行うことが可能であるので、
再学習の必要が少ない。従って、音声認識装置として
は、必ずしも学習機能を有する必要がなく、他の装置の
学習で求めた結合重みＷおよび閾値θを有する、ＡＮＮ
１〜ＡＮＮｎからなるニューロン素子網を使用するよう
にしてもよい。この場合、ニューロン素子網を、学習済
みの結合重みを有するハードウェアで構成してもよい。Further, in this embodiment, the neuron element network 22 is learned by the CPU 11 according to the learning program stored in the ROM 13, and the neuron element network 2 after learning is learned.
Although the voice recognition by 2 is performed, it is possible to perform continuous voice recognition of an unspecified speaker with a high recognition rate.
Less need for re-learning. Therefore, the speech recognition device does not necessarily have to have a learning function, and has the connection weight W and the threshold value θ obtained by learning of another device.
You may make it use the neuron element network which consists of 1-ANNn. In this case, the neuron element network may be configured by hardware having learned connection weights.

【００４４】また、以上説明した実施の形態では、ＦＦ
Ｔ装置における高速フーリエ変換によって、学習時の各
音素と音声認識時の音声についてのスペクトル解析を行
ったが、他のアルゴリズムによりスペクトル解析を行う
ようにしてもよい。例えば、ＤＣＴ（離散コサイン変
換）等によるスペクトル解析を行ってもよい。In the embodiment described above, the FF
Although the spectrum analysis was performed on each phoneme during learning and the speech during speech recognition by the fast Fourier transform in the T device, the spectrum analysis may be performed by another algorithm. For example, spectrum analysis by DCT (discrete cosine transform) or the like may be performed.

【００４５】以上説明した、図２のＡＮＮ１〜ＡＮＮｎ
では、各層Ｉ、Ｈ、Ｏ間の結合状態として完全結合して
いる場合について説明したが、本発明ではこれに限定さ
れるものではない。例えば、各層のニューロン素子数
や、学習能力に応じて結合状態を決定するようにしても
よい。The above-described ANN1 to ANNn of FIG.
In the above, the case where the layers I, H, and O are completely bonded has been described, but the present invention is not limited to this. For example, the connection state may be determined according to the number of neuron elements in each layer and the learning ability.

【００４６】次に第２の実施の形態ついて説明する。前
記した第１の実施の形態では、音声認識においてＦＦＴ
装置２１でスペクトル解析されたベクトル列を入力層Ｉ
に入力するデータとしたのに対して、この第２の実施の
形態では、ケプストラムデータを各入力層Ｉに入力する
ことで、学習および音声認識を行うようにしたものであ
る。図９は、第２の実施の形態におけるニューラルネッ
トワークのシステム構成を表したものである。この図に
示すように、ニューラルネットワークでは、図１に示し
た第１の実施の形態のシステムに、更にケプストラム装
置２６を備えている。なお、その他の部分については第
１の実施の形態と同様なので、同一の番号を付してその
説明を省略する。Next, a second embodiment will be described. In the above-described first embodiment, FFT is performed in speech recognition.
The vector sequence spectrally analyzed by the device 21 is input to the input layer I.
In the second embodiment, the cepstrum data is input to each input layer I to perform learning and voice recognition. FIG. 9 shows a system configuration of the neural network according to the second embodiment. As shown in this figure, in the neural network, the system of the first embodiment shown in FIG. 1 is further provided with a cepstrum device 26. Since the other parts are the same as those in the first embodiment, the same numbers are given and the description thereof is omitted.

【００４７】ケプストラム装置２６は、ＦＦＴ装置２１
におけるスペクトル解析された波形の短時間振幅スペク
トルの対数を逆フーリエ変換することで、ケプストラム
データを得るものである。このケプストラム装置２６に
より、スペクトル包絡と微細構造とを近似的に分離して
抽出することができる。The cepstrum device 26 is the FFT device 21.
The cepstrum data is obtained by performing an inverse Fourier transform on the logarithm of the short-time amplitude spectrum of the spectrum analyzed in the above. With this cepstrum device 26, the spectral envelope and the fine structure can be approximately separated and extracted.

【００４８】ここで、ケプストラムの原理について説明
する。いま、音源と音道のインパルス応答のフーリエ変
換をそれぞれ、Ｇ（ω）Ｈ（ω）で表すと、線型分離透
過回路モデルにより、Ｘ（ω）＝Ｇ（ω）Ｈ（ω）の関係が得られる。この式の両辺の対数をとると、次の
数式（１）となる。ｌｏｇ｜Ｘ（ω）｜＝ｌｏｇ｜Ｇ（ω）＋ｌｏｇ｜Ｈ（ω）｜…（１）さらに、この数式（１）の両辺の逆フーリエ変換をとる
と次の数式（２）になり、これがケプストラムである。ｃ（τ）＝Ｆ^-1ｌｏｇ｜Ｘ（ω）｜＝Ｆ^-1ｌｏｇ｜Ｇ（ω）＋Ｆ^-1ｌｏｇ｜Ｈ（ω）｜…（２）ここでτの次元は、周波数領域からの逆変換であるから
時間になり、ケフレンシーとよばれる。Here, the principle of the cepstrum will be described. Now, when the Fourier transform of the impulse response of the sound source and the Fourier transform of the sound path are respectively expressed by G (ω) H (ω), the relationship of X (ω) = G (ω) H (ω) is expressed by the linear separation transmission circuit model. can get. When the logarithm of both sides of this equation is taken, the following equation (1) is obtained. log | X (ω) | = log | G (ω) + log | H (ω) | ... (1) Further, when the inverse Fourier transform of both sides of this equation (1) is taken, the following equation (2) is obtained. This is the cepstrum. c (τ) = F ⁻¹ log | X (ω) | = F ⁻¹ log | G (ω) + F ⁻¹ log | H (ω) | (2) Here, the dimension of τ is from the frequency domain. Since it is an inverse transformation, it takes time, and it is called kefrenshi.

【００４９】次に基本周期と包絡線の抽出について説明
する。数式（１）の右辺第１項はスペクトル上の微細構
造であり、第２項はスペクトル包絡線である。両者の逆
フーリエ変換には大きな違いがあり、第１項は高ケフレ
ンシーのピークとなり、第２項は０から２〜４ｍｓ程度
の低ケフレンシー部に集中する。高ケフレンシー部を用
いてフーリエ変換することによって対数スペクトル包絡
線が求まり、更に、それを指数変換すればスペクトル包
絡線が求まる。求まるスペクトル包絡線の平滑さの度合
いは、低ケフレンシー部のどれだけの成分を用いるかに
よって変化する。ケフレンシー成分を分離する操作をリ
フタリングと呼ぶ。Next, the extraction of the basic period and the envelope will be described. The first term on the right side of the equation (1) is the fine structure on the spectrum, and the second term is the spectrum envelope. There is a big difference between the inverse Fourier transforms of the two, the first term is a peak of high kefflenency, and the second term is concentrated in the low kefflenency portion of about 0 to 2 to 4 ms. The logarithmic spectrum envelope is obtained by performing Fourier transform using the high-keflency part, and the spectrum envelope is obtained by subjecting it to exponential transformation. The degree of smoothness of the obtained spectrum envelope changes depending on how many components in the low Keffency portion are used. The operation of separating the kefrenshi component is called lifter ring.

【００５０】図９は、ケプストラム装置２６の構成を表
したものである。このケプストラム装置２６は、対数変
換部２６１と、逆ＦＦＴ部２６２と、ケプストラム窓２
６３と、ピーク抽出部２６４と、ＦＦＴ部２６５とを備
えている。なお、ケプストラム窓２６３、ピーク抽出部
２６４と、ＦＦＴ部２６５は、ニューロン素子網２２の
音声入力層３２に供給するデータとして、逆ＦＦＴ部２
６２で求めたケプストラムデータを使用する場合には不
要であり、スペクトル包絡をニューロン素子網２２の入
力データとして使用する場合に必要となる。また、ＦＦ
Ｔ部２６５については、必ずしも必要ではなく、ＦＦＴ
装置２１を使用するようにしてもよい。FIG. 9 shows the structure of the cepstrum device 26. The cepstrum device 26 includes a logarithmic transformation unit 261, an inverse FFT unit 262, and a cepstrum window 2
63, a peak extraction unit 264, and an FFT unit 265. The cepstrum window 263, the peak extraction unit 264, and the FFT unit 265 use the inverse FFT unit 2 as data to be supplied to the voice input layer 32 of the neuron element network 22.
It is not necessary when using the cepstrum data obtained in 62, and is necessary when using the spectrum envelope as input data of the neuron element network 22. Also, FF
The T section 265 is not always necessary, and the FFT
The device 21 may be used.

【００５１】対数変換部２６１は、ＦＦＴ装置２１から
供給されるスペクトルデータＸ（ω）から、数式（１）
に従って対数変換を行い、ｌｏｇ｜Ｘ（ω）｜を求め、
逆ＦＦＴ部２６２に供給する。逆ＦＦＴ部２６２では、
供給された値について、更に逆ＦＦＴをとり、ｃ（τ）
を算出することで、ケプストラムデータを求める。逆Ｆ
ＦＴ部２６２では、求めたケプストラムデータを、音声
データについての学習または音声認識を行う入力データ
Ｉｎ（ベクトル列）として、第１の実施の形態で説明し
たＡＮＮ１〜ＡＮＮｎの各入力層Ｉに供給するようにな
っている。ＡＮＮ１〜ＡＮＮｎに入力する入力データＩ
ｎの数については、音声認識に併せて任意に選択された
入力層Ｉのニューロン素子数ｐと同数が選択される。従
って、ケフレンシー（τ）軸をｐ分割し、各ケフレンシ
ー毎のパワーの値をニューロン素子Ｉ１〜Ｉｐの入力デ
ータとして、各ＡＮＮ１〜ＡＮＮｎに供給する。この逆
ＦＦＴ部２６２で求めたケプストラムデータをＡＮＮ１
〜ＡＮＮｎの各入力層Ｉに供給するのが、第２の実施の
形態における第１例である。The logarithmic transformation unit 261 calculates the mathematical expression (1) from the spectrum data X (ω) supplied from the FFT device 21.
Logarithm conversion is performed according to to obtain log | X (ω) |
The inverse FFT unit 262 is supplied. In the inverse FFT unit 262,
Inverse FFT is further performed on the supplied value to obtain c (τ)
By calculating, the cepstrum data is obtained. Reverse F
The FT unit 262 supplies the obtained cepstrum data to the input layers I of ANN1 to ANNn described in the first embodiment as input data In (vector sequence) for learning or recognizing voice data. It is like this. Input data I input to ANN1 to ANNn
As for the number of n, the same number as the number of neuron elements p of the input layer I, which is arbitrarily selected in accordance with the voice recognition, is selected. Therefore, the keffency (τ) axis is divided into p parts, and the power value for each keffency is supplied to each ANN1 to ANNn as the input data of the neuron elements I1 to Ip. The cepstrum data obtained by the inverse FFT unit 262 is ANN1.
It is the first example in the second embodiment that supplies to each input layer I of ANNn.

【００５２】次に、第２の実施の形態における第２例に
ついて説明する。この第２例では、ケプストラム窓２６
３において求めたケプストラムデータに対してリフタリ
ングを行うことで、ケフレンシー成分を高ケフレンシー
部と低ケフレンシー部に分離する。分離された低ケフレ
ンシー部は、ＦＦＴ部２６５において、フーリエ変換す
ることによって対数スペクトル包絡線が求められ、更
に、指数変換することでスペクトル包絡線が求められ
る。このスペクトル包絡データから、周波数軸軸をニュ
ーロン素子の数に対応して分割し、各周波数毎のパワー
の値（ベクトル列）をＡＮＮ１〜ＡＮＮｎの各入力層Ｉ
に供給する。Next, a second example of the second embodiment will be described. In this second example, the cepstrum window 26
By performing lifter ringing on the cepstrum data obtained in step 3, the keflenency component is separated into a high kefflenency portion and a low kefflenency portion. In the FFT unit 265, the separated low-keflency portion is Fourier-transformed to obtain a logarithmic spectrum envelope, and further exponentially transformed to obtain a spectrum envelope curve. From this spectrum envelope data, the frequency axis is divided according to the number of neuron elements, and the power value (vector sequence) for each frequency is input to each input layer I of ANN1 to ANNn.
To supply.

【００５３】なお、ケプストラム窓２６３で分離され
た、低ケフレンシー部のケプストラムデータを入力デー
タとして各入力層Ｉに供給するようにしてよもい。ま
た、分離された高ケフレンシー部のケプストラムデータ
から、ピーク抽出部２６４で基本周期を抽出し、これ
を、ＦＦＴ部２６５で求めたスペクトル包絡のデータと
共に入力データの１つとして使用してもよい。この場
合、入力層Ｉのニューロン素子数がｐ個なので、スペク
トル包絡のデータから（ｐ−１）の入力データＩｎ１〜
Ｉ（ｐ−１）ｎを各入力層Ｉに入力し、基本周期のデー
タから入力データＩｎｐを入力層Ｉに入力する。It should be noted that the cepstrum data of the low-keflency portion separated by the cepstrum window 263 may be supplied to each input layer I as input data. Alternatively, the peak extraction unit 264 may extract the fundamental period from the separated cepstrum data of the high-keflency portion, and this may be used as one of the input data together with the spectrum envelope data obtained by the FFT unit 265. In this case, since the number of neuron elements in the input layer I is p, the input data In1 to (p-1) of the spectrum envelope data
I (p-1) n is input to each input layer I, and input data Inp is input to the input layer I from the data of the basic period.

【００５４】以上説明したように、第２の実施の形態に
よれば、音声データについてのケプストラムデータを使
用することで、パワースペクトルよりも一層音声の特徴
を捕らえたデータにより、音素の学習と認識を行うの
で、認識率が向上する。As described above, according to the second embodiment, by using the cepstrum data of the voice data, the learning and recognition of the phoneme can be performed by the data in which the features of the voice are more captured than the power spectrum. As a result, the recognition rate is improved.

【００５５】なお、第１および第２の実施の形態では音
声認識について説明したが、画像データのケプストラム
データを使用して画像認識を行うようにしてもよい。こ
の場合の画像データは、図形読取装置２４で読み取られ
た画像データ、および、通信制御装置１５で受信した画
像データのいずれを用いてもよい。Although voice recognition has been described in the first and second embodiments, image recognition may be performed using cepstrum data of image data. The image data in this case may be either the image data read by the graphic reading device 24 or the image data received by the communication control device 15.

【００５６】以上説明した第１および第２の実施の形態
では、音声認識の場合について説明したが、本発明で
は、文字認識、図形認識、運動の時系列パターン発生の
予測等にも適用することができる。文字認識の場合であ
れば、認識対象となる文字数のＡＮＮを設け、各ＡＮＮ
毎に対応する文字について自己連想学習を行う。学習デ
ータとしては、交点等の特徴点、筆順、画数等が使用さ
れる。In the above-described first and second embodiments, the case of voice recognition has been described. However, the present invention can be applied to character recognition, figure recognition, prediction of occurrence of a time series pattern of motion, and the like. You can In the case of character recognition, an ANN having the number of characters to be recognized is provided, and each ANN is
Self-associative learning is performed for each corresponding character. As learning data, feature points such as intersections, stroke order, and stroke count are used.

【００５７】[0057]

【発明の効果】そこで、本発明のニューラルネットワー
クによれば、複数のニューロン素子を有する入力層と、
この入力層よりも少ないニューロン素子を有する中間層
と、前記入力層と同数のニューロン素子を有する出力層
とを有し、各々に異なる特定意味が対応付けられた複数
のボトルネックニューロン素子網と、ベクトル列を前記
ボトルネックにニューロン素子網の各データ入力層に入
力する入力手段と、この入力手段によるベクトル列の入
力による各ボトルネックニューロン素子網の出力ベクト
ル列と入力ベクトル列との類似度を算出する類似度算出
手段と、この類似度算出手段で算出された類似度がもっ
とも大きいボトルネックニューロン素子網に対応する特
定の意味を、入力手段に入力されたベクトル列の意味と
して出力する出力手段と、を具備させたので、簡単な構
成で、高認識率等を得ることができる。また、対応する
特定意味のベクトル列を入力データおよび教師信号とす
る自己連想的学習を、各ボトルネックニューロン素子網
毎に行うので、学習を短時間で容易に行うことができ
る。Therefore, according to the neural network of the present invention, an input layer having a plurality of neuron elements,
A plurality of bottleneck neuron element networks each having an intermediate layer having fewer neuron elements than the input layer and an output layer having the same number of neuron elements as the input layer, each having a different specific meaning associated with each other, Input means for inputting a vector sequence to each data input layer of the neuron element network at the bottleneck, and the similarity between the output vector sequence of each bottleneck neuron element network and the input vector sequence due to the input of the vector sequence by this input means Outputting means for outputting similarity meaning calculating means and a specific meaning corresponding to the bottleneck neuron element network having the highest similarity calculated by the similarity calculating means as the meaning of the vector sequence input to the input means Since the above is provided, a high recognition rate and the like can be obtained with a simple configuration. Further, since self-associative learning using the corresponding vector string of a specific meaning as the input data and the teacher signal is performed for each bottleneck neuron element network, the learning can be easily performed in a short time.

[Brief description of drawings]

【図１】本発明の一実施の形態におけるニューラルネッ
トワークを利用した音声認識装置のシステム構成図であ
る。FIG. 1 is a system configuration diagram of a voice recognition device using a neural network according to an embodiment of the present invention.

【図２】同上、音声認識装置のニューロン素子網の構成
図である。FIG. 2 is a configuration diagram of a neuron element network of the voice recognition device.

【図３】同上、ニューロン素子網２２の各ＡＮＮの結合
重みと閾値を格納する結合重みテーブルを示す説明図で
ある。FIG. 3 is an explanatory diagram showing a connection weight table that stores the connection weights and threshold values of each ANN of the neuron element network 22.

【図４】同上、音声認識装置による音声のスペクトル解
析の状態を説明する説明図である。FIG. 4 is an explanatory diagram explaining a state of spectrum analysis of voice by the voice recognition device.

【図５】同上、音声認識装置のＦＦＴ装置によりスペク
トル解析された音声についてのベクトル列を表す説明図
である。FIG. 5 is an explanatory diagram showing a vector sequence for a voice spectrum-analyzed by the FFT device of the voice recognition device.

【図６】同上、音声認識装置のＦＦＴ装置により３種類
の“あ”、“ア”、“Ａ”のスペクトル解析したデータ
を示す説明図である。FIG. 6 is an explanatory diagram showing data obtained by spectrum analysis of three types of “A”, “A”, and “A” by the FFT device of the voice recognition device.

【図７】同上、音声認識装置のニューロン素子網におけ
るＡＮＮ１の学習時の入力データと教師信号との関係を
表す説明図である。FIG. 7 is an explanatory diagram showing a relationship between input data and a teacher signal during learning of ANN1 in the neuron element network of the voice recognition device.

【図８】本発明の第２の実施の形態におけるニューラル
ネットワークのシステム構成図である。FIG. 8 is a system configuration diagram of a neural network according to a second embodiment of the present invention.

【図９】第２の実施の形態におけるケプストラム装置の
構成図である。FIG. 9 is a configuration diagram of a cepstrum device according to a second embodiment.

[Explanation of symbols]

１１ＣＰＵ１２バスライン１３ＲＯＭ１４ＲＡＭ１５通信制御装置１６プリンタ１７表示装置１８キーボード２１ＦＦＴ装置２２ニューロン素子網２３音声入力装置２４図形読取装置２６ケプストラム装置 11 CPU 12 Bus Line 13 ROM 14 RAM 15 Communication Control Device 16 Printer 17 Display Device 18 Keyboard 21 FFT Device 22 Neuron Element Network 23 Voice Input Device 24 Graphic Reading Device 26 Cepstral Device

Claims

[Claims]

1. An input layer having a plurality of neuron elements, an intermediate layer having a smaller number of neuron elements than the input layer, and an output layer having the same number of neuron elements as the input layer, each having a different specification. A plurality of bottleneck neuron element networks having associated meanings, input means for inputting a vector sequence to each data input layer of the neuron element network with the bottleneck as the bottleneck, and each bottleneck neuron by inputting a vector sequence by the input means The similarity calculation means for calculating the similarity between the output vector sequence and the input vector sequence of the element network, and the specific meaning corresponding to the bottleneck neuron element network having the highest similarity calculated by the similarity calculation means, Output means for outputting the meaning of the vector sequence input to the input means. Network.

2. The plurality of bottleneck neuron element networks have been subjected to self-associative learning using a corresponding vector string of a specific meaning as input data and a teacher signal. Neural network.

3. A learning means for carrying out self-associative learning for each of the bottleneck neuron element networks, using a corresponding vector string of a specific meaning as input data and a teacher signal. Neural network described.

4. The neural network according to claim 2, wherein self-associative learning is performed according to the back propagation rule.

5. The specific meaning is a phoneme constituting a voice, and the vector sequence input to the input layer is a vector sequence representing a feature amount of a phoneme analyzed in time series. The neural network according to any one of claims 1 to 4.

6. The neural network according to claim 5, wherein at least one of speech spectrum data and cepstrum data is used as a vector representing a feature amount for the specific meaning.