JPH04140800A

JPH04140800A - Voice recognition system using neural network

Info

Publication number: JPH04140800A
Application number: JP2264669A
Authority: JP
Inventors: Hidefumi Sawai; 沢井　秀文
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1990-10-01
Filing date: 1990-10-01
Publication date: 1992-05-14
Anticipated expiration: 2009-06-01
Also published as: JPH0642160B2

Abstract

PURPOSE:To enable the best recognition of a pattern which has remarkable voicing variation like a voice pattern by extracting a local feature in a layer close to the input layer of the neural network and then extracting a general feature in a high-order hidden layer. CONSTITUTION:After a phoneme pattern corresponding to a phoneme category is shown to the input layer 1, the output units in the corresponding phoneme category are set to '1' and other units are all set to '0'. The input layer 1 has a time base in the horizontal axis direction and a frequency axis in the vertical axis direction. Windowing 11 is performed locally for the time and frequency. A signal is propagated from the input layer 1 to the neural unit 21b of a 1st hidden layer 2 through a connection 11a corresponding to a nerve. In the 1st hidden layer 2, windowing 21 is performed similarly to extract a more general feature of a voice and a signal is propagated to the unit 31b in a 2nd hidden layer through a connection 21a. Similarly, signal propagation to higher-order layers is carried out. In a learning stage, an actual output value outputted by the neural network 30 is compared with the target value in the corresponding phoneme category and coupling coefficients between respective layers are so adjusted that the error becomes smaller. A recognition result is obtained as a phoneme category result having the maximum output in an output layer 5 by feature extraction in the hidden layers 2 - 4 from the input layer 1 by using the learnt coupling coefficients.

Description

【発明の詳細な説明】［産業上の利用分野］この発明はニューラルネットワークを用いた音声認識方
式に関し、特に、ニューラルネットワクを用いて音声パ
ターンを認識するような音声認識方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition method using a neural network, and particularly to a speech recognition method that recognizes speech patterns using a neural network.

［従来の技術および発明か解決しようとする課題］音声
認識装置において、日常人間か行なっている判定メカニ
ズムに近い判定を計算機上で実現することか、音声認識
の実用化にとって重要であり、そのための一方策として
、人間の神経回路網を簡単なモデルとして計算機上に実
現したニューラルネットワークか広く用いられている。[Prior Art and Problems to be Solved by the Invention] In speech recognition devices, it is important to realize on a computer a judgment mechanism similar to the judgment mechanism used by everyday humans, and it is important for the practical application of speech recognition. As a solution, neural networks, which are simple models of human neural networks realized on computers, are widely used.

しかしながら、従来のニューラルネットワークでは、学
習パターンとは異なる未学習パターンに対しては識別性
能が劣化するという、８化能力低下の問題点があった。However, conventional neural networks have a problem in that the 8-folding ability deteriorates, that is, the discrimination performance deteriorates for unlearned patterns that are different from learned patterns.

これは、通常のニューラルネットワークでは、入力層か
ら隠れ層への結合や、隠れ層から出力層への結合は全結
合で実現されている場合が多く、音声の効率的な特徴抽
出や正確な認識処理に適した構造が必ずしも実現されて
いるとは言いがたいためである。このため、ニューラル
ネットワークを用いた音声認識装置は、音声現象特有の
発声様式の変動に伴なう時間的１周波数的な変動に対処
することができす、このことか高精度な音声認識装置を
実現する上での障害となっていた。In normal neural networks, the connection from the input layer to the hidden layer and the connection from the hidden layer to the output layer are often realized by fully connected, which enables efficient feature extraction and accurate recognition of speech. This is because it cannot be said that a structure suitable for processing is necessarily realized. For this reason, a speech recognition device using a neural network can deal with temporal and single-frequency fluctuations associated with variations in the vocal style unique to speech phenomena. This was an obstacle to its realization.

それゆえに、この発明の主たる目的は、音声の時間的お
よび周波数的な発声変動に強いネットワーク構造を用い
ることによって、高精度な音声認識方式を実現すること
を主たる目的とする。Therefore, the main object of the present invention is to realize a highly accurate speech recognition method by using a network structure that is resistant to temporal and frequency variations in speech.

［課題を解決するための手段］この発明はニューラルネットワークを用いた音声認識方
式であって、連続的に発声された入力音声を分析する音
声分析手段と、分析された音声を特徴パラメータの時系
列に変換する変換手段と、入力層と複数の隠れ層と出力
層とを含み、変換された特徴パラメータか入力されるニ
ューラルネットワークを備えて構成され、ニューラルネ
ットワクの入力層および入力側の隠れ層の時間方向およ
び周波数方向の双方に局所的な窓掛けを行ない、入力層
や隠れ層を含む入力側の層から出力側の隠れ層や出力層
を含む出力側の層への単一ユニ、、トの結合を有し、窓
掛けを入力側の層全体に行なうように構成される。[Means for Solving the Problems] The present invention is a speech recognition method using a neural network, which includes a speech analysis means for analyzing continuously uttered input speech, and a time series of characteristic parameters for analyzing the analyzed speech. The neural network is configured to include a conversion means for converting into , an input layer, a plurality of hidden layers, and an output layer, and to which the converted feature parameters are input. Local windowing is performed in both the time and frequency directions, and a single unit, , , , and is configured to apply windowing to the entire layer on the input side.

［作用］この発明に係るニューラルネットワークを用いた音声認
識装置は、ニューラルネットワークの入力層および入力
側の隠れ層の時間方向および周波数方向の双方に局所的
な窓掛けを行ない、窓掛けを入力側の層全体に行なうこ
とにより、上位の隠れ層においてはより対局的な特徴を
抽出することができ、音声パターンのような発声変動の
著しいパターンを最適に認工することができる。[Operation] A speech recognition device using a neural network according to the present invention performs local windowing in both the time direction and the frequency direction of the input layer of the neural network and the hidden layer on the input side, and performs windowing on the input side. By performing this on the entire layer, more game-related features can be extracted in the upper hidden layer, and patterns with significant vocalization fluctuations, such as voice patterns, can be optimally recognized.

［発明の実施例］第１図はこの発明の一実施例の概略ブロック図であり、
第２図は第１図に示したニューラルネットワークの構成
例を示す図である。[Embodiment of the invention] FIG. 1 is a schematic block diagram of an embodiment of the invention.
FIG. 2 is a diagram showing an example of the configuration of the neural network shown in FIG. 1.

まず、第１図を参照して、音声分析装置１０は連続的に
発声された人力音声を分析するものであり、分析（、た
音声を変換回路２０に与える。変換回路２０は音声分析
装置１０によって分析された音声を特徴パラメータの時
系列に変換してニューラルネットワーク３０に与える。First, referring to FIG. The voice analyzed by the above is converted into a time series of characteristic parameters and is provided to the neural network 30.

ニューラルネットワーク３０は第２図に示すように、入
力層１と隠れ層第１層２と隠れ層第２層３と隠れ層第３
層４と出力層５とから構成されている。第２図に示した
例では、隠れ層の数は３であるが、一般には任意の数の
隠れ層で構成できる。第２図においては、６つの音韻カ
テゴリ／ｂ、ｄ、ｇ、ｍ、ｎ。As shown in FIG. 2, the neural network 30 includes an input layer 1, a hidden layer 1 2, a hidden layer 2 3, and a hidden layer 3.
It is composed of a layer 4 and an output layer 5. In the example shown in FIG. 2, the number of hidden layers is three, but generally any number of hidden layers can be used. In FIG. 2, there are six phoneme categories /b, d, g, m, n.

Ｎ／を識別するニューラルネットワークを例に述べるか
、これも任意の数と分類カテゴリの種類を選択すること
が可能である。ニューラルネットワークの学習法は、［
１］　ＭｃＣＩｅ　１１ａｎｄ。A neural network for identifying N/ may be described as an example, or any number and type of classification category may be selected. The neural network learning method is [
1] McCIe 11and.

Ｊ、Ｌ、、　　　Ｄ、Ｅ、Ｒｕｍｅｌｈａｒｔ　　ａｎ
ｄ　　ｔｈｅ　　ＰＤＰ　　Ｒｅ５ｅａｒｃｈ　　Ｇｒ
。J.L., D.E., Rumelhart an.
d the PDP Re5earch Gr.
.

ｕｐ：’Ｐａｒａｉｌｅｌ　　Ｄｉｓｔｒｉｂｕｔｅｄ
　　Ｐｒｏｃｅｓｓｉｎｇ　　、ｖｏｌ、１．、ｃｈａ
ｐ、８．ＭＩＴ　　Ｐｒｅｓｓ　　（１９８８）に示す
誤差逆伝搬法が用いられる。up:'Parailel Distributed
Processing, vol. 1. , cha
p, 8. The error backpropagation method described in MIT Press (1988) is used.

まず、入力層１に音韻カテゴリに対応する音韻パターン
を提示した後、対応する音韻カテゴリの出カニニットを
“１”にし、それ以外のユニットをすべて“０°にする
（第２図に示した例では、／ｍ／に対応するユニットを
“１“とする）。ニューラルネットワーク３０に提示さ
れる音韻パターンは、ＦＦＴやＬＰＣ型の特徴分析か施
されたパターンを用いる。入力層１は水平軸方向が時間
軸を表わし、垂直軸方向が周波数軸を表わしている。第
２図に示した入力層１では、水平（時間）軸方向に１５
フレーム（１，５０ｍ５ｅｃ）分、垂直（周波数）軸方
向に１６チヤネル分の音声パターンを入力している。窓
掛け１１は時間および周波数に対して局所的に行なった
ものであり、時間方向に３フレーム（３０ｍｓｅｃ）、
周波数方向に４チャネル分窓掛けを行ない、神経に相当
するコネクションｌｌａを介して入力層１から隠れ層第
１層２の神経ユニット２１ｂに信号が伝搬される。First, after presenting a phoneme pattern corresponding to a phoneme category to input layer 1, the output unit of the corresponding phoneme category is set to "1", and all other units are set to "0°" (the example shown in Figure 2). In this case, the unit corresponding to /m/ is assumed to be "1").The phonetic pattern presented to the neural network 30 uses a pattern that has been subjected to FFT or LPC type feature analysis.The input layer 1 is arranged in the horizontal axis direction. represents the time axis, and the vertical axis represents the frequency axis.In the input layer 1 shown in Fig. 2, 15
Audio patterns for 16 channels are input in the vertical (frequency) direction for a frame (1,50 m5ec). Windowing 11 is performed locally with respect to time and frequency, and consists of 3 frames (30 msec) in the time direction,
Windowing is performed for four channels in the frequency direction, and a signal is propagated from the input layer 1 to the neural unit 21b of the first hidden layer 2 via a connection lla corresponding to a nerve.

隠れ層第１層２でも同様にして、窓掛け２１（５フレ一
ム×５チヤネル分）を行なって音声のより大局的な特徴
を抽出し、次の隠れ層第２層３のユニット３１ｂにコネ
クション２１ａを介して信号が伝搬される。以下、同様
にして、上位の層への信号伝搬が行なわれる。窓掛け１
１．２１゜３１　（５フレ一ム×５チヤネル分）などは
それぞれ入力層１．隠れ層第１層２．隠れ層第２層３゜
隠れ層第３層４を全領域にわたって連続して行なわれる
。Similarly, in the first hidden layer 2, windowing 21 (5 frames x 5 channels) is performed to extract more global features of the audio, and then to the unit 31b of the next hidden layer 2 3. A signal is propagated via connection 21a. Thereafter, signal propagation to upper layers is performed in the same manner. window hanger 1
1.21°31 (5 frames x 5 channels) etc. are input layer 1. Hidden layer 1st layer 2. Hidden layer 2nd layer 3° hidden layer 3rd layer 4 are continuously applied over the entire area.

学習段階では、ニューラルネットワーク３０が出力する
実際の出力値と、対応する音韻カテゴリでの目標値（“
１“または“Ｏ“）とを比較して、これらの誤差ができ
るだけ小さくなるように、各層間の結合係数を調節する
。認識結果は学習済みの結合係数を用いて、入力層１か
ら隠れ層２〜４での特徴抽出を経て、出力層５での最大
出力を有する音韻カテゴリ結果として得られる。In the learning stage, the actual output value output by the neural network 30 and the target value (“
1" or "O"), and adjust the coupling coefficient between each layer so that these errors are as small as possible. The recognition result is calculated from the input layer 1 to the hidden layer using the learned coupling coefficient. After the feature extraction in steps 2 to 4, a phoneme category result having the maximum output in the output layer 5 is obtained.

上述のごとく、入力層１に比較的近い低位の層において
は、局所的な特徴を抽出し、出力層５に近いより高位の
層においては大局的な特徴を抽出２統合している。その
結果、従来の各層間の結合が全部ある場合に比べて、入
カバターンの変動に強いネットワーク構造を実現できる
。As described above, in a lower layer relatively close to the input layer 1, local features are extracted, and in a higher layer near the output layer 5, global features are extracted and integrated. As a result, it is possible to realize a network structure that is more resistant to changes in input patterns than the conventional case in which all the connections between layers are present.

［発明の効果］以上のように、この発明によれば、ニューラルネットワ
ークの入力層に近い層においては局所的な特徴を抽出し
た後、上位の隠れ層においてはより大局的な特徴を抽出
するようにしたので、音声パターンのような発声変動の
著しいパターンを最適に認識することかできる。[Effects of the Invention] As described above, according to the present invention, after local features are extracted in a layer close to the input layer of a neural network, more global features are extracted in an upper hidden layer. As a result, it is possible to optimally recognize patterns with significant vocalization fluctuations, such as voice patterns.

[Brief explanation of the drawing]

第１図はこの発明の一実施例か適用される音声認識シス
テムの概略ブロック図である。第２図は第１図に示した
ニューラルネットワークの構成を示す図である。図において、１は入力層、２は隠れ層第１層、３は隠れ
層第２層、４は隠れ層第３層、５は出力層、１０は音声
分析装置、１１，２１，３１，４〕は窓掛け、２０は変
換回路、３０はニューラルネットワークを示す。FIG. 1 is a schematic block diagram of a speech recognition system to which an embodiment of the present invention is applied. FIG. 2 is a diagram showing the configuration of the neural network shown in FIG. 1. In the figure, 1 is an input layer, 2 is a first hidden layer, 3 is a second hidden layer, 4 is a third hidden layer, 5 is an output layer, 10 is a speech analysis device, 11, 21, 31, 4 ] indicates a window, 20 indicates a conversion circuit, and 30 indicates a neural network.

Claims

[Scope of Claims] Speech analysis means for analyzing continuously uttered input speech, conversion means for converting the speech analyzed by the speech analysis means into a time series of feature parameters, and an input layer and a plurality of hidden layers. and an output layer, and into which the feature parameters converted by the conversion means are input, the neural network having a local window in both the time direction and the frequency direction in the input layer and the hidden layer on the input side of the neural network. and has a single unit connection from a layer on the input side including the input layer or hidden layer to a layer on the output side including the hidden layer or the output layer, and the windowing is performed on the input side. A speech recognition method using a neural network that is characterized by performing it on the entire layer.