JPH04140800A - Voice recognition system using neural network - Google Patents

Voice recognition system using neural network

Info

Publication number
JPH04140800A
JPH04140800A JP2264669A JP26466990A JPH04140800A JP H04140800 A JPH04140800 A JP H04140800A JP 2264669 A JP2264669 A JP 2264669A JP 26466990 A JP26466990 A JP 26466990A JP H04140800 A JPH04140800 A JP H04140800A
Authority
JP
Japan
Prior art keywords
layer
neural network
input
hidden
input layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2264669A
Other languages
Japanese (ja)
Other versions
JPH0642160B2 (en
Inventor
Hidefumi Sawai
沢井 秀文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
A T R JIDO HONYAKU DENWA KENKYUSHO KK
Original Assignee
A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by A T R JIDO HONYAKU DENWA KENKYUSHO KK filed Critical A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority to JP2264669A priority Critical patent/JPH0642160B2/en
Publication of JPH04140800A publication Critical patent/JPH04140800A/en
Publication of JPH0642160B2 publication Critical patent/JPH0642160B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Abstract

PURPOSE:To enable the best recognition of a pattern which has remarkable voicing variation like a voice pattern by extracting a local feature in a layer close to the input layer of the neural network and then extracting a general feature in a high-order hidden layer. CONSTITUTION:After a phoneme pattern corresponding to a phoneme category is shown to the input layer 1, the output units in the corresponding phoneme category are set to '1' and other units are all set to '0'. The input layer 1 has a time base in the horizontal axis direction and a frequency axis in the vertical axis direction. Windowing 11 is performed locally for the time and frequency. A signal is propagated from the input layer 1 to the neural unit 21b of a 1st hidden layer 2 through a connection 11a corresponding to a nerve. In the 1st hidden layer 2, windowing 21 is performed similarly to extract a more general feature of a voice and a signal is propagated to the unit 31b in a 2nd hidden layer through a connection 21a. Similarly, signal propagation to higher-order layers is carried out. In a learning stage, an actual output value outputted by the neural network 30 is compared with the target value in the corresponding phoneme category and coupling coefficients between respective layers are so adjusted that the error becomes smaller. A recognition result is obtained as a phoneme category result having the maximum output in an output layer 5 by feature extraction in the hidden layers 2 - 4 from the input layer 1 by using the learnt coupling coefficients.

Description

【発明の詳細な説明】 [産業上の利用分野] この発明はニューラルネットワークを用いた音声認識方
式に関し、特に、ニューラルネットワクを用いて音声パ
ターンを認識するような音声認識方式に関する。
DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition method using a neural network, and particularly to a speech recognition method that recognizes speech patterns using a neural network.

[従来の技術および発明か解決しようとする課題]音声
認識装置において、日常人間か行なっている判定メカニ
ズムに近い判定を計算機上で実現することか、音声認識
の実用化にとって重要であり、そのための一方策として
、人間の神経回路網を簡単なモデルとして計算機上に実
現したニューラルネットワークか広く用いられている。
[Prior Art and Problems to be Solved by the Invention] In speech recognition devices, it is important to realize on a computer a judgment mechanism similar to the judgment mechanism used by everyday humans, and it is important for the practical application of speech recognition. As a solution, neural networks, which are simple models of human neural networks realized on computers, are widely used.

しかしながら、従来のニューラルネットワークでは、学
習パターンとは異なる未学習パターンに対しては識別性
能が劣化するという、8化能力低下の問題点があった。
However, conventional neural networks have a problem in that the 8-folding ability deteriorates, that is, the discrimination performance deteriorates for unlearned patterns that are different from learned patterns.

これは、通常のニューラルネットワークでは、入力層か
ら隠れ層への結合や、隠れ層から出力層への結合は全結
合で実現されている場合が多く、音声の効率的な特徴抽
出や正確な認識処理に適した構造が必ずしも実現されて
いるとは言いがたいためである。このため、ニューラル
ネットワークを用いた音声認識装置は、音声現象特有の
発声様式の変動に伴なう時間的1周波数的な変動に対処
することができす、このことか高精度な音声認識装置を
実現する上での障害となっていた。
In normal neural networks, the connection from the input layer to the hidden layer and the connection from the hidden layer to the output layer are often realized by fully connected, which enables efficient feature extraction and accurate recognition of speech. This is because it cannot be said that a structure suitable for processing is necessarily realized. For this reason, a speech recognition device using a neural network can deal with temporal and single-frequency fluctuations associated with variations in the vocal style unique to speech phenomena. This was an obstacle to its realization.

それゆえに、この発明の主たる目的は、音声の時間的お
よび周波数的な発声変動に強いネットワーク構造を用い
ることによって、高精度な音声認識方式を実現すること
を主たる目的とする。
Therefore, the main object of the present invention is to realize a highly accurate speech recognition method by using a network structure that is resistant to temporal and frequency variations in speech.

[課題を解決するための手段] この発明はニューラルネットワークを用いた音声認識方
式であって、連続的に発声された入力音声を分析する音
声分析手段と、分析された音声を特徴パラメータの時系
列に変換する変換手段と、入力層と複数の隠れ層と出力
層とを含み、変換された特徴パラメータか入力されるニ
ューラルネットワークを備えて構成され、ニューラルネ
ットワクの入力層および入力側の隠れ層の時間方向およ
び周波数方向の双方に局所的な窓掛けを行ない、入力層
や隠れ層を含む入力側の層から出力側の隠れ層や出力層
を含む出力側の層への単一ユニ、、トの結合を有し、窓
掛けを入力側の層全体に行なうように構成される。
[Means for Solving the Problems] The present invention is a speech recognition method using a neural network, which includes a speech analysis means for analyzing continuously uttered input speech, and a time series of characteristic parameters for analyzing the analyzed speech. The neural network is configured to include a conversion means for converting into , an input layer, a plurality of hidden layers, and an output layer, and to which the converted feature parameters are input. Local windowing is performed in both the time and frequency directions, and a single unit, , , , and is configured to apply windowing to the entire layer on the input side.

[作用] この発明に係るニューラルネットワークを用いた音声認
識装置は、ニューラルネットワークの入力層および入力
側の隠れ層の時間方向および周波数方向の双方に局所的
な窓掛けを行ない、窓掛けを入力側の層全体に行なうこ
とにより、上位の隠れ層においてはより対局的な特徴を
抽出することができ、音声パターンのような発声変動の
著しいパターンを最適に認工することができる。
[Operation] A speech recognition device using a neural network according to the present invention performs local windowing in both the time direction and the frequency direction of the input layer of the neural network and the hidden layer on the input side, and performs windowing on the input side. By performing this on the entire layer, more game-related features can be extracted in the upper hidden layer, and patterns with significant vocalization fluctuations, such as voice patterns, can be optimally recognized.

[発明の実施例] 第1図はこの発明の一実施例の概略ブロック図であり、
第2図は第1図に示したニューラルネットワークの構成
例を示す図である。
[Embodiment of the invention] FIG. 1 is a schematic block diagram of an embodiment of the invention.
FIG. 2 is a diagram showing an example of the configuration of the neural network shown in FIG. 1.

まず、第1図を参照して、音声分析装置10は連続的に
発声された人力音声を分析するものであり、分析(、た
音声を変換回路20に与える。変換回路20は音声分析
装置10によって分析された音声を特徴パラメータの時
系列に変換してニューラルネットワーク30に与える。
First, referring to FIG. The voice analyzed by the above is converted into a time series of characteristic parameters and is provided to the neural network 30.

ニューラルネットワーク30は第2図に示すように、入
力層1と隠れ層第1層2と隠れ層第2層3と隠れ層第3
層4と出力層5とから構成されている。第2図に示した
例では、隠れ層の数は3であるが、一般には任意の数の
隠れ層で構成できる。第2図においては、6つの音韻カ
テゴリ/b、d、g、m、n。
As shown in FIG. 2, the neural network 30 includes an input layer 1, a hidden layer 1 2, a hidden layer 2 3, and a hidden layer 3.
It is composed of a layer 4 and an output layer 5. In the example shown in FIG. 2, the number of hidden layers is three, but generally any number of hidden layers can be used. In FIG. 2, there are six phoneme categories /b, d, g, m, n.

N/を識別するニューラルネットワークを例に述べるか
、これも任意の数と分類カテゴリの種類を選択すること
が可能である。ニューラルネットワークの学習法は、[
1] McCIe 11and。
A neural network for identifying N/ may be described as an example, or any number and type of classification category may be selected. The neural network learning method is [
1] McCIe 11and.

J、L、、   D、E、Rumelhart  an
d  the  PDP  Re5earch  Gr
J.L., D.E., Rumelhart an.
d the PDP Re5earch Gr.
.

up:’Parailel  Distributed
  Processing  、vol、1.、cha
p、8.MIT  Press  (1988)に示す
誤差逆伝搬法が用いられる。
up:'Parailel Distributed
Processing, vol. 1. , cha
p, 8. The error backpropagation method described in MIT Press (1988) is used.

まず、入力層1に音韻カテゴリに対応する音韻パターン
を提示した後、対応する音韻カテゴリの出カニニットを
“1”にし、それ以外のユニットをすべて“0°にする
(第2図に示した例では、/m/に対応するユニットを
“1“とする)。ニューラルネットワーク30に提示さ
れる音韻パターンは、FFTやLPC型の特徴分析か施
されたパターンを用いる。入力層1は水平軸方向が時間
軸を表わし、垂直軸方向が周波数軸を表わしている。第
2図に示した入力層1では、水平(時間)軸方向に15
フレーム(1,50m5ec)分、垂直(周波数)軸方
向に16チヤネル分の音声パターンを入力している。窓
掛け11は時間および周波数に対して局所的に行なった
ものであり、時間方向に3フレーム(30msec)、
周波数方向に4チャネル分窓掛けを行ない、神経に相当
するコネクションllaを介して入力層1から隠れ層第
1層2の神経ユニット21bに信号が伝搬される。
First, after presenting a phoneme pattern corresponding to a phoneme category to input layer 1, the output unit of the corresponding phoneme category is set to "1", and all other units are set to "0°" (the example shown in Figure 2). In this case, the unit corresponding to /m/ is assumed to be "1").The phonetic pattern presented to the neural network 30 uses a pattern that has been subjected to FFT or LPC type feature analysis.The input layer 1 is arranged in the horizontal axis direction. represents the time axis, and the vertical axis represents the frequency axis.In the input layer 1 shown in Fig. 2, 15
Audio patterns for 16 channels are input in the vertical (frequency) direction for a frame (1,50 m5ec). Windowing 11 is performed locally with respect to time and frequency, and consists of 3 frames (30 msec) in the time direction,
Windowing is performed for four channels in the frequency direction, and a signal is propagated from the input layer 1 to the neural unit 21b of the first hidden layer 2 via a connection lla corresponding to a nerve.

隠れ層第1層2でも同様にして、窓掛け21(5フレ一
ム×5チヤネル分)を行なって音声のより大局的な特徴
を抽出し、次の隠れ層第2層3のユニット31bにコネ
クション21aを介して信号が伝搬される。以下、同様
にして、上位の層への信号伝搬が行なわれる。窓掛け1
1.21゜31 (5フレ一ム×5チヤネル分)などは
それぞれ入力層1.隠れ層第1層2.隠れ層第2層3゜
隠れ層第3層4を全領域にわたって連続して行なわれる
Similarly, in the first hidden layer 2, windowing 21 (5 frames x 5 channels) is performed to extract more global features of the audio, and then to the unit 31b of the next hidden layer 2 3. A signal is propagated via connection 21a. Thereafter, signal propagation to upper layers is performed in the same manner. window hanger 1
1.21°31 (5 frames x 5 channels) etc. are input layer 1. Hidden layer 1st layer 2. Hidden layer 2nd layer 3° hidden layer 3rd layer 4 are continuously applied over the entire area.

学習段階では、ニューラルネットワーク30が出力する
実際の出力値と、対応する音韻カテゴリでの目標値(“
1“または“O“)とを比較して、これらの誤差ができ
るだけ小さくなるように、各層間の結合係数を調節する
。認識結果は学習済みの結合係数を用いて、入力層1か
ら隠れ層2〜4での特徴抽出を経て、出力層5での最大
出力を有する音韻カテゴリ結果として得られる。
In the learning stage, the actual output value output by the neural network 30 and the target value (“
1" or "O"), and adjust the coupling coefficient between each layer so that these errors are as small as possible. The recognition result is calculated from the input layer 1 to the hidden layer using the learned coupling coefficient. After the feature extraction in steps 2 to 4, a phoneme category result having the maximum output in the output layer 5 is obtained.

上述のごとく、入力層1に比較的近い低位の層において
は、局所的な特徴を抽出し、出力層5に近いより高位の
層においては大局的な特徴を抽出2統合している。その
結果、従来の各層間の結合が全部ある場合に比べて、入
カバターンの変動に強いネットワーク構造を実現できる
As described above, in a lower layer relatively close to the input layer 1, local features are extracted, and in a higher layer near the output layer 5, global features are extracted and integrated. As a result, it is possible to realize a network structure that is more resistant to changes in input patterns than the conventional case in which all the connections between layers are present.

[発明の効果] 以上のように、この発明によれば、ニューラルネットワ
ークの入力層に近い層においては局所的な特徴を抽出し
た後、上位の隠れ層においてはより大局的な特徴を抽出
するようにしたので、音声パターンのような発声変動の
著しいパターンを最適に認識することかできる。
[Effects of the Invention] As described above, according to the present invention, after local features are extracted in a layer close to the input layer of a neural network, more global features are extracted in an upper hidden layer. As a result, it is possible to optimally recognize patterns with significant vocalization fluctuations, such as voice patterns.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図はこの発明の一実施例か適用される音声認識シス
テムの概略ブロック図である。第2図は第1図に示した
ニューラルネットワークの構成を示す図である。 図において、1は入力層、2は隠れ層第1層、3は隠れ
層第2層、4は隠れ層第3層、5は出力層、10は音声
分析装置、11,21,31,4〕は窓掛け、20は変
換回路、30はニューラルネットワークを示す。
FIG. 1 is a schematic block diagram of a speech recognition system to which an embodiment of the present invention is applied. FIG. 2 is a diagram showing the configuration of the neural network shown in FIG. 1. In the figure, 1 is an input layer, 2 is a first hidden layer, 3 is a second hidden layer, 4 is a third hidden layer, 5 is an output layer, 10 is a speech analysis device, 11, 21, 31, 4 ] indicates a window, 20 indicates a conversion circuit, and 30 indicates a neural network.

Claims (1)

【特許請求の範囲】 連続的に発声された入力音声を分析する音声分析手段、 前記音声分析手段によって分析された音声を特徴パラメ
ータの時系列に変換する変換手段、および 入力層と複数の隠れ層と出力層とを含み、前記変換手段
によって変換された特徴パラメータが入力されるニュー
ラルネットワークを備え、 前記ニューラルネットワークの入力層および入力側の隠
れ層に時間方向および周波数方向の双方に局所的な窓掛
けを行ない、前記入力層や隠れ層を含む入力側の層から
出力側の隠れ層や前記出力層を含む出力側の層への単一
ユニットの結合を有し、前記窓掛けを入力側の層全体に
行なうことを特徴とする、ニューラルネットワークを用
いた音声認識方式。
[Scope of Claims] Speech analysis means for analyzing continuously uttered input speech, conversion means for converting the speech analyzed by the speech analysis means into a time series of feature parameters, and an input layer and a plurality of hidden layers. and an output layer, and into which the feature parameters converted by the conversion means are input, the neural network having a local window in both the time direction and the frequency direction in the input layer and the hidden layer on the input side of the neural network. and has a single unit connection from a layer on the input side including the input layer or hidden layer to a layer on the output side including the hidden layer or the output layer, and the windowing is performed on the input side. A speech recognition method using a neural network that is characterized by performing it on the entire layer.
JP2264669A 1990-10-01 1990-10-01 Speech recognition device using neural network Expired - Fee Related JPH0642160B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2264669A JPH0642160B2 (en) 1990-10-01 1990-10-01 Speech recognition device using neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2264669A JPH0642160B2 (en) 1990-10-01 1990-10-01 Speech recognition device using neural network

Publications (2)

Publication Number Publication Date
JPH04140800A true JPH04140800A (en) 1992-05-14
JPH0642160B2 JPH0642160B2 (en) 1994-06-01

Family

ID=17406564

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2264669A Expired - Fee Related JPH0642160B2 (en) 1990-10-01 1990-10-01 Speech recognition device using neural network

Country Status (1)

Country Link
JP (1) JPH0642160B2 (en)

Also Published As

Publication number Publication date
JPH0642160B2 (en) 1994-06-01

Similar Documents

Publication Publication Date Title
CN109326302B (en) Voice enhancement method based on voiceprint comparison and generation of confrontation network
CN108766419B (en) Abnormal voice distinguishing method based on deep learning
JP2764277B2 (en) Voice recognition device
CN112331216A (en) Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
EP0342630A2 (en) Speech recognition with speaker adaptation by learning
CN112466326A (en) Speech emotion feature extraction method based on transform model encoder
CN109313892A (en) Steady language identification method and system
TWI223791B (en) Method and system for utterance verification
CN109671423A (en) Non-parallel text compressing method under the limited situation of training data
JPH03201079A (en) Pattern recognizing device
JPH0540497A (en) Speaker adaptive voice recognizing device
Sunny et al. Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam
JPH04140800A (en) Voice recognition system using neural network
Lashkari et al. NMF-based cepstral features for speech emotion recognition
Safie Spoken Digit Recognition Using Convolutional Neural Network
Alex et al. Performance analysis of SOFM based reduced complexity feature extraction methods with back propagation neural network for multilingual digit recognition
Abd El-Moneim et al. Effect of reverberation phenomena on text-independent speaker recognition based deep learning
JPH06161495A (en) Speech recognizing device
JPH0466999A (en) Device for detecting clause boundary
Chelliah et al. Robust Hearing-Impaired Speaker Recognition from Speech using Deep Learning Networks in Native
JPH05204399A (en) Unspecified speaker's phoneme recognition method
Nijhawan et al. A comparative study of two different neural models for speaker recognition systems
Giurgiu On the use of Neural Networks for automatic vowel recognition
JPH04369699A (en) Unspecified spealer voice recognition system using neural network
Thamburaj et al. Automatic Speech Recognition Based on Improved Deep Learning

Legal Events

Date Code Title Description
LAPS Cancellation because of no payment of annual fees