JP2996417B2

JP2996417B2 - Voice recognition method

Info

Publication number: JP2996417B2
Application number: JP3030434A
Authority: JP
Inventors: 清明相川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1991-02-25
Filing date: 1991-02-25
Publication date: 1999-12-27
Anticipated expiration: 2014-12-27
Also published as: JPH04269800A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、入力が一方向に伝搬
して出力が得られる、いわゆるフィードフォワード型の
人工的神経回路、いわゆるニューラルネットを用いて音
声を認識する方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for recognizing speech by using a so-called feed-forward artificial neural circuit in which an input is propagated in one direction to obtain an output, that is, a so-called neural network.

【０００２】[0002]

【従来の技術】人工的神経回路のうち多層パーセプトロ
ン型のものは神経素子にあたるセルの出力関数として微
分可能な非線形関数、いわゆるシグモイドを導入するこ
とによって、逆向き誤差伝搬法、いわゆるバックプロパ
ゲーションを用いた精度の高い学習ができるようになっ
た[D.E. Rumelhart, et al, “Learning Internal Repr
esentations by Error Propagation", Parallel Distri
buted Processing: Explorations in the Microstructu
re of Cognition. Vol. 1:Foundations. MIT Press(198
6)] 。多層パーセプトロン型のニューラルネットはフィ
ードフォワード型のニューラルネットと呼ばれ、音声認
識にも応用されてきた。従来から音声の特徴の時間的な
位置ずれに強いニューラルネットの構成法として時間遅
れニューラルネット( Time Delay Neural Network: TDN
N ) が提案されている [A.H.Waibel, et. al, “ Phone
me Recognition Using Time-Delay Neural Network," I
EEE Trans., ASSP Vol. 37, No.3, pp.328-339, (Mar,
1989)]。時間遅れニューラルネットの特徴はニューラル
ネット中の結合で時間方向に並んだ結合をタイド、すな
わち同じ結合係数とすることである。しかし、時間遅れ
ニューラルネットを含め、従来の音声認識のためのフィ
ードフォーワード型神経回路では入力フレーム長は固定
であり、線形非線形を問わず、学習に用いた音声と比較
して時間的に伸縮した未知入力音声に対する認識率は低
かった。部分的に時間伸縮を吸収するため、時間遅れニ
ューラルネットのタイド結合を時間的にいくつかに分割
した時間構造ニューラルネットが提案されているが、与
えられた一定区間の非線形伸縮を吸収する構造にはなっ
ていない〔小森、他、“時間構造を考慮したニューラル
ネットワークによる音韻認識, “日本音響学会平成２年
度春季研究発表会講演論文集、Vol.1, pp.157-158. (Ma
r, 1990)〕。2. Description of the Related Art Among artificial neural circuits, a multilayer perceptron type uses a reverse error propagation method, so-called back propagation, by introducing a differentiable nonlinear function, so-called sigmoid, as an output function of a cell corresponding to a neural element. High-precision learning that has been used [DE Rumelhart, et al, “Learning Internal Repr.
esentations by Error Propagation ", Parallel Distri
buted Processing: Explorations in the Microstructu
re of Cognition. Vol. 1: Foundations. MIT Press (198
6)]. The multilayer perceptron type neural network is called a feedforward type neural network and has been applied to speech recognition. Conventionally, a time delay neural network (TDN) has been proposed as a method for constructing a neural network that is resistant to temporal displacement of speech features.
N) has been proposed [AHWaibel, et. Al, “Phone
me Recognition Using Time-Delay Neural Network, "I
EEE Trans., ASSP Vol. 37, No.3, pp.328-339, (Mar,
1989)]. The feature of the time-delay neural network is that the connections in the neural network arranged in the time direction are tied, that is, the same coupling coefficient. However, in conventional feedforward neural networks for speech recognition, including time-delay neural networks, the input frame length is fixed, and it expands and contracts in time compared to the speech used for learning, regardless of linear non-linearity. The recognition rate for the unknown input speech was low. In order to partially absorb time expansion and contraction, a time-structured neural network that divides the tied connection of a time-delay neural network into several parts in time has been proposed. [Komori et al., “Phonological Recognition Using Neural Networks Considering Time Structure,” Proc. Of the Acoustical Society of Japan Spring Meeting, Vol.1, pp.157-158. (Ma
r, 1990)].

【０００３】[0003]

【発明が解決しようとする課題】音声は個人性や文脈、
発声速度により時間軸の局部的な伸縮、すなわち非線形
の伸縮を起こす。入力データ長は固定であるが優れたパ
ターン識別性能を持つフィードフォーワード型ニューラ
ルネットを用いて、この発明では様々な長さのデータを
時間軸の非線形伸縮を考慮して認識することにより高い
認識性能を実現しようとするものである。[Problems to be solved by the invention] Speech is personality, context,
Depending on the utterance speed, local expansion and contraction of the time axis, that is, non-linear expansion and contraction is caused. Using a feed-forward type neural network with a fixed input data length but excellent pattern discrimination performance, the present invention realizes high recognition by recognizing data of various lengths in consideration of the non-linear expansion and contraction of the time axis. The goal is to achieve performance.

【０００４】[0004]

【課題を解決するための手段】この発明においては、与
えられた音声区間に対し、あらかじめ定められた複数の
時間伸縮関数に従って複数の特徴パラメータ時系列の組
を生成し、これらをニューラルネットの入力とし、これ
らの特徴パラメータ時系列の組の中で対応する時点から
第１隠れ層の神経セルへの結合の組を同じ結合係数とな
るように制約条件をつけた、いわゆるタイド結合とする
ことにより、さまざまな時間伸縮のうちどれかの伸縮パ
ターンに適合すれば上位のセルが発火する構造を持た
せ、ニューラルネットにさまざまな時間伸縮された音声
を受け付けられる構造を持たせることを特徴とする。こ
のニューラルネットの構造を時間伸縮ニューラルネット
と呼ぶことにする。According to the present invention, for a given voice section, a plurality of sets of feature parameter time series are generated in accordance with a plurality of predetermined time expansion / contraction functions, and these are input to a neural network. From the corresponding time point in the set of these feature parameter time series, a set of connections to the neural cell of the first hidden layer is a so-called tied connection in which constraints are set so as to have the same connection coefficient. It is characterized in that a higher-order cell is fired if it matches any of the expansion / contraction patterns of various time expansion / contraction, and that the neural network has a structure that can accept various time-expanded voices. This structure of the neural network will be referred to as a time-reducing neural network.

【０００５】[0005]

【作用】この発明方法により時間伸縮ニューラルネット
は複数種類の時間伸縮した音声を受け付けられることが
できる構造を持つ。未知音声がこのニューラルネットに
入力されたとき複数の時間伸縮パターンのうち、どれか
１つまたは複数に対応する第１隠れ層のセルが発火し、
上位層でそれらが統合されるので、時間伸縮された音声
を認識できる。According to the method of the present invention, a time-expandable neural network has a structure capable of receiving a plurality of types of time-expanded voices. When an unknown voice is input to the neural network, a cell of the first hidden layer corresponding to any one or more of a plurality of time expansion / contraction patterns is fired,
Since they are integrated in the upper layer, it is possible to recognize time-expanded speech.

【０００６】[0006]

【実施例】以下、この発明の一実施例について図面によ
り説明する。図１にこの発明の一実施例を適用可能とし
た音素認識システムの一例を示す。このシステムの使用
に当っては、まず、スイッチＳＷ１，ＳＷ２をともにｂ
側に倒してニューラルネットの学習を行う。ニューラル
ネット学習用の標準音声をマイクロホン１から入力し、
マイクロホン１の出力をＡ／Ｄ変換時のサンプリング周
波数の半分の帯域を持つフィルタ２に通し、そのフィル
タ２の出力をＡ／Ｄ変換部３でディジタル値に変換す
る。この実施例ではサンプリング周波数を１２kHz とす
るが、サンプリング周波数はこれと異なっても良い。次
にＡ／Ｄ変換部３の出力はメルスケールバンドパスフィ
ルタ４に通されて複数の特徴時系列を得る。この実施例
では１６チャネルのメルスケールバンドパスフィルタ群
を用いているが、チャネル数はこれと異なっても良い。
また、メルスケールバンドパスフィルタによる特徴抽出
の他にケプストラム係数等のスペクトルを表わすパラメ
ータを利用してもよい。メルスケールバンドパスフィル
タの設計には各種の方法が考えられるが、この実施例で
は２５６点の高速フーリエ変換いわゆるＦＦＴにより得
られる１２８チャネルの出力からメルスケールに基づ
き、いくつかのチャネルの出力の和を求め、その対数を
取ったものを用いる。An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 shows an example of a phoneme recognition system to which an embodiment of the present invention can be applied. In using this system, first, both switches SW1 and SW2 are set to b
Flip to the side to learn the neural network. A standard voice for neural network learning is input from the microphone 1,
The output of the microphone 1 is passed through a filter 2 having a band half the sampling frequency at the time of A / D conversion, and the output of the filter 2 is converted into a digital value by an A / D converter 3. In this embodiment, the sampling frequency is 12 kHz, but the sampling frequency may be different. Next, the output of the A / D converter 3 is passed through a mel-scale bandpass filter 4 to obtain a plurality of characteristic time series. In this embodiment, a mel-scale band-pass filter group of 16 channels is used, but the number of channels may be different.
In addition to the feature extraction by the mel-scale bandpass filter, a parameter representing a spectrum such as a cepstrum coefficient may be used. Various methods are conceivable for the design of the mel-scale bandpass filter. In this embodiment, the sum of the outputs of several channels is obtained based on the mel scale from the output of 128 channels obtained by a 256-point fast Fourier transform so-called FFT. And use the logarithm.

【０００７】次に音素ラベル６に蓄えられている標準音
声に対応した音素ラベルを音素位置決定部７を通過させ
て音素切り出し部５へ供給し、バンドパスフィルタ４の
出力の時系列から音素ラベルにもとづいて音素部分を切
り出す。この切り出した音素部分に対しあらかじめ決め
られた時間伸縮を時間伸縮部８で施して特徴パラメータ
の時系列の組を生成する。この実施例においては時間伸
縮関数を５種類用意した。これら関数ｙは時間ｘを０か
ら１に正規化した時間軸、音素位置決定部７により得ら
れた音素区間長をＴとしてｙ＝Ｔｘｙ＝Ｔ（ｘ±０．３ sin（πｘ））ｙ＝Ｔ（ｘ±０．１５ sin（２πｘ））により表わされる。これら時間伸縮関数ｙを図２中に線
２１〜２５に示す。時間伸縮関数としてはこれらの他に
単調増加するものなら何でも利用できる。１番目の式
（線２１）は線形の伸縮関数に相当する。特徴パラメー
タ時系列としてはｘが1/6, 3/6, 5/6 の３点に対応する
ｙの時点を求め、この３時点をそれぞれ中心とする前後
３フレームすなわちあわせて９フレームを用いる。例え
ば線２１についてみると、図２中の斜線を施した９フレ
ームであり、線２５は点線でくくった３フレームずつの
９フレームであり、この線２５は時間軸を中心部に圧縮
した例である。時点数、フレーム数は増減させることが
できる。この実施例では１フレームは１６チャネルのバ
ンドパスフィルタ４の出力を含み、時間伸縮関数の数は
５であるから、ニューラルネットの入力数は１６×９×
５＝７２０になる。ネットワーク生成部１１では入力セ
ル数、隠れセル数、出力セル数、時間伸縮数、時間軸サ
ンプル点数、などに応じた時間伸縮ニューラルネットの
構造を作成する。Next, the phoneme label corresponding to the standard speech stored in the phoneme label 6 is passed through the phoneme position determination unit 7 and supplied to the phoneme cutout unit 5, and the phoneme label is output from the time series of the output of the bandpass filter 4. Cut out the phoneme part based on. A predetermined time expansion and contraction is performed on the cut-out phoneme part by the time expansion and contraction unit 8 to generate a time-series set of feature parameters. In this embodiment, five types of time expansion / contraction functions are prepared. These functions y are represented by a time axis obtained by normalizing the time x from 0 to 1 and a phoneme section length obtained by the phoneme position determining unit 7 as T. y = Tx y = T (x ± 0.3 sin (πx)) y = T (x ± 0.15 sin (2πx)). These time expansion functions y are shown by lines 21 to 25 in FIG. Any other monotonically increasing function can be used as the time stretching function. The first equation (line 21) corresponds to a linear stretching function. As the characteristic parameter time series, the time points of y corresponding to three points where x is 1/6, 3/6, and 5/6 are obtained, and three frames before and after the three time points as the center, that is, nine frames in total are used. For example, looking at the line 21, there are 9 frames indicated by diagonal lines in FIG. 2, the line 25 is 9 frames of 3 frames each separated by a dotted line, and the line 25 is an example in which the time axis is compressed at the center. is there. The number of points and the number of frames can be increased or decreased. In this embodiment, since one frame includes the output of the band-pass filter 4 of 16 channels and the number of the time expansion / contraction functions is 5, the input number of the neural network is 16 × 9 ×
5 = 720. The network generation unit 11 creates a structure of a time-expanded neural network according to the number of input cells, the number of hidden cells, the number of output cells, the number of time expansion / contraction, the number of time axis sample points, and the like.

【０００８】図２に音素認識部９を構成する時間伸縮ニ
ューラルネットの構造を示す。入力層２６、第１隠れ層
２７、第２隠れ層２８、出力層２９では神経セルが行列
状に並んでいる。行列中の列と呼ばれる縦向きの帯をこ
こではフレームと呼ぶことにする。層間の結合は下層の
フレーム群のすべてのセルと上層の１フレームのすべて
のセルとがすべて結合していることを表わしている。セ
ル群とセル群とがすべての組み合わせで結合しているこ
とをフルコネクションという。たとえば入力層２６と第
１隠れ層２７との間の左側の結合は、入力層２６の３フ
レームと第１隠れ層２７の１フレームとがフルコネクシ
ョンしていることを表わしている。タイド結合とは図２
の太い矢印で示した５つのフルコネクションの対応する
結合が強制的に同じ結合係数を持つように学習されるこ
とを表わしている。すなわち、５つの時間伸縮パターン
の対応する位置からの結合がタイドになっており、これ
が１２個に多重化されている。第２層２８ではこれらが
１２個に多重化されている。第２隠れ層２８の全セルの
出力を統合して出力セルが発火する。各セルの入出力関
数は標準的なシグモイドである。すなわち、あるセルｊ
に入力する下層のセルｉの出力をｐ_i、結合係数をｗ_ji
とすると、セルｊの出力ｑ_jはｑ_j＝１／〔１＋exp(−（Σ_iｗ_jiｐ_i＋bias））〕により求められる。Σ_iはセルｊに入力するすべてのセ
ルｉについての総和を示し、biasは直流バイアスを供給
する特別の入力セルからの結合である。タイド結合とな
っているセルへのバイアスはやはりタイドとなってい
る。なお、この実施例では４層のネットワークを用いて
いるが、第１隠れ層の出力をすべての出力セルとフルコ
ネクションさせた３層のネットワークを用いることもで
きる。FIG. 2 shows the structure of a time-varying neural network constituting the phoneme recognition unit 9. In the input layer 26, the first hidden layer 27, the second hidden layer 28, and the output layer 29, nerve cells are arranged in a matrix. A vertical band called a column in a matrix is referred to as a frame here. The connection between the layers indicates that all the cells of the lower layer frame group and all the cells of the upper layer frame are all connected. The connection of the cell group and the cell group in all combinations is called full connection. For example, the connection on the left side between the input layer 26 and the first hidden layer 27 indicates that three frames of the input layer 26 and one frame of the first hidden layer 27 are fully connected. Figure 2
Indicate that the corresponding connections of the five full connections indicated by the thick arrows are forcibly learned to have the same coupling coefficient. That is, the connection from the corresponding position of the five time expansion / contraction patterns is tied, and these are multiplexed into twelve. In the second layer 28, these are multiplexed into 12 pieces. The outputs of all the cells of the second hidden layer 28 are integrated to fire the output cells. The input / output function of each cell is a standard sigmoid. That is, a certain cell j
An output p _i of cell i of the lower layer to be input to the coupling coefficient w _ji
Then, the output q _{j of the} cell j is obtained by q _j = 1 / [1 + exp (− (Σ _i w _ji p _i + bias))]. Σ _i indicates the sum for all cells i entering cell j, and bias is the coupling from the special input cell that supplies the DC bias. The bias to the tied cell is still tied. In this embodiment, a four-layer network is used. However, a three-layer network in which the output of the first hidden layer is fully connected to all output cells may be used.

【０００９】図１中のニューラルネット学習部１０では
時間伸縮部８で得られる学習用音素データを用い、音素
認識部９のネツトワークの結合係数を逆向き誤差伝搬
法、いわゆるバックプロパゲーションにより求める。タ
イド結合の学習に関してはタイドの関係にある結合の組
での結合係数修正量を平均して結合係数を更新すること
により行う。The neural network learning unit 10 in FIG. 1 uses the phoneme data for learning obtained by the time expansion and contraction unit 8 to determine the network coupling coefficient of the phoneme recognition unit 9 by the backward error propagation method, so-called back propagation. . The learning of the tied combination is performed by updating the coupling coefficient by averaging the coupling coefficient correction amount in the tied coupling set.

【００１０】このようにして音素認識部９に対する学習
を終了した後、未知音素を認識するにはスイッチＳＷ１
とＳＷ２をａ側に切り替えて、マイクロホン１より音声
を入力し、学習時と同じ処理系でメルスケールバンドパ
スフィルタ４の出力を求め、音素位置決定部７で視察ま
たは音量などに基づき音素位置を決定し、その決定位置
から音素切り出し部５において音素を切り出す。その音
素について時間伸縮部８で学習時と同じ時間伸縮を行
い、特徴パラメータ時系列の組を求め、その出力を音素
認識部９のニューラルネットの入力として与え、出力層
２９のどの音素に対応するセルが最も大きな出力を出し
たか、つまり発火したかで認識する。認識結果表示部１
２でその認識結果を表示する。After the learning for the phoneme recognizing section 9 is completed in this way, the switch SW1 is used to recognize the unknown phoneme.
And SW2 are switched to the a side to input a sound from the microphone 1, obtain the output of the mel-scale bandpass filter 4 in the same processing system as in the learning, and determine the phoneme position based on the inspection or the volume by the phoneme position determination unit 7. The phoneme is cut out from the determined position in the phoneme cutout unit 5. The time expansion and contraction of the phoneme is performed by the time expansion and contraction unit 8 to obtain a set of feature parameter time series, and its output is given as an input to the neural network of the phoneme recognition unit 9. It recognizes whether the cell has produced the largest output, that is, has fired. Recognition result display section 1
2 displays the recognition result.

【００１１】なお、この実施例は音素を認識する場合で
あるが、同じ構成で音節、単語などあらゆる長さの音声
を認識できる。ただし、用いるニューラルネットの各層
のセルの数は時間伸縮関数の組や入力する音声の長さに
応じて調整する必要がある。In this embodiment, phonemes are recognized, but voices of any length, such as syllables and words, can be recognized with the same configuration. However, it is necessary to adjust the number of cells in each layer of the neural network to be used according to a set of time expansion / contraction functions and the length of input speech.

【００１２】[0012]

【発明の効果】この発明の効果を６音素／b,d,g,m,n,N/
の認識実験により確認した。学習に用いた音素は使用頻
度の高い重要語５２４０単語の偶数番目より２００個ず
つ抽出した。試験に用いた音素は学習音声と同一の話者
が文節に区切って発声した１１５文章から視察により抽
出した。従来法の延長線上にある方法として、与えられ
た区間の音声を線形伸縮してリサンプルし一定のフレー
ム数にしたデータを時間遅れニューラルネット（ＴＤＮ
Ｎ）により認識する場合は、８１．３％の音素認識率で
あったが、この発明の方法を用いた場合には認識率を８
４．５％まで向上できた。According to the present invention, the effect of the present invention is expressed by six phonemes / b, d, g, m, n, N /
Was confirmed by a recognition experiment. The phonemes used for learning were extracted 200 times from the even-numbered 5240 frequently used important words. The phonemes used in the test were extracted by inspection from 115 sentences uttered by the same speaker as the learning speech in sections. As a method that is an extension of the conventional method, data obtained by linearly expanding and contracting speech in a given section and resampling the data into a fixed number of frames is converted to a time-delay neural network (TDN).
N), the phoneme recognition rate was 81.3%, but when the method of the present invention was used, the recognition rate was 8%.
It was improved to 4.5%.

【００１３】以上述べたようにこの発明によれば複数の
伸縮パターンにより入力音声を時間的に伸縮させて入力
することにより高い認識率を得ることができる。As described above, according to the present invention, a high recognition rate can be obtained by inputting an input voice while expanding and contracting it in time with a plurality of expansion and contraction patterns.

[Brief description of the drawings]

【図１】この発明の一実施例が適用された音素認識シス
テムの一例を示すブロック図。FIG. 1 is a block diagram showing an example of a phoneme recognition system to which an embodiment of the present invention has been applied.

【図２】この発明の要部である時間伸縮ニューラルネッ
トの構成例を示すブロック図。FIG. 2 is a block diagram showing a configuration example of a time-varying neural network which is a main part of the present invention.

フロントページの続き (56)参考文献特開平２−77888（ＪＰ，Ａ) 特開平１−241668（ＪＰ，Ａ) 特開昭58−115487（ＪＰ，Ａ) 日本音響学会平成３年度春季研究発表会講演論文集▲Ｉ▼ １−５−８「時間伸縮ニューラルネットワークによる子音認識」ｐ．19−20（平成３年３月27日発行) 電子情報通信学会技術研究報告［音声］Ｖｏｌ．91，Ｎｏ．95，ＳＰ91−13, 「時間伸縮ニューラルネットワークを用いた音声認識」ｐ．55−62（1991年６月20日発行) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 531 G10L 3/00 539 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References JP-A-2-77888 (JP, A) JP-A-1-241668 (JP, A) JP-A-58-115487 (JP, A) Spring Study of the Acoustical Society of Japan in 1991 Proceedings of the conference ▲ I ▼ 1-5-8 “Consonant recognition using time-expandable neural networks” p. 19-20 (issued March 27, 1991) IEICE Technical Report [Voice] Vol. 91, No. 95, SP91-13, “Speech Recognition Using Time-Expandable Neural Network” p. 55-62 (Issued June 20, 1991) (58) Fields investigated (Int. Cl. ⁶ , DB name) G10L 3/00 531 G10L 3/00 539 JICST file (JOIS)

Claims

(57) [Claims]

1. A method for performing speech recognition using a feedforward neural network in which an input is propagated in one direction and an output is obtained, wherein a method is provided in which a given speech is expanded and contracted on a plurality of time axes. A set of feature parameter time series at a certain number of time points is input, and the first time from the feature parameter at the corresponding time point in the set of feature parameter time series
A set of connections to the neural cells of the hidden layer is a tied connection with constraints so as to have the same connection coefficient, and a plurality of cells of the first hidden layer to be connected from the feature parameters at one time point of the input are prepared. A speech recognition method characterized by performing recognition based on an output of an output cell when a set of feature parameter time series obtained by performing the above-described time expansion and contraction on speech is input to the neural network.