JPH03120600A

JPH03120600A - Continuous voice recognizing system by neural network

Info

Publication number: JPH03120600A
Application number: JP1259359A
Authority: JP
Inventors: Hidefumi Sawai; 沢井　秀文; Masanori Miyatake; 正典宮武; Kiyohiro Kano; 清宏鹿野
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1989-10-03
Filing date: 1989-10-03
Publication date: 1991-05-22
Anticipated expiration: 2009-06-01
Also published as: JPH0642159B2

Abstract

PURPOSE:To perform the recognition of continuous voice at high speed and with high accuracy by predicating a phoneme by using a syntax analysis method, and taking the matching of a predicted phoneme and a phoneme spotting result by a neural network with dynamic programming. CONSTITUTION:An inputted voice 1 is frequency-analyzed, and is supplied to a phoneme spotting part 2 after being changed to the form of time series of a feature parameter, and the phoneme spotting part 2 outputs the spotting results of 24 phonemes. Next, an LR table 6 is generated according to context- free grammar stored in a context-free grammar storage part 4 with an LR table generator 5, and an LR purger 7 predicts a phoneme series permitted in grammar as referring to the LR table 6. A predictive phoneme storage part 8 stores a predicted phoneme series, and a phoneme recognition result verifica tion part 3 takes the matching of the predicted phoneme series and the result obtained at the phoneme spotting part 2 with the dynamic programming, and a series which takes the maximum likelihood out of verified phoneme series is outputted to a recognition result output part 9 as a recognition result.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明はニューラルネットワークによる連続音声認識
方式に関し、特に、ニューラルネットワークを用いた音
声認識装置において、連続的に発声された音声を認識す
るようなニューラルネットワークによる連続音声認識方
式に関する。[Detailed Description of the Invention] [Field of Industrial Application] This invention relates to a continuous speech recognition method using a neural network, and in particular, to a speech recognition device using a neural network that recognizes continuously uttered speech. Concerning continuous speech recognition method using neural network.

［従来の技術および発明が解決しようとする課題］従来
、連続的に発声された音声の認識を行なう場合には、ま
ず連続音声中の音韻のセグメントテーシジンを行、ない
、次にセグメントテーシジンーされた音声を認識する方
法が一般的に採用されている。また、従来の方式では、
高精度の音韻のセグメントチージョン方式と、音韻認識
方式とを確立することが難しく、認識された音韻は曖昧
な「音韻ラティス」の形式で一旦出力された後、辞書な
どの情報からトップダウン的に発声内容の同定を行なう
のが通常である。[Prior art and problems to be solved by the invention] Conventionally, when recognizing continuously uttered speech, first the segmentation of the phoneme in the continuous speech is performed, then the segmentation is performed. A method that recognizes the recorded voice is generally adopted. In addition, in the conventional method,
It is difficult to establish a high-precision phoneme segmentation method and a phoneme recognition method, and the recognized phonemes are once output in the form of an ambiguous "phoneme lattice" and then top-down from information such as dictionaries. Usually, the content of the utterance is identified.

しかしながら、このような方式では、認識システムが複
雑になるばかりではなく、高精度な連続音声認識システ
ムを構築することが困難であるという問題点があった。However, such a method has problems in that not only the recognition system becomes complicated, but also it is difficult to construct a highly accurate continuous speech recognition system.

それゆえに、この発明の主たる目的は、ニューラルネッ
トワークを用いた音韻スポツティング技術により得られ
た連続音声中の音韻スポツティング結果と、拡張ＬＲパ
ーザによって予測された音韻とを動的計画法ｒＤｙｎａ
ｍＬｃ　　Ｔｉｍｅ−Ｗｒａｐｉｎｇ　　Ｍａｔｃｈｉ
ｎｇＪによって統合し、高精度な連続音声認識システム
を構築できるようなニューラルネットワークによる連続
音声認識方式を提供することである。Therefore, the main purpose of this invention is to combine the phoneme spotting results in continuous speech obtained by phoneme spotting technology using a neural network and the phonemes predicted by the extended LR parser using dynamic programming rDyna.
mLc Time-Wrapping Matchi
An object of the present invention is to provide a continuous speech recognition method using a neural network that can be integrated with ngJ to construct a highly accurate continuous speech recognition system.

［課題を解決するための手段］この発明はニー−ラルネットワークによる連続音声認識
方式であって、連続的に発声された入力音声を分析する
分析手段と、分析された音声を特徴パラメータの時系列
に変換する変換手段と、特徴パラメータ時系列を一定の
時間領域にわたって正規化する正規化手段と、正規化さ
れた特徴パラメータを用いて、ニューラルネットワーク
によって連続音声中の音韻をスポツティングする手段と
を備えて構成され、構文解析法を用いて連続音声中の音
韻を予測し、予測音韻とニューラルネットワークによる
音韻スポツティング結果とを音声の時間正規化能力を持
つ動的計画法によってマツチングを行なうものである。[Means for Solving the Problems] The present invention is a continuous speech recognition method using a neural network, which includes an analysis means for analyzing continuously uttered input speech, and a time series of characteristic parameters for analyzing the analyzed speech. a normalizing means for normalizing the feature parameter time series over a certain time domain; and a means for spotting phonemes in continuous speech using a neural network using the normalized feature parameters. It uses a syntactic analysis method to predict phonemes in continuous speech, and matches the predicted phonemes with the phoneme spotting results from a neural network using dynamic programming, which has the ability to time-normalize speech. be.

［作用］この発明にかかるニューラルネットワークによる連続音
声認識方式は、ニューラルネットワークの一種である時
間遅れ神経回路網（ＴＤＮＮ：　Ｔｉｍｅ−Ｄｅｌａｙ
　　Ｎｅｕｒａｌ　　Ｎｅｔｗ。[Operation] The continuous speech recognition method using a neural network according to the present invention uses a time-delay neural network (TDNN), which is a type of neural network.
Neural Net.

ｒｋ）［１１による音韻スポツティング方法と、構文解
析法の一種である拡張ＬＲ構文解析法とを用いて音韻を
予測し、予測音韻とＴＤＮＮによる音韻認識結果とを動
的計画法によって統合し、高精度で連続音声を認識する
。rk) [11] and the extended LR parsing method, which is a type of parsing method, to predict the phonology, integrate the predicted phonology and the phonological recognition result by TDNN by dynamic programming, Recognize continuous speech with high accuracy.

［発明の実施例］第１図はこの発明の一実施例における時間遅れ神経回路
網を示すブロック図である。第１図を参照して、入力層
１１には連続音声が入力され、この連続音声は中間層と
してのサブネットワーク１２ないし２０に与えられる。[Embodiment of the Invention] FIG. 1 is a block diagram showing a time delay neural network in an embodiment of the invention. Referring to FIG. 1, continuous speech is input to input layer 11, and this continuous speech is provided to subnetworks 12 to 20 as intermediate layers.

これらのサブネットワーク１２〜２０のうち、サブネッ
トワーク１２〜１７および１９は日本語の全音韻の２４
種類Ｃｂ＋　　ｄｌｇｏ　　ｐ＊　　ＬｌｋＩｍ＋　’
ｎ＋　ＮＩ　　Ｓ、Ｓｈ、　ｈ、　　ｚ、　　ｃｈ、　
　ｔｓ、　　ｒ、　ｗ、　　ｙ、　　ａ、　　ｉ。Among these subnetworks 12 to 20, subnetworks 12 to 17 and 19 are the 24th subnetwork of all Japanese phonemes.
Type Cb+ dlgo p* LlkIm+ '
n+ NI S, Sh, h, z, ch,
ts, r, w, y, a, i.

ｕｓ　　ｅ＋　　ｏ、Ｑ　（無音））をスポツティング
する。Spot us e+ o, Q (silence)).

すなわち、サブネットワーク１２は３つの音韻す、ｄ、
ｇを識別し、ネットワーク１３はｐ、ｔ。That is, the subnetwork 12 has three phonemes, d,
g, and the network 13 identifies p, t.

ｋを識別し、サブネットワーク１４はｍ、　　ｎ、　Ｎ
を識別し、サブネットワーク１５はｓ、ｓｈ、ｈ。k, and subnetwork 14 identifies m, n, N
, and the subnetworks 15 are s, sh, h.

２を識別し、サブネットワーク１６はｃｈ、ｔｓを識別
し、サブネットワーク１７はｒ、ｗ、ｙを識別し、サブ
ネットワーク１９はａ、ｔ、ｕ、ｅ。2, subnetwork 16 identifies ch, ts, subnetwork 17 identifies r, w, y, and subnetwork 19 identifies a, t, u, e.

０を識別する。サブネットワーク１８はサブネットワー
ク１２〜１７までの６つの音韻グループ間を識別し、サ
ブネットワーク２０は音声であるかあるいは無音である
かを識別する。Identify 0. Subnetwork 18 identifies six phoneme groups, subnetworks 12 to 17, and subnetwork 20 identifies speech or silence.

これらのサブネットワーク１２〜２０は統合ネットワー
ク２１によって統合され、スポツティングされた２４音
韻は出力層２２に出力される。なお、ネットワークの学
習は、誤差逆伝搬法（Ｅｒｒｏｒ　　　Ｂａｃｋ−Ｐｒ
ｏｐａｇａｔｉｏｎ）［２］に従って行なわれる。この
方法は評価関数である誤差を特徴空間において、局所的
に最急降下法に基づいて逐次減少させていく方法である
。These sub-networks 12 to 20 are integrated by an integration network 21, and the spotted 24 phonemes are output to an output layer 22. Note that network learning is performed using the error back propagation method (Error Back-Pr
pagation) [2]. This method is a method in which an error, which is an evaluation function, is sequentially reduced locally in the feature space based on the steepest descent method.

第２図はこの発明の一実施例における連続音声中の音韻
をスポツティングする方法を説明するための図である。FIG. 2 is a diagram for explaining a method for spotting phonemes in continuous speech in one embodiment of the present invention.

第２図を参照して、入力データとして入力音声１１ａが
与えられる。第２図においては、縦軸が周波数を表わし
、横軸が時間を表わしている。入力音声１１ａは第１図
に示したニューラルネットワークの入力層１１に与えら
れ、音韻のスポツティングは第１図のネットワークを１
フレームずつ時間方向に走査することによって行なわれ
る。１フレームシフトするごとに、２４音韻のうちのい
ずれかの音韻スポツティング結果が出力層２２から出力
される。なお、第１図に示したネットワークのうちの中
間層１２〜２１は省略している。この第２図に示した方
法は、従来の方法のように音韻のセグメントテーシ目ン
を必要としない極めて簡易で優れた方法である。Referring to FIG. 2, input voice 11a is given as input data. In FIG. 2, the vertical axis represents frequency and the horizontal axis represents time. The input speech 11a is given to the input layer 11 of the neural network shown in FIG.
This is done by scanning frame by frame in the time direction. Every time one frame is shifted, the output layer 22 outputs the phoneme spotting result of one of the 24 phonemes. Note that the middle layers 12 to 21 of the network shown in FIG. 1 are omitted. The method shown in FIG. 2 is an extremely simple and excellent method that does not require phoneme segmentation unlike conventional methods.

第３図はＴＤＮＮ−ＬＲ法の認識システムの構成を示す
ブロック図である。第３図を参照して、入力された音声
１は周波数分析され、ＦＦＴ出力のような特徴パラメー
タの時系列の形式にされて時間遅れ神経回路網２に与え
られる。時間遅れ神経回路網２は第１図で説明したよう
に、２４音韻のスポツティング結果を出力する。FIG. 3 is a block diagram showing the configuration of a recognition system using the TDNN-LR method. Referring to FIG. 3, input speech 1 is frequency-analyzed, converted into a time-series format of feature parameters such as FFT output, and provided to a time-delay neural network 2. As explained in FIG. 1, the time delay neural network 2 outputs the spotting results of 24 phonemes.

一方、文脈自由文法格納部４には文脈自由文法が格納さ
れていて、この文脈自由文法に従ってＬＲテーブル生成
器５によってＬＲ子テーブルが生成される。ＬＲパーザ
７はＬＲ子テーブルを参照しながら文法上杵される音韻
系列を予測する。予測音韻格納部８は予測された音韻系
列を予め格納しており、音韻検証部３は予測音韻格納部
８に格納されている予測された音韻系列と、時間遅れ神
経回路網２で得られた音韻のスポツティング結果とをＤ
ＴＷマツチングを用いて検証を行なう。検証された音韻
系列のうち、最大尤度をとる系列を認識結果として、認
識結果出力部９に出力する。On the other hand, a context free grammar is stored in the context free grammar storage unit 4, and an LR child table is generated by the LR table generator 5 according to this context free grammar. The LR parser 7 predicts the phoneme sequence to be grammatically determined while referring to the LR child table. The predicted phoneme storage unit 8 stores the predicted phoneme sequence in advance, and the phoneme verification unit 3 uses the predicted phoneme sequence stored in the predicted phoneme storage unit 8 and the predicted phoneme sequence obtained by the time delay neural network 2. The phonological spotting results and D
Verification is performed using TW matching. Among the verified phoneme sequences, the sequence with the maximum likelihood is output to the recognition result output unit 9 as a recognition result.

ここで、ＬＲパーザ７による音韻予測法について簡単に
説明する。ＬＲパーザ７は文脈自由文法の中で、ＬＲ文
法という限定された文法から生成される文法を解析する
。このパーザは入力信号を受付けながらバックトラック
なしに決定的に構文を解析できる。ＬＲパーザ７は動作
衣゛と行先表という２Ｆｌ類の表を見ながら解析を行な
う。動作衣は次にパーザが行なう動作を示す表であり、
行先表は次にパーザがとる状態を示す表である。パーザ
の動作には、次の４種類がある。Here, the phoneme prediction method using the LR parser 7 will be briefly explained. The LR parser 7 analyzes a grammar generated from a limited grammar called an LR grammar among context-free grammars. This parser can accept input signals and parse them deterministically without backtracking. The LR parser 7 performs the analysis while looking at the 2F1 class tables, ie, the operation clothes table and the destination table. The action code is a table showing the next action the parser will perform.
The destination table is a table that indicates the next state the parser will take. There are four types of parser operations:

■　移動（ｓｈｉｆｔ） ■　還元（ｒｅｄｕｃｅ） ■　受理（ａｃｃｅｐｔ） ■　誤り　（ｅｒｒｏｒ） ■移動はパーザの状態をスタックに積む動作でであり、
■還元はスタック上の記号を文法規則に従ってまとめる
ものである。■受理は入力文章がＬＲパーザで解析でき
たことを示し、■誤りは解析できなかったことを示す。■Move (shift) ■Reduce (reduce) ■Accept (accept) ■Error (error) ■Move is the action of putting the parser state on the stack,
■Reduction groups symbols on the stack according to grammatical rules. ■Acceptance indicates that the input text could be parsed by the LR parser; ■Error indicates that it could not be analyzed.

次に、解析の手順を示す。Next, the analysis procedure is shown.

「定義」Ｓ：パーザの状態ａ：文法記号（非終端、終端記号）入力ポインタ：現在処理中の入力記号列を示す。"Definition" S: Parser status a: Grammar symbol (non-terminal, terminal symbol) Input pointer: Indicates the input symbol string currently being processed.

状態スタック：パーザの状態を保存する。State stack: saves parser state.

ＧＯＴＯ（ｓ、ａ）：状態Ｓと文法記号ａから次の状態
を求める。GOTO (s, a): Find the next state from state S and grammatical symbol a.

ＡＣＴＩＯＮ　（ｓ、ａ）：状態Ｓと文法記号ａからパ
ーザの動作を求める。ACTION (s, a): Find the parser action from the state S and the grammar symbol a.

「アルゴリズム」 ■　初期化：入力ポインタを入力記号列の先頭に位置づ
ける。状態スタックに０をブツシュする。"Algorithm" ■ Initialization: Position the input pointer at the beginning of the input symbol string. Push 0 onto the state stack.

■　現在の状態Ｓと入力ポインタの示す記号ａからＡＣ
ＴＩＯＮ　（ｓ、ａ）を調べる。■ From the current state S and the symbol a indicated by the input pointer to AC
Examine TION (s, a).

■　ＡＣＴＩＯＮ　（ｓ、ａ）−ｓｈｉ　ｆ　ｔ”なら
ばＧＯＴＯ（ｓ、ａ）を状態スタックにブツシュし、入
力ポインタを１つ進める。■ If ACTION (s, a) - shift, push GOTO (s, a) onto the state stack and advance the input pointer by one.

■ＡＣＴＩＯＮ（ｓ、ａ）　霞’ｒｅｄｕｃｅ。■ACTION (s, a) Kasumi’reduce.

ｎ”ならば、ｎ番目の文法規則の右辺にある文法記号の
数だけスタックの状態をポツプする。スタック最上段の
状態をＳ′とすると、Ｓ′とｎ番目の文法規則左辺にあ
る文法規則Ａから、次の状態ＧＯＴＯ（！！’　、Ａ）
を求め、スタックにブツシュする。n”, pop the states of the stack equal to the number of grammar symbols on the right side of the nth grammar rule.If the state at the top of the stack is S', then pop S' and the grammar rule on the left side of the nth grammar rule. From A, next state GOTO (!!', A)
Ask for and put it on the stack.

■　ＡＣＴＩＯＮ　（ｓ、ａ）−ａｃｃｅｐｔ”ならば
解析終了。■ If "ACTION (s, a)-accept", the analysis ends.

■　ＡＣＴＩＯＮ　（ｓ、ａ）ｍ　　ｅ　ｒ　ｒｏ　ｒ
ならば解析失敗。■ ACTION (s, a) m e r r o r
If so, the analysis has failed.

■　■に戻る。■ Return to ■.

拡張ＬＲパーザは、ＬＲパーザでは対処できなかった曖
昧な構文を解析できるようにしたものである。拡張ＬＲ
パーザでは、動作衣に複数の項目を記述する。パーザが
この複数の項目の表を調べた場合には並列動作を行なう
。このようにして決定的に構文の解゛析を行なう。The extended LR parser is capable of parsing ambiguous syntax that the LR parser could not handle. Extended LR
In the parser, multiple items are written in the action code. When the parser examines this table of multiple items, it performs parallel operations. In this way, the syntax is analyzed definitively.

第４図は音韻スポツティング結果の一例を示す図である
。この第４図に示した例は、「会議に」と発声した場合
であり、入力音声のスベクトログラムｌｌｂと音韻スポ
ツティング結果２２ａとを示す。入力音声と音韻スポツ
ティング結果には、結果の妥当性を検証するために、予
め視察により音韻ラベルが付与されている。第４図にお
いて、黒い四角は出力が活性化したことを表わしている
。FIG. 4 is a diagram showing an example of phoneme spotting results. The example shown in FIG. 4 is a case where the user utters "to the meeting", and shows the spectrum llb of the input voice and the phoneme spotting result 22a. In order to verify the validity of the input speech and phoneme spotting results, phoneme labels are assigned in advance by inspection. In FIG. 4, black squares represent activated outputs.

第５図は第３図に示した音韻認識結果検証部３における
動作を示すための図であり、音韻スポツティング結果２
２とＤＰマツチングパス３１とＬＲパーザによって予測
された音韻の系列３２とを示している。第５図では、／
ｋａｉｇｉｎｉ／と発声された入力音声が、予測音韻の
系列３２と音韻スポツティング結果２２との間でＤＰマ
ツチングパス３１によって整合されていることがわかる
。FIG. 5 is a diagram showing the operation of the phoneme recognition result verification section 3 shown in FIG.
2, a DP matching pass 31, and a phoneme sequence 32 predicted by the LR parser. In Figure 5, /
It can be seen that the input speech uttered as kaigini/ is matched between the predicted phoneme sequence 32 and the phoneme spotting result 22 by the DP matching path 31.

［発明の効果］以上のように、この発明によれば、時間遅れニューラル
ネットワーク（ＴＤＮＮ）による簡易で高精度な音韻ス
ポツティング方法と、拡張ＬＲパーザによって予測され
た音韻系列とを動的計画法（ＤＴＷ）を用いてマツチン
グを行なうようにしたので、高精度で高速に連続音声を
認識することが可能になる。[Effects of the Invention] As described above, according to the present invention, a simple and highly accurate phoneme spotting method using a time delay neural network (TDNN) and a phoneme sequence predicted by an extended LR parser are combined using dynamic programming. Since matching is performed using (DTW), continuous speech can be recognized with high precision and at high speed.

[Brief explanation of drawings]

第１図はこの発明の一実施例に用いられる時間遅れ神経
回路網を示すブロック図である。第２図は連続音声中の
音韻をスポツティングする方法を示す図である。第３図
はＴＤＮＮ−ＬＲ法による認識システムの構成を示すブ
ロック図である。第４図はこの発明の一実施例による音
韻スポツティング結果の一例を示す図である。第５図は
第３図に示した音韻認識結果検証部における動作を示す
図である。図において、１は入力音声データ、２は音韻スポツティ
ング部、３は音韻認識結果検証部、４は文脈自由文法格
納部、５はＬＲ子テーブル成器、６はＬＲテーブル、７
は（Ｒパーザ、８は予測音韻格納部、９は認識結果出力
部、１１は入力層、１２〜２０は中間層としてのネット
ワーク、２１は統合ネットワーク、２２は出力層を示す
。第３図手続補正書Ｃ７ｊカ６、補正の対象平成２年３月１日図面７、補正の内容（１）図面の第４図および第５図の浄書を別紙の通り（内容に変更なし）。２、発明の名称以上ニューラルネットワークによる連続音声認識方式３、補
正をする者名称株式会社エイ・ティ・アール自動翻訳電話研究所代表者搏松明４、代理人住所大阪市北区南森町２丁目１番２９号住友銀行南森町ビル５、補正命令の日付FIG. 1 is a block diagram showing a time delay neural network used in an embodiment of the present invention. FIG. 2 is a diagram showing a method for spotting phonemes in continuous speech. FIG. 3 is a block diagram showing the configuration of a recognition system using the TDNN-LR method. FIG. 4 is a diagram showing an example of phoneme spotting results according to an embodiment of the present invention. FIG. 5 is a diagram showing the operation of the phoneme recognition result verification section shown in FIG. 3. In the figure, 1 is input speech data, 2 is a phoneme spotting unit, 3 is a phoneme recognition result verification unit, 4 is a context free grammar storage unit, 5 is an LR child table generator, 6 is an LR table, 7
(R parser, 8 is a predictive phoneme storage unit, 9 is a recognition result output unit, 11 is an input layer, 12 to 20 are networks as intermediate layers, 21 is an integration network, and 22 is an output layer. Amendment C7j Ka6, Subject of amendment March 1, 1990 Drawing 7, Contents of amendment (1) The engravings of Figures 4 and 5 of the drawings are as attached (no change in content). 2. Invention Continuous speech recognition method using neural network 3 Name of person making the correction Name of ATR Automatic Translation Telephone Research Institute Representative Akihito Akira 4 Address of agent 2-1-29 Minamimorimachi, Kita-ku, Osaka City Sumitomo Bank Minamimorimachi Building 5, date of amendment order

Claims

[Scope of Claims] Analysis means for analyzing continuously uttered input speech; conversion means for converting the speech analyzed by the analysis means into a time series of feature parameters; and time series of feature parameters converted by the conversion means. normalizing means for normalizing a sequence over a certain time domain, and means for spotting phonemes in continuous speech by a neural network using feature parameters normalized by the normalizing means, and using a syntactic analysis method. A continuous speech recognition method using a neural network, which is characterized by predicting the phoneme in continuous speech using a dynamic programming method that has the ability to time-normalize speech, and then matching the predicted phoneme with the phoneme spotting result using a neural network. .