JPH09244687A

JPH09244687A - Speech recognizing method

Info

Publication number: JPH09244687A
Application number: JP8049898A
Authority: JP
Inventors: Tomokazu Yamada; 智一山田; Shigeki Sagayama; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-03-07
Filing date: 1996-03-07
Publication date: 1997-09-19

Abstract

PROBLEM TO BE SOLVED: To carry out highly precise speech recognition making good use of environment-dependent phoneme standard patterns. SOLUTION: While a network is generated by utilizing context free grammar having an environment-dependent phoneme as a termination symbol and an LR(land reserve) purser, a standard pattern is determined and while an input speech is matched against standard patterns of HMM(hidden markov model) a one-pass search is advanced; and environment-dependent phoneme standard patterns as used as the standard patterns of HMM and when an environment- dependent phoneme P3 is predicted by the LR purser from a tip node 3 at the time of the extension of the network, tracing from a new node 4 back to the last node is performed to set the environment-dependent phoneme standard patterns to the arc between nodes 3 and 4 by using phoneme strings P1 , P2 , and P3 as a triphone model.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、隠れマルコフモ
デル（例えば中川聖一「確率モデルによる音声認識」電
子情報通信学会編（1988))と、一般化ＬＲ文脈解析アル
ゴリズム（例えばM.Tomita : "An Efficient Context-F
ree Parsing Algorithm for Natural Languages and it
s Applications ”, Carnegie-Mellon University(198
5)) と、One-Passサーチ・アルゴリズム（例えば J. Br
idle 他 :"An Algorithm For Connected Word Recogni
tion, ”ICASSP82 , pp. 899-902(1982)) とを用いて入
力音声を連続的に認識する方法に関する。TECHNICAL FIELD The present invention relates to a hidden Markov model (for example, Seiichi Nakagawa “Speech Recognition by Probabilistic Model” edited by The Institute of Electronics, Information and Communication Engineers (1988)) and a generalized LR context analysis algorithm (for example, M. Tomita: " An Efficient Context-F
ree Parsing Algorithm for Natural Languages and it
s Applications ”, Carnegie-Mellon University (198
5)) and the One-Pass search algorithm (eg J. Br.
idle and others: "An Algorithm For Connected Word Recogni
, ICASSP82, pp. 899-902 (1982)) and a method for continuously recognizing input speech.

【０００２】[0002]

【従来の技術】従来の音声認識において、ＬＲパーザ
（文法解析機）を用いて、予め決められた文法にかなう
認識候補を探し、その認識候補の標準パターンと入力音
声との照合を行いながら認識を行う方法がある。また予
め決められたネットワークを用い、その一部について標
準パターンと入力音声との照合を行い、かつ、スコアの
小さいものを除き、処理量の増加を最小限にしながらサ
ーチを進め、最終的にスコアの最も高い候補を認識結果
とするＯｎｅ−Ｐａｓｓサーチ方法が知られている。2. Description of the Related Art In conventional speech recognition, an LR parser (grammar analyzer) is used to search for a recognition candidate that meets a predetermined grammar, and recognition is performed while matching a standard pattern of the recognition candidate with an input speech. There is a way to do. In addition, using a predetermined network, the standard pattern is collated with the input voice for a part of it, and the search is advanced while minimizing the increase in the processing amount except for those with a small score, and finally the score is obtained. One-Pass search method is known in which the highest candidate of is the recognition result.

【０００３】更に隠れマルコフモデル、一般化ＬＲパー
ザ、Ｏｎｅ−Ｐａｓｓアルゴリズムを統合した音声認識
の方法が知られている。この統合した認識方法としては
環境独立の音素による文脈自由文法からＬＲパーザを用
いてサブ・ネットワークを作成し、それらサブ・ネット
ワーク間の接合部分で環境依存の音素を考慮し、環境依
存の音素標準パターンを利用する方法（例えば伊藤克亘
他：「拡張ＬＲ構文解析法を用いた連続音声認識」信学
技報，ＳＰ９０−７４，４９〜５６頁）や、環境独立の
音素による文脈自由文法からＬＲパーザを用いて一つの
ネットワークを動的に作成し、環境独立の音素標準パタ
ーンを利用する方法があった。（例えば北研二他：「Ｌ
Ｒパーザ制御によるＯｎｅ−Ｐａｓｓ型連続音声認識ア
ルゴリズム」，信学技報、ＳＰ９４−１３（１９９４−
０５）５５〜６２頁）。Furthermore, a method of speech recognition is known in which a hidden Markov model, a generalized LR parser, and a One-Pass algorithm are integrated. As this integrated recognition method, a sub-network is created from the context-free grammar with environment-independent phonemes using the LR parser, and the environment-dependent phonemes are considered at the junctions between these sub-networks to determine the environment-dependent phoneme standard. From methods using patterns (for example, Katsutoshi Ito: “Continuous Speech Recognition Using Extended LR Parsing Method”, IEICE Technical Report, SP90-74, pp. 49-56) and context-free grammars using environment-independent phonemes. There was a method to dynamically create one network using a parser and use the phoneme standard pattern independent of environment. (For example, Kenji Kita et al .: "L
One-Pass continuous speech recognition algorithm by R parser control ", IEICE Tech., SP94-13 (1994-
05) 55-62).

【０００４】効率の点ではサブ・ネットワークを用いず
に一つのネットワークで処理した方がよいし、音声認識
の精度の点では環境依存の音素標準パターンを利用した
方がよいが、両者を同時に利用できる方法はなかった。From the viewpoint of efficiency, it is better to process with one network without using the sub-network, and from the viewpoint of accuracy of speech recognition, it is better to use the environment-dependent phoneme standard pattern, but both are used at the same time. There was no way I could do it.

【０００５】[0005]

【発明が解決しようとする課題】この発明は、隠れマル
コフモデル、一般化ＬＲパーザ、Ｏｎｅ−Ｐａｓｓアル
ゴリズムを統合するに際し、一つのネットワークを使っ
て効率的に処理すると同時に、環境依存の音素標準パタ
ーンを利用することができ、認識率の高い音声認識方法
を提供することを目的とする。SUMMARY OF THE INVENTION The present invention integrates a Hidden Markov Model, a generalized LR parser, and a One-Pass algorithm, efficiently processes them using a single network, and at the same time, uses environment-dependent phoneme standard patterns. It is an object of the present invention to provide a voice recognition method that can utilize the above and has a high recognition rate.

【０００６】[0006]

【課題を解決するための手段】この発明によれば、環境
依存の音素標準パターンを用い、ネットワーク上で新た
に生成されたノードに対し、音素環境を考慮に入れる音
素数だけネットワークを遡ることによってＬＲパーザで
予測された環境独立の音素の組を特定し、そのノードへ
接続される弧に設定すべき環境依存の音素標準パターン
を決定する。According to the present invention, by using the environment-dependent phoneme standard pattern, the network is traced back to the newly generated node on the network by the number of phonemes taking the phoneme environment into consideration. The set of environment-independent phonemes predicted by the LR parser is specified, and the environment-dependent phoneme standard pattern to be set in the arc connected to the node is determined.

【０００７】[0007]

【発明の実施の形態】図１に、この発明の実施例を適用
した認識装置の構成を機能的に示す。入力端子１から入
力された音声は、特徴抽出部２においてディジタル信号
に変換され、更にＬＰＣケプストラム分析された後、１
フレーム（例えば１０ミリ秒）ごとに特徴パラメータに
変換される。この特徴パラメータは、例えばＬＰＣケプ
ストラム係数である。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 functionally shows the structure of a recognition device to which an embodiment of the present invention is applied. The voice input from the input terminal 1 is converted into a digital signal in the feature extraction unit 2 and further subjected to LPC cepstrum analysis, and then 1
It is converted into a feature parameter every frame (for example, 10 milliseconds). This feature parameter is, for example, an LPC cepstrum coefficient.

【０００８】学習用音声データベースより、上記特徴ベ
クトルと同一形式で、隠れマルコフモデルの環境依存の
音素標準パターンを作り、標準パターンメモリ４に記憶
してある。認識対象の範囲を規定する、環境独立（環境
非依存）の音素を終端記号とする文脈自由文法から、Ｌ
Ｒパーザ用のテーブルを作り、ＬＲテーブル５に記憶し
てある。An environment-dependent phoneme standard pattern of a hidden Markov model is created from the learning speech database in the same format as the above-mentioned feature vector and stored in the standard pattern memory 4. From the context-free grammar that defines the range of the recognition target and uses the environment-independent (environment-independent) phonemes as terminal symbols,
A table for the R parser is created and stored in the LR table 5.

【０００９】認識部３での処理のフローチャートを図２
に示す。１フレーム分の入力音声の特徴パラメータを読
み込み（Ｓ₁）、（毎フレームまたは数フレームごと
に）ネットワークを成長させ（Ｓ₂）、入力音声特徴パ
ラメータと標準パターンとのパターン照合を行ない（Ｓ
₃）、次のステップでも処理を進めるアクティブ・ノー
ドを決定してステップＳ₁に戻る（Ｓ₄）。このことを
入力される音声がなくなるまで繰り返す。入力される音
声がなくならなくても、パターン照合によって得られた
スコアから、何らかの条件により処理を打ち切ってもよ
い。FIG. 2 shows a flowchart of the processing in the recognition section 3.
Shown in The characteristic parameters of the input speech for one frame are read (S ₁ ), the network is grown (every frame or every several frames) (S ₂ ), and pattern matching between the input speech characteristic parameters and the standard pattern is performed (S 1).
₃ ) Then, in the next step, the active node to be processed is determined and the process returns to step S ₁ (S ₄ ). This is repeated until there is no input voice. Even if the input voice does not disappear, the process may be terminated under some condition from the score obtained by the pattern matching.

【００１０】ネットワークはノードと弧によって構成さ
れ、最初は初期ノード一つだけが存在する。各ノード
は、ＬＲパーザを用いて生成される仮説で、スタックの
内容が同一のものに対して一つが割り当てられ、対応づ
けられる。そのため、ＬＲパーザによってそれ以降の処
理が同じになるものは、同一のノードに統合される（同
一のノードに弧が接続される）。ここではこれをマージ
・ノードと呼ぶ。各ステップでの照合結果に基づき、得
られたスコアの高いものを、次のステップでも処理を進
めるべきアクティブ・ノードとする。それ以外のノード
は、そのままネットワーク中に保持してもよいし、メモ
リの有効利用のために、捨ててもよい。このようにして
ネットワークを、環境独立の音素終端信号とする文脈自
由文法とＬＲパーザを利用して拡張しながらＯｎｅ−Ｐ
ａｓｓサーチによりサーチを進めてゆく。The network consists of nodes and arcs, initially there is only one initial node. Each node is a hypothesis generated by using the LR parser, and one having the same stack contents is assigned and associated. Therefore, LR parsers that perform the same processing thereafter are integrated into the same node (arcs are connected to the same node). Here, this is called a merge node. Based on the collation result in each step, the one with a high score is set as an active node to be processed in the next step. Other nodes may be retained in the network as they are, or may be discarded for effective use of memory. In this way, the One-P is extended while expanding the network by using the context-free grammar and the LR parser, which are environment-independent phoneme termination signals.
We will proceed with the search by the ass search.

【００１１】図３を参照してこの発明の特徴部分である
ネットワークの生成（成長動作）を説明する。文脈自由
文法からは、無限の大きさのネットワークを作成可能な
ので、あらかじめネットワークの全体を作成しておけな
い場合が想定される。そこでネットワーク生成部では、
ネットワークの各アクティブ・ノードからＬＲパーザを
使って新たなノードを作成し、弧を伸ばす（ネットワー
クを成長させる）、といった動作を、有限個先の音素ま
でに限定する。ここでは一度に一つ音素を伸ばす場合で
説明する。図中、点線の上に示された記号は、ＬＲパー
ザによって各ノード間に予測された環境独立の音素であ
る。ノード間に実線で示されているのがネットワークの
弧であり、この発明では環境依存の音素標準パターンが
設定される（図を見やすくするため、１−２，２−３，
５−６，６−３のノード間の弧は省略してある）。ここ
では、中心となる音素（Ｃ）の左側（Ｌ）及び右側
（Ｒ）の環境を考慮に入れた、いわゆるｔｒｉｐｈｏｎ
ｅモデルの場合で説明し、Ｌ−Ｃ−Ｒと表記する。The generation (growth operation) of the network, which is a characteristic part of the present invention, will be described with reference to FIG. Since a network of infinite size can be created from the context-free grammar, it is assumed that the entire network cannot be created in advance. So in the network generator,
The operation of creating a new node from each active node of the network using the LR parser and extending the arc (growing the network) is limited to a finite number of phonemes ahead. Here, a case where one phoneme is extended at a time will be described. In the figure, the symbols shown above the dotted lines are environment-independent phonemes predicted between the nodes by the LR parser. A solid line between the nodes is an arc of the network. In the present invention, an environment-dependent phoneme standard pattern is set (1-2, 2-3, in order to make the diagram easy to see).
Arcs between nodes 5-6 and 6-3 are omitted). Here, the so-called triphon, which takes into consideration the environment on the left side (L) and the right side (R) of the central phoneme (C).
This will be described in the case of the e model and will be referred to as LCR.

【００１２】まず簡単な図３Ａの場合で説明する。ノー
ド１〜３を生成済みのネットワークから、ノード３に対
してネットワークを成長させる場合を考える。ＬＲパー
ザによって、ノード３から環境独立の音素Ｐ₃が予測さ
れ、新たなノード４で表されるスタックの状態に変化す
ることが分かったとする。ここで、ノード４から三つ前
のノードまでを遡ることにより、Ｐ₁、Ｐ₂、Ｐ₃とい
う３音素系列が得られる。これをもとにノード３−４間
の弧に、Ｐ₁−Ｐ₂−Ｐ₃という環境依存の音素標準パ
ターンを設定する。First, a simple case of FIG. 3A will be described. Consider a case where the network is grown for the node 3 from the network in which the nodes 1 to 3 have been generated. It is assumed that the LR parser predicts the phoneme P ₃ independent of the environment from the node 3 and changes it to the state of the stack represented by the new node 4. Here, by tracing back from the node 4 to the node three nodes before, a three-phoneme sequence of P ₁ , P ₂ , and P ₃ is obtained. Based on this, an environment-dependent phoneme standard pattern of P ₁ -P ₂ -P ₃ is set in the arc between the nodes 3-4.

【００１３】ノード３がマージ・ノードである場合を図
３Ｂで説明する。この場合は、ノード４から三つ前のノ
ードまでを遡ることにより、ノード３−４間には、Ｐ₁
−Ｐ ₂−Ｐ₃とＰ₄−Ｐ₅−Ｐ₃を設定する二本の弧が
必要であることが分かる。そして、音響的な整合性を確
保するため、ノード３を二つに分割するか、ノード２か
らの弧にはＰ₁−Ｐ₂−Ｐ₃の弧が、ノード６からの弧
にはＰ₄−Ｐ₅−Ｐ₃が対応してデータの受渡しがされ
るように制約を設ける。Diagram of the case where node 3 is a merge node
3B will be described. In this case, the node three nodes before node 4
By tracing back to the node, P₁
−P _Two−P_ThreeAnd P_Four−P_Five−P_ThreeThe two arcs that set
I find it necessary. And ensure the acoustic consistency
To keep node 3 split into two or node 2
P for those arcs₁−P_Two−P_ThreeIs the arc from node 6
For P_Four−P_Five−P_ThreeAnd the data will be passed
So that there are restrictions.

【００１４】処理量を削減するため、整合性を考慮しな
い方法も考えられる。この場合、得られる結果には誤差
が含まれる。最終的に、ＬＲパーザ上で受理（ａｃｃｅ
ｐｔ）された仮説は、一つのマージ・ノードとなる。全
ステップ終了時に、このマージ・ノードに最も高いスコ
アを伝搬した弧を遡ることにより、認識結果を得ること
ができる。音素の環境依存としてはｔｒｉｐｈｏｎｅモ
デルに限らないが、標準パターンの音素環境依存モデル
の音素数と、ネットワーク拡張の際にネットワークを遡
る音素数とを等しくする。In order to reduce the processing amount, it is possible to consider a method that does not consider consistency. In this case, the result obtained contains an error. Finally, on the LR parser
The pt) hypothesis becomes one merge node. At the end of all steps, the recognition result can be obtained by tracing back the arc that propagated the highest score to this merge node. The environment dependence of phonemes is not limited to the triphone model, but the number of phonemes in the phoneme environment-dependent model of the standard pattern is made equal to the number of phonemes going back in the network when the network is expanded.

【００１５】[0015]

【発明の効果】仮に環境依存の音素を終端記号とするＬ
Ｒテーブルを用いた場合にマージ・ノードが少なく、文
法的には同一であるが、複数に分れたままネットワーク
が成長し、多数のパスが生じ、つまり多数の候補が生
じ、演算量が多くなる。しかしこの発明では、前記環境
依存の音素を終端記号とするＬＲテーブルを用いると分
散してしまうマージ・ノードを、環境独立（環境非依
存）の音素を終端記号としたＬＲテーブルを用いること
で回避し、しかもマージ・ノードから必要な数だけ弧と
ノードを辿って環境依存の音素標準パターンを決定し、
新たに作成したノードの弧に設定しているため、マージ
効率の高いネットワークを動的に作成しながら、環境依
存の音素標準パターンを利用した高精度の認識処理が可
能である。EFFECT OF THE INVENTION Suppose that an environment-dependent phoneme is a terminal symbol.
When the R table is used, the number of merge nodes is small and the syntax is the same, but the network grows while being divided into a plurality of paths, that is, a large number of paths are generated, that is, a large number of candidates are generated, and a large amount of computation is required. Become. However, in the present invention, the merge node which is distributed when the LR table having the environment-dependent phonemes as the terminal symbols is dispersed is avoided by using the LR table having the environment-independent (environment-independent) phonemes as the terminal symbols. Moreover, the required number of arcs and nodes from the merge node are traced to determine the environment-dependent phoneme standard pattern,
Since it is set to the arc of the newly created node, it is possible to perform highly accurate recognition processing using environment-dependent phoneme standard patterns while dynamically creating a network with high merge efficiency.

[Brief description of drawings]

【図１】この発明の実施例方法を適用した認識装置の機
能構成例を示すブロック図。FIG. 1 is a block diagram showing a functional configuration example of a recognition device to which a method according to an embodiment of the present invention is applied.

【図２】この発明の実施例を示す認識処理部のフローチ
ャート。FIG. 2 is a flowchart of a recognition processing unit showing an embodiment of the present invention.

【図３】この発明の実施例で、環境依存の音素標準パタ
ーンを、ネットワークの弧に設定する方法を説明する
図。FIG. 3 is a diagram illustrating a method of setting an environment-dependent phoneme standard pattern in an arc of a network in an embodiment of the present invention.

Claims

[Claims]

1. A phoneme standard pattern is determined while generating a network by using an LR parser and a context-free grammar whose input speech is a characteristic parameter time series and environment-independent phonemes are terminal symbols. One-while checking the characteristic parameter time series and the standard pattern of the hidden Markov model
In a speech recognition method in which a search is advanced by a Pass search and a candidate having a high similarity likelihood is used as a recognition result, an environment-dependent phoneme standard pattern is used as the standard pattern of the hidden Markov model, and when the network is expanded, A speech recognition method characterized in that an environment-dependent phoneme standard pattern is determined by tracing back the network from the tip by the number of phonemes taking the environment dependence into consideration and an arc on the network is set.