JPH09244688A

JPH09244688A - Speech recognition method

Info

Publication number: JPH09244688A
Application number: JP8049899A
Authority: JP
Inventors: Tomokazu Yamada; 智一山田; Shigeki Sagayama; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-03-07
Filing date: 1996-03-07
Publication date: 1997-09-19

Abstract

(57)【要約】【課題】環境依存音素標準パターンを利用した高精度
の認識を可能とする。【解決手段】環境独立の音素を終端記号とする文脈自
由文法とＬＲパーザを利用してネットワークを生成しな
がら標準パターンを決定すると共に、入力音声とＨＭＭ
の標準パターンとの照合をとりながらＯｎｅ−Ｐａｓｓ
サーチによりサーチを進め、ＨＭＭの標準パターンに環
境依存の音素標準パターンを用い、ネットワークの拡張
の際に各弧の音素を、前記環境依存の音素数だけ先のノ
ードへ伝搬させ、弧拡張の際に、先端ノード３で、ＬＲ
パーザで予測された音素Ｐ₃と、ノード３に到達した音
素系列Ｐ₁，Ｐ₂とによってｔｒｉｐｈｏｎｅモデルの
音素標準パターンを決定して、ノード３，４間の弧に設
定する。 (57) [Abstract] [PROBLEMS] To enable highly accurate recognition using an environment-dependent phoneme standard pattern. SOLUTION: A standard pattern is determined while generating a network by using a context-free grammar in which an environment-independent phoneme is a terminal symbol and an LR parser, and an input voice and an HMM are determined.
One-Pass while checking with the standard pattern of
The search is advanced by using the environment-dependent phoneme standard pattern as the HMM standard pattern, and when the network is expanded, the phonemes of each arc are propagated to the preceding node by the environment-dependent number of phonemes, and the arc expansion is performed. At the tip node 3, LR
The phoneme standard pattern of the triphone model is determined based on the phoneme P ₃ predicted by the parser and the phoneme sequences P ₁ and P ₂ that have reached the node 3, and set to the arc between the nodes 3 and 4.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、隠れマルコフモ
デル（例えば中川聖一「確率モデルによる音声認識」電
子情報通信学会編（1988))と、一般化ＬＲ文脈解析アル
ゴリズム（例えばM.Tomita : "An Efficient Context-F
ree Parsing Algorithm for Natural Languages and it
s Applications ”, Carnegie-Mellon University(198
5)) と、One-Passサーチ・アルゴリズム（例えば J. Br
idle 他 :"An Algorithm For Connected Word Recogni
tion, ”ICASSP82 , pp. 899-902(1982)) とを用いて入
力音声を連続的に認識する方法に関する。TECHNICAL FIELD The present invention relates to a hidden Markov model (for example, Seiichi Nakagawa “Speech Recognition by Probabilistic Model” edited by The Institute of Electronics, Information and Communication Engineers (1988)) and a generalized LR context analysis algorithm (for example, M. Tomita: " An Efficient Context-F
ree Parsing Algorithm for Natural Languages and it
s Applications ”, Carnegie-Mellon University (198
5)) and the One-Pass search algorithm (eg J. Br.
idle and others: "An Algorithm For Connected Word Recogni
, ICASSP82, pp. 899-902 (1982)) and a method for continuously recognizing input speech.

【０００２】[0002]

【従来の技術】従来の音声認識において、ＬＲパーザ
（文法解析機）を用いて、予め決められた文法にかなう
認識候補を探し、その認識候補の標準パターンと入力音
声との照合を行いながら認識を行う方法がある。また予
め決められたネットワークを用い、その一部について標
準パターンと入力音声との照合を行い、かつスコアの小
さいものを除き、処理量の増加を最小限にしながらサー
チを進め、最終的にスコアの最も高い候補を認識結果と
するＯｎｅ−Ｐａｓｓサーチ方法が知られている。2. Description of the Related Art In conventional speech recognition, an LR parser (grammar analyzer) is used to search for a recognition candidate that meets a predetermined grammar, and recognition is performed while matching a standard pattern of the recognition candidate with an input speech. There is a way to do. Also, using a predetermined network, a part of the network is compared with the standard pattern and the input voice, and except for those with a small score, the search is advanced while minimizing the increase in the processing amount, and finally the score A One-Pass search method in which the highest candidate is used as the recognition result is known.

【０００３】更に隠れマルコフモデル、一般化ＬＲパー
ザ、Ｏｎｅ−Ｐａｓｓアルゴリズムを統合した音声認識
の方法が知られている。この統合した認識方法としては
環境独立の音素による文脈自由文法からＬＲパーザを用
いてサブ・ネットワークを作成し、それらサブ・ネット
ワーク間の接合部分で環境依存の音素を考慮し、環境依
存の音素標準パターンを利用する方法（例えば伊藤克亘
他：「拡張ＬＲ構文解析法を用いた連続音声認識」信学
技報，ＳＰ９０−７４，４９〜５６頁）や、環境独立の
音素による文脈自由文法からＬＲパーザを用いて一つの
ネットワークを動的に作成し、環境独立の音素標準パタ
ーンを利用する方法があった。（例えば北研二他：「Ｌ
Ｒパーザ制御によるＯｎｅ−Ｐａｓｓ型連続音声認識ア
ルゴリズム」，信学技報、ＳＰ９４−１３（１９９４−
０５）５５〜６２頁）。Furthermore, a method of speech recognition is known in which a hidden Markov model, a generalized LR parser, and a One-Pass algorithm are integrated. As this integrated recognition method, a sub-network is created from the context-free grammar with environment-independent phonemes using the LR parser, and the environment-dependent phonemes are considered at the junctions between these sub-networks to determine the environment-dependent phoneme standard. From methods using patterns (for example, Katsutoshi Ito: “Continuous Speech Recognition Using Extended LR Parsing Method”, IEICE Technical Report, SP90-74, pp. 49-56) and context-free grammars using environment-independent phonemes. There was a method to dynamically create one network using a parser and use the phoneme standard pattern independent of environment. (For example, Kenji Kita et al .: "L
One-Pass continuous speech recognition algorithm by R parser control ", IEICE Tech., SP94-13 (1994-
05) 55-62).

【０００４】効率の点ではサブ・ネットワークを用いず
に一つのネットワークで処理した方がよいし、音声認識
の精度の点では環境依存の音素標準パターンを利用した
方がよいが、両者を同時に利用できる方法はなかった。From the viewpoint of efficiency, it is better to process with one network without using the sub-network, and from the viewpoint of accuracy of speech recognition, it is better to use the environment-dependent phoneme standard pattern, but both are used at the same time. There was no way I could do it.

【０００５】[0005]

【発明が解決しようとする課題】この発明は、隠れマル
コフモデル、一般化ＬＲパーザ、Ｏｎｅ−Ｐａｓｓアル
ゴリズムを統合するに際し、一つのネットワークを使っ
て効率的に処理すると同時に、環境依存の音素標準パタ
ーンを利用することができ、認識率の高い音声認識方法
を提供することを目的とする。SUMMARY OF THE INVENTION The present invention integrates a Hidden Markov Model, a generalized LR parser, and a One-Pass algorithm, efficiently processes them using a single network, and at the same time, uses environment-dependent phoneme standard patterns. It is an object of the present invention to provide a voice recognition method that can utilize the above and has a high recognition rate.

【０００６】[0006]

【課題を解決するための手段】この発明によれば、環境
依存の音素標準パターンを用い、ネットワークの各弧の
音素を、少くとも音素環境依存を考慮した数だけ先のノ
ードまでの各ノードへ伝搬させ、ネットワーク上で新た
に生成されたノードに対し、ＬＲパーザによってそのノ
ードへの弧に設定すべきと予測された音素と、その弧の
ソース・ノードに蓄えられた過去の音素系列とによっ
て、その弧に設定すべき環境依存の音素標準パターンを
決定する。According to the present invention, by using the environment-dependent phoneme standard pattern, the phonemes of each arc of the network are transmitted to each node up to the preceding node by at least the number taking the phoneme environment dependence into consideration. For the newly propagated node on the network, the phoneme predicted by the LR parser to be set to the arc to that node and the past phoneme sequence stored at the source node of that arc , The environment-dependent phoneme standard pattern to be set for the arc is determined.

【０００７】[0007]

【発明の実施の形態】図１に、この発明の実施例を適用
した認識装置の構成を機能的に示す。入力端子１から入
力された音声は、特徴抽出部２においてディジタル信号
に変換され、更にＬＰＣケプストラム分析された後、１
フレーム（例えば１０ミリ秒）ごとに特徴パラメータに
変換される。この特徴パラメータは、例えばＬＰＣケプ
ストラム係数である。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 functionally shows the structure of a recognition device to which an embodiment of the present invention is applied. The voice input from the input terminal 1 is converted into a digital signal in the feature extraction unit 2 and further subjected to LPC cepstrum analysis, and then 1
It is converted into a feature parameter every frame (for example, 10 milliseconds). This feature parameter is, for example, an LPC cepstrum coefficient.

【０００８】学習用音声データベースより、上記特徴ベ
クトルと同一形式で、隠れマルコフモデルの環境依存の
音素標準パターンを作り、標準パターンメモリ４に記憶
してある。認識対象の範囲を規定する、環境独立（環境
非依存）の音素を終端記号とする文脈自由文法から、Ｌ
Ｒパーザ用のテーブルを作り、ＬＲテーブル５に記憶し
てある。An environment-dependent phoneme standard pattern of a hidden Markov model is created from the learning speech database in the same format as the above-mentioned feature vector and stored in the standard pattern memory 4. From the context-free grammar that defines the range of the recognition target and uses the environment-independent (environment-independent) phonemes as terminal symbols,
A table for the R parser is created and stored in the LR table 5.

【０００９】認識部３での処理のフローチャートを図２
に示す。１フレーム分の入力音声の特徴パラメータを読
み込み（Ｓ₁）、（毎フレームまたは数フレームごと
に）ネットワークを成長させ（Ｓ₂）、入力音声特徴パ
ラメータと標準パターンとのパターン照合を行ない（Ｓ
₃）、次のステップでも処理を進めるアクティブ・ノー
ドを決定してステップＳ₁に戻る（Ｓ₄）。このことを
入力される音声がなくなるまで繰り返す。入力される音
声がなくならなくても、パターン照合によって得られた
スコアから、何らかの条件により処理を打ち切ってもよ
い。FIG. 2 shows a flowchart of the processing in the recognition section 3.
Shown in The characteristic parameters of the input speech for one frame are read (S ₁ ), the network is grown (every frame or every several frames) (S ₂ ), and pattern matching between the input speech characteristic parameters and the standard pattern is performed (S 1).
₃ ) Then, in the next step, the active node to be processed is determined and the process returns to step S ₁ (S ₄ ). This is repeated until there is no input voice. Even if the input voice does not disappear, the process may be terminated under some condition from the score obtained by the pattern matching.

【００１０】ネットワークはノードと弧によって構成さ
れ、最初は初期ノード一つだけが存在する。各ノード
は、ＬＲパーザを用いて生成される仮説で、スタックの
内容が同一のものに対して一つが割り当てられ、対応づ
けられる。そのため、ＬＲパーザによってそれ以降の処
理が同じになるものは、同一のノードに統合される（同
一のノードに弧が接続される）。ここではこれをマージ
・ノードと呼ぶ。各ステップでの照合結果に基づき、得
られたスコアの高いものを、次のステップでも処理を進
めるべきアクティブ・ノードとする。それ以外のノード
は、そのままネットワーク中に保持してもよいし、メモ
リの有効利用のために、捨ててもよい。このようにして
ネットワークを、環境独立の音素終端信号とする文脈自
由文法とＬＲパーザを利用して拡張しながらＯｎｅ−Ｐ
ａｓｓサーチによりサーチを進めてゆく。The network consists of nodes and arcs, initially there is only one initial node. Each node is a hypothesis generated by using the LR parser, and one having the same stack contents is assigned and associated. Therefore, LR parsers that perform the same processing thereafter are integrated into the same node (arcs are connected to the same node). Here, this is called a merge node. Based on the collation result in each step, the one with a high score is set as an active node to be processed in the next step. Other nodes may be retained in the network as they are, or may be discarded for effective use of memory. In this way, the One-P is extended while expanding the network by using the context-free grammar and the LR parser, which are environment-independent phoneme termination signals.
We will proceed with the search by the ass search.

【００１１】図３を参照してこの発明の特徴部分である
ネットワークの生成（成長動作）を説明する。文脈自由
文法からは、無限の大きさのネットワークを作成可能な
ので、あらかじめネットワークの全体を作成しておけな
い場合が想定される。そこでネットワーク生成部では、
ネットワークの各アクティブ・ノードからＬＲパーザを
使って新たなノードを作成し、弧を伸ばす（ネットワー
クを成長させる）、といった動作を、有限個先の音素ま
でに限定する。ここでは一度に一つ音素を伸ばす場合で
説明する。図中、点線の上に示された記号は、ＬＲパー
ザによって各ノード間に予測された環境独立の音素であ
る。ノード間に実線で示されているのがネットワークの
弧であり、この発明では環境依存の音素標準パターンが
設定される（図を見やすくするため、１−２，２−３，
５−６，６−３のノード間の弧は省略してある）。ここ
では、中心となる音素（Ｃ）の左側（Ｌ）及び右側
（Ｒ）の環境を考慮に入れた、いわゆるｔｒｉｐｈｏｎ
ｅモデルの場合で説明し、Ｌ−Ｃ−Ｒと表記する。ノー
ドが設定されると、その前のノード間の弧に設定された
音素をその新ノードに伝搬させる。この例ではｔｒｉｐ
ｈｏｎｅモデルの場合であるから、この前のノードまで
の各弧の音素を伝搬保持させる。The generation (growth operation) of the network, which is a characteristic part of the present invention, will be described with reference to FIG. Since a network of infinite size can be created from the context-free grammar, it is assumed that the entire network cannot be created in advance. So in the network generator,
The operation of creating a new node from each active node of the network using the LR parser and extending the arc (growing the network) is limited to a finite number of phonemes ahead. Here, a case where one phoneme is extended at a time will be described. In the figure, the symbols shown above the dotted lines are environment-independent phonemes predicted between the nodes by the LR parser. A solid line between the nodes is an arc of the network. In the present invention, an environment-dependent phoneme standard pattern is set (1-2, 2-3, in order to make the diagram easy to see).
Arcs between nodes 5-6 and 6-3 are omitted). Here, the so-called triphon, which takes into consideration the environment on the left side (L) and the right side (R) of the central phoneme (C).
This will be described in the case of the e model and will be referred to as LCR. When a node is set, the phonemes set in the arc between the previous nodes are propagated to the new node. In this example, trip
Since this is the case of the phone model, the phonemes of each arc to the preceding node are propagated and held.

【００１２】まず簡単な図３Ａの場合で説明する。ノー
ド１〜３を生成済みのネットワークから、ノード３に対
してネットワークを成長させる場合を考える。ＬＲパー
ザによって、ノード３から環境独立の音素Ｐ₃が予測さ
れ、新たなノード４で表されるスタックの状態に変化す
ることが分かったとする。ここで、ノード３に伝搬され
て蓄えられている過去の２音素系列の情報（Ｐ₁，Ｐ₂)
から、Ｐ₁、Ｐ₂、Ｐ ₃という３音素系列が得られる。
これをもとにノード３−４間の弧に、Ｐ₁−Ｐ ₂−Ｐ₃
という環境依存の音素標準パターンを設定する。First, a simple case of FIG. 3A will be described. No
Nodes 1 to 3 from the generated network
Then, consider the case of growing the network. LR Par
The phoneme P that is environment independent from node 3_ThreeIs predicted
And change to the state of the stack represented by the new node 4.
Suppose you know that Where it is propagated to node 3
Information of the past two phoneme series stored as (P₁, P_Two)
From P₁, P_Two, P _ThreeA three phoneme sequence is obtained.
Based on this, the arc between nodes 3-4 is₁−P _Two−P_Three
The environment-dependent phoneme standard pattern is set.

【００１３】ノード３がマージ・ノードである場合を図
３Ｂで説明する。この場合は、ノード３に伝搬されて蓄
えられた過去の２音素系列として、（Ｐ₁，Ｐ₂)と（Ｐ
₄，Ｐ₅)の２組が存在する。ＬＲパーザによって新たに
予測された環境独立の音素Ｐ ₃との組み合わせにより、
ノード３−４間には、Ｐ₁−Ｐ₂−Ｐ₃とＰ₄−Ｐ₅−
Ｐ₃を設定する二本の弧が必要であることが分かる。そ
して、音響的な整合性を確保するため、ノード３を二つ
に分割するか、ノード２からの弧にはＰ₁−Ｐ ₂−Ｐ₃
の弧が、ノード６からの弧にはＰ₄−Ｐ₅−Ｐ₃が対応
してデータの受渡しがされるように制約を設ける。Diagram of the case where node 3 is a merge node
3B will be described. In this case, it is propagated to node 3 and stored.
As the obtained past two-phoneme series, (P₁, P_Two) And (P
_Four, P_Five) Exists. New by LR parser
Predicted environment-independent phoneme P _ThreeIn combination with
P between nodes 3-4₁−P_Two−P_ThreeAnd P_Four−P_Five−
P_ThreeIt turns out that we need two arcs to set. So
Then, in order to ensure acoustic compatibility, two node 3
Or P for the arc from node 2₁−P _Two−P_Three
Is P for the arc from node 6_Four−P_Five−P_ThreeSupported by
Restrictions are set so that the data will be passed.

【００１４】処理量を削減するため、整合性を考慮しな
い方法も考えられる。この場合、得られる結果には誤差
が含まれる。ノード３までの過去の音素系列とノード３
−４間の新たな音素の組み合わせとして参照するのでは
なく、ノード４に必要な全音素系列、前記例ではＰ₁−
Ｐ₂−Ｐ₃（及びＰ₄−Ｐ₅−Ｐ₃）を保持させて参照
してもよいし、不要な音素系列まで含めて最初のノード
から全て伝搬させて保持し、必要な部分のみを参照して
もよい。In order to reduce the processing amount, it is possible to consider a method that does not consider consistency. In this case, the result obtained contains an error. Past phoneme series up to node 3 and node 3
-4 instead of being referred to as a new phoneme combination, the entire phoneme sequence required for node 4, P ₁ − in the above example
P ₂ −P ₃ (and P ₄ −P ₅ −P ₃ ) may be held and referred to, or may be propagated and held from the first node including an unnecessary phoneme sequence, and only a necessary portion may be held. You may refer.

【００１５】最終的に、ＬＲパーザ上で受理（ａｃｃｅ
ｐｔ）された仮説は、一つのマージ・ノードとなる。全
ステップ終了時に、このマージ・ノードに最も高いスコ
アを伝搬した弧を遡ることにより、認識結果を得ること
ができる。音素の環境依存としてはｔｒｉｐｈｏｎｅモ
デルに限らない、各ノードに伝搬保持させる音素数は、
環境依存標準パターンの音素数より少くとも１つ少ない
数とする。Finally, on the LR parser,
The pt) hypothesis becomes one merge node. At the end of all steps, the recognition result can be obtained by tracing back the arc that propagated the highest score to this merge node. The environment dependence of phonemes is not limited to the triphone model, and the number of phonemes to be propagated and held by each node is
The number is at least one less than the number of phonemes in the environment-dependent standard pattern.

【００１６】[0016]

【発明の効果】仮に環境依存の音素を終端記号とするＬ
Ｒテーブルを用いた場合にマージ・ノードが少なく、文
法的には同一であるが複数に分れたままネットワークが
成長し、多数のパスが生じ、つまり多数の候補が生じ演
算量が多くなる。しかしこの発明では、前記環境依存の
音素を終端記号とするＬＲテーブルを用いると分散して
しまうマージ・ノードを、環境独立（環境非依存）の音
素を終端記号としたＬＲテーブルを用いることで回避
し、マージ・ノードで必要な数だけ伝搬されて来た音素
を用いて環境依存の音素標準パターンを決定し、新たに
作成したノードの弧に設定しているため、マージ効率の
高いネットワークを動的に作成しながら、環境依存の音
素標準パターンを利用した高精度の認識処理が可能であ
る。EFFECT OF THE INVENTION Suppose that an environment-dependent phoneme is a terminal symbol.
When the R table is used, the number of merge nodes is small, and although the syntax is the same, the network grows while being divided into a plurality of paths, resulting in a large number of paths, that is, a large number of candidates and a large amount of calculation. However, in the present invention, the merge node which is distributed when the LR table having the environment-dependent phoneme as the terminal symbol is dispersed is avoided by using the LR table having the environment-independent (environment-independent) phoneme as the terminal symbol. Then, the environment-dependent phoneme standard pattern is determined using the phonemes propagated by the merge node as many times as necessary, and it is set in the arc of the newly created node. It is possible to perform high-accuracy recognition processing using environment-dependent phoneme standard patterns while dynamically creating.

[Brief description of drawings]

【図１】本発明の実施例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】本発明の実施例を示す認識処理部のフローチャ
ートである。FIG. 2 is a flowchart of a recognition processing unit showing an embodiment of the present invention.

【図３】本発明の実施例で、環境依存の音素標準パター
ンを、ネットワークの弧に設定する方法を説明する図で
ある。FIG. 3 is a diagram illustrating a method of setting an environment-dependent phoneme standard pattern in an arc of a network according to an embodiment of the present invention.

Claims

[Claims]

1. A phoneme standard pattern is determined while generating a network by using an LR parser and a context-free grammar whose input speech is a characteristic parameter time series and environment-independent phonemes are terminal symbols. One-while checking the characteristic parameter time series and the standard pattern of the hidden Markov model
In a speech recognition method in which a search is advanced by a Pass search and candidates having a high similarity likelihood are used as recognition results, an environment-dependent phoneme standard pattern is used as a standard pattern of the hidden Markov model, and at least phonemes of each arc of the network are used. Propagate to the nodes up to the previous node by the number considering the phoneme environment dependence, and propagate to the phoneme newly predicted by the LR parser and the arc reaching that node at the tip node when expanding the above network. A speech recognition method characterized in that an environment-dependent phoneme standard pattern is determined based on a phoneme sequence that has been obtained and is set in an arc on a network.