JPH1097275A

JPH1097275A - Large-vocabulary speech recognition system

Info

Publication number: JPH1097275A
Application number: JP8249548A
Authority: JP
Inventors: Koichi Yamaguchi; 耕市山口; Seiji Hamaguchi; 清治濱口; Toshio Akaha; 俊夫赤羽
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1996-09-20
Filing date: 1996-09-20
Publication date: 1998-04-14

Abstract

PROBLEM TO BE SOLVED: To obtain the large-vocabulary speech recognition system which can perform real-time operation with inexpensive hardware constitution even for a very large vocabulary by composing a parser of a forward and a backward operation part and making a viterbi search driven under restriction conditions of tri-phoneme units considering phoneme environment. SOLUTION: This system is equipped with the parser 5 including the forward operation part 6 and backward operation part 7. The forward operation part 6 makes a viterbi search under restriction conditions of phoneme units considering the phoneme environment and the backward operation part 7 develops a hypothesis by using a viterbi search while referring to a tree structure dictionary of language models 10 considering the phoneme environment. Then the order of development is determined on a best.first basis by using A* algorithm making good use of the sum of the score of the forward operation result and the score of the operation result of the backward viterbi search made in phoneme units, and outputted as a word candidate for a recognition result in the order of received hypotheses and once a specific number of word candidates are found, the backward operation is ended.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は大語彙音声認識装
置に関し、特に、サブワード単位のＨＭＭ（隠れマルコ
フモデル：Hidden Markov Model の略称）を用いて音声
認識する際に処理量を削減するようにした大語彙音声認
識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a large vocabulary speech recognition apparatus, and in particular, to reduce the processing amount when performing speech recognition using an HMM (Hidden Markov Model) for each subword. The present invention relates to a large vocabulary speech recognition device.

【０００２】[0002]

【従来の技術】図６は従来の音声認識装置を示すブロッ
ク図である。図６において、音声波形は音響分析部１に
与えられ、線形予測分析などを用いて音声波形の特徴で
ある音響パラメータに変換される。この音響パラメータ
は構文解析アルゴリズムであるパーザ２に与えられる。2. Description of the Related Art FIG. 6 is a block diagram showing a conventional speech recognition apparatus. In FIG. 6, a speech waveform is provided to a sound analysis unit 1 and is converted into a sound parameter which is a feature of the speech waveform using a linear prediction analysis or the like. These acoustic parameters are given to a parser 2 which is a parsing algorithm.

【０００３】パーザ２は音響パラメータを解析するため
に、音響モデル３と言語モデル４の２つのモデルを使
う。音響モデル３は各音素がどういうパラメータになる
のかを決定するためのものであり、言語モデル４は音素
がどういう順番で並べば意味がある文や単語になるかを
決める語彙や文法の情報を決定するためのものである。
パーザ３はこの２つのモデルを組合せて入力に最もうま
く合う単語や文を探し出す。The parser 2 uses two models, an acoustic model 3 and a language model 4, to analyze acoustic parameters. The acoustic model 3 is for determining what parameters each phoneme is to be, and the language model 4 is for determining vocabulary and grammatical information for determining in what order the phonemes are arranged into meaningful sentences and words. It is for doing.
The parser 3 combines the two models to find a word or sentence that best matches the input.

【０００４】音響モデル３では、サブワードモデルＨＭ
Ｍが一般的に用いられ、特に音素環境依存型の連続分布
型ＨＭＭが不特定話者に対して精度よく表現できること
からよく用いられている。日本語では、たとえば特開平
０７−１７５４９４号公報に記載されたＨＭｎｅｔ（Hi
dden Markov Netwook ）が良好な認識率を得ている。こ
こで、サブワードとは、言語音声を精度よくかつ効率よ
く表せる表現単位のことで、音素や音節などがある。In the acoustic model 3, the sub-word model HM
M is generally used, and in particular, a phoneme environment-dependent continuous distribution HMM is often used because it can accurately represent an unspecified speaker. In Japanese, for example, HMNet (Hi) described in JP-A-07-175494 is used.
dden Markov Netwook) has a good recognition rate. Here, a subword is a unit of expression that can express language speech with high accuracy and efficiency, and includes phonemes and syllables.

【０００５】ＨＭＭを用いる音声認識システムは処理量
が多く、これを要素別に見ると、一般に、ＨＭＭの尤度
演算と探索処理（パーザ部）が二大要因となっている。
ＨＭＭの尤度演算はパーザ２の中の探索の前処理として
位置付けられる。オンライン型の音声認識では実時間動
作が望ましく、また処理量の多さは価格に直接影響する
ばかりでなく、他のタスクへの負荷にもなるので、処理
量削減は重要な課題である。近年、サブワードＨＭＭに
よる音声認識が一般的になり、その高速化の研究が増え
てきている。[0005] A speech recognition system using an HMM has a large processing amount. Looking at this by element, generally, the likelihood calculation and search processing (parser section) of the HMM are two major factors.
The likelihood calculation of the HMM is positioned as a pre-process of the search in the parser 2. In online speech recognition, real-time operation is desirable, and the large amount of processing not only directly affects the price but also imposes a load on other tasks, so reducing the processing amount is an important issue. In recent years, speech recognition using a subword HMM has become common, and research on speeding up the speech recognition has been increasing.

【０００６】連続音声認識では、与えられた文法（言語
モデル）で許される系列と入力された音声とを照合し、
照合スコアの最も高い音素系列を認識結果とする。しか
し、文法で許される音素系列すべてと入力音声を照合し
たのでは、多くの計算量を必要とする。照合の回数をで
きるだけ減らし、必要な照合のみを行なうことが探索処
理の高速の鍵となっている。そのための１つの手法とし
て、Ａ^*アルゴリズムを用いて正確なＮ−ベスト候補を
高速に探索する手法が提案されている。[0006] In continuous speech recognition, a sequence permitted by a given grammar (language model) is collated with input speech,
The phoneme sequence having the highest matching score is set as the recognition result. However, if all the phoneme sequences permitted by the grammar are collated with the input speech, a large amount of calculation is required. Reducing the number of times of collation as much as possible and performing only necessary collation is the key to the high-speed search process. As one method for that purpose, a method of searching for an accurate N-best candidate at high speed using the A ^* algorithm has been proposed.

【０００７】図７はＡ^*アルゴリズムを説明するための
図である。図７において、グラフの任意の接点をｎとし
たとき、出発点Ｓからｎまでの最適な道のコストの推定
値をｇ^*（ｎ）とし、ｎから目標接点までの最適な道の
コストの推定値をｈ^*（ｎ）とする。もし、道がなけれ
ばｇ^*（ｎ）あるいはｈ^*（ｎ）は無限大となる。ｎを
通る最適な道のコストの推定値ｆ^*（ｎ）は次式で与え
られる。FIG. 7 is a diagram for explaining the A ^* algorithm. In FIG. 7, when an arbitrary point on the graph is n, the estimated value of the optimal road cost from the starting point S to n is g ^* (n), and the optimal road cost from n to the target point is Let the estimated value be h ^* (n). If there is no road, g ^* (n) or h ^* (n) will be infinite. An estimate f ^* (n) of the cost of the optimal road through n is given by:

【０００８】ｆ^*（ｎ）＝ｇ^*（ｎ）＋ｈ^*（ｎ）…（１）上述の式を評価関数として用いかつ推定コストｈ^*が真
のコストｈの下界であれば（ｈ^*（ｎ）≦ｈ^*（ｎ））
であるグラフ探索の戦略をＡ^*アルゴリズムと呼ぶ。F ^* (n) = g ^* (n) + h ^* (n) (1) If the above equation is used as an evaluation function and the estimated cost h ^* is a lower bound of the true cost h, (h ^* ( n) ≦ h ^* (n))
Is called the A ^* algorithm.

【０００９】図７に示すＡ^*アルゴリズムにおいてｈ^*
（ｎ）は節点の横に示し、各節点に対するｆ^*を括弧内
に示す。リストの変化は次のようになる。（Ｓ（７）→
（Ａ（８）Ｂ（９））→（Ｄ（８）Ｂ（９）Ｃ（１
０））→（Ｂ（９）Ｃ（１０）Ｈ（１０）Ｉ（１０））
→（Ｄ（７）Ｃ（１０）Ｅ（１０）Ｈ（１０）Ｉ（１
０））→（Ｈ（９）Ｉ（９）Ｃ（１０）Ｅ（１０））→
（Ｉ（９）Ｇ１（１０）Ｃ（１０）Ｅ（１０）Ｌ（１
１））→（Ｇ２（９）Ｇ１（１０）Ｃ（１０）Ｅ（１
０）Ｌ（１１））。次に、Ｇ２がオープンから取出され
て終了する。解はＳ→Ｂ→Ｄ→Ｉ→Ｇ２となる。[0009] In the A ^* algorithm shown in Figure 7 h ^*
(N) is shown beside the nodes, and f ^* for each node is shown in parentheses. The list changes as follows. (S (7) →
(A (8) B (9)) → (D (8) B (9) C (1
0)) → (B (9) C (10) H (10) I (10))
→ (D (7) C (10) E (10) H (10) I (1
0)) → (H (9) I (9) C (10) E (10)) →
(I (9) G1 (10) C (10) E (10) L (1
1)) → (G2 (9) G1 (10) C (10) E (1
0) L (11)). Next, G2 is taken out of the open and ends. The solution is S → B → D → I → G2.

【００１０】一方、照合計算の共有化／近似を行なう方
法として、バンドルサーチが提案されている。この方法
では、各単語の１回ずつの照合計算で済ませるため、複
雑な文法でも高速探索が可能となる。しかし、計算量は
語彙数に依存する。On the other hand, a bundle search has been proposed as a method of sharing / approximate the collation calculation. In this method, only one matching calculation is required for each word, so that a high-speed search can be performed even with a complicated grammar. However, the amount of calculation depends on the number of words.

【００１１】従来の離散単語認識方法では、入力音声に
対して、認識語彙に含まれる単語１つずつビタービサー
チしてスコアが算出される。したがって、パーザ部の処
理数は語彙数に比例する。語彙数が非常に多いとき、連
続音声認識の場合と同様のことが離散単語認識について
もいえる。単語ごとのマッチングにｂｅｓｔ−ｆｉｒｓ
ｔ型のビタービサーチを用いる方法も提案されている
が、探索空間の削減は達成できてはいるものの、ヒュー
リスティック関数の計算量が多いため、実際の処理量は
あまり削減できていない。また、ヒューリスティック関
数の作成が離散ＨＭＭに対して有効な手法ともいえる。In the conventional discrete word recognition method, a score is calculated by performing a Viterbi search for each word included in the recognized vocabulary for the input speech. Therefore, the number of processes in the parser unit is proportional to the number of words. When the number of vocabularies is very large, the same applies to discrete word recognition as in the case of continuous speech recognition. Best-firsts for word-by-word matching
Although a method using t-type Viterbi search has been proposed, the reduction of the search space has been achieved, but the actual amount of processing has not been significantly reduced due to the large amount of calculation of the heuristic function. In addition, it can be said that creation of a heuristic function is an effective method for a discrete HMM.

【００１２】不特定話者を対象とするならば、より高精
度な音響モデルの混合連続ＨＭＭを使用する必要があ
る。この混合連続ＨＭＭに対してはヒューリスティック
関数を設計するのが難しい。予備選択を用いる方法も古
くから提案されている。しかし、予備選択ミスという避
けられない欠点があるため、最近ではあまり採用されな
い。認識率の低下はわずかながら計算量は１／２４まで
削減できたという報告もある。しかし、この実験は特定
話者認識であり、不特定話者を対象とした場合、予備選
択ミスが増加するため、認識率の低下を招くことが予想
される。For an unspecified speaker, it is necessary to use a mixed continuous HMM of a more accurate acoustic model. It is difficult to design a heuristic function for this mixed continuous HMM. Methods using preselection have long been proposed. However, due to the unavoidable drawback of preliminary selection mistakes, they are not often adopted recently. There is also a report that the amount of calculation could be reduced to 1/24 with a slight decrease in the recognition rate. However, this experiment is specific speaker recognition. When an unspecified speaker is targeted, errors in preliminary selection increase, and it is expected that the recognition rate will decrease.

【００１３】Chenなどが発表した“Large Vocabulary W
ord Recognition Based on Tree-trellis Search, ”Pr
oc. ICASSP-94, pp.II-137-II-140 （1994）において、
Soong の連続音声認識手法を用いて中国語の大語彙単語
認識を高速にする手法を提案している。前向き探索には
音節の自由なネットワーク，後向き探索には音節木化し
た単語辞書を参照してＡ^*アルゴリズムを用いるという
もので、超大語彙の高速認識が可能としている。中国語
は４１２種類もの音節があるので、サブシラブル（subs
yllable ）ＨＭＭを状態共有することによって効率よく
音節を構成する。"Large Vocabulary W" published by Chen et al.
ord Recognition Based on Tree-trellis Search, ”Pr
oc. ICASSP-94, pp.II-137-II-140 (1994)
We propose a method to speed up large vocabulary word recognition in Chinese using Soong's continuous speech recognition method. In the forward search, a syllable-free network is used. In the backward search, the A ^* algorithm is used with reference to a word dictionary converted into a syllable tree, and high-speed vocabulary recognition can be performed. Chinese has 412 syllables, so subsyllables (subs
yllable) A syllable is efficiently constructed by sharing the state of the HMM.

【００１４】音素環境は音節内のみ考慮しており、音節
内の母音間接続条件のみが環境依存である。つまり、音
節間は無考慮（＝環境独立）とし、木構造辞書は音節を
アークとして作成しており、音節と音節を直接接続す
る。この論文中において、音素環境独立型（Table ４）
と音節内音素環境依存型（Table ５）の両方認識実験を
試みている。もちろん、音節内音素環境依存型を用いた
方が高い認識率を得ている。しかし、この実験も特定話
者認識である。不特定話者を対象とした場合、音節間環
境が考慮されていないため、認識性能の劣化が予想され
る。The phoneme environment considers only syllables, and only the connection conditions between vowels in syllables are environment-dependent. That is, syllables are not considered (= environmentally independent), the syllables are created as arcs in the tree structure dictionary, and syllables are directly connected. In this paper, the phoneme environment independent type (Table 4)
Recognition experiments of both the syllable phoneme environment-dependent type (Table 5) are being attempted. Of course, a higher recognition rate is obtained by using the syllable phoneme environment dependent type. However, this experiment is also specific speaker recognition. In the case of an unspecified speaker, degradation in recognition performance is expected because the intersyllable environment is not considered.

【００１５】[0015]

【発明が解決しようとする課題】従来の離散単語認識方
法では、入力音声に対して認識語彙に含まれる単語１つ
ずつビタービサーチしてスコアを算出する。したがっ
て、パーザ２の処理量は語彙数に比例する。語彙数が非
常に多いとき、連続音声認識の場合と同様のことが離散
単語認識についてもいえる。情報検索などに適用する場
合、語彙数は数千単語以上となることが十分考えられ
る。１００単語のとき０．１秒かかるシステムならば、
１，０００単語では１秒、１０，０００単語では１０秒
かかることになる。この方式は、実時間動作が望ましい
オンライン型の音声認識には不向きである。In the conventional discrete word recognition method, a score is calculated by performing a Viterbi search for each word included in a recognized vocabulary for an input speech. Therefore, the processing amount of the parser 2 is proportional to the number of words. When the number of vocabularies is very large, the same applies to discrete word recognition as in the case of continuous speech recognition. When applied to information retrieval and the like, the number of vocabulary words can be several thousand words or more. For a system that takes 0.1 second for 100 words,
It takes 1 second for 1,000 words and 10 seconds for 10,000 words. This method is not suitable for online speech recognition where real-time operation is desired.

【００１６】前述のChenの離散単語認識方式は、特定話
者を対象としているので、音節間環境を考慮しなくとも
元来音響モデルが高精度なため良好な認識性能が得られ
ていた。しかし、この方式を不特定話者を対象とした場
合、Ａ^*アルゴリズムがうまく動作しないという問題点
がある。音響モデルの精度は低いため、前向きサーチの
精度は低くなり、後向きの探索で失敗する確率が高くな
るからである。Since the Chen discrete word recognition method described above is intended for a specific speaker, good recognition performance was obtained because the acoustic model was originally highly accurate without considering the intersyllable environment. However, when this method is applied to an unspecified speaker, there is a problem that the A ^* algorithm does not operate well. This is because the accuracy of the acoustic model is low, the accuracy of the forward search is low, and the probability of failure in the backward search is high.

【００１７】同様の実験が日本語でも試みられ、「孤立
単語認識における全探索法・ビームサーチ法・Ａ^*探索
法の比較」，日本音響学会講演論文集，2-5-10, pp.77-
78（1996.3）において発表されている。この文献では、
Ａ^*探索のヒューリスティック関数ｘ^*（ｔ）の推定に
は任意の音節連鎖を用いており、音素環境を考慮してお
らず、不特定話者を対象とした実験を行ない処理時間を
調査しているが、従来方法に比べてよい結果は得られて
いない。A similar experiment was attempted in Japanese, "Comparison of full search method, beam search method, and A ^* search method in isolated word recognition," Proc. Of the Acoustical Society of Japan, 2-5-10, pp.77 -
78 (1996.3). In this document,
An arbitrary syllable chain is used for estimating the heuristic function x ^* (t) for A ^* search, and the phoneme environment is not taken into account. However, good results have not been obtained compared with the conventional method.

【００１８】予備選択方式は最近ではあまり用いられる
ことはない。なぜならば、予備選択部では処理量の少な
いマッチング方式で大雑把な認識／分類を行なう。よっ
て、しばしば誤り（＝予備選択ミス）が生じる。この誤
りは後で回復できないため認識率の低下を招く、という
ような避けられない欠点があるためである。The preselection scheme has not been used much recently. This is because the preliminary selection unit performs rough recognition / classification by a matching method with a small processing amount. Therefore, an error (= preliminary selection error) often occurs. This is because there is an unavoidable disadvantage that this error cannot be recovered later and causes a reduction in the recognition rate.

【００１９】それゆえに、この発明の主たる目的は、パ
ーザにおいて、探索空間を効率よく絞り込むサーチ方法
を採用することによって、処理量が語彙数にほとんど影
響されないという特徴を持たせ、超大語彙を対象として
も安価なハードウェア構成で実時間動作が可能な大語彙
音声認識装置を提供することである。Therefore, a main object of the present invention is to provide a feature that the processing amount is hardly influenced by the number of vocabularies by adopting a search method in which the search space is efficiently narrowed in the parser. Another object of the present invention is to provide a large vocabulary speech recognition device capable of real-time operation with an inexpensive hardware configuration.

【００２０】[0020]

【課題を解決するための手段】請求項１に係る発明は、
音素環境依存型音素隠れマルコフモデルを用いた音声認
識装置において、音声を入力するための入力手段と、入
力された音声を短時間フレームごとに分析し、特徴ベク
トルを抽出する特徴ベクトル抽出手段と、抽出された特
徴ベクトルに基づいて語頭の前および語尾の後に無音モ
デルを付加した認識語彙を音素環境依存型音素列で表現
し、それら音素をアークとする木構造の辞書に変換する
辞書作成手段と、前向き演算部と後向き演算部とを含む
パーザ手段を備え、前向き演算部は音素環境を考慮した
音素単位の制約条件下で駆動するビタービサーチを行な
い、後向き演算部は音素環境を考慮した木構造辞書を参
照しながらビタービサーチを用いて仮説を展開し、前向
き演算結果のスコアと音素単位で実行した後向きビター
ビサーチの演算結果のスコアの和を利用したＡ^*アルゴ
リズムを用いて展開する順番をｂｅｓｔ−ｆｉｒｓｔに
決定し、受理された仮説の順にそれを認識結果の単語候
補として出力し、所定の個数の単語候補が求まれば後向
き演算を終了するようにしたものである。The invention according to claim 1 is
In a speech recognition apparatus using a phoneme environment-dependent phoneme hidden Markov model, an input means for inputting speech, a feature vector extraction means for analyzing the input speech for each short-time frame, and extracting a feature vector, A dictionary creating means for expressing a recognition vocabulary obtained by adding a silent model before and after the beginning of the word based on the extracted feature vector as a phoneme environment-dependent phoneme sequence and converting the phonemes into a tree-structured dictionary having arcs as phonemes; , A parser means including a forward operation unit and a backward operation unit, the forward operation unit performs a Viterbi search driven under a constraint condition of a phoneme unit in consideration of a phoneme environment, and the backward operation unit includes a tree in consideration of a phoneme environment. The hypothesis is developed using Viterbi search while referring to the structure dictionary, and the result of the forward Viterbi search is calculated based on the score of the forward calculation result and phoneme unit. The order in which developed using the A ^* algorithm using the sum of the scores determined to best-first, and outputs it as a word candidate of the recognition result in order of received hypotheses, word candidates a predetermined number is Motomema In this case, the backward calculation is terminated.

【００２１】請求項２に係る発明では、請求項１の前向
き演算部は、音素環境を考慮した隠れマルコフモデルの
状態単位の制約条件下で駆動するビタービサーチを行な
う。According to a second aspect of the present invention, the forward operation unit of the first aspect performs a Viterbi search driven under a constraint condition of a state unit of a hidden Markov model in consideration of a phoneme environment.

【００２２】請求項３に係る発明では、請求項１の後向
き演算部は各音素ごとのビタービサーチの照合範囲を予
め定める各音素別継続時間長をもとに所定の方法で制限
する。According to the third aspect of the present invention, the backward operation unit of the first aspect limits the collation range of the Viterbi search for each phoneme by a predetermined method based on a predetermined duration for each phoneme.

【００２３】請求項４に係る発明では、請求項１の後向
き演算部は、音素単位の仮説の展開におけるビタービサ
ーチをＡ^*アルゴリズムで実行する。According to the fourth aspect of the present invention, the backward operation unit of the first aspect executes the Viterbi search in the development of the hypothesis for each phoneme using the A ^* algorithm.

【００２４】[0024]

【発明の実施の形態】図１はこの発明の一実施形態を示
すブロック図である。図１において、図示しないマイク
ロフォンから入力された音声は、Ａ／Ｄ変換器でデジタ
ル信号に変換され、音響分析部１に入力される。音響分
析部１では入力音声をフレームごとに分析し、音響パラ
メータを抽出する。この音響パラメータとしては、たと
えばＬＰＣケプストラムや差分ＬＰＣケプストラムや差
分パワーなどである。FIG. 1 is a block diagram showing an embodiment of the present invention. In FIG. 1, sound input from a microphone (not shown) is converted into a digital signal by an A / D converter, and input to the acoustic analysis unit 1. The acoustic analysis unit 1 analyzes the input speech for each frame and extracts acoustic parameters. The acoustic parameters include, for example, LPC cepstrum, differential LPC cepstrum, and differential power.

【００２５】この実施形態では、音響モデルの音素環境
依存型ＨＭＭとしてＨＭｎｅｔを用いて説明する。音素
環境依存にするために各音素は三組音素（triphone）と
して表現され、さらに各状態は他の三組音素状態と共有
することがある。したがって、音素ごとに前後に接続す
る音素リスト，状態ごとに前後に接続する音素リスト，
状態ごとに前後に接続する状態リストが記述されてい
て、１つのネットワークを形成している。表１にＨＭｎ
ｅｔの状態に関する各種情報の例を示す。これらの接続
情報を用いて前向きサーチ演算および後向きサーチ演算
が実現される。In this embodiment, HMNet will be described as a phoneme environment-dependent HMM of an acoustic model. Each phoneme is represented as a triphone to make it phonemic environment dependent, and each state may be shared with other triad phoneme states. Therefore, a phoneme list connected before and after for each phoneme, a phoneme list connected before and after for each state,
A state list connected before and after each state is described for each state to form one network. Table 1 shows HMn
The example of various information regarding the state of et was shown. A forward search operation and a backward search operation are realized using these pieces of connection information.

【００２６】[0026]

【表１】 [Table 1]

【００２７】図１におけるパーザ５は前向き演算部６と
後向き演算部７の２つの演算部から構成される。前向き
演算部６では、まずフレームごとにＨＭｎｅｔの各状態
について尤度が計算される。この計算結果は尤度テーブ
ルとして前向き演算部６と後向き演算部７とで参照され
る。次に、音素環境を考慮した三組音素単位の制約条件
下で駆動するビタービサーチが行なわれる。The parser 5 in FIG. 1 is composed of two operation units, a forward operation unit 6 and a backward operation unit 7. The forward operation unit 6 first calculates the likelihood for each state of HMNet for each frame. This calculation result is referred to by the forward calculation unit 6 and the backward calculation unit 7 as a likelihood table. Next, a Viterbi search is performed, which is driven under a constraint condition in units of three phonemes in consideration of a phoneme environment.

【００２８】図２は状態／音素接続情報の例を示す図で
ある。図２において、１つの音素は３〜４個の状態から
成り立っている。日本語の三組音素の総数は三千数百種
類存在するが、この発明の実施形態で扱うＨＭｎｅｔを
利用すると、状態共有がなされているため、状態系列の
異なる三組音素（これを異なり三組音素と呼ぶ）は数百
種類になる。そこで、この発明の実施形態では、図２に
示すように異なり三組音素ごとに状態を並べる。したが
って、三組音素のうちＨＭＭの状態系列が同じものは計
算を省略されることとなる。FIG. 2 is a diagram showing an example of state / phoneme connection information. In FIG. 2, one phoneme is composed of three to four states. Although the total number of triads in Japanese is three hundred and several hundred, when HMNet used in the embodiment of the present invention is used, the state is shared, so that triads in different state series (three different phonemes) There are several hundred types. Therefore, in the embodiment of the present invention, the states are arranged for each of the three sets of phonemes differently as shown in FIG. Accordingly, among the three sets of phonemes, those having the same HMM state series are omitted from the calculation.

【００２９】図２（ａ）において、時刻がｔ−１からｔ
へ移るとき、許されている遷移のうち、主として音素ｐ
の状態ａに関係するものを実線矢印で示している。ある
時刻ｔにおけるある音素ｐのヒューリスティック関数ｈ
^* _p（ｔ）は次の第（２）式のようにして１フレームご
と、すべての異なる音素について累積スコアとして算出
される。図２（ａ）の時刻ｔ上の状態ｊには直前フレー
ムｔ−１上の状態ｉと状態ｊからの経路が存在してい
て、ｔ−１フレームからこれらの経路を辿ってｔフレー
ムの状態ｊに達する累積スコア中で、最大のものが時刻
ｔにおける状態ｊの累積スコアとなる。In FIG. 2A, the time is changed from t-1 to t.
When transitioning to, of the allowed transitions,
The state related to the state a is indicated by a solid arrow. Heuristic function h of a certain phoneme p at a certain time t
^* _p (t) is calculated as a cumulative score for all different phonemes for each frame as in the following equation (2). In the state j at the time t in FIG. 2A, there are paths from the state i and the state j on the immediately preceding frame t-1, and these paths are traced from the t-1 frame to the state of the t frame. Among the cumulative scores that reach j, the largest one is the cumulative score of state j at time t.

【００３０】[0030]

【数１】 (Equation 1)

【００３１】ここで、ｊは当該音素ｐの最終状態の状態
番号であり、簡単のためｈ^* _p（ｔ）≡ｈ^* _jp（ｔ）と
して定義している。ｂ_j（ｔ）は状態ｊの時刻ｔにおけ
るシンボル出力確率で、尤度テーブルに格納されてい
る。ａ_ijは状態ｉから状態ｊへの状態遷移確率である。
Ｃ_p（ｊ）は状態ｊに遷移し得る状態のうち、その音素
ｐに属するものの集合を意味し、図１に示す音響／言語
モデル８の状態／音素接続情報を参照して求められる。
ｖ_ip（ｔ−１）は状態ｉの時刻ｔ−１における累積スコ
アであり、ビタービサーチによって漸化的に求められ
る。Here, j is the state number of the final state of the phoneme p, and is defined as h ^* _p (t) ≡h ^* _jp (t) for simplicity. b _j (t) is the symbol output probability at time t in state j and is stored in the likelihood table. a _ij is a state transition probability from the state i to the state j.
C _p (j) means a set of states belonging to the phoneme p among the states that can transition to the state j, and is obtained by referring to the state / phoneme connection information of the sound / language model 8 shown in FIG.
v _ip (t−1) is a cumulative score of the state i at time t−1, and is recursively obtained by Viterbi search.

【００３２】ｊが当該音素ｐの初期状態の状態番号のと
きを図２（ｂ）に示す。この場合は第（３）式に示すと
おり、音素ｐに接続し得る音素群と自己ループのうちで
最大のものが選ばれる。FIG. 2B shows the case where j is the initial state number of the phoneme p. In this case, as shown in Expression (3), the largest one of the self-loop and the phoneme group that can be connected to the phoneme p is selected.

【００３３】[0033]

【数２】 (Equation 2)

【００３４】ここで、Ｉ（ｐ）は音素ｐに接続し得る音
素とその最終状態番号の集合を意味し、後述の表２に示
す音素接続情報を参照して求められる。Here, I (p) means a set of phonemes that can be connected to phoneme p and their final state numbers, and is obtained by referring to phoneme connection information shown in Table 2 described later.

【００３５】前向き演算部７において、音素環境を考慮
したＨＭＭの状態単位の制約条件下で駆動するビタービ
サーチを行なうことも有効である。この場合、ｈ
^*（ｔ）の精度は上述の方式に比べてわずかに劣る。し
かし、状態共有によって計算しなくて済む度合が多くな
る。結局、状態数だけ計算すればよいので、前向きサー
チに必要な演算量が１／１０以下になるというメリット
がある。前向き演算方法は用途に応じて決めればよい。
ある時刻ｔにおけるある音素ｐのヒューリスティック関
数ｈ^* _p（ｔ）は第（４）式のようにして１フレームご
と、すべての状態について算出され、音素ｐの最終状態
の状態番号ｊを用いてｈ^* _j（ｔ）として代表表現され
る。It is also effective in the forward operation unit 7 to perform a Viterbi search in which the driving is performed under the constraint condition of the state unit of the HMM in consideration of the phoneme environment. In this case, h
^* The accuracy of (t) is slightly inferior to the above method. However, the degree to which calculations need not be performed due to state sharing increases. In the end, since it is sufficient to calculate only the number of states, there is an advantage that the amount of calculation required for the forward search is reduced to 1/10 or less. The forward calculation method may be determined according to the application.
The heuristic function h ^* _p (t) of a certain phoneme p at a certain time t is calculated for every state for each frame as shown in Expression (4), and is calculated using the state number j of the final state of the phoneme p. ^* _j (t) is typically represented.

【００３６】[0036]

【数３】 (Equation 3)

【００３７】ここでＳ（ｊ）は状態ｊに遷移し得るすべ
ての状態の集合を意味し、状態に接続情報を参照して求
められる。ｖ_i（ｔ−１）は状態ｉの時刻ｔ−１におけ
る累積スコアであり、ｊが当該音素ｐの初期状態の状態
番号のときには、第（２）式と同様である。Here, S (j) means a set of all the states that can transition to the state j, and is obtained by referring to the connection information for the state. v _i (t−1) is the cumulative score of state i at time t−1, and when j is the state number of the initial state of the phoneme p, the same as in equation (2).

【００３８】次に、表２に音素間の接続情報と初期状態
番号の集合Ｉ（ｐ）のうち、音素接続情報のテーブルを
示す。Next, Table 2 shows a table of phoneme connection information in the set I (p) of connection information between phonemes and initial state numbers.

【００３９】[0039]

【表２】 [Table 2]

【００４０】音素ｐの初期状態に対してはこのテーブル
が参照される。言語的な制約を考慮して特定の音素と音
素との連結に制限を設けている。たとえば日本語であれ
ば子音と子音が連結しないと考えられ、“ｍ”から
“ｈ”への経路は設けないなどと設定している。音素表
記はヘボン式ローマ字綴りに従っている。ただし、
“ｑ”は促音、“Ｎ”は撥音、“ｙ”は拗音、“−”は
無音を示す。表２では左側の音素ＨＭＭの最終状態から
右側の音素ＨＭＭの初期状態に繋がり得ることを意味し
ている。This table is referred to for the initial state of the phoneme p. Considering linguistic restrictions, restrictions are placed on the connection between specific phonemes. For example, in Japanese, it is considered that consonants are not connected to each other, so that a route from “m” to “h” is not provided. Phonetic notation follows the Hepburn Roman spelling. However,
“Q” indicates a prompting sound, “N” indicates a repellent sound, “y” indicates a relentless sound, and “−” indicates no sound. In Table 2, it means that the final state of the phoneme HMM on the left can be connected to the initial state of the phoneme HMM on the right.

【００４１】後向き演算部７では、音素をアークとする
言語モデル１０の木構造辞書を参照する。この木構造辞
書は認識語彙リストから予め作成されている。パーザ５
の前に単語ごとの音声区間の切出し処理を行うが、語頭
・語尾の判定誤りがしばしば起こる。そこで、音声区間
と判定された区間に対し、その語頭の前および語尾の後
に、ある程度のマージン、すなわち周囲環境音区間を設
定することが多い。このマージンに対応するために各認
識語彙の前後にはＨＭＭの無音モデルが付加されてい
る。無音モデルとは音声が入っていない周囲雑音を対象
に学習したモデルのことであって、波形が常に０の真の
無音を指しているのではない。The backward operation unit 7 refers to the tree structure dictionary of the language model 10 using phonemes as arcs. This tree structure dictionary is created in advance from the recognized vocabulary list. Parser 5
Is performed before the word, the speech section of each word is extracted. Therefore, for a section determined to be a voice section, a certain margin, that is, a surrounding environment sound section is often set before the beginning of the word and after the end of the word. To accommodate this margin, a silence model of HMM is added before and after each recognized vocabulary. The silence model is a model that has been trained on ambient noise that contains no voice, and does not always indicate true silence with a waveform of zero.

【００４２】図３は木構造辞書の一部分の例を示す図で
あり、図４は比較のために音素環境を考慮しない場合の
木構造辞書を示す図である。図３および図４において、
数字は辞書のノード番号を示し、アルファベットはアー
クの音素を示す。前述の表２に例示したような音素間の
接続情報制約下で音素が展開されており、前後の音素環
境が考慮されている。そのため、枝分かれが多くなって
いる。ノード３２８から４本のアークが伸びているのに
対し、図４に示すように音素環境を考慮しなければノー
ド７６のように２本のアークとなる。演算の方向が時間
とは逆向きのため、語尾から語頭へと枝が伸びている。
後向き演算の参照を高速にするため、各アークには予め
ＨＭｎｅｔの状態番号が割付けられている。FIG. 3 is a diagram showing an example of a part of a tree structure dictionary, and FIG. 4 is a diagram showing a tree structure dictionary when a phoneme environment is not considered for comparison. 3 and 4,
The numbers indicate the node numbers of the dictionary, and the alphabets indicate the phonemes of the arc. The phonemes are developed under the connection information restriction between phonemes as exemplified in Table 2 described above, and the surrounding phoneme environments are considered. Therefore, branching is increasing. While four arcs extend from the node 328, as shown in FIG. 4, if the phoneme environment is not taken into account, two arcs occur as in the node 76. Since the direction of operation is opposite to time, a branch extends from the end to the beginning.
In order to make the backward operation reference faster, each arc is assigned a state number of HMNet in advance.

【００４３】図３に示した木構造辞書に沿って仮説が展
開されるが、展開する順番はＡ^*アルゴリズムを用いて
ｂｅｓｔ−ｆｉｒｓｔに決定される。すなわち、以下の
第（５）式の評価値ｆ^* _p（ｔ）が最も高い部分仮説の
ノードを展開して先に進む。ここで、ｐはその仮説の先
端アークの音素を表わし、ｔは後述する方法によって定
められた照合範囲Ｒ（ｐ，ｔ₀）中のフレーム番号を表
わす。The hypotheses are developed along the tree structure dictionary shown in FIG. 3, and the development order is determined as best-first using the A ^* algorithm. That is, the node of the partial hypothesis having the highest evaluation value f ^* _p (t) in the following equation (5) is expanded and the process proceeds. Here, p represents a phoneme of the tip arc of the hypothesis, and t represents a frame number in a collation range R (p, t ₀ ) determined by a method described later.

【００４４】評価値ｆ^*（ｔ）は前向き演算結果のスコ
アｈ^*（ｔ）と、第（６）式に示すようにｔ∈Ｒ（ｐ，
ｔ₀）について音素単位で実行した後向きビタービサー
チの演算結果のスコアｇ_p（ｔ）の和で表わされる。こ
の和は木構造辞書に沿っているため、ｈ^*（ｔ）および
ｇ_p（ｔ）に付与されている前後音素環境情報が反映さ
れており、音素環境が考慮されていることになる。処理
の簡素化のため、接続ポイントはｆ^*（ｔ）の最大値を
与えるｔ′１点に限定する。つまり接続ポイントが異な
るだけで音素列は同じ仮説を１つで代表させることで仮
説の和を削減する。この仮説に接続するアークを次に展
開するときのビタービサーチの開始点はこのｔ′となり
これをｔ₀とおく。The evaluation value f ^* (t) is the forward operation result score h ^* (t), the as shown in equation (6) t∈R (p,
t ₀ ) is represented by the sum of the scores g _p (t) of the calculation results of the backward Viterbi search executed for each phoneme. Since this sum is in accordance with the tree structure dictionary, the before and after phoneme environment information given to h ^* (t) and g _p (t) is reflected, and the phoneme environment is considered. For simplicity of processing, the connection point is limited to t'1 which gives the maximum value of f ^* (t). In other words, the sum of hypotheses is reduced by representing the same hypothesis by one phoneme sequence only at the different connection points. Viterbi search of the starting point is put with this t ₀ makes this t 'when you then expand the arc to connect to this hypothesis.

【００４５】[0045]

【数４】 (Equation 4)

【００４６】ここで、ｉは音素ｐの初期状態の状態番号
であり、ｇ_p（ｔ）≡ｇ_ip（ｔ）と定義している。Ｃ^*
_p（ｉ）はＣ_p（ｊ）の状態接続の方向を逆にした集合
である。後述のｂｅｓｔ−ｆｉｒｓｔ仮説成長アルゴリ
ズムで説明しているように、仮説には単一仮説とグルー
プ仮説とがある。単一仮説の場合、ｐは先端アークの唯
一の音素である。グループ仮説の場合は、ｐはそのグル
ープ中の最高のスコアを与える選択アークの音素を示
す。ｔ₀はその仮説の前回の接続ポイントｔ′を示す。Here, i is the state number of the initial state of the phoneme p, and is defined as g _p (t) ≡g _ip (t). C ^*
_p (i) is a set in which the direction of the state connection of C _p (j) is reversed. As described in the best-first hypothesis growing algorithm described later, there are a single hypothesis and a group hypothesis. For the single hypothesis, p is the only phoneme of the tip arc. For the group hypothesis, p indicates the phoneme of the selected arc that gives the highest score in the group. t ₀ indicates the previous connection point t ′ of the hypothesis.

【００４７】後向き演算における各音素ｐごとのビター
ビサーチの照合範囲Ｒ（ｐ，ｔ₀）は、通常のその仮説
の開始点ｔ₀（＝前回の接続ポイント）から終点、すな
わち入力音声の先頭ｔ＝１までである。したがって、語
尾付近においては照合範囲が非常に広くなり、計算量の
増加を招く。音素にはその音素固有の継続時間があり、
一般に母音や撥音，促音は長く、破裂子音は短い傾向が
ある。そこで、音素単位にラベル付けされた音声データ
を用いて、各音素ごとに平均継続時間長μ_pと分散σ²
_pを求めておく。音声データが多量にある場合は正確を
期すため、各三組音素ごとに平均継続時間長と分散を求
めてもよい。使い方としては前回の接続ポイントｔ₀か
ら“平均値μ_p±α×標準偏差σ_p”の区間を対象とす
る。たとえば、この実施形態では、次の第（７）式に示
すように、ｔ₀から“平均値μ_p＋３×標準偏差σ_p”
だけ遡った区間が照合範囲Ｒ（ｐ，ｔ₀）とされるThe matching range R (p, t ₀ ) of the Viterbi search for each phoneme p in the backward calculation is the end point from the normal start point t ₀ (= the previous connection point) of the hypothesis, that is, the beginning of the input voice. Up to t = 1. Therefore, the matching range becomes very wide near the end of the word, which leads to an increase in the amount of calculation. Phonemes have their own duration.
In general, vowels, sound repellents, and consonants tend to be long, and consonants tend to be short. Then, using the speech data labeled in phoneme units, the average duration length μ _p and the variance σ ²
_{Find p} . If there is a large amount of voice data, the average duration and variance may be calculated for each triad of phonemes for accuracy. As a usage, a section of “average value μ _p ± α × standard deviation σ _p ” from the previous connection point t ₀ is targeted. For example, in this embodiment, as shown in the following equation (7), “average value μ _p + 3 × standard deviation σ _p ” from t _0.
The section that has been traced back is set as the collation range R (p, t ₀ )

【００４８】[0048]

【数５】 (Equation 5)

【００４９】この照合範囲制限は計算量の削減だけでな
く、ビタービサーチでしばしば生じる不必要な時間軸整
合を未然に防ぐこともできるため、認識率の向上にも貢
献することが実験により確認されている。It has been confirmed by experiments that this collation range restriction not only reduces the amount of calculation but also prevents unnecessary time-axis matching often occurring in Viterbi search, thereby contributing to an improvement in the recognition rate. Have been.

【００５０】図５はｂｅｓｔ−ｆｉｒｓｔに仮説を成長
させるアルゴリズムを示すフローチャートである。この
フローチャートは前述のSoong に準拠している。図６に
おける全スコアとは第（５）式の評価値ｆ^* _p（ｔ）の
ことであり、ルートノードとはまだ全く展開をしていな
い仮説のことを意味し、単一パスとは仮説の先端アーク
が１個，グループパスとは仮説の先端アークが複数個で
ある仮説のことを示し、ＮはＮ−ｂｅｓｔの候補数Ｎを
示す。FIG. 5 is a flowchart showing an algorithm for growing a hypothesis at best-first. This flowchart conforms to Soong described above. In FIG. 6, the total score is the evaluation value f ^* _p (t) of the equation (5), the root node means a hypothesis that has not yet been developed at all, and the single pass is a hypothesis. Indicates one hypothesis and the group path indicates a hypothesis having a plurality of hypotheses, and N indicates the number N of N-best candidates.

【００５１】仮説の展開はスタックのトップエントリ
（＝最良部分仮説）を１アーク展開し、２つの仮説（す
なわち最良単一パスと残りのグループ）に分割すること
によって進められる。展開対象は常に最良部分仮説とし
ているので展開する順番はｂｅｓｔ−ｆｉｒｓｔとな
る。The development of the hypothesis proceeds by expanding the top entry of the stack (= the best partial hypothesis) by one arc and dividing it into two hypotheses (the best single path and the remaining group). Since the development target is always the best partial hypothesis, the development order is best-first.

【００５２】図５を参照してより具体的に説明すると、
スタックにルートノードをおき、初期化が行なわれる。
次いで、スタックのトップエントリを取出し、最良部分
仮説が単一パスであり、グループパスでないか否かが判
別される。単一パスでなければ、最良部分仮説を２つの
仮説（最良単一パスと残りのグループ）に分割され、こ
れら２つの仮説について全スコアが計算され、これら２
つの仮説がスタックに戻されて全スコアに基づいてソー
トされる。単一パスであれば最良部分仮説が終端ノード
まで到達しているか否かが判別され、終端ノードまで到
達していなければ、グループ仮説を２つの仮説に分割
し、これら２つの仮説について全スコアを計算し、これ
ら２つの仮説をスタックに戻し、全スコアに基づいてソ
ートされる。最良部分仮説が終端ノードまで到達すれ
ば、その仮説を出力し、受理数カウンタをインクリメン
トする。受理数カウンタがＮに等しくなければ、再びス
タックのトップエントリを取り、受理カウンタがＮに等
しければ終了する。More specifically, with reference to FIG.
A root node is placed on the stack, and initialization is performed.
Next, the top entry of the stack is taken out, and it is determined whether or not the best partial hypothesis is a single pass and not a group pass. If it is not a single pass, the best partial hypothesis is split into two hypotheses (the best single pass and the remaining groups), and a total score is calculated for these two hypotheses,
One hypothesis is returned to the stack and sorted based on the total score. In the case of a single pass, it is determined whether or not the best partial hypothesis has reached the terminal node. If the best partial hypothesis has not reached the terminal node, the group hypothesis is divided into two hypotheses, and the total score of these two hypotheses is calculated. Calculate and put these two hypotheses back on the stack and sort based on the total score. When the best partial hypothesis reaches the terminal node, the hypothesis is output and the accepted number counter is incremented. If the received number counter is not equal to N, the top entry of the stack is taken again. If the received counter is equal to N, the process is terminated.

【００５３】このように、ｂｅｓｔ−ｆｉｒｓｔに順次
仮説を展開していくと、受理された仮説の順に認識結果
のＮ−ｂｅｓｔ単語候補が求まる。つまり、スコアの高
い候補から順に受理されるので、第１位、第２位、第３
位、…の順に単語候補が出力される。所定の個数（たと
えば１０個なら第１０位まで）の単語候補が求められれ
ば後向き演算が終了され、木構造辞書のうち、Ｎ−ｂｅ
ｓｔ単語候補にかかわるアークのみ後向き演算でビター
ビサーチを実行していることとなる。As described above, by sequentially developing hypotheses in the best-first manner, N-best word candidates as recognition results are obtained in the order of the received hypotheses. In other words, since the candidates with the highest scores are accepted in order, the first, second, third
Word candidates are output in the order of order,. When a predetermined number of word candidates (for example, up to the tenth place for ten words) are obtained, the backward calculation is terminated, and the N-be
Only the arc related to the st word candidate is executing the Viterbi search by backward calculation.

【００５４】ヒューリスティック関数ｈ^*（ｔ）の精度
は後向きサーチの探索効率（＝仮説展開回数）に大きく
影響する。もし、ｈ^*（ｔ）が真の値ｈ（ｔ）に等しい
ならば理想的に展開が進み、無駄な仮説の展開を全くし
なくて済む。このとき処理量は認識語彙数、すなわち木
構造辞書のサイズには依存しないこととなる。ｈ
^*（ｔ）はＡ^*アルゴリズムの許容可能性：ｈ^*（ｔ）
≧ｈ（ｔ）の関係が成立しているが、弱い文法を使って
いるためｈ^*（ｔ）＝ｈ（ｔ）にはならない。The accuracy of the heuristic function h ^* (t) greatly affects the search efficiency of the backward search (= the number of hypothesis developments). If h ^* (t) is equal to the true value h (t), the development proceeds ideally, and there is no need to develop useless hypotheses. At this time, the processing amount does not depend on the number of recognized vocabularies, that is, the size of the tree structure dictionary. h
^* (T) is the acceptability of the A ^* algorithm: h ^* (t)
Although the relationship of ≧ h (t) holds, h ^* (t) = h (t) does not hold because a weak grammar is used.

【００５５】したがって、実際には無駄な仮説の展開が
多少存在し、正解に近い仮説の周辺アークもサーチする
ので処理量は認識語彙数に少しは依存する。結果として
処理量が語彙数にはほとんど影響されないという特徴を
持つ。従来の１単語ずつビタービサーチを実行する方式
に比べれば、語彙が増えれば増えるほど探索空間が劇的
に削減できる。したがって、この実施形態は大語彙に適
した認識方式といえる。２０，３０００単語を認識語彙
とした場合、パーザ５の処理量が１／４０に削減できる
ことが実験で確認されている。Therefore, in practice, there is some useless hypothesis development, and a search is also made for a peripheral arc of a hypothesis that is close to the correct answer, so that the processing amount slightly depends on the number of recognized words. As a result, the processing amount is hardly influenced by the number of words. Compared with the conventional method of executing the Viterbi search one word at a time, the search space can be dramatically reduced as the vocabulary increases. Therefore, this embodiment can be said to be a recognition method suitable for a large vocabulary. It has been confirmed by experiments that the processing amount of the parser 5 can be reduced to 1/40 when 20,3000 words are used as the recognition vocabulary.

【００５６】なお、展開途中の仮説はスタックに積んで
おく。１回の展開操作ごとにスタックの並び換え（ソー
ト）が必要となる。スタックのサイズは理想的な環境下
では理論的にはＮ−ｂｅｓｔの候補数Ｎと同じでよい。
しかし、現実には認識語彙数や音響モデルの性能に影響
されるため、余裕を持たせた値に設定する必要がある。
実環境実験では数百程度のサイズが望まれる。たとえ
ば、この実施形態では、語彙が２０，０００語のときは
１，０００、５，０００語のときは５００とする。した
がって、スタックのソートは処理量を増大させる要因と
なる。処理の高速化のため、仮説をスタックへ戻す際に
は二分木探索処理量を増大させる要因となる。処理の高
速化のため、仮説をスタックへ戻す際には二分木探索を
用いて挿入する場所が決定される。これによりスタック
の全データのソートはしなくて済む。スタックの入換え
はポインタ操作で行ない、実際のスタック上のデータは
移動させないようにする。ただし、処理系によってはポ
インタ操作よりもスタックをリスト構造にした方が効果
的となることもある。The hypotheses that are being developed are stored on the stack. The stack must be rearranged (sorted) for each expansion operation. The stack size may theoretically be the same as the number N of N-best candidates under an ideal environment.
However, since it is actually affected by the number of recognized vocabulary words and the performance of the acoustic model, it is necessary to set a value with a margin.
In an actual environment experiment, a size of several hundreds is desired. For example, in this embodiment, the vocabulary is 1,000 when the vocabulary is 20,000 words, and 500 when the vocabulary is 5,000 words. Therefore, stack sorting causes an increase in the processing amount. When the hypothesis is returned to the stack in order to speed up the processing, this becomes a factor for increasing the binary tree search processing amount. In order to speed up the processing, when returning a hypothesis to the stack, a place to insert is determined using a binary tree search. This eliminates the need to sort all data in the stack. Stack replacement is performed by pointer operation, and actual data on the stack is not moved. However, depending on the processing system, it may be more effective to make the stack a list structure than to operate the pointer.

【００５７】なお、後向き演算における各音素ごとのビ
タービサーチの代わりにＡ^*アルゴリズムを用いて算出
することも可能である。この場合ヒューリスティック関
数は前向き演算で既に求まっているものを流用できる。
ただし、Ａ^*アルゴリズムを起動する回数が多く、スタ
ック操作などのオーバヘッドがあるため、処理速度が向
上するかどうかは、メモリのアクセススピードなど実装
する処理系の条件に依存する。Note that it is also possible to use the A ^* algorithm instead of the Viterbi search for each phoneme in the backward calculation. In this case, the heuristic function that has already been obtained by the forward operation can be used.
However, since the A ^* algorithm is activated many times and has an overhead such as a stack operation, whether or not the processing speed is improved depends on conditions of a processing system to be mounted such as a memory access speed.

【００５８】この実施形態では、サブワード単位とし
て、音素を採用したが、音節でも実現可能である。日本
語の音節は約１１０種類あり、音素環境を考慮すると異
なり音節数は１０，０００以上になるため、前向き演算
の処理量が大きくなる反面、前向きサーチの精度が向上
するため、後向きサーチの探索がより効率よく行なわれ
る。In this embodiment, phonemes are used as subword units, but syllables can be used. There are about 110 types of Japanese syllables, and the number of syllables is more than 10,000 in consideration of the phoneme environment. Therefore, the processing amount of the forward calculation is large, but the accuracy of the forward search is improved. Is performed more efficiently.

【００５９】また、上述の実施形態では、認識対象とし
て単語を取上げたが、辞書の語彙は単語に限定されるわ
けではなく、１文節を１単語と見なして木構造辞書を作
成すれば、文節の認識も実現可能である。日本語は助詞
などの表現で語尾の表現が木構造によって共有化できる
ので、効率よくサーチすることができる。In the above embodiment, words are picked up as recognition targets. However, the vocabulary of the dictionary is not limited to words, and if a phrase is regarded as one word and a tree-structured dictionary is created, the phrase Recognition is also feasible. In Japanese, since the ending expression can be shared by a tree structure in expressions such as particles, it is possible to search efficiently.

【００６０】[0060]

【発明の効果】以上のように、この発明によれば、認識
語彙を音素環境依存型音素列で表現し、それら音素をア
ークとする木構造の辞書に変換し、前向き演算部で音素
環境を考慮した音素単位の制約条件下で駆動するビター
ビサーチを行ない、後向き演算部で音素環境を考慮した
木構造辞書を参照しながらビタービサーチを用いて仮説
を展開し、前向き演算結果のスコアと音素単位で実行し
た後向きビタービサーチの演算結果のスコアの和を利用
したＡ^*アルゴリズムを用いて展開する順番をｂｅｓｔ
−ｆｉｒｓｔに決定し、受理された仮説の順にそれを認
識結果の単語候補として出力し、所定の個数の単語候補
が求まれば後向き演算を終了して認識候補の単語を出力
するようにしたので、処理量が語彙数に比例しないとい
う特徴を活かして、超大語彙を対象としても安価なハー
ドウェア構成で実時間動作が可能な音声認識装置を実現
できる。たとえば、２０，０００単語を認識語彙とした
場合、この発明によれば、音声認識の処理量の二大要
素、ＨＭＭの尤度演算とパーザのうち、後者を約１／４
０に削減できる。As described above, according to the present invention, the recognized vocabulary is represented by a phoneme environment-dependent phoneme sequence, and these phonemes are converted into a tree-structured dictionary having arcs. Perform the Viterbi search driven under the constraint condition of the phoneme unit considered, develop the hypothesis using the Viterbi search while referring to the tree structure dictionary considering the phoneme environment in the backward operation unit, and calculate the score of the forward operation result and The best order to develop using the A ^* algorithm that uses the sum of the scores of the results of backward Viterbi search performed on a phoneme-by-phoneme basis
−first, and output them as word candidates of the recognition result in the order of the accepted hypotheses. When a predetermined number of word candidates are obtained, the backward operation is terminated and the words of the recognition candidates are output. By taking advantage of the feature that the processing amount is not proportional to the number of vocabularies, it is possible to realize a speech recognition device capable of real-time operation with an inexpensive hardware configuration even for very large vocabularies. For example, when 20,000 words are used as the recognition vocabulary, according to the present invention, of the two major elements of the processing amount of speech recognition, the likelihood calculation of the HMM and the parser, the latter is about 1/4.
It can be reduced to zero.

[Brief description of the drawings]

【図１】この発明の一実施形態を示すブロック図であ
る。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】ＨＭＭの状態間の接続の制約例を示す図であ
る。FIG. 2 is a diagram illustrating an example of a restriction on a connection between states of an HMM.

【図３】音素環境を考慮した木構造辞書の一部分の例を
示す図である。FIG. 3 is a diagram showing an example of a part of a tree-structured dictionary considering a phoneme environment.

【図４】音素環境を考慮しない木構造辞書の一部分の例
を示す図である。FIG. 4 is a diagram showing an example of a part of a tree-structured dictionary without considering a phoneme environment.

【図５】ｂｅｓｔ−ｆｉｒｓｔに仮説を成長させるアル
ゴリズムを示すフローチャートである。FIG. 5 is a flowchart illustrating an algorithm for growing a hypothesis in best-first.

【図６】従来の一般的な単語音声認識装置の構成を示す
ブロック図である。FIG. 6 is a block diagram showing a configuration of a conventional general word speech recognition device.

【図７】Ａ^*アルゴリズムを説明するための図である。FIG. 7 is a diagram for explaining an A ^* algorithm.

[Explanation of symbols]

１音響分析部５パーザ６前向き演算部７後向き演算部８音響／言語モデル９音響モデル１０言語モデル Reference Signs List 1 sound analysis unit 5 parser 6 forward operation unit 7 backward operation unit 8 sound / language model 9 sound model 10 language model

Claims

[Claims]

1. A speech recognition apparatus using a phoneme environment-dependent phoneme hidden Markov model, comprising: input means for inputting speech; analyzing speech input from the input means for each short-time frame; Based on the feature vector extracted by the feature vector extraction means, a recognition vocabulary to which a silence model is added before and after the beginning of a word is represented by a phoneme environment-dependent phoneme sequence,
Dictionary creation means for converting the phonemes into a tree-structured dictionary having arcs, and parser means including a forward operation unit and a backward operation unit, wherein the forward operation unit is provided with phoneme unit constraints in consideration of a phoneme environment. The Viterbi search driven below was performed, and the backward operation unit developed a hypothesis using Viterbi search while referring to a tree structure dictionary in consideration of the phoneme environment, and executed the forward operation result score and phoneme unit. The order of development using the A ^* algorithm that uses the sum of the scores of the operation results of backward Viterbi search is best-first.
A large vocabulary speech recognition device, characterized in that the hypotheses are output in the order of the accepted hypotheses as word candidates as recognition results, and when a predetermined number of word candidates are obtained, the backward calculation is terminated.

2. The method according to claim 1, wherein the forward calculation unit performs a Viterbi search driven under a constraint condition of a state unit of a hidden Markov model in consideration of a phoneme environment.
Large vocabulary speech recognition device.

3. The method according to claim 1, wherein the backward calculation unit limits a collation range of the Viterbi search for each phoneme in a predetermined manner based on a predetermined duration for each phoneme. Large vocabulary speech recognition device.

4. The large vocabulary speech recognition apparatus according to claim 1, wherein said backward operation unit executes a Viterbi search in developing a hypothesis in phoneme units using an A ^* algorithm.