JP3461789B2

JP3461789B2 - Speech recognition device, speech recognition method, and program recording medium

Info

Publication number: JP3461789B2
Application number: JP2000187686A
Authority: JP
Inventors: 俊夫赤羽
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2000-06-22
Filing date: 2000-06-22
Publication date: 2003-10-27
Anticipated expiration: 2020-06-22
Also published as: JP2002006883A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、隠れマルコフモ
デルを用いた音声認識装置および音声認識方法、並び
に、音声認識プログラムが記録されたプログラム記録媒
体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and a voice recognition method using a hidden Markov model, and a program recording medium in which a voice recognition program is recorded.

【０００２】[0002]

【従来の技術】音声認識手法の一つとして隠れマルコフ
モデル(以下、ＨＭＭと略称する)がある(Rabiner＆Juan
g著,古井監訳「音声認識の基礎」第６章,ＮＴＴアドバン
ストテクノロジ１９９５年：文献１)。上記ＨＭＭにお
いては、話者や発声変動等の音声が有する揺らぎを統計
的に学習することによって高い認識精度が得られるため
に、現代では音声認識方式として定着している。2. Description of the Related Art Hidden Markov models (hereinafter abbreviated as HMMs) are one of speech recognition methods (Rabiner & Juan).
g, Translated by Furui, "Basics of Speech Recognition," Chapter 6, NTT Advanced Technology 1995: Reference 1). In the HMM, high recognition accuracy can be obtained by statistically learning fluctuations of a voice such as a speaker and utterance variation, and therefore, it has been established as a voice recognition method in the present age.

【０００３】図３は、上記ＨＭＭを用いた従来の基本的
な音声認識装置の構成例である。以下、図３に従って、
従来のＨＭＭを用いた音声認識装置について説明する。
尚、入力音声は既にサンプリングおよび量子化されてい
るものとする。FIG. 3 shows an example of the configuration of a conventional basic voice recognition device using the above HMM. Below, according to FIG.
A conventional voice recognition device using an HMM will be described.
It is assumed that the input voice has already been sampled and quantized.

【０００４】音響分析部１は、音声サンプルデータを一
定の周期毎に取り込んで音響パラメータを抽出し、尤度
演算部２と音声区間検出部３とに出力する。音響モデル
記憶部４には、音素や音節等の音声の徴小単位毎に音響
パラメータの分布を統計的に学習した音響モデルが記憶
されている。尚、上記音響モデルは、大量の音声データ
から学習されているものとする。The acoustic analysis unit 1 takes in the voice sample data at regular intervals, extracts the acoustic parameters, and outputs them to the likelihood calculation unit 2 and the voice section detection unit 3. The acoustic model storage unit 4 stores an acoustic model in which the distribution of acoustic parameters is statistically learned for each small voice unit such as a phoneme or a syllable. The acoustic model has been learned from a large amount of voice data.

【０００５】上記尤度演算部２は、上記音響モデル記憶
部４に記憶された音響モデルを構成する各状態の出力確
率に基づいて、入力された各フレームの音響パラメータ
から各フレーム毎に各状態の尤度を求め、尤度記憶部５
に記憶する。音声区間検出部３は、音響分析部１による
音響分析結果から、主に短時間音声エネルギー等の一部
の音響パラメータを用いて音声区間を検出する。The likelihood calculating section 2 calculates each state for each frame from the input acoustic parameters of each frame based on the output probability of each state forming the acoustic model stored in the acoustic model storage section 4. , The likelihood storage unit 5
Remember. The voice section detection unit 3 detects a voice section from the result of the acoustic analysis by the acoustic analysis unit 1, mainly using some acoustic parameters such as short-term voice energy.

【０００６】言語辞書６には、認識対象語彙の各単語
と、この単語を音響モデルである各音素モデルの状態系
列を直列接続して表現したものとを対応付けて格納して
いる。照合部７は、言語辞書６に格納された各単語に関
して、言語辞書６に格納された状態系列と入力された全
フレームの状態系列とをビタビ法によって照合して、各
単語の尤度を算出する。その場合、入力された各フレー
ムにおける各状態の局所尤度は、尤度記憶部５に記憶さ
れた値を参照することによって得る。そして、尤度の高
い単語の順に並べ直し、上位候補を出力するのである。In the language dictionary 6, each word of the recognition target vocabulary is stored in association with each word representing a series of state sequences of each phoneme model which is an acoustic model. For each word stored in the language dictionary 6, the matching unit 7 matches the state series stored in the language dictionary 6 with the state series of all frames input by the Viterbi method to calculate the likelihood of each word. To do. In that case, the local likelihood of each state in each input frame is obtained by referring to the value stored in the likelihood storage unit 5. Then, the words with the highest likelihood are rearranged in order, and the top candidates are output.

【０００７】ところで、上記文献１における６.４.２.
２節によると、上記ＨＭＭを用いた認識の基礎となるビ
タビアルゴリズムにおいては、入力された観測系列の長
さをＴとし、単語モデルの状態数をＮとすると、次の繰
り返し計算が処理の大部分を占める。 δ_t(j)＝max[δ_t-1(i)＋ａ_ij]＋ｂ_j(o_t) …（１） Ψ_t(j)＝argmax[δ_t-1(i)＋ａ_ij] …（２）２≦ｔ≦Ｔ, １≦ｊ≦Ｎここで、ａ_ij,ｂ_j(o_t)は、夫々対数化した遷移確率と出
力確率とである。また、δは累積尤度であり、Ψはバッ
クポインタであり、ｉは(ｔ−１)の状態番号である。
尚、マッチング時のパスを知る必要がない場合は上記式
(２)は必要がない。By the way, 6.4.2.
According to Section 2, in the Viterbi algorithm that is the basis of recognition using the HMM, if the length of the input observation sequence is T and the number of states of the word model is N, the next iterative calculation is a large process. Occupy a part. δ _t (j) = max [δ _t-1 (i) + a _ij ] + b _j (o _t ) (1) Ψ _t (j) = argmax [δ _t-1 (i) + a _ij ] (2) 2 ≦ t ≦ T, 1 ≦ j ≦ N where a _ij and b _j (o _t ) are logarithmic transition probabilities and output probabilities, respectively. Further, δ is the cumulative likelihood, Ψ is the back pointer, and i is the state number of (t-1).
If you do not need to know the path at the time of matching, use the above formula
(2) is not necessary.

【０００８】上記式(１),(２)の計算量は、Ｎ²・Ｔのオ
ーダーでの加算および比較となる。但し、状態遷移を隣
接する状態間のみに制限するとオーダーは２・Ｎ・Ｔとな
る。大語彙の音声認識を行うには単語毎に学習データを
大量に収集するのは困難であるために、音素毎のモデル
(音素モデル)を予め学習しておき、これら音素モデルを
連結することによって任意の単語を生成する方法がよく
用いられる。上記音素モデルとして十分な性能を発揮す
るには、各音素毎に３つから５つ程度の状態を有するＨ
ＭＭを設定するのが一般的である。The calculation amounts of the above equations (1) and (2) are addition and comparison on the order of N ² · T. However, if the state transition is limited only between the adjacent states, the order is 2.N.T. It is difficult to collect a large amount of learning data for each word for large vocabulary speech recognition.
A method in which (phoneme model) is learned in advance and an arbitrary word is generated by connecting these phoneme models is often used. In order to exhibit sufficient performance as the above phoneme model, H having three to five states for each phoneme
It is common to set the MM.

【０００９】これらを総合すると、大語彙の単語の照合
に必要なビタビ演算の計算量は、単語数をＶ、単語辞書
の平均音素数をＰ、音素の平均状態数をＳ、入力音声の
長さをＴとすると、２・Ｖ・Ｐ・Ｓ・Ｔのオーダーとなる。
例として、Ｖ＝１,０００単語、Ｐ＝１０音素、Ｓ＝４
状態、Ｔ＝１００フレームとすると、２＊１０００＊１
０＊４＊１００＝８,０００,０００オーダーの膨大な加
算および比較が必要になるという問題がある。When these are combined, the calculation amount of the Viterbi operation required for matching a word in a large vocabulary is V for the number of words, P for the average number of phonemes in the word dictionary, S for the average number of states of the phonemes, and the length of the input speech. Letting T be T, the order is 2 · V · P · S · T.
As an example, V = 1,000 words, P = 10 phonemes, S = 4
State, T = 100 frames, 2 * 1000 * 1
There is a problem that a huge addition and comparison of 0 * 4 * 100 = 8,000,000 order is required.

【００１０】上述のごとき膨大な演算に対処するため、
特開平６‐２６６３９３号公報(文献２)に開示された音
声認識装置においては、標準パターンを用いた音声認識
の際におけるマッチングを高速化するための方法とし
て、入力系列と標準パターンとを共に分周器によって一
定間隔で間引いて高速な予備選択を行う方法およびワー
ドスポッティングの方法を用いている。In order to deal with a huge amount of calculation as described above,
In the voice recognition device disclosed in Japanese Unexamined Patent Publication No. 6-266393 (reference 2), as a method for speeding up matching in voice recognition using a standard pattern, both the input sequence and the standard pattern are separated. A method of performing high-speed preselection by thinning out at regular intervals by a frequency divider and a method of word spotting are used.

【００１１】また、他の文献(文献３)“「A Fast Appro
ximate Acoustic Match for LargeVocabulary Speech R
ecognition」IEEE Trans. on Speech and Audio Process
ingVol.1，No.1，January 1993”には、ＨＭＭを用いた
音声認識において、詳細な照合を行う前に候補数を絞る
ために行う高速な照合を実現する方法が開示されてい
る。文献３に記載の音声認識装置では、詳細照合用の音
素モデルとしては前後の音素環境を考慮した環境依存型
のＨＭＭを用いるが、高速マッチングの際には環境を考
慮しない環境独立型の音素モデルを用いる。すなわち、
音素ｕに属する環境依存型ＨＭＭ内の状態の集合をＡu
とし、状態ａ∈Ａuからラベルfiを出力する出力確率をp
r(fi?ａ)とすると、音素ｕの出力確率を次式で定義す
る。また、音素ｕに属する長さｎの状態系列から脱出する脱
出確率をｑu(ｎ)とすると、音素ｕの状態から脱出する
遷移確率を次式で定義する。一方、状態ｕ内に留まる確率は１としている。Further, another document (Reference 3) “A Fast Appro
ximate Acoustic Match for LargeVocabulary Speech R
ecognition ”IEEE Trans. on Speech and Audio Process
ingVol.1, No.1, January 1993 ”discloses a method for realizing high-speed matching for narrowing down the number of candidates before detailed matching in speech recognition using HMM. In the speech recognition device described in 3, an environment-dependent HMM that considers the preceding and following phoneme environments is used as a phoneme model for detailed matching, but an environment-independent phoneme model that does not consider the environment is used during high-speed matching. Use, that is,
Let Au be the set of states in the environment-dependent HMM belonging to phoneme u.
And the output probability of outputting the label fi from the state a ∈ Au is p
When r (fi? a), the output probability of the phoneme u is defined by the following equation. If the escape probability of escape from the state sequence of length n belonging to the phoneme u is qu (n), the transition probability of escape from the state of the phoneme u is defined by the following equation. On the other hand, the probability of staying in the state u is 1.

【００１２】このようにして定義した環境独立型の音素
ＨＭＭを用いることと、単語辞書を音素の木構造で表現
することとによって、入力系列と照合すべき辞書を縮小
し、大語彙辞書との高速なマッチングとを可能にしてい
る。By using the environment-independent phoneme HMM defined in this way and by expressing the word dictionary in a phoneme tree structure, the dictionary to be matched with the input sequence is reduced to obtain a large vocabulary dictionary. It enables high-speed matching.

【００１３】[0013]

【発明が解決しようとする課題】しかしながら、上記従
来の高速なマッチングを実現する音声認識装置において
は、以下のような問題がある。一般に、音響モデルとし
てのＨＭＭが精密になればなる程、モデルを構成する音
素数や状態数は多くなり、照合に必要な計算量が増大す
る。そこで、上記文献２や文献３に開示されているよう
な高速な照合によって粗く候補を選択して、後に詳細に
照合する方法が、計算量の増大に対するよい解決手段と
なるのである。但し、上記文献２のように、標準パター
ンを時間方向に一定間隔で間引く方法はＨＭＭの状態列
に対しては適用できないため、ＨＭＭを用いた音声認識
装置には上記文献３のような少ない状態数のモデルを用
いる方法が適していると言える。その理由は、上記文献
２の方法のように入力音声を一定間隔で間引くと、早口
で発声した音声の場合に破裂音等の瞬間的な音素の特徴
を見落としてしまう場合がある。そこで、破裂音等の瞬
間的な音素の特徴を見落とさないように間引き率を設定
すると、十分な高速化が行えないという別の問題が発生
するためである。However, the above-described conventional voice recognition device for realizing high-speed matching has the following problems. In general, the more accurate the HMM as an acoustic model, the greater the number of phonemes and the number of states that form the model, and the greater the amount of calculation required for matching. Therefore, a method of roughly selecting a candidate by high-speed matching as disclosed in Documents 2 and 3 and verifying it in detail later is a good solution to the increase in calculation amount. However, since the method of thinning out the standard pattern at regular intervals in the time direction as in the above-mentioned Document 2 cannot be applied to the state sequence of the HMM, the speech recognition apparatus using the HMM has a small number of states as in the above-mentioned Document 3. It can be said that the method using a number model is suitable. The reason is that when the input speech is thinned out at regular intervals as in the method of the above-mentioned document 2, there is a case where the instantaneous phoneme feature such as a plosive sound is overlooked in the case of a speech uttered quickly. Therefore, if the thinning-out rate is set so as not to overlook instantaneous phoneme features such as plosive sounds, another problem occurs in that the speed cannot be sufficiently increased.

【００１４】また、上記文献３では、複数状態から成る
環境依存型音素モデルを１状態の環境独立型音素モデル
に変換する操作において、音素間でパラメータ空間を占
める範囲が重複していることから音素の尤度間に格差が
生じ、一定の音素誤りが多数起こる場合がある。その場
合には、高速照合結果に誤りが多く含まれることにな
り、候補を少ない数に制限することができないため高速
化が十分できないことになる。これらの問題を解決する
方法に付いては、文献３には何ら記載されてはいない。Further, in the above-mentioned reference 3, in an operation of converting an environment-dependent phoneme model consisting of a plurality of states into a one-state environment-independent phoneme model, phonemes occupy a parameter space, so that phonemes are overlapped. There is a case that a difference occurs between the likelihoods of, and a large number of constant phoneme errors occur. In this case, the high-speed matching result contains many errors, and the number of candidates cannot be limited to a small number, so that the speedup cannot be sufficiently performed. Reference 3 does not describe any method for solving these problems.

【００１５】そこで、この発明の目的は、破裂音等の瞬
間的な音素の欠落や誤りの少ない高速照合を可能にする
ＨＭＭを用いた音声認識装置および音声認識方法、並び
に、音声認識プログラムを記録したプログラム記録媒体
を提供することにある。Therefore, an object of the present invention is to record a voice recognition device and a voice recognition method using an HMM, and a voice recognition program, which enables high-speed matching with few missing or erroneous phonemes such as plosive sounds. To provide the program recording medium.

【００１６】[0016]

【課題を解決するための手段】上記目的を達成するた
め、第１の発明の音声認識装置は、入力音声を音響分析
する音響分析手段と、上記音響分析結果に基づいて,音
響モデル記憶手段に記憶された音響モデルを参照してフ
レーム毎に各状態の尤度を演算し,演算結果を詳細照合
用尤度として詳細照合用尤度記憶手段に記憶する尤度演
算手段と、上記詳細照合用尤度に基づいて,高速照合用
尤度を求める高速照合用尤度演算手段と、上記高速照合
用尤度の誤った側への偏りを修正し,高速照合用尤度記
憶手段に記憶する高速照合用尤度修正手段と、上記修正
後の高速照合用尤度と高速照合用言語辞書に登録された
全単語との照合を行って上記各単語の尤度を算出する高
速照合手段と、上記高速照合手段による照合結果に基づ
いて候補単語の予備選択を行う候補予備選択手段と、上
記予備選択された候補単語に関して,上記詳細照合用尤
度と詳細照合用言語辞書に登録された単語との詳細照合
を行って,上記各候補単語の尤度を算出する詳細照合手
段を備えたことを特徴としている。In order to achieve the above object, a speech recognition apparatus according to the first aspect of the present invention includes an acoustic analysis means for acoustically analyzing an input speech, and an acoustic model storage means based on the acoustic analysis result. Likelihood calculation means for calculating the likelihood of each state for each frame with reference to the stored acoustic model and storing the calculation result as the detailed matching likelihood in the detailed matching likelihood storage means, and the above detailed matching A high-speed matching likelihood calculating means for obtaining a high-speed matching likelihood based on the likelihood, and a high-speed high-speed matching likelihood storing means for correcting the bias of the high-speed matching likelihood to the wrong side. Matching likelihood correction means, high-speed matching means for matching the corrected high-speed matching likelihood with all the words registered in the high-speed matching language dictionary to calculate the likelihood of each word, and Preliminary selection of candidate words based on the matching result by the high-speed matching means With respect to the candidate pre-selection means to be performed, the detailed matching likelihood and the word registered in the detailed matching language dictionary are subjected to detailed matching with respect to the pre-selected candidate word, and the likelihood of each candidate word is calculated. It is characterized in that it is provided with a detailed collating means.

【００１７】上記構成によれば、尤度演算手段によって
フレーム毎に各状態の尤度が演算され、上記詳細照合用
尤度に基づいて、高速照合用尤度演算手段によって高速
照合用尤度が求められる。そして、高速照合用尤度修正
手段によって、上記高速照合用尤度の誤った側への偏り
が修正される。According to the above construction, the likelihood calculating means calculates the likelihood of each state for each frame, and the high-speed matching likelihood calculating means calculates the high-speed matching likelihood based on the detailed matching likelihood. Desired. Then, the high-speed matching likelihood correction means corrects the bias of the high-speed matching likelihood to the wrong side.

【００１８】こうして、上記高速照合用尤度を少ない状
態で表現した際に生ずる尤度の誤った音声単位側への偏
りが、上記高速照合用尤度修正手段によって修正され
る。したがって、上記修正後の高速照合用尤度を用いて
高速照合を行って候補単語の予備選択を行う際に、照合
誤りが少なくなる。その結果、候補単語が少ない数に的
確に絞り込まれ、以後に詳細照合手段によって行われる
詳細照合の高速化が効率的に行われるのである。In this way, the bias toward the wrong voice unit side of the likelihood, which occurs when the likelihood for high-speed matching is expressed in a small state, is corrected by the likelihood correcting means for high-speed matching. Therefore, when performing the high-speed matching using the corrected high-speed matching likelihood to perform the preliminary selection of the candidate words, the matching error is reduced. As a result, the number of candidate words is accurately narrowed down to a small number, and the detailed matching performed by the detailed matching means thereafter can be speeded up efficiently.

【００１９】また、上記第１の発明の音声認識装置は、
上記音響分析結果に基づいて間引きパラメータを演算す
る間引きパラメータ演算手段を備えると共に、上記高速
照合用尤度演算手段を,上記詳細照合用尤度に対して上
記間引きパラメータに基づく時間方向への間引き処理を
行った後に,残った上記詳細照合用尤度に基づいて,高速
照合用尤度を求めるように成すことが望ましい。The voice recognition device of the first invention is
A thinning-out parameter calculating means for calculating a thinning-out parameter based on the acoustic analysis result is provided, and the high-speed matching likelihood calculating means is configured to perform thinning-out processing in the time direction based on the thinning-out parameter for the detailed matching likelihood. After performing, it is desirable to calculate the likelihood for high-speed matching based on the remaining likelihood for detailed matching.

【００２０】上記構成によれば、上記高速照合用尤度演
算手段による上記詳細照合用尤度に対する時間方向への
間引き処理は、間引きパラメータ演算手段によって演算
された間引きパラメータに基づいて行われる。したがっ
て、上記間引きパラメータを適切に算出することによっ
て、上記文献２のごとく時間方向に一定間隔で間引く場
合のように瞬間的な特徴が欠落することがなく、且つ、
十分に高速化を行うことが可能になる。According to the above configuration, the thinning-out process in the time direction for the detailed matching likelihood by the high-speed matching likelihood calculating means is performed based on the thinning-out parameter calculated by the thinning-out parameter calculating means. Therefore, by appropriately calculating the thinning-out parameter, there is no omission of an instantaneous feature as in the case of thinning out at regular intervals in the time direction as in the above-mentioned Document 2, and
It becomes possible to sufficiently speed up.

【００２１】また、上記第１の発明の音声認識装置は、
上記間引きパラメータ演算手段を,上記音響分析結果と
しての音響パラメータの変化量に基づいて上記間引きパ
ラメータを演算するように成し、上記高速照合用尤度演
算手段を,上記間引きパラメータに基づいて,上記音響パ
ラメータの変化量が略一定になるように間引き処理を行
うように成すことが望ましい。The speech recognition apparatus of the first invention is
The thinning-out parameter calculating means is configured to calculate the thinning-out parameter based on the change amount of the acoustic parameter as the acoustic analysis result, and the high-speed matching likelihood calculating means is based on the thinning-out parameter, and It is desirable to perform the thinning-out process so that the amount of change in the acoustic parameter becomes substantially constant.

【００２２】上記構成によれば、上記高速照合用尤度演
算手段による間引き処理は、音響パラメータの変化量が
略一定になるように行われる。したがって、間引き処理
後の上記詳細照合用尤度数は音響パラメータの変化が激
しい領域ほど多く、瞬間的な特徴が欠落してしまうこと
が防止される。According to the above configuration, the thinning-out processing by the high-speed matching likelihood calculating means is performed so that the variation amount of the acoustic parameter becomes substantially constant. Therefore, the likelihood for detailed matching after the thinning-out process is larger in the region where the acoustic parameter changes drastically, and it is possible to prevent the instantaneous feature from being lost.

【００２３】また、上記第１の発明の音声認識装置は、
上記高速照合用尤度演算手段を、上記音響モデルの構成
単位である音声単位を一つの代表尤度で表わすことによ
って上記高速照合用尤度の演算を行うように成すことが
望ましい。The voice recognition device of the first invention is
It is desirable that the high-speed matching likelihood calculating means calculates the high-speed matching likelihood by expressing a voice unit, which is a constituent unit of the acoustic model, as one representative likelihood.

【００２４】上記構成によれば、高速照合用の尤度が最
小の状態数で表現されている。したがって、上記高速照
合用の尤度を用いた高速照合が高速に行われる。According to the above configuration, the likelihood for high-speed matching is represented by the minimum number of states. Therefore, high-speed matching using the above-mentioned likelihood for high-speed matching is performed at high speed.

【００２５】また、上記第１の発明の音声認識装置は、
上記高速照合用尤度演算手段を、上記音響モデルの構成
単位である音声単位を誤り易い音声単位でグループ化
し、一つのグループを一つの代表尤度で表わすことによ
って上記高速照合用尤度の演算を行うように成すことが
望ましい。Further, the speech recognition apparatus of the first invention described above,
The high-speed matching likelihood calculating means calculates the high-speed matching likelihood by grouping voice units, which are constituent units of the acoustic model, into voice units that are prone to error and expressing one group by one representative likelihood. It is desirable to do so.

【００２６】上記構成によれば、高速照合用の尤度が、
誤り易い音声単位でグループ化された一つのグループで
表現されている。したがって、誤った音声単位の尤度が
正しい音声単位の尤度よりも高くなることがなく、高速
照合時における照合誤りが少なくなる。さらに、上記グ
ループ化によって、高速照合時における照合の対象が減
少し、上記高速照合が非常に高速に行われる。According to the above configuration, the likelihood for high-speed matching is
It is expressed by a single group that is grouped into audio units that are prone to error. Therefore, the likelihood of an erroneous voice unit does not become higher than the likelihood of a correct voice unit, and matching errors during high-speed matching are reduced. Further, the grouping reduces the number of objects to be collated at the time of high-speed collation, and the high-speed collation is performed very quickly.

【００２７】尚、この場合には、上記高速照合用尤度修
正手段による修正処理を省略することが可能になる。In this case, it is possible to omit the correction processing by the high-speed matching likelihood correction means.

【００２８】また、上記第１の発明の音声認識装置は、
高速照合用尤度修正手段を、上記音声単位間あるいは上
記グループ間の誤りパターンを考慮して上記音声単位あ
るいは上記グループの代表尤度を修正することによっ
て、上記高速照合用尤度の修正を行うように成すことが
望ましい。The voice recognition device according to the first aspect of the invention is
The high-speed matching likelihood correction means corrects the high-speed matching likelihood by correcting the representative likelihood of the voice unit or the group in consideration of the error pattern between the voice units or the groups. It is desirable to do so.

【００２９】上記構成によれば、予め分っている上記音
声単位間あるいはグループ間の誤りパターンを考慮して
上記音声単位あるいはグループの代表尤度を修正するの
で、迅速に且つ的確に修正処理が行われる。According to the above configuration, the representative likelihood of the voice unit or the group is corrected in consideration of the previously known error pattern between the voice units or between the groups, so that the correction process can be performed quickly and accurately. Done.

【００３０】また、上記第１の発明の音声認識装置は、
上記高速照合手段を,内部メモリを有するように成し、
上記高速照合用言語辞書は高速照合用言語辞書記憶手段
に記憶されおり、上記高速照合手段を,上記高速照合を
実行する際には,上記高速照合用尤度記憶手段に記憶さ
れた高速照合用尤度と上記高速照合用言語辞書記憶手段
に記憶された高速照合用言語辞書とを,上記内部メモリ
にロードするように成すことが望ましい。The voice recognition device of the first invention is
The high-speed collating means is configured to have an internal memory,
The high-speed matching language dictionary is stored in the high-speed matching language dictionary storage means, and when performing the high-speed matching, the high-speed matching language dictionary is stored in the high-speed matching likelihood storage means. It is desirable that the likelihood and the high-speed matching language dictionary stored in the high-speed matching language dictionary storage means be loaded into the internal memory.

【００３１】上記構成によれば、上記高速照合手段は、
高速照合を実行するに際して、上記高速照合用尤度と高
速照合用言語辞書とを上記内部メモリにロードするの
で、上記高速照合処理が効率よく行われる。According to the above configuration, the high speed collating means is
When performing high-speed matching, the likelihood for high-speed matching and the language dictionary for high-speed matching are loaded into the internal memory, so that the high-speed matching process is efficiently performed.

【００３２】また、上記第１の発明の音声認識装置は、
単語が入力されて、この入力単語に関する高速照合用の
状態系列と詳細照合用の状態系列とを生成し、上記高速
照合用の状態系列を上記高速照合用言語辞書に追加登録
する一方、上記詳細照合用の状態系列を上記詳細照合用
言語辞書に追加登録する辞書登録手段を備えることが望
ましい。The voice recognition device of the first invention is
A word is input, a state series for high-speed matching and a state series for detailed matching are generated for this input word, and the state series for high-speed matching is additionally registered in the high-speed matching language dictionary, while the above-mentioned details are given. It is desirable to provide a dictionary registration means for additionally registering the state sequence for collation into the detailed collation language dictionary.

【００３３】上記構成によれば、辞書登録手段に新しい
単語を入力するだけで、自動的に上記高速照合用言語辞
書および詳細照合用言語辞書の両辞書に当該単語の辞書
項目が追加登録される。したがって、常に新しい単語が
認識可能になり、高い認識率が維持される。According to the above configuration, simply by inputting a new word into the dictionary registration means, the dictionary item of the word is automatically additionally registered in both the high-speed matching language dictionary and the detailed matching language dictionary. . Therefore, new words can always be recognized, and a high recognition rate is maintained.

【００３４】また、上記第１の発明の音声認識装置は、
上記辞書登録手段を、上記高速照合用の状態系列を生成
する際に、同一の音声単位あるいは同一の音声単位グル
ープが連続する場合には、上記連続する同一音声単位あ
るいは上記連続する同一音声単位グループを１つの状態
に圧縮するように成すことが望ましい。The voice recognition device of the first invention is
When the same voice unit or the same voice unit group is continuous when the dictionary registration means generates the state sequence for high-speed matching, the same continuous voice unit or the same continuous voice unit group is used. Is preferably compressed into one state.

【００３５】上記構成によれば、連続する同一音声単位
あるいは連続する同一音声単位グループが１つの状態に
圧縮されている。したがって、上記高速照合用言語辞書
を用いた高速照合の高速化が図られる。According to the above configuration, continuous same voice units or continuous same voice unit groups are compressed into one state. Therefore, high-speed collation using the high-speed collation language dictionary can be speeded up.

【００３６】また、第２の発明の音声認識方法は、入力
音声を音響分析するステップと、上記音響分析結果に基
づいて,音響モデルを参照してフレーム毎に各状態の尤
度を演算して詳細照合用尤度を求めるステップと、上記
詳細照合用尤度に基づいて高速照合用尤度を求めるステ
ップと、上記高速照合用尤度の誤った側への偏りを修正
するステップと、上記修正後の高速照合用尤度と高速照
合用言語辞書に登録された全単語との高速照合を行って
上記各単語の尤度を算出するステップと、上記高速照合
結果に基づいて候補単語の予備選択を行うステップと、
上記予備選択された候補単語に関して,上記詳細照合用
尤度と詳細照合用言語辞書に登録された単語との詳細照
合を行って上記各候補単語の尤度を算出するステップを
特徴としている。In the speech recognition method of the second invention, the step of acoustically analyzing the input speech, and the likelihood of each state is calculated for each frame by referring to the acoustic model based on the acoustic analysis result. A step of obtaining a likelihood for detailed matching, a step of obtaining a likelihood for high-speed matching based on the likelihood for detailed matching, a step of correcting the bias of the high-speed matching likelihood to the wrong side, and a step of correcting the A step of performing a high-speed matching between the subsequent high-speed matching likelihood and all the words registered in the high-speed matching language dictionary to calculate the likelihood of each of the words, and preselecting candidate words based on the high-speed matching result. The steps to do
With respect to the preselected candidate words, a step of performing detailed matching between the likelihood for detailed matching and the words registered in the language for detailed matching to calculate the likelihood of each candidate word is characterized.

【００３７】上記構成によれば、フレーム毎に各状態の
尤度が演算され、上記詳細照合用尤度に基づいて高速照
合用尤度が求められる。そして、上記高速照合用尤度を
少ない状態で表現した際に生ずる各尤度の誤った音声単
位側への偏りが修正される。したがって、上記修正後の
高速照合用尤度を用いて高速照合を行って候補単語の予
備選択を行う際に、照合誤りが少なくなる。その結果、
候補単語が少ない数に的確に絞り込まれ、以後に行われ
る詳細照合の高速化が効率的に行われる。According to the above configuration, the likelihood of each state is calculated for each frame, and the likelihood for high speed matching is obtained based on the likelihood for detailed matching. Then, the bias toward the wrong voice unit side of each likelihood, which occurs when the above high-speed matching likelihood is expressed in a small state, is corrected. Therefore, when performing the high-speed matching using the corrected high-speed matching likelihood to perform the preliminary selection of the candidate words, the matching error is reduced. as a result,
The number of candidate words is accurately narrowed down to a small number, and the speed of detailed matching performed thereafter is efficiently performed.

【００３８】また、第３の発明のプログラム記録媒体
は、コンピュータを、上記第１の発明における音響分析
手段,尤度演算手段,高速照合用尤度演算手段,高速照合
用尤度修正手段,高速照合手段,候補予備選択手段および
詳細照合手段として機能させる音声認識処理プログラム
が記録されていることを特徴としている。The program recording medium according to the third aspect of the present invention is a computer, comprising: a computer, acoustic analysis means, likelihood calculation means, high-speed matching likelihood calculation means, high-speed matching likelihood correction means, high speed. It is characterized in that a voice recognition processing program that functions as a matching means, a candidate preliminary selection means, and a detailed matching means is recorded.

【００３９】上記構成によれば、上記第１の発明と同様
に、上記高速照合用尤度を少ない状態で表現した際に生
ずる各尤度の誤った音声単位側への偏りが修正される。
したがって、上記修正後の高速照合用尤度を用いて高速
照合を行って候補単語の予備選択を行う際に、照合誤り
が少なくなる。その結果、候補単語が少ない数に的確に
絞り込まれ、以後に行われる詳細照合の高速化が効率的
に行われる。According to the above configuration, as in the first aspect of the invention, the bias of each likelihood to the wrong voice unit side, which occurs when the high-speed matching likelihood is expressed in a small state, is corrected.
Therefore, when performing the high-speed matching using the corrected high-speed matching likelihood to perform the preliminary selection of the candidate words, the matching error is reduced. As a result, the candidate words are accurately narrowed down to a small number, and the speed of the detailed matching performed thereafter is efficiently performed.

【００４０】[0040]

【発明の実施の形態】以下、この発明を図示の実施の形
態により詳細に説明する。図１は、本実施の形態の音声
認識装置におけるブロック図であり、ＨＭＭを用いた音
声認識装置である。以下、図１に従って、本実施の形態
における音声認識装置について説明する。尚、入力音声
は、既にサンプリングおよび量子化されているものとす
る。また、以下の説明は、音響モデルを構成する単位は
音素(音素モデル)であるとして行うが、上記構成単位は
音節(音節モデル)であっても構わない。BEST MODE FOR CARRYING OUT THE INVENTION The present invention will be described in detail below with reference to the embodiments shown in the drawings. FIG. 1 is a block diagram of a voice recognition device according to the present embodiment, which is a voice recognition device using an HMM. Hereinafter, the speech recognition apparatus according to the present embodiment will be described with reference to FIG. It is assumed that the input voice has already been sampled and quantized. Further, in the following description, the unit forming the acoustic model is a phoneme (phoneme model), but the constituent unit may be a syllable (syllable model).

【００４１】音響分析部１１は、音声サンプルデータを
一定の周期ごとに取り込んで音響パラメータを抽出し、
尤度演算部１２と音声区間検出部１３と間引きパラメー
タ演算部１４とに出力する。その際における分析周期は
５msから２０ms程度とし、分析窓長は分析周期より長く
１０msから２０msとするのが一般的である。尚、分析手
法としては一般的に用いられるフィルタバンクによる帯
域エネルギー,ＦＦＴ（高速フーリエ変換)ケプストラ
ム,線形予測分析を用いたＬＰＣ(線形予測分析)ケプス
トラム等の分析パラメータと、短時間音声エネルギー
と、これらの時間変化量とを組み合わせて用いる。The acoustic analysis unit 11 takes in audio sample data at regular intervals to extract acoustic parameters,
The data is output to the likelihood calculator 12, the voice section detector 13, and the thinning parameter calculator 14. In that case, the analysis cycle is generally set to about 5 ms to 20 ms, and the analysis window length is generally set to 10 ms to 20 ms, which is longer than the analysis cycle. In addition, as an analysis method, band energy by a filter bank that is generally used, analysis parameters such as FFT (Fast Fourier Transform) cepstrum, LPC (linear prediction analysis) cepstrum using linear prediction analysis, and short-term speech energy, These time variations are used in combination.

【００４２】上記尤度演算部１２は、入力されたフレー
ムの音響パラメータと、音響モデル記憶部１５に記憶さ
れた音響モデルを構成する各状態の出力確率密度分布と
に基づいて、各フレーム毎に各状態の尤度を求めて、詳
細照合用尤度記憶部１６に記憶する。ここで、連続分布
型ＨＭＭである場合には、ｊ番目のＭ次元出力確率密度
分布を、平均ベクトルμj＝{μj₁,μj₂,…,μj_i,…,μj
_M}、分散σj²＝{σj² ₁,σj² ₂,…,σj² _i,…,σj² _M}で表
わし、入力ベクトルをｃ＝(ｃ₁,ｃ₂,…,ｃ_i…,ｃ_M)で表
わすと、対数尤度Ｌjは式(５)で表される。 The likelihood calculating section 12 calculates, for each frame, based on the acoustic parameters of the input frame and the output probability density distribution of each state forming the acoustic model stored in the acoustic model storage section 15. The likelihood of each state is calculated and stored in the detailed matching likelihood storage unit 16. Here, in the case of the continuous distribution type HMM, the j-th M-dimensional output probability density distribution is calculated as the average vector μj = {μj ₁ , μj ₂ , ..., μj _i , ..., μj.
_M }, variance σj ² = {σj ² ₁ ,, σj ² ₂ , ..., σj ² _i , ..., σj ² _M }, and the input vector is c = (c ₁ , c ₂ , ..., c _i ..., c _When expressed as _M ), the log likelihood Lj is expressed by equation (5).

【００４３】状態ｋの尤度Ｐkが複数の分布の集合Ｄkの
混合分布で表現される場合、本来は真数の尤度で加算し
てから対数尤度に変換すべきであるが、高速化のために
近似的に最大値をとる処理で置き換えても構わない。こ
の場合、各分布ｊの混合比率をλkjとすると式(６)で表
される。 When the likelihood Pk of the state k is expressed by the mixed distribution of the set Dk of a plurality of distributions, it should be added with the likelihood of the true number and then converted into the logarithmic likelihood. Therefore, it may be replaced by a process that takes an approximate maximum value. In this case, if the mixture ratio of each distribution j is λkj, it is expressed by equation (6).

【００４４】上記音声区間検出部１３は、上記音響分析
部１１による音響分析結果から、主に短時間音声エネル
ギー等の一部のパラメータを用いて音声区間を検出す
る。間引きパラメータ演算部１４は、高速照合に用いる
フレームの間引き方を決定するための間引きパラメータ
を各フレーム毎に計算し、得られた間引きパラメータを
間引きパラメータ記憶部１７に記憶しておく。一例とし
て、フレームｔにおける間引きパラメータＢ(t)を式
(７)によって求める。ここで、ΔＣtiは、フレームｔにおける音響パラメータ
のｉ次元目の値における前フレームからの変化量であ
る。また、σiは、音響パラメータのｉ次元めの標準偏
差である。標準偏差σiの値は大量のデータから求める
必要があり、音響モデルを作成した際のデータを用いる
ことができる。あるいは、音響モデルを構成する出力確
立密度分布の分散を平均した値から求めてもよい。The voice section detection unit 13 detects the voice section from the result of the acoustic analysis by the acoustic analysis unit 11 mainly using some parameters such as short-time voice energy. The decimation parameter calculation unit 14 calculates decimation parameters for determining the decimation method of frames used for high-speed matching for each frame, and stores the obtained decimation parameters in the decimation parameter storage unit 17. As an example, the decimation parameter B (t) at frame t
Obtained by (7). Here, ΔCti is the amount of change in the value of the i-th dimension of the acoustic parameter in frame t from the previous frame. Further, σi is the i-dimensional standard deviation of the acoustic parameter. The value of the standard deviation σi needs to be obtained from a large amount of data, and the data used when creating the acoustic model can be used. Alternatively, it may be obtained from a value obtained by averaging the variances of the output probability density distributions constituting the acoustic model.

【００４５】高速照合用尤度演算部１８は、上記詳細照
合用尤度記憶部１６に記憶された尤度テーブルから音声
区間検出部１３によって検出された音声区間の範囲内に
ある尤度値を読み出し、高速照合用尤度を求める。その
場合における高速照合用尤度は、詳細照合用尤度の中か
ら音素環境を無視して同じ音素の全状態の尤度を読み出
し、つまり、同じ音素であって異なる音素環境に在る音
素に属する総ての状態の尤度を読み出し、その最大値を
求めることによって求める。The high-speed matching likelihood calculating section 18 obtains the likelihood value within the range of the voice section detected by the voice section detecting section 13 from the likelihood table stored in the detailed matching likelihood storing section 16. Read out and obtain the likelihood for high-speed matching. The likelihood for high-speed matching in that case is to read out the likelihood of all states of the same phoneme from the likelihood for detailed matching, ignoring the phoneme environment, that is, for phonemes in the same phoneme but in different phoneme environments. It is obtained by reading the likelihoods of all the states to which it belongs and obtaining the maximum value thereof.

【００４６】その際に、上記高速照合用尤度の演算に先
立って、上記間引きパラメータ記憶部１７に記憶された
間引きパラメータを積分しながら、音響パラメータの変
化量が略一定になるようにフレームの間引きを行うので
ある。したがって、上記高速照合用尤度の演算は、間引
きの結果残った少ないフレームに対してだけ行えばよ
く、上記演算を迅速に行うことができるのである。例と
して、分析周期が１０msの場合における平均的な間引き
率は１/４から１/５程度で効率よく照合が行え、精度の
劣化も少ないことが実験的に分かっている。At this time, prior to the calculation of the likelihood for high-speed matching, the thinning-out parameters stored in the thinning-out parameter storage unit 17 are integrated, and the amount of change in the acoustic parameters is made substantially constant. Thinning out is done. Therefore, the calculation of the high-speed matching likelihood may be performed only on a small number of frames remaining as a result of thinning, and the calculation can be performed quickly. As an example, it has been empirically known that the average thinning rate is about 1/4 to 1/5 when the analysis period is 10 ms, the collation can be efficiently performed, and the accuracy is less deteriorated.

【００４７】高速照合用尤度修正部１９は、上記高速照
合用尤度演算部１８によって計算された音素毎の尤度
を、尤度の修正ルールに従って修正を行う。例えば、無
音区間が入力された場合には、無音/Ｓ/の尤度Ｌ(/Ｓ/)
よりも音素/Ｋ/の尤度Ｌ(/Ｋ/)の方が大きくなることが
非常に多い場合、式(８)により、Ｌ(/Ｓ/)＝max{Ｌ(/Ｓ/),Ｌ(/Ｋ/)｝ …（８）無音から始まる母音を音素/Ｋ/で始まる「か行」の母音に
誤る現象を削減することができるのである。この他「わ」
の音を表す音素/ｗ/を「う(/ｕ/)」や「お(/ｏ/)」の母音に
誤る等の現象にも適用できる。このように、予め分って
いる誤り易い音素の対のパターンを用いて修正を行うこ
とによって、迅速に且つ的確に修正処理を行うのとがで
きるのである。The high-speed matching likelihood correction section 19 corrects the likelihood for each phoneme calculated by the high-speed matching likelihood calculation section 18 in accordance with the likelihood correction rule. For example, when a silent section is input, the likelihood L (/ S /) of silence / S /
When the likelihood L (/ K /) of phoneme / K / is much larger than that of phoneme, L (/ S /) = max {L (/ S /), L (/ K /)} (8) It is possible to reduce a phenomenon in which a vowel starting from a silence is mistaken as a vowel of “ka” starting from a phoneme / K /. Other than this
It can also be applied to a phenomenon in which the phoneme / w / representing the sound of is mistaken for the vowel of "u (/ u /)" or "o (/ o /)". As described above, the correction processing can be performed quickly and accurately by performing the correction using the pattern of the phoneme pairs that are known in advance and are likely to be erroneous.

【００４８】高速照合用尤度記憶部２０は、上記高速照
合用尤度修正部１９によって修正された音素の尤度を記
憶する。高速照合用言語辞書２１は、認識対象語彙の各
単語と、この単語を１音素を１状態とした状態系列で表
現したものとを対応付けて格納している。高速照合部２
２では、間引き処理後の入力と高速照合用言語辞書２１
の各単語とのビタビ法による照合を行う。その際におけ
る各入力フレームの局所尤度は、高速照合用尤度記憶部
２０を参照することによって求める。The high-speed matching likelihood storage unit 20 stores the likelihood of the phoneme corrected by the high-speed matching likelihood correction unit 19. The high-speed matching language dictionary 21 stores each word of the recognition target vocabulary in association with each word represented by a state series in which one phoneme is in one state. High-speed matching unit 2
In 2, the input after the thinning process and the high-speed collation language dictionary 21
Matches each word with the Viterbi method. The local likelihood of each input frame at that time is obtained by referring to the high-speed matching likelihood storage unit 20.

【００４９】候補予備選択部２３は、上記高速照合部２
２による各単語に対するビタビ照合の結果に基づいて、
尤度の大きい順にＨ個の単語を選ぶ。尚、「Ｈ」の数は語
彙数に依存するが、語彙数の１/５から１/２０程度とす
る。詳細照合用言語辞書２４には、認識対象語彙の各単
語と、この単語を音響モデルである各環境依存型音素モ
デルの状態系列を直列接続して表現したものとを対応付
けて格納している。The candidate preliminary selection section 23 is the high-speed collation section 2 described above.
Based on the result of Viterbi matching for each word by 2,
Select H words in descending order of likelihood. The number of "H" depends on the number of vocabularies, but is set to about 1/5 to 1/20 of the number of vocabularies. In the detailed matching language dictionary 24, each word of the recognition target vocabulary is stored in association with each word that is expressed by serially connecting the state series of each environment-dependent phoneme model that is an acoustic model. .

【００５０】詳細照合部２５は、上記候補予備選択部２
３によって選択されたＨ個の単語に関して、詳細照合用
言語辞書２４に格納された状態系列と入力された全フレ
ームとをビタビ法によって照合して、Ｈ個の単語の尤度
を計算し直す。その場合、入力された各フレームにおけ
る各状態の局所尤度は、詳細照合用尤度記憶部１６に記
憶された値を参照することによって得る。そして、上記
予備選択されたＨ個の候補単語を、計算し直した尤度の
高い順に並べ直し、上位候補を出力するのである。The detailed collating unit 25 is the candidate preliminary selecting unit 2 described above.
With respect to the H words selected by 3, the state series stored in the detailed matching language dictionary 24 and all the input frames are matched by the Viterbi method, and the likelihood of the H words is recalculated. In that case, the local likelihood of each state in each input frame is obtained by referring to the value stored in the detailed matching likelihood storage unit 16. Then, the preselected H candidate words are rearranged in the descending order of likelihood of recalculation, and the upper candidates are output.

【００５１】上記実施の形態における高速照合用尤度演
算および高速照合に用いる音素は、上述のごとく詳細照
合用の音素モデルの音素をそのまま使うのではなく、別
の音素クラスを用いることも可能ではある。その場合に
おける音素クラスとしては、/ｕ/と/ｏ/と/ｗ/等の誤り
易い音素群は同一のクラスとし、「か,く,け,こ」におけ
る音素/ｋ/と「き」における音素/ｋ/等の誤り難い音素は
別音素とする等、音響モデルの誤り特性に合わせて調節
すると効果的である。その場合は、高速照合用言語辞書
２１を高速照合用音素クラスで記述しておく必要があ
る。尚、上述のように誤り易い音素群は同一のクラスと
する場合には、高速照合用尤度修正部１９による高速照
合用尤度の修正処理を省略しても構わない。As the phoneme used for the likelihood matching for high-speed matching and the high-speed matching in the above-mentioned embodiment, it is possible to use another phoneme class instead of using the phoneme of the phoneme model for detailed matching as it is. is there. In that case, phoneme classes such as / u / and / o / and / w / that are prone to errors are the same class, and phonemes / k / and "ki" in "ka, ku, ke, ko" are the same. It is effective to adjust phonemes such as phoneme / k / that are difficult to be errored as different phonemes according to the error characteristics of the acoustic model. In that case, it is necessary to describe the high-speed matching language dictionary 21 in the high-speed matching phoneme class. As described above, when the phoneme groups that are prone to errors are in the same class, the high-speed matching likelihood correction unit 19 may omit the high-speed matching likelihood correction processing.

【００５２】また、上記文献３のごとく、高速照合を効
率よく行うために、高速照合用言語辞書２１を、語頭か
ら同じ音素を共通化して木構造に成しても差し支えな
い。但し、語彙数が数百単語程度の場合には共通化の効
果が少なく、処理が複雑になるためあまり高速化はでき
ない。また、語彙に含まれる長母音を短母音に省略する
方法は、語彙数が数百程度と少ない場合でも若干計算量
を削減することが可能である。また、高速照合用尤度演
算および高速照合に音素クラスを用いる場合は、元の単
語としては異なる音素連鎖であっても音素クラスで表現
した場合には同じ音素クラスの連続となる部分を１つの
状態に圧縮することによって、若干高速化の効果が得ら
れる。Further, as in the above-mentioned document 3, in order to perform the high-speed matching efficiently, the high-speed matching language dictionary 21 may have a tree structure in which the same phonemes are shared from the beginning of the words. However, when the number of vocabularies is about several hundred words, the effect of commonality is small and the processing becomes complicated, so that the speed cannot be increased so much. Further, the method of omitting long vowels included in a vocabulary into short vowels can slightly reduce the calculation amount even when the number of vocabularies is as small as several hundreds. When the phoneme class is used for the likelihood matching for high-speed matching and the high-speed matching, even if the original word is a different phoneme chain, when it is expressed by the phoneme class, a part of the same phoneme class that is continuous is defined as one part. By compressing to a state, the effect of slightly increasing the speed can be obtained.

【００５３】ところで、音声認識装置をＤＳＰ(ディジ
タル・シグナル・プロセッサ)や汎用プロセッサ等によっ
て実現する場合には、内部メモリを効率よく使用し、外
部メモリヘのアクセスを少なくすることによる高速化が
重要となる。このことを本実施の形態の音声認識装置に
おいて実現する方法として、高速照合の時だけ必要にな
る高速照合用言語辞書２１と高速照合用の尤度テーブル
(高速照合用尤度記憶部２０の記憶内容)をプロセッサの
内部ＲＡＭ(ランダム・アクセス・メモリ)にロードするこ
とによって、効率よく高速照合を行う方法が考えられ
る。By the way, when the voice recognition device is realized by a DSP (digital signal processor), a general-purpose processor, or the like, it is important to use the internal memory efficiently and speed up by reducing the access to the external memory. Become. As a method for realizing this in the speech recognition apparatus of the present embodiment, a high-speed matching language dictionary 21 and a high-speed matching likelihood table, which are required only during high-speed matching.
A method of efficiently performing high-speed collation by loading (the stored contents of the high-speed collation likelihood storage unit 20) into the internal RAM (random access memory) of the processor is conceivable.

【００５４】具体的には、詳細照合用の尤度テーブル
(詳細照合用尤度記憶部１６の記憶内容)は一般に大きな
容量が必要になるため、詳細照合用尤度記憶部１６は外
部メモリ上に設定する。そして、尤度演算部１２は、音
声入力に同期して各フレーム毎に尤度を演算し、得られ
た尤度を上記外部メモリの詳細照合用尤度記憶部１６に
記憶する。一方、音声区間検出部１３によって音声区間
が切り出されると、入力音声を停止した後、高速照合用
尤度演算部１８によって高速照合用尤度演算を行い、高
速照合用尤度修正部１９で修正する。そして、得られた
修正後の尤度をプロセッサの上記内部ＲＡＭ上の高速照
合用尤度記憶部２０に記憶する。それと同時に、高速照
合用言語辞書２１を上記内部ＲＡＭにロードしておく。
そして、高速照合部２２によって、上記内部ＲＡＭ上の
高速照合用の尤度テーブルと高速照合用言語辞書２１と
を用いて高速照合を行った後に、上記内部ＲＡＭを開放
する。そうした後、上記詳細照合用の尤度テーブルと詳
細照合用言語辞書２４とを外部メモリから上記内部ＲＡ
Ｍにロードして、候補予備選択部２３による選択の結果
残った候補に対してのみ、詳細照合部２５によって詳細
な照合を行うのである。Specifically, the likelihood table for detailed matching
Since the (storage contents of the detailed matching likelihood storage unit 16) generally requires a large capacity, the detailed matching likelihood storage unit 16 is set in an external memory. Then, the likelihood calculation unit 12 calculates the likelihood for each frame in synchronization with the voice input, and stores the obtained likelihood in the detailed matching likelihood storage unit 16 of the external memory. On the other hand, when the voice section is cut out by the voice section detection unit 13, after stopping the input voice, the high-speed matching likelihood calculation unit 18 performs the high-speed matching likelihood calculation, and the high-speed matching likelihood correction unit 19 corrects it. To do. Then, the obtained corrected likelihood is stored in the high-speed matching likelihood storage unit 20 on the internal RAM of the processor. At the same time, the high-speed collation language dictionary 21 is loaded in the internal RAM.
Then, the high-speed matching unit 22 performs high-speed matching using the likelihood table for high-speed matching on the internal RAM and the high-speed matching language dictionary 21, and then releases the internal RAM. After that, the likelihood table for detailed matching and the language dictionary for detailed matching 24 are loaded from the external memory into the internal RA.
The detailed collation unit 25 performs detailed collation only on the candidates that have been loaded into M and left as a result of selection by the candidate preliminary selection unit 23.

【００５５】ユーザーが新しい単語等を辞書に登録する
場合には、辞書登録部２６に単語を入力すると、辞書登
録部２６によって、詳細照合用の音素状態系列と高速照
合用の状態系列とが作成される。そして、前者は詳細照
合用言語辞書２４に追加登録され、後者は高速照合用言
語辞書２１に追加登録される。こうして、高速照合用言
語辞書２１および詳細照合用言語辞書２４の両辞書に自
動的に新しい単語を追加登録することによって、常に新
しい単語を認識可能にして高い認識率を維持できるので
ある。When the user registers a new word or the like in the dictionary, he or she inputs the word into the dictionary registration unit 26, and the dictionary registration unit 26 creates a phoneme state series for detailed matching and a state series for high-speed matching. To be done. The former is additionally registered in the detailed matching language dictionary 24, and the latter is additionally registered in the high-speed matching language dictionary 21. Thus, by automatically registering new words in both the high-speed matching language dictionary 21 and the detailed matching language dictionary 24, new words can always be recognized and a high recognition rate can be maintained.

【００５６】以下、図２のフローチャートに従って、上
記音声認識装置による音声認識処理動作のアルゴリズム
について説明する。ステップＳ1で、音響分析部１１に
よって入力音声が音響分析される。そして、分析結果に
基づいて、間引きパラメータ演算部１４によって間引き
パラメータが演算されて間引きパラメータ記憶部１７に
記憶される。ステップＳ2で、尤度演算部１２によっ
て、音響分析部１１による分析結果に基づいて、フレー
ム毎に各状態の尤度が算出されて上記外部メモリの詳細
照合用尤度記憶部１６に記憶される。ステップＳ3で、
音声区間検出部１３によって、音響分析部１１による分
析結果に基づいて音声区間が検出され、検出信号が出力
される。The algorithm of the voice recognition processing operation by the voice recognition device will be described below with reference to the flowchart of FIG. In step S1, the input voice is acoustically analyzed by the acoustic analysis unit 11. Then, based on the analysis result, the thinning-out parameter calculating unit 14 calculates the thinning-out parameter and stores it in the thinning-out parameter storage unit 17. In step S2, the likelihood calculation unit 12 calculates the likelihood of each state for each frame based on the analysis result by the acoustic analysis unit 11 and stores the likelihood in the detailed matching likelihood storage unit 16 of the external memory. . In step S3,
The voice section is detected by the voice section detection unit 13 based on the analysis result by the acoustic analysis unit 11, and a detection signal is output.

【００５７】ステップＳ4で、上記高速照合用尤度演算
部１８によって、上記検出信号に基づいて音声区間が検
出された否かが判別される。その結果、検出されればス
テップＳ5に進み、そうでなければステップＳ1に戻って
検出されるのを待つ。ステップＳ5で、高速照合用尤度
演算部１８によって、詳細照合用尤度記憶部１６に記憶
された当該音声区間の尤度値を用いて、上記間引きパラ
メータに基づく入力の間引きが行われた後、音素環境を
考慮しない高速照合用尤度が演算される。さらに、高速
照合用尤度修正部１９によって音素毎に誤り音素側への
尤度の偏りが修正される。こうして得られた高速照合用
の尤度は高速照合用尤度記憶部２０に記憶される。In step S4, the high-speed matching likelihood calculator 18 determines whether or not the voice section is detected based on the detection signal. As a result, if detected, the process proceeds to step S5, and if not detected, the process returns to step S1 and waits for the detection. In step S5, the high-speed matching likelihood calculation unit 18 performs thinning-out of the input based on the thinning-out parameter using the likelihood value of the voice section stored in the detailed matching likelihood storage unit 16. , The likelihood for high-speed matching is calculated without considering the phoneme environment. Further, the likelihood correction unit for high-speed matching 19 corrects the bias of the likelihood to the erroneous phoneme side for each phoneme. The high-speed matching likelihood thus obtained is stored in the high-speed matching likelihood storage unit 20.

【００５８】ステップＳ6で、上記高速照合部２２によ
って、上述のような高速照合が行われて高速照合用言語
辞書２１に登録された各単語の尤度が求められる。ステ
ップＳ7で、全単語の高速照合が終了したか否かが判別
される。その結果、終了すればステップＳ8に進み、終
了していなければ上記ステップＳ6に戻って高速照合が
続行される。ステップＳ8で、候補予備選択部２３によ
って、尤度の高い順に上位Ｈ個の単語が候補として選択
される。In step S6, the high-speed collation unit 22 performs the high-speed collation as described above to obtain the likelihood of each word registered in the high-speed collation language dictionary 21. In step S7, it is determined whether or not high-speed matching of all words has been completed. As a result, if completed, the process proceeds to step S8, and if not completed, the process returns to step S6 to continue high-speed collation. In step S8, the candidate preliminary selection unit 23 selects the top H words as candidates in descending order of likelihood.

【００５９】ステップＳ9で、上記詳細照合部２５によ
って、上記予備選択された候補単語に関して、詳細照合
用言語辞書２４を用いて詳細照合が行われ、正確な尤度
が求め直される。ステップＳ10で、全予備選択候補単語
に関する詳細照合が終了したか否かが判別される。その
結果、終了すればステップＳ11に進み、終了していなけ
れば上記ステップＳ9に戻って詳細照合が続行される。
ステップＳ11で、さらに、候補単語が上記正確な尤度の
高い順に並べ直され、上位候補が出力される。そうした
後、音声認識処理動作が終了される。In step S9, the detailed matching section 25 performs detailed matching on the preselected candidate words using the detailed matching language dictionary 24, and re-determines the correct likelihood. In step S10, it is determined whether or not the detailed matching has been completed for all the preselected candidate words. As a result, if completed, the process proceeds to step S11, and if not completed, the process returns to step S9 to continue the detailed collation.
In step S11, the candidate words are rearranged in the order of the above-mentioned accurate likelihoods, and the upper candidates are output. After that, the voice recognition processing operation is ended.

【００６０】次に、本音声認識装置を、電話帳の人名３
００単語を認識するシステムに応用した場合を例に、上
記音声認識処理動作を具体的に説明する。この場合、高
速照合用言語辞書２１には、電話帳の人名３００の各単
語と、この単語を１音素を１状態とした状態系列で表現
したものとが対応付けられて格納されている。また、詳
細照合用言語辞書２４には、上記人名３００の各単語
と、この単語を各環境依存型音素モデルの状態系列を直
列接続して表現したものとが対応付けられて格納されて
いる。Next, the voice recognition device is changed to the person name 3 in the telephone directory.
The above speech recognition processing operation will be specifically described by taking the case of application to a system for recognizing 00 words as an example. In this case, the high-speed matching language dictionary 21 stores each word of the personal name 300 in the telephone directory and the word represented by a state series in which one phoneme is in one state in association with each other. Further, in the detailed matching language dictionary 24, each word of the person name 300 is stored in association with each word in which the word is expressed by serially connecting the state series of each environment-dependent phoneme model.

【００６１】上記音声区間検出部１３による音声区間の
判定は、促音による無音区間を誤って音声区間終了の無
音区間と判定しないように、通常、発声が終了してから
例えば０.３秒程度無音区間が継続した場合に音声区間
終了と判定するようにしている。したがって、例えば、
ユーザが「佐藤」と発声した場合、「佐藤」の発声終了後
０.３秒が経過するまで、図２におけるステップＳ1〜ス
テップＳ4が繰り返されて、音声「佐藤」に関する音響分
析,間引きパラメータ演算および尤度演算が行われるの
である。The determination of the voice section by the voice section detection unit 13 is usually, for example, about 0.3 seconds after the utterance is finished so that the silent section due to the consonant is not mistakenly determined to be the silent section at the end of the voice section. When the section continues, it is determined that the voice section ends. So, for example,
When the user utters "Sato", steps S1 to S4 in FIG. 2 are repeated until 0.3 seconds after the end of utterance of "Sato", and acoustic analysis and thinning-out parameter calculation for the voice "Sato" are performed. And likelihood calculation is performed.

【００６２】そして、０.３秒間無音区間が継続して音
声区間が検出されると、切り出された「佐藤」の音声区間
に対して上記演算された尤度を状態方向と時間方向とに
間引きながら高速照合用尤度記憶部２０にコピーして、
上記高速照合用の尤度テーブルが作成される(ステップ
Ｓ5)。そして、上記高速照合用の尤度テーブルに対して
修正が行われた後、修正後の上記高速照合用の尤度テー
ブルと高速照合用言語辞書２１に登録された３００単語
との高速照合が行われる(ステップＳ6)。そして、その
結果を尤度の高い順に並べて、上位の２０単語が候補と
して予備選択される(ステップＳ8)。Then, when a silent section continues for 0.3 seconds and a voice section is detected, the likelihood calculated above is thinned out in the state direction and the time direction with respect to the cut out "Sato" voice section. While copying to the high-speed matching likelihood storage unit 20,
The likelihood table for high-speed matching is created (step S5). Then, after correcting the likelihood table for high-speed matching, high-speed matching is performed between the corrected likelihood table for high-speed matching and the 300 words registered in the high-speed matching language dictionary 21. (Step S6). Then, the results are arranged in descending order of likelihood, and the upper 20 words are preselected as candidates (step S8).

【００６３】その結果、入力音声「佐藤」に対して、「加
藤」,「佐藤」,「斉藤」,「後藤」,…という順位となったとす
る。これら２０個の候補単語に関して、上記詳細照合用
の尤度テーブルと詳細照合用言語辞書２４との詳細照合
が行われ、尤度の再計算と並び替えとが行われる(ステ
ップＳ9)。As a result, it is assumed that the input voice "Sato" is ranked as "Kato", "Sato", "Saito", "Goto", .... With respect to these 20 candidate words, the detailed matching between the likelihood table for detailed matching and the detailed matching language dictionary 24 is performed, and the likelihood is recalculated and rearranged (step S9).

【００６４】このように、３００個の大語彙に対するビ
タビ法による照合演算は、間引き後に残った入力フレー
ムに対して、１音素を１状態に限定した簡素化された高
速照合用言語辞書２１を用いて行う。一方、各環境依存
型音素モデルの状態系列の直列接続に関するビタビ法に
よる膨大な照合演算は、２０個の予備候補単語に限定し
て行う。こうすることによって、認識処理の高速化と認
識率の向上とが図られるのである。As described above, the matching operation by the Viterbi method for 300 large vocabularies uses the simplified high-speed matching language dictionary 21 in which one phoneme is limited to one state for the input frames remaining after thinning. Do it. On the other hand, the enormous collation operation by the Viterbi method regarding the serial connection of the state series of each environment-dependent phoneme model is limited to 20 preliminary candidate words. By doing so, the recognition processing can be speeded up and the recognition rate can be improved.

【００６５】上述のようにして、詳細照合と候補単語の
並び替えとが行われた結果、上記候補単語の並び順が
「佐藤」,「加藤」,「斉藤」の順になったとすると、この順で
候補単語を出力する(ステップＳ11)。As a result of performing the detailed collation and rearrangement of candidate words as described above, if the candidate words are arranged in the order of "Sato", "Kato", "Saito", this order The candidate word is output at (step S11).

【００６６】上述のように、本実施の形態においては、
入力音声の音響パラメータに基づいて演算された各フレ
ームにおける各状態の尤度に対して、高速照合用尤度演
算部１８によって、間引きパラメータに基づいて間引き
を行った後、種々の音素環境下に在る同一音素に属する
全状態のうち最大尤度を呈する１つの状態とその状態の
尤度とを求めて(つまり、１音素１状態の環境独立型音
素モデルに変換して)、高速照合用の尤度テーブルを生
成する。そして、高速照合用尤度修正部１９によって、
上記尤度テーブル上の尤度の誤った音素側への偏りを、
尤度の修正ルールに従って修正するようにしている。As described above, in the present embodiment,
The likelihood of each state in each frame calculated based on the acoustic parameter of the input speech is thinned out by the high-speed matching likelihood calculation unit 18 based on the thinning parameter, and then, in various phoneme environments. For high-speed matching, one state that exhibits the maximum likelihood and the likelihood of that state among all the states that belong to the same phoneme are obtained (that is, converted to an environment-independent phoneme model of one phoneme and one state). Generate a likelihood table of Then, by the high-speed matching likelihood correction unit 19,
The bias of the likelihood on the likelihood table to the wrong phoneme side is
The correction is made according to the likelihood correction rule.

【００６７】したがって、上記高速照合用の尤度テーブ
ルを生成するに際して１音素１状態としたことによって
生ずる誤った音素側への尤度の偏りを、的確に修正する
ことができる。その結果、上記高速照合用の尤度テーブ
ルを用いた高速照合によって候補単語の予備選択を行う
際に、照合誤りを無くすことができる。その結果、候補
単語を少ない数に的確に絞り込むことができ、詳細照合
部２５によって後に行われる環境依存型音素モデルによ
る詳細照合の高速化を行うことができるのである。Therefore, it is possible to accurately correct the erroneous bias of the likelihood to the phoneme side caused by setting the one phoneme 1 state when the likelihood table for high speed matching is generated. As a result, it is possible to eliminate matching errors when performing preliminary selection of candidate words by high-speed matching using the above-described likelihood table for high-speed matching. As a result, the candidate words can be accurately narrowed down to a small number, and the detailed matching by the environment matching phoneme model performed later by the detailed matching unit 25 can be speeded up.

【００６８】さらに、本実施の形態においては、上記高
速照合用尤度演算部１８による入力の間引きは、間引き
パラメータ演算部１４によって、標準偏差で正規化した
音響パラメータの変化量に基づく間引きパラメータの積
分値に従って、上記音響パラメータの変化量が略一定に
なるように行われる。したがって、時間方向に一定間隔
で間引く場合のように、早口で発声した音声中の破裂音
のごとく主観的な音素の特徴が欠落することがなく、入
力音声の特徴をよく表わす高速照合用尤度が得られるの
である。Further, in the present embodiment, the thinning-out of the input by the high-speed matching likelihood calculating unit 18 is performed by the thinning-out parameter calculating unit 14 based on the change amount of the acoustic parameter normalized by the standard deviation. It is performed so that the variation amount of the acoustic parameter becomes substantially constant according to the integrated value. Therefore, as in the case of thinning out at regular intervals in the time direction, there is no loss of subjective phoneme characteristics such as plosives in speech uttered quickly, and the likelihood for high-speed matching that well expresses the characteristics of input speech. Is obtained.

【００６９】上述のごとく、本実施の形態においては、
上記高速照合時における照合誤りを無くすことによって
候補単語を少ない数に的確に絞り込むことができ、結果
的に詳細照合の高速化を図ることができる。具体的に
は、高速照合用尤度演算部１８によって詳細照合用の尤
度テーブルを状態方向に約１/４,時間方向に約１/５の
圧縮を行い、高速照合用言語辞書２１に登録された語彙
数の約１/２０の単語候補を予備選択するとすると、図
３に示す音声認識装置の場合に比して約１/２０の時間
での高速照合と約１/２０の時間での詳細照合とで音声
認識を行うことができ、照合全体としては１０倍の高速
化を実現できる。また、上記内部メモリを有効に使用す
ることができるため、さらに高速化が可能になるのであ
る。As described above, in the present embodiment,
By eliminating the matching error at the time of the high-speed matching, the candidate words can be accurately narrowed down to a small number, and as a result, the speed of the detailed matching can be increased. Specifically, the high-speed matching likelihood calculation unit 18 compresses the detail matching likelihood table by about 1/4 in the state direction and about 1/5 in the time direction, and registers it in the high-speed matching language dictionary 21. Assuming that word candidates of about 1/20 of the number of vocabularies that have been selected are preselected, the high-speed matching in about 1/20 time and the time in about 1/20 time compared to the case of the voice recognition device shown in FIG. Voice recognition can be performed by the detailed matching, and the speed of the entire matching can be increased 10 times. Further, since the internal memory can be effectively used, the speed can be further increased.

【００７０】また、上記高速照合用尤度演算部１８によ
る高速照合用尤度の演算を、上記音響モデルの構成単位
である音素を誤り易い音素でグループ化し、一つの音素
グループを一つの代表尤度で表わすように行うこともで
きる。この場合には、誤り易い音素を一つの音素グルー
プをとしているため、高速照合時における照合誤りを殆
どなくすことができ、高速照合用尤度修正部１９による
上記修正処理を省略することが可能になる。また、照合
対象数が減少するため、高速照合の高速化を図ることが
できるのである。In the calculation of the likelihood for high-speed matching by the high-speed matching likelihood calculating section 18, the phonemes, which are the constituent units of the acoustic model, are grouped into phonemes that are prone to error, and one phoneme group is represented by one representative likelihood. It can also be done in degrees. In this case, since phonemes that are prone to error are included in one phoneme group, it is possible to almost eliminate matching errors during high-speed matching, and it is possible to omit the correction process by the high-speed matching likelihood correction unit 19. Become. Further, since the number of collation targets is reduced, high-speed collation can be speeded up.

【００７１】さらに、本実施の形態における音声認識装
置をＤＳＰや汎用プロセッサで実現する場合には、外部
メモリ上に設定された詳細照合用尤度記憶部２０に基づ
いて得られた高速照合用の尤度テーブルおよび高速照合
用言語辞書２１を内部メモリにロードして高速照合を行
うようにしている。したがって、上記高速照合を効率よ
く高速に処理することができる。Further, when the voice recognition device in this embodiment is realized by a DSP or a general-purpose processor, a high-speed matching for high-speed matching obtained based on the detailed matching likelihood storage section 20 set in the external memory is used. The likelihood table and the high-speed matching language dictionary 21 are loaded into the internal memory for high-speed matching. Therefore, the high-speed matching can be processed efficiently and at high speed.

【００７２】また、本実施の形態においては、辞書登録
部２６を有して、高速照合用言語辞書２１および詳細照
合用言語辞書２４に登録されていない新たな単語が入力
されると、入力単語に関する高速照合用の状態系列と詳
細照合用の音素状態系列とを生成する。そして、生成さ
れた上記高速照合用の状態系列を高速照合用言語辞書２
１に追加登録する一方、上記詳細照合用の音素状態系列
を詳細照合用言語辞書２４に追加登録する。こうして、
新たな単語の照合用言語辞書情報が自動的に得られて、
高速照合用言語辞書２１および詳細照合用言語辞書２４
の両辞書に追加登録される。したがって、常に新しい単
語を認識可能にして、高い認識率を維持できるのであ
る。Further, in the present embodiment, when the dictionary registration unit 26 is provided and a new word which is not registered in the high-speed matching language dictionary 21 and the detailed matching language dictionary 24 is input, the input word is input. A state sequence for high-speed matching and a phoneme state sequence for detailed matching are generated. Then, the generated state sequence for high-speed matching is used as the high-speed matching language dictionary 2
1 is additionally registered, and the phoneme state series for detailed matching is additionally registered in the detailed matching language dictionary 24. Thus
The new word matching language dictionary information is automatically obtained,
High-speed matching language dictionary 21 and detailed matching language dictionary 24
Is additionally registered in both dictionaries. Therefore, new words can always be recognized and a high recognition rate can be maintained.

【００７３】その際に、連続する同一音素がある場合に
は１つの状態に圧縮する。または、元の単語としては異
なる音素連鎖であっても音素グループで見ると同一音素
グループが連続する場合には、その連続する同一音素グ
ループを１つの状態に圧縮する。そうすることによっ
て、高速照合用言語辞書２１を用いた高速照合および詳
細照合用言語辞書２４を用いた詳細照合の高速化を図る
ことができるのである。At this time, if there are consecutive identical phonemes, they are compressed into one state. Alternatively, if the same phoneme group is continuous when viewed as a phoneme group even if the original word is a different phoneme chain, the continuous same phoneme group is compressed into one state. By doing so, high-speed matching using the high-speed matching language dictionary 21 and high-speed detailed matching using the detailed matching language dictionary 24 can be achieved.

【００７４】ところで、上記実施の形態における音響分
析手段,尤度演算手段,音声区間検出手段,高速照合用尤
度演算手段,高速照合用尤度修正手段,高速照合手段,候
補予備選択手段,詳細照合手段,間引きパラメータ演算手
段および辞書登録手段としての機能は、プログラム記録
媒体に記録された音声認識処理プログラムによって実現
される。上記実施の形態における上記プログラム記録媒
体は、ＲＯＭ(リード・オンリ・メモリ)でなるプログラム
メディアである。あるいは、外部補助記憶装置に装着さ
れて読み出されるプログラムメディアであってもよい。
尚、何れの場合においても、上記プログラムメディアか
ら音声認識処理プログラムを読み出すプログラム読み出
し手段は、上記プログラムメディアに直接アクセスして
読み出す構成を有していてもよいし、ＲＡＭに設けられ
たプログラム記憶エリア(図示せず)にダウンロードし、
上記プログラム記憶エリアにアクセスして読み出す構成
を有していてもよい。尚、上記プログラムメディアから
ＲＡＭの上記プログラム記憶エリアにダウンロードする
ためのダウンロードプログラムは、予め本体装置に格納
されているものとする。By the way, the acoustic analyzing means, the likelihood calculating means, the voice section detecting means, the high-speed matching likelihood calculating means, the high-speed matching likelihood correcting means, the high-speed matching means, the candidate preliminary selecting means, and the details in the above-mentioned embodiment. The functions as the matching unit, the thinning-out parameter calculation unit, and the dictionary registration unit are realized by the voice recognition processing program recorded in the program recording medium. The program recording medium in the above embodiments is a program medium including a ROM (Read Only Memory). Alternatively, it may be a program medium loaded in an external auxiliary storage device and read.
In any case, the program reading means for reading the voice recognition processing program from the program medium may have a configuration of directly accessing and reading the program medium, or a program storage area (provided in the RAM). (Not shown),
The program storage area may be accessed and read. The download program for downloading from the program medium to the program storage area of the RAM is assumed to be stored in the main body device in advance.

【００７５】ここで、上記プログラムメディアとは、本
体側と分離可能に構成され、磁気テープやカセットテー
プ等のテープ系、フロッピー（登録商標）ディスク,ハ
ードディスク等の磁気ディスクやＣＤ(コンパクトディ
スク)−ＲＯＭ,ＭＯ(光磁気)ディスク,ＭＤ(ミニディス
ク),ＤＶＤ(ディジタルビデオディスク)等の光ディスク
のディスク系、ＩＣ(集積回路)カードや光カード等のカ
ード系、マスクＲＯＭ,ＥＰＲＯＭ（紫外線消去型ＲＯ
Ｍ),ＥＥＰＲＯＭ(電気的消去型ＲＯＭ),フラッシュＲ
ＯＭ等の半導体メモリ系を含めた、固定的にプログラム
を坦持する媒体である。Here, the program medium is configured to be separable from the main body side, and a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy (registered trademark) disk or a hard disk, or a CD (compact disk)- Disk system of optical disk such as ROM, MO (magneto-optical) disk, MD (mini disk), DVD (digital video disk), card system such as IC (integrated circuit) card and optical card, mask ROM, EPROM (ultraviolet erasable type) RO
M), EEPROM (electrically erasable ROM), flash R
It is a medium that holds a program fixedly, including a semiconductor memory system such as OM.

【００７６】また、上記各実施の形態における音声認識
装置は、モデムを備えてインターネットを含む通信ネッ
トワークと接続可能な構成を有している場合には、上記
プログラムメディアは、通信ネットワークからのダウン
ロード等によって流動的にプログラムを坦持する媒体で
あっても差し支えない。尚、その場合における上記通信
ネットワークからダウンロードするためのダウンロード
プログラムは、予め本体装置に格納されているものとす
る。あるいは、別の記録媒体からインストールされるも
のとする。Further, when the voice recognition device in each of the above-mentioned embodiments is equipped with a modem and is connectable to a communication network including the Internet, the program medium is downloaded from the communication network. Even if it is a medium that fluidly carries the program, it does not matter. In this case, the download program for downloading from the communication network is stored in the main body device in advance. Alternatively, it is assumed that the program is installed from another recording medium.

【００７７】尚、上記記録媒体に記録されるものはプロ
グラムのみに限定されるものではなく、データも記録す
ることが可能である。It should be noted that what is recorded on the recording medium is not limited to the program, and data can be recorded.

【００７８】[0078]

【発明の効果】以上より明らかなように、第１の発明の
音声認識装置は、ＨＭＭを用いた音声認識装置におい
て、尤度演算手段によって得られた詳細照合用尤度に基
づいて、高速照合用尤度演算手段によって高速照合用尤
度を求め、高速照合用尤度修正手段によって上記高速照
合用尤度の誤った側への偏りを修正するので、上記高速
照合用尤度を少ない状態で表現した際に生ずる尤度の誤
った音声単位側への偏りを修正することができる。した
がって、上記文献３のごとく、複数の状態からなる環境
依存型音素モデルを１状態の環境独立型音素モデルに変
換した際に、音素間でパラメータ空間を占める範囲が重
複しているために生ずる誤り音素側への偏りを修正する
ことができる。As is apparent from the above, the speech recognition apparatus of the first invention is a speech recognition apparatus using an HMM, and in the speech recognition apparatus, high-speed matching is performed based on the detailed matching likelihood obtained by the likelihood calculation means. The likelihood for high-speed matching is calculated by the likelihood calculating means, and the bias for the high-speed matching likelihood is corrected by the high-speed matching likelihood correcting means, so that the high-speed matching likelihood is small. It is possible to correct the bias toward the voice unit side in which the likelihood is incorrect when expressed. Therefore, as described in Document 3, when an environment-dependent phoneme model consisting of a plurality of states is converted into an environment-independent phoneme model of one state, an error that occurs due to overlapping ranges of parameter spaces among phonemes It is possible to correct the bias toward the phoneme side.

【００７９】したがって、高速照合手段による高速照合
を行う際の照合誤りを少なくでき、結果的に、候補予備
選択手段による候補単語の予備選択によって、候補単語
を少ない数に的確に絞り込むことができる。すなわち、
この発明によれば、上記候補単語の予備選択による詳細
照合の高速化を、より効率的に行うことができるのであ
る。Therefore, it is possible to reduce the collation error when performing the high-speed collation by the high-speed collating means, and as a result, the candidate words can be accurately narrowed down to a small number by the preliminary selection of the candidate words by the candidate preliminary selecting means. That is,
According to the present invention, it is possible to more efficiently speed up the detailed matching by preliminarily selecting the candidate words.

【００８０】また、上記第１の発明の音声認識装置は、
間引きパラメータ演算手段によって音響分析結果に基づ
いて間引きパラメータを演算し、上記高速照合用尤度演
算手段を、上記間引きパラメータに基づいて上記詳細照
合用尤度に対して時間方向への間引き処理を行った後
に、上記高速照合用尤度を求めるように成せば、上記間
引きパラメータを適切に算出することによって、上記文
献２のごとく時間方向に一定間隔で間引く場合のような
瞬間的な特徴の欠落を防止し、且つ、十分に高速化を行
うことが可能になる。The voice recognition device of the first invention is
The thinning-out parameter calculating means calculates a thinning-out parameter based on the acoustic analysis result, and the high-speed matching likelihood calculating means performs a thinning-out process in the time direction for the detailed matching likelihood based on the thinning-out parameter. After that, if the likelihood for high-speed matching is calculated, by appropriately calculating the thinning-out parameter, it is possible to eliminate a momentary loss of characteristics as in the case of thinning out at regular intervals in the time direction as in the above-mentioned Document 2. It is possible to prevent it and to sufficiently speed up.

【００８１】また、上記第１の発明の音声認識装置は、
上記間引きパラメータ演算手段による上記間引きパラメ
ータの演算を、上記音響分析結果としての音響パラメー
タの変化量に基づいて行い、上記高速照合用尤度演算手
段による間引き処理を、上記間引きパラメータに基づい
て上記音響パラメータの変化量が略一定になるように行
えば、間引き処理後の上記詳細照合用尤度を音響パラメ
ータの変化が激しい領域ほど多く残すことができる。し
たがって、入力音声の瞬間的な特徴を的確に抽出するこ
とができるのである。Further, the speech recognition apparatus of the first invention described above,
Calculation of the thinning parameter by the thinning parameter calculating means is performed based on the amount of change of the acoustic parameter as the acoustic analysis result, and thinning processing by the high-speed matching likelihood calculating means is performed based on the thinning parameter. If the variation amount of the parameter is made substantially constant, the detail matching likelihood after the thinning-out process can be left more in the region where the acoustic parameter changes more drastically. Therefore, the instantaneous feature of the input voice can be accurately extracted.

【００８２】また、上記第１の発明の音声認識装置は、
上記高速照合用尤度演算手段による上記高速照合用尤度
の演算を、上記音響モデルの構成単位である音声単位を
一つの代表尤度で表わすことによって行えば、高速照合
用の尤度を最小の状態数で表現できる。したがって、上
記高速照合用の尤度を用いた高速照合を高速に行うこと
ができる。Further, the speech recognition apparatus of the first invention described above,
If the calculation of the likelihood for high-speed matching by the likelihood calculation means for high-speed matching is performed by expressing a voice unit, which is a structural unit of the acoustic model, by one representative likelihood, the likelihood for high-speed matching is minimized. It can be expressed by the number of states. Therefore, high-speed matching using the above-mentioned likelihood for high-speed matching can be performed at high speed.

【００８３】また、上記第１の発明の音声認識装置は、
上記高速照合用尤度演算手段による上記高速照合用尤度
の演算を、上記音響モデルの構成単位である音声単位を
誤り易い音声単位でグループ化し、一つのグループを一
つの代表尤度で表わすことによって行えば、誤った音声
単位の尤度が正しい音声単位の尤度よりも高くなること
を防止できる。すなわち、この発明によれば、高速照合
時における照合誤りを少なくできるのである。さらに、
上記グループ化によって、高速照合時における照合の対
象を減少し、上記高速照合を非常に高速に行うことがで
きるのである。The voice recognition device of the first invention is
In the calculation of the likelihood for high-speed matching by the likelihood calculating means for high-speed matching, a voice unit which is a constituent unit of the acoustic model is grouped into a voice unit that is prone to error, and one group is represented by one representative likelihood. By doing so, it is possible to prevent the likelihood of an erroneous voice unit from becoming higher than the likelihood of a correct voice unit. That is, according to the present invention, it is possible to reduce collation errors during high-speed collation. further,
By the above grouping, the target of collation at the time of high-speed collation can be reduced, and the above-mentioned high-speed collation can be performed very quickly.

【００８４】尚、この場合には、上記高速照合用尤度修
正手段による修正処理を省略することが可能になる。In this case, it is possible to omit the correction processing by the high-speed matching likelihood correction means.

【００８５】また、上記第１の発明の音声認識装置は、
高速照合用尤度修正手段による上記高速照合用尤度の修
正を、上記音声単位間あるいは上記グループ間の誤りパ
ターンを考慮して行えば、予め分っている上記音声単位
間あるいは上記グループ間の誤りパターンを考慮して上
記音声単位あるいはグループの代表尤度を修正でき、迅
速に且つ的確に修正処理を行うことができる。The voice recognition device of the first invention is
If the high-speed matching likelihood correction means corrects the high-speed matching likelihood in consideration of the error pattern between the voice units or between the groups, it is possible to understand the voice units or the groups that are known in advance. The representative likelihood of the voice unit or group can be corrected in consideration of the error pattern, and the correction process can be performed quickly and accurately.

【００８６】また、上記第１の発明の音声認識装置は、
上記高速照合を実行する際には、上記高速照合用尤度記
憶手段に記憶された上記高速照合用尤度と、高速照合用
言語辞書記憶手段に記憶された上記高速照合用言語辞書
とを、上記高速照合手段の内部メモリにロードすれば、
上記高速照合処理を効率よく行うことができる。The voice recognition device of the first invention is
When performing the high-speed matching, the high-speed matching likelihood stored in the high-speed matching likelihood storage means, and the high-speed matching language dictionary stored in the high-speed matching language dictionary storage means, If loaded into the internal memory of the high-speed matching means,
The high-speed matching process can be efficiently performed.

【００８７】また、上記第１の発明の音声認識装置は、
辞書登録手段によって、入力単語に関する高速照合用の
状態系列と詳細照合用の状態系列とを生成し、前者を上
記高速照合用言語辞書に追加登録する一方、後者を上記
詳細照合用言語辞書に追加登録すれば、自動的に上記高
速照合用言語辞書および詳細照合用言語辞書の両辞書に
当該単語の辞書項目を追加登録できる。したがって、常
に新しい単語の認識を可能にでき、高い認識率を維持で
きる。The voice recognition device of the first invention is
By the dictionary registration means, a state sequence for high-speed matching and a state sequence for detailed matching are generated for the input word, and the former is additionally registered in the high-speed matching language dictionary, while the latter is added to the detailed matching language dictionary. If registered, the dictionary item of the word can be automatically additionally registered in both the high-speed matching language dictionary and the detailed matching language dictionary. Therefore, new words can always be recognized, and a high recognition rate can be maintained.

【００８８】また、上記第１の発明の音声認識装置は、
上記辞書登録手段による上記高速照合用の状態系列の生
成を、連続する同一音声単位あるいは連続する同一音声
単位グループを１つの状態に圧縮することによって行え
ば、上記高速照合用言語辞書を用いた高速照合の高速化
を図ることができる。The voice recognition device of the first invention is
If the dictionary registration means generates the state sequence for high-speed matching by compressing continuous same voice units or continuous same voice unit groups into one state, high speed using the high-speed collation language dictionary is achieved. The collation can be speeded up.

【００８９】また、第２の発明の音声認識方法は、ＨＭ
Ｍを用いた音声認識方法において、詳細照合用尤度に基
づいて高速照合用尤度を求め、上記高速照合用尤度の誤
った側への偏りを修正するので、上記高速照合用尤度を
少ない状態で表現した際に生ずる各尤度の誤った音声単
位側への偏りを修正することができる。したがって、高
速照合を行う際の照合誤りを少なくでき、結果的に、候
補単語の予備選択によって、候補単語を少ない数に的確
に絞り込むことができる。The speech recognition method of the second invention is HM
In the speech recognition method using M, the likelihood for high-speed matching is obtained based on the likelihood for detailed matching, and the bias to the wrong side of the likelihood for high-speed matching is corrected. It is possible to correct the bias toward the wrong voice unit side of each likelihood that occurs when the likelihood is expressed in a small amount. Therefore, it is possible to reduce matching errors when performing high-speed matching, and as a result, it is possible to accurately narrow down the candidate words to a small number by preliminary selection of candidate words.

【００９０】すなわち、この発明によれば、上記候補単
語の予備選択による詳細照合の高速化を、より効率的に
行うことができるのである。That is, according to the present invention, it is possible to more efficiently speed up the detailed matching by preliminarily selecting the candidate words.

【００９１】また、第３の発明のプログラム記録媒体
は、コンピュータを、上記第１の発明における音響分析
手段,尤度演算手段,高速照合用尤度演算手段,高速照合
用尤度修正手段,高速照合手段,候補予備選択手段及び詳
細照合手段として機能させる音声認識処理プログラムが
記録されているので、上記第１の発明の場合と同様に、
ＨＭＭを用いた音声認識を行うに際して、高速照合用尤
度の誤った側への偏りを修正することができ、上記高速
照合用尤度を少ない状態で表現した際に生ずる各尤度の
誤った音声単位側への偏りを修正することができる。し
たがって、高速照合を行う際の照合誤りを少なくでき、
結果的に、候補単語の予備選択によって、候補単語を少
ない数に的確に絞り込むことができる。Further, the program recording medium of the third invention comprises a computer, the acoustic analysis means, the likelihood calculation means, the high-speed matching likelihood calculation means, the high-speed matching likelihood correction means, and the high-speed matching computer in the first invention. Since the voice recognition processing program for functioning as the matching means, the candidate preliminary selection means, and the detailed matching means is recorded, as in the case of the first invention,
When performing speech recognition using the HMM, the bias of the high-speed matching likelihood to the wrong side can be corrected, and the likelihood of each likelihood generated when the high-speed matching likelihood is expressed in a small state is erroneous. The bias toward the voice unit side can be corrected. Therefore, it is possible to reduce collation errors when performing high-speed collation,
As a result, the candidate words can be accurately narrowed down to a small number by the preliminary selection of the candidate words.

【００９２】すなわち、この発明によれば、上記候補単
語の予備選択による詳細照合の高速化を、より効率的に
行うことができるのである。That is, according to the present invention, it is possible to more efficiently speed up the detailed matching by preliminarily selecting the candidate words.

[Brief description of drawings]

【図１】この発明の音声認識装置におけるブロック図
である。FIG. 1 is a block diagram of a voice recognition device of the present invention.

【図２】図１に示す音声認識装置による音声認識処理
動作のフローチャートである。FIG. 2 is a flowchart of a voice recognition processing operation by the voice recognition device shown in FIG.

【図３】ＨＭＭを用いた従来の音声認識装置のブロッ
ク図である。FIG. 3 is a block diagram of a conventional voice recognition device using an HMM.

[Explanation of symbols]

１１…音響分析部、１２…尤度演算部、１３…音声区間検出部、１４…間引きパラメータ演算部、１５…音響モデル記憶部、１６…詳細照合用尤度記憶部、１７…間引きパラメータ記憶部、１８…高速照合用尤度演算部、１９…高速照合用尤度修正部、２０…高速照合用尤度記憶部、２１…高速照合用言語辞書、２２…高速照合部、２３…候補予備選択部、２４…詳細照合用言語辞書、２５…詳細照合部、２６…辞書登録部。 11 ... Acoustic analysis unit, 12 ... Likelihood calculator 13 ... voice section detector, 14 ... thinning-out parameter calculation unit, 15 ... Acoustic model storage unit, 16 ... Likelihood storage unit for detailed matching, 17 ... thinning-out parameter storage unit, 18 ... Likelihood calculator for high-speed matching, 19 ... High-speed matching likelihood correction unit, 20 ... Likelihood storage unit for high-speed matching, 21 ... High-speed collation language dictionary, 22 ... High-speed matching unit, 23 ... Candidate preliminary selection section, 24 ... Detailed collation language dictionary, 25 ... Detail collation unit, 26 ... Dictionary registration unit.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＧ１０Ｌ 15/14 Ｇ１０Ｌ 3/00 ５２１Ｃ 15/18 (56)参考文献特開昭62−220996（ＪＰ，Ａ) 特開平６−348299（ＪＰ，Ａ) 特開平９−34486（ＪＰ，Ａ) 特開平３−116100（ＪＰ，Ａ) 特開平８−123470（ＪＰ，Ａ) 特開平６−266393（ＪＰ，Ａ) 特開昭59−60499（ＪＰ，Ａ) 特開平６−266396（ＪＰ，Ａ) 山口外８名，コンパクトな単語音声認識、テキスト音声合成，シャープ技報，日本，2000年８月10日，第77号, Ｐａｇｅｓ 26−32 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/00 G10L 15/00 - 15/28 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI G10L 15/14 G10L 3/00 521C 15/18 (56) References JP-A-62-220996 (JP, A) JP-A-6 -348299 (JP, A) JP 9-34486 (JP, A) JP 3-116100 (JP, A) JP 8-123470 (JP, A) JP 6-266393 (JP, A) ) JP-A-59-60499 (JP, A) JP-A-6-266396 (JP, A) Yamaguchi, G. 8 people, compact word speech recognition, text-to-speech synthesis, Sharp Technical Report, Japan, August 10, 2000. , No. 77, Pages 26-32 (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 11/00 G10L 15/00-15/28 JISST file (JOIS)

Claims

(57) [Claims]

1. An acoustic analysis unit for acoustically analyzing an input voice, and based on the acoustic analysis result, a likelihood of each state is calculated for each frame with reference to an acoustic model stored in an acoustic model storage unit, A likelihood calculating means for storing the calculation result as a detailed matching likelihood in the detailed matching likelihood storage means, and a high-speed matching likelihood calculating means for obtaining a high-speed matching likelihood based on the detailed matching likelihood. , A high-speed matching likelihood correction means for correcting the bias of the high-speed matching likelihood to the wrong side and storing it in the high-speed matching likelihood storage means, and the corrected high-speed matching likelihood and high-speed matching High-speed matching means for matching with all the words registered in the language dictionary to calculate the likelihood of each word, and candidate preliminary selection for preliminary selection of candidate words based on the matching result by the high-speed matching means. Means and the preselected candidate words Then, the detailed matching likelihood and the word registered in the detailed matching language dictionary are subjected to detailed matching, and the detailed matching means for calculating the likelihood of each of the candidate words is provided. Recognition device.

2. The speech recognition apparatus according to claim 1, further comprising thinning-out parameter calculating means for calculating a thinning-out parameter based on the acoustic analysis result, and the high-speed matching likelihood calculating means includes the detailed matching. After performing the thinning-out process in the time direction based on the thinning-out parameter for the usage likelihood, the high-speed matching likelihood is calculated based on the remaining detailed matching likelihood. Voice recognition device.

3. The voice recognition device according to claim 2, wherein the thinning-out parameter computing means computes the thinning-out parameter based on a change amount of the acoustic parameter as the acoustic analysis result, and the high-speed matching likelihood. The speech recognition apparatus, wherein the computing means is configured to perform thinning processing based on the thinning parameter so that the variation amount of the acoustic parameter becomes substantially constant.

4. The voice recognition device according to claim 1, wherein the high-speed matching likelihood computing means expresses a voice unit, which is a constituent unit of the acoustic model, by one representative likelihood, thereby performing the high-speed matching. A voice recognition device, characterized in that it calculates a likelihood of use.

5. The voice recognition device according to claim 1, wherein the high-speed matching likelihood calculation means groups voice units, which are constituent units of the acoustic model, into voice units that are prone to error, and sets one group to one. A voice recognition device characterized in that the likelihood for high-speed matching is calculated by expressing it as one representative likelihood.

6. The voice recognition device according to claim 5, wherein the correction processing by the likelihood correction means for high-speed matching is omitted.

7. The voice recognition device according to claim 4 or 5, wherein the high-speed matching likelihood correction means considers the error pattern between the voice units or between the groups, and the voice unit or the group. The speech recognition apparatus characterized in that the likelihood for high-speed matching is corrected by correcting the representative likelihood of.

8. The voice recognition device according to claim 1, wherein the high-speed matching means has an internal memory, and the high-speed matching language dictionary is stored in the high-speed matching language dictionary storage means. , The high-speed matching means, when performing the high-speed matching,
The high-speed matching likelihood stored in the high-speed matching likelihood storage means and the high-speed matching language dictionary stored in the high-speed matching language dictionary storage means are loaded into the internal memory. A voice recognition device characterized by the above.

9. The speech recognition apparatus according to claim 1, wherein a word is input, a state sequence for high-speed matching and a state sequence for detailed matching are generated for this input word, and the state for high-speed matching is generated. A voice recognition device comprising dictionary registration means for additionally registering a sequence into the detailed matching language dictionary while additionally registering a sequence into the high-speed matching language dictionary.

10. The voice recognition device according to claim 9, wherein the dictionary registration means generates the state sequence for the high-speed matching when the same voice unit or the same voice unit group is continuous. The speech recognition apparatus is characterized in that the continuous same voice unit or the continuous same voice unit group is compressed into one state.

11. A step of acoustically analyzing the input voice, a step of calculating a likelihood of each state for each frame by referring to an acoustic model based on the acoustic analysis result, and obtaining a likelihood for detailed matching, A step of obtaining a likelihood for high-speed matching based on the likelihood for detailed matching, a step of correcting the bias of the likelihood for high-speed matching to the wrong side, a likelihood for high-speed matching after the correction and a high-speed matching Performing a high-speed matching with all the words registered in the language dictionary for use to calculate the likelihood of each word, a step of preselecting a candidate word based on the result of the high-speed matching, and a preliminary selection With respect to the selected candidate words, a detailed matching is performed between the detailed matching likelihood and the words registered in the detailed matching language dictionary, and the likelihood of each candidate word is calculated. Speech recognition method.

12. A computer as the acoustic analysis means, likelihood calculation means, high-speed matching likelihood calculation means, high-speed matching likelihood correction means, high-speed matching means, candidate preliminary selection means and detailed matching means in claim 1. A computer-readable program recording medium on which a voice recognition processing program for functioning is recorded.