JPH08241096A

JPH08241096A - Speech recognition method

Info

Publication number: JPH08241096A
Application number: JP7041948A
Authority: JP
Inventors: Yoshiaki Noda; 喜昭野田; Shigeki Sagayama; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-03-01
Filing date: 1995-03-01
Publication date: 1996-09-17
Anticipated expiration: 2015-10-30
Also published as: JP3104900B2

Abstract

PURPOSE: To discard non-grammatical utterance during the course of a search process. CONSTITUTION: A partial hypothesis is formed by additionally connecting and branching phonemes in accordance with the grammer 41 of a tree structure and a score function gi(t) is determined by trellis calculation while collecting the partial hypothesis i, the corresponding HMM and input speeches. The max. value in the score of the nongrammatically formed partial hypothesis is simultaneously obtained as a reference score function go(t). The respective max. values of the respective differences between the forward heuristic function gΛ(t) and gi(t) and go(t) are respectively determined as evaluation values Si, So and the search is progressed by discarding the partial hypothesis in which Si-So is below the threshold.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、音素、音節、半音
節、単語などのような音声単位の、与えられた文法の制
御に従って連結可能な数多くの各部分仮説について対応
する音響モデルと、入力された音声とを照合し入力音声
に近い候補を探索する音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a corresponding acoustic model for a number of partial hypotheses, which can be connected according to the control of a given grammar, of a phonetic unit such as a phoneme, a syllable, a half syllable, and a word. The present invention relates to a voice recognition method for collating an input voice and searching for a candidate close to an input voice.

【０００２】[0002]

【従来の技術】図３Ａに音素を認識の単位とした音声認
識処理の手順を示す。入力音声１１は、分析処理部１２
により、特徴パラメータのベクトルデータ時系列に変換
され、探索処理部１３により文法１６の拘束条件を用い
ながら、音素モデル１５との照合が行われる。そして、
最も高い評価値を持つ音素系列が認識結果１４として出
力される。2. Description of the Related Art FIG. 3A shows a procedure of speech recognition processing using a phoneme as a unit of recognition. The input voice 11 is the analysis processing unit 12
Thus, the vector data of the feature parameter is converted into a time series, and the search processing unit 13 performs matching with the phoneme model 15 while using the constraint condition of the grammar 16. And
The phoneme sequence having the highest evaluation value is output as the recognition result 14.

【０００３】分析処理部１２における信号処理として、
よく用いられるのは、線形予測分析（ＬｉｎｅａｒＰ
ｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ，ＬＰＣと呼ばれ
る）であり、特徴パラメータとしては、ＬＰＣケプスト
ラム、ＬＰＣデルタケプストラム、メルケプストラム、
対数パワーなどがある。音素モデル１５としては確率・
統計理論に基づいてモデル化された隠れマルコフモデル
（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ，以後ＨＭ
Ｍ法と呼ぶ）が主流である。このＨＭＭの詳細は、例え
ば、社団法人電子情報通信学会編，中川聖一著『確率モ
デルによる音声認識』に開示されている。As signal processing in the analysis processing section 12,
Often used is Linear Predictive Analysis (Linear P
(Reductive Coding, LPC), and the characteristic parameters are LPC cepstrum, LPC delta cepstrum, mel cepstrum,
There is logarithmic power. Probability as the phoneme model 15
Hidden Markov Model (Hidden Markov Model, modeled based on statistical theory)
(M method) is the mainstream. Details of this HMM are disclosed, for example, in "Speech Recognition by Probabilistic Model" by Seiichi Nakagawa, edited by The Institute of Electronics, Information and Communication Engineers.

【０００４】探索処理部１３は、文法で連結することが
許される音素列である部分仮説についてその音素モデル
に対して、入力音声とのもっともらしさを評価し、一つ
ずつ部分仮説に音素を拡張しながら探索を進める。ここ
で、部分仮説とは、文法に示されている音素の並び順の
制約に従ってつなげられた音素列のことを表し、また、
部分仮説への音素の拡張とは、文法に従って部分仮説の
音素列にさらに一つ音素をつなげることを意味する。The search processing unit 13 evaluates the plausibility of the partial hypotheses, which are phoneme sequences that are allowed to be connected by grammar, with the input speech, and expands the phonemes to the partial hypotheses one by one. While proceeding with the search. Here, the partial hypothesis represents a phoneme string connected according to the restriction of the order of phonemes shown in the grammar, and
Extension of a phoneme to a partial hypothesis means that one more phoneme is connected to the phoneme sequence of the partial hypothesis according to the grammar.

【０００５】それぞれの部分仮説について、１．音素
列、２．トレリス計算等による、音響モデルとの照合結
果であるスコア関数、３．入力音声に対する部分仮説の
もっともらしさを示す評価値、の３つの情報を記憶して
おく。部分仮説の識別番号をｉ、時刻をｔとするとスコ
ア関数はｇ_i（ｔ）と表される。探索処理部１３では、
まず文法によって許される１つ目の音素を部分仮説に拡
張し、その音素に対応したＨＭＭと、分析された特徴パ
ラメータのベクトルデータ時系列（入力音声）とを照合
し、この部分仮説ｉの各時刻ｔのスコア関数ｇ_i（ｔ）
を求める。ＨＭＭとの照合方法としてトレリス法、ビタ
ービ法があり、この詳細は、例えば、社団法人電子情報
通信学会編，中川聖一著『確率モデルによる音声認識』
に開示されている。このスコア関数ｇ_i（ｔ）から後述
する方法で部分仮説ｉの評価値を求め、この部分仮説に
対し、音素列、スコア関数ｇ_i（ｔ）、評価値を記録し
ておく。そして、以後の音素の拡張が行われるごとに、
その部分仮説に対する評価値を求めながら探索処理が進
められる。また、部分仮説の音素列に対して、文法の制
約から２種類以上の音素が拡張できる場合は、拡張でき
る音素の種類の数だけ元の部分仮説を複製し、それぞれ
の音素を拡張した部分仮説を作り、それらに対する評価
値計算を行う。文法により音素を延ばすことが出来なく
なった部分仮説は、その音素列が文法として受理された
仮説として、音素の拡張を終了する。全ての部分仮説で
音素の拡張が出来なくなった時、文法として許される全
ての音素列に対し入力音声と照合を行ったことになり、
探索処理１３を終える。そのときの最も評価値の高い仮
説の音素列またはそれに対応する単語、文を認識結果１
４として出力する。For each partial hypothesis, 1. Phoneme sequence, 2. 2. A score function that is a result of matching with an acoustic model by trellis calculation or the like; Three pieces of information, that is, an evaluation value indicating the plausibility of the partial hypothesis with respect to the input voice, are stored. When the identification number of the partial hypothesis is i and the time is t, the score function is expressed as g _i (t). In the search processing unit 13,
First, the first phoneme allowed by the grammar is expanded to a partial hypothesis, the HMM corresponding to the phoneme is collated with the vector data time series (input speech) of the analyzed characteristic parameter, and each partial hypothesis i Score function g _i (t) at time t
Ask for. The trellis method and the Viterbi method are available as the matching method with the HMM. For more details, see the Institute of Electronics, Information and Communication Engineers, edited by Seiichi Nakagawa, "Speech recognition by probabilistic model".
Is disclosed in. The evaluation value of the partial hypothesis i is obtained from the score function g _i (t) by the method described later, and the phoneme sequence, the score function g _i (t), and the evaluation value are recorded for this partial hypothesis. And every time the phoneme is expanded thereafter,
The search process proceeds while obtaining the evaluation value for the partial hypothesis. In addition, when two or more types of phonemes can be expanded from the phoneme string of the partial hypothesis due to grammatical constraints, the original partial hypothesis is duplicated by the number of kinds of phonemes that can be expanded, and each phoneme is expanded. And calculate evaluation values for them. For the partial hypothesis in which the phoneme cannot be extended by the grammar, the phoneme expansion is terminated as the hypothesis that the phoneme sequence is accepted as the grammar. When the phonemes cannot be expanded for all the partial hypotheses, it means that all phoneme strings allowed as grammar are matched with the input speech.
The search process 13 ends. Recognition result 1 of the hypothetical phoneme sequence with the highest evaluation value at that time or the corresponding word or sentence
Output as 4.

【０００６】上記のように、探索処理において全ての部
分仮説（音素列）の音素数が均等となるように部分仮説
の音素を延ばす探索方法は横形探索法と呼ばれる。横形
探索法を実際に行うと、文法の許す全ての音素列に対応
した部分仮説について計算を行うことになり、非常に多
くの部分仮説の計算を行わなければならず、多くの処理
時間を必要とする。このため、部分仮説に音素を拡張す
る過程で、最終的な認識結果となる見込みのある部分仮
説のみを残し、それ以外の部分仮説を廃棄する方法をと
る場合が多い。具体的には、部分仮説の評価値によりそ
の部分仮説を残すかどうかを判定する。その判定方法と
して部分仮説の評価値の高いものから順に一定個数の部
分仮説を残す方法や、部分仮説の評価値のしきい値を設
け、そのしきい値よりも高い部分仮説のみを残す方法、
両者の方法の併用等が用いられる。このような横形探索
法において、一定の条件により、見込みのある部分仮説
のみを残し、それ以外の部分仮説を廃棄して探索を行う
方法はビーム探索法と呼ばれる。[0006] As described above, a search method for extending phonemes of partial hypotheses so that the number of phonemes of all partial hypotheses (phoneme strings) in the search process is equal is called a horizontal search method. When the horizontal search method is actually performed, calculation is performed for partial hypotheses corresponding to all phoneme sequences permitted by the grammar, and a large number of partial hypotheses must be calculated, which requires a lot of processing time. And Therefore, in the process of expanding phonemes to partial hypotheses, a method is often adopted in which only partial hypotheses that are likely to be the final recognition result are left and other partial hypotheses are discarded. Specifically, the evaluation value of the partial hypothesis determines whether or not the partial hypothesis remains. As a determination method, a method of leaving a certain number of partial hypotheses in order from the one with the highest evaluation value of the partial hypothesis, a method of providing a threshold value of the evaluation value of the partial hypothesis, and leaving only the partial hypothesis higher than the threshold value,
A combination of both methods is used. In such a horizontal search method, a method of leaving only a partial hypothesis with a certain probability and discarding the other partial hypotheses under certain conditions to perform a search is called a beam search method.

【０００７】以上のことを具体的に説明すると、例えば
図３Ｂに示すような木構造によって表現された文法に対
して、ＨＭＭを用いた探索処理を行う場合を例とし、い
ま探索処理が既に第４音素までの処理を終えていると
し、第５音素を拡張する場合を述べると、図３Ｂにおい
ては第１音素＃から第４音素まで拡張された部分仮説
は、「＃ｉｋａ」，「＃ｉｋｉ」，「＃
ｉｍｉ」の３種類である。ここで、“ ”は音素の
区切りを示す記号であり、音素＃は無音を示すものとす
る。Explaining the above concretely, for example, a search process using HMM is performed on a grammar expressed by a tree structure as shown in FIG. 3B. Assuming that the processing up to the fourth phoneme has been completed and the case where the fifth phoneme is expanded will be described, in FIG. 3B, the partial hypothesis expanded from the first phoneme # to the fourth phoneme is “# i k a ”,“ # i k i ”,“ #
i m i ”. here," It is assumed that "" is a symbol indicating a phoneme division, and phoneme # indicates silence.

【０００８】第１音素が＃から始まり、第４音素まで拡
張された一つの部分仮説、「＃ｉｋｉ」では、図３
Ｂからわかるように、第５音素として、３種類の音素
ｋ，ｏ，ｍが拡張可能である。また、第１音素が＃から
始まり、第４音素まで拡張されたもう１つの部分仮説、
「＃ｉｋａ」は、第５音素として、２種類の音素
ｍ，ｎが拡張可能である。また、部分仮説「＃ｉｍ
ｉ」は、第４音素で完了しており、音素の拡張は行わ
れない。The first phoneme starts with # and extends to the fourth phoneme.
One partial hypothesis, "# i k i ”in FIG.
As can be seen from B, there are three types of phonemes as the fifth phoneme.
k, o, and m can be expanded. Also, the first phoneme is from #
Another partial hypothesis that started and expanded to the 4th phoneme,
"# i k "a" is the second phoneme as the fifth phoneme.
m and n can be expanded. In addition, the partial hypothesis “# i m
i ”is completed in the 4th phoneme, and the phoneme is expanded.
Not.

【０００９】木構造文法における音素の深さごとに見込
みのない部分仮説の廃棄を行うビーム探索では、同じ音
素数をもつ部分仮説に対し、これら部分仮説の評価値を
求め、一定の条件で評価値の良い部分仮説のみを残す。
ここでは、一定の条件として、評価値の高い上位２つの
部分仮説のみを残すものとする。上で述べたように、第
５音素まで拡張された部分仮説は、「＃ｉｋｉ
ｏ」，「＃ｉｋｉｋ」，「＃ｉｋｉ
ｍ」，「＃ｉｋａｍ」，「＃ｉｋａｎ」
の５種類あり、それぞれの部分仮説の評価値はこの順に
高いとすると、上位２つの部分仮説である「＃ｉｋ
ｉｏ」と「＃ｉｋｉｋ」のみが次の音素を拡
張できる部分仮説として残し、それ以外の部分仮説を廃
棄する。Expected for each phoneme depth in tree structure grammar
The same sound is generated in the beam search that discards the missing partial hypothesis.
For partial hypotheses with prime numbers,
Only the partial hypotheses with good evaluation values are left under certain conditions.
Here, as a certain condition, the top two with the highest evaluation value are
Only the partial hypothesis remains. As mentioned above,
The partial hypothesis expanded to 5 phonemes is "# i k i
o ”,“ # i k i k ”,“ # i k i
m ”,“ # i k a m ”,“ # i k a n "
There are 5 types, and the evaluation value of each partial hypothesis is in this order.
If it is high, the top two partial hypotheses are "# i k
i o ”and“ # i k i k ”only expands the next phoneme
It remains as a partial hypothesis that can be extended and the other partial hypotheses are abolished.
Abandon.

【００１０】このように、部分仮説に音素を拡張して、
一定の条件によって残す部分仮説を限定し、残された部
分仮説にさらに音素を拡張していき、全ての部分仮説で
音素を拡張できなくなるまで、同様の処理を続ける。そ
して、音素を拡張できなくなった全ての部分仮説、つま
り仮説の評価値を比較して、評価値の最も高い仮説を認
識結果として、出力する。In this way, phonemes are expanded to the partial hypothesis,
The partial hypotheses to be left are limited under certain conditions, the phonemes are further expanded to the remaining partial hypotheses, and the same processing is continued until the phonemes cannot be expanded for all the partial hypotheses. Then, all partial hypotheses whose phonemes cannot be expanded, that is, the evaluation values of the hypotheses are compared, and the hypothesis with the highest evaluation value is output as the recognition result.

【００１１】部分仮説ｉのスコア関数ｇ_i（ｔ）から部
分仮説の評価値を求める方法としては、音声の始端から
前向きに推定した全ての部分仮説に共通な前向きヒュー
リスティック関数ｇ＾（ｔ）を求めておき、これと、部
分仮説ｉのスコア関数ｇ_i（ｔ）との差を求め、その差
の時刻ｔに対する最大値に対応する値を、部分仮説ｉの
評価値Ｓ_iとする方法がある（この方法の詳細は、例え
ば「野田喜昭、嵯峨山茂樹、“前向き尤度を用いたＡ^*
ビーム探索によるＨＭＭ−ＬＲ音声認識”電子情報通信
学会技術研究報告音声、ＳＰ９４−２３，１９９
４」、および「特願平６−１３３３３９，音響認識方
法」に開示されている）。As a method of obtaining the evaluation value of the partial hypothesis from the score function g _i (t) of the partial hypothesis i, a forward heuristic function g ^ (t) common to all partial hypotheses estimated forward from the beginning of speech is used. The difference between this and the score function g _i (t) of the partial hypothesis i is obtained, and the value corresponding to the maximum value of the difference with respect to the time t is used as the evaluation value S _i of the partial hypothesis i. (For details of this method, see, for example, “Yoshiaki Noda, Shigeki Sagayama,” A ^* using forward likelihood ^.
HMM-LR Speech Recognition by Beam Search "IEICE Technical Report, Speech, SP94-23,199
4 "and" Japanese Patent Application No. 6-133339, acoustic recognition method ".

【００１２】この部分仮説の評価値の求め方の具体例と
して第４音素まで拡張された部分仮説「＃ｉｋ
ｉ」に音素ｏを拡張するときの、評価値の計算方法を図
４を用いて説明する。図４は、音素列と入力音声の照合
であるトレリス計算を行って得られるスコア関数を、音
素列、入力音声、スコアの３つの軸をもつ３次元の図に
よって示しており、曲線３１は部分仮説「＃ｉｋ
ｉ」のスコア関数、ｇ_i4（ｔ）であり、時刻ｔ₁でのそ
のスコア値ｇ_i4（ｔ₁）は、入力音声が時刻ｔ₁までに
この部分仮説（音素列）が最も短い時間で発声されたと
仮定した時のもっともらしさを示すスコアであり、時刻
ｔ₂でのスコアｇ_i4（ｔ₂）は、入力音声が時刻ｔ₂ま
でに、この部分仮説が最も長い時間で発声されたと仮定
した時のもっともらしさを示すスコアであり、時刻
ｔ₁、ｔ₂と、音素ｏの継続時間長とから時刻ｔ₃を決
定し、その区間で、入力音声がその各時刻までに、音素
列「＃ｉｋｉｏ」をそれぞれ発声されたと仮定し
た時のもっともらしさ（スコア）をつらねたのが曲線３
２であり、つまり曲線３２は入力音声の部分仮説「＃ｉ
ｋｉｏ」に対するスコア関数ｇ_i5（ｔ）である。
即ち部分仮説「＃ｉｋｉ」のスコア関数３１は、既
に計算されており、これを各時刻の尤度を初期値とし
て、トレリス計算により音素ｏの各時刻のスコアを積算
していき、「＃ｉｋｉｏ」のスコア関数３２を求
める。A specific example of how to obtain the evaluation value of this partial hypothesis and
And extended to the 4th phoneme, the partial hypothesis "# i k
Diagram of how to calculate evaluation value when phoneme o is extended to "i"
4 will be described. Fig. 4 shows matching between phoneme strings and input speech.
The score function obtained by performing the trellis calculation is
A three-dimensional diagram with three axes: elementary sequence, input speech, and score
Therefore, the curve 31 shows the partial hypothesis “# i k
i "score function, g_i4(T) at time t₁Sono
Score value g_i4(T₁) Indicates that the input voice is at time t₁until
That this partial hypothesis (phoneme sequence) was uttered in the shortest time
It is a score that indicates the plausibility of assuming
t₂Score in_i4(T₂) Indicates that the input voice is at time t₂Well
And suppose that this partial hypothesis was uttered in the longest time
It is a score that shows the plausibility of when
t₁, T₂And the duration of the phoneme o, the time t₃Decided
And the input voice will be phoneme by that time in that section.
Column "# i k i "o" is uttered respectively
Curve 3 is what made the plausibility (score) of when
2, that is, the curve 32 is the partial hypothesis “#” of the input voice. i
k i Score function g for "o"_i5(T).
That is, the partial hypothesis "# i k The score function 31 of “i” is
Is calculated as the initial value of the likelihood at each time.
Then, the trellis calculation is used to accumulate the scores of phoneme o at each time.
Then, "# i k i Find the score function 32 of "o"
Meru.

【００１３】トレリス計算は、音響モデルを示すＨＭＭ
と入力音声を分析した特徴パラメータのベクトル時系列
データとの照合であり、時刻ｔでＨＭＭの最終状態に到
達するＨＭＭの全ての遷移に対してベクトル時系列デー
タの確率計算を行い、その結果時刻ｔにおける確率値を
得ることができる。ここではその確率値のlog 値をスコ
ア（尤度）として用いる。The trellis calculation is an HMM showing an acoustic model.
And the vector time-series data of the characteristic parameters obtained by analyzing the input speech, and the probability calculation of the vector time-series data is performed for all transitions of the HMM that reach the final state of the HMM at time t. The probability value at t can be obtained. Here, the log value of the probability value is used as the score (likelihood).

【００１４】次に部分仮説の評価値を求めるために、音
声の始端から推定した各部分仮説に共通な無文法（文法
の制約がなく、何れの音素への拡張を許す）で求めた前
向きのヒューリスティック関数ｇ＾（ｔ）を求め、これ
を、下記式（１）のように、この部分仮説のスコア関数
ｇ_i（ｔ）から差し引き、その最大値Ｓ_iを求めると、
Ｓ_iはその部分仮説ｉのもっともらしさを示しており、
これを部分仮説ｉの評価値とすることによって、時刻に
対する正規化を行った部分仮説の評価値を求めることが
できる。Next, in order to obtain the evaluation value of the partial hypothesis, a forward-looking statement obtained by common grammar (without grammatical restrictions and allowing expansion to any phoneme) common to each partial hypothesis estimated from the beginning of the speech When the heuristic function g ^ (t) is obtained and is subtracted from the score function g _i (t) of this partial hypothesis as in the following formula (1), the maximum value S _i is obtained.
S _i indicates the plausibility of the partial hypothesis i,
By setting this as the evaluation value of the partial hypothesis i, the evaluation value of the partial hypothesis that is normalized with respect to time can be obtained.

【００１５】Ｓ_i＝ max｛ｇ_i（ｔ）−ｇ＾（ｔ）｝（１） maxは各ｔについて｛｝内が最大となるものなお、無文法により探索を行うと正解に近い評価値が得
られるが、部分仮説の数が著しく多く、ほぼ同一の評価
値のものが多くなり、選択が困難となるため、前述した
ように文法の制約下での探索を行うことになる。S _i = max {g _i (t) −g ^ (t)} (1) max is the maximum in {} for each t Note that when a search is performed without grammar, an evaluation value close to the correct answer is obtained. However, since the number of partial hypotheses is remarkably large and the number of evaluation values that are almost the same is large and selection becomes difficult, as described above, the search is performed under the constraint of grammar.

【００１６】[0016]

【発明が解決しようとする課題】音声認識において、探
索処理量の削減により、認識処理時間を短くし実使用で
の音声認識の使いやすさが良くなる。また、探索処理量
の削減により処理能力の低い計算機にも音声認識を実用
的に動作させることが可能となる。探索処理量を減らす
には、探索の過程で見込みのない部分仮説を廃棄し、拡
張すべき部分仮説の個数を少なくすることが必要であ
る。しかし、従来の評価値の高い部分仮説を一定個数保
持するビーム探索では、一定個数保持する部分仮説の中
に評価値が小さい部分仮説、つまり、もっともらしい認
識結果となり得ない部分仮説があったとしても、その部
分仮説は廃棄されず、無駄な処理を行うことになる。ま
た、しきい値を設定し、評価値がしきい値よりも高い部
分仮説を保持するビーム探索では、評価値が小さい部分
仮説は廃棄されるが、一般に評価値は認識語彙数、話
者、入力音声長に大きく影響を受けるため、正解の部分
仮説を落とさずに効果的に部分仮説の廃棄を行えるしき
い値の設定は難しい。In the voice recognition, by reducing the search processing amount, the recognition processing time is shortened and the voice recognition in actual use becomes easy to use. Further, by reducing the amount of search processing, it becomes possible to practically operate voice recognition even on a computer having a low processing capacity. In order to reduce the amount of search processing, it is necessary to discard partial hypotheses that are unlikely in the search process and reduce the number of partial hypotheses to be expanded. However, in the conventional beam search that holds a certain number of partial hypotheses with high evaluation values, there is a partial hypothesis with a small evaluation value among partial hypotheses that hold a certain number of hypotheses, that is, a partial hypothesis that cannot be a plausible recognition result. However, the partial hypothesis is not discarded, and wasteful processing is performed. Further, in the beam search in which a threshold is set and a partial hypothesis whose evaluation value is higher than the threshold is retained, partial hypotheses with a small evaluation value are discarded, but in general, the evaluation value is the number of recognized vocabularies, speakers, Since it is greatly affected by the input speech length, it is difficult to set a threshold value that can effectively discard the partial hypothesis without dropping the correct partial hypothesis.

【００１７】つまり、従来の方法によって計算された評
価値は、部分仮説同士の比較を行うためには有効である
が、認識語彙数、話者、入力音声長の影響を大きく受け
るため、その絶対値そのものを用いて部分仮説を評価す
ることは難しい。That is, the evaluation value calculated by the conventional method is effective for comparing the partial hypotheses, but is greatly affected by the number of recognition vocabularies, the speaker, and the input speech length, and therefore its absolute value It is difficult to evaluate the partial hypothesis using the value itself.

【００１８】[0018]

【課題を解決するための手段】この発明によれば、探索
の過程で、つまり木構造の文法における音声単位（音
素、音節、半音節、単語など）の深さごとに、入力音声
の発声内容が正解であると仮定したときの評価値を推定
して基準評価値とし、従来の文法の制約を受けて音声単
位を連結し音響モデルとの照合を行って得られた評価値
を、前記基準評価値で正規化を行い、その正規化評価値
がしきい値以下の部分仮説を廃棄する。According to the present invention, the utterance content of the input speech is calculated in the search process, that is, for each depth of a phonetic unit (phoneme, syllable, semi-syllable, word, etc.) in the tree structure grammar. Is estimated as a reference evaluation value by assuming that the answer is correct, and the evaluation value obtained by performing a collation with an acoustic model by concatenating speech units under the constraint of the conventional grammar is the reference value. The evaluation value is normalized, and the partial hypothesis whose normalized evaluation value is less than or equal to the threshold value is discarded.

【００１９】上記正規化により、部分仮説の評価値から
認識語彙数、話者、入力音声長等の影響が取り除かれ、
探索の過程で見込みのない部分仮説を確実に廃棄でき、
探索効率を高め、この正規化評価値を用いることで探索
処理量を削減することができる。By the above normalization, the influence of the number of recognized vocabulary, the speaker, the input speech length, etc. is removed from the evaluation value of the partial hypothesis,
In the search process, you can surely discard the partial hypothesis that has no prospect,
By increasing the search efficiency and using this normalized evaluation value, the search processing amount can be reduced.

【００２０】[0020]

【実施例】以下、この発明の実施例を説明する。従来と
同様に入力音声を分析処理し、時系列の特徴パラメータ
のベクトルデータを得る。探索処理としては、部分仮説
に拡張する音声の単位を音素とし、各部分仮説で音素数
が一定なる音素同期のビーム探索、音響モデルはＨＭＭ
の場合に、この発明を適用した実施例を図１を用いて説
明する。文法４１の拘束条件を用い音素拡張処理部４２
で部分仮説ｉに音素を拡張し、トレリス計算処理部４３
で音素系列に対応するＨＭＭと入力音声との照合を行
う。得られた部分仮説ｉのスコア関数ｇ_i（ｔ）から評
価値計算処理部４７で部分仮説ｉの評価値Ｓ_iを求め
る。従来の方法ではこの部分仮説ｉの評価値Ｓ_iの高い
部分仮説を一定個数保持し、あとは捨てるというビーム
探索を行うが、この発明では、スコア関数計算処理部４
５で基準評価値のためのスコア関数ｇ₀（ｔ）を後述す
る方法で求め、上記と同様に評価値計算処理部４８で基
準評価値Ｓ_Oを求める。次に部分仮説ｉの評価値Ｓ_iと
基準評価値Ｓ₀の差（部分仮説ｉの正規化評価値
Ｓ_i′）を求め、この差Ｓ_i′が大きいものは、見込み
のない部分仮説として廃棄し、探索を進める。Embodiments of the present invention will be described below. The input speech is analyzed in the same manner as in the past, and vector data of time-series feature parameters is obtained. As the search processing, a phoneme is used as a unit of speech extended to the partial hypothesis, a phoneme-synchronized beam search in which the number of phonemes in each partial hypothesis is constant, and an acoustic model is an HMM.
In this case, an embodiment to which the present invention is applied will be described with reference to FIG. The phoneme extension processing unit 42 using the constraint condition of the grammar 41
The phoneme is expanded to the partial hypothesis i by, and the trellis calculation processing unit 43
Matches the HMM corresponding to the phoneme sequence with the input voice. The evaluation value calculation processing unit 47 obtains the evaluation value S _i of the partial hypothesis i from the obtained score function g _i (t) of the partial hypothesis i. In the conventional method, a beam search is performed in which a certain number of partial hypotheses having a high evaluation value S _i of the partial hypothesis _i are held and then discarded. However, in the present invention, the score function calculation processing unit 4
Calculated by the score function g ₀ (t) the method described below for the reference evaluation value in 5, we obtain the reference evaluation value S _O in the evaluation value calculation processing unit 48 in the same manner as described above. Then 'seek, the difference S _i the difference evaluation value S _i and the reference evaluation value S ₀ of the partial hypotheses i (partial hypotheses i normalized evaluation value S _i) of' what is large, as expected with no partial hypotheses Discard and proceed with the search.

【００２１】図３Ｂの例で具体的に説明すると、第４音
素の部分仮説から音素を拡張した部分仮説は、「＃ｉ
ｋｉｏ」，「＃ｉｋｉｋ」，「＃ｉ
ｋｉｍ」，「＃ｉｋａｍ」，「＃ｉｋ
ａｎ」の５種類あり、それぞれの部分仮説を部分仮説
ｉとし、部分仮説ｉの評価値をＳ_iとし、基準評価値Ｓ
₀とすると、下記式（２）により部分仮説ｉの正規化評
価値Ｓ_i′が得られる。Explaining specifically with reference to the example of FIG. 3B, the fourth tone
The partial hypothesis obtained by extending the phoneme from the elementary partial hypothesis is "# i
k i o ”,“ # i k i k ”,“ # i
k i m ”,“ # i k a m ”,“ # i k
a There are five types of "n", and each partial hypothesis is a partial hypothesis.
i and the evaluation value of the partial hypothesis i is S_iAnd the standard evaluation value S
₀Then, the normalized evaluation of the partial hypothesis i is performed by the following equation (2).
Value S_i′ Is obtained.

【００２２】Ｓ_i′＝Ｓ_i−Ｓ₀ （２）入力された音声が実際には「いきおい」と発生されたと
すると、「＃ｉｋｉｏ」の部分仮説が最も正解に
近く評価値が高くなる。また、「＃ｉｋａｍ」の
ように正解とは離れた部分仮説については、その評価値
は小さくなっている。基準評価値は、入力音声の内容が
正解であると仮定したときの推定評価値であって、例え
ば無文法によって求められたものであるから、文法的な
制約がなく全ての音響モデルの組み合わせを許して得ら
れるものなので、必ず入力音声の内容と同一の音素系列
又はこれに近いものとの照合がなされ、その音素系列は
最も評価値が高くなる組み合わせになっているはずであ
り、この基準評価値は「＃ｉｋｉｏ」の部分仮
説の評価値に近い値になる。よって、正規化評価値
Ｓ_i′の値は正解に近い部分仮説に対しては０に近くな
り、正解から離れた部分仮説では大きな負の値となる。
正規化評価値Ｓ_i′のこの傾向は、Ｓ₀とＳ _iは共に同
一入力音声から作られているためＳ₀とＳ_iが含む話者
の特性が正規化評価値では差し引かれて除去され、話者
に依存しにくい。同様の理由でＳ_iの前記傾向は入力音
声長にも依存しない。また、部分仮説の個数を一定にす
るビーム探索では、認識語彙数によって保持する部分仮
説の個数を変える必要があるが、評価値そのものは認識
語彙数が変わっても変化しないため、正規化評価値
Ｓ_i′は認識語彙数の影響も少ない。S_i′ = S_i-S₀ (2) When the input voice is actually generated as "Ikioi"
Then, "# i k i The partial hypothesis of "o" is the most correct answer
The evaluation value becomes high soon. Also,"# i k a m ”
As for the partial hypothesis that is far from the correct answer, its evaluation value
Is getting smaller. The reference evaluation value depends on the content of the input voice.
Estimated evaluation value assuming that the answer is correct.
For example, since it was obtained by non-grammar, grammatical
There are no restrictions, and all combinations of acoustic models are allowed
The same phoneme sequence as the content of the input speech
Or, the phoneme sequence of the phoneme sequence
The combination should have the highest evaluation value.
This standard evaluation value is "# i k i part of "o"
It is close to the theory's evaluation value. Therefore, the normalized evaluation value
S_iThe value of ′ is not close to 0 for the partial hypothesis that is close to the correct answer.
Therefore, the partial hypothesis far from the correct answer has a large negative value.
Normalized evaluation value S_iThis tendency of ′ is S₀And S _iAre both the same
S because it is made from one input voice₀And S_iSpeakers including
Characteristics are subtracted from the normalized evaluation value and removed.
Hard to depend on. S for the same reason_iSaid tendency of input sound
It does not depend on the voice length. Also, keep the number of partial hypotheses constant.
In the beam search,
It is necessary to change the number of theories, but the evaluation value itself is recognized
Normalized evaluation value because it does not change even if the number of vocabularies changes
S_i′ Has little influence on the number of recognized vocabularies.

【００２３】ビーム探索で正規化評価値Ｓ_i′の低い部
分仮説を廃棄する場合、しきい値Ｌを定め、Ｓ_i′＜Ｌ
となる部分仮説を廃棄するが、Ｌは一定数としたり、部
分仮説の時間長に依存した値、例えば部分仮説の時間長
が長ければこれに応じて前記例では負の大きな値に設定
してもよい。図１中の評価値計算処理部４７，４８での
計算方法として式（１）を用い、かつ式（１）のｇ＾
（ｔ）が基準評価値のためのスコア関数ｇ₀（ｔ）と等
しい場合は下記式（３）を用いて正規化評価値Ｓ_i′を
求めることができる。ｇ_i（ｔ）は部分仮説ｉのスコア
関数、ｇ₀（ｔ）は基準評価値のためのスコア関数であ
る。式（３）によれば正規化評価値Ｓ_i′のための計算
処理量を大幅に削減できる。When a partial hypothesis having a low normalized evaluation value S _i ′ is discarded in the beam search, a threshold value L is set and S _i ′ <L.
However, if L is a constant number or a value depending on the time length of the partial hypothesis, for example, if the time length of the partial hypothesis is long, a large negative value is set in the above example. Good. Equation (1) is used as the calculation method in the evaluation value calculation processing units 47 and 48 in FIG.
When (t) is equal to the score function g ₀ (t) for the reference evaluation value, the normalized evaluation value S _i ′ can be obtained using the following equation (3). g _i (t) is a score function of the partial hypothesis i, and g ₀ (t) is a score function for the reference evaluation value. According to the equation (3), the amount of calculation processing for the normalized evaluation value S _i ′ can be significantly reduced.

【００２４】Ｓ_i′＝ max｛ｇ_i（ｔ）−ｇ₀（ｔ）｝（３） maxは各ｔについて｛｝内が最大となるもの基準評価値Ｓ_Oのためのスコア関数ｇ₀（ｔ）の求め方
を以下に示す。＜基準評価値のためのスコア関数の計算方法１＞各音素
ＨＭＭは、通常３つ程度の状態をもっており、その各状
態では、複数の確率密度関数の重み和の出力確率密度分
布をもっている。ここで、各時刻での入力音声の特徴パ
ラメータを全ての出力確率密度分布に与え、最も高い出
力確率密度値を選択し、その対数である時刻ごとの最大
尤度を求める。この最大尤度の時刻進行での累積値を求
め、それを基準評価値のためのスコア関数とする。Ｏτ
を時刻τでの特徴パラメータ、ｐ_j（Ｏτ）を出力確率
密度分布ｊにその特徴パラメータを与えた出力確率密度
値とすると、ｇ₀（ｔ）は式（４）によって与えられ
る。[0024] _{_{S i '= max {g i}} (t) -g 0 (t)} (3) max score function g ₀ for a reference evaluation value S _O those in {} for each t is maximum ( The method of obtaining t) is shown below. <Score function calculation method 1 for reference evaluation value> Each phoneme HMM usually has about three states, and in each state, there is an output probability density distribution of the weighted sum of a plurality of probability density functions. Here, the characteristic parameter of the input speech at each time is given to all output probability density distributions, the highest output probability density value is selected, and the maximum likelihood for each time, which is the logarithm thereof, is obtained. The cumulative value of this maximum likelihood over time is calculated, and this is used as the score function for the reference evaluation value. Oτ
Is a characteristic parameter at time τ, and p _j (Oτ) is an output probability density value obtained by giving the characteristic parameter to the output probability density distribution j, g ₀ (t) is given by the equation (4).

【００２５】ｇ₀（ｔ）＝Σ maxｐ_j（Ｏτ）（４） Σはτ＝０からｔまで、 maxはｐ_j（Ｏτ）中の全ての
ｊについての最大値通常は一つのＨＭＭから他のＨＭＭ
への遷移は、一つのＨＭＭの終りの状態から、他のＨＭ
Ｍの始めの状態へ遷移するという条件のもとに遷移する
が、このスコア関数は、前記遷移条件をなくし、かつ文
法の制約もなくし、全てのＨＭＭの何れの状態から何れ
のＨＭＭの何れの状態への遷移を許し、かつその遷移確
率を１として、ビタービ計算を行ったときのスコア関数
を示している。探索が進むとｐ_j（Ｏτ）の大部分は、
探索の過程でのトレリス計算で行われているので、その
結果を利用でき計算量が少なくて済む。G ₀ (t) = Σ maxp _j (Oτ) (4) Σ is from τ = 0 to t, max is the maximum value for all _j in p _j (Oτ), usually from one HMM to another HMM
The transition from the end state of one HMM to another HM
The transition is made under the condition that the transition to the initial state of M is made, but this score function eliminates the transition condition and the constraint of grammar, and from any state of any HMM to any of any HMM. The score function when the Viterbi calculation is performed with the transition probability being 1 and the transition probability being 1 is shown. As the search progresses, most of p _j (Oτ) becomes
Since the trellis calculation is performed in the search process, the result can be used and the amount of calculation is small.

【００２６】＜基準評価値のためのスコア関数の計算方
法２＞前記計算方法１においては、全ＨＭＭの全ての状
態の出力確率密度分布から得られる出力確率密度値の最
大値から求めたが、この計算方法２では探索処理の過程
で現在までにトレリス計算によって計算済みの全ての出
力確率密度分布に対する出力確率密度値の最大値から求
める。例えば図２に示すように、各ＨＭＭの各状態の出
力密度分布ｐ₁，ｐ₂，ｐ₃…，を縦軸に、横軸に時刻
ｔをとると、前記図３Ｂの例では先ず無音＃のＨＭＭの
各状態の出力密度分布に対する出力確率密度値が予測さ
れる無音長について、この例では時刻０から３まで計算
され（この計算値が埋められた領域を５１で示す）、最
も短い無音の終了時刻１の次の時刻２から最も長い無音
の終了時刻３の次の時刻４より次の音素ｉのＨＭＭの各
状態の出力密度分布に対する出力確率密度値がそれぞれ
計算される。その計算値が埋められた領域を５２で示
す。同様にして音素ｋのＨＭＭの各状態の出力確率密度
値が図２に領域５３として計算される。探索によりこの
ような計算が進められるが、図２中の各時刻０，１，
２，…における各計算された出力確率密度値の最大値を
求める。この最大値を順次加算してｇ₀（ｔ）とする。
このようにすると探索処理過程で文法の拘束を受けた出
力確率密度分布からスコア関数ｇ₀（ｔ）を計算するた
め、より実際の文法に近いスコア関数が得られる。しか
も、トレリス計算で既に計算された出力確率密度値しか
使わないため、スコア関数ｇ₀（ｔ）のための計算はほ
とんど必要としない。このような計算方法でも、文法で
制約されていない部分の出力確率密度値はトレリス計算
で得られているものより小さいものが大部分と考えら
れ、正しく数ｇ₀（ｔ）が推定される。<Calculation Method 2 of Score Function for Reference Evaluation Value> In the calculation method 1, the maximum value of the output probability density values obtained from the output probability density distributions of all the states of all HMMs is obtained. In this calculation method 2, the maximum value of the output probability density values for all the output probability density distributions that have been calculated by trellis calculation up to the present in the course of the search processing is obtained. For example, as shown in FIG. 2, when the power density distributions p ₁ , p ₂ , p ₃ ... Of each state of each HMM are plotted on the vertical axis and the time t is plotted on the horizontal axis, in the example of FIG. In this example, the silent length for which the output probability density value is predicted for the output density distribution of each state of the HMM is calculated from time 0 to time 3 (the area in which this calculated value is filled is indicated by 51), and the shortest silence The output probability density value for the output density distribution of each state of the HMM of the next phoneme i is calculated from the time 2 after the end time 1 of 1 to the time 4 after the end time 3 of the longest silence. An area in which the calculated value is filled is indicated by 52. Similarly, the output probability density value of each HMM state of the phoneme k is calculated as a region 53 in FIG. Although such calculation is advanced by the search, each time 0, 1, in FIG.
The maximum value of the calculated output probability density values in 2, ... Is obtained. This maximum value is sequentially added to obtain g ₀ (t).
In this way, since the score function g ₀ (t) is calculated from the output probability density distribution which is constrained by the grammar in the search process, a score function closer to the actual grammar can be obtained. Moreover, since only the output probability density value already calculated in the trellis calculation is used, the calculation for the score function g ₀ (t) is hardly necessary. Even in such a calculation method, it is considered that the output probability density value of the part not restricted by the grammar is smaller than that obtained by the trellis calculation, and the number g ₀ (t) is correctly estimated.

【００２７】＜基準評価値のためのスコア関数の計算方
法３＞横型探索法の説明で述べたように部分仮説に音素
を拡張していき、トレリス計算等の照合を行うことによ
りスコア関数を得る。この場合、各部分仮説に任意の音
素の拡張を行えるような文法、つまり無文法で、音素を
拡張していき、対応する音響モデルと入力音声を照合し
て得られたスコア関数の各時刻での最大値を基準評価値
のためのスコア関数とする。この場合はＨＭＭの遷移制
約は残しておく、この方法は上記２つの方法よりも文法
的拘力が強く、これを用いることにより精度の高い正規
化評価値Ｓ_i′を求めることができるが、計算量も多く
なる。<Score function calculation method 3 for reference evaluation value> As described in the explanation of the horizontal search method, phonemes are expanded to partial hypotheses, and a score function is obtained by performing collation such as trellis calculation. . In this case, the phoneme is expanded in a grammar that can expand any phoneme to each partial hypothesis, that is, without grammar, and at each time of the score function obtained by matching the corresponding acoustic model with the input speech. The maximum value of is the score function for the reference evaluation value. In this case, the transition constraint of the HMM is left. This method has a stronger grammatical force than the above two methods, and by using this, a highly accurate normalized evaluation value S _i ′ can be obtained. The amount of calculation also increases.

【００２８】＜基準評価値のためのスコア関数の計算方
法４＞基準評価値のためのスコア関数の計算方法３にお
いて、任意の音素の拡張を行えるような文法ではなく、
日本語特有の音素配列構造のみを許す文法により、尤度
計算を行い、得られたスコア関数を前向きのヒューリス
ティック関数とする。日本語特有の音素配列構造を許す
音素列とは、例えば「ｏｍｏｓｈｉｒｏ
ｉ」や「ｓｕｔｏｒａｉｋｕ」という
ように一般に子音の後には子音が来ないという制約を示
している。「ｓｔｒａｉｋ」という音素の連鎖
は英語での音素配列構造を満たしているが、日本語の音
素配列構造とはなっていない。<How to calculate score function for reference evaluation value
Method 4> In the calculation method 3 of the score function for the reference evaluation value
And not a grammar that can expand any phoneme,
A grammar that allows only the phoneme array structure peculiar to Japanese
Perform the calculation and use the obtained score function as a positive heuristic.
Let it be a tick function. Allow Japanese phoneme array structure
The phoneme sequence is, for example, "o m o sh i r o
i "and" s u t o r a i k u "
In general, the restriction that consonants do not come after
are doing. "S t r ai Phoneme chain "k"
Satisfies the phoneme array structure in English, but Japanese sounds
It is not a prime array structure.

【００２９】計算方法３および計算方法４におけるｇ₀
（ｔ）を計算する際の音素を拡張する文法は、探索のた
めの部分仮説を作成するための文法を包含する文法と言
える。＜基準評価値のためのスコア関数の計算方法５＞最終的
な正解の部分仮説は、そのスコア関数も他の部分仮説よ
りも大きくなっている場合が多い。そこで、探索の過程
で計算された全ての部分仮説のスコア関数ｇ₁（ｔ），
ｇ₂（ｔ），ｇ₃（ｔ），…の各時間ごとの最大値をｇ
₀（ｔ）とする。式で表現すると次のようになる。G ₀ in calculation method 3 and calculation method 4
It can be said that the grammar for expanding the phonemes when calculating (t) includes the grammar for creating the partial hypothesis for the search. <Calculation Method 5 of Score Function for Reference Evaluation Value> In most cases, the final correct partial hypothesis has a larger score function than the other partial hypotheses. Therefore, the score functions g ₁ (t) of all partial hypotheses calculated in the search process,
The maximum value of g ₂ (t), g ₃ (t), ...
₀ (t). Expressed as an expression, it is as follows.

【００３０】ｇ₀（ｔ）＝ maxｇ_i（ｔ）（５） maxはｇ_i（ｔ）の全てのｉ中最大のものこの計算方法ではｇ₀（ｔ）のための計算量をほとんど
必要としない。＜基準評価値のためのスコア関数の計算方法６＞基準評
価値Ｓ_Oを求めるためのスコア関数ｇ₀（ｔ）の計算
は、音素の識別をする必要はなく、スコアを求めること
ができればよいから、各音素ごとのＨＭＭを用いる必要
がなく、図１に点線で示すように認識用の音響モデル１
５とは別の音響モデル４６を用いてもよく、この音響モ
デル４６としては、例えば一つまたは数個の音響モデル
でも、多くの状態数を設けることにより、認識対象を包
含している音響現象を全て表現できるように構成したも
のでもよく、この一つの音響モデルを繰り返し使用し、
または数個の音響モデルの場合は、これらを任意に選択
して連結して入力音声と照合してもっともらしいものを
求めてｇ₀（ｔ）を求めてもよい。G ₀ (t) = max g _i (t) (5) max is the largest of all _i in g _i (t) This calculation method requires almost all the calculation amount for g ₀ (t). do not do. <Score Function Calculation Method 6 for Reference Evaluation Value> The calculation of the score function g ₀ (t) for obtaining the reference evaluation value S _O does not need to identify phonemes, and it is sufficient if the score can be obtained. Therefore, it is not necessary to use the HMM for each phoneme, and the acoustic model 1 for recognition as shown by the dotted line in FIG.
An acoustic model 46 different from 5 may be used. As the acoustic model 46, for example, even if one or several acoustic models are provided, by providing a large number of states, an acoustic phenomenon including a recognition target. It may be configured to express all, and this one acoustic model is repeatedly used,
Alternatively, in the case of several acoustic models, g ₀ (t) may be obtained by arbitrarily selecting and concatenating these acoustic models and matching them with the input voice to find a plausible one.

【００３１】一部変形の説明上述において、評価値を求めるため前向きヒューリステ
ィック関数を求めたが、例えば「南等“番号案内を対象
とした大語い連続音声認識アルゴリズム”電子情報通信
学会論文誌Ａ．vol.Ｊ７７−Ａ，No. ２，pp. １９０〜
１９７．１９９４」に示されているように、音声の終端
から後向きに推定した全ての仮説に共通な推定尤度関数
ｈ＾（ｔ）を求めておき、これをスコア関数ｇ_i（ｔ）
に加算して評価値Ｓ_iとしてもよい。さらに、この発明
は音素を単位としての音声認識のみならず、音節、半音
節、単語などを単位として認識する場合にも適用され
る。Description of Partially Modified In the above, a forward-looking heuristic function was obtained in order to obtain an evaluation value. For example, “Minami et al.“ Large word continuous speech recognition algorithm for number guidance ”, IEICE Transactions A Vol.J77-A, No. 2, pp. 190-
197.1994 ”, an estimated likelihood function h ^ (t) common to all hypotheses estimated backward from the end of speech is obtained, and this is calculated as a score function g _i (t).
To the evaluation value S _i . Furthermore, the present invention is applicable not only to speech recognition in units of phonemes, but also to recognition in units of syllables, semi-syllables, words, and the like.

【００３２】以下に実験例を示す。音素バランス２１６
単語の奇数番号１０８単語を対象とした単語認識におい
て、語彙内単語として奇数番号１０８単語、語彙外単語
として偶数番号１０８単語の音声データを与え認識を行
った結果で評価を行った。探索中、廃棄の性能を評価す
る値として、語彙内の単語認識での認識率を全体の認識
率、語彙内の単語認識で“認識結果なし”と判定される
割合を誤棄却率、語彙外の単語認識で“認識結果が棄却
されない”割合を誤受理率、誤棄却率と誤受理率の平均
を誤判定率とした。つまり、認識率を保った状態で誤判
定率を低く抑えられる場合に廃棄の性能が良いと考えら
れる。Experimental examples are shown below. Phoneme balance 216
In the word recognition targeting the odd-numbered 108 words, the speech data of the odd-numbered 108 words as the vocabulary and the even-numbered 108 words as the words outside the vocabulary were given and evaluated. During the search, the recognition rate for word recognition in the vocabulary is the overall recognition rate as the value for evaluating the performance of discarding, and the rate of misrecognition in the word recognition in the vocabulary is “missing recognition result”, and the rate outside the vocabulary. The rate at which the recognition result was not rejected in the word recognition was defined as the false acceptance rate, and the average of the false rejection rate and the false acceptance rate was defined as the false positive rate. That is, it is considered that the discarding performance is good when the false determination rate can be suppressed to a low level while maintaining the recognition rate.

【００３３】以上の評価を廃棄の強さを変化させて行っ
た。これには部分仮説を棄却するためのしきい値Ｌとし
て、時刻ｔに比例したθ・ｔを用い、θの値を変えるこ
とによって廃棄の強さを変えた。θの値が大きいほど強
い廃棄となる。音声データとしてはＡＴＲの音声データ
ベースのうちＭＡＵ，ＭＨＴ，ＦＡＦ，ＦＳＵの４人の
話者を評価に用いた。また、実験システムとしてＨＭＭ
−ＬＲ音声認識サーバを用いた。ただし、音響モデル
は、状態数３，混合分布数４で音素モデル数５４個の不
特定話者用環境独立型混合連続分布ＨＭＭで、音響学会
連続音声データベース９６００文より学習したものを使
用した。今回の実験では任意の音素の組み合わせの連鎖
を基準評価値用の仮説とし、その尤度関数を前向きヒュ
ーリスティック関数とした。The above evaluation was carried out by changing the strength of disposal. For this, as the threshold L for rejecting the partial hypothesis, θ · t proportional to the time t was used, and the strength of discard was changed by changing the value of θ. The larger the value of θ, the stronger the disposal. As voice data, four speakers of MAU, MHT, FAF, and FSU in the voice database of ATR were used for evaluation. As an experimental system, HMM
-Using the LR voice recognition server. However, as the acoustic model, an environment-independent mixed continuous distribution HMM for unspecified speakers with the number of states 3 and the number of mixture distributions 4 and the number of phoneme models 54, which was learned from 9600 sentences of the ASJ continuous speech database was used. In this experiment, the chain of arbitrary phoneme combinations was used as the hypothesis for the reference evaluation value, and its likelihood function was used as the forward heuristic function.

【００３４】図５に話者ＭＨＴの場合の動的廃棄の強さ
を変化させたときの認識性能、廃棄性能の変化を示す。
図での認識処理時間、照合回数は全探索でのそれぞれの
値を用いて正規化した値を示す。図からわかるように、
例えばθ＝０付近を見るとわかるように認識率を保った
状態で廃棄の効果がある。また照合回数が抑えられてお
り、不要な部分仮説の棄却が行われていることがわか
る。しかし、今回の単語認識実験では語彙が小さいた
め、ヒューリスティック関数を求めるための計算量が相
対的に大きくなり、全体の認識処理時間は全探索を行う
場合とほとんど変わらなかった。ただし、このヒューリ
スティック関数を用いて部分仮説の個数一定のビーム探
索を行う場合、同等の認識率を得るには全探索の1.２倍
程度の認識処理時間を必要とする。よって、この条件で
の実験でも、個数一定のビーム探索に比べ、この発明方
法の方が廃棄の機能があり、しかも認識処理時間が短い
結果となった。FIG. 5 shows changes in the recognition performance and the discard performance when the strength of the dynamic discard in the case of the speaker MHT is changed.
The recognition processing time and the number of times of matching in the figure show the values normalized using the respective values in the full search. As you can see from the figure,
For example, as can be seen from around θ = 0, there is an effect of discarding while maintaining the recognition rate. Moreover, the number of times of matching is suppressed, and it can be seen that unnecessary partial hypotheses are rejected. However, since the vocabulary was small in this word recognition experiment, the amount of calculation for obtaining the heuristic function was relatively large, and the overall recognition processing time was almost the same as when performing a full search. However, when a beam search with a certain number of partial hypotheses is performed using this heuristic function, a recognition processing time of about 1.2 times that of the full search is required to obtain the same recognition rate. Therefore, even in the experiment under these conditions, the method of the present invention has a discarding function and the recognition processing time is shorter than that of the beam search with a constant number.

【００３５】[0035]

【発明の効果】従来の部分仮説の評価値の絶対値が話
者、認識語彙数、入力音声長に依存するのに対し、この
発明では、部分仮説の評価値を同一入力音声から求めた
基準評価値により正規化しているため、話者、認識語彙
数、入力音声長に依存しない正規化評価値が得られ、探
索の過程での見込みのない部分仮説の廃棄を効果的に行
うことができる。これより、正規化評価値のためのしき
い値は同じ値で様々な用途に音声認識を利用でき、利用
者の設定の負担を減らすことができる。As described above, the absolute value of the evaluation value of the partial hypothesis depends on the speaker, the number of recognized vocabularies, and the length of the input speech. Since it is normalized by the evaluation value, a normalized evaluation value that does not depend on the speaker, the number of recognized vocabularies, and the input speech length can be obtained, and it is possible to effectively discard unexpected partial hypotheses in the search process. . As a result, the threshold value for the normalized evaluation value is the same value, the voice recognition can be used for various purposes, and the setting load on the user can be reduced.

【００３６】また、入力された音声が文法の許さない内
容の場合、従来の探索では文法内のもっとも近い候補で
ある間違った結果を出力することになり、利用者の発声
ミスと音声認識の誤認識との区別を示すことができなか
った。しかし、この場合この発明では、探索の過程で全
ての部分仮説が廃棄され、認識結果なしとなり、利用者
に発声の誤りを知らせることができる。利用者の発声ミ
スを早期に発見して示すことは実用の音声認識において
重要である。Further, when the input voice has a content that the grammar does not allow, the conventional search outputs the wrong result which is the closest candidate in the grammar, so that the utterance error of the user and the voice recognition error. Could not show distinction from cognition. However, in this case, in the present invention, all the partial hypotheses are discarded during the search process, and no recognition result is obtained, and the user can be notified of the utterance error. It is important for practical speech recognition to detect and show the utterance error of the user at an early stage.

【００３７】この発明の方法の効果を以下に列挙する。・探索の過程での見込みのない部分仮説の廃棄を効果的
に行える。・設定しなければならないしきい値は、話者、認識語彙
数、入力音声長に依存しないので、利用者の設定の負担
を減らすことができる。・入力された音声が文法の許さない内容の場合、探索の
過程で早期に認識が行えないことを検出でき、利用者の
発声ミスを知らせることができる。The effects of the method of the present invention are listed below. -Effectively discarding partial hypotheses that are unlikely in the search process. -Since the threshold that must be set does not depend on the speaker, the number of recognized vocabularies, and the length of input speech, the burden of setting on the user can be reduced. -If the input voice has a content that the grammar does not allow, it can be detected that recognition cannot be performed early in the search process, and the user's utterance error can be notified.

[Brief description of drawings]

【図１】この発明方法の要部である部分仮説の正規化評
価値を求める手法の例を示す図。FIG. 1 is a diagram showing an example of a method of obtaining a normalized evaluation value of a partial hypothesis, which is a main part of the method of the present invention.

【図２】基準評価値のためのスコア関数計算方法２を説
明するためのトレリス計算にてなされた出力確率密度値
の例を示す図。FIG. 2 is a diagram showing an example of an output probability density value obtained by a trellis calculation for explaining a score function calculation method 2 for a reference evaluation value.

【図３】Ａは音素を認識の単位とした音声認識方法の処
理を示す図、Ｂは木構造によって表現される文法を示す
図である。FIG. 3A is a diagram showing a process of a speech recognition method in which a phoneme is a unit of recognition, and B is a diagram showing a grammar expressed by a tree structure.

【図４】トレリス計算の結果得られるスコア関数を示す
図。FIG. 4 is a diagram showing a score function obtained as a result of trellis calculation.

【図５】この発明方法について行った実験の結果を示す
図。FIG. 5 is a diagram showing the results of experiments conducted on the method of the present invention.

Claims

[Claims]

1. Based on a grammar of a tree structure composed of voice units, one or a plurality of hypotheses regarding the generated content of an input voice is generated by gradually concatenating and branching voice units, and at that time, For each depth of the voice unit of the above tree structure, based on the acoustic model, the plausibility of the partial hypotheses up to that point for the input speech is evaluated to obtain the partial hypothesis evaluation value, and the In a voice recognition method for obtaining a recognition result from the likelihood, for each depth of the voice unit of the tree structure, the evaluation value when the utterance content of the input voice is assumed to be the correct answer is estimated as a reference evaluation value, A speech recognition method characterized by normalizing an evaluation value of a partial hypothesis of a corresponding depth with a reference evaluation value and discarding a partial hypothesis having a normalized evaluation value equal to or less than a threshold value.

2. A hypothesis related to the utterance content of the input voice is generated successively and continuously by adding voice units based on a grammar including the grammar, and the input voice is the sound corresponding to the partial hypothesis. 2. The voice recognition method according to claim 1, wherein the reference evaluation value is obtained by obtaining a score function by collating with a model.

3. A score function is obtained by collating the input speech with a partial hypothesis corresponding to at least one reference evaluation value acoustic model expressing an acoustic phenomenon including a recognition target, and obtaining the score function. The voice recognition method according to claim 1, wherein an evaluation value is obtained.

4. The partial hypothesis evaluation value is obtained by collating the input speech with an acoustic model corresponding to the partial hypothesis to obtain a score function. Speech recognition method.

5. The speech recognition method according to claim 4, wherein the acoustic model is a hidden Markov model.

6. A maximum value of all output probability density values of the hidden Markov model is calculated for each time, and the maximum value is accumulated to calculate a score function for obtaining the reference evaluation value. The voice recognition method according to claim 5.

7. A score function for selecting the maximum value among the output probability values of hidden Markov calculated for obtaining the partial evaluation value at each time and accumulating the maximum value to obtain the reference evaluation value. The voice recognition method according to claim 5, wherein

8. The voice recognition method according to claim 2, wherein the grammar including the grammar allows any combination of acoustic models corresponding to voice units.

9. The speech recognition method according to claim 8, wherein a constraint of a phoneme array structure peculiar to Japanese is used for a combination of acoustic models corresponding to the speech units.

10. The input speech is collated with the acoustic model corresponding to a partial hypothesis to obtain a score function to obtain the partial hypothesis evaluation value, and the reference evaluation value is the maximum value of the score function at each time. The speech recognition method according to claim 1, wherein the speech recognition method is obtained by determining

11. The evaluation value of the partial hypothesis is calculated as a forward heuristic function common to all partial hypotheses, the difference between the score function of each partial hypothesis and the forward heuristic function is calculated, and the maximum value of the difference is associated. The voice recognition method according to any one of claims 4 to 10, wherein the voice recognition method is obtained as a value.

12. The speech recognition method according to claim 11, wherein a score function obtained for obtaining the reference evaluation value is used as the forward-looking heuristic function.