JP3104900B2

JP3104900B2 - Voice recognition method

Info

Publication number: JP3104900B2
Application number: JP07041948A
Authority: JP
Inventors: 喜昭野田; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-03-01
Filing date: 1995-03-01
Publication date: 2000-10-30
Anticipated expiration: 2015-10-30
Also published as: JPH08241096A

Abstract

PURPOSE: To discard non-grammatical utterance during the course of a search process. CONSTITUTION: A partial hypothesis is formed by additionally connecting and branching phonemes in accordance with the grammer 41 of a tree structure and a score function gi(t) is determined by trellis calculation while collecting the partial hypothesis i, the corresponding HMM and input speeches. The max. value in the score of the nongrammatically formed partial hypothesis is simultaneously obtained as a reference score function go(t). The respective max. values of the respective differences between the forward heuristic function gΛ(t) and gi(t) and go(t) are respectively determined as evaluation values Si, So and the search is progressed by discarding the partial hypothesis in which Si-So is below the threshold.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、音素、音節、半音
節、単語などのような音声単位の、与えられた文法の制
御に従って連結可能な数多くの各部分仮説について対応
する音響モデルと、入力された音声とを照合し入力音声
に近い候補を探索する音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an acoustic model corresponding to a number of partial hypotheses which can be connected under the control of a given grammar, such as phonemes, syllables, semi-syllables, words, etc. The present invention relates to a voice recognition method for searching for a candidate that is close to the input voice by comparing the input voice with the input voice.

【０００２】[0002]

【従来の技術】図３Ａに音素を認識の単位とした音声認
識処理の手順を示す。入力音声１１は、分析処理部１２
により、特徴パラメータのベクトルデータ時系列に変換
され、探索処理部１３により文法１６の拘束条件を用い
ながら、音素モデル１５との照合が行われる。そして、
最も高い評価値を持つ音素系列が認識結果１４として出
力される。2. Description of the Related Art FIG. 3A shows a procedure of speech recognition processing using phonemes as a unit of recognition. The input voice 11 is sent to the analysis processing unit 12
Is converted into a time series of vector data of the feature parameter, and the search processing unit 13 performs collation with the phoneme model 15 using the constraint condition of the grammar 16. And
The phoneme sequence having the highest evaluation value is output as the recognition result 14.

【０００３】分析処理部１２における信号処理として、
よく用いられるのは、線形予測分析（ＬｉｎｅａｒＰ
ｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ，ＬＰＣと呼ばれ
る）であり、特徴パラメータとしては、ＬＰＣケプスト
ラム、ＬＰＣデルタケプストラム、メルケプストラム、
対数パワーなどがある。音素モデル１５としては確率・
統計理論に基づいてモデル化された隠れマルコフモデル
（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ，以後ＨＭ
Ｍ法と呼ぶ）が主流である。このＨＭＭの詳細は、例え
ば、社団法人電子情報通信学会編，中川聖一著『確率モ
デルによる音声認識』に開示されている。[0003] As signal processing in the analysis processing section 12,
Often used is linear predictive analysis (Linear P
and the characteristic parameters include LPC cepstrum, LPC delta cepstrum, mel cepstrum, and the like.
And logarithmic power. As the phoneme model 15, the probability
Hidden Markov Model (hereinafter HM) modeled based on statistical theory
M method) is the mainstream. Details of this HMM are disclosed in, for example, Seiichi Nakagawa, "Speech Recognition by Stochastic Model", edited by the Institute of Electronics, Information and Communication Engineers.

【０００４】探索処理部１３は、文法で連結することが
許される音素列である部分仮説についてその音素モデル
に対して、入力音声とのもっともらしさを評価し、一つ
ずつ部分仮説に音素を拡張しながら探索を進める。ここ
で、部分仮説とは、文法に示されている音素の並び順の
制約に従ってつなげられた音素列のことを表し、また、
部分仮説への音素の拡張とは、文法に従って部分仮説の
音素列にさらに一つ音素をつなげることを意味する。The search processing unit 13 evaluates the plausibility of the phoneme model of the partial hypothesis, which is a phoneme sequence that can be connected by grammar, with the input speech, and extends the phoneme to the partial hypothesis one by one. Proceed while searching. Here, the partial hypothesis indicates a phoneme sequence connected according to the restriction on the order of phonemes shown in the grammar.
Extension of a phoneme to a partial hypothesis means connecting one more phoneme to the phoneme sequence of the partial hypothesis according to the grammar.

【０００５】それぞれの部分仮説について、１．音素
列、２．トレリス計算等による、音響モデルとの照合結
果であるスコア関数、３．入力音声に対する部分仮説の
もっともらしさを示す評価値、の３つの情報を記憶して
おく。部分仮説の識別番号をｉ、時刻をｔとするとスコ
ア関数はｇ_i（ｔ）と表される。探索処理部１３では、
まず文法によって許される１つ目の音素を部分仮説に拡
張し、その音素に対応したＨＭＭと、分析された特徴パ
ラメータのベクトルデータ時系列（入力音声）とを照合
し、この部分仮説ｉの各時刻ｔのスコア関数ｇ_i（ｔ）
を求める。ＨＭＭとの照合方法としてトレリス法、ビタ
ービ法があり、この詳細は、例えば、社団法人電子情報
通信学会編，中川聖一著『確率モデルによる音声認識』
に開示されている。このスコア関数ｇ_i（ｔ）から後述
する方法で部分仮説ｉの評価値を求め、この部分仮説に
対し、音素列、スコア関数ｇ_i（ｔ）、評価値を記録し
ておく。そして、以後の音素の拡張が行われるごとに、
その部分仮説に対する評価値を求めながら探索処理が進
められる。また、部分仮説の音素列に対して、文法の制
約から２種類以上の音素が拡張できる場合は、拡張でき
る音素の種類の数だけ元の部分仮説を複製し、それぞれ
の音素を拡張した部分仮説を作り、それらに対する評価
値計算を行う。文法により音素を延ばすことが出来なく
なった部分仮説は、その音素列が文法として受理された
仮説として、音素の拡張を終了する。全ての部分仮説で
音素の拡張が出来なくなった時、文法として許される全
ての音素列に対し入力音声と照合を行ったことになり、
探索処理１３を終える。そのときの最も評価値の高い仮
説の音素列またはそれに対応する単語、文を認識結果１
４として出力する。For each of the partial hypotheses, 1. phoneme sequence; 2. A score function as a result of matching with an acoustic model by trellis calculation or the like; Three pieces of information, that is, an evaluation value indicating the likelihood of the partial hypothesis for the input speech, are stored. If the identification number of the partial hypothesis is i and the time is t, the score function is represented as g _i (t). In the search processing unit 13,
First, the first phoneme permitted by the grammar is extended to a partial hypothesis, and the HMM corresponding to the phoneme is compared with the analyzed feature parameter vector data time series (input speech). Score function g _i (t) at time t
Ask for. There are a trellis method and a Viterbi method as a matching method with the HMM.
Is disclosed. An evaluation value of the partial hypothesis i is obtained from the score function g _i (t) by a method described later, and a phoneme sequence, a score function g _i (t), and an evaluation value are recorded for the partial hypothesis. And every time the subsequent phoneme expansion is performed,
The search process proceeds while obtaining an evaluation value for the partial hypothesis. If two or more types of phonemes can be extended due to grammatical restrictions with respect to the phoneme sequence of the partial hypothesis, the original partial hypotheses are duplicated by the number of types of phonemes that can be extended, and the partial hypotheses obtained by expanding each phoneme And calculate an evaluation value for them. The partial hypothesis for which it is no longer possible to extend the phoneme due to the grammar ends the phoneme extension as a hypothesis that the phoneme sequence has been accepted as a grammar. When phoneme expansion cannot be performed for all partial hypotheses, all phoneme strings allowed as grammars have been matched with the input speech,
The search processing 13 ends. The phoneme string of the hypothesis with the highest evaluation value at that time or the word or sentence corresponding thereto is recognized as the recognition result 1
Output as 4.

【０００６】上記のように、探索処理において全ての部
分仮説（音素列）の音素数が均等となるように部分仮説
の音素を延ばす探索方法は横形探索法と呼ばれる。横形
探索法を実際に行うと、文法の許す全ての音素列に対応
した部分仮説について計算を行うことになり、非常に多
くの部分仮説の計算を行わなければならず、多くの処理
時間を必要とする。このため、部分仮説に音素を拡張す
る過程で、最終的な認識結果となる見込みのある部分仮
説のみを残し、それ以外の部分仮説を廃棄する方法をと
る場合が多い。具体的には、部分仮説の評価値によりそ
の部分仮説を残すかどうかを判定する。その判定方法と
して部分仮説の評価値の高いものから順に一定個数の部
分仮説を残す方法や、部分仮説の評価値のしきい値を設
け、そのしきい値よりも高い部分仮説のみを残す方法、
両者の方法の併用等が用いられる。このような横形探索
法において、一定の条件により、見込みのある部分仮説
のみを残し、それ以外の部分仮説を廃棄して探索を行う
方法はビーム探索法と呼ばれる。[0006] As described above, a search method in which the phonemes of partial hypotheses are extended so that the number of phonemes of all partial hypotheses (phoneme strings) becomes equal in the search processing is called a horizontal search method. When the horizontal search method is actually performed, calculations are performed for partial hypotheses corresponding to all phoneme sequences permitted by the grammar, and a large number of partial hypotheses must be calculated, requiring a lot of processing time. And For this reason, in a process of expanding a phoneme to a partial hypothesis, a method is often adopted in which only a partial hypothesis that is likely to be a final recognition result is left and other partial hypotheses are discarded. Specifically, it is determined whether or not the partial hypothesis is to be left based on the evaluation value of the partial hypothesis. As a determination method, a method of leaving a fixed number of partial hypotheses in order from the highest evaluation value of the partial hypothesis, a method of providing a threshold value of the evaluation value of the partial hypothesis, and leaving only the partial hypothesis higher than the threshold value,
A combination of the two methods is used. In such a horizontal search method, a method of performing a search by leaving only promising partial hypotheses and discarding other partial hypotheses under certain conditions is called a beam search method.

【０００７】以上のことを具体的に説明すると、例えば
図３Ｂに示すような木構造によって表現された文法に対
して、ＨＭＭを用いた探索処理を行う場合を例とし、い
ま探索処理が既に第４音素までの処理を終えていると
し、第５音素を拡張する場合を述べると、図３Ｂにおい
ては第１音素＃から第４音素まで拡張された部分仮説
は、「＃ｉｋａ」，「＃ｉｋｉ」，「＃
ｉｍｉ」の３種類である。ここで、“ ”は音素の
区切りを示す記号であり、音素＃は無音を示すものとす
る。More specifically, the case where a search process using an HMM is performed on a grammar expressed by a tree structure as shown in FIG. Assuming that the processing up to the fourth phoneme has been completed and the fifth phoneme is extended, in FIG. 3B, the partial hypothesis extended from the first phoneme # to the fourth phoneme is “# i k a ”,“ # i k i ","#
i m i ". here," "Is a symbol indicating a phoneme delimiter, and a phoneme # indicates silence.

【０００８】第１音素が＃から始まり、第４音素まで拡
張された一つの部分仮説、「＃ｉｋｉ」では、図３
Ｂからわかるように、第５音素として、３種類の音素
ｋ，ｏ，ｍが拡張可能である。また、第１音素が＃から
始まり、第４音素まで拡張されたもう１つの部分仮説、
「＃ｉｋａ」は、第５音素として、２種類の音素
ｍ，ｎが拡張可能である。また、部分仮説「＃ｉｍ
ｉ」は、第４音素で完了しており、音素の拡張は行わ
れない。The first phoneme starts from # and extends to the fourth phoneme.
One partial hypothesis, "# i k i ”, FIG.
As can be seen from B, as the fifth phoneme, three types of phonemes
k, o, and m are extensible. Also, the first phoneme starts with #
Another partial hypothesis that began and extended to the fourth phoneme,
"# i k "a" is the second phoneme as the fifth phoneme
m and n are extensible. Also, the partial hypothesis "# i m
"i" is completed with the fourth phoneme, and the phoneme expansion is performed
Not.

【０００９】木構造文法における音素の深さごとに見込
みのない部分仮説の廃棄を行うビーム探索では、同じ音
素数をもつ部分仮説に対し、これら部分仮説の評価値を
求め、一定の条件で評価値の良い部分仮説のみを残す。
ここでは、一定の条件として、評価値の高い上位２つの
部分仮説のみを残すものとする。上で述べたように、第
５音素まで拡張された部分仮説は、「＃ｉｋｉ
ｏ」，「＃ｉｋｉｋ」，「＃ｉｋｉ
ｍ」，「＃ｉｋａｍ」，「＃ｉｋａｎ」
の５種類あり、それぞれの部分仮説の評価値はこの順に
高いとすると、上位２つの部分仮説である「＃ｉｋ
ｉｏ」と「＃ｉｋｉｋ」のみが次の音素を拡
張できる部分仮説として残し、それ以外の部分仮説を廃
棄する。For each phoneme depth in tree structure grammar
In the beam search that discards the partial hypothesis,
For partial hypotheses with prime numbers, the evaluation values of these partial hypotheses are
Then, only a partial hypothesis with a good evaluation value is left under certain conditions.
Here, as a certain condition, the two highest ranking evaluation values
Only the partial hypothesis is left. As mentioned above,
The partial hypothesis extended to five phonemes is "# i k i
o "," # i k i k ”,“ # i k i
m ”,“ # i k a m ”,“ # i k a n "
And the evaluation value of each partial hypothesis is in this order
If it is high, the top two partial hypotheses "# i k
i o "and" # i k i k ”expands the next phoneme
Remain as partial hypotheses that can be extended, and eliminate other partial hypotheses.
Abandon

【００１０】このように、部分仮説に音素を拡張して、
一定の条件によって残す部分仮説を限定し、残された部
分仮説にさらに音素を拡張していき、全ての部分仮説で
音素を拡張できなくなるまで、同様の処理を続ける。そ
して、音素を拡張できなくなった全ての部分仮説、つま
り仮説の評価値を比較して、評価値の最も高い仮説を認
識結果として、出力する。Thus, the phoneme is extended to the partial hypothesis,
The remaining partial hypotheses are limited according to certain conditions, phonemes are further extended to the remaining partial hypotheses, and the same processing is continued until the phonemes cannot be extended in all the partial hypotheses. Then, all the partial hypotheses for which the phoneme cannot be expanded, that is, the evaluation values of the hypotheses are compared, and the hypothesis with the highest evaluation value is output as the recognition result.

【００１１】部分仮説ｉのスコア関数ｇ_i（ｔ）から部
分仮説の評価値を求める方法としては、音声の始端から
前向きに推定した全ての部分仮説に共通な前向きヒュー
リスティック関数ｇ＾（ｔ）を求めておき、これと、部
分仮説ｉのスコア関数ｇ_i（ｔ）との差を求め、その差
の時刻ｔに対する最大値に対応する値を、部分仮説ｉの
評価値Ｓ_iとする方法がある（この方法の詳細は、例え
ば「野田喜昭、嵯峨山茂樹、“前向き尤度を用いたＡ^*
ビーム探索によるＨＭＭ−ＬＲ音声認識”電子情報通信
学会技術研究報告音声、ＳＰ９４−２３，１９９
４」、および「特願平６−１３３３３９，音響認識方
法」に開示されている）。As a method of obtaining the evaluation value of the partial hypothesis from the score function g _i (t) of the partial hypothesis i, a forward heuristic function g ＾ (t) common to all the partial hypotheses estimated forward from the beginning of the voice is used. In this method, a difference between this and the score function g _i (t) of the partial hypothesis i is determined, and the value corresponding to the maximum value of the difference at time t is used as the evaluation value S _i of the partial hypothesis i. (Details of this method are described in, for example, “Yoshiaki Noda, Shigeki Sagayama,“ A ^* using forward likelihood ^.
HMM-LR Speech Recognition by Beam Search "IEICE Technical Report Speech, SP94-23, 199
4 ", and" Japanese Patent Application No. 6-133939, Sound Recognition Method ").

【００１２】この部分仮説の評価値の求め方の具体例と
して第４音素まで拡張された部分仮説「＃ｉｋ
ｉ」に音素ｏを拡張するときの、評価値の計算方法を図
４を用いて説明する。図４は、音素列と入力音声の照合
であるトレリス計算を行って得られるスコア関数を、音
素列、入力音声、スコアの３つの軸をもつ３次元の図に
よって示しており、曲線３１は部分仮説「＃ｉｋ
ｉ」のスコア関数、ｇ_i4（ｔ）であり、時刻ｔ₁でのそ
のスコア値ｇ_i4（ｔ₁）は、入力音声が時刻ｔ₁までに
この部分仮説（音素列）が最も短い時間で発声されたと
仮定した時のもっともらしさを示すスコアであり、時刻
ｔ₂でのスコアｇ_i4（ｔ₂）は、入力音声が時刻ｔ₂ま
でに、この部分仮説が最も長い時間で発声されたと仮定
した時のもっともらしさを示すスコアであり、時刻
ｔ₁、ｔ₂と、音素ｏの継続時間長とから時刻ｔ₃を決
定し、その区間で、入力音声がその各時刻までに、音素
列「＃ｉｋｉｏ」をそれぞれ発声されたと仮定し
た時のもっともらしさ（スコア）をつらねたのが曲線３
２であり、つまり曲線３２は入力音声の部分仮説「＃ｉ
ｋｉｏ」に対するスコア関数ｇ_i5（ｔ）である。
即ち部分仮説「＃ｉｋｉ」のスコア関数３１は、既
に計算されており、これを各時刻の尤度を初期値とし
て、トレリス計算により音素ｏの各時刻のスコアを積算
していき、「＃ｉｋｉｏ」のスコア関数３２を求
める。A specific example of a method of obtaining the evaluation value of the partial hypothesis and
Hypothesis "#" extended to the fourth phoneme i k
Diagram showing how to calculate the evaluation value when phoneme o is extended to "i"
4 will be described. Figure 4 shows collation of phoneme sequence and input speech
The score function obtained by performing the trellis calculation
In a three-dimensional diagram with three axes of sequence, input voice, and score
Therefore, the curve 31 shows the partial hypothesis “# i k
i ”score function, g_i4(T) and time t₁In
Score value of g_i4(T₁) Indicates that the input voice is at time t₁until
That this partial hypothesis (phoneme sequence) was uttered in the shortest time
It is a score indicating the plausibility at the time of the assumption, and the time
t_TwoScore in_i4(T_Two) Indicates that the input voice is at time t_TwoMa
Assuming that this partial hypothesis was uttered the longest time
The score indicating the plausibility when the
t₁, T_TwoAnd the duration of phoneme o, the time t_ThreeDecide
In that section, the input voice
The column "# i k i o "is assumed to have been uttered
Curve 3 shows the plausibility (score) of the time
2, that is, the curve 32 is a partial hypothesis "# i
k i o ”score function g_i5(T).
That is, the partial hypothesis "# i k The score function 31 of “i”
Is calculated using the likelihood at each time as the initial value.
And accumulate the score at each time of phoneme o by trellis calculation
Then, "# i k i o ”score function 32
Confuse.

【００１３】トレリス計算は、音響モデルを示すＨＭＭ
と入力音声を分析した特徴パラメータのベクトル時系列
データとの照合であり、時刻ｔでＨＭＭの最終状態に到
達するＨＭＭの全ての遷移に対してベクトル時系列デー
タの確率計算を行い、その結果時刻ｔにおける確率値を
得ることができる。ここではその確率値のlog 値をスコ
ア（尤度）として用いる。The trellis calculation is performed by an HMM indicating an acoustic model.
And vector time-series data of feature parameters obtained by analyzing the input voice. The probability of vector time-series data is calculated for all transitions of the HMM that reach the final state of the HMM at time t. The probability value at t can be obtained. Here, the log value of the probability value is used as a score (likelihood).

【００１４】次に部分仮説の評価値を求めるために、音
声の始端から推定した各部分仮説に共通な無文法（文法
の制約がなく、何れの音素への拡張を許す）で求めた前
向きのヒューリスティック関数ｇ＾（ｔ）を求め、これ
を、下記式（１）のように、この部分仮説のスコア関数
ｇ_i（ｔ）から差し引き、その最大値Ｓ_iを求めると、
Ｓ_iはその部分仮説ｉのもっともらしさを示しており、
これを部分仮説ｉの評価値とすることによって、時刻に
対する正規化を行った部分仮説の評価値を求めることが
できる。Next, in order to obtain the evaluation value of the partial hypothesis, the forward hypothesis obtained by the nongrammar common to each partial hypothesis estimated from the beginning of the speech (there is no restriction on the grammar and extension to any phoneme is allowed) is obtained. The heuristic function g ＾ (t) is obtained, and this is subtracted from the score function g _i (t) of the partial hypothesis as in the following equation (1) to obtain the maximum value S _i .
S _i indicates the plausibility of the partial hypothesis i,
By using this as the evaluation value of the partial hypothesis i, the evaluation value of the partial hypothesis normalized with respect to time can be obtained.

【００１５】Ｓ_i＝ max｛ｇ_i（ｔ）−ｇ＾（ｔ）｝（１） maxは各ｔについて｛｝内が最大となるものなお、無文法により探索を行うと正解に近い評価値が得
られるが、部分仮説の数が著しく多く、ほぼ同一の評価
値のものが多くなり、選択が困難となるため、前述した
ように文法の制約下での探索を行うことになる。S _i = max ｛g _i (t) −g ＾ (t)｝ (1) max is the maximum value in {｝ for each t. When a search is performed by a no-grammar, an evaluation value close to the correct answer is obtained. Is obtained, but the number of partial hypotheses is remarkably large, and those having substantially the same evaluation value increase, making selection difficult. Therefore, as described above, the search is performed under the constraints of the grammar.

【００１６】[0016]

【発明が解決しようとする課題】音声認識において、探
索処理量の削減により、認識処理時間を短くし実使用で
の音声認識の使いやすさが良くなる。また、探索処理量
の削減により処理能力の低い計算機にも音声認識を実用
的に動作させることが可能となる。探索処理量を減らす
には、探索の過程で見込みのない部分仮説を廃棄し、拡
張すべき部分仮説の個数を少なくすることが必要であ
る。しかし、従来の評価値の高い部分仮説を一定個数保
持するビーム探索では、一定個数保持する部分仮説の中
に評価値が小さい部分仮説、つまり、もっともらしい認
識結果となり得ない部分仮説があったとしても、その部
分仮説は廃棄されず、無駄な処理を行うことになる。ま
た、しきい値を設定し、評価値がしきい値よりも高い部
分仮説を保持するビーム探索では、評価値が小さい部分
仮説は廃棄されるが、一般に評価値は認識語彙数、話
者、入力音声長に大きく影響を受けるため、正解の部分
仮説を落とさずに効果的に部分仮説の廃棄を行えるしき
い値の設定は難しい。In speech recognition, the amount of search processing is reduced, thereby shortening the recognition processing time and improving the ease of speech recognition in actual use. In addition, the reduction in the amount of search processing makes it possible to operate speech recognition practically on a computer having a low processing capability. In order to reduce the amount of search processing, it is necessary to discard unexpected partial hypotheses in the search process and reduce the number of partial hypotheses to be expanded. However, in the conventional beam search that holds a fixed number of partial hypotheses with high evaluation values, there is a partial hypothesis with a small evaluation value among partial hypotheses that hold a fixed number, that is, a partial hypothesis that can not be a plausible recognition result However, the partial hypothesis is not discarded, and wasteful processing is performed. In a beam search in which a threshold value is set and a partial hypothesis having an evaluation value higher than the threshold value is retained, a partial hypothesis with a small evaluation value is discarded, but the evaluation value is generally determined by the number of recognized words, the speaker, Since it is greatly affected by the input speech length, it is difficult to set a threshold value at which the partial hypothesis can be effectively discarded without dropping the correct partial hypothesis.

【００１７】つまり、従来の方法によって計算された評
価値は、部分仮説同士の比較を行うためには有効である
が、認識語彙数、話者、入力音声長の影響を大きく受け
るため、その絶対値そのものを用いて部分仮説を評価す
ることは難しい。That is, the evaluation value calculated by the conventional method is effective for comparing partial hypotheses, but is greatly affected by the number of recognized vocabularies, speakers, and input speech length. It is difficult to evaluate the partial hypothesis using the value itself.

【００１８】[0018]

【課題を解決するための手段】この発明によれば、探索
の過程で、つまり木構造の文法における音声単位（音
素、音節、半音節、単語など）の深さごとに、入力音声
の発声内容が正解であると仮定したときの評価値を推定
して基準評価値とし、従来の文法の制約を受けて音声単
位を連結し音響モデルとの照合を行って得られた評価値
を、前記基準評価値で正規化を行い、その正規化評価値
がしきい値以下の部分仮説を廃棄する。According to the present invention, the utterance content of an input speech is determined in the search process, that is, for each depth of speech units (phonemes, syllables, semi-syllables, words, etc.) in a tree-structured grammar. Estimate the evaluation value when it is assumed that the answer is correct, and use it as the reference evaluation value, and evaluate the evaluation value obtained by connecting the speech units under the constraints of the conventional grammar and collating with the acoustic model. Normalization is performed using the evaluation value, and partial hypotheses whose normalized evaluation value is equal to or smaller than the threshold value are discarded.

【００１９】上記正規化により、部分仮説の評価値から
認識語彙数、話者、入力音声長等の影響が取り除かれ、
探索の過程で見込みのない部分仮説を確実に廃棄でき、
探索効率を高め、この正規化評価値を用いることで探索
処理量を削減することができる。By the above-described normalization, the influence of the number of recognized words, the speaker, the input voice length, and the like are removed from the evaluation value of the partial hypothesis.
During the search process, we can reliably discard unsuccessful partial hypotheses,
The search efficiency can be increased, and the amount of search processing can be reduced by using the normalized evaluation value.

【００２０】[0020]

【実施例】以下、この発明の実施例を説明する。従来と
同様に入力音声を分析処理し、時系列の特徴パラメータ
のベクトルデータを得る。探索処理としては、部分仮説
に拡張する音声の単位を音素とし、各部分仮説で音素数
が一定なる音素同期のビーム探索、音響モデルはＨＭＭ
の場合に、この発明を適用した実施例を図１を用いて説
明する。文法４１の拘束条件を用い音素拡張処理部４２
で部分仮説ｉに音素を拡張し、トレリス計算処理部４３
で音素系列に対応するＨＭＭと入力音声との照合を行
う。得られた部分仮説ｉのスコア関数ｇ_i（ｔ）から評
価値計算処理部４７で部分仮説ｉの評価値Ｓ_iを求め
る。従来の方法ではこの部分仮説ｉの評価値Ｓ_iの高い
部分仮説を一定個数保持し、あとは捨てるというビーム
探索を行うが、この発明では、スコア関数計算処理部４
５で基準評価値のためのスコア関数ｇ₀（ｔ）を後述す
る方法で求め、上記と同様に評価値計算処理部４８で基
準評価値Ｓ_Oを求める。次に部分仮説ｉの評価値Ｓ_iと
基準評価値Ｓ₀の差（部分仮説ｉの正規化評価値
Ｓ_i′）を求め、この差Ｓ_i′が大きいものは、見込み
のない部分仮説として廃棄し、探索を進める。Embodiments of the present invention will be described below. The input voice is analyzed and processed in the same manner as in the related art to obtain time-series feature parameter vector data. In the search processing, a phoneme is used as a unit of speech extended to a partial hypothesis, and a phoneme-synchronized beam search in which the number of phonemes is constant in each partial hypothesis.
In this case, an embodiment to which the present invention is applied will be described with reference to FIG. Phoneme extension processing unit 42 using constraint conditions of grammar 41
To extend the phoneme to the partial hypothesis i, and the trellis calculation processing unit 43
Performs matching between the HMM corresponding to the phoneme sequence and the input speech. From the obtained score function g _i (t) of the partial hypothesis i, the evaluation value calculation processing unit 47 obtains the evaluation value S _i of the partial hypothesis i. In the conventional method, a beam search is performed in which a fixed number of partial hypotheses with a high evaluation value S _i of the partial hypothesis _i are held and the rest is discarded.
At 5, the score function g ₀ (t) for the reference evaluation value is obtained by a method described later, and the evaluation value calculation processing unit 48 obtains the reference evaluation value S _O in the same manner as described above. Then 'seek, the difference S _i the difference evaluation value S _i and the reference evaluation value S ₀ of the partial hypotheses i (partial hypotheses i normalized evaluation value S _i) of' what is large, as expected with no partial hypotheses Discard and proceed with the search.

【００２１】図３Ｂの例で具体的に説明すると、第４音
素の部分仮説から音素を拡張した部分仮説は、「＃ｉ
ｋｉｏ」，「＃ｉｋｉｋ」，「＃ｉ
ｋｉｍ」，「＃ｉｋａｍ」，「＃ｉｋ
ａｎ」の５種類あり、それぞれの部分仮説を部分仮説
ｉとし、部分仮説ｉの評価値をＳ_iとし、基準評価値Ｓ
₀とすると、下記式（２）により部分仮説ｉの正規化評
価値Ｓ_i′が得られる。More specifically, referring to the example of FIG.
A partial hypothesis that extends a phoneme from a prime partial hypothesis is "# i
k i o "," # i k i k ”,“ # i
k i m ”,“ # i k a m ”,“ # i k
a n ", and each partial hypothesis is a partial hypothesis
i, and the evaluation value of the partial hypothesis i is S_iAnd the reference evaluation value S
₀Then, the normalized evaluation of the partial hypothesis i is given by the following equation (2).
Value S_i'Is obtained.

【００２２】Ｓ_i′＝Ｓ_i−Ｓ₀ （２）入力された音声が実際には「いきおい」と発生されたと
すると、「＃ｉｋｉｏ」の部分仮説が最も正解に
近く評価値が高くなる。また、「＃ｉｋａｍ」の
ように正解とは離れた部分仮説については、その評価値
は小さくなっている。基準評価値は、入力音声の内容が
正解であると仮定したときの推定評価値であって、例え
ば無文法によって求められたものであるから、文法的な
制約がなく全ての音響モデルの組み合わせを許して得ら
れるものなので、必ず入力音声の内容と同一の音素系列
又はこれに近いものとの照合がなされ、その音素系列は
最も評価値が高くなる組み合わせになっているはずであ
り、この基準評価値は「＃ｉｋｉｏ」の部分仮
説の評価値に近い値になる。よって、正規化評価値
Ｓ_i′の値は正解に近い部分仮説に対しては０に近くな
り、正解から離れた部分仮説では大きな負の値となる。
正規化評価値Ｓ_i′のこの傾向は、Ｓ₀とＳ _iは共に同
一入力音声から作られているためＳ₀とＳ_iが含む話者
の特性が正規化評価値では差し引かれて除去され、話者
に依存しにくい。同様の理由でＳ_iの前記傾向は入力音
声長にも依存しない。また、部分仮説の個数を一定にす
るビーム探索では、認識語彙数によって保持する部分仮
説の個数を変える必要があるが、評価値そのものは認識
語彙数が変わっても変化しないため、正規化評価値
Ｓ_i′は認識語彙数の影響も少ない。S_i'= S_i-S₀ (2) When the input voice is actually generated as “Ikioi”
Then, "# i k i o "partial hypothesis is the most correct answer
The evaluation value increases soon. Also,"# i k a m ”
The partial hypothesis far from the correct answer
Is getting smaller. The reference evaluation value is
Estimated evaluation value assuming correct answer.
Grammar-free grammar,
Allowed to accept all acoustic model combinations without restrictions
Phoneme sequence that is the same as the content of the input speech
Or a match with something close to this, and the phoneme sequence is
It should be the combination with the highest evaluation value
The standard evaluation value is "# i k i o "
It becomes a value close to the evaluation value of the theory. Therefore, the normalized evaluation value
S_i'Is close to 0 for the partial hypothesis that is close to the correct answer.
Therefore, the partial hypothesis far from the correct answer has a large negative value.
Normalized evaluation value S_i', This tendency is₀And S _iAre the same
S because it is made from one input voice₀And S_iSpeakers included
Are removed by subtraction in the normalized evaluation value, and the speaker
Hard to depend on. S for similar reasons_iSaid tendency of input sound
It does not depend on voice length. Also, keep the number of partial hypotheses constant.
In the beam search, the partial temporary
It is necessary to change the number of theories, but the evaluation value itself is recognized
Since the vocabulary number does not change, the normalized evaluation value
S_i'Has little effect on the number of recognized words.

【００２３】ビーム探索で正規化評価値Ｓ_i′の低い部
分仮説を廃棄する場合、しきい値Ｌを定め、Ｓ_i′＜Ｌ
となる部分仮説を廃棄するが、Ｌは一定数としたり、部
分仮説の時間長に依存した値、例えば部分仮説の時間長
が長ければこれに応じて前記例では負の大きな値に設定
してもよい。図１中の評価値計算処理部４７，４８での
計算方法として式（１）を用い、かつ式（１）のｇ＾
（ｔ）が基準評価値のためのスコア関数ｇ₀（ｔ）と等
しい場合は下記式（３）を用いて正規化評価値Ｓ_i′を
求めることができる。ｇ_i（ｔ）は部分仮説ｉのスコア
関数、ｇ₀（ｔ）は基準評価値のためのスコア関数であ
る。式（３）によれば正規化評価値Ｓ_i′のための計算
処理量を大幅に削減できる。When discarding a partial hypothesis having a low normalized evaluation value S _i ′ in beam search, a threshold value L is determined and S _i ′ <L
The partial hypothesis is discarded, but L is set to a fixed number or a value depending on the time length of the partial hypothesis, for example, if the time length of the partial hypothesis is long, the value is set to a large negative value in the above example in accordance with this. Is also good. Equation (1) is used as a calculation method in the evaluation value calculation processing units 47 and 48 in FIG.
When (t) is equal to the score function g ₀ (t) for the reference evaluation value, the normalized evaluation value S _i ′ can be obtained using the following equation (3). g _i (t) is a score function of the partial hypothesis i, and g ₀ (t) is a score function for the reference evaluation value. According to equation (3), the amount of calculation processing for the normalized evaluation value S _i ′ can be significantly reduced.

【００２４】Ｓ_i′＝ max｛ｇ_i（ｔ）−ｇ₀（ｔ）｝（３） maxは各ｔについて｛｝内が最大となるもの基準評価値Ｓ_Oのためのスコア関数ｇ₀（ｔ）の求め方
を以下に示す。＜基準評価値のためのスコア関数の計算方法１＞各音素
ＨＭＭは、通常３つ程度の状態をもっており、その各状
態では、複数の確率密度関数の重み和の出力確率密度分
布をもっている。ここで、各時刻での入力音声の特徴パ
ラメータを全ての出力確率密度分布に与え、最も高い出
力確率密度値を選択し、その対数である時刻ごとの最大
尤度を求める。この最大尤度の時刻進行での累積値を求
め、それを基準評価値のためのスコア関数とする。Ｏτ
を時刻τでの特徴パラメータ、ｐ_j（Ｏτ）を出力確率
密度分布ｊにその特徴パラメータを与えた出力確率密度
値とすると、ｇ₀（ｔ）は式（４）によって与えられ
る。S _i ′ = max {g _i (t) −g ₀ (t)} (3) max is the maximum value in {} for each t. The score function g ₀ (for the reference evaluation value S _O The method for obtaining t) will be described below. <Method 1 for calculating score function for reference evaluation value> Each phoneme HMM usually has about three states, and each state has an output probability density distribution of a weighted sum of a plurality of probability density functions. Here, the characteristic parameters of the input speech at each time are given to all output probability density distributions, the highest output probability density value is selected, and the log-like maximum likelihood at each time is obtained. The cumulative value of the maximum likelihood in the time progression is obtained, and this is used as a score function for the reference evaluation value. Oτ
Is the feature parameter at time τ, and p _j (Oτ) is the output probability density value obtained by giving the feature parameter to the output probability density distribution j, and g ₀ (t) is given by equation (4).

【００２５】ｇ₀（ｔ）＝Σ maxｐ_j（Ｏτ）（４） Σはτ＝０からｔまで、 maxはｐ_j（Ｏτ）中の全ての
ｊについての最大値通常は一つのＨＭＭから他のＨＭＭ
への遷移は、一つのＨＭＭの終りの状態から、他のＨＭ
Ｍの始めの状態へ遷移するという条件のもとに遷移する
が、このスコア関数は、前記遷移条件をなくし、かつ文
法の制約もなくし、全てのＨＭＭの何れの状態から何れ
のＨＭＭの何れの状態への遷移を許し、かつその遷移確
率を１として、ビタービ計算を行ったときのスコア関数
を示している。探索が進むとｐ_j（Ｏτ）の大部分は、
探索の過程でのトレリス計算で行われているので、その
結果を利用でき計算量が少なくて済む。G ₀ (t) = {maxp _j (Oτ) (4)} is from τ = 0 to t, and max is the maximum value for all _j in p _j (Oτ), usually from one HMM to another HMM
Transition from the end state of one HMM to the other HM
Although the transition is made under the condition that the transition to the first state of M is made, this score function eliminates the transition condition and grammatical restrictions, and from any state of all HMMs to any state of any HMM. A score function when Viterbi calculation is performed with a transition to a state permitted and the transition probability set to 1 is shown. As the search progresses, most of p _j (Oτ)
Since the trellis calculation is performed in the search process, the result can be used and the calculation amount can be reduced.

【００２６】＜基準評価値のためのスコア関数の計算方
法２＞前記計算方法１においては、全ＨＭＭの全ての状
態の出力確率密度分布から得られる出力確率密度値の最
大値から求めたが、この計算方法２では探索処理の過程
で現在までにトレリス計算によって計算済みの全ての出
力確率密度分布に対する出力確率密度値の最大値から求
める。例えば図２に示すように、各ＨＭＭの各状態の出
力密度分布ｐ₁，ｐ₂，ｐ₃…，を縦軸に、横軸に時刻
ｔをとると、前記図３Ｂの例では先ず無音＃のＨＭＭの
各状態の出力密度分布に対する出力確率密度値が予測さ
れる無音長について、この例では時刻０から３まで計算
され（この計算値が埋められた領域を５１で示す）、最
も短い無音の終了時刻１の次の時刻２から最も長い無音
の終了時刻３の次の時刻４より次の音素ｉのＨＭＭの各
状態の出力密度分布に対する出力確率密度値がそれぞれ
計算される。その計算値が埋められた領域を５２で示
す。同様にして音素ｋのＨＭＭの各状態の出力確率密度
値が図２に領域５３として計算される。探索によりこの
ような計算が進められるが、図２中の各時刻０，１，
２，…における各計算された出力確率密度値の最大値を
求める。この最大値を順次加算してｇ₀（ｔ）とする。
このようにすると探索処理過程で文法の拘束を受けた出
力確率密度分布からスコア関数ｇ₀（ｔ）を計算するた
め、より実際の文法に近いスコア関数が得られる。しか
も、トレリス計算で既に計算された出力確率密度値しか
使わないため、スコア関数ｇ₀（ｔ）のための計算はほ
とんど必要としない。このような計算方法でも、文法で
制約されていない部分の出力確率密度値はトレリス計算
で得られているものより小さいものが大部分と考えら
れ、正しく数ｇ₀（ｔ）が推定される。<Method 2 of calculating score function for reference evaluation value> In the above-described calculation method 1, the score function was obtained from the maximum value of the output probability density values obtained from the output probability density distributions of all states of all HMMs. In this calculation method 2, the output probability density value is obtained from the maximum value of the output probability density values for all the output probability density distributions that have been calculated by the trellis calculation up to the present time in the search process. For example, as shown in FIG. 2, when the output density distributions p ₁ , p ₂ , p ₃ ... In each state of each HMM are plotted on the vertical axis and the time t is plotted on the horizontal axis, the silence # in the example of FIG. In this example, the silence length for which the output probability density value is predicted with respect to the output density distribution of each state of the HMM is calculated from time 0 to 3 (the region where the calculated value is embedded is indicated by 51), and the shortest silence length is calculated. The output probability density values for the output density distributions of the respective states of the HMM of the next phoneme i are calculated from the time 2 following the end time 1 after the end time 1 to the time 4 following the end time 3 of the longest silence. The area in which the calculated value is embedded is indicated by 52. Similarly, the output probability density value of each state of the HMM of the phoneme k is calculated as an area 53 in FIG. Although such calculation is advanced by the search, each time 0, 1, in FIG.
The maximum value of each calculated output probability density value in 2,. The maximum values are sequentially added to obtain g ₀ (t).
In this way, since the score function g ₀ (t) is calculated from the output probability density distribution restricted by the grammar in the search process, a score function closer to the actual grammar is obtained. Moreover, since only the output probability density value already calculated in the trellis calculation is used, the calculation for the score function g ₀ (t) is hardly required. Even in such a calculation method, the output probability density value of a portion not restricted by the grammar is considered to be mostly smaller than that obtained by trellis calculation, and the number g ₀ (t) is correctly estimated.

【００２７】＜基準評価値のためのスコア関数の計算方
法３＞横型探索法の説明で述べたように部分仮説に音素
を拡張していき、トレリス計算等の照合を行うことによ
りスコア関数を得る。この場合、各部分仮説に任意の音
素の拡張を行えるような文法、つまり無文法で、音素を
拡張していき、対応する音響モデルと入力音声を照合し
て得られたスコア関数の各時刻での最大値を基準評価値
のためのスコア関数とする。この場合はＨＭＭの遷移制
約は残しておく、この方法は上記２つの方法よりも文法
的拘力が強く、これを用いることにより精度の高い正規
化評価値Ｓ_i′を求めることができるが、計算量も多く
なる。<Method 3 of calculating score function for reference evaluation value> As described in the description of the horizontal search method, a phoneme is extended to a partial hypothesis, and a score function is obtained by performing collation such as trellis calculation. . In this case, the phoneme is extended by a grammar that can extend an arbitrary phoneme to each partial hypothesis, that is, a non-grammar, and at each time of the score function obtained by comparing the corresponding acoustic model with the input speech. Is the score function for the reference evaluation value. In this case, the transition constraint of the HMM is left. This method has a stronger grammatical force than the above two methods, and by using this method, a highly accurate normalized evaluation value S _i ′ can be obtained. The amount of calculation also increases.

【００２８】＜基準評価値のためのスコア関数の計算方
法４＞基準評価値のためのスコア関数の計算方法３にお
いて、任意の音素の拡張を行えるような文法ではなく、
日本語特有の音素配列構造のみを許す文法により、尤度
計算を行い、得られたスコア関数を前向きのヒューリス
ティック関数とする。日本語特有の音素配列構造を許す
音素列とは、例えば「ｏｍｏｓｈｉｒｏ
ｉ」や「ｓｕｔｏｒａｉｋｕ」という
ように一般に子音の後には子音が来ないという制約を示
している。「ｓｔｒａｉｋ」という音素の連鎖
は英語での音素配列構造を満たしているが、日本語の音
素配列構造とはなっていない。<Method of calculating score function for reference evaluation value
Method 4> Calculation method 3 of score function for reference evaluation value
Is not a grammar that can extend any phoneme,
By using a grammar that allows only Japanese phoneme sequence structures, the likelihood
Calculate, and use the resulting score function
Tick function. Allow Japanese phoneme array structure
The phoneme sequence is, for example, “o m o sh i r o
i "or" s u t o r a i k u "
As a general rule, constrained consonants do not follow consonants.
are doing. "S t r ai chain of phonemes "k"
Satisfies the phoneme sequence structure in English, but sounds in Japanese
It does not have a prime array structure.

【００２９】計算方法３および計算方法４におけるｇ₀
（ｔ）を計算する際の音素を拡張する文法は、探索のた
めの部分仮説を作成するための文法を包含する文法と言
える。＜基準評価値のためのスコア関数の計算方法５＞最終的
な正解の部分仮説は、そのスコア関数も他の部分仮説よ
りも大きくなっている場合が多い。そこで、探索の過程
で計算された全ての部分仮説のスコア関数ｇ₁（ｔ），
ｇ₂（ｔ），ｇ₃（ｔ），…の各時間ごとの最大値をｇ
₀（ｔ）とする。式で表現すると次のようになる。G ₀ in calculation methods 3 and 4
The grammar that extends phonemes when calculating (t) can be said to be a grammar that includes a grammar for creating a partial hypothesis for search. <Method 5 for calculating score function for reference evaluation value> The final correct partial hypothesis often has a score function larger than other partial hypotheses. Therefore, the score functions g ₁ (t), of all partial hypotheses calculated in the search process
g ₂ (t), g ₃ (t),...
₀ (t). The expression is as follows.

【００３０】ｇ₀（ｔ）＝ maxｇ_i（ｔ）（５） maxはｇ_i（ｔ）の全てのｉ中最大のものこの計算方法ではｇ₀（ｔ）のための計算量をほとんど
必要としない。＜基準評価値のためのスコア関数の計算方法６＞基準評
価値Ｓ_Oを求めるためのスコア関数ｇ₀（ｔ）の計算
は、音素の識別をする必要はなく、スコアを求めること
ができればよいから、各音素ごとのＨＭＭを用いる必要
がなく、図１に点線で示すように認識用の音響モデル１
５とは別の音響モデル４６を用いてもよく、この音響モ
デル４６としては、例えば一つまたは数個の音響モデル
でも、多くの状態数を設けることにより、認識対象を包
含している音響現象を全て表現できるように構成したも
のでもよく、この一つの音響モデルを繰り返し使用し、
または数個の音響モデルの場合は、これらを任意に選択
して連結して入力音声と照合してもっともらしいものを
求めてｇ₀（ｔ）を求めてもよい。G ₀ (t) = maxg _i (t) (5) max is the largest of all _i of g _i (t). This calculation method requires almost no calculation amount for g ₀ (t). do not do. <Score Function Calculation Method 6 for Reference Evaluation Value> In the calculation of the score function g ₀ (t) for obtaining the reference evaluation value S _O , there is no need to identify phonemes, and it is sufficient to obtain a score. Therefore, it is not necessary to use an HMM for each phoneme, and the acoustic model 1 for recognition is used as shown by the dotted line in FIG.
5 may be used. As the acoustic model 46, for example, even if one or several acoustic models are provided, a large number of states are provided so that an acoustic phenomenon including a recognition target is included. May be configured so that all can be expressed, and this one acoustic model is repeatedly used,
Alternatively, in the case of several acoustic models, these may be arbitrarily selected and connected, collated with the input speech, and a plausible one may be obtained to obtain g ₀ (t).

【００３１】一部変形の説明上述において、評価値を求めるため前向きヒューリステ
ィック関数を求めたが、例えば「南等“番号案内を対象
とした大語い連続音声認識アルゴリズム”電子情報通信
学会論文誌Ａ．vol.Ｊ７７−Ａ，No. ２，pp. １９０〜
１９７．１９９４」に示されているように、音声の終端
から後向きに推定した全ての仮説に共通な推定尤度関数
ｈ＾（ｔ）を求めておき、これをスコア関数ｇ_i（ｔ）
に加算して評価値Ｓ_iとしてもよい。さらに、この発明
は音素を単位としての音声認識のみならず、音節、半音
節、単語などを単位として認識する場合にも適用され
る。In the above description, a forward heuristic function was obtained in order to obtain an evaluation value. For example, “Large vocabulary continuous speech recognition algorithm for number guidance,” Minami et al. Vol.J77-A, No. 2, pp. 190-
197.1994 ", an estimated likelihood function h ＾ (t) common to all hypotheses estimated backward from the end of speech is obtained, and this is used as a score function g _i (t).
May be added as the evaluation value S _i . Further, the present invention is applied not only to speech recognition in units of phonemes, but also to recognition in units of syllables, semi-syllables, words, and the like.

【００３２】以下に実験例を示す。音素バランス２１６
単語の奇数番号１０８単語を対象とした単語認識におい
て、語彙内単語として奇数番号１０８単語、語彙外単語
として偶数番号１０８単語の音声データを与え認識を行
った結果で評価を行った。探索中、廃棄の性能を評価す
る値として、語彙内の単語認識での認識率を全体の認識
率、語彙内の単語認識で“認識結果なし”と判定される
割合を誤棄却率、語彙外の単語認識で“認識結果が棄却
されない”割合を誤受理率、誤棄却率と誤受理率の平均
を誤判定率とした。つまり、認識率を保った状態で誤判
定率を低く抑えられる場合に廃棄の性能が良いと考えら
れる。An experimental example will be described below. Phoneme balance 216
In the word recognition for the 108 odd-numbered words, the speech data of 108 odd-numbered words as words in the vocabulary and 108 words of even-numbered words as non-vocabulary words were given and evaluated. During the search, the recognition rate for word recognition in the vocabulary is the overall recognition rate, and the percentage of words recognized as "no recognition result" in the vocabulary is the false rejection rate, In the word recognition, the ratio of “the recognition result is not rejected” was defined as the false acceptance rate, and the average of the false rejection rate and the false acceptance rate was defined as the false determination rate. In other words, it is considered that discarding performance is good when the erroneous determination rate can be kept low while the recognition rate is maintained.

【００３３】以上の評価を廃棄の強さを変化させて行っ
た。これには部分仮説を棄却するためのしきい値Ｌとし
て、時刻ｔに比例したθ・ｔを用い、θの値を変えるこ
とによって廃棄の強さを変えた。θの値が大きいほど強
い廃棄となる。音声データとしてはＡＴＲの音声データ
ベースのうちＭＡＵ，ＭＨＴ，ＦＡＦ，ＦＳＵの４人の
話者を評価に用いた。また、実験システムとしてＨＭＭ
−ＬＲ音声認識サーバを用いた。ただし、音響モデル
は、状態数３，混合分布数４で音素モデル数５４個の不
特定話者用環境独立型混合連続分布ＨＭＭで、音響学会
連続音声データベース９６００文より学習したものを使
用した。今回の実験では任意の音素の組み合わせの連鎖
を基準評価値用の仮説とし、その尤度関数を前向きヒュ
ーリスティック関数とした。The above evaluation was carried out by changing the strength of disposal. For this, θ · t proportional to time t was used as the threshold L for rejecting the partial hypothesis, and the intensity of discard was changed by changing the value of θ. The larger the value of θ, the stronger the discard. As the voice data, four speakers of MAU, MHT, FAF and FSU in the voice database of ATR were used for evaluation. In addition, as an experimental system, HMM
-LR speech recognition server was used. However, the acoustic model used was an environment independent mixed continuous distribution HMM for an unspecified speaker having 54 phoneme models with 3 states and 4 mixture distributions, and learned from 9600 sentences of the Acoustic Society continuous speech database. In this experiment, a chain of arbitrary combinations of phonemes was used as a hypothesis for the reference evaluation value, and the likelihood function was used as a forward heuristic function.

【００３４】図５に話者ＭＨＴの場合の動的廃棄の強さ
を変化させたときの認識性能、廃棄性能の変化を示す。
図での認識処理時間、照合回数は全探索でのそれぞれの
値を用いて正規化した値を示す。図からわかるように、
例えばθ＝０付近を見るとわかるように認識率を保った
状態で廃棄の効果がある。また照合回数が抑えられてお
り、不要な部分仮説の棄却が行われていることがわか
る。しかし、今回の単語認識実験では語彙が小さいた
め、ヒューリスティック関数を求めるための計算量が相
対的に大きくなり、全体の認識処理時間は全探索を行う
場合とほとんど変わらなかった。ただし、このヒューリ
スティック関数を用いて部分仮説の個数一定のビーム探
索を行う場合、同等の認識率を得るには全探索の1.２倍
程度の認識処理時間を必要とする。よって、この条件で
の実験でも、個数一定のビーム探索に比べ、この発明方
法の方が廃棄の機能があり、しかも認識処理時間が短い
結果となった。FIG. 5 shows changes in recognition performance and discard performance when the strength of dynamic discard in the case of speaker MHT is changed.
The recognition processing time and the number of times of collation in the figure show values normalized using the respective values in the full search. As you can see from the figure,
For example, there is an effect of discarding in a state where the recognition rate is maintained, as can be seen from the vicinity of θ = 0. Further, it can be seen that the number of times of collation is suppressed, and unnecessary partial hypotheses are rejected. However, in this word recognition experiment, the vocabulary was small, so the amount of calculation for finding the heuristic function was relatively large, and the overall recognition processing time was almost the same as when performing the full search. However, when performing a beam search with a fixed number of partial hypotheses using this heuristic function, a recognition processing time that is about 1.2 times that of the full search is required to obtain an equivalent recognition rate. Therefore, even in an experiment under these conditions, the method of the present invention has a discarding function and a shorter recognition processing time than the beam search with a fixed number of beams.

【００３５】[0035]

【発明の効果】従来の部分仮説の評価値の絶対値が話
者、認識語彙数、入力音声長に依存するのに対し、この
発明では、部分仮説の評価値を同一入力音声から求めた
基準評価値により正規化しているため、話者、認識語彙
数、入力音声長に依存しない正規化評価値が得られ、探
索の過程での見込みのない部分仮説の廃棄を効果的に行
うことができる。これより、正規化評価値のためのしき
い値は同じ値で様々な用途に音声認識を利用でき、利用
者の設定の負担を減らすことができる。According to the present invention, while the absolute value of the evaluation value of the conventional partial hypothesis depends on the speaker, the number of recognized vocabularies, and the input speech length, in the present invention, the evaluation value of the partial hypothesis is obtained from the same input speech. Because the evaluation value is normalized, a normalized evaluation value that does not depend on the speaker, the number of recognized vocabulary words, and the input speech length can be obtained, and it is possible to effectively discard an unlikely partial hypothesis in the search process. . As a result, the threshold value for the normalized evaluation value is the same, and speech recognition can be used for various purposes, and the burden on the user for setting can be reduced.

【００３６】また、入力された音声が文法の許さない内
容の場合、従来の探索では文法内のもっとも近い候補で
ある間違った結果を出力することになり、利用者の発声
ミスと音声認識の誤認識との区別を示すことができなか
った。しかし、この場合この発明では、探索の過程で全
ての部分仮説が廃棄され、認識結果なしとなり、利用者
に発声の誤りを知らせることができる。利用者の発声ミ
スを早期に発見して示すことは実用の音声認識において
重要である。If the input speech does not allow the grammar, the conventional search will output an incorrect result that is the closest candidate in the grammar, resulting in a user utterance error and a speech recognition error. No distinction from cognition could be shown. However, in this case, according to the present invention, all partial hypotheses are discarded in the search process, no recognition result is obtained, and the user can be notified of the utterance error. It is important in practical speech recognition to find and show user utterance mistakes early.

【００３７】この発明の方法の効果を以下に列挙する。・探索の過程での見込みのない部分仮説の廃棄を効果的
に行える。・設定しなければならないしきい値は、話者、認識語彙
数、入力音声長に依存しないので、利用者の設定の負担
を減らすことができる。・入力された音声が文法の許さない内容の場合、探索の
過程で早期に認識が行えないことを検出でき、利用者の
発声ミスを知らせることができる。The effects of the method of the present invention are listed below.・ Effective discard of partial hypotheses that are unlikely in the search process can be performed. The threshold value to be set does not depend on the speaker, the number of recognized vocabulary words, and the input voice length, so that the burden of setting by the user can be reduced. -If the input speech does not allow the grammar, it is possible to detect that recognition cannot be performed early in the search process, and to notify the user of a speech error.

[Brief description of the drawings]

【図１】この発明方法の要部である部分仮説の正規化評
価値を求める手法の例を示す図。FIG. 1 is a diagram showing an example of a technique for obtaining a normalized evaluation value of a partial hypothesis, which is a main part of the method of the present invention.

【図２】基準評価値のためのスコア関数計算方法２を説
明するためのトレリス計算にてなされた出力確率密度値
の例を示す図。FIG. 2 is a diagram showing an example of an output probability density value obtained by trellis calculation for explaining a score function calculation method 2 for a reference evaluation value.

【図３】Ａは音素を認識の単位とした音声認識方法の処
理を示す図、Ｂは木構造によって表現される文法を示す
図である。FIG. 3A is a diagram illustrating a process of a speech recognition method using phonemes as a unit of recognition, and FIG. 3B is a diagram illustrating a grammar represented by a tree structure;

【図４】トレリス計算の結果得られるスコア関数を示す
図。FIG. 4 is a diagram showing a score function obtained as a result of trellis calculation.

【図５】この発明方法について行った実験の結果を示す
図。FIG. 5 is a diagram showing the results of an experiment performed on the method of the present invention.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平８−6588（ＪＰ，Ａ) 特開平２−300798（ＪＰ，Ａ) 日本音響学会平成７年度春季研究発表会講演論文集▲Ｉ▼，１−Ｑ−28，野田喜昭外「動的リジェクト機能をもつ前向きヒューリスティック関数によるビーム探索」ｐ．151−152（平成７年３月14日発行) 電子情報通信学会技術研究報告［音声］，Ｖｏｌ．94，Ｎｏ．91，ＳＰ94− 23，野田喜昭外「前向き尤度を用いたＡ＊ビーム探索によるＨＭＭ−ＬＲ音声認識」ｐ．１−８（1994年６月17日発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 17/00 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-8-6588 (JP, A) JP-A-2-300798 (JP, A) Proceedings of the Acoustical Society of Japan Spring Meeting 2007 , 1-Q-28, Yoshiaki Noda, “Beam search by forward heuristic function with dynamic reject function” p. 151-152 (issued on March 14, 1995) IEICE Technical Report [Voice], Vol. 94, no. 91, SP94-23, Yoshiaki Noda, "HMM-LR speech recognition by A * beam search using forward likelihood," p. 1-8 (Issued June 17, 1994) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 15/00-17/00 JICST file (JOIS)

Claims

(57) [Claims]

1. A method for generating one or more hypotheses relating to the content of an input speech by gradually adding and connecting speech units based on a tree-structured grammar composed of speech units. For each depth of the tree-structured speech unit, based on the acoustic model, the plausibility of the previous partial hypothesis in each hypothesis with respect to the input speech is evaluated to obtain a partial hypothesis evaluation value. In the speech recognition method for obtaining a recognition result from likelihood, an evaluation value when assuming that the utterance content of the input speech is correct is estimated as a reference evaluation value for each depth of the speech unit of the tree structure. A speech recognition method comprising: normalizing an evaluation value of a partial hypothesis having a depth corresponding to a reference evaluation value, and discarding a partial hypothesis whose normalized evaluation value is equal to or smaller than a threshold value.

2. Based on a grammar including the grammar, a hypothesis relating to the utterance content of the input speech is generated by adding speech units gradually and continuously, and the input speech is generated by the sound corresponding to a partial hypothesis. 2. The speech recognition method according to claim 1, wherein the reference evaluation value is obtained by obtaining a score function by collating with a model.

3. A score function is obtained by comparing the input speech with a partial hypothesis of at least one reference evaluation value acoustic model representing an acoustic phenomenon including a recognition target, and obtaining a score function. The speech recognition method according to claim 1, wherein an evaluation value is obtained.

4. The partial hypothesis evaluation value according to claim 2, wherein the input speech is collated with an acoustic model corresponding to the partial hypothesis, a score function is obtained, and the partial hypothesis evaluation value is obtained. Voice recognition method.

5. The speech recognition method according to claim 4, wherein said acoustic model is a hidden Markov model.

6. A score function for obtaining a maximum value of all output probability density values of the Hidden Markov Model at each time and accumulating the maximum values to calculate the reference evaluation value. The speech recognition method according to claim 5, wherein

7. A score function for selecting a maximum value among output probability values of hidden Markov calculated to obtain the partial evaluation value at each time, and accumulating the maximum value to obtain the reference evaluation value. The speech recognition method according to claim 5, wherein is calculated.

8. The speech recognition method according to claim 2, wherein the grammar including the grammar allows any combination of acoustic models corresponding to speech units.

9. The speech recognition method according to claim 8, wherein a constraint of a phoneme array structure unique to Japanese is used for the combination of the acoustic models corresponding to the speech units.

10. The input speech is collated with the acoustic model corresponding to a partial hypothesis, a score function is obtained to obtain the partial hypothesis evaluation value, and the reference evaluation value is set to a maximum value of the score function at each time. 2. The speech recognition method according to claim 1, wherein the speech recognition method obtains the following.

11. An evaluation value of the partial hypothesis is obtained by finding a forward heuristic function common to all partial hypotheses, taking the difference between the score function of each partial hypothesis and the forward heuristic function, and determining the maximum value of the difference. 11. The speech recognition method according to claim 4, wherein the value is obtained as a value to be obtained.

12. The speech recognition method according to claim 11, wherein a score function obtained for obtaining the reference evaluation value is used as the forward heuristic function.