JP2001296884A

JP2001296884A - Device and method for voice recognition

Info

Publication number: JP2001296884A
Application number: JP2000114269A
Authority: JP
Inventors: Sadahiro Kimura; 貞弘木村
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-04-14
Filing date: 2000-04-14
Publication date: 2001-10-26

Abstract

PROBLEM TO BE SOLVED: To provide a device and a method for voice recognition to which the condition of uttering of a system user and external factors such as environmental condition are introduced as new knowledge sources. SOLUTION: In the voice recognition device which analyzes inputted voice into acoustic parameters and recognizes the voice by comparing the parameters with comparison-object pattern candidates, the likelihood of recognition is weighted employing the values that are set based on the time of the uttering and external factors to conduct converging of word recognition. Thus, a more likely result or a higher recognition rate is obtained.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声を分析
し、予め入力されている複数の比較パターンと比較し、
言語情報等の知識源を活用して音声を認識する音声認識
装置および方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention analyzes an input speech and compares it with a plurality of comparison patterns input in advance.
The present invention relates to a speech recognition device and method for recognizing speech using a knowledge source such as linguistic information.

【０００２】[0002]

【従来の技術】近年、マンマシンインターフェースの１
手法として、音声認識を用いた音声対話インターフェー
スが注目されている。従来の技術として、音響的な評価
値以外に、Ｎ−グラム（Ｎ−ｇｒａｍ）に代表される統
計的言語モデル（有限オートマトン、確率文脈自由文
法）にて生成される言語的評価値を重み付け演算し、検
索語彙を制約し、認識の高速化と認識率の向上を図る手
法がある（例えば、特開平09-134192号公報および特開
平09-274498号公報）。さらに、従来の技術として、対
話の状態において言語モデルを変化させる手法（特開平
07-104786号公報参照）も提案されている。2. Description of the Related Art In recent years, one of man-machine interfaces has been developed.
As a method, a speech dialogue interface using speech recognition has attracted attention. As a conventional technique, in addition to an acoustic evaluation value, a linguistic evaluation value generated by a statistical language model (finite automaton, stochastic context-free grammar) represented by an N-gram is weighted. In addition, there is a method of restricting a search vocabulary to speed up recognition and improve a recognition rate (for example, JP-A-09-134192 and JP-A-09-274498). Further, as a conventional technique, a method of changing a language model in a dialogue state (Japanese Unexamined Patent Application Publication No.
No. 07-104786) has also been proposed.

【０００３】[0003]

【発明が解決しようとする課題】従来の音声認識装置で
は、直前に発声された幾つかの単語から次の単語への遷
移確率を統計的に与えるものであるため、言語の文法の
みに依存し、発話者の状態を考慮に入れていない。さら
に、学習は静的なデータを用いるために、実際にシステ
ムとして使用する時に、発話者の状態が時間的、心理的
に学習時とは違った傾向になる事も考えられる。従っ
て、従来の言語モデルだけではより高い認識率を追求す
ることはできなかった。音声認識装置は今後のマンマシ
ンインターフェースとして注目されている。しかし、認
識時の計算制約条件として、言語モデルのみの知識源で
はより高い認識率を得ることができないという問題があ
る。また、発話者の状態と環境を含めた新たな知識源が
必要となるという問題がある。本発明の課題は、前記問
題を解決することにある。すなわち、本発明の目的は、
システム使用者の発話の状態、環境等発話以外の外的要
因を新たな知識源として導入した音声認識装置および方
法を提供することにある。In the conventional speech recognition apparatus, the probability of transition from some words uttered immediately before to the next word is statistically given, so that it depends only on the grammar of the language. , Does not take into account the state of the speaker. Further, since the learning uses static data, the state of the speaker may be temporally and psychologically different from the time of learning when actually used as a system. Therefore, it was not possible to pursue a higher recognition rate using only the conventional language model. Speech recognition devices are attracting attention as future man-machine interfaces. However, there is a problem that a higher recognition rate cannot be obtained with a knowledge source using only a language model as a calculation constraint condition at the time of recognition. Another problem is that a new knowledge source including the state and environment of the speaker is required. An object of the present invention is to solve the above problem. That is, the object of the present invention is:
It is an object of the present invention to provide a speech recognition apparatus and method in which external factors other than the utterance of the system user, such as the state and environment, are introduced as new knowledge sources.

【０００４】[0004]

【課題を解決するための手段】前記の課題を解決するた
めに、請求項１に記載の発明は、入力された音声を音響
的パラメータに分析し、予め記憶されている複数の比較
対象パターン候補と比較して、音声を認識する音声認識
装置において、発声された単語の時間的偏りに着目し、
発声された時刻を基に設定された値を取得する手段と、
その取得した値を用いて予め求めた音声認識後の尤度に
重み付けを行い、尤もらしい単語を絞りこむ手段とを備
えることを特徴とする。請求項２に記載の発明は、請求
項１に記載の音声認識装置において、外的要因に着目
し、その外的要因に関して設定された値を用いて予め求
めた音声認識後の尤度に重み付けを行うようにしたこと
を特徴とする。請求項３に記載の発明は、入力された音
声を音響的パラメータに分析をし、その分析結果と予め
記憶されている複数の比較対象パターン候補とを比較を
して音声を認識する音声認識の方法であって、発声され
た単語の時間的偏りに着目し、発声された時刻を基に設
定された値を取得し、その取得した値を用いて予め求め
た音声認識後の尤度に重みを付け、尤もらしい単語を絞
りこむようにしたことを特徴とする。請求項４に記載の
発明は、請求項３に記載の音声認識方法において、外的
要因に着目し、その外的要因に関して設定された値を用
いて予め求めた音声認識後の尤度に重み付けを行うよう
にしたことを特徴とする。In order to solve the above-mentioned problems, the invention according to the first aspect analyzes input speech into acoustic parameters, and stores a plurality of comparison target pattern candidates stored in advance. In comparison with, the voice recognition device that recognizes voice focuses on the temporal bias of the uttered word,
Means for obtaining a value set based on the utterance time;
Means for weighting the likelihood after speech recognition obtained in advance using the obtained value and narrowing down likely words. According to a second aspect of the present invention, in the speech recognition apparatus according to the first aspect, the likelihood after speech recognition is determined in advance by using an external factor and paying attention to an external factor using a value set for the external factor. Is performed. According to a third aspect of the present invention, there is provided a speech recognition apparatus for analyzing an input speech into acoustic parameters, comparing the analysis result with a plurality of comparison target pattern candidates stored in advance, and recognizing the speech. Focusing on the temporal bias of the uttered word, obtaining a value set based on the uttered time, and weighting the likelihood after speech recognition obtained in advance using the obtained value. , To narrow down likely words. According to a fourth aspect of the present invention, in the speech recognition method according to the third aspect, the likelihood after speech recognition is determined in advance by using an external factor and using a value set for the external factor. Is performed.

【０００５】[0005]

【発明の実施の形態】以下に本発明の実施の形態を図面
を用いて説明する。図１は、本発明の１つの実施の形態
に係る音声認識装置を示すブロック図である。図１に示
すように、音声認識装置は、マイクロフォン１と、音声
特徴抽出部２と、音声認識部３と、音素ＨＭＭ４と、単
語辞書５と、単語仮説絞り込み部６と、統計的言語モデ
ル７と、発話者状態分析部８と、発話知識ベース９とを
有している。音声特徴抽出部２は、マイクロフォン１に
接続されている。音声認識部３は、音声特徴抽出部２に
接続されている。音素ＨＭＭ４および単語辞書５は、音
声認識部３に接続されている。単語仮説絞り込み部６
は、音声認識部３に接続されている。統計的言語モデル
７は、単語仮説絞り込み部６に接続されている。発話者
状態分析部８は、単語仮説絞り込み部６に接続されてい
る。発話知識ベース９は、発話者状態分析部８に接続さ
れている。マイクロフォン１は、音声を入力する音声の
入力手段である（以後マイクと記載する）。音声特徴抽
出部２は、入力音声を特徴パラメータに分析する。音声
認識部３では、音響的パラメータのモデルデータである
音素ＨＭＭ４（本実施例ではHMM ：Hidden Markov Mode
l を用いる）と、文法的パラメータのモデルデータであ
る単語辞書５を用いて、比較し認識を行う。発話知識ベ
ース９には、使用される単語や、その単語が使用される
頻度が高くなる時間などが記憶されている。この発話知
識ベース９には、時間的要因に関する情報の他、その他
の外的要因についても同様に、それぞれの要因毎に情報
が記憶されている。発話者状態分析部８は、発話知識ベ
ース９を基に、現在の時間等の発話者の発声時の環境か
ら、状態を分析して現状の認識対象語彙の中で使用され
る可能性を示す評価値を算出する部分である。単語仮説
絞り込み部６では、発話者状態分析部８で算出された結
果と、統計的言語モデル７と、音声認識部３とを比較し
て、単語の絞り込みを行う。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a voice recognition device according to one embodiment of the present invention. As shown in FIG. 1, the speech recognition device includes a microphone 1, a speech feature extraction unit 2, a speech recognition unit 3, a phoneme HMM 4, a word dictionary 5, a word hypothesis narrowing unit 6, a statistical language model 7, And a speaker state analysis unit 8 and an utterance knowledge base 9. The audio feature extraction unit 2 is connected to the microphone 1. The speech recognition unit 3 is connected to the speech feature extraction unit 2. The phoneme HMM 4 and the word dictionary 5 are connected to the speech recognition unit 3. Word hypothesis refiner 6
Are connected to the voice recognition unit 3. The statistical language model 7 is connected to the word hypothesis narrowing unit 6. The speaker state analysis unit 8 is connected to the word hypothesis narrowing unit 6. The utterance knowledge base 9 is connected to the utterer state analysis unit 8. The microphone 1 is a voice input unit for inputting voice (hereinafter, referred to as a microphone). The voice feature extraction unit 2 analyzes the input voice into a feature parameter. In the speech recognition unit 3, the phoneme HMM4 (HMM: Hidden Markov Mode in this embodiment), which is model data of acoustic parameters
l is used) and the word dictionary 5 which is model data of grammatical parameters is used for comparison and recognition. The utterance knowledge base 9 stores words to be used, times at which the words are frequently used, and the like. In the utterance knowledge base 9, in addition to the information on the temporal factors, information on other external factors is similarly stored for each factor. The speaker state analysis unit 8 analyzes the state from the environment at the time of the speaker's utterance, such as the current time, based on the utterance knowledge base 9 and indicates the possibility of being used in the current recognition target vocabulary. This is a part for calculating an evaluation value. The word hypothesis narrowing unit 6 compares the result calculated by the speaker state analysis unit 8 with the statistical language model 7 and the speech recognition unit 3 to narrow down words.

【０００６】次に、本装置の音声入力から認識結果出力
までの過程を説明する。例えば、「ちょうしょくはなん
ですか」とマイクロフォン１に向かって発声すると、音
声特徴抽出部２によりＭＦＣＣ（メル周波数ケスプトラ
ム係数）が計算され、音声認識部３に送られる。音声認
識部３はこのＭＦＣＣを基にして、音素ＨＭＭ４、単語
辞書５を用いて単語仮説を検出し尤度を計算する。例え
ば、この音声認識部３からの認識尤度はＨＭＭのフォワ
ードパスを出力確率と遷移確率を掛け合わせた総和とし
て計算する。算出された結果は図２の様に出力される。
本実施例の「ちょうしょく」と言う発声は音響的に「ち
ゅうしょく」と区別が付きにくい。また、文法的にも
「朝食は何ですか」と「昼食は何ですか」は、両者とも
Ｓ＋Ｖ（主語、述語）から成っており、この２つの文章
は文法的に全く同じとなるために、言語モデルでは差異
はなく、区別できないので、単語仮説絞り込み部６でこ
の仮説を絞り込む場合、統計的言語モデル７を用いても
完全に絞り込めない。そこで、発話者状態分析部８から
のパラメータを使用するのである。発話者状態分析部８
の機能を次に示す。発話者状態分析部８は、発話知識ベ
ース９と時間やその他の外的要因を関連させて、音声認
識部３が算出した尤度に重み付けをするためのパラメー
タを計算する。本例の発話知識ベース９の内容が、図３
の様な形式の場合で説明する。図３の発話知識ベース９
は、使用される単語、その単語が使用される頻度が高く
なる区間の開始時間Ａ、終了時間Ｂとその時の評価値
（重み）の各要素をテーブル状に保持したものである。
発話者状態分析部８は現在の時刻を取り込み、音声認識
部３の出力である単語仮説の中に発話知識ベース９に登
録されているものがあれば、現在の時間に該当する区間
の評価値を単語仮説絞り込み部６に転送する。登録され
ていないものに関しては、発話知識ベース９の最低値を
設定しておき、その値を単語仮説絞り込み部６へ転送す
ることになる。本発明に係る請求項１及び請求項３に記
載の発明は、言語モデル以外の評価値として、時間を変
数とする単語の発声確率を導入するところにある。会話
において、使用される単語の発声確率は時間（日時）に
対して偏りがある。例えば、「朝食」と言う発声は昼間
に比べて朝と晩に多く発声される。これは朝晩にそれぞ
れ「今日の／朝食／．．．．」、「明日の／朝食
／．．．．」と言った会話が多くなるためである。ま
た、「入学」と言う発声は１月から５月頃にかけて多く
発声される。これは「入学／試験」、「入学／式」と言
った会話が多くなるためである。この様な時間により変
動する現象は言語モデルでは表現できない。そこで、こ
の時間に対する偏りを新たな評価値として導入するので
ある。[0006] Next, the process from the voice input to the recognition result output of the apparatus will be described. For example, when saying “What is it?” To the microphone 1, the MFCC (mel frequency cepstral coefficient) is calculated by the voice feature extraction unit 2 and sent to the voice recognition unit 3. Based on the MFCC, the speech recognition unit 3 detects a word hypothesis using the phoneme HMM 4 and the word dictionary 5, and calculates the likelihood. For example, the recognition likelihood from the speech recognition unit 3 is calculated as the sum of the forward path of the HMM multiplied by the output probability and the transition probability. The calculated result is output as shown in FIG.
In the present embodiment, the utterance “chosho” is acoustically difficult to distinguish from “chosho”. Also, grammatically, "what is breakfast" and "what is lunch" are both composed of S + V (subject and predicate), and these two sentences are grammatically identical. Since there is no difference between language models and they cannot be distinguished, when the hypothesis is narrowed down by the word hypothesis narrowing unit 6, even if the statistical language model 7 is used, the word cannot be completely narrowed down. Therefore, the parameters from the speaker state analysis unit 8 are used. Speaker state analysis unit 8
The function of is shown below. The speaker state analyzing unit 8 calculates a parameter for weighting the likelihood calculated by the speech recognition unit 3 by associating the utterance knowledge base 9 with time and other external factors. The contents of the utterance knowledge base 9 in this example are shown in FIG.
In the case of a format like Utterance knowledge base 9 of FIG.
Is a table in which each element of the word used, the start time A and the end time B of the section in which the word is frequently used, and the evaluation value (weight) at that time are stored.
The utterer state analysis unit 8 captures the current time, and if any of the word hypotheses output from the speech recognition unit 3 is registered in the utterance knowledge base 9, the evaluation value of the section corresponding to the current time Is transferred to the word hypothesis narrowing unit 6. For those not registered, the lowest value of the utterance knowledge base 9 is set, and the value is transferred to the word hypothesis narrowing unit 6. The invention according to claims 1 and 3 of the present invention resides in introducing the utterance probability of a word using time as a variable as an evaluation value other than the language model. In conversation, the utterance probabilities of words used are biased with respect to time (date and time). For example, "breakfast" is uttered more in the morning and evening than in the day. This is because there are many conversations in the morning and evening saying "today / breakfast / ..." and "tomorrow / breakfast / ...". Also, the utterance “entrance” is uttered frequently from January to May. This is because the number of conversations "entrance / test" and "entrance / ceremony" increases. Such time-varying phenomena cannot be expressed by a language model. Therefore, the bias with respect to the time is introduced as a new evaluation value.

【０００７】図６のフローチャートに基づいて本実施例
を説明をする。本実施例の発声が９時３０頃に行われた
（ステップＳ１０）とすると、該当区間は“０９：００
〜１１：００”になる。「ちょうしょく（朝食）」と
「ちゅうしょく（昼食）」という単語が発話知識ベース
９の中にあるかどうかを探す（ステップＳ２０）。ここ
では、図３のように、「朝食」と「昼食」という単語に
関して、発話知識ベース９で評価値を設定しているので
（図３参照）、「朝食」が０．３、「昼食」が０．２５
になる（ステップＳ３０）。これらの評価値が単語仮説
絞り込み部６に転送される（ステップＳ５０）。単語仮
説絞り込み部６では、統計的言語モデル７を用いても完
全に絞り込めなかったものに関してこの評価値（重み）
を乗算する（ステップＳ６０）。図２の認識結果に関し
て示すと、単語仮説「朝食は何ですか」において、０．
３５の出力尤度と発話者状態分析部８の出力である評価
値０．３を乗算した０．１０５が重み付け後の尤度とな
り、「昼食は何ですか」においては、０．３５の出力尤
度と発話者状態分析部８の出力である評価値０．２５を
乗算した０．０８７５が重み付け後の尤度となる。従っ
て、単語仮説絞り込み部６の出力は重み付け後の尤度が
最大である「朝食は何ですか」（尤度：０．１０５）を
認識結果として出力できる（ステップＳ７０）のであ
る。本発明に係る請求項２及び請求項４に記載の発明で
の実施例は、上記の発話者状態分析部８の入力を時間で
はなくその他の外的要因を入力する。ここではその外的
要因として、音の大きさを入力値として説明する（図７
参照）。例えば、「ボリュームをさげる」とマイク１に
向かって発声した（ステップＳ１００）とする。請求項
１の実施例と同様に「ボリュームを下げる」と「ボリュ
ームを上げる」は「下げる」の子音ｓが抜け落ちる事が
あるため、音響的に区別が付きにくい。特に雑音下では
一層区別できない。そこで、発話者状態分析部８の入力
に音の大きさを入力し、音が大きい時には「下げる」の
重みを増やし、小さいときには「上げる」の重みを増や
す様にコントロールする。「上げる」と「下げる」の単
語を発話知識ベース９から探し（ステップＳ２００）、
単語があった場合は、評価値を取得して（ステップＳ３
００）、単語仮説絞り込み部６に転送し（ステップＳ５
００）、請求項１の場合と同様に乗算を行い（ステップ
Ｓ６００）、重み付け後の尤度の最大値を認識結果とし
て出力する（ステップＳ７００）。その他の外的要因と
して、天候、気温、明瞭なども同様に利用できる。例え
ば、その他の外的要因を明瞭とした場合の発話知識ベー
ス９は、図5 のようになる。この発話知識ベース９は、
認識を行う前に設定するが、知識ベースの内容について
は、適宜に、変更、追加および削除が可能である。This embodiment will be described with reference to the flowchart of FIG. Assuming that the utterance of this embodiment is performed at about 9:30 (step S10), the corresponding section is “09:00”
１１11: 00 ”. It is searched whether the words“ choice (breakfast) ”and“ choice (lunch) ”are present in the utterance knowledge base 9 (step S20). Here, as shown in FIG. 3, the evaluation values of the words “breakfast” and “lunch” are set in the utterance knowledge base 9 (see FIG. 3), so that “breakfast” is 0.3 and “lunch” Is 0.25
(Step S30). These evaluation values are transferred to the word hypothesis narrowing unit 6 (step S50). The word hypothesis narrowing unit 6 evaluates this evaluation value (weight) for a word that cannot be completely narrowed down using the statistical language model 7.
(Step S60). Referring to the recognition result of FIG. 2, in the word hypothesis "What is breakfast?"
The weighted likelihood is 0.105 obtained by multiplying the output likelihood of 35 and the evaluation value 0.3 output from the speaker state analysis unit 8, and the output of 0.35 in “What is lunch?” 0.0875 obtained by multiplying the likelihood by the evaluation value 0.25 output from the speaker state analysis unit 8 is the likelihood after weighting. Therefore, the output of the word hypothesis narrowing unit 6 can output "what is breakfast" (likelihood: 0.105) having the maximum likelihood after weighting as a recognition result (step S70). In the embodiment according to the second and fourth aspects of the present invention, the input of the speaker state analyzing unit 8 is not a time but other external factors. Here, as an external factor, the loudness of the sound will be described as an input value (FIG. 7).
reference). For example, it is assumed that “turn down the volume” is uttered toward microphone 1 (step S100). As in the case of the first embodiment, the "lower volume" and the "higher volume" may not be distinguished acoustically because the "lower" consonant s may be dropped. In particular, it cannot be distinguished under noise. Therefore, the loudness of the sound is input to the input of the speaker state analysis unit 8, and the control is performed so that the weight of “down” is increased when the sound is loud, and the weight of “up” is increased when the sound is low. The words “raise” and “lower” are searched from the utterance knowledge base 9 (step S200),
If there is a word, an evaluation value is obtained (step S3).
00), and transferred to the word hypothesis narrowing section 6 (step S5).
00), multiplication is performed in the same manner as in claim 1 (step S600), and the maximum value of the likelihood after weighting is output as a recognition result (step S700). Other external factors such as weather, temperature, and clarity can be used as well. For example, the utterance knowledge base 9 when other external factors are clarified is as shown in FIG. This utterance knowledge base 9 is
It is set before recognition, but the contents of the knowledge base can be changed, added and deleted as appropriate.

【０００８】[0008]

【発明の効果】請求項１に記載の発明によれば、統計的
言語モデルを用いても完全に絞り込めない単語仮説に関
して、発声の時間的な偏りを考慮した評価値を重み付け
することにより、実用的により尤もらしい絞り込みが行
え、より高い認識率を得ることができる。請求項２に記
載の発明によれば、認識誤りが起こりやすい環境であっ
ても音の大きさを考慮した評価値を導入することによ
り、より高い認識率を得ることができる。請求項３に記
載の発明によれば、統計的言語モデルを用いても完全に
絞り込めない単語仮説に関して、発声の時間的な偏りを
考慮した評価値を重み付けすることにより、実用的によ
り尤もらしい絞り込みが行え、より高い認識率を得るこ
とができる。請求項４に記載の発明によれば、認識誤り
が起こりやすい環境であっても音の大きさを考慮した評
価値を導入することにより、より高い認識率を得ること
ができる。According to the first aspect of the present invention, a word hypothesis that cannot be completely narrowed down even by using a statistical language model is weighted with an evaluation value in consideration of a temporal bias in utterance. More practical narrowing down is possible, and a higher recognition rate can be obtained. According to the second aspect of the present invention, even in an environment where recognition errors are likely to occur, a higher recognition rate can be obtained by introducing an evaluation value in consideration of the sound volume. According to the third aspect of the present invention, for a word hypothesis that cannot be completely narrowed down even by using a statistical language model, weighting is applied to an evaluation value in consideration of a temporal bias in utterance, so that it is more likely to be practical. It is possible to narrow down and obtain a higher recognition rate. According to the fourth aspect of the present invention, even in an environment where recognition errors are likely to occur, a higher recognition rate can be obtained by introducing an evaluation value in consideration of the sound volume.

[Brief description of the drawings]

【図１】本発明の１つの実施の形態に係る音声識別装置
を示すブロック図である。FIG. 1 is a block diagram showing a voice identification device according to one embodiment of the present invention.

【図２】本発明の実施例における音声認識部の結果を説
明するための図である。FIG. 2 is a diagram illustrating a result of a voice recognition unit according to the embodiment of the present invention.

【図３】時間的要因を用いた本発明の実施例における発
話知識ベースの状態を説明する図である。FIG. 3 is a diagram illustrating a state of an utterance knowledge base according to an embodiment of the present invention using a temporal factor.

【図４】外的要因として音の大きさを用いた本発明の実
施例における発話知識ベースの状態を説明する図であ
る。FIG. 4 is a diagram for explaining a state of an utterance knowledge base in an embodiment of the present invention using a loudness as an external factor.

【図５】外的要因として明瞭を用いた本発明の実施例に
おける発話知識ベースの状態を説明する図である。FIG. 5 is a diagram illustrating a state of an utterance knowledge base in the embodiment of the present invention using clarity as an external factor.

【図６】時間的要因を用いた本発明の実施例を説明する
ためのフローチャート図である。FIG. 6 is a flowchart for explaining an embodiment of the present invention using a temporal factor.

【図７】外的要因として音の大きさを用いた本発明の実
施例を説明するためののフローチャート図である。FIG. 7 is a flowchart for explaining an embodiment of the present invention using a loudness as an external factor.

[Explanation of symbols]

１マイクロフォン、２音声特徴抽出部、３音声認
識部、４音素ＨＭＭ、５単語辞書、６単語仮説絞
り込み部、７統計的言語モデル、８発話者状態分析
部、９発話知識ベース。1 microphone, 2 voice feature extraction unit, 3 voice recognition unit, 4 phoneme HMM, 5 word dictionary, 6 word hypothesis narrowing down unit, 7 statistical language model, 8 speaker state analysis unit, 9 utterance knowledge base.

Claims

[Claims]

1. A speech recognition apparatus that analyzes an input speech into acoustic parameters, compares the speech with a plurality of pre-stored comparison target pattern candidates, and recognizes the speech. Focusing on, means to obtain a value set based on the utterance time, and weighting the likelihood after speech recognition obtained in advance using the obtained value,
Means for narrowing down likely words.

2. A speech recognition apparatus according to claim 1, wherein attention is paid to an external factor, and a likelihood after speech recognition obtained in advance is weighted using a value set for the external factor. A speech recognition device characterized by the following.

3. A speech recognition method for analyzing an input speech into acoustic parameters, comparing the analysis result with a plurality of pattern object candidates stored in advance, and recognizing the speech. Paying attention to the temporal bias of the uttered word, obtaining a value set based on the uttered time, weighting the likelihood after speech recognition previously obtained using the obtained value, A speech recognition method characterized by narrowing down words that are likely to occur.

4. A speech recognition method according to claim 3, wherein attention is paid to an external factor, and a likelihood after speech recognition obtained in advance is weighted using a value set for the external factor. A speech recognition method characterized in that: