JP2003316377A

JP2003316377A - Device and method for voice recognition

Info

Publication number: JP2003316377A
Application number: JP2002126939A
Authority: JP
Inventors: Soichi Toyama; 聡一外山
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2002-04-26
Filing date: 2002-04-26
Publication date: 2003-11-07

Abstract

<P>PROBLEM TO BE SOLVED: To determine correctness or incorrectness of a result of voice recognition in a highly precise manner employing a small amount of processing. <P>SOLUTION: Discrimination is made to determine the correctness or incorrectness of a recognition result RCG by collating an acoustic model HMMsb and a feature vector V(n) of uttered voice, obtaining a recognition result RCG that indicates an acoustic model through which maximum likelihood is obtained, a first score FSCR which indicates the value of the maximum likelihood and a second score SSCR which indicates a next likelihood value and comparing an evaluation value FSCRX (FSCR-SSCR) based on the scores FSCR and SSCR with respect to a beforehand set threshold value THD. When it is determined that the result RCG is correct, a speaker adaptation is conducted for the model HMMsb. When it is determined that the result RCG is incorrect, no speaker adaptation is performed for the model HMMsb so as to realize improvement in the precision of the speaker adaptation or the like. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、例えば話者適応に
よって音声認識を行う音声認識装置及び音声認識方法に
関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and a voice recognition method for performing voice recognition by speaker adaptation, for example.

【０００２】[0002]

【従来の技術】音声認識の難しさの主たる原因として、
一般に、発話者の発話音声には発声器官や発話習慣等の
個人差があることが上げられている。2. Description of the Related Art As a main cause of difficulty in voice recognition,
Generally, it is said that the uttered voice of a speaker has individual differences such as a vocal organ and a habit of utterance.

【０００３】従来、こうした個人差の影響を有した発話
音声に対してロバスト（頑強）な音声認識アルゴリズム
として、ＭＬＬＲ（Maximum Likelihood Linear Regres
sion）やＭＡＰ（Maximum a Posteriori）推定法等を用
いて話者適応を行うことにより、特定話者に対する認識
率を高める話者適応方法が提案されている。Conventionally, a MLLR (Maximum Likelihood Linear Regres) has been used as a robust (robust) speech recognition algorithm for uttered speech which is affected by such individual differences.
A speaker adaptation method for increasing the recognition rate for a specific speaker by performing speaker adaptation using a sion) or MAP (Maximum a Posteriori) estimation method or the like.

【０００４】この話者適応方法では、不特定多数の発話
者の音声から音素や単語単位での多数の音響モデル（初
期の音響モデル）を作成しておき、話者適応すべき発話
者の音声の特徴量によって、それらの音響モデルを話者
適応する。In this speaker adaptation method, a large number of acoustic models (initial acoustic models) for each phoneme or word are created from the voices of an unspecified number of speakers, and the speech of the speaker to be speaker-adapted. Those acoustic models are adapted to the speaker according to the feature quantity of.

【０００５】[0005]

【発明が解決しようとする課題】ところで、適切な話者
適応を行うためには、音声認識結果に対応させて、音素
や単語単位の音響モデルに話者適応を施す必要がある。By the way, in order to perform an appropriate speaker adaptation, it is necessary to apply the speaker adaptation to the phoneme or word-based acoustic model in correspondence with the speech recognition result.

【０００６】つまり、具体的な事例を上げて説明すれ
ば、発話者が「東京」と発話した際、その発話音声を
「東京」と正しく音声認識した場合に限り、「東京」の
音響モデルを話者適応する必要があり、誤認識した結果
に従って「東京」の音響モデルを話者適応すると、誤っ
た話者適応を行うことになる。[0006] In other words, to give a concrete example, when the speaker utters "Tokyo", the acoustic model of "Tokyo" is selected only when the speech is correctly recognized as "Tokyo". It is necessary to adapt the speaker, and if the acoustic model of "Tokyo" is adapted to the speaker according to the result of erroneous recognition, incorrect speaker adaptation will be performed.

【０００７】仮に発話者が「東京」と発話した際、その
発話音声を誤って「京都」と音声認識し、「京都」の音
響モデルに対して「東京」の発話音声の特徴量で話者適
応を行うと、適切な話者適応を施すことができず、話者
適応後の音声認識率を低下させる虞がある。If the speaker utters "Tokyo", the uttered voice is erroneously recognized as "Kyoto" and the speaker uses the feature amount of the uttered voice of "Tokyo" with respect to the acoustic model of "Kyoto". If adaptation is performed, appropriate speaker adaptation cannot be performed, and the voice recognition rate after speaker adaptation may be reduced.

【０００８】このため、話者適応を行う際の前提条件と
して、発話音声と音声認識の結果とが一致したか、すな
わち音声認識の結果が正しいか否かを確実に判定するこ
とが重要となっている。Therefore, as a precondition for performing speaker adaptation, it is important to surely determine whether the uttered voice matches the result of voice recognition, that is, whether the result of voice recognition is correct. ing.

【０００９】ところが、従来の話者適応方法によると、
音声認識の結果が正しいか否かを判定するための処理量
が膨大となることから、発話者に対してストレスを感じ
させることなく、迅速且つ適切な話者適応を行うことが
可能な新規な判定処理方法の開発が重要な課題となって
いた。However, according to the conventional speaker adaptation method,
Since the amount of processing for determining whether or not the result of voice recognition is correct is enormous, it is possible to perform new and rapid speaker adaptation without causing stress to the speaker. The development of a judgment processing method has been an important issue.

【００１０】本発明はこうした課題に鑑みてなされたも
のであり、迅速且つ適切な話者適応を行うことを可能に
する音声認識装置及び音声認識方法を提供することを目
的とする。The present invention has been made in view of these problems, and an object of the present invention is to provide a voice recognition device and a voice recognition method that enable quick and appropriate speaker adaptation.

【００１１】[0011]

【課題を解決するための手段】上記目的を達成するため
請求項１に係る発明は、音響モデルに発話音声の特徴ベ
クトルによって話者適応を施す音声認識装置において、
上記音響モデルと発話音声の特徴ベクトルとを照合し、
最大尤度の得られる音響モデルを示す認識結果と、上記
最大尤度の値を示す第１スコアと、次の尤度値を示す第
２スコアとを出力する音声認識手段と、上記第１スコア
と第２スコアとに基づく評価値を予め設定された閾値と
比較し、上記評価値が閾値に対して所定の関係にある場
合に上記認識結果を正しいと判定する判定手段と、上記
判定手段により上記認識結果を正しいと判定されると、
上記音響モデルに対して話者適応を施す話者適応処理手
段とを具備することを特徴とする。In order to achieve the above-mentioned object, the invention according to claim 1 is a speech recognition apparatus which applies speaker adaptation to an acoustic model according to a feature vector of uttered speech,
Collate the above acoustic model with the feature vector of the speech,
A speech recognition unit that outputs a recognition result indicating an acoustic model for which the maximum likelihood is obtained, a first score indicating the maximum likelihood value, and a second score indicating the next likelihood value, and the first score. And an evaluation value based on the second score and a threshold value set in advance, and when the evaluation value has a predetermined relationship with the threshold value, a determination unit that determines the recognition result as correct, If it is determined that the above recognition result is correct,
And a speaker adaptation processing means for applying speaker adaptation to the acoustic model.

【００１２】また、請求項５に係る発明は、音響モデル
に発話音声の特徴ベクトルによって話者適応を施す音声
認識方法において、上記音響モデルと発話音声の特徴ベ
クトルとを照合し、最大尤度の得られる音響モデルを示
す認識結果と、上記最大尤度の値を示す第１スコアと、
次の尤度値を示す第２スコアとを求める第１の工程と、
上記第１スコアと第２スコアとに基づく評価値を予め設
定された閾値と比較し、上記評価値が閾値に対して所定
の関係にある場合に上記認識結果を正しいと判定する第
２の工程と、上記第２の工程で上記認識結果を正しいと
判定すると、上記音響モデルに対して話者適応を施す第
３の工程とを具備することを特徴とする。According to the invention of claim 5, in a voice recognition method for applying speaker adaptation to an acoustic model by a feature vector of a spoken voice, the acoustic model and a feature vector of a spoken voice are collated to determine a maximum likelihood. A recognition result indicating the obtained acoustic model, and a first score indicating the value of the maximum likelihood,
A first step of obtaining a second score indicating the next likelihood value;
A second step of comparing the evaluation value based on the first score and the second score with a preset threshold value, and determining the recognition result to be correct when the evaluation value has a predetermined relationship with the threshold value. And a third step of applying speaker adaptation to the acoustic model when the recognition result is determined to be correct in the second step.

【００１３】これら請求項１に係る音声認識装置と請求
項５に係る音声認識方法の発明によれば、話者適応の処
理に際して、音響モデルと発話音声の特徴ベクトルとを
照合することにより、最大尤度の得られる音響モデルを
示す認識結果と、最大尤度の値を示す第１スコアと、次
の尤度値を示す第２スコアとを求め、更に第１スコアと
第２スコアとに基づく評価値を予め設定された閾値と比
較する。そして、評価値が閾値に対して所定の関係にあ
る場合に、認識結果を正しいと判定し、音響モデルに対
して話者適応を施す。これにより、正しい認識結果に基
づいて適切な話者適応を行う。According to the inventions of the speech recognition apparatus according to claim 1 and the speech recognition method according to claim 5, at the time of speaker adaptation processing, by comparing the acoustic model and the feature vector of the spoken voice, the maximum Based on the recognition result indicating the acoustic model for which the likelihood is obtained, the first score indicating the maximum likelihood value, and the second score indicating the next likelihood value, and further based on the first score and the second score. The evaluation value is compared with a preset threshold value. Then, when the evaluation value has a predetermined relationship with the threshold value, the recognition result is determined to be correct, and speaker adaptation is performed on the acoustic model. As a result, appropriate speaker adaptation is performed based on the correct recognition result.

【００１４】また、請求項２に係る発明は、請求項１に
係る発明において、上記評価値が閾値に対して所定の関
係にない場合に、上記判定手段が上記認識結果を誤りと
判定し、上記話者適応処理手段は、上記認識結果を誤り
と判定されると、上記音響モデルに対して話者適応を行
わないことを特徴とする。According to a second aspect of the invention, in the first aspect of the invention, when the evaluation value does not have a predetermined relationship with the threshold value, the determining means determines the recognition result as an error, The speaker adaptation processing means does not perform speaker adaptation on the acoustic model when the recognition result is determined to be an error.

【００１５】また、請求項６に係る発明は、請求項５に
係る発明において、上記第２の工程では、上記評価値が
閾値に対して所定の関係にない場合に上記認識結果を誤
りと判定し、上記第３の工程では、上記認識結果を誤り
と判定すると、上記音響モデルに対して話者適応を行わ
ないことを特徴とする。The invention according to claim 6 is the invention according to claim 5, wherein in the second step, the recognition result is determined to be an error when the evaluation value does not have a predetermined relationship with a threshold value. However, in the third step, when the recognition result is determined to be an error, speaker adaptation is not performed on the acoustic model.

【００１６】これら請求項２に係る音声認識装置と請求
項６に係る音声認識方法の発明によれば、認識結果を誤
りと判定すると、音響モデルに対して話者適応を行わな
い。これにより、誤った認識結果に基づいて話者適応を
行わないことで、話者適応後の音声認識精度の低下等を
防止する。According to the inventions of the voice recognition device and the voice recognition method according to the sixth aspect, when the recognition result is determined to be an error, speaker adaptation is not performed on the acoustic model. As a result, the speaker adaptation is not performed based on the erroneous recognition result, so that the deterioration of the voice recognition accuracy after the speaker adaptation is prevented.

【００１７】また、請求項３に係る発明は、請求項１に
係る音声認識装置の発明において、上記評価値は、上記
第１スコアと第２スコアとの差分値によって演算される
ことを特徴とする。According to a third aspect of the invention, in the invention of the speech recognition apparatus according to the first aspect, the evaluation value is calculated by a difference value between the first score and the second score. To do.

【００１８】また、請求項７に係る発明は、請求項５に
係る音声認識方法の発明において、上記評価値は、上記
第１スコアと第２スコアとの差分値によって演算される
ことを特徴とする。According to a seventh aspect of the invention, in the invention of the speech recognition method according to the fifth aspect, the evaluation value is calculated by a difference value between the first score and the second score. To do.

【００１９】これら請求項３に係る音声認識装置と請求
項７に係る音声認識方法の発明によれば、第１スコアと
第２スコアとの差分値によって評価値を演算することに
より、その演算によって得られる評価値が閾値に対して
所定の関係にある場合に認識結果を正しいと判定し、ま
た、その評価値が閾値に対して所定の関係にない場合に
認識結果を誤りと判定する際の判定精度の向上を実現す
る。According to the inventions of the voice recognizing device and the voice recognizing method according to claim 7, the evaluation value is calculated by the difference value between the first score and the second score. When the obtained evaluation value has a predetermined relationship with the threshold value, the recognition result is determined to be correct, and when the evaluation value does not have the predetermined relationship with the threshold value, the recognition result is determined to be incorrect. Achieves improved judgment accuracy.

【００２０】また、請求項４に係る発明は、請求項２に
係る発明において、上記判定手段が上記認識結果を誤り
と判定すると、上記認識結果の出力を禁止すると共に、
上記認識結果が誤りである旨の提示を行う手段を備える
ことを特徴とする。According to a fourth aspect of the invention, in the invention according to the second aspect, when the determination means determines that the recognition result is an error, the recognition result is prohibited from being output, and
It is characterized in that it comprises means for presenting that the recognition result is incorrect.

【００２１】また、請求項８に係る発明は、請求項６に
係る発明において、上記第２の工程で上記認識結果を誤
りと判定すると、上記認識結果の出力を禁止すると共
に、上記認識結果が誤りである旨の提示を行うことを特
徴とする。Further, in the invention according to claim 8, in the invention according to claim 6, when the recognition result is determined as an error in the second step, the recognition result is prohibited from being output and the recognition result is The feature is that an error is presented.

【００２２】これら請求項４に係る音声認識装置と請求
項８に係る音声認識方法の発明によれば、適切な話者適
応がなされたか否か等の有効な情報をユーザー等に提示
する。According to the invention of the voice recognition device according to claim 4 and the voice recognition method according to claim 8, effective information such as whether or not a proper speaker adaptation is made is presented to the user or the like.

【００２３】[0023]

【発明の実施の形態】以下、本発明の好適な実施の形態
を図面を参照して説明する。BEST MODE FOR CARRYING OUT THE INVENTION Preferred embodiments of the present invention will be described below with reference to the drawings.

【００２４】（第１の実施の形態）本発明の第１の実施
の形態を図１乃至図３を参照して説明する。なお、図１
は、本実施形態の音声認識装置の構成を示すブロック図
である。(First Embodiment) A first embodiment of the present invention will be described with reference to FIGS. Note that FIG.
FIG. 3 is a block diagram showing a configuration of a voice recognition device of this embodiment.

【００２５】図１において、本音声認識装置は、ＨＭＭ
（Hidden Markov Model）を用いて音声認識を行う構成
となっており、収音手段としてのマイクロフォン（図示
省略）から出力される音声入力信号ｖ(t)をケプストラ
ム領域の特徴ベクトル系列Ｖ(n)に変換して出力する音
響分析部１と、音声認識処理部２と、単語モデル生成部
３、単語辞書４、音響モデル記憶部５、正誤判定部６、
及び話者適応処理部７を備えて構成されている。In FIG. 1, the speech recognition apparatus is an HMM.
(Hidden Markov Model) is used for voice recognition, and a voice input signal v (t) output from a microphone (not shown) as a sound pickup means is used as a feature vector sequence V (n) in the cepstrum region. An acoustic analysis unit 1 for converting the output into a speech output, a speech recognition processing unit 2, a word model generation unit 3, a word dictionary 4, an acoustic model storage unit 5, a correctness determination unit 6,
And a speaker adaptation processing unit 7.

【００２６】音響モデル記憶部５には、不特定話者の音
声から生成された音素等サブワード単位の音響モデル
（標準の音素ＨＭＭ）が記憶されている。The acoustic model storage unit 5 stores an acoustic model (standard phoneme HMM) in units of subwords such as phonemes generated from the voice of an unspecified speaker.

【００２７】なお、後述の説明で明らかになるが、音響
モデル記憶部５には、予め不特定話者の音声より求めた
音響モデル（標準の音素ＨＭＭ）が初期の音響モデルと
して記憶されており、その後、話者適応が施されるとそ
の初期の音響モデルは話者適応された音響モデルに更新
されていく。したがって、話者適応が継続され、あるい
は繰り返し行われると、音響モデル記憶部５に記憶され
た話者適応音響モデルについて話者適応が行われること
となり、更にその話者適応された音響モデルについて更
に話者適応が行われるというようにして繰り返し処理が
行われることから、音響モデル記憶部５には、繰り返し
更新された話者適応音響モデルが記憶される。このよう
に、話者適応音響モデルを更新していくことで、話者の
音声に対する認識性能の向上を実現している。As will be described later, the acoustic model storage unit 5 stores an acoustic model (standard phoneme HMM) obtained in advance from the voice of an unspecified speaker as an initial acoustic model. After that, when the speaker adaptation is applied, the initial acoustic model is updated to the speaker adapted acoustic model. Therefore, if the speaker adaptation is continued or repeated, the speaker adaptation is performed on the speaker adaptation acoustic model stored in the acoustic model storage unit 5, and further the speaker adaptation acoustic model is further performed. Since the speaker adaptation is performed and the iterative processing is performed, the acoustic model storage unit 5 stores the repeatedly updated speaker-adaptive acoustic model. By thus updating the speaker-adaptive acoustic model, the recognition performance for the speaker's voice is improved.

【００２８】単語辞書４には、多数の単語と文の辞書デ
ータ（テキストデータとも言われる）が予め記憶されて
いる。The word dictionary 4 stores a large number of words and sentence dictionary data (also referred to as text data) in advance.

【００２９】単語モデル生成部３は、単語辞書４に記憶
されている各単語や文のデータ（以下、これらのデータ
を「単語データ」と総称する）ＷＲＤに従って、音響モ
デル記憶部５に記憶されている各音素の音響モデルＨＭ
Ｍsbを組み合わせ、それによって単語データＷＲＤに対
応する音響モデル（以下、「単語モデル」という）ＨＭ
Ｍwを生成する。The word model generation unit 3 is stored in the acoustic model storage unit 5 in accordance with the data of each word or sentence stored in the word dictionary 4 (hereinafter, these data are collectively referred to as “word data”) WRD. Acoustic model HM of each phoneme
An acoustic model (hereinafter, referred to as "word model") HM corresponding to word data WRD by combining Msb
Generate Mw.

【００３０】音声認識処理部２は、単語モデル生成部３
で生成される単語モデルＨＭＭwと、音響分析部１から
供給される発話音声の特徴ベクトルＶ(n)とをリアルタ
イムに照合する。そして、全ての単語モデルＨＭＭwと
特徴ベクトルＶ(n)とを照合した結果、最大尤度の得ら
れた単語モデルＨＭＭwを示す認識結果ＲＣＧを出力す
る。The voice recognition processing section 2 includes a word model generation section 3
The word model HMMw generated in (1) and the feature vector V (n) of the uttered voice supplied from the acoustic analysis unit 1 are collated in real time. Then, as a result of matching all the word models HMMw with the feature vector V (n), the recognition result RCG indicating the word model HMMw having the maximum likelihood is output.

【００３１】更に音声認識処理部２は、認識結果ＲＣＧ
を出力する他、既述した最大尤度の値を第１スコアＦＳ
ＣＲ、その次に大きな尤度値（最大尤度に対して第２番
目に大きな尤度値）を第２スコアＳＳＣＲとして出力す
る。Further, the voice recognition processing section 2 recognizes the recognition result RCG.
In addition to outputting the above-mentioned maximum likelihood value as the first score FS
The CR and the next largest likelihood value (the second largest likelihood value with respect to the maximum likelihood) are output as the second score SSCR.

【００３２】具体例を上げて説明すれば、例えば発話者
が「東京（Tokyo）」と発話したときに、音声認識処理
部２がその特徴ベクトル系列Ｖ(n)と全ての単語モデル
ＨＭＭwとの照合を行った結果、「東京（Tokyo）」の単
語モデルについての尤度が最大値となり、別の単語「京
都（Kyoto）」の単語モデルについての尤度が２番目に
大きな値になったとすると、その最大尤度の値を第１ス
コアＦＳＣＲ、２番目に大きな尤度値を第２スコアＳＳ
ＣＲとして出力する。なお、かかる例示の場合は、「東
京（Tokyo）」の発話音声と、第１スコアＦＳＣＲに相
当する「東京（Tokyo）」の単語モデルとが一致するこ
とから、認識結果ＲＣＧが正しい場合に相当する。For example, when the speaker utters "Tokyo", the speech recognition processing unit 2 will generate the feature vector series V (n) and all the word models HMMw. As a result of the matching, it is assumed that the likelihood for the word model of “Tokyo” has the maximum value and the likelihood for the word model of another word “Kyoto” has the second largest value. , Its maximum likelihood value is the first score FSCR, the second largest likelihood value is the second score SS
Output as CR. In the case of such an example, since the uttered voice of “Tokyo” and the word model of “Tokyo” corresponding to the first score FSCR match, it corresponds to the case where the recognition result RCG is correct. To do.

【００３３】一方、認識結果ＲＣＧが誤認識であった場
合、例えば、発話者が「東京（Tokyo）」と発話したと
きに、音声認識処理部２が音声認識した結果、例えば単
語「京都（Kyoto）」の単語モデルについての尤度が最
大値となり、「東京（Tokyo）」の単語モデルについて
の尤度が２番目に大きな値になったとすると、その最大
尤度の値を第１スコアＦＳＣＲ、２番目に大きな尤度値
を第２スコアＳＳＣＲとして出力する。このように、
「東京（Tokyo）」の発話音声と、第１スコアＦＳＣＲ
に相当する「京都（Kyoto）」の単語モデルとが一致し
ないため、認識結果ＲＣＧは正しくないこととなる。On the other hand, when the recognition result RCG is erroneous recognition, for example, when the speaker utters "Tokyo", the result of the voice recognition by the voice recognition processing unit 2, for example, the word "Kyoto". ) ”Has the maximum likelihood, and the likelihood of the“ Tokyo ”word model has the second largest likelihood, the maximum likelihood value is the first score FSCR, The second largest likelihood value is output as the second score SSCR. in this way,
Speech of "Tokyo" and first score FSCR
The recognition result RCG is not correct because the word model of “Kyoto” corresponding to is not matched.

【００３４】正誤判定部６は、音声認識処理部２がある
発話音声を音声認識した際出力される第１スコアＦＳＣ
Ｒと第２スコアＳＳＣＲとを、次式(1)で表されるスコ
ア評価関数に適用し、それによって得られるスコア評価
値Ｇ(L)と所定の閾値ＴＨＤと比較する。The correctness determination unit 6 outputs the first score FSC output when the voice recognition processing unit 2 voice-recognizes a speech.
R and the second score SSCR are applied to the score evaluation function represented by the following expression (1), and the score evaluation value G (L) obtained thereby is compared with a predetermined threshold value THD.

【００３５】Ｇ(L)＝ＦＳＣＲ×（ＦＳＣＲ−ＳＳＣＲ） …(1) なお、上記式(1)の右辺中に示された変数ＦＳＣＲは、
第１スコアＦＳＣＲの値（最大尤度値）、変数ＳＳＣＲ
は、第２スコアＦＳＣＲの値（２番目の尤度値）であ
り、スコア評価値Ｇ(L)は、右辺の評価演算によって得
られる尤度Ｌに関連した変数となる。G (L) = FSCR × (FSCR−SSCR) (1) The variable FSCR shown in the right side of the above formula (1) is
First score FSCR value (maximum likelihood value), variable SSCR
Is the value of the second score FSCR (second likelihood value), and the score evaluation value G (L) is a variable related to the likelihood L obtained by the evaluation operation on the right side.

【００３６】そして、スコア評価値Ｇ(L)が閾値ＴＨＤ
よりも大きいとき（すなわち、Ｇ(L)＞ＴＨＤのとき）
には、認識結果ＲＣＧを正しいと判定し、スコア評価値
Ｇ(L)が閾値ＴＨＤよりも小さいとき（すなわち、Ｇ(L)
≦ＴＨＤのとき）には、認識結果ＲＣＧを誤り判定し、
夫々の判定結果ＲＳＬＴを出力する。Then, the score evaluation value G (L) is the threshold value THD.
Greater than (that is, when G (L)> THD)
, The recognition result RCG is determined to be correct, and when the score evaluation value G (L) is smaller than the threshold value THD (that is, G (L)
≦ THD), the recognition result RCG is determined as an error,
Each judgment result RSLT is output.

【００３７】ここで、上記式(1)のスコア評価関数と閾
値ＴＨＤ、及び正誤判定部６の判定原理について詳述す
ることとする。Here, the score evaluation function and the threshold value THD of the above equation (1), and the determination principle of the correctness determination unit 6 will be described in detail.

【００３８】上記式(1)のスコア評価関数と閾値ＴＨＤ
は、次に述べる統計的手法によって予め実験的に決めら
れている。The score evaluation function of the above equation (1) and the threshold value THD
Is experimentally determined in advance by the statistical method described below.

【００３９】まず、任意の話者に所定数Ｎの単語や文を
発話してもらい、単語辞書４と単語モデル生成部３及び
音声認識処理部２に音声認識の処理を行わせて、全数Ｎ
の各単語や文ごとに出力される第１スコアＦＳＣＲと第
２スコアＳＳＣＲとを実験的に測定する。First, an arbitrary speaker is made to speak a predetermined number N of words and sentences, and the word dictionary 4, the word model generation unit 3 and the voice recognition processing unit 2 are caused to perform voice recognition processing, and the total number N is reached.
The first score FSCR and the second score SSCR output for each word or sentence of are measured experimentally.

【００４０】そして、発話音声と認識結果ＲＣＧとが一
致したときの単語や文（すなわち正しく認識されたとき
の単語や文）と、発話音声と認識結果ＲＣＧとが一致し
なかったときの単語や文（すなわち誤認識されたときの
単語や文）とに分類する。Then, a word or sentence when the uttered voice and the recognition result RCG match (that is, a word or sentence when correctly recognized) and a word or sentence when the uttered voice and the recognition result RCG do not match. Sentences (that is, words and sentences when misrecognized).

【００４１】例えば、実験対象として発話してもらった
単語や文の総数Ｎが５００個、そのうち正しく認識され
た単語や文の個数Ｘが４００個、誤認識された単語や文
の個数Ｙが１００個であったとすると、５００個の単語
や文を夫々個数Ｘ，Ｙずつに分類する。For example, the total number N of words or sentences spoken as an experiment target is 500, the number X of correctly recognized words or sentences is 400, and the number Y of incorrectly recognized words or sentences is 100. If the number is 500, the 500 words or sentences are classified into the numbers X and Y, respectively.

【００４２】更に、図２（ａ）に示すように、正しく認
識された単語や文（既述の例示では、Ｘ＝４００個の単
語や文）について、第１スコアＦＳＣＲの値（最大尤度
値）に対する単語や文の個数の分布を示すヒストグラム
Ｐ(FSCR）と、第２スコアＳＳＣＲの値（２番目の尤度
値）に対する単語や文の個数の分布を示すヒストグラム
Ｑ(SSCR）とを作成する。Further, as shown in FIG. 2A, the value of the first score FSCR (maximum likelihood) for correctly recognized words and sentences (X = 400 words and sentences in the above-mentioned example). Histogram P (FSCR) showing the distribution of the number of words or sentences with respect to (value) and histogram Q (SSCR) showing the distribution of the number of words or sentences with respect to the value of the second score SSCR (second likelihood value). create.

【００４３】つまり、ヒストグラムＰ(FSCR）とＱ(SSC
R）は共に、正しく音声認識された４００個の単語や文
を対象として作成する。更に、音声認識処理部２から出
力される第１スコアＦＳＣＲの値（最大尤度値）と第２
スコアＳＳＣＲの値（２番目の尤度値）は、音声認識環
境等に応じて様々に変化するので、４００個の単語や文
を第１スコアＦＳＣＲの個々の値に対して割り振ること
で、同図中の実線で示すようなヒストグラムＰ(FSCR）
を作成し、同様に４００個の単語や文を第２スコアＳＳ
ＣＲの個々の値に対して割り振ることで、同図中の点線
で示すようなヒストグラムＱ(SSCR）を作成する。That is, the histograms P (FSCR) and Q (SSC
R) together creates 400 words and sentences that have been correctly recognized by voice. Furthermore, the value of the first score FSCR (maximum likelihood value) and the second score output from the voice recognition processing unit 2
The value of the score SSCR (the second likelihood value) changes variously depending on the speech recognition environment and the like. Therefore, by assigning 400 words or sentences to each value of the first score FSCR, the same value can be obtained. Histogram P (FSCR) as shown by the solid line in the figure
Create a second score SS of 400 words or sentences in the same way
By assigning each value of CR, a histogram Q (SSCR) as shown by the dotted line in the figure is created.

【００４４】また、図２（ｂ）に例示するように、誤認
識された単語や文（既述の例示では、Ｘ＝１００個の単
語や文）についても同様に、第１スコアＦＳＣＲの値に
対する単語や文の個数の分布を示すヒストグラムＰ(FSC
R)"と、第２スコアＳＳＣＲの値に対する単語や文の個
数の分布を示すヒストグラムＱ(SSCR)"とを作成する。As shown in FIG. 2B, the value of the first score FSCR is similarly applied to the erroneously recognized words and sentences (X = 100 words and sentences in the above-mentioned example). Histogram P (FSC
R) "and a histogram Q (SSCR)" indicating the distribution of the number of words or sentences with respect to the value of the second score SSCR ".

【００４５】つまり、図２（ｂ）に示すヒストグラムＰ
(FSCR)"とＱ(SSCR)"は共に、誤認識された１００個の単
語や文を対象として作成する。更に、音声認識処理部２
が誤認識した場合にも、第１スコアＦＳＣＲの値（最大
尤度値）と第２スコアＳＳＣＲの値（２番目の尤度値）
は、音声認識環境等に応じて様々に変化するので、１０
０個の単語や文を第１スコアＦＳＣＲの個々の値に対し
て割り振ることで、図２（ｂ）中の実線で示すようなヒ
ストグラムＰ(FSCR)"を作成し、１００個の単語や文を
第２スコアＳＳＣＲの個々の値に対して割り振ること
で、図２（ｂ）中の点線で示すようなヒストグラムＱ(S
SCR)"を作成する。That is, the histogram P shown in FIG.
Both (FSCR) and Q (SSCR) are created for 100 misrecognized words and sentences. Furthermore, the voice recognition processing unit 2
Even if the erroneous recognition is performed, the value of the first score FSCR (maximum likelihood value) and the value of the second score SSCR (second likelihood value)
Changes in accordance with the voice recognition environment, etc.
By assigning 0 words or sentences to each value of the first score FSCR, a histogram P (FSCR) "shown by a solid line in FIG. 2B is created, and 100 words or sentences are created. Is assigned to each value of the second score SSCR, the histogram Q (S
SCR) "is created.

【００４６】このようにしてヒストグラムを作成する
と、図２（ａ）中のヒストグラムＰ(FSCR）とＱ(SSCR）
は夫々離れた尤度値の範囲に偏在して分布することとな
り、したがって認識結果ＲＣＧが正しい場合には、第１
スコアＦＳＣＲの統計的特徴と第２スコアＳＳＣＲの統
計的特徴とが大きく異なって現れることが解る。When the histogram is created in this way, the histograms P (FSCR) and Q (SSCR) in FIG.
Will be distributed ubiquitously in the range of the likelihood values separated from each other. Therefore, if the recognition result RCG is correct, the first
It can be seen that the statistical characteristics of the score FSCR and the statistical characteristics of the second score SSCR appear greatly differently.

【００４７】また、図２（ｂ）中のヒストグラムＰ(FSC
R)"とＱ(SSCR)"はほぼ同じ尤度値の範囲において分布す
ることとなり、したがって認識結果ＲＣＧが誤りであっ
た場合には、第１スコアＦＳＣＲの統計的特徴と第２ス
コアＳＳＣＲの統計的特徴とがほぼ同じになることが解
る。Also, the histogram P (FSC in FIG.
R) "and Q (SSCR)" are distributed in the range of almost the same likelihood value. Therefore, when the recognition result RCG is incorrect, the statistical features of the first score FSCR and the second score SSCR of It can be seen that the statistical characteristics are almost the same.

【００４８】このように、認識結果ＲＣＧが正しかった
場合と誤っていた場合では、ヒストグラムＰ(FSCR）と
Ｑ(SSCR）、及びヒストグラムＰ(FSCR)"とＱ(SSCR)"と
の関係で示されるような特有の統計的特徴が有り、かか
る統計的特徴を表現し得る関数として上記式(1)のスコ
ア評価関数が決められている。In this way, when the recognition result RCG is correct and when it is incorrect, it is shown by the relationship between the histograms P (FSCR) and Q (SSCR) and the histograms P (FSCR) "and Q (SSCR)". There is a unique statistical feature as described above, and the score evaluation function of the above formula (1) is determined as a function capable of expressing such a statistical feature.

【００４９】上記式(1)のスコア評価関数によると、認
識結果ＲＣＧが正しい場合には、図２（ａ）に示したよ
うに、第１スコアＦＳＣＲが第２スコアＳＳＣＲに比べ
て大きな尤度値側に偏在することから、上記式(1)の右
辺中の差分値（ＦＳＣＲ−ＳＳＣＲ）は大きな値とな
り、更にこの差分値（ＦＳＣＲ−ＳＳＣＲ）に第１スコ
アＦＳＣＲを乗算することによって、その差分値（ＦＳ
ＣＲ−ＳＳＣＲ）の特徴をより顕在化することを可能に
するスコア評価値が得られる。According to the score evaluation function of the above equation (1), when the recognition result RCG is correct, as shown in FIG. 2A, the first score FSCR has a larger likelihood than the second score SSCR. Since it is unevenly distributed on the value side, the difference value (FSCR-SSCR) in the right side of the above formula (1) becomes a large value, and by further multiplying this difference value (FSCR-SSCR) by the first score FSCR, Difference value (FS
(CR-SSCR) is obtained, which allows the score evaluation value to be more manifested.

【００５０】したがって、上記式(1)のスコア評価関数
は、正しく音声認識が行われたときの第１スコアＦＳＣ
Ｒと第２スコアＳＳＣＲを適用することにより、正しく
音声認識が行われたときの統計的特徴を適切に反映する
こととなり、更に正しく音声認識が行われたときに出力
される認識結果ＲＣＧを正しいと判定するための基礎と
することが可能となる。Therefore, the score evaluation function of the above equation (1) is the first score FSC when the speech recognition is correctly performed.
By applying R and the second score SSCR, the statistical characteristics when the voice recognition is correctly performed are properly reflected, and the recognition result RCG output when the voice recognition is correctly performed is correct. Can be used as a basis for determining.

【００５１】一方、誤認識が行われたときには、図２
（ｂ）に示したように、第１スコアＦＳＣＲと第２スコ
アＳＳＣＲとの尤度値がほぼ同じ範囲内に生じることか
ら、上記式(1)の右辺中の差分値（ＦＳＣＲ−ＳＳＣ
Ｒ）は小さな値となり、更にこの差分値（ＦＳＣＲ−Ｓ
ＳＣＲ）に第１スコアＦＳＣＲを乗算することによっ
て、その差分値（ＦＳＣＲ−ＳＳＣＲ）の特徴をより顕
在化することを可能にするスコア評価値が得られる。On the other hand, when erroneous recognition is performed, the process shown in FIG.
As shown in (b), since the likelihood values of the first score FSCR and the second score SSCR occur in substantially the same range, the difference value (FSCR-SSC in the right side of the above equation (1) is calculated.
R) becomes a small value, and this difference value (FSCR-S
By multiplying (SCR) by the first score FSCR, a score evaluation value that makes it possible to more reveal the characteristic of the difference value (FSCR-SSCR) is obtained.

【００５２】したがって、上記式(1)のスコア評価関数
は、誤認識が行われたときの第１スコアＦＳＣＲと第２
スコアＳＳＣＲを適用することにより、誤認識が行われ
たときの統計的特徴を適切に反映したものとなり、更に
誤認識が行われたときに出力される認識結果ＲＣＧを誤
りと判定するための基礎とすることが可能となる。Therefore, the score evaluation function of the above equation (1) has the first score FSCR and the second score FSCR when erroneous recognition is performed.
By applying the score SSCR, it is possible to appropriately reflect the statistical characteristics when the misrecognition is performed, and the basis for determining the recognition result RCG output when the misrecognition is performed as an error. It becomes possible to

【００５３】次に、閾値ＴＨＤは、上記式(1)のスコア
評価関数から求まるスコア評価値Ｇ(L)を対象として、
認識結果ＲＣＧが正しいときには正しいと判定し、認識
結果ＲＣＧが誤っているときには誤りであると判定する
ための判定基準として決められている。Next, the threshold value THD is obtained by using the score evaluation value G (L) obtained from the score evaluation function of the above equation (1) as a target.
When the recognition result RCG is correct, it is determined to be correct, and when the recognition result RCG is incorrect, it is determined as a determination criterion for determining an error.

【００５４】つまり、全ての認識結果ＲＣＧの正解・不
正解を正しく判定することは一般に困難である。話者適
応では、不正解を正解と誤判定してしまうと、前述のよ
うに音響モデルが誤って適応されるため、認識性能劣化
を招く。逆に正解を不正解と誤判定してしまっても、話
者適応が行われないため、認識性能は向上しないが劣化
もしない。よって、閾値ＴＨＤは、不正解を正しく不正
解と判定する性能を十分確保できるように、次に述べる
原理に基づいて選択されている。That is, it is generally difficult to correctly determine the correct answer / incorrect answer of all recognition results RCG. In speaker adaptation, if an incorrect answer is erroneously determined to be the correct answer, the acoustic model is erroneously applied as described above, resulting in deterioration of recognition performance. Conversely, even if the correct answer is erroneously determined as an incorrect answer, the speaker adaptation is not performed, so that the recognition performance is not improved but is not deteriorated. Therefore, the threshold value THD is selected based on the principle described below so as to sufficiently secure the performance of correctly determining an incorrect answer.

【００５５】まず、正しく音声認識が行われたときの単
語や文ごとに得られた夫々の第１スコアＦＳＣＲと第２
スコアＳＳＣＲとを上記式(1)のスコア評価関数に適応
することにより、各単語や文ごとのスコア評価値Ｇ(L)
を算出する。更に、算出した各スコア評価値Ｇ(L)に対
応する単語や文の個数の分布を示すヒストグラムＲ(G
(L))を求める。First, the first score FSCR and the second score obtained for each word or sentence when the voice recognition is correctly performed.
By applying the score SSCR and the score evaluation function of the above formula (1), the score evaluation value G (L) for each word or sentence
To calculate. Furthermore, a histogram R (G) showing the distribution of the number of words or sentences corresponding to each calculated score evaluation value G (L).
(L)) is calculated.

【００５６】同様に、誤認識が行われたときの単語や文
ごとに得られた夫々の第１スコアＦＳＣＲと第２スコア
ＳＳＣＲとを上記式(1)のスコア評価関数に適応するこ
とにより、各単語や文ごとのスコア評価値Ｇ(L)を算出
する。更に、算出した各スコア評価値Ｇ(L)に対応する
単語や文の個数の分布を示すヒストグラムＴ(G(L))を求
める。Similarly, by applying the respective first score FSCR and second score SSCR obtained for each word or sentence at the time of misrecognition to the score evaluation function of the above formula (1), The score evaluation value G (L) for each word or sentence is calculated. Further, a histogram T (G (L)) showing the distribution of the number of words or sentences corresponding to each calculated score evaluation value G (L) is obtained.

【００５７】こうして各ヒストグラムＲ(G(L))とＴ(G
(L))を求めると、ヒストグラムＲ(G(L))は図２（ｃ）中
の実線で示すような分布となり、ヒストグラムＴ(G(L))
は図２（ｃ）中の点線で示すような分布となる。Thus, each histogram R (G (L)) and T (G
(L)), the histogram R (G (L)) has the distribution shown by the solid line in FIG. 2 (c), and the histogram T (G (L))
Has a distribution as shown by the dotted line in FIG.

【００５８】つまり、ヒストグラムＲ(G(L))は、ある単
語や文の発話音声を正しく認識したときの特徴、ヒスト
グラムＴ(G(L))は、ある単語や文の発話音声を誤って認
識したときの特徴を示すことになる。That is, the histogram R (G (L)) is a feature when the speech of a certain word or sentence is correctly recognized, and the histogram T (G (L)) is the speech of a certain word or sentence is erroneously recognized. It will show the characteristics when it is recognized.

【００５９】そして、ヒストグラムＴ(G(L))の個数が０
になるときのスコア評価値Ｇ(L)を境にして、それより
スコア評価値Ｇ(L)の値が大きくなる範囲ＷRは、認識結
果ＲＣＧが正しいときにそれを正しいと判定することが
可能な範囲を示すことなり、ヒストグラムＴ(G(L))の個
数が０になるときのスコア評価値Ｇ(L)を境にして、そ
れよりスコア評価値Ｇ(L)の値が小さくなる範囲ＷTは、
認識結果ＲＣＧが誤っていたときにそれを誤りと判定す
ることが可能な範囲を示すことになる。The number of histograms T (G (L)) is 0.
When the recognition result RCG is correct, it is possible to determine that the range WR in which the score evaluation value G (L) becomes larger than the boundary when the recognition result RCG is correct. A range in which the score evaluation value G (L) becomes smaller than the score evaluation value G (L) when the number of histograms T (G (L)) becomes 0. WT is
When the recognition result RCG is incorrect, it indicates the range in which it can be determined as an error.

【００６０】そこで、閾値ＴＨＤはヒストグラムＴ(G
(L))の個数が０になるときのスコア評価値Ｇ(L)よりも
若干大きな値に決められている。Therefore, the threshold value THD is the histogram T (G
It is set to a value slightly larger than the score evaluation value G (L) when the number of (L)) becomes zero.

【００６１】そして、正誤判定部６は、単語や文ごとの
認識結果ＲＣＧが出力される度に、閾値ＴＨＤと上記式
(1)のスコア評価関数から得られるスコア評価値Ｇ(L)と
を比較し、閾値ＴＨＤとスコア評価値Ｇ(L)とが所定の
関係にあるか否か判定し、その判定結果ＲＳＬＴを出力
する。つまり、スコア評価値Ｇ(L)が閾値ＴＨＤよりも
大きいとき（Ｇ(L)＞ＴＨＤのとき）には、認識結果Ｒ
ＣＧを正しいと判定し、スコア評価値Ｇ(L)が閾値ＴＨ
Ｄよりも小さいとき（Ｇ(L)≦ＴＨＤのとき）には、認
識結果ＲＣＧを誤り判定し、夫々の判定結果ＲＳＬＴを
出力する。Then, the correctness determination section 6 outputs the threshold value THD and the above equation each time the recognition result RCG for each word or sentence is output.
The score evaluation value G (L) obtained from the score evaluation function of (1) is compared to determine whether the threshold value THD and the score evaluation value G (L) have a predetermined relationship, and the determination result RSLT is Output. That is, when the score evaluation value G (L) is larger than the threshold value THD (when G (L)> THD), the recognition result R
It is determined that CG is correct, and the score evaluation value G (L) is the threshold TH.
When it is smaller than D (when G (L) ≤THD), the recognition result RCG is determined as an error, and each determination result RSLT is output.

【００６２】話者適応処理部７は、認識結果ＲＣＧと判
定結果ＲＳＬＴとを入力し、判定結果ＲＳＬＴに応じて
話者適応の処理を行う。The speaker adaptation processing unit 7 inputs the recognition result RCG and the judgment result RSLT, and performs speaker adaptation processing according to the judgment result RSLT.

【００６３】つまり、正しく音声認識がなされたことを
示す判定結果ＲＳＬＴを入力したときには、そのときに
発話された発話音声の特徴ベクトルＶ(n)によって話者
適応を行い、一方、誤認識の判定結果ＲＳＬＴを入力し
たときには、話者適応を行わない。That is, when the determination result RSLT indicating that the voice recognition is correctly performed is input, the speaker adaptation is performed by the feature vector V (n) of the uttered voice uttered at that time, while the misrecognition is determined. As a result, when RSLT is input, speaker adaptation is not performed.

【００６４】さらに、上記の話者適応は、話者適応前の
音素モデルＨＭＭsbの全てあるいは一部に対して行われ
る。Furthermore, the speaker adaptation described above is performed for all or part of the phoneme model HMMsb before speaker adaptation.

【００６５】つまり、正しく音声認識がなされたことを
示す判定結果ＲＳＬＴを入力すると、発話音声の特徴ベ
クトル系列Ｖ(n)の発話内容が認識結果ＲＣＧであると
みなして、発話内容が既知であることが条件の、ＭＬＬ
ＲやＭＡＰ推定法などの話者適応手法で、話者適応前の
音素モデルＨＭＭsbに話者適応を施し話者適応後の音素
モデルＨＭＭsb"を得る。そして、このＨＭＭsb"を音響
モデル記憶部５に供給し、話者適応前の音響モデルＨＭ
Ｍsbに置き換えて更新記憶させる。That is, when the determination result RSLT indicating that the voice recognition is correctly performed is input, the utterance content of the feature vector series V (n) of the utterance voice is regarded as the recognition result RCG, and the utterance content is known. MLL, subject to
Speaker adaptation is performed on the phoneme model HMMsb before speaker adaptation by a speaker adaptation method such as R or MAP estimation method to obtain a phoneme model HMMsb "after speaker adaptation. Acoustic model HM before being adapted to the speaker
It is replaced with Msb and updated and stored.

【００６６】なお、説明するまでもなく、話者適応の処
理は継続的又は繰り返し行われる。よって、音響モデル
記憶部５に更新記憶された話者適応後の音素モデルは、
次の話者適応では話者適応前の音素モデルとなり、その
話者適応で得られた音素モデルが音響モデル記憶部５に
更新記憶されると、更に次の話者適応に際してその更新
された音素モデルについて話者適応が施されるという、
繰り返しが行われることとなる。Needless to say, the speaker adaptation process is continuously or repeatedly performed. Therefore, the speaker-adapted phoneme model updated and stored in the acoustic model storage unit 5 is
In the next speaker adaptation, it becomes a phoneme model before speaker adaptation, and when the phoneme model obtained by that speaker adaptation is updated and stored in the acoustic model storage unit 5, the updated phoneme is further stored in the next speaker adaptation. Speaker adaptation is applied to the model,
It will be repeated.

【００６７】次に、かかる構成を有する本音声認識装置
の動作を図３に示すフローチャートを参照して説明す
る。Next, the operation of the speech recognition apparatus having the above structure will be described with reference to the flow chart shown in FIG.

【００６８】同図において、話者適応の処理を開始させ
ると、ステップＳ１００において音声入力の処理を開始
する。In the figure, when the speaker adaptation process is started, the voice input process is started in step S100.

【００６９】次に、ステップＳ１０２において、単語辞
書４と単語モデル生成部３及び音声認識処理部２が発話
音声の特徴ベクトルＶ(n)と単語モデルＨＭＭwとを照合
することで音声認識を行い、認識結果ＲＣＧと第１スコ
アＦＳＣＲ及び第２スコアＳＳＣＲを出力する。Next, in step S102, the word dictionary 4, the word model generation unit 3, and the voice recognition processing unit 2 perform voice recognition by collating the feature vector V (n) of the uttered voice with the word model HMMw, The recognition result RCG, the first score FSCR and the second score SSCR are output.

【００７０】次に、ステップＳ１０４において、正誤判
定部６が第１スコアＦＳＣＲと第２スコアＳＳＣＲを上
記式(1)のスコア評価関数に適応し、スコア評価値Ｇ(L)
を算出する。Next, in step S104, the correctness determination unit 6 applies the first score FSCR and the second score SSCR to the score evaluation function of the above equation (1) to obtain the score evaluation value G (L).
To calculate.

【００７１】次に、ステップＳ１０６において、更に正
誤判定部６がスコア評価値Ｇ(L)と閾値ＴＨＤとを比較
し、Ｇ(L)＞ＴＨＤのとき（「ＹＥＳ」のとき）には、
認識結果ＲＣＧを正しいと判断することでステップＳ１
０８の処理に移行し、Ｇ(L)≦ＴＨＤのとき（「ＮＯ」
のとき）には、認識結果ＲＣＧを誤りと判断することで
話者適応の処理を行うことなく処理を終了する。Next, in step S106, the correctness / incorrectness determination section 6 further compares the score evaluation value G (L) with the threshold value THD, and when G (L)> THD (when "YES"),
By determining that the recognition result RCG is correct, step S1
08, when G (L) ≦ THD (“NO”)
In the case of), the recognition result RCG is determined to be an error, and the processing is ended without performing the speaker adaptation processing.

【００７２】処理がステップＳ１０８に移行すると、話
者適応処理部７が特徴ベクトルＶ(n)によって音響モデ
ル記憶部５中の音素モデルＨＭＭsbに話者適応を施し、
更にステップＳ１１０において、その話者適応後の音素
モデルＨＭＭsb"を更新記憶させた後、処理を終了す
る。When the process proceeds to step S108, the speaker adaptation processing unit 7 applies the speaker adaptation to the phoneme model HMMsb in the acoustic model storage unit 5 by the feature vector V (n),
Further, in step S110, the speaker-adapted phoneme model HMMsb "is updated and stored, and then the process ends.

【００７３】なお、図３には便宜上、発話された１つの
単語や文の発話音声に関する話者適応処理を示している
が、複数の単語や文から成る連続文章を対象として話者
適応処理を行う場合には、図３の処理を繰り返す。Although FIG. 3 shows the speaker adaptation process for the uttered voice of one uttered word or sentence for convenience, the speaker adaptation process is performed for a continuous sentence composed of a plurality of words or sentences. When performing, the process of FIG. 3 is repeated.

【００７４】そして、発話された複数の単語や文を順次
に処理し、誤認識となった単語や文についての音素モデ
ルＨＭＭsbに対しては、ステップＳ１０６において「Ｎ
Ｏ」と判定することにより、話者適応の処理（ステップ
Ｓ１０８，Ｓ１１０）をスキップし、正しく認識された
単語や文に対しては、ステップＳ１０６において「ＹＥ
Ｓ」と判定することにより、話者適応の処理（ステップ
Ｓ１０８，Ｓ１１０）を行うことで、適切な話者適応を
行う。Then, a plurality of uttered words or sentences are sequentially processed, and the phoneme model HMMsb for the erroneously recognized words or sentences is "N" in step S106.
By determining “O”, the speaker adaptation process (steps S108 and S110) is skipped, and the word or sentence recognized correctly is “YE” in step S106.
By determining "S", the speaker adaptation process (steps S108 and S110) is performed, thereby performing appropriate speaker adaptation.

【００７５】このように、本実施形態の音声認識装置に
よれば、２つの第１スコアＦＳＣＲと第２スコアＳＳＣ
Ｒをスコア評価関数に適用することで得られるスコア評
価値Ｇ(L)を所定の閾値ＴＨＤと比較するだけで、認識
結果ＲＣＧが正しいか誤りであるかの判定を確実且つ迅
速に行うことができる。つまり、音声認識の結果が正し
いか否かを判定するための処理量を大幅に削減すること
が可能となると共に、高い判定精度が得られる。As described above, according to the speech recognition apparatus of this embodiment, the two first scores FSCR and the second score SSC are used.
Only by comparing the score evaluation value G (L) obtained by applying R to the score evaluation function with a predetermined threshold value THD, it is possible to surely and quickly determine whether the recognition result RCG is correct or incorrect. it can. That is, it is possible to significantly reduce the processing amount for determining whether or not the result of voice recognition is correct, and it is possible to obtain high determination accuracy.

【００７６】このため、発話者に対してストレスを感じ
させることなく、迅速且つ適切な話者適応を行うことが
できる。更に、誤った話者適応を大幅に低減することが
可能となるため、話者適応後の音声認識精度を低下させ
る等の問題の発生を未然に防止することができる等の効
果が得られる。Therefore, prompt and appropriate speaker adaptation can be performed without causing stress to the speaker. Further, since it is possible to significantly reduce erroneous speaker adaptation, it is possible to prevent problems such as deterioration of voice recognition accuracy after speaker adaptation from occurring.

【００７７】また、発話者であるユーザー等が本音声認
識装置を使用する度に、徐々に音響モデル記憶部５中の
音素モデルＨＭＭsbに対して適切な話者適応を施すこと
になるため、使用回数が増えるにつれて音声認識率の向
上を実現することができる。Also, every time a user or the like who is a speaker uses the present speech recognition apparatus, an appropriate speaker adaptation is gradually applied to the phoneme model HMMsb in the acoustic model storage unit 5. As the number of times increases, the voice recognition rate can be improved.

【００７８】なお、本実施形態では、前記式(1)で示し
たように、第１スコアＦＳＣＲと第２スコアＳＳＣＲに
基づいてスコア評価値Ｇ(L)を演算するのに、第１スコ
アＦＳＣＲと第２スコアＳＳＣＲの差分を求めて、更に
その差分値に第１スコアＦＳＣＲを乗算することとして
いる。しかし本発明はこれに限定されるものではない。
変形例として、第１スコアＦＳＣＲと第２スコアＳＳＣ
Ｒの差分値をスコア値Ｇ(L)としてもよい。In this embodiment, the first score FSCR is used to calculate the score evaluation value G (L) on the basis of the first score FSCR and the second score SSCR, as shown in the above equation (1). Then, the difference between the second score SSCR and the second score SSCR is obtained, and the difference value is further multiplied by the first score FSCR. However, the present invention is not limited to this.
As a modified example, the first score FSCR and the second score SSC
The difference value of R may be used as the score value G (L).

【００７９】（第２の実施の形態）次に、本発明の第２
の実施形態を図４を参照して説明する。なお、図４にお
いて図１と同一又は相当する部分を同一符号で示してい
る。(Second Embodiment) Next, the second embodiment of the present invention will be described.
An embodiment will be described with reference to FIG. In FIG. 4, parts that are the same as or correspond to those in FIG.

【００８０】本音声認識装置と、図１に示した第１の実
施形態の音声認識装置とを対比すると、本音声認識装置
には、誤判定対応部８と表示部９が備えられている点で
更なる特徴を有している。When the present speech recognition apparatus is compared with the speech recognition apparatus of the first embodiment shown in FIG. 1, the present speech recognition apparatus is provided with an erroneous judgment handling section 8 and a display section 9. Has more features.

【００８１】誤判定対応部８は、正誤判定部６からの判
定結果ＲＳＬＴを入力し、認識結果ＲＣＧが正しい旨の
判定結果ＲＳＬＴを入力すると、その認識結果ＲＣＧを
出力し、認識結果ＲＣＧが誤認識である旨の判定結果Ｒ
ＳＬＴを入力すると、その認識結果ＲＣＧの出力を禁止
する。これにより、正確に音声認識が行われた場合に限
り認識結果ＲＣＧを出力する。The erroneous judgment handling unit 8 inputs the judgment result RSLT from the correctness judgment unit 6 and, when the judgment result RSLT that the recognition result RCG is correct is input, outputs the recognition result RCG and the recognition result RCG is erroneous. Recognition result R indicating recognition
When SLT is input, the output of the recognition result RCG is prohibited. As a result, the recognition result RCG is output only when the voice recognition is correctly performed.

【００８２】更に、認識結果ＲＣＧが誤認識である旨の
判定結果ＲＳＬＴを入力すると、液晶ディスプレイ等で
形成されている表示部９に対して指示することにより、
誤認識となった旨と再度発話を行うべき旨の文字等によ
る警告表示を行わせる。Further, when the determination result RSLT indicating that the recognition result RCG is erroneous recognition is input, by giving an instruction to the display unit 9 formed of a liquid crystal display or the like,
A warning display is displayed with characters indicating that the recognition is incorrect and that the user should speak again.

【００８３】このように、本音声認識装置によれば、正
誤判定部６の判定結果ＲＳＬＴを利用することにより、
適切な話者適応がなされたか否か、また再度発話を行っ
た方が良い等の情報をユーザー等に提示することがで
き、ユーザーの利便性の向上を図ることができる。As described above, according to the present speech recognition apparatus, by using the judgment result RSLT of the correctness judgment unit 6,
It is possible to present the user with information such as whether or not an appropriate speaker adaptation has been made, and whether it is better to speak again, etc., so that the convenience of the user can be improved.

【００８４】[0084]

【発明の効果】以上説明したように本発明の音声認識装
置と音声認識方法によれば、話者適応の処理に際して、
音響モデルと発話音声の特徴ベクトルとを照合すること
により、最大尤度の得られる音響モデルを示す認識結果
と、最大尤度の値を示す第１スコアと、次の尤度値を示
す第２スコアとを求め、これら第１スコアと第２スコア
とに基づく評価値を予め設定された閾値と比較すること
によって認識結果の正誤判定を行うようにしたので、そ
の正誤判定を高精度で且つ少ない処理量で実現すること
ができる。As described above, according to the voice recognition device and the voice recognition method of the present invention, in the speaker adaptation process,
By collating the acoustic model and the feature vector of the uttered speech, the recognition result indicating the acoustic model for which the maximum likelihood is obtained, the first score indicating the maximum likelihood value, and the second likelihood value indicating the next likelihood value. Since the score is obtained and the evaluation value based on the first score and the second score is compared with a preset threshold value, the correctness determination of the recognition result is performed, so the accuracy determination is highly accurate and small. It can be realized by processing amount.

[Brief description of drawings]

【図１】第１の実施形態の音声認識装置の構成を示す図
である。FIG. 1 is a diagram illustrating a configuration of a voice recognition device according to a first embodiment.

【図２】認識結果が正しいか誤っているかを正確に判定
するための原理を説明するための図である。FIG. 2 is a diagram for explaining a principle for accurately determining whether a recognition result is correct or incorrect.

【図３】第１の実施形態の音声認識装置の動作を示すフ
ローチャートである。FIG. 3 is a flowchart showing an operation of the voice recognition device in the first exemplary embodiment.

【図４】第２の実施形態の音声認識装置の構成を示す図
である。FIG. 4 is a diagram showing a configuration of a voice recognition device according to a second embodiment.

[Explanation of symbols]

１…音響分析部２…音声認識処理部３…単語モデル生成部４…単語辞書５…音響モデル記憶部６…正誤判定部７…話者適応処理部８…誤判定対応部９…表示部 1 ... Acoustic analysis unit 2 ... Speech recognition processing unit 3 ... Word model generation unit 4 ... Word dictionary 5 ... Acoustic model storage unit 6 ... Correctness determination unit 7 ... Speaker adaptation processing unit 8 ... Misjudgment handling unit 9 ... Display

Claims

[Claims]

1. A speech recognition apparatus for speaker adaptation of an acoustic model using a feature vector of a spoken voice, wherein the acoustic model and a feature vector of a spoken voice are collated,
A speech recognition unit that outputs a recognition result indicating an acoustic model with which the maximum likelihood is obtained, a first score indicating the value of the maximum likelihood, and a second score indicating the next likelihood value, and the first score. A determination unit that compares the evaluation value based on the second score with a preset threshold value, and determines the recognition result to be correct when the evaluation value has a predetermined relationship with the threshold value; A speech recognition apparatus comprising: a speaker adaptation processing unit that applies speaker adaptation to the acoustic model when the recognition result is determined to be correct.

2. The determination means determines the recognition result as an error when the evaluation value does not have a predetermined relationship with a threshold value, and the speaker adaptation processing means determines the recognition result as an error. Then, the speech recognition apparatus according to claim 1, wherein speaker adaptation is not performed on the acoustic model.

3. The voice recognition device according to claim 1, wherein the evaluation value is calculated by a difference value between the first score and the second score.

4. The means for prohibiting output of the recognition result and presenting the fact that the recognition result is incorrect when the judging means judges that the recognition result is incorrect. The voice recognition device according to 2.

5. In a voice recognition method, wherein a speaker adaptation is performed on an acoustic model by a feature vector of a spoken voice, the acoustic model and a feature vector of the spoken voice are collated,
A first step of obtaining a recognition result indicating an acoustic model for which maximum likelihood is obtained, a first score indicating the value of the maximum likelihood, and a second score indicating the next likelihood value; and the first score. And a second step of comparing an evaluation value based on the second score with a preset threshold value, and determining the recognition result to be correct when the evaluation value has a predetermined relationship with the threshold value, When it is determined that the recognition result is correct in step 2,
A third step of applying speaker adaptation to the acoustic model;
A voice recognition method comprising:

6. The second step determines the recognition result as an error when the evaluation value does not have a predetermined relationship with a threshold value, and the third step determines the recognition result as an error. Then, the speaker recognition method according to claim 5, wherein speaker adaptation is not performed on the acoustic model.

7. The voice recognition method according to claim 5, wherein the evaluation value is calculated by a difference value between the first score and the second score.

8. The method according to claim 6, wherein when the recognition result is determined to be an error in the second step, the output of the recognition result is prohibited and the fact that the recognition result is an error is presented. Speech recognition method described in.