JP3605011B2

JP3605011B2 - Voice recognition method

Info

Publication number: JP3605011B2
Application number: JP2000240283A
Authority: JP
Inventors: 計美大倉
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2000-08-08
Filing date: 2000-08-08
Publication date: 2004-12-22
Anticipated expiration: 2020-08-08
Also published as: JP2002055691A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識方法に関する。
【０００２】
【従来の技術】
音声認識においては、ＤＰマッチングを用いたもの、確率モデル（ＨＭＭ；ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を用いたもの等がある。ＤＰマッチングを用いた単語認識では、単語単位にテンプレートが作成されている。ＨＭＭを用いた単語認識では、より小さな単位でテンプレートが作成される。
【０００３】
ＤＰマッチングを用いた単語認識では、よく知られているように、入力音声とテンプレートのマッチング度を評価する評価値として、距離が用いられる。距離が小さいほどマッチング度が高いと判定される。
【０００４】
ＨＭＭを用いた単語認識では、よく知られているように、入力音声とテンプレートのマッチング度を評価する評価値として、尤度が用いられる。尤度が大きいほどマッチング度が高いと判定される。
【０００５】
ところで、雑音がある環境下で音声を発声した場合、パワーの小さい子音等は、雑音によりその音響特徴が汚されてしまい、テンプレートとのマッチングの結果得られる尤度（ＨＭＭを用いた場合）や距離（ＤＰマッチングを用いた場合）は信頼性が低くなってしまう。また、テンプレート作成時の雑音パターンと認識時の雑音パターンとが異なる場合、雑音区間における距離や尤度は信頼性が低くなってしまう。
【０００６】
そこで、従来においては、実際の雑音環境から無音モデルを学習して、テンプレートの音声区間以外の部分に無音モデルを追加することにより、雑音区間の認識精度を向上する手法が提案されている。しかしながら、この手法では、本来パワーが小さく、雑音に埋もれてしまう音声区間の認識精度を向上させることはできない。
【０００７】
【発明が解決しようとする課題】
この発明は、雑音がある環境下で発声された音声に対して音声認識を行なう場合に、認識精度の向上化が図れる音声認識方法を提供することを目的とする。
【０００８】
【課題を解決するための手段】
この発明による第１の音声認識方法は、入力音声特徴とテンプレートとのマッチング度を評価値に基づいて判定することにより、最適な標準パターンを選択する音声認識方法において、入力音声特徴の雑音区間から雑音特徴を作成する第１ステップ、テンプレートの所定の閾値よりパワーが小さい区間においては、テンプレートの当該区間の音声特徴と入力音声特徴とを比較することによって第１の評価値を算出するとともに、第１ステップで作成された雑音特徴と入力音声特徴とを比較することによって第２の評価値を算出する第２ステップ、および第１の評価値と第２の評価値のうち、マッチング度の高い方の評価値を、当該区間に対する評価値として採用する第３ステップを備えていることを特徴とする。
【００１０】
この発明による第２の音声認識方法は、入力音声特徴とテンプレートとのマッチング度を評価値に基づいて判定することにより、最適な標準パターンを選択する音声認識方法において、入力音声特徴の雑音区間から雑音特徴を作成する第１ステップ、テンプレートの所定の閾値よりパワーが小さい区間であってかつ入力音声特徴の所定の閾値よりパワーが小さい区間においては、テンプレートの当該区間の音声特徴と入力音声特徴とを比較することによって第１の評価値を算出するとともに、第１ステップで作成された雑音特徴と入力音声特徴とを比較することによって第２の評価値を算出する第２ステップ、および第１の評価値と第２の評価値のうち、マッチング度の高い方の評価値を、当該区間に対する評価値として採用する第３ステップを備えていることを特徴とする。
【００１１】
【発明の実施の形態】
以下、図面を参照して、この発明の実施の形態について説明する。
【００１２】
〔１〕本発明の考え方についての説明
本発明の考え方について説明する。
【００１３】
図１（ａ）はテンプレートを示し、図１（ｂ）は入力音声特徴を示している。図１（ａ）において、Ｔｈは、予め設定された評価区間決定用の閾値を示している。また、図１（ａ）および図１（ｂ）において、区間Ａ、Ａ’は雑音区間を、区間Ｂ、Ｂ’は音声区間を、区間Ｃ、Ｃ’は雑音区間を示している。
【００１４】
以下、音声認識方法について説明する。
【００１５】
まず、入力音声特徴の雑音区間Ａ’またはＣ’から、雑音特徴を作成する。
【００１６】
テンプレートの閾値Ｔｈよりパワーが大きい区間Ｕ１、Ｕ２については、テンプレートのかかる部分の音声特徴と、入力音声特徴とを比較することにより、評価を行なう。つまり、評価値（尤度または距離）を算出する。
【００１７】
テンプレートの閾値Ｔｈよりパワーが小さい区間、Ａ、Ｄ１、Ｄ２、Ｄ３、Ｃについては、２種類の評価を行なう。第１は、テンプレートのかかる部分の音声特徴と、入力音声特徴とを比較することにより、評価を行なう。第２は、入力音声特徴の雑音区間Ａ’またはＣ’から作成された雑音特徴と、入力音声特徴とを比較することにより、評価を行なう。そして、第１の評価結果と第２の評価結果とを比較し、マッチング度の高い方の評価値を当該区間の評価値として用いる。つまり、評価値が尤度である場合には評価値が大きい方の尤度を当該区間の評価値として用い、評価値が距離である場合には評価値が小さい方の距離を当該区間の評価値として用いる。
【００１８】
ＳＮＲ値が小さいために音声の特徴であるスペクトルが雑音より完全にマスクされている区間では、テンプレートの特徴として雑音特徴を用いた場合のマッチング度が高くなる。一方、本来の音声特徴が残っている区間においては、テンプレートの特徴としてテンプレートのかかる部分の音声特徴を用いた場合のマッチング度が高くなる。したがって、上記実施の形態のように、テンプレートの閾値Ｔｈよりパワーが小さい区間Ａ、Ｄ１、Ｄ２、Ｄ３、Ｃについて、２種類の評価を行なって、マッチング度の高い方の評価値を用いることにより、音声認識精度が向上する。
【００１９】
上記実施の形態では、テンプレートの閾値Ｔｈよりパワーが小さい区間Ａ、Ｄ１、Ｄ２、Ｄ３、Ｃについて、２種類の評価が行なわれているが、入力音声特徴の閾値Ｔｈ’（図１（ｂ）参照）よりパワーが小さい区間Ａ’、Ｄ１’、Ｄ２’、Ｄ３’、Ｃ’について、２種類の評価を行なうようにしてもよい。
【００２０】
また、テンプレートの閾値Ｔｈよりパワーが小さい区間であってかつ入力音声特徴の閾値Ｔｈ’よりパワーが小さい区間について、２種類の評価を行なうようにしてもよい。
【００２１】
〔２〕ＨＭＭを用いた音声認識に適用した場合の実施例の説明
ＨＭＭを用いた音声認識に適用した場合の実施例について説明する。
【００２２】
図２（ａ）は”ＳＡＫＡＩ”のテンプレートを示している。テンプレートは、ＨＭＭ状態列と各状態毎のパワーとからなる。
【００２３】
テンプレートの学習時は、図２（ｂ）に示すように、学習用音声データからテンプレートを学習する。この場合、単語単位に学習してもよいし、単語より細かな単位、例えば音素単位に学習してもよい。また、各状態毎のパワーは、他のパラメータ（例えばケプトラム）と同様にＥＭ推定によって求められる。あるいは、各状態毎のパワーは、他のパラメータ（例えばケプトラム）を用いて学習したモデルを用いて、学習用データと学習によって得られたモデルとのアライメントをとり、各状態と対応する学習データのアライメント部分から各状態の平均パワーを算出することによって、求めるようにしてもよい。
【００２４】
図３は、テンプレートと入力音声特徴を示している。
【００２５】
まず、入力音声特徴の雑音区間から、雑音特徴を作成する。
【００２６】
テンプレート（図２）の閾値Ｔｈをこの例では５とする。テンプレートのパワーが大きい部分、つまり、テンプレート中の”Ａ”の部分については、テンプレートのかかる部分の音声特徴と、入力音声特徴とを比較することにより、尤度を算出する。
【００２７】
テンプレート（図２）の閾値Ｔｈより平均パワーが小さい部分、つまり、テンプレート中の”無音”、”Ｓ”、”Ｋ”の部分については、２種類の評価を行なう。第１は、テンプレートのかかる部分の音声特徴と、入力音声特徴とを比較することにより、尤度を求める。第２は、入力音声特徴の雑音区間から作成された雑音特徴と、入力音声特徴とを比較することにより、尤度を求める。そして、第１の評価結果と第２の評価結果とを比較し、尤度の大きい方を当該区間の尤度として用いる。
【００２８】
なお、入力音声特徴（図３）の閾値Ｔｈ’よりパワーが小さい部分”無音”、”Ｓ”、”Ａ”、”Ｉ”について、２種類の評価を行なうようにしてもよい。
【００２９】
また、テンプレート（図２）の閾値Ｔｈよりパワーが小さい部分であってかつ入力音声特徴（図３）の閾値Ｔｈ’よりパワーが小さい部分”無音”、”Ｓ”について、２種類の評価を行なうようにしてもよい。
【００３０】
テンプレートの閾値Ｔｈよりパワーが小さい部分であってかつ入力音声特徴の閾値Ｔｈ’よりパワーが小さい部分について２種類の評価を行なう音声認識方法（本発明方法という）と、従来の音声認識方法（従来法という）とについて、次の比較実験を行なった。
【００３１】
つまり、ＳＮ比５ｄＢの雑音環境下において、１００個の地名の単語に対して男女計５名に発声を行なわせ、これらの音声に対して本発明方法と従来法とによって音声認識を行い、音声認識率を求めた。従来法では認識率が９４．８％であったのに対して、本発明方法では認識率が９６．０％に向上した。
【００３２】
【発明の効果】
この発明によれば、雑音がある環境下で発声された音声に対して音声認識を行なう場合に、認識精度の向上化が図れるようになる。
【図面の簡単な説明】
【図１】テンプレートと入力音声特徴とを示す波形図である。
【図２】ＨＭＭを用いた音声認識でのテンプレートを示す模式図である。
【図３】ＨＭＭを用いた音声認識でのテンプレートと入力音声特徴を示す模式図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition method.
[0002]
[Prior art]
In speech recognition, there are those using DP matching and those using a probabilistic model (HMM; Hidden Markov Model). In word recognition using DP matching, a template is created for each word. In word recognition using the HMM, a template is created in smaller units.
[0003]
In word recognition using DP matching, as is well known, distance is used as an evaluation value for evaluating the degree of matching between an input voice and a template. The smaller the distance, the higher the matching degree is determined.
[0004]
In word recognition using the HMM, as is well known, likelihood is used as an evaluation value for evaluating the degree of matching between an input speech and a template. It is determined that the matching degree is higher as the likelihood is larger.
[0005]
By the way, when a voice is uttered in a noisy environment, a consonant or the like having a low power has its acoustic characteristics contaminated by noise, and the likelihood (in the case of using HMM) obtained as a result of matching with a template or The distance (when DP matching is used) has low reliability. Also, if the noise pattern at the time of template creation and the noise pattern at the time of recognition are different, the reliability and the distance or likelihood in the noise section will be low.
[0006]
Therefore, conventionally, a method of improving a recognition accuracy of a noise section by learning a silence model from an actual noise environment and adding a silence model to a portion other than the speech section of the template has been proposed. However, this method cannot improve the recognition accuracy of a voice section that is originally low in power and buried in noise.
[0007]
[Problems to be solved by the invention]
SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech recognition method capable of improving recognition accuracy when performing speech recognition on speech uttered in an environment having noise.
[0008]
[Means for Solving the Problems]
A first speech recognition method according to the present invention is a speech recognition method for selecting an optimal standard pattern by determining a matching degree between an input speech feature and a template based on an evaluation value. A first step of generating a noise feature, in a section whose power is smaller than a predetermined threshold value of the template, calculating a first evaluation value by comparing a speech feature of the section of the template with an input speech feature; A second step of calculating a second evaluation value by comparing the noise feature created in one step with the input speech feature, and a higher matching degree of the first evaluation value and the second evaluation value Is adopted as the evaluation value for the section in question.
[0010]
A second speech recognition method according to the present invention is a speech recognition method for selecting an optimal standard pattern by determining a matching degree between an input speech feature and a template based on an evaluation value. The first step of creating a noise feature, in a section in which the power is smaller than a predetermined threshold of the template and in a section in which the power is smaller than the predetermined threshold of the input speech feature, the speech feature and the input speech feature of the template in the corresponding section. And a second step of calculating a second evaluation value by comparing the noise feature created in the first step with the input speech feature, and a first evaluation value by comparing Of the evaluation value and the second evaluation value, the evaluation value with the higher matching degree is used as the evaluation value for the section in the third step. Characterized in that it comprises a.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0012]
[1] Description of the concept of the present invention The concept of the present invention will be described.
[0013]
FIG. 1A shows a template, and FIG. 1B shows an input voice feature. In FIG. 1A, Th indicates a preset threshold for determining an evaluation section. In FIGS. 1A and 1B, sections A and A ′ indicate noise sections, sections B and B ′ indicate voice sections, and sections C and C ′ indicate noise sections.
[0014]
Hereinafter, the speech recognition method will be described.
[0015]
First, a noise feature is created from the noise section A 'or C' of the input speech feature.
[0016]
For the sections U1 and U2 where the power is larger than the threshold value Th of the template, evaluation is performed by comparing the audio features of such portions of the template with the input audio features. That is, an evaluation value (likelihood or distance) is calculated.
[0017]
Two types of evaluation are performed for sections A, D1, D2, D3, and C whose power is smaller than the template threshold Th. First, evaluation is performed by comparing the audio features of such a portion of the template with the input audio features. Second, the evaluation is performed by comparing the noise feature created from the noise section A ′ or C ′ of the input speech feature with the input speech feature. Then, the first evaluation result and the second evaluation result are compared, and the evaluation value with the higher matching degree is used as the evaluation value of the section. That is, if the evaluation value is likelihood, the likelihood with the larger evaluation value is used as the evaluation value of the section, and if the evaluation value is distance, the distance with the smaller evaluation value is used as the evaluation value of the section. Use as a value.
[0018]
In a section in which the spectrum, which is a feature of speech, is completely masked from noise because the SNR value is small, the matching degree when the noise feature is used as the template feature is high. On the other hand, in a section in which the original voice feature remains, the matching degree when the voice feature of such a portion of the template is used as the template feature is high. Therefore, as in the above-described embodiment, two types of evaluations are performed for sections A, D1, D2, D3, and C whose power is smaller than the threshold value Th of the template, and the evaluation value with the higher matching degree is used. And the accuracy of voice recognition is improved.
[0019]
In the above embodiment, two types of evaluations are performed for the sections A, D1, D2, D3, and C whose power is smaller than the threshold value Th of the template. However, the threshold value Th ′ of the input voice feature (FIG. 1B) Two types of evaluations may be performed for sections A ′, D1 ′, D2 ′, D3 ′, and C ′ with lower power.
[0020]
In addition, two types of evaluation may be performed in a section in which the power is smaller than the threshold value Th of the template and in which the power is smaller than the threshold value Th ′ of the input voice feature.
[0021]
[2] Description of Embodiment When Applied to Speech Recognition Using HMM An embodiment when applied to speech recognition using an HMM will be described.
[0022]
FIG. 2A shows a template of “SAKAI”. The template includes an HMM state sequence and power for each state.
[0023]
When learning the template, as shown in FIG. 2B, the template is learned from the learning voice data. In this case, the learning may be performed in units of words or in units smaller than words, for example, in units of phonemes. Further, the power for each state is obtained by EM estimation in the same manner as other parameters (for example, ceptoram). Alternatively, the power for each state is obtained by aligning the learning data with the model obtained by learning using a model learned using other parameters (for example, ceptoram), and obtaining the state of the learning data corresponding to each state. The average power of each state may be calculated from the alignment portion to obtain the average power.
[0024]
FIG. 3 shows a template and input speech features.
[0025]
First, a noise feature is created from a noise section of an input speech feature.
[0026]
The threshold value Th of the template (FIG. 2) is set to 5 in this example. For the portion where the power of the template is large, that is, for the portion “A” in the template, the likelihood is calculated by comparing the audio feature of such portion of the template with the input audio feature.
[0027]
Two types of evaluations are performed on the portion of the template (FIG. 2) where the average power is smaller than the threshold value Th, that is, on the "silence", "S", and "K" portions in the template. First, the likelihood is obtained by comparing the speech feature of such a portion of the template with the input speech feature. Second, a likelihood is obtained by comparing a noise feature created from a noise section of the input speech feature with the input speech feature. Then, the first evaluation result and the second evaluation result are compared, and the larger likelihood is used as the likelihood of the section.
[0028]
Note that two types of evaluation may be performed on the portions “silence”, “S”, “A”, and “I” whose power is smaller than the threshold value Th ′ of the input voice feature (FIG. 3).
[0029]
In addition, two types of evaluation are performed on portions "silence" and "S" that are portions having power smaller than the threshold value Th of the template (FIG. 2) and have power smaller than the threshold value Th 'of the input voice feature (FIG. 3). You may do so.
[0030]
A speech recognition method for performing two types of evaluations on a portion having a power smaller than the threshold value Th of the template and a power smaller than the threshold value Th ′ of the input speech feature (referred to as the method of the present invention), and a conventional speech recognition method (conventional method) The following comparative experiment was conducted.
[0031]
That is, in a noise environment with an S / N ratio of 5 dB, a total of five men and women make utterances for 100 place name words, and these voices are subjected to voice recognition by the method of the present invention and the conventional method to perform voice recognition. The recognition rate was determined. While the recognition rate was 94.8% in the conventional method, the recognition rate was improved to 96.0% in the method of the present invention.
[0032]
【The invention's effect】
According to the present invention, it is possible to improve recognition accuracy when performing speech recognition on a speech uttered in an environment with noise.
[Brief description of the drawings]
FIG. 1 is a waveform diagram showing a template and input speech features.
FIG. 2 is a schematic diagram showing a template in speech recognition using an HMM.
FIG. 3 is a schematic diagram showing templates and input speech features in speech recognition using an HMM.

Claims

In a voice recognition method for selecting an optimal standard pattern by determining a matching degree between an input voice feature and a template based on an evaluation value,
A first step of creating a noise feature from a noise section of the input speech feature,
In the section where the power is smaller than the predetermined threshold value of the template, the first evaluation value is calculated by comparing the speech feature of the template with the input speech feature, and the noise feature created in the first step is compared with the noise feature created in the first step. A second step of calculating a second evaluation value by comparing the input voice feature with the input voice feature, and, of the first evaluation value and the second evaluation value, an evaluation value having a higher matching degree is evaluated for the section. The third step to take as value,
A voice recognition method comprising:

In a voice recognition method for selecting an optimal standard pattern by determining a matching degree between an input voice feature and a template based on an evaluation value,
A first step of creating a noise feature from a noise section of the input speech feature,
In a section in which the power is smaller than the predetermined threshold of the template and in which the power is smaller than the predetermined threshold of the input speech feature, the first evaluation is performed by comparing the speech feature in the section of the template with the input speech feature. A second step of calculating a second evaluation value by calculating a value and comparing the noise feature created in the first step with the input speech feature; and
A third step of adopting, from the first evaluation value and the second evaluation value, the evaluation value with the higher matching degree as the evaluation value for the section;
A voice recognition method comprising: