JP2002055691A

JP2002055691A - Voice-recognition method

Info

Publication number: JP2002055691A
Application number: JP2000240283A
Authority: JP
Inventors: Kazuyoshi Okura; 計美大倉
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2000-08-08
Filing date: 2000-08-08
Publication date: 2002-02-20
Anticipated expiration: 2020-08-08
Also published as: JP3605011B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition method by which recognition accuracy is improved, when conducting voice recognition of a voice uttered in a environment in which noise exists. SOLUTION: The voice recognition method includes a first step for preparing noise characteristic from the noise section of inputted voice characteristic; a second step for calculating a first evaluation value, in a section where power is smaller than the prescribed threshold of the template, by comparing the voice characteristic of the section of the template with the inputted voice characteristic and for calculating a second evaluation value, by comparing the noise characteristic prepared in the first step with the inputted voice characteristic; and a third step for adopting an evaluation value, having a higher matching degree between the first evaluation value and the second evaluation value as the evaluation value of the section.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、音声認識方法に
関する。[0001] The present invention relates to a speech recognition method.

【０００２】[0002]

【従来の技術】音声認識においては、ＤＰマッチングを
用いたもの、確率モデル（ＨＭＭ; Hidden Markov Mode
l) を用いたもの等がある。ＤＰマッチングを用いた単
語認識では、単語単位にテンプレートが作成されてい
る。ＨＭＭを用いた単語認識では、より小さな単位でテ
ンプレートが作成される。2. Description of the Related Art In speech recognition, a method using DP matching, a probability model (HMM; Hidden Markov Mode
l) and others. In word recognition using DP matching, a template is created for each word. In word recognition using the HMM, a template is created in smaller units.

【０００３】ＤＰマッチングを用いた単語認識では、よ
く知られているように、入力音声とテンプレートのマッ
チング度を評価する評価値として、距離が用いられる。
距離が小さいほどマッチング度が高いと判定される。In word recognition using DP matching, as is well known, distance is used as an evaluation value for evaluating the degree of matching between an input voice and a template.
The smaller the distance, the higher the matching degree is determined.

【０００４】ＨＭＭを用いた単語認識では、よく知られ
ているように、入力音声とテンプレートのマッチング度
を評価する評価値として、尤度が用いられる。尤度が大
きいほどマッチング度が高いと判定される。In the word recognition using the HMM, as is well known, likelihood is used as an evaluation value for evaluating a matching degree between an input voice and a template. It is determined that the matching degree is higher as the likelihood is larger.

【０００５】ところで、雑音がある環境下で音声を発声
した場合、パワーの小さい子音等は、雑音によりその音
響特徴が汚されてしまい、テンプレートとのマッチング
の結果得られる尤度（ＨＭＭを用いた場合）や距離（Ｄ
Ｐマッチングを用いた場合）は信頼性が低くなってしま
う。また、テンプレート作成時の雑音パターンと認識時
の雑音パターンとが異なる場合、雑音区間における距離
や尤度は信頼性が低くなってしまう。[0005] By the way, when a voice is uttered in an environment where noise is present, the acoustic characteristics of consonants and the like having low power are contaminated by noise, and the likelihood (HMM using a HMM) obtained as a result of matching with a template is obtained. Case) and distance (D
When P matching is used), the reliability is reduced. If the noise pattern at the time of template creation and the noise pattern at the time of recognition are different, the reliability and the distance and likelihood in the noise section will be low.

【０００６】そこで、従来においては、実際の雑音環境
から無音モデルを学習して、テンプレートの音声区間以
外の部分に無音モデルを追加することにより、雑音区間
の認識精度を向上する手法が提案されている。しかしな
がら、この手法では、本来パワーが小さく、雑音に埋も
れてしまう音声区間の認識精度を向上させることはでき
ない。Therefore, conventionally, there has been proposed a method of improving a recognition accuracy of a noise section by learning a silent model from an actual noise environment and adding a silent model to a portion other than a voice section of the template. I have. However, this method cannot improve the recognition accuracy of a voice section that is originally low in power and buried in noise.

【０００７】[0007]

【発明が解決しようとする課題】この発明は、雑音があ
る環境下で発声された音声に対して音声認識を行なう場
合に、認識精度の向上化が図れる音声認識方法を提供す
ることを目的とする。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech recognition method capable of improving recognition accuracy when speech recognition is performed on speech uttered in an environment having noise. I do.

【０００８】[0008]

【課題を解決するための手段】この発明による第１の音
声認識方法は、入力音声特徴とテンプレートとのマッチ
ング度を評価値に基づいて判定することにより、最適な
標準パターンを選択する音声認識方法において、入力音
声特徴の雑音区間から雑音特徴を作成する第１ステッ
プ、テンプレートの所定の閾値よりパワーが小さい区間
においては、テンプレートの当該区間の音声特徴と入力
音声特徴とを比較することによって第１の評価値を算出
するとともに、第１ステップで作成された雑音特徴と入
力音声特徴とを比較することによって第２の評価値を算
出する第２ステップ、および第１の評価値と第２の評価
値のうち、マッチング度の高い方の評価値を、当該区間
に対する評価値として採用する第３ステップを備えてい
ることを特徴とする。According to a first aspect of the present invention, there is provided a voice recognition method for selecting an optimal standard pattern by determining a matching degree between an input voice feature and a template based on an evaluation value. A first step of generating a noise feature from a noise section of the input speech feature, in a section having a power smaller than a predetermined threshold value of the template, comparing the speech feature of the template with the input speech feature in the first step by comparing A second step of calculating a second evaluation value by comparing the noise feature created in the first step with the input speech feature, and a first evaluation value and a second evaluation. And a third step of employing, as the evaluation value for the section, an evaluation value having a higher matching degree among the values.

【０００９】この発明による第２の音声認識方法は、入
力音声特徴とテンプレートとのマッチング度を評価値に
基づいて判定することにより、最適な標準パターンを選
択する音声認識方法において、入力音声特徴の雑音区間
から雑音特徴を作成する第１ステップ、入力音声特徴の
所定の閾値よりパワーが小さい区間においては、テンプ
レートの当該区間の音声特徴と入力音声特徴とを比較す
ることによって第１の評価値を算出するとともに、第１
ステップで作成された雑音特徴と入力音声特徴とを比較
することによって第２の評価値を算出する第２ステッ
プ、および第１の評価値と第２の評価値のうち、マッチ
ング度の高い方の評価値を、当該区間に対する評価値と
して採用する第３ステップを備えていることを特徴とす
る。A second speech recognition method according to the present invention is a speech recognition method for selecting an optimal standard pattern by determining a matching degree between an input speech feature and a template based on an evaluation value. A first step of generating a noise feature from a noise section, in a section having a power smaller than a predetermined threshold of an input speech feature, a first evaluation value is obtained by comparing the speech feature of the template with the input speech feature in the section. Calculation and the first
A second step of calculating a second evaluation value by comparing the noise feature created in the step with the input speech feature, and, of the first evaluation value and the second evaluation value, A third step of employing the evaluation value as the evaluation value for the section.

【００１０】この発明による第３の音声認識方法は、入
力音声特徴とテンプレートとのマッチング度を評価値に
基づいて判定することにより、最適な標準パターンを選
択する音声認識方法において、入力音声特徴の雑音区間
から雑音特徴を作成する第１ステップ、テンプレートの
所定の閾値よりパワーが小さい区間であってかつ入力音
声特徴の所定の閾値よりパワーが小さい区間において
は、テンプレートの当該区間の音声特徴と入力音声特徴
とを比較することによって第１の評価値を算出するとと
もに、第１ステップで作成された雑音特徴と入力音声特
徴とを比較することによって第２の評価値を算出する第
２ステップ、および第１の評価値と第２の評価値のう
ち、マッチング度の高い方の評価値を、当該区間に対す
る評価値として採用する第３ステップを備えていること
を特徴とする。A third speech recognition method according to the present invention is a speech recognition method for selecting an optimal standard pattern by determining a matching degree between an input speech feature and a template based on an evaluation value. A first step of creating a noise feature from a noise section, in a section in which the power is smaller than a predetermined threshold of the template and in a section in which the power is smaller than the predetermined threshold of the input speech feature, the speech feature and the input of the corresponding section of the template are input; A second step of calculating a first evaluation value by comparing the voice feature with the input feature, and calculating a second evaluation value by comparing the noise feature created in the first step with the input voice feature; and Of the first evaluation value and the second evaluation value, the evaluation value with the higher matching degree is adopted as the evaluation value for the section. Characterized in that it comprises a third step.

【００１１】[0011]

【発明の実施の形態】以下、図面を参照して、この発明
の実施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１２】〔１〕本発明の考え方についての説明本発明の考え方について説明する。[1] Description of Concept of the Present Invention The concept of the present invention will be described.

【００１３】図１（ａ）はテンプレートを示し、図１
（ｂ）は入力音声特徴を示している。図１（ａ）におい
て、Ｔｈは、予め設定された評価区間決定用の閾値を示
している。また、図１（ａ）および図１（ｂ）におい
て、区間Ａ、Ａ’は雑音区間を、区間Ｂ、Ｂ’は音声区
間を、区間Ｃ、Ｃ’は雑音区間を示している。FIG. 1A shows a template.
(B) shows the input voice feature. In FIG. 1A, Th indicates a preset threshold for determining an evaluation section. In FIGS. 1A and 1B, sections A and A ′ indicate noise sections, sections B and B ′ indicate voice sections, and sections C and C ′ indicate noise sections.

【００１４】以下、音声認識方法について説明する。Hereinafter, a speech recognition method will be described.

【００１５】まず、入力音声特徴の雑音区間Ａ’または
Ｃ’から、雑音特徴を作成する。First, a noise feature is created from the noise section A 'or C' of the input speech feature.

【００１６】テンプレートの閾値Ｔｈよりパワーが大き
い区間Ｕ１、Ｕ２については、テンプレートのかかる部
分の音声特徴と、入力音声特徴とを比較することによ
り、評価を行なう。つまり、評価値（尤度または距離）
を算出する。In the sections U1 and U2 where the power is larger than the threshold value Th of the template, evaluation is performed by comparing the audio feature of the portion of the template with the input audio feature. In other words, the evaluation value (likelihood or distance)
Is calculated.

【００１７】テンプレートの閾値Ｔｈよりパワーが小さ
い区間、Ａ、Ｄ１、Ｄ２、Ｄ３、Ｃについては、２種類
の評価を行なう。第１は、テンプレートのかかる部分の
音声特徴と、入力音声特徴とを比較することにより、評
価を行なう。第２は、入力音声特徴の雑音区間Ａ’また
はＣ’から作成された雑音特徴と、入力音声特徴とを比
較することにより、評価を行なう。そして、第１の評価
結果と第２の評価結果とを比較し、マッチング度の高い
方の評価値を当該区間の評価値として用いる。つまり、
評価値が尤度である場合には評価値が大きい方の尤度を
当該区間の評価値として用い、評価値が距離である場合
には評価値が小さい方の距離を当該区間の評価値として
用いる。Two types of evaluation are performed for sections A, D1, D2, D3, and C whose power is smaller than the template threshold Th. First, evaluation is performed by comparing the audio features of such a portion of the template with the input audio features. Second, the evaluation is performed by comparing the noise feature created from the noise section A ′ or C ′ of the input speech feature with the input speech feature. Then, the first evaluation result is compared with the second evaluation result, and the evaluation value with the higher matching degree is used as the evaluation value of the section. That is,
If the evaluation value is likelihood, the larger likelihood is used as the evaluation value for the section. If the evaluation value is distance, the smaller distance is used as the evaluation value for the section. Used.

【００１８】ＳＮＲ値が小さいために音声の特徴である
スペクトルが雑音より完全にマスクされている区間で
は、テンプレートの特徴として雑音特徴を用いた場合の
マッチング度が高くなる。一方、本来の音声特徴が残っ
ている区間においては、テンプレートの特徴としてテン
プレートのかかる部分の音声特徴を用いた場合のマッチ
ング度が高くなる。したがって、上記実施の形態のよう
に、テンプレートの閾値Ｔｈよりパワーが小さい区間
Ａ、Ｄ１、Ｄ２、Ｄ３、Ｃについて、２種類の評価を行
なって、マッチング度の高い方の評価値を用いることに
より、音声認識精度が向上する。In a section where the spectrum, which is a feature of speech, is completely masked from noise because of a small SNR value, the matching degree when the noise feature is used as the template feature is high. On the other hand, in the section in which the original voice feature remains, the matching degree when the voice feature of such a portion of the template is used as the template feature is high. Therefore, as in the above-described embodiment, two types of evaluations are performed for sections A, D1, D2, D3, and C whose power is smaller than the threshold value Th of the template, and the evaluation value with the higher matching degree is used. And the accuracy of speech recognition is improved.

【００１９】上記実施の形態では、テンプレートの閾値
Ｔｈよりパワーが小さい区間Ａ、Ｄ１、Ｄ２、Ｄ３、Ｃ
について、２種類の評価が行なわれているが、入力音声
特徴の閾値Ｔｈ’（図１（ｂ）参照）よりパワーが小さ
い区間Ａ’、Ｄ１’、Ｄ２’、Ｄ３’、Ｃ’について、
２種類の評価を行なうようにしてもよい。In the above embodiment, the sections A, D1, D2, D3, C in which the power is smaller than the template threshold Th.
, Two types of evaluations are performed. For the sections A ′, D1 ′, D2 ′, D3 ′, and C ′ having powers smaller than the threshold value Th ′ (see FIG. 1B) of the input voice feature,
Two types of evaluation may be performed.

【００２０】また、テンプレートの閾値Ｔｈよりパワー
が小さい区間であってかつ入力音声特徴の閾値Ｔｈ’よ
りパワーが小さい区間について、２種類の評価を行なう
ようにしてもよい。In addition, two types of evaluation may be performed for a section whose power is smaller than the threshold value Th of the template and whose power is smaller than the threshold value Th ′ of the input speech feature.

【００２１】〔２〕ＨＭＭを用いた音声認識に適用した
場合の実施例の説明ＨＭＭを用いた音声認識に適用した場合の実施例につい
て説明する。[2] Description of Embodiment When Applied to Speech Recognition Using HMM An embodiment when applied to speech recognition using an HMM will be described.

【００２２】図２（ａ）は”ＳＡＫＡＩ”のテンプレー
トを示している。テンプレートは、ＨＭＭ状態列と各状
態毎のパワーとからなる。FIG. 2A shows a "SAKAI" template. The template includes an HMM state sequence and power for each state.

【００２３】テンプレートの学習時は、図２（ｂ）に示
すように、学習用音声データからテンプレートを学習す
る。この場合、単語単位に学習してもよいし、単語より
細かな単位、例えば音素単位に学習してもよい。また、
各状態毎のパワーは、他のパラメータ（例えばケプトラ
ム）と同様にＥＭ推定によって求められる。あるいは、
各状態毎のパワーは、他のパラメータ（例えばケプトラ
ム）を用いて学習したモデルを用いて、学習用データと
学習によって得られたモデルとのアライメントをとり、
各状態と対応する学習データのアライメント部分から各
状態の平均パワーを算出することによって、求めるよう
にしてもよい。When learning the template, the template is learned from the learning voice data as shown in FIG. 2 (b). In this case, the learning may be performed in units of words or in units smaller than words, for example, in units of phonemes. Also,
The power for each state is obtained by EM estimation, like other parameters (for example, ceptoram). Or,
The power for each state is obtained by aligning the learning data with the model obtained by learning using a model learned using other parameters (for example, ceptoram).
Alternatively, the average power of each state may be calculated from the alignment portion of the learning data corresponding to each state.

【００２４】図３は、テンプレートと入力音声特徴を示
している。FIG. 3 shows a template and input speech features.

【００２５】まず、入力音声特徴の雑音区間から、雑音
特徴を作成する。First, a noise feature is created from a noise section of an input speech feature.

【００２６】テンプレート（図２）の閾値Ｔｈをこの例
では５とする。テンプレートのパワーが大きい部分、つ
まり、テンプレート中の”Ａ”の部分については、テン
プレートのかかる部分の音声特徴と、入力音声特徴とを
比較することにより、尤度を算出する。The threshold value Th of the template (FIG. 2) is set to 5 in this example. For a portion where the power of the template is large, that is, for the portion “A” in the template, the likelihood is calculated by comparing the audio feature of such portion of the template with the input audio feature.

【００２７】テンプレート（図２）の閾値Ｔｈより平均
パワーが小さい部分、つまり、テンプレート中の”無
音”、”Ｓ”、”Ｋ”の部分については、２種類の評価
を行なう。第１は、テンプレートのかかる部分の音声特
徴と、入力音声特徴とを比較することにより、尤度を求
める。第２は、入力音声特徴の雑音区間から作成された
雑音特徴と、入力音声特徴とを比較することにより、尤
度を求める。そして、第１の評価結果と第２の評価結果
とを比較し、尤度の大きい方を当該区間の尤度として用
いる。Two types of evaluations are performed on a portion of the template (FIG. 2) where the average power is smaller than the threshold Th, that is, a portion of "silent", "S", and "K" in the template. First, the likelihood is obtained by comparing the speech feature of such a portion of the template with the input speech feature. Second, the likelihood is obtained by comparing the noise feature created from the noise section of the input speech feature with the input speech feature. Then, the first evaluation result and the second evaluation result are compared, and the one with the larger likelihood is used as the likelihood of the section.

【００２８】なお、入力音声特徴（図３）の閾値Ｔｈ’
よりパワーが小さい部分”無音”、”Ｓ”、”Ａ”、”
Ｉ”について、２種類の評価を行なうようにしてもよ
い。The threshold value Th 'of the input voice feature (FIG. 3)
Parts with lower power "silence", "S", "A", "
Two types of evaluation may be performed on I ″.

【００２９】また、テンプレート（図２）の閾値Ｔｈよ
りパワーが小さい部分であってかつ入力音声特徴（図
３）の閾値Ｔｈ’よりパワーが小さい部分”無音”、”
Ｓ”について、２種類の評価を行なうようにしてもよ
い。Also, the portions "silence", which are portions whose power is smaller than the threshold value Th of the template (FIG. 2) and whose power is smaller than the threshold value Th 'of the input voice feature (FIG. 3).
For S ″, two types of evaluations may be performed.

【００３０】テンプレートの閾値Ｔｈよりパワーが小さ
い部分であってかつ入力音声特徴の閾値Ｔｈ’よりパワ
ーが小さい部分について２種類の評価を行なう音声認識
方法（本発明方法という）と、従来の音声認識方法（従
来法という）とについて、次の比較実験を行なった。A speech recognition method for performing two kinds of evaluations on a portion having a power smaller than the threshold value Th of the template and a power smaller than the threshold value Th ′ of the input speech feature (the method of the present invention), and a conventional speech recognition method The following comparative experiment was conducted with the method (referred to as a conventional method).

【００３１】つまり、ＳＮ比５ｄＢの雑音環境下におい
て、１００個の地名の単語に対して男女計５名に発声を
行なわせ、これらの音声に対して本発明方法と従来法と
によって音声認識を行い、音声認識率を求めた。従来法
では認識率が９４．８％であったのに対して、本発明方
法では認識率が９６．０％に向上した。That is, in a noise environment with an S / N ratio of 5 dB, a total of five men and women make utterances for 100 place name words, and these voices are subjected to voice recognition by the method of the present invention and the conventional method. And the speech recognition rate was determined. While the recognition rate was 94.8% in the conventional method, the recognition rate was improved to 96.0% in the method of the present invention.

【００３２】[0032]

【発明の効果】この発明によれば、雑音がある環境下で
発声された音声に対して音声認識を行なう場合に、認識
精度の向上化が図れるようになる。According to the present invention, when speech recognition is performed on a voice uttered in an environment with noise, recognition accuracy can be improved.

[Brief description of the drawings]

【図１】テンプレートと入力音声特徴とを示す波形図で
ある。FIG. 1 is a waveform diagram showing a template and input speech features.

【図２】ＨＭＭを用いた音声認識でのテンプレートを示
す模式図である。FIG. 2 is a schematic diagram showing a template in speech recognition using an HMM.

【図３】ＨＭＭを用いた音声認識でのテンプレートと入
力音声特徴を示す模式図である。FIG. 3 is a schematic diagram showing templates and input speech features in speech recognition using an HMM.

Claims

[Claims]

1. A speech recognition method for selecting an optimal standard pattern by determining a matching degree between an input speech feature and a template based on an evaluation value, wherein a noise feature is created from a noise section of the input speech feature. In one step, in a section where the power is smaller than a predetermined threshold value of the template, the first evaluation value is calculated by comparing the speech feature of the template with the input speech feature and the first evaluation value is created in the first step. A second step of calculating a second evaluation value by comparing the noise feature and the input speech feature, and, among the first evaluation value and the second evaluation value, an evaluation value having a higher matching degree, And a third step of adopting as an evaluation value for the section.

2. A speech recognition method for selecting an optimal standard pattern by determining a matching degree between an input speech feature and a template based on an evaluation value, wherein a noise feature is created from a noise section of the input speech feature. One step, in a section whose power is smaller than a predetermined threshold value of the input speech feature, calculate a first evaluation value by comparing the speech feature of the template with the input speech feature and create the first evaluation value. A second step of calculating a second evaluation value by comparing the obtained noise feature with the input voice feature, and, of the first evaluation value and the second evaluation value, And a third step of adopting as an evaluation value for the section.

3. A speech recognition method for selecting an optimal standard pattern by determining a matching degree between an input speech feature and a template based on an evaluation value, wherein a noise feature is created from a noise section of the input speech feature. One step, in a section in which the power is smaller than the predetermined threshold of the template and in which the power is smaller than the predetermined threshold of the input voice feature, the voice feature of the template in the corresponding section is compared with the input voice feature by comparing the voice feature of the template with the input voice feature. A second step of calculating a second evaluation value by comparing the noise feature created in the first step with the input voice feature, and a second evaluation value of the first evaluation value and the second evaluation value. A third step of adopting, from among the evaluation values, the evaluation value with the higher matching degree as the evaluation value for the section. Characteristic speech recognition method.