JP2001215992A

JP2001215992A - Voice recognition device

Info

Publication number: JP2001215992A
Application number: JP2000022696A
Authority: JP
Inventors: Shigeki Aoshima; 滋樹青島
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2000-01-31
Filing date: 2000-01-31
Publication date: 2001-08-10

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device which surely recognizes inputted voice under various environmental conditions. SOLUTION: Inputted voice is supplied to a spectrum subtracting section 20 through a filter 10 and a spectrum analysis section 12. In the section 20, noise is subtracted from the inputted voice and the result is supplied to a feature extracting section 22. A noise difference section 14 computes the difference between input noise and the noise generated while learning a voice dictionary 26. The section 20 cancels the difference between the input noise and the noise of the dictionary 26 by subtracting the difference from the inputted voice. A subtracting magnification used in the subtraction is determined based on the SNR of the inputted voice and the difference spectrum from the section 14. The magnification is also determined for every analysis frame.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は音声認識装置、特に
騒音下において発生された音声を認識する技術に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus, and more particularly to a technique for recognizing speech generated under noise.

【０００２】[0002]

【従来の技術】従来より、入力音声から騒音を差し引い
て得られる音声の特徴と予め学習により得られた標準音
声とを比較することにより騒音下においても音声を認識
する技術が知られている。2. Description of the Related Art Conventionally, there has been known a technique of recognizing speech even under noise by comparing features of speech obtained by subtracting noise from input speech with standard speech previously obtained by learning.

【０００３】たとえば、特開平１１−１５４０００号公
報に開示された雑音抑圧装置及び該装置を用いた音声認
識システムには、音声区間の入力信号に基づいて算出し
たパワースペクトルから雑音パワースペクトルに所定の
サブトラクト係数を乗じたものを引き算することにより
雑音の影響を排除して音声認識を行う技術が記載されて
いる。For example, a noise suppression device disclosed in Japanese Patent Application Laid-Open No. H11-154000 and a speech recognition system using the device include a noise power spectrum calculated from a power spectrum calculated based on an input signal in a voice section. There is described a technique for performing speech recognition by subtracting the result of multiplication by a subtraction coefficient to eliminate the influence of noise.

【０００４】[0004]

【発明が解決しようとする課題】一般に、雑音スペクト
ルを差し引くスペクトルサブトラクション技術において
は、発生前の騒音区間の数十フレームを平均化すること
で騒音を推定し、この推定した騒音を音声区間の入力か
らフレーム毎（分析単位）に周波数領域で引き算するも
のである。Generally, in a spectrum subtraction technique for subtracting a noise spectrum, noise is estimated by averaging several tens of frames in a noise section before occurrence, and the estimated noise is input to a speech section. Is subtracted in the frequency domain for each frame (analysis unit).

【０００５】しかしながら、このようにして騒音の影響
を除去した入力音声と予め用意した標準パターンとを比
較する場合、標準パターン（音声辞書）としてある程度
の騒音が存在する環境下で発生した音声を用いる場合
（無騒音に制御しても、完全には除去できないためある
程度の騒音は残存する）には、比較の対象が騒音付の音
声であるため、両者に相違が生じ、認識率が低下するお
それがある。However, when comparing the input voice from which the influence of noise has been removed in this way with a standard pattern prepared in advance, a voice generated in an environment where some noise exists is used as a standard pattern (voice dictionary). In this case (even if the noise is not controlled, some noise remains because it cannot be completely removed), since the comparison target is sound with noise, a difference occurs between the two and the recognition rate may decrease. There is.

【０００６】また、上記従来技術においては、サブトラ
クト倍率を１より大きな値に設定しているが、これは推
定騒音が平均化されているのに対して、パワーの大きい
区間の音声に調整した場合の方が全体として認識率がよ
くなることを考慮したものであり、パワーが小さい区間
においても同様にサブトラクト倍率を大きくすると騒音
の引きすぎによる歪みが生じ、認識率が低下する問題も
ある。In the above prior art, the subtraction magnification is set to a value larger than 1. This is because the estimated noise is averaged, but the sound is adjusted to a sound in a section with a large power. In consideration of the fact that the recognition rate is improved as a whole, there is also a problem that if the subtraction magnification is increased even in a section where the power is small, distortion due to excessive noise is generated and the recognition rate is reduced.

【０００７】本発明は、上記従来技術の有する課題に鑑
みなされたものであり、その目的は、比較すべき標準パ
ターンが騒音下で発声されたパターンであっても確実に
入力音声を認識することができ、また、種々の環境下に
おいても認識率の低下を抑制することができる装置を提
供することにある。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems of the prior art, and has as its object to surely recognize an input voice even when a standard pattern to be compared is a pattern uttered under noise. It is another object of the present invention to provide an apparatus capable of suppressing a decrease in recognition rate even under various environments.

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するため
に、本発明は、入力音声から騒音を差し引いて得られる
音声の特徴を学習により得られた標準音声と比較して認
識する音声認識装置であって、前記標準音声に含まれる
学習騒音と前記入力音声に含まれる入力騒音との相違に
基づいて、前記入力音声から差し引くべき前記騒音を演
算する演算手段とを有することを特徴とする。学習時に
含まれる騒音も考慮して差分演算することで、認識率の
低下を有効に抑制できる。In order to achieve the above object, the present invention provides a speech recognition apparatus for recognizing a feature of a speech obtained by subtracting noise from an input speech by comparing it with a standard speech obtained by learning. And calculating means for calculating the noise to be subtracted from the input voice based on a difference between the learning noise included in the standard voice and the input noise included in the input voice. By performing the difference calculation in consideration of the noise included in the learning, it is possible to effectively suppress a decrease in the recognition rate.

【０００９】また、本発明は、入力音声と学習により得
られた標準音声とを比較することにより認識する音声認
識装置であって、前記標準音声に含まれる学習騒音と入
力音声に含まれる入力騒音との相違に基づいて、前記標
準音声に加算すべき騒音を演算する演算手段とを有する
ことを特徴とする。学習時に含まれる騒音も考慮して加
算演算することで、認識率の低下を有効に抑制できる。The present invention also relates to a speech recognition apparatus for recognizing a speech by comparing an input speech with a standard speech obtained by learning, wherein the learning noise contained in the standard speech and the input noise contained in the input speech are recognized. Calculating means for calculating the noise to be added to the standard sound based on the difference from the above. By performing the addition operation in consideration of the noise included in the learning, it is possible to effectively suppress a decrease in the recognition rate.

【００１０】ここで、前記入力音声のＳＮＲに応じて差
し引くべき割合、あるいは加算すべき割合を決定する手
段をさらに有することが好適である。雑音レベルが増大
すると発声レベルも騒音レベルに比例して増大するラン
バード効果が存在するため、音声レベル（音声パワー）
のみならず騒音レベル（騒音パワー）も考慮したＳＮＲ
で差し引くべき割合や加算割合を決定することで、特に
音声パワーの大小によらず認識率を向上させることがで
きる。ここで、ＳＮＲは音声パワーと騒音パワーの比で
定義される。Here, it is preferable to further comprise means for determining a ratio to be subtracted or a ratio to be added according to the SNR of the input voice. As the noise level increases, the utterance level also increases in proportion to the noise level.
SNR considering noise level (noise power)
By determining the ratio to be subtracted and the addition ratio in step (1), the recognition rate can be improved irrespective of the magnitude of the audio power. Here, the SNR is defined by the ratio between the voice power and the noise power.

【００１１】また、前記入力音声のＳＮＲは、周波数領
域での重み付けに基づいて算出されることが好適であ
り、より具体的には人間の聴覚特性に基づいたフィルタ
処理を行うことが望ましい。Preferably, the SNR of the input voice is calculated based on weighting in the frequency domain, and more specifically, it is desirable to perform a filtering process based on human auditory characteristics.

【００１２】また、前記相違のスペクトル帯域毎、ある
いは入力騒音のパワー分散に応じて差し引くべき割合、
あるいは加算すべき割合を決定する手段をさらに有する
ことが好適である。スペクトル帯域毎に割合を変化させ
ることで、全ての帯域において認識率を向上させること
ができ、入力騒音のパワー分散に応じて割合を決定する
ことで、ランバード効果を利用して認識率を向上させる
ことができる。A ratio to be subtracted for each of the different spectral bands or according to the power variance of the input noise;
Alternatively, it is preferable to further include means for determining a ratio to be added. By changing the ratio for each spectrum band, the recognition rate can be improved in all the bands, and by determining the ratio in accordance with the power variance of the input noise, the recognition rate is improved using the Lambert effect. be able to.

【００１３】また、前記入力騒音の音声分析フレーム毎
のＳＮＲあるいはパワーに基づいて割合を決定すること
も好適である。分析単位（フレーム）毎に騒音は変化す
るから、分析単位で割合を変化させることで、より高精
度の認識が可能となる。It is also preferable that the ratio is determined based on the SNR or the power of the input noise for each voice analysis frame. Since the noise changes for each analysis unit (frame), the recognition can be performed with higher accuracy by changing the ratio for each analysis unit.

【００１４】[0014]

【発明の実施の形態】以下、図面に基づき本発明の実施
形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１５】図１には、本実施形態の全体構成ブロック
図が示されている。マイクから入力された入力音声はフ
ィルタ１０を介してスペクトル分析部１２に供給され
る。なお、音声が入力されない場合には、騒音がフィル
タ１０を介してスペクトル分析部１２に供給される。フ
ィルタ１０は人間の聴覚特性を考慮したフィルタであ
り、具体的には周波数の高い領域を優先的に透過するフ
ィルタである。フィルタ１０は必ずしも必須ではなく、
マイクから入力された音声あるいは騒音を直接スペクト
ル分析部１２に供給してもよい。FIG. 1 is a block diagram showing the overall configuration of this embodiment. The input voice input from the microphone is supplied to the spectrum analysis unit 12 via the filter 10. When no sound is input, noise is supplied to the spectrum analysis unit 12 via the filter 10. The filter 10 is a filter in consideration of human auditory characteristics, and specifically, is a filter that preferentially transmits a high frequency region. The filter 10 is not always required,
The voice or noise input from the microphone may be directly supplied to the spectrum analyzer 12.

【００１６】スペクトル分析部１２では、入力した音声
や騒音をＦＦＴ等によりスペクトル分析し、周波数毎の
パワーを算出する。算出されたスペクトルは平滑化さ
れ、騒音差分部１４に供給される。The spectrum analyzer 12 analyzes the spectrum of the input voice or noise by FFT or the like, and calculates the power for each frequency. The calculated spectrum is smoothed and supplied to the noise difference unit 14.

【００１７】騒音差分部１４には、スペクトル分析部１
２からの入力騒音スペクトル（マイクから音声が入力さ
れず、騒音が入力された区間におけるスペクトルであ
り、音声に含まれる騒音と推定されるスペクトル）が供
給されるとともに、比較の対象となる学習音声辞書の発
声時に含まれていた学習騒音データを格納するデータベ
ース１８から学習騒音スペクトルが供給される。騒音差
分部１４では、これら２つのスペクトル、すなわち入力
騒音スペクトルと学習騒音スペクトルとの差分を算出
し、推定騒音スペクトルとする。具体的には、推定騒音
スペクトルのＳＮＲ（推定騒音）は、The noise difference section 14 includes a spectrum analysis section 1
2, the input noise spectrum (a spectrum in a section in which no sound is input from the microphone and the noise is input, which is estimated to be noise included in the sound) is supplied, and the learning sound to be compared is provided. A learning noise spectrum is supplied from a database 18 that stores learning noise data included when the dictionary is uttered. The noise difference unit 14 calculates a difference between these two spectra, that is, the input noise spectrum and the learning noise spectrum, and sets the difference as an estimated noise spectrum. Specifically, the SNR (estimated noise) of the estimated noise spectrum is

【数１】ＳＮＲ（推定騒音）＝ＳＮＲ２−ＳＮＲ１・・・（１）で算出される。ただし、ＳＮＲ１は学習騒音スペクトル
のＳＮＲであり、ＳＮＲ２は入力騒音スペクトルのＳＮ
Ｒである。ここで、ＳＮＲは、音声区間のパワーと騒音
区間のパワーの比（Speech to Noise Ratio）として定
義され、具体的にはSNR (estimated noise) = SNR2−SNR1 (1) Here, SNR1 is the SNR of the learning noise spectrum, and SNR2 is the SN of the input noise spectrum.
R. Here, the SNR is defined as a ratio (Speech to Noise Ratio) between the power of the voice section and the power of the noise section, and specifically,

【数２】ＳＮＲ＝１０Ｌｏｇ（ΣＰ（Ｓｉ）／Σ（ｉ））／（ΣＰ（Ｎｊ）／Σ（ｉ））・・・（２）で定義される。入力騒音スペクトルのＳＮＲは、スペク
トル分析部１２で分析して得られた騒音のパワーと、発
声実験値により得られた音声パワーとの比から算出する
ことができる。学習騒音スペクトルのＳＮＲも同様であ
る。SNR = 10Log (ΣP (Si) / Σ (i)) / (ΣP (Nj) / Σ (i)) (2) The SNR of the input noise spectrum can be calculated from the ratio of the noise power obtained by the analysis by the spectrum analysis unit 12 to the voice power obtained by the utterance experimental value. The same applies to the SNR of the learning noise spectrum.

【００１８】以上のようにして入力騒音スペクトルと学
習騒音スペクトルとの差分を演算することで両スペクト
ルの相違が演算されると、演算結果はスペクトルサブト
ラクション部２０に供給される。As described above, when the difference between the input noise spectrum and the learning noise spectrum is calculated to calculate the difference between the two spectra, the calculation result is supplied to the spectrum subtraction unit 20.

【００１９】スペクトルサブトラクション部２０では、
フィルタ１０及びスペクトル分析部１２を介して供給さ
れた入力音声パターン（音声区間における入力信号スペ
クトル）と騒音差分部１４から供給された推定騒音との
差分を演算し、騒音の影響が除去された音声パターンを
抽出して特徴抽出部２２に供給する。In the spectral subtraction section 20,
The difference between the input voice pattern (input signal spectrum in the voice section) supplied through the filter 10 and the spectrum analysis unit 12 and the estimated noise supplied from the noise difference unit 14 is calculated, and the voice from which the influence of noise has been removed is calculated. The pattern is extracted and supplied to the feature extracting unit 22.

【００２０】特徴抽出部２２は、騒音の影響が除去され
た入力音声パターンから特徴部分を抽出し、音素認識部
２４に供給する。音素認識部２４では、予め学習により
用意された音声辞書２６（この音声辞書の音声パターン
には、学習時における騒音が付加されている）及び音響
モデル２８に基づいて抽出された特徴がどの音素に該当
するかを照合し、音素を認識して出力する。The feature extraction unit 22 extracts a feature portion from the input voice pattern from which the influence of noise has been removed, and supplies the extracted feature portion to the phoneme recognition unit 24. In the phoneme recognition unit 24, the features extracted based on the speech dictionary 26 prepared by learning in advance (noise at the time of learning is added to the speech pattern of the speech dictionary) and the acoustic model 28 are assigned to any phoneme. It verifies whether it is applicable and recognizes and outputs phonemes.

【００２１】図２には、騒音差分部１４における差分演
算が模式的に示されている。図において、（ａ）は学習
騒音スペクトルのＳＮＲ（ＳＮＲ１）が示されており、
（ｂ）は入力騒音スペクトルのＳＮＲ（ＳＮＲ２）が示
されている。騒音差分部１４では、供給されたこれら２
つのＳＮＲに基づき、上述の（１）式に基づいてスペク
トルサブトラクションすべき差分量を演算する。FIG. 2 schematically shows the difference calculation in the noise difference section 14. In the figure, (a) shows the SNR (SNR1) of the learning noise spectrum,
(B) shows the SNR (SNR2) of the input noise spectrum. In the noise difference unit 14, these two supplied
Based on the two SNRs, a difference amount to be spectrally subtracted is calculated based on the above equation (1).

【００２２】図３には、スペクトルサブトラクション部
２０における差分の様子が模式的に示されている。フィ
ルタ１０及びスペクトル分析部１２を介して供給された
入力音声スペクトル（図中実線）と騒音差分部１４から
供給された推定騒音（ＳＮＲ２−ＳＮＲ１であり、図中
一点鎖線）との差分が演算され、これにより学習時の騒
音と音声入力時の騒音との相違がキャンセルされ、精度
よく音声辞書２６に記録された音声データと照合するこ
とができる。FIG. 3 schematically shows the state of the difference in the spectral subtraction section 20. The difference between the input voice spectrum (solid line in the figure) supplied through the filter 10 and the spectrum analysis unit 12 and the estimated noise (SNR2−SNR1 in the figure, dashed line in the figure) supplied from the noise difference unit 14 is calculated. Thereby, the difference between the noise at the time of learning and the noise at the time of voice input is canceled, and it is possible to accurately collate with the voice data recorded in the voice dictionary 26.

【００２３】なお、上述した処理は、入力音声から入力
騒音を差し引き、差し引いて得られたものにさらに学習
騒音を付加して音声辞書２６に記録された騒音付音声デ
ータと照合すると考えることもできる。すなわち、上述
した処理を数式で表現すると、（入力音声）−｛（入力
騒音）−（学習騒音）｝＝（入力音声）−（入力騒音）
＋（学習騒音）であり、音声辞書に学習時の騒音が付加
されていても、これにより学習時の騒音に影響されずに
認識できることが理解されよう。It should be noted that the above-described processing may be considered to subtract the input noise from the input voice, add learning noise to the result obtained by subtracting the input noise, and collate with the noise-added voice data recorded in the voice dictionary 26. . That is, if the above-described processing is expressed by a mathematical formula, (input voice)-{(input noise)-(learning noise)} = (input voice)-(input noise)
+ (Learning noise), and it can be understood that even if the noise at the time of learning is added to the speech dictionary, it can be recognized without being affected by the noise at the time of learning.

【００２４】一方、スペクトルサブトラクション部２０
にて入力音声から推定騒音を差し引く場合、差し引く倍
率であるサブトラクト倍率を固定とした場合には、上述
したように種々の環境下において安定して認識率を向上
させることが困難となる。具体的には、パワーが小さい
区間でサブトラクト倍率が大きくなりすぎ、騒音の引き
すぎによる歪みが生じて認識率低下を招くことになる。On the other hand, the spectrum subtraction section 20
In the case where the estimated noise is subtracted from the input voice and the subtraction magnification, which is the subtraction magnification, is fixed, it is difficult to stably improve the recognition rate under various environments as described above. More specifically, the subtraction magnification becomes too large in a section where the power is small, and distortion is caused by excessive noise, resulting in a reduction in the recognition rate.

【００２５】そこで、本実施形態においてはさらにサブ
トラクト倍率設定部３０を設け、騒音差分部１４から出
力された推定騒音にサブトラクト倍率αを乗じてスペク
トルサブトラクション部２０に供給している。Therefore, in the present embodiment, a subtraction magnification setting unit 30 is further provided, and the estimated noise output from the noise difference unit 14 is multiplied by the subtraction magnification α and supplied to the spectrum subtraction unit 20.

【００２６】サブトラクト倍率設定部３０は、基本的に
は入力音声のパワーに応じてサブトラクト倍率を動的に
変更するものであるが、一般に、図４に示されるように
騒音レベルが増大すると発声レベルも騒音レベルにほぼ
比例して増大する、いわゆるランバード効果が存在する
ため、最適のサブトラクト倍率を設定することは困難と
なる。そこで、本実施形態においては、図１に示される
ようにフィルタ１０で高周波強調された入力音声のＳＮ
ＲをＳＮＲ計算部３４で算出し、算出したＳＮＲをサブ
トラクト倍率設定部３０に供給し、サブトラクト倍率設
定部３０で入力音声のＳＮＲに基づきサブトラクト倍率
を設定している。具体的には、入力音声のＳＮＲが大き
いほどサブトラクト倍率を大きく設定する。単に入力音
声のパワーに応じてサブトラクト倍率を変更するのでは
なく、入力音声のＳＮＲに応じてサブトラクト倍率を変
更することで、ランバード効果も考慮した高精度の音声
認識が可能となり、特に入力音声のパワーが小さい区間
における引きすぎを確実に防止できる。The subtraction magnification setting unit 30 basically changes the subtraction magnification dynamically in accordance with the power of the input voice. Generally, as shown in FIG. 4, when the noise level increases, the utterance level increases. Also, since there is a so-called Lambert effect, which increases almost in proportion to the noise level, it is difficult to set an optimum subtraction magnification. Therefore, in the present embodiment, as shown in FIG.
R is calculated by the SNR calculation unit 34, and the calculated SNR is supplied to the subtraction ratio setting unit 30, and the subtraction ratio setting unit 30 sets the subtraction ratio based on the SNR of the input voice. Specifically, the subtraction magnification is set to increase as the SNR of the input sound increases. Rather than simply changing the subtraction magnification according to the power of the input voice, by changing the subtraction magnification according to the SNR of the input voice, high-precision voice recognition in consideration of the Lambert effect becomes possible. Excessive pulling in the section where the power is small can be reliably prevented.

【００２７】また、騒音が含まれていても、認識率が大
きく低下する帯域と劣化の度合いが比較的小さい帯域が
存在することが知られている。すなわち、騒音に強い帯
域と弱い帯域が存在する。例えば、本願出願人は、１ｋ
Ｈｚ〜３ｋＨｚに騒音スペクトルが存在すると、他の帯
域に存在する場合に比べて認識率の低下が大きいことを
確認している。したがって、ハイパスフィルタやローパ
スフィルタ等を用いて入力音声パターンから特定の帯域
のみの信号を取り出して音声認識することにより、騒音
環境下においても高精度に音声認識することが可能とな
る。しかしながら、騒音のスペクトルやパワーは種々変
化するため、固定的な帯域通過フィルタ等を用いて音声
認識する構成では、環境変化に柔軟に対応することがで
きず、全体として見た場合に認識率の低下を招くおそれ
がある。It is known that, even when noise is included, there are a band where the recognition rate is greatly reduced and a band where the degree of deterioration is relatively small. That is, there are a band that is strong against noise and a band that is weak against noise. For example, the applicant of the present application
It has been confirmed that when the noise spectrum exists in the range of Hz to 3 kHz, the recognition rate is greatly reduced compared to the case where the noise spectrum exists in another band. Therefore, by using a high-pass filter or a low-pass filter or the like to extract a signal of only a specific band from an input voice pattern and performing voice recognition, voice recognition can be performed with high accuracy even in a noisy environment. However, since the spectrum and power of noise change in various ways, the configuration of speech recognition using a fixed band-pass filter or the like cannot flexibly cope with environmental changes. There is a risk of lowering.

【００２８】そこで、本実施形態においては帯域毎にサ
ブトラクション倍率を変化させ、種々の走行環境に柔軟
に対応している。このため、図１に示されるように、騒
音差分部１４から出力された推定騒音がサブトラクト倍
率設定部３０に供給され、サブトラクト倍率設定部３０
では、騒音パターン／倍率変換テーブル３６に基づいて
推定騒音のスペクトル帯域毎にサブトラクト倍率を決定
してスペクトルサブトラクション部２０で差し引くべき
差分量を決定している。騒音パターン／倍率変換テーブ
ル３６は、騒音パターンとその時の帯域毎のサブトラク
ト倍率を予め決定してテーブル形式で保持するもので、
例えば、１ｋＨｚ〜３ｋＨｚにおけるサブトラクト倍率
を他の帯域に比べて大きくするように設定する。Therefore, in this embodiment, the subtraction magnification is changed for each band to flexibly cope with various driving environments. Therefore, as shown in FIG. 1, the estimated noise output from the noise difference unit 14 is supplied to the subtraction magnification setting unit 30 and the subtraction magnification setting unit 30
Here, the subtraction magnification is determined for each spectral band of the estimated noise based on the noise pattern / magnification conversion table 36, and the difference to be subtracted by the spectrum subtraction unit 20 is determined. The noise pattern / magnification conversion table 36 is for determining the noise pattern and the subtraction magnification for each band at that time in advance and holding the same in a table format.
For example, the subtraction magnification at 1 kHz to 3 kHz is set to be larger than that of the other bands.

【００２９】図５には、サブトラクト倍率設定部３０に
おける処理が模式的に示されている。（ａ）及び（ｃ）
は騒音差分部１４から出力された推定騒音のスペクトル
例であり、（ａ）は比較的平坦なスペクトル、（ｃ）は
低周波側に多くパワーが存在するスペクトル例である。
（ｂ）は（ａ）が入力された場合に帯域毎に決定される
サブトラクト倍率であり、（ｄ）は（ｃ）が入力された
場合の各帯域毎に決定されるサブトラクト倍率である。
基本的には推定騒音のパワーに応じてサブトラクト倍率
を変えているが（すなわちパワーが大なるほどサブトラ
クト倍率を増大させる）、さらに騒音に対し比較的認識
率が低下しやすい帯域に対してはサブトラクト倍率を増
大させている。このように、推定騒音、すなわち入力騒
音スペクトルと学習騒音との差異のスペクトル帯域毎に
サブトラクト倍率を決定することで、任意の走行環境、
すなわち任意の騒音パターンに対しても高精度に認識す
ることができる。FIG. 5 schematically shows the processing in the subtraction magnification setting section 30. (A) and (c)
7A is an example of the spectrum of the estimated noise output from the noise difference section 14, FIG. 7A is an example of a spectrum that is relatively flat, and FIG. 7C is an example of a spectrum in which much power exists on the low frequency side.
(B) is a subtraction magnification determined for each band when (a) is input, and (d) is a subtraction magnification determined for each band when (c) is input.
Basically, the subtraction magnification is changed according to the power of the estimated noise (that is, the subtraction magnification is increased as the power increases), but the subtraction magnification is further reduced for a band where the recognition rate of noise is relatively low. Is increasing. As described above, by determining the subtraction magnification for each of the estimated noises, that is, the spectrum band of the difference between the input noise spectrum and the learning noise, any driving environment,
That is, any noise pattern can be recognized with high accuracy.

【００３０】なお、帯域毎のサブトラクト倍率αｉは、
具体的にはNote that the subtraction magnification αi for each band is
In particular

【数３】 αｉ＝βｉ・Ｐｉ・・・（３）で決定することができる。ここで、βｉは実験的に求め
た帯域ｉの係数であり、Ｐｉは帯域ｉの推定騒音パワ
ー、ｉは周波数帯域である。Α i = β i · P i (3) Here, βi is a coefficient of band i experimentally obtained, Pi is an estimated noise power of band i, and i is a frequency band.

【００３１】さらに、本実施形態においては図１に示さ
れるようにフィルタ１０で高域強調された入力騒音の平
均パワー及びその分散（あるいは偏差）をパワー計算部
３２で算出し、サブトラクト倍率設定部３０に供給する
構成となっている。サブトラクト倍率設定部３０では、
パワーピーク値の平均値からの偏差、すなわちパワー分
散値によりサブトラクト倍率を決定する。分散が大なる
ほどサブトラクト倍率を大きく設定し、分散が小なるほ
どサブトラクト倍率を小さく設定する。Further, in the present embodiment, as shown in FIG. 1, the average power and the variance (or deviation) of the input noise emphasized in the high frequency range by the filter 10 are calculated by the power calculation unit 32, and the subtraction magnification setting unit is used. 30. In the subtraction magnification setting unit 30,
The subtraction magnification is determined from the deviation of the power peak value from the average value, that is, the power variance value. The larger the variance, the larger the subtraction magnification is set, and the smaller the variance, the smaller the subtraction magnification.

【００３２】図６には、入力騒音のパワースペクトルと
偏差の関係が示されている。図において、点線は入力騒
音パワーの時間平均値であり、σ１及びσ２はピーク値
の平均値からの偏差を示している。σ１＞σ２であり、
偏差σ１の場合のサブトラクト倍率を偏差σ２の場合の
サブトラクト倍率よりも大きく設定する。これにより、
入力騒音パワーが少ない場合に発声レベルも少ないラン
バード効果が生じてもサブトラクト倍率が不必要に大き
くなって騒音の引きすぎによる歪みが生じることがな
く、認識率を向上させることができる。FIG. 6 shows the relationship between the power spectrum of the input noise and the deviation. In the figure, the dotted line indicates the average value of the input noise power over time, and σ1 and σ2 indicate the deviation of the peak value from the average value. σ1> σ2, and
The subtraction magnification in the case of the deviation σ1 is set to be larger than the subtraction magnification in the case of the deviation σ2. This allows
When the input noise power is low, even if the Lambert effect with a low utterance level occurs, the subtraction magnification is not unnecessarily increased and distortion due to excessive noise is not generated, and the recognition rate can be improved.

【００３３】なお、上記実施形態においては、発声区間
全体にわたってサブトラクト倍率を決定する場合につい
て示したが、音声認識の分析フレーム単位でサブトラク
ト倍率を決定することも好適である。たとえば、マイク
を２入力とし、１つの入力からの信号を用いて分析フレ
ーム毎のＳＮＲを算出する。そして、このフレーム単位
のＳＮＲに基づき、サブトラクト倍率を決定する。これ
により、分析単位でのサブトラクト倍率設定が可能とな
り、音声認識率をより向上させることができる。もちろ
ん、分析フレーム毎にサブトラクト倍率を決定する場
合、入力騒音と学習騒音の相違を分析フレーム単位で算
出し、このＳＮＲに基づいて決定することも好適であ
る。また、ＳＮＲの代わりに、分析フレーム毎のパワー
に基づいて倍率を変化させることも好適である。In the above embodiment, the case where the subtraction magnification is determined over the entire utterance section has been described. However, it is also preferable to determine the subtraction magnification for each analysis frame for speech recognition. For example, a microphone has two inputs, and an SNR for each analysis frame is calculated using a signal from one input. Then, the subtraction magnification is determined based on the SNR in frame units. As a result, the subtraction magnification can be set for each analysis unit, and the speech recognition rate can be further improved. Of course, when the subtraction magnification is determined for each analysis frame, it is also preferable to calculate the difference between the input noise and the learning noise for each analysis frame and determine the difference based on the SNR. It is also preferable to change the magnification based on the power of each analysis frame instead of the SNR.

【００３４】以上、本発明の実施形態について、入力音
声から騒音を差し引いて得られる音声の特徴を音声辞書
と比較する場合について説明したが、入力騒音と学習騒
音との相違を算出し、音声辞書２６内のデータに加算し
て入力音声と比較することも可能であり、両者は技術的
に等価である。そして、音声辞書２６に相違のデータを
加算する場合の倍率もサブトラクト倍率と同様にＳＮＲ
やパワーに基づいて決定することができる。As described above, in the embodiment of the present invention, a case has been described where the characteristics of a voice obtained by subtracting noise from an input voice are compared with a voice dictionary. It is also possible to add to the data in 26 and compare with the input voice, and both are technically equivalent. The magnification when adding the different data to the voice dictionary 26 is also the same as the subtraction magnification.
And power can be determined.

【００３５】この場合の構成ブロック図が図７に示され
ている。図１と異なる点は、騒音差分部１４で算出した
推定騒音をスペクトルアディション部２１に供給し、ス
ペクトルアディション部２１では音声辞書２６に記憶さ
れた学習音声データにこの推定騒音、すなわち入力騒音
と学習騒音の相違を付加する点である。なお、音声辞書
２６の音声データに付加する際の倍率、すなわちアディ
ション倍率はアディション倍率設定部３１で決定され
（図１のサブトラクト倍率設定部３０に相当する）、ア
ディション倍率設定部３１は、具体的には入力音声のＳ
ＮＲやパワー分散、あるいは推定騒音のスペクトル帯域
毎に倍率を決定する。FIG. 7 is a block diagram showing the configuration in this case. The difference from FIG. 1 is that the estimated noise calculated by the noise difference unit 14 is supplied to the spectrum addition unit 21, and the estimated noise, that is, the input noise is added to the learning speech data stored in the speech dictionary 26 by the spectrum addition unit 21. And the difference between learning noise. In addition, the magnification at the time of adding to the audio data of the voice dictionary 26, that is, the addition magnification is determined by the addition magnification setting unit 31 (corresponding to the subtraction magnification setting unit 30 in FIG. 1). , Specifically, S of the input voice
The magnification is determined for each NR, power variance, or spectrum band of the estimated noise.

【００３６】[0036]

【発明の効果】以上説明したように、本発明によれば騒
音環境下で標準音声を学習した場合においても、確実に
入力音声を認識することができる。また、騒音が種々変
化する任意の走行環境下において、走行認識率の低下を
抑制することができる。As described above, according to the present invention, even when standard speech is learned in a noisy environment, input speech can be reliably recognized. Further, under an arbitrary traveling environment in which noise changes variously, it is possible to suppress a decrease in the traveling recognition rate.

[Brief description of the drawings]

【図１】実施形態の構成ブロック図である。FIG. 1 is a configuration block diagram of an embodiment.

【図２】騒音差分の処理説明図である。FIG. 2 is an explanatory diagram of noise difference processing.

【図３】スペクトルサブトラクション説明図である。FIG. 3 is an explanatory diagram of spectrum subtraction.

【図４】ランバード効果を示す説明図である。FIG. 4 is an explanatory diagram showing the Lambert effect.

【図５】スペクトル帯域毎のサブトラクト倍率決定説
明図である。FIG. 5 is an explanatory diagram for determining a subtraction magnification for each spectrum band.

【図６】入力音声パワーの分散を示すグラフ図であ
る。FIG. 6 is a graph showing the variance of input audio power.

【図７】他の実施形態の構成ブロック図である。FIG. 7 is a configuration block diagram of another embodiment.

[Explanation of symbols]

１０フィルタ、１２スペクトル分析部、１４騒音
差分部、１８学習騒音データベース、２０スペクト
ルサブトラクション部、２２特徴抽出部、２４音素
認識部、２６音声辞書、２８音響モデルデータベー
ス、３０サブトラクト倍率設定部、３２パワー計算
部、３４ＳＮＲ計算部、３６騒音パターン／倍率変
換テーブル。Reference Signs List 10 filter, 12 spectrum analysis unit, 14 noise difference unit, 18 learning noise database, 20 spectrum subtraction unit, 22 feature extraction unit, 24 phoneme recognition unit, 26 speech dictionary, 28 acoustic model database, 30 subtraction magnification setting unit, 32 power Calculator, 34 SNR calculator, 36 Noise pattern / magnification conversion table.

Claims

[Claims]

1. A speech recognition apparatus for recognizing a feature of a speech obtained by subtracting noise from an input speech by comparing the feature with a standard speech obtained by learning, wherein a learning noise included in the standard speech and the input speech are recognized. And a calculating means for calculating the noise to be subtracted from the input voice based on a difference from the input noise included in the voice recognition.

2. A speech recognition apparatus for recognizing an input voice by comparing an input voice with a standard voice obtained by learning, wherein a difference between learning noise included in the standard voice and input noise included in the input voice is determined. And a calculating means for calculating a noise to be added to the standard voice based on the standard voice.

3. The apparatus according to claim 1, further comprising: means for determining a ratio to be subtracted or a ratio to be added in accordance with the SNR of the input voice. Voice recognition device.

4. The apparatus according to claim 3, wherein the SNR of the input speech is calculated based on weighting in a frequency domain.

5. The apparatus according to claim 1, further comprising: means for determining a ratio to be subtracted or a ratio to be added for each of the different spectral bands. Voice recognition device.

6. The apparatus according to claim 1, further comprising: a ratio to be subtracted according to a power variance of the input noise;
Alternatively, a means for determining a ratio to be added is provided.

7. The apparatus according to claim 1, wherein a ratio to be subtracted or a ratio to be added is determined based on an SNR of the input noise for each audio analysis frame. Recognition device.

8. The apparatus according to claim 1, wherein a ratio to be subtracted or a ratio to be added is determined based on the power of the input noise for each voice analysis frame. Recognition device.