JP3091244B2

JP3091244B2 - Noise removal device and speech recognition device

Info

Publication number: JP3091244B2
Application number: JP03036724A
Authority: JP
Inventors: 貢松下
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-02-05
Filing date: 1991-02-05
Publication date: 2000-09-25
Anticipated expiration: 2015-09-25
Also published as: JPH04249299A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【技術分野】本発明は、雑音除去装置及び該雑音除去装
置を利用した音声認識装置、より詳細には、音声認識装
置に入力する音声からその音声信号に混入する周囲雑音
の成分を除去する技術、及び、該技術を用いた、周囲の
雑音の大きい環境、例えば、事務所内、自動車内、工場
内での音声認識に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a noise elimination device and a speech recognition device using the noise elimination device, and more particularly, to a technology for removing ambient noise components mixed into a speech signal from speech input to the speech recognition device. And speech recognition using the technology in a noisy environment, such as in an office, in a car, or in a factory.

【０００２】[0002]

【従来技術】音声認識装置において、入力した音声に周
囲の雑音が混入した場合、著しく認識率を低下させるの
で、音声認識装置の実用化に際して周囲雑音の除去は重
要な課題となっている。雑音除去の方法としては、特開
昭６３−２６２６９５や特開平１−２３９５９６などで
知られている音声入力部と雑音入力部の２つの音響入力
部を用いる方法がある。これは、音声入力部から入力さ
れた信号の特徴量から雑音入力部から入力された信号の
特徴量に応じた雑音成分を除去する方法である。しか
し、この方法は、雑音入力部から入力された信号より求
めた雑音の特徴量を音声入力部から入力された雑音を含
む音声の特徴量から減じることにより雑音除去を行うの
で、音声入力部と雑音入力部で入力される雑音が大きく
異なる場合、例えば、複数の雑音源があり、その内の１
つの雑音源が雑音入力部の近くにある場合などでは、適
切な雑音除去が行えず、かえって、音声認識率を低減さ
せることもある。2. Description of the Related Art In a speech recognition device, when ambient noise is mixed in an inputted speech, the recognition rate is remarkably lowered. Therefore, removal of the ambient noise is an important issue when the speech recognition device is put into practical use. As a method of denoising is to use two acoustic input of speech input and the noise input unit known in such Sho 63-2 6 2695 and JP-A-1-239596. This is a method of removing a noise component corresponding to a feature amount of a signal input from a noise input unit from a feature amount of a signal input from a voice input unit. However, in this method, noise removal is performed by subtracting a feature amount of noise obtained from a signal input from the noise input unit from a feature amount of voice including noise input from the voice input unit. If the noise input at the noise input unit is significantly different, for example, there are a plurality of noise sources,
For example, when two noise sources are near the noise input unit, appropriate noise removal cannot be performed, and the speech recognition rate may be reduced instead.

【０００３】なお、図４は、雑音除去が適切に行えなか
った場合の一例を示したもので、図のように、音声入力
信号に含まれる雑音成分（ａ）と雑音入力信号のスペク
トル（ｂ）が大きく異なる場合、雑音除去を行うと
（ｄ）のようになり、雑音除去する前のスペクトル
（ｃ）の方が（ｄ）よりも音声のスペクトル（ｅ）に似
ているという結果になる。FIG. 4 shows an example in which noise removal is not properly performed. As shown in FIG. 4 , a noise component (a) included in a voice input signal and a spectrum (b) of the noise input signal are shown. ) Is significantly different, the result is as shown in (d) when noise is removed, and the spectrum (c) before the noise removal is more similar to the spectrum (e) of the voice than (d). .

【０００４】[0004]

【目的】本発明は、上述のごとき実情に鑑みてなされた
もので、特に、音声入力部と騒音入力部での騒音が大き
く異なる場合の、不適切な雑音除去を行わないようにす
ることを目的としてなされたものである。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned circumstances, and in particular, has been made to prevent improper noise removal when the noise between the voice input unit and the noise input unit is significantly different. It was made for the purpose.

【０００５】[0005]

【構成】本発明は、上記目的を達成するために、（１）
雑音を含む音声を入力するための第１の音響入力部と、
雑音を入力するための第２の音響入力部と、前記第１の
音響入力部より入力された雑音を含む音声信号の特徴量
を求める第１の特徴抽出部と、前記第２の音響入力部よ
り入力された雑音信号の特徴量を求める第２の特徴抽出
部と、雑音を含む音声信号の特徴量から雑音信号の特徴
量を減算することにより雑音を含む音声信号の特徴量か
ら雑音成分を除去する雑音除去部とを有する雑音除去装
置において、音声入力を行っていない区間における前記
第１の音響入力部の入力信号の特徴量を求め、これを、
雑音を含む音声信号の特徴量の雑音成分と推定し、この
雑音成分と雑音信号の特徴量が大きく異なる場合、雑音
の除去を行うことを中止すること、或いは、（２）前記
１記載の雑音除去装置と、該雑音除去装置で得られた音
声の特徴量から音声の入力パターンを作成するパターン
作成部と、予め登録された音声の標準パターンを記憶す
る標準パターンメモリとを有し、前記パターン作成部で
得られた入力パターンと前記標準パターンメモリに記憶
されている標準パターンとを用いて認識処理を行うこと
を特徴としたものである。以下、本発明の実施例に基い
て説明する。To achieve the above object, the present invention provides (1)
A first sound input unit for inputting speech including noise,
A second sound input unit for inputting noise, a first feature extraction unit for obtaining a feature amount of a speech signal containing noise input from the first sound input unit, and the second sound input unit A second feature extraction unit for obtaining a feature amount of the noise signal input from the second unit; and a noise component from the feature amount of the noise-containing speech signal by subtracting the feature amount of the noise signal from the feature amount of the speech signal containing noise. A noise elimination unit having a noise elimination unit for elimination, a characteristic amount of an input signal of the first sound input unit in a section where no voice input is performed,
Estimating a noise component of the feature amount of the audio signal including noise, if the feature amount of the noise component and the noise signal are largely different, and child cease to remove the noise, Or, (2) the 1 A noise removing device, a pattern creating unit that creates an audio input pattern from a feature amount of a voice obtained by the noise removing device, and a standard pattern memory that stores a standard pattern of the voice registered in advance. , and this for performing recognition processing by using the reference pattern and the input pattern obtained by the pattern preparing unit the stored in the standard pattern memory
It is characterized by. Hereinafter, a description will be given based on an example of the present invention.

【０００６】図１は、請求項第１項に記載の発明の一実
施例を説明するためのブロック図、図２はその動作原理
を説明するための１フレーム毎のフローチャートで、図
示のように雑音を含む音声を入力するための第１の音響
入力部１と、雑音を入力するための第２の音響入力部２
と、前記第１の音響入力部より入力された雑音を含む音
声信号の特徴量を求める第１の特徴抽出部３と、前記第
２の音響入力部より入力された雑音信号の特徴量を求め
る第２の特徴抽出部４と、音声区間検出部５と、雑音類
似度計算部６と、雑音除去判断部７と、雑音を含む音声
信号の特徴量から雑音信号の特徴量を減算することによ
り雑音を含む音声信号の特徴量から雑音成分を除去する
雑音除去部８とを有する雑音除去装置において、音声入
力を行っていない区間における前記第１の音響入力部の
入力信号の特徴量を求め、これを、雑音を含む音声信号
の特徴量の雑音成分と推定し、この雑音成分と雑音信号
の特徴量が大きく異なる場合、雑音の除去を行うことを
中止するようにしたものである。FIG. 1 is a block diagram for explaining an embodiment of the first aspect of the present invention, and FIG. 2 is a flowchart for each frame for explaining an operation principle thereof. A first sound input unit 1 for inputting speech including noise, and a second sound input unit 2 for inputting noise
A first feature extraction unit 3 for obtaining a feature amount of a speech signal including noise input from the first sound input unit; and a feature amount of a noise signal input from the second sound input unit. The second feature extraction unit 4, the speech section detection unit 5, the noise similarity calculation unit 6, the noise removal determination unit 7, and the feature amount of the noise signal is subtracted from the feature amount of the speech signal containing noise. A noise removing unit that removes a noise component from a feature amount of a speech signal including noise, wherein a feature amount of an input signal of the first sound input unit in a section where no speech input is performed is obtained; This is estimated as a noise component of a feature amount of a speech signal including noise, and when the noise component greatly differs from the feature amount of the noise signal, the removal of the noise is stopped.

【０００７】更に詳細に説明すると、音声入力用音響入
力部１は、発声者の口元近傍に設置したマイクロフォン
のような音響・電気信号変換器を用いて、音を電気信号
ｘ（ｔ）に変換する。なお、このｘ（ｔ）には、音声の
みでなく、周囲の雑音も混入している。雑音入力用音響
入力部２は、音声入力用音響入力部１より、発声者の発
声する声が入力されにくい位置に設置したマイクロフォ
ンのような音響・電気信号変換器を用いて、周囲の雑音
を電気信号ｎ（ｔ）に変換する（ｔは時間を示す変
数）。音声入力用特徴抽出部３は、バンドパスフィルタ
群、或いは、ＦＦＴなどを用いて、音声入力用音響入力
部１で得られた電気信号ｘ（ｔ）の１０ｍｓｅｃ程度の
短時間周波数スペクトルＸ（ｔ，ｆ）などの特徴量を抽
出する（ｆは周波数を示す変数）。雑音入力用特徴抽出
部４は、バンドパスフィルタ群、或いは、ＦＦＴなどを
用いて、雑音入力用音響入力部２で得られた電気信号ｎ
（ｔ）の１０ｍｓｅｃ程度の短時間周波数スペクトルＮ
（ｔ，ｆ）などの雑音の特徴量を抽出する。音声区間検
出部５は、音声入力中であるかどうかを検出するもの
で、例えば、前記のｘ（ｔ）の絶対値の短時間平均値が
予め定めておいた閾値以上の区間を音声入力中であると
する方法を用いる。また、その他の方法としては、音声
入力スイッチを設けておき、スイッチをオンさせている
区間を音声入力中とする方法があるが、他の方法を用い
ても実現可能である。雑音類似度計算部６は、音声入力
信号の特徴量Ｘ（ｔ，ｆ）に含まれる雑音成分を推定す
るもので、音声区間検出部５で、音声入力部でないと判
定された区間（ｔ_１〜ｔ_２）の音声入力用特徴抽出部３
の出力信号Ｘ（ｔ，ｆ）を求め、これを音声入力信号中
の雑音成分Ｙ（ｔ，ｆ）とし、雑音入力用特徴量抽出部
４で抽出されたＮ（ｔ，ｆ）との類似度を下式（１）の
ようにして求める方法などがある。More specifically, the sound input unit 1 for sound input converts sound into an electric signal x (t) using an acoustic-electric signal converter such as a microphone installed near the mouth of the speaker. I do. Note that not only voice but also ambient noise is mixed in this x (t). The sound input unit 2 for noise input uses a sound-electric signal converter such as a microphone installed at a position where it is difficult to input a voice uttered by a speaker from the sound input unit 1 for voice input, and reduces ambient noise. It is converted into an electric signal n (t) (t is a variable indicating time). The voice input feature extraction unit 3 uses a band-pass filter group, FFT, or the like, and outputs a short-time frequency spectrum X (t) of about 10 msec of the electric signal x (t) obtained by the voice input audio input unit 1. , F) (f is a variable indicating frequency). The noise input feature extraction unit 4 uses a band-pass filter group or FFT or the like to generate an electric signal n obtained by the noise input sound input unit 2.
(T) Short-time frequency spectrum N of about 10 msec
A feature amount of noise such as (t, f) is extracted. The voice section detection unit 5 detects whether or not voice input is being performed. For example, the voice section detects a section in which the short-time average value of the absolute value of x (t) is equal to or larger than a predetermined threshold. Is used. As another method, there is a method in which a voice input switch is provided, and a section in which the switch is turned on is during voice input. The noise similarity calculator 6 estimates a noise component included in the feature X (t, f) of the voice input signal, and the voice section detector 5 determines a section (t ₁₎ determined not to be the voice input section. To t ₂ ) voice input feature extraction unit 3
Is obtained as a noise component Y (t, f) in the speech input signal, and is similar to N (t, f) extracted by the noise input feature amount extraction unit 4. There is a method of obtaining the degree as in the following equation (1).

【０００８】[0008]

【数１】 (Equation 1)

【０００９】なお、上式（１）中のｋは、音声が入力さ
れていない区間における２つの入力部のレベル比から求
められたレベル補正を行うための係数である。雑音除去
判断部７は、雑音類似度計算部６で求められた雑音類似
度Ｚから、雑音入力用特徴量抽出部４で抽出された雑音
特徴量と音声入力用特徴抽出部３で抽出された特徴量に
含まれる雑音成分が大きく異なっているかどうかを判断
するもので、例えば、雑音特徴量Ｚが予め定めておいて
閾値以上の場合、大きく異なると判断し、雑音除去を行
なわないようにする方法がある。雑音除去部８は、雑音
除去判断部７で雑音除去を行うと判定された場合に、雑
音除去を行うもので、例えば、音声区間検出部５で、音
声入力中と判断された区間における音声入力特徴量Ｘ
(t，f)と雑音入力特徴量Ｎ(ｔ,ｆ)を用いて、Ｓ(ｔ,ｆ)＝Ｘ(ｔ,ｆ)−ｋ・Ｎ(ｔ,ｆ) で求められるＳ(ｔ,ｆ)を雑音除去された音声特徴量と
する。Note that k in the above equation (1) is a coefficient for performing a level correction obtained from a level ratio between two input sections in a section where no voice is input. The noise removal determining unit 7 extracts the noise feature extracted by the noise input feature extracting unit 4 and the noise input feature extracting unit 3 from the noise similarity Z obtained by the noise similarity calculating unit 6. It is determined whether or not the noise components included in the feature amount are significantly different. For example, when the noise feature amount Z is equal to or larger than a predetermined threshold value, it is determined that the noise component is greatly different, and the noise removal is not performed. There is a way. The noise elimination unit 8 performs noise elimination when the noise elimination determination unit 7 determines that the noise is to be eliminated. Feature X
Using (t, f) and the noise input feature N (t, f), S (t, f) obtained by S (t, f) = X (t, f) -kN (t, f) ) Is a noise feature amount from which noise has been removed.

【００１０】[0010]

【００１１】図３は、請求項第２項に記載の発明の一実
施例を説明するためのブロック図で、図中の１〜８は図
１に示した実施例と同様の作用をする。音声認識部９
は、雑音除去部８で雑音除去された音声特徴量Ｓ(ｔ,
ｆ)を用いて、音声処理を行うもので、「２値のＴＳＰ
を用いた単語音声認識システムの開発」（安田他、電
気学会論文誌Ｃ１０８巻、昭和６３年１０月号ｐｐ．８
５８〜８６５）記載の音声認識システムを用いるが、他
の公知の音声認識システムを用いても実現可能である。FIG. 3 is a block diagram for explaining one embodiment of the invention described in claim 2. In the drawing, reference numerals 1 to 8 in FIG. 3 operate in the same manner as in the embodiment shown in FIG. Voice recognition unit 9
Is the speech feature amount S (t,
f) performs voice processing using "binary TSP".
Development of a Word Speech Recognition System Using ”(Yasuda et al., The Institute of Electrical Engineers of Japan, C108, October 1988, pp.8)
58 to 865), but can be realized by using other known speech recognition systems.

【００１２】[0012]

【００１３】[0013]

【効果】（１）請求項１に対応する作用効果：音声信号
に含まれる雑音成分と雑音信号の特徴量が大きく異なる
場合、雑音の除去を行うことを中止するようにしてある
ので、不適切な雑音除去が行われなくなる。ないように
してあるので、不適切な雑音除去が行われなくなる。（２）請求項２に対応する作用効果：音声認識部を組み
合わすことにより、音声信号に含まれる雑音成分と雑音
信号の特徴量が大きく異なる場合でも、認識率の低下が
起こらない音声認識装置が実現できる。(1) The function and effect corresponding to the first aspect: when the noise component included in the voice signal and the feature amount of the noise signal are largely different, the noise removal is stopped, so that it is inappropriate. No noise is removed. Since no noise is removed, inappropriate noise removal is not performed. (2) A function and effect corresponding to the second aspect : a speech recognition apparatus in which the recognition rate does not decrease even when the noise component included in the speech signal and the feature amount of the noise signal are significantly different by combining the speech recognition unit. Can be realized.

[Brief description of the drawings]

【図１】請求項１に記載の発明の一実施例を説明する
ためのブロック図である。FIG. 1 is a block diagram for explaining an embodiment of the invention described in claim 1;

【図２】図１に示した実施例の動作原理を説明するた
めの１フレーム毎のフローチャートである。FIG. 2 is a flowchart for each frame for explaining the operation principle of the embodiment shown in FIG. 1;

【図３】請求項２に記載の発明の一実施例を説明する
ためのブロック図である。 FIG. 3 is a block diagram for explaining one embodiment of the invention described in claim 2 ;

【図４】雑音除去が適切に行えない場合の一例を説明
するための波形図である。 FIG. 4 is a waveform diagram for explaining an example of a case where noise removal cannot be performed properly.

[Explanation of symbols]

１…音声入力用音響入力部、２…雑音入力用音響入力
部、３…音声入力用特徴抽出部、４…雑音入力用特徴抽
出部、５…音声区間検出部、６…雑音類似度計算部、７
…雑音除去判断部、８…雑音除去部、９…音声認識部。DESCRIPTION OF SYMBOLS 1 ... Speech input sound input part, 2 ... Sound input sound input part, 3 ... Speech input feature extraction part, 4 ... Noise input feature extraction part, 5 ... Speech section detection part, 6 ... Noise similarity calculation part , 7
... Noise removal determination unit, 8. Noise removal unit, 9.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/20 G10L 21/02 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G10L 15/20 G10L 21/02 JICST file (JOIS)

Claims

(57) [Claims]

1. A first sound input unit for inputting speech including noise, a second sound input unit for inputting noise, and a noise input from the first sound input unit. A first feature extraction unit for obtaining a feature amount of the audio signal;
A second feature extraction unit for obtaining a feature amount of a noise signal input from the audio input unit of the first embodiment; and a feature amount of the noise signal obtained by subtracting the feature amount of the noise signal from the feature amount of the speech signal containing noise. A noise removing unit for removing a noise component, wherein a characteristic amount of the input signal of the first sound input unit in a section where no voice input is performed is obtained, and the characteristic amount of the input signal of the noise-containing audio signal is obtained. A noise component that is estimated to be a noise component of a noise signal and stops performing noise removal when the feature amount of the noise component is significantly different from that of the noise signal.

2. A noise removing apparatus according to claim 1, a pattern creating section for creating a speech input pattern from a feature amount of the speech obtained by said noise removing apparatus, and a standard pattern of speech registered in advance. And a standard pattern memory for storing voice data, and performing a voice recognition process using the input pattern obtained by the pattern creation unit and the standard pattern stored in the standard pattern memory. apparatus.