JPH03274099A

JPH03274099A - Voice recognizing device

Info

Publication number: JPH03274099A
Application number: JP2074690A
Authority: JP
Inventors: Takashi Ariyoshi; 有吉　敬; Junichiro Fujimoto; 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1990-03-23
Filing date: 1990-03-23
Publication date: 1991-12-05

Abstract

PURPOSE:To remove a noise properly even if the playback sound of a radio set, a stereophonic set, etc., is superposed as the noise on a voice signal by providing a 2nd feature quantity extraction part which extracts the feature quantity of the noise from a playback signal reproduced by a speaker placed at the periphery of a microphone outside a 1st feature quantity extraction part which extracts the feature quantity of a noise-containing voice inputted from the microphone. CONSTITUTION:The 1st noise removal part 30 subtracts the time spectrum pattern of the noise of a speaker playback signal extracted by a 2nd feature quantity extraction part 20 from the time spectrum pattern of the voice containing the noise extracted by the 1st feature quantity extraction part 10 to generate a time spectrum pattern. A 2nd noise removal part 60 estimates the time spectrum pattern of a noise other than the noise of the speaker playback signal by a noise estimation part 50 and subtracts it from the time spectrum pattern, in a voice section detected by a voice section part 40, generated by the 1st noise removal part 30. An input pattern generation part 70 generates the input pattern of a voice inputted according to a known BTSP voice recognizing means from the feature quantity of the voice generated by the 2nd noise removal part 60.

Description

【発明の詳細な説明】挟先分立本発明は、音声認識装置、より詳細には、高雑音環境で
の音声認識に於ける雑音除去技術に関し、例えば、自動
車内での、ダイヤリング、オーディオ機器の制御、ニア
コンディショナーの制御、ナビゲーションシステムの制
御等のための音声認識装置に応用して好適なものであり
、更には、家庭内、事務所内などでも応用可能なもので
ある。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition device, and more particularly, to a noise removal technique in speech recognition in a high-noise environment, such as dialing in a car, audio equipment, etc. The present invention is suitable for application to a voice recognition device for controlling a car, a near conditioner, a navigation system, etc., and can also be applied at home, office, etc.

ｋ来枝先近年、音声による情報入力手段が注目を集めているが、
自動車内においても、自動車電話の発呼。In recent years, voice-based information input methods have been attracting attention.
You can also make calls from your car phone while in your car.

オーディオ機器の制御、ニアコンディショナーの制御、
ナビゲーションシステムの制御等のための音声認識技術
を応用することが考えられている。Audio equipment control, near conditioner control,
It is being considered to apply voice recognition technology to control navigation systems, etc.

しかしながら、自動車内の音声認識は、エンジン音、タ
イヤの走行音、あるいは、ラジオ、ステレオの再生音が
雑音として音声信号に混入することと、運転中に接話型
マイクなどの口との距離の近いマイクが装着できないた
め、音声信号とノイズとの比、即ち、Ｓ／Ｎが悪いこと
が問題となり、雑音除去技術が不可欠なものとなってい
る。However, voice recognition in a car is difficult due to the fact that the sound of the engine, running tires, or sound played from a radio or stereo mixes into the voice signal as noise, and the distance between the mouth and the mouth of a close-talking microphone while driving. Since it is not possible to attach a microphone close to the device, the ratio of the audio signal to the noise, that is, the S/N ratio is poor, which poses a problem, and noise removal technology is indispensable.

従来の音声認識における雑音除去技術としては。As a noise removal technology in conventional speech recognition.

Ｓ、Ｆ、Ｂｏｌｌをはじめとするスペクトラルサブトラ
クション法、Ｂ、　Ｉｉｄｒｏｗをはじめとするアダプ
ティブノイズキャンセリング等がある。There are spectral subtraction methods such as S, F, and Boll, and adaptive noise canceling methods such as B and Iidrow.

しかしながら、スペクトラルサブトラクション法は、時
間非定常な雑音に弱く、例えば、ラジオ、ステレオの再
生音（音声や楽音）のように、音声帯域で時間非定常な
雑音が大きい場合は、まず、音声区間検出も十分にでき
ず、音声区間検出ができたとしても、推定雑音成分量が
実際の雑音成分と異なるという欠点がある。また、アダ
プティブノイズキャンセリングは、２人力、即ち、２本
のマイクを用いた場合、複数の雑音源からの合成された
雑音に対してうまく対応ができないという欠点があり、
であるからといって、マイクの本数をいたずらに増やし
ても、信号処理の量が膨大になり、かつ、コスト高にな
り実用化が困難になるという欠点がある。However, the spectral subtraction method is sensitive to time-unsteady noise. For example, when there is a large amount of time-unsteady noise in the audio band, such as radio or stereo playback (speech or musical tones), the spectral subtraction method first requires speech section detection. However, even if speech sections can be detected, the estimated noise component amount is different from the actual noise component. In addition, adaptive noise canceling has the disadvantage that it cannot respond well to noise synthesized from multiple noise sources when using two people, that is, two microphones.
However, even if the number of microphones is increased unnecessarily, the amount of signal processing will be enormous, and the cost will be high, making it difficult to put it into practical use.

また、ラジオ、ステレオなどのスピーカから再生される
音（音声認識装置にとっては雑音）に関しては、マイク
からの入力から、スピーカに送られる信号に適当なゲイ
ンをかけたものを差し引く方法も考えられるが、スピー
カからマイクまでの遅延や、周囲の反射音があるために
、時間波形上で単純に両者の差を取っても効果は期待で
きない。In addition, regarding the sound played from speakers such as radios and stereos (which is noise for voice recognition devices), it is possible to subtract the signal sent to the speakers by applying an appropriate gain from the input from the microphone. Since there is a delay from the speaker to the microphone and reflected sound from the surroundings, simply taking the difference between the two on the time waveform cannot be expected to be effective.

且−一」岬本発明は、上述のごとき従来技術の欠点に鑑み成された
もので、特に、音声を入力するためのマイク付近で発生
する音声帯域で、かつ、時間非定常性の強いラジオ、ス
テレオなどの再生音が雑音として音声信号に重畳してい
ても、適切にこの雑音を除去し、このような雑音環境下
での良好な音声認識を実現することを目的としてなされ
たものである。The present invention was made in view of the drawbacks of the prior art as described above, and is particularly applicable to radios in the voice band that occurs near the microphone for inputting voice and which has strong time non-stationarity. , even if playback sound from stereo etc. is superimposed on the audio signal as noise, the purpose was to appropriately remove this noise and achieve good speech recognition in such a noisy environment. .

代−一一腹本発明は、上記目的を達成するために、（１）マイクか
ら入力される雑音を含む音声の特徴量を抽出する第１の
特徴量抽出部と、上記マイクの周囲に置かれたスピーカ
から再生されるためのスピーカ再生用信号を用いてスピ
ーカ再生用信号による雑音の特徴量を抽出する第２の特
徴量抽出部と、上記第１の特徴量抽出部で抽出された特
徴量から上記第２の特徴量抽出部で抽出された特徴量を
除去した特′ｅｌｉ量を生成する第１の雑音除去部と、
上記第１の雑音除去部で生成された特徴量から音声区間
を検出する音声区間検出部と、上記音声区間検出部で検
出された非音声区間における上記第１の雑音除去部で生
成された特徴量からスピーカ再生用信号による雑音以外
の雑音の推定値を求め、更に、上記音声区間検出部で検
出された音声区間における上記第１の雑音除去部で生成
された特徴量から上記スピーカ再生用信号による雑音以
外の雑音の推定値を除去し、音声の特徴量を生成する第
２の雑音除去部と、上記第２の雑音除去部で生成された
音声の特徴量から入力された音声の入力パターンを生成
する入力パターン生成部と、音声の標準パターンを記憶
する標準パターンメモリと、上記入力パターン生成部で
生成された入力パターンと上記標準パターンメモリに記
憶された標準パターンとで認識処理を行う認識部とを具
備して成ることを特徴としたものであり、更には、（２
）上記（１）の音声認識装置において、上記第２の特徴
抽出部は、マイクの周囲に置かれたスピーカから再生さ
れるための上記スピーカ再生用信号に。In order to achieve the above object, the present invention includes: (1) a first feature extracting section that extracts features of a voice including noise input from a microphone; a second feature extraction unit that extracts a feature amount of noise caused by the speaker reproduction signal using the speaker reproduction signal to be reproduced from the speaker, and a feature extracted by the first feature extraction unit; a first noise removal unit that generates a feature quantity by removing the feature quantity extracted by the second feature quantity extraction unit from the quantity;
a speech section detection section that detects a speech section from the feature quantity generated by the first noise removal section; and a feature generated by the first noise removal section in the non-speech section detected by the speech section detection section. An estimated value of noise other than the noise caused by the speaker reproduction signal is calculated from the amount, and further, an estimated value of the noise other than the noise caused by the speaker reproduction signal is calculated from the feature amount generated by the first noise removal unit in the voice interval detected by the voice interval detection unit. a second noise removal unit that removes estimated values of noise other than the noise caused by the noise and generates a voice feature amount; and an input pattern of the voice input from the voice feature amount generated by the second noise removal unit. an input pattern generation unit that generates an input pattern, a standard pattern memory that stores a standard pattern of speech, and a recognition process that performs recognition processing using the input pattern generated by the input pattern generation unit and the standard pattern stored in the standard pattern memory. (2)
) In the speech recognition device of (1) above, the second feature extracting section extracts the speaker reproduction signal to be reproduced from speakers placed around the microphone.

上記スピーカと上記マイクとの予め定められた位置関係
で予め測定された上記スピーカ再生用信号と上記マイク
から入力される信号間の伝達関数に相当する処理を行っ
た信弼の特ｙｌｌ量をスピーカ再生用信号による雑音の
特徴量とすることを特徴とするものであり、更には、（
３）上記（１）又は（２）の音声認識装置において、上
記マイクの周囲に置かれたスピーカが複数個ある場合に
、それぞれのスピーカから再生するためのスピーカ再生
用信号に、上記それぞれのスピーカと上記マイクとの予
め定められた位置関係で予め測定された上記それぞれの
スピーカ再生用信号と上記マイクから入力される信号間
の伝達関数に相馬するそれぞれの処理を行った信号をす
べて加算した信号の特徴量をスピーカ再生用信号による
雑音の特徴量とすることを特徴としたものである。以下
１本発明の実施例に基づいて説明する。Shinsuke's special yll amount, which has been subjected to processing corresponding to the transfer function between the speaker reproduction signal measured in advance in a predetermined positional relationship between the speaker and the microphone, and the signal input from the microphone, is transmitted to the speaker. It is characterized in that it is a feature amount of noise due to a reproduction signal, and furthermore, (
3) In the speech recognition device of (1) or (2) above, when there are multiple speakers placed around the microphone, the speaker reproduction signal for reproduction from each speaker is A signal that is the sum of all the signals that have been processed in accordance with the transfer function between each of the above speaker reproduction signals and the signal input from the above microphone, which are measured in advance in a predetermined positional relationship between the above microphone and the above microphone. This feature is characterized in that the feature quantity is used as the feature quantity of the noise due to the speaker reproduction signal. An explanation will be given below based on one embodiment of the present invention.

第１図は１本発明の一実施例を説明するための構成図で
、図中、１はマイク、２はマイク１の周囲に置かれたス
ピーカ、１０はマイク１から入力された雑音を含む音声
の特徴量を抽出する第１の特徴量抽出部、２０はスピー
カ再生用信号による雑音の特徴量を抽出する第２の特徴
量抽出部、３０は第１の雑音除去部、４０は音声区間検
出部、５０は雑音推定部、６０は第２の雑音除去部。FIG. 1 is a configuration diagram for explaining one embodiment of the present invention. In the figure, 1 is a microphone, 2 is a speaker placed around the microphone 1, and 10 includes noise input from the microphone 1. 20 is a second feature extraction unit that extracts the feature amount of noise due to the speaker reproduction signal; 30 is the first noise removal unit; and 40 is a voice section. 50 is a noise estimator, and 60 is a second noise remover.

７０は入力パターン生成部、８０は標準パターンメモリ
、９０は認識部で、請求項第１項及び第２項に記載の発
明は、マイク１から入力される雑音を含む音声の特徴量
を抽出する第１の特徴量抽出部１０と、上記マイクエの
周囲に置かれたスピーカ２から再生されるためのスピー
カ再生用信号を用いてスピーカ再生用信号による雑音の
特徴量を抽出する第２の特徴量抽出部２０と、上記第↓
の特徴量抽出部１０で抽出された特徴量から上記第２の
特ｒＪＩｌ量抽出部２０で抽出された特徴量を除去した
特徴量を生成する第１の雑音除去部３０と、上記第１の
雑音除去部で生成された特徴量から音声区間を検出する
音声区間検出部４０と、上記音声区間検出部４０で検出
された非音声区間における上記第１の雑音除去部３０で
生成された特徴量からスピーカ再生用信号による雑音以
外の雑音の推定値を雑音推定部５０にて推定し、て求め
、更に、上記音声区間検出部４０で検出された音声区間
における上記第１の雑音除去部３０で生成された特徴量
から上記スピーカ再生用信号による雑音以外の雑音の推
定値を除去し、音声の特徴量を生成する第２の雑音除去
部６０と、上記第２の雑音除去部６０で生成された音声
の特徴量から入力された音声の入力パターンを生成する
入力パターン生成部７０と、音声の標準パターンを記憶
する標準パターンメモリ８０と、上記入力パターン生成
部で生成された入力パターン７０と上記標準パターンメ
モリ８０に記憶された標準パターンとで認識処理を行う
認識部９０から戊っている。70 is an input pattern generation section, 80 is a standard pattern memory, and 90 is a recognition section. A second feature amount that extracts a feature amount of noise due to a speaker reproduction signal using a first feature amount extraction unit 10 and a speaker reproduction signal to be reproduced from the speakers 2 placed around the microphone. Extraction part 20 and the above ↓
a first noise removal section 30 that generates a feature amount by removing the feature amount extracted by the second characteristic amount extraction section 20 from the feature amount extracted by the feature amount extraction section 10; a speech section detection section 40 that detects a speech section from the feature amount generated by the noise removal section; and a feature amount generated by the first noise removal section 30 in the non-speech section detected by the speech section detection section 40. The noise estimator 50 estimates and obtains an estimated value of noise other than the noise caused by the speaker reproduction signal from A second noise removing unit 60 that removes estimated values of noise other than the noise caused by the speaker reproduction signal from the generated feature amount to generate a voice feature amount; an input pattern generation unit 70 that generates an input pattern of the input voice from the feature amount of the input voice; a standard pattern memory 80 that stores a standard pattern of voice; and the input pattern 70 generated by the input pattern generation unit and the It is separated from the recognition section 90 that performs recognition processing using the standard pattern stored in the standard pattern memory 80.

更に詳細に説明すると、第１の特徴量抽出部１０は、自
動車内に設置され音声を入力するためのマイク１から入
力される雑音を含む音声の特徴量を抽出するもので、マ
イクアンプ１１は、増幅を行ない、プリエンファシス１
２は、高域を強調し、バンドパスフィルタバンク１３は
、２５０１１ｚから６．３５ＫＩｌｚまで対数軸上で等
間隔に配置された１５個の周波数を中心周波数とするバ
ンドパスフィルタ群と、その各帯域毎の整流器、ローパ
スフィルタから成り、これにより入力音声のスペクトル
を求める。マルチプレクサ１４は、上記の各帯域のデー
タを切り替え、Ａ／Ｄコンバータ１５は、ｉｏｍｓのサ
ンプリング周期で各帯域毎のデータをデジタル化する。To explain in more detail, the first feature extraction unit 10 extracts the feature amount of the voice including noise input from the microphone 1 installed in the car for inputting voice. , perform amplification, pre-emphasis 1
2 emphasizes the high frequency range, and the bandpass filter bank 13 includes a group of bandpass filters whose center frequencies are 15 frequencies arranged at equal intervals on the logarithmic axis from 25011z to 6.35KIlz, and each band. It consists of a rectifier and a low-pass filter for each channel, and the spectrum of the input voice is determined using these. The multiplexer 14 switches the data of each band, and the A/D converter 15 digitizes the data of each band at a sampling period of IOMS.

従って、第１の特徴量抽出部１０に入力された信号は、
マイクアンプ１１、プリエンファシス１２、バンドパス
フィルタバンク１３、マルチプレクサ１４、Ａ／Ｄコン
バータ１５を経て、雑音を含む音声のタイムスペクトル
パターンＸ（ｔ、ｆ）（ここで、ｔは時間軸、ｆは周波
数軸である）となる。また、第２の特徴量抽出部２０は
、マイク１の周囲に置かれたスピーカ２から再生される
ためのスピーカ再生用信号を用いてスピーカ再生用信号
による雑音の特徴量を抽出するもので、プリエンファシ
ス２２は、プリエンファシス１２と同様に高域を強調し
、バンドパスフィルタバンク２３は、バンドパスフィル
タバンク１３と同様にしてスピーカ再生用信号のスペク
トルを求め、マルチプレクサ２４は、マルチプレクサ１
４と同様にして各帯域のデータを切り換え、Ａ／Ｄコン
バータ２５は、Ａ／Ｄコンノく一タ１５と同様にして各
帯域毎のデータをデジタル化する。第２の特徴量抽出部
２０に入力された信号は、プリエンファシス２２．バン
ドパスフィルタバンク２３．マルチプレクサ２４、Ａ／
Ｄコンバータ２５を経て、スピーカ再生用信号ののタイ
ムスペクトルパターンＮ（ｔ、ｆ）となる。更に、例え
ば、このスピーカ再生用信号のタイムスペクトルパター
ンＮ（ｔ、ｆ）と、予め測定され伝達関数メモリ２８に
記憶されたスピーカ２とマイク１間の伝達関数Ｈ（ｆ）
との積Ｎ（ｔ、、ｆ）・Ｈ（ｆ）が乗算器２７で計算さ
れ、スピーカ再生用信号による雑音のタイムスペクトル
パターンＮ１（ｔ、ｆ）＝Ｎ（ｔ、ｆ）・Ｈ（ｆ）となる。なお
、スピーカ２とマイク１間の伝達関数Ｈ（ｆ）は、第２
の特徴量抽出部２０に入力されるスピーカ再生用信号に
インパルス信号を与えて、スピーカ２を介して再生され
た音をマイク１から収音して得られるインパルス応答を
フーリエ変換すれば予め求めることができるし、ホワイ
ト・ノイズの再生、周波数スイープ信号を再生して求め
ることもできる。Therefore, the signal input to the first feature extraction unit 10 is
After passing through the microphone amplifier 11, pre-emphasis 12, bandpass filter bank 13, multiplexer 14, and A/D converter 15, the time spectrum pattern of the audio including noise X(t, f) (where t is the time axis and f is (on the frequency axis). Further, the second feature amount extraction unit 20 extracts the feature amount of noise due to the speaker reproduction signal using the speaker reproduction signal to be reproduced from the speaker 2 placed around the microphone 1, The pre-emphasis 22 emphasizes high frequencies in the same manner as the pre-emphasis 12, the band-pass filter bank 23 obtains the spectrum of the speaker reproduction signal in the same way as the band-pass filter bank 13, and the multiplexer 24
The data of each band is switched in the same manner as in 4, and the A/D converter 25 digitizes the data of each band in the same manner as the A/D converter 15. The signal input to the second feature extraction unit 20 is processed by pre-emphasis 22. Bandpass filter bank 23. Multiplexer 24, A/
After passing through the D converter 25, the time spectrum pattern N(t, f) of the signal for speaker reproduction is obtained. Further, for example, the time spectrum pattern N(t, f) of this speaker reproduction signal and the transfer function H(f) between the speaker 2 and the microphone 1 measured in advance and stored in the transfer function memory 28.
The product N(t,,f)・H(f) is calculated by the multiplier 27, and the time spectrum pattern of noise due to the speaker reproduction signal N1(t,f)=N(t,f)・H(f ). Note that the transfer function H(f) between the speaker 2 and the microphone 1 is the second
This can be obtained in advance by giving an impulse signal to the speaker reproduction signal input to the feature extracting unit 20 of the feature extraction unit 20, and performing Fourier transform on the impulse response obtained by collecting the sound reproduced through the speaker 2 from the microphone 1. It can also be obtained by reproducing white noise or frequency sweep signals.

第１の雑音除去部３０は、第ｉの特徴量抽出部１０で抽
出された雑音を含む音声のタイムスペクトルパターンＸ
（ｔ、ｆ）から第２の特徴量抽出部２０で抽出されたス
ピーカ再生用信号による雑音のタイムスペクトルパター
ンＮ１（ｔ、ｆ）を減じ、雑音を含む音声信号からスピ
ーカ再生用信号による雑音を除去したタイムスペクトル
パターンｘ１（ｔ、ｆ）＝ｘ（ｔ、ｆ）−Ｎ１．（ｔ、
ｆ）を生成する。The first noise removal unit 30 extracts a time spectrum pattern
The time spectrum pattern N1 (t, f) of the noise due to the speaker reproduction signal extracted by the second feature extraction unit 20 is subtracted from (t, f), and the noise due to the speaker reproduction signal is removed from the noisy audio signal. Removed time spectrum pattern x1(t,f)=x(t,f)-N1. (t,
f).

音声区間検出部４０は、第１の雑音除去部３０で生成さ
れた雑音を含む音声信号からスピーカ再生用信号による
雑音を除去したタイムスペクトルパターンＸＩ（ｔ、ｆ
）から音声区間を検出する。The speech section detection section 40 generates a time spectrum pattern XI (t, f
) to detect the voice section.

ここで用いられる音声区間検出の方法は、タイムスペク
トルパターンＸ１（ｔ、ｆ）の各フレームにおける合計５ Σ　ＸＩ（ｔ、ｆ）ｆ＝１が、予め定められたしきい値を越えた区間を音声区間と
する。The voice section detection method used here detects the section in which the total of 5 Σ XI (t, f) f=1 in each frame of the time spectrum pattern This is a voice section.

第２の雑音除去部６０は、音声区間検出部４０で検出さ
れた音声区間における第１の雑音除去部３０で生成され
た雑音を含む音声信号からスピーカ再生用信号による雑
音を除去したタイムスペクトルパターンＸ　１ｓ（ｔ、
ｆ）　（添え字Ｓは、音声区間を表す）から更にスピー
カ再生用信号による雑音以外の雑音のタイムスペクトル
パターンＮ　２　（ｔ、ｆ）を雑音推定部５０で推定し
て減じて、音声のタイムスペクトルパターンＳ　（ｔ、ｆ）　＝　Ｘ　１．ｓ（ｔ、ｆ）　−Ｎ　２
（ｔ、ｆ）を生成する。ここで、スピーカ再生用信号に
よる雑音以外の雑音のタイムスペクトルパターンＮ２（
ｔｌｆ）は、公知のスペクトルサブトラクションｌムに
従って、音声区間でない時の第１の雑音除去部３０の出
力Ｘ１ｎ（ｔ、ｆ）の複数フレームの平均を雑音推定部
５０にて推定して充てる（添え字ｎは、非音声区間を表
す）。The second noise removal unit 60 generates a time spectrum pattern obtained by removing noise caused by the speaker reproduction signal from the noise-containing audio signal generated by the first noise removal unit 30 in the audio interval detected by the audio interval detection unit 40. X 1s(t,
f) The noise estimator 50 estimates and subtracts the time spectrum pattern N 2 (t, f) of noise other than the noise caused by the speaker reproduction signal from (the subscript S represents the speech section) to obtain the speech time. Spectral pattern S (t, f) = X 1. s(t, f) −N 2
Generate (t, f). Here, the time spectrum pattern N2 (
tlf) is estimated by the noise estimation unit 50 and applied to the average of a plurality of frames of the output X1n(t, f) of the first noise removal unit 30 when it is not a voice section, according to the known spectral subtraction system (see appendix). The letter n represents a non-speech section).

入力パターン生成部７０は、第２の雑音除去部６０で生
成された音声の特徴量から公知のＢＴＳＰ　（Ｂｉｎａ
ｒｙ　Ｔｉ＋ｉｅ　Ｓｐｅｃｔｒｕｍ　Ｐａｔｔｅｒｎ
）音声認識方式の音声パターン生成法に従って入力され
た音声の入力パターンを生成する。The input pattern generation unit 70 generates a known BTSP (Bina
ry Ti+ie Spectrum Pattern
) Generate an input pattern of the input voice according to the voice pattern generation method of the voice recognition method.

標準パターンメモリ８０は、公知のＢＴＳＰ音声認識方
式の標準パターン形式になっている音声の標準パターン
を記憶する。The standard pattern memory 80 stores a standard pattern of speech in the standard pattern format of the known BTSP speech recognition system.

認識部９０は、入力パターン生成部７０で生成された入
力パターンと上記標準パターンメモリ８０に記憶された
標準パターンとで公知のＢＴＳＰ音声認識方式の認識処
理に従って認識処理を行う。The recognition unit 90 performs recognition processing using the input pattern generated by the input pattern generation unit 70 and the standard pattern stored in the standard pattern memory 80 according to the recognition process of the well-known BTSP speech recognition method.

尚、以上に示した実施例で用いた手段以外に、音声区間
検出部４０の音声区間検出法、第２の雑音除去部６０の
雑音除去法、入力パターン生成部７０のパターン生成法
、標準パターンメモリ８０のパターン形式、認識部９０
の認識処理などに公知の方法を用いても本発明を実施す
ることができる。In addition to the means used in the embodiments described above, the speech section detection method of the speech section detection section 40, the noise removal method of the second noise removal section 60, the pattern generation method of the input pattern generation section 70, and the standard pattern Pattern format of memory 80, recognition unit 90
The present invention can also be implemented using a known method for recognition processing.

また、バンドパスフィルタバンク１３．２３は。Also, the bandpass filter bank 13.23.

ＦＦＴなどのデジタル信号処理と置き換えても良く、ま
た、Ａ／Ｄコンバータ１５と２５は、時分割処理によっ
て共有することも可能である。It may be replaced with digital signal processing such as FFT, and the A/D converters 15 and 25 may be shared by time-division processing.

第２図は、第１図に示した実施例を、スピーカが複数個
ある場合に対応するために拡張した実施例の構成図で、
図示のように、スピーカの数が例えば２個の場合、第２
の特徴量抽出部２０ａ、２０ｂも２個で、スピーカ２８
．２ｂから再生されるためのそれぞれのスピーカ再生用
信号を入力し、バンドパスフィルタ２３ａ、２３ｂで得
られるこれらのスピーカ再生用信号のタイムスペクトル
パターンＮａ（ｔ、ｆ）　ｔ　Ｎｂ（ｔ、ｆ）と、予め
測定されたスピーカ２ａ、２ｂとマイク１間の伝達関数
Ｈａ（ｆ）。FIG. 2 is a configuration diagram of an embodiment in which the embodiment shown in FIG. 1 is expanded to accommodate a case where there are multiple speakers.
As shown in the figure, if the number of speakers is two, the second
There are also two feature extraction units 20a and 20b, and the speaker 28
．． The time spectrum patterns Na(t, f) t Nb(t, f) of these speaker playback signals obtained by the bandpass filters 23a and 23b are inputted to the respective speaker playback signals to be played back from the band pass filters 23a and 23b. , a transfer function Ha(f) between the speakers 2a, 2b and the microphone 1 measured in advance.

）Ｉｂ（ｆ）のそれぞれの積が乗算器２７ａ、２７ｂで
求められ、これらの積の総和が加算器２９でスピーカ再
生用信号による雑音のタイムスペクトルパターンＮ　１（ｔ、ｆ）　＝　Ｎａ（ｔ、ｆ）・Ｈａ（ｆ）　
＋　Ｎｂ（ｔ、ｆｌ　Ｈｂ（ｆ）となる。これ以降の処
理は、第１図に示した実施例と同じである。また、スピ
ーカの個数が３個以上でも同様な手法で実現できる。)Ib(f) are calculated by multipliers 27a and 27b, and the sum of these products is calculated by adder 29 as a time spectrum pattern of noise due to the speaker reproduction signal N1(t,f) = Na(t , f)・Ha(f)
+ Nb(t, fl Hb(f). The subsequent processing is the same as the embodiment shown in FIG. 1. Even if the number of speakers is three or more, the same method can be used.

羞−一果以上の説明から明らかなように、請求項第１項の発明に
よると、第■の雑音除去部３０において、雑音を含む音
声の特徴量から音声帯域でしかも時間非定常性の強いス
ピーカ再生用信号による雑音の特徴量を除去してから、
音声区間検出部４０において音声区間を検出するので、
スピーカ再生音以外の雑音１例えば自動車内では、エン
ジン音、タイヤの走行音、例えば家庭内、事務所内では
、Ｉｔ　ｏ　ｔ　ｈノイズに代表されるような雑音とい
ったいずれも低域（１００Ｈｚから数１００Ｈｚ程度）
でしかも時間定常性の強い雑音が主な雑音から音声区間
を検出するので検出精度が向上し、結果として高雑音下
の音声の認識率が改善される。As is clear from the above description, according to the invention of claim 1, in the noise removal unit 30, the noise removing unit 30 detects noise in the voice band and which has strong time non-stationarity from the feature amount of the voice including noise. After removing the noise features caused by the speaker reproduction signal,
Since the voice section is detected by the voice section detecting section 40,
Noises other than speaker playback sounds 1 For example, in a car, the engine sound, the sound of running tires, and in homes and offices, it is typical of noise, all of which have a low frequency range (100 Hz to several 100 Hz). degree)
Moreover, since the speech section is detected from the main noise, which is strongly time-stationary, the detection accuracy is improved, and as a result, the recognition rate of speech under high noise is improved.

また、請求項第２項の発明によれば、第２の特徴量抽出
部２０が、マイクの周囲に置かれたスピーカから再生さ
れるための上記スピーカ再生用信号に、上記スピーカと
上記マイクとの予め定められた位置関係で予め測定され
た上記スピーカと上記マイク間のインパルス応答に相当
する処理を行った信号の特徴量をスピーカ再生用信号に
よる雑音の特徴量とし、第１の雑音除去部３０が、雑音
を含む音声の特徴量からスピーカ再生用信号による雑音
の特徴量を除去するので、スピーカ再生用信号による雑
音の成分を正確に除去できるので、結果として高雑音下
の音声の認識率が改善される。Further, according to the invention of claim 2, the second feature extracting section 20 includes a signal for reproducing the speaker reproduction signal to be reproduced from the speakers placed around the microphone. A first noise removal unit sets the feature amount of the signal that has been processed corresponding to the impulse response between the speaker and the microphone measured in advance in a predetermined positional relationship as the feature amount of the noise due to the speaker reproduction signal. 30 removes the feature amount of noise due to the signal for speaker reproduction from the feature amount of the voice containing noise, so the noise component due to the signal for speaker reproduction can be accurately removed, and as a result, the recognition rate of speech under high noise can be improved. is improved.

また、請求項第３項の発明によれば、マイク１の周囲に
置かれたスピーカ２が複数個ある場合にも、第２の特徴
量抽出部２０ａ、２０ｂ１０．。Further, according to the third aspect of the invention, even when there are a plurality of speakers 2 placed around the microphone 1, the second feature extraction units 20a, 20b10. .

が、それぞれのスピーカ２ａ、２ｂ１８８．とマイク１
との予め定められた位置関係で予め測定されたそれぞれ
のスピーカ２ａ、２ｂ１９１．とマイク１間の伝達関数
に相当するそれぞれの処理を乗算器２７ａ、２７ｂ１０
０．で行った信号をすべて加算器２９で加算した信号の
特徴量をスピーカ再生用信号による雑音の特徴量とし、
第１の雑音除去部３０が、雑音を含む音声の特徴量から
スピーカ再生用信号による雑音の特［を除去するので、
スピーカ２が複数個ある場合にも、スピーカ再生用信号
による雑音の成分を正確に除去でき、結果として高雑音
下の音声の認識率が改善される。However, each speaker 2a, 2b188. and microphone 1
Each of the speakers 2a, 2b191. Multipliers 27a and 27b10 perform respective processing corresponding to the transfer function between
0. The feature amount of the signal obtained by adding all the signals processed in step 1 with the adder 29 is set as the feature amount of the noise due to the speaker reproduction signal,
Since the first noise removal unit 30 removes the characteristics of the noise caused by the speaker reproduction signal from the feature amount of the noise-containing voice,
Even when there are a plurality of speakers 2, the noise component caused by the speaker reproduction signal can be accurately removed, and as a result, the recognition rate of speech under high noise is improved.

[Brief explanation of drawings]

第１図は、請求項第１項及び第２項に記載した発明の一
実施例を説明するための構成図、第２図は、請求項第３
項に記載した発明の一実施例を説明するための構成図で
ある。１・・・マイク、２．２ａ、２ｂ・・・スピーカ、１０
・・第１の特徴量抽出部、２０，２０ａ、２０ｂ・・・
第２の特徴量抽出部、１１・・・マイクアンプ、１２゜
２２．２２ａ、２２ｂ・・・プリエンファシス回路、１
３．２３，２３ａ、２３ｂ−バンドパスフィルタバンク
、１４、２４、２４　ａ　、　２４　ｂ　・＝マルチプ
レクサ、１５，２５，２５ａ、２５ｂ＝Ａ／Ｄコンバー
タ、２７、２７　ａ　、　２７　ｂ　−・−乗算器。２８．２８ａ、２８ｂ・・・伝達関数メモリ、２９・・
・加算器、３０・・・第１の雑音除去部、４０・・・音
声区間検出部、５０・・・雑音推定部、６０・・・第２
の雑音除去部、７０・・・入力パターン生成部、８０・
・・標準パターンメモリ、９０・・・認識部。認謁目活来FIG. 1 is a configuration diagram for explaining an embodiment of the invention described in claims 1 and 2, and FIG.
FIG. 2 is a configuration diagram for explaining an embodiment of the invention described in section 1. 1...Microphone, 2.2a, 2b...Speaker, 10
...First feature extraction unit, 20, 20a, 20b...
Second feature extraction unit, 11... Microphone amplifier, 12°22.22a, 22b... Pre-emphasis circuit, 1
3.23, 23a, 23b - bandpass filter bank, 14, 24, 24 a, 24 b - multiplexer, 15, 25, 25a, 25b = A/D converter, 27, 27 a, 27 b - - multiplication vessel. 28.28a, 28b...transfer function memory, 29...
-Adder, 30...first noise removal section, 40...speech section detection section, 50...noise estimation section, 60...second
Noise removal unit, 70... Input pattern generation unit, 80.
...Standard pattern memory, 90...Recognition section. The life of the audience

Claims

[Scope of Claims] 1. A first feature extraction unit that extracts the feature amount of the sound including noise input from the microphone, and a speaker reproduction unit for reproducing the sound from speakers placed around the microphone. a second feature extraction section that uses the signal to extract a feature amount of noise due to a signal for speaker reproduction; and a second feature extraction section that extracts the feature amount from the feature amount extracted by the first feature extraction section. a first noise removal section that generates a feature amount by removing the feature amount that has been removed; a speech section detection section that detects a speech section from the feature amount generated by the first noise removal section; and the speech section detection section. An estimated value of noise other than the noise caused by the speaker reproduction signal is calculated from the feature amount generated by the first noise removal section in the non-speech section detected in the non-speech section, and further a second noise removing unit that removes an estimated value of noise other than the noise caused by the speaker reproduction signal from the feature generated by the first noise removing unit to generate a voice feature; an input pattern generation unit that generates an input pattern of the input voice from the feature amount of the voice generated by the noise removal unit; a standard pattern memory that stores a standard pattern of voice; and a standard pattern memory that stores the standard pattern of voice; A speech recognition device comprising: a recognition unit that performs recognition processing using an input pattern and a standard pattern stored in the standard pattern memory. 2. The second feature extracting section includes a feature that has been measured in advance in a predetermined positional relationship between the speaker and the microphone, in the speaker reproduction signal to be reproduced from speakers placed around the microphone. Claim 1, characterized in that a feature quantity of a signal subjected to processing corresponding to a transfer function between the speaker reproduction signal and the signal input from the microphone is used as a feature quantity of noise due to the speaker reproduction signal. The speech recognition device described in . 3. When there are multiple speakers placed around the microphone, the speaker reproduction signal to be reproduced from each speaker is measured in advance in a predetermined positional relationship between each speaker and the microphone. The feature amount of the signal obtained by adding all the processed signals corresponding to the transfer function between each of the above-mentioned speaker reproduction signals and the signal input from the microphone is the feature amount of the noise due to the speaker reproduction signal. Claim 1 characterized in that
The speech recognition device according to item 1 or 2.