JP3118023B2

JP3118023B2 - Voice section detection method and voice recognition device

Info

Publication number: JP3118023B2
Application number: JP03166391A
Authority: JP
Inventors: 敬有吉
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1990-08-15
Filing date: 1991-06-11
Publication date: 2000-12-18
Anticipated expiration: 2015-12-18
Also published as: JPH056193A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【技術分野】本発明は、音声区間検出方式及び該音声区
間検出方式を用いた音声認識装置、より詳細には、高騒
音環境下の音声入力装置の音声区間検出技術に関し、特
に、高騒音環境下の音声認識装置、例えば、自動車内、
工場内、家庭内の音声認識装置に応用可能であり、ま
た、音声合成装置、通信機器など他の音声入力装置にお
ける雑音除去に応用可能なものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice section detection method and a voice recognition apparatus using the voice section detection method, and more particularly to a voice section detection technique of a voice input device in a high noise environment. Speech recognition device below, for example, in a car,
The present invention can be applied to voice recognition devices in factories and homes, and can be applied to noise removal in other voice input devices such as voice synthesis devices and communication devices.

【０００２】[0002]

【従来技術】音声認識装置の実用化においては、周囲の
騒音に対する雑音対策が重要な課題である。特に、雑音
が重畳した音声から正確に音声区間を検出することが容
易でないため、この様な環境では、著しく認識率が低下
する。雑音が重畳した音声に適した音声区間検出方式と
しては、例えば、特公昭６３−２９７５４号公報があ
る。これは、２つの閾値を用いていて、高い閾値以上の
区間がある時間以上である場合に、低い閾値以上の区間
を音声区間としているが、時間非定常的な雑音がその高
い閾値の条件を超えるような環境では、区間検出が困難
になる。この時間非定常的な雑音に効果がある方法とし
て、例えば、特開昭５８−１３０３９５号公報がある。
これは、音声用マイクと騒音用マイクの２つの入力のパ
ワーの差と閾値を比較して音声区間を決めている。しか
しながら、音声区間中の音声に重畳している雑音の成分
は除去されていない。2. Description of the Related Art In putting a speech recognition apparatus into practical use, countermeasures against noise from surrounding noise are important issues. In particular, since it is not easy to accurately detect a voice section from a voice on which noise is superimposed, the recognition rate is significantly reduced in such an environment. Japanese Patent Publication No. Sho 63-29754 discloses, for example, a speech section detection method suitable for speech in which noise is superimposed. This means that when two thresholds are used and a section above the high threshold is longer than a certain time, a section above the low threshold is used as a voice section. In such an environment, section detection becomes difficult. Japanese Patent Application Laid-Open No. 58-130395 is an example of a method that has an effect on the time non-stationary noise.
In this method, a voice section is determined by comparing a difference between powers of two inputs of a voice microphone and a noise microphone with a threshold. However, the noise component superimposed on the voice in the voice section is not removed.

【０００３】更に、２つの入力を用いて、雑音成分を除
去する方法として、従来広く用いられているスペクトル
サブトラクション法があるが、この方法は時間非定常性
の強い騒音には対応できないという欠点がある。こうし
た騒音にも比較的効果のある方法として、特開昭５８−
１９６５９９号公報、特開昭６３−２６２６９５号公
報、特開平１−１１５７９８号公報、特開平１−２３９
５９６号公報がある。これら公報に記載された発明はい
ずれも、音声用入力と騒音用入力との２つの入力手段を
用いたアダプティブノイズキャンセリング法の一種であ
って、次のように表せる。音声用入力のスペクトルをＸ
(ｉ)、騒音用入力のスペクトルをＮ(ｉ)として(ｉ：各
周波数帯域を表す)、２つの入力で得られる雑音の周波
数帯域毎の比ｋ(ｉ)を、ｋ(ｉ）＝Ｘ(ｉ)／Ｎ(ｉ）として求めておき、音声区間中の音声のスペクトルの推
定値Ｓ(ｉ)を、Ｓ(ｉ)＝Ｘ(ｉ)−ｋ(ｉ)・Ｎ(ｉ）として求めるものである。この方法によれば、ある帯域
ｉに関して、雑音源が１つであると仮定すると、雑音レ
ベルが変化しても２つの入力で得られる雑音の比ｋ(ｉ)
は変らないので、音声区間中のＸ(ｉ)に含まれる雑音成
分をｋ(ｉ)・Ｎ(ｉ）で推定することができる。従っ
て、ある程度時間非定常的な騒音に関しても効果があ
る。しかし、この方法は、比ｋ(ｉ)の計算を雑音が小さ
い時に行なうと誤差が大きくなり、次の音声区間中に比
較的大きな雑音成分が含まれると、適切な音声スペクト
ルの推定ができない。従って、全ての帯域に常に雑音が
存在するような場合でないと適切な雑音除去が行なえな
いという欠点がある。Further, as a method of removing a noise component by using two inputs, there is a spectrum subtraction method which has been widely used in the past, but this method has a drawback that it cannot cope with noise having a strong time non-stationary state. is there. Japanese Patent Application Laid-Open No.
196599, JP-A-63-262695, JP-A-1-115798, JP-A-1-239
No. 596. Each of the inventions described in these publications is a kind of an adaptive noise canceling method using two input means of a voice input and a noise input, and can be expressed as follows. X is the spectrum of the audio input
(i) Assuming that the spectrum of the noise input is N (i) (i: represents each frequency band), the ratio k (i) of the noise obtained by the two inputs for each frequency band is k (i) = X (i) / N (i), and the estimated value S (i) of the voice spectrum in the voice section is calculated as S (i) = X (i) −k (i) · N (i). Things. According to this method, assuming that there is only one noise source for a certain band i, the ratio k (i) of noise obtained from two inputs even if the noise level changes.
Does not change, the noise component included in X (i) in the voice section can be estimated by k (i) · N (i). Therefore, the present invention is also effective for noise that is not stationary for some time. However, in this method, if the calculation of the ratio k (i) is performed when the noise is small, the error increases, and if a relatively large noise component is included in the next voice section, it is not possible to estimate a proper voice spectrum. Therefore, there is a drawback that proper noise removal cannot be performed unless noise always exists in all bands.

【０００４】[0004]

【目的】本発明は、上述のごとき実情に鑑みてなされた
もので、第１に、時間非定常な騒音環境下でも、適切な
音声区間検出を行なうことができる音声区間検出方式を
提供することを目的とするものであり、第２に、時間非
定常な騒音環境下でも、適切な雑音除去を行なうことが
できる雑音除去装置を提供することを目的とするもので
あり、更には、時間非定常な騒音環境下でも、良好な認
識率の得られる音声認識装置を提供することを目的とす
るものである。SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and, first, provides a voice section detection method capable of performing appropriate voice section detection even in a time non-stationary noise environment. Secondly, it is an object of the present invention to provide a noise elimination device capable of appropriately performing noise elimination even in a time non-stationary noise environment. It is an object of the present invention to provide a speech recognition device that can obtain a good recognition rate even under a steady noise environment.

【０００５】[0005]

【構成】本発明は、上記目的を達成するために、（１）
音声を入力するための第１の入力手段と、第１の入力手
段で得られた第１の入力信号の複数の要素から成る特徴
量を求める第１の特徴量抽出手段と、雑音を入力するた
めの第２の入力手段と、第２の入力手段で得られた第２
の入力信号の複数の要素から成る特徴量を求める第２の
特徴量抽出手段と、上記第１の特徴量と上記第２の特徴
量とから各要素毎に係数の演算を行う係数演算手段と、
上記第２の特徴量と上記係数とを用いて上記第１の特徴
量から雑音成分を除去することにより音声の特徴量を推
定する雑音成分除去手段と、少なくとも上記雑音成分除
去手段で推定された音声の特徴量を用いて、上記各要素
毎に音声区間を検出する音声区間検出手段とを具備して
成ること、或いは、（２）音声を入力するための第１の
入力手段と、第１の入力手段で得られた第１の入力信号
の複数の要素から成る特徴量を求める第１の特徴量抽出
手段と、雑音を入力するための第２の入力手段と、第２
の入力手段で得られた第２の入力信号の複数の要素から
成る特徴量を求める第２の特徴量抽出手段と、上記第１
の特徴量と上記第２の特徴量とから各要素毎に係数の演
算を行う係数演算手段と、上記第２の特徴量と上記係数
とを用いて上記第１の特徴量から雑音成分を除去するこ
とにより音声の特徴量を推定する雑音成分除去手段と、
上記雑音成分除去手段で推定された音声の特徴量を用い
て音声の大きさを求め、該音声の大きさから第１の音声
区間を検出する第１の音声区間検出手段と、上記第１の
音声区間検出手段で検出された第１の音声区間の前後に
それぞれ予め定められた区間を付け加えた区間で、少な
くとも上記雑音成分除去手段で推定された音声の特徴量
を用いて、上記各要素毎に第２の音声区間を検出する第
２の音声区間検出手段とを具備して成ること、或いは、
（３）音声を入力するための第１の入力手段と、第１の
入力手段で得られた第１の入力信号の複数の要素から成
る特徴量を求める第１の特徴量抽出手段と、雑音を入力
するための第２の入力手段と、第２の入力手段で得られ
た第２の入力信号の複数の要素から成る特徴量を求める
第２の特徴量抽出手段と、上記第１の特徴量と上記第２
の特徴量とから各要素毎に係数の演算を行う係数演算手
段と、上記第２の特徴量と上記係数とを用いて上記第１
の特徴量から雑音成分を除去することにより音声の特徴
量を推定する雑音成分除去手段と、少なくとも、上記第
１の信号の大きさと上記第２の信号の大きさを用いて第
１の音声区間を検出する第１の音声区間検出手段と、上
記第１の音声区間検出手段で検出された第１の音声区間
の前後にそれぞれ予め定められた区間を付け加えた区間
で、少なくとも上記雑音成分除去手段で推定された音声
の特徴量を用いて、上記各要素毎に第２の音声区間を検
出する第２の音声区間検出手段とを具備して成ること、
或いは、（４）前記（１）記載の音声区間検出方式であ
って、更に、上記雑音成分除去手段は、雑音成分を除去
するための係数を上記各要素毎に保持し、該各要素毎の
係数は、上記音声区間検出手段における対応する各要素
毎の音声区間に関する情報に基づいて更新されること、
或いは、（５）前記（２）又は（３）記載の音声区間検
出方式であって、更に、上記雑音成分除去手段は、雑音
成分を除去するための係数を上記各要素毎に保持し、該
各要素毎の係数は、上記第２の音声区間検出手段におけ
る対応する各要素毎の音声区間に関する情報に基づいて
更新されること、或いは、（６）前記（１）記載の音声
区間検出方式であって、更に、上記音声区間検出手段
は、上記各要素毎の音声区間を検出するための閾値を保
持し、該各要素毎の閾値は、上記音声区間検出手段にお
ける対応する各要素毎の音声区間に関する情報に基づい
て更新されること、或いは、（７）前記（２）又は
（３）記載の音声区間検出方式であって、更に、上記第
２の音声区間検出手段は、上記各要素毎の第２の音声区
間を検出するための閾値を保持し、該各要素毎の閾値
は、対応する第２の各要素毎の音声区間に関する情報に
基づいて更新されること、或いは、（８）前記（１）又
は（２）又は（３）記載の音声区間検出方式であって、
上記係数演算手段は、上記第１の特徴量または上記第２
の特徴量の各要素が大きい場合には、対応する係数の値
を上記第１の特徴量の対応する要素と上記第２の特徴量
の対応する要素の比の値、又は、該値に近い値とし、上
記第１の特徴量または上記第２の特徴量の各要素が小さ
い場合には、対応する係数の値を予め定められた値、又
は、該値に近い値とすること、或いは、（９）前記
（８）記載の音声区間検出方式であって、更に、上記係
数演算手段は、上記第１の特徴量の各要素Ｘ（ｉ）、上
記第２の特徴量の対応する要素Ｎ（ｉ）に対して、対応
する係数ｋ（ｉ）の値を、ｋ(ｉ）＝〔Ｘ(ｉ)＋Ｃ〕／〔Ｎ(ｉ)＋Ｃ〕 (Ｃ＝定数＞０) とすること、或いは、（１０）前記（８）記載の音声区
間検出方式であって、更に、上記係数演算手段は、上記
第１の特徴量の各要素Ｘ（ｉ）、上記第２の特徴量の対
応する要素Ｎ（ｉ）に対して、対応する係数ｋ（ｉ）の
値を、ｋ(ｉ）＝〔Ｘ(ｉ)＋Ｃ₁〕／〔Ｎ(ｉ)＋Ｃ₂〕 (Ｃ₁，Ｃ₂＝定数＞０) とすること、或いは、（１１）前記（１）乃至（１０）
のいずれか１に記載の音声区間検出方式であって、更
に、音声を入力するための第１の入力手段と雑音を入力
するための第２の入力手段とはいずれもマイクロフォン
であること、或いは、（１２）前記（１）乃至（１０）
のいずれか１に記載の音声区間検出方式であって、更
に、音声を入力するための第１の入力手段はマイクロフ
ォンであり、雑音を入力するための第２の入力手段は上
記マイクロフォンの付近におかれたスピーカであり、該
スピーカから再生される音響信号を入力とすること、或
いは、（１３）前記（１）乃至（１２）のいずれか１に
記載の音声区間検出方式を用いて得られた音声の特徴量
から入力された音声の入力パターンを作成する入力パタ
ーン生成部と、予め登録された音声の標準パターンを記
憶する標準パターンメモリと、上記入力パターンと上記
標準パターンとで認識処理を行なう認識部とを具備して
成ることを特徴としたものである。以下、本発明の実施
例に基いて説明する。To achieve the above object, the present invention provides (1)
A first input unit for inputting voice, a first feature amount extraction unit for obtaining a feature amount of a plurality of elements of the first input signal obtained by the first input unit, and a noise; Input means for the second input means and the second input means obtained by the second input means.
A second feature value extracting means for calculating a feature value composed of a plurality of elements of the input signal, a coefficient calculating means for calculating a coefficient for each element from the first feature value and the second feature value, ,
A noise component removing unit that estimates a feature amount of a voice by removing a noise component from the first feature amount using the second feature amount and the coefficient, and at least the noise component estimated by the noise component removing unit; A voice section detecting means for detecting a voice section for each of the above-mentioned elements by using a feature amount of the voice; or (2) a first input means for inputting a voice; A first feature extracting means for obtaining a feature comprising a plurality of elements of the first input signal obtained by the input means, a second input means for inputting noise, and a second
A second feature value extracting means for obtaining a feature value comprising a plurality of elements of the second input signal obtained by the first input means;
Coefficient calculating means for calculating a coefficient for each element from the characteristic amount of the second characteristic amount and the second characteristic amount, and removing a noise component from the first characteristic amount by using the second characteristic amount and the coefficient Noise component removing means for estimating a speech feature amount by performing
A first voice section detecting means for obtaining a voice volume using the voice feature amount estimated by the noise component removing means and detecting a first voice section from the voice volume; In a section obtained by adding a predetermined section before and after the first speech section detected by the speech section detection means, at least the feature amount of the speech estimated by the noise component removal means is used for each of the above elements. And a second voice section detecting means for detecting a second voice section, or
(3) first input means for inputting voice, first feature amount extracting means for obtaining a feature amount comprising a plurality of elements of the first input signal obtained by the first input means, and noise Input means for inputting the first characteristic, second characteristic amount extracting means for obtaining a characteristic amount comprising a plurality of elements of the second input signal obtained by the second input means, and the first characteristic The quantity and the second
Coefficient calculating means for calculating a coefficient for each element from the characteristic amount of the first characteristic amount, and the first characteristic value using the second characteristic amount and the coefficient.
A noise component removing unit for estimating a speech feature amount by removing a noise component from the feature amount of the first speech section and a first speech section using at least the magnitude of the first signal and the magnitude of the second signal. A first voice section detecting means for detecting the noise component, and a section in which predetermined sections are added before and after the first voice section detected by the first voice section detecting means, respectively. And a second voice section detecting means for detecting a second voice section for each of the elements using the feature amount of the voice estimated in the above.
Alternatively, (4) the speech section detection method according to (1), wherein the noise component removing unit holds a coefficient for removing the noise component for each of the elements, and The coefficient is updated based on information on a voice section for each corresponding element in the voice section detection means,
Alternatively, (5) the voice section detection method according to the above (2) or (3), wherein the noise component removing means holds a coefficient for removing a noise component for each of the elements, The coefficient for each element is updated based on the information on the voice section for each corresponding element in the second voice section detection means, or (6) the voice section detection method described in (1) above. Further, the voice section detecting means holds a threshold value for detecting a voice section of each element, and the threshold value of each element corresponds to a voice of each corresponding element in the voice section detecting means. (7) the voice section detection method according to the above (2) or (3), wherein the second voice section detection means further comprises: For detecting the second speech section of Hold, and the threshold value for each element is updated based on the information about the corresponding voice section for each second element, or (8) the above (1) or (2) or (3) Voice section detection method,
The coefficient calculating means is configured to determine whether the first characteristic amount or the second characteristic amount
If each element of the feature amount is large, the value of the corresponding coefficient is set to the value of the ratio of the corresponding element of the first feature amount to the corresponding element of the second feature amount, or close to the value. If each element of the first feature amount or the second feature amount is small, the value of the corresponding coefficient is a predetermined value or a value close to the value, or (9) In the voice section detection method according to (8), the coefficient calculating means may further include: each element X (i) of the first feature quantity; and a corresponding element N of the second feature quantity. For (i), the value of the corresponding coefficient k (i) is k (i) = [X (i) + C] / [N (i) + C] (C = constant> 0), or (10) The voice section detection method according to the above (8), wherein the coefficient calculating means further comprises: each element X (i) of the first feature amount; For the corresponding element N (i) of the quantity, the value of the corresponding coefficient k (i) is given by: k (i) = [X (i) + C ₁ ] / [N (i) + C ₂ ] (C ₁ , C ₂ = constant> 0) or (11) the above (1) to (10)
Wherein the first input means for inputting voice and the second input means for inputting noise are both microphones, or (12) The above (1) to (10)
Wherein the first input means for inputting a voice is a microphone, and the second input means for inputting noise is located near the microphone. A speaker placed in the speaker, and an audio signal reproduced from the speaker as an input, or (13) obtained using the voice section detection method according to any one of (1) to (12). An input pattern generation unit that creates an input pattern of the input voice from the feature amount of the input voice, a standard pattern memory that stores a standard pattern of the voice registered in advance, and performs a recognition process using the input pattern and the standard pattern. And a recognition unit for performing the recognition. Hereinafter, a description will be given based on an example of the present invention.

【０００６】図１は、本発明の一実施例を説明するため
の構成図で、図中、１は第１のマイクロフォン、２は第
２のマイクロフォン、１０は第１の特徴量抽出部、２０
は第２の特徴量抽出部、３０は第１の音声区間検出部、
４０は係数演算部、５０は雑音成分除去部、６０は第２
の音声区間検出部である。第１のマイク１は、音声を入
力する（主入力）ためのマイクで、該マイク１は発声者
の口の近くに置かれ、ここで得られた主入力信号には音
声と周囲の雑音が含まれる。第２のマイク２は、周囲の
雑音を入力する（参照入力）ためのマイクで、該マイク
２は発声者の口から離れた位置に置かれ、ここで得られ
た参照入力信号には周囲の雑音のみが含まれ、音声は殆
ど含まれない。FIG. 1 is a block diagram for explaining an embodiment of the present invention. In FIG. 1, reference numeral 1 denotes a first microphone, 2 denotes a second microphone, 10 denotes a first feature amount extraction unit, and 20 denotes a first feature amount extraction unit.
Is a second feature amount extraction unit, 30 is a first voice section detection unit,
40 is a coefficient calculator, 50 is a noise component remover, and 60 is a second
Is a voice section detection unit. The first microphone 1 is a microphone for inputting voice (main input). The microphone 1 is placed near the mouth of the speaker, and the main input signal obtained here includes voice and ambient noise. included. The second microphone 2 is a microphone for inputting ambient noise (reference input). The microphone 2 is placed at a position away from the mouth of the speaker, and the reference input signal obtained here includes the surrounding input. It contains only noise and almost no voice.

【０００７】第１の特徴量抽出部１０は、マイクアンプ
１１、１５チャンネルのバンドパスフィルタ１２、整流
器１３、ローパスフィルタ１４、マルチプレクサ１５、
Ａ／Ｄ変換器１６から成り、一定フレーム時間毎にマイ
ク１で得られた主入力信号のスペクトルＸ(ｉ)（ｉ＝
１，２，…，１５）を求める。第２の特徴量抽出部２０
は、マイクアンプ２１、１５チャンネルのバンドパスフ
ィルタ２２、整流器２３、ローパスフィルタ２４、マル
チプレクサ２５、Ａ／Ｄ変換器２６より成り、一定フレ
ーム時間毎にマイク２で得られた参照入力信号のスペク
トルＮ(ｉ)を求める。マイクアンプ１１とマイクアンプ
２１は、遠い音源からの雑音に対して主入力信号と参照
入力信号がほぼ同レベルとなるように各入力のゲインを
予め調整しておく。第１、及び、第２の特徴量抽出部の
その他の部分の特性は、それぞれ同一である。第１、及
び、第２の特徴量抽出部で得られる特徴量は、他の公知
の特徴量でも良い。The first feature quantity extraction unit 10 includes a microphone amplifier 11, a band-pass filter 12 for 15 channels, a rectifier 13, a low-pass filter 14, a multiplexer 15,
A main input signal spectrum X (i) (i =
1, 2, ..., 15). Second feature amount extraction unit 20
Consists of a microphone amplifier 21, a 15-channel band-pass filter 22, a rectifier 23, a low-pass filter 24, a multiplexer 25, and an A / D converter 26, and the spectrum N of the reference input signal obtained by the microphone 2 every fixed frame time. Find (i). The microphone amplifier 11 and the microphone amplifier 21 previously adjust the gain of each input so that the main input signal and the reference input signal have substantially the same level with respect to noise from a distant sound source. The characteristics of the other parts of the first and second feature quantity extraction units are the same. The feature amounts obtained by the first and second feature amount extraction units may be other known feature amounts.

【０００８】第１の音声区間検出部３０は、主入力信号
のパワーΣＸ(ｉ)と参照入力信号のパワーΣＮ(ｉ)との
差が閾値Ｔ_pwrを超えたかどうかで、概略の音声区間を
検出する。各入力信号のパワーは、別の手段で求めても
良いし、概略の音声区間検出の方法は、他の方法を用い
ても良い。閾値Ｔ_pwrは、概略の音声区間でない区間で
そのフレーム以前の複数フレームの主入力信号のパワー
Ｘ_pwr＝ΣＸ(ｉ)と参照入力信号のパワーＮ_pwr＝ΣＮ
(ｉ)の平均値（それぞれＡｖ・Ｘ_pwr、Ａｖ・Ｎ_pwrと記
す）から計算され、順次更新される。即ち、Ｔ_pwr＝ａ_pwr（Ａｖ・Ｘ_pwr−Ａｖ・Ｎ_pwr）＋ｂ_pwr （１）（ａ_pwr，ｂ_pwr：定数，ａ_pwr，ｂ_pwr＞０）また、この音声区間検出部３０は、後述する雑音成分除
去部５０で求められる雑音成分を除去した音声の特徴量
Ｓ(ｉ)から音声のパワーΣＳ(ｉ)を求めて、閾値Ｔ_pwr
と比較しても良い。[0008] The first voice section detector 30 determines the approximate voice section by determining whether the difference between the power ΣX (i) of the main input signal and the power ΣN (i) of the reference input signal exceeds a threshold value T _pwr. To detect. The power of each input signal may be obtained by another means, or another method may be used as a general method for detecting a voice section. The threshold value T _pwr is a power X _pwr = ΣX (i) of the main input signal and a power N _pwr = ΣN of the reference input signal in a plurality of frames before the frame in a section other than the approximate voice section.
It is calculated from the average value of (i) ( _{denoted as} Av · X _pwr and Av · N _pwr , respectively) and updated sequentially. That is, T _pwr = a _pwr (Av · X _pwr −Av · N _pwr ) + b _pwr (1) (a _pwr , b _pwr : constant, a _pwr , b _pwr > 0) A power ΣS (i) of the voice is obtained from a feature amount S (i) of the voice from which a noise component is removed by a noise component removing unit 50 described later, and a threshold T _pwr is obtained.
It may be compared with.

【０００９】係数演算部４０は、各チャンネルｉ毎に、
主入力信号のスペクトルＸ(ｉ)と参照入力信号のスペク
トルＮ(ｉ)とから係数ｋ(ｉ)を次のように演算する。ｋ(ｉ）＝〔Ｘ(ｉ)＋Ｃ₁〕／〔Ｎ(ｉ)＋Ｃ₂〕 (Ｃ₁，Ｃ₂：定数，Ｃ₁，Ｃ₂＞０) (２) 式（２）の係数ｋ(ｉ)は、そのフレーム以前の複数フレ
ームの主入力信号のスペクトルＸ(ｉ)と参照入力信号の
スペクトルＮ(ｉ)の平均値から計算しても良いが、平均
するフレームの数は、閾値Ｔ_pwr，Ｔｉを求めるための
フレームの数より小さい数が良い。[0009] The coefficient calculation unit 40 provides, for each channel i,
The coefficient k (i) is calculated from the spectrum X (i) of the main input signal and the spectrum N (i) of the reference input signal as follows. k (i) = [X (i) + C ₁₎ / (N (i) + C _2]: coefficient (C _1, C ₂ _{_{constant, C 1, C 2> 0}} ) (2) Equation (2) k ( i) may be calculated from the average value of the spectrum X (i) of the main input signal and the spectrum N (i) of the reference input signal in a plurality of frames before that frame. A number smaller than the number of frames for _{obtaining pwr} and Ti is _preferable .

【００１０】また、式（２）から明らかなように、Ｘ
(ｉ)，Ｎ(ｉ)の値が大きい場合、ｋ(ｉ）は、従来技術
で用いられている値、Ｘ(ｉ)／Ｎ(ｉ)に近付き、逆に、
Ｘ(ｉ)，Ｎ(ｉ)の値が小さい場合、ｋ(ｉ）は、Ｃ₁／Ｃ
₂（一定）に近付く。従って、Ｘ(ｉ)，Ｎ(ｉ)の値が小
さい場合、Ｃ₁／Ｃ₂の値をシステムに応じて適当な値に
設定しておけば、ｋ(ｉ）の誤差は少なくなる。ここ
で、Ｃ₁，Ｃ₂の値は、Ｘ(ｉ)，Ｎ(ｉ)の値が８bit(0〜2
55)で表される場合、8〜32程度が良い。Ｃ₁／Ｃ₂の値
は、遠い音源からの雑音、あるいは、決まった音源から
の雑音に対して実験的に予め測定されるＸ(ｉ），Ｎ
(ｉ)がほぼ同レベルになる場合は１で良いが、そうでな
い場合は、Ｘ(ｉ)／Ｎ(ｉ)などとする。更に、Ｃ₁，Ｃ₂
の値は、チャンネルによって異なる値を用いても良い。
また、マイク２に音声がある程度混入するようなシステ
ムでは、Ｃ₁／Ｃ₂の値を１とする（Ｃ₁＝Ｃ₂）と、音声
成分の一部が雑音成分として除去されてしまうので、１
より小さい値にする（Ｃ₁＜Ｃ₂）と良い。式（２）の演
算は、音声が入力されていない区間で行なわれ順次更新
される。ここで求めた係数ｋ(ｉ）の値は、時間軸上で
平滑化処理をしても良い。また、係数ｋ(ｉ）とスペク
トルＸ(ｉ），Ｎ(ｉ)との関係は、式（２）の関係に限
らず、双曲線関数や指数関数を用いた式でも良いし、重
み付などの手法を用いて同様な効果を持たせることも可
能である。As is apparent from the equation (2), X
When the values of (i) and N (i) are large, k (i) approaches the value used in the prior art, X (i) / N (i), and conversely,
If the values of X (i) and N (i) are small, k (i) is C ₁ / C
₂ (Constant). Therefore, when the values of X (i) and N (i) are small, the error of k (i) can be reduced by setting the value of C ₁ / C ₂ to an appropriate value according to the system. Here, the values of C ₁ and C ₂ are such that the values of X (i) and N (i) are 8 bits (0 to 2).
When expressed by 55), about 8 to 32 is good. The value of C ₁ / C ₂ is X (i), N which is experimentally measured in advance for noise from a distant sound source or noise from a fixed sound source.
If (i) is at substantially the same level, 1 is sufficient, but if not, X (i) / N (i) is used. Further, C ₁ , C ₂
May be different depending on the channel.
Further, in a system in which sound is mixed into the microphone 2 to some extent, if the value of C ₁ / C ₂ is set to 1 (C ₁ = C ₂ ), a part of the sound component is removed as a noise component. 1
It is better to make the value smaller (C ₁ <C ₂ ). The calculation of Expression (2) is performed in a section where no voice is input, and is sequentially updated. The value of the coefficient k (i) obtained here may be subjected to a smoothing process on the time axis. Further, the relationship between the coefficient k (i) and the spectra X (i), N (i) is not limited to the relationship of Expression (2), and may be an expression using a hyperbolic function or an exponential function, A similar effect can be provided by using a technique.

【００１１】雑音成分除去部５０は、各チャンネルｉ毎
に、主入力信号のスペクトルＸ(ｉ)、参照入力信号のス
ペクトルＮ(ｉ)、係数ｋ(ｉ)から雑音成分を除去した音
声の特徴量Ｓ(ｉ）を次のように演算する。Ｓ(ｉ)＝Ｘ(ｉ)−ｋ(ｉ)・Ｎ(ｉ） (３) また、音声区間でない場合は、Ｓ(ｉ)＝０としても良
い。尚、式（３）において、更に加減算を施して誤差な
どを調整し、より正確な音声の特徴量Ｓ(ｉ)を求めるこ
とも可能である。The noise component removing section 50 provides, for each channel i, a characteristic of a voice obtained by removing a noise component from the spectrum X (i) of the main input signal, the spectrum N (i) of the reference input signal, and the coefficient k (i). The quantity S (i) is calculated as follows. S (i) = X (i) −k (i) · N (i) (3) In the case of a non-voice section, S (i) = 0 may be set. In equation (3), it is also possible to further add or subtract to adjust an error or the like and obtain a more accurate voice feature amount S (i).

【００１２】第２の音声区間検出部６０は、第１の音声
区間検出部３０で検出された概略の音声区間の前後にそ
れぞれ予め定められた区間を付け加えた区間で、雑音成
分除去部５０で推定された音声のスペクトルＳ(ｉ)が閾
値Ｔｉを超えたかどうかで、上記各チャンネルｉ毎に、
帯域毎の音声区間を検出する。そこで、帯域毎の音声区
間である場合には、Ｓ(ｉ)を、そうでない場合には、０
を後続の音声認識装置などに出力する。The second speech section detection section 60 is a section in which predetermined sections are added before and after the general speech section detected by the first speech section detection section 30, respectively. Depending on whether or not the estimated voice spectrum S (i) exceeds the threshold value Ti, for each channel i,
A voice section for each band is detected. Therefore, S (i) is set when the voice section is for each band, and 0 (zero) otherwise.
Is output to a subsequent speech recognition device or the like.

【００１３】図２は、概略の音声区間（Ａ）と、その前
後にそれぞれ予め定められた区間を付け加えた区間
（Ｂ）と、各チャンネルｉに関する帯域毎の音声区間
（Ｃ）の関係を示した図である。閾値Ｔｉは、Ｔｉ＝ａ（Ａｖ・Ｘ(ｉ)−Ａｖ・Ｎ(ｉ)）＋ｂ（４）（ａ，ｂ：定数，ａ，ｂ＞０）で表わされ、チャンネルｉが帯域の音声区間でない区間
でそのフレーム以前の複数フレームの主入力信号のスペ
クトルＸ(ｉ)と参照入力信号のスペクトルＮ(ｉ)の平均
値（それぞれＡｖ・Ｘ(ｉ)、Ａｖ・Ｎ(ｉ)）から計算さ
れ、順次更新される。FIG. 2 shows the relationship between a general voice section (A), a section (B) in which predetermined sections are added before and after the voice section (A), and a voice section (C) for each band for each channel i. FIG. The threshold value Ti is expressed as Ti = a (Av.X (i) -Av.N (i)) + b (4) (a, b: constant, a, b> 0) From the average value (Av.X (i) and Av.N (i), respectively) of the spectrum X (i) of the main input signal and the spectrum N (i) of the reference input signal in a plurality of frames before that frame in a section other than the section. Calculated and updated sequentially.

【００１４】尚、この実施例において、雑音レベルがそ
れほど大きくない環境下で使用される場合には、第１の
音声区間検出部３０で概略の音声区間を求めることな
く、直接、第２の音声区間検出部６０で帯域の音声区間
を求めることも可能である（請求項１）。また、第２の
音声区間検出部６０での閾値Ｔｉの計算、係数演算部４
０での係数ｋ(ｉ)の計算は、チャンネルｉ毎ではなく、
幾つかのチャンネル毎にグループにまとめて各グループ
毎に行なっても良い。In this embodiment, when the apparatus is used in an environment where the noise level is not so large, the first voice section detection section 30 directly calculates the second voice without using the rough voice section. It is also possible to determine the voice section of the band by the section detecting section 60 (claim 1). Further, the calculation of the threshold value Ti in the second voice section detection unit 60 and the coefficient calculation unit 4
The calculation of the coefficient k (i) at 0 is not per channel i,
The processing may be performed for each group by grouping some channels.

【００１５】図３は、音声を入力するためのマイク付近
に置かれたスピーカからの雑音を除去するための本発明
の一実施例を示す構成図で、図中、図１に示した実施例
と同様の作用をする部分には、図１の場合と同一の参照
番号が付してある。而して、この実施例は、図１のマイ
ク２からの入力の代りに、スピーカ２ｓへ送られる音響
信号を入力すること以外は、図１の発明と同様であり、
マイクアンプ１１，２１の調整は、音響信号に対して主
入力信号と参照入力信号がほぼ同レベルになるように調
整される。FIG. 3 is a block diagram showing one embodiment of the present invention for removing noise from a speaker placed near a microphone for inputting voice. In the drawing, the embodiment shown in FIG. 1 is used. The parts having the same functions as in FIG. 1 are denoted by the same reference numerals as in FIG. Thus, this embodiment is the same as the invention of FIG. 1 except that an acoustic signal to be sent to the speaker 2s is input instead of the input from the microphone 2 in FIG.
The microphone amplifiers 11 and 21 are adjusted so that the main input signal and the reference input signal have substantially the same level with respect to the acoustic signal.

【００１６】図４は、本発明の雑音除去装置を用いた音
声認識装置の一例を示す構成図で、入力パターン生成部
７０は、上述の雑音除去装置で得られた音声のスペクト
ルから入力された音声の入力パターンを作成し、標準パ
ターンメモリ８０は、予め登録された音声の標準パター
ンを記憶し、認識部９０は入力パターンと標準パターン
とで認識処理を行なう。入力パターン生成部７０、標準
パターンメモリ８０、認識部９０の構成、動作は、公知
のＢＴＳＰ音声認識方式によるものであるが、他の方式
を用いても良い。FIG. 4 is a block diagram showing an example of a speech recognition device using the noise elimination device of the present invention. The input pattern generation section 70 is input from the speech spectrum obtained by the above-described noise elimination device. A voice input pattern is created, the standard pattern memory 80 stores a voice standard pattern registered in advance, and the recognition unit 90 performs recognition processing on the input pattern and the standard pattern. Although the configurations and operations of the input pattern generation unit 70, the standard pattern memory 80, and the recognition unit 90 are based on the known BTSP speech recognition system, other systems may be used.

【００１７】[0017]

【効果】請求項１に記載の発明によると、音声を入力す
るための主入力信号の特徴量と、雑音を入力するための
参照入力信号から得られる特徴量とを用いて、帯域毎に
音声区間検出を行なっているので、ある帯域が音声区間
であっても音声の成分が存在しない他の帯域は音声区間
とされないので、音声の特徴量が正確に抽出でき、時間
非定常な高騒音環境下の音声認識において、良好な認識
率が得られる。請求項２に記載の発明によると、雑音成
分除去手段で推定された音声の特徴量を用いて音声の大
きさを求め、該音声の大きさから概略の音声区間を検出
し、その概略の音声区間の前後にそれぞれ予め定められ
た区間を付け加えた区間で、帯域毎に音声区間検出を行
なっているので、概略の音声区間からある程度離れた区
間の騒音を音声とまちがうことがなく、更に、正確な音
声区間検出が行なえる。請求項３に記載の発明による
と、主入力信号のパワーと参照入力信号のパワーを用い
て音声の大きさを求め、該音声の大きさから概略の音声
区間を検出し、その概略の音声区間の前後にそれぞれ予
め定められた区間を付け加えた区間で、帯域毎に音声区
間検出を行なっているので、音声と同程度の大きさの騒
音であっても、音声とまちがうことがなく、更に正確な
音声区間検出が行なえる。請求項４，５に記載の発明に
よると、雑音成分除去のための各帯域毎の係数を帯域毎
の音声区間の情報に基づいて更新しているので、音声区
間中に騒音環境が変化したとしても、音声成分の含まれ
ない帯域では上記係数を更新して、騒音環境の変化に対
応する事が可能であり、更に正確な音声区間検出が行な
える。請求項６，７に記載の発明によると、各帯域毎の
音声区間を検出するための閾値を帯域毎の音声区間の情
報に基づいて更新しているので、音声区間中に騒音環境
が変化したとしても、音声成分の含まれない帯域では上
記閾値を更新して、騒音環境の変化に対応する事が可能
であり、更に正確な音声区間検出が行なえる。請求項８
に記載の発明によると、式（２）から明らかなように、
時間非定常な騒音下において、周囲の騒音レベルが小さ
い時に係数ｋ（ｉ）が決められ、その次の音声区間中に
比較的大きな雑音成分が含まれるような場合には、係数
ｋ（ｉ）は予め定められた定数に近くなり、係数ｋ
（ｉ）の誤差が少なく、また騒音レベルが大きい場合に
は係数ｋ（ｉ）は主入力と参照入力の比に近くなるので
いずれの場合にも適切な雑音成分の除去、即ち、適切な
音声スペクトルの推定ができる。請求項９に記載の発明
によると、主入力と参照入力がほぼ同レベルになり、参
照入力に音声が含まれないようなシステムにおいて、周
囲の騒音レベルが小さい時の係数ｋ（ｉ）が１に近くな
り、係数ｋ（ｉ）の誤差が少なく、また騒音レベルが大
きい場合には係数ｋ（ｉ）は主入力と参照入力の比に近
くなり、更に、騒音レベルが小さい場合から大きい場合
まで係数ｋ（ｉ）は連続的に変化していくのでいかなる
騒音レベルであっても適切な雑音成分の除去、即ち、適
切な音声スペクトルの推定ができる。請求項１０に記載
の発明によると、主入力と参照入力がほぼ同レベルには
ならないか、または、参照入力に音声がある程度混入す
るようなシステムにおいて、周囲の騒音レベルが小さい
時の係数ｋ（ｉ）がシステムに適した定数に近くなり、
請求項９に記載の発明の効果と同様の効果がある。請求
項１１に記載の発明によると、主入力、参照入力として
２つのマイクを用いる場合に、上述の効果がある。請求
項１２に記載の発明によると、主入力としてマイク、参
照入力としてスピーカに送られる音響信号を用いる場合
に、上述の効果がある。請求項１３に記載の発明による
と、時間非定常な騒音環境下の音声認識において、良好
な認識率が得られる。According to the first aspect of the present invention, a voice signal is input for each band using a characteristic value of a main input signal for inputting voice and a characteristic value obtained from a reference input signal for inputting noise. Since section detection is performed, even if a certain band is a voice section, other bands where no voice component is present are not regarded as a voice section, so that a feature amount of the voice can be accurately extracted, and a time non-stationary high-noise environment is used. In the speech recognition below, a good recognition rate is obtained. According to the second aspect of the present invention, the loudness of the voice is obtained by using the feature amount of the voice estimated by the noise component removing means, a rough voice section is detected from the loudness of the voice, and the rough voice is detected. Since a voice section is detected for each band in a section where a predetermined section is added before and after the section, noise in a section that is somewhat distant from the general voice section is not mistaken for voice, and it is more accurate. Voice section detection. According to the third aspect of the present invention, the loudness of the voice is obtained by using the power of the main input signal and the power of the reference input signal, and the rough voice section is detected from the loudness of the voice. Since the voice section detection is performed for each band in a section where a predetermined section is added before and after the section, even if the noise is as loud as the voice, it will not be mistaken for the voice and it will be more accurate Voice section detection. According to the fourth and fifth aspects of the present invention, since the coefficient for each band for removing the noise component is updated based on the information of the voice section for each band, it is assumed that the noise environment changes during the voice section. Also, in a band that does not include a sound component, the coefficient can be updated to cope with a change in the noise environment, and more accurate voice section detection can be performed. According to the sixth and seventh aspects of the present invention, since the threshold for detecting the voice section for each band is updated based on the information of the voice section for each band, the noise environment changes during the voice section. Even in the case where the voice component is not included, the threshold value can be updated in the band where the voice component is not included, to cope with a change in the noise environment, and more accurate voice section detection can be performed. Claim 8
According to the invention described in the above, as is apparent from the equation (2),
Under non-stationary noise, the coefficient k (i) is determined when the surrounding noise level is low, and when a relatively large noise component is included in the next speech section, the coefficient k (i) is determined. Is close to a predetermined constant and the coefficient k
When the error of (i) is small and the noise level is large, the coefficient k (i) becomes close to the ratio between the main input and the reference input. The spectrum can be estimated. According to the ninth aspect of the present invention, in a system in which the main input and the reference input are substantially at the same level and the reference input does not include voice, the coefficient k (i) when the ambient noise level is small is 1 , The error of the coefficient k (i) is small, and when the noise level is large, the coefficient k (i) is close to the ratio of the main input and the reference input, and further, when the noise level is small to large. Since the coefficient k (i) changes continuously, an appropriate noise component can be removed at any noise level, that is, an appropriate voice spectrum can be estimated. According to the tenth aspect of the present invention, in a system in which the main input and the reference input are not substantially at the same level, or where the reference input is mixed with voice to some extent, the coefficient k ( i) approaches a constant suitable for the system,
There is an effect similar to the effect of the ninth aspect of the present invention. According to the eleventh aspect, when two microphones are used as the main input and the reference input, the above-described effect is obtained. According to the twelfth aspect, the above-described effect is obtained when an audio signal sent to a microphone is used as a main input and an audio signal sent to a speaker is used as a reference input. According to the thirteenth aspect, a good recognition rate can be obtained in voice recognition in a noise environment where time is not stationary.

[Brief description of the drawings]

【図１】本発明の一実施例を説明するための構成図で
ある。FIG. 1 is a configuration diagram for explaining an embodiment of the present invention.

【図２】図１の動作説明をするためのタイムチャート
である。FIG. 2 is a time chart for explaining the operation of FIG. 1;

【図３】本発明による音声区間検出方式の実施例を説
明するための構成図である。FIG. 3 is a configuration diagram for explaining an embodiment of a voice section detection method according to the present invention.

【図４】本発明による音声認識装置の実施例を説明す
るための構成図である。FIG. 4 is a configuration diagram for explaining an embodiment of a speech recognition device according to the present invention.

[Explanation of symbols]

１…第１のマイクロフォン、２…第２のマイクロフォ
ン、１０…第１の特徴量抽出部、２０…第２の特徴量抽
出部、３０…第１の音声区間検出部、４０…係数演算
部、５０…雑音成分除去部、６０…第２の音声区間検出
部、７０…入力パターン生成部、８０…標準パターンメ
モリ、９０…認識部。DESCRIPTION OF SYMBOLS 1 ... 1st microphone, 2 ... 2nd microphone, 10 ... 1st feature-value extraction part, 20 ... 2nd feature-value extraction part, 30 ... 1st audio | voice area detection part, 40 ... Coefficient calculation part, 50: noise component removing unit, 60: second voice section detecting unit, 70: input pattern generating unit, 80: standard pattern memory, 90: recognition unit.

フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩ // Ｇ１０Ｌ 101:023 (56)参考文献特開平１−200294（ＪＰ，Ａ) 特開昭60−193000（ＪＰ，Ａ) 特開昭62−244100（ＪＰ，Ａ) 特開昭62−123499（ＪＰ，Ａ) 特開昭62−129897（ＪＰ，Ａ) 特開昭58−142400（ＪＰ，Ａ) 特開平３−147000（ＪＰ，Ａ) 特開昭63−262695（ＪＰ，Ａ) 特開昭55−7749（ＪＰ，Ａ) 特開平２−272499（ＪＰ，Ａ) 特開昭58−196599（ＪＰ，Ａ) 特開平１−255897（ＪＰ，Ａ) 特開平２−238493（ＪＰ，Ａ) 特開昭60−101598（ＪＰ，Ａ) 特開平２−205898（ＪＰ，Ａ) 特許2897230（ＪＰ，Ｂ２) 特許2807457（ＪＰ，Ｂ２) 特許2856934（ＪＰ，Ｂ２) 特許2539027（ＪＰ，Ｂ２) 特公平７−113834（ＪＰ，Ｂ２) 特公平３−76475（ＪＰ，Ｂ２) 特公平４−49952（ＪＰ，Ｂ２) 特公平２−22398（ＪＰ，Ｂ２) 実公昭62−37279（ＪＰ，Ｙ２) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 17/00 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (51) Int.Cl. ⁷ Identification symbol FI // G10L 101: 023 (56) References JP-A-1-200294 (JP, A) JP-A-60-193000 (JP, A) JP JP-A-62-244100 (JP, A) JP-A-62-123499 (JP, A) JP-A-62-129897 (JP, A) JP-A-58-142400 (JP, A) JP-A-3-147000 (JP) JP-A-63-262695 (JP, A) JP-A-55-7749 (JP, A) JP-A-2-272499 (JP, A) JP-A-58-196599 (JP, A) 1-255897 (JP, A) JP-A-2-238493 (JP, A) JP-A-60-101598 (JP, A) JP-A-2-205898 (JP, A) Patent 2897230 (JP, B2) Patent 2807457 (JP, B2) Patent 2856934 (JP, B2) Patent 2539027 (JP, B2) JP-B 7-113834 (JP, B2) JP-B 3-76475 (JP, B2) JP-B 4-49952 (JP, B B2) Tokiko Hei 22-22398 (JP, B2) Jiko 62-37279 ( P, Y2) (58) investigated the field (Int.Cl. ^7, DB name) G10L 15/00 - 17/00 JICST file (JOIS)

Claims

(57) [Claims]

A first input unit for inputting a voice; a first feature extraction unit for obtaining a feature comprising a plurality of elements of the first input signal obtained by the first input; ,
A second input unit for inputting noise, a second feature extracting unit for obtaining a feature comprising a plurality of elements of the second input signal obtained by the second input, and the first feature extracting unit; Coefficient calculating means for calculating a coefficient for each element from the characteristic amount and the second characteristic amount; and removing a noise component from the first characteristic amount using the second characteristic amount and the coefficient. Noise component removing means for estimating a feature amount of the speech by using the feature amount of the speech estimated by the noise component removing means at least, and a voice section detecting means for detecting a voice section for each of the elements. A voice section detection method characterized by the following.

2. A first input means for inputting a voice, a first feature quantity extracting means for obtaining a feature quantity comprising a plurality of elements of the first input signal obtained by the first input means, ,
A second input unit for inputting noise, a second feature extracting unit for obtaining a feature comprising a plurality of elements of the second input signal obtained by the second input, and the first feature extracting unit; Coefficient calculating means for calculating a coefficient for each element from the characteristic amount and the second characteristic amount; and removing a noise component from the first characteristic amount using the second characteristic amount and the coefficient. And a noise component estimating means for estimating a feature amount of the voice, and a loudness of the voice is obtained by using the feature amount of the voice estimated by the noise component removing means, and a first voice section is determined from the loudness of the voice. A first voice section detecting means to be detected, and a section in which predetermined sections are added before and after the first voice section detected by the first voice section detecting means, respectively, and at least the noise component removing means Using the estimated speech features, VAD method, characterized in that formed by and a second voice activity detection means for detecting a second audio segment.

3. A first input means for inputting a voice, and a first feature quantity extracting means for obtaining a feature quantity comprising a plurality of elements of the first input signal obtained by the first input means. ,
A second input unit for inputting noise, a second feature extracting unit for obtaining a feature comprising a plurality of elements of the second input signal obtained by the second input, and the first feature extracting unit; Coefficient calculating means for calculating a coefficient for each element from the characteristic amount and the second characteristic amount; and removing a noise component from the first characteristic amount using the second characteristic amount and the coefficient. Noise component removing means for estimating a speech feature amount, and a first speech section detection for detecting a first speech section using at least the magnitude of the first signal and the magnitude of the second signal. Means, and a first speech section detected by the first speech section detection means.
In a section in which a predetermined section is added before and after each of the voice sections, a second voice section is detected for each element by using at least the feature amount of the voice estimated by the noise component removing means. 2. A voice section detection method comprising: a second voice section detection means.

4. The voice section detection method according to claim 1, wherein said noise component removing means holds a coefficient for removing a noise component for each of said elements, and a coefficient for each of said elements. Is a voice section detection method which is updated based on information on a voice section of each corresponding element in the voice section detection means.

5. The voice section detection method according to claim 2, wherein said noise component removing means holds a coefficient for removing a noise component for each of said elements. A voice section detection method, wherein the coefficient for each element is updated based on information on a voice section for each corresponding element in the second voice section detection means.

6. The voice section detection method according to claim 1, wherein said voice section detection means holds a threshold for detecting a voice section for each of said elements, and a threshold for each of said elements. Is a voice section detection method which is updated based on information on a voice section of each corresponding element in the voice section detection means.

7. The voice section detection method according to claim 2, wherein said second voice section detection means further comprises:
A threshold value for detecting a second voice section for each of the elements is held, and the threshold value for each of the elements is updated based on information on a corresponding voice section for each of the second elements. Voice section detection method.

8. The voice section detection method according to claim 1, wherein the coefficient calculating means is configured to determine whether or not each of the first feature amount and the second feature amount is large.
The value of the corresponding coefficient is the value of the ratio between the corresponding element of the first feature and the corresponding element of the second feature, or a value close to the value, and the first feature or the second (2) When each element of the feature amount is small, the value of the corresponding coefficient is set to a predetermined value or a value close to the value.

9. The voice section detection method according to claim 8, wherein said coefficient calculating means further comprises: each element X (i) of said first characteristic amount; and a corresponding element of said second characteristic amount. N
For (i), the value of the corresponding coefficient k (i) is k (i) = [X (i) + C] / [N (i) + C] (C = constant> 0) Voice section detection method.

10. The voice section detection method according to claim 8, wherein said coefficient calculation means further comprises: each element X (i) of said first characteristic amount; and a corresponding element of said second characteristic amount. N
For (i), the value of the corresponding coefficient k (i) is given by: k (i) = [X (i) + C ₁ ] / [N (i) + C ₂ ] (C ₁ , C ₂ = constant> 0 ).

11. The voice section detection method according to claim 1, further comprising: first input means for inputting voice and second input means for inputting noise. Is a voice section detection method, wherein each is a microphone.

12. The voice section detection method according to claim 1, wherein said first input means for inputting a voice is a microphone, and said first input means for inputting a noise. 2. An audio section detection method, wherein the input means is a speaker placed near the microphone, and receives an audio signal reproduced from the speaker.

13. The method according to claim 1, wherein
An input pattern generation unit that creates an input pattern of the voice input from the feature amount of the voice obtained using the voice section detection method described in, a standard pattern memory that stores a pre-registered voice standard pattern, A speech recognition apparatus comprising: a recognition unit that performs a recognition process on the input pattern and the standard pattern.