JP4533126B2

JP4533126B2 - Proximity sound separation / collection method, proximity sound separation / collection device, proximity sound separation / collection program, recording medium

Info

Publication number: JP4533126B2
Application number: JP2004373810A
Authority: JP
Inventors: 真理子青木; 賢一古家; 章俊片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-12-24
Filing date: 2004-12-24
Publication date: 2010-09-01
Anticipated expiration: 2024-12-24
Also published as: JP2006178333A

Description

本発明は、マイクロホンに近接した目的音源と、マイクロホンから離れた雑音源が同時に鳴っている環境において、雑音信号を抑圧し、目的音を高いＳＮ比で収音する近接音分離収音方法、近接音分離収音装置に関する。 The present invention relates to a proximity sound separation and collection method that suppresses a noise signal and collects a target sound with a high S / N ratio in an environment in which a target sound source close to the microphone and a noise source far from the microphone are simultaneously sounding. The present invention relates to a sound separating and collecting apparatus.

目的音と雑音が同時に鳴っている環境において、雑音を抑圧し、目的音を強調する方法としては、従来、単一のマイクロホンを用いて目的音を音声とし、雑音として空調ノイズなど時間変動が緩やかな雑音（以下、定常雑音）を想定し、雑音の定常性を利用して混合信号のスペクトルから雑音信号のスペクトルを減算するスペクトルサブトラクション法（非特許文献１）が提案されている。
また、複数のマイクロホンを用いて雑音を抑圧するマイクロホンアレー法（非特許文献２）も提案されている。
Boll, S. F. “Suppression of Acoustic Noise in Speech Using Spectral Subtraction.” IEEE Trans. Acoust. Speech, and Signal Processing, vol. ASSP-27, no.2 pp.133-120,1979. Y. Kaneda and J. Ohga, “Adaptive microphone-array system for noise reduction,” IEEE Trans. Acoust. Speech Signal Process, vol.ASSP-34,no.6,pp.1391-1400,1986 In an environment where the target sound and noise are heard simultaneously, as a method of suppressing the noise and emphasizing the target sound, conventionally, a single microphone is used as the target sound and the time fluctuations such as air conditioning noise are slow. A spectral subtraction method (Non-patent Document 1) has been proposed in which a noise signal spectrum is subtracted from a spectrum of a mixed signal by using a noise noise (hereinafter referred to as stationary noise).
A microphone array method (Non-Patent Document 2) that suppresses noise using a plurality of microphones has also been proposed.
Boll, SF “Suppression of Acoustic Noise in Speech Using Spectral Subtraction.” IEEE Trans. Acoust. Speech, and Signal Processing, vol. ASSP-27, no.2 pp.133-120, 1979. Y. Kaneda and J. Ohga, “Adaptive microphone-array system for noise reduction,” IEEE Trans. Acoust. Speech Signal Process, vol.ASSP-34, no.6, pp.1391-1400, 1986

非特許文献１で提案されている雑音抑圧方法は雑音の定常性を用いるため、音声や、音楽など非定常な雑音信号を抑圧することは困難とする第１の課題が生じる。また、非特許文献２で提案されている雑音抑圧方法は少なくとも２本以上のマイクロホンを必要とするため、装置の規模が大きくなるとする第２の課題が生じる。 Since the noise suppression method proposed in Non-Patent Document 1 uses noise continuity, a first problem that makes it difficult to suppress non-stationary noise signals such as speech and music occurs. Further, since the noise suppression method proposed in Non-Patent Document 2 requires at least two or more microphones, there arises a second problem that the scale of the apparatus becomes large.

本発明の第１実施形態によれば音声入力手段の各出力信号を音声帯域内において複数の帯域信号に分割する帯域分割手段と、帯域信号の音響特徴量を算出する帯域別特徴量算出手段と、帯域別特徴量算出手段で算出された各帯域別の特徴量に基づき、目的音源の信号を主成分とする信号であるか、又は雑音を主成分とする信号であるかを判定する帯域別信号判定手段と、帯域別信号判定手段で判定した判定結果に基づいて、前記帯域別に重み値を決定する帯域別重み値決定手段と、帯域別重み値決定手段で決定された重み値を前記各帯域信号に乗算する帯域別重み値乗算手段と、帯域別重み値乗算手段で重み付けされた信号を時間波形に戻す信号合成手段とを備えることを特徴とする。 According to the first embodiment of the present invention, a band dividing unit that divides each output signal of the voice input unit into a plurality of band signals within a voice band, and a band-specific feature amount calculation unit that calculates an acoustic feature amount of the band signal. Based on the feature value for each band calculated by the feature value calculation unit for each band, it is determined for each band whether the signal is a signal mainly composed of the signal of the target sound source or a signal mainly composed of noise. Based on the determination result determined by the signal determination unit, the signal determination unit for each band, the weight value determination unit for each band for determining the weight value for each band, and the weight value determined by the weight value determination unit for each band It is characterized by comprising band-by-band weight value multiplying means for multiplying the band signal and signal synthesis means for returning the signal weighted by the band-by-band weight value multiplying means to a time waveform.

本発明の第２の実施形態によれば第１の実施形態で提案した近接音分離収音装置において、帯域別特徴量算出手段は各帯域信号のパワー値を算出し、帯域別信号判定手段は各帯域信号のパワー値が予め設定した閾値以上を目的音信号を主成分とする帯域信号として判定し、閾値以下を雑音を主成分とする帯域信号と判定することを特徴とする。
本発明の第３の実施形態によれば第１の実施形態で提案した近接音分離収音装置において、帯域別特徴量算出手段は各帯域信号の特徴量として尖鋭度を算出し、帯域信号判定手段は各帯域信号の尖鋭度が予め設定した閾値以上を目的音信号を主成分とする帯域信号と判定し、各帯域信号の尖鋭度が閾値以下を雑音を主成分とする帯域信号と判定することを特徴とする。 According to the second embodiment of the present invention, in the proximity sound separating and collecting apparatus proposed in the first embodiment, the band-specific feature amount calculating means calculates the power value of each band signal, and the band-specific signal determining means is The power value of each band signal is determined to be a band signal whose main component is a target sound signal when the power value is higher than a preset threshold value, and is determined to be a band signal whose main component is noise.
According to the third embodiment of the present invention, in the proximity sound separating and collecting apparatus proposed in the first embodiment, the band-specific feature value calculating unit calculates the sharpness as the feature value of each band signal, and determines the band signal. The means determines that the sharpness of each band signal is equal to or higher than a preset threshold value as a band signal whose main component is the target sound signal, and determines that the sharpness of each band signal is equal to or lower than the threshold value as a band signal whose main component is noise. It is characterized by that.

本発明の第４の実施形態によれば前述した第２の実施形態又は第３の実施形態で提案した近接音分離収音装置の何れかにおいて、帯域別特徴量算出手段で算出した特徴量の値から閾値を算出する閾値算出手段を付加し、この閾値算出手段で算出した閾値に従って帯域別信号判定手段の判定を実行することを特徴とする。
各実施形態において、目的音は音声とし、目的音は雑音源に比べてマイクロホンに近接している、という条件に限定して用いる。また、音声信号のスパース性（パワーの大きな周波数が、特定の帯域に局在する性質）に着目して、雑音が混ざった信号から目的音声を推定する。 According to the fourth embodiment of the present invention, in any of the proximity sound separation and collection apparatuses proposed in the second embodiment or the third embodiment described above, the feature amount calculated by the band-wise feature amount calculation means is used. A threshold value calculation means for calculating a threshold value from the value is added, and the determination by the band-specific signal determination means is executed according to the threshold value calculated by the threshold value calculation means.
In each embodiment, the target sound is voice, and the target sound is used only under the condition that the target sound is closer to the microphone than the noise source. In addition, paying attention to the sparseness of the voice signal (the property that a frequency with a large power is localized in a specific band), the target voice is estimated from a signal mixed with noise.

帯域分割手段においては、各帯域の信号が主として１つの音響信号成分よりなる程度（目的音のスペクトルを分離できる程度）に細かく帯域分割する、具体例としては２０Ｈｚ程度、また、目的音が雑音に比べてマイクロホンに近接していることから、目的音信号は雑音信号よりも大きくマイクロホンに受音される、と仮定する。
受音信号を帯域分割手段で帯域分割し、帯域別特徴量算出手段で帯域毎の音響的特徴量を算出する。帯域別重み値決定手段では、帯域別に算出した特徴量に基づき、各周波数成分が、マイクロホンに近接した目的音源の成分であるか、遠方から到来する雑音源の成分であるかを判定し、その判定に基づき重み値α（ω_１）を決定する。例えば、特徴量として各帯域のパワーを用いた場合、目的音源の信号パワーは雑音の信号パワーよりも大きいことを利用して、パワーがあらかじめ定めた閾値より大きくなる帯域の信号は、目的信号と判定し、その帯域に乗算する重み値を例えばα（ω_１）＝１．０と決定する。パワーが閾値より小さくなる帯域は雑音信号の成分と判定し、ゼロに近い重み値α（ω_１）（０＜α（ω_１）＜１）と決定する。 In the band dividing means, the band is finely divided to the extent that the signal of each band is mainly composed of one acoustic signal component (to the extent that the spectrum of the target sound can be separated). As a specific example, the target sound is converted into noise. It is assumed that the target sound signal is received by the microphone larger than the noise signal because it is closer to the microphone.
The received sound signal is band-divided by the band dividing means, and the acoustic feature quantity for each band is calculated by the band-specific feature quantity calculating means. The weight value determining means for each band determines whether each frequency component is a component of a target sound source close to the microphone or a noise source component coming from a distance based on the feature amount calculated for each band. A weight value α (ω ₁ ) is determined based on the determination. For example, when the power of each band is used as the feature amount, the signal of the band in which the power is larger than a predetermined threshold is used as the target signal by using the fact that the signal power of the target sound source is larger than the signal power of noise. Determination is made, and a weight value to be multiplied by the band is determined to be, for example, α (ω ₁ ) = 1.0. The band where the power is smaller than the threshold is determined as a noise signal component, and the weight value α (ω ₁ ) (0 <α (ω ₁ ) <1) close to zero is determined.

また、特徴量として信号の尖鋭度（実施例において詳しく定義を説明する）を用いる場合には、近接した音源の尖鋭度は大きく、遠方音源の尖鋭度は小さくなる性質を利用して、尖鋭度がある閾値以下の場合には雑音信号成分と判定してゼロに近い重み値を例えばα（ω_１）（０＜α（ω_１）＜１）と決定する。
帯域別重み値乗算手段においては、決定した重み値α（ω_１）を各帯域信号Ｘ（ω_１）に乗算する。このように重み付けされた信号を信号合成手段により時間波形に戻す。 In addition, when using the sharpness of a signal (detailed description will be described in the embodiment) as a feature amount, the sharpness of a nearby sound source is large and the sharpness of a distant sound source is small. If it is less than a certain threshold value, it is determined as a noise signal component, and a weight value close to zero is determined as, for example, α (ω ₁ ) (0 <α (ω ₁ ) <1).
The band-by-band weight value multiplication means multiplies each band signal X (ω ₁ ) by the determined weight value α (ω ₁ ). The weighted signal is returned to the time waveform by the signal synthesis means.

本発明の構成によれば雑音の性質（定常性）を用いることなく雑音が混じった信号から目的音声を回復することができる。よって、雑音源が音声や音楽など非定常な信号に対しても対応が可能である。つまり、上述した第１の課題を解決することができる。
また、本発明では単一のマイクロホンで実現可能なため、装置規模も小さくできる。これにより上述した第２の課題も解決することができる。 According to the configuration of the present invention, it is possible to recover the target speech from a signal mixed with noise without using the nature (stationarity) of noise. Therefore, it is possible to cope with a non-stationary signal such as voice or music as a noise source. That is, the first problem described above can be solved.
In addition, since the present invention can be realized with a single microphone, the apparatus scale can be reduced. Thereby, the second problem described above can also be solved.

本発明による近接音分離収音装置は全てをハードウェアにより構成することができるが、それより、コンピュータが解読可能なプログラム言語によって記述された近接音分離収音プログラムをコンピュータにインストールし、コンピュータに近接音分離収音装置として機能させる実施形態が最良の実施形態である。
コンピュータに本発明による近接音分離収音装置として機能させる場合、コンピュータには帯域分割手段、帯域別特徴量算出手段、帯域別信号判定手段、帯域別重み付け値決定手段、帯域別重み値乗算手段、信号合成手段を構築し、近接分離装置として機能させる。 The proximity sound separation and collection apparatus according to the present invention can be configured entirely by hardware. From this, a proximity sound separation and collection program written in a computer-readable program language is installed in the computer, and the computer is installed. The embodiment that functions as the proximity sound separation and collection device is the best embodiment.
When the computer is caused to function as a proximity sound separating and collecting apparatus according to the present invention, the computer includes a band dividing unit, a band-specific feature amount calculating unit, a band-specific signal determining unit, a band-specific weight value determining unit, a band-specific weight value multiplying unit, A signal synthesis means is constructed and functions as a proximity separation device.

図１に本発明の請求項５で提案する近接音分離収音装置の実施例を示す。入力手段１は例えばマイクロホンとする。目的音源Ｍの信号をＳ(ｔ)、雑音源Ｎの信号をｎ（ｔ）とする。説明を簡略化するために、ここでは雑音源Ｎを一つとして説明するが、一般に雑音源Ｎは複数でも良い。
帯域分割手段２においては例えば高速フーリエ変換などで音声帯域内を複数の帯域に分割する。このとき、各帯域信号Ｘ（ω_１），Ｘ（ω_２），．．．Ｘ（ω_Ｎ）は、主として一つの音響信号成分よりなる程度に細かく分割する。ここで一つの音響信号成分とは信号Ｓ（ｔ）及びｎ（ｔ）に含まれる一つのスペクトルを指し、各スペクトルを分離できる程度の細かさに分割すれば良いとされている。（更に詳しくは特許第３３５５５９８号明細書を参照）。 FIG. 1 shows an embodiment of a proximity sound separating and collecting apparatus proposed in claim 5 of the present invention. The input means 1 is a microphone, for example. The signal of the target sound source M is S (t), and the signal of the noise source N is n (t). In order to simplify the description, the description will be made assuming that the noise source N is one, but in general, a plurality of noise sources N may be provided.
The band dividing means 2 divides the voice band into a plurality of bands by, for example, fast Fourier transform. At this time, each band signal X (ω ₁ ), X (ω ₂ ),. . . X (ω _N ) is subdivided so as to be mainly composed of one acoustic signal component. Here, one acoustic signal component refers to one spectrum included in the signals S (t) and n (t), and it is only necessary to divide each spectrum into fine parts that can be separated. (For further details, see Japanese Patent No. 3355598).

帯域別特徴量算出手段３においては、各周波数帯域毎に信号の音響的特徴量（τ（ω_１））を算出する。この特徴量とは例えば、信号のパワーや尖鋭度である。ここでは本発明の請求項６で提案する信号のパワーを特徴量として用いるものとして説明する。従って帯域別特徴量算出手段３は各帯域信号Ｘ（ω_１），Ｘ（ω_２），…Ｘ（ω_Ｎ）のパワー値20log₁₀|Ｘ（ω₁）|，20log₁₀|Ｘ（ω₂）|，…20log₁₀|Ｘ（ω_N）|，を出力する。
帯域別信号判定手段４は各帯域のパワー値により、各帯域信号Ｘ（ω_１），Ｘ（ω_２），…Ｘ（ω_Ｎ）の属性を判定する。ここで雑音は目的音より遠方から到来するため、雑音信号ｎ（ｔ）は目的音信号Ｓ(ｔ)に比べて小さく受音される、と仮定できる。すなわち、帯域分割した帯域信号Ｘ（ω_１），Ｘ（ω_２），…Ｘ（ω_Ｎ）は図２に示すようなスペクトルを持つと考えられる。よって図２に示したようにパワーが閾値（Ｔ）を超える帯域はその主成分が目的信号Ｓ（ｔ）であると推定され、閾値Ｔ以下の帯域はその主成分が雑音信号ｎ（ｔ）であると推定される。帯域別信号判定手段４はこの判定アルゴリズムを適用して各帯域信号Ｘ（ω_１），Ｘ（ω_２），…Ｘ（ω_Ｎ）の属性を判定し、その判定結果を帯域別重み値決定手段５に受け渡す。 The band-specific feature value calculation means 3 calculates the acoustic feature value (τ (ω ₁ )) of the signal for each frequency band. This feature amount is, for example, signal power or sharpness. Here, description will be made assuming that the power of the signal proposed in claim 6 of the present invention is used as the feature amount. Therefore, the band-specific feature amount calculation means 3 uses the power values 20log ₁₀ | X (ω ₁ ) |, 20 log ₁₀ | X (ω ₂ ) of the band signals X (ω ₁ ), X (ω ₂ ),... X (ω _N ). ) |,... 20 log ₁₀ | X (ω _N ) |
The band-specific signal determination means 4 determines the attribute of each band signal X (ω ₁ ), X (ω ₂ ),... X (ω _N ) based on the power value of each band. Here, since noise comes from a distance from the target sound, it can be assumed that the noise signal n (t) is received smaller than the target sound signal S (t). That is, the band-divided band signals X (ω ₁ ), X (ω ₂ ),... X (ω _N ) are considered to have a spectrum as shown in FIG. Therefore, as shown in FIG. 2, it is estimated that the main component of the band where the power exceeds the threshold (T) is the target signal S (t), and the main component of the band below the threshold T is the noise signal n (t). It is estimated that. The band-specific signal determination means 4 applies this determination algorithm to determine the attributes of the respective band signals X (ω ₁ ), X (ω ₂ ),... X (ω _N ), and determines the determination result as a weight value for each band. Deliver to means 5.

帯域別重み値決定手段５では目的音信号Ｓ（ｔ）と判定された帯域には重み値α（ω_ｉ）を例えばα（ω_ｉ）＝１．０と決定する。また、雑音信号ｎ（ｔ）と判定された帯域には重み値α（ω_ｉ）を例えば０≦α（ω_ｉ）≦１と決定する。雑音と判定された帯域に指定した重み値０≦α（ω_ｉ）≦１は限りなく０に近い値とされる。目的音信号と判定された帯域に指定した重み値α（ω_ｉ）＝１は必ずしも１でなくともよく、雑音帯域に与えた重み値より大きい値であればよい。
帯域別重み値決定手段５で決定した各帯域の重み値α（ω_ｉ），α（ω_２），…α（ω_Ｎ）は帯域別重み値乗算手段６に与えられ、この帯域別重み値乗算手段６で各帯域信号Ｘ（ω_１），Ｘ（ω_２），…Ｘ（ω_Ｎ）に乗算され、重み付けされた各帯域信号α（ω_ｉ）・Ｘ（ω_１），α（ω_２）・Ｘ（ω_２），…α（ω_Ｎ）・Ｘ（ω_Ｎ）を信号合成手段７に入力し、信号合成手段７で例えば逆フーリエ変換等を用いて時間信号に戻される。雑音と判定した帯域には限りなく０に近い重み値を指定したから、この時間信号に含まれる雑音信号成分はわずかとなり、目的音信号Ｓ（ｔ）のＳＮ比が向上する。 The band-specific weight value determining means 5 determines the weight value α (ω _i ) as, for example, α (ω _i ) = 1.0 for the band determined as the target sound signal S (t). For the band determined as the noise signal n (t), the weight value α (ω _i ) is determined as 0 ≦ α (ω _i ) ≦ 1, for example. The weight value 0 ≦ α (ω _i ) ≦ 1 specified for the band determined to be noise is set to a value close to 0 as much as possible. The weight value α (ω _i ) = 1 designated for the band determined as the target sound signal does not necessarily have to be 1, as long as it is larger than the weight value given to the noise band.
The weight values α (ω _i ), α (ω ₂ ),... Α (ω _N ) of each band determined by the band-specific weight value determining means 5 are given to the band-specific weight value multiplying means 6, and this band-specific weight value Each band signal X (ω ₁ ), X (ω ₂ ),... X (ω _N ) is multiplied and weighted by the multiplication means 6 and each band signal α (ω _i ) · X (ω ₁ ), α (ω ₂ ) · X (ω ₂ ),... Α (ω _N ) · X (ω _N ) are input to the signal synthesizing means 7, and the signal synthesizing means 7 returns the time signal using, for example, inverse Fourier transform. Since a weight value close to 0 is designated for the band determined to be noise, the noise signal component contained in this time signal is small, and the SN ratio of the target sound signal S (t) is improved.

図３はこの発明の請求項７で提案した近接音分離収音装置の実施例を示す。この実施例では特徴量算出手段３において算出する特徴量を尖鋭度Ｊ（ω_１），Ｊ（ω_２），…Ｊ（ω_Ｎ）とした場合を示す。
信号ｘ（ｎ）の線形予想残差信号をｙ（ｎ）とする。信号ｙ（ｎ）の尖鋭度（ｎ）は下記（１）で定義される、Ｅはカッコ内の平均値
J(n)＝E{y⁴(n)}/E²{y²(n)}-3 …（１）
信号ｙ（ｎ）の尖鋭度は、マイクロホンに近接した音源信号の場合の値が大きく、マイクロホンから遠方になるにつれて値が小さくなることが知られている。この性質を帯域分割した帯域信号Ｘ（ω_１）に適用することを考える。帯域分割された帯域信号Ｘ（ω_１）の尖鋭度を測定し、各帯域の尖鋭度が予め定めた閾値Ｔを越える場合には目的音信号と判定し、閾値以下となる帯域は雑音信号成分と判定する。ここで時間波形ｘ（ｎ）の場合には、一旦信号を線形予測し、その残差信号ｙ（ｎ）を求め、その残差信号ｙ（ｎ）について尖鋭度を測定する必要があった。これは線形予測により音声の包絡情報を除去するためであった。しかし、帯域分割した各成分にはすでに音声の包絡線情報が残っていないため、本発明では帯域分割した信号Ｘ（ω_１）の尖鋭度J~(ω_i,J)を式（２）に定義し、それを用いて各帯域の信号成分の属性を判定する。 FIG. 3 shows an embodiment of the proximity sound separating and collecting apparatus proposed in claim 7 of the present invention. In this embodiment, the feature amount calculated by the feature amount calculation means 3 is shown as a sharpness J (ω ₁ ), J (ω ₂ ),... J (ω _N ).
Let y (n) be the linear prediction residual signal of the signal x (n). The sharpness (n) of the signal y (n) is defined by (1) below, E is the average value in parentheses
J (n) = E {y ⁴ (n)} / E ² {y ² (n)}-3 (1)
It is known that the sharpness of the signal y (n) has a large value in the case of a sound source signal close to the microphone and decreases as the distance from the microphone increases. Consider applying this property to a band signal X (ω ₁ ) obtained by band division. The sharpness of the band-divided band signal X (ω ₁ ) is measured, and when the sharpness of each band exceeds a predetermined threshold T, it is determined as a target sound signal, and the band below the threshold is a noise signal component Is determined. Here, in the case of the time waveform x (n), it is necessary to linearly predict the signal once, obtain the residual signal y (n), and measure the sharpness of the residual signal y (n). This is to remove the envelope information of speech by linear prediction. However, since the envelope information of the voice has not already been left in each band-divided component, in the present invention, the sharpness J ~ (ω _i , J) of the band-divided signal X (ω ₁ ) is expressed by Equation (2). Define and use it to determine the attributes of the signal components in each band.

J~(ω_i,_ｊ)= E{x⁴(ω_i,_ｊ)}/E²{x²(ω_i,_ｊ)}-3 …（２）
ここで、インデックスｉは帯域のインデックス、ｊはフレームのインデックスである。
帯域別特徴量算出手段３は、式（２）で定義した尖鋭度J~(ω_i,_ｊ)を各帯域について算出する。帯域別信号判定手段４は尖鋭度がある閾値以上の帯域は目的音信号成分と判定し、尖鋭度がある閾値以下の帯域は雑音信号と判定する。
帯域別重み値決定手段５は図１の場合と同様に、目的音信号成分と判定した帯域に対しては重み値α(ω_i)をα(ω_i)＝1.0と決定し、雑音信号成分と判定した帯域に対しては重み値α(ω_i)をゼロに近い値をα(ω_i)（0≦α(ω_i)≦1）として決定する。決定した各帯域の重み値α(ω_i)を各帯域信号Ｘ(ω_i)に乗算し、重み付けされた各帯域信号α(ω_i)・Ｘ(ω_i)を信号合成手段７で時間信号に戻すことにより雑音成分が除去された目的音信号を得ることができる。 J ~ (ω _i , _j ) = E {x ⁴ (ω _i , _j )} / E ² {x ² (ω _i , _j )}-3 (2)
Here, the index i is a band index, and j is a frame index.
The band-specific feature amount calculation means 3 calculates the sharpness J ~ (ω _i , _j ) defined by the equation (2) for each band. The band-specific signal determination unit 4 determines that a band having a sharpness equal to or higher than a threshold is a target sound signal component, and determines a band having a sharpness not higher than the threshold to be a noise signal.
Similarly to the case of FIG. 1, the band-specific weight value determining means 5 determines the weight value α (ω _i ) as α (ω _i ) = 1.0 for the band determined as the target sound signal component, and the noise signal component For the band determined to be, the weight value α (ω _i ) is determined as α (ω _i ) (0 ≦ α (ω _i ) ≦ 1). Each band signal X (ω _i ) is multiplied by the determined weight value α (ω _i ) of each band, and the weighted band signals α (ω _i ) · X (ω _i ) are time-signaled by the signal synthesis means 7. By returning to, the target sound signal from which the noise component has been removed can be obtained.

ところで、上述した実施例では帯域別信号判定手段４の判定を予め定めた閾値Ｔを用いて各帯域の信号の属性を判定したが、この判定方法を採る場合は、目的音信号に対して雑音信号が充分小さい場合には有効であるが、雑音信号が大きくなるに伴って、閾値Ｔを大きく設定する必要が生じる。一方、音声信号は一般に高域になるにつれて信号のパワーが小さくなる性質を持つ（図５参照）。そのため雑音が大きくなると、雑音の低域成分の影響を抑制するために閾値Ｔを大きく設定する必要が生じ、その結果、目的音信号の高域成分まで抑圧してしまうという問題が生じる（図５参照）。 By the way, in the above-described embodiment, the attribute of each band signal is determined using a predetermined threshold T for the determination by the band-specific signal determination means 4, but when this determination method is employed, noise is detected with respect to the target sound signal. This is effective when the signal is sufficiently small, but as the noise signal increases, the threshold T needs to be set larger. On the other hand, the sound signal generally has the property that the power of the signal decreases as the frequency becomes higher (see FIG. 5). Therefore, when the noise increases, it is necessary to set a large threshold value T in order to suppress the influence of the low frequency component of the noise, and as a result, there arises a problem that the high frequency component of the target sound signal is suppressed (FIG. 5). reference).

図４はこの問題を解決するための実施例（請求項８に対応）を示す。この実施例では複数の帯域毎に適正な閾値を算出する閾値算出手段８を設け、この閾値算出手段８で算出した閾値を用いて、帯域別信号判定手段４で適正に信号の属性を判定しようとするものである。
つまり、この実施例では音声信号はいくつかの（通常、３つ程度）のフォルマント周波数を有するという特徴（図５参照）と、更に、高域になるにつれてパワーが減衰するという特徴を利用して受音信号ｓ（ｔ）＋ｎ（ｔ）を複数個、例えば３つ程度のバンドに分離し、閾値算出手段８で各バンド毎に適した閾値を算出することで、雑音がある程度大きい場合にも本発明を適用可能としたものである。但し、雑音信号のパワーは目的音信号のパワーより小さい、とする条件は必要である。 FIG. 4 shows an embodiment (corresponding to claim 8) for solving this problem. In this embodiment, a threshold value calculation means 8 for calculating an appropriate threshold value for each of a plurality of bands is provided, and using the threshold value calculated by the threshold value calculation means 8, the signal attribute determination means 4 for each band appropriately determines signal attributes. It is what.
That is, in this embodiment, the sound signal has several (usually about three) formant frequencies (see FIG. 5), and further, the power attenuates as the frequency increases. Even when the noise is large to some extent, the received signal s (t) + n (t) is separated into a plurality of, for example, about three bands, and the threshold calculation unit 8 calculates a threshold suitable for each band. The present invention can be applied. However, the condition that the power of the noise signal is smaller than the power of the target sound signal is necessary.

以下に具体的な方法を説明する。図５にしめしたように、音声信号は通常、高域にいくに従ってパワーが減衰する。そのため、雑音がある程度大きい場合に、一つの閾値（Ｔ）で全帯域の雑音成分を除去しようとすると、雑音信号の低域成分を除去するために閾値Ｔを高めに設定することになり、その結果、高域の目的信号まで減衰させてしまう。よって信号を複数個（例えば３個）のバンドに分割し、各バンドで適した閾値（T1,T2,T3）を閾値算出手段８で算出する。バンドの分割方法として例えば、平均的な音声信号のフォルマント周波数（第一フォルマント周波数ｆ１、第二フォルマント周波数ｆ２、第三フォルマント周波数ｆ３）を用いて、ｆ２以下の帯域を第一バンド、ｆ２以上ｆ３未満の帯域を第二バンド、ｆ３以上の帯域を第三バンドとする。 A specific method will be described below. As shown in FIG. 5, the power of an audio signal is usually attenuated as it goes up. For this reason, when the noise is large to some extent, if an attempt is made to remove the noise component of the entire band with one threshold (T), the threshold T is set higher to remove the low frequency component of the noise signal. As a result, the target signal in the high range is attenuated. Therefore, the signal is divided into a plurality of (for example, three) bands, and threshold values (T1, T2, T3) suitable for each band are calculated by the threshold value calculation means 8. As a band dividing method, for example, the formant frequency (first formant frequency f1, second formant frequency f2, third formant frequency f3) of an average audio signal is used, and the band below f2 is set to the first band, and f2 to f3. The lower band is the second band, and the band above f3 is the third band.

各バンドにおける閾値（Ｔ１，Ｔ２，Ｔ３）の算出方法を、Ｔ１を例に挙げて述べる。第一バンドにおいて、受音信号のうち最も大きなパワーを持つ周波数成分Ｘ(ω_Max1)を選定する。この帯域Ｘ(ω_Max1)は、目的音信号の成分である可能性が高いと判断できる。よって、Ｘ(ω_Max1)のパワー20log₁₀|X(ω_Max1)|を算出し、そのパワー値より例えば20dB小さい値（他の値(10dB,15dBなど）でもよい)を閾値T1とする。すなわち、T1=20log₁₀|X(ω_Max1)|-20とする。こうすることで、第一バンドの中で、最大のパワーを持つ周波数成分に比べて20dB以上小さくなる信号成分は雑音成分と判定されて抑圧される。 A method of calculating threshold values (T1, T2, T3) in each band will be described by taking T1 as an example. In the first band, the frequency component X (ω _Max1 ) having the largest power among the received signals is selected. This band X (ω _Max1 ) can be determined to be highly likely to be a component of the target sound signal. Therefore, the power _20log ₁₀ | X (ω _Max1 ) | of X (ω _Max1 ) is calculated, and a value that is, for example, 20 dB smaller than the power value (other values (10 dB, 15 dB, etc.) may be used as the threshold value T1. That is, T1 = _20log ₁₀ | X (ω _Max1 ) | −20. In this way, a signal component that is 20 dB or more smaller than the frequency component having the maximum power in the first band is determined as a noise component and suppressed.

閾値T2についても同様に、第二バンドのなかで最もパワーが大きい周波数成分20log₁₀|X(ω_Max2)|のパワーを算出し、閾値T2を、T2=20log₁₀|X(ω_Max1)|-20と設定する。閾値T3についても同様である。以上の方法により、閾値算出手段８は帯域毎に適した閾値を求める。その算出結果を帯域別信号判定手段４に入力する。帯域別信号判定手段４は各バンド毎に算出した閾値を利用して各帯域信号の属性を判定するから、雑音信号がある程度大きくなった場合でも、請求項６に比べて、帯域毎の雑音成分を精度よく判定することができる。以上の説明により、遠方からの雑音信号が混在した受音信号に対して、目的信号を抽出できることが理解できよう。 Similarly, for the threshold T2, the power of the frequency component 20log ₁₀ | X (ω _Max2 ) | having the largest power in the second band is calculated, and the threshold T2 is calculated as T2 = _20log ₁₀ | X (ω _Max1 ) |- Set to 20. The same applies to the threshold value T3. With the above method, the threshold value calculation means 8 obtains a threshold value suitable for each band. The calculation result is input to the band-specific signal determination means 4. The band-specific signal determination means 4 determines the attribute of each band signal using the threshold value calculated for each band. Therefore, even when the noise signal is increased to some extent, the noise component for each band is compared with the sixth aspect. Can be accurately determined. From the above description, it can be understood that a target signal can be extracted from a received sound signal in which a noise signal from a distance is mixed.

上述した各実施例で説明した帯域分割手段２、帯域別特徴量算出手段３、帯域別信号判定手段４、帯域別重み値決定手段５、帯域別重み値乗算手段６、信号合成手段７、閾値算出手段８はそれぞれ、コンピュータが解読可能なプログラム言語によって記述された近接音分散プログラムをコンピュータにインストールし、コンピュータに備えたＣＰＵに解読させて実行することによりコンピュータより機能させることができ、結果として近接音分離収音装置として機能させることができる。近接音分離収音プログラムはコンピュータが読み取り可能な磁気ディスク或はＣＤ−ＲＯＭのような記録媒体に記録され、これらの記録媒体からコンピュータにインストールするか或は通信回線を通じてインストールすることができる。 The band dividing means 2, the band-specific feature amount calculating means 3, the band-specific signal determining means 4, the band-specific weight value determining means 5, the band-specific weight value multiplying means 6, the signal synthesizing means 7, the threshold values described in the above embodiments Each of the calculating means 8 can be made to function from the computer by installing a proximity sound distribution program described in a computer-readable program language in the computer, causing the CPU provided in the computer to decode and execute the program. It can function as a proximity sound separating and collecting device. The proximity sound separation and sound collection program is recorded on a recording medium such as a magnetic disk or CD-ROM that can be read by a computer, and can be installed in the computer from these recording media or can be installed through a communication line.

この発明による近接音分離収音装置は例えばハンズフリー方式の音声会議システム等に活用される。 The proximity sound separating and collecting apparatus according to the present invention is utilized in, for example, a hands-free audio conference system.

この発明の請求項５と６で提案する近接音分離収音装置の実施例を説明するためのブロック図。The block diagram for demonstrating the Example of the proximity sound separation sound collection apparatus proposed by Claim 5 and 6 of this invention. 図１の動作を説明するためのグラフ。The graph for demonstrating the operation | movement of FIG. この発明の請求項７で提案する近接音分離収音装置の実施例を説明するためのブロック図。The block diagram for demonstrating the Example of the proximity sound isolation | separation sound collection apparatus proposed by Claim 7 of this invention. この発明の請求項８で提案する近接音分離収音装置の実施例を説明するためのブロック図。The block diagram for demonstrating the Example of the proximity sound isolation | separation sound collection apparatus proposed by Claim 8 of this invention. 図４の動作を説明するためのグラフ。The graph for demonstrating the operation | movement of FIG.

Explanation of symbols

Ｎ雑音源４帯域別信号判定手段
Ｍ目的音源５帯域別重み値決定手段
ｎ（ｔ）雑音信号６帯域別重み値乗算手段
ｓ（ｔ）目的音信号７信号合成手段
１入力手段８閾値算出手段
２帯域分割手段
３帯域別特徴量算出手段

N Noise source 4 Band-specific signal determination means
M target sound source 5 band-specific weight value determining means n (t) noise signal 6 band-specific weight value multiplying means s (t) target sound signal 7 signal synthesizing means
1 Input means 8 Threshold value calculation means
2 Band division means
3 Band-specific feature value calculation means

Claims

A proximity sound separation and collection method for collecting sound by suppressing noise in an environment where a target sound source and a noise source are present using at least one voice input means, and emphasizing the target sound signal,
Band division processing for dividing each output signal of the voice input means into a plurality of band signals within a voice band;
A band-specific feature amount calculation process for calculating a sound source feature amount of each band signal;
Based on the feature value for each band calculated in the feature value calculation process for each band, it is determined for each band whether the signal is mainly a signal of a target sound source or a signal of a noise source. Signal determination processing;
Based on the determination result determined in the signal determination process for each band, a weight value determination process for each band for determining a weight value for each band;
A band-by-band weight value multiplication process for multiplying each band signal by the weight value determined in the band-by-band weight value determination process;
A signal synthesis process for returning a signal weighted by the weight value multiplication process for each band to a time waveform;
Only including,
The band-specific feature amount calculation processing calculates the sharpness indicating the degree of peak in the time domain as the feature amount of each band signal, and the band-specific signal determination processing determines that the sharpness of each band signal is equal to or greater than a preset threshold value. A proximity sound separation and collection method, wherein the target sound signal is determined to be a band signal having a main component, and the sharpness of each band signal is determined to be a band signal having noise as a main component if the sharpness of each band signal is equal to or less than a threshold value .

Oite proximity sound separation sound collection how according to claim 1, adding the threshold calculation process for calculating the threshold value from the value of the characteristic amount calculated in the band feature quantity calculation process, calculated by the threshold calculation process A proximity sound separation and collection method, wherein the band-specific signal determination processing is executed according to a threshold value.

A proximity sound separating and collecting apparatus that suppresses noise in an environment where a target sound source and a noise source exist using at least one voice input unit and emphasizes the target sound signal to collect sound,
Band dividing means for dividing each output signal of the voice input means into a plurality of band signals within a voice band;
A band-specific feature amount calculating means for calculating an acoustic feature amount of the band signal;
Band-specific signal determination means for determining whether the signal is a signal mainly composed of the signal of the target sound source based on the characteristic value for each band calculated by the band-specific feature value calculation means;
Based on the determination result determined by the band-specific signal determination unit, a band-specific weight value determination unit that determines a weight value for each band;
Weight-by-band weight value multiplying means for multiplying each band signal by the weight value determined by the weight-by-band weight value determining means;
Signal synthesis means for returning the signal weighted by the weight-by-band weight value multiplication means to a time waveform;
In the near Se'on separating and collecting apparatus for Ru provided with,
The band-specific feature amount calculation means calculates the sharpness indicating the degree of peak in the time domain as the feature amount of each band signal, and the band-specific signal determination means determines that the sharpness of each band signal is greater than or equal to a preset threshold value. A proximity sound separating and collecting apparatus, wherein the target sound signal is determined to be a band signal having a main component, and the sharpness of each band signal is determined to be a band signal having noise as a main component if the threshold is not more than a threshold value.

Oite proximity sound separation sound collection equipment according to claim 3, by adding a threshold value calculating means for calculating the threshold value from the value of the calculated features in the band feature quantity calculating means, calculated by the threshold calculating unit threshold The proximity sound separating and collecting apparatus according to claim 1, wherein the determination by the band-specific signal determining means is executed.

5. A proximity sound separation / collection program that is described in a computer-readable program language and causes the computer to function as at least the proximity sound separation / collection device according to claim 3 or 4 .

6. A recording medium comprising a computer-readable recording medium and recording the proximity sound separating and collecting program according to claim 5 on the recording medium.