JP2008135933A

JP2008135933A - Voice emphasizing processing system

Info

Publication number: JP2008135933A
Application number: JP2006320101A
Authority: JP
Inventors: Yoichi Suzuki; 陽一鈴木; Shuichi Sakamoto; 修一坂本; Junfeng Li; 軍鋒李; Satoru Hongo; 哲本郷
Original assignee: Tohoku University NUC; Institute of National Colleges of Technologies Japan
Current assignee: Tohoku University NUC; Institute of National Colleges of Technologies Japan
Priority date: 2006-11-28
Filing date: 2006-11-28
Publication date: 2008-06-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice emphasizing processing system extracting a target sound-source signal existing in the front direction from observed mixed signals (a plurality of sound-source signals). <P>SOLUTION: The voice emphasizing processing system has a signal sound-obtaining means 101 inputting acoustic signals generated from a plurality of sound sources from left-right both receiving sections and a signal conversion means 102 dividing input left-right 2 input signals at every frequency band and obtaining the frequency-band components of the left-right 2 input signals. The voice emphasizing processing system further has a sound-source identifying means 103 identifying a target sound source by using a coherence component acquired from the frequency-band components and an interaural level difference (ILD) acquired from the frequency-band components and a noise-source identifying means 104 identifying a noise source from the frequency-band components. The voice emphasizing processing system further has a sound-source extracting means 105 extracting the target sound sources at every frequency band on the basis of the target sound sources obtained by the sound-source identifying means and the noise source obtained by the noise-source identifying means. The target sound sources can be separated and emphasized excellently by a technique higher than a conventional technique. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、観測された混合信号（複数の音源）から正面方向に存在するターゲット音源信号を抽出する技術に関するものである。 The present invention relates to a technique for extracting a target sound source signal existing in the front direction from observed mixed signals (a plurality of sound sources).

近年、独立成分分析（ICA: Independent Component Analysis）に基づく音源分離法として、ブラインド音源分離（BBS: Blind Source Separation）や適応型ビームフォーマ（ABF: Adaptive Beamformer）などの手法が研究されている。 In recent years, methods such as blind source separation (BBS) and adaptive beamformer (ABF) have been studied as sound source separation methods based on independent component analysis (ICA).

また広周波数帯域での音源分離を行う両耳聴モデルとして、両耳間レベル差および両耳間位相差を用いた周波数領域両耳聴モデル(FDBM:Frequency Domain Binaural Model)が提案されている。このFDBMは、低周波数領域の信号に対しては、両耳間位相差を用い、高周波数領域の信号に対しては、両耳間レベル差を用いることによって音源の到来方向を推定する。得られた推定結果に基づき、特定方向の音源のみをフィルタリングにより分離するため、広周波数帯域での音源方向の推定精度が改善され、また分離性能が改善されている。しかしながら、この方法における音源方向の推定は、仰角が０°、すなわち水平面について推定するものであり、仰角がある音源方向を判定しようとすると、両耳間の位相差、レベル差が等しい点がいくつか空間上に存在するため、方向の推定は出来ないという問題が残っていた。 In addition, as a binaural model for sound source separation in a wide frequency band, a frequency domain binaural model (FDBM: Frequency Domain Binaural Model) using interaural level difference and interaural phase difference has been proposed. This FDBM uses the binaural phase difference for signals in the low frequency region and estimates the direction of arrival of the sound source by using the interaural level difference for signals in the high frequency region. Since only the sound source in a specific direction is separated by filtering based on the obtained estimation result, the estimation accuracy of the sound source direction in a wide frequency band is improved and the separation performance is improved. However, in this method, the direction of the sound source is estimated with respect to an elevation angle of 0 °, that is, with respect to the horizontal plane. However, because it exists in space, the problem that the direction could not be estimated remained.

特許文献１では、この問題を解決するために、上下に存在する複数の音源から特定の音響信号を分離するシステムを提示している。このシステムは、複数の音源から発生される音響信号を左右両受音部から入力する手段と、左右両入力信号を周波数帯域ごとに分割する手段と、左右両入力信号のクロススペクトルから周波数帯域ごとの両耳間位相差(IPD)、パワースペクトルのレベル差から両耳間レベル差(ILD)を求める手段と、全周波数帯域で、IPD及び／またはILDと、データベースのそれとを比較することにより各周波数帯域ごとに音源方向の候補を推定する手段と、上記の各周波数帯域ごとに得られた音源方向のうち出現頻度が高い方向を音源方向と推定する手段と、上記より推定された音源方向情報をもとに、特定音源方向の周波数帯域を主として抽出することにより音源を分離する手段とを備え、２入力システムにより複数の音源から特定の音響信号を分離することが可能となり、さらに方位角と仰角を有する音源を分離することが可能としている。 In order to solve this problem, Patent Document 1 presents a system that separates a specific acoustic signal from a plurality of sound sources that exist above and below. This system includes means for inputting sound signals generated from a plurality of sound sources from both the left and right sound receiving units, means for dividing the left and right input signals for each frequency band, and for each frequency band from the cross spectrum of both left and right input signals. By comparing the interaural phase difference (IPD), the means for obtaining the interaural level difference (ILD) from the level difference of the power spectrum, and comparing the IPD and / or ILD with that of the database in all frequency bands Means for estimating sound source direction candidates for each frequency band; means for estimating the direction of high appearance frequency among the sound source directions obtained for each frequency band; and sound source direction information estimated from the above And a means for separating the sound source by mainly extracting the frequency band in the direction of the specific sound source, and it is possible to separate a specific acoustic signal from a plurality of sound sources by a two-input system. In addition, it is possible to separate sound sources having an azimuth angle and an elevation angle.

特開２００４−３２５２８４JP 2004-325284 A

しかしながら、従来の方法では、特に反射音が存在する環境や、周囲に非定常/定常騒音源が多数あるような環境では、音源分離の精度が向上しないという問題点がある。 However, the conventional method has a problem in that the accuracy of sound source separation is not improved particularly in an environment where reflected sound exists or an environment where there are many unsteady / steady noise sources in the surroundings.

本発明は、２入力を用いて抽出したい音源と抑制したいノイズ源を同定し、時間−周波数平面上で両者のS/N比を比較して抽出音源を強調することで、ターゲット音源を良好に分離・強調するシステムを提供することを目的とする。 The present invention identifies a sound source to be extracted and a noise source to be suppressed using two inputs, compares the S / N ratio of both on the time-frequency plane, and enhances the extracted sound source, thereby improving the target sound source. The purpose is to provide a system that separates and emphasizes.

上記目的を達成するため、請求項１に記載の音声強調処理システムは、観測された混合信号から正面方向に存在するターゲット音源信号を抽出する音声強調処理システムであって、複数の音源から発生される音響信号を左右両受信部から入力する信号収音手段と、入力した左右２入力信号を周波数帯域毎に分割して左右２入力信号の周波数帯域成分を求める信号変換手段と、前記周波数帯域成分から求めたコヒーレンス成分と前記周波数帯域成分から求めた両耳間レベル差(ILD) とに基づき、ターゲット音源を同定する音源同定手段と、前記周波数帯域成分からノイズ源を同定するノイズ源同定手段と、前記音源同定手段で求めたターゲット音源と、前記ノイズ源同定手段で求めたノイズ源とに基づき、周波数帯域毎にターゲット音源を抽出する音源抽出手段と、を備えたことを特徴とする。 In order to achieve the above object, a speech enhancement processing system according to claim 1 is a speech enhancement processing system that extracts a target sound source signal existing in a front direction from an observed mixed signal, and is generated from a plurality of sound sources. Sound collecting means for inputting an acoustic signal to be received from both the left and right receivers, signal converting means for dividing the input left and right two input signals into frequency bands to obtain frequency band components of the left and right two input signals, and the frequency band components A sound source identifying means for identifying a target sound source based on a coherence component obtained from the above and a binaural level difference (ILD) obtained from the frequency band component; and a noise source identifying means for identifying a noise source from the frequency band component; The target sound source is extracted for each frequency band based on the target sound source obtained by the sound source identification means and the noise source obtained by the noise source identification means. Characterized by comprising a signal extraction means.

請求項２に記載の音声強調処理システムは、前記音源同定手段が、前記周波数帯域成分からコヒーレンス関数を用いてコヒーレンス成分の実数部(RealCoh)を求める手段と、前記周波数帯域成分から両耳間の音圧レベル差(ILD)を求める手段と、周波数帯域毎のRealCoh値とILD値とに基づき予め設定された条件式を用いて周波数帯域毎のターゲット音源の有無(SAP_CohILD)を導出する手段とを備えたことを特徴とする。 The speech enhancement processing system according to claim 2, wherein the sound source identification means obtains a real part (RealCoh) of a coherence component from the frequency band component using a coherence function; Means for obtaining a sound pressure level difference (ILD), means for deriving the presence / absence of a target sound source for each frequency band (SAP _CohILD ) using a preset conditional expression based on the RealCoh value and ILD value for each frequency band, and It is provided with.

請求項３に記載の音声強調処理システムは、前記周波数帯域成分から周波数帯域毎のノイズ比を算出し、その値と予め設定された指標値とを比較することにより周波数帯域毎のノイズ源の有無(SAP_NOR)を導出する手段とを備えたことを特徴とする。 The speech enhancement processing system according to claim 3, wherein a noise ratio for each frequency band is calculated from the frequency band component, and the presence / absence of a noise source for each frequency band is compared by comparing the calculated value with a preset index value. And a means for deriving (SAP _NOR ).

請求項４に記載の音声強調処理システムは、前記音源同定手段で求めた周波数帯域毎のSAP_CohILD値と、前記ノイズ源同定手段で求めたSAP_NOR値とに基づき周波数帯域毎のターゲット音源の有無(SAP_SNR)を導出することにより周波数帯域毎にターゲット音源を抽出する手段を備えたことを特徴とする。 5. The speech enhancement processing system according to claim 4, wherein the presence or absence of a target sound source for each frequency band based on the SAP _CohILD value for each frequency band obtained by the sound source identification means and the SAP _NOR value obtained by the noise source identification means. A means for extracting a target sound source for each frequency band by deriving (SAP _SNR ) is provided.

請求項５に記載の音声強調処理システムは、前記音源抽出手段が、周波数帯域毎のS/N比を算出して、その値と予め設定された指標値とを比較することにより周波数帯域毎のターゲット音源の有無(SAP_SNR)を導出することにより周波数帯域毎にターゲット音源を抽出する手段を備えたことを特徴とする。 The speech enhancement processing system according to claim 5, wherein the sound source extraction unit calculates an S / N ratio for each frequency band, and compares the value with a preset index value, thereby calculating the S / N ratio for each frequency band. Means is provided for extracting a target sound source for each frequency band by deriving the presence or absence of a target sound source (SAP _SNR ).

請求項１に係る発明によれば、左右２入力信号間の周波数帯域成分からコヒーレンス成分の実数部(RealCoh)と両耳間レベル差(ILD)とを求めてターゲット音源を同定するとともに、左右２入力信号間の周波数帯域成分からノイズ源を同定することで、周波数帯域毎に正面方向のターゲット音源を同定することができ、従来の手法以上にターゲット音源を良好に分離・強調することが可能になる。特に、反射音が存在するような環境においても効果がある。 According to the first aspect of the invention, the target sound source is identified by obtaining the real part (RealCoh) and interaural level difference (ILD) of the coherence component from the frequency band component between the left and right two input signals, and the left and right 2 By identifying the noise source from the frequency band components between the input signals, it is possible to identify the target sound source in the front direction for each frequency band, making it possible to separate and emphasize the target sound source better than conventional methods Become. In particular, it is also effective in an environment where reflected sound exists.

請求項２に係る発明によれば、周波数帯域毎にターゲット音源の有無を同定することが可能になる。 According to the invention which concerns on Claim 2, it becomes possible to identify the presence or absence of a target sound source for every frequency band.

請求項３に係る発明によれば、周波数帯域毎にノイズ源の有無を同定することが可能になる。 According to the invention which concerns on Claim 3, it becomes possible to identify the presence or absence of a noise source for every frequency band.

請求項４または請求項５に係る発明によれば、音源同定手段で求めたターゲット音源と、ノイズ源同定手段で求めたノイズ源とに基づき、周波数帯域毎にターゲット音源を抽出することで、従来の手法以上にターゲット音源を良好に分離・強調することが可能になる。 According to the invention according to claim 4 or claim 5, by extracting the target sound source for each frequency band based on the target sound source obtained by the sound source identification means and the noise source obtained by the noise source identification means, The target sound source can be separated and emphasized better than this method.

次に、本発明の実施の形態に係る音声強調処理システムについて図面に基づいて説明する。なお、この実施の形態により本発明が限定されるものではない。 Next, a speech enhancement processing system according to an embodiment of the present invention will be described with reference to the drawings. In addition, this invention is not limited by this embodiment.

図１は、本発明の実施の形態に係る音声強調処理システムの構成を示す図である。図１に示すように、音声強調処理システムは、観測された混合信号から正面方向に存在するターゲット音源信号を抽出する音声強調処理システムであって、複数の音源から発生される音響信号を左右両受信部から入力する信号収音手段11と、入力した左右２入力信号を周波数帯域毎に分割して左右２入力信号の周波数帯域成分を求める信号変換手段12と、前記周波数帯域成分から求めたコヒーレンス成分と前記周波数帯域成分から求めた両耳間レベル差(ILD)とを用いてターゲット音源を同定する音源同定手段13と、前記周波数帯域成分からノイズ源を同定するノイズ源同定手段14と、前記音源同定手段で求めたターゲット音源と、前記ノイズ源同定手段で求めたノイズ源とに基づき、周波数帯域毎にターゲット音源を抽出する音源抽出手段15と、を備えている。 FIG. 1 is a diagram showing a configuration of a speech enhancement processing system according to an embodiment of the present invention. As shown in FIG. 1, the speech enhancement processing system is a speech enhancement processing system that extracts a target sound source signal that exists in the front direction from an observed mixed signal. Signal sound collection means 11 input from the receiving unit, signal conversion means 12 for dividing the input left and right two input signals into frequency bands to obtain frequency band components of the left and right two input signals, and coherence obtained from the frequency band components A sound source identifying means 13 for identifying a target sound source using an interaural level difference (ILD) obtained from a component and the frequency band component; a noise source identifying means 14 for identifying a noise source from the frequency band component; Sound source extraction means 15 for extracting a target sound source for each frequency band based on the target sound source obtained by the sound source identification means and the noise source obtained by the noise source identification means, It is provided.

信号収音手段11は、複数の音源から発生される音響空間において、左右両受信部からターゲット音源Sとノイズ源Ni（ i：ノイズ源の個数）とが混合された音響信号を受信する。 The signal pickup means 11 receives an acoustic signal in which the target sound source S and the noise source Ni (i: the number of noise sources) are mixed from both the left and right receiving units in an acoustic space generated from a plurality of sound sources.

信号変換手段12は、信号収音手段11で入力した左右２入力信号をそれぞれFFT(Fast Fourier Transform)を用いて周波数帯域毎に分割して左右２入力信号の周波数帯域成分を求める。 The signal conversion means 12 divides the left and right two input signals input by the signal sound pickup means 11 into frequency bands using FFT (Fast Fourier Transform), respectively, and obtains frequency band components of the left and right two input signals.

音源同定手段13は、信号変換手段12で求めた周波数帯域成分からコヒーレンス関数( coherence function )を用いてコヒーレンス成分の実数部(RealCoh)を求める手段Aと、前記周波数帯域成分から両耳間の音圧レベル差(ILD)を求める手段Bと、周波数帯域毎のRealCoh値とILD値とに基づき予め設定された条件式を用いて周波数帯域毎のターゲット音源の有無(SAP_CohILD)を導出する手段Cとを有する。 The sound source identification means 13 is a means A for obtaining a real part (RealCoh) of a coherence component using a coherence function from a frequency band component obtained by the signal conversion means 12, and a sound between both ears from the frequency band component. Means B for determining the pressure level difference (ILD) and means C for deriving the presence / absence of the target sound source for each frequency band (SAP _CohILD ) using a preset conditional expression based on the RealCoh value and ILD value for each frequency band And have.

手段Aでは、コヒーレンス関数は式(1)で定義され、その実数部は式(2)で求められる。ここで、X_Lは左側入力信号の周波数帯域成分、X_Rは右側入力信号の周波数帯域成分、X_R *はX_Rの共役複素数、kは周波数、lは時間を表す。

In means A, the coherence function is defined by equation (1), and its real part is obtained by equation (2). Here, X _L is a frequency band component of the left input signal, X _R is a frequency band component of the right input signal, X _R * is a conjugate complex number of X _R , k is a frequency, and l is time.

図２は、実験により全ての周波数帯域で得られたコヒーレンス関数の実数部(RealCoh)を平均化して表したものである。この実験結果から、実数部(RealCoh)は、正面方向からの信号に対しては大きな値（例えばRealCoh ≧0.9）となり、その他の方向からの信号に対しては小さな値（例えばRealCoh ≦0.2）となることがわかる。したがって、この実数部(RealCoh)を用いて、正面方向のターゲット音源と、その他の方向のノイズ源とを分別できることが導き出される。 FIG. 2 is an averaged representation of the real part (RealCoh) of the coherence function obtained in all frequency bands through experiments. From this experimental result, the real part (RealCoh) has a large value (for example, RealCoh ≧ 0.9) for the signal from the front direction and a small value (for example, RealCoh ≦ 0.2) for the signal from the other direction. I understand that Therefore, it is derived that the target sound source in the front direction and the noise source in other directions can be distinguished using the real part (RealCoh).

しかしながら、この実数部(RealCoh)には周期性があり、正面方向以外の信号についても周波数によっては大きな値を有する場合がある。図３に、実験により全ての周波数帯域で得られたコヒーレンス関数の実数部(RealCoh)を表したものを示す。したがって、手段Aだけを用いて、全ての周波数帯域で正面方向のターゲット音源を抽出できるとは限らないことがわかる。 However, the real part (RealCoh) has periodicity, and signals other than the front direction may have a large value depending on the frequency. FIG. 3 shows the real part (RealCoh) of the coherence function obtained in all frequency bands by experiment. Therefore, it can be seen that it is not always possible to extract the target sound source in the front direction in all frequency bands using only the means A.

次に手段Bについて説明する。手段Bでは、左側入力信号の周波数帯域成分X_Lおよび右側入力信号の周波数帯域成分X_Rが式(3)で定義され、これを用いて両耳間の音圧レベル差(ILD)を求める。音圧レベル差(ILD)は式(4)で求められる。ここで、H_Lは音源から左耳までの伝達関数、H_Rは音源から右耳までの伝達関数、Sは音源を表す。

Next, means B will be described. In the means B, the frequency band component X _L of the left input signal and the frequency band component X _R of the right input signal are defined by Equation (3), and the sound pressure level difference (ILD) between both ears is obtained using these. The sound pressure level difference (ILD) is obtained by the equation (4). Here, H _L is the transfer function of the transfer function from the sound source to the left ear, H _R from the sound source to the right ear, S is representative of the sound source.

数式(4)から、両耳間の音圧レベル差(ILD)は、全ての周波数帯域において、正面方向からの信号に対して小さな値となり、その他の方向からの信号に対して大きな値となることが導き出される。 From Equation (4), the sound pressure level difference (ILD) between both ears is a small value for signals from the front direction and a large value for signals from other directions in all frequency bands. It is derived.

次に手段Cについて説明する。手段Cでは、手段Aで求めた周波数帯域毎のRealCoh値と、手段B で求めた周波数帯域毎のILD値とに基づき、予め設定された条件式を用いて周波数帯域毎のターゲット音源の有無(SAP_CohILD)を導出する。SAP_CohILD値は式(5)’により求められる。SAP_CohILD値を用いて式(5)により周波数帯域毎にターゲット音源S_L 、S_Rを求める。ここで、S_Lは左側入力信号の周波数帯域成分X_Lから求めるものであり、S_Rは右側入力信号の周波数帯域成分X_Rから求めるものである。

Next, means C will be described. In the means C, based on the RealCoh value for each frequency band obtained in the means A and the ILD value for each frequency band obtained in the means B, the presence / absence of a target sound source for each frequency band using a preset conditional expression ( SAP _CohILD ) is derived. The SAP _CohILD value is obtained by equation (5) ′. Using the SAP _CohILD value, the target sound sources S _L and S _R are obtained for each frequency band according to Equation (5). Here, S _L are those obtained from the frequency band components X _L of the left input signal, S _R are those obtained from the frequency band components X _R of the right input signal.

式(5)’において、T₁ 、T₂ 、P₁ 、P₂ は予め設定された閾値である。式(5)’から、RealCoh値が閾値T₁より大きく、且つILD値が閾値P₁より小さい場合は、SAP_CohILD値が１に設定され、RealCoh値が閾値T₂より小さく、且つILD値が閾値P₂より大きい場合は、SAP_CohILD値が0に設定され、その他の場合は補完処理が行われ、SAP_CohILD値として0と1の間の数値が設定される（例えば、0.2 or 0.5 or 0.8）。 In Expression (5) ′, T ₁ , T ₂ , P ₁ , and P ₂ are preset threshold values. From Equation (5) ′, when the RealCoh value is larger than the threshold T ₁ and the ILD value is smaller than the threshold P ₁ , the SAP _CohILD value is set to 1, the RealCoh value is smaller than the threshold T ₂ and the ILD value is If the threshold P ₂ greater than, SAP _CohILD value is set to 0, otherwise complementary processing is performed, number is set between 0 and 1 as the SAP _CohILD value (e.g., 0.2 or 0.5 or 0.8 ).

次にノイズ源同定手段14について説明する。ノイズ源同定手段14は、信号変換手段12で求めた周波数帯域成分から周波数帯域毎のノイズ比率を算出し、その値と予め設定された指標値とを比較することにより周波数帯域毎のノイズ源の有無(SAP_NOR)を導出する。
ここで、ノイズ比率を算出には従来の手法を用いる。従来の手法として、遅延和アレイを用いる手法やRoman’s Algorithmなどがある。 Next, the noise source identification means 14 will be described. The noise source identification unit 14 calculates a noise ratio for each frequency band from the frequency band component obtained by the signal conversion unit 12, and compares the value with a preset index value to determine the noise source for each frequency band. The presence / absence (SAP _NOR ) is derived.
Here, a conventional method is used to calculate the noise ratio. Conventional methods include a method using a delay-and-sum array and a Roman's Algorithm.

遅延和アレイを用いる手法は、マイクロホンの受信信号に適切な遅延を与えることで目的音成分を同相化（位相を揃えること）して目的音を強調する手法であり、ノイズ源同定手段14では、正面方向の音源を抽出するため、遅延が0の条件で左側入力信号の周波数帯域成分X_Lと右側入力信号の周波数帯域成分X_Rとの差分Ｚ( = X_L−X_R)を算出し、式(6)によりノイズ比率OIR(w,t)を求める。ここでY₁にはX_LまたはX_Rを用いる。

The method using the delay sum array is a method of emphasizing the target sound by making the target sound component in-phase (aligning the phase) by giving an appropriate delay to the received signal of the microphone. In the noise source identifying means 14, In order to extract the sound source in the front direction, the difference Z (= X _L −X _R ) between the frequency band component X _L of the left input signal and the frequency band component X _R of the right input signal is calculated under the condition of zero delay, The noise ratio OIR (w, t) is obtained from Equation (6). Here, X _L or X _R is used for Y ₁ .

またRoman’s Algorithmについて、図４にそのアルゴリズムを示す。この手法では、適応フィルタWを用いて、式(6)によりノイズ比率OIR(w,t)を求める。 FIG. 4 shows the algorithm of the Roman's algorithm. In this method, the noise ratio OIR (w, t) is obtained by using the adaptive filter W according to Equation (6).

式(6)により求めたノイズ比率OIR(w,t)と予め設定された指標値とを比較することにより周波数帯域毎のノイズ源の有無(SAP_NOR)を導出する。指標値として−6dBを設定した場合の例を式(7)に示す。

The presence / absence of a noise source for each frequency band (SAP _NOR ) is derived by comparing the noise ratio OIR (w, t) obtained by Equation (6) with a preset index value. Formula (7) shows an example when -6dB is set as the index value.

式(7)から、ノイズ比率OIR(w,t)が-6dBより大きい場合にはSAP_NOR値が0に設定され、その他の場合はSAP_NOR値が1に設定される。 From Equation (7), the SAP _NOR value is set to 0 when the noise ratio OIR (w, t) is greater than −6 dB, and the SAP _NOR value is set to 1 in other cases.

次に音源抽出手段15について説明する。音源抽出手段15は、周波数帯域毎のSAP_CohILD値とSAP_NOR値とに基づき周波数帯域毎のターゲット音源の有無(SAP_SNR)を導出することにより周波数帯域毎にターゲット音源を抽出する手段Dと、周波数帯域毎のS/N比を算出して、その値と予め設定された指標値とを比較することにより周波数帯域毎のターゲット音源の有無(SAP_SNR)を導出することにより周波数帯域毎にターゲット音源を抽出する手段Eとを有する。 Next, the sound source extraction means 15 will be described. The sound source extraction means 15 is a means D for extracting the target sound source for each frequency band by deriving the presence or absence of the target sound source for each frequency band (SAP _SNR ) based on the SAP _CohILD value and the SAP _NOR value for each frequency band, Calculate the S / N ratio for each frequency band and compare the value with a preset index value to derive the target sound source presence / absence (SAP _SNR ) for each frequency band. And means E for extracting a sound source.

手段Dでは、音源同定手段13で求めた周波数帯域毎のSAP_CohILD値と、ノイズ源同定手段14で求めた周波数帯域毎のSAP_NOR値とを比較し、周波数帯域毎にターゲット音源の有無(SAP_SNR)を導出する。ここでは、
・“SAP_CohILD値 = 1” and “SAP_NOR値 = 0” の場合、SAP_SNR値 = 1
・“SAP_CohILD値 = 0” and “SAP_NOR値 = 1” の場合、SAP_SNR値 = 0
に設定する。上記以外の場合には、手段Eを用いてSAP_SNR値を導出する。 In means D, the SAP _CohILD value for each frequency band obtained by the sound source identification means 13 is compared with the SAP _NOR value for each frequency band obtained by the noise source identification means 14, and the presence or absence of the target sound source (SAP _SNR ) is derived. here,
・ When "SAP _CohILD value = 1" and "SAP _NOR value = 0", SAP _SNR value = 1
・ When "SAP _CohILD value = 0" and "SAP _NOR value = 1", SAP _SNR value = 0
Set to. In cases other than the above, the SAP _SNR value is derived using means E.

手段Eでは、式(8)により周波数帯域毎のS/N比SNR(w,t)を算出して、その値と予め設定された指標値T₃とを比較することにより周波数帯域毎のターゲット音源の有無(SAP_SNR)を導出する。

In section E, to calculate the S / N ratio SNR for each frequency band (w, t) by the equation (8), a target for each frequency band by comparing the index value T ₃ set in advance and the value The presence or absence of a sound source (SAP _SNR ) is derived.

式(8)により求めたSNR(w,t)に基づき、SAP_SNR値を式(9)により求める。

Based on the SNR (w, t) obtained by the equation (8), the SAP _SNR value is obtained by the equation (9).

式(9)から、SNR(w,t)が閾値T₃より小さい場合にはSAP_SNR値が0に設定され、その他の場合はSAP_SNR値が1に設定される。 From equation (9), SAP _SNR value is set to 0 if SNR (w, t) is the threshold value T ₃ smaller than otherwise SAP _SNR value is set to 1.

音源抽出手段15では、上記で説明した手段Dおよび手段Eで求めたSAP_SNR値に基づき、式(10)を用いて周波数帯域毎にターゲット音源S_L 、S_Rを求める。ここで、S_Lは左側入力信号の周波数帯域成分X_Lから求めるものであり、S_Rは右側入力信号の周波数帯域成分X_Rから求めるものである。

The sound source extraction means 15 obtains the target sound sources S _L and S _R for each frequency band using the equation (10) based on the SAP _SNR values obtained by the means D and E described above. Here, S _L are those obtained from the frequency band components X _L of the left input signal, S _R are those obtained from the frequency band components X _R of the right input signal.

次に、本発明の音声強調処理システムと、他の従来手法を用いたシステムとの比較実験結果について説明する。ここでは、次の３つの条件で実験を行った結果を図５、及び図６に示す。
(1)ターゲット音源：正面方向、40 sentences
ノイズ源(1個) ：60度方向、40 sentences
(2)ターゲット音源：正面方向、40 sentences
ノイズ源(2個) ：60度方向＆−60度方向、40 sentences
(3)ターゲット音源：正面方向、40 sentences
ノイズ源(3個) ：60度方向＆−60度方向＆30度方向、40 sentences Next, the results of a comparison experiment between the speech enhancement processing system of the present invention and a system using another conventional method will be described. Here, the results of experiments conducted under the following three conditions are shown in FIGS.
(1) Target sound source: Front direction, 40 sentences
Noise source (1): 60 degree direction, 40 sentences
(2) Target sound source: Front direction, 40 sentences
Noise source (2): 60 degree direction and -60 degree direction, 40 sentences
(3) Target sound source: Front direction, 40 sentences
Noise sources (3): 60 degree direction & -60 degree direction & 30 degree direction, 40 sentences

図５は、それぞれのシステムにおけるS/N比を示したものであり、図６は、歪み度を示したものである。図５、及び図６の凡例において、FDBM、Romanが従来手法の結果を示すものであり、CohILDが本発明において音源同定手段13で求めたターゲット音源の有無(SAP_CohILD)を用いた場合（中間段階）の結果を示すものであり、CohAndBFが本発明による結果を示すものである。 FIG. 5 shows the S / N ratio in each system, and FIG. 6 shows the degree of distortion. In the legends of FIG. 5 and FIG. 6, FDBM and Roman show the results of the conventional method, and CohILD uses the presence / absence of the target sound source (SAP _CohILD ) obtained by the sound source identification means 13 in the present invention (intermediate) Stage) results, and CohAndBF shows the results according to the present invention.

図５から、上記３条件において、本発明のシステムは、他のシステムと比べて、高いS/N比を得ることができることが実証された。また図６から、上記３条件において、本発明のシステムは、他のシステムと比べて、歪み度が少ないことが実証された。
以上から、本発明のシステムは、従来手法以上に、ターゲット音源を良好に分離・強調することが可能になることが実証された。 FIG. 5 demonstrates that the system of the present invention can obtain a higher S / N ratio than the other systems under the above three conditions. Moreover, from FIG. 6, it was proved that the system of the present invention has less distortion than the other systems under the above three conditions.
From the above, it has been proved that the system of the present invention can separate and emphasize the target sound source better than the conventional method.

本発明の音声強調処理システムの構成を示す図である。It is a figure which shows the structure of the audio | voice emphasis processing system of this invention. 実験により全ての周波数帯域で得られたコヒーレンス関数の実数部(RealCoh)を平均化して表した図である。It is the figure which averaged and represented the real part (RealCoh) of the coherence function obtained in all the frequency bands by experiment. 実験により全ての周波数帯域で得られたコヒーレンス関数の実数部(RealCoh)を表した図である。It is a figure showing the real part (RealCoh) of the coherence function obtained in all the frequency bands by experiment. ノイズ比率を算出する手法としてRoman’s Algorithmを示した図である。It is the figure which showed Roman's Algorithm as a method of calculating a noise ratio. S/N比について、本発明の音声強調処理システムと、他の従来手法を用いたシステムとの比較実験結果を示した図である。It is the figure which showed the comparison experiment result of the audio | voice emphasis processing system of this invention, and the system using another conventional method regarding S / N ratio. 歪み度について、本発明の音声強調処理システムと、他の従来手法を用いたシステムとの比較実験結果を示した図である。It is the figure which showed the comparison experiment result of the audio | voice emphasis processing system of this invention, and the system using another conventional method about distortion degree.

Explanation of symbols

１１信号収音手段
１２信号変換手段
１３音源同定手段
１４ノイズ源同定手段
１５音源抽出手段 11 Signal sound pickup means 12 Signal conversion means 13 Sound source identification means 14 Noise source identification means 15 Sound source extraction means

Claims

A speech enhancement processing system for extracting a target sound source signal existing in the front direction from an observed mixed signal, and a signal sound pickup means for inputting acoustic signals generated from a plurality of sound sources from both left and right receiving units, and an input Signal conversion means for dividing the left and right two input signals for each frequency band to obtain the frequency band components of the left and right two input signals; the coherence component obtained from the frequency band component and the interaural level difference obtained from the frequency band component ( ILD) based on the sound source identification means for identifying the target sound source, the noise source identification means for identifying the noise source from the frequency band component, the target sound source obtained by the sound source identification means and the noise source identification means A speech enhancement processing system comprising sound source extraction means for extracting a target sound source for each frequency band based on a noise source.

The sound source identification means includes means for obtaining a real part of a coherence component (RealCoh) using a coherence function from the frequency band component, means for obtaining a sound pressure level difference (ILD) between both ears from the frequency band component, 2. A means for deriving presence / absence of a target sound source for each frequency band (SAP _CohILD ) using a preset conditional expression based on a RealCoh value and an ILD value for each frequency band. The speech enhancement processing system described.

The noise source identification means calculates a noise ratio for each frequency band from the frequency band component, and compares the value with a preset index value to determine the presence / absence of a noise source for each frequency band (SAP _NOR ). The speech enhancement processing system according to claim 1, further comprising a deriving unit.

The sound source extraction means derives the presence or absence (SAP _SNR ) of the target sound source for each frequency band based on the SAP _CohILD value for each frequency band obtained by the sound source identification means and the SAP _NOR value obtained by the noise source identification means. The speech enhancement processing system according to claim 1, further comprising means for extracting a target sound source for each frequency band.

The sound source extraction means calculates the S / N ratio for each frequency band, and compares the value with a preset index value to derive the presence or absence (SAP _SNR ) of the target sound source for each frequency band. The speech enhancement processing system according to claim 1, further comprising means for extracting a target sound source for each frequency band.