JP2010171585A

JP2010171585A - Method, device and program for separating sound source

Info

Publication number: JP2010171585A
Application number: JP2009010843A
Authority: JP
Inventors: Toshiharu Horiuchi; 俊治堀内
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2009-01-21
Filing date: 2009-01-21
Publication date: 2010-08-05
Anticipated expiration: 2029-01-21
Also published as: JP5113096B2

Abstract

<P>PROBLEM TO BE SOLVED: To separate a sound source satisfactorily even when sparseness is not established in a received sound signal outputted from main/sub microphones and there is no amplitude difference between time frequency components. <P>SOLUTION: Time frequency analyzing portions 11 and 12 analyze received sound signals x1(t) and x2(t) outputted from the main/sub microphones, and convert them into time frequency components X1(f, t) and X2(f, t), respectively. A gain giving portion 13 gives a gain Gf to the X2(f, t). A level difference comparing portion 14 compares amplitudes of the time frequency components X1(f, t) and GfX2(f, t) by time frequency component, and generates a mask pattern m1(f, t). A masking processing portion 15 uses the mask pattern m1(f, t) to mask the time frequency component X1(f, t). A time frequency synthesizing portion 16 synthesizes time frequency components Y1(f, t) outputted from the masking processing portion 15. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音源分離方法、装置およびプログラムに関し、特に、２個のマイクロホンが出力する受音信号に基づいて、目的音と妨害音を分離して抽出する音源分離方法、装置およびプログラムに関する。 The present invention relates to a sound source separation method, apparatus, and program, and more particularly, to a sound source separation method, apparatus, and program for separating and extracting a target sound and an interference sound based on sound reception signals output from two microphones.

街頭、車内あるいは駅のプラットホームなどに代表される雑音環境下では、ハンドセットやヘッドセットなどの口元に近接配置されたマイクロホンを用いても、目的音である所望の音声に妨害音である他の音声や周囲雑音が混入してしまうことがある。この問題を解決するため、これまでに様々な妨害音抑圧手法や音源分離手法が提案されている。これらの手法は、単一のマイクロホンを使用するものと複数のマイクロホンを使用するものとに大別できる。複数のマイクロホンを使用するものでは、単一のマイクロホンを使用するものと比較して、より高い妨害音抑圧性能を得ることができる。 In noisy environments such as streets, cars, and station platforms, even if you use a microphone placed close to your mouth, such as a handset or headset, you can use the target sound as the target sound and other sounds that are interference sounds. And ambient noise may be mixed. In order to solve this problem, various interference sound suppression methods and sound source separation methods have been proposed so far. These methods can be broadly classified into those using a single microphone and those using a plurality of microphones. In the case of using a plurality of microphones, it is possible to obtain a higher disturbing sound suppression performance than in the case of using a single microphone.

複数のマイクロホンを使用する手法では、複数のマイクロホンを空間的に配置し、各マイクロホンが出力する受音信号に、各マイクロホンと音源との空間的な位置関係に依存した時間差や振幅差を反映させる。これによれば、各マイクロホンが出力する受音信号の時間差や振幅差の統計情報を利用して、目的音のみを選択的に収音したり、あるいは目的音と妨害音とを分離したりすることができる。 In the method using multiple microphones, multiple microphones are arranged spatially, and the received signal output from each microphone reflects the time difference or amplitude difference depending on the spatial positional relationship between each microphone and the sound source. . According to this, only the target sound is selectively collected or the target sound and the interfering sound are separated using the statistical information of the time difference and the amplitude difference of the received sound signals output from each microphone. be able to.

複数のマイクロホンを使用する手法として、音声信号のスパース性を利用した時間周波数マスキングと呼ばれる手法も提案されている。音声信号のスパース性とは、音声信号のエネルギが一部の時間周波数領域に集中し、その他の時間周波数領域ではほぼ0であるような性質をいう。 As a technique using a plurality of microphones, a technique called time-frequency masking using the sparsity of an audio signal has been proposed. The sparseness of an audio signal refers to the property that the energy of the audio signal is concentrated in a part of the time frequency domain and is almost zero in the other time frequency domain.

時間周波数マスキングに基づく手法では、目的音および妨害音の方向は未知でよく、目的音を抽出するために、複数のマイクロホンが出力する受音信号の各時間周波数成分の振幅差と時間差の一方または両方を算出する。そして、それらの差に基づいて各時間周波数成分を分類し、目的音と妨害音とを分離する。複数のマイクロホンが出力する受音信号の各時間周波数成分の振幅差および時間差の算出では、所定時間長ごとに周波数分析を行う。 In the method based on time frequency masking, the direction of the target sound and the disturbing sound may be unknown, and in order to extract the target sound, one or both of the amplitude difference and the time difference of each time frequency component of the received signal output from the plurality of microphones is used. Calculate both. And each time frequency component is classified based on those differences, and the target sound and the interference sound are separated. In calculating the amplitude difference and the time difference of each time frequency component of the received sound signal output from the plurality of microphones, frequency analysis is performed for each predetermined time length.

時間周波数マスキングに基づく手法のうち、特に、複数のマイクロホンが出力する受音信号の各時間周波数成分の振幅差を用いるものは、より強い信号はそれより弱い信号をマスクするという聴覚マスキング現象を計算機上に模擬したものである。２つのマイクロホンを用いるとき、目的音に重畳された妨害音をマスクするマスクパターンは、２つのマイクロホンが出力する受音信号の各時間周波数成分の振幅差を比較することで生成され、主マイクロホンに近接している音源の高振幅の受音信号の時間周波数成分を選択的に抽出するために用いられる。この処理は、時間周波数領域で行われ、２つのマイクロホンのうちの主マイクロホンが出力する受音信号が支配的な周波数成分はそのまま出力し、もう一方の副マイクロホンが出力する受音信号が支配的な周波数成分はマスク処理する。主マイクロホンに近接している音源の受音信号に対するマスク処理は、下記式(1)で定義される。 Among the methods based on time-frequency masking, especially those that use the amplitude difference of each time-frequency component of the received signal output by multiple microphones, the computer calculates the auditory masking phenomenon in which stronger signals mask weaker signals. Simulated above. When two microphones are used, a mask pattern that masks the interference sound superimposed on the target sound is generated by comparing the amplitude difference of each time frequency component of the received sound signal output by the two microphones, and is stored in the main microphone. It is used for selectively extracting the time frequency component of the high amplitude received sound signal of the adjacent sound source. This processing is performed in the time frequency domain, and the frequency component in which the received sound signal output from the main microphone of the two microphones is dominant is output as it is, and the received sound signal output from the other sub microphone is dominant. Masking is performed on the frequency components that are not correct. Mask processing for the received sound signal of the sound source close to the main microphone is defined by the following equation (1).

このマスク処理では、主・副マイクロホンが出力する受音信号にスパース性が成立し、それらの時間周波数成分間に振幅差があると仮定している。これについては、非特許文献１−４に記載されている。
R.F.Lyon: "A computational model of binaural localization and separation, " In Proc. ICASSP, 1983. M. Bodden: "Modeling human sound-source localization and the cocktail-party-effect," Acta Acoustica, vol.1, pp.43--55, 1993. O. Yilmaz and S. Rickard: "Blind Separation of Speech Mixtures via Time-Frequency Masking," IEEE Transaction on Signal Processing, Vol. 52, No. 7, pp. 1830-1847, 2004. S. Rickard and O. Yilmaz: "On the Approximate W-disjoint Orthogonality of Speech," Proc. ICASSP, Vol. I, pp. 529-532, 2002. In this masking process, it is assumed that the received sound signals output from the main and sub microphones are sparse and there is an amplitude difference between their time frequency components. This is described in Non-Patent Documents 1-4.
RFLyon: "A computational model of binaural localization and separation," In Proc. ICASSP, 1983. M. Bodden: "Modeling human sound-source localization and the cocktail-party-effect," Acta Acoustica, vol.1, pp.43--55, 1993. O. Yilmaz and S. Rickard: "Blind Separation of Speech Mixtures via Time-Frequency Masking," IEEE Transaction on Signal Processing, Vol. 52, No. 7, pp. 1830-1847, 2004. S. Rickard and O. Yilmaz: "On the Approximate W-disjoint Orthogonality of Speech," Proc. ICASSP, Vol. I, pp. 529-532, 2002.

しかしながら、一般に、人を音源とする受音信号ではスパース性は成立するが、例えば、妨害音(周囲雑音)の受音信号ではスパース性が成立しない。さらに、２つのマイクロホンが出力する受音信号において、目的音の受音信号間に振幅差があっても、妨害音の受音信号間には振幅差がない場合が多い。この結果、従来の妨害音抑圧手法や音源分離手法では、十分な妨害音抑圧や音源分離の性能を得ることができないという課題がある。 However, in general, a sparseness is established in a received signal using a human sound source, but a sparseness is not established in a received signal of a disturbing sound (ambient noise), for example. Further, in the sound reception signals output from the two microphones, even if there is an amplitude difference between the target sound reception signals, there is often no amplitude difference between the interference sound reception signals. As a result, there is a problem that the conventional interference sound suppression method and sound source separation method cannot obtain sufficient interference sound suppression and sound source separation performance.

本発明の目的は、上記課題を解決し、２つのマイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、妨害音抑圧や音源分離の性能が劣化することのない音源分離装置、方法およびプログラムを提供することにある。 The object of the present invention is to solve the above-mentioned problems, and even if the received sound signals output from the two microphones are not sparse and there is no difference in amplitude between their time-frequency components, interference sound suppression and sound source separation are achieved. It is an object of the present invention to provide a sound source separation device, method and program in which performance does not deteriorate.

上記目的を達成するために、本発明の音源分離装置は、主・副マイクロホンが出力する受音信号から目的音成分および妨害音成分の少なくとも一方を分離して出力する音源分離装置であって、主・副マイクロホンの信号経路に設けられ、主・副マイクロホンが出力する受音信号をそれぞれ時間周波数成分に変換する変換手段と、主・副マイクロホンの信号経路の少なくとも一方に設けられ、時間周波数成分へ変換される前の受音信号、あるいは時間周波数成分へ変換された後の時間周波数成分にゲインを付与するゲイン付与手段と、前記ゲイン付与手段によりゲインが付与され、前記変換手段により変換された後の時間周波数成分の振幅を各時間周波数成分ごとに比較し、マスクパターンを生成するレベル差比較手段と、前記ゲイン付与手段によりゲインが付与され、前記変換手段により変換された後の時間周波数成分の少なくとも一方を、前記レベル差比較手段により生成されるマスクパターンを用いてマスキングするマスキング処理手段と、前記マスキング処理手段から出力される時間周波数成分を合成する時間周波数合成手段を備えたことを特徴としている。 In order to achieve the above object, a sound source separation apparatus of the present invention is a sound source separation apparatus that separates and outputs at least one of a target sound component and an interference sound component from a received sound signal output from a main / sub microphone, Conversion means for converting the received sound signals output from the main and sub microphones into time frequency components, respectively, and the time and frequency components provided in at least one of the signal paths of the main and sub microphones. A gain applying means for applying a gain to the received sound signal before being converted into the time frequency component, or a time frequency component after being converted into the time frequency component, and a gain is applied by the gain applying means, and converted by the converting means Level difference comparison means for comparing the amplitudes of subsequent time frequency components for each time frequency component and generating a mask pattern, and the gain applying means A masking processing unit that masks at least one of the time frequency components after the gain is given and converted by the conversion unit using a mask pattern generated by the level difference comparison unit, and the masking processing unit outputs the masking processing unit And a time-frequency synthesizing unit for synthesizing the time-frequency components.

また、本発明の音源分離装置は、前記ゲイン付与手段で付与されるゲインが、妨害音に対して主・副マイクロホンがそれぞれ出力する受音信号の時間周波数成分間に振幅差を生じさせ、かつ目的音に対して主・副マイクロホンがそれぞれ出力する受音信号の時間周波数成分の振幅の大小関係が逆転しないように、一定値あるいは周波数依存値に設定されることを特徴としている。 Further, in the sound source separation device of the present invention, the gain applied by the gain applying unit causes an amplitude difference between the time frequency components of the received sound signal output from the main and sub microphones with respect to the disturbing sound, and It is characterized in that it is set to a constant value or a frequency-dependent value so that the magnitude relationship of the amplitudes of the time frequency components of the received sound signals output from the main and sub microphones with respect to the target sound is not reversed.

さらに、本発明の音源分離装置は、前記ゲイン付与手段が、主マイクロホンの信号経路における時間周波数成分に周波数依存値のゲインを付与し、さらに、主マイクロホンの信号経路を通して前記マスキング処理手段に入力される時間周波数成分あるいは前記マスキング処理手段から出力される、主マイクロホンの信号経路の時間周波数成分に対し、前記ゲイン付与手段でのゲイン付与と逆の処理を行ってゲインを除去するゲイン除去手段を備えたことを特徴としている。 Further, in the sound source separation device of the present invention, the gain applying unit applies a gain of a frequency dependent value to a time frequency component in the signal path of the main microphone, and is further input to the masking processing unit through the signal path of the main microphone. Gain removing means for removing the gain by performing a process reverse to the gain applying by the gain applying means on the time frequency component of the main microphone signal path output from the masking processing means. It is characterized by that.

なお、本発明は、音源分離装置としてだけでなく、受音信号の処理手順で特定される音源分離方法としても特徴があり、さらにコンピュータに音源分離や妨害音抑圧の機能を実現させるためのプログラムとしても特徴がある。 The present invention is characterized not only as a sound source separation device but also as a sound source separation method specified by a processing procedure of received sound signals, and a program for causing a computer to realize sound source separation and interference sound suppression functions There are also features.

本発明では、主・副マイクロホンの信号経路の少なくとも一方にゲイン付与手段を設け、該信号経路を通る受音信号あるいは時間周波数成分にゲイン付与することにより、主・副マイクロホンが出力する受音信号の時間周波数成分間に振幅差を生じさせ、その後にマスクパターンを生成するので、主・副マイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、音源分離や妨害音抑圧の性能が劣化することがなく、音源を良好に分離できる。 In the present invention, at least one of the signal paths of the main / sub microphones is provided with a gain applying means, and a sound receiving signal passing through the signal path or a sound receiving signal output from the main / sub microphone by applying gain to a time frequency component. A difference in amplitude is generated between the time frequency components of the two, and then a mask pattern is generated, so that the received signal output from the main and sub microphones is not sparse, and there is no difference in amplitude between these time frequency components. Even in such a case, the sound source can be satisfactorily separated without deteriorating the performance of sound source separation and interference sound suppression.

以下、図面を参照して本発明を説明する。図１は、本発明に係る音源分離装置の実施形態を示すブロック図である。なお、本発明は、音源分離装置としてだけでなく、受音信号の処理手順で特定される音源分離方法としても実現でき、さらにコンピュータに音源分離や妨害音抑圧の機能を実現させるためのプログラムとしても実現できる。また、音源分離装置における各部は、ハードウエアでもソフトウエアでも実現できる。 The present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of a sound source separation device according to the present invention. The present invention can be realized not only as a sound source separation device but also as a sound source separation method specified by a received signal processing procedure, and as a program for causing a computer to realize sound source separation and interference sound suppression functions. Can also be realized. Each unit in the sound source separation apparatus can be realized by hardware or software.

図１の音源分離装置は、時間周波数分析部11,12、ゲイン付与部13、レベル差比較部14、マスキング処理部15および時間周波数合成部16を備える。本実施形態では、時間周波数分析部11、マスキング処理部15および時間周波数合成部16により主マイクロホンの信号経路が構成され、時間周波数分析部12およびゲイン付与部13により副マイクロホンの信号経路が構成されている。 The sound source separation device of FIG. 1 includes time frequency analysis units 11 and 12, a gain addition unit 13, a level difference comparison unit 14, a masking processing unit 15, and a time frequency synthesis unit 16. In this embodiment, the time path analysis unit 11, the masking processing unit 15 and the time frequency synthesis unit 16 form a signal path of the main microphone, and the time frequency analysis unit 12 and the gain applying unit 13 form a signal path of the sub microphone. ing.

時間周波数分析部11,12はそれぞれ、主・副マイクロホンが出力する受音信号を時間周波数領域で分析し、各時間周波数成分を出力する。ゲイン付与部13は、入力される受音信号の各時間周波数成分にゲインを付与する。 Each of the time frequency analysis units 11 and 12 analyzes the received sound signal output from the main and sub microphones in the time frequency domain, and outputs each time frequency component. The gain assigning unit 13 assigns a gain to each time frequency component of the input sound reception signal.

レベル差比較部14は、時間周波数分析部11とゲイン付与部13からそれぞれ出力される各時間周波数成分の振幅(レベル(絶対値))を各成分ごとに比較し、その比較結果に基づいてマスクパターンを生成する。 The level difference comparison unit 14 compares the amplitude (level (absolute value)) of each time frequency component output from the time frequency analysis unit 11 and the gain applying unit 13 for each component, and masks based on the comparison result Generate a pattern.

マスキング処理部15は、レベル差比較部14により生成されたマスクパターンに従って、時間周波数分析部11から出力される時間周波数成分をマスキングする。時間周波数合成部16は、マスキング処理部15から出力される時間周波数成分を合成する。 The masking processing unit 15 masks the time frequency component output from the time frequency analysis unit 11 according to the mask pattern generated by the level difference comparison unit. The time frequency synthesis unit 16 synthesizes the time frequency component output from the masking processing unit 15.

次に、図１の音源分離装置の動作を説明する。 Next, the operation of the sound source separation device of FIG. 1 will be described.

時間周波数分析部11には、主マイクロホンが出力する受音信号x1(t)が入力される。携帯端末(例えば携帯電話)の場合、目的音は通話での音声である。主マイクロホンは、高レベルの目的音を受音するために、例えば、携帯端末の前面に配置される。主マイクロホンは、目的音に比べて低レベルではあるが、周囲雑音などの妨害音も受音する。したがって、受音信号x1(t)は、高レベルの目的音と低レベルの妨害音が変換されたものとなる。時間周波数分析部11は、受音信号x1(t)を時間周波数成分X1(f,t)に変換する。 The time frequency analysis unit 11 receives the sound reception signal x1 (t) output from the main microphone. In the case of a mobile terminal (for example, a mobile phone), the target sound is a voice in a call. The main microphone is arranged, for example, on the front surface of the mobile terminal in order to receive a high-level target sound. Although the main microphone is at a lower level than the target sound, it also receives interference sounds such as ambient noise. Therefore, the received sound signal x1 (t) is obtained by converting a high level target sound and a low level interference sound. The time frequency analysis unit 11 converts the received sound signal x1 (t) into a time frequency component X1 (f, t).

一方、時間周波数分析部12には、副マイクロホンが出力する受音信号x2(t)が入力される。副マイクロホンは、妨害音を受音するために、例えば携帯端末の背面に配置される。副マイクロホンは、主マイクロホンが受音する目的音より低レベルではあるが、目的音と妨害音を受音する。副マイクロホンが受音する目的音は、主マイクロホンが受音する目的音よりかなり低レベルであり、副マイクロホンが受音する妨害音は、主マイクロホンが受音する妨害音と同レベルである。受音信号x2(t)は、妨害音と低レベルの目的音が変換されたものとなる。時間周波数分析部12は、受音信号x2(t)を時間周波数成分X2(f,t)に変換する。 On the other hand, the sound reception signal x2 (t) output from the sub microphone is input to the time frequency analysis unit 12. The sub microphone is disposed on the back surface of the mobile terminal, for example, in order to receive the interference sound. The sub microphone receives the target sound and the interference sound, although the level is lower than the target sound received by the main microphone. The target sound received by the sub microphone is at a considerably lower level than the target sound received by the main microphone, and the disturbing sound received by the sub microphone is at the same level as the disturbing sound received by the main microphone. The received sound signal x2 (t) is obtained by converting the interference sound and the low-level target sound. The time frequency analysis unit 12 converts the received sound signal x2 (t) into a time frequency component X2 (f, t).

ゲイン付与部13は、主・副マイクロホンの空間的な位置関係、妨害音の性質などから事前に算出されたゲインGfを時間周波数成分X2(f,t)に付与し、ゲインGfが付与された時間周波数成分Gf・X2(f,t)を送出する。ゲインGfは、目的音に対して主・副マイクロホンがそれぞれ出力する受音信号間の振幅差を考慮し、さらに、妨害音の受音信号は低周波数領域では高レベルであり、高周波数領域では低レベルであるという一般的性質を考慮して、例えば周波数成分ごとに設定する。ゲインGfは、1より大きい周波数依存値である。 The gain assigning unit 13 assigns the gain Gf calculated in advance from the spatial positional relationship between the main and sub microphones, the nature of the interference sound, and the like to the time frequency component X2 (f, t), and the gain Gf is given. The time frequency component Gf · X2 (f, t) is transmitted. The gain Gf takes into account the amplitude difference between the received signals output by the main and sub microphones for the target sound, and the received signal of the interference sound is at a high level in the low frequency range and in the high frequency range. Considering the general property of low level, for example, it is set for each frequency component. The gain Gf is a frequency dependent value greater than 1.

目的音に対して主・副マイクロホンがそれぞれ出力する受音信号間の振幅差は、主・副マイクロホンのインパルス応答を予め測定することにより得ることができる。この振幅差は、目的音の音源と主・副マイクロホンとの間の距離、携帯端末の筐体における主・副マイクロホンの設置位置などに依存する。また、妨害音の受音信号の性質は、様々な周囲音源の受音信号を測定し、それらの周波数特性から平均的な周囲音源の受音信号の周波数ごとの振幅を算出することにより得ることができる。 The amplitude difference between the received sound signals output from the main and sub microphones with respect to the target sound can be obtained by measuring the impulse responses of the main and sub microphones in advance. This amplitude difference depends on the distance between the target sound source and the main / sub microphone, the installation position of the main / sub microphone in the casing of the portable terminal, and the like. In addition, the nature of the received signal of the interfering sound can be obtained by measuring the received signal of various ambient sound sources and calculating the amplitude for each frequency of the received signal of the average ambient sound source from their frequency characteristics. Can do.

ゲイン付与部13が付与するゲインGfは、妨害音に対して主・副マイクロホンがそれぞれ出力する受音信号の時間周波数成分間に振幅差を生じさせ、かつ目的音に対して主・副マイクロホンがそれぞれ出力する受音信号の時間周波数成分の振幅の大小関係が逆転しないようなものとすればよい。 The gain Gf provided by the gain applying unit 13 causes an amplitude difference between the time frequency components of the received signal output from the main and sub microphones for the disturbing sound, and the main and sub microphones for the target sound. What is necessary is just to make it the magnitude relationship of the amplitude of the time frequency component of each received sound signal not to be reversed.

レベル差比較部14は、時間周波数分析部11から出力された時間周波数成分X1(f,t)のレベル｜X1(f,t)｜とゲイン付与部13から出力された時間周波数成分Gf・X2(f,t)のレベル｜Gf・X2(f,t)｜を比較し、下記式(2)を用いてマスクパターンm1(f,t)を生成する。下記式(2)により、主マイクロホンが出力する受音信号x1(t)の時間周波数成分X1(f,t)のうち、主マイクロホンが出力する受音信号x1(t)が支配的な成分以外をマスクするマスクパターンm1(f,t)が生成される。レベル差比較部14により生成されたマスクパターンm1(f,t)は、マスキング処理部15に出力される。 The level difference comparison unit 14 includes the level | X1 (f, t) | of the time frequency component X1 (f, t) output from the time frequency analysis unit 11 and the time frequency component Gf · X2 output from the gain applying unit 13. The level | Gf · X2 (f, t) | of (f, t) is compared, and a mask pattern m1 (f, t) is generated using the following equation (2). From the following equation (2), out of the time frequency component X1 (f, t) of the received sound signal x1 (t) output by the main microphone, the received sound signal x1 (t) output by the main microphone is not the dominant component A mask pattern m1 (f, t) for masking is generated. The mask pattern m1 (f, t) generated by the level difference comparison unit 14 is output to the masking processing unit 15.

マスキング処理部15は、時間周波数分析部11から入力される時間周波数成分X1(f,t)をマスクパターンm1(f,t)によりマスキングする。したがって、マスキング処理部15からは、主マイクロホンが出力する受音信号x1(t)の時間周波数成分X1(f,t)のうち、主マイクロホンが出力する受音信号x1(t)が支配的な成分Y1(f,t)のみが出力される。 The masking processing unit 15 masks the time frequency component X1 (f, t) input from the time frequency analysis unit 11 with the mask pattern m1 (f, t). Therefore, from the masking processing unit 15, the sound reception signal x1 (t) output by the main microphone is dominant among the time frequency components X1 (f, t) of the sound reception signal x1 (t) output by the main microphone. Only the component Y1 (f, t) is output.

時間周波数合成部16は、主マイクロホンが出力する受音信号x1(t)の時間周波数成分X1(f,t)のうち、主マイクロホンが出力する受音信号x1(t)が支配的な成分Y1(f,t)のみを合成し、出力信号y1(t)を送出する。 The time frequency synthesizer 16 is a component Y1 in which the sound reception signal x1 (t) output from the main microphone is dominant among the time frequency components X1 (f, t) of the sound reception signal x1 (t) output from the main microphone. Only (f, t) is synthesized and the output signal y1 (t) is sent out.

図２は、ゲイン付与部13でのゲイン付与動作の説明図である。同図(a),(b)はそれぞれ、ある時間において、主・副マイクロホンが出力する受音信号の各周波数成分ごと振幅(レベル)を示す。ここで、白部分は、目的音の受音信号の周波数成分であり、黒部分は、妨害音の受音信号の周波数成分である。 FIG. 2 is an explanatory diagram of the gain applying operation in the gain applying unit 13. FIGS. 9A and 9B show the amplitude (level) for each frequency component of the received sound signal output from the main / sub microphone at a certain time. Here, the white portion is the frequency component of the target sound reception signal, and the black portion is the frequency component of the interference sound reception signal.

例えば、f_１付近の周波数成分は、目的音の受音信号のみであり、主・副マイクロホンが出力する受音信号間にかなり大きな振幅差があり、この振幅差を利用して目的音を分離することができる。しかし、f_２付近の周波数成分は、妨害音の受音信号のみであり、主・副マイクロホンが出力する受音信号の振幅はほぼ同じである。この振幅の大小関係は状況によって変わるので、f_２付近の周波数成分は、目的音として分離されたり、妨害音として分離されたりする。 For example, the frequency components near f ₁ is only received sound signal of the target sound, there is a rather large amplitude difference between the received sound signal mainly Vice microphone outputs, separate the target sound by utilizing this amplitude difference can do. However, the frequency components near f ₂ is only received sound signals of interference sound, the amplitude of the received sound signal mainly Vice microphone outputs are substantially the same. Since the magnitude relation between the amplitude varies depending on the situation, the frequency components near f ₂ is or separated as target sound, or isolated as a disturbing sound.

そこで、図２(c)に示すように、副マイクロホンが出力する受音信号の各周波数成分X2(f,t)にゲインGfを付与し、主・副マイクロホンが妨害音の受音信号のみを出力する場合でも、主・副マイクロホンが出力する受音信号の周波数成分間に振幅差が生じるようにして、それが目的音として分離されないようにする。ここでは、ゲインGfを高周波数領域で低下させることによって該領域の目的音が分離されやすくしている。 Therefore, as shown in FIG. 2 (c), a gain Gf is given to each frequency component X2 (f, t) of the sound reception signal output from the sub microphone, and the main and sub microphones receive only the interference sound reception signal. Even in the case of output, an amplitude difference is generated between the frequency components of the received sound signals output from the main and sub microphones so that they are not separated as target sounds. Here, the target sound in the region is easily separated by lowering the gain Gf in the high frequency region.

ゲインGfは、妨害音に対して主・副マイクロホンが出力する受音信号の周波数成分間に振幅差を生じさせ、かつ目的音に対して主・副マイクロホンがそれぞれ出力する受音信号の周波数成分間の振幅差を打ち消さない、つまり、両者の大小関係を逆転させないようなものとすればよい。 The gain Gf causes an amplitude difference between the frequency components of the received signal output by the main and auxiliary microphones for the disturbing sound, and the frequency component of the received signal output by the main and auxiliary microphones for the target sound. The amplitude difference between the two may not be canceled, that is, the magnitude relationship between the two may not be reversed.

しかし、特定周波数成分の目的音あるいは妨害音が重畳された目的音が分離されないように、ゲインGfを付与することもできる。例えば、図２(b)において、f_３付近の周波数成分に対するゲインGfを極めて大きくすれば、該周波数成分では目的音を含めて分離されなくなる。ゲインGfの値を調整あるいは選択できるようにしてもよい。 However, the gain Gf can also be given so that the target sound with the specific frequency component or the target sound on which the interference sound is superimposed is not separated. For example, in FIG. 2 (b), the if very large gain Gf for the frequency components in the vicinity of f _3, will not be separated, including the target sound in the frequency component. The value of the gain Gf may be adjusted or selected.

図３は、本発明に係る音源分離装置の他の実施形態を示すブロック図であり、図１と同一または同等部分には同じ符号を付してある。本実施形態は、時間周波数分析部12とゲイン付与部13を図１と逆に配置したものであり、本実施形態でも、時間周波数分析部11、マスキング処理部15および時間周波数合成部16により主マイクロホンの信号経路が構成され、時間周波数分析部12およびゲイン付与部13により副マイクロホンの信号経路が構成されている。 FIG. 3 is a block diagram showing another embodiment of the sound source separation device according to the present invention, and the same or equivalent parts as in FIG. In the present embodiment, the time-frequency analysis unit 12 and the gain applying unit 13 are arranged in the opposite direction to those in FIG. The signal path of the microphone is configured, and the signal path of the sub microphone is configured by the time frequency analysis unit 12 and the gain applying unit 13.

時間周波数分析部11は、主マイクロホンが出力する受音信号x1(t)を入力とし、受音信号x1(t)を時間周波数成分X1(f,t)に変換する。 The time frequency analysis unit 11 receives the sound reception signal x1 (t) output from the main microphone and converts the sound reception signal x1 (t) into a time frequency component X1 (f, t).

ゲイン付与部13は、主・副マイクロホンの空間的な位置関係、妨害音の性質などから事前に算出されたゲインGを、副マイクロホンが出力する受音信号x2(t)に付与し、ゲインGが付与された受音信号G・x2(t)を出力する。ゲインGは、1より大きい一定値である。 The gain assigning unit 13 assigns a gain G calculated in advance from the spatial positional relationship between the main and sub microphones, the nature of the interference sound, and the like to the received sound signal x2 (t) output from the sub microphone. The received sound signal G · x2 (t) is output. The gain G is a constant value greater than 1.

時間周波数分析部12には、副マイクロホンが出力する受音信号x2(t)がゲイン付与部13を介して入力される。したがって、時間周波数分析部12は、ゲインが付与された受音信号G・x2(t)を時間周波数成分G・X2(f,t)に変換する。 A sound reception signal x2 (t) output from the sub microphone is input to the time frequency analysis unit 12 via the gain applying unit 13. Therefore, the time-frequency analysis unit 12 converts the received sound signal G · x2 (t) to which the gain is given into the time-frequency component G · X2 (f, t).

レベル差比較部14は、時間周波数分析部11から出力された時間周波数成分X1(f,t)のレベル｜X1(f,t)｜と時間周波数分析部12から出力された時間周波数成分G・X2(f,t)のレベル｜G・X2(f,t)｜を比較し、下記式(3)を用いてマスクパターンm1(f,t)を生成する。下記式(3)により、主マイクロホンが出力する受音信号x1(t)の時間周波数成分X1(f,t)のうち、主マイクロホンが出力する受音信号x1(t)が支配的な成分以外をマスクするマスクパターンm1(f,t)が生成される。レベル差比較部14により生成されたマスクパターンm1(f,t)は、マスキング処理部15に出力される。 The level difference comparison unit 14 compares the level | X1 (f, t) | of the time frequency component X1 (f, t) output from the time frequency analysis unit 11 and the time frequency component G · The level | G · X2 (f, t) | of X2 (f, t) is compared, and a mask pattern m1 (f, t) is generated using the following equation (3). According to the following formula (3), among the time frequency components X1 (f, t) of the sound reception signal x1 (t) output from the main microphone, the sound reception signal x1 (t) output from the main microphone is not the dominant component. A mask pattern m1 (f, t) for masking is generated. The mask pattern m1 (f, t) generated by the level difference comparison unit 14 is output to the masking processing unit 15.

図４は、本発明に係る音源分離装置のさらに他の実施形態を示すブロック図であり、図１と同一または同等部分には同じ符号を付してある。 FIG. 4 is a block diagram showing still another embodiment of the sound source separation device according to the present invention. The same or equivalent parts as those in FIG.

本実施形態では、時間周波数分析部11、ゲイン付与部13、ゲイン除去部17、マスキング処理部15および時間周波数合成部16により主マイクロホンの信号経路が構成され、時間周波数分析部12により副マイクロホンの信号経路が構成されている。 In the present embodiment, the time-frequency analysis unit 11, the gain applying unit 13, the gain removal unit 17, the masking processing unit 15, and the time-frequency synthesis unit 16 form a signal path of the main microphone, and the time-frequency analysis unit 12 sets the sub microphone. A signal path is configured.

時間周波数分析部11には、主マイクロホンが出力する受音信号x1(t)が入力される。時間周波数分析部11は、受音信号x1(t)を時間周波数成分X1(f,t)に変換する。 The time frequency analysis unit 11 receives the sound reception signal x1 (t) output from the main microphone. The time frequency analysis unit 11 converts the received sound signal x1 (t) into a time frequency component X1 (f, t).

ゲイン付与部13は、主・副マイクロホンの空間的な位置関係、妨害音の性質などから事前に算出されたゲインGfを時間周波数成分X1(f,t)に付与し、ゲインGfが付与された時間周波数成分Gf・X1(f,t)を送出する。ゲインGfは、1より小さい周波数依存値である。 The gain assigning unit 13 assigns the gain Gf calculated in advance from the spatial positional relationship between the main and sub microphones, the nature of the interference sound, and the like to the time frequency component X1 (f, t), and the gain Gf is given. Sends the time-frequency component Gf · X1 (f, t). The gain Gf is a frequency dependent value smaller than 1.

一方、時間周波数分析部12には、副マイクロホンが出力する受音信号x2(t)が入力される。時間周波数分析部12は、受音信号x2(t)を時間周波数成分X2(f,t)に変換する。 On the other hand, the sound reception signal x2 (t) output from the sub microphone is input to the time frequency analysis unit 12. The time frequency analysis unit 12 converts the received sound signal x2 (t) into a time frequency component X2 (f, t).

レベル差比較部14は、ゲイン付与部13から出力された時間周波数成分Gf・X1(f,t)のレベル｜Gf・X1(f,t)｜と時間周波数分析部12から出力された時間周波数成分X2(f,t)のレベル｜X2(f,t)｜を比較し、下記式(4)を用いてマスクパターンm1(f,t)を生成する。下記式(4)により、主マイクロホンが出力する受音信号x1(t)の時間周波数成分X1(f,t)のうち、主マイクロホンが出力する受音信号x1(t)が支配的な成分以外をマスクするマスクパターンm1(f,t)が生成される。レベル差比較部14により生成されたマスクパターンm1(f,t)は、マスキング処理部15に出力される。 The level difference comparison unit 14 includes the level | Gf · X1 (f, t) | of the time frequency component Gf · X1 (f, t) output from the gain applying unit 13 and the time frequency output from the time frequency analysis unit 12. The level | X2 (f, t) | of the component X2 (f, t) is compared, and a mask pattern m1 (f, t) is generated using the following equation (4). According to the following formula (4), among the time frequency components X1 (f, t) of the sound reception signal x1 (t) output from the main microphone, the sound reception signal x1 (t) output from the main microphone is not the dominant component. A mask pattern m1 (f, t) for masking is generated. The mask pattern m1 (f, t) generated by the level difference comparison unit 14 is output to the masking processing unit 15.

ゲイン除去部17は、時間周波数成分Gf・X1(f,t)に対し、ゲイン付与部13と逆の処理を施し、時間周波数成分X1(f,t)をマスキング処理部15に出力する。ゲイン除去部17は、ゲイン付与部13でのゲイン付与に起因する出力信号y1(t)の歪みをなくすために設けているが、歪みが許容できる場合には省略することができる。また、ゲイン除去部17は、マスキング処理部15の出力側に設けてもよい。 The gain removing unit 17 performs a process reverse to that of the gain applying unit 13 on the time frequency component Gf · X1 (f, t), and outputs the time frequency component X1 (f, t) to the masking processing unit 15. The gain removing unit 17 is provided to eliminate distortion of the output signal y1 (t) due to gain application by the gain applying unit 13, but can be omitted when distortion is allowable. The gain removing unit 17 may be provided on the output side of the masking processing unit 15.

マスキング処理部15は、ゲイン除去部17から入力される時間周波数成分X1(f,t)をマスクパターンm1(f,t)によりマスキングする。したがって、マスキング処理部15からは、主マイクロホンが出力する受音信号x1(t)の時間周波数成分X1(f,t)のうち、主マイクロホンが出力する受音信号x1(t)が支配的な成分Y1(f,t)のみが出力される。 The masking processing unit 15 masks the time frequency component X1 (f, t) input from the gain removing unit 17 with the mask pattern m1 (f, t). Therefore, from the masking processing unit 15, the sound reception signal x1 (t) output by the main microphone is dominant among the time frequency components X1 (f, t) of the sound reception signal x1 (t) output by the main microphone. Only the component Y1 (f, t) is output.

以上、実施形態について説明したが、本発明は、上記実施形態に限定されず、種々に変形することができる。例えば、ゲイン付与部は、図１および図３の実施形態では副マイクロホンの信号経路に設けられ、図４の実施形態では主マイクロホンの信号経路に設けられたが、主・副マイクロホンの信号経路の両方にゲイン付与部を設け、それらのゲインを調整するようにすることもできる。ただし、主マイクロホンの信号経路においてゲインGf(周波数依存値)を付与する場合には、主マイクロホンが出力する受音信号がゲインGfによって変形されるので、図４の実施形態と同様に、ゲイン除去部を設けることが好ましい。 Although the embodiment has been described above, the present invention is not limited to the above embodiment and can be variously modified. For example, the gain applying unit is provided in the signal path of the sub microphone in the embodiment of FIGS. 1 and 3, and is provided in the signal path of the main microphone in the embodiment of FIG. It is also possible to provide gain imparting sections on both and adjust their gains. However, when gain Gf (frequency-dependent value) is given in the signal path of the main microphone, the received sound signal output from the main microphone is transformed by the gain Gf, so that gain removal is performed as in the embodiment of FIG. It is preferable to provide a part.

また、上記実施形態は、目的音の受音信号を分離して出力するものであるが、これに加えて妨害音の受音信号を分離して出力したり、妨害音の受音信号のみを分離して出力するようにもできる。妨害音の受音信号は、例えば、周囲雑音の測定、携帯端末の背面方向から到来する音声の抽出などに用いることができる。 Moreover, although the said embodiment isolate | separates and outputs the received signal of a target sound, in addition to this, the received signal of a disturbing sound is isolate | separated and output, or only the received signal of a disturbing sound is output. It can also be output separately. The received signal of the disturbing sound can be used, for example, for measurement of ambient noise, extraction of sound coming from the back side of the mobile terminal, and the like.

図５は、目的音および妨害音の受音信号をそれぞれ分離して出力する場合の変形例を示すブロック図である。同図において、時間周波数分析部11,12、ゲイン付与部13、マスキング処理部15および時間周波数合成部16は、図１と同じものであるが、レベル差比較部14は、マスクパターンm1(f,t)の他に、これが反転されたマスクパターンm2(f,t)を、下記式(5)により生成する。 FIG. 5 is a block diagram illustrating a modified example in which the received sound signals of the target sound and the interference sound are separated and output. In the figure, the time-frequency analysis units 11 and 12, the gain applying unit 13, the masking processing unit 15 and the time-frequency synthesis unit 16 are the same as those in FIG. , t), a mask pattern m2 (f, t) obtained by inverting this is generated by the following equation (5).

マスキング処理部18は、ゲイン付与部13から入力される時間周波数成分Gf・X2(f,t)をマスクパターンm2(f,t)によりマスキングする。したがって、マスキング処理部18からは、時間周波数成分Gf・X2(f,t)のうち、主マイクロホンが出力する受音信号x1(t)が支配的な成分を除いた成分、すなわち妨害音に時間周波数成分のみが出力される。 The masking processing unit 18 masks the time frequency component Gf · X2 (f, t) input from the gain applying unit 13 with the mask pattern m2 (f, t). Therefore, the masking processing unit 18 removes the time-frequency component Gf · X2 (f, t) from the component excluding the component in which the received signal x1 (t) output from the main microphone is dominant, that is, the interference sound. Only frequency components are output.

時間周波数合成部19は、時間周波数成分Gf・X2(f,t)のうち、主マイクロホンが出力する受音信号x1(t)が支配的な成分を除いた成分Gf・Y2(f,t)、すなわち妨害音に時間周波数成分Gf・Y2(f,t)のみを合成し、出力信号Gfy2(t)を出力する。この場合でもゲイン除去回路を設ければ出力信号y2(t)を出力させることができる。 The time frequency synthesizer 19 removes the component GfY2 (f, t) from the time frequency component GfX2 (f, t) excluding the component in which the received sound signal x1 (t) output from the main microphone is dominant. That is, only the time frequency component Gf · Y2 (f, t) is synthesized with the disturbing sound, and the output signal Gfy2 (t) is output. Even in this case, the output signal y2 (t) can be output by providing a gain removal circuit.

本発明に係る音源分離装置の実施形態を示すブロック図である。It is a block diagram showing an embodiment of a sound source separation device according to the present invention. ゲイン付与部でのゲイン付与動作の説明図である。It is explanatory drawing of the gain provision operation | movement in a gain provision part. 本発明に係る音源分離装置の他の実施形態を示すブロック図である。It is a block diagram which shows other embodiment of the sound source separation apparatus which concerns on this invention. 本発明に係る音源分離装置のさらに他の実施形態を示すブロック図である。It is a block diagram which shows other embodiment of the sound source separation apparatus which concerns on this invention. 本発明に係る音源分離装置の変形例を示すブロック図である。It is a block diagram which shows the modification of the sound source separation apparatus which concerns on this invention.

11,12・・・時間周波数分析部、13・・・ゲイン付与部、14・・・レベル差比較部、15,19・・・マスキング処理部、16,19・・・時間周波数合成部、17・・・ゲイン除去部 11,12 ... Time frequency analysis unit, 13 ... Gain addition unit, 14 ... Level difference comparison unit, 15,19 ... Masking processing unit, 16,19 ... Time frequency synthesis unit, 17 ... Gain removal unit

Claims

In the sound source separation method for separating and outputting at least one of the target sound component and the interference sound component from the received sound signal output from the main / sub microphone,
A first step of converting sound reception signals output from the main and sub microphones into time frequency components in the signal path of the main and sub microphones,
A second step of applying gain to the received sound signal before being converted to the time frequency component or the time frequency component after being converted to the time frequency component in at least one of the signal paths of the main and sub microphones;
A third step of generating a mask pattern by adding a gain in the second step and comparing the amplitude of the time frequency component after being converted in the first step for each time frequency component;
A fourth step of gaining at the second step and masking at least one of the time-frequency components after the conversion by the first step using the mask pattern generated by the third step. When,
A sound source separation method comprising a fifth step of synthesizing the time-frequency components output in the fourth step.

In the sound source separation device that separates and outputs at least one of the target sound component and the interference sound component from the received sound signal output by the main / sub microphone
Conversion means provided in the signal path of the main and sub microphones, and converting the received sound signals output from the main and sub microphones into time frequency components, respectively;
A gain applying means which is provided in at least one of the signal paths of the main and sub microphones and applies a gain to the received sound signal before being converted to the time frequency component, or to the time frequency component after being converted to the time frequency component;
A level difference comparing means for generating a mask pattern by comparing the amplitude of the time frequency component after gain is given by the gain giving means and converted by the converting means for each time frequency component;
A masking processing means for masking at least one of the time frequency components after the gain is applied by the gain applying means and converted by the converting means, using the mask pattern generated by the level difference comparing means;
A sound source separation apparatus comprising: a time frequency synthesis unit that synthesizes a time frequency component output from the masking processing unit.

The gain applied by the gain applying means causes an amplitude difference between the time frequency components of the received signal output from the main and sub microphones for the disturbing sound, and the main and sub microphones for the target sound. 3. The sound source separation device according to claim 2, wherein the sound source separation device is set to a constant value or a frequency-dependent value so that the magnitude relationship between the amplitudes of the time frequency components of the received sound signals to be output does not reverse.

The gain applying means applies a gain of a frequency dependent value to a time frequency component in the signal path of the main microphone, and further from a time frequency component input to the masking processing means through the signal path of the main microphone or from the masking processing means. 3. A gain removing unit that removes the gain by applying a process reverse to the gain applying by the gain applying unit to the time frequency component of the signal path of the main microphone that is output. 3. The sound source separation device according to 3.

A program for realizing a function of separating and outputting at least one of a target sound component and an interference sound component from a received sound signal output from a main / sub microphone,
A first function for converting received sound signals output from the main and sub microphones into time frequency components in the signal path of the main and sub microphones;
A second function for giving a gain to the received sound signal before being converted into the time frequency component or the time frequency component after being converted into the time frequency component in at least one of the signal paths of the main and sub microphones;
A third function for generating a mask pattern by adding a gain by the second function and comparing the amplitude of the time frequency component after being converted by the first function for each time frequency component;
A fourth function for masking at least one of the time frequency components after gain is given by the second function and converted by the first function, using the mask pattern generated by the third function When,
A program for executing a fifth function for synthesizing time-frequency components output by the fourth function.