JP2016126136A

JP2016126136A - Automatic mixing device and program

Info

Publication number: JP2016126136A
Application number: JP2014266387A
Authority: JP
Inventors: 堀内　俊治; Toshiharu Horiuchi; 俊治堀内
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2014-12-26
Filing date: 2014-12-26
Publication date: 2016-07-11
Anticipated expiration: 2034-12-26
Also published as: JP6524463B2

Abstract

PROBLEM TO BE SOLVED: To ensure that received sound signals outputted by a plurality of microphones do not have sparseness, and prevent the performance of interfering sound suppression or sound source separation from degrading even when there is no amplitude difference between the time-frequency components of these signals.SOLUTION: This automatic mixing device comprises: time-frequency conversion units 10 to 10-n for converting sound signals obtained by a main microphone 1 and a plurality of sub-microphones 10-2 to 10-n into time-frequency components; gain assignment units 70 to 70-n for assigning gain to each sound signal obtained by the plurality of sub-microphones 10-2 to 10-n; a level difference comparison unit 90 for comparing the amplitude of the time-frequency component of the sound signal of the main microphone 1 with the amplitude of the time-frequency component of each sound signal having been assigned gain, and generating a mask pattern; a masking processing unit 30 for masking the time-frequency component of the sound signal obtained by the main microphone 1 using the mask pattern; and a time-frequency synthesis unit 50 for synthesizing the masked sound signals.SELECTED DRAWING: Figure 1

Description

本発明は、複数のマイクロホンで得られた音声信号から目的音の音声信号を抽出する技術に関する。 The present invention relates to a technique for extracting a target sound signal from sound signals obtained by a plurality of microphones.

一般に、会議やグループインタビューでは、複数のマイクロホンが使用される。複数のマイクロホンを同時使用した場合、ハウリングマージンの低下、周囲雑音の増加、およびコムフィルタの発生を引き起こす。この問題を解決するため、ミキシングエンジニアを配置するか、あるいはその作業を代替する自動ミキシング装置が使用される。自動ミキシング装置は、一般に、複数のマイクロホンの信号経路を監視し、最も入力レベルが大きいマイクロホンが出力する受音信号を選択し、出力レベルを調整するものである。従って、声の小さい人を救うために、該当するマイクロホンのミキシングゲインを上げると、当然ながら、妨害音(周囲雑音)の増加を招くことになる。 In general, a plurality of microphones are used in meetings and group interviews. When a plurality of microphones are used simultaneously, a howling margin is reduced, an ambient noise is increased, and a comb filter is generated. In order to solve this problem, automatic mixing devices are used that either place a mixing engineer or replace the work. In general, an automatic mixing apparatus monitors signal paths of a plurality of microphones, selects a sound reception signal output from a microphone having the highest input level, and adjusts the output level. Therefore, if the mixing gain of the corresponding microphone is increased in order to save a person whose voice is low, naturally, interference sound (ambient noise) is increased.

一方、街頭、車内あるいは駅のプラットホームなどの雑音環境下では、ハンドセットやヘッドセットなどの口元に近接配置されたマイクロホンを用いても、目的音である所望の音声に妨害音である他の音声や周囲雑音が混入してしまうことがある。この問題を解決するため、これまでに様々な妨害音抑圧手法や音源分離手法が提案されている。これらの手法は、単一のマイクロホンを使用するものと複数のマイクロホンを使用するものとに大別できる。複数のマイクロホンを使用するものでは、単一のマイクロホンを使用するものと比較して、より高い妨害音抑圧性能を得ることができる。 On the other hand, in a noisy environment such as a street, in a car, or on a platform of a station, even if a microphone placed close to the mouth of a handset or a headset is used, the desired sound that is the target sound is Ambient noise may be mixed. In order to solve this problem, various interference sound suppression methods and sound source separation methods have been proposed so far. These methods can be broadly classified into those using a single microphone and those using a plurality of microphones. In the case of using a plurality of microphones, it is possible to obtain a higher disturbing sound suppression performance than in the case of using a single microphone.

複数のマイクロホンを使用する手法では、複数のマイクロホンを空間的に配置し、各マイクロホンが出力する受音信号に、各マイクロホンと音源との空間的な位置関係に依存した時間差や振幅差を反映させる。これにより、各マイクロホンが出力する受音信号の時間差や振幅差の統計情報を利用して、目的音のみを選択的に収音したり、あるいは目的音と妨害音とを分離したりすることができる。 In the method using multiple microphones, multiple microphones are arranged spatially, and the received signal output from each microphone reflects the time difference or amplitude difference depending on the spatial positional relationship between each microphone and the sound source. . This makes it possible to selectively collect only the target sound or to separate the target sound and the interfering sound by using the statistical information of the time difference and amplitude difference of the received sound signal output from each microphone. it can.

また、複数のマイクロホンを使用する手法として、音声信号のスパース性を利用した時間周波数マスキングと呼ばれる手法も提案されている。音声信号のスパース性とは、音声信号のエネルギが一部の時間周波数領域に集中し、その他の時間周波数領域ではほぼ０であるような性質をいう。時間周波数マスキングに基づく手法では、目的音および妨害音の方向は未知でよく、目的音を抽出するために、複数のマイクロホンが出力する受音信号の各時間周波数成分の振幅差と時間差の一方または両方を算出する。そして、それらの差に基づいて各時間周波数成分を分類し、目的音と妨害音とを分離する。複数のマイクロホンが出力する受音信号の各時間周波数成分の振幅差および時間差の算出では、所定時間長ごとに周波数分析を行なう。 In addition, as a technique using a plurality of microphones, a technique called time-frequency masking using sparseness of an audio signal has been proposed. The sparsity of an audio signal refers to the property that the energy of the audio signal is concentrated in a part of the time frequency domain and is almost zero in the other time frequency domain. In the method based on time frequency masking, the direction of the target sound and the disturbing sound may be unknown, and in order to extract the target sound, one or both of the amplitude difference and the time difference of each time frequency component of the received signal output from the plurality of microphones is extracted. Calculate both. And each time frequency component is classified based on those differences, and the target sound and the interference sound are separated. In the calculation of the amplitude difference and the time difference of each time frequency component of the sound reception signals output from the plurality of microphones, frequency analysis is performed for each predetermined time length.

時間周波数マスキングに基づく手法のうち、特に、複数のマイクロホンが出力する受音信号の各時間周波数成分の振幅差を用いるものは、より強い信号はそれより弱い信号をマスクするという聴覚マスキング現象を、計算機上に模擬したものである。２つのマイクロホンを用いるとき、目的音に重畳された妨害音をマスクするマスクパターンは、２つのマイクロホンが出力する受音信号の各時間周波数成分の振幅差を比較することで生成され、主マイクロホンに近接している音源の高振幅の受音信号の時間周波数成分を選択的に抽出するために用いられる。 Among the techniques based on temporal frequency masking, especially those that use the amplitude difference of each temporal frequency component of the received signal output by multiple microphones, the auditory masking phenomenon that the stronger signal masks the weaker signal, It is simulated on a computer. When two microphones are used, a mask pattern that masks the interference sound superimposed on the target sound is generated by comparing the amplitude difference of each time frequency component of the received sound signal output by the two microphones, and is stored in the main microphone. It is used for selectively extracting the time frequency component of the high amplitude received sound signal of the adjacent sound source.

この処理は、時間周波数領域で行われ、２つのマイクロホンのうちの主マイクロホンが出力する受音信号が支配的な周波数成分はそのまま出力し、もう一方の副マイクロホンが出力する受音信号が支配的な周波数成分はマスク処理する。主マイクロホンに近接している音源の受音信号に対するマスク処理は、下記式（１）で定義される。 This processing is performed in the time frequency domain, and the frequency component in which the received sound signal output from the main microphone of the two microphones is dominant is output as it is, and the received sound signal output from the other sub microphone is dominant. Masking is performed on the frequency components that are not correct. The mask process for the received sound signal of the sound source close to the main microphone is defined by the following equation (1).

このマスク処理では、主・副マイクロホンが出力する受音信号にスパース性が成立し、それらの時間周波数成分間に振幅差があると仮定している。この技術については、非特許文献１−４に記載されている。

In this masking process, it is assumed that the received sound signals output from the main and sub microphones are sparse and there is an amplitude difference between their time frequency components. This technique is described in Non-Patent Documents 1-4.

また、特許文献１では、主・副マイクロホンの受音信号間に、振幅差を生じさせ、マスクパターンを生成している。また、特許文献２では、主・副マイクロホンの受音信号間に、パワースペクトル差を生じさせ、マスクパターンを生成している。 Further, in Patent Document 1, an amplitude difference is generated between the received sound signals of the main and sub microphones to generate a mask pattern. Further, in Patent Document 2, a power spectrum difference is generated between sound reception signals of main and sub microphones to generate a mask pattern.

特許第５１０７９５６号明細書Japanese Patent No. 5107956 特許第５１１３０９６号明細書Japanese Patent No. 5113096

R.F.Lyon: "A computational model of binaural localization and separation, " In Proc. ICASSP, 1983.R.F.Lyon: "A computational model of binaural localization and separation," In Proc. ICASSP, 1983. M. Bodden: "Modeling human sound-source localization and the cocktail-party-effect," Acta Acoustica, vol.1, pp.43--55, 1993.M. Bodden: "Modeling human sound-source localization and the cocktail-party-effect," Acta Acoustica, vol.1, pp.43--55, 1993. O. Yilmaz and S. Rickard: "Blind Separation of Speech Mixtures via Time-Frequency Masking," IEEE Transaction on Signal Processing, Vol. 52, No. 7, pp. 1830-1847, 2004.O. Yilmaz and S. Rickard: "Blind Separation of Speech Mixtures via Time-Frequency Masking," IEEE Transaction on Signal Processing, Vol. 52, No. 7, pp. 1830-1847, 2004. S. Rickard and O. Yilmaz: "On the Approximate W-disjoint Orthogonality of Speech," Proc. ICASSP, Vol. I, pp. 529-532, 2002.S. Rickard and O. Yilmaz: "On the Approximate W-disjoint Orthogonality of Speech," Proc. ICASSP, Vol. I, pp. 529-532, 2002.

しかしながら、一般に、人を音源とする受音信号ではスパース性は成立するが、例えば、妨害音(周囲雑音)の受音信号ではスパース性が成立しない。さらに、複数のマイクロホンが出力する受音信号において、目的音の受音信号間に振幅差があっても、妨害音の受音信号間には振幅差がない場合が多い。さらに、一般に、人を音源とする受音信号はその音圧がまちまちであるため、例えば、声の小さい人を救うために、該当するマイクロホンのミキシングゲインを上げると、当然ながら、妨害音(周囲雑音)の増加を招くことになる。 However, in general, a sparseness is established in a received signal using a human sound source, but a sparseness is not established in a received signal of a disturbing sound (ambient noise), for example. Furthermore, in the sound reception signals output from the plurality of microphones, even if there is an amplitude difference between the target sound reception signals, there is often no amplitude difference between the interference sound reception signals. Furthermore, in general, since a sound signal received from a person as a sound source varies in sound pressure, for example, when the mixing gain of the corresponding microphone is increased in order to save a person with a low voice, naturally, an interference sound (ambient Noise).

本発明は、このような事情に鑑みてなされたものであり、複数のマイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、妨害音抑圧や音源分離の性能が劣化しない自動ミキシング装置およびプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and even if the received sound signals output from a plurality of microphones do not have sparsity and there is no amplitude difference between their time frequency components, the disturbing sound is present. An object of the present invention is to provide an automatic mixing apparatus and program in which the performance of suppression and sound source separation does not deteriorate.

（１）上記の目的を達成するために、本発明は、以下のような手段を講じた。すなわち、本発明の自動ミキシング装置は、複数のマイクロホンで得られた音声信号から目的音の音声信号を抽出する自動ミキシング装置であって、各マイクロホンのうちのいずれか一つである主マイクロホンおよび主マイクロホン以外の複数の副マイクロホンで得られた音声信号を、それぞれ時間周波数成分に変換する時間周波数変換部と、複数の副マイクロホンで得られた各音声信号に、それぞれゲインを付与するゲイン付与部と、主マイクロホンで得られた音声信号の時間周波数成分の振幅、およびゲインが付与された各音声信号の時間周波数成分の振幅を比較し、マスクパターンを生成するレベル差比較部と、主マイクロホンで得られた音声信号の時間周波数成分を、前記マスクパターンを用いてマスキングするマスキング処理部と、前記マスキングされた音声信号の時間周波数成分を合成する時間周波数合成部と、を備えることを特徴とする。 (1) In order to achieve the above object, the present invention takes the following measures. In other words, the automatic mixing apparatus of the present invention is an automatic mixing apparatus that extracts a target sound signal from sound signals obtained by a plurality of microphones, and includes a main microphone and a main microphone that are any one of the microphones. A time-frequency converter that converts audio signals obtained by a plurality of sub-microphones other than microphones into time-frequency components, and a gain-applying unit that gives a gain to each audio signal obtained by the plurality of sub-microphones. The level difference comparison unit that generates the mask pattern by comparing the amplitude of the time frequency component of the audio signal obtained by the main microphone and the amplitude of the time frequency component of each audio signal to which the gain is applied, and obtained by the main microphone A masking processing unit for masking a time frequency component of the received audio signal using the mask pattern; Characterized in that it comprises a and a time-frequency synthesis unit for synthesizing a time-frequency component of the masking sound signal.

このように、各マイクロホンのうちのいずれか一つである主マイクロホンおよび主マイクロホン以外の複数の副マイクロホンで得られた音声信号を、それぞれ時間周波数成分に変換し、複数の副マイクロホンで得られた各音声信号に、それぞれゲインを付与し、主マイクロホンで得られた音声信号の時間周波数成分の振幅、およびゲインが付与された各音声信号の時間周波数成分の振幅を比較し、マスクパターンを生成し、主マイクロホンで得られた音声信号の時間周波数成分を、前記マスクパターンを用いてマスキングするので、マイクロホンが出力する受音信号の時間周波数成分間に振幅差を生じさせ、その後にマスクパターンを生成することが可能となる。これにより、マイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、音源分離や妨害音抑圧の性能が劣化することがなくなり、目的音を明確に得ることが可能となる。 In this way, the audio signals obtained by any one of the microphones, ie, the main microphone and a plurality of sub microphones other than the main microphone, are converted into time frequency components, respectively, and obtained by the plurality of sub microphones. A gain is assigned to each audio signal, and the amplitude of the time frequency component of the audio signal obtained by the main microphone is compared with the amplitude of the time frequency component of each audio signal to which the gain is applied to generate a mask pattern. Since the time frequency component of the audio signal obtained by the main microphone is masked using the mask pattern, an amplitude difference is generated between the time frequency components of the received sound signal output from the microphone, and then the mask pattern is generated. It becomes possible to do. As a result, even if the received sound signal output from the microphone is not sparse and there is no amplitude difference between these time-frequency components, the performance of sound source separation and interference noise suppression will not deteriorate, and the target sound will be It becomes possible to obtain clearly.

（２）また、本発明の自動ミキシング装置において、前記ゲイン付与部は、各副マイクロホンで得られた妨害音としての音声信号の時間周波数成分間に振幅差を生じさせ、かつ、各副マイクロホンで得られた目的音としての音声信号の時間周波数成分の振幅と、各副マイクロホンで得られた妨害音としての音声信号の時間周波数成分の振幅との大小関係が逆転しないように、ゲインを設定することを特徴とする。 (2) Further, in the automatic mixing device according to the present invention, the gain applying unit generates an amplitude difference between time frequency components of an audio signal as an interference sound obtained by each sub microphone, and each sub microphone The gain is set so that the magnitude relationship between the amplitude of the time frequency component of the audio signal as the target sound and the amplitude of the time frequency component of the audio signal as the interference sound obtained by each sub microphone is not reversed. It is characterized by that.

このように、各副マイクロホンで得られた妨害音としての音声信号の時間周波数成分間に振幅差を生じさせ、かつ、各副マイクロホンで得られた目的音としての音声信号の時間周波数成分の振幅と、各副マイクロホンで得られた妨害音としての音声信号の時間周波数成分の振幅との大小関係が逆転しないように、ゲインを設定するので、マイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、音源分離や妨害音抑圧の性能が劣化することがなくなり、目的音を明確に得ることが可能となる。 In this way, an amplitude difference is generated between the time frequency components of the audio signal as the interference sound obtained by each sub microphone, and the amplitude of the time frequency component of the audio signal as the target sound obtained by each sub microphone is obtained. Since the gain is set so that the magnitude relationship between the amplitude and the amplitude of the time frequency component of the audio signal as the disturbing sound obtained by each sub microphone does not reverse, sparseness is established in the received sound signal output from the microphone. Even when there is no amplitude difference between these time frequency components, the performance of sound source separation and interference sound suppression is not deteriorated, and the target sound can be clearly obtained.

（３）また、本発明の自動ミキシング装置において、前記レベル差比較部は、主マイクロホンの時間周波数成分のレベルを｜Ｘ１（ｆ，ｔ）｜とし、ゲインＧ１ｎ（ｆ）が付与された複数の副マイクロホンの時間周波数成分のレベルを１／（Ｎ−１）・Σ｜Ｇ１ｎ（ｆ）・Ｘｎ（ｆ，ｔ）｜とし、次式に示すマスクパターンｍ１（ｆ，ｔ）を生成することを特徴とする。 (3) Further, in the automatic mixing device of the present invention, the level difference comparison unit sets the level of the time frequency component of the main microphone to | X1 (f, t) | and a plurality of gain G1n (f) The level of the time frequency component of the sub microphone is set to 1 / (N−1) · Σ | G1n (f) · Xn (f, t) |, and a mask pattern m1 (f, t) represented by the following equation is generated. Features.

この構成により、マイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、音源分離や妨害音抑圧の性能が劣化することがなくなり、目的音を明確に得ることが可能となる。

With this configuration, even if the received sound signal output from the microphone is not sparse and there is no difference in amplitude between these time-frequency components, the performance of sound source separation and interference noise suppression is not degraded, and the target sound Can be clearly obtained.

（４）また、本発明の自動ミキシング装置は、複数のマイクロホンで得られた音声信号から目的音の音声信号を抽出する自動ミキシング装置であって、各マイクロホンのうちのいずれか一つである主マイクロホンおよび主マイクロホン以外の複数の副マイクロホンで得られた音声信号を、それぞれ時間周波数成分に変換する時間周波数変換部と、主マイクロホンで得られた各音声信号に、ゲインを付与するゲイン付与部と、ゲインが付与された音声信号の時間周波数成分の振幅、および複数の副マイクロホンで得られた各音声信号の時間周波数成分の振幅を比較し、マスクパターンを生成するレベル差比較部と、ゲインが付与された音声信号の時間周波数成分を、前記マスクパターンを用いてマスキングするマスキング処理部と、前記マスキングされた音声信号からゲインを除去するゲイン除去部と、前記ゲインが除去された音声信号の時間周波数成分を合成する時間周波数合成部と、を備えることを特徴とする。 (4) Moreover, the automatic mixing apparatus of the present invention is an automatic mixing apparatus that extracts an audio signal of a target sound from audio signals obtained by a plurality of microphones, and is one of the microphones. A time-frequency converter that converts sound signals obtained by a plurality of sub-microphones other than the microphone and the main microphone into time-frequency components, and a gain-applying unit that gives a gain to each sound signal obtained by the main microphone; A level difference comparison unit that compares the amplitude of the time frequency component of the audio signal to which the gain is applied and the amplitude of the time frequency component of each audio signal obtained by the plurality of sub microphones to generate a mask pattern; and A masking processing unit for masking a time frequency component of the given audio signal using the mask pattern; and the masking unit. A gain removal unit for removing the gain from the grayed speech signal, characterized in that it comprises a time-frequency synthesis unit for synthesizing a time-frequency component of the audio signal to which the gain is removed.

このように、各マイクロホンのうちのいずれか一つである主マイクロホンおよび主マイクロホン以外の複数の副マイクロホンで得られた音声信号を、それぞれ時間周波数成分に変換する時間周波数変換部と、主マイクロホンで得られた各音声信号に、ゲインを付与するゲイン付与部と、ゲインが付与された音声信号の時間周波数成分の振幅、および複数の副マイクロホンで得られた各音声信号の時間周波数成分の振幅を比較し、マスクパターンを生成するレベル差比較部と、ゲインが付与された音声信号の時間周波数成分を、前記マスクパターンを用いてマスキングするマスキング処理部と、前記マスキングされた音声信号からゲインを除去するので、マイクロホンが出力する受音信号の時間周波数成分間に振幅差を生じさせ、その後にマスクパターンを生成することが可能となる。これにより、マイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、音源分離や妨害音抑圧の性能が劣化することがなくなり、目的音を明確に得ることが可能となる。 In this manner, the main microphone and the time frequency conversion unit that converts the audio signals obtained by the main microphone that is one of the microphones and a plurality of sub microphones other than the main microphone into time frequency components, respectively, A gain applying unit that applies gain to each obtained audio signal, and the amplitude of the time frequency component of the audio signal to which the gain is applied, and the amplitude of the time frequency component of each audio signal obtained by a plurality of sub microphones. A level difference comparison unit that compares and generates a mask pattern, a masking processing unit that masks a time-frequency component of an audio signal to which a gain is applied, using the mask pattern, and a gain that is removed from the masked audio signal Therefore, an amplitude difference is generated between the time frequency components of the received sound signal output from the microphone, and then the mass is It is possible to generate a pattern. As a result, even if the received sound signal output from the microphone is not sparse and there is no amplitude difference between these time-frequency components, the performance of sound source separation and interference noise suppression will not deteriorate, and the target sound will be It becomes possible to obtain clearly.

（５）また、本発明のプログラムは、複数のマイクロホンで得られた音声信号から目的音の音声信号を抽出する自動ミキシング装置のプログラムであって、各マイクロホンのうちのいずれか一つである主マイクロホンおよび主マイクロホン以外の複数の副マイクロホンで得られた音声信号を、それぞれ時間周波数成分に変換する処理と、複数の副マイクロホンで得られた各音声信号に、それぞれゲインを付与する処理と、主マイクロホンで得られた音声信号の時間周波数成分の振幅、およびゲインが付与された各音声信号の時間周波数成分の振幅を比較し、マスクパターンを生成する処理と、主マイクロホンで得られた音声信号の時間周波数成分を、前記マスクパターンを用いてマスキングする処理と、前記マスキングされた音声信号の時間周波数成分を合成する処理と、の一連の処理を、コンピュータに実行させることを特徴とする。 (5) Further, the program of the present invention is a program for an automatic mixing apparatus that extracts an audio signal of a target sound from audio signals obtained by a plurality of microphones, and is a main one of each microphone. A process of converting audio signals obtained by a plurality of sub-microphones other than the microphone and the main microphone into respective time-frequency components, a process of assigning gains to the respective audio signals obtained by the plurality of sub-microphones, Compares the amplitude of the time frequency component of the audio signal obtained by the microphone and the amplitude of the time frequency component of each audio signal to which gain is given, and generates a mask pattern, and the audio signal obtained by the main microphone A process of masking a time frequency component using the mask pattern, and a time of the masked audio signal And processing for combining the wavenumber components, a series of processing, and characterized by causing a computer to execute.

本発明によれば、マイクロホンが出力する受音信号の時間周波数成分間に振幅差を生じさせ、その後にマスクパターンを生成することが可能となる。これにより、マイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、音源分離や妨害音抑圧の性能が劣化することがなくなり、目的音を明確に得ることが可能となる。 According to the present invention, it is possible to generate an amplitude difference between temporal frequency components of a sound reception signal output from a microphone, and then generate a mask pattern. As a result, even if the received sound signal output from the microphone is not sparse and there is no amplitude difference between these time-frequency components, the performance of sound source separation and interference noise suppression will not deteriorate, and the target sound will be It becomes possible to obtain clearly.

本発明に係る自動ミキシング装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the automatic mixing apparatus which concerns on this invention. マスクパターン生成の概念を示す図である。It is a figure which shows the concept of mask pattern generation. マスクパターン生成の概念を示す図である。It is a figure which shows the concept of mask pattern generation. 本実施形態の変形例を示す図である。It is a figure which shows the modification of this embodiment. 本実施形態の変形例を示す図である。It is a figure which shows the modification of this embodiment.

本発明者は、複数のマイクロホンで人の音声を集音する場合、人の音声信号ではスパース性が成立するが、妨害音（周囲雑音）の音声信号ではスパース性が成立せず、また、目的音の音声信号間には振幅差があっても、妨害音の音声信号間には振幅差が無いことに着目し、妨害音にゲインを付与してからマスクパターンを生成することによって、複数のマイクロホンで得られた音声信号にスパース性が成立せず、それらの時間周波数成分間に振幅差が無い場合であっても、妨害音の抑圧や音源分離の性能を維持することができることを見出し、本発明をするに至った。 The present inventor, when collecting a person's voice with a plurality of microphones, has a sparseness in a person's voice signal, but does not have a sparseness in an interference sound (ambient noise) voice signal. Note that even if there is an amplitude difference between the sound signals of the sound, there is no amplitude difference between the sound signals of the interfering sound. It has been found that the sound signal obtained by the microphone does not have sparsity, and even if there is no amplitude difference between the time-frequency components, it is possible to maintain the performance of suppression of disturbance noise and sound source separation, It came to make this invention.

すなわち、本発明の自動ミキシング装置は、複数のマイクロホンで得られた音声信号から目的音の音声信号を抽出する自動ミキシング装置であって、各マイクロホンのうちのいずれか一つである主マイクロホンおよび主マイクロホン以外の複数の副マイクロホンで得られた音声信号を、それぞれ時間周波数成分に変換する時間周波数変換部と、複数の副マイクロホンで得られた各音声信号に、それぞれゲインを付与するゲイン付与部と、主マイクロホンで得られた音声信号の時間周波数成分の振幅、およびゲインが付与された各音声信号の時間周波数成分の振幅を比較し、マスクパターンを生成するレベル差比較部と、主マイクロホンで得られた音声信号の時間周波数成分を、前記マスクパターンを用いてマスキングするマスキング処理部と、前記マスキングされた音声信号の時間周波数成分を合成する時間周波数合成部と、を備えることを特徴とする。 In other words, the automatic mixing apparatus of the present invention is an automatic mixing apparatus that extracts a target sound signal from sound signals obtained by a plurality of microphones, and includes a main microphone and a main microphone that are any one of the microphones. A time-frequency converter that converts audio signals obtained by a plurality of sub-microphones other than microphones into time-frequency components, and a gain-applying unit that gives a gain to each audio signal obtained by the plurality of sub-microphones. The level difference comparison unit that generates the mask pattern by comparing the amplitude of the time frequency component of the audio signal obtained by the main microphone and the amplitude of the time frequency component of each audio signal to which the gain is applied, and obtained by the main microphone A masking processing unit for masking a time frequency component of the received audio signal using the mask pattern; Characterized in that it comprises a and a time-frequency synthesis unit for synthesizing a time-frequency component of the masking sound signal.

これにより、本発明者は、マイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、音源分離や妨害音抑圧の性能が劣化することを回避し、目的音を明確に得ることを可能とした。以下、本発明の実施形態について、図面を参照しながら具体的に説明する。 As a result, the present inventor has confirmed that the sound reception signal output from the microphone is not sparse, and the performance of sound source separation and interference sound suppression is degraded even when there is no amplitude difference between these time frequency components. It was possible to avoid it and get the target sound clearly. Embodiments of the present invention will be specifically described below with reference to the drawings.

図１は、本発明に係る自動ミキシング装置の概略構成を示すブロック図である。マイクロホン１で受音した受音信号ｘ_１（ｔ）とその他のマイクロホンで受音した受音信号ｘ_ｎ（ｔ）（ｎ＝２，３，…，Ｎ）が、それぞれ独立した時間周波数分析部１０〜１０−ｎに入力され、時間周波数成分Ｘ_１（ｆ，ｔ）およびＸ_ｎ（ｆ，ｔ）に変換される。ゲイン付与部７０〜７０−ｎでは、マイクロホン１０とその他のマイクロホン１０−２〜１０−ｎの空間的な位置関係、周囲雑音の性質などから事前に算出された周波数ごとのゲインＧ_１ｎ（ｆ）が、その他のマイクロホンで受音された時間周波数成分Ｘ_ｎ（ｆ，ｔ）に付与される。 FIG. 1 is a block diagram showing a schematic configuration of an automatic mixing apparatus according to the present invention. A time frequency analysis unit in which the sound reception signal x ₁ (t) received by the microphone 1 and the sound reception signal x _n (t) (n = 2, 3,..., N) received by the other microphones are independent of each other. 10 to 10-n and converted into time frequency components X ₁ (f, t) and X _n (f, t). In the gain applying units 70 to 70-n, the gain G _1n (f) for each frequency calculated in advance from the spatial positional relationship between the microphone 10 and the other microphones 10-2 to 10-n, the nature of ambient noise, and the like. Is added to the time frequency component X _n (f, t) received by the other microphones.

ここで、周波数ごとのゲインＧ_１ｎ（ｆ）は、具体的には、
（Ａ）マイクロホン１０に接近している音源からの音波信号が、マイクロホン１０とその他のマイクロホン１０−２〜１０−ｎで受音される際の振幅差
（Ｂ）低域では高く、高域では低いという一般的な周囲雑音の性質
を利用する。 Here, the gain G _1n (f) for each frequency is specifically:
(A) Amplitude difference when a sound wave signal from a sound source approaching the microphone 10 is received by the microphone 10 and the other microphones 10-2 to 10-n. (B) High in the low range and high in the high range. Utilizes the general ambient noise property of low.

上記の（Ｂ）は、様々な周囲雑音を測定し、それらの周波数特性から平均的な周囲雑音の周波数ごとの振幅を算出する。（Ａ）の振幅差と（Ｂ）の周波数ごとの振幅から、周波数ごとのゲインを算出する。マイクロホン１０に近接している音源からの音波が、マイクロホン１０とその他のマイクロホン１０−２〜１０−ｎで受音される際の振幅差は、より具体的には、点音源と仮定すれば、音源からの距離が２倍になると、約６ｄＢの減衰が生じる。 In (B) above, various ambient noises are measured, and the amplitude of each average ambient noise frequency is calculated from their frequency characteristics. A gain for each frequency is calculated from the amplitude difference in (A) and the amplitude for each frequency in (B). More specifically, the difference in amplitude when sound waves from a sound source close to the microphone 10 are received by the microphone 10 and the other microphones 10-2 to 10-n is assumed to be a point sound source. When the distance from the sound source is doubled, attenuation of about 6 dB occurs.

一方で、一般に、周囲雑音はほぼ同等程度である。このことを利用して、Ｇ_１ｎ（ｆ）は、マイクロホン１０からマイクロホン１０−ｎまでの距離に応じたゲインを乗じることで、相対的に周囲雑音成分が大きくなり、結果として、後段でマスキングが可能となる。レベル差比較部９０は、マイクロホン１０の時間周波数成分のレベル｜Ｘ_１（ｆ，ｔ）｜とゲインＧ_１ｎ（ｆ）が付与されたその他のマイクロホンの時間周波数成分のレベル“１／（Ｎ−１）・Σｎ｜Ｇ_１ｎ（ｆ）・Ｘ_ｎ（ｆ，ｔ）｜”が比較され、次式によりマイクロホン１０で受音した時間周波数成分のうち、図２Ａおよび図２Ｂに示すように、支配的な成分以外をマスクするマスクパターンｍ_１（ｆ，ｔ）を生成する。すなわち、レベル差比較部９０は、各時間周波数成分を比較し、（マイクロホン１０）＞その他の音声成分、（マイクロホン１０）≦その他の雑音成分を判断する。その結果、各々の成分をマスクするマスクパターンを生成する。 On the other hand, in general, ambient noise is approximately the same. Utilizing this fact, G _1n (f) is multiplied by a gain corresponding to the distance from the microphone 10 to the microphone 10-n, so that the ambient noise component becomes relatively large. It becomes possible. The level difference comparison unit 90 outputs the level “1 / (N−) of the time frequency component of another microphone to which the level | X ₁ (f, t) | of the microphone 10 and the gain G _1n (f) are given. 1) · Σn | G _1n (f) · X _n (f, t) | "are compared, and among the time frequency components received by the microphone 10 according to the following equation, as shown in FIG. 2A and FIG. A mask pattern m ₁ (f, t) for masking other than the essential components is generated. That is, the level difference comparison unit 90 compares each time frequency component, and determines (microphone 10)> other audio components and (microphone 10) ≦ other noise components. As a result, a mask pattern for masking each component is generated.

マスキング処理部３０では、レベル差比較部９０で生成されたマスクパターンｍ_１（ｆ，ｔ）を入力し、時間周波数分析部１０から入力された音声信号をマスキング処理する。時間周波数合成部５０では、マイクロホン１０で受音した時間周波数成分のうち、支配的な成分のみを合成に使用し、出力信号Ｙ_１（ｔ）を出力する。

In the masking processing unit 30, the mask pattern m ₁ (f, t) generated by the level difference comparison unit 90 is input, and the audio signal input from the time frequency analysis unit 10 is masked. In the time frequency synthesis unit 50, only the dominant component of the time frequency components received by the microphone 10 is used for synthesis, and the output signal Y ₁ (t) is output.

本実施形態では、ゲイン付与部７０〜７０−ｎは、マイクロホン１０以外のマイクロホン１０−２〜１０−ｎのパスに設けているが、マイクロホン１０のパスにゲイン付与部を設けてゲインを低下させても同一の効果が得られる。また、全てのマイクロホンのパスにゲイン付与部を設けてそれぞれのゲインを調整すれば同一の効果が得られる。

In this embodiment, the gain applying units 70 to 70-n are provided in the paths of the microphones 10-2 to 10-n other than the microphone 10, but a gain applying unit is provided in the path of the microphone 10 to reduce the gain. However, the same effect can be obtained. Further, the same effect can be obtained by providing gain applying sections in all microphone paths and adjusting the respective gains.

図３は、マイクロホン１０のパスにゲイン付与部を設けた態様を示す図である。図３に示すように、マイクロホン１０のパスにゲイン付与部６０を設け、周波数ごとのゲインＧ_１ｎ（ｆ）を付与した場合は、マイクロホン１０で受音した受音信号が、周波数ごとのゲインＧ_１ｎ（ｆ）によって変形しているため、ゲイン除去部６１をマイクロホン１０のパスに設ける。 FIG. 3 is a diagram illustrating an aspect in which a gain applying unit is provided in the path of the microphone 10. As shown in FIG. 3, when a gain applying unit 60 is provided in the path of the microphone 10 and a gain G _1n (f) for each frequency is applied, a received signal received by the microphone 10 is converted into a gain G for each frequency. _Since it is deformed by _1n (f), the gain removing unit 61 is provided in the path of the microphone 10.

また、本実施形態では、マイクロホン１０で受音した音源信号のみを抽出するパスを示したが、図４に示すように、その他のマイクロホンで受音した音源信号にもマイクロホン１０と同様な回路構成を用いることで、マイクロホン１０とその他のマイクロホン１０−２〜１０−ｎで受音した音源信号をそれぞれ分離、抽出することができる。 In the present embodiment, the path for extracting only the sound source signal received by the microphone 10 is shown. However, as shown in FIG. 4, the circuit configuration similar to that of the microphone 10 is used for the sound source signals received by other microphones. Can be used to separate and extract the sound source signals received by the microphone 10 and the other microphones 10-2 to 10-n.

なお、本発明は、自動ミキシング装置としてだけでなく、受音信号の処理手順で特定される自動ミキシング方法としても実現でき、さらにコンピュータに音源分離や妨害音抑圧の機能を実現させるためのプログラムとしても実現できる。また、自動ミキシング装置における各部は、ハードウエアでもソフトウエアでも実現することができる。 The present invention can be realized not only as an automatic mixing device but also as an automatic mixing method specified by a received signal processing procedure, and as a program for causing a computer to realize sound source separation and interference sound suppression functions. Can also be realized. In addition, each unit in the automatic mixing apparatus can be realized by hardware or software.

以上説明したように、本実施形態によれば、マイクロホンが出力する受音信号の時間周波数成分間に振幅差を生じさせ、その後にマスクパターンを生成することが可能となる。これにより、マイクロホンが出力する受音信号にスパース性が成立せず、それらの時間周波数成分間に振幅差がない場合でも、音源分離や妨害音抑圧の性能が劣化することがなくなり、目的音を明確に得ることが可能となる。 As described above, according to the present embodiment, it is possible to generate an amplitude difference between the time frequency components of the received sound signal output from the microphone and then generate a mask pattern. As a result, even if the received sound signal output from the microphone is not sparse and there is no amplitude difference between these time-frequency components, the performance of sound source separation and interference noise suppression will not deteriorate, and the target sound will be It becomes possible to obtain clearly.

１０〜１０−ｎ時間周波数分析部
７０〜７０−ｎゲイン付与部
１〜ｎマイクロホン
３０マスキング処理部
５０時間周波数合成部
６０ゲイン付与部
６１ゲイン除去部
９０レベル差比較部 10 to 10-n time frequency analysis unit 70 to 70-n gain applying unit 1 to n microphone 30 masking processing unit 50 time frequency synthesizing unit 60 gain applying unit 61 gain removing unit 90 level difference comparing unit

Claims

An automatic mixing device that extracts an audio signal of a target sound from audio signals obtained by a plurality of microphones,
A time-frequency conversion unit that converts sound signals obtained by a plurality of sub-microphones other than the main microphone and the main microphone, which are any one of the microphones, into respective time-frequency components;
A gain applying unit for applying a gain to each audio signal obtained by a plurality of sub microphones;
A level difference comparison unit that compares the amplitude of the time frequency component of the audio signal obtained by the main microphone and the amplitude of the time frequency component of each audio signal to which the gain is given, and generates a mask pattern;
A masking processor that masks the time-frequency component of the audio signal obtained by the main microphone using the mask pattern;
An automatic mixing apparatus comprising: a time-frequency synthesis unit that synthesizes a time-frequency component of the masked audio signal.

The gain applying unit generates an amplitude difference between the time frequency components of the sound signal as the interference sound obtained by each sub microphone, and the time frequency component of the sound signal as the target sound obtained by each sub microphone 2. The automatic mixing apparatus according to claim 1, wherein the gain is set so that the magnitude relationship between the amplitude of the first frequency and the amplitude of the time frequency component of the audio signal as the interference sound obtained by each sub microphone is not reversed. .

The level difference comparison unit includes:
Let the level of the time frequency component of the main microphone be | X1 (f, t) |
The level of the time frequency component of the plurality of sub microphones to which the gain G1n (f) is given is 1 / (N−1) · Σ | G1n (f) · Xn (f, t) | The automatic mixing apparatus according to claim 1 or 2, wherein m1 (f, t) is generated.

An automatic mixing device that extracts an audio signal of a target sound from audio signals obtained by a plurality of microphones,
A time-frequency conversion unit that converts sound signals obtained by a plurality of sub-microphones other than the main microphone and the main microphone, which are any one of the microphones, into respective time-frequency components;
A gain applying unit that applies gain to each audio signal obtained by the main microphone;
A level difference comparison unit that compares the amplitude of the time frequency component of the audio signal with gain and the amplitude of the time frequency component of each audio signal obtained by the plurality of sub microphones, and generates a mask pattern;
A masking processing unit that masks a time-frequency component of an audio signal to which gain has been applied, using the mask pattern;
A gain removing unit for removing gain from the masked audio signal;
An automatic mixing apparatus comprising: a time-frequency synthesis unit that synthesizes a time-frequency component of the audio signal from which the gain has been removed.

A program for an automatic mixing device that extracts an audio signal of a target sound from audio signals obtained by a plurality of microphones,
A process of converting audio signals obtained from a plurality of sub-microphones other than the main microphone and the main microphone, which are any one of the microphones, into respective time-frequency components;
A process of giving a gain to each audio signal obtained by a plurality of sub-microphones;
A process for generating a mask pattern by comparing the amplitude of the time frequency component of the audio signal obtained by the main microphone and the amplitude of the time frequency component of each audio signal to which the gain is applied;
A process of masking the time frequency component of the audio signal obtained by the main microphone using the mask pattern;
A program for causing a computer to execute a series of processes of synthesizing time-frequency components of the masked audio signal.