JP5143656B2

JP5143656B2 - Sound collection system and sound display method

Info

Publication number: JP5143656B2
Application number: JP2008188581A
Authority: JP
Inventors: 洋平川口; 真人戸上; 康成大淵
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-07-22
Filing date: 2008-07-22
Publication date: 2013-02-13
Anticipated expiration: 2028-07-22
Also published as: JP2010028531A

Description

本発明は、音収集システムに関し、特に、収集した音の音量を表示することができる音収集システムに関する。 The present invention relates to a sound collection system, and more particularly to a sound collection system capable of displaying the volume of collected sound.

電話会議システム及びテレビ会議システムにおいて、ある拠点の音声入力部から収集された音声は、エコーキャンセラ又は雑音除去処理が実行された後、他の拠点に送信される。このとき、送信される自分の音声が、遠端（音声の受信側）の相手に聞こえるだけの十分大きい音量で送信されているとは限らない。また、聞こえて欲しくない音声が、遠端の相手に聞こえないぐらい十分小さい音量で送信されているとは限らない。 In a telephone conference system and a video conference system, voice collected from a voice input unit at a certain site is transmitted to another site after echo canceller or noise removal processing is executed. At this time, the transmitted own voice is not necessarily transmitted at a volume that is high enough to be heard by the far end (speech receiving side) partner. Also, the voice that you do not want to hear is not always transmitted at a volume that is low enough not to be heard by the far-end party.

ユーザが自分の音声がどの程度の音量で遠端に送信されているかを知ることができれば、ユーザは自分の音声の音量を確認しながら、その音量に応じて会話を行うことができる。 If the user can know how loud his / her voice is transmitted to the far end, the user can talk according to the volume of his / her voice while checking the volume of his / her voice.

しかし、ユーザが自分の音声がどの程度の音量で遠端に送信されているかを知ることは、容易ではない。たとえば、受信信号の音量をユーザに提示する方法が考えられる（例えば、特許文献１参照。）。
特開２００６−６７２４０号公報 M.Togami, T.Sumiyoshi, and A.Amano, "Stepwise phase difference restoration method for sound source localization using multiple microphone pairs", ICASSP2007, vol.1, pp.117-120, 2007. T.Takatani, T.Nishikawa, H.Saruwatari, and K.Shikano, "Blind separation of binaural sound mixtures using SIMO-model-based independent component analysis", ICASSP2004, vol.4, pp.113-116, 2004. However, it is not easy for the user to know how loud his / her voice is transmitted to the far end. For example, a method of presenting the volume of the received signal to the user is conceivable (see, for example, Patent Document 1).
JP 2006-67240 A M. Togami, T. Sumiyoshi, and A. Amano, "Stepwise phase difference restoration method for sound source localization using multiple microphone pairs", ICASSP2007, vol.1, pp.117-120, 2007. T.Takatani, T.Nishikawa, H.Saruwatari, and K.Shikano, "Blind separation of binaural sound mixture using SIMO-model-based independent component analysis", ICASSP2004, vol.4, pp.113-116, 2004.

しかし、特許文献１に記載の方法で確認できる音量は、自分の音声のみの音量ではなく、同時に発話した人物の音声、他の拠点から送信された音声がスピーカから出力された音、及び環境騒音、などの複数音源の音が重畳された音を収音した信号の音量でしかない。 However, the volume that can be confirmed by the method described in Patent Document 1 is not the volume of only one's own voice, but the voice of a person who spoke at the same time, the sound output from another speaker, and the environmental noise It is only the volume of the signal that picks up the sound with the sound of multiple sound sources superimposed.

本発明は、音声入力部から収集された音声がどれだけ抑圧されているかを音源ごとに確認できる、音収集システムを提供することにある。 An object of the present invention is to provide a sound collection system capable of confirming, for each sound source, how much the voice collected from the voice input unit is suppressed.

本発明の代表的な一例を示せば以下の通りである。すなわち、二以上のマイクロホンで構成されるマイクロホンアレイと、前記マイクロホンアレイから出力された信号を変換する処理部と、を備える音収集システムであって、前記処理部は、前記マイクロホンアレイから出力された信号を、音源が存在する方向ごと分離する音源分離部と、前記マイクロホンアレイから出力された信号から雑音を除去する雑音除去処理部と、前記音源分離部から出力された信号、及び前記雑音除去処理部から出力された残留信号に基づいて、前記残留信号の方向別の音量を算出する方向別残留信号計算部と、を備え、音収集システムは、さらに、前記方向別残留信号計算部による算出結果に基づいて、前記方向別の残留信号の音量を表示する抑圧量表示部を備えることを特徴とする。 A typical example of the present invention is as follows. That is, a sound collection system including a microphone array composed of two or more microphones and a processing unit that converts a signal output from the microphone array, wherein the processing unit is output from the microphone array. A sound source separation unit that separates a signal for each direction in which a sound source exists, a noise removal processing unit that removes noise from the signal output from the microphone array, a signal output from the sound source separation unit, and the noise removal processing And a direction-specific residual signal calculation unit that calculates a sound volume for each direction of the residual signal based on the residual signal output from the unit, and the sound collection system further includes a calculation result by the direction-specific residual signal calculation unit And a suppression amount display unit for displaying the volume of the residual signal for each direction.

本発明の一実施の形態によれば、収集された音声がどれだけ抑圧されているかを音源ごとに確認することができる。 According to the embodiment of the present invention, it is possible to confirm for each sound source how much the collected voice is suppressed.

［第１の実施の形態］
以下、本発明を用いたテレビ会議システムを例に説明する。ＩＰネットワーク回線を用いたテレビ会議システムは、ネットワークで接続された二以上の拠点のそれぞれが、マイクロホンアレイ及びスピーカなどから構成される電話会議設備を用いて交信し、各拠点に存在する話者間の会話を実現する。以下、任意拠点を中心としたテレビ会議システムについて説明する。なお、該拠点を近端と記し、近端と接続される近端以外の拠点を遠端と記す。 [First Embodiment]
Hereinafter, a video conference system using the present invention will be described as an example. In a video conference system using an IP network line, two or more bases connected by a network communicate with each other using a telephone conference facility composed of a microphone array and a speaker, and between speakers existing at each base. Realize conversation. Hereinafter, a video conference system centering on an arbitrary base will be described. The base is referred to as the near end, and the base other than the near end connected to the near end is referred to as the far end.

図１は、本発明の第１の実施の形態におけるテレビ会議システムのハードウェア構成を示した図である。 FIG. 1 is a diagram showing a hardware configuration of the video conference system according to the first embodiment of the present invention.

テレビ会議システムは、二以上のマイクロホン素子からなるマイクロホンアレイ１０１、Ａ／Ｄ−Ｄ／Ａ変換装置１０２、中央演算装置１０３、揮発性メモリ１０４、記憶媒体１０５、抑圧量表示部１０６、雑音除去操作入力部１０７、スピーカ１０８、カメラ１０９、画像表示装置１１０、ハブ１１１、オーディオケーブル１１２、デジタルケーブル１１３、デジタルケーブル１１４、デジタルケーブル１１５、オーディオケーブル１１６、デジタルケーブル１１７、モニタケーブル１１８、及びＬＡＮケーブル１１９から構成される。 The video conference system includes a microphone array 101 including two or more microphone elements, an A / D / D / A converter 102, a central processing unit 103, a volatile memory 104, a storage medium 105, a suppression amount display unit 106, and a noise removal operation. Input unit 107, speaker 108, camera 109, image display device 110, hub 111, audio cable 112, digital cable 113, digital cable 114, digital cable 115, audio cable 116, digital cable 117, monitor cable 118, and LAN cable 119 Consists of

Ａ／Ｄ−Ｄ／Ａ変換装置１０２は、マイクロホンアレイ１０１から出力される音圧のアナログ信号をデジタルデータに変換する。中央演算装置１０３は、Ａ／Ｄ−Ｄ／Ａ変換装置１０２の出力を管理する。記憶媒体１０５は、プログラム及びマイクロホンアレイ１０１の各マイク素子の物理座標などの情報を記憶し、また、中央演算装置１０３と接続されている。 The A / D-D / A converter 102 converts an analog signal of sound pressure output from the microphone array 101 into digital data. The central processing unit 103 manages the output of the A / D-D / A conversion device 102. The storage medium 105 stores information such as a program and physical coordinates of each microphone element of the microphone array 101, and is connected to the central processing unit 103.

マイクロホンアレイ１０１の各マイクロホン素子で収集された多チャンネル音圧データは、オーディオケーブル１１２を介してＡ／Ｄ−Ｄ／Ａ変換装置１０２に出力される。前記多チャンネル音圧データは、Ａ／Ｄ−Ｄ／Ａ変換装置１０２によって多チャンネルデジタル音圧データに変換される。前述した変換は、各マイクロホン素子から出力される音圧信号の間で変換タイミングを同期して実行される。 Multi-channel sound pressure data collected by each microphone element of the microphone array 101 is output to the A / D-D / A converter 102 via the audio cable 112. The multi-channel sound pressure data is converted into multi-channel digital sound pressure data by an A / D-D / A converter 102. The above-described conversion is executed in synchronization with the conversion timing between the sound pressure signals output from the microphone elements.

変換された多チャンネルデジタル音圧データは、デジタルケーブル１１３を介して中央演算装置１０３に出力される。中央演算装置１０３は、入力された多チャンネルデジタル音圧データに音響信号処理を実行する。音響信号処理が実行された信号は、ＬＡＮケーブル１１９及びハブ１１１を介して、ネットワークへ送信される。 The converted multi-channel digital sound pressure data is output to the central processing unit 103 via the digital cable 113. The central processing unit 103 performs acoustic signal processing on the input multi-channel digital sound pressure data. The signal subjected to the acoustic signal processing is transmitted to the network via the LAN cable 119 and the hub 111.

ネットワークを介して遠端から受信したデジタル音圧データは、ハブ１１１及びＬＡＮケーブル１１９を介して、中央演算装置１０３に出力され、前記中央演算装置１０３で音響信号処理が実行される。前記音声処理がされたデジタル音圧データは、デジタルケーブル１１３を介して、Ａ／Ｄ−Ｄ／Ａ変換装置１０２に出力される。出力されたデジタル音圧データは、Ａ／Ｄ−Ｄ／Ａ変換装置１０２によってアナログ音圧データに変換され、オーディオケーブル１１６を介して変換されたアナログ音圧データがスピーカ１０８から出力される。 Digital sound pressure data received from the far end via the network is output to the central processing unit 103 via the hub 111 and the LAN cable 119, and the central processing unit 103 executes acoustic signal processing. The digital sound pressure data subjected to the sound processing is output to the A / D-D / A converter 102 via the digital cable 113. The output digital sound pressure data is converted into analog sound pressure data by the A / D-D / A converter 102, and the analog sound pressure data converted via the audio cable 116 is output from the speaker 108.

雑音除去操作入力部１０７は、収集された多チャンネル音圧データに含まれる各方向から到来する音声を抑圧するか否かを示す抑圧方向データをユーザが設定する入力部である。雑音除去操作入力部１０７は、例えば、複数のボタンが円筒状筐体の側面を一周するように設置された装置である。前記ボタンを操作することによって、前記ボタンが配置された方向から到来する音声を抑圧するか否かを設定することができる。例えば、ある方向から到来する音声を抑圧する場合はその方向のボタンのＬＥＤを点灯させ、ある方向から到来する音声を抑圧しない時はその方向のボタンのＬＥＤが消灯させることによって、どの方向の音声が抑圧されているかをユーザに提示できる。設定された抑圧方向のデータは、デジタルケーブル１１５を介して中央演算装置１０３に送信される。 The noise removal operation input unit 107 is an input unit in which the user sets suppression direction data indicating whether or not to suppress the voice coming from each direction included in the collected multi-channel sound pressure data. The noise removal operation input unit 107 is an apparatus in which, for example, a plurality of buttons are installed so as to go around the side surface of the cylindrical housing. By operating the button, it is possible to set whether or not to suppress the voice coming from the direction in which the button is arranged. For example, if the voice coming from a certain direction is suppressed, the LED of the button in that direction is turned on. If the voice coming from a certain direction is not suppressed, the LED of the button in that direction is turned off. Can be shown to the user. The set suppression direction data is transmitted to the central processing unit 103 via the digital cable 115.

マイクロホンアレイ１０１で収集され、中央演算装置１０３に出力された多チャンネルデジタル音圧データＸには、スピーカ１０８から出力された音声が音響エコーとして含まれる。 The multichannel digital sound pressure data X collected by the microphone array 101 and output to the central processing unit 103 includes the sound output from the speaker 108 as an acoustic echo.

中央演算装置１０３は、多チャンネルデジタル音圧データＸ、及びハブ１１１から出力されたデジタル音圧データに基づいて、前記音響エコーを除去するための多チャンネルデジタルフィルタを各時刻に更新し、更新された前記デジタルフィルタを揮発性メモリ１０４に記憶し、各時間帯に更新された前記デジタルフィルタを用いて前記音響エコーを除去する。さらに、中央演算装置１０３は、雑音除去操作入力部１０７から出力された抑圧方向データと、記憶媒体１０５に記憶されているマイクロホンアレイ１０１の各マイク素子の物理座標とを参照し、音響エコーを除去した後の多チャンネル音圧データＹに対し、雑音除去処理を実行する。 The central processing unit 103 updates the multi-channel digital filter for removing the acoustic echo at each time based on the multi-channel digital sound pressure data X and the digital sound pressure data output from the hub 111, and is updated. The digital filter is stored in the volatile memory 104, and the acoustic echo is removed using the digital filter updated in each time zone. Further, the central processing unit 103 refers to the suppression direction data output from the noise removal operation input unit 107 and the physical coordinates of each microphone element of the microphone array 101 stored in the storage medium 105 to remove the acoustic echo. The noise removal process is executed on the multi-channel sound pressure data Y after the above.

また、中央演算装置１０３は、前記多チャンネルデジタル音圧データＸを用いて、前記多チャンネルデジタル音圧データＸに含まれる各到来方向の音量Ｐ＿Ｘを算出する。さらに、中央演算装置１０３は、前記多チャンネルのデジタル音圧データＸと前記雑音除去処理が実行されたデジタル音圧データＹとを用いて、前記雑音除去処理が実行されたデジタル音圧データＹに含まれる各到来方向の音量Ｐ＿Ｙを算出する。算出された音量Ｐ＿Ｘ及び算出された音量Ｐ＿Ｙは、中央演算装置１０３から、デジタルケーブル１４を介して、抑圧量表示部１０６に出力される。 Further, the central processing unit 103 uses the multi-channel digital sound pressure data X to calculate the sound volume P_X in each direction of arrival included in the multi-channel digital sound pressure data X. Furthermore, the central processing unit 103 uses the multi-channel digital sound pressure data X and the digital sound pressure data Y on which the noise removal processing has been performed to convert the digital sound pressure data Y on which the noise removal processing has been performed into The volume P_Y of each included arrival direction is calculated. The calculated volume P_X and the calculated volume P_Y are output from the central processing unit 103 to the suppression amount display unit 106 via the digital cable 14.

抑圧量表示部１０６は、算出された音量Ｐ＿Ｘ及び算出された音量Ｐ＿Ｙを表示する。 The suppression amount display unit 106 displays the calculated volume P_X and the calculated volume P_Y.

カメラ１０９で撮影された画像信号は、デジタルケーブル１１７を介して、中央演算装置１０３に出力される。中央演算装置１０３は、入力された画像信号に画像信号処理を実行する。画像信号処理が実行された画像信号は、ＬＡＮケーブル１１９及びハブ１１１を介してネットワーク上に送信される。 An image signal captured by the camera 109 is output to the central processing unit 103 via the digital cable 117. The central processing unit 103 performs image signal processing on the input image signal. The image signal that has been subjected to the image signal processing is transmitted to the network via the LAN cable 119 and the hub 111.

遠端から送信された前記画像信号は、ハブ１１１及びＬＡＮケーブル１１９を介して、中央演算装置に出力される。中央演算装置１０３は、入力された前記画像信号に画像信号処理を実行し、画像信号処理が実行された前記画像信号はモニタケーブル１１８を介して画像表示装置１１０に出力し、画像表示装置１１０の画面に画像を表示する。 The image signal transmitted from the far end is output to the central processing unit via the hub 111 and the LAN cable 119. The central processing unit 103 performs image signal processing on the input image signal, and outputs the image signal on which image signal processing has been performed to the image display device 110 via the monitor cable 118. Display an image on the screen.

デジタルケーブル１１３、デジタルケーブル１１４、デジタルケーブル１１５、デジタルケーブル１１７は、ＵＳＢケーブルなどが用いられる。 As the digital cable 113, the digital cable 114, the digital cable 115, and the digital cable 117, a USB cable or the like is used.

抑圧量表示部１０６は、各方向から到来する音声の抑圧量をユーザに示すことができる。抑圧量表示部１０６は、例えば、緑色の複数のＬＥＤを縦に並べた列ＳＥＱ＿Ｘと赤色の複数のＬＥＤを縦に並べた列ＳＥＱ＿Ｙの２列を一つの列の組ＳＥＱ＿ＣＯＭＢし、複数のＳＥＱ＿ＣＯＭＢが円筒状筐体の側面を一周するように配置された装置である。ＳＥＱ＿ＣＯＭＢが配置された方向θがΘ＝［θ＿１，θ＿２］の範囲に含まれる場合、ＳＥＱ＿Ｘ＿Θは、入力された多チャンネルデジタル音圧データＸに含まれる、Θの範囲から到来する音声の音量をレベルメータを用いて表示する。ＳＥＱ＿ＣＯＭＢが配置された方向θがΘ＝［θ＿１，θ＿２］の範囲に含まれる場合、ＳＥＱ＿Ｙ＿Θは、雑音が除去されたデジタル音圧データＹに含まれる、Θの範囲から到来する音声の音量をレベルメータを用いて表示する。 The suppression amount display unit 106 can indicate to the user the suppression amount of speech coming from each direction. The suppression amount display unit 106, for example, combines two columns of a sequence SEQ_X in which a plurality of green LEDs are arranged vertically and a sequence SEQ_Y in which a plurality of red LEDs are arranged vertically into one column SEQ_COMB. It is an apparatus arrange | positioned so that the side surface of a cylindrical housing | casing may go around. When the direction θ in which SEQ_COMB is arranged is included in the range of Θ = [θ_1, θ_2], SEQ_X_Θ is the level of the volume of the sound coming from the range of Θ included in the input multi-channel digital sound pressure data X Display using a meter. When the direction θ in which SEQ_COMB is arranged is included in the range of Θ = [θ_1, θ_2], SEQ_Y_Θ is the level of the sound volume coming from the range of Θ included in the digital sound pressure data Y from which noise is removed. Display using a meter.

音声が到来する方向の範囲ごとに音量を表示することによって、ユーザは、自分の音声の抑圧量を確認することができる。 By displaying the volume for each range of the direction in which the voice comes, the user can check the suppression amount of his / her voice.

本実施の形態において、マイクロホンアレイ１０１と抑圧量表示部１０６との筐体同士が互いに物理的に固定され、相対的位置関係が固定されていることが望ましい。これによって、マイクロホンアレイ１０１を移動する場合、抑圧量を表示する表示部も一緒に移動するため、ユーザは、マイクロホンアレイ１０１の位置を基準に考えればよく、抑圧される方向が分かりやすい。 In the present embodiment, it is desirable that the housings of the microphone array 101 and the suppression amount display unit 106 are physically fixed to each other and the relative positional relationship is fixed. Accordingly, when the microphone array 101 is moved, the display unit that displays the suppression amount is also moved together. Therefore, the user only has to consider the position of the microphone array 101 as a reference, and the direction in which the suppression is suppressed can be easily understood.

また、新たにセンサを設置することが必要ないため装置の構成は簡易にできる。すなわち、マイクロホンアレイ１０１と抑圧量表示部１０６との相対的位置関係が時間的に変わるならば、相対的位置関係に応じて抑圧量を表示する位置を変化させなければならない。そのためには、磁気センサ、超音波センサ、または、カメラでマーカ位置を取得するなど、各種位置センサで相対的位置関係を得る必要がある。しかし、センサを導入すれば装置の構成が複雑になる。マイクロホンアレイ１０１と抑圧量表示部１０６との相対的位置関係を固定することによって、センサを不要とする。 Further, since it is not necessary to newly install a sensor, the configuration of the apparatus can be simplified. That is, if the relative positional relationship between the microphone array 101 and the suppression amount display unit 106 changes with time, the position where the suppression amount is displayed must be changed according to the relative positional relationship. For that purpose, it is necessary to obtain a relative positional relationship with various position sensors, such as acquiring a marker position with a magnetic sensor, an ultrasonic sensor, or a camera. However, if a sensor is introduced, the configuration of the apparatus becomes complicated. By fixing the relative positional relationship between the microphone array 101 and the suppression amount display unit 106, a sensor is unnecessary.

また、抑圧量表示部１０６と雑音除去操作入力部１０７との筐体同士が互いに物理的に固定され、相対的位置関係が固定されていることが望ましい。これによって、前述したように相対的位置関係の推定のためのセンサを使わないことで、装置の構成を簡易にできる。 In addition, it is desirable that the housings of the suppression amount display unit 106 and the noise removal operation input unit 107 are physically fixed to each other and the relative positional relationship is fixed. Thus, as described above, the configuration of the apparatus can be simplified by not using the sensor for estimating the relative positional relationship.

さらに、抑圧量表示部１０６のＬＥＤ列の組ＳＥＱ＿ＣＯＭＢが配置されている方向と、雑音除去操作入力部１０７のボタンが配置されている方向とが、一致していることが望ましい。これによって、ユーザが音声を抑圧したい方向を指定するときに、抑圧量を表示する表示部の位置とボタンの位置との間の距離が短いほど、ユーザが操作しやすい。 Furthermore, it is desirable that the direction in which the LED row set SEQ_COMB of the suppression amount display unit 106 is arranged coincides with the direction in which the buttons of the noise removal operation input unit 107 are arranged. Accordingly, when the user specifies the direction in which the voice is desired to be suppressed, the shorter the distance between the position of the display unit that displays the suppression amount and the position of the button, the easier the user can operate.

図２は、本発明の第１の実施の形態におけるテレビ会議システムの利用例を示した図である。 FIG. 2 is a diagram showing a usage example of the video conference system according to the first embodiment of the present invention.

拠点ＡにユーザＵ１、及びユーザＵ２が存在し、拠点Ｂに存在するユーザと通話を行っている。このとき、拠点Ａだけで会話を行いたいユーザＵ１が、自分の音声を拠点Ｂのユーザに聞こえないように、雑音除去操作入力部１０７に設置されたボタンのうち、自分からの距離が最短であるボタンを操作する。つまり、自分が存在する位置に対応するボタンを押す。すると、中央演算装置１０３が、ユーザＵ１の方向から到来する音量をよくあるするような指向性パターンを持つ方向性のフィルタを算出する。中央演算装置１０３は、算出された前記フィルタをエコーキャンセラ処理後の信号に適用し、ユーザＵ１の方向から到来する音を抑圧した音声を、拠点Ｂに送信する。 A user U1 and a user U2 exist at the base A, and a call is made with a user at the base B. At this time, the user U1 who wants to talk only at the site A has the shortest distance from himself among the buttons installed in the noise removal operation input unit 107 so that his / her voice cannot be heard by the user at the site B. Operate a button. In other words, the button corresponding to the position where the user exists is pressed. Then, the central processing unit 103 calculates a directional filter having a directivity pattern that often has a volume coming from the direction of the user U1. The central processing unit 103 applies the calculated filter to the signal after the echo canceller process, and transmits the sound in which the sound coming from the direction of the user U1 is suppressed to the base B.

拠点Ｂでは、受信した前記信号が中央演算装置２０３を介し、スピーカ２０８から出力される。 At the site B, the received signal is output from the speaker 208 via the central processing unit 203.

拠点Ａにおける抑圧量表示部１０６は、入力された多チャンネルデジタル音圧データＸに含まれるユーザＵ１の存在する方向から到来する音声の音量、及び雑音除去後のデジタル音圧データに含まれるユーザＵ１が存在する方向から到来する音の音量を、ユーザＵ１の存在する方向に対応する抑圧量表示部１０６に配置されたＳＥＱ＿ＣＯＭＢに表示する。 The suppression amount display unit 106 at the site A displays the volume of voice coming from the direction in which the user U1 exists included in the input multi-channel digital sound pressure data X and the user U1 included in the digital sound pressure data after noise removal. Is displayed in SEQ_COMB arranged in the suppression amount display unit 106 corresponding to the direction in which the user U1 exists.

ユーザＵ１は、表示される抑圧量を見て、ユーザＵ１の存在する方向から到来する音声の音量が十分抑圧されているか否かを確認しながら、拠点Ｂのユーザに聞かれずに会話ができる。また、拠点ＡのユーザＵ２と拠点Ｂのユーザとの間の会話を邪魔していないことを確認しながら、拠点Ａだけで会話ができる。また、拠点Ａの会話音声が十分抑圧されていない場合、拠点ＡのユーザＵ１、ユーザＵ２は、より多くの方向を指定するように雑音除去操作入力部１０７のボタンを操作するか、又は、より小さい声で会話するか、などによって、拠点Ｂのユーザに会話を聞かれずに経典Ａだけで会話できる。 The user U1 can talk without being heard by the user at the site B while checking whether or not the volume of the voice coming from the direction in which the user U1 exists is sufficiently suppressed by looking at the displayed suppression amount. Further, it is possible to have a conversation only at the site A while confirming that the conversation between the user U2 at the site A and the user at the site B is not disturbed. Further, when the conversation voice of the site A is not sufficiently suppressed, the user U1 and the user U2 of the site A operate the buttons of the noise removal operation input unit 107 so as to specify more directions, or more It is possible to talk only with the scripture A without listening to the conversation of the user at the base B by talking with a low voice.

図３は、本発明の音収集システムの実施例であるテレビ会議システムにおける、各ユーザの発話、雑音除去操作入力部１０７への入力操作、抑圧量表示部１０６の表示、及び、遠端に送信される音圧データの音量の関係のタイムチャートの例を示す図である。なお、図３において横軸が時刻を表している。 FIG. 3 shows the speech of each user, the input operation to the noise removal operation input unit 107, the display of the suppression amount display unit 106, and the transmission to the far end in the video conference system which is an embodiment of the sound collection system of the present invention. It is a figure which shows the example of the time chart of the volume relationship of the sound pressure data performed. In FIG. 3, the horizontal axis represents time.

時間帯ｔ１において、ユーザＵ１及びユーザＵ２が発話している。このとき、ユーザＵ１及びユーザＵ２の音声は、抑圧されずに遠端に送信される。抑圧量表示においても、収集された音声の音量と送信される音声の音量との音量差はわずかであり、ほとんど抑圧されていないことが分かる。 The user U1 and the user U2 speak in the time zone t1. At this time, the voices of the users U1 and U2 are transmitted to the far end without being suppressed. Even in the suppression amount display, it can be seen that the volume difference between the volume of the collected voice and the volume of the transmitted voice is small, and is hardly suppressed.

時刻ｔ２において、ユーザＵ１が、雑音除去操作入力部１０７のユーザＵ１から最短の位置に設置されたボタンＢ１を操作する。この操作の後から、ユーザＵ１の存在する方向の音声は抑圧された状態となる。 At time t2, the user U1 operates the button B1 installed at the shortest position from the user U1 of the noise removal operation input unit 107. After this operation, the voice in the direction in which the user U1 exists is suppressed.

時間帯ｔ３において、ユーザＵ１及びユーザＵ２が発話している。このとき、ユーザＵ２の音声は、時間帯ｔ１と同様に、抑圧されずに送信される。一方で、ユーザＵ１の存在する方向の音声は抑圧された状態であるため、ユーザＵ１の音声は残留信号中では抑圧されている。抑圧量表示でも、ユーザＵ１の存在する方向から収集された音声の音量と遠端に送信される音声の音量との差が大きいことから、ユーザＵ１の音声が十分に抑圧されていることが分かり、ユーザＵ１は安心して近端だけでの会話を行うことができる。 User U1 and user U2 speak in time zone t3. At this time, the voice of the user U2 is transmitted without being suppressed as in the time zone t1. On the other hand, since the voice in the direction in which the user U1 exists is in a suppressed state, the voice of the user U1 is suppressed in the residual signal. Even in the suppression amount display, since the difference between the volume of the voice collected from the direction where the user U1 exists and the volume of the voice transmitted to the far end is large, it can be seen that the voice of the user U1 is sufficiently suppressed. The user U1 can perform a conversation only at the near end with peace of mind.

時刻ｔ４において、ユーザＵ１が、再びボタンＢ１を操作する。この操作の後から、ユーザＵ１の存在する方向の音声は抑圧された状態から通常の抑圧されていない状態に戻る。 At time t4, the user U1 operates the button B1 again. After this operation, the sound in the direction in which the user U1 exists returns from the suppressed state to the normal unsuppressed state.

時間帯ｔ５において、ユーザＵ１の方向の音声は抑圧された状態ではないため、ユーザＵ１の音声は、時間帯ｔ１と同様に、抑圧されずに遠端に送信される。抑圧量表示においても、収集された音声の音量と遠端に送信される音声の音量との音量差はわずかであり、ほとんど抑圧されていないことが分かる。 Since the voice in the direction of the user U1 is not suppressed in the time zone t5, the voice of the user U1 is transmitted to the far end without being suppressed as in the time zone t1. Even in the suppression amount display, it can be seen that the volume difference between the volume of the collected voice and the volume of the voice transmitted to the far end is very small, and is hardly suppressed.

図１３は、本発明の第１の実施の形態のテレビ会議システムの一連の処理を示したフローチャートである。 FIG. 13 is a flowchart showing a series of processes of the video conference system according to the first embodiment of this invention.

テレビ会議システムが起動した後、まず、自拠点（近端）の中央演算装置１０３は、音響エコーキャンセラ適応処理を行なう（Ｓ１３０１）。音響エコーキャンセラ適応処理は、スピーカから白色信号、又は、時間方向に周波数が変化するタイプの全帯域信号などを出力し、音響エコーキャンセラのフィルタを初期化する。その後、中央演算装置１０３は、他の拠点（遠端）から接続が要求されたか否かを判定する（Ｓ１３０２）。 After the video conference system is activated, first, the central processing unit 103 at its own base (near end) performs acoustic echo canceller adaptation processing (S1301). The acoustic echo canceller adaptive process outputs a white signal or a full-band signal whose frequency changes in the time direction from the speaker, and initializes the filter of the acoustic echo canceller. Thereafter, the central processing unit 103 determines whether or not a connection is requested from another base (far end) (S1302).

他の拠点（遠端）から接続が要求がされたと判定された場合、中央演算装置１０３は、他の拠点（遠端）との接続を行う（Ｓ１３０４）。他の拠点（遠端）から接続が要求されていないと判定された場合、中央演算装置１０３は、自拠点（近端）から他の拠点（遠端）へ接続を要求したか否かを判定する（Ｓ１３０３）。 When it is determined that a connection is requested from another base (far end), the central processing unit 103 connects to another base (far end) (S1304). When it is determined that a connection is not requested from another base (far end), the central processing unit 103 determines whether or not a connection is requested from the local base (near end) to another base (far end). (S1303).

自拠点（近端）から他の拠点（遠端）へ接続を要求したと判定された場合、中央演算装置１０３は、他の拠点（遠端）との接続を行う（Ｓ１３０４）。自拠点（近端）から他の拠点（遠端）へ接続を要求していないと判定された場合、中央演算装置１０３は、Ｓ１３０２に戻る。 When it is determined that a connection is requested from its own base (near end) to another base (far end), the central processing unit 103 connects to another base (far end) (S1304). When it is determined that a connection is not requested from its own base (near end) to another base (far end), the central processing unit 103 returns to S1302.

Ｓ１３０４において、他の拠点（遠端）と接続された後、中央演算装置１０３は、スピーカから遠端の音声を再生し（Ｓ１３０５）、音響エコーキャンセラ（Ｓ１３０６）、雑音除去処理（Ｓ１３０７）、収集音声の音源分離（Ｓ１３０８）、残留信号に対する方向別音量の計算（Ｓ１３０９）、抑圧量の提示（Ｓ１３１０）、及び、他の拠点（遠端）への音声送信（Ｓ１３１１）の順に処理を実行する。前述した処理が実行された後、中央演算装置１０３は、他の拠点（遠端）との接続が切れたか否かを判定する（Ｓ１３１２）。 In S1304, after being connected to another site (far end), the central processing unit 103 reproduces the far end sound from the speaker (S1305), acoustic echo canceller (S1306), noise removal processing (S1307), and collection. The processing is executed in the order of sound source separation (S1308), calculation of sound volume by direction with respect to residual signal (S1309), presentation of suppression amount (S1310), and voice transmission to another base (far end) (S1311). . After the processing described above is executed, the central processing unit 103 determines whether or not the connection with the other base (far end) is broken (S1312).

他の拠点（遠端）との接続が切れていると判定された場合、中央演算装置１０３は、他の拠点（遠端）との接続を切断する処理を実行し（Ｓ１３１４）、一連の処理を終了する。他の拠点（遠端）との接続が切れていないと判定された場合、中央演算装置１０３は、自拠点（近端）から他の拠点（遠端）へ切断を要求したか否かを判定する（Ｓ１３１３）。 When it is determined that the connection with the other base (far end) is broken, the central processing unit 103 executes processing for disconnecting the connection with the other base (far end) (S1314), and a series of processing Exit. When it is determined that the connection with the other base (far end) is not broken, the central processing unit 103 determines whether or not the local base (near end) has requested disconnection from the other base (far end). (S1313).

自拠点（近端）から他の拠点（遠端）へ切断を要求したと判定された場合、中央演算装置１０３は、他の拠点（遠端）との接続を切断する処理を実行し（Ｓ１３１４）、一連の処理を終了する。自拠点（近端）から他の拠点（遠端）へ切断を要求していないと判定された場合、Ｓ１３１５に戻り、以下、同様の処理を行う。 When it is determined that a disconnection request has been made from its own base (near end) to another base (far end), the central processing unit 103 executes processing for disconnecting from the other base (far end) (S1314). ), A series of processing ends. If it is determined that the disconnection is not requested from the own base (near end) to another base (far end), the process returns to S1315, and the same processing is performed thereafter.

図４は、本発明の第１の実施の形態におけるテレビ会議システムの構成を示したブロック図である。 FIG. 4 is a block diagram showing the configuration of the video conference system according to the first embodiment of the present invention.

マイクロホンアレイ１０１の各マイクロホン素子に入力された多チャンネルアナログ音圧データは、多チャンネルＡ／Ｄ変換部４０１で各マイクロホン素子に対応した多チャンネルデジタル音圧データｘ＿ｉ（ｔ）に変換される。ここで、ｉはマイク素子の番号を示すインデックスであり、全マイク素子数をＭとすると、ｉは０からＭ−１までのいずれの値をとる。また、ｔはサンプリング周期ごとの離散時間である。変換された多チャンネルデジタル音圧データｘ＿ｉ（ｔ）は、多チャンネルフレーム処理部４０２に出力される。 Multi-channel analog sound pressure data input to each microphone element of the microphone array 101 is converted into multi-channel digital sound pressure data x_i (t) corresponding to each microphone element by a multi-channel A / D converter 401. Here, i is an index indicating the number of the microphone element. If the total number of microphone elements is M, i takes any value from 0 to M-1. T is a discrete time for each sampling period. The converted multi-channel digital sound pressure data x_i (t) is output to the multi-channel frame processing unit 402.

音声受信部４０４は、遠端から送信されたデジタル音圧データｒｅｆ（ｔ）を受信する。なお、受信するデジタル音圧データｒｅｆ（ｔ）は、ＴＣＰ／ＩＰプロトコル、又はＲＴＰプロトコルを用いたデジタル音圧データである。 The sound receiving unit 404 receives the digital sound pressure data ref (t) transmitted from the far end. The received digital sound pressure data ref (t) is digital sound pressure data using the TCP / IP protocol or the RTP protocol.

サーバを中央に介する多拠点テレビ会議システムの場合、サーバは多拠点から音声信号を受信し、受信した音声信号を混合して、それぞれの拠点に送信する。音声受信部４０４は、サーバから送信さえた混合した音声信号を受信する。この場合、音声受信部４０４は、混合した音声をそのままデジタル音圧データｒｅｆ（ｔ）としてＤ／Ａ変換部４０５と多チャンネルフレーム処理部４０２とに送信する。 In the case of a multi-site video conference system with a server at the center, the server receives audio signals from the multi-sites, mixes the received audio signals, and transmits the mixed audio signals to the respective bases. The voice receiving unit 404 receives the mixed voice signal transmitted from the server. In this case, the audio reception unit 404 transmits the mixed audio as it is to the D / A conversion unit 405 and the multi-channel frame processing unit 402 as digital sound pressure data ref (t).

サーバを中央に介さず、マルチキャストなどを用いて通信を行う多拠点テレビ会議システムの場合、それぞれの拠点の音声信号は、それぞれの拠点に送信され、音声受信部４０４は、それぞれの拠点の音声信号をそれぞれの拠点から直接受信する。この場合、音声受信部４０４は、それぞれの拠点の音声を混合した後、前記混合した音声をデジタル音圧データｒｅｆ（ｔ）としてＤ／Ａ変換部４０５と多チャンネルフレーム処理部４０２とに出力する。なお、多チャンネルフレーム処理部４０２に出力されたデジタル音圧データｒｅｆ（ｔ）は、後述するように、多チャンネルフレーム処理部４０２において参照信号として用いられる。 In the case of a multi-site video conference system in which communication is performed using multicast or the like without using a server in the center, the audio signal of each base is transmitted to each base, and the audio receiver 404 receives the audio signal of each base. Directly from each location. In this case, the voice receiving unit 404 mixes the voices of the respective bases, and then outputs the mixed voice to the D / A conversion unit 405 and the multi-channel frame processing unit 402 as digital sound pressure data ref (t). . The digital sound pressure data ref (t) output to the multi-channel frame processing unit 402 is used as a reference signal in the multi-channel frame processing unit 402, as will be described later.

Ｄ／Ａ変換部４０５は、入力されたデジタル音圧データｒｅｆ（ｔ）をアナログ音圧データに変換する。変換されたアナログ音圧データは、音声再生部４０６でスピーカ１０８から出力される。 The D / A conversion unit 405 converts the input digital sound pressure data ref (t) into analog sound pressure data. The converted analog sound pressure data is output from the speaker 108 by the sound reproduction unit 406.

多チャンネルフレーム処理部４０２は、入力された多チャンネルデジタル音圧データｘ＿ｉ（ｔ）をｔ＝τＳからｔ＝τＳ＋Ｆ−１の範囲に該当する多チャンネル時間領域フレーム信号Ｘｆ＿ｉ（ｔ，τ）及び時間領域の参照信号Ｒｅｆｆ（ｔ，τ）に変換する。 The multi-channel frame processing unit 402 converts the input multi-channel digital sound pressure data x_i (t) into a multi-channel time domain frame signal Xf_i (t, τ) corresponding to a range from t = τS to t = τS + F−1 and a time. The signal is converted into a region reference signal Reff (t, τ).

なお、ｆは周波数を等間隔に分割した周波数帯域を表すインデックスであり、周波数をＮ分割した場合、ｆは０からＮ−１までのいずれかの値をとる。以下、周波数ビンｆと記す。ｔは、時間を表す。τはフレームインデックスと呼び、多チャンネルフレーム処理部４０２から音声送信部４１３までの処理が完了した後、τは１加算される。Ｓはフレームシフトと呼び、フレームごとにずらすサンプル数を意味する。Ｆはフレームサイズと呼び、フレームごとに一度に処理するサンプル数を意味する。 Note that f is an index representing a frequency band obtained by dividing the frequency at equal intervals. When the frequency is divided into N, f takes any value from 0 to N-1. Hereinafter, it is described as a frequency bin f. t represents time. τ is called a frame index, and τ is incremented by 1 after the processing from the multi-channel frame processing unit 402 to the voice transmission unit 413 is completed. S is called a frame shift, and means the number of samples shifted for each frame. F is called the frame size and means the number of samples processed at one time for each frame.

変換された多チャンネル時間領域フレーム信号Ｘｆ＿ｉ（ｔ，τ）及び時間領域の参照信号Ｒｅｆｆ（ｔ，τ）は、多チャンネル短時間周波数分析部４０３に出力される。多チャンネル短時間周波数分析部４０３は、入力された多チャンネル時間領域フレーム信号Ｘｆ＿ｉ（ｔ，τ）及び時間領域の参照信号Ｒｅｆｆ（ｔ，τ）に、直流成分カット、ハミング窓、ハニング窓、及びブラックマン窓などの窓処理を実行する。その後、多チャンネル短時間周波数分析部４０３は、さらに、短時間フーリエ変換を実行し、多チャンネル周波数領域フレーム信号Ｘｆ＿ｉ（ｆ，τ）及び周波数領域の参照信号Ｒｅｆｆ（ｆ，τ）に変換する。ここで、周波数ビンｆの数（以下、周波数ビン数と記す。）をＮとする。 The converted multi-channel time-domain frame signal Xf_i (t, τ) and the time-domain reference signal Reff (t, τ) are output to the multi-channel short-time frequency analysis unit 403. The multi-channel short-time frequency analysis unit 403 adds a DC component cut, a Hamming window, a Hanning window, and a time-domain reference signal Reff (t, τ) to the input multi-channel time domain frame signal Xf_i (t, τ). Perform window processing such as Blackman windows. Thereafter, the multi-channel short-time frequency analysis unit 403 further performs short-time Fourier transform to convert the multi-channel frequency domain frame signal Xf_i (f, τ) and the frequency domain reference signal Reff (f, τ). Here, the number of frequency bins f (hereinafter referred to as the number of frequency bins) is N.

図１４は、本発明の第１の実施の形態の、任意フレームτにおける多チャンネル周波数領域フレーム信号Ｘｆ＿ｉ（ｆ，τ）のデータ構造を示した説明図である。 FIG. 14 is an explanatory diagram showing a data structure of the multi-channel frequency domain frame signal Xf_i (f, τ) in the arbitrary frame τ according to the first embodiment of this invention.

マイク素子数及び周波数ビン数で分割された一つ一つに、対応する多チャンネル周波数領域フレーム信号Ｘｆ＿ｉ（ｆ，τ）が格納されている。なお、多チャンネル周波数領域フレーム信号Ｘｆ＿ｉ（ｆ，τ）は、複素数の値をとる。 A corresponding multi-channel frequency domain frame signal Xf_i (f, τ) is stored in each divided by the number of microphone elements and the number of frequency bins. The multi-channel frequency domain frame signal Xf_i (f, τ) takes a complex value.

各マイクロホン素子の多チャンネル周波数領域フレーム信号Ｘｆ＿ｉ（ｆ，τ）及び周波数領域の参照信号Ｒｅｆｆ（ｆ，τ）は、多チャンネル音響エコーキャンセラ部４０７に出力される。また、多チャンネル周波数領域フレーム信号Ｘｆ＿ｉ（ｆ，τ）は、音源分離部４０８にも出力される。 The multi-channel frequency domain frame signal Xf_i (f, τ) and the frequency domain reference signal Reff (f, τ) of each microphone element are output to the multi-channel acoustic echo canceller 407. Further, the multi-channel frequency domain frame signal Xf_i (f, τ) is also output to the sound source separation unit 408.

多チャンネル音響エコーキャンセラ部４０７は、多チャンネル短時間周波数分析部４０３から入力された各マイクロホン素子の多チャンネル周波数領域フレーム信号Ｘｆ＿ｉ（ｆ，τ）から、スピーカ１０８から入力された信号の音響エコー信号成分を除去する。音響エコー信号成分は、多チャンネル短時間周波数分析部４０３から入力された周波数領域の参照信号Ｒｅｆｆ（ｆ，τ）に基づいて算出される。前記音響エコー除去処理は、例えば、ＮＬＭＳアルゴリズムなど一般的なアルゴリズムを用いて音響エコーの伝達関数を逐次適応させれる処理が考えられる。なお、音響エコーキャンセラの処理の差異は、本発明の本質的な差にはならない。 The multi-channel acoustic echo canceller 407 receives an acoustic echo signal of a signal input from the speaker 108 from the multi-channel frequency domain frame signal Xf_i (f, τ) of each microphone element input from the multi-channel short-time frequency analyzer 403. Remove ingredients. The acoustic echo signal component is calculated based on the frequency domain reference signal Reff (f, τ) input from the multi-channel short-time frequency analysis unit 403. As the acoustic echo removal processing, for example, processing in which a transfer function of acoustic echo is sequentially adapted using a general algorithm such as an NLMS algorithm can be considered. Note that the difference in processing of the acoustic echo canceller is not an essential difference of the present invention.

多チャンネル音響エコーキャンセラ部４０７で音響エコー成分が除去された後の多チャンネル周波数領域フレーム信号をＥｆ＿ｉ（ｆ，τ）とする。多チャンネル音響エコーキャンセラ部４０７で算出された多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ（ｆ，τ）は、雑音除去処理部４０９に出力される。 The multi-channel frequency domain frame signal after the acoustic echo component is removed by the multi-channel acoustic echo canceller 407 is defined as Ef_i (f, τ). The multi-channel frequency domain frame signal Ef_i (f, τ) calculated by the multi-channel acoustic echo canceller unit 407 is output to the noise removal processing unit 409.

図１５は、本発明の第１の実施の形態の任意フレームτにおける多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ（ｆ，τ）のデータ構造を示した説明図である。 FIG. 15 is an explanatory diagram illustrating a data structure of the multi-channel frequency domain frame signal Ef_i (f, τ) in the arbitrary frame τ according to the first embodiment of this invention.

マイク素子数及び周波数ビン数で分割された一つ一つに、対応する多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ（ｆ，τ）が格納されている。なお、多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ（ｆ，τ）は、複素数の値をとる。 A corresponding multi-channel frequency domain frame signal Ef_i (f, τ) is stored in each divided by the number of microphone elements and the number of frequency bins. The multi-channel frequency domain frame signal Ef_i (f, τ) takes a complex value.

雑音除去操作入力部１０７は、互いに排他的なJ個の方向範囲Θ＿ｊ＝［θ＿ｊ１，θ＿ｊ２］ごとに、雑音除去を行うか否かを示す信号が出力される。ただし、ｊは方向を示すインデックスであり、全方向をＪ分割した場合、ｊは、０からＪ−１までのいずれかの値をとる。 The noise removal operation input unit 107 outputs a signal indicating whether or not to perform noise removal for each of J mutually exclusive direction ranges Θ_j = [θ_j1, θ_j2]. However, j is an index indicating a direction. When all directions are divided into J, j takes any value from 0 to J-1.

具体的には、雑音除去処理部４０９は、方向範囲Θ＿ｊごとに対応するボタンＢ＿jを備えており、ボタンＢ＿ｊが押される度に、 Specifically, the noise removal processing unit 409 includes a button B_j corresponding to each direction range Θ_j, and each time the button B_j is pressed,

に示すような、二値の値をとるＩｓＲｅｄｕｃｅｄ＿ｊ（τ）が出力される。ただし、ＩｓＲｅｄｕｃｅｄ＿ｊ（τ）は、任意フレームでボタンＢ＿ｊが押されたときに、真（値が１）をとるブール値とする。 IsReduced_j (τ) having a binary value as shown in FIG. However, IsReduced_j (τ) is a Boolean value that is true (value is 1) when the button B_j is pressed in an arbitrary frame.

出力されるＩｓＲｅｄｕｃｅｄ＿ｊ（τ）が０であるの場合、雑音除去を行わないことを意味する信号が雑音除去処理部４０９に出力される。出力されるＩｓＲｅｄｕｃｅｄ＿ｊ（τ）が０でない場合、雑音除去を行うことを意味する信号が雑音除去処理部４０９に出力される。 When IsReduced_j (τ) to be output is 0, a signal indicating that noise removal is not performed is output to the noise removal processing unit 409. When the output IsReduced_j (τ) is not 0, a signal indicating that noise removal is performed is output to the noise removal processing unit 409.

図１６は、本発明の第１の実施の形態の、任意フレームτにおけるＩｓＲｅｄｕｃｅｄ＿ｊ（τ）のデータ構造を示した説明図である。 FIG. 16 is an explanatory diagram illustrating a data structure of IsReduced_j (τ) in an arbitrary frame τ according to the first embodiment of this invention.

全方向をＪ分割した領域に、対応する値が格納される。図１６に示すように１次元の配列になっている。 Corresponding values are stored in an area obtained by dividing J in all directions. As shown in FIG. 16, it is a one-dimensional array.

なお、雑音を除去する方向を指定する方法は、ユーザーが手動で指定する方法に限定されず、雑音除去処理部４０９に予め設定値を設ける方法であってもよい。この場合、雑音除去操作入力部１０７を備える必要がない。 Note that the method of designating the direction of noise removal is not limited to the method of manual designation by the user, and may be a method of providing a preset value in the noise removal processing unit 409 in advance. In this case, it is not necessary to provide the noise removal operation input unit 107.

雑音除去処理部４０９は、多チャンネル音響エコーキャンセラ部４０７から入力された多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)、及び、雑音除去操作入力部１０７から入力されたＩｓＲｅｄｕｃｅｄ＿ｊ（τ）に基づいて、多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)から指定された方向の雑音を除去する。以下、具体的な処理について説明する。 The noise removal processing unit 409 is based on the multi-channel frequency domain frame signal Ef_i (f, τ) input from the multi-channel acoustic echo canceller unit 407 and IsReduced_j (τ) input from the noise removal operation input unit 107. The noise in the designated direction is removed from the multi-channel frequency domain frame signal Ef_i (f, τ). Specific processing will be described below.

図５は、最小分散ビームフォーマによる雑音除去処理部４０９の構成例を示すブロック図である。 FIG. 5 is a block diagram illustrating a configuration example of the noise removal processing unit 409 using the minimum dispersion beamformer.

雑音除去処理部４０９は、目的音／雑音分離部５０１、目的音ステアリングベクトル更新部５０２、雑音共分散行列更新部５０３、フィルタ更新部５０４、及び、フィルタ乗算部５０５を備える。 The noise removal processing unit 409 includes a target sound / noise separation unit 501, a target sound steering vector update unit 502, a noise covariance matrix update unit 503, a filter update unit 504, and a filter multiplication unit 505.

まず、入力された多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)の性質について説明する。 First, the properties of the input multi-channel frequency domain frame signal Ef_i (f, τ) will be described.

図６は、入力された多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)のうちの一つのチャンネルの信号を模式的に示す図である。 FIG. 6 is a diagram schematically showing a signal of one channel of the input multi-channel frequency domain frame signal Ef_i (f, τ).

図６に示すように、収集された音声は、周波数成分ごとが離散していることが知られている。この性質を「スパース性」と呼ぶ。したがって、各周波数成分は、ただ一人の音声の成分と仮定できる。本実施の形態は、この仮定を利用して目的音と雑音とを分離する。 As shown in FIG. 6, it is known that the collected speech is discrete for each frequency component. This property is called “sparseness”. Therefore, each frequency component can be assumed to be a component of only one voice. The present embodiment uses this assumption to separate the target sound and noise.

まず、目的音／雑音分離部５０１は、マイク配置４１０からマイク素子の配置に関するデータ、雑音除去操作入力部１０７から収集された多チャンネル音圧データに含まれる任意方向から到来する音声を抑圧するか否かを示す抑圧方向データ、及び、多チャンネル音響エコーキャンセラ部４０７から入力された多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)を用いてθを算出する。なお、θは、音声の到来方向を表す量である。 First, the target sound / noise separation unit 501 suppresses the voice arriving from an arbitrary direction included in the data regarding the arrangement of the microphone elements from the microphone arrangement 410 and the multichannel sound pressure data collected from the noise removal operation input unit 107. Θ is calculated using the suppression direction data indicating whether or not and the multi-channel frequency domain frame signal Ef_i (f, τ) input from the multi-channel acoustic echo canceller 407. Note that θ is an amount representing the direction of voice arrival.

θの算出方法としては、例えば、マイクロホンアレイ１０１のマイク素子数が二つの場合、 As a calculation method of θ, for example, when the number of microphone elements of the microphone array 101 is two,

を用いて算出される。 Is calculated using

ここで、ρ（ｆ，τ）は、二つのマイク素子の入力信号の、フレームτ、及び周波数インデックスｆにおける位相差とする。位相差ρ（ｆ，τ）の算出方法としては、例えば、図１４に示すように、任意の周波数、つまり、任意の行における、マイク素子１の多チャンネル周波数領域フレーム信号Ｘｆ＿１（ｆ，τ）とマイク素子ｉの多チャンネル周波数領域フレーム信号Ｘｆ＿１（ｆ，τ）との割り算を計算し、その乗数から位相差が算出される。 Here, ρ (f, τ) is a phase difference between the input signals of the two microphone elements at the frame τ and the frequency index f. As a calculation method of the phase difference ρ (f, τ), for example, as shown in FIG. 14, the multi-channel frequency domain frame signal Xf_1 (f, τ) of the microphone element 1 at an arbitrary frequency, that is, in an arbitrary row. And the multi-channel frequency domain frame signal Xf_1 (f, τ) of the microphone element i are calculated, and the phase difference is calculated from the multiplier.

また、ｆｒｅｑ（ｆ）は周波数ビンｆの周波数であり、 Freq (f) is the frequency of the frequency bin f,

を用いて算出される。ただし、Ｆｓは、多チャンネルＡ／Ｄ変換部４０１のサンプリングレートである。ｄは、二つのマイク素子の物理的な間隔とする。ｃは、音速である。音速は、厳密には温度、及び媒質の密度に依存して変化するが、通常３４０ｍ／ｓなど一つの値に固定し用いてもよい。 Is calculated using Here, Fs is the sampling rate of the multi-channel A / D converter 401. d is the physical distance between the two microphone elements. c is the speed of sound. Strictly speaking, the speed of sound changes depending on the temperature and the density of the medium, but it may be fixed to one value such as 340 m / s.

雑音除去処理は、前述の「スパース性」の仮定に基づいて、時間−周波数を固定し、固定された時間−周波数ごとに同一の処理を行う。以下、固定された時間−周波数のサフィックス（ｆ，τ）は省略して記す。 In the noise removal processing, the time-frequency is fixed based on the above-mentioned “sparseness” assumption, and the same processing is performed for each fixed time-frequency. Hereinafter, the fixed time-frequency suffix (f, τ) is omitted.

マイクロホンアレイ１０１のマイク素子数が三つ以上の場合、ＳＰＩＲＥアルゴリズム（非特許文献１参照）によって、θを算出できる。なお、ＳＰＩＲＥアルゴリズムにおいても、前述の「スパース性」の仮定に基づき、時間−周波数を固定し、固定された時間−周波数ごとに同一の処理を行う。 When the number of microphone elements in the microphone array 101 is three or more, θ can be calculated by the SPIRE algorithm (see Non-Patent Document 1). In the SPIRE algorithm, the time-frequency is fixed based on the above-described assumption of “sparseness”, and the same processing is performed for each fixed time-frequency.

図７は、マイクロホンアレイ１０１のマイク素子数が三つ以上の場合におけるθの算出方法（ＳＰＩＲＥアルゴリズム）を示すフローチャートである。 FIG. 7 is a flowchart showing a θ calculation method (SPIRE algorithm) when the number of microphone elements of the microphone array 101 is three or more.

まず、目的音／雑音分離部５０１は、マイク素子の配置に関するデータを読み込む（Ｓ７０１）。なお、マイク配置に関すデータは、記憶媒体１０５が保持する。 First, the target sound / noise separation unit 501 reads data relating to the arrangement of the microphone elements (S701). Note that the storage medium 105 holds data regarding the microphone arrangement.

次に、目的音／雑音分離部５０１は、二つのマイク素子を一つの組とするマイクペアを構成するためにマイク素子の組み合わせを選択する（Ｓ７０２）。このとき、選択される二つのマイク素子の配置間隔がマイクペアごとに異なるように選択されることが望ましい。 Next, the target sound / noise separation unit 501 selects a combination of microphone elements to form a microphone pair having two microphone elements as one set (S702). At this time, it is desirable that the interval between the two selected microphone elements is selected so as to be different for each microphone pair.

次に、目的音／雑音分離部５０１は、選択された各マイクペアをマイク素子の配置間隔が小さいものから順に並び替え、マイクペア待ち行列に格納する（Ｓ７０３）。ここで、ｋを一つのマイクペアを特定するためのインデックスとし、ｋ＝１をマイク素子の配置間隔が最も短いマイクペアとし、ｋ＝Ｋをマイク素子の配置間隔が最も長いマイクペアとする。 Next, the target sound / noise separation unit 501 rearranges the selected microphone pairs in descending order of the arrangement interval of the microphone elements, and stores them in the microphone pair queue (S703). Here, k is an index for specifying one microphone pair, k = 1 is a microphone pair with the shortest microphone element arrangement interval, and k = K is a microphone pair with the longest microphone element arrangement interval.

目的音／雑音分離部５０１は、マイクペア待ち行列の要素数が０か否かを判定する（Ｓ７０４）。つまり、マイクペアがあるか否かが判定される。マイクペア待ち行列の要素数が０でないと判定された場合、目的音／雑音分離部５０１は、マイクペア待ち行列からマイク素子の配置間隔が最短のマイクペアを一つ読み出し、かつ、読み出したマイクペアをマイクペア待ち行列から除く処理を行う（Ｓ７０５）。 The target sound / noise separation unit 501 determines whether or not the number of elements in the microphone pair queue is 0 (S704). That is, it is determined whether there is a microphone pair. When it is determined that the number of elements in the microphone pair queue is not 0, the target sound / noise separation unit 501 reads one microphone pair with the shortest microphone element arrangement interval from the microphone pair queue, and waits for the read microphone pair to wait for the microphone pair. Processing to remove from the matrix is performed (S705).

目的音／雑音分離部５０１は、読み出したマイクペアに対して、位相差を算出する。具体的には、目的音／雑音分離部５０１は、まず、 The target sound / noise separation unit 501 calculates a phase difference for the read microphone pair. Specifically, the target sound / noise separation unit 501 firstly

を満たす整数ｎ＿ｋを算出する。不等式で囲まれた範囲が２πに相当するため、必ず解が存在する。 An integer n_k that satisfies the above is calculated. Since the range enclosed by the inequality corresponds to 2π, there is always a solution.

次に、目的音／雑音分離部５０１は、算出された整数ｎ＿ｋを Next, the target sound / noise separation unit 501 uses the calculated integer n_k.

に代入し、位相差を算出する。なお、ｋ＝１の場合における初期値は、 And the phase difference is calculated. The initial value in the case of k = 1 is

で定義される。 Defined by

Ｓ７０６の後に再びＳ７０４に戻り、全てのマイクペアについて同一の処理を実行する。 After S706, the process returns to S704 again, and the same processing is executed for all microphone pairs.

Ｓ７０４において、マイクペア待ち行列の要素数が０であると判定された場合、目的音／雑音分離部５０１は、算出された位相差を If it is determined in S704 that the number of elements in the microphone pair queue is 0, the target sound / noise separation unit 501 uses the calculated phase difference.

に代入し、音声の到来方向であるθ（ｆ，τ）を算出する。ここで、ｄ_kはｋ番目のマイクペアのマイク素子の配置間隔とする。 And θ (f, τ), which is the voice arrival direction, is calculated. Here, d _k is the arrangement interval of the microphone elements of the kth microphone pair.

音の到来方向の算出の推定精度は、マイクペアのマイク素子の配置間隔が長いほど、高くなるが、多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)の半波長以上のマイクペアのマイク素子の配置間隔が長ければ、マイクペアのマイク素子の配置間隔の位相差から一つの方向を特定することができず、同じ位相差を持つ二つ以上の方向が存在してしまう（空間的エイリアシング）。 The estimation accuracy of the calculation of the arrival direction of sound increases as the arrangement interval of the microphone elements of the microphone pair increases, but the arrangement interval of the microphone elements of the microphone pair having a half wavelength or more of the multi-channel frequency domain frame signal Ef_i (f, τ). Is longer, one direction cannot be specified from the phase difference of the arrangement intervals of the microphone elements of the microphone pair, and two or more directions having the same phase difference exist (spatial aliasing).

前述の到来方向の算出方法は、長いマイクペアのマイク素子の配置間隔に対して本来[数２]で得られる二つ以上の方向のうち、前のループで短いマイクペアのマイク
素子の配置間隔に対して一意に得られている音声の到来方向θ（ｆ，τ）の方を選択することと等価な手順となっている。したがって、空間的エイリアシングが生じるような場合においても高精度に音の到来方向を算出することができる。 The above calculation method of the arrival direction is based on the arrangement interval of the microphone elements of the short microphone pair in the previous loop among the two or more directions originally obtained by [Formula 2] with respect to the arrangement interval of the microphone elements of the long microphone pair. The procedure is equivalent to selecting the direction of arrival θ (f, τ) of the voice obtained uniquely. Therefore, the direction of arrival of sound can be calculated with high accuracy even when spatial aliasing occurs.

目的音／雑音分離部５０１は、算出された時間−周波数ごとの音声の到来方向θ（ｆ，τ）に基づいて、多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)を目的音信号Ｅ_subject＿ｉ（ｆ，τ）と雑音信号Ｅ_noise＿ｉ（ｆ，τ）とに分離する。 The target sound / noise separation unit 501 converts the multi-channel frequency domain frame signal Ef_i (f, τ) into the target sound signal E _subject _i based on the calculated voice arrival direction θ (f, τ) for each time-frequency. (F, τ) and noise signal E _{noise —} i (f, τ) are separated.

具体的には、各周波数ビンｆにおいて、方向範囲Θ＿ｊに音声の到来方向θ（ｆ，τ）が含まれるような方向範囲Θ＿ｊのｊに対し、以下に示すように分離される。 Specifically, in each frequency bin f, separation is performed as follows with respect to j in the direction range Θ_j in which the direction range Θ_j includes the voice arrival direction θ (f, τ).

図１７は、本発明の第１の実施の形態の任意フレームτにおける目的音信号Ｅ_subject＿ｉ（ｆ，τ）のデータ構造を示す図である。図１８は、本発明の第１の実施の形態の任意フレームτにおける雑音信号Ｅ_noise＿ｉ（ｆ，τ）のデータ構造を示す図である。 FIG. 17 is a diagram illustrating a data structure of the target sound signal E _{subject —} i (f, τ) in an arbitrary frame τ according to the first embodiment of this invention. FIG. 18 is a diagram illustrating a data structure of the noise signal E _{noise —} i (f, τ) in an arbitrary frame τ according to the first embodiment of this invention.

目的音信号Ｅ_subject＿ｉ（ｆ，τ）は、目的音／雑音分離部５０１から目的音ステアリングベクトル更新部５０２に出力される。雑音信号Ｅ_noise＿ｉ（ｆ，τ）は、目的音／雑音分離部５０１から雑音共分散行列更新部５０３に出力される。 The target sound signal E _{subject —} i (f, τ) is output from the target sound / noise separation unit 501 to the target sound steering vector update unit 502. The noise signal E _{noise —} i (f, τ) is output from the target sound / noise separation unit 501 to the noise covariance matrix update unit 503.

目的音ステアリングベクトル更新部５０２は、 The target sound steering vector update unit 502

に基づき、目的音ステアリングベクトルａ_subject（ｆ，τ）＝［ａ＿０（ｆ，τ），・・・，ａ＿Ｍ−１（ｆ，τ）]^Tを更新する。ただし、安定のために、目的音信号Ｅ_subject＿ｉ（ｆ，τ）の絶対値が十分に大きいときだけに更新するようにしてもよい。更新された目的音ステアリングベクトルａ_subject（ｆ，τ）は、フィルタ更新部５０４に出力される。 , A target sound steering vector a _subject (f, τ) = [a — 0 (f, τ),..., A_M−1 (f, τ)] ^T is updated. However, for the sake of stability, it may be updated only when the absolute value of the target sound signal E _{subject —} i (f, τ) is sufficiently large. The updated target sound steering vector a _subject (f, τ) is output to the filter update unit 504.

雑音共分散行列更新部５０３は、 The noise covariance matrix update unit 503

に基づき、雑音共分散行列Ｒ_n（ｆ，τ）を更新する。ただし、雑音信号Ｅ_noise＿ｉ（ｆ，τ）＝［Ｅ_noise＿０（ｆ，τ），・・・，Ｅ_noise＿Ｍ−１（ｆ，τ）］^Tとし、γ_nは０以上１未満の適当な定数パラメタとする。また、安定のために、雑音信号Ｅ_noise＿ｉ（ｆ，τ）の絶対値が十分に大きいときだけに更新するようにしてもよい。更新された雑音共分散行列Ｒ_n（ｆ，τ）は、フィルタ更新部５０４に出力される。 Based on the above, the noise covariance matrix R _n (f, τ) is updated. However, the noise signal E _{noise —} i (f, τ) = [E _{noise —} 0 (f, τ),..., E _noise — M−1 (f, τ)] ^T, and γ _n is 0 or more and less than 1 Constant parameter. For the sake of stability, the noise signal E _{noise —} i (f, τ) may be updated only when the absolute value is sufficiently large. The updated noise covariance matrix R _n (f, τ) is output to the filter update unit 504.

フィルタ更新部５０４は、入力された目的音ステアリングベクトルａ_subject（ｆ，τ）、及び、雑音共分散行列雑音共分散行列Ｒ_n（ｆ，τ）から、 The filter update unit 504 receives the target sound steering vector a _subject (f, τ) and the noise covariance matrix noise covariance matrix R _n (f, τ),

に基づき、フィルタｗ（ｆ，τ）を算出する。ただし、γ_wは０以上１未満の適当な定数パラメタである。 Based on the above, the filter w (f, τ) is calculated. However, γ _w is an appropriate constant parameter of 0 or more and less than 1.

フィルタ乗算部５０５は、フィルタｗ（ｆ，τ）、及び多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)を The filter multiplier 505 receives the filter w (f, τ) and the multi-channel frequency domain frame signal Ef_i (f, τ).

に代入し、指定された方向から到来する音を除去した周波数領域フレーム信号ｙ（ｆ，τ）を算出する。 And the frequency domain frame signal y (f, τ) from which the sound coming from the designated direction is removed is calculated.

前述した手順によって算出された周波数領域フレーム信号ｙ（ｆ，τ）は、時間信号生成部４１１及び方向別残留音量計算部４１５に出力される。時間信号生成部４１１は、入力された周波数領域フレーム信号ｙ（ｆ，τ）に逆ＦＦＴを行い、時間領域フレーム信号ｙ（ｔ，τ）に変換する。さらに、時間信号生成部４１１は、時間領域フレーム信号ｙ（ｔ，τ）をフレーム周期ごとに重ね合わせ、加算し、かつ窓関数の逆数を乗算し、時間領域信号ｙ（ｔ）に変換する。そして、時間信号生成部４１１は、音声送信部４１３に変換された時間領域信号ｙ（ｔ）を出力する。 The frequency domain frame signal y (f, τ) calculated by the above-described procedure is output to the time signal generation unit 411 and the direction-specific residual volume calculation unit 415. The time signal generation unit 411 performs inverse FFT on the input frequency domain frame signal y (f, τ) to convert it into a time domain frame signal y (t, τ). Further, the time signal generation unit 411 superimposes and adds the time domain frame signal y (t, τ) for each frame period, and multiplies the inverse of the window function to convert it into the time domain signal y (t). Then, the time signal generation unit 411 outputs the converted time domain signal y (t) to the voice transmission unit 413.

音声送信部４１３は、サーバを介する場合、サーバに対して、各拠点ごとに生成した時間領域信号ｙ（ｔ）を送信する。サーバを介さない場合、各拠点に対して、時間領域信号ｙ（ｔ）をＴＣＰ／ＩＰ又はＲＴＰプロトコルを用いて送信する。 When transmitting through the server, the voice transmission unit 413 transmits the time domain signal y (t) generated for each site to the server. When not passing through the server, the time domain signal y (t) is transmitted to each base using the TCP / IP or RTP protocol.

音源分離部４０８は、多チャンネル周波数フレーム領域信号Ｘｆ＿ｉ（ｆ，τ）を、各方向の成分である方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）に分離し、音量計算部４１４、及び方向別残留音量計算部４１５に、分離された方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）を出力する。以下、音源分離部４０８における処理について説明する。 The sound source separation unit 408 separates the multi-channel frequency frame region signal Xf_i (f, τ) into direction-specific frequency region frame signals Xf_j (f, τ), which are components in the respective directions. The separated sound volume calculation unit 415 outputs the separated direction-specific frequency domain frame signal Xf_j (f, τ). Hereinafter, processing in the sound source separation unit 408 will be described.

図８は、本発明の第１の実施の形態の音源分離部４０８の処理を示すフローチャートである。 FIG. 8 is a flowchart illustrating processing of the sound source separation unit 408 according to the first embodiment of this invention.

まず、音源分離部４０８は、入力された多チャンネル周波数フレーム領域信号Ｘｆ＿ｉ（ｆ，τ）から、音声の到来方向θ（ｆ，τ）を算出する（Ｓ８０１）。なお、音の到来方向θ（ｆ，τ）の算出方法は、前述したＳＰＩＲＥアルゴリズムを用いて算出される。 First, the sound source separation unit 408 calculates the voice arrival direction θ (f, τ) from the input multi-channel frequency frame region signal Xf_i (f, τ) (S801). Note that the calculation method of the sound arrival direction θ (f, τ) is calculated using the SPIRE algorithm described above.

次に、音源分離部４０８は、周波数ビンfごとに振幅の絶対値Ａ_X（ｆ，τ）を Next, the sound source separation unit 408 calculates the absolute value A _X (f, τ) of the amplitude for each frequency bin f.

を用いて算出する（Ｓ８０２）。 (S802).

算出されたθ（ｆ，τ）が方向範囲Θｊの範囲内に含まれる場合、 When the calculated θ (f, τ) is included in the range of the direction range Θj,

に基づいて方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）を算出する（Ｓ８０３）。 Then, the direction-specific frequency domain frame signal Xf_j (f, τ) is calculated (S803).

図１９は、本発明の第１の実施の形態の任意フレームτ方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）のデータ構造を示す図である。 FIG. 19 is a diagram illustrating a data structure of the frequency domain frame signal Xf_j (f, τ) for each arbitrary frame τ direction according to the first embodiment of this invention.

前述の処理のよって算出された方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）は、図１９に示すに、対応する周波数及び対応する方向範囲ごとにデータが格納されている。 As shown in FIG. 19, the direction-specific frequency domain frame signal Xf_j (f, τ) calculated by the above processing stores data for each corresponding frequency and corresponding direction range.

音量計算部４１４は、入力された方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）の収音信号内方向別音量Ｐ＿ｊ（τ）を、 The volume calculation unit 414 calculates the volume P_j (τ) for each direction of the collected sound signal of the input frequency domain frame signal Xf_j (f, τ) for each direction.

に基づいて算出する。算出された収音信号内方向別音量Ｐ＿ｊ（τ）は、抑圧量表示部１０６に出力される。 Calculate based on The calculated volume P_j (τ) for each direction in the collected sound signal is output to the suppression amount display unit 106.

方向別残留音量計算部４１５は、雑音除去処理部４０９から入力された周波数領域フレーム信号ｙ（ｆ，τ）、及び、音源分離部４０８から入力された方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）から、方向範囲Θjの範囲内から到来する音が抑圧された音量、つまり、残留信号内方向別音量Ｑ＿ｊ（τ）を算出する。以下、方向別残留音量計算部４１５における。処理について説明する。 The direction-specific residual volume calculation unit 415 includes the frequency domain frame signal y (f, τ) input from the noise removal processing unit 409 and the direction-specific frequency domain frame signal Xf_j (f, τ) input from the sound source separation unit 408. ) From which the sound coming from within the direction range Θj is suppressed, that is, the volume Q_j (τ) for each remaining signal in the direction. Hereinafter, the residual sound volume calculation unit 415 for each direction. Processing will be described.

図９は、本発明の第１の実施の形態の方向別残留音量計算部４１５の処理を示すフローチャートである。 FIG. 9 is a flowchart illustrating a process of the direction-specific residual volume calculation unit 415 according to the first embodiment of this invention.

まず、方向別残留音量計算部４１５は、初期化設定を行う（Ｓ９０１）。具体的には、周波数ビンｆを０に設定し、全ての範囲方向Θｊに対して残留信号内方向別音量Ｑ＿ｊ（τ）を０に設定する。 First, the direction-specific residual sound volume calculation unit 415 performs initialization setting (S901). Specifically, the frequency bin f is set to 0, and the residual signal in-direction volume Q_j (τ) is set to 0 for all range directions Θj.

次に、方向別残留音量計算部４１５は、ｆ＝Ｎ−１であるか否かを判定する（Ｓ９０２）。ｆ＝Ｎ−１でないと判定された場合、方向別残留音量計算部４１５は、ｊを０に設定する（Ｓ９０８）。 Next, the direction-specific residual sound volume calculation unit 415 determines whether or not f = N−1 (S902). When it is determined that f = N−1 is not satisfied, the direction-specific residual volume calculation unit 415 sets j to 0 (S908).

方向別残留音量計算部４１５は、ｊ＝Ｊであるか否かを判定する（Ｓ９０３）。ｊ＝Ｊであると判定された場合、方向別残留音量計算部４１５は、Ｓ９０７に進む。ｊ＝Ｊでないと判定された場合、方向別残留音量計算部４１５は、方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）＝０であるか否かを判定する（Ｓ９０４）。 The direction-specific residual volume calculation unit 415 determines whether j = J (S903). If it is determined that j = J, the direction-specific residual volume calculation unit 415 proceeds to S907. When it is determined that j = J is not satisfied, the direction-specific residual volume calculation unit 415 determines whether the direction-specific frequency domain frame signal Xf_j (f, τ) = 0 (S904).

方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）＝０であると判定された場合、方向別残留音量計算部４１５は、ｊ＋１を新たなｊと定義し（Ｓ９０５）、Ｓ９０３に戻る。方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）＝０でないと判定された場合、方向別残留音量計算部４１５は、残留信号内方向別音量Ｑ＿ｊ（τ）に｜ｙ（ｆ，τ）｜²を加算し、前記加算された値を新たな残留信号内方向別音量Ｑ＿ｊ（τ）と定義する（Ｓ９０６）。 When it is determined that the direction-specific frequency domain frame signal Xf_j (f, τ) = 0, the direction-specific residual volume calculation unit 415 defines j + 1 as a new j (S905), and returns to S903. Directionally frequency domain frame signal Xf_j (f, τ) = 0 not equal when it is determined, by the residual volume calculation unit 415 direction, the residual signal in the direction-specific volume Q_j (τ) | y (f , τ) | 2 And the added value is defined as a new residual signal in-direction volume Q_j (τ) (S906).

次に、方向別残留音量計算部４１５は、ｆ＋１を新たな周波数ビンｆと定義し（Ｓ９０７）、Ｓ９０２へ戻り、以下同様の処理を行う。 Next, the direction-specific residual volume calculation unit 415 defines f + 1 as a new frequency bin f (S907), returns to S902, and performs the same processing.

Ｓ９０２においてｆ＝Ｎ−１であると判定された場合、方向別残留音量計算部４１５は、抑圧量表示部１０６に残留信号内方向別音量Ｑ＿ｊ（τ）を出力する。 When it is determined in S902 that f = N−1, the direction-specific residual sound volume calculation unit 415 outputs the residual signal in-direction sound volume Q_j (τ) to the suppression amount display unit 106.

ここで、方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）は、音源分離部４０８において、「スパース性」の仮定に基づき、各周波数ｆの成分をただ一つの範囲方向Θjに分離されたものであるため、多くとも一つのｆでのみ方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）≠０である。したがって、Ｓ９０３〜Ｓ９０５のループにおいて、方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）≠０となる周波数ビンｆを一つ見つけた場合、次のＳ９０２〜Ｓ９０７のループに移ることができる。これによって、高速な処理を行える。 Here, the direction-specific frequency domain frame signal Xf_j (f, τ) is obtained by separating the components of each frequency f into a single range direction Θj in the sound source separation unit 408 based on the assumption of “sparseness”. Therefore, the direction-specific frequency domain frame signal Xf_j (f, τ) ≠ 0 only at most one f. Therefore, when one frequency bin f satisfying the direction-specific frequency domain frame signal Xf_j (f, τ) ≠ 0 is found in the loop of S903 to S905, the process can move to the next loop of S902 to S907. Thereby, high-speed processing can be performed.

図１０は、本発明の第１の実施の形態の抑圧量表示部１０６の表示の一例を示す図である。 FIG. 10 is a diagram illustrating an example of display on the suppression amount display unit 106 according to the first embodiment of this invention.

抑圧量表示部１０６は、収音音量のメータＳＥＱ＿Ｘ＿Θｊと残留信号の音量メータＳＥＱ＿Ｙ＿Θｊとが並列して一組のＳＥＱ＿ＣＯＭＢ＿Θｊになっており、ＳＥＱ＿ＣＯＭＢ＿Θｊが円筒上筐体の側面に配置されている。各ＳＥＱ＿ＣＯＭＢ＿Θｊは、各方向範囲Θjに対応する。ＳＥＱ＿ＣＯＭＢ＿Θｊは、方向範囲Θ＿ｊ＝［θ＿ｊ１，θ＿ｊ２］に対して、θｊｍ＝（θ＿ｊ１＋θ＿ｊ２）／２の方向に配置されているのが望ましい。これは、各ＳＥＱ＿ＣＯＭＢ＿Θｊが配置されている方向と音声の到来方向の対応を分かりやすくするためである。 In the suppression amount display unit 106, a sound collection volume meter SEQ_X_Θj and a residual signal volume meter SEQ_Y_Θj are arranged in parallel to form a set of SEQ_COMB_Θj, and SEQ_COMB_Θj is arranged on the side surface of the cylindrical housing. Each SEQ_COMB_Θj corresponds to each direction range Θj. SEQ_COMB_Θj is preferably arranged in the direction of θjm = (θ_j1 + θ_j2) / 2 with respect to the direction range Θ_j = [θ_j1, θ_j2]. This is to make it easy to understand the correspondence between the direction in which each SEQ_COMB_Θj is arranged and the voice arrival direction.

収音音量のメータＳＥＱ＿Ｘ＿Θｊは、収音信号内方向ごと音量Ｐ＿ｊ（τ）を表示する。残留信号の音量メータＳＥＱ＿Ｙ＿Θｊは、残留信号内方向別音量Ｑ＿（τ）を表示する。 The sound collection volume meter SEQ_X_Θj displays the volume P_j (τ) for each direction in the sound collection signal. The residual signal volume meter SEQ_Y_Θj displays the volume Q_ (τ) for each residual signal inward direction.

図１１は、本発明の第１の実施の形態の抑圧量表示部１０６で点灯するＬＥＤの数と収音信号内方向別音量Ｐ＿ｊ（τ）の値との対応付けを示す図である。図１２は、本発明の第１の実施の形態の抑圧量表示部１０６で点灯するＬＥＤの数と残留信号内方向別音量Ｑ＿ｊ（τ）の値との対応付けを示す図である。 FIG. 11 is a diagram illustrating a correspondence between the number of LEDs that are turned on in the suppression amount display unit 106 according to the first embodiment of this invention and the value of the sound volume P_j (τ) for each direction in the collected sound signal. FIG. 12 is a diagram illustrating a correspondence between the number of LEDs that are turned on in the suppression amount display unit 106 according to the first embodiment of this invention and the value of the volume Q_j (τ) for each remaining signal in-direction.

収音音量のメータＳＥＱ＿Ｘ＿Θｊを構成するＬＥＤの個数が８個だった場合、収音信号内方向別音量Ｐ＿ｊ（τ）とＰｍａｘの比に対して、点灯するＬＥＤの数を０個から８個点灯させるものが考えられる。ただし、表中のＰｍａｘは収音信号内方向別音量Ｐ＿ｊ（τ）の最大値とし、各ＬＥＤの番号は下部から順にＬ１、Ｌ２・・・、Ｌ８とする。残留信号内方向別音量Ｑ＿ｊ（τ）についても、図１２に示すように、同様である。ただし、Ｑｍａｘは、残留信号内方向別音量Ｑ＿ｊ（τ）の最大値とする。 When the number of LEDs constituting the sound collection volume meter SEQ_X_Θj is 8, the number of LEDs to be lit is 0 to 8 with respect to the ratio of the volume P_j (τ) and Pmax according to the direction of the sound collection signal. What can be considered. However, Pmax in the table is the maximum value of the volume P_j (τ) for each direction in the collected sound signal, and the numbers of the LEDs are L1, L2,. The same applies to the volume Q_j (τ) for each remaining signal in-direction, as shown in FIG. However, Qmax is the maximum value of the volume Q_j (τ) for each remaining signal in the direction.

抑圧量表示部１０６は、本実施の形態のようなＬＥＤによる表示だけに限定されない。例えば、有機ＥＬディスプレイまたは液晶ディスプレイなどの他のデバイスであってもよく、また、レベルメータとしての機能を有する他の表示方法であってもよい。 The suppression amount display unit 106 is not limited to display using LEDs as in the present embodiment. For example, it may be another device such as an organic EL display or a liquid crystal display, or may be another display method having a function as a level meter.

本実施の形態は、テレビ会議システムに限定されない、例えば、携帯電話のテレビ電話またはカーナビのハンズフリー通話装置に適用可能である。 The present embodiment is not limited to a video conference system, and can be applied to, for example, a mobile phone videophone or a car navigation handsfree call device.

本発明の実施の形態は、マイクロホンアレイ１０１、抑圧量表示部１０６、雑音除去操作入力部１０７の形状に限定されない。例えば、半球状の形状でマイクロホンアレイ１０１に、マイク素子が配置され、各マイク素子に対応するように抑圧量表示部１０６、及び雑音除去操作入力部１０７を配置してもよい。この場合、２次元的な方向ではなく、高さを含めた３次元的な方向について音を分離し、表示することができる。 The embodiment of the present invention is not limited to the shapes of the microphone array 101, the suppression amount display unit 106, and the noise removal operation input unit 107. For example, microphone elements may be arranged in the microphone array 101 in a hemispherical shape, and the suppression amount display unit 106 and the noise removal operation input unit 107 may be arranged so as to correspond to each microphone element. In this case, the sound can be separated and displayed not in the two-dimensional direction but in the three-dimensional direction including the height.

［第２の実施の形態］
図２０は、本発明の第２の実施の形態におけるテレビ会議システムの構成を示したブロック図である。 [Second Embodiment]
FIG. 20 is a block diagram showing the configuration of the video conference system according to the second embodiment of the present invention.

第２の実施の形態は、多チャンネル音響エコーキャンセラ部４０７で算出された多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(ｆ，τ)を、音源分離部４０８及び雑音除去処理部４０９に出力する。 In the second embodiment, the multi-channel frequency domain frame signal Ef_i (f, τ) calculated by the multi-channel acoustic echo canceller unit 407 is output to the sound source separation unit 408 and the noise removal processing unit 409.

前述した構成によって、第１の実施の形態は、エコーを除去し、かつ、雑音を除去した音声を抑圧された音声として表示するのに対し、第２の実施の形態は、雑音を除去した音声を抑圧された音声として表示する。 With the above-described configuration, the first embodiment displays the voice with the echo removed and the noise removed as the suppressed voice, whereas the second embodiment has the voice with the noise removed. Is displayed as suppressed speech.

［第３の実施の形態］
抑圧量表示部１０６は、収音信号内方向別音量Ｐ＿ｊ（τ）、及び残留信号内方向別音量Ｑ＿ｊ（τ）を表示する形態に限定されない。 [Third Embodiment]
The suppression amount display unit 106 is not limited to the form of displaying the sound volume P_j (τ) for each direction in the collected sound signal and the sound volume Q_j (τ) for each remaining signal direction.

例えば、抑圧量表示部１０６は、 For example, the suppression amount display unit 106

に示すように、収音信号内方向別音量Ｐ＿ｊ（τ）と残留信号内方向別音量Ｑ＿ｊ（τ）との差を抑圧量Ｒ＿ｊ（τ）と定義し、抑圧量Ｒ＿ｊ（τ）を表示する形態であってもよい。 As shown in FIG. 5, the difference between the volume P_j (τ) for each direction in the collected sound signal and the volume Q_j (τ) for each remaining signal direction is defined as a suppression amount R_j (τ), and the suppression amount R_j (τ) is displayed. Form may be sufficient.

また、抑圧量表示部１０６は、 The suppression amount display unit 106

に示すように、収音信号内方向別音量Ｐ＿ｊ（τ）と残留信号内方向別音量Ｑ＿ｊ（τ）の比を抑圧量Ｒ＿ｊ（τ）と定義し、抑圧量Ｒ＿ｊ（τ）を表示する形態であってもよい。 As shown in FIG. 3, the ratio of the volume P_j (τ) for each direction in the collected sound signal and the volume Q_j (τ) for each remaining signal direction is defined as the suppression amount R_j (τ), and the suppression amount R_j (τ) is displayed. It may be.

抑圧量表示部１０６は、収音信号内方向別音量Ｐ＿ｊ（τ）と残留信号内方向別音量Ｑ＿ｊ（τ）との相対的な大きさの違いが分かる尺度を表示することが望ましい。 It is desirable that the suppression amount display unit 106 displays a scale that shows a relative difference between the volume P_j (τ) for each direction in the collected sound signal and the volume Q_j (τ) for each remaining signal direction.

図２１は、本発明の第３の実施の形態における抑圧量表示部１０６の表示の一例を示す図である。 FIG. 21 is a diagram illustrating an example of display on the suppression amount display unit 106 according to the third embodiment of the present invention.

図２１に示すように、抑圧量表示部１０６は、抑圧量Ｒ＿ｊ（τ）、及び残留信号内方向別音量Ｑ＿ｊ（τ）を表示アイコンを用いる表示方法が考えられる。 As shown in FIG. 21, the suppression amount display unit 106 may be configured to display the suppression amount R_j (τ) and the residual signal in-direction volume Q_j (τ) using display icons.

図２２は、本発明の第３の実施の形態における表示アイコンと抑圧量Ｒ＿ｊ(τ)との対応付けを示す図である。図２３は、本発明の第３の実施の形態における表示アイコンと残留信号内方向別音量Ｑ＿ｊ（τ）との対応付けを示す図である。 FIG. 22 is a diagram illustrating associations between display icons and suppression amounts R_j (τ) according to the third embodiment of the present invention. FIG. 23 is a diagram illustrating a correspondence between display icons and residual signal in-direction volume Q_j (τ) according to the third embodiment of the present invention.

図２２に示すように、抑圧量Ｒ＿ｊ（τ）とＲｍａｘとの比を表示アイコンと対応付ける方法が考えられる。ただし、Ｒｍａｘは、抑圧量Ｒ＿ｊ（τ）の最大値とする。また、図２３に示すように、残留信号内方向別音量Ｑ＿ｊ（τ）とＱｍａｘとの比を表示アイコンと対応付ける方法が考えられる。ただし、Ｑｍａｘは、残留信号内方向別音量Ｑ＿ｊ（τ）の最大値とする。 As shown in FIG. 22, a method of associating the ratio between the suppression amount R_j (τ) and Rmax with the display icon is conceivable. However, Rmax is the maximum value of the suppression amount R_j (τ). In addition, as shown in FIG. 23, a method of associating the ratio between the volume Q_j (τ) for each remaining signal in-direction and Qmax with a display icon is conceivable. However, Qmax is the maximum value of the volume Q_j (τ) for each remaining signal in the direction.

以上説明した第３の実施の形態では、第１の実施の形態に比べて、ユーザは、より直観的に音量を把握できる。 In the third embodiment described above, the user can grasp the volume more intuitively as compared with the first embodiment.

［第４の実施の形態］
音声を音源ごとに分離する音源分離部４０８の処理の方法は第１の実施の形態に限定されず、他の処理によって分離できる。 [Fourth Embodiment]
The processing method of the sound source separation unit 408 that separates sound for each sound source is not limited to the first embodiment, and can be separated by other processing.

図２４は、本発明の第４の実施の形態におけるテレビ会議システムの構成を示したブロック図である。 FIG. 24 is a block diagram showing a configuration of a video conference system according to the fourth embodiment of the present invention.

図２４に示すように、多チャンネルフレーム処理部４０２から出力される多チャンネル時間領域フレーム信号Ｘｆ＿ｉ（ｔ，τ）が音源分離部２６０８に入力される。以下、音源分離部２６０８の処理について説明する。 As shown in FIG. 24, the multi-channel time domain frame signal Xf_i (t, τ) output from the multi-channel frame processing unit 402 is input to the sound source separation unit 2608. Hereinafter, processing of the sound source separation unit 2608 will be described.

図２５は、本発明の第４の実施の形態の音源分離部２６０８の処理を示したフローチャートである。 FIG. 25 is a flowchart illustrating processing of the sound source separation unit 2608 according to the fourth embodiment of this invention.

音源分離部２６０８は、多チャンネル時間領域フレーム信号Ｘｆ＿ｉ（ｔ，τ）を入力とし、ＳＩＭＯ−ＩＣＡ（非特許文献２参照）フィルタを算出し、ＳＩＭＯ−ＩＣＡフィルタを更新する（Ｓ２７０１）。なお、フィルタの算出方法及び更新方法は、非特許文献２に記載された方法を用いることができる。 The sound source separation unit 2608 receives the multi-channel time domain frame signal Xf_i (t, τ) as input, calculates a SIMO-ICA (see Non-Patent Document 2) filter, and updates the SIMO-ICA filter (S2701). In addition, the method described in the nonpatent literature 2 can be used for the calculation method and update method of a filter.

次に、音源分離部２６０８は、更新されたＳＩＭＯ−ＩＣＡのフィルタを多チャンネル時間領域フレーム信号Ｘｆ＿ｉ（ｔ，τ）に乗算し、各音源ごとに分離する（Ｓ２７０２）。前述した分離処理によって、Ｓ個の信号Ｘｆ＿ｓ＿ｉ（ｔ，τ）に分離される。ここで、ｓは、０からＳ−１までの整数であり、各音源を表すインデックスである。以下、音源ｓと記す。また、Ｓは最大音源数であり、マイク素子の数（Ｍ個）以下の数とする。つまり、分離された信号Ｘｆ＿ｓ＿ｉ（ｔ，τ）は、音源ｓの音がマイク素子ｉに入力された信号を示す。 Next, the sound source separation unit 2608 multiplies the updated SIMO-ICA filter by the multi-channel time domain frame signal Xf_i (t, τ), and separates each sound source (S2702). By the separation process described above, the signal is separated into S signals Xf_s_i (t, τ). Here, s is an integer from 0 to S-1, and is an index representing each sound source. Hereinafter, it is described as a sound source s. S is the maximum number of sound sources, and is a number equal to or less than the number of microphone elements (M). That is, the separated signal Xf_s_i (t, τ) indicates a signal in which the sound of the sound source s is input to the microphone element i.

音源分離部２６０８は、分離された信号Ｘｆ＿ｓ＿ｉ（ｔ，τ）を周波数領域フレーム信号Ｘｆ＿ｓ＿ｉ（ｆ，τ）に変換する（Ｓ２７０３）。 The sound source separation unit 2608 converts the separated signal Xf_s_i (t, τ) into a frequency domain frame signal Xf_s_i (f, τ) (S2703).

音源分離部２６０８は、音源ｓを０に設定し（Ｓ２７０４）、次にｓ＝Ｓ−１か否かを判定する（Ｓ２７０５）。 The sound source separation unit 2608 sets the sound source s to 0 (S2704), and then determines whether or not s = S-1 (S2705).

ｓ＝Ｓ−１でないと判定された場合、音源分離部２６０８は、周波数ビンｆを０に設定し、さらに、方向ヒストグラムｈ＿ｓ（θ）を初期化する（Ｓ２７０６）。具体的には、各音源ｓの全ての音声の到来方向θに対して、方向ヒストグラムｈ＿ｓ（θ）＝０と設定する。なお、方向ヒストグラムｈ＿ｓ（θ）は任意の音源における角度分布を示すヒストグラムである。次に、音源分離部２６０８は、ｆ＝Ｎ−１か否かを判定する（Ｓ２７０７）。 If it is determined that s = S−1 is not satisfied, the sound source separation unit 2608 sets the frequency bin f to 0 and further initializes the direction histogram h_s (θ) (S2706). Specifically, the direction histogram h_s (θ) = 0 is set for the arrival directions θ of all the sounds of the sound sources s. The direction histogram h_s (θ) is a histogram showing the angle distribution in an arbitrary sound source. Next, the sound source separation unit 2608 determines whether or not f = N−1 (S2707).

周波数ビンｆ＝Ｎ−１でないと判定された場合、音源分離部２６０８は、音声の到来方向θ（ｆ，τ）を算出する（Ｓ２７０８）。なお、音声の到来方向θ（ｆ，τ）の算出は、第１の実施の形態と同様に、ＳＰＩＲＥアルゴリズムを用いて算出される。 If it is determined that the frequency bin f = N−1 is not satisfied, the sound source separation unit 2608 calculates the voice arrival direction θ (f, τ) (S2708). Note that the voice arrival direction θ (f, τ) is calculated using the SPIRE algorithm, as in the first embodiment.

次に、音源分離部２６０８は、算出された音声の到来方向θ（ｆ，τ）をθ_hとし、 Next, the sound source separation unit 2608 sets the calculated voice arrival direction θ (f, τ) as θ _h ,

にしたがって、方向ヒストグラムｈ＿ｓ（θ_h）へ投票する（Ｓ２７０９）。そして、ｆ＋１を新たな周波数ビンｆと定義し（Ｓ２７１０）、Ｓ２７０７へ戻る。以下、全ての周波数について（ｆ＝Ｎ−１になるまで）、Ｓ２７０８〜Ｓ２７１０までの処理が同様に行われる。 Accordingly, the voting is performed for the direction histogram h_s (θ _h ) (S2709). Then, f + 1 is defined as a new frequency bin f (S2710), and the process returns to S2707. Thereafter, the processing from S2708 to S2710 is similarly performed for all frequencies (until f = N−1).

Ｓ２７０７において、ｆ＝Ｎ−１であると判定された場合、音源分離部２６０８は、Ｓ２７０７〜Ｓ２７１０の一連のループ処理によって作成された方向ヒストグラムｈ＿ｓ（θ_h）から方向ピークを探索する（Ｓ２７１１）。ＳＩＭＯ―ＩＣＡフィルタを用いた分離によって、理論的に、周波数領域フレーム信号Ｘｆ＿ｓ＿ｉ（ｆ，τ）は、単一の音源の成分である。したがって、前述した方向ピークの探索は、方向ヒストグラムｈ＿ｓ（θ_h）の分布から最大値をとるθ_h求めればよい。また、求められたθ_hを音源ｓの音の到来方向θ＿ｓとする。 If it is determined in step S2707 that f = N−1, the sound source separation unit 2608 searches for a direction peak from the direction histogram h_s (θ _h ) created by the series of loop processing in steps S2707 to S2710 (S2711). . Due to the separation using the SIMO-ICA filter, the frequency domain frame signal Xf_s_i (f, τ) is theoretically a component of a single sound source. Therefore, the search for the direction peak described above may be performed by obtaining θ _h that takes the maximum value from the distribution of the direction histogram h_s (θ _h ). Further, the obtained θ _h is set as the sound arrival direction θ_s of the sound source s.

方向ピークを探索した後、音源分離部２６０８は、ｓ＋１を新たな音源ｓと定義し（Ｓ２７１２）、Ｓ２７０５へ戻り、以下、同様の処理を行う。 After searching for the direction peak, the sound source separation unit 2608 defines s + 1 as a new sound source s (S2712), returns to S2705, and thereafter performs the same processing.

Ｓ２７０５において、ｓ＝Ｓ−１であると判定された場合、音源分離部２６０８は、前述した、Ｓ２７０５〜Ｓ２７１２のループ処理によって、算出された各音源ｓの音声の到来方向θ＿ｓと、抑圧量表示部１０６の範囲方向Θ＿ｊとを対応付ける（Ｓ２７１３）。 When it is determined in S2705 that s = S-1, the sound source separation unit 2608 displays the arrival direction θ_s of the sound of each sound source s calculated by the loop processing of S2705 to S2712 and the suppression amount display. The range direction Θ_j of the unit 106 is associated (S2713).

以上の処理によって、音源分離部２６０８は、算出された各音源ｓの音声の到来方向θ＿ｓに基づいて、各方向に分離さた方向別周波数領域フレーム信号Ｘｆ＿j（ｆ，τ）を算出し、音量計算部４１４及び方向別残留音量計算部２６１５に方向別周波数領域フレーム信号をＸｆ＿j（ｆ，τ）を出力する。以下、算出された各音源ｓの音声の到来方向θ＿ｓと表示範囲Θ＿ｊとの対応付け、及び方向別周波数領域フレーム信号Ｘｆ＿j（ｆ，τ）の算出方法について説明する。 Through the above processing, the sound source separation unit 2608 calculates the direction-specific frequency domain frame signal Xf_j (f, τ) separated in each direction based on the calculated voice arrival direction θ_s of each sound source s. The direction-specific frequency domain frame signal Xf_j (f, τ) is output to the calculation unit 414 and the direction-specific residual volume calculation unit 2615. Hereinafter, the correlation between the calculated voice arrival direction θ_s of each sound source s and the display range Θ_j and the calculation method of the direction-specific frequency domain frame signal Xf_j (f, τ) will be described.

一つの方法として、例えば、Θ＿ｊがθ＿ｓを含むｊについて、 As one method, for example, for j where Θ_j includes θ_s,

に基づいて方向ごと周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）を算出する。 Based on the above, a frequency domain frame signal Xf_j (f, τ) is calculated for each direction.

また、別の方法として、例えば、 As another method, for example,

に示すコスト関数Ｃ（δ）を考える。ここで、δ（ｓ，ｊ）は、任意の音源ｓに対して、ただ一つのｊのみが１となり、その他のｊについては０となり、かつ、任意のｊに対し、ただ一つのｓのみが１となり、その他のｓについては０となるような関数である。δ（ｓ，ｊ）は音源ｓとｊとの一対一の対応関係を表す。また、ｄｉｓｔ（θ＿ｓ，Θ＿ｊ）は、 Consider the cost function C (δ) shown in FIG. Here, for δ (s, j), only one j is 1 for an arbitrary sound source s, 0 for the other j, and only one s for any j. The function is 1 and the other s are 0. δ (s, j) represents a one-to-one correspondence between the sound sources s and j. Also, dist (θ_s, Θ_j) is

に示すような、距離関数とする。 A distance function as shown in FIG.

対応付けの方法として、例えば、コスト関数Ｃ（δ）が最大となるδを求めることで、音源ｓと方向範囲ｊとを対応付ける方法が考えられる。そして、δ（ｓ，ｊ）＝１となるｓの分離された信号Ｘｆ＿ｓ＿ｉ（ｔ，τ）を As a method of association, for example, a method of associating the sound source s with the direction range j by obtaining δ that maximizes the cost function C (δ) is conceivable. Then, s separated signals Xf_s_i (t, τ) satisfying δ (s, j) = 1 are obtained.

に代入し、方向ごと周波数領域フレーム信号Ｘｆ＿j（ｆ，τ）を算出する。前述した方法を用いることによって、音源ｓの方向が近接し、かつ、抑圧量表示部１０６で表示できる方向が離散的に存在していても、音源ｓごとに分離することができる。 And the frequency domain frame signal Xf_j (f, τ) is calculated for each direction. By using the method described above, even if the direction of the sound source s is close and the directions that can be displayed by the suppression amount display unit 106 exist discretely, the sound source s can be separated.

次に、方向別残留音量計算部２６１５について説明する。 Next, the direction-specific residual volume calculation unit 2615 will be described.

図２６は、本発明の第４の実施の形態の方向別残留音量計算部２６１５の処理を示すフローチャートである。 FIG. 26 is a flowchart illustrating a process of the direction-specific residual sound volume calculation unit 2615 according to the fourth embodiment of this invention.

方向別残留音量計算部２６１５は、初期化処理を行う（Ｓ２８０１）。具体的には、周波数ビンｆを０に設定し、残留信号内方向別音量Ｑ＿ｊ（τ）を０に設定する。次に、方向別残留音量計算部２６１５は、ｆ＝Ｎ−１であるか否かを判定する（Ｓ２８０２）。 The direction-specific residual volume calculation unit 2615 performs an initialization process (S2801). Specifically, the frequency bin f is set to 0, and the residual signal inner direction volume Q_j (τ) is set to 0. Next, the direction-specific residual volume calculation unit 2615 determines whether or not f = N−1 (S2802).

ｆ＝Ｎ−１でないと判定された場合、方向別残留音量計算部２６１５は、 When it is determined that f = N−1 is not satisfied, the direction-specific residual volume calculation unit 2615

に基づいて、Ｐ_sumを算出し（Ｓ２８０３）、方向範囲ｊを０に設定する（Ｓ２８０４）。次に、方向別残留音量計算部２６１５は、ｊ＝Ｊ−１か否かを判定する（Ｓ２８０５）。 Based on, to calculate the P _sum (S2803), it sets the direction range j to 0 (S2804). Next, the direction-specific residual sound volume calculation unit 2615 determines whether j = J−1 or not (S2805).

ｊ＝Ｊ−１でないと判定された場合、方向別残留音量計算部２６１５は、Ｓ２８０３で算出されたＰ_sumを When it is determined that j = J−1 is not satisfied, the direction-specific residual volume calculation unit 2615 calculates the P _sum calculated in S2803.

に代入し、Ｑ＿ｊを算出する（Ｓ２８０６）。 And Q_j is calculated (S2806).

次に、方向別残留音量計算部２６１５は、ｊ＋１を新たな方向範囲ｊと定義し（Ｓ２８０７）、Ｓ２８０５へ戻り、同様の処理を行う。 Next, the direction-specific residual volume calculation unit 2615 defines j + 1 as a new direction range j (S2807), returns to S2805, and performs the same processing.

Ｓ２８０５において、ｊ＝Ｊであると判定された場合、方向別残留音量計算部２６１５は、ｆ＋１を新たな周波数ビンｆと定義し（Ｓ２８０８）、Ｓ２８０２へ戻り、以下同様の処理を行う。 When it is determined in S2805 that j = J, the direction-specific residual volume calculation unit 2615 defines f + 1 as a new frequency bin f (S2808), returns to S2802, and performs the same processing.

Ｓ２８０２において、ｆ＝Ｎ−１であると判定された場合、方向別残留音量計算部２６１５は、Ｓ２８０２〜Ｓ２８０８の一連のループ処理から算出された残留信号内方向別音量Ｑ＿ｊ（τ）を抑圧量表示部１０６へ出力する。 When it is determined in S2802 that f = N−1, the direction-specific residual volume calculation unit 2615 suppresses the residual signal-specific direction volume Q_j (τ) calculated from the series of loop processing of S2802 to S2808. The data is output to the display unit 106.

なお、抑圧量表示部１０６の表示方法としては、第１の実施の形態または第３の実施の形態と同様の方法を用いる。 As a display method of the suppression amount display unit 106, the same method as that in the first embodiment or the third embodiment is used.

本実施の形態は、テレビ会議システムに限定されない、例えば、携帯電話のテレビ電話またはカーナビのハンズフリー通話装置に適用可能である。また、音源分離部２６０８において、音声のスパース性の仮定を必要としない音源の分離方法を用いるため、会話音声に限られず、環境音または楽音など、他の種類の音を対象とする場合にも適用可能である。 The present embodiment is not limited to a video conference system, and can be applied to, for example, a mobile phone videophone or a car navigation handsfree call device. In addition, since the sound source separation unit 2608 uses a sound source separation method that does not require the assumption of speech sparsity, the sound source separation unit 2608 is not limited to conversational speech, and may be used for other types of sounds such as environmental sounds or musical sounds. Applicable.

［第５の実施の形態］
本発明は、例えば、ＩＣレコーダなどの音声録音装置にも適応可能である。 [Fifth Embodiment]
The present invention is also applicable to a voice recording device such as an IC recorder.

図２７は、本発明の第５の実施の形態の音声録音装置のハードウェア構成例を示す図である。 FIG. 27 is a diagram illustrating a hardware configuration example of a voice recording device according to the fifth embodiment of the present invention.

音声録音装置２０００は、一以上のマイクロホン素子からなるマイクロホンアレイ１０１、マイクロホンアレイ１０１から入力されるアナログの音圧値をデジタルデータに変換するＡ／Ｄ変換装置２００２、Ａ／Ｄ変換装置２００２から出力されるデジタルデータを処理する中央演算装置１０３、中央演算装置１０３に接続された揮発性メモリ１０４、中央演算装置１０３に接続された、プログラム及びマイクロホンアレイ１０１の各マイク素子の物理的な配置などの情報を記憶する記憶媒体１０５、抑圧量表示部１０６、雑音除去操作入力部１０７、オーディオケーブル１１２、デジタルケーブル１１３、デジタルケーブル１１４、及びデジタルケーブル１１５から構成される。 The voice recording apparatus 2000 includes a microphone array 101 including one or more microphone elements, an A / D conversion apparatus 2002 that converts an analog sound pressure value input from the microphone array 101 into digital data, and an output from the A / D conversion apparatus 2002. A central processing unit 103 for processing the digital data to be processed, a volatile memory 104 connected to the central processing unit 103, a program and a physical arrangement of each microphone element of the microphone array 101 connected to the central processing unit 103, etc. A storage medium 105 that stores information, a suppression amount display unit 106, a noise removal operation input unit 107, an audio cable 112, a digital cable 113, a digital cable 114, and a digital cable 115 are included.

第５の実施の形態は、遠端とのデータのやりとりを必要としないため、カメラ及びモニタなどの画像を扱わなくてもよい。また、Ａ／Ｄ変換装置２００２は、音声を再生しないため、Ｄ／Ａ変換を必要としない。したがって、Ａ／Ｄ変換装置２００２は、入力された多チャンネル音圧データを多チャンネルデジタル音圧データに変換する処理のみを行う。 Since the fifth embodiment does not require data exchange with the far end, images such as a camera and a monitor need not be handled. In addition, the A / D conversion device 2002 does not reproduce audio and therefore does not require D / A conversion. Therefore, the A / D converter 2002 only performs processing for converting the input multi-channel sound pressure data into multi-channel digital sound pressure data.

なお、マイクロホンアレイ１０１、抑圧量表示部１０６、及び雑音除去操作入力部１０７の配置方法は第１の実施の形態と同様である。また、中央演算装置１０３、抑圧量表示装置２００６、及び雑音除去操作入力部１０７とにおける処理は、第１の実施の形態と同様である。 The arrangement method of the microphone array 101, the suppression amount display unit 106, and the noise removal operation input unit 107 is the same as that in the first embodiment. The processing in the central processing unit 103, the suppression amount display device 2006, and the noise removal operation input unit 107 is the same as that in the first embodiment.

図２８は、本発明の第５の実施の形態の音声録音装置の構成を示すブロック図である。 FIG. 28 is a block diagram showing a configuration of a voice recording apparatus according to the fifth embodiment of the present invention.

本実施の形態における音声録音装置は、図２８に示すのように、音声受信部４０４、音声再生部４０６、及び多チャンネル音響エコーキャンセラ部４０７が無くてもよい。また、雑音除去操作入力部２００７を介してユーザが手動で雑音として除去する方向を決定してもよいが、音声録音装置に予め設定された値によって雑音として除去する方向を決定してもよい。その場合、雑音除去操作入力部２００７は、音声録音装置の構成に含まなくてもよい。 As shown in FIG. 28, the audio recording apparatus according to the present embodiment may not include the audio reception unit 404, the audio reproduction unit 406, and the multi-channel acoustic echo canceller unit 407. In addition, although the user may manually determine the direction of noise removal using the noise removal operation input unit 2007, the direction of noise removal may be determined by a value preset in the voice recording device. In that case, the noise removal operation input unit 2007 may not be included in the configuration of the voice recording device.

本実施の形態における音声録音装置は、ある方向から騒音または録音したくない音声が到来し、該音声が到来する方向からの到来音を雑音として除去するように操作する場合、前述した騒音または録音したくない音声が雑音として十分に除去され、かつ、録音したい音声が録音されていることを、ユーザが確かめながら会話することができる
本実施の形態は、ＩＣレコーダに限らず、ビデオカメラの録音機構などにも、そのまま適用可能である。また、第４の実施の形態のように、ＳＩＭＯ−ＩＣＡを用いた音源分離部２６０８及び方向別残留音量計算部２６１５を音声録音装置の構成とすることもできる。その場合、第４の実施の形態で前述したように、音源分離部２６０８において、音声のスパース性の仮定を必要としない音源の分離方法を用いるため、会話音声に限られず、環境音または楽音など、他の種類の音を対象とする場合にも適用可能である。 When the voice recording apparatus in the present embodiment is operated so as to remove noise or voice that is not desired to be recorded from a certain direction and removes the incoming sound from the direction from which the voice arrives as noise, the above-described noise or recording is performed. The user can make a conversation while confirming that the sound that he / she does not want to remove is sufficiently removed as noise and the sound he / she wants to record is recorded. The present invention can be applied to a mechanism as it is. Further, as in the fourth embodiment, the sound source separation unit 2608 and the direction-specific residual volume calculation unit 2615 using SIMO-ICA can be configured as a voice recording device. In this case, as described above in the fourth embodiment, the sound source separation unit 2608 uses a sound source separation method that does not require the assumption of speech sparsity. It can also be applied to other types of sounds.

本発明の実施の形態は、音声録音装置の形状に限定されない。例えば、半球状の形状であってもよい。この場合、２次元的な方向ではなく、高さを含めた３次元的な方向について音を分離し、表示することができる。 The embodiment of the present invention is not limited to the shape of the voice recording device. For example, a hemispherical shape may be used. In this case, the sound can be separated and displayed not in the two-dimensional direction but in the three-dimensional direction including the height.

本発明の第１の実施の形態におけるテレビ会議システムのハードウェア構成を示した図である。It is the figure which showed the hardware constitutions of the video conference system in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるテレビ会議システムの利用例を示した図である。It is the figure which showed the usage example of the video conference system in the 1st Embodiment of this invention. 本発明の音収集システムの実施例であるテレビ会議システムにおける、各ユーザの発話、雑音除去操作入力部への入力操作、抑圧量表示装置の表示、及び、遠端に送信される音圧データの音量の関係のタイムチャートの例を示す図である。In the video conference system which is an embodiment of the sound collection system of the present invention, each user's utterance, input operation to the noise removal operation input unit, display of the suppression amount display device, and sound pressure data transmitted to the far end It is a figure which shows the example of the time chart of a volume relationship. 本発明の第１の実施の形態におけるテレビ会議システムの構成を示したブロック図である。It is the block diagram which showed the structure of the video conference system in the 1st Embodiment of this invention. 最小分散ビームフォーマによる雑音除去処理部の構成例を示すブロック図である。It is a block diagram which shows the structural example of the noise removal process part by a minimum dispersion beamformer. 、入力された多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ(f，τ)のうちの一つのチャンネルの信号を模式的に示す図である。FIG. 6 is a diagram schematically showing a signal of one channel among input multi-channel frequency domain frame signals Ef_i (f, τ). マイクロホンアレイのマイク素子数が三つ以上の場合におけるθの算出方法（ＳＰＩＲＥアルゴリズム）を示すフローチャートである。It is a flowchart which shows the calculation method (SPIRE algorithm) of (theta) in case the number of microphone elements of a microphone array is three or more. 本発明の第１の実施の形態の音源分離部の処理を示すフローチャートである。It is a flowchart which shows the process of the sound source separation part of the 1st Embodiment of this invention. 本発明の第１の実施の形態の方向別残留音量計算部の処理を示すフローチャートである。It is a flowchart which shows the process of the residual sound volume calculation part according to direction of the 1st Embodiment of this invention. 本発明の第１の実施の形態の抑圧量表示部の表示の一例を示す図である。It is a figure which shows an example of the display of the suppression amount display part of the 1st Embodiment of this invention. 本発明の第１の実施の形態の抑圧量表示部で点灯するＬＥＤの数と収音信号内方向別音量Ｐ＿ｊ（τ）の値との対応付けを示す図である。It is a figure which shows matching with the value of the number of LED lighted by the suppression amount display part of the 1st Embodiment of this invention, and the value of the sound volume P_j (τ) according to the direction of the collected sound signal. 本発明の第１の実施の形態の抑圧量表示部で点灯するＬＥＤの数と残留信号内方向別音量Ｑ＿ｊ（τ）の値との対応付けを示す図である。It is a figure which shows matching with the number of LED lighted by the suppression amount display part of the 1st Embodiment of this invention, and the value of the volume Q_j (τ) according to the direction in residual signal. 本発明の第１の実施の形態のテレビ会議システムの一連の処理を示したフローチャートである。It is the flowchart which showed a series of processes of the video conference system of the 1st Embodiment of this invention. 本発明の第１の実施の形態の、任意フレームτにおける多チャンネル周波数領域フレーム信号Ｘｆ＿ｉ（ｆ，τ）のデータ構造を示した説明図である。It is explanatory drawing which showed the data structure of the multi-channel frequency domain frame signal Xf_i (f, (tau)) in arbitrary frames (tau) of the 1st Embodiment of this invention. 本発明の第１の実施の形態の任意フレームτにおける多チャンネル周波数領域フレーム信号Ｅｆ＿ｉ（ｆ，τ）のデータ構造を示した説明図である。It is explanatory drawing which showed the data structure of the multichannel frequency domain frame signal Ef_i (f, (tau)) in arbitrary frames (tau) of the 1st Embodiment of this invention. 本発明の第１の実施の形態の、任意フレームτにおけるＩｓＲｅｄｕｃｅｄ＿ｊ（τ）のデータ構造を示した説明図である。It is explanatory drawing which showed the data structure of IsReduced_j (τ) in the arbitrary frame τ according to the first embodiment of this invention. 本発明の第１の実施の形態の任意フレームτにおける目的音信号Ｅ_subject＿ｉ（ｆ，τ）のデータ構造を示す図である。It illustrates a data structure of the target sound signal E _subject _i in any frame tau in the first embodiment of the present invention (f, τ). 本発明の第１の実施の形態の任意フレームτにおける雑音信号Ｅ_noise＿ｉ（ｆ，τ）のデータ構造を示す図である。It illustrates a data structure of the noise signal E _noise _i in any frame tau in the first embodiment of the present invention (f, τ). 本発明の第１の実施の形態の任意フレームτ方向別周波数領域フレーム信号Ｘｆ＿ｊ（ｆ，τ）のデータ構造を示す図である。It is a figure which shows the data structure of the frequency domain frame signal Xf_j (f, (tau)) according to the arbitrary frame (tau) direction of the 1st Embodiment of this invention. 本発明の第２の実施の形態におけるテレビ会議システムの構成を示したブロック図である。It is the block diagram which showed the structure of the video conference system in the 2nd Embodiment of this invention. 本発明の第３の実施の形態における抑圧量表示部の表示の一例を示す図である。It is a figure which shows an example of the display of the suppression amount display part in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における表示アイコンと抑圧量Ｒ＿ｊ(τ)との対応付けを示す図である。It is a figure which shows matching with the display icon and suppression amount R_j ((tau)) in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における表示アイコンと残留信号内方向別音量Ｑ＿ｊ（τ）との対応付けを示す図である。It is a figure which shows matching with the display icon in the 3rd Embodiment of this invention, and volume Q_j ((tau)) according to residual signal inner direction. 本発明の第４の実施の形態におけるテレビ会議システムの構成を示したブロック図である。It is the block diagram which showed the structure of the video conference system in the 4th Embodiment of this invention. 本発明の第４の実施の形態の音源分離部の処理を示したフローチャートである。It is the flowchart which showed the process of the sound source separation part of the 4th Embodiment of this invention. 本発明の第４の実施の形態の方向別残留音量計算部の処理を示すフローチャートである。It is a flowchart which shows the process of the residual sound volume calculation part according to direction of the 4th Embodiment of this invention. 本発明の第５の実施の形態の音声録音装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the audio | voice recording apparatus of the 5th Embodiment of this invention. 本発明の第５の実施の形態の音声録音装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio recording apparatus of the 5th Embodiment of this invention.

Explanation of symbols

１００テレビ会議システム
１０１マイクロホンアレイ
１０２Ａ／Ｄ−Ｄ／Ａ変換装置
１０３中央演算装置
１０４揮発性メモリ
１０５記憶媒体
１０６抑圧量表示部
１０７雑音除去操作入力部
１０８スピーカ
１０９カメラ
１１０画像表示装置
１１１ハブ
１１２オーディオケーブル
１１３〜１１５デジタルケーブル
１１６オーディオケーブル
１１７デジタルケーブル
１１８モニタケーブル
１１９ＬＡＮケーブル
Ｕ１、Ｕ２ユーザ
２０３中央演算装置
２０８スピーカ
ｔ１、ｔ３、ｔ５時間帯
ｔ２、ｔ４時刻
４０１多チャンネルＡ／Ｄ変換部
４０２多チャンネルフレーム処理部
４０３多チャンネル短時間周波数分析部
４０４音声受信部
４０５Ｄ／Ａ変換部
４０６音声再生部
４０７多チャンネル音響エコーキャンセラ部
４０８音源分離部
４０９雑音除去処理部
４１０マイク配置
４１１時間信号生成部
４１３音声送信部
４１４音量計算部
４１５方向別残留音量計算部
５０１目的音／雑音分離部
５０２目的音ステアリングベクトル更新部
５０３雑音共分散行列更新部
５０４フィルタ更新部
５０５フィルタ更新部
Ｓ７０１〜Ｓ７０７ステップ
Ｓ８０１〜Ｓ８０３ステップ
Ｓ９０１〜Ｓ９０８ステップ
Ｓ１３０１〜Ｓ１３１４ステップ
Ｌ１〜Ｌ８ＬＥＤの番号
２０００音声録音装置
２００２Ａ／Ｄ変換装置
２６０８音源分離部
２６１５方向別残留音量計算部
Ｓ２７０１〜Ｓ２７１３ステップ
Ｓ２８０１〜Ｓ２８０８ステップ DESCRIPTION OF SYMBOLS 100 Video conference system 101 Microphone array 102 A / D-D / A converter 103 Central processing unit 104 Volatile memory 105 Storage medium 106 Suppression amount display part 107 Noise removal operation input part 108 Speaker 109 Camera 110 Image display apparatus 111 Hub 112 Audio cables 113 to 115 Digital cable 116 Audio cable 117 Digital cable 118 Monitor cable 119 LAN cable U1, U2 User 203 Central processing unit 208 Speaker t1, t3, t5 Time zone t2, t4 Time 401 Multi-channel A / D converter 402 Multi Channel frame processing unit 403 Multi-channel short-time frequency analysis unit 404 Audio reception unit 405 D / A conversion unit 406 Audio reproduction unit 407 Multi-channel acoustic echo canceller unit 408 Sound source component Separation unit 409 Noise removal processing unit 410 Microphone arrangement 411 Time signal generation unit 413 Audio transmission unit 414 Volume calculation unit 415 Directional residual volume calculation unit 501 Target sound / noise separation unit 502 Target sound steering vector update unit 503 Noise covariance matrix update Unit 504 filter update unit 505 filter update unit S701 to S707 step S801 to S803 step S901 to S908 step S1301 to S1314 step L1 to L8 LED number 2000 voice recording device 2002 A / D converter 2608 sound source separation unit 2615 residual volume according to direction Calculation unit S2701 to S2713 Step S2801 to S2808 Step

Claims

A sound collection system comprising: a microphone array composed of two or more microphones; and a processing unit that converts a signal output from the microphone array,
The processor is
A sound source separation unit that separates signals output from the microphone array for each direction in which a sound source exists;
A noise removal processing unit for removing noise from the signal output from the microphone array;
A residual signal calculation unit for each direction that calculates a volume for each direction of the residual signal based on the signal output from the sound source separation unit and the residual signal output from the noise removal processing unit;
The sound collection system further includes a suppression amount display unit that displays a volume of the residual signal for each direction based on a calculation result by the residual signal calculation unit for each direction.

The sound source separation unit is
For each frequency in the time / frequency domain divided by the time component and frequency component, calculate the direction in which the sound source exists,
Based on the calculated direction, the signal output from the microphone array is separated for each direction in which a sound source exists,
The direction-specific residual signal calculator is
Determine whether the sound source exists in the time / frequency domain divided by the time component and the frequency component,
The sound collection system according to claim 1, wherein a sound volume for each direction of the residual signal is calculated based on the determination result.

The sound source separation unit is
Calculate the direction that the sound source exists for each frequency of the sound collected by each microphone,
Based on the direction in which the sound source for each calculated frequency exists, the signal output from the microphone array is separated for each direction in which the sound source exists,
The direction-specific residual signal calculator is
Separated for each direction in which the sound source is present and separated from each direction in which the sound source is present and output from the microphone array with respect to the sum of the magnitudes of the signals output from the microphone array. The sound collection system according to claim 1, wherein a relative value of the magnitude of the signal in the direction is calculated as a volume for each direction of the residual signal.

The processing unit further includes a volume calculation unit that calculates the volume of the signal output from the sound source separation unit,
The sound collection system according to claim 1, wherein the suppression amount display unit further displays a volume of the signal output from the sound source separation unit based on a calculation result by the volume calculation unit.

The sound collection system according to claim 1, wherein the suppression amount display unit displays a difference between a volume of the residual signal and a volume of a signal output from the sound source separation unit.

The sound collection system according to claim 1, wherein the suppression amount display unit includes a display that displays the volume of the residual signal for each direction in which a sound source exists.

The sound collection system according to claim 1, wherein the suppression amount display unit includes a display that displays a volume of a signal output from the sound source separation unit for each direction in which the sound source exists.

Sound in a sound collection device comprising: a microphone array including two or more microphones; a processing unit that converts a signal output from the microphone array; and a suppression amount display unit that displays a volume of the converted signal. Display method,
The processor is
The signal output from the microphone array is separated for each direction in which a sound source exists,
Remove noise from the signal output from the microphone array,
Based on the residual signal from which the noise has been removed, calculate a volume for each direction of the residual signal,
The suppression amount display unit
A sound display method comprising displaying the calculated volume of each residual signal in each direction.

The processor is
When calculating the volume for each direction of the residual signal,
For each frequency in the time / frequency domain divided by the time component and frequency component, calculate the direction in which the sound source exists,
Based on the calculated direction, the signal output from the microphone array is separated for each direction in which a sound source exists,
Determine whether the sound source exists in the time / frequency domain divided by the time component and the frequency component,
The sound display method according to claim 8, wherein the sound volume for each direction of the residual signal is calculated based on the determination result.

The processor is
When calculating the volume for each direction of the residual signal,
Calculate the direction that the sound source exists for each frequency of the sound collected by each microphone,
Calculating the direction of the statistics sound source is present for each of the calculated frequency,
Based on the calculated statistics, the signal output from the microphone array is separated for each direction in which a sound source exists,
The signal output from the microphone array is separated for each direction in which the sound source is present in a predetermined direction with respect to the sum of the magnitudes of the signals output from the microphone array and separated in the direction in which the sound source is present. The sound display method according to claim 8 , wherein a relative value of the magnitude is calculated as a volume for each direction of the residual signal.