JP2015070291A

JP2015070291A - Sound collection/emission device, sound source separation unit and sound source separation program

Info

Publication number: JP2015070291A
Application number: JP2013199981A
Authority: JP
Inventors: 克之高橋; Katsuyuki Takahashi
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2013-09-26
Filing date: 2013-09-26
Publication date: 2015-04-13

Abstract

PROBLEM TO BE SOLVED: To provide a sound collection/emission device which excellently extracts a target sound from an intended sound source regardless to the sound kinds of an emitted sound while the emitted sound exists.SOLUTION: A sound collection/emission device emits a sound from one or a plurality of speakers and collects ambient sound from two microphones. The sound collection/emission device comprises: sound emission non-target sound removal means which removes a sound emission non-target sound emitted from the speaker and collected from respective microphones, and is diverted from an acoustic echo canceller structure; and sound source separation means which extracts a target sound from the output thereof. The sound source separation means generates a parameter for the sound source separation from the input sound signal from which the sound emission non-target sound is removed, and corrects the parameter on the basis of a pseudo sound emission non-target sound signal generated in the sound emission non-target sound removal means to use the parameter for the sound source separation.

Description

本発明は、集音・放音装置、音源分離ユニット及び音源分離プログラムに関し、例えば、マイクロホンによる捕捉音声、捕捉音響などから、所定方向の音源から到来する音（以下、目的音と呼ぶ）だけを分離することを欲する通信端末、オーディオ機器などに適用し得るものである。 The present invention relates to a sound collecting / sound emitting device, a sound source separation unit, and a sound source separation program. For example, only sound coming from a sound source in a predetermined direction (hereinafter referred to as a target sound) from captured sound by a microphone, captured sound, etc. The present invention can be applied to communication terminals, audio devices, etc. that want to be separated.

例えば、スマートフォンに通話音声を入力する場合や、オーディオ機器やスマートフォンなどに音声コマンドを入力する場合などにおいては、音声が入力される機器は、利用者の口が存在すると思われる正面からの音声だけを、他の方向からの音声、音楽、雑音などと区別して抽出することが好ましい。 For example, when inputting call voice to a smartphone or inputting voice commands to an audio device or smartphone, the device to which the sound is input is only the sound from the front where the user's mouth seems to exist. Is preferably distinguished from voice, music, noise, etc. from other directions.

２つのマイクロホンに入力された音を捕捉し、入力音（電気信号）の位相差に基づいて周囲の雑音を抑圧して、マイクロホンの所定方位（例えば正面）から到来する目的音を抽出する方式（音源分離方式）が、特許文献１に記載されている。 A system that captures sound input to two microphones, suppresses ambient noise based on the phase difference between the input sounds (electrical signals), and extracts a target sound that arrives from a predetermined direction (for example, the front) of the microphone ( (Sound source separation method) is described in Patent Document 1.

特許文献１に第３の実施形態として記載されている目的音の抽出方法は、マイクロホンの左右に死角を有する二つの指向性を形成して得た二つの信号の相関に応じた抑圧係数を周波数成分毎に入力音信号に乗算することにより、左右から到来する雑音成分（非目的音）を抑圧する手法である。特許文献１に第４の実施形態として記載されている目的音の抽出方法は、マイクロホンの正面に死角を有する指向性を形成し、これにより得られた信号を、左右から到来する雑音成分として入力音信号から減算することにより、左右から到来する雑音成分（非目的音）を抑圧する手法である。 The target sound extraction method described in Patent Document 1 as the third embodiment uses a suppression coefficient corresponding to the correlation between two signals obtained by forming two directivities having blind spots on the left and right sides of a microphone. This is a technique for suppressing noise components (non-target sounds) coming from the left and right by multiplying an input sound signal for each component. The target sound extraction method described as the fourth embodiment in Patent Document 1 forms a directivity having a blind spot in front of a microphone, and inputs a signal obtained as a noise component coming from the left and right. This is a technique for suppressing noise components (non-target sounds) coming from the left and right by subtracting from the sound signal.

特開２０１３−０６１４２１号公報JP 2013-061421 A

北脇信彦著、「デジタル音声・オーディオ技術（未来ねっと技術シリーズ）」、電気通信協会発行、ｐ２１８〜ｐ２４３、１９９９年Kitawaki Nobuhiko, “Digital Voice / Audio Technology (Future Netto Technology Series)”, published by Telecommunications Association, p218-p243, 1999

ところで、近年、図８に示すように、携帯端末（例えば、スマートフォンやタブレット端末）などの通信機能を有する集音機器２の両脇に、一対のスピーカ３Ｌ及び３Ｒを配置して接続し、このような構成で遠隔地と通話を行なう集音・放音装置１が利用されるようになってきている。また、同様な構成で、集音機器２内に記録された音楽ファイルやインターネット上の音楽配信サイトから取得した楽曲ファイルによる音（音楽）を、両脇のスピーカ３Ｌ及び３Ｒから放音させている状態で、利用者が、集音機器２のマイクロホン正面から発した音声によるコマンドを受ける方法も検討されている。 Incidentally, in recent years, as shown in FIG. 8, a pair of speakers 3L and 3R are arranged and connected on both sides of a sound collecting device 2 having a communication function such as a portable terminal (for example, a smartphone or a tablet terminal). The sound collecting / sound emitting device 1 for making a call with a remote place with such a configuration has come to be used. Also, with the same configuration, sound (music) from music files recorded in the sound collecting device 2 or music files acquired from music distribution sites on the Internet is emitted from the speakers 3L and 3R on both sides. In this state, a method in which the user receives a command by a voice emitted from the front of the microphone of the sound collecting device 2 is also being studied.

両脇のスピーカ３Ｌ及び３Ｒから音楽などが放音されている状態で、正面から到来する目的音を抽出し、通話相手に発話内容を伝えたり、若しくは、音声認識処理を介して音声コマンドを認識して音声コマンドに対応する処理を実行したりする場合には、スピーカ３Ｌ、３Ｒから発する音などが雑音となり、通話音質や音声認識率を大きく低下させる。 In the state where music is emitted from the speakers 3L and 3R on both sides, the target sound coming from the front is extracted and the utterance content is communicated to the other party, or the voice command is recognized through voice recognition processing. When the processing corresponding to the voice command is executed, the sound emitted from the speakers 3L and 3R becomes noise, which greatly reduces the call sound quality and the voice recognition rate.

そこで、上述した特許文献１の記載技術のような音源分離方式を適用し、両脇のスピーカ３Ｌ及び３Ｒから到来する雑音成分を抑圧し、正面からの目的音を抽出しなければならない。特許文献１に記載の音源分離方式を適用する場合には、図９に示すように、集音機器１に、２つのマイクロホン４Ｌ、４Ｒを搭載若しくは外付けすることを要する。 Therefore, it is necessary to apply a sound source separation method such as the technology described in Patent Document 1 described above, suppress noise components coming from the speakers 3L and 3R on both sides, and extract the target sound from the front. In the case of applying the sound source separation method described in Patent Document 1, it is necessary to mount or externally attach two microphones 4L and 4R to the sound collecting device 1, as shown in FIG.

しかしながら、利用者が集音・放音装置１から音楽を放音して楽しむ場合、その音量は大きく、大きな音量の音楽が雑音成分（非目的音）としてマイクロホン４Ｌ、４Ｒに捕捉されるため、音源分離方式を適用して目的音を抽出したとしても、抽出した目的音信号に雑音成分が多く残ってしまう。 However, when the user enjoys the music from the sound collecting / sound emitting device 1, the volume is large and the loud music is captured by the microphones 4 </ b> L and 4 </ b> R as noise components (non-target sounds). Even if the target sound is extracted by applying the sound source separation method, many noise components remain in the extracted target sound signal.

これを避けようとすると、利用者は、音楽の出力（放音）を停止してから、通話音声や音声コマンドなどの入力音声を発音すれば良い。しかしながら、このように出力を停止させるキー操作などを行うのであれば、音声コマンドのメリットは薄れ、キー操作などでコマンドを入力する方が簡便である。また、着信からの通話の場合、音声の出力停止操作をできないことや、出力停止操作の実行のため着信が遅れてしまうことなども生じる。 In order to avoid this, after the user stops outputting the music (sound emission), the user may pronounce the input voice such as a call voice or voice command. However, if the key operation for stopping the output is performed as described above, the merit of the voice command is reduced, and it is easier to input the command by the key operation. Further, in the case of a call from an incoming call, the voice output stop operation cannot be performed, or the incoming call is delayed due to the execution of the output stop operation.

そのため、放音音がある状況においても、意図した音源からの目的音を、良好なＳＮ比をもって抽出することができる、集音・放音装置、音源分離ユニット及び音源分離プログラムが望まれている。 Therefore, there is a demand for a sound collecting / sound emitting device, a sound source separation unit, and a sound source separation program capable of extracting a target sound from an intended sound source with a good S / N ratio even in a situation where there is a sound emission. .

第１の本発明は、少なくとも２本のマイクロホンが周囲音を捕捉する集音部と、１又は複数のスピーカから放音する放音部とを有する集音・放音装置において、（１）上記２本のマイクロホンが周囲音を捕捉した入力音信号に基づき、所定方位にある音源からの目的音を抽出する音源分離手段と、（２）上記放音部が放音する音信号が入力され、上記スピーカから放音され、上記各マイクロホンで捕捉された放音に伴う非目的音を疑似した疑似放音非目的音信号を生成し、上記各マイクロホンからの入力音信号から減算することにより、上記各マイクロホンで捕捉された放音非目的音を除去する、上記音源分離手段へ至る経路までに設けられた放音非目的音除去手段とを備え、（１）上記音源分離手段は、（１−１）上記放音非目的音除去手段から出力された放音非目的音を除去された入力音信号から、音源分離のための第１のパラメータを生成する第１の分離用パラメータ生成部と、（１−２）上記放音非目的音除去手段内で生成された疑似放音非目的音信号に基づいて、音源分離のための第２のパラメータを生成する第２の分離用パラメータ生成部と、（１−３）上記第１のパラメータを上記第２のパラメータを利用して修正して音源分離に利用する最終的なパラメータを得るパラメータ修正部と、（１−４）修正されたパラメータを適用して音源分離を行う音源分離部とを有し、（３）放音非目的音を上記放音非目的音除去手段で除去すると共に、その他の非目的音を上記音源分離手段で除去して目的音を抽出することを特徴とする。 According to a first aspect of the present invention, there is provided a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sounds and a sound emission unit that emits sound from one or more speakers. Sound source separation means for extracting a target sound from a sound source in a predetermined direction based on an input sound signal in which two microphones capture ambient sound; and (2) a sound signal emitted by the sound emitting unit is input; By generating a pseudo-sounding non-target sound signal that simulates a non-target sound that is emitted from the speaker and captured by each microphone, and subtracting it from the input sound signal from each microphone, The sound source non-target sound removing means provided up to the path to the sound source separation means for removing the sound non-target sound captured by each microphone is provided. (1) The sound source separation means is (1- 1) The above non-target sound removal hand A first separation parameter generating unit that generates a first parameter for sound source separation from the input sound signal from which the sound non-target sound output from is removed, and (1-2) the sound non-purpose sound output A second separation parameter generation unit that generates a second parameter for sound source separation based on the pseudo sound emission non-target sound signal generated in the sound removal means; (1-3) the first A parameter correcting unit that corrects the parameter using the second parameter to obtain a final parameter used for sound source separation; and (1-4) a sound source separating unit that performs sound source separation using the corrected parameter. And (3) removing the non-target sound by the sound non-target sound removing means and extracting the target sound by removing the other non-purpose sound by the sound source separating means. To do.

第２の本発明は、少なくとも２本のマイクロホンが周囲音を捕捉する集音部と、１又は複数のスピーカから放音する放音部とを有する集音・放音装置に適用される音源分離ユニットであって、（１）上記２本のマイクロホンが周囲音を捕捉した入力音信号に基づき、所定方位にある音源からの目的音を抽出する音源分離手段と、（２）上記放音部が放音する音信号が入力され、上記スピーカから放音され、上記各マイクロホンで捕捉された放音に伴う非目的音を疑似した疑似放音非目的音信号を生成し、上記各マイクロホンからの入力音信号から減算することにより、上記各マイクロホンで捕捉された放音非目的音を除去する、上記音源分離手段へ至る経路までに設けられた放音非目的音除去手段とを備え、（１）上記音源分離手段は、（１−１）上記放音非目的音除去手段から出力された放音非目的音を除去された入力音信号から、音源分離のための第１のパラメータを生成する第１の分離用パラメータ生成部と、（１−２）上記放音非目的音除去手段内で生成された疑似放音非目的音信号に基づいて、音源分離のための第２のパラメータを生成する第２の分離用パラメータ生成部と、（１−３）上記第１のパラメータを上記第２のパラメータを利用して修正して音源分離に利用する最終的なパラメータを得るパラメータ修正部と、（１−４）修正されたパラメータを適用して音源分離を行う音源分離部とを有し、（３）放音非目的音を上記放音非目的音除去手段で除去すると共に、その他の非目的音を上記音源分離手段で除去して目的音を抽出することを特徴とする。 The second aspect of the present invention is a sound source separation applied to a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sounds and a sound emission unit that emits sound from one or more speakers. (1) sound source separation means for extracting a target sound from a sound source in a predetermined direction based on an input sound signal obtained by capturing the ambient sound by the two microphones; and (2) the sound emitting unit. A sound signal to be emitted is input, emitted from the speaker, and generates a pseudo-non-target sound signal that simulates a non-target sound accompanying the sound output captured by each microphone, and is input from each microphone. A sound non-target sound removing means provided up to a path to the sound source separating means for removing the sound non-purpose sound captured by each of the microphones by subtracting from the sound signal; (1) The sound source separation means is (1 1) a first separation parameter generating unit that generates a first parameter for sound source separation from an input sound signal from which the sound non-target sound output from the sound non-target sound removing means is removed; (1-2) a second separation parameter generating unit that generates a second parameter for sound source separation based on the pseudo-non-target sound signal generated in the sound non-target sound removing means; (1-3) a parameter correction unit that corrects the first parameter using the second parameter to obtain a final parameter to be used for sound source separation; and (1-4) the corrected parameter. And (3) removing the non-target sound by the sound non-target sound removing means and removing the other non-target sound by the sound source separating means. And extracting a target sound.

第３の本発明は、少なくとも２本のマイクロホンが周囲音を捕捉する集音部と、１又は複数のスピーカから放音する放音部とを有する集音・放音装置に搭載されるコンピュータが実行する音源分離プログラムであって、上記コンピュータを、（１）上記２本のマイクロホンが周囲音を捕捉した入力音信号に基づき、所定方位にある音源からの目的音を抽出する音源分離手段と、（２）上記放音部が放音する音信号が入力され、上記スピーカから放音され、上記各マイクロホンで捕捉された放音に伴う非目的音を疑似した疑似放音非目的音信号を生成し、上記各マイクロホンからの入力音信号から減算することにより、上記各マイクロホンで捕捉された放音非目的音を除去する、上記音源分離手段へ至る経路までに設けられた放音非目的音除去手段として機能させ、（１）機能させられる上記音源分離手段は、（１−１）上記放音非目的音除去手段から出力された放音非目的音を除去された入力音信号から、音源分離のための第１のパラメータを生成する第１の分離用パラメータ生成部と、（１−２）上記放音非目的音除去手段内で生成された疑似放音非目的音信号に基づいて、音源分離のための第２のパラメータを生成する第２の分離用パラメータ生成部と、（１−３）上記第１のパラメータを上記第２のパラメータを利用して修正して音源分離に利用する最終的なパラメータを得るパラメータ修正部と、（１−４）修正されたパラメータを適用して音源分離を行う音源分離部とを有し、（３）放音非目的音を上記放音非目的音除去手段で除去すると共に、その他の非目的音を上記音源分離手段で除去して目的音を抽出することを特徴とする。 According to a third aspect of the present invention, there is provided a computer mounted on a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sounds and a sound emission unit that emits sound from one or more speakers. A sound source separation program to be executed, the computer comprising: (1) sound source separation means for extracting a target sound from a sound source in a predetermined direction based on an input sound signal obtained by capturing the ambient sound by the two microphones; (2) A sound signal emitted by the sound emitting unit is input, sound is emitted from the speaker, and a pseudo sound emitting non-purpose sound signal simulating a non-purpose sound accompanying sound emission captured by each microphone is generated. Then, by subtracting from the input sound signal from each of the microphones, the emitted non-target sound captured by each of the microphones is removed, and the emitted non-target sound removal provided up to the path to the sound source separation means (1) The sound source separation means to be functioned is (1-1) sound source separation from the input sound signal from which the sound non-target sound output from the sound non-purpose sound removal means is removed. A first separation parameter generation unit for generating a first parameter for the sound source, and (1-2) a sound source based on the pseudo sound emission non-purpose sound signal generated in the sound emission non-purpose sound removal means A second parameter generation unit for separation that generates a second parameter for separation; and (1-3) a final that is used for sound source separation after the first parameter is modified using the second parameter. And (1-4) a sound source separation unit that performs sound source separation by applying the modified parameters, and (3) the non-target sound is a non-target sound. Remove with the removal means and other non-target sounds Removing a source separating means and extracting a target sound.

本発明によれば、放音音がある状況においても、放音音の音種によらず、意図した音源からの目的音を良好なＳＮ比をもって抽出できる集音・放音装置、音源分離ユニット及び音源分離プログラムを実現できる。 According to the present invention, a sound collection / sound emission device and a sound source separation unit that can extract a target sound from an intended sound source with a good SN ratio regardless of the sound type of the sound emission even in a situation where the sound emission is present And a sound source separation program.

第１の実施形態の集音・放音装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound collection and sound emission apparatus of 1st Embodiment. 第１の実施形態の集音・放音装置における音源分離処理部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the sound source separation process part in the sound collection / sound emission apparatus of 1st Embodiment. 第１の実施形態の集音・放音装置における音源分離処理部内の抑圧係数算出部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the suppression coefficient calculation part in the sound source separation process part in the sound collection / sound emitting apparatus of 1st Embodiment. 第２の実施形態の集音・放音装置における音源分離処理部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the sound source separation process part in the sound collection / sound emission apparatus of 2nd Embodiment. 第２の実施形態の集音・放音装置における音源分離処理部内の放音非目的音種判定部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the sound emission non-purpose sound type determination part in the sound source separation process part in the sound collection and sound emission apparatus of 2nd Embodiment. 第２の実施形態に関連し、音源データの種類とコヒーレンスの挙動との関係を示す説明図である。It is explanatory drawing which shows the relationship between the kind of sound source data, and the behavior of coherence in connection with 2nd Embodiment. 各実施形態の放音非目的音キャンセラ処理部として適用可能な他の詳細構成例を示すブロック図である。It is a block diagram which shows the other detailed structural example applicable as a sound emission non-target sound canceller process part of each embodiment. 従来の集音・放音装置におけるスピーカの接続の様子を示す説明図である。It is explanatory drawing which shows the mode of the connection of the speaker in the conventional sound collection and sound emission apparatus. 従来の集音・放音装置に音源分離方式を適用する場合におけるマイクロホンの搭載の様子を示す説明図である。It is explanatory drawing which shows the mode of mounting of the microphone in the case of applying a sound source separation system to the conventional sound collecting / sound emitting device.

（Ａ）第１の実施形態
以下、本発明による集音・放音装置、音源分離ユニット及び音源分離プログラムの第１の実施形態を、図面を参照しながら説明する。 (A) First Embodiment Hereinafter, a first embodiment of a sound collecting / sound emitting device, a sound source separation unit, and a sound source separation program according to the present invention will be described with reference to the drawings.

（Ａ−１）第１の実施形態の構成
第１の実施形態の集音・放音装置は、一対のマイクロホンが搭載され、若しくは、外付けされており、かつ、一対のスピーカが搭載され、若しくは、外付けされているものである。例えば、スマートフォンやタブレット端末などの集音機器を利用している集音・放音装置であれば、一対のマイクロホンが搭載され、一対のスピーカが外付けされて構成される。また例えば、スピーカ一体型のオーディオ機器が該当する集音・放音装置であれば、一対のマイクロホンも一対のスピーカも搭載されて構成される。以上のように、一対のマイクロホン及び一対のスピーカの接続形態は多様であるが、いずれの接続形態を適用したものであっても良い。 (A-1) Configuration of the First Embodiment The sound collection / sound emission device of the first embodiment is equipped with a pair of microphones or externally attached, and a pair of speakers. Or it is an external one. For example, in the case of a sound collecting / sound emitting device using a sound collecting device such as a smartphone or a tablet terminal, a pair of microphones is mounted and a pair of speakers are externally configured. Further, for example, if a speaker integrated audio device is a corresponding sound collecting / sound emitting device, a pair of microphones and a pair of speakers are mounted. As described above, the connection forms of the pair of microphones and the pair of speakers are various, but any connection form may be applied.

以下では、第１の実施形態の集音・放音装置は、上述した図９に示すように、一対のマイクロホンが搭載され、一対のスピーカが外付けされて構成されているとして説明を行う。また、第１の実施形態の集音・放音装置における各構成要素の符号も、図９に記述されている構成要素に関しては、図９で用いている符号をそのまま用いる。 In the following, the sound collection / sound emission device of the first embodiment will be described on the assumption that a pair of microphones are mounted and a pair of speakers are externally attached as shown in FIG. 9 described above. In addition, for the components described in FIG. 9, the symbols used in FIG. 9 are used as they are for the components in the sound collection / sound emission device of the first embodiment.

図１は、第１の実施形態の集音・放音装置１０の構成を示すブロック図である。 FIG. 1 is a block diagram illustrating a configuration of a sound collection / sound emission device 10 according to the first embodiment.

第１の実施形態の集音・放音装置１０は、ハードウェア的な各種構成要素を接続して構築されたものであっても良く、また、一部の構成要素（例えば、スピーカ、マイクロホン、アナログ／デジタル変換部（Ａ／Ｄ変換部）、デジタル／アナログ変換部（Ｄ／Ａ変換部）を除く部分）を、ＣＰＵ、ＲＯＭ、ＲＡＭなどのプログラムの実行構成を適用してその機能を実現するように構築されたものであっても良い。いずれの構築方法を適用した場合であっても、集音・放音装置１０の機能的な詳細構成は、図１で表す構成となっている。なお、プログラムを適用する場合において、プログラムは、集音・放音装置１０が有するメモリに装置出荷時から書き込まれているものであっても良く、また、ダウンロードによりインストールされるものであっても良い。例えば、後者の場合としては、スマートフォン用のアプリケーションとしてプログラムを用意しておき、必要とする利用者が、インターネットを介してダウンロードしてインストールする場合を挙げることができる。 The sound collection / sound emission device 10 of the first embodiment may be constructed by connecting various hardware components, and some components (for example, a speaker, a microphone, The functions of the analog / digital conversion unit (A / D conversion unit) and digital / analog conversion unit (except for the D / A conversion unit) are realized by applying program execution configurations such as CPU, ROM, and RAM. It may be constructed to do so. Regardless of which construction method is applied, the functional detailed configuration of the sound collection / sound emission device 10 is the configuration shown in FIG. When applying the program, the program may be written in the memory of the sound collecting / sound emitting device 10 from the time of shipment of the device, or may be installed by downloading. good. For example, in the latter case, a program is prepared as an application for a smartphone, and a user who needs it can download and install it via the Internet.

図１において、第１の実施形態の集音・放音装置１０は、放音部２０及び集音部３０を有する。 In FIG. 1, the sound collection / sound emission device 10 of the first embodiment includes a sound emission unit 20 and a sound collection unit 30.

放音部２０は、既存の放音部と同様な構成を有する。放音部２０は、Ｌチャンネル及びＲチャンネルの音源データ記憶部２１Ｌ及び２１Ｒ、Ｄ／Ａ変換部２２Ｌ及び２２Ｒ、並びに、スピーカ３Ｌ及び３Ｒを有する。 The sound emitting unit 20 has the same configuration as the existing sound emitting unit. The sound emitting unit 20 includes sound source data storage units 21L and 21R for L channel and R channel, D / A conversion units 22L and 22R, and speakers 3L and 3R.

一方、集音部３０は、Ｌチャンネル及びＲチャンネルのマイクロホン４Ｌ及び４Ｒ、並びに、Ａ／Ｄ変換部３１Ｌ及び３１Ｒと、放音非目的音キャンセラ処理部３２と、図２に詳細構成を示す音源分離処理部３３とを有する。ここで、後述する音源データの入力端子を有する集音部３０の全体が音源分離ユニットとして構築されて、市販に供するものであっても良い。また、Ａ／Ｄ変換部３１Ｌ、３１Ｒ、放音非目的音キャンセラ処理部３２及び音源分離処理部３３でなる部分が、後述する音源データの入力端子を有して、音源分離ユニットとして構築され、市販に供するものであっても良い。すなわち、集音・放音装置１０は、特に、集音部３０は、音源分離ユニットを用いて構築されたものであっても良い。 On the other hand, the sound collection unit 30 includes L-channel and R-channel microphones 4L and 4R, A / D conversion units 31L and 31R, a sound emission non-target sound canceller processing unit 32, and a sound source whose detailed configuration is shown in FIG. And a separation processing unit 33. Here, the entire sound collection unit 30 having an input terminal for sound source data, which will be described later, may be constructed as a sound source separation unit and provided on the market. Further, the part composed of the A / D conversion units 31L and 31R, the sound emission non-target sound canceller processing unit 32, and the sound source separation processing unit 33 has a sound source data input terminal, which will be described later, and is constructed as a sound source separation unit. You may use for a commercially available thing. That is, in the sound collection / sound emission device 10, in particular, the sound collection unit 30 may be constructed using a sound source separation unit.

音源データ記憶部２１Ｌ及び２１Ｒはそれぞれ、Ｌチャンネル、Ｒチャンネル用の音源データ（デジタル信号）ｓｉｇＬ、ｓｉｇＲを記憶し、図示しない放音制御部の制御下で音源データｓｉｇＬ、ｓｉｇＲを読み出して出力するものである。音源データｓｉｇＬ、ｓｉｇＲは、例えば、楽曲データであっても良く、電子書籍その他の読み上げ用などの音声データであっても良い。各音源データ記憶部２１Ｌ、２１Ｒは、ＣＤ−ＲＯＭなどの記録媒体が装填された記録媒体アクセス装置であっても良く、インターネット上のサイトなどの外部装置から通信によって取得した音源データを記憶する当該装置の記憶部によって構成されたものであっても良い。また、各音源データ記憶部２１Ｌ、２１Ｒは、例えば、ＵＳＢコネクタ接続で接続される外付けの装置が該当するものであっても良い。さらに、各音源データ記憶部２１Ｌ、２１Ｒは「記憶部」とネーミングしているが、各音源データ記憶部２１Ｌ、２１Ｒの概念には、デジタル音声放送の受信機のような、受信した音源データをリアルタイムに出力する構成をも含むものとする。 The sound source data storage units 21L and 21R store the sound source data (digital signals) sigL and sigR for the L channel and the R channel, respectively, and read and output the sound source data sigL and sigR under the control of a sound emission control unit (not shown). Is. The sound source data sigL and sigR may be, for example, music data or electronic data such as an electronic book for reading out. Each of the sound source data storage units 21L and 21R may be a recording medium access device loaded with a recording medium such as a CD-ROM, and stores sound source data acquired by communication from an external device such as a site on the Internet. It may be configured by a storage unit of the apparatus. The sound source data storage units 21L and 21R may correspond to, for example, external devices connected by USB connector connection. Furthermore, each sound source data storage unit 21L, 21R is named “storage unit”, but the concept of each sound source data storage unit 21L, 21R includes received sound source data such as a digital audio broadcast receiver. A configuration for outputting in real time is also included.

Ｄ／Ａ変換部２２Ｌ及び２２Ｒはそれぞれ、対応する音源データ記憶部２１Ｌ、２１Ｒから出力された音源データｓｉｇＬ、ｓｉｇＲをアナログ信号に変換して対応するスピーカ３Ｌ、３Ｒに与えるものである。 The D / A converters 22L and 22R convert the sound source data sigL and sigR output from the corresponding sound source data storage units 21L and 21R into analog signals and give them to the corresponding speakers 3L and 3R.

スピーカ３Ｌ及び３Ｒはそれぞれ、対応するＤ／Ａ変換部２２Ｌ、２２Ｒから与えられた音源信号を放音出力（発音出力）するものである。ここで、スピーカ３Ｌ及び３Ｒから放音出力された音響若しくは音声は、マイクロホン４Ｒ、４Ｌに捕捉されることを意図したものではなく、マイクロホン４Ｒ、４Ｌの捕捉機能から見たとき、非目的音になっている。 The speakers 3L and 3R output sound sources (sound generation output) from the sound source signals supplied from the corresponding D / A converters 22L and 22R, respectively. Here, the sound or sound output from the speakers 3L and 3R is not intended to be captured by the microphones 4R and 4L, but becomes non-target sound when viewed from the capturing function of the microphones 4R and 4L. It has become.

以上では、スピーカ３Ｌ、３Ｒから放音される音楽、音声の当初の信号形式がデジタル信号（音源データ）であるものを示したが、音源データ記憶部２１Ｌ、２１Ｒに相当する構成が、レコードプレイヤ、オーディオカセットテープレコーダ、ＡＭやＦＭのラジオ受信機などであって、アナログ信号でなる音響信号や音声信号を出力するものであっても良い。この場合には、Ｄ／Ａ変換部２２Ｌ及び２２Ｒは省略され、別途、Ｌチャンネル、Ｒチャンネル用のＡ／Ｄ変換部を設けて、アナログ信号の音響信号や音声信号をデジタル信号に変換して放音非目的音キャンセラ処理部３２に与えることになる。 In the above, the music and sound emitted from the speakers 3L and 3R are shown as digital signals (sound source data). However, the configuration corresponding to the sound source data storage units 21L and 21R is a record player. An audio cassette tape recorder, an AM or FM radio receiver, and the like may output an audio signal or an audio signal that is an analog signal. In this case, the D / A converters 22L and 22R are omitted, and an A / D converter for the L channel and the R channel is provided separately to convert an analog acoustic signal or audio signal into a digital signal. The sound is output to the non-target sound canceller processing unit 32.

マイクロホン４Ｒ及び４Ｌはそれぞれ、周囲音を捕捉して電気信号（アナログ信号）に変換するものである。一対のマイクロホン４Ｒ及び４Ｌにより、ステレオ信号が得られる。各マイクロホン４Ｒ、４Ｌは、当該集音・放音装置１０の正面から到来する音を主として捕捉するような指向性を有するものであるが、両脇に配置されているスピーカ３Ｌ、３Ｒから放音された音をも捕捉するものである。なお、スピーカ３Ｌ、３Ｒは、一対のマイクロホン４Ｒ及び４Ｌの両脇に配置されることが好ましいが、この配置に限定されるものではない。 Each of the microphones 4R and 4L captures ambient sound and converts it into an electrical signal (analog signal). A stereo signal is obtained by the pair of microphones 4R and 4L. Each of the microphones 4R and 4L has directivity that mainly captures sound coming from the front of the sound collection / sound emission device 10, but emits sound from the speakers 3L and 3R arranged on both sides. It also captures the generated sound. The speakers 3L and 3R are preferably arranged on both sides of the pair of microphones 4R and 4L, but are not limited to this arrangement.

各マイクロホン４Ｒ、４Ｌは、例えば、当該集音・放音装置１０の筐体に設けられた筒体内に取り付けられる。ここで、筒体の内面には合成樹脂でなる遮音部材が設けられ、マイクロホン４Ｒ、４Ｌが取り付けられたときに、筐体の内外を音が通過する経路ができないようになされている。これにより、筐体内部で発生した雑音や、外部から筐体内部に入り込んで反射により筐体外部に出ていこうとする雑音などを、マイクロホン４Ｒ、４Ｌが捕捉するようなことを極力防止することができる。 The microphones 4R and 4L are attached to, for example, a cylinder provided in the housing of the sound collecting / sound emitting device 10. Here, a sound insulating member made of a synthetic resin is provided on the inner surface of the cylindrical body so that when the microphones 4R and 4L are attached, there is no path through which the sound passes inside and outside the housing. This prevents as much as possible the microphones 4R and 4L from capturing the noise generated inside the housing or the noise entering the housing from the outside and going out of the housing by reflection. Can do.

Ａ／Ｄ変換部３１Ｌ及び３１Ｒはそれぞれ、対応するマイクロホン４Ｒ、４Ｌが捕捉した入力音信号をデジタル信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲに変換して放音非目的音キャンセラ処理部３２に与えるものである。各Ａ／Ｄ変換部３１Ｌ、３１Ｒは、例えば、音源データｓｉｇＬ、ｓｉｇＲのサンプリングレートと同じサンプリングレートのデジタル信号に変換する。 The A / D conversion units 31L and 31R convert the input sound signals captured by the corresponding microphones 4R and 4L into digital signals inputL and inputR, respectively, and give them to the sound emission non-target sound canceller processing unit 32. Each A / D conversion unit 31L, 31R converts, for example, a digital signal having the same sampling rate as the sampling rate of the sound source data sigL, sigR.

放音非目的音キャンセラ処理部３２には、音源データ記憶部２１Ｌ及び２１Ｒから出力された音源データｓｉｇＬ及びｓｉｇＲも与えられる。ここで、放音非目的音キャンセラ処理部３２に入力される４つのデジタル信号のサンプリングレートが揃っていることを要する。例えば、インターネットのサイトからダウンロードし、音源データ記憶部２１Ｌ及び２１Ｒに記憶された音源データｓｉｇＬ、ｓｉｇＲのサンプリングレートが、Ａ／Ｄ変換部３１Ｌ、３１Ｒからのデジタル信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲのサンプリングレートと異なる場合には、Ｄ／Ａ変換部２２Ｌ、２２Ｒへはダウンロードした音源データｓｉｇＬ、ｓｉｇＲをそのまま与え、放音非目的音キャンセラ処理部３２へは音源データｓｉｇＬ、ｓｉｇＲのサンプリングレートを変換した音源データを与えるようにすれば良い。 The sound emission non-target sound canceller processing unit 32 is also supplied with sound source data sigL and sigR output from the sound source data storage units 21L and 21R. Here, it is necessary that the sampling rates of the four digital signals input to the sound emission non-target sound canceller processing unit 32 are the same. For example, the sampling rates of the sound source data sigL and sigR downloaded from the Internet site and stored in the sound source data storage units 21L and 21R are different from the sampling rates of the digital signals inputL and inputR from the A / D conversion units 31L and 31R. In this case, the downloaded sound source data sigL and sigR are directly supplied to the D / A conversion units 22L and 22R, and the sound source data obtained by converting the sampling rate of the sound source data sigL and sigR is supplied to the sound emission non-target sound canceller processing unit 32. You should give it.

放音非目的音キャンセラ処理部３２は、音源データ記憶部２１Ｌ及び２１Ｒから出力された音源データｓｉｇＬ及びｓｉｇＲに基づき、入力音信号（デジタル信号）ｉｎｐｕｔＬ、ｉｎｐｕｔＲに含まれている、スピーカ３Ｌ、３Ｒから放音されることによる非目的音成分（以下、適宜、放音非目的音と呼ぶ）を除去（若しくは軽減）し、除去処理後の入力音信号ＥＣｏｕｔＬ、ＥＣｏｕｔＲを音源分離処理部３３に与えるものである。 The sound emission non-target sound canceller processing unit 32 is based on the sound source data sigL and sigR output from the sound source data storage units 21L and 21R, and includes the speakers 3L and 3R included in the input sound signals (digital signals) inputL and inputR. Removes (or reduces) the non-target sound component (hereinafter referred to as the sound non-target sound as appropriate) that is emitted from the sound, and provides the sound source separation processing unit 33 with the input sound signals ECoutL and ECoutR after the removal processing. Is.

ここで、スピーカ３Ｌ、３Ｒから放音され、マイクロホン４Ｒ、４Ｌによって捕捉される、目的音から見て不要な音（放音非目的音）は、電話通信において問題となっている音響エコーと同様にみなすことができる。そこで、第１の実施形態においては、放音非目的音キャンセラ処理部３２を、音響エコーキャンセラの技術を流用して構成している。例えば、非特許文献１には「ステレオエコーキャンセラ」が記載されている。第１の実施形態では、放音非目的音キャンセラ処理部３２として、非特許文献１の図３．７１若しくは図３．７５の記載のものを適用しているとする。 Here, the sound that is emitted from the speakers 3L and 3R and captured by the microphones 4R and 4L and is unnecessary from the target sound (non-target sound) is the same as the acoustic echo that is a problem in telephone communication. Can be considered. Therefore, in the first embodiment, the sound emission non-target sound canceller processing unit 32 is configured using the acoustic echo canceller technique. For example, Non-Patent Document 1 describes “stereo echo canceller”. In the first embodiment, it is assumed that the sound output non-target sound canceller processing unit 32 described in FIG. 3.71 or FIG. 3.75 of Non-Patent Document 1 is applied.

ステレオエコーキャンセラ構成の放音非目的音キャンセラ処理部３２では、入力音信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲから、放音目的音を除去するために、内部で、疑似的な放音目的音信号（以下、疑似放音目的音信号と呼ぶ）ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲが生成されており、この第１の実施形態の場合、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲも音源分離処理部３３に与えられる。 In the sound emission non-target sound canceller processing unit 32 having a stereo echo canceller configuration, in order to remove the sound emission target sound from the input sound signals inputL and inputR, a pseudo sound emission target sound signal (hereinafter referred to as “pseudo sound emission target sound signal”) is internally generated. PSechoL and PSechoR (referred to as sound target sound signals) are generated. In the case of the first embodiment, the pseudo sounding target sound signals PSechoL and PSechoR are also supplied to the sound source separation processing unit 33.

音源分離処理部３３は、図２に示す詳細構成を有し、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ、ＥＣｏｕｔＲと疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲとに基づき、所定方位（例えば、正面）にある音源からの目的音だけを抽出するものである。音源分離処理部３３が適用している音源分離方法は、音源の方向によって特性が変化するコヒーレンス係数を適用したコヒーレンスフィルタ法である。 The sound source separation processing unit 33 has the detailed configuration shown in FIG. 2, and based on the input sound signals ECoutL and ECoutR from which the sound emission non-target sound has been removed and the pseudo sound emission target sound signals PSechoL and PSechoR (for example, Only the target sound from the sound source located in the front) is extracted. The sound source separation method applied by the sound source separation processing unit 33 is a coherence filter method to which a coherence coefficient whose characteristics change depending on the direction of the sound source is applied.

図２において、音源分離処理部３３は、ＦＦＴ（高速フーリエ変換）部４１、第１のコヒーレンス係数計算部４２、第２のコヒーレンス係数計算部４３、抑圧係数算出部４４、抑圧係数乗算部４５及びＩＦＦＴ（逆高速フーリエ変換）部４６を有する。 2, the sound source separation processing unit 33 includes an FFT (Fast Fourier Transform) unit 41, a first coherence coefficient calculation unit 42, a second coherence coefficient calculation unit 43, a suppression coefficient calculation unit 44, a suppression coefficient multiplication unit 45, and An IFFT (Inverse Fast Fourier Transform) unit 46 is included.

ＦＦＴ部４１は、時間領域の信号である、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）と、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）とをそれぞれ、周波数領域の信号ＸＬ（ｆ，Ｋ）、ＸＲ（ｆ，Ｋ）、ＹＬ（ｆ，Ｋ）、ＹＲ（ｆ，Ｋ）に変換するものである。なお、上記での「ｎ」は時刻を表すパラメータであり、「ｆ」は周波数を表すパラメータであり、「Ｋ」は変換に供する所定の入力サンプル数の塊を規定するフレームの順番を表すパラメータである。 The FFT unit 41 is an input sound signal ECoutL (n), ECoutR (n) from which a sound non-target sound, which is a signal in the time domain, is removed, and a pseudo sound output target sound signal PSechoL (n), PSechoR (n). Are converted into frequency domain signals XL (f, K), XR (f, K), YL (f, K), and YR (f, K), respectively. In the above, “n” is a parameter representing time, “f” is a parameter representing frequency, and “K” is a parameter representing the order of frames defining a block of a predetermined number of input samples to be subjected to conversion. It is.

第１のコヒーレンス係数計算部４２は、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）から得られた周波数領域信号ＸＬ（ｆ，Ｋ）及びＸＲ（ｆ，Ｋ）に基づいて、コヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）を計算するものである。 The first coherence coefficient calculation unit 42 uses the frequency domain signals XL (f, K) and XR (f, K) obtained from the input sound signals ECoutL (n) and ECoutR (n) from which the emitted non-target sound has been removed. ) To calculate the coherence coefficient Xcoef (f, K).

第２のコヒーレンス係数計算部４３は、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）から得られた周波数領域信号ＹＬ（ｆ，Ｋ）及びＹＲ（ｆ，Ｋ）に基づいてコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）を計算するものである。 The second coherence coefficient calculator 43 calculates the coherence coefficient based on the frequency domain signals YL (f, K) and YR (f, K) obtained from the pseudo sound emission target sound signals PSechoL (n) and PSechoR (n). Ycoef (f, K) is calculated.

コヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）、Ｙｃｏｅｆ（ｆ，Ｋ）の計算式として、特許文献１に記載のものを適用できる（特許文献１の（１）式、（２）式、（４）式参照）。 As the calculation formulas of the coherence coefficients Xcoef (f, K) and Ycoef (f, K), those described in Patent Document 1 can be applied (see Expressions (1), (2), and (4) of Patent Document 1). ).

抑圧係数算出部４４は、２つのコヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）及びＹｃｏｅｆ（ｆ，Ｋ）から、非目的音を抑圧する抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を算出して抑圧係数乗算部４５に与えるものである。抑圧係数算出部４４は、図３に示すように、第１及び第２のコヒーレンス係数計算部４２及び４３からのコヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）及びＹｃｏｅｆ（ｆ，Ｋ）を受信する係数受信部５１と、（１）式に従って抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を演算する抑圧係数演算部５２と、得られた抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を抑圧係数乗算部４５に与える抑圧係数送信部５３とでなる。 The suppression coefficient calculation unit 44 calculates a suppression coefficient NRcoef (f, K) for suppressing the non-target sound from the two coherence coefficients Xcoef (f, K) and Ycoef (f, K), and supplies the suppression coefficient multiplication unit 45 with the suppression coefficient NRcoef (f, K). To give. As shown in FIG. 3, the suppression coefficient calculation unit 44 is a coefficient reception unit that receives the coherence coefficients Xcoef (f, K) and Ycoef (f, K) from the first and second coherence coefficient calculation units 42 and 43. 51, a suppression coefficient calculation unit 52 that calculates the suppression coefficient NRcoef (f, K) according to the equation (1), and a suppression coefficient transmission unit 53 that supplies the obtained suppression coefficient NRcoef (f, K) to the suppression coefficient multiplication unit 45 And become.

ＮＲｃｏｅｆ（ｆ，Ｋ）
＝Ｘｃｏｅｆ（ｆ，Ｋ）−α×Ｙｃｏｅｆ（ｆ，Ｋ）
但し、αは０．０＜α≦１．０の範囲の値 …（１）
抑圧係数乗算部４５は、放音非目的音が除去された入力音信号から得られた一方の周波数領域信号ＸＬ（ｆ，Ｋ）に対し、（２）式に示すように、抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を乗算して非目的音が除去された周波数領域信号（言い換えると、目的音の周波数領域信号）Ｚ（ｆ，Ｋ）を得るものである。 NRcoef (f, K)
= Xcoef (f, K)-[alpha] * Ycoef (f, K)
However, α is a value in the range of 0.0 <α ≦ 1.0 (1)
The suppression coefficient multiplying unit 45 applies the suppression coefficient NRcoef () to the one frequency domain signal XL (f, K) obtained from the input sound signal from which the emitted non-target sound has been removed, as shown in Equation (2). The frequency domain signal (in other words, the frequency domain signal of the target sound) Z (f, K) from which the non-target sound is removed by multiplying by f, K) is obtained.

Ｚ（ｆ，Ｋ）＝ＸＬ（ｆ，Ｋ）×ＮＲｃｏｅｆ（ｆ、Ｋ） …（２）
ＩＦＦＴ部４６は、周波数領域信号である非目的音抑圧信号Ｚ（ｆ、Ｋ）を時間領域信号ｚ（ｎ）に変換するものである。後段回路が、周波数領域信号Ｚ（ｆ、Ｋ）をそのまま処理できる構成であれば、ＩＦＦＴ部４６は省略することができる。 Z (f, K) = XL (f, K) × NRcoef (f, K) (2)
The IFFT unit 46 converts the non-target sound suppression signal Z (f, K), which is a frequency domain signal, into a time domain signal z (n). If the latter circuit is configured to process the frequency domain signal Z (f, K) as it is, the IFFT unit 46 can be omitted.

放音非目的音キャンセラ処理部３２も、音源分離処理部３３と同様に、非目的音の除去機能を有するものである。音源分離処理部３３に加えて、放音非目的音キャンセラ処理部３２を設けるようにしたのは、以下の理由による。すなわち、非目的音を一括して捉えるのではなく、放音非目的音及び背景非目的音を区別し、それぞれに適した除去方法を考慮し、放音非目的音を放音非目的音キャンセラ処理部３２で除去し、背景非目的音を音源分離処理部３３で除去することとした。すなわち、音源分離処理部３３の前処理部として放音非目的音キャンセラ処理部３２を設け、音源分離処理部３３が不得手なＬチャンネルとＲチャンネルの相関が強い非目的音成分を放音非目的音キャンセラ処理部３２で予め抑圧しておくことにより、音源分離処理部３３の機能を十分に発揮させると同時に、放音非目的音キャンセラ処理部３２で抑圧しきれなかった非目的音成分を音源分離処理部３３で抑圧し、音源分離処理部３３を単体で適用するよりもはるかに高性能な非目的音の抑圧性能を得るようにしている。 Similarly to the sound source separation processing unit 33, the sound emission non-target sound canceller processing unit 32 also has a non-target sound removal function. The reason why the sound non-target sound canceller processing unit 32 is provided in addition to the sound source separation processing unit 33 is as follows. That is, rather than capturing non-target sounds at once, the non-target sound and the background non-target sound are distinguished, and a removal method suitable for each is considered, and the non-target sound is output as a non-target sound canceller. The processing unit 32 removes the background non-target sound and the sound source separation processing unit 33 removes it. That is, a sound emission non-target sound canceller processing unit 32 is provided as a pre-processing unit of the sound source separation processing unit 33, and a non-target sound component having a strong correlation between the L channel and the R channel, which is not good at the sound source separation processing unit 33, is not emitted. By suppressing in advance by the target sound canceller processing unit 32, the function of the sound source separation processing unit 33 is fully exhibited, and at the same time, the non-target sound component that could not be suppressed by the emitted non-target sound canceller processing unit 32 is obtained. Suppression is performed by the sound source separation processing unit 33, and a much higher performance of suppressing non-target sound is obtained than when the sound source separation processing unit 33 is applied alone.

音源分離処理部３３の音源分離方法としてコヒーレンスフィルタ法を単に適用する場合であれば、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）から非目的音の抑圧に用いる抑圧係数を得るようにすれば良い。この第１の実施形態において、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）だけでなく、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）をも適用して、非目的音の抑圧に用いる抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を得ている。このようにしたのは、以下の理由による。 If the coherence filter method is simply applied as the sound source separation method of the sound source separation processing unit 33, the non-target sound is suppressed from the input sound signals ECoutL (n) and ECoutR (n) from which the emitted non-target sound is removed. What is necessary is just to obtain the suppression coefficient to be used. In the first embodiment, not only the input sound signals ECoutL (n) and ECoutR (n) from which the sound non-target sound is removed but also the pseudo sound output target sound signals PSechoL (n) and PSechoR (n). As a result, a suppression coefficient NRcoef (f, K) used for suppressing non-target sounds is obtained. The reason for this is as follows.

スピーカ３Ｌ、３Ｒから放音される放音音が、例えば、楽曲であって、打楽器の音のような突発的に全周波数に成分を有する衝撃音（例えば、ロックにおけるドラムの音）が含まれる場合、放音非目的音キャンセラ処理部３２（の適応フィルタ）における追従が間に合わず、放音非目的音を十分に抑圧できない。また、衝撃音は、全周波数に成分を有するため、その到来方位が正面ではなくても、左右のスピーカ３Ｌ、３Ｒから放音された音同士が強い相関を有し、恰も正面から到来するような特性を有する。そのため、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）だけから非目的音の抑圧に用いる抑圧係数を得た場合には、放音非目的音が衝撃音のときに、放音非目的音の除去が不十分となる。 The sound emitted from the speakers 3L and 3R is, for example, music, and includes impact sounds (for example, drum sounds in rock) that suddenly have components at all frequencies, such as percussion instrument sounds. In this case, the follow-up in the emitted non-target sound canceller processing unit 32 (the adaptive filter) is not in time, and the emitted non-target sound cannot be sufficiently suppressed. Moreover, since the impact sound has components at all frequencies, even if the direction of arrival is not the front, the sounds emitted from the left and right speakers 3L and 3R have a strong correlation with each other so that the kite also comes from the front. It has special characteristics. Therefore, when the suppression coefficient used for suppressing the non-target sound is obtained only from the input sound signals ECoutL (n) and ECoutR (n) from which the non-target sound is removed, the non-target sound is the impact sound. Sometimes the removal of the non-target sound is insufficient.

このような不都合を回避するために、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲも非目的音の抑圧に用いる抑圧係数ＮＲｃｏｅｆの形成に用いることとした。 In order to avoid such an inconvenience, the pseudo sound emission target sound signals PSechoL and PSechoR are also used to form the suppression coefficient NRcoef used for suppressing the non-target sound.

放音非目的音キャンセラ処理部３２で算出される疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲは、音源データｓｉｇＬ、ｓｉｇＲにスピーカ３Ｌ、３Ｒからマイクロホン４Ｌ、４Ｒまでの伝達特性を畳み込んだ信号であるので、マイクロホン４Ｌ、４Ｒが捕捉した入力音信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲに含まれる妨害音成分と近い特性を有していると言える。従って、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲ、あるいは、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲから得られる特徴量を参照にすることで、衝撃音への抑圧性能を高めることが期待できる。 The pseudo sound emission target sound signals PSechoL and PSechoR calculated by the sound emission non-target sound canceller processing unit 32 are signals obtained by convolving the sound source data sigL and sigR with transfer characteristics from the speakers 3L and 3R to the microphones 4L and 4R. Therefore, it can be said that the microphones 4L and 4R have characteristics close to those of the disturbing sound components included in the input sound signals inputL and inputR captured by the microphones 4L and 4R. Therefore, it can be expected that the suppression performance to the impact sound can be improved by referring to the characteristic amount obtained from the pseudo sound emission target sound signals PSechoL and PSechoR, or the pseudo sound emission target sound signals PSEchoL and PSechoR.

そのため、第１の実施形態においては、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲも非目的音の抑圧に用いる抑圧係数ＮＲｃｏｅｆの形成に用いることとした。 Therefore, in the first embodiment, the pseudo sound emission target sound signals PSechoL and PSechoR are also used for forming the suppression coefficient NRcoef used for suppressing the non-target sound.

次に、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲを非目的音の抑圧に用いる抑圧係数ＮＲｃｏｅｆの形成に用いることができることを、より具体的に説明する。 Next, it will be described in more detail that the pseudo sound emission target sound signals PSechoL and PSechoR can be used to form the suppression coefficient NRcoef used for suppressing the non-target sound.

第１の実施形態が想定する機器構成（上述した図８、図９参照）を考慮すると、妨害音が正面から到来することはあり得ない。この挙動を、特許文献１に記載のコヒーレンスのような到来方位と直結する特徴量の挙動と対応付けると、妨害音は、正面から到来する目的音と同等以上のコヒーレンス値をとらないと言うことができる。しかし、上述した通り、妨害音に衝撃音が含まれる場合には、左右のスピーカ３Ｌ、３Ｒから放音される妨害音同士の相関が著しく増し、妨害音であるにも拘わらす、正面から到来するかのような挙動をする。つまり、衝撃音が含まれる場合の妨害音のコヒーレンス値は目的音と同等以上の値となる。従って、妨害音の到来方位に応じて雑音抑圧ゲインを設定するコヒーレンスフィルタ法では、十分に妨害音を抑圧できない。ところで、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲは、放音されれば放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲに、スピーカ３Ｌ、３Ｒからマイクロホン４Ｌ、４Ｒまでの伝達特性を畳み込んだ音なので、目的音成分は含まず、両脇のスピーカ３Ｌ、３Ｒから到来する妨害音成分だけに由来する信号である。よって、２つの疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲから得られるコヒーレンス値のレンジは、目的音のレンジより小さく、仮に、妨害音源データｓｉｇＬ、ｓｉｇＲに衝撃音が含まれている場合には、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲのコヒーレンスが大きくなる。逆に言えば、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲのコヒーレンスの急増によって衝撃音の発生を検出することができる。疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲから得られたコヒーレンスフィルタ係数Ｙｃｏｅｆを参照することで、衝撃音の成分を周波数毎に取得することができる。放音非目的音キャンセラ処理部３２から出力された放音非目的音が除去された入力音信号ＥＣｏｕｔＬ、ＥＣｏｕｔＲから得たコヒーレンスフィルタ係数Ｘｃｏｅｆを、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲから得られたコヒーレンスフィルタ係数Ｙｃｏｅｆで（２）式に示すように調整することにより、衝撃音に由来する成分をコヒーレンスフィルタ係数から除去し、より正確な抑圧係数Ｚｃｏｅｆを算出することができる。 Considering the device configuration assumed by the first embodiment (see FIGS. 8 and 9 described above), the disturbing sound cannot come from the front. When this behavior is associated with the behavior of the feature quantity directly linked to the arrival direction such as the coherence described in Patent Document 1, it can be said that the interference sound does not take a coherence value equal to or higher than the target sound coming from the front. it can. However, as described above, when the impact sound includes the impact sound, the correlation between the disturbing sounds emitted from the left and right speakers 3L and 3R is remarkably increased, and it comes from the front even though it is a disturbing sound. Behaves as if That is, the coherence value of the disturbing sound when the impact sound is included is equal to or greater than the target sound. Therefore, the coherence filter method that sets the noise suppression gain according to the direction of arrival of the interference sound cannot sufficiently suppress the interference sound. By the way, the pseudo sound emission target sound signals PSEchoL and PSechoR are sounds obtained by convolution of the transmission characteristics from the speakers 3L and 3R to the microphones 4L and 4R into the sound source data sigL and sigR that become sound non-target sounds when emitted. Therefore, the target sound component is not included, and the signal is derived only from the disturbing sound component coming from the speakers 3L and 3R on both sides. Therefore, the range of the coherence value obtained from the two pseudo sound emission target sound signals PSechoL and PSechoR is smaller than the range of the target sound, and if the disturbing sound source data sigL and sigR include an impact sound, The coherence of the target sound signals PSechoL and PSechoR increases. In other words, it is possible to detect the occurrence of an impact sound by a sudden increase in the coherence of the pseudo sound emission target sound signals PSechoL and PSechoR. By referring to the coherence filter coefficient Ycoef obtained from the pseudo sound emission target sound signals PSechoL and PSechoR, the component of the impact sound can be obtained for each frequency. The coherence filter coefficient Xcoef obtained from the input sound signals ECoutL and ECoutR from which the sound non-target sound output from the sound non-target sound canceller processing unit 32 has been removed is obtained from the pseudo sound output target sound signals PSechoL and PSechoR. By adjusting the coherence filter coefficient Ycoef as shown in the equation (2), the component derived from the impact sound can be removed from the coherence filter coefficient, and a more accurate suppression coefficient Zcoef can be calculated.

（Ａ−２）第１の実施形態の動作
次に、第１の実施形態の集音・放音装置１０の動作を説明する。以下では、音源データが楽曲データであり、目的音が、集音・放音装置１０の正面に位置する利用者が発音した音声であるとして、適宜、説明する。 (A-2) Operation of the First Embodiment Next, the operation of the sound collection / sound emission device 10 of the first embodiment will be described. In the following description, it is assumed that the sound source data is music data and the target sound is a sound produced by a user located in front of the sound collecting / sound emitting device 10.

各音源データ記憶部２１Ｌ、２１Ｒから読み出された音源データ（楽曲データ）はそれぞれ、対応するＤ／Ａ変換部２２Ｌ、２２Ｒによってアナログ信号に変換された後、各スピーカ３Ｌ、３Ｒから放音される。このような音楽が当該集音・放音装置１０から流れているときに、利用者が当該集音・放音装置１０に向かって発音した音声は、両マイクロホン４Ｌ及び４Ｒによって捕捉される。この際、スピーカ３Ｌ、３Ｒからの音楽も流れているため、スピーカ３Ｌからの音楽も両マイクロホン４Ｌ及び４Ｒによって捕捉され、スピーカ３Ｒからの音楽も両マイクロホン４Ｌ及び４Ｒによって捕捉される。さらに、周囲の背景雑音（エアコンの駆動音、近くを走行する車両からの走行音など）も、両マイクロホン４Ｌ及び４Ｒによって捕捉される。 The sound source data (music data) read from the sound source data storage units 21L and 21R are converted into analog signals by the corresponding D / A conversion units 22L and 22R, and then emitted from the speakers 3L and 3R. The When such music is flowing from the sound collecting / sound emitting device 10, the sound produced by the user toward the sound collecting / sound emitting device 10 is captured by both microphones 4 </ b> L and 4 </ b> R. At this time, since music from the speakers 3L and 3R is also flowing, music from the speaker 3L is also captured by both microphones 4L and 4R, and music from the speaker 3R is also captured by both microphones 4L and 4R. Furthermore, ambient background noise (such as driving sound of an air conditioner, traveling sound from a vehicle traveling nearby) is also captured by both microphones 4L and 4R.

すなわち、各マイクロホン４Ｌ、４Ｒが捕捉して得た入力音信号には、利用者の音声という目的音以外に、自装置が放音した音楽という放音非目的音や、背景雑音などの非目的音（以下、適宜、背景非目的音と呼ぶ）が含まれている。 In other words, the input sound signals obtained by the microphones 4L and 4R include non-purpose sounds such as music emitted by the device itself and non-purpose sounds such as background noise, in addition to the target sound of the user's voice. Sound (hereinafter referred to as background non-purpose sound as appropriate) is included.

各マイクロホン４Ｌ、４Ｒが捕捉して得た入力音信号はそれぞれ、対応するＡ／Ｄ変換部３１Ｌ、３１Ｒによってデジタル信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲに変換されて放音非目的音キャンセラ処理部３２に与えられる。放音非目的音キャンセラ処理部３２には、音源データｓｉｇＬ及びｓｉｇＲも与えられる。 Input sound signals obtained by the microphones 4L and 4R are converted into digital signals inputL and inputR by the corresponding A / D converters 31L and 31R, respectively, and are supplied to the sound emission non-target sound canceller processing unit 32. The sound emission non-target sound canceller processing unit 32 is also provided with sound source data sigL and sigR.

放音非目的音キャンセラ処理部３２においては、Ｌチャンネルに係る入力音信号（デジタル信号）ｉｎｐｕｔＬから、内部で生成した疑似放音目的音信号ＰＳｅｃｈｏＬを減算することにより、放音非目的音が除去された入力音信号ＥＣｏｕｔＬが得られ、同様に、Ｒチャンネルに係る入力音信号（デジタル信号）ｉｎｐｕｔＲから、内部で生成した疑似放音目的音信号ＰＳｅｃｈｏＲを減算することにより、放音非目的音が除去された入力音信号ＥＣｏｕｔＲが得られる。このようにして得られた放音非目的音が除去された一対に入力音信号ＥＣｏｕｔＬ、ＥＣｏｕｔＲが、内部生成の一対の疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲと共に、音源分離処理部３３に与えられる。 In the sound non-target sound canceller processing unit 32, the sound non-target sound is removed by subtracting the internally generated pseudo sound target sound signal PSchoL from the input sound signal (digital signal) inputL related to the L channel. Similarly, by subtracting the internally generated pseudo sound emission target sound signal PSEchoR from the input sound signal (digital signal) inputR relating to the R channel, the sound emission non-purpose sound is obtained. The removed input sound signal ECoutR is obtained. The pair of input sound signals ECoutL and ECoutR obtained by removing the non-target sound output thus obtained are supplied to the sound source separation processing unit 33 together with the pair of internally generated pseudo sound output target sound signals PSechoL and PSechoR. .

音源分離処理部３３においては、ＦＦＴ部４１によって、時間領域信号である、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）と、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）とがそれぞれ、周波数領域信号ＸＬ（ｆ，Ｋ）、ＸＲ（ｆ，Ｋ）、ＹＬ（ｆ，Ｋ）、ＹＲ（ｆ，Ｋ）に変換される。 In the sound source separation processing unit 33, the input sound signals ECoutL (n) and ECoutR (n) from which the sound non-target sound, which is a time domain signal, is removed by the FFT unit 41, and the pseudo sound output target sound signal PSchoL ( n) and PSechoR (n) are converted into frequency domain signals XL (f, K), XR (f, K), YL (f, K), YR (f, K), respectively.

そして、第１のコヒーレンス係数計算部４２によって、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）から得られた周波数領域信号ＸＬ（ｆ，Ｋ）及びＸＲ（ｆ，Ｋ）に基づいて、コヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）が計算され、第２のコヒーレンス係数計算部４３によって、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）から得られた周波数領域信号ＹＬ（ｆ，Ｋ）及びＹＲ（ｆ，Ｋ）に基づいてコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）が計算される。 Then, the first coherence coefficient calculation unit 42 performs frequency domain signals XL (f, K) and XR (f) obtained from the input sound signals ECoutL (n) and ECoutR (n) from which the emitted non-target sound has been removed. , K), the coherence coefficient Xcoef (f, K) is calculated, and the second coherence coefficient calculation unit 43 obtains the frequency region obtained from the pseudo sound emission target sound signals PSechoL (n) and PSechoR (n). A coherence coefficient Ycoef (f, K) is calculated based on the signals YL (f, K) and YR (f, K).

その後、抑圧係数算出部４４において、２つのコヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）及びＹｃｏｅｆ（ｆ，Ｋ）から、（１）式に従って、非目的音を抑圧する抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）が算出されて抑圧係数乗算部４５に与えられ、抑圧係数乗算部４５によって、放音非目的音が除去された入力音信号から得られた一方の周波数領域信号ＸＬ（ｆ，Ｋ）に抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）が周波数成分毎に乗算されて非目的音が除去された周波数領域信号Ｚ（ｆ，Ｋ）が得られる。この周波数領域信号である非目的音抑圧信号Ｚ（ｆ、Ｋ）をＩＦＦＴ部４６によって時間領域信号ｚ（ｎ）に変換することにより、目的音成分だけを含む出力信号ｏｕｔｐｕｔ（＝ｚ（ｎ））が得られる。 Thereafter, the suppression coefficient calculation unit 44 calculates a suppression coefficient NRcoef (f, K) for suppressing the non-target sound from the two coherence coefficients Xcoef (f, K) and Ycoef (f, K) according to the equation (1). Is supplied to the suppression coefficient multiplication unit 45, and the suppression coefficient multiplication unit 45 converts the suppression coefficient NRcoef () into one frequency domain signal XL (f, K) obtained from the input sound signal from which the emitted non-target sound has been removed. f, K) is multiplied for each frequency component to obtain a frequency domain signal Z (f, K) from which the non-target sound has been removed. By converting the non-target sound suppression signal Z (f, K), which is a frequency domain signal, into the time domain signal z (n) by the IFFT unit 46, an output signal output (= z (n)) including only the target sound component. ) Is obtained.

（Ａ−３）第１の実施形態の効果
第１の実施形態によれば、非目的音を一括して捉えるのではなく、放音非目的音及び背景非目的音に区別し、それぞれに適した除去処理を適用して除去して目的音を抽出するようにしたので、目的音の抽出精度を非常に高いものとすることができる。 (A-3) Effects of the first embodiment According to the first embodiment, the non-target sounds are not collectively detected, but are classified into the emitted non-target sounds and the background non-target sounds, which are suitable for each. Since the target sound is extracted by applying the removal process, the target sound extraction accuracy can be made extremely high.

しかも、第１の実施形態によれば、疑似放音目的音信号から算出したコヒーレンスフィルタ係数の特性を雑音抑圧係数に反映するようにしたので、音源データが衝撃音を含んでいたとしても、非目的音を十分に抑圧することができる。 In addition, according to the first embodiment, since the characteristics of the coherence filter coefficient calculated from the simulated sound emission target sound signal are reflected in the noise suppression coefficient, even if the sound source data includes an impact sound, the The target sound can be sufficiently suppressed.

その結果、例えば、抽出した目的音成分である音声を通話に用いた場合には通話音質を高めることができ、抽出した目的音成分である音声を音声認識に供する場合には認識率を高めることができる。 As a result, for example, when the voice that is the extracted target sound component is used for a call, the call sound quality can be improved, and when the voice that is the extracted target sound component is used for voice recognition, the recognition rate is increased. Can do.

（Ｂ）第２の実施形態
次に、本発明による集音・放音装置、音源分離ユニット及び音源分離プログラムの第２の実施形態を、図面を参照しながら説明する。 (B) Second Embodiment Next, a second embodiment of the sound collecting / sound emitting device, sound source separation unit, and sound source separation program according to the present invention will be described with reference to the drawings.

第２の実施形態は、第１の実施形態と比較すると、音源分離処理部（以下、符号３３Ａを用いる）の内部構成が異なっている。 The second embodiment is different from the first embodiment in the internal configuration of a sound source separation processing unit (hereinafter, reference numeral 33A is used).

図４は、第２の実施形態の音源分離処理部３３Ａの詳細構成を示すブロック図であり、第１の実施形態に係る上述した図２との同一、対応部分には、同一符号を付して示している。 FIG. 4 is a block diagram showing a detailed configuration of the sound source separation processing unit 33A of the second embodiment, and the same reference numerals are given to the same and corresponding parts as in FIG. 2 according to the first embodiment. It shows.

図４において、第２の実施形態の音源分離処理部３３Ａは、ＦＦＴ部４１、第１のコヒーレンス係数計算部４２、第２のコヒーレンス係数計算部４３、抑圧係数算出部４４Ａ、抑圧係数乗算部４５及びＩＦＦＴ部４６に加え、コヒーレンス計算部４７及び放音非目的音種判定部４８を有する。また、抑圧係数算出部４４Ａも、第１の実施形態のものから変更されている。 4, the sound source separation processing unit 33A of the second embodiment includes an FFT unit 41, a first coherence coefficient calculation unit 42, a second coherence coefficient calculation unit 43, a suppression coefficient calculation unit 44A, and a suppression coefficient multiplication unit 45. In addition to the IFFT unit 46, a coherence calculation unit 47 and a sound emission non-target sound type determination unit 48 are provided. The suppression coefficient calculation unit 44A is also changed from that of the first embodiment.

コヒーレンス計算部４７は、第２のコヒーレンス係数計算部４３が得た第２のコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）から、コヒーレンスＣＯＨ（Ｋ）を計算するものである。コヒーレンスＣＯＨ（Ｋ）は、特許文献１の（５）式に示すように、全Ｍ個の周波数成分毎のコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）の平均値として算出される。 The coherence calculator 47 calculates the coherence COH (K) from the second coherence coefficient Ycoef (f, K) obtained by the second coherence coefficient calculator 43. Coherence COH (K) is calculated as an average value of coherence coefficients Ycoef (f, K) for all M frequency components, as shown in equation (5) of Patent Document 1.

放音非目的音種判定部４８は、コヒーレンス計算部４７が得たコヒーレンスＣＯＨ（Ｋ）に基づいて、放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲの音種を判定するものである。例えば、衝撃音を含む音源データｓｉｇＬ、ｓｉｇＲか、衝撃音をほとんど含まない音源データｓｉｇＬ、ｓｉｇＲかを判別するものである。 The sound emission non-purpose sound type determination unit 48 determines the sound type of the sound source data sigL and sigR that are the sound non-purpose sound based on the coherence COH (K) obtained by the coherence calculation unit 47. For example, it is determined whether the sound source data sigL and sigR includes impact sound, or the sound source data sigL and sigR that hardly includes impact sound.

放音非目的音種判定部４８は、例えば、プログラムで実現されており、機能的には、図５に示すように、コヒーレンス受信部６１、長期平均計算部６２、分散計算部６３、判定部６４及び判定結果出力部６５を有する。 The sound emission non-purpose sound type determination unit 48 is realized by a program, for example, and functionally, as shown in FIG. 5, a coherence reception unit 61, a long-term average calculation unit 62, a variance calculation unit 63, a determination unit 64 and a determination result output unit 65.

コヒーレンス受信部６１は、コヒーレンス計算部４７が得たコヒーレンスＣＯＨ（Ｋ）を取り込むものである。 The coherence receiving unit 61 takes in the coherence COH (K) obtained by the coherence calculating unit 47.

長期平均計算部６２は、コヒーレンスＣＯＨ（Ｋ）の長期平均値ａｖｅｃｏｈ（Ｋ）を、例えば、（３）式に従って計算するものであり、分散計算部６３は、一般的な分散の計算式に従ってコヒーレンスＣＯＨ（Ｋ）の分散ｖａｒを計算するものである。 The long-term average calculation unit 62 calculates the long-term average value avecoh (K) of the coherence COH (K), for example, according to the equation (3), and the variance calculation unit 63 performs the coherence according to a general equation for variance. The variance var of COH (K) is calculated.

ａｖｅｃｏｈ（ｋ）
＝β×ＣＯＨ（Ｋ）＋（１−β）×ＣＯＨ（Ｋ−１）
但し、βは０．０＜β＜１．０の範囲の値 …（３）
判定部６４は、コヒーレンスＣＯＨ（Ｋ）の長期平均値ａｖｅｃｏｈ（Ｋ）と分散ｖａｒとから、放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲの音種を判定するものである。判定部６４は、例えば、長期平均値ａｖｅｃｏｈ（Ｋ）が予め設定されている閾値を超え、かつ、分散ｖａｒが予め設定されている閾値を超えている場合に、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含むものであると判定し、長期平均値ａｖｅｃｏｈ（Ｋ）及び分散ｖａｒの組み合わせが上記以外の場合に、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含まないものであると判定する。 avecoh (k)
= Β × COH (K) + (1−β) × COH (K−1)
However, β is a value in the range of 0.0 <β <1.0 (3)
The determination unit 64 determines the sound types of the sound source data sigL and sigR that are the non-target sound emission from the long-term average value avecoh (K) of the coherence COH (K) and the variance var. For example, when the long-term average value avecoh (K) exceeds a preset threshold value and the variance var exceeds a preset threshold value, the determination unit 64 determines that the sound source data sigL and sigR are impact sounds. If the combination of the long-term average value avecoh (K) and the variance var is other than the above, it is determined that the sound source data sigL and sigR do not include the impact sound.

判定結果出力部６５は、得られた音種の判定結果を、抑圧係数算出部４４Ａに与えるものである。 The determination result output unit 65 gives the obtained sound type determination result to the suppression coefficient calculation unit 44A.

図６は、音源データｓｉｇＬ、ｓｉｇＲの楽曲が、変化が穏やかなクラシックの場合と衝撃音を含む変化が激しいロックの場合に得られたコヒーレンスＣＯＨ（Ｋ）の時間変化を示している。クラシックの場合には、コヒーレンスＣＯＨ（Ｋ）の長期平均値は小さく分散も小さい。ロックの場合には、衝撃波部分が長期平均値を引き上げると共に、分散を大きくしている。そのため、コヒーレンスＣＯＨ（Ｋ）の長期平均値及び分散に基づいて、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含むものであるか否かを判定することができる。 FIG. 6 shows temporal changes in coherence COH (K) obtained when the music of the sound source data sigL and sigR is a classical music with a gentle change and a rock with a strong change including an impact sound. In the case of classic, the long-term average value of coherence COH (K) is small and the variance is also small. In the case of rock, the shock wave part raises the long-term average and increases the dispersion. Therefore, based on the long-term average value and variance of the coherence COH (K), it can be determined whether or not the sound source data sigL and sigR includes an impact sound.

コヒーレンスを適用するようにしたのは、以下の理由による。コヒーレンスは、信号レベルで正規化されている周波数成分毎のコヒーレンス係数を平均したものであるので、放音非目的音の音量に影響を受けずに算出できる。従って、ロックとクラシックのような音量が大きく異なる楽曲同士であっても、音量に依存せずに特性を比較でき、音量が大きいクラシックを誤ってロックと判定するようなことを極力排除することができる。 The reason why the coherence is applied is as follows. The coherence is an average of the coherence coefficients for each frequency component normalized by the signal level, and thus can be calculated without being affected by the volume of the emitted non-target sound. Therefore, even between songs such as rock and classical music with greatly different volumes, the characteristics can be compared without depending on the volume, and it is possible to eliminate as much as possible that the classical music with a large volume is erroneously determined to be rock. it can.

第２の実施形態の抑圧係数算出部４４Ａは、放音非目的音種判定部４８の判定結果に応じて、抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）の算出方法を切り替えるものである。 The suppression coefficient calculation unit 44A of the second embodiment switches the calculation method of the suppression coefficient NRcoef (f, K) according to the determination result of the sound emission non-target sound type determination unit 48.

例えば、抑圧係数算出部４４Ａは、放音非目的音種判定部４８の判定結果が、音源データｓｉｇＬ、ｓｉｇＲは衝撃音を含むという結果のときには、第１の実施形態と同様に（１）式に従って抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を算出し、一方、放音非目的音種判定部４８の判定結果が、音源データｓｉｇＬ、ｓｉｇＲは衝撃音を含まないという結果のときには、第１のコヒーレンス係数計算部４２が得たコヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）をそのまま抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）とする。衝撃音の有無に対し、これ以外の対応であっても良い。例えば、抑圧係数算出部４４Ａは、衝撃音の有無に応じて、（１）式におけるαを切り替えるようにしても良い（なお、衝撃音を含む場合の方がαを大きくする）。 For example, when the determination result of the sound emission non-target sound type determination unit 48 is a result that the sound source data sigL and sigR include an impact sound, the suppression coefficient calculation unit 44A uses the expression (1) as in the first embodiment. On the other hand, the suppression coefficient NRcoef (f, K) is calculated according to the above, and when the determination result of the sound emission non-target sound type determination unit 48 is that the sound source data sigL and sigR do not include the impact sound, the first coherence coefficient The coherence coefficient Xcoef (f, K) obtained by the calculation unit 42 is directly used as the suppression coefficient NRcoef (f, K). Other measures may be taken for the presence or absence of an impact sound. For example, the suppression coefficient calculation unit 44A may switch α in the expression (1) according to the presence or absence of an impact sound (note that α is increased when the impact sound is included).

第１の実施形態は、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含む場合の音源分離の精度低下を防止する工夫を有するものであるが、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含まない場合に、その工夫が却って精度に影響する恐れがある。 The first embodiment has a contrivance to prevent a decrease in accuracy of sound source separation when the sound source data sigL and sigR include impact sound, but when the sound source data sigL and sigR do not include impact sound, Ingenuity may affect accuracy.

この第２の実施形態によれば、放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含むか否かに応じて、非目的音の抑圧係数の算出方法を切り替えるようにしたので、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含むか否かに拘わらず、音源分離精度を高めることができる。 According to the second embodiment, since the sound source data sigL and sigR that are sound emission non-target sounds include the impact sound, the calculation method of the suppression coefficient of the non-target sound is switched. Regardless of whether the sound source data sigL and sigR include impact sound, the sound source separation accuracy can be improved.

（Ｃ）他の実施形態
上記各実施形態の説明においても、種々変形実施形態に言及したが、さらに、以下に例示するような変形実施形態を挙げることができる。 (C) Other Embodiments In the description of each of the above-described embodiments, various modified embodiments have been referred to. However, modified embodiments as exemplified below can be given.

上記各実施形態においては、放音非目的音キャンセラ処理部３２として、ステレオエコーキャンセラの技術を流用したものを示したが、放音非目的音キャンセラ処理部３２として、図７に示すような４つのモノラルエコーキャンセラ７１ＬＬ、７１ＲＬ、７１ＬＲ、７１ＲＲの構成を利用するようにしても良い。なお、図７に示すような構成もステレオエコーキャンセラの範疇に属すると捉えることができる（非特許文献１の図３．７３参照）。 In each of the above-described embodiments, the sound emission non-target sound canceller processing unit 32 that uses the stereo echo canceller technique is shown. However, as the sound emission non-target sound canceller processing unit 32, 4 as shown in FIG. The configuration of two monaural echo cancellers 71LL, 71RL, 71LR, 71RR may be used. 7 can also be regarded as belonging to the category of stereo echo canceller (see FIG. 3.73 of Non-Patent Document 1).

モノラルエコーキャンセラを用いる場合、スピーカ３Ｌ、３Ｒとマイクロホン４Ｌ、４Ｒが二つずつあるため音響経路の混雑が生じ、音響経路特性を正確に推定できず十分な抑圧効果が得られない場合がある。 When a monaural echo canceller is used, there are two speakers 3L and 3R and two microphones 4L and 4R, so that the acoustic path is congested, and the acoustic path characteristic cannot be accurately estimated, and a sufficient suppression effect may not be obtained.

そこで、音源データｓｉｇＬ、ｓｉｇＲの再生に先立ち、ホワイトノイズをスピーカ３Ｌだけから放音して、スピーカ３Ｌからマイクロホン４Ｌまでの音響経路特性Ｈ_ＬＬとスピーカ３Ｌからマイクロホン４Ｒまでの音響経路特性Ｈ_ＬＲを、モノラルエコーキャンセラ７１ＬＬ及び７１ＬＲの適応フィルタが推定し、次に、ホワイトノイズをスピーカ３Ｒだけから放音し、スピーカ３Ｒからマイクロホン４Ｌまでの音響経路特性Ｈ_ＲＬとスピーカ３Ｒからマイクロホン４Ｒまでの音響経路特性Ｈ_ＲＲをモノラルエコーキャンセラ７１ＲＬ及び７１ＲＲの適応フィルタが推定し、初期設定しておく。以降、４つの音響経路特性と対応する音源データとを畳み込むことで得た疑似放音非目的音信号を、マイクロホンが捕捉した入力音信号から減算することで放音非目的音を抑圧することができる。以上のように、４つのモノラルエコーキャンセラ７１ＬＬ、７１ＲＬ、７１ＬＲ、７１ＲＲの適応フィルタが、音源データの放音に先立って４つの音響経路特性をそれぞれ事前学習しておくことにより、音響経路の混雑を防ぎ、放音非目的音を抑圧することができる。 Therefore, prior to reproduction of the sound source data sigL and sigR, white noise is emitted only from the speaker 3L, and the acoustic path characteristic H _LL from the speaker 3L to the microphone 4L and the acoustic path characteristic H _LR from the speaker 3L to the microphone 4R are obtained. , mono echo canceller 71LL and the adaptive filter estimates of 71LR, then sound white noise only from the speaker 3R, acoustic path from the acoustic path characteristics _{H RL} and the speaker 3R of the speaker 3R to the microphone 4L to the microphone 4R The characteristic H _RR is estimated by the monaural echo cancellers 71RL and 71RR adaptive filters, and is initially set. Thereafter, it is possible to suppress the sound non-target sound by subtracting the pseudo sound non-target sound signal obtained by convolving the four sound path characteristics and the corresponding sound source data from the input sound signal captured by the microphone. it can. As described above, the adaptive filters of the four monaural echo cancellers 71LL, 71RL, 71LR, and 71RR learn the four acoustic path characteristics in advance before sound source data is emitted, thereby reducing the congestion of the acoustic path. It is possible to prevent and suppress the non-target sound emission.

なお、ホワイトノイズ区間終了後に、４つのモノラルエコーキャンセラ７１ＬＬ、７１ＲＬ、７１ＬＲ、７１ＲＲの適応フィルタの係数更新を停止させ、ホワイトノイズで学習した際の係数を常時用いて放音非目的音の除去を行うようにしても良い。 After the white noise period is over, the update of the coefficients of the adaptive filters of the four monaural echo cancellers 71LL, 71RL, 71LR, 71RR is stopped, and the sound that has been learned with white noise is always used to remove the emitted non-target sound. You may make it do.

上記第１の実施形態では、抑圧係数計算部４４が（１）式によって抑圧係数を算出するものを示したが、抑圧係数が小さくなり過ぎないように、（１）式の演算後にフロアリング処理を施すようにしても良い。このようにすると、過剰抑圧による音質低下を防ぐことができる。 In the first embodiment, the suppression coefficient calculation unit 44 calculates the suppression coefficient using the expression (1). However, the flooring process is performed after the calculation of the expression (1) so that the suppression coefficient does not become too small. You may make it give. In this way, it is possible to prevent deterioration in sound quality due to excessive suppression.

上記第１の実施形態では、抑圧係数計算部４４が演算する（１）式における係数αが固定の場合を示したが、係数αとして可変係数を適用するようにしても良い。例えば、放音非目的音（若しくは非目的音全体）の含まれ具合に応じて係数αを制御するようにしても良い。例えば、放音非目的音となる音源データのパワーを雑音の含有量として係数αを可変するようにしても良い。これにより、雑音の含有量に応じて抑圧性能を制御することが可能となる。 In the first embodiment, the case where the coefficient α in the expression (1) calculated by the suppression coefficient calculation unit 44 is fixed is shown, but a variable coefficient may be applied as the coefficient α. For example, the coefficient α may be controlled according to the degree of inclusion of the emitted non-target sound (or the entire non-target sound). For example, the coefficient α may be varied with the power of sound source data that is a non-sound emission sound as the noise content. This makes it possible to control the suppression performance according to the noise content.

また、第２の実施形態では、音種の判定が、音源データが衝撃音を含むか否かの判定であったが、衝撃音を強く含む、弱く含む、含まないなどの３種類以上の判定であっても良く、この場合には、衝撃音の含み方によって係数αを切り替えるようにしても良い。 In the second embodiment, the determination of the sound type is a determination of whether or not the sound source data includes an impact sound. However, there are three or more determinations such as a strong, weak, and no impact sound. In this case, the coefficient α may be switched depending on how the impact sound is included.

上記各実施形態では、第１のコヒーレンス係数を、第２のコヒーレンス係数を利用して修正する演算式が（１）式に示す減算であるものを示したが、他の演算式（関数）を適用して、第２のコヒーレンス係数を利用して第１のコヒーレンス係数を修正するようにしても良い。例えば、第１のコヒーレンス係数を、第２のコヒーレンス係数を係数倍した値で除算して抑圧係数を算出するようにしても良い。 In each of the above embodiments, the arithmetic expression for correcting the first coherence coefficient by using the second coherence coefficient is the subtraction shown in the expression (1), but other arithmetic expressions (functions) are The first coherence coefficient may be corrected by applying the second coherence coefficient. For example, the suppression coefficient may be calculated by dividing the first coherence coefficient by a value obtained by multiplying the second coherence coefficient by a coefficient.

上記第２の実施形態では、放音非目的音（妨害音）の判定に用いる特徴量がコヒーレンスの分散及び長期平均値であるものを示したが、図６に示すような挙動を区別できるものであれば、他の統計量を用いるようにしても良い。例えば、コヒーレンスの最大値を平均値で割った値若しくは変動係数（＝標準偏差／平均値）を特徴量として用いるようにしても良い。 In the second embodiment, the characteristic amount used for the determination of the sound emission non-target sound (interfering sound) is the coherence variance and the long-term average value. However, the characteristic as shown in FIG. 6 can be distinguished. If so, other statistics may be used. For example, a value obtained by dividing the maximum coherence value by the average value or a coefficient of variation (= standard deviation / average value) may be used as the feature amount.

また、コヒーレンスではなく、全てではない１又は複数の周波数成分のコヒーレンス係数を用いて特徴量を算出するようにしても良い。さらに、コヒーレンス係数やコヒーレンスを演算することなく、疑似放音非目的音信号のパワー変化等に基づいて、衝撃音の有無や衝撃音の混入段階を判別するようにしても良い。さらにまた、判定に用いる特徴量は、疑似放音非目的音信号から得られる特徴量に限定されない。例えば、疑似放音非目的音信号から得られる特徴量に代え、若しくは、疑似放音非目的音信号から得られる特徴量に加え、放音非目的音キャンセラ処理部から出力された、放音非目的音が除去された入力音信号から得られる特徴量を、放音非目的音の音種の判定に用いるようにしても良い。 Further, the feature amount may be calculated using not the coherence but the coherence coefficient of one or a plurality of frequency components that are not all. Furthermore, the presence / absence of an impact sound and the stage where the impact sound is mixed may be determined based on the power change of the pseudo sound emission non-target sound signal without calculating the coherence coefficient or coherence. Furthermore, the feature value used for the determination is not limited to the feature value obtained from the pseudo sound emission non-target sound signal. For example, instead of the feature amount obtained from the pseudo-non-target sound signal, or in addition to the feature amount obtained from the pseudo-non-target sound signal, the sound non-sound output from the sound non-target sound canceller processing unit is output. You may make it use the feature-value obtained from the input sound signal from which the target sound was removed for determination of the kind of sound of a non-target sound to be emitted.

上記第２の実施形態では、音種の判定結果を、抑圧係数の算出方法に反映させるものを示したが、これに代え、若しくは、これに加え、放音非目的音キャンセラ処理部内の適応フィルタのステップサイズの変更に利用するようにしても良い。例えば、衝撃音を含む場合には、ステップサイズを大きくして追従性を速めるようにする。 In the second embodiment, the sound type determination result is reflected in the suppression coefficient calculation method. However, instead of or in addition to this, an adaptive filter in the sound emission non-target sound canceller processing unit is used. It may be used to change the step size. For example, when an impact sound is included, the step size is increased to speed up the follow-up performance.

上記各実施形態では、音源分離処理部がコヒーレンスフィルタ法に従って目的音と背景非目的音とを分離するものを示したが、分離方法はこれに限定されない。例えば、コヒーレンスフィルタ法と周波数減算法（スペクトル減算法）との組み合わせを適用するようにしても良く、コヒーレンスフィルタ法とウィーナーフィルタ法との組み合わせを適用するようにしても良く、コヒーレンスフィルタ法と周波数減算法とウィーナーフィルタ法との組み合わせを適用するようにしても良い。 In each of the embodiments described above, the sound source separation processing unit has shown the target sound and the background non-target sound separated according to the coherence filter method, but the separation method is not limited to this. For example, a combination of the coherence filter method and the frequency subtraction method (spectral subtraction method) may be applied, or a combination of the coherence filter method and the Wiener filter method may be applied. The coherence filter method and the frequency A combination of the subtraction method and the Wiener filter method may be applied.

周波数減算法を適用する場合において、入力音声信号のスペクトルから雑音成分のスペクトルを減算する比率を、第２の実施形態で判定した音種の判定結果に応じて変更するようにしても良い。また、ウィーナーフィルタ法を適用する場合において、入力音声信号のスペクトルに対して乗算するウィーナーフィルタ係数を、第２の実施形態で判定した音種の判定結果に応じて変更するようにしても良い。 When applying the frequency subtraction method, the ratio of subtracting the spectrum of the noise component from the spectrum of the input speech signal may be changed according to the determination result of the sound type determined in the second embodiment. Further, when applying the Wiener filter method, the Wiener filter coefficient to be multiplied with respect to the spectrum of the input audio signal may be changed according to the determination result of the sound type determined in the second embodiment.

上記各実施形態では、スピーカが２つの場合を示したが、スピーカは１つでも３つ以上であっても良い。また、マイクロホンも２つに限定されず、３以上あっても良い。スピーカとマイクロホンとの数に応じて定まる放音音響経路の数を考慮して、放音非目的音キャンセラ処理部３２の内部構成を設計すれば良い。 In each of the above-described embodiments, the case where there are two speakers is shown, but there may be one speaker or three or more speakers. Also, the number of microphones is not limited to two and may be three or more. The internal configuration of the sound emission non-target sound canceller processing unit 32 may be designed in consideration of the number of sound emission sound paths determined according to the number of speakers and microphones.

上記各実施形態では、集音・放音装置単体で全ての処理を実行するものを示したが、非目的音の除去処理などを外部のサーバに委ねて実行するようにしても良い。例えば、集音・放音装置がスマートフォンの場合において、いわゆるクラウドによってシステムを構成し、利用者から外部サーバの存在が分からないように更新しても良い。特許請求の範囲における「集音・放音装置」の請求項は、利用者からは見えない外部サーバが処理を行っている場合を含むものとする。 In each of the above-described embodiments, the sound collection / sound emission device alone performs all processing. However, the non-target sound removal processing may be performed by an external server. For example, when the sound collection / sound emission device is a smartphone, the system may be configured by a so-called cloud and updated so that the user does not know the presence of the external server. The claim of “sound collecting / sound emitting device” in the claims includes a case where an external server that is invisible to the user performs processing.

１０…集音・放音装置、
２０…放音部、２１Ｌ、２１Ｒ…音源データ記憶部、２２Ｌ、２２Ｒ…Ｄ／Ａ変換部、３Ｌ、３Ｒ…スピーカ、
３０…集音部、４Ｌ、４Ｒ…マイクロホン、３１Ｌ、３１Ｒ…Ａ／Ｄ変換部、３２…放音非目的音キャンセラ処理部、３３、３３Ａ…音源分離処理部、４１…ＦＦＴ部、４２…第１のコヒーレンス係数計算部、４３…第２のコヒーレンス係数計算部、４４、４４Ａ…抑圧係数算出部、４５…抑圧係数乗算部、４６…ＩＦＦＴ部、４７…コヒーレンス計算部、４８…放音非目的音種判定部。 10 ... Sound collecting / sound emitting device,
20 ... Sound emission part, 21L, 21R ... Sound source data storage part, 22L, 22R ... D / A conversion part, 3L, 3R ... Speaker
30 ... Sound collection unit, 4L, 4R ... Microphone, 31L, 31R ... A / D conversion unit, 32 ... Sound emission non-target sound canceller processing unit, 33, 33A ... Sound source separation processing unit, 41 ... FFT unit, 42th 1 coherence coefficient calculation unit 43 43 second coherence coefficient calculation unit 44 44A suppression coefficient calculation unit 45 suppression coefficient multiplication unit 46 IFFT unit 47 coherence calculation unit 48 sound non-purpose Sound type determination unit.

Claims

In a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sound and a sound emission unit that emits sound from one or more speakers,
Sound source separation means for extracting a target sound from a sound source in a predetermined direction based on an input sound signal obtained by capturing the ambient sound by the two microphones;
A sound signal emitted by the sound emitting unit is input, sound is emitted from the speaker, and a pseudo sound emitting non-purpose sound signal that simulates a non-purpose sound accompanying sound emission captured by each microphone is generated, A sound emission non-target sound removing means provided up to a route to the sound source separation means for removing the sound non-purpose sound captured by each microphone by subtracting from the input sound signal from each microphone; Prepared,
The sound source separation means is
A first separation parameter generating unit that generates a first parameter for sound source separation from the input sound signal from which the sound non-target sound output from the sound non-target sound removing means is removed;
A second separation parameter generating unit that generates a second parameter for sound source separation based on the pseudo-sound non-target sound signal generated in the sound non-target sound removing means;
A parameter correction unit for correcting the first parameter using the second parameter to obtain a final parameter used for sound source separation;
A sound source separation unit that performs sound source separation by applying the corrected parameters;
A sound collecting / sound emitting device characterized in that the non-target sound is removed by the sound non-target sound removing means and the non-target sound is removed by the sound source separation means to extract the target sound.

2. The sound collection / sound emission device according to claim 1, wherein the sound source separation means performs sound source separation according to a coherence filter method, and the parameter is a suppression coefficient.

The parameter correction unit multiplies the second coherence filter coefficient, which is the second parameter, by a positive correction value of 1.0 or less, and then subtracts it from the first coherence filter coefficient, which is the first parameter. The sound collection / sound emission device according to claim 2, wherein a suppression coefficient that is a final parameter is calculated.

The sound collection / sound emission device according to claim 3, wherein the parameter correction unit performs flooring processing on the calculated suppression coefficient.

The sound source separation unit further includes a sound type determination unit that determines a sound type of the sound emission non-target sound based on the pseudo sound emission non-purpose sound signal generated in the sound emission non-purpose sound removal unit, The sound collecting / sound emitting device according to any one of claims 1 to 4, wherein the parameter correcting unit selects whether or not the sound type of the determined sound emission non-target sound is corrected or a correction method.

A sound source separation unit applied to a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sound and a sound emission unit that emits sound from one or more speakers,
Sound source separation means for extracting a target sound from a sound source in a predetermined direction based on an input sound signal obtained by capturing the ambient sound by the two microphones;
A sound signal emitted by the sound emitting unit is input, sound is emitted from the speaker, and a pseudo sound emitting non-purpose sound signal that simulates a non-purpose sound accompanying sound emission captured by each microphone is generated, A sound emission non-target sound removing means provided up to a route to the sound source separation means for removing the sound non-purpose sound captured by each microphone by subtracting from the input sound signal from each microphone; Prepared,
The sound source separation means is
A first separation parameter generating unit that generates a first parameter for sound source separation from the input sound signal from which the sound non-target sound output from the sound non-target sound removing means is removed;
A second separation parameter generating unit that generates a second parameter for sound source separation based on the pseudo-sound non-target sound signal generated in the sound non-target sound removing means;
A parameter correction unit for correcting the first parameter using the second parameter to obtain a final parameter used for sound source separation;
A sound source separation unit that performs sound source separation by applying the corrected parameters;
A sound source separation unit characterized in that a non-target sound is removed by the sound non-target sound removing means, and a target sound is extracted by removing other non-purpose sounds by the sound source separation means.

A sound source separation program executed by a computer mounted on a sound collecting / sound emitting device having a sound collecting unit in which at least two microphones capture ambient sound and a sound emitting unit emitting sound from one or more speakers. And
The above computer
Sound source separation means for extracting a target sound from a sound source in a predetermined direction based on an input sound signal obtained by capturing the ambient sound by the two microphones;
A sound signal emitted by the sound emitting unit is input, sound is emitted from the speaker, and a pseudo sound emitting non-purpose sound signal that simulates a non-purpose sound accompanying sound emission captured by each microphone is generated, Functions as sound emission non-target sound removal means provided up to the path to the sound source separation means, which removes sound emission non-purpose sound captured by each microphone by subtracting from the input sound signal from each microphone Let
The sound source separation means to be operated is
A first separation parameter generating unit that generates a first parameter for sound source separation from the input sound signal from which the sound non-target sound output from the sound non-target sound removing means is removed;
A second separation parameter generating unit that generates a second parameter for sound source separation based on the pseudo-sound non-target sound signal generated in the sound non-target sound removing means;
A parameter correction unit for correcting the first parameter using the second parameter to obtain a final parameter used for sound source separation;
A sound source separation unit that performs sound source separation by applying the corrected parameters;
A sound source separation program for removing a non-target sound by the sound non-target sound removing means and extracting a target sound by removing other non-target sounds by the sound source separation means.