JP2015070292A

JP2015070292A - Sound collection/emission device and sound collection/emission program

Info

Publication number: JP2015070292A
Application number: JP2013199999A
Authority: JP
Inventors: 克之高橋; Katsuyuki Takahashi
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2013-09-26
Filing date: 2013-09-26
Publication date: 2015-04-13

Abstract

PROBLEM TO BE SOLVED: To provide a sound collection/emission device which can improve accuracy of an expected processing regardless to the kinds of an interference sound.SOLUTION: A sound collection/emission device emits a sound from a speaker and collects ambient sound from two microphones. The sound collection/emission device comprises: interference sound removal means with the emitted sound into which the emitted sound signal is inputted to generate a pseudo interference sound signal simulating an interference sound with the emitted sound intruded from the speaker into respective microphones, subtract the pseudo interference sound signal from the input signal from the respective microphones so that the interference sound with the emitted sound is removed; sound kind determination means which determines a sound kind of the interference sound with the emitted sound on the basis of the pseudo interference sound signal; and one or a plurality of sound kind reflection processing means which switch own processing according to the determination result. The sound kind reflection processing means is sound source separation means, for example.

Description

本発明は、集音・放音装置及び集音・放音プログラムに関し、例えば、マイクロホンによる捕捉音声、捕捉音響などから、スピーカが放音した成分を除去することを欲する通信端末、オーディオ機器などに適用し得るものである。 The present invention relates to a sound collection / sound emission device and a sound collection / sound emission program, for example, in communication terminals, audio devices, and the like that want to remove components emitted by a speaker from sound captured by a microphone, captured sound, and the like. It can be applied.

例えば、スマートフォンに通話音声を入力する場合や、オーディオ機器やスマートフォンなどに音声コマンドを入力する場合などにおいては、音声が入力される機器は、利用者の口が存在すると思われる正面からの音声だけを、他の方向からの音声、音楽、雑音などと区別して抽出することが好ましい。 For example, when inputting call voice to a smartphone or inputting voice commands to an audio device or smartphone, the device to which the sound is input is only the sound from the front where the user's mouth seems to exist. Is preferably distinguished from voice, music, noise, etc. from other directions.

２つのマイクロホンに入力された音を捕捉し、入力音（電気信号）の位相差に基づいて周囲の雑音を抑圧して、所定方位（例えば正面）の音源からマイクロホンに到来する音（以下、目的音と呼ぶ）だけを分離を抽出する方式（音源分離方式）が、特許文献１に記載されている。 Sounds input to two microphones are captured, ambient noise is suppressed based on the phase difference between the input sounds (electrical signals), and sound arriving at the microphone from a sound source in a predetermined direction (for example, the front) Patent Document 1 discloses a method (sound source separation method) that extracts only separation (referred to as sound).

特許文献１に第３の実施形態として記載されている目的音の抽出方法は、マイクロホンの左右に死角を有する二つの指向性を形成して得た二つの信号の相関に応じた抑圧係数を周波数成分毎に入力音信号に乗算することにより、左右から到来する雑音成分（非目的音）を抑圧する手法である。特許文献１に第４の実施形態として記載されている目的音の抽出方法は、マイクロホンの正面に死角を有する指向性を形成し、これにより得られた信号を、左右から到来する雑音成分として入力音信号から減算することにより、左右から到来する雑音成分（非目的音）を抑圧する手法である。 The target sound extraction method described in Patent Document 1 as the third embodiment uses a suppression coefficient corresponding to the correlation between two signals obtained by forming two directivities having blind spots on the left and right sides of a microphone. This is a technique for suppressing noise components (non-target sounds) coming from the left and right by multiplying an input sound signal for each component. The target sound extraction method described as the fourth embodiment in Patent Document 1 forms a directivity having a blind spot in front of a microphone, and inputs a signal obtained as a noise component coming from the left and right. This is a technique for suppressing noise components (non-target sounds) coming from the left and right by subtracting from the sound signal.

ところで、近年、図６に示すように、携帯端末（例えば、スマートフォンやタブレット端末）などの通信機能を有する集音機器２の両脇に、一対のスピーカ３Ｌ及び３Ｒを配置して接続し、このような構成で遠隔地と通話を行なう集音・放音装置１が利用されるようになってきている。また、同様な構成で、集音機器２内に記録された音楽ファイルやインターネット上の音楽配信サイトから取得した楽曲ファイルによる音（音楽）を、両脇のスピーカ３Ｌ及び３Ｒから放音させている状態で、利用者が、集音機器２のマイクロホン正面から発した音声によるコマンドを受ける方法も検討されている。 Incidentally, in recent years, as shown in FIG. 6, a pair of speakers 3L and 3R are arranged and connected on both sides of a sound collecting device 2 having a communication function such as a portable terminal (for example, a smartphone or a tablet terminal). The sound collecting / sound emitting device 1 for making a call with a remote place with such a configuration has come to be used. Also, with the same configuration, sound (music) from music files recorded in the sound collecting device 2 or music files acquired from music distribution sites on the Internet is emitted from the speakers 3L and 3R on both sides. In this state, a method in which the user receives a command by a voice emitted from the front of the microphone of the sound collecting device 2 is also being studied.

両脇のスピーカ３Ｌ及び３Ｒから音楽などが放音されている状態で、正面から到来する目的音を抽出し、通話相手に発話内容を伝えたり、若しくは、音声認識処理を介して音声コマンドを認識して音声コマンドに対応する処理を実行したりする場合には、スピーカ３Ｌ、３Ｒから発する音などが雑音となり、通話音質や音声認識率を大きく低下させる。 In the state where music is emitted from the speakers 3L and 3R on both sides, the target sound coming from the front is extracted and the utterance content is communicated to the other party, or the voice command is recognized through voice recognition processing. When the processing corresponding to the voice command is executed, the sound emitted from the speakers 3L and 3R becomes noise, which greatly reduces the call sound quality and the voice recognition rate.

そこで、上述した特許文献１の記載技術のような音源分離方式を適用し、両脇のスピーカ３Ｌ及び３Ｒから到来する雑音成分を抑圧し、正面からの目的音を抽出しなければならない。特許文献１に記載の音源分離方式を適用する場合には、図７に示すように、集音機器１に、２つのマイクロホン４Ｌ、４Ｒを搭載若しくは外付けすることを要する。 Therefore, it is necessary to apply a sound source separation method such as the technology described in Patent Document 1 described above, suppress noise components coming from the speakers 3L and 3R on both sides, and extract the target sound from the front. When applying the sound source separation method described in Patent Document 1, it is necessary to mount or externally attach two microphones 4L and 4R to the sound collecting device 1, as shown in FIG.

特開２０１３−０６１４２１号公報JP 2013-061421 A

北脇信彦著、「デジタル音声・オーディオ技術（未来ねっと技術シリーズ）」、電気通信協会発行、ｐ２１８〜ｐ２４３、１９９９年Kitawaki Nobuhiko, “Digital Voice / Audio Technology (Future Netto Technology Series)”, published by Telecommunications Association, p218-p243, 1999

しかしながら、利用者が集音・放音装置１から音楽を放音して楽しむ場合、その音量は大きく、大きな音量の音楽が雑音成分（非目的音）としてマイクロホン４Ｌ、４Ｒに捕捉されるため、音源分離方式を適用して目的音を抽出したとしても、抽出した目的音信号に雑音成分が多く残ってしまう。 However, when the user enjoys the music from the sound collecting / sound emitting device 1, the volume is large and the loud music is captured by the microphones 4 </ b> L and 4 </ b> R as noise components (non-target sounds). Even if the target sound is extracted by applying the sound source separation method, many noise components remain in the extracted target sound signal.

これを避けようとすると、利用者は、音楽の出力（放音）を停止してから、通話音声や音声コマンドなどの入力音声を発音すれば良い。しかしながら、このように出力を停止させるキー操作などを行うのであれば、音声コマンドのメリットは薄れ、キー操作などでコマンドを入力する方が簡便である。また、着信からの通話の場合、音声の出力停止操作をできないことや、出力停止操作の実行のため着信が遅れてしまうことなども生じる。 In order to avoid this, after the user stops outputting the music (sound emission), the user may pronounce the input voice such as a call voice or voice command. However, if the key operation for stopping the output is performed as described above, the merit of the voice command is reduced, and it is easier to input the command by the key operation. Further, in the case of a call from an incoming call, the voice output stop operation cannot be performed, or the incoming call is delayed due to the execution of the output stop operation.

ここで、集音・放音装置１から放音される音声、音響も様々で、それがスピーカ３Ｌ、３Ｒで捕捉されて、目的音に対する妨害音（非目的音）も様々である。すなわち、妨害音も音楽、音声フレーズなど様々な種類があり、妨害音の種類によっては音響特性は大きく異なる。例えば、音楽（楽曲）には、クラシック音楽からロックまで様々なジャンルの音楽が存在する。仮に、妨害音がロックであれば、音量が大きいためにＳＮ比は劣悪になり、ドラムなどの打楽器による突発的な衝撃音が多く生じるので非定常な特性となる。一方、クラシック音楽の場合には、音量が比較的小さいため目的音とのＳＮ比は良好で、突発的な衝撃音が生じることは少ないので定常な特性である。従って、妨害音が音声フレーズやクラシック音楽の場合には、妨害音（非目的音）に対する強力な抑圧処理をしなくても十分な抑圧効果が得られるが、ロックの場合には、大音量雑音や突発的な変動に対応できる強力な抑圧処理をしなければならない。 Here, there are various sounds and sounds emitted from the sound collection / sound emission device 1, which are captured by the speakers 3L and 3R, and various disturbing sounds (non-target sounds) with respect to the target sound. That is, there are various types of disturbing sounds such as music and voice phrases, and the acoustic characteristics vary greatly depending on the types of the disturbing sounds. For example, music (songs) includes music of various genres from classical music to rock. If the interference sound is rock, the SN ratio becomes poor because the volume is high, and a lot of sudden impact sounds are generated by percussion instruments such as drums, resulting in unsteady characteristics. On the other hand, in the case of classical music, since the volume is relatively low, the S / N ratio with the target sound is good, and sudden impact sound is rarely generated, so that it is a steady characteristic. Therefore, when the disturbing sound is a voice phrase or classical music, a sufficient suppression effect can be obtained without performing powerful suppression processing on the disturbing sound (non-target sound). And powerful suppression processing that can cope with sudden fluctuations must be done.

以上のように、妨害音（非目的音）の種類による音響特性の差を無視して一律に妨害音の抑圧処理を施すと、抑圧性能が不足する、あるいは、過剰に妨害音を抑圧することで音質が低下する、といった問題が生じる。 As described above, if the interference noise suppression process is uniformly performed ignoring the difference in acoustic characteristics depending on the type of the interference sound (non-target sound), the suppression performance is insufficient or the interference sound is excessively suppressed. This causes problems such as poor sound quality.

そのため、妨害音の種類に拘らず、所望する処理の精度を向上することができる集音・放音装置及び集音・放音プログラムが望まれている。 Therefore, there is a demand for a sound collection / sound emission device and a sound collection / sound emission program that can improve the accuracy of desired processing regardless of the type of interference sound.

第１の本発明は、少なくとも２本のマイクロホンが周囲音を捕捉する集音部と、１又は複数のスピーカから放音する放音部とを有する集音・放音装置において、（１）上記放音部が放音する音信号が入力され、上記スピーカから放音され、上記各マイクロホンで捕捉された放音に伴う妨害音を疑似した疑似妨害音信号を生成し、上記各マイクロホンからの入力音信号から減算することにより、上記各マイクロホンで捕捉された放音妨害音を除去する、音響エコーキャンセラ構成を流用している放音妨害音除去手段と、（２）上記放音妨害音除去手段内で生成された疑似放音妨害音信号に基づいて、放音妨害音の音種を判定する音種判定手段と、（３）上記音種判定手段の判定結果に応じて、自己の処理を切り替える１又は複数の音種反映処理手段とを備えることを特徴とする。 According to a first aspect of the present invention, there is provided a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sounds and a sound emission unit that emits sound from one or more speakers. A sound signal emitted by the sound emitting unit is input, emitted from the speaker, generates a pseudo-interference sound signal that simulates the interference sound accompanying the sound output captured by each microphone, and is input from each microphone. A sound emission disturbing sound removing means using an acoustic echo canceller configuration for removing the sound emission disturbing sound captured by each of the microphones by subtracting from the sound signal; and (2) the sound emission disturbing sound removing means. Sound type determination means for determining the sound type of the sound emission disturbing sound based on the pseudo sound emission disturbance sound signal generated in the sound source, and (3) self-processing according to the determination result of the sound type determination means. One or more sound type reflection processes to be switched Characterized in that it comprises a stage.

第２の本発明は、少なくとも２本のマイクロホンが周囲音を捕捉する集音部と、１又は複数のスピーカから放音する放音部とを有する集音・放音装置に搭載されるコンピュータが実行する集音・放音プログラムであって、上記コンピュータを、（１）上記放音部が放音する音信号が入力され、上記スピーカから放音され、上記各マイクロホンで捕捉された放音に伴う妨害音を疑似した疑似妨害音信号を生成し、上記各マイクロホンからの入力音信号から減算することにより、上記各マイクロホンで捕捉された放音妨害音を除去する、音響エコーキャンセラ構成を流用している放音妨害音除去手段と、（２）上記放音妨害音除去手段内で生成された疑似放音妨害音信号に基づいて、放音妨害音の音種を判定する音種判定手段と、（３）上記音種判定手段の判定結果に応じて、自己の処理を切り替える１又は複数の音種反映処理手段として機能させることを特徴とする。 According to a second aspect of the present invention, there is provided a computer mounted on a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sounds and a sound emission unit that emits sound from one or more speakers. A sound collection / sound emission program to be executed, wherein the computer is (1) a sound signal emitted by the sound emission unit is input, sound is emitted from the speaker, and is emitted by the microphones. A sound echo canceller configuration that diverts the sound emission interference sound captured by each microphone by generating a pseudo interference sound signal that simulates the accompanying interference sound and subtracting it from the input sound signal from each microphone is diverted. And (2) sound type determination means for determining the sound type of the sound emission disturbing sound based on the pseudo sound emission disturbance sound signal generated in the sound emission interference sound removing means. (3) The above sound types According to the determination result of the constant means and be made to function as one or more note type reflection processing unit switches its own processing.

本発明によれば、妨害音の種類に拘らず、所望する処理の精度を向上できる集音・放音装置及び集音・放音プログラムを実現できる。 According to the present invention, it is possible to realize a sound collection / sound emission device and a sound collection / sound emission program that can improve the accuracy of desired processing regardless of the type of interference sound.

第１の実施形態の集音・放音装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound collection and sound emission apparatus of 1st Embodiment. 第１の実施形態の集音・放音装置における音源分離処理部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the sound source separation process part in the sound collection / sound emission apparatus of 1st Embodiment. 第１の実施形態の集音・放音装置における音源分離処理部内の放音非目的音種判定部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the sound emission non-purpose sound type determination part in the sound source separation process part in the sound collection / sound emission apparatus of 1st Embodiment. 第１の実施形態に関連し、音源データの種類とコヒーレンスの挙動との関係を示す説明図である。It is explanatory drawing which shows the relationship between the kind of sound source data and the behavior of coherence in relation to 1st Embodiment. 第２の実施形態の集音・放音装置における放音非目的音キャンセラ処理部の詳細構成を示すブロック図である。It is a block diagram which shows the detailed structure of the sound emission non-target sound canceller process part in the sound collection and sound emission apparatus of 2nd Embodiment. 従来の集音・放音装置におけるスピーカの接続の様子を示す説明図である。It is explanatory drawing which shows the mode of the connection of the speaker in the conventional sound collection and sound emission apparatus. 従来の集音・放音装置に音源分離方式を適用する場合におけるマイクロホンの搭載の様子を示す説明図である。It is explanatory drawing which shows the mode of mounting of the microphone in the case of applying a sound source separation system to the conventional sound collecting / sound emitting device.

（Ａ）第１の実施形態
以下、本発明による集音・放音装置及び集音・放音プログラムの第１の実施形態を、図面を参照しながら説明する。 (A) First Embodiment Hereinafter, a first embodiment of a sound collecting / sound emitting device and a sound collecting / sound emitting program according to the present invention will be described with reference to the drawings.

（Ａ−１）第１の実施形態の構成
第１の実施形態の集音・放音装置は、一対のマイクロホンが搭載され、若しくは、外付けされており、かつ、一対のスピーカが搭載され、若しくは、外付けされているものである。例えば、スマートフォンやタブレット端末などの集音機器を利用している集音・放音装置であれば、一対のマイクロホンが搭載され、一対のスピーカが外付けされて構成される。また例えば、スピーカ一体型のオーディオ機器が該当する集音・放音装置であれば、一対のマイクロホンも一対のスピーカも搭載されて構成される。以上のように、一対のマイクロホン及び一対のスピーカの接続形態は多様であるが、いずれの接続形態を適用したものであっても良い。 (A-1) Configuration of the First Embodiment The sound collection / sound emission device of the first embodiment is equipped with a pair of microphones or externally attached, and a pair of speakers. Or it is an external one. For example, in the case of a sound collecting / sound emitting device using a sound collecting device such as a smartphone or a tablet terminal, a pair of microphones is mounted and a pair of speakers are externally configured. Further, for example, if a speaker integrated audio device is a corresponding sound collecting / sound emitting device, a pair of microphones and a pair of speakers are mounted. As described above, the connection forms of the pair of microphones and the pair of speakers are various, but any connection form may be applied.

以下では、第１の実施形態の集音・放音装置は、上述した図７に示すように、一対のマイクロホンが搭載され、一対のスピーカが外付けされて構成されているとして説明を行う。また、第１の実施形態の集音・放音装置における各構成要素の符号も、図７に記述されている構成要素に関しては、図７で用いている符号をそのまま用いる。 In the following, the sound collection / sound emission device of the first embodiment will be described on the assumption that a pair of microphones are mounted and a pair of speakers are externally attached as shown in FIG. 7 described above. In addition, for the components described in FIG. 7, the symbols used in FIG. 7 are used as they are for the components in the sound collection / sound emission device of the first embodiment.

図１は、第１の実施形態の集音・放音装置１０の構成を示すブロック図である。 FIG. 1 is a block diagram illustrating a configuration of a sound collection / sound emission device 10 according to the first embodiment.

第１の実施形態の集音・放音装置１０は、ハードウェア的な各種構成要素を接続して構築されたものであっても良く、また、一部の構成要素（例えば、スピーカ、マイクロホン、アナログ／デジタル変換部（Ａ／Ｄ変換部）、デジタル／アナログ変換部（Ｄ／Ａ変換部）を除く部分）を、ＣＰＵ、ＲＯＭ、ＲＡＭなどのプログラムの実行構成を適用してその機能を実現するように構築されたものであっても良い。いずれの構築方法を適用した場合であっても、集音・放音装置１０の機能的な詳細構成は、図１で表す構成となっている。なお、プログラムを適用する場合において、プログラムは、集音・放音装置１０が有するメモリに装置出荷時から書き込まれているものであっても良く、また、ダウンロードによりインストールされるものであっても良い。例えば、後者の場合としては、スマートフォン用のアプリケーションとしてプログラムを用意しておき、必要とする利用者が、インターネットを介してダウンロードしてインストールする場合を挙げることができる。 The sound collection / sound emission device 10 of the first embodiment may be constructed by connecting various hardware components, and some components (for example, a speaker, a microphone, The functions of the analog / digital conversion unit (A / D conversion unit) and digital / analog conversion unit (except for the D / A conversion unit) are realized by applying program execution configurations such as CPU, ROM, and RAM. It may be constructed to do so. Regardless of which construction method is applied, the functional detailed configuration of the sound collection / sound emission device 10 is the configuration shown in FIG. When applying the program, the program may be written in the memory of the sound collecting / sound emitting device 10 from the time of shipment of the device, or may be installed by downloading. good. For example, in the latter case, a program is prepared as an application for a smartphone, and a user who needs it can download and install it via the Internet.

図１において、第１の実施形態の集音・放音装置１０は、放音部２０及び集音部３０を有する。 In FIG. 1, the sound collection / sound emission device 10 of the first embodiment includes a sound emission unit 20 and a sound collection unit 30.

放音部２０は、既存の放音部と同様な構成を有する。放音部２０は、Ｌチャンネル及びＲチャンネルの音源データ記憶部２１Ｌ及び２１Ｒ、Ｄ／Ａ変換部２２Ｌ及び２２Ｒ、並びに、スピーカ３Ｌ及び３Ｒを有する。 The sound emitting unit 20 has the same configuration as the existing sound emitting unit. The sound emitting unit 20 includes sound source data storage units 21L and 21R for L channel and R channel, D / A conversion units 22L and 22R, and speakers 3L and 3R.

一方、集音部３０は、Ｌチャンネル及びＲチャンネルのマイクロホン４Ｌ及び４Ｒ、並びに、Ａ／Ｄ変換部３１Ｌ及び３１Ｒと、放音非目的音キャンセラ処理部３２と、図２に詳細構成を示す音源分離処理部３３とを有する。ここで、後述する音源データの入力端子を有する集音部３０の全体が音源分離ユニットとして構築されて、市販に供するものであっても良い。また、Ａ／Ｄ変換部３１Ｌ、３１Ｒ、放音非目的音キャンセラ処理部３２及び音源分離処理部３３でなる部分が、後述する音源データの入力端子を有して、音源分離ユニットとして構築され、市販に供するものであっても良い。 On the other hand, the sound collection unit 30 includes L-channel and R-channel microphones 4L and 4R, A / D conversion units 31L and 31R, a sound emission non-target sound canceller processing unit 32, and a sound source whose detailed configuration is shown in FIG. And a separation processing unit 33. Here, the entire sound collection unit 30 having an input terminal for sound source data, which will be described later, may be constructed as a sound source separation unit and provided on the market. Further, the part composed of the A / D conversion units 31L and 31R, the sound emission non-target sound canceller processing unit 32, and the sound source separation processing unit 33 has a sound source data input terminal, which will be described later, and is constructed as a sound source separation unit. You may use for a commercially available thing.

音源データ記憶部２１Ｌ及び２１Ｒはそれぞれ、Ｌチャンネル、Ｒチャンネル用の音源データ（デジタル信号）ｓｉｇＬ、ｓｉｇＲを記憶し、図示しない放音制御部の制御下で音源データｓｉｇＬ、ｓｉｇＲを読み出して出力するものである。音源データｓｉｇＬ、ｓｉｇＲは、例えば、楽曲データであっても良く、電子書籍その他の読み上げ用などの音声データであっても良い。各音源データ記憶部２１Ｌ、２１Ｒは、ＣＤ−ＲＯＭなどの記録媒体が装填された記録媒体アクセス装置であっても良く、インターネット上のサイトなどの外部装置から通信によって取得した音源データを記憶する当該装置の記憶部によって構成されたものであっても良い。また、各音源データ記憶部２１Ｌ、２１Ｒは、例えば、ＵＳＢコネクタ接続で接続される外付けの装置が該当するものであっても良い。さらに、各音源データ記憶部２１Ｌ、２１Ｒは「記憶部」とネーミングしているが、各音源データ記憶部２１Ｌ、２１Ｒの概念には、デジタル音声放送の受信機のような、受信した音源データをリアルタイムに出力する構成をも含むものとする。 The sound source data storage units 21L and 21R store the sound source data (digital signals) sigL and sigR for the L channel and the R channel, respectively, and read and output the sound source data sigL and sigR under the control of a sound emission control unit (not shown). Is. The sound source data sigL and sigR may be, for example, music data or electronic data such as an electronic book for reading out. Each of the sound source data storage units 21L and 21R may be a recording medium access device loaded with a recording medium such as a CD-ROM, and stores sound source data acquired by communication from an external device such as a site on the Internet. It may be configured by a storage unit of the apparatus. The sound source data storage units 21L and 21R may correspond to, for example, external devices connected by USB connector connection. Furthermore, each sound source data storage unit 21L, 21R is named “storage unit”, but the concept of each sound source data storage unit 21L, 21R includes received sound source data such as a digital audio broadcast receiver. A configuration for outputting in real time is also included.

Ｄ／Ａ変換部２２Ｌ及び２２Ｒはそれぞれ、対応する音源データ記憶部２１Ｌ、２１Ｒから出力された音源データｓｉｇＬ、ｓｉｇＲをアナログ信号に変換して対応するスピーカ３Ｌ、３Ｒに与えるものである。 The D / A converters 22L and 22R convert the sound source data sigL and sigR output from the corresponding sound source data storage units 21L and 21R into analog signals and give them to the corresponding speakers 3L and 3R.

スピーカ３Ｌ及び３Ｒはそれぞれ、対応するＤ／Ａ変換部２２Ｌ、２２Ｒから与えられた音源信号を放音出力（発音出力）するものである。ここで、スピーカ３Ｌ及び３Ｒから放音出力された音響若しくは音声は、マイクロホン４Ｒ、４Ｌに捕捉されることを意図したものではなく、マイクロホン４Ｒ、４Ｌの捕捉機能から見たとき、非目的音になっている。 The speakers 3L and 3R output sound sources (sound generation output) from the sound source signals supplied from the corresponding D / A converters 22L and 22R, respectively. Here, the sound or sound output from the speakers 3L and 3R is not intended to be captured by the microphones 4R and 4L, but becomes non-target sound when viewed from the capturing function of the microphones 4R and 4L. It has become.

以上では、スピーカ３Ｌ、３Ｒから放音される音楽、音声の当初の信号形式がデジタル信号（音源データ）であるものを示したが、音源データ記憶部２１Ｌ、２１Ｒに相当する構成が、レコードプレイヤ、オーディオカセットテープレコーダ、ＡＭやＦＭのラジオ受信機などであって、アナログ信号でなる音響信号や音声信号を出力するものであっても良い。この場合には、Ｄ／Ａ変換部２２Ｌ及び２２Ｒは省略され、別途、Ｌチャンネル、Ｒチャンネル用のＡ／Ｄ変換部を設けて、アナログ信号の音響信号や音声信号をデジタル信号に変換して放音非目的音キャンセラ処理部３２に与えることになる。 In the above, the music and sound emitted from the speakers 3L and 3R are shown as digital signals (sound source data). However, the configuration corresponding to the sound source data storage units 21L and 21R is a record player. An audio cassette tape recorder, an AM or FM radio receiver, and the like may output an audio signal or an audio signal that is an analog signal. In this case, the D / A converters 22L and 22R are omitted, and an A / D converter for the L channel and the R channel is provided separately to convert an analog acoustic signal or audio signal into a digital signal. The sound is output to the non-target sound canceller processing unit 32.

マイクロホン４Ｒ及び４Ｌはそれぞれ、周囲音を捕捉して電気信号（アナログ信号）に変換するものである。一対のマイクロホン４Ｒ及び４Ｌにより、ステレオ信号が得られる。各マイクロホン４Ｒ、４Ｌは、当該集音・放音装置１０の正面から到来する音を主として捕捉するような指向性を有するものであるが、両脇に配置されているスピーカ３Ｌ、３Ｒから放音された音をも捕捉するものである。なお、スピーカ３Ｌ、３Ｒは、一対のマイクロホン４Ｒ及び４Ｌの両脇に配置されることが好ましいが、この配置に限定されるものではない。 Each of the microphones 4R and 4L captures ambient sound and converts it into an electrical signal (analog signal). A stereo signal is obtained by the pair of microphones 4R and 4L. Each of the microphones 4R and 4L has directivity that mainly captures sound coming from the front of the sound collection / sound emission device 10, but emits sound from the speakers 3L and 3R arranged on both sides. It also captures the generated sound. The speakers 3L and 3R are preferably arranged on both sides of the pair of microphones 4R and 4L, but are not limited to this arrangement.

各マイクロホン４Ｒ、４Ｌは、例えば、当該集音・放音装置１０の筐体に設けられた筒体内に取り付けられる。ここで、筒体の内面には合成樹脂でなる遮音部材が設けられ、マイクロホン４Ｒ、４Ｌが取り付けられたときに、筐体の内外を音が通過する経路ができないようになされている。これにより、筐体内部で発生した雑音や、外部から筐体内部に入り込んで反射により筐体外部に出ていこうとする雑音などを、マイクロホン４Ｒ、４Ｌが捕捉するようなことを極力防止することができる。 The microphones 4R and 4L are attached to, for example, a cylinder provided in the housing of the sound collecting / sound emitting device 10. Here, a sound insulating member made of a synthetic resin is provided on the inner surface of the cylindrical body so that when the microphones 4R and 4L are attached, there is no path through which the sound passes inside and outside the housing. This prevents as much as possible the microphones 4R and 4L from capturing the noise generated inside the housing or the noise entering the housing from the outside and going out of the housing by reflection. Can do.

Ａ／Ｄ変換部３１Ｌ及び３１Ｒはそれぞれ、対応するマイクロホン４Ｒ、４Ｌが捕捉した入力音信号をデジタル信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲに変換して放音非目的音キャンセラ処理部３２に与えるものである。各Ａ／Ｄ変換部３１Ｌ、３１Ｒは、例えば、音源データｓｉｇＬ、ｓｉｇＲのサンプリングレートと同じサンプリングレートのデジタル信号に変換する。 The A / D conversion units 31L and 31R convert the input sound signals captured by the corresponding microphones 4R and 4L into digital signals inputL and inputR, respectively, and give them to the sound emission non-target sound canceller processing unit 32. Each A / D conversion unit 31L, 31R converts, for example, a digital signal having the same sampling rate as the sampling rate of the sound source data sigL, sigR.

放音非目的音キャンセラ処理部３２には、音源データ記憶部２１Ｌ及び２１Ｒから出力された音源データｓｉｇＬ及びｓｉｇＲも与えられる。ここで、放音非目的音キャンセラ処理部３２に入力される４つのデジタル信号のサンプリングレートが揃っていることを要する。例えば、インターネットのサイトからダウンロードし、音源データ記憶部２１Ｌ及び２１Ｒに記憶された音源データｓｉｇＬ、ｓｉｇＲのサンプリングレートが、Ａ／Ｄ変換部３１Ｌ、３１Ｒからのデジタル信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲのサンプリングレートと異なる場合には、Ｄ／Ａ変換部２２Ｌ、２２Ｒへはダウンロードした音源データｓｉｇＬ、ｓｉｇＲをそのまま与え、放音非目的音キャンセラ処理部３２へは音源データｓｉｇＬ、ｓｉｇＲのサンプリングレートを変換した音源データを与えるようにすれば良い。 The sound emission non-target sound canceller processing unit 32 is also supplied with sound source data sigL and sigR output from the sound source data storage units 21L and 21R. Here, it is necessary that the sampling rates of the four digital signals input to the sound emission non-target sound canceller processing unit 32 are the same. For example, the sampling rates of the sound source data sigL and sigR downloaded from the Internet site and stored in the sound source data storage units 21L and 21R are different from the sampling rates of the digital signals inputL and inputR from the A / D conversion units 31L and 31R. In this case, the downloaded sound source data sigL and sigR are directly supplied to the D / A conversion units 22L and 22R, and the sound source data obtained by converting the sampling rate of the sound source data sigL and sigR is supplied to the sound emission non-target sound canceller processing unit 32. You should give it.

放音非目的音キャンセラ処理部３２は、音源データ記憶部２１Ｌ及び２１Ｒから出力された音源データｓｉｇＬ及びｓｉｇＲに基づき、入力音信号（デジタル信号）ｉｎｐｕｔＬ、ｉｎｐｕｔＲに含まれている、スピーカ３Ｌ、３Ｒから放音されることによる非目的音成分（以下、適宜、放音非目的音と呼ぶ）を除去（若しくは軽減）し、除去処理後の入力音信号ＥＣｏｕｔＬ、ＥＣｏｕｔＲを音源分離処理部３３に与えるものである。 The sound emission non-target sound canceller processing unit 32 is based on the sound source data sigL and sigR output from the sound source data storage units 21L and 21R, and includes the speakers 3L and 3R included in the input sound signals (digital signals) inputL and inputR. Removes (or reduces) the non-target sound component (hereinafter referred to as the sound non-target sound as appropriate) that is emitted from the sound, and provides the sound source separation processing unit 33 with the input sound signals ECoutL and ECoutR after the removal processing. Is.

ここで、スピーカ３Ｌ、３Ｒから放音され、マイクロホン４Ｒ、４Ｌによって捕捉される、目的音から見て不要な音（放音非目的音）は、電話通信において問題となっている音響エコーと同様にみなすことができる。そこで、第１の実施形態においては、放音非目的音キャンセラ処理部３２を、音響エコーキャンセラの技術を流用して構成している。例えば、非特許文献１には「ステレオエコーキャンセラ」が記載されている。第１の実施形態では、放音非目的音キャンセラ処理部３２として、非特許文献１の図３．７１若しくは図３．７５の記載のものを適用しているとする。 Here, the sound that is emitted from the speakers 3L and 3R and captured by the microphones 4R and 4L and is unnecessary from the target sound (non-target sound) is the same as the acoustic echo that is a problem in telephone communication. Can be considered. Therefore, in the first embodiment, the sound emission non-target sound canceller processing unit 32 is configured using the acoustic echo canceller technique. For example, Non-Patent Document 1 describes “stereo echo canceller”. In the first embodiment, it is assumed that the sound output non-target sound canceller processing unit 32 described in FIG. 3.71 or FIG. 3.75 of Non-Patent Document 1 is applied.

ステレオエコーキャンセラ構成の放音非目的音キャンセラ処理部３２では、入力音信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲから、放音目的音を除去するために、内部で、疑似的な放音目的音信号（以下、疑似放音目的音信号と呼ぶ）ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲが生成されており、この第１の実施形態の場合、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲも音源分離処理部３３に与えられる。 In the sound emission non-target sound canceller processing unit 32 having a stereo echo canceller configuration, in order to remove the sound emission target sound from the input sound signals inputL and inputR, a pseudo sound emission target sound signal (hereinafter referred to as “pseudo sound emission target sound signal”) is internally generated. PSechoL and PSechoR (referred to as sound target sound signals) are generated. In the case of the first embodiment, the pseudo sounding target sound signals PSechoL and PSechoR are also supplied to the sound source separation processing unit 33.

音源分離処理部３３は、図２に示す詳細構成を有し、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ、ＥＣｏｕｔＲと疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲとに基づき、所定方位（例えば、正面）にある音源からの目的音だけを抽出するものである。音源分離処理部３３が適用している音源分離方法は、音源の方向によって特性が変化するコヒーレンス係数を適用したコヒーレンスフィルタ法である。 The sound source separation processing unit 33 has the detailed configuration shown in FIG. 2, and based on the input sound signals ECoutL and ECoutR from which the sound emission non-target sound has been removed and the pseudo sound emission target sound signals PSechoL and PSechoR (for example, Only the target sound from the sound source located in the front) is extracted. The sound source separation method applied by the sound source separation processing unit 33 is a coherence filter method to which a coherence coefficient whose characteristics change depending on the direction of the sound source is applied.

図２において、音源分離処理部３３は、ＦＦＴ（高速フーリエ変換）部４１、第１のコヒーレンス係数計算部４２、第２のコヒーレンス係数計算部４３、抑圧係数算出部４４、抑圧係数乗算部４５、ＩＦＦＴ（逆高速フーリエ変換）部４６、コヒーレンス計算部４７及び放音非目的音種判定部４８を有する。 2, the sound source separation processing unit 33 includes an FFT (Fast Fourier Transform) unit 41, a first coherence coefficient calculation unit 42, a second coherence coefficient calculation unit 43, a suppression coefficient calculation unit 44, a suppression coefficient multiplication unit 45, An IFFT (Inverse Fast Fourier Transform) unit 46, a coherence calculation unit 47, and a sound emission non-target sound type determination unit 48 are included.

ＦＦＴ部４１は、時間領域の信号である、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）と、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）とをそれぞれ、周波数領域の信号ＸＬ（ｆ，Ｋ）、ＸＲ（ｆ，Ｋ）、ＹＬ（ｆ，Ｋ）、ＹＲ（ｆ，Ｋ）に変換するものである。なお、上記での「ｎ」は時刻を表すパラメータであり、「ｆ」は周波数を表すパラメータであり、「Ｋ」は変換に供する所定の入力サンプル数の塊を規定するフレームの順番を表すパラメータであり、説明上、明らかにしたいときに記述する。すなわち、記述していなくても、それは記述の省略であって、処理の中ではこれらパラメータは参照されている。 The FFT unit 41 is an input sound signal ECoutL (n), ECoutR (n) from which a sound non-target sound, which is a signal in the time domain, is removed, and a pseudo sound output target sound signal PSechoL (n), PSechoR (n). Are converted into frequency domain signals XL (f, K), XR (f, K), YL (f, K), and YR (f, K), respectively. In the above, “n” is a parameter representing time, “f” is a parameter representing frequency, and “K” is a parameter representing the order of frames defining a block of a predetermined number of input samples to be subjected to conversion. It is described when it is necessary to clarify the explanation. That is, even if not described, it is an omission of description, and these parameters are referred to in the processing.

第１のコヒーレンス係数計算部４２は、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）から得られた周波数領域信号ＸＬ（ｆ，Ｋ）及びＸＲ（ｆ，Ｋ）に基づいて、コヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）を計算するものである。 The first coherence coefficient calculation unit 42 uses the frequency domain signals XL (f, K) and XR (f, K) obtained from the input sound signals ECoutL (n) and ECoutR (n) from which the emitted non-target sound has been removed. ) To calculate the coherence coefficient Xcoef (f, K).

第２のコヒーレンス係数計算部４３は、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）から得られた周波数領域信号ＹＬ（ｆ，Ｋ）及びＹＲ（ｆ，Ｋ）に基づいてコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）を計算するものである。 The second coherence coefficient calculator 43 calculates the coherence coefficient based on the frequency domain signals YL (f, K) and YR (f, K) obtained from the pseudo sound emission target sound signals PSechoL (n) and PSechoR (n). Ycoef (f, K) is calculated.

コヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）、Ｙｃｏｅｆ（ｆ，Ｋ）の計算式として、特許文献１に記載のものを適用できる（特許文献１の（１）式、（２）式、（４）式参照）。 As the calculation formulas of the coherence coefficients Xcoef (f, K) and Ycoef (f, K), those described in Patent Document 1 can be applied (see Expressions (1), (2), and (4) of Patent Document 1). ).

コヒーレンス計算部４７は、第２のコヒーレンス係数計算部４３が得た第２のコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）から、コヒーレンスＣＯＨ（Ｋ）を計算するものである。コヒーレンスＣＯＨ（Ｋ）は、特許文献１の（５）式に示すように、全Ｍ個の周波数成分毎のコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）の平均値として算出される。 The coherence calculator 47 calculates the coherence COH (K) from the second coherence coefficient Ycoef (f, K) obtained by the second coherence coefficient calculator 43. Coherence COH (K) is calculated as an average value of coherence coefficients Ycoef (f, K) for all M frequency components, as shown in equation (5) of Patent Document 1.

放音非目的音種判定部４８は、コヒーレンス計算部４７が得たコヒーレンスＣＯＨ（Ｋ）に基づいて、放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲの音種を判定するものである。例えば、衝撃音を含む音源データｓｉｇＬ、ｓｉｇＲか、衝撃音をほとんど含まない音源データｓｉｇＬ、ｓｉｇＲかを判別するものである。 The sound emission non-purpose sound type determination unit 48 determines the sound type of the sound source data sigL and sigR that are the sound non-purpose sound based on the coherence COH (K) obtained by the coherence calculation unit 47. For example, it is determined whether the sound source data sigL and sigR includes impact sound, or the sound source data sigL and sigR that hardly includes impact sound.

放音非目的音種判定部４８は、例えば、プログラムで実現されており、機能的には、図３に示すように、コヒーレンス受信部５１、長期平均計算部５２、分散計算部５３、判定部５４及び判定結果出力部５５を有する。 The sound emission non-purpose sound type determination unit 48 is realized by a program, for example, and functionally, as shown in FIG. 3, a coherence reception unit 51, a long-term average calculation unit 52, a variance calculation unit 53, a determination unit 54 and a determination result output unit 55.

コヒーレンス受信部５１は、コヒーレンス計算部４７が得たコヒーレンスＣＯＨ（Ｋ）を取り込むものである。 The coherence receiving unit 51 takes in the coherence COH (K) obtained by the coherence calculating unit 47.

長期平均計算部５２は、コヒーレンスＣＯＨ（Ｋ）の長期平均値ａｖｅｃｏｈ（Ｋ）を、例えば、（１）式に従って計算するものであり、分散計算部５３は、一般的な分散の計算式に従ってコヒーレンスＣＯＨ（Ｋ）の分散ｖａｒを計算するものである。 The long-term average calculation unit 52 calculates the long-term average value avecoh (K) of the coherence COH (K), for example, according to the equation (1), and the variance calculation unit 53 performs the coherence according to a general equation for variance. The variance var of COH (K) is calculated.

ａｖｅｃｏｈ（ｋ）
＝β×ＣＯＨ（Ｋ）＋（１−β）×ＣＯＨ（Ｋ−１）
但し、βは０．０＜β＜１．０の範囲の値 …（１）
判定部５４は、コヒーレンスＣＯＨ（Ｋ）の長期平均値ａｖｅｃｏｈ（Ｋ）と分散ｖａｒとから、放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲの音種を判定するものである。判定部５４は、例えば、長期平均値ａｖｅｃｏｈ（Ｋ）が予め設定されている閾値を超え、かつ、分散ｖａｒが予め設定されている閾値を超えている場合に、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含むものであると判定し、長期平均値ａｖｅｃｏｈ（Ｋ）及び分散ｖａｒの組み合わせが上記以外の場合に、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含まないものであると判定する。 avecoh (k)
= Β × COH (K) + (1−β) × COH (K−1)
However, β is a value in the range of 0.0 <β <1.0 (1)
The determination unit 54 determines the sound types of the sound source data sigL and sigR that are the non-target sound emission from the long-term average value avecoh (K) of the coherence COH (K) and the variance var. For example, when the long-term average value avecoh (K) exceeds a preset threshold value and the variance var exceeds a preset threshold value, the determination unit 54 determines that the sound source data sigL and sigR are impact sounds. If the combination of the long-term average value avecoh (K) and the variance var is other than the above, it is determined that the sound source data sigL and sigR do not include the impact sound.

判定結果出力部５５は、得られた音種の判定結果を、抑圧係数算出部４４に与えるものである。 The determination result output unit 55 gives the obtained sound type determination result to the suppression coefficient calculation unit 44.

図４は、音源データｓｉｇＬ、ｓｉｇＲの楽曲が、変化が穏やかなクラシックの場合と衝撃音を含む変化が激しいロックの場合に得られたコヒーレンスＣＯＨ（Ｋ）の時間変化を示している。クラシックの場合には、コヒーレンスＣＯＨ（Ｋ）の長期平均値は小さく分散も小さい。ロックの場合には、衝撃音部分が長期平均値を引き上げると共に、分散を大きくしている。そのため、コヒーレンスＣＯＨ（Ｋ）の長期平均値及び分散に基づいて、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含むものであるか否かを判定することができる。 FIG. 4 shows temporal changes in coherence COH (K) obtained when the music of the sound source data sigL and sigR is a classical music with a gentle change and a rock with a strong change including an impact sound. In the case of classic, the long-term average value of coherence COH (K) is small and the variance is also small. In the case of rock, the impact sound part raises the long-term average value and increases the dispersion. Therefore, based on the long-term average value and variance of the coherence COH (K), it can be determined whether or not the sound source data sigL and sigR includes an impact sound.

抑圧係数算出部４４は、２つのコヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）及びＹｃｏｅｆ（ｆ，Ｋ）から、非目的音を抑圧する抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を算出して抑圧係数乗算部４５に与えるものである。抑圧係数算出部４４は、放音非目的音種判定部４８の判定結果に応じて、抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）の算出方法を切り替えるものである。 The suppression coefficient calculation unit 44 calculates a suppression coefficient NRcoef (f, K) for suppressing the non-target sound from the two coherence coefficients Xcoef (f, K) and Ycoef (f, K), and supplies the suppression coefficient multiplication unit 45 with the suppression coefficient NRcoef (f, K). To give. The suppression coefficient calculation unit 44 switches the calculation method of the suppression coefficient NRcoef (f, K) according to the determination result of the sound emission non-target sound type determination unit 48.

例えば、抑圧係数算出部４４は、放音非目的音種判定部４８の判定結果が、音源データｓｉｇＬ、ｓｉｇＲは衝撃音を含むという結果のときには、（２）式に従って抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を算出し、一方、放音非目的音種判定部４８の判定結果が、音源データｓｉｇＬ、ｓｉｇＲは衝撃音を含まないという結果のときには、第１のコヒーレンス係数計算部４２が得たコヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）をそのまま抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）とする。衝撃音の有無に対し、これ以外の対応であっても良い。例えば、抑圧係数算出部４４は、衝撃音の有無に応じて、（２）式におけるαを切り替えるようにしても良い（なお、衝撃音を含む場合の方がαを大きくする）。 For example, when the determination result of the sound emission non-target sound type determination unit 48 is that the sound source data sigL and sigR include impact sound, the suppression coefficient calculation unit 44 suppresses the suppression coefficient NRcoef (f, K) according to the equation (2). On the other hand, when the determination result of the sound emission non-target sound type determination unit 48 is that the sound source data sigL and sigR do not include an impact sound, the coherence coefficient obtained by the first coherence coefficient calculation unit 42 Xcoef (f, K) is directly used as the suppression coefficient NRcoef (f, K). Other measures may be taken for the presence or absence of an impact sound. For example, the suppression coefficient calculation unit 44 may switch α in Expression (2) according to the presence or absence of an impact sound (note that α is increased when the impact sound is included).

ＮＲｃｏｅｆ（ｆ，Ｋ）
＝Ｘｃｏｅｆ（ｆ，Ｋ）−α×Ｙｃｏｅｆ（ｆ，Ｋ）
但し、αは０．０＜α≦１．０の範囲の値 …（２）
抑圧係数乗算部４５は、放音非目的音が除去された入力音信号から得られた一方の周波数領域信号ＸＬ（ｆ，Ｋ）に対し、（３）式に示すように、抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を乗算して非目的音が除去された周波数領域信号（言い換えると、目的音の周波数領域信号）Ｚ（ｆ，Ｋ）を得るものである。 NRcoef (f, K)
= Xcoef (f, K)-[alpha] * Ycoef (f, K)
However, α is a value in the range of 0.0 <α ≦ 1.0 (2)
The suppression coefficient multiplication unit 45 applies the suppression coefficient NRcoef () to the one frequency domain signal XL (f, K) obtained from the input sound signal from which the emitted non-target sound has been removed, as shown in Equation (3). The frequency domain signal (in other words, the frequency domain signal of the target sound) Z (f, K) from which the non-target sound is removed by multiplying by f, K) is obtained.

Ｚ（ｆ，Ｋ）＝ＸＬ（ｆ，Ｋ）×ＮＲｃｏｅｆ（ｆ、Ｋ） …（３）
ＩＦＦＴ部４６は、周波数領域信号である非目的音抑圧信号Ｚ（ｆ、Ｋ）を時間領域信号ｚ（ｎ）に変換するものである。後段回路が、周波数領域信号Ｚ（ｆ、Ｋ）をそのまま処理できる構成であれば、ＩＦＦＴ部４６は省略することができる。 Z (f, K) = XL (f, K) × NRcoef (f, K) (3)
The IFFT unit 46 converts the non-target sound suppression signal Z (f, K), which is a frequency domain signal, into a time domain signal z (n). If the latter circuit is configured to process the frequency domain signal Z (f, K) as it is, the IFFT unit 46 can be omitted.

放音非目的音キャンセラ処理部３２も、音源分離処理部３３と同様に、非目的音の除去機能を有するものである。音源分離処理部３３に加えて、放音非目的音キャンセラ処理部３２を設けるようにしたのは、以下の理由による。すなわち、非目的音を一括して捉えるのではなく、放音非目的音及び背景非目的音を区別し、それぞれに適した除去方法を考慮し、放音非目的音を放音非目的音キャンセラ処理部３２で除去し、背景非目的音を音源分離処理部３３で除去することとした。すなわち、音源分離処理部３３の前処理部として放音非目的音キャンセラ処理部３２を設け、音源分離処理部３３が不得手なＬチャンネルとＲチャンネルの相関が強い非目的音成分を放音非目的音キャンセラ処理部３２で予め抑圧しておくことにより、音源分離処理部３３の機能を十分に発揮させると同時に、放音非目的音キャンセラ処理部３２で抑圧しきれなかった非目的音成分を音源分離処理部３３で抑圧し、音源分離処理部３３を単体で適用するよりもはるかに高性能な非目的音の抑圧性能を得るようにしている。 Similarly to the sound source separation processing unit 33, the sound emission non-target sound canceller processing unit 32 also has a non-target sound removal function. The reason why the sound non-target sound canceller processing unit 32 is provided in addition to the sound source separation processing unit 33 is as follows. That is, rather than capturing non-target sounds at once, the non-target sound and the background non-target sound are distinguished, and a removal method suitable for each is considered, and the non-target sound is output as a non-target sound canceller. The processing unit 32 removes the background non-target sound and the sound source separation processing unit 33 removes it. That is, a sound emission non-target sound canceller processing unit 32 is provided as a pre-processing unit of the sound source separation processing unit 33, and a non-target sound component having a strong correlation between the L channel and the R channel, which is not good at the sound source separation processing unit 33, is not emitted. By suppressing in advance by the target sound canceller processing unit 32, the function of the sound source separation processing unit 33 is fully exhibited, and at the same time, the non-target sound component that could not be suppressed by the emitted non-target sound canceller processing unit 32 is obtained. Suppression is performed by the sound source separation processing unit 33, and a much higher performance of suppressing non-target sound is obtained than when the sound source separation processing unit 33 is applied alone.

音源分離処理部３３の音源分離方法としてコヒーレンスフィルタ法を単に適用する場合であれば、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）から非目的音の抑圧に用いる抑圧係数を得るようにすれば良い。この実施形態において、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）だけでなく、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）をも適用して、非目的音の抑圧に用いる抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）を得ている。このようにしたのは、以下の理由による。 If the coherence filter method is simply applied as the sound source separation method of the sound source separation processing unit 33, the non-target sound is suppressed from the input sound signals ECoutL (n) and ECoutR (n) from which the emitted non-target sound is removed. What is necessary is just to obtain the suppression coefficient to be used. In this embodiment, not only the input sound signals ECoutL (n) and ECoutR (n) from which the sound non-target sound has been removed, but also the pseudo sound emission target sound signals PSechoL (n) and PSechoR (n) are applied. The suppression coefficient NRcoef (f, K) used for suppressing the non-target sound is obtained. The reason for this is as follows.

スピーカ３Ｌ、３Ｒから放音される放音音が、例えば、楽曲であって、打楽器の音のような突発的に全周波数に成分を有する衝撃音（例えば、ロックにおけるドラムの音）が含まれる場合、放音非目的音キャンセラ処理部３２（の適応フィルタ）における追従が間に合わず、放音非目的音を十分に抑圧できない。また、衝撃音は、全周波数に成分を有するため、その到来方位が正面ではなくても、左右のスピーカ３Ｌ、３Ｒから放音された音同士が強い相関を有し、恰も正面から到来するような特性を有する。そのため、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）だけから非目的音の抑圧に用いる抑圧係数を得た場合には、放音非目的音が衝撃音のときに、放音非目的音の除去が不十分となる。 The sound emitted from the speakers 3L and 3R is, for example, music, and includes impact sounds (for example, drum sounds in rock) that suddenly have components at all frequencies, such as percussion instrument sounds. In this case, the follow-up in the emitted non-target sound canceller processing unit 32 (the adaptive filter) is not in time, and the emitted non-target sound cannot be sufficiently suppressed. Moreover, since the impact sound has components at all frequencies, even if the direction of arrival is not the front, the sounds emitted from the left and right speakers 3L and 3R have a strong correlation with each other so that the kite also comes from the front. It has special characteristics. Therefore, when the suppression coefficient used for suppressing the non-target sound is obtained only from the input sound signals ECoutL (n) and ECoutR (n) from which the non-target sound is removed, the non-target sound is the impact sound. Sometimes the removal of the non-target sound is insufficient.

このような不都合を回避するために、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲも非目的音の抑圧に用いる抑圧係数ＮＲｃｏｅｆの形成に用いることとした。 In order to avoid such an inconvenience, the pseudo sound emission target sound signals PSechoL and PSechoR are also used to form the suppression coefficient NRcoef used for suppressing the non-target sound.

放音非目的音キャンセラ処理部３２で算出される疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲは、音源データｓｉｇＬ、ｓｉｇＲにスピーカ３Ｌ、３Ｒからマイクロホン４Ｌ、４Ｒまでの伝達特性を畳み込んだ信号であるので、マイクロホン４Ｌ、４Ｒが捕捉した入力音信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲに含まれる妨害音成分と近い特性を有していると言える。従って、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲ、あるいは、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲから得られる特徴量を参照にすることで、衝撃音への抑圧性能を高めることが期待できる。 The pseudo sound emission target sound signals PSechoL and PSechoR calculated by the sound emission non-target sound canceller processing unit 32 are signals obtained by convolving the sound source data sigL and sigR with transfer characteristics from the speakers 3L and 3R to the microphones 4L and 4R. Therefore, it can be said that the microphones 4L and 4R have characteristics close to those of the disturbing sound components included in the input sound signals inputL and inputR captured by the microphones 4L and 4R. Therefore, it can be expected that the suppression performance to the impact sound can be improved by referring to the characteristic amount obtained from the pseudo sound emission target sound signals PSechoL and PSechoR, or the pseudo sound emission target sound signals PSEchoL and PSechoR.

そのため、第１の実施形態においては、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲも、非目的音の抑圧に用いる抑圧係数ＮＲｃｏｅｆの形成に用いることとした。 Therefore, in the first embodiment, the pseudo sound emission target sound signals PSechoL and PSechoR are also used to form the suppression coefficient NRcoef used for suppressing the non-target sound.

次に、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲを非目的音の抑圧に用いる抑圧係数ＮＲｃｏｅｆの形成に用いることができることを、より具体的に説明する。 Next, it will be described in more detail that the pseudo sound emission target sound signals PSechoL and PSechoR can be used to form the suppression coefficient NRcoef used for suppressing the non-target sound.

第１の実施形態が想定する機器構成（上述した図６、図７参照）を考慮すると、妨害音が正面から到来することはあり得ない。この挙動を、特許文献１に記載のコヒーレンスのような到来方位と直結する特徴量の挙動と対応付けると、妨害音は、正面から到来する目的音と同等以上のコヒーレンス値をとらないと言うことができる。しかし、上述した通り、妨害音に衝撃音が含まれる場合には、左右のスピーカ３Ｌ、３Ｒから放音される妨害音同士の相関が著しく増し、妨害音であるにも拘わらす、正面から到来するかのような挙動をする。つまり、衝撃音が含まれる場合の妨害音のコヒーレンス値は目的音と同等以上の値となる。従って、妨害音の到来方位に応じて雑音抑圧ゲインを設定するコヒーレンスフィルタ法では、十分に妨害音を抑圧できない。ところで、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲは、放音されれば放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲに、スピーカ３Ｌ、３Ｒからマイクロホン４Ｌ、４Ｒまでの伝達特性を畳み込んだ音なので、目的音成分は含まず、両脇のスピーカ３Ｌ、３Ｒから到来する妨害音成分だけに由来する信号である。よって、２つの疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲから得られるコヒーレンス値のレンジは、目的音のレンジより小さく、仮に、妨害音源データｓｉｇＬ、ｓｉｇＲに衝撃音が含まれている場合には、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲのコヒーレンスが大きくなる。逆に言えば、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲのコヒーレンスの急増によって衝撃音の発生を検出することができる。疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲから得られたコヒーレンスフィルタ係数Ｙｃｏｅｆを参照することで、衝撃音の成分を周波数毎に取得することができる。放音非目的音キャンセラ処理部３２から出力された放音非目的音が除去された入力音信号ＥＣｏｕｔＬ、ＥＣｏｕｔＲから得たコヒーレンスフィルタ係数Ｘｃｏｅｆを、疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲから得られたコヒーレンスフィルタ係数Ｙｃｏｅｆで（２）式に示すように調整することにより、衝撃音に由来する成分をコヒーレンスフィルタ係数から除去し、より正確な抑圧係数Ｚｃｏｅｆを算出することができる。 Considering the device configuration assumed by the first embodiment (see FIGS. 6 and 7 described above), no disturbing sound can come from the front. When this behavior is associated with the behavior of the feature quantity directly linked to the arrival direction such as the coherence described in Patent Document 1, it can be said that the interference sound does not take a coherence value equal to or higher than the target sound coming from the front. it can. However, as described above, when the impact sound includes the impact sound, the correlation between the disturbing sounds emitted from the left and right speakers 3L and 3R is remarkably increased, and it comes from the front even though it is a disturbing sound. Behaves as if That is, the coherence value of the disturbing sound when the impact sound is included is equal to or greater than the target sound. Therefore, the coherence filter method that sets the noise suppression gain according to the direction of arrival of the interference sound cannot sufficiently suppress the interference sound. By the way, the pseudo sound emission target sound signals PSEchoL and PSechoR are sounds obtained by convolution of the transmission characteristics from the speakers 3L and 3R to the microphones 4L and 4R into the sound source data sigL and sigR that become sound non-target sounds when emitted. Therefore, the target sound component is not included, and the signal is derived only from the disturbing sound component coming from the speakers 3L and 3R on both sides. Therefore, the range of the coherence value obtained from the two pseudo sound emission target sound signals PSechoL and PSechoR is smaller than the range of the target sound, and if the disturbing sound source data sigL and sigR include an impact sound, The coherence of the target sound signals PSechoL and PSechoR increases. In other words, it is possible to detect the occurrence of an impact sound by a sudden increase in the coherence of the pseudo sound emission target sound signals PSechoL and PSechoR. By referring to the coherence filter coefficient Ycoef obtained from the pseudo sound emission target sound signals PSechoL and PSechoR, the component of the impact sound can be obtained for each frequency. The coherence filter coefficient Xcoef obtained from the input sound signals ECoutL and ECoutR from which the sound non-target sound output from the sound non-target sound canceller processing unit 32 has been removed is obtained from the pseudo sound output target sound signals PSechoL and PSechoR. By adjusting the coherence filter coefficient Ycoef as shown in the equation (2), the component derived from the impact sound can be removed from the coherence filter coefficient, and a more accurate suppression coefficient Zcoef can be calculated.

また、コヒーレンスは、信号レベルで正規化されている周波数成分毎のコヒーレンス係数を平均したものであるので、放音非目的音の音量に影響を受けずに算出できる。従って、ロックとクラシックのような音量が大きく異なる楽曲同士であっても、音量に依存せずに特性を比較でき、音量が大きいクラシックを誤ってロックと判定するようなことを極力排除することができる。 Further, since the coherence is an average of the coherence coefficients for each frequency component normalized by the signal level, it can be calculated without being influenced by the volume of the emitted non-target sound. Therefore, even between songs such as rock and classical music with greatly different volumes, the characteristics can be compared without depending on the volume, and it is possible to eliminate as much as possible that the classical music with a large volume is erroneously determined to be rock. it can.

（２）式は、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含む場合の音源分離の精度低下を防止する工夫を有するものであるが、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含まない場合に、その工夫が却って精度に影響する恐れがある。 The expression (2) has a contrivance to prevent a reduction in accuracy of sound source separation when the sound source data sigL and sigR includes impact sound. However, when the sound source data sigL and sigR do not include impact sound, the contrivance is obtained. However, the accuracy may be affected.

そこで、この実施形態においては、放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含むか否かに応じて、非目的音の抑圧係数の算出方法（（２）式を適用するか否か）を切り替えて、音源データｓｉｇＬ、ｓｉｇＲが衝撃音を含むか否かに拘わらず、音源分離精度を高めるようにした。 Therefore, in this embodiment, the calculation method of the suppression coefficient for the non-target sound (equation (2)) is applied depending on whether the sound source data sigL and sigR that are the sound non-target sound include the impact sound. The sound source separation accuracy is improved regardless of whether the sound source data sigL and sigR include impact sound.

（Ａ−２）第１の実施形態の動作
次に、第１の実施形態の集音・放音装置１０の動作を説明する。以下では、音源データが楽曲データであり、目的音が、集音・放音装置１０の正面に位置する利用者が発音した音声であるとして、適宜、説明する。 (A-2) Operation of the First Embodiment Next, the operation of the sound collection / sound emission device 10 of the first embodiment will be described. In the following description, it is assumed that the sound source data is music data and the target sound is a sound produced by a user located in front of the sound collecting / sound emitting device 10.

各音源データ記憶部２１Ｌ、２１Ｒから読み出された音源データ（楽曲データ）はそれぞれ、対応するＤ／Ａ変換部２２Ｌ、２２Ｒによってアナログ信号に変換された後、各スピーカ３Ｌ、３Ｒから放音される。このような音楽が当該集音・放音装置１０から流れているときに、利用者が当該集音・放音装置１０に向かって発音した音声は、両マイクロホン４Ｌ及び４Ｒによって捕捉される。この際、スピーカ３Ｌ、３Ｒからの音楽も流れているため、スピーカ３Ｌからの音楽も両マイクロホン４Ｌ及び４Ｒによって捕捉され、スピーカ３Ｒからの音楽も両マイクロホン４Ｌ及び４Ｒによって捕捉される。さらに、周囲の背景雑音（エアコンの駆動音、近くを走行する車両からの走行音など）も、両マイクロホン４Ｌ及び４Ｒによって捕捉される。 The sound source data (music data) read from the sound source data storage units 21L and 21R are converted into analog signals by the corresponding D / A conversion units 22L and 22R, and then emitted from the speakers 3L and 3R. The When such music is flowing from the sound collecting / sound emitting device 10, the sound produced by the user toward the sound collecting / sound emitting device 10 is captured by both microphones 4 </ b> L and 4 </ b> R. At this time, since music from the speakers 3L and 3R is also flowing, music from the speaker 3L is also captured by both microphones 4L and 4R, and music from the speaker 3R is also captured by both microphones 4L and 4R. Furthermore, ambient background noise (such as driving sound of an air conditioner, traveling sound from a vehicle traveling nearby) is also captured by both microphones 4L and 4R.

すなわち、各マイクロホン４Ｌ、４Ｒが捕捉して得た入力音信号には、利用者の音声という目的音以外に、自装置が放音した音楽という放音非目的音や、背景雑音などの非目的音（背景非目的音）が含まれている。 In other words, the input sound signals obtained by the microphones 4L and 4R include non-purpose sounds such as music emitted by the device itself and non-purpose sounds such as background noise, in addition to the target sound of the user's voice. Sound (background non-target sound) is included.

各マイクロホン４Ｌ、４Ｒが捕捉して得た入力音信号はそれぞれ、対応するＡ／Ｄ変換部３１Ｌ、３１Ｒによってデジタル信号ｉｎｐｕｔＬ、ｉｎｐｕｔＲに変換されて放音非目的音キャンセラ処理部３２に与えられる。放音非目的音キャンセラ処理部３２には、音源データｓｉｇＬ及びｓｉｇＲも与えられる。 Input sound signals obtained by the microphones 4L and 4R are converted into digital signals inputL and inputR by the corresponding A / D converters 31L and 31R, respectively, and are supplied to the sound emission non-target sound canceller processing unit 32. The sound emission non-target sound canceller processing unit 32 is also provided with sound source data sigL and sigR.

放音非目的音キャンセラ処理部３２においては、Ｌチャンネルに係る入力音信号（デジタル信号）ｉｎｐｕｔＬから、内部で生成した疑似放音目的音信号ＰＳｅｃｈｏＬを減算することにより、放音非目的音が除去された入力音信号ＥＣｏｕｔＬが得られ、同様に、Ｒチャンネルに係る入力音信号（デジタル信号）ｉｎｐｕｔＲから、内部で生成した疑似放音目的音信号ＰＳｅｃｈｏＲを減算することにより、放音非目的音が除去された入力音信号ＥＣｏｕｔＲが得られる。このようにして得られた放音非目的音が除去された一対に入力音信号ＥＣｏｕｔＬ、ＥＣｏｕｔＲが、内部生成の一対の疑似放音目的音信号ＰＳｅｃｈｏＬ、ＰＳｅｃｈｏＲと共に、音源分離処理部３３に与えられる。 In the sound non-target sound canceller processing unit 32, the sound non-target sound is removed by subtracting the internally generated pseudo sound target sound signal PSchoL from the input sound signal (digital signal) inputL related to the L channel. Similarly, by subtracting the internally generated pseudo sound emission target sound signal PSEchoR from the input sound signal (digital signal) inputR relating to the R channel, the sound emission non-purpose sound is obtained. The removed input sound signal ECoutR is obtained. The pair of input sound signals ECoutL and ECoutR obtained by removing the non-target sound output thus obtained are supplied to the sound source separation processing unit 33 together with the pair of internally generated pseudo sound output target sound signals PSechoL and PSechoR. .

音源分離処理部３３においては、ＦＦＴ部４１によって、時間領域信号である、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）と、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）とがそれぞれ、周波数領域信号ＸＬ（ｆ，Ｋ）、ＸＲ（ｆ，Ｋ）、ＹＬ（ｆ，Ｋ）、ＹＲ（ｆ，Ｋ）に変換される。 In the sound source separation processing unit 33, the input sound signals ECoutL (n) and ECoutR (n) from which the sound non-target sound, which is a time domain signal, is removed by the FFT unit 41, and the pseudo sound output target sound signal PSchoL ( n) and PSechoR (n) are converted into frequency domain signals XL (f, K), XR (f, K), YL (f, K), YR (f, K), respectively.

そして、第１のコヒーレンス係数計算部４２によって、放音非目的音が除去された入力音信号ＥＣｏｕｔＬ（ｎ）、ＥＣｏｕｔＲ（ｎ）から得られた周波数領域信号ＸＬ（ｆ，Ｋ）及びＸＲ（ｆ，Ｋ）に基づいて、コヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）が計算され、第２のコヒーレンス係数計算部４３によって、疑似放音目的音信号ＰＳｅｃｈｏＬ（ｎ）、ＰＳｅｃｈｏＲ（ｎ）から得られた周波数領域信号ＹＬ（ｆ，Ｋ）及びＹＲ（ｆ，Ｋ）に基づいてコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）が計算される。 Then, the first coherence coefficient calculation unit 42 performs frequency domain signals XL (f, K) and XR (f) obtained from the input sound signals ECoutL (n) and ECoutR (n) from which the emitted non-target sound has been removed. , K), the coherence coefficient Xcoef (f, K) is calculated, and the second coherence coefficient calculation unit 43 obtains the frequency region obtained from the pseudo sound emission target sound signals PSechoL (n) and PSechoR (n). A coherence coefficient Ycoef (f, K) is calculated based on the signals YL (f, K) and YR (f, K).

その後、コヒーレンス計算部４７によって、第２のコヒーレンス係数計算部４３が得た第２のコヒーレンス係数Ｙｃｏｅｆ（ｆ，Ｋ）から、コヒーレンスＣＯＨ（Ｋ）が計算され、放音非目的音種判定部４８によって、コヒーレンスＣＯＨ（Ｋ）の挙動に基づいて、放音非目的音となる音源データｓｉｇＬ、ｓｉｇＲが、衝撃音を含むか否かが判定され、その判定結果が抑圧係数算出部４４に与えられる。 Thereafter, the coherence calculation unit 47 calculates the coherence COH (K) from the second coherence coefficient Ycoef (f, K) obtained by the second coherence coefficient calculation unit 43, and the sound emission non-target sound type determination unit 48. Thus, based on the behavior of the coherence COH (K), it is determined whether or not the sound source data sigL and sigR, which are sound emission non-target sounds, include an impact sound, and the determination result is given to the suppression coefficient calculation unit 44. .

抑圧係数算出部４４において、音源データは衝撃音を含むという判定結果が与えられると、上述した（２）式に従って抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）が算出され、一方、音源データは衝撃音を含まないという判定結果が与えられると、第１のコヒーレンス係数計算部４２が得たコヒーレンス係数Ｘｃｏｅｆ（ｆ，Ｋ）がそのまま抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）とされる。そして、抑圧係数乗算部４５によって、放音非目的音が除去された入力音信号から得られた一方の周波数領域信号ＸＬ（ｆ，Ｋ）に抑圧係数ＮＲｃｏｅｆ（ｆ，Ｋ）が周波数成分毎に乗算されて非目的音が除去された周波数領域信号Ｚ（ｆ，Ｋ）が得られる。この周波数領域信号である非目的音抑圧信号Ｚ（ｆ、Ｋ）をＩＦＦＴ部４６によって時間領域信号ｚ（ｎ）に変換することにより、目的音成分だけを含む出力信号ｏｕｔｐｕｔ（＝ｚ（ｎ））が得られる。 When the suppression coefficient calculation unit 44 gives a determination result that the sound source data includes an impact sound, the suppression coefficient NRcoef (f, K) is calculated according to the above-described equation (2), while the sound source data includes the impact sound. If the determination result is given, the coherence coefficient Xcoef (f, K) obtained by the first coherence coefficient calculator 42 is directly used as the suppression coefficient NRcoef (f, K). Then, the suppression coefficient NRcoef (f, K) is added to each frequency component in one frequency domain signal XL (f, K) obtained from the input sound signal from which the emitted non-target sound is removed by the suppression coefficient multiplier 45. A frequency domain signal Z (f, K) from which the non-target sound has been removed by multiplication is obtained. By converting the non-target sound suppression signal Z (f, K), which is a frequency domain signal, into the time domain signal z (n) by the IFFT unit 46, an output signal output (= z (n)) including only the target sound component. ) Is obtained.

（Ａ−３）第１の実施形態の効果
第１の実施形態によれば、非目的音を一括して捉えるのではなく、放音非目的音及び背景非目的音に区別し、さらに放音非目的音については音種を判定し、それぞれに適した除去処理を適用して除去して目的音を抽出するようにしたので、目的音の抽出精度を非常に高いものとすることができる。 (A-3) Effects of the First Embodiment According to the first embodiment, the non-target sounds are not collectively captured, but are classified into the sound non-target sounds and the background non-target sounds, and the sound is further emitted. Since the target sound is extracted by determining the sound type of the non-target sound and applying the removal process suitable for each non-target sound, the target sound extraction accuracy can be made extremely high.

その結果、例えば、抽出した目的音成分である音声を通話に用いた場合には通話音質を高めることができ、抽出した目的音成分である音声を音声認識に供する場合には認識率を高めることができる。 As a result, for example, when the voice that is the extracted target sound component is used for a call, the call sound quality can be improved, and when the voice that is the extracted target sound component is used for voice recognition, the recognition rate is increased. Can do.

（Ｂ）第２の実施形態
次に、本発明による集音・放音装置及び集音・放音プログラムの第２の実施形態を、第１の実施形態との差異を中心に説明する。 (B) Second Embodiment Next, a second embodiment of the sound collecting / sound emitting apparatus and sound collecting / sound emitting program according to the present invention will be described focusing on differences from the first embodiment.

図５は、第２の実施形態の放音非目的音キャンセラ処理部（以下、符号３２Ａを用いる）の詳細構成を示すブロック図である。 FIG. 5 is a block diagram illustrating a detailed configuration of a sound emission non-target sound canceller processing unit (hereinafter, reference numeral 32A is used) according to the second embodiment.

図５において、放音非目的音キャンセラ処理部３２Ａは、４つの擬似放音非目的音生成部６１ＬＬ〜６１ＲＲと、４つの減算部６２ＬＬ〜６２ＲＲとを有している。 In FIG. 5, the sound emission non-purpose sound canceller processing unit 32A includes four pseudo sound emission non-purpose sound generation units 61LL to 61RR and four subtraction units 62LL to 62RR.

上述したように、スピーカ３Ｌ、３Ｒから放音され、マイクロホン４Ｒ、４Ｌによって捕捉される放音非目的音は、電話通信において問題となっている音響エコーと同様にみなすことができる。第２の実施形態においては、放音非目的音キャンセラ処理部３２Ａを、モノラルエコーキャンセラの技術を４つ適用して構成した。なお、図４に示すような構成もステレオエコーキャンセラの範疇に属すると捉えることができる（非特許文献１の図３．７３参照）。 As described above, the non-target sound emitted from the speakers 3L and 3R and captured by the microphones 4R and 4L can be regarded in the same manner as the acoustic echo that is a problem in telephone communication. In the second embodiment, the sound emission non-target sound canceller processing unit 32A is configured by applying four monaural echo canceller techniques. 4 can also be regarded as belonging to the category of stereo echo canceller (see FIG. 3.73 of Non-Patent Document 1).

擬似放音非目的音生成部６１ＬＬは、Ｌチャンネルの入力音信号ｉｎｐｕｔＬに含まれている、スピーカ３Ｌから放音されてマイクロホン４Ｌで捕捉された放音非目的音を擬似した擬似放音非目的音を音源データｓｉｇＬに基づいて生成し、減算部６２ＬＬは、Ｌチャンネルの入力音信号ｉｎｐｕｔＬから、擬似放音非目的音生成部６１ＬＬが生成した擬似放音非目的音を減算し、Ｌチャンネルの入力音信号ｉｎｐｕｔＬから、スピーカ３Ｌから放音されてマイクロホン４Ｌで捕捉された放音非目的音の成分を除去するものである。 The simulated sound emission non-purpose sound generation unit 61LL simulates the sound emission non-purpose sound that is included in the L channel input sound signal inputL and is emitted from the speaker 3L and captured by the microphone 4L. The sound is generated based on the sound source data sigL, and the subtraction unit 62LL subtracts the pseudo sound emission non-purpose sound generated by the pseudo sound emission non-purpose sound generation unit 61LL from the L channel input sound signal inputL, From the input sound signal inputL, the component of the non-target sound emitted from the speaker 3L and captured by the microphone 4L is removed.

擬似放音非目的音生成部６１ＲＬは、Ｌチャンネルの入力音信号ｉｎｐｕｔＬに含まれている、スピーカ３Ｒから放音されてマイクロホン４Ｌで捕捉された放音非目的音を擬似した擬似放音非目的音を音源データｓｉｇＲに基づいて生成し、減算部６２ＲＬは、擬似放音非目的音生成部６１ＬＬの出力音信号から、擬似放音非目的音生成部６１ＲＬが生成した擬似放音非目的音を減算し、擬似放音非目的音生成部６１ＬＬの出力音信号から、スピーカ３Ｒから放音されてマイクロホン４Ｌで捕捉された放音非目的音の成分を除去するものである。 The simulated sound emission non-purpose sound generation unit 61RL simulates the sound emission non-purpose sound that is included in the L channel input sound signal inputL and is emitted from the speaker 3R and captured by the microphone 4L. The sound is generated based on the sound source data sigR, and the subtracting unit 62RL generates the pseudo sound emitting non-purpose sound generated by the pseudo sound emitting non-purpose sound generating unit 61RL from the output sound signal of the pseudo sound emitting non-purpose sound generating unit 61LL. Subtraction is performed to remove the component of the sound non-target sound emitted from the speaker 3R and captured by the microphone 4L from the output sound signal of the pseudo sound non-purpose sound generation unit 61LL.

これにより、擬似放音非目的音生成部６１ＲＬから出力された入力音信号ＥＣｏｕｔＬは、入力音信号ｉｎｐｕｔＬから、スピーカ３Ｌから放音されてマイクロホン４Ｌで捕捉された放音非目的音の成分と、スピーカ３Ｒから放音されてマイクロホン４Ｌで捕捉された放音非目的音の成分とが除外されたものとなる。 As a result, the input sound signal ECoutL output from the simulated sound emission non-purpose sound generation unit 61RL is emitted from the input sound signal inputL and emitted from the speaker 3L and captured by the microphone 4L. The sound non-target sound component emitted from the speaker 3R and captured by the microphone 4L is excluded.

擬似放音非目的音生成部６１ＬＲは、Ｒチャンネルの入力音信号ｉｎｐｕｔＲに含まれている、スピーカ３Ｌから放音されてマイクロホン４Ｒで捕捉された放音非目的音を擬似した擬似放音非目的音を音源データｓｉｇＬに基づいて生成し、減算部６２ＬＲは、Ｒチャンネルの入力音信号ｉｎｐｕｔＲから、擬似放音非目的音生成部６１ＬＲが生成した擬似放音非目的音を減算し、Ｒチャンネルの入力音信号ｉｎｐｕｔＲから、スピーカ３Ｌから放音されてマイクロホン４Ｒで捕捉された放音非目的音の成分を除去するものである。 The pseudo sound emission non-purpose sound generation unit 61LR simulates the sound non-purpose sound that is emitted from the speaker 3L and captured by the microphone 4R and included in the R channel input sound signal inputR. The sound is generated based on the sound source data sigL, and the subtraction unit 62LR subtracts the pseudo sound emission non-purpose sound generated by the pseudo sound emission non-purpose sound generation unit 61LR from the input sound signal inputR of the R channel, From the input sound signal inputR, the component of the non-target sound emitted from the speaker 3L and captured by the microphone 4R is removed.

擬似放音非目的音生成部６１ＲＲは、Ｒチャンネルの入力音信号ｉｎｐｕｔＬに含まれている、スピーカ３Ｒから放音されてマイクロホン４Ｒで捕捉された放音非目的音を擬似した擬似放音非目的音を音源データｓｉｇＲに基づいて生成し、減算部６２ＲＲは、擬似放音非目的音生成部６１ＬＲの出力音信号から、擬似放音非目的音生成部６１ＲＲが生成した擬似放音非目的音を減算し、擬似放音非目的音生成部６１ＬＲの出力音信号から、スピーカ３Ｒから放音されてマイクロホン４Ｒで捕捉された放音非目的音の成分を除去するものである。 The simulated sound emission non-purpose sound generation unit 61RR simulates the sound emission non-purpose sound that is included in the R channel input sound signal inputL and is emitted from the speaker 3R and captured by the microphone 4R. The sound is generated based on the sound source data sigR, and the subtraction unit 62RR generates the pseudo sound emission non-purpose sound generated by the pseudo sound emission non-purpose sound generation unit 61RR from the output sound signal of the pseudo sound emission non-purpose sound generation unit 61LR. Subtraction is performed to remove the component of the sound non-target sound emitted from the speaker 3R and captured by the microphone 4R from the output sound signal of the pseudo sound non-purpose sound generation unit 61LR.

これにより、擬似放音非目的音生成部６１ＲＲから出力された入力音信号ＥＣｏｕｔＲは、入力音信号ｉｎｐｕｔＲから、スピーカ３Ｌから放音されてマイクロホン４Ｒで捕捉された放音非目的音の成分と、スピーカ３Ｒから放音されてマイクロホン４Ｒで捕捉された放音非目的音の成分とが除外されたものとなる。 As a result, the input sound signal ECoutR output from the simulated sound emission non-purpose sound generation unit 61RR is emitted from the input sound signal inputR and emitted from the speaker 3L and captured by the microphone 4R. The sound non-target sound component emitted from the speaker 3R and captured by the microphone 4R is excluded.

擬似放音非目的音生成部６１ＬＬ〜６１ＲＲはそれぞれ、音響エコーキャンセラで利用されているような適応フィルタによって構成される。これら適応フィルタが適用する適応アルゴリズムは限定されないが、例えば、学習同定アルゴリズムを適用することができる。 Each of the simulated sound emission non-target sound generation units 61LL to 61RR is configured by an adaptive filter used in an acoustic echo canceller. Although the adaptive algorithm which these adaptive filters apply is not limited, for example, a learning identification algorithm can be applied.

モノラルエコーキャンセラを用いる場合、スピーカ３Ｌ、３Ｒとマイクロホン４Ｌ、４Ｒが二つずつあるため音響経路の混雑が生じ、音響経路特性を正確に推定できず十分な抑圧効果が得られない場合がある。 When a monaural echo canceller is used, there are two speakers 3L and 3R and two microphones 4L and 4R, so that the acoustic path is congested, and the acoustic path characteristic cannot be accurately estimated, and a sufficient suppression effect may not be obtained.

そこで、音源データｓｉｇＬ、ｓｉｇＲの再生に先立ち、ホワイトノイズをスピーカ３Ｌだけから放音して、スピーカ３Ｌからマイクロホン４Ｌまでの音響経路特性Ｈ_ＬＬとスピーカ３Ｌからマイクロホン４Ｒまでの音響経路特性Ｈ_ＬＲを、擬似放音非目的音生成部６１ＬＬ及び６１ＬＲ（の適応フィルタ）が推定し、次に、ホワイトノイズをスピーカ３Ｒだけから放音し、スピーカ３Ｒからマイクロホン４Ｌまでの音響経路特性Ｈ_ＲＬとスピーカ３Ｒからマイクロホン４Ｒまでの音響経路特性Ｈ_ＲＲを擬似放音非目的音生成部６１ＲＬ及び６１ＲＲ（の適応フィルタ）が推定し、初期設定しておく。 Therefore, prior to reproduction of the sound source data sigL and sigR, white noise is emitted only from the speaker 3L, and the acoustic path characteristic H _LL from the speaker 3L to the microphone 4L and the acoustic path characteristic H _LR from the speaker 3L to the microphone 4R are obtained. estimates the pseudo sound non-target sound generator 61LL and 61LR (adaptive filter) is, then sound white noise only from the speaker 3R, acoustic path characteristics _{H RL} and the speaker 3R of the speaker 3R to the microphone 4L the acoustic path characteristics _{H RR} to the microphone 4R estimated pseudo sound non-target sound generator 61RL and 61RR (adaptive filter) is from previously initialized.

これ以降、４つの音響経路特性と対応する音源データｓｉｇＬ、ｓｉｇＲとを畳み込むことで得た疑似放音非目的音信号を、マイクロホン４Ｌ、４Ｒが捕捉した入力音信号から減算することで放音非目的音を抑圧することができる。 Thereafter, the pseudo sound emission non-target sound signal obtained by convolving the four sound path characteristics and the corresponding sound source data sigL and sigR is subtracted from the input sound signal captured by the microphones 4L and 4R, thereby preventing sound emission. The target sound can be suppressed.

この際、放音非目的音種判定部４８（図２参照）による音源データｓｉｇＬ、ｓｉｇＲの音種の判定結果が、擬似放音非目的音生成部６１ＬＬ〜６１ＲＲに与えられて利用される。各擬似放音非目的音生成部６１ＬＬ〜６１ＲＲはそれぞれ、音種の判定結果に応じて、適応フィルタのステップサイズを変更する。各擬似放音非目的音生成部６１ＬＬ〜６１ＲＲはそれぞれ、衝撃音を含む音源データの場合には、衝撃音を含まない音源データの場合に比較して、ステップサイズを小さくして追従性を速めるようにする。 At this time, the sound type determination results of the sound source data sigL and sigR by the sound emission non-purpose sound type determination unit 48 (see FIG. 2) are given to the pseudo sound emission non-purpose sound generation units 61LL to 61RR and used. Each of the simulated sound emission non-purpose sound generation units 61LL to 61RR changes the step size of the adaptive filter according to the sound type determination result. Each of the pseudo sound emission non-target sound generation units 61LL to 61RR has a smaller step size and quicker follow-up in the case of sound source data including impact sound than in the case of sound source data not including impact sound. Like that.

第２の実施形態によっても、非目的音を一括して捉えるのではなく、放音非目的音及び背景非目的音に区別し、さらに放音非目的音については音種を判定し、それぞれに適した除去処理を適用して除去して目的音を抽出するようにしたので、目的音の抽出精度を非常に高いものとすることができる。 Also according to the second embodiment, the non-target sounds are not collectively captured, but are classified into the emitted non-purpose sounds and the background non-purpose sounds, and the sound types are determined for the emitted non-purpose sounds. Since the target sound is extracted by applying a suitable removal process, the target sound can be extracted with very high accuracy.

（Ｃ）他の実施形態
上記各実施形態の説明においても、種々変形実施形態に言及したが、さらに、以下に例示するような変形実施形態を挙げることができる。 (C) Other Embodiments In the description of each of the above-described embodiments, various modified embodiments have been referred to. However, modified embodiments as exemplified below can be given.

上記各実施形態では、抑圧係数計算部４４が、必要に応じて、（２）式によって抑圧係数を算出するものを示したが、抑圧係数が小さくなり過ぎないように、（２）式の演算後にフロアリング処理を施すようにしても良い。このようにすると、過剰抑圧による音質低下を防ぐことができる。 In each of the above embodiments, the suppression coefficient calculation unit 44 calculates the suppression coefficient according to the expression (2) as necessary. However, the calculation of the expression (2) is performed so that the suppression coefficient does not become too small. You may make it perform a flooring process later. In this way, it is possible to prevent deterioration in sound quality due to excessive suppression.

上記各実施形態では、音種の判定が、音源データが衝撃音を含むか否かの判定であったが、衝撃音を強く含む、弱く含む、含まないなどの３種類以上の判定であっても良く、この場合には、衝撃音の含み方によって（２）式における係数αを切り替えるようにしても良い。３種類の判定方法としては、長期平均と比較する閾値を２つとすると共に、分散と比較する閾値を２つとし、長期平均も分散も大きい方の閾値を超過している場合に、衝撃音を強く含むと判定し、長期平均も分散も小さい方の閾値以下の場合に、衝撃音を含まないと判定し、上述した２つの場合以外を、衝撃音を弱く含むと判定する方法を挙げることができる。 In each of the above embodiments, the determination of the sound type is a determination of whether or not the sound source data includes an impact sound, but there are three or more types of determinations such as whether the impact sound is strongly included, weakly included, or not included. In this case, the coefficient α in the equation (2) may be switched depending on how the impact sound is included. The three types of determination methods include two thresholds to be compared with the long-term average and two thresholds to be compared with the variance, and if the long-term average and the variance are larger, the impact sound is It is determined that the sound is strongly included, and when the long-term average and the variance are less than the smaller threshold value, it is determined that the sound is not included, and except for the two cases described above, the method is determined that the sound is weakly included. it can.

上記各実施形態では、第１のコヒーレンス係数を、第２のコヒーレンス係数を利用して修正する演算式が（２）式に示す減算であるものを示したが、他の演算式（関数）を適用して、第２のコヒーレンス係数を利用して第１のコヒーレンス係数を修正するようにしても良い。例えば、第１のコヒーレンス係数を、第２のコヒーレンス係数を係数倍した値で除算して抑圧係数を算出するようにしても良い。 In each of the above embodiments, the arithmetic expression for correcting the first coherence coefficient using the second coherence coefficient is the subtraction shown in the expression (2), but other arithmetic expressions (functions) are The first coherence coefficient may be corrected by applying the second coherence coefficient. For example, the suppression coefficient may be calculated by dividing the first coherence coefficient by a value obtained by multiplying the second coherence coefficient by a coefficient.

上記各実施形態では、放音非目的音（妨害音）の判定に用いる特徴量がコヒーレンスの分散及び長期平均値であるものを示したが、図３に示すような挙動を区別できるものであれば、他の統計量を用いるようにしても良い。例えば、コヒーレンスの最大値を平均値で割った値若しくは変動係数（＝標準偏差／平均値）を特徴量として用いるようにしても良い。 In each of the above-described embodiments, the characteristic amount used for the determination of the sound emission non-target sound (interfering sound) is the coherence variance and the long-term average value. However, if the behavior as shown in FIG. 3 can be distinguished. For example, other statistics may be used. For example, a value obtained by dividing the maximum coherence value by the average value or a coefficient of variation (= standard deviation / average value) may be used as the feature amount.

また、コヒーレンスではなく、全てではない１又は複数の周波数成分のコヒーレンス係数を用いて特徴量を算出するようにしても良い。さらに、コヒーレンス係数やコヒーレンスを演算することなく、疑似放音非目的音信号のパワー変化等に基づいて、衝撃音の有無や衝撃音の混入段階を判別するようにしても良い。さらにまた、判定に用いる特徴量は、疑似放音非目的音信号から得られる特徴量に限定されない。例えば、疑似放音非目的音信号から得られる特徴量に代え、若しくは、疑似放音非目的音信号から得られる特徴量に加え、放音非目的音キャンセラ処理部から出力された、放音非目的音が除去された入力音信号から得られる特徴量を、放音非目的音の音種の判定に用いるようにしても良い。 Further, the feature amount may be calculated using not the coherence but the coherence coefficient of one or a plurality of frequency components that are not all. Furthermore, the presence / absence of an impact sound and the stage where the impact sound is mixed may be determined based on the power change of the pseudo sound emission non-target sound signal without calculating the coherence coefficient or coherence. Furthermore, the feature value used for the determination is not limited to the feature value obtained from the pseudo sound emission non-target sound signal. For example, instead of the feature amount obtained from the pseudo-non-target sound signal, or in addition to the feature amount obtained from the pseudo-non-target sound signal, the sound non-sound output from the sound non-target sound canceller processing unit is output. You may make it use the feature-value obtained from the input sound signal from which the target sound was removed for determination of the kind of sound of a non-target sound to be emitted.

上記第１の実施形態では、音種の判定結果を、抑圧係数の算出方法に反映させ、上記第２の実施形態では、音種の判定結果を、抑圧係数の算出方法と適応フィルタのステップサイズに反映させたものを示したが、音種の判定結果を利用方法はこれらに限定されない。例えば、適応フィルタのステップサイズだけ反映させるようにしても良く、音源分離方法として、他の音源分離方法（後述する）を適用する場合にはその処理の中で用いるパラメータの切替えに反映させるようにしても良い。 In the first embodiment, the sound type determination result is reflected in the suppression coefficient calculation method. In the second embodiment, the sound type determination result is used as the suppression coefficient calculation method and the step size of the adaptive filter. However, the method of using the determination result of the sound type is not limited to these. For example, only the step size of the adaptive filter may be reflected. When another sound source separation method (described later) is applied as the sound source separation method, it is reflected in switching of parameters used in the processing. May be.

上記各実施形態では、音源分離処理部がコヒーレンスフィルタ法に従って目的音と背景非目的音とを分離するものを示したが、分離方法はこれに限定されない。例えば、コヒーレンスフィルタ法と周波数減算法（スペクトル減算法）との組み合わせを適用するようにしても良く、コヒーレンスフィルタ法とウィーナーフィルタ法との組み合わせを適用するようにしても良く、コヒーレンスフィルタ法と周波数減算法とウィーナーフィルタ法との組み合わせを適用するようにしても良い。周波数減算法を適用する場合において、入力音声信号のスペクトルから雑音成分のスペクトルを減算する比率を、音種の判定結果に応じて変更するようにしても良い。また、ウィーナーフィルタ法を適用する場合において、入力音声信号のスペクトルに対して乗算するウィーナーフィルタ係数を、音種の判定結果に応じて変更するようにしても良い。 In each of the embodiments described above, the sound source separation processing unit has shown the target sound and the background non-target sound separated according to the coherence filter method, but the separation method is not limited to this. For example, a combination of the coherence filter method and the frequency subtraction method (spectral subtraction method) may be applied, or a combination of the coherence filter method and the Wiener filter method may be applied. The coherence filter method and the frequency A combination of the subtraction method and the Wiener filter method may be applied. In the case of applying the frequency subtraction method, the ratio of subtracting the noise component spectrum from the spectrum of the input voice signal may be changed according to the sound type determination result. Further, when the Wiener filter method is applied, the Wiener filter coefficient to be multiplied with respect to the spectrum of the input audio signal may be changed according to the determination result of the sound type.

なお、本発明の技術思想は、放音非目的音キャンセラ処理部３２だけを備え、音源分離処理部３３を備えない集音・放音装置に対しても適用可能なものである。例えば、放音非目的音の音種を、放音非目的音キャンセラ処理部３２、３２Ａのステップサイズに反映させる態様を挙げることができる。 The technical idea of the present invention is also applicable to a sound collecting / sound emitting device that includes only the sound non-target sound canceller processing unit 32 and does not include the sound source separation processing unit 33. For example, a mode in which the sound type of the emitted non-target sound is reflected in the step sizes of the emitted non-purpose sound canceller processing units 32 and 32A can be exemplified.

上記各実施形態では、スピーカが２つの場合を示したが、スピーカは１つでも３つ以上であっても良い。また、マイクロホンも２つに限定されず、３以上あっても良い。スピーカとマイクロホンとの数に応じて定まる放音音響経路の数を考慮して、放音非目的音キャンセラ処理部３２の内部構成を設計すれば良い。 In each of the above-described embodiments, the case where there are two speakers is shown, but there may be one speaker or three or more speakers. Also, the number of microphones is not limited to two and may be three or more. The internal configuration of the sound emission non-target sound canceller processing unit 32 may be designed in consideration of the number of sound emission sound paths determined according to the number of speakers and microphones.

上記各実施形態では、集音・放音装置単体で全ての処理を実行するものを示したが、非目的音の除去処理などを外部のサーバに委ねて実行するようにしても良い。例えば、集音・放音装置がスマートフォンの場合において、いわゆるクラウドによってシステムを構成し、利用者から外部サーバの存在が分からないように更新しても良い。特許請求の範囲における「集音・放音装置」の請求項は、利用者からは見えない外部サーバが処理を行っている場合を含むものとする。 In each of the above-described embodiments, the sound collection / sound emission device alone performs all processing. However, the non-target sound removal processing may be performed by an external server. For example, when the sound collection / sound emission device is a smartphone, the system may be configured by a so-called cloud and updated so that the user does not know the presence of the external server. The claim of “sound collecting / sound emitting device” in the claims includes a case where an external server that is invisible to the user performs processing.

１０…集音・放音装置、
２０…放音部、２１Ｌ、２１Ｒ…音源データ記憶部、２２Ｌ、２２Ｒ…Ｄ／Ａ変換部、３Ｌ、３Ｒ…スピーカ、
３０…集音部、４Ｌ、４Ｒ…マイクロホン、３１Ｌ、３１Ｒ…Ａ／Ｄ変換部、３２、３２Ａ…放音非目的音キャンセラ処理部、３３…音源分離処理部、４１…ＦＦＴ部、４２…第１のコヒーレンス係数計算部、４３…第２のコヒーレンス係数計算部、４４…抑圧係数算出部、４５…抑圧係数乗算部、４６…ＩＦＦＴ部、４７…コヒーレンス計算部、４８…放音非目的音種判定部、６１ＬＬ〜６１ＲＲ…擬似放音非目的音生成部、６２ＬＬ〜６２ＲＲ…減算部。 10 ... Sound collecting / sound emitting device,
20 ... Sound emission part, 21L, 21R ... Sound source data storage part, 22L, 22R ... D / A conversion part, 3L, 3R ... Speaker
30 ... Sound collection unit, 4L, 4R ... Microphone, 31L, 31R ... A / D conversion unit, 32, 32A ... Sound emission non-target sound canceller processing unit, 33 ... Sound source separation processing unit, 41 ... FFT unit, 42th 1 coherence coefficient calculation unit, 43 ... second coherence coefficient calculation unit, 44 ... suppression coefficient calculation unit, 45 ... suppression coefficient multiplication unit, 46 ... IFFT unit, 47 ... coherence calculation unit, 48 ... sound emission non-target sound type Determination unit, 61LL to 61RR... Pseudo sound emission non-purpose sound generation unit, 62LL to 62RR.

Claims

In a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sound and a sound emission unit that emits sound from one or more speakers,
A sound signal emitted by the sound emitting unit is input, emitted from the speaker, and generates a pseudo-interfering sound signal that simulates an interfering sound accompanying the emitted sound captured by each of the microphones. A sound emission disturbing sound removing means that diverts an acoustic echo canceller configuration that removes the sound emission disturbing sound captured by each microphone by subtracting from the input sound signal;
Sound type determination means for determining the sound type of the sound emission disturbing sound based on the pseudo sound emission disturbance sound signal generated in the sound emission interference sound removing means;
A sound collecting / sound emitting device comprising: one or a plurality of sound type reflection processing means for switching own processing according to a determination result of the sound type determination means.

The sound type determining means calculates a coherence whose value changes depending on the direction of arrival from the pseudo sound emission disturbing sound signal, and determines the sound type of the sound emission disturbing sound according to the behavior of the coherence. Item 2. The sound collecting and sound emitting device according to Item 1.

The sound collection / sound emission device according to claim 2, wherein the sound type determination means distinguishes the behavior of coherence by a long-term average and variance of coherence.

One of the sound type reflection processing means is sound source separation means for extracting a target sound from a sound source in a predetermined direction from the input sound signal after removal of the sound emission disturbing sound output from the sound emission interference sound removing means. The sound collection / sound emission device according to claim 1, wherein the parameter for separation is switched according to the determination result of the sound type.

The sound emission disturbing sound removing means is one of the sound type reflection processing means, and the sound emission disturbing sound removing means switches the step size of the internal adaptive filter in accordance with the sound type determination result. The sound collecting / sound emitting device according to any one of claims 1 to 4.

Sound collection / sound emission executed by a computer mounted in a sound collection / sound emission device having a sound collection unit in which at least two microphones capture ambient sound and a sound emission unit emitting sound from one or more speakers A program,
The above computer
A sound signal emitted by the sound emitting unit is input, emitted from the speaker, and generates a pseudo-interfering sound signal that simulates an interfering sound accompanying the emitted sound captured by each of the microphones. A sound emission disturbing sound removing means that diverts an acoustic echo canceller configuration that removes the sound emission disturbing sound captured by each microphone by subtracting from the input sound signal;
Sound type determination means for determining the sound type of the sound emission disturbing sound based on the pseudo sound emission disturbance sound signal generated in the sound emission interference sound removing means;
A sound collection / sound emission program which functions as one or a plurality of sound type reflection processing means for switching its own processing according to the determination result of the sound type determination means.