JP2019068133A

JP2019068133A - Sound pick-up device, program, and method

Info

Publication number: JP2019068133A
Application number: JP2017188770A
Authority: JP
Inventors: 一浩片桐; Kazuhiro Katagiri; 隆矢頭; Takashi Yato
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2019-04-25
Anticipated expiration: 2037-09-28
Also published as: JP6943120B2

Abstract

To provide a sound pick-up device that can perform area sound pick-up with higher sound quality for output destinations.SOLUTION: A sound pick-up device 100 outputs sound pick-up signals picked up by using a plurality of microphone arrays MA1, MA2 to a plurality of output destinations. In order to output, to each of the output destinations, an acoustic signal including at least a component of a target area sound as the sound pick-up signal, the sound pick-up device generates a post-mixture target area sound in which a mixed sound including at least an input signal and/or a component of an estimated noise is mixed with the target area sound according to the characteristics of the output destination, and outputs the post-mixture target area sound as the sound pick-up signal.SELECTED DRAWING: Figure 1

Description

この発明は、収音装置、プログラム及び方法に関し、例えば、雑音環境下で用いられる音声通信システムや音声認識システム等に適用する、複数の音源が存在する環境下で特定のエリアの音を強調し、それ以外のエリアの音を抑制するシステムに適用し得る。 The present invention relates to a sound collection device, program, and method, for example, applied to a voice communication system or a speech recognition system used in a noise environment, to emphasize sounds in a specific area in the presence of multiple sound sources. It can apply to the system which suppresses the sound of other areas.

雑音環境下で音声通信システムや音声認識応用システムを利用する場合、必要な目的音声と同時に混入する周囲の雑音は、良好なコミュニケーションを阻害し、音声認識率の低下をもたらす厄介な存在である。従来、このような複数の音源が存在する環境下において、特定の方向の音のみ分離・収音することで不要音の混入を避け必要な目的音を得る技術として、マイクロホンアレイを用いたビームフォーマ（ＢｅａｍＦｏｒｍｅｒ；以下「ＢＦ」とも呼ぶ；特許文献２、３参照）がある。ＢＦとは各マイクロホンに到達する信号の時間差を利用して指向性を形成する技術である。しかしＢＦだけでは収音を目的とするエリア（以下、「目的エリア」と呼ぶ）の周囲に他の音源が存在する場合、目的エリア内に存在する音（以下、「目的エリア音」と呼ぶ）だけを収音することが難しい。そのため、従来、特許文献１等により、複数のマイクロホンアレイを用いて目的エリアを収音するエリア収音方式が提案されている。 When using a voice communication system or a voice recognition application system in a noise environment, ambient noise that mixes simultaneously with the required target voice is a nuisance that impairs good communication and causes a reduction in voice recognition rate. Conventionally, in an environment in which a plurality of such sound sources exist, a beam former using a microphone array is a technology for obtaining unnecessary target sound by separating and collecting only sounds in a specific direction and avoiding unnecessary sound mixing. (Beam Former; hereinafter also referred to as "BF"; see Patent Documents 2 and 3). BF is a technology for forming directivity using the time difference between signals arriving at each microphone. However, if there are other sound sources around the area for sound collection in BF alone (hereinafter referred to as "target area"), the sound present in the target area (hereinafter referred to as "target area sound") It is difficult to pick up only the sound. Therefore, in the related art, an area sound collection method for collecting a target area using a plurality of microphone arrays has been proposed by Patent Document 1 and the like.

図６は、２つのマイクロホンアレイＭＡ１、ＭＡ２を用いて、目的エリアの音源からの目的エリア音を収音する処理について示した説明図である。図６（ａ）は、各マイクロホンアレイの構成例について示した説明図である。図６（ｂ）、図６（ｃ）は、それぞれ図６（ａ）に示すマイクロホンアレイＭＡ１、ＭＡ２のＢＦ出力について周波数領域で示したグラフ(イメージ図)である。図６（ｂ）、図６（ｃ）は、それぞれマイクロホンアレイＭＡ１、ＭＡ２のＢＦ出力について周波数領域で示したグラフ（イメージ図）である。 FIG. 6 is an explanatory view showing processing of collecting a target area sound from a sound source of a target area using two microphone arrays MA1 and MA2. FIG. 6A is an explanatory view showing a configuration example of each microphone array. FIGS. 6B and 6C are graphs (image views) showing the BF output of the microphone arrays MA1 and MA2 shown in FIG. 6A, respectively, in the frequency domain. FIGS. 6 (b) and 6 (c) are graphs (image views) showing the BF output of the microphone arrays MA1 and MA2, respectively, in the frequency domain.

従来のエリア収音では、図６（ａ）に示すように、マイクロホンアレイＭＡ１、ＭＡ２の指向性を別々の方向から収音したいエリア（目的エリア）で交差させて収音する。図６（ａ）の状態では、各マイクロホンアレイＭＡ１、ＭＡ２の指向性に目的エリア内に存在する音（目的エリア音）だけでなく、目的エリア方向の雑音（非目的エリア音）も含まれている。しかし、図６（ｂ）、図６（ｃ）に示すように、マイクロホンアレイＭＡ１、ＭＡ２の指向性を周波数領域で比較すると、目的エリア音成分はどちらの出力にも含まれるが、非目的エリア音成分は各マイクロホンアレイで異なることになる。従来のエリア収音技術では、このような特性を利用し、２つのマイクロホンアレイＭＡ１、ＭＡ２のＢＦ出力に、共通に含まれる成分以外を抑圧することで目的エリア音のみ抽出することができる。 In the conventional area sound collection, as shown in FIG. 6A, the directivity of the microphone arrays MA1 and MA2 is collected in an area (target area) where sound collection is desired from different directions. In the state of FIG. 6A, not only the sound (target area sound) present in the target area but also the noise (non-target area sound) in the target area direction is included in the directivity of each of the microphone arrays MA1 and MA2. There is. However, as shown in FIGS. 6 (b) and 6 (c), when the directivity of the microphone arrays MA1 and MA2 is compared in the frequency domain, the target area sound component is included in either output, but the non-target area The sound components will be different for each microphone array. In the conventional area sound pickup technology, it is possible to extract only the target area sound by suppressing components other than the components included in common in the BF output of the two microphone arrays MA1 and MA2 using such characteristics.

従来のエリア収音技術は、エリア外で発生する雑音の抑圧に非常に有効な手法だが、周囲に存在する非目的エリア音や背景雑音のレベルが大きい場合、ミュージカルノイズなどの耳障りな異音が発生する場合がある。エリア収音においてミュージカルノイズを改善する技術としては、特許文献３の記載技術がある。特許文献３では、エリア収音の音質を改善する手法として、エリア収音の出力に、入力信号及び推定雑音を混合し、ミュージカルノイズ等の異音をマスキングする方式（信号混合エリア収音方式）を提案している。以下では、特許文献３の記載技術のように、エリア収音の出力に所定の音（例えば、入力信号や推定雑音等）を混合して、ミュージカルノイズ等の異音をマスキングする手法を「ミキシングエリア収音」と呼ぶものとする。 The conventional area sound collection technology is a very effective method to suppress the noise generated outside the area, but if the level of the non-target area sound or background noise present in the surroundings is large, offensive noise such as musical noise etc. It may occur. As a technique for improving musical noise in area sound collection, there is a technique described in Patent Document 3. In Patent Document 3, as a method of improving the sound quality of area pickup, a method of mixing an input signal and estimation noise with an area pickup output and masking abnormal noise such as musical noise (signal mixing area pickup method) Is proposed. In the following, as described in Patent Document 3, a method of masking abnormal noise such as musical noise by mixing a predetermined sound (for example, an input signal, estimated noise, etc.) with the output of the area pickup is described as “mixing We call it "area sound collection".

特開２０１４−７２７０８号公報JP, 2014-72708, A 特開２００５−１９５９５５号公報JP 2005-195955 A 特許６１８７６２６号公報Patent No. 6187626

浅野太著，“音響テクノロジーシリーズ１６音のアレイ信号処理−音源の定位・追跡と分離−”，日本音響学会編，コロナ社，２０１１年２月２５日発行Asano Ta, "Sound Technology Series 16 Array signal processing of sound-Localization, tracking and separation of sound source", Japan Acoustical Society, edited by Corona, February 25, 2011

上述のように、従来のミキシングエリア収音は、エリア収音の音質を大幅に改善する手法であるが、入力信号や推定雑音を混合する方式であるため、雑音抑圧の観点からは、その効果がやや弱まる。そのため、音声通信と音声認識双方の機能を備えるシステムにエリア収音を適用する場合、音声通信には好ましい雑音抑圧レベルが音声認識の前処理としては雑音抑圧効果が十分でなく、認識率が低下するという問題があった。 As described above, the conventional mixing area sound collection is a method to significantly improve the sound quality of the area sound collection, but since it is a method of mixing the input signal and the estimation noise, from the viewpoint of noise suppression, its effect Is slightly weakened. Therefore, when area pickup is applied to a system equipped with both voice communication and voice recognition functions, the noise suppression level is preferable for voice communication because the noise suppression effect is not sufficient as pre-processing for voice recognition, and the recognition rate decreases. Had the problem of

以上のような問題に鑑みて、出力先に対してより高品質のエリア収音が可能となる収音装置、プログラム及び方法が望まれている。 In view of the above problems, there is a demand for a sound collection device, program, and method that enables area collection of higher quality to an output destination.

第１の本発明は、マイクロホンアレイを用いて収音した収音信号を複数の出力先に出力する収音装置において、（１）マイクロホンアレイから入力された入力信号に含まれる背景雑音を推定して推定雑音として取得し、取得した前記推定雑音を用いて、前記入力信号の雑音成分を抑圧して雑音抑圧後信号を取得する雑音抑圧手段と、（２）前記雑音抑圧後信号について、目的エリア方向以外の方向に指向性を形成した第１の非目的エリア音と、目的エリア方向に指向性を形成した目的エリア方向音とを取得する指向性形成手段と、（３）前記目的エリア方向音を用いて目的エリア方向からの第２の非目的エリア音を抽出し、さらに、前記第２の非目的エリア音と前記目的エリア方向音とを用いて、目的エリアを音源とする目的エリア音を取得する目的エリア音抽出部と、（４）それぞれの前記出力先に、少なくとも前記目的エリア音の成分を含む音響信号を前記収音信号として出力するものであって、前記出力先の特性に応じて、前記目的エリア音に、少なくとも前記入力信号の成分及び又は前記推定雑音の成分を含む混合音を混合した混合後目的エリア音を生成して前記収音信号として出力することが可能である出力手段とを有することを特徴とする。 According to a first aspect of the present invention, in a sound collection device that outputs a collected sound signal collected using a microphone array to a plurality of output destinations, (1) estimating background noise included in an input signal input from the microphone array Noise suppression means for acquiring the noise component as the estimation noise and using the acquired estimation noise to suppress the noise component of the input signal to acquire the noise-suppressed signal, and (2) a target area for the noise-suppressed signal Directivity forming means for acquiring a first non-target area sound in which directivity is formed in a direction other than the direction and a target area direction sound in which directivity is formed in the target area direction; (3) the target area direction sound To extract a second non-target area sound from the target area direction, and further using the second non-target area sound and the target area direction sound to generate a target area sound with the target area as a sound source Get An audio signal including at least the component of the target area sound at each of the output destinations of the target area sound extraction unit and (4), as the sound collection signal, according to the characteristics of the output destination An output unit capable of generating a target area sound after mixing the mixed sound including at least the component of the input signal and / or the component of the estimated noise in the target area sound and outputting it as the collected sound signal And.

第２の本発明の収音プログラムは、マイクロホンアレイを用いて収音した収音信号を複数の出力先に出力する収音装置に搭載されたコンピュータを、（１）マイクロホンアレイから入力された入力信号に含まれる背景雑音を推定して推定雑音として取得し、取得した前記推定雑音を用いて、前記入力信号の雑音成分を抑圧して雑音抑圧後信号を取得する雑音抑圧手段と、（２）前記雑音抑圧後信号について、目的エリア方向以外の方向に指向性を形成した第１の非目的エリア音と、目的エリア方向に指向性を形成した目的エリア方向音とを取得する指向性形成手段と、（３）前記目的エリア方向音を用いて目的エリア方向からの第２の非目的エリア音を抽出し、さらに、前記第２の非目的エリア音と前記目的エリア方向音とを用いて、目的エリアを音源とする目的エリア音を取得する目的エリア音抽出部と、（４）それぞれの前記出力先に、少なくとも前記目的エリア音の成分を含む音響信号を前記収音信号として出力するものであって、前記出力先の特性に応じて、前記目的エリア音に、少なくとも前記入力信号の成分及び又は前記推定雑音の成分を含む混合音を混合した混合後目的エリア音を生成して前記収音信号として出力することが可能である出力手段とを有することを特徴とする。 According to a second sound collecting program of the present invention, a computer mounted on a sound collection device for outputting a collected sound signal collected using a microphone array to a plurality of output destinations, (1) an input input from the microphone array Noise suppression means for estimating background noise included in a signal and acquiring it as estimation noise, and using the acquired estimation noise to suppress the noise component of the input signal to acquire a noise-suppressed signal, (2) Directivity forming means for acquiring, for the noise-suppressed signal, a first non-target area sound in which directivity is formed in a direction other than the target area direction and a target area direction sound in which directivity is formed in the target area direction; (3) A second non-target area sound from the target area direction is extracted using the target area direction sound, and further, using the second non-target area sound and the target area direction sound, the target The A target area sound extraction unit for acquiring a target area sound whose sound source is a sound source; and (4) outputting an audio signal including at least the component of the target area sound as the sound collection signal to each of the output destinations. The mixed sound including at least the component of the input signal and / or the component of the estimated noise is mixed with the target area sound according to the characteristics of the output destination to generate a target area sound after mixing, and the sound collection signal And output means capable of outputting as.

第３の本発明は、マイクロホンアレイを用いて収音した収音信号を複数の出力先に出力する収音装置が行う収音方法において、（１）前記収音装置は、雑音抑圧手段、指向性形成手段、目的エリア音抽出部、及び出力手段を有し、（２）前記雑音抑圧手段は、マイクロホンアレイから入力された入力信号に含まれる背景雑音を推定して推定雑音として取得し、取得した前記推定雑音を用いて、前記入力信号の雑音成分を抑圧して雑音抑圧後信号を取得し、（３）前記指向性形成手段は、前記雑音抑圧後信号について、目的エリア方向以外の方向に指向性を形成した第１の非目的エリア音と、目的エリア方向に指向性を形成した目的エリア方向音とを取得し、（４）前記目的エリア音抽出部は、前記目的エリア方向音を用いて目的エリア方向からの第２の非目的エリア音を抽出し、さらに、前記第２の非目的エリア音と前記目的エリア方向音とを用いて、目的エリアを音源とする目的エリア音を取得し、（５）前記出力手段は、それぞれの前記出力先に、少なくとも前記目的エリア音の成分を含む音響信号を前記収音信号として出力するものであって、前記出力先の特性に応じて、前記目的エリア音に、少なくとも前記入力信号の成分及び又は前記推定雑音の成分を含む混合音を混合した混合後目的エリア音を生成して前記収音信号として出力することが可能であることを特徴とする。 According to a third aspect of the present invention, there is provided a sound collection method performed by a sound collection device for outputting a collected sound signal collected using a microphone array to a plurality of output destinations. (2) The noise suppression means estimates background noise contained in the input signal input from the microphone array and acquires it as estimation noise, and acquires it The noise component of the input signal is suppressed using the estimated noise to obtain a noise-suppressed signal, and (3) the directivity forming unit causes the noise-suppressed signal in a direction other than the target area. The first non-target area sound forming directivity and the target area directional sound forming directivity in the target area direction are acquired, and (4) the target area sound extraction unit uses the target area directional sound Target area direction The second non-target area sound is extracted, and further, using the second non-target area sound and the target area direction sound, the target area sound having the target area as a sound source is acquired, and (5) the above The output means outputs an acoustic signal including at least the component of the target area sound as the collected sound signal to each of the output destinations, and the target area sound is output according to the characteristics of the output destination. A mixed sound including at least a component of the input signal and / or a component of the estimation noise may be mixed to generate a target area sound after mixing and output as a sound collection signal.

本発明によれば、出力先に対してより高音質のエリア収音が可能となる収音装置を提供することができる。 According to the present invention, it is possible to provide a sound collection device capable of collecting an area with higher sound quality to an output destination.

第１の実施形態に係る収音装置の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the sound collection device concerning a 1st embodiment. 第１の実施形態に係る減算型ＢＦ（マイクロホンの数が２個の場合）の構成を示すブロック図である。It is a block diagram showing composition of subtraction type BF (when there are two microphones) concerning a 1st embodiment. 第１の実施形態に係る減算型ＢＦ（マイクロホンの数が２個の場合）により形成される指向性フィルタの例について示した説明図である。It is an explanatory view shown about an example of a directivity filter formed by subtraction type BF (when there are two microphones) concerning a 1st embodiment. 第２の実施形態に係る収音装置の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the sound collection device concerning a 2nd embodiment. 第３の実施形態に係る収音装置の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the sound collection device concerning a 3rd embodiment. ２つのマイクロホンアレイを用いて、目的エリアの音源からの目的エリア音を収音する処理について示した説明図である。It is explanatory drawing shown about the process which collects the target area sound from the sound source of a target area using two microphone arrays.

（Ａ）第１の実施形態
以下、本発明による収音装置、プログラム及び方法の第１の実施形態を、図面を参照しながら詳述する。 (A) First Embodiment Hereinafter, a first embodiment of a sound collection device, program and method according to the present invention will be described in detail with reference to the drawings.

（Ａ−１）第１の実施形態の構成
図１は、この実施形態の収音装置１００の機能的構成について示したブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing a functional configuration of the sound collection device 100 of this embodiment.

収音装置１００は、２個のマイクロホンアレイＭＡ（ＭＡ１、ＭＡ２）から供給される音響信号を用いて、目的エリアの音源からの目的エリア音を収音する目的エリア音収音処理を行う。 The sound collection device 100 performs target area sound collection processing for collecting a target area sound from a sound source of a target area using acoustic signals supplied from the two microphone arrays MA (MA1, MA2).

マイクロホンアレイＭＡ１、ＭＡ２は、目的エリアが存在する空間の任意の場所に配置される。目的エリアに対するマイクロホンアレイＭＡ１、ＭＡ２の位置は、例えば、図６（ａ）に示すように、指向性が目的エリアでのみ重なればどこでも良く、例えば目的エリアを挟んで対向に配置しても良い。この実施形態において、各マイクロホンアレイＭＡは２つ以上のマイクロホンＭから構成され、各マイクロホンＭにより音響信号が収音される。図１に示すように、この実施形態では、各マイクロホンアレイＭＡに、２つのマイクロホンＭ１、Ｍ２が配置されるものとして説明する。すなわち、各マイクロホンアレイＭＡは、２ｃｈマイクロホンアレイを構成している。なお、マイクロホンアレイＭＡの数は２つに限定するものではなく、目的エリアが複数存在する場合、全てのエリアをカバーできる数のマイクロホンアレイＭＡを配置する必要がある。 The microphone arrays MA1 and MA2 are arranged anywhere in the space where the target area exists. For example, as shown in FIG. 6A, the positions of the microphone arrays MA1 and MA2 with respect to the target area may be anywhere as long as the directivity overlaps only in the target area, and may be arranged opposite to each other across the target area, for example. . In this embodiment, each microphone array MA comprises two or more microphones M, and each microphone M picks up an acoustic signal. As shown in FIG. 1, in this embodiment, two microphones M1 and M2 are described as being disposed in each microphone array MA. That is, each microphone array MA constitutes a 2ch microphone array. The number of microphone arrays MA is not limited to two. When there are a plurality of target areas, it is necessary to arrange the number of microphone arrays MA that can cover all the areas.

以上のように、各マイクロホンアレイＭＡは、目的エリアが存在する空間の、目的エリアを指向できる場所に配置される。各マイクロホンアレイＭＡは、２つのマイクロホンＭ（Ｍ１、Ｍ２）により構成されている。各マイクロホンアレイＭＡでは、２つのマイクロホンＭ１、Ｍ２によって捕捉された音響に基づく音響信号が収音装置１００に供給される。 As described above, each microphone array MA is disposed in a space where the target area is present, at a location where it can be directed to the target area. Each microphone array MA is configured by two microphones M (M1, M2). In each microphone array MA, an acoustic signal based on the sound captured by the two microphones M1 and M2 is supplied to the sound collection device 100.

そして、収音装置１００は、エリア収音した音響信号を、音声認識部１０及びスピーカ１１に供給する。音声認識部１０およびスピーカ１１は、収音装置１００と直接接続する（ローカルに配置する）ようにしてもよいし、ネットワークを介して間接的に接続して収音した音響信号を供給するようにしてもよい。 Then, the sound collection device 100 supplies the sound signal of the area collected to the voice recognition unit 10 and the speaker 11. The voice recognition unit 10 and the speaker 11 may be directly connected (locally disposed) to the sound collection device 100 or indirectly connected via a network to supply collected sound signals. May be

スピーカ１１は、例えば、遠隔地でエリア収音した音響信号をオペレータ（ユーザ）に表音出力する装置である。以下では、スピーカ１１等、収音した音響信号を人間に聴かせるためのシステム全般を「通話系システム」とも呼ぶものとする。 The speaker 11 is a device that outputs, for example, a sound signal of an area collected at a remote place to an operator (user). In the following, it is assumed that the entire system for allowing a human to listen to a sound signal collected by the speaker 11 or the like is also referred to as a "communication system".

音声認識部１０は、例えば、遠隔地でエリア収音した音響信号に含まれる音声をテキスト化する等の音声認識処理を伴う装置である。以下では、音響信号に含まれる音声を認識してその認識結果に基づく処理（例えば、ＳｐｅｅｃｈｔｏＴｅｘｔの処理や、声紋認識処理等）を行うシステム全般を「音声認識系システム」と呼ぶものとする。 The speech recognition unit 10 is, for example, a device accompanied by speech recognition processing such as converting the speech contained in the sound signal collected in an area at a remote location into a text. Hereinafter, the entire system that recognizes speech contained in an acoustic signal and performs processing based on the recognition result (for example, processing of Speech to Text, voiceprint recognition processing, etc.) will be referred to as "voice recognition system". .

したがって、収音装置１００は、通話系システム（スピーカ１１）と音声認識系システム（音声認識部１０）という、特性（用途の特性）の異なる複数の出力先のそれぞれに、同時にエリア収音した音響信号を供給することになる。言い換えると、収音装置１００は、出力先の特性に応じた音響信号（目的音強調信号）を収音し、それぞれの音響信号を対応する特性の出力先に出力する。 Therefore, the sound collection device 100 simultaneously collects the area of the plurality of output destinations having different characteristics (characteristics of the application), ie, the call system (speaker 11) and the speech recognition system (speech recognition unit 10). It will supply a signal. In other words, the sound collection device 100 picks up an acoustic signal (target sound emphasis signal) according to the characteristic of the output destination, and outputs each acoustic signal to the output destination of the corresponding characteristic.

以上のように、この実施形態では、収音装置１００は、通話系システム向けの音響信号と、音声認識システム向けの音響信号を収音（生成）し、通話系システム向けに収音した音響信号をスピーカ１１に供給し、音声認識システム向けに収音した音響信号を音声認識部１０に供給する。この実施形態では、図１に示すように、通話系システムの出力先はスピーカ１１だけであり、音声認識システムの出力先は音声認識部１０だけであるが、それぞれの特性に応じた出力先は複数としてもよい。また、この実施形態では、収音装置１００は、通話系システムと音声認識システムの２つの特性に応じた音響信号を収音するが、３種類以上の特性に応じた音響信号を生成し、それぞれ対応する出力先に出力するようにしてもよい。 As described above, in this embodiment, the sound collection device 100 collects (generates) an audio signal for a call system and an audio signal for a voice recognition system, and an audio signal collected for the call system. Are supplied to the speaker 11, and the sound signal collected for the speech recognition system is supplied to the speech recognition unit 10. In this embodiment, as shown in FIG. 1, the output destination of the communication system is only the speaker 11, and the output destination of the voice recognition system is only the voice recognition unit 10. However, the output destination corresponding to each characteristic is It is good also as plural. Further, in this embodiment, the sound collection device 100 collects sound signals according to two characteristics of the communication system and the speech recognition system, but generates sound signals according to three or more types of characteristics, respectively. It may be output to the corresponding output destination.

次に、収音装置１００の内部構成について図１を用いて説明する。 Next, the internal configuration of the sound collection device 100 will be described with reference to FIG.

収音装置１００は、信号入力部３、雑音抑圧部４、指向性形成部５、目的エリア音抽出部６、混合レベル算出部７、混合レベル調節部８、及び信号混合部９を有している。 The sound collection device 100 includes a signal input unit 3, a noise suppression unit 4, a directivity formation unit 5, a target area sound extraction unit 6, a mixing level calculation unit 7, a mixing level adjustment unit 8, and a signal mixing unit 9. There is.

収音装置１００は、例えば、プロセッサやメモリ等を備えるコンピュータにプログラム（実施形態に係る収音プログラムを含む）を実行させるようにしてもよいが、その場合であっても、機能的には、図１のように示すことができる。 For example, the sound collection device 100 may cause a computer including a processor, a memory, and the like to execute a program (including the sound collection program according to the embodiment), but even in that case, functionally, It can be shown as FIG.

（Ａ−２）第１の実施形態の動作
次に、以上のような構成を有する第１の実施形態の収音装置１００の動作（実施形態に係る収音方法）を説明する。 (A-2) Operation of First Embodiment Next, an operation (a sound collection method according to the embodiment) of the sound collection device 100 of the first embodiment having the above-described configuration will be described.

信号入力部３は、各マイクロホンアレイＭＡ１、ＭＡ２で収音した音響信号をアナログ信号からデジタル信号に変換し入力する。その後、信号入力部３は、そのデジタル信号を、例えば高速フーリエ変換を用いて時間領域から周波数領域へ変換する。 The signal input unit 3 converts an acoustic signal collected by each of the microphone arrays MA1 and MA2 from an analog signal into a digital signal and inputs it. Thereafter, the signal input unit 3 transforms the digital signal from the time domain to the frequency domain using, for example, a fast Fourier transform.

雑音抑圧部４は、信号入力部３で取得した信号に含まれる背景雑音の成分を推定し、抑圧する。雑音抑圧部４では、例えばスペクトル減算法(ＳｐｅｃｔｒａｌＳｕｂｔｒａｃｔｉｏｎ;以下、単に「ＳＳ」と呼ぶ)やウィーナーフィルタリング法（Ｗｉｅｎｅｒｆｉｌｔｅｒｉｎｇ）などを用いて雑音抑圧を行うことができる。 The noise suppression unit 4 estimates and suppresses the component of the background noise included in the signal acquired by the signal input unit 3. The noise suppression unit 4 can perform noise suppression using, for example, a spectral subtraction (hereinafter, simply referred to as “SS”), a Wiener filtering method, or the like.

指向性形成部５は、マイクロホンアレイＭＡ毎に雑音抑圧部４により背景雑音を抑圧した音響信号に対し、ＢＦにより目的音の方向への指向性を形成する。 The directivity forming unit 5 uses BF to form directivity in the direction of the target sound for an acoustic signal whose background noise has been suppressed by the noise suppressing unit 4 for each microphone array MA.

ここで、各マイクロホンアレイＭＡ（ＭＡ１、ＭＡ２）のＢＦによる指向性形成について図２、図３を用いて説明する。ＢＦとは、各マイクロホンに到達する信号の時間差を利用して指向性を形成する技術である（非特許文献１参照）。ＢＦは加算型と減算型の大きく２つの種類に分けられが、ここでは少ないマイクロホン数で指向性を形成できる減算型ＢＦについて説明する。 Here, directivity formation by BF of each microphone array MA (MA1, MA2) will be described using FIG. 2 and FIG. BF is a technology for forming directivity by using the time difference between signals arriving at each microphone (see Non-Patent Document 1). Although BF is roughly divided into two types of addition and subtraction types, here, a subtraction type BF which can form directivity with a small number of microphones will be described.

図２は、マイクロホンＭの数が２個の場合の減算型ＢＦ２００に係る構成を示すブロック図である。 FIG. 2 is a block diagram showing a configuration according to the subtractive BF 200 when the number of microphones M is two.

図３は、２個のマイクロホンＭを用いた減算型ＢＦ２００により形成される指向性フィルタの例について示した説明図である。 FIG. 3 is an explanatory view showing an example of a directional filter formed by the subtractive BF 200 using two microphones M. As shown in FIG.

図２に示す減算型ＢＦ２００では、まず遅延器２１０により、目的とする方向に存在する音（目的音）が各マイクロホンＭ（Ｍ１、Ｍ２）に到来する信号の時間差を算出し、遅延を加えることにより目的音の位相を合わせる。上述の時間差は以下の（１）式により算出することができる。 In the subtraction type BF 200 shown in FIG. 2, first, the delay unit 210 calculates the time difference of the signal that the sound (target sound) present in the target direction arrives at each microphone M (M1, M2) and adds a delay. Adjust the phase of the target sound by. The above-mentioned time difference can be calculated by the following equation (1).

ここで、ｄはマイクロホンＭ１、Ｍ２間の距離、ｃは音速、τ_ｉは遅延量である。またθ_Ｌは、各マイクロホンＭ（Ｍ１、Ｍ２）を結んだ直線に対する垂直方向から目的方向への角度である。 Here, d is the distance between the microphones M1 and M2, c is the speed of sound, and τ _i is the delay amount. Further, θ _L is an angle from a perpendicular direction to a target direction with respect to a straight line connecting the microphones M (M 1, M 2).

また、ここで、死角がマイクロホンＭ１とＭ２の中心に対し、マイクロホンＭ１の方向に存在する場合、遅延器２１０は、マイクロホンＭ１の入力信号ｘ_１（ｔ）に対し遅延処理を行う。その後、減算型ＢＦ２００では、以下の（２）式に従い減算器２２０が処理（減算処理）を行う。 Here, when a dead angle exists in the direction of the microphone M1 with respect to the centers of the microphones M1 and M2, the delay unit 210 performs delay processing on the input signal x ₁ (t) of the microphone M1. After that, in the subtraction type BF 200, the subtractor 220 performs processing (subtraction processing) according to the following equation (2).

減算型ＢＦ２００の処理は周波数領域でも同様に行うことができ、その場合（２）式は以下の（３）のように変更される。

The processing of the subtractive BF 200 can be similarly performed in the frequency domain, and in that case, the equation (2) is changed to the following equation (3).

ここでθ_Ｌ＝±π／２の場合、減算型ＢＦ２００により形成される指向性は図３(ａ)に示すように、カージオイド型の単一指向性となる。また、「θ_Ｌ＝０，π」の場合、減算型ＢＦ２００により形成される指向性は、図３(ｂ)のような８の字型の双指向性となる。さらに、ＳＳを用いることで、減算型ＢＦ２００では、双指向性の死角に強い指向性を形成することもできる。ＳＳによる指向性は、（４）式に従い全周波数、もしくは指定した周波数帯域で形成される。（４）式では、マイクロホンＭ１の入力信号Ｘ_１を用いているが、マイクロホンＭ２の入力信号Ｘ_２でも同様の効果を得ることができる。ここでβはＳＳの強度を調節するための係数である。減算型ＢＦ２００では、減算時に値がマイナスなった場合は、０または元の値を小さくした値に置き換えるフロアリング処理を行う。 Here, in the case of θ _L = ± π / 2, the directivity formed by the subtractive BF 200 becomes a cardioid unidirectivity as shown in FIG. 3A. Further, in the case of “θ _L = 0, π”, the directivity formed by the subtractive BF 200 is an eight-shaped bi-directionality as shown in FIG. 3B. Furthermore, by using the SS, the subtractive BF 200 can also form strong directivity in a bi-directional blind spot. The directivity by SS is formed at the entire frequency or a designated frequency band according to equation (4). (4) In the formula, is used to input signals X ₁ microphone M1, it is possible to obtain the same effect input signal X ₂ microphones M2. Here, β is a coefficient for adjusting the intensity of SS. In the subtraction type BF 200, when the value becomes minus at the time of subtraction, a flooring process is performed to replace the value with 0 or a smaller value of the original value.

上述のような減算型ＢＦ２００の処理方式では、双指向性の特性によって目的方向以外に存在する音（非目的音）を抽出し、抽出した非目的音の振幅スペクトルを入力信号の振幅スペクトルから減算することで、目的音を強調することができる。

In the processing method of the subtractive BF 200 as described above, the sound (non-target sound) existing other than the target direction is extracted according to the characteristic of bi-directionality, and the amplitude spectrum of the non-target sound extracted is subtracted from the amplitude spectrum of the input signal By doing this, the target sound can be emphasized.

指向性形成部５では、上述のような減算型ＢＦ２００の処理を用いて、各マイクロホンアレイＭＡ（ＭＡ１、ＭＡ２）のＢＦの出力を取得することができる。 The directivity forming unit 5 can obtain the BF output of each of the microphone arrays MA (MA1, MA2) using the process of the subtraction type BF 200 as described above.

ところで、ある特定のエリア内に存在する音（目的エリア音）だけを収音したい場合、減算型ＢＦを用いるだけでは、そのエリアと同一方向の線上に存在する音源（非目的エリア音）も収音してしまう。そこで、収音装置１００では、特許文献１で提案されているように、複数のマイクロホンアレイＭＡを用い、それぞれ別々の方向から目的エリアへ指向性を向け、指向性を目的エリアで交差させることで目的エリア音を収音する処理（エリア収音処理）を実施する。 By the way, when it is desired to pick up only the sound (target area sound) present in a specific area, the sound source (non-target area sound) existing on a line in the same direction as the area is collected only by using the subtractive BF. It sounds. Therefore, in the sound collection device 100, as proposed in Patent Document 1, by using a plurality of microphone arrays MA, directivity is directed to the target area from different directions respectively, and the directivity is crossed at the target area. A process of picking up a target area sound (area pickup process) is performed.

指向性形成部５は、マイクロホンアレイＭＡ１とマイクロホンアレイＭＡ２の２つのマイクロホンアレイに対し、それぞれＢＦ（減算型ＢＦ２００）によって指向性を形成し、図６（ａ）と同様に、各マイクロホンアレイＭＡ１、ＭＡ２の指向性を別々の方向から収音したいエリア（目的エリア）で交差させる。 The directivity forming unit 5 forms the directivity of each of the two microphone arrays of the microphone array MA1 and the microphone array MA2 by BF (subtractive BF 200), and each microphone array MA1,, as in FIG. Cross the directivity of MA2 in the area (target area) where you want to collect sound from different directions.

目的エリア音抽出部６は、指向性形成部５で形成したマイクロホンアレイＭＡ１、およびマイクロホンアレイＭＡ２の各ＢＦ出力データＹ_１（ｎ）、Ｙ_２（ｎ）を以下の（５）式、もしくは（６）式に従いＳＳし、目的エリア方向に存在する非目的エリア音Ｎ_１（ｎ）、Ｎ_２（ｎ）を抽出する。ここでα_１、α_２は、目的エリアと各マイクロホンアレイＭＡの距離の違いによって生じる信号レベルの差を補正する補正係数であり、所定の処理によって逐一計算されるべきものである。しかし、ここでは簡単のため、目的エリアと各マイクロホンアレイＭＡまでの距離は同一（α_１（ｎ）＝α_２（ｎ）＝１）とし（５）式、（６）式を（７）式、（８）式に代えて適用するものとして説明する。

The target area sound extraction unit 6 sets each of the BF output data Y ₁ (n) and Y ₂ (n) of the microphone array MA1 and the microphone array MA2 formed by the directivity formation unit 5 to the following equation (5) or 6) SS according to the equation to extract non-target area sounds N ₁ (n) and N ₂ (n) present in the direction of the target area. Here, α ₁ and α ₂ are correction coefficients for correcting the difference in signal level caused by the difference in distance between the target area and each microphone array MA, and should be calculated one by one according to a predetermined process. However, for the sake of simplicity here, the distance between the target area and each microphone array MA is the same (α ₁ (n) = α ₂ (n) = 1), and equations (5) and (6) are equations (7) It demonstrates as what is applied instead of Formula (8).

その後、目的エリア音抽出部６は、以下の（９）式、（１０）式に従い、マイクロホンアレイＭＡ１、ＭＡ２のＢＦ出力から非目的エリア音をＳＳして目的エリア音を抽出する。ここでγ_１（ｎ）、γ_２（ｎ）は、ＳＳ時の強度を変更するための係数である。

Thereafter, the target area sound extraction unit 6 SS extracts non-target area sound from the BF output of the microphone arrays MA1 and MA2 according to the following equations (9) and (10) to extract a target area sound. Here, γ ₁ (n) and γ ₂ (n) are coefficients for changing the intensity at SS.

混合レベル算出部７は、雑音抑圧部４で推定した推定雑音と、指向性形成部５で抽出した目的エリア方向以外の非目的エリア音と、目的エリア音抽出部６で抽出した目的エリア音方向の非目的エリア音のパワーを算出し、それらの合計値の大きさから、目的エリア音に混合する入力信号と背景雑音の総音量レベルを決定する。 The mixing level calculation unit 7 estimates the noise estimated by the noise suppression unit 4, the non-target area sound other than the target area direction extracted by the directivity forming unit 5, and the target area sound direction extracted by the target area sound extraction unit 6. The power of the non-target area sound is calculated, and the total volume level of the input signal to be mixed with the target area sound and the background noise is determined from the magnitude of the total value of them.

ここでは、混合レベル算出部７は、（９）式に従いマイクロホンアレイＭＡ１を主としてエリア収音を行うものとする。この場合、混合レベル算出部７は、マイクロホンアレイＭＡ１の入力信号から推定した推定雑音Ｂ_１（ｎ）と、（３）式に従い抽出した目的エリア方向以外の非目的エリア音Ｍ_１（ｎ）と、（７）式に従い抽出した目的エリア方向の非目的エリア音Ｎ_１（ｎ）との合計がＡ_１（ｎ）であるとき、混合レベルをδ_１Ａ_１（ｎ）とする。ここでδ_１は、目的エリア音Ｚ_１（ｎ）とＡ_１（ｎ）のＳＮ比に比例する変数であり、例えばＳＮ比０ｄＢでＡ_１（ｎ）−２０ｄＢにする値とする。 Here, it is assumed that the mixing level calculation unit 7 mainly performs area collection of the microphone array MA1 according to the equation (9). In this case, the mixing level calculation unit 7 estimates the noise B ₁ (n) estimated from the input signal of the microphone array MA ₁ and the non-target area sound M ₁ (n) other than the target area direction extracted according to the equation (3) When the sum of non-target area sounds N ₁ (n) in the direction of the target area extracted according to the equation (7) is A ₁ (n), the mixing level is δ ₁ A ₁ (n). Here, δ ₁ is a variable that is proportional to the SN ratio of the target area sound Z ₁ (n) and A ₁ (n), and is, for example, a value that makes A ₁ (n) −20 dB at an SN ratio of 0 dB.

混合レベル調節部８は、混合レベル算出部７により求めた混合レベルと、推定雑音と非目的エリア音のパワーの比から目的エリア音に混合する入力信号と推定雑音の音量レベル（混合比率）を調節する。ここでは、混合レベル調節部８は、（９）式に従いマイクロホンアレイＭＡ１を主としてエリア収音を行うものとする。このとき、混合する入力信号と推定雑音の比率を決める変数λ_１は、推定雑音Ｂ_１（ｎ）と非目的エリア音（Ｍ_１（ｎ）＋Ｎ_１（ｎ））のパワーの比（Ｍ_１（ｎ）＋Ｎ_１（ｎ））／Ｂ_１（ｎ）に反比例する。例えば、（Ｍ_１（ｎ）＋Ｎ_１（ｎ））／Ｂ_１（ｎ）＝０のとき、λ_１＝１とする。また、ここでは、λ_１の取る範囲は０から１までとする。さらに、ここでは、混合レベルδ_１Ａ_１（ｎ）を満たすための変数μ_１は、以下の（１１）式により算出される。ここでＸ_１１（ｎ）はマイクロホンアレイＭＡ１を形成するマイクロホンＭ１から取得した入力信号である。 The mixing level adjustment unit 8 determines the volume level (mixing ratio) of the input signal to be mixed with the target area sound and the estimation noise from the mixing level calculated by the mixing level calculation unit 7 and the ratio of the estimated noise to the power of the non-target area sound. Adjust. Here, it is assumed that the mixing level adjustment unit 8 mainly performs area collection of the microphone array MA1 according to the equation (9). In this case, variable lambda ₁ that determines the ratio of the input signal and the estimated noise to be mixed, the power ratio of the estimated noise _B 1 (n) and the non-target area sound _{_{(M 1 (n) + N}} 1 (n)) (M 1 It is inversely proportional to (n) + N ₁ (n)) / B ₁ (n). For example, when (M ₁ (n) + N ₁ (n)) / B ₁ (n) = 0, λ ₁ = 1. Here, the range taken by λ ₁ is from 0 to 1. Furthermore, here, the variable μ ₁ for satisfying the mixture level δ ₁ A ₁ (n) is calculated by the following equation (11). Here, X ₁₁ (n) is an input signal acquired from the microphone M1 forming the microphone array MA1.

信号混合部９は、目的エリア音抽出部６で抽出した目的エリア音に、信号入力部３で取得した入力信号と、雑音抑圧部４で推定した雑音とを混合レベル調節部８で算出した比率に基づき混合する。例えば、（９）式に従いマイクロホンアレイＭＡ１を主としてエリア収音を行う場合、最終的な出力Ｗ_１（ｎ）は以下の（１２）式に従い混合される。目的エリア音にミキシングするデータは、上記のように入力音と推定雑音を所定の割合で混合してもよいし、入力音のみ、もしくは推定雑音のみを混合してもよい。

The signal mixing unit 9 calculates the ratio of the target area sound extracted by the target area sound extraction unit 6 to the input signal acquired by the signal input unit 3 and the noise estimated by the noise suppression unit 4 calculated by the mixing level adjustment unit 8 Mix based on For example, when the area of the microphone array MA1 is mainly picked up according to the equation (9), the final output W ₁ (n) is mixed according to the following equation (12). The data to be mixed into the target area sound may be mixed with the input sound and the estimated noise at a predetermined ratio as described above, or may be mixed with only the input sound or only the estimated noise.

収音装置１００では、信号混合部９から出力される音響信号（ミキシングによって聴感上の音質を改善された目的エリア強調音）が、人間が聴く音声（音声系システム向けの音響信号）としてスピーカ１１に供給される。一方、収音装置１００では、信号混合部９から出力される音響信号（妨害音や背景雑音が十分に抑圧された目的エリア音；音声認識システム向けの音響信号）が、音声認識部１０に供給される。 In the sound collection device 100, the sound signal (a target area emphasis sound whose sound quality on hearing is improved by mixing) output from the signal mixing unit 9 is a speaker 11 as a sound (a sound signal for a sound system) heard by a human. Supplied to On the other hand, in the sound collection device 100, an audio signal (a target area sound with a disturbance noise and background noise sufficiently suppressed; an audio signal for a voice recognition system) output from the signal mixing unit 9 is supplied to the voice recognition unit 10. Be done.

（Ａ−３）第１の実施形態の効果
第１の実施形態によれば、以下のような効果を奏することができる。 (A-3) Effects of the First Embodiment According to the first embodiment, the following effects can be achieved.

第１の実施形態の収音装置１００では、人が聴くことを前提とした通話系システム（この実施形態ではスピーカ１１）に対しては、エリア収音出力に入力信号と推定雑音を混合することによって、ミュージカルノイズがマスキングされた音響信号が出力されることになる。第１の実施形態の収音装置１００で生成される通話系システム向けの音響信号では、目的エリア音の歪が補正されて、強調感は維持されつつも聴き易さが改善された音声となる。言い換えると、第１の実施形態の収音装置１００で生成される通話系システム向けの音響信号は、背景雑音のレベルが大きい場合にエリア収音処理によって生じる異音や歪みをミキシングエリア収音機能によって軽減した高音質の目的音となる。 In the sound collection device 100 according to the first embodiment, the input signal and the estimated noise are mixed with the area sound collection output for a call system (in this embodiment, the speaker 11) on the premise that a person listens to the sound. Thus, an acoustic signal in which musical noise is masked is output. In the sound signal for the call system generated by the sound collection device 100 of the first embodiment, the distortion of the target area sound is corrected, and the voice with improved ease of listening while maintaining the feeling of emphasis is obtained. . In other words, the sound signal for the communication system generated by the sound collection device 100 according to the first embodiment has a mixing area collection function of abnormal noise and distortion caused by the area collection processing when the background noise level is large. It becomes the target sound of high-quality sound reduced by.

また、第１の実施形態の収音装置１００では、音声認識システム（この実施形態では、機械である音声認識部１０）に対しては、音声認識の妨げとなる妨害音や雑音が十分に抑圧された音響信号（目的エリアの音声が強調され、騒音下においても高い認識率が確保される音響信号）が出力されることになる。言い換えると、第１の実施形態の収音装置１００で生成される音声認識システム向けの音響信号は、音声認識性能を阻害する目的エリア外の雑音、妨害音を十分に抑圧した目的音強調信号となる。 Further, in the sound collection device 100 according to the first embodiment, for the speech recognition system (in this embodiment, the machine speech recognition unit 10), the disturbance sound or noise that hinders speech recognition is sufficiently suppressed. The sound signal (sound in the target area is emphasized and a high recognition rate is ensured even under noise) is output. In other words, the sound signal for the speech recognition system generated by the sound collection device 100 according to the first embodiment is a target sound emphasis signal in which noise outside the target area that impedes the speech recognition performance and the disturbance sound are sufficiently suppressed. Become.

したがって、第１の実施形態の収音装置１００では、通話系システムに対してはより音質改善が図られた音響信号を提供しつつ、音声認識システムに対してはより高い音声認識精度を実現可能な音響信号を提供することが同時に実現される。例えば、第１の実施形態の収音装置１００を、遠隔地の顧客の音声を収音して、センター（コールセンター）側のオペレータ及び顧客の音声を自動でテキスト化するシステムに出力する場合を想定する。この場合、第１の実施形態の収音装置１００は、遠隔地の顧客の音声をマイクロホンアレイＭＡ１、ＭＡ２で捕捉し、通話系システム向けの音響信号をセンターのオペレータに出力（スピーカ１１を用いて出力）しつつ、音声認識システム向けの音響信号を顧客の音声を自動でテキスト化するシステムに出力（音声認識部１０に出力）することになる。なお、音声認識部１０に対して、センター側のオペレータの音声を図示しないマイクで捕捉して入力するようにしてもよい。また、遠隔地の顧客に対しては、センター側のオペレータの音声を図示しないマイクで捕捉して、遠隔地の図示しないスピーカから出力するようにしてもよい。これにより、音声認識システム向けの音響信号の供給を受けた音声認識部１０では、より高い認識精度で、オペレータと顧客のやり取りを自動でテキスト化して保存するばかりでなく、認識結果に基いて迅速に顧客情報の引き出し、商品情報の自動検索などでサービスの向上が図れる。その間、オペレータは改善された音質で顧客との会話が可能であり、業務の負担が軽減される。 Therefore, in the sound collection device 100 of the first embodiment, it is possible to realize higher voice recognition accuracy for the voice recognition system while providing an audio signal with improved sound quality to the call system. It is simultaneously realized to provide various sound signals. For example, it is assumed that the sound collecting apparatus 100 according to the first embodiment picks up the voice of a customer at a remote place and outputs it to a system that automatically converts the voice of the operator at the center (call center) and the customer into text. Do. In this case, the sound collection device 100 according to the first embodiment captures the voice of the customer at a remote place with the microphone arrays MA1 and MA2, and outputs an acoustic signal for a call system to the center operator (using the speaker 11) While outputting, the audio signal for the speech recognition system is output (output to the speech recognition unit 10) to a system for automatically texting the speech of the customer. The voice of the operator on the center side may be captured and input to the voice recognition unit 10 by a microphone (not shown). Further, for a customer at a remote location, the voice of the operator at the center may be captured by a microphone (not shown) and output from a speaker (not shown) at the remote location. As a result, the speech recognition unit 10 that receives the supply of the sound signal for the speech recognition system not only automatically converts the operator-customer interaction into text and saves it with higher recognition accuracy, but also quickly based on the recognition result The service can be improved by pulling out customer information and automatically searching product information. Meanwhile, the operator can talk with the customer with the improved sound quality, and the burden of work is reduced.

（Ｂ）第２の実施形態
以下、本発明による収音装置、プログラム及び方法の第２の実施形態を、図面を参照しながら詳述する。 (B) Second Embodiment Hereinafter, a second embodiment of a sound collection device, program and method according to the present invention will be described in detail with reference to the drawings.

（Ｂ−１）第２の実施形態の構成
図４は、第２の実施形態の収音装置１００Ａの全体構成について示したブロック図であり、上述の図１と同一部分又は対応部分には同一符号又は対応符号を付している。 (B-1) Configuration of Second Embodiment FIG. 4 is a block diagram showing the entire configuration of a sound collection device 100A of the second embodiment, and the same portions as or corresponding portions to those of FIG. The code or the corresponding code is attached.

以下では、第２の実施形態について第１の実施形態との差異を説明する。 The differences between the second embodiment and the first embodiment will be described below.

一般に雑音抑圧処理において、雑音の抑圧量、音質はトレードオフの関係にある。すなわち、抑圧量を増やせば歪みは増えることになる。エリア収音は、目的エリアで発生する音だけを強調可能な優れた方式であるが、一般の雑音抑圧同様、強調効果を高めれば、それだけ歪みは増す。そこで、第１の実施形態では、音声認識システムに対しては抑圧効果の高いエリア収音処理結果を出力し、通話系のスピーカもしくは通信システムに対しては、高音質のミキシングエリア収音処理結果を出力する構成を示した。 Generally, in noise suppression processing, the amount of noise suppression and the sound quality are in a trade-off relationship. That is, if the amount of suppression is increased, distortion will increase. Although the area pickup is an excellent method capable of emphasizing only the sound generated in the target area, the distortion is increased as the emphasizing effect is enhanced like general noise suppression. Therefore, in the first embodiment, the result of area collection processing with high suppression effect is output to the voice recognition system, and the result of mixing area collection processing of high sound quality is output to the speaker or communication system of the call system. Showed a configuration to output

ところで、近年の音声認識エンジンは、周囲雑音への耐性が高まり、ある程度の騒音環境でも認識性能を維持できるものが出現している。そのようなエンジンに対しては、抑圧量だけを最優先にすることは必ずしも得策とは言えない。 By the way, recent speech recognition engines have become more resistant to ambient noise, and some are able to maintain recognition performance even in a certain noise environment. For such engines, it is not always a good idea to give top priority only to the suppression amount.

そこで、第２の実施形態の収音装置１００Ａにおいては、通話系システムと音声認識システムに対し、異なるミキシング量（混合レベル）を設定し、それぞれのシステムに対して最適なミキシングエリア収音出力を提供できる構成とする。 Therefore, in the sound collection device 100A of the second embodiment, different mixing amounts (mixing levels) are set for the communication system and the speech recognition system, and the mixing area sound collection output optimal for each system is set. It can be provided.

次に、第２の実施形態の収音装置１００Ａの内部構成について、第１の実施形態との差異を説明する。 Next, the difference between the internal configuration of the sound collection device 100A of the second embodiment and that of the first embodiment will be described.

第２の実施形態の収音装置１００Ａでは、混合レベル算出部７、混合レベル調節部８、及び信号混合部９が、それぞれ、混合レベル算出部７Ａ、混合レベル調節部８Ａ、及び信号混合部９Ａに置き換わっている点で第１の実施形態と異なっている。混合レベル算出部７Ａ、混合レベル調節部８Ａ、及び信号混合部９Ａの処理内容の詳細については後述する。 In the sound collection device 100A of the second embodiment, the mixing level calculation unit 7, the mixing level adjustment unit 8, and the signal mixing unit 9 respectively have a mixing level calculation unit 7A, a mixing level adjustment unit 8A, and a signal mixing unit 9A. The second embodiment differs from the first embodiment in that Details of processing contents of the mixing level calculation unit 7A, the mixing level adjustment unit 8A, and the signal mixing unit 9A will be described later.

（Ｂ−２）第２の実施形態の動作
次に、以上のような構成を有する第２の実施形態における収音装置１００Ａの動作（実施形態に係る収音方法）について第１の実施形態との差異を説明する。 (B-2) Operation of Second Embodiment Next, the operation (sound collection method according to the embodiment) of the sound collection device 100A in the second embodiment having the above configuration will be described with the first embodiment. Explain the difference between

上述の通り、第２の実施形態の収音装置１００Ａにおいて、第１の実施形態と異なるのは、混合レベル算出部７Ａ、混合レベル調節部８Ａ、及び信号混合部９Ａであるため、以下では、これらの要素の処理を中心に説明する。 As described above, in the sound collection device 100A of the second embodiment, what differs from the first embodiment is the mixing level calculation unit 7A, the mixing level adjustment unit 8A, and the signal mixing unit 9A. The processing of these elements will be mainly described.

混合レベル算出部７Ａは、第１の実施形態と同様であるが、通話系システム向け（スピーカ１１向け）の混合レベル（以下、「Ｌ１」と表す）と、音声認識システム向け（音声認識部１０向け）の混合レベル（以下、「Ｌ２」と表す）の２種類の混合レベルを算出する点で第１の実施形態と異なっている。 The mixing level calculation unit 7A is the same as that of the first embodiment, but the mixing level (hereinafter referred to as “L1”) for the communication system (for the speaker 11) and the speech recognition system (for the speech recognition unit 10). The second embodiment differs from the first embodiment in that two types of mixing levels (hereinafter referred to as “L2”) are calculated.

混合レベル算出部７Ａは、スピーカ１１に対しては、第１の実施形態と同様に、推定雑音Ｂ_１（ｎ）と、目的エリア方向以外の非目的エリア音Ｍ_１（ｎ）と、目的エリア方向の非目的エリア音Ｎ_１（ｎ）との合計がＡ_１（ｎ）であるとき、混合レベルをδ_１Ａ_１（ｎ）とする。ここでδ_１は、目的エリア音Ｚ_１（ｎ）とＡ_１（ｎ）のＳＮ比に比例する変数であり、例えばＳＮ比０ｄＢでＡ_１（ｎ）−２０ｄＢにする値とする。 As with the first embodiment, the mixing level calculator 7A applies the estimated noise B ₁ (n), the non-target area sound M ₁ (n) other than the target area direction, and the target area to the speaker 11 as in the first embodiment. When the sum of the non-target area sound N ₁ (n) in the direction is A ₁ (n), the mixing level is δ ₁ A ₁ (n). Here, δ ₁ is a variable that is proportional to the SN ratio of the target area sound Z ₁ (n) and A ₁ (n), and is, for example, a value that makes A ₁ (n) −20 dB at an SN ratio of 0 dB.

また、混合レベル算出部７Ａは、音声認識部１０に対しては、スピーカ１１に比べて抑圧量を重視しつつ歪も抑えた音響信号を供給する。例えば、混合レベル算出部７Ａは、音声認識部１０に対しては、δ_２として、ＳＮ比０ｄＢでＡ_１（ｎ）を−２５ｄＢにする値を設定するようにしてもよい。 Further, the mixing level calculation unit 7A supplies, to the voice recognition unit 10, an acoustic signal in which the amount of suppression is emphasized and the distortion is suppressed as compared to the speaker 11. For example, the mixing level calculation unit 7A may set a value to set A ₁ (n) to −25 dB at an SN ratio of 0 dB as δ ₂ for the speech recognition unit 10.

混合レベル調節部８Ａは、通話系システムに供給する音響信号に対しては混合レベルＬ１を用いて、目的エリア音に混合する入力信号と推定雑音の音量レベル（混合比率；以下、「Ｒ１」と呼ぶ）を決定する。また、混合レベル調節部８Ａは、音声認識システムに供給する音響信号に対しては混合レベルＬ２を用いて、目的エリア音に混合する入力信号と推定雑音の音量レベル（混合比率；以下、「Ｒ２」と呼ぶ）を決定する。 The mixing level adjustment unit 8A uses the mixing level L1 for the sound signal supplied to the communication system, and the volume level of the input signal to be mixed with the target area sound and the estimated noise (mixing ratio; hereinafter, “R1” Determine). In addition, the mixing level adjustment unit 8A uses the mixing level L2 for the sound signal supplied to the voice recognition system, and the volume level of the input signal to be mixed with the target area sound and the estimation noise (mixing ratio; To determine the

信号混合部９Ａは、混合レベル調節部８Ａで算出した通話系システム向けの混合比率Ｒ１に基づいて混合した混合音と、音声認識システム向けの混合比率Ｒ２に基づいて混合した混合音を生成し、それぞれの混合音を対応するシステムに供給する点で第１の実施形態と異なっている。具体的には、信号混合部９Ａは、目的エリア音抽出部６Ａで抽出した目的エリア音に、信号入力部３で取得した入力信号と、雑音抑圧部４で推定した雑音とを、音声系システム向けの混合比率Ｒ１に基づいて混合した混合音をスピーカ１１に供給する。また、信号混合部９Ａは、目的エリア音抽出部６Ａで抽出した目的エリア音に、信号入力部３で取得した入力信号と、雑音抑圧部４で推定した雑音とを、音声認識システム向けの混合比率Ｒ２に基づいて混合した混合音を音声認識部１０に供給する。 The signal mixing unit 9A generates a mixed sound mixed based on the mixing ratio R2 for the speech recognition system and a mixed sound mixed based on the mixing ratio R1 for the speech communication system calculated by the mixing level adjustment unit 8A, It differs from the first embodiment in that each mixed sound is supplied to the corresponding system. Specifically, the signal mixing unit 9A is a voice system system for the target area sound extracted by the target area sound extraction unit 6A, the input signal acquired by the signal input unit 3, and the noise estimated by the noise suppression unit 4. The mixed sound mixed based on the direction mixing ratio R1 is supplied to the speaker 11. Further, the signal mixing unit 9A mixes the target area sound extracted by the target area sound extraction unit 6A with the input signal acquired by the signal input unit 3 and the noise estimated by the noise suppression unit 4 for the speech recognition system. The mixed sound mixed based on the ratio R2 is supplied to the speech recognition unit 10.

（Ｂ−３）第２の実施形態の効果
第２の実施形態によれば、第１の実施形態の効果に加えて、以下のような効果を奏することができる。 (B-3) Effects of Second Embodiment According to the second embodiment, the following effects can be achieved in addition to the effects of the first embodiment.

第２の実施形態の収音装置１００Ａでは、通話系システム向けと音声認識システム向けに異なるミキシング量（混合比率）を設定できるため、それぞれのシステムに適したミキシング量の音響信号（収音結果）を生成して供給することが出来る。言い換えると、第２の実施形態の収音装置１００Ａでは、耐雑音性が向上した最近の音声認識システムに対しても、雑音抑圧量と音質をシステムに適合した最もよい入力状態で与えることが出来る。 In the sound collection device 100A of the second embodiment, since different mixing amounts (mixing ratios) can be set for the communication system and for the voice recognition system, acoustic signals of the mixing amounts suitable for the respective systems (sound collection results) Can be generated and supplied. In other words, in the sound collection device 100A of the second embodiment, the noise suppression amount and the sound quality can be given in the best input state adapted to the system even to the recent speech recognition system with improved noise resistance. .

（Ｃ）第３の実施形態
以下、本発明による収音装置、プログラム及び方法の第３の実施形態を、図面を参照しながら詳述する。 (C) Third Embodiment Hereinafter, a third embodiment of a sound collection device, program and method according to the present invention will be described in detail with reference to the drawings.

（Ｃ−１）第３の実施形態の構成
図５は、第３の実施形態の収音装置１００Ｂの全体構成について示したブロック図であり、上述の図４と同一部分又は対応部分には同一符号又は対応符号を付している。 (C-1) Configuration of Third Embodiment FIG. 5 is a block diagram showing the entire configuration of a sound collection device 100B of the third embodiment, and the same portions as or corresponding portions to those of FIG. The code or the corresponding code is attached.

以下では、第３の実施形態について第２の実施形態との差異を説明する。 The differences between the third embodiment and the second embodiment will be described below.

雑音抑圧などの処理過程で発生するミュージカルノイズは、人工的な雑音であり、人間にとっては耳障りな音であるが、音声認識に対しての影響はさほど大きくないという特性がある。推定雑音をエリア収音出力に混合することはミュージカルノイズの軽減に効果があるが、音声認識システムにおける音声認識精度の向上には必ずしも有効とは限らない。 Musical noise generated in the process of noise suppression and the like is artificial noise and is an offensive sound to human beings, but it has a characteristic that the influence on speech recognition is not so great. Although mixing estimated noise into the area sound output is effective for reducing musical noise, it is not always effective for improving the speech recognition accuracy in the speech recognition system.

そこで、第３の実施形態の収音装置１００Ｂでは、スピーカ１１等の通話系システムに対しては、音質改善のため目的エリア音に入力信号と推定雑音の双方を混合し、音声認識部１０等の音声認識システムに対しては、目的音声の歪低減に有効な入力信号のみを混合する構成とする。 Therefore, in the sound collection device 100B of the third embodiment, for a speech system such as the speaker 11, both the input signal and the estimated noise are mixed with the target area sound to improve the sound quality, and the speech recognition unit 10 etc. For the speech recognition system of the above, only input signals effective for distortion reduction of the target speech are mixed.

次に、第３の実施形態の収音装置１００Ａの内部構成について、第２の実施形態との差異を説明する。 Next, with respect to the internal configuration of the sound collection device 100A of the third embodiment, differences from the second embodiment will be described.

収音装置１００Ｂでは、混合レベル算出部７Ａ、混合レベル調節部８Ａ、及び信号混合部９Ａが除外され、代わりに２つの信号混合部９１、９２が配置されている。第１の信号混合部９１は、音声系システム向けの混合音を生成する機能を担っており、第２の信号混合部９２は音声認識システム向けの混合音を生成する機能を担っている。 In the sound collection device 100B, the mixing level calculation unit 7A, the mixing level adjustment unit 8A, and the signal mixing unit 9A are excluded, and instead, two signal mixing units 91 and 92 are disposed. The first signal mixing unit 91 has a function of generating mixed sound for a voice system, and the second signal mixing unit 92 has a function of generating mixed sound for a speech recognition system.

（Ｃ−２）第３の実施形態の動作
次に、以上のような構成を有する第１の実施形態の収音装置１００Ｂの動作を説明する。以下では、第３の実施形態の収音装置１００Ｂについて、第２の実施形態との差異を中心に説明する。 (C-2) Operation of Third Embodiment Next, the operation of the sound collection device 100B of the first embodiment having the configuration as described above will be described. Hereinafter, the sound collection device 100B of the third embodiment will be described focusing on differences from the second embodiment.

マイクロホンアレイＭＡ１ＭＡ２から、目的エリア音抽出部６までの処理は、第１の実施形態や、第２の実施形態と同様である。 The processes from the microphone array MA1MA2 to the target area sound extraction unit 6 are the same as those in the first embodiment or the second embodiment.

また、第の実施形態では、信号混合部９１と信号混合部９２の２つの信号混合部を備えており、それぞれの信号混合部での混合レベル（混合比率）は、第１の実施形態や第２の実施形態同様の手順によって好適に算出されるものとする。すなわち、図５では、図示を簡単とするために、それぞれの信号混合部で、混合レベル（混合比率）の算出処理等が行われるものとして説明する。 Further, in the embodiment, two signal mixing units of the signal mixing unit 91 and the signal mixing unit 92 are provided, and the mixing level (mixing ratio) in each signal mixing unit is the same as that in the first embodiment or the first embodiment. It shall be suitably calculated by the procedure similar to 2 embodiment. That is, in FIG. 5, in order to simplify the illustration, it is assumed that calculation processing of the mixing level (mixing ratio) or the like is performed in each signal mixing unit.

第３の実施形態の収音装置１００Ｂでは、図５に示す通り、目的エリア音抽出部６に後続して、通話系システム向けの信号混合部９１と、音声認識システム用の信号混合部９２を備える。 In the sound collection device 100B according to the third embodiment, as shown in FIG. 5, a signal mixing unit 91 for a call system and a signal mixing unit 92 for a speech recognition system are provided following the target area sound extraction unit 6. Prepare.

目的エリア音抽出部６で抽出された目的エリア音は、信号混合部９１において信号入力部３からの入力信号と、雑音抑圧部４で算出される推定雑音が好適な混合レベルで混合され、通話系システム（スピーカ１１）へと送出される。例えば、信号混合部９１では、第１の実施形態もしくは第２の実施形態における、音声系システム向けの混合音の生成処理と同様の処理（混合レベル算出部、混合レベル調節部、及び信号混合部と同様の処理）が行われる。そして、信号混合部９１で生成された音声系システム向けの混合音は、スピーカ１１に供給される。 The target area sound extracted by the target area sound extraction unit 6 is mixed at the signal mixing unit 91 with the input signal from the signal input unit 3 and the estimated noise calculated by the noise suppression unit 4 at a suitable mixing level, It is sent to the system system (speaker 11). For example, in the signal mixing unit 91, the same processing (mixing level calculation unit, mixing level adjustment unit, and signal mixing unit as in the mixed sound generation processing for audio systems in the first embodiment or the second embodiment) And the same processing as described above is performed. Then, the mixed sound for the audio system generated by the signal mixing unit 91 is supplied to the speaker 11.

また、目的エリア音抽出部６で抽出された目的エリア音は、もう一方の信号混合部９２において、信号入力部３からの入力信号と好適な混合レベルで混合され、音声認識システムへと送出される。このとき、上述の通り、信号混合部９２で生成される混合音には、雑音抑圧部４で算出される推定雑音の成分は混合されない。すなわち、信号混合部９２は、目的エリア音抽出部６で抽出された目的エリア音と、信号入力部３からの入力信号とを混合した混合音を音声認識システム向けに生成し、音声認識部１０に供給する。 Also, the target area sound extracted by the target area sound extraction unit 6 is mixed with the input signal from the signal input unit 3 at a suitable mixing level in the other signal mixing unit 92 and sent out to the voice recognition system. Ru. At this time, as described above, the component of the estimated noise calculated by the noise suppression unit 4 is not mixed with the mixed sound generated by the signal mixing unit 92. That is, the signal mixing unit 92 generates a mixed sound obtained by mixing the target area sound extracted by the target area sound extraction unit 6 and the input signal from the signal input unit 3 for the speech recognition system, and the speech recognition unit 10 Supply to

信号混合部９２において、目的エリア音と入力信号とを混合する際の混合レベルについては、例えば、第２の実施形態における音声認識システム向けの混合レベルと同様の処理により算出するようにしてもよい。また、信号混合部９２は、上記の（１１）式、（１２）式において、λ_１＝１とすることで、目的エリア音に混合する推定雑音の比率を０とし、入力信号の成分のみが混合されるように混合比率を調整するようにしてもよい。 The mixing level when mixing the target area sound and the input signal in the signal mixing unit 92 may be calculated, for example, by the same process as the mixing level for the speech recognition system in the second embodiment. . In addition, the signal mixing unit 92 sets the ratio of estimated noise to be mixed with the target area sound to 0 by setting λ ₁ = 1 in the above equations (11) and (12), and only the component of the input signal The mixing ratio may be adjusted to be mixed.

（Ｃ−３）第３の実施形態の効果
第３の実施形態によれば、第１の実施形態の効果に加えて以下のような効果を奏することができる。 (C-3) Effects of Third Embodiment According to the third embodiment, the following effects can be achieved in addition to the effects of the first embodiment.

第３の実施形態の収音装置１００Ｂでは、人が聴くことを前提とした通話系システムに対しては、エリア収音出力に入力信号と推定雑音を混合することによって、ミュージカルノイズがマスキングされた音響信号（混合音）が出力される。これにより、第３の実施形態の収音装置１００Ｂでは、通話系システム向けの音響信号として、目的エリア音の歪が補正されて、強調感は維持されつつも聴き易さが改善された音声が提供される。また、第３の実施形態の収音装置１００Ｂでは、音声認識システムに対しては、音声歪の低減に有効な入力信号だけが混合された音響信号が生成されるため、出力先の音声認識システムにおける音声認識の精度向上に寄与することができる。 In the sound collection device 100B according to the third embodiment, musical noise is masked by mixing the input signal and the estimated noise in the area sound collection output for a call system on the premise of human listening. An acoustic signal (mixed sound) is output. As a result, in the sound collection device 100B of the third embodiment, the distortion of the target area sound is corrected as an audio signal for the call system, and the voice whose ease of listening is improved while the emphasis is maintained is maintained. Provided. Further, in the sound collection device 100B of the third embodiment, for the voice recognition system, an acoustic signal in which only input signals effective for reducing voice distortion are mixed is generated. Can contribute to the improvement of the accuracy of speech recognition in

（Ｄ）他の実施形態
本発明は、上記の各実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (D) Other Embodiments The present invention is not limited to the above-described embodiments, and may include modified embodiments as exemplified below.

（Ｄ−１）上記の各実施形態の収音装置では、収音に用いる各マイクロホンアレイＭＡのマイクロホンの数は２つであったが、３つ以上のマイクを用いて収音した音響信号に基づいて目的エリア方向の音を収音するようにしてもよい。上記の各実施形態において、適用するマイクロホンアレイＭＡ毎のマイクロホンの数や目的音方向の音を収音する方式については、既存の種々の方式を適用することができる。 (D-1) In the sound collection device of each of the above embodiments, the number of microphones of each microphone array MA used for sound collection is two, but for an acoustic signal collected using three or more microphones Sound based on the target area may be picked up. In each of the above-described embodiments, various existing methods can be applied to the number of microphones for each microphone array MA to be applied and the method for collecting sound in the target sound direction.

１００…収音装置、３…信号入力部、４…雑音抑圧部、５…指向性形成部、６…目的エリア音抽出部、７…混合レベル算出部、８…混合レベル調節部、９…信号混合部、
２００…減算型ＢＦ、２１０…遅延器、２２０…減算器。 100 ... sound collecting device, 3 ... signal input unit, 4 ... noise suppression unit, 5 ... directivity forming unit, 6 ... target area sound extraction unit, 7 ... mixing level calculation unit, 8 ... mixing level adjustment unit, 9 ... signal Mixing section,
200 ... Subtractive BF, 210 ... Delay, 220 ... Subtractor.

Claims

In a sound collection device that outputs a collected sound signal collected using a microphone array to a plurality of output destinations,
Background noise included in an input signal input from a microphone array is estimated and acquired as estimation noise, and the acquired estimation noise is used to suppress a noise component of the input signal to acquire a noise-suppressed signal Suppression means,
Directivity forming means for acquiring, for the noise-suppressed signal, a first non-target area sound in which directivity is formed in a direction other than the target area direction and a target area direction sound in which directivity is formed in the target area direction; ,
The target area direction sound is used to extract a second non-target area sound from the target area direction, and the second non-target area sound and the target area direction sound are used to generate a target area as a sound source. A target area sound extraction unit for acquiring a target area sound;
An audio signal including at least a component of the target area sound is output as the collected signal to each of the output destinations, and at least the input signal is output to the target area sound according to the characteristics of the output destination. Sound pickup apparatus characterized by: output means capable of generating a target area sound after mixing the mixed sound including the component of the component and / or the component of the estimated noise and outputting it as the sound collection signal .

The output means sets a sound including the component of the input signal and the component of the estimated noise as the mixed sound to the output destination of the speech system, and the mixed sound is output to the output destination of the voice recognition system. The sound pickup apparatus according to claim 1, wherein an audio signal not included is output as the sound pickup signal.

The sound pickup apparatus according to claim 1, wherein the output unit adjusts the volume level of the mixed sound in accordance with the characteristics of the output destination.

The output means sets the volume level of the mixed sound set as the output destination of the voice recognition system lower than the volume level of the mixed sound set as the output destination of the speech system. A sound pickup device according to Item 2.

The output means sets a sound including the component of the input signal and the component of the estimated noise as the mixed sound to the output destination of the speech recognition system, and the mixed sound to the output destination of a speech system The sound pickup apparatus according to claim 1, wherein the sound including the input signal and the estimated noise not including the input signal is set.

A computer mounted on a sound collection device that outputs a collected sound signal collected using a microphone array to a plurality of output destinations,
Background noise included in an input signal input from a microphone array is estimated and acquired as estimation noise, and the acquired estimation noise is used to suppress a noise component of the input signal to acquire a noise-suppressed signal Suppression means,
Directivity forming means for acquiring, for the noise-suppressed signal, a first non-target area sound in which directivity is formed in a direction other than the target area direction and a target area direction sound in which directivity is formed in the target area direction; ,
The target area direction sound is used to extract a second non-target area sound from the target area direction, and the second non-target area sound and the target area direction sound are used to generate a target area as a sound source. A target area sound extraction unit for acquiring a target area sound;
An audio signal including at least a component of the target area sound is output as the collected signal to each of the output destinations, and at least the input signal is output to the target area sound according to the characteristics of the output destination. A sound pickup program characterized by comprising: an output means capable of generating a target area sound after mixing the mixed sound including the component of the component and / or the component of the estimated noise and outputting it as the sound collection signal .

In a sound collection method performed by a sound collection device that outputs a collected sound signal collected using a microphone array to a plurality of output destinations,
The sound collection device includes noise suppression means, directivity formation means, target area sound extraction unit, and output means.
The noise suppression unit estimates background noise included in an input signal input from a microphone array and acquires it as estimation noise, and uses the acquired estimation noise to suppress noise components of the input signal to suppress noise Get a signal after
The directivity forming means includes, for the noise-suppressed signal, a first non-target area sound in which directivity is formed in a direction other than the target area direction and a target area direction sound in which directivity is formed in the target area direction. Acquired,
The target area sound extraction unit extracts a second non-target area sound from the target area direction using the target area direction sound, and further includes the second non-target area sound and the target area direction sound. Use the target area sound source to obtain the target area sound,
The output means outputs an audio signal including at least a component of the target area sound as the collected sound signal to each of the output destinations, and the output area includes the target area sound according to the characteristics of the output destination. It is possible to generate a target area sound after mixing a mixed sound including at least the component of the input signal and / or the component of the estimated noise, and output it as the sound collection signal. .