JP7316614B2

JP7316614B2 - Sound source separation device, sound source separation method, and program

Info

Publication number: JP7316614B2
Application number: JP2020100287A
Authority: JP
Inventors: 一博中臺; 知鍾; 克寿糸山; 健次西田
Original assignee: Honda Motor Co Ltd; Tokyo Institute of Technology NUC
Current assignee: Honda Motor Co Ltd; Tokyo Institute of Technology NUC
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2023-07-28
Anticipated expiration: 2040-06-09
Also published as: JP2021197566A

Description

特許法第３０条第２項適用［１］発行日（公開日）２０１９年９月３日（オンライン予稿集全文ダウンロード：２０１９年９月３日１３：００～９月４日９：００）（ＵＲＬ：ｈｔｔｐｓ：／／ａｃ．ｒｓｊ－ｗｅｂ．ｏｒｇ／２０１９／）、刊行物第３７回日本ロボット学会学術講演会（ＲＳＪ２０１９）予稿集日本ロボット学会＜資料＞予稿集掲載研究論文（抜粋）［２］開催日２０１９年９月３日から２０１９年９月７日まで（公開日２０１９年９月４日）集会名、開催場所第３７回日本ロボット学会学術講演会主催日本ロボット学会早稲田大学早稲田キャンパス＜資料＞第３７回日本ロボット学会概要ウェブページプリントアウト＜資料＞第３７回日本ロボット学会プログラム（抜粋）＜資料＞第３７回日本ロボット学会研究口頭発表資料（スライド）［３］発行日（公開日）２０２０年１月１２日（ハワイ州ホノルル現地時間及び日本時間）刊行物Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２０２０ＩＥＥＥ／ＳＩＣＥＩｎｔｅｒｎａｔｉｏｎａｌＳｙｍｐｏｓｉｕｍｏｎＳｙｓｔｅｍＩｎｔｅｇｒａｔｉｏｎ（ＳＩＩ２０２０）、ＩＥＥＥ＜資料＞掲載研究論文（抜粋）［４］開催日２０２０年１月１２日から２０２０年１月１５日まで（ハワイ州ホノルル現地時間）２０２０年１月１２日から２０２０年１月１６日まで（日本時間）集会名、開催場所国際シンポジウム：２０２０ＩＥＥＥ／ＳＩＣＥＩｎｔｅｒｎａｔｉｏｎａｌＳｙｍｐｏｓｉｕｍｏｎＳｙｓｔｅｍＩｎｔｅｇｒａｔｉｏｎ（ＳＩＩ２０２０）主催ＩＥＥＥＲｏｂｏｔｉｃｓ＆ＡｕｔｏｍａｔｉｏｎＳｏｃｉｅｔｙＨａｗａｉｉＣｏｎｖｅｎｔｉｏｎＣｅｎｔｅｒ，Ｈｏｎｏｌｕｌｕ，Ｈａｗａｉｉ，ＵＳＡプログラム公開ＵＲＬ：ｈｔｔｐｓ：／／ｒａｓ．ｐａｐｅｒｃｅｐｔ．ｎｅｔ／ｃｏｎｆｅｒｅｎｃｅｓ／ｃｏｎｆｅｒｅｎｃｅｓ／ＳＩＩ２０／ｐｒｏｇｒａｍ／ＳＩＩ２０＿ＣｏｎｔｅｎｔＬｉｓｔＷｅｂ＿３．ｈｔｍｌ＜資料＞国際シンポジウム（ＳＩＩ２０２０）概要ウェブページプリントアウト＜資料＞国際シンポジウム（ＳＩＩ２０２０）プログラム及び掲載研究要旨（抜粋）＜資料＞国際シンポジウム（ＳＩＩ２０２０）研究口頭発表資料（スライド）Application of Article 30, Paragraph 2 of the Patent Act [1] Date of issue (publication date) September 3, 2019 (Online Proceedings full text download: September 3, 2019 13:00 to September 4, 9:00) ( URL: https://ac.rsj-web.org/2019/ ), Publications The 37th Annual Conference of the Robotics Society of Japan (RSJ 2019) Proceedings The Robotics Society of Japan <Reference> Research papers published in the proceedings (excerpts) [ 2] Date: September 3, 2019 to September 7, 2019 (Released date: September 4, 2019) Meeting name, venue: The 37th Annual Meeting of the Robotics Society of Japan Sponsor: The Robotics Society of Japan, Waseda University, Waseda Campus <Documents> Overview of the 37th Annual Meeting of the Robotics Society of Japan Web page printout <Documents> Program of the 37th Annual Meeting of the Robotics Society of Japan (excerpt) <Documents> Materials for the 37th Annual Meeting of the Robotics Society of Japan (slides) [3] Publication date (released) Sunday) January 12, 2020 (Honolulu, Hawaii local time and Japan time) Publications Proceedings of the 2020 IEEE/SICE International Symposium on System Integration (SII2020), IEEE <References> Published research papers (excerpts) [4] Held Date January 12, 2020 to January 15, 2020 (Honolulu, Hawaii local time) January 12, 2020 to January 16, 2020 (Japan time) Meeting name, Venue International Symposium: 2020 IEEE / SICE International Symposium on System Integration (SII2020) Sponsored by IEEE Robotics & Automation Society Hawaii Convention Center, Honolulu, Hawaii, USA Program release URL: h ttps://ras. papercept. net/conferences/conferences/SII20/program/SII20_ContentListWeb_3. html <Reference> International Symposium (SII2020) Overview Web page printout <Reference> International Symposium (SII2020) program and abstracts of published research (excerpts) <Reference> International Symposium (SII2020) research oral presentation materials (slides)

本発明は、音源分離装置、音源分離方法、およびプログラムに関する。 The present invention relates to a sound source separation device, a sound source separation method, and a program.

マイクロホンアレイで収音した音響信号に対して、ビームフォーミング等の処理を行うことで、複数の音源が混在した観測信号から、特定の音源のみを取り出す音源分離を行うことができる（例えば特許文献１参照）。 By performing processing such as beamforming on acoustic signals collected by a microphone array, it is possible to perform sound source separation that extracts only a specific sound source from an observed signal in which multiple sound sources are mixed (for example, Patent Document 1). reference).

これらの音源分離処理では、音源が点音源であることが前提として理論が構築されている。通常の音源は面音源であるため、従来は、面音源を点音源であるものとして分離処理を行っていた。従来は、面音源を疑似的に音源分離するために、遅延和ビームフォーミングやエコーキャンセルといったビーム（指向性）を広くとる手法が用いられてきた。 In these sound source separation processes, the theory is built on the premise that the sound source is a point sound source. Since a normal sound source is a surface sound source, conventionally, the surface sound source is treated as a point sound source and subjected to separation processing. Conventionally, methods of widening beams (directivity) such as delay-and-sum beamforming and echo cancellation have been used to pseudo-separate surface sound sources.

特開２０１５－４６７５９号公報JP 2015-46759 A

しかしながら、従来技術では、指向性の細かい制御が難しく、例えば抑圧したい領域にある音源を抽出してしまう、面の形状が四角やより複雑な形状である音源には対応できないといった問題があった。 However, with the conventional technology, it is difficult to finely control the directivity. For example, there are problems such as extracting a sound source in an area to be suppressed, and not being able to deal with a sound source having a square surface shape or a more complicated shape.

本発明は、上記の問題点に鑑みてなされたものであって、面音源を分離することができる音源分離装置、音源分離方法、およびプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a sound source separation device, a sound source separation method, and a program capable of separating a surface sound source.

（１）上記目的を達成するため、本発明の一態様に係る音源分離装置は、音響信号を収音する、第１間隔で配置されるＮ（Ｎは２以上の整数）個のマイクロホンを有するマイクロホンアレイと、所望の領域を第２間隔で細分化し、前記細分化した領域それぞれに対して、前記マイクロホンアレイによって収音された音響信号を、前記細分化した領域の方位角θに対応するサブビームを用いてビームフォーミング法によって分離して抽出し、前記抽出した音響信号を加算することにより前記所望の領域の音響信号を分離する音源分離部と、を備える。 (1) In order to achieve the above object, a sound source separation device according to an aspect of the present invention has N (N is an integer equal to or greater than 2) microphones arranged at a first interval for picking up acoustic signals. A microphone array and a desired area are subdivided at a second interval, and for each of the subdivided areas, acoustic signals picked up by the microphone array are transmitted to sub-beams corresponding to the azimuth angles θ of the subdivided areas. and a sound source separation unit that separates and extracts by a beamforming method using , and separates the acoustic signal of the desired region by adding the extracted acoustic signals.

（２）また、本発明の一態様に係る音源分離装置において、前記サブビームのパターンは、次式で表され、

上式において、Ｄ（θ）は音源の開始方位角θ_ｂ以上源の終了方位角θ_ｅ以下の場合に１であり他の方位角で０である所望の領域の音響信号を分離するための理想的なフィルタであり、Ｐ_θｉはθ_ｉ方向にビームの指向性を向けた場合の点音源を対象としたビームフォーマの指向特性であり、θ_ｉは各ビームフォーマの指向性の方向であり、Ｈはエルミート共役を意味し、ａ（θ）は音源か前記マイクロホンへの伝達関数であり、Ｗ_θｉはビームフォーマＰ_θｉの係数ベクトルであり、この場合の面音源のビームフォーマの係数Ｗは次式であり、

Ｄ（θ）を次式としたとき、

θ_ｉ＝θ_ｂ＋（ｉ－１）θ_ｓｃａｎと定義でき、ｂはθ_ｉ＝θ_ｂとなるｉの値であり、ｅはθ_ｉ＝θ_ｅとなるｉの値を示す、ようにしてもよい。 (2) Further, in the sound source separation device according to one aspect of the present invention, the pattern of the sub-beams is represented by the following equation,

In the above equation, D(θ) is 1 when the starting azimuth angle _θb of the sound source is equal to _or less than the ending azimuth angle θe of the source, and is 0 at other azimuth angles. This is an ideal filter, P _θi is the directional characteristic of a beamformer targeting a point sound source when the directivity of the beam is directed in the _θi direction, and _θi is the direction of the directivity of each beamformer. , H means the Hermitian conjugate, a(θ) is the transfer function to the sound source or said microphone, W _θi is the coefficient vector of the beamformer P _θi , where the coefficients W of the beamformer for the surface sound source are and

When D(θ) is given by the following formula,

It can be defined as θ _i =θ _b +(i−1) θ _scan , where b is the value of i that satisfies θ _i =θ _b , and e is the value of i that satisfies θ _i =θ _e . good too.

（３）また、本発明の一態様に係る音源分離装置において、コスト関数Ｊと、前記マイクロホンの数と、前記第２間隔と、の関係を算出する評価部と、前記評価部が算出した前記コスト関数と、前記マイクロホンの数と、前記第２間隔と、の関係において、コスト関数が最小になる前記マイクロホンの数と、前記第２間隔を選択する選択部と、をさらに備えるようにしてもよい。 (3) Further, in the sound source separation device according to an aspect of the present invention, an evaluation unit that calculates a relationship among a cost function J, the number of microphones, and the second interval; A selecting unit that selects the number of microphones that minimizes the cost function and the second spacing in the relationship between the cost function, the number of microphones, and the second spacing. good.

（４）また、本発明の一態様に係る音源分離装置において、前記評価部は、ビームパターンと理想的なパターンの違いを次式の対数平均二乗誤差ＭＳＥを用いて算出し、

次式を用いて前記コスト関数Ｊを算出し、算出した前記コスト関数Ｊと、前記マイクロホンの数Ｎと、スキャン角度θ_ｓｃａｎに基づいて、前記所望の領域の音響信号を分離するための最適な前記マイクロホンの数Ｎと、スキャン角度θ_ｓｃａｎを求め、

上式において、αは所定値であり、λ_１、λ_２それぞれは調整パラメーターである、ようにしてもよい。 (4) In the sound source separation device according to an aspect of the present invention, the evaluation unit calculates the difference between the beam pattern and the ideal pattern using the logarithmic mean square error MSE of the following formula,

Calculate the cost function J using the following equation, and based on the calculated cost function J, the number N of the microphones, and the scan angle θ _scan , the optimum for separating the acoustic signal of the desired region Obtaining the number N of the microphones and the scan angle θ _scan ,

In the above formula, α may be a predetermined value, and λ ₁ and λ ₂ may be adjustment parameters.

（５）また、本発明の一態様に係る音源分離装置において、前記評価部は、前記コスト関数Ｊと、前記マイクロホンの数Ｎと、スキャン角度θｓｃａｎを三次元グラフに表し、前記コスト関数Ｊが最小となる前記マイクロホンの数Ｎと、スキャン角度θｓｃａｎを選択することで、最適な前記マイクロホンの数Ｎと、スキャン角度θｓｃａｎを求め、前記音源分離部は、前記評価部によって選択された最適な前記マイクロホンの数Ｎと、スキャン角度θｓｃａｎに更新する、ようにしてもよい。 (5) Further, in the sound source separation device according to the aspect of the present invention, the evaluation unit expresses the cost function J, the number N of microphones, and the scan angle θscan in a three-dimensional graph, and the cost function J is By selecting the minimum number N of microphones and the scan angle θscan, the optimum number N of microphones and the optimum scan angle θscan are obtained. It may be updated to the number N of microphones and the scan angle θscan.

（６）上記目的を達成するため、本発明の一態様に係る音源分離方法は、第１間隔で配置されるＮ（Ｎは２以上の整数）個のマイクロホンを有するマイクロホンアレイが、音響信号を収音し、音源分離部が、所望の領域を第２間隔で細分化し、前記音源分離部が、前記細分化された領域それぞれに対して、音響信号を、前記細分化した領域の方位角θに対応するサブビームを用いてビームフォーミング法によって分離して抽出し、前記音源分離部が、前記抽出された音響信号を加算することにより前記所望の領域の音響信号を分離する。 (6) In order to achieve the above object, a sound source separation method according to an aspect of the present invention provides a microphone array having N (N is an integer equal to or greater than 2) microphones arranged at a first interval to generate an acoustic signal. Sound is picked up, a sound source separation unit subdivides a desired region at a second interval, and the sound source separation unit converts an acoustic signal to each of the subdivided regions into an azimuth angle θ of the subdivided region. are separated and extracted by a beamforming method using sub-beams corresponding to , and the sound source separation unit separates the acoustic signal of the desired region by adding the extracted acoustic signals.

（７）上記目的を達成するため、本発明の一態様に係るプログラムは、コンピュータに、第１間隔で配置されるＮ（Ｎは２以上の整数）個のマイクロホンを有するマイクロホンアレイに音響信号を収音させ、所望の領域を第２間隔で細分化させ、前記細分化された領域それぞれに対して、前記細分化された領域の方位角θに対応するサブビームを用いてビームフォーミング法によって前記音響信号を分離して抽出させ、前記抽出された音響信号を加算することにより前記所望の領域の音響信号を分離させる。 (7) To achieve the above object, a program according to an aspect of the present invention causes a computer to transmit an acoustic signal to a microphone array having N (N is an integer equal to or greater than 2) microphones arranged at a first interval. picking up sound, subdividing a desired region at a second interval, and for each of the subdivided regions, using sub-beams corresponding to the azimuth angle θ of the subdivided region to perform the acoustic beam forming method By separating and extracting the signals and adding the extracted acoustic signals, the acoustic signal of the desired region is separated.

上述した（１）～（７）によれば、抑圧したい領域にある音源を適切に抑圧することができ、面音源を分離することができる。
上述した（２）によれば、細分化した領域毎にサブビームフォーマを用いて、ビームフォーミング法によって細分化した領域の音響信号を収音された音響信号から抽出でき細分化された領域毎に抽出した音響信号を加算することで、所望の面音源を分離できる。
上述した（３）～（５）によれば、コスト関数Jを用いることで、面音源を抽出するための最適なマイクロホンの数、所望の領域を分割する間隔を選択することができる。 According to (1) to (7) described above, it is possible to appropriately suppress a sound source in a region to be suppressed, and to separate a plane sound source.
According to (2) above, the sub-beamformer is used for each subdivided region, and the acoustic signal of the subdivided region by the beamforming method can be extracted from the collected acoustic signal. A desired surface sound source can be separated by adding the obtained acoustic signals.
According to (3) to (5) described above, using the cost function J makes it possible to select the optimal number of microphones for extracting a surface sound source and the interval for dividing a desired region.

実施形態に係る音源分離システムの構成例を示すブロック図である。1 is a block diagram showing a configuration example of a sound source separation system according to an embodiment; FIG. 実施形態に係るマイクロホンの配置例を示す図である。FIG. 4 is a diagram showing an example of arrangement of microphones according to the embodiment; 実施形態で用いるビームフォーマを説明するための図である。FIG. 3 is a diagram for explaining a beamformer used in an embodiment; FIG. 評価に用いたマイクロホンアレイのセッティングと分離目標の面音源の関係を示す図である。FIG. 10 is a diagram showing the relationship between the setting of the microphone array used for evaluation and the surface sound source of the separation target; 分離目標の面音源を抽出し、周囲のノイズを抑圧するための設定例である。This is a setting example for extracting a plane sound source of a separation target and suppressing surrounding noise. 適応ＭＶＤＲのビームフォーマ例を示す図である。FIG. 10 is a diagram illustrating an example beamformer for adaptive MVDR; 図６の適応ＭＶＤＲのビームフォーマを用いた場合のＭＳＥ曲面である。FIG. 7 is an MSE surface when the adaptive MVDR beamformer of FIG. 6 is used. ＤＳのビームフォーマを示す図である。Fig. 2 shows a DS beamformer; 図８のＤＳのビームフォーマを用いた場合のＭＳＥ曲面である。9 is an MSE curved surface when the DS beamformer of FIG. 8 is used. 分離目標の面音源の設定例を示す図である。FIG. 5 is a diagram showing an example of setting a plane sound source as a separation target; サブビームフォーマがＤＳスキャンアンドサムビームフォーマのＭＳＥ曲面を示す図である。FIG. 10 is a diagram showing the MSE surface of a DS scan-and-sum beamformer with a sub-beamformer; サブビームフォーマがＤＳスキャンアンドサムビームフォーマのコスト曲面を示す図である。Fig. 3 shows the cost surface of the DS scan-and-sum beamformer with the sub-beamformer; ウイナービームフォーミングとスキャンアンドサムビームフォーマそれぞれのパターン例を示す図である。FIG. 10 is a diagram showing pattern examples of winner beamforming and scan-and-sum beamformer; サブビームフォーマがＭＶＤＲスキャンアンドサムビームフォーマのＭＳＥ曲面を示す図である。FIG. 10 is a diagram showing the MSE surface of the MVDR scan-and-sum beamformer with sub-beamformer; サブビームフォーマがＭＶＤＲスキャンアンドサムビームフォーマのコスト曲面を示す図である。Fig. 3 shows the cost surface for the sub-beamformer MVDR scan-and-sum beamformer; 推奨設定での合成したＭＶＤＲスキャンアンドサムビームフォーマのパターンを示す図である。FIG. 11 shows the pattern of the synthesized MVDR scan-and-sum beamformer with recommended settings; 低密度スキャンを選択した場合のビームパターンの例を示す図である。FIG. 10 is a diagram showing an example of a beam pattern when low-density scanning is selected; ３つの異なるバッファサイズの評価結果を示す図である。FIG. 10 is a diagram showing evaluation results for three different buffer sizes; 実施形態に係る分離された面音源の例を示す図である。It is a figure which shows the example of the separated surface sound source which concerns on embodiment.

以下、本発明の実施の形態について図面を参照しながら説明する。なお、以下の説明に用いる図面では、各部材を認識可能な大きさとするため、各部材の縮尺を適宜変更している。また、本実施形態では、特定領域の音響信号を面音源という。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, in the drawings used for the following description, the scale of each member is appropriately changed so that each member has a recognizable size. Also, in the present embodiment, the acoustic signal in the specific area is called a surface sound source.

図１は、本実施形態に係る音源分離システム１の構成例を示すブロック図である。音源分離システム１は、収音部２、および音源分離装置３を備える。
収音部２は、Ｎ（Ｎは２以上の整数）のマイクロホン２１－１、…、マイクロホン２１－Ｎを備える。なお、以下の説明において、マイクロホン２１－１、…、マイクロホン２１－Ｎのうち１つを特定しない場合は、マイクロホン２１という。
音源分離装置３は、取得部３１、伝達関数記憶部３２、ビームパターン記憶部３３、音源分離部３４、および出力部３５を備える。
音源分離部３４は、分離部３４１、評価部３４２、および選択部３４３（評価部）を備える。 FIG. 1 is a block diagram showing a configuration example of a sound source separation system 1 according to this embodiment. A sound source separation system 1 includes a sound pickup unit 2 and a sound source separation device 3 .
The sound pickup unit 2 includes N (N is an integer equal to or greater than 2) microphones 21-1, . . . , and microphones 21-N. . . , and 21-N are referred to as microphones 21 unless otherwise specified.
The sound source separation device 3 includes an acquisition unit 31 , a transfer function storage unit 32 , a beam pattern storage unit 33 , a sound source separation unit 34 and an output unit 35 .
The sound source separation unit 34 includes a separation unit 341, an evaluation unit 342, and a selection unit 343 (evaluation unit).

収音部２は、ｎ個のマイクロホン２１を備えるマイクロホンアレイである。収音部２は、音源が発した音響信号を収音し、収音したｎチャネルの音響信号を取得部３１に出力する。なお、マイクロホン２１の配置については、後述する。 The sound pickup unit 2 is a microphone array including n microphones 21 . The sound pickup unit 2 picks up an acoustic signal emitted by a sound source, and outputs the picked-up n-channel acoustic signal to the acquisition unit 31 . The arrangement of the microphones 21 will be described later.

取得部３１は、収音部２が出力するアナログのｎチャネルの音響信号を取得し、取得したアナログの音響信号をデジタルの音響信号に変換する。なお、収音部２のｎ個のマイクロホン２１それぞれが出力する音響信号は、同じサンプリング周波数の信号を用いてサンプリングが行われる。取得部３１は、デジタルに変換した音響信号を音源分離部３４に出力する。 The acquisition unit 31 acquires the analog n-channel acoustic signal output from the sound pickup unit 2 and converts the acquired analog acoustic signal into a digital acoustic signal. The acoustic signals output from the n microphones 21 of the sound pickup unit 2 are sampled using signals having the same sampling frequency. The acquisition unit 31 outputs the digitally converted acoustic signal to the sound source separation unit 34 .

伝達関数記憶部３２は、到来方向を引数とする関数として表現してモデル化された伝達関数を、収音部２が備えるマイクロホン２１毎に格納する。 The transfer function storage unit 32 stores, for each microphone 21 included in the sound pickup unit 2, a transfer function modeled as a function having the direction of arrival as an argument.

ビームパターン記憶部３３は、サブビームパターンを記憶するようにしてもよい。なお、サブビームパターンについては後述する。 The beam pattern storage unit 33 may store sub-beam patterns. The sub-beam pattern will be described later.

音源分離部３４は、所望の領域の音響信号を分離し、分離した所望の領域の音響信号を出力部３５に出力する。音源分離部３４は、分離に用いたビームパターンを評価する。音源分離部３４は、評価した結果に基づいて、マイクロホン２１の数、所望の領域を分割する間隔を選択する。なお、所望の領域は、分離したい面音源が存在する領域を含む領域である。 The sound source separation unit 34 separates the acoustic signal of the desired region and outputs the separated acoustic signal of the desired region to the output unit 35 . The sound source separation unit 34 evaluates the beam pattern used for separation. The sound source separation unit 34 selects the number of microphones 21 and the intervals for dividing the desired region based on the evaluation result. Note that the desired region is a region including a region in which a surface sound source to be separated exists.

分離部３４１は、所望の領域を等間隔で細分化する。音源分離部３４は、細分化した領域毎にサブビームフォーマを用いて、ビームフォーミング法によって細分化した領域の音響信号を収音された音響信号から抽出する。音源分離部３４は、細分化された領域毎に抽出した音響信号を加算することで、所望の面音源を分離する。なお、分離部３４１は、マイクロホン２１の数、所望の領域を分割する間隔を自部が記憶する初期値に設定する。分離部３４１は、選択部３４３が出力する選択結果に基づいて、マイクロホン２１の数、所望の領域を分割する間隔を更新するようにしてもよい。 The separation unit 341 subdivides the desired area at equal intervals. The sound source separation unit 34 uses a sub-beamformer for each subdivided region to extract the acoustic signal of the subdivided region by the beamforming method from the collected acoustic signal. The sound source separation unit 34 separates a desired surface sound source by adding acoustic signals extracted for each subdivided area. Note that the separation unit 341 sets the number of microphones 21 and the interval for dividing the desired region to initial values stored therein. The separation unit 341 may update the number of microphones 21 and the interval for dividing the desired region based on the selection result output by the selection unit 343 .

評価部３４２は、選択したビームパターンの品質を、パフォーマンスとコストのバランスをとるコスト関数Ｊを備えたＭＳＥ（対数平均二乗誤差（ＭｅａｎＳｑｕａｒｅＥｒｒｏｒ））を用いて評価する。評価部３４２は、評価した評価結果を選択部３４３に出力する。なお、コスト関数Ｊ、ＭＳＥ、評価方法については後述する。 The evaluator 342 evaluates the quality of the selected beam pattern using MSE (Mean Square Error) with a cost function J that balances performance and cost. The evaluation unit 342 outputs the evaluated evaluation result to the selection unit 343 . Note that the cost function J, MSE, and evaluation method will be described later.

選択部３４３は、評価部３４２が評価した評価結果に基づいて、マイクロホン２１の数、所望の領域を分割する間隔を選択する。選択部３４３は、後述するようにコスト関数Ｊ、マイクロホン２１の数、所望の領域を分割する間隔を三次元グラフで表し、このグラフにおいて最小値を検出することで、マイクロホン２１の数、所望の領域を分割する間隔を選択する。選択部３４３は、選択した建託結果を分離部３４１に出力する。 The selection unit 343 selects the number of microphones 21 and the interval for dividing the desired region based on the evaluation result evaluated by the evaluation unit 342 . As will be described later, the selection unit 343 represents the cost function J, the number of microphones 21, and the intervals for dividing the desired area in a three-dimensional graph, and detects the minimum value in this graph to determine the number of microphones 21, the desired Select the interval to divide the regions. The selection unit 343 outputs the selected entrustment result to the separation unit 341 .

出力部３５は、例えばスピーカーである。出力部３５は、音源分離部３４が分離した所望の領域の音響信号を出力する。 The output unit 35 is, for example, a speaker. The output unit 35 outputs the acoustic signal of the desired region separated by the sound source separation unit 34 .

＜マイクロホンの配置例＞
次に、マイクロホン２１の配置例を説明する。
図２は、本実施形態に係るマイクロホン２１の配置例を示す図である。Ｎ個のマイクロホン２１は、間隔（第１間隔）がｄ毎に等間隔で、２１－１，２－２，…，２１－（Ｎ－１）、２１－（Ｎ－１）の順に、例えば直線ｇ１１上に配置されている。一端のマイクロホン２１－１を参照マイクロホンとする。なお、マイクロホン２１の個数Ｎ、間隔ｄは、例えば後述するコスト等に応じて選択するようにしてもよい。 <Example of placement of microphones>
Next, an arrangement example of the microphone 21 will be described.
FIG. 2 is a diagram showing an arrangement example of the microphones 21 according to this embodiment. The N microphones 21 are arranged at equal intervals (first intervals) every d, in the order of 21-1, 2-2, . It is arranged on the straight line g11. The microphone 21-1 at one end is used as a reference microphone. Note that the number N of the microphones 21 and the spacing d may be selected according to, for example, the cost to be described later.

Ｎ個のマイクロホン２１は、全て無指向性のマイクロホンである。音源とマイクロホン２１との距離は、音源からの信号が平面波としてマイクロホン２１に入力される距離が望ましい。
符号ｇ１２は、マイクロホン２１に入力される平面波を表している。方位角θは、マイクロホン２１が配置される直線ｇ１１と、入力される平面波ｇ１２とのなす角であり、反時計回りに増加するとする。音波の音速をｃ（３４０．２９ｍ／ｓ）で表す。 All of the N microphones 21 are omnidirectional microphones. The distance between the sound source and the microphone 21 is desirably the distance at which the signal from the sound source is input to the microphone 21 as a plane wave.
A symbol g12 represents a plane wave input to the microphone 21 . The azimuth angle θ is the angle between the straight line g11 on which the microphone 21 is arranged and the input plane wave g12, and is assumed to increase counterclockwise. The sound speed of sound waves is represented by c (340.29 m/s).

音源からの信号は、次式（１）のように表されるａ（θ，ｆ）∈Ｃ^Ｎ×１（Ｃは複素数全体の集合）で示される位相シフトベクトル（ｐｈａｓｅ－ｓｈｉｆｔｖｅｃｔｏｒ）で表される伝達関数を介してマイクロホンアレイである収音部２に伝播される。この伝達関数は、伝達関数記憶部３２が記憶する。 A signal from a sound source is represented by a phase-shift vector represented by a(θ, f)εC ^N×1 (where C is a set of all complex numbers) represented by the following equation (1): is propagated to the sound pickup unit 2, which is a microphone array, through the transfer function. This transfer function is stored in the transfer function storage unit 32 .

式（１）において、Ｔは転置行列を表す。τは、マイクロホンアレイ内の隣接する２つのマイクロホン２１間の遅延時間であり、τ＝（ｄ・ｃｏｓθ）／ｃである。また、ｆは、入力信号の異なる周波数成分であり、ｆ＝ω／２πである。以下の説明では、説明を簡単にするため、ｆを省略する。 In Equation (1), T represents a transposed matrix. τ is the delay time between two adjacent microphones 21 in the microphone array, and τ=(d·cos θ)/c. Also, f is the different frequency components of the input signal, where f=ω/2π. In the following description, f is omitted for simplification of description.

ｉ番目の受信される点音源（ｐｏｉｎｔｓｏｕｒｃｅ）は、Ｚ_ｉ＝ａ_ｉＳ_ｉである。ここで、ａ_ｉは、ｉ番目の点音源の到来方向（ＤＯＡ）に焦点を当てた伝達関数である。マイクの熱雑音のような空間的白色雑音は、Ｖ∈Ｃ^Ｎ×１として表される。
また、混合信号Ｚ∈Ｃ^Ｎ×１は、次式（２）のように表すことができる。 The i-th received point source is Z _i =a _i S _i . where a _i is the transfer function focused on the direction of arrival (DOA) of the i-th point source. Spatial white noise, such as microphone thermal noise, is represented as VεC ^N×1 .
Also, the mixed signal ZεC ^N×1 can be expressed by the following equation (2).

式（２）において、Ｓ＝［Ｓ_１，Ｓ_２，…，Ｓ_Ｎｓｉｇ］^Ｔであり、Ｎ_ｓｉｇは信号の数である。Ａは、個々の点音源を混合するためのマルチチャネルの音声信号であり、Ａ＝［ａ_１，ａ_２，…，ａ_Ｎｓｉｇ］は混合行列である。 In equation (2), S=[S ₁ , S ₂ , . . . , S _Nsig ] ^T , where N _sig is the number of signals. A is a multi-channel audio signal for mixing individual point sources, and A=[a ₁ , a ₂ , . . . , a _Nsig ] is a mixing matrix.

＜点音源のビームフォーマの基本式＞
次に、点音源のビームフォーマの基本式について説明する。
ビームフォーマは、係数ベクトルＷ∈Ｃ^{Ｎｃｈ×１}で表される。ビームフォーマの方向周波数応答は、次式（３）のように定義される「パターン」である。 <Basic formula of point source beamformer>
Next, the basic equations of the point sound source beamformer will be described.
The beamformer is represented by a coefficient vector WεC ^Nch×1 . The directional frequency response of the beamformer is a "pattern" defined as in Equation (3) below.

式（３）において、Ｈはエルミート共役を意味し、θ_ＤＯＡは目標音源到来方向（ｆｏｃｕｓＤＯＡ）を表す。また、下付表示のＤＯＡは、フォーカス到来方向の設計パラメーターを表す。ａ（θ）は、式（１）の伝達関数においてｆを省略したものである。変数は信号の入力方向を表す。異なる方向からの信号入力の選択は、ビームフォーマにおける指向性である。 In equation (3), H means Hermitian conjugate, and θ _DOA represents the direction of arrival of the target sound source (focus DOA). Further, the subscripted DOA represents a design parameter of the direction of arrival of the focus. a(θ) is obtained by omitting f in the transfer function of equation (1). The variable represents the input direction of the signal. The selection of signal inputs from different directions is directivity in the beamformer.

＜スキャンアンドサム（ＳＣＡＮ－ＡＮＤ－ＳＵＭ）ビームフォーマ＞
次に、本実施形態で用いるスキャンアンドサムビームフォーマについて説明する。
本実施形態では、１つのビームフォーマで面ビームパターンを設計する代わりに、点音源のサブビームフォーマをフォーカス到来方向に変更し、目的の面音源が存在する領域を適切なスキャン角度でスキャンする。本実施形態では、このサブビームフォーマを、面ビームフォーマに統合する。本実施形態では、この手法をスキャンアンドサムビームフォーマ法（またはスキャンアンドサム法）と呼ぶ。 <Scan-and-sum (SCAN-AND-SUM) beamformer>
Next, the scan-and-sum beamformer used in this embodiment will be described.
In this embodiment, instead of designing a planar beam pattern with a single beamformer, the sub-beamformer of the point sound source is changed in the focus arrival direction, and the area where the target planar sound source exists is scanned at an appropriate scan angle. In this embodiment, this sub-beamformer is integrated into a plane beamformer. In this embodiment, this method is called a scan-and-sum beamformer method (or scan-and-sum method).

図３は、本実施形態で用いるビームフォーマを説明するための図である。図３において、横軸は到来方向θ［度］であり、縦軸は利得（２０ｌｏｇ｜Ｐ_θＤＯＡ｜）［ｄＢ］である。符号ｇ２１のパターンは、比較例のＭＶＤＲ（ＭｉｎｉｍｕｍＶａｒｉａｎｃｅＤｉｓｔｏｒｔｉｏｎｌｅｓｓＲｅｓｐｏｎｓｅ；最小分散無歪応答法）スキャンのビームパターンである。符号ｇ２２のパターンは、本実施形態で用いるスキャンアンドサム法（ｓｃａｎ－ａｎｄ－ｓｕｍｍｅｔｈｏｄ）のビームパターンである。符号ｇ２３のパターンは、面音源に対する理想的なビームパターンである。なお、ＭＶＤＲの場合は、目的音源を歪ませない線形拘束条件の下で、出力パワーを最小化するような分離行列を求めることで音源分離を行う。また、図３に示したスキャンアンドサム法のビームパターンは一例であり、これに限らない。
なお、スキャンアンドサム法は、面音源が多数の点音源の組み合わせに分解できるというモデルに基づいている。 FIG. 3 is a diagram for explaining the beamformer used in this embodiment. In FIG. 3, the horizontal axis is the direction of arrival θ [degrees], and the vertical axis is the gain (20log|P _θDOA |) [dB]. A pattern g21 is a beam pattern of MVDR (Minimum Variance Distortionless Response) scan of a comparative example. A pattern g22 is a beam pattern of the scan-and-sum method used in this embodiment. The pattern g23 is an ideal beam pattern for a plane sound source. In the case of MVDR, sound source separation is performed by obtaining a separation matrix that minimizes the output power under linear constraint conditions that do not distort the target sound source. Moreover, the beam pattern of the scan-and-sum method shown in FIG. 3 is an example, and is not limited to this.
Note that the scan-and-sum method is based on a model that a surface sound source can be decomposed into a combination of many point sound sources.

パラメーター設定の例として、すべてのパターンと解析の分析は、例えばｄ＝２ｃｍ、Ｎ＝２０、ｆ＝２ｋＨｚで行った。このため、マイクロホンアレイの長さは４０ｃｍである。 As an example of parameter settings, all pattern and analysis analyzes were performed with, for example, d=2 cm, N=20, f=2 kHz. Therefore, the length of the microphone array is 40 cm.

＜スキャンアンドサムビームフォーマの式＞
次に、スキャンアンドサムビームフォーマで用いる式について説明する。
理想的なパターンＤ（θ）は、次式（４）のような方位角次元である。 <Formula of scan-and-sum beamformer>
Next, the equations used in the scan-and-sum beamformer are described.
The ideal pattern D(θ) is in azimuth dimension as in Equation (4) below.

式（４）において、θ_ｂは面音源の開始方位角であり、θ_ｅは面音源の終了方位角である。一連のＤＯＡｓ（θ_ｉ）に焦点を合わせた既知の点音源のビームフォーマのパターンは、Ｐ_θｉとして示される。ここで、θ_ｉは、各ビームフォーマの指向性の方向であり、θ_ｉ＝θ_ｂ＋（ｉ－１）θ_ｓｃａｎである。θ_ｓｃａｎは、スキャン角度（第２間隔）である。また、ｂはθ_ｉ＝θ_ｂとなるｉの値であり、ｅはθ_ｉ＝θ_ｅとなるｉの値を示す。
本実施形態で用いるスキャンアンドサムビームフォーマの式は、次式（５）のように表される。 In equation (4), θ _b is the starting azimuth angle of the surface sound source, and θ _e is the ending azimuth angle of the surface sound source. The beamformer pattern of a known point source focused on a set of DOAs (θ _i ) is denoted as P _θi . where θ _i is the directivity direction of each beamformer and θ _i =θ _b +(i−1) θ _scan . θ _scan is the scan angle (second interval). Also, b is the value of i that satisfies θ _i =θ _b , and e indicates the value of i that satisfies θ _i =θ _e .
The formula of the scan-and-sum beamformer used in this embodiment is represented by the following formula (5).

式（５）において、Ｐ_θｉは、θ_ｉにおけるビームフォーマの指向特性である。Ｐ_θは、θ方向にビームの指向性を向けた場合の点音源を対象としたビームフォーマの指向特性である。なお、実施形態では、最大応答を０ｄＢに正規化した。
なお、式（５）から，Ｐ（θ）はＰ_θｉ（θ）をｅ－ｂ＋１個加算した関数であることから、θ_ｓｃａｎの関数であり，これを明示的に示すために、Ｐ_{θｓｃａｎ}（θ）と表すこととすれば、スキャンアンドサムビームフォーマは、次式（６）のように表すこともできる。 In equation (5), P _θi is the directional characteristic of the beamformer at θ _i . P _θ is the directional characteristic of a beamformer intended for a point sound source when the directivity of the beam is directed in the θ direction. Note that the maximum response was normalized to 0 dB in the embodiment.
From equation (5), since P(θ) is a function obtained by adding eb+1 P _θi (θ), it is a function of _θ _scan . θ), the scan-and-sum beamformer can also be expressed by the following equation (6).

式（６）において、Ｐ_θｉ（θ）（ｉ＝１，２，…）は、方位角θ_ｉ方向に指向性があるサブビームフォーマのパターンである。式（５）または式（６）のように、スキャンアンドサムビームフォーマは、複数のサブビームのパターンを合成することで、理想的なパターンに近いビームパターンを実現する。なお、サブビームパターンとは、図３の符号ｇ２１１の鎖線のように、所定の方位角毎のパターンである。また、Ｎ_Ｓは、ｅ－ｂ＋１である。
また、式（３）と式（５）より、スキャンアンドサムビームフォーマの係数ベクトルＷは、次式（７）のように表される。 In Equation (6), P _θi (θ) (i=1, ₂ , . As shown in Equation (5) or Equation (6), the scan-and-sum beamformer achieves a beam pattern that is close to the ideal pattern by synthesizing patterns of multiple sub-beams. Note that the sub-beam pattern is a pattern for each predetermined azimuth angle, as indicated by the dashed line g211 in FIG. Also, _NS is eb+1.
Further, from the equations (3) and (5), the coefficient vector W of the scan-and-sum beamformer is represented by the following equation (7).

式（６）は加算する際の重みＢ＝［ｂ_１，ｂ_２，…，ｂ_ｉ，…，ｂ_ＮＳ］を考慮して、次式（８）としてもよい。 Equation (6) may be replaced by the following equation (8) in consideration of the weight B=[b ₁ , b ₂ , _. . . , _bi , .

この場合、Ｐ（θ）とＤ（θ）の平均二乗誤差（ＭＳＥ）を最小化する問題を解くことによって，次式（９）として一意に求めることができる。 In this case, by solving the problem of minimizing the mean squared error (MSE) of P(θ) and D(θ), it can be uniquely determined as the following equation (9).

ただし、Ｑ（θ）＝［ｐ_φ１（θ），ｐ_φ２（θ），…，ｐ_φｉ（θ），…，ｐ_φＮＳ（θ）］とする。 However, Q(θ)= _[ p _φ1 (θ), p _φ2 (θ), . . . , p _φi (θ), .

＜エラー分析の基準＞
次に、エラー分析の基準について説明する。
スキャンアンドサムのパターンが、目的のパターンを正確に近似しているかを確認するため、スキャンアンドサムビームパターンと理想的なパターンの違いを評価する。式（６）より、Ｐ_θｉ（θ）はマイクロホン数Ｎの関数であり、かつ、このφ_ｉの間隔はθ_ｉの間隔と同じθ_ｓｃａｎであることから、Ｐ（θ）はＮおよびθ_ｓｃａｎの関数となる。これを明示的に示すため、Ｐ（θ）をＰ_{Ｎ，θｓｃａｎ}（θ）と再定義する。このとき，評価のための基準のＭＳＥ（ＭｅａｎＳｑｕａｒｅＥｒｒｏｒ；平均二乗誤差）は、次式（１０）のように定式化できる。 <Error analysis criteria>
Next, the criteria for error analysis will be described.
Evaluate the difference between the scan-and-sum beam pattern and the ideal pattern to see if the scan-and-sum pattern accurately approximates the desired pattern. From equation (6), P _θi (θ) is a function of the _number of microphones N, and the interval of φ _i is θ _scan , which is the same as the interval of θ _i . is a function of To show this explicitly, we redefine P([theta]) as PN _,[theta]scan ([theta]). At this time, MSE (Mean Square Error), which is a reference for evaluation, can be formulated as in the following equation (10).

ＭＳＥは、マイクロホンの数Ｎとスキャン角度θ_ｓｃａｎの２つの変数を用いて求められる。 The MSE is determined using two variables, the number of microphones N and the scan angle θ _scan .

図４は、評価に用いたマイクロホンアレイのセッティングと分離目標の面音源の関係を示す図である。方位が約７５～１０５度は分離目標の面音源（ｇ３１）であり、方位が約４５～７５度は第１の干渉波（ｇ３２）であり、方位が約１０５～１３５度は第２の干渉波（ｇ３３）である。干渉波は、例えば抽出したい面音源の周囲から発せられるノイズである。評価の条件は、マイクロホン２１の間隔が２ｃｍ、ｆ＝２ｋＨｚ、マイクロホン数を２０、スキャン角度θ_ｓｃａｎを０．９２度で行った。また、評価に用いた音源は、マイクロホンから遠方にあるファーフィールドであり、面音源である。 FIG. 4 is a diagram showing the relationship between the setting of the microphone array used for evaluation and the surface sound source of the separation target. The azimuth of about 75 to 105 degrees is the separation target surface sound source (g31), the azimuth of about 45 to 75 degrees is the first interference wave (g32), and the azimuth of about 105 to 135 degrees is the second interference. Wave (g33). An interference wave is, for example, noise emitted from around a planar sound source to be extracted. Evaluation conditions were that the distance between the microphones 21 was 2 cm, f=2 kHz, the number of microphones was 20, and the scan angle θ _scan was 0.92 degrees. The sound source used for the evaluation is a far-field, plane sound source far from the microphone.

また、ＦをＦＦＴ（高速フーリエ変換）における周波数のスライスセットとし、｜Ｆ｜をＦの集合の濃度（ｃａｒｄｉｎａｌｉｔｙ）であるとする。音源が存在する領域の誤差を重視し、式（１０）を次式（１１）のように表すことにする。 Also, let F be the set of slices of frequency in the FFT (Fast Fourier Transform) and |F| Emphasizing the error in the region where the sound source exists, the expression (10) is expressed as the following expression (11).

式（９）において、図５のようにΘ_ｉｔｆは図４の第１、および第２の干渉波の存在範囲の角度であり、Θ_ｔａｒは図４の目的音源の存在範囲の角度である。図５は、分離目標の面音源を抽出し、周囲のノイズを抑圧するための設定例である。また、図６は、適応ＭＶＤＲのビームフォーマ例を示す図である。図７は、図６の適応ＭＶＤＲのビームフォーマを用いた場合のＭＳＥ曲面である。図５と図６において、横軸は角度（度）、縦軸は利得（ｄＢ）である。図７において、紙面の横方向の軸はマイクロホンの数Ｎ（個）であり、奥行き方向の軸はスキャン角度θ_ｓｃａｎ（度）であり、縦方向の軸は対数ＭＳＥである。 In equation (9), Θ _itf is the angle of the existence range of the first and second interference waves in FIG. 4 as shown in FIG. 5, and Θ _tar is the angle of the existence range of the target sound source in FIG. FIG. 5 is a setting example for extracting a surface sound source to be separated and suppressing surrounding noise. FIG. 6 is a diagram showing an example of a beamformer for adaptive MVDR. FIG. 7 is an MSE surface when the adaptive MVDR beamformer of FIG. 6 is used. 5 and 6, the horizontal axis is the angle (degrees) and the vertical axis is the gain (dB). In FIG. 7, the horizontal axis of the paper is the number of microphones N (pieces), the depth axis is the scan angle θ _scan (degrees), and the vertical axis is the logarithmic MSE.

図６のように、適応ＭＶＤＲのビームフォーマのパターンは、分離目標の面音源と干渉波との利得差は約１０ｄＢである。そして、図７のように、ＭＳＥは、マイクロホン数Ｎを１００、スキャン角度を０度に近づけても－２０程度であった。また、マイクロホン数Ｎを１００にした場合であっても、スキャン角度が広がると、ＭＳＥは－５程度であった。 As shown in FIG. 6, the adaptive MVDR beamformer pattern has a gain difference of about 10 dB between the separation target surface sound source and the interference wave. As shown in FIG. 7, the MSE was about -20 even when the number of microphones N was 100 and the scan angle was close to 0 degrees. Also, even when the number of microphones N was 100, the MSE was about -5 when the scan angle was widened.

図８は、ＤＳ（ＤｅｌａｙａｎｄＳｕｍ）のビームフォーマを示す図である。図９は、図８のＤＳのビームフォーマを用いた場合のＭＳＥ曲面である。図８において、横軸は角度（度）、縦軸は利得（ｄＢ）である。図９において、紙面の横方向の軸はマイクロホンの数Ｎ（個）であり、奥行き方向の軸はスキャン角度θ_ｓｃａｎ（度）であり、縦方向の軸は対数ＭＳＥである。 FIG. 8 is a diagram showing a DS (Delay and Sum) beamformer. FIG. 9 is an MSE curved surface when the DS beamformer of FIG. 8 is used. In FIG. 8, the horizontal axis is the angle (degrees) and the vertical axis is the gain (dB). In FIG. 9, the horizontal axis of the paper is the number of microphones N (pieces), the depth axis is the scan angle θ _scan (degrees), and the vertical axis is the logarithmic MSE.

図８のように、ＤＳビームフォーマのパターンは、分離目標の面音源と干渉波との利得差は、３０度離れた角度であっても６ｄＢ程度である。そして、図９のように、ＭＳＥは、マイクロホン数Ｎを１００、スキャン角度を０度に近づけても－１０程度であった。また、マイクロホン数Ｎを１００にした場合であっても、スキャン角度が広がると、ＭＳＥは－７程度であった。 As shown in FIG. 8, in the pattern of the DS beamformer, the gain difference between the surface sound source of the separation target and the interference wave is about 6 dB even if the angle is separated by 30 degrees. As shown in FIG. 9, the MSE was about -10 even when the number of microphones N was 100 and the scan angle was close to 0 degrees. Also, even when the number of microphones N was set to 100, the MSE was about -7 when the scan angle was widened.

＜評価に用いるコスト＞
ここで、マイクロホンの数（Ｎ）は物理的なコストと見なすことができる。この理由は、マイクロホンの数が増えると、それを収容するためにより大きな装置を必要とし、多くのハードウェアが必要になるためである。
式（５）において、サブビームフォーマの数（Ｎ_Ｓ）がスキャン角度に反比例するため、スキャン角度（θ_ｓｃａｎ）は計算コストと見なすことができる。さらにＰ_θｉを計算する必要がある。 <Cost used for evaluation>
Here, the number of microphones (N) can be regarded as a physical cost. The reason for this is that more microphones require larger equipment and more hardware to accommodate them.
In equation (5), the scan angle (θ _scan ) can be viewed as a computational cost because the number of sub-beamformers (N _S ) is inversely proportional to the scan angle. We also need to calculate P _θi .

一般に、コストが増加するとパフォーマンスが向上する。完全に単調ではないが、ＭＳＥは、マイクロホン数Ｎの増加またはスキャン角度θ_ｓｃａｎの減少とともに改善する。一般に、ＭＳＥを最小化すると、非現実的で許容できないコストが発生するため、パフォーマンスとコストの両方を均一に測定するために、次式（１２）のコスト関数Ｊを導入する。 In general, increasing cost improves performance. Although not completely monotonic, the MSE improves with increasing microphone number N or decreasing scan angle θ _scan . In general, minimizing the MSE incurs unrealistic and unacceptable costs, so to uniformly measure both performance and cost, we introduce the cost function J of Equation (12).

式（１２）において、αは所定値であり、λ_ｉ（ｉ＝１，２）は、調整パラメーターである。以下の評価ではλ_１＝０．０１５９、λ_２＝０．０００１５９，α＝０．２，ＭＳＥｍａｘ＝５０とした。また、式（１２）において、予想されるパフォーマンスが含まれるＭＳＥ曲面のスライスと、Ｎ、θ_ｓｃａｎの領域の両方を扱う。ＭＳＥとコストの値は広範囲にわたって変化するため、重みとして機能するλ_ｉ（ｉ＝１，２）が必要である。 (12), α is a predetermined value and λ _i (i=1, 2) is an adjustment parameter. In the following evaluation, λ ₁ =0.0159, λ ₂ =0.000159, α=0.2, and MSEmax=50. Also, in equation (12) we treat both the slice of the MSE surface that contains the expected performance and the region of N, θ _scan . Since the values of MSE and cost vary over a wide range, we need λ _i (i=1,2) to act as weights.

式（１０）の３項目と４項目は、正規化されたコストである。ここで、λ_２は、実用的なマイクの最大数を正規化するために選択する（たとえば、１／１００＝０．０１）。同様に、λ_３は、考慮可能な最小スキャン角度として選択する（たとえば、０．０１°）。
ＭＳＥのスライス間の重み付け（最初の２項目）およびコスト領域（最後の２項目）は、λ_ｉ（ｉ＝１，２，３）の値を変更することで調整できる。
パフォーマンスとコストのトレードオフは最適化の問題は、次式（１３）のように表すことができる。 The third and fourth terms of equation (10) are normalized costs. where λ ₂ is chosen to normalize the maximum number of practical microphones (eg 1/100 = 0.01). Similarly, λ ₃ is chosen as the smallest possible scan angle (eg, 0.01°).
The MSE inter-slice weightings (first two items) and cost regions (last two items) can be adjusted by changing the values of λ _i (i=1,2,3).
The trade-off between performance and cost can be expressed as the following equation (13).

なお、ａｒｇｍｉｎＪ（ｘ）は、Ｊ（ｘ）を最小にするｘの集合である。
本実施形態によれば、コスト関数Ｊの導入によって、面音源を抽出するための最適なマイクロホンの数、所望の領域を分割する間隔を選択することができる。 Note that arg min J(x) is the set of x that minimizes J(x).
According to this embodiment, by introducing the cost function J, it is possible to select the optimum number of microphones for extracting a plane sound source and the interval for dividing a desired region.

＜最適化とパラメーターの調整＞
本実施形態のスキャンアンドサムビームフォーマを実装および最適化する方法に関するガイドラインは以下である。 <Optimization and parameter adjustment>
Guidelines on how to implement and optimize the scan-and-sum beamformer of this embodiment are as follows.

Ｉ．マイクロホンアレイのタイプは、カバーする必要があるスペース範囲に従って決定する。例えば、方位角３６０度をカバーしたい場合は円形マイクロホンアレイを選択する。理想的なパターンは、前提条件のローカリゼーション情報に従って、式（４）によって決定する。 I. The type of microphone array is determined according to the space coverage that needs to be covered. For example, if you want to cover 360 degrees of azimuth, choose a circular microphone array. The ideal pattern is determined by equation (4) according to the prerequisite localization information.

ＩＩ．利用者には、さまざまなタイプのサブビームフォーマを選択する柔軟性がある。ただし、利用者は、アプリケーションシナリオに合わせて慎重に選択する必要がある。例えば、強いノイズ環境では、ＤＳサブビームフォーマが適している場合がある。 II. The user has the flexibility to choose different types of sub-beamformers. However, the user should choose carefully according to the application scenario. For example, in strong noise environments, the DS sub-beamformer may be suitable.

ＩＩＩ．スキャンアンドサムのビームパターンの合成後、ＭＳＥ曲面はグリッドのセット、例えば対数スケールにおいてＮ∈［６，１００］、θ_ｓｃａｎ∈［０．０１，１０］の範囲のグリッドで計算できる。 III. After synthesizing the scan-and-sum beam patterns, the MSE surface can be computed on a set of grids, eg, grids in the range Nε[6,100], θ _scan ε[0.01,10] on a logarithmic scale.

ＩＶ．ＭＳＥ、マイクロホン数Ｎ、スキャン角度θ_ｓｃａｎ空間の最小点をグリッド検索アルゴリズムによって決定できるように、コスト関数Ｊに展開する。 IV. MSE, number of microphones N, scan angle θ Expand into a cost function J so that the minimum point in _{the scan} space can be determined by a grid search algorithm.

Ｖ．コスト関数によって決定された設定のスキャンアンドビームフォーマは、結果を確認するために、例えばウイナーフィルタおよび理想的なパターンと比較し、数値評価も行うことが望ましい。 V. The scan-and-beamformer settings determined by the cost function are preferably compared, eg, with the Wiener filter and the ideal pattern, and also numerically evaluated to confirm the results.

最適化問題の分析的解決は困難であるが、最適な理論的解決策を調べてから、それに近似する実用的な解決策を見つけることが望ましい。最適な適応フィルタとして知られるウイナーフィルタは、混合物の空間相関情報と理想的な参照信号の両方を必要とする。利用者は通常、参照として機能するターゲット信号を所有していないため、ウイナーフィルタは実用的ではない。スキャンアンドサムビームフォーマとウイナーフィルタを比較すると、本実施形態のスキャンアンドサムビームフォーマが最適解の適切な近似値である。 Although solving optimization problems analytically is difficult, it is desirable to find the best theoretical solution and then find a practical solution that approximates it. Wiener filters, known as optimal adaptive filters, require both spatial correlation information of the mixture and an ideal reference signal. Wiener filters are impractical because users typically do not have a target signal to serve as a reference. Comparing the scan-and-sum beamformer and the Wiener filter, the scan-and-sum beamformer of the present embodiment is a good approximation of the optimal solution.

適応設計では、面音源の位置が異なると、適応サブビームフォーマのパターンがわずかに異なる。図１と異なる面音源の設定を図１０に示す。図１０は、分離目標の面音源の設定例を示す図である。図１０に示す例では、分離目標の面音源（ｇ３１）と２つの干渉波（ｇ３２、ｇ３３）が存在する。面音源（ｇ３１）と第１の干渉波（ｇ３２）の間には、方位角で３０度を占めるバッファ空間（ｇ３５）が設けられている。面音源（ｇ３１）と第２の干渉波（ｇ３３）の間には、方位角で３０度を占めるバッファ空間（ｇ３６）が設けられている。 In the adaptive design, the patterns of the adaptive sub-beamformers are slightly different for different surface source locations. FIG. 10 shows a surface sound source setting different from that in FIG. FIG. 10 is a diagram showing a setting example of a plane sound source as a separation target. In the example shown in FIG. 10, there are a separation target surface sound source (g31) and two interference waves (g32, g33). A buffer space (g35) occupying an azimuth angle of 30 degrees is provided between the surface sound source (g31) and the first interference wave (g32). A buffer space (g36) occupying an azimuth angle of 30 degrees is provided between the surface sound source (g31) and the second interference wave (g33).

以下の説明において、すべてのＭＳＥとコスト曲面は、バッファ空間無しで計算した例を説明する。 In the following description, all MSE and cost surfaces are calculated without buffer space.

（第１の例）
第１の例では、ＤＳサブビームフォーマのＰ_θｉを選択した例を説明する。
ＤＳサブビームフォーマは、空間フィルタの中で最も基本的であり、実装の単純さと強いノイズに対する堅牢性に特化しているが、信号機能に適応しないためＤＳ設計は一般に最適ではない。 (first example)
In the first example, an example in which P _θi of the DS sub-beamformer is selected will be described.
DS sub-beamformers are the most basic of spatial filters and specialize in simplicity of implementation and robustness to strong noise, but DS designs are generally suboptimal because they are not adaptive to the signal function.

図１１は、サブビームフォーマがＤＳスキャンアンドサムビームフォーマのＭＳＥ曲面を示す図である。紙面の横方向の軸はマイクロホンの数Ｍ（個）であり、奥行き方向の軸はスキャン角度間隔Δθ（度）であり、縦方向の軸は対数ＭＳＥである。なお、図１１は、Ｎ∈［２，５０］で３個おきに、θ_ｓｃａｎ∈［０．０１，１０］内で、対数スケールで計算した結果である。図１１のＭＳＥ曲面は、マイクロホン数Ｎ、スキャン角度間隔Δθの変化によって、変化が単調であり、最小値が見つけにくい。このため、ＤＳスキャンアンドサムビームフォーマにおいては、マイクロホン数Ｎ、スキャン角度間隔Δθのパラメーターのセットを手動で選択することは困難であるためコスト曲面を導入する。 FIG. 11 is a diagram showing an MSE curved surface of a DS scan-and-sum beamformer as a sub-beamformer. The horizontal axis of the paper is the number of microphones M (pieces), the depth axis is the scan angle interval Δθ (degrees), and the vertical axis is the logarithmic MSE. Note that FIG. 11 is the result of calculation on a logarithmic scale within θ _scan ε[0.01, 10] every three Nε[2, 50]. The MSE curved surface in FIG. 11 changes monotonously with changes in the number of microphones N and the scan angle interval Δθ, and it is difficult to find the minimum value. Therefore, in the DS scan-and-sum beamformer, since it is difficult to manually select the parameter set of the number of microphones N and the scan angle interval Δθ, a cost curved surface is introduced.

図１２は、サブビームフォーマがＤＳスキャンアンドサムビームフォーマのコスト曲面を示す図である。紙面の横方向の軸はマイクロホンの数Ｎ（個）であり、奥行き方向の軸はスキャン角度θ_ｓｃａｎ（度）であり、縦方向の軸はコスト関数Ｊである。図１２のコスト曲面において、マイクロホン数Ｎ、スキャン角度θ_ｓｃａｎの変化によって、グリッド検索処理によって最小のＪ値を持つパラメーターはＮ＝３６、θ_ｓｃａｎ＝０．６４度であった。 FIG. 12 is a diagram showing the cost surface of the DS scan-and-sum beamformer with the sub-beamformer. The horizontal axis of the paper is the number of microphones N (pieces), the depth axis is the scan angle θ _scan (degrees), and the vertical axis is the cost function J. In the cost curved surface of FIG. 12, the parameter with the minimum J value was N=36 and θ _scan =0.64 degrees by the grid search process due to the change in the number of microphones N and the scan angle θ _scan .

図１３は、ウイナービームフォーミングとスキャンアンドサムビームフォーマそれぞれのパターン例を示す図である。横軸は到来方向θ［度］であり、縦軸は利得（２０ｌｏｇ｜Ｐ_θＤＯＡ｜）［ｄＢ］である。符号ｇ４１のパターンは、遅延和（ＤＳ）を用いたスキャンアンドサムビームフォーマのパターンである。符号ｇ４２のパターンは、面音源に対する理想的なパターンである。符号ｇ４３のパターンは、比較用のウイナーフィルタによるビームフォーマである。
この設定でのＭＳＥの値は４０．８であり、図１１より小さな値であることが確認されたが、ビームパターンの高レベルのサイドローブは、分離率が悪い。 FIG. 13 is a diagram showing pattern examples of the winner beamforming and the scan-and-sum beamformer. The horizontal axis is the arrival direction θ [degrees], and the vertical axis is the gain (20log|P _θDOA |) [dB]. The pattern g41 is a scan-and-sum beamformer pattern using delay-and-sum (DS). The pattern g42 is an ideal pattern for a surface sound source. A pattern g43 is a beamformer using a Wiener filter for comparison.
The value of MSE for this setting is 40.8, which was found to be lower than in FIG. 11, but the high level sidelobes of the beam pattern are poorly isolated.

（第２の例）
第２の例では、ＭＶＤＲサブビームフォーマのＰ_θｉを選択した例を説明する。
ＭＶＤＲサブビームフォーマは、振幅と位相の両方で高い指向性と低い歪みを備えた適応フィルタであり、混合物の瞬時値によって推定できる混合信号の定位情報と空間相関を利用する。したがって、ＭＶＤＲ設計は一般的なシナリオで実用的になる。
図１４は、サブビームフォーマがＭＶＤＲスキャンアンドサムビームフォーマのＭＳＥ曲面を示す図である。各軸は、図１１と同じである。図１５は、サブビームフォーマがＭＶＤＲスキャンアンドサムビームフォーマのコスト曲面を示す図である。各軸は、図１２と同じである。 (Second example)
In a second example, an example of selecting P _θi for the MVDR sub-beamformer will be described.
The MVDR sub-beamformer is an adaptive filter with high directivity and low distortion in both amplitude and phase, and exploits localization information and spatial correlation of the mixed signal that can be estimated by the instantaneous value of the mixture. Therefore, the MVDR design becomes practical in common scenarios.
FIG. 14 is a diagram showing the MSE curved surface of the MVDR scan-and-sum beamformer as the sub-beamformer. Each axis is the same as in FIG. FIG. 15 is a diagram showing the cost surface for the sub-beamformer MVDR scan-and-sum beamformer. Each axis is the same as in FIG.

なお、コスト関数Ｃによる推奨パラメーター設定はＮ＝１８、θ_ｓｃａｎ＝１．０°である。図１６は、推奨設定での合成したＭＶＤＲスキャンアンドサムビームフォーマのパターンを示す図である。図１６において、各軸は図１３と同じである。図１６において、符号ｇ５２は、面音源に対する理想的なパターンである。符号ｇ５３は、ＭＶＤＲスキャンアンドサムビームフォーマである。図１３と図１６のように、ＭＶＤＲスキャンアンドサムビームフォーマは、ウイナーフィルタのパターンに近い。このように、本実施形態のＭＶＤＲスキャンアンドサムビームフォーマは、実施が容易であり、かつウイナーフィルタにパターンを実現することができる。 The recommended parameter settings for the cost function C are N=18 and θ _scan =1.0°. FIG. 16 shows the pattern of the synthesized MVDR scan-and-sum beamformer with the recommended settings. In FIG. 16, each axis is the same as in FIG. In FIG. 16, symbol g52 is an ideal pattern for a planar sound source. Reference g53 is an MVDR scan-and-sum beamformer. 13 and 16, the MVDR scan-and-sum beamformer approximates the Wiener filter pattern. Thus, the MVDR scan-and-sum beamformer of the present embodiment is easy to implement and can implement patterns in Wiener filters.

図１１では、ＤＳスキャンアンドサムビームフォーマのＭＳＥ曲面は、マイクロホン数が少ないと単調ではないが、ＭＳＥがマイクロホン数の増加とともに改善する。基本的に、マイクロホン数Ｎが増加すると、サブビームフォーマ（Ｐ_θｉ）の指向性が向上する。各ＤＳサブビームフォーマからの出力の位相の同期が不十分な場合は、スキャン軸に沿ったパフォーマンスが不安定になる可能性がある。
ＤＳスキャンアンドサムビームフォーマの例から、望ましい特性には、少ないマイクロホン数でサブビームフォーマの高い指向性を実現する能力と、位相のシフトを防ぐ能力が含まれていると言える。これらの機能は、ＭＶＤＲフィルタにつながる。 In FIG. 11, the MSE surface of the DS scan-and-sum beamformer is not monotonic with a small number of microphones, but the MSE improves with an increase in the number of microphones. Basically, as the number of microphones N increases, the directivity of the sub-beamformer (P _θi ) improves. Insufficient phase synchronization of the outputs from each DS sub-beamformer can result in erratic performance along the scan axis.
From the DS scan-and-sum beamformer example, desirable properties include the ability to achieve high directivity of the sub-beamformer with a small number of microphones and the ability to avoid phase shifts. These functions lead to the MVDR filter.

ＭＶＤＲフィルタにはいくつかの利点があるが、ＭＶＤＲサブビームフォーマの設計では、空間的に独立した点音源が想定される。ただし、面音源を均一な信号で生成するかどうかについては、基本的に制限がないが、シミュレーションによる評価では、シミュレーションを簡素化するために、混合行列を使用して点音源から面音源を合成する。この単純化されたシミュレーションでは、理論的にはＭＶＤＲフィルタがウイナーフィルタと比較して効果的であることが証明されている。しかし、各ＭＶＤＲサブビームフォーマのパフォーマンスを低下させる点音源を独立に保つことは困難である。 Although the MVDR filter has some advantages, the design of the MVDR sub-beamformer assumes spatially independent point sources. However, there is basically no restriction on whether or not the surface sound source is generated as a uniform signal. do. This simplified simulation proves that the MVDR filter is theoretically more effective than the Wiener filter. However, it is difficult to keep the point sources independent which degrade the performance of each MVDR sub-beamformer.

図１４は、ＭＶＤＲスキャンアンドサムビームフォーマのＭＳＥ曲面にはスキャン間隔方向で見ると極小値となる谷があり、この極小値をとるスキャン間隔はマイクロホン数Ｎが増加すると、より小さな値となる傾向がある。ＭＶＤＲスキャンアンドサムビームフォーマのＭＳＥ曲面のこの極小値は、図１１と比較して特徴的である。この現象は、物理的コストが増加すると計算コストも増加し、そうでなければパフォーマンスが低下することを示している。これは、ＭＶＤＲの高い指向性が原因である。 FIG. 14 shows that the MSE surface of the MVDR scan-and-sum beamformer has a valley with a minimum value when viewed in the scan interval direction, and the scan interval with this minimum value tends to become a smaller value as the number of microphones N increases. There is This local minimum of the MSE surface of the MVDR scan-and-sum beamformer is characteristic compared to FIG. This phenomenon shows that as the physical cost increases, the computational cost also increases, otherwise the performance decreases. This is due to the high directivity of the MVDR.

また、図１２も図１５も図１１と図１４と比較して極小値が検出しやすくなっている。これにより，ＭＳＥ曲面よりもコスト曲面の方が，Ｎやθｓｃａｎの検出に適しているといえる。 12 and 15 are easier to detect the minimum value than in FIGS. 11 and 14. FIG. Therefore, it can be said that the cost curved surface is more suitable for detecting N and θ scan than the MSE curved surface.

図１７は、低密度スキャンを選択した場合のビームパターンの例を示す図である。各軸は、図１３と同様である。図１７において、符号６１はＭＶＤＲパターンを示す。符号ｇ６２はスキャンアンドサムのパターンを示す。符号ｇ６３は理想パターンを示す。
なお、図１７に示す例では、マイクロホン数Ｎを１００、θ_ｓｃａｎの角度を５°で計算している。マイクロホンの数を増やした場合は、マイクロホンの指向性が非常に高くなるため、ビームパターンのメインローブが狭くなるが、サブビームフォーマの接続領域にギャップが現れるほどスキャン角度が不十分になる。この結果、対象領域の情報が失われる面パターンが生じてしまう。
このため、本実施形態では、音源分離部３４（図１）が、コスト関数Ｊを用いて評価を行うことで、適切なマイクロホン数、スキャン角度を選択するようにした。 FIG. 17 is a diagram showing an example of a beam pattern when low-density scanning is selected. Each axis is the same as in FIG. In FIG. 17, reference numeral 61 indicates the MVDR pattern. Reference g62 indicates a scan-and-sum pattern. Reference g63 indicates an ideal pattern.
In the example shown in FIG. 17, the number of microphones N is 100, and the angle of θ _scan is 5°. When the number of microphones is increased, the directivity of the microphones becomes so high that the main lobe of the beam pattern becomes narrower, but the scan angle becomes insufficient so that gaps appear in the connection regions of the sub-beamformers. As a result, a surface pattern is generated in which the information of the target area is lost.
Therefore, in the present embodiment, the sound source separation unit 34 (FIG. 1) performs evaluation using the cost function J to select an appropriate number of microphones and scan angles.

＜評価結果＞
次に、評価結果を説明する。
以下の評価では、ＭＶＤＲサブビームフォーマのＰ_θｉを使用して、本実施形態のスキャンアンドサムビームフォーマの性能を評価するために、数値シミュレーションを行った。数値シミュレーションでは、設計ガイドラインに従って、マイクロホンの間隔ｄを２ｃｍ、マイクロホン数Ｎを２０、θ_ｓｃａｎの角度を０．５°に設定した。この場合、コストは、コスト関数Ｊの推奨パラメーター設定よりわずかに大きくなる。 <Evaluation results>
Next, the evaluation results will be explained.
In the following evaluation, numerical simulations were performed to evaluate the performance of the scan-and-sum beamformer of this embodiment using the P _θi of the MVDR sub-beamformer. In the numerical simulation, the distance d between microphones was set to 2 cm, the number of microphones N was set to 20, and the angle of θ _scan was set to 0.5° according to design guidelines. In this case the cost is slightly higher than the recommended parameter settings for the cost function J.

評価に用いた面音源は、図１０に示すように、小さなバッファ空間で互いに隣接して設定した。分離目標の面音源は、男性の声であり、開始方位角θ_ｂが７５°であり、終了方位角θ_ｅが１０５°であり、第１の干渉波はピアノ音楽であり、第２の干渉波はホワイトノイズであり、それぞれ密度Δθ＝０．１°で３０°の範囲内に分布している。 The planar sound sources used for evaluation were set adjacent to each other in a small buffer space as shown in FIG. The separation target surface sound source is a male voice, the starting azimuth angle θ _b is 75°, the ending azimuth angle θ _e is 105°, the first interference wave is piano music, and the second interference wave is The waves are white noise, each distributed within a range of 30° with a density Δθ=0.1°.

評価では、混合行列Ａを使用して、面音源を式（２）として合成した。空間的に独立したホワイトノイズが混合物に追加されたため、シミュレーションはＳＮＲが２０ｄＢの条件下にある。一般的な広帯域信号をカバーするために、サンプリングレートＦｓは４４：１ｋＨｚに設定した。また、音声ファイルの持続時間は約６秒である。 In the evaluation, the mixing matrix A was used to synthesize the surface sound source as Equation (2). Spatially independent white noise was added to the mixture, so the simulation is under the condition of 20 dB SNR. The sampling rate Fs was set to 44:1 kHz to cover a typical wideband signal. Also, the duration of the audio file is about 6 seconds.

評価では、Ｍａｔｌａｂ（登録商標）ツールボックスＢＳＳＥＶＡＬバージョン２．１を使用して、信号対干渉比（ＳＩＲ；Ｓｉｇｎａｌ－ｔｏ－ＩｎｔｅｒｆｅｒｅｎｃｅＲａｔｉｏ）、信号対アーチファクト比（ＳＡＲ；ＳｏｕｒｃｅｓｔｏＡｒｔｉｆａｃｔｓＲａｔｉｏ）、および信号対歪み比（ＳＤＲ；ＳｉｇｎａｌｔｏＤｉｓｔｏｒｔｉｏｎＲａｔｉｏ）について信号の分離を分析した。 The evaluation used the Matlab® toolbox BSS EVAL version 2.1 to evaluate Signal-to-Interference Ratio (SIR), Sources to Artifacts Ratio (SAR), and Signal separation was analyzed for Signal to Distortion Ratio (SDR).

ＳＩＲは、結果の分離率を評価できる。
ＳＡＲは、アルゴリズムの音質を評価できる。これは、優れたアルゴリズムが乱れ（ａｒｔｉｆａｃｔｓ）をほとんど生成しないためである。
ＳＤＲは、干渉、乱れ、およびノイズに関連する歪みを評価することができる。 SIR can evaluate the separation rate of the results.
SAR can evaluate the sound quality of the algorithm. This is because good algorithms produce few artifacts.
SDR can evaluate distortions related to interference, disturbances, and noise.

図１８は、３つの異なるバッファ空間の評価結果を示す図である。
遷移帯域（通過帯域と減衰帯域の間の帯域）を備えた実用的な空間フィルタの場合、バッファ空間を０°とした分離が最も困難であるが、ＳＩＲは混合状態と比較して約１８ｄＢ（＝４．７＋１３．０）改善された。
図１８のように、バッファ空間の角度を大きくすると、ＳＩＲは更に改善された。ＭＶＤＲサブビームフォーマでは、乱れノイズ（ａｒｔｉｆａｃｔｎｏｉｓｅ）が空間的特徴の不確実な推定によって発生する。この結果、バッファ空間が狭くなる程、ＳＡＲが低下し、ＳＤＲが約９ｄＢ改善された。 FIG. 18 is a diagram showing evaluation results for three different buffer spaces.
For a practical spatial filter with a transition band (the band between the passband and the attenuation band), isolation with 0° buffer space is the most difficult, but the SIR is about 18 dB compared to the mixed state ( = 4.7 + 13.0) improved.
As shown in FIG. 18, increasing the angle of the buffer space further improved the SIR. In MVDR sub-beamformers, artifact noise is caused by uncertain estimation of spatial features. As a result, the narrower the buffer space, the lower the SAR, and the SDR was improved by about 9 dB.

図１９は、本実施形態に係る分離された面音源の例を示す図である。図１９において、紙面の横方向の軸は方位角（度）であり、奥行き方向の軸はフレーム数であり、縦方向の軸はスペクトラム（対数表現）である。評価に用いた音源とマイクロホンの配置は図１０と同様である。図１０のように面音源は、方位角が７５度から１０５度の間に配置されている。
本実施形態のスキャンアンドサムビームフォーマを用いて音源分離を行った結果、図１９に示すように、適切に面音源を分離することができた。 FIG. 19 is a diagram showing an example of separated surface sound sources according to this embodiment. In FIG. 19, the horizontal axis of the paper is the azimuth angle (degrees), the depth axis is the number of frames, and the vertical axis is the spectrum (logarithmic expression). The arrangement of sound sources and microphones used for evaluation is the same as in FIG. As shown in FIG. 10, the plane sound source is arranged with an azimuth angle between 75 degrees and 105 degrees.
As a result of performing sound source separation using the scan-and-sum beamformer of this embodiment, surface sound sources could be appropriately separated as shown in FIG.

本実施形態による面音源分離の利点は、音源とマイクロホンアレイの種類に応じてさまざまな種類のサブビームフォーマを選択できることと、さまざまなサイズと形状の音源に対してさまざまな理想的なパターンを定義できることである。このように、スキャンアンドサムビームフォーマは、さまざまなシナリオに対して非常に柔軟である。
さらに、本実施形態では、選択したビームパターンの品質を評価ために、パフォーマンスとコストのバランスをとるコスト関数Ｊを備えたＭＳＥを用いることで、精度良く評価できる。また、本実施形態によれば、評価したＭＳＥによって、最適なマイクロホン数とスキャン角度をもとめることができる。 The advantage of surface source separation according to this embodiment is that various types of sub-beamformers can be selected according to the types of sound sources and microphone arrays, and that various ideal patterns can be defined for sound sources of various sizes and shapes. is. Thus, the scan-and-sum beamformer is very flexible for different scenarios.
Furthermore, in this embodiment, in order to evaluate the quality of the selected beam pattern, an MSE with a cost function J that balances performance and cost can be used for accurate evaluation. Further, according to the present embodiment, the optimum number of microphones and the optimum scan angle can be obtained from the evaluated MSE.

また、上述した第１と第２の例のように、本実施形態によれば、調整されたパラメーターと最適化されたパターンでスキャンアンドサムビームフォーマを実装する方法に関するガイドラインが効果的である。
また、数値シミュレーションにより、本実施形態によれば、困難な状況で３つの面音源の混合物のＳＩＲが改善された。 Also, as in the first and second examples above, according to this embodiment, guidelines on how to implement a scan-and-sum beamformer with tuned parameters and optimized patterns are useful.
Numerical simulations also show that the present embodiment improves the SIR of a mixture of three surface sources in difficult situations.

このように、本実施形態では、細分化した領域毎にサブビームフォーマを用いて、ビームフォーミング法によって細分化した領域の音響信号を収音された音響信号から抽出するようにした。そして、本実施形態では、細分化された領域毎に抽出した音響信号を加算することで、所望の面音源を分離するようにした。
これにより、本実施形態によれば、抑圧したい領域にある音源を適切に抑圧することができ、面音源を適切に分離できる。 As described above, in this embodiment, the sub-beamformer is used for each subdivided area, and the acoustic signal of the subdivided area is extracted from the collected acoustic signal by the beamforming method. Then, in this embodiment, a desired planar sound source is separated by adding the acoustic signals extracted for each subdivided area.
As a result, according to the present embodiment, it is possible to appropriately suppress the sound source located in the region to be suppressed, and to appropriately separate the surface sound source.

なお、上述した音源分離システム１は、ロボット、受付システム、車両内の音声認識システム、音声認識を用いたスマートスピーカー、音声認識を用いた家電機器などの各種装置に適用することが可能である。
本実施形態によれば、抑圧したい領域にある音源を適切に抑圧することができ、面音源を適切に分離できるので、人間とロボットの相互作用などのアプリケーションで、ロボットのパフォーマンスが向上することができる。 The sound source separation system 1 described above can be applied to various devices such as robots, reception systems, in-vehicle voice recognition systems, smart speakers using voice recognition, and home appliances using voice recognition.
According to this embodiment, it is possible to appropriately suppress a sound source in a region to be suppressed, and to appropriately separate an area sound source. Therefore, it is possible to improve the robot's performance in applications such as human-robot interaction. can.

なお、本発明における音源分離装置３の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより音源分離装置３が行う処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 A program for realizing all or part of the functions of the sound source separation device 3 of the present invention is recorded in a computer-readable recording medium, and the program recorded in this recording medium is read into a computer system and executed. By doing so, all or part of the processing performed by the sound source separation device 3 may be performed. It should be noted that the "computer system" referred to here includes hardware such as an OS and peripheral devices. Also, the "computer system" includes a WWW system provided with a home page providing environment (or display environment). The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. In addition, "computer-readable recording medium" means a volatile memory (RAM) inside a computer system that acts as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. , includes those that hold the program for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the above program may be transmitted from a computer system storing this program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in a transmission medium. Here, the "transmission medium" for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the program may be for realizing part of the functions described above. Further, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 As described above, the mode for carrying out the present invention has been described using the embodiments, but the present invention is not limited to such embodiments at all, and various modifications and replacements can be made without departing from the scope of the present invention. can be added.

１…音源分離システム、２…収音部、３…音源分離装置、２１，２１－１，・・・２１－Ｎ…マイクロホン、３１…取得部、３２…伝達関数記憶部、３３…ビームパターン記憶部、３４…音源分離部、３５…出力部、３４１…分離部、３４２…評価部、３４３…選択部 Reference Signs List 1 sound source separation system 2 sound pickup unit 3 sound source separation device 21, 21-1, 21-N microphone 31 acquisition unit 32 transfer function storage unit 33 beam pattern storage Part 34... Sound source separation part 35... Output part 341... Separation part 342... Evaluation part 343... Selection part

Claims

a microphone array having N (N is an integer equal to or greater than 2) microphones arranged at a first interval for picking up acoustic signals;
A desired area is subdivided at a second interval, and acoustic signals picked up by the microphone array are beamed to each of the subdivided areas using sub-beams corresponding to the azimuth angles θ of the subdivided areas. a sound source separation unit that separates and extracts by a forming method and adds the extracted acoustic signals to separate the acoustic signals of the desired region;
A sound source separation device.

The pattern of the sub-beams is represented by the following equation,

When D(θ) is given by the following formula,

can be defined as θ _i =θ _b +(i−1) θ _scan , where b is the value of i that satisfies θ _i =θ _b , and e is the value of i that satisfies θ _i =θ _e ;
The sound source separation device according to claim 1.

an evaluation unit that calculates the relationship between the cost function J, the number of microphones, and the second spacing;
a selection unit that selects the number of microphones and the second interval that minimizes the cost function in the relationship between the cost function calculated by the evaluation unit, the number of microphones, and the second interval;
The sound source separation device according to claim 1 or 2, further comprising:

The evaluation unit
Calculate the difference between the beam pattern and the ideal pattern using the logarithmic mean squared error MSE of

where α is a predetermined value and λ ₁ and λ ₂ are tuning parameters,
The sound source separation device according to claim 3.

The evaluation unit
By representing the cost function J, the number N of microphones, and the scan angle θ _scan in a three-dimensional graph, and selecting the number N of microphones and the scan angle θ _scan that minimize the cost function J, the optimum the number N of microphones and the scan angle θ _scan ,
The sound source separation unit
update to the optimal number of microphones N selected by the evaluator and the scan angle θ _scan ;
The sound source separation device according to claim 4.

A microphone array having N (N is an integer equal to or greater than 2) microphones arranged at a first interval picks up an acoustic signal,
The sound source separation unit subdivides the desired region at second intervals,
The sound source separation unit separates and extracts an acoustic signal for each of the subdivided regions by a beamforming method using a sub-beam corresponding to the azimuth angle θ of the subdivided region,
The sound source separation unit separates the acoustic signal of the desired region by adding the extracted acoustic signals.
sound source separation method.

to the computer,
Acoustic signals are picked up by a microphone array having N (N is an integer equal to or greater than 2) microphones arranged at a first interval,
subdividing the desired area at a second interval;
For each of the subdivided regions, the acoustic signal is separated and extracted by a beamforming method using a sub-beam corresponding to the azimuth angle θ of the subdivided region;
separating the acoustic signal of the desired region by adding the extracted acoustic signals;
program.