JP7292646B2

JP7292646B2 - Sound source separation device, sound source separation method, and program

Info

Publication number: JP7292646B2
Application number: JP2019223975A
Authority: JP
Inventors: 一博中臺; 泰宏鍵本; 克寿糸山; 健次西田
Original assignee: Honda Motor Co Ltd; Tokyo Institute of Technology NUC
Current assignee: Honda Motor Co Ltd; Tokyo Institute of Technology NUC
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2023-06-19
Anticipated expiration: 2039-12-11
Also published as: JP2021092695A

Description

本発明は、音源分離装置、音源分離方法、およびプログラムに関する。 The present invention relates to a sound source separation device, a sound source separation method, and a program.

複数の音源から特定の音源を抽出する技術が開発されている。例えば、位置情報を用いた音源分離手法としてビームフォーミングを用いる手法がある。ビームフォーミングでは、信号の到達時間差や位相差を用いることにより、方向情報に基づいた音源分離を行うことができる（例えば、特許文献１参照）。 Technologies for extracting a specific sound source from a plurality of sound sources have been developed. For example, there is a method using beamforming as a sound source separation method using position information. In beamforming, sound source separation based on direction information can be performed by using arrival time differences and phase differences of signals (see, for example, Patent Document 1).

特開２０１０－１５２１０７号公報JP 2010-152107 A

しかしながら、従来技術では、同方向に複数の音源が存在する場合に、所望の音源を取り出すことが困難であった。 However, with the conventional technology, it is difficult to extract a desired sound source when multiple sound sources exist in the same direction.

本発明は、上記の問題点に鑑みてなされたものであって、同方向に複数の音源が存在する場合であっても、所望の音源を取り出すことができる音源分離装置、音源分離方法、およびプログラムを提供することを目的としている。 The present invention has been made in view of the above problems, and provides a sound source separation device, a sound source separation method, and a sound source separation method capable of extracting a desired sound source even when a plurality of sound sources exist in the same direction. The purpose is to provide a program.

（１）上記目的を達成するため、本発明の一態様に係る音源分離装置は、音響信号を収音する複数のマイクロホンアレイと、少なくとも２つの前記マイクロホンアレイによって収音されたそれぞれの収音音響信号に、注目音源の第１音響信号と、前記注目音源と同じ方向の他の音源の第２音響信号とが含まれる場合、少なくとも２つの前記マイクロホンアレイによって収音されたそれぞれの前記収音音響信号に含まれている共通成分を抽出して、前記収音音響信号から前記第１音響信号を抽出する抽出部と、を備える。 (1) In order to achieve the above object, a sound source separation device according to an aspect of the present invention includes a plurality of microphone arrays for picking up acoustic signals, and each sound picked up by at least two of the microphone arrays. each of the picked sounds picked up by at least two of the microphone arrays, if the signal includes a first acoustic signal of a sound source of interest and a second acoustic signal of another sound source in the same direction as the sound source of interest; an extraction unit that extracts a common component included in the signals and extracts the first acoustic signal from the picked-up acoustic signal.

（２）また、本発明の一態様に係る音源分離装置において、前記抽出部は、前記共通成分を、潜在的ディリクレ配分法を用いて抽出するようにしてもよい。 (2) In the sound source separation device according to an aspect of the present invention, the extraction unit may extract the common component using a latent Dirichlet allocation method.

（３）また、本発明の一態様に係る音源分離装置において、前記収音音響信号に含まれている音のトピックを分類する分類部、をさらに備え、前記抽出部は、前記分類部によって、前記マイクロホンアレイ毎に分類された前記トピックを比較し、比較した結果、複数の前記マイクロホンアレイそれぞれが収音した前記収音音響信号において同じトピックである場合に、前記同じトピックを前記注目音源であると推定して、前記収音音響信号から、前記同じトピックに対応する音響信号を前記第１音響信号として抽出するようにしてもよい。 (3) The sound source separation device according to an aspect of the present invention further includes a classification unit that classifies topics of sounds included in the collected sound signal, wherein the extraction unit causes the classification unit to: The topics classified for each of the microphone arrays are compared, and if the topic is the same in the collected sound signals picked up by each of the plurality of microphone arrays as a result of the comparison, the same topic is the sound source of interest. , the acoustic signal corresponding to the same topic may be extracted from the picked-up acoustic signal as the first acoustic signal.

（４）また、本発明の一態様に係る音源分離装置において、前記分類部は、前記マイクロホンアレイそれぞれによって収音された前記収音音響信号を周波数スペクトルに変換し、前記マイクロホンアレイ毎の前記周波数スペクトルを、時間フレームにおいてＭ（Ｍは２以上の整数）個の区間に分割してセグメント化し、各セグメントに含まれている前記時間フレーム毎の周波数スペクトルを前記トピック毎に分類するようにしてもよい。 (4) Further, in the sound source separation device according to the aspect of the present invention, the classification unit converts the collected sound signals collected by the respective microphone arrays into frequency spectra, The spectrum may be segmented by dividing it into M (M is an integer equal to or greater than 2) sections in the time frame, and the frequency spectrum for each time frame included in each segment may be classified according to the topic. good.

（５）また、本発明の一態様に係る音源分離装置において、前記抽出部は、時間区間毎の前記トピックの分布と、前記トピック毎の前記周波数スペクトルを量子化した量子化スペクトルの分布を推定し、前記トピックの分布と前記量子化スペクトルの分布の事後確率が、それぞれアクティブ状態を判別するための閾値より大きいものをアクティブ状態であるとし、同時刻の前記セグメントごとの前記トピックの分布を比較し、少なくとも２つの前記マイクロホンアレイにおいてアクティブになっている前記トピックを抽出することで前記共通成分を抽出するようにしてもよい。 (5) In the sound source separation device according to an aspect of the present invention, the extraction unit estimates the distribution of the topic for each time interval and the distribution of the quantized spectrum obtained by quantizing the frequency spectrum for each topic. Then, when the posterior probabilities of the distribution of the topic and the distribution of the quantized spectrum are respectively larger than a threshold value for discriminating the active state, the state is defined as the active state, and the distribution of the topic for each of the segments at the same time is compared. and the common component may be extracted by extracting the topics that are active in at least two of the microphone arrays.

（６）また、本発明の一態様に係る音源分離装置において、前記マイクロホンアレイに対して、前記注目音源の方向にビームを形成するように制御する制御部、をさらに備え、複数の前記マイクロホンアレイは、前記制御部の制御に応じて、前記注目音源の前記第１音響信号を含む前記収音音響信号を収音するようにしてもよい。 (6) The sound source separation device according to an aspect of the present invention further includes a control unit that controls the microphone array to form a beam in the direction of the sound source of interest, wherein the plurality of microphone arrays may pick up the collected sound signal including the first sound signal of the sound source of interest under the control of the control unit.

（７）また、本発明の一態様に係る音源分離装置において、前記マイクロホンアレイそれぞれが収音した前記収音音響信号に対して音源定位を行う音源定位部と、前記マイクロホンアレイそれぞれが収音した前記収音音響信号から、前記音源定位された定位結果に基づいて、前記第１音響信号を含む分離信号を分離する音源分離部と、をさらに備え、前記抽出部は、少なくとも２つの前記マイクロホンアレイそれぞれの前記収音音響信号から分離されたそれぞれの前記分離信号に含まれている共通成分を抽出して、前記収音音響信号から前記第１音響信号を抽出するようにしてもよい。 (7) In the sound source separation device according to an aspect of the present invention, a sound source localization unit that performs sound source localization on the sound signals picked up by each of the microphone arrays, and the sound picked up by each of the microphone arrays. a sound source separation unit that separates a separated signal including the first sound signal from the collected sound signal based on the localization result of the sound source localization, wherein the extraction unit includes at least two of the microphone arrays. A common component included in each of the separated signals separated from each of the collected sound signals may be extracted to extract the first sound signal from the collected sound signals.

（８）上記目的を達成するため、本発明の一態様に係る音源分離方法は、複数のマイクロホンアレイが、音響信号を収音し、抽出部が、少なくとも２つの前記マイクロホンアレイによって収音されたそれぞれの収音音響信号に、注目音源の第１音響信号と、前記注目音源と同じ方向の他の音源の第２音響信号とが含まれる場合、少なくとも２つの前記マイクロホンアレイによって収音されたそれぞれの前記収音音響信号に含まれている共通成分を抽出して、前記収音音響信号から前記第１音響信号を抽出する。 (8) In order to achieve the above object, a sound source separation method according to an aspect of the present invention includes a plurality of microphone arrays picking up acoustic signals, and an extracting unit picking up the sounds by at least two of the microphone arrays. When each sound pickup sound signal includes a first sound signal of a sound source of interest and a second sound signal of another sound source in the same direction as the sound source of interest, each picked up by the at least two microphone arrays extracts a common component included in the collected sound signals, and extracts the first sound signal from the collected sound signals.

（９）上記目的を達成するため、本発明の一態様に係るプログラムは、コンピュータに、複数のマイクロホンアレイによって音響信号を収音させ、少なくとも２つの前記マイクロホンアレイそれぞれによって収音された収音音響信号に、注目音源の第１音響信号と、前記注目音源と同じ方向の他の音源の第２音響信号とが含まれる場合、少なくとも２つの前記マイクロホンアレイによって収音されたそれぞれの前記収音音響信号に含まれている共通成分を抽出させ、前記収音音響信号から前記第１音響信号を抽出させる。 (9) To achieve the above object, a program according to an aspect of the present invention causes a computer to pick up sound signals with a plurality of microphone arrays, and picks up sound picked up by each of at least two of the microphone arrays. each of the picked sounds picked up by at least two of the microphone arrays, if the signal includes a first acoustic signal of a sound source of interest and a second acoustic signal of another sound source in the same direction as the sound source of interest; A common component included in the signals is extracted, and the first acoustic signal is extracted from the picked-up acoustic signal.

上述した（１）～（９）によれば、収音音響信号に含まれている共通成分を抽出するようにしたので、同方向に複数の音源が存在する場合であっても、所望の音源を取り出すことができる。
また、上述した（２）によれば、潜在的ディリクレ配分法によって共通成分を抽出するようにしたので、精度良く所望の音源を取り出すことができる。
また、上述した（３）によれば、収音音響信号を音のトピックに分類して、一致しているトピックを共通成分をとして抽出するようにしたので、精度良く所望の音源を取り出すことができる。
また、上述した（４）によれば、収音音響信号をセグメントに分け、セグメント毎に音のトピックに分類して、一致しているトピックを共通成分をとして抽出するようにしたので、精度良く所望の音源を取り出すことができる。
また、上述した（５）によれば、同時刻のセグメントごとのトピック分布を比較し，少なくとも２つのマイクロホンアレイでアクティブになっているトピックを抽出して共通成分を抽出するようにしたので、精度良く所望の音源を取り出すことができる。
また、上述した（６）によれば、ビームフォーミングによって分離された収音音響信号に含まれている共通成分を抽出するようにしたので、同方向に複数の音源が存在する場合であっても、所望の音源を取り出すことができる。
また、上述した（７）によれば、音源定位処理と音源分離処理によって収音音響信号から分離信号を分離し、分離信号に含まれている共通成分を抽出するようにしたので、同方向に複数の音源が存在する場合であっても、所望の音源を取り出すことができる。 According to the above (1) to (9), since the common component included in the collected sound signal is extracted, even if a plurality of sound sources exist in the same direction, the desired sound source can be taken out.
Further, according to the above-mentioned (2), since the common component is extracted by the latent Dirichlet allocation method, a desired sound source can be extracted with high accuracy.
In addition, according to the above-mentioned (3), since the collected sound signals are classified into sound topics and the matching topics are extracted as common components, the desired sound source can be extracted with high accuracy. can.
Further, according to (4) described above, the collected sound signal is divided into segments, each segment is classified into sound topics, and matching topics are extracted as common components. A desired sound source can be extracted.
In addition, according to (5) above, topic distributions for each segment at the same time are compared, topics that are active in at least two microphone arrays are extracted, and common components are extracted. A desired sound source can be taken out well.
Further, according to (6) above, since the common component included in the collected sound signals separated by beamforming is extracted, even if a plurality of sound sources exist in the same direction, , the desired sound source can be extracted.
In addition, according to the above-described (7), the separated signals are separated from the collected sound signals by the sound source localization processing and the sound source separation processing, and the common components contained in the separated signals are extracted. A desired sound source can be extracted even when a plurality of sound sources exist.

実施形態に係る分離対象の音源の位置例とマイクロホンアレイの配置例を示す図である。FIG. 4 is a diagram showing an example of the position of a separation target sound source and an example of the arrangement of a microphone array according to the embodiment; 第１実施形態に係る音源分離装置の構成例を示すブロック図である。1 is a block diagram showing a configuration example of a sound source separation device according to a first embodiment; FIG. 第１実施形態に係る音源分離装置が行う処理手順を示すフローチャートである。4 is a flow chart showing a processing procedure performed by the sound source separation device according to the first embodiment; 周波数スペクトルの量子化スペクトル化を説明するための図である。FIG. 4 is a diagram for explaining quantization spectrum conversion of a frequency spectrum; FIG. ｋ－ｍｅａｎｓの処理手順例を示すフローチャートである。10 is a flowchart showing an example of a k-means processing procedure; ＬＤＡの量子化スペクトルのまとまりの生成過程例を示すフローチャートである。FIG. 11 is a flow chart showing an example of a process of generating a set of LDA quantized spectra; FIG. ＬＤＡのグラフィカルモデルを表す図である。FIG. 3 is a diagram representing a graphical model of LDA; 実施形態に係るトピックモデルに対する変分ベイズ推定のアルゴリズムの一例である。It is an example of a variational Bayesian inference algorithm for a topic model according to the embodiment. 注目音源のスペクトル推定の例を示す図である。FIG. 4 is a diagram showing an example of spectral estimation of a sound source of interest; クラスタ数Ｋ＝６００、セグメントの時間区間ｄ＝４秒、トピック数Ｌ＝５のときの抽出音の一例を示す図である。FIG. 10 is a diagram showing an example of extracted sounds when the number of clusters K=600, the segment time interval d=4 seconds, and the number of topics L=5. クラスタ数Ｋ＝６００、時間区間ｄ＝４秒の場合のトピック数Ｌに伴う分離性能の変化を示す図である。FIG. 10 is a diagram showing changes in separation performance with the number of topics L when the number of clusters K=600 and the time interval d=4 seconds; クラスタ数Ｋ＝１００、３００、６００と、セグメントの長さの違いによる分離性能の変化を示す図である。FIG. 10 is a diagram showing changes in separation performance due to the number of clusters K=100, 300, 600 and the difference in segment length. クラスタ数Ｋ＝６００、時間区間ｄ＝４秒、トピック数Ｌ＝５の場合、無音成分とユニーク成分の除去を行う場合と行わない場合の分離性能を比較した評価結果を示す図である。FIG. 10 is a diagram showing evaluation results comparing the separation performance when the number of clusters K=600, the time interval d=4 seconds, and the number of topics L=5, with and without removing silent components and unique components. 第２実施形態に係る音源分離装置の構成例を示すブロック図である。FIG. 11 is a block diagram showing a configuration example of a sound source separation device according to a second embodiment; 無音区間と発話区間を説明するための図である。FIG. 4 is a diagram for explaining silent intervals and speech intervals;

以下、本発明の実施の形態について図面を参照しながら説明する。
まず、実施形態の概要を説明する。図１は、実施形態に係る分離対象の音源の位置例とマイクロホンアレイの配置例を示す図である。
図１に示す例では、４人の話者の音源Ｓ_０～Ｓ_３のうち、音源Ｓ_０を注目音源とする。符号ＭＡ_１～ＭＡ_３は、マイクロホンアレイである。マイクロホンアレイＭＡ_１によって収音された音響信号を分離した分離音には音源Ｓ_０とＳ_１が含まれる。マイクロホンアレイＭＡ_２によって収音された音響信号を分離した分離音には音源Ｓ_０とＳ_２が含まれる。マイクロホンアレイＭＡ_３によって収音された音響信号を分離した分離音には音源Ｓ_０とＳ_３が含まれる。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First, an outline of the embodiment will be described. FIG. 1 is a diagram showing a position example of a separation target sound source and an arrangement example of a microphone array according to the embodiment.
In the example shown in FIG. 1, among the sound sources S ₀ to S ₃ of the four speakers, the sound source S ₀ is the target sound source. References MA ₁ to MA ₃ are microphone arrays. The separated sound obtained by separating the acoustic signal picked up by the microphone array _MA1 includes the sound sources _S0 and _S1 . The separated sound obtained by separating the acoustic signal picked up by the microphone array _MA2 includes the sound sources _S0 and _S2 . The separated sound obtained by separating the acoustic signal picked up by the microphone array _MA3 includes the sound sources _S0 and _S3 .

図１のように注目音源Ｓ_０は、複数のマイクロホンアレイで収音して分離した分離音に共通で含まれる場合がある。このため、以下に説明する各実施形態では、複数のマイクロホンアレイで収音して分離した分離音に共通で含まれる共通成分を抽出することで、所望の音源を分離する。 As shown in FIG. 1, the sound source of interest _S0 may be commonly included in separated sounds picked up and separated by a plurality of microphone arrays. Therefore, in each of the embodiments described below, a desired sound source is separated by extracting a common component commonly included in separated sounds collected and separated by a plurality of microphone arrays.

＜第１実施形態＞
第１実施形態では、音源方向が既知であり、ビームフォーミング法によって音源方向の音響信号を収音して分離する例を説明する。 <First Embodiment>
In the first embodiment, an example will be described in which the direction of the sound source is known and the acoustic signal in the direction of the sound source is picked up and separated by the beamforming method.

［音源分離装置の構成例］
まず、本実施形態の音源分離装置１の構成例を説明する。
図２は、本実施形態に係る音源分離装置１の構成例を示すブロック図である。図２に示すように、音源分離装置１は、収音部２、および処理部３を備える。
収音部２は、第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３を備える。なお、図２に示す構成では、収音部２が３つのマイクロホンアレイを備える例を説明するが、マイクロホンアレイの数は２つ以上であればよい。
処理部３は、ビームフォーミング制御部３０、取得部３１、変換部３４、分類部３５、除去部３６、抽出部３７、逆変換部３８、および出力部３９を備える。 [Configuration example of sound source separation device]
First, a configuration example of the sound source separation device 1 of this embodiment will be described.
FIG. 2 is a block diagram showing a configuration example of the sound source separation device 1 according to this embodiment. As shown in FIG. 2 , the sound source separation device 1 includes a sound pickup section 2 and a processing section 3 .
The sound pickup unit 2 includes a first microphone array 2-1, a second microphone array 2-2, and a third microphone array 2-3. In the configuration shown in FIG. 2, an example in which the sound pickup unit 2 includes three microphone arrays will be described, but the number of microphone arrays may be two or more.
The processing unit 3 includes a beamforming control unit 30 , an acquisition unit 31 , a transformation unit 34 , a classification unit 35 , a removal unit 36 , an extraction unit 37 , an inverse transformation unit 38 and an output unit 39 .

［音源分離装置の動作、機能］
次に、音源分離装置１の各部の動作と機能例を説明する。
第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３それぞれは、処理部３のビームフォーミング制御部３０に応じて、既知の音源方向にビームを形成する。第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３それぞれは、Ｐ（Ｐは２以上の整数）個ずつのマイクロホンを備えるマイクロホンアレイである。第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３それぞれは、収音した音響信号を処理部３に出力する。なお、各マイクロホンアレイが出力する音響信号には、マイクロホンアレイを識別するための識別情報が含まれている。なお、各マイクロホンアレイが収音した音響信号は、ビームフォーミング法で既知の音源方向に形成された、例えば１つのビームによる１つの指向性マイクロホンによって収音された音響信号に相当する。なお、マイクロホンアレイそれぞれが集音する収音音響信号は、図１のように、注目音源の第１音響信号と、注目音源と同じ方向の他の音源の第２音響信号とが含まれている場合があるとする。 [Operation and function of the sound source separation device]
Next, an example of the operation and function of each part of the sound source separation device 1 will be described.
Each of the first microphone array 2-1, the second microphone array 2-2, and the third microphone array 2-3 forms a beam in a known sound source direction according to the beamforming control section 30 of the processing section 3. FIG. Each of the first microphone array 2-1, the second microphone array 2-2, and the third microphone array 2-3 is a microphone array having P (P is an integer equal to or greater than 2) microphones. Each of the first microphone array 2-1, the second microphone array 2-2, and the third microphone array 2-3 outputs picked-up acoustic signals to the processing unit 3. FIG. The acoustic signal output by each microphone array contains identification information for identifying the microphone array. An acoustic signal picked up by each microphone array corresponds to an acoustic signal picked up by, for example, one directional microphone with one beam formed in a known sound source direction by the beamforming method. As shown in FIG. 1, the collected sound signals collected by the microphone arrays include the first sound signal of the sound source of interest and the second sound signal of another sound source in the same direction as the sound source of interest. Suppose there is a case.

ビームフォーミング制御部３０は、ビームフォーミング法によって既知の音源方向にビームを形成するように、第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３それぞれを制御する。 A beamforming control unit 30 controls each of the first microphone array 2-1, the second microphone array 2-2, and the third microphone array 2-3 so as to form a beam in a known sound source direction by a beamforming method. do.

取得部３１は、第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３それぞれが出力する音響信号（収音音響信号）を取得する。取得部３１は、取得したマイクロホンアレイ毎の音響信号を変換部３４に出力する。 The acquisition unit 31 acquires acoustic signals (collected acoustic signals) output by the first microphone array 2-1, the second microphone array 2-2, and the third microphone array 2-3. The acquisition unit 31 outputs the acquired acoustic signal for each microphone array to the conversion unit 34 .

変換部３４は、取得部３１が出力するマイクロホンアレイ毎の音響信号を取得する。変換部３４は、マイクロホンアレイ毎の音響信号に対して短時間フーリエ変換（ＳＴＦＴ；ｓｈｏｒｔ－ｔｉｍｅＦｏｕｒｉｅｒｔｒａｎｓｆｏｒｍ）を行って、時間周波数領域の振幅スペクトル（以下、周波数スペクトルともいう）に変換する。変換部３４は、変換したマイクロホンアレイ毎の周波数スペクトルを分類部３５に出力する。 The conversion unit 34 acquires the acoustic signal for each microphone array output by the acquisition unit 31 . The transformation unit 34 performs a short-time Fourier transform (STFT) on the acoustic signal for each microphone array to convert it into an amplitude spectrum (hereinafter also referred to as a frequency spectrum) in the time-frequency domain. The conversion unit 34 outputs the converted frequency spectrum for each microphone array to the classification unit 35 .

分類部３５は、変換部３４が出力するマイクロホンアレイ毎の周波数スペクトルを取得する。分類部３５は、各マイクロホンアレイの周波数スペクトルを、時間フレームにおいてＭ（Ｍは２以上の整数）個の区間に分割してセグメント化する。分類部３５は、時間フレーム毎の振幅スペクトルを１つのベクトルと見なして、各セグメントに含まれている時間フレーム毎の周波数スペクトルを量子化スペクトルとし、量子化スペクトルの数をカウントする。また、分類部３５は、マイクロホンアレイ毎に、各セグメントに含まれている量子化スペクトルを、例えばｋ－ｍｅａｎｓ法のクラスタリング手法で分類する。なお、分類方法については、後述する。分類部３５は、マイクロホンアレイ毎に、カウントした結果を示すカウント情報と、分類結果を示す分類情報を除去部３６に出力する。 The classifying unit 35 acquires the frequency spectrum for each microphone array output by the transforming unit 34 . The classification unit 35 segments the frequency spectrum of each microphone array by dividing it into M (M is an integer equal to or greater than 2) sections in a time frame. The classification unit 35 regards the amplitude spectrum for each time frame as one vector, sets the frequency spectrum for each time frame included in each segment as a quantized spectrum, and counts the number of quantized spectra. Further, the classifying unit 35 classifies the quantized spectrum included in each segment for each microphone array by, for example, the k-means clustering method. Note that the classification method will be described later. The classification unit 35 outputs count information indicating the counting result and classification information indicating the classification result to the removing unit 36 for each microphone array.

除去部３６は、分類部３５が出力するカウント情報と分類情報を取得する。除去部３６は、量子化スペクトルからノイズ成分を除去する。ここで、人の話し声には、無音成分が多く含まれるため、多くの時間区間に含まれる量子化スペクトルが無音である可能性が高い。このため、除去部３６は、例えば全区間の７割以上に出てくる分類単位と、３未満のセグメントにしか出てこない量子化スペクトルを除去する。除去部３６は、ノイズ成分の除去後のカウント情報と分類情報を抽出部３７に出力する。 The removal unit 36 acquires the count information and the classification information output by the classification unit 35 . The removal unit 36 removes noise components from the quantized spectrum. Here, since human speech includes many silent components, there is a high possibility that quantized spectra included in many time intervals are silent. Therefore, the removing unit 36 removes, for example, classification units appearing in 70% or more of all segments and quantized spectra appearing only in less than 3 segments. The removing unit 36 outputs the count information and the classification information after removing the noise component to the extracting unit 37 .

抽出部３７は、除去部３６が出力するノイズ成分の除去後のカウント情報と分類情報を取得する。抽出部３７は、取得したカウント情報と分類情報を用いて、例えば潜在的ディリクレ配分法（ＬＤＡ；ＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ）を用いて、マイクロホンアレイ毎かつセグメント毎に、周波数スペクトルを話者や発話内容に基づくトピックとして、トピック分布を推定する。抽出部３７は、複数のマイクロホンアレイにおいて、トピックの時間同一性によるスペクトル抽出を行うことで、注目音源の推定時間周波数スペクトログラムを抽出する。具体的には、抽出部３７は、時間区間ごとに推定したトピックを選び、推定したトピックが持つトピック分布に存在する周波数スペクトルだけを抽出する。なお、推定方法については後述する。抽出部３７は、抽出したスペクトルを逆変換部３８に出力する。 The extraction unit 37 acquires the count information and the classification information after removing the noise component output by the removal unit 36 . Using the obtained count information and classification information, the extraction unit 37 uses, for example, Latent Dirichlet Allocation (LDA) to extract the frequency spectrum for each microphone array and for each segment according to the speaker and the utterance content. Estimate topic distributions as topics based. The extraction unit 37 extracts an estimated time-frequency spectrogram of the sound source of interest by performing spectrum extraction based on topic temporal identity in a plurality of microphone arrays. Specifically, the extraction unit 37 selects the estimated topic for each time interval and extracts only the frequency spectrum present in the topic distribution of the estimated topic. Note that the estimation method will be described later. The extraction unit 37 outputs the extracted spectrum to the inverse transform unit 38 .

逆変換部３８は、抽出部３７が出力するスペクトルを取得する。逆変換部３８は、取得したすスペクトルに対して、逆短時間フーリエ変換（ＩＳＴＦＴ；Ｉｎｖｅｒｓｅｓｈｏｒｔ-ｔｉｍｅＦｏｕｒｉｅｒｔｒａｎｓｆｏｒｍ）を行うことで注目音源の推定信号を復元する。逆変換部３８は、復元した注目音源の音響信号を出力部３９に出力する。 The inverse transforming unit 38 acquires the spectrum output by the extracting unit 37 . The inverse transform unit 38 restores the estimated signal of the sound source of interest by performing an inverse short-time Fourier transform (ISTFT) on the acquired spectrum. The inverse transform unit 38 outputs the restored acoustic signal of the sound source of interest to the output unit 39 .

出力部３９は、例えばスピーカーである。出力部３９は、逆変換部３８が出力する音響信号を再生する。 The output unit 39 is, for example, a speaker. The output unit 39 reproduces the acoustic signal output by the inverse transform unit 38 .

［音源分離装置１の処理］
次に、音源分離装置１が行う処理手順例を説明する。
図３は、本実施形態に係る音源分離装置１が行う処理手順を示すフローチャートである。 [Processing of the sound source separation device 1]
Next, an example of processing procedures performed by the sound source separation device 1 will be described.
FIG. 3 is a flowchart showing a processing procedure performed by the sound source separation device 1 according to this embodiment.

（ステップＳ１）ビームフォーミング制御部３０は、収音部２の各マイクロホンアレイに対して、既知の音源方向にビームを形成するように制御する。 (Step S1) The beamforming control unit 30 controls each microphone array of the sound pickup unit 2 to form a beam in a known sound source direction.

（ステップＳ２）収音部２は、形成されたビームによって、音響信号を収音する。これにより、収音部２は、音源方向の音源に対応する音響信号を収音する。なお、収音された音響信号は、分離音であり、図１のように、同じ音源方向の複数の音源の音響信号を含む場合がある。 (Step S2) The sound pickup unit 2 picks up an acoustic signal with the formed beam. Thereby, the sound pickup unit 2 picks up an acoustic signal corresponding to the sound source in the direction of the sound source. Note that the collected sound signals are separated sounds, and may include sound signals from a plurality of sound sources in the same sound source direction, as shown in FIG.

（ステップＳ３）変換部３４は、収音されたマイクロホンアレイ毎の音響信号に対して短時間フーリエ変換を行って、周波数スペクトルに変換する。 (Step S3) The transform unit 34 performs short-time Fourier transform on the collected acoustic signal of each microphone array to transform it into a frequency spectrum.

（ステップＳ４）分類部３５は、各マイクロホンアレイの周波数スペクトルを、時間フレームにおいてＭ個の区間に分割してセグメント化する。続けて、分類部３５は、各セグメントに含まれている量子化スペクトルの数をカウントする。続けて、分類部３５は、マイクロホンアレイ毎に、各セグメントに含まれている量子化スペクトルを、例えばｋ－ｍｅａｎｓ法のクラスタリング手法で分類する。 (Step S4) The classification unit 35 segments the frequency spectrum of each microphone array by dividing it into M sections in the time frame. Subsequently, the classifier 35 counts the number of quantized spectra included in each segment. Subsequently, the classifying unit 35 classifies the quantized spectrum included in each segment for each microphone array by a clustering method such as the k-means method, for example.

（ステップＳ５）除去部３６は、量子化スペクトルからノイズ成分を除去する。 (Step S5) The removal unit 36 removes noise components from the quantized spectrum.

（ステップＳ６）抽出部３７は、取得したカウント情報と分類情報を用いて、例えば潜在的ディリクレ配分法を用いて、マイクロホンアレイ毎かつセグメント毎に、周波数スペクトルを話者や発話内容に基づくトピックとして、トピック分布を推定する。 (Step S6) The extraction unit 37 uses the acquired count information and classification information, for example, using the latent Dirichlet allocation method, to extract the frequency spectrum for each microphone array and for each segment as a topic based on the speaker and the utterance content. , to estimate the topic distribution.

（ステップＳ７）抽出部３７は、複数のマイクロホンアレイにおいて、トピックの時間同一性によるスペクトル抽出を行うことで、注目音源の推定時間周波数スペクトログラムを抽出する。 (Step S7) The extraction unit 37 extracts the estimated time-frequency spectrogram of the sound source of interest by performing spectrum extraction based on the temporal identity of the topic in a plurality of microphone arrays.

（ステップＳ８）逆変換部３８は、取得したスペクトルに対して、逆短時間フーリエ変換を行うことで注目音源の推定信号を復元する。続けて、出力部３９は、逆変換部３８が出力する音響信号を再生する。 (Step S8) The inverse transform unit 38 restores the estimated signal of the sound source of interest by performing an inverse short-time Fourier transform on the acquired spectrum. Subsequently, the output unit 39 reproduces the acoustic signal output by the inverse transform unit 38 .

［ＬＤＡを用いた注目音源の抽出方法］
次に、ＬＤＡを用いた注目音源の抽出方法を説明する。
実施形態では、複数のマイクロホンアレイでビームフォーミングして得られた注目音源方向の各分離音に対して、全ての分離音に共通する成分だけを取り出すことで注目音源が抽出する。 [Method of extracting sound source of interest using LDA]
Next, a method of extracting a sound source of interest using LDA will be described.
In the embodiment, the sound source of interest is extracted by extracting only components common to all the separated sounds in the direction of the sound source of interest obtained by beamforming with a plurality of microphone arrays.

実施形態では、時間フレームごとの周波数スペクトルを一つの量子化スペクトルとして扱い、時間区間ごとの周波数スペクトルの集合をセグメントとして扱う。このようにすることで、周波数スペクトルを話者や発話内容に基づくトピックと呼ばれるグループに分類することができる。
別の話者のスペクトルは異なるトピックに割り振られる場合は、ある時間区間で分離音のトピックが異なる場合は注目音源が存在していない。また、すべての分離音に同じトピックが割り当てられる場合は、そのトピックは注目音源である。
実施形態では、このようにしてトピックの時間同一性から注目音源のトピックを推定し，そのトピックがもつ周波数スペクトルだけを抜き出すことで共通成分を抽出する。 In the embodiment, a frequency spectrum for each time frame is treated as one quantized spectrum, and a set of frequency spectra for each time interval is treated as a segment. In this way, the frequency spectrum can be classified into groups called topics based on speakers and speech content.
If different speakers' spectra are assigned to different topics, the sound source of interest does not exist if the topics of separated sounds are different in a certain time interval. Also, if the same topic is assigned to all the separated sounds, that topic is the sound source of interest.
In the embodiment, the topic of the sound source of interest is thus estimated from the temporal identity of the topic, and the common component is extracted by extracting only the frequency spectrum of the topic.

（前処理）
実施形態では、音響信号に対してＬＤＡを適用するために、音を量子化スペクトル化する前処理を行う。
実施形態では、時間フレームごとの振幅スペクトルを一つの量子化スペクトルベクトルとみなし、例えば、ｋ－ｍｅａｎｓ法のクラスタリング手法を用いて、似たような成分を持つ量子化スペクトルベクトルをいくつかのグループに分ける。 (Preprocessing)
In an embodiment, in order to apply LDA to an acoustic signal, preprocessing is performed to quantize and spectralize the sound.
In the embodiment, the amplitude spectrum for each time frame is regarded as one quantized spectral vector. For example, using the k-means clustering method, quantized spectral vectors having similar components are grouped into several groups Divide.

まず、ｋ－ｍｅａｎｓ法による周波数スペクトルの量子化スペクトル化について説明する。
短時間フーリエ変換を音響信号Ｘ_ｉ（ｔ）に適用すると、時間周波数領域の振幅スペクトルＹ_ｉ（ω、ｔ）∈Ｒ^Ｆ×Ｔ（Ｒは、正の実数全体の集合）が得られる。ここで，Ｆは周波数ビン数を表し、Ｔは時間フレーム数を表す。図４のように、時間フレームごとの振幅スペクトルｙ_ｉ（ｔ）を一つのベクトルとみなして量子化スペクトル化を行う。さらに、実施形態では、ｋ－ｍｅａｎｓ法により、ｙ_ｉ（ｔ）をＫ個のクラスｋ∈｛１，…，Ｋ｝に分類する。図４は、周波数スペクトルの量子化スペクトル化を説明するための図である。図４において、横軸は時間フレームであり、縦軸は周波数である。 First, the quantization spectrum conversion of the frequency spectrum by the k-means method will be explained.
Applying the short-time Fourier transform to the acoustic signal X _i (t) yields the amplitude spectrum Y _i (ω,t)εR ^F×T in the time-frequency domain, where R is the set of all positive real numbers. where F represents the number of frequency bins and T represents the number of time frames. As shown in FIG. 4, the amplitude spectrum y _i (t) for each time frame is regarded as one vector, and quantized spectrum conversion is performed. Furthermore, in the embodiment, the k-means method classifies y _i (t) into K classes kε{1, . . . , K}. FIG. 4 is a diagram for explaining quantization spectrum conversion of a frequency spectrum. In FIG. 4, the horizontal axis is the time frame and the vertical axis is the frequency.

ここで、ｋ－ｍｅａｎｓの処理手順例を説明する。
図５は、ｋ－ｍｅａｎｓの処理手順例を示すフローチャートである。なお、ｉはマイクロホンアレイの番号であり、Ｋは量子化スペクトルのクラスタのクラスタ数である。実施形態では、周波数ベクトル成分の類似性に基づき、マイクロホンアレイｉ毎、時間フレームｔ毎にクラスｋが割り当てられる。 Here, an example of the k-means processing procedure will be described.
FIG. 5 is a flowchart showing an example of the k-means processing procedure. Note that i is the number of the microphone array, and K is the number of clusters in the quantized spectrum. In an embodiment, a class k is assigned to each microphone array i and each time frame t based on similarity of frequency vector components.

（ステップＳ１１）分類部３５は、ｙ_ｉ（ｔ）をランダムにクラスタｋに配分する。 (Step S11) The classification unit 35 randomly distributes y _i (t) to the cluster k.

（ステップＳ１２）分類部３５は、各クラスｋに属するｘ_ｉｔのクラスタ中心Ｖ_ｋを計算する。
（ステップＳ１３）分類部３５は、ｙ_ｉｔを最も近いクラスタ中心Ｖ_ｋに再配分する。 (Step S12) The classification unit 35 calculates the cluster center V _k of x _it belonging to each class k.
(Step S13) The classification unit 35 redistributes _yit to the nearest cluster center _Vk .

（ステップＳ１４）分類部３５は、変化が収束したか否か、事前に与えられた回数が終了したか否かを判別する。分類部３５は、変化が収束した、または事前に与えられた回数が終了したと判別した場合（ステップＳ１４；ＹＥＳ）、処理を終了する。分類部３５は、変化が収束していない、かつ事前に与えられた回数が終了していないと判別した場合（ステップＳ１４；ＮＯ）、ステップＳ１２の処理に戻す。 (Step S14) The classification unit 35 determines whether or not the change has converged, and whether or not the number of times given in advance has been completed. When the classification unit 35 determines that the change has converged or the predetermined number of times has been completed (step S14; YES), the process ends. When the classification unit 35 determines that the change has not converged and the predetermined number of times has not been completed (step S14; NO), the process returns to step S12.

次に、トピックモデルについて説明する。各分離音の共通成分を抽出するために、実施形態では、音響信号にトピックモデルを当てはめる。
トピックモデルは、大量の文書データから何らかの意味情報を発見するための道具として考えられたものである。トピックモデルは、文書データの解析手法として考案されたが、その構造モデルの汎用性の高さから、画像処理やソーシャルネットワーク解析、音響信号処理などでも使われている。音響信号処理分野では、例えば信号到来方向（ＤＯＡ；ＤｉｒｅｃｔｉｏｎＯｆＡｒｒｉｖａｌ）情報にトピックモデルを用いた話者推定法が考案されている。 Next, the topic model will be explained. To extract the common component of each isolated sound, embodiments fit a topic model to the acoustic signal.
A topic model is conceived as a tool for discovering some kind of semantic information from a large amount of document data. The topic model was devised as an analysis method for document data, but due to the high versatility of its structure model, it is also used in image processing, social network analysis, and acoustic signal processing. In the field of acoustic signal processing, for example, a speaker estimation method using a topic model for signal direction of arrival (DOA) information has been devised.

トピックモデルでは、セグメントｍ毎にトピック分布ｍ＝（θ_ｍ１，…，θ_ｍＬ）が得られる。ここでθ_ｍｌ＝ｐ（ｌ｜θ_ｍ）は、セグメントｍの量子化スペクトルにトピックｌが割り当てられる確率を表し、θ_ｍｌ≧０、Σ_ｌθ_ｍｌ＝１を満たす。また、トピックｌごとに量子化スペクトル分布φ_ｌ＝（φ_ｌ１，…，φ_ｌＫ）が得られる。φ_ｌｋ＝ｐ（ｋ｜φ_ｌ）は、トピックｌにおける値ｋの現れやすさを表す確率であり、φ_ｌｋ≧０、Σ_ｋφ_ｌｋ＝１を満たす。 In the topic model, topic distribution m=(θ _m1 , . . . , θ _mL ) is obtained for each segment m. where θ _ml =p(l|θ _m ) represents the probability that topic l is assigned to the quantized spectrum of segment m and satisfies θ _ml ≧0 and Σ _l θ _ml =1. Also, a quantized spectral distribution φ _l =(φ _l1 , . . . ,φ _lK ) is obtained for each topic l. φ _lk =p(k|φ ₁ ) is a probability representing the likelihood of appearance of value k in topic l and satisfies φ _lk ≧0 and Σ _k φ _lk =1.

また、トピックモデルでは、量子化スペクトルの順序は考慮せず、どの量子化スペクトルが何回現れるかによってセグメントを表現する。そのため、各マイクロホンアレイｉの分離信号をＭ個の区間に分割しセグメント化する。
分類部３５は、このようにして得たセグメントｍ_ｉに対して、それぞれに含まれる量子化スペクトルｋの数を数える。分類部３５は、この操作によってＬＤＡへの頻度行列Ｗ∈Ｒ^３Ｍ×Ｋ（Ｒは、正の実数全体の集合）を作成する。なお、頻度行列Ｗの行数が３Ｍの理由は、３つのマイクロホンアレイのセグメント全体でＬＤＡを計算するためである。 Also, in the topic model, the order of quantized spectra is not taken into account, and segments are represented by how many times each quantized spectrum appears. Therefore, the separated signal of each microphone array i is divided into M sections for segmentation.
The classification unit 35 counts the number of quantized spectra k included in each of the segments m _i thus obtained. Through this operation, the classification unit 35 creates a frequency matrix WεR ^3M×K (R is a set of all positive real numbers) for LDA. The reason why the number of rows of the frequency matrix W is 3M is that the LDA is calculated for the entire segments of the three microphone arrays.

（ＬＤＡによる注目トピックの推定）
上述したように、前処理後、除去部３６は、ノイズ成分を除去する。
実施形態では、前処理で作成された頻度行列Ｗに対してＬＤＡを適用する。ＬＤＡの推定法については変分ベイズ法を用いる。 (Prediction of hot topics by LDA)
As described above, after preprocessing, the removal unit 36 removes noise components.
In the embodiment, LDA is applied to the frequency matrix W created by preprocessing. A variational Bayesian method is used for estimating LDA.

実施形態では、ＬＤＡを用いて、各セグメントのトピック分布θ_ｉｍ＝｛θ_１，…，θ_Ｌ｝（ただし、ｉ＝１，…，Ｎ、ｍ＝１，…，Ｍ）と、トピック毎の量子化スペクトル分布φ_ｌ＝｛φ_ｌ１，…，φ_ｌＫ｝（ただし、ｌ＝１，…，Ｌ）を推定する。ＬＤＡでは、量子化スペクトル分布とトピック分布に多項分布，その事前分布にディリクレ分布を仮定する。ここで、多項分布は「値がｋになる確率がφ_ｋであるとき，Ｋ種類の離散値から１つの値を取り出す操作をＮ回行ったときの確率」を表す。多項分布は、次式（１）のように表される。なお、ｘは、分離音である。 In an embodiment, LDA is used to determine the topic distribution θ _im ={θ ₁ , . . . , θ _L } (where i=1, . Estimate the quantized spectral distribution φ _l ={φ _l1 , . . . , φ _lK } (where l=1, . LDA assumes a multinomial distribution for the quantized spectrum distribution and the topic distribution, and a Dirichlet distribution for its prior distribution. Here, the multinomial distribution represents "the probability when the operation of extracting one value from K kinds of discrete values is performed N times when the probability that the value becomes k is _φk ". A multinomial distribution is represented by the following formula (1). Note that x is a separated sound.

また、ディリクレ分布は、φ_ｋ≧０、Σ_ｋ＝１ ^Ｋφ_ｋ＝１という制約を満たす多項分布のパラメータφ＝（φ_１，…，φ_ｋ）の確率分布であり、次式（２）のように表される。 Further, the Dirichlet distribution _is _a probability distribution with _parameters φ=( ^φ ₁ , _. is represented as

式（２）において、Γ（・）はガンマ関数を表し、分数部分は正規化項である。また、βはハイパーパラメータを表し、βの値によって多項分布のパラメータがφとなる確率が決まる。 In equation (2), Γ(•) represents the gamma function and the fractional part is the normalization term. Also, β represents a hyperparameter, and the value of β determines the probability that the parameter of the multinomial distribution is φ.

図６は、ＬＤＡの量子化スペクトルのまとまりの生成過程例を示すフローチャートである。
なお、ｌはトピック番号を表し、ｍはセグメント番号を表し、Ｎ_ｍはセグメントｍに含まれる量子化スペクトル数を表している。また、セグメントｍのｎ番目の量子化スペクトルに対して、ｚ_ｍｎはトピック番号を表し、ｗ_ｍｎは単語番号を表す。
この生成過程では、トピック分布と量子化スペクトル分布を多項分布で表し、事前分布としてはディリクレ分布を採用する。また、α、βはそれぞれのハイパーパラメータを表す。 FIG. 6 is a flow chart showing an example of a process of generating a set of LDA quantized spectra.
Note that l represents a topic number, m represents a segment number, and _Nm represents the number of quantized spectra included in segment m. Also, for the nth quantized spectrum of segment m, z _mn represents the topic number and w _mn represents the word number.
In this generation process, the topic distribution and the quantized spectrum distribution are represented by multinomial distributions, and the Dirichlet distribution is used as the prior distribution. Also, α and β represent respective hyperparameters.

（ステップＳ２１～Ｓ２３）抽出部３７は、トピックｌについて、１～Ｌまで分布を生成｛φ_ｌ～Ｄｉｒｉｃｈｌｅｔ（β）｝する処理（ステップＳ２２）を繰り返す。 (Steps S21 to S23) The extraction unit 37 repeats the process of generating distributions {φ _l to Dirichlet (β)} from 1 to L for topic l (step S22).

（ステップＳ２４～Ｓ３０）抽出部３７は、セグメントｍについて、１，…，Ｍまで、トピック分布を生成｛θ_ｍ～Ｄｉｒｉｃｈｌｅｔ（α）｝（ステップＳ２５）と、ステップＳ２６～Ｓ２９の処理を繰り返す。 (Steps S24 to S30) The extraction unit 37 generates topic distributions {θ _m to Dirichlet(α)} (step S25) up to 1, .

（ステップＳ２６～Ｓ２９）抽出部３７は、量子化スペクトルｎについて、１，…，Ｎ_ｍまで、トピックを生成｛ｚ_ｍｎ～Ｍｕｌｔｉｎｏｍｉａｌ（θ_ｍ）｝（ステップＳ２７）と、量子化スペクトルを生成｛ｗ_ｍｎ～Ｍｕｌｔｉｎｏｍｉａｌ（φ_ｚｍｎ）｝（ステップＳ２８）の処理を繰り返す。 (Steps S26 to S29) The extraction unit 37 generates topics {z _mn to Multinomial(θ _m )} ₍ step S27) and generates quantized spectra { w _mn ˜Multinomial(φ _zmn )} (step S28) is repeated.

ここで、ＬＤＡのグラフィカルモデルを説明する。
ＬＤＡのグラフィカルモデルは図７のようになる。図７は、ＬＤＡのグラフィカルモデルを表す図である。図７において、丸で囲われたノード（α、β、θ、φ、ｗ、ｚ）は未知変数を表し、四角で囲われた部分（Ｌ、Ｎ、Ｍ）は繰り返しを表す。グラフィカルモデルでは、各ノードの確率的依存関係が視覚的に表現される。 A graphical model of LDA will now be described.
A graphical model of LDA is shown in FIG. FIG. 7 is a diagram showing a graphical model of LDA. In FIG. 7, circled nodes (α, β, θ, φ, w, z) represent unknown variables, and squared portions (L, N, M) represent iterations. The graphical model visually represents the probabilistic dependencies of each node.

実施形態では、変分ベイズ法を用いてＬＤＡの推定を行う。以下、変分ベイズ法によるトピック推定方法を説明する。
以下の説明において、トピックモデルの未知変数は、トピック集合Ｚ、トピック分布集合Θ、量子化スペクトル分布集合Φである。
まず，トピックモデルの対数周辺尤度ｌｏｇｐ（Ｗ｜α，β）の変分下限Ｆを次式（３）のように求める。 In embodiments, variational Bayesian methods are used to estimate LDA. A topic estimation method based on the variational Bayesian method will be described below.
In the following description, unknown variables of the topic model are topic set Z, topic distribution set Θ, and quantized spectrum distribution set Φ.
First, the lower limit of variation F of the logarithmic marginal likelihood logp(W|α, β) of the topic model is obtained by the following equation (3).

式（３）において、３つ目の変形式の不等式は、イェンゼンの不等式を用いている。また、４つ目の式変形では、計算の簡単化のために変分事後分布をｑ（Ｚ，Θ，Φ）＝ｑ（Ｚ）ｑ（Θ，Φ）と変形できることを仮定している。 In Equation (3), Jensen's inequality is used as the third modified inequality. Also, in the fourth modification of the formula, it is assumed that the variational posterior distribution can be modified as q(Z, Θ, Φ)=q(Z)q(Θ, Φ) for simplification of calculation.

次に変分事後分布ｑ（ｚ）の推定を行う。推定では、ラグランジュの未定乗数法を用いて、確率分布であるための制約条件Σ_ｚｑ（ｚ）＝１のもとで変分下限Ｆの最大化を行う。推定では、Ｆ（ｑ（Ｚ））を次式（４）のように置き、式（４）の極値を求める。 Next, the variational posterior distribution q(z) is estimated. In the estimation, the Lagrangian method of undetermined multipliers is used to maximize the lower bound of variation F under the constraint Σ _z q(z)=1 for probability distribution. In the estimation, F(q(Z)) is set as in the following equation (4), and the extremum of equation (4) is obtained.

なお、式（４）において、λ（・）は、未定乗数である。
∂Ｆ（ｑ（Ｚ））／ｑ（Ｚ）＝０を解くと、Ｆ（（ｑ（Ｚ））を最大化するｑ_ｍｎｌは、次式（５）のようになる。 Note that, in Equation (4), λ(•) is an undetermined multiplier.
Solving ∂F(q(Z))/q(Z)=0, q _mnl that maximizes F((q(Z)) is given by the following equation (5).

ただし、式（５）において、Ψ（・）は、ディガンマ関数である。
同様に、ｑ（Θ、Φ）に対しても、変分下限Ｆの最大化を行う。Ｆ（ｑ（Θ、Φ））を、次式（６）のようにおき、式（６）の極値を求める。 However, in Equation (5), Ψ(•) is a digamma function.
Similarly, the lower limit of variation F is maximized for q(Θ, Φ). F(q(Θ, Φ)) is given by the following equation (6), and the extremum of equation (6) is obtained.

∂Ｆ（ｑ（Θ、Φ））／ｑ（Θ、Φ）＝０を解くと、トピック分布の変分事後分布ｑ（Θ）は、次式（７）のようになる。 By solving ∂F(q(Θ, Φ))/q(Θ, Φ)=0, the variational posterior distribution q(Θ) of the topic distribution is given by the following equation (7).

式（７）において、トピック分布の変分事後分布ｑ（Θ）のパラメータα_ｍｌは、次式（８）のように定義している。 In Equation (7), the parameter α _ml of the variational posterior distribution q(Θ) of the topic distribution is defined as in Equation (8) below.

さらに、量子化スペクトル分布の変分事後分布ｑ（Φ）は、次式（９）のように求めることができる。 Furthermore, the variational posterior distribution q(Φ) of the quantized spectral distribution can be obtained by the following equation (9).

なお、式（９）において、パラメータβ_ｌｋを、次式（１０）と定義している。 Note that in equation (9), the parameter _βlk is defined as the following equation (10).

抽出部３７は、パラメータα_ｍｌ、β_ｌｋを、式（８）と式（１０）によって更新することで、トピック分布と量子化スペクトル分布を推定する。 The extracting unit 37 estimates the topic distribution and the quantized spectrum distribution by updating the parameters α _ml and β _lk with equations (8) and (10).

図８は、本実施形態に係るトピックモデルに対する変分ベイズ推定のアルゴリズムの一例である。 FIG. 8 is an example of a variational Bayesian estimation algorithm for a topic model according to this embodiment.

（ステップＳ１０１）抽出部３７は、変分事後パラメータα_ｍｌ、β_ｌｋを、ランダムな正の値で初期化する。 (Step S101) The extraction unit 37 initializes the variational posterior parameters α _ml and β _lk with random positive values.

（ステップＳ１０２～Ｓ１１４）抽出部３７は、終了条件を満たすまでステップＳ１０２～Ｓ１１４の処理を繰り返す。 (Steps S102 to S114) The extraction unit 37 repeats the processes of steps S102 to S114 until the termination condition is satisfied.

（ステップＳ１０３）抽出部３７は、パラメータα_ｍｎ ^ｎｅｗ＝αに設定し、β_ｌｋ ^ｎｅｗ＝βに設定することで、ステップＳ１０４以降のパラメータを初期化する。 (Step S103) The extraction unit 37 sets the parameters α _mn ^new =α and β _lk ^new =β, thereby initializing the parameters after step S104.

（ステップＳ１０４～Ｓ１１２）抽出部３７は、ステップＳ１０４～Ｓ１１２の処理をＭ回繰り返す。
（ステップＳ１０５～Ｓ１１１）抽出部３７は、ステップＳ１０５～Ｓ１１１の処理をＮ回繰り返す。
（ステップＳ１０６～Ｓ１１０）抽出部３７は、ステップＳ１０６～Ｓ１１０の処理をＬ回繰り返す。 (Steps S104 to S112) The extraction unit 37 repeats the processes of steps S104 to S112 M times.
(Steps S105 to S111) The extraction unit 37 repeats the processes of steps S105 to S111 N times.
(Steps S106 to S110) The extraction unit 37 repeats the processes of steps S106 to S110 L times.

（ステップＳ１０７）抽出部３７は、式（５）の計算を行う。
（ステップＳ１０８）抽出部３７は、パラメータα_ｍｌ ^ｎｅｗ＝α_ｍｌ ^ｎｅｗ＋ｑ_ｍｎｌに設定して、トピック分布の変分事後分布のパラメータを更新する。
（ステップＳ１０９）抽出部３７は、パラメータβ_ｌｗｍｎ ^ｎｅｗ＝β_ｌｗｍｎ ^ｎｅｗ＋ｑ_ｍｎｌに設定して、量子化スペクトル分布の変分事後分布のパラメータを更新する。 (Step S107) The extraction unit 37 performs calculation of Expression (5).
(Step S108) The extraction unit 37 sets the parameter α _ml ^new =α _ml ^new +q _mnl to update the parameter of the variational posterior distribution of the topic distribution.
(Step S109) The extraction unit 37 sets the parameter β _lwmn ^new =β _lwmn ^new +q _mnl to update the parameter of the variational posterior distribution of the quantized spectral distribution.

（ステップＳ１１０）抽出部３７は、ステップＳ１０６～Ｓ１１０の処理をＬ回繰り返した後、ステップＳ１１１の処理に進める。
（ステップＳ１１１）抽出部３７は、ステップＳ１０５～Ｓ１１１の処理をＮ回繰り返した後、ステップＳ１１２の処理に進める。
（ステップＳ１１２）抽出部３７は、ステップＳ１０４～Ｓ１１２の処理をＭ回繰り返した後、ステップＳ１１３の処理に進める。 (Step S110) After repeating the processes of steps S106 to S110 L times, the extraction unit 37 advances to the process of step S111.
(Step S111) After repeating the processes of steps S105 to S111 N times, the extraction unit 37 advances to the process of step S112.
(Step S112) After repeating the process of steps S104 to S112 M times, the extraction unit 37 proceeds to the process of step S113.

（ステップＳ１１３）抽出部３７は、パラメータα_ｍｌ＝α_ｍｌ ^ｎｅｗに設定し、
β_ｌｋ＝β_ｌｋ ^ｎｅｗに設定して更新する。
（ステップＳ１１４）抽出部３７は、終了条件を満たした後、処理を終了する。なお、終了条件は、例えば所定の範囲に収束した場合または所定回数の処理を行った場合である。 (Step S113) The extraction unit 37 sets the parameter α _ml =α _ml ^new ,
Update by setting β _lk =β _lk ^new .
(Step S114) The extraction unit 37 terminates the process after satisfying the termination condition. Note that the end condition is, for example, when the process converges within a predetermined range or when the process is performed a predetermined number of times.

これらの処理によって、ＬＤＡにより時間区間毎のトピック分布Θ∈Ｒ^３Ｍ×Ｌと、トピック毎の量子化スペクトルの分布Φ∈Ｒ^Ｌ×Ｋが推定される。なお、実施形態では、トピック分布と量子化スペクトル分布の事後確率が、それぞれ閾値γ、ηを超えたものをアクティブ状態であると定義する。
具体的には、トピック分布θ_ｉｍにどのトピックが含まれているかを判別するために閾値γと比較して、α_ｉｍｌを次式（１１）のように決定する。 Through these processes, LDA estimates a topic distribution ΘεR ^3M×L for each time interval and a quantized spectrum distribution ΦεR ^L×K for each topic. In the embodiment, the posterior probabilities of the topic distribution and the quantized spectrum distribution exceed the thresholds γ and η, respectively, as defined as the active state.
Specifically, in order to determine which topic is included in the topic distribution θ _im , it is compared with a threshold value γ to determine α _iml as shown in the following equation (11).

また、量子化スペクトル分布においてアクティブパラメータβ_ｌｋを閾値ηと比較して、β_ｌｋを次式（１２）のように決定する。 Also, the active parameter β _lk in the quantized spectrum distribution is compared with the threshold η to determine β _lk as shown in the following equation (12).

α_ｉｍｌ＝１である場合は、トピックｌがそのセグメントに出現する確率が高いことを表している。また、β_ｌｋ＝１の場合は、クラスタｋがトピックｌに含まれている確率が高いことを表す。
別の時間区間ｍ、ｍ’に同じ音が含まれる場合、その音を含むトピックｌは、どちらの時間区間でもアクティブ状態になる可能性が高い。つまり、α_ｉｍｌ＝α_ｉｍ’ｌ＝１となる可能性が高い。このため、同じ時間区間におけるそれぞれの入力音で同じトピックがアクティブ状態であれば、そのトピックは注目音源の可能性が高い。 If α _iml =1, it indicates that topic l has a high probability of appearing in that segment. Also, when β _lk =1, it indicates that there is a high probability that cluster k is included in topic l.
If the same sound is included in different time intervals m, m', the topic l containing that sound is likely to be active in both time intervals. That is, it is highly likely that α _iml =α _im'l =1. Therefore, if the same topic is in the active state for each input sound in the same time interval, that topic is highly likely to be the sound source of interest.

図９は、注目音源のスペクトル推定の例を示す図である。図９において、符号ｇ１１０は、マイクロホンアレイｉが収音した音響信号であり、処理部３への入力信号である。また、符号ｇ１１１は第１マイクロホンアレイ２－１が収音した音響信号を示し、符号ｇ１１２は第２マイクロホンアレイ２－２が収音した音響信号を示し、符号ｇ１１３は第３マイクロホンアレイ２－３が収音した音響信号を示す。また、符号ｇ１２０は推定信号を表す。符号ｇ１１１～ｇ１１３、ｇ１２０において、横軸は時間フレームであり、縦軸は振幅である。 FIG. 9 is a diagram showing an example of spectral estimation of a sound source of interest. In FIG. 9 , symbol g110 is an acoustic signal picked up by the microphone array i, which is an input signal to the processing unit 3 . Further, reference g111 indicates an acoustic signal picked up by the first microphone array 2-1, reference g112 indicates an acoustic signal picked up by the second microphone array 2-2, and reference g113 indicates the third microphone array 2-3. indicates the collected acoustic signal. Moreover, the code|symbol g120 represents an estimated signal. In symbols g111 to g113 and g120, the horizontal axis is the time frame and the vertical axis is the amplitude.

図９では、入力信号を時間区間ごとに最もアクティブなトピックを濃淡でして示している。符号ｇ１３０の四角で囲まれた部分に着目すると、同じ時間区間で全ての入力信号が同じ色に色分けされているため、そのトピックは注目音源を表す。
実施形態では、同じ時間区間ｍにおいて、トピックｌが全ての入力信号でα_ｉｍｌ＝１となっている場合に、このトピックｌを抽出し、抽出したトピックを推定トピックとする。 FIG. 9 shows the input signal with the most active topics shaded for each time interval. Focusing on the portion surrounded by the rectangle g130, all the input signals in the same time interval are colored in the same color, so that topic represents the sound source of interest.
In the embodiment, if the topic l satisfies α _iml =1 for all input signals in the same time interval m, then the topic l is extracted and the extracted topic is taken as the estimated topic.

さらに、抽出部３７は、時間区間毎に推定トピックを選び、選んだ推定トピックが持つ量子化スペクトル分布のアクティブパラメータβ_ｌｋ＝１となっている量子化スペクトルを抽出する。そして、逆変換部３８は、抽出された量子化スペクトルに対して逆短時間フーリエ変換を行うことで注目音源の推定信号ｅ_ｉ（ｔ）を復元する。 Furthermore, the extracting unit 37 selects an estimated topic for each time interval, and extracts a quantized spectrum having an active parameter β _lk =1 of the quantized spectrum distribution of the selected estimated topic. Then, the inverse transform unit 38 restores the estimated signal e _i (t) of the sound source of interest by performing an inverse short-time Fourier transform on the extracted quantized spectrum.

なお、図９に示した例では、３つの第１マイクロホンアレイ２－１～第３マイクロホンアレイ２－３（図２）それぞれに、同じトピックが含まれている場合に、注目音源であると推定して抽出する例を説明したが、これに限らない。
図９の例は、図１に示したように、３つの第１マイクロホンアレイ２－１～第３マイクロホンアレイ２－３それぞれが収音した音響信号に注目音源Ｓ_０が含まれている例であるが、例えば３つの第１マイクロホンアレイ２－１～第３マイクロホンアレイ２－３のうち、２つのマイクロホンアレイで収音した音響信号に注目信号が含まれている場合もあり得る。このような場合は、収音に用いた複数のマイクロホンアレイのうち、２つ以上のマイクロホンアレイに同じトピックが含まれている場合に、その共通のトピックを注目音源であると推定するようにしてもよい。 In the example shown in FIG. 9, when the same topic is included in each of the three first to third microphone arrays 2-1 to 2-3 (FIG. 2), the sound source is estimated to be the sound source of interest. Although an example of extracting by
The example of FIG. 9 is an example in which the sound source of interest _S0 is included in the acoustic signals picked up by the three first to third microphone arrays 2-1 to 2-3 as shown in FIG. However, for example, the signal of interest may be included in acoustic signals picked up by two of the three microphone arrays 2-1 to 2-3. In such a case, when the same topic is included in two or more microphone arrays among the multiple microphone arrays used for picking up sound, the common topic is estimated to be the sound source of interest. good too.

以上のように、本実施形態では、注目音源の内容（トピック）に注目した。そして、本実施形態では、複数のマイクロホンアレイで注目音源の方向の音を分離し、それぞれの音のトピックを、トピックモデルを用いて推定することにより、それぞれのマイクロホンアレイで共通するトピックを持つ部分を注目音源の音であると推定するようにした。
これにより本実施形態によれば、簡易に注目音源を分離することができる。 As described above, in the present embodiment, attention is paid to the content (topic) of the sound source of interest. Then, in this embodiment, sounds in the direction of the sound source of interest are separated by a plurality of microphone arrays, and the topic of each sound is estimated using a topic model. is assumed to be the sound of the sound source of interest.
Thus, according to the present embodiment, it is possible to easily separate the sound source of interest.

＜評価結果＞
次に、本実施形態の音源分離装置１を用いて評価を行った結果を説明する。
評価は、図１のように４人の音源に対して、３つのマイクロホンアレイを用いて収音して音源分離した。なお、音源には、サンプリング周波数１６ｋＨｚ、長さ３０秒の男性による朗読音声を用いた。この４人の音源のうち、２人目の音声データを注目音源Ｓ_０とした。また、１人目の音声を音源Ｓ_３とし、３人目の音声を音源Ｓ_１とし、４人目の音声を音源Ｓ_２とした。注目音源Ｓ_０は、前半３０秒で発話し、それ以外の音源が後半３０秒で発話しているようにした。このように、合計６０秒の分離信号を３つ作成した。なお、評価では、注目音源と他の音源との発話時間が重なっていない状態で行った。また、全ての分離信号で注目音源の振幅や位相が等しいという条件で評価を行った。また、サンプリング周波数を１６０００Ｈｚとし、短時間フーリエ変換の窓幅を５１２とし、短時間フーリエ変換のシフト幅を２５６とし、短時間フーリエ変換の窓関数としてハミング窓を用いた。 <Evaluation results>
Next, the results of evaluation using the sound source separation device 1 of this embodiment will be described.
For the evaluation, as shown in FIG. 1, the sound sources of four persons were picked up using three microphone arrays and the sound sources were separated. As the sound source, a reading voice by a man with a sampling frequency of 16 kHz and a length of 30 seconds was used. Of these four sound sources, the speech data of the second person was set as the target sound source _S0 . Also, the first person's voice is the sound source _S3 , the third person's voice is the sound source _S1 , and the fourth person's voice is the sound source _S2 . The sound source of interest _S0 spoke in the first half 30 seconds, and the other sound sources spoke in the second half 30 seconds. Thus, three separate signals of 60 seconds total were created. Note that the evaluation was performed in a state in which the utterance time of the target sound source and other sound sources did not overlap. In addition, the evaluation was performed under the condition that the amplitude and phase of the sound source of interest are the same for all the separated signals. The sampling frequency is 16000 Hz, the window width of the short-time Fourier transform is 512, the shift width of the short-time Fourier transform is 256, and the Hamming window is used as the window function of the short-time Fourier transform.

評価では、作成した分離信号Ｘ_ｉ（ｔ）に対して短時間フーリエ変換を行い、変換して得られた振幅スペクトルＹ_ｉ（ω、ｔ）をｋ－ｍｅａｎｓ法で量子化スペクトル化した。ｋ－ｍｅａｎｓ法のクラスタ数は、Ｋ＝１００、３００、６００とした。セグメント化では、マイクロホンアレイ毎に、Ｍ＝１０、１５、２０、２５個のセグメントに分割した、この時、各セグメントｍの時間間隔は、それぞれｄ＝６、４、３、２．４秒である。分離信号は、３０秒の部分で注目音源からその他の音源で切り替わるため、ｄ＝３秒と６秒の場合は、セグメントの教会と音源の教会が一致しているため、全てのセグメントに１つの音源しか含まれていない。また、ｄ＝２．４秒と４秒の場合は、３０秒をまたぐセグメントに注目音源とその他の音源が同時に含まれる。セグメント化の後、全体の７割以上に出現する量子化スペクトルと、３未満のセグメントのみに出現する量子化スペクトルを除去し、セグメント毎の頻度行列Ｗを作成した。 In the evaluation, a short-time Fourier transform was performed on the separated signal X _i (t) created, and the amplitude spectrum Y _i (ω, t) obtained by the transform was quantized and spectralized by the k-means method. The number of clusters for the k-means method was K=100, 300, and 600. In the segmentation, each microphone array was divided into M = 10, 15, 20 and 25 segments, and the time intervals of each segment m were d = 6, 4, 3 and 2.4 seconds, respectively. be. Since the separation signal switches from the target sound source to other sound sources in the 30-second portion, when d = 3 seconds and 6 seconds, the churches of the segments and the churches of the sound sources match. Contains only sound sources. Also, when d=2.4 seconds and 4 seconds, the target sound source and other sound sources are simultaneously included in the segment extending over 30 seconds. After the segmentation, quantized spectra appearing in 70% or more of the whole and quantized spectra appearing only in less than 3 segments were removed to create a frequency matrix W for each segment.

ＬＤＡの推定法として、上述した変分ベイズ法を用いた。またトピック分布と量子化スペクトル分布の事前分布には、どちらにもディリクレ分布を用いて、それぞれのハイパーパラメータの初期値を１／Ｌ、１／Ｋとした。さらに、アクティブ判定の閾値をγ＝１／Ｌとし、η＝１／Ｋとした。 The variational Bayesian method described above was used as the LDA estimation method. Dirichlet distribution is used for both the topic distribution and the prior distribution of the quantized spectrum distribution, and the initial values of the respective hyperparameters are set to 1/L and 1/K. Further, the thresholds for active determination are set to γ=1/L and η=1/K.

また、音源分離性能の評価指標には，ＢｓｓＥｖａｌのＳｏｕｒｃｅｔｏＤｉｓｔｏｒｔｉｏｎＲａｔｉｏ（ＳＤＲ）を使用した。ＳＤＲは、推定された音源信号と全てのノイズのエネルギー比を表す。計算には、ＢｓｓＥｖａｌｔｏｏｌｂｏｘを使いた。評価では、分離しない状態からどれだけＳＤＲ値が改善するかを評価した。また、評価では、推定信号はマイクロホンアレイ毎に得られるため、マイクロホンアレイ毎にＳＤＲを計算し平均化したものを指標とした。 BssEval's Source to Distortion Ratio (SDR) was used as an evaluation index for sound source separation performance. SDR represents the energy ratio between the estimated source signal and all noise. The Bss Eval toolbox was used for the calculations. In the evaluation, how much the SDR value improved from the non-separated state was evaluated. In the evaluation, since the estimated signal is obtained for each microphone array, the SDR calculated and averaged for each microphone array was used as an index.

ここで、分離を行う混合音に含まれる真の目的音源信号ｓ_ｉ（ｔ）に対し、推定信号ｓ＾_ｉ（ｔ）は、次式（１３）のように分解できる。 Here, for the true target sound source signal s _i (t) contained in the mixed sound to be separated, the estimated signal s^ _i (t) can be decomposed as shown in the following equation (13).

式（１３）において、ｓ_{ｔａｒｇｅｔ}（ｔ）は目的音源信号項、ｅ_{ｉｎｔｅｒｆ}（ｔ）は混合音に含まれる他の音源に由来するノイズ項、ｅ_{ｎｏｉｓｅ}（ｔ）は他の音源によらない外部からのノイズ項、ｅ_{ａｒｔｉｆ}（ｔ）は分離アルゴリズム由来のノイズ項を表す。
また、ＳＤＲの計算式は、次式（１４）で表される。 In equation (13), s _target (t) is the target sound source signal term, e _interf (t) is the noise term derived from other sound sources contained in the mixed sound, and e _noise (t) is the external The noise term from , e _artif (t) represents the noise term from the separation algorithm.
Moreover, the calculation formula of SDR is represented by following Formula (14).

図１０は、クラスタ数Ｋ＝６００、セグメントの時間区間ｄ＝４秒、トピック数Ｌ＝５のときの抽出音の一例を示す図である。図１０において、符号ｇ２０１は、第１マイクロホンアレイ２－１で得られた注目音源方向の分離音の波形である。符号ｇ２０２は、正解音源の信号波形である。符号ｇ２０３は、第１マイクロホンアレイ２－１が収音した音響信号から抽出した推定信号の波形である。符号ｇ２０１～ｇ２０３において、横軸は時刻（秒）であり、縦軸は振幅である。 FIG. 10 is a diagram showing an example of extracted sounds when the number of clusters K=600, the segment time interval d=4 seconds, and the number of topics L=5. In FIG. 10, symbol g201 is the waveform of the separated sound in the target sound source direction obtained by the first microphone array 2-1. Symbol g202 is the signal waveform of the correct sound source. Symbol g203 is the waveform of the estimated signal extracted from the acoustic signal picked up by the first microphone array 2-1. In symbols g201 to g203, the horizontal axis is time (seconds) and the vertical axis is amplitude.

図１０の評価結果では、正解音源の波形と推定信号の波形を比較すると、推定信号が殆どの時間期間で正解音源部分を抽出できている。このように、本実施形態によれば、正解信号と同じ時間区間の音だけを精度良く取り出すことができる。 In the evaluation results of FIG. 10, when the waveform of the correct sound source and the waveform of the estimated signal are compared, the correct sound source portion of the estimated signal can be extracted for most of the time period. Thus, according to this embodiment, it is possible to accurately extract only the sound in the same time interval as the correct signal.

次に、各パラメータの値を変えたときに分離精度がどのように変化するか評価した結果を説明する。
図１１は、クラスタ数Ｋ＝６００、時間区間ｄ＝４秒の場合のトピック数Ｌに伴う分離性能の変化を示す図である。横軸はトピック数であり、縦軸はＳＤＲ改善率［ｄＢ］である。
図１１に示す評価結果において、ＳＤＲ改善率は本実施形態の手法を適用した場合と適用しない場合のＳＤＲの差分を表し、この値が高い値であるほど分離が高性能であることを意味する。この評価結果では、トピック数Ｌ＝２の時、ほとんどＳＤＲ値が改善していないのに対し、トピック数Ｌが大きいほど分離性能が上がる傾向であった。このため、トピック数Ｌは、適用する音響信号に応じて変更するようにしてもよい。また、トピック数は、例えば機械学習によって設定や変更するようにしてもよい。 Next, the results of evaluating how the separation accuracy changes when the value of each parameter is changed will be described.
FIG. 11 is a diagram showing changes in separation performance with the number of topics L when the number of clusters K=600 and the time interval d=4 seconds. The horizontal axis is the number of topics, and the vertical axis is the SDR improvement rate [dB].
In the evaluation results shown in FIG. 11, the SDR improvement rate represents the difference in SDR between when the method of this embodiment is applied and when it is not applied, and the higher the value, the higher the separation performance. . In this evaluation result, when the number of topics L=2, the SDR value hardly improved, whereas the separation performance tended to increase as the number of topics L increased. Therefore, the number of topics L may be changed according to the sound signal to be applied. Also, the number of topics may be set or changed by machine learning, for example.

図１２は、クラスタ数Ｋ＝１００、３００、６００と、セグメントの長さの違いによる分離性能の変化を示す図である。符号ｇ３１０は分離性能の変化を示すグラフであり、符号ｇ３２０は、符号ｇ３１０のグラフの各値を示す表である。符号ｇ３１０において、横軸はクラスタ数であり、縦軸はＳＤＲ改善率［ｄＢ］である。また、符号ｇ３１１は時間間隔が２．４秒であり、符号ｇ３１２は時間間隔が３秒であり、符号ｇ３１３は時間間隔が４秒であり、符号ｇ３１４は時間間隔が６秒である。 FIG. 12 is a diagram showing changes in separation performance due to the number of clusters K=100, 300, 600 and the difference in segment length. Symbol g310 is a graph showing changes in separation performance, and symbol g320 is a table showing each value of the graph of symbol g310. In symbol g310, the horizontal axis is the number of clusters, and the vertical axis is the SDR improvement rate [dB]. Reference g311 indicates a time interval of 2.4 seconds, reference g312 indicates a time interval of 3 seconds, reference g313 indicates a time interval of 4 seconds, and reference g314 indicates a time interval of 6 seconds.

この評価結果では、ｋ－ｍｅａｎｓのクラスタ数Ｋについてみると、Ｋが小さい時は分離精度が低い。この理由は、クラスタ数Ｋが少ないとき、異なる音も同じクラスに割り当てられてしまうため分離性能が低下するためである。また、Ｋが大きすぎると各周波数スペクトルに対して量子化スペクトル番号が一対一で割り当てられてしまう。
これらのことから、Ｋは小さすぎず大きすぎない適切な値を設定した方が、より精度を向上することができる。このため、クラスタ数Ｋは、適用する環境等に応じて設定するようにしてもよく、例えば機械学習によって設定や変更するようにしてもよい。 In this evaluation result, looking at the number of k-means clusters K, when K is small, the separation accuracy is low. The reason for this is that when the number of clusters K is small, different sounds are also assigned to the same class, resulting in degraded separation performance. Also, if K is too large, a quantized spectrum number will be assigned to each frequency spectrum on a one-to-one basis.
For these reasons, the accuracy can be further improved by setting K to an appropriate value that is neither too small nor too large. For this reason, the number of clusters K may be set according to the application environment or the like, and may be set or changed by machine learning, for example.

また、図１２において、セグメントの違いについて比較すると、ｄ＝２．４秒、４秒の場合は、３０秒付近で同じセグメントに注目音源とその他の音源が含まれる。この理由は、トピック分布が共起性に基づいて単語を分類するためである。このため、この評価では、ｄ＝２．４秒、４秒の場合に推定トピックに別の音源の持つ単語が含まれる可能性が高くなる。クラスタ数Ｋ＝６００では、ｄ＝３秒の場合にＳＤＲ値が高いのに対し、ｄ＝４秒の場合にＳＤＲ値が低くなっている。 Further, in FIG. 12, when comparing the difference between the segments, when d=2.4 seconds and 4 seconds, the sound source of interest and other sound sources are included in the same segment around 30 seconds. The reason for this is that the topic distribution classifies words based on co-occurrence. Therefore, in this evaluation, when d=2.4 seconds and 4 seconds, there is a high possibility that the estimated topic includes a word of another sound source. When the number of clusters is K=600, the SDR value is high when d=3 seconds, while the SDR value is low when d=4 seconds.

図１３は、クラスタ数Ｋ＝６００、時間区間ｄ＝４秒、トピック数Ｌ＝５の場合、無音成分とユニーク成分の除去を行う場合と行わない場合の分離性能を比較した評価結果を示す図である。符号ｇ４１０は評価結果をグラフで表したものであり、符号ｇ４２０は符号ｇ４１０のグラフの値を表で表したものである。符号ｇ４１０において、横軸は時間区間ｄであり、縦軸はＳＤＲ改善率［ｄＢ］である。また、符号ｇ４１１は無音除去ありの場合であり、符号ｇ４１２は無音除去無しの場合である。 FIG. 13 is a diagram showing the evaluation result comparing the separation performance between the case where the number of clusters K=600, the time interval d=4 seconds, and the number of topics L=5, with and without removing silent components and unique components. is. The reference g410 represents the evaluation results in a graph, and the reference g420 represents the values of the graph of the reference g410 in a table. In symbol g410, the horizontal axis is the time interval d, and the vertical axis is the SDR improvement rate [dB]. Reference g411 is the case with silence removal, and reference g412 is the case without silence removal.

図１３のように、比較例の無音除去しない場合はＳＤＲ値が劣化するが、本実施形態のように無音除去した場合はＳＤＲ値が大きく向上する。この理由は、評価に用いた人の朗読音は無音成分を多く持つため、複数の時間区間で無音成分を持つトピックがアクティブ状態と判別されたためである。 As shown in FIG. 13, the SDR value deteriorates when silence is not removed in the comparative example, but the SDR value is greatly improved when silence is removed as in the present embodiment. The reason for this is that since the reading voice of the person used for the evaluation has many silent components, a topic having silent components in a plurality of time intervals was determined to be in the active state.

＜第２実施形態＞
第２実施形態では、音源それぞれの方向を音源定位処理と音源分離処理によって検出する例を説明する。 <Second embodiment>
In the second embodiment, an example in which the direction of each sound source is detected by sound source localization processing and sound source separation processing will be described.

［音源分離装置１Ａの構成例］
まず、本実施形態の音源分離装置１Ａの構成例を説明する。
図１４は、本実施形態に係る音源分離装置１Ａの構成例を示すブロック図である。図１４に示すように、音源分離装置１Ａは、収音部２Ａ、および処理部３Ａを備える。なお、第１実施形態の音源分離装置１と同様の機能を有する機能部については、同じ符号を用いて説明を省略する。
収音部２Ａは、第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３を備える。
処理部３Ａは、取得部３１Ａ、音源定位部３２、音源分離部３３、変換部３４Ａ、分類部３５、除去部３６、抽出部３７、逆変換部３８、および出力部３９を備える。 [Configuration example of sound source separation device 1A]
First, a configuration example of the sound source separation device 1A of this embodiment will be described.
FIG. 14 is a block diagram showing a configuration example of the sound source separation device 1A according to this embodiment. As shown in FIG. 14, the sound source separation device 1A includes a sound pickup section 2A and a processing section 3A. Note that functional units having functions similar to those of the sound source separation device 1 of the first embodiment are denoted by the same reference numerals, and descriptions thereof are omitted.
The sound pickup section 2A includes a first microphone array 2-1, a second microphone array 2-2, and a third microphone array 2-3.
The processing unit 3A includes an acquisition unit 31A, a sound source localization unit 32, a sound source separation unit 33, a conversion unit 34A, a classification unit 35, a removal unit 36, an extraction unit 37, an inverse conversion unit 38, and an output unit 39.

［音源分離装置１Ａの動作、機能］
次に、音源分離装置１Ａの各部の動作と機能例を説明する。
第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３それぞれは、収音したＰチャネルの音響信号を処理部３Ａに出力する。なお、各マイクロホンアレイが出力するＰチャネルの音響信号には、マイクロホンアレイを識別するための識別情報が含まれている。 [Operations and functions of the sound source separation device 1A]
Next, an example of the operation and function of each part of the sound source separation device 1A will be described.
Each of the first microphone array 2-1, the second microphone array 2-2, and the third microphone array 2-3 outputs a picked-up P-channel acoustic signal to the processing unit 3A. The P-channel acoustic signal output from each microphone array includes identification information for identifying the microphone array.

取得部３１Ａは、第１マイクロホンアレイ２－１、第２マイクロホンアレイ２－２、および第３マイクロホンアレイ２－３それぞれが出力するＰチャネルの音響信号を取得する。取得部３１Ａは、取得したマイクロホンアレイ毎のＰチャネルの音響信号を音源定位部３２と音源分離部３３に出力する。 The acquisition unit 31A acquires P-channel acoustic signals output from the first microphone array 2-1, the second microphone array 2-2, and the third microphone array 2-3. The acquisition unit 31</b>A outputs the acquired P-channel acoustic signal for each microphone array to the sound source localization unit 32 and the sound source separation unit 33 .

音源定位部３２は、取得部３１Ａが出力するマイクロホンアレイ毎のＰチャネルの音響信号を取得する。音源定位部３２は、取得したマイクロホンアレイ毎のＰチャネルの音響信号に対して、例えばビームフォーミング法またはＭＵＳＩＣ法による音源定位処理を行って、音響信号に含まれる音源方向を推定する。音源定位部３２は、マイクロホンアレイ毎に、推定した音源定位情報を音源分離部３３に出力する。 The sound source localization unit 32 acquires the P-channel acoustic signal for each microphone array output by the acquisition unit 31A. The sound source localization unit 32 performs sound source localization processing by, for example, the beamforming method or the MUSIC method on the acquired P-channel acoustic signal for each microphone array, and estimates the sound source direction included in the acoustic signal. The sound source localization unit 32 outputs the estimated sound source localization information to the sound source separation unit 33 for each microphone array.

音源分離部３３は、音源定位部３２が出力する音源定位情報と、取得部３１Ａが出力するマイクロホンアレイ毎のＭチャネルの音響信号を取得する。音源分離部３３は、マイクロホンアレイ毎に、Ｍチャネルの音響信号から音源定位された方向の音響信号を抽出する。音源分離部３３は、例えばＧＨＤＳＳ（ＧｅｏｍｅｔｒｉｃＨｉｇｈ-ｏｒｄｅｒＤｉｃｏｒｒｅｌａｔｉｏｎ-ｂａｓｅｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎ）法によって、音源分離処理を行う。例えば、図１において、マイクロホンアレイＭＡ_１が第１マイクロホンアレイ２－１の場合は、音源Ｓ_０とＳ_１が１チャネルの音響信号として抽出される。同様に、音源分離部３３は、第２マイクロホンアレイ２－２によって収音されたＰチャネルの音響信号に対して、音源に対応する音響信号を抽出する。音源分離部３３は、第３マイクロホンアレイ２－３によって収音されたＰチャネルの音響信号に対して、音源に対応する音響信号を抽出する。音源分離部３３は、抽出したマイクロホンアレイ毎の音響信号を変換部３４Ａに出力する。 The sound source separation unit 33 acquires the sound source localization information output by the sound source localization unit 32 and the M-channel acoustic signals for each microphone array output by the acquisition unit 31A. The sound source separation unit 33 extracts an acoustic signal in the direction in which the sound source is localized from the M-channel acoustic signals for each microphone array. The sound source separation unit 33 performs sound source separation processing by, for example, the GHDSS (Geometric High-order Dicorrelation-based Source Separation) method. For example, in FIG. 1, when the microphone array MA ₁ is the first microphone array 2-1, the sound sources S ₀ and S ₁ are extracted as one-channel acoustic signals. Similarly, the sound source separation unit 33 extracts an acoustic signal corresponding to the sound source from the P-channel acoustic signal picked up by the second microphone array 2-2. The sound source separation unit 33 extracts an acoustic signal corresponding to the sound source from the P-channel acoustic signal picked up by the third microphone array 2-3. The sound source separation unit 33 outputs the extracted acoustic signal for each microphone array to the conversion unit 34A.

なお、本実施形態において、複数のマイクロホンアレイの基準方向は、例えば図１の複数のマイクロホンアレイＭＡ_１～ＭＡ_３の重心（注目音源Ｓ_０位置）方向等に設定するようにしてもよい。 In this embodiment, the reference direction of the multiple microphone arrays may be set, for example, in the direction of the center of gravity (position of the sound source of interest _S0 ) of the multiple microphone arrays MA ₁ to MA ₃ in FIG.

第１実施形態では、ビームフォーミング法で形成されたビームによって注目音源を含む音響信号を収音することで、注目音源の音響信号を分離したが、本実施形態では、音源定位処理と音源分離処理によって、注目音源の音響信号を分離する。その後、処理部３Aは、第１実施形態と同様に、トピックの抽出、分類、共通トピックの抽出による推定トピックの推定等を行う。 In the first embodiment, the acoustic signal of the sound source of interest is separated by picking up the acoustic signal including the sound source of interest using beams formed by the beamforming method. separates the acoustic signal of the sound source of interest. After that, the processing unit 3A extracts topics, classifies them, and estimates an estimated topic by extracting common topics, as in the first embodiment.

本実施形態の音源分離装置１Ａの構成によっても、第１実施形態と同様の効果を得ることができる。 The configuration of the sound source separation device 1A of this embodiment can also provide the same effects as those of the first embodiment.

＜変形例＞
上述した第１実施形態と第２実施形態では、ｋ－ｍｅａｎｓ法によってクラスタリングを行う例を説明したが、これに限らない。クラスタリングは、他の周知の手法（例えば重み付き平均法等）を用いてもよい。 <Modification>
In the first and second embodiments described above, an example of performing clustering by the k-means method has been described, but the present invention is not limited to this. Clustering may use other well-known techniques (for example, weighted average method, etc.).

また、上述した第１実施形態と第２実施形態では、先にクラスタリングを行い、クラスタリング後に除去部３６がノイズ成分を除去し、ノイズ成分が除去された後に注目音源を抽出する例を説明したが、これに限らない。 In addition, in the above-described first and second embodiments, an example has been described in which clustering is performed first, noise components are removed by the removal unit 36 after clustering, and the sound source of interest is extracted after the noise components are removed. , but not limited to this.

図１５は、無音区間と発話区間を説明するための図である。
図１５に示すように、音響信号には、一般的に無音区間ｇ５０１が含まれている。このような無音区間を除去、または発話区間ｇ５０２を抽出し、発話区間に対して所定の区間毎のからトピックを抽出するようにしてもよい。無音区間または発話区間の検出は、例えば音響信号の振幅に対する発話区間検出のための閾値と音響信号を比較して検出するようにしてもよい。 FIG. 15 is a diagram for explaining silent intervals and speech intervals.
As shown in FIG. 15, the acoustic signal generally includes a silent section g501. It is also possible to remove such a silent section or extract the speech section g502 and extract the topic from each predetermined section with respect to the speech section. Silent intervals or speech intervals may be detected, for example, by comparing a threshold value for detecting speech intervals with respect to the amplitude of the acoustic signal and the acoustic signal.

なお、本発明における音源分離装置１（または１Ａ）の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより音源分離装置１（または１Ａ）が行う処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 A program for realizing all or part of the functions of the sound source separation device 1 (or 1A) of the present invention is recorded on a computer-readable recording medium, and the program recorded on this recording medium is transferred to a computer system. All or part of the processing performed by the sound source separation device 1 (or 1A) may be performed by reading and executing the program. It should be noted that the "computer system" here includes hardware such as an OS and peripheral devices. Also, the "computer system" includes a WWW system provided with a home page providing environment (or display environment). The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. In addition, "computer-readable recording medium" means a volatile memory (RAM) inside a computer system that acts as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. , includes those that hold the program for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the above program may be transmitted from a computer system storing this program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in a transmission medium. Here, the "transmission medium" for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the program may be for realizing part of the functions described above. Further, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.

１，１Ａ…音源分離装置、
２，２Ａ…収音部、
３，３Ａ…処理部、
２－１…第１マイクロホンアレイ、
２－２…第２マイクロホンアレイ、
２－３…第３マイクロホンアレイ、
３０…ビームフォーミング制御部、
３１…取得部、
３２…音源定位部、
３３…音源分離部、
３４，３４Ａ…変換部、
３５…分類部、
３６…除去部、
３７…抽出部、
３８…逆変換部、
３９…出力部、
４０…音源定位部 1, 1A ... sound source separation device,
2, 2A ... sound pickup part,
3, 3A ... processing unit,
2-1 ... first microphone array,
2-2 ... second microphone array,
2-3 ... third microphone array,
30 ... beam forming control unit,
31 ... Acquisition unit,
32... Sound source localization section,
33... Sound source separation section,
34, 34A ... conversion unit,
35... Classifying section,
36 ... removal unit,
37 ... extraction part,
38 ... inverse transformation unit,
39 ... output section,
40... Sound source localization part

Claims

a plurality of microphone arrays for picking up acoustic signals;
When each of the collected sound signals picked up by the at least two microphone arrays includes a first sound signal of a sound source of interest and a second sound signal of another sound source in the same direction as the sound source of interest, at least an extraction unit that extracts a common component included in each of the collected sound signals picked up by the two microphone arrays and extracts the first sound signal from the collected sound signals ;
a classification unit that classifies topics of sounds included in the collected sound signal;
The extractor is
The classifying unit compares the topics classified for each of the microphone arrays, and if the topic is the same in the collected sound signals picked up by each of the plurality of microphone arrays as a result of the comparison, the same topic is identified. A sound source separation device that presumes that the sound source is the target sound source and extracts an acoustic signal corresponding to the same topic from the collected sound signals as the first acoustic signal.

The extractor is
extracting the common component using a latent Dirichlet allocation method;
The sound source separation device according to claim 1.

further comprising a conversion unit that converts the collected sound signal collected by each of the microphone arrays into a frequency spectrum;
The classification unit
The frequency spectrum for each microphone array is segmented by dividing it into M (M is an integer equal to or greater than 2) sections in a time frame, and the frequency spectrum for each of the time frames included in each segment is divided into the topic. classify by
The sound source separation device according to claim 1 or 2 .

The extractor is
estimating the distribution of the topic for each time interval and the distribution of the quantized spectrum obtained by quantizing the frequency spectrum for each topic, and the posterior probabilities of the distribution of the topic and the distribution of the quantized spectrum each indicate an active state; Those larger than the threshold for discrimination are assumed to be in an active state,
extracting the common component by comparing the distribution of the topics for each of the segments at the same time and extracting the topics that are active in at least two of the microphone arrays;
The sound source separation device according to claim 3 .

a control unit that controls the microphone array to form a beam in the direction of the sound source of interest;
The plurality of microphone arrays pick up the collected sound signal including the first sound signal of the sound source of interest under the control of the control unit.
The sound source separation device according to any one of claims 1 to 4 .

a sound source localization unit that performs sound source localization on the sound signals picked up by each of the microphone arrays;
a sound source separation unit that separates a separated signal including the first acoustic signal from the collected sound signals picked up by each of the microphone arrays based on the localization result of the sound source localization;
The extractor is
Extracting a common component included in each of the separated signals separated from the collected sound signals of each of the at least two microphone arrays to extract the first sound signal from the collected sound signals;
The sound source separation device according to any one of claims 1 to 4 .

A multiple microphone array picks up the acoustic signal,
The extraction unit causes each collected sound signal picked up by the at least two microphone arrays to include a first sound signal of a sound source of interest and a second sound signal of another sound source in the same direction as the sound source of interest. extracting a common component included in each of the collected sound signals picked up by at least two of the microphone arrays to extract the first sound signal from the collected sound signals ;
a classifying unit classifying sound topics included in the collected sound signal;
When the extraction unit compares the topics classified for each microphone array by the classification unit, and as a result of the comparison, the same topic is included in the collected sound signals picked up by each of the plurality of microphone arrays estimating the same topic as the sound source of interest, and extracting an acoustic signal corresponding to the same topic from the collected sound signal as the first acoustic signal;
sound source separation method.

to the computer,
Acoustic signals are picked up by multiple microphone arrays,
When the collected sound signals picked up by each of the at least two microphone arrays include a first sound signal of a sound source of interest and a second sound signal of another sound source in the same direction as the sound source of interest, at least two extracting a common component included in each of the collected sound signals picked up by the two microphone arrays, and extracting the first sound signal from the collected sound signals ;
classifying sound topics contained in the collected sound signal;
Compare the topics classified for each of the microphone arrays, and if the topic is the same in the picked-up acoustic signals picked up by each of the plurality of microphone arrays as a result of the comparison, the same topic is picked up by the sound source of interest. estimating that there is, extracting an acoustic signal corresponding to the same topic from the collected sound signal as the first acoustic signal;
program.