JP2023025458A

JP2023025458A - Sound source separation device, sound source separation method, and sound source separation program

Info

Publication number: JP2023025458A
Application number: JP2021130719A
Authority: JP
Inventors: 弘和亀岡; Hirokazu Kameoka; 莉李; Ri Ri
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2023-02-22

Abstract

To provide a sound source separation device which has improved estimation accuracy for a sound signal only from observation signals by matching combination of frequency directions of separation signals even if positioning of a microphone is unknown.SOLUTION: A sound source separation device 100 comprises: a masking unit 103 that masks a frequency band for a plurality of predetermined different frequency bands with respect to a separation signal generated by separating an observation signal including a plurality of mixed construction sounds according to a separation matrix; a spectrogram generation unit 104 that generates a compensation spectrogram using the masked separation signal and a predetermined sound source model, and generates a separation signal spectrogram in each frequency band of the separation signal; and a separation matrix correction unit 106 that corrects the separation matrix so as to achieve rearrangement of frequency bands in a reallocation target assigned so that a distance from the compensation becomes closer to each of the separation signal spectrum.SELECTED DRAWING: Figure 3

Description

開示の技術は、音源分離装置、音源分離方法、及び音源分離プログラムに関する。 The disclosed technology relates to a sound source separation device, a sound source separation method, and a sound source separation program.

ブラインド音源分離（ＢｌｉｎｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎ：ＢＳＳ）は、音源又は音源からマイクまでの伝達特性が未知の下で、複数の音源信号が混合された観測信号のみから各音源信号を推定する技術である。周波数領域における独立成分分析（Ｆｒｅｑｕｅｎｃｙ－ＤｏｍａｉｎＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ：ＦＤＩＣＡ）をはじめとする周波数領域で定式化されるＢＳＳのアプローチは、音源の混合過程を畳み込み演算を含まない瞬時混合系で表せるため、比較的効率の良いアルゴリズムを実現できる利点がある。 Blind source separation (BSS) is a technique for estimating each sound source signal only from an observed signal in which a plurality of sound source signals are mixed under the condition that the sound source or the transfer characteristics from the sound source to the microphone are unknown. BSS approaches formulated in the frequency domain, such as frequency-domain independent component analysis (FDICA), can express the sound source mixing process as an instantaneous mixing system that does not include convolution operations. It has the advantage of being able to implement a highly efficient algorithm.

しかし、ＦＤＩＣＡは、周波数ごとに得られた分離信号の順番に任意性があるため、同一音源に由来する周波数ごとの独立成分をグルーピングするパーミュテーション整合処理が、後段で別途必要になる。従来そのパーミュテーション整合処理として、隣接周波数のパワーの相関又はマイクの位置情報から得られる音源到来方向を手がかりとする解決法、及び両者を組み合わせた手法（非特許文献１）が提案されている。 However, in FDICA, since the order of the separated signals obtained for each frequency is arbitrary, permutation matching processing for grouping independent components for each frequency derived from the same sound source is separately required in the subsequent stage. As permutation matching processing, a solution using the direction of arrival of the sound source obtained from the correlation of the power of adjacent frequencies or the position information of the microphone as a clue, and a method combining the two have been proposed (Non-Patent Document 1). .

一方で後段処理としてではなく、音源の周波数間の成分の依存関係をモデル化し、ＢＳＳの最適化問題に制約又はコストの形で取り入れることで、パーミュテーション整合と周波数ごとの音源分離とを同時に解決する手法も近年多数提案されており、その効果が示されている。 On the other hand, not as a post-processing, but by modeling the inter-frequency component dependencies of the sound sources and incorporating them into the BSS optimization problem in the form of constraints or costs, permutation matching and frequency-wise sound source separation can be performed simultaneously. In recent years, many methods for solving this problem have been proposed, and their effects have been demonstrated.

例えば、独立低ランク行列分析（ＩｎｄｅｐｅｎｄｅｎｔＬｏｗ－ＲａｎｋＭａｔｒｉｘＡｎａｌｙｓｉｓ：ＩＬＲＭＡ）では、各音源のパワースペクトログラムが、二つの非負値行列の積で表現される。これは、時間変化する振幅でスケーリングされた基底スペクトルの線形和によって、各時間フレームでパワースペクトルを近似できるとする仮定に相当する。この制約によりＩＬＲＭＡは音源のスペクトル構造を手がかりにしながら周波数ごとの音源分離とパーミュテーション整合問題の同時解決を可能にしている。 For example, in Independent Low-Rank Matrix Analysis (ILRMA), the power spectrogram of each sound source is represented by the product of two non-negative matrices. This corresponds to the assumption that the power spectrum can be approximated at each time frame by a linear sum of scaled basis spectra with time-varying amplitudes. This constraint allows ILRMA to simultaneously solve the frequency-wise source separation and permutation matching problems while cuing the spectral structure of the sources.

また、多チャンネル変分自己符号化器法（ＭｕｌｔｉｃｈａｎｎｅｌＶａｒｉａｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒ：ＭＶＡＥ）法では、音源スペクトログラムの生成モデルを条件付きＶＡＥ（ＣｏｎｄｉｔｉｏｎａｌＶＡＥ：ＣＶＡＥ）で表現し、学習サンプルを用いて事前学習することで音源の周波数間及び時刻間の成分の依存関係を捉えることを可能にしている。この音源生成モデルに各分離信号ができるだけ適合するように分離行列推定を行うことで、高精度な音源分離を行うことができる。 In addition, in the multichannel variational autoencoder (MVAE) method, the generation model of the sound source spectrogram is represented by conditional VAE (Conditional VAE: CVAE), and pre-learning using the training sample It makes it possible to capture the dependence of components between frequencies and between times of sound sources. High-precision sound source separation can be performed by estimating the separation matrix so that each separated signal matches this sound source generation model as much as possible.

Hiroshi Sawada, Ryo Mukai, Shoko Araki, and Shoji Makino, “A robust and precise method for solving the permutation problem of frequency-domain blind source separation,” IEEE transactions on speech and audio processing, vol. 12, no. 5, pp. 530-538, 2004.Hiroshi Sawada, Ryo Mukai, Shoko Araki, and Shoji Makino, “A robust and precise method for solving the permutation problem of frequency-domain blind source separation,” IEEE transactions on speech and audio processing, vol. 12, no. 5, pp. 530-538, 2004.

パーミュテーション整合と周波数ごとの音源分離とを同時に解決する手法であっても、帯域ブロックごとにパーミュテーション不整合が生じうる。これはブロックパーミュテーション問題と呼ばれ、離れた周波数帯域間の依存関係を音源モデルが適切に捉えられていなかったり、音源モデルの表現能力が高すぎたりすることに起因する。このブロックパーミュテーション問題が解決できれば、さらなる分離精度の向上が期待できる。 Even with a technique that simultaneously solves permutation matching and sound source separation for each frequency, permutation mismatch can occur for each band block. This is called the block permutation problem, and is caused by the fact that the sound source model does not adequately capture the dependency between distant frequency bands, or the sound source model has too high expressive power. If this block permutation problem can be solved, further improvement in separation accuracy can be expected.

ブロックパーミュテーション問題は、適当なコストを手がかりに周波数ごとの分離信号がどの音源に対応しているかを見つける「割当問題」と見なせる。しかし、ＢＳＳにより、観測信号のみから各音源信号を推定する際に推定される分離行列は周波数毎に得られるため、音源の周波数方向の構造について何らかの制約を加えなければ、分離結果の周波数方向の組み合わせとして適切なものを探すことは困難である。また、音源の周波数方向の構造についての制約を設けてもなお、離れた周波数帯域間の構造を制約しきれず、その周波数帯域の値が入れ替わったような分離結果が得られることがある。従来は、隣接周波数のパワーの相関又はマイクの位置情報から得られる音源到来方向を手がかりに周波数方向の組み合わせを整合する処理がとられていた。しかし、非特許文献１のように、マイクの位置情報を利用する手法は、マイクの配置が未知であっても動作させることができるＢＳＳの利点を損なうものであった。 The block permutation problem can be regarded as an "assignment problem" to find which sound source the separated signal for each frequency corresponds to, using an appropriate cost as a clue. However, since the separation matrix estimated for each frequency when estimating each sound source signal from only the observed signal is obtained by BSS, the frequency direction of the separation result is Finding a suitable combination is difficult. Moreover, even if restrictions are placed on the structure of the sound source in the frequency direction, the structure between distant frequency bands cannot be fully restricted, and a separation result in which the values of the frequency bands are interchanged may be obtained. Conventionally, the combination of frequency directions is matched using the direction of arrival of the sound source obtained from the correlation of the power of adjacent frequencies or the positional information of the microphone. However, the technique of using microphone position information as in Non-Patent Document 1 impairs the advantage of BSS that can be operated even if the microphone arrangement is unknown.

開示の技術は、上記の点に鑑みてなされたものであり、マイクロホンの配置が未知であっても分離信号の周波数方向の組み合わせを整合させて、観測信号のみからの音源信号の推定精度を改良した、音源分離装置、音源分離方法、及び音源分離プログラムを提供することを目的とする。 The disclosed technology has been made in view of the above points, and improves the accuracy of estimating the sound source signal from only the observed signal by matching the combination of the separated signals in the frequency direction even if the placement of the microphones is unknown. It is an object of the present invention to provide a sound source separation device, a sound source separation method, and a sound source separation program.

本開示の第１態様は、音源分離装置であって、複数の構成音が混合された観測信号が分離行列により分離された分離信号に対し、所定の複数の異なる周波数帯域についてそれぞれ当該周波数帯域をマスキングするマスキング部と、前記マスキング部によりマスキングされた前記分離信号と、所定の音源モデルとを用いて補完スペクトログラムを生成するとともに、前記分離信号の各前記周波数帯域における分離信号スペクトログラムを生成するスペクトログラム生成部と、前記分離信号スペクトログラムのそれぞれに対して前記補完スペクトログラムとの距離が近くなるように割り当てられた再配置先の周波数帯域の並び替えを実現するように前記分離行列を修正する分離行列修正部と、を含む。 A first aspect of the present disclosure is a sound source separation device, in which a separated signal obtained by separating an observed signal in which a plurality of constituent sounds are mixed is separated by a separation matrix, and a predetermined plurality of different frequency bands are separated from each other. A spectrogram generator for generating a complementary spectrogram using a masking unit for masking, the separated signal masked by the masking unit, and a predetermined sound source model, and for generating a separated signal spectrogram in each of the frequency bands of the separated signal. and a separation matrix correction unit that corrects the separation matrix so as to rearrange the frequency bands to be rearranged so that the distance between each of the separated signal spectrograms and the complementary spectrogram is reduced. and including.

本開示の第２態様は、音源分離方法であって、複数の構成音が混合された観測信号が分離行列により分離された分離信号に対し、所定の複数の異なる周波数帯域についてそれぞれ当該周波数帯域をマスキングし、マスキングされた前記分離信号と、所定の音源モデルとを用いて補完スペクトログラムを生成するとともに、前記分離信号の各前記周波数帯域における分離信号スペクトログラムを生成し、前記分離信号スペクトログラムのそれぞれに対して前記補完スペクトログラムとの距離が近くなるように割り当てられた再配置先の周波数帯域の並び替えを実現するように前記分離行列を修正する処理をコンピュータが実行する。 A second aspect of the present disclosure is a sound source separation method, wherein for a separated signal obtained by separating an observed signal in which a plurality of constituent sounds are mixed by a separation matrix, each of a plurality of predetermined different frequency bands is divided into the respective frequency bands. Masking, generating an interpolated spectrogram using the masked separated signal and a predetermined sound source model, generating a separated signal spectrogram in each of the frequency bands of the separated signal, and generating a separated signal spectrogram for each of the separated signal spectrograms The computer executes a process of correcting the separation matrix so as to realize rearrangement of the frequency bands to be rearranged so that the distance to the complementary spectrogram becomes closer to the complementary spectrogram.

本開示の第３態様は、プログラムであって、コンピュータを、上記第１態様の音源分離装置として機能させるためのプログラムである。 A third aspect of the present disclosure is a program for causing a computer to function as the sound source separation device of the first aspect.

開示の技術によれば、マイクロホンの配置が未知であっても分離信号の周波数方向の組み合わせを整合させることで、分離信号の周波数方向の組み合わせを整合させない場合と比較して精度よく分離信号を得ることができる。 According to the disclosed technique, by matching the combination of the separated signals in the frequency direction even if the arrangement of the microphones is unknown, the separated signal can be obtained with higher accuracy than when the combination of the separated signals in the frequency direction is not matched. be able to.

開示の技術の実施形態で提案するＨＢＰ法のアルゴリズムを示す図である。FIG. 3 is a diagram showing an algorithm of the HBP method proposed in an embodiment of the disclosed technique; 音源分離装置のハードウェア構成を示すブロック図である。2 is a block diagram showing the hardware configuration of the sound source separation device; FIG. 音源分離装置の機能構成の例を示すブロック図である。2 is a block diagram showing an example of a functional configuration of a sound source separation device; FIG. 音源分離装置による音源分離処理の流れを示すフローチャートである。4 is a flowchart showing the flow of sound source separation processing by the sound source separation device; 分離実験におけるマイクと音源の配置を示す図である。FIG. 4 is a diagram showing the arrangement of microphones and sound sources in a separation experiment; ９音源の分離信号の一例を示す図である。FIG. 4 is a diagram showing an example of separated signals of 9 sound sources;

以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 An example of embodiments of the technology disclosed herein will be described below with reference to the drawings. In each drawing, the same or equivalent components and portions are given the same reference numerals. Also, the dimensional ratios in the drawings are exaggerated for convenience of explanation, and may differ from the actual ratios.

＜本実施形態の概要＞
まず、本実施形態における概要を説明する。 <Overview of this embodiment>
First, an outline of this embodiment will be described.

本実施形態では、まず音源の周波数方向の構造を表す音源モデルと、分離行列とを推定して信号を分離した後、所定の複数の異なる周波数帯域についてそれぞれ、分離信号の当該周波数帯域を欠損（マスキング）させてから音源モデルを用いて修復した補完スペクトログラムを用意する。そして、欠損させた各周波数帯域の分離信号を、補完スペクトログラムとの距離が近づくように、効率的なアルゴリズムであるハンガリアン法を用いて並び替える。これにより、マイクロホン配置が未知であっても、分離信号の周波数方向の組み合わせを整合させることが可能となる。 In this embodiment, first, after estimating a sound source model representing the structure of the sound source in the frequency direction and a separation matrix to separate the signals, for each of a plurality of predetermined different frequency bands, the frequency band of the separated signal is deleted ( Masking) is performed, and then an interpolated spectrogram restored using the sound source model is prepared. Then, the separated signals of each missing frequency band are rearranged using the Hungarian method, which is an efficient algorithm, so that the distance to the complementary spectrogram becomes closer. As a result, even if the microphone arrangement is unknown, it is possible to match the combination of separated signals in the frequency direction.

＜本実施形態の原理＞
続いて、本実施形態における技術原理を説明する。 <Principle of this embodiment>
Next, the technical principle of this embodiment will be described.

＜周波数領域多チャンネル音源分離問題の定式化＞
Ｉ個のマイクロホンでＪ個の音源から到来する信号を観測する場合を考える。マイクｉの観測信号、音源ｊの信号の複素スペクトログラムをそれぞれｘ_ｉ（ｆ，ｎ）、ｓ_ｊ（ｆ，ｎ）とする。また、これらを要素としたベクトルを

（１）

（２）
とする。ただし、ここではＩ＝Ｊの優決定条件を考える。ここで（）^Ｔは転置を表し、ｆとｎはそれぞれ周波数と時間のインデックスである。 <Formulation of frequency domain multi-channel sound source separation problem>
Consider the case of observing signals coming from J sound sources with I microphones. Let x _i (f, n) and s _j (f, n) be the complex spectrograms of the observed signal of microphone i and the signal of sound source j, respectively. Also, a vector with these elements as

(1)

(2)
and However, here, the over-determination condition of I=J is considered. where ( ) ^T represents the transpose and f and n are the frequency and time indices, respectively.

Ｉ＝Ｊの条件においては、残響時間が分析窓長より短い場合に音源信号ベクトルｓ（ｆ，ｎ）と観測信号ベクトルｘ（ｆ，ｎ）の間の関係式として瞬時分離系

（３）

（４）
を仮定することができる。ここで、Ｗ^Ｈ（ｆ）は分離行列を表し、（）^Ｈはエルミート転置である。 Under the condition of I=J, when the reverberation time is shorter than the analysis window length, the instantaneous separation system

(3)

(4)
can be assumed. where W ^H (f) represents the separation matrix and ( ) ^H is the Hermitian transpose.

以上の瞬時混合系の仮定の下で、混合信号の尤度関数は数式（５）のように表せる。

（５） Under the assumption of the instantaneous mixing system described above, the likelihood function of the mixed signal can be expressed as Equation (5).

(5)

ＩＬＲＭＡ法及びＭＶＡＥ法は局所ガウスモデル（ＬｏｃａｌＧａｕｓｓｉａｎＭｏｄｅｌ：ＬＧＭ）を仮定する。すなわち、音源信号ｊの複素スペクトログラムｓ_ｊ（ｆ，ｎ）が平均０、分散

の複素正規分布

（６）
に従う確率変数と仮定する。ここで、パワースペクトログラムＶ_ｊ＝｛ｖ_ｊ（ｆ，ｎ）｝_ｆ，ｎを非負値行列積で表現したバージョンがＩＬＲＭＡのモデル、ＣＶＡＥのデコーダで表現したバージョンがＭＶＡＥ法のモデルにそれぞれ対応する。ｓ_ｊ（ｆ，ｎ）とｓ_ｊ’（ｆ，ｎ）、ｊ≠ｊ’が統計的に独立のとき、数式（６）により、ｓ（ｆ，ｎ）は

（７）
に従う。ここで、Ｖ（ｆ，ｎ）はｖ_１（ｆ，ｎ），・・・，ｖ_Ｉ（ｆ，ｎ）を対角要素に持つ対角行列である。 The ILRMA method and the MVAE method assume a Local Gaussian Model (LGM). That is, the complex spectrogram s _j (f, n) of the sound source signal j has a mean of 0 and a variance of

complex normal distribution of

(6)
is assumed to be a random variable that follows Here, the version of the power spectrogram V _j ={v _j (f, n)} _{f, n} represented by the non-negative matrix product corresponds to the ILRMA model, and the version represented by the CVAE decoder corresponds to the MVAE method model. . When s _j (f, n) and s _j′ (f, n), j≠j′ are statistically independent, s(f, n) is given by Equation (6) as

(7)
obey. Here, V(f, n) is a diagonal matrix having v ₁ (f, n), . . . , v _I (f, n) as diagonal elements.

数式（５）及び（７）より、観測信号

が与えられた下での分離行列

と、各音源のパワースペクトログラム

の対数尤度関数は、

（８）
となる。 From equations (5) and (7), the observed signal

given the separation matrix

and the power spectrogram of each sound source

The log-likelihood function of is

(8)
becomes.

＜ブロックパーミュテーション問題の定式化＞
理想的な音源モデルＶを仮定した上で数式（８）を最大化することができれば、周波数ごとの音源分離とパーミュテーション整合とを同時解決することが可能となる。しかし、既存の多くの音源モデルでは、一部の周波数帯域の成分が他の音源の成分にそっくり入れ替わったスペクトログラムに対しても柔軟に適合できてしまう場合がある。その結果として、帯域ごとに異なる音源の成分を持つような分離信号が得られてしまう。ある帯域Ｆ_ｋ内の各周波数の成分が、音源間で同じように入れ替わっている状況では、正解の分離行列をＷ（ｆ）とすると、

（９）
のように、Ｗ（ｆ）に置換行列Ｐ_ｋを乗じたものが局所解として推定されていることになる。数式（９）で、Ｆ_ｋはｋ番目の帯域ブロック内の周波数ビンの集合であり、Ｐ_ｋは当該帯域における正解音源成分と分離成分の順番とを対応付ける置換行列である。ブロックパーミュテーション問題は、帯域ｋごとにＰ^－１ _ｋ＝Ｐ^Ｔ _ｋを推定し、

（１０）
により、正解の分離行列を見つける問題となる。なお、周波数ビンごとに帯域を分割した場合、ブロックパーミュテーション問題は通常のパーミュテーション問題に帰着する。 <Formulation of block permutation problem>
If the equation (8) can be maximized on the assumption of an ideal sound source model V, it will be possible to simultaneously solve sound source separation and permutation matching for each frequency. However, many existing sound source models can flexibly adapt to spectrograms in which some frequency band components are completely replaced with other sound source components. As a result, separated signals having different sound source components for each band are obtained. In a situation where each frequency component in a certain band _Fk is similarly exchanged between sound sources, if the correct separation matrix is W(f),

(9)
, W(f) multiplied by the permutation matrix _Pk is estimated as the local solution. In Equation (9), F _k is a set of frequency bins in the k-th band block, and P _k is a permutation matrix that associates correct sound source components with the order of separated components in the band. The block permutation problem estimates P ⁻¹ _k =P ^T _k for each band k,

(10)
, it becomes a problem of finding the correct separation matrix. Note that when the band is divided for each frequency bin, the block permutation problem is reduced to a normal permutation problem.

＜割当問題とハンガリアン法の概説＞
パーミュテーション問題は、適当なコストを手がかりに、周波数ごとの分離信号がどの音源に対応しているかを見つける「割当問題」と見做せる。本実施形態では、割当問題の求解法の１つであるハンガリアン法をパーミュテーション問題の解決に用いる。そこでまず、割当問題とハンガリアン法について以下概説する。 <Explanation of quota problem and Hungarian law>
The permutation problem can be regarded as an "assignment problem" of finding which sound source the separated signal for each frequency corresponds to, using an appropriate cost as a clue. In this embodiment, the Hungarian method, which is one of the methods for solving the assignment problem, is used to solve the permutation problem. Therefore, first, the allocation problem and the Hungarian method will be outlined below.

割当問題とは、Ｍ人の作業員にＭ個の仕事を割り当てる際に、最も効率の良い仕事の割り当てを見つける問題である。作業員ｐが仕事ｑをする場合に要するコストをｃ_ｐｑとし、ｃ_ｐｑをｐ行ｑ列目の要素としたコスト行列を

とする。ただし、ｐ＝１，・・・，Ｍとｑ＝１，・・・，Ｍは、それぞれ作業員と仕事のインデックスである。この場合、割当問題は

（１１）
を満足する分配行列

を求める最適化問題として定式化することができる。ただし、＜，＞は行列の内積を表す。 The assignment problem is the problem of finding the most efficient job assignment when assigning M jobs to M workers. Let c _pq be the cost required for worker p to do job q, and the cost matrix with c _pq as the element of p row and q column is

and where p=1, . . . , M and q=1, . In this case the allocation problem is

(11)
A distribution matrix that satisfies

can be formulated as an optimization problem for However, < , > represent the inner product of matrices.

この最適化問題を全列挙により解く場合、Ｍ！通りの解の候補が存在するため、Ｍの増大により組合せ爆発が起こる。この最適化問題を効率的に解くアルゴリズムがハンガリアン法である。ハンガリアン法では

（１２）

（１３）
を満足する実数集合

及び

が存在することを仮定する。これらのΠ及び∇は双対問題により求められる。最適化問題のコストｚは、数式（１４）のように表せる。

（１４） When solving this optimization problem by full enumeration, M! An increase in M results in a combinatorial explosion, since there are valid solution candidates. The Hungarian method is an algorithm that efficiently solves this optimization problem. under Hungarian law

(12)

(13)
real number set satisfying

as well as

Suppose that there exists These Π and ∇ are obtained by the dual problem. The cost z of the optimization problem can be expressed as Equation (14).

(14)

数式（１４）によれば、コスト行列の任意の行と列から、それぞれ定数ｕ_ｐとｒ_ｑを引くことは、最適な割当に影響しないことがわかる。ハンガリアン法は、この性質を利用して、コスト行列を修正しながら最適な割り当てを求めることができる。具体的な手順は以下の通りとなる。 Equation (14) shows that subtracting the constants u _p and r _q from any row and column of the cost matrix, respectively, does not affect the optimal allocation. The Hungarian method can use this property to find the optimal allocation while modifying the cost matrix. The specific procedure is as follows.

（ステップ１）各行の最小値を見つけ、その行の各要素からその最小値を引く。その後、同様に各列の最小値を見つけ、その列の各要素からその最小値を引く。 (Step 1) Find the minimum value in each row and subtract that minimum value from each element in that row. Then similarly find the minimum value in each column and subtract that minimum value from each element in that column.

（ステップ２）最小値を引いた後の行列の各行各列から、０を１つずつ選ぶことができるかどうかを判定する。選ぶことができれば、その座標の組が最適な割当案となる。選ぶことができなければ次のステップに進む。 (Step 2) Determine whether one 0 can be selected from each row and column of the matrix after subtracting the minimum value. If a choice can be made, that set of coordinates is the optimal allocation proposal. If you can't choose, go to the next step.

（ステップ３）最小値を引いた後の行列中のすべての０成分を覆い隠すように、行上又は列上に、できるだけ少ない線を引く。 (Step 3) Draw as few lines as possible on rows or columns to obscure all 0 entries in the matrix after subtracting the minimum.

（ステップ４）ステップ３で引いた線で覆われていない行列の要素から、それらの要素の中の最小値を引き、ステップ３で引いた線における縦線と横線とが交わる要素に、その最小値を足して、ステップ２に戻る。 (Step 4) Subtract the minimum value among those elements from the elements of the matrix not covered by the line drawn in step 3, and apply the minimum Add the value and go back to step 2.

なお、計算時間オーダーは全列挙法の場合はＯ（Ｍ！）となるのに対し、ハンガリアン法の場合はＯ（Ｍ^３）となる。すなわちＭの数が増加するほど全列挙法に比べてハンガリアン法の方がより効率的に最も効率の良い仕事の割り当てを見つけることができる。 Note that the computation time order is O(M!) for the full enumeration method, while it is O(M ³ ) for the Hungarian method. That is, as the number of M increases, the Hungarian method can find the most efficient work assignment more efficiently than the full enumeration method.

＜ハンガリアンブロックパーミュテーション法＞
続いて、ハンガリアン法を用いたブロックパーミュテーション整合法について述べる。ハンガリアン法を用いたブロックパーミュテーション整合法をＨＢＰ（ＨｕｎｇａｒｉａｎＢｌｏｃｋＰｅｒｍｕｔａｔｉｏｎ）法と称する。分離信号の隣接周波数における成分間の相関又は到来方向などを手がかりとした従来のパーミュテーション整合法は、反復計算が必要な点、マイクロホン配置が既知でなければならない点などに難点があった。これに対し、本実施形態で提案するＨＢＰ法は、音源分離アルゴリズムで用いられる音源モデルをそのまま流用可能な手法であり、マイクロホン配置が未知の下でも適用可能な方法である。具体的には、ＨＢＰ法は音源分離アルゴリズムの途中で、（１）各分離信号の高帯域の成分を人為的にマスキング（ゼロ化）し、（２）音源モデルを用いて当該帯域の欠損成分を復元し、（３）その復元値を基にハンガリアン法によりブロックパーミュテーション整合を行う、という３つのステップからなる方法である。 <Hungarian block permutation method>
Next, we describe a block permutation matching method using the Hungarian method. A block permutation matching method using the Hungarian method is called an HBP (Hungarian Block Permutation) method. The conventional permutation matching method, which uses the correlation between components at adjacent frequencies of separated signals or the direction of arrival as a clue, has drawbacks such as the need for iterative calculations and the fact that the microphone placement must be known. On the other hand, the HBP method proposed in this embodiment is a method that can use the sound source model used in the sound source separation algorithm as it is, and is a method that can be applied even when the microphone arrangement is unknown. Specifically, in the HBP method, in the middle of the sound source separation algorithm, (1) the high-band components of each separated signal are artificially masked (zeroed), and (2) the missing components of the band using the sound source model and (3) performing block permutation matching by the Hungarian method based on the restored value.

一部の帯域の成分が欠損したスペクトログラムを入力とし、欠損領域を補完したスペクトログラムを出力する関数（欠損帯域補完器）をＲ（・）とする。分離信号ｊに対応するｖ_ｊ（ｆ，ｎ）を要素にもつ行列Ｖ_ｊを

とすると、ｌ番目の帯域を欠損（マスキング）した後に当該帯域の補完を行う過程は

（１５）
と表せる。ただし、

は行列の要素積を表す。 Let R(·) be a function (missing band complementer) that receives as input a spectrogram in which some band components are missing and outputs a spectrogram in which the missing region is interpolated. A matrix V j whose elements are v _j (f, n) corresponding to the separated signal _j is

Then, the process of complementing the band after missing (masking) the l-th band is

(15)
can be expressed as however,

represents the element product of matrices.

マスキングする周波数帯域をＧ_ｌとすると、Ｍ_ｌ∈｛０，１｝^Ｆ×Ｎは行ｆ∈Ｇ_ｌの全要素が０、行

の全要素が１であるような行列Ｖ_ｊと同じサイズのバイナリ行列を表す。 Assuming that the frequency band to be masked is G _l , M _l ε{0, 1} ^F×N has all elements of row fεG _l being 0, row

represents a binary matrix of the same size as the matrix V _j such that all elements of are ones.

関数Ｒ（・）の具体形及び事前学習方法には様々な選択肢がありうるが、後述のように音源分離アルゴリズムで用いる音源モデルをそのまま流用してもよい。関数Ｒ（・）の欠損帯域補完能力が十分高ければ、Ｖ_ｊの高域でパーミュテーション不整合が生じている場合には、数式（１５）の処理により、

はＶｊに比べて、当該音源が本来もつべきスペクトログラムに近いものになっていることが期待できる。このことを利用し、

を用いて適切なコスト行列を設計できれば、ハンガリアン法を応用してブロックパーミュテーション整合を行うことができる。つまり本実施形態で提案する手法は、各ｌの欠損帯域を補完したスペクトログラム

を用いて各ｋの帯域内のブロックパーミュテーション整合を行うアルゴリズムとなる。ある（ｌ，ｋ）において、割当問題のコスト行列Ｃ^{（ｌ，ｋ）}の各要素ｃ^{（ｌ，ｋ）} _ｊｊ’は、

が帯域Ｆ_ｋにおいて分離信号ｊ’とどれくらい適合しているかを測る尺度となっていれば良い。本実施形態では、要素ｃ^{（ｌ，ｋ）} _ｊｊ’は、数式（８）の対数尤度関数に関連させて、

と、

との板倉齋藤距離である

（１６）
とした。勿論、コスト行列Ｃ^{（ｌ，ｋ）}の各要素ｃ^{（ｌ，ｋ）} _ｊｊ’の決め方は係る例に限定されるものではない。 Although there are various options for the concrete form of the function R(·) and the pre-learning method, the sound source model used in the sound source separation algorithm may be used as it is, as will be described later. If the missing band compensating ability of the function R(·) is sufficiently high, and if permutation mismatch occurs in the high range of V _j , the processing of formula (15) yields

can be expected to be closer to the spectrogram that the sound source should originally have than Vj. Taking advantage of this

can be used to design an appropriate cost matrix, the Hungarian method can be applied to perform block permutation matching. In other words, the method proposed in this embodiment is a spectrogram

is used to perform block permutation matching within each k band. For some (l, k), each element c ^{(l, k)} _jj' of the allocation problem cost matrix C ( ^{l, k} ) is

is a measure of how well it matches the separated signal j' in the band _Fk . In this embodiment, the element c ^(l,k) _jj' is related to the log-likelihood function of equation (8) as follows:

and,

is the Itakura-Saito distance between

(16)
and Of course, the method of determining each element c ^(l,k) _jj' of the cost matrix C ^(l,k) is not limited to this example.

帯域Ｆ_ｋ及びマスキングする周波数帯域Ｇ_ｌの決め方は特定の方法に限定されるものではない。例えばＦ_ｋ及びＧ_ｌはランダムに決定されてもよい。また例えば、Ｆ_ｋを各帯域ブロックが異なる単一の周波数ビンとなるように、すなわち、Ｆ_ｋ＝｛ｋ｝（ｋ＝Ｆ_０，・・・，Ｆ）としてもよい。また例えば、Ｇ_ｌを欠損帯域が２ｋＨｚ以上ナイキスト周波数以下の周波数ビンからなる１種類の集合のみ、すなわちＧ_ｌ＝｛Ｆ_０，・・・，Ｆ｝としてもよい。 The method of determining the band _Fk and the masking frequency band _Gl is not limited to a specific method. For example, _Fk and _Gl may be randomly determined. Also for example, F _k may be such that each band block is a different single frequency bin, ie, F _k ={k}(k=F ₀ , . . . , F). Further, for example, G _l may be only one type of set consisting of frequency bins with missing bands equal to or higher than the Nyquist frequency, that is, G _l ={F ₀ , . . . , F}.

図１は、本実施形態で提案するＨＢＰ法のアルゴリズムを示す図である。図１に示したＨＢＰ法のアルゴリズムの概要を説明する。ＨＢＰ法のアルゴリズムは、まず数式（１５）を用いて、Ｖ_ｊの周波数帯域Ｇ_ｌをマスキングして、欠損帯域補完器により

を求めている。そして図１に示したＨＢＰ法のアルゴリズムは、数式（１６）によってコスト行列Ｃ^{（ｌ，ｋ）}を計算する。続いて図１に示したＨＢＰ法のアルゴリズムは、ハンガリアン法によりコスト行列Ｃ^{（ｌ，ｋ）}から置換行列Ｐ_ｋを求め、数式（１０）により正解の分離行列を求める。 FIG. 1 is a diagram showing an algorithm of the HBP method proposed in this embodiment. An outline of the algorithm of the HBP method shown in FIG. 1 will be described. The algorithm of the HBP method first masks the frequency band G _l of V _j using Equation (15), and the missing band interpolator

I am looking for Then, the algorithm of the HBP method shown in FIG. 1 calculates the cost matrix C ^{(l, k)} by Equation (16). Subsequently, the algorithm of the HBP method shown in FIG. 1 obtains the permutation matrix P _k from the cost matrix C ^{(l, k)} by the Hungarian method, and obtains the correct separation matrix by Equation (10).

＜ハードウェア構成＞
図２は、音源分離装置１００のハードウェア構成を示すブロック図である。 <Hardware configuration>
FIG. 2 is a block diagram showing the hardware configuration of the sound source separation device 100. As shown in FIG.

図２に示すように、音源分離装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３、ストレージ１４、入力部１５、表示部１６及び通信インタフェース（Ｉ／Ｆ）１７を有する。各構成は、バス１９を介して相互に通信可能に接続されている。 As shown in FIG. 2, the sound source separation apparatus 100 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input section 15, a display section 16, and a communication interface. (I/F) 17. Each component is communicatively connected to each other via a bus 19 .

ＣＰＵ１１は、中央演算処理ユニットであり、各種プログラムを実行したり、各部を制御したりする。すなわち、ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４からプログラムを読み出し、ＲＡＭ１３を作業領域としてプログラムを実行する。ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ＲＯＭ１２又はストレージ１４には、観測信号から各音源信号を分離する音源分離プログラムが格納されている。 The CPU 11 is a central processing unit that executes various programs and controls each section. That is, the CPU 11 reads a program from the ROM 12 or the storage 14 and executes the program using the RAM 13 as a work area. The CPU 11 performs control of each configuration and various arithmetic processing according to programs stored in the ROM 12 or the storage 14 . In this embodiment, the ROM 12 or storage 14 stores a sound source separation program for separating each sound source signal from the observed signal.

ＲＯＭ１２は、各種プログラム及び各種データを格納する。ＲＡＭ１３は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ１４は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶装置により構成され、オペレーティングシステムを含む各種プログラム、及び各種データを格納する。 The ROM 12 stores various programs and various data. The RAM 13 temporarily stores programs or data as a work area. The storage 14 is configured by a storage device such as a HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data.

入力部１５は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used for various inputs.

本実施形態では、入力部１５は、複数の音源信号が混合された観測信号を受け付ける。入力部１５が受け付けた観測信号は、ＣＰＵ１１によって各音源信号に分離される。 In this embodiment, the input unit 15 receives an observed signal in which a plurality of sound source signals are mixed. The observation signal received by the input unit 15 is separated into each sound source signal by the CPU 11 .

表示部１６は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部１６は、タッチパネル方式を採用して、入力部１５として機能しても良い。 The display unit 16 is, for example, a liquid crystal display, and displays various information. The display unit 16 may employ a touch panel system and function as the input unit 15 .

通信インタフェース１７は、他の機器と通信するためのインタフェースである。当該通信には、たとえば、イーサネット（登録商標）若しくはＦＤＤＩ等の有線通信の規格、又は、４Ｇ、５Ｇ、若しくはＷｉ－Ｆｉ（登録商標）等の無線通信の規格が用いられる。 The communication interface 17 is an interface for communicating with other devices. The communication uses, for example, a wired communication standard such as Ethernet (registered trademark) or FDDI, or a wireless communication standard such as 4G, 5G, or Wi-Fi (registered trademark).

＜機能構成＞
次に、音源分離装置１００の機能構成について説明する。 <Functional configuration>
Next, the functional configuration of the sound source separation device 100 will be described.

図３は、音源分離装置１００の機能構成の例を示すブロック図である。 FIG. 3 is a block diagram showing an example of the functional configuration of the sound source separation device 100. As shown in FIG.

図３に示すように、音源分離装置１００は、機能構成として、学習部１０１、モデル記憶部１０２、マスキング部１０３、スペクトログラム生成部１０４、帯域割当部１０５、分離行列修正部１０６、及び音源分離部１０７を有する。各機能構成は、ＣＰＵ１１がＲＯＭ１２又はストレージ１４に記憶された音源分離プログラムを読み出し、ＲＡＭ１３に展開して実行することにより実現される。 As shown in FIG. 3 , the sound source separation apparatus 100 includes a learning unit 101, a model storage unit 102, a masking unit 103, a spectrogram generation unit 104, a band allocation unit 105, a separation matrix correction unit 106, and a sound source separation unit. 107. Each functional configuration is realized by the CPU 11 reading out a sound source separation program stored in the ROM 12 or the storage 14, developing it in the RAM 13, and executing it.

学習部１０１は、分離行列の修正に用いられる音源モデルの機械学習を行う。本実施形態における音源モデルの機械学習処理を説明するが、音源モデルの機械学習処理は以下で説明するものに限定されるものではない。 The learning unit 101 performs machine learning of the sound source model used for correcting the separation matrix. Although the machine learning processing of the sound source model in this embodiment will be described, the machine learning processing of the sound source model is not limited to the one described below.

学習部１０１による機械学習に際して、予めクリーン音声のスペクトログラムＳ、スペクトログラムＳの適当な帯域をマスキングしたスペクトログラムＳ’、話者ラベルｃの組を多数用意する。学習部１０１は、各組についてＳ’又はＳを入力、Ｓを目標値として音源モデルを学習する。Ｓ’を入力とする場合、再構築誤差として以下の２つを定義する。

（１７）

（１８）
ｐ^＋ _θ（Ｓ｜ｚ，ｃ）、ｑ^＋ _φ（ｚ｜Ｓ）、ｒ^＋ _ψ（ｃ｜Ｓ）は、それぞれエンコーダ分布、デコーダ分布、クラス識別器分布を表し、θ、φ、ψは対応するネットワークのパラメータである。学習部１０１が学習した音源モデルは、欠損帯域補完器Ｒ（・）として用いることができる。 For machine learning by the learning unit 101, a large number of pairs of a spectrogram S of clean speech, a spectrogram S' obtained by masking an appropriate band of the spectrogram S, and a speaker label c are prepared in advance. The learning unit 101 learns a sound source model by inputting S′ or S for each pair and using S as a target value. When S′ is input, the following two are defined as reconstruction errors.

(17)

(18)
p ⁺ _θ (S|z,c), q ⁺ _φ (z|S), r ⁺ _ψ (c|S) represent the encoder distribution, the decoder distribution, and the class discriminator distribution, respectively, and θ, φ, and ψ are Corresponding network parameters. The sound source model learned by the learning unit 101 can be used as the missing band interpolator R(·).

モデル記憶部１０２は、分離行列の修正に用いられる音源モデルを記憶する。音源モデルは、予め用意されたものであってもよく、学習部１０１によって機械学習されたものであってもよい。 The model storage unit 102 stores a sound source model used for correcting the separation matrix. The sound source model may be prepared in advance or machine-learned by the learning unit 101 .

マスキング部１０３は、構成音が混合された観測信号が分離行列により分離された分離信号に対し、所定の複数の異なる周波数帯域についてそれぞれ当該周波数帯域をマスキングする。例えば、マスキング部１０３は、分離信号における高帯域の成分についてマスキングする。 The masking section 103 masks a plurality of predetermined different frequency bands with respect to the separated signal obtained by separating the observed signal mixed with the constituent sounds by the separating matrix. For example, the masking section 103 masks high-band components in the separated signals.

スペクトログラム生成部１０４は、マスキング部１０３よりマスキングされた上記分離信号と、モデル記憶部１０２が記憶する所定の音源モデルとを用いて、マスキングにより欠損した部分を補完した補完スペクトログラムを生成する。また、スペクトログラム生成部１０４は、上記分離信号の各周波数帯域における分離信号スペクトログラムを生成する。 The spectrogram generation unit 104 uses the separated signals masked by the masking unit 103 and a predetermined sound source model stored in the model storage unit 102 to generate a complementary spectrogram that complements the missing part due to masking. Also, the spectrogram generator 104 generates a separated signal spectrogram in each frequency band of the separated signal.

帯域割当部１０５は、スペクトログラム生成部１０４が生成した分離信号スペクトログラムのそれぞれに対して、補完スペクトログラムとの距離が近くなるように周波数帯域を割り当てる。具体的には、帯域割当部１０５は、上記数式（１６）によってコスト行列Ｃ^{（ｌ，ｋ）}を計算することで、周波数帯域の割り当てを行う。 Band allocation section 105 allocates a frequency band to each of the separated signal spectrograms generated by spectrogram generation section 104 so that the distance from the complementary spectrogram is short. Specifically, band allocation section 105 allocates the frequency band by calculating cost matrix C ^{(l, k)} using Equation (16) above.

分離行列修正部１０６は、帯域割当部１０５による割り当てに対応する周波数帯域の並び替えを実現するよう、分離行列を修正する。具体的には、分離行列修正部１０６は、ハンガリアン法を用いてコスト行列Ｃ^{（ｌ，ｋ）}から置換行列Ｐ_ｋを求め、数式（１０）により正解の分離行列を求めることで、分離行列を修正する。 Separation matrix correction section 106 corrects the separation matrix so as to realize rearrangement of frequency bands corresponding to allocation by band allocation section 105 . Specifically, separating matrix correction section 106 obtains permutation matrix P _k from cost matrix C ^{(l, k)} using the Hungarian method, and obtains the correct separating matrix by Equation (10), thereby obtaining the separating matrix fix it.

音源分離部１０７は、ＢＳＳにより、分離行列を用いて、複数の音源信号が混合された観測信号を各音源信号に分離する。本実施形態では、音源分離部１０７は、分離行列として、所定のタイミングで分離行列修正部１０６により修正された分離行列を用いて、複数の音源信号が混合された観測信号を各音源信号に分離する。 Sound source separation section 107 separates an observed signal in which a plurality of sound source signals are mixed into sound source signals using a separation matrix by BSS. In the present embodiment, the sound source separation unit 107 uses the separation matrix corrected by the separation matrix correction unit 106 at a predetermined timing as the separation matrix, and separates an observed signal in which a plurality of sound source signals are mixed into each sound source signal. do.

音源分離装置１００は、係る構成を有することで、構成音が混合された観測信号を分離する際に用いる分離行列を、分離信号の周波数方向の組み合わせを整合させるよう修正することができる。音源分離装置１００は、分離信号の周波数方向の組み合わせを整合させるよう分離行列を修正することで、分離行列を修正しない場合と比較して精度よく分離信号を得ることができる。 By having such a configuration, the sound source separation apparatus 100 can modify the separation matrix used when separating the observed signal in which the component sounds are mixed so as to match the combination of the separated signals in the frequency direction. Sound source separation apparatus 100 corrects the separation matrix so as to match the combination of separated signals in the frequency direction, thereby obtaining separated signals with higher accuracy than when the separation matrix is not corrected.

＜作用＞
次に、音源分離装置１０の作用について説明する。 <Action>
Next, the operation of the sound source separation device 10 will be described.

図４は、音源分離装置１０による音源分離処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から音源分離プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、音源分離処理が行なわれる。 FIG. 4 is a flowchart showing the flow of sound source separation processing by the sound source separation device 10. As shown in FIG. The CPU 11 reads the sound source separation program from the ROM 12 or the storage 14, develops it in the RAM 13, and executes it, thereby performing sound source separation processing.

ステップＳ１０１において、ＣＰＵ１１は、構成音が混合された観測信号が分離行列により分離された分離信号に対し、所定の複数の異なる周波数帯域についてそれぞれ当該周波数帯域をマスキングする。具体的には、ＣＰＵ１１は、分離信号における高帯域の成分についてマスキングする。 In step S101, the CPU 11 masks a plurality of predetermined different frequency bands with respect to the separated signal obtained by separating the observed signal mixed with the constituent sounds by the separation matrix. Specifically, the CPU 11 masks high-band components in the separated signals.

ステップＳ１０１に続いて、ステップＳ１０２において、ＣＰＵ１１は、所定の音源モデルを用いて、マスキングにより欠損した部分を補完した補完スペクトログラムを生成する。音源モデルは、予め用意されたものであってもよく、ＣＰＵ１１によって機械学習されたものであってもよい。 Following step S101, in step S102, the CPU 11 uses a predetermined sound source model to generate a complementary spectrogram that complements the missing portion due to masking. The sound source model may be prepared in advance or machine-learned by the CPU 11 .

ステップＳ１０２に続いて、ステップＳ１０３において、ＣＰＵ１１は、分離信号の各周波数帯域における分離信号スペクトログラムを生成する。なお、ステップＳ１０２とステップＳ１０３の順序は逆であってもよい。 After step S102, in step S103, the CPU 11 generates a separated signal spectrogram in each frequency band of the separated signal. Note that the order of steps S102 and S103 may be reversed.

ステップＳ１０３に続いて、ステップＳ１０４において、ＣＰＵ１１は、ステップＳ１０３で生成した分離信号スペクトログラムのそれぞれに対して、補完スペクトログラムとの距離が近くなるように周波数帯域を割り当てる。具体的には、ＣＰＵ１１は、ステップＳ１０４において、上記数式（１６）によってコスト行列Ｃ^{（ｌ，ｋ）}を計算することで、周波数帯域の割り当てを行う。 Following step S103, in step S104, the CPU 11 allocates frequency bands to each of the separated signal spectrograms generated in step S103 so that the distance to the complementary spectrogram is short. Specifically, in step S104, the CPU 11 allocates the frequency band by calculating the cost matrix C ^{(l, k)} according to Equation (16) above.

ステップＳ１０４に続いて、ステップＳ１０５において、ＣＰＵ１１は、ステップＳ１０４で行った割り当てに対応する周波数帯域の並び替えを実現するよう、分離行列を修正する。具体的には、ＣＰＵ１１は、ステップＳ１０５において、ハンガリアン法を用いてコスト行列Ｃ^{（ｌ，ｋ）}から置換行列Ｐ_ｋを求め、数式（１０）により正解の分離行列を求めることで、分離行列を修正する。 After step S104, in step S105, the CPU 11 corrects the separation matrix so as to realize the rearrangement of the frequency bands corresponding to the allocation performed in step S104. Specifically, in step S105, the CPU 11 obtains the permutation matrix P _k from the cost matrix C ^{(l, k)} using the Hungarian method, and obtains the correct separation matrix by Equation (10), thereby obtaining the separation matrix fix it.

ＣＰＵ１１は、構成音が混合された観測信号を分離する際に、ステップＳ１０１～ステップＳ１０５の一連の処理により修正した分離行列を用いることで、ステップＳ１０１～ステップＳ１０５の一連の処理を行わない場合と比較して精度よく分離信号を得ることができる。 When separating the observed signal mixed with the constituent sounds, the CPU 11 uses the separation matrix corrected by the series of processes of steps S101 to S105, so that the series of processes of steps S101 to S105 may or may not be performed. Separation signals can be obtained with high accuracy by comparison.

＜効果＞
本実施形態に係る音源分離装置１０による音声分離性能を検証するため、ＷＳＪ０音声データベースを用いた任意話者の分離実験を行った。ＷＳＪ０データベースのｓｉ＿ｔｒ＿ｓフォルダに含まれる１０１話者の約２５時間のデータを学習データとし、ｓｉ＿ｄｔ＿０５フォルダとｓｉ＿ｅｔ＿０５フォルダにある１８話者のデータを評価用データの作成に用いた。検証のために、音源数が｛２，３，６，９，１２，１５，１８｝の混合信号を作成した。インパルス応答は鏡像法により作成し、壁の反射係数を０．２とした。図５は、分離実験におけるマイクと音源の配置を示す図である。各条件について混合信号を１０文作成した。また、全ての発話を繰り返した音声を用いて各条件について混合信号を１０文作成した。全ての音声信号のサンプリング周波数を１６ｋＨｚとし、フレーム長２５６ｍｓ、シフト１２８ｍｓの下で短時間フーリエ変換を行い、スペクトログラムを算出した。 <effect>
In order to verify the speech separation performance of the sound source separation device 10 according to this embodiment, an arbitrary speaker separation experiment was conducted using the WSJ0 speech database. About 25 hours of data of 101 speakers included in the si_tr_s folder of the WSJ0 database was used as learning data, and data of 18 speakers in the si_dt_05 folder and the si_et_05 folder were used to create evaluation data. For verification, mixed signals with sound sources of {2, 3, 6, 9, 12, 15, 18} were created. Impulse responses were generated by the mirror image method with a wall reflection coefficient of 0.2. FIG. 5 is a diagram showing the arrangement of microphones and sound sources in a separation experiment. Ten mixed signals were generated for each condition. In addition, 10 sentences of a mixed signal were created for each condition using speech in which all utterances were repeated. A short-time Fourier transform was performed with a sampling frequency of 16 kHz for all audio signals, a frame length of 256 ms, and a shift of 128 ms, and a spectrogram was calculated.

ＭＶＡＥ法の高速化版として、ＦａｓｔＭＶＡＥ法、及びＦａｓｔＭＶＡＥ２法がある。ＦａｓｔＭＶＡＥ法、及びＦａｓｔＭＶＡＥ２法では、高速な分離アルゴリズムを実現するため、前者がクラス識別器つきＶＡＥ（ＡｕｘｉｌｉａｒｙＣｌａｓｓｉｆｉｅｒＶＡＥ：ＡＣＶＡＥ）、後者がＡＣＶＡＥのエンコーダとクラス識別器を一体化したＣｈｉｍｅｒａＡＣＶＡＥをそれぞれ用いて音源スペクトログラムの生成モデルとその潜在変数の推論プロセスを事前学習するアプローチをとっている。この分離実験では、ＣｈｉｍｅｒａＡＣＶＡＥのネットワーク構造を用いた。アルゴリズムの反復回数を６０とし、１０回反復するごとに、本実施形態のＨＢＰ法を行った。また、評価基準としてＳｏｕｒｃｅ－ｔｏ－ＤｉｓｔｏｒｔｉｏｎＲａｔｉｏ（ＳＤＲ）を用いた。 There are FastMVAE method and FastMVAE2 method as high-speed versions of the MVAE method. In the FastMVAE method and the FastMVAE2 method, in order to realize a high-speed separation algorithm, the former is a VAE with a class discriminator (Auxiliary Classifier VAE: ACVAE), and the latter is a ChimeraACVAE that integrates an ACVAE encoder and a class discriminator. The approach is to prelearn the generative model of the sound source spectrogram and the inference process of its latent variables. The ChimeraACVAE network structure was used in this segregation experiment. The number of iterations of the algorithm was set to 60, and the HBP method of this embodiment was performed every 10 iterations. Also, Source-to-Distortion Ratio (SDR) was used as an evaluation criterion.

表１にＳＤＲの平均値を示す。 Table 1 shows the average SDR values.

表１によれば、全ての音源数においてＨＢＰ法により音源分離性能が向上したことが確認できた。また、繰り返しありデータに対して改善値が大きいことが分かった。これは、無音区間が減少し、パーミュテーションを解く手がかりになる調音構造のある空間が増えたことによる改善だと考えられる。図６は、９音源の分離信号の一例を示す図である。図６では、上からそれぞれ正解信号、ＦａｓｔＭＶＡＥ２によるＨＢＰ法を用いない分離信号、音源分離装置１０が生成したＦａｓｔＭＶＡＥ２によるＨＢＰ法を用いた分離信号のスペクトログラムを示し、各スペクトログラムの上に入力ＳＤＲ及びＳＤＲ改善値が示されている。 According to Table 1, it was confirmed that the sound source separation performance was improved by the HBP method for all the number of sound sources. In addition, it was found that the improvement value is large for repeated data. This is thought to be an improvement due to the reduction in silent intervals and the increase in spaces with articulatory structures that serve as clues for solving permutations. FIG. 6 is a diagram showing an example of separated signals of 9 sound sources. FIG. 6 shows, from the top, the correct signal, the separated signal not using the HBP method by FastMVAE2, and the separated signal using the HBP method by FastMVAE2 generated by the sound source separation apparatus 10. Input SDR and SDR are shown above each spectrogram. Improvement values are indicated.

以上示したように本開示の実施形態によれば、マイクロホン配置が未知であっても、分離信号の周波数方向の組み合わせを整合させることで、整合させない場合と比較して精度よく分離信号を得ることができる。 As described above, according to the embodiments of the present disclosure, even if the microphone arrangement is unknown, by matching the combination of the separated signals in the frequency direction, the separated signals can be obtained with higher accuracy than when they are not matched. can be done.

なお、上記各実施形態でＣＰＵがソフトウェア（プログラム）を読み込んで実行した音源分離処理を、ＣＰＵ以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、ＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等の製造後に回路構成を変更可能なＰＬＤ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）、及びＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、音源分離処理を、これらの各種のプロセッサのうちの１つで実行してもよいし、同種又は異種の２つ以上のプロセッサの組み合わせ（例えば、複数のＦＰＧＡ、及びＣＰＵとＦＰＧＡとの組み合わせ等）で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 Note that the sound source separation processing executed by the CPU by reading the software (program) in each of the above embodiments may be executed by various processors other than the CPU. The processor in this case is a PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacturing such as an FPGA (Field-Programmable Gate Array), and an ASIC (Application Specific Integrated Circuit) for executing specific processing. A dedicated electric circuit or the like, which is a processor having a specially designed circuit configuration, is exemplified. In addition, the sound source separation processing may be performed by one of these various processors, or a combination of two or more processors of the same or different type (for example, multiple FPGAs and a combination of a CPU and an FPGA). etc.). More specifically, the hardware structure of these various processors is an electric circuit in which circuit elements such as semiconductor elements are combined.

また、上記各実施形態では、音源分離プログラムがストレージ１４に予め記憶（インストール）されている態様を説明したが、これに限定されない。プログラムは、ＣＤ－ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ－ＲＯＭ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、及びＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ等の非一時的（ｎｏｎ－ｔｒａｎｓｉｔｏｒｙ）記憶媒体に記憶された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 Also, in each of the above-described embodiments, the sound source separation program has been pre-stored (installed) in the storage 14, but the present invention is not limited to this. The program is stored in non-transitory storage media such as CD-ROM (Compact Disk Read Only Memory), DVD-ROM (Digital Versatile Disk Read Only Memory), and USB (Universal Serial Bus) memory. may be provided in the form Also, the program may be downloaded from an external device via a network.

以上の実施形態に関し、更に以下の付記を開示する。
（付記項１）
メモリと、
前記メモリに接続された少なくとも１つのプロセッサと、
を含み、
前記プロセッサは、
構成音が混合された観測信号が分離行列により分離された分離信号に対し、所定の複数の異なる周波数帯域についてそれぞれ当該周波数帯域をマスキングし、
マスキングされた前記分離信号と、所定の音源モデルとを用いて補完スペクトログラムを生成するとともに、前記分離信号の各前記周波数帯域における分離信号スペクトログラムを生成し、
前記分離信号スペクトログラムのそれぞれに対して前記補完スペクトログラムとの距離が近くなるように割り当てられた再配置先の周波数帯域の並び替えを実現するように前記分離行列を修正する
ように構成されている音源分離装置。 The following additional remarks are disclosed regarding the above embodiments.
(Appendix 1)
memory;
at least one processor connected to the memory;
including
The processor
Masking each of a plurality of predetermined different frequency bands for a separated signal obtained by separating an observed signal in which constituent sounds are mixed by a separation matrix,
generating a complementary spectrogram using the masked separated signal and a predetermined sound source model, and generating a separated signal spectrogram in each of the frequency bands of the separated signal;
A sound source configured to modify the separation matrix so as to realize rearrangement of frequency bands to be rearranged so that each of the separated signal spectrograms is closer to the complementary spectrogram. separation device.

（付記項２）
音源分離処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記音源分離処理は、
構成音が混合された観測信号が分離行列により分離された分離信号に対し、所定の複数の異なる周波数帯域についてそれぞれ当該周波数帯域をマスキングし、
マスキングされた前記分離信号と、所定の音源モデルとを用いて補完スペクトログラムを生成するとともに、前記分離信号の各前記周波数帯域における分離信号スペクトログラムを生成し、
前記分離信号スペクトログラムのそれぞれに対して前記補完スペクトログラムとの距離が近くなるように割り当てられた再配置先の周波数帯域の並び替えを実現するように前記分離行列を修正する
非一時的記憶媒体。 (Appendix 2)
A non-temporary storage medium storing a program executable by a computer to perform sound source separation processing,
The sound source separation processing includes:
Masking each of a plurality of predetermined different frequency bands for a separated signal obtained by separating an observed signal in which constituent sounds are mixed by a separation matrix,
generating a complementary spectrogram using the masked separated signal and a predetermined sound source model, and generating a separated signal spectrogram in each of the frequency bands of the separated signal;
A non-temporary storage medium that modifies the separation matrix so as to rearrange frequency bands to be rearranged so that each of the separated signal spectrograms is closer to the complementary spectrogram.

１００音源分離装置
１０１学習部
１０２モデル記憶部
１０３マスキング部
１０４スペクトログラム生成部
１０５帯域割当部
１０６分離行列修正部
１０７音源分離部 100 Sound source separation device 101 Learning unit 102 Model storage unit 103 Masking unit 104 Spectrogram generation unit 105 Band allocation unit 106 Separation matrix correction unit 107 Sound source separation unit

Claims

a masking unit for masking a plurality of predetermined different frequency bands with respect to a separated signal obtained by separating an observed signal in which a plurality of constituent sounds are mixed by a separation matrix;
a spectrogram generation unit that generates a complementary spectrogram using the separated signal masked by the masking unit and a predetermined sound source model, and generates a separated signal spectrogram in each of the frequency bands of the separated signal;
A separation matrix correction unit that corrects the separation matrix so as to realize rearrangement of frequency bands to be rearranged so that the distance between each of the separated signal spectrograms and the complementary spectrogram is reduced;
A sound source separation device.

2. The sound source separation according to claim 1, wherein the separation matrix correction unit obtains a permutation matrix using a Hungarian method for the cost matrix generated using the complementary spectrogram, and corrects the separation matrix using the permutation matrix. Device.

3. The sound source separation apparatus according to claim 1, further comprising a sound source separation unit that separates the observed signals using the separation matrix corrected by the separation matrix correction unit.

The sound source separation device according to any one of claims 1 to 3, further comprising a learning unit that obtains the sound source model by predetermined machine learning.

5. The sound source separation according to claim 4, wherein the learning unit receives as input a spectrogram of clean speech or a spectrogram obtained by masking a part of the band of the spectrogram of clean speech, and learns the sound source model using the spectrogram of clean speech as a target value. Device.

Masking each of a plurality of predetermined different frequency bands for a separated signal obtained by separating an observed signal in which a plurality of constituent sounds are mixed by a separation matrix,
generating a complementary spectrogram using the masked separated signal and a predetermined sound source model, and generating a separated signal spectrogram in each of the frequency bands of the separated signal;
A computer executes a process of correcting the separation matrix so as to rearrange the frequency bands to be relocated so that the distance from the complementary spectrogram is reduced for each of the separated signal spectrograms, sound source separation method.

A sound source separation program for causing a computer to function as the sound source separation device according to any one of claims 1 to 5.