JP6992873B2

JP6992873B2 - Sound source separation device, sound source separation method and program

Info

Publication number: JP6992873B2
Application number: JP2020504518A
Authority: JP
Inventors: 孝文越仲; 隆之鈴木; 薫鯉田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2018-03-06
Filing date: 2018-03-06
Publication date: 2022-01-13
Anticipated expiration: 2038-03-06
Also published as: WO2019171457A1; JPWO2019171457A1

Description

本開示は、音源分離装置、音源分離方法およびプログラムが格納された非一時的なコンピュータ可読媒体に関する。 The present disclosure relates to a non-temporary computer-readable medium in which a sound source separation device, a sound source separation method, and a program are stored.

複数の話者が同時に発した音声などの複数の音源信号を含む混合信号を個々の音源信号に分離する技術が検討されている。このような技術に関連し、特許文献１には、事前に学習された音源であって、第２音源の音響を含まない第１音源の音響の特徴を示す第１基底行列を記憶する音響処理装置が開示されている。音響処理装置は、第１基底行列を利用した非負値行列因子分解により、第１音源の音響と第２音源の音響との混合音を示す音響信号のスペクトルの時系列を示す観測行列から、第２基底行列と、第２係数行列とを生成する。そして、音響処理装置は、第１基底行列と第１係数行列とに応じた音響信号および第２基底行列と第２係数行列とを用いて、第１音源および第２音源の音響信号を生成する。 A technique for separating a mixed signal including a plurality of sound source signals such as voice emitted by a plurality of speakers into individual sound source signals is being studied. In relation to such a technique, Patent Document 1 describes an acoustic process for storing a first base matrix that is a sound source learned in advance and that shows the acoustic characteristics of the first sound source that does not include the sound of the second sound source. The device is disclosed. The sound processing device uses a non-negative matrix factorization using the first basis matrix to obtain the first from the observation matrix showing the time series of the spectrum of the acoustic signal showing the mixed sound of the sound of the first sound source and the sound of the second sound source. A two-base matrix and a second coefficient matrix are generated. Then, the acoustic processing apparatus generates the acoustic signals of the first sound source and the second sound source by using the acoustic signal corresponding to the first basis matrix and the first coefficient matrix and the second basis matrix and the second coefficient matrix. ..

また、上記技術に関連して、非特許文献１が開示されている。非特許文献１には、ある話者が発する音声を音源とし、複数の話者が同時に発した音声を個々の話者の音声に分離する音源分離方法が開示されている。音源分離方法は、単一チャネルの混合信号を受信し、受信した混合信号を時間－周波数表現(スペクトログラム)に変換し、深層ニューラルネットワークを用いて各々の時間－周波数ビンから特徴ベクトルを抽出する。そして、抽出した特徴ベクトルをクラスタリングすることにより時間－周波数ビンを目的の音源数(話者数)と同数のクラスタに分類し、クラスタごとに、そこに含まれる時間－周波数ビンから再構成されたスペクトログラムから話者ごとの音源信号を作成する。 Further, Non-Patent Document 1 is disclosed in relation to the above technique. Non-Patent Document 1 discloses a sound source separation method in which a voice emitted by a certain speaker is used as a sound source and the voice emitted by a plurality of speakers at the same time is separated into the voices of individual speakers. The sound source separation method receives a single channel mixed signal, converts the received mixed signal into a time-frequency representation (spectrogram), and extracts a feature vector from each time-frequency bin using a deep neural network. Then, by clustering the extracted feature vectors, the time-frequency bins were classified into the same number of clusters as the target number of sound sources (number of speakers), and each cluster was reconstructed from the time-frequency bins contained therein. Create a sound source signal for each speaker from the spectrogram.

非特許文献１に開示された深層ニューラルネットワークは、事前のトレーニング（学習）によって用意される。学習に用いるデータは、様々な話者が話す音源信号を多数集めたものである。これらはすべて独立した音源信号であり、複数の話者が同時に話す混合信号ではない。非特許文献１では、まず、学習用データに短時間フーリエ変換を実施し、各音源信号をスペクトログラムに変換する。次に、２つの音源信号のスペクトログラムを重畳して混合信号のスペクトログラムを生成し、時間－周波数ビンごとに、いずれの話者に属するかを決定して話者ラベルを付与する。ここで、話者ラベルは、元になった個々の音源信号の振幅から決定する。すなわち、振幅の大きい方の話者に、その時間－周波数ビンが属するとする。続いて、その時点で得られている深層ニューラルネットワークを用いて各々の時間－周波数ビンから特徴ベクトルを抽出する。次に、話者ラベルとの整合性を測る尺度を算出する損失関数を計算し、その損失関数が減少するように、特徴抽出を行う深層ニューラルネットワークのパラメタを更新する。 The deep neural network disclosed in Non-Patent Document 1 is prepared by prior training (learning). The data used for learning is a collection of many sound source signals spoken by various speakers. These are all independent sound source signals, not mixed signals spoken by multiple speakers at the same time. In Non-Patent Document 1, first, a short-time Fourier transform is performed on the training data, and each sound source signal is converted into a spectrogram. Next, the spectrograms of the two sound source signals are superimposed to generate a spectrogram of the mixed signal, and for each time-frequency bin, it is determined which speaker belongs to and a speaker label is given. Here, the speaker label is determined from the amplitude of each original sound source signal. That is, it is assumed that the time-frequency bin belongs to the speaker having the larger amplitude. Subsequently, the feature vector is extracted from each time-frequency bin using the deep neural network obtained at that time. Next, a loss function for calculating a scale for measuring consistency with the speaker label is calculated, and the parameters of the deep neural network for feature extraction are updated so that the loss function is reduced.

特開２０１３－３３１９６号公報Japanese Unexamined Patent Publication No. 2013-33196

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. of the 41st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2016), Mar. 2016.JR Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. Of the 41st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2016) , Mar. 2016.

特許文献１に開示された技術は、事前に学習された基準となる第１の音源を用いて、複数の音源が混合された混合信号から個々の音源を分離する。また、非特許文献１に開示された技術は、異なる２つ以上の音源信号を重畳して人工的に生成された混合信号を用いて、個々の音源を分離する。すなわち、特許文献１および非特許文献１において用いられる学習データは、実際に観測される混合信号とは異なる。 The technique disclosed in Patent Document 1 uses a first sound source as a reference learned in advance to separate individual sound sources from a mixed signal in which a plurality of sound sources are mixed. Further, the technique disclosed in Non-Patent Document 1 separates individual sound sources by using a mixed signal artificially generated by superimposing two or more different sound source signals. That is, the training data used in Patent Document 1 and Non-Patent Document 1 are different from the actually observed mixed signals.

ここで、実際の環境では、通常、ノイズや残響が存在するため、実際に観測される混合信号は、個々の音源信号のスペクトログラムを単純に重ね合わせたものとは異なる。その理由は、音源信号を重畳する際の振幅比は、マイクと話者の位置関係などに依存するため、実際に観測される混合信号は、すべての観測で一定になるとは限らないからである。また、会話では話者間のインタラクションがあるので、実際に観測される混合信号は、時間的にも一定になるとは限らないからである。そのため、学習に用いるデータが、実際に観測された混合信号ではないと、学習に用いるデータと、実際に観測された混合信号との間にミスマッチが発生する。学習に用いるデータと、実際に観測される混合信号と、の間にミスマッチが発生すると、適切な学習を行うことが出来ない。したがって、適切な学習が行われてないと、混合信号から個々の音源信号を精度良く分離することが出来ない。すなわち、上述した特許文献１および非特許文献１に開示された技術は、実際に観測された混合信号を学習用データとして用いていないため、実際に観測された混合信号に対して精度良く音源分離ができない。 Here, in an actual environment, noise and reverberation are usually present, so that the actually observed mixed signal is different from a simple superposition of spectrograms of individual sound source signals. The reason is that the amplitude ratio when superimposing the sound source signal depends on the positional relationship between the microphone and the speaker, so that the actually observed mixed signal is not always constant in all observations. .. In addition, since there is interaction between speakers in conversation, the actually observed mixed signal is not always constant in time. Therefore, if the data used for learning is not the actually observed mixed signal, a mismatch occurs between the data used for learning and the actually observed mixed signal. If a mismatch occurs between the data used for learning and the actually observed mixed signal, proper learning cannot be performed. Therefore, if proper learning is not performed, it is not possible to accurately separate individual sound source signals from the mixed signal. That is, since the techniques disclosed in Patent Document 1 and Non-Patent Document 1 described above do not use the actually observed mixed signal as learning data, the sound source is separated accurately from the actually observed mixed signal. I can't.

本開示の目的は、このような課題を解決するためになされたものであり、混合信号から個々の音源信号を精度良く分離することが可能な、音源分離装置、音源分離方法およびプログラムが格納された非一時的なコンピュータ可読媒体を提供することにある。 An object of the present disclosure is to solve such a problem, and a sound source separation device, a sound source separation method, and a program capable of accurately separating individual sound source signals from a mixed signal are stored. It is to provide a non-temporary computer-readable medium.

本開示にかかる音源分離装置は、複数の音源信号が混合された混合信号を変換したスペクトログラムにおいて、時間周波数ビン毎に、特徴抽出の際に用いるパラメタが適用された特徴抽出器を用いて特徴ベクトルを抽出する特徴抽出手段と、抽出された前記特徴ベクトルを複数のクラスタに分類するクラスタリング手段と、分類された前記複数のクラスタの各々に含まれる時間周波数ビンを用いて、分類されたクラスタ毎に音源信号を生成する分離手段と、観測された混合信号を含む学習用混合信号に基づいて、前記パラメタを更新するパラメタ更新手段と、を備える。 The sound source separation device according to the present disclosure is a feature vector using a feature extractor to which the parameters used for feature extraction are applied for each time frequency bin in a spectrogram obtained by converting a mixed signal in which a plurality of sound source signals are mixed. For each classified cluster using a feature extraction means for extracting, a clustering means for classifying the extracted feature vector into a plurality of clusters, and a time frequency bin included in each of the classified plurality of clusters. A separation means for generating a sound source signal and a parameter updating means for updating the parameters based on the learning mixed signal including the observed mixed signal are provided.

また、本開示にかかる音源分離方法は、複数の音源信号が混合された混合信号を変換したスペクトログラムにおいて、時間周波数ビン毎に、特徴抽出の際に用いるパラメタが適用された特徴抽出器を用いて特徴ベクトルを抽出することと、抽出された前記特徴ベクトルを複数のクラスタに分類することと、分類された前記複数のクラスタの各々に含まれる時間周波数ビンを用いて、分類されたクラスタ毎に音源信号を生成することと、観測された混合信号を含む学習用混合信号に基づいて、前記パラメタを更新することと、を含む音源分離方法である。 Further, the sound source separation method according to the present disclosure uses a feature extractor to which the parameters used for feature extraction are applied for each time frequency bin in a spectrogram obtained by converting a mixed signal in which a plurality of sound source signals are mixed. Using the extraction of the feature vector, the classification of the extracted feature vector into a plurality of clusters, and the time frequency bin included in each of the classified clusters, the sound source is used for each classified cluster. It is a sound source separation method including generating a signal and updating the parameter based on the learning mixed signal including the observed mixed signal.

また、本開示にかかる非一時的なコンピュータ可読媒体は、複数の音源信号が混合された混合信号を変換したスペクトログラムにおいて、時間周波数ビン毎に、特徴抽出の際に用いるパラメタが適用された特徴抽出器を用いて特徴ベクトルを抽出することと、抽出された前記特徴ベクトルを複数のクラスタに分類することと、分類された前記複数のクラスタの各々に含まれる時間周波数ビンを用いて、分類されたクラスタ毎に音源信号を生成することと、観測された混合信号を含む学習用混合信号に基づいて、前記パラメタを更新することと、をコンピュータに実行させるプログラムが格納された非一時的なコンピュータ可読媒体である。 Further, the non-temporary computer-readable medium according to the present disclosure is a feature extraction in which parameters used for feature extraction are applied for each time frequency bin in a spectrogram obtained by converting a mixed signal in which a plurality of sound source signals are mixed. The feature vectors were extracted using a device, the extracted feature vectors were classified into a plurality of clusters, and the time frequency bins included in each of the classified clusters were used for classification. A non-temporary computer-readable device that contains a program that causes a computer to generate a sound source signal for each cluster and update the parameters based on the learning mixed signal including the observed mixed signal. It is a medium.

本開示によれば、混合信号から個々の音源信号を精度良く分離することが可能な、音源分離装置、音源分離方法およびプログラムが格納された非一時的なコンピュータ可読媒体を提供できる。 According to the present disclosure, it is possible to provide a non-temporary computer-readable medium containing a sound source separation device, a sound source separation method, and a program capable of accurately separating individual sound source signals from a mixed signal.

本開示の実施の形態にかかる音源分離装置１の概要を示す図である。It is a figure which shows the outline of the sound source separation apparatus 1 which concerns on embodiment of this disclosure. 実施の形態１にかかる音源分離装置の構成例を示す構成図である。It is a block diagram which shows the structural example of the sound source separation apparatus which concerns on Embodiment 1. FIG. 関連技術における音源ラベルを説明する図である。It is a figure explaining the sound source label in a related technique. 実施の形態１における音源ラベルを説明する図である。It is a figure explaining the sound source label in Embodiment 1. FIG. 実施の形態１にかかる音源分離装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the sound source separation apparatus which concerns on Embodiment 1. FIG. 実施の形態１にかかる音源分離装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the sound source separation apparatus which concerns on Embodiment 1. FIG. 実施の形態１にかかる音源分離装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the sound source separation apparatus which concerns on Embodiment 1. FIG. 実施の形態２にかかる音源分離装置の構成例を示す構成図である。It is a block diagram which shows the structural example of the sound source separation apparatus which concerns on Embodiment 2. その他の実施の形態にかかる音源分離装置の構成例を示す図である。It is a figure which shows the structural example of the sound source separation apparatus which concerns on other embodiment.

（実施の形態の概要）
本開示の実施形態の説明に先立って、実施の形態の概要について説明する。図１は、本開示の実施の形態にかかる音源分離装置１の概要を示す図である。(Outline of embodiment)
Prior to the description of the embodiments of the present disclosure, an outline of the embodiments will be described. FIG. 1 is a diagram showing an outline of the sound source separation device 1 according to the embodiment of the present disclosure.

音源分離装置１は、特徴抽出手段として機能する特徴抽出部２と、クラスタリング手段として機能するクラスタリング部３と、分離手段として機能する分離部４と、パラメタ更新手段として機能するパラメタ更新部５と、を備える。 The sound source separation device 1 includes a feature extraction unit 2 that functions as a feature extraction means, a clustering unit 3 that functions as a clustering means, a separation unit 4 that functions as a separation means, and a parameter update unit 5 that functions as a parameter update means. To be equipped.

特徴抽出部２は、複数の音源信号が混合された混合信号を変換したスペクトログラムにおいて、時間周波数ビン毎に、特徴抽出の際に用いるパラメタが適用された特徴抽出器を用いて特徴ベクトルを抽出する。 The feature extraction unit 2 extracts a feature vector for each time-frequency bin using a feature extractor to which the parameters used for feature extraction are applied in a spectrogram obtained by converting a mixed signal in which a plurality of sound source signals are mixed. ..

クラスタリング部３は、抽出された特徴ベクトルを複数のクラスタに分類する。
分離部４は、分類された複数のクラスタの各々に含まれる時間周波数ビンを用いて、分類されたクラスタ毎に音源信号を生成する。
パラメタ更新部５は、観測された混合信号を含む学習用混合信号に基づいて、前記パラメタを更新する。The clustering unit 3 classifies the extracted feature vector into a plurality of clusters.
The separation unit 4 generates a sound source signal for each of the classified clusters by using the time frequency bin included in each of the plurality of classified clusters.
The parameter updating unit 5 updates the parameters based on the learning mixed signal including the observed mixed signal.

実施の形態にかかる音源分離装置１は、学習用混合信号として、複数の音源信号を人工的に重畳した混合信号ではなく、実際に観測される混合信号を用いる。そのため、音源分離装置１を用いることにより、個々の音源信号に分離する混合信号に対して最適な特徴ベクトルを取得することが出来るので、混合信号を正確に個々の音源信号に分離することが可能となる。したがって、実施の形態にかかる音源分離装置１を用いることにより、混合信号から個々の音源信号を精度良く分離することが可能となる。 The sound source separation device 1 according to the embodiment uses a actually observed mixed signal as the learning mixed signal, not a mixed signal in which a plurality of sound source signals are artificially superimposed. Therefore, by using the sound source separation device 1, it is possible to acquire the optimum feature vector for the mixed signal separated into the individual sound source signals, so that the mixed signal can be accurately separated into the individual sound source signals. It becomes. Therefore, by using the sound source separation device 1 according to the embodiment, it is possible to accurately separate individual sound source signals from the mixed signal.

なお、音源分離装置１における音源分離方法を用いても、混合信号から個々の音源信号を精度良く分離することが可能となる。さらに、音源分離方法を実行可能なプログラムを用いても、混合信号から個々の音源信号を精度良く分離することが可能となる。 Even if the sound source separation method in the sound source separation device 1 is used, it is possible to accurately separate individual sound source signals from the mixed signal. Further, even if a program capable of executing the sound source separation method is used, it is possible to accurately separate individual sound source signals from the mixed signal.

（実施の形態１）
以下、図面を参照して、本開示の実施の形態について説明する。
＜音源分離装置の構成例＞
まず、図２を用いて、実施の形態１にかかる音源分離装置１０の構成例を説明する。図２は、実施の形態１にかかる音源分離装置の構成例を示す構成図である。(Embodiment 1)
Hereinafter, embodiments of the present disclosure will be described with reference to the drawings.
<Sound source separation device configuration example>
First, a configuration example of the sound source separation device 10 according to the first embodiment will be described with reference to FIG. FIG. 2 is a configuration diagram showing a configuration example of the sound source separation device according to the first embodiment.

音源分離装置１０は、例えば、サーバ装置、パーソナルコンピュータ装置等のコンピュータであってもよい。音源分離装置１０は、学習用混合信号記憶部１１と、学習用ラベルデータ記憶部１２と、特徴抽出器学習部１３と、音源分離部１４と、を備える。学習用混合信号記憶部１１、学習用ラベルデータ記憶部１２、特徴抽出器学習部１３および音源分離部１４は、それぞれ、学習用混合信号記憶手段、学習用ラベルデータ記憶手段、特徴抽出器学習手段および音源分離手段として機能する。 The sound source separation device 10 may be, for example, a computer such as a server device or a personal computer device. The sound source separation device 10 includes a learning mixed signal storage unit 11, a learning label data storage unit 12, a feature extractor learning unit 13, and a sound source separation unit 14. The learning mixed signal storage unit 11, the learning label data storage unit 12, the feature extractor learning unit 13, and the sound source separation unit 14 are a learning mixed signal storage unit, a learning label data storage unit, and a feature extractor learning unit, respectively. And functions as a sound source separation means.

学習用混合信号記憶部１１は、実際に観測された混合信号であって、予め取得された混合信号を学習用データとして記憶する。学習用混合信号は、複数の音源から発せれる信号であって、例えば、複数の話者が話す音声をモノラル（単一チャネル）録音したオーディオデータである。本実施の形態では、学習用混合信号は、非特許文献１として示した関連技術における学習用混合信号のように、複数の音源を人工的に重畳された混合信号ではなく、実際に観測された混合信号である。混合信号は、例えば、サンプリング周波数が８ｋＨｚ、サンプルサイズが１６ｂｉｔ、圧縮されていない線形ＰＣＭ（Pulse Code Modulation）であってもよい。なお、混合信号の形式は、上記内容には限定されず、他の形式であってもよい。 The learning mixed signal storage unit 11 stores the mixed signal actually observed and acquired in advance as learning data. The learning mixed signal is a signal emitted from a plurality of sound sources, and is, for example, audio data in which voices spoken by a plurality of speakers are recorded in monaural (single channel). In the present embodiment, the learning mixed signal is not a mixed signal in which a plurality of sound sources are artificially superimposed like the learning mixed signal in the related technique shown as Non-Patent Document 1, but is actually observed. It is a mixed signal. The mixed signal may be, for example, a sampling frequency of 8 kHz, a sample size of 16 bits, and an uncompressed linear PCM (Pulse Code Modulation). The format of the mixed signal is not limited to the above contents, and may be another format.

学習用ラベルデータ記憶部１２は、学習用混合信号記憶部１１に記憶された混合信号を予め分析して決定された、各音源信号の時間区間を表すラベルデータを記憶する。具体的には、ラベルデータは、各混合信号において、各音源がどの時間区間に含まれるかを示すデータであって、例えば、音源種別、時間区間の始端および終端が関連付けられて設定されるデータである。例えば、ある混合信号において、Ｏ分ｏ秒からＰ分ｐ秒まで話者Ａの音源が含まれていると分析された場合、ラベルデータは、音源種別：話者Ａ、始端：Ｏ分ｏ秒、終端：Ｐ分ｐ秒のように、音源種別、時間区間の始端および終端が関連付けられて設定される。なお、当然ながら、上記したラベルデータは、一例であるので、他の情報が設定されていてもよい。 The learning label data storage unit 12 stores label data representing a time interval of each sound source signal determined by analyzing the mixed signal stored in the learning mixed signal storage unit 11 in advance. Specifically, the label data is data indicating which time interval each sound source is included in each mixed signal, and is, for example, data set in association with the sound source type and the start and end of the time interval. Is. For example, when it is analyzed that the sound source of the speaker A is included from O minutes o seconds to P minutes p seconds in a certain mixed signal, the label data is the sound source type: speaker A, the beginning: O minutes o seconds. , End: P minutes, p seconds, etc., the sound source type, the start and end of the time interval are set in association with each other. As a matter of course, the above-mentioned label data is an example, and other information may be set.

特徴抽出器学習部１３は、特徴抽出器に適用された特徴抽出の際に用いるパラメタを示す特徴抽出パラメタを学習する。特徴抽出器は、ニューラルネットワークであってもよいし、他のアルゴリズムが用いられてもよい。以降の説明において、特徴抽出器はニューラルネットワークであるとして記載することがある。
音源分離部１４は、複数の音源が混合された混合信号を特徴抽出器であるニューラルネットワークを用いて個々の音源信号に分離する。The feature extractor learning unit 13 learns feature extraction parameters indicating parameters used in feature extraction applied to the feature extractor. The feature extractor may be a neural network or another algorithm may be used. In the following description, the feature extractor may be described as a neural network.
The sound source separation unit 14 separates a mixed signal in which a plurality of sound sources are mixed into individual sound source signals using a neural network which is a feature extractor.

続いて、特徴抽出器学習部１３および音源分離部１４の詳細について説明する。
特徴抽出器学習部１３は、特徴抽出部１０１と、特徴抽出パラメタ記憶部１０２と、パラメタ更新部１０３と、教師付き損失関数算出部１０４と、教師なし損失関数算出部１０５と、を備える。特徴抽出部１０１、特徴抽出パラメタ記憶部１０２およびパラメタ更新部１０３は、それぞれ、特徴抽出手段、特徴抽出パラメタ記憶手段、パラメタ更新手段として機能する。また、教師付き損失関数算出部１０４および教師なし損失関数算出部１０５は、それぞれ、教師付き損失関する算出手段および教師なし損失関数算出手段として機能する。特徴抽出部１０１および特徴抽出パラメタ記憶部１０２は、音源分離部１４と共有する機能部である。Subsequently, the details of the feature extractor learning unit 13 and the sound source separation unit 14 will be described.
The feature extractor learning unit 13 includes a feature extraction unit 101, a feature extraction parameter storage unit 102, a parameter update unit 103, a supervised loss function calculation unit 104, and an unsupervised loss function calculation unit 105. The feature extraction unit 101, the feature extraction parameter storage unit 102, and the parameter update unit 103 function as feature extraction means, feature extraction parameter storage means, and parameter update means, respectively. Further, the supervised loss function calculation unit 104 and the unsupervised loss function calculation unit 105 function as a supervised loss calculation means and an unsupervised loss function calculation means, respectively. The feature extraction unit 101 and the feature extraction parameter storage unit 102 are functional units shared with the sound source separation unit 14.

特徴抽出部１０１は、実施の形態の概要における特徴抽出部２に対応する。特徴抽出部１０１は、学習用混合信号記憶部１１に記憶された全ての混合信号を取得する。特徴抽出部１０１は、取得した各混合信号に短時間フーリエ変換（ＳＴＦＴ：Short-Term Fourier Transform）を適用して時間－周波数表現(スペクトログラム)に変換する。また、特徴抽出部１０１は、個々の音源信号に分離を行う判定対象の混合信号に対しても同様に、短時間フーリエ変換を適用してスペクトログラムに変換する。 The feature extraction unit 101 corresponds to the feature extraction unit 2 in the outline of the embodiment. The feature extraction unit 101 acquires all the mixed signals stored in the learning mixed signal storage unit 11. The feature extraction unit 101 applies a short-time Fourier transform (STFT) to each acquired mixed signal to convert it into a time-frequency representation (spectrogram). Further, the feature extraction unit 101 also applies a short-time Fourier transform to the mixed signal to be determined, which is separated into individual sound source signals, to convert it into a spectrogram.

また、特徴抽出部１０１は、特徴抽出パラメタ記憶部１０２に記憶された特徴抽出器（ニューラルネットワーク）に適用される特徴抽出パラメタを取得する。特徴抽出部１０１は、混合信号から変換されたスペクトログラムを所定数の時間－周波数ビンに分割し、各時間－周波数ビンに対応する部分スペクトログラムをニューラルネットワークに入力する。そして、特徴抽出部１０１は、ニューラルネットワークから出力される結果を特徴ベクトルとする。 Further, the feature extraction unit 101 acquires the feature extraction parameter applied to the feature extractor (neural network) stored in the feature extraction parameter storage unit 102. The feature extraction unit 101 divides the spectrogram converted from the mixed signal into a predetermined number of time-frequency bins, and inputs the partial spectrogram corresponding to each time-frequency bin into the neural network. Then, the feature extraction unit 101 uses the result output from the neural network as a feature vector.

なお、本開示において、特徴抽出部１０１が分割した時間－周波数ビンを（ｔ，ｆ）として表し、時間－周波数ビン（ｔ，ｆ）に対応する部分スペクトログラムをｘ（ｔ，ｆ）として表し、特徴ベクトルをｖ_ｔ，ｆとして表すとする。また、本開示において、時間－周波数ビンを時間周波数ビンとして記載することがある。In the present disclosure, the time-frequency bin divided by the feature extraction unit 101 is represented as (t, f), and the partial spectrogram corresponding to the time-frequency bin (t, f) is represented as x (t, f). It is assumed that the feature vector is expressed as vt and _f . Further, in the present disclosure, the time-frequency bin may be described as a time frequency bin.

例えば、混合信号の形式が、サンプリング周波数が８ｋＨｚであり、サンプルサイズが１６ｂｉｔであり、圧縮されていない線形ＰＣＭであるとする。そうすると、短時間フーリエ変換により得られるスペクトログラムは、混合信号を、例えば、１フレームにつき窓幅が３２ｍｓｅｃ（２５６点）でのフーリエ変換を８ｍｓｅｃ（６４点）毎にずらしながら変換することにより取得される。この場合、周波数方向の解像度は３１．２５Ｈｚ（８ｋＨｚで２５６点）となり、時間－周波数ビンの個数は時間方向に毎秒１２５、周波数方向に２５６となる。時間－周波数ビン（ｔ，ｆ）に対して、特徴ベクトルｖ_ｔ，ｆを得る際、（ｔ，ｆ）を含む前後のコンテキストを考慮するのが有効である。例えば、ｔを含む１００フレーム分のビンをまとめた１００次元ベクトルを入力ｘ_ｔ，ｆとしてニューラルネットワークに与える。ニューラルネットワークの出力ｖ_ｔ，ｆは、通常は入力よりも低次元とし、例えば、入力が１００次元である場合、出力は５０次元程度に設定してもよい。For example, assume that the format of the mixed signal is a sampling frequency of 8 kHz, a sample size of 16 bits, and an uncompressed linear PCM. Then, the spectrogram obtained by the short-time Fourier transform is obtained by transforming the mixed signal, for example, by shifting the Fourier transform at a window width of 32 msec (256 points) per frame every 8 msec (64 points). .. In this case, the resolution in the frequency direction is 31.25 Hz (256 points at 8 kHz), and the number of time-frequency bins is 125 per second in the time direction and 256 in the frequency direction. For the time-frequency bin (t, f), it is effective to consider the context before and after including (t, f) when obtaining the feature vectors vt _{, f} . For example, a 100-dimensional vector that collects bins for 100 frames including t is given to the neural network as inputs x _{t and f} . The outputs _{dt and f} of the neural network are usually set to be lower than the input, and for example, when the input is 100 dimensions, the output may be set to about 50 dimensions.

特徴抽出パラメタ記憶部１０２は、特徴抽出部１０１が特徴ベクトルを抽出する際に用いる特徴抽出パラメタを記憶する。具体的には、特徴抽出パラメタ記憶部１０２は、後述するパラメタ更新部１０３が決定した特徴抽出パラメタを記憶する。特徴抽出パラメタ記憶部１０２は、特徴抽出パラメタが未定の初期段階においては、パラメタ更新部１０３が乱数を発生（生成）する等の処理を行い初期化された特徴抽出パラメタを記憶する。 The feature extraction parameter storage unit 102 stores the feature extraction parameters used when the feature extraction unit 101 extracts the feature vector. Specifically, the feature extraction parameter storage unit 102 stores the feature extraction parameters determined by the parameter update unit 103, which will be described later. In the initial stage where the feature extraction parameter is undecided, the feature extraction parameter storage unit 102 stores the initialized feature extraction parameter by performing a process such as generating (generating) a random number by the parameter update unit 103.

パラメタ更新部１０３は、実施の形態の概要におけるパラメタ更新部５に対応する。パラメタ更新部１０３は、ニューラルネットワークに適用される特徴抽出パラメタを学習用データである学習混合信号に基づいて更新する。 The parameter update unit 103 corresponds to the parameter update unit 5 in the outline of the embodiment. The parameter update unit 103 updates the feature extraction parameters applied to the neural network based on the learning mixed signal which is the learning data.

パラメタ更新部１０３は、特徴抽出部１０１が抽出した特徴ベクトルおよび特徴抽出パラメタ記憶部１０２に記憶された特徴抽出パラメタを取得する。パラメタ更新部１０３は、取得した情報を、後述する教師付き損失関数算出部１０４および教師なし損失関数算出部１０５に出力する。パラメタ更新部１０３は、教師付き損失関数算出部１０４および教師なし損失関数算出部１０５によって定まる評価基準を用いて、特徴抽出部１０１が抽出した特徴ベクトルを評価する。パラメタ更新部１０３は、特徴ベクトルの評価結果に基づいて、より良い特徴ベクトルが生成されるように特徴抽出パラメタを決定し、決定した特徴抽出パラメタに更新する。 The parameter update unit 103 acquires the feature vector extracted by the feature extraction unit 101 and the feature extraction parameters stored in the feature extraction parameter storage unit 102. The parameter update unit 103 outputs the acquired information to the supervised loss function calculation unit 104 and the unsupervised loss function calculation unit 105, which will be described later. The parameter update unit 103 evaluates the feature vector extracted by the feature extraction unit 101 using the evaluation criteria determined by the supervised loss function calculation unit 104 and the unsupervised loss function calculation unit 105. The parameter update unit 103 determines the feature extraction parameter so that a better feature vector is generated based on the evaluation result of the feature vector, and updates the feature extraction parameter to the determined feature extraction parameter.

パラメタ更新部１０３は、特徴抽出パラメタを決定（更新）する際、例えば、誤差逆伝搬法（Error backpropagation）等の、ニューラルネットワークの学習において用いられる反復解法を適用して特徴抽出パラメタを決定する。パラメタ更新部１０３は、決定した特徴抽出パラメタを特徴抽出パラメタ記憶部１０２に記憶し、決定した特徴抽出パラメタがニューラルネットワークに適用されるように更新する。すなわち、パラメタ更新部１０３は、特徴抽出パラメタを決定する際に用いる評価基準を、数学的に規定される評価関数を示す損失関数として定義する。そして、パラメタ更新部１０３は、その損失関数が最小化されるように、例えば、確率的勾配降下法(ＳＧＤ：Stochastic Gradient Descent)等の数値的手法を用いて、特徴抽出パラメタを反復的に決定し、特徴抽出パラメタを更新する。 When the feature extraction parameter is determined (updated), the parameter update unit 103 determines the feature extraction parameter by applying an iterative solution method used in the learning of the neural network, for example, an error backpropagation method. The parameter update unit 103 stores the determined feature extraction parameter in the feature extraction parameter storage unit 102, and updates the determined feature extraction parameter so that it is applied to the neural network. That is, the parameter update unit 103 defines the evaluation standard used when determining the feature extraction parameter as a loss function indicating a mathematically defined evaluation function. Then, the parameter update unit 103 iteratively determines the feature extraction parameter by using a numerical method such as a stochastic gradient descent (SGD) so that the loss function is minimized. And update the feature extraction parameters.

パラメタ更新部１０３は、ニューラルネットワークの特徴抽出パラメタに関する評価関数を示す損失関数を以下の式（１１）のように定義する。具体的には、パラメタ更新部１０３は、特徴抽出パラメタに関する評価関数を、第１の評価関数を示す教師付き（Supervised）損失関数と、第２の評価関数を示す教師なし（Unsupervised）損失関数と、を用いて定義する。

ここで、θは特徴抽出パラメタであり、Ｌ_θ ^（Ｓ）は教師付き損失関数であり、Ｌ_θ ^（Ｕ）は教師なし損失関数であり、λは重み係数である。また、Ｘは学習用データから得られる全てのスペクトログラムの集合であり、ＶはＸから得られる全ての特徴ベクトルの集合である。さらに、Ｙ＝（ｙ_ｔ，ｆ）は特徴ベクトルｖ_ｔ，ｆに対応する時間－周波数ビン（ｔ，ｆ）がどの音源に対応するかを表現した音源ラベルである。例えば、音源が話者である場合、ｙ_ｔ，ｆは話者を一意に特定する話者ラベルとなる。The parameter update unit 103 defines a loss function indicating an evaluation function related to the feature extraction parameter of the neural network as the following equation (11). Specifically, the parameter update unit 103 uses the evaluation function related to the feature extraction parameter as a supervised loss function indicating the first evaluation function and an unsupervised loss function indicating the second evaluation function. , To be defined.

Here, θ is a feature extraction parameter, L _θ ^(S) is a supervised loss function, L _θ ^(U) is an unsupervised loss function, and λ is a weighting factor. Further, X is a set of all spectrograms obtained from the training data, and V is a set of all feature vectors obtained from X. Further, Y = (yt _{, f} ) is a sound source label expressing which sound source the time-frequency bin (t, f) corresponding to the feature vectors vt _{, f} corresponds to. For example, when the sound source is a speaker, _{yt and f} are speaker labels that uniquely identify the speaker.

例えば、ある混合信号に２人の話者が含まれており、時間－周波数ビン（ｔ，ｆ）に、第１の話者の音声が、第２の話者よりも強く含まれていたとすると、ｙ_ｔ，ｆは２次元ベクトル（１，０）となる。一方、時間－周波数ビン（ｔ，ｆ）に、第２の話者の音声が第１の話者よりも強く含まれていたとすると、ｙ_ｔ，ｆは２次元ベクトル（０，１）となる。このように、Ｎ人の話者（つまり、Ｎ個の音源）が含まれる場合、これらのベクトルはＮ次元となり、Ｎ次元のベクトルのうち、ただ１つの要素が１となり、その他の（Ｎ－１）の要素は０となる。For example, suppose a mixed signal contains two speakers, and the time-frequency bin (t, f) contains the voice of the first speaker more strongly than the second speaker. , Yt _{, f} are two-dimensional vectors (1,0). On the other hand, if the time-frequency bin (t, f) contains the voice of the second speaker more strongly than the first speaker, _{yt, f} becomes a two-dimensional vector (0, 1). .. Thus, when N speakers (that is, N sound sources) are included, these vectors are N-dimensional, only one element of the N-dimensional vector is 1, and the other (N-). The element of 1) becomes 0.

教師付き損失関数算出部１０４は、式（１１）における教師付き損失関数を計算する。上述したように、教師付き損失関数は、第１の評価関数と言えるので、教師付き損失関数算出部１０４は、第１の算出手段とも言える。また、教師付き損失関数算出部１０４は、第１の評価関数を示す教師付き損失関数を用いて、第１の評価値を算出するとも言える。 The supervised loss function calculation unit 104 calculates the supervised loss function in the equation (11). As described above, since the supervised loss function can be said to be the first evaluation function, the supervised loss function calculation unit 104 can also be said to be the first calculation means. Further, it can be said that the supervised loss function calculation unit 104 calculates the first evaluation value by using the supervised loss function indicating the first evaluation function.

教師付き損失関数算出部１０４は、学習用ラベルデータ記憶部１２に記憶されたラベルデータを取得し、音源ラベルｙ_ｔ，ｆを生成する。教師付き損失関数算出部１０４は、単一の音源のみが存在する時間区間に含まれる時間－周波数ビンに対して音源ラベルを設定する。一方、教師付き損失関数算出部１０４は、複数の音源が混在する時間区間に含まれる時間－周波数ビンに対して音源ラベルを設定しない。The supervised loss function calculation unit 104 acquires the label data stored in the learning label data storage unit 12 and generates the sound source labels yt _{and f} . The supervised loss function calculation unit 104 sets a sound source label for a time-frequency bin included in a time interval in which only a single sound source exists. On the other hand, the supervised loss function calculation unit 104 does not set the sound source label for the time-frequency bin included in the time interval in which a plurality of sound sources coexist.

また、教師付き損失関数算出部１０４は、パラメタ更新部１０３から特徴ベクトルｖ_ｔ，ｆを取得し、式（１１）における右辺第１項である教師付き損失関数Ｌ_θ ^（Ｓ）を計算してパラメタ更新部１０３に計算結果を出力する。なお、教師付き損失関数Ｌ_θ ^（Ｓ）の詳細および音源ラベルについては後述する。Further, the supervised loss function calculation unit 104 acquires the feature vectors vt _{and f} from the parameter update unit 103, and calculates the supervised loss function L _θ ^(S) which is the first term on the right side in the equation (11). The calculation result is output to the parameter update unit 103. The details of the supervised loss function L _θ ^(S) and the sound source label will be described later.

教師なし損失関数算出部１０５は、式（１１）における教師なし損失関数を計算する。上述したように、教師なし損失関数は、第２の評価関数と言えるので、教師なし損失関数算出部１０５は、第２の算出手段とも言える。また、教師なし損失関数算出部１０５は、第２の評価関数を示す教師なし損失関数を用いて、第２の評価値を算出するとも言える。 The unsupervised loss function calculation unit 105 calculates the unsupervised loss function in the equation (11). As described above, since the unsupervised loss function can be said to be the second evaluation function, the unsupervised loss function calculation unit 105 can also be said to be the second calculation means. Further, it can be said that the unsupervised loss function calculation unit 105 calculates the second evaluation value by using the unsupervised loss function indicating the second evaluation function.

教師なし損失関数算出部１０５は、パラメタ更新部１０３から特徴ベクトルｖ_ｔ，ｆを取得すると共に、教師付き損失関数算出部１０４から音源ラベルｙ_ｔ，ｆを取得する。教師なし損失関数算出部１０５は、式（１１）における右辺第２項である教師なし損失関数Ｌ_θ ^（Ｕ）を計算してパラメタ更新部１０３に計算結果を出力する。The unsupervised loss function calculation unit 105 acquires the feature vectors vt _{and f} from the parameter update unit 103 and the sound source labels yt _{and f} from the supervised loss function calculation unit 104. The unsupervised loss function calculation unit 105 calculates the unsupervised loss function L _θ ^(U) , which is the second term on the right side of the equation (11), and outputs the calculation result to the parameter update unit 103.

特徴抽出部１０１、パラメタ更新部１０３、教師付き損失関数算出部１０４および教師なし損失関数算出部１０５は、相互に作用しつつ反復的に動作し、特徴抽出パラメタ記憶部１０２に記憶された特徴抽出パラメタを逐次的に更新する。特徴抽出部１０１、パラメタ更新部１０３、教師付き損失関数算出部１０４および教師なし損失関数算出部１０５は、特徴抽出パラメタが収束するように、特徴抽出パラメタを十分な回数分更新する。そして、特徴抽出パラメタが収束すると、最終的な特徴抽出パラメタが、特徴抽出パラメタ記憶部１０２に記憶される。 The feature extraction unit 101, the parameter update unit 103, the supervised loss function calculation unit 104, and the unsupervised loss function calculation unit 105 operate repeatedly while interacting with each other, and feature extraction stored in the feature extraction parameter storage unit 102. Update the parameters sequentially. The feature extraction unit 101, the parameter update unit 103, the supervised loss function calculation unit 104, and the unsupervised loss function calculation unit 105 update the feature extraction parameters a sufficient number of times so that the feature extraction parameters converge. Then, when the feature extraction parameters converge, the final feature extraction parameters are stored in the feature extraction parameter storage unit 102.

ここで、上記式（１１）に含まれる、教師付き損失関数および教師なし損失関数の詳細を説明する。
教師付き損失関数Ｌ_θ ^（Ｓ）は、以下の式（１２）のように定義される。教師付き損失関数Ｌ_θ ^（Ｓ）は、音源ラベルが設定された時間－周波数ビンから抽出された特徴ベクトルＶ＝（ｖ_ｔ，ｆ）に関する損失を表す関数である。

ここで、ＶＶ^Ｔは特徴ベクトルｖ_ｔ，ｆの全てのペアに関する余弦類似度（正規化された内積）を要素に持つ行列であり、ＹＹ^Ｔは音源ラベルｙ_ｔ，ｆの全てのペアに関する内積をもつ行列となる。ＶＶ^ＴおよびＹＹ^Ｔは、時間－周波数ビン（ｔ，ｆ）と（ｔ’，ｆ’）が同じ音源クラスに属している場合、１となり、同じ音源クラスに属していない場合、０となる。また、||・||_Ｆは、フロベニウスノルムであり、行列の全ての要素の自乗和の平方根を表す。すなわち、同じ音源クラスに属する特徴ベクトルのペアの余弦類似度が１に近く、同じ音源クラスに属さない特徴ベクトルのペアの余弦類似度が０に近くなるほど、式（１２）の損失関数は小さくなる。この場合、教師付き損失関数Ｌ_θ ^（Ｓ）は、特徴抽出パラメタθは、音源クラスをよく表す特徴ベクトルＶを抽出出来ていると言える。Here, the details of the supervised loss function and the unsupervised loss function included in the above equation (11) will be described.
The supervised loss function L _θ ^(S) is defined by the following equation (12). The supervised loss function L _θ ^(S) is a function representing the loss related to the feature vector V = (v _{t, f} ) extracted from the time-frequency bin in which the sound source label is set.

Here, VVT is a matrix having cosine similarity (normalized inner product) for all pairs of feature vectors vt _{and f} as an element, and YY ^T is an inner product for all pairs of sound source labels ^yt _{and f} . It becomes a matrix with. VV ^T and YY ^T are 1 when the time-frequency bins (t, f) and (t', f') belong to the same sound source class, and 0 when they do not belong to the same sound source class. Also, || · || _F is the Frobenius norm and represents the square root of the sum of squares of all the elements of the matrix. That is, the closer the cosine similarity of the pair of feature vectors belonging to the same sound source class is to 1, and the closer the cosine similarity of the pair of feature vectors not belonging to the same sound source class is to 0, the smaller the loss function of equation (12) becomes. .. In this case, it can be said that the supervised loss function L _θ ^(S) can extract the feature vector V that well represents the sound source class by the feature extraction parameter θ.

音源クラスとは、混合信号に含まれる個々の音源を示す情報である。例えば、混合信号に話者Ａおよび話者Ｂの音源が含まれる場合、話者Ａの音源が第１の音源クラスとなり、話者Ｂの音源が第２の音源クラスとなる。 The sound source class is information indicating individual sound sources included in the mixed signal. For example, when the mixed signal includes the sound sources of the speaker A and the speaker B, the sound source of the speaker A is the first sound source class, and the sound source of the speaker B is the second sound source class.

ここで、図３および図４を用いて、教師付き損失関数算出部１０４が設定する音源ラベルについて、非特許文献１のような関連技術における音源ラベルとの違いを説明する。図３は、関連技術における音源ラベルを説明する図である。図４は、実施の形態１における音源ラベルを説明する図である。 Here, with reference to FIGS. 3 and 4, the difference between the sound source label set by the supervised loss function calculation unit 104 and the sound source label in the related technology as in Non-Patent Document 1 will be described. FIG. 3 is a diagram illustrating a sound source label in a related technique. FIG. 4 is a diagram illustrating a sound source label according to the first embodiment.

非特許文献１のような関連技術においては、全ての時間－周波数ビンｙ_ｔ，ｆは既知であるとして仮定される。上述したように、非特許文献１においては、複数の音源信号を重畳して人工的に混合信号を生成しているので、損失関数は上記式（１２）のみで定義される。非特許文献１のような関連技術においては、個々の音源信号の時間－周波数ビン（ｔ，ｆ）の振幅は既知であるので、振幅が最大となる音源信号を求めることにより、全ての時間－周波数ビン（ｔ，ｆ）に対して音源ラベルを設定することが出来る。In related arts such as Non-Patent Document 1, it is assumed that all time-frequency bins yt _{, f} are known. As described above, in Non-Patent Document 1, since a plurality of sound source signals are superimposed to artificially generate a mixed signal, the loss function is defined only by the above equation (12). In a related technique such as Non-Patent Document 1, the time-frequency bin (t, f) amplitude of each sound source signal is known. Therefore, by obtaining the sound source signal having the maximum amplitude, all the time- A sound source label can be set for the frequency bin (t, f).

図３は、非特許文献１において、時間－周波数ビンに設定された音源ラベルの一例を示している。図３は、２人の話者を音源とした音源信号が混合された混合信号をスペクトログラムに変換し、各時間－周波数ビンに対して音源ラベルが設定されていることを示している。図３に示すように、非特許文献１においては、複数の音源信号を重畳して人工的に混合信号を生成しているので、各時間－周波数ビン（ｔ，ｆ）に対して、話者Ａまたは話者Ｂの音源ラベルが設定されている。 FIG. 3 shows an example of a sound source label set in the time-frequency bin in Non-Patent Document 1. FIG. 3 shows that a mixed signal obtained by mixing sound source signals with two speakers as sound sources is converted into a spectrogram, and a sound source label is set for each time-frequency bin. As shown in FIG. 3, in Non-Patent Document 1, since a plurality of sound source signals are superimposed to artificially generate a mixed signal, the speaker is used for each time-frequency bin (t, f). The sound source label of A or speaker B is set.

一方、本実施の形態においては、全ての時間－周波数ビンのうち、一部の時間－周波数ビンの音源ラベルｙ_ｔ，ｆは未知であると仮定した上で、所定条件を満たす時間－周波数ビンに対して音源ラベルが設定される。本実施の形態では、上述した関連技術とは異なり、実際に観測された混合信号を変換したスペクトログラムに対して音声ラベルが付与される。図４は、図３と同様に２人の話者を音源とした混合信号に対して、音源ラベルを設定した一例である。図４に示すように、混合信号のうち、話者Ａまたは話者Ｂの音源のみが含まれる時間区間に含まれる時間－周波数ビンに対して音源ラベルが設定される。換言すると、混合信号のうち、単一の音源のみが含まれる時間区間における時間－周波数ビンに対して音源ラベルが設定される。On the other hand, in the present embodiment, it is assumed that the sound source labels yt _{and f} of some of the time-frequency bins among all the time-frequency bins are unknown, and the time-frequency bins satisfying a predetermined condition are satisfied. The sound source label is set for. In the present embodiment, unlike the related art described above, an audio label is given to the spectrogram obtained by converting the actually observed mixed signal. FIG. 4 is an example in which a sound source label is set for a mixed signal using two speakers as sound sources as in FIG. As shown in FIG. 4, a sound source label is set for a time-frequency bin included in a time interval in which only the sound source of speaker A or speaker B is included in the mixed signal. In other words, the sound source label is set for the time-frequency bin in the time interval in which only a single sound source is included in the mixed signal.

なお、音源ラベルは、上述したように、学習用ラベルデータ記憶部１２に記憶されたラベルデータに基づいて、教師付き損失関数算出部１０４により設定される。例えば、混合信号に話者Ａおよび話者Ｂの音源が含まれているとすると、ラベルデータには、話者Ａの音源が含まれる時間区間の始端および終端が設定されている。同様に、話者Ｂの音源が含まれる時間区間の始端および終端が設定されている。教師付き損失関数算出部１０４は、ラベルデータを参照することにより、どの時間区間にどの話者の音源が含まれているかを判断することが出来るので、ラベルデータに基づいて、音源ラベルを設定することが出来る。 As described above, the sound source label is set by the supervised loss function calculation unit 104 based on the label data stored in the learning label data storage unit 12. For example, assuming that the mixed signal includes the sound sources of the speaker A and the speaker B, the label data is set to the start and end of the time interval including the sound source of the speaker A. Similarly, the start and end of the time interval including the sound source of the speaker B are set. Since the supervised loss function calculation unit 104 can determine which speaker's sound source is included in which time interval by referring to the label data, the sound source label is set based on the label data. Can be done.

図４は、時間領域において、前から８番目までの時間－周波数ビンに対しては話者Ａの音源ラベルが設定されていることを示している。同様に、時間領域において、前から１１番目～１６番目の時間－周波数ビンに対しては話者Ｂの音源ラベルが設定されていることを示している。一方、複数の音源が混在する時間－周波数ビンに対しては、複数の音源が含まれており、いずれの音源であるかが分からないため、音源ラベルが未知であるとして、音源ラベルを設定しない。図４に示すように、時間領域において、前から９番目および１０番目の時間－周波数ビンに対しては、話者Ａおよび話者Ｂの音源が混在していることから音源ラベルを未知として音源ラベルを設定しない。その理由は、実際に観測された混合信号に対して、各音源信号の時間区間の始終端は比較的容易に設定することができるのに対して、各音源信号の全ての時間－周波数ビンに対して音源ラベルを付与することは、ほぼ不可能であるからである。したがって、本実施の形態において、教師付き損失関数算出部１０４は、全ての時間－周波数ビンのうち、単一の音源のみが存在する時間区間に含まれる時間－周波数ビンに音源ラベルを設定する。また、教師付き損失関数算出部１０４は、複数の音源が混在する時間区間に含まれる時間－周波数ビンには音源ラベルを設定しない。 FIG. 4 shows that the sound source label of speaker A is set for the time-frequency bins from the front to the eighth in the time domain. Similarly, in the time domain, it is shown that the sound source label of the speaker B is set for the 11th to 16th time-frequency bins from the front. On the other hand, for a time-frequency bin in which multiple sound sources coexist, multiple sound sources are included and it is not known which sound source it is, so the sound source label is not set because the sound source label is unknown. .. As shown in FIG. 4, in the time domain, since the sound sources of speaker A and speaker B are mixed for the 9th and 10th time-frequency bins from the front, the sound source label is unknown and the sound source is sound source. Do not set the label. The reason is that for the actually observed mixed signal, the start and end of the time interval of each sound source signal can be set relatively easily, whereas for all time-frequency bins of each sound source signal. On the other hand, it is almost impossible to give a sound source label. Therefore, in the present embodiment, the supervised loss function calculation unit 104 sets the sound source label in the time-frequency bin included in the time interval in which only a single sound source exists among all the time-frequency bins. Further, the supervised loss function calculation unit 104 does not set a sound source label for the time-frequency bin included in the time interval in which a plurality of sound sources coexist.

なお、混合信号において振幅が十分小さい時間－周波数ビンについては、音源ラベルとは異なる特殊なラベルを示す「音源なし」を付与してもよい。この特殊なラベルは、簡単な信号処理によって自動的に付与することができる。なお、本開示においては、特殊なラベルは音源ラベルには含まれないこととする。 For the time-frequency bin whose amplitude is sufficiently small in the mixed signal, "no sound source" may be added to indicate a special label different from the sound source label. This special label can be given automatically by simple signal processing. In this disclosure, the special label is not included in the sound source label.

上述したように、本実施の形態においては、音源ラベルを設定することが出来ない時間－周波数ビンが含まれることとなる。そのため、音源ラベルが設定されていない時間－周波数ビンから抽出された特徴ベクトルに対する損失関数を定義する必要がある。 As described above, in the present embodiment, the time-frequency bin in which the sound source label cannot be set is included. Therefore, it is necessary to define a loss function for the feature vector extracted from the time-frequency bin where the sound source label is not set.

そこで、本実施の形態においては、教師なし損失関数算出部１０５を備え、教師なし損失関数を以下の式（１３）のように定義する。すなわち、本実施の形態では、図４における音源ラベルが設定されない音源ラベル未知の時間－周波数ビンに関する損失関数を定義する。以下に示す式（１３）を用いることにより、音源ラベルが設定されない音源ラベルが未知の時間－周波数ビンが、どの音源に含まれるかを決定する。

ここで、ｙ_ｔ，ｆ＝ＮＵＬＬは音源ラベルが設定されていない時間－周波数ビンを表し、γ_{ｔ，ｆ，ｉ}は音源クラスｉに対する特徴ベクトルｖ_ｔ，ｆの帰属率であり、ｃは音源クラス数である。また、μ_ｉは音源クラスｉに属する時間－周波数ビン（ｔ，ｆ）にわたる特徴ベクトルｖ_ｔ，ｆの平均である。音源クラスｉに対する特徴ベクトルｖ_ｔ，ｆの帰属率は、どの音源に帰属するかを示す指標値である。また、音源クラス数は、ラベルデータより決定することが出来る。Therefore, in the present embodiment, the unsupervised loss function calculation unit 105 is provided, and the unsupervised loss function is defined as the following equation (13). That is, in the present embodiment, a loss function for a time-frequency bin whose sound source label is unknown in FIG. 4 for which the sound source label is not set is defined. By using the following equation (13), it is determined which sound source includes the time-frequency bin whose sound source label is unknown and whose sound source label is not set.

Here, y _{t, f} = NULL represents a time-frequency bin in which the sound source label is not set, γ _{t, f, i} are the attribution ratios of the feature vectors vt _{, f} to the sound source class i, and c is the sound source. The number of classes. Further, μ _i is the average of the feature vectors v _{t, f} over the time-frequency bin (t, f) belonging to the sound source class i. The attribution rate of the feature vectors vt _{and f} with respect to the sound source class i is an index value indicating which sound source belongs to. Further, the number of sound source classes can be determined from the label data.

μ_ｉは、以下の式（１４）に従って計算される。

ここで、帰属率γ_{ｔ，ｆ，ｉ}は、例えば、ｉ＝ａｒｇｍｉｎ_ｊ｜ｖ_ｔ，ｆ―μ_ｊ｜が成り立つ場合、γ_{ｔ，ｆ，ｉ}＝１となり、そうではない場合、γ_{ｔ，ｆ，ｉ}＝０となるような、最近傍法に基づく離散的な帰属率を定義することができる。 _μi is calculated according to the following equation (14).

Here, the attribution rate γ _{t, f, i} is, for example, γ _{t, f, i} = 1 when i = argmin _j | v _{t, f} ―μ _j |, and γ t, when i = argmin j | v t, f ―μ j | _. It is possible to define a discrete attribution rate based on the nearest neighbor method such that _{f, i} = 0.

式（１３）および式（１４）、および離散的な帰属率による教師なし損失関数は、上述した内容から明らかなように、ユークリッド距離に基づき特徴ベクトルを一意なクラスタに分類するハードクラスタリングの一形態である。教師なし損失関数は、特に、音源クラスが既知および未知の特徴ベクトルを含んだ準教師付きハードクラスタリングである。換言すると、教師なし損失関数算出部１０５は、ハードクラスタリングに基づく損失関数を計算すると言える。 Equations (13) and (14), and the unsupervised loss function with discrete attribution, are a form of hard clustering that classifies feature vectors into unique clusters based on Euclidean distance, as is clear from the above. Is. The unsupervised loss function is, in particular, quasi-supervised hard clustering containing feature vectors of known and unknown sound source classes. In other words, it can be said that the unsupervised loss function calculation unit 105 calculates the loss function based on hard clustering.

なお、上述した教師なし損失関数は、一例であるので、これには限定されない。例えば、特徴ベクトルの近さをユークリッド距離（Ｌ２ノルム）で測るのではなく、マンハッタン距離（Ｌ１ノルム）、またはＬｐノルムや余弦類似度などの類似度尺度で測ることも可能である。特に、余弦類似度は、式（１２）の教師付き損失関数との整合性が高く好適である。 The unsupervised loss function described above is an example and is not limited to this. For example, instead of measuring the closeness of the feature vector by the Euclidean distance (L2 norm), it is also possible to measure by the Manhattan distance (L1 norm) or a similarity scale such as Lp norm or cosine similarity. In particular, the cosine similarity is suitable because it is highly consistent with the supervised loss function of Eq. (12).

また、帰属率γ_{ｔ，ｆ，ｉ}は連続的でもよく、例えば、ガウス混合分布を仮定したソフトクラスタリングに基づいてγ_{ｔ，ｆ，ｉ}およびμ_ｉを定義してもよい。一般にクラスタリングでは、あらゆる類似度尺度と損失関数とを定義できるので、本実施の形態の教師なし損失関数も同様に定義してもよい。さらに、音源ラベルが付与できない音源ラベルが未知の時間－周波数ビンに対して、音源ラベルが付与できる音源ラベルが既知である時間－周波数ビンに比べて十分に少ない場合、上記式（１４）の右辺の分子および分母の第２項は無視することが出来る。すなわち、上記式（１４）において、音源ラベルが付与されていない特徴ベクトルに関する項は無視することが可能である。Further, the attribution rates γ _{t, f, i} may be continuous, and for example, γ _{t, f, i} and μ _i may be defined based on soft clustering assuming a Gaussian mixture distribution. In general, in clustering, any similarity measure and loss function can be defined, so the unsupervised loss function of the present embodiment may be defined as well. Further, when the sound source label to which the sound source label cannot be attached is sufficiently smaller than the time-frequency bin in which the sound source label to which the sound source label can be attached is known for the unknown time-frequency bin, the right side of the above equation (14). The second term of the molecule and denominator of is negligible. That is, in the above equation (14), the term relating to the feature vector to which the sound source label is not attached can be ignored.

図２に戻り、音源分離部１４について説明する。音源分離部１４は、特徴抽出器であるニューラルネットワークを用いて混合信号を個々の音源信号に分離する。音源分離部１４は、特徴抽出部１０１と、特徴抽出パラメタ記憶部１０２と、クラスタリング部１０６と、分離部１０７と、を備える。クラスタリング部１０６および分離部１０７は、それぞれクラスタリング手段および分離手段として機能する。また、特徴抽出部１０１および特徴抽出パラメタ記憶部１０２は、特徴抽出器学習部１３と共有する機能部である。 Returning to FIG. 2, the sound source separation unit 14 will be described. The sound source separation unit 14 separates the mixed signal into individual sound source signals using a neural network that is a feature extractor. The sound source separation unit 14 includes a feature extraction unit 101, a feature extraction parameter storage unit 102, a clustering unit 106, and a separation unit 107. The clustering unit 106 and the separation unit 107 function as a clustering means and a separation unit, respectively. Further, the feature extraction unit 101 and the feature extraction parameter storage unit 102 are functional units shared with the feature extractor learning unit 13.

特徴抽出部１０１は、特徴抽出器学習部１３における構成と同様に、混合信号を取得して、混合信号をスペクトログラムＸに変換し、部分スペクトログラムｘ_ｔ，ｆから特徴ベクトルｖ_ｔ，ｆを生成する。The feature extraction unit 101 acquires the mixed signal, converts the mixed signal into the spectrogram X, and generates the feature vectors vt _{, f from the partial spectrograms x t, f} _, as in the configuration in the feature extractor learning unit 13. ..

クラスタリング部１０６は、実施の形態の概要におけるクラスタリング部３に対応する。クラスタリング部１０６は、例えば、Ｋ平均法(K-means)、平均シフト法(Mean-shift)、最短／最長距離法、ウォード法等のうち、いずれかのアルゴリズムを適用して、特徴ベクトルｖ_ｔ，ｆを複数のクラスタに分類する。The clustering unit 106 corresponds to the clustering unit 3 in the outline of the embodiment. The clustering unit 106 applies any algorithm of, for example, K-means, mean-shift, shortest / longest distance method, Ward's method, etc., to the feature vector _vt . _{, F} are classified into multiple clusters.

分離部１０７は、実施の形態の概要における分離部４に対応する。分離部１０７は、クラスタリング部１０６により分類された複数のクラスタの各々に含まれる時間周波数ビンを用いて、分類されたクラスタ毎に音源信号を生成する。具体的には、分離部１０７は、クラスタリング部１０６が分類したクラスタ毎に、各クラスタに含まれる時間－周波数ビン（ｔ，ｆ）のみから再構成されたスペクトログラムに逆フーリエ変換を実施し、個々の音源信号を生成する。 The separation unit 107 corresponds to the separation unit 4 in the outline of the embodiment. The separation unit 107 generates a sound source signal for each of the classified clusters by using the time frequency bin included in each of the plurality of clusters classified by the clustering unit 106. Specifically, the separation unit 107 performs an inverse Fourier transform on the spectrogram reconstructed only from the time-frequency bins (t, f) included in each cluster for each cluster classified by the clustering unit 106, and individually. Generates the sound source signal of.

＜音源分離装置の動作例＞
続いて、図５～図７を用いて、音源分離装置１０の動作例を説明する。図５～図７は、実施の形態１にかかる音源分離装置の動作例を示すフローチャートである。<Operation example of sound source separator>
Subsequently, an operation example of the sound source separation device 10 will be described with reference to FIGS. 5 to 7. 5 to 7 are flowcharts showing an operation example of the sound source separation device according to the first embodiment.

まず、図５を用いて、音源分離装置１０の全体動作について説明する。図５に示すように、音源分離装置１０は、特徴抽出器学習処理（ステップＡ１）および音源分離処理（ステップＡ２）を実行する。 First, the overall operation of the sound source separation device 10 will be described with reference to FIG. As shown in FIG. 5, the sound source separation device 10 executes the feature extractor learning process (step A1) and the sound source separation process (step A2).

具体的には、音源分離装置１０は、特徴抽出器学習処理において、実際に観測された混合信号を用いて、特徴抽出器であるニューラルネットワークの特徴抽出パラメタを学習する（ステップＡ１）。 Specifically, the sound source separation device 10 learns the feature extraction parameters of the neural network which is the feature extractor by using the actually observed mixed signal in the feature extractor learning process (step A1).

次に、音源分離装置１０は、音源分離処理において、ステップＡ１において決定された特徴抽出パラメタが適用された特徴抽出器を用いて、混合信号を個々の音源信号に分離する（ステップＡ２）。 Next, in the sound source separation process, the sound source separation device 10 separates the mixed signal into individual sound source signals by using the feature extractor to which the feature extraction parameters determined in step A1 are applied (step A2).

続いて、図６を用いて、特徴抽出器学習処理について説明する。図６に示すフローチャートは、図５のステップＡ１において実行されるフローチャートであり、特徴抽出器学習部１３が実行する。なお、以下に示す動作は、非特許文献１に開示された動作と明確に異なる。 Subsequently, the feature extractor learning process will be described with reference to FIG. The flowchart shown in FIG. 6 is a flowchart executed in step A1 of FIG. 5, and is executed by the feature extractor learning unit 13. The operation shown below is clearly different from the operation disclosed in Non-Patent Document 1.

まず、特徴抽出部１０１は、学習用混合信号記憶部１１に記憶された混合信号を順次取得して、短時間フーリエ変換を実行し、スペクトログラムに変換する(ステップＢ１)。 First, the feature extraction unit 101 sequentially acquires the mixed signals stored in the learning mixed signal storage unit 11, performs a short-time Fourier transform, and converts them into a spectrogram (step B1).

次に、特徴抽出部１０１は、特徴抽出パラメタ記憶部１０２に記憶された特徴抽出パラメタを取得する。特徴抽出部１０１は、取得したパラメタが適用された特徴抽出器であるニューラルネットワークを用いて、変換されたスペクトログラムにおける各時間－周波数ビン（ｔ，ｆ）から特徴ベクトルｖ_ｔ，ｆを抽出する(ステップＢ２)。Next, the feature extraction unit 101 acquires the feature extraction parameters stored in the feature extraction parameter storage unit 102. The feature extraction unit 101 extracts feature vectors vt _, f from each time-frequency bin (t, f) in the converted spectrogram by using a neural network which is a feature extractor to which the acquired parameters are applied (. Step B2).

なお、特徴抽出パラメタが未定の初期の段階では、図示しない初期化ステップにおいて、パラメタ更新部１０３が乱数を発生させる等の動作を行い、特徴抽出パラメタを初期化して、予め特徴抽出パラメタ記憶部１０２に出力しておく。 In the initial stage where the feature extraction parameters are undecided, the parameter update unit 103 performs an operation such as generating a random number in an initialization step (not shown) to initialize the feature extraction parameters, and the feature extraction parameter storage unit 102 in advance. Output to.

次に、パラメタ更新部１０３は、特徴抽出部１０１が抽出した特徴ベクトルを特徴抽出部１０１から取得し、特徴ベクトルの良し悪しを測る尺度である損失関数を式（１１）に基づいて計算する。具体的には、パラメタ更新部１０３は、損失関数を後述するステップＢ３およびステップＢ４において計算された算出結果を用いて、式（１１）に示した損失関数を計算する。 Next, the parameter update unit 103 acquires the feature vector extracted by the feature extraction unit 101 from the feature extraction unit 101, and calculates a loss function, which is a scale for measuring the quality of the feature vector, based on the equation (11). Specifically, the parameter update unit 103 calculates the loss function shown in the equation (11) by using the calculation results calculated in steps B3 and B4, which will be described later, for the loss function.

ステップＢ３において、教師付き損失関数算出部１０４は、式（１２）に示した教師付き損失関数を計算する（ステップＢ３）。具体的には、教師付き損失関数算出部１０４は、特徴抽出部１０１が抽出した特徴ベクトルを、パラメタ更新部１０３を介して取得する。また、教師付き損失関数算出部１０４は、学習用ラベルデータ記憶部１２に記憶された、各音源の時間区間を表すラベルデータを取得する。教師付き損失関数算出部１０４は、取得したラベルデータに基づいて、各音源の時間－周波数ビンのうち、単一の音源のみが存在する時間区間における時間－周波数ビンに音源ラベルを設定する。そして、教師付き損失関数算出部１０４は、音源ラベルが設定された時間－周波数ビンに関して、式（１２）に基づいて教師付き損失関数を計算する。 In step B3, the supervised loss function calculation unit 104 calculates the supervised loss function shown in the equation (12) (step B3). Specifically, the supervised loss function calculation unit 104 acquires the feature vector extracted by the feature extraction unit 101 via the parameter update unit 103. Further, the supervised loss function calculation unit 104 acquires label data representing a time interval of each sound source stored in the learning label data storage unit 12. The supervised loss function calculation unit 104 sets a sound source label in the time-frequency bin in the time interval in which only a single sound source exists among the time-frequency bins of each sound source based on the acquired label data. Then, the supervised loss function calculation unit 104 calculates the supervised loss function based on the equation (12) with respect to the time-frequency bin in which the sound source label is set.

ステップＢ４において、教師なし損失関数算出部１０５は、式（１３）に示した教師なし損失関数を計算する（ステップＢ４）。具体的には、教師なし損失関数算出部１０５は、特徴抽出部１０１が抽出した特徴ベクトルを、パラメタ更新部１０３を介して取得する。また、教師なし損失関数算出部１０５は、教師付き損失関数算出部１０４が設定した音源ラベルを取得する。教師なし損失関数算出部１０５は、音源ラベルが設定されていない時間－周波数ビンに関して、式（１３）および式（１４）に基づいて教師なし損失関数を計算する。 In step B4, the unsupervised loss function calculation unit 105 calculates the unsupervised loss function shown in the equation (13) (step B4). Specifically, the unsupervised loss function calculation unit 105 acquires the feature vector extracted by the feature extraction unit 101 via the parameter update unit 103. Further, the unsupervised loss function calculation unit 105 acquires the sound source label set by the supervised loss function calculation unit 104. The unsupervised loss function calculation unit 105 calculates the unsupervised loss function based on the equations (13) and (14) for the time-frequency bin in which the sound source label is not set.

パラメタ更新部１０３は、式（１１）に示した損失関数の算出結果に基づいて、特徴抽出パラメタを更新する（ステップＢ５）。具体的には、パラメタ更新部１０３は、ステップＢ３において算出された教師付き損失関数の算出結果、およびステップＢ４において算出された教師なし損失関数の算出結果を用いて、式（１１）で示す損失関数を計算する。パラメタ更新部１０３は、式（１１）で示した損失関数の算出結果が減少するように、特徴抽出パラメタを決定する。そして、パラメタ更新部１０３は、決定した特徴抽出パラメタを特徴抽出パラメタ記憶部１０２に記憶し、特徴抽出パラメタを更新する。 The parameter update unit 103 updates the feature extraction parameter based on the calculation result of the loss function shown in the equation (11) (step B5). Specifically, the parameter update unit 103 uses the calculation result of the supervised loss function calculated in step B3 and the calculation result of the unsupervised loss function calculated in step B4, and the loss represented by the equation (11). Compute the function. The parameter update unit 103 determines the feature extraction parameter so that the calculation result of the loss function shown in the equation (11) is reduced. Then, the parameter update unit 103 stores the determined feature extraction parameter in the feature extraction parameter storage unit 102, and updates the feature extraction parameter.

次に、パラメタ更新部１０３は、例えば、式（１１）で示した損失関数の算出結果の減少傾向がなくなるなど、予め定められた収束条件を満たしているかを判定する（ステップＢ６）。なお、パラメタ更新部１０３は、ステップＢ６において、ステップＢ２からステップＢ５の処理が、所定回数分実施されたかを判定してもよい。 Next, the parameter update unit 103 determines whether or not a predetermined convergence condition is satisfied, for example, the decreasing tendency of the calculation result of the loss function shown in the equation (11) disappears (step B6). The parameter update unit 103 may determine in step B6 whether the processes of steps B2 to B5 have been performed a predetermined number of times.

ステップＢ６において、パラメタ更新部１０３は、予め定められた収束条件を満たしていると判定すると（ステップＢ６のＹＥＳ）、処理を終了する。
一方、パラメタ更新部１０３は、予め定められた収束条件を満たしていないと判定すると（ステップＢ６のＮＯ）、ステップＢ２に戻り、ステップＢ２以降の処理を再度行う。In step B6, when the parameter update unit 103 determines that the predetermined convergence condition is satisfied (YES in step B6), the process ends.
On the other hand, when the parameter update unit 103 determines that the predetermined convergence condition is not satisfied (NO in step B6), the parameter update unit 103 returns to step B2 and performs the processing after step B2 again.

続いて、図７を用いて、音源分離処理について説明する。図７に示すフローチャートは、図５のステップＡ２において実行されるフローチャートであり、音源分離部１４が実行する。 Subsequently, the sound source separation process will be described with reference to FIG. 7. The flowchart shown in FIG. 7 is a flowchart executed in step A2 of FIG. 5, and is executed by the sound source separation unit 14.

まず、特徴抽出部１０１は、個々の音源信号に分離する判定対象の混合信号に、短時間フーリエ変換を実施してスペクトログラムに変換する（ステップＣ１）。判定対象の混合信号は、音源分離装置１０が図示しないマイクにより観測した混合信号であってもよいし、予め録音等され、記憶された混合信号であってもよい。 First, the feature extraction unit 101 performs a short-time Fourier transform on the mixed signal to be determined to be separated into individual sound source signals, and converts it into a spectrogram (step C1). The mixed signal to be determined may be a mixed signal observed by the sound source separation device 10 with a microphone (not shown), or may be a mixed signal pre-recorded and stored.

次に、特徴抽出部１０１は、特徴抽出パラメタ記憶部１０２に記憶された特徴抽出パラメタを取得する。特徴抽出部１０１は、取得した特徴抽出パラメタが適用された特徴抽出器であるニューラルネットワークを用いて、変換されたスペクトログラムにおける各時間－周波数ビン（ｔ，ｆ）から特徴ベクトルｖ_ｔ，ｆを抽出する(ステップＣ２)。Next, the feature extraction unit 101 acquires the feature extraction parameters stored in the feature extraction parameter storage unit 102. The feature extraction unit 101 extracts feature vectors vt _, f from each time-frequency bin (t, f) in the converted spectrogram by using a neural network which is a feature extractor to which the acquired feature extraction parameters are applied. (Step C2).

次に、クラスタリング部１０６は、特徴抽出部１０１が抽出した特徴ベクトルｖ_ｔ，ｆをクラスタリングする（ステップＣ３）。具体的には、クラスタリング部１０６は、特徴抽出部１０１が抽出した特徴ベクトルｖ_ｔ，ｆをクラスタリングすることにより、時間－周波数ビンを、混合信号に含まれると想定される音源数と同数のクラスタに分類する。Next, the clustering unit 106 clusters the feature vectors _{dt and f} extracted by the feature extraction unit 101 (step C3). Specifically, the clustering unit 106 clusters the feature vectors _{dt and f} extracted by the feature extraction unit 101 to cluster the time-frequency bins in the same number as the number of sound sources assumed to be included in the mixed signal. Classify into.

なお、クラスタリング部１０６は、例えば、Ｋ平均法(K-means)、平均シフト法(Mean-shift)、最短／最長距離法、ウォード法等のうち、いずれかのアルゴリズムを適用してクラスタリングを行ってもよい。また、クラスタリング部１０６は、特徴ベクトルｖ_ｔ，ｆを分類するクラスタ数を、例えば、「２人の話者の会話である」などの事前情報がある場合、当該事前情報に従って定めてもよい。もしくは、クラスタリング部１０６は、上記事前情報がない場合、上記のうちのいずれかのアルゴリズムが提供するクラスタ数の決定法を利用してもよい。The clustering unit 106 performs clustering by applying any of the algorithms such as the K-means method, the mean shift method, the shortest / longest distance method, and the Ward method. You may. Further, the clustering unit 106 may determine the number of clusters for classifying the feature vectors vt _{and f} according to the prior information such as "a conversation between two speakers". Alternatively, the clustering unit 106 may use the method for determining the number of clusters provided by any of the above algorithms in the absence of the above prior information.

次に、分離部１０７は、分類された複数のクラスタの各々に含まれる時間－周波数ビンから再構成されたスペクトログラムに逆フーリエ変換を実施し、分類されたクラスタ毎に単一の音源に分離された音源信号を生成し出力する(ステップＣ４)。 Next, the separator 107 performs an inverse Fourier transform on the spectrogram reconstructed from the time-frequency bin contained in each of the plurality of classified clusters, and is separated into a single sound source for each classified cluster. The sound source signal is generated and output (step C4).

以上説明したように、本実施の形態にかかる音源分離装置１０は、実際に観測された混合信号と、それに付与された各音源の時間区間のラベルデータと、を用いて、特徴抽出器の特徴抽出パラメタを決定する。また、本実施の形態にかかる音源分離装置１０は、特徴抽出パラメタを決定する際、教師付き損失関数および教師なし損失関数の２つの損失関数を含む損失関数を用いて、各損失関数の算出結果の和が最小化される特徴抽出パラメタに更新する。したがって、本実施の形態にかかる音源分離装置１０を用いることにより、人工的に作られた混合信号ではなく、実際に観測される混合信号に対して最適な特徴抽出器を獲得して、混合信号を正確に個々の音源信号に分離できる。すなわち、本実施の形態にかかる音源分離装置１０を用いることにより、混合信号から個々の音源信号を精度良く分離することが可能となる。 As described above, the sound source separation device 10 according to the present embodiment uses the actually observed mixed signal and the label data of the time interval of each sound source assigned to the mixed signal, and features the feature extractor. Determine the extraction parameters. Further, the sound source separation device 10 according to the present embodiment uses a loss function including two loss functions, a supervised loss function and an unsupervised loss function, when determining the feature extraction parameter, and the calculation result of each loss function. Update to the feature extraction parameter that minimizes the sum of. Therefore, by using the sound source separation device 10 according to the present embodiment, the optimum feature extractor for the actually observed mixed signal is obtained instead of the artificially created mixed signal, and the mixed signal is obtained. Can be accurately separated into individual sound source signals. That is, by using the sound source separation device 10 according to the present embodiment, it is possible to accurately separate individual sound source signals from the mixed signal.

（実施の形態２）
続いて、実施の形態２について説明する。
＜音源分離装置の構成例＞
図８を用いて、実施の形態２にかかる音源分離装置８０について説明する。図８は、実施の形態２にかかる音源分離装置の構成例を示す構成図である。図８に示す様に、本実施の形態にかかる音源分離装置８０は、音源分離用プログラム８１と、データ処理装置８２と、記憶装置８３とを備える。また、記憶装置８３には、特徴抽出パラメタ記憶領域８３１と、学習用混合信号記憶領域８３２と、学習用ラベルデータ記憶領域８３３と、を備える。なお、本実施の形態は、実施の形態１における特徴抽出器学習部１３および音源分離部１４をプログラムにより動作されるコンピュータにより実現した場合の構成例である。(Embodiment 2)
Subsequently, the second embodiment will be described.
<Sound source separation device configuration example>
The sound source separation device 80 according to the second embodiment will be described with reference to FIG. FIG. 8 is a configuration diagram showing a configuration example of the sound source separation device according to the second embodiment. As shown in FIG. 8, the sound source separation device 80 according to the present embodiment includes a sound source separation program 81, a data processing device 82, and a storage device 83. Further, the storage device 83 includes a feature extraction parameter storage area 831, a learning mixed signal storage area 832, and a learning label data storage area 833. It should be noted that this embodiment is a configuration example in which the feature extractor learning unit 13 and the sound source separation unit 14 in the first embodiment are realized by a computer operated by a program.

音源分離用プログラム８１は、データ処理装置８２に読み込まれ、データ処理装置８２の動作を制御する。なお、音源分離用プログラム８１には、実施の形態１における特徴抽出器学習部１３および音源分離部１４の動作がプログラム言語を用いて記述されている。 The sound source separation program 81 is read into the data processing device 82 and controls the operation of the data processing device 82. The sound source separation program 81 describes the operations of the feature extractor learning unit 13 and the sound source separation unit 14 in the first embodiment using a programming language.

具体的には、データ処理装置８２は、音源分離用プログラム８１の制御により、実施の形態１における特徴抽出器学習部１３および音源分離部１４の処理と同一の処理を実行する。すなわち、データ処理装置８２は、記憶装置８３内の特徴抽出パラメタ記憶領域８３１、学習用混合信号記憶領域８３２および学習用ラベルデータ記憶領域８３３にそれぞれ記憶された特徴抽出パラメタ、学習用混合信号、学習用ラベルデータを取得する。そして、データ処理装置８２は、実施の形態１における実施の形態１における特徴抽出器学習部１３および音源分離部１４の処理を行う。 Specifically, the data processing device 82 executes the same processing as the processing of the feature extractor learning unit 13 and the sound source separation unit 14 in the first embodiment under the control of the sound source separation program 81. That is, the data processing device 82 has the feature extraction parameter stored in the feature extraction parameter storage area 831, the learning mixed signal storage area 832, and the learning label data storage area 833 in the storage device 83, respectively, the feature extraction parameter, the learning mixed signal, and the learning. Get the label data for. Then, the data processing device 82 processes the feature extractor learning unit 13 and the sound source separation unit 14 in the first embodiment of the first embodiment.

より具体的には、データ処理装置８２は、実施の形態１における、特徴抽出部１０１、パラメタ更新部１０３、教師付き損失関数算出部１０４、教師なし損失関数算出部１０５、クラスタリング部１０６および分離部１０７が実施する各処理を行う。 More specifically, in the first embodiment, the data processing device 82 includes a feature extraction unit 101, a parameter update unit 103, a supervised loss function calculation unit 104, an unsupervised loss function calculation unit 105, a clustering unit 106, and a separation unit. Each process carried out by 107 is performed.

以上説明したように、実施の形態２にかかる音源分離装置８０についても、実施の形態１における各機能部が実行する各処理を行うので、実施の形態１と同様の効果を得ることが可能となる。すなわち、本実施の形態にかかる音源分離装置８０を用いることにより、人工的に作られた混合信号ではなく、実際に観測される混合信号に対して最適な特徴抽出器を獲得して、混合信号を正確に個々の音源信号に分離できる。したがって、本実施の形態にかかる音源分離装置８０を用いることにより、混合信号から個々の音源信号を精度良く分離することが可能となる。 As described above, the sound source separation device 80 according to the second embodiment also performs each process executed by each functional unit in the first embodiment, so that the same effect as that of the first embodiment can be obtained. Become. That is, by using the sound source separation device 80 according to the present embodiment, the optimum feature extractor for the actually observed mixed signal, not the artificially created mixed signal, is obtained, and the mixed signal is obtained. Can be accurately separated into individual sound source signals. Therefore, by using the sound source separation device 80 according to the present embodiment, it is possible to accurately separate individual sound source signals from the mixed signal.

また、実施の形態２にかかる音源分離用プログラム８１を用いることにより、実施の形態１と同様の効果を得ることが可能となる。すなわち、本実施の形態にかかる音源分離用プログラム８１によれば、混合信号から個々の音源信号を精度良く分離することが可能となる。 Further, by using the sound source separation program 81 according to the second embodiment, it is possible to obtain the same effect as that of the first embodiment. That is, according to the sound source separation program 81 according to the present embodiment, it is possible to accurately separate individual sound source signals from the mixed signal.

（その他の実施の形態）
上述した実施の形態にかかる音源分離装置は、次のようなハードウェア構成を有していてもよい。図９は、上述した実施の形態において説明した音源分離装置１、１０および８０（以下、音源分離装置１等と称する）の構成例を示すブロック図である。図９を参照すると、音源分離装置１等は、プロセッサ１２０１およびメモリ１２０２を含む。(Other embodiments)
The sound source separation device according to the above-described embodiment may have the following hardware configuration. FIG. 9 is a block diagram showing a configuration example of the sound source separation devices 1, 10 and 80 (hereinafter, referred to as sound source separation device 1 and the like) described in the above-described embodiment. Referring to FIG. 9, the sound source separation device 1 and the like include the processor 1201 and the memory 1202.

プロセッサ１２０１は、メモリ１２０２からソフトウェア（コンピュータプログラム）を読み出して実行することで、上述の実施形態においてフローチャートを用いて説明された音源分離装置１等の処理を行う。プロセッサ１２０１は、例えば、マイクロプロセッサ、MPU（Micro Processing Unit）またはCPU（Central Processing Unit）であってもよい。プロセッサ１２０１は、複数のプロセッサを含んでもよい。 The processor 1201 reads software (computer program) from the memory 1202 and executes it to perform processing of the sound source separation device 1 and the like described by using the flowchart in the above-described embodiment. The processor 1201 may be, for example, a microprocessor, an MPU (Micro Processing Unit) or a CPU (Central Processing Unit). Processor 1201 may include a plurality of processors.

メモリ１２０２は、揮発性メモリ及び不揮発性メモリの組み合わせによって構成される。メモリ１２０２は、プロセッサ１２０１から離れて配置されたストレージを含んでもよい。この場合、プロセッサ１２０１は、図示されていないI/Oインタフェースを介してメモリ１２０２にアクセスしてもよい。 The memory 1202 is composed of a combination of a volatile memory and a non-volatile memory. Memory 1202 may include storage located away from processor 1201. In this case, processor 1201 may access memory 1202 via an I / O interface (not shown).

図９の例では、メモリ１２０２は、ソフトウェアモジュール群を格納するために使用される。プロセッサ１２０１は、これらのソフトウェアモジュール群をメモリ１２０２から読み出して実行することで、上述の実施形態において説明された音源分離装置１等の処理を行うことができる。 In the example of FIG. 9, memory 1202 is used to store software modules. By reading these software modules from the memory 1202 and executing the processor 1201, the processor 1201 can perform the processing of the sound source separation device 1 and the like described in the above-described embodiment.

図９を用いて説明したように、音源分離装置１等が有するプロセッサの各々は、図面を用いて説明されたアルゴリズムをコンピュータに行わせるための命令群を含む１または複数のプログラムを実行する。 As described with reference to FIG. 9, each of the processors included in the sound source separator 1 and the like executes one or more programs including a set of instructions for causing the computer to perform the algorithm described with reference to the drawings.

上述の例において、プログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）を含む。さらに、非一時的なコンピュータ可読媒体の例は、ＣＤ－ＲＯＭ（Read Only Memory）、ＣＤ－Ｒ、ＣＤ－Ｒ／Ｗを含む。さらに、非一時的なコンピュータ可読媒体の例は、半導体メモリを含む。半導体メモリは、例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（Random Access Memory）を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 In the above example, the program can be stored and supplied to the computer using various types of non-transitory computer readable medium. Non-temporary computer-readable media include various types of tangible storage media. Examples of non-temporary computer-readable media include magnetic recording media (eg, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (eg, magneto-optical disks). Further, examples of non-temporary computer-readable media include CD-ROM (Read Only Memory), CD-R, and CD-R / W. Further, examples of non-temporary computer readable media include semiconductor memory. The semiconductor memory includes, for example, a mask ROM, a PROM (Programmable ROM), an EPROM (Erasable PROM), a flash ROM, and a RAM (Random Access Memory). The program may also be supplied to the computer by various types of transient computer readable medium. Examples of temporary computer readable media include electrical, optical, and electromagnetic waves. The temporary computer-readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

なお、本開示は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。また、本開示は、それぞれの実施の形態を適宜組み合わせて実施されてもよい。 The present disclosure is not limited to the above embodiment, and can be appropriately modified without departing from the spirit. Further, the present disclosure may be carried out by appropriately combining the respective embodiments.

１、１０、８０音源分離装置
２、１０１特徴抽出部
３、１０６クラスタリング部
４、１０７分離部
５、１０３パラメタ更新部
１１学習用混合信号記憶部
１２学習用ラベルデータ記憶部
１３特徴抽出器学習部
１４音源分離部
１０２特徴抽出パラメタ記憶部
１０４教師付き損失関数算出部
１０５教師なし損失関数算出部
８１音源分離用プログラム
８２データ処理装置
８３記憶装置1, 10, 80 Sound source separator 2, 101 Feature extraction unit 3, 106 Clustering unit 4, 107 Separation unit 5, 103 Parameter update unit 11 Learning mixed signal storage unit 12 Learning label data storage unit 13 Feature extractor learning unit 14 Sound source separation unit 102 Feature extraction parameter storage unit 104 Supervised loss function calculation unit 105 Unsupervised loss function calculation unit 81 Sound source separation program 82 Data processing device 83 Storage device

Claims

A feature extraction means for extracting a feature vector using a feature extractor to which the parameters used for feature extraction are applied for each time-frequency bin in a spectrogram obtained by converting a mixed signal in which a plurality of sound source signals are mixed.
A clustering means for classifying the extracted feature vector into a plurality of clusters,
A separation means for generating a sound source signal for each classified cluster using a time frequency bin included in each of the plurality of classified clusters.
A parameter updating means for updating the parameter based on the learning mixed signal including the observed mixed signal is provided .
In the spectrogram obtained by converting the mixed signal for learning, the sound source label is set in the time frequency bin satisfying a predetermined condition, and the first evaluation value for the feature vector extracted from the time frequency bin in which the sound source label is set is set. The first calculation means calculated using the first evaluation function and
Further, a second calculation means for calculating a second evaluation value for a feature vector extracted from a time frequency bin in which the sound source label is not set by using a second evaluation function is provided.
The parameter updating means is a sound source separation device that updates the parameters based on the first evaluation value and the second evaluation value .

The sound source separation device according to claim 1 , wherein the parameter updating means updates the parameters so as to reduce the total value of the first evaluation value and the second evaluation value.

The first calculation means sets the sound source label in the time frequency bin in the time interval in which a single sound source exists, based on the label data indicating the time interval in which each sound source signal is included in the learning mixed signal. The sound source separation device according to claim 1 or 2 , wherein the sound source label is not set in the time frequency bin in the time interval in which a plurality of sound sources exist.

The sound source separation device according to any one of claims 1 to 3 , wherein the second evaluation function is a loss function based on at least one of hard clustering and soft clustering.

The sound source separation device according to any one of claims 1 to 4 , wherein the first evaluation function is a supervised loss function, and the second evaluation function is an unsupervised loss function.

The sound source separation device according to claim 5 , wherein the supervised loss function is the following equation (1), and the unsupervised loss function is the following equation (2).

Here, θ is the parameter, X is a set of all spectrograms obtained from the mixed signal for learning, and (t, f) and ( _t ', f') are time frequency bins, and vt. _{, F} is the feature vector of the time frequency bin (t, f), V is the set of all feature vectors v _{t, f} obtained from X, and Y = (y _{t, f} ) is the feature vector v _{t ,.} It is a sound source label of the time frequency bin (t, f) corresponding to _f .

Here, y _{t, f} = NUML is a time frequency bin in which the sound source label is not set, γ _{t, f, i} are the attribution ratios of the feature vectors v _{t, f} to the sound source class i, and c is the sound source class. It is a number, and μ _i is the average of the feature vectors vt, _f over the time frequency bins (t, f) belonging to the sound source class i, and is determined by the equation (3).

The sound source separation device according to any one of claims 1 to 6 , wherein the feature extractor is a neural network.

In a spectrogram obtained by converting a mixed signal in which a plurality of sound source signals are mixed, a feature vector is extracted for each time-frequency bin using a feature extractor to which the parameters used for feature extraction are applied.
By classifying the extracted feature vectors into multiple clusters,
Using the time frequency bin included in each of the plurality of classified clusters, the sound source signal is generated for each classified cluster, and
Updating the above parameters based on the learning mixed signal including the observed mixed signal
In the spectrogram obtained by converting the mixed signal for learning, the sound source label is set in the time frequency bin satisfying a predetermined condition, and the first evaluation value for the feature vector extracted from the time frequency bin in which the sound source label is set is set. To calculate using the first evaluation function and
Using the second evaluation function, the second evaluation value for the feature vector extracted from the time frequency bin in which the sound source label is not set is calculated.
A sound source separation method comprising updating the parameters based on the first evaluation value and the second evaluation value .

In a spectrogram obtained by converting a mixed signal in which a plurality of sound source signals are mixed, a feature vector is extracted for each time-frequency bin using a feature extractor to which the parameters used for feature extraction are applied.
By classifying the extracted feature vectors into multiple clusters,
Using the time frequency bin included in each of the plurality of classified clusters, the sound source signal is generated for each classified cluster, and
Updating the above parameters based on the learning mixed signal including the observed mixed signal
In the spectrogram obtained by converting the mixed signal for learning, the sound source label is set in the time frequency bin satisfying a predetermined condition, and the first evaluation value for the feature vector extracted from the time frequency bin in which the sound source label is set is set. To calculate using the first evaluation function and
Using the second evaluation function, the second evaluation value for the feature vector extracted from the time frequency bin in which the sound source label is not set is calculated.
A program that causes a computer to update the parameters based on the first evaluation value and the second evaluation value .