JP2014215544A

JP2014215544A - Sound processing device

Info

Publication number: JP2014215544A
Application number: JP2013094475A
Authority: JP
Inventors: 大地北村; Daichi Kitamura; 洋猿渡; Hiroshi Saruwatari; 祐高橋; Yu Takahashi
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-04-26
Filing date: 2013-04-26
Publication date: 2014-11-17

Abstract

PROBLEM TO BE SOLVED: To achieve a highly accurate sound source separation with a specified component of a sound signal as an arithmetic operation object of a non-negative value matrix factorization.SOLUTION: A matrix creating unit 34 generates a separation matrix Q in which an element q (m, n) corresponding to each time frequency component y (m, n) where a first sound source is dominant among sound signals SA is set to a numerical value 1, and each remaining element q (m, n) is set to 0. A matrix decomposition unit 36 calculates a coefficient matrix G corresponding to a base matrix F and a coefficient matrix U corresponding to a base matrix H of a second sound source and a coefficient matrix U corresponding to the base matrix H from an observation matrix Y having arranged each time frequency component y (m, n) of the sound signal SA by repeating an update operation of a non-negative value matrix factorization using the base matrix F of the first sound source as teacher information with each time frequency component y (m, n) corresponding to the element q (m, n) set to the numerical value 1 in the separation matrix Q as an arithmetic operation object. The matrix decomposition unit 36 executes the non-negative value matrix factorization under a constraint condition that each element of the coefficient matrix G is suppressed.

Description

本発明は、音響信号を音源毎に分離する技術に関する。 The present invention relates to a technique for separating an acoustic signal for each sound source.

相異なる音源が発生した複数の音響の混合音を音源毎の音響に分離する音源分離技術が従来から提案されている。例えば非特許文献１には、特定の音源の音響特性を表す基底行列を教師情報とする教師有非負値行列因子分解（ＳＮＭＦ：Semi-supervised Nonnegative Matrix Factorization）を利用した音源分離技術が開示されている。また、非特許文献２には、複数のチャネルの音響信号から特定される各音源の空間的な位置を利用した空間的音源分離を教師有非負値行列因子分解と併用することが開示されている。具体的には、音響信号の各時点における各周波数の成分（以下「時間周波数成分」という）を音源方向毎に分類（クラスタリング）し、目標方向に音源が位置する成分を音響信号から分離したうえで、非特許文献１と同様の非負値行列因子分解を実行する。 A sound source separation technique for separating a mixed sound of a plurality of sounds generated by different sound sources into sounds for each sound source has been proposed. For example, Non-Patent Document 1 discloses a sound source separation technique using supervised non-negative matrix factorization (SNMF) in which a base matrix representing acoustic characteristics of a specific sound source is used as teacher information. Yes. Non-Patent Document 2 discloses that spatial sound source separation using the spatial position of each sound source specified from acoustic signals of a plurality of channels is used in combination with supervised non-negative matrix factorization. . Specifically, each frequency component (hereinafter referred to as “temporal frequency component”) at each time point of the acoustic signal is classified (clustered) for each sound source direction, and the component in which the sound source is located in the target direction is separated from the acoustic signal. Thus, non-negative matrix factorization similar to Non-Patent Document 1 is executed.

K. Yagi, et. al., "Music Signal Separation by Orthogonality and Maximum-Distance Constrained Nonnegative Matrix Factorization with Target Signal Information", Proc. of Audio Engineering Society 45th International Conference Applications of Time-Frequency Processing in Audio (AES45th), March 2012K. Yagi, et. Al., "Music Signal Separation by Orthogonality and Maximum-Distance Constrained Nonnegative Matrix Factorization with Target Signal Information", Proc. Of Audio Engineering Society 45th International Conference Applications of Time-Frequency Processing in Audio (AES45th), March 2012 Y. Iwao, et. al., "Stereo Music Signal Separation Combining Directional Clustering and Nonnegative Matrix Factorization". Proc. the 12th IEEE international Symposium on Signal Processing and Information Technology (ISSPIT 2012), December 2012Y. Iwao, et. Al., "Stereo Music Signal Separation Combining Directional Clustering and Nonnegative Matrix Factorization". Proc. The 12th IEEE international Symposium on Signal Processing and Information Technology (ISSPIT 2012), December 2012

しかし、非特許文献２のように音響信号の各時間周波数成分を複数のクラスの何れかに択一的に分類する構成（ハードクラスタリング）では、空間的音源分離の実行後の音響信号について、時間-周波数領域内の多数の地点で時間周波数成分の欠落（強度ゼロ）が発生し、教師情報の基底行列が表す音響特性から乖離する結果、高精度な音源分離が困難であるという問題がある。なお、以上の説明では、音響信号のうち非負値行列因子分解の対象となる特定成分を空間的音源分離で抽出する場合を便宜的に例示したが、音響信号の特定成分を演算対象とする非負値行列因子分解では、特定成分の抽出方法の如何に拘わらず同様の問題が発生し得る。以上の事情を考慮して、本発明は、音響信号の特定成分を演算対象として非負値行列因子分解を実行する構成のもとで高精度な音源分離を実現することを目的とする。 However, in the configuration (hard clustering) in which each time-frequency component of the acoustic signal is selectively classified into any of a plurality of classes as in Non-Patent Document 2, the time is required for the acoustic signal after the spatial sound source separation is performed. -There is a problem in that high-precision sound source separation is difficult as a result of missing time frequency components (zero intensity) at many points in the frequency domain and deviating from the acoustic characteristics represented by the base matrix of the teacher information. In the above description, the case where a specific component to be subjected to non-negative matrix factorization is extracted by spatial sound source separation from the acoustic signal is illustrated for the sake of convenience. In the value matrix factorization, the same problem may occur regardless of the extraction method of the specific component. In view of the above circumstances, an object of the present invention is to realize high-accuracy sound source separation under a configuration in which non-negative matrix factorization is performed on a specific component of an acoustic signal as a calculation target.

以上の課題を解決するために、本発明の第１態様に係る音響処理装置は、複数の音源の音響の混合音を表す音響信号の各時間周波数成分に対応する要素を含み、複数の音源のうち第１音源の音響が優勢な各時間周波数成分に対応する要素が当該時間周波数成分を維持する維持値に設定されるとともに残余の各時間周波数成分に対応する要素が当該時間周波数成分を抑圧する抑圧値に設定された分離行列（例えば分離行列Ｑ）を生成する行列生成手段と、分離行列にて維持値に設定された要素に対応する各時間周波数成分を演算対象として、第１音源の音響の各成分のスペクトルを表す複数の基底ベクトルを含む第１基底行列（例えば基底行列Ｆ）を教師情報として利用した非負値行列因子分解の更新演算を反復することで、音響信号の各時間周波数成分を配列した観測行列から、第１基底行列の各基底ベクトルに対応する複数の係数ベクトルを含む第１係数行列（例えば係数行列Ｇ）と、第１音源とは相違する第２音源の音響の各成分のスペクトルを表す複数の基底ベクトルを含む第２基底行列（例えば基底行列Ｈ）と、第２基底行列の各基底ベクトルに対応する複数の係数ベクトルを含む第２係数行列（例えば係数行列Ｕ）とを算定する行列分解手段とを具備し、行列分解手段は、第１係数行列の各要素を抑圧することを拘束条件（例えば後述の第１拘束条件）とした非負値行列因子分解を実行する。分離行列にて維持値に設定された要素に対応する各時間周波数成分を演算対象とする非負値行列因子分解では、教師情報として利用される第１基底行列に対応した第１係数行列の各要素が過度に大きい数値に設定される可能性がある。本発明の第１態様に係る音響処理装置では、第１係数行列の各要素を抑制することを拘束条件とした非負値行列因子分解が実行されるから、第１係数行列の各要素が過度に大きい数値に設定される可能性が低減される。したがって、高精度な音源分離を実現することが可能である。 In order to solve the above problems, the acoustic processing device according to the first aspect of the present invention includes elements corresponding to each time frequency component of an acoustic signal representing a mixed sound of acoustics of a plurality of sound sources, Among them, the element corresponding to each time frequency component in which the sound of the first sound source is dominant is set to a maintenance value that maintains the time frequency component, and the element corresponding to each remaining time frequency component suppresses the time frequency component. The matrix generating means for generating the separation matrix set to the suppression value (for example, the separation matrix Q), and the sound of the first sound source with each time frequency component corresponding to the element set to the maintenance value in the separation matrix as the calculation target By repeating the update operation of the non-negative matrix factorization using a first basis matrix (for example, basis matrix F) including a plurality of basis vectors representing the spectrum of each component in the teacher information, each time of the acoustic signal From an observation matrix in which wave number components are arranged, a first coefficient matrix (for example, coefficient matrix G) including a plurality of coefficient vectors corresponding to each basis vector of the first basis matrix and the sound of a second sound source that is different from the first sound source. And a second coefficient matrix (for example, coefficient matrix) including a plurality of coefficient vectors corresponding to the respective basis vectors of the second basis matrix. U) and a matrix decomposition unit that calculates non-negative matrix factorization using a constraint condition (for example, a first constraint condition described later) to suppress each element of the first coefficient matrix. Run. In non-negative matrix factorization in which each time frequency component corresponding to an element set as a maintenance value in a separation matrix is an operation target, each element of a first coefficient matrix corresponding to a first basis matrix used as teacher information May be set to an excessively large number. In the sound processing device according to the first aspect of the present invention, since the non-negative matrix factorization is performed with the constraint that the elements of the first coefficient matrix are suppressed, each element of the first coefficient matrix is excessively large. The possibility of setting a large number is reduced. Therefore, highly accurate sound source separation can be realized.

本発明の好適な態様において、行列分解手段は、分離行列の各要素について維持値と抑圧値とを反転させた反転分離行列（例えば反転分離行列/Ｑ）の各要素を第１基底行列と第１係数行列との行列積の各要素に乗算した制約対象行列（例えば制約対象行列｛/Ｑ.×(ＦＧ)｝）のノルムを減少させることを拘束条件とした非負値行列因子分解を実行する。以上の態様では、反転分離行列の要素を第１基底行列と第１係数行列との行列積の各要素に乗算した制約対象行列のノルムを減少させることを拘束条件として非負値行列因子分解が実行されるから、音源分離の精度が向上するという前述の効果は格別に顕著である。 In a preferred aspect of the present invention, the matrix decomposing means converts each element of the inverted separation matrix (for example, the inverted separation matrix / Q) obtained by inverting the maintenance value and the suppression value for each element of the separation matrix to the first basis matrix and the first matrix. Execute non-negative matrix factorization with the constraint that the norm of the constraint target matrix (for example, the constraint target matrix {/Q.×(FG)}) multiplied by each element of the matrix product with one coefficient matrix is reduced. . In the above embodiment, non-negative matrix factorization is executed with the constraint that the norm of the constrained matrix obtained by multiplying each element of the matrix product of the first base matrix and the first coefficient matrix by the element of the inverted separation matrix is reduced. Therefore, the above-described effect of improving the accuracy of sound source separation is particularly remarkable.

分離行列にて維持値に設定された要素に対応する各時間周波数成分を演算対象とする非負値行列因子分解において第１係数行列の各要素が過度に大きい数値に設定され得るという前述の問題は、分離行列における維持値の個数が少ない（第１音源の音響が優勢な時間周波数成分が少ない）ほど顕在化する。以上の事情を考慮すると、分離行列における維持値の個数に応じて更新演算における拘束条件の影響の度合を可変に制御する構成や、分離行列における維持値の個数に応じた指標値が閾値を上回る場合に拘束条件のもとで非負値行列因子分解を実行し、指標値が閾値を下回る場合には拘束条件を解除する構成が好適である。 The above-described problem that each element of the first coefficient matrix can be set to an excessively large numerical value in the non-negative matrix factorization in which each time frequency component corresponding to the element set as the maintenance value in the separation matrix is an operation target. The smaller the number of maintenance values in the separation matrix (the smaller the time frequency component in which the sound of the first sound source is dominant), the more obvious it becomes. Considering the above circumstances, the configuration in which the degree of influence of the constraint condition in the update calculation is variably controlled according to the number of maintenance values in the separation matrix, and the index value according to the number of maintenance values in the separation matrix exceeds the threshold value In such a case, it is preferable that the non-negative matrix factorization is executed under the constraint condition, and the constraint condition is canceled when the index value falls below the threshold value.

本発明の第２態様に係る音響処理装置は、複数の音源の音響の混合音を表す音響信号の各時間周波数成分に対応する要素を含み、複数の音源のうち第１音源の音響が優勢な各時間周波数成分に対応する要素が当該時間周波数成分を維持する維持値に設定されるとともに残余の各時間周波数成分に対応する要素が当該時間周波数成分を抑圧する抑圧値に設定された分離行列（例えば分離行列Ｑ）を生成する行列生成手段と、分離行列にて維持値に設定された要素に対応する各時間周波数成分を演算対象として、第１音源の音響の各成分のスペクトルを表す複数の基底ベクトルを含む第１基底行列（例えば基底行列Ｆ）を教師情報として利用した非負値行列因子分解の更新演算を反復することで、音響信号の各時間周波数成分を配列した観測行列から、第１基底行列の各基底ベクトルに対応する複数の係数ベクトルを含む第１係数行列（例えば係数行列Ｇ）と、第１音源とは相違する第２音源の音響の各成分のスペクトルを表す複数の基底ベクトルを含む第２基底行列（例えば基底行列Ｈ）と、第２基底行列の各基底ベクトルに対応する複数の係数ベクトルを含む第２係数行列（例えば係数行列Ｕ）とを算定する行列分解手段とを具備し、行列分解手段は、第１基底行列と第１係数行列との行列積のうち分離行列にて抑圧値に設定された要素に対応する時間周波数成分（例えば時間周波数成分ｄ(m,n)）の時間的な変動を抑制することを拘束条件とした非負値行列因子分解を実行する。以上の構成では、第１基底行列と第１係数行列との行列積のうち分離行列にて抑圧値に設定された要素に対応する時間周波数成分の時間的な変動を抑制することを拘束条件とした非負値行列因子分解が実行されるから、第１係数行列の各要素が過度に大きい数値（時間軸上で不連続に変動する数値）に設定される可能性が低減される。したがって、第１態様の音響処理装置と同様に、高精度な音源分離を実現することが可能である。 The acoustic processing device according to the second aspect of the present invention includes elements corresponding to each time frequency component of an acoustic signal representing a mixed sound of acoustics of a plurality of sound sources, and the sound of the first sound source is dominant among the plurality of sound sources. An element corresponding to each time frequency component is set to a maintenance value for maintaining the time frequency component, and an element corresponding to each remaining time frequency component is set to a suppression value for suppressing the time frequency component ( For example, matrix generation means for generating a separation matrix Q), and a plurality of time frequency components corresponding to elements set as maintenance values in the separation matrix are calculated, An observation matrix in which the time frequency components of the acoustic signal are arranged by repeating the update operation of the non-negative matrix factorization using the first basis matrix (for example, the basis matrix F) including the basis vector as the teacher information. The first coefficient matrix (for example, coefficient matrix G) including a plurality of coefficient vectors corresponding to each basis vector of the first basis matrix and the spectrum of each component of the sound of the second sound source different from the first sound source are represented. A matrix for calculating a second basis matrix (for example, basis matrix H) including a plurality of basis vectors and a second coefficient matrix (for example, coefficient matrix U) including a plurality of coefficient vectors corresponding to each basis vector of the second basis matrix. Decomposing means, and the matrix decomposing means includes a time frequency component (for example, time frequency component d) corresponding to an element set as a suppression value in the separation matrix of the matrix product of the first base matrix and the first coefficient matrix. (m, n)) Non-negative matrix factorization is performed with the constraint of suppressing temporal variation. In the above configuration, the constraint condition is to suppress temporal variation of the time frequency component corresponding to the element set as the suppression value in the separation matrix of the matrix product of the first basis matrix and the first coefficient matrix. Since the non-negative matrix factorization is performed, the possibility that each element of the first coefficient matrix is set to an excessively large numerical value (a numerical value that varies discontinuously on the time axis) is reduced. Therefore, as with the sound processing device of the first aspect, it is possible to realize high-accuracy sound source separation.

以上の各態様に係る音響処理装置は、音響信号の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。 The sound processing apparatus according to each of the above aspects is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to processing of an acoustic signal, or a general-purpose operation such as a CPU (Central Processing Unit). This is also realized by cooperation between the processing device and the program. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer.

また、以上の各態様に係る音響処理装置は、複数の音源の音響の混合音を表す音響信号の各時間周波数成分に対応する要素を含み、複数の音源のうち第１音源の音響が優勢な各時間周波数成分に対応する要素が当該時間周波数成分を維持する維持値に設定されるとともに残余の各時間周波数成分に対応する要素が当該時間周波数成分を抑圧する抑圧値に設定された分離行列を生成し、分離行列にて維持値に設定された要素に対応する各時間周波数成分を演算対象として、第１音源の音響の各成分のスペクトルを表す複数の基底ベクトルを含む第１基底行列を教師情報として利用した非負値行列因子分解の更新演算を反復することで、音響信号の各時間周波数成分を配列した観測行列から、第１基底行列の各基底ベクトルに対応する複数の係数ベクトルを含む第１係数行列と、第１音源とは相違する第２音源の音響の各成分のスペクトルを表す複数の基底ベクトルを含む第２基底行列と、第２基底行列の各基底ベクトルに対応する複数の係数ベクトルを含む第２係数行列とを算定する方法（音響処理方法）としても特定される。第１態様に係る音響処理方法では、第１係数行列の各要素を抑圧することを拘束条件とした非負値行列因子分解が実行され、第２態様に係る音響処理方法では、第１基底行列と第１係数行列との行列積のうち分離行列にて抑圧値に設定された要素に対応する時間周波数成分の時間的な変動を抑制することを拘束条件とした非負値行列因子分解が実行される。 Moreover, the acoustic processing device according to each of the above aspects includes an element corresponding to each time frequency component of the acoustic signal representing the mixed sound of the sound of the plurality of sound sources, and the sound of the first sound source is dominant among the plurality of sound sources. A separation matrix in which an element corresponding to each time frequency component is set to a maintenance value for maintaining the time frequency component, and an element corresponding to each remaining time frequency component is set to a suppression value for suppressing the time frequency component. The first basis matrix including a plurality of basis vectors representing the spectrum of each component of the sound of the first sound source is calculated with each time frequency component corresponding to the element that is generated and set as the maintenance value in the separation matrix as a calculation target. By repeating the update operation of the non-negative matrix factorization used as information, a plurality of coefficients corresponding to each basis vector of the first basis matrix from the observation matrix in which each time frequency component of the acoustic signal is arranged Corresponds to a first coefficient matrix including a spectrum, a second basis matrix including a plurality of basis vectors representing the spectrum of each acoustic component of the second sound source different from the first sound source, and each basis vector of the second basis matrix It is also specified as a method (acoustic processing method) for calculating a second coefficient matrix including a plurality of coefficient vectors. In the acoustic processing method according to the first aspect, non-negative matrix factorization is performed with the constraint that suppression of each element of the first coefficient matrix is performed, and in the acoustic processing method according to the second aspect, the first basis matrix and Non-negative matrix factorization is executed with the constraint that the temporal variation of the time-frequency component corresponding to the element set as the suppression value in the separation matrix of the matrix product with the first coefficient matrix is restrained. .

本発明の第１実施形態に係る音響処理装置のブロック図である。1 is a block diagram of a sound processing apparatus according to a first embodiment of the present invention. 非負値行列因子分解の説明図である。It is explanatory drawing of nonnegative matrix factorization. 方位クラスタリングのフローチャートである。It is a flowchart of direction clustering. 対比例１の問題点の説明図である。It is explanatory drawing of the problem of contrast 1. 対比例２の問題点の説明図である。It is explanatory drawing of the problem of contrast 2. 対比例２の問題点の説明図である。It is explanatory drawing of the problem of contrast 2. 分離行列および反転分離行列の説明図である。It is explanatory drawing of a separation matrix and an inversion separation matrix. 第１実施形態および対比例３の実験結果の図表である。It is a chart of an experimental result of a 1st embodiment and contrast 3. 第１実施形態および対比例４の実験結果の図表である。It is a table | surface of 1st Embodiment and the experimental result of contrast 4. 第１実施形態および対比例２の実験結果の図表である。It is a chart of a 1st embodiment and an experimental result of contrast 2. 第２実施形態に係る音響処理装置のブロック図である。It is a block diagram of the sound processing apparatus which concerns on 2nd Embodiment. 第２実施形態の変形例における指標値の説明図である。It is explanatory drawing of the index value in the modification of 2nd Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音響処理装置１００のブロック図である。図１に示すように、音響処理装置１００には信号供給装置１２と放音装置１４とが接続される。信号供給装置１２は、空間的な位置（収音点に対する方向）または音響の音響特性が相違し得る複数の音源が発生した音響の混合音の波形を表す時間領域の音響信号ＳAを音響処理装置１００に供給する。音響信号ＳAは、Ｊ個（Ｊは２以上の自然数）のチャネルの音響信号ＳA[1]〜ＳA[J]で構成されるマルチチャネル信号である。音響信号ＳAが表す音響の複数の音源のうち空間的な位置または音響特性が既知である音源を以下では第１音源と表記し、第１音源以外の音源を以下では第２音源と表記する。第１音源および第２音源の各々は、単数または複数の音源（単音源または音源群）に相当する。周囲の音響を収音して音響信号ＳAを生成する収音機器や、可搬型または内蔵型の記録媒体から音響信号ＳAを取得して音響処理装置１００に供給する再生装置や、通信網から音響信号ＳAを受信して音響処理装置１００に供給する通信装置が信号供給装置１２として利用され得る。 <First Embodiment>
FIG. 1 is a block diagram of a sound processing apparatus 100 according to the first embodiment of the present invention. As shown in FIG. 1, a signal supply device 12 and a sound emitting device 14 are connected to the sound processing device 100. The signal supply device 12 generates a time-domain acoustic signal SA representing a waveform of a mixed sound of sounds generated by a plurality of sound sources that may have different spatial positions (directions relative to the sound collection point) or acoustic properties. 100. The acoustic signal SA is a multi-channel signal composed of acoustic signals SA [1] to SA [J] of J channels (J is a natural number of 2 or more). A sound source having a known spatial position or acoustic characteristics among a plurality of sound sources represented by the acoustic signal SA is hereinafter referred to as a first sound source, and a sound source other than the first sound source is hereinafter referred to as a second sound source. Each of the first sound source and the second sound source corresponds to a single sound source or a plurality of sound sources (single sound source or sound source group). Sound collecting equipment that picks up surrounding sounds and generates an acoustic signal SA, a playback device that acquires an acoustic signal SA from a portable or built-in recording medium and supplies the acoustic signal SA to the acoustic processing device 100, or an acoustic from a communication network A communication device that receives the signal SA and supplies it to the sound processing device 100 can be used as the signal supply device 12.

第１実施形態の音響処理装置１００は、信号供給装置１２から供給される音響信号ＳAに対する音源分離で音響信号ＳBを生成する信号処理装置（音源分離装置）である。音響信号ＳBは、音響信号ＳAのうち第１音源および第２音源の一方の音響を分離（抽出または抑圧）した時間領域信号である。具体的には、第１音源および第２音源のうち例えば利用者が選択した音源の音響を抽出した音響信号ＳBが生成される。すなわち、音響信号ＳAが音源毎に分離される。放音装置１４（例えばスピーカやヘッドホン）は、音響処理装置１００が生成した音響信号ＳBに応じた音波を放射する。 The acoustic processing device 100 according to the first embodiment is a signal processing device (sound source separation device) that generates an acoustic signal SB by sound source separation with respect to the acoustic signal SA supplied from the signal supply device 12. The sound signal SB is a time-domain signal obtained by separating (extracting or suppressing) one sound of the first sound source and the second sound source from the sound signal SA. Specifically, for example, an acoustic signal SB obtained by extracting the sound of the sound source selected by the user from the first sound source and the second sound source is generated. That is, the acoustic signal SA is separated for each sound source. The sound emitting device 14 (for example, a speaker or a headphone) emits a sound wave corresponding to the acoustic signal SB generated by the acoustic processing device 100.

図１に示すように、音響処理装置１００は、演算処理装置２２と記憶装置２４とを具備するコンピュータシステムで実現される。記憶装置２４は、演算処理装置２２が実行するプログラムや演算処理装置２２が使用する各種のデータ（例えば基底行列Ｆ）を記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体や複数種の記録媒体の組合せが記憶装置２４として任意に採用され得る。音響信号ＳAを記憶装置２４に記憶した構成（したがって、信号供給装置１２は省略され得る）も好適である。 As shown in FIG. 1, the sound processing device 100 is realized by a computer system including an arithmetic processing device 22 and a storage device 24. The storage device 24 stores a program executed by the arithmetic processing device 22 and various data (for example, a base matrix F) used by the arithmetic processing device 22. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media can be arbitrarily employed as the storage device 24. A configuration in which the acoustic signal SA is stored in the storage device 24 (therefore, the signal supply device 12 can be omitted) is also suitable.

演算処理装置２２は、記憶装置２４に記憶されたプログラムを実行することで、音響信号ＳAから音響信号ＳBを生成するための複数の機能（周波数分析部３２，行列生成部３４，行列分解部３６，波形合成部３８）を実現する。演算処理装置２２の各要素による処理は、音響信号ＳAを時間軸上で区分した単位区間（フレーム）のＮ個分を単位として順次に実行される（Ｎは２以上の自然数）。なお、演算処理装置２２の各機能を複数の装置に分散した構成や、専用の電子回路（例えばＤＳＰ）が演算処理装置２２の一部の機能を実現する構成も採用され得る。 The arithmetic processing device 22 executes a program stored in the storage device 24 to thereby generate a plurality of functions (frequency analysis unit 32, matrix generation unit 34, matrix decomposition unit 36) for generating the acoustic signal SB from the acoustic signal SA. , A waveform synthesis unit 38) is realized. Processing by each element of the arithmetic processing unit 22 is sequentially executed in units of N units (frames) obtained by dividing the acoustic signal SA on the time axis (N is a natural number of 2 or more). A configuration in which the functions of the arithmetic processing device 22 are distributed to a plurality of devices, or a configuration in which a dedicated electronic circuit (for example, a DSP) realizes a part of the functions of the arithmetic processing device 22 may be employed.

周波数分析部３２は、短時間フーリエ変換等の周波数分析を音響信号ＳAに対して実行する。具体的には、周波数分析部３２は、音響信号ＳA（例えば音響信号ＳA[1]〜ＳA[J]の平均や合計、または音響信号ＳA[1]〜ＳA[J]の何れか）から観測行列Ｙを生成する。観測行列Ｙは、図２に例示される通り、各時間周波数成分ｙ(m,n)（ｙ(1,1)〜ｙ(M,N)）を縦方向および横方向に配列したＭ行Ｎ列の行列である。記号ｍは、周波数軸上に設定されたＭ個の周波数（周波数ビン）のうち任意の１個を指示する変数であり（ｍ＝１〜Ｍ）、記号ｎは、時間軸上のＮ個の単位区間のうち任意の１個を指示する変数である（ｎ＝１〜Ｎ）。任意の１個の時間周波数成分ｙ(m,n)は、時間軸上の第ｎ番目の単位区間内の音響信号ＳAのうち周波数軸上の第ｍ番目の周波数における強度（振幅）を意味する。したがって、観測行列Ｙの第ｎ列に位置するＭ個の時間周波数成分ｙ(1,n)〜ｙ(M,n)の系列は、第ｎ番目の単位区間における音響信号ＳAの振幅スペクトルに相当する。以上の説明から理解される通り、観測行列Ｙは、Ｎ個の単位区間にわたる音響信号ＳAの振幅スペクトルの時系列（振幅スペクトログラム）を表現するＭ行Ｎ列の非負値行列である。 The frequency analysis unit 32 performs frequency analysis such as short-time Fourier transform on the acoustic signal SA. Specifically, the frequency analysis unit 32 observes from the acoustic signal SA (for example, any one of the average and sum of the acoustic signals SA [1] to SA [J] or the acoustic signals SA [1] to SA [J]). A matrix Y is generated. As shown in FIG. 2, the observation matrix Y has M rows N in which time frequency components y (m, n) (y (1,1) to y (M, N)) are arranged in the vertical direction and the horizontal direction. A matrix of columns. The symbol m is a variable indicating any one of M frequencies (frequency bins) set on the frequency axis (m = 1 to M), and the symbol n is N numbers on the time axis. It is a variable indicating any one of the unit sections (n = 1 to N). An arbitrary one time frequency component y (m, n) means the intensity (amplitude) at the mth frequency on the frequency axis of the acoustic signal SA in the nth unit interval on the time axis. . Therefore, the sequence of M time frequency components y (1, n) to y (M, n) located in the nth column of the observation matrix Y corresponds to the amplitude spectrum of the acoustic signal SA in the nth unit interval. To do. As understood from the above description, the observation matrix Y is a non-negative matrix of M rows and N columns representing a time series (amplitude spectrogram) of the amplitude spectrum of the acoustic signal SA over N unit intervals.

また、周波数分析部３２は、Ｊ個のチャネルの音響信号ＳA[1]〜ＳA[J]の各々について時間周波数成分ｘj(m,n)を算定する。第ｊ番目のチャネルの任意の１個の時間周波数成分ｘj(m,n)は、時間軸上の第ｎ番目の単位区間内の音響信号ＳA[j]のうち周波数軸上の第ｍ番目の周波数における強度（振幅）を意味する。周波数分析部３２は、周波数および単位区間が共通する各時間周波数成分ｘj(m,n)をＪ個のチャネルについて配列した観測ベクトルＶX(m,n)（ＶX(m,n)＝［ｘ1(m,n)，ｘ2(m,n)，……，ｘJ(m,n)］^T）を各単位区間について周波数毎に生成する。記号Ｔは行列の転置を意味する。 Further, the frequency analysis unit 32 calculates a time frequency component xj (m, n) for each of the acoustic signals SA [1] to SA [J] of J channels. Any one time frequency component xj (m, n) of the jth channel is the mth component on the frequency axis of the acoustic signal SA [j] in the nth unit interval on the time axis. It means intensity (amplitude) at frequency. The frequency analysis unit 32 arranges observation vectors VX (m, n) (VX (m, n) = [x1 () in which the time and frequency components xj (m, n) having a common frequency and unit interval are arranged for J channels. m, n), x2 (m, n),..., xJ (m, n)] ^T ) is generated for each unit interval for each frequency. The symbol T means transposition of the matrix.

図１の行列生成部３４は、音響信号ＳAから第１音源の音響を抽出するためのマスクとして利用される分離行列Ｑを生成する。第１実施形態の行列生成部３４は、方位クラスタリングを利用して分離行列Ｑを生成する。方位クラスタリングでは、Ｊ個のチャネルの音響信号ＳA[1]〜ＳA[J]から推定される各音源（音像）の空間的な方位毎に各観測ベクトルＶX(m,n)がＲ個のクラスに分類され、各クラスに分類された複数の観測ベクトルＶX(m,n)が当該クラスの空間代表ベクトルΛr(m)（Λ1(m)〜ΛR(m)）で近似的に表現される（ｒ＝１〜Ｒ）。方位クラスタリングについては非特許文献２にも記載されているが、分離行列Ｑの理解のために以下に概要を説明する。 1 generates a separation matrix Q used as a mask for extracting the sound of the first sound source from the sound signal SA. The matrix generation unit 34 of the first embodiment generates a separation matrix Q using orientation clustering. In orientation clustering, each observation vector VX (m, n) has R classes for each spatial orientation of each sound source (sound image) estimated from acoustic signals SA [1] to SA [J] of J channels. A plurality of observation vectors VX (m, n) classified into each class are approximately expressed by space representative vectors Λr (m) (Λ1 (m) to ΛR (m)) of the class ( r = 1 to R). Although orientation clustering is also described in Non-Patent Document 2, an outline will be described below in order to understand the separation matrix Q.

図３は、方位クラスタリングのフローチャートである。方位クラスタリングを開始すると、行列生成部３４は、相異なるクラスに対応する初期的なＲ個のセントロイドＣ1(m)〜ＣR(m)を設定する（Ｓ1）。そして、行列生成部３４は、以下の数式(1)で表現される通り、各セントロイドＣr(m)と観測ベクトルＶX(m,n)との誤差Ｅ(ＶX(m,n),Ｃr(m))²が最小となるように、観測ベクトルＶX(m,n)が所属するクラスの番号Ψ(m,n)を決定する（Ｓ2）。

数式(1)から理解される通り、変数Ψ(m,n)は、観測ベクトルＶX(m,n)に最も類似するセントロイドＣr(m)に対応するクラスの番号ｒに設定される。 FIG. 3 is a flowchart of orientation clustering. When orientation clustering is started, the matrix generation unit 34 sets initial R centroids C1 (m) to CR (m) corresponding to different classes (S1). Then, the matrix generator 34 expresses an error E (VX (m, n), Cr () between each centroid Cr (m) and the observed vector VX (m, n) as expressed by the following equation (1). m)) The number Ψ (m, n) of the class to which the observation vector VX (m, n) belongs is determined so that ² is minimized (S2).

As understood from the equation (1), the variable Ψ (m, n) is set to the number r of the class corresponding to the centroid Cr (m) most similar to the observation vector VX (m, n).

行列生成部３４は、以下の数式(2)で表現される通り、第ｒ番目のクラスに分類された複数の観測ベクトルＶX(m,n)と現時点のセントロイドＣr(m)との誤差Ｅ(ＶX(m,n),Ｃr(m))²の合計値が最小となるように第ｒ番目のクラスのセントロイドＣr(m)を更新する（Ｓ3）。数式(2)の記号Θrは、第ｒ番目のクラスに所属する観測ベクトルＶX(m,n)が生成された単位区間の集合を意味する。

The matrix generator 34 represents an error E between the plurality of observation vectors VX (m, n) classified into the r-th class and the current centroid Cr (m) as expressed by the following equation (2). The centroid Cr (m) of the rth class is updated so that the total value of (VX (m, n), Cr (m)) ² is minimized (S3). The symbol Θr in Equation (2) means a set of unit intervals in which the observation vectors VX (m, n) belonging to the rth class are generated.

行列生成部３４は、収束条件の成否を判定する（Ｓ4）。具体的には、ステップＳ3の更新の前後におけるセントロイドＣr(m)の変化の有無が収束条件として好適である。すなわち、更新の前後でセントロイドＣr(m)が変化した場合、行列生成部３４は、収束条件が成立しない（Ｓ4：NO）と判定して処理をステップＳ2に移行する。他方、更新の前後でセントロイドＣr(m)が変化しない場合、行列生成部３４は、収束条件が成立した（Ｓ4：YES）と判定し、直前の更新後のセントロイドＣr(m)を第ｒ番目のクラスの空間代表ベクトルΛr(m)として確定する（Ｓ5）。 The matrix generation unit 34 determines whether the convergence condition is successful (S4). Specifically, the presence / absence of a change in centroid Cr (m) before and after the update in step S3 is suitable as the convergence condition. That is, when the centroid Cr (m) changes before and after the update, the matrix generation unit 34 determines that the convergence condition is not satisfied (S4: NO), and proceeds to step S2. On the other hand, when the centroid Cr (m) does not change before and after the update, the matrix generation unit 34 determines that the convergence condition is satisfied (S4: YES), and sets the centroid Cr (m) immediately after the update to the first The r-th class space representative vector Λr (m) is determined (S5).

以上に説明した図３の方位クラスタリングを実行すると、行列生成部３４は、図３のステップＳ2で設定された変数Ψ(m,n)に応じて分離行列Ｑを生成する。分離行列Ｑは、図２に例示される通り、観測行列Ｙの各時間周波数成分ｙ(m,n)に１対１に対応する要素ｑ(m,n)（ｑ(1,1)〜ｑ(M,N)）を縦方向および横方向に配列したＭ行Ｎ列の非負値行列である。分離行列Ｑの各要素ｑ(m,n)は、以下の数式(3)で表現される。

When the orientation clustering of FIG. 3 described above is executed, the matrix generation unit 34 generates the separation matrix Q according to the variable Ψ (m, n) set in step S2 of FIG. As illustrated in FIG. 2, the separation matrix Q includes elements q (m, n) (q (1,1) to q corresponding to each time frequency component y (m, n) of the observation matrix Y on a one-to-one basis. (M, N)) is a non-negative matrix with M rows and N columns arranged in the vertical and horizontal directions. Each element q (m, n) of the separation matrix Q is expressed by the following mathematical formula (3).

数式(3)の記号ηは、方位クラスタリングにおけるＲ個のクラスのうち第１音源に対応する１個のクラス（以下「目標クラス」という）の番号を意味し、第１音源の既知の方向に応じて事前に設定される。例えば、収音点の正面方向に対応するクラスを目標クラスとして分離行列Ｑが生成される。数式(3)から理解される通り、分離行列Ｑのうち変数Ψ(m,n)が番号ηである時間周波数成分（目標クラスに分類された各時間周波数成分）ｙ(m,n)に対応する要素ｑ(m,n)は数値１に設定され、変数Ψ(m,n)が番号η以外の数値である時間周波数成分（目標クラス以外のクラスに分類された各時間周波数成分）ｙ(m,n)に対応する要素ｑ(m,n)は数値０に設定される。すなわち、分離行列Ｑのうち第１音源の音響が第２音源の音響と比較して優勢な各時間周波数成分ｙ(m,n)に対応する要素ｑ(m,n)は数値１に設定され、残余の各時間周波数成分（すなわち第２音源の音響が第１音源の音響と比較して優勢な時間周波数成分）ｙ(m,n)に対応する要素ｑ(m,n)は数値０に設定される。要素ｑ(m,n)の数値１は、時間周波数成分ｙ(m,n)に乗算した場合に当該時間周波数成分ｙ(m,n)を維持する数値（維持値）に相当し、要素ｑ(m,n)の数値０は、時間周波数成分ｙ(m,n)に乗算した場合に当該時間周波数成分ｙ(m,n)を抑圧する数値（抑圧値）に相当する。 The symbol η in Equation (3) means the number of one class (hereinafter referred to as “target class”) corresponding to the first sound source among the R classes in the orientation clustering, and is in the known direction of the first sound source. It is set in advance accordingly. For example, the separation matrix Q is generated with the class corresponding to the front direction of the sound collection point as the target class. As understood from Equation (3), in the separation matrix Q, the variable Ψ (m, n) corresponds to the time frequency component (each time frequency component classified into the target class) y (m, n) with the number η. Element q (m, n) to be set is a numerical value 1 and the variable Ψ (m, n) is a numerical value other than the number η (a time frequency component classified into a class other than the target class) y ( The element q (m, n) corresponding to m, n) is set to the numerical value 0. That is, the element q (m, n) corresponding to each time frequency component y (m, n) in which the sound of the first sound source is dominant compared with the sound of the second sound source in the separation matrix Q is set to a numerical value 1. The element q (m, n) corresponding to each remaining time frequency component (that is, the time frequency component in which the sound of the second sound source is dominant compared to the sound of the first sound source) y (m, n) is 0. Is set. The numerical value 1 of the element q (m, n) corresponds to a numerical value (maintenance value) that maintains the time frequency component y (m, n) when multiplied by the time frequency component y (m, n). The numerical value 0 of (m, n) corresponds to a numerical value (suppression value) that suppresses the time frequency component y (m, n) when multiplied by the time frequency component y (m, n).

図１の行列分解部３６は、図２から理解される通り、周波数分析部３２が算定した観測行列Ｙの非負値行列因子分解で、第１音源に対応する第１行列Ｄ1と第２音源に対応する第２行列Ｄ2とを算定する。第１行列Ｄ1は、音響信号ＳAのうち第１音源の音響の振幅スペクトルの時系列（第１音源の音響の振幅スペクトログラム）を表現するＭ行×Ｎ列の非負値行列であり、基底行列Ｆと係数行列Ｇとの行列積として表現される。基底行列Ｆは、図２に例示される通り、第１音源の音響を構成する各成分に対応するＫ個の基底ベクトルｆ(1)〜ｆ(K)を横方向に配列したＭ行Ｋ列の非負値行列である。基底行列Ｆのうち第ｋ列（ｋ＝１〜Ｋ）の基底ベクトルｆ(k)は、周波数軸上の相異なる周波数に対応するＭ個の要素ｆ(1,k)〜ｆ(M,k)で構成され、第１音源の音響を構成するＫ個の成分（例えば第１音源が発音し得るＫ種類の音高の成分）のうち第ｋ番目の成分の振幅スペクトルに相当する。他方、係数行列Ｇは、基底行列Ｆの各基底ベクトルｆ(k)に対応するＫ個の係数ベクトルｇ(1)〜ｇ(K)を縦方向に配列したＫ行Ｎ列の非負値行列である。係数行列Ｇの第ｋ行の係数ベクトルｑ(k)は、時間軸上の相異なる単位区間に対応するＮ個の要素（係数）ｇ(k,1)〜ｇ(k,N)で構成され、基底行列Ｆの基底ベクトルｆ(k)に対する加重値（活性度）の時系列に相当する。 As is understood from FIG. 2, the matrix decomposition unit 36 in FIG. 1 is a non-negative matrix factorization of the observation matrix Y calculated by the frequency analysis unit 32, and converts the first matrix D1 corresponding to the first sound source and the second sound source. The corresponding second matrix D2 is calculated. The first matrix D1 is an M-row × N-column non-negative matrix representing the time series of the acoustic amplitude spectrum of the first sound source (the acoustic amplitude spectrogram of the first sound source) of the acoustic signal SA, and the base matrix F And the coefficient matrix G. As illustrated in FIG. 2, the base matrix F has M rows and K columns in which K base vectors f (1) to f (K) corresponding to the components constituting the sound of the first sound source are arranged in the horizontal direction. Is a non-negative matrix. The basis vector f (k) of the k-th column (k = 1 to K) of the basis matrix F is M elements f (1, k) to f (M, k) corresponding to different frequencies on the frequency axis. ), And corresponds to the amplitude spectrum of the k-th component of the K components (for example, components of K pitches that can be generated by the first sound source) constituting the sound of the first sound source. On the other hand, the coefficient matrix G is a non-negative matrix of K rows and N columns in which K coefficient vectors g (1) to g (K) corresponding to each base vector f (k) of the base matrix F are arranged in the vertical direction. is there. The coefficient vector q (k) in the k-th row of the coefficient matrix G is composed of N elements (coefficients) g (k, 1) to g (k, N) corresponding to different unit intervals on the time axis. This corresponds to a time series of weight values (activity) for the basis vector f (k) of the basis matrix F.

第２行列Ｄ2は、音響信号ＳAのうち第２音源の音響の振幅スペクトルの時系列（第２音源の音響の振幅スペクトログラム）を表現するＭ行×Ｎ列の非負値行列であり、基底行列Ｈと係数行列Ｕとの行列積として表現される。基底行列Ｈは、図２に例示される通り、Ｋ個の基底ベクトルｈ(1)〜ｈ(K)を横方向に配列したＭ行Ｋ列の非負値行列である。任意の１個の基底ベクトルｈ(k)は、第２音源の音響の各成分の振幅スペクトルに相当し、周波数軸上の相異なる周波数に対応するＭ個の要素ｈ(1,k)〜ｈ(M,k)で構成される。他方、係数行列Ｕは、Ｋ個の係数ベクトルｕ(1)〜ｕ(K)を縦方向に配列したＫ行Ｎ列の非負値行列である。任意の１個の係数ベクトルｕ(k)は、時間軸上の相異なる単位区間に対応するＮ個の要素ｕ(k,1)〜ｕ(k,N)で構成され、基底行列Ｈの基底ベクトルｈ(k)に対する加重値の時系列に相当する。なお、基底行列Ｆの行数（係数行列Ｇの列数）と基底行列Ｈの行数（係数行列Ｕの列数）とを相違させることも可能である。 The second matrix D2 is an M-row × N-column non-negative matrix representing a time series of the amplitude spectrum of the sound of the second sound source in the sound signal SA (the amplitude spectrogram of the sound of the second sound source). And a coefficient matrix U. The base matrix H is a non-negative matrix of M rows and K columns in which K base vectors h (1) to h (K) are arranged in the horizontal direction, as illustrated in FIG. One arbitrary basis vector h (k) corresponds to the amplitude spectrum of each component of the sound of the second sound source, and M elements h (1, k) to h corresponding to different frequencies on the frequency axis. (M, k). On the other hand, the coefficient matrix U is a non-negative matrix of K rows and N columns in which K coefficient vectors u (1) to u (K) are arranged in the vertical direction. An arbitrary coefficient vector u (k) is composed of N elements u (k, 1) to u (k, N) corresponding to different unit intervals on the time axis, and the basis of the base matrix H This corresponds to a time series of weight values for the vector h (k). Note that the number of rows of the base matrix F (the number of columns of the coefficient matrix G) and the number of rows of the base matrix H (the number of columns of the coefficient matrix U) can be made different.

第１実施形態の行列分解部３６は、事前に用意された既知の基底行列Ｆを教師情報（事前情報）として利用する教師有非負値行列因子分解で係数行列Ｇと基底行列Ｈと係数行列Ｕとを観測行列Ｙから算定する。基底行列Ｆは、既知の第１音源が単独で発音した音響（第２音源の音響を包含しない音響）から事前に生成されたうえで記憶装置２４に格納される。 The matrix decomposition unit 36 according to the first embodiment performs a coefficient matrix G, a base matrix H, and a coefficient matrix U by supervised non-negative matrix factorization using a known base matrix F prepared in advance as teacher information (prior information). Are calculated from the observation matrix Y. The base matrix F is generated in advance from the sound generated by a known first sound source alone (the sound not including the sound of the second sound source) and then stored in the storage device 24.

ところで、前述の行列生成部３４が生成する分離行列Ｑでは、第１音源の音響が優勢な各時間周波数成分ｙ(m,n)に対応する要素ｑ(m,n)が数値１に設定されるとともに残余の各時間周波数成分ｙ(m,n)に対応する要素ｑ(m,n)が数値０に設定される。したがって、図４に例示される通り、観測行列Ｙの各時間周波数成分ｙ(m,n)に分離行列Ｑの各要素ｑ(m,n)を乗算する空間的音源分離で生成された行列｛Ｙ.×Ｑ｝を対象として非負値行列因子分解を実行する構成（以下「対比例１」という）が想定される。なお、記号「.×」は、行列間の要素毎の乗算（アダマール積）を意味する。 By the way, in the separation matrix Q generated by the matrix generation unit 34 described above, the element q (m, n) corresponding to each time frequency component y (m, n) in which the sound of the first sound source is dominant is set to a numerical value 1. In addition, the element q (m, n) corresponding to each remaining time frequency component y (m, n) is set to a numerical value 0. Therefore, as illustrated in FIG. 4, the matrix {generated by the spatial source separation that multiplies each time frequency component y (m, n) of the observation matrix Y by each element q (m, n) of the separation matrix Q. A configuration (hereinafter referred to as “proportional 1”) that performs non-negative matrix factorization for Y. × Q} is assumed. The symbol “. ×” means multiplication (Hadamard product) for each element between the matrices.

しかし、前掲の方位クラスタリングのように音響信号ＳAの各時間周波数成分ｙ(m,n)をＲ個のクラスの何れかに択一的に分類する構成（ハードクラスタリング）では、第２音源の音響が優勢な時間周波数成分ｙ(m,n)に対応する要素ｑ(m,n)は、当該時間周波数成分ｙ(m,n)が第１音源の音響を含有する場合でも第２音源と比較して劣勢であれば数値０（抑圧値）に設定され得る。したがって、図４から理解される通り、空間的音源分離の実行後の行列｛Ｙ.×Ｑ｝では、空間的音源分離の実行前に第１音源の音響を含有していた要素を含む多数の要素の強度が０（時間-周波数領域内の多数の地点における時間周波数成分の欠落）となり、基底行列Ｆで表現される第１音源の音響とは音響特性（周波数特性）が乖離する可能性がある。したがって、対比例１では高精度な音源分離は困難である。 However, in the configuration (hard clustering) in which each time frequency component y (m, n) of the acoustic signal SA is selectively classified into any of the R classes as in the azimuth clustering described above, the sound of the second sound source The element q (m, n) corresponding to the dominant time frequency component y (m, n) is compared with the second sound source even when the time frequency component y (m, n) contains the sound of the first sound source. If it is inferior, it can be set to a numerical value 0 (suppression value). Therefore, as can be understood from FIG. 4, the matrix {Y. × Q} after the spatial sound source separation is performed includes a number of elements including elements of the sound of the first sound source before the spatial sound source separation is performed. The intensity of the element becomes 0 (missing time frequency components at many points in the time-frequency domain), and there is a possibility that the acoustic characteristics (frequency characteristics) are different from the sound of the first sound source represented by the base matrix F. is there. Therefore, with the proportionality 1, it is difficult to separate the sound sources with high accuracy.

以上の課題を解決する観点から、観測行列Ｙのうち分離行列Ｑ内で数値１に設定された要素ｑ(m,n)に対応する時間周波数成分ｙ(m,n)のみを演算対象として非負値行列因子分解の更新演算を実行する構成（以下「対比例２」という）が想定される。対比例２の非負値行列因子分解では、例えば、処理前の観測行列Ｙと処理後の行列（第１行列Ｄ1および第２行列Ｄ2との行列和）との相違（距離）を評価するための数式(4)の評価関数Φが最小化される（両者間の距離が最小化される、または、両者間の相関が最大化される）ように分離後の各行列（Ｇ,Ｈ,Ｕ）が反復的に更新される。

数式(4)の記号δ(Ｙ|ＦＧ＋ＨＵ)は、観測行列Ｙと分離後の行列｛ＦＧ＋ＨＵ｝との距離（例えばユークリッド距離）を意味する。数式(4)の評価関数Φを適用した対比例２によれば、分離行列Ｑの要素ｑ(m,n)の数値０の影響が基底行列Ｆにより補償されるから、対比例１と比較して高精度な音源分離が実現される。 From the viewpoint of solving the above problems, only the time frequency component y (m, n) corresponding to the element q (m, n) set to the numerical value 1 in the separation matrix Q in the observation matrix Y is non-negative. A configuration (hereinafter referred to as “proportional 2”) that executes update calculation of value matrix factorization is assumed. In the non-negative matrix factorization of the proportional 2, for example, for evaluating the difference (distance) between the observation matrix Y before processing and the matrix after processing (matrix sum of the first matrix D 1 and the second matrix D 2). Each matrix (G, H, U) after separation so that the evaluation function Φ in Equation (4) is minimized (the distance between the two is minimized or the correlation between the two is maximized) Are updated iteratively.

The symbol δ (Y | FG + HU) in Equation (4) means the distance (for example, Euclidean distance) between the observation matrix Y and the separated matrix {FG + HU}. According to the comparative 2 to which the evaluation function Φ of the formula (4) is applied, the influence of the numerical value 0 of the element q (m, n) of the separation matrix Q is compensated by the base matrix F. Highly accurate sound source separation.

ただし、対比例２では、第１音源の目標クラスに分類された時間周波数成分ｙ(m,n)の個数（分離行列Ｑのうち１に設定された要素ｑ(m,n)の総数）が少ない場合に、時間-周波数領域内に点在する第１音源の各時間周波数成分ｙ(m,n)が基底行列Ｆの各基底ベクトルｆ(k)の系数倍で近似されるように、係数行列Ｇの係数ベクトルｇ(k)の要素ｇ(k,n)（基底ベクトルｆ(k)のゲイン）が過度に大きい数値となる可能性がある。例えば、第１音源の音響の実際の振幅スペクトル（図５の部分(A)）のうち第ｍ1番目の周波数の時間周波数成分ｙ(m1,n)に対応する要素ｑ(m1,n)のみが方位クラスタリングの結果に応じて数値１に設定されて残余の要素ｑ(m,n)が数値０に設定された場合を想定する。以上の状況では、図５の部分(A)の振幅スペクトルのうち第ｍ1番目の周波数の時間周波数成分ｙ(m1,n)のみが演算対象とされる（図５の部分(B)）。したがって、図５の部分(C)から理解される通り、基底行列Ｆの特定の基底ベクトルｆ(k)の要素ｆ(m1,k)に係数ベクトルｇ(k)の係数ｇ(k,n)を乗算した数値が時間周波数成分ｙ(m1,n)に近似するように、係数ベクトルｇ(k)の各係数ｇ(k,n)（すなわち基底ベクトルｆ(k)のゲイン）が過度に大きい数値に設定される。以上の結果、図６に区間σとして例示される通り、第１音源の音響を強調した音響信号ＳBでは強度の発散が発生し得る。 However, in contrast 2, the number of time frequency components y (m, n) classified into the target class of the first sound source (the total number of elements q (m, n) set to 1 in the separation matrix Q) is When the number is small, the coefficient is such that each time frequency component y (m, n) of the first sound source scattered in the time-frequency domain is approximated by a multiple of each basis vector f (k) of the basis matrix F. There is a possibility that the element g (k, n) (the gain of the base vector f (k)) of the coefficient vector g (k) of the matrix G becomes an excessively large numerical value. For example, only the element q (m1, n) corresponding to the time frequency component y (m1, n) of the m1st frequency in the actual amplitude spectrum of the sound of the first sound source (part (A) in FIG. 5) is obtained. Assume that the numerical value 1 is set according to the result of the orientation clustering and the remaining element q (m, n) is set to the numerical value 0. In the above situation, only the time frequency component y (m1, n) of the m1st frequency in the amplitude spectrum of the part (A) in FIG. 5 is subject to calculation (part (B) in FIG. 5). Accordingly, as understood from the part (C) of FIG. 5, the coefficient g (k, n) of the coefficient vector g (k) is added to the element f (m1, k) of the specific basis vector f (k) of the basis matrix F. Each coefficient g (k, n) (that is, the gain of the basis vector f (k)) of the coefficient vector g (k) is excessively large so that the numerical value obtained by multiplying by the frequency frequency component y (m1, n) is approximated. Set to a numeric value. As a result, as illustrated as section σ in FIG. 6, intensity divergence may occur in the acoustic signal SB that emphasizes the sound of the first sound source.

以上に説明した対比例１および対比例２の課題を解決する観点から、第１実施形態では、基底行列Ｆに対応する係数行列Ｇ（各係数ベクトルｇ(k)）の各要素が抑制されるという拘束条件（以下「第１拘束条件」という）のもとで観測行列Ｙの非負値行列因子分解を実行する。具体的には、以下の数式(5)で表現される評価関数Φが最小化されるように分離後の各行列（Ｇ,Ｈ,Ｕ）が反復的に更新される。

数式(5)の右辺の第２項を最小化することが第１拘束条件に相当する。数式(5)の記号‖ ‖はノルム（例えばフロベニウスノルム）を意味し、係数λは所定の正数に設定される。また、記号/Ｑは、図７から理解される通り、分離行列Ｑの各要素ｑ(m,n)の数値０と数値１とを反転させた各要素/ｑ(m,n)（/ｑ(m,n)＝１−ｑ(m,n)）を配列したＭ行Ｎ列の行列（以下「反転分離行列」という）である。すなわち、第１実施形態の第１拘束条件は、反転分離行列/Ｑの各要素/ｑ(m,n)を基底行列Ｆと係数行列Ｇとの行列積（第１行列Ｄ1）の各要素に乗算した行列（以下「制約対象行列」という）｛/Ｑ.×(ＦＧ)｝のノルムの自乗を抑圧することに相当する。反転分離行列/Ｑでは、第１音源の音響が優勢な各時間周波数成分ｙ(m,n)に対応する要素/ｑ(m,n)が数値０に設定され、第２音源の音響が優勢な各時間周波数成分ｙ(m,n)に対応する要素/ｑ(m,n)が数値１に設定される。したがって、第１実施形態の第１拘束条件は、第１音源に対応する第１行列Ｄ1（＝ＦＧ）のうち観測行列Ｙにて第２音源の音響が優勢な時間周波数成分ｙ(m,n)に対応する要素を抑圧するための条件とも換言され得る。 In the first embodiment, each element of the coefficient matrix G (each coefficient vector g (k)) corresponding to the base matrix F is suppressed from the viewpoint of solving the problems of the comparison 1 and the comparison 2 described above. The non-negative matrix factorization of the observation matrix Y is executed under the constraint condition (hereinafter referred to as “first constraint condition”). Specifically, each matrix (G, H, U) after separation is repetitively updated so that the evaluation function Φ expressed by the following formula (5) is minimized.

Minimizing the second term on the right side of Equation (5) corresponds to the first constraint condition. The symbol ‖ の in Equation (5) means a norm (for example, Frobenius norm), and the coefficient λ is set to a predetermined positive number. Further, as understood from FIG. 7, the symbol / Q represents each element / q (m, n) (/ q) obtained by inverting the numerical value 0 and the numerical value 1 of each element q (m, n) of the separation matrix Q. (m, n) = 1−q (m, n)) are arranged in a matrix of M rows and N columns (hereinafter referred to as “inversion separation matrix”). That is, the first constraint condition of the first embodiment is that each element / q (m, n) of the inverted separation matrix / Q is changed to each element of the matrix product (first matrix D1) of the base matrix F and the coefficient matrix G. This is equivalent to suppressing the square of the norm of the multiplied matrix (hereinafter referred to as “constraint matrix”) {/Q.×(FG)}. In the inverse separation matrix / Q, the element / q (m, n) corresponding to each time frequency component y (m, n) in which the sound of the first sound source is dominant is set to a numerical value 0, and the sound of the second sound source is dominant. The element / q (m, n) corresponding to each time frequency component y (m, n) is set to a numerical value 1. Therefore, the first constraint condition of the first embodiment is that the time frequency component y (m, n) in which the sound of the second sound source is dominant in the observation matrix Y in the first matrix D1 (= FG) corresponding to the first sound source. In other words, it can also be said as a condition for suppressing the element corresponding to.

ところで、数式(5)の評価関数Φを最小化するという条件だけでは、基底行列Ｆと基底行列Ｈとが同等となる可能性（第１音源の音響と第２音源の音響とが適切に分離できない可能性）がある。以上の事情を考慮して、第１実施形態では、第１音源の音響特性を表現する基底行列Ｆと第２音源の音響特性を表現する基底行列Ｈとの類似度が低下する（両者間の距離が最大化される、または、両者間の相関が最小化される）という拘束条件（以下「第２拘束条件」という）を導入する。具体的には、以下の数式(6)で表現される評価関数Φが最小化されるように分離後の各行列（Ｇ,Ｈ,Ｕ）が反復的に更新される。

数式（6）の右辺のうち基底行列Ｆと基底行列Ｈとの相関行列｛Ｆ^TＨ｝のノルム（例えばフロベニウスノルム）‖Ｆ^TＨ‖を含む第３項（罰則項）を最小化することが第２拘束条件に相当する。第２拘束条件の導入により、基底行列Ｆと基底行列Ｈとが共通する状況は回避される。なお、第２拘束条件については特開２０１３−０３３１９６号公報にも詳述されている。 By the way, there is a possibility that the base matrix F and the base matrix H become equivalent only when the evaluation function Φ in Expression (5) is minimized (the sound of the first sound source and the sound of the second sound source are appropriately separated). May not be possible). Considering the above circumstances, in the first embodiment, the similarity between the base matrix F expressing the acoustic characteristics of the first sound source and the base matrix H expressing the acoustic characteristics of the second sound source is reduced (between the two). A constraint condition (hereinafter referred to as “second constraint condition”) is introduced in which the distance is maximized or the correlation between the two is minimized. Specifically, each matrix (G, H, U) after separation is iteratively updated so that the evaluation function Φ expressed by the following formula (6) is minimized.

Equation (6) the third term containing norm (e.g. Frobenius norm) ‖F ^T H‖ of the correlation matrix of the basis matrix F and the basis matrix H {F ^T H} among the right-hand side of minimizing (penalty term) Corresponds to the second constraint condition. By introducing the second constraint condition, a situation where the base matrix F and the base matrix H are common is avoided. Note that the second constraint condition is also described in detail in JP2013-033196A.

第１実施形態の行列分解部３６による非負値行列因子分解は、数式(6)の評価関数Φを最小化する処理に相当する。評価関数Φの最小化という条件から以下の数式(7)から数式(9)が導出される。なお、数式(7)から数式(9)における記号「.−」は、行列間の要素毎の除算を意味する。

The non-negative matrix factorization by the matrix decomposition unit 36 of the first embodiment corresponds to a process for minimizing the evaluation function Φ of Equation (6). From the condition of minimizing the evaluation function Φ, Expression (9) is derived from Expression (7) below. Note that the symbol “.−” in Equations (7) to (9) means division for each element between the matrices.

行列分解部３６は、数式(7)から数式(9)の各々の更新演算を反復することで観測行列Ｙから係数行列Ｇと基底行列Ｈと係数行列Ｕとを算定する。具体的には、行列分解部３６は、評価関数Φ内の未知の行列（Ｇ,Ｈ,Ｕ）の初期値を例えば乱数に設定したうえで数式(7)から数式(9)の更新演算を反復し、反復回数が所定の回数に到達した時点で演算結果（係数行列Ｇ，基底行列Ｈ，係数行列Ｕ）を確定する。更新演算の反復回数は、例えば評価関数Φが所定値（例えばゼロ）に収束するように実験的または統計的に選定される。以上の説明から理解される通り、第１実施形態の行列分解部３６は、分離行列Ｑ内で数値１に設定された要素ｑ(m,n)に対応する各時間周波数成分ｙ(m,n)を演算対象として、係数行列Ｇの各要素を抑圧させるための第１拘束条件と、基底行列Ｆおよび基底行列Ｈの類似度を低下させるための第２拘束条件とが成立するように、第１音源の基底行列Ｆを教師情報として利用した非負値行列因子分解の更新演算（数式(7)から数式(9)）を反復する。 The matrix decomposition unit 36 calculates the coefficient matrix G, the base matrix H, and the coefficient matrix U from the observation matrix Y by repeating each update operation of Expression (7) to Expression (9). Specifically, the matrix decomposing unit 36 sets the initial value of the unknown matrix (G, H, U) in the evaluation function Φ to, for example, a random number, and then performs the update operation from Equation (7) to Equation (9). The calculation is repeated (coefficient matrix G, basis matrix H, coefficient matrix U) when the number of iterations reaches a predetermined number. The number of iterations of the update operation is selected, for example, experimentally or statistically so that the evaluation function Φ converges to a predetermined value (for example, zero). As understood from the above description, the matrix decomposing unit 36 of the first embodiment is configured so that each time frequency component y (m, n) corresponding to the element q (m, n) set to the numerical value 1 in the separation matrix Q. ) As a computation target, the first constraint condition for suppressing each element of the coefficient matrix G and the second constraint condition for reducing the similarity between the base matrix F and the base matrix H are satisfied. The non-negative matrix factorization update operation (formula (7) to formula (9)) using the base matrix F of one sound source as teacher information is repeated.

図１の波形合成部３８は、行列分解部３６による解析結果（Ｇ,Ｈ,Ｕ）を利用して音響信号ＳBを生成する。具体的には、第１音源が指定された場合、波形合成部３８は、記憶装置２４に記憶された基底行列Ｆと行列分解部３６が生成した係数行列Ｇとを乗算することで音響信号ＳAのうち第１音源の音響の振幅スペクトログラムを算定し、各単位区間の振幅スペクトルと音響信号ＳAの当該単位区間での位相スペクトルとを適用した短時間逆フーリエ変換で時間領域の音響信号ＳBを生成する。他方、第２音源が指定された場合、波形合成部３８は、行列分解部３６が生成した基底行列Ｈと係数行列Ｕとを乗算することで音響信号ＳAのうち第２音源の音響の振幅スペクトログラムを算定し、各単位区間の振幅スペクトルと音響信号ＳAの当該単位区間での位相スペクトルとを適用した短時間逆フーリエ変換で時間領域の音響信号ＳBを生成する。すなわち、音響信号ＳAを第１音源と第２音源とで分離した音響信号ＳBが生成される。波形合成部３８が生成した音響信号ＳBが放音装置１４に供給されて音波として再生される。 The waveform synthesizer 38 in FIG. 1 generates the acoustic signal SB using the analysis result (G, H, U) by the matrix decomposition unit 36. Specifically, when the first sound source is designated, the waveform synthesis unit 38 multiplies the base matrix F stored in the storage device 24 and the coefficient matrix G generated by the matrix decomposition unit 36 to thereby generate the acoustic signal SA. The acoustic spectrum of the first sound source is calculated, and the time domain acoustic signal SB is generated by short-time inverse Fourier transform using the amplitude spectrum of each unit section and the phase spectrum of the acoustic signal SA in the unit section. To do. On the other hand, when the second sound source is designated, the waveform synthesizer 38 multiplies the base matrix H generated by the matrix decomposition unit 36 and the coefficient matrix U, thereby the acoustic amplitude spectrogram of the sound of the second sound source in the acoustic signal SA. And the time-domain acoustic signal SB is generated by short-time inverse Fourier transform using the amplitude spectrum of each unit section and the phase spectrum of the acoustic signal SA in the unit section. That is, an acoustic signal SB obtained by separating the acoustic signal SA by the first sound source and the second sound source is generated. The acoustic signal SB generated by the waveform synthesizer 38 is supplied to the sound emitting device 14 and reproduced as a sound wave.

以上に説明した第１実施形態では、第１音源の基底行列Ｆに対応する係数行列Ｇの各要素を抑圧するための第１拘束条件を加味した非負値行列因子分解で係数行列Ｇと基底行列Ｈと係数行列Ｕとが算定される。したがって、係数行列Ｇの各要素が過度に大きい数値に設定されて分離後の音響信号ＳBの強度が発散する対比例２の問題（図５を参照して説明した問題）が解消される。すなわち、第１実施形態によれば、対比例１や対比例２と比較して高精度な音源分離を実現することが可能である。また、第１実施形態では、反転分離行列/Ｑの各要素/ｑ(m,n)を第１行列Ｄ1（＝ＦＧ）の各要素に乗算した制約対象行列｛/Ｑ.×(ＦＧ)｝のノルムを減少させることを第１拘束条件として非負値行列因子分解が実行されるから、音源分離の精度が向上するという効果は格別に顕著である。 In the first embodiment described above, the coefficient matrix G and the base matrix are obtained by non-negative matrix factorization in consideration of the first constraint condition for suppressing each element of the coefficient matrix G corresponding to the base matrix F of the first sound source. H and coefficient matrix U are calculated. Therefore, the problem of contrast 2 in which each element of the coefficient matrix G is set to an excessively large numerical value and the intensity of the separated acoustic signal SB diverges (the problem described with reference to FIG. 5) is solved. That is, according to the first embodiment, it is possible to realize high-precision sound source separation as compared with the proportional 1 and the proportional 2. In the first embodiment, the restriction target matrix {/Q.×(FG)} obtained by multiplying each element of the first matrix D1 (= FG) by each element / q (m, n) of the inversion separation matrix / Q. Since the non-negative matrix factorization is executed with the first constraint condition being to reduce the norm of the sound source, the effect of improving the accuracy of sound source separation is particularly remarkable.

図８から図１０は、第１実施形態による音源分離の実験結果である。図８では、方位クラスタリングのみを利用した空間的音源分離（以下「対比例３」という）の実験結果が比較対象として併記され、図９では、第２拘束条件を加味した非負値行列因子分解（数式(6)の第２項を省略した評価関数Φの最小化）のみを利用した音源分離（以下「対比例４」という）の実験結果が比較対象として併記されている。また、図１０では、方位クラスタリングと第２拘束条件を加味した非負値行列因子分解とを連続的に実行する対比例２（数式(6)の評価関数Φの係数λを０に設定した構成）の実験結果が第１実施形態との比較対象として併記されている。 8 to 10 show experimental results of sound source separation according to the first embodiment. In FIG. 8, experimental results of spatial sound source separation using only orientation clustering (hereinafter referred to as “contrast 3”) are also shown as comparison targets, and in FIG. 9, non-negative matrix factorization taking the second constraint condition into account ( An experimental result of sound source separation (hereinafter referred to as “comparative 4”) using only the evaluation function Φ, which omits the second term of Equation (6), is also shown as a comparison target. In FIG. 10, the contrast 2 in which the orientation clustering and the non-negative matrix factorization in consideration of the second constraint condition are continuously executed (the configuration in which the coefficient λ of the evaluation function Φ in Expression (6) is set to 0). The experimental results are also shown as a comparison object with the first embodiment.

図８から図１０では、第１音源および第２音源の各々の楽器（Ｏb.＝オーボエ，Ｆl.＝フルート，Ｐf.＝ピアノ，Ｔb.＝トロンボーン）を相違させた複数の場合について、信号対歪比（ＳＤＲ：Signal to Distortion Ratio）と信号対干渉比（ＳＩＲ：Signal to Interference Ratio）と非線形歪（ＳＡＲ：Sources to Artifacts Ratio）とが実験結果の評価尺度として表記されている。信号対歪比は、音源分離の精度と分離信号の品質との総合的な評価尺度であり、信号対干渉比は音源分離の精度のみの評価尺度であり、非線形歪は音源分離の前後にわたる信号歪の評価尺度である。各評価尺度の数値が大きいほど音源分離の結果が良好であることを意味する。なお、比較対象を上回る数値は太字で表記されている。 In FIG. 8 to FIG. 10, the signals are shown for a plurality of cases where the respective instruments (Ob. = Oboe, Fl. = Flute, Pf. = Piano, Tb. = Trombone) of the first sound source and the second sound source are different. Signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and nonlinear distortion (SAR: Sources to Artifacts Ratio) are expressed as evaluation scales of experimental results. The signal-to-distortion ratio is a comprehensive measure of the accuracy of sound source separation and the quality of the separated signal, the signal-to-interference ratio is a measure of only the accuracy of sound source separation, and the nonlinear distortion is a signal that extends before and after sound source separation. It is a distortion evaluation scale. The larger the numerical value of each evaluation scale, the better the result of sound source separation. In addition, numerical values that exceed the comparison target are written in bold.

図８ないし図１０から理解される通り、第１実施形態によれば、対比例２から対比例４の何れと比較しても、音源分離の精度および分離信号の品質の双方の観点から良好な音源分離を実現できる。なお、信号対干渉比（音源分離の精度のみ）については対比例１が第１実施形態を上回る場合も図１０では散見される。しかし、音源分離の精度および分離信号の品質の双方を反映した総合的な性能指標である信号対歪比は、全部の場合にわたり第１実施形態が対比例１を上回る。したがって、総合的な観点では第１実施形態が対比例１と比較して有利であると評価できる。 As can be understood from FIGS. 8 to 10, according to the first embodiment, it is good in terms of both the accuracy of the sound source separation and the quality of the separated signal as compared with any of the proportional 2 to the proportional 4. Sound source separation can be realized. Note that the signal-to-interference ratio (only the accuracy of sound source separation) is sometimes seen in FIG. 10 even when the contrast 1 exceeds the first embodiment. However, the signal-to-distortion ratio, which is an overall performance index that reflects both the accuracy of sound source separation and the quality of the separated signal, exceeds 1 in the first embodiment in all cases. Therefore, it can be evaluated that the first embodiment is more advantageous than the comparative 1 from a comprehensive viewpoint.

＜第２実施形態＞
本発明の第２実施形態を以下に説明する。なお、以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described below. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each form illustrated below, the reference | standard referred by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

対比例２において係数ベクトルｇ(k)の係数ｇ(k,n)が過度に大きい数値となる問題は、第１音源に対応する目標クラスに分類された時間周波数成分ｙ(m,n)の個数が小さい場合に顕在化するという傾向がある。以上の傾向を考慮して、第２実施形態では、目標クラスに分類された時間周波数成分ｙ(m,n)の個数に応じて制約対象行列｛/Ｑ.×(ＦＧ)｝の影響（第１拘束条件の影響）を可変に制御する。 In contrast 2, the problem that the coefficient g (k, n) of the coefficient vector g (k) becomes an excessively large numerical value is that the time frequency component y (m, n) classified into the target class corresponding to the first sound source. There is a tendency to manifest when the number is small. In consideration of the above tendency, in the second embodiment, the influence of the restriction target matrix {/Q.×(FG)} according to the number of time frequency components y (m, n) classified into the target class (first 1) The influence of the constraint condition) is variably controlled.

図１１は、第２実施形態における音響処理装置１００のブロック図である。図１１に示すように、第２実施形態の音響処理装置１００の演算処理装置２２は、第１実施形態と同様の要素（周波数分析部３２，行列生成部３４，行列分解部３６，波形合成部３８）に加えて指標算定部４２として機能する。指標算定部４２は、目標クラス（第η番目のクラス）に分類された時間周波数成分ｙ(m,n)の個数に応じた指標値νを算定する。 FIG. 11 is a block diagram of the sound processing apparatus 100 according to the second embodiment. As shown in FIG. 11, the arithmetic processing unit 22 of the sound processing apparatus 100 of the second embodiment includes the same elements (frequency analysis unit 32, matrix generation unit 34, matrix decomposition unit 36, waveform synthesis unit as in the first embodiment. 38) and functions as an index calculation unit 42. The index calculation unit 42 calculates an index value ν corresponding to the number of time frequency components y (m, n) classified into the target class (ηth class).

第１実施形態と同様に、行列生成部３４が生成する分離行列Ｑのうち目標クラスに分類された時間周波数成分ｙ(m,n)に対応する各要素ｑ(m,n)は数値１に設定され、目標クラス以外のクラスに分類された時間周波数成分ｙ(m,n)に対応する各要素ｑ(m,n)は数値０に設定される。したがって、分離行列Ｑのうち数値１に設定された要素ｑ(m,n)の総数が、目標クラスに分類された時間周波数成分ｙ(m,n)の個数に相当する。以上の関係を考慮して、本実施形態の指標算定部４２は、以下の数式(10)の演算で指標値νを算定する。

数式(10)の右辺の分子は、Ｍ行Ｎ列の分離行列Ｑのうち数値１に設定された要素ｑ(m,n)の総数（すなわち、目標クラスに分類された時間周波数成分ｙ(m,n)の総数）に相当する。数式(10)の右辺の分母は、数値１の要素ｑ(m,n)の個数を０以上かつ１以下の範囲内に正規化する演算を意味する。図５を参照した前述の説明から理解される通り、目標クラスに分類された時間周波数成分ｙ(m,n)の個数が大きい場合には、係数ベクトルｇ(k)の各係数ｇ(k,n)の過度な増加は発生し難いという傾向がある。したがって、指標値νは、非負値行列因子分解の実行前の観測行列Ｙと実行後の行列（第１行列Ｄ1および第２行列Ｄ2との行列和）との相違を最小化するという基本的な条件（数式(4)）の信頼度の尺度とも換言され得る。 As in the first embodiment, each element q (m, n) corresponding to the time frequency component y (m, n) classified into the target class in the separation matrix Q generated by the matrix generation unit 34 is set to the numerical value 1. Each element q (m, n) corresponding to the time frequency component y (m, n) set and classified into a class other than the target class is set to a numerical value 0. Therefore, the total number of elements q (m, n) set to the numerical value 1 in the separation matrix Q corresponds to the number of time frequency components y (m, n) classified into the target class. In consideration of the above relationship, the index calculation unit 42 of the present embodiment calculates the index value ν by the calculation of the following formula (10).

The numerator on the right side of Equation (10) is the total number of elements q (m, n) set to the numerical value 1 in the separation matrix Q of M rows and N columns (that is, the time frequency component y (m , n)). The denominator on the right side of Equation (10) means an operation that normalizes the number of elements q (m, n) of the numerical value 1 within a range of 0 or more and 1 or less. As understood from the above description with reference to FIG. 5, when the number of time frequency components y (m, n) classified into the target class is large, each coefficient g (k, k, There is a tendency that an excessive increase in n) hardly occurs. Therefore, the index value ν minimizes the difference between the observation matrix Y before execution of non-negative matrix factorization and the matrix after execution (matrix sum of the first matrix D1 and the second matrix D2). In other words, it is a measure of the reliability of the condition (Equation (4)).

以上の説明から理解される通り、指標値νが大きい（目標クラスに分類された時間周波数成分ｙ(m,n)の個数が多い）ほど、制約対象行列｛/Ｑ.×(ＦＧ)｝のノルムを最小化するための第１拘束条件を観測行列Ｙの非負値行列因子分解に加味する必要性は低下する、という傾向がある。以上の傾向を考慮して、第２実施形態の行列分解部３６は、指標算定部４２が算定した指標値νに応じて、非負値行列因子分解の更新演算における制約対象行列｛/Ｑ.×(ＦＧ)｝を加重する。すなわち、非負値行列因子分解の更新演算における第１拘束条件の影響の度合が指標値νに応じて可変に制御される。具体的には、行列分解部３６は、前掲の数式(6)に代えて、以下の数式(11)で表現される評価関数Φが最小化されるように選定された更新演算の反復で係数行列Ｇと基底行列Ｈと係数行列Ｕとを算定する。

数式(11)から理解される通り、指標値νが大きい（目標クラスに分類された時間周波数成分ｙ(m,n)の個数が多い）ほど、非負値行列因子分解（評価関数Φ）における制約対象行列｛/Ｑ.×(ＦＧ)｝のノルムの影響が低減される。行列生成部３４による分離行列Ｑの算定方法や波形合成部３８による音響信号ＳBの生成方法は第１実施形態と同様である。 As understood from the above description, the larger the index value ν (the larger the number of time-frequency components y (m, n) classified into the target class), the more the constraint target matrix {/Q.×(FG)} There is a tendency that the necessity of adding the first constraint for minimizing the norm to the non-negative matrix factorization of the observation matrix Y decreases. In consideration of the above tendency, the matrix decomposition unit 36 of the second embodiment, according to the index value ν calculated by the index calculation unit 42, the constraint target matrix {/Q.× in the update operation of the non-negative matrix factorization. (FG)} is weighted. That is, the degree of influence of the first constraint condition in the update operation of non-negative matrix factorization is variably controlled according to the index value ν. Specifically, the matrix decomposing unit 36 replaces the above-described equation (6) with the coefficient of the update operation selected so that the evaluation function Φ expressed by the following equation (11) is minimized. A matrix G, a base matrix H, and a coefficient matrix U are calculated.

As understood from Equation (11), the larger the index value ν (the larger the number of time-frequency components y (m, n) classified into the target class), the more the constraints on the non-negative matrix factorization (the evaluation function Φ) The influence of the norm of the target matrix {/Q.×(FG)} is reduced. The calculation method of the separation matrix Q by the matrix generation unit 34 and the generation method of the acoustic signal SB by the waveform synthesis unit 38 are the same as in the first embodiment.

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、非負値行列因子分解（評価関数Φ）における制約対象行列｛/Ｑ.×(ＦＧ)｝の影響が指標値νに応じて可変に制御されるから、音源分離の精度が向上するという効果は格別に顕著である。 In the second embodiment, the same effect as in the first embodiment is realized. In the second embodiment, the influence of the constraint target matrix {/Q.×(FG)} in the non-negative matrix factorization (evaluation function Φ) is variably controlled according to the index value ν. The effect of improving accuracy is particularly remarkable.

＜第２実施形態の変形例＞
（１）指標値νと制約対象行列｛/Ｑ.×(ＦＧ)｝の影響との関係は前述の例示（数式(11)）に限定されない。例えば、以下に例示する数式(12)の評価関数Φが最小化されるように選定された更新演算の反復で係数行列Ｇと基底行列Ｈと係数行列Ｕとを算定することも可能である。

数式(12)の評価関数Φを適用した構成でも、指標値νが大きいほど制約対象行列｛/Ｑ.×(ＦＧ)｝の影響が低減されるから、第２実施形態と同様の効果が実現される。 <Modification of Second Embodiment>
(1) The relationship between the index value ν and the influence of the constraint target matrix {/Q.×(FG)} is not limited to the above example (Formula (11)). For example, it is also possible to calculate the coefficient matrix G, the base matrix H, and the coefficient matrix U by repetition of the update operation selected so that the evaluation function Φ of the formula (12) exemplified below is minimized.

Even in the configuration in which the evaluation function Φ in Expression (12) is applied, the larger the index value ν, the smaller the influence of the restriction target matrix {/Q.×(FG)}. Therefore, the same effect as the second embodiment is realized. Is done.

（２）前掲の数式(10)では、分離行列Ｑ内で数値１に設定された要素ｑ(m,n)の総数に比例するように指標値νを算定したが、指標算定部４２が指標値νを算定する方法は以上の例示に限定されない。例えば、以下の数式(13)の演算で指標値νを算定することも可能である。

(2) In the above formula (10), the index value ν is calculated so as to be proportional to the total number of elements q (m, n) set to the numerical value 1 in the separation matrix Q. The method for calculating the value ν is not limited to the above examples. For example, it is also possible to calculate the index value ν by the calculation of the following formula (13).

数式(13)は、図１２に例示されるシグモイド関数であり、係数αはシグモイド関数のゲインに相当する。係数βは、図１２から理解される通り、シグモイド関数の変曲点を規定する係数であり、数値１に設定された要素ｑ(m,n)の総数の最大値（Ｍ×Ｎ）に応じた数値（例えば最大値の８０％程度の数値）に設定される。数式(13)および図１２から理解される通り、数値１に設定された要素ｑ(m,n)の総数に応じて指標値νは非線形に変化する。 Equation (13) is a sigmoid function illustrated in FIG. 12, and the coefficient α corresponds to the gain of the sigmoid function. The coefficient β is a coefficient that defines the inflection point of the sigmoid function, as understood from FIG. 12, and depends on the maximum value (M × N) of the total number of elements q (m, n) set to the numerical value 1. Is set to a numerical value (for example, a numerical value of about 80% of the maximum value). As understood from the equation (13) and FIG. 12, the index value ν changes nonlinearly according to the total number of elements q (m, n) set to the numerical value 1.

以上の説明から理解される通り、第２実施形態の行列分解部３６は、分離行列Ｑにおける数値１の個数（目標クラスに分類された時間周波数成分ｙ(m,n)の個数）に応じて非負値行列因子分解の更新演算における制約対象行列｛/Ｑ.×(ＦＧ)｝の影響を可変に制御する要素として表現され、指標値νと制約対象行列｛/Ｑ.×(ＦＧ)｝との関係や指標値νの算定方法の如何は不問である。 As understood from the above description, the matrix decomposing unit 36 of the second embodiment corresponds to the number of numerical values 1 in the separation matrix Q (the number of time frequency components y (m, n) classified into the target class). Expressed as an element that variably controls the influence of the constraint target matrix {/Q.×(FG)} in the update operation of the non-negative matrix factorization, the index value ν and the constraint target matrix {/Q.×(FG)} There is no limitation on the relationship between the values and the calculation method of the index value ν.

（３）以上の説明では、Ｎ個の単位区間を対象として指標値νを算定したが、単位区間毎に指標値νを算定したうえで制約対象行列｛/Ｑ.×(ＦＧ)｝の影響を単位区間毎に制御することも可能である。例えば、以下の数式(14)で表現されるように単位区間毎に指標値ν(n)を算定し、単位区間毎の指標値ν(n)を適用した以下の数式(15)の評価関数Φを非負値行列因子分解に適用することも可能である。

(3) In the above description, the index value ν is calculated for N unit intervals. However, after calculating the index value ν for each unit interval, the influence of the constraint target matrix {/Q.×(FG)} It is also possible to control for each unit section. For example, as shown in the following formula (14), the index value ν (n) is calculated for each unit section, and the evaluation function of the following formula (15) is applied using the index value ν (n) for each unit section. It is also possible to apply Φ to non-negative matrix factorization.

数式(15)の記号ｙ_nは、第ｎ番目の単位区間の音響信号ＳAの振幅スペクトル（Ｍ個の時間周波数成分ｙ(1,n)〜ｙ(M,n)の系列）を意味する。記号ｇ_nは、係数行列Ｇのうち第ｎ番目の単位区間に対応するＫ個の係数ｇ(1,n)〜ｇ(K,n)の系列であり、記号ｕ_nは、係数行列Ｕのうち第ｎ番目の単位区間に対応するＫ個の係数ｕ(1,n)〜ｕ(K,n)の系列である。記号Ｑ_nは、分離行列Ｑの第ｎ列の要素（ｑ(1,n)〜ｑ(M,n)）の系列である。 Symbol y _n of the equation (15) means the amplitude spectrum of the acoustic signal SA of the n-th unit interval (M pieces of time-frequency component y (1, n) ~y (M, a sequence of n)). Symbol g _n is a sequence of K coefficients g (1, n) to g (K, n) corresponding to the nth unit interval in the coefficient matrix G, and symbol u _n is the coefficient matrix U. Of these, a series of K coefficients u (1, n) to u (K, n) corresponding to the nth unit interval. The symbol Q _n is a sequence of elements (q (1, n) to q (M, n)) in the nth column of the separation matrix Q.

（４）以上の説明では、制約対象行列｛/Ｑ.×(ＦＧ)｝の影響を指標値νに応じて多段階的に制御したが、制約対象行列｛/Ｑ.×(ＦＧ)｝（第１拘束条件）を非負値行列因子分解に加味するか否かを指標値νに応じて制御することも可能である。具体的には、指標値νが所定の閾値νTHを下回る場合（目標クラスに分類された時間周波数成分ｙ(m,n)の個数が少ない場合）、行列分解部３６は、第１実施形態と同様に、数式(6)の評価関数Φを最小化する更新演算を反復することで、第１拘束条件を加味した非負値行列因子分解を実行する。他方、指標値νが閾値νTHを上回る場合（目標クラスに分類された時間周波数成分ｙ(m,n)の個数が多い場合）、行列分解部３６は、数式(6)の係数λを０に設定した評価関数Φを最小化する更新演算を反復することで、第１拘束条件を解除した非負値行列因子分解を実行する。以上の構成でも第２実施形態と同様の効果が実現される。また、制約対象行列｛/Ｑ.×(ＦＧ)｝の影響（有無）が指標値νに応じて２値的に制御されるから、多段階的に制御する構成と比較して処理が簡素化されるという利点もある。 (4) In the above description, the influence of the restriction target matrix {/Q.×(FG)} is controlled in a multistage manner according to the index value ν, but the restriction target matrix {/Q.×(FG)} ( It is also possible to control whether or not to add the first constraint condition) to the non-negative matrix factorization according to the index value ν. Specifically, when the index value ν is below a predetermined threshold value νTH (when the number of time-frequency components y (m, n) classified into the target class is small), the matrix decomposition unit 36 is the same as in the first embodiment. Similarly, the non-negative matrix factorization taking the first constraint condition into consideration is executed by repeating the update operation that minimizes the evaluation function Φ of Expression (6). On the other hand, when the index value ν exceeds the threshold value νTH (when the number of time-frequency components y (m, n) classified into the target class is large), the matrix decomposition unit 36 sets the coefficient λ of Equation (6) to 0. By repeating the update operation that minimizes the set evaluation function Φ, non-negative matrix factorization with the first constraint condition released is executed. With the above configuration, the same effect as in the second embodiment is realized. Further, since the influence (presence / absence) of the restriction target matrix {/Q.×(FG)} is controlled in a binary manner according to the index value ν, the process is simplified as compared with the configuration controlled in multiple steps. There is also an advantage of being.

＜第３実施形態＞
第１実施形態では、第１音源に対応する第１行列Ｄ1（＝ＦＧ）のうち観測行列Ｙにて第２音源の音響が優勢な時間周波数成分ｙ(m,n)に対応する要素（制約対象行列｛/Ｑ.×(ＦＧ)｝のノルム）を抑圧することを第１拘束条件とした非負値行列因子分解を例示した。第３実施形態の行列分解部３６は、第１行列Ｄ1のうち観測行列Ｙにて第２音源の音響が優勢な時間周波数成分ｙ(m,n)に対応する要素の時間的な変動を抑制するという拘束条件（以下「第３拘束条件」という）のもとで観測行列Ｙの非負値行列因子分解を実行する。 <Third Embodiment>
In the first embodiment, the element (constraint) corresponding to the time frequency component y (m, n) in which the sound of the second sound source is dominant in the observation matrix Y in the first matrix D1 (= FG) corresponding to the first sound source. An example of non-negative matrix factorization in which suppression of the target matrix {/Q.×(FG)} norm) is used as the first constraint condition is illustrated. The matrix decomposition unit 36 of the third embodiment suppresses temporal variation of the element corresponding to the time frequency component y (m, n) in which the sound of the second sound source is dominant in the observation matrix Y in the first matrix D1. The non-negative matrix factorization of the observation matrix Y is executed under the constraint condition (hereinafter referred to as “third constraint condition”).

具体的には、行列分解部３６は、以下の数式(16)で表現される評価関数Φが最小化されるように選定された更新演算の反復で未知の各行列（Ｇ,Ｈ,Ｕ）を算定する。数式(16)の係数εは所定の正数に設定される。

Specifically, the matrix decomposing unit 36 determines each matrix (G, H, U) unknown by repetition of the update operation selected so that the evaluation function Φ expressed by the following equation (16) is minimized. Is calculated. The coefficient ε in Expression (16) is set to a predetermined positive number.

数式(16)の右辺の第３項を最小化することが第３拘束条件に対応する。記号｛ｆ(m,k)ｇ(k,n)｝は、音響信号ＳAの基底ベクトルｆ(k)（第ｋ番目の成分の振幅スペクトル）のうち周波数軸上の第ｍ番目の周波数に対応する要素ｆ(m,n)と、基底ベクトルｆ(k)に対応する係数ベクトルｇ(k)のうち時間軸上の第ｎ番目の単位区間に対応する要素ｇ(k,n)との乗算値である。したがって、数式(16)の記号Σ_kｆ(m,k)ｇ(k,n)は、音響信号ＳA内の第１音源の音響のうち第ｎ番目の単位区間と第ｍ番目の周波数とに対応する時間周波数成分ｄ(m,n)の強度（振幅）に相当する。すなわち、数式(16)の第３項の括弧内は、音響信号ＳAの第１音源の音響のうち第ｎ番目の単位区間の時間周波数成分ｄ(m,n)と直前の第(n-1)番目の単位区間の時間周波数成分ｄ(m,n-1)との強度差（誤差）に相当する。また、数式(16)における反転分離行列/Ｑの要素/ｑ(m,n)の乗算は、観測行列Ｙにて第２音源の音響が優勢な時間周波数成分ｙ(m,n)（分離行列Ｑにて数値０に設定された要素ｑ(m,n)）に対応する時間周波数成分ｄ(m,n)を第１行列Ｄ1から抽出する処理を意味する。 Minimizing the third term on the right side of Equation (16) corresponds to the third constraint condition. The symbol {f (m, k) g (k, n)} corresponds to the mth frequency on the frequency axis in the basis vector f (k) (amplitude spectrum of the kth component) of the acoustic signal SA. The element f (m, n) to be multiplied by the element g (k, n) corresponding to the nth unit interval on the time axis of the coefficient vector g (k) corresponding to the base vector f (k) Value. Therefore, the symbol Σ _k f (m, k) g (k, n) in Expression (16) is the nth unit interval and mth frequency of the sound of the first sound source in the acoustic signal SA. This corresponds to the intensity (amplitude) of the corresponding time frequency component d (m, n). In other words, the parentheses in the third term of Equation (16) are the time frequency component d (m, n) of the nth unit section of the sound of the first sound source of the acoustic signal SA and the immediately preceding (n-1). This corresponds to an intensity difference (error) from the time frequency component d (m, n-1) of the) th unit interval. In addition, the multiplication of the inverse separation matrix / Q element / q (m, n) in Equation (16) is the time frequency component y (m, n) (separation matrix) in which the sound of the second sound source is dominant in the observation matrix Y. This means that the time frequency component d (m, n) corresponding to the element q (m, n)) set to the numerical value 0 in Q is extracted from the first matrix D1.

以上の説明から理解される通り、基底行列Ｆと係数行列Ｇとの行列積である第１行列Ｄ1のうち分離行列Ｑにて数値０に設定された要素ｑ(m,n)（反転分離行列/Ｑにて数値１に設定された要素/ｑ(m,n)）に対応する時間周波数成分ｄ(m,n)の強度が、相前後する各単位区間にて相互に近似する（強度差が小さい）ほど、数式(16)の右辺の第３項は小さい数値となる。すなわち、第３拘束条件は、第１行列Ｄ1のうち数値０の要素ｑ(m,n)に対応する時間周波数成分ｄ(m,n)の時間的な変動を抑制するための条件と表現される。 As understood from the above description, the element q (m, n) (inverted separation matrix) of the first matrix D1, which is the matrix product of the base matrix F and the coefficient matrix G, set to the numerical value 0 in the separation matrix Q. The intensity of the time-frequency component d (m, n) corresponding to the element / q (m, n)) set to the numerical value 1 by / Q approximates each other in each unit interval (intensity difference) Is smaller), the third term on the right side of Equation (16) becomes a smaller numerical value. That is, the third constraint condition is expressed as a condition for suppressing temporal variation of the time frequency component d (m, n) corresponding to the element q (m, n) having a numerical value 0 in the first matrix D1. The

第３実施形態の行列分解部３６は、数式(16)の評価関数Φが最小化されるように選定された以下の数式(17)の更新演算を反復することで係数行列Ｇを算定する。なお、基底行列Ｈおよび係数行列Ｕを算定する更新演算の演算式は第１実施形態（数式(8)，数式(9)）と同様である。

数式(17)の記号Ｇ^(-)は、直前の更新後の係数行列Ｇの各係数ベクトルｇ(k)を１列分だけ右方に移動させた行列を意味し、記号Ｇ⁽⁺⁾は、直前の更新後の係数行列Ｇの各係数ベクトルｇ(k)を１列分だけ左方に移動させた行列を意味する。行列生成部３４による分離行列Ｑの算定方法や波形合成部３８による音響信号ＳBの生成方法は第１実施形態と同様である。 The matrix decomposition unit 36 of the third embodiment calculates the coefficient matrix G by repeating the update operation of the following equation (17) selected so that the evaluation function Φ of the equation (16) is minimized. The calculation formula for the update calculation for calculating the base matrix H and the coefficient matrix U is the same as that in the first embodiment (Formulas (8) and (9)).

The symbol G ^{(−) in} Equation (17) means a matrix obtained by moving each coefficient vector g (k ⁾ of the immediately updated coefficient matrix G by one column to the right, and the symbol G ⁽⁺⁾ is Means a matrix in which each coefficient vector g (k) of the immediately updated coefficient matrix G is moved to the left by one column. The calculation method of the separation matrix Q by the matrix generation unit 34 and the generation method of the acoustic signal SB by the waveform synthesis unit 38 are the same as in the first embodiment.

以上に説明した第３実施形態では、第１行列Ｄ1のうち観測行列Ｙにて第２音源の音響が優勢な時間周波数成分ｙ(m,n)に対応する要素が時間軸上で連続するという第３拘束条件を加味した非負値行列因子分解で係数行列Ｇと基底行列Ｈと係数行列Ｕとが算定される。したがって、係数行列Ｇの各要素が過度に大きい数値（すなわち直前の単位区間の数値から極端に乖離した数値）に設定されて分離後の音響信号ＳBの強度が発散する対比例２の問題（図５）が解消される。すなわち、第３実施形態によれば、第１実施形態と同様に、対比例１や対比例２と比較して高精度な音源分離を実現することが可能である。 In the third embodiment described above, the element corresponding to the time frequency component y (m, n) in which the sound of the second sound source is dominant in the observation matrix Y in the first matrix D1 is continuous on the time axis. The coefficient matrix G, the base matrix H, and the coefficient matrix U are calculated by non-negative matrix factorization taking the third constraint condition into consideration. Accordingly, the problem of the proportional 2 in which each element of the coefficient matrix G is set to an excessively large numerical value (that is, a numerical value that is extremely deviated from the numerical value of the immediately preceding unit section) and the intensity of the separated acoustic signal SB diverges (FIG. 5) is eliminated. That is, according to the third embodiment, as in the first embodiment, it is possible to realize high-accuracy sound source separation as compared with the proportional 1 and the proportional 2.

＜変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

（１）前述の各形態では、方位クラスタリングを利用して分離行列Ｑを生成したが、行列生成部３４が分離行列Ｑを生成する方法は任意である。例えば以下に例示する方法が分離行列Ｑの生成に採用され得る。 (1) In each of the above-described embodiments, the separation matrix Q is generated using orientation clustering, but the method of generating the separation matrix Q by the matrix generation unit 34 is arbitrary. For example, the method exemplified below can be adopted for generating the separation matrix Q.

［態様１］
定位軸（音像の定位方向）と周波数軸とが設定された定位-周波数領域に音響信号ＳAの各周波数成分の分布を表示し、第１音源の定位範囲および周波数帯域を含む指定領域を例えば利用者からの指示に応じて定位-周波数領域内に設定する構成が例えば特開２０１２−２４９０４８号公報に開示されている。行列生成部３４は、定位-周波数領域における音響信号ＳAの各周波数成分の分布を表示装置に表示させたうえで指定領域の指示を利用者から受付け、音響信号ＳAの観測行列Ｙのうち指定領域内の各時間周波数成分ｙ(m,n)に対応する要素ｑ(m,n)を数値１に設定するとともに残余の各要素ｑ(m,n)を数値０に設定した分離行列Ｑを生成する。 [Aspect 1]
The distribution of each frequency component of the acoustic signal SA is displayed in the localization-frequency region in which the localization axis (sound image localization direction) and the frequency axis are set, and the specified region including the localization range and frequency band of the first sound source is used, for example For example, Japanese Patent Application Laid-Open No. 2012-249048 discloses a configuration in which it is set in a localization-frequency region in response to an instruction from a person. The matrix generation unit 34 displays the distribution of each frequency component of the acoustic signal SA in the localization-frequency domain on the display device, receives an instruction from the designated region from the user, and designates the designated region in the observation matrix Y of the acoustic signal SA. A separation matrix Q is generated in which the element q (m, n) corresponding to each time frequency component y (m, n) is set to the numerical value 1 and the remaining elements q (m, n) are set to the numerical value 0. To do.

［態様２］
行列生成部３４は、音響信号ＳAに対する雑音抑圧処理を実行する。具体的には、以下の数式(18)で表現される通り、音響信号ＳAの各時間周波数成分ｘ(m,n)から推定雑音成分ζ(m,n)を周波数領域で減算するスペクトル減算（一般化スペクトル減算）が実行される。時間周波数成分ｘ(m,n)は、例えば時間周波数成分ｘ1(m,n)〜ｘJ(m,n)の平均や合計（または時間周波数成分ｘ1(m,n)〜ｘJ(m,n)の何れか）であり、推定雑音成分ζ(m,n)は、例えば音響信号ＳAの雑音区間（非音声区間）から推定された雑音成分（典型的には定常的な雑音）である。

[Aspect 2]
The matrix generation unit 34 performs noise suppression processing on the acoustic signal SA. Specifically, as expressed by the following equation (18), spectral subtraction (subtraction of the estimated noise component ζ (m, n) from each time frequency component x (m, n) of the acoustic signal SA in the frequency domain ( Generalized spectral subtraction) is performed. The time frequency component x (m, n) is, for example, the average or total of the time frequency components x1 (m, n) to xJ (m, n) (or the time frequency components x1 (m, n) to xJ (m, n) The estimated noise component ζ (m, n) is, for example, a noise component (typically stationary noise) estimated from the noise interval (non-voice interval) of the acoustic signal SA.

数式(18)の記号ａは減算係数であり、記号ｂはフロアリング係数である。また、時間周波数成分ｘ(m,n)や推定雑音成分ζ(m,n)の冪指数ｐは所定値（典型的には２または４）に設定される。行列生成部３４は、以下の数式(19)で表現される通り、数式(18)の雑音抑圧処理の結果に応じて分離行列Ｑの各要素ｑ(m,n)を算定する。

数式(19)から理解される通り、推定雑音成分ζ(m,n)と比較して優勢な各時間周波数成分ｘ(m,n)に対応する要素ｑ(m,n)は数値１（維持値）に設定され、残余の各要素ｑ(m,n)は数値０（抑圧値）に設定される。行列分解部３６は、数式(18)の雑音抑圧処理後の各時間周波数成分ｙ(m,n)を配列した観測行列Ｙに対し、行列生成部３４が生成した分離行列Ｑを適用した非負値行列因子分解を実行する。 In the formula (18), symbol a is a subtraction coefficient, and symbol b is a flooring coefficient. Further, the power exponent p of the time frequency component x (m, n) and the estimated noise component ζ (m, n) is set to a predetermined value (typically 2 or 4). The matrix generation unit 34 calculates each element q (m, n) of the separation matrix Q according to the result of the noise suppression processing of Expression (18) as expressed by Expression (19) below.

As understood from the equation (19), the element q (m, n) corresponding to each temporal frequency component x (m, n) that is dominant compared to the estimated noise component ζ (m, n) is a numerical value 1 (maintained). Value), and each remaining element q (m, n) is set to a numerical value 0 (suppression value). The matrix decomposition unit 36 applies the separation matrix Q generated by the matrix generation unit 34 to the observation matrix Y in which the time frequency components y (m, n) after the noise suppression processing of Expression (18) are arranged. Perform matrix factorization.

［態様３］
行列分解部３６は、第１処理と第２処理とを含む複数段の非負値行列因子分解を実行する。第１処理は、分離行列Ｑを適用しない非負値行列因子分解を観測行列Ｙに対して実行することで係数行列Ｇと基底行列Ｈと係数行列Ｕとを算定する。行列生成部３４は、以下の数式(20)で表現される通り、第１処理の結果に応じて分離行列Ｑの各要素ｑ(m,n)を算定する。

[Aspect 3]
The matrix decomposition unit 36 performs a plurality of stages of non-negative matrix factorization including the first process and the second process. The first process calculates the coefficient matrix G, the base matrix H, and the coefficient matrix U by performing non-negative matrix factorization on which the separation matrix Q is not applied to the observation matrix Y. The matrix generation unit 34 calculates each element q (m, n) of the separation matrix Q according to the result of the first process, as expressed by the following formula (20).

数式(20)の記号｛Σ_kｈ(m,k)ｕ(k,n)｝/ｙ(m,n)は、時間周波数成分ｙ(m,n)において第２音源の音響（Σ_kｈ(m,k)ｕ(k,n)）が第１音源の音響に対して優勢である度合の指標に相当する。数式(20)の記号ＴHは所定の閾値である。数式(20)から理解される通り、指標｛Σ_kｈ(m,k)ｕ(k,n)｝/ｙ(m,n)が閾値ＴHを上回る程度に第２音源の音響が優勢である各時間周波数成分ｙ(m,n)に対応する要素ｑ(m,n)は数値１に設定され、残余の各要素ｑ(m,n)は数値０に設定される。行列分解部３６は、第１処理の結果から特定される第２行列Ｄ2（Ｄ2＝ＨＵ）を新たな観測行列Ｙとして、前述の各形態と同様に、行列生成部３４が生成した分離行列Ｑを適用した非負値行列因子分解を第２処理として実行する。なお、指標｛Σ_kｈ(m,k)ｕ(k,n)｝/ｙ(m,n)の数値を要素ｑ(m,n)として設定した分離行列（ソフトマスク）Ｑを生成することも可能である。 The symbol {Σ _k h (m, k) u (k, n)} / y (m, n) in Equation (20) is the sound (Σ _k h) of the second sound source in the time frequency component y (m, n). (m, k) u (k, n)) corresponds to an index of the degree of predominance over the sound of the first sound source. Symbol TH in equation (20) is a predetermined threshold value. Acoustic second sound source predominates to the extent as is understood, the index _{{Σ k h (m, k} ) u (k, n)} / y (m, n) exceeds the threshold value TH from the formula (20) The element q (m, n) corresponding to each time frequency component y (m, n) is set to a numerical value 1 and the remaining elements q (m, n) are set to a numerical value 0. The matrix decomposition unit 36 uses the second matrix D2 (D2 = HU) specified from the result of the first processing as a new observation matrix Y, and in the same manner as in the above embodiments, the separation matrix Q generated by the matrix generation unit 34. The non-negative matrix factorization to which is applied is executed as the second process. Incidentally, the index _{{Σ k h (m, k} ) u (k, n)} / y (m, n) numerical elements q of (m, n) is set as the separation matrix (soft mask) generating a Q Is also possible.

（２）分離行列Ｑの各要素ｑ(m,n)（反転分離行列/Ｑの各要素/ｑ(m,n)）の数値は、前述の各形態での例示（０,１）に限定されない。すなわち、分離行列Ｑの各要素ｑ(m,n)は、時間周波数成分ｙ(m,n)を維持する維持値γ1、または、時間周波数成分ｙ(m,n)を抑圧する抑圧値γ0に設定される。前述の各形態における要素ｑ(m,n)の数値１は維持値γ1の典型例であり、各要素ｑ(m,n)の数値０は抑圧値γ0の典型例である。 (2) The numerical value of each element q (m, n) of the separation matrix Q (inverted separation matrix / each element / q (m, n)) is limited to the example (0,1) in each of the above-described forms. Not. That is, each element q (m, n) of the separation matrix Q has a maintenance value γ1 that maintains the time frequency component y (m, n) or a suppression value γ0 that suppresses the time frequency component y (m, n). Is set. The numerical value 1 of the element q (m, n) in each of the above-described forms is a typical example of the maintenance value γ1, and the numerical value 0 of each element q (m, n) is a typical example of the suppression value γ0.

（３）教師情報として利用される基底行列Ｆの生成方法は任意である。例えば、事前に収録された第１音源の音響信号に対する非負値行列因子分解で基底行列Ｆが生成される。また、基底行列Ｆは、第１音源の音響に想定されるＫ個の振幅スペクトルで構成されるから、例えば、Ｋ個の音高の各々について第１音源の音響の平均的な振幅スペクトルを算定し、各音高に対応するＫ個の振幅スペクトルを配列することで基底行列Ｆを生成することも可能である。 (3) The generation method of the base matrix F used as teacher information is arbitrary. For example, the base matrix F is generated by non-negative matrix factorization for the acoustic signal of the first sound source recorded in advance. Further, since the base matrix F is composed of K amplitude spectra assumed for the sound of the first sound source, for example, an average amplitude spectrum of the sound of the first sound source is calculated for each of the K pitches. It is also possible to generate the base matrix F by arranging K amplitude spectra corresponding to each pitch.

（４）前述の各形態では、フロベニウスノルムを適用した非負値行列因子分解を例示したが、非負値行列因子分解に適用される距離規準はフロベニウスノルムに限定されない。具体的には、Kullback-Leibler擬距離やダイバージェンス等の公知の距離規準が任意に採用される。また、スパースネスの拘束条件を適用した非負値行列因子分解も採用される。 (4) In each of the above embodiments, the non-negative matrix factorization using the Frobenius norm is exemplified, but the distance criterion applied to the non-negative matrix factorization is not limited to the Frobenius norm. Specifically, a known distance criterion such as a Kullback-Leibler pseudorange or divergence is arbitrarily adopted. In addition, non-negative matrix factorization using sparseness constraints is also employed.

（５）前述の各形態では、第１音源および第２音源の一方の音響を抽出した音響信号ＳBを生成したが、第１音源の音響を抽出した音響信号ＳBと第２音源の音響を抽出した音響信号ＳBとの双方を波形合成部３８が並列に生成することも可能である。例えば、第１音源の音響信号ＳBと第２音源の音響信号ＳBとの各々に相異なる音響処理（例えば効果付与）を実行したうえで加算する構成が採用される。 (5) In each of the above-described embodiments, the acoustic signal SB obtained by extracting the sound of one of the first sound source and the second sound source is generated. However, the sound signal SB obtained by extracting the sound of the first sound source and the sound of the second sound source are extracted. It is also possible for the waveform synthesizer 38 to generate both of the acoustic signals SB that have been performed in parallel. For example, a configuration is adopted in which different acoustic processing (for example, effect application) is performed on each of the acoustic signal SB of the first sound source and the acoustic signal SB of the second sound source and then added.

（６）前述の各形態では、基底行列Ｆと係数行列Ｇとの乗算で第１音源の音響信号ＳBを生成し、基底行列Ｈと係数行列Ｕとの乗算で第２音源の音響信号ＳBを生成したが、行列分解部３６が生成した各行列を利用して音響信号ＳAの処理用のフィルタを生成することも可能である。例えば、基底行列Ｆと係数行列Ｇとの行列積（第１行列Ｄ1）から第１音源の音響を抑圧または強調するためのフィルタ（例えばウィナーフィルタ）を生成して音響信号ＳAに作用させる構成や、基底行列Ｈと係数行列Ｕとの行列積（第２行列Ｄ2）から第２音源の音響を抑圧または強調するためのフィルタを生成して音響信号ＳAに作用させる構成が採用される。 (6) In each of the above-described embodiments, the sound signal SB of the first sound source is generated by multiplying the base matrix F and the coefficient matrix G, and the sound signal SB of the second sound source is generated by multiplying the base matrix H and the coefficient matrix U. However, it is also possible to generate a filter for processing the acoustic signal SA using each matrix generated by the matrix decomposition unit 36. For example, a configuration for generating a filter (for example, a Wiener filter) for suppressing or enhancing the sound of the first sound source from the matrix product (first matrix D1) of the base matrix F and the coefficient matrix G and acting on the acoustic signal SA A configuration is adopted in which a filter for suppressing or enhancing the sound of the second sound source is generated from the matrix product (second matrix D2) of the base matrix H and the coefficient matrix U and applied to the acoustic signal SA.

（７）行列分解部３６による演算結果（Ｇ,Ｈ,Ｕ）を利用した音響信号ＳBの生成（波形合成部３８）は省略され得る。例えば、音響信号ＳAの観測行列Ｙに対する非負値行列因子分解で未知の行列（Ｇ,Ｈ,Ｕ）を算定する音響処理装置１００としても本発明は実施され得る。また、携帯電話機等の端末装置と通信するサーバ装置で音響処理装置１００を実現することも可能である。例えば、音響処理装置１００は、端末装置から受信した音響信号ＳAから音響信号ＳBを生成して端末装置に送信する。なお、音響信号ＳAの観測行列Ｙおよび各観測ベクトルＶX(m,n)を端末装置から受信する構成（例えば端末装置が周波数分析部３２を具備する構成）では音響処理装置１００から周波数分析部３２が省略され、行列分解部３６による演算結果を端末装置に送信する構成（例えば端末装置が波形合成部３８を具備する構成）では音響処理装置１００から波形合成部３８が省略される。 (7) Generation of the acoustic signal SB (waveform synthesis unit 38) using the calculation result (G, H, U) by the matrix decomposition unit 36 may be omitted. For example, the present invention can be implemented as an acoustic processing apparatus 100 that calculates an unknown matrix (G, H, U) by non-negative matrix factorization with respect to the observation matrix Y of the acoustic signal SA. In addition, the sound processing apparatus 100 can be realized by a server device that communicates with a terminal device such as a mobile phone. For example, the acoustic processing device 100 generates an acoustic signal SB from the acoustic signal SA received from the terminal device and transmits the acoustic signal SB to the terminal device. In the configuration in which the observation matrix Y of the acoustic signal SA and each observation vector VX (m, n) are received from the terminal device (for example, the configuration in which the terminal device includes the frequency analysis unit 32), the acoustic processing device 100 to the frequency analysis unit 32. Is omitted, and the waveform synthesizing unit 38 is omitted from the acoustic processing device 100 in a configuration in which the calculation result by the matrix decomposing unit 36 is transmitted to the terminal device (for example, a configuration in which the terminal device includes the waveform synthesizing unit 38).

１００……音響処理装置、１２……信号供給装置、１４……放音装置、２２……演算処理装置、２４……記憶装置、３２……周波数分析部、３４……行列生成部、３６……行列分解部、３８……波形合成部、４２……指標算定部。 DESCRIPTION OF SYMBOLS 100 ... Sound processing device, 12 ... Signal supply device, 14 ... Sound emission device, 22 ... Arithmetic processing device, 24 ... Memory | storage device, 32 ... Frequency analysis part, 34 ... Matrix generation part, 36 ... ... matrix decomposition unit, 38 ... waveform synthesis unit, 42 ... index calculation unit.

Claims

An element corresponding to each time frequency component of the sound signal representing the mixed sound of the sound of the plurality of sound sources, and an element corresponding to each time frequency component in which the sound of the first sound source is dominant among the plurality of sound sources is the time frequency Matrix generation means for generating a separation matrix that is set to a maintenance value that maintains a component and an element corresponding to each remaining time frequency component is set to a suppression value that suppresses the time frequency component;
The first basis matrix including a plurality of basis vectors representing the spectrum of each component of the sound of the first sound source is set as teaching information, with each time frequency component corresponding to the element set as the maintenance value in the separation matrix as a calculation target. Including a plurality of coefficient vectors corresponding to each basis vector of the first basis matrix from an observation matrix in which each time frequency component of the acoustic signal is arranged by repeating the update operation of the non-negative matrix factorization used as Corresponding to a first coefficient matrix, a second basis matrix including a plurality of basis vectors representing the spectrum of each component of the sound of the second sound source different from the first sound source, and each basis vector of the second basis matrix Matrix decomposition means for calculating a second coefficient matrix including a plurality of coefficient vectors,
The matrix processing means executes a non-negative matrix factorization with a constraint condition of suppressing each element of the first coefficient matrix.

The matrix decomposition means multiplies each element of a matrix product of the first base matrix and the first coefficient matrix by each element of the inverted separation matrix obtained by inverting the maintenance value and the suppression value for each element of the separation matrix. The acoustic processing device according to claim 1, wherein non-negative matrix factorization is performed with the constraint that reducing the norm of the restriction target matrix.

The acoustic processing apparatus according to claim 1, wherein the matrix decomposition unit variably controls the degree of influence of the constraint condition in the update calculation according to the number of maintenance values in the separation matrix.

The matrix decomposition means performs a non-negative matrix factorization under the constraint when an index value corresponding to the number of maintenance values in the separation matrix exceeds a threshold value, and the index value falls below the threshold value The acoustic processing apparatus according to claim 1, wherein the constraint condition is released.

An element corresponding to each time frequency component of the sound signal representing the mixed sound of the sound of the plurality of sound sources, and an element corresponding to each time frequency component in which the sound of the first sound source is dominant among the plurality of sound sources is the time frequency Matrix generation means for generating a separation matrix that is set to a maintenance value that maintains a component and an element corresponding to each remaining time frequency component is set to a suppression value that suppresses the time frequency component;
The first basis matrix including a plurality of basis vectors representing the spectrum of each component of the sound of the first sound source is set as teaching information, with each time frequency component corresponding to the element set as the maintenance value in the separation matrix as a calculation target. Including a plurality of coefficient vectors corresponding to each basis vector of the first basis matrix from an observation matrix in which each time frequency component of the acoustic signal is arranged by repeating the update operation of the non-negative matrix factorization used as Corresponding to a first coefficient matrix, a second basis matrix including a plurality of basis vectors representing the spectrum of each component of the sound of the second sound source different from the first sound source, and each basis vector of the second basis matrix Matrix decomposition means for calculating a second coefficient matrix including a plurality of coefficient vectors,
The matrix decomposing means suppresses temporal variation of a time frequency component corresponding to an element set as a suppression value in the separation matrix among matrix products of the first base matrix and the first coefficient matrix. A sound processing device that performs non-negative matrix factorization with as a constraint.