JP2018136368A

JP2018136368A - Signal analyzer, method, and program

Info

Publication number: JP2018136368A
Application number: JP2017028843A
Authority: JP
Inventors: 弘和亀岡; Hirokazu Kameoka; 莉李; Ri Ri
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-02-20
Filing date: 2017-02-20
Publication date: 2018-08-30
Anticipated expiration: 2037-02-20
Also published as: JP6618493B2

Abstract

PROBLEM TO BE SOLVED: To provide a signal analyzer capable of learning a base spectrum using an algorithm with convergence guaranteed.SOLUTION: The signal analyzer includes: an auxiliary variable update part 42 that updates an auxiliary variable; a parameter update part 44 that updates a base spectrum and an activation parameter of each constituent sound so that an auxiliary function as an upper bound function which represents the magnitude of an error between an extraction spectrogram of a constituent sound signal of the constituent sound extracted from a spectrogram of a mixed signal and a spectrogram of a constituent sound signal of the constituent sound gets smaller; and a convergence determination part 46 that causes the auxiliary variable update part 42 and the parameter update part 44 to repeat the update until a predetermined convergence condition is satisfied.SELECTED DRAWING: Figure 1

Description

本発明は、信号解析装置、方法、及びプログラムに係り、特に、パラメータを推定する信号解析装置、方法、及びプログラムに関する。 The present invention relates to a signal analysis apparatus, method, and program, and more particularly, to a signal analysis apparatus, method, and program for estimating parameters.

近年、非負値行列因子分解（Non-negative matrix factorization: NMF）はモノラル音響信号処理問題に対する有力な手法として注目されている（非特許文献１）。各時刻で観測された振幅またはパワースペクトルを基底スペクトルの非負結合で近似することは、観測スペクトログラムを行列と見なし、二つの行列（基底行列とアクティベーション行列）の積で近似することに相当する。各々の行列の要素は非負値のため、非負制約のもと観測スペクトログラムに対し行列分解が行われることからNMF と呼ぶ。教師ありまたは半教師あり音源分離の問題設定においては、まず、各音源の学習サンプルのスペクトログラムにNMF を行い、基底行列を事前学習する。一方テスト時には、学習した基底行列を固定し、アクティベーション行列のみを推定する。このようにして求めた各音源のパワースペクトログラムを用い、Wiener フィルタにより混合信号から目的音源信号を得ることができる。 In recent years, non-negative matrix factorization (NMF) has attracted attention as a promising technique for monophonic acoustic signal processing problems (Non-patent Document 1). Approximating the amplitude or power spectrum observed at each time with a non-negative combination of the base spectrum is equivalent to considering the observation spectrogram as a matrix and approximating it with the product of two matrices (base matrix and activation matrix). Since each matrix element is non-negative, it is called NMF because matrix decomposition is performed on the observed spectrogram under non-negative constraints. In the problem setting for supervised or semi-supervised sound source separation, NMF is first performed on the spectrogram of the learning sample of each sound source, and the base matrix is pre-learned. On the other hand, during the test, the learned base matrix is fixed and only the activation matrix is estimated. By using the power spectrogram of each sound source thus obtained, the target sound source signal can be obtained from the mixed signal by the Wiener filter.

以上のアプローチ（非特許文献１）では基底学習において学習サンプルのスペクトログラムと行列積との誤差が最適化規準として用いられるが、分離信号そのものが最適となるような規準とはなっていなかった。この点に着目し、Wiener フィルタの出力信号と目的音源の学習サンプルとの誤差を直接的に最適化規準として基底学習を行う、識別的NMF（Discriminative non-negative matrix factorization:DNMF）（非特許文献２）と呼ぶ枠組が提案されている。この方式では、学習時とテスト時に用いられる最適化規準が同一となるため、より高い分離能力をもった基底スペクトルが学習により得られるようになることが期待される。 In the above approach (Non-Patent Document 1), the error between the spectrogram of the learning sample and the matrix product is used as an optimization criterion in the base learning, but the criterion is not such that the separated signal itself is optimal. Focusing on this point, discriminative non-negative matrix factorization (DNMF) that performs base learning using the error between the output signal of the Wiener filter and the learning sample of the target sound source as an optimization criterion directly (non-patent literature) A framework called 2) has been proposed. In this method, since the optimization criteria used at the time of learning and testing are the same, it is expected that a base spectrum having higher separation ability can be obtained by learning.

P. Smaragdis、 R. Bhiksha、 and S. Madhusudana、 “Supervised and semi-supervised separation of sounds from single-channel mixtures."、 In Proc. ICA、 pp. 414-421、 2007.P. Smaragdis, R. Bhiksha, and S. Madhusudana, “Supervised and semi-supervised separation of sounds from single-channel mixture.”, In Proc. ICA, pp. 414-421, 2007. F.Weninger、 J. L. Roux、 J. R. Hershey、 and S.Watanabe、 “Discriminative NMF and its application to single-channel source separation."、In Proc. INTERSPEECH、 pp. 865-869、 2014.F. Weninger, J. L. Roux, J. R. Hershey, and S. Watanabe, “Discriminative NMF and its application to single-channel source separation.”, In Proc. INTERSPEECH, pp. 865-869, 2014.

しかし、識別的NMFの学習規準（後述）は従来のNMF の最適化規準に比べて解析的に複雑な形になる。このため、非特許文献２では乗法更新アルゴリズムと呼ぶ汎用的な手法を用いた最適化アルゴリズムが提案されているが、停留点への収束性が保証されておらずDNMFのポテンシャルを十分発揮できているとはいえなかった。 However, discriminative NMF learning criteria (discussed below) are analytically more complex than conventional NMF optimization criteria. For this reason, Non-Patent Document 2 proposes an optimization algorithm using a general-purpose method called a multiplicative update algorithm, but convergence to a stopping point is not guaranteed and the potential of DNMF can be fully demonstrated. I couldn't say.

本発明では、上記事情を鑑みて成されたものであり、収束性が保証されたアルゴリズムにより基底スペクトルを学習することができる信号解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a signal analysis apparatus, method, and program that can learn a base spectrum using an algorithm that guarantees convergence.

上記目的を達成するために、本発明に係る信号解析装置は、各構成音が混合された混合信号の時系列データと、前記混合信号を分離した各構成音について構成音信号の時系列データとを入力として、前記混合信号、および各構成音の構成音信号の各々について、各時刻及び各周波数の信号の成分を表すスペクトログラムを出力する時間周波数展開部と、前記時間周波数展開部により出力された、前記混合信号、および各構成音の構成音信号の各々についてのスペクトログラムに基づいて、各構成音の構成音信号の各々についての、基底スペクトル、および各基底及び各時刻における音量を表すアクティベーションパラメータを用いて、前記混合信号のスペクトログラムから抽出される、前記構成音の構成音信号の抽出スペクトログラムと、前記構成音の構成音信号のスペクトログラムとの誤差の大きさを表す規準を小さくするように、各構成音の基底スペクトルと、各構成音のアクティベーションパラメータとを推定するパラメータ学習部と、を含み、前記パラメータ学習部は、前記規準の上界関数である補助関数を小さくするように、各構成音の基底スペクトルと、各構成音のアクティベーションパラメータとを更新するパラメータ更新部と、予め定められた収束条件を満たすまで、前記パラメータ更新部による更新を繰り返させる収束判定部と、を含んで構成されている。 In order to achieve the above object, a signal analyzing apparatus according to the present invention includes time-series data of mixed signals obtained by mixing the constituent sounds, time-series data of constituent sound signals for the constituent sounds obtained by separating the mixed signals, and For the mixed signal and each component sound signal of each component sound, a time frequency expansion unit that outputs a spectrogram representing a signal component of each time and each frequency, and the time frequency expansion unit output An activation parameter representing a base spectrum for each component sound signal of each component sound and a sound volume at each base and each time based on a spectrogram for each component sound signal of the mixed signal and each component sound The extracted spectrogram of the component sound signal extracted from the spectrogram of the mixed signal using A parameter learning unit that estimates a base spectrum of each component sound and an activation parameter of each component sound so as to reduce a criterion that represents a magnitude of an error with a spectrogram of the component sound signal of the component sound; The parameter learning unit includes a parameter updating unit that updates a base spectrum of each constituent sound and an activation parameter of each constituent sound so as to reduce an auxiliary function that is an upper bound function of the criterion, A convergence determination unit that repeats updating by the parameter updating unit until a convergence condition is satisfied.

本発明に係る信号解析方法は、時間周波数展開部が、各構成音が混合された混合信号の時系列データと、前記混合信号を分離した各構成音について構成音信号の時系列データとを入力として、前記混合信号、および各構成音の構成音信号の各々について、各時刻及び各周波数の信号の成分を表すスペクトログラムを出力し、パラメータ学習部が、前記時間周波数展開部により出力された、前記混合信号、および各構成音の構成音信号の各々についてのスペクトログラムに基づいて、各構成音の構成音信号の各々についての、基底スペクトル、および各基底及び各時刻における音量を表すアクティベーションパラメータを用いて、前記混合信号のスペクトログラムから抽出される、前記構成音の構成音信号の抽出スペクトログラムと、前記構成音の構成音信号のスペクトログラムとの誤差の大きさを表す規準を小さくするように、各構成音の基底スペクトルと、各構成音のアクティベーションパラメータとを推定する信号解析方法であって、前記パラメータ学習部が推定することでは、パラメータ更新部が、前記規準の上界関数である補助関数を小さくするように、各構成音の基底スペクトルと、各構成音のアクティベーションパラメータとを更新し、収束判定部が、予め定められた収束条件を満たすまで、前記パラメータ更新部による更新を繰り返させることを含む。
上記の構成音の構成音信号の抽出スペクトログラムは、Wienerフィルタにより、前記混合信号のスペクトログラムから抽出される。 In the signal analysis method according to the present invention, the time-frequency expansion unit inputs time-series data of a mixed signal obtained by mixing each component sound and time-series data of the component sound signal for each component sound obtained by separating the mixed signal. For each of the mixed signal and each component sound signal of each component sound, a spectrogram representing a signal component of each time and each frequency is output, and a parameter learning unit is output by the time frequency expansion unit, Based on the mixed signal and the spectrogram for each constituent sound signal of each constituent sound, the base spectrum for each constituent sound signal of each constituent sound and the activation parameter representing the volume at each base and each time are used. The extracted spectrogram of the component sound signal extracted from the spectrogram of the mixed signal, and the configuration A signal analysis method for estimating a base spectrum of each component sound and an activation parameter of each component sound so as to reduce a criterion representing an error magnitude with respect to a spectrogram of the component sound signal, wherein the parameter learning The parameter updating unit updates the base spectrum of each constituent sound and the activation parameter of each constituent sound so as to reduce the auxiliary function that is the upper bound function of the criterion, and determines the convergence Including repeating the updating by the parameter updating unit until a predetermined convergence condition is satisfied.
The extraction spectrogram of the constituent sound signal of the constituent sound is extracted from the spectrogram of the mixed signal by the Wiener filter.

また、本発明のプログラムは、コンピュータを、上記の信号解析装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said signal analysis apparatus.

以上説明したように、本発明の信号解析装置、方法、及びプログラムによれば、各構成音の構成音信号の各々についての、基底スペクトル、および各基底及び各時刻における音量を表すアクティベーションパラメータを用いて、前記混合信号のスペクトログラムから抽出される、前記構成音の構成音信号の抽出スペクトログラムと、前記構成音の構成音信号のスペクトログラムとの誤差の大きさを表す規準の上界関数である補助関数を小さくするように、各構成音の基底スペクトルと、各構成音のアクティベーションパラメータとを更新することを繰り返すことにより、収束性が保証されたアルゴリズムにより基底スペクトルを学習することができる。 As described above, according to the signal analysis apparatus, method, and program of the present invention, the activation spectrum representing the base spectrum and the sound volume at each base and each time for each of the constituent sound signals of each constituent sound is obtained. An auxiliary function that is a standard upper bound function representing the magnitude of an error between the extracted spectrogram of the constituent sound signal of the constituent sound and the spectrogram of the constituent sound signal of the constituent sound, which is extracted from the spectrogram of the mixed signal By repeatedly updating the base spectrum of each constituent sound and the activation parameter of each constituent sound so as to reduce the function, the base spectrum can be learned by an algorithm with guaranteed convergence.

本発明の実施の形態に係る信号解析装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the signal analyzer which concerns on embodiment of this invention. 本発明の実施の形態に係る信号解析装置における学習処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the learning process routine in the signal analyzer which concerns on embodiment of this invention. 実験結果を示す図である。It is a figure which shows an experimental result.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜既存手法＞
＜教師ありNMF による音源分離＞
L個の音源からなる混合信号のパワースペクトログラムを

とする。 <Existing method>
<Sound source separation by supervised NMF>
A power spectrogram of a mixed signal consisting of L sound sources

And

ただし、ωとtは周波数および時刻のインデックスである。教師ありNMFでは、事前学習した各音源の基底スペクトル

を用いて、観測スペクトログラム

を基底行列

とアクティベーション行列

の積で近似することで、Wiener フィルタにより各音源信号を混合信号から抽出するためのパワースペクトログラム推定値を得ることが目的である。 Where ω and t are frequency and time indexes. In supervised NMF, the base spectrum of each sound source trained in advance

Using the observation spectrogram

The basis matrix

And activation matrix

The objective is to obtain an estimated power spectrogram value for extracting each sound source signal from the mixed signal by the Wiener filter.

非特許文献１では

の事前学習において、音源l の学習サンプルのスペクトログラム

との誤差 In Non-Patent Document 1,

Spectrogram of the training sample of the sound source l

And error

を最適化規準として用いている。ただし、

は音源l の学習サンプルのスペクトログラム

と行列積

の誤差を測る関数である。テスト時においては、事前学習した基底行列

を固定し、 Is used as an optimization criterion. However,

Is the spectrogram of the training sample of sound source l

And matrix product

It is a function that measures the error of. At the time of testing, pre-trained basis matrix

Fixed,

を最小にするアクティベーション行列

を推定することで、

に含まれる各音源のパワースペクトログラムの成分

を推定することができる。誤差関数

としてI ダイバージェンスを用いる場合、(2)式は具体的に Activation matrix to minimize

By estimating

Components of the power spectrogram of each sound source included in

Can be estimated. Error function

When I divergence is used as

となる。ただし、[・]i.jは行列の{i,j}番目要素を表す。各音源のパワースペクトログラム

が求まれば、Wiener フィルタ It becomes. However, [·] ij represents the {i, j} -th element of the matrix. Power spectrogram of each sound source

If you want a Wiener filter

により、足して矛盾なく

になるよう保証された各音源信号のスペクトログラムを得ることができる。ただし、

と

は要素ごとの乗法と除法を表すものとする。しかし、上述のアプローチ（非特許文献１）では、基底の学習規準において(1)式が用いられていることから、(4)式による分離信号が最適となるような規準になっていなかった。 And without contradiction

A spectrogram of each sound source signal guaranteed to be obtained can be obtained. However,

When

Represents the multiplication and division for each element. However, in the above-mentioned approach (Non-patent Document 1), since the equation (1) is used in the basis learning criterion, the criterion for optimizing the separation signal according to the equation (4) was not established.

＜識別的NMF と乗法更新アルゴリズム＞
識別的NMF（非特許文献２）は、(1)式の代わりにWiener フィルタ出力と学習サンプルのスペクトログラムの誤差 <Distinguishing NMF and multiplicative update algorithm>
The discriminative NMF (Non-patent Document 2) is the error between the Wiener filter output and the spectrogram of the training sample instead of (1).

を規準として基底学習を行う教師ありNMF による音源分離の枠組である。ただし、有るがα_ｌ≧０はl 番目の分離信号の重要度を表すパラメータである。 It is a framework for sound source separation by supervised NMF that performs basic learning with reference to. However, although α _l ≧ 0 is a parameter representing the importance of the l-th separated signal.

以下では説明の簡略化のため、音声と雑音の二種類の音源(L = 2) からなる音源分離問題を考える。音声強調が目的の場合は音声信号の分離精度がより重要となるので、重要度αは、音声に対して1、雑音に対して0とする。従って、クリーン音声の学習サンプルのスペクトログラムを

、雑音の学習サンプルのスペクトログラムを

とし、その混合信号のスペクトログラムを

とすると、識別的NMF の基底学習問題は In the following, for simplification of explanation, consider a sound source separation problem consisting of two types of sound sources (L = 2), speech and noise. When speech enhancement is intended, the accuracy of speech signal separation is more important, so the importance α is 1 for speech and 0 for noise. Therefore, the spectrogram of the clean speech learning sample

Spectrogram of noise learning sample

And the spectrogram of the mixed signal

Then, the basic learning problem of discriminative NMF is

のような最適化問題として定式化される。ただし、基底行列

はK^s個の音声基底スペクトルとKⁿ個の雑音基底スペクトルで構成される。 It is formulated as an optimization problem such as However, the basis matrix

Consists of K ^s speech basis spectra and K ⁿ noise basis spectra.

Weninger らは上述の最適化問題に対し乗法更新法を用いた最適化アルゴリズムを提案している（非特許文献２）。Weninger らのアルゴリズムでは、まず通常のNMF（すなわち(2)式）でアクティベーション行列

を求め、

を固定した下で基底行列Wを Weninger et al. Proposed an optimization algorithm using a multiplicative update method for the above optimization problem (Non-patent Document 2). In the Weninger et al. Algorithm, the activation matrix is first calculated using normal NMF (ie, equation (2)).

Seeking

With the basis matrix W

により更新する方法がとられている。上述の更新式は

の

に関する偏微分の負の項と正の項の商と

の要素ごとの積で与えられるが、各更新により目的関数が減少することが保証されない。このため、これらの更新式による反復アルゴリズムの収束性は保証されない。 The method of updating is taken. The above update formula is

of

The quotient of the negative and positive terms of the partial derivative of

, But it is not guaranteed that the objective function decreases with each update. For this reason, the convergence of the iterative algorithm based on these update equations is not guaranteed.

＜提案手法＞
＜補助関数法による基底学習アルゴリズム＞
本発明の実施の形態は、補助関数法の原理に基づいて導かれる、(6)式の最適化問題の停留点への収束性が保証された最適化アルゴリズムである。 <Proposed method>
<Basic learning algorithm by auxiliary function method>
The embodiment of the present invention is an optimization algorithm that is derived based on the principle of the auxiliary function method and in which convergence of the optimization problem of Equation (6) to the stationary point is guaranteed.

＜補助関数法＞
F(θ)をθに関して最小化したい目的関数とすると、

を満たす関数

を補助関数、αを補助変数と呼ぶ。このような補助関数を設計できれば、

と

を交互に繰り返すことで、目的関数F(θ)の停留点を得ることができる。この最適化手法を補助関数法と呼ぶ。 <Auxiliary function method>
Assuming that F (θ) is an objective function to be minimized with respect to θ,

A function that satisfies

Is called an auxiliary function, and α is called an auxiliary variable. If you can design such an auxiliary function,

When

By alternately repeating the above, a stationary point of the objective function F (θ) can be obtained. This optimization method is called an auxiliary function method.

＜補助関数の設計＞
以下で、目的関数

の補助関数を設計する。まず、目的関数

の中の <Auxiliary function design>
The objective function

Design auxiliary functions for. First, the objective function

In

の補助関数を次の不等式を用いて設計する。 Is designed using the following inequality.

（補題1）
任意の

に対して、不等式

が成り立ち、

のとき等号成立する。 (Lemma 1)
any

Inequality

And

The equality holds when

（証明）
任意の

に対して、 (Proof)
any

Against

M_ω、tは非負値のため、補題1 より、 Since M _{ω and t} are non-negative values, from Lemma 1,

が成り立つ。ただし、＝^cはパラメータに依存する項のみに関する等号を表す。また、

とし、

とする。(12)式の等号は Holds. However, = ^c represents an equal sign for only the term depending on the parameter. Also,

age,

And The equal sign of (12) is

のとき成立する。次に、(12)式の各項の補助関数を設計する。

は正値であること、および負の対数関数は凸関数であることより、Jensen の不等式 This holds true. Next, the auxiliary function for each term in Eq. (12) is designed.

Is a positive value, and the negative logarithmic function is a convex function, Jensen's inequality

が成り立つ。ただし、

は

を満たす変数であり、(14)式の等号は Holds. However,

Is

And the equal sign in (14) is

のとき成立する。

は正値のため、(12)式の第二項の対数関数は凹関数である。凹関数は任意の点における接線により上から抑えることができるため、 This holds true.

Since is a positive value, the logarithmic function of the second term in Eq. (12) is a concave function. Since the concave function can be suppressed from above by the tangent at any point,

が成り立つ。ここで、

は正の変数であり、 Holds. here,

Is a positive variable,

のとき、(16)式の等号は成立する。続いて、

の補助関数を設計する。二次関数は凸関数なので、Jensenの不等式 In this case, the equal sign in equation (16) holds. continue,

Design auxiliary functions for. Since quadratic functions are convex, Jensen's inequality

が成り立つ。ただし、

は

を満たす正数であり、(18)式の等号は Holds. However,

Is

And the equal sign in (18) is

のときに成立する。最後に、

の補助関数を設計する。関数1/x²はx>0においては凸であるため、Jensenの不等式により It is established when Finally,

Design auxiliary functions for. The function 1 / x ² is convex for x> 0, so Jensen's inequality

が成り立つ。ただし、

は

を満たす変数である。(20)式の等号は Holds. However,

Is

It is a variable that satisfies The equal sign in equation (20) is

のとき成立する。 This holds true.

(12)式、(14)式、(16)式、(18)式と(20)式により、目的関数

の補助関数 (12), (14), (16), (18) and (20)

Auxiliary functions

を得ることができる。ここで、

は補助変数

の集合であり、dは定数項である。この補助関数を導いたことのポイントは、

と

に関する大域最適解は解析的に得ることができる点にある。 Can be obtained. here,

Is an auxiliary variable

D is a constant term. The point of deriving this auxiliary function is

When

The global optimal solution for can be obtained analytically.

＜パラメータの更新式＞
以上の補助関数を最小にする補助変数の条件は各不等式の等号成立条件に他ならないので、(13)式、(15)式、 (17)式、 (19)式、 (21)式で与えられる。また、補助関数を最小にする

は

と

すなわち、 <Parameter update formula>
The condition of the auxiliary variable that minimizes the above auxiliary function is none other than the condition for establishing the equality of each inequality, so in Equation (13), Equation (15), Equation (17), Equation (19), Equation (21) Given. Also minimize auxiliary functions

Is

When

That is,

のような四次方程式と三次方程式の正数解を解くことにより得られる。上記四次方程式の定数項と二次式の係数はいずれも負値であるため、必ず一つの正数解のみを持つことが示される。 It can be obtained by solving positive number solutions of quartic and cubic equations such as Since the constant term and the coefficient of the quadratic equation of the above quartic equation are both negative values, it is always shown that there is only one positive solution.

＜本発明の実施の形態に係る信号解析装置の構成＞
次に、本発明の実施の形態に係る信号解析装置の構成について説明する。図１に示すように、本発明の実施の形態に係る信号解析装置１００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この信号解析装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部９０と、を含んで構成されている。 <Configuration of Signal Analysis Device according to Embodiment of the Present Invention>
Next, the configuration of the signal analysis apparatus according to the embodiment of the present invention will be described. As shown in FIG. 1, a signal analyzing apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing a learning processing routine described later and various data. Can be configured. Functionally, the signal analyzing apparatus 100 includes an input unit 10, an arithmetic unit 20, and an output unit 90 as shown in FIG.

入力部１０は、各構成音が混合された混合信号の時系列データと、当該混合信号を分離した各構成音について音響信号の時系列データとを受け付ける。 The input unit 10 receives time-series data of a mixed signal in which each component sound is mixed and time-series data of an acoustic signal for each component sound from which the mixed signal is separated.

演算部２０は、時間周波数展開部２４と、パラメータ学習部３６と、を含んで構成されている。 The calculation unit 20 includes a time frequency expansion unit 24 and a parameter learning unit 36.

時間周波数展開部２４は、混合信号の時系列データに基づいて、各時刻における各周波数の信号の成分を表すパワースペクトログラム

を計算する。また、各構成音信号の時系列データに基づいて、各時刻における各周波数の信号の成分を表すパワースペクトログラム

を計算する。なお、本実施の形態においては、短時間フーリエ変換やウェーブレット変換などの時間周波数展開を行う。 The time frequency expansion unit 24 is a power spectrogram that represents a signal component of each frequency at each time based on the time-series data of the mixed signal.

Calculate Also, based on the time-series data of each component sound signal, a power spectrogram representing the signal components of each frequency at each time

Calculate In this embodiment, time frequency expansion such as short-time Fourier transform and wavelet transform is performed.

パラメータ学習部３６は、時間周波数展開部２４によって計算された、混合信号のパワースペクトログラム及び各構成音信号のパワースペクトログラムに基づいて、各構成音の構成音信号の各々についての、基底スペクトル、および各基底及び各時刻における音量を表すアクティベーションパラメータを用いて、混合信号のスペクトログラムから抽出される、構成音の構成音信号の抽出スペクトログラムと、構成音の構成音信号のスペクトログラムとの誤差の大きさを表す、上記（５）式の規準を小さくするように、各構成音の基底スペクトル

と、各構成音のアクティベーション

とを推定する。 Based on the power spectrogram of the mixed signal and the power spectrogram of each component sound signal calculated by the time-frequency expansion unit 24, the parameter learning unit 36, for each component sound signal of each component sound, Using the activation parameters representing the base and the volume at each time, the magnitude of the error between the extracted spectrogram of the constituent sound extracted from the spectrogram of the mixed signal and the spectrogram of the constituent sound of the constituent sound is determined. Representing the base spectrum of each component sound so as to reduce the criterion of the above formula (5)

And activation of each component sound

Is estimated.

具体的には、パラメータ学習部３６は、初期値設定部４０、補助変数更新部４２、パラメータ更新部４４、及び収束判定部４６を備えている。 Specifically, the parameter learning unit 36 includes an initial value setting unit 40, an auxiliary variable update unit 42, a parameter update unit 44, and a convergence determination unit 46.

初期値設定部４０は、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

とに初期値を設定する。例えば、ランダムに初期値を設定する。 The initial value setting unit 40 has a base spectrum of speech and noise.

And voice and noise activation

Set the initial value to. For example, an initial value is set at random.

補助変数更新部４２は、初期値である、又は前回更新した、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

とに基づいて、上記（１３）式、（１５）式、（１７）式、（１９）式、（２１）式に従って、各基底ｋ、各周波数ω、及び各時刻ｔに対するγ_k,ω,t、β_k,ω,t、θ_k,ω,t、各周波数ω及び各時刻ｔに対するλ_ω,t、η_ω,tを更新する。 Auxiliary variable updating unit 42 is an initial value or last updated base spectrum of speech and noise

And voice and noise activation

And γ _{k, ω,} for each base k, each frequency ω, and each time t according to the above formulas (13), (15), (17), (19), and (21) _. Update λ _{ω, t} and η _{ω, t} for _t , β _{k, ω, t} , θ _{k, ω, t} , each frequency ω, and each time t.

パラメータ更新部４４は、時間周波数展開部２４により出力された
混合信号のパワースペクトログラム

と、音声信号のパワースペクトログラム

と、補助変数更新部４２によって更新された各基底ｋ、各周波数ω、及び各時刻ｔに対するγ_k,ω,t、β_k,ω,t、θ_k,ω,t、各周波数ω及び各時刻ｔに対するλ_ω,t、η_ω,tと、初期値である、又は前回更新した、初期値である、又は前回更新した、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

と、に基づいて、上記（２３）式〜（２６）式に示す四次方程式と三次方程式を解くことにより、初期値である、又は前回更新した、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

とを推定する。 The parameter update unit 44 is a power spectrogram of the mixed signal output by the time frequency expansion unit 24.

And the power spectrogram of the audio signal

Γ _{k, ω, t} , β _{k, ω, t} , θ _{k, ω, t} , each frequency ω and each frequency k updated for each base k, each frequency ω, and each time t by the auxiliary variable updating unit 42. Λ _{ω, t} , η _{ω, t} with respect to time t _, and initial values, last updated, initial values, or last updated base spectrum of speech and noise

And voice and noise activation

Based on the above, by solving the quaternary equation and the cubic equation shown in the above equations (23) to (26), the base spectrum of speech and noise which is an initial value or updated last time

And voice and noise activation

Is estimated.

収束判定部４６は、収束条件を満たすか否かを判定し、収束条件を満たすまで、補助変数更新部４２における更新処理と、パラメータ更新部４４における更新処理とを繰り返させる。 The convergence determination unit 46 determines whether or not the convergence condition is satisfied, and repeats the update process in the auxiliary variable update unit 42 and the update process in the parameter update unit 44 until the convergence condition is satisfied.

収束条件としては、例えば、繰り返し回数が、上限回数に到達したことを用いることができる。あるいは、収束条件として、上記（６）式の規準の値と前回の規準の値との差分が、予め定められた閾値以下であることを用いることができる。 As the convergence condition, for example, the fact that the number of repetitions has reached the upper limit number can be used. Alternatively, as the convergence condition, it can be used that the difference between the value of the criterion of the above formula (6) and the value of the previous criterion is equal to or less than a predetermined threshold value.

出力部９０は、パラメータ学習部３６において最終的に取得した、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

を出力する。 The output unit 90 is a speech and noise base spectrum finally acquired by the parameter learning unit 36.

And voice and noise activation

Is output.

＜本発明の実施の形態に係る信号解析装置の作用＞
次に、本発明の実施の形態に係る信号解析装置１００の作用について説明する。まず、入力部１０において各構成音が混合された混合信号の時系列データと、当該混合信号を分離した各構成音について音響信号の時系列データとを受け付けると、信号解析装置１００は、図２に示す学習処理ルーチンを実行する。 <Operation of Signal Analysis Device According to Embodiment of the Present Invention>
Next, the operation of the signal analyzing apparatus 100 according to the embodiment of the present invention will be described. First, when receiving the time-series data of the mixed signal in which the constituent sounds are mixed in the input unit 10 and the time-series data of the acoustic signal for each of the constituent sounds from which the mixed signals are separated, the signal analysis apparatus 100 receives the signal in FIG. The learning process routine shown in FIG.

まず、ステップＳ１００では、入力部１０において混合信号の時系列データに基づいて、各時刻における各周波数の信号の成分を表すパワースペクトログラム

を計算する。また、各構成音信号の時系列データに基づいて、各時刻における各周波数の信号の成分を表すパワースペクトログラム

を計算する。 First, in step S100, a power spectrogram representing a signal component of each frequency at each time based on the time series data of the mixed signal in the input unit.

Calculate Also, based on the time-series data of each component sound signal, a power spectrogram representing the signal components of each frequency at each time

Calculate

次に、ステップＳ１０２では、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

とに初期値を設定する。 Next, in step S102, the base spectrum of speech and noise

And voice and noise activation

Set the initial value to.

ステップＳ１０４では、初期値である、又は前回更新した、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

とに基づいて、上記（１３）式、（１５）式、（１７）式、（１９）式、（２１）式に従って、各基底ｋ、各周波数ω、及び各時刻ｔに対するγ_k,ω,t、β_k,ω,t、θ_k,ω,t、各周波数ω及び各時刻ｔに対するλ_ω,t、η_ω,tを更新する。 In step S104, the base spectrum of speech and noise, which is an initial value or updated last time

And voice and noise activation

And γ _{k, ω,} for each base k, each frequency ω, and each time t according to the above formulas (13), (15), (17), (19), and (21) _. Update λ _{ω, t} and η _{ω, t} for _t , β _{k, ω, t} , θ _{k, ω, t} , each frequency ω, and each time t.

次に、ステップＳ１０６では、時間周波数展開部２４により出力された混合信号のパワースペクトログラム

と、音声信号のパワースペクトログラム

と、補助変数更新部４２によって更新された各基底ｋ、各周波数ω、及び各時刻ｔに対するγ_k,ω,t、β_k,ω,t、θ_k,ω,t、各周波数ω及び各時刻ｔに対するλ_ω,t、η_ω,tと、初期値である、又は前回更新した、初期値である、又は前回更新した、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

と、に基づいて、上記（２３）式〜（２６）式に示す四次方程式と三次方程式を解くことにより、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

とを推定する。 Next, in step S106, the power spectrogram of the mixed signal output by the time-frequency expansion unit 24.

And the power spectrogram of the audio signal

Γ _{k, ω, t} , β _{k, ω, t} , θ _{k, ω, t} , each frequency ω and each frequency k updated for each base k, each frequency ω, and each time t by the auxiliary variable updating unit 42. Λ _{ω, t} , η _{ω, t} with respect to time t _, and initial values, last updated, initial values, or last updated base spectrum of speech and noise

And voice and noise activation

Based on the above, by solving the quartic equation and the cubic equation shown in the above equations (23) to (26), a base spectrum of speech and noise is obtained.

And voice and noise activation

Is estimated.

次に、ステップＳ１０８では、収束条件を満たすか否かを判定する。収束条件を満たした場合には、ステップＳ１１０へ移行し、収束条件を満たしていない場合には、ステップＳ１０４へ移行し、ステップＳ１０４〜ステップＳ１０６の処理を繰り返す。 Next, in step S108, it is determined whether a convergence condition is satisfied. If the convergence condition is satisfied, the process proceeds to step S110. If the convergence condition is not satisfied, the process proceeds to step S104, and the processes in steps S104 to S106 are repeated.

ステップＳ１１０では、上記ステップＳ１０６で最終的に更新された、音声と雑音の基底スペクトル

と、音声と雑音のアクティベーション

を、出力部９０から出力して、学習処理ルーチンを終了する。 In step S110, the base spectrum of speech and noise finally updated in step S106 above.

And voice and noise activation

Is output from the output unit 90 and the learning process routine is terminated.

＜実験例＞
本実施の形態の手法による音声強調効果を検証するため、ATR 音声データベース503 文の音声データ（非特許文献３参照）とATR 環境音データベース（department noise、 subway station noise の2種類）を用いて評価実験を行った。比較対象は従来の教師ありNMF 法（SNMF）と識別的NMF の乗法更新式アルゴリズム（DNMF MU）とし、処理前と処理後の信号対歪み比(SDR) および信号対干渉比(SIR) （非特許文献４参照）の改善値を評価した。 <Experimental example>
In order to verify the speech enhancement effect according to the method of this embodiment, evaluation is performed using speech data of the ATR speech database 503 sentences (see Non-Patent Document 3) and ATR environmental sound database (two types of department noise and subway station noise). The experiment was conducted. Compared to the conventional supervised NMF method (SNMF) and discriminative NMF multiplicative update algorithm (DNMF MU), the signal-to-distortion ratio (SDR) and signal-to-interference ratio (SIR) before and after processing (non- The improvement value of Patent Document 4) was evaluated.

［非特許文献３］A. Kurematsu、 K. Takeda、 Y. Sagisaka、 S. Katagiri、 H. Kuwabara、 and K. Shikano、 "ATR Japanese speech database as a tool of speech recognition and synthesis、" Speech Communication、 vol. 9、 pp. 357-363、 1990.
［非特許文献４］ E. Vincent、 R. Gribonval、 and C. Fevotte、 "Performance measurement in blind audio source separa-tion."、 IEEE transactions on audio、 speech、 and language processing、 vol. 14、 no. 4、 pp. 1462-1469、 2016. [Non-Patent Document 3] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, "ATR Japanese speech database as a tool of speech recognition and synthesis," Speech Communication, vol 9, pp. 357-363, 1990.
[Non-Patent Document 4] E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separa-tion.", IEEE transactions on audio, speech, and language processing, vol. 14, no. 4 , Pp. 1462-1469, 2016.

テストデータはクリーン音声に各雑音を-6、-3、0、3dB の信号対雑音比(SNR) で重畳させて作成した。実験に用いた音響信号はサンプリング周波数16kHzのモノラル信号で、フレーム長32ms、フレームシフト16ms で短時間Fourier 変換を行い、観測スペクトログラム

を算出した。基底学習においては男性2 名と女性2 名の話者の計200 文の音声を用いて音声基底の学習を行った。基底数は音声と雑音両方40とした。ランダムに決めた初期値で反復アルゴリズムを5 回試行し、各試行における反復回数が0、10、25、50、100、200の時のSDR改善値の平均と分散をプロットしたものが図３である。図３の結果により、以下の実験では反復回数を25とした。テストデータセットは、ATR503 文データベースからランダムに選定した40 文の音声データに雑音を重畳させて作成した。以上の条件下で、提案法(DNMF AU)と従来法(SNMF、DNMF MU)を5回試行して得られたSDR およびSIR の改善値の平均を表1、2に示す。いずれの評価尺度においても全ての場合において提案手法の方が高い改善値を得られていることが確認できた。 The test data was created by superimposing each noise on clean speech with a signal-to-noise ratio (SNR) of -6, -3, 0, 3dB. The acoustic signal used in the experiment was a monaural signal with a sampling frequency of 16 kHz, and a Fourier transform was performed for a short time with a frame length of 32 ms and a frame shift of 16 ms.

Was calculated. In base learning, speech base learning was performed using a total of 200 sentences from two male and two female speakers. The basis number was 40 for both speech and noise. Figure 3 shows a plot of the mean and variance of SDR improvement values when the iteration algorithm is tried 5 times with randomly determined initial values, and the number of iterations in each trial is 0, 10, 25, 50, 100, 200. is there. According to the results of FIG. 3, the number of iterations was set to 25 in the following experiment. The test data set was created by superimposing noise on 40 sentences of voice data randomly selected from the ATR503 sentence database. Tables 1 and 2 show the average of the improved SDR and SIR values obtained by trying the proposed method (DNMF AU) and the conventional method (SNMF, DNMF MU) five times under the above conditions. In any evaluation scale, it was confirmed that the proposed method was able to obtain higher improvement values in all cases.

上記表1は、各手法を5 回試行して得られたSDR 改善量平均値[dB]を示している。上段はDepartment ノイズにおける音声強調結果であり、下段はSubway station ノイズにおける音声強調結果である。 Table 1 above shows the average SDR improvement [dB] obtained by trying each method five times. The top row is the speech enhancement result for Department noise, and the bottom row is the speech enhancement result for Subway station noise.

上記表2は、各手法を5 回試行して得られたSIR 改善量平均値[dB]を示している。上段はDepartment ノイズにおける音声強調結果であり、下段はSubway station ノイズにおける音声強調結果である。 Table 2 above shows the average SIR improvement [dB] obtained by trying each method five times. The top row is the speech enhancement result for Department noise, and the bottom row is the speech enhancement result for Subway station noise.

以上説明したように、本発明の実施の形態に係る信号解析装置によれば、各構成音の構成音信号の各々についての、基底スペクトル、およびアクティベーションパラメータを用いて、混合信号のスペクトログラムから抽出される、構成音の構成音信号の抽出スペクトログラムと、構成音の構成音信号のスペクトログラムとの誤差の大きさを表す規準の上界関数である補助関数を小さくするように、各構成音の基底スペクトルと、各構成音のアクティベーションパラメータとを更新することを繰り返すことにより、収束性が保証されたアルゴリズムにより基底スペクトルを学習することができる。
また、非負値行列因子分解を用いた教師あり音源分離手法において、分離信号の復元誤差を規準として、収束性が保証されたアルゴリズムにより基底スペクトルを学習することができる。 As described above, according to the signal analysis device according to the embodiment of the present invention, the base spectrum and the activation parameter for each component sound signal of each component sound are extracted from the spectrogram of the mixed signal. The base of each component sound is reduced so as to reduce the auxiliary function that is the upper bound function of the standard representing the magnitude of the error between the component spectrogram of the component sound and the spectrogram of the component sound signal of the component sound. By repeatedly updating the spectrum and the activation parameter of each component sound, the base spectrum can be learned by an algorithm with guaranteed convergence.
Further, in the supervised sound source separation method using non-negative matrix factorization, the base spectrum can be learned by an algorithm that guarantees convergence by using the reconstruction error of the separated signal as a criterion.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 For example, in the present specification, the program has been described as an embodiment in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do.

１０入力部
２０演算部
２４時間周波数展開部
３６パラメータ学習部
４０初期値設定部
４２補助変数更新部
４４パラメータ更新部
４６収束判定部
９０出力部
１００信号解析装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 24 Time frequency expansion part 36 Parameter learning part 40 Initial value setting part 42 Auxiliary variable update part 44 Parameter update part 46 Convergence determination part 90 Output part 100 Signal analysis apparatus

Claims

Using the time series data of the mixed signal in which each component sound is mixed and the time series data of the component sound signal for each component sound separated from the mixture signal as inputs, the mixed signal and the component sound signal of each component sound For each, a time-frequency expansion unit that outputs a spectrogram representing a signal component at each time and frequency,
Based on the mixed signal and the spectrogram for each component sound signal of each component sound output by the time-frequency expansion unit, the base spectrum for each component sound signal of each component sound, and each base and The magnitude of error between the extraction spectrogram of the component sound signal extracted from the spectrogram of the mixed sound and the spectrogram of the component sound signal of the component sound, which is extracted from the spectrogram of the mixed signal using an activation parameter representing the volume at each time A parameter learning unit that estimates a base spectrum of each component sound and an activation parameter of each component sound,
Including
The parameter learning unit
A parameter updating unit that updates the base spectrum of each component sound and the activation parameter of each component sound so as to reduce the auxiliary function that is the upper bound function of the criterion;
A convergence determination unit that repeats the update by the parameter update unit until a predetermined convergence condition is satisfied;
Including a signal analysis device.

The signal analysis apparatus according to claim 1, wherein an extraction spectrogram of the constituent sound signal of the constituent sound is extracted from the spectrogram of the mixed signal by a Wiener filter.

Each component sound is voice and noise,
The signal analysis apparatus according to claim 2, wherein the criterion is an I divergence criterion represented by the following expression.

Where W ^s represents a speech base spectrum, H ^s represents a speech activation parameter, W represents a base matrix composed of a speech base spectrum and a noise base spectrum, and H represents a speech activation spectrum. Represents an activation matrix consisting of an activation parameter and a noise activation parameter, S ^s represents a spectrogram of a speech component sound signal, M represents a spectrogram of a mixed signal, W _{ω, k} represents a frequency ω and a basis represents the power spectrum of k.

The time-frequency expansion unit receives as input the time-series data of the mixed signal in which each component sound is mixed and the time-series data of the component sound signal for each component sound separated from the mixed signal, and the mixed signal and each component For each of the sound signals constituting the sound, a spectrogram representing the signal components at each time and each frequency is output,
Based on the mixed signal and the spectrogram for each constituent sound signal of each constituent sound, which is output by the time-frequency expanding section, the parameter learning unit, the base spectrum for each constituent sound signal of each constituent sound And an extraction spectrogram of the component sound signal of the component sound, and a spectrogram of the component sound signal of the component sound, which are extracted from the spectrogram of the mixed signal using an activation parameter representing the volume at each base and each time, and A signal analysis method for estimating a base spectrum of each component sound and an activation parameter of each component sound so as to reduce a criterion representing the magnitude of the error of
By the parameter learning unit estimating,
The parameter update unit updates the base spectrum of each component sound and the activation parameter of each component sound so as to reduce the auxiliary function that is the upper bound function of the criterion,
A signal analysis method comprising: causing a convergence determination unit to repeat updating by the parameter updating unit until a predetermined convergence condition is satisfied.

The signal analysis method according to claim 4, wherein an extraction spectrogram of the constituent sound signal of the constituent sound is extracted from the spectrogram of the mixed signal by a Wiener filter.

Each component sound is voice and noise,
The signal analysis method according to claim 4, wherein the criterion is an I divergence criterion expressed by the following equation.

Where W ^s represents a speech base spectrum, H ^s represents a speech activation parameter, W represents a base matrix composed of a speech base spectrum and a noise base spectrum, and H represents a speech activation spectrum. Represents an activation matrix consisting of an activation parameter and a noise activation parameter, S ^s represents a spectrogram of a speech component sound signal, M represents a spectrogram of a mixed signal, W _{ω, k} represents a frequency ω and a basis represents the power spectrum of k.

The program for functioning a computer as each part of the signal analyzer of any one of Claims 1-3.