JP2019193073A

JP2019193073A - Sound source separation device, method thereof, and program

Info

Publication number: JP2019193073A
Application number: JP2018083097A
Authority: JP
Inventors: 弘章伊藤; Hiroaki Ito; 悠馬小泉; Yuma Koizumi; 登原田; Noboru Harada
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2019-10-31
Also published as: WO2019208137A1

Abstract

To provide a sound source separation device with higher separation accuracy than before.SOLUTION: The sound source separation device includes: a diffusive noise removal part configured to remove an estimated diffuse noise signal from observed signal and to obtain the removed signal; a filter design part configured to obtain a filter by combining the probability distribution of modeled removed signal and the probability distribution of modeled transfer function; and a sound source separation part configured to separate noise component estimate value including at least first acoustic signals and coherent noise acoustic signals from observed signal by the filter.SELECTED DRAWING: Figure 4

Description

本発明は、雑音環境下において、既知の音響信号をマイクロホンに与えた際(例えば、既知の音響信号を再生し、再生音をマイクロホンで収録した際)に得られた観測信号と、既知の音響信号から、観測信号に含まれる音声成分と雑音成分を分離する音源分離装置に関する。 The present invention provides an observation signal obtained when a known acoustic signal is applied to a microphone in a noisy environment (for example, when a known acoustic signal is reproduced and a reproduced sound is recorded by the microphone), and a known acoustic signal. The present invention relates to a sound source separation device that separates a speech component and a noise component contained in an observation signal from a signal.

マイクロホンの音声認識性能を評価する場合、マイクロホンで収録した観測信号からSN比を推定し、SN比推定値と音声認識率の比較を行うものがある。例えば、SN比推定値の異なる２つ以上の観測信号に対して１台の音声認識装置で音声認識を行うことで、各SN比推定値に対するその音声認識装置の音声認識率を比較することができる。 When evaluating the speech recognition performance of a microphone, there is one that estimates the SN ratio from the observation signal recorded by the microphone and compares the SN ratio estimated value with the speech recognition rate. For example, it is possible to compare the speech recognition rates of the speech recognition apparatus with respect to each SN ratio estimation value by performing speech recognition with one speech recognition apparatus for two or more observation signals having different SN ratio estimation values. it can.

このような手法を利用することで、観測信号が人間にとって認識してほしいと感じるか否か、という推定ができ（例えば、SN比が高い観測信号であれば聞き取りやすく認識して欲しいと感じると推定できる）、ユーザー体感値に近い認識性能の評価が可能となる。言い換えると、SN比が高い(音声成分に対して雑音成分が少なく聞き取りやすい)と音声認識の認識精度が高くなり、SN比が低い(音声成分に対して雑音成分が多く聞き取りづらい)と音声認識の認識精度が低くなるという点を考慮した認識性能の評価が可能となる。 By using such a method, it is possible to estimate whether or not the observation signal is desired to be recognized by humans (for example, if an observation signal with a high SN ratio is desired to be recognized easily) It is possible to estimate the recognition performance close to the user experience value. In other words, a high SN ratio (easy to hear with less noise component compared to the speech component) results in higher speech recognition recognition accuracy, and a lower SNR (more noise component than the speech component makes it difficult to hear) speech recognition. The recognition performance can be evaluated in consideration of the fact that the recognition accuracy of the image becomes low.

上記のような性能評価のためのデータは、一般的に予め図示しない音声信号データベースを用意し、図１のようにスピーカ７１から目的音s_tを再生し、スピーカ７２から干渉性雑音n_tを再生し、マイクロホン７３で収録した観測信号x_tを用い、SN比推定部７４でSN比を推定する。なお、観測信号x_tには拡散性雑音d_tも含まれる。tは時刻を示すインデックスである。 Data for performance evaluation as described above, generally prepared speech signal database, not previously shown to reproduce the target sound s _t from the speaker 71 as shown in FIG. 1, the interference noise n _t from the speaker 72 The SN ratio is estimated by the SN ratio estimator 74 using the observation signal x _t recorded by the microphone 73. Note that the observed signal x _t also include diffuse noise d _t. t is an index indicating time.

従来のSN比推定技術は、図２のように目的音(元の音響信号であり、源音源または源信号ともいう)s_tから得られる発話区間情報を参考に、発話区間(図２中のT_s0〜T_s1)から音声成分を、非発話区間(図２中のT_n0〜T_n1)から雑音成分を求める（非特許文献１参照）。 Conventional SN ratio estimation techniques, target sound as in FIG. 2 (a original sound signal, a source referred to as a source or source signal) to the speech period information obtained from s _t send speech segment (in Figure 2 The speech component is obtained from T _{s0 to} T _s1 ) and the noise component is obtained from the non-speech interval (T _{n0 to} T _{n1 in} FIG. 2) (see Non-Patent Document 1).

しかし、非定常な雑音が存在すると、SN比の推定値とユーザー体感値（実感値）との間に差が生じる。例えば、図２Ａを非定常な雑音が存在しない状態とし、図２Ｂでは非発話区間(図２中のT_n0〜T_n1)を含む区間に非定常な雑音が存在しSN比がユーザー体感値よりも低く推定され、図２Ｃでは発話区間(図２中のT_s0〜T_s1)を含む区間に非定常な雑音が存在しSN比がユーザー体感値よりも高く推定される。 However, if non-stationary noise is present, a difference occurs between the estimated value of the SN ratio and the user experience value (actual feeling value). For example, FIG. 2A is set to a state in which non-stationary noise does not exist, and in FIG. 2B, non-stationary noise exists in a section including a non-speech section (T _{n0 to} T _{n1 in} FIG. 2), and the SN ratio is based on the user experience value. In FIG. 2C, non-stationary noise exists in the section including the speech section (T _{s0 to} T _{s1 in} FIG. 2), and the SN ratio is estimated to be higher than the user experience value.

そこで、図３のように、音源分離部８４において観測信号x_tの音声成分と雑音成分とを分離し、SN比推定部８５において分離した信号からSN比を推定する手法を提案する。ここでは、目的音s_ω,τ∈C(Cは複素数の全体の集合)と干渉性雑音n_ω,τ∈Cや拡散性雑音d_ω,τ∈Cが以下のように重畳した観測信号x_ω,τ∈Cから、観測信号x_ω,τに含まれる目的音由来の成分(音声成分)a_ωs_ω,τと、雑音由来の成分(雑音成分)n_ω,τ+d_ω,τを推定する問題を扱う。
x_ω,τ=a_ωs_ω,τ+n_ω,τ+d_ω,τ (1)
ここで、x_ω,τ、s_ω,τ、n_ω,τ、d_ω,τはそれぞれ時間領域の信号x_t、s_t、n_t、d_tを周波数領域の信号に変換したものであり、ω∈{1,…,Ω}とτ∈{1,…,Τ}は周波数と(フレーム)時間のインデックス、a_ωは目的音位置(目的音の発生位置)から観測位置までの伝達特性(伝達関数ともいう)である。以降、表記の簡単のために、|x_ω,τ|=X_ω,τのように複素数の絶対値を各小文字に対応する大文字で表記する。特に断りのない限り、小文字の変数は複素数、大文字の変数は実数とする。拡散性雑音としては、空調の音などを含む背景雑音等の定常的な雑音が想定される。干渉性雑音としては、本来、収録対象としていない人の発話やTVの音声、突発的な物音等の非定常な雑音が想定される。 Therefore, as shown in FIG. 3, it separates the speech and noise components of the observed signal x _t in the sound source separation unit 84, to propose a method for estimating the SN ratio from the signal separated in SN ratio estimation unit 85. Here, the target signal s _{ω, τ} ∈ C (C is the entire set of complex numbers) and the coherent noise n _{ω, τ} ∈ C and the diffusive noise d _{ω, τ} ∈ C are superimposed as follows: _{From ω, τ} ∈ C, the target sound-derived component (speech component) a _ω s _{ω, τ} included in the observation signal x _ω, _τ and the noise-derived component (noise component) n _{ω, τ} + d _{ω, τ} Dealing with the problem of estimating.
x _{ω, τ} = a _ω s _{ω, τ} + n _{ω, τ} + d _{ω, τ} (1)
Here, x _{ω, τ} , s _{ω, τ} , n _{ω, τ} , d _{ω, τ} are converted from time domain signals x _t , _st , n _t , _dt to frequency domain signals, respectively. , Ω∈ {1,…, Ω} and τ∈ {1,…, Τ} are frequency and (frame) time indices, and a _ω is the transfer characteristic from the target sound position (target sound generation position) to the observation position. (Also called transfer function). Hereinafter, for the sake of simplicity, the absolute value of the complex number is expressed in uppercase letters corresponding to each lowercase letter such as | x _{ω, τ} | = X _{ω, τ} . Unless otherwise noted, lowercase variables are complex and uppercase variables are real. As the diffuse noise, stationary noise such as background noise including air-conditioning sound is assumed. As the coherent noise, non-stationary noise such as an utterance of a person who is not originally recorded, a TV voice, or a sudden sound is assumed.

観測信号x_ω,τから音声成分a_ωs_ω,τと雑音成分n_ω,τ+d_ω,τとを推定する代表的な手法に、非線形フィルタリングがある。この方法では、非線形フィルタを以下の式で設計し、 Observed signal x _omega, voice component from _{_τ} a _ω s _{_ω,} _τ and a noise component n _{_ω, τ} + d _ω, a typical method of estimating the _tau, there is a non-linear filtering. In this method, the nonlinear filter is designed with the following equation:

各信号（成分）を以下のように推定する。
^a_ω^s_ω,τ=G_ω,τx_ω,τ (3)
^n_ω,τ+^d_ω,τ=(1-G_ω,τ)x_ω,τ (4)
このように各信号（成分）を推定することで、例えば式(5)で定義されるような各時間フレームのSNRであるsSNR（segmental-SNR）を推定できる。 Each signal (component) is estimated as follows.
^ a _ω ^ s _{ω, τ} = G _{ω, τ} x _{ω, τ} (3)
^ n _{ω, τ} + ^ d _{ω, τ} = (1-G _{ω, τ} ) x _{ω, τ} (4)
By estimating each signal (component) in this way, it is possible to estimate sSNR (segmental-SNR), which is the SNR of each time frame as defined by Equation (5), for example.

式(2)において、非線形フィルタG_ω,τを推定するためには、伝達特性A_ω、目的音S_ω,τ、干渉性雑音N_ω,τ、拡散性雑音D_ω,τを推定する必要がある。本問題設定では、目的音S_ω,τは既知であると仮定しているため、観測信号X_ω,τから伝達特性A_ω、干渉性雑音N_ω,τ、拡散性雑音D_ω,τを推定することで、非線形フィルタG_ω,τ及びSNRの推定が可能である。 In Equation (2), in order to estimate the nonlinear filter G _{ω, τ} , it is necessary to estimate the transfer characteristic A _ω , target sound S _{ω, τ} , coherent noise N _{ω, τ} , and diffusive noise D _{ω, τ} There is. In this problem setting, since the target sound S _{ω, τ} is assumed to be known, transfer characteristics A _ω , coherent noise N _{ω, τ} , and diffusive noise D _{ω, τ} are obtained from the observed signal X _{ω, τ.} By estimating, it is possible to estimate the nonlinear filters _{Gω, τ} and SNR.

上記の音源分離問題における従来手法の多くでは、振幅領域での各音源の瞬時混合、および伝達特性の振幅領域での乗法性を仮定している。今、上記の仮定が成り立つとすると、観測信号X_ω,τは以下のように記述できる。
X_ω,τ=A_ωS_ω,τ+N_ω,τ+D_ω,τ (6)
このモデルの下で、各成分を推定する手法には様々なものがある。拡散性雑音D_ω,τを推定する手法で代表的なものは、拡散性雑音D_ω,τが定常雑音であると仮定し、観測信号X_ω,τの期待値とすることである。 Many of the conventional methods in the above sound source separation problem assume instantaneous mixing of sound sources in the amplitude region and multiplicative properties in the amplitude region of the transfer characteristics. Assuming that the above assumption holds, the observation signal X _{ω, τ} can be described as follows.
X _{ω, τ} = A _ω S _{ω, τ} + N _{ω, τ} + D _{ω, τ} (6)
There are various methods for estimating each component under this model. Dispersive noise D _omega, typical in a manner of estimating the _tau is to diffuse noise D _{omega, tau} is assumed to be stationary noise, the expected value of the observation signal X _{omega, tau.}

しかし、この方法だけでは、雑音成分のうち拡散性雑音D_ω,τしか推定できず、干渉性雑音N_ω,τを推定することができない。干渉性雑音N_ω,τを推定する方法として、半教師付非負値行列因子分解（NMF: non-negative matrix factorization）がある。半教師付NMFでは観測信号X_ω,τに関して以下のようなモデルを置く。 However, only this method can estimate only the diffusive noise _{Dω, τ} among the noise components, and cannot estimate the coherent noise _{Nω, τ} . There is a semi-supervised non-negative matrix factorization (NMF) as a method for estimating the coherent noise N _{ω, τ} . Semi-supervised NMF puts the following models for observed signals X _{ω, τ} .

ここでW^S _ω,rとW^N _ω,kはそれぞれ、目的音と干渉性雑音の振幅スペクトルの基底、H^S _r,τとH^N _k,τはそれぞれ、目的音と干渉性雑音の振幅スペクトルの各基底に対応する強度（アクティベーション）であり、RとKはそれぞれの基底数である。本問題設定では、目的音S_ω,τが既知であるため、基底W^S _ω,rと強度H^S _r,τを、目的音S_ω,τと Where W ^S _{ω, r} and W ^N _{ω, k} are the basis of the amplitude spectrum of the target sound and coherent noise, and H ^S _{r, τ} and H ^N _{k, τ} are the amplitudes of the target sound and coherent noise, respectively. It is the intensity (activation) corresponding to each base of the spectrum, and R and K are the respective base numbers. In this problem setting, since the target sound S _{ω, τ} is known, the base W ^S _{ω, r} and the intensity H ^S _{r, τ} are changed to the target sound S _{ω, τ} and

の間の一般化KL情報量などの目的関数を最小化するように学習し、次いで、観測信号X_ω,τと式(7)の間の一般化KL情報量などの目的関数を最小化するように基底W^N _ω,kと強度H^N _k,τを学習する（非特許文献２参照）。 Learn to minimize the objective function such as generalized KL information amount between, and then minimize the objective function such as generalized KL information amount between observation signal X _{ω, τ} and Equation (7) Thus, the base W ^N _{ω, k} and the intensity H ^N _{k, τ} are learned (see Non-Patent Document 2).

"G.160 : Revised Appendix II - Objective measures for the characterization of the basic functioning of noise reduction algorithms", International Telecommunication Union"G.160: Revised Appendix II-Objective measures for the characterization of the basic functioning of noise reduction algorithms", International Telecommunication Union D. Kitamura, N. Ono, H. Saruwatari, Y. Takahashi, and K. Kondo, "DISCRIMINATIVE AND RECONSTRUCTIVE BASIS TRAINING FOR AUDIO SOURCE SEPARATION WITH SEMI-SUPERVISED NONNEGATIVE MATRIX FACTORIZATION", in Proc., IWAENC 2016.D. Kitamura, N. Ono, H. Saruwatari, Y. Takahashi, and K. Kondo, "DISCRIMINATIVE AND RECONSTRUCTIVE BASIS TRAINING FOR AUDIO SOURCE SEPARATION WITH SEMI-SUPERVISED NONNEGATIVE MATRIX FACTORIZATION", in Proc., IWAENC 2016.

しかしながら、式(7)では伝達特性A_ωと拡散性雑音D_ω,τを考慮していないため、観測信号X_ω,τからの目的音由来の成分a_ωs_ω,τと雑音由来の成分n_ω,τ+d_ω,τの分離精度が低く、これを適用しただけではSNRを精緻に推定することは困難である。 However, since the formula (7) and the transfer characteristic A _omega diffuse noise D _omega, do not consider the _tau, observation signals X _omega, components derived from the target sound from _τ a _{_ω} s _{_ω,} components derived from _tau and noise The separation accuracy of n _{ω, τ} + d _{ω, τ} is low, and it is difficult to estimate the SNR precisely only by applying this.

本発明は、従来よりも分離精度の高い音源分離技術装置を提供することを目的とする。 An object of the present invention is to provide a sound source separation technology apparatus with higher separation accuracy than the conventional one.

上記の課題を解決するために、本発明の一態様によれば、音源分離装置は、スピーカから発せられた所定の音響信号をマイクロホンで収録した観測信号から所望の音響信号を取得する。観測信号は、所定の音響信号とスピーカとマイクロホンとの間の空間特性を表現した関数である伝達関数とに基づく第一音響信号と、干渉性雑音である干渉性雑音音響信号と、拡散性雑音である拡散性雑音音響信号と、を含んでおり、音源分離装置は、観測信号から拡散性雑音音響信号の推定値を除去し、除去済信号を求める拡散性雑音除去部と、除去済信号をモデル化した確率分布と、伝達関数をモデル化した確率分布と、を組み合わせることでフィルタを得るフィルタ設計部と、フィルタにより観測信号から、少なくとも第一音響信号と干渉性雑音音響信号を含む雑音成分の推定値とを分離する音源分離部と、を有する。 In order to solve the above problems, according to one aspect of the present invention, a sound source separation device acquires a desired acoustic signal from an observation signal obtained by recording a predetermined acoustic signal emitted from a speaker with a microphone. The observation signal includes a first acoustic signal based on a predetermined acoustic signal and a transfer function that represents a spatial characteristic between the speaker and the microphone, an interference noise acoustic signal that is coherent noise, and diffusive noise. The sound source separation device removes the estimated value of the diffusive noise acoustic signal from the observation signal and obtains the removed signal, and the removed noise signal. A filter design unit that obtains a filter by combining a probability distribution modeled with a probability distribution modeled with a transfer function, and a noise component including at least a first acoustic signal and a coherent noise acoustic signal from an observation signal by the filter A sound source separation unit that separates the estimated value of.

本発明によれば、従来よりも分離精度が高いという効果を奏する。さらに、分離した各成分を用いることで従来よりもSN比の推定精度が高いという効果を奏する。 According to the present invention, there is an effect that the separation accuracy is higher than the conventional one. Further, the use of each separated component has the effect that the SN ratio estimation accuracy is higher than in the prior art.

SN比を推定する従来技術を説明するための図。The figure for demonstrating the prior art which estimates SN ratio. 図２Ａは非定常な雑音が存在しない状態を示す図、図２Ｂは非発話区間を含む区間に非定常な雑音が存在する状態を示す図、図２Ｃは発話区間を含む区間に非定常な雑音が存在する状態を示す図。2A is a diagram showing a state where non-stationary noise does not exist, FIG. 2B is a diagram showing a state where non-stationary noise exists in a section including a non-speech section, and FIG. 2C is a non-stationary noise in a section including a utterance section. The figure which shows the state which exists. SN比を推定する従来技術を説明するための図。The figure for demonstrating the prior art which estimates SN ratio. 第一実施形態に係るSN比推定装置の機能ブロック図。The functional block diagram of the SN ratio estimation apparatus which concerns on 1st embodiment. 第一実施形態に係るSN比推定装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the SN ratio estimation apparatus which concerns on 1st embodiment.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。また、テキスト中で使用する記号「_」等は、本来直後の文字の真下に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following explanation, the symbol “^” etc. used in the text should be described immediately above the character immediately after it, but it is described immediately before the character due to restrictions on the text notation. In addition, the symbol “_” or the like used in the text should be described immediately below the character immediately after it, but it is described immediately before the character due to restrictions on the text notation. In the formula, these symbols are written in their original positions. Further, the processing performed for each element of the vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態のポイント＞
本実施形態では、半教師付NMFを拡張することで、観測信号X_ω,τから、伝達特性A_ω、干渉性雑音N_ω,τ、拡散性雑音D_ω,τを推定し、SN比を推定する手法を提供する。本実施形態のポイントは、
・半教師付NMFを、式(1)や式(6)のような実環境における観測モデルに適応するための拡散性雑音を事前に推定して観測信号から除去し、
・除去後の信号に基づき確率モデル化された半教師付NMF（非特許文献２参照）に伝達特性A_ωに関する項を組み込み、事後確率最大化（MAP: maximum-a-posteriori）推定に基づく最適化を行う
アルゴリズムを提供することである。このような構成により、実環境においても観測信号から目的音由来成分と雑音由来成分を高精度に分離し、SN比を推定することができる。 <Points of first embodiment>
In this embodiment, by extending the semi-supervised NMF, the transfer characteristics A _ω , coherent noise N _{ω, τ} , and diffusive noise D _{ω, τ} are estimated from the observed signals X _{ω, τ} , and the SN ratio is calculated. Provide an estimation method. The point of this embodiment is
・ Semi-supervised NMF is preliminarily estimated for diffusive noise to be applied to the observation model in the real environment such as Equation (1) and Equation (6), and is removed from the observation signal.
-Optimum based on posterior probability maximization (MAP: maximum-a-posteriori) estimation by incorporating a term related to the transfer characteristic A _ω into a semi-supervised NMF (see Non-Patent Document 2) that is probability-modeled based on the signal after removal It is to provide an algorithm that performs the conversion. With such a configuration, the target sound-derived component and the noise-derived component can be separated from the observation signal with high accuracy in the actual environment, and the SN ratio can be estimated.

まず、観測信号をどのようにモデル化にするかについて説明する。 First, how the observation signal is modeled will be described.

＜観測信号のモデル化＞
式(6)に即して観測信号をモデル化するために、まず、観測信号X_ω,τを以下のように近似する。 <Modeling of observation signal>
In order to model the observation signal in accordance with Equation (6), first, the observation signal _{Xω, τ} is approximated as follows.

ここで拡散性雑音D_ω,τの推定に関する既存技術を拡張し、拡散性雑音D_ω,τが、ある一定の時間フレームの間では定常雑音であると仮定する。また目的音S_ω,τと干渉性雑音N_ω,τが時間的に疎な信号であると仮定することで、拡散性雑音D_ω,τを以下のように推定する。
^D_ω,τ←Υ・min[X_{ω,τ-F_wd},X_{ω,τ-F_wd+1},…,X_{ω,τ+B_wd}] (8)
ここでF_wdとB_wdはD_ω,τが定常的である時間フレーム数を規定するパラメータであり、チューニングにより求めることができる。例えば、それぞれ20程度に設定すればよい。また、Υは所定の値である。すると、拡散性雑音D_ω,τを除去した観測信号（以下、「除去済信号」ともいう）Y_ω,τは以下のように記述することができる。 It is assumed that diffuse noise D _omega, extending the existing technology related to the estimation of _tau, diffuse noise D _{omega, tau} is a stationary noise in during a certain time frame. Further, assuming that the target sound S _{ω, τ} and the coherent noise N _{ω, τ} are sparse signals in time, the diffusive noise D _{ω, τ} is estimated as follows.
^ D _{ω, τ} ← Υ ・ min [X _{ω, τ-F_wd} , X _{ω, τ-F_wd + 1} ,…, X _{ω, τ + B_wd} ] (8)
Here, F_wd and B_wd are parameters that define the number of time frames in which D _{ω and τ} are stationary, and can be obtained by tuning. For example, each may be set to about 20. Moreover, Υ is a predetermined value. Then, the observation signal (hereinafter, also referred to as “removed signal”) Y _{ω, τ} from which the diffusive noise D _{ω, τ} has been removed can be described as follows.

ここで目的音の振幅スペクトルの基底W^S _ω,rと強度H^S _r,τは、従来の半教師付NMFの枠組み（非特許文献２参照）を利用することで推定できる。以降では、除去済信号Y_ω,τから、干渉性雑音の振幅スペクトルの基底W^N _ω,kと強度H^N _k,τおよび伝達特性A_ωを推定する手法を述べる。なお、W^S _ω,r、H^S _r,τ、W^N _ω,k、H^N _k,τ、A_ωの推定値をそれぞれ^W^S _ω,r、^H^S _r,τ、^W^N _ω,k、^H^N _k,τ、^A_ωと表記する。 Here, the base W ^S _{ω, r} and the intensity H ^S _{r, τ} of the amplitude spectrum of the target sound can be estimated by using a conventional semi-supervised NMF framework (see Non-Patent Document 2). Hereinafter, a method for estimating the base W ^N _{ω, k} , the intensity H ^N _{k, τ} and the transfer characteristic A _ω of the amplitude spectrum of the coherent noise from the removed signal Y _{ω, τ} will be described. The estimated values of W ^S _{ω, r} , H ^S _{r, τ} , W ^N _{ω, k} , H ^N _{k, τ} , A _ω are ^ W ^S _{ω, r} , ^ H ^S _{r, τ} , ^ W ^N _{It is written as ω, k} , ^ H ^N _{k, τ} , ^ A _ω .

伝達特性A_ωは元々物理的なパラメータであり、部屋の形状や、観測環境などの音響的な事前知識を組み込むことで、推定精度の向上が見込まれる。これを実現するために、本実施形態では、各パラメータをMAP推定で推定する。具体的には、除去済信号Y_ω,τに関する尤度関数p(_A,_N|_S,_Y)と、伝達特性A_ωに関する事前分布p(_A|_α)を設計し、以下の式(11)の同時確率Lを最大化するように各パラメータ_A、_N、_αを推定する。
L=p(_A,_N|_S,_Y)p(_A|_α) (11)
_A:=[^A_ω]∈R^Ω
_N:=[^N_ω,τ]∈R^Ω×Τ
_S:=[S_ω,τ]∈R^Ω×Τ
_Y:=[Y_ω,τ]∈R^Ω×Τ
_α:=[α_ω]∈R^Ω
_αは伝達特性^A_ωに関する事前分布をモデル化する際に用いられるパラメータの集合である。ここで尤度関数には、一般化KL情報量を確率的に解釈した確率分布である、ポアソン分布を適用する。また伝達特性A_ωに関しても、伝達特性A_ωは非負の変数であるため、ポアソン分布を適用する。すると各分布は以下のように記述できる。 The transfer characteristic A _ω is originally a physical parameter, and estimation accuracy can be improved by incorporating acoustic prior knowledge such as the shape of the room and the observation environment. In order to realize this, in this embodiment, each parameter is estimated by MAP estimation. Specifically, the likelihood function p (_A, _N | _S, _Y) for the removed signal Y _{ω, τ} and the prior distribution p (_A | _α) for the transfer characteristic A _ω are designed, and the following equation (11 ) Are estimated so as to maximize the joint probability L of).
L = p (_A, _N | _S, _Y) p (_A | _α) (11)
_A: = [^ A _ω ] ∈R ^Ω
_N: = [^ N _{ω, τ} ] ∈R ^{Ω × Τ}
_S: = [S _{ω, τ} ] ∈R ^{Ω × Τ}
_Y: = [Y _{ω, τ} ] ∈R ^{Ω × Τ}
_α: = [α _ω ] ∈R ^Ω
_α is a set of parameters used to model the prior distribution for the transfer characteristic ^ A _ω . Here, a Poisson distribution, which is a probability distribution obtained by probabilistic interpretation of the generalized KL information amount, is applied to the likelihood function. As for the transfer characteristic A _ω , since the transfer characteristic A _ω is a non-negative variable, the Poisson distribution is applied. Each distribution can then be described as follows:

ここで各分布は指数分布族であるため、同時確率Lの最大化は、両辺に対数をとった対数同時分布を最大化する方が、数値計算上効率的である。ここで各分布に対数をとると、以下のように記述できる。 Here, since each distribution is an exponential family, it is more efficient in numerical calculation to maximize the simultaneous probability L by maximizing the logarithmic simultaneous distribution with logarithms on both sides. Here, when logarithm is taken for each distribution, it can be described as follows.

ゆえに最大化すべき目的関数は Therefore, the objective function to be maximized is

となる。この目的関数J(Θ)を最大化することは、同時確率Lを最大化することを意味する。 It becomes. Maximizing this objective function J (Θ) means maximizing the joint probability L.

＜更新式の導出＞
式(18)を最大化するように基底の推定値^W^N _ω,k、強度の推定値^H^N _k,τおよび伝達特性の推定値^A_ωを推定するアルゴリズムを述べる。式(18)を直接最大化することは困難なため、本実施形態では補助関数法を利用した更新アルゴリズムを述べる。また、問題の簡単のために、R=Kとする。いま対数和の不等式より、λ_r,ω,τ≧0かつ <Derivation of update formula>
An algorithm for estimating the estimated value of the base ^ W ^N _{ω, k} , the estimated value of the intensity ^ H ^N _{k, τ} and the estimated value of the transfer characteristic ^ A _ω so as to maximize Equation (18) is described. Since it is difficult to directly maximize Equation (18), an update algorithm using the auxiliary function method will be described in this embodiment. For simplicity of the problem, R = K. From the log-sum inequality, λ _{r, ω, τ} ≧ 0 and

とすると、以下の不等式が成り立つ。 Then, the following inequality holds.

すると目的関数J(Θ)は、以下のJ'(Θ)で下から抑えることができる。 Then, the objective function J (Θ) can be suppressed from below by the following J ′ (Θ).

補助関数法によれば、まずJ'(Θ)をλ_r,ω,τに関して最大化し、そのλ_r,ω,τの下で各変数を最大化する処理を繰り返すことで、目的関数J(Θ)を単調増加するようにパラメータを推定できる。補助関数法に基づく更新アルゴリズムは以下のようになる。 According to the auxiliary function method, first, J ′ (Θ) is maximized with respect to λ _{r, ω, τ} , and by repeating the process of maximizing each variable under the λ _{r, ω, τ} , the objective function J ( The parameter can be estimated to monotonically increase Θ). The update algorithm based on the auxiliary function method is as follows.

なお、行列計算ライブラリを用いて計算する際は、上記アルゴリズムの近似として、式(22)(23)を以下のような更新則に変更してもよい。 Note that when calculating using the matrix calculation library, Equations (22) and (23) may be changed to the following update rule as an approximation of the above algorithm.

また、Tは転置、_EはΩ×Τで要素が全て1の行列であり、行列の除算は要素毎の除算を表す。また_Z=[_Z^(S),_W^(N)]、_H=[(_H^(S))^T,(_H^(N))^T]^T、_Z^(S):={^A_ω^W_ω,r ^S}∈R^Ω×R、_W^(N):={^W_ω,k ^N}∈R^Ω×K、_H^(S):={^H_r,τ ^S}∈R^R×Τ、_H^(N):={^H_k,τ ^N}∈R^K×Τである。 T is a transpose, _E is a matrix of Ω × Τ and all elements are 1, and division of the matrix represents division for each element. Also _Z = [_ Z ^(S) , _W ^(N) ], _H = [(_ H ^(S) ) ^T , (_ H ^(N) ) ^T ] ^T , _Z ^(S) : = {^ A _ω ^ W _{ω , r} ^S } ∈R ^{Ω × R} , _W ^(N) : = {^ W _{ω, k} ^N } ∈R ^{Ω × K} , _H ^(S) : = {^ H _{r, τ} ^S } ∈R ^{R × Τ} , _H ^(N) : = {^ H _{k, τ} ^N } ∈R ^{K × Τ} .

また_Z^(S)と_H^(S)を更新させないために、各更新毎に_Z^(S)と_H^(S)を事前学習した値へと置き換える。 Also in order not to update the _Z ^(S) and _H ^(S), replaced with a _H ^(S) and _Z ^(S) for each update to the pre-learned values.

＜第一実施形態に係るSN比推定装置＞
図４は第一実施形態に係るSN比推定装置の機能ブロック図を、図５はその処理フローの例を示す。 <SNR ratio estimation apparatus according to the first embodiment>
FIG. 4 is a functional block diagram of the SN ratio estimation apparatus according to the first embodiment, and FIG. 5 shows an example of the processing flow.

SN比推定装置１００は、初期化部１０２、拡散性雑音除去部１０３と、フィルタ設計部１０４と、音源分離部１０５と、信号対雑音比推定部１０６とを含む。 The SN ratio estimation apparatus 100 includes an initialization unit 102, a diffusive noise removal unit 103, a filter design unit 104, a sound source separation unit 105, and a signal-to-noise ratio estimation unit 106.

SN比推定装置１００は、スピーカ７１で再生する時間領域の目的音s_tを周波数領域の信号に変換した目的音s_ω,τ、マイクロホン７３で収録した時間領域の観測信号x_tを周波数領域の信号に変換した観測信号x_ω,τ、各種パラメータを入力とする。ここでいう各種パラメータとは、例えば、式(8)のΥ、基底R,K(例えば、R=K=10程度に設定できる)、伝達特性の推定値^Aの初期値(例えば、^A_ω=1)等である。なお、本実施形態では、周波数領域の目的音s_ω,τ、観測信号x_ω,τが入力されるものとして説明しているが、時間領域の目的音s_t、観測信号x_tが入力される構成としてもよい。ただし、tは時刻のインデックスである。この場合、SN比推定装置１００において、周波数領域の信号に変換する処理を行う。例えば、周波数変換には高速フーリエ変換などを利用すればよく、フーリエ変換長は256点、シフト点数は128点などにすればよい。 SN ratio estimation apparatus 100, the target sound s _omega obtained by converting the target sound s _t in the time domain to be reproduced by the loudspeaker 71 into a frequency domain _{signal, tau,} of the observed signal x _t the frequency domain of the time domain was recorded by the microphone 73 The observation signal x _{ω, τ} converted to a signal and various parameters are input. The various parameters here are, for example, Υ in formula (8), bases R and K (for example, R = K = can be set to about 10), initial value of transfer characteristic estimation value ^ A (for example, ^ A _ω = 1) etc. In this embodiment, the target sound s _{ω, τ} in the frequency domain and the observation signal x _{ω, τ} are described as input. However, the target sound s _{t in} the time domain and the observation signal x _t are input. It is good also as a structure to be. Where t is a time index. In this case, the signal-to-noise ratio estimation apparatus 100 performs processing for conversion to a frequency domain signal. For example, fast Fourier transform or the like may be used for frequency conversion, and the Fourier transform length may be 256 points and the number of shift points may be 128 points.

SN比推定装置１００は、目的音s_ω,τ、観測信号x_ω,τを利用して、観測信号x_ω,τに含まれる音声成分と雑音成分とを分離して信号対雑音比を求め、出力する。 The SN ratio estimation apparatus 100 uses the target sound s _{ω, τ} and the observation signal x _{ω, τ} to separate the speech component and the noise component contained in the observation signal x _{ω, τ} to obtain the signal-to-noise ratio. ,Output.

SN比推定装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。SN比推定装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。SN比推定装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。SN比推定装置の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。SN比推定装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。ただし、各記憶部は、必ずしもSN比推定装置がその内部に備える必要はなく、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置により構成し、SN比推定装置の外部に備える構成としてもよい。 The SN ratio estimation device is, for example, a special configuration configured by reading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM), and the like. Device. For example, the SN ratio estimation apparatus executes each process under the control of the central processing unit. Data input to the SN ratio estimation device and data obtained in each process are stored in, for example, a main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary. Used for other processing. At least a part of each processing unit of the SN ratio estimation apparatus may be configured by hardware such as an integrated circuit. Each storage unit included in the SN ratio estimation device can be configured by, for example, a main storage device such as a RAM (Random Access Memory), or middleware such as a relational database or a key-value store. However, each storage unit is not necessarily provided in the SN ratio estimation device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, and the SN ratio is determined. It is good also as a structure provided in the exterior of an estimation apparatus.

以下、各部について説明する。 Hereinafter, each part will be described.

＜初期化部１０２＞
初期化部１０２は、目的音s_ω,τと観測信号x_ω,τと各種パラメータとを入力とする。 <Initialization unit 102>
The initialization unit 102 receives the target sound _{sω, τ} , the observation signal _{xω, τ,} and various parameters.

初期化部１０２は、観測信号x_ω,τとΥとを用いて、式(8)により、拡散性雑音D_ω,τを推定し、推定値^D_ω,τを出力する。
^D_ω,τ←Υ・min[X_{ω,τ-F_wd},X_{ω,τ-F_wd+1},…,X_{ω,τ+B_wd}] (8)
初期化部１０２は、例えば、目的音s_ω,τと基底数Rを用いて、一般化KL情報量最小化などに基づく既存のNMFの枠組み(非特許文献２参照)で、基底の推定値^W^S _ω,rと強度の推定値^H^S _r,τとを求め、出力する。例えば、目的音S_ω,τが既知であるため、基底の推定値^W^S _ω,rと強度の推定値^H^S _r,τを、目的音S_ω,τと The initialization unit 102 estimates the diffusive noise D _{ω, τ according} to the equation (8) using the observation signal x _{ω, τ} and Υ, and outputs the estimated value ^ D _{ω, τ} .
^ D _{ω, τ} ← Υ ・ min [X _{ω, τ-F_wd} , X _{ω, τ-F_wd + 1} ,…, X _{ω, τ + B_wd} ] (8)
The initialization unit 102 uses, for example, the target sound s _{ω, τ} and the basis number R to estimate the basis value in an existing NMF framework (see Non-Patent Document 2) based on generalized KL information minimization. ^ W ^S _{ω, r} and intensity estimate ^ H ^S _{r, τ} are obtained and output. For example, since the target sound S _{ω, τ} is known, the base estimate ^ W ^S _{ω, r} and the intensity estimate ^ H ^S _{r, τ} are changed to the target sound S _{ω, τ}

の間の一般化KL情報量などの目的関数を最小化するように学習する（非特許文献２参照）。また、基底の推定値^W^N _ω,kと強度の推定値^H^N _k,τは非負の乱数などで初期化する。 Learning is performed so as to minimize an objective function such as a generalized KL information amount between (see Non-Patent Document 2). The base estimation value ^ W ^N _{ω, k} and the intensity estimation value ^ H ^N _{k, τ} are initialized with a non-negative random number or the like.

初期化部１０２は、例えば、上述の方法により、推定値^D_ω,τ、基底の推定値^W^S _ω,r、強度の推定値^H^S _r,τ、基底の推定値^W^N _ω,k、強度の推定値^H^N _k,τの初期値を求め（Ｓ１０２）、出力する。なお、伝達特性の推定値^A_ω,k、基底の推定値^W^N _ω,k、強度の推定値^H^N _k,τは本実施形態において繰り返し更新される値だが、推定値^D_ω,τ、基底の推定値^W^S _ω,r、強度の推定値^H^S _r,τは１つの利用環境に対して一度設定すれば初期値をそのまま利用してもよい。 The initialization unit 102 performs, for example, the estimation value ^ D _{ω, τ} , the basis estimation value ^ W ^S _{ω, r} , the intensity estimation value ^ H ^S _{r, τ} , and the basis estimation value ^ W ^N by the above-described method. Initial values of _{ω, k} and estimated intensity ^ H ^N _{k, τ} are obtained (S102) and output. Note that the estimated transfer value ^ A _{ω, k} , the estimated base value ^ W ^N _{ω, k} , and the estimated strength value ^ H ^N _{k, τ} are values that are repeatedly updated in this embodiment, but the estimated value ^ D _The initial values of _{ω, τ} , the estimated base value ^ W ^S _{ω, r} and the estimated intensity value ^ H ^S _{r, τ} may be used as they are once set for one usage environment.

＜拡散性雑音除去部１０３＞
拡散性雑音除去部１０３は、観測信号x_ω,τと拡散性雑音D_ω,τの推定値^D_ω,τとを入力とし、式(9)により観測信号x_ω,τから拡散性雑音D_ω,τの推定値を除去し、除去済信号Y_ω,τを求め（Ｓ１０３）、出力する。 <Diffusion noise removing unit 103>
The diffusive noise removing unit 103 receives the observation signal x _{ω, τ} and the estimated value ^ D _{ω, τ of the} diffusive noise D _{ω, τ} as inputs, and from the observation signal x _{ω, τ} by the expression (9) The estimated value of D _{ω, τ} is removed, and the removed signal Y _{ω, τ} is obtained (S103) and output.

＜フィルタ設計部１０４＞
フィルタ設計部１０４は、基底の推定値^W^S _ω,r、強度の推定値^H^S _r,τ、基底の推定値^W^N _ω,k及び強度の推定値^H^N _k,τの初期値、並びに、除去済信号Y_ω,τ、拡散性雑音D_ω,τの推定値^D_ω,τ、観測信号x_ω,τ、基底数K,Rを含む各種パラメータを入力とする。フィルタ設計部１０４は、除去済信号Y_ω,τをモデル化した確率分布と、伝達特性A_ωをモデル化した確率分布と、を組み合わせることで非線形フィルタG_ω,τを得（Ｓ１０４）、出力する。例えば、除去済信号Y_ω,τに関する尤度関数p(_A,_N|_S,_Y)と、伝達特性A_ωに関する事前分布p(_A|_α)とを組み合わせた式(11)の同時確率Lを最大化するように各パラメータ_A、_N、_αを推定する。
L=p(_A,_N|_S,_Y)p(_A|_α) (11)
この処理は、次の目的関数J(Θ)を最大化するように各パラメータ（基底の推定値^W^N _ω,k、強度の推定値^H^N _k,τ、伝達特性の推定値^A_ω）を推定する処理に相当する。 <Filter design unit 104>
The filter design unit 104 calculates a basis estimate ^ W ^S _{ω, r} , an intensity estimate ^ H ^S _{r, τ} , a basis estimate ^ W ^N _{ω, k} and an intensity estimate ^ H ^N _{k, τ} initial value, and removing spent signal Y _{omega, tau,} diffuse noise D _omega, estimate of _tau ^ D _{omega, tau,} observed signal x _{omega, tau,} base number K, and input various parameters including R. Filter design unit 104-removed signal Y _omega, give a probability distribution that models the _tau, and the probability distribution that models the transfer characteristics A _omega, the nonlinear filter G _omega by combining _{the tau} (S104), the output To do. For example, the joint probability L of Equation (11) that combines the likelihood function p (_A, _N | _S, _Y) for the removed signal Y _{ω, τ} and the prior distribution p (_A | _α) for the transfer characteristic A _ω Each parameter _A, _N, and _α is estimated so as to maximize.
L = p (_A, _N | _S, _Y) p (_A | _α) (11)
This process maximizes the following objective function J (Θ) with each parameter (base estimate ^ W ^N _{ω, k} , strength estimate ^ H ^N _{k, τ} , transfer characteristic estimate ^ A _This corresponds to the process of estimating _ω ).

例えば、式(21)〜(24)または式(21),(25),(26),(24)により、基底の推定値^W^N _ω,k、強度の推定値^H^N _k,τ、伝達特性の推定値^A_ωを更新する（Ｓ１０４−１）ことが同時確率Lを最大化し、各パラメータ_A、_N、_αを推定することを意味する。 For example, using Equations (21) to (24) or Equations (21), (25), (26), and (24), the estimated base value ^ W ^N _{ω, k} and the estimated intensity value ^ H ^N _{k, τ} , Updating the estimated value of transfer characteristic ^ A _ω (S104-1) means maximizing the joint probability L and estimating each parameter _A, _N, _α.

ただし、_Z=[_Z^(S),_W^(N)]、_H=[(_H^(S))^T,(_H^(N))^T]^T、_Z^(S):={^A_ω^W_ω,r ^S}∈R^Ω×R、_W^(N):={^W_ω,k ^N}∈R^Ω×K、_H^(S):={^H_r,τ ^S}∈R^R×Τ、_H^(N):={^H_k,τ ^N}∈R^K×Τであり、式(21),(25),(26),(24)により更新する場合には、_Z^(S)と_H^(S)を更新させないために、各更新毎に_Z^(S)と_H^(S)を事前学習した値へと置き換える。 However, _Z = [_ Z ^(S) , _W ^(N) ], _H = [(_ H ^(S) ) ^T , (_ H ^(N) ) ^T ] ^T , _Z ^(S) : = {^ A _ω ^ W _{ω , r} ^S } ∈R ^{Ω × R} , _W ^(N) : = {^ W _{ω, k} ^N } ∈R ^{Ω × K} , _H ^(S) : = {^ H _{r, τ} ^S } ∈R ^{R × Τ} , _H ^(N) : = {^ H _{k, τ} ^N } ∈R ^{K × Τ} , and when updating by equations (21), (25), (26), (24), _Z ^(S) and In order not to update _H ^(S) , _Z ^(S) and _H ^(S) are replaced with pre-learned values for each update.

フィルタ設計部１０４は、所定の条件を満たす場合に（Ｓ１０４−２）、更新を終了し、終了時の基底の推定値^W^N _ω,k、強度の推定値^H^N _k,τ、伝達特性の推定値^A_ωを用いて、次式で表される非線形フィルタG_ω,τを求め（Ｓ１０４−３）、出力する。 When the predetermined condition is satisfied (S104-2), the filter design unit 104 ends the update, and the estimated value of the base ^ W ^N _{ω, k} and the estimated value of the intensity ^ H ^N _{k, τ} Using the estimated value ^ A _ω of the characteristic, a nonlinear filter G _{ω, τ} represented by the following equation is obtained (S104-3) and output.

フィルタ設計部１０４は、所定の条件を満たすまで更新処理Ｓ１０４−１を繰り返す。所定の条件としては、(i)Ｓ１０４−１を所定回数（例えば100回）繰り返すこと、(ii)更新量が所定の値よりも小さくなること等が考えられる。要は、基底の推定値^W^N _ω,k、強度の推定値^H^N _k,τ、伝達特性の推定値^A_ωの更新量が所望のレベルまで収束すればよい。 The filter design unit 104 repeats the update process S104-1 until a predetermined condition is satisfied. As the predetermined condition, (i) S104-1 is repeated a predetermined number of times (for example, 100 times), and (ii) the update amount is smaller than a predetermined value. In short, it is only necessary that the update amount of the base estimated value ^ W ^N _{ω, k} , the intensity estimated value ^ H ^N _{k, τ} , and the transfer characteristic estimated value ^ A _ω converge to a desired level.

＜音源分離部１０５＞
音源分離部１０５は、観測信号x_ω,τとフィルタG_ω,τとを入力とし、フィルタG_ω,τにより観測信号x_ω,τから、少なくとも音声成分の推定値^a_ω^s_ω,τと干渉性雑音n_ω,τを含む雑音成分の推定値とを分離する。例えば、次式により音声成分の推定値^a_ω^s_ω,τと雑音成分の推定値^n_ω,τ+^d_ω,τとを分離し（Ｓ１０５）、出力する。
^a_ω^s_ω,τ=G_ω,τx_ω,τ (3)
^n_ω,τ+^d_ω,τ=(1-G_ω,τ)x_ω,τ (4) <Sound source separation unit 105>
The sound source separation unit 105 receives the observation signal x _{ω, τ} and the filter G _{ω, τ} as input, and uses the filter G _{ω, τ} to at least estimate the speech component ^ a _ω ^ s _ω, _τ from the observation signal x _{ω, τ} _{. τ} and the estimated noise component including coherent noise n _{ω, τ} are separated. For example, the estimated value ^ a _ω ^ s _{ω, τ} of the speech component and the estimated value ^ n _{ω, τ} + ^ d _{ω, τ of the} noise component are separated by the following equation (S105) and output.
^ a _ω ^ s _{ω, τ} = G _{ω, τ} x _{ω, τ} (3)
^ n _{ω, τ} + ^ d _{ω, τ} = (1-G _{ω, τ} ) x _{ω, τ} (4)

＜信号対雑音比推定部１０６＞
信号対雑音比推定部１０６は、音声成分の推定値^a_ω^s_ω,τと雑音成分の推定値^n_ω,τ+^d_ω,τを入力とし、信号対雑音比を求め（Ｓ１０６）、出力する。例えば、次式によりsSNRを求める。 <Signal to Noise Ratio Estimator 106>
The signal-to-noise ratio estimation unit 106 receives a speech component estimate value ^ a _ω ^ s _{ω, τ} and a noise component estimate value ^ n _{ω, τ} + ^ d _{ω, τ} as inputs, and obtains a signal-to-noise ratio ( S106) and output. For example, sSNR is obtained by the following equation.

＜効果＞
このような構成により、雑音環境下の発話をマイクロホンで収録した観測信号から、音声成分と雑音成分を分離できるため、非定常な雑音が存在する環境でも、発話区間内のSN比を高精度に推定することができる。得られたSN比推定値を用いることで、以下のようなアプリケーションへの応用が可能となる。
・マイクロホン間の雑音抑圧性能の比較：例えば、雑音環境下の発話を2台以上のノイズキャンセル機能付きのマイクロホンで収録した観測信号からSN比推定値を求めることで、マイクロホンの雑音抑圧性能を比較できる。
・マイクロホンが接続する音声認識システム間の音声認識性能の比較：例えば、雑音環境下の発話をマイクロホンで収録した観測信号からSN比推定値を求めるとともに、2台以上の音声認識システムで音声認識処理を行い、SN比推定値と音声認識結果から、音声認識システム毎のSN比推定値に対する音声認識性能を比較できる。
・マイクロホンの観測信号とユーザー体感認識率との比較:例えば、雑音環境下の発話をマイクロホンで収録した観測信号からSN比推定値を求めるとともに、その観測信号に対するユーザの体感認識率を求め、SN比推定値とユーザの体感認識率とを比較できる。
・マイクロホンの観測信号と音声認識エンジンの認識性能との比較：例えば、SN比推定値の異なる２つ以上の観測信号に対して１つの音声認識エンジンで音声認識を行うことで、各SN比推定値に対するその音声認識エンジンの音声認識性能を比較できる。 <Effect>
With such a configuration, the speech component and noise component can be separated from the observation signal recorded by the microphone in the noisy environment, so the SN ratio in the utterance interval can be highly accurate even in the presence of non-stationary noise. Can be estimated. By using the obtained SN ratio estimated value, application to the following applications becomes possible.
・ Comparison of noise suppression performance between microphones: For example, by comparing the noise suppression performance of microphones by obtaining SNR estimates from observation signals recorded with two or more microphones with a noise cancellation function. it can.
・ Comparison of speech recognition performance between speech recognition systems connected to microphones: For example, an SN ratio estimate is obtained from observation signals recorded with microphones in a noisy environment, and speech recognition processing is performed with two or more speech recognition systems. And the speech recognition performance with respect to the SN ratio estimated value for each speech recognition system can be compared from the SN ratio estimated value and the speech recognition result.
・ Comparison of microphone observation signal and user sensation recognition rate: For example, the SN ratio estimated value is obtained from the observation signal recorded with the microphone in the noisy environment, and the user's sensation recognition rate for the observation signal is obtained. The ratio estimated value and the user's bodily sensation recognition rate can be compared.
Comparison of microphone observation signal and speech recognition engine recognition performance: For example, each speech signal recognition engine performs speech recognition on two or more observation signals having different SN ratio estimates, thereby estimating each signal-to-noise ratio. The speech recognition performance of the speech recognition engine against the value can be compared.

＜変形例＞
本実施形態では、信号対雑音比を装置の出力としているが、音源分離部１０５の出力値である音声成分の推定値^a_ω^s_ω,τと雑音成分推定値^n_ω,τ+^d_ω,τとを装置の出力とし、信号対雑音比推定部１０６を設けない構成としてもよい。この場合、音源分離装置という。なお、SN比推定装置は、音源分離装置を含んでいるとも言える。 <Modification>
In this embodiment, the signal-to-noise ratio is used as the output of the apparatus, but the estimated value ^ a _ω ^ s _{ω, τ} of the speech component that is the output value of the sound source separation unit 105 and the estimated noise component ^ n _{ω, τ} + ^ d _{ω, τ} may be output from the apparatus, and the signal-to-noise ratio estimation unit 106 may not be provided. In this case, it is called a sound source separation device. It can be said that the SN ratio estimation apparatus includes a sound source separation apparatus.

本実施形態では、音源分離部１０５において、フィルタG_ω,τにより観測信号x_ω,τから、少なくとも音声成分の推定値^a_ω^s_ω,τと雑音成分の推定値^n_ω,τ+^d_ω,τとを分離しているが、SN比を推定する際に必ずしも観測信号から拡散性雑音d_ω,τを分離する必要はないため、雑音成分の推定値として^n_ω,τのみを分離してもよい。なお、この場合、拡散性雑音を考慮せずにフィルタを設計すればよい。 In the present embodiment, the sound source separation unit 105, the filter G _omega, the observed signal x _{omega, tau} by _tau, an estimate of at least speech component ^ a _ω ^ s _{_ω,} the estimate of _tau and a noise component ^ n _{omega, tau} + ^ d _{ω, τ} is separated, but it is not always necessary to separate the diffusive noise d _{ω, τ} from the observed signal when estimating the signal-to-noise ratio _. Only _τ may be separated. In this case, the filter may be designed without taking diffuse noise into consideration.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

A sound source separation device for obtaining a desired acoustic signal from an observation signal obtained by recording a predetermined acoustic signal emitted from a speaker with a microphone,
The observation signal includes a first acoustic signal based on the predetermined acoustic signal and a transfer function that is a function expressing a spatial characteristic between the speaker and the microphone, and a coherent noise acoustic signal that is coherent noise. A diffusive noise acoustic signal that is diffusive noise, and
Removing an estimate of the diffusive noise acoustic signal from the observed signal and obtaining a removed signal;
A filter design unit that obtains a filter by combining a probability distribution modeling the removed signal and a probability distribution modeling the transfer function;
A sound source separation unit that separates at least the first acoustic signal and an estimated value of a noise component including the coherent noise acoustic signal from the observation signal by the filter;
Sound source separation device.

The sound source separation device according to claim 1,
ω = {1,2, ..., Ω} and τ = {1,2, ..., Τ} are frequency and time indices, respectively, and the estimated transfer function is ^ A _ω. Is assumed to be ^ N _{ω, τ} , the predetermined acoustic signal is S _{ω, τ} , the removed signal is Y _{ω, τ} , _A: = [^ A _ω ] ∈R ^Ω , _N: = [ ^ N _{ω, τ} ] ∈R ^{Ω × Τ} , _S: = [S _{ω, τ} ] ∈R ^{Ω × Τ} , _Y: = [Y _{ω, τ} ] ∈R ^{Ω × Τ} , _α: = [α _ω ] ∈ R ^Ω , the probability distribution modeled on the removed signal is a likelihood function p (_A_N | _S, _Y) related to the removed signal, and the probability distribution modeled on the transfer function is a prior distribution related to the transfer function p (_A | _α), and the filter design unit has the joint probability
L = p (_A, _N | _S, _Y) p (_A | _α)
Parameter is estimated to maximize the filter, and the filter is obtained from the estimated parameter.
Sound source separation device.

The sound source separation device according to claim 2,
The observed signal is X _{ω, τ} , the estimated values of the amplitude spectra of the predetermined acoustic signal and the coherent noise acoustic signal are ^ W _{ω, r} ^S and ^ W _{ω, k} ^N _, respectively. Assume that the estimated values of the intensity corresponding to the basis of the amplitude spectrum of the acoustic signal and the coherent noise acoustic signal are ^ H _{r, τ} ^S and ^ H _{k, τ} ^N , respectively, and the predetermined acoustic signal and the coherent noise acoustic signal Let R and K be the basis numbers of the amplitude spectrum of
The filter design unit includes:
By
Or
T is transpose, _E is a matrix with Ω x Τ and all elements are 1. Matrix division is element-by-element division, _Z = [_ Z ^(S) , _W ^(N) ], _H = [(_ H ^{( S)} ) ^T , (_ H ^(N) ) ^T ] ^T , _Z ^(S) : = {^ A _ω ^ W _{ω, r} ^S } ∈R ^{Ω × R} , _W ^(N) : = {^ W _{ω, k} ^N } ∈R ^{Ω × K} , _H ^(S) : = {^ H _{r, τ} ^S } ∈R ^{R × Τ} , _H ^(N) : = {^ H _{k, τ} ^N } ∈R ^{K × Τ}
By updating λ _{r, ω, τ} , ^ W _{ω, τ} ^N , ^ H _{ω, τ} ^N , ^ A _ω , parameters are estimated so as to maximize the joint probability.
Sound source separation device.

The sound source separation device according to claim 3,
The estimated value of the diffusive noise acoustic signal is set to ^ D _{ω, τ} , and the filter design unit repeats the update process until a predetermined condition is satisfied, and uses the parameters at the end of the update to filter the filter.
Get as,
Sound source separation device.

A sound source separation method for obtaining a desired acoustic signal from an observation signal obtained by recording a predetermined acoustic signal emitted from a speaker with a microphone,
The observation signal includes a first acoustic signal based on the predetermined acoustic signal and a transfer function that is a function expressing a spatial characteristic between the speaker and the microphone, and a coherent noise acoustic signal that is coherent noise. A diffusive noise acoustic signal that is diffusive noise, and
Removing the estimated value of the diffusive noise acoustic signal from the observed signal and obtaining a removed signal;
A filter design step of obtaining a filter by combining a probability distribution modeling the removed signal and a probability distribution modeling the transfer function;
A sound source separation step of separating at least the first acoustic signal and an estimated value of a noise component including the coherent noise acoustic signal from the observation signal by the filter,
Sound source separation method.

The sound source separation method according to claim 5,
ω = {1,2, ..., Ω} and τ = {1,2, ..., Τ} are frequency and time indices, respectively, and the estimated transfer function is ^ A _ω. Is assumed to be ^ N _{ω, τ} , the predetermined acoustic signal is S _{ω, τ} , the removed signal is Y _{ω, τ} , _A: = [^ A _ω ] ∈R ^Ω , _N: = [ ^ N _{ω, τ} ] ∈R ^{Ω × Τ} , _S: = [S _{ω, τ} ] ∈R ^{Ω × Τ} , _Y: = [Y _{ω, τ} ] ∈R ^{Ω × Τ} , _α: = [α _ω ] ∈ R ^Ω , the probability distribution modeled on the removed signal is a likelihood function p (_A_N | _S, _Y) related to the removed signal, and the probability distribution modeled on the transfer function is a prior distribution related to the transfer function p (_A | _α), and the filter design step has the joint probability
L = p (_A, _N | _S, _Y) p (_A | _α)
Parameter is estimated to maximize the filter, and the filter is obtained from the estimated parameter.
Sound source separation method.

The sound source separation method according to claim 6,
The observed signal is X _{ω, τ} , the estimated values of the amplitude spectra of the predetermined acoustic signal and the coherent noise acoustic signal are ^ W _{ω, r} ^S and ^ W _{ω, k} ^N _, respectively. Assume that the estimated values of the intensity corresponding to the basis of the amplitude spectrum of the acoustic signal and the coherent noise acoustic signal are ^ H _{r, τ} ^S and ^ H _{k, τ} ^N , respectively, and the predetermined acoustic signal and the coherent noise acoustic signal Let R and K be the basis numbers of the amplitude spectrum of
The filter design step includes
By
Or
T is transpose, _E is a matrix with Ω x Τ and all elements are 1. Matrix division is element-by-element division, _Z = [_ Z ^(S) , _W ^(N) ], _H = [(_ H ^{( S)} ) ^T , (_ H ^(N) ) ^T ] ^T , _Z ^(S) : = {^ A _ω ^ W _{ω, r} ^S } ∈R ^{Ω × R} , _W ^(N) : = {^ W _{ω, k} ^N } ∈R ^{Ω × K} , _H ^(S) : = {^ H _{r, τ} ^S } ∈R ^{R × Τ} , _H ^(N) : = {^ H _{k, τ} ^N } ∈R ^{K × Τ}
By updating λ _{r, ω, τ} , ^ W _{ω, τ} ^N , ^ H _{ω, τ} ^N , ^ A _ω , parameters are estimated so as to maximize the joint probability.
Sound source separation method.

A program for causing a computer to function as the sound source separation device according to any one of claims 1 to 4.