JP2013044908A

JP2013044908A - Background sound suppressor, background sound suppression method and program

Info

Publication number: JP2013044908A
Application number: JP2011182277A
Authority: JP
Inventors: Tomohiro Nakatani; 智広中谷; Akiko Araki; 章子荒木; Takuya Yoshioka; 拓也吉岡; Masakiyo Fujimoto; 雅清藤本; Marc Delcroix; マークデルクロア
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-08-24
Filing date: 2011-08-24
Publication date: 2013-03-04
Anticipated expiration: 2031-08-24
Also published as: JP5498452B2

Abstract

PROBLEM TO BE SOLVED: To provide a more efficient and highly accurate background sound suppressor capable of reducing a calculation cost and utilizing a probability density function with a more complex form.SOLUTION: In a background sound suppressor 20 of the present invention, a feature quantity extraction unit 100 extracts a high resolution sound source position feature quantity and a high resolution spectral feature quantity from an observation signal, a sound source position occupation degree estimation unit 200 obtains a high resolution sound source position occupation degree, a frequency resolution reduction unit 300 reduces frequency resolution of the high resolution spectral feature quantity and the high resolution sound source position occupation degree, a low resolution occupation degree estimation unit 400 estimates a spectral parameter, a high resolution occupation degree re-estimation unit 510 estimates a high resolution occupation degree, and a target speech estimation unit 600 estimates a target speech.

Description

本発明は、目的音声と背景音が混ざって複数のマイクロホンで収音された観測信号から、背景音を抑圧し、目的音声を推定・抽出する背景音抑圧装置、背景音抑圧方法、およびプログラムに関する。 The present invention relates to a background sound suppression apparatus, a background sound suppression method, and a program that suppress background sound and estimate / extract target sound from observation signals collected by a plurality of microphones in which target sound and background sound are mixed. .

従来の背景音抑圧装置には、高解像度スペクトルモデル記憶部と特徴抽出部と高解像度占有度推定部と目的音声推定部とを備えるものがある（例えば非特許文献１および非特許文献２参照）。 Some conventional background sound suppression devices include a high-resolution spectrum model storage unit, a feature extraction unit, a high-resolution occupancy estimation unit, and a target speech estimation unit (see, for example, Non-Patent Document 1 and Non-Patent Document 2). .

以下、非特許文献１および２に記載された従来の背景音抑圧装置について説明する。上述のように、従来の背景音抑圧装置は、高解像度スペクトルモデル記憶部と特徴抽出部と高解像度占有度推定部と目的音声推定部とを備える。高解像度スペクトルモデル記憶部は、目的音声と背景音それぞれについて、スペクトル特徴量の時系列全体の状態を表すスペクトルパラメータの事前確率密度関数と、そのスペクトルパラメータが与えられた場合の各音源信号（目的信号、または背景音）の各時間周波数点における事後確率密度関数であるスペクトル特徴量のモデルとを記憶する。特徴抽出部は、複数のマイクロホンで収音した時間領域の信号を時間周波数領域信号に変換した観測信号を入力として、各時間周波数点における高解像度音源位置特徴量と高解像度スペクトル特徴量を抽出する。高解像度占有度推定部は、高解像度音源位置特徴量と高解像度スペクトル特徴量と高解像度スペクトル特徴量のモデルとスペクトルパラメータの事前確率密度関数とを入力として、観測信号が得られた下での占有的な音源の番号の事後確率密度関数である高解像度占有度の推定値とスペクトルパラメータの推定値を得る。さらに、目的音声推定部は、高解像度占有度推定部が出力する高解像度占有度の推定値とスペクトルパラメータの推定値と、特徴抽出部が出力する高解像度スペクトル特徴量と、高解像度スペクトルモデル記憶部に記憶された高解像度スペクトル特徴量のモデルとを入力として、目的音声の推定値を抽出する。 Hereinafter, conventional background sound suppression devices described in Non-Patent Documents 1 and 2 will be described. As described above, the conventional background sound suppression apparatus includes a high-resolution spectrum model storage unit, a feature extraction unit, a high-resolution occupancy estimation unit, and a target speech estimation unit. The high-resolution spectral model storage unit stores the prior probability density function of the spectral parameters representing the overall state of the spectral feature time series for each of the target speech and background sound, and each sound source signal (target purpose) when the spectral parameters are given. Signal or background sound) at each time frequency point, and a spectral feature quantity model that is a posterior probability density function. The feature extraction unit extracts an observation signal obtained by converting a time domain signal collected by a plurality of microphones into a time frequency domain signal, and extracts a high resolution sound source position feature quantity and a high resolution spectrum feature quantity at each time frequency point. . The high-resolution occupancy estimator receives the high-resolution sound source position feature, high-resolution spectral feature, high-resolution spectral feature model, and prior probability density function of the spectral parameters as input. Obtain high resolution occupancy estimates and spectral parameter estimates that are posterior probability density functions of occupying sound source numbers. Further, the target speech estimator includes a high-resolution occupancy estimation value and a spectral parameter estimation value output from the high-resolution occupancy estimation unit, a high-resolution spectral feature value output from the feature extraction unit, and a high-resolution spectral model storage The estimated value of the target speech is extracted using the high-resolution spectral feature model stored in the unit as an input.

Tomohiro Nakatani, Shoko Araki, Takuya Yoshioka, Masakiyo Fujimoto, “Multichannel source separation based on source location cue with log-spectral shaping by hidden Markov source model,” Proc. of Interspeech-2010, pp. 2766-2769, Sep., 2010.Tomohiro Nakatani, Shoko Araki, Takuya Yoshioka, Masakiyo Fujimoto, “Multichannel source separation based on source location cue with log-spectral shaping by hidden Markov source model,” Proc. Of Interspeech-2010, pp. 2766-2769, Sep., 2010 . 中谷智広、荒木章子、吉岡卓也、藤本雅清、“DOAクラスタリングと音声の対数スペクトルHMMに基づく音源分離”日本音響学会2010年秋季研究発表会講演論文集、pp.577-580, 9月, 2010年.Tomohiro Nakatani, Akiko Araki, Takuya Yoshioka, Masaki Fujimoto, “Sound source separation based on DOA clustering and logarithmic spectrum HMM of speech” Proc. .

しかしながら、従来の背景音抑圧装置は、高解像度音源位置特徴量と高解像度スペクトル特徴量に基づきスペクトルパラメータと高解像度占有度を推定するために繰り返し処理を実行するため、各特徴量の次元が大きくなるにつれて計算コストが大きくなるという問題があった。特に、残響のある環境で音源位置特徴量から抽出される音源位置の情報を適切に扱うには分析窓のサイズを大きくすることが望ましいが、それにともない各特徴量の次元が大きくなるため、計算コストの増大が避けられないことが問題であった。 However, since the conventional background sound suppression apparatus repeatedly performs processing to estimate the spectrum parameter and the high resolution occupancy based on the high resolution sound source position feature quantity and the high resolution spectrum feature quantity, the dimension of each feature quantity is large. There was a problem that the calculation cost increased as the time became. In particular, it is desirable to increase the size of the analysis window in order to properly handle the sound source position information extracted from the sound source position feature in a reverberant environment. The problem is that an increase in cost is inevitable.

また、従来の背景音抑圧装置では、音源位置特徴量のモデルは比較的残響の少ない環境で点音源から観測されることを前提としていたため、音源位置特徴量の確率密度関数は単一のガウス分布などの単純なものしか扱うことができなかった。したがって、分析窓より長い残響が含まれる場合や、背景音が点音源でなかったり複数の音源で構成されていたりする場合には、適切に目的音声の推定を行うことができなかった。 In addition, the conventional background sound suppressor assumes that the sound source position feature model is observed from a point sound source in an environment with relatively little reverberation, so the probability density function of the sound source position feature quantity is a single Gaussian. Only simple things such as distribution could be handled. Therefore, when reverberation longer than the analysis window is included, or when the background sound is not a point sound source or composed of a plurality of sound sources, the target speech cannot be estimated appropriately.

本発明はこのような点に鑑みてなされたものであり、各特徴量の次元が大きい場合でも計算コストを小さく抑えることができ、長い残響が含まれていたり、背景音が点音源でなかったり複数の音源で構成されていたりする場合にも、適切に目的音声の推定を行うことができる背景音抑圧装置を提供することを目的とする。 The present invention has been made in view of such a point, and even when the dimension of each feature amount is large, the calculation cost can be kept low, and long reverberation is included or the background sound is not a point sound source. An object of the present invention is to provide a background sound suppression apparatus that can appropriately estimate a target voice even when it is composed of a plurality of sound sources.

本発明の背景音抑圧装置は、複数のマイクロホンで収音した時間領域信号を時間周波数領域信号に変換した観測信号ｘ^（ｍ）＿（ｎ，ｋ）から、背景音を抑圧し目的音声の推定値＾Ｓ^（ｊ）＿（ｎ，ｋ）を抽出する。まず、ｍはマイクロホンの番号を表し、ｎはフレームの番号を表し、ｋは高解像度での周波数ビンの番号を表し、ｋ￣は低解像度での周波数ビンの番号を表し、ｊは音源の番号を表すとする。低解像度スペクトルモデル記憶部には、各音源信号のスペクトルパラメータの事前確率密度関数ｐ（ｑ^（ｊ））と各音源信号の低解像度スペクトル特徴量モデルβ￣＿（ｑ^（ｊ），ｎ，ｋ￣）（Ｓ）が記憶される。高解像度スペクトルモデル記憶部には、各音源信号の高解像度スペクトル特徴量モデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）が記憶される。特徴抽出部は、観測信号ｘ^（ｍ）＿（ｎ，ｋ）から、高解像度音源位置特徴量Ａ＿（ｎ，ｋ）と高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）を抽出する。音源位置占有度推定部は、高解像度音源位置特徴量Ａ＿（ｎ，ｋ）から、各音源信号の音源位置パラメータφ^（ｊ）を求め、高解像度音源位置特徴量Ａ＿（ｎ，ｋ）と音源位置パラメータφ^（ｊ）から、各音源信号の高解像度音源位置占有度Ｑ^（ｊ）＿（ｎ，ｋ）を求める。周波数解像度低減部は、高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と高解像度音源位置占有度Ｑ^（ｊ）＿（ｎ，ｋ）から、近傍周波数間の平滑化処理により、低解像度スペクトル特徴量Ｘ￣＿（ｎ，ｋ￣）と低解像度音源位置占有度Ｑ￣^（ｊ）＿（ｎ，ｋ￣）を求める。低解像度占有度推定部は、低解像度スペクトル特徴量Ｘ￣＿（ｎ，ｋ￣）と低解像度音源位置占有度Ｑ￣^（ｊ）＿（ｎ，ｋ￣）と事前確率密度関数ｐ（ｑ^（ｊ））と低解像度スペクトル特徴量モデルβ￣＿（ｑ^（ｊ），ｎ，ｋ￣）（Ｓ）から、対数尤度関数を最大化するように、各音源信号のスペクトルパラメータの推定値＾ｑ^（ｊ）を求める。高解像度占有度再推定部は、前記高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）とスペクトルパラメータの推定値＾ｑ^（ｊ）と高解像度音源位置占有度Ｑ^（ｊ）＿（ｎ，ｋ）と高解像度スペクトル特徴量モデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）から、高解像度占有度の推定値＾Ｍ^（ｊ）＿（ｎ，ｋ）を求める。目的音声推定部は、スペクトルパラメータの推定値＾ｑ^（ｊ）と高解像度占有度の推定値＾Ｍ^（ｊ）＿（ｎ，ｋ）と高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と高解像度スペクトル特徴量モデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）から、目的音声の推定値＾Ｓ^（ｊ）＿（ｎ，ｋ）を求める。 The background sound suppression apparatus of the present invention suppresses the background sound and estimates the target speech from the observation signal x ^(m) _ (n, k) obtained by converting the time domain signal collected by a plurality of microphones into the time frequency domain signal. Extract the value {circumflex over (S ⁾ } ^(j) _ (n, k). First, m represents a microphone number, n represents a frame number, k represents a frequency bin number at high resolution, k￣ represents a frequency bin number at low resolution, and j represents a sound source number. Is represented. The low resolution spectral model storage unit stores the prior probability density function p (q ^(j) ) of the spectral parameters of each sound source signal and the low resolution spectral feature quantity model β￣_ (q ^(j) , n, k of each sound source signal. I) (S) is stored. The high-resolution spectral model storage unit stores a high-resolution spectral feature model β_ (q ^(j) , n, k) (S) of each sound source signal. The feature extraction unit extracts the high-resolution sound source position feature quantity A_ (n, k) and the high-resolution spectral feature quantity X_ (n, k) from the observation signal x ^(m) _ (n, k). The sound source position occupancy estimation unit obtains the sound source position parameter φ ^(j) of each sound source signal from the high resolution sound source position feature amount A_ (n, k), and obtains the high resolution sound source position feature amount A_ (n, k) and the sound source. From the position parameter φ ^(j) , the high-resolution sound source position occupancy Q ^(j) _ (n, k) of each sound source signal is obtained. The frequency resolution reduction unit performs smoothing processing between neighboring frequencies from the high-resolution spectral feature quantity X_ (n, k) and the high-resolution sound source position occupancy Q ^(j) _ (n, k). X￣_ (n, k￣) and low-resolution sound source position occupancy Q￣ ^(j) _ (n, k￣) are obtained. The low-resolution occupancy degree estimation unit includes a low-resolution spectral feature amount X￣_ (n, k￣), a low-resolution sound source position occupancy Q￣ ^(j) _ (n, k￣), and a prior probability density function p (q ^{( j)} ) and the low resolution spectral feature model β￣_ (q ^(j) , n, k￣) (S) to estimate the spectral parameter of each sound source signal so as to maximize the log likelihood function ^ Find q ^(j) . The high-resolution occupancy re-estimation unit includes the high-resolution spectral feature quantity X_ (n, k), an estimated value of spectral parameters ^ q ^(j), and a high-resolution sound source position occupancy Q ^(j) _ (n, k). From the high-resolution spectral feature model β_ (q ^(j) , n, k) (S), an estimated value of high resolution occupancy ^ M ^(j) _ (n, k) is obtained. The target speech estimator is configured to estimate the spectral parameter ^ q ^(j) , the high-resolution occupancy estimate ^ M ^(j) _ (n, k), the high-resolution spectral feature X_ (n, k), and the high-resolution. An estimated value {circumflex over (S)} ^(j) _ (n, k) of the target speech is obtained from the spectral feature model β_ (q ^(j) , n, k) (S).

本発明の背景音抑圧装置は、各特徴量の次元が大きい場合でも計算コストを小さく抑えることができ、長い残響が含まれていたり、背景音が点音源でなかったり複数の音源で構成されていたりする場合にも、適切に目的音声の推定を行うことができる。 The background sound suppression apparatus of the present invention can reduce the calculation cost even when the dimension of each feature amount is large, and includes a long reverberation, a background sound that is not a point sound source, or a plurality of sound sources. In this case, the target speech can be estimated appropriately.

従来の背景音抑圧装置の構成を示すブロック図。The block diagram which shows the structure of the conventional background sound suppression apparatus. 従来の背景音抑圧装置の動作を示すフローチャート。The flowchart which shows operation | movement of the conventional background sound suppression apparatus. 実施例１に係る背景音抑圧装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a background sound suppression device according to Embodiment 1. FIG. 実施例１に係る背景音抑圧装置の動作を示すフローチャート。5 is a flowchart showing the operation of the background sound suppression apparatus according to the first embodiment. 周波数解像度低減部の用いるフィルタ係数の例。The example of the filter coefficient which a frequency resolution reduction part uses. 実施例２に係る背景音抑圧装置の構成を示すブロック図。FIG. 6 is a block diagram illustrating a configuration of a background sound suppression device according to a second embodiment. 実施例２に係る背景音抑圧装置の動作を示すフローチャート。10 is a flowchart illustrating the operation of the background sound suppression device according to the second embodiment. 実施例３に係る背景音抑圧装置の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a background sound suppression device according to a third embodiment. 実施例３に係る背景音抑圧装置の動作を示すフローチャート。10 is a flowchart illustrating the operation of the background sound suppression device according to the third embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

最初に、説明に用いる記号について説明する。観測信号には目的音声と背景音が重畳しており、その音源信号をＮ_ｍ本のマイクロホンで収音する。ｍ番目のマイクロホンから収音した音響信号を短時間フーリエ変換等を用いて周波数領域の信号に変換した観測信号をｘ^（ｍ）＿（ｎ，ｋ）と表記する。ｎはｎ番目の時間つまりフレーム番号、ｋはｋ番目の周波数つまりビン番号であり、ｎ番目の時間及びｋ番目の周波数に対応する時間周波数点を参照する場合に、時間周波数点（ｎ，ｋ）と表記する。各フレームの周波数ビンの総数をＮ_ｋと表記する。ｊはｊ番目の音源信号の番号であり、ｊ＝１は目的音声、ｊ＝２は背景音を表すとする。なお、数式での表現とテキストでの表現には次のような対応関係がある。 First, symbols used for description will be described. The observed signal is superimposed target speech and the background sound, picking up the sound signal in N _m the microphones. An observation signal obtained by converting an acoustic signal collected from the m-th microphone into a frequency domain signal using a short-time Fourier transform or the like is represented as x ^(m) _ (n, k). n is the nth time, that is, the frame number, k is the kth frequency, that is, the bin number, and when referring to the time frequency point corresponding to the nth time and the kth frequency, the time frequency point (n, k ). The total number of frequency bins for each frame is denoted as _Nk . j is the number of the j-th sound source signal, j = 1 represents the target sound, and j = 2 represents the background sound. There is the following correspondence between the expression in the mathematical expression and the expression in the text.

＜従来例の説明＞
まず、図１、図２を参照して、従来の背景音抑圧装置１０の動作の概略を説明する。図１は従来の背景音抑圧装置１０の構成を示すブロック図である。図２は従来の背景音抑圧装置１０の動作を示すフローチャートである。 <Description of conventional example>
First, the outline of the operation of the conventional background sound suppression apparatus 10 will be described with reference to FIGS. FIG. 1 is a block diagram showing the configuration of a conventional background sound suppression apparatus 10. FIG. 2 is a flowchart showing the operation of the conventional background sound suppression apparatus 10.

従来の背景音抑圧装置１０では、ｊ番目の音源信号のスペクトル時系列全体｛Ｓ^（ｊ）＿（ｎ，ｋ）｝の同時確率密度関数を次式に示すようにモデル化する。 In the conventional background sound suppression apparatus 10, the simultaneous probability density function of the entire spectrum time series {S ^(j) _ (n, k)} of the j th sound source signal is modeled as shown in the following equation.

ここで、ｑ^（ｊ）はｊ番目の音源信号のスペクトル時系列全体の状態を表すスペクトルパラメータを表す。以下では全ての音源信号のｑ^（ｊ）をまとめてｑ＝［ｑ^（１），ｑ^（２）］とも表記する。 Here, q ^(j) represents a spectrum parameter representing the state of the entire spectrum time series of the j-th sound source signal. Hereinafter, q ^(j) of all sound source signals is collectively expressed as q = [q ⁽¹⁾ , q ⁽²⁾ ].

また、β＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）はスペクトル特徴量のモデルであり、式（３）に表されるように、スペクトルパラメータｑ^（ｊ）が与えられた下で各時間周波数点（ｎ，ｋ）の音源信号のスペクトルの値がＳとなる確率密度関数である。 Further, β_ (q ^(j) , n, k) (S) is a model of the spectral feature quantity, and as expressed in the equation (3), each time is given under the spectral parameter q ^(j). This is a probability density function in which the spectrum value of the sound source signal at the frequency point (n, k) is S.

式（２）において、スペクトルパラメータが既知のもとでは、異なる時間周波数点におけるスペクトルの値Ｓ^（ｊ）＿（ｎ，ｋ）は相互に独立であるという仮定を導入している。 Equation (2) introduces the assumption that the spectral values S ^(j) _ (n, k) at different time frequency points are independent of each other when the spectral parameters are known.

また、従来例では式（４）に示すように、各時間周波数点（ｎ，ｋ）において最も大きなエネルギーを持つ音源信号（以下、占有的な音源信号と称する）のスペクトルの値Ｓ^（ｊ）＿（ｎ，ｋ）は、観測信号のスペクトルの値と一致すると仮定する。 In the conventional example, as shown in Expression (4), the spectrum value S ^(j) of a sound source signal having the largest energy at each time frequency point (n, k) (hereinafter referred to as an exclusive sound source signal ^). It is assumed that _ (n, k) matches the spectrum value of the observation signal.

また、占有的ではない音源ｊに関しては、Ｓ^（ｊ）＿（ｎ，ｋ）≦Ｘ＿（ｎ，ｋ）の関係を持つと仮定する。すると、各音源信号のスペクトルパラメータが既知の条件の下で、観測信号の高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）の事後確率密度関数は次のように表現できることが知られている（詳しくは「S. J. Rennie, J. R. Hershey, and P. A. Olsen, “Hierarchical variational loopy belief propagation for multi-talker speech recognition,” Proc. ASRU-2009, pp. 176-181 2009.」参照）。 It is assumed that the non-occupying sound source j has a relationship of S ^(j) _ (n, k) ≦ X_ (n, k). Then, it is known that the posterior probability density function of the high-resolution spectral feature quantity X_ (n, k) of the observation signal can be expressed as follows under the condition that the spectral parameters of each sound source signal are known (in detail) “See SJ Rennie, JR Hershey, and PA Olsen,“ Hierarchical variational loopy belief propagation for multi-talker speech recognition, ”Proc. ASRU-2009, pp. 176-181 2009.”).

従来例では、更に上式は次のように分解可能であると仮定している。 In the conventional example, it is further assumed that the above equation can be decomposed as follows.

Ｚ＿（ｎ，ｋ）は時間周波数点（ｎ，ｋ）において占有的な音源の番号を表す確率変数であり、Ｚ＿（ｎ，ｋ）＝ｊは、ｊ番目の音源が占有的な音源である場合を示す。 Z_ (n, k) is a random variable representing the number of the sound source that is occupied at the time frequency point (n, k), and Z_ (n, k) = j is a sound source in which the jth sound source is occupied. Show the case.

また、従来の背景音抑圧装置１０では、高解像度音源位置特徴量Ａ＿（ｎ，ｋ）から音源位置パラメータφ^(ｊ)を推定するため、高解像度音源位置特徴量のモデルｐ（Ａ＿（ｎ，ｋ）；φ）を導入する。各音源ｊに対応する音源位置特徴量のモデルｐ（Ａ＿（ｎ，ｋ）；φ）は、各音源信号のエネルギーは異なる時間周波数点にわたり疎に分布していると仮定し、その時間周波数点において占有的な音源の音源位置のみに依存して決まると仮定する。そして、全ての音源の音源位置パラメータφ^(ｊ)をまとめてφ＝［φ^(１)，φ^(２)］と表すと、観測信号に対する高解像度音源位置特徴量のモデルｐ（Ａ＿（ｎ，ｋ）；φ）、つまり観測信号の高解像度音源位置特徴量の確率密度関数は、混合分布として式（８）に示すように展開することができる。 Further, in the conventional background sound suppressing apparatus 10, since the sound source position parameter φ ^(j) is estimated from the high resolution sound source position feature quantity A_ (n, k), the high resolution sound source position feature quantity model p (A_ (n, k); φ) is introduced. The sound source position feature quantity model p (A_ (n, k); φ) corresponding to each sound source j assumes that the energy of each sound source signal is sparsely distributed over different time frequency points, and the time frequency points. , It is assumed that it depends only on the sound source position of the exclusive sound source. When the sound source position parameters φ ^(j) of all sound sources are collectively expressed as φ = [φ ⁽¹⁾ , φ ⁽²⁾ ], a model p (A_ (n, k); φ), that is, the probability density function of the high-resolution sound source position feature quantity of the observation signal can be developed as a mixture distribution as shown in Expression (8).

式（８）において、ｐ（Ｚ＿（ｎ，ｋ）＝ｊ）は、ｊ番目の音源が時間周波数点（ｎ，ｋ）において占有的な音源になる事前確率密度関数を表している。更に、以降の説明では次の表記を用いることにする。 In Equation (8), p (Z_ (n, k) = j) represents a prior probability density function in which the jth sound source becomes an exclusive sound source at the time frequency point (n, k). Further, the following notation is used in the following description.

γ＿（φ^（ｊ），ｎ，ｋ）（Ａ）は、時間周波数点（ｎ，ｋ）において占有的な音源の番号がｊの場合に、高解像度音源位置特徴量Ａ＿（ｎ，ｋ）が得られる確率密度関数を表す。これは、ｊ番目の音源の音源位置パラメータφ^（ｊ）のみに依存するものとする。具体的なγ＿（φ^（ｊ），ｋ）（Ａ）やφ^（ｊ）の定義については後述する。 γ_ (φ ^(j) , n, k) (A) is a high-resolution sound source position feature quantity A_ (n, k) when the number of the sound source occupied at the time frequency point (n, k) is j. Represents the resulting probability density function. This depends only on the sound source position parameter φ ^(j) of the j-th sound source. Specific definitions of γ_ (φ ^(j) , k) (A) and φ ^(j) will be described later.

式（８）のもと、γ＿（φ^（ｊ），ｋ）（Ａ）が定義されている場合、音源位置パラメータφ^(ｊ)と占有的な音源の番号に関する事前確率密度関数ｐ（Ｚ＿（ｎ，ｋ）＝ｊ）が与えられれば、音源位置特徴量のモデルｐ（Ａ＿（ｎ，ｋ）；φ）は一意に定めることができる。逆に、音源位置特徴量Ａ＿（ｎ，ｋ）が観測された場合に、最尤推定などの方法に従い、音源位置パラメータと占有的な音源の番号に関する事前確率密度関数ｐ（Ｚ＿（ｎ，ｋ）＝ｊ）やその事後確率密度関数を推定することができる。 Original expression ^{(8), γ_ (φ (} j), k) if (A) is defined, the sound source position parameter phi ^(j) and exclusively pre probability for the number of sound source density function p (Z_ ( If n, k) = j) is given, the model p (A_ (n, k); φ) of the sound source position feature quantity can be uniquely determined. Conversely, when the sound source position feature A_ (n, k) is observed, the prior probability density function p (Z_ (n, k) relating to the sound source position parameter and the number of the occupied sound source according to a method such as maximum likelihood estimation. ) = J) and its posterior probability density function.

以上の定義に従うと、完全データの確率密度関数は式（１０）に示すように導出される。 According to the above definition, the probability density function of complete data is derived as shown in equation (10).

式（１０）において、ｑがスペクトルパラメータ、φが音源位置パラメータである。従来例では、次の対数尤度関数を最大化する値として、スペクトルパラメータｑと音源位置パラメータφを推定する。 In Expression (10), q is a spectral parameter, and φ is a sound source position parameter. In the conventional example, the spectrum parameter q and the sound source position parameter φ are estimated as values that maximize the next log likelihood function.

式（１２）で、確率変数Ｚ＿（ｎ，ｋ）は隠れ変数として扱われる。隠れ変数を含む対数尤度関数の最大化には、例えば、期待値最大化アルゴリズムなどを用いることができる。期待値最大化アルゴリズムでは、スペクトルパラメータの推定値＾ｑに基づき、観測信号が得られた下での占有的な音源の番号の事後確率密度関数＾Ｍ^（ｊ）＿（ｎ，ｋ）＝ｐ（Ｚ＿（ｎ，ｋ）｜Ａ＿（ｎ，ｋ），Ｘ＿（ｎ，ｋ）＾ｑ；＾φ）をも同時に推定する必要がある。従来例では、この関数の値を高解像度占有度と称し、この値も推定すべきパラメータに含めて考えている。 In Expression (12), the random variable Z_ (n, k) is treated as a hidden variable. For example, an expectation maximization algorithm can be used to maximize the log likelihood function including hidden variables. In the expected value maximization algorithm, the a posteriori probability density function ^ M ^(j) _ (n, k) = p of the number of the occupied sound source under the observation signal obtained based on the estimated value ^ q of the spectral parameter (Z_ (n, k) | A_ (n, k), X_ (n, k) ^ q; ^ φ) also needs to be estimated at the same time. In the conventional example, the value of this function is referred to as high resolution occupancy, and this value is also included in the parameter to be estimated.

以下、実際に行われる手続きの順に説明してゆく。従来の背景音抑圧装置１０は、特徴抽出部１００、高解像度占有度推定部５００、目的音声推定部６００、高解像度スペクトルモデル記憶部８００を備える。 In the following, description will be made in the order of procedures actually performed. The conventional background sound suppression apparatus 10 includes a feature extraction unit 100, a high resolution occupancy estimation unit 500, a target speech estimation unit 600, and a high resolution spectrum model storage unit 800.

高解像度スペクトルモデル記憶部８００は、目的音声と背景音それぞれのスペクトル時系列全体の状態を表すスペクトルパラメータｑ^（ｊ）の事前確率密度関数ｐ（ｑ^（ｊ））と、そのスペクトルパラメータｑ^（ｊ）が与えられた場合の各音源信号の各時間周波数点における高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）とを記憶する。（Ｓ）は音源パワー特徴量Ｘ＿（ｎ，ｋ）を表す変数である。事前確率密度関数ｐ（ｑ^（ｊ））と高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）は、目的音声と背景音それぞれについて、事前学習により、与えられているものとする。 The high-resolution spectral model storage unit 800 includes a prior probability density function p (q ^(j) ) of a spectral parameter q ^(j) representing the state of the entire spectrum time series of the target speech and background sound, and the spectral parameter q ^{(j )} Is stored, the model β_ (q ^(j) , n, k) (S) of the high resolution spectral feature quantity at each time frequency point of each sound source signal is stored. (S) is a variable representing the sound source power feature amount X_ (n, k). The prior probability density function p (q ^(j) ) and the high resolution spectral feature quantity model β_ (q ^(j) , n, k) (S) are given by prior learning for each of the target speech and the background sound. It shall be.

特徴抽出部１００は、複数（Ｎ_ｍ本）のマイクロホンで収音した時間領域信号を時間周波数領域信号に変換した観測信号ｘ^（ｍ）＿（ｎ，ｋ）を入力として、各時間周波数点（ｎ，ｋ）における高解像度音源位置特徴量Ａ＿（ｎ，ｋ）と高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）を抽出する（Ｓ１０１，Ｓ１０２）。 The feature extraction unit 100 receives an observation signal x ^(m) _ (n, k) obtained by converting a time domain signal collected by a plurality (N _m ) of microphones into a time frequency domain signal, and inputs each time frequency point ( The high-resolution sound source position feature quantity A_ (n, k) and the high-resolution spectral feature quantity X_ (n, k) at n, k) are extracted (S101, S102).

高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）は、例えば、１本目のマイクロホンが収音した信号の対数パワースペクトルとして抽出される。これは式（１３）に示すように計算される。 The high-resolution spectral feature amount X_ (n, k) is extracted as, for example, a logarithmic power spectrum of a signal picked up by the first microphone. This is calculated as shown in equation (13).

高解像度音源位置特徴量Ａ＿（ｎ，ｋ）は、一般に各時間周波数点における異なるマイクロホン間での信号の位相差や強度比などに表れる。したがって、高解像度音源位置特徴量Ａ＿（ｎ，ｋ）は、信号の位相差や強度比を異なるマイクロホンペアごとにまとめて出来るベクトルであったり、そこから更に何らかの特徴抽出を行った結果の値であったりとして抽出される。例えば、２本のマイクロホンで収音した信号の位相差を高解像度音源位置特徴量Ａ＿（ｎ，ｋ）として抽出する場合、式（１４）に示すように計算される。 The high-resolution sound source position feature quantity A_ (n, k) generally appears in a signal phase difference or intensity ratio between different microphones at each time frequency point. Therefore, the high-resolution sound source position feature amount A_ (n, k) is a vector that can be obtained by grouping different signal phase differences and intensity ratios for different microphone pairs, or a value obtained as a result of performing some feature extraction therefrom. It is extracted as there is. For example, when a phase difference between signals collected by two microphones is extracted as a high-resolution sound source position feature amount A_ (n, k), the calculation is performed as shown in Expression (14).

上記以外にも、例えば、式（１４’）に示すように計算される正規化複素スペクトルベクトルなども、音源位置特徴量として用いることができる（詳しくは「Hiroshi Sawada, Shoko Araki, and Shoji Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, 2011.」（以下、参考文献１）参照）。 In addition to the above, for example, a normalized complex spectrum vector calculated as shown in Expression (14 ′) can also be used as the sound source position feature amount (for details, see “Hiroshi Sawada, Shoko Araki, and Shoji Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, 2011. 1)).

以下、本明細書では、式（１４’）による高解像度音源位置特徴量Ａ＿（ｎ，ｋ）を用いて発明の構成を説明する。式（１４）を用いる発明の構成については、非特許文献１および非特許文献２を参照されたい。 Hereinafter, in the present specification, the configuration of the invention will be described using the high-resolution sound source position feature quantity A_ (n, k) according to the equation (14 ′). Refer to Non-Patent Document 1 and Non-Patent Document 2 for the configuration of the invention using Expression (14).

高解像度占有度推定部５００は、特徴抽出部１００から出力される高解像度音源位置特徴量Ａ＿（ｎ，ｋ）と高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と、高解像度スペクトルモデル記憶部８００に記憶されたスペクトルパラメータの事前確率密度関数ｐ（ｑ^（ｊ））とスペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）とを入力として、各音源信号の高解像度占有度＾Ｍ^（ｊ）＿（ｎ，ｋ）を推定する。 The high-resolution occupancy estimation unit 500 includes a high-resolution sound source position feature amount A_ (n, k) and a high-resolution spectral feature amount X_ (n, k) output from the feature extraction unit 100, and a high-resolution spectral model storage unit 800. Input of the spectral parameter prior probability density function p (q ^(j) ) and the spectral feature quantity model β_ (q ^(j) , n, k) (S) Degree ^ M ^(j) _ (n, k) is estimated.

まず、高解像度占有度推定部５００は、音源ｊごとに高解像度占有度＾Ｍ^（ｊ）＿（ｎ，ｋ）を、Σ_ｊ＾Ｍ^（ｊ）＝１となるように、例えば乱数で初期化する。その後、以下の（１）〜（３）の各処理を収束するまで繰り返す。 First, the high-resolution occupancy estimation unit 500 initializes the high-resolution occupancy ^ M ^(j) _ (n, k) for each sound source j, for example, with random numbers so that Σ _j ^ M ^(j) = 1. Turn into. Thereafter, the following processes (1) to (3) are repeated until convergence.

（１）スペクトルパラメータの更新（Ｓ５０１）
高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と高解像度占有度＾Ｍ^（ｊ）＿（ｎ，ｋ）とスペクトルパラメータの事前確率密度関数ｐ（ｑ^（ｊ））と高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）とを用いて、式（１５）に示すように、スペクトルパラメータの推定値＾ｑ^（ｊ）を更新する（Ｍ−ｓｔｅｐ）。 (1) Update of spectrum parameters (S501)
High-resolution spectral feature quantity X_ (n, k), high-resolution occupancy ^ M ^(j) _ (n, k), spectral parameter prior probability density function p (q ^(j) ), and high-resolution spectral feature quantity model Using β_ (q ^(j) , n, k) (S), the estimated value ^ q ^(j) of the spectral parameter is updated (M-step) as shown in the equation (15).

（２）音源位置パラメータの更新（Ｓ５０２）
高解像度占有度＾Ｍ^（ｊ）＿（ｎ，ｋ）と高解像度音源位置特徴量Ａ＿（ｎ，ｋ）とを用いて、式（１７）に示すように、音源位置パラメータ＾φ^（ｊ）を更新する（Ｍ−ｓｔｅｐ）。 (2) Update of sound source position parameter (S502)
Using the high resolution occupancy ^ M ^(j) _ (n, k) and the high resolution sound source position feature A_ (n, k), as shown in the equation (17), the sound source position parameter ^ φ ^(j) Is updated (M-step).

（３）高解像度占有度の更新（Ｓ５０３）
スペクトルパラメータ＾ｑ^（ｊ）と高解像度音源位置特徴量Ａ＿（ｎ，ｋ）と高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）とを用いて、式（１８）に示すように、高解像度占有度＾Ｍ^（ｊ）＿（ｎ，ｋ）を更新する（Ｅ−ｓｔｅｐ）。 (3) Update of high resolution occupancy (S503)
Spectral parameter ^ q ^(j) , high-resolution sound source position feature A_ (n, k), high-resolution spectral feature X_ (n, k), and high-resolution spectral feature model β_ (q ^(j) , n, k ) (S) and the high resolution occupancy ^ M ^(j) _ (n, k) is updated (E-step) as shown in the equation (18).

目的音声推定部６００は、高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と、高解像度占有度＾Ｍ^（ｊ）＿（ｎ，ｋ）とスペクトルパラメータ＾ｑ^（ｊ） _ｎと、高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）とを入力として、最小自乗誤差推定により、目的音声の推定値＾Ｓ^（ｊ）＿（ｎ，ｋ）を求める（Ｓ６００）。推定の方法は次式によって行う。 The target speech estimation unit 600 includes a high-resolution spectral feature quantity X_ (n, k), a high-resolution occupancy ^ M ^(j) _ (n, k), a spectral parameter ^ q ^(j) _n, and a high-resolution spectral feature. Using the quantity model β_ (q ^(j) , n, k) (S) as an input, an estimated value {circumflex over (S)} ^(j) _ (n, k) of the target speech is obtained by least square error estimation (S600). The estimation method is as follows.

＜従来例の問題点＞
従来の背景音抑圧装置１０は、高解像度占有度推定部５００において、スペクトルパラメータ＾ｑ^（ｊ）と音源位置パラメータ＾φ^（ｊ）、および高解像度占有度＾Ｍ^（ｊ）＿（ｎ，ｋ）の更新のために、式（１５）（１７）（１８）を繰返し実行する。このとき、高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）および高解像度音源位置特徴量Ａ＿（ｎ，ｋ）の次元が増すほど、すなわち、各フレームの周波数ビンの総数Ｎ_ｋが増えるほど、計算コストが大きくなるという問題があった。 <Problem of conventional example>
In the conventional background sound suppression apparatus 10, in the high resolution occupancy estimation unit 500, the spectrum parameter ^ q ^(j) , the sound source position parameter ^ φ ^(j) , and the high resolution occupancy ^ M ^(j) _ (n, k (15), (17), and (18) are repeatedly executed for the update of. At this time, the calculation cost increases as the dimensions of the high-resolution spectral feature quantity X_ (n, k) and the high-resolution sound source position feature quantity A_ (n, k) increase, that is, as the total number N _k of frequency bins of each frame increases. There was a problem that became larger.

また、従来例では、音源位置特徴量の確率密度関数において、γ＿(φ^（ｊ），ｋ)（Ａ）で表現される各音源の高解像度音源位置特徴量の確率密度関数は、単一のガウス分布などの単純なものしか扱うことができなかった。したがって、比較的残響の少ない点音源の高解像度音源位置特徴量の統計的性質しか表現できず、残響が長かったり、背景音に複数の点音源が含まれたり点音源以外の音源が含まれていると、目的音声や背景音の高解像度音源位置特徴量を適切に表現することができなかった。 In the conventional example, in the probability density function of the sound source position feature quantity, the probability density function of the high resolution sound source position feature quantity of each sound source represented by γ_ (φ ^(j) , k) (A) is a single Only simple things such as Gaussian distribution could be handled. Therefore, only the statistical properties of the high-resolution sound source position feature of a point sound source with relatively little reverberation can be expressed, the reverberation is long, the background sound includes multiple point sound sources, and sound sources other than point sound sources are included. In this case, the high-resolution sound source position feature quantity of the target voice and the background sound could not be expressed appropriately.

＜本発明の概要＞
実施例１では、従来例で計算コストを増大させていたスペクトルパラメータ＾ｑ^（ｊ）の繰返し推定については、周波数の解像度を落とした空間での繰り返し推定により求める。これにより、効率的にスペクトルパラメータ＾ｑ^（ｊ）を推定できるようになる。さらに、推定したスペクトルパラメータ＾ｑ^（ｊ）と高解像度スペクトルモデルと高解像度音源位置占有度を用いることで、高解像度占有度を推定できる。このとき、高解像度占有度の推定には、繰返し処理は必要ないので、計算コストは大きくならない。その結果、スペクトルパラメータ＾ｑ^（ｊ）と高解像度占有度を効率的に求めることができるようになる。なお、本構成においても、高解像度音源位置占有度の推定のために繰返し処理が必要であるが、この処理は、従来例の高解像度占有度推定のための繰返し処理と比較して、小さい計算コストで実現することができる。 <Outline of the present invention>
In the first embodiment, the iterative estimation of the spectrum parameter ^ q ^(j) , which has increased the calculation cost in the conventional example, is obtained by the iterative estimation in a space with a reduced frequency resolution. As a result, the spectral parameter ^ q ^(j) can be estimated efficiently. Further, the high resolution occupancy can be estimated by using the estimated spectral parameter ^ q ^(j) , the high resolution spectral model, and the high resolution sound source position occupancy. At this time, since iterative processing is not necessary for estimating the high resolution occupancy, the calculation cost does not increase. As a result, the spectral parameter ^ q ^(j) and the high resolution occupancy can be obtained efficiently. Even in this configuration, iterative processing is necessary for estimating the high-resolution sound source position occupancy, but this processing is smaller than the iterative processing for estimating the high-resolution occupancy in the conventional example. Can be realized at a cost.

実施例２では、さらに、事前に学習した高解像度音源位置特徴量のモデルをも具備させることで、繰返し処理をしなくても高解像度音源位置占有度の推定ができるようになる。その結果、より効率的に背景音抑圧が行えるようになる。さらに、高解像度音源位置特徴量のモデルを事前学習する場合には、より複雑な高解像度音源位置特徴量の確率密度関数をも利用できるようになるため、長い残響が含まれたり、背景音が単一の点音源のみから構成されていなかったりする場合でも、目的音声と背景音の音源位置特徴量を、より適切に区別できるようになる。 In the second embodiment, a high-resolution sound source position feature quantity model learned in advance is also provided, so that the high-resolution sound source position occupancy can be estimated without performing iterative processing. As a result, the background sound can be suppressed more efficiently. Furthermore, when learning a model of high-resolution sound source position feature quantity in advance, it becomes possible to use a more complicated probability density function of high-resolution sound source position feature quantity. Even when it is not composed of only a single point sound source, the sound source position feature quantities of the target sound and the background sound can be more appropriately distinguished.

実施例３では、周波数解像度の低減は行わず、事前に学習した高解像度音源位置特徴量のモデルを具備させる。高解像度音源位置モデル記憶部に記憶されている音源位置のモデルを利用することができるので、目的音声および背景音の音源位置のモデルパラメータを推定する必要がなく、計算コストを低く抑えることができる。また、高解像度音源位置モデル記憶部に記憶されている音源位置のモデルとして、混合分布などのより複雑な分布形状をもつものを利用できるようになるので、残響のある環境や複数の音が背景音に含まれる環境でも、適切に背景音抑圧を行うことができる。 In the third embodiment, the frequency resolution is not reduced, and a model of the high-resolution sound source position feature quantity learned in advance is provided. Since the model of the sound source position stored in the high-resolution sound source position model storage unit can be used, it is not necessary to estimate the model parameters of the sound source positions of the target sound and the background sound, and the calculation cost can be kept low. . In addition, as a model of the sound source position stored in the high-resolution sound source position model storage unit, a model having a more complicated distribution shape such as a mixture distribution can be used. Even in an environment included in sound, background sound can be appropriately suppressed.

次に、図３、図４を参照して、本発明の実施例１に係る背景音抑圧装置２０の動作を詳細に説明する。図３は本発明の実施例１に係る背景音抑圧装置２０の構成を示すブロック図である。図４は本発明の実施例１に係る背景音抑圧装置２０の動作を示すフローチャートである。 Next, the operation of the background sound suppression apparatus 20 according to the first embodiment of the present invention will be described in detail with reference to FIGS. FIG. 3 is a block diagram showing the configuration of the background sound suppression apparatus 20 according to the first embodiment of the present invention. FIG. 4 is a flowchart showing the operation of the background sound suppression apparatus 20 according to the first embodiment of the present invention.

以下、実際に行われる手続きの順に説明してゆく。本実施例の背景音抑圧装置２０は、特徴抽出部１００、音源位置占有度推定部２００、周波数解像度低減部３００、低解像度占有度推定部４００、高解像度占有度再推定部５１０、目的音声推定部６００、低解像度スペクトルモデル記憶部７００、高解像度スペクトルモデル記憶部８１０を備える。 In the following, description will be made in the order of procedures actually performed. The background sound suppression apparatus 20 of this embodiment includes a feature extraction unit 100, a sound source position occupancy estimation unit 200, a frequency resolution reduction unit 300, a low resolution occupancy estimation unit 400, a high resolution occupancy re-estimation unit 510, and target speech estimation. Unit 600, low-resolution spectral model storage unit 700, and high-resolution spectral model storage unit 810.

低解像度スペクトルモデル記憶部７００は、目的音声と背景音それぞれのスペクトル時系列全体の状態を表すスペクトルパラメータｑ^（ｊ）の事前確率密度関数ｐ（ｑ^（ｊ））と、そのスペクトルパラメータｑ^（ｊ）が与えられた場合の各音源信号の各時間周波数点における低解像度スペクトル特徴量のモデルβ￣＿（ｑ^（ｊ），ｎ，ｋ￣）（Ｓ）とを記憶する。（Ｓ）は低解像度スペクトル特徴量Ｘ￣＿（ｎ，ｋ￣）を表す変数である。ｊ番目の音源信号の低解像度スペクトル特徴量の時系列全体｛Ｓ￣^（ｊ）＿（ｎ，ｋ￣）｝の同時確率密度関数を次式（１’）（２’）（３’）に示すようにモデル化する。 The low-resolution spectral model storage unit 700 includes a prior probability density function p (q ^(j) ) of a spectral parameter q ^(j) representing the state of the entire spectrum time series of the target speech and background sound, and the spectral parameter q ^{(j )} Is stored, the model β スペクトル _ (q ^(j) , n, k￣) (S) of the low resolution spectral feature quantity at each time frequency point of each sound source signal is stored. (S) is a variable representing the low-resolution spectral feature amount X￣_ (n, k￣). The simultaneous probability density function of the entire time series {S ￣ ^(j) _ (n, k ￣)} of the low-resolution spectral feature quantity of the j-th sound source signal is expressed by the following equations (1 ′), (2 ′), and (3 ′). Model as shown.

さらに、スペクトルパラメータｑ^（ｊ）は、ｑ^（ｊ）＝｛ｑ^（ｊ）＿（０），ｑ^（ｊ）＿（１），…｝のように各時刻の状態を表す状態系列に分解され、一次のマルコフ過程に従い状態遷移が各時刻で起こると仮定する。但し、スペクトルパラメータｑ^（ｊ）＿（０）は隠れマルコフモデルの初期状態を表す。式（３’）で定義される各時間周波数点（ｎ，ｋ￣）におけるＳ￣^（ｊ）＿（ｎ，ｋ￣）の事後確率密度関数は、その時刻の状態ｑ^（ｊ）＿（ｎ）のみに依存するガウス分布に従うと仮定する。これを数式で表すと式（２０）（２１）のようになる。 Furthermore, the spectrum parameter q ^(j) is decomposed into a state series representing the state at each time as q ^(j) = {q ^(j) _ (0), q ^(j) _ (1),. Suppose that state transitions occur at each time according to a first-order Markov process. However, the spectrum parameter q ^(j) _ (0) represents the initial state of the hidden Markov model. The posterior probability density function of S￣ ^(j) _ (n, k￣) at each time frequency point (n, k￣) defined by the equation (3 ′) is the state q ^(j) _ (n ) Is assumed to follow a Gaussian distribution that depends only on. This can be expressed by equations (20) and (21).

ここで、π^（ｊ）＿（i）＝ｐ（ｑ^（ｊ）＿（０）＝ｉ）は、隠れマルコフモデルの初期状態がｉである事前確率、α^（ｊ）＿（ｉ，ｈ）＝ｐ（ｑ^（ｊ）＿（ｎ）＝ｈ｜ｑ^（ｊ）＿（ｎ−１）＝ｉ）は、隠れマルコフモデルが状態ｉから状態ｈへ移る状態遷移確率、β￣＿（ｉ，ｎ，ｋ￣）（Ｓ）＝ｐ（Ｓ￣^（ｊ）＿（ｎ，ｋ￣）＝Ｓ｜ｑ^（ｊ）＿（ｎ）＝ｉ）＝Ｎ（Ｓ￣^（ｊ）＿（ｎ，ｋ￣）；μ￣^（ｊ）＿（ｉ，ｋ￣），σ￣^（ｊ）＿（ｉ，ｋ￣））は、隠れマルコフモデルの状態ｉにおける出力の確率密度関数であり、μ￣^（ｊ）＿（ｉ，ｋ￣）及びσ￣^（ｊ）＿（ｉ，ｋ￣）はその平均と分散である。全てのｈ，ｉ，ｊ，ｋに対するπ^（ｊ）＿（i）、α^（ｊ）＿（ｉ，ｈ）、μ￣^（ｊ）＿（ｉ，ｋ￣）、σ￣^（ｊ）＿（ｉ，ｋ￣）は、本実施例では、全て音声データベース等からの学習により、事前に求められているものとする。 Here, π ^(j) _ (i) = p (q ^(j) _ (0) = i) is a prior probability that the initial state of the hidden Markov model is i, α ^(j) _ (i, h) = P (q ^(j) _ (n) = h | q ^(j) _ (n-1) = i) is the state transition probability that the hidden Markov model moves from state i to state h, β 、 _ (i, n, k￣) (S) = p (S￣ ^(j) _ (n, k￣) = S | q ^(j) _ (n) = i) = N (S￣ ^(j) _ (n, k ￣); μ￣ ^(j) _ (i, k￣), σ￣ ^(j) _ (i, k￣)) is the probability density function of the output in state i of the hidden Markov model, and μ￣ ^{(j )} _ (I, k￣) and σ￣ ^(j) _ (i, k￣) are their mean and variance. Π ^(j) _ (i), α ^(j) _ (i, h), μ￣ ^(j) _ (i, k￣), σ￣ ^(j) _ ( ^{) for} all h, i, j, k In this embodiment, i, k￣) are all obtained in advance by learning from a speech database or the like.

高解像度スペクトルモデル記憶部８１０は、スペクトルパラメータｑ^（ｊ）が与えられた場合の各音源信号の各時間周波数点における高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）を記憶する。 The high-resolution spectral model storage unit 810 is a model β_ (q ^(j) , n, k) (S) of the high-resolution spectral feature quantity at each time frequency point of each sound source signal when the spectral parameter q ^(j) is given. ) Is stored.

特徴抽出部１００は、観測信号ｘ^（ｍ）＿（ｎ，ｋ）を入力として、式（１３）に基づき、対数パワースペクトルを高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）として抽出する（Ｓ１０１）。さらに、式（１４’）に基づき、正規化複素スペクトルを高解像度音源位置特徴量Ａ＿（ｎ，ｋ）として抽出する（Ｓ１０２）。 The feature extraction unit 100 receives the observation signal x ^(m) _ (n, k) as an input, and extracts a logarithmic power spectrum as a high resolution spectral feature quantity X_ (n, k) based on the equation (13) (S101). . Further, based on the equation (14 ′), the normalized complex spectrum is extracted as the high-resolution sound source position feature amount A_ (n, k) (S102).

音源位置占有度推定部２００は、高解像度音源位置特徴量Ａ＿（ｎ，ｋ）を入力として、音源位置パラメータφ^（ｊ）を推定する（Ｓ２０１）。この推定には、参考文献１もしくは「Tomohiro Nakatani, Shoko Araki, Takuya Fujimoto, Masakiyo Fujimoto, “Joint unsupervised learning of hidden Markov source models and source location models for multi-channel source separation,” Proc. Of IEEE ICASSP-2011, pp. 237-240, 2011.」（以下、参考文献２）等に記載の方法を用いることができる。このため、本実施例では、各音源信号に由来する観測信号の正規化複素スペクトルは、周波数ごとに異なる平均値μ^(ｊ)＿（ｋ）、分散σ^(ｊ)＿（ｋ）をもつ、以下の分布に従うと仮定する。 The sound source position occupancy estimation unit 200 receives the high-resolution sound source position feature quantity A_ (n, k) as an input and estimates the sound source position parameter φ ^(j) (S201). For this estimation, reference 1 or “Tomohiro Nakatani, Shoko Araki, Takuya Fujimoto, Masakiyo Fujimoto,“ Joint unsupervised learning of hidden Markov source models and source location models for multi-channel source separation, ”Proc. Of IEEE ICASSP-2011 , pp. 237-240, 2011. (hereinafter referred to as Reference Document 2) and the like. For this reason, in this embodiment, the normalized complex spectrum of the observation signal derived from each sound source signal has an average value μ ^(j) _ (k) and variance σ ^(j) _ (k) that differ for each frequency. Assume that the following distribution follows.

但し、φ^(ｊ)＿（ｋ）=[μ^(ｊ)＿（ｋ）,σ^(ｊ)＿（ｋ）]は、音源位置パラメータφ^(ｊ)のうち周波数ｋのみに関する部分を取り出したものであり、φ^(ｊ)は全ての周波数ｋについてφ^(l)＿（ｋ）を集めたφ^(ｊ)＝［φ^（ｊ）＿（１），…，φ^（ｊ）＿（Ｎ_ｋ）］である。この仮定に基づき、本実施例では、観測信号ｘ^（ｊ）＿（ｎ，ｋ）の高解像度音源位置特徴量の確率密度関数は、式（８）（９）（１９）でモデル化されるとする。 However, φ ^(j) _ (k) = [μ ^(j) _ (k), σ ^(j) _ (k)] is obtained by extracting a portion related to only the frequency k from the sound source position parameter φ ^(j). Φ ^(j) is a collection of φ ^(l) _ (k) for all frequencies k φ ^(j) = [φ ^(j) _ (1),..., Φ ^(j) _ (N _k ) ]. Based on this assumption, in this embodiment, the probability density function of the high-resolution sound source position feature quantity of the observation signal x ^(j) _ (n, k) is modeled by equations (8), (9), and (19). And

続いて、音源位置占有度推定部２００は、推定された音源位置パラメータφ^(ｊ)に基づき、高解像度音源位置占有度Ｑ^（ｊ）＿（ｎ，ｋ）を以下のように推定する（Ｓ２０２）。 Subsequently, the sound source position occupancy estimation unit 200 estimates the high-resolution sound source position occupancy Q ^(j) _ (n, k) as follows based on the estimated sound source position parameter φ ^(j) (S202). ).

周波数解像度低減部３００は、特徴抽出部１００の出力する高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と、音源位置占有度推定部２００の出力する高解像度音源位置占有度Ｑ＿（ｎ，ｋ）を入力として、近傍周波数間の平滑化処理を適用することで、低解像度スペクトル特徴量Ｘ￣＿（ｎ，ｋ￣）と低解像度音源位置占有度Ｑ￣＿（ｎ，ｋ￣）に変換する。 The frequency resolution reduction unit 300 uses the high-resolution spectral feature amount X_ (n, k) output from the feature extraction unit 100 and the high-resolution sound source position occupancy Q_ (n, k) output from the sound source position occupancy estimation unit 200. By applying a smoothing process between neighboring frequencies as an input, it is converted into a low resolution spectral feature quantity X￣_ (n, k￣) and a low resolution sound source position occupancy Q￣_ (n, k￣).

高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）の周波数解像度低減には、例えば、音声認識の特徴量抽出でしばしば利用されるフィルタバンク処理などを利用する。いま、Ｆ＿（ｋ￣）＝［Ｆ＿（ｋ￣，１），Ｆ＿（ｋ￣，２），…，Ｆ＿（ｋ￣，Ｎ_ｋ）］を、フィルタバンク処理のｋ￣番目の出力を得るためのフィルタ係数とする。高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）から低解像度スペクトル特徴量Ｘ￣＿（ｎ，ｋ￣）への変換は、フィルタ係数Ｆ＿（ｋ￣）を用いて、以下のように求められる（Ｓ３０１）。 In order to reduce the frequency resolution of the high-resolution spectral feature value X_ (n, k), for example, filter bank processing or the like often used in the feature value extraction for speech recognition is used. Now, F_ (k￣) = [F_ (k￣, 1), F_ (k￣, 2),..., F_ (k￣, N _k )] is obtained to obtain the k￣th output of the filter bank processing. Filter coefficients. The conversion from the high-resolution spectral feature quantity X_ (n, k) to the low-resolution spectral feature quantity X （_ (n, k￣) is obtained as follows using the filter coefficient F_ (k￣) (S301). ).

ただし、ｋ￣は、低解像度スペクトル特徴量Ｘ￣＿（ｎ，ｋ￣）の周波数の番号を表しており、ｋ￣≦ｋである。 Here, k￣ represents the frequency number of the low-resolution spectral feature quantity X￣_ (n, k￣), and k￣ ≦ k.

次に、周波数解像度低減部３００は、高解像度音源位置占有度Ｑ＿（ｎ，ｋ）から低解像度音源位置占有度Ｑ￣＿（ｎ，ｋ￣）への変換を、同じフィルタ係数Ｆ＿（ｋ￣）を用いて、以下のように実施する（Ｓ３０２）。 Next, the frequency resolution reduction unit 300 converts the high-resolution sound source position occupancy Q_ (n, k) to the low-resolution sound source position occupancy Q￣_ (n, k￣) with the same filter coefficient F_ (k￣ ) Is performed as follows (S302).

図５にフィルタ係数Ｆ＿（ｋ￣）の例を示す。 FIG. 5 shows an example of the filter coefficient F_ (k￣).

低解像度占有度推定部４００は、低解像度スペクトル特徴量Ｘ￣＿（ｎ，ｋ￣）と低解像度音源位置占有度Ｑ￣＿（ｎ，ｋ￣）を入力として、期待値最大化アルゴリズムに従い、スペクトルパラメータの推定値＾ｑ^（ｊ）と低解像度占有度の推定値＾Ｍ￣^（ｊ）＿（ｎ，ｋ￣）を求める。このために、以下の（１）と（２）の処理を収束するまで繰り返す。 The low-resolution occupancy estimation unit 400 receives the low-resolution spectral feature amount X￣_ (n, k￣) and the low-resolution sound source position occupancy Q￣_ (n, k￣) as inputs, according to an expected value maximization algorithm. obtaining the estimated value of the spectrum parameter ^{^ q} and ^(j) a low-resolution occupancy estimate ^{^ M¯ (j) _ (n} , k¯). For this reason, the following processes (1) and (2) are repeated until convergence.

（１）スペクトルパラメータの推定値の更新（Ｓ４０１）
音源ｊごとに、式（２２）を満たすスペクトルパラメータの推定値＾ｑ（ｊ）＝［＾ｑ^（ｊ）＿（０），…，＾ｑ^（ｊ）＿（Ｎ_ｓ）］を、Ｖｉｔｅｒｂｉアルゴリズムを用いて更新する。 (1) Update of estimated values of spectral parameters (S401)
For each sound source j, an estimated value of spectral parameters ^ q (j) = [^ q ^(j) _ (0),..., ^ Q ^(j) _ (N _s )] satisfying Expression (22) is represented by the Viterbi algorithm. Update using.

（２）低解像度占有度の更新（Ｓ４０２）
低解像度占有度Ｍ￣^（ｊ）＿（ｎ，ｋ￣）を、式（３２）に示すように、更新する（Ｅ−ｓｔｅｐ）。 (2) Update of low resolution occupancy (S402)
The low resolution occupancy M￣ ^(j) _ (n, k￣) is updated as shown in the equation (32) (E-step).

上記、（１）と（２）の繰返しの結果得られたスペクトルパラメータの推定値＾ｑ^（ｊ）が、低解像度占有度推定部４００の出力となる。 The spectral parameter estimation value ^{ circumflex over ⁽ q ⁾ } ^(j) obtained as a result of the repetition of the above (1) and (2) is the output of the low resolution occupancy estimation unit 400.

高解像度占有度再推定部５１０は、高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と、高解像度音源位置占有度Ｑ＿（ｎ，ｋ）と、スペクトルパラメータの推定値＾ｑ^（ｊ）と、スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）とを入力として、式（３２’）に従い、高解像度占有度の推定値＾Ｍ^（ｊ）＿（ｎ，ｋ）を求める（Ｓ５１０）。 The high-resolution occupancy re-estimation unit 510 includes a high-resolution spectral feature quantity X_ (n, k), a high-resolution sound source position occupancy Q_ (n, k), a spectral parameter estimate value ^ q ^(j), and a spectrum The feature value model β_ (q ^(j) , n, k) (S) is used as an input, and an estimated value of high resolution occupancy ^ M ^(j) _ (n, k) is obtained according to equation (32 ′). (S510).

目的音声推定部６００は、高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と、高解像度占有度の推定値＾Ｍ^（ｊ）＿（ｎ，ｋ）と、スペクトルパラメータの推定値＾ｑ^（ｊ）と、高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）とを入力として、従来例と同じ式（３６）に基づき、観測信号から背景音を抑圧した、目的音声の推定値＾Ｓ^（ｊ）＿（ｎ，ｋ）を求める（Ｓ６００）。 The target speech estimation unit 600 includes a high-resolution spectral feature amount X_ (n, k), a high-resolution occupancy estimation value ^ M ^(j) _ (n, k), and a spectral parameter estimation value ^ q ^(j). And a high-resolution spectral feature model β_ (q ^(j) , n, k) (S) as an input, and based on the same equation (36) as in the conventional example, the target speech in which the background sound is suppressed from the observation signal determination of the estimated value ^{^ S (j) _ (n} , k) (S600).

このように、本実施例の背景音抑圧装置２０は、従来例で計算コストを増大させていたスペクトルパラメータ＾ｑ^（ｊ）の繰返し推定について、周波数の解像度を落とした空間での繰り返し推定により求める。これにより、効率的にスペクトルパラメータ＾ｑ^（ｊ）を推定できるようになる。さらに、推定したスペクトルパラメータ＾ｑ^（ｊ）と高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）と高解像度音源位置占有度Ｑ＿（ｎ，ｋ）を用いることで、高解像度占有度Ｍ^（ｊ）＿（ｎ，ｋ）を推定できる。このとき、高解像度占有度の推定には、繰返し処理は必要ないので、計算コストは大きくならない。その結果、スペクトルパラメータ＾ｑ^（ｊ）と高解像度占有度Ｍ^（ｊ）＿（ｎ，ｋ）を効率的に求めることができるようになる。なお、本構成においても、高解像度音源位置占有度Ｑ＿（ｎ，ｋ）の推定のために繰返し処理が必要であるが、この処理は、従来例の高解像度占有度推定のための繰返し処理と比較して、小さい計算コストで実現することができる。 As described above, the background sound suppression apparatus 20 according to the present embodiment obtains the iterative estimation of the spectrum parameter ^ q ^(j) , which has increased the calculation cost in the conventional example, by the iterative estimation in the space where the frequency resolution is reduced. . As a result, the spectral parameter ^ q ^(j) can be estimated efficiently. Further, by using the estimated spectral parameter ^ q ^(j) , the high resolution spectral feature model β_ (q ^(j) , n, k) (S) and the high resolution sound source position occupancy Q_ (n, k). The high resolution occupancy M ^(j) _ (n, k) can be estimated. At this time, since iterative processing is not necessary for estimating the high resolution occupancy, the calculation cost does not increase. As a result, the spectral parameter ^ q ^(j) and the high resolution occupancy M ^(j) _ (n, k) can be obtained efficiently. In this configuration as well, iterative processing is necessary for estimating the high-resolution sound source position occupancy Q_ (n, k). This processing is similar to the iterative processing for estimating the high-resolution occupancy in the conventional example. In comparison, it can be realized with a small calculation cost.

次に、図６、図７を参照して、本発明の実施例２に係る背景音抑圧装置３０の動作を詳細に説明する。図６は本発明の実施例２に係る背景音抑圧装置３０の構成を示すブロック図である。図７は本発明の実施例２に係る背景音抑圧装置３０の動作を示すフローチャートである。以下では、実施例１との相違点を中心に説明を行い、実施例１と共通する事項については説明を省略する。 Next, the operation of the background sound suppression device 30 according to the second embodiment of the present invention will be described in detail with reference to FIGS. FIG. 6 is a block diagram showing the configuration of the background sound suppression apparatus 30 according to the second embodiment of the present invention. FIG. 7 is a flowchart showing the operation of the background sound suppression apparatus 30 according to the second embodiment of the present invention. Below, it demonstrates centering on difference with Example 1, and abbreviate | omits description about the matter which is common in Example 1. FIG.

本実施例の背景音抑圧装置３０は、特徴抽出部１００、音源位置占有度推定部２１０、周波数解像度低減部３００、低解像度占有度推定部４００、高解像度占有度再推定部５１０、目的音声推定部６００、低解像度スペクトルモデル記憶部７００、高解像度スペクトルモデル記憶部８１０、高解像度音源位置モデル記憶部９００を備える。 The background sound suppression apparatus 30 of this embodiment includes a feature extraction unit 100, a sound source position occupancy estimation unit 210, a frequency resolution reduction unit 300, a low resolution occupancy estimation unit 400, a high resolution occupancy re-estimation unit 510, and target speech estimation. Unit 600, low-resolution spectral model storage unit 700, high-resolution spectral model storage unit 810, and high-resolution sound source position model storage unit 900.

高解像度音源位置モデル記憶部９００は、各音源信号(目的音声、もしくは背景音)に関して、高解像度音源位置特徴量の確率密度関数γ^（ｊ）＿（ｋ）（Ａ）を記憶する。確率密度関数γ^（ｊ）＿（ｋ）（Ａ）の形状は事前学習により固定されており、観測信号から推定する必要がない。また、式（１９）のように、観測信号からパラメータ推定が容易である必要はなく、より複雑な形式にできる。 The high-resolution sound source position model storage unit 900 stores a probability density function γ ^(j) _ (k) (A) of a high-resolution sound source position feature quantity for each sound source signal (target sound or background sound). The shape of the probability density function γ ^(j) _ (k) (A) is fixed by prior learning and does not need to be estimated from the observed signal. Further, as in the equation (19), it is not necessary to easily estimate the parameters from the observation signal, and a more complicated format can be obtained.

音源位置占有度推定部２１０は、特徴抽出部１００の出力する高解像度音源位置特徴量Ａ＿（ｎ，ｋ）と、高解像度音源位置モデル記憶部９００に記憶された確率密度関数γ^（ｊ）＿（ｋ）（Ａ）を入力として、以下の式に従い、高解像度音源位置占有度Ｑ^（ｊ）＿（ｎ，ｋ）を推定する（Ｓ２１０）。 The sound source position occupancy estimation unit 210 outputs the high-resolution sound source position feature quantity A_ (n, k) output from the feature extraction unit 100 and the probability density function γ ^{(j) —} stored in the high-resolution sound source position model storage unit 900. (K) Using (A) as an input, the high-resolution sound source position occupancy Q ^(j) _ (n, k) is estimated according to the following equation (S210).

その他の構成部と処理フローは実施例１の背景音抑圧装置２０と同じである。 Other components and the processing flow are the same as those of the background sound suppression apparatus 20 of the first embodiment.

次に、確率密度関数γ^（ｊ）＿（ｋ）（Ａ）の事前学習方法について説明する。いま、事前学習用のデータとして、音源ｊ(目的音声、または背景音)のみが含まれた観測信号が得られており、その観測信号から高解像度音源位置特徴量Ａ＿（ｎ，ｋ）が抽出されているとする。ｎ＝１〜Ｎとする。このとき、確率密度関数γ^（ｊ）＿（ｋ）（Ａ）は、各周波数ｋにおけるこの特徴量の確率密度関数を表現するものであれば、どのような関数でも用いることができる。一例として、式（１９）で定義される分布Ｆ（Ａ；μ^（ｊ）＿（ｋ），σ^（ｊ）＿（ｋ））を要素として持つ混合分布を用いる場合について説明する。このとき、確率密度関数γ^（ｊ）＿（ｋ）（Ａ）は以下のようにモデル化される。 Next, a prior learning method of the probability density function γ ^(j) _ (k) (A) will be described. Now, an observation signal including only the sound source j (target sound or background sound) is obtained as data for prior learning, and a high-resolution sound source position feature A_ (n, k) is extracted from the observation signal. Suppose that n = 1 to N. At this time, any function can be used as the probability density function γ ^(j) _ (k) (A) as long as it represents the probability density function of this feature quantity at each frequency k. As an example, a case where a mixed distribution having a distribution F (A; μ ^(j) _ (k), σ ^(j) _ (k)) defined by Expression (19) as an element will be described. At this time, the probability density function γ ^(j) _ (k) (A) is modeled as follows.

ここで、ｒは、混合分布の要素の番号であり、ｕ^（ｊ）＿（ｒ）は、その要素の混合比であり、Ｆ（Ａ；μ^（ｊ）＿（ｒ，ｋ），σ^（ｊ）＿（ｒ，ｋ））は、その要素の分布を表す。式（１９）と式（１９’）の違いのひとつは、式（１９）では各音源ｊに関する確率密度関数がひとつの要素のみでモデル化されていたのに対し、式（１９’）は、複数の要素からなる混合分布となっているところである。各音源ｊに対し、事前学習で定めるべきパラメータは、すべてのｒ，ｋに対するｕ^（ｊ）＿（ｒ）とμ^（ｊ）＿（ｒ，ｋ）とσ^（ｊ）＿（ｒ，ｋ）である。事前学習用のデータから抽出した、高解像度音源位置特徴量Ａ＿（ｎ，ｋ）を用いて、これらのパラメータは、期待値最大化アルゴリズムを用いて以下の手順で求めることができる。 Here, r is an element number of the mixture distribution, u ^(j) _ (r) is a mixture ratio of the element, and F (A; μ ^(j) _ (r, k), σ ^{( j)} _ (r, k)) represents the distribution of the elements. One difference between Equation (19) and Equation (19 ′) is that the probability density function for each sound source j is modeled with only one element in Equation (19), whereas Equation (19 ′) is It is a mixed distribution consisting of multiple elements. For each sound source j, parameters to be determined in advance learning are u ^(j) _ (r), μ ^(j) _ (r, k), and σ ^(j) _ (r, k) for all r, k. It is. Using the high-resolution sound source position feature quantity A_ (n, k) extracted from the pre-learning data, these parameters can be obtained by the following procedure using an expected value maximization algorithm.

（１）すべてのｒ，ｋに対して、μ^（ｊ）＿（ｒ，ｋ）とσ^（ｊ）＿（ｒ，ｋ）を初期化する。例えば、μ^（ｊ）＿（ｒ，ｋ）は乱数で初期化し、σ^（ｊ）＿（ｒ，ｋ）はσ^（ｊ）＿（ｒ，ｋ）＝１と初期化する。 (1) Initialize μ ^(j) _ (r, k) and σ ^(j) _ (r, k) for all r and k. For example, μ ^{(j) —} (r, k) is initialized with a random number, and σ ^{(j) —} (r, k) is initialized to σ ^{(j) —} (r, k) = 1.

（２）Σ_ｒｕ^（ｊ）＿（ｒ）＝１となるように、ｕ^（ｊ）＿（ｒ）（＞０）を、例えば乱数で初期化する。 (2) u ^(j) _ (r) (> 0) is initialized with a random number, for example, so that Σ _r u ^(j) _ (r) = 1.

（３）以下の（３−１）から（３−４）を収束するまで繰り返す。
（３−１）Ｋ^（ｊ）＿（ｎ，ｒ，ｋ）を、以下のように更新する。 (3) The following (3-1) to (3-4) are repeated until convergence.
(3-1) K ^(j) _ (n, r, k) is updated as follows.

（３−２）σ^（ｊ）＿（ｒ，ｋ）を、以下のように更新する。 (3-2) σ ^(j) _ (r, k) is updated as follows.

（３−３）以下のように求められる行列Ｒ＿（ｒ，ｋ）の最大固有値に対する固有値を求め、μ^（ｊ）＿（ｒ，ｋ）に代入して更新する。 (3-3) An eigenvalue for the maximum eigenvalue of the matrix R_ (r, k) obtained as follows is obtained, and is substituted by μ ^(j) _ (r, k) to be updated.

（３−４）ｕ^（ｊ）＿（ｒ，ｋ）を、以下のように更新する。 (3-4) u ^(j) _ (r, k) is updated as follows.

上記の繰返しの結果、最終的に得られるｕ^（ｊ）＿（ｒ）とμ^（ｊ）＿（ｒ，ｋ）とσ^（ｊ）＿（ｒ，ｋ）が、事前学習により得られるパラメータであり、これらのパラメータに従い、確率密度関数γ^（ｊ）＿（ｋ）（Ａ）は式（１９’）で規定される。 As a result of the above iteration, u ^(j) _ (r), μ ^(j) _ (r, k), and σ ^(j) _ (r, k) finally obtained are parameters obtained by prior learning. In accordance with these parameters, the probability density function γ ^(j) _ (k) (A) is defined by equation (19 ′).

このように、本実施例の背景音抑圧装置３０は、事前に学習した高解像度音源位置特徴量のモデルを具備させることで、繰返し処理をしなくても高解像度音源位置占有度Ｑ^（ｊ）＿（ｎ，ｋ）の推定ができるため、計算コストを低く抑えることができる。その結果、より効率的に背景音抑圧が行えるようになる。 As described above, the background sound suppression apparatus 30 according to the present embodiment includes the model of the high-resolution sound source position feature value learned in advance, so that the high-resolution sound source position occupancy Q ^(j) can be obtained without iterative processing. Since _ (n, k) can be estimated, the calculation cost can be kept low. As a result, the background sound can be suppressed more efficiently.

また、高解像度音源位置特徴量のモデルを事前学習する場合には、より複雑な高解像度音源位置特徴量の確率密度関数γ^（ｊ）＿（ｋ）（Ａ）をも利用できる。そのため、長い残響が含まれたり、背景音が単一の点音源のみから構成されていなかったりする場合でも、目的音声と背景音の音源位置特徴量をより適切に区別できる。その結果、より適切に背景音抑圧を行うことができる。 In addition, when learning a model of a high resolution sound source position feature quantity in advance, a more complicated probability density function γ ^(j) _ (k) (A) of a high resolution sound source position feature quantity can be used. Therefore, even when a long reverberation is included or the background sound is not composed of only a single point sound source, the sound source position feature quantities of the target sound and the background sound can be more appropriately distinguished. As a result, background sound suppression can be performed more appropriately.

次に、図８、図９を参照して、本発明の実施例３に係る背景音抑圧装置４０の動作を詳細に説明する。図８は本発明の実施例３に係る背景音抑圧装置４０の構成を示すブロック図である。図９は本発明の実施例３に係る背景音抑圧装置４０の動作を示すフローチャートである。以下では、実施例２との相違点を中心に説明を行い、実施例２と共通する事項については説明を省略する。 Next, the operation of the background sound suppression device 40 according to the third embodiment of the present invention will be described in detail with reference to FIGS. FIG. 8 is a block diagram showing the configuration of the background sound suppression apparatus 40 according to the third embodiment of the present invention. FIG. 9 is a flowchart showing the operation of the background sound suppression apparatus 40 according to the third embodiment of the present invention. Below, it demonstrates centering around difference with Example 2, and abbreviate | omits description about the matter which is common in Example 2. FIG.

本実施例の背景音抑圧装置４０は、特徴抽出部１００、音源位置占有度推定部２１０、高解像度占有度推定部５２０、目的音声推定部６００、高解像度スペクトルモデル記憶部８００、高解像度音源位置モデル記憶部９００を備える。 The background sound suppression apparatus 40 of this embodiment includes a feature extraction unit 100, a sound source position occupancy estimation unit 210, a high resolution occupancy estimation unit 520, a target speech estimation unit 600, a high resolution spectrum model storage unit 800, a high resolution sound source position. A model storage unit 900 is provided.

高解像度占有度推定部５２０は、特徴抽出部１００の出力する高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と、音源位置占有度推定部２１０の出力する高解像度音源位置占有度Ｑ^（ｊ）＿（ｎ，ｋ）を入力とし、期待値最大化アルゴリズムに従い、スペクトルパラメータの推定値＾ｑ^（ｊ）と高解像度占有度の推定値＾Ｍ^（ｊ）＿（ｎ，ｋ）を求める。このために、以下の（１）と（２）の処理を収束するまで繰り返す。
（１）スペクトルパラメータの推定値の更新（Ｓ５２１）
音源ｊごとに、式（２２’）を満たすスペクトルパラメータの推定値＾ｑ^（ｊ）＝［＾ｑ^（ｊ）＿（０），…，＾ｑ^（ｊ）＿（Ｎ_ｓ）］を、Ｖｉｔｅｒｂｉアルゴリズムを用いて更新する。 The high-resolution occupancy estimation unit 520 includes the high-resolution spectral feature amount X_ (n, k) output from the feature extraction unit 100 and the high-resolution sound source position occupancy Q ^(j) _ output from the sound source position occupancy estimation unit 210. Using (n, k) as an input, according to an expectation maximization algorithm, an estimated value of spectrum parameter ^ q ^(j) and an estimated value of high resolution occupancy ^ M ^(j) _ (n, k) are obtained. For this reason, the following processes (1) and (2) are repeated until convergence.
(1) Update of estimated values of spectral parameters (S521)
For each sound source j, an estimated value of spectral parameters ^ q ^(j) = [^ q ^(j) _ (0), ..., ^ q ^(j) _ (N _s )] satisfying Expression (22 ′) is expressed as Viterbi. Update using algorithm.

（２）高解像度占有度の更新（Ｓ５２２）
高解像度占有度Ｍ^（ｊ）＿（ｎ，ｋ）を、式（３２’）に示すように、更新する（Ｅ−ｓｔｅｐ）。 (2) Update of high resolution occupancy (S522)
The high resolution occupancy M ^(j) _ (n, k) is updated (E-step) as shown in Expression (32 ′).

上記、（１）と（２）の繰返しの結果得られたスペクトルパラメータの推定値＾ｑ^（ｊ）と高解像度占有度の推定値＾Ｍ^（ｊ）＿（ｎ，ｋ）が、高解像度占有度推定部５２０の出力となる。 Spectral parameter estimates ^{ circumflex over ⁽ q ⁾ } ^(j) and high resolution occupancy estimates {circumflex over (M ⁾ } ^(j, _ (n, k) obtained as a result of repetition of (1) and (2) above This is the output of the degree estimation unit 520.

なお、高解像度占有度推定部５２０は、実施例２の低解像度占有度推定部４００と比較して、特徴量の周波数の解像度が異なるだけであり、処理の中身は同一である。 Note that the high-resolution occupancy estimation unit 520 differs from the low-resolution occupancy estimation unit 400 of the second embodiment only in the frequency resolution of the feature amount, and the processing contents are the same.

その他の構成部と処理フローは実施例２の背景音抑圧装置３０と同じである。 Other components and processing flow are the same as those of the background sound suppression device 30 of the second embodiment.

本実施例の背景音抑圧装置４０は、全体の処理の結果だけに注目すると、実施例２において、フィルタバンク処理に用いるフィルタ係数Ｆ＿（ｋ￣）＝［Ｆ＿（ｋ￣，１），Ｆ＿（ｋ￣，２），…，Ｆ＿（ｋ￣，Ｎ_ｋ）］の長さがＮ_ｋで、各要素を、ｋ￣＝ｋのときにＦ＿（ｋ￣，ｋ）＝１とし、それ以外はＦ＿（ｋ￣，ｋ）＝０とした場合に相当する。この場合、実施例２の周波数解像度低減部３００の入出力は同一になる。すなわち、周波数解像度低減部３００は何も処理をしていないのと等価となる。また、低解像度スペクトルモデル特徴量のモデルβ￣＿（ｑ^（ｊ），ｎ，ｋ￣）（Ｓ）と高解像度スペクトルモデル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）は同一になるとともに、低解像度占有度の推定値＾Ｍ￣^（ｊ）＿（ｎ，ｋ￣）と高解像度占有度の推定値＾Ｍ^（ｊ）＿（ｎ，ｋ）も同一のものになる。 The background sound suppression apparatus 40 of the present embodiment pays attention only to the result of the entire processing. In the second embodiment, the filter coefficient F_ (k￣) = [F_ (k￣, 1), F_ ( k ￣, 2), ..., F_ ( _k Ｎ, N _k )] is N _k and each element is set to F_ (k ￣, k) = 1 when k ￣ = k, otherwise This corresponds to the case where F_ (k￣, k) = 0. In this case, the input / output of the frequency resolution reduction unit 300 of the second embodiment is the same. That is, the frequency resolution reduction unit 300 is equivalent to performing no processing. In addition, the model β￣_ (q ^(j) , n, k￣) (S) of the low resolution spectral model feature quantity and the model β_ (q ^(j) , n, k) (S) of the high resolution spectral model feature quantity are used. Are the same, and the estimated value of low resolution occupancy ^ M￣ ^(j) _ (n, k￣) and the estimated value of high resolution occupancy ^ M ^(j) _ (n, k) are also the same. Become.

そのため、本実施例では周波数解像度低減部３００を省略し、特徴抽出部１００の出力する高解像度スペクトル特徴量Ｘ＿（ｎ，ｋ）と音源位置占有度推定部２１０の出力するＱ^（ｊ）＿（ｎ，ｋ）が高解像度占有度推定部５２０へ入力されるようにしている。また、低解像度占有度推定部４００、低解像度スペクトルモデル記憶部７００は省略し、高解像度占有度推定部５２０の出力する高解像度占有度の推定値＾Ｍ（ｊ）＿（ｎ，ｋ）とスペクトルパラメータの推定値＾ｑ^（ｊ）と、高解像度スペクトルモデル記憶部８００に記憶された高解像度スペクトル特徴量のモデルβ＿（ｑ^（ｊ），ｎ，ｋ）（Ｓ）とが、目的音声推定部に入力されるものとする。 Therefore, in this embodiment, the frequency resolution reduction unit 300 is omitted, and the high-resolution spectral feature amount X_ (n, k) output from the feature extraction unit 100 and Q ^(j) _ ( ⁾ output from the sound source position occupancy estimation unit 210 are omitted. n, k) is input to the high resolution occupancy estimation unit 520. Further, the low resolution occupancy estimation unit 400 and the low resolution spectrum model storage unit 700 are omitted, and the high resolution occupancy estimation value ^ M (j) _ (n, k) output from the high resolution occupancy estimation unit 520 The estimated value ^ q ^{(j) of the} spectral parameter and the model β_ (q ^(j) , n, k) (S) of the high resolution spectral feature quantity stored in the high resolution spectral model storage unit 800 are used to estimate the target speech. Shall be input to the department.

このように、本実施例の背景音抑圧装置４０は、事前に学習した高解像度音源位置特徴量のモデルを具備させることで、繰返し処理をしなくても高解像度音源位置占有度Ｑ^（ｊ）＿（ｎ，ｋ）の推定ができるため、計算コストを低く抑えることができる。その結果、より効率的に背景音抑圧が行えるようになる。 As described above, the background sound suppression apparatus 40 according to the present embodiment includes the high-resolution sound source position feature quantity model learned in advance, so that the high-resolution sound source position occupancy Q ^(j) can be obtained without performing the iterative process. Since _ (n, k) can be estimated, the calculation cost can be kept low. As a result, the background sound can be suppressed more efficiently.

＜確認実験＞
本発明の背景音抑圧装置を評価する目的で確認実験を行った。 <Confirmation experiment>
A confirmation experiment was conducted for the purpose of evaluating the background sound suppressor of the present invention.

実験条件を説明する。残響のある部屋で、二本のマイクロホンを用いて、マイクロホンの正面にいる話者の音声が様々な周囲の背景音と同時に収録された音を、観測信号として用いた。この観測信号には、比較的長い残響が含まれているともに、背景音には複数の点音源が含まれていたり、点音源ではない音源が含まれていたりするものであった。このような観測信号を適切に扱うために、本発明の実施例２に示した高解像度音源位置特徴量のモデルを、事前学習により用意した。そして、本確認実験では、実施例２の発明において、周波数解像度の低減を行った場合（本発明）と行わなかった場合（従来例）の比較を実施した。どちらの場合も、残響を含む信号の音源位置情報を適切に扱えるようにするために、短時間フーリエ変換の分析窓長は、１００ミリ秒とした。標本化周波数を１６ｋＨｚとしたため、高解像度スペクトル特徴量の次元は８０１となった。一方、低解像度スペクトル特徴量の次元は４０とした。 The experimental conditions will be described. In a room with reverberation, we used two microphones and used the sound of the speaker in front of the microphone as well as various surrounding background sounds. This observation signal includes a relatively long reverberation, and the background sound includes a plurality of point sound sources or a sound source that is not a point sound source. In order to appropriately handle such observation signals, the model of the high-resolution sound source position feature amount shown in Example 2 of the present invention was prepared by prior learning. In this confirmation experiment, in the invention of Example 2, a comparison was made between the case where the frequency resolution was reduced (the present invention) and the case where the frequency resolution was not performed (the conventional example). In both cases, the analysis window length of the short-time Fourier transform is set to 100 milliseconds so that the sound source position information of the signal including reverberation can be appropriately handled. Since the sampling frequency was 16 kHz, the dimension of the high-resolution spectral feature amount was 801. On the other hand, the dimension of the low-resolution spectral feature is 40.

まず、計算コストの比較として、実時間ファクタを測定した。実時間ファクタは、背景音抑圧処理に要した時間（秒）と観測信号長（秒）の比である。実時間ファクタが１以下の場合、観測信号の長さよりも短い時間の間に処理が終わることを意味する。我々の実験では、従来例と本発明の実時間ファクタは、それぞれ、４．５２と０．６９であった。これにより、本発明は、大幅に計算コストを削減できることが確認できた。 First, the real-time factor was measured as a comparison of calculation costs. The real time factor is the ratio of the time (seconds) required for background sound suppression processing to the observed signal length (seconds). When the real time factor is 1 or less, it means that the processing is completed in a time shorter than the length of the observation signal. In our experiments, the real-time factors of the conventional example and the present invention were 4.52 and 0.69, respectively. Thereby, it was confirmed that the present invention can greatly reduce the calculation cost.

続いて、観測信号、および背景音を抑圧した信号に対して、自動音声認識を適用した結果を示す。観測信号をそのまま音声認識した場合の単語正解率は、６９．４％であったのに対し、従来例と本発明で背景音抑圧した音を音声認識した場合の単語正解率は、それぞれ、８２．７％と８１．６％であった。従来例、本発明ともに大幅な音声認識率の改善が得られたことから、実施例２の高解像度音源位置特徴量モデルは、有効に機能していたことがわかる。また、従来例と比較して、本発明により若干の音声認識性能の低下があったが、その差はきわめて小さかった。 Next, the result of applying automatic speech recognition to the observed signal and the signal with the background sound suppressed is shown. The word correct rate when the observed signal was speech recognized as it was was 69.4%, whereas the word correct rate when the sound with the background sound suppressed in the conventional example and the present invention was recognized as speech was 82, respectively. 7% and 81.6%. Since the speech recognition rate was greatly improved in both the conventional example and the present invention, it can be seen that the high-resolution sound source position feature quantity model of Example 2 functioned effectively. In addition, the speech recognition performance was slightly reduced by the present invention as compared with the conventional example, but the difference was extremely small.

以上の結果より、本発明は、背景音抑圧性能をほとんど劣化させることなく、従来例の計算コストを大幅に下げる効果を実現できることが確認された。 From the above results, it was confirmed that the present invention can realize the effect of greatly reducing the calculation cost of the conventional example without substantially deteriorating the background sound suppression performance.

＜プログラム、記録媒体＞
上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 <Program, recording medium>
The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

本発明は、目的音声と背景音が混ざって複数のマイクロホンで収音された観測信号から、背景音を抑圧し、目的音声を推定・抽出するために利用することができる。 INDUSTRIAL APPLICABILITY The present invention can be used for estimating and extracting a target sound by suppressing the background sound from observation signals collected by a plurality of microphones in which the target sound and the background sound are mixed.

１０、２０、３０、４０背景音抑圧装置
１００特徴抽出部２００、２１０音源位置占有度推定部
３００周波数解像度低減部４００低解像度占有度推定部
５００、５２０高解像度占有度推定部５１０高解像度占有度再推定部
６００目的音声推定部
７００低解像度スペクトルモデル記憶部
８００、８１０高解像度スペクトルモデル記憶部
９００高解像度音源位置モデル記憶部 10, 20, 30, 40 Background sound suppression device 100 Feature extraction unit 200, 210 Sound source position occupancy estimation unit 300 Frequency resolution reduction unit 400 Low resolution occupancy estimation unit 500, 520 High resolution occupancy estimation unit 510 High resolution occupancy Re-estimation unit 600 Target speech estimation unit 700 Low-resolution spectral model storage unit 800, 810 High-resolution spectral model storage unit 900 High-resolution sound source position model storage unit

Claims

From the observation signal x ^(m) _ (n, k) obtained by converting the time domain signals collected by a plurality of microphones into a time frequency domain signal, the background sound is suppressed and the estimated value of the target speech ^ S ^(j) _ (n , K) to extract a background sound,
m represents a microphone number, n represents a frame number, k represents a frequency bin number at high resolution, k￣ represents a frequency bin number at low resolution, and j represents a sound source number. As
The prior probability density function p (q ^(j) ) of the spectral parameters of each sound source signal and the low resolution spectral feature quantity model β￣_ (q ^(j) , n, k￣) (S) of each sound source signal are stored. A low-resolution spectral model storage unit;
A high-resolution spectral model storage unit storing high-resolution spectral feature models β_ (q ^(j) , n, k) (S) of each sound source signal;
A feature extraction unit for extracting a high-resolution sound source position feature quantity A_ (n, k) and a high-resolution spectral feature quantity X_ (n, k) from the observed signal x ^(m) _ (n, k);
A sound source position parameter φ ^(j) of each sound source signal is obtained from the high resolution sound source position feature A_ (n, k), and the high resolution sound source position feature A_ (n, k) and the sound source position parameter φ ^{(j ) To determine} the high-resolution sound source position occupancy Q ^(j) _ (n, k) of each sound source signal;
From the high-resolution spectral feature quantity X_ (n, k) and the high-resolution sound source position occupancy Q ^(j) _ (n, k), the low-resolution spectral feature quantity X￣_ ( n, k￣) and a low resolution sound source position occupancy Q￣ ^(j) _ (n, k￣),
The low-resolution spectral feature amount X￣_ (n, k￣), the low-resolution sound source position occupancy Q￣ ^(j) _ (n, k￣), and the prior probability density function p (q ^(j) ) From the low-resolution spectral feature model β￣_ (q ^(j) , n, k￣) (S), an estimated value ^ q ^{(j of} each sound source signal so as to maximize the log likelihood function ⁾ To obtain a low-resolution occupancy estimation unit;
The high-resolution spectral feature quantity X_ (n, k), the estimated spectral parameter q ^(j) , the high-resolution sound source position occupancy Q ^(j) _ (n, k), and the high-resolution spectral feature quantity model a high-resolution occupancy re-estimation unit for obtaining an estimated value of high-resolution occupancy ^ M ^(j) _ (n, k) from β_ (q ^(j) , n, k) (S);
Estimated value ^ q ^{(j) of the} spectral parameter, estimated value ^ M ^(j) _ (n, k) of the high resolution occupancy, the high resolution spectral feature quantity X_ (n, k), and the high resolution spectral feature A background sound characterized by comprising a target speech estimator that obtains an estimate value {circumflex over (S)} ^(j) _ (n, k) of the target speech from the quantity model β_ (q ^(j) , n, k) (S) Suppressor.

The background sound suppression device according to claim 1,
The frequency resolution reduction unit uses the high resolution spectral feature value X_ (n, k) and a filter coefficient F_ (k￣) to perform the low resolution spectral feature value X￣_ (n, k￣) by a filter bank process. An average obtained by weighting the high-resolution sound source position occupancy Q ^(j) _ (n, k) with a function value having the filter coefficient F_ (k￣) and the high-resolution spectral feature X_ (n, k) as arguments. The background sound suppression device, wherein the low-resolution sound source position occupancy Q￣ ^(j) _ (n, k￣) is obtained by calculating a value.

The background sound suppression device according to claim 1 or 2,
A high-resolution sound source position model storage unit storing a probability density function γ ^(j) _ (k) (A) of a high-resolution sound source position feature amount of each sound source signal;
The sound source position occupancy estimation unit calculates the high-resolution sound source position occupancy of each sound source signal from the high-resolution sound source position feature quantity A_ (n, k) and the probability density function γ ^(j) _ (k) (A). A background sound suppression apparatus characterized by obtaining a degree Q ^(j) _ (n, k).

From the observation signal x (m) _ (n, k) obtained by converting the time domain signals collected by a plurality of microphones into a time frequency domain signal, the background sound is suppressed and the estimated value of the target speech ^ S ^(j) _ (n , K) to extract the background sound,
m represents a microphone number, n represents a frame number, k represents a frequency bin number at high resolution, k￣ represents a frequency bin number at low resolution, and j represents a sound source number. As
The low-resolution spectral model storage unit stores the prior probability density function p (q ^(j) ) of the spectral parameters of each sound source signal and the low-resolution spectral feature model β モデル _ (q ^(j) , n, k￣ of each sound source signal. ) (S) is stored,
The high-resolution spectral model storage unit stores a high-resolution spectral feature model β_ (q ^(j) , n, k) (S) of each sound source signal,
A feature extraction step in which a feature extraction unit extracts a high-resolution sound source position feature quantity A_ (n, k) and a high-resolution spectral feature quantity X_ (n, k) from the observed signal x ^(m) _ (n, k). When,
A sound source position occupancy estimation unit obtains a sound source position parameter φ ^(j) of each sound source signal from the high resolution sound source position feature A_ (n, k), and the high resolution sound source position feature A_ (n, k). And a sound source position occupancy estimation step for obtaining a high resolution sound source position occupancy Q ^(j) _ (n, k) of each sound source signal from the sound source position parameter φ ^(j) ;
The frequency resolution reduction unit performs a smoothing process between neighboring frequencies from the high-resolution spectrum feature amount X_ (n, k) and the high-resolution sound source position occupancy Q ^(j) _ (n, k). A frequency resolution reduction step for obtaining a feature amount X￣_ (n, k￣) and a low-resolution sound source position occupancy Q￣ ^(j) _ (n, k￣);
The low-resolution occupancy estimation unit includes the low-resolution spectral feature amount X￣_ (n, k￣), the low-resolution sound source position occupancy Q￣ ^(j) _ (n, k￣), and the prior probability density function. From p (q ^(j) ) and the low-resolution spectral feature model β￣_ (q ^(j) , n, k￣) (S), the spectrum of each sound source signal is maximized so that the log-likelihood function is maximized. A low-resolution occupancy estimation step for obtaining an estimated value ^ q ^(j) of the parameter;
The high-resolution occupancy re-estimation unit performs the high-resolution spectral feature quantity X_ (n, k), the estimated value of the spectral parameter ^ q ^(j), and the high-resolution sound source position occupancy Q ^(j) _ (n, k ) And the high-resolution spectral feature model β_ (q ^(j) , n, k) (S), the high-resolution occupancy re-establishment for obtaining an estimated value of high resolution occupancy ^ M ^(j) _ (n, k) An estimation step;
The target speech estimator is configured to estimate the spectral parameter ^ q ^(j) , the high resolution occupancy ^ M ^(j) _ (n, k), and the high resolution spectral feature X_ (n, k). And a target speech estimation step for obtaining an estimated value of the target speech ^ S ^(j) _ (n, k) from the high-resolution spectral feature model β_ (q ^(j) , n, k) (S). The background sound suppression method characterized by this.

The background sound suppression method according to claim 4,
In the frequency resolution reduction step, the low resolution spectral feature value X￣_ (n, k￣) is obtained by filter bank processing using the high resolution spectral feature value X_ (n, k) and a filter coefficient F_ (k￣). An average obtained by weighting the high-resolution sound source position occupancy Q ^(j) _ (n, k) with a function value having the filter coefficient F_ (k￣) and the high-resolution spectral feature X_ (n, k) as arguments. The background sound suppression method, wherein the low-resolution sound source position occupancy Q 度^(j) _ (n, k￣) is obtained by calculating a value.

The background sound suppression method according to claim 4 or 5,
The high-resolution sound source position model storage unit stores a probability density function γ ^(j) _ (k) (A) of the high-resolution sound source position feature amount of each sound source signal,
In the sound source position occupancy estimation step, the high-resolution sound source position occupancy of each sound source signal is calculated from the high-resolution sound source position feature quantity A_ (n, k) and the probability density function γ ^(j) _ (k) (A). Determining the degree Q ^(j) _ (n, k), a background sound suppression method.

The program which makes a computer perform the background sound suppression method in any one of Claims 4-6.