JP2017151188A

JP2017151188A - Vocal tract spectrum estimation device, vocal tract spectrum estimation method, and program

Info

Publication number: JP2017151188A
Application number: JP2016031809A
Authority: JP
Inventors: 弘和亀岡; Hirokazu Kameoka; 友彦中村; Tomohiko Nakamura
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2016-02-23
Filing date: 2016-02-23
Publication date: 2017-08-31
Anticipated expiration: 2036-02-23
Also published as: JP6420781B2

Abstract

PROBLEM TO BE SOLVED: To precisely estimate a vocal tract spectrum from a speech signal.SOLUTION: An estimation part 23 estimate a vocal tract spectrum Hof each normalization angular frequency of each spectrum pattern and a weight Uat each time point of each spectrum pattern so as to make small a target function represented using a distance between an observation spectrum Yand a vocal tract spectrogram X.SELECTED DRAWING: Figure 1

Description

本発明は、声道スペクトル推定装置、声道スペクトル推定方法、及びプログラムに係り、特に、音声信号から声道スペクトルを推定するための声道スペクトル推定装置、声道スペクトル推定方法、及びプログラムに関する。 The present invention relates to a vocal tract spectrum estimation device, a vocal tract spectrum estimation method, and a program, and more particularly to a vocal tract spectrum estimation device, a vocal tract spectrum estimation method, and a program for estimating a vocal tract spectrum from a speech signal.

音声合成、或いは音声変換をはじめ音声処理全般において、音声信号から声道スペクトルを推定する技術は多くの場面で用いられている。短区間ごとの音声信号を、周期デルタ関数（パルス列）を入力とした線形時不変系の出力としてモデル化できると仮定すれば、この線形系の入力とインパルス応答がそれぞれ声帯音源信号と声道特性に対応する。この仮定は、周波数領域では周期デルタ関数で表される声帯音源スペクトルと声道スペクトルの積で音声スペクトルが表されることに相当する。従って、音声スペクトルは、声道スペクトルを基本周波数間隔で周期的にサンプリングしたものであると見なすことができる。 In general voice processing including voice synthesis or voice conversion, a technique for estimating a vocal tract spectrum from a voice signal is used in many scenes. Assuming that the speech signal for each short interval can be modeled as the output of a linear time-invariant system with a periodic delta function (pulse train) as input, the input and impulse response of this linear system are the vocal cord source signal and vocal tract characteristics, respectively. Corresponding to This assumption is equivalent to representing a speech spectrum in the frequency domain by the product of a vocal cord source spectrum and a vocal tract spectrum represented by a periodic delta function. Therefore, the voice spectrum can be regarded as a periodic sampling of the vocal tract spectrum at basic frequency intervals.

この観点に基づいて、音声スペクトルから声道スペクトルを推定する方法がこれまで提案されている。代表的な方法の一つとして広く知られる“STRAIGHT ” は、音声信号を基本周期の幅で切り出し、切り出した各々の信号である切り出し信号のスペクトルを声道スペクトルの推定値とする方法である（非特許文献１）。これは周波数領域では、各調波成分のピークをsinc 補間したものを声道スペクトルと見なしていることに相当する。 Based on this viewpoint, methods for estimating the vocal tract spectrum from the speech spectrum have been proposed so far. “STRAIGHT”, which is widely known as one of representative methods, is a method in which a speech signal is cut out with the width of the basic period, and the spectrum of the cut out signal, which is each cut out signal, is used as an estimated value of the vocal tract spectrum ( Non-patent document 1). This corresponds to the fact that in the frequency domain, the peak of each harmonic component is sinc interpolated and regarded as the vocal tract spectrum.

しかし、STRAIGHTによって得られる声道スペクトル推定値は、定常な音声が対象であっても、音声信号を基本周期の幅で切り出す切り出しフレームのオフセットに依存して周期的に時間変化することが知られている。これは各調波成分が互いに干渉し合うからである。こうした周期的に時間変化するスペクトルの変動成分は、周期信号に対する有限窓を用いた周波数分析により不可避的に生じるものであり、声道スペクトル推定値に本来含めるべきものではない。したがって、STRAIGHTを用いた声道スペクトルの推定において、声道スペクトル推定値から変動成分を除くように改良された手法が提案されている（非特許文献２）。 However, it is known that the vocal tract spectrum estimation value obtained by STRAIGHT changes periodically with time depending on the offset of the cutout frame that cuts out the audio signal with the width of the basic period, even for stationary speech. ing. This is because the harmonic components interfere with each other. Such a fluctuation component of the spectrum which changes periodically with time is inevitably generated by frequency analysis using a finite window for the periodic signal, and should not be included in the vocal tract spectrum estimation value. Therefore, in the estimation of the vocal tract spectrum using STRAIGHT, an improved method has been proposed to remove the fluctuation component from the vocal tract spectrum estimated value (Non-patent Document 2).

前述したように、音声スペクトルは声道スペクトルを基本周波数(F₀)間隔でサンプリングしたものと見なせるため、音声のF₀が高いときほど声道スペクトル推定の手がかりは少なくなる。このことは、１フレームごとに独立な処理に本質的な限界があることを示唆している。 As described above, since the voice spectrum can be regarded as a sample of the vocal tract spectrum at the basic frequency (F ₀ ) interval, the higher the voice F _{0, the} fewer clues for estimating the vocal tract spectrum. This suggests that there is an essential limit to independent processing for each frame.

一方で、音声信号には同一の音素が繰り返し出現するため、類似した声道スペクトルが複数の異なる時刻で現れることも、声道スペクトル推定の手がかりとなる。複数のフレームが共通の声道スペクトルを持つと仮定でき、当該複数のフレームでF₀ が異なれば、実際に観測可能な声道スペクトルのサンプル点が単一のフレームの場合よりも増加するため、声道スペクトルの推定精度が向上すると考えられる。 On the other hand, since the same phoneme repeatedly appears in the audio signal, the appearance of similar vocal tract spectra at a plurality of different times is also a clue for estimating the vocal tract spectrum. It can be assumed that multiple frames have a common vocal tract spectrum, and if F ₀ is different in the multiple frames, the actually observable vocal tract spectrum sample points will increase compared to the case of a single frame, It is considered that the estimation accuracy of the vocal tract spectrum is improved.

こうした考えに基づき、同時に収録された調音運動データを用いて、複数フレームから声道スペクトルを推定する手法が提案されている（非特許文献３）。また、同様の手法として、因子分析トラジェクトリ隠れマルコフモデルによる声道スペクトル推定法が提案されている（非特許文献４）。因子分析トラジェクトリ隠れマルコフモデルによる声道スペクトル推定法では、音声信号の各フレームに付与されているコンテキストラベルを用い、同一のコンテキストが付与された複数のフレームにおける調波成分の情報に加え動的特徴量を手がかりにすることで、声道スペクトルを推定する。 Based on this idea, a technique has been proposed for estimating vocal tract spectrum from multiple frames using articulatory motion data recorded simultaneously (Non-patent Document 3). As a similar technique, a vocal tract spectrum estimation method using a factor analysis trajectory hidden Markov model has been proposed (Non-Patent Document 4). In the vocal tract spectrum estimation method based on the factor analysis trajectory hidden Markov model, the context label attached to each frame of the speech signal is used, and in addition to the harmonic component information in multiple frames to which the same context is attached, the dynamic features The vocal tract spectrum is estimated by using the quantity as a clue.

H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne, "Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:Possible role of a repetitive structure in sounds," Speech Commun., 27, pp. 187-207, 1999.H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne, "Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds," Speech Commun ., 27, pp. 187-207, 1999. H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino and H. Banno, "Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation," Proc. ICASSP, pp. 3933-3936, 2008.H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino and H. Banno, "Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation, "Proc. ICASSP, pp. 3933-3936, 2008. Y. Shiga and S. King, "Estimating the spectral envelope of voiced speech using multi-frame analysis," Proc. EUROSPEECH, pp. 1737-1740, 2003.Y. Shiga and S. King, "Estimating the spectral envelope of voiced speech using multi-frame analysis," Proc. EUROSPEECH, pp. 1737-1740, 2003. T. Toda, "Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM," Proc. ICASSP, pp. 3925-3928, 2008.T. Toda, "Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM," Proc. ICASSP, pp. 3925-3928, 2008.

しかしながら、因子分析トラジェクトリ隠れマルコフモデルによる声道スペクトル推定法では、複数のフレームにおける調波成分の情報を手がかりにすることで、声道スペクトルを精度良く推定することは可能であるが，音声信号に対するコンテキストラベルの付与には膨大な労力を要するという問題がある。 However, with the vocal tract spectrum estimation method based on the factor analysis trajectory hidden Markov model, it is possible to estimate the vocal tract spectrum accurately by using the harmonic component information in multiple frames as a clue. There is a problem that enormous effort is required to assign a context label.

本発明は、上記の事情を鑑みて成されたものであり、音声信号から声道スペクトルを精度良く推定することができる声道スペクトル推定装置、声道スペクトル推定方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and provides a vocal tract spectrum estimation apparatus, a vocal tract spectrum estimation method, and a program capable of accurately estimating a vocal tract spectrum from a speech signal. Objective.

上記の目的を達成するために本発明に係る声道スペクトル推定装置は、音声信号の時系列データを、基本周期の幅で切り出し、各々切り出した音声信号のスペクトルから、各時刻及び各正規化角周波数の観測時間周波数成分を表す観測スペクトログラムを出力する観測スペクトログラム推定部と、前記観測スペクトログラム推定部により出力された前記観測スペクトログラムと、各スペクトルパターンにおける各正規化角周波数のパワースペクトルを表す声道スペクトル、及び各スペクトルパターンの各時刻における重みから求められる各時刻及び各正規化角周波数の時間周波数成分を表す声道スペクトログラムとの距離を用いて表される目的関数を小さくするように、各スペクトルパターンにおける各正規化角周波数の前記声道スペクトル、及び各スペクトルパターンの各時刻における重みを推定する推定部と、を含んで構成されている。 In order to achieve the above object, the vocal tract spectrum estimation device according to the present invention cuts out time-series data of a speech signal with the width of the basic period, and extracts each time and each normalized angle from the spectrum of each cut-out speech signal. Observation spectrogram estimator that outputs an observation spectrogram representing the frequency component of the observation time of the frequency, the observation spectrogram output by the observation spectrogram estimator, and the vocal tract spectrum that represents the power spectrum of each normalized angular frequency in each spectrum pattern , And each spectral pattern so as to reduce the objective function expressed using the distance from each time and the vocal tract spectrogram representing the time frequency component of each normalized angular frequency obtained from the weight of each spectral pattern at each time The vocal tract spectrum of each normalized angular frequency in And it is configured to include an estimation unit that estimates a weight, the at each time of each spectral pattern.

本発明に係る声道スペクトル推定方法は、観測スペクトログラム推定部と推定部とを含む声道スペクトル推定装置における声道スペクトル推定方法であって、前記観測スペクトログラム推定部が、音声信号の時系列データを、基本周期の幅で切り出し、各々切り出した音声信号のスペクトルから、各時刻及び各正規化角周波数の観測時間周波数成分を表す観測スペクトログラムを出力し、前記推定部が、前記観測スペクトログラム推定部により出力された前記観測スペクトログラムと、各スペクトルパターンにおける各正規化角周波数のパワースペクトルを表す声道スペクトル、及び各スペクトルパターンの各時刻における重みから求められる各時刻及び各正規化角周波数の時間周波数成分を表す声道スペクトログラムとの距離を用いて表される目的関数を小さくするように、各スペクトルパターンにおける各正規化角周波数の前記声道スペクトル、及び各スペクトルパターンの各時刻における重みを推定する。 A vocal tract spectrum estimation method according to the present invention is a vocal tract spectrum estimation method in a vocal tract spectrum estimation apparatus including an observation spectrogram estimation unit and an estimation unit, wherein the observation spectrogram estimation unit converts time series data of a speech signal. , Cut out with the width of the basic period, and output an observation spectrogram representing the observation time frequency component of each time and each normalized angular frequency from the spectrum of each cut out audio signal, and the estimation unit outputs by the observation spectrogram estimation unit The time spectrum component of each time and each normalized angular frequency obtained from the observed spectrogram, the vocal tract spectrum representing the power spectrum of each normalized angular frequency in each spectrum pattern, and the weight of each spectrum pattern at each time. Expressed using the distance to the representing vocal tract spectrogram That as the objective function to reduce, to estimate the weights at each time of the vocal tract spectrum, and the spectrum pattern of each normalized angular frequency at the respective spectral pattern.

本発明に係るプログラムは、上記の声道スペクトル推定装置の各部としてコンピュータを機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each part of the above vocal tract spectrum estimation apparatus.

以上説明したように、本発明の声道スペクトル推定装置、声道スペクトル推定方法、及びプログラムによれば、観測スペクトログラムと、声道スペクトログラムとの距離を用いて表される目的関数を小さくするように、各スペクトルパターンにおける各正規化角周波数の声道スペクトル、及び各スペクトルパターンの各時刻における重みを推定することにより、音声信号から声道スペクトルを精度良く推定することができる、という効果が得られる。 As described above, according to the vocal tract spectrum estimation apparatus, the vocal tract spectrum estimation method, and the program of the present invention, the objective function expressed using the distance between the observation spectrogram and the vocal tract spectrogram is reduced. By estimating the vocal tract spectrum of each normalized angular frequency in each spectrum pattern and the weight of each spectrum pattern at each time, it is possible to accurately estimate the vocal tract spectrum from the speech signal. .

本発明の実施の形態に係る声道スペクトル推定装置の構成を示す概略図である。It is the schematic which shows the structure of the vocal tract spectrum estimation apparatus which concerns on embodiment of this invention. GMM-NMFに対する声道スペクトルの推定アルゴリズムを利用する声道スペクトル推定処理ルーチンの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process of the vocal tract spectrum estimation process routine using the estimation algorithm of the vocal tract spectrum with respect to GMM-NMF. AR-NMFに対する声道スペクトルの推定アルゴリズムを利用する声道スペクトル推定処理ルーチンの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process of the vocal tract spectrum estimation processing routine using the vocal tract spectrum estimation algorithm with respect to AR-NMF. 声道スペクトル推定装置によって推定した声道スペクトルの評価結果を示す図である。It is a figure which shows the evaluation result of the vocal tract spectrum estimated by the vocal tract spectrum estimation apparatus.

以下、図面を参照して本発明の実施の形態を詳細に説明する。
＜本発明の実施の形態の概要＞
声道スペクトログラムが低ランクな非負値行列で近似できるという仮定に基づいて、音声信号に付与されたコンテキストラベルを用いることなく、音声信号の複数のフレームにおける調波成分の情報を手がかりにして、声道スペクトルを精度良く推定する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
<Outline of Embodiment of the Present Invention>
Based on the assumption that the vocal tract spectrogram can be approximated by a low-rank non-negative matrix, without using the context label attached to the speech signal, the information on the harmonic components in multiple frames of the speech signal can be used as a clue. Estimate the road spectrum accurately.

＜欠損データ補間による声道スペクトル推定＞
＜低ランク非負値行列による声道スペクトログラムのモデル化＞
まず、低ランク非負値行列による声道スペクトログラムのモデル化について説明する。時間インデックスをｔ(ｔ＝0,・・,T-1)とし、周波数インデックスｋ(ｋ＝0,・・,K-1)と対応する正規化角周波数をそれぞれω_kと表す。 <Estimation of vocal tract spectrum by interpolation of missing data>
<Modeling of vocal tract spectrogram using low rank non-negative matrix>
First, vocal tract spectrogram modeling using a low rank non-negative matrix will be described. The time index is t (t = 0,..., T-1), and the normalized angular frequency corresponding to the frequency index k (k = 0,..., K-1) is represented as ω _k .

音声における音素の種類は限られていることから、音声の一発話において、同じ声道スペクトルが複数の異なる時刻に現れる。この状況は、声道スペクトログラムを低ランクな非負値行列で表現可能であるということと同等であると解釈することができる。したがって、例えばＲ個の滑らかなスペクトルパターンを列方向に並べた非負値行列Ｈ＝(Ｈ_k,r)_k,rと、各スペクトルパターンの非負値の重みＵ＝(Ｕ_r,t)_r,tによって、声道スペクトログラムＸ_k,tは（１）式で表すことができる。 Since the types of phonemes in speech are limited, the same vocal tract spectrum appears at a plurality of different times in one speech. This situation can be interpreted as equivalent to being able to represent the vocal tract spectrogram with a low-rank non-negative matrix. Therefore, for example, a non-negative matrix H = (H _{k, r} ) _{k, r in} which R smooth spectral patterns are arranged in the column direction, and a non-negative value weight U = (U _{r, t} ) _r, _{r, for} each spectral pattern _{By t} , the vocal tract spectrogram X _{k, t} can be expressed by equation (1).

なお、ｒはスペクトルパターンのインデックスである。 R is an index of the spectrum pattern.

周波数方向に滑らか、且つ、非負となるＨ_k,rは様々に設計可能であるが、本実施の形態では２種類のＨ_k,rを提案する。 H _{k, r that} is smooth and non-negative in the frequency direction can be designed in various _ways, but in this embodiment, two types of H _{k, r} are proposed.

１つ目の提案は、（２）式及び（３）式に示すように、Ｈ_k,rを混合正規分布型の関数で規定する。 The first proposal defines H _{k, r} with a mixed normal distribution function, as shown in equations (2) and (3).

ここで、ωは正規化角周波数、Ｎは混合数、ｎ(ｎ＝0,・・,N-1)は混合数Ｎのインデックスである。ｈ(ω)は周波数ワーピング関数であり、ｈ(ω)＝ωとすれば、Ｇ_n(ω)は線形周波数領域で平均ρ_n、分散ν² _nの正規分布と同形の関数となる。本実施の形態では、メル周波数領域で滑らかな声道スペクトルとなるように、周波数ワーピング関数ｈ(ω)を例えば（４）式のように設定する。 Here, ω is a normalized angular frequency, N is the number of mixtures, and n (n = 0,..., N−1) is an index of the number of mixtures N. h (ω) is a frequency warping function. If h (ω) = ω, G _n (ω) is a function having the same shape as a normal distribution with mean ρ _n and variance ν ² _{n in} the linear frequency domain. In the present embodiment, the frequency warping function h (ω) is set, for example, as in equation (4) so that a smooth vocal tract spectrum is obtained in the mel frequency region.

ここで、ｆ_sはサンプリング周波数であり、（４）式によって[0,π]の正規化周波数は[0,1]にマッピングされる。また、Ｗ＝(Ｗ_r,n)_r,n≧０は各正規分布の重みであり、非負値行列Ｈ及び重みＵに関するスケールの任意性を解消するため、各ｒに関してΣ_nＷ_r,n＝１とする。 Here, f _s is a sampling frequency, and the normalized frequency of [0, π] is mapped to [0,1] by the equation (4). Also, W = (W _{r, n} ) _{r, n} ≧ 0 is the weight of each normal distribution, and Σ _n W _{r, n} for each r in order to eliminate the arbitraryness of the scale regarding the non-negative matrix H and the weight U. = 1.

２つ目の提案は、ソースフィルタモデルでよく用いられる全極フィルタを利用する方法である。具体的には、Ｐ次の全極フィルタの係数ａ_r:＝[ａ_r,0,ａ_r,1,・・,ａ_r,P]^Tを用いれば、全極フィルタの振幅スペクトルＨ_k,r ^(AR)、すなわち、声道スペクトルＨ_k,r ^(AR)は（５）式で表される。 The second proposal is a method using an all-pole filter often used in a source filter model. Specifically, if the coefficients a _r : = [ _{ar, 0} , a _{r, 1} ,..., A _{r, P} ] ^T of the Pth order all-pole filter are used, the amplitude spectrum H _{k, r} ^(AR) , that is, the vocal tract spectrum H _{k, r} ^(AR) is expressed by equation (5).

ここで、Ｑ(ω)は、(ｐ,ｑ)成分がcos(ω(ｐ−ｑ))で表される(Ｐ＋１)×(Ｐ＋１)のToeplitz行列である。 Here, Q (ω) is a (P + 1) × (P + 1) Toeplitz matrix in which the (p, q) component is represented by cos (ω (p−q)).

＜欠損データに対する非負値行列因子分解アプローチ＞
STRAIGHTによって推定された声道スペクトログラムＹ＝(Ｙ_k,t)_k,tが与えられた場合、声道スペクトログラムの推定問題は、与えられた声道スペクトログラムＹ_k,tと推定した声道スペクトログラムＸ_k,tとの距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)を用いて、（６）式のように定式化ができる。 <Non-negative matrix factorization approach to missing data>
_Given the vocal tract spectrogram Y = (Y _{k, t} ) _{k, t} estimated by STRAIGHT, the estimation problem of the vocal tract spectrogram is the given vocal tract spectrogram Y _{k, t} and the estimated vocal tract spectrogram X _Using the distance D _* (Y _{k, t} ; X _{k, t} ) with _{k, t} , it can be formulated as in equation (6).

ここで、Θはパラメータ集合であり、声道スペクトルＨ_k,rとして（２）式のＨ_k,r ^(GMM)を用いた場合は、Θ＝{Ｗ,Ｕ}であり、声道スペクトルＨ_k,rとして（５）式のＨ_k,r ^(AR)を用いた場合は、Θ＝{{ａ_r}_r,Ｕ}である。以降、説明の便宜上、Ｈ_k,r ^(GMM)を用いた場合の非負値行列因子分解(Nonnegative Matrix Factorization:NMF)を“GMM-NMF”、Ｈ_k,r ^(AR)を用いた場合のNMFを“AR-NMF”と称す。また、Ｚ_k,t∈[0,1]は、各時間周波数成分の信頼度を表すパラメータである。 Here, Θ is a parameter set, and when H _{k, r} ^(GMM) in the equation (2 ⁾ is used as the vocal tract spectrum H _{k, r} , Θ = {W, U} and the vocal tract spectrum H _When H _{k, r} ^(AR) in the equation (5) is used as _{k, r} , Θ = {{ _ar } _r , U}. Hereinafter, for convenience of explanation, the non-negative matrix factorization (NMF) when H _{k, r} ^(GMM) is used is “GMM-NMF”, and the NMF when H _{k, r} ^(AR) is used. Is called “AR-NMF”. Z _{k, t} ε [0,1] is a parameter representing the reliability of each time frequency component.

なお、（６）式においてＺ_k,t＝0であれば、STRAIGHTによって推定された声道スペクトログラムＹ_k,tに対するコストは考慮されず、Ｚ_k,tが大きい時間周波数成分ほど重視される。 Note that if Z _{k, t} = 0 in equation (6), the cost for the vocal tract spectrogram Y _{k, t} estimated by STRAIGHT is not considered, and the time frequency component with a larger Z _{k, t} is more important.

この信頼度Ｚ_k,tの単純な設計方法として、各時刻での基本周波数F₀とその高調波周波数に対応する時間周波数成分にはＺ_k,t=1、それ以外の時間周波数成分にはＺ_k,t=ξ（ただし、ξは0以上1以下の定数）とする方法が考えられる。ξの値は実験的、経験的に決定することもできるが，本実施の形態ではSTRAIGHTで得られる非周期性指標Ａ_k,t∈[0,1]を利用して設計する方法について述べる。非周期性指標Ａ_k,tは、各時間周波数成分に含まれる非周期成分の割合であるため、各ｋ,ｒに関してＺ_k,t＝1−Ａ_k,tとすれば、周期性成分を重視した声道スペクトログラムＸ_k,tの推定が可能になる。 As a simple design method of the reliability Z _{k, t} , Z _{k, t} = 1 is used for the time frequency component corresponding to the fundamental frequency F ₀ and its harmonic frequency at each time, and other time frequency components are used. A method in which Z _{k, t} = ξ (where ξ is a constant between 0 and 1) is conceivable. The value of ξ can be determined experimentally and empirically. In this embodiment, a method of designing using the aperiodicity index A _{k, t} ∈ [0,1] obtained by STRAIGHT will be described. Since the non-periodic index A _{k, t} is the ratio of the non-periodic component included in each time-frequency component, if Z _{k, t} = 1−A _{k, t} for each k, r, the periodic component is It is possible to estimate an important vocal tract spectrogram X _{k, t} .

NMFで広く知られている距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)として、例えば一般化Kullback-Leibler(KL)ダイバージェンスＤ_GKL、または２乗距離Ｄ_EUが挙げられる。一般化KLダイバージェンスＤ_GKLを用いた場合の目的関数Ｌ_GKL(Θ)を（７）式に示し、２乗距離Ｄ_EUを用いた場合の目的関数Ｌ_EU(Θ)を（８）式に示す。 Examples of the distance D _* (Y _{k, t} ; X _{k, t} ) widely known in NMF include a generalized Kullback-Leibler (KL) divergence D _GKL or a square distance D _EU . The objective function L _GKL (Θ) when using the generalized KL divergence D _GKL is shown in Equation (7), and the objective function L _EU (Θ) when using the square distance D _EU is shown in Equation (8). .

＜補助関数法によるパラメータ推定アルゴリズム＞
次に、GMM-NMFに対するパラメータ推定アルゴリズム、及びAR-NMFに対するパラメータ推定アルゴリズムについて説明する。 <Parameter estimation algorithm by auxiliary function method>
Next, a parameter estimation algorithm for GMM-NMF and a parameter estimation algorithm for AR-NMF will be described.

＜GMM-NMFに対する反復アルゴリズムの導出＞
上記（７）式の一般化KLダイバージェンスＤ_GKLを用いた場合の目的関数Ｌ_GKL(Θ)における右辺括弧内の第2項は、対数関数の中に加算演算を含んでいるため、（７）式の目的関数を最小化する最適化問題を直接解くことは困難である。 <Derivation of iterative algorithm for GMM-NMF>
Since the second term in the right parenthesis in the objective function L _GKL (Θ) when using the generalized KL divergence D _GKL in the above equation (7) includes an addition operation in the logarithmic function, (7) It is difficult to directly solve the optimization problem that minimizes the objective function of the equation.

しかし、多くのNMFを用いた研究で行われているように、最適化問題を直接解くことが困難である目的関数に、補助関数法と呼ばれる最適化原理を適用することによって、反復的に局所最適解を得ることができることが知られている。 However, as is done in many NMF studies, it is possible to iteratively apply the optimization principle called auxiliary function method to an objective function that is difficult to solve the optimization problem directly. It is known that an optimal solution can be obtained.

補助関数法では、パラメータΘの目的関数Ｌ(Θ)に対して補助変数λを導入し、Ｌ(Θ)＝ｍｉｎ_λＬ⁺(Θ,λ)を満たす上界を規定する補助関数Ｌ⁺(Θ,λ)を導出する。補助関数Ｌ⁺(Θ,λ)をパラメータΘ、補助変数λに関して交互に最小化することによって、目的関数Ｌ(Θ)を広義単調減少させることができる。 In the auxiliary function method, an auxiliary variable λ is introduced to the objective function L (Θ) of the parameter Θ, and an auxiliary function L ⁺ (that defines an upper bound satisfying L (Θ) = min _λ L ⁺ (Θ, λ). Θ, λ) is derived. By alternately minimizing the auxiliary function L ⁺ (Θ, λ) with respect to the parameter Θ and the auxiliary variable λ, the objective function L (Θ) can be decreased monotonously in a broad sense.

対数関数は凹関数であるため、上記（７）式の右辺括弧内の第2項の上界は、Jensenの不等式を用いて（９）式で表される。 Since the logarithmic function is a concave function, the upper bound of the second term in the right parenthesis of the above equation (7) is expressed by equation (9) using Jensen's inequality.

ここで、λ_k,t,r,n≧0は補助変数であり、各k,tに関してΣ_r,nλ_k,t,r,n＝1を満たす。なお、（９）式の等式成立条件は（１０）式を満たす場合となる。 Here, λ _{k, t, r, n} ≧ 0 is an auxiliary variable, and Σ _{r, n} λ _{k, t, r, n} = 1 is satisfied for each k, t. It should be noted that the condition for satisfying the equation (9) is satisfied when the equation (10) is satisfied.

したがって、目的関数Ｌ_GKL(Θ)に対する補助関数は（１１）式で表される。 Therefore, the auxiliary function for the objective function L _GKL (Θ) is expressed by equation (11).

ここで、λ:＝{λ_k,t,r,n}と定義した。（１１）式の補助関数のＷ_r,n、Ｕ_r,tに関する偏微分が0となる値を求めて（１０）式を代入することにより、（１２）式及び（１３）式に示す閉形式の更新式が得られる。 Here, it is defined as λ: = {λ _{k, t, r, n} }. By substituting the equation (10) by obtaining a value at which the partial differentiation with respect to W _{r, n} and U _{r, t} of the auxiliary function of the equation (11) is 0, the closure shown in the equations (12) and (13) The form update formula is obtained.

（１２）式及び（１３）式の更新式は全て非負値の項同士の積として計算されるため、初期値を非負値にすればＷ_r,n、Ｕ_r,tの非負値性は自然と保たれる。 Since the updating formulas of the equations (12) and (13) are all calculated as products of non-negative values, the non-negative values of W _{r, n} and U _{r, t} are natural if the initial values are set to non-negative values. And kept.

次に、距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)として２乗距離Ｄ_EUを用いた場合の目的関数Ｌ_EU(Θ)の更新式について検討する。（８）式も右辺括弧内の第2項に加算演算を含んでいるため、（８）式の目的関数を最小化する最適化問題を直接解くことは困難である。 Next, an update formula of the objective function L _EU (Θ) when the square distance D _EU is used as the distance D _* (Y _{k, t} ; X _{k, t} ) will be examined. Since equation (8) also includes an addition operation in the second term in the right parenthesis, it is difficult to directly solve the optimization problem that minimizes the objective function of equation (8).

しかし、２次関数は凸関数であるため、上記（８）式に対してJensenの不等式を用いることで、目的関数Ｌ_GKL(Θ)の場合と同様に補助関数を設計することができる。補助関数の設計は、目的関数Ｌ_GKL(Θ)における補助関数の設計方法と同様であるため省略するが、（８）式を最小化する更新式は、（１４）式及び（１５）式で表される。 However, since the quadratic function is a convex function, an auxiliary function can be designed in the same manner as the objective function L _GKL (Θ) by using Jensen's inequality for the above equation (8). The design of the auxiliary function is omitted because it is the same as the design method of the auxiliary function in the objective function L _GKL (Θ). However, the update equations for minimizing the equation (8) are the equations (14) and (15). expressed.

＜AR-NMFに対する反復アルゴリズムの導出＞
AR-NMFに関しても、GMM-NMFに対する反復アルゴリズムの導出方法と同様にして、閉形式の更新式を導出することができる。 <Derivation of iterative algorithm for AR-NMF>
Regarding AR-NMF, a closed-form update formula can be derived in the same way as the iterative algorithm derivation method for GMM-NMF.

まず、一般化KLダイバージェンスの目的関数Ｌ_GKL(Θ)に対する補助関数Ｌ⁺ _GKL,AR(Θ,ξ)について検討する。補助関数Ｌ⁺ _GKL,AR(Θ,ξ)は、各ｋ,ｔに関してΣ_rξ_k,t,r＝１を満たす非負の補助変数ξ＝{ξ_k,t,r}_k,t,rを導入することで、（１６）式によって定義される。 First, the auxiliary function L ⁺ _{GKL, AR} (Θ, ξ) for the generalized KL divergence objective function L _GKL (Θ) is examined. The auxiliary function L ⁺ _{GKL, AR} (Θ, ξ) is a non-negative auxiliary variable ξ = {ξ _{k, t, r} } _{k, t, r} satisfying Σ _r ξ _{k, t, r} = 1 for each k, t. Is defined by the equation (16).

また、（１６）式の等式成立条件はξ_k,t,r＝Ｈ^(AR) _k,r(Ｕ_r,t／Ｘ_k,t)となる。Ｕ_r,tの更新式は、（１３）式のＨ_k,r ^(GMM)をＨ_k,r ^(AR)に置換したものと同じになり、（１７）式で表される。 Further, the condition for establishing the equation (16) is ξ _{k, t, r} = H ^(AR) _{k, r} (U _{r, t} / X _{k, t} ). The update equation for U _{r, t} is the same as that obtained by replacing H _{k, r} ^(GMM) in equation (13) with H _{k, r} ^(AR) , and is represented by equation (17).

一方、AR-NMFに対する反復アルゴリズムを導出する場合、ａ_rの更新には、乗法更新型アルゴリズムを利用できる。 On the other hand, when deriving an iterative algorithm for AR-NMF, a multiplicative update algorithm can be used for updating a _r .

（７）式に示した一般化KLダイバージェンスの目的関数Ｌ_GKL(Θ)のａ_rに関する偏微分は、（１８）式〜（２０）式で表すことができる。 The partial differentiation of the generalized KL divergence objective function L _GKL (Θ) shown in equation (7) with respect to a _r can be expressed by equations (18) to (20).

（１８）式の右辺括弧内の第１項及び第２項はどちらも正定値行列であり、（２１）式の乗法更新測を用いることで目的関数Ｌ_GKL(Θ,ξ)を広義単調減少させることができる。 The first and second terms in the right parenthesis of equation (18) are both positive definite matrices, and the objective function L _GKL (Θ, ξ) is decreased monotonously in a broad sense by using the multiplicative update measurement of equation (21). Can be made.

詳細は省略するが、２乗距離Ｄ_EUを用いた場合の更新式についても、一般化KLダイバージェンスの場合と同様にして導出することができる。具体的には、Ｕ_r,tの更新式は、（１５）式のＨ_k,r ^(GMM)をＨ_k,r ^(AR)に置換した（２２）式で表され、ａ_rの更新式は（２１）式と同様に（２３）式となる。 Although details are omitted, the update formula when the square distance D _EU is used can be derived in the same manner as in the case of the generalized KL divergence. Specifically, the update formula for U _{r, t} is expressed by formula (22) in which H _{k, r} ^(GMM) in formula (15 ⁾ is replaced with H _{k, r} ^(AR), and the update formula for a _r Becomes the equation (23) in the same manner as the equation (21).

＜第１の実施の形態＞
＜システム構成＞
次に、音声信号の複数のフレームにおける調波成分の情報を手がかりにして、声道スペクトルを推定する声道スペクトル推定装置に本発明を適用した場合を例にして、本発明の第１の実施の形態を説明する。なお、第１の実施の形態では距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)として、一般化KLダイバージェンスＤ_GKLを用いた場合のGMM-NMFに対する声道スペクトルの推定アルゴリズムを利用する声道スペクトル推定装置の例について説明する。 <First Embodiment>
<System configuration>
Next, a case where the present invention is applied to a vocal tract spectrum estimation apparatus that estimates the vocal tract spectrum using information on harmonic components in a plurality of frames of a speech signal as a clue is described as a first embodiment of the present invention. Will be described. In the first embodiment, the distance D _* (Y _{k, t} ; X _{k, t} ) is a voice using a vocal tract spectrum estimation algorithm for the GMM-NMF when the generalized KL divergence D _GKL is used. An example of the road spectrum estimation apparatus will be described.

図１に示すように、本発明の第１の実施の形態に係る声道スペクトル推定装置は、ＣＰＵと、ＲＡＭと、後述する声道スペクトル推定処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 As shown in FIG. 1, the vocal tract spectrum estimation apparatus according to the first embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing a vocal tract spectrum estimation processing routine described later. And is functionally configured as follows.

声道スペクトル推定装置１００は、入力部１０と、演算部２０と、記憶部３０と、出力部４０とを備えている。 The vocal tract spectrum estimation apparatus 100 includes an input unit 10, a calculation unit 20, a storage unit 30, and an output unit 40.

入力部１０により、声道スペクトルの推定対象である音声信号の時系列データが入力される。記憶部３０は、入力部１０により入力された音声信号の時系列データを記憶する。また、記憶部３０は、後述する各処理での結果を記憶すると共に、本処理ルーチンで用いる各パラメータの初期値を記憶している。 The input unit 10 inputs time-series data of a speech signal that is an object of vocal tract spectrum estimation. The storage unit 30 stores time series data of the audio signal input by the input unit 10. In addition, the storage unit 30 stores the result of each process to be described later, and stores the initial value of each parameter used in this process routine.

演算部２０は、観測スペクトログラム推定部２１と、初期設定部２２と、推定部２３と、終了判定部２４と、出力部２５とを備えている。 The calculation unit 20 includes an observation spectrogram estimation unit 21, an initial setting unit 22, an estimation unit 23, an end determination unit 24, and an output unit 25.

観測スペクトログラム推定部２１は、入力部１０で収集した音声信号の時系列データを入力として、公知の声道スペクトルの推定手法であるSTRAIGHTを用いて推定した声道スペクトログラムＹ_k,tを計算する。また、計算した声道スペクトログラムＹ_k,tを、記憶部３０に記憶しておく。より詳細には、観測スペクトログラム推定部２１は、入力部１０で収集した音声信号の時系列データを基本周期の幅で切り出し、各々切り出した音声信号から、各時刻ｔ及び各正規化角周波数ω_kの観測時間周波数成分を表す声道スペクトログラムＹ_k,tを計算する。なお、声道スペクトログラムＹ_k,tは、本実施の形態に係る観測スペクトログラムの一例である。 The observation spectrogram estimation unit 21 receives the time series data of the speech signal collected by the input unit 10 and calculates the vocal tract spectrogram Y _{k, t} estimated using STRAIGHT, which is a known vocal tract spectrum estimation method. Further, the calculated vocal tract spectrogram Y _{k, t} is stored in the storage unit 30. More specifically, the observation spectrogram estimation unit 21 cuts out the time-series data of the voice signal collected by the input unit 10 with the width of the basic period, and from each cut-out voice signal, each time t and each normalized angular frequency ω _k. The vocal tract spectrogram Y _{k, t} representing the observed time frequency components of is calculated. The vocal tract spectrogram Y _{k, t} is an example of an observation spectrogram according to the present embodiment.

また、観測スペクトログラム推定部２１は、声道スペクトログラムＹ_k,tを計算する際、観測時間周波数成分毎に非周期性指標Ａ_k,tを計算し、Ｚ_k,t＝1−Ａ_k,tにより各時間周波数成分の信頼度Ｚ_k,tを算出する。算出した各時間周波数成分の信頼度Ｚ_k,tは、記憶部３０に記憶される。 In addition, when calculating the vocal tract spectrogram Y _{k, t} , the observation spectrogram estimation unit 21 calculates an aperiodic index A _{k, t} for each observation time frequency component, and Z _{k, t} = 1−A _{k, t} To calculate the reliability Z _{k, t} of each time-frequency component. The calculated reliability Z _{k, t} of each time frequency component is stored in the storage unit 30.

初期設定部２２は、後述する処理で用いるパラメータＷ_r,n、Ｕ_r,t、及びＧ_n(ω_k)の各初期値を設定する。なお、Ｗ_r,nは非負値であり、Σ_nＷ_r,n＝１を満たすように初期値を設定する。Ｕ_r,tも非負値であり、例えば乱数を用いて適当な値に初期値を設定する。なお、ｒ(r=0,・・,R-1)は、Ｒ個のスペクトルパターンを指し示すスペクトルインデックスである。Ｇ_n(ω_k)に対しては、上記（３）式を満たすように初期値を設定する。この際、平均ρ_n、分散ν² _nには適当な値を用いればよい。設定したパラメータＷ_r,n、Ｕ_r,t、及びＧ_n(ω_k)の各初期値は、記憶部３０に記憶される。 The initial setting unit 22 sets initial values of parameters W _{r, n} , U _{r, t} and G _n (ω _k ) used in processing to be described later. Note that W _{r, n} is a non-negative value, and an initial value is set so as to satisfy Σ _n W _{r, n} = 1. U _{r, t is} also a non-negative value, and an initial value is set to an appropriate value using, for example, a random number. Note that r (r = 0,..., R−1) is a spectrum index indicating R spectrum patterns. For G _n (ω _k ), an initial value is set so as to satisfy the above expression (3). At this time, appropriate values may be used for the average ρ _n and the variance ν ² _n . The initial values of the set parameters W _{r, n} , U _{r, t} and G _n (ω _k ) are stored in the storage unit 30.

推定部２３は、（ｋ、ｒ）の全ての組み合わせの各々について、記憶部３０に記憶されているＷ_r,n及びＧ_n(ω_k)に基づいて、上記（２）式に従って、声道スペクトルＨ_k,r ^(GMM)（以降、単に「Ｈ_k,r」と記載する）を計算し、記憶部３０に格納する。 Based on W _{r, n} and G _n (ω _k ) stored in the storage unit 30 for each of all combinations of (k, r), the estimation unit 23 performs the vocal tract according to the above equation (2). A spectrum H _{k, r} ^(GMM) (hereinafter simply referred to as “H _{k, r} ”) is calculated and stored in the storage unit 30.

推定部２３は、（ｋ、ｔ）の全ての組み合わせの各々について、記憶部３０に記憶されているＨ_k,r及びＵ_r,tに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを計算し、記憶部３０に格納する。 Based on H _{k, r} and U _{r, t} stored in the storage unit 30 for each combination of (k, t), the estimation unit 23 performs the vocal tract spectrogram X according to the above equation (1). _{k and t} are calculated and stored in the storage unit 30.

推定部２３は、（ｒ、ｎ）の全ての組み合わせの各々について、記憶部３０に記憶されているＷ_r,n、Ｚ_k,t、Ｘ_k,t、Ｙ_k,t、Ｇ_n(ω_k)、Ｕ_r,tに基づいて、上記（１２）式に従って、正規分布の重みＷ_r,nを更新し、記憶部３０に格納する。 For each of all combinations of (r, n), the estimation unit 23 stores W _{r, n} , Z _{k, t} , X _{k, t} , Y _{k, t} , G _n (ω _k ), U _{r, t} , the normal distribution weight W _{r, n} is updated according to the above equation (12) and stored in the storage unit 30.

推定部２３は、正規分布の重みＷ_r,nの更新に伴い、（ｋ、ｒ）の全ての組み合わせの各々について、Ｗ_r,n及びＧ_n(ω_k)に基づいて、上記（２）式に従って、声道スペクトルＨ_k,rを計算し、記憶部３０に格納する。 With the update of the normal distribution weight W _{r, n} , the estimation unit 23 performs the above (2) based on W _{r, n} and G _n (ω _k ) for all the combinations of (k, r). The vocal tract spectrum H _{k, r} is calculated according to the equation and stored in the storage unit 30.

推定部２３は、声道スペクトルＨ_k,rの更新に伴い、（ｋ、ｔ）の全ての組み合わせの各々について、記憶部３０に記憶されているＨ_k,r及びＵ_r,tに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを計算し、記憶部３０に格納する。 With the update of the vocal tract spectrum H _{k, r} , the estimator 23 _calculates , based on H _{k, r} and U _{r, t} stored in the storage unit 30 for each combination of (k, t). The vocal tract spectrogram X _{k, t} is calculated according to the above equation (1) and stored in the storage unit 30.

推定部２３は、（ｒ、ｔ）の全ての組み合わせの各々について、記憶部３０に記憶されているＵ_r,t、Ｚ_k,t、Ｘ_k,t、Ｙ_k,t、Ｈ^(GMM) _k,r、すなわちＨ_k,rに基づいて、上記（１３）式に従って、声道スペクトログラムＸ_k,tのスペクトルパターンの重みＵ_r,tを更新し、記憶部３０に格納する。 For each of all combinations of (r, t), the estimating unit 23 stores U _{r, t} , Z _{k, t} , X _{k, t} , Y _{k, t} , H ^(GMM) stored in the storage unit 30. _{Based on k, r} , that is, H _{k, r} , the spectrum pattern weights U _{r, t} of the vocal tract spectrogram X _k _{, t} are updated according to the above equation (13) and stored in the storage unit 30.

更に、推定部２３は、声道スペクトログラムＸ_k,tのスペクトルパターンの重みＵ_r,tの更新に伴い、（ｋ、ｔ）の全ての組み合わせの各々について、記憶部３０に記憶されているＨ_k,r及びＵ_r,tに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを計算し、記憶部３０に格納する。 Further, the estimation unit 23 updates the spectral pattern weights U _{r, t} of the vocal tract spectrogram X _{k, t} with the H stored in the storage unit 30 for each of all combinations of (k, t). _{Based on k, r} and U _{r, t} , a vocal tract spectrogram X _{k, t} is calculated according to the above equation (1) and stored in the storage unit 30.

終了判定部２４は、予め定められた終了条件を満足するか否かを判定し、終了条件を満足していない場合には、推定部２３の各処理を繰り返す。終了判定部２４は、終了条件を満足したと判定した場合には、出力部２５による処理に移行する。 The end determination unit 24 determines whether or not a predetermined end condition is satisfied. If the end condition is not satisfied, each process of the estimation unit 23 is repeated. When it is determined that the end condition is satisfied, the end determination unit 24 proceeds to processing by the output unit 25.

出力部２５は、記憶部３０に記憶されている声道スペクトログラムＸ_k,tを、入力部１０に入力された音声信号から推定される声道スペクトログラムとして出力する。 The output unit 25 outputs the vocal tract spectrogram X _{k, t} stored in the storage unit 30 as a vocal tract spectrogram estimated from the speech signal input to the input unit 10.

なお、終了条件としては、繰り返し回数がＬ-1回目の目的関数（７）式の値と、繰り返し回数がＬ回目の目的関数（７）式の値との差が、予め定めた閾値よりも小さくなったことを用いればよい。あるいは、終了条件として、繰り返し回数が、予め定められた上限回数に到達したことを用いてもよい。 As an end condition, the difference between the value of the objective function (7) with the number of iterations L-1 and the value of the objective function (7) with the number of iterations L is less than a predetermined threshold. What has become smaller can be used. Alternatively, the termination condition may be that the number of repetitions has reached a predetermined upper limit number.

＜声道スペクトル推定装置の作用＞
次に、GMM-NMFに対する声道スペクトルの推定アルゴリズムを利用する第１の実施の形態に係る声道スペクトル推定装置１００の作用について説明する。 <Operation of vocal tract spectrum estimation device>
Next, the operation of the vocal tract spectrum estimation apparatus 100 according to the first embodiment using the vocal tract spectrum estimation algorithm for GMM-NMF will be described.

マイクロホンで取得された音声信号の時系列データが声道スペクトル推定装置１００に入力され、記憶部３０に格納される。そして、声道スペクトル推定装置１００において、図２に示す声道スペクトル推定処理ルーチンが実行される。 The time-series data of the voice signal acquired by the microphone is input to the vocal tract spectrum estimation apparatus 100 and stored in the storage unit 30. Then, the vocal tract spectrum estimation apparatus 100 executes a vocal tract spectrum estimation processing routine shown in FIG.

まず、ステップＳ１００において、記憶部３０から、音声信号の時系列データを読み込み、当該音声信号の時系列データに対して、STRAIGHTによる声道スペクトルの推定を行い、各時間ｔ及び各正規化角周波数ω_k(k=0,・・,K-1)のの観測時間周波数成分を表す声道スペクトログラムＹ_k,tを計算し、得られた声道スペクトログラムＹ_k,tを、記憶部３０に格納する。 First, in step S100, the time series data of the speech signal is read from the storage unit 30, the vocal tract spectrum is estimated by STRAIGHT for the time series data of the speech signal, and each time t and each normalized angular frequency is estimated. The vocal tract spectrogram Y _{k, t} representing the observation time frequency component of ω _k (k = 0,..., K−1) is calculated, and the obtained vocal tract spectrogram Y _{k, t} is stored in the storage unit 30. To do.

ステップＳ１０２において、ステップＳ１００で得られた声道スペクトログラムＹ_k,tに対応する非周期性指標Ａ_k,tを計算し、Ｚ_k,t＝1−Ａ_k,tにより各時間周波数成分の信頼度Ｚ_k,tを計算し、得られた信頼度Ｚ_k,tを記憶部３０に格納する。 In step S102, an aperiodic index A _{k, t} corresponding to the vocal tract spectrogram Y _{k, t} obtained in step S100 is calculated, and the reliability of each time-frequency component is calculated by Z _{k, t} = 1−A _{k, t.} The degree Z _{k, t} is calculated, and the obtained reliability Z _{k, t} is stored in the storage unit 30.

ステップＳ１０４において、乱数を用いてＷ_r,n及びＵ_r,tの初期値を設定する。なお、Ｗ_r,n及びＵ_r,tの初期値は非負値とし、Ｗ_r,nに対しては、Σ_nＷ_r,n＝１を満たすように初期値を設定する。また、Ｇ_n(ω_k)については、上記（３）式を満たすように初期値を設定する。この際、平均ρ_n、分散ν² _nには適当な値を用いればよい。 In step S104, initial values of W _{r, n} and U _{r, t} are set using random numbers. Incidentally, W _{r, n} and U _r, the initial value of _t is set to non-negative values, W _r, for the _n, sets the initial value so as to satisfy the _{_{Σ n W r, n = 1}} . For G _n (ω _k ), an initial value is set so as to satisfy the above expression (3). At this time, appropriate values may be used for the average ρ _n and the variance ν ² _n .

こうして設定されたＷ_r,n、Ｕ_r,t、及びＧ_n(ω_k)の各初期値は、記憶部３０に記憶される。 The initial values of W _{r, n} , U _{r, t} and G _n (ω _k ) set in this way are stored in the storage unit 30.

次に、ステップＳ１０６では、ステップＳ１０４で設定されたＷ_r,n及びＧ_n(ω_k)に基づいて、上記（２）式に従って、声道スペクトルＨ_k,rを各（ｋ、ｒ）の組み合わせについて算出して、記憶部３０に格納する。また、ステップＳ１０６では、ステップＳ１０４で設定されたＵ_r,t、及び本ステップで算出されたＨ_k,rに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを各（ｋ、ｔ）の組み合わせについて算出して、記憶部３０に格納する。 Next, in step S106, the vocal tract spectrum H _{k, r} is calculated for each (k, r) according to the above equation (2) based on W _{r, n} and G _n (ω _k ) set in step S104. The combination is calculated and stored in the storage unit 30. In step S106, the vocal tract spectrograms X _{k, t} are each (k) according to the above equation (1) based on U _{r, t} set in step S104 and H _{k, r} calculated in this step. , T) is calculated and stored in the storage unit 30.

ステップＳ１０８では、ステップＳ１０４で設定されたＷ_r,n、Ｇ_n(ω_k) 及びＵ_r,t、ステップＳ１０２で算出されたＺ_k,t、ステップＳ１０６で算出されたＸ_k,t、並びにステップＳ１００で算出されたＹ_k,t、すなわち、記憶部３０に記憶されている最新の各パラメータＷ_r,n、Ｚ_k,t、Ｘ_k,t、Ｙ_k,t、Ｇ_n(ω_k)、Ｕ_r,tに基づいて、上記（１２）式に従って、正規分布の重みＷ_r,nを各（ｒ、ｎ）の組み合わせについて算出して、記憶部３０に格納する。 In step S108, W _{r, n} , G _n (ω _k ) and U _{r, t} set in step S104, Z _{k, t} calculated in step S102, X _{k, t} calculated in step S106, and Y _{k, t} calculated in step S100, that is, the latest parameters W _{r, n} , Z _{k, t} , X _{k, t} , Y _{k, t} , G _n (ω _k) stored in the storage unit 30. ), U _{r, t} , the normal distribution weight W _{r, n} is calculated for each (r, n) combination according to the above equation (12), and stored in the storage unit 30.

ステップＳ１１０では、ステップＳ１０６と同様に、記憶部３０に記憶されている最新の各パラメータＷ_r,n及びＧ_n(ω_k)、すなわち、ステップＳ１０８で算出されたＷ_r,n及びステップＳ１０４で設定されたＧ_n(ω_k)に基づいて、上記（２）式に従って、声道スペクトルＨ_k,rを各（ｋ、ｒ）の組み合わせについて算出して、記憶部３０に格納する。また、ステップＳ１１０では、記憶部３０に記憶されている最新の各パラメータＵ_r,t及びＨ_k,r、すなわち、ステップＳ１０４で設定されたＵ_r,t、及び本ステップで算出されたＨ_k,rに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを各（ｋ、ｔ）の組み合わせについて算出して、記憶部３０に格納する。 In step S110, as in step S106, the latest parameters W _{r, n} and G _n (ω _k ) stored in the storage unit 30, that is, W _{r, n} calculated in step S108 and step S104 are used. Based on the set G _n (ω _k ), the vocal tract spectrum H _{k, r} is calculated for each (k, r) combination according to the above equation (2) and stored in the storage unit 30. In step S110, the latest parameters U _r in the storage unit 30 is _{stored, t} and H _{k, r,} i.e., H _k calculated in U _{r, t,} and the step set in step S104 _{, r} , the vocal tract spectrogram X _{k, t} is calculated for each (k, t) combination and stored in the storage unit 30 according to the above equation (1).

ステップＳ１１２では、記憶部３０に記憶されている最新の各パラメータＺ_k,t、Ｘ_k,t、Ｙ_k,t、Ｈ_k,r、Ｕ_r,t、すなわち、ステップＳ１０４で設定されたＵ_r,t、ステップＳ１０２で算出されたＺ_k,t、ステップＳ１１０で算出したＸ_k,t及びＨ_k,r、ステップＳ１００で算出したＹ_k,tに基づいて、上記（１３）式に従って、声道スペクトログラムＸ_k,tのスペクトルパターンの重みＵ_r,tを各（ｒ、ｔ）の組み合わせについて算出して、記憶部３０に格納する。 In step S112, the latest parameters Z _{k, t} , X _{k, t} , Y _{k, t} , H _{k, r} , U _{r, t} stored in the storage unit 30, that is, the U set in step S104. _{Based on r, t} , Z _{k, t} calculated in step S102, X _{k, t} and H _{k, r} calculated in step S110, and Y _{k, t} calculated in step S100, according to the above equation (13), The spectral pattern weights U _{r, t} of the vocal tract spectrogram X _{k, t} are calculated for each (r, t) combination and stored in the storage unit 30.

ステップＳ１１４では、記憶部３０に記憶されている最新の各パラメータＵ_r,t及びＨ_k,r、すなわち、ステップＳ１１２で算出されたＵ_r,t、及びステップＳ１１０で算出されたＨ_k,rに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを各（ｋ、ｔ）の組み合わせについて算出して、記憶部３０に格納する。 In step S114, the latest parameters U _r in the storage unit 30 is _{stored, t} and H _{k, r,} i.e., U _r calculated in step _{S112, t,} and H _k calculated in step _{S110, r} Based on the above, the vocal tract spectrogram X _{k, t} is calculated for each (k, t) combination according to the above equation (1) and stored in the storage unit 30.

次のステップＳ１１６では、ステップＳ１００で算出したＹ_k,tと、ステップＳ１０２で算出したＺ_k,tと、ステップＳ１１４で算出したＸ_k,tに基づいて、（７）式に従って目的関数Ｌ_GKL(Θ)の値を算出して、記憶部３０に記憶する。そして、前回のステップＳ１１６で算出した目的関数Ｌ_GKL(Θ)の値を記憶部３０から読み込み、今回のステップＳ１１６で算出した目的関数Ｌ_GKL(Θ)の値と、前回のステップＳ１１６で算出した目的関数Ｌ_GKL(Θ)の値との差分が、予め記憶部３０に記憶されている予め定められた閾値よりも小さいか否かを判定し、差分が予め定められた閾値以上の場合には、終了条件を満足していないと判断して、上記ステップＳ１０８へ戻り、上記ステップＳ１０８〜ステップＳ１１６の処理を繰り返す。 In the next step S116, based on Y _{k, t} calculated in step S100, Z _{k, t} calculated in step S102, and X _{k, t} calculated in step S114, the objective function L _{GKL is obtained} according to equation (7). The value of (Θ) is calculated and stored in the storage unit 30. Then, the value of the objective function L _GKL (Θ) calculated in the previous step S116 is read from the storage unit 30, and the value of the objective function L _GKL (Θ) calculated in the current step S116 and calculated in the previous step S116. It is determined whether or not the difference from the value of the objective function L _GKL (Θ) is smaller than a predetermined threshold stored in advance in the storage unit 30, and if the difference is greater than or equal to a predetermined threshold If it is determined that the end condition is not satisfied, the process returns to step S108, and the processes of steps S108 to S116 are repeated.

一方、差分が予め定められた閾値未満の場合には、終了条件を満足したと判断して、ステップＳ１１８で、ステップＳ１１４で算出した最新の声道スペクトログラムＸ_k,tを出力して、声道スペクトル推定処理ルーチンを終了する。 On the other hand, if the difference is less than a predetermined threshold value, it is determined that the end condition is satisfied, and in step S118, the latest vocal tract spectrogram X _{k, t} calculated in step S114 is output. The spectrum estimation processing routine ends.

なお、上記では、距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)として、一般化KLダイバージェンスＤ_GKLを用いた場合のGMM-NMFに対する声道スペクトルの推定アルゴリズムを利用する声道スペクトル推定装置の例について説明したが、距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)として、２乗距離Ｄ_EUを用いてもよいことは言うまでもない。 In the above description, a vocal tract spectrum estimation apparatus that uses a vocal tract spectrum estimation algorithm for GMM-NMF when the generalized KL divergence D _GKL is used as the distance D _* (Y _{k, t} ; X _{k, t} ). However, it is needless to say that the square distance D _EU may be used as the distance D _* (Y _{k, t} ; X _{k, t} ).

この場合、例えばステップＳ１０８が（１４）式に示したＷ_r,nの更新式の計算に置き換わり、ステップＳ１１２が（１５）式に示したＵ_r,tの更新式の計算に置き換わる。 In this case, for example, step S108 is replaced with the calculation of the update equation for W _{r, n} shown in equation (14), and step S112 is replaced with the calculation of the update equation for U _{r, t} shown in equation (15).

＜第２の実施の形態＞
＜システム構成＞
次に、第２の実施の形態に係る声道スペクトル推定装置について説明する。第２の実施の形態では、距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)として、一般化KLダイバージェンスＤ_GKLを用いた場合のAR-NMFに対する声道スペクトルの推定アルゴリズムを利用する声道スペクトル推定装置の例について説明する。本発明の第２の実施の形態に係る声道スペクトル推定装置は、図１に示した第１の実施の形態に係る声道スペクトル推定装置のシステム構成と同様に、入力部１０と、演算部２０と、記憶部３０と、出力部４０とを備えている。また、演算部２０は、観測スペクトログラム推定部２１と、初期設定部２２と、推定部２３と、終了判定部２４と、出力部２５とを備えている。 <Second Embodiment>
<System configuration>
Next, a vocal tract spectrum estimation apparatus according to the second embodiment will be described. In the second embodiment, as a distance D _* (Y _{k, t} ; X _{k, t} ), a vocal tract using a vocal tract spectrum estimation algorithm for AR-NMF when a generalized KL divergence D _GKL is used. An example of a spectrum estimation apparatus will be described. The vocal tract spectrum estimation apparatus according to the second embodiment of the present invention is similar to the system configuration of the vocal tract spectrum estimation apparatus according to the first embodiment shown in FIG. 20, a storage unit 30, and an output unit 40. In addition, the calculation unit 20 includes an observation spectrogram estimation unit 21, an initial setting unit 22, an estimation unit 23, an end determination unit 24, and an output unit 25.

入力部１０及び観測スペクトログラム推定部２１については、第１の実施の形態と同様であるため、説明を省略する。 Since the input unit 10 and the observation spectrogram estimation unit 21 are the same as those in the first embodiment, description thereof is omitted.

初期設定部２２は、後述する処理で用いるパラメータａ_r、及びＵ_r,tの各初期値を設定する。ここで、Ｐ次の全極フィルタ係数ａ_rは、例えば乱数を用いて適当な値に初期値を設定する。Ｕ_r,tの各初期値の設定は、（システム構成その１）での説明と同様である。設定したパラメータａ_r及びＵ_r,tの各初期値は、記憶部３０に記憶される。 The initial setting unit 22 sets initial values of parameters a _r and U _{r, t} used in processing to be described later. Here, the initial value of the P-th order all-pole filter coefficient a _r is set to an appropriate value using, for example, a random number. The setting of each initial value of U _{r, t} is the same as described in (System configuration 1). The initial values of the set parameters a _r and U _{r, t} are stored in the storage unit 30.

推定部２３は、（ｋ、ｒ）の全ての組み合わせの各々について、記憶部３０に記憶されているａ_rに基づいて、上記（５）式に従って、声道スペクトルＨ_k,r ^(AR)（以降、単に「Ｈ_k,r」と記載する）を計算し、記憶部３０に格納する。 Based on a _r stored in the storage unit 30 for each of all combinations of (k, r), the estimation unit 23 performs vocal tract spectrum H _{k, r} ^(AR) ( Hereinafter, it is simply described as “H _{k, r} ”) and stored in the storage unit 30.

推定部２３は、記憶部３０に記憶されているａ_r、Ｚ_k,t、Ｈ_k,r、Ｘ_k,t、Ｙ_k,t、Ｕ_r,tに基づいて、上記（２０）式に従って、全極フィルタ係数ａ_rを更新し、記憶部３０に格納する。 The estimation unit 23 follows the above equation (20) based on a _r , Z _{k, t} , H _{k, r} , X _{k, t} , Y _{k, t} , U _{r, t} stored in the storage unit 30. The all-pole filter coefficient a _r is updated and stored in the storage unit 30.

推定部２３は、全極フィルタ係数ａ_rの更新に伴い、（ｋ、ｒ）の全ての組み合わせの各々について、ａ_rに基づいて、上記（５）式に従って、声道スペクトルＨ_k,rを計算し、記憶部３０に格納する。 As the all-pole filter coefficient a _r is updated, the estimation unit 23 calculates the vocal tract spectrum H _{k, r} according to the above equation (5) based on a _r for each combination of (k, r). Calculate and store in the storage unit 30.

推定部２３は、（ｒ、ｔ）の全ての組み合わせの各々について、記憶部３０に記憶されているＵ_r,t、Ｚ_k,t、Ｘ_k,t、Ｙ_k,t、Ｈ^(AR) _k,r、すなわちＨ_k,rに基づいて、上記（１７）式に従って、声道スペクトログラムＸ_k,tのスペクトルパターンの重みＵ_r,tを更新し、記憶部３０に格納する。 For each of all combinations of (r, t), the estimation unit 23 stores U _{r, t} , Z _{k, t} , X _{k, t} , Y _{k, t} , H ^(AR) stored in the storage unit 30. _{Based on k, r} , that is, H _{k, r} , the spectrum pattern weights U _{r, t} of the vocal tract spectrogram X _k _{, t} are updated according to the above equation (17) and stored in the storage unit 30.

＜声道スペクトル推定装置の作用＞
次に、AR-NMFに対する声道スペクトルの推定アルゴリズムを利用する第２の実施の形態に係る声道スペクトル推定装置１００の作用について説明する。 <Operation of vocal tract spectrum estimation device>
Next, the operation of the vocal tract spectrum estimation apparatus 100 according to the second embodiment that uses the vocal tract spectrum estimation algorithm for AR-NMF will be described.

マイクロホンで取得された音声信号の時系列データが声道スペクトル推定装置１００に入力され、記憶部３０に格納される。そして、声道スペクトル推定装置１００において、図３に示す声道スペクトル推定処理ルーチンが実行される。 The time-series data of the voice signal acquired by the microphone is input to the vocal tract spectrum estimation apparatus 100 and stored in the storage unit 30. Then, the vocal tract spectrum estimation apparatus 100 executes a vocal tract spectrum estimation processing routine shown in FIG.

図３に示す声道スペクトル推定処理ルーチンのステップＳ１００、Ｓ１０２、及びＳ１１８は、既に説明した図２の声道スペクトル推定処理ルーチンの対応するステップと同様であるため、説明を省略する。 Steps S100, S102, and S118 of the vocal tract spectrum estimation processing routine shown in FIG. 3 are the same as the corresponding steps of the vocal tract spectrum estimation processing routine of FIG.

ステップＳ１０５において、乱数を用いてａ_r及びＵ_r,tの初期値を設定する。この際、ａ_r及びＵ_r,tの初期値は非負値を設定する。 In step S105, initial values of a _r and U _{r, t} are set using random numbers. At this time, the initial values of a _r and U _{r, t} are set to non-negative values.

こうして設定されたａ_r及びＵ_r,tの各初期値は、記憶部３０に記憶される。 The initial values of a _r and U _{r, t} set in this way are stored in the storage unit 30.

次に、ステップＳ１０７では、ステップＳ１０５で設定されたａ_rに基づいて、上記（５）式に従って、声道スペクトルＨ_k,rを各（ｋ、ｒ）の組み合わせについて算出して、記憶部３０に格納する。また、ステップＳ１０７では、ステップＳ１０５で設定されたＵ_r,t、及び本ステップで算出されたＨ_k,rに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを各（ｋ、ｔ）の組み合わせについて算出して、記憶部３０に格納する。 Next, in step S107, the vocal tract spectrum H _{k, r} is calculated for each (k, r) combination according to the above equation (5) based on a _r set in step S105, and the storage unit 30 To store. In step S107, the vocal tract spectrograms X _{k, t} are each (k) according to the above equation (1) based on U _{r, t} set in step S105 and H _{k, r} calculated in this step. , T) is calculated and stored in the storage unit 30.

ステップＳ１０９では、ステップＳ１００で算出されたＹ_k,t、ステップＳ１０２で算出されたＺ_k,t、ステップＳ１０５で設定されたａ_r及びＵ_r,t、ステップＳ１０７で算出されたＨ_k,r及びＸ_k,t、すなわち、記憶部３０に記憶されている最新の各パラメータａ_r、Ｚ_k,t、Ｘ_k,t、Ｙ_k,t、Ｈ_k,r、Ｕ_r,tに基づいて、上記（２１）式に従って、全極フィルタ係数ａ_rを算出して、記憶部３０に格納する。 In step S109, Y _{k, t} calculated in step S100, Z _{k, t} calculated in step S102, a _r and U _{r, t} set in step S105, and H _{k, r} calculated in step S107. And X _{k, t} , that is, based on the latest parameters a _r , Z _{k, t} , X _{k, t} , Y _{k, t} , H _{k, r} , U _{r, t} stored in the storage unit 30. The all-pole filter coefficient a _r is calculated according to the above equation (21) and stored in the storage unit 30.

ステップＳ１１１では、記憶部３０に記憶されている最新のａ_r、すなわち、ステップＳ１０９で算出されたａ_rに基づいて、上記（５）式に従って、声道スペクトルＨ_k,rを各（ｋ、ｒ）の組み合わせについて算出して、記憶部３０に格納する。また、ステップＳ１１１では、記憶部３０に記憶されている最新の各パラメータＵ_r,t及びＨ_k,r、すなわち、ステップＳ１０５で設定されたＵ_r,t、及び本ステップで算出されたＨ_k,rに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを各（ｋ、ｔ）の組み合わせについて算出して、記憶部３０に格納する。 In step S111, the latest a _r in the storage unit 30 is stored, i.e., based on a _r calculated in step S109, according to the above (5), the vocal tract spectrum H _k, each of _r (k, The combination of r) is calculated and stored in the storage unit 30. In step S111, the latest parameters U _r in the storage unit 30 is _{stored, t} and H _{k, r,} i.e., H _k calculated in U _{r, t,} and the steps set in step S105 _{, r} , the vocal tract spectrogram X _{k, t} is calculated for each (k, t) combination and stored in the storage unit 30 according to the above equation (1).

ステップＳ１１３では、記憶部３０に記憶されている最新の各パラメータＺ_k,t、Ｘ_k,t、Ｙ_k,t、Ｈ_k,r、Ｕ_r,t、すなわち、ステップＳ１０５で設定されたＵ_r,t、ステップＳ１０２で算出されたＺ_k,t、ステップＳ１１１で算出したＸ_k,t及びＨ_k,r、ステップＳ１００で算出したＹ_k,tに基づいて、上記（１７）式に従って、声道スペクトログラムＸ_k,tのスペクトルパターンの重みＵ_r,tを各（ｒ、ｔ）の組み合わせについて算出して、記憶部３０に格納する。 In step S113, the latest parameters Z _{k, t} , X _{k, t} , Y _{k, t} , H _{k, r} , U _{r, t} stored in the storage unit 30, that is, U set in step S105 _{Based on r, t} , Z _{k, t} calculated in step S102, X _{k, t} and H _{k, r} calculated in step S111, and Y _{k, t} calculated in step S100, according to the above equation (17), The spectral pattern weights U _{r, t} of the vocal tract spectrogram X _{k, t} are calculated for each (r, t) combination and stored in the storage unit 30.

ステップＳ１１５では、記憶部３０に記憶されている最新の各パラメータＵ_r,t及びＨ_k,r、すなわち、ステップＳ１１３で算出されたＵ_r,t、及びステップＳ１１１で算出されたＨ_k,rに基づいて、上記（１）式に従って、声道スペクトログラムＸ_k,tを各（ｋ、ｔ）の組み合わせについて算出して、記憶部３０に格納する。 In step S115, the latest parameters U _r in the storage unit 30 is _{stored, t} and H _{k, r,} i.e., U _r calculated in step _{S113, t,} and H _k calculated in step S _{111, r} Based on the above, the vocal tract spectrogram X _{k, t} is calculated for each (k, t) combination according to the above equation (1) and stored in the storage unit 30.

次のステップＳ１１７では、ステップＳ１００で算出したＹ_k,tと、ステップＳ１０２で算出したＺ_k,tと、ステップＳ１１５で算出したＸ_k,tに基づいて、（７）式に従って目的関数Ｌ_GKL(Θ)の値を算出して、記憶部３０に記憶する。そして、前回のステップＳ１１７で算出した目的関数Ｌ_GKL(Θ)の値を記憶部３０から読み込み、今回のステップＳ１１７で算出した目的関数Ｌ_GKL(Θ)の値と、前回のステップＳ１１７で算出した目的関数Ｌ_GKL(Θ)の値との差分が、予め記憶部３０に記憶されている予め定められた閾値よりも小さいか否かを判定し、差分が予め定められた閾値以上の場合には、終了条件を満足していないと判断して、上記ステップＳ１０９へ戻り、上記ステップＳ１０９〜ステップＳ１１７の処理を繰り返す。 In the next step S117, based on Y _{k, t} calculated in step S100, Z _{k, t} calculated in step S102, and X _{k, t} calculated in step S115, the objective function L _{GKL is obtained} according to equation (7). The value of (Θ) is calculated and stored in the storage unit 30. Then, the value of the objective function L _GKL (Θ) calculated in the previous step S117 is read from the storage unit 30, and the value of the objective function L _GKL (Θ) calculated in the current step S117 is calculated in the previous step S117. It is determined whether or not the difference from the value of the objective function L _GKL (Θ) is smaller than a predetermined threshold stored in advance in the storage unit 30, and if the difference is greater than or equal to a predetermined threshold If it is determined that the end condition is not satisfied, the process returns to step S109, and the processes of steps S109 to S117 are repeated.

一方、差分が予め定められた閾値未満の場合には、終了条件を満足したと判断して、ステップＳ１１８で、ステップＳ１１５で算出した最新の声道スペクトログラムＸ_k,tを出力して、声道スペクトル推定処理ルーチンを終了する。 On the other hand, if the difference is less than the predetermined threshold value, it is determined that the end condition is satisfied, and the latest vocal tract spectrogram X _{k, t} calculated in step S115 is output in step S118, and the vocal tract is calculated. The spectrum estimation processing routine ends.

なお、上記では、距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)として、一般化KLダイバージェンスＤ_GKLを用いた場合のAR-NMFに対する声道スペクトルの推定アルゴリズムを利用する声道スペクトル推定装置の例について説明したが、距離Ｄ_*(Ｙ_k,t;Ｘ_k,t)として、２乗距離Ｄ_EUを用いてもよいことは言うまでもない。 In the above description, the vocal tract spectrum estimation apparatus using the vocal tract spectrum estimation algorithm for AR-NMF when the generalized KL divergence D _GKL is used as the distance D _* (Y _{k, t} ; X _{k, t} ). However, it is needless to say that the square distance D _EU may be used as the distance D _* (Y _{k, t} ; X _{k, t} ).

この場合、例えばステップＳ１０９が、（２３）式に示したａ_rの更新式の計算に置き換わり、ステップＳ１１３が、（２２）式に示したＵ_r,tの更新式の計算に置き換わる。 In this case, for example, step S109 is replaced with the calculation of the update equation for a _r shown in equation (23), and step S113 is replaced with the calculation of the update equation for U _{r, t} shown in equation (22).

＜声道スペクトル推定精度評価実験＞
＜評価実験の条件＞
次に、第１の実施の形態に係る声道スペクトルの推定方法（以降、「提案法」という）の有効性を示す目的で、提案法で推定した声道スペクトルと、STRAIGHTで推定した声道スペクトルの推定精度を比較する評価実験を行った。 <Voice tract spectrum estimation accuracy evaluation experiment>
<Conditions for evaluation experiment>
Next, the vocal tract spectrum estimated by the proposed method and the vocal tract estimated by STRAIGHT for the purpose of showing the effectiveness of the vocal tract spectrum estimation method (hereinafter referred to as “proposed method”) according to the first embodiment. An evaluation experiment was performed to compare the estimation accuracy of spectra.

ATRデジタル音声データベースのAセットから、日本人女性話者1名による20文の音声信号（サンプリング周波数は16kHz）をSTRAIGHTの手法で分析し、基本周波数F₀、声道スペクトル、非周期性指標Ａ_k,tを抽出した。なお、ここで得られたスペクトルを正解の声道スペクトルとみなす。 From A set of ATR digital speech database, 20 Japanese speech signals (sampling frequency is 16kHz) by one Japanese female speaker are analyzed by STRAIGHT method, fundamental frequency F ₀ , vocal tract spectrum, aperiodicity index A _{k, t} were extracted. The spectrum obtained here is regarded as a correct vocal tract spectrum.

そして、正解の声道スペクトルと2^-1.0、2^-0.5、2^0.0、2^0.5、2^1.0、2^1.5倍したF₀を用いて、音声信号をそれぞれSTRAIGHTで再合成した。 Then, using the correct vocal tract spectrum and F ₀ multiplied by 2 ^−1.0 , 2 ^−0.5 , 2 ^0.0 , 2 ^0.5 , 2 ^1.0 , and 2 ^1.5 , the speech signals were recombined with STRAIGHT.

そして、再合成音声信号からSTRAIGHT及び提案法で声道スペクトルを推定し、声道スペクトル推定値と正解の声道スペクトルとのメルケプストラム歪みを用いて、各々の手法の声道スペクトル推定性能を比較した。なお、メルケプストラム歪みは、1次から24次のメルケプストラム係数を用いて計算し、STRAIGHTによる声道スペクトルの推定では、フレームシフトを5ms(Ｔ＝81761)、声道スペクトルの次元をＫ＝513とした。 Then, the vocal tract spectrum is estimated from the re-synthesized speech signal using STRAIGHT and the proposed method, and the vocal tract spectrum estimation performance of each method is compared using the mel cepstrum distortion between the vocal tract spectrum estimated value and the correct vocal tract spectrum. did. The mel cepstrum distortion is calculated using first to 24th mel cepstrum coefficients. In the estimation of the vocal tract spectrum by STRAIGHT, the frame shift is 5 ms (T = 81761) and the dimension of the vocal tract spectrum is K = 513. It was.

また、提案法として、GMM-NMFに対するパラメータ推定アルゴリズムにおいて、一般化KLダイバージェンスＤ_GKLと2乗距離Ｄ_EUを用いた場合の評価を行った。具体的には、STRAIGHTで推定された再合成音声信号の声道スペクトルを用い、F₀の各定数倍毎に20発話の声道スペクトログラムＹ_k,tを同時に用いて、Ｗ_r,n及びＵ_r,tを推定した。また、スペクトルパターンＲ＝90、混合数Ｎ＝100とし、ｎ＝0,・・,Ｎ−１に対して平均ρ_n＝ｎ／(Ｎ−１)、標準偏差ν_n＝１／(Ｎ−１)とした。Ｗ_r,n及びＵ_r,tは非負の乱数で初期化し、提案法における声道スペクトルの推定アルゴリズムの反復回数は100回とした。 In addition, as a proposed method, evaluation was performed when a generalized KL divergence D _GKL and a square distance D _EU were used in the parameter estimation algorithm for GMM-NMF. Specifically, using the vocal tract spectrum of the re-synthesized speech signal estimated by STRAIGHT and simultaneously using the vocal tract spectrogram Y _{k, t} of 20 utterances for each constant multiple of F ₀ , W _{r, n} and U _{r and t} were estimated. Further, the spectrum pattern R = 90, the number of mixtures N = 100, the average ρ _n = n / (N−1) and the standard deviation ν _n = 1 / (N− with respect to n = 0,. 1). W _{r, n} and U _{r, t} were initialized with non-negative random numbers, and the number of iterations of the vocal tract spectrum estimation algorithm in the proposed method was 100.

＜評価実験結果＞
図４に、提案法とSTRAIGHTによる声道スペクトルの推定結果のメルケプストラム歪みを示す。なお、“GKL”の列は、Ｆ₀の倍率ｘに対する一般化KLダイバージェンスを用いた場合の評価結果、“EU”の列は、Ｆ₀の倍率ｘに対する2乗距離を用いた場合の評価結果、及び“STRAIGHT”の列は、STRAIGHTによる評価結果を表している。各々の評価結果は、［平均値±標準偏差］[dB]の形式で記載されており、括弧内の値は、非周期性指標Ａ_k,tを用いなかった場合の評価結果を示している。 <Results of evaluation experiment>
FIG. 4 shows the mel-cepstral distortion of the vocal tract spectrum estimation result by the proposed method and STRAIGHT. The column “GKL” is the evaluation result when using the generalized KL divergence with respect to the magnification x of F ₀ , and the column “EU” is the evaluation result when using the square distance with respect to the magnification x of F _0. , And “STRAIGHT” columns represent evaluation results by STRAIGHT. Each evaluation result is described in the form of [average value ± standard deviation] [dB], and the value in parentheses indicates the evaluation result when the non-periodicity index A _{k, t} is not used. .

Ｆ₀の倍率ｘが高くなるにしたがって、観測できる調波成分も少なくなるため、当該フレーム以外の調波成分を利用することによる効果が現れると考えられるが、図４に示すように、Ｆ₀の倍率ｘが高くなるほど、STRAIGHTを用いた評価結果に比べて、GKLの評価結果の方がメルケプストラム歪みが小さくなり、当該フレーム以外の調波成分が声道スペクトルの推定に有効であることが確認できる。 , According to the proportion x of F ₀ is increased, since the reduced observable harmonics, the effect due to the use of the harmonic components other than the frame is considered to appear, as shown in FIG. 4, F ₀ The higher the magnification x, the smaller the mel cepstrum distortion in the GKL evaluation result compared to the evaluation result using STRAIGHT, and the harmonic components other than the frame are more effective in estimating the vocal tract spectrum. I can confirm.

EUの評価結果も、Ｆ₀の倍率ｘが高くなるほど、STRAIGHTを用いた評価結果に比べてメルケプストラム歪みが小さくなる。しかし、GKLの評価結果に比べると、平均的にメルケプストラム歪みが大きくなる傾向が見られ、一般化KLダイバージェンスを用いたGMM-NMFに対するパラメータ推定アルゴリズムの方が、声道スペクトルの推定に適しているということができる。 Also in the EU evaluation results, the mel cepstrum distortion becomes smaller as the magnification x of F ₀ becomes higher than the evaluation results using STRAIGHT. However, the average mel cepstrum distortion tends to be larger than the GKL evaluation results, and the parameter estimation algorithm for GMM-NMF using generalized KL divergence is more suitable for estimation of vocal tract spectrum. It can be said that

また、非周期性指標Ａ_k,tを全ての時間周波数成分で一様、すなわち、全ての（ｋ、ｔ）の組み合わせに対してＺ_k,t＝1とした場合、それぞれの音声スペクトルの推定手法において、括弧内の値が取得される。 Further, when the non-periodicity index A _{k, t} is uniform for all time frequency components, that is, when Z _{k, t} = 1 for all (k, t) combinations, the estimation of each speech spectrum is performed. In the technique, the value in parentheses is obtained.

GKLの評価結果において、何れのＦ₀の倍率ｘについても、非周期性指標Ａ_k,tを用いた方が、非周期性指標Ａ_k,tを一様にした場合のメルケプストラム歪みより小さくなっていることから、非周期性指標Ａ_k,tが声道スペクトルの推定に関する性能向上に寄与することが確認できる。 The evaluation results of GKL, for even magnification x of any F _0, aperiodic index A _k, is preferable to use a _t, aperiodic index A _k, less than mel-cepstrum distortion in the case of a uniform _t Therefore, it can be confirmed that the non-periodicity index _{Ak, t} contributes to the performance improvement regarding the estimation of the vocal tract spectrum.

また、GKLにおける括弧内の評価結果と、STRAIGHTを用いた評価結果を比べると、何れのＦ₀の倍率ｘについても、GKLにおけるメルケプストラム歪みが小さいことがわかる。したがって、声道スペクトログラムＸ_k,tが低ランクな非負値行列で近似できるという仮定が、声道スペクトル推定に有用であることが示唆される。 Further, when the evaluation results in parentheses in GKL are compared with the evaluation results using STRAIGHT, it can be seen that the mel cepstrum distortion in GKL is small for any F ₀ magnification x. Therefore, it is suggested that the assumption that the vocal tract spectrogram X _{k, t} can be approximated by a low-rank non-negative matrix is useful for vocal tract spectrum estimation.

このように、本発明に係る提案手法では、音声信号に付与されたコンテキストラベルを用いることなく、音声信号の複数のフレームにおける調波成分の情報を手がかりにして、音声信号から声道スペクトルを精度良く推定することができる。 As described above, in the proposed method according to the present invention, the vocal tract spectrum is accurately obtained from the audio signal by using the information on the harmonic components in the plurality of frames of the audio signal without using the context label given to the audio signal. It can be estimated well.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の声道スペクトル推定装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the above vocal tract spectrum estimation apparatus has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. Shall be.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２１観測スペクトログラム推定部
２２初期設定部
２３推定部
２４終了判定部
２５出力部
３０記憶部
１００声道スペクトル推定装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 21 Observation spectrogram estimation part 22 Initial setting part 23 Estimation part 24 End determination part 25 Output part 30 Storage part 100 Vocal tract spectrum estimation apparatus

Claims

The time-series data of the audio signal is cut out with the width of the basic period, and the observation spectrogram estimation unit that outputs the observation spectrogram representing the observation time frequency component of each time and each normalized angular frequency from the cut out audio signal spectrum,
Each time and each normalization obtained from the observation spectrogram output by the observation spectrogram estimation unit, the vocal tract spectrum representing the power spectrum of each normalized angular frequency in each spectrum pattern, and the weight of each spectrum pattern at each time The vocal tract spectrum of each normalized angular frequency in each spectral pattern and each time of each spectral pattern so as to reduce the objective function expressed using the distance from the vocal tract spectrogram representing the time frequency component of the angular frequency. An estimator for estimating the weight at
A vocal tract spectrum estimation apparatus including:

The vocal tract spectrum of the k-th normalized angular frequency ω _k in the r-th spectral pattern is H _{k, r} ^(GMM) expressed by the following equation:
2. The vocal tract according to claim 1, wherein the estimation unit estimates a weight W _{r, n} for each of the n th normal distributions as the vocal tract spectrum of the k th normalized angular frequency ω _k in the r th spectrum pattern. Spectrum estimation device.

Here, G _n (ω) represents a normal distribution with mean ρ _n and variance ν _n ² , and h (ω) represents a frequency warping function.

The vocal tract spectrum of the _kth normalized angular frequency ω _{k in} the rth spectrum pattern is represented by H _{k, r} ^(AR) expressed by the following equation:
2. The vocal tract spectrum estimation apparatus according to claim 1, wherein the estimation unit estimates a coefficient a _r of a P-order all-pole filter as the vocal tract spectrum of the k-th normalized angular frequency ω _k in the r-th spectrum pattern. .

However, Q (ω) is a (P + 1) × (P + 1) Toeplitz matrix in which the (p, q) component is represented by cos (ω (p−q)).

A vocal tract spectrum estimation method in a vocal tract spectrum estimation apparatus including an observation spectrogram estimation unit and an estimation unit,
The observation spectrogram estimation unit cuts out the time series data of the audio signal with the width of the basic period, and outputs an observation spectrogram representing the observation time frequency component of each time and each normalized angular frequency from the spectrum of the extracted audio signal. And
Each of the estimation units obtained from the observation spectrogram output by the observation spectrogram estimation unit, a vocal tract spectrum representing a power spectrum of each normalized angular frequency in each spectrum pattern, and a weight at each time of each spectrum pattern The vocal tract spectrum of each normalized angular frequency in each spectral pattern, and each of the normalized angular frequencies in each spectral pattern, so as to reduce the objective function represented using the time and the distance from the vocal tract spectrogram representing the time frequency component of each normalized angular frequency, and A vocal tract spectrum estimation method that estimates the weight of a spectrum pattern at each time.

The vocal tract spectrum of the k-th normalized angular frequency ω _k in the r-th spectral pattern is H _{k, r} ^(GMM) expressed by the following equation:
The estimation unit estimates a weight W _{r, n} for each of the nth normal distributions as the vocal tract spectrum of the _kth normalized angular frequency ω _{k in} the rth spectrum pattern. The vocal tract spectrum estimation method described.

Here, G _n (ω) represents a normal distribution with mean ρ _n and variance ν _n ² , and h (ω) represents a frequency warping function.
Vocal tract spectrum estimation method.

The vocal tract spectrum of the _kth normalized angular frequency ω _{k in} the rth spectrum pattern is represented by H _{k, r} ^(AR) expressed by the following equation:
The voice according to claim 4, wherein the estimation unit estimates a coefficient a _r of a P-th order all-pole filter as the vocal tract spectrum of the k-th normalized angular frequency ω _k in the r-th spectrum pattern. Road spectrum estimation method.

However, Q (ω) is a (P + 1) × (P + 1) Toeplitz matrix in which the (p, q) component is represented by cos (ω (p−q)).
Vocal tract spectrum estimation method.

The program for functioning a computer as each part of the vocal tract spectrum estimation apparatus of any one of Claims 1-3.