JP2014134730A

JP2014134730A - Fundamental frequency model parameter estimation device, method and program

Info

Publication number: JP2014134730A
Application number: JP2013003585A
Authority: JP
Inventors: Hirokazu Kameoka; 弘和亀岡; Kota Yoshizato; 幸太吉里; Daisuke Saito; 大輔齋藤; Shigeki Sagayama; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2013-01-11
Filing date: 2013-01-11
Publication date: 2014-07-24
Anticipated expiration: 2033-01-11
Also published as: JP5885210B2

Abstract

【課題】フレーズ指令とアクセント指令の非負性に関する制約を用いて、藤崎モデルのパラメータを推定することができるようにする。
【解決手段】基本周波数抽出部２によって、音声信号の時系列データから、観測基本周波数系列＾ｙを抽出し、有声無声区間推定部３によって、各時刻ｋにおける基本周波数の不確かさの度合いを推定し、初期値設定部４によって、指令関数＾ｏの初期値、及びパラメータ群θの初期値を設定する。指令状態系列事後確率更新部５によって、指令状態系列＾ｓの事後確率Ｐ（＾ｓ｜＾ｙ，＾ｏ’、θ’）を計算し、モデルパラメータ更新部６によって、対数事後確率ｌｏｇＰ（＾ｏ，θ｜＾ｙ）を目的関数として、非負値である指令関数＾ｏ、及びパラメータ群θを更新する。予め定められた収束条件を満たすまで、指令状態系列事後確率更新部５による計算、及びモデルパラメータ更新部６による更新を繰り返し行う。
【選択図】図３A parameter of a Fujisaki model can be estimated using a constraint on non-negativeness of a phrase command and an accent command.
An observed fundamental frequency sequence ^ y is extracted from time series data of a speech signal by a fundamental frequency extraction unit, and a degree of uncertainty of the fundamental frequency at each time k is estimated by a voiced / unvoiced interval estimation unit. Then, the initial value setting unit 4 sets the initial value of the command function ^ o and the initial value of the parameter group θ. The command state sequence posterior probability update unit 5 calculates the posterior probability P (^ s | ^ y, ^ o ', θ') of the command state sequence ^ s, and the model parameter update unit 6 calculates the log posterior probability logP (^ The command function ^ o, which is a non-negative value, and the parameter group θ are updated using o, θ | The calculation by the command state sequence posterior probability update unit 5 and the update by the model parameter update unit 6 are repeated until a predetermined convergence condition is satisfied.
[Selection] Figure 3

Description

本発明は、基本周波数モデルパラメータ推定装置、方法、及びプログラムに係り、特に、音声信号から、観測基本周波数系列のパラメータを推定する基本周波数モデルパラメータ推定装置、方法、及びプログラムに関する。 The present invention relates to a fundamental frequency model parameter estimation device, method, and program, and more particularly, to a fundamental frequency model parameter estimation device, method, and program for estimating parameters of an observed fundamental frequency sequence from a speech signal.

＜藤崎モデル＞
音声のイントネーションを解析する手法に、藤崎の基本周波数（F₀）パターン生成過程モデル（藤崎モデル）が知られている（非特許文献１）。藤崎モデルとは、甲状軟骨の運動に注目してF₀パターンの生成過程を説明した、力学的モデルである。藤崎モデルでは、甲状軟骨の二つの独立な運動（平行移動運動と回転運動）にそれぞれ伴う声帯の伸びの合計がF₀の時間的変化をもたらすと解釈され、声帯の伸びとF₀パターンの対数値y(t)が比例関係にあるという仮定に基づいてF₀パターンがモデル化される。甲状軟骨の平行移動運動によって生じるF₀パターンy_p(t)をフレーズ成分、回転運動によって生じるF₀パターンy_a(t)をアクセント成分と呼ぶ。藤崎モデルでは、音声のF₀パターンy(t)は、これらの成分に声帯の物理的制約によって決まるベースライン成分y_bを足し合わせたものとして、 <Fujisaki model>
As a method for analyzing intonation of speech, Fujisaki's fundamental frequency (F ₀ ) pattern generation process model (Fujisaki model) is known (Non-patent Document 1). The Fujisaki model is a mechanical model that explains the F ₀ pattern generation process, focusing on thyroid cartilage movement. The Fujisaki model, the total elongation of the vocal cords with each two independent movement of the thyroid cartilage (translational motion and rotational motion) is interpreted to result temporal variation of F _0, pairs of elongation and F ₀ pattern of the vocal cords The F ₀ pattern is modeled based on the assumption that the numerical value y (t) is proportional. The F ₀ pattern y _p (t) generated by the translational movement of the thyroid cartilage is called a phrase component, and the F ₀ pattern y _a (t) generated by the rotational movement is called an accent component. In the Fujisaki model, the F ₀ pattern y (t) of speech is the sum of these components plus the baseline component y _b determined by the physical constraints of the vocal cords.

と表現される。これら二つの成分は二次の臨界制動系の出力であるとしてモデル化されており、 It is expressed as These two components are modeled as being the output of a secondary critical braking system,

と表される（*は時刻tに関する畳み込み演算）。ここでu_p(t)はフレーズ指令関数と呼ばれ、デルタ関数（フレーズ指令）の列からなり、u_a(t)はアクセント指令関数と呼ばれ、矩形波（アクセント指令）の列からなる。これらの指令列には、発話の最初にはフレーズ指令が生起する、フレーズ指令は二連続で生起しない、異なる二つの指令（フレーズ指令とアクセント指令）は同時刻に生起しない、という制約条件がある。またαとβはそれぞれフレーズ制御機構、アクセント制御機構の固有角周波数であり、話者や発話内容によらず、おおよそα=3 rad/s、β=20 rad/s 程度であることが経験的に知られている。 (* Is a convolution operation related to time t). Here, u _p (t) is called a phrase command function and consists of a sequence of delta functions (phrase commands), and u _a (t) is called an accent command function and consists of a sequence of rectangular waves (accent commands). These command sequences have a constraint that a phrase command occurs at the beginning of an utterance, a phrase command does not occur in succession, and two different commands (phrase command and accent command) do not occur at the same time. . Α and β are the natural angular frequencies of the phrase control mechanism and the accent control mechanism, respectively. It is experiential that α is approximately 3 rad / s and β is 20 rad / s, regardless of the speaker or utterance content. Known to.

＜藤崎モデルのパラメータ推定法１＞
従来、音声信号のF₀パターンから藤崎モデルのパラメータを推定する手法として、非特許文献２に記載の手法が知られている。この手法ではまず、観測F₀パターンに対して平滑化のための前処理を行う。具体的にはgross errorの除去、microprosodyの修正、短い無音区間と無声区間の補間を行った後、F₀パターンを至るところで連続かつ微分可能な区分的3次曲線で近似する。次に、そうして得られた平滑化F₀パターンの微分値の極大値・極小値を手掛かりに、アクセント指令列の位置と大きさを推定する。さらに観測F₀パターンから推定アクセント成分を差し引いたパターンをもとにleft-to-rightにフレーズ指令を挿入していく。最後に、推定指令列から生成したF₀パターンと観測F₀パターンの平均二乗誤差を最小とするように、指令列を微小変化させ、こうして得られた指令列を藤崎モデルの推定パラメータとする。 <Fujisaki model parameter estimation method 1>
Conventionally, a technique described in Non-Patent Document 2 is known as a technique for estimating the parameters of the Fujisaki model from the F ₀ pattern of a speech signal. In this method first performs pre-processing for smoothing with respect to the observation F ₀ pattern. Specifically, removal of gross error, correction of Microprosody, short after the interpolation silence section and unvoiced, is approximated by a continuous and differentiable piecewise cubic curves everywhere the F ₀ pattern. Next, thus the clue the maximum value and minimum value of the differential values of the obtained smoothed F ₀ pattern, and estimates the position and size of the accent command string. Continue to insert the phrase command to the left-to-right on the basis of the further observation F ₀ pattern from minus the estimated accent component pattern. Finally, the command sequence is slightly changed so that the mean square error between the F ₀ pattern generated from the estimated command sequence and the observed F ₀ pattern is minimized, and the command sequence thus obtained is used as the estimation parameter of the Fujisaki model.

＜藤崎モデルのパラメータ推定法２＞
従来、音声信号のF₀パターンから藤崎モデルのパラメータを推定する手法として、他にも次のような手法がある（非特許文献３〜５）。この手法では、離散化した藤崎モデルをベースにして定式化したF₀パターン生成過程の確率モデルを使い、そのモデルに従ってP(y|θ)の最適化問題を解くことによって適切なパラメータを推定する（yは観測F₀パターン、θは藤崎モデルのパラメータ）。このモデル中では、制約条件があって扱いづらいフレーズ指令とアクセント指令のペアを、隠れマルコフモデル（HMM）から確率的に出力される値として扱う。また推定アルゴリズムにおいては、各成分について周辺化を行ったうえで、EMアルゴリズムによる反復解法で適切なパラメータを推定している。 <Fujisaki model parameter estimation method 2>
Conventionally, there are the following other methods for estimating the parameters of the Fujisaki model from the F ₀ pattern of the audio signal (Non-Patent Documents 3 to 5). In this method, a probabilistic model of the F ₀ pattern generation process formulated based on the discrete Fujisaki model is used, and an appropriate parameter is estimated by solving the optimization problem of P (y | θ) according to the model. (y is the observed F ₀ patterns, theta is the parameter of Fujisaki model). In this model, a pair of phrase commands and accent commands that are difficult to handle due to constraints is treated as a value that is stochastically output from the Hidden Markov Model (HMM). In the estimation algorithm, each component is marginalized, and appropriate parameters are estimated by an iterative solution using the EM algorithm.

Hiroya Fujisaki, Sumio Ohno and Wentao Gu, “Physiological and physical mechanisms for fundamental frequency control In some tone languages and a command-response model for generation of their F0 contours,” Proceedings of International Symposium on Tonal Aspects of Languages: Emphasis on Tone Languages, Beijing, pp. 61-64 (2004-3).Hiroya Fujisaki, Sumio Ohno and Wentao Gu, “Physiological and physical mechanisms for fundamental frequency control In some tone languages and a command-response model for generation of their F0 contours,” Proceedings of International Symposium on Tonal Aspects of Languages: Emphasis on Tone Languages , Beijing, pp. 61-64 (2004-3). S. Narusawa, N. Minematsu, K. Hirose, and H. Fujisaki, “A method for automatic extraction of model parameters from fundamental frequency contours of speech,” in Proc. ICASSP, 2002, pp. 509−512.S. Narusawa, N. Minematsu, K. Hirose, and H. Fujisaki, “A method for automatic extraction of model parameters from fundamental frequency contours of speech,” in Proc. ICASSP, 2002, pp. 509-512. H. Kameoka, J. L. Roux, and Y. Ohishi, “A statistical model of speech F0contours,” in Proc.SAPA, 2010, pp. 43−48.H. Kameoka, J. L. Roux, and Y. Ohishi, “A statistical model of speech F0contours,” in Proc. SAPA, 2010, pp. 43−48. 吉里幸太, 亀岡弘和, 齋藤大輔, 嵯峨山茂樹,“F0パターン生成過程の統計的モデルによる音声信号からのフレーズ・アクセント指令の推定,” 日本音響学会春季研究発表会講演集, 2012, no. 1-11-9,Kota Yoshizato, Hirokazu Kameoka, Daisuke Saito, Shigeki Hiyama, “Estimation of Phrase / Accent Commands from Speech Signals by Statistical Model of F0 Pattern Generation Process,” Proc. Of the Spring Meeting of the Acoustical Society of Japan, 2012, no. 1 -11-9,

pp.311-314.
K. Yoshizato, H. Kameoka, D. Saito, and S. Sagayama, “Statistical approach to fujisaki-model parameter estimation from speech signals and its quantitative evaluation,” in Proc. Speech pp.311-314.
K. Yoshizato, H. Kameoka, D. Saito, and S. Sagayama, “Statistical approach to fujisaki-model parameter estimation from speech signals and its quantitative evaluation,” in Proc. Speech

Prosody 2012, 2012, pp. 175−178. Prosody 2012, 2012, pp. 175-178.

本発明は、音声のF₀パターンから藤崎モデルのパラメータを推定する方法に関するものである。 The present invention relates to a method for estimating parameters of a Fujisaki model from an F ₀ pattern of speech.

この推定問題は不良設定の逆問題であるため、解析的に解くことは困難である。ここで、日本語のような非声調言語においては、フレーズ指令とアクセント指令の大きさは非負でなければならないという制約がある。この非負性は解を絞り込むための重要な制約になるにも関わらず、上記非特許文献３〜５に記載の藤崎モデルのパラメータ推定法では、この制約を最適化問題として直接導入することができなかった。上記の非特許文献５に記載の技術では、フレーズ・アクセント成分から指令列を逆算する際に非負制約付き逆畳み込み問題を解くというアドホックな方法で解決を図ったが、アルゴリズムの収束性が保証されなくなり、推定パラメータから生成されたF₀パターンと観測F₀パターンとの間の誤差が大きいという問題があった。この「誤差が大きい」という問題は、上記の非特許文献２に記載の、藤崎モデルのパラメータ推定法にも見られる問題である。 Since this estimation problem is an inverse problem of defect setting, it is difficult to solve analytically. Here, in a non-tone language such as Japanese, there is a restriction that the size of the phrase command and the accent command must be non-negative. Although this non-negativeity is an important constraint for narrowing down solutions, the Fujisaki model parameter estimation methods described in Non-Patent Documents 3 to 5 can directly introduce this constraint as an optimization problem. There wasn't. In the technique described in Non-Patent Document 5 above, an ad hoc method of solving a deconvolution problem with a non-negative constraint when calculating a command sequence from a phrase / accent component is attempted, but the convergence of the algorithm is guaranteed. Whilst there is a problem that a large error between the F ₀ pattern generated from the estimated parameters and the observed F ₀ pattern. This problem of “large error” is a problem also found in the parameter estimation method of the Fujisaki model described in Non-Patent Document 2 above.

本発明は、上記の事情を鑑みてなされたもので、フレーズ指令とアクセント指令の非負性に関する制約を用いて、藤崎モデルのパラメータを推定することができる基本周波数モデルパラメータ装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and uses a fundamental frequency model parameter apparatus, method, and program capable of estimating parameters of the Fujisaki model using restrictions on non-negativeity of phrase commands and accent commands. The purpose is to provide.

上記の目的を達成するために本発明に係る基本周波数モデルパラメータ推定装置は、音声信号を入力として、隠れマルコフモデルの各時刻ｋの状態ｓ_kからなる指令状態系列＾ｓと、各時刻ｋにおける甲状軟骨の平行移動運動によって生じる基本周波数パターンを表すフレーズ指令ｕ_p［ｋ］及び甲状軟骨の回転運動によって生じる基本周波数パターンを表すアクセント指令ｕ_a［ｋ］のペア＾ｏ[ｋ]からなる指令関数＾ｏと、各時刻ｋにおける状態ｓ_kに応じたフレーズ指令の振幅Ａ_p[ｋ]及び各アクセント指令ｎの振幅Ａ_a ⁽ⁿ⁾を表すパラメータ群θとを推定する基本周波数モデルパラメータ推定装置であって、前記音声信号の時系列データから、前記音声信号の各時刻ｋの基本周波数を表す観測基本周波数系列＾ｙを抽出する基本周波数抽出手段と、前記音声信号の時系列データについて、有声区間及び無声区間の何れであるかに応じて、各時刻ｋにおける前記基本周波数の不確かさの度合いを推定する有声無声区間推定手段と、前記指令関数＾ｏの初期値、及び前記パラメータ群θの初期値を設定する初期値設定手段と、前回更新された前記指令関数＾ｏ’または前記指令関数＾ｏの初期値＾ｏ’に基づいて、前記観測基本周波数系列＾ｙ、前記指令関数＾ｏ’、及び前記パラメータ群θ’が与えられたときの指令状態系列＾ｓの事後確率Ｐ（＾ｓ｜＾ｙ，＾ｏ’、θ’）を計算する指令状態系列事後確率更新手段と、前回更新された前記指令関数＾ｏ’または前記指令関数＾ｏの初期値＾ｏ’、前記観測基本周波数系列＾ｙ、各時刻ｋにおける前記不確かさの度合い、及び前記事後確率Ｐ（＾ｓ｜＾ｙ，＾ｏ’、θ’）に基づいて、前記観測基本周波数系列＾ｙが与えられたときの前記指令関数＾ｏ及び前記パラメータ群θの対数事後確率ｌｏｇＰ（＾ｏ，θ｜＾ｙ）を目的関数として、前記目的関数を増加させるように、各々非負値である前記指令関数＾ｏ、及び前記パラメータ群θを更新するモデルパラメータ更新手段と、予め定められた収束条件を満たすまで、前記指令状態系列事後確率更新手段による計算、及び前記モデルパラメータ更新手段による更新を繰り返し行う第１収束判定手段と、を含んで構成されている。 The fundamental frequency model parameter estimation apparatus according to the present invention in order to achieve the object of, as an input audio signal, a command state sequence ^ s made from the state s _k at each time k in the hidden Markov model, at each time k A command comprising _a pair ^ o [k] of a phrase command u _p [k] representing a fundamental frequency pattern generated by translational movement of thyroid cartilage and an accent command u _a [k] representing a fundamental frequency pattern generated by rotational motion of thyroid cartilage Fundamental frequency model parameter estimation that estimates the function ^ o and the parameter group θ representing the amplitude A _p [k] of the phrase command and the amplitude A _a ⁽ⁿ⁾ of each accent command n according to the state s _k at each time k A basic frequency extraction device for extracting an observed fundamental frequency sequence ^ y representing a fundamental frequency at each time k of the voice signal from time series data of the voice signal Means for estimating the degree of uncertainty of the fundamental frequency at each time k according to whether the time series data of the voice signal is a voiced section or an unvoiced section, and the command Based on the initial value setting means for setting the initial value of the function ^ o and the initial value of the parameter group θ, and the command function ^ o 'updated previously or the initial value ^ o' of the command function ^ o, A posteriori probability P (^ s | ^ y, ^ o ', θ') of the command state sequence ^ s given the observed fundamental frequency sequence ^ y, the command function ^ o ', and the parameter group θ' Command state sequence posterior probability update means for calculating the command function ^ o 'or the initial value ^ o' of the command function ^ o updated last time, the observed fundamental frequency sequence ^ y, and the uncertainty at each time k And the posterior probability P ( s | ^ y, ^ o ′, θ ′), the logarithmic posterior probability logP (^ o, θ |) of the command function ^ o and the parameter group θ when the observed fundamental frequency sequence ^ y is given. Model function update means for updating the command function ^ o, which is a non-negative value, and the parameter group θ, so as to increase the objective function with ^ y) as an objective function, and a predetermined convergence condition Up to the first convergence determination means for repeatedly performing the calculation by the command state series posterior probability update means and the update by the model parameter update means.

本発明に係る基本周波数モデルパラメータ推定方法は、音声信号を入力として、隠れマルコフモデルの各時刻ｋの状態ｓ_kからなる指令状態系列＾ｓと、各時刻ｋにおける甲状軟骨の平行移動運動によって生じる基本周波数パターンを表すフレーズ指令ｕ_p［ｋ］及び甲状軟骨の回転運動によって生じる基本周波数パターンを表すアクセント指令ｕ_a［ｋ］のペア＾ｏ[ｋ]からなる指令関数＾ｏと、各時刻ｋにおける状態ｓ_kに応じたフレーズ指令の振幅Ａ_p[ｋ]及び各アクセント指令ｎの振幅Ａ_a ⁽ⁿ⁾を表すパラメータ群θとを推定する基本周波数モデルパラメータ推定方法であって、基本周波数抽出手段によって、前記音声信号の時系列データから、前記音声信号の各時刻ｋの基本周波数を表す観測基本周波数系列＾ｙを抽出し、有声無声区間推定手段によって、前記音声信号の時系列データについて、有声区間及び無声区間の何れであるかに応じて、各時刻ｋにおける前記基本周波数の不確かさの度合いを推定し、初期値設定手段によって、前記指令関数＾ｏの初期値、及び前記パラメータ群θの初期値を設定し、指令状態系列事後確率更新手段によって、前回更新された前記指令関数＾ｏ’または前記指令関数＾ｏの初期値＾ｏ’に基づいて、前記観測基本周波数系列＾ｙ、前記指令関数＾ｏ’、及び前記パラメータ群θ’が与えられたときの指令状態系列＾ｓの事後確率Ｐ（＾ｓ｜＾ｙ，＾ｏ’、θ’）を計算し、モデルパラメータ更新手段によって、前回更新された前記指令関数＾ｏ’または前記指令関数＾ｏの初期値＾ｏ’、前記観測基本周波数系列＾ｙ、各時刻ｋにおける前記不確かさの度合い、及び前記事後確率Ｐ（＾ｓ｜＾ｙ，＾ｏ’、θ’）に基づいて、前記観測基本周波数系列＾ｙが与えられたときの前記指令関数＾ｏ及び前記パラメータ群θの対数事後確率ｌｏｇＰ（＾ｏ，θ｜＾ｙ）を目的関数として、前記目的関数を増加させるように、各々非負値である前記指令関数＾ｏ、及び前記パラメータ群θを更新し、第１収束判定手段によって、予め定められた収束条件を満たすまで、前記指令状態系列事後確率更新手段による計算、及び前記モデルパラメータ更新手段による更新を繰り返し行う。 Fundamental frequency model parameter estimation method according to the present invention, an input audio signal, a command state sequence ^ s made from the state s _k at each time k in the hidden Markov model, caused by the translation movement of the thyroid cartilage at each time k A command function ^ o consisting of a pair ^ o [k] of a phrase command u _p [k] representing a fundamental frequency pattern and an accent command u _a [k] representing a fundamental frequency pattern generated by rotational movement of the thyroid cartilage, and each time k Is a fundamental frequency model parameter estimation method for estimating a phrase command amplitude A _p [k] corresponding to the state s _k and a parameter group θ representing the amplitude A _a ⁽ⁿ⁾ of each accent command n, Means for extracting an observed fundamental frequency sequence ^ y representing a fundamental frequency at each time k of the voice signal from the time series data of the voice signal, The estimation means estimates the degree of uncertainty of the fundamental frequency at each time k according to whether the time series data of the voice signal is a voiced section or an unvoiced section, and the initial value setting means An initial value of the command function ^ o and an initial value of the parameter group θ are set, and the command function ^ o 'or the initial value ^ o of the command function ^ o updated last time by the command state series posterior probability update means ', The posterior probability P (^ s | ^ y, ^ o of the command state series ^ s when the observed fundamental frequency series ^ y, the command function ^ o', and the parameter group θ 'are given. ', Θ'), and the command parameter ^ o 'updated previously or the initial value ^ o' of the command function ^ o, the observed fundamental frequency sequence ^ y, and each time k by the model parameter updating means Uncertainty And the command function ^ o and the parameter group θ when the observed fundamental frequency sequence ^ y is given based on the degree of the posterior probability P (^ s | ^ y, ^ o ', θ') The command function ^ o, which is a non-negative value, and the parameter group θ are updated so that the objective function is increased using the log posterior probability logP (^ o, θ | ^ y) of The calculation by the command state sequence posterior probability update unit and the update by the model parameter update unit are repeatedly performed by the convergence determination unit until a predetermined convergence condition is satisfied.

本発明に係るプログラムは、上記の基本周波数モデルパラメータ推定装置の各手段としてコンピュータを機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each unit of the above-described fundamental frequency model parameter estimation apparatus.

以上説明したように、本発明の基本周波数モデルパラメータ推定装置、方法、及びプログラムによれば、観測基本周波数系列＾ｙが与えられたときの指令関数＾ｏ及びパラメータ群θの対数事後確率ｌｏｇＰ（＾ｏ，θ｜＾ｙ）を目的関数として、目的関数を増加させるように、各々非負値である指令関数＾ｏ、及びパラメータ群θを更新することにより、フレーズ指令とアクセント指令の非負性に関する制約を用いて、藤崎モデルのパラメータを推定することができる、という効果が得られる。 As described above, according to the fundamental frequency model parameter estimation apparatus, method, and program of the present invention, the logarithmic posterior probability logP (of the command function ^ o and the parameter group θ when the observed fundamental frequency sequence ^ y is given. By updating the command function ^ o, which is a non-negative value, and the parameter group θ, so that the objective function is increased with ^ o, θ | ^ y) as the objective function, the non-negativeity of the phrase command and the accent command is related. The effect is that the parameters of the Fujisaki model can be estimated using the constraints.

ＨＭＭを説明するための図である。It is a figure for demonstrating HMM. 状態の分割を説明するための図である。It is a figure for demonstrating the division | segmentation of a state. 本発明の実施の形態に係る基本周波数モデルパラメータ推定装置の構成を示す概略図である。It is the schematic which shows the structure of the fundamental frequency model parameter estimation apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る基本周波数モデルパラメータ推定装置における基本周波数モデルパラメータ推定処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the fundamental frequency model parameter estimation process routine in the fundamental frequency model parameter estimation apparatus which concerns on embodiment of this invention. 指令関数のマッチングを説明するための図である。It is a figure for demonstrating matching of a command function. 実験結果を示す図である。It is a figure which shows an experimental result. 実験結果を示す図である。It is a figure which shows an experimental result.

以下、図面を参照して本発明の実施の形態を詳細に説明する。本発明で提案する手法では、観測F₀パターンの再現性が高い藤崎モデルのパラメータ推定を実現するために、藤崎モデルをベースにしたF₀パターン生成過程の確率モデルを定式化し、それに基づいて観測F₀パターンが生じたと仮定する。藤崎モデルのパラメータ推定アルゴリズムも、この確率モデルに基づく。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the method proposed in the present invention, in order to realize parameter estimation of the Fujisaki model with high reproducibility of the observed F ₀ pattern, a probability model of the F ₀ pattern generation process based on the Fujisaki model is formulated and observed based on it. Assume that an F ₀ pattern has occurred. The parameter estimation algorithm of the Fujisaki model is also based on this probability model.

＜F₀パターン生成過程の確率モデル＞
本発明の原理について説明する。まず、F₀パターン生成過程の確率モデルについて説明する。 <Probability model of F ₀ pattern generation process>
The principle of the present invention will be described. First, the probability model of the F ₀ pattern generation process will be described.

指令関数に付随する各種制約をモデルに組み入れるために、フレーズ指令、アクセント指令のペア＾o[k]=(u_p[k],u_a[k])^Tを出力するHMMを考える。このHMMでは、出力される指令関数{＾o[k]}_k=1 ^Kは、各時刻ごとにガウス分布に従い、 In order to incorporate the various constraints associated with a command function in the model, phrase command, accent command of the pair ^ o [k] = (u p [k], u a [k]) consider the HMM to output the ^T. In this HMM, the output command function {^ o [k]} _{k = 1} ^K follows a Gaussian distribution at each time,

と確率的に表現される。ここで{s_k}_k=1 ^KはHMMの状態系列であり、平均ベクトル＾ν_sk[k]と分散共分散行列＾Υ_skはHMMの状態遷移の結果として定まる値である。具体的なHMMの構成は図１に示した。なお、行列又はベクトルを示す記号については、「＾」を付すこととする。 It is expressed stochastically. Here, {s _k } _{k = 1} ^K is a state series of the HMM, and the mean vector ^ ν _sk [k] and the variance-covariance matrix ^ Υ _sk are values determined as a result of the state transition of the HMM. A specific configuration of the HMM is shown in FIG. A symbol indicating a matrix or a vector is attached with “^”.

加えて、自己遷移の持続長をパラメータ化するために、それぞれの状態をいくつかの小状態に分割することを考える。なおこのとき、各々の小状態は全て同じ出力分布を持ち、小状態の数は十分大きな値となるようにしておく。図２に状態a_nを分割した例を示した。例えばこの図２のように全てのm≠0に対してa_n,mからa_n,m+1への状態遷移確率を1に設定することで、a_n,0からa_n,mへの遷移確率が状態a_nがmステップだけ持続する確率に対応し、アクセント指令の持続長を柔軟に制御できるようになる。同様にp₁とp₀とa₀も小状態に分割することで、フレーズ指令の持続長と指令間の間隔の長さの分布をパラメータ化することが可能になる。こうした分割をふまえて、以後は改めてp₀={p_0,0,p_0,1,...}、a₀={a_0,0,a_0,1,...}、a_n={a_n,0,a_n,1,...}と表記する。 In addition, consider dividing each state into several small states in order to parameterize the duration of self-transition. At this time, all the small states have the same output distribution, and the number of small states is set to a sufficiently large value. It shows an example of dividing the state a _n in FIG. For example, as shown in FIG. 2, by setting the state transition probability from a _{n, m} to a _{n, m + 1} to ₁ for all m ≠ 0 _{, the change} from a _{n, 0} to a _{n, m} transition probability corresponds to the probability that state a _n lasts only m step, it becomes possible to flexibly control the persistence length of the accent command. Similarly, by dividing p ₁ , p _0, and a ₀ into small states, it becomes possible to parameterize the distribution of the duration of the phrase command and the length of the interval between commands. Based on these divisions, p ₀ = {p _0,0 , p _0,1 , ...}, a ₀ = {a _0,0 , a _0,1 , ...}, a _n = It is written as {a _{n, 0} , a _{n, 1} , ...}.

提案するHMMの構成を定式化すると次のように書ける。 When the proposed HMM configuration is formulated, it can be written as follows.

状態系列＾s={s_k}_k=1 ^Kが与えられたとき、このHMMはフレーズ指令関数u_p[k]とアクセント指令関数u_a[k]のペアを出力する。式(2)と式(4)で示した通り、u_p[k]とu_a[k]はそれぞれG_p[k]とG_a[k]というフィルタに畳み込まれてフレーズ成分x_p[k]とアクセント成分x_a[k]が出力される。これを式で表すと、 When a state sequence ^ s = {s _k } _{k = 1} ^K is given, this HMM outputs a pair of _a phrase command function u _p [k] and an accent command function u _a [k]. As shown in Equation (2) and Equation (4), u _p [k] and u _a [k] are convolved in filters G _p [k] and G _a [k], respectively, and the phrase component x _p [ k] and accent component x _a [k] are output. This can be expressed as an expression:

と書ける（*は離散時刻kに関する畳み込み演算）。このとき、F₀パターンx[k]は (* Is a convolution operation for discrete time k). At this time, the F ₀ pattern x [k] is

と三種類の成分の重ね合わせで書ける。ただしu_bは時刻によらないベースライン成分である。 And can be written by superimposing three kinds of components. However, u _b is a baseline component that does not depend on time.

また、実音声においては、常に信頼のできる基本周波数F₀の値が観測できるとは限らない。例えば音声データからピッチ抽出によって得られた基本周波数F₀の推定値は、無声区間においては全く信頼できない値である。藤崎モデルのパラメータ推定を行うにあたっては、信頼のおける観測区間のF₀値のみを考慮に入れて、そうでない区間は無視することが望ましい。そこで、提案モデルに観測F₀値の時刻kにおける不確かさの程度v_n ²[k]を導入する。具体的には、観測F₀値y[k]を、真のF₀値x[k]とノイズ成分x_n[k]〜N(0,v_n ²[k])との重ね合わせで In real speech, a reliable value of the fundamental frequency F ₀ cannot always be observed. For example, an estimated value of the fundamental frequency F ₀ obtained by pitch extraction from speech data is a value that is not reliable at all in the silent section. When estimating the parameters of the Fujisaki model, it is desirable to consider only the F ₀ value of the reliable observation interval and ignore the other intervals. Therefore, the degree of uncertainty v _n ² [k] of the observed F ₀ value at time k is introduced into the proposed model. Specifically, the observed F ₀ value y [k] is superimposed on the true F ₀ value x [k] and the noise components x _n [k] to N (0, v _n ² [k]).

と表現することで、信頼のおける区間かどうかに関わらず全ての観測区間を統一的に扱える。 This means that all observation intervals can be handled uniformly regardless of whether they are reliable intervals.

φ_i′,i、u_b、v_p,sk ²、v_a,sk ²、v_n ²[k]、α、βを定数とみなし、{A_p[k]}_k=1 ^K,{A_a ⁽ⁿ⁾}_n=1 ^Nは一様に分布すると仮定する。するとx_n[k]を周辺化することで、出力値系列＾o={o[k]}_k=1 ^Kが与えられたときの＾y={y[k]}_k=1 ^Kの確率密度関数 φ _{i ′, i} , u _b , v _{p, sk} ² , v _{a, sk} ² , v _n ² [k], α, β are regarded as constants, and {A _p [k]} _{k = 1} ^K , {A _a ⁽ⁿ⁾ } _{n = 1} ^N is assumed to be uniformly distributed. Then, the probability of ^ y = {y [k]} _{k = 1} ^K when the output value sequence ^ o = {o [k]} _{k = 1} ^K is given by peripheralizing x _n [k] Density function

が得られる。状態系列＾s = {s_k}_k=1 ^Kと指令の振幅を表すパラメータ群θ={{A_p[k]}_k=1 ^K,{A_a ⁽ⁿ⁾}_n=1 ^N}が与えられたとき、出力値系列＾oは Is obtained. State sequence ^ s = {s _k } _{k = 1} ^K and parameter group θ = {{A _p [k]} _{k = 1} ^K , {A _a ⁽ⁿ⁾ } _{n = 1} ^N } Output value series ^ o is

に従って生成される。また、P(＾s)は状態遷移確率の積として Is generated according to P (^ s) is the product of the state transition probabilities

と書ける。ただしφ_s1は初期状態がs₁である確率をあらわす。 Can be written. Φ _s1 represents the probability that the initial state is s ₁ .

＜藤崎モデルのパラメータ推定アルゴリズム＞
本発明では、藤崎モデルのパラメータ推定問題を、観測基本周波数系列＾yが与えられたときのパラメータ＾o,θの事後確率P(＾o,θ|＾y)を最大化する＾oとθを求める最大事後確率（Maximum A Posteriori;MAP）推定問題として定式化し、指令状態系列＾sを潜在変数と見なしてExpectation-Maximization(EM)アルゴリズムに基づく反復計算により＾oとθの局所最適解を探索する。EMアルゴリズムは、パラメータの対数事後確率ｌｏｇP(＾o,θ|＾y)の下限関数（Q関数と呼ぶ）を反復的に増加させることで、間接的にパラメータの対数事後確率ｌｏｇP(＾o,θ|＾y)を増加させていく方法であり、本問題におけるQ関数は <Fujisaki model parameter estimation algorithm>
In the present invention, the parameter estimation problem of the Fujisaki model is determined by maximizing the a posteriori probability P (^ o, θ | ^ y) of the parameters ^ o, θ when the observation fundamental frequency sequence ^ y is given. Is formulated as a maximum posterior probability (Maximum A Posteriori; MAP) estimation problem, and a local optimal solution of ^ o and θ is obtained by iterative calculation based on the Expectation-Maximization (EM) algorithm with the command state sequence ^ s as a latent variable Explore. The EM algorithm indirectly increases the logarithmic posterior probability logP (^ o, of the parameter logarithmic posterior probability logP (^ o, θ | ^ y) by repeatedly increasing the lower limit function (called Q function). θ | ^ y), and the Q function in this problem is

と書ける。ここで^c=は定数部分を除いて一致することを意味し、＾o′とθ′はそれぞれ＾oとθの直前の反復における値である。 Can be written. Here, ^c = means matching except for a constant part, and ^ o ′ and θ ′ are values in the previous iteration of ^ o and θ, respectively.

＜システム構成＞
次に、観測された音声信号の時系列データを解析して、藤崎モデルのパラメータを推定する基本周波数モデルパラメータ推定装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 <System configuration>
Next, the embodiment of the present invention will be described with reference to an example in which the present invention is applied to a fundamental frequency model parameter estimation apparatus that analyzes time series data of an observed speech signal and estimates parameters of the Fujisaki model. explain.

図３に示すように、本発明の実施の形態に係る基本周波数モデルパラメータ推定装置は、ＣＰＵと、ＲＡＭと、後述する基本周波数モデルパラメータ推定処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 As shown in FIG. 3, the fundamental frequency model parameter estimation apparatus according to the embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing a fundamental frequency model parameter estimation processing routine described later. It is composed of a computer equipped and functionally configured as follows.

図３に示すように、基本周波数モデルパラメータ推定装置は、記憶部１と、基本周波数系列抽出部２と、有声無声区間推定部３と、初期値設定部４と、指令状態系列事後確率更新部５と、モデルパラメータ更新部６と、収束判定部７と、状態系列算出部８と、出力部９とを備えている。 As shown in FIG. 3, the fundamental frequency model parameter estimation apparatus includes a storage unit 1, a fundamental frequency sequence extraction unit 2, a voiced / unvoiced interval estimation unit 3, an initial value setting unit 4, and a command state sequence posterior probability update unit. 5, a model parameter update unit 6, a convergence determination unit 7, a state series calculation unit 8, and an output unit 9.

記憶部１は、観測された音声信号の時系列データを記憶する。 The storage unit 1 stores time series data of the observed audio signal.

基本周波数系列抽出部２は、音声信号の時系列データから、基本周波数の時系列データを抽出し、それらを離散時間ｋで表現するように変換して、音声信号の基本周波数の時系列データである観測基本周波数系列＾ｙ＝｛Ｆ₀［ｋ］｝（ｋ＝１,…,Ｋ）とする。この基本周波数の抽出処理は、周知技術により実現でき、例えば、非特許文献６（H. Kameoka, "Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model," in Tech. Rep. IEICE, 2010, in Japanese.）に記載の手法を利用して、８ｍｓごとに基本周波数を抽出する。 The basic frequency sequence extraction unit 2 extracts time series data of the basic frequency from the time series data of the audio signal, converts them to be expressed in discrete time k, and uses the time series data of the basic frequency of the audio signal. It is assumed that a certain observation basic frequency sequence ^ y = {F ₀ [k]} (k = 1,..., K). This extraction process of the fundamental frequency can be realized by a well-known technique. For example, Non-Patent Document 6 (H. Kameoka, “Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model,” in Tech. Rep. IEICE, 2010, in Japanese.), The fundamental frequency is extracted every 8 ms.

有声無声区間推定部３は、音声信号の時系列データから、有声区間と無声区間とを特定し、離散時間ｋ毎に、有声区間であるか無声区間であるかに応じて、観測Ｆ0［ｋ］値の不確かさの程度v_n ²[k]を推定する。無声区間では不確かさの程度を大きく推定し（例えば、v_n ²[k]=10¹⁵）、有声区間では不確かさの程度を小さく推定する（例えば、v_n ²[k]=0.22）。 The voiced / unvoiced section estimation unit 3 identifies the voiced section and the unvoiced section from the time-series data of the voice signal, and observes F0 [k] at each discrete time k depending on whether it is a voiced section or an unvoiced section. ] Estimate the degree of value uncertainty v _n ² [k]. In the unvoiced section, the degree of uncertainty is estimated to be large (for example, v _n ² [k] = 10 ¹⁵ ), and in the voiced section, the degree of uncertainty is estimated to be small (for example, v _n ² [k] = 0.22).

初期値設定部４は、後述する処理で用いる各パラメータである、アクセント指令の数N、EMアルゴリズムの反復回数M、α、β、v_p ²[k]、v_a ²[k]、u_bを定数とみなし初期値を設定する。初期値として適当な値を設定する。また、初期値設定部４は、HMMの小状態の個数、遷移確率φ_i′,Iを、予め用意した正解データから学習して決定する。また、初期値設定部４は、上記非特許文献２に記載の藤崎モデルのパラメータ推定法を用いて、＾oの初期値（非負値）を設定する。また、初期値設定部４は、A_p[k]の初期値として、＾oのフレーズ指令関数の振幅を線形補間したものを設定し、A_a ⁽ⁿ⁾の初期値として適切な値を設定する。 The initial value setting unit 4 is the number of accent commands N, the number of iterations M of the EM algorithm, α, β, v _p ² [k], v _a ² [k], u _b , which are parameters used in the processing described later. Is assumed to be a constant and an initial value is set. Set an appropriate value as the initial value. The initial value setting unit 4 determines the number of small states of the HMM and the transition probability φ _{i ′, I} by learning from correct data prepared in advance. The initial value setting unit 4 sets the initial value (non-negative value) of ^ o using the Fujisaki model parameter estimation method described in Non-Patent Document 2. The initial value setting unit 4 sets the linear interpolation of the amplitude of the phrase command function of ^ o as the initial value of A _p [k], and sets an appropriate value as the initial value of A _a ⁽ⁿ⁾ To do.

本実施の形態では、上記の式(15)のQ関数にもとづき、藤崎モデルパラメータ＾oとθの局所最適解は、指令状態系列事後確率更新部５とモデルパラメータ更新部６の2つのステップを繰り返すことで得られる。 In the present embodiment, based on the Q function of the above equation (15), the local optimal solution of the Fujisaki model parameters ^ o and θ includes two steps: a command state sequence posterior probability update unit 5 and a model parameter update unit 6. It is obtained by repeating.

指令状態系列事後確率更新部５は、指令状態系列（潜在変数）の事後確率P(＾s|＾y,＾o′,θ′)を計算するステップであり、EMアルゴリズムではこれをEステップと呼ぶ。Forward-Backwardアルゴリズムを用いれば各k,tに対してP(s_k=t|＾y,＾o′,θ′)を効率的に求めることができる。具体的には、 The command state sequence posterior probability updating unit 5 is a step of calculating the posterior probability P (^ s | ^ y, ^ o ', θ') of the command state series (latent variable). Call. If the Forward-Backward algorithm is used, P (s _k = t | ^ y, ^ o ′, θ ′) can be efficiently _obtained for each k and t. In particular,

と変形すると、各P(s_k=t|{y[l],＾o′[l],θ′[l]}_l=1 ^l=k)は、 P (s _k = t | {y [l], ^ o ′ [l], θ ′ [l]} _{l = 1} ^{l = k} )

という漸化式を順次（k=1,2,...,K）解くことによって計算でき、各P({y[l],o′[l],θ′[l]}_l=k+1 ^l=K|s_k=t)は、 Can be calculated by solving the recursion formula sequentially (k = 1,2, ..., K), and each P ({y [l], o ′ [l], θ ′ [l]} _{l = k + 1} ^{l = K} | s _k = t) is

という漸化式を順次（k=K,K−1,...,1）解くことによって計算できる。 Can be calculated by solving the recursion formulas sequentially (k = K, K-1, ..., 1).

このように、指令状態系列事後確率更新部５は、時刻ｋ、状態ｔの全ての組み合わせ（k,t）の各々に対して、前回更新された指令関数＾o′又は初期値＾o′に基づいて、事後確率P(s_k=t|＾y,＾o′,θ′)を算出することにより、観測基本周波数系列＾ｙ、指令関数＾ｏ’、及びパラメータ群θ’が与えられたときの指令状態系列＾ｓの事後確率P(＾s|＾y,＾o′,θ′)を計算する。 As described above, the command state sequence posterior probability update unit 5 sets the command function ^ o ′ or the initial value ^ o ′ updated last time for each combination (k, t) of the time k and the state t. Based on this, by calculating the posterior probability P (s _k = t | ^ y, ^ o ', θ'), the observed fundamental frequency sequence ^ y, the command function ^ o ', and the parameter group θ' are given. The posterior probability P (^ s | ^ y, ^ o ', θ') of the command state sequence ^ s is calculated.

モデルパラメータ更新部６は、補助変数更新部６１、指令関数更新部６２、収束判定部６３、及び平均振幅更新部６４を備えている。 The model parameter update unit 6 includes an auxiliary variable update unit 61, a command function update unit 62, a convergence determination unit 63, and an average amplitude update unit 64.

モデルパラメータ更新部６は、目的関数Q(＾o,θ,＾o′,θ′)を増加させるように、非負値である指令関数＾oとパラメータ群θを更新するステップであり、EMアルゴリズムではこれをMステップと呼ぶ。logP(＾y|＾o,θ)の項は、 The model parameter updating unit 6 is a step of updating the non-negative command function ^ o and the parameter group θ so as to increase the objective function Q (^ o, θ, ^ o ′, θ ′), and the EM algorithm This is called the M step. The term of logP (^ y | ^ o, θ) is

と書ける。ただしG_b[k]=δ[k]（クロネッカーのデルタ）である。指令関数u_p[k],u_a[k]が非負であるという条件で式(21)を最大化する＾oを直接求めるのは難しいが、補助関数法に基づく反復計算により式(21)を局所的に最大化する＾oを求めることができる。補助関数法はEMアルゴリズムと同様最大化したい目的関数の下限関数を反復的に増加させていくことで目的関数を増加させる手法であるが、式(21)の下限関数は、ジェンセンの不等式 Can be written. However, G _b [k] = δ [k] (Kronecker delta). Although it is difficult to directly find ^ o that maximizes Equation (21) under the condition that the command functions u _p [k] and u _a [k] are non-negative, Equation (21) is obtained by iterative calculation based on the auxiliary function method. ^ O that locally maximize can be obtained. The auxiliary function method is a technique to increase the objective function by iteratively increasing the lower limit function of the objective function to be maximized as in the EM algorithm, but the lower limit function in Equation (21) is the Jensen inequality.

が成り立つことを利用して設計することができる。ただし、λ_i,k,l≧0を補助変数と呼び、Σ_iΣ_lλ_i,k,l=1を満たす。式(22)の等号成立条件は It is possible to design using the fact that _However, λ _{i, k,} called the auxiliary variable _{_{_{l ≧ 0, Σ i Σ l}}} λ i, k, satisfying the _l = 1. The condition for establishing the equal sign in equation (22) is

である。また、Σ_^sP(^s|^y,^o′,θ′)logP(θ,^o|^s)の項は、 It is. Also, the term of Σ _{^ s} P (^ s | ^ y, ^ o ′, θ ′) logP (θ, ^ o | ^ s) is

と書ける。よって、不等式 Can be written. Thus, the inequality

の右辺(補助関数と呼ぶ)は、Q(^o,θ,^o′,θ′)の下限関数となり、補助関数Q′(^o,θ,^o′,θ′)とする。補助関数Q′(^o,θ,^o′,θ′)をu_i[l]で微分すると、 Is the lower limit function of Q (^ o, θ, ^ o ′, θ ′), and is the auxiliary function Q ′ (^ o, θ, ^ o ′, θ ′). Differentiating the auxiliary function Q ′ (^ o, θ, ^ o ′, θ ′) by u _i [l],

となる。ゆえにMステップにおいてu_i[l]を求めるには、補助関数法の更新式 It becomes. Therefore, to obtain u _i [l] in the M step, the update formula of the auxiliary function method

を用いてλ_i,k,lとu_i[l]を交互に更新することを十分な回数繰り返せばよい。 It is sufficient to repeat updating λ _{i, k, l} and u _i [l] alternately by using a sufficient number of times.

このように、補助変数更新部６１は、前回更新された各時刻ｋのフレーズ指令ｕ_p［ｋ］（又は初期値）に基づいて、時刻ｋ、ｌ（ｌ＜ｋ）の全ての組み合わせ（ｋ、ｌ）の各々について、上記の式（２８）に従って、補助変数λ_p,k,lを算出して更新する。また、補助変数更新部６１は、前回更新された各時刻ｋのアクセント指令ｕ_a［ｋ］（又は初期値）に基づいて、（ｋ、ｌ）の全ての組み合わせについて、上記の式（２８）に従って、補助変数λ_a,k,lを算出して更新する。 As described above, the auxiliary variable updating unit 61 determines all combinations (k) of times k and l (l <k) based on the phrase command u _p [k] (or initial value) at each time k updated last time. , L), the auxiliary variable λ _{p, k, l} is calculated and updated according to the above equation (28). In addition, the auxiliary variable update unit 61 performs the above equation (28) for all combinations of (k, l) based on the accent command u _a [k] (or initial value) at each time k updated last time. According to the above, the auxiliary variable λ _{a, k, l} is calculated and updated.

また、補助変数更新部６１は、ｕ_bに基づいて、（ｋ、ｌ）の全ての組み合わせについて、上記の式（２８）に従って、補助変数λ_b,k,lを算出して更新する。 Further, the auxiliary variable updating unit 61 calculates and updates the auxiliary variable λ _{b, k, l} according to the above equation (28) for all combinations of (k, l) based on u _b .

指令関数更新部６２は、基本周波数系列＾ｙと、不確かさの程度v_n ² [k]と、指令状態系列事後確率更新部５によって更新された指令状態系列の事後確率P(＾s|＾y,＾o′,θ′)と、補助変数更新部６１によって更新された補助変数λ_p,k,lとに基づいて、上記式（２９）に従って、非負値である各時刻ｌのフレーズ指令ｕ_p［ｌ］を更新する。 The command function update unit 62 includes the fundamental frequency sequence ^ y, the degree of uncertainty v _n ² [k], and the posterior probability P (^ s | ^) of the command state sequence updated by the command state sequence posterior probability update unit 5. y, ^ o ′, θ ′) and the auxiliary variable λ _{p, k, l} updated by the auxiliary variable updating unit 61 according to the above equation (29), the phrase command at each time l which is a non-negative value Update u _p [l].

また、指令関数更新部６２は、基本周波数系列＾ｙと、不確かさの程度v_n ² [k]と、指令状態系列事後確率更新部５によって更新された指令状態系列の事後確率P(＾s|＾y,＾o′,θ′)と、補助変数更新部６１によって更新された補助変数λ_a,k,lとに基づいて、上記式（２９）に従って、非負値である各時刻ｌのアクセント指令ｕ_a［ｌ］を更新する。 The command function update unit 62 also includes the fundamental frequency sequence ^ y, the degree of uncertainty v _n ² [k], and the posterior probability P (^ s of the command state sequence updated by the command state sequence posterior probability update unit 5. | ^ Y, ^ o ′, θ ′) and the auxiliary variable λ _{a, k, l} updated by the auxiliary variable updating unit 61 according to the above equation (29), each time l is a non-negative value. Update the accent command u _a [l].

また、指令関数更新部６２は、基本周波数系列＾ｙと、不確かさの程度v_n ² [k]と、指令状態系列事後確率更新部５によって更新された指令状態系列の事後確率P(＾s|＾y,＾o′,θ′)と、補助変数更新部６１によって更新された補助変数λ_b,k,lとに基づいて、上記式（２９）に従って、ベース成分ｕ_bを更新する。 The command function update unit 62 also includes the fundamental frequency sequence ^ y, the degree of uncertainty v _n ² [k], and the posterior probability P (^ s of the command state sequence updated by the command state sequence posterior probability update unit 5. | Y, ^ o ′, θ ′) and the auxiliary variable λ _{b, k, l} updated by the auxiliary variable updating unit 61, the base component u _b is updated according to the above equation (29).

収束判定部６３は、予め定められた収束条件を満足するか否かを判定し、収束条件を満足していない場合には、補助変数更新部６１及び指令関数更新部６２の各処理を繰り返す。収束判定部６３は、収束条件を満足したと判定した場合には、平均振幅更新部６４による処理に移行する。 The convergence determination unit 63 determines whether or not a predetermined convergence condition is satisfied. If the convergence condition is not satisfied, each process of the auxiliary variable update unit 61 and the command function update unit 62 is repeated. If the convergence determining unit 63 determines that the convergence condition is satisfied, the convergence determining unit 63 proceeds to processing by the average amplitude updating unit 64.

収束条件としては、繰り返し回数ｓが予め定めた回数Ｓ（例えば、２０回）に達したことを用いればよい。なお、s-1回目のパラメータを用いたときの補助関数の値とs回目のパラメータを用いたときの補助関数の値との差が、予め定めた閾値よりも小さくなったことを、収束条件として用いてもよい。 As the convergence condition, it may be used that the number of repetitions s has reached a predetermined number of times S (for example, 20 times). Note that the convergence condition is that the difference between the value of the auxiliary function when the s-1th parameter is used and the value of the auxiliary function when the sth parameter is used is smaller than a predetermined threshold. It may be used as

平均振幅更新部６４は、Mステップとして、続けてθ={{A_p[k]}_k=1 ^K,{A_a ⁽ⁿ⁾}_n=1 ^N}を更新する。Q′(＾o,θ,＾o′,θ′)をA_p[k]とA_a ⁽ⁿ⁾で微分すると、 The average amplitude updating unit 64 continuously updates θ = {{A _p [k]} _{k = 1} ^K and {A _a ⁽ⁿ⁾ } _{n = 1} ^N } as M steps. Differentiating Q ′ (^ o, θ, ^ o ′, θ ′) with A _p [k] and A _a ⁽ⁿ⁾ ,

となる。ゆえにMステップにおいてA_p[k]とA_a ⁽ⁿ⁾を求める更新式は、 It becomes. Therefore, the update equation for obtaining A _p [k] and A _a ⁽ⁿ⁾ in the M step is

と書ける。 Can be written.

このように、平均振幅更新部６４は、指令関数更新部６２によって更新された各時刻ｋのフレーズ指令ｕ_p［ｋ］に基づいて、上記式（３２）に従って、各時刻ｋのフレーズ指令の振幅A_p[k]を更新すると共に、指令関数更新部６２によって更新された各時刻ｋのアクセント指令ｕ_a［ｋ］と、指令状態系列事後確率更新部５によって更新された指令状態系列の事後確率P(＾s|＾y,＾o′,θ′)とに基づいて、上記式（３２）に従って、各アクセント指令ｎの振幅A_a ⁽ⁿ⁾を更新することにより、パラメータ群θを更新する。 As described above, the average amplitude updating unit 64, based on the phrase command u _p [k] at each time k updated by the command function updating unit 62, according to the above equation (32), the amplitude of the phrase command at each time k. While updating A _p [k], the accent command u _a [k] at each time k updated by the command function update unit 62 and the posterior probability of the command state sequence updated by the command state sequence posterior probability update unit 5 Based on P (^ s | ^ y, ^ o ', θ'), the parameter group θ is updated by updating the amplitude A _a ⁽ⁿ⁾ of each accent command n according to the above equation (32). .

収束判定部７は、予め定められた収束条件を満足するか否かを判定し、収束条件を満足していない場合には、上記の更新値を改めてo′とθ′に代入して、反復アルゴリズム（指令状態系列事後確率更新部５及びモデルパラメータ更新部６の各処理）を繰り返す。収束判定部７は、収束条件を満足したと判定した場合には、状態系列算出部８による処理に移行する。 The convergence determination unit 7 determines whether or not a predetermined convergence condition is satisfied. If the convergence condition is not satisfied, the update value is again substituted into o ′ and θ ′ to repeat The algorithm (each process of the command state sequence posterior probability update unit 5 and the model parameter update unit 6) is repeated. If the convergence determination unit 7 determines that the convergence condition is satisfied, the convergence determination unit 7 proceeds to processing by the state series calculation unit 8.

収束条件としては、繰り返し回数ｒが予め定めた回数Ｒ（例えば、２０回）に達したことを用いればよい。なお、ｒ-1回目のパラメータを用いたときの目的関数の値とｒ回目のパラメータを用いたときの目的関数の値との差が、予め定めた閾値よりも小さくなったことを、収束条件として用いてもよい。 As the convergence condition, it may be used that the number of repetitions r has reached a predetermined number R (for example, 20 times). Note that the convergence condition is that the difference between the value of the objective function when the r-1 parameter is used and the value of the objective function when the r parameter is used is smaller than a predetermined threshold. It may be used as

状態系列算出部８は、最後に、Viterbiアルゴリズムを用いることで最適な状態系列^s^*を求める。具体的には、 The state sequence calculation unit 8 finally obtains an optimal state sequence ^ s ^* by using the Viterbi algorithm. In particular,

という漸化式を順次（k=1,2,...,K）解くことによって求めたδ_t[k]とψ_t[k]を用いて、 Using δ _t [k] and ψ _t [k] obtained by sequentially solving the recursion formula (k = 1,2, ..., K),

このように、状態系列算出部８は、モデルパラメータ更新部６によって最終的に更新された指令関数＾ｏに基づいて、上記式（３３）〜式（３７）式に従って、状態系列＾ｓを算出する。そして、出力部９により、指令関数＾ｏ、パラメータ群θ、状態系列＾ｓを出力する。 As described above, the state series calculation unit 8 calculates the state series ^ s according to the above equations (33) to (37) based on the command function ^ o finally updated by the model parameter update unit 6. To do. Then, the output unit 9 outputs the command function ^ o, the parameter group θ, and the state series ^ s.

＜基本周波数モデルパラメータ推定装置の作用＞
次に、本実施の形態に係る基本周波数モデルパラメータ推定装置１００の作用について説明する。まず、分析対象として、観測された音声信号の時系列データが、基本周波数モデルパラメータ推定装置１００に入力され、記憶部１に格納される。そして、基本周波数モデルパラメータ推定装置１００において、図４に示す基本周波数モデルパラメータ推定処理ルーチンが実行される。 <Operation of fundamental frequency model parameter estimation device>
Next, the operation of fundamental frequency model parameter estimation apparatus 100 according to the present embodiment will be described. First, time series data of an observed voice signal as an analysis target is input to the fundamental frequency model parameter estimation apparatus 100 and stored in the storage unit 1. Then, the fundamental frequency model parameter estimation apparatus 100 executes a fundamental frequency model parameter estimation processing routine shown in FIG.

まず、ステップＳ１０１において、記憶部１から、音声信号の時系列データを読み込み、各時刻ｋの基本周波数Ｆ₀からなる基本周波数系列ｙを抽出する。ステップＳ１０２において、音声信号の時系列データに基づいて、有声区間、無声区間を特定し、各時刻ｋの基本周波数の不確かさの程度v_n ² [ｋ]を推定する。 First, in step S101, time-series data of an audio signal is read from the storage unit 1, and a fundamental frequency series y consisting of the fundamental frequency F _{0 at} each time k is extracted. In step S102, voiced and unvoiced intervals are specified based on the time-series data of the audio signal, and the degree of uncertainty v _n ² [k] of the fundamental frequency at each time k is estimated.

次のステップＳ１０３では、各パラメータN、M、α、β、v_p ²[k]、v_a ²[k]、u_bに対して適切な初期値を設定すると共に、HMMの小状態の個数、遷移確率φ_i′,Iを、予め用意した正解データから学習して決定する。また、従来手法により指令系列＾oを推定して、初期値として設定すると共に、A_p[k]の初期値及びA_a ⁽ⁿ⁾の初期値を設定する。 In the next step S103, appropriate initial values are set for the parameters N, M, α, β, v _p ² [k], v _a ² [k], u _{b and} the number of small states of the HMM The transition probability φ _{i ′, I} is determined by learning from correct data prepared in advance. Further, the command sequence ^ o is estimated by a conventional method and set as an initial value, and an initial value of A _p [k] and an initial value of A _a ⁽ⁿ⁾ are set.

そして、ステップＳ１０４において、上記ステップＳ１０３で設定された指令系列＾oの初期値、または後述するステップＳ１０５で前回更新された指令系列＾oに基づいて、（k,t）の全ての組み合わせについて、事後確率P(s_k=t|＾y,＾o′,θ′)を更新することにより、指令状態系列の事後確率P(＾s|＾y,＾o′,θ′)を更新する。 In step S104, based on the initial value of the command sequence ^ o set in step S103 or the command sequence ^ o updated last time in step S105 described later, for all combinations of (k, t), By updating the posterior probability P (s _k = t | ^ y, ^ o ′, θ ′), the posterior probability P (^ s | ^ y, ^ o ′, θ ′) of the command state sequence is updated.

ステップＳ１０５では、上記ステップＳ１０３で設定された指令系列＾oの初期値、または当該ステップＳ１０５で前回更新された指令系列＾oと、上記ステップＳ１０１で算出された基本周波数系列＾ｙと、上記ステップＳ１０２で算出された各時刻ｋの不確かさの程度v_n ²[ｋ]と、上記ステップＳ１０４で更新された指令状態系列の事後確率P(＾s|＾y,＾o′,θ′)とに基づいて、目的関数Q(＾o,θ,＾o′,θ′)を増加させるように、指令系列＾oと指令の振幅を表すパラメータ群θとを更新する In step S105, the initial value of the command sequence ^ o set in step S103 or the command sequence ^ o updated last time in step S105, the fundamental frequency sequence ^ y calculated in step S101, and the step The degree of uncertainty v _n ² [k] calculated at S102 and the posterior probability P (^ s | ^ y, ^ o ', θ') of the command state sequence updated at Step S104 The command sequence ^ o and the parameter group θ representing the command amplitude are updated so as to increase the objective function Q (^ o, θ, ^ o ′, θ ′) based on

上記ステップＳ１０５は、以下のステップＳ１１１〜Ｓ１１４の各処理によって実現される。 The step S105 is realized by the processes of the following steps S111 to S114.

ステップＳ１１１では、上記ステップＳ１０３で設定された指令系列＾oの初期値、または後述するステップＳ１１２で前回更新された指令系列＾oに基づいて、（ｋ、ｌ）の全ての組み合わせについて、上記の式（２８）に従って、補助変数λ_p,k,l、λ_a,k,l、λ_b,k,lを算出して更新する。 In step S111, based on the initial value of the command sequence ^ o set in step S103 or the command sequence ^ o updated last time in step S112 described later, all combinations of (k, l) are described above. According to the equation (28), the auxiliary variables λ _{p, k, l} , λ _{a, k, l} , λ _{b, k, l} are calculated and updated.

次のステップＳ１１２では、上記ステップＳ１０１で算出された基本周波数系列＾ｙと、上記ステップＳ１０２で算出された各時刻ｋの不確かさの程度v_n ² [ｋ]と、上記ステップＳ１０４で更新された指令状態系列の事後確率P(＾s|＾y,＾o′,θ′)と、上記ステップＳ１１１で更新された補助変数λ_p,k,l、λ_a,k,l、λ_b,k,lとに基づいて、上記式（２９）に従って、非負値である各時刻ｌのフレーズ指令ｕ_p［ｌ］及びアクセント指令ｕ_a［ｌ］からなる指令系列＾oとベース成分ｕ_bとを更新する。 In the next step S112, the fundamental frequency sequence ^ y calculated in step S101, the degree of uncertainty v _n ² [k] calculated in step S102, and updated in step S104. A posteriori probability P (^ s | ^ y, ^ o ', θ') of the command state sequence and auxiliary variables λ _{p, k, l} , λ _{a, k, l} , λ _{b, k} updated in step S111 above. _{, l} and _a command sequence ^ o consisting of _a phrase command u _p [l] and an accent command u _a [l] at each time l which is a non-negative value and a base component u _b according to the above equation (29). Update.

次のステップＳ１１３では、収束条件として、繰り返し回数ｓが、Ｓに到達したか否かを判定し、繰り返し回数ｓがＳに到達していない場合には、収束条件を満足していないと判断して、上記ステップＳ１１１へ戻り、上記ステップＳ１１１〜ステップＳ１１２の処理を繰り返す。一方、繰り返し回数ｓがＳに到達した場合には、収束条件を満足したと判断し、ステップＳ１１４で、上記ステップＳ１１２で更新された各時刻ｋのフレーズ指令ｕ_p［ｋ］及びアクセント指令ｕ_a［ｋ］と、上記ステップＳ１０４で更新された指令状態系列の事後確率P(＾s|＾y,＾o′,θ′)とに基づいて、上記式（３２）に従って、各時刻ｋのフレーズ指令の振幅A_p[k]、及び各位置ｎのアクセント指令の振幅A_a[k]を更新することにより、パラメータ群θを更新する。 In the next step S113, it is determined whether or not the number of repetitions s has reached S as the convergence condition. If the number of repetitions s has not reached S, it is determined that the convergence condition is not satisfied. Then, the process returns to step S111, and the processes of steps S111 to S112 are repeated. On the other hand, when the number of repetitions s reaches S, it is determined that the convergence condition is satisfied, and in step S114, the phrase command u _p [k] and the accent command u _{a at} each time k updated in step S112 above. Based on [k] and the posterior probability P (^ s | ^ y, ^ o ', θ') of the command state sequence updated in step S104, the phrase at each time k according to the above equation (32) The parameter group θ is updated by updating the command amplitude A _p [k] and the accent command amplitude A _a [k] at each position n.

そして、ステップＳ１０６において、収束条件として、繰り返し回数ｒが、Ｒに到達したか否かを判定し、繰り返し回数ｒがＲに到達していない場合には、収束条件を満足していないと判断して、ステップＳ１０７で、上記ステップＳ１０５で更新された指令関数＾o,パラメータ群θを、＾o′,θ′に代入して、上記ステップＳ１０４へ戻り、上記ステップＳ１０４〜ステップＳ１０５の処理を繰り返す。一方、繰り返し回数ｒがＲに到達した場合には、収束条件を満足したと判断し、ステップＳ１０８で、上記ステップＳ１０５で最終的に更新された指令関数＾ｏに基づいて、上記式（３３）〜式（３７）式に従って、状態系列＾ｓを算出し、出力部９により、指令関数＾ｏ、指令の振幅を表すパラメータ群θ、状態系列＾ｓを出力して、基本周波数モデルパラメータ推定処理ルーチンを終了する。 In step S106, it is determined whether the number of repetitions r has reached R as the convergence condition. If the number of repetitions r has not reached R, it is determined that the convergence condition is not satisfied. In step S107, the command function ^ o and parameter group θ updated in step S105 are substituted into ^ o 'and θ', the process returns to step S104, and the processes in steps S104 to S105 are repeated. . On the other hand, when the number of repetitions r reaches R, it is determined that the convergence condition is satisfied, and in step S108, based on the command function ^ o finally updated in step S105, the above equation (33) The state sequence ^ s is calculated according to the equation (37), and the output unit 9 outputs the command function ^ o, the parameter group θ representing the amplitude of the command, and the state sequence ^ s, and the fundamental frequency model parameter estimation process End the routine.

＜実験＞
本実施の形態における重要な成果は、藤崎モデルを確率モデルとして表現することに成功したことである。本発明者らは、数多くの統計的手法に基づく音声アプリケーションに、本実施の形態で提案したモデルを組み込むことによって、将来的には韻律を扱う強力な手法が得られると考えている。そのためには、スペクトル特徴量と同じようにして、藤崎モデルのパラメータであるフレーズ、アクセント指令関数が音声コーパスから自動的に学習できると非常に便利である。この点において、確率モデルとして定式化した、本実施の形態の提案モデルと提案アルゴリズムは、たとえば、上記非特許文献２のような統計的でない手法よりも優れていると言える。しかし、提案アルゴリズムを用いた実音声からの藤崎モデルパラメータの推定性能が、既存手法の性能を上回っているかどうかはまだ明らかでない。そこで、本実施の形態で提案した手法のパラメータ推定性能を定量的に評価するための実験を行った。 <Experiment>
An important result in the present embodiment is that the Fujisaki model has been successfully expressed as a probability model. The present inventors believe that a powerful technique for handling prosody can be obtained in the future by incorporating the model proposed in this embodiment into a speech application based on a number of statistical techniques. For that purpose, it is very convenient if the phrase and the accent command function, which are the parameters of the Fujisaki model, can be automatically learned from the speech corpus in the same manner as the spectrum feature amount. In this respect, it can be said that the proposed model and the proposed algorithm of the present embodiment, which are formulated as a probability model, are superior to the non-statistical method as described in Non-Patent Document 2, for example. However, it is not yet clear whether the estimation performance of Fujisaki model parameters from real speech using the proposed algorithm exceeds the performance of existing methods. Therefore, an experiment was performed to quantitatively evaluate the parameter estimation performance of the method proposed in this embodiment.

詳しい実験条件を以下に記す。本実験で実音声データとして用いたのは、ATR 日本語音声データベースのB セット（非特許文献７（A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, "ATR japanese speech database as a tool of speech recognition and synthesis," Speech Communication, vol. 27, pp. 187-207, 1999.）を参照）である。これは503 文の音素バランス文からなる音声データベースであり、その中から一人の男性話者(MHT) を選択した。また、その音声データベースに対して、ある韻律研究の専門家が手動で求めたフレーズ、アクセント指令関数を正解データとして用いた。提案手法の入力として与える観測F₀ パターンを音声データから抽出する手法には、本発明者らが以前提案した上記非特許文献６に記載のアルゴリズムを用いた。 Detailed experimental conditions are described below. The actual speech data used in this experiment was the B set of the ATR Japanese speech database (Non-Patent Document 7 (A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano , "ATR japanese speech database as a tool of speech recognition and synthesis," Speech Communication, vol. 27, pp. 187-207, 1999.)). This is a speech database consisting of 503 phoneme balance sentences, from which one male speaker (MHT) was selected. For the speech database, phrases and accent command functions manually obtained by a prosody research specialist were used as correct answer data. The algorithm described in Non-Patent Document 6 previously proposed by the present inventors was used as a method for extracting the observed F ₀ pattern given as input to the proposed method from the speech data.

定数のパラメータについては、N=10、離散時刻のサンプリング間隔t₀=8ms、EMアルゴリズムの反復回数M=20回、α=3.0 rad/s、β=20.0 rad/s、v_p ²[k]=0.2²、v_a ²[k]=0.1²、無声区間ではv_n ²[k]=10¹⁵、有声区間ではv_n ²[k]=0.22、そしてu_bは有声区間におけるlogF₀の最小値に、それぞれ設定した。HMMの小状態の個数や遷移確率φ_i′,iについては、ATR日本語音声データベースのBセットのNo.1からNo.200までの200文の正解データから学習して決定した。＾oの初期値を、上記非特許文献２に記載した方法を用いて定めた。A_p[k]の初期値は＾oのフレーズ指令関数の振幅を線形補間したものとし、A_a ⁽ⁿ⁾の初期値を0.1nとした。 For constant parameters, N = 10, discrete time sampling interval t ₀ = 8 ms, EM algorithm iterations M = 20, α = 3.0 rad / s, β = 20.0 rad / s, v _p ² [k] = 0.2 ² , v _a ² [k] = 0.1 ² , v _n ² [k] = 10 ¹⁵ in the unvoiced interval, v _n ² [k] = 0.22 in the voiced interval, and u _b is the minimum of logF ₀ in the voiced interval Each value was set. The number of small states and transition probabilities φ _{i ′, i} of the HMM were determined by learning from 200 sentences of correct data from No. 1 to No. 200 of B set of ATR Japanese speech database. The initial value of ^ o was determined using the method described in Non-Patent Document 2 above. The initial value of A _p [k] is obtained by linear interpolation of the amplitude of the phrase command function of ^ o, and the initial value of A _a ⁽ⁿ⁾ is 0.1n.

パラメータ推定実験は、No.201 からNo.503 までの303 文を対象にして行った。推定パラメータを評価する方法として、観測F₀ パターンの再現性と、言語学的な妥当性の二つを考慮した。これらは一般にトレードオフの関係にある。例えば、短い区間に細かく大量の指令を立てれば観測F₀ パターンを非常によく再現することができるが、そうして作った指令関数は言語学的に妥当なものであるとは言えない。そこで本実験では、本実施の形態で提案した手法によって得られた推定パラメータが言語学的に十分妥当なものでありつつ、観測F₀ パターンの再現性が非常に高いことを確認することを目的とする。 The parameter estimation experiment was conducted on 303 sentences from No.201 to No.503. As a method for evaluating the estimated parameters, considering the reproducibility of observed F ₀ pattern, the two linguistic validity. These are generally in a trade-off relationship. For example, can be reproduced very well observed F ₀ pattern if Tatere finely large amount of command to the short interval, thus command functions made can not be said that those linguistically appropriate. Therefore, the purpose of this experiment is to confirm that the reproducibility of the observed F ₀ pattern is very high while the estimated parameters obtained by the method proposed in this embodiment are linguistically adequate. And

観測F₀パターンの再現性の評価基準には、観測F₀ パターンと推定指令関数から再構成されたF₀ パターンとの平均二乗誤差（log F₀[Hz] RMSE）を用い、この値が小さいほど再現性が高いとした。言語学的な妥当性の評価基準には、検出率という値を用い、これが大きいほど言語学的に妥当なパラメータであるとした。検出率は以下のように定義される。図５に例を示したように、推定パラメータ列と正解パラメータ列を比較して、指令単位でのマッチングをとる。指令と指令のマッチングがとれる条件は、二つの指令が同種の指令であること（フレーズ指令同士またはアクセント指令同士）と、二つの指令の時間のずれがS = 0.3 秒以下であることとした。ただし、アクセント指令に関しては生起時刻と終了時刻の平均を基準にした。また、二つのマッチングは時刻に関して交差していてはならない。マッチングがとれた指令同士の距離を1、そうでないときの距離を0 として、これらの条件を満たしなおかつ距離最大になるようなマッチングは、動的計画法によって求めることができる。推定実験に用いた303 文全てに対してこのマッチングをとったとき、マッチングの総数をN_Mとする。また、推定パラメータ列における指令の総数をN_E、正解パラメータ列における指令の総数をN_A とおく。ここで、挿入エラーE_Iを(N_E ‐N_M)/N_Aと定義し、脱落エラーE_D を(N_A‐N_M)/N_A と定義し、最終的な検出率Dは1‐E_I‐E_Dであると定義した。なお、この検出率の定義では指令の振幅を考慮に入れていない。これは、フレーズ、アクセント指令の振幅はベースライン成分の値に強く依存するが、このベースライン成分の値が提案手法と正解データで大きく異なるためである。具体的には、提案手法ではベースライン成分の値u_b を有声区間におけるlog F₀の最小値に設定しているが、正解データでは常にlog 60 Hzに固定しており、提案手法でu_b の値を固定すると推定性能が落ちることが確認されたためである。 The criteria of reproducibility of observed F ₀ pattern, using the observed F ₀ pattern and the mean square error between the reconstructed F ₀ pattern from the estimated command function _{(log F 0 [Hz] RMSE} ), this value is smaller The reproducibility was high. The evaluation rate for linguistic validity used the value of detection rate, and the larger this, the more linguistically valid parameters were assumed. The detection rate is defined as follows. As shown in FIG. 5, the estimated parameter sequence and the correct parameter sequence are compared, and matching is performed in units of commands. The conditions for matching the command and the command are that the two commands are of the same type (phrase commands or accent commands) and that the time difference between the two commands is S = 0.3 seconds or less. However, the accent command was based on the average of the start time and end time. Also, the two matches must not intersect with respect to time. Matching that satisfies these conditions and maximizes the distance can be obtained by dynamic programming, where the distance between the commands with matching is 1 and the distance when it is not is 0. When taking this matching on 303 sentences all used in the estimation experiment, the total number of matching and N _M. Also, the total number of commands in the estimated parameter sequence is N _E , and the total number of commands in the correct parameter sequence is N _A. Here, the insertion error E _I is defined as (N _E -N _M ) / N _A , the dropout error E _D is defined as (N _A -N _M ) / N _A, and the final detection rate D is 1- It was defined as E _I -E _D. Note that the definition of the detection rate does not take into account the amplitude of the command. This is because the amplitude of the phrase and accent commands strongly depends on the value of the baseline component, but the value of the baseline component is greatly different between the proposed method and the correct answer data. Specifically, in the proposed method to set the value u _b baseline component to the minimum value of the log F ₀ in voiced segments, always with correct answer data has been secured to the log 60 Hz, the proposed method u _b This is because it has been confirmed that the estimation performance drops when the value of is fixed.

提案手法を用いたパラメータ推定結果と、比較手法として選んだ、上記の非特許文献２に記載のパラメータ推定アルゴリズム（非統計的手法）を用いた推定結果とを図６にまとめた。この結果を見れば分かる通り、提案手法の検出率は比較手法と同程度である一方で、提案手法のlog F₀ RMSE の値は比較手法を大きく下回っている。つまり、提案手法を用いた実音声からのフレーズ、アクセント指令関数の推定は、既存手法に匹敵する言語学的な妥当性を満たしつつ、観測F₀ パターンの再現性では既存手法を上回る性能を持っていることが確認できた。 FIG. 6 summarizes the parameter estimation results using the proposed method and the estimation results using the parameter estimation algorithm (non-statistical method) described in Non-Patent Document 2 selected as the comparison method. As can be seen from this result, the detection rate of the proposed method is comparable to that of the comparative method, while the log F ₀ RMSE value of the proposed method is significantly lower than that of the comparative method. In other words, phrases from real speech using the proposed method, the estimation of the accent command function while satisfying the linguistic validity comparable to existing methods, with performance over existing techniques in reproducibility of observed F ₀ pattern It was confirmed that

図７に、提案手法のパラメータ推定結果を示した。この図の上のグラフは、有声区間の観測F₀パターン（実線）と推定パラメータから再構築したF₀パターン（点線）であり、下のグラフは推定フレーズ指令関数と推定アクセント指令関数を示したものである。入力F₀パターンは、例として、ATR日本語音声データベースのBセットのNo.353から得られたものを用いた。この例で示したように、本発明は、観測F₀パターンと推定パラメータから再構築したF₀パターンが非常によく一致するようなパラメータ推定が可能である。 FIG. 7 shows the parameter estimation results of the proposed method. The upper graph in this figure is the observed F ₀ pattern (solid line) in the voiced section and the F ₀ pattern (dotted line) reconstructed from the estimated parameters, and the lower graph shows the estimated phrase command function and estimated accent command function. Is. Input F ₀ pattern, as an example, was used as obtained from No.353 of B set of ATR Japanese speech database. As shown in this example, the present invention is observed F ₀ pattern and F ₀ pattern reconstructed from the estimated parameters can be very well matched to such parameters estimation.

以上説明したように、本発明の実施の形態に係る基本周波数モデルパラメータ推定装置によれば、観測基本周波数系列＾ｙが与えられたときの藤崎モデルのパラメータ＾ｏ及びθの対数事後確率ｌｏｇＰ（＾ｏ，θ｜＾ｙ）の下限関数Ｑ（＾ｏ，θ、＾ｏ’，θ’）を目的関数として、目的関数を増加させるように、各々非負値である指令関数＾ｏ、及びパラメータ群θを更新することにより、フレーズ指令とアクセント指令の非負性に関する制約を用いて、藤崎モデルのパラメータを推定することができる。 As described above, according to the fundamental frequency model parameter estimation apparatus according to the embodiment of the present invention, logarithmic posterior probabilities logP (of the Fujisaki model parameters ^ o and θ when the observed fundamental frequency sequence ^ y is given. A command function ^ o, which is a non-negative value, and a parameter so that the objective function is increased using the lower limit function Q (^ o, θ, ^ o ', θ') of ^ o, θ | ^ y) as an objective function. By updating the group θ, it is possible to estimate the parameters of the Fujisaki model using the constraint on the non-negativeness of the phrase command and the accent command.

本実施の形態では、EMアルゴリズムにおけるMステップが補助関数法による反復計算（λの更新ステップと＾oとθの更新ステップの反復計算）により構成され、この反復計算によりQ(＾o,θ,＾o',θ')が減少しないことが保証されているため、目的関数値（＝p(＾o,θ|y)）の収束性が保証される。 In the present embodiment, the M step in the EM algorithm is configured by an iterative calculation based on the auxiliary function method (an λ update step and an iterative calculation of ^ o and θ update steps), and Q (^ o, θ, Since it is guaranteed that (^ o ', θ') does not decrease, the convergence of the objective function value (= p (^ o, θ | y)) is guaranteed.

また、音声のF₀パターンを入力として藤崎モデルのパラメータを推定する、本実施の形態に係る推定アルゴリズムでは、フレーズ成分とアクセント成分の非負性を直接導入することが可能で、なおかつ収束性が保証されている。具体的には、藤崎モデルの確率モデル表現を畳み込み混合隠れマルコフモデルに基づいて定式化することによって、フレーズ成分とアクセント成分の非負性を直接導入することと共に、収束性を保証することができる。その結果、観測F₀パターンの再現性が非常に高い藤崎モデルのパラメータを推定することができる。 In addition, the estimation algorithm according to the present embodiment, which estimates the parameters of the Fujisaki model using the F ₀ pattern of the speech as an input, can directly introduce non-negative phrase components and accent components, and guarantees convergence. Has been. Specifically, by formulating the probability model representation of the Fujisaki model based on the convolutional mixed hidden Markov model, non-negativeity of the phrase component and the accent component can be directly introduced and convergence can be ensured. As a result, you are possible to reproducibly observed F ₀ patterns to estimate the parameters of very high Fujisaki model.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の基本周波数モデルパラメータ推定装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the fundamental frequency model parameter estimation apparatus described above has a computer system inside, but if the “computer system” uses a WWW system, a homepage providing environment (or display environment) is also available. Shall be included.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１記憶部
２基本周波数系列抽出部
３有声無声区間推定部
４初期値設定部
５指令状態系列事後確率更新部
６モデルパラメータ更新部
７収束判定部
８状態系列算出部
６１補助変数更新部
６２指令関数更新部
６３収束判定部
６４平均振幅更新部
１００基本周波数モデルパラメータ推定装置 DESCRIPTION OF SYMBOLS 1 Memory | storage part 2 Fundamental frequency sequence extraction part 3 Voiced unvoiced area estimation part 4 Initial value setting part 5 Command state series posterior probability update part 6 Model parameter update part 7 Convergence determination part 8 State series calculation part 61 Auxiliary variable update part 62 Command function Update unit 63 Convergence determination unit 64 Average amplitude update unit 100 Fundamental frequency model parameter estimation device

Claims

As an input voice signal, a command state sequence ^ s made from the state s _k at each time k in the hidden Markov model, phrase command u _p representing the fundamental frequency pattern resulting from the translation movement of the thyroid cartilage at each time k [k] and a command function ^ o of pairs ^ o [k] of the accent command u _a [k] representing the fundamental frequency pattern generated by the rotation movement of the thyroid cartilage, the amplitude a of the phrase command in accordance with the state s _k at each time k a fundamental frequency model parameter estimation device for estimating _p [k] and a parameter group θ representing the amplitude A _a ⁽ⁿ⁾ of each accent command n,
A fundamental frequency extracting means for extracting an observed fundamental frequency sequence ^ y representing a fundamental frequency at each time k of the speech signal from the time series data of the speech signal;
Voiced and unvoiced section estimation means for estimating the degree of uncertainty of the fundamental frequency at each time k depending on whether the time series data of the speech signal is a voiced section or a unvoiced section;
An initial value setting means for setting an initial value of the command function ^ o and an initial value of the parameter group θ;
Based on the previously updated command function ^ o 'or the initial value ^ o' of the command function ^ o, the observed fundamental frequency sequence ^ y, the command function ^ o ', and the parameter group θ' are given. Command state sequence posterior probability update means for calculating the posterior probability P (^ s | ^ y, ^ o ′, θ ′) of the command state sequence ^ s when
The previously updated command function ^ o 'or the initial value ^ o' of the command function ^ o, the observed fundamental frequency sequence ^ y, the degree of uncertainty at each time k, and the posterior probability P (^ s | ^ Y, ^ o ′, θ ′), the logarithmic posterior probability logP (^ o, θ | ^) of the command function ^ o and the parameter group θ when the observed fundamental frequency sequence ^ y is given. model parameter updating means for updating the command function ^ o and the parameter group θ, each of which is a non-negative value so that the objective function is increased with y) as an objective function;
First convergence determination means for repeatedly performing calculation by the command state sequence posterior probability update means and update by the model parameter update means until a predetermined convergence condition is satisfied;
A fundamental frequency model parameter estimation apparatus including:

The model parameter update means includes
Based on the initial value of the phrase command u _p [l] at each time l updated last time or the phrase command u _p [l] at each time l, each combination of the times k and l (k, l) is assisted. The variable λ _{p, k, l} is calculated and updated, and based on the initial value of the accent command u _a [k] at each time k updated last time or the accent command u _a [k] at each time k, Auxiliary variable updating means for calculating and updating auxiliary variables λ _{a, k, l} for each combination of times k, l (k, l);
The observed fundamental frequency sequence ^ y, the degree of uncertainty at each time k, the calculated command state sequence posterior probability P (^ s | ^ y, ^ o ', θ'), and the auxiliary variable Based on the auxiliary variables λ _{p, k, l} , λ _{a, k, l} updated by the updating means, the lower limit function Q (^ o, θ, ^ o ′, θ ′) of the objective function is further reduced. Command function updating means for updating the phrase command u _p [l] and the accent command u _a [l] at each time l so that the function is
Second convergence determination means for repeatedly performing the update by the auxiliary variable update means and the update by the command function update means until a predetermined convergence condition is satisfied;
Based on the phrase command u _p [l] at each time l updated by the command function updating means, the amplitude A _p [k] of the phrase command at each time k is updated and updated by the command function updating means. Of each accent command n based on the calculated accent command u _a [l] of each time l and the calculated posterior probability P (^ s | ^ y, ^ o ', θ') of the command state sequence. Average amplitude updating means for updating the parameter group θ by updating the amplitude A _a ⁽ⁿ⁾ ;
The fundamental frequency model parameter estimation apparatus according to claim 1, comprising:

3. The fundamental frequency model parameter estimating apparatus according to claim 1, further comprising a state series calculating unit that calculates the state series ^ s based on the command function ^ o finally updated by the model parameter updating unit.

As an input voice signal, a command state sequence ^ s made from the state s _k at each time k in the hidden Markov model, phrase command u _p representing the fundamental frequency pattern resulting from the translation movement of the thyroid cartilage at each time k [k] and a command function ^ o of pairs ^ o [k] of the accent command u _a [k] representing the fundamental frequency pattern generated by the rotation movement of the thyroid cartilage, the amplitude a of the phrase command in accordance with the state s _k at each time k A fundamental frequency model parameter estimation method for estimating _p [k] and a parameter group θ representing the amplitude A _a ⁽ⁿ⁾ of each accent command n,
The fundamental frequency extraction means extracts from the time series data of the speech signal an observed fundamental frequency sequence ^ y representing the fundamental frequency at each time k of the speech signal,
The voiced and unvoiced section estimation means estimates the degree of uncertainty of the fundamental frequency at each time k according to whether the time series data of the voice signal is a voiced or unvoiced section,
By an initial value setting means, an initial value of the command function ^ o and an initial value of the parameter group θ are set,
Based on the command function ^ o 'updated last time or the initial value ^ o' of the command function ^ o by the command state series posterior probability update means, the observed fundamental frequency sequence ^ y, the command function ^ o ', And a posteriori probability P (^ s | ^ y, ^ o ', θ') of the command state sequence ^ s given the parameter group θ '.
The command parameter ^ o 'updated last time or the initial value ^ o' of the command function ^ o, the observed fundamental frequency sequence ^ y, the degree of uncertainty at each time k, and the fact Based on the a posteriori probability P (^ s | ^ y, ^ o ', θ'), the logarithmic posterior probability logP (of the command function ^ o and the parameter group θ when the observed fundamental frequency sequence ^ y is given. Updating the command function ^ o and the parameter group θ each having a non-negative value so that the objective function is increased with ^ o, θ | ^ y) as an objective function,
A fundamental frequency model parameter estimation method in which the calculation by the command state sequence posterior probability update unit and the update by the model parameter update unit are repeatedly performed by the first convergence determination unit until a predetermined convergence condition is satisfied.

The program for functioning a computer as each means of the fundamental frequency model parameter estimation apparatus of any one of Claims 1-3.