JP6137477B2

JP6137477B2 - Basic frequency model parameter estimation apparatus, method, and program

Info

Publication number: JP6137477B2
Application number: JP2013172366A
Authority: JP
Inventors: 弘和亀岡; 達馬石原; 幸太吉里; 大輔齋藤; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2013-08-22
Filing date: 2013-08-22
Publication date: 2017-05-31
Anticipated expiration: 2033-08-22
Also published as: JP2015041004A

Description

本発明は、基本周波数モデルパラメータ推定装置、方法、及びプログラムに係り、特に、音声信号から、観測基本周波数系列のパラメータを推定する基本周波数モデルパラメータ推定装置、方法、及びプログラムに関する。 The present invention relates to a fundamental frequency model parameter estimation device, method, and program, and more particularly, to a fundamental frequency model parameter estimation device, method, and program for estimating parameters of an observed fundamental frequency sequence from a speech signal.

音声には言語情報以外にも様々な情報が含まれており、日常的なコミュニケーションに利用される。我々はこれらの非言語的な情報を工学的に扱う枠組みを構築することを目標として、非言語情報の解析・合成のための情報処理と信号処理の研究を進めている。 Voices contain various information in addition to language information and are used for daily communication. We are researching information processing and signal processing for analyzing and synthesizing non-linguistic information with the goal of constructing a framework for engineering these non-linguistic information.

音声の基本周波数（Ｆ_０）軌跡には、話者性、感情、意図などの非言語的な情報が豊富に含まれることが知られている。このため、Ｆ_０軌跡のモデル化は、音声合成、話者認識、感情認識、対話システムなど、韻律情報が重要な役割を担う応用において極めて有効である。Ｆ_０軌跡は、韻律句全体にわたってゆるやかに変化する成分（フレーズ成分）と、アクセントに従って急峻に変化する成分（アクセント成分）により構成される。これらの成分は、ヒトの甲状軟骨の並進運動と回転運動にそれぞれ対応していると解釈できるが、この解釈に基づき対数Ｆ_０軌跡をこれらの成分の和で表した数学的なモデル（以後、藤崎モデルと称する）が提案されている。藤崎モデルは、フレーズ・アクセント指令の生起時刻、持続時間、各指令の大きさなどをパラメータとして有し、これらが適切に設定されたとき実測の軌跡を非常によく近似することが知られている。また、パラメータの言語学的対応の妥当性も広く確認されている。 It is known that the fundamental frequency (F ₀ ) trajectory of speech includes abundant non-linguistic information such as speaker characteristics, emotions, intentions, and the like. For this reason, modeling of the F ₀ trajectory is extremely effective in applications in which prosodic information plays an important role, such as speech synthesis, speaker recognition, emotion recognition, and a dialogue system. The F ₀ trajectory is composed of a component (phrase component) that changes gently over the entire prosodic phrase and a component (accent component) that changes sharply according to the accent. These components can be interpreted as corresponding to the translational motion and rotational motion of human thyroid cartilage, respectively, but based on this interpretation, a mathematical model representing the logarithmic F ₀ trajectory as the sum of these components (hereinafter, Called the Fujisaki model). The Fujisaki model has parameters such as the occurrence time and duration of the phrase / accent command, the size of each command, etc., and when these parameters are set appropriately, it is known to approximate the measured trajectory very well. . In addition, the validity of the linguistic correspondence of parameters has been widely confirmed.

先述の藤崎モデルのパラメータは、韻律的特徴を効率よく表現できるため、実測のＦ_０軌跡から藤崎モデルのパラメータを推定することは非常に重要な問題である（非特許文献１参照）。しかしながら、この問題は元来不良設定問題であること、また藤崎モデルには言語学的な知見により守られるべき制約が存在することなどから、必ずしも容易ではなかった。これまで発明者らは、藤崎モデルをベースとしたＦ_０軌跡の確率的生成過程（非特許文献２参照）をモデル化し、藤崎モデルのパラメータ推定問題をＥＭアルゴリズムに基づく最尤推定問題に帰着させることに成功し、効果的なパラメータ推定アルゴリズムの開発を行ってきた。 Since the parameters of the aforementioned Fujisaki model can express prosodic features efficiently, it is a very important problem to estimate the parameters of the Fujisaki model from the actually measured F ₀ trajectory (see Non-Patent Document 1). However, this problem was originally a failure setting problem, and the Fujisaki model was not always easy because there were restrictions that should be observed based on linguistic knowledge. Until now, the inventors modeled the stochastic generation process of the F ₀ locus based on the Fujisaki model (see Non-Patent Document 2), and reduced the parameter estimation problem of the Fujisaki model to the maximum likelihood estimation problem based on the EM algorithm. We have been successful in developing effective parameter estimation algorithms.

成澤修一、峯松信明、広瀬啓吉、藤崎博也、“音声の基本周波数パターン生成過程モデルのパラメータ自動抽出法、”情報処理学会論文誌、ｖｏｌ．４３、ｎｏ．７、ｐｐ．２１５５−２１６８、２００２．Shuichi Narusawa, Nobuaki Himatsu, Hiroyoshi Hirose, Hiroya Fujisaki, “Automatic parameter extraction method of fundamental frequency pattern generation process model of speech,” IPSJ Journal, vol. 43, no. 7, pp. 2155-2168, 2002. Ｋ．Ｙｏｓｈｉｚａｔｏ，Ｈ．Ｋａｍｅｏｋａ，Ｄ．Ｓａｉｔｏ，ａｎｄＳ．Ｓａｇａｙａｍａ， “ＨｉｄｄｅｎＭａｒｋｏｖｃｏｎｖｏｌｕｔｉｖｅｍｉｘｔｕｒｅｍｏｄｅｌｆｏｒｐｉｔｃｈｃｏｎｔｏｕｒａｎａｌｙｓｉｓｏｆｓｐｅｅｃｈ，” ｉｎＰｒｏｃ．Ｔｈｅ１３ｔｈＡｎｎｕａｌＣｏｎｆｅｒｅｎｃｅｏｆｔｈｅＩｎｔｅｒｎａｔｉｏｎａｌＳｐｅｅｃｈＣｏｍｍｕｎｉｃａｔｉｏｎＡｓｓｏｃｉａｔｉｏｎ（Ｉｎｔｅｒｓｐｅｅｃｈ２０１２），Ｓｅｐ．２０１２．K. Yoshizato, H.C. Kameoka, D.H. Saito, and S.K. Sagayama, “Hidden Markov voluntary mix model for pitch control analysis of speech,” in Proc. The 13th Annual Conference of the International Speck Communication Association (Interspec 2012), Sep. 2012.

前記手法の中心的なアイデアは、フレーズ・アクセント指令列の生成プロセスを隠れマルコフモデル（ＨＭＭ）により表現した点にあるが、これまでの状態遷移トポロジーのもとでは藤崎モデルにおける制約を満たす範囲のいかなる指令列も生成し得て、言語学的に必ずしも妥当でない指令列を生成することを許容していた。 The central idea of the above method is that the generation process of the phrase / accent command sequence is expressed by a Hidden Markov Model (HMM). Any command sequence could be generated, allowing the generation of command sequences that are not necessarily linguistically valid.

もし指令列のとりうる範囲を言語学的な制約に基づいて適切に制限できれば、提案モデルを用いて効果的に藤崎モデルパラメータの推定を行えるようになるはずである。 If the possible range of the command sequence can be appropriately limited based on linguistic constraints, Fujisaki model parameters should be estimated effectively using the proposed model.

本発明は、上記の事情を鑑みてなされたもので、言語的な先験的知識をＨＭＭの状態遷移トポロジーの設計を通してモデルに組み込むことで、藤崎モデルのパラメータを精度よく推定することができる基本周波数モデルパラメータ装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and by incorporating linguistic a priori knowledge into the model through the design of the state transition topology of the HMM, it is possible to accurately estimate the parameters of the Fujisaki model. An object of the present invention is to provide a frequency model parameter apparatus, method, and program.

上記の目的を達成するために第１の発明に係る基本周波数モデルパラメータ推定装置は、音声信号を入力として、各時刻ｋにおける甲状軟骨の平行移動運動によって生じる基本周波数パターンを表すフレーズ指令ｕ_p［ｋ］及び甲状軟骨の回転運動によって生じる基本周波数パターンを表すアクセント指令ｕ_a［ｋ］のペア＾ｏ[ｋ]からなる指令列＾ｏと、隠れマルコフモデルの各時刻ｋの、前記フレーズ指令及び前記アクセント指令のペアを示す状態のインデックスｓ_kからなる指令状態系列＾ｓと、前記隠れマルコフモデルの前記状態ｉ'，ｉ間の各々の遷移確率φ_i',iを含むパラメータ群＾θとを推定する基本周波数モデルパラメータ推定装置であって、前記音声信号の時系列データから、前記音声信号の各時刻ｋの基本周波数を表す観測基本周波数系列＾ｙを抽出する基本周波数抽出部と、前記音声信号の時系列データについて、有声区間及び無声区間の何れであるかに応じて、各時刻ｋにおける前記基本周波数の不確かさの度合いを推定する有声無声区間推定部と、前記指令列＾ｏの初期値、及び前記パラメータ群＾θの初期値を設定する初期値設定部と、前回更新された前記指令列＾ｏ’または前記指令列＾ｏの初期値＾ｏ’に基づいて、時刻ｋ、状態ｔの組み合わせ（ｋ、ｔ）の各々について、前記観測基本周波数系列＾ｙ、前記指令列＾ｏ’、及び前記パラメータ群＾θ’が与えられたときの事後確率Ｐ（ｓ_k＝ｔ｜＾ｙ，＾ｏ’、＾θ’）を、Ｆｏｒｗａｒｄ−Ｂａｃｋｗａｒｄアルゴリズムを用いて計算する状態系列事後確率更新部と、前回更新された前記指令列＾ｏ’または前記指令列＾ｏの初期値＾ｏ’、前記観測基本周波数系列＾ｙ、各時刻ｋにおける前記不確かさの度合い、及び時刻ｋ、状態ｔの組み合わせ（ｋ、ｔ）の各々の前記事後確率Ｐ（ｓ_k＝ｔ｜＾ｙ，＾ｏ’、＾θ’）に基づいて、前記指令列＾ｏ、及び前記パラメータ群＾θを更新するモデルパラメータ更新部と、予め定められた収束条件を満たすまで、前記状態系列事後確率更新部による計算、及び前記モデルパラメータ更新部による更新を繰り返し行う収束判定部と、前記モデルパラメータ更新部によって最終的に更新された指令列＾ｏに基づいて、Ｖｉｔｅｒｂｉアルゴリズムを用いて、前記状態系列＾ｓを算出する状態系列算出部と、を含み、前記隠れマルコフモデルは、複数のテンプレートに対応する、前記状態の系列を表し、かつ、一定方向に状態が遷移する複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭであって、前記複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの各々における始点の状態が特定状態に連結され、かつ、前記複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの各々における終点の状態が前記特定状態に連結され、前記パラメータ群＾θは、前記複数のテンプレートｎの各々について、前記テンプレートｎにおける前記フレーズ指令に対応する状態の出力平均μ_p ⁽ⁿ⁾、前記テンプレートｎにおける各アクセント指令ｍに対応する各状態の出力平均μ_a ^(n,m)を更に含む。 In order to achieve the above object, the fundamental frequency model parameter estimating apparatus according to the first invention receives a speech signal as an input, and a phrase command u _p [representing a fundamental frequency pattern generated by parallel movement of thyroid cartilage at each time k. k] and _a command sequence ^ o consisting of a pair ^ o [k] of an accent command u _a [k] representing a fundamental frequency pattern generated by the rotational motion of the thyroid cartilage, and the phrase command and each phrase k at each time k of the hidden Markov model a command state sequence ^ s made of index s _k of the state indicating the accent command pair, the state i of the hidden Markov model ', each of the transition probability between i phi _i', the parameter group ^ theta comprising _i Is a fundamental frequency model parameter estimation device that estimates the fundamental frequency at each time k of the speech signal from the time-series data of the speech signal The fundamental frequency extraction unit that extracts the main frequency sequence ^ y and the degree of uncertainty of the fundamental frequency at each time k according to whether the time series data of the audio signal is a voiced segment or an unvoiced segment. A voiced / unvoiced section estimation unit to be estimated; an initial value of the command sequence ^ o; an initial value setting unit for setting an initial value of the parameter group ^ θ; and the command sequence ^ o ′ or the command sequence updated last time Based on the initial value ^ o 'of ^ o, for each combination (k, t) of time k and state t, the observed fundamental frequency sequence ^ y, the command string ^ o', and the parameter group ^ θ ' A state posterior probability update unit that calculates the posterior probability P (s _k = t | ^ y, ^ o ', ^ θ') when given by the Forward-Backward algorithm, and the previously updated Command sequence ^ o 'or Initial value ^ o 'of the command sequence ^ o, the observed fundamental frequency sequence ^ y, the degree of uncertainty at each time k, and the posterior probability of each of the combinations (k, t) of time k and state t Based on P (s _k = t | ^ y, ^ o ′, ^ θ ′), a model parameter update unit that updates the command sequence ^ o and the parameter group ^ θ, and a predetermined convergence condition Based on the convergence determination unit that repeatedly performs the calculation by the state series posterior probability update unit and the update by the model parameter update unit until the condition is satisfied, and the command sequence ^ o that is finally updated by the model parameter update unit, Viterbi A state sequence calculation unit that calculates the state sequence ^ s using an algorithm, wherein the hidden Markov model represents the sequence of states corresponding to a plurality of templates, and is constant A plurality of Left-to-Right HMMs whose state transitions in a direction, wherein a state of a starting point in each of the plurality of Left-to-Right HMMs is connected to a specific state, and the plurality of Left-to-Right HMMs The state of the end point in each of the HMMs is connected to the specific state, and the parameter group ^ θ is, for each of the plurality of templates n, the output average μ _p ⁽ⁿ⁾ in the state corresponding to the phrase command in the template n. The output average μ _a ^{(n, m)} of each state corresponding to each accent command m in the template n is further included.

第２の発明に係る基本周波数モデルパラメータ推定方法は、音声信号を入力として、各時刻ｋにおける甲状軟骨の平行移動運動によって生じる基本周波数パターンを表すフレーズ指令ｕ_p［ｋ］及び甲状軟骨の回転運動によって生じる基本周波数パターンを表すアクセント指令ｕ_a［ｋ］のペア＾ｏ[ｋ]からなる指令列＾ｏと、隠れマルコフモデルの各時刻ｋの、前記フレーズ指令及び前記アクセント指令のペアを示す状態のインデックスｓ_kからなる指令状態系列＾ｓと、前記隠れマルコフモデルの前記状態ｉ'，ｉ間の各々の遷移確率φ_i',iを含むパラメータ群＾θとを推定する基本周波数モデルパラメータ推定方法であって、基本周波数抽出部によって、前記音声信号の時系列データから、前記音声信号の各時刻ｋの基本周波数を表す観測基本周波数系列＾ｙを抽出し、有声無声区間推定部によって、前記音声信号の時系列データについて、有声区間及び無声区間の何れであるかに応じて、各時刻ｋにおける前記基本周波数の不確かさの度合いを推定し、初期値設定部によって、前記指令列＾ｏの初期値、及び前記パラメータ群＾θの初期値を設定し、状態系列事後確率更新部によって、前回更新された前記指令列＾ｏ’または前記指令列＾ｏの初期値＾ｏ’に基づいて、時刻ｋ、状態ｔの組み合わせ（ｋ、ｔ）の各々について、前記観測基本周波数系列＾ｙ、前記指令列＾ｏ’、及び前記パラメータ群＾θ’が与えられたときの事後確率Ｐ（ｓ_k＝ｔ｜＾ｙ，＾ｏ’、＾θ’）を、Ｆｏｒｗａｒｄ−Ｂａｃｋｗａｒｄアルゴリズムを用いて計算し、モデルパラメータ更新部によって、前回更新された前記指令列＾ｏ’または前記指令列＾ｏの初期値＾ｏ’、前記観測基本周波数系列＾ｙ、各時刻ｋにおける前記不確かさの度合い、及び時刻ｋ、状態ｔの組み合わせ（ｋ、ｔ）の各々の前記事後確率Ｐ（ｓ_k＝ｔ｜＾ｙ，＾ｏ’、＾θ’）に基づいて、前記指令列＾ｏ、及び前記パラメータ群＾θを更新し、収束判定部によって、予め定められた収束条件を満たすまで、前記状態系列事後確率更新部による計算、及び前記モデルパラメータ更新部による更新を繰り返し行い、状態系列算出部によって、前記モデルパラメータ更新部によって最終的に更新された指令列＾ｏに基づいて、Ｖｉｔｅｒｂｉアルゴリズムを用いて、前記状態系列＾ｓを算出することを含み、前記隠れマルコフモデルは、複数のテンプレートに対応する、前記状態の系列を表し、かつ、一定方向に状態が遷移する複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭであって、前記複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの各々における始点の状態が特定状態に連結され、かつ、前記複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの各々における終点の状態が前記特定状態に連結され、前記パラメータ群＾θは、前記複数のテンプレートｎの各々について、前記テンプレートｎにおける前記フレーズ指令に対応する状態の出力平均μ_p ⁽ⁿ⁾、前記テンプレートｎにおける各アクセント指令ｍに対応する各状態の出力平均μ_a ^(n,m)を更に含む。 The fundamental frequency model parameter estimation method according to the second invention is based on the phrase command u _p [k] representing the fundamental frequency pattern generated by the translational motion of the thyroid cartilage at each time k, and the rotational motion of the thyroid cartilage. A command sequence ^ o consisting of a pair ^ o [k] of accent commands u _a [k] representing a fundamental frequency pattern generated by the above, and a state indicating a pair of the phrase command and the accent command at each time k of the hidden Markov model The fundamental frequency model parameter estimation for estimating the command state sequence ^ s consisting of the indices s _k and the parameter group ^ θ including the transition probabilities φ _{i ', i} between the states i ′ and i of the hidden Markov model In this method, the fundamental frequency extraction unit observes the fundamental frequency at each time k of the speech signal from the time series data of the speech signal. This frequency series ^ y is extracted, and the voiced and unvoiced section estimation unit determines the uncertainty of the fundamental frequency at each time k according to whether the time series data of the speech signal is a voiced or unvoiced section. The degree is estimated, the initial value setting unit sets the initial value of the command sequence ^ o and the initial value of the parameter group ^ θ, and the state sequence posterior probability update unit updates the command sequence ^ o last time. 'Or the initial value ^ o' of the command sequence ^ o, for each combination of time k and state t (k, t), the observed fundamental frequency sequence ^ y, the command sequence ^ o ', and the A posteriori probability P (s _k = t | ^ y, ^ o ', ^ θ') when a parameter group ^ θ 'is given is calculated using the Forward-Backward algorithm, and the model parameter update unit Further The new command sequence ^ o 'or the initial value ^ o' of the command sequence ^ o, the observed fundamental frequency sequence ^ y, the degree of uncertainty at each time k, and the combination of time k and state t (k , T), the command sequence ^ o and the parameter group ^ θ are updated based on the a posteriori probability P (s _k = t | The state series posterior probability update unit and the model parameter update unit repeatedly perform the calculation until the predetermined convergence condition is satisfied, and finally the state parameter calculation unit causes the model parameter update unit to Calculating the state sequence ^ s using a Viterbi algorithm based on the updated command sequence ^ o, wherein the hidden Markov model corresponds to a plurality of templates, A plurality of Left-to-Right HMMs representing a row and having a state transition in a certain direction, wherein a state of a starting point in each of the plurality of Left-to-Right HMMs is connected to a specific state, and The state of the end point in each of the plurality of Left-to-Right HMMs is connected to the specific state, and the parameter group ^ θ is set to a state corresponding to the phrase command in the template n for each of the plurality of templates n. The output average μ _p ⁽ⁿ⁾ and the output average μ _a ^{(n, m)} of each state corresponding to each accent command m in the template n are further included.

第３の発明に係る基本周波数モデルパラメータ推定装置は、音声信号を入力として、各時刻ｋにおける甲状軟骨の平行移動運動によって生じる基本周波数パターンを表すフレーズ指令ｕ_p［ｋ］及び甲状軟骨の回転運動によって生じる基本周波数パターンを表すアクセント指令ｕ_a［ｋ］のペア＾ｏ[ｋ]からなる指令列＾ｏと、隠れマルコフモデルの各時刻ｋの、前記フレーズ指令及び前記アクセント指令のペアを示す状態のインデックスｓ_kからなる指令状態系列＾ｓと、を推定する基本周波数モデルパラメータ推定装置であって、前記音声信号の時系列データから、前記音声信号の各時刻ｋの基本周波数を表す観測基本周波数系列＾ｙを抽出する基本周波数抽出部と、前記音声信号の時系列データについて、有声区間及び無声区間の何れであるかに応じて、各時刻ｋにおける前記基本周波数の不確かさの度合いを推定する有声無声区間推定部と、前記指令列＾ｏの初期値を設定する初期値設定部と、前回更新された前記指令列＾ｏ’または前記指令列＾ｏの初期値＾ｏ’と、前記隠れマルコフモデルの前記状態ｉ'，ｉ間の各々の遷移確率φ_i',iを含む予め求められたパラメータ群＾θ’とに基づいて、時刻ｋ、状態ｔの組み合わせ（ｋ、ｔ）の各々について、前記観測基本周波数系列＾ｙ、前記指令列＾ｏ’、及び前記パラメータ群＾θ’が与えられたときの事後確率Ｐ（ｓ_k＝ｔ｜＾ｙ，＾ｏ’、＾θ’）を、Ｆｏｒｗａｒｄ−Ｂａｃｋｗａｒｄアルゴリズムを用いて計算する状態系列事後確率更新部と、前回更新された前記指令列＾ｏ’または前記指令列＾ｏの初期値＾ｏ’、前記観測基本周波数系列＾ｙ、各時刻ｋにおける前記不確かさの度合い、及び時刻ｋ、状態ｔの組み合わせ（ｋ、ｔ）の各々の前記事後確率Ｐ（ｓ_k＝ｔ｜＾ｙ，＾ｏ’、＾θ’）に基づいて、前記指令列＾ｏを更新するモデルパラメータ更新部と、予め定められた収束条件を満たすまで、前記状態系列事後確率更新部による計算、及び前記モデルパラメータ更新部による更新を繰り返し行う収束判定部と、前記モデルパラメータ更新部によって最終的に更新された指令列＾ｏに基づいて、Ｖｉｔｅｒｂｉアルゴリズムを用いて、前記状態系列＾ｓを算出する状態系列算出部と、を含み、前記隠れマルコフモデルは、複数のテンプレートに対応する、前記状態の系列を表し、かつ、一定方向に状態が遷移する複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭであって、前記複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの各々における始点の状態が特定状態に連結され、かつ、前記複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの各々における終点の状態が前記特定状態に連結され、前記パラメータ群＾θは、前記複数のテンプレートｎの各々について、前記テンプレートｎにおける前記フレーズ指令に対応する状態の出力平均μ_p ⁽ⁿ⁾、前記テンプレートｎにおける各アクセント指令ｍに対応する各状態の出力平均μ_a ^(n,m)を更に含む。 The fundamental frequency model parameter estimating apparatus according to the third invention is the phrase command u _p [k] representing the fundamental frequency pattern generated by the translational motion of the thyroid cartilage at each time k, and the rotational motion of the thyroid cartilage, with the speech signal as an input. A command sequence ^ o consisting of a pair ^ o [k] of accent commands u _a [k] representing a fundamental frequency pattern generated by the above, and a state indicating a pair of the phrase command and the accent command at each time k of the hidden Markov model Is a fundamental frequency model parameter estimation device for estimating a command state sequence ^ s consisting of an index s _k of an observed fundamental frequency representing a fundamental frequency at each time k of the speech signal from time series data of the speech signal The fundamental frequency extracting unit for extracting the sequence ^ y, and the time-series data of the voice signal, in either the voiced section or the unvoiced section Depending on whether there is a voiced and unvoiced section estimation unit that estimates the degree of uncertainty of the fundamental frequency at each time k, an initial value setting unit that sets an initial value of the command sequence ^ o, and the last updated 'initial value or the command string ^ o ^ o' command string ^ o and the state i of the hidden Markov model ', each of the transition probability between i phi _i', previously obtained parameter group including _i ^ When the observation fundamental frequency sequence ^ y, the command sequence ^ o ', and the parameter group ^ θ' are given for each of the combinations (k, t) of time k and state t based on θ ' A state sequence posterior probability update unit that calculates the posterior probability P (s _k = t | ^ y, ^ o ′, ^ θ ′) using the Forward-Backward algorithm, and the previously updated command sequence ^ o ′. Or the initial value ^ o 'of the command sequence ^ o, the observation basic Wavenumber series ^ y, the degree of the uncertainty at each time k, and time k, the probability after each of the previous article of a combination of state t (k, t) P ( s k = t | ^ y, ^ o ', ^ based on θ ′), a model parameter update unit that updates the command sequence ^ o, a calculation by the state sequence posterior probability update unit, and an update by the model parameter update unit until a predetermined convergence condition is satisfied. A convergence determination unit that is repeatedly performed, and a state sequence calculation unit that calculates the state sequence ^ s using a Viterbi algorithm based on a command sequence ^ o that is finally updated by the model parameter update unit, The hidden Markov model is a plurality of Left-to-Right HMMs representing a sequence of the states corresponding to a plurality of templates and transitioning in a certain direction. The start point state in each of the plurality of Left-to-Right HMMs is connected to a specific state, and the end point state in each of the plurality of Left-to-Right HMMs is connected to the specific state, and the parameter group ^ Θ is, for each of the plurality of templates n, an output average μ _p ⁽ⁿ⁾ in a state corresponding to the phrase command in the template n, and an output average μ in each state corresponding to each accent command m in the template n. _a ^{(n, m)} is further included.

第４の発明に係る基本周波数モデルパラメータ推定方法は、音声信号を入力として、各時刻ｋにおける甲状軟骨の平行移動運動によって生じる基本周波数パターンを表すフレーズ指令ｕ_p［ｋ］及び甲状軟骨の回転運動によって生じる基本周波数パターンを表すアクセント指令ｕ_a［ｋ］のペア＾ｏ[ｋ]からなる指令列＾ｏと、隠れマルコフモデルの各時刻ｋの、前記フレーズ指令及び前記アクセント指令のペアを示す状態のインデックスｓ_kからなる指令状態系列＾ｓと、を推定する基本周波数モデルパラメータ推定方法であって、基本周波数抽出部によって、前記音声信号の時系列データから、前記音声信号の各時刻ｋの基本周波数を表す観測基本周波数系列＾ｙを抽出し、有声無声区間推定部によって、前記音声信号の時系列データについて、有声区間及び無声区間の何れであるかに応じて、各時刻ｋにおける前記基本周波数の不確かさの度合いを推定し、初期値設定部によって、前記指令列＾ｏの初期値を設定し、状態系列事後確率更新部によって、前回更新された前記指令列＾ｏ’または前記指令列＾ｏの初期値＾ｏ’と、前記隠れマルコフモデルの前記状態ｉ'，ｉ間の各々の遷移確率φ_i',iを含む予め求められたパラメータ群＾θ’とに基づいて、時刻ｋ、状態ｔの組み合わせ（ｋ、ｔ）の各々について、前記観測基本周波数系列＾ｙ、前記指令列＾ｏ’、及び前記パラメータ群＾θ’が与えられたときの事後確率Ｐ（ｓ_k＝ｔ｜＾ｙ，＾ｏ’、＾θ’）を、Ｆｏｒｗａｒｄ−Ｂａｃｋｗａｒｄアルゴリズムを用いて計算し、モデルパラメータ更新部によって、前回更新された前記指令列＾ｏ’または前記指令列＾ｏの初期値＾ｏ’、前記観測基本周波数系列＾ｙ、各時刻ｋにおける前記不確かさの度合い、及び時刻ｋ、状態ｔの組み合わせ（ｋ、ｔ）の各々の前記事後確率Ｐ（ｓ_k＝ｔ｜＾ｙ，＾ｏ’、＾θ’）に基づいて、前記指令列＾ｏを更新し、収束判定部によって、予め定められた収束条件を満たすまで、前記状態系列事後確率更新部による計算、及び前記モデルパラメータ更新部による更新を繰り返し行い、状態系列算出部によって、前記モデルパラメータ更新部によって最終的に更新された指令列＾ｏに基づいて、Ｖｉｔｅｒｂｉアルゴリズムを用いて、前記状態系列＾ｓを算出することを含み、前記隠れマルコフモデルは、複数のテンプレートに対応する、前記状態の系列を表し、かつ、一定方向に状態が遷移する複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭであって、前記複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの各々における始点の状態が特定状態に連結され、かつ、前記複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの各々における終点の状態が前記特定状態に連結され、前記パラメータ群＾θは、前記複数のテンプレートｎの各々について、前記テンプレートｎにおける前記フレーズ指令に対応する状態の出力平均μ_p ⁽ⁿ⁾、前記テンプレートｎにおける各アクセント指令ｍに対応する各状態の出力平均μ_a ^(n,m)を更に含む。 A fundamental frequency model parameter estimation method according to a fourth aspect of the present invention is a phrase command u _p [k] representing a fundamental frequency pattern generated by translational motion of thyroid cartilage at each time k, and a rotational motion of thyroid cartilage, with a speech signal as an input. A command sequence ^ o consisting of a pair ^ o [k] of accent commands u _a [k] representing a fundamental frequency pattern generated by the above, and a state indicating a pair of the phrase command and the accent command at each time k of the hidden Markov model Is a fundamental frequency model parameter estimation method for estimating a command state sequence ^ s consisting of indices s _{k of the} speech signal from the time series data of the speech signal by the fundamental frequency extraction unit. The observed fundamental frequency sequence ^ y representing the frequency is extracted, and the voiced and unvoiced interval estimation unit extracts the time series data of the speech signal. And estimating the degree of uncertainty of the fundamental frequency at each time k depending on whether it is a voiced or unvoiced section, and setting an initial value of the command sequence ^ o by an initial value setting unit, The state sequence posterior probability update unit updates the command sequence ^ o 'or the initial value ^ o' of the command sequence ^ o and the transition probability φ between the states i 'and i of the hidden Markov model. Based on the parameter group ^ θ 'obtained in advance including _{i' and i,} for each combination (k, t) of time k and state t, the observed fundamental frequency series ^ y and the command string ^ o ' , And the posterior probability P (s _k = t | ^ y, ^ o ', ^ θ') given the parameter group ^ θ 'is calculated using the Forward-Backward algorithm, and the model parameter update unit The finger that was last updated by Each of the sequence ^ o 'or the initial value ^ o' of the command sequence ^ o, the observed fundamental frequency sequence ^ y, the degree of uncertainty at each time k, and the combination (k, t) of time k and state t The command sequence ^ o is updated based on the posterior probability P (s _k = t | ^ y, ^ o ', ^ θ') until a predetermined convergence condition is satisfied by the convergence determination unit. , The calculation by the state series posterior probability update unit and the update by the model parameter update unit are repeated, and the state series calculation unit performs Viterbi based on the command sequence ^ o finally updated by the model parameter update unit. Calculating the state series ^ s using an algorithm, wherein the hidden Markov model represents the series of states corresponding to a plurality of templates, and the state transitions in a certain direction. A plurality of Left-to-Right HMMs, wherein a state of a start point in each of the plurality of Left-to-Right HMMs is connected to a specific state, and an end point in each of the plurality of Left-to-Right HMMs The state is connected to the specific state, and the parameter group ^ θ is, for each of the plurality of templates n, an output average μ _p ⁽ⁿ⁾ in a state corresponding to the phrase command in the template n, It further includes an output average μ _a ^{(n, m)} in each state corresponding to the accent command m.

本発明に係るプログラムは、上記の基本周波数モデルパラメータ推定装置の各部としてコンピュータを機能させるためのプログラムである。 A program according to the present invention is a program for causing a computer to function as each unit of the fundamental frequency model parameter estimation apparatus.

以上説明したように、本発明の基本周波数モデルパラメータ推定装置、方法、及びプログラムによれば、複数のテンプレートに対応する、状態の系列を表す複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭである隠れマルコフモデルを用いて、指令列＾ｏ、指令状態系列＾ｓ、及びパラメータ群＾θを推定することにより、言語的な先験的知識をＨＭＭの状態遷移トポロジーの設計を通してモデルに組み込むことで、藤崎モデルのパラメータを精度よく推定することができる、という効果が得られる。 As described above, according to the fundamental frequency model parameter estimation apparatus, method, and program of the present invention, a hidden Markov model that is a plurality of Left-to-Right HMMs representing a sequence of states corresponding to a plurality of templates is obtained. By using the linguistic a priori knowledge into the model through the design of the state transition topology of the HMM by estimating the command sequence ^ o, the command state sequence ^ s, and the parameter group ^ θ, the Fujisaki model The effect that the parameter can be estimated with high accuracy is obtained.

藤崎モデルを説明するための図である。It is a figure for demonstrating the Fujisaki model. 従来法におけるＨＭＭを説明するための図である。It is a figure for demonstrating HMM in a conventional method. 本発明の実施の形態におけるＨＭＭを説明するための図である。It is a figure for demonstrating HMM in embodiment of this invention. 状態の分割を説明するための図である。It is a figure for demonstrating the division | segmentation of a state. 本発明の第１の実施の形態に係る基本周波数モデルパラメータ推定装置の構成を示す概略図である。It is the schematic which shows the structure of the fundamental frequency model parameter estimation apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る基本周波数モデルパラメータ推定装置における基本周波数モデルパラメータ推定処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the fundamental frequency model parameter estimation process routine in the fundamental frequency model parameter estimation apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係る基本周波数モデルパラメータ推定装置の構成を示す概略図である。It is the schematic which shows the structure of the fundamental frequency model parameter estimation apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る基本周波数モデルパラメータ推定装置における基本周波数モデルパラメータ推定処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the fundamental frequency model parameter estimation process routine in the fundamental frequency model parameter estimation apparatus which concerns on the 2nd Embodiment of this invention. 実験結果を示す図である。It is a figure which shows an experimental result.

以下、図面を参照して本発明の実施の形態を詳細に説明する。本発明で提案する手法では、観測F₀パターンの再現性が高い藤崎モデルのパラメータ推定を実現するために、藤崎モデルをベースにしたF₀パターン生成過程の確率モデルを定式化し、それに基づいて観測F₀パターンが生じたと仮定する。藤崎モデルのパラメータ推定アルゴリズムも、この確率モデルに基づく。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the method proposed in the present invention, in order to realize parameter estimation of the Fujisaki model with high reproducibility of the observed F ₀ pattern, a probability model of the F ₀ pattern generation process based on the Fujisaki model is formulated and observed based on it. Assume that an F ₀ pattern has occurred. The parameter estimation algorithm of the Fujisaki model is also based on this probability model.

＜発明の概要＞
まず、本発明の概要について説明する。 <Outline of the invention>
First, an outline of the present invention will be described.

通常の発話では、イントネーション型の種類は限られている。日本語の場合、ピッチアクセントは、高いと低いの２値で表され、１アクセント句に含まれるモーラ数には限りがあるためである。例えば、「あらゆる現実を」と「明日は輪講だ」のアクセントパターンは同一であるため、イントネーションはほとんど同一となる。このことは、藤崎モデルの指令列ペアが、有限種類のテンプレートをつなぎ合わせて表現できる可能性があることを示唆する。そこで、指令列のテンプレートに対応する有限種類のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭを考え、テンプレート間を遷移可能なＨＭＭにより、上述のようなテンプレートベースの指令列の生成モデルを立てた点が本発明のポイントである。具体的には以下の（１）〜（３）により実現する。 In normal speech, the types of intonation type are limited. In the case of Japanese, the pitch accent is expressed by binary values of high and low, and the number of mora included in one accent phrase is limited. For example, the accent patterns of “every reality” and “Tomorrow is a wheelchair” are the same, so the intonation is almost the same. This suggests that the Fujisaki model command string pair may be represented by connecting finite types of templates. Therefore, a finite type of Left-to-Right HMM corresponding to the template of the command sequence is considered, and the above-described template-based command sequence generation model is established by the HMM capable of transitioning between templates. It is a point. Specifically, it is realized by the following (1) to (3).

（１）フレーズ・アクセント指令列とベース成分が決まった時に対数基本周波数軌跡が生成される確率と、フレーズ・アクセント指令列（ＨＭＭの出力系列）が生成される確率との積を規準とし、この規準を大きくするように、ＨＭＭの状態遷移確率と、ＨＭＭの出力系列と、ＨＭＭの状態系列の事後確率と、出力分布パラメータを更新する。 (1) Based on the product of the probability that a logarithmic fundamental frequency locus is generated when a phrase / accent command sequence and a base component are determined, and the probability that a phrase / accent command sequence (HMM output sequence) is generated, The HMM state transition probability, the HMM output sequence, the posterior probability of the HMM state sequence, and the output distribution parameter are updated so as to increase the criterion.

（２）上記（１）において、ＨＭＭを、フレーズ・アクセント指令列毎のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭの始点の各々を特定状態に連結し、さらに終点の各々を当該特定状態に連結したものとする。 (2) In (1) above, the HMM is such that each start point of the Left-to-Right HMM for each phrase / accent command sequence is connected to a specific state, and each end point is connected to the specific state. .

（３）上記（２）のモデルを用い、ＨＭＭの状態遷移確率と、出力分布パラメータを固定したもとで、ＨＭＭの出力系列とＨＭＭの状態系列を推定することでフレーズ・アクセント指令列を得る。 (3) Using the model of (2) above, with the HMM state transition probability and output distribution parameters fixed, the HMM output sequence and HMM state sequence are estimated to obtain a phrase / accent command sequence .

＜F₀軌跡の確率モデル化＞
次に、F₀軌跡の確率モデル化について説明する。 <Probability modeling of F ₀ trajectory>
Next, probability modeling of the F ₀ locus will be described.

藤崎モデルでは、対数Ｆ_０軌跡ｙ（ｔ）が以下のように３つの成分の和で表されると仮定する（図１参照）。 In the Fujisaki model, it is assumed that the logarithm F ₀ trajectory y (t) is represented by the sum of three components as follows (see FIG. 1).

ここで、ｔは時間、ｘ_ｐ（ｔ）はフレーズ成分、ｘ_ａ（ｔ）はアクセント成分、ｘ_ｂはベースライン成分と呼ばれる、時間によらない定数である。さらにフレーズ成分、アクセント成分はそれぞれ、フレーズ指令、アクセント指令と呼ばれる信号の２次のフィルタの出力であると仮定される。 Here, t is a time-independent constant called time, x _p (t) is a phrase component, x _a (t) is an accent component, and x _b is a baseline component. Furthermore, it is assumed that the phrase component and the accent component are the outputs of secondary filters of signals called phrase commands and accent commands, respectively.

ここでｕ_ｐ（ｔ）はフレーズ指令と呼ばれるデルタ列であり、ｕ_ａ（ｔ）はアクセント指令と呼ばれる矩形パルス列である。これらのうち非ゼロの値をとるのは各時刻で高々１つである。α、βはそれぞれ２次フィルタの応答の速さを表す角周波数であり、個人や発話によらずおおよそα＝３ｒａｄ／ｓ、β＝２０ｒａｄ／ｓ程度の値をとることが知られている。 Here, u _p (t) is a delta train called a phrase command, and u _a (t) is a rectangular pulse train called an accent command. Of these, at most one takes a non-zero value at each time. α and β are angular frequencies representing the speed of response of the secondary filter, respectively, and are known to take values of approximately α = 3 rad / s and β = 20 rad / s regardless of an individual or an utterance.

以下に、これまで発明者らが開発してきた藤崎モデルをベースにしたＦ_０軌跡の生成過程の確率モデル（非特許文献２参照）について概説する。上述の藤崎モデルにおいて、フレーズ指令、アクセント指令はそれぞれデルタ列、矩形パルス列であり、さらにこれらは互いに重ならないという仮定が置かれる。発明者らは、これらの制約を満たすような指令列をうまく確率モデルの形として記述するために、フレーズ指令ｕ_ｐ［ｋ］、アクセント指令ｕ_ａ［ｋ］のペア＾ｏ［ｋ］＝（ｕ_ｐ［ｋ］、ｕ_ａ［ｋ］）^Ｔを、ＨＭＭの出力として表現するモデルを考案した。各状態の出力分布を正規分布とした場合、出力系列｛＾ｏ［ｋ］｝^Ｋ _ｋ＝１は、 The following outlines a stochastic model (see Non-Patent Document 2) of the F ₀ trajectory generation process based on the Fujisaki model that has been developed by the inventors. In the above-mentioned Fujisaki model, the phrase command and the accent command are a delta train and a rectangular pulse train, respectively, and further, it is assumed that they do not overlap each other. In order to successfully describe a command sequence that satisfies these constraints as a form of a probability model, the inventors have a pair of phrase commands u _p [k] and accent commands u _a [k] ^ o [k] = ( u _p [k], u _a [k]) A model for expressing ^T as the output of the HMM was devised. When the output distribution of each state is a normal distribution, the output sequence {^ o [k]} ^K _{k = 1} is

に従う。ここでｓ_ｋは時刻ｋにおける状態を表す。すなわち、式（６）は平均μ［ｋ］＝（μ_ｐ［ｋ］，μ_ａ［ｋ］）^Ｔ＝＾ｃ_ｓｋと分散＾Σ［ｋ］＝＾Υ_ｓｋ＝ｄｉａｇ（ｖ_ｐ，ｋ，ｖ_ａ，ｋ）が状態遷移の結果として時間とともに変化することを意味する。なお、行列又はベクトルを示す記号については、「＾」を付すこととする。以上のＨＭＭの構成は以下となる。 Follow. Here, s _k represents a state at time k. That is, the equation (6) is obtained by calculating the mean μ [k] = (μ _p [k], μ _a [k]) ^T = ^ c _sk and variance ^ Σ [k] = ^ Υ _sk = diag (v _{p, k} , It means that v _{a, k} ) changes with time as a result of the state transition. A symbol indicating a matrix or a vector is attached with “^”. The configuration of the above HMM is as follows.

上記のＨＭＭから出力された指令関数ｕ_ｐ［ｋ］，ｕ_ａ［ｋ］にそれぞれ異なるフィルタＧ_ｐ［ｋ］とＧ_ａ［ｋ］が畳み込まれたものがフレーズ成分とアクセント成分 Phrase components and accent components are obtained by convolving different filters G _p [k] and G _a [k] with the command functions u _p [k] and u _a [k] output from the HMM.

となる。ただし、＊は離散時間ｋに関する畳込みを表す。また、Ｇ_ｐ［ｋ］とＧ_ａ［ｋ］はそれぞれＧ_ｐ（ｔ）とＧ_ａ（ｔ）を離散時間表現である。以上より、Ｆ_０軌跡の離散時間表現ｘ［ｋ］は It becomes. However, * represents the convolution regarding discrete time k. G _p [k] and G _a [k] are discrete time representations of G _p (t) and G _a (t), respectively. From the above, the discrete time representation x [k] of the F ₀ trajectory is

となる。ｘ_ｂはベースライン成分を表す。 It becomes. _xb represents a baseline component.

無声区間においてはＦ_０は観測されないことがあったり、観測されていたとしても信頼できない場合が多い。また、Ｆ_０抽出において推定誤りが生じる場合もある。そこで観測Ｆ_０軌跡ｙ［ｋ］を、上述のＦ_０軌跡モデルｘ［ｋ］とノイズｘ_ｎ［ｋ］〜Ｎ（０，ｖ^２ _ｎ［ｋ］）との和として表すことで、観測Ｆ_０系列の不確実性を分散ｖ^２ _ｎ［ｋ］の設定を通して組み込むことができる。よって、観測Ｆ_０系列ｙ［ｋ］は In the unvoiced section, F ₀ may not be observed, and it is often unreliable even if it is observed. In addition, an estimation error may occur in the F ₀ extraction. Therefore, the observation F ₀ trajectory y [k] is expressed as the sum of the above F ₀ trajectory model x [k] and the noises x _n [k] to N (0, v ² _n [k]). _Zero series of uncertainties can be incorporated through the setting of the variance v ² _n [k]. Therefore, the observed F ₀ series y [k] is

と表される。ここで、ｘ_ｎ［ｋ］を周辺化すると、^ｏ＝｛^ｏ［ｋ］｝^Ｋ _ｋ＝１が与えられたもとでの^ｙ＝｛ｙ［ｋ］｝^Ｋ _ｋ＝１の条件つき確率密度関数Ｐ（^ｙ｜^ｏ）は It is expressed. Here, if x _n [k] is marginalized, the conditional probability of ^ y = {y [k]} ^K _{k = 1} with ^ o = {^ o [k]} ^K _{k = 1.} The density function P (^ y | ^ o) is

となる。式（６）より、状態系列^ｓ＝｛ｓ_ｋ｝^Ｋ _ｋ＝１が与えられたもとでの｛^ｏ［ｋ］｝^Ｋ _ｋ＝１の条件つき確率密度関数Ｐ（^ｏ｜^ｓ，^θ）はＰ（^ｏ｜^ｓ，^θ）＝Π^Ｋ _ｋ＝１Ｎ（^ｏ［ｋ］；ｃ_ｓｋ［ｋ］，^Υ_ｓｋ）で与えられる。ここで、^θは出力分布の平均と分散の系列を表す。状態系列^ｓの確率分布Ｐ（^ｓ）はＨＭＭにおけるマルコフ性の仮定より、遷移確率の積Ｐ（^ｓ）＝φ_ｓ１Π^Ｋ _ｋ＝２φ_{ｓｋ，ｓｋ−１}で与えられる。 It becomes. From Equation (6), the conditional probability density function P (^ o | ^ s, where {^ o [k]} ^K _{k = 1} under the state sequence ^ s = {s _k } ^K _{k = 1} . ^ theta) is P (^ o | ^ s, ^ θ) = Π K k = 1 N (^ o [k]; given by _{c sk [k], ^ Υ} sk). Here, ^ θ represents a series of mean and variance of the output distribution. State sequence ^ s probability distribution P (^ s) than Markov assumption in HMM, the transition probabilities of the product _{P (^ s) = φ s1} Π K k = 2 φ sk, given by _sk-1.

また、図２に、従来法におけるフレーズ・アクセント指令列の状態遷移モデルを示す。状態ｒ_０においてμ_ｐ［ｋ］とμ_ａ［ｋ］はゼロである。状態ｐ_１においてμ_ｐ［ｋ］は非負値Ａ_ｐ［ｋ］をとることができ、μ_ａ［ｋ］はゼロである。状態ｐ_１において自己遷移は禁止される。状態ｒ_１においてμ_ｐ［ｋ］とμ_ａ［ｋ］はまたゼロのみに制限される。この状態はμ_ｐ［ｋ］がパルス列になることを保証するものである。状態ｒ_０は状態ａ_１，．．．，ａ_Ｎへのみ遷移することができ、これらの状態においてμ_ａ［ｋ］はそれぞれ異なる値Ａ^（ｎ） _ａをとることができるが、μ_ｐ［ｋ］はゼロに制限される。直接ａ_ｎからａ_ｎ'へを通らずにｒ_１遷移することは禁止される。これはμ_ａ［ｋ］が矩形パルス列であることを保証するためのものである。 FIG. 2 shows a state transition model of the phrase / accent command sequence in the conventional method. In the state r ₀ , μ _p [k] and μ _a [k] are zero. In the state p ₁ , μ _p [k] can take a non-negative value A _p [k], and μ _a [k] is zero. Self-transition in the state p ₁ is prohibited. In the state _{r 1} μ _p [k] and μ _a [k] is also limited only to zero. This state guarantees that μ _p [k] becomes a pulse train. State r ₀ is a state a ₁ ,. . . , A _N only, and μ _a [k] can take different values A ⁽ⁿ⁾ _a in these states, but μ _p [k] is limited to zero. To r ₁ transitions directly from a _n without passing through the a _{n 'is} prohibited. This is to ensure that μ _a [k] is a rectangular pulse train.

＜フレーズ・アクセント指令の語彙モデル＞
以上のモデルにおいて重要なアイデアは、藤崎モデルの制約がＨＭＭの状態遷移トポロジーで表現される点にあるが、これまでの状態遷移トポロジーのもとでは藤崎モデルにおける制約を満たす範囲のいかなる指令列も生成し得て、言語学的に必ずしも妥当でない指令列を生成することを許容していた。もし指令列のとりうる範囲を言語学的な先験知識に基づいて適切に制限できれば、提案モデルを用いて効果的に藤崎モデルパラメータの推定を行えるようになるはずである。以上より、言語的な先験的知識をＨＭＭの状態遷移トポロジーの設計を通してモデルに組み込もうというのが本発明のアイデアの要点である。 <Phrase / accent command vocabulary model>
An important idea in the above model is that the constraints of the Fujisaki model are expressed in the state transition topology of the HMM. Under the state transition topology so far, any command sequence that satisfies the constraints of the Fujisaki model can be used. It was possible to generate a command sequence that was not necessarily linguistically valid. If the range that the command sequence can take can be appropriately limited based on linguistic a priori knowledge, it should be possible to effectively estimate the Fujisaki model parameters using the proposed model. From the above, the essential point of the idea of the present invention is to incorporate linguistic a priori knowledge into the model through the design of the state transition topology of the HMM.

通常の発話では、イントネーション型の種類は限られている。日本語の場合、ピッチアクセントは高いと低いの２値で表され、１アクセント句に含まれるモーラ数には限りがあるためである。例えば、「あらゆる現実を」と「明日は輪講だ」のアクセントパターンは同一であるため、イントネーションはほとんど同一である。このことは、藤崎モデルの指令列ペアが、有限種類のテンプレートをつなぎ合わせて表現できる可能性があることを示唆する。そこでまず指令列のテンプレートに対応する有限種類のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭを考え、テンプレート間を遷移可能なＨＭＭを考えることにより、上述のようなテンプレートベースの指令列の生成モデルを立てることができそうである。ピッチパターンテンプレートの語彙モデルに基づくフレーズ・アクセント指令列の状態遷移トポロジーは、例えば図３のようなＨＭＭで表すことができる。このＨＭＭをフレーズ・アクセント指令列の語彙モデルと呼ぶこととする。ここで、各テンプレートの時間伸縮をどれだけ許容するかを柔軟に扱えるようにする目的で、図４に示すように各状態（ただしｐ_１，ｐ_２，．．．を除く）を、同一な出力分布を有するよう拘束された小状態に分割することとした。これにより、各状態での停留時間を個別にパラメトライズすることが可能である。 In normal speech, the types of intonation type are limited. This is because, in Japanese, the pitch accent is expressed by a binary value of high and low, and the number of mora included in one accent phrase is limited. For example, the accent patterns of “every reality” and “Tomorrow is a wheelchair” are the same, so the intonation is almost the same. This suggests that the Fujisaki model command string pair may be represented by connecting finite types of templates. Therefore, by considering a finite type of Left-to-Right HMM corresponding to a command sequence template and considering an HMM capable of transitioning between templates, a template-based command sequence generation model as described above can be established. That's right. The state transition topology of the phrase / accent command sequence based on the vocabulary model of the pitch pattern template can be represented by, for example, an HMM as shown in FIG. This HMM is referred to as a phrase / accent command string vocabulary model. Here, in order to flexibly handle how much time expansion / contraction of each template is allowed, each state (except for p ₁ , p ₂ ,...) Is the same as shown in FIG. It was decided to divide into small states constrained to have an output distribution. Thereby, it is possible to individually parameterize the stopping time in each state.

上記図４では、状態ａ_１，１を４つの小状態ａ_{１，１，０}，ａ_{１，１，１}，ａ_{１，１，２}，ａ_{１，１，３}へ分割した場合について示す。遷移確率φ_{ａ１，１，０，ａ１，１，１}は状態ａ_１，１が４回持続する確率に対応する。 FIG. 4 shows a case where the state a _1,1 is divided into four substates a _1,1,0 , a _1,1,1 , a _1,1,2 , a _1,1,3 . Transition probabilities φ _{a1,1,0, a1,1,1} correspond to the probability that state a _1,1 lasts four times.

＜最適化アルゴリズム＞
次に、観測Ｆ_０系列＾ｙが与えられたもとで、モデルパラメータ＾θと＾ｏの事後確率Ｐ（＾ｏ，＾θ｜＾ｙ）の局所最適解を求める反復アルゴリズムを、ＥＭアルゴリズムと補助関数法に基づいて導出する。状態系列＾ｓを隠れ変数とし、事後確率Ｐ（＾ｏ，＾θ｜＾ｙ）がＰ（＾ｏ，＾θ，＾ｓ｜＾ｙ）∝Ｐ（＾ｙ｜＾ｏ）Ｐ（＾ｏ｜＾ｓ，＾θ）Ｐ（＾ｓ）を＾ｓについて周辺化することで得られる点に注意すると、Ｑ関数Ｑ（＾ｏ，＾θ，＾ｏ’，＾θ’）は <Optimization algorithm>
Next, an iterative algorithm for obtaining a local optimal solution of the posterior probability P (^ o, ^ θ | ^ y) of the model parameters ^ θ and ^ o, given the observation F ₀ series ^ y, an EM algorithm and an auxiliary Derived based on the functional method. The state sequence ^ s is a hidden variable, and the posterior probability P (^ o, ^ θ | ^ y) is P (^ o, ^ θ, ^ s | ^ y) ∝P (^ y | ^ o) P (^ o Note that | ^ s, ^ θ) P (^ s) is obtained by marginalizing about ^ s, the Q function Q (^ o, ^ θ, ^ o ', ^ θ') is

と置ける。ここで、^c＝は定数項を除いて等しいことを表す。よって、Ｐ（＾ｓ｜＾ｙ，＾ｏ’，＾θ’）をＦｏｒｗａｒｄ−Ｂａｃｋｗａｒｄアルゴリズムにより計算するステップ、＾ｏと＾θについてＱ（＾ｏ，＾θ，＾ｏ’，＾θ’）を増加させるステップとを繰り返すことで、Ｐ（＾ｏ，＾θ｜＾ｙ）が局所最大となる解を得ることができる。＾ｏは藤崎モデルの指令関数のペアであるため、Ｑ（＾ｏ，＾θ，＾ｏ’，＾θ’）を増加させるステップにおいては、ｏの非負制約を考慮する必要がある。＾ｏの非負制約を満たしながらＱ（＾ｏ，＾θ，＾ｏ’，＾θ’）を増加させるような更新則は、上記の非特許文献２と同様の考え方により導くことができる。上記の非特許文献２より、Ｑ（＾ｏ，＾θ，＾ｏ’，＾θ’）の下界が、Ｊｅｎｓｅｎの不等式 I can put it. Where ^c = is equal except for constant terms. Therefore, the step of calculating P (^ s | ^ y, ^ o ', ^ θ') by the Forward-Backward algorithm, and Q (^ o, ^ θ, ^ o ', ^ θ') for ^ o and ^ θ By repeating the step of increasing the value, a solution in which P (^ o, ^ θ | ^ y) is locally maximum can be obtained. Since ^ o is a pair of Fujisaki model command functions, it is necessary to consider the non-negative constraint of o in the step of increasing Q (^ o, ^ θ, ^ o ′, ^ θ ′). An update rule that increases Q (^ o, ^ θ, ^ o ', ^ θ') while satisfying the nonnegative constraint of ^ o can be derived from the same idea as in Non-Patent Document 2 above. From Non-Patent Document 2 above, the lower bound of Q (^ o, ^ θ, ^ o ', ^ θ') is the Jensen inequality.

を用いて設計することができる。ここで、Ｇ_ｂ［ｋ］＝δ［ｋ］（クロネッカーのデルタ）である。また、λ_{ｉ，ｋ，ｌ}は、０＜λ_{ｉ，ｋ，ｌ}＜１，Σ_ｉΣ_ｌλ_{ｉ，ｋ，ｌ}＝１を満たす任意の変数である。以上をまとめるとＱ関数の下界は、 Can be used to design. Here, G _b [k] = δ [k] (Kronecker delta). Moreover, lambda _{i, k, l} is an arbitrary variable satisfying _{0 <λ i, k, l} <1, Σ i Σ l λ i, k, l = 1. In summary, the lower bound of the Q function is

と表される。この下界関数をλ_{ｉ，ｋ，ｌ}≧０に関して最大化するステップと、ｏに関して最大化するステップとを交互に繰り返せばＱ（＾ｏ，＾θ，＾ｏ’，＾θ’）を増加させることができる。いずれのステップの更新則も解析的に求めることができ、それぞれ It is expressed. Q (^ o, ^ θ, ^ o ', ^ θ') is increased by alternately repeating the step of maximizing the lower bound function with respect to λ _{i, k, l} ≧ 0 and the step of maximizing with respect to o. be able to. The update rule for any step can be determined analytically,

で表される。以上の更新においてアルゴリズムを安定して動作させるために、Ｑ（＾ｏ，＾θ，＾ｏ’，＾θ’）の代わりに、ｕ_ｐ［ｋ］とｕ_ａ［ｋ］に対するスパース正則化項Λを加えた関数Ｑ（＾ｏ，＾θ，＾ｏ’，＾θ’）＋Λを、最適化の対象とすることができる。ここで、 It is represented by In order to operate the algorithm stably in the above update, instead of Q (^ o, ^ θ, ^ o ', ^ θ'), a sparse regularization term for u _p [k] and u _a [k] The function Q (^ o, ^ θ, ^ o ′, ^ θ ′) + Λ to which Λ is added can be targeted for optimization. here,

である。この正則化項を加えた更新則は不等式 It is. The update rule with this regularization term is inequality

が成り立つことを利用して設計することができる。ここで、ｗ_ｋ，ｄ_ｋ，ｍ_ｋは補助変数である。式（２２）、（２３）の等号成立条件は It is possible to design using the fact that Here, w _k , d _k , and m _k are auxiliary variables. The conditions for establishing equality in equations (22) and (23) are

である。よって、不等式 It is. Thus, the inequality

の右辺はＱ（＾ｏ，＾θ，＾ｏ’，＾θ’）＋Λの下限関数となる。Ｑ’’（＾ｏ，＾θ，＾ｏ’，＾θ’）をｕ_ｉ［ｌ］で微分すると、 Is the lower limit function of Q (^ o, ^ θ, ^ o ′, ^ θ ′) + Λ. Differentiating Q ″ (^ o, ^ θ, ^ o ′, ^ θ ′) by u _i [l],

となる。よって更新則は、この下限関数をλ_{ｉ，ｋ，ｌ}≧０，ｄ_ｋ，ｍ_ｋに関して最大化するステップと、＾ｏに関して最大化するステップと、を交互に実行することでＱ（＾ｏ，＾θ，＾ｏ’，＾θ’）＋Λを増加させることができる。更新式はそれぞれ It becomes. Therefore, the update rule alternately performs the step of maximizing the lower limit function with respect to λ _{i, k, l} ≧ 0, d _k , m _k and the step of maximizing with respect to ^ o, so that Q (^ o , ^ Θ, ^ o ′, ^ θ ′) + Λ can be increased. Each update formula

である。 It is.

以上の反復が収束したあと、続けて＾θを更新する。更新式は解析的に求めることができ、 After the above iterations converge, ^ θ is continuously updated. The update formula can be obtained analytically,

と書ける。これらの更新値を改めてｏ’とθ’に代入して次の反復に進む。また、遷移確率の更新については、後述する。 Can be written. These updated values are newly substituted into o 'and θ' and proceed to the next iteration. The update of the transition probability will be described later.

以上の反復アルゴリズムが収束した後、Ｖｉｔｅｒｂｉアルゴリズムにより求まる最適な＾ｓを指令列推定とする。 After the above iterative algorithm has converged, the optimum ^ s obtained by the Viterbi algorithm is used as the command string estimation.

＜システム構成＞
次に、観測された音声信号の時系列データを解析して、藤崎モデルのパラメータを推定する基本周波数モデルパラメータ推定装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 <System configuration>
Next, the embodiment of the present invention will be described with reference to an example in which the present invention is applied to a fundamental frequency model parameter estimation apparatus that analyzes time series data of an observed speech signal and estimates parameters of the Fujisaki model. explain.

本発明の第１の実施の形態に係る基本周波数モデルパラメータ推定装置は、ＣＰＵと、ＲＡＭと、後述する基本周波数モデルパラメータ推定処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 The fundamental frequency model parameter estimation apparatus according to the first embodiment of the present invention is a computer including a CPU, a RAM, and a ROM that stores a program for executing a fundamental frequency model parameter estimation processing routine described later. It is configured and functionally configured as follows.

図５に示すように、基本周波数モデルパラメータ推定装置１００は、記憶部１と、基本周波数系列抽出部２と、有声無声区間推定部３と、初期値設定部４と、状態系列事後確率更新部５と、モデルパラメータ更新部６と、収束判定部７と、状態系列算出部８と、出力部９とを備えている。 As shown in FIG. 5, the fundamental frequency model parameter estimation apparatus 100 includes a storage unit 1, a fundamental frequency sequence extraction unit 2, a voiced / unvoiced interval estimation unit 3, an initial value setting unit 4, and a state sequence posterior probability update unit. 5, a model parameter update unit 6, a convergence determination unit 7, a state series calculation unit 8, and an output unit 9.

記憶部１は、観測された音声信号の時系列データを記憶する。また、記憶部１は、言語的な先験的知識に基づいて状態遷移トポロジーが予め設計されたＨＭＭの各状態を表すデータを記憶する。上記のＨＭＭは、複数のテンプレートに対する複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭを含み、テンプレート間を遷移可能なＨＭＭである。 The storage unit 1 stores time series data of the observed audio signal. The storage unit 1 also stores data representing each state of the HMM in which the state transition topology is designed in advance based on linguistic a priori knowledge. The HMM described above is an HMM that includes a plurality of Left-to-Right HMMs for a plurality of templates and that can transition between templates.

基本周波数系列抽出部２は、音声信号の時系列データから、基本周波数の時系列データを抽出し、それらを離散時間ｋで表現するように変換して、音声信号の基本周波数の時系列データである観測基本周波数系列＾ｙ＝｛Ｆ₀［ｋ］｝（ｋ＝１,…,Ｋ）とする。この基本周波数の抽出処理は、周知技術により実現でき、例えば、非特許文献３（H. Kameoka, "Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model," in Tech. Rep. IEICE, 2010, in Japanese.）に記載の手法を利用して、８ｍｓごとに基本周波数を抽出する。 The basic frequency sequence extraction unit 2 extracts time series data of the basic frequency from the time series data of the audio signal, converts them to be expressed in discrete time k, and uses the time series data of the basic frequency of the audio signal. It is assumed that a certain observation basic frequency sequence ^ y = {F ₀ [k]} (k = 1,..., K). This fundamental frequency extraction process can be realized by a well-known technique. For example, Non-Patent Document 3 (H. Kameoka, "Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model," in Tech. Rep. IEICE, 2010, in Japanese.), The fundamental frequency is extracted every 8 ms.

有声無声区間推定部３は、音声信号の時系列データから、有声区間と無声区間とを特定し、離散時間ｋ毎に、有声区間であるか無声区間であるかに応じて、観測Ｆ₀［ｋ］値の不確かさの程度v_n ²[k]を推定する。有声区間では不確かさの程度を大きく推定し（例えば、v_n ²[k]=10¹⁵）、無声区間では不確かさの程度を小さく推定する（例えば、v_n ²[k]=0.1²）。 The voiced / voiceless section estimation unit 3 identifies a voiced section and a voiceless section from the time-series data of the voice signal, and observes F ₀ [ k] Estimate the degree of uncertainty v _n ² [k]. In the voiced section, the degree of uncertainty is estimated to be large (for example, v _n ² [k] = 10 ¹⁵ ), and in the unvoiced section, the degree of uncertainty is estimated to be small (for example, v _n ² [k] = 0.1 ² ).

初期値設定部４は、後述する処理で用いる各パラメータである、EMアルゴリズムの反復回数M、α、β、v_p ²[k]、v_a ²[k]、v_b ²を定数とみなし、初期値を設定する。初期値として適当な値を設定する。 The initial value setting unit 4 regards the number of iterations M, α, β, v _p ² [k], v _a ² [k], v _b ² of the EM algorithm, which are parameters used in the processing described later, as constants, Set the initial value. Set an appropriate value as the initial value.

また、初期値設定部４は、上記非特許文献１に記載の藤崎モデルのパラメータ推定法を用いて、＾o［ｋ］＝（ｕ_ｐ［ｋ］，ｕ_ａ［ｋ］）^Tの初期値、u_bの初期値を設定する。 The initial value setting unit 4 uses the parameter estimation method of Fujisaki model described in Non-Patent Document _{1, ^ o [k] =} (u p [k], u a [k]) T initial value of , U _b Set the initial value.

また、初期値設定部４は、μ_ｐ ^（ｎ），μ_ａ ^{（ｎ、ｍ）}の初期値を、それぞれ初期値として設定されたｕ_ｐ［ｋ］，ｕ_ａ［ｋ］の最大値の０．５倍から１倍までの一様乱数から出力された変数によって与える。また、言語的な先験的知識に基づいて状態遷移トポロジーが予め設計されたＨＭＭの状態のうち、i∈r⁰であるような各状態ｉの初期確率φ_iを、iに関する一様分布として与える。その他の状態ｉについては初期確率φ_iが0に初期化される。また、ＨＭＭの状態遷移トポロジーに基づいて、状態i′から状態iへ遷移不可能な場合、遷移確率φ_i′,iを０にセットし、可能な状態遷移の遷移確率φ_i′,iに、iに関する一様分布で初期値を与える。 Further, the initial value setting unit 4 sets the initial values of μ _p ⁽ⁿ⁾ and μ _a ^{(n, m)} as 0 as the maximum values of u _p [k] and u _a [k] set as initial values, respectively. It is given by a variable output from a uniform random number from 5 to 1 times. In addition, among the HMM states whose state transition topology is designed in advance based on linguistic a priori knowledge, the initial probability φ _i of each state i with i∈r ⁰ is expressed as a uniform distribution with respect to i. give. For other states i, the initial probability φ _i is initialized to zero. Further, based on the state transition topology of the HMM, when the transition from the state i ′ to the state i is impossible, the transition probability φ _{i ′, i} is set to 0, and the possible transition probability φ _{i ′, i} of the state transition is set. , I give initial values with uniform distribution.

本実施の形態では、上記のQ関数にもとづき、藤崎モデルパラメータ＾oとθの局所最適解は、状態系列事後確率更新部５とモデルパラメータ更新部６の2つのステップを繰り返すことで得られる。 In the present embodiment, based on the above Q function, the local optimal solution of Fujisaki model parameters ^ o and θ is obtained by repeating the two steps of the state sequence posterior probability update unit 5 and the model parameter update unit 6.

状態系列事後確率更新部５は、指令状態系列（潜在変数）の事後確率P(＾s|＾y,＾o′,θ′)を計算するステップであり、EMアルゴリズムではこれをEステップと呼ぶ。Forward-Backwardアルゴリズムを用いれば各k,tに対してP(s_k=t|＾y,＾o′,θ′)を効率的に求めることができる。具体的には、 The state series posterior probability update unit 5 is a step of calculating the posterior probability P (^ s | ^ y, ^ o ', θ') of the command state series (latent variable), which is called an E step in the EM algorithm. . If the Forward-Backward algorithm is used, P (s _k = t | ^ y, ^ o ′, θ ′) can be efficiently _obtained for each k and t. In particular,

と変形すると、各P(s_k=t|{y[l],＾o′[l],＾θ′[l]}_l=1 ^l=k)は、 Each P (s _k = t | {y [l], ^ o ′ [l], ^ θ ′ [l]} _{l = 1} ^{l = k} ) is

という漸化式を順次（k=1,2,...,K）解くことによって計算でき、各P({＾y[l],＾o′[l],＾θ′[l]}_l=k+1 ^l=K|s_k=t)は、 Can be calculated by solving the recursion formulas sequentially (k = 1,2, ..., K), and each P ({^ y [l], ^ o '[l], ^ θ' [l]} _{l = k + 1} ^{l = K} | s _k = t)

という漸化式を順次（k=K,K-1,...,1）解くことによって計算できる。 Can be calculated by solving the recursion formulas sequentially (k = K, K-1, ..., 1).

このように、状態系列事後確率更新部５は、時刻ｋ、状態ｔの全ての組み合わせ（k,t）の各々に対して、Forward-Backwardアルゴリズムを用いて、前回更新された指令列＾o′又は初期値＾o′に基づいて、事後確率P(s_k=t|＾y,＾o′,＾θ′)を算出することにより、観測基本周波数系列＾ｙ、指令列＾ｏ’、及びパラメータ群＾θ’が与えられたときの指令状態系列＾ｓの事後確率P(＾s|＾y,＾o′,＾θ′)を計算する。 Thus, the state sequence posterior probability update unit 5 uses the Forward-Backward algorithm for each combination (k, t) of the time k and the state t to update the command sequence ^ o ′. Or, by calculating the posterior probability P (s _k = t | ^ y, ^ o ', ^ θ') based on the initial value ^ o ', the observed fundamental frequency sequence ^ y, the command string ^ o', and The posterior probability P (^ s | ^ y, ^ o ', ^ θ') of the command state sequence ^ s when the parameter group ^ θ 'is given is calculated.

モデルパラメータ更新部６は、補助変数更新部６１、指令関数更新部６２、収束判定部６３、及び平均振幅更新部６４を備えている。 The model parameter update unit 6 includes an auxiliary variable update unit 61, a command function update unit 62, a convergence determination unit 63, and an average amplitude update unit 64.

モデルパラメータ更新部６は、目的関数Q(＾o,θ,＾o′,＾θ′)を増加させるように、指令列＾oとパラメータ群＾θを更新するステップであり、EMアルゴリズムではこれをMステップと呼ぶ。なお、スパース制約項を考慮する場合はＱ（＾ｏ，＾θ，＾ｏ’，＾θ’）＋Λを増加させるように＾ｏと＾θを更新する。 The model parameter update unit 6 is a step of updating the command sequence ^ o and the parameter group ^ θ so as to increase the objective function Q (^ o, θ, ^ o ′, ^ θ ′). Is called M step. When considering the sparse constraint term, ^ o and ^ θ are updated so as to increase Q (^ o, ^ θ, ^ o ′, ^ θ ′) + Λ.

補助変数更新部６１は、前回更新された各時刻ｋのフレーズ指令ｕ_p［ｋ］（又は初期値）に基づいて、時刻ｋ、ｌ（ｌ＜ｋ）の全ての組み合わせ（ｋ、ｌ）の各々について、上記の式（１９）に従って、補助変数λ_p,k,lを算出して更新する。また、補助変数更新部６１は、前回更新された各時刻ｋのアクセント指令ｕ_a［ｋ］（又は初期値）に基づいて、（ｋ、ｌ）の全ての組み合わせについて、上記の式（１９）に従って、補助変数λ_a,k,lを算出して更新する。 The auxiliary variable updating unit 61 performs all combinations (k, l) of times k and l (l <k) based on the phrase command u _p [k] (or initial value) at each time k updated last time. For each, the auxiliary variable λ _{p, k, l} is calculated and updated according to equation (19) above. In addition, the auxiliary variable update unit 61 performs the above equation (19) for all combinations of (k, l) based on the accent command u _a [k] (or initial value) at each time k updated last time. According to the above, the auxiliary variable λ _{a, k, l} is calculated and updated.

スパース制約項を考慮する場合はさらに式（３３）〜（３５）により補助変数ｗ_ｋ，ｄ_ｋ，ｍ_ｋを更新する。 When considering the sparse constraint terms, the auxiliary variables w _k , d _k , and m _k are further updated by the equations (33) to (35).

指令関数更新部６２は、基本周波数系列＾ｙと、不確かさの程度v_n ² [k]と、状態系列事後確率更新部５によって更新された指令状態系列の事後確率P(＾s|＾y,＾o′,＾θ′)と、補助変数更新部６１によって更新された補助変数λ_p,k,lとに基づいて、上記式（２０）に従って、各時刻ｋのフレーズ指令ｕ_p［ｋ］を更新する。スパース制約項を考慮する場合は代わりに式（３６）を用いてｕ_ｐ［ｋ］を更新する。 The command function update unit 62 includes the fundamental frequency sequence ^ y, the degree of uncertainty v _n ² [k], and the posterior probability P (^ s | ^ y of the command state sequence updated by the state sequence posterior probability update unit 5. , ^ O ′, ^ θ ′) and the auxiliary variable λ _{p, k, l} updated by the auxiliary variable updating unit 61 according to the above equation (20), the phrase command u _p [k ] Is updated. When considering the sparse constraint term, u _p [k] is updated using Equation (36) instead.

また、指令関数更新部６２は、基本周波数系列＾ｙと、不確かさの程度v_n ² [k]と、状態系列事後確率更新部５によって更新された指令状態系列の事後確率P(＾s|＾y,＾o′,＾θ′)と、補助変数更新部６１によって更新された補助変数λ_a,k,lとに基づいて、上記式（２０）に従って、各時刻ｋのアクセント指令ｕ_a［ｋ］を更新する。スパース制約項を考慮する場合は、代わりに式（３６）を用いてｕ_ａ［ｋ］を更新する。 The command function update unit 62 also includes the fundamental frequency sequence ^ y, the degree of uncertainty v _n ² [k], and the posterior probability P (^ s | of the command state sequence updated by the state sequence posterior probability update unit 5. (^ Y, ^ o ′, ^ θ ′) and the auxiliary variable λ _{a, k, l} updated by the auxiliary variable updating unit 61 according to the above equation (20), the accent command u _{a at} each time k [K] is updated. When considering the sparse constraint term, u _a [k] is updated using Equation (36) instead.

収束判定部６３は、予め定められた収束条件を満足するか否かを判定し、収束条件を満足していない場合には、補助変数更新部６１及び指令関数更新部６２の各処理を繰り返す。収束判定部６３は、収束条件を満足したと判定した場合には、平均振幅更新部６４による処理に移行する。 The convergence determination unit 63 determines whether or not a predetermined convergence condition is satisfied. If the convergence condition is not satisfied, each process of the auxiliary variable update unit 61 and the command function update unit 62 is repeated. If the convergence determining unit 63 determines that the convergence condition is satisfied, the convergence determining unit 63 proceeds to processing by the average amplitude updating unit 64.

収束条件としては、繰り返し回数ｓが予め定めた回数Ｓ（例えば、５００回）に達したことを用いればよい。なお、s-1回目のパラメータを用いたときの補助関数の値とs回目のパラメータを用いたときの補助関数の値との差が、予め定めた閾値よりも小さくなったことを、収束条件として用いてもよい。 As the convergence condition, it may be used that the number of repetitions s has reached a predetermined number S (for example, 500 times). Note that the convergence condition is that the difference between the value of the auxiliary function when the s-1th parameter is used and the value of the auxiliary function when the sth parameter is used is smaller than a predetermined threshold. It may be used as

平均振幅更新部６４は、Mステップとして、続けて＾θ={{μ_p(n)}n₌₁ ^N,{μ_a ^(n,m)}_n=1 ^N _m=1 ^M,{φ_i',i}}を更新する。 The average amplitude updating unit 64 continues as M steps as follows: ^ θ = {{μ _p (n)} n _{= 1} ^N , {μ _a ^{(n, m)} } _{n = 1} ^N _{m = 1} ^M , {φ _{i ', i} }} is updated.

平均振幅更新部６４は、指令関数更新部６２によって更新された各時刻ｋのフレーズ指令ｕ_p［ｋ］、及び状態系列事後確率更新部５によって更新された事後確率P(s_k=t|＾y,＾o′,＾θ′)に基づいて、複数のテンプレートｎの各々について、上記式（４２）に従って、テンプレートｎにおけるフレーズ指令に対応する状態の出力平均μ_p ⁽ⁿ⁾を更新する。また、平均振幅更新部６４は、指令関数更新部６２によって更新された各時刻ｋのアクセント指令ｕ_a［ｋ］と、状態系列事後確率更新部５によって更新された事後確率P(s_k=t|＾y,＾o′,＾θ′)に基づいて、複数のテンプレートｎの各々について、上記式（４３）に従って、テンプレートｎにおける各アクセント指令ｍに対応する状態の出力平均μ_a ^(n,m)を更新する。 The average amplitude update unit 64 includes the phrase command u _p [k] at each time k updated by the command function update unit 62 and the posterior probability P (s _k = t | ^ updated by the state series posterior probability update unit 5. Based on y, ^ o ′, ^ θ ′), for each of the plurality of templates n, the output average μ _p ⁽ⁿ⁾ in a state corresponding to the phrase command in the template n is updated according to the above equation (42). The average amplitude updating unit 64 also includes the accent command u _a [k] at each time k updated by the command function updating unit 62 and the posterior probability P (s _k = t updated by the state series posterior probability updating unit 5. | Y, ^ o ′, ^ θ ′), for each of a plurality of templates n, according to the above equation (43), the output average μ _a ^(n, Update ^m) .

また、平均振幅更新部６４は、状態系列事後確率更新部５によって計算されたP(s_k=i'|{y[l],o′[l],＾θ′[l]}_l=k+1 ^l=K)、P({y[l],o′[l],＾θ′[l]}_l=k+1 ^l=K|s_k=i)に基づいて、ＨＭＭの状態ｉ’，ｉの各ペアについて、以下の式に従って、状態ｉ'，ｉ間の遷移確率φ_i',iを更新する。 Further, the average amplitude update unit 64 calculates P (s _k = i ′ | {y [l], o ′ [l], ^ θ ′ [l]} _{l = k} calculated by the state sequence posterior probability update unit 5. ₊₁ ^{l = K} ), P ({y [l], o ′ [l], ^ θ ′ [l]} _{l = k + 1} ^{l = K} | _sk = i), the state i of the HMM For each pair of ', i, the transition probability φ _{i', i} between states _{i ', i} is updated according to the following equation.

収束判定部７は、予め定められた収束条件を満足するか否かを判定し、収束条件を満足していない場合には、上記の更新値を改めて＾o′と＾θ′に代入して、反復アルゴリズム（状態系列事後確率更新部５及びモデルパラメータ更新部６の各処理）を繰り返す。収束判定部７は、収束条件を満足したと判定した場合には、状態系列算出部８による処理に移行する。 The convergence determination unit 7 determines whether or not a predetermined convergence condition is satisfied. If the convergence condition is not satisfied, the update value is substituted into ^ o ′ and ^ θ ′. The iterative algorithm (each process of the state series posterior probability update unit 5 and the model parameter update unit 6) is repeated. If the convergence determination unit 7 determines that the convergence condition is satisfied, the convergence determination unit 7 proceeds to processing by the state series calculation unit 8.

収束条件としては、繰り返し回数ｒが予め定めた回数Ｒ（例えば、２０回）に達したことを用いればよい。なお、ｒ-1回目のパラメータを用いたときの目的関数の値とｒ回目のパラメータを用いたときの目的関数の値との差が、予め定めた閾値よりも小さくなったことを、収束条件として用いてもよい。 As the convergence condition, it may be used that the number of repetitions r has reached a predetermined number R (for example, 20 times). Note that the convergence condition is that the difference between the value of the objective function when the r-1 parameter is used and the value of the objective function when the r parameter is used is smaller than a predetermined threshold. It may be used as

状態系列算出部８は、最後に、Viterbiアルゴリズムを用いることで最適な状態系列^s*を求める。具体的には、 The state sequence calculation unit 8 finally obtains an optimal state sequence ^ s * by using the Viterbi algorithm. In particular,

という漸化式を順次（k=1,2,...,K）解くことによって求めたδ_t[k]とψ_t[k]を用いて、 Using δ _t [k] and ψ _t [k] obtained by sequentially solving the recursion formula (k = 1,2, ..., K),

という漸化式を順次（k=K,K−1,...,1）解くことによって計算できる。 Can be calculated by solving the recursion formulas sequentially (k = K, K-1, ..., 1).

このように、状態系列算出部８は、モデルパラメータ更新部６によって最終的に更新された指令列＾ｏに基づいて、上記式（４９）〜式（５３）式に従って、状態系列＾ｓを算出する。そして、出力部９により、指令列＾ｏ、パラメータ群＾θ、状態系列＾ｓを出力する。 As described above, the state series calculation unit 8 calculates the state series ^ s according to the above equations (49) to (53) based on the command sequence ^ o finally updated by the model parameter update unit 6. To do. Then, the output unit 9 outputs a command sequence ^ o, a parameter group ^ θ, and a state sequence ^ s.

＜基本周波数モデルパラメータ推定装置の作用＞
次に、第１の形態に係る基本周波数モデルパラメータ推定装置１００の作用について説明する。まず、分析対象として、観測された音声信号の時系列データが、基本周波数モデルパラメータ推定装置１００に入力され、記憶部１に格納される。そして、基本周波数モデルパラメータ推定装置１００において、図６に示す基本周波数モデルパラメータ推定処理ルーチンが実行される。 <Operation of fundamental frequency model parameter estimation device>
Next, the operation of the fundamental frequency model parameter estimation apparatus 100 according to the first embodiment will be described. First, time series data of an observed voice signal as an analysis target is input to the fundamental frequency model parameter estimation apparatus 100 and stored in the storage unit 1. Then, in the fundamental frequency model parameter estimation apparatus 100, a fundamental frequency model parameter estimation processing routine shown in FIG. 6 is executed.

まず、ステップＳ１０１において、記憶部１から、音声信号の時系列データを読み込み、各時刻ｋの基本周波数Ｆ₀からなる基本周波数系列＾ｙを抽出する。ステップＳ１０２において、音声信号の時系列データに基づいて、有声区間、無声区間を特定し、各時刻ｋの基本周波数の不確かさの程度v_n ² [ｋ]を推定する。 First, in step S101, time-series data of an audio signal is read from the storage unit 1, and a fundamental frequency sequence ^ y composed of the fundamental frequency F _{0 at} each time k is extracted. In step S102, voiced and unvoiced intervals are specified based on the time-series data of the audio signal, and the degree of uncertainty v _n ² [k] of the fundamental frequency at each time k is estimated.

次のステップＳ１０３では、各パラメータM、α、β、v_p ²[k]、v_a ²[k]、v_b ²に対して適切な初期値を設定する。また、従来手法により指令系列＾o、u_bを推定して、初期値として設定する。そして、ステップＳ１０４において、上記ステップＳ１０３で設定された指令系列＾oに基づいて、μ_ｐ ^（ｎ），μ_ａ ^{（ｎ、ｍ）}の初期値を設定する。また、予め設計されたＨＭＭの状態遷移トポロジーに基づいて、各状態ｉの初期確率φ_i、各状態ｉ，ｉ’間の遷移確率φ_i′,iを設定する。 In the next step S103, appropriate initial values are set for the parameters M, α, β, v _p ² [k], v _a ² [k], and v _b ² . Further, the command sequences ^ o and u _b are estimated by the conventional method and set as initial values. In step S104, initial values of μ _p ⁽ⁿ⁾ and μ _a ^{(n, m)} are set based on the command sequence ^ o set in step S103. Further, based on the state transition topology of the HMM designed in advance, the initial probability φ _i of each state i and the transition probability φ _{i ′, i} between the states i and i ′ are set.

そして、ステップＳ１０５において、上記ステップＳ１０３で設定された指令系列＾oの初期値、または後述するステップＳ１０６で前回更新された指令系列＾oに基づいて、（k,t）の全ての組み合わせについて、事後確率P(s_k=t|＾y,＾o′,＾θ′)を更新することにより、指令状態系列の事後確率P(＾s|＾y,＾o′,＾θ′)を更新する。 In step S105, based on the initial value of the command sequence ^ o set in step S103 or the command sequence ^ o updated last time in step S106 described later, for all combinations of (k, t), Update posterior probability P (^ s | ^ y, ^ o ', ^ θ') of command state sequence by updating posterior probability P (s _k = t | ^ y, ^ o ', ^ θ') To do.

ステップＳ１０６では、上記ステップＳ１０３で設定された指令系列＾oの初期値、または当該ステップＳ１０６で前回更新された指令系列＾oと、上記ステップＳ１０１で算出された基本周波数系列＾ｙと、上記ステップＳ１０２で算出された各時刻ｋの不確かさの程度v_n ²[ｋ]と、上記ステップＳ１０４で更新された指令状態系列の事後確率P(＾s|＾y,＾o′,＾θ′)とに基づいて、目的関数Q(＾o,θ,＾o′,＾θ′)又はQ(＾o,θ,＾o′,＾θ′)＋Λを増加させるように、指令系列＾oと、指令の振幅及び状態遷移確率を表すパラメータ群＾θとを更新する In step S106, the initial value of the command sequence ^ o set in step S103 or the command sequence ^ o updated last time in step S106, the fundamental frequency sequence ^ y calculated in step S101, and the step Uncertainty level v _n ² [k] calculated at S102 at each time k, and posterior probability P (^ s | ^ y, ^ o ', ^ θ') of the command state sequence updated at Step S104 above And the command sequence ^ o to increase the objective function Q (^ o, θ, ^ o ′, ^ θ ′) or Q (^ o, θ, ^ o ′, ^ θ ′) + Λ , Update parameter group ^ θ representing command amplitude and state transition probability

上記ステップＳ１０６は、以下のステップＳ１１１〜Ｓ１１４の各処理によって実現される。 The step S106 is realized by the processes of the following steps S111 to S114.

ステップＳ１１１では、上記ステップＳ１０３で設定された指令系列＾oの初期値、または後述するステップＳ１１２で前回更新された指令系列＾oに基づいて、（ｋ、ｌ）の全ての組み合わせについて、上記の式（１９）に従って、補助変数λ_p,k,l、λ_a,k,lを算出して更新する。スパース制約項を考慮する場合はさらに式（３３）〜（３５）により補助変数ｗ_ｋ，ｄ_ｋ，ｍ_ｋを更新する。 In step S111, based on the initial value of the command sequence ^ o set in step S103 or the command sequence ^ o updated last time in step S112 described later, all combinations of (k, l) are described above. The auxiliary variables λ _{p, k, l} and λ _{a, k, l} are calculated and updated according to the equation (19). When considering the sparse constraint terms, the auxiliary variables w _k , d _k , and m _k are further updated by the equations (33) to (35).

次のステップＳ１１２では、上記ステップＳ１０１で算出された基本周波数系列＾ｙと、上記ステップＳ１０２で算出された各時刻ｋの不確かさの程度v_n ² [ｋ]と、上記ステップＳ１０４で更新された指令状態系列の事後確率P(＾s|＾y,＾o′,＾θ′)と、上記ステップＳ１１１で更新された補助変数λ_p,k,l、λ_a,k,lとに基づいて、上記式（２０）に従って、各時刻ｋのフレーズ指令ｕ_p［ｋ］及びアクセント指令ｕ_a［ｋ］からなる指令系列＾oを更新する。スパース制約項を考慮する場合は代わりに式（３６）を用いて各時刻ｋのフレーズ指令ｕ_p［ｋ］及びアクセント指令ｕ_a［ｋ］からなる指令系列＾oを更新する。 In the next step S112, the fundamental frequency sequence ^ y calculated in step S101, the degree of uncertainty v _n ² [k] calculated in step S102, and updated in step S104. Based on the posterior probability P (^ s | ^ y, ^ o ', ^ θ') of the command state sequence and the auxiliary variables λ _{p, k, l} and λ _{a, k, l} updated in step S111 above. In accordance with the above equation (20), the command sequence ^ o consisting of the phrase command u _p [k] and the accent command u _a [k] at each time k is updated. When considering the sparse constraint term, the command sequence ^ o consisting of the phrase command u _p [k] and the accent command u _a [k] at each time k is updated instead using the equation (36).

次のステップＳ１１３では、収束条件として、繰り返し回数ｓが、Ｓに到達したか否かを判定し、繰り返し回数ｓがＳに到達していない場合には、収束条件を満足していないと判断して、上記ステップＳ１１１へ戻り、上記ステップＳ１１１〜ステップＳ１１２の処理を繰り返す。一方、繰り返し回数ｓがＳに到達した場合には、収束条件を満足したと判断し、ステップＳ１１４で、上記ステップＳ１１２で更新された各時刻ｋのフレーズ指令ｕ_p［ｋ］及びアクセント指令ｕ_a［ｋ］と、上記ステップＳ１０４で更新された指令状態系列の事後確率P(＾s|＾y,＾o′,＾θ′)とに基づいて、 In the next step S113, it is determined whether or not the number of repetitions s has reached S as the convergence condition. If the number of repetitions s has not reached S, it is determined that the convergence condition is not satisfied. Then, the process returns to step S111, and the processes of steps S111 to S112 are repeated. On the other hand, when the number of repetitions s reaches S, it is determined that the convergence condition is satisfied, and in step S114, the phrase command u _p [k] and the accent command u _{a at} each time k updated in step S112 above. Based on [k] and the posterior probability P (^ s | ^ y, ^ o ', ^ θ') of the command state sequence updated in step S104,

複数のテンプレートｎの各々について、上記式（４２）に従って、テンプレートｎにおけるフレーズ指令に対応する状態の出力平均μ_p ⁽ⁿ⁾を更新すると共に、複数のテンプレートｎの各々について、上記式（４３）に従って、テンプレートｎにおける各アクセント指令ｍに対応する状態の出力平均振幅μ_a ^(n,m)を更新する。 For each of the plurality of templates n, the output average μ _p ⁽ⁿ⁾ in a state corresponding to the phrase command in the template n is updated according to the above equation (42), and for each of the plurality of templates n, the above equation (43) Accordingly, the output average amplitude μ _a ^{(n, m)} in the state corresponding to each accent command m in the template n is updated.

また、上記ステップＳ１０４で計算されたP(s_k=i'|{y[l],o′[l],＾θ′[l]}_l=k+1 ^l=K)、P({y[l],o′[l],＾θ′[l]}_l=k+1 ^l=K|s_k=i)に基づいて、ＨＭＭの状態ｉ’，ｉの各ペアについて、上記式（４９）〜式（５１）に従って、状態ｉ'，ｉ間の遷移確率φ_i',i を更新する。 Further, P (s _k = i ′ | {y [l], o ′ [l], ^ θ ′ [l]} _{l = k + 1} ^{l = K} ), P ({y [l], o ′ [l], ^ θ ′ [l]} _{l = k + 1} ^{l = K} | _sk = i), for each pair of HMM states i ′, i 49) to (51), the transition probability φ _{i ′, i} between states _{i ′ and i} Update.

そして、ステップＳ１０７において、収束条件として、繰り返し回数ｒが、Ｒに到達したか否かを判定し、繰り返し回数ｒがＲに到達していない場合には、収束条件を満足していないと判断して、ステップＳ１０８で、上記ステップＳ１０６で更新された指令列＾o,パラメータ群θを、＾o′,＾θ′に代入して、上記ステップＳ１０５へ戻り、上記ステップＳ１０５〜ステップＳ１０６の処理を繰り返す。一方、繰り返し回数ｒがＲに到達した場合には、収束条件を満足したと判断し、ステップＳ１０８で、上記ステップＳ１０６で最終的に更新された指令列＾ｏに基づいて、上記式（４９）〜式（５３）式に従って、状態系列＾ｓを算出し、出力部９により、指令列＾ｏ、指令の振幅を表すパラメータ群＾θ、状態系列＾ｓを出力して、基本周波数モデルパラメータ推定処理ルーチンを終了する。 In step S107, it is determined whether the number of iterations r has reached R as the convergence condition. If the number of iterations r has not reached R, it is determined that the convergence condition is not satisfied. In step S108, the command sequence ^ o and parameter group θ updated in step S106 are substituted into ^ o 'and ^ θ', and the process returns to step S105, and the processes in steps S105 to S106 are performed. repeat. On the other hand, when the number of repetitions r reaches R, it is determined that the convergence condition is satisfied, and in step S108, based on the command sequence ^ o finally updated in step S106, the above equation (49) The state sequence ^ s is calculated according to the equation (53), and the output unit 9 outputs the command sequence ^ o, the parameter group ^ θ representing the amplitude of the command, and the state sequence ^ s to estimate the fundamental frequency model parameters. The processing routine ends.

以上説明したように、第１の実施の形態に係る基本周波数モデルパラメータ推定装置によれば、複数のテンプレートに対応する、状態の系列を表す複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭを含む隠れマルコフモデルを用いて、指令列＾ｏ、指令状態系列＾ｓ、及びパラメータ群＾θを推定することにより、言語的な先験的知識をＨＭＭの状態遷移トポロジーの設計を通してモデルに組み込むことで、藤崎モデルのパラメータを精度よく推定することができる。 As described above, according to the fundamental frequency model parameter estimation apparatus according to the first embodiment, a hidden Markov model including a plurality of Left-to-Right HMMs representing a sequence of states corresponding to a plurality of templates is obtained. By using the linguistic a priori knowledge into the model through the design of the state transition topology of the HMM by estimating the command sequence ^ o, the command state sequence ^ s, and the parameter group ^ θ, the Fujisaki model The parameter can be estimated with high accuracy.

［第２の実施の形態］
＜システム構成＞
次に、第２の実施の形態について説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 [Second Embodiment]
<System configuration>
Next, a second embodiment will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第２の実施の形態では、学習ステージで予め求められたモデルパラメータに固定して、指令列＾ｏ、及び状態系列＾ｓを推定している点が、第１の実施の形態と異なっている。 The second embodiment is different from the first embodiment in that the command sequence ^ o and the state sequence ^ s are estimated while being fixed to model parameters obtained in advance at the learning stage. .

図７に示すように、第２の実施の形態に係る基本周波数モデルパラメータ推定装置１００は、記憶部１と、基本周波数系列抽出部２と、有声無声区間推定部３と、初期値設定部４と、状態系列事後確率更新部５と、モデルパラメータ更新部６と、収束判定部７と、状態系列算出部８と、出力部９とを備えている。 As shown in FIG. 7, the fundamental frequency model parameter estimation apparatus 100 according to the second embodiment includes a storage unit 1, a fundamental frequency sequence extraction unit 2, a voiced / unvoiced interval estimation unit 3, and an initial value setting unit 4. A state sequence posterior probability update unit 5, a model parameter update unit 6, a convergence determination unit 7, a state sequence calculation unit 8, and an output unit 9.

記憶部１は、観測された音声信号の時系列データを記憶する。また、記憶部１は、上記の第１の実施の形態で説明した基本周波数モデルパラメータ推定装置により推定されたモデルパラメータ＾θ={{μ_p(n)}n₌₁ ^N,{μ_a ^(n,m)}_n=1 ^N _m=1 ^M,{φ_i',i}}を記憶する。 The storage unit 1 stores time series data of the observed audio signal. In addition, the storage unit 1 stores the model parameters ^ θ = {{μ _p (n)} n _{= 1} ^N , {μ _a ⁽ ) estimated by the fundamental frequency model parameter estimation apparatus described in the first embodiment. ^{n, m)} } _{n = 1} ^N _{m = 1} ^M , {φ _{i ′, i} }} is stored.

初期値設定部４は、μ_ｐ ^（ｎ），μ_ａ ^{（ｎ、ｍ）}の値、及び遷移確率φ_i′,iの値を、記憶部１に記憶された値に設定する。また、初期値設定部４は、その他のパラメータの初期値を、上記第１の実施の形態と同様に設定する。 The initial value setting unit 4 sets the values of μ _p ⁽ⁿ⁾ and μ _a ^{(n, m) and} the value of the transition probability φ _{i ′, i to} the values stored in the storage unit 1. The initial value setting unit 4 sets initial values of other parameters in the same manner as in the first embodiment.

モデルパラメータ更新部６は、補助変数更新部６１、指令関数更新部６２、及び収束判定部６３を備えている。モデルパラメータ更新部６では、モデルパラメータ＾θ={{μ_p(n)}n₌₁ ^N,{μ_a ^(n,m)}_n=1 ^N _m=1 ^M,{φ_i',i}}の値が更新されない。 The model parameter update unit 6 includes an auxiliary variable update unit 61, a command function update unit 62, and a convergence determination unit 63. In the model parameter updating unit 6, model parameters ^ θ = {{μ _p (n)} n _{= 1} ^N , {μ _a ^{(n, m)} } _{n = 1} ^N _{m = 1} ^M , {φ _{i ′, i} } } Value is not updated.

＜基本周波数モデルパラメータ推定装置の作用＞
次に、第２の実施の形態に係る基本周波数モデルパラメータ推定装置１００の作用について説明する。まず、上記第１の実施の形態で説明した方法により基本周波数モデルパラメータ推定装置で推定されたモデルパラメータ＾θ={{μ_p(n)}n₌₁ ^N,{μ_a ^(n,m)}_n=1 ^N _m=1 ^M,{φ_i',i}}が、基本周波数モデルパラメータ推定装置１００に入力され、記憶部１に格納される。また、認識対象として、観測された音声信号の時系列データが、基本周波数モデルパラメータ推定装置１００に入力され、記憶部１に格納される。 <Operation of fundamental frequency model parameter estimation device>
Next, the operation of the fundamental frequency model parameter estimation apparatus 100 according to the second embodiment will be described. First, model parameters ^ θ = {{μ _p (n)} n _{= 1} ^N , {μ _a ^{(n, m)} estimated by the fundamental frequency model parameter estimation apparatus by the method described in the first embodiment. } _{n = 1} ^N _{m = 1} ^M , {φ _{i ′, i} }} is input to the fundamental frequency model parameter estimation apparatus 100 and stored in the storage unit 1. In addition, time series data of the observed speech signal is input to the fundamental frequency model parameter estimation apparatus 100 as a recognition target and stored in the storage unit 1.

そして、基本周波数モデルパラメータ推定装置１００において、図８に示す基本周波数モデルパラメータ推定処理ルーチンが実行される。なお、第１の実施の形態と同様の処理については、同一符号を付して詳細な説明を省略する。 Then, the fundamental frequency model parameter estimation apparatus 100 executes a fundamental frequency model parameter estimation processing routine shown in FIG. In addition, about the process similar to 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

次のステップＳ１０３では、各パラメータM、α、β、v_p ²[k]、v_a ²[k]、v_b ²に対して適切な初期値を設定する。また、従来手法により指令系列＾o、u_bを推定して、初期値として設定する。そして、ステップＳ２０４において、記憶部１から、μ_ｐ ^（ｎ），μ_ａ ^{（ｎ、ｍ）}の値を読み込み、μ_ｐ ^（ｎ），μ_ａ ^{（ｎ、ｍ）}に設定する。また、記憶部１から、遷移確率φ_i′,iの値を読み込み、遷移確率φ_i′,iに設定する。 In the next step S103, appropriate initial values are set for the parameters M, α, β, v _p ² [k], v _a ² [k], and v _b ² . Further, the command sequences ^ o and u _b are estimated by the conventional method and set as initial values. Then, in step S204, the storage unit 1, μ _p ^(n), reads the value of _{^{μ a (n, m),}} μ p (n), it is set to μ _a ^{(n, m).} Further, the value of the transition probability φ _{i ′, i} is read from the storage unit 1 and set to the transition probability φ _{i ′, i} .

そして、ステップＳ１０５において、（k,t）の全ての組み合わせについて、事後確率P(s_k=t|＾y,＾o′,＾θ′)を更新することにより、指令状態系列の事後確率P(＾s|＾y,＾o′,＾θ′)を更新する。 In step S105, the posterior probability P (s _k = t | ^ y, ^ o ', ^ θ') is updated for all the combinations of (k, t), whereby the posterior probability P of the command state sequence is updated. (^ S | ^ y, ^ o ', ^ θ') is updated.

ステップＳ１０６では、目的関数Q(＾o,θ,＾o′,＾θ′)又はQ(＾o,θ,＾o′,＾θ′)＋Λを増加させるように、指令系列＾oと、指令の振幅及び状態遷移確率を表すパラメータ群θとを更新する In step S106, the command sequence ^ o is increased so as to increase the objective function Q (^ o, θ, ^ o ', ^ θ') or Q (^ o, θ, ^ o ', ^ θ') + Λ. Update parameter group θ representing command amplitude and state transition probability

上記ステップＳ１０６は、以下のステップＳ１１１〜Ｓ１１３の各処理によって実現される。 The step S106 is realized by the processes of the following steps S111 to S113.

ステップＳ１１１では、（ｋ、ｌ）の全ての組み合わせについて、上記の式（１９）に従って、補助変数λ_p,k,l、λ_a,k,lを算出して更新する。スパース制約項を考慮する場合はさらに式（３３）〜（３５）により補助変数ｗ_ｋ，ｄ_ｋ，ｍ_ｋを更新する。 In step S111, auxiliary variables λ _{p, k, l} and λ _{a, k, l} are calculated and updated for all combinations of (k, l) according to the above equation (19). When considering the sparse constraint terms, the auxiliary variables w _k , d _k , and m _k are further updated by the equations (33) to (35).

次のステップＳ１１２では、上記式（２０）に従って、各時刻ｋのフレーズ指令ｕ_p［ｋ］及びアクセント指令ｕ_a［ｋ］からなる指令系列＾oを更新する。スパース制約項を考慮する場合は代わりに式（３６）を用いて各時刻ｋのフレーズ指令ｕ_p［ｋ］及びアクセント指令ｕ_a［ｋ］からなる指令系列＾oを更新する。 In the next step S112, the command sequence ^ o composed of the phrase command u _p [k] and the accent command u _a [k] at each time k is updated according to the above equation (20). When considering the sparse constraint term, the command sequence ^ o consisting of the phrase command u _p [k] and the accent command u _a [k] at each time k is updated instead using the equation (36).

次のステップＳ１１３では、収束条件として、繰り返し回数ｓが、Ｓに到達したか否かを判定する。繰り返し回数ｓがＳに到達した場合には、収束条件を満足したと判断し、ステップＳ１０７へ進む。 In the next step S113, it is determined whether or not the number of repetitions s has reached S as a convergence condition. When the number of repetitions s reaches S, it is determined that the convergence condition is satisfied, and the process proceeds to step S107.

そして、ステップＳ１０７において、収束条件として、繰り返し回数ｒが、Ｒに到達したか否かを判定し、繰り返し回数ｒがＲに到達していない場合には、ステップＳ１０８で、上記ステップＳ１０６で更新された指令列＾o,パラメータ群θを、＾o′,＾θ′に代入して、上記ステップＳ１０５へ戻る。一方、繰り返し回数ｒがＲに到達した場合には、収束条件を満足したと判断し、ステップＳ１０８で、上記式（４９）〜式（５３）式に従って、状態系列＾ｓを算出し、出力部９により、指令列＾ｏ、指令の振幅を表すパラメータ群＾θ、状態系列＾ｓを出力して、基本周波数モデルパラメータ推定処理ルーチンを終了する。 Then, in step S107, it is determined whether or not the number of repetitions r has reached R as a convergence condition. If the number of repetitions r has not reached R, it is updated in step S108 in step S106. The command sequence ^ o and parameter group θ are substituted into ^ o ′ and ^ θ ′, and the process returns to step S105. On the other hand, when the number of repetitions r reaches R, it is determined that the convergence condition is satisfied, and in step S108, the state sequence ^ s is calculated according to the above equations (49) to (53), and the output unit 9 outputs the command sequence ^ o, the parameter group ^ θ representing the command amplitude, and the state series ^ s, and the basic frequency model parameter estimation processing routine is terminated.

以上説明したように、第２の実施の形態に係る基本周波数モデルパラメータ推定装置によれば、複数のテンプレートに対応する、状態の系列を表す複数のＬｅｆｔ−ｔｏ−ＲｉｇｈｔＨＭＭを含む隠れマルコフモデルを用いて、指令列＾ｏ、及び指令状態系列＾ｓを推定することにより、言語的な先験的知識をＨＭＭの状態遷移トポロジーの設計を通してモデルに組み込むことで、藤崎モデルのパラメータを精度よく推定することができる。 As described above, according to the fundamental frequency model parameter estimation apparatus according to the second embodiment, a hidden Markov model including a plurality of Left-to-Right HMMs representing a sequence of states corresponding to a plurality of templates is obtained. By using the presumed command sequence ^ o and the command state sequence ^ s to incorporate linguistic a priori knowledge into the model through the design of the state transition topology of the HMM, the parameters of the Fujisaki model can be estimated accurately. can do.

＜実験＞
指令列の推定精度は語彙モデルの大きさ（イントネーションテンプレートの個数）に依存するため、異なる大きさの語彙モデルに対して指令列の推定精度の評価を行った。 <Experiment>
Since the estimation accuracy of the command sequence depends on the size of the vocabulary model (number of intonation templates), the estimation accuracy of the command sequence was evaluated for vocabulary models of different sizes.

学習ステージでは、男性話者（ＭＨＴ）によって読まれた、ＡＴＲ音素バランス文の最初の５０文から、基本周波数推定手法を用いて抽出されたＦ_０軌跡を学習データとして、モデルパラメータを推定した。定数パラメータは以下のようにセットした。ｔ_０＝８ｍｓ，α＝３．０ｒａｄ／ｓ，β＝２０．０ｒａｄ／ｓ，ｖ^２ _ｐ［ｋ］＝３２，ｖ^２ _ａ［ｋ］＝０．０３２，ｖ^２ _ｂ＝１０^−８、有声区間においてｖ^２ _ｎ［ｋ］＝１０^１５、無声区間においてｖ_２ｎ［ｋ］＝０．１^２とセットした。また、μ_ｂには、有声区間におけるｌｏｇＦ_０の全ての値のうちの最低値をセットした。 In the learning stage, model parameters were estimated from the first 50 sentences of the ATR phoneme balance sentence read by the male speaker (MHT) using the F ₀ trajectory extracted using the fundamental frequency estimation technique as learning data. The constant parameters were set as follows. t ₀ = 8 ms, α = 3.0 rad / s, β = 20.0 rad / s, v ² _p [k] = 32, v ² _a [k] = 0.032, v ² _b = 10 ⁻⁸ , voiced ^{_{^{v 2 n [k] = 10}}} 15 in the interval, and _v 2n [k] = 0.1 ² and set in the unvoiced interval. Further, μ _b is set to the lowest value of all values of log F ₀ in the voiced interval.

認識ステージでは、学習ステージで用いたものと同一の話者が読み上げたＡＴＲ音素バランス文の最後の５３文から、モデルパラメータを固定して指令列の推定を行い、結果に対して推定精度を評価した。 In the recognition stage, the command sequence is estimated by fixing the model parameters from the last 53 sentences of the ATR phoneme balance sentence read out by the same speaker used in the learning stage, and the estimation accuracy is evaluated for the result. did.

テンプレート数が５、１０、１５個の場合についてそれぞれ評価した。モデルパラメータΘの初期値には、上記非特許文献１に記載の手法を用いて得られた値を設定した。 Evaluation was performed for 5, 10, and 15 templates, respectively. As the initial value of the model parameter Θ, a value obtained using the method described in Non-Patent Document 1 was set.

評価の枠組みは、発明者らの従来の研究（非特許文献２を参照）と同じである。指令列の推定精度を評価するため、動的計画法に基づく指令ごとのマッチングにより正解データとのマッチングを計算し、評価に利用した。フレーズ指令においては、位置の差がＳ以下である指令をマッチしたと定義した。アクセント指令においては、指令の開始位置の差の大きさと指令の終了位置の差の大きさの平均がＳ以下である１組の指令をマッチしたと定義した。指令の大きさの情報はマッチングに用いなかった。これは、指令の大きさはベースライン成分の値に影響を受けるが、その設定方法が提案手法と正解データとで異なるためである。Ｎ_ＥとＮ_Ａがそれぞれ、提案法によって推定された藤崎モデル指令数と正解データの指令列数であるとし、Ｎ_Ｍが、推定指令列と正解指令列との間でマッチした指令数であるとする。Ｎ_{Ｅｓｕｍ，}Ｎ_ＡｓｕｍおよびＮ_Ｍｓｕｍが、それぞれＮ_Ｅ，Ｎ_Ａ，Ｎ_Ｍの全文に渡る総和であるとする。これらの値から、挿入エラーＥ_Ｉを（Ｎ_Ｅｓｕｍ−Ｎ_Ｍｓｕｍ）／Ｎ_Ａｓｕｍで定義し、脱落エラーＥ_Ｄを（Ｎ_Ａｓｕｍ−Ｎ_Ｍｓｕｍ）／Ｎ_Ａｓｕｍで定義し、精度Ａを、１−Ｅ_Ｉ−Ｅ_Ｄで定義した。 The framework of evaluation is the same as the conventional research by the inventors (see Non-Patent Document 2). In order to evaluate the estimation accuracy of the command sequence, matching with correct data was calculated by matching for each command based on dynamic programming and used for evaluation. In the phrase command, it is defined that the command whose position difference is S or less is matched. In the accent command, it is defined that a set of commands in which the average of the difference between the command start positions and the command difference between the command end positions is S or less is matched. Information on the size of the command was not used for matching. This is because the magnitude of the command is affected by the value of the baseline component, but the setting method differs between the proposed method and the correct answer data. N _E and N _A are each as Fujisaki model command number estimated by the proposed method to be commanded number of columns of correct data, N _M is is the command number matched with the estimation command string and correct command sequence And Let N _Esum, N _Asum, and N _Msum be sums over the full text of N _E , N _A , and N _M , respectively. From these values, the insertion error E _I is defined by (N _Esum −N _Msum ) / N _Asum , the dropout error E _D is defined by (N _Asum −N _Msum ) / N _Asum , and the accuracy A is 1-E defined in _I -E _D.

図９に、Ｓ＝０．３ｓにおける評価結果を示す。上記図９の左、中央、右の列はそれぞれ、フレーズ指令とアクセント指令の推定精度、フレーズ指令のみの推定精度、アクセント指令のみの推定精度を表す。「初期値」の行はＥＭアルゴリズムの初期値（従って上記非特許文献１の手法によるもの）の推定精度、「Ｔ＝Ｎ」の行はそれぞれ提案アルゴリズムにおいてテンプレート数をＮとしたときの推定精度、「従来法」の行は、本実施の形態の語彙モデルを組み込まない確率モデルによる手法（非特許文献２）の推定精度である。Ｔ＝５とＴ＝１０において全体の推定精度が改善しており、すべての場合においてフレーズ指令の推定精度が向上している。 FIG. 9 shows the evaluation result at S = 0.3 s. The left, center, and right columns in FIG. 9 represent the estimation accuracy of the phrase command and the accent command, the estimation accuracy of only the phrase command, and the estimation accuracy of only the accent command, respectively. The “initial value” line is the estimation accuracy of the initial value of the EM algorithm (thus, according to the method of Non-Patent Document 1), and the “T = N” line is the estimation accuracy when the number of templates is N in the proposed algorithm. The “conventional method” line is the estimation accuracy of a method using a probability model that does not incorporate the vocabulary model of this embodiment (Non-patent Document 2). The overall estimation accuracy is improved at T = 5 and T = 10, and the estimation accuracy of the phrase command is improved in all cases.

このように、提案手法はフレーズ指令の位置推定にアクセント指令の位置や大きさ、個数の情報を利用できるため、フレーズ指令の検出精度に大きな改善が見られたと考えられる。一方、過学習によりアクセント指令の推定精度は語彙モデルの増大により減少する傾向にあることがわかった。 Thus, since the proposed method can use information on the position, size, and number of accent commands to estimate the position of the phrase command, it is considered that the detection accuracy of the phrase command has been greatly improved. On the other hand, it was found that the estimation accuracy of accent commands tended to decrease with the increase of vocabulary models due to overlearning.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の基本周波数モデルパラメータ推定装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the fundamental frequency model parameter estimation apparatus described above has a computer system inside, but if the “computer system” uses a WWW system, a homepage providing environment (or display environment) is also available. Shall be included.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１記憶部
２基本周波数系列抽出部
３有声無声区間推定部
４初期値設定部
５状態系列事後確率更新部
６モデルパラメータ更新部
７収束判定部
８状態系列算出部
９出力部
６１補助変数更新部
６２指令関数更新部
６３収束判定部
６４平均振幅更新部
１００基本周波数モデルパラメータ推定装置 DESCRIPTION OF SYMBOLS 1 Memory | storage part 2 Fundamental frequency sequence extraction part 3 Voiced unvoiced area estimation part 4 Initial value setting part 5 State sequence posterior probability update part 6 Model parameter update part 7 Convergence determination part 8 State series calculation part 9 Output part 61 Auxiliary variable update part 62 Command function update unit 63 Convergence determination unit 64 Average amplitude update unit 100 Fundamental frequency model parameter estimation device

Claims

With an audio signal as an input, a phrase command u _p [k] representing a fundamental frequency pattern generated by translational movement of thyroid cartilage at each time k and an accent command u _a [k] representing a fundamental frequency pattern generated by rotational movement of thyroid cartilage a command string ^ o consisting of a pair ^ o [k] of, and hidden at each time k of the Markov model, consisting of the index s _k of state indicating the phrase command and said accent command of the pair instruction state series ^ s, the the state i of the hidden Markov model ', each of the transition probability between i phi _i', a fundamental frequency model parameter estimation device for estimating a parameter group ^ theta including _i,
A fundamental frequency extraction unit that extracts an observed fundamental frequency sequence ^ y representing a fundamental frequency at each time k of the speech signal from the time series data of the speech signal;
About the time-series data of the speech signal, a voiced / unvoiced section estimation unit that estimates the degree of uncertainty of the fundamental frequency at each time k according to whether it is a voiced section or an unvoiced section;
An initial value setting unit for setting an initial value of the command sequence ^ o and an initial value of the parameter group ^ θ;
For each combination (k, t) of time k and state t, based on the updated value of the command sequence ^ o updated last time or the command sequence ^ o ' that is the initial value of the command sequence ^ o A posteriori probability when a fundamental frequency series ^ y, the command sequence ^ o ', and an updated value of the parameter group ^ θ updated last time or a parameter group ^ θ' which is an initial value of the parameter group ^ θ are given. A state sequence posterior probability updater that calculates P (s _k = t | ^ y, ^ o ′, ^ θ ′) using the Forward-Backward algorithm;
The command sequence ^ o ', the observed fundamental frequency sequence ^ y, the degree of uncertainty at each time k, and the a posteriori probability P (s _k =) of each combination (k, t) of the time k and the state t. t | ^ y, ^ o ', ^ θ'), a model parameter update unit that updates the command sequence ^ o and the parameter group ^ θ;
A convergence determination unit that repeatedly performs calculation by the state sequence posterior probability update unit and update by the model parameter update unit until a predetermined convergence condition is satisfied,
A state sequence calculation unit that calculates the command state sequence ^ s using the Viterbi algorithm based on the command sequence ^ o that is finally updated by the model parameter update unit;
Including
The hidden Markov model is a plurality of Left-to-Right HMMs representing a sequence of the states corresponding to a plurality of templates and having a state transition in a certain direction, wherein the plurality of Left-to-Right HMMs A starting point state in each of the plurality of Left-to-Right HMMs is connected to the specific state, and an end point state in each of the plurality of Left-to-Right HMMs is connected to the specific state;
The parameter group { circumflex over ( θ ^{)} is} an output average μ _p ⁽ⁿ⁾ in a state corresponding to the phrase command in the template n and a state in each state corresponding to each accent command m in the template n. A fundamental frequency model parameter estimation device further including an output average μ _a ^{(n, m)} .

Wherein the model parameter updating unit, before Symbol command string ^ o ', the observation fundamental frequency sequence ^ y, the degree of the uncertainty at each time k, and time k, before each combination of state t (k, t) Based on the post-article probability P (s _k = t | ^ y, ^ o ', ^ θ'), the command sequence ^ o and the parameter group ^ θ when the observed fundamental frequency sequence ^ y is given. log posterior probability logP (^ o, ^ θ | ^ y) as the objective function, to increase the objective function, the basic of claim 1, wherein updating the directive columns ^ o, and the parameter group ^ theta Frequency model parameter estimation device.

With an audio signal as an input, a phrase command u _p [k] representing a fundamental frequency pattern generated by translational movement of thyroid cartilage at each time k and an accent command u _a [k] representing a fundamental frequency pattern generated by rotational movement of thyroid cartilage of the pair ^ o a command string ^ o consisting of a [k], at each time k of the hidden Markov model, and a command status series ^ s made from the index s _k of state indicating the phrase command and said accent command of the pair A fundamental frequency model parameter estimating apparatus for estimating,
A fundamental frequency extraction unit that extracts an observed fundamental frequency sequence ^ y representing a fundamental frequency at each time k of the speech signal from the time series data of the speech signal;
About the time-series data of the speech signal, a voiced / unvoiced section estimation unit that estimates the degree of uncertainty of the fundamental frequency at each time k according to whether it is a voiced section or an unvoiced section;
An initial value setting unit for setting an initial value of the command sequence ^ o;
Each of the transition probabilities φ _{i ′} between the state i ′ and i of the hidden Markov model and the instruction sequence ^ o ′ which is the updated value of the instruction sequence ^ o updated last time or the initial value of the command string ^ o. _{, i} and the previously obtained parameter group ^ θ ', for each combination of time k and state t (k, t), the observed fundamental frequency sequence ^ y, the command string ^ o', and A state sequence posterior probability update unit for calculating the posterior probability P (s _k = t | ^ y, ^ o ′, ^ θ ′) when the parameter group ^ θ ′ is given using the Forward-Backward algorithm; ,
Before Symbol command string ^ o ', the observation fundamental frequency series ^ y, the degree of the uncertainty at each time k, and time k, a combination of state t (k, t) each of the posterior probability P (s _k = T | ^ y, ^ o ', ^ θ'), a model parameter update unit that updates the command sequence ^ o;
A convergence determination unit that repeatedly performs calculation by the state sequence posterior probability update unit and update by the model parameter update unit until a predetermined convergence condition is satisfied,
A state sequence calculation unit that calculates the command state sequence ^ s using the Viterbi algorithm based on the command sequence ^ o that is finally updated by the model parameter update unit;
Including
The hidden Markov model is a plurality of Left-to-Right HMMs representing a sequence of the states corresponding to a plurality of templates and having a state transition in a certain direction, wherein the plurality of Left-to-Right HMMs A starting point state in each of the plurality of Left-to-Right HMMs is connected to the specific state, and an end point state in each of the plurality of Left-to-Right HMMs is connected to the specific state;
The parameter group {circumflex over (θ)} is, for each of the plurality of templates n, an output average μ _p ⁽ⁿ⁾ in a state corresponding to the phrase command in the template n and a state in each state corresponding to each accent command m in the template n. A fundamental frequency model parameter estimation device further including an output average μ _a ^{(n, m)} .

With an audio signal as an input, a phrase command u _p [k] representing a fundamental frequency pattern generated by translational movement of thyroid cartilage at each time k and an accent command u _a [k] representing a fundamental frequency pattern generated by rotational movement of thyroid cartilage a command string ^ o consisting of a pair ^ o [k] of, and hidden at each time k of the Markov model, consisting of the index s _k of state indicating the phrase command and said accent command of the pair instruction state series ^ s, the the state i of the hidden Markov model ', each of the transition probability between i phi _i', a fundamental frequency model parameter estimation method for estimating a parameter group ^ theta including _i,
The fundamental frequency extraction unit extracts the observed fundamental frequency sequence ^ y representing the fundamental frequency at each time k of the speech signal from the time series data of the speech signal,
The voiced and unvoiced section estimation unit estimates the degree of uncertainty of the fundamental frequency at each time k according to whether the time series data of the voice signal is a voiced or unvoiced section,
An initial value setting unit sets an initial value of the command sequence ^ o and an initial value of the parameter group ^ θ,
The state series posterior probability update unit, based on a command string ^ o 'is the initial value of the updated value or the command string ^ o of the instruction sequence ^ o was last updated, the time k, a combination of state t (k, For each of t), the observed fundamental frequency sequence ^ y, the command sequence ^ o ', and the parameter group ^ θ' which is the updated value of the parameter group ^ θ updated last time or the initial value of the parameter group ^ θ. A posteriori probability P (s _k = t | ^ y, ^ o ′, ^ θ ′) is given using the Forward-Backward algorithm,
The posterior of each of the command string ^ o ', the observed fundamental frequency sequence ^ y, the degree of uncertainty at each time k, and the combination (k, t) of the time k and the state t by the model parameter update unit Based on the probability P (s _k = t | ^ y, ^ o ', ^ θ'), the command sequence ^ o and the parameter group ^ θ are updated,
The convergence determination unit repeatedly performs the calculation by the state series posterior probability update unit and the update by the model parameter update unit until a predetermined convergence condition is satisfied,
Calculating a command state sequence ^ s using a Viterbi algorithm based on a command sequence ^ o finally updated by the model parameter update unit by a state series calculation unit;
The hidden Markov model is a plurality of Left-to-Right HMMs representing a sequence of the states corresponding to a plurality of templates and having a state transition in a certain direction, wherein the plurality of Left-to-Right HMMs A starting point state in each of the plurality of Left-to-Right HMMs is connected to the specific state, and an end point state in each of the plurality of Left-to-Right HMMs is connected to the specific state;
The parameter group {circumflex over (θ)} is, for each of the plurality of templates n, an output average μ _p ⁽ⁿ⁾ in a state corresponding to the phrase command in the template n and a state in each state corresponding to each accent command m in the template n. A fundamental frequency model parameter estimation method further including an output average μ _a ^{(n, m)} .

With an audio signal as an input, a phrase command u _p [k] representing a fundamental frequency pattern generated by translational movement of thyroid cartilage at each time k and an accent command u _a [k] representing a fundamental frequency pattern generated by rotational movement of thyroid cartilage of the pair ^ o a command string ^ o consisting of a [k], at each time k of the hidden Markov model, and a command status series ^ s made from the index s _k of state indicating the phrase command and said accent command of the pair A fundamental frequency model parameter estimation method for estimation, comprising:
The fundamental frequency extraction unit extracts the observed fundamental frequency sequence ^ y representing the fundamental frequency at each time k of the speech signal from the time series data of the speech signal,
The voiced and unvoiced section estimation unit estimates the degree of uncertainty of the fundamental frequency at each time k according to whether the time series data of the voice signal is a voiced or unvoiced section,
An initial value setting unit sets an initial value of the command sequence ^ o,
By the state series posterior probability update unit, the previously updated value of the command sequence ^ o or the initial value of the command sequence ^ o and the state i ', i of the hidden Markov model For each combination (k, t) of time k and state t, based on a previously determined parameter group {circumflex over (θ)} including each transition probability φ _{i ′, i} . A posteriori probability P (s _k = t | ^ y, ^ o ', ^ θ') when the command sequence ^ o 'and the parameter group ^ θ' are given is calculated using the Forward-Backward algorithm. And
The model parameter updating unit, each of the previous article before Symbol command string ^ o ', the observation fundamental frequency sequence ^ y, the degree of the uncertainty at each time k, and time k, a combination of state t (k, t) Based on the post-probability P (s _k = t | ^ y, ^ o ', ^ θ'), the command sequence ^ o is updated,
The convergence determination unit repeatedly performs the calculation by the state series posterior probability update unit and the update by the model parameter update unit until a predetermined convergence condition is satisfied,
Calculating a command state sequence ^ s using a Viterbi algorithm based on a command sequence ^ o finally updated by the model parameter update unit by a state series calculation unit;
The hidden Markov model is a plurality of Left-to-Right HMMs representing a sequence of the states corresponding to a plurality of templates and having a state transition in a certain direction, wherein the plurality of Left-to-Right HMMs A starting point state in each of the plurality of Left-to-Right HMMs is connected to the specific state, and an end point state in each of the plurality of Left-to-Right HMMs is connected to the specific state;
The parameter group {circumflex over (θ)} is, for each of the plurality of templates n, an output average μ _p ⁽ⁿ⁾ in a state corresponding to the phrase command in the template n and a state in each state corresponding to each accent command m in the template n. A fundamental frequency model parameter estimation method further including an output average μ _a ^{(n, m)} .

The program for functioning a computer as each part of the fundamental frequency model parameter estimation apparatus of Claim 1 or 2.

The program for functioning a computer as each part of the fundamental frequency model parameter estimation apparatus of Claim 3.