JP2003241776A

JP2003241776A - Speech analyzing method and apparatus therefor, and speech analyzing program and recording medium therefor

Info

Publication number: JP2003241776A
Application number: JP2002039693A
Authority: JP
Inventors: Sadao Hiroya; 定男廣谷; Masaaki Yoda; 雅彰誉田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-02-18
Filing date: 2002-02-18
Publication date: 2003-08-29

Abstract

<P>PROBLEM TO BE SOLVED: To accurately find an articulatory motion track in voicing from a given speech. <P>SOLUTION: Vectors generated by adding parameters of a speed and acceleration to position parameters of articulation motion and a speech spectrum, a hidden Markov model (HMM) constituted as a dynamic model for the articulation motion, and linear functions for determining speech spectrums from articulation parameters as speech generation models by states of the HMM, are prepared. The speech spectrum of an input speech is compared with a speech spectrum pattern in an articulation-to-sound code book to select a plurality of pairs of articulation code parameters xC in the increasing order of distances, and the speech generation models, those xC, and the speech spectrum are used to determine the state series (q) maximizing the appearance probability of the speech spectrum. For the (q), an articulation parameter maximizing the appearance probability of the speech spectrum of the speech generation model is generated from the given speech signal. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は入力音声信号に対
し、その音声を発生した時の顎、舌、唇などの調音器官
（発生器官）の各位置情報、つまり調音運動を決定する
音声分析法、その装置及び音声分析プログラムとその記
録媒体に関する技術である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice analysis method for an input voice signal, which determines each position information of articulatory organs (generating organs) such as jaw, tongue and lips when the voice is generated, that is, articulatory movement. , A technique relating to the apparatus and the voice analysis program and the recording medium.

【０００２】[0002]

【従来の技術】入力音声からその発生時の調音運動の逆
推定手法として、調音・音響対コードブックを用いる方
法がある。これは、調音運動と音声を同時に観測したデ
ータ対の多数をコードブックに格納しておき、入力音声
データにより、このコードブックを参照して近い音声デ
ータと対応する調音データを取出して調音運動を推定す
るものである（Schroeter, J.and Sondhi, M.M.,“Spee
ch coding based on physiological models of speech
production,”Advance in Speech Signal Processing,
edited by S.Furui and M.M.Sondhi（Dekker, New Yor
k）, 231-267, 1992）。2. Description of the Related Art As a method of inversely estimating an articulatory movement when an input voice is generated, there is a method using an articulatory-acoustic codebook. This is because a large number of data pairs obtained by simultaneously observing articulatory motion and voice are stored in a codebook, and by referring to this codebook, the articulatory data corresponding to the nearby voice data is extracted to obtain the articulatory motion. Estimate (Schroeter, J. and Sondhi, MM, “Spee
ch coding based on physiological models of speech
production, ”Advance in Speech Signal Processing,
edited by S.Furui and MMSondhi (Dekker, New Yor
k), 231-267, 1992).

【０００３】これに対し、調音・音響対コードブックを
参照して得た調音データの前後の変化状態を調べ、通常
の調音運動では生じないような、調音データを排除する
動的な制約を取り入れた手法が提案されている（S.Suzu
ki, T.Okadome and M.Honda,“Determination of artic
ulatory positions from speech acoustics by applyin
g dynamic articulatory constraints,”Proc. of the
5th ICSLP, 2251-2254, 1998）。このように発声時の調
音器官の調音運動の分析ができれば、この分析結果を用
いて、人工音声生成や言語学習など様々な音声分野に利
用することができる。On the other hand, the change state before and after the articulatory data obtained by referring to the articulatory / acoustic codebook is examined, and a dynamic constraint for eliminating the articulatory data, which does not occur in normal articulatory motion, is incorporated. Another method has been proposed (S. Suzu
ki, T. Okadome and M. Honda, “Determination of artic
ulatory positions from speech acoustics by applyin
g dynamic articulatory constraints, ”Proc. of the
5th ICSLP, 2251-2254, 1998). In this way, if the articulatory movement of the articulatory organ at the time of utterance can be analyzed, the analysis result can be used in various voice fields such as artificial voice generation and language learning.

【０００４】調音運動の逆推定ではないが、動的特徴と
して音声パラメータの時間的変化を表す速度および加速
度のパラメータを考慮した、隠れマルコフモデル（以下
ＨＭＭと記す）からの音声パラメータの生成アルゴリズ
ムが提案されている（K.Tokuda, T.Masuko, T.Yamada,
T.Kobayashi and S.Imai,“An algorithm for speechge
neration from coutinuous mixture HMMs with dynamic
features,”Proc.EUROSPEECH, 757-760, 1995）。これ
は例えば音素列を与え、これに応じたＨＭＭから音声ス
ペクトルを求めて音声合成することに用いられる。Although it is not the inverse estimation of articulatory motion, there is an algorithm for generating a voice parameter from a hidden Markov model (hereinafter referred to as HMM), which takes into consideration the velocity and acceleration parameters that represent the temporal change of the voice parameter as a dynamic feature. Proposed (K. Tokuda, T. Masuko, T. Yamada,
T. Kobayashi and S. Imai, “An algorithm for speechge
neration from coutinuous mixture HMMs with dynamic
features, "Proc.EUROSPEECH, 757-760, 1995). This is used, for example, to give a phoneme sequence, obtain a speech spectrum from the corresponding HMM, and perform speech synthesis.

【０００５】[0005]

【発明が解決しようとする課題】従来の調音運動の逆推
定は学習データから作成した調音・音響対コードブック
を用い、与えられた音声に近い音声データの調音データ
を求めるものであり、入力音声データ系列の各データ
が、コードブック内の音声データの何れかと一致するよ
うになることは殆ど有り得ず、むしろ、大部分の入力音
声データはコードブック内の音声データとわずかに異な
ったものである。従って、得られる調音データ系列、つ
まり調音運動は精度のよいものではなかった。The conventional inverse estimation of articulatory motion is to obtain articulatory data of voice data close to a given voice by using an articulatory-acoustic pair codebook created from learning data. It is unlikely that each piece of data in the data series will match any of the audio data in the codebook, but rather most input audio data will be slightly different from the audio data in the codebook. . Therefore, the obtained articulatory data series, that is, the articulatory movement was not accurate.

【０００６】[0006]

【課題を解決するための手段】この発明によれば、音声
パラメータベクトルと調音パラメータベクトルとの対を
格納した音響・調音対コードブックと、調音パラメータ
ベクトルの時系列を確率的な動的状態遷移で表現した調
音モデル及び調音パラメータベクトルに対する音声パラ
メータベクトルの生起確率を決定する音声生成モデルを
格納したモデル記憶部とを用い、入力音声信号をスペク
トル分析して入力音声パラメータベクトルを求め、音響
・調音対コードベクトルを参照してその入力音声パラメ
ータベクトルに近い音声パラメータベクトルの調音パラ
メータベクトルを少くとも１つ近似調音パラメータベク
トルを選択し、入力音声パラメータベクトル及び近似調
音パラメータベクトルと、モデル記憶部を参照してこれ
らと対応する音声生成モデル及び調音モデルを用いて、
入力音声パラメータベクトル系列の出現確率が最大とな
る調音パラメータベクトル系列を求める。According to the present invention, an acoustic / articulatory pair codebook storing a pair of a voice parameter vector and an articulatory parameter vector, and a probabilistic dynamic state transition of a time series of the articulatory parameter vector. And a model storage unit that stores a voice generation model that determines the occurrence probability of the voice parameter vector for the articulatory parameter vector expressed by Select at least one articulatory parameter vector of the voice parameter vector close to the input voice parameter vector by referring to the paired chord vector, and refer to the input voice parameter vector and the approximate articulatory parameter vector and the model storage unit. And the voice corresponding to these Using a formation model and the articulation model,
An articulation parameter vector sequence that maximizes the appearance probability of the input speech parameter vector sequence is obtained.

【０００７】[0007]

【発明の実施の形態】以下この発明の実施形態を説明す
る。まずこの実施形態に用いるモデルの作成方法を図１
を参照して説明する。モデル作成文章を連続発声した音声信号と、磁気センサシステムに
より同時観測された調音データを用いて、調音・音響対
コードブックを作成する。音声入力端子１１よりの音声
信号はスペクトル分析部１２において、フレームごと
に、例えば毎秒２５０回のレートで、窓長３２ｍｓのブ
ラックマン窓で切り出され、スペクトル分析され、例え
ば０次項を除いた２１次のメルケプストラム係数が音声
パラメータとして求められる。必要に応じてその音声パ
ラメータは速度検出部１３で微分（差分）により、時間
的変化として速度のパラメータが検出され、これら音声
パラメータと速度パラメータを要素とするベクトルが音
声パラメータベクトルｙとして合成部１４より出力され
る。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below. First, a method of creating a model used in this embodiment is shown in FIG.
Will be described with reference to. Model creation An articulatory-acoustic pair codebook is created using the audio signals obtained by continuously uttering sentences and the articulatory data simultaneously observed by the magnetic sensor system. An audio signal from the audio input terminal 11 is cut out in a spectrum analysis unit 12 for each frame at a rate of 250 times per second by a Blackman window with a window length of 32 ms and subjected to spectrum analysis, for example, a 21st order excluding the 0th order term. The mel cepstrum coefficient of is obtained as a voice parameter. As needed, the speed parameter is detected as a temporal change by the differentiation (difference) of the voice parameter in the speed detection unit 13, and a vector having these voice parameter and the speed parameter as an element is synthesized as a voice parameter vector y. Will be output.

【０００８】同時に観測された調音器官の複数の各位
置、例えば下顎、上・下唇と舌上の４箇所との計７点の
それぞれについての水平方向および垂直方向における各
位置情報信号が調音入力端子１５ａ〜１５ｎに入力さ
れ、それぞれ位置情報部１７で毎秒２５０回のレートで
取り込まれ、その各位置パラメータは必要に応じて、速
度検出部１８で微分（差分）により時間的変化としての
速度パラメータが求められ、更に必要に応じて各速度パ
ラメータが加速度検出部１９で微分（差分）により時間
的変化としての加速度パラメータが求められる。これら
各１４個の位置パラメータ、速度パラメータ、加速度パ
ラメータを要素とするベクトルが合成部２１から調音パ
ラメータベクトルｘとして出力される。つまりこの例で
は音声パラメータベクトルｙ、調音パラメータベクトル
ｘはそれぞれ下記のように４２個の要素からなるベクト
ルである。ｙ＝［ｋ，………，ｋ₂₁，ｋ₁′，………，ｋ₂₁′］ｘ＝［ｐ_a，……，ｐ_n，ｐ_a′，……，ｐ_n′，
ｐ_a″，……，ｐ_n″］このようにして同一の時点において求まった音声パラメ
ータベクトルｙと調音パラメータベクトルｘを対とする
データを複数個、例えば２０万セット保持して調音・音
響対コードブック２２を構成する。The position information signals in the horizontal and vertical directions at a plurality of positions of the articulatory organs observed at the same time, for example, the lower jaw, the upper / lower lip and the four positions on the tongue, respectively, in the horizontal direction and the vertical direction, are articulated. It is input to the terminals 15a to 15n, respectively captured by the position information section 17 at a rate of 250 times per second, and the respective position parameters are, if necessary, differentiated (difference) by the speed detection section 18 and the speed parameter as a temporal change. Further, if necessary, each speed parameter is differentiated (differenced) by the acceleration detection unit 19 to obtain an acceleration parameter as a temporal change. A vector having these 14 position parameters, velocity parameters, and acceleration parameters as elements is output from the synthesis unit 21 as an articulation parameter vector x. That is, in this example, the voice parameter vector y and the articulatory parameter vector x are vectors each having 42 elements as described below. y = [k, ........., k 21, k 1 ', ........., k 21'] x = [p a, ......, p n, p a ', ......, p n',
p _a ″, ..., P _n ″] In this way, a plurality of data, for example, 200,000 sets of voice parameter vector y and articulatory parameter vector x obtained at the same time point are held, and an articulatory / acoustic pair is stored. The code book 22 is constructed.

【０００９】このようにして求めた調音パラメータベク
トルｘおよび音声パラメータベクトルｙを用いて調音パ
ラメータベクトルｘの時系列を確率的な動的遷移状態で
表現した調音モデル及び調音パラメータベクトルｘに対
する音声パラメータベクトルｙの生起確率を決定する音
声生成モデルを含むモデルλをモデル作成部２３で作成
する。この例ではモデルλとしてＨＭＭを用いる。ＨＭ
Ｍの構造は、２音素組の３状態１混合ガウス分布で、ス
キップなしのleft-to-rightモデルとする。例えば図２
に示すように３つの状態ｑ₁，ｑ₂，ｑ₃があり、各状
態での調音パラメータベクトル、音声パラメータベクト
ルの各出現確率をそれぞれ１つのガウス分布とし、状態
過程は同一状態から同一状態への遷移（ループ）と、ｑ
₁からｑ ₂又はｑ₂からｑ₃への遷移の計５つのみであ
る。各音素について次に続く異なる音素ごとにモデルが
作られる。ただし、精度を保つために、各ＨＭＭの状態
に存在するデータ数が２５６以下の場合には、その状態
を１音素組のＨＭＭモデルに置き換える。The articulatory parameter vector obtained in this way
Using the tone x and the voice parameter vector y.
The time series of the parameter vector x with stochastic dynamic transition states
The expressed articulation model and articulation parameter vector x
Sound that determines the occurrence probability of the voice parameter vector y
The model creating unit 23 creates a model λ including a voice generation model.
To do. In this example, HMM is used as the model λ. HM
The structure of M is a 3-state 1-mixture Gaussian distribution of 2 phonemes,
It is a left-to-right model without a ticket. Figure 2
3 states q₁, Q₂, Q₃There are various conditions
Articulatory parameter vector, voice parameter vector
Each Gaussian distribution for each occurrence probability
The process is a transition (loop) from the same state to the same state, and q
₁To q ₂Or q₂To q₃There are only 5 transitions to
It For each phoneme, a model is created for each different phoneme that follows.
Made However, in order to maintain accuracy, the state of each HMM
If there are 256 or less data in the
Is replaced by an HMM model of one phoneme set.

【００１０】このＨＭＭのモデルλの作成は、まず、前
記文章の連続発声により得られた全体の調音パラメータ
ベクトル系列ｘの出現確率Ｐ（ｘ，ｑ｜λ）が最大とな
るようにする。この場合［従来の技術］の項で述べた音
声パラメータのみを用いたＨＭＭのモデルの作成法と同
様な手法で行えばよい。ここでｑは全体の調音パラメー
タベクトル系列ｘに対する状態系列を表す。状態系列ｑ
を構成する各１つの状態をｑ_iとする時、状態ｑ_iの調
音パラメータベクトルｘの出現確率は、その状態ｑ_iへ
の遷移確率Ｐ_t＝Ｐ（ｑ_i｜λ）と、その状態ｑ_iに対
する調音パラメータベクトルｘの出現確率Ｐ_x＝Ｐ（ｘ
｜ｑ_i，λ）との積である。従ってＰ（ｘ，ｑ｜λ）＝
Ｐ（ｘ｜ｑ，λ）Ｐ（ｑ｜λ）が最大となるように各モ
デルを作成すればよい。ここで調音パラメータベクトル
の出現確率Ｐ（ｘ｜ｑ，λ）はガウス分布を仮定する。The HMM model λ is created by first maximizing the appearance probability P (x, q | λ) of the entire articulation parameter vector sequence x obtained by the continuous utterance of the sentence. In this case, a method similar to the method of creating an HMM model using only the voice parameters described in the [Prior Art] section may be used. Here, q represents a state sequence for the entire articulation parameter vector sequence x. State series q
When each one state and q _i which constitutes a probability of occurrence of articulatory parameter vector x of state q _i, the transition to state q _i probability P _t = P | and (q _i lambda), the condition q Probability of appearance of articulatory parameter vector x for _i P _x = P (x
| Q _i , λ). Therefore, P (x, q | λ) =
Each model may be created so that P (x | q, λ) P (q | λ) becomes maximum. Here, the appearance probability P (x | q, λ) of the articulatory parameter vector assumes a Gaussian distribution.

【００１１】このようにしてＨＭＭの各モデルとしてそ
の状態ｑ_iごとに状態遷移確率Ｐ_t、出現確率Ｐ_xが求
められ、モデル記憶部２４に、例えば図３に示すように
各モデルλ₁〜λＪの格納部２５−１〜２５−Ｊごとに
状態遷移確率Ｐ_t（これは前述したように各ループと隣
りへとの計５つの確率よりなる）が格納部２６に格納さ
れる。更にこの発明では各状態ごとに調音パラメータベ
クトルｘを変数として音声パラメータベクトルｙを求め
る変換関数ｙ＝ｆ（ｘ）として、線形関数ｙ＝Ａｘ＋ｂ
を用い上記で求めたＨＭＭのモデルλの各状態毎に、こ
れに属する調音パラメータベクトルｘと音声パラメータ
ベクトルｙの複数個のデータから、最小二乗法を用いて
調音・音響変換関数ｙ＝Ａｘ＋ｂの行列Ａとベクトルｂ
を音声生成用係数生成部２０で決定する。つまり、この
変換関数は図４に示すように、前記例では左辺のベクト
ルｙは要素数が４２であり、中辺中のベクトルｘも要素
数が４２であり、行列Ａは４２×４２の行列となり、定
数ｂも要素数が４２のベクトルとなる。従って１つの状
態ｑ_iについて、これに属する音声パラメータベクトル
ｙと調音パラメータベクトルｘの対を最低でも４２個用
いて、調音パラメータベクトルｘを用いて計算した音声
パラメータベクトルｙ′＝Ａｘ＋ｂと、調音パラメータ
ベクトルｘの対をなす音声パラメータベクトルｙとの二
乗誤差が最小となるＡとｂを求める。In this way, the state transition probability P _t and the appearance probability P _x are obtained for each state q _i as each model of the HMM, and the model storage unit 24 stores each model λ _1- . The state transition probability P _t (which is made up of a total of 5 probabilities of each loop and adjacent to each other as described above) is stored in the storage unit 26 for each of the storage units 25-1 to 25-J of λJ. Further, in the present invention, a linear function y = Ax + b is used as a conversion function y = f (x) for obtaining a voice parameter vector y using the articulation parameter vector x as a variable for each state.
For each state of the HMM model λ obtained above using a plurality of data of the articulatory parameter vector x and the voice parameter vector y belonging thereto, the articulatory-acoustic conversion function y = Ax + b Matrix A and vector b
Is determined by the voice generation coefficient generation unit 20. That is, as shown in FIG. 4, in this conversion function, the vector y on the left side has 42 elements, the vector x on the middle side also has 42 elements, and the matrix A is a 42 × 42 matrix as shown in FIG. Therefore, the constant b is also a vector having 42 elements. Therefore, for one state q _i , at least 42 pairs of the voice parameter vector y and the articulatory parameter vector x belonging thereto are used, and the voice parameter vector y ′ = Ax + b calculated using the articulatory parameter vector x and the articulatory parameter A and b that minimize the squared error between the vector x and the voice parameter vector y forming a pair are obtained.

【００１２】このようにして求めた音声生成モデルの一
部を構成する各Ａ，ｂはモデル記憶部２４の対応モデル
の対応状態に対する係数格納部２８に格納される。また
各モデルの各状態ごとの調音パラメータベクトルｘの出
現確率Ｐ_x、音声パラメータベクトルｙの出現確率Ｐ_yを
求めることができるように、雑音生成部３１（図１）に
おいて、各モデルの各状態ごとにｙ′＝Ａｘ＋ｂとｙと
の誤差ｗ（この例では要素数４２のベクトル）を計算
し、その誤差ｗの各要素の平均ｗ_mを雑音平均計算部３
２で計算し、更に誤差（残音）ｗの共分散σ_xを雑音共
分散計算部３３で計算する。また各モデルの各状態ごと
にこれに属する調音パラメータベクトルｘの平均ｘ _mを
調音平均計算部３４で計算し、更にその調音パラメータ
ベクトルｘの共分散σ_xを共分散計算部３５で計算す
る。これら計算された各状態ごとの平均ｗ_m，ｘ_m及び
共分散σ_w，σ_xを対応するモデルの対応状態の平均・
共分散格納部２９に格納する。出現確率Ｐ_xとＰ_yはそ
れぞれ、ｘ_m，σ_xとＡ，ｂ，ｗ_m，σ_wで求めることが
できる。One of the speech generation models thus obtained
Each of A and b constituting the unit is a corresponding model of the model storage unit 24.
Is stored in the coefficient storage unit 28 corresponding to the corresponding state. Also
Output of the articulatory parameter vector x for each state of each model
Current probability P_x, The appearance probability P of the voice parameter vector y_yTo
As can be obtained, the noise generator 31 (FIG. 1)
Then, y ′ = Ax + b and y for each state of each model
Error w (in this example, a vector with 42 elements) is calculated
And the average w of each element of the error w_mNoise average calculation unit 3
2 and the covariance σ of the error (remaining sound) w_xThe noise
The calculation is performed by the dispersion calculator 33. Also, for each state of each model
Mean x of the articulation parameter vector x belonging to _mTo
The articulation average calculation unit 34 calculates the
Covariance σ of vector x_xIs calculated by the covariance calculation unit 35
It Average w of each of these calculated states_m, X_mas well as
Covariance σ_w, Σ_xThe average of the corresponding states of the corresponding model
It is stored in the covariance storage unit 29. Appearance probability P_xAnd P_yHaso
X each_m, Σ_xAnd A, b, w_m, Σ_wTo ask for
it can.

【００１３】調音運動推定次に入力された音声信号に対する調音運動の推定を上述
したモデルを用いて行うこの発明の実施形態を説明す
る。図５に示すように入力端子４１から入力された音声
信号は通常は信号記憶部４２に一旦に格納され、その
後、スペクトル分析部４３においてフレームごとにスペ
クトル分析され、音声パラメータが検出される。この音
声パラメータとしてはモデルの作成の際に用いた学習音
声パラメータと同一のもの、前記例では０次項を除いた
２１次のメルケプストラム係数を求める。また音声信号
の切り出しも、モデル作成時と同様にすることが好まし
い。モデル作成時の音声パラメータベクトルｙと合せる
ため、求めた音声パラメータの時間的変化としての速度
パラメータを速度検出部４４で求め、これら両パラメー
タを合成部４５で合わせて音声パラメータベクトルｙと
する。 Articulatory Motion Estimation Next, an embodiment of the present invention will be described in which the articulatory motion estimation for an input voice signal is performed using the above model. As shown in FIG. 5, the voice signal input from the input terminal 41 is normally temporarily stored in the signal storage unit 42, and then the spectrum analysis unit 43 performs spectrum analysis for each frame to detect a voice parameter. The speech parameters are the same as the learning speech parameters used when creating the model, and in the above example, the 21st-order mel-cepstral coefficient excluding the 0th-order term is obtained. It is also preferable to cut out the audio signal in the same manner as when creating the model. In order to match with the voice parameter vector y at the time of model creation, the velocity detecting unit 44 obtains a velocity parameter as a temporal change of the obtained voice parameter, and the synthesizing unit 45 combines these parameters into a voice parameter vector y.

【００１４】この音声パラメータベクトルｙと、前述し
て作成した調音・音響対コードブック２２内の音声パラ
メータベクトルｙ_cとを、近似調音パラメータベクトル
選出部４６で比較し、距離の小さい順にｙ_cと対となる
近似調音パラメータベクトルｘ_cを複数個選択する。前
述したモデル記憶部２４のＨＭＭのモデルλについて、
これら近似調音パラメータベクトルｘ_c1〜ｘ_cRのそれぞ
れについて入力音声パラメータベクトルｙおよびモデル
の各状態の出現確率Ｐ_y，Ｐ_x、遷移確率Ｐ_tを用い
て、入力音声パラメータベクトルｙの系列の出現確率Ｐ
（ｙ，ｑ｜λ）＝∫Ｐ（ｙ｜ｘ_c，ｑ，λ）Ｐ（ｘ_c｜
ｑ，λ）Ｐ（ｑ｜λ）ｄｘ_cが最大となるＨＭＭの状態
系列ｑをＶｉｔｅｒｂｉ（ビタビ）アルゴリズムを用い
て状態系列決定部４７により決定する。This voice parameter vector y is compared with the voice parameter vector y _c in the articulatory-acoustic pair codebook 22 created as described above by the approximate articulatory parameter vector selecting unit 46, and y _c is calculated in ascending order of distance. A plurality of approximate articulatory parameter vectors x _c are selected. Regarding the HMM model λ of the model storage unit 24 described above,
For each of these approximate articulation parameter vectors x _{c1 to} x _{cR, using} the input speech parameter vector y, the appearance probabilities P _y and P _x of each state of the model, and the transition probability P _t , the appearance probability of the sequence of the input speech parameter vector y P
(Y, q | λ) = ∫P (y | x _c , q, λ) P (x _c |
The state sequence determination unit 47 determines the HMM state sequence q that maximizes q, λ) P (q | λ) dx _c by using the Viterbi algorithm.

【００１５】例えば図６に示すように、各近似調音パラ
メータベクトルｘ_c1〜ｘ_cRについて、無音モデル、好ま
しくは２音素モデルにおける各無音モデルを初期値と
し、その各状態ｑ_iにおける近似調音パラメータベクト
ルｘ_cr（ｒ＝１，…，Ｒ）に対する入力音声パラメータ
ベクトルｙの出現確率Ｐ（ｙ｜ｘ_cr，ｑ_i，λ）とその
近似調音パラメータベクトルｘ_crの出現確率Ｐ（ｘ_cr｜
ｑ_i，λ）と、遷移確率Ｐ（ｑ_i｜λ）との積をブラン
チメトリックＭ_Bとし、遷移可能な各状態についてブラ
ンチメトリックＭ_Bを求め、このようにして求めたｘ_c1
〜ｘ_cRについての全てのＭ_B中の最大のものを生き残り
パスとし、それまでのパスメトリックＭ_Pにそのブラン
チメトリックＭ_Bを加算することを、フレームごとの入
力音声パラメータベクトルｙについて行う。For example, as shown in FIG. 6, for each approximate articulatory parameter vector x _{c1 to} x _cR , a silent model, preferably each silent model in a two-phoneme model is used as an initial value, and an approximate articulatory parameter vector in each state q _i thereof. _{x cr (r = 1, ...} , R) occurrence probability P of the input speech parameter vector y for the _{_{(y | x cr, q i}} , λ) and its approximate articulate parameter vector x _cr of the occurrence probability P (x _cr |
The product of q _i , λ) and the transition probability P (q _i | λ) is taken as the branch metric M _B, and the branch metric M _B is obtained for each transitionable state, thus obtained x _c1
A surviving path largest ones in all M _B for your ~x _cR, the adding the branch metric M _B to the path metric M _P up to that performed for the input speech parameter vector y for each frame.

【００１６】このようにして入力音声パラメータベクト
ルｙの系列の各ｙごとに１つのモデルの１つの状態を対
応付けた、状態系列ｑを求める。この求めたＨＭＭの状
態系列ｑにしたがって、入力された音声信号の音声パラ
メータベクトル系列の全体としての出現確率Ｐ（ｙ｜ｑ，λ）＝∫Ｐ（ｙ｜ｘ，ｑ，λ）Ｐ（ｘ｜ｑ，λ）ｄｘ（１）が最大となるような調音運動を、モデル記憶部２４の音
声生成モデルを用いて調音パラメータ計算部４８により
生成する。In this way, the state series q in which one state of one model is associated with each y of the series of the input speech parameter vector y is obtained. According to the obtained HMM state series q, the appearance probability P (y | q, λ) = ∫P (y | x, q, λ) P (x The articulatory parameter calculation unit 48 uses the voice generation model of the model storage unit 24 to generate an articulatory motion that maximizes | q, λ) dx (1).

【００１７】前述したように出現確率Ｐ_y，Ｐ_xはそれ
ぞれガウス分布であると仮定してある。この時、各出現
確率は次式で表わせる。Ｐ_y＝Ｐ（ｙ｜ｘ，ｑ，λ）＝［１／（（２π）^N/2 ｜σ_w｜^1/2 ）］× exp［−(1/2)(ｙ−Ａｘ−ｂ−ｗ_m)^Tσ_w ^-1(ｙ−Ａｘ−ｂ−ｗ_m)］（２）Ｐ_x＝Ｐ（ｘ｜ｑ，λ）＝［１／（（２π）^M/2 ｜σ_x｜^1/2 ）］× exp［−(1/2)(ｘ−ｘ_m)^Tσ_x ^-1(ｘ−ｘ_m)］（３）Ｎはベクトルｙの次数、Ｍはベクトルｘの次数であり、
前記例では共に４２であり、（）^Tは行列の転置を表
わす。As described above, it is assumed that the appearance probabilities P _y and P _x are Gaussian distributions. At this time, each appearance probability can be expressed by the following equation. _Py = P (y | x, q, [lambda]) = [1 / (( ² [pi]) ^{N / 2} | [sigma] _w | ^1/2 )] * exp [-(1/2) (y-Ax-b-w) _m ) ^T σ _w ^-1 (y-Ax-b-w _m )] (2) P _x = P (x | q, λ) = [1 / ((2π) ^{M / 2} | σ _x | ^1/2 )] × exp [− (1/2) (x−x _m ) ^T σ _x ⁻¹ (x−x _m )] (3) N is the order of the vector y, M is the order of the vector x,
In the above example, both are 42, and () ^T represents the transpose of the matrix.

【００１８】式（２）及び（３）を式（１）に代入し
て、式（１）を最大化するｘを求めることは、この代入
した式の分母を最小化するｘを求めればよい。つまりＪ＝(ｙ−Ａｘ−ｂ−ｗ_m)^Tσ_w ^-1(ｙ−Ａｘ−ｂ−ｗ_m)
＋(ｘ−ｘ_m)^Tσ_x ^-1(ｘ−ｘ_m) を最小化するｘを求める。この式を微分して０とおくと（σ_x ^-1＋Ａ^Tσ_w ^-1Ａ）ｘ＝σ_x ^-1ｘ_m＋Ａ^Tσ_w ^-1(ｙ−ｂ−ｗ_m) （４）となる。従って、各フレームの音声パラメータベクトル
ｙについてそれが属する、前記決定した状態系列ｑ中の
状態ｑ_iに対するＡ，ｂ，ｗ_m，ｘ_m，σ_w，σ _xをモ
デル記憶部２４から読み出して、これらの値とその音声
パラメータベクトルｙを式（４）に代入してｘの値を計
算すれば、そのフレームのｙに対する調音パラメータベ
クトルｘが得られる。Substituting equations (2) and (3) into equation (1)
And finding x that maximizes equation (1)
It is only necessary to find x that minimizes the denominator of the above equation. That is J = (y-Ax-b-w_m)^Tσ_w ^-1(y-Ax-b-w_m)
+ (X-x_m)^Tσ_x ^-1(xx_m) Find x that minimizes Differentiating this equation and setting it to 0 (Σ_x ^-1+ A^Tσ_w ^-1A) x = σ_x ^-1x_m+ A^Tσ_w ^-1(ybw_m) (4) Becomes Therefore, the speech parameter vector for each frame
in the determined state sequence q to which it belongs for y
State q_iA, b, w_m, X_m, Σ_w, Σ _xThe
These values and their audio are read from the Dell memory unit 24.
Substitute the parameter vector y into equation (4) to calculate the value of x.
When calculated, the articulatory parameter vector for y of that frame is calculated.
You can get the cutle x.

【００１９】上述した調音パラメータベクトル系列、つ
まり調音運動の推定の手順を図７を参照して簡単に述べ
る。調音運動が推定されるべき、音声信号が入力される
と、記憶部４２に記憶し（Ｓ１）、この音声信号の音声
パラメータベクトルｙをフレームごとに検出する（Ｓ
２）。前記例ではフレームごとにスペクトル分析してメ
ルケプストラム係数を音声パラメータとして求め（Ｓ２
−１）、更にその音声パラメータの時間的変化、つまり
速度パラメータを検出し（Ｓ２−２）、これら両パラメ
ータにより音声パラメータベクトルとする。調音・音響
対コードブック２２を参照して各音声パラメータベクト
ルに近い複数の音声パラメータベクトルｙ _cに対する各
近似調音パラメータベクトルｘ_cを選択し（Ｓ３）、こ
れら音声パラメータベクトルｙと複数の近似調音パラメ
ータベクトルｘ_cとを組にして信号記憶部４２に格納す
る（Ｓ４）。The above articulation parameter vector sequence
The procedure for estimating the articulatory motion is briefly described with reference to FIG.
It An audio signal is input, where articulatory movement should be estimated
And stored in the storage unit 42 (S1)
The parameter vector y is detected for each frame (S
2). In the above example, the spectrum is analyzed for each frame and
The luke pstral coefficient is obtained as a voice parameter (S2
-1), and the temporal change of the voice parameter, that is,
The speed parameter is detected (S2-2), and both parameters are detected.
Data into a voice parameter vector. Articulation / acoustic
Refer to the codebook 22 for each voice parameter vector
Voice parameter vectors y close to _cAgainst each
Approximate articulation parameter vector x_cSelect (S3),
These voice parameter vector y and multiple approximate articulation parameters
Data vector x_cAnd is paired and stored in the signal storage unit 42.
(S4).

【００２０】入力音声信号の始めから終りまでの各音声
パラメータベクトルｙの検出と、対応、近似調音パラメ
ータベクトルｘ_cの選択とを終了すると、モデル記憶部
２４内のモデルを参照して全体の音声パラメータベクト
ルの出現確率が最大となる状態系列ｑをビタビアルゴリ
ズムにより決定する（Ｓ５）。この決定された状態系列
ｑを参照して、入力音声パラメータベクトルｙの系列の
各ベクトルについて、これが属する状態ｑ_iの音声生成
係数Ａ，ｂ、平均ｗ_m，ｘ_m、共分散σ_w，σ_xを、モデ
ル記憶部２４から取得して式（４）を計算して調音パラ
メータベクトルｘを求める（Ｓ６）。更に必要に応じ
て、この調音パラメータベクトルの系列を、そのベクト
ルの各要素ごとの変化状態として表示したり記録しても
よい。When the detection of each voice parameter vector y from the beginning to the end of the input voice signal and the selection of the corresponding and approximate articulation parameter vector x _c are completed, the model in the model storage unit 24 is referred to for the entire voice. The state sequence q that maximizes the appearance probability of the parameter vector is determined by the Viterbi algorithm (S5). With reference to the determined state sequence q, for each vector of the sequence of the input voice parameter vector y, the voice generation coefficient A, b of the state q _i to which it belongs, the average w _m , x _m , the covariance σ _w , σ _x is acquired from the model storage unit 24 and equation (4) is calculated to obtain the articulatory parameter vector x (S6). Further, if necessary, this series of articulatory parameter vectors may be displayed or recorded as a change state for each element of the vector.

【００２１】上述において、音声パラメータベクトルと
しては音声スペクトルのみ、つまり速度パラメータを用
いなくてもよい、また加速度パラメータを加えてもよ
い。同様に調音パラメータベクトルとして、その位置パ
ラメータのみでもよく、位置パラメータとその速度パラ
メータとを用い、加速度パラメータを省略してもよい。
ただ位置パラメータのみの場合により、速度パラメータ
を加えた方がよく、加速度パラメータも加えると更に精
度が高くなる。音声パラメータベクトルと調音パラメー
タベクトルとの次数は同一にしなくてもよい。近似調音
パラメータベクトルｘ_cの選択は１個だけでもよい。図
５に示した音声分析装置を、コンピュータにより音声分
析プログラムを実行させて機能させてもよい。この場合
は、図７に示した手順をコンピュータに実行させるため
の音声分析プログラムを、コンピュータ内にＣＤ−ＲＯ
Ｍや可撓性磁気ディスクなどからインストールしたり、
通信回線を介してダウンロードしてコンピュータに実行
させればよい。In the above description, only the voice spectrum, that is, the velocity parameter may not be used as the voice parameter vector, and the acceleration parameter may be added. Similarly, as the articulation parameter vector, only the position parameter may be used, or the position parameter and its velocity parameter may be used and the acceleration parameter may be omitted.
However, it is better to add the velocity parameter depending on the case of only the position parameter, and the accuracy is further improved by adding the acceleration parameter. The order of the voice parameter vector and the articulatory parameter vector need not be the same. The approximate articulation parameter vector x _c may be selected only once. The voice analysis device shown in FIG. 5 may be caused to function by causing a computer to execute a voice analysis program. In this case, a voice analysis program for causing the computer to execute the procedure shown in FIG.
Install from M or flexible magnetic disk,
It may be downloaded via a communication line and executed by a computer.

【００２２】[0022]

【発明の効果】以上述べたようにこの発明によれば調音
モデル及び音声生成モデルを含むモデルを用い、入力音
声パラメータベクトルに近い近似調音パラメータベクト
ルにより音声生成モデルを用いて、その入力音声パラメ
ータベクトルの調音パラメータベクトルを生成し、入力
音声パラメータ系列の出現確率が最大になるような調音
パラメータ系列を求めているから、入力音声パラメータ
系列の音声発声の際の調音運動をより精度よく求めるこ
とができる。As described above, according to the present invention, a model including an articulation model and a voice generation model is used, and a voice generation model is used with an approximate articulation parameter vector close to the input voice parameter vector. Since the articulatory parameter vector is generated and the articulatory parameter sequence that maximizes the appearance probability of the input voice parameter sequence is obtained, the articulatory motion during voice utterance of the input voice parameter sequence can be obtained more accurately. .

【００２３】この発明の方法を評価するための実験を示
す。実験では、調音運動と音声は連続発声から磁気セン
サシステム（ＥＭＡ）を用いて同時観測されたものを用
いる。調音データは、サンプリングレート２５０Ｈｚで
ＥＭＡを用いて観測した。調音パラメータは、下顎、上
下唇、舌上の４点の計７点の位置の水平および垂直位置
とした。音声は８ｋＨｚサンプリングで収録されてお
り、分析周期４ｍｓ、窓長３２ｍｓのブラックマン窓で
切り出し、２１次のメルケプストラム分析を行い０次項
を除いたものを音声パラメータとした。また、調音パラ
メータベクトルｘは、（位置・速度・加速度）のベクト
ルとし、音声パラメータベクトルｙは、（位置・速度）
のベクトルとした。Experiments are shown to evaluate the method of the present invention. In the experiment, the articulatory motion and the voice are those that are simultaneously observed from continuous vocalization using a magnetic sensor system (EMA). The articulatory data was observed using EMA at a sampling rate of 250 Hz. The articulatory parameters were the horizontal and vertical positions of a total of 7 positions of 4 points on the lower jaw, upper and lower lips, and tongue. The voice was recorded at 8 kHz sampling, cut out with a Blackman window having an analysis cycle of 4 ms and a window length of 32 ms, and 21st-order mel-cepstral analysis was performed. The articulatory parameter vector x is a (position / velocity / acceleration) vector, and the voice parameter vector y is (position / velocity).
And the vector.

【００２４】同時観測した調音・音響データは、日本人
男性によって発声された３５８文章（データ数２３１，
９４９）である。ＨＭＭの学習には、３５８文章のうち
ランダムに選ばれた３４２文章から作成した。残りの１
６文章は、この発明方法の評価のためのテストデータと
した。ＨＭＭの構造は、３状態１混合ガウス分布でスキ
ップなしleft-to-rightの２音素類モデルとした。ただ
し、データ数の少ないステートは、１音素類モデルで置
き換えた。変換関数ｆ（ｘ）は線形変換ｙ＝Ａｘ＋ｂで
近似し、発声時の音素情報は既知とした。この音素情報
はなくても良い。The simultaneously observed articulatory and acoustic data are 358 sentences (data number 231,
949). For learning the HMM, it was created from 342 sentences randomly selected from 358 sentences. The remaining one
Six sentences are test data for evaluation of the method of the present invention. The structure of the HMM is a left-to-right two-phoneme model with three-state one-mixed Gaussian distribution and no skip. However, the states with a small amount of data were replaced with the one-phoneme model. The conversion function f (x) is approximated by a linear conversion y = Ax + b, and phoneme information at the time of utterance is assumed to be known. This phoneme information may not be necessary.

【００２５】図８に男性話者が「午後はたまった書類に
目を通します」という文章を発声した音声信号から、こ
の発明方法により推定された調音運動軌道（太線）と観
測された調音運動軌道（細線）の垂直信号の例を示す。
この図からこの発明方法により推定された調音運動起動
（太線）が観測された調音運動軌道（細線）をよく再現
していることがわかる。図９に、ＨＭＭの状態系列ｑの
決定における近似調音コードベクトルｘ_cの選択数に関
する調音運動の推定軌道と観測軌道の平均２乗誤差の結
果を示す。この図より選択数が１０以上で誤差の減少が
飽和し、最小平均２乗誤差は、１．７１ｍｍとなった。
またこの図より近似調音パラメータベクトルの選択数は
５以上が好ましいことがわかる。更に従来の方法では最
もよくても２．０９ｍｍであったが、この発明では近似
調音パラメータベクトルを１個のみ選択した場合も従来
方法より高い精度が得られることがわかる。In FIG. 8, the articulatory motion trajectory (thick line) estimated by the method of the present invention and the articulatory motion observed by the male speaker from the speech signal uttering the sentence "Look through the accumulated documents in the afternoon". An example of a vertical signal of an orbit (thin line) is shown.
It can be seen from this figure that the articulatory motion trajectory (thin line) in which the articulatory motion activation (thick line) estimated by the method of the present invention is observed is well reproduced. FIG. 9 shows the results of the mean square error of the estimated trajectory of the articulatory motion and the observed trajectory with respect to the number of selections of the approximate articulatory code vector x _c in the determination of the state series q of the HMM. From this figure, the reduction in error saturates when the number of selections was 10 or more, and the minimum mean square error was 1.71 mm.
Also, from this figure, it is understood that the number of selected approximate articulation parameter vectors is preferably 5 or more. Further, the best value is 2.09 mm in the conventional method, but it is understood that the present invention can obtain higher accuracy than the conventional method even when only one approximate articulation parameter vector is selected.

[Brief description of drawings]

【図１】この発明に用いられるモデルの生成装置の実施
例の機能構成を示すブロック図。FIG. 1 is a block diagram showing a functional configuration of an embodiment of a model generation device used in the present invention.

【図２】ＨＭＭの状態遷移図の例を示す図。FIG. 2 is a diagram showing an example of a state transition diagram of an HMM.

【図３】モデル記憶部２４の記憶例を示す図。FIG. 3 is a diagram showing a storage example of a model storage unit 24.

【図４】変換式ｙ＝Ａｘ＋ｂの行列表現を示す図。FIG. 4 is a diagram showing a matrix expression of a conversion formula y = Ax + b.

【図５】この発明による音声分析装置の実施例の機能構
成を示すブロック図。FIG. 5 is a block diagram showing a functional configuration of an embodiment of a voice analysis device according to the present invention.

【図６】ビタビアルゴリズムにより状態系列ｑを求める
ための説明図。FIG. 6 is an explanatory diagram for obtaining a state series q by a Viterbi algorithm.

【図７】この発明による音声分析方法の実施例を示す流
れ図。FIG. 7 is a flowchart showing an embodiment of a voice analysis method according to the present invention.

【図８】この発明方法の実験結果による調音運動軌道と
実際に観測された調音運動軌道を示す図。FIG. 8 is a diagram showing an articulatory trajectory according to an experimental result of the method of the present invention and an actually observed articulatory trajectory.

【図９】この発明方法の実験結果における近似調音パラ
メータベクトルの選択数と、推定軌道と観測軌道の平均
２乗誤差との関係を示す図。FIG. 9 is a diagram showing the relationship between the number of selected approximate articulatory parameter vectors in the experimental results of the method of the present invention and the mean square error between the estimated orbit and the observed orbit.

Claims

[Claims]

1. A process of storing an input voice signal in a storage unit, a process of extracting the input voice signal from the storage unit and performing spectrum analysis for each frame to generate an input voice parameter vector, and a voice parameter vector and its articulation. With reference to the acoustic / articulatory pair codebook that stores pairs with parameter vectors, at least one articulatory parameter vector containing at least position information of a plurality of articulatory organs, which is a voice parameter vector close to each input voice parameter vector, is stored. The process of selecting as an approximate articulatory parameter vector, the process of storing the input voice parameter vector in the storage unit, the articulatory model expressing the time series of the articulatory parameter vector by probabilistic dynamic state transitions, and the voice parameter vector for the articulatory parameter vector. Speech Generation Model for Determining Occurrence Probability With reference to the stored model storage unit, using the input voice parameter vector and the approximate articulation parameter vector, and the voice generation model and articulation model corresponding to them, the articulation that maximizes the appearance probability of the input voice parameter vector sequence. And a process of calculating articulatory parameters for obtaining a parameter vector sequence.

2. A time-series state transition probability of an articulation parameter vector forming the articulation model, an appearance probability in that state, and a function for determining a voice parameter vector from the articulation parameter vector forming the voice generation model. The model of the model storage unit is configured as a hidden Markov model including a coefficient for each state and an appearance probability for each state of a voice parameter vector with respect to the articulatory parameter vector. The speech analysis method according to Item 1.

3. The state of the hidden Markov model in which the appearance probability of the input speech parameter vector sequence is maximized in the articulatory parameter calculation process using each of the approximated articulation parameter vectors, the input speech parameter vector, and the hidden Markov model. For each state of the determined state sequence, the articulation parameter vector that maximizes the appearance probability of the input speech parameter vector sequence is calculated for each state of the determined state sequence using the input speech parameter vector and the coefficient. The method of claim 2, further comprising:

4. The function is a linear function, and the coefficient is a first-order coefficient A and a constant b, and the average and covariance of the articulatory parameter vector belonging to each state and the coefficient A, The mean and covariance of the error when using b are stored in the model storage unit, and the process of calculating the articulatory parameter vector is performed in the above-mentioned state A to which each input voice parameter vector belongs,
4. The speech analysis method according to claim 3, wherein the calculation is performed using b, the mean and covariance of the articulation parameter vector, and the mean and covariance of the error.

5. The voice analysis method according to claim 1, wherein at least one of the articulatory parameter vector and the voice parameter vector includes an element representing a temporal change of the parameter.

6. A signal storage unit for storing an input voice signal, an input voice parameter vector, etc., an acoustic / articulatory pair codebook storing a pair of a voice parameter vector and its articulatory parameter vector, and an articulatory parameter vector A model storage unit that stores a voice generation model that determines the occurrence probability of the voice parameter vector for the articulatory model that represents the sequence by probabilistic dynamic state transitions, and spectrum analysis of the input voice signal for each frame. , A spectrum analysis means for generating an input speech parameter vector, and an articulation parameter vector including at least position information of a plurality of articulatory organs, which is a speech parameter vector close to the input speech parameter vector with reference to the acoustic / articulatory pair codebook. At least one approximate articulatory parameter The input voice parameter vector and the approximate articulatory parameter vector with reference to the model storage unit, and the voice generation model and articulatory model corresponding to these, the appearance probability of the input voice parameter vector sequence. And an articulatory parameter calculation means for obtaining an articulatory parameter vector sequence that maximizes

7. A time-series state transition probability of an articulatory parameter vector that constitutes the articulatory model, an appearance probability in that state, and a function for determining a voice parameter vector from the articulatory parameter vector that constitutes the voice generation model. The model of the model storage unit is configured as a hidden Markov model including a coefficient for each state and an appearance probability for each state of a voice parameter vector with respect to the articulatory parameter vector. Item 6. The voice analysis device according to item 6.

8. The state of a hidden Markov model in which the appearance probability of an input speech parameter vector sequence is maximized, wherein the articulation parameter calculation means uses each of the approximate articulation parameter vectors, the input speech parameter vector, and the hidden Markov model. For each state of the determined state series, the articulation parameter vector that maximizes the appearance probability of the input speech parameter vector sequence is calculated using the input speech parameter vector and the coefficient. The voice analysis device according to claim 7, further comprising:

9. The function is a linear function, and the coefficients are a first-order coefficient A and a constant b, and the average and covariance of the articulatory parameter vector belonging to each state and the coefficient A, The mean and covariance of the error when using b are stored in the model storage unit, and the means for calculating the articulatory parameter vector is A for the above-mentioned state to which each of the input voice parameter vectors belongs,
9. The speech analysis apparatus according to claim 8, further comprising: b, means for calculating using the mean and covariance of the articulation parameter vector and the mean and covariance of the error.

10. The voice analysis device according to claim 6, wherein at least one of the articulatory parameter vector and the voice parameter vector includes an element representing a temporal change of the parameter.

11. A signal storage unit for storing an input voice signal, an input voice parameter vector, etc., an acoustic / articulatory pair codebook storing a pair of a voice parameter vector and its articulatory parameter vector, and an articulatory parameter vector A model storage unit that stores a voice generation model that determines the occurrence probability of the voice parameter vector for the articulatory model that represents the sequence by probabilistic dynamic state transitions, and spectrum analysis of the input voice signal for each frame. , A spectrum analysis means for generating an input speech parameter vector, and an articulation parameter vector including at least position information of a plurality of articulatory organs, which is a speech parameter vector close to the input speech parameter vector with reference to the acoustic / articulatory pair codebook. At least one approximate articulatory parameter Means for selecting the input voice parameter vector sequence using the input voice parameter vector and the approximate articulatory parameter vector, and the corresponding voice generation model and articulatory model with reference to the model storage unit. A voice analysis program for causing a computer to function as a voice analysis device, comprising: an articulation parameter calculation unit that obtains a maximum articulation parameter vector sequence.

12. A time-series state transition probability of an articulatory parameter vector that constitutes the articulatory model, an appearance probability in that state, and a function that determines a voice parameter vector from the articulatory parameter vector that constitutes the voice generation model. The coefficient for each state, including the appearance probability for each state of the voice parameter vector for the articulatory parameter vector, the model of the model storage unit is configured as a hidden Markov model by these, the articulatory parameter calculation means, Using each of the approximate articulation parameter vectors, the input speech parameter vector, and the hidden Markov model, means for determining the sequence of states of the hidden Markov model that maximizes the appearance probability of the input speech parameter vector sequence, and the above determined For each state in the state series, enter the above Using voice parameter vector and the coefficient, the input speech parameter vector sequence according to claim 11, wherein the speech analysis program that probability of occurrence, characterized in that it comprises means for calculating each articulation parameter vector becomes maximum.

13. The function is a linear function, and the coefficient is a first-order coefficient A and a constant b, and the average and covariance of the articulatory parameter vector belonging to each state and the coefficient A, The mean and covariance of the error when using b are stored in the model storage unit, and the means for calculating the articulatory parameter vector is A for the above-mentioned state to which each of the input voice parameter vectors belongs,
13. The speech analysis program according to claim 12, further comprising: b, means for calculating using the mean and covariance of the articulation parameter vector and the mean and covariance of the error.

14. The voice analysis program according to claim 11, wherein at least one of the articulatory parameter vector and the voice parameter vector includes an element representing a temporal change of the parameter.

15. A computer-readable recording medium in which the voice analysis program according to claim 11 is recorded.