JP2951661B2

JP2951661B2 - Pattern likelihood calculation method

Info

Publication number: JP2951661B2
Application number: JP62254116A
Authority: JP
Inventors: 芳春阿部
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1987-10-08
Filing date: 1987-10-08
Publication date: 1999-09-20
Anticipated expiration: 2014-09-20
Also published as: JPH0195373A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、音声パタンで代表される時系列パタンと、
この時系列パタンの部分時間区間のベクトルの標準的な
分布を規定するパラメータの系列からなる型版とを比較
して、この型版に対する前記時系列パタンの尤度（尤も
らしさの度合い）を計算する装置に関する。〔従来の技術〕入力パタンと、基準となるパタンを規定する型版との
比較に基づく音声認識システムでは、与えられた型版に
対する入力パタンの尤度計算が重要である。音声パタン
のように時間軸の非線型な伸縮を含む時系列パタンの、
基準の型版に対する尤度は、時間軸の正規化能力を有す
る動的計画法に基づく尤度計算装置によって求められ
る。動的計画法の時間正規化能力は動的計画法の状態経
路に対する傾斜制限によって規定される。引用文献（電子通信学会論文誌（Ａ），第69−Ａ巻，
第２号,pp261−270）は、従来からよく用いられている1
/2〜２の時間軸に対する一様な傾斜制限における問題点
として、係る方法では音声パタンの定常部と過渡部にお
ける時間軸の伸縮の度合いの違いを扱うには不十分であ
る事を挙げ、きめ細かい傾斜制限を行う方法について述
べている。その方法は、入力パタンをＪ個の区間に分割する時の
各分割区間に対する自己相関係数ベクトルの平均値の系
列と、各分割区間の区間長の許容最小値と許容最大値の
系列を含む型版を与え、この型版に対する入力パタンの
尤度は、各分割区間の区間長の許容範囲の中で入力パタ
ンをＪ個の区間に分割するときの各分割区間での型版に
含まれる自己相関係数の平均ベクトルとその区間内のベ
クトルとから計算される尤度をＪ個の分割区間について
加算して得られるパタンの尤度の中で各分割区間の区間
長の許容範囲の中で分割方法を変化させる時の最大のパ
タンの尤度で与えている。（この計算は動的計画法を用
いれば効率的に行える。）係る方法で、入力パタンが分割区間の長さの違いが主
たる弁別要因である単語，例えば、/ookii/（大きい）
と/oki/（沖）のように母音部に対応する分割区間の区
間長の違いで対立する単語を上記のパタン尤度の比較に
よって識別できるためには、例えば、/o/に対する分割
区間長と/oo/に対する分割区間長の違いを型版に含まれ
る分割区間の区間長の許容値の違いとして表す必要があ
る。しかし、実際には音声速度によって両者の許容範囲
が重なる事がある。第５図は両者の区間長の頻度分布を模型的に示してい
るが、もし、区間長の許容範囲を分布の上限と下限に選
ぶ（即ち/o/に対してはt1とt2,/oo/に対してはT1とT2と
する）と、区間長がT1とt2の範囲にある時、この分割区
間が/o/であるか/oo/であるかパタン尤度から識別する
事は困難である。また、もし、/o/の区間長の上限及び/
oo/の区間長の下限を分布の交点t0に選べば、t0を超え
る/o/の区間長，または、t0を下回る/oo/の区間長に対
しては正しく整合が行われずパタン尤度が誤って減少す
る。〔発明が解決しようとする問題点〕従来の入力パタンの尤度の計算は、以上説明したよう
に区間長の分布範囲を制限して、行っていたので、区間
長の大きな変動を許容することが難しいという問題点が
あった。この発明は上記問題点を解決するためになされたもの
で、分割区間の区間長の違いで対立する時系列パタン間
の識別を容易にし、また、区間長の大きな変動による誤
整合を改善することを目的としている。〔問題点を解決するための手段〕この発明に係るパタン尤度計算方式は、型版に各時間
分割区間のベクトルの分布パラメータの他に各時間分割
区間の区間長の分布パラメータを含ませ、パタンの尤度
の計算は型版に含まれるベクトルの分布パラメータの系
列に基づき計算される入力パタンの時間分割区間のベク
トルの尤度と、型版に含まれる時間分割区間の区間長の
分布パラメータの系列に基づき計算される入力パタンの
時間分割区間の区間長の尤度との和であって、音声区間
全体にわたって最大となる分割の方法でもって求められ
る当該和を用いて計算することを特徴としている。〔作用〕入力パタンに対して尤度を比較する型版は各時間分割
区間におけるベクトルの標準的な分布を規定するパラメ
ータの系列の他に各時間分割区間の区間長の分布パラメ
ータの系列を含んでいる。ここで尤度を計算する場合には、各時間分割区間のベ
クトルの分布を規定するパラメータに基づき計算される
尤度と各時間分割区間の区間長の分布を規定するパラメ
ータに基づき計算される尤度との和であって、音声区間
全体にわたって最大となる分割の方法でもって求められ
る当該和を用いて行う。つまり、分割された区間のスベ
クトルの距離と，分割された区間の区間長の距離との和
が、音声区間全体にわたって最大となる分割の方法をも
って最大のパタン尤度を求めるようにしている。〔実施例〕以下この発明の一実施例を図について説明する。第１
図はこの発明の一実施例を示す機能ブロック図である。図において、１は音声等の時系列パタンからなる入力
パタンを記憶する入力パタン記憶部、２は入力パタンに
対する尤度を計算するために標準的な分布を規定したパ
ラメータからなる型版を記憶する型版記憶部、３は累積
尤度を記憶する累積尤度記憶部、４は尤度を計算する数
値演算部である。ここにおいて、型版記憶部２は入力パタンの各時間分
割区間における区間長の分布パラメータを記憶する区間
長系列記憶部21と、各時間分割区間におけるベクトルの
分布パラメータを記憶する分布パラメータ系列記憶部22
とを含んでいる。第２図は入力パタン記憶部１に格納されている入力パ
タンの記憶構成を示す図で、図のように、入力パタンは
Ｉ個のフレームから構成され、さらに、各フレームは15
次元のケプストラム係数ｃ（i,m）（ｉはフレーム番号,
mは次数）からなる。また、第３図は型版記憶部２に格
納されている型版の記憶構成を示す図で、図のように、
型版はＪ個の分割区間に対応するＪ個の記憶領域から構
成され、各記憶領域には、入力パタンの各分割区間にお
けるベクトル（この場合各フレームにおける15次元のケ
プストラム係数を一つのベクトルと考える）の分布パラ
メータとして15次元の平均ケプストラム係数ａ（j,m）
（ｊは分割区間の番号,mは次数）が、また、分割区間の
区間長の分布パラメータとして区間長の平均μ（ｊ）及
び分散σ^２（ｊ）が格納されている。更に、第４図は累
積尤度記憶部３に格納される累積尤度の記憶構成を示す
図で、図のように、累積尤度は全体で（Ｉ＋１）×（Ｊ
＋１）の大きさを持つ配列で、この配列の要素は、ベク
トル尤度の累積和Ls（i,j），及び、区間長尤度の累積
和Lt（i,j）からなる。（但し、１≦ｉ≦Ｉ＋１及び１
≦ｊ≦Ｊ＋１）型版記憶部２に格納されている型版に対する入力パタ
ン記憶部１に格納されている入力パタンの尤度の計算は
数値演算部４によって次のように行われる。即ち、数値
演算部４は次の第（１）式で定義されるパタン尤度Ｓを
動的計画法に基づき計算する。ここで、ν（ｊ）は第ｊ分割区間の開始端フレーム番
号（従って、第ｊ分割区間の終了端はν（ｊ＋ｉ）−１
で与えられる）で、１＝ν（１）＜ν（２）＜ν（３）＜…＜ν（Ｊ）＜ν（Ｊ＋１）＝Ｉ＋１
（２）を満たすものとする。又、第（１）式の第１項の中のls
（ν（ｊ），ν（ｊ＋１）,j）は第ｊ分割区間の中のベ
クトルｃ（i,m）（ν（ｊ）≦ｉ＜ν（ｊ＋１）とこの
区間のベクトルの分布パラメータである平均ケプストラ
ム係数ａ（j,m）に基づき第（３）式のように計算され
る第ｊ区間のベクトル尤度である。更に、第（１）式の第２項の中のlt（ν（ｊ），ν
（ｊ＋１）,j）は第ｊ分割区間の中の区間長ν（ｊ＋
１）−ν（ｊ）とこの区間の区間長の分布パラメータで
ある区間長の平均μ（ｊ）及び分散σ^２（ｊ）とに基づ
いて第（４）式のように計算される第ｊ分割区間の中の
区間長尤度である。上記第（１）式は区間分割の方法（即ち、｛ν
（ｊ）｝）に関する最大化を含む。そこで、数値演算部
４は、第（１）式を動的計画法に基づく次の漸化式漸化式 τ_０＝argmax［Ls（ｉ−τ,i−１）＋ls（ｉ−τ,i,j）１≦τ≦ｉ−１＋Lt（ｉ−τ,j−１）＋lt（ｉ−τ,i,j）］（５） Ls（i,j）＝Ls（ｉ−τ₀,j−１）＋ls（ｉ−τ₀,i,j）（６） Lt（i,j）＝Lt（ｉ−τ₀,j−１）＋lt（ｉ−τ₀,i,j）（７）（２≦ｉ≦Ｉ＋1,2≦ｊ≦Ｊ＋１）を次の初期条件のもとで初期条件 Ls（1,j）＝Lt（1,j）＝０（１≦ｊ≦Ｊ＋１）（８） Ls（i,1）＝Lt（i,1）＝∞（２≦ｉ≦Ｉ＋１）（９）解き、パタン尤度Ｓをと置くことによって効率的に求める。以上のように、この装置は入力パタン記憶部１に格納
された入力パタンの、型版記憶部２に格納された型版に
対する尤度を、数値演算部４が累積尤度記憶部３に漸化
式の計算の途中の結果を残しながら最終的に第（10）式
のようにパタン尤度を計算するものである。ところで、第（10）式の第２項は各分割区間の区間長
の尤度を平均したものであるため、本装置を/oki（沖）
と/ookii（大きい）のようなセグメント長の違いが主た
む弁別要因である単語の識別のため利用することによっ
て両者の分割区間の区間長の分布に重なりがあっても判
別が可能となるし、区間長の大きな変動があっても誤整
合を起こりにくくすることができる。以上の説明では、本発明に係るパタン尤度計算装置を
音声認識システムのパタン尤度計算部に用いる場合につ
いて説明したが、本装置は他の音声パタンと型版の整合
を利用する装置，例えば、音声データの自動ラベル付け
装置等にも応用することができる。〔発明の効果〕以上のように、本願発明によれば、各時間分割区間毎
の尤度を、第１のパラメータに基づき計算される時間分
割区間中の特徴ベクトルの尤度と第２の分布パラメータ
に基づき計算される時間分割区間の区間長の尤度との和
であって、音声区間全体にわたって最大となる分割の方
法でもって求められる当該和に基づいて計算するように
している。つまり、分割された区間のスペクトルの距離
と，分割された区間の区間長の距離との和が、音声区間
音声区間全体にわたって最大となる分割の方法をもって
最大のパタン尤度を求めるため、分割区間の区間長の違
いで対立する時系列パタン間の識別を容易にし、また、
区間長の大きな変動による誤整合を改善できるという利
点がある。DETAILED DESCRIPTION OF THE INVENTION [Industrial Application Field] The present invention relates to a time-series pattern represented by a voice pattern,
The likelihood (degree of likelihood) of the time-series pattern with respect to this template is calculated by comparing the template with a template consisting of a series of parameters defining the standard distribution of the vector of the partial time section of this time-series pattern. To a device that 2. Description of the Related Art In a speech recognition system based on a comparison between an input pattern and a template defining a reference pattern, it is important to calculate the likelihood of the input pattern for a given template. For time-series patterns that include non-linear expansion and contraction of the time axis like voice patterns,
The likelihood for the reference template is obtained by a likelihood calculation device based on a dynamic programming method having a time axis normalization capability. The time normalization capability of dynamic programming is defined by the slope constraint on the state path of dynamic programming. Citations (IEICE Transactions (A), Vol. 69-A,
No. 2, pp. 261-270) is a commonly used one
As a problem in the uniform inclination limitation with respect to the time axis of / 2 to 2, it is pointed out that such a method is insufficient to deal with the difference in the degree of expansion and contraction of the time axis in the stationary part and the transient part of the voice pattern, This document describes a method for performing fine tilt limitation. The method includes a series of average values of the autocorrelation coefficient vector for each divided section when the input pattern is divided into J sections, and a series of allowable minimum values and allowable maximum values of the section length of each divided section. Given a template, the likelihood of the input pattern for this template is included in the template in each divided section when the input pattern is divided into J sections within the permissible range of the section length of each divided section. In the likelihood of the pattern obtained by adding the likelihood calculated from the average vector of the auto-correlation coefficient and the vector in the section for the J divided sections, within the allowable range of the section length of each divided section Is given by the maximum pattern likelihood when the division method is changed. (This calculation can be performed efficiently using dynamic programming.) In such a method, the input pattern is a word whose difference is mainly due to the difference in the length of the divided section, for example, / ookii / (large)
In order to be able to identify words that conflict with each other due to the difference in the section length of the divided section corresponding to the vowel part, such as / oki / (Oki), by comparing the above pattern likelihoods, for example, the divided section length for / o / It is necessary to express the difference in the section length between / and / oo / as the difference in the allowable value of the section length of the section included in the template. However, in practice, the allowable ranges may overlap depending on the voice speed. Fig. 5 shows a model of the frequency distribution of the section lengths of both cases. If the allowable range of the section lengths is selected as the upper and lower limits of the distribution (that is, for / o /, t1 and t2, / oo T1 and T2 for /), and when the section length is in the range of T1 and t2, it is difficult to distinguish whether this divided section is / o / or / oo / from the pattern likelihood It is. Also, if the / o / section length upper limit and /
If the lower limit of the section length of oo / is selected as the intersection point t0 of the distribution, the matching is not performed correctly for the section length of / o / that exceeds t0 or the section length of / oo / that is less than t0, and the pattern likelihood is Accidentally decrease. [Problems to be Solved by the Invention] Conventional calculation of the likelihood of an input pattern is performed by limiting the distribution range of the section length as described above. There was a problem that was difficult. SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and it is an object of the present invention to make it easy to discriminate between time-series patterns that conflict with each other due to a difference in section length of a divided section, and to improve misalignment due to a large change in section length. It is an object. [Means for Solving the Problem] The pattern likelihood calculation method according to the present invention includes, in the template, a distribution parameter of a section length of each time division section in addition to a distribution parameter of a vector of each time division section, The likelihood of the pattern is calculated based on the sequence of the distribution parameters of the vector included in the pattern, and the likelihood of the vector in the time division section of the input pattern and the distribution parameter of the section length of the time division section included in the pattern And the likelihood of the section length of the time division section of the input pattern calculated based on the sequence of the input pattern, wherein the sum is calculated using the maximum division method over the entire speech section. And [Operation] The pattern version for comparing the likelihood with the input pattern includes a sequence of distribution parameters of the section length of each time division section in addition to a series of parameters defining a standard distribution of vectors in each time division section. In. Here, when calculating the likelihood, the likelihood calculated based on the parameter that defines the distribution of the vector of each time division section and the likelihood calculated based on the parameter that defines the distribution of the section length of each time division section This is performed using the sum of the degrees and the sum obtained by the method of division that is the maximum over the entire voice section. In other words, the maximum pattern likelihood is obtained by a division method in which the sum of the distance between the svectors of the divided section and the distance of the section length of the divided section is maximized over the entire voice section. An embodiment of the present invention will be described below with reference to the drawings. First
FIG. 1 is a functional block diagram showing an embodiment of the present invention. In the figure, reference numeral 1 denotes an input pattern storage unit for storing an input pattern composed of a time-series pattern such as a voice, and 2 stores a pattern composed of parameters defining a standard distribution for calculating a likelihood for the input pattern. A model storage unit 3 is a cumulative likelihood storage unit that stores the cumulative likelihood, and 4 is a numerical operation unit that calculates the likelihood. Here, the template storage unit 2 includes a section length sequence storage unit 21 that stores a distribution parameter of a section length in each time division section of the input pattern, and a distribution parameter series storage unit that stores a vector distribution parameter in each time division section. twenty two
And FIG. 2 is a diagram showing a storage configuration of the input pattern stored in the input pattern storage unit 1. As shown in the figure, the input pattern is composed of I frames, and each frame has 15 frames.
Dimensional cepstrum coefficient c (i, m) (i is the frame number,
m is the order). FIG. 3 is a diagram showing the storage configuration of the template stored in the template storage unit 2. As shown in FIG.
The template is composed of J storage areas corresponding to J divided sections, and each storage area stores a vector in each divided section of the input pattern (in this case, a 15-dimensional cepstrum coefficient in each frame is defined as one vector. 15) average cepstrum coefficient a (j, m)
(J is the number of the divided section, m is the order), and the average μ (j) and the variance σ ² (j) of the section length are stored as distribution parameters of the section length of the divided section. FIG. 4 is a diagram showing a storage configuration of the cumulative likelihood stored in the cumulative likelihood storage unit 3. As shown in the figure, the cumulative likelihood is (I + 1) × (J
In an array having a size of +1), the elements of this array are made up of a cumulative sum Ls (i, j) of vector likelihoods and a cumulative sum Lt (i, j) of section length likelihoods. (However, 1 ≦ i ≦ I + 1 and 1
.Ltoreq.j.ltoreq.J + 1) The likelihood of the input pattern stored in the input pattern storage unit 1 for the template stored in the template storage unit 2 is calculated by the numerical operation unit 4 as follows. That is, the numerical operation unit 4 calculates the pattern likelihood S defined by the following equation (1) based on the dynamic programming. Here, ν (j) is the starting end frame number of the j-th divided section (therefore, the ending end of the j-th divided section is ν (j + i) −1
Where 1 = ν (1) <ν (2) <ν (3) <... <ν (J) <ν (J + 1) = I + 1
(2) shall be satisfied. Also, ls in the first term of the equation (1)
(Ν (j), ν (j + 1), j) are the vector c (i, m) (ν (j) ≦ i <ν (j + 1)) in the j-th divided section and the distribution parameter of the vector in this section. This is the vector likelihood of the j-th section calculated as Expression (3) based on the average cepstrum coefficient a (j, m). Further, lt (ν (j), ν in the second term of the equation (1)
(J + 1), j) is the section length ν (j +
1) The j-th calculation calculated as Expression (4) based on −ν (j) and the average μ (j) of the section length and the variance σ ² (j), which are distribution parameters of the section length of this section. This is the section length likelihood in the divided section. The above equation (1) is based on the section division method (that is, ｛ν
(J) Includes maximization for｝). Therefore, the numerical operation unit 4 converts the equation (1) into the following recurrence formula τ ₀ = argmax [Ls (i−τ, i−1) + ls (i−τ, i) based on the dynamic programming. , j) 1 ≦ τ ≦ i -1 + lt (i-τ, j-1) + lt (i-τ, i, j)] (5) Ls (i, j) = Ls (i-τ 0, j-1 ) + Ls (i−τ ₀ , i, j) (6) Lt (i, j) = Lt (i−τ ₀ , j−1) + lt (i−τ ₀ , i, j) (7) (2 ≦ i ≦ I + 1,2 ≦ j ≦ J + 1) is initialized under the following initial conditions: Ls (1, j) = Lt (1, j) = 0 (1 ≦ j ≦ J + 1) (8) Ls (i, 1) = Lt (i, 1) = ∞ (2 ≦ i ≦ I + 1) (9) And seek efficiently by placing. As described above, in this apparatus, the likelihood of the input pattern stored in the input pattern storage unit 1 with respect to the template stored in the template storage unit 2 is stored in the cumulative likelihood storage unit 3 by the numerical operation unit 4. Finally, the pattern likelihood is calculated as in Expression (10) while leaving the result of the calculation of the chemical expression. By the way, the second term of the equation (10) is an average of the likelihood of the section length of each divided section.
And / ookii (large) are used to identify words, which are the main factors of discrimination, due to the difference in segment length, so that it is possible to discriminate even if the distribution of the segment lengths of the two segments overlaps However, even if there is a large change in the section length, it is possible to make it difficult for misalignment to occur. In the above description, the case where the pattern likelihood calculation device according to the present invention is used for the pattern likelihood calculation unit of the speech recognition system has been described. The present invention can also be applied to a device for automatically labeling audio data. [Effect of the Invention] As described above, according to the present invention, the likelihood of a feature vector in a time division section calculated based on a first parameter and the second distribution The calculation is based on the sum of the likelihood of the section length of the time division section calculated based on the parameter and the sum obtained by the division method that maximizes the entire speech section. In other words, the sum of the distance of the spectrum of the divided section and the distance of the section length of the divided section is used to determine the maximum pattern likelihood by a division method that maximizes the entire speech section. Facilitates discrimination between conflicting time-series patterns due to differences in the section length of
There is an advantage that misalignment due to a large change in the section length can be improved.

【図面の簡単な説明】第１図は本発明の一実施例の機能ブロック図、第２図は
入力パタン記憶部の記憶構成図、第３図は型版記憶部の
記憶構成図、第４図は累積尤度記憶部の記憶構成図、第
５図は従来のパタンの尤度を計算する方式の説明図であ
る。図において、１は入力パタン記憶部、２は型版記憶部、
３は累積尤度記憶部、４は数値演算部、21は区間長系列
記憶部、22は分布パラメータ系列記憶部である。なお、図中、同一あるいは相当部分には同一の符号を付
して示している。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a functional block diagram of one embodiment of the present invention, FIG. 2 is a storage configuration diagram of an input pattern storage unit, FIG. FIG. 5 is a diagram illustrating a storage configuration of a cumulative likelihood storage unit, and FIG. 5 is a diagram illustrating a conventional method of calculating the likelihood of a pattern. In the figure, 1 is an input pattern storage unit, 2 is a model storage unit,
3 is a cumulative likelihood storage unit, 4 is a numerical operation unit, 21 is a section length sequence storage unit, and 22 is a distribution parameter sequence storage unit. In the drawings, the same or corresponding parts are denoted by the same reference numerals.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/20 ──────────────────────────────────────────────────続き Continued on front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 3/00-9/20

Claims

(57) [Claims] Means for storing a distribution parameter characterizing each time division section obtained by dividing a time series pattern of a vector; a first step of dividing the time series pattern of the vector into each time division section; A second step of calculating a likelihood for each of the time division sections based on a distribution parameter characterizing the division section, and based on the likelihood of each time division section calculated in the second step. In the pattern likelihood calculation method for calculating the likelihood of the entire time-series pattern of the vector, the distribution parameter characterizing each time division section is a first distribution defining the distribution of the feature vector in each time division section. Parameters and a second distribution parameter defining a distribution of the section length of each time division section, and the likelihood for each time division section is calculated based on the first parameter. Sum of the likelihood of the feature vector in the time division section and the likelihood of the section length of the time division section calculated on the basis of the second distribution parameter. A pattern likelihood calculation method characterized by being calculated based on the sum obtained by the method.