JP3709436B2

JP3709436B2 - Fine segment acoustic model creation device for speech recognition

Info

Publication number: JP3709436B2
Application number: JP2001397777A
Authority: JP
Inventors: 和世田中
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2001-12-27
Filing date: 2001-12-27
Publication date: 2005-10-26
Anticipated expiration: 2021-12-27
Also published as: JP2003195888A

Description

【０００１】
【発明の属する技術分野】
本願発明は、自動音声認識装置の重要要素部分である音声セグメントの音響標準パターン作成部に関する。この標準パターンに基いて、仮説単語などに対する入力音声の尤度や類似度が計算されて音声認識が実行される。
【０００２】
【従来の技術】
自動音声認識における音声の最小認識単位は、通常、音声を記述する音素記号である。音素記号は、日本語を例にとると、ローマ字正書法のローマ字１文字に相当する単位である。たとえば「秋（あき）」という単語の音素表記は /aki/と書き、a, k, iの3文字が音素記号である。自動音声認識のためには音素記号の音響モデル標準パターン作成装置が必要であり、この標準パターンは、音声波形データセットとその個々のサンプルを音素記号で表わしたデータ（音声波形データに音素記号列を付与したもの）から、何らかの手法によって推定計算する。たとえば、「秋 aki」という音声であれば、音声波形のどの辺りがそれぞれa、k、i に相当するかを専門家が視察で教えたデータを使用するか、またはa, k, iに相当する区間を自動的に識別して、a, k, i の標準パターンを作成する。
【０００３】
現在、最も一般に使用されている音響モデル標準パターンは、隠れマルコフモデル（HMM）である（Steve Young 他著 “The HTK book”, Entropic Cambridge Research Laboratory.以下「文献１」という。）。また、音声サンプルデータとその音素記述が与えられたとき、音素のHMMを推定する手法は存在している。しかし、精度の良いHMMの推定を可能にするためには、あらかじめ専門家がラベルを付与した音声データが存在するか、適切なHMMの初期値が存在することが必要である。
【０００４】
一方、より高性能な音声認識を達成するためには、上記の音素よりも細かい音声認識単位である「精細セグメント」が使用される。その１つとして、請求者等が提唱している「サブ音声セグメント（Sub-Phonetic Segment, 以後SPSと略記）」がある（田中、児島、富山、壇辻、「言語に共通な音声符号系とその音響セグメントモデルの作成」、日本音響学会講演論文集 pp191-192 (2001-3).以下「文献2」という。）。このSPS記号系のラベル系列自体は、多言語対応の記号系であるXSAMPAと呼ばれる記号系（http://www.phon.ucl.ac.uk/home/sampa/home）から変換規則によって求められる。XSAMPA記号系は、音素とほぼ同じ詳しさのレベルの記号系であり、ヨーロッパなどで主に使用されている。たとえば、「秋」のXSAMPAによる記述は、akIであり、SPSでは #a, aa, ak, kcl, kk, kI, II, I#となる。この記号系を採用すると性能改善が達成されることは確認されている［文献2］。
【０００５】
しかし、SPSのような記号のHMMを推定するのは、音素やXSAMPAのHMMを推定する場合よりも困難である。その理由としては、記号の種類がXSAMPAでは50〜100前後であるのに対し、SPSでは1000種類以上になることから、専門家が音声波形と記号ラベルとの対応付け（ラベリング）をする負担が大きいこと、音声サンプルとその記号記述だけからの自動推定では適切にセグメンテーションができないことなどがある。
【０００６】
【発明が解決しようとする課題】
上の項で述べたように、精細セグメントの1種である「サブ音声セグメント（SPS）」の導入により音声認識性能の高度化が可能になる。しかし、音声波形にSPS記号がラベル付けされた音声データがない状態では、その音響標準パターンとしてのHMMを自動推定するのは一般に困難である。
【０００７】
【課題を解決するための手段】
そこで、比較的容易に得られるXSAMPA記号のHMMパラメータを出発点として、これを利用したSPSのHMMパラメータの自動推定手法を開発し、音声認識システムに登載する。この音響モデル標準パターン作成装置により、音声波形にSPS系列がラベル付けされたデータがなくても、SPS記号系について適切な音響標準パターンが作成可能となる。
【０００８】
高性能音声認識を可能とする「サブ音声セグメントSPS」に対するHMMの推定計算を行うためには、必要なSPS記号について、そのHMMパラメータの初期値を適切に決めてやることである。ここでは、XSAMPA記号のHMMが与えられたとき、このパラメータの値からSPSについてのHMMパラメータの初期値を設定する方法を与える。また、実際の音声データとそれらの発音を記述するXSAMPAテキストが与えられたとき、SPS‐HMMの最適推定するシステムを示す。
【０００９】
(I) XSAMPA記号のHMMが与えられたとき、このパラメータの値からサブ音声セグメントSPSついてのHMMパラメータの初期値を設定する方法。
XSAMPAの音響セグメントと「サブ音声セグメントSPS」の音響セグメントの時間軸上の相対関係は図１に示すような場合が考えられ、生成されるSPSとしてはXX, XY, XYZの3通りの型があり得る。このそれぞれの場合について、下記に示すような計算式によって、XSAMPA記号のHMMの値からSPS記号HMMの初期値を与える。
【００１０】
ただし、ここで、XSAMPA記号のHMMとSPS記号HMMは、共に図２に示すような3状態3ループで表わされるHMMで、各状態は１個のガウス型連続確率分布により観測量の出力確率が定義されているものとする。すなわち、HMMの各変数は次のように表わされる。
【００１１】
先行するXSAMPA記号（図ではX）のHMMの状態iにおける出力確率分布の平均値ベクトルと分散値行列、および遷移確率（行列）をそれぞれ、m^(X) _i 、v^(X) _i 、T^(X) _ii とする。同様に、後続するXSAMPA記号YのHMMの平均値ベクトル、分散値行列、および遷移確率（行列）をm^(Y) _i, v^(Y) _i, T^(Y) _ii とする。
【００１２】
[場合A]
「サブ音声セグメントSPS」の定常的区間ラベルXX（図１参照）に相当するHMMの平均値ベクトル、分散値行列、および遷移確率（行列）初期値をそれぞれ m^(SPS) ₂, v^(SPS) ₂, and T^(SPS) ₂₂ とするとき、これらの値を次のように定める。まず、状態２（中央）については、
m^(SPS) ₂= m^(X) ₂
v^(SPS) ₂= v^(X) ₂,
T^(SPS) ₂₂ = T^(X) ₂₂
T^(SPS) ₂₃ =1- T^(SPS) ₂₂
状態１については、重み係数a_m , a_v , a_Tを用いて
m^(SPS) ₁ =( a_m m^(X) ₁ + m^(X) ₂) / (1+ a_m )
v^(SPS) ₁ = (a_v v^(X) ₁ + v^(X) ₂) / (1+ a_v)
T^(SPS) ₁₁ = (a_T T^(X) ₁₁ + T^(X) ₂₂) / (1+ a_T )
T^(SPS) ₁₂ =1- T^(SPS) ₁₁
状態３についても、同様に係数a_m , a_v , a_Tを用いて
m^(SPS) ₃ =( a_mm^(X) ₃ + m^(X) ₂) / (1+ a_m )
v^(SPS) ₃ = (a_v v^(X) ₃ + v^(X) ₂) / (1+ a_v)
T^(SPS) ₃₃ = (a_T T^(X) ₃₃ + T^(X) ₂₂) / (1+ a_T )
T^(SPS) ₃₄ =1- T^(SPS) ₃₃
以上の式で、係数a_m , a_v , a_Tは、概ね0.5〜1.0の範囲の値である。なお、ラベルYYについてもXをYと置きかえることによって同様に定められる。
【００１３】
［場合B］
SPSラベルXY（図１参照）で示されている音響モデルのパラメータ初期値については以下のように定める。ただし、b_m , b_v , b_T は重み係数で、1.0より大きく3.0以下の範囲の値である。
m^(SPS) ₁, v^(SPS) ₁, T^(SPS) ₁₁ については、
m^(SPS) ₁ =(b_m m^(X) ₂ + m^(Y) ₂)/ (1+ b_m )
v^(SPS) ₁ = (b_v v^(X) ₂ + v^(Y) ₂) / (1+ b_v)
T^(SPS) ₁₁ = (b_T T^(X) ₂₂ + T^(Y) ₂₂) / (1+ b_T )
m^(SPS) ₂, v^(SPS) ₂, T^(SPS) ₂₂ については、
m^(SPS) ₂ =( m^(X) ₂ + m^(Y) ₂)/2
v^(SPS) ₂ = ( v^(X) ₂ + v^(Y) ₂)/2
T^(SPS) ₂₂ = (T^(X) ₂₂ + T^(Y) ₂₂)/2
m^(SPS) ₃, v^(SPS) ₃, T^(SPS) ₃₃ については、
m^(SPS) ₃ =( m^(X) ₂ + b_m m^(Y) ₂) / (1+ b_m )
v^(SPS) = (v^(X) ₂ + b_v v^(Y) ₂) / (1+ b_v)
T^(SPS) ₃₃ = (T^(X) ₂₂ + b_T T^(Y) ₂₂) / (1+ b_T )
T^(SPS) ₁₂ 、T^(SPS) ₂₃ 、T^(SPS) ₃₄については、それぞれT^(SPS) ₁₁ 、T^(SPS) ₂₂ 、T^(SPS) ₃₃から（場合A）と同じ式によって計算される。
【００１４】
〔場合C］
SPSラベルXYZ（図１参照）で示されている音響モデルのパラメータ初期値については以下のように定める。ただし、c_m , c_v , c_T は重み係数で、1.0より大きく3.0以下の範囲の値である。
m^(SPS) ₁, v^(SPS) ₁, T^(SPS) ₁₁ については、
m^(SPS) =(c_m m^(X) ₂ + m^(Z) ₂)/ (1+ c_m )
v^(SPS) ₁ = (c_v v^(X) ₂ + v^(Z) ₂) / (1+ c_v)
T^(SPS) ₁₁ = (c_T T^(X) ₂₂ + T^(Z)) / (1+ c_T )
m^(SPS) ₂, v^(SPS) ₂, T^(SPS) ₂₂ については、
m^(SPS) ₂ =( m^(X) ₂ + m^(Z) ₂)/2
v^(SPS) ₂ = ( v^(X) ₂ + v^(Z) ₂)/2
T^(SPS) ₂₂ = (T^(X) ₂₂ + T^(Z) ₂₂)/2
m^(SPS) ₃, v^(SPS) ₃, T^(SPS) ₃₃ については、
m^(SPS) ₃ =( m^(X) ₂ + c_m m^(Z) ₂) / (1+ c_m )
v^(SPS) ₃ = (v^(X) ₂ + c_v v^(Z) ₂) / (1+ c_v)
T^(SPS) ₃₃ = (T^(X) ₂₂ + c_T T^(Z) ₂₂) / (1+ c_T )
T^(SPS) ₁₂ 、T^(SPS) ₂₃ 、T^(SPS) ₃₄については、それぞれT^(SPS) ₁₁ 、T^(SPS) ₂₂ 、T^(SPS) ₃₃から（場合A）と同じ式によって計算される。
【００１５】
（II）上記によって設定された初期値を元に、学習用音声サンプルを使用して「サブ音声セグメント」のHMMのパラメータを最適推定するシステム。
下記の前提条件が満たされているものとする。
（前提）「学習用音声サンプルの発音を記述したXSAMPA記号のテキストが与えられているものとする。」
このとき、図3に示すブロック図のような手順で「サブ音声セグメントSPS」のHMMパラメータを最適推定する。
（１）学習用音声サンプル▲１▼、その発音を記述したXSAMPAテキスト▲２▼、およびXSAPMA-HMMのパラメータ値▲３▼は、あらかじめ与えられている。
（２）学習用XSAMPAテキスト▲２▼から、書換え規則［文献２］を適用して、この発声テキストについてSPSを用いた記述▲６▼を得る。
（３）このテキスト▲６▼に存在しているSPSラベルをすべて拾い出し、そのすべてのSPSラベルについて、XSAMPA‐HMM▲３▼を利用して、SPS-HMMの初期値を（I）に示した方法で設定し、初期値▲８▼を得る。このとき、分散行列の値（変数ｖで表わされているパラメータ値）については、すべて係数0.5を掛けておく。
（４）音声サンプルを用いて変数を学習する手法は、学習用音声データ▲１▼、そのSPS記述テキスト▲６▼、およびSPS-HMMの初期値▲８▼を使用して、通常のHMMパラメータ推定法を用いて、3〜5回程度の繰り返し計算を行う［文献１］。このとき、実サンプル数が少ないラベル（概ね10個未満）については、下記の式によって変数を補正する。パラメータの最終的な値を表わす変数をpp、初期値をｑ、学習サンプルセットから得られた推定値をp、実在する学習サンプルの数をiとするとき、 pp = ( 5q + ip) / (5+i)
とする。iが10以上のときは、pp=pとしても差し支えない。
以上の手順において、（３）が特許請求に係る新しい内容である。
【００１６】
【実施例】
ここでは、本願発明が効果があることを示す目的で、音声の認識単位として次の2通りの記号ラベルを用いた場合について、単語音声認識実験を行った。その認識率の比較結果を示す。
（A）通常用いられている音素ラベルと同等なXSAMPA記号の音響モデルXSAMPA‐HMMを用いた場合、
（B）上の（A）で用いたXSAMPA‐HMMを出発点データとして、同じ学習用音声データを用いて（３−４）項で述べたシステムによって「サブ音響セグメントSPS」のHMMを計算し、これを音響標準パターン（SPS‐HMM）とした場合。
【００１７】
すべての実験は、成人男性話者音声データを用い、話者独立の音声認識である。HMMは3状態、3ループのLRモデルで、各状態の分布混合数は１である。また、（３−４）項の式で用いた重み係数a_m , a_v , a_Tはいずれも0.5とし、b_m , b_v , b_T , c_m , c_v , c_Tはいずれも2.0とした。学習用音声データセットは、英語母語話者５名の語彙セットBasic English Words(B.E.W)、語数850語である。テストデータは、別話者5名のB.E.W.である。その認識率を以下に示す。
・（A）の場合の認識率: 5名平均59.3% (最低55% 最高67%)
・（B）の場合の認識率: 5名平均76.2% (最低71% 最高82%)
以上のように、「精細音響セグメント」を採用する効果は、認識率で15ポイント程度の向上があり、明らかである。また、本件で述べた方法により「精細音響セグメント」のHMMを計算することが有効であることが実証されている。
【００１８】
【発明の効果】
上記の実験で明らかなように、「精細音響セグメント」を導入することは、音声認識性能を大きく向上させる。しかし、英語音声データの場合、XSAMPAラベルの種類数は65程度であるのに対し、「精細音響セグメント」では1000種類以上となり、このラベルを音声データにラベリングするのは事実上困難である。さらに多くの言語音声を扱うようになると困難さはますます増大する。したがって、本願発明のように、XSAMPAラベルのHMMから自動的にその初期値を計算する手法が効率的であり、また計算結果も有効なものとなる。
【図面の簡単な説明】
【図１】ＸＳＡＭＰＡ記号の音響モデルＨＭＭとＳＰＳ記号の音響モデルＨＭＭの時間軸状の関係
【図２】ＸＳＡＭＰＡセグメントおよびＳＰＳの音響標準パターンの隠れマルコフモデル（ＨＭＭ）のトポロジー（本願発明で使用しているもの）
【図３】ＳＰＳ記号の音響標準パターンＨＭＭを最適推定するシステムのブロック図[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an acoustic standard pattern creation unit for speech segments, which is an important element part of an automatic speech recognition apparatus. Based on this standard pattern, the likelihood and similarity of the input speech for the hypothesis word and the like are calculated, and speech recognition is executed.
[0002]
[Prior art]
The minimum speech recognition unit in automatic speech recognition is usually a phoneme symbol that describes speech. A phoneme symbol is a unit corresponding to one romaji character in Roman orthography, taking Japanese as an example. For example, the phoneme notation of the word “Aki” is written as / aki /, and the three letters a, k, and i are phoneme symbols. For automatic speech recognition, a phonetic symbol acoustic model standard pattern generation device is required, and this standard pattern is a data that represents a speech waveform data set and its individual samples in phonemic symbols (phoneme symbol sequences are added to speech waveform data). ) To estimate and calculate by some method. For example, in the case of the voice “autumn aki”, use the data that the expert has instructed to see which part of the voice waveform corresponds to a, k, i, respectively, or equivalent to a, k, i The standard pattern of a, k, i is created automatically.
[0003]
Currently, the most commonly used acoustic model standard pattern is the Hidden Markov Model (HMM) (Steve Young et al., “The HTK book”, Entropic Cambridge Research Laboratory. Hereinafter referred to as “Reference 1”). Also, there is a method for estimating a phoneme HMM when speech sample data and its phoneme description are given. However, in order to enable accurate estimation of the HMM, it is necessary that there is audio data that has been previously labeled by an expert, or that an appropriate initial value of the HMM exists.
[0004]
On the other hand, in order to achieve higher performance speech recognition, “fine segments” which are speech recognition units finer than the above phonemes are used. One of them is the “Sub-Phonetic Segment (hereinafter abbreviated as SPS)” proposed by the claimants (Tanaka, Kojima, Toyama, Dandan, “Creation of the acoustic segment model”, Proceedings of Acoustical Society of Japan pp191-192 (2001-3). The label series of the SPS symbol system itself is obtained by a conversion rule from a symbol system called XSAMPA (http://www.phon.ucl.ac.uk/home/sampa/home) which is a multilingual symbol system. . The XSAMPA symbol system is a symbol system with almost the same level of detail as a phoneme, and is mainly used in Europe and the like. For example, the description of “autumn” in XSAMPA is akI, and in SPS, it is #a, aa, ak, kcl, kk, kI, II, I #. It has been confirmed that performance improvement is achieved when this symbol system is adopted [Reference 2].
[0005]
However, it is more difficult to estimate the HMM of a symbol like SPS than to estimate the HMM of a phoneme or XSAMPA. The reason for this is that the number of symbol types is around 50 to 100 in XSAMPA, but more than 1000 types in SPS, so there is a burden for experts to associate (label) speech waveforms with symbol labels. It is large, and automatic segmentation based only on speech samples and their symbol descriptions does not allow proper segmentation.
[0006]
[Problems to be solved by the invention]
As described in the above section, the introduction of “sub-speech segment (SPS)”, which is one type of fine segment, enables the enhancement of speech recognition performance. However, it is generally difficult to automatically estimate the HMM as the acoustic standard pattern in a state where there is no speech data labeled with the SPS symbol in the speech waveform.
[0007]
[Means for Solving the Problems]
Therefore, using the HMM parameter of the XSAMPA symbol that can be obtained relatively easily as a starting point, we developed an automatic estimation method of the SMM HMM parameter using this, and put it on the speech recognition system. With this acoustic model standard pattern creation device, it is possible to create an appropriate acoustic standard pattern for the SPS symbol system without the data in which the SPS sequence is labeled in the speech waveform.
[0008]
In order to perform the estimation calculation of the HMM for the “sub-speech segment SPS” that enables high-performance speech recognition, it is necessary to appropriately determine the initial values of the HMM parameters for the necessary SPS symbols. Here, when the HMM of the XSAMPA symbol is given, a method for setting the initial value of the HMM parameter for the SPS from the value of this parameter is given. We also show an optimal estimation system for SPS-HMM given XSAMPA text describing actual speech data and their pronunciation.
[0009]
(I) A method of setting the initial value of the HMM parameter for the sub speech segment SPS from the value of this parameter when the HMM of the XSAMPA symbol is given.
The relative relationship on the time axis between the acoustic segment of XSAMPA and the acoustic segment of “sub-speech segment SPS” can be considered as shown in FIG. 1, and the generated SPS has three types of XX, XY, and XYZ. possible. In each of these cases, the initial value of the SPS symbol HMM is given from the value of the HMM of the XSAMPA symbol by the following calculation formula.
[0010]
However, here the XSAMPA symbol HMM and the SPS symbol HMM are both HMMs represented by three states and three loops as shown in Fig. 2, and each state has an output probability of observation by one Gaussian continuous probability distribution. It shall be defined. That is, each variable of the HMM is expressed as follows.
[0011]
The mean value vector and variance matrix of the output probability distribution and the transition probability (matrix) of the preceding XSAMPA symbol (X in the figure) in state i of the HMM are represented by m ^(X) _i , v ^(X) _i , T ^{( X)} _ii . Similarly, the mean value vector, variance value matrix, and transition probability (matrix) of the subsequent HMM of the XSAMPA symbol Y are m ^(Y) _i , v ^(Y) _i , T ^(Y) _ii .
[0012]
[Case A]
The HMM mean vector, variance matrix, and transition probability (matrix) initial values corresponding to the stationary interval label XX (see Fig. 1) of the "sub speech segment SPS" are m ^(SPS) ₂ and v ^(SPS) , respectively. ₂ , and T ^(SPS) ₂₂ values are determined as follows. First, for state 2 (center),
m ^(SPS) ₂ = m ^(X) ₂
v ^(SPS) ₂ = v ^(X) ₂ ,
T ^(SPS) ₂₂ = T ^(X) ₂₂
T ^(SPS) ₂₃ = 1- T ^(SPS) ₂₂
For state 1, we use the weighting factors a _m , a _v , a _T
m ^(SPS) ₁ = (a _m m ^(X) ₁ + m ^(X) ₂ ) / (1+ a _m )
v ^(SPS) ₁ = (a _v v ^(X) ₁ + v ^(X) ₂ ) / (1+ a _v )
T ^(SPS) ₁₁ = (a _T T ^(X) ₁₁ + T ^(X) ₂₂ ) / (1+ a _T )
T ^(SPS) ₁₂ = 1- T ^(SPS) ₁₁
Similarly for state 3, using the coefficients a _m , a _v , and a _T
m ^(SPS) ₃ = (a _m m ^(X) ₃ + m ^(X) ₂ ) / (1+ a _m )
v ^(SPS) ₃ = (a _v v ^(X) ₃ + v ^(X) ₂ ) / (1+ a _v )
T ^(SPS) ₃₃ = (a _T T ^(X) ₃₃ + T ^(X) ₂₂ ) / (1+ a _T )
T ^(SPS) ₃₄ = 1- T ^(SPS) ₃₃
In the above equation, the coefficients a _m , a _v , and a _T are values in a range of approximately 0.5 to 1.0. The label YY is similarly determined by replacing X with Y.
[0013]
[Case B]
The initial parameter values of the acoustic model indicated by the SPS label XY (see FIG. 1) are determined as follows. However, b _m , b _v , and b _T are weight coefficients and are values in the range of greater than 1.0 and less than or equal to 3.0.
For m ^(SPS) ₁ , v ^(SPS) ₁ , T ^(SPS) ₁₁
m ^(SPS) ₁ = (b _m m ^(X) ₂ + m ^(Y) ₂ ) / (1+ b _m )
v ^(SPS) ₁ = (b _v v ^(X) ₂ + v ^(Y) ₂ ) / (1+ b _v )
T ^(SPS) ₁₁ = (b _T T ^(X) ₂₂ + T ^(Y) ₂₂ ) / (1+ b _T )
For m ^(SPS) ₂ , v ^(SPS) ₂ , T ^(SPS) ₂₂
m ^(SPS) ₂ = (m ^(X) ₂ + m ^(Y) ₂ ) / 2
v ^(SPS) ₂ = (v ^(X) ₂ + v ^(Y) ₂ ) / 2
T ^(SPS) ₂₂ = (T ^(X) ₂₂ + T ^(Y) ₂₂ ) / 2
For m ^(SPS) ₃ , v ^(SPS) ₃ , T ^(SPS) ₃₃ ,
m ^(SPS) ₃ = (m ^(X) ₂ + b _m m ^(Y) ₂ ) / (1+ b _m )
v ^(SPS) = (v ^(X) ₂ + b _v v ^(Y) ₂ ) / (1+ b _v )
T ^(SPS) ₃₃ = (T ^(X) ₂₂ + b _T T ^(Y) ₂₂ ) / (1+ b _T )
T ^(SPS) ₁₂ , T ^(SPS) ₂₃ and T ^(SPS) ₃₄ are respectively calculated from T ^(SPS) ₁₁ , T ^(SPS) ₂₂ and T ^(SPS) _{33 according} to the same formula as (case A). .
[0014]
[Case C]
The initial parameter values of the acoustic model indicated by the SPS label XYZ (see FIG. 1) are determined as follows. Here, c _m , c _v , and c _T are weighting factors and are values in the range of greater than 1.0 and less than or equal to 3.0.
For m ^(SPS) ₁ , v ^(SPS) ₁ , T ^(SPS) ₁₁
m ^(SPS) = (c _m m ^(X) ₂ + m ^(Z) ₂ ) / (1+ c _m )
v ^(SPS) ₁ = (c _v v ^(X) ₂ + v ^(Z) ₂ ) / (1+ c _v )
T ^(SPS) ₁₁ = (c _T T ^(X) ₂₂ + T ^(Z) ) / (1+ c _T )
For m ^(SPS) ₂ , v ^(SPS) ₂ , T ^(SPS) ₂₂
m ^(SPS) ₂ = (m ^(X) ₂ + m ^(Z) ₂ ) / 2
v ^(SPS) ₂ = (v ^(X) ₂ + v ^(Z) ₂ ) / 2
T ^(SPS) ₂₂ = (T ^(X) ₂₂ + T ^(Z) ₂₂ ) / 2
For m ^(SPS) ₃ , v ^(SPS) ₃ , T ^(SPS) ₃₃ ,
m ^(SPS) ₃ = (m ^(X) ₂ + c _m m ^(Z) ₂ ) / (1+ c _m )
v ^(SPS) ₃ = (v ^(X) ₂ + c _v v ^(Z) ₂ ) / (1+ c _v )
T ^(SPS) ₃₃ = (T ^(X) ₂₂ + c _T T ^(Z) ₂₂ ) / (1+ c _T )
T ^(SPS) ₁₂ , T ^(SPS) ₂₃ and T ^(SPS) ₃₄ are respectively calculated from T ^(SPS) ₁₁ , T ^(SPS) ₂₂ and T ^(SPS) _{33 according} to the same formula as (case A). .
[0015]
(II) A system that optimally estimates the HMM parameters of the “sub speech segment” using the training speech samples based on the initial values set as described above.
Assume that the following prerequisites are met:
(Assumption) “XSAMPA symbol text describing the pronunciation of the training speech sample is given.”
At this time, the HMM parameters of the “sub speech segment SPS” are optimally estimated by the procedure shown in the block diagram of FIG.
(1) The learning speech sample (1), the XSAMPA text (2) describing its pronunciation, and the parameter value (3) of XSAPMA-HMM are given in advance.
(2) From the learning XSAMPA text (2), the rewrite rule [Document 2] is applied to obtain a description (6) using SPS for the uttered text.
(3) Pick out all the SPS labels that exist in this text (6), and use XSAMPA-HMM (3) for all the SPS labels to show the initial value of SPS-HMM in (I) The initial value (8) is obtained. At this time, all the values of the dispersion matrix (parameter values represented by the variable v) are multiplied by the coefficient 0.5.
(4) The variable learning method using speech samples uses the training speech data (1), its SPS description text (6), and the initial value of SPS-HMM (8), and normal HMM parameters. Using the estimation method, iterative calculation is performed 3 to 5 times [Reference 1]. At this time, for labels with a small number of actual samples (approximately less than 10), the variables are corrected by the following formula. If the variable representing the final value of the parameter is pp, the initial value is q, the estimated value obtained from the training sample set is p, and the number of actual learning samples is i, then pp = (5q + ip) / ( 5 + i)
And When i is 10 or more, pp = p is acceptable.
In the above procedure, (3) is a new content related to the claims.
[0016]
【Example】
Here, for the purpose of showing that the present invention is effective, word speech recognition experiments were performed in the case where the following two symbol labels were used as speech recognition units. The comparison result of the recognition rate is shown.
(A) When using the XSAMPA-HMM acoustic model with the XSAMPA symbol equivalent to the usual phoneme label,
(B) The XSAMPA-HMM used in (A) above is used as the starting point data, and the HMM of the “sub-acoustic segment SPS” is calculated by the system described in (3-4) using the same learning speech data. When this is an acoustic standard pattern (SPS-HMM).
[0017]
All experiments are speaker-independent speech recognition using adult male speaker speech data. The HMM is a three-state, three-loop LR model, and each state has one distribution mixture. In addition, the weighting factors a _m , a _v , and a _T used in the expression in (3-4) are all 0.5, and b _m , b _v , b _T , c _m , c _v , and c _T are all 2.0. It was. The learning speech data set is a vocabulary set Basic English Words (BEW) for five English native speakers and 850 words. The test data is BEW of 5 different speakers. The recognition rate is shown below.
・ Recognition rate in the case of (A): 5 people average 59.3% (minimum 55% maximum 67%)
・ Recognition rate in case of (B): 56.2 average 76.2% (minimum 71% maximum 82%)
As described above, the effect of employing the “fine acoustic segment” is obvious, with an improvement of about 15 points in recognition rate. In addition, it has been proved that it is effective to calculate the HMM of the “fine acoustic segment” by the method described in this case.
[0018]
【The invention's effect】
As is clear from the above experiment, the introduction of “fine acoustic segments” greatly improves speech recognition performance. However, in the case of English speech data, the number of types of XSAMPA labels is about 65, whereas in the “fine acoustic segment”, there are over 1000 types, and it is practically difficult to label these labels on speech data. In addition, the difficulty increases as more linguistic sounds are handled. Therefore, the method of automatically calculating the initial value from the HMM of the XSAMPA label as in the present invention is efficient, and the calculation result is also effective.
[Brief description of the drawings]
FIG. 1 Time-axis relationship between the acoustic model HMM of the XSAMPA symbol and the acoustic model HMM of the SPS symbol. FIG. What)
FIG. 3 is a block diagram of a system for optimally estimating an acoustic standard pattern HMM of an SPS symbol.

Claims

In the method of setting the initial value of the acoustic model standard pattern of a fine segment, which is a recognition unit used for speech recognition, the value of the acoustic model standard pattern of the phoneme segment involved in the fine segment is used (Case A) When the initial value is calculated according to the rules of (Case B) and (C), and in the following (Case B) and (C), the segment of the fine segment standard pattern is divided by time, This weighting is a method of increasing the weight of the standard pattern value of the preceding phoneme segment on the start side and increasing the weight of the value of the subsequent phoneme segment on the end side.
(Case A) In the case where the fine segment represents the central part of one physical segment of the phoneme segment corresponding to itself, the value of the acoustic model standard pattern representing the vicinity of the central part of the phoneme segment is The initial value of the fine segment standard pattern.
(Case B) For the case where the fine segment represents a physical transition section of two consecutive phoneme segments, the value of the acoustic model standard pattern representing the central part of the preceding phoneme segment and the section of the subsequent phoneme segment A weighted average value is calculated using two values of the acoustic model standard pattern representing the central portion, and this is used as the initial value of the fine segment standard pattern.
(Case C) When the fine segment is related to the physical transition section of three consecutive phoneme segments, and the influence of the second phoneme segment on the characteristics of the fine segment is extremely small, one Using the value of the acoustic model standard pattern in the middle part of the phoneme segment of the eye and the value of the acoustic model standard pattern in the middle part of the section of the third phoneme segment, a weighted average value is calculated, This is the initial value of the fine segment standard pattern.

An apparatus for creating an acoustic model standard pattern from an audio sample using the method according to claim 1.