JP2000099071A

JP2000099071A - Speech recognition device and method

Info

Publication number: JP2000099071A
Application number: JP26416298A
Authority: JP
Inventors: Shoichi Matsunaga; 昭一松永
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-09-18
Filing date: 1998-09-18
Publication date: 2000-04-07
Anticipated expiration: 2018-09-18
Also published as: JP3583930B2

Abstract

PROBLEM TO BE SOLVED: To improve the recognition performance by considering the relationship of acoustic features of a segment and the ones immediately before and after. SOLUTION: Speech to be learned is transformed into acoustic feature parameter. Then, the loci of acoustic parameters are obtained for any one of a section 1 which includes a recognition object phoneme Wi and the tail of a phoneme Wi-1 that is immediately before the phoneme Wi, a section 2 which includes the phoneme Wi and the leading top of a phoneme Wi+1 that is immediately after the phoneme Wi and a section 3 which includes the phoneme Wi, the tail of the phoneme Wi-1 and the phoneme Wi+1 that is immediately after the phoneme Wi. Then, appearance probabilities P(Bi-1, Ai|Wi), P(Ai, Bi+1|Wi) and P(Bi+1, Ai, Bi+1|Wi) are obtained against the phoneme Wi to generate models. Then, inputted voices are transformed into feature parameters during a speech recognition, loci of the section parameters corresponding to the sections of the models for each phoneme are obtained and the likelihood between the loci and each model is obtained.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、音響特徴パラメ
ータの軌跡に基づいて音声を認識するセグメントモデル
を用いた音声認識装置及び方法に関する。The present invention relates to a speech recognition apparatus and method using a segment model for recognizing speech based on the locus of acoustic feature parameters.

【０００２】[0002]

【従来の技術】従来、音声認識における認識の基本単位
としては、音素単位、副単語（サブワード）単位、単語
単位等（以下これらをユニットと呼ぶ）があり、その単
位に対して隠れマルコフモデル（ＨＭＭ）が音響モデル
として広く用いられている（例えば、中川聖一著、“確
率モデルによる音声認識”、電子情報通信学会、昭和６
３年７月発行参照。）。音声認識では音声をある一定時
間間隔（ここではこれをフレームと呼ぶ）でパラメータ
化する。このＨＭＭに基づく方式では、隣接するフレー
ム間のパラメータの値は独立であるとして、音声のモデ
ル化、及び認識候補の尤度計算を行っていた。一方、人
間の発声機構の制約により、音声の特徴パラメータは隣
接するフレームでは独立とは考えられない。その点を補
強するモデルとしてユニット内でのパラメータの値の連
続性を仮定したセグメントモデルが提案されている（例
えば、M. Ostendorf他“From HMMs to segment models
：Aubified view of stochastic modeling for speech
recognition”IEEE Transactions on Speech and Audi
o Processing，SAP-4(5)，pp.360-378（1996-9))。2. Description of the Related Art Conventionally, basic units of speech recognition include phoneme units, subword (subword) units, word units, and the like (hereinafter, these are referred to as units). HMM) is widely used as an acoustic model (eg, Seiichi Nakagawa, “Speech Recognition by Stochastic Model”, IEICE, Showa 6)
See issue in July 2013. ). In speech recognition, speech is parameterized at certain fixed time intervals (here, this is called a frame). In the method based on the HMM, assuming that values of parameters between adjacent frames are independent, modeling of speech and calculation of likelihood of a recognition candidate are performed. On the other hand, due to the restriction of the human vocalization mechanism, the feature parameters of the speech are not considered to be independent in adjacent frames. As a model that reinforces this point, a segment model assuming continuity of parameter values in a unit has been proposed (for example, M. Ostendorf et al., “From HMMs to segment models”
: Aubified view of stochastic modeling for speech
recognition ”IEEE Transactions on Speech and Audi
o Processing, SAP-4 (5), pp. 360-378 (1996-9)).

【０００３】[0003]

【発明が解決しようとする課題】従来のＨＭＭではパラ
メータ値が独立と仮定され、パラメータの軌跡の連続性
を十分に扱えなかった。また、これまでのセグメントモ
デルはユニット内のパラメータの連続性については捉え
られていたが、ユニット外（隣接するユニット間）のパ
ラメータ値との連続性については扱っておらず、認識性
能はまだ十分ではなかった。この発明の目的は、当該セ
グメント（ユニット）の中だけではなく、隣接するセグ
メント（ユニット）とのパラメータ値の連続性を考慮す
ることで上記の問題点を解決し、これを効率よくモデル
化する方式を具備した、音声認識装置及び方法を提供す
ることにある。In the conventional HMM, the parameter values are assumed to be independent, and the continuity of the trajectory of the parameter cannot be sufficiently handled. In addition, although the segment model up to now has been concerned with the continuity of the parameters inside the unit, it does not deal with the continuity with the parameter values outside the unit (between adjacent units), and the recognition performance is still insufficient. Was not. An object of the present invention is to solve the above problem by considering the continuity of parameter values not only within the segment (unit) but also with an adjacent segment (unit), and efficiently model the problem. An object of the present invention is to provide a speech recognition apparatus and method having a method.

【０００４】[0004]

【課題を解決するための手段】この発明によれば、入力
された音声を音声音響特徴パラメータに分析し、その特
徴パラメータの軌跡の情報に基づいて認識をおこなうセ
グメントモデルを用いた音声認識装置において、認識を
行うセグメントの直前のセグメントの末尾の部分を含め
た区間、あるいは直後のセグメントの先頭の部分を含め
た区間、あるいは直前のセグメントの末尾の部分及び直
後のセグメントの先頭の部分を含めた区間、すなわち隣
接するセグメントへの遷移部分の特徴パラメータと、認
識を行うセグメントの特徴パラメータを含めたセグメン
ト区間の特徴パラメータを併せて、パラメータの軌跡を
求め、その軌跡の情報に基づいたセグメントの尤度を用
いて音声を認識することを特徴とする。つまり前記遷移
部分を含む特徴パラメータの軌跡のそのセグメント情報
に対する出現確率をモデルとして予め求めておき、この
モデルと入力音声信号の特徴パラメータの軌跡との尤度
を求める。According to the present invention, there is provided a speech recognition apparatus using a segment model for analyzing an inputted speech into speech acoustic feature parameters and performing recognition based on information of a locus of the feature parameters. , The section including the end of the segment immediately before the segment to be recognized, the section including the beginning of the immediately following segment, or the end of the immediately preceding segment and the beginning of the immediately following segment The trajectory of the parameter is obtained by combining the feature parameter of the segment, that is, the feature parameter of the segment section including the feature parameter of the segment to be recognized, and the likelihood of the segment based on the information of the trajectory. It is characterized in that speech is recognized using degrees. That is, the appearance probability of the trajectory of the feature parameter including the transition portion with respect to the segment information is obtained in advance as a model, and the likelihood between this model and the trajectory of the feature parameter of the input audio signal is obtained.

【０００５】また、請求項２記載の発明では請求項１記
載の発明において、上記セグメントの尤度計算におい
て、当該セグメントの前後のセグメントのラベル情報も
考慮して、当該セグメントの尤度を計算することを特徴
とする。According to a second aspect of the present invention, in the first aspect of the invention, the likelihood of the segment is calculated in consideration of the label information of segments before and after the segment in the likelihood calculation of the segment. It is characterized by the following.

【０００６】[0006]

【発明の実施の形態】以下、図面を参照してこの発明に
係る実施形態について説明する。図１は、この発明の要
部である特徴パラメータの軌跡を求める範囲を示す図で
ある。図１に認識対象となるｉ番目のセグメント（具体
的には音素、副単語（サブワード）、単語）のラベルを
ｗｉ、その前のセグメントのラベルをｗi-1 、後ろのセ
グメントのラベルをｗi+1 とそれぞれする。また、それ
ぞれのセグメントのラベルｗｉ，ｗi-1 ，ｗi+1 におけ
る各フレームごとに得られる特徴パラメータの軌跡をそ
れぞれＡｉ，Ａi-1 ，Ａi+1 とする。この発明では、前
後のセグメントのすべてを用いると、処理量が多くなる
ばかりでなく、軌跡の推定精度も落ちるため、前後のセ
グメントの遷移部分、即ち認識を行うセグメントの直前
のセグメントに関しては末尾の部分Ｂi-1 、直後のセグ
メントに関しては先頭の部分Ｂi+1 のみを考慮する。具
体的には、セグメントが音素の場合、その長さは通常５
０〜１００ミリ秒程度であるが、遷移部分Ｂi-1 ，Ｂi+
1 は１０〜５０ミリ秒程度とする。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing a range for obtaining a locus of a characteristic parameter which is a main part of the present invention. In FIG. 1, the label of the i-th segment (specifically, phoneme, sub-word (sub-word), word) to be recognized is wi, the label of the preceding segment is wi-1, and the label of the following segment is wi + 1 and each. The trajectories of the characteristic parameters obtained for each frame at the labels wi, wi-1 and wi + 1 of the respective segments are denoted by Ai, Ai-1 and Ai + 1, respectively. In the present invention, if all of the preceding and succeeding segments are used, not only does the processing amount increase, but also the trajectory estimation accuracy decreases, so the transition part of the preceding and following segments, that is, the segment immediately before the segment to be recognized, has a trailing end. Only the leading portion Bi + 1 is considered for the portion Bi-1 and the immediately succeeding segment. Specifically, if the segment is a phoneme, its length is usually 5
It is about 0 to 100 milliseconds, but the transition parts Bi-1 and Bi +
1 is about 10 to 50 milliseconds.

【０００７】認識を行うセグメントの直前のセグメント
の末尾の部分を含めた区間でパラメータの軌跡を求める
場合は図１中の区間１となり、その軌跡の出現する確
率、つまりラベルｗｉの時に、パラメータ軌跡Ｂi-1 ，
Ａｉが生じる確率は、Ｐ（Ｂi-1 ，Ａｉ｜ｗｉ）あるいは前のセグメントの出現確率で正規化した確率Ｐ（Ｂi-1 ，Ａｉ｜ｗｉ）／Ｐ（Ｂi-1 ｜ｗｉ）で表す。また、直後のセグメントの先頭の部分を含めた
区間でパラメータの軌跡を求める場合は区間２となり、
その軌跡の出現する確率は、Ｐ（Ａｉ，Ｂi+1 ｜ｗｉ）あるいは後のセグメントの出現確率で正規化した確率Ｐ（Ａｉ，Ｂi+1 ｜ｗｉ）／Ｐ（Ｂi+1 ｜ｗｉ）で表す。また、直前のセグメントの末尾の部分及び直後
のセグメントの先頭の部分を含めた区間でパラメータの
軌跡を求める場合は区間３となり、その軌跡の出現する
確率は、Ｐ（Ｂi-1 ，Ａｉ，Ｂi+1 ｜ｗｉ）で表す。When a parameter trajectory is obtained in a section including the end of the segment immediately before the segment to be recognized, the section 1 in FIG. 1 is obtained. Bi-1,
The probability of occurrence of Ai is represented by P (Bi-1, Ai | wi) or the probability P (Bi-1, Ai | wi) / P (Bi-1 | wi) normalized by the appearance probability of the previous segment. In addition, when the locus of the parameter is obtained in the section including the leading part of the immediately following segment, the section becomes the section 2,
The probability of the locus appearing is P (Ai, Bi + 1 | wi) or the probability P (Ai, Bi + 1 | wi) / P (Bi + 1 | wi) normalized by the appearance probability of the subsequent segment. Represent. When a parameter trajectory is obtained in a section including the end of the immediately preceding segment and the beginning of the immediately following segment, the section 3 is obtained, and the probability of occurrence of the trajectory is represented by P (Bi-1, Ai, Bi). +1 | wi).

【０００８】一方、請求項２のコンテキスト（例えば音
素環境）依存の音響セグメントモデルに関しては、認識
を行うセグメントの直前のセグメントの末尾の部分を含
めた区間でパラメータの軌跡を求める場合は区間１とな
り、その軌跡の出現する確率は、Ｐ（Ｂi-1 ，Ａｉ｜ｗi-1 ，ｗｉ，ｗi+1 ）あるいは前のセグメントの出現確率で正規化した確率Ｐ（Ｂi-1 ，Ａｉ｜ｗi-1,ｗi,ｗi+1)/ Ｐ（Ｂi-1 ｜ｗ
i-1,ｗｉ，ｗi+1 ）で表す。また、直後のセグメントの先頭の部分を含めた
区間でパラメータの軌跡を求める場合は区間２となり、
その軌跡の出現する確率は、Ｐ（Ａｉ，Ｂi+1 ｜ｗi-1 ，ｗｉ，ｗi+1 ）あるいは後のセグメントの出現確率で正規化した確率Ｐ（Ａｉ，Ｂi+1 ｜ｗi-1,ｗi,ｗi+1)/ Ｐ（Ｂi+1 ｜ｗ
i-1,ｗｉ，ｗi+1 ）で表す。また、直前のセグメントの末尾の部分及び直後
のセグメントの先頭の部分を含めた区間でパラメータの
軌跡を求める場合は区間３となり、その軌跡の出現する
確率は、Ｐ（Ｂi-1 ，Ａｉ，Ｂi+1 ｜ｗi-1 ，ｗｉ，ｗ
i+1 ）で表す。On the other hand, with regard to the acoustic segment model depending on the context (for example, phoneme environment), when the trajectory of the parameter is obtained in the section including the end of the segment immediately before the segment to be recognized, the section becomes the section 1. , The probability that the locus appears is P (Bi−1, Ai | wi−1, wi, wi + 1) or the probability P (Bi−1, Ai | wi−1) normalized by the appearance probability of the previous segment. , wi, wi + 1) / P (Bi-1 | w
i-1, wi, wi + 1). In addition, when the locus of the parameter is obtained in the section including the leading part of the immediately following segment, the section becomes the section 2,
The probability of appearance of the trajectory is P (Ai, Bi + 1 | wi-1, Wi, wi + 1) or the probability P (Ai, Bi + 1 | wi-1, normalized by the appearance probability of the subsequent segment. wi, wi + 1) / P (Bi + 1 | w
i-1, wi, wi + 1). When a parameter trajectory is obtained in a section including the last part of the immediately preceding segment and the first part of the immediately following segment, the trajectory becomes Section 3, and the appearance probability of the trajectory is represented by P (Bi-1, Ai, Bi). +1 | wi-1, wi, w
i + 1).

【０００９】このコンテキスト依存の音響セグメントモ
デルとしては、認識を行うセグメントのラベル情報と、
その直前又は直後のセグメントのラベル情報のみを考慮
してもよい。図２はこの実施例において使用する音響セ
グメントモデルの作成のブロック図である。入力された
学習音声データは、特徴抽出部１２でケプストラム等の
特徴パラメータに変換され、軌跡計算部１３で上記軌跡
の推定区間に応じて、各パラメータの軌跡を推定する。
これらの軌跡の集合と入力学習音声データのラベルデー
タ（発声内容を記述したもの）を用いてモデル作成部１
４で音響セグメントモデルを作成し、メモリ１５に蓄積
する。The context-dependent acoustic segment model includes label information of a segment to be recognized,
Only the label information of the segment immediately before or immediately after may be considered. FIG. 2 is a block diagram for creating an acoustic segment model used in this embodiment. The input learning speech data is converted into feature parameters such as cepstrum by the feature extraction unit 12, and the trajectory of each parameter is estimated by the trajectory calculation unit 13 according to the estimated section of the trajectory.
A model creation unit 1 uses a set of these trajectories and label data of input learning speech data (in which utterance contents are described).
In step 4, an acoustic segment model is created and stored in the memory 15.

【００１０】図３はこの実施例の音声認識システムのブ
ロック図である。入力端子２１より入力された音声は、
特徴抽出部２２で、ケプストラム等の特徴パラメータに
変換され、上記軌跡の推定区間に応じて、軌跡計算部２
３で各パラメータの軌跡を推定する。メモリ２４から、
この推定区間の対応する音響セグメントモデルを用い
て、単語辞書２５と文法記述２６を用いて生成した認識
候補の確からしさ（尤度）を求め、最も確からしさの高
い認識候補を認識結果として出力する。FIG. 3 is a block diagram of the speech recognition system of this embodiment. The voice input from the input terminal 21 is
The feature extraction unit 22 converts the parameters into feature parameters such as cepstrum and the like.
In step 3, the trajectory of each parameter is estimated. From memory 24,
The likelihood of the recognition candidate generated using the word dictionary 25 and the grammar description 26 is obtained using the acoustic segment model corresponding to this estimated section, and the recognition candidate with the highest likelihood is output as a recognition result. .

【００１１】以上、説明したように、この発明によれば
前後のセグメントとの関連を考慮した音響セグメントモ
デルを作成し、それを用いて認識する方法を提供するこ
とができる。As described above, according to the present invention, it is possible to provide a method of creating an acoustic segment model in consideration of the relationship with the preceding and following segments and using the model to recognize the segment.

【００１２】[0012]

【発明の効果】以上、詳述したように、この発明によれ
ば、音響セグメントの軌跡を基に音声を認識する技術に
おいて、前後のセグメントの音響的特徴の関連性を考慮
してモデル化することにより、それを用いた音声認識に
おいて、従来のＨＭＭに代表される音響モデルより、よ
り優れた認識性能を提供できるという利点がある。As described above in detail, according to the present invention, in a technique for recognizing speech based on the trajectory of an acoustic segment, modeling is performed in consideration of the relevance of the acoustic features of the preceding and following segments. As a result, there is an advantage that it is possible to provide more excellent recognition performance in speech recognition using the same than an acoustic model represented by a conventional HMM.

【００１３】以下に実施例を述べる。学習用に１５人の
男性と、１５人の女性とを用い、試験用に５人の男性
と、５人の女性を用いた。音声の２５ミリ秒の窓に対
し、１３メルオープドケプストラム係数のベクトルを１
０ミリ秒ごとに計算した。ある実験では、この静的係数
に、いわゆるデルタ及び加速係数を加算して使用した。
発声者の変化を強調するため、単語をパラメータ化した
後、平均ベクトルを決定し、各フレームごとのパラメー
タベクトルから平均ベクトルを差し引いた。この実験で
は全てのモデルは、コンテキスト依存（三音素）であ
り、各モデルは３混合であり、ＨＭＭモデルは３状態を
もち、セグメントモデルはＨＭＭモデル及びセグメント
モデルのパラメータの数は同一である、ＨＭＭは固有の
エキスポネンシャル間隔モデルを用い、セグメントモデ
ルはガラシアン間隔モデルを用いた。セグメントモデル
は直前のセグメントの末尾の３０ミリ秒だけを考慮し
た。この値は、全遷移領域を含むように選定したが、離
れた音響データの使用を避けた。音素モデルのＨＭＭを
使用した場合の誤り率は静的パラメータでは１５．４７
％、静的＋△＋△△パラメータでは１３．５７％、とな
り、ポリノミナルセグメントモデルを用いた場合の誤り
率はそれぞれ１１．５３％、１０．１８％となり、この
発明のモデルを用いた場合はそれぞれ１０．０５％、
９．３１％となった。セグメントモデルの使用によれ
ば、ＨＭＭモデルの使用よりも誤り率が２５％よりな
り、この発明によれば誤り率が更に９〜１３％よくな
り、この発明が優れていることが理解される。An embodiment will be described below. Fifteen men and fifteen women were used for learning, and five men and five women were used for testing. For a 25 ms window of audio, the vector of 13 mel-opened cepstrum coefficients is 1
Calculated every 0 ms. In some experiments, so-called delta and acceleration factors were added to this static factor.
After parameterizing the words to emphasize speaker changes, the average vector was determined and the average vector was subtracted from the parameter vector for each frame. In this experiment, all models are context-dependent (triphones), each model is three-mixed, the HMM model has three states, and the segment model has the same number of parameters for the HMM model and the segment model. The HMM used a unique exponential interval model, and the segment model used a Galassian interval model. The segment model considered only the last 30 ms of the previous segment. This value was chosen to include the entire transition region, but avoided using distant acoustic data. The error rate when the phoneme model HMM is used is 15.47 for static parameters.
% And the static + では + 静的 parameter are 13.57%, and the error rates when the polynomial segment model is used are 11.53% and 10.18%, respectively, and when the model of the present invention is used. Are 10.05%,
It was 9.31%. It can be understood that the use of the segment model has an error rate of 25% or more than the use of the HMM model, and that the present invention further improves the error rate by 9 to 13%.

[Brief description of the drawings]

【図１】この発明に用いる音響モデルにおいて特徴パラ
メータの軌跡を求める範囲を示す図。FIG. 1 is a diagram showing a range in which a locus of a characteristic parameter is obtained in an acoustic model used in the present invention.

【図２】この発明に用いる音響モデルの生成過程を示す
ブロック図。FIG. 2 is a block diagram showing a process of generating an acoustic model used in the present invention.

【図３】この発明に係る一実施形態である音声認識装置
の機能構成を示すブロック図。FIG. 3 is a block diagram showing a functional configuration of a speech recognition device according to an embodiment of the present invention.

Claims

[Claims]

1. A speech recognition apparatus for analyzing an inputted speech signal into speech acoustic feature parameters, and comparing the locus of the parameters with a model for each segment in units of phonemes, sub-words or words to perform recognition. In the first, including the end of the segment immediately before the segment
, A second section including the beginning of the segment immediately after the segment, or a third section including the end of the segment immediately before the segment and the beginning of the segment immediately after the segment.
A memory for storing a segment model representing the trajectory of the characteristic parameter for each segment label for at least one of the sections, a means for calculating the sound and audio parameters of the input sound signal, Means for calculating a trajectory for each section corresponding to the segment model in the memory; means for obtaining the likelihood of the calculated trajectory for each segment model in the memory; and recognition candidates using the likelihood A voice recognition device.

2. The speech recognition apparatus according to claim 1, wherein each of the segment models stored in the memory includes at least one of a label of each segment model, a label of a segment immediately before the segment model, and a label of a segment immediately after the segment model. A speech recognition apparatus characterized in that in the means for calculating the likelihood, at least one of the label information of the segments immediately before and after the corresponding segment is also considered.

3. A speech recognition method for analyzing speech acoustic feature parameters of an input speech signal and performing speech recognition for each segment based on phonemes, sub-words, or words based on the locus of the parameters. A first section including the end of the segment immediately preceding the segment, a second section including the beginning of the segment immediately following the segment, a third section including the end of the segment immediately preceding the segment and the beginning of the segment immediately following the segment.
A segment model representing the trajectory of the characteristic parameter is created in advance for at least one section of each section for each label of each segment, and this is stored in a memo. At the time of speech recognition, the speech acoustic parameters of the input speech signal are Calculating a trajectory of each of the sections corresponding to the segment model in the memory of the calculated voice acoustic parameter, obtaining a likelihood of the calculated trajectory for each segment model in the memory, A speech recognition method characterized by performing speech recognition using likelihood.

4. The speech recognition method according to claim 3, wherein the segment model is created by considering at least one of a label of the model, a label of a segment immediately before the model, and a segment label immediately after the model. A speech recognition method characterized in that in calculating a degree, likelihood calculation is performed in consideration of label information of segments immediately before and after a corresponding segment according to the model.