JPH1185184A

JPH1185184A - Speech recognition device

Info

Publication number: JPH1185184A
Application number: JP9239528A
Authority: JP
Inventors: Hiroshi Yamamoto; 博史山本; Singer Harald; ハラルド・シンガー; Atsushi Nakamura; 篤中村; Foo Chan; チャン・フオー
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1997-09-04
Filing date: 1997-09-04
Publication date: 1999-03-30

Abstract

PROBLEM TO BE SOLVED: To provide the speech recognition device in which a user is not required to utter certain contents and continuous voice recognition is conducted while executing a speaker adaptive process by uttering only 'No instructor is in charge'. SOLUTION: Voice recognizers 4 and 6 refer to an acoustic model and the statistical language model in terms of word units and continuously conduct voice recognition of the word strings of an uttered voice sentence based on the voice signals of an inputted uttered voice sentence. Then, an instructor signal generating section 21 refers to the word dictionary which includes the phoneme strings corresponding to each word string and transforms the word strings outputted from the recognizer 6 into a phoneme string. An on-line speaker adaptive control section 22 uses the transformed phoneme string as the instructor signals, executes an on-line speaker adaptive process against the acoustic model to update the model. In such a case the section 22 executes an on-line speaker adaptive process based on a quasi-Bayes approximation and the statistical language model includes an N-gram of the word having a variable length N.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、リアルタイムで話
者適応化処理を行うオンライン話者適応化制御手段を備
えた連続音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous speech recognition apparatus provided with online speaker adaptation control means for performing speaker adaptation processing in real time.

【０００２】[0002]

【従来の技術】認識対象の音声をそのまま用いて逐次的
に話者適応を行なっていくオンライン話者適応処理は未
知の話者に対して発話量に応じて段階的に音響モデルの
高精度化を図ることができるため、音声認識システムの
構築という観点から実用的な手法として期待が高い。オ
ンライン話者適応処理としては、クアジ・ベイズ（Ｑｕ
ａｓｉ−Ｂａｙｅｓ）近似に基づくオンライン話者適応
化方法が、例えば、従来技術文献「Q.Huo et al.,“A s
tudy of on-line Quasi-Bayes adaptation for CDHMM-b
ases speech recognition",In Proceeding of the Inte
rnational Conference on Acoustics,Speech,and Signa
l Processing,pp.705-708,1996年5月」参照。）におい
て開示されている。この方法では、発話内容が既知であ
る「教師あり」の条件で構成され、すなわち教師信号を
手動で与えて動作させ、その有効性が確認されている。2. Description of the Related Art On-line speaker adaptation processing, in which speaker adaptation is performed sequentially using the speech to be recognized as it is, improves the acoustic model of an unknown speaker step by step according to the amount of speech. Therefore, it is highly expected as a practical method from the viewpoint of construction of a speech recognition system. As online speaker adaptation processing, Quasi Bayes (Qu
Asi-Bayes) online speaker adaptation methods based on approximations are described, for example, in the prior art document "Q. Huo et al.," As
tudy of on-line Quasi-Bayes adaptation for CDHMM-b
ases speech recognition ", In Proceeding of the Inte
rnational Conference on Acoustics, Speech, and Signa
l Processing, pp. 705-708, May 1996 ". ). In this method, the utterance content is configured under the condition of "supervised", that is, the teacher signal is manually applied to operate, and its effectiveness is confirmed.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、「教師
あり」の話者適応処理では利用者に対して一定内容の発
声を促す必要性が生じ、対話制御の複雑化や利用者への
負担増をまねくことになってしまう。本発明の目的は以
上の問題点を解決し、利用者に対して一定内容の発声を
促す必要性が無く、「教師なし」で話者適応化処理を実
行しながら連続的に音声認識することができる音声認識
装置を提供することにある。However, in the "supervised" speaker adaptation processing, it is necessary to urge the user to utter a certain amount of content, which complicates the dialog control and increases the burden on the user. It will be imitated. SUMMARY OF THE INVENTION An object of the present invention is to solve the above problems and eliminate the need to prompt the user to utter a certain content, and to continuously perform speech recognition while performing speaker adaptation processing without "supervising". It is an object of the present invention to provide a voice recognition device capable of performing the above-mentioned operations.

【０００４】[0004]

【課題を解決するための手段】本発明に係る請求項１記
載の音声認識装置は、所定の音響モデルと、所定の単語
単位の統計的言語モデルとを参照して、入力される発声
音声文の音声信号に基づいて上記発声音声文の単語列を
連続的に音声認識する音声認識手段と、各単語列に対応
した音素列を含む単語辞書を参照して、上記音声認識手
段から出力される単語列を音素列に変換する変換手段
と、上記変換手段によって変換された音素列を教師信号
として用いて、上記音響モデルに対してオンライン話者
適応化処理を実行することにより、上記音響モデルを更
新する適応化手段とを備えたことを特徴とする。According to a first aspect of the present invention, there is provided a speech recognition apparatus for inputting an uttered speech sentence with reference to a predetermined acoustic model and a predetermined word-based statistical language model. The speech recognition unit continuously recognizes the word sequence of the uttered speech sentence based on the speech signal of the above, and refers to a word dictionary including a phoneme sequence corresponding to each word sequence, and is output from the speech recognition unit. A conversion unit for converting a word sequence into a phoneme sequence, and using the phoneme sequence converted by the conversion unit as a teacher signal, performing an online speaker adaptation process on the acoustic model to convert the acoustic model Updating means for updating.

【０００５】また、請求項２記載の音声認識装置は、請
求項１記載の音声認識装置において、上記適応化手段
は、クアジ・ベイズ（Ｑｕａｓｉ−Ｂａｙｅｓ）近似に
基づくオンライン話者適応化処理を実行することを特徴
とする。According to a second aspect of the present invention, in the speech recognition apparatus of the first aspect, the adaptation means executes an on-line speaker adaptation process based on Quasi-Bayes approximation. It is characterized by doing.

【０００６】さらに、請求項３記載の音声認識装置は、
請求項１又は２記載の音声認識装置において、上記統計
的言語モデルは、可変長Ｎの単語のＮ−グラムを含むこ
とを特徴とする。[0006] Further, the speech recognition apparatus according to claim 3 is
3. The speech recognition device according to claim 1, wherein the statistical language model includes an N-gram of a word having a variable length of N.

【０００７】[0007]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【０００８】図１に本発明に係る一実施形態の連続音声
認識装置のブロック図を示す。本実施形態の連続音声認
識装置は、公知のワン−パス・ビタビ復号化法を用い
て、入力される発声音声文の音声信号の特徴パラメータ
に基づいて上記発声音声文の単語仮説を検出し尤度を計
算して出力する単語照合部４を備えた連続音声認識装置
において、（ａ）単語照合部４からバッファメモリ５を
介して出力される、終了時刻が等しく開始時刻が異なる
同一の単語の単語仮説に対して、可変長Ｎの単語のＮ−
ｇｒａｍである単語単位の可変長Ｎ−ｇｒａｍを含む統
計的言語モデル１３を参照して、当該単語の先頭音素環
境毎に、発声開始時刻から当該単語の終了時刻に至る計
算された総尤度のうちの最も高い尤度を有する１つの単
語仮説で代表させるように単語仮説の絞り込みを行う単
語仮説絞込部６と、（ｂ）単語仮説絞込部６から出力さ
れる音声認識結果の単語列を、単語辞書１２を参照し
て、音素列に変換する教師信号発生部２１と、（ｃ）教
師信号発生部２１から出力される音素列を教師信号とし
て用いて、公知のクアジ・ベイズ近似に基づくオンライ
ン話者適応化方法を用いて、音響モデルである音素隠れ
マルコフモデル（以下、音素ＨＭＭという。）１１を更
新するオンライン話者適応化制御部２２とを備えたこと
を特徴とする。FIG. 1 is a block diagram showing a continuous speech recognition apparatus according to an embodiment of the present invention. The continuous speech recognition apparatus according to the present embodiment detects the word hypothesis of the uttered speech sentence based on the characteristic parameter of the speech signal of the input uttered speech sentence using a known one-pass Viterbi decoding method. In the continuous speech recognition apparatus provided with the word matching unit 4 which calculates and outputs the degree, (a) the same words having the same end time and different start time output from the word matching unit 4 via the buffer memory 5 For the word hypothesis, the N-
With reference to the statistical language model 13 including the variable length N-gram of each word as a gram, the total likelihood calculated from the utterance start time to the end time of the word for each head phoneme environment of the word. A word hypothesis narrowing unit 6 for narrowing down word hypotheses so as to be represented by one word hypothesis having the highest likelihood; and (b) a word string of a speech recognition result output from the word hypothesis narrowing unit 6 Is converted to a phoneme sequence with reference to the word dictionary 12, and (c) the phoneme sequence output from the teacher signal generation unit 21 is used as a teacher signal to form a well-known quasi Bayesian approximation. An online speaker adaptation control unit 22 for updating a phoneme hidden Markov model (hereinafter, referred to as a phoneme HMM) 11 as an acoustic model using an online speaker adaptation method based on the speaker model.

【０００９】「教師なし」話者適応処理においては「教
師あり」における発話内容に相当する情報である教師信
号を話者適応前の音響モデルを用いた認識結果を元に内
部的に生成させている。このため、話者適応の効果は生
成された教師信号の精度（認識率）に大きく依存するこ
とになる。教師信号として日本語の音節規則のみを言語
的制約として与えた連続音声認識器（いわゆる、音素タ
イプライタ）の認識結果を用いた場合の「教師なし」ク
アジ・ベイズオン近似に基づくライン話者適応処理（以
下、比較例という。）においては教師信号の精度が不十
分なため、認識性能がほとんど向上しなかった。従っ
て、より強い言語的制約を用い、少なくとも音素タイプ
ライタを上回る性能を持つ認識器の認識結果を教師信号
として用いなければ「教師なし」クアジ・ベイズ近似に
基づくオンライン話者適応による認識性能の向上は見込
めないことが予想され、本発明者は、単語単位の統計的
言語モデルである可変長Ｎ−ｇｒａｍ（例えば、特開平
０９−１３４１９２号公報において開示され、詳細後述
する。）を言語的制約として与えた（認識対象の語彙、
ドメインが既知ということが前提となる。）連続音声認
識器の認識結果を教師信号として用いる「教師なし」ク
アジ・ベイズに基づくオンライン話者適応処理を実行し
ながら連続音声認識することを発明した。これにより、
詳細後述するように、可変長Ｎ−ｇｒａｍを用いること
により、高精度な教師信号を生成することができ「教師
なし」の条件でも適応効果を得ることができる。In the "unsupervised" speaker adaptation process, a teacher signal which is information corresponding to the utterance content in "supervised" is internally generated based on a recognition result using an acoustic model before speaker adaptation. I have. For this reason, the effect of speaker adaptation largely depends on the accuracy (recognition rate) of the generated teacher signal. "Unsupervised" line speaker adaptation based on quasi-bayesian approximation using a recognition result of a continuous speech recognizer (so-called phoneme typewriter) with only Japanese syllable rules as linguistic constraints as teacher signals In the following (hereinafter referred to as a comparative example), the recognition performance was hardly improved due to insufficient accuracy of the teacher signal. Therefore, the recognition performance of online speaker adaptation based on Quasi-Bayesian approximation based on the "unsupervised" Quasi-Bayes approximation should be used unless the recognition result of a recognizer with performance that exceeds at least the phoneme typewriter is used as a teacher signal using stronger linguistic constraints. The present inventor expects that the variable length N-gram (for example, disclosed in Japanese Patent Application Laid-Open No. 09-134192, which will be described in detail later), which is a statistical language model in units of words, is a language constraint. (Vocabulary to be recognized,
It is assumed that the domain is known. We have invented continuous speech recognition while executing online speaker adaptation processing based on "unsupervised" Quasi Bayes using the recognition result of the continuous speech recognizer as a teacher signal. This allows
As will be described later in detail, by using the variable length N-gram, a highly accurate teacher signal can be generated, and an adaptive effect can be obtained even under the condition of "no teacher".

【００１０】本実施形態で用いる統計的言語モデル１３
は、学習用テキストデータに基づいて言語モデル生成部
により生成されたものであって、統計的言語モデル１３
は、品詞クラス間のバイグラム（Ｎ＝２）を基本とした
ものであるが、単独で信頼できる単語は品詞クラスより
分離させ、単独のクラスとして取り扱い、さらに、予測
精度を向上させるため、頻出単語列に関してはそれらの
単語を結合して一つのクラスとして取り扱い、長い単語
連鎖の表現を可能にさせ、こうして、生成されたモデル
は、品詞バイグラムと可変長単語Ｎ−グラムとの特徴を
併せ持つ統計的言語モデルとなり、遷移確率の精度と信
頼性とのバランスをとられたものであることを特徴とす
る。The statistical language model 13 used in this embodiment
Are generated by the language model generation unit based on the text data for learning, and include the statistical language model 13
Is based on bigrams between part-of-speech classes (N = 2), but words that can be independently trusted are separated from part-of-speech classes, treated as a single class, and frequent words are used to improve prediction accuracy. For sequences, those words are combined and treated as a class, allowing the representation of long word chains, and thus the model generated is a statistical model that combines the features of part-of-speech bigrams and variable-length word N-grams. It is a language model, characterized by a balance between accuracy and reliability of transition probability.

【００１１】まず、本実施形態において用いる可変長Ｎ
−グラムの概念について以下に説明する。Ｎ−グラム
は、（Ｎ−１）重のマルコフモデルであり、これは、過
去（Ｎ−１）回の状態遷移を記憶するように単純（１
重）マルコフモデルの各状態が分離されたものと解釈さ
れる。例として、図３にバイグラムをマルコフモデルと
して図式化した状態遷移図を示し、図４にトライグラム
をマルコフモデルとして図式化した状態遷移図を示す。First, the variable length N used in this embodiment
-The concept of gram is explained below. The N-gram is a (N-1) -fold Markov model, which is simple (1-1) to store the past (N-1) state transitions.
Heavy) Each state of the Markov model is interpreted as being separated. As an example, FIG. 3 shows a state transition diagram in which a bigram is represented as a Markov model, and FIG. 4 shows a state transition diagram in which a trigram is represented as a Markov model.

【００１２】図３においては、状態ｓ₁においてシンボ
ルａを出力されたとき状態ｓ₁のままであるが、状態ｓ₁
でシンボルｂを出力した状態ｓ₂に遷移する。状態ｓ₂で
シンボルｂを出力したときは状態ｓ₂のままであるが、
状態ｓ₂でシンボルａを出力したとき状態ｓ₁に戻る。図
４のトライグラムは、バイグラムの状態ｓ₁を状態ｓ₁₁
と状態ｓ₁₂とに分離しかつ、状態ｓ₂を状態ｓ₂₁と状態
ｓ₂₂とに分離したものと考えられる。さらに、全ての状
態の分離を進めることにより、より高次のＮ−グラムと
なる。[0012] In FIG. 3, but remains in state s ₁ when output symbol a in state s _1, the state s ₁
In a transition to a state s ₂ which has output the symbol b. But it remains in the state s ₂ when outputting the symbol b in the state s _2,
Back in state s ₂ to the state s ₁ when outputting the symbol a. Figure 4 trigram, state status s ₁ bigrams s ₁₁
Vital separated into state s ₁₂ and it is believed that separation of the state s ₂ in the state s ₂₁ and the state s _22. Further, by promoting the separation of all states, a higher-order N-gram is obtained.

【００１３】図５に示す可変長Ｎ−グラムは、単純マル
コフモデルの状態を部分的に分離させたものである。す
なわち、図３のバイグラムにおいて、状態ｓ₂から、シ
ンボルａが出力される際に、続けてシンボルｂを出力す
る場合（これをａｂと表わし、シンボルａｂを出力する
という。）、続けてｂ以外のシンボルを出力する場合
（これをａ（／ｂ）と表し、シンボルａ（／ｂ）を出力
するという。ここで、／は否定の意味を表しバー（上
線）である。）とに分け、前者の場合、状態ｓ₁から状
態ｓ₁₂に遷移させる一方、後者の場合、状態ｓ₂から状
態ｓ₁₁に遷移させる。すなわち、前者の場合において、
状態ｓ₁から状態ｓ₁₂へと分離させ、シンボルａを出力
する残りの遷移（ａ（／ｂ））を状態ｓ₁₁に残したもの
である。なお、このモデルにおいて、状態ｓ₁₁でシンボ
ルａｂを出力したとき状態ｓ₁₂に遷移する一方、状態ｓ
₁₁でシンボルａ（／ｂ）を出力したとき状態ｓ₁₁のまま
である。また、状態ｓ₁₂でシンボルａｂを出力したとき
状態ｓ₁₂のままである一方、状態ｓ₁₂でシンボルａ（／
ｂ）を出力したとき状態ｓ₁₁に遷移する。The variable length N-gram shown in FIG. 5 is obtained by partially separating the states of the simple Markov model. That is, in the bigram of FIG. 3, when the symbol a is output from the state s ₂ and the symbol b is continuously output (this is referred to as ab, and the symbol ab is output), the state other than b is continued. (This is expressed as a (/ b) and the symbol a (/ b) is output. Here, / represents a negative meaning and is a bar (overline)). in the former case, while transitioning from state s ₁ to the state s _12, in the latter case, the transition from state s ₂ to the state s _11. That is, in the former case,
Separated from the state s ₁ to state s _12, in which left remaining transition for outputting the symbols a to (a (/ b)) in the state s _11. Incidentally, in this model, whereas a transition to the state s ₁₂ when outputting the symbol ab in the state s _11, the state s
_{When the} symbol a (/ b) is output in step ₁₁ , the state remains s11. Further, while in the state s ₁₂ and remain in the state s ₁₂ when outputting the symbol ab, while s ₁₂ symbols a (/
b) a transition to the state s ₁₁ when the output.

【００１４】このモデルは、複数の連続したシンボルを
新しいシンボルとみなすことで、単純マルコフモデルの
構造のまま、長い連鎖を表すことができるという特徴が
ある。同様の状態分離を繰り返すことで、局所的にさら
に長い連鎖を表すことができる。これが可変長Ｎ−グラ
ムである。すなわち、シンボルを単語とみなした言語モ
デルとしての可変長単語Ｎ−グラムは、単語列（１単語
も含む）間のバイグラムと表される。This model is characterized in that a long chain can be represented with a simple Markov model structure by considering a plurality of consecutive symbols as new symbols. By repeating the same state separation, a longer chain can be represented locally. This is the variable length N-gram. That is, a variable-length word N-gram as a language model in which a symbol is regarded as a word is represented as a bigram between word strings (including one word).

【００１５】次いで、可変長Ｎ−グラムの動作について
説明する。本実施形態で用いる統計的言語モデル２２
は、品詞クラスと単語との可変長Ｎ−グラムであり、次
の３種類のクラス間のバイグラムとして表現する。（１）品詞クラス（以下、第１のクラスという。）、
（２）品詞クラスから分離した単語のクラス（以下、第
２のクラスという。）、及び、（３）連接単語が結合し
てできたクラス（以下、第３のクラスという。）。Next, the operation of the variable length N-gram will be described. Statistical language model 22 used in this embodiment
Is a variable length N-gram of a part of speech class and a word, and is expressed as a bigram between the following three types of classes. (1) part of speech class (hereinafter referred to as first class),
(2) A class of a word separated from the part of speech class (hereinafter, referred to as a second class), and (3) a class formed by combining connected words (hereinafter, referred to as a third class).

【００１６】上記第１のクラスに属する単語は、主とし
て出現頻度の小さいもので、単語単独で取り扱うよりも
遷移確率の信頼性が高められる。また、第２のクラスに
属する単語は、主として出現頻度が高いもので、単独で
取り扱っても十分な信頼性があり、さらに、連接単語が
結合して上記第３のクラスに分類されることにより、可
変長Ｎ−グラムとして動作し、次単語の予測精度が高め
られる。ただし、本実施形態において、連接する品詞ク
ラスと品詞クラス、および、品詞クラスと単語の結合は
考えない。複数Ｌ個の単語からなる文の生成確率Ｐ（ｗ
₁ ^L）は、次式で与えられる。The words belonging to the first class are mainly those having a low appearance frequency, and the reliability of the transition probability is improved as compared with the case where the words are handled alone. In addition, the words belonging to the second class are mainly those having a high frequency of appearance, and have sufficient reliability even if handled alone. Furthermore, the words connected to each other are combined and classified into the third class. , Operates as a variable-length N-gram, and the prediction accuracy of the next word is improved. However, in the present embodiment, it is not considered that the part-of-speech class and the part-of-speech class that are connected to each other and the combination of the part-of-speech class and the word are considered. Generation probability P (w of a sentence composed of a plurality of L words
₁ ^L ) is given by the following equation.

【００１７】[0017]

【数１】 (Equation 1)

【００１８】ここで、ｗｓ_tは文章を上記のクラスに分
類した時の、ｔ番目の単語列（単独の単語も含める）を
意味する。従って、Ｐ（ｗｓ_t｜ｃ_t）は、ｔ番目のクラ
スがわかったときに単語列ｗｓ_tが出現する確率であ
り、Ｐ（ｃ_t｜ｃ_t-1）は１つ前の（ｔ−１）番目のクラ
スから当該ｔ番目のクラスの単語が出現する確率であ
る。また、文章のＫは単語列の個数を表し、Ｋ≦Ｌであ
る。従って、数１のΠはｔ＝１からＫまでの積である。
ここで、例として、次の７単語からなる発声音声文の文
章を考える。[0018] In this case, ws _t is the sentence at the time of the classification of the above class, means the t-th word string (a stand-alone word, is also included). _{_{Therefore, P (ws t | c t}} ) is the probability that the word string ws _t appears when found t-th _{_{class, P (c t | c t}} -1) is the previous (t- 1) Probability that a word of the t-th class appears from the class. Further, K in the text represents the number of word strings, and K ≦ L. Therefore, Π in Equation 1 is a product from t = 1 to K.
Here, as an example, consider a sentence of an uttered voice sentence composed of the following seven words.

【００１９】[0019]

【数２】「わたくし−村山−と−言−い−ま−す」[Equation 2] "I-Murayama-and-Words-I-Mas-"

【００２０】この文章の生成確率Ｐ（ｗ₁ ^L）は、数１を
用いて、次の式で与えられる。The generation probability P (w ₁ ^L ) of this sentence is given by the following equation using Equation 1.

【数３】Ｐ（ｗ₁ ^L）＝Ｐ（わたくし｜｛わたくし｝）・
Ｐ（｛わたくし｝）・Ｐ（村山｜＜固有名詞＞）・（＜固有名詞＞｜｛わた
くし｝）・Ｐ（と｜｛と｝）・Ｐ（｛と｝｜＜固有名詞＞）・Ｐ（言います｜［言います］）・Ｐ（［言います］｜
｛と｝）[Equation 3] P (w ₁ ^L ) = P (Watakushi | {Watakushi})
P (｛Watashi｝) ・ P (Murayama ｜ <proper noun>) ・ (<proper noun> ｜｛wattaku｝) ・ P (and | ｛and｝) ・ P (｛and｝ | <proper noun>) ・ P (Say | [say]) ・ P ([say] |
{When})

【００２１】ただし、＜＞，｛｝，［］はそれぞれ、第
１のクラス、第２のクラス、第３のクラスに属している
ことを表す。ただし、各単語および単語列は次のように
属している。（１）「村山」は名詞なので、第１のクラスに属する。（２）「わたくし」、「と」は名詞と助詞との組み合わ
せであり、第２のクラスに属する。（３）「言います」は動詞と、動詞の接尾辞と、助動詞
と、助動詞の接尾辞との組み合わせであり、第３のクラ
スに属する。ここで、第２と第３のクラスにおいて、単語とクラスの
出現頻度は等しいので、Ｐ（わたくし｜｛わたくし｝）
＝１、Ｐ（と｜｛と｝）＝１、Ｐ（言います｜［言いま
す］）＝１であり、従って、上記数３は次の式のように
なる。However, <>, ｛｝, and [] indicate that they belong to the first class, the second class, and the third class, respectively. However, each word and word string belong as follows. (1) Since “Murayama” is a noun, it belongs to the first class. (2) “I” and “to” are combinations of nouns and particles, and belong to the second class. (3) "I say" is a combination of a verb, a verb suffix, an auxiliary verb, and an auxiliary verb suffix, and belongs to the third class. Here, in the second and third classes, the appearance frequencies of the word and the class are equal, so that P (wattashi | ｛wattaku｝)
= 1, P (and | ｛and｝) = 1, P (say | [say]) = 1, and therefore, the above Equation 3 becomes the following equation.

【００２２】[0022]

【数４】Ｐ（ｗ₁ ^L）＝Ｐ（わたくし）・Ｐ（村山｜＜固有名詞＞）・Ｐ（＜固有名詞＞｜わた
くし）・Ｐ（と｜＜固有名詞＞）・Ｐ（言います｜と）[Equation 4] P (w ₁ ^L ) = P (Watashi) ・ P (Murayama ｜ <proper noun>) ・ P (<proper noun> ｜ watashi) ・ P (and | <proper noun>) ・ P (say | And)

【００２３】図１において、単語照合部４に接続され、
例えばハードディスクメモリに格納される音素ＨＭＭ１
１は、各状態を含んで表され、各状態はそれぞれ以下の
情報を有する。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率なお、本実施形態において用いる音素ＨＭＭ１１は、各
分布がどの話者に由来するかを特定する必要があるた
め、所定の話者混合ＨＭＭを変換して生成する。ここ
で、出力確率密度関数は３４次元の対角共分散行列をも
つ混合ガウス分布である。In FIG. 1, it is connected to a word collating unit 4,
For example, a phoneme HMM1 stored in a hard disk memory
1 includes each state, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding state and succeeding state (d) Parameter of output probability density distribution (e) Self transition probability and transition probability to succeeding state Since it is necessary to specify which speaker each distribution originates from, the phoneme HMM 11 used in the embodiment is generated by converting a predetermined speaker mixed HMM. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix.

【００２４】また、単語照合部４に接続され、例えばハ
ードディスクに格納される単語辞書１２は、音素ＨＭＭ
１１の各単語毎に音素シンボルで表した読みを示すシン
ボル列又は音素列を格納する。The word dictionary 12 connected to the word matching unit 4 and stored in, for example, a hard disk is a phoneme HMM
For each of the 11 words, a symbol sequence or a phoneme sequence indicating a reading expressed by a phoneme symbol is stored.

【００２５】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して単語照合部４に入力される。In FIG. 1, a uttered voice of a speaker is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the word matching unit 4 via the buffer memory 3.

【００２６】単語照合部４は、ワン−パス・ビタビ復号
化法を用いて、バッファメモリ３を介して入力される特
徴パラメータのデータに基づいて、音素ＨＭＭ１１と単
語辞書１２とを用いて単語仮説を検出し尤度を計算して
出力する。ここで、単語照合部４は、各時刻の各ＨＭＭ
の状態毎に、単語内の尤度と発声開始からの尤度を計算
する。尤度は、単語の識別番号、単語の開始時刻、先行
単語の違い毎に個別にもつ。また、計算処理量の削減の
ために、音素ＨＭＭ１１及び単語辞書１２とに基づいて
計算される総尤度のうちの低い尤度のグリッド仮説を削
減する。単語照合部４は、その結果の単語仮説と尤度の
情報を発声開始時刻からの時間情報（具体的には、例え
ばフレーム番号）とともにバッファメモリ５を介して単
語仮説絞込部６に出力する。The word collating unit 4 uses the one-pass Viterbi decoding method and the word hypothesis using the phoneme HMM 11 and the word dictionary 12 based on feature parameter data input via the buffer memory 3. Is detected, the likelihood is calculated and output. Here, the word matching unit 4 determines whether each HMM
The likelihood within a word and the likelihood from the start of utterance are calculated for each state. The likelihood is individually provided for each word identification number, word start time, and difference between preceding words. Further, in order to reduce the amount of calculation processing, the grid hypothesis of a low likelihood among the total likelihoods calculated based on the phoneme HMM 11 and the word dictionary 12 is reduced. The word collating unit 4 outputs the resulting word hypothesis and likelihood information to the word hypothesis narrowing unit 6 via the buffer memory 5 together with time information (specifically, a frame number, for example) from the utterance start time. .

【００２７】単語仮説絞込部６は、単語照合部４からバ
ッファメモリ５を介して出力される単語仮説に基づい
て、統計的言語モデル１３を参照して、終了時刻が等し
く開始時刻が異なる同一の単語の単語仮説に対して、当
該単語の先頭音素環境毎に、発声開始時刻から当該単語
の終了時刻に至る計算された総尤度のうちの最も高い尤
度を有する１つの単語仮説で代表させるように単語仮説
の絞り込みを行った後、絞り込み後のすべての単語仮説
の単語列のうち、最大の総尤度を有する仮説の単語列を
認識結果として出力する。出力される単語列は、教師信
号を発生するために教師信号発生部２１に入力される。
本実施形態においては、好ましくは、処理すべき当該単
語の先頭音素環境とは、当該単語より先行する単語仮説
の最終音素と、当該単語の単語仮説の最初の２つの音素
とを含む３つの音素並びをいう。The word hypothesis narrowing section 6 refers to the statistical language model 13 based on the word hypothesis output from the word collation section 4 via the buffer memory 5 and has the same end time and the same start time. Is represented by one word hypothesis having the highest likelihood among the total likelihoods calculated from the utterance start time to the end time of the word for each head phoneme environment of the word. After narrowing down the word hypotheses so as to cause them, the word string of the hypothesis having the maximum total likelihood is output as the recognition result among the word strings of all the narrowed word hypotheses. The output word string is input to the teacher signal generator 21 to generate a teacher signal.
In the present embodiment, preferably, the first phoneme environment of the word to be processed is three phonemes including the last phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. I mean a line.

【００２８】例えば、図２に示すように、（ｉ−１）番
目の単語Ｗ_i-1の次に、音素列ａ₁，ａ₂，…，ａ_nからな
るｉ番目の単語Ｗ_iがくるときに、単語Ｗ_i-1の単語仮説
として６つの仮説Ｗａ，Ｗｂ，Ｗｃ，Ｗｄ，Ｗｅ，Ｗｆ
が存在している。ここで、前者３つの単語仮説Ｗａ，Ｗ
ｂ，Ｗｃの最終音素は／ｘ／であるとし、後者３つの単
語仮説Ｗｄ，Ｗｅ，Ｗｆの最終音素は／ｙ／であるとす
る。終了時刻ｔ_eと先頭音素環境が等しい仮説（図２で
は先頭音素環境が“ｘ／ａ₁／ａ₂”である上から３つの
単語仮説）のうち総尤度が最も高い仮説（例えば、図２
において１番上の仮説）以外を削除する。なお、上から
４番めの仮説は先頭音素環境が違うため、すなわち、先
行する単語仮説の最終音素がｘではなくｙであるので、
上から４番めの仮説を削除しない。すなわち、先行する
単語仮説の最終音素毎に１つのみ仮説を残す。図２の例
では、最終音素／ｘ／に対して１つの仮説を残し、最終
音素／ｙ／に対して１つの仮説を残す。[0028] For example, as shown in FIG. 2, the (i-1) th word W _i-1 of the following phoneme string a _1, a _2, ..., come i th word W _i consisting a _n Sometimes, six hypotheses Wa, Wb, Wc, Wd, We, and Wf are assumed as the word hypotheses of the word Wi _-1.
Exists. Here, the former three word hypotheses Wa, W
It is assumed that the final phonemes of b and Wc are / x /, and the final phonemes of the latter three word hypotheses Wd, We and Wf are / y /. The hypothesis with the highest total likelihood among the hypotheses in which the end time t _e is equal to the first phoneme environment (the top three word hypotheses in which the _first phoneme environment is “x / a ₁ / a ₂ ” in FIG. 2) (for example, FIG. 2
Are deleted except for the top hypothesis). Note that the fourth hypothesis from the top has a different phoneme environment, that is, since the last phoneme of the preceding word hypothesis is y instead of x,
Do not delete the fourth hypothesis from the top. That is, only one hypothesis is left for each final phoneme of the preceding word hypothesis. In the example of FIG. 2, one hypothesis is left for the final phoneme / x /, and one hypothesis is left for the final phoneme / y /.

【００２９】一方、教師信号発生部２１は、単語仮説絞
込部６から出力される音声認識結果の単語列を、単語辞
書１２を参照して、音素列に変換してオンライン話者適
応化制御部２２に出力する。これに応答して、オンライ
ン話者適応化制御部２２は、教師信号発生部２１から出
力される音素列を教師信号として用いて、公知のクアジ
・ベイズ近似に基づくオンライン話者適応化方法を用い
て、話者音声の発声毎に、音響モデルである音素ＨＭＭ
１１を更新してオンライン話者適応化処理を実行する。
具体的には、対象となる音素ＨＭＭ１１のパラメータの
推定に対してその個々のパラメータの確率密度関数の近
似を行う。近似の分布の基準は分布の最頻値が同じにな
ることであり、その値は公知のＥＭ（Estimation-Maxim
ization）アルゴリズムに基づく繰り返し計算によって
推定される。この際に前の学習データによる影響を効果
的に無くしていくことにより、話者の変化等の環境の変
化に高速に追従していくことが可能である。このときに
学習データによる影響はハイパーパラメータと呼ばれる
パラメータ化された形式で記憶されており、新しい発話
データと共に話者適応に用いられながら逐次更新されて
いく。On the other hand, the teacher signal generating section 21 converts the word string of the speech recognition result output from the word hypothesis narrowing section 6 into a phoneme string with reference to the word dictionary 12, and performs online speaker adaptation control. Output to the unit 22. In response, the online speaker adaptation control unit 22 uses the phoneme sequence output from the teacher signal generation unit 21 as a teacher signal, and uses a known online speaker adaptation method based on the well-known Quasi-Bayes approximation. For each utterance of the speaker's voice, the phoneme HMM
11 is updated to execute the online speaker adaptation processing.
Specifically, the estimation of the parameters of the target phoneme HMM 11 is performed by approximating the probability density function of each parameter. The criterion of the approximate distribution is that the mode of the distribution is the same, and the value is a known EM (Estimation-Maxim
ization) algorithm. At this time, by effectively eliminating the influence of the previous learning data, it is possible to quickly follow a change in the environment such as a change in the speaker. At this time, the influence of the learning data is stored in a parameterized form called a hyperparameter, and is updated sequentially while being used for speaker adaptation together with new utterance data.

【００３０】以上の実施形態においては、当該単語の先
頭音素環境とは、当該単語より先行する単語仮説の最終
音素と、当該単語の単語仮説の最初の２つの音素とを含
む３つの音素並びとして定義されているが、本発明はこ
れに限らず、先行する単語仮説の最終音素と、最終音素
と連続する先行する単語仮説の少なくとも１つの音素と
を含む先行単語仮説の音素列と、当該単語の単語仮説の
最初の音素を含む音素列とを含む音素並びとしてもよ
い。In the above embodiment, the head phoneme environment of the word is defined as a sequence of three phonemes including the last phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. Although defined, the present invention is not limited to this. The phoneme sequence of the preceding word hypothesis including the final phoneme of the preceding word hypothesis, and at least one phoneme of the preceding word hypothesis that is continuous with the final phoneme, And a phoneme sequence that includes a phoneme sequence that includes the first phoneme of the word hypothesis.

【００３１】以上の実施形態において、特徴抽出部２
と、単語照合部４と、単語仮説絞込部６と、教師信号発
生部２１と、オンライン話者適応化制御部２２とは、例
えば、デジタル電子計算機で構成され、バッファメモリ
３，５は例えばハードデイスクメモリで構成され、音素
ＨＭＭ１１と単語辞書１２と統計的言語モデル１３と
は、例えばハードデイスクメモリなどの記憶装置に記憶
される。In the above embodiment, the feature extracting unit 2
The word matching unit 4, the word hypothesis narrowing unit 6, the teacher signal generation unit 21, and the online speaker adaptation control unit 22 are constituted by, for example, a digital computer. A phoneme HMM 11, a word dictionary 12, and a statistical language model 13 are stored in a storage device such as a hard disk memory.

【００３２】以上実施形態においては、単語照合部４と
単語仮説絞込部６とを用いて音声認識を行っているが、
本発明はこれに限らず、例えば、音素ＨＭＭ１１を参照
する音素照合部と、例えばＯｎｅＰａｓｓＤＰアル
ゴリズムを用いて統計的言語モデル１３を参照して単語
の音声認識を行う音声認識部とで構成してもよい。In the above embodiment, speech recognition is performed using the word collating unit 4 and the word hypothesis narrowing unit 6.
The present invention is not limited to this, and includes, for example, a phoneme matching unit that refers to the phoneme HMM 11 and a speech recognition unit that performs speech recognition of words by referring to the statistical language model 13 using, for example, the One Pass DP algorithm. You may.

【００３３】以上の実施形態において、オンライン話者
適応化制御部２２は、教師信号発生部２１から出力され
る音素列を教師信号として用いて、公知のクアジ・ベイ
ズ近似に基づくオンライン話者適応化方法を用いて、音
響モデルである音素ＨＭＭ１１を更新しているが、本発
明はこれに限らず、別のオンライン話者適応化方法を用
いてもよい。In the above embodiment, the online speaker adaptation control unit 22 uses the phoneme sequence output from the teacher signal generation unit 21 as a teacher signal to perform online speaker adaptation based on the well-known Quasi-Bayes approximation. Although the phoneme HMM 11 that is an acoustic model is updated using the method, the present invention is not limited to this, and another online speaker adaptation method may be used.

【００３４】[0034]

【実施例】本発明者は、図１の連続音声認識装置の実験
を以下のごとく行った。話者適応は一発声ごとに行なわ
れ、適応が行なわれた音響モデルは次の発声において教
師信号の生成とそれを用いた話者適応の適応元のモデル
として用いられる。本来、オンライン話者適応処理は、
認識対象の音声をそのまま適応データとして用いるが、
本発明の実験では各発声ごとの適応結果に対して同一の
条件でその効果を比較するために、適応データとは別に
認識対象音声（テストデータ）を用意した。ここで、教
師信号の生成は５名の話者に対して以下の条件で行なっ
た。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor conducted an experiment on the continuous speech recognition apparatus shown in FIG. 1 as follows. The speaker adaptation is performed for each utterance, and the acoustic model to which the adaptation has been performed is used as a model of the adaptation source of the speaker adaptation using the generation of the teacher signal in the next utterance. Originally, the online speaker adaptation process
The speech to be recognized is used as adaptive data as it is,
In the experiment of the present invention, a speech to be recognized (test data) was prepared separately from the adaptation data in order to compare the effect of the adaptation result for each utterance under the same conditions. Here, the generation of the teacher signal was performed for the five speakers under the following conditions.

【００３５】[0035]

【表１】音響分析 ───────────────────────────── 標本化：１２ｋＨｚフレーム間隔：１０ｍｓｅｃハミング窓：２０ｍｓｅｃ特徴量：ｌｏｇパワー＋１６次ＬＰＣケプストラム＋Δｌｏｇパワー＋１６次ΔＬＰＣケプストラム ─────────────────────────────[Table 1] Acoustic analysis 標本 Sampling: 12 kHz Frame interval: 10 msec Hamming window: 20 msec Feature value: log power + 16th order LPC cepstrum + Δlog power + 16th order LPC cepstrum ─────────────────────────────

【表２】音響モデル ─────────────────────────────────── 不特定話者ＨＭｎｅｔ６００状態５混合＋１状態１０混合無音モデル ───────────────────────────────────[Table 2] Acoustic model ─────────────────────────────────── Unspecified speaker HMNet 600 state 5 mixture +1 state 10 mixed silence model ───────────────────────────────────

【表３】言語モデル ──────────────────────────── ６３７０単語、可変長Ｎ−ｇｒａｍ（分離クラス数５００） ────────────────────────────[Table 3] Language model {6,370 words, variable length N-gram (number of separated classes: 500)} ──────────────────────────

【００３６】各話者ごとの適応データを表４に示す。ま
た初期モデルである不特定話者モデルを用いた場合の音
素誤り率、すなわち教師信号の精度は表５に示す通りで
ある。Table 4 shows the adaptation data for each speaker. Table 5 shows the phoneme error rate when the speaker-independent model, which is the initial model, is used, that is, the accuracy of the teacher signal.

【００３７】[0037]

【表４】適応に用いたデータ ──────────────────────────── 話者名会話数発声数発声時間 ──────────────────────────── ＦＹＯＭＡ１１２８２４１５．４８秒ＦＹＵＹＯ９２４５３８３．３１秒ＦＨＩＴＡ１５５０８６９９．４７秒ＦＴＯＡＳ６２１５３１７．９２秒ＭＭＡＴＡ２４６７８．２０秒 ────────────────────────────[Table 4] Data used for adaptation ──────────────────────────── Speaker name Number of conversations Number of utterances Speech time ──── ──────────────────────── FYOMA 11 282 415.48 seconds FUYO 9 245 383.31 seconds FHITA 15 508 699.47 seconds FTOAS 6 215 317. 92 seconds MATA 246 78.20 seconds ────────────────────────────

【表５】 [Table 5]

【００３８】言語制約として音素タイプライタを用いた
比較例の場合に比較して、本実施形態の方法を用いた音
声認識装置においては、教師信号の精度が大きく向上し
ていることがわかる。なお実際の教師信号は逐次適応の
進んだ音響モデルを用いて生成されるため、教師信号の
精度はさらに向上している。例えば、話者ＦＹＯＭＡで
１０．６％から８．６％に向上している。It can be seen that the accuracy of the teacher signal is greatly improved in the speech recognition apparatus using the method of the present embodiment, as compared with the comparative example using the phoneme typewriter as the language constraint. Since the actual teacher signal is generated using an acoustic model that has been successively advanced, the accuracy of the teacher signal is further improved. For example, the speaker FYOMA has increased from 10.6% to 8.6%.

【００３９】図１に示す音声認識装置で教師信号を生成
して話者適応を行なった。一発声ごとのオンライン話者
適応が全発声について終了した時点でのテストデータに
対する音素誤り率（比較例では、音素タイプライタを使
用した。）を表６に示す。図６にはテストデータＴＡＳ
１２００８について適応データを一発声ずつ増やした時
の音素誤り率の変化を示す。A teacher signal was generated by the speech recognition apparatus shown in FIG. 1 to perform speaker adaptation. Table 6 shows the phoneme error rate (in the comparative example, a phoneme typewriter was used) for the test data when the online speaker adaptation for each utterance was completed for all utterances. FIG. 6 shows the test data TAS
The change of the phoneme error rate when the adaptive data is increased by one utterance for 12008 is shown.

【００４０】[0040]

【表６】話者適応後の音素誤認識率 ─────────────────────────────────── テストデータ話者音素誤認識率（％） ────────────────── 実施形態教師あり適応なし ─────────────────────────────────── ＴＡＳ１２００８．ＡＦＹＯＭＡ１９．３３１７．０９２４．２３ＴＡＳ１２０１０．ＡＦＹＯＭＡ１４．５５１３．０９２２．１８ＴＡＳ１３００５．ＢＦＹＵＹＯ１５．４９１６．２０２１．８３ＴＡＳ１３００９．ＢＦＹＵＹＯ１６．１１１４．９９２６．０２ＴＡＳ２２００１．ＢＦＨＩＴＡ１７．９９１５．３５２７．０１ＴＡＳ２３００２．ＡＦＨＩＴＡ１６．１９１４．５２３２．５９ＴＡＳ３２００２．ＡＦＹＯＡＳ１６．７５１３．６６２１．６５ＴＡＳ３３００１．ＢＦＹＯＡＳ１２．９５９．７６１９．５３ＴＡＳ３３０１１．ＢＭＭＡＴＡ１７．８３１７．０４２９．８６ ─────────────────────────────────── 平均１７．４６１５．７４２５．０８ ───────────────────────────────────[Table 6] Phoneme misrecognition rate after speaker adaptation ─────────────────────────────────── Test data Speaker Phoneme misrecognition rate (%) 形態 Embodiment With teacher No adaptation ───────────────── ────────────────── TAS12008. A FYOMA 19.33 17.09 24.23 TAS12010. A FYOMA 14.55 13.09 22.18 TAS 13005. B FYUYO 15.49 16.20 21.83 TAS13009. B FYUYO 16.11 14.99 26.02 TAS22001. B FHITA 17.99 15.35 27.01 TAS23002. A FHITA 16.19 14.52 32.59 TAS32002. A FYOAS 16.75 13.66 21.65 TAS 33001. B FYOAS 12.95 9.76 19.53 TAS33011. B MDATA 17.83 17.04 29.86 ─────────────────────────────────── Average 17.46 15.74 25.08───────────────────────────────────

【００４１】表６において、「教師あり」は手動で教師
信号を与えた場合であり、「適応なし」は話者適応をし
ない音素ＨＭＭを用いた場合である。表６及び図６から
明らかなように、本実施形態の「教師なし」オンライン
適応でも「教師あり」には及ばないもの、明らかに認識
性能の向上し適応に必要な発声量に関しても「教師あ
り」の場合と大きな差がないことが確認された。In Table 6, "supervised" is a case where a teacher signal is manually given, and "no adaptation" is a case where a phoneme HMM without speaker adaptation is used. As is clear from Table 6 and FIG. 6, the “unsupervised” online adaptation of this embodiment is not as good as “supervised”, and the recognition performance is clearly improved and the utterance required for adaptation is also “supervised”. It was confirmed that there was no big difference from the case of "."

【００４２】以上説明したように、本実施形態によれ
ば、単語単位の統計的言語モデルを言語的制約として用
いた連続音声認識器によって教師信号を得て、「教師な
し」クアジ・ベイズ近似を用いたオンライン話者適応処
理を実行することにより、教師信号を自動的に発生する
ことができ、不特定話者の発声音声に対して、事前準備
なしで、従来技術に比較してきわめて高い音声認識率を
得ることができる。本実施形態では、当該装置を使用す
ればするほど、音響モデルを適応化することができるの
で、音声認識性能を大幅に向上させることができる。ま
た、可変長Ｎ−ｇｒａｍを含む音響モデルを用いること
により、高精度な教師信号を生成することができ「教師
なし」の条件でも大きな話者適応効果を得ることができ
る。As described above, according to the present embodiment, a teacher signal is obtained by a continuous speech recognizer using a statistical language model in units of words as a linguistic constraint, and "unsupervised" Quasi-Bayes approximation is performed. The teacher signal can be automatically generated by executing the online speaker adaptation process used. A recognition rate can be obtained. In this embodiment, the more the device is used, the more the acoustic model can be adapted, so that the speech recognition performance can be significantly improved. Also, by using an acoustic model including a variable length N-gram, a highly accurate teacher signal can be generated, and a large speaker adaptation effect can be obtained even under the condition of "no teacher".

【００４３】[0043]

【発明の効果】以上詳述したように本発明によれば、所
定の音響モデルと、所定の単語単位の統計的言語モデル
とを参照して、入力される発声音声文の音声信号に基づ
いて上記発声音声文の単語列を連続的に音声認識する音
声認識手段と、各単語列に対応した音素列を含む単語辞
書を参照して、上記音声認識手段から出力される単語列
を音素列に変換する変換手段と、上記変換手段によって
変換された音素列を教師信号として用いて、上記音響モ
デルに対してオンライン話者適応化処理を実行すること
により、上記音響モデルを更新する適応化手段とを備え
る。ここで、上記適応化手段は、好ましくは、クアジ・
ベイズ（Ｑｕａｓｉ−Ｂａｙｅｓ）近似に基づくオンラ
イン話者適応化処理を実行し、上記統計的言語モデル
は、好ましくは、可変長Ｎの単語のＮ−グラムを含む。As described above in detail, according to the present invention, a predetermined acoustic model and a predetermined statistical language model in units of words are referred to based on a speech signal of an input uttered speech sentence. With reference to a speech recognition means for continuously recognizing the word string of the uttered speech sentence and a word dictionary including a phoneme string corresponding to each word string, the word string output from the speech recognition means is converted into a phoneme string. A conversion unit for converting, and an adaptation unit for updating the acoustic model by executing an online speaker adaptation process on the acoustic model using the phoneme sequence converted by the conversion unit as a teacher signal. Is provided. Here, the adaptation means is preferably quasi
Perform an online speaker adaptation process based on the Quasi-Bayes approximation, wherein the statistical language model preferably includes N-grams of words of variable length N.

【００４４】従って、単語単位の統計的言語モデルを言
語的制約として用いた音声認識手段によって教師信号を
得てオンライン話者適応処理を実行することにより、教
師信号を自動的に発生することができ、不特定話者の発
声音声に対して、事前準備なしで、従来技術に比較して
きわめて高い音声認識率を得ることができる。本発明で
は、当該装置を使用すればするほど、音響モデルを適応
化することができるので、音声認識性能を大幅に向上さ
せることができる。また、単語の可変長Ｎ−ｇｒａｍを
含む音響モデルを用いることにより、高精度な教師信号
を生成することができ「教師なし」の条件でも大きな話
者適応効果を得ることができる。Therefore, the teacher signal can be automatically generated by obtaining the teacher signal by the speech recognition means using the statistical language model of each word as a linguistic constraint and executing the online speaker adaptation processing. In addition, it is possible to obtain an extremely high speech recognition rate as compared with the related art without prior preparation for the uttered voice of an unspecified speaker. In the present invention, the more the device is used, the more the acoustic model can be adapted, so that the speech recognition performance can be greatly improved. Also, by using an acoustic model including a variable length N-gram of a word, a highly accurate teacher signal can be generated, and a large speaker adaptation effect can be obtained even under the condition of "no teacher".

[Brief description of the drawings]

【図１】本発明に係る一実施形態である連続音声認識
装置のブロック図である。FIG. 1 is a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention.

【図２】図１の連続音声認識装置における単語仮説絞
込部６の処理を示すタイミングチャートである。FIG. 2 is a timing chart showing a process of a word hypothesis narrowing section 6 in the continuous speech recognition device of FIG.

【図３】バイグラムの統計的言語モデルを示す状態遷
移図である。FIG. 3 is a state transition diagram showing a bigram statistical language model.

【図４】トライグラムの統計的言語モデルを示す状態
遷移図である。FIG. 4 is a state transition diagram showing a statistical language model of a trigram.

【図５】図１の連続音声認識装置において用いる可変
長Ｎ−グラムの下のモデルを示す状態遷移図である。FIG. 5 is a state transition diagram showing a model under a variable length N-gram used in the continuous speech recognition device of FIG. 1;

【図６】図１の連続音声認識装置の実験結果であっ
て、適応データ量に対する音素認識誤り率を示すグラフ
である。FIG. 6 is a graph showing experimental results of the continuous speech recognition device of FIG. 1, showing a phoneme recognition error rate with respect to an adaptive data amount.

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３，５…バッファメモリ、４…単語照合部、６…単語仮説絞込部、１１…音素ＨＭＭ、１２…単語辞書、１３…言語モデル生成部、２１…教師信号発生部、２２…オンライン話者適応化制御部。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3, 5 ... Buffer memory, 4 ... Word collation part, 6 ... Word hypothesis narrowing part, 11 ... Phoneme HMM, 12 ... Word dictionary, 13 ... Language model generation part, 21 Teacher signal generator, 22 ... Online speaker adaptation controller.

フロントページの続き (72)発明者ハラルド・シンガー京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者中村篤京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者チャン・フオー京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内Continued on the front page (72) Inventor Harald Singer 5 Shiraya, Inaya, Koika-cho, Soraku-cho, Soraku-gun, Kyoto ATI Spoken Language Translation and Communication Research Laboratories, Inc. (72) Inventor Atsushi Nakamura Seika, Soraku-gun, Kyoto 5, Aoyama-cho, Sanraya, Kota-cho, A / T ARL Speech Translation Research Institute, Inc. (72) Inventor Zhang Huo 5-san, Hiratani, Seiya-cho, Seika-cho, Soraku-gun, Kyoto A-5 Aoyama, Sanriya Translation Communication Laboratory

Claims

[Claims]

1. A word sequence of said uttered voice sentence is continuously spoken based on a voice signal of an input uttered voice sentence with reference to a predetermined acoustic model and a predetermined word-based statistical language model. With reference to a speech recognition means to recognize and a word dictionary including phoneme strings corresponding to the respective word strings,
A conversion unit that converts a word sequence output from the speech recognition unit into a phoneme sequence; and performs an online speaker adaptation process on the acoustic model using the phoneme sequence converted by the conversion unit as a teacher signal. And an adaptation means for updating the acoustic model.

2. The speech recognition apparatus according to claim 1, wherein said adapting means includes a Quasi-Basic device.
yes) An online speaker adaptation process based on approximation.

3. The speech recognition apparatus according to claim 1, wherein said statistical language model includes an N-gram of a word having a variable length of N.