JPH09114482A - Speaker adaptation method for voice recognition - Google Patents

Speaker adaptation method for voice recognition

Info

Publication number
JPH09114482A
JPH09114482A JP26826895A JP26826895A JPH09114482A JP H09114482 A JPH09114482 A JP H09114482A JP 26826895 A JP26826895 A JP 26826895A JP 26826895 A JP26826895 A JP 26826895A JP H09114482 A JPH09114482 A JP H09114482A
Authority
JP
Japan
Prior art keywords
recognition
speaker
speech
acoustic model
adaptation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP26826895A
Other languages
Japanese (ja)
Inventor
Shigeru Honma
茂 本間
Junichi Takahashi
淳一 高橋
Shigeki Sagayama
茂樹 嵯峨山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP26826895A priority Critical patent/JPH09114482A/en
Publication of JPH09114482A publication Critical patent/JPH09114482A/en
Pending legal-status Critical Current

Links

Abstract

PROBLEM TO BE SOLVED: To make an acoustic model speaker-adaptation without a teacher. SOLUTION: Recorded recognition object voices of a speaker is tried to be recognized employing an un-specified speaker model (S1 and S2 ). Then, the recognition results are used as teacher signals in making the above unspecified speaker model adaptive to recognition object voices (S3 and S4 ). After that, the adaptive model is used to re-recognize the recognition object voices (S5 ). If the recognition results are the same as in a previous case or do not exceed a limiting time, the process goes back to the step S3 . If the results are the same or exceed the limiting time, the process stops (S6 ).

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【発明の属する技術分野】この発明は、オフ・ラインで
十分な時間を掛けて行うことができる場合に適する音声
認識のための話者適応化方法に関する。例えば、医用所
見のディクテーションのように、テープに録音してある
医師の発声した診断結果を専門作業者が書き下ろして文
書化するなどといった作業の効率化に利用できる。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker adaptation method for speech recognition, which is suitable when it can be done offline for a sufficient period of time. For example, like a dictation of a medical finding, it can be used for efficiency of work such as writing down and documenting a diagnostic result recorded by a doctor uttered on a tape by a professional worker.

【0002】[0002]

【従来の技術】図3Aに音素を認識の単位とした音声認
識処理を示す。図3Aにおいて、入力音声11は、分析
処理12により特徴パラメータの時系列に変換され、こ
の特徴パラメータの時系列は探索処理13により文法1
6の拘束条件を用いながら音素モデル15との照合が行
われる。そして、最も高い評価値を持つ音素系列が認識
結果14として出力される。
2. Description of the Related Art FIG. 3A shows a speech recognition process using a phoneme as a unit of recognition. In FIG. 3A, the input speech 11 is converted into a time series of characteristic parameters by the analysis processing 12, and the time series of the characteristic parameters is converted by the search processing 13 into the grammar 1
The matching with the phoneme model 15 is performed while using the constraint condition of 6. Then, the phoneme sequence having the highest evaluation value is output as the recognition result 14.

【0003】分析処理12における信号処理としてよく
用いられるのは、線形予測分析(Linear Predictive
Coding,LPCと呼ばれる)であり、特徴パラメータと
してはLPCケプストラム、LPCデルタケプストラ
ム、メルケプストラム、対数パワーなどがある。音素モ
デル15としては確率・統計理論に基づいてモデル化す
る隠れマルコフモデル法(Hidden Markov Model, 以
後HMM法と呼ぶ)が主流である。このHMM法の詳細
は、例えば、電子情報通信学会編、中川聖一著「確率モ
デルによる音声認識」に開示されている。
Linear prediction analysis (Linear Predictive Analysis) is often used as the signal processing in the analysis processing 12.
Coding, LPC), and the characteristic parameters include LPC cepstrum, LPC delta cepstrum, mel cepstrum, and logarithmic power. As the phoneme model 15, a Hidden Markov Model method (hereinafter referred to as HMM method) which is modeled based on probability / statistical theory is mainstream. Details of the HMM method are disclosed in, for example, "Speech Recognition by Stochastic Model" by Seiichi Nakagawa, edited by The Institute of Electronics, Information and Communication Engineers.

【0004】探索処理13は、文法で許される音素列に
対応した仮説に対して、入力音声とのもっともらしさを
評価し、1つずつ仮説に音素を拡張しながら探索を進め
る。ここで、仮説とは文法に示されている音素の並び順
の制約に従ってつなげられた音素列のことを表し、ま
た、仮説への音素の拡張とは文法に従って仮説の音素列
にさらに1つ音素をつなげることを意味する。
The search processing 13 evaluates the plausibility of the hypothesis corresponding to the phoneme string permitted by the grammar with the input speech, and advances the search while expanding the phonemes into the hypotheses one by one. Here, the hypothesis means a phoneme string connected according to the restriction of the order of arrangement of phonemes shown in the grammar, and the extension of the phoneme to the hypothesis means one more phoneme in the phoneme string of the hypothesis according to the grammar. Means to connect.

【0005】音声認識の際に用いる音素モデル15を認
識対象の話者の音声から作った場合を、特定話者認識と
呼び、認識対象話者以外の通常一人以上の話者の音声か
ら作る場合を不特定話者認識と呼ぶ。認識性能を比べる
と、認識対象話者の音声から音素モデル15を作った特
定話者認識の方が一般的に高い。音素モデル15を作る
には、通常、認識対象語彙をすべて含む音声を用意しな
ければならない。このため、認識対象語彙数が大きい場
合は、このような事前の音声の発声が困難となるので、
不特定話者認識を行うか、認識対象話者の少量の音声を
用いて不特定話者の音素モデル15に対して、次に述べ
る話者適応化を行った後、認識を行う。
A case where the phoneme model 15 used for speech recognition is made from the speech of the speaker to be recognized is called specific speaker recognition, and is usually made from the speech of one or more speakers other than the speaker to be recognized. Is called unspecified speaker recognition. Comparing the recognition performance, the specific speaker recognition in which the phoneme model 15 is created from the voice of the recognition target speaker is generally higher. In order to create the phoneme model 15, it is usually necessary to prepare a voice that includes all vocabulary to be recognized. Therefore, when the number of vocabulary to be recognized is large, it is difficult to utter such an advance voice.
The unspecified speaker recognition is performed, or the speaker adaptation described below is performed on the phoneme model 15 of the unspecified speaker using a small amount of voice of the recognition target speaker, and then the recognition is performed.

【0006】話者適応は、不特定話者認識に用いている
音素モデル15を認識対象話者の音声を使って、特定話
者認識に用いる音素モデル15に近いものを得ようとす
る技術である。一般に話者適応には、認識対象話者の音
声で発声内容が既知のものを用いる。このように、発声
内容を教えて話者適応を行うことを教師あり話者適応と
呼び、発声内容が不明の音声を与える話者適応は教師な
し話者適応と呼ぶ。
Speaker adaptation is a technique in which the phoneme model 15 used for unspecified speaker recognition is obtained by using the voice of the recognition target speaker to obtain a phoneme model 15 close to the phoneme model 15 used for specific speaker recognition. is there. In general, for the speaker adaptation, the speech of the recognition target speaker whose utterance content is known is used. In this way, the process of teaching the utterance contents to perform the speaker adaptation is called the speaker adaptation with a teacher, and the speaker adaptation that gives the voice of which the utterance contents are unknown is called the unsupervised speaker adaptation.

【0007】従来の話者適応は、以下の3つに分類され
る。 ・バッチ適応 未知音声の認識に用いる前に認識対象とは別に話者適応
のために特別に音声を事前に収録し、これを用いてバッ
チ処理として話者適応を行い、その話者に適合した音響
モデルを作成する。この事前に収録する音声は発声内容
が既知であるのが一般である。得られた音響モデルを用
いて、その話者の未知の音声に対する認識を行う。
Conventional speaker adaptation is classified into the following three types.・ Batch adaptation Before using for recognition of unknown speech, a special voice was recorded in advance for speaker adaptation separately from the recognition target, and speaker adaptation was performed as a batch process using this and adapted to that speaker. Create an acoustic model. It is general that the utterance content of the voice recorded in advance is known. The obtained acoustic model is used to recognize the unknown voice of the speaker.

【0008】・逐次適応 バッチ適応とは異なり、音声認識を行うにあたって、事
前に話者適応に用いる音声の収録や、話者適応の処理は
行わない。オンラインで次々に入力される認識対象の音
声に対し、入力されるごとに認識し、話者適応を行うと
いった処理を繰り返し行う。認識を1単位(単語、文節
など)行うごとに、その認識結果、またはその確認・修
正によって得られる教師信号を元に音響モデルに対して
話者適応を行い、適応化が行われた新しい音響モデルを
作成し、後続する音声の認識に用いる。
Sequential adaptation Unlike batch adaptation, when performing voice recognition, recording of voice used for speaker adaptation and speaker adaptation processing are not performed in advance. The recognition target voices that are sequentially input online are recognized each time they are input, and the process of performing speaker adaptation is repeated. Each time recognition is performed for one unit (word, phrase, etc.), speaker adaptation is performed on the acoustic model based on the recognition result or the teacher signal obtained by checking / correcting it, and new adaptation is performed. A model is created and used for subsequent speech recognition.

【0009】・即時適応 事前に話者適応のための音声を収録したり、話者適応の
処理は行わない。バッチ適応や逐次適応では適応に用い
る音声と、適応によって得られたモデルの認識対象であ
る音声は別であるが、即時適応では認識する音声そのも
のを用いて適応化を行う。音声入力ごとに話者適応化を
行い、得られたモデルでその入力音声の音声認識を行
う。適応化の方法は、音声のスペクトルの特徴に注目し
たものであるため、発声内容が未知であるか、既知であ
るかは適応化には関係ない。
Immediate adaptation The voice for speaker adaptation is not recorded in advance, and the speaker adaptation processing is not performed. In batch adaptation and sequential adaptation, the speech used for adaptation is different from the speech that is the recognition target of the model obtained by adaptation, but in immediate adaptation, the speech itself is used for adaptation. Speaker adaptation is performed for each voice input, and the input model is recognized by the obtained model. Since the adaptation method focuses on the characteristics of the spectrum of the voice, whether the utterance content is unknown or known is irrelevant to the adaptation.

【0010】[0010]

【発明が解決しようとする課題】上述した従来の方法で
は、それぞれ以下のような問題があった。 ・バッチ適応 認識とは関係のない話者適応のためにのみ用いる音声を
収録しなければならない。認識対象の音声をすべて包含
するような音声の特徴を話者適応によりモデルに反映で
きないため、適応化の効果が低くなる場合がある。
The above-mentioned conventional methods have the following problems, respectively. -Batch adaptation. The speech used only for speaker adaptation not related to recognition shall be recorded. Since the feature of the voice including all the voices to be recognized cannot be reflected in the model by the speaker adaptation, the adaptation effect may be lowered.

【0011】・逐次適応 部分的な音声から話者適応を行うため、適応化の方法に
より認識精度は必ずしも改善できるとは限らない。話者
適応に用いる音声は必ずしも認識対象音声を包含するよ
うな音声ではないため、その話者性を十分にモデルに反
映できず、適応化の効果が十分生かされないことがあ
る。
Sequential adaptation Since the speaker adaptation is performed from the partial voice, the recognition accuracy cannot always be improved by the adaptation method. Since the speech used for speaker adaptation does not necessarily include the speech to be recognized, its speaker nature cannot be sufficiently reflected in the model, and the adaptation effect may not be fully utilized.

【0012】・即時適応 部分的な音声から話者適応を行うため、適応化の方法に
より認識精度は必ずしも改善するとは限らない。不特定
話者モデルと認識対象話者の音声に大きな違いがない場
合、認識率の改善が小さいことが報告されている。
Immediate adaptation Since the speaker adaptation is performed from a partial voice, the recognition accuracy is not always improved by the adaptation method. It is reported that the improvement of the recognition rate is small when there is no significant difference between the speech of the unspecified speaker model and the speech of the recognition target speaker.

【0013】[0013]

【課題を解決するための手段】この発明は蓄積した認識
対象音声を音響モデルを用いて認識し、その認識結果を
教師信号として、前記音響モデルを、前記認識対象音声
に適応化する。人手を介さずに与えられた時間を有効に
活用し、認識対象音声に対して高い認識率を得る。請求
項2の発明の手順は以下のとおりである。
According to the present invention, accumulated recognition target speech is recognized using an acoustic model, and the recognition result is used as a teacher signal to adapt the acoustic model to the recognition target speech. We effectively utilize the given time without human intervention, and obtain a high recognition rate for the speech to be recognized. The procedure of the invention of claim 2 is as follows.

【0014】認識対象話者の音声を初期音響モデルを用
いて音声認識する第1過程と、この認識により得られた
認識結果を教師信号として音響モデルの話者適応を行う
第2過程と、この適応化によって得た新しい音響モデル
を用いて前記音声を音声認識を行う第3過程と、その認
識結果が前回と同じ場合、または制限時間を越えた場合
は終了し、それ以外は第2過程へ進む。この構成によ
り、オフラインで十分時間を掛けて繰り返し行われ、認
識対象音声そのものに存在する認識対象話者に固有の個
人性情報が音響モデルに直接反映されるとともに、逐次
適応や、即時適応のように部分的な音声からの話者適応
ではなく、認識対象音声全体から得た個人性情報により
話者適応化を行った音響モデルを得ることができる。
The first step of recognizing the speech of the speaker to be recognized using the initial acoustic model, and the second step of performing speaker adaptation of the acoustic model using the recognition result obtained by this recognition as a teacher signal, The third step of recognizing the voice using the new acoustic model obtained by adaptation, and if the recognition result is the same as the previous time or if the time limit is exceeded, end, otherwise go to the second step move on. With this configuration, it is repeated off-line for a sufficient time, and the personality information peculiar to the recognition target speaker existing in the recognition target speech itself is directly reflected in the acoustic model. It is possible to obtain a speaker-adapted acoustic model based on individuality information obtained from the entire recognition target speech, rather than speaker adaptation from partial speech.

【0015】請求項3の発明では、認識対象音声を音響
モデル認識し、その認識結果の一部を確認・修正し、そ
の確認修正を終了した部分の認識対象音声については、
確認・修正結果を教師信号とし、確認・修正が未完了の
部分の認識対象音声については、先の認識結果をそのま
ま用いて教師信号として前記音響モデルに対して話者適
応化を行って認識対象の話者に適応化した新たな音響モ
デルを作成し、この新たな音響モデルを用いて認識対象
音声の認識を再度行う。
According to the third aspect of the present invention, the recognition target speech is recognized as an acoustic model, a part of the recognition result is confirmed / corrected, and the recognition target speech of the portion for which confirmation / correction is completed is
The confirmation / correction result is used as a teacher signal, and the recognition target speech of the part for which confirmation / correction is not completed is used as a teacher signal using the previous recognition result as it is, and speaker adaptation is performed on the acoustic model to be recognized. A new acoustic model adapted to the speaker is created, and the recognition target speech is recognized again using this new acoustic model.

【0016】[0016]

【発明の実施の形態】図1に請求項2の発明の構成を示
す。 1.SW2をAに切替えて不特定話者モデル21を認識
処理部25のためにロードする(S1 )。 2.認識対象音声22(例えば医用所見でテープに録音
されている)を文節単位で認識処理25をすべて行う
(S2 )。 3.SW1をCに切替え不特定話者モデル21を話者適
応化の初期モデルとする。 4.先の認識結果を教師信号として話者適応処理23を
行う(S3 )。この適応処理は最尤推定法、最大事後確
率推定法、またはこれに移動ベクトル場平滑化法と組み
合わせた方法などが用いられる。 5.SW2をBに切替え適応化によって得た新しい音響
モデル24を認識処理25のためにロードする
(S4 )。 6.先に用いた認識対象音声25を文節単位で認識処理
25をすべて行う(S5)。 7.認識結果が前回と同じか、または制限時間を越えた
かを判定する(S6 )。 8.この判定で認識結果が前回と同じでなく、制限時間
を経過していなければ、SW1をDに切替え前回の適応
化で得た音響モデル24を話者適応化の初期モデルとし
て4へ進む。つまり、ステップS3 に戻る。認識結果が
前回と同一となるか、または制限時間を越えると処理を
終了とする。
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows the configuration of the invention of claim 2. 1. SW2 is switched to A and the unspecified speaker model 21 is loaded for the recognition processing unit 25 (S 1 ). 2. The recognition target speech 22 (for example, recorded on a tape in a medical finding) is subjected to the recognition processing 25 for each clause (S 2 ). 3. SW1 is switched to C and the unspecified speaker model 21 is used as an initial model for speaker adaptation. 4. Speaker adaptation processing 23 is performed using the above recognition result as a teacher signal (S 3 ). For this adaptive processing, a maximum likelihood estimation method, a maximum posterior probability estimation method, or a method combining this with a motion vector field smoothing method is used. 5. A new acoustic model 24 obtained by the adaptation switches the SW2 to B load for recognition processing 25 (S 4). 6. The recognition target speech 25 used previously is subjected to the recognition processing 25 for each phrase (S 5 ). 7. Recognition result determines exceeded same or a time limit as before (S 6). 8. If the recognition result is not the same as the previous time and the time limit has not elapsed in this determination, SW1 is switched to D, and the acoustic model 24 obtained by the previous adaptation is set as the initial model for speaker adaptation, and the process proceeds to 4. That is, the process returns to step S 3 . When the recognition result is the same as the previous one or when the time limit is exceeded, the process ends.

【0017】このように認識結果を教師信号として音響
モデルに対し、話者適用化を行い、その適用化された音
響モデルで再び同一の認識対象音声を認識することを繰
り返すため、その繰り返しが進むに従って話者適用化が
良好になされ、かつ認識率が向上することになる。次に
実験例を説明する。収録された頭部X線CTの所見に関
する音声を用いて、この発明による話者適応と、従来の
話者適応方法を比較評価する。ここでは、ディクテーシ
ョンにおける音声認識の効用を評価する尺度として、認
識対象音声全体の文書化における認識誤りの総修正工数
に基づく尺度を導入した。1つは、認識対象全体の中
で、認識結果を修正しないで済む文節の割合であり、こ
れを文節認識率と呼ぶ。また、誤認識の修正をその操作
数で重み付けした総修正工数による比較も行った。
As described above, since the speaker is applied to the acoustic model using the recognition result as a teacher signal and the same recognition target speech is recognized again by the applied acoustic model, the repetition proceeds. As a result, the speaker adaptation is improved and the recognition rate is improved. Next, an experimental example will be described. The speaker adaptation according to the present invention and the conventional speaker adaptation method are compared and evaluated by using the recorded voice concerning the findings of the head X-ray CT. In this paper, we introduced a measure based on the total correction man-hours of recognition errors in documenting the entire speech to be recognized as a measure for evaluating the effectiveness of speech recognition in dictation. One is the ratio of the bunsetsu that does not need to correct the recognition result in the entire recognition target, and this is called the bunsetsu recognition rate. In addition, we also compared the total correction man-hours, which weighed the corrections of erroneous recognition with the number of operations.

【0018】話者適応化には文節発声を切り出さずに文
節発声と、その音素ラベル系列に従って連結したHMM
で学習する連結学習を用いた。認識対象の全音声を用い
てパラメータ推定を行う場合でも、音素によってはサン
プル数が少ないため、パラメータの推定精度の確保は難
しいため、平均ベクトルのみを用いて適応化を行った。
適応アルゴリズムは、Baum-Welchアルゴリズムを用い
た最尤(ML:Maximum Likilihood)推定法、および最
大事後確率(MAP:Maximum a Posteriori)推定法を
用いる。
For speaker adaptation, an HMM is formed by concatenating syllable utterances and their phoneme label sequences without cutting out syllable utterances.
Connected learning is used. Even when parameter estimation is performed using all speech to be recognized, it is difficult to secure parameter estimation accuracy due to the small number of samples depending on phonemes, so adaptation was performed using only the average vector.
As the adaptive algorithm, a maximum likelihood (ML: Maximum Likilihood) estimation method using a Baum-Welch algorithm and a maximum posterior probability (MAP: Maximum a Posteriori) estimation method are used.

【0019】ATR重要語5240,音素バランス単語
216の男性、女性各10名分、音響学会データベース
音素バランス文150文を男性30名、女性34名分を
用いて不特定話者の音響モデルを作成した。認識実験に
用いた音声は、女性話者2名が自然な速度で文節ごとに
発声した頭部X線CT所見で、分量は各々30所見(1
300文節、14000音素)である。これは概ね放射
線科の医師の標準的な1日量である。
An acoustic model of an unspecified speaker is created by using ATR important word 5240, phoneme balance word 216 for each of 10 males and 10 females, and 150 phoneme balance sentences of ASJ database for 30 males and 34 females. did. The speech used in the recognition experiment was a head X-ray CT finding that two female speakers uttered at each phrase at a natural speed, and the volume was 30 finding each (1
300 bunsetsu, 14000 phonemes). This is generally the standard daily dose for a radiologist.

【0020】音声の音響分析条件は標本化周波数12K
Hz,ハミング時間窓、分析窓長32ms,16次LPC分
析、特徴量は16次LPCケプストラム、16次Δケプ
ストラム、Δパワーである。実験に用いたHMMは環境
依存HMMで、基本音素数は26,環境依存モデルの数
は約1700である。構造はHM netで450状態、出
力確率は4混合連続分布である。
The acoustic analysis condition of voice is a sampling frequency of 12K.
Hz, Hamming time window, analysis window length 32 ms, 16th-order LPC analysis, features are 16th-order LPC cepstrum, 16th-order Δ cepstrum, and Δ-power. The HMM used in the experiment is an environment-dependent HMM, the number of basic phonemes is 26, and the number of environment-dependent models is about 1700. The structure is HM net with 450 states, and the output probability is a 4-mixed continuous distribution.

【0021】比較対象として、最適な適応化が行われた
場合に近い結果が得られると考えられる認識対象音声す
べてに教師信号を与えたバッチ話者適応、一所見ごと、
あるいは一文ごとに認識結果に教師信号を与え、これに
基づきMAP推定法を用いて話者適応を行う逐次話者適
応、不特定話者モデルの認識を行った。全所見の入力を
通して、ユーザが修正しなければならない誤認識の文節
の数を評価基準として文節認識率を算出する。この実験
で文節認識率は音素系列で比較し算出した。
As a comparison target, batch speaker adaptation in which a teacher signal is given to all recognition target speeches for which it is considered that a result close to that in the case where optimum adaptation is performed is obtained.
Alternatively, a teacher signal was given to the recognition result for each sentence, and based on this, the speaker adaptation was performed using the MAP estimation method to recognize the speaker adaptation and the unspecified speaker model. Through inputting all the findings, the phrase recognition rate is calculated with the number of erroneous recognition phrases that the user has to correct as an evaluation criterion. In this experiment, the phrase recognition rate was calculated by comparing phoneme sequences.

【0022】逐次話者適応の場合 一文または一所見ご
とに、 1.その時点までに得られている適応化モデルを用いて
認識を行う。ただし、一文(所見)目は不特定モデルを
用いる。 2.一文(所見)内の全文節数と正しく認識した文節数
を計数する。 3.教師信号を与え音響モデルの適応化を行う。
Sequential speaker adaptation For each sentence or finding, 1. Recognition is performed using the adaptation model obtained up to that point. However, the first sentence (finding) uses an unspecified model. 2. The total number of phrases in one sentence (finding) and the number of correctly recognized phrases are counted. 3. The teacher model is given and the acoustic model is adapted.

【0023】以上の処理を繰り返すことにより、全所見
の処理が終了した後、総文節数と総正解文節数から文節
認識率を算出する。繰り返し教師なし話者適応の場合
(この発明) 全所見を教師なし適応後認識し総文節数
と正しく認識した総文節数から文節認識率を算出する。
図3Bに話者2名の平均文節認識率と繰り返しの回数の
関係を示す。MAP推定法を用いて繰り返し教師なし話
者適応を行った場合(本発明MAP)は、繰り返し3回
目で認識率86.3%,ML推定法を用いて繰り返し教師
なし話者適応を行った場合(本発明ML)は繰り返し4
回目で認識率85.3%であった。正しい音素系列を教師
信号として与えバッチ話者適応を行った場合(バッチ適
応)の認識率は94.1%,MAP推定法により逐次話者
適応化(逐次適応)を一文ごとに行った場合は90.5%
と所見毎に行った場合は90.7%,不特定話者モデルを
用いて文節認識を行った場合(不特定話者)の平均認識
率は78.4%であった。MAP推定法を用いたこの発明
方法では3回繰り返しで、文節の誤認識率は21.6%か
ら13.7%に低減し、誤認識改善率37%であった。
By repeating the above process, the phrase recognition rate is calculated from the total number of phrases and the total number of correct sentences after the process of all findings is completed. In the case of repeated unsupervised speaker adaptation (the present invention), the bunsetsu recognition rate is calculated from the total number of bunsetsus that have been recognized after unsupervised adaptation of all findings and the total number of correctly bunsetsus.
FIG. 3B shows the relationship between the average phrase recognition rate of two speakers and the number of repetitions. When repeated unsupervised speaker adaptation is performed using the MAP estimation method (MAP of the present invention), recognition rate is 86.3% at the third iteration, and repeated unsupervised speaker adaptation is performed using the ML estimation method. (Invention ML) is repeated 4
The recognition rate was 85.3% at the first time. The recognition rate is 94.1% when batch speaker adaptation (batch adaptation) is performed by giving a correct phoneme sequence as a teacher signal, and when sequential speaker adaptation (sequential adaptation) is performed by the MAP estimation method for each sentence. 90.5%
The average recognition rate was 98.7% for each finding, and the average recognition rate was 78.4% for phrase recognition using the unspecified speaker model (unspecified speaker). In the method of the present invention using the MAP estimation method, the erroneous recognition rate of the bunsetsu was reduced from 21.6% to 13.7% after three repetitions, and the erroneous recognition improvement rate was 37%.

【0024】一般的なディクテーション・システムで
は、認識結果は1位のみがディスプレー上に表示され、
2位以下は何らかの操作を行ったときに初めて表示され
る。2位以下の候補を数多く表示すれば正しい認識結果
が含まれる確率は高くなるが、表示する数が多過ぎると
似たような候補が並び、正しい候補を選び出すのが難し
くなる。このため、通常10位までの候補を表示するの
が一般的である。このような事実から、ディクテーショ
ン・システムでは、認識結果は次の三つに分類して考え
ることができる。
In a general dictation system, only the 1st place of the recognition result is displayed on the display,
The second place and below are displayed only when some operation is performed. If a large number of second and lower candidates are displayed, the probability that a correct recognition result is included increases, but if the number of displayed candidates is too large, similar candidates are lined up, and it becomes difficult to select a correct candidate. For this reason, it is general to display candidates up to the 10th place. From such a fact, in the dictation system, the recognition result can be classified into the following three types.

【0025】・1位が正解であるもの。 ・2位から10位に正解があるもの。 ・10位以内には正解がないもの。 通常の使用形態を考慮すると、2番目のグループの認識
結果の修正に必要な操作は、次候補の表示をシステムに
指示することと、2位から10位の認識結果の中から正
しい候補を選択することの2つの操作である。従って、
このグループの誤り修正には、1つに付き2つの操作が
必要となる。同様に3番目のグループの認識結果の修正
に必要な工数は、次候補の表示をシステムに指示するこ
とと、次候補の中に正しい変換結果がないことをシステ
ムに知らせること、仮名漢字変換(ここではローマ字入
力を行うことを仮定する)のためにn文字を入力するこ
と、仮名漢字変換の指示すること、文字列確定を指示す
ることである。従って、このグループの誤り修正には、
1つに付き、4+n回の操作が必要となる。通常、認識
結果の一部の文字列が使用可能であり、仮名漢字変換を
行ったとき、「次候補キー」を何度か押すことも考えら
れるが、操作数の計算は上記の方法に従い簡略に行っ
た。認識率との関係を明確にするため、認識結果の修正
に必要な100文節当たりの操作回数を修正指数と定義
する。これにより認識結果の修正に必要な工数の比較を
行う。図3Cに修正指数を示す。この発明によるMAP
推定法を用いて繰り返し教師なし話者適応を3回繰り返
した場合には、不特定話者モデルを用いた認識の場合に
比べ、修正指数は36%減少した。
The one in which the first place is the correct answer.・ Those who have the correct answers from 2nd to 10th.・ There is no correct answer within the 10th place. Considering the normal usage pattern, the operation required to correct the recognition result of the second group is to instruct the system to display the next candidate and select the correct candidate from the second to tenth recognition results. There are two operations to do. Therefore,
Correcting this group of errors requires two operations each. Similarly, the man-hour required to correct the recognition result of the third group is to instruct the system to display the next candidate, to inform the system that there is no correct conversion result in the next candidate, and to convert kana to kanji ( In this case, it is assumed that roman characters are input), n characters are input, kana-kanji conversion is instructed, and character string confirmation is instructed. Therefore, the error correction of this group is
For each, 4 + n operations are required. Usually, a part of the character string of the recognition result can be used, and it is possible to press the "next candidate key" several times when performing Kana-Kanji conversion, but the calculation of the number of operations is simplified according to the above method. Went to. In order to clarify the relationship with the recognition rate, the number of operations per 100 phrases required to correct the recognition result is defined as a correction index. By doing this, the man-hours required to correct the recognition result are compared. The correction index is shown in FIG. 3C. MAP according to this invention
When the unsupervised speaker adaptation was repeated three times using the estimation method, the correction index decreased by 36% as compared with the case of recognition using the unspecified speaker model.

【0026】この発明方法では、MAP推定、ML推定
のいずれの場合も1回目の適応化で大きな性能向上が得
られた。これは不特定話者の音響モデルから認識対象話
者のモデルに近づいて行く話者適応の効果が誤った教師
信号の影響に比べて大きいためと考えられる。2回以上
の繰り返しでは、話者適応化の効果が誤った教師信号の
影響に比べて小さく、また同様の誤認識を繰り返すため
性能が改善されないものと考えられる。
According to the method of the present invention, a large performance improvement was obtained by the first adaptation in both MAP estimation and ML estimation. This is considered to be because the effect of speaker adaptation approaching the model of the target speaker from the acoustic model of the unspecified speaker is greater than the effect of the wrong teacher signal. It is considered that the repetition of two or more times does not improve the performance because the effect of speaker adaptation is smaller than the effect of the wrong teacher signal and the same misrecognition is repeated.

【0027】評価した適応化方式で誤認識率と修正指数
の関係を見ると、認識率の低い適応化方式ほど、誤認識
よりも大きな割合で修正指数が増加している。誤認識率
が単に第1位の認識結果が正解でないことを示すのに対
し、修正指数では誤認識の多さだけでなく正解候補の順
位を考慮に入れている評価基準であるからと考えられ
る。以上から、修正指数は誤認識率よりも、使用者側の
立場からディクテーション・システムの性能を評価する
尺度としてふさわしいと考えられる。
Looking at the relationship between the erroneous recognition rate and the correction index in the evaluated adaptation system, the modification index having a lower recognition rate increases the correction index at a rate higher than that of the erroneous recognition. The false recognition rate indicates that the first recognition result is not the correct answer, whereas the modified index is an evaluation criterion that takes into account not only the number of false recognitions but also the rank of correct answer candidates. . From the above, it can be considered that the correction index is more suitable as a scale for evaluating the performance of the dictation system from the viewpoint of the user than the misrecognition rate.

【0028】なお、この発明は逐次適応化を用いる即時
認識に比べ修正指数が約2倍であるが、発声と文書化が
分業となっているため、作業のしやすさの観点からは、
これは問題とはならないと思われる。発声者は、発声の
都度、認識結果待ち、認識結果の確認・修正といった作
業を行う必要がないため思考の中断がなく、従来の発声
をテープに録音する方式からの移行も容易に行え、文書
化を行う側にとっては人手を介さずに約87%正しく変
換された文書が出来上がるからである。
Although the present invention has a correction index about twice as large as that of the instant recognition using the sequential adaptation, the voicing and the documentation are division of labor, so from the viewpoint of workability,
This doesn't seem to matter. The speaker does not have to wait for the recognition result and confirm / correct the recognition result each time he / she speaks, so there is no interruption of thinking, and the transition from the conventional method of recording the utterance on tape can be performed easily. This is because the person who performs the conversion can produce a document that is correctly converted by about 87% without human intervention.

【0029】次に請求項3の発明の実施例を図2を参照
して説明する。この実施例は認識対象音声を不特定話者
音響モデルで認識を行い(S1 )、その認識結果に対す
る確認・修正を行い(S2 )、その確認・修正を例えば
中食のためなどにより一旦中断する(S3 )。その中断
中に、この発明ではその確認・修正が終了した部分の認
識対象音声については、確認・修正の結果を教師信号と
し、確認・修正が未完了の部分の認識対象音声について
は、ステップS1 での認識結果をそのまま教師信号とし
て音響モデルに対して話者適応化を行い、認識対象の話
者に適応化して新しい音響モデルを作成する(S4 )。
次にこの新しい音響モデルを用いて認識対象音声を再び
認識する(S5 )、その認識結果が前回と同一か、制限
時間を越えたかを調べ、そうでなければステップS4
戻り、認識結果が同一または制限時間を越えたら終了と
する(S6 )。この場合、ステップS5 における認識は
確認・修正が完了していない部分の認識対象音声に対し
てのみ行ってもよい。
Next, an embodiment of the invention of claim 3 will be described with reference to FIG. In this embodiment, the speech to be recognized is recognized by the unspecified speaker acoustic model (S 1 ), the recognition result is confirmed / corrected (S 2 ), and the confirmation / correction is temporarily performed for eating out, for example. interrupted (S 3). During the interruption, in the present invention, for the recognition target voice of the portion whose confirmation / correction is completed, the result of the confirmation / correction is used as a teacher signal, and for the recognition target voice of the portion whose confirmation / correction is not completed, step S The recognition result in 1 is used as a teacher signal as it is for speaker adaptation to the acoustic model and adapted to the speaker to be recognized to create a new acoustic model (S 4 ).
Next, the speech to be recognized is recognized again using this new acoustic model (S 5 ), and it is checked whether the recognition result is the same as the previous time or whether the time limit has been exceeded. If not, the process returns to step S 4 to recognize the recognition result. If the same or the time limit is exceeded, the process ends (S 6 ). In this case, the recognition in step S 5 may be performed only on the recognition target voice of the part for which confirmation / correction is not completed.

【0030】この図2に示す学習において前述と同様の
実験を行い、実験で用いたデータの前半1/3である1
0所見を教師あり、つまり確認・修正した部分の音声と
し、初期モデルに不特定話者モデルを用い、上記の適応
法により実験を行った。「全所見の入力を通して、ユー
ザが修正しなければならない誤認識の文節の数」という
基準で評価した結果、認識率はMAP推定法では繰り返
し1回目で84.4%,ML推定法では83.9%であっ
た。全所見の集計では不特定話者モデルで認識した前半
1/3の所見を修正する工数が含まれるため認識率が低
く見える。適応化の効果確認のため後半2/3の20所
見で集計を行うと、請求項2の発明ではMAPは86.6
%,MLは85.3%であったものが各々87.2%,86.
3%に向上した。以上から教師信号(確認・修正信号)
によりモデルの性能が向上していることがわかる。
In the learning shown in FIG. 2, the same experiment as described above was conducted, and the first half of the data used in the experiment was 1/3.
An experiment was conducted by the above-mentioned adaptive method using 0 observation as a teacher's voice, that is, a confirmed / corrected part of the voice, and an unspecified speaker model as an initial model. As a result of evaluation based on "the number of erroneous recognition phrases that the user must correct through inputting all findings", the recognition rate was 84.4% in the first iteration by the MAP estimation method and 83.3% by the ML estimation method. It was 9%. In the aggregation of all findings, the recognition rate seems low because it includes the number of man-hours for correcting the findings in the first half of the recognition recognized by the unspecified speaker model. When 20 tabulations in the latter half 2/3 are tabulated to confirm the effect of adaptation, the MAP in the invention of claim 2 is 86.6.
% And ML were 85.3%, 87.2% and 86.
It improved to 3%. From above, teacher signal (confirmation / correction signal)
Shows that the model performance is improved.

【0031】上述の二つの実施例では、初期モデルとし
て不特定話者音響モデルを用いたが、認識対象話者が発
声した音声から作成した音響モデルを用いてもよい。こ
の場合も、その音響モデルの作成のための音声はそれ程
多くすることが困難であり、従って、認識対象音声には
音響モデル作成に用いた音声以外の音声が多く含まれる
ことがあり、この点でこの発明は有効である。また上記
実験例から明らかなように、認識結果を教師信号とする
音響モデルの話者適応化は一度行えば、認識率が著しく
向上し、話者適応化と認識との繰り返しを行わなくても
よい。
In the above two embodiments, the unspecified speaker acoustic model is used as the initial model, but an acoustic model created from the voice uttered by the recognition target speaker may be used. Also in this case, it is difficult to increase the amount of speech for creating the acoustic model so much, and therefore, the speech to be recognized may include a large amount of speech other than the speech used for creating the acoustic model. Therefore, the present invention is effective. Further, as is clear from the above experimental example, once the speaker adaptation of the acoustic model using the recognition result as the teacher signal is performed, the recognition rate is remarkably improved, and the speaker adaptation and the recognition are not repeated. Good.

【0032】[0032]

【発明の効果】以上の説明から明らかなように、この発
明によれば人手を介さない適応処理により認識対象音声
に対する音声認識率が不特定話者モデルに比べて向上
し、誤って認識された結果の修正の手間が大幅に削減で
きる、という効果がある。2名の話者が発声した各30
文章(2632文節)について、本方法を適用した結
果、誤認識率が約半減した。
As is apparent from the above description, according to the present invention, the voice recognition rate for the recognition target voice is improved by the adaptive processing without human intervention as compared with the unspecified speaker model, and it is erroneously recognized. This has the effect of significantly reducing the effort required to correct the results. Each 30 uttered by two speakers
As a result of applying this method to sentences (2632 phrases), the false recognition rate was reduced by about half.

【0033】通常ディクテーション装置では、次候補は
認識結果2位から10位までのものを表示するのが一般
的である。このため、正しい認識結果が2位から10位
にある場合は、装置の画面上に次候補を表示させる指示
を行う操作と表示された候補の中から1つを選択する操
作の2回で修正作業が完了する。10位以内に正しい認
識結果がない場合は、通常の日本語ワープロと同様の操
作となるため、誤認識結果を修正するには、次候補の表
示させる操作、2位から10位になかったことを装置に
知らせる操作、正しい文章をワープロと同様にローマ字
入力で行うため文字数に応じた回数の操作、仮名漢字変
換を行わせる操作、変換結果を確定する操作が必要にな
る。この様にして計数した、誤認識となった変換結果を
手入力で修正するために必要な操作回数で、修正に必要
な手間を比較すると、従来の方法では、約1600回の
操作が必要であったものが、本方法を用いた場合は、約
1100回の操作に軽減された。
In a normal dictation device, the next candidate is generally the second to tenth recognition result displayed. Therefore, if the correct recognition result is in the 2nd to 10th positions, it is corrected by the operation of giving an instruction to display the next candidate on the screen of the device and the operation of selecting one from the displayed candidates. The work is completed. If there is no correct recognition result within the 10th place, the operation will be the same as a normal Japanese word processor, so to correct the incorrect recognition result, the operation to display the next candidate was not from 2nd to 10th place. To inform the device, as in the case of a word processor, to input the correct text in Roman characters, it is necessary to operate as many times as the number of characters, to perform Kana-Kanji conversion, and to confirm the conversion result. Comparing the labor required for correction by the number of operations required to manually correct the misrecognized conversion results counted in this way, the conventional method requires about 1600 operations. What was present was reduced to about 1100 operations using this method.

【図面の簡単な説明】[Brief description of the drawings]

【図1】Aは請求項3の発明の実施例を示すブロック
図、Bは請求項2の発明の実施例の処理手順を示す流れ
図である。
1A is a block diagram showing an embodiment of the invention of claim 3; FIG. 1B is a flow chart showing a processing procedure of an embodiment of the invention of claim 2;

【図2】請求項3の発明の実施例の処理手順を示す流れ
図。
FIG. 2 is a flowchart showing a processing procedure of an embodiment of the invention of claim 3;

【図3】Aは音素を認識の単位とする音声認識処理を示
すブロック図、Bは請求項2の発明の繰り返し回数と文
節認識率との関係の実験例と、従来法の認識率を示す
図、Cは請求項2の発明の方法と従来法の修正指数の実
験例を示す図である。
FIG. 3A is a block diagram showing a speech recognition process using a phoneme as a unit of recognition, and B shows an experimental example of the relationship between the number of repetitions and the phrase recognition rate of the invention of claim 2, and the recognition rate of the conventional method. FIG. 6C is a diagram showing an experimental example of the correction index of the method of the invention of claim 2 and the conventional method.

Claims (5)

【特許請求の範囲】[Claims] 【請求項1】 同一話者が発声した収録済みの認識対象
音声を音響モデルを用いて認識する第1過程と、 その第1過程で得られた認識結果を教師信号として話者
適応を行い、上記認識対象音声に適応化した新しい音響
モデルを作成する第2過程とを有することを特徴とする
音声認識のための話者適応化方法。
1. A first step of recognizing a recorded recognition target voice uttered by the same speaker using an acoustic model, and speaker adaptation using the recognition result obtained in the first step as a teacher signal, And a second step of creating a new acoustic model adapted to the speech to be recognized, the speaker adaptation method for speech recognition.
【請求項2】 上記新たに作成した音響モデルを用い
て、上記認識対象音声を再度認識する第3過程と、 その第3過程で得られた認識結果を教師信号として、上
記新しい音響モデルに対して話者適応を行い、認識対象
の話者に更に適応化した新しい音響モデルを作成する第
4過程と、 上記第3過程と上記第4過程とを認識結果が前回と同一
となるまで、または所定の処理時間を越えるまで繰り返
し行う第5過程とを有することを特徴とする音声認識の
ための話者適応化方法。
2. A third step of recognizing the recognition target speech again using the newly created acoustic model, and a recognition result obtained in the third step as a teacher signal for the new acoustic model. Until the recognition result becomes the same as the last time, the fourth step of performing speaker adaptation to create a new acoustic model further adapted to the speaker to be recognized, and the third step and the fourth step, or A speaker adaptation method for speech recognition, comprising a fifth step of repeating until a predetermined processing time is exceeded.
【請求項3】 あらかじめ用意した音響モデルを用いて
認識対象音声を認識する第1過程と、 その第1過程で得られた認識結果の一部を確認・修正を
行う第2過程と、 その第2過程での確認・修正が終了した部分の認識対象
音声については確認・修正の結果を教師信号として用
い、確認・修正が未完了の部分の認識対象音声について
は、上記第1過程で得られた認識結果をそのまま教師信
号として用いて、上記音響モデルに対して話者適応化を
行って認識対象の話者に適応化した新しい音響モデルを
作成する第3過程と、 その第3過程で得られた音響モデルを用いて、上記認識
対象音声の認識を再度行う第4過程とを有することを特
徴とする音声認識のための話者適応化方法。
3. A first step of recognizing a speech to be recognized using an acoustic model prepared in advance, a second step of confirming / correcting a part of the recognition result obtained in the first step, and a second step thereof. The result of confirmation / correction is used as a teacher signal for the recognition target voice of the part whose confirmation / correction is completed in the two steps, and the recognition target voice of the part for which confirmation / correction is not completed is obtained in the first step. Using the recognition result as it is as a teacher signal, speaker adaptation is performed on the acoustic model to create a new acoustic model adapted to the speaker to be recognized. A fourth step of re-recognizing the speech to be recognized using the acoustic model obtained, and a speaker adaptation method for speech recognition.
【請求項4】 上記第1過程で用いられる初期音響モデ
ルは、認識対象話者以外の一人以上の話者が発声した音
声から作成した音響モデルであることを特徴とする請求
項2または3記載の音声認識のための話者適応化方法。
4. The initial acoustic model used in the first step is an acoustic model created from speech uttered by one or more speakers other than the recognition target speaker. Adaptation Method for Speech Recognition in Humans.
【請求項5】 上記第1過程で用いられる音響モデル
は、認識対象話者が発声した音声から作成した音響モデ
ルであることを特徴とする請求項2または3記載の音声
認識のための話者適応化方法。
5. The speaker for speech recognition according to claim 2 or 3, wherein the acoustic model used in the first step is an acoustic model created from speech uttered by a recognition target speaker. Adaptation method.
JP26826895A 1995-10-17 1995-10-17 Speaker adaptation method for voice recognition Pending JPH09114482A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP26826895A JPH09114482A (en) 1995-10-17 1995-10-17 Speaker adaptation method for voice recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP26826895A JPH09114482A (en) 1995-10-17 1995-10-17 Speaker adaptation method for voice recognition

Publications (1)

Publication Number Publication Date
JPH09114482A true JPH09114482A (en) 1997-05-02

Family

ID=17456214

Family Applications (1)

Application Number Title Priority Date Filing Date
JP26826895A Pending JPH09114482A (en) 1995-10-17 1995-10-17 Speaker adaptation method for voice recognition

Country Status (1)

Country Link
JP (1) JPH09114482A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002073669A (en) * 2000-08-30 2002-03-12 Nippon Telegr & Teleph Corp <Ntt> Device and method for providing information
JP2003029784A (en) * 2001-04-20 2003-01-31 Koninkl Philips Electronics Nv Method for determining entry of database
JP2006505002A (en) * 2002-11-02 2006-02-09 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Speech recognition method and system
WO2007105409A1 (en) * 2006-02-27 2007-09-20 Nec Corporation Reference pattern adapter, reference pattern adapting method, and reference pattern adapting program
JP2021529978A (en) * 2018-05-10 2021-11-04 エル ソルー カンパニー, リミテッドLlsollu Co., Ltd. Artificial intelligence service method and equipment for it

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002073669A (en) * 2000-08-30 2002-03-12 Nippon Telegr & Teleph Corp <Ntt> Device and method for providing information
JP2003029784A (en) * 2001-04-20 2003-01-31 Koninkl Philips Electronics Nv Method for determining entry of database
JP2006505002A (en) * 2002-11-02 2006-02-09 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Speech recognition method and system
WO2007105409A1 (en) * 2006-02-27 2007-09-20 Nec Corporation Reference pattern adapter, reference pattern adapting method, and reference pattern adapting program
US8762148B2 (en) 2006-02-27 2014-06-24 Nec Corporation Reference pattern adaptation apparatus, reference pattern adaptation method and reference pattern adaptation program
JP2021529978A (en) * 2018-05-10 2021-11-04 エル ソルー カンパニー, リミテッドLlsollu Co., Ltd. Artificial intelligence service method and equipment for it

Similar Documents

Publication Publication Date Title
JP4301102B2 (en) Audio processing apparatus, audio processing method, program, and recording medium
EP1557822B1 (en) Automatic speech recognition adaptation using user corrections
US5333275A (en) System and method for time aligning speech
EP2003572B1 (en) Language understanding device
Jelinek et al. 25 Continuous speech recognition: Statistical methods
EP1139332A9 (en) Spelling speech recognition apparatus
JP2003316386A (en) Method, device, and program for speech recognition
JPWO2009025356A1 (en) Speech recognition apparatus and speech recognition method
US20040210437A1 (en) Semi-discrete utterance recognizer for carefully articulated speech
KR101014086B1 (en) Voice processing device and method, and recording medium
US20070038453A1 (en) Speech recognition system
US20070203700A1 (en) Speech Recognition Apparatus And Speech Recognition Method
US20020091520A1 (en) Method and apparatus for text input utilizing speech recognition
US6301561B1 (en) Automatic speech recognition using multi-dimensional curve-linear representations
JP6031316B2 (en) Speech recognition apparatus, error correction model learning method, and program
Metze Articulatory features for conversational speech recognition
EP3309778A1 (en) Method for real-time keyword spotting for speech analytics
JPH09114482A (en) Speaker adaptation method for voice recognition
JP3633254B2 (en) Voice recognition system and recording medium recording the program
JP2003345388A (en) Method, device, and program for voice recognition
JP3277522B2 (en) Voice recognition method
JP3575904B2 (en) Continuous speech recognition method and standard pattern training method
Amdal Learning pronunciation variation: A data-driven approach to rule-based lecxicon adaptation for automatic speech recognition
JPH09160586A (en) Learning method for hidden markov model
JPH08211893A (en) Speech recognition device