JP2976795B2

JP2976795B2 - Speaker adaptation method

Info

Publication number: JP2976795B2
Application number: JP6020734A
Authority: JP
Inventors: 浩一篠田
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1994-02-18
Filing date: 1994-02-18
Publication date: 1999-11-10
Anticipated expiration: 2014-11-10
Also published as: JPH07230295A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は音声認識において認識装
置を使用者に速やかに適応させることを目的とした話者
適応化方式に関し、特に混合連続分布モデル型ＨＭＭを
用いた音声認識システムにおける教師なし話者適応化方
式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker adaptation system for quickly adapting a recognition device to a user in speech recognition, and more particularly to a teacher in a speech recognition system using a mixed continuous distribution model type HMM. It is related to a speaker adaptation method without.

【０００２】[0002]

【従来の技術】近年、音声パターンの機械による認識に
関する研究が行われ、数々の手法が提案されている。こ
の中で、よく使われる代表的な認識手法に、ＤＰマッチ
ング（ダイナミックプログラミングマッチング）と呼ば
れる方法、並びに、隠れマルコフモデル（ＨＭＭ）と呼
ばれる方法がある。このＤＰマッチングやＨＭＭなどの
音声認識手法を用いた音声認識システムとして、誰の声
でも認識できることを目的とした不特定話者の認識シス
テムが盛んに研究・開発されている。2. Description of the Related Art In recent years, research on recognition of voice patterns by machines has been conducted, and various techniques have been proposed. Among these, typical recognition methods often used include a method called DP matching (dynamic programming matching) and a method called a hidden Markov model (HMM). As a speech recognition system using a speech recognition technique such as DP matching or HMM, an unspecified speaker recognition system for the purpose of recognizing anyone's voice has been actively researched and developed.

【０００３】不特定話者システムは、使用者を特定した
特定話者システムと違い、事前に使用者が発声を登録す
る必要がないという利点がある。しかしながら、近年、
次のような問題点が指摘された。まず、ほとんどの話者
において認識性能が特定話者システムより劣り、さら
に、認識性能が大幅に悪い話者（特異話者）が存在する
という点である。このような問題点を解決するために、
従来、特定話者システムにおいて用いられてきた、話者
適応化の技術を不特定話者システムにも適用しようとい
う研究が最近始まっている。An unspecified speaker system has an advantage that, unlike a specific speaker system in which a user is specified, the user does not need to register an utterance in advance. However, in recent years,
The following problems were pointed out. First, the recognition performance of most speakers is inferior to that of the specific speaker system, and there are speakers (singular speakers) whose recognition performance is significantly poor. In order to solve such problems,
In recent years, studies have begun to apply speaker adaptation techniques, which have been used in specific speaker systems, to unspecified speaker systems.

【０００４】話者適応化とは、学習に用いるよりも少量
の適応化用データを用いて、認識システムを新しい使用
者（未知話者）に適応化させる方式を指す。話者適応化
方式の詳細については、「音声認識における話者適応化
技術」、古井貞煕著、テレビジョン学会誌、Vol.43、N
o. 9 、1989、pp.929-934に解説されている。[0004] Speaker adaptation refers to a method of adapting a recognition system to a new user (an unknown speaker) using a smaller amount of adaptation data than used for learning. For details of the speaker adaptation method, see "Speaker Adaptation Technology in Speech Recognition", Sadahiro Furui, Television Society of Japan, Vol.
o. 9, 1989, pp. 929-934.

【０００５】話者適応化は大きく分けて２つの手法に分
けられる。１つは教師あり話者適応化、もう１つは教師
なし話者適応化である。ここでの教師とは入力発声の発
声内容を表す音韻表記列を指す。教師あり適応化とは、
入力発声に対する音韻表記列が既知の場合の適応化手法
であり、適応化の際、未知話者に対し発声語彙を事前に
指示する必要がある。[0005] Speaker adaptation can be broadly divided into two approaches. One is supervised speaker adaptation and the other is unsupervised speaker adaptation. Here, the teacher refers to a phoneme notation sequence representing the utterance content of the input utterance. What is supervised adaptation?
This is an adaptation method in a case where a phoneme notation sequence for an input utterance is known. In the adaptation, it is necessary to instruct an unknown speaker in advance on a vocabulary vocabulary.

【０００６】一方、教師なし適応化とは、入力発声に対
する音韻表記列が未知の場合の適応化手法であり、未知
話者に対し入力発声の発声内容を限定しない、すなわ
ち、未知話者に対し発声内容を指示をする必要がなく、
実際に音声認識を使用中の入力音声を用いて、未知話者
に意識させずに適応化を行なえるため、使用者にとって
使いやすい方式である。On the other hand, unsupervised adaptation is an adaptation method in the case where the phoneme notation sequence for an input utterance is unknown, and does not limit the utterance content of the input utterance for an unknown speaker, that is, for an unknown speaker. There is no need to give instructions on what to say,
It is an easy-to-use method for the user because the adaptation can be performed by using the input speech that is actually using the speech recognition without being aware of the unknown speaker.

【０００７】一般に、教師なし適応化は教師あり適応化
に比べ、適応化後の認識性能が低いため、現在は教師あ
り適応化がよく使われている。In general, unsupervised adaptation has lower recognition performance after adaptation than supervised adaptation, and thus, supervised adaptation is often used at present.

【０００８】以下、従来の教師あり適応化技術を用いた
音声認識装置について図６を参照して説明する。A conventional speech recognition apparatus using the supervised adaptation technique will be described below with reference to FIG.

【０００９】音声認識装置６−１に入力された話者の発
声は、入力パターン作成部６−２に入力され、AD変換、
音声分析などの過程を経て、ある時間長をもつフレーム
と呼ばれる単位ごとの特徴ベクトルの時系列に変換され
る。この特徴ベクトルの時系列を、ここでは入力パター
ンと呼ぶ。フレームの長さは通常 10ms から100ms 程度
である。特徴ベクトルはその時刻における音声スペクト
ルの特徴量を抽出したもので、通常10次元から100 次元
である。The utterance of the speaker input to the speech recognition device 6-1 is input to an input pattern creating unit 6-2, where the utterance is subjected to AD conversion,
Through a process such as voice analysis, the data is converted into a time series of feature vectors for each unit called a frame having a certain time length. The time series of the feature vectors is referred to as an input pattern here. The frame length is usually about 10ms to 100ms. The feature vector is obtained by extracting the feature amount of the voice spectrum at that time, and is usually 10 to 100 dimensions.

【００１０】一方、標準パターン記憶部６−６にはＨＭ
Ｍ（Hidden Markov Model ）が記憶されている。ＨＭＭ
は音声の情報源のモデルの１つであり、話者の音声を用
いてそのパラメータを学習することができる。ＨＭＭに
ついては認識部６−３の項の説明で詳しく述べる。On the other hand, HM is stored in the standard pattern storage section 6-6.
M (Hidden Markov Model) is stored. HMM
Is a model of a speech information source, and its parameters can be learned using speech of a speaker. The HMM will be described in detail in the description of the recognition unit 6-3.

【００１１】ＨＭＭは通常各認識単位ごとに用意され
る。ここでは、認識単位として音素を例にとる。標準パ
ターン記憶部６−６のＨＭＭは、別の使用者の発声を用
いて学習した異話者ＨＭＭ、あるいは、予め多くの話者
の発声を用いて学習した不特定話者ＨＭＭなどが用いら
れる。An HMM is usually prepared for each recognition unit. Here, a phoneme is taken as an example of the recognition unit. As the HMM in the standard pattern storage unit 6-6, a different speaker HMM learned using utterances of different users, or an unspecified speaker HMM previously learned using utterances of many speakers is used. .

【００１２】今、1000単語を認識対象とする場合、即ち
1000単語の認識候補から１単語の正解を求める場合を想
定する。単語を認識する場合には、各音素のＨＭＭを連
結して、認識候補単語のＨＭＭを作成する。1000単語認
識の場合には1000単語分の単語ＨＭＭを作成する。本説
明では、1000単語を例としたが、特に1000単語である必
要はなく何単語でもよい。また、認識対象として、例え
ば連続音節をとってもよい。連続音節とは、日本語（あ
るいは他の外国語認識ではその言語）に現れるすべての
音節の連結をネットワークで表現したＨＭＭで認識を行
なうもので、事実上、日本語に現れるすべての発声を認
識することが可能である。これらの処理は語彙パターン
作成部６−５で行なわれる。Now, when 1000 words are to be recognized, that is,
It is assumed that a correct answer of one word is obtained from recognition candidates of 1000 words. When recognizing a word, the HMM of each phoneme is connected to create an HMM of a recognition candidate word. In the case of 1000 word recognition, a word HMM for 1000 words is created. In this description, 1000 words are taken as an example, but it is not particularly necessary to be 1000 words, and any number of words may be used. In addition, continuous syllables may be used as the recognition target, for example. A continuous syllable is an HMM that uses a network to represent the concatenation of all syllables that appear in Japanese (or that language in other foreign language recognition), and recognizes virtually all utterances that appear in Japanese. It is possible to These processes are performed by the vocabulary pattern creating unit 6-5.

【００１３】認識部６−３では、語彙パターン作成部６
−５で作成された単語ＨＭＭを用いて入力パターンの認
識を行なう。ＨＭＭは、音声の情報源のモデルであり、
音声パターンの様々な揺らぎに対処するため、標準パタ
ーンの記述に統計的な考えが導入されている。ＨＭＭの
詳細な説明は、「確率モデルによる音声認識」、中川聖
一、電子情報通信学会編（昭63）（以下文献１）の40〜
46頁、55〜60頁、69〜74頁に記述されている。各音素の
ＨＭＭは、それぞれ、通常１から１０個の状態とその間
の状態遷移から構成される。通常は始状態と終状態が定
義されており、単位時間ごとに、各状態からシンボルが
出力され、状態遷移が行なわれる。各音素の音声は、始
状態から終状態までの状態遷移の間にＨＭＭから出力さ
れるシンボルの時系列として表される。各状態にはシン
ボルの出現確率が、状態間の各遷移には遷移確率が定義
されている。遷移確率パラメータは音声パタンの時間的
な揺らぎを表現するためのパラメータである。出現確率
パラメータは、音声パターンの声色の揺らぎを表現する
ものである。始状態の確率をある値に定め、状態遷移ご
とに出現確率、遷移確率を掛けていくことにより、発声
がそのモデルから発生する確率を求めることができる。
逆に、発声を観測した場合、それが、あるＨＭＭから発
生したと仮定するとその発生確率が計算できることにな
る。The recognition unit 6-3 includes a vocabulary pattern creation unit 6
The input pattern is recognized using the word HMM created in -5. HMM is a model of speech information source,
In order to cope with various fluctuations of the voice pattern, a statistical idea is introduced in the description of the standard pattern. For a detailed description of HMM, see "Speech Recognition by Stochastic Model", Seichi Nakagawa, edited by IEICE (Showa 63)
Pages 46, 55-60, 69-74. The HMM of each phoneme is usually composed of 1 to 10 states and state transitions between them. Usually, a start state and an end state are defined, and a symbol is output from each state for each unit time, and a state transition is performed. The speech of each phoneme is represented as a time series of symbols output from the HMM during a state transition from a start state to an end state. A symbol appearance probability is defined for each state, and a transition probability is defined for each transition between the states. The transition probability parameter is a parameter for expressing the temporal fluctuation of the voice pattern. The appearance probability parameter expresses the fluctuation of the timbre of the voice pattern. By setting the probability of the start state to a certain value and multiplying the appearance probability and the transition probability for each state transition, the probability that the utterance occurs from the model can be obtained.
Conversely, when an utterance is observed, assuming that it is generated from a certain HMM, its occurrence probability can be calculated.

【００１４】ＨＭＭによる音声認識では、各認識候補に
対してＨＭＭを用意し、発声が入力されると、各々のＨ
ＭＭにおいて、発生確率を求め、最大となるＨＭＭを発
生源と決定し、そのＨＭＭに対応する認識候補をもって
認識結果とする。In speech recognition by HMM, an HMM is prepared for each recognition candidate, and when an utterance is input, each HMM is input.
In the MM, the occurrence probability is obtained, the HMM having the maximum value is determined as the generation source, and a recognition candidate corresponding to the HMM is used as a recognition result.

【００１５】出力確率パラメータには、離散確率分布表
現と連続確率分布表現があるが、ここでは連続確率表現
を例にとる。連続確率分布表現では、混合連続分布、す
なわち、複数のガウス分布を重みつきで加算した分布が
使われる。出力確率パラメータ、遷移確率パラメータ、
複数のガウス分布の重みなどのパラメータは、モデルに
対応する学習音声を与えてバウムーウェルチアルゴリズ
ムと呼ばれるアルゴリズムにより予め学習されている。
バウムーウェルチアルゴリズムについては文献１に詳述
されている。以下の例では出力確率は混合連続確率分布
とする。The output probability parameters include a discrete probability distribution expression and a continuous probability distribution expression. Here, a continuous probability expression will be used as an example. In the continuous probability distribution expression, a mixed continuous distribution, that is, a distribution obtained by adding a plurality of Gaussian distributions with weight is used. Output probability parameter, transition probability parameter,
Parameters such as weights of a plurality of Gaussian distributions are learned in advance by applying a learning voice corresponding to the model and using an algorithm called a Baum-Welch algorithm.
The Baum-Welch algorithm is described in detail in Reference 1. In the following example, the output probability is a mixed continuous probability distribution.

【００１６】以下に単語認識時の処理を数式で説明す
る。特徴ベクトルの時系列として表現された入力パター
ンＸをＸ＝ｘ₁，ｘ₂，ｘ₃，……ｘ_t，……ｘ_T （１）とする。ここでＴは入力パターンの総フレーム数であ
る。今、認識候補単語をＷ₁，Ｗ₂，前記Ｗ_Nとする。
N は認識候補単語数である。各々の単語Ｗ_Nの単語ＨＭ
Ｍと入力パターンＸとの間のマッチングは以下のように
行なわれる。以下必要のない限り添字n を省略する。The processing at the time of word recognition will be described below using mathematical expressions. Wherein when X = x ₁ input pattern X expressed as a sequence of _{_{vectors, x 2, x 3, ......}} x t, and ...... x _T (1). Here, T is the total number of frames of the input pattern. Now, a recognition candidate word W _1, W _2, and the W _N.
N is the number of recognition candidate words. Word HM for each word W _N
Matching between M and input pattern X is performed as follows. In the following, the suffix n is omitted unless necessary.

【００１７】まず、単語ＨＭＭにおいて、状態j から状
態i への遷移確率をａ_ji、出力確率分布の混合重みをλ
_im、各要素ガウス分布（フレーム分布とよぶ）の平均ベ
クトルをμ_im、共分散行列をΣ_imとする。ここで、t は
入力時刻、i,j はＨＭＭの状態、m は混合要素番号を表
す。前向き確率α(i,t) に関する次の漸化式計算を行
う。First, in the word HMM, the transition probability from the state j to the state i is a _ji , and the mixing weight of the output probability distribution is λ
_im , an average vector of each element Gaussian distribution (called a frame distribution) is μ _im , and a covariance matrix is Σ _im . Here, t represents the input time, i, j represents the state of the HMM, and m represents the mixed element number. The following recurrence formula calculation for the forward probability α (i, t) is performed.

【００１８】 [0018]

【００１９】ここでWhere

【００２０】 [0020]

【００２１】 [0021]

【００２２】単語Ｗ_nに対する入力パタンに対する尤度
はThe likelihood of an input pattern for a word W _n is

【００２３】 [0023]

【００２４】により求められる。ここで、Ｉは最終状態
である。この処理を各単語モデルについて行ない、入力
パターンＸに対する認識結果単語[0024] Here, I is the final state. This processing is performed for each word model, and a recognition result word for the input pattern X is obtained.

【００２５】 [0025]

【００２６】は、Is

【００２７】 [0027]

【００２８】で与えられる。認識結果単語は、認識結果
出力部に送られる。Given by The recognition result word is sent to the recognition result output unit.

【００２９】認識結果出力部６−４は、認識結果を画面
上に出力する、あるいは、認識結果に対応した制御命令
を別の装置に送出するなどの処理を行なう。The recognition result output unit 6-4 performs processing such as outputting the recognition result on a screen or transmitting a control command corresponding to the recognition result to another device.

【００３０】以上、ＨＭＭを例にとり音声認識装置につ
いて説明した。The speech recognition apparatus has been described using the HMM as an example.

【００３１】次に、この音声認識装置に対する教師あり
話者適応化技術について説明する。教師あり話者適応化
では、発声する単語あるいは文を予め使用者に指示し
て、単語表記と入力音声を用いてＨＭＭのパラメータの
更新を行なう。このように予め発声に対する正解単語が
わかっているという意味で教師あり適応化と呼ばれる。
教師あり話者適応化方式としては、特願平2-203437「標
準パターン適応化方式」、あるいは、特願平4-203669
「音声認識装置」に記載されている手法があるが、ここ
では、特願平2-203437に基づく方式について簡単に述べ
る。Next, a supervised speaker adaptation technique for the speech recognition apparatus will be described. In the supervised speaker adaptation, a word or sentence to be uttered is instructed to a user in advance, and the parameters of the HMM are updated using the word notation and the input speech. This is called supervised adaptation in the sense that the correct word for the utterance is known in advance.
As a supervised speaker adaptation method, Japanese Patent Application No. 2-203437 "Standard pattern adaptation method" or Japanese Patent Application No. 4-203669
Although there is a method described in "Speech Recognition Apparatus", here, a method based on Japanese Patent Application No. 2-203437 will be briefly described.

【００３２】尚、教師あり話者適応化においては、話者
の負担を軽減するため、適応化に必要な入力発声の量を
なるべく少なくする必要がある。しかし、ＨＭＭは一般
にパラメータ数が多く、少量の適応化用発声で全パラメ
ータを適応化しようとすると、データの不足のためパラ
メータの推定精度が悪く、認識性能が向上しない可能性
がある。そこで、ここでの教師あり話者適応化では、Ｈ
ＭＭのパラメータのうち、出力確率分布の平均ベクトル
のみを適応化する。平均ベクトルを選んだのは、これ
が、ＨＭＭのパラメータの中でもっとも認識性能に影響
があると考えられるからである。In supervised speaker adaptation, it is necessary to minimize the amount of input utterance necessary for adaptation in order to reduce the burden on the speaker. However, the HMM generally has a large number of parameters, and when trying to adapt all parameters with a small amount of adaptation utterance, the accuracy of parameter estimation is poor due to lack of data, and recognition performance may not be improved. Therefore, in the supervised speaker adaptation here, H
Only the average vector of the output probability distribution among the parameters of the MM is adapted. The average vector was selected because it is considered that this has the greatest effect on the recognition performance among the parameters of the HMM.

【００３３】簡単のため出力確率分布が単一ガウス分布
の場合について述べ、後に混合ガウス分布の場合につい
て述べる。For the sake of simplicity, a case where the output probability distribution is a single Gaussian distribution will be described, and a case where the output probability distribution is a Gaussian mixture distribution will be described later.

【００３４】教師あり適応化は以下の２段階に分けら
れ、まず第１の段階について説明する。The supervised adaptation is divided into the following two stages. First, the first stage will be described.

【００３５】まず、予め適応化初期標準パターン記憶部
６−１１に適応化初期ＨＭＭを用意する。適応化初期Ｈ
ＭＭとしては、例えば、多くの話者の発声で予め作成さ
れた不特定話者の音素ＨＭＭを用い、標準パターン記憶
部６−６に記憶されている音素ＨＭＭと同じでも良い
し、違ってもよい。また、適応化部６−９に各音素ＨＭ
Ｍの各状態ごとに特徴ベクトルの次元を持ったバッファ
Ｂ１（ｉ）と、特徴ベクトルの個数を加算するための１
次元のバッファＢ２（ｉ）を用意する。そして、各入力
発声ごとに以下の処理を行なう。First, an adaptation initial HMM is prepared in the adaptation initial standard pattern storage unit 6-11 in advance. Initialization H
As the MM, for example, a phoneme HMM of an unspecified speaker prepared in advance by the utterance of many speakers is used, and may be the same as the phoneme HMM stored in the standard pattern storage unit 6-6, or may be different. Good. Also, each phoneme HM is added to the adaptation unit 6-9.
A buffer B1 (i) having a dimension of a feature vector for each state of M, and 1 for adding the number of feature vectors
A dimensional buffer B2 (i) is prepared. Then, the following processing is performed for each input utterance.

【００３６】最初に認識時と同様に入力パターン作成部
６−８にて入力音声から入力パターンを作成する。上で
述べたように、教師あり適応化の場合、正解単語は予め
わかっているため、適応化用辞書作成部６−７は入力さ
れた正解単語表記と作成された入力パターンから適応化
用辞書を作成する。次に語いパターン作成部６−１０
は、適応化辞書の音素系列と音素毎の適応化初期ＨＭＭ
を用いて入力パターンに対応する単語ＨＭＭを作成す
る。そして、適応化部６−９は、入力パターンと適応化
用単語ＨＭＭの間で尤度計算を行なう。ここでは、式
(2) 、(5) の代わりに、First, similarly to the recognition, an input pattern is created from an input voice in an input pattern creating section 6-8. As described above, in the case of supervised adaptation, since the correct word is known in advance, the adaptation dictionary creating unit 6-7 determines the adaptation dictionary from the input correct word notation and the created input pattern. Create Next, vocabulary pattern creating section 6-10
Is the phoneme sequence of the adaptation dictionary and the initialized HMM for each phoneme.
To create a word HMM corresponding to the input pattern. Then, the adaptation unit 6-9 performs likelihood calculation between the input pattern and the adaptation word HMM. Here, the expression
Instead of (2) and (5),

【００３７】 [0037]

【００３８】 [0038]

【００３９】の漸化式を用いる。これは、ビタービアル
ゴリズムと呼ばれる。式(7) と並行して、The recurrence formula is used. This is called the Viterbi algorithm. In parallel with equation (7),

【００４０】 [0040]

【００４１】の計算を行ない、各時刻の各状態におけ
る、前時刻の状態を配列Ψに記憶しておく。式(9) での
最終フレームＴの計算が終ったのち、Ψを用いて、最終
フレームから最初のフレームまで順番に、各フレームに
対応する状態が求まる。すなわち、フレームｔに対応す
る状態をＳ（ｔ）と表すと、Is calculated, and the state at the previous time in each state at each time is stored in the array Ψ. After the calculation of the last frame T in Expression (9) is completed, the state corresponding to each frame is obtained in order from the last frame to the first frame using Ψ. That is, when the state corresponding to the frame t is represented as S (t),

【００４２】 [0042]

【００４３】 [0043]

【００４４】である。この処理をバックトラックと呼
ぶ。この処理により、各時刻の特徴ベクトルに対応する
状態が求められる。つぎに各時刻の特徴ベクトルｘ_tご
とに、Is as follows. This process is called backtracking. By this processing, a state corresponding to the feature vector at each time is obtained. Then for each feature vector x _t at each time,

【００４５】 [0045]

【００４６】 [0046]

【００４７】の処理を行ない、Ｂ１，Ｂ２バッファの加
算を行なう。この処理を適応化用の発声単語数分だけ繰
り返す。Is performed, and the B1 and B2 buffers are added. This process is repeated for the number of utterance words for adaptation.

【００４８】すべての入力発声について上の対応づけの
処理が終了した後、各音素ＨＭＭの各状態ｉに対応づけ
られたフレームの特徴ベクトルを、全入力パターンにわ
たって平均して、その状態の適応化後の平均ベクトルをAfter the above association processing is completed for all input utterances, the feature vectors of the frames associated with each state i of each phoneme HMM are averaged over all input patterns, and the state adaptation is performed. The mean vector after

【００４９】 [0049]

【００５０】とすると、Then,

【００５１】 [0051]

【００５２】と計算される。Is calculated.

【００５３】第２の段階では、適応化用発声中に含まれ
ない音素に対応するＨＭＭをスペクトル内挿と呼ぶ手法
を用いて適応化する。スペクトル内挿では、適応化用発
声中に出現しなかった音素に対応する平均ベクトルを、
出現した音素の平均ベクトルの適応化前後の差分を用い
て推定する。In the second stage, HMMs corresponding to phonemes not included in the adaptation utterance are adapted using a technique called spectral interpolation. In the spectral interpolation, the average vector corresponding to the phoneme that did not appear during the adaptation utterance was calculated as
The estimation is performed using the difference between the average vector of the appearing phonemes before and after the adaptation.

【００５４】今、適応化用発声に含まれるＨＭＭの各状
態の平均ベクトルの集合を集合Ａ、含まれないＨＭＭの
各状態の平均ベクトルの集合を集合Ｂとする。まず、集
合Ａのすべての状態について適応化ベクトルΔ^Aが計算
される。適応化ベクトルは、適応化後の平均ベクトルτ
^Aと、適応化前の平均ベクトルμ^Aの差として定義され
る。次に、集合Ｂの状態の適応化ベクトルを適応化する
ために、集合Ａの状態の適応化ベクトルを内挿すること
により求める。このアルゴリズムは以下の通りである。 1. 集合Ａの状態j においては、適応化後の平均ベクト
ルτ_j ^Aはすでに求められている. 適応化ベクトルΔ_j ^Aは
以下の式で与えられる。Assume that a set of average vectors of each state of the HMM included in the adaptation utterance is set A, and a set of average vectors of each state of the HMM not included is set B. First, the adaptation vector delta ^A is calculated for all the states of the set A. The adaptation vector is the average vector τ after the adaptation.
And ^A, is defined as the difference between the average vector mu ^A prior adaptation. Next, in order to adapt the adaptation vector of the state of the set B, it is obtained by interpolating the adaptation vector of the state of the set A. The algorithm is as follows. 1. In the state j of the set A, the average vector τ _j ^A after the adaptation has already been obtained. The adaptation vector Δ _j ^A is given by the following equation.

【００５５】 [0055]

【００５６】ここで、A は状態 jが集合Ａに属すること
を示す添字である。適応化ベクトルΔ_j ^Aは集合Ａにおけ
るすべての状態について計算される。 2. 集合Ｂの状態 iに対して, 適応化ベクトルΔ_i ^Bは、
集合Ａの状態j の適応化ベクトルを内挿することにより
求める。Here, A is a subscript indicating that the state j belongs to the set A. The adaptation vector Δ _j ^A is calculated for all states in set A. 2. For state i of set B, the adaptation vector Δ _i ^B is
It is obtained by interpolating the adaptation vector of the state j of the set A.

【００５７】 [0057]

【００５８】適応化ベクトルΔ_j ^Aへの重みｗ_i ^jはμ_i ^Bと
μ_j ^Aとの距離 d_i ^jの関数として定義される。例えば、ｗ
_i ^jは以下のように定義される。[0058] weights w _i ^j to adaptation vector delta _j ^A is defined as a function of the distance d _i ^j with mu _i ^B and mu _j ^A. For example, w
_i ^j is defined as follows.

【００５９】 [0059]

【００６０】ここで mは重みｗ_i ^jの距離 d_i ^jへの依存度
を表す定数である。適応化ベクトルΔ_i ^Bは集合Ｂに属す
るすべての状態について計算される。 3. 新しい話者の状態i の平均ベクトルτ_i ^Bは、次式で
与えられる。[0060] where m is a constant representing the dependence on the distance d _i ^j weights w _i ^j. The adaptation vector Δ _i ^B is calculated for all the states belonging to the set B. 3. The average vector τ _i ^B of the new speaker state i is given by the following equation.

【００６１】 [0061]

【００６２】ここで、μ_I ^Bは適応化初期ＨＭＭの平均ベ
クトルである。 4. 2-3 の過程を集合Ｂのすべての状態について繰り返
す。上の手続きは、出力確率分布が混合ガウス分布であ
るＨＭＭにも、状態内の複数の成分分布を別々に扱うこ
とにより、適用することができる。第１段階のバックト
ラックにおいては、状態内の成分分布のうち、対応する
特徴ベクトルの出現確率に重み係数を乗じた値が最大に
なるものを選び、集合Ａに分類する。対応する適応化用
データのない成分分布は集合Ｂに分類される。第２段階
のスペクトル内挿は、集合Ｂの成分分布に対して行なわ
れる。すなわち、集合Ｂの成分分布の適応化ベクトル
は、すべての状態にわたる集合Ａの成分分布の適応化ベ
クトルを用いたスペクトル内挿で求められる。[0062] Here, mu _I ^B is the mean vector adaptation initial HMM. 4. Repeat steps 2-3 for all states in set B. The above procedure can be applied to an HMM in which the output probability distribution is a Gaussian mixture distribution by separately treating a plurality of component distributions in the state. In the first stage of backtracking, among the component distributions in the state, those having the largest value obtained by multiplying the appearance probability of the corresponding feature vector by the weighting coefficient are selected and classified into the set A. Component distributions without corresponding adaptation data are classified into set B. The second-stage spectral interpolation is performed on the component distribution of the set B. That is, the adaptation vector of the component distribution of the set B is obtained by spectrum interpolation using the adaptation vector of the component distribution of the set A over all states.

【００６３】ここでは、平均ベクトルのみを適応化する
例を示したが、その他の分散、重み、遷移確率なども同
様の方式で適応化することが容易に可能である。また、
それらパラメータのうち、同時に複数のものを適応化す
ることが可能である。Here, an example in which only the average vector is adapted has been described, but other variances, weights, transition probabilities, and the like can be easily adapted in the same manner. Also,
It is possible to adapt a plurality of these parameters at the same time.

【００６４】適応化後のＨＭＭは、標準パターン記憶部
６−６のそれまでのＨＭＭに代わって記憶される。この
場合、話者別にＨＭＭを記憶しておいても良いが、その
場合は認識の前処理として、使用者がＨＭＭを選択する
か、あるいは、使用者の発声を用いて自動的にＨＭＭを
選択する手段が必要になる。The HMM after the adaptation is stored in place of the previous HMM in the standard pattern storage unit 6-6. In this case, the HMM may be stored for each speaker, but in that case, as a pre-processing of the recognition, the user selects the HMM or automatically selects the HMM using the utterance of the user. You need a way to do that.

【００６５】ここまで、音素を認識単位としたＨＭＭ
に対する教師あり適応化を説明したが、単語あるいは文
を認識単位とした場合にも、それぞれの単語、文に対応
したＨＭＭを作成することにより、容易に適応化でき
る。認識単位と入力発声の単位が同じ場合には( 例えば
単語単位のＨＭＭと単語発声) 、適応化初期ＨＭＭを連
結する必要はなく、そのまま尤度計算を行ない適応化す
ることができる。Up to this point, the HMM using phonemes as recognition units
Has been described, but when words or sentences are used as a recognition unit, the adaptation can be easily performed by creating an HMM corresponding to each word or sentence. When the recognition unit and the input utterance unit are the same (for example, a word-based HMM and a word utterance), it is not necessary to connect the adaptation initial HMM, and the adaptation can be performed by performing likelihood calculation as it is.

【００６６】以上、従来の教師あり適応化について簡単
に説明した。The conventional supervised adaptation has been briefly described above.

【００６７】[0067]

【発明が解決しようとする課題】上述した従来の教師あ
り適応化方式は、教師なし適応化時に比べ性能は高い。
しかし、使用者は、使用時の発声とは別に、トレーニン
グとして装置に指示された単語を発声しなければなら
ず、負担が大きいという欠点がある。The above-described conventional supervised adaptation method has higher performance than the unsupervised adaptation method.
However, the user has to utter the words instructed to the device as training separately from the utterance at the time of use, and there is a disadvantage that the burden is large.

【００６８】本発明の目的は、音声認識システムの話者
適応化において、使用者に意識させずに教師あり適応化
に匹敵するような性能をもつ教師なし適応化の手法を提
供することである。It is an object of the present invention to provide a method of unsupervised adaptation in a speaker adaptation of a speech recognition system that has a performance comparable to supervised adaptation without the user being aware of it. .

【００６９】[0069]

【課題を解決するための手段】第１の発明は、音声認識
に用いる標準パターンを入力音声パターンを用いて適応
化する話者適応化方式において、各認識候補単語を単位
とする予め定められた基準により作成された単語標準パ
ターンを保持する標準パターン記憶部と、入力音声に対
し音声分析を行ない入力パターンを作成する入力パター
ン作成部と、作成された前記入力パターンを前記標準パ
ターン記憶部における単語標準パターンを用いて単語を
認識をする認識部と、前記認識部における認識結果を出
力する認識結果出力部と、前記認識単語の表記を参照し
て予め決めれらた方法により適応化初期単語標準パター
ンを用意する適応化用標準パターン選択部と、前記適応
化初期単語標準パターンに基く前記入力パターンの尤度
計算を行ない前記尤度計算結果により求められた適応化
後単語標準パターンにより前記標準パターン記憶部にお
ける単語標準パターンを更新する教師あり適応化部と、
前記入力パターンを記憶する入力パターン記憶部と、前
記入力パターン記憶部における前記入力パターンを入力
として前記認識部と前記認識結果出力部と前記適応化用
標準パターン選択部および前記教師あり適応化部の動作
を予め決めれらた変数が予め定められた基準値に達する
まで繰り返させる繰り返し制御部を備えたことを特徴と
する。 According to a first aspect of the present invention, there is provided a speaker adaptation system for adapting a standard pattern used for speech recognition by using an input speech pattern. A standard pattern storage unit that holds a word standard pattern created based on a reference; an input pattern creation unit that creates an input pattern by performing voice analysis on an input voice; and stores the created input pattern in a word in the standard pattern storage unit. A recognition unit for recognizing a word by using a standard pattern, a recognition result output unit for outputting a recognition result in the recognition unit, and an initial word standard pattern adapted by a predetermined method with reference to the notation of the recognition word And an adaptation standard pattern selecting unit that prepares the likelihood calculation of the input pattern based on the adaptation initial word standard pattern. And supervised adaptation unit by adaptation after word reference pattern obtained by the time the calculation result to update the word reference pattern in the reference pattern memory,
An input pattern storage unit for storing the input pattern;
Input the input pattern in the input pattern storage unit
The recognition unit, the recognition result output unit, and the adaptation
Operation of the standard pattern selection unit and the supervised adaptation unit
A predetermined variable reaches a predetermined reference value
It has a repetition control unit to repeat until
I do.

【００７０】第２の発明は、音声認識に用いる標準パタ
ーンを入力音声パターンを用いて適応化する話者適応化
方式において、音節，音素，などのサブワードを単位と
する予め定められた基準により作成されたサブワード標
準パターンを保持する標準パターン記憶部と、前記サブ
ワード標準パターンを用いて認識候補単語に対応する単
語標準パターンを作成する語彙パターン作成部と、入力
音声に対し音声分析を行ない入力パターンを作成する入
力パターン作成部と、作成された前記入力パターンを前
記語彙パターン作成部により作成された前記単語標準パ
ターンを用いて単語の認識をする認識部と、前記認識部
における認識結果を出力する認識結果出力部と、前記認
識単語の表記を参照して予め定めれらた方法により適応
化初期単語標準パターンを用意する適応化用標準パター
ン選択部と、前記適応化初期単語標準パターンに基く前
記入力パターンの尤度計算を行ない前記尤度計算結果に
より求められた適応化後サブワード標準パターンにより
前記標準パターン記憶部における前記サブワード標準パ
ターンを更新する教師あり適応化部と、前記入力パター
ンを記憶する入力パターン記憶部と、前記入力パターン
記憶部における前記入力パターンを入力として前記認識
部と前記認識結果出力部と前記適応化用標準パターン選
択部および前記教師あり適応化部の動作を予め決めれら
た変数が予め定められた基準値に達するまで繰り返させ
る繰り返し制御部を備えたことを特徴とする。 According to a second aspect of the present invention, in a speaker adaptation method for adapting a standard pattern used for speech recognition using an input speech pattern, the speaker adaptation method is prepared based on a predetermined standard in units of sub-words such as syllables and phonemes. A standard pattern storage unit that holds the obtained subword standard pattern, a vocabulary pattern creating unit that creates a word standard pattern corresponding to the recognition candidate word using the subword standard pattern, An input pattern creation unit to be created, a recognition unit that recognizes the created input pattern using the word standard pattern created by the vocabulary pattern creation unit, and a recognition that outputs a recognition result in the recognition unit A result output unit and an adaptive initial word standard parameter by a method predetermined with reference to the notation of the recognized word. An adaptation standard pattern selecting unit for preparing a standard pattern, and a likelihood calculation of the input pattern based on the adaptation initial word standard pattern, and an adaptation subword standard pattern obtained from the likelihood calculation result. A supervised adaptation unit for updating the subword standard pattern in a pattern storage unit, and the input pattern
An input pattern storage unit for storing the input pattern;
The recognition is performed using the input pattern in the storage unit as an input.
Unit, the recognition result output unit, and the standard pattern selection for adaptation.
The operation of the selection unit and the supervised adaptation unit is determined in advance.
Until the variable reaches a predetermined reference value.
And a repetition control unit.

【００７１】第３の発明は、音声認識に用いる標準パタ
ーンを入力音声パターンを用いて適応化する話者適応化
方式において、音節，音素，などのサブワードを単位と
する予め定められた基準により作成されたサブワード標
準パターンを保持する標準パターン記憶部と、前記サブ
ワード標準パターンを用いて認識候補単語に対応する単
語標準パターンを作成する語彙パターン作成部と、入力
音声に対し音声分析を行ない入力パターンを作成する入
力パターン作成部と、作成された前記入力パターンを記
憶する入力パターン記憶部と、前記入力パターン記憶部
における前記入力パターンと前記語彙パターン作成部に
より作成された前記単語標準パターンを用いて単語の認
識をする認識部と、前記認識部における認識結果を出力
する認識結果出力部と、前記認識単語に相当する前記語
彙パターン作成部における前記単語標準パターンに基く
前記入力パターンの尤度計算を行ない前記尤度計算結果
により求められた適応化後サブワード標準パターンによ
り前記標準パターン記憶部における前記サブワード標準
パターンを更新する教師あり適応化部と前記認識部と前
記認識結果出力部および前記教師あり適応化部の動作を
予め決めれらた変数が予め定められた基準値に達するま
で繰り返させる繰り返し制御部を備えたことを特徴とす
る。 A third aspect of the present invention relates to a standard pattern used for speech recognition.
Speaker adaptation using the input speech pattern
Sub-words such as syllables, phonemes, etc.
Subword mark created according to predetermined criteria
A standard pattern storage unit for storing a reference pattern;
Words corresponding to recognition candidate words using word standard patterns
Vocabulary pattern creation unit for creating word standard patterns, and input
Performs voice analysis on voice and creates input patterns.
A force pattern creation unit and the created input pattern.
An input pattern storage unit for remembering the input pattern storage unit
In the input pattern and the vocabulary pattern creation unit in
Word recognition using the word standard pattern created above
And a recognition unit that outputs the recognition result.
A recognition result output unit that performs
Based on the word standard pattern in the vocabulary pattern creation unit
Performing the likelihood calculation of the input pattern and the likelihood calculation result
Of the subword standard pattern after adaptation
The sub-word standard in the standard pattern storage unit.
A supervised adaptation unit for updating the pattern and the recognition unit
The operation of the recognition result output unit and the supervised adaptation unit
Until a predetermined variable reaches a predetermined reference value.
Characterized by having a repetition control unit for repeating
You.

【００７２】第４の発明は、音声認識に用いる標準パタ
ーンを入力音声パターンを用いて適応化する話者適応化
方式において、音節，音素，などのサブワードを単位と
する予め定められた基準により作成されたサブワード標
準パターンを保持する標準パターン記憶部と、前記サブ
ワード標準パターンを用いて認識候補単語に対応する単
語標準パターンを作成する語彙パターン作成部と、予め
定められた方法により作成された前記認識候補単語に対
応する基本標準パターンを記憶する基本標準パターン記
憶部と、入力音声に対し音声分析を行ない入力パターン
を作成する入力パターン作成部と、作成された前記入力
パターンを記憶する入力パターン記憶部と、前記入力パ
ターン記憶部における前記入力パターンと前記語彙パタ
ーン作成部により作成された前記単語標準パターンを用
いて単語の認識をする認識部と、前記認識部における認
識結果を出力する認識結果出力部と、前記認識単語に相
当する前記基本標準パターン記憶部における前記基本標
準パターンに基く前記入力パターンの尤度計算を行ない
前記尤度計算結果により求められた適応化後サブワード
標準パターンにより前記標準パターン記憶部における前
記サブワード標準パターンを更新する教師あり適応化部
と前記認識部と前記認識結果出力部および前記教師あり
適応化部の動作を予め決めれらた変数が予め定められた
基準値に達するまで繰り返させる繰り返し制御部を備え
たことを特徴とする。 A fourth aspect of the present invention relates to a standard pattern used for speech recognition.
Speaker adaptation using the input speech pattern
Sub-words such as syllables, phonemes, etc.
Subword mark created according to predetermined criteria
A standard pattern storage unit for storing a reference pattern;
Words corresponding to recognition candidate words using word standard patterns
A vocabulary pattern creation unit that creates a word standard pattern,
The recognition candidate words created by the specified method
Basic standard pattern notation that stores the corresponding basic standard pattern
Memory unit and input pattern by performing voice analysis on input voice
An input pattern creation unit for creating the input, and the created input
An input pattern storage unit for storing a pattern;
The input pattern and the vocabulary pattern in the turn storage unit.
The word standard pattern created by the
And a recognition unit for recognizing words.
A recognition result output unit for outputting a recognition result;
The basic mark in the corresponding basic standard pattern storage unit;
Perform likelihood calculation of the input pattern based on the quasi-pattern
Adapted subword obtained from the likelihood calculation result
By the standard pattern in the standard pattern storage unit
Supervised adaptation unit that updates the subword standard pattern
With the recognition unit, the recognition result output unit, and the teacher
Variables for which the operation of the adaptation unit is predetermined are predetermined.
Equipped with a repetition control unit that repeats until the reference value is reached
It is characterized by having.

【００７３】[0073]

【００７４】[0074]

【実施例】次に、本発明について図面を参照して説明す
る。Next, the present invention will be described with reference to the drawings.

【００７５】説明の前提として、後述する図１〜図５に
示す標準パターン記憶部１０１〜５０１，入力パターン
作成部１０２〜５０２，認識部１０３〜５０３，認識結
果出力部１０４〜５０４，語彙パターン作成部２０７，
３０９，４０９，５１０は、それぞれ従来技術の説明の
項で説明した標準パターン記憶部６−６，入力パターン
作成部６−２，認識部６−３，認識結果出力部６−４，
語彙パターン作成部６−５と同様であるため、本実施例
では、簡単な説明に留める。 The description is based on the assumption that standard pattern storage units 101 to 501, input pattern creation units 102 to 502, recognition units 103 to 503, recognition result output units 104 to 504, and vocabulary pattern creation shown in FIGS. Part 207,
Reference numerals 309, 409, and 510 denote a standard pattern storage unit 6-6, an input pattern creation unit 6-2, a recognition unit 6-3, and a recognition result output unit 6-4 described in the description of the related art, respectively.
Since this is the same as the vocabulary pattern creating unit 6-5, in the present embodiment, only a brief description will be given .

【００７６】また、ここでの教師あり話者適応化では、
従来技術で説明した場合と同様、ＨＭＭのパラメータの
中でもっとも認識性能に影響がある、出力確率分布の平
均ベクトルのみを適応化するものとする。Also, in the supervised speaker adaptation here,
As in the case described in the related art, it is assumed that only the average vector of the output probability distribution, which has the greatest effect on the recognition performance among the parameters of the HMM, is adapted.

【００７７】図１は、第１の話者学習方式の一実施例の
ブロック図である。入力発声、および、ＨＭＭの認識単
位は、単語とする。標準パターン記憶部１０１は各認識
候補単語の単語ＨＭＭを保持する。単語ＨＭＭは不特定
話者のＨＭＭ、あるいは、他の話者のＨＭＭである。入
力パターン作成部１０２は入力音声に対し、音声分析を
行ない入力パターンＸを作成する。作成された入力パタ
ーンＸは認識部１０３において標準パターン記憶部１０
１における単語ＨＭＭを用いて認識をされ、認識結果出
力部１０４から認識結果が出力される。FIG . 1 is a block diagram of an embodiment of the first speaker learning method . The input utterance and the recognition unit of the HMM are words. The standard pattern storage unit 101 holds a word HMM of each recognition candidate word. The word HMM is an HMM of an unspecified speaker or an HMM of another speaker. The input pattern creation unit 102 performs speech analysis on the input speech to create an input pattern X. The created input pattern X is stored in the recognition unit 103 in the standard pattern storage unit 10.
1 is recognized using the word HMM, and the recognition result output unit 104 outputs the recognition result.

【００７８】適応化用標準パターン選択部１０５は、認
識結果単語の表記を参照して適応化初期単語ＨＭＭを用
意する。適応化初期単語ＨＭＭは多くの話者の発声で予
め学習された不特定話者の単語ＨＭＭ、あるいは、他の
話者の発声で学習された異話者の単語ＨＭＭであり、標
準パターン記憶部１０１の単語ＨＭＭでも良いし、それ
とは別のものでもよい。The adaptation standard pattern selection unit 105 prepares an adaptation initial word HMM by referring to the notation of the recognition result word. The adapted initial word HMM is a word HMM of an unspecified speaker learned in advance by the utterances of many speakers or a word HMM of a different speaker learned by utterances of other speakers. The word HMM of 101 may be used, or another word HMM may be used.

【００７９】教師あり適応化部１０６では、入力パター
ンＸおよび適応化初期単語ＨＭＭを用いた尤度計算を、
１つまたは複数の入力パターンについて行なったのち、
適応化後の平均ベクトルを計算し適応化後ＨＭＭを求め
る。教師あり適応化部の詳しい動作については従来の技
術の説明における適応化部６−９を参照されたい。教師
あり適応化部１０６より出力された適応化後ＨＭＭは、
標準パターン記憶部１０１に出力され、今までの認識Ｈ
ＭＭのかわりに記憶される。The supervised adaptation unit 106 calculates likelihood using the input pattern X and the adapted initial word HMM,
After performing one or more input patterns,
The average vector after the adaptation is calculated, and the HMM after the adaptation is obtained. For the detailed operation of the supervised adaptation unit, refer to the adaptation unit 6-9 in the description of the related art. The post-adaptation HMM output from the supervised adaptation unit 106 is
It is output to the standard pattern storage unit 101 and the recognition H
It is stored instead of MM.

【００８０】図２は、第２の話者学習方式の一実施例の
ブロック図である。入力発声は単語であるとする。標準
パターン記憶部２０１は各音素のＨＭＭを保持する。語
彙パターン作成部２０７は各音素のＨＭＭを用いて認識
候補単語に対応する単語ＨＭＭを作成する。入力パター
ン作成部２０２は入力音声に対し、音声分析を行ない入
力パターンＸを作成する。作成された入力パターンは認
識部２０３において認識候補単語の単語ＨＭＭを用いて
認識をされ、認識結果出力部２０４から認識結果が出力
される。適応化用辞書作成部２０５は、認識結果表記か
ら適応化用辞書を作成する。FIG . 2 is a block diagram of an embodiment of the second speaker learning method . It is assumed that the input utterance is a word. The standard pattern storage unit 201 holds the HMM of each phoneme. The vocabulary pattern creation unit 207 creates a word HMM corresponding to the recognition candidate word using the HMM of each phoneme. The input pattern creation unit 202 performs an audio analysis on the input speech to create an input pattern X. The created input pattern is recognized by the recognition unit 203 using the word HMM of the recognition candidate word, and the recognition result output unit 204 outputs the recognition result. The adaptation dictionary creating unit 205 creates an adaptation dictionary from the recognition result notation.

【００８１】教師あり適応化部２０６では、まず、適応
化用辞書を用いて適応化初期音素ＨＭＭを連結して適応
化初期単語ＨＭＭを作成する。適応化初期音素ＨＭＭ
は、標準パターン記憶部２０１にある音素ＨＭＭでも良
いし、別の音素ＨＭＭでも良い。次に、作成された適応
化初期単語ＨＭＭと入力パターンを用いて尤度計算を、
１つまたは複数の入力パターンについて行なったのち、
適応化後の平均ベクトルを計算し適応化後ＨＭＭを求め
る。適応化されたＨＭＭは、標準パターン記憶部２０１
に出力され、今までの認識ＨＭＭのかわりに記憶され
る。The supervised adaptation unit 206 first creates an adapted initial word HMM by connecting the adapted initial phonemes HMM using the adaptation dictionary. Adapted initial phoneme HMM
May be a phoneme HMM in the standard pattern storage unit 201 or another phoneme HMM. Next, the likelihood calculation is performed using the created adaptive initial word HMM and the input pattern,
After performing one or more input patterns,
The average vector after the adaptation is calculated, and the HMM after the adaptation is obtained. The adapted HMM is stored in the standard pattern storage unit 201.
And stored in place of the conventional recognition HMM.

【００８２】図３は、本願の請求項２の発明に係る第３
の話者学習方式の一実施例のブロック図である。図２に
示す第２の実施例と異なる点は、繰り返し制御部３０８
が制御することにより適応化が繰り返し行なわれる点で
ある。教師あり適応化部３０６の適応化により作成され
た適応化後ＨＭＭは、認識に用いた認識ＨＭＭに比べ、
使用者の発声に対し一般により高い認識性能を示す。し
たがって、この適応化後ＨＭＭを用いて、もう一回入力
パターンを認識すれば、さらに良好な認識率を示すと考
えられる。そして、その認識結果を用いて作成した適応
化用辞書を用いて適応化すればさらに認識性能の高い適
応化後ＨＭＭが作成される可能性がある。認識・適応化
の繰り返しの際には、繰り返し毎に入力パターンを作成
する計算を省くために、最初の適応化の際に入力パター
ンを入力パターン記憶部３０７に記憶しておき、２回目
以降の繰り返しにおいては、入力パターンは、入力パタ
ーン記憶部３０７から出力されるものを用いる。繰り返
し回数は、予め決めておくか、あるいは、繰り返しごと
に認識部３０３における認識結果単語に対応する認識結
果尤度を記憶しておき、前回の繰り返しにおける尤度と
比較して尤度が飽和したかどうかを判定し、飽和したら
繰り返しをとめるなどの方法で決める。この繰り返し手
段３０８は、図１に示す第１の実施例に対しても同様に
適用でき、また、音素が認識単位の場合でも、単語や文
などの入力発声と同じ認識単位の場合でも、同様に適用
可能である。FIG. 3 shows a third embodiment according to the second aspect of the present invention .
FIG. 2 is a block diagram of an embodiment of a speaker learning method according to the present invention. In FIG.
The difference from the second embodiment shown in FIG.
, The adaptation is repeatedly performed. The HMM after the adaptation created by the adaptation of the supervised adaptation unit 306 is compared with the recognition HMM used for the recognition.
It generally exhibits higher recognition performance for user utterances. Therefore, if the input pattern is recognized once more using the HMM after the adaptation, it is considered that a better recognition rate will be exhibited. Then, if the adaptation is performed using the adaptation dictionary created using the recognition result, there is a possibility that an adapted HMM having higher recognition performance may be created. At the time of repetition of recognition / adaptation, the input pattern is stored in the input pattern storage unit 307 at the time of the first adaptation in order to omit the calculation for creating the input pattern at each iteration, and the second and subsequent times are used. In the repetition, an input pattern output from the input pattern storage unit 307 is used. The number of repetitions is determined in advance, or a recognition result likelihood corresponding to the recognition result word in the recognition unit 303 is stored for each repetition, and the likelihood is saturated as compared with the likelihood in the previous repetition. Judgment is made, and if saturated, the repetition is stopped. This repetition means 308 can be similarly applied to the first embodiment shown in FIG. 1, and the same applies whether the phoneme is a recognition unit or the same recognition unit as an input utterance such as a word or a sentence. Applicable to

【００８３】図４は、本願の請求項３の発明に係る第４
の話者学習方式の一実施例のブロック図である。第４の
話者学習方式では、第３の話者学習方式において、適応
化初期音素ＨＭＭとして、標準パターン記憶部に記憶さ
れた音素ＨＭＭを用いる。繰り返しを行なうことによ
り、標準パターン記憶部に記憶された音素ＨＭＭはすで
に使用者にある程度適応しているため、それを適応化の
初期モデルとして用いることにより、適応化が速やかに
行なわれる。すなわち、認識・適応化の繰り返しの回数
が減少する効果がある。また、この方式は、認識単位が
単語であっても容易に適用可能である。FIG. 4 shows a fourth embodiment according to the third aspect of the present invention .
FIG. 2 is a block diagram of an embodiment of a speaker learning method according to the present invention. In the fourth speaker learning method, the phoneme HMM stored in the standard pattern storage unit is used as the adapted initial phoneme HMM in the third speaker learning method. By performing the repetition, the phoneme HMM stored in the standard pattern storage unit has already been adapted to the user to some extent. Therefore, by using it as an initial model of adaptation, the adaptation is performed quickly. That is, there is an effect that the number of times of recognizing and adapting is reduced. This method can be easily applied even if the recognition unit is a word.

【００８４】図５は、本願の請求項４の発明に係る第５
の話者学習方式の一実施例のブロック図である。第５の
話者学習方式では、第３の話者学習方式において、適応
化初期ＨＭＭとして、基本標準パターン記憶部５０９に
記憶された音素ＨＭＭを用いる。基本標準パターンは、
予め多数の話者の発声により学習された不特定話者ＨＭ
Ｍや、他の使用者の発声により学習された異話者ＨＭＭ
を用いる。この基本標準パターンは、繰り返しにより更
新されることはない。第４の話者適応化方式では、前の
繰り返しにおける適応化後ＨＭＭを適応化初期ＨＭＭと
しているが、適応化が迅速に行なわれる反面、認識の
際、誤認識があると、それが、適応化の性能に与える影
響がより大きくなるという問題点がある。しかし、この
第５の話者適応化方式では、適応化において前ループか
ら得る情報は、教師となる適応化用辞書のみとなり、第
４の話者適応化方式に比べ、繰り返しの回数は多くかか
るものの誤認識の度合が少ないと考えられる。また、こ
の方式は、認識単位が単語であっても容易に適用可能で
ある。FIG. 5 shows a fifth embodiment according to the fourth aspect of the present invention .
FIG. 2 is a block diagram of an embodiment of a speaker learning method according to the present invention. In the fifth speaker learning method, in the third speaker learning method, a phoneme HMM stored in the basic standard pattern storage unit 509 is used as an adaptive initial HMM. The basic standard pattern is
Unspecified speaker HM learned in advance by uttering many speakers
M or another speaker HMM learned by uttering another user
Is used. This basic standard pattern is not updated by repetition. In the fourth speaker adaptation method, the HMM after the adaptation in the previous iteration is used as the adaptation initial HMM. However, although the adaptation is performed quickly, if there is a misrecognition during the recognition, the adaptation is made. However, there is a problem that the influence on the performance of the conversion becomes greater. However, in the fifth speaker adaptation method, the information obtained from the previous loop in the adaptation is only the adaptation dictionary serving as a teacher, and the number of repetitions is larger than that of the fourth speaker adaptation method. It is considered that the degree of misrecognition is low. This method can be easily applied even if the recognition unit is a word.

【００８５】以上述べた実施例は、いずれも図６に示す
従来の装置を拡張変更することによって達成することが
できるという特徴を有する。Each of the embodiments described above has a feature that it can be achieved by extending and changing the conventional device shown in FIG.

【００８６】なお、ここでは、認識対象として単語を例
にあげたが、文、あるいは、会話発声においても同様な
手段で適応化可能である。また、認識方式としてＨＭＭ
を例にあげて説明したが、他の認識方式、例えば、NN、
DPマッチングなどの認識方式においても、認識・適応化
部は同様の手法を用いて構成できる。また、適応化手段
として、特願平2-203437の方式に基づく方式について説
明したが、他の教師あり適応化方式を用いても構成可能
である。さらに、認識・適応化手段において、認識単位
として、音素を例にとりあげたが、音素以外の、音節、
半音節など他の認識単位の場合も、本方式は容易に適用
可能である。Here, a word has been described as an example of an object to be recognized, but a sentence or a conversation utterance can be adapted by the same means. Also, HMM is used as the recognition method.
Has been described as an example, but other recognition methods, such as NN,
In a recognition method such as DP matching, the recognition / adaptation unit can be configured using a similar method. Also, a method based on the method of Japanese Patent Application No. 2-203437 has been described as the adaptation means, but it is also possible to use another supervised adaptation method. Furthermore, in the recognition / adaptation means, a phoneme was taken as an example of a recognition unit.
This method can be easily applied to other recognition units such as syllables.

【００８７】以下に上述した第２の話者適応化方式の評
価実験の結果を述べる。評価実験は半音節を認識単位と
した混合ガウス分布ＨＭＭを用い、類似５０００単語認
識を行なった。ここで、ＨＭＭの混合ガウス分布数は２
とし、多数話者のデータとして、男性４６名女性３９名
計８５名の音素バランスを考慮した２５０単語１回発声
を用いた。また、評価話者として上の85名に含まれない
男性３名、女性４名計７名を用い、適応化用データ、お
よび、評価用データとしてそれぞれ、学習時とは異なる
語彙２５０単語１回発声を用いた。適応化用、評価用の
データの語彙はお互いに異なっている。分析条件は、サ
ンプリング周波数１６ kHz、帯域０．１−７．２ kHz、
フレーム間隔１０ｍｓで、メルケプストラム分析を用い
た。特徴ベクトルは正規化パワー差分、メルケプストラ
ム１０次元、メルケプストラムの変化量１０次元の計２
１次元である。また、適応化の初期ＨＭＭは話者８５名
の発声データを用いて学習した不特定話者モデルを用い
た。The results of an evaluation experiment of the above-described second speaker adaptation method will be described below. In the evaluation experiment, 5000 similar words were recognized by using a Gaussian mixture HMM using a semisyllable as a recognition unit. Here, the Gaussian mixture distribution number of the HMM is 2
As data of a large number of speakers, a single 250-word utterance taking into account the phoneme balance of 46 men and 39 women in total was used. In addition, three males and four females, which are not included in the above-mentioned 85 speakers, were used as the evaluation speakers, and the vocabulary was 250 words once each for the adaptation data and the evaluation data, which were different from those at the time of learning. Utterance was used. The vocabulary of the data for adaptation and evaluation is different from each other. The analysis conditions were as follows: sampling frequency 16 kHz, band 0.1-7.2 kHz,
Mel cepstrum analysis was used at a frame interval of 10 ms. The feature vectors are normalized power difference, mel-cepstral 10-dimensional, and mel-cepstral change 10-dimensional in total 2
It is one-dimensional. The initial HMM of the adaptation used an unspecified speaker model learned using utterance data of 85 speakers.

【００８８】離散５０００単語を適応化の認識対象とし
た場合について教師なし適応化の評価実験を行なった結
果、性能が大幅に向上し、話者７名平均で不特定話者認
識率８４．５％のところ、適応化単語数２５０単語で教
師なし適応化後の認識率９１．３％と誤りが半分近く減
少している。また、教師あり適応化と比べても、各々の
適応化用単語数において、１〜２％低いに過ぎない。As a result of performing an evaluation experiment of unsupervised adaptation for the case where 5000 discrete words were used as the recognition target for adaptation, the performance was greatly improved, and the unspecified speaker recognition rate was 84.5 on average for 7 speakers. %, The recognition rate after unsupervised adaptation is 91.3% and the error is reduced by almost half when the number of adapted words is 250 words. Also, compared to the supervised adaptation, the number of words for each adaptation is only 1-2% lower.

【００８９】[0089]

【発明の効果】以上説明したように、本発明により、音
声認識装置を使用者が意識することなしに使用者に適応
させ、高い認識性能を得ることが可能になり、同時に使
用者の負担が軽減されユーザーインターフェースが向上
し、さらに、すでに教師あり適応化システムが存在して
いる場合、それを利用することによりわずかな手間で教
師なし適応化システムを構築可能になるという効果があ
る。As described above, according to the present invention, it is possible to adapt the speech recognition device to the user without being conscious of the user and to obtain a high recognition performance, and at the same time to reduce the burden on the user. This has the effect of reducing the user interface, and further, if a supervised adaptation system already exists, it is possible to construct an unsupervised adaptation system with little effort by using it.

[Brief description of the drawings]

【図１】本発明の第１の実施例を示すブロック図であ
る。FIG. 1 is a block diagram showing a first embodiment of the present invention.

【図２】本発明の第２の実施例を示すブロック図であ
る。FIG. 2 is a block diagram showing a second embodiment of the present invention.

【図３】本発明の第３の実施例を示すブロック図であ
る。FIG. 3 is a block diagram showing a third embodiment of the present invention.

【図４】本発明の第４の実施例を示すブロック図であ
る。FIG. 4 is a block diagram showing a fourth embodiment of the present invention.

【図５】本発明の第５の実施例を示すブロック図であ
る。FIG. 5 is a block diagram showing a fifth embodiment of the present invention.

【図６】従来技術の実施例を示すブロック図である。FIG. 6 is a block diagram showing an embodiment of the prior art.

[Explanation of symbols]

１０１標準パターン記憶部１０２入力パターン作成部１０３認識部１０４認識結果出力部１０５適応化用標準パターン選択部１０６教師あり適応化部２０１標準パターン記憶部２０２入力パターン作成部２０３認識部２０４認識結果出力部２０５適応化用辞書作成部２０６教師あり適応化部２０７語彙パターン作成部３０１標準パターン記憶部３０２入力パターン作成部３０３認識部３０４認識結果出力部３０５適応化用辞書作成部３０６教師あり適応化部３０７入力パターン記憶部３０８繰り返し制御部３０９語彙パターン作成部４０１標準パターン記憶部４０２入力パターン作成部４０３認識部４０４認識結果出力部４０５適応化用辞書作成部４０６教師あり適応化部４０７入力パターン記憶部４０８繰り返し制御部４０９語彙パターン作成部５０１標準パターン記憶部５０２入力パターン作成部５０３認識部５０４認識結果出力部５０５適応化用辞書作成部５０６教師あり適応化部５０７入力パターン記憶部５０８繰り返し制御部５０９基本標準パターン記憶部５１０語彙パターン作成部６−１音声認識装置６−２入力パターン作成部６−３認識部６−４認識結果出力部６−５語彙パターン作成部６−６標準パターン記憶部６−７適応化用辞書作成部６−８入力パターン作成部６−９適応化部６−１０語彙パターン作成部６−１１適応化初期標準パターン記憶部 Reference Signs List 101 Standard pattern storage unit 102 Input pattern creation unit 103 Recognition unit 104 Recognition result output unit 105 Standard pattern selection unit for adaptation 106 Supervised adaptation unit 201 Standard pattern storage unit 202 Input pattern creation unit 203 Recognition unit 204 Recognition result output unit 205 adaptation dictionary creation unit 206 supervised adaptation unit 207 vocabulary pattern creation unit 301 standard pattern storage unit 302 input pattern creation unit 303 recognition unit 304 recognition result output unit 305 adaptation dictionary creation unit 306 supervised adaptation unit 307 Input pattern storage unit 308 Repetition control unit 309 Vocabulary pattern creation unit 401 Standard pattern storage unit 402 Input pattern creation unit 403 Recognition unit 404 Recognition result output unit 405 Adaptation dictionary creation unit 406 Supervised adaptation unit 407 Input pattern storage unit 408 Iteration control unit 409 Vocabulary pattern creation unit 501 Standard pattern storage unit 502 Input pattern creation unit 503 Recognition unit 504 Recognition result output unit 505 Adaptation dictionary creation unit 506 Supervised adaptation unit 507 Input pattern storage unit 508 Repetition control unit 509 Basic standard pattern storage unit 510 Vocabulary pattern creation unit 6-1 Speech recognition device 6-2 Input pattern creation unit 6-3 Recognition unit 6-4 Recognition result output unit 6-5 Vocabulary pattern creation unit 6-6 Standard pattern storage unit 6 -7 Adaptation dictionary creation unit 6-8 Input pattern creation unit 6-9 Adaptation unit 6-10 Vocabulary pattern creation unit 6-11 Adaptation initial standard pattern storage unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平５−88693（ＪＰ，Ａ) 特開平２−198499（ＪＰ，Ａ) 特開平４−280299（ＪＰ，Ａ) 特開昭63−63098（ＪＰ，Ａ) 特開平５−232985（ＪＰ，Ａ) 特開平５−224691（ＪＰ，Ａ) 日本音響学会講演論文集平成５年10 月２−７−13 ｐ．95〜96 日本音響学会講演論文集平成６年３月３−７−８ｐ．103〜104 日本音響学会講演論文集平成２年９月１−８−12 ｐ．23〜24 日本音響学会講演論文集平成４年３月１−Ｐ−15 ｐ．133〜134 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 521 G10L 3/00 535 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-5-88693 (JP, A) JP-A-2-198499 (JP, A) JP-A-4-280299 (JP, A) JP-A-63-1988 63098 (JP, A) JP-A-5-232985 (JP, A) JP-A-5-224691 (JP, A) Proceedings of the Acoustical Society of Japan October 1993 2-7-13 p. 95-96 Proceedings of the Acoustical Society of Japan March 1994 3-7-8 p. 103-104 Proceedings of the Acoustical Society of Japan September 1990 1-8-12 p. 23-24 Proceedings of the Acoustical Society of Japan March 1992 1-P-15 p. 133-134 (58) Field surveyed (Int. Cl. ⁶ , DB name) G10L 3/00 521 G10L 3/00 535 JICST file (JOIS)

Claims

(57) [Claims]

In a speaker adaptation method for adapting a standard pattern used for speech recognition using an input speech pattern, a word standard pattern created based on a predetermined standard for each recognition candidate word is held. A standard pattern storage unit that performs voice analysis on an input voice to generate an input pattern; and recognizes a word of the generated input pattern using a word standard pattern in the standard pattern storage unit. A recognition unit, a recognition result output unit that outputs a recognition result in the recognition unit, and an adaptation standard pattern selection unit that prepares an adaptation initial word standard pattern by a predetermined method with reference to the notation of the recognition word. And calculating the likelihood of the input pattern based on the adapted initial word standard pattern, and calculating the likelihood calculated by the likelihood calculation result. A supervised adaptation unit for updating the word standard pattern in the standard pattern storage unit with the optimized word standard pattern, and storing the input pattern
An input pattern storage unit;
The recognition unit and the recognition
A result output unit and the standard pattern selecting unit for adaptation and
Variables for which the operation of the adaptation unit with teacher is predetermined
Iterative system that repeats until the specified reference value is reached
A speaker adaptation method comprising a control unit.

2. A speaker adaptation method for adapting a standard pattern used for speech recognition using an input speech pattern, wherein a subword standard created based on a predetermined criterion in units of subwords such as syllables and phonemes. A standard pattern storage unit for holding a pattern, a vocabulary pattern creating unit for creating a word standard pattern corresponding to a recognition candidate word using the subword standard pattern, and an input pattern for creating an input pattern by performing speech analysis on input speech A creating unit, a recognition unit that recognizes a word using the word standard pattern created by the vocabulary pattern creating unit with the created input pattern, and a recognition result output unit that outputs a recognition result in the recognition unit. Preparing an adapted initial word standard pattern by a predetermined method with reference to the notation of the recognized word. An adaptation standard pattern selecting unit, and a likelihood calculation of the input pattern based on the adaptation initial word standard pattern, and a post-adaptation subword standard pattern obtained from the likelihood calculation result in the standard pattern storage unit. A supervised adaptation unit for updating the subword standard pattern, and storing the input pattern
An input pattern storage unit that performs
The recognition unit and the input pattern
A recognition result output unit and the standard pattern selection unit for adaptation;
And the variables in which the operation of the supervised
Repetition until it reaches a predetermined reference value
A speaker adaptation method characterized by comprising a control unit.

3. A standard pattern used for voice recognition is input sound.
Speaker adaptation method using voice patterns
And sub-words such as syllables, phonemes, etc.
Subword standard pattern created by the specified criteria
A standard pattern storage unit for storing the subword standard
Word standard patterns corresponding to recognition candidate words using patterns
Vocabulary pattern creation unit that creates
An input pattern that performs voice analysis and creates an input pattern
A creating unit and an input for storing the created input pattern
A pattern storage unit, and the input pattern storage unit
Created by the vocabulary pattern creating unit
To recognize words using the word standard pattern
And a recognition unit that outputs a recognition result in the recognition unit.
Result output unit and the vocabulary pattern corresponding to the recognized word
The input pattern based on the word standard pattern in the creating unit;
The likelihood calculation of the turn is performed and the likelihood calculation result is obtained.
The standardized subword standard pattern after adaptation
The subword standard pattern in the pattern storage unit is
Supervised adaptation unit to be updated, the recognition unit, and the recognition result
Predetermine the operation of the output section and the supervised adaptation section.
Until the variable reaches a predetermined reference value.
Speaker adaptation characterized by having an iterative control unit
System.

4. A standard pattern used for speech recognition is input sound.
Speaker adaptation method using voice patterns
And sub-words such as syllables, phonemes, etc.
Subword standard pattern created by the specified criteria
A standard pattern storage unit for storing the subword standard
Word standard patterns corresponding to recognition candidate words using patterns
A vocabulary pattern creation unit that creates a
Basics corresponding to the recognition candidate words created by the method
A basic standard pattern storage unit for storing standard patterns
Perform voice analysis on input speech to create input patterns To
An input pattern creation unit, and the created input pattern
An input pattern storage unit for storing the input pattern;
The input pattern and the vocabulary pattern creating unit in the unit
Of the word using the word standard pattern created by
A recognition unit for performing recognition, and output a recognition result of the recognition unit.
A recognition result output unit,
The basic standard pattern in the basic standard pattern storage unit
Calculating the likelihood of the input pattern based on the likelihood meter
Subword standard pattern after adaptation obtained from the calculation result
The sub word in the standard pattern storage unit.
Supervised adaptation unit for updating standard patterns and said recognition
Unit, the recognition result output unit, and the supervised adaptation unit.
A variable whose operation has been determined reaches a predetermined reference value.
It has a repetition control unit that repeats until
Speaker adaptation method to be used.