JP2000003190A

JP2000003190A - Voice model learning device and voice recognition device

Info

Publication number: JP2000003190A
Application number: JP10169587A
Authority: JP
Inventors: Masaru Takano; 優高野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-06-17
Filing date: 1998-06-17
Publication date: 2000-01-07
Anticipated expiration: 2018-06-17
Also published as: JP3090204B2

Abstract

PROBLEM TO BE SOLVED: To eliminate dictionary dependency of discriminative learning and to generate a discriminative learning model with high discriminative performance for sets of all words in a specified language by beforehand storing a pair of monosyllable models different from each other in a dictionary part as a learning word pattern. SOLUTION: A voice analytic part 101 inputs a learning voice to calculate/ output the acoustic feature amount of the learning voice in respective frame at every prescribed time (frame). The dictionary part 102 holds a learning monosyllable pair dictionary (pattern: a pair of monosyllable models and word model constituting it). A pattern matching part 103 inputs the feature amount per frame, and outputs correspondence with the learning voice of all models in the monosyllable dictionary and its likelihood. A calculation part 104 calculates/ outputs correction amounts of parameters of respective models in the dictionary from the correspondence, the likelihood and the monosyllable pair dictionary. A correction part 105 corrects the parameters of respective models in the dictionary by using the correction amounts, and a control part 106 inputs the likelihood to decide a calculation end.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声モデル学習装
置及びそれを利用する音声認識装置に関する。The present invention relates to a speech model learning device and a speech recognition device using the same.

【０００２】[0002]

【従来の技術】音声認識に使用される音声モデルは、音
声モデル学習装置に対して、実際の音声を用いて学習さ
せることが多い。このような学習方法には、非識別学習
法と呼ばれるものと、識別学習法と呼ばれるものがあ
る。2. Description of the Related Art A speech model used for speech recognition is often trained by a speech model learning device using actual speech. Such learning methods include a method called a non-discriminating learning method and a method called a discriminating learning method.

【０００３】非識別学習法として、ＥＭ（Expectation-
Maximization）アルゴリズムを用いる方法がある。この
方法は、ＨＭＭ（hidden Markov model ）を形成するた
めの学習アルゴリズムである。As a non-discriminating learning method, EM (Expectation-
Maximization) algorithm. This method is a learning algorithm for forming an HMM (hidden Markov model).

【０００４】ＨＭＭでは、入力パターンの記号列のみが
観測され、状態間の遷移を直接観測することができな
い。そこで、モデルを仮定し、観測データ（入力パター
ン）から状態の遷移回数を計算して、それに基づいて遷
移確率、出力確率を最尤推定してＨＭＭのパラメータを
更新することを考える。この２段階の工程を収束するま
で繰り返す推定アルゴリズムがＥＭアルゴリズムであ
る。In the HMM, only a symbol string of an input pattern is observed, and a transition between states cannot be directly observed. Therefore, assuming a model, calculating the number of state transitions from observation data (input pattern), and performing the maximum likelihood estimation of the transition probability and the output probability based on the calculated number, to update the parameters of the HMM. An estimation algorithm that repeats these two steps until convergence is the EM algorithm.

【０００５】このＥＭアルゴリズムを用いたＥＭ学習法
では、ある単語について、その単語を発生した音声（た
いてい複数）が全体的に最も良く適合するようなモデル
を生成する。そのモデルの適合度は、フォワード−バッ
クワード（Forward-Backward）スコアや、ビタビ（Vite
rbi ）スコアで表すことが多い。In the EM learning method using the EM algorithm, for a certain word, a model is generated such that the speech (usually a plurality of words) that generated the word best fits as a whole. The model's goodness of fit can be measured using a forward-backward score or a viterbi (Vite
rbi) Often represented by scores.

【０００６】通常、ＥＭ学習法では、数百単語程度の多
数の単語に対してモデルの生成を行う。そして、複数の
単語に共通に含まれるサブワード（音節、音素等）につ
いては、それら全ての単語にまんべんなく適合するよう
にバランスをとったモデルを生成するようにしている。Normally, in the EM learning method, a model is generated for a large number of words such as several hundred words. Then, for subwords (syllables, phonemes, etc.) that are commonly included in a plurality of words, a model is generated that is balanced so as to fit all the words evenly.

【０００７】なお、ＥＭアルゴリズムについては、例え
ば、鹿野清宏著、「音声認識の基礎」（平成８年５月１
７日）、５２−５７頁に記載されている。The EM algorithm is described in, for example, Kiyohiro Kano, "Basics of Speech Recognition" (May 1, 1996)
7), pp. 52-57.

【０００８】一方、識別学習法は、ある単語を発生した
音声（複数あり得る）について、発声内容の単語モデル
と、発声内容とは異なるけれども発声内容を表すと誤認
識されてしまう単語モデル（エラーモデル）との間に、
差をつけるための学習である。この方法では、単語モデ
ルとエラーモデルとの間に差をつけることを目的とする
ため、各モデルの発声内容に対する適合度はあまり重視
されず、発声内容を示すモデルの適合度がエラーモデル
の適合度よりも高く（有利に）なるように、両モデル間
の調整を行う。通常、識別学習法は、上記のＥＭ学習な
どで得られたモデルに対して適用されるので、適合度は
両モデルとも低下する場合がほとんどである。なお、エ
ラーモデルは、ある辞書に含まれる単語モデルであっ
て、その辞書中の全て（発声内容を表すものを除く）の
単語モデルの場合もあれば、特定の単語を選択する場合
もある。[0008] On the other hand, the discriminative learning method is based on the word model of the utterance content of a speech (a plurality of utterances) generating a certain word, and the word model (error, Model)
Learning to make a difference. In this method, since the purpose is to make a difference between the word model and the error model, the fitness of each model to the utterance content is not so important, and the fitness of the model indicating the utterance content is the fitness of the error model. Adjust between the two models to be higher (advantageously) higher. Normally, the discriminative learning method is applied to a model obtained by the above-described EM learning or the like, so that the degree of conformity of both models is reduced in most cases. Note that the error model is a word model included in a dictionary, and may be all word models (excluding those representing utterance contents) in the dictionary, or a specific word may be selected.

【０００９】このような識別学習法としては、識別誤り
最小化（ＭＣＥ：Minimum Classification Error）学習
法というのがあり、例えば、高橋淳一及び嵯峨山茂樹に
よる論文「識別誤り最小化を用いた少量データの学習に
おける初期モデルの研究」、日本音響学会講演論文集、
３−５−８（平成８年３月２６日）、１２３〜１２４頁
に記載されている。このＭＣＥ学習法は、クラスの識別
に用いる尺度を識別関数ｇ_kとし、あるサンプルＸに対
する識別関数の差で表される識別誤り関数ｄ_k （Ｘ，
Λ）から、sigmoid 関数で表される損失関数ｌ（ｄ_k）
を用いて実効的な識別誤り数を評価し、この識別誤り数
を最小化する規準でモデルのパラメータΛを求めること
によってモデルの識別能力を高めるというものである。
これは、あるサンプルＸに対する正解クラスとこれに競
合するクラス間の関係を考慮した学習法である。ＭＣＥ
法は、定式化により、次の数式１〜５のように表され
る。Such an identification learning method includes an identification error.
Minimization (MCE: Minimum Classification Error) learning
There is a law, for example, in Junichi Takahashi and Shigeki Sagayama
A paper by the author "For small data learning using minimization of identification errors.
Research on the Initial Model in Japan ", Proceedings of the Acoustical Society of Japan,
3-5-8 (March 26, 1996), pp. 123-124
It is described in. This MCE learning method uses class identification
Is the discriminant function g_kAnd for a sample X
Error function d expressed by the difference between_k (X,
Λ), the loss function l (d_k)
Is used to evaluate the effective number of identification errors.
The model parameter Λ with the criterion to minimize
This enhances the model's discrimination ability.
This is the competition between the correct class for a sample X and this.
This is a learning method that takes into account the relationships between the classes to be combined. MCE
The method is expressed by the following formulas 1 to 5 by formulation.
You.

【００１０】[0010]

【数１】 (Equation 1)

【００１１】[0011]

【数２】 (Equation 2)

【００１２】[0012]

【数３】 (Equation 3)

【００１３】[0013]

【数４】 (Equation 4)

【００１４】[0014]

【数５】数式５に示すように、この方法では、最急降下法を用い
て漸化的にモデルパラメータΛを調整しながら、最適な
パラメータを求めている。なお、パラメータε_tは、学
習ステップサイズを制御する。(Equation 5) As shown in Expression 5, in this method, the optimal parameters are obtained while gradually adjusting the model parameters Λ using the steepest descent method. Note that the parameter ε _t controls the learning step size.

【００１５】[0015]

【発明が解決しようとする課題】一般に、識別学習で作
成された音声モデルは、非識別学習で作成された音声モ
デルより識別性能が高いという長所がある。Generally, a speech model created by discriminative learning has an advantage of higher discrimination performance than a speech model created by non-discriminatory learning.

【００１６】しかしながら、ＭＣＥ学習法によって作成
された音声モデルは、学習時に使用した辞書に含まれる
単語に対しては高い識別能力を有するが、それ以外の単
語の組み合わせ（辞書に含まれない単語）にはさほどの
識別性能を有していないという問題点がある。However, the speech model created by the MCE learning method has a high discriminating ability for words included in the dictionary used at the time of learning, but a combination of other words (words not included in the dictionary) Has a problem that it does not have such a high discrimination performance.

【００１７】詳述すると、ＭＣＥ学習法は、識別学習に
使用する辞書として、複数の単語を含む特定の単語セッ
トを用いる。この場合、辞書中の、学習用発声に対応す
る単語以外の全ての単語が、エラーモデルとなる。その
結果、学習辞書内の単語の識別能力が高められる。とこ
ろが、このような特定単語セットを辞書とする学習方法
では、該当辞書内での誤認識率を改善できても、辞書外
の単語セットに対しての識別性能は保証されない。極端
な例を挙げると、辞書内に特定の音節モデルが存在しな
い場合、従来の識別学習では、そのような音節モデルに
対しては全く誤認識率の改善が行われないという弱点が
ある。More specifically, the MCE learning method uses a specific word set including a plurality of words as a dictionary used for identification learning. In this case, all the words in the dictionary other than the word corresponding to the learning utterance become the error model. As a result, the ability to identify words in the learning dictionary is improved. However, in such a learning method in which a specific word set is used as a dictionary, even if the misrecognition rate in the dictionary can be improved, the identification performance for a word set outside the dictionary is not guaranteed. As an extreme example, in the case where a specific syllable model does not exist in the dictionary, there is a weakness that the conventional recognition learning does not improve the false recognition rate at all for such a syllable model.

【００１８】本発明の目的は、そのような識別学習の辞
書依存性をなくし、特定言語中のすべての単語の組に対
し識別性能の高い識別学習モデルを作成できる音声モデ
ル学習装置及びそれを用いた音声認識装置を提供するこ
とにある。An object of the present invention is to eliminate such a dictionary dependence of discrimination learning, and to provide a speech model learning apparatus capable of creating a discrimination learning model with high discrimination performance for all word sets in a specific language, and to use the speech model learning apparatus. To provide a speech recognition device.

【００１９】[0019]

【課題を解決するための手段】本発明の音声モデル学習
装置は、入力音声に対して一定時間ごとの特徴量を計算
する音声分析部と、複数の学習用単語パタンを格納した
辞書部と、前記一定時間ごとの特徴量と前記複数の学習
用単語パタンのそれぞれとのマッチングを行い、各学習
単語パタンとの対応付け及び尤度を求めるパタンマッチ
ング部と、前記対応付け及び尤度に基づいて、各学習用
単語パタンのパラメータ修正量を算出する計算部と、前
記パラメータ修正量に基づいて前記辞書部に格納された
前記複数の単語パタンのパラメータをそれぞれ修正する
修正部とを備え、前記辞書部が、前記学習用単語パタン
として一対の互いに異なる単音節モデルを格納している
ことを特徴とする。According to the present invention, there is provided a speech model learning apparatus comprising: a speech analysis unit for calculating a feature amount of an input speech at fixed time intervals; a dictionary unit for storing a plurality of learning word patterns; A pattern matching unit that performs matching between the feature amount for each fixed time and each of the plurality of learning word patterns, and calculates a correspondence and likelihood with each learning word pattern, based on the correspondence and the likelihood. A calculation unit that calculates a parameter correction amount of each learning word pattern, and a correction unit that corrects each of the plurality of word patterns stored in the dictionary unit based on the parameter correction amount, wherein the dictionary The unit stores a pair of mutually different monosyllable models as the learning word pattern.

【００２０】また、本発明の音声モデル学習装置は、前
記尤度に基づき、前記修正部により修正されたパラメー
タを有する学習用単語パタンを用いて再度学習処理を行
なうか否かを判定し、学習処理を行なう場合には、前記
パタンマッチング部に対して学習処理の再実行を指示す
る制御部を有することを特徴とする。Further, the speech model learning apparatus of the present invention determines whether or not to perform the learning process again using the learning word pattern having the parameter corrected by the correction unit based on the likelihood. When performing the process, a control unit is provided for instructing the pattern matching unit to re-execute the learning process.

【００２１】さらに本発明の音声モデル学習装置は、前
記辞書部が、前記一対の互いに異なる単音節モデルに代
えて、前記単語パタンとして、一対の単語モデルを保持
し、該一対の単語モデルの一方が所定の入力音声に対応
する単語モデルであり、他方の単語モデルが、連続単語
認識アルゴリズムによって得られる前記所定の入力音声
に最も近い単語モデルであることを特徴とする。Further, in the speech model learning apparatus according to the present invention, the dictionary unit holds a pair of word models as the word patterns in place of the pair of mutually different monosyllable models, and one of the pair of word models. Is a word model corresponding to a predetermined input voice, and the other word model is a word model closest to the predetermined input voice obtained by a continuous word recognition algorithm.

【００２２】さらにまた、本発明の音声モデル学習装置
は、前記辞書部が、前記一対の互いに異なる単音節モデ
ルに代えて、前記単語パタンとして、複数対の単語モデ
ルを保持し、各対の一方の単語モデルがそれぞれ異なる
所定の入力音声に対応する単語モデルであり、各対の他
方の単語モデルが、連続単語認識アルゴリズムによって
得られる前記所定の入力音声にそれぞれ最も近い単語モ
デルであることを特徴とする。Still further, in the speech model learning apparatus according to the present invention, the dictionary unit holds a plurality of pairs of word models as the word patterns instead of the pair of different single syllable models, and one of each pair of the word models. Are word models corresponding to different predetermined input voices, and the other word model of each pair is a word model closest to the predetermined input voice obtained by a continuous word recognition algorithm. And

【００２３】また、本発明の音声認識装置は、入力音声
に対して一定時間ごとの特徴量を計算する音声分析部
と、一対の単語パタンを格納する辞書部と、前記一定時
間ごとの前記特徴量と、前記一対の単語パタンのそれぞ
れとのマッチングを行い、前記一対の単語パタンのそれ
ぞれの尤度を算出する計算部と、前記尤度に基づいて前
記一対の単語パタンの一方を前記入力音声に対応すると
判断して出力する出力部とを有し、前記辞書部に格納さ
れた一対の単語パタンが、それぞれ単音節モデルを発声
順に接続して構成されており、かつ各単音節モデルが、
他方の単語パタンの同一発声順の単音節モデルと対を成
す一対の学習用単語パタンを用いた識別学習によって生
成されていることを特徴とする。Further, the speech recognition apparatus of the present invention comprises a speech analysis unit for calculating a feature amount of the input speech at regular intervals, a dictionary unit for storing a pair of word patterns, and the feature at regular intervals. A calculating unit that performs matching between the amount and each of the pair of word patterns and calculates the likelihood of each of the pair of word patterns; and inputting one of the pair of word patterns based on the likelihood into the input voice. And a pair of word patterns stored in the dictionary unit are configured by connecting monosyllable models in utterance order, and each monosyllable model is
It is characterized by being generated by discriminative learning using a pair of learning word patterns paired with a single syllable model of the other word pattern in the same utterance order.

【００２４】また、本発明の音声認識装置は、入力音声
に対して一定時間ごとの特徴量を計算する音声分析部
と、複数の単語に対応する単語パタンを格納する辞書部
と、前記一定時間ごとの前記特徴量と、前記単語パタン
の各々とのマッチングを行い、各単語パタンの尤度を算
出する計算部と、前記単語パタンの尤度から前記入力さ
れた音声により近い前記単語パタンを求めて出力する出
力部を備え、前記複数の単語に対応する単語パタンとし
て、各単語を構成する単音節モデルを発声順に接続して
構成したものであって、各単語を構成する単音節モデル
のそれぞれが、他の単語を構成する単音節モデルのそれ
ぞれと発声順毎に対を成す一対の学習用単語パタンを用
いた識別学習により生成されたものとなるように、各単
語毎に他の単語の数に対応する数の単語パタンを用意し
たことを特徴とする。The speech recognition apparatus according to the present invention further comprises a speech analysis unit for calculating a feature amount of the input speech at fixed time intervals, a dictionary unit for storing word patterns corresponding to a plurality of words, A calculating unit that performs matching between the feature amount of each word and each of the word patterns to calculate the likelihood of each word pattern, and obtains the word pattern closer to the input voice from the likelihood of the word pattern. Output unit that outputs a single syllable model that constitutes each word as a word pattern corresponding to the plurality of words. Is generated by discriminative learning using a pair of learning word patterns that form a pair with each of the monosyllable models that compose the other words in the order of utterance. number Characterized in that it was prepared word pattern of the corresponding number.

【００２５】さらに本発明の音声モデル学習方法は、入
力される単音節発声に対して一定時間ごとの特徴量を計
算する第１の工程と、前記一定時間ごとの特徴量と予め
記憶されている一対の学習用単音節パタンのそれぞれと
のマッチングを行い、前記単音節発声と各学習単音節パ
タンとの対応付け及び尤度を求める第２の工程と、前記
対応付け及び尤度に基づいて、各学習用単音節パタンの
パラメータ修正量を算出する第３の工程と、前記パラメ
ータ修正量に基づいて前記予め記憶されている複数の学
習用単音節パタンのパラメータをそれぞれ修正する第４
の工程とを備えたことを特徴とする。Further, in the speech model learning method according to the present invention, a first step of calculating a feature amount for each fixed time for an input monosyllable utterance, and the feature amount for each fixed time are stored in advance. Performing a matching with each of a pair of learning single syllable patterns, a second step of determining the correspondence and likelihood between the single syllable utterance and each learning single syllable pattern, and based on the correspondence and likelihood, A third step of calculating a parameter correction amount of each learning single syllable pattern, and a fourth step of correcting each of the plurality of learning single syllable patterns stored in advance based on the parameter correction amount.
And a step of:

【００２６】また、本発明の音声モデル学習方法は、前
記第４の工程の後、前記尤度に基づき、修正されたパラ
メータを有する前記学習用単語パタンを用いて再度学習
処理を行なうか否かを判定を行ない、学習処理を再度行
なう場合には、前記第２の工程以降を繰り返させる第５
の工程を備えたことを特徴とする。In the speech model learning method according to the present invention, after the fourth step, it is determined whether or not to perform the learning process again using the learning word pattern having the corrected parameter based on the likelihood. When the learning process is performed again, the fifth and subsequent steps are repeated.
Characterized by comprising the steps of:

【００２７】さらにまた、本発明の音声モデル学習方法
は、入力音声に対して一定時間ごとの特徴量を計算する
第１の工程と、前記一定時間ごとの前記特徴量と予め記
憶した一対の単語パタンのそれぞれとのマッチングを行
い、前記一対の単語パタンのそれぞれの尤度を算出する
第２の工程と、前記尤度に基づいて前記一対の単語パタ
ンの一方を前記入力音声に対応すると判断して出力する
第３の工程とを有し、前記一対の単語パタンとして一対
の学習用単音節パタンを用いた音声モデル学習方法によ
り得られた単音節モデルを接続した単語モデルを使用す
ることを特徴とする。Still further, in the speech model learning method according to the present invention, a first step of calculating a feature amount of the input speech at fixed time intervals, and a pair of words stored in advance with the feature amount at fixed time intervals A second step of performing matching with each of the patterns and calculating the likelihood of each of the pair of word patterns, and determining that one of the pair of word patterns corresponds to the input voice based on the likelihood. And a word model to which a single syllable model obtained by a speech model learning method using a pair of learning single syllable patterns is used as the pair of word patterns. And

【００２８】[0028]

【作用】音声モデル学習装置は、特定言語（例えば日本
語。日本語と英語のすべてを含む汎用言語のようなもの
でもよい。）に出現するすべての音節の対について、そ
の音節対のみを識別する識別学習モデルを作成する。即
ち、この音声モデル学習装置は、学習辞書として、一対
の音節モデルのみを保持する。The speech model learning apparatus identifies only the syllable pair of all syllable pairs appearing in a specific language (for example, Japanese, such as a general language including all Japanese and English). Create a discriminative learning model. That is, this speech model learning device holds only a pair of syllable models as a learning dictionary.

【００２９】この音声モデル学習装置では、識別学習
が、すべての音節対に対し行われる。この結果、辞書内
に特定の音節モデルがない等の、認識学習に用いた辞書
の内容に依存することにより生じる欠点は解消される。
しかしながら、この装置では、単音節の識別精度が高い
モデルを生成するが、連続音節で表されるような単語モ
デルを表現するには、音節間の発声変形を表現していな
いことから精度がやや不足することがあり得る。In this speech model learning apparatus, discrimination learning is performed for all syllable pairs. As a result, disadvantages caused by depending on the contents of the dictionary used for recognition learning, such as the absence of a specific syllable model in the dictionary, are eliminated.
However, with this device, a model with high single-syllable identification accuracy is generated, but in order to express a word model represented by continuous syllables, accuracy is slightly higher because vocal deformation between syllables is not expressed. It can be short.

【００３０】そこで、学習辞書として、入力に対応する
音節列モデルと特定言語に含まれるすべての音節列を表
現するモデルとを用いるようにしてもよい。Therefore, a syllable string model corresponding to the input and a model expressing all syllable strings included in a specific language may be used as the learning dictionary.

【００３１】この場合、音節対に対する識別処理を行な
い、なおかつ連続音節を対象として識別学習を行なうこ
とで、音節間の発声変形を表すモデルも同時に学習する
ことができ、より識別性能の高い識別学習モデルを作成
することができる。In this case, by performing discrimination processing on syllable pairs and performing discrimination learning on continuous syllables, a model representing vocal deformation between syllables can be simultaneously learned, and discrimination learning with higher discrimination performance A model can be created.

【００３２】[0032]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００３３】図１に、本発明の第１の実施の形態を示
す。この音声モデル学習装置は、特定言語（例えば、日
本語、あるいは英語、またはこれらの組み合わせなど、
どの様な言語でもよい。）に出現し得る音節の全てにつ
いて、他の音節と対にして識別学習するためのものであ
る。例えば、日本語の場合、その音節が５０個であると
すると、各音節について４９対の音節対が考えられ、重
複する対を除く｛（５０×４９）／２｝対の音節対のそ
れぞれについて、識別学習を行なうものである。つま
り、この音声モデル学習装置では、一時に扱う音節は一
対の音節、即ち２音節であるが、最終的には考え得る全
ての対にして識別学習を行なう。FIG. 1 shows a first embodiment of the present invention. This speech model learning device can be used for a specific language (for example, Japanese or English, or a combination thereof).
In any language. This is for discriminative learning of all the syllables that can appear in ()) in pairs with other syllables. For example, in the case of Japanese, if there are 50 syllables, 49 syllable pairs can be considered for each syllable, and for each of the {(50 × 49) / 2} syllable pairs excluding overlapping pairs , To perform identification learning. In other words, in this speech model learning device, the syllables to be treated at one time are a pair of syllables, that is, two syllables, but the discrimination learning is finally performed for all possible pairs.

【００３４】図１の音声モデル学習装置は、学習音声を
入力とし、所定時間（フレームという）毎に、各フレー
ムにおける学習音声の音響的特徴量を計算し、出力する
音声分析部１０１と、学習用の単音節対辞書（パタン：
一対の単音節モデル及び各単音節モデルを構成するサブ
ワードモデル）を保持する辞書部１０２と、音声分析部
１０１から出力されるフレーム毎の特徴量を入力とし、
辞書部１０２からの単音節対辞書中のあらゆるモデルの
学習音声との対応づけと、その尤度を出力するパタンマ
ッチング部１０３と、パタンマッチング部１０３から出
力された対応づけと尤度、及び辞書部１０２の単音節対
辞書とから、辞書中の各モデルのパラメータの修正量を
計算し出力する計算部１０４と、計算部１０４から出力
された修正量を用いて辞書中の各モデルのパラメータを
修正する修正部１０５と、パタンマッチング部１０３か
ら出力される尤度を入力とし、計算終了の判定を行う制
御部１０６とを有している。The speech model learning apparatus shown in FIG. 1 receives a learning speech as an input, calculates an acoustic feature amount of the learning speech in each frame at predetermined time intervals (referred to as frames), and outputs a speech analysis unit 101. Single syllable pair dictionary (pattern:
A dictionary unit 102 that holds a pair of monosyllable models and a subword model that constitutes each monosyllable model), and a feature amount for each frame output from the speech analysis unit 101 as inputs.
The correspondence between the monosyllable pairs from the dictionary unit 102 and the learning speeches of all models in the dictionary and the pattern matching unit 103 that outputs the likelihood, the association and the likelihood output from the pattern matching unit 103, and the dictionary. A calculating unit 104 for calculating and outputting a correction amount of a parameter of each model in the dictionary from the monosyllable pair dictionary of the unit 102; and using the correction amount output from the calculation unit 104 to calculate a parameter of each model in the dictionary. It has a correction unit 105 for correction and a control unit 106 that receives the likelihood output from the pattern matching unit 103 and determines whether the calculation is completed.

【００３５】次に、この音声モデル学習装置の動作につ
いて、図１とともに図２乃至図４を参照して説明する。
なお、入力音声としては、一例として「し」と「ち」の
単音節発声を用いる場合について説明する。Next, the operation of the speech model learning apparatus will be described with reference to FIG. 1 and FIGS.
Note that, as an example, a case where single syllable utterances of “shi” and “chi” are used will be described.

【００３６】まず、図２に示すステップＳ１にて、音声
分析部１０１は、入力音声を所定周期で区切ってフレー
ムとし、フレーム毎に、その区間の音声の周波数分析を
行う。そして、各フレーム毎に、入力音声の特徴量を算
出し、パタンマッチング部１０３へ出力する。ここで、
特徴量としては、音声のパワー、パワー変化量、ケプス
トラム、及びケプストラム変化量等が使用できる。First, in step S1 shown in FIG. 2, the speech analysis unit 101 divides an input speech into frames by a predetermined period, and performs a frequency analysis of the speech in the section for each frame. Then, the feature amount of the input voice is calculated for each frame and output to the pattern matching unit 103. here,
The power of the voice, the power change amount, the cepstrum, the cepstrum change amount, and the like can be used as the feature amount.

【００３７】辞書部１０２は、２つの学習用単音節対辞
書（単音節モデル）を保持している。ここでは、図３に
示すように、単音節「し」を表現する辞書と、単音節
「ち」を表現する辞書である。また、辞書部１０２は、
「し」及び「ち」に対応するサブワードをすべて含むサ
ブワード辞書（サブワードモデル）Ｄ１を有する。さら
に、辞書部１０２は、サブワード辞書Ｄ１をコピーする
ためのサブワード辞書Ｄ２を有している。このサブワー
ド辞書Ｄ２は、初期状態においてサブワード辞書Ｄ１と
同じ内容になっている。The dictionary unit 102 holds two learning single-syllable pair dictionaries (single-syllable models). Here, as shown in FIG. 3, a dictionary expressing a single syllable “shi” and a dictionary expressing a single syllable “chi”. Also, the dictionary unit 102
It has a sub-word dictionary (sub-word model) D1 that includes all sub-words corresponding to “shi” and “chi”. Further, the dictionary unit 102 has a sub-word dictionary D2 for copying the sub-word dictionary D1. This sub-word dictionary D2 has the same contents as the sub-word dictionary D1 in the initial state.

【００３８】次に、ステップＳ２において、パタンマッ
チング部１０３は、音声分析部１０１から入力される各
フレームの特徴量に基づき、辞書部１０２が保持する２
つの単音節モデルのそれぞれに対応するサブワードモデ
ルの抽出を行なう。また、それぞれの抽出されたサブワ
ードモデルについて尤度を求める。パタンマッチング１
０３は、求めた対応付けと抽出したサブワードモデル、
及びその尤度を計算部１０４へ出力する。Next, in step S2, the pattern matching unit 103 stores the data stored in the dictionary unit 102 based on the feature amount of each frame input from the speech analysis unit 101.
The subword model corresponding to each of the single syllable models is extracted. Further, the likelihood is obtained for each extracted sub-word model. Pattern matching 1
03 is the obtained correspondence and the extracted subword model,
And its likelihood are output to calculation section 104.

【００３９】計算部１０４は、ステップＳ３において、
各サブワードモデルにおける各パラメータの修正量を求
め、パタンマッチング部１０３からの尤度とともに、修
正部１０５へ出力する。修正量を求める方法について
は、従来の方法、例えば、ＭＣＥ法が使用できる（数式
１乃至数式５参照）。The calculation unit 104 determines in step S3
The correction amount of each parameter in each subword model is obtained, and is output to the correction unit 105 together with the likelihood from the pattern matching unit 103. As a method for obtaining the correction amount, a conventional method, for example, the MCE method can be used (see Expressions 1 to 5).

【００４０】修正部１０５は、計算部からの修正量を受
け取ると、ステップＳ４において、サブワード辞書Ｄ１
の各モデルについて、それぞれ特徴量に修正量を加算し
て修正する。修正部１０５は、特徴量の修正を終える
と、計算部１０４からの尤度を制御部１０６へ出力し、
修正処理終了を通知する。When the correction unit 105 receives the correction amount from the calculation unit, in step S4, the subword dictionary D1
For each of the models described above, the correction amount is added to the feature amount to make correction. After completing the correction of the feature amount, the correction unit 105 outputs the likelihood from the calculation unit 104 to the control unit 106,
Notifies the end of the correction process.

【００４１】制御部１０６は、修正部１０５における修
正処理が終わると、計算部１０４から供給された尤度に
基づいて損失関数の値を計算する。ここで、制御部１０
６は、前回の処理における損失関数の値を記憶してお
り、ステップＳ５で、求めた損失関数の値と、記憶して
いる損失関数の値とを比較する。When the correction process in the correction unit 105 is completed, the control unit 106 calculates the value of the loss function based on the likelihood supplied from the calculation unit 104. Here, the control unit 10
Numeral 6 stores the value of the loss function in the previous processing, and in step S5, compares the calculated value of the loss function with the stored value of the loss function.

【００４２】ここで求めた損失関数の値が、記憶してい
る前回の損失関数値以下の場合、制御部１０６は、求め
た損失関数値の新たに記憶し、辞書部１０２に対し、サ
ブワード辞書Ｄ１をサブワード辞書Ｄ２へコピーするよ
う指示を出す。サブワード辞書Ｄ１は、ステップＳ４に
おいて修正され、そのときサブワード辞書Ｄ２は修正さ
れないので、２つに辞書の内容は異なるものとなってい
たが、ここで、再び同一になる。また、制御部１０６
は、パタンマッチング部１０３に対して再度処理を行な
うように指示を出す。なお、初期状態では、損失関数の
値として、ダミー値＝１．０、即ち、損失関数値の上限
値が設定されているので、１回目の処理においては、制
御部１０６は、必ず上記のように動作する。この後、パ
タンマッチング部１０３、計算部１０４、及び修正部１
０５は、上述したステップＳ２からステップＳ４の動作
を再び行なう。また、制御部１０６も再びステップＳ５
で、求めた損失関数の値と、記憶している損失関数の値
とを比較する。If the value of the loss function obtained here is equal to or less than the previous stored loss function value, the control unit 106 newly stores the obtained loss function value and sends it to the dictionary unit 102 for the sub-word dictionary. An instruction is issued to copy D1 to the sub-word dictionary D2. The sub-word dictionary D1 is modified in step S4, and at that time the sub-word dictionary D2 is not modified, so that the contents of the two dictionaries are different, but now the same. The control unit 106
Instructs the pattern matching unit 103 to perform the processing again. In the initial state, the dummy value = 1.0, that is, the upper limit of the loss function value is set as the value of the loss function. Therefore, in the first process, the control unit 106 always performs Works. Thereafter, the pattern matching unit 103, the calculation unit 104, and the correction unit 1
In step 05, the operations from step S2 to step S4 described above are performed again. In addition, the control unit 106 again executes step S5
Then, the calculated value of the loss function is compared with the stored value of the loss function.

【００４３】一方、求めた損失関数の値が、記憶してい
る前回の損失関数値を上回った場合、制御部１０６は、
辞書部１０２のサブワード辞書Ｄ２の内容を学習モデル
として出力する。即ち、制御部１０６は、今回の処理に
より修正されたサブワードモデルではなく、前回の処理
により修正されたサブワードモデルを学習モデルとして
出力する。On the other hand, if the calculated value of the loss function exceeds the stored value of the previous loss function, the control unit 106
The contents of the sub-word dictionary D2 of the dictionary unit 102 are output as a learning model. That is, the control unit 106 outputs not the subword model corrected by the current process but the subword model corrected by the previous process as a learning model.

【００４４】例えば、図４に示すように、制御部１０６
にて求めた損失関数の値が、処理回数（ループ回数）
１、２、及び３回目で、それぞれ、“０．５００”、
“０．４５０”、及び“０．４８０”であった場合、１
回目では、求めた値が初期値“１．０”よりも小さいの
で、２回目の処理を行なう。２回目も、求めた値“０．
４５０”が、前回の値“０．５００”よりも小さいの
で、３回目の処理を行なう。ところが、３回目に求めた
値“０．４８０”は、前回の値“０．４５０”よりも大
きいので、処理を終了する。For example, as shown in FIG.
The value of the loss function calculated in step is the number of processing times (loop times)
At the first, second, and third times, "0.500",
In the case of “0.450” and “0.480”, 1
At the second time, the obtained value is smaller than the initial value “1.0”, so that the second processing is performed. In the second time, the calculated value “0.
Since “450” is smaller than the previous value “0.500”, the third processing is performed, but the value “0.480” obtained the third time is larger than the previous value “0.450”. Therefore, the process ends.

【００４５】以上のようにして、本実施の形態による音
声モデル学習装置では、一対の単音節について識別学習
を行なう。このように、本実施の形態では、モデルの精
度が学習辞書に依存する識別学習の性質を利用し、学習
辞書を最低限度の２音節に絞り込むことにより、該当２
音節の識別だけを専門に行う高精度な音節識別モデルを
生成できる。したがって、このような一対の単音節を単
位とする識別学習を、想定した全ての音声対について実
行すれば、その学習結果（音声モデル）を組み合わせる
ことにより、その特定言語におけるいかなる単語に対し
ても音声認識が可能になる。As described above, the speech model learning device according to the present embodiment performs discrimination learning for a pair of single syllables. As described above, in the present embodiment, the learning dictionary is narrowed down to the minimum two syllables by using the property of discriminative learning in which the accuracy of the model depends on the learning dictionary.
A high-accuracy syllable identification model that specializes only in syllable identification can be generated. Therefore, if such discriminative learning is performed for each pair of single syllables for all supposed voice pairs, the learning result (voice model) is combined with any word in the specific language. Voice recognition becomes possible.

【００４６】次に、本発明の第２の実施の形態による音
声モデル学習装置について説明する。本実施の形態によ
る音声モデル学習装置は、基本的には、第１の実施の形
態による音声モデル学習装置と同じである。ただし、辞
書部１０２の記憶内容は、第１の実施の形態とは異な
る。即ち、本実施の形態では、学習用発声として、所定
の音節列を用いるため、それに対応して、辞書部１０２
は、認識辞書として、入力される音声の内容を表す音節
列モデルを保持している。また、辞書部１０２は、任意
の音節列を受理するモデル（以後、汎用モデル）を保持
する。Next, a description will be given of a speech model learning apparatus according to a second embodiment of the present invention. The speech model learning device according to the present embodiment is basically the same as the speech model learning device according to the first embodiment. However, the contents stored in the dictionary unit 102 are different from those in the first embodiment. That is, in the present embodiment, since a predetermined syllable string is used as the learning utterance, the dictionary unit 102
Holds a syllable string model representing the content of the input speech as a recognition dictionary. The dictionary unit 102 holds a model that accepts an arbitrary syllable string (hereinafter, a general-purpose model).

【００４７】ここで使用される汎用モデルは、通常の単
語モデルとはその形態が全く異なる。詳述すると、通常
の単語モデルは、サブワード（音節）モデルの一直線の
連鎖として表されるのに対し、汎用モデルは、図５に示
すようなネットワーク形状で表される。つまり、汎用モ
デルは、全ての音節（日本語の場合、５０音）に対応す
る入力側から出力側へのパスと、出力側から入力側に戻
るパス（null）とを有する。音声モデル学習装置は、こ
のような汎用モデルに対して、入力音声に最も適合する
パスを探し出す。例えば、入力音声として「やかん」が
与えられた場合、音声モデル学習装置は、図６に示すパ
スを探し出す。つまり、「や」→「null」→「か」→
「null」→「ん」のパスを探し出す。探し出したパス
は、通常の単語モデルと同様、一直線の連鎖と見なすこ
とができるので、これを、誤認識単語の単語モデル（エ
ラーモデル）として、音声モデル学習装置は識別学習を
行う。このように、汎用モデルは、全ての音節に対応す
るパスと、入力側に戻るパスとを有しているので、どの
ような入力音声に対応できる。The general-purpose model used here is completely different in form from a normal word model. More specifically, a normal word model is represented as a straight chain of subword (syllable) models, whereas a general-purpose model is represented by a network shape as shown in FIG. That is, the general-purpose model has a path from the input side to the output side and a path (null) from the output side to the input side corresponding to all syllables (50 Japanese syllables). The speech model learning device searches for a path that best matches the input speech for such a general-purpose model. For example, when “kettle” is given as the input voice, the voice model learning device searches for the path shown in FIG. In other words, "ya" → "null" → "ka" →
Find the path "null" → "n". The found path can be regarded as a straight chain like a normal word model, and the speech model learning device performs identification learning using this as a word model (error model) of a misrecognized word. As described above, since the general-purpose model has a path corresponding to all syllables and a path returning to the input side, it can correspond to any input voice.

【００４８】実際には、単語モデルは、１音節が複数の
サブワードで構成される。例えば、１音節が３サブワー
ドの場合、通常の単語モデルは、図７（ａ）に示すよう
に表される。したがって、これに対応する汎用モデル
は、図７（ｂ）に示すように表される。In practice, in the word model, one syllable is composed of a plurality of subwords. For example, when one syllable is composed of three subwords, a normal word model is represented as shown in FIG. Therefore, the corresponding general-purpose model is represented as shown in FIG.

【００４９】なお、上記汎用モデルを認識するために
は、鹿野清宏著、「音声認識の基礎」（平成８年５月１
７日）、４２−４３頁に記載されているOne Pass DP 法
等の連続単語認識アルゴリズムを用いればよい。In order to recognize the above-mentioned general-purpose model, see "Basics of Speech Recognition" by Kiyohiro Kano (May 1, 1996).
7th), a continuous word recognition algorithm such as the One Pass DP method described on pages 42 to 43 may be used.

【００５０】このような音声モデル学習装置を利用し、
音声認識学習を行なうと、汎用モデルに対する認識結果
として、学習用発声に最も近いとこの装置に認識される
音節列モデルが得られる。即ち、正しい発声内容を表す
モデルと、実際の発声に最も近いモデル（即ち、混同さ
れやすい文字列）との識別学習を自動的に行うことにな
る。このように、本実施の形態では、正しいモデルと混
同モデルが自動生成され、混同されやすいモデルとの識
別学習が行なわれる。なお、汎用モデルの代わりに、混
同される可能性のある全ての単語を含む辞書を用いる方
法も考えられるが、その場合、辞書に含まれる単語数
が、非常に多くなるおそれがあり、現実的でない。Using such a speech model learning device,
When the speech recognition learning is performed, a syllable string model that is recognized by the device as being closest to the training utterance is obtained as a recognition result for the general-purpose model. That is, discrimination learning between the model representing the correct utterance content and the model closest to the actual utterance (that is, a character string that is easily confused) is automatically performed. As described above, in the present embodiment, a correct model and a confusion model are automatically generated, and discrimination learning for a model that is easily confused is performed. In addition, instead of the general-purpose model, a method using a dictionary including all the words that may be confused may be considered, but in this case, the number of words included in the dictionary may be extremely large, and Not.

【００５１】また、この音声モデル学習装置では、複数
の入力音声を同時に学習させることができる。この場
合、辞書部１０２には、各入力音声に対応する音節列モ
デルと、これらにそれぞれ対応する汎用モデルとが保持
される。そして、これらの汎用モデルは、音声認識学習
の実行によって、各入力音声に最も近い混同モデルとな
る。Further, the speech model learning apparatus can simultaneously learn a plurality of input speeches. In this case, the dictionary unit 102 holds a syllable string model corresponding to each input voice and a general-purpose model corresponding to each of them. Then, these general-purpose models become the confusion models closest to each input voice by executing the voice recognition learning.

【００５２】本実施の形態では、汎用モデルの構成の仕
方によって、様々なカテゴリの専用モデルを生成するこ
とができる。例えば、汎用モデルを任意単音節を表すモ
デルとし、単音節音声によって学習すれば、単音節専用
の識別モデルを作ることができる。また、汎用モデルを
任意の連続数字を表すモデルとすれば、連続数字専用の
識別モデルをつくることができる。このように、本実施
の形態によれば、ある特定カテゴリの音声向きの識別モ
デルを、２モデルの識別学習のみにより生成することが
できる。In the present embodiment, it is possible to generate dedicated models of various categories depending on how to construct a general-purpose model. For example, if the general-purpose model is a model representing an arbitrary single syllable, and learning is performed using a single syllable voice, an identification model dedicated to a single syllable can be created. Further, if the general-purpose model is a model representing an arbitrary continuous numeral, an identification model dedicated to the continuous numeral can be created. As described above, according to the present embodiment, it is possible to generate a speech-direction identification model of a certain specific category only by two models of identification learning.

【００５３】次に、本発明の第３の実施の形態による音
声認識装置について、図８乃至図１０を参照して説明す
る。本実施の形態による音声認識装置は、図８に示すよ
うに、音声を入力とし、その一定時間（フレーム）ごと
の音響的特徴量を計算して出力する音声分析部５０１
と、音声認識用の一対の単語モデル辞書Ｄを保持する辞
書部５０２と、音声分析部５０１から出力されるフレー
ムごとの特徴量を入力とし、その特徴量に基づいて辞書
部５０２の各単語モデルと入力音声を対応づけ、その尤
度を計算して出力する計算部５０３と、計算部５０３か
ら出力された尤度から、認識結果を求め、出力する出力
部５０４とを有している。Next, a speech recognition apparatus according to a third embodiment of the present invention will be described with reference to FIGS. As shown in FIG. 8, the speech recognition apparatus according to the present embodiment receives speech as input, and calculates and outputs acoustic features for each fixed time (frame), and outputs the result.
And a dictionary unit 502 for holding a pair of word model dictionaries D for speech recognition, and a feature amount for each frame output from the speech analysis unit 501, and each word model of the dictionary unit 502 based on the feature amount. And an input voice, a calculation unit 503 for calculating and outputting the likelihood, and an output unit 504 for obtaining a recognition result from the likelihood output from the calculation unit 503 and outputting the result.

【００５４】ここで、辞書部５０２は、例えば、図９に
示すように、認識対象となる２つの単語モデル及びその
各単語モデルを構成するサブワードモデルを保持してい
る。図９の例では、単語「よしだ」と単語「うちだ」の
それぞれについて、単語モデルを保持している。各単語
モデルは、それぞれの単語を構成する音節モデルを時系
列に接続して構成されている。即ち、「よしだ」は、
「よ」を表すモデルＭ１、「し」を表すモデルＭ２、及
び「だ」を表すモデルＭ３であり、「うちだ」は、
「う」を表すモデルＭ４、「ち」を表すモデルＭ５、及
び「だ」を表すモデルＭ６（＝Ｍ３）で構成される。こ
こで、これらのモデルＭ１乃至Ｍ６は、第１の実施の形
態による音声モデル学習装置により得られた音声モデル
である。即ち、モデルＭ１及びＭ４は、音節「よ」と音
節「う」の対を用いた識別学習により得られた音声モデ
ルである。また、モデルＭ２及びＭ５は、音節「し」と
音節「ち」の対を用いた識別学習により得られた音声モ
デルである。ただし、モデルＭ３及びＭ６については、
第１の実施の形態のよる音声モデル学習装置では得るこ
とができないので、例えば、ＥＭアルゴリズムなどを用
いて、別に用意しておく必要がある。Here, as shown in FIG. 9, for example, the dictionary unit 502 holds two word models to be recognized and subword models constituting each of the word models. In the example of FIG. 9, a word model is held for each of the words “yoshida” and “uchida”. Each word model is configured by connecting syllable models constituting each word in time series. That is, "Yoshida"
A model M1 representing “yo”, a model M2 representing “shi”, and a model M3 representing “da”.
It is composed of a model M4 representing "U", a model M5 representing "chi", and a model M6 (= M3) representing "da". Here, these models M1 to M6 are speech models obtained by the speech model learning device according to the first embodiment. That is, the models M1 and M4 are speech models obtained by discriminative learning using a pair of the syllable “yo” and the syllable “u”. The models M2 and M5 are speech models obtained by discriminative learning using a pair of syllable "shi" and syllable "chi". However, for models M3 and M6,
Since it cannot be obtained with the speech model learning device according to the first embodiment, it is necessary to prepare it separately using, for example, an EM algorithm.

【００５５】各単音節モデルＭ１乃至Ｍ６は、サブワー
ドモデルの時系列として表される。ここでは、一例とし
て９サブワードで構成されたものを示す。各サブワード
は、フレーム毎の特徴量及びその標準偏差を表す情報を
含む。Each of the single syllable models M1 to M6 is represented as a time series of a subword model. Here, an example composed of 9 subwords is shown. Each subword includes information representing a feature amount for each frame and its standard deviation.

【００５６】以下、図１０をも参照して、本実施の形態
による音声認識装置の動作について説明する。Hereinafter, the operation of the speech recognition apparatus according to the present embodiment will be described with reference to FIG.

【００５７】まず、音声分析部５０１は、ステップＳ７
１で、入力された音声をフレーム単位に区切り、そのフ
レーム毎に、その区間の音声の周波数分析を行い、特徴
量を算出して出力する。特徴量としては、音声のパワ
ー、パワー変化量、ケプストラム、ケプストラム変化量
等が使用できる。First, the voice analysis unit 501 determines in step S7
In step 1, the input voice is divided into frames, the frequency of the voice in the section is analyzed for each frame, and the feature amount is calculated and output. As the feature amount, audio power, power change amount, cepstrum, cepstrum change amount, and the like can be used.

【００５８】次に、計算部５０３は、ステップＳ７２
で、入力された各フレームの特徴量を用いて、辞書部５
０２に保持された辞書中の各単語モデルごと尤度を求
め、出力部５０４へ出力する。ここで、尤度を求める方
法としては、動的計画法（dynamic programming: DP ）
による時間軸伸縮マッチング（dynamic time warping:
DTW ）が利用できる。このＤＴＷについては、例えば、
鹿野清宏著、「音声認識の基礎」（平成８年５月１７
日）、３２−３４頁に記載されている。Next, the calculating unit 503 determines in step S72
The dictionary unit 5 uses the input feature amount of each frame.
The likelihood is calculated for each word model in the dictionary stored in the second dictionary and output to the output unit 504. Here, as a method of obtaining likelihood, dynamic programming (DP)
Dynamic time warping:
DTW) is available. For this DTW, for example,
Kiyohiro Kano, "Basics of Speech Recognition" (May 17, 1996
Sun), pp. 32-34.

【００５９】最後に、出力部５０４は、ステップＳ７３
で、計算部５０３の出力である各単語モデルの尤度を比
較し、尤度の高い単語モデルを、入力音声に最も近い単
語であるとして（認識結果として）、出力する。Finally, the output unit 504 determines in step S73
Then, the likelihood of each word model output from the calculation unit 503 is compared, and a word model with a high likelihood is output as a word closest to the input speech (as a recognition result).

【００６０】このように、本実施の形態では、第１の実
施の形態による音声モデル学習装置により得られた音声
モデルを利用して、単語の認識を行なうようにしたの
で、単語そのものの認識学習をすることなく、高精度の
音声認識を実現できる。換言すれば、任意の単語につい
て認識学習を行なうことなく、高い精度で音声認識を行
なうことができる。As described above, in the present embodiment, word recognition is performed using the speech model obtained by the speech model learning apparatus according to the first embodiment. And high-accuracy speech recognition can be realized without performing In other words, speech recognition can be performed with high accuracy without performing recognition learning on an arbitrary word.

【００６１】次に、本発明の第４の実施の形態による音
声認識装置について説明する。第３の実施の形態による
音声認識装置は、辞書部５０２が保持する一対の単語を
認識する装置であったが、本実施の形態による音声認識
装置は、３以上の単語の認識を可能にする。Next, a speech recognition apparatus according to a fourth embodiment of the present invention will be described. Although the speech recognition device according to the third embodiment is a device that recognizes a pair of words held by the dictionary unit 502, the speech recognition device according to the present embodiment can recognize three or more words. .

【００６２】本実施の形態においては、辞書部５０２
が、３以上の単語モデルを含む辞書Ｄを保持している。
ここで、各単語モデルは、第１の実施の形態による音声
モデル学習装置を用いて、音節対の認識学習を行ない得
たモデルを時系列に接続したものであるから、３つの単
語を扱う場合は、３対の単語モデルが、４つの単語を扱
う場合は、６対の単語モデルが、辞書Ｄとして保持され
る。In this embodiment, the dictionary unit 502
Holds a dictionary D including three or more word models.
Here, each word model is a model obtained by performing syllable pair recognition learning using the speech model learning apparatus according to the first embodiment, and is connected in time series. When three pairs of word models handle four words, six pairs of word models are held as the dictionary D.

【００６３】また、計算部５０３は、辞書Ｄに含まれる
全ての単語対について、第３の実施の形態の場合と同
様、単語毎に尤度を算出する。このとき、単語対（ｓ，
ｔ）を考えた場合に、それぞれの単語の尤度をＬ
（ｓ），Ｌ（ｔ）とすると、単語対（ｓ，ｔ）につい
て、単語ｓの単語ｔに対する相対尤度は、数式６で表さ
れ、The calculation unit 503 calculates the likelihood for each word for all the word pairs included in the dictionary D, as in the third embodiment. At this time, the word pair (s,
t), the likelihood of each word is L
Assuming that (s) and L (t), the relative likelihood of the word s with respect to the word t with respect to the word pair (s, t) is expressed by Expression 6.

【００６４】[0064]

【数６】その結果、単語ｓの相対尤度平均は、数式７で与えられ
る。(Equation 6) As a result, the relative likelihood average of the word s is given by Expression 7.

【００６５】[0065]

【数７】計算部５０３は、この相対尤度平均を、各単語について
求め、出力部５０４へ出力する。(Equation 7) The calculation unit 503 calculates the relative likelihood average for each word, and outputs the average to the output unit 504.

【００６６】出力部１０４は、計算部５０３から出力さ
れた相対尤度平均を比較して、最も大きな値の単語モデ
ルを、認識結果として出力する。The output unit 104 compares the relative likelihood averages output from the calculation unit 503 and outputs a word model having the largest value as a recognition result.

【００６７】こうして、本実施の形態の音声認識装置で
は、３単語以上の単語認識を行なうことができる。Thus, the speech recognition apparatus according to the present embodiment can recognize three or more words.

【００６８】[0068]

【発明の効果】本発明によれば、モデルの精度が学習辞
書に依存する識別学習の性質を利用し、学習辞書を最低
限度の２音節に絞り込むことにより、該当２音節の識別
だけを専門に行う高精度な音節識別モデルを生成でき
る。したがって、音声認識の対象として特定言語を想定
すれば、音声認識に必要な音声対が特定され、特定され
た全ての音節対について音声モデルを生成しておけば、
任意の単語に対して、高い精度を有する音声認識装置を
実現できる。According to the present invention, the learning dictionary is narrowed down to a minimum of two syllables by utilizing the property of discriminative learning in which the accuracy of the model depends on the learning dictionary. A highly accurate syllable identification model can be generated. Therefore, assuming a specific language as a target of speech recognition, if a speech pair necessary for speech recognition is specified, and a speech model is generated for all specified syllable pairs,
A speech recognition device having high accuracy for any word can be realized.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施の形態を示すブロック図であ
る。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】図１の音声モデル学習装置の動作を説明するた
めのフローチャートである。FIG. 2 is a flowchart illustrating an operation of the speech model learning device in FIG. 1;

【図３】単音節モデルとサブワードモデルを説明するた
めの図である。FIG. 3 is a diagram for explaining a single syllable model and a subword model.

【図４】図１の制御部の動作を説明するための図であ
る。FIG. 4 is a diagram for explaining an operation of a control unit in FIG. 1;

【図５】汎用モデルを説明するための図である。FIG. 5 is a diagram for explaining a general-purpose model.

【図６】汎用モデルに対するパスの選択を説明するため
の図である。FIG. 6 is a diagram for explaining selection of a path for a general-purpose model.

【図７】（ａ）は、実際の単語モデルを、（ｂ）は、実
際の汎用モデルを説明するための図である。FIG. 7A is a diagram for explaining an actual word model, and FIG. 7B is a diagram for explaining an actual general-purpose model.

【図８】本発明の第３の実施の形態を示すブロック図で
ある。FIG. 8 is a block diagram showing a third embodiment of the present invention.

【図９】図８の辞書部５０２の辞書Ｄを説明するための
図である。9 is a diagram illustrating a dictionary D of the dictionary unit 502 in FIG.

【図１０】図８の音声認識装置の動作を説明するための
フローチャートである。FIG. 10 is a flowchart for explaining the operation of the speech recognition device in FIG. 8;

[Explanation of symbols]

１０１音声分析部１０２辞書部１０３パタンマッチング部１０４計算部１０５修正部１０６制御部５０１音声分析部５０２辞書部５０３計算部５０４出力部 Reference Signs List 101 voice analysis unit 102 dictionary unit 103 pattern matching unit 104 calculation unit 105 correction unit 106 control unit 501 voice analysis unit 502 dictionary unit 503 calculation unit 504 output unit

Claims

[Claims]

A speech analysis unit for calculating a feature amount of the input speech at predetermined time intervals; a dictionary unit storing a plurality of learning word patterns; Match each word pattern,
A pattern matching unit that calculates the association and likelihood with each learning word pattern; a calculation unit that calculates the parameter correction amount of each learning word pattern based on the association and the likelihood; A correction unit for respectively correcting the parameters of the plurality of word patterns stored in the dictionary unit, wherein the dictionary unit stores a pair of different monosyllable models as the learning word pattern. Characteristic speech model learning device.

2. Based on the likelihood, it is determined whether or not to perform a learning process again using a learning word pattern having a parameter corrected by the correcting unit. The speech model learning device according to claim 1, further comprising a control unit that instructs the matching unit to execute the learning process again.

3. The dictionary unit holds a pair of word models as the word patterns instead of the pair of different monosyllable models, and one of the pair of word models corresponds to a predetermined input voice. 3. The speech model learning device according to claim 1, wherein the word model is a word model, and the other word model is a word model closest to the predetermined input speech obtained by a continuous word recognition algorithm.

4. The dictionary unit holds a plurality of pairs of word models as the word patterns in place of the pair of different monosyllable models, and one of the paired word models has a different input model. 3. The speech model according to claim 1, wherein the other word model of each pair is a word model closest to the predetermined input speech obtained by a continuous word recognition algorithm. 4. Learning device.

5. A voice analysis unit for calculating a feature amount of input speech at predetermined time intervals, a dictionary unit for storing a pair of word patterns, the feature amount at predetermined time intervals, and the pair of word patterns. A calculating unit that performs matching with each of the above, and calculates the likelihood of each of the pair of word patterns,
An output unit that determines and outputs one of the pair of word patterns corresponding to the input voice based on the likelihood, and the pair of word patterns stored in the dictionary unit each include a monosyllable model. It is configured by connecting in the order of utterance,
A speech recognition apparatus, wherein each single syllable model is generated by discriminative learning using a pair of learning word patterns paired with a single syllable model of the other word pattern in the same utterance order.

6. A speech analysis unit for calculating a feature amount of the input speech at regular time intervals, a dictionary unit for storing word patterns corresponding to a plurality of words, the feature amount at regular time intervals, A matching unit that performs matching with each of the word patterns and calculates the likelihood of each word pattern, and an output unit that obtains and outputs the word pattern closer to the input voice from the likelihood of the word pattern, The word patterns corresponding to the plurality of words are formed by connecting monosyllable models constituting each word in the order of utterance, and each of the monosyllable models constituting each word constitutes another word Each word has a number of word patterns corresponding to the number of other words, so that each word is generated by discriminative learning using a pair of learning word patterns that form a pair in each utterance order. Prepared A speech recognition device characterized by the following.

7. A first step of calculating a feature amount for each fixed time with respect to an input single syllable utterance, and a step of calculating a pair of learning single syllable patterns stored in advance with the feature amount for each fixed time. A second step of performing matching with each of the syllable utterances and each learning single syllable pattern and obtaining a likelihood, and a parameter of each learning single syllable pattern based on the correspondence and the likelihood. A third step of calculating a correction amount; and a fourth step of respectively correcting parameters of the plurality of learning single syllable patterns stored in advance based on the parameter correction amount. Voice model learning method.

8. After the fourth step, it is determined based on the likelihood whether or not the learning process is to be performed again using the learning word pattern having the corrected parameter, and the learning process is performed again. 8. The speech model learning method according to claim 7, further comprising a fifth step of repeating the second step and the subsequent steps when performing.

9. A first step of calculating a feature amount of the input voice at regular time intervals, and performing matching between the feature amount at regular time intervals and each of a pair of word patterns stored in advance. A second step of calculating the likelihood of each of the pair of word patterns; and a third step of determining and outputting one of the pair of word patterns as corresponding to the input voice based on the likelihood. 8. A speech recognition method comprising using a word model to which a single syllable model obtained by the speech model learning method according to claim 7 is connected as the pair of word patterns.