JPH07104783A

JPH07104783A - Voice recognition device and method for generating standard pattern of recognition basic unit

Info

Publication number: JPH07104783A
Application number: JP5246587A
Authority: JP
Inventors: Akio Amano; 明雄天野; Hiroaki Kokubo; 浩明小窪; Nobuo Hataoka; 信夫畑岡
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1993-10-01
Filing date: 1993-10-01
Publication date: 1995-04-21

Abstract

PURPOSE:To make articulation coupling to be adaptable by a speaker to deal with the deformation of a voice pattern by generating a standard pattern while putting similar speakers together with a cluster ring, and generating a standard pattern of a large unit while utilizing the correspondence relation between a phoneme and the standard pattern. CONSTITUTION:An input voice is converted by a voice input means 1 into an electric signal, which is analyzed by a voice analyzing means 2, so that a time series of feature vectors is outputted. A standard pattern connecting means 33, on the other hand, connects standard patterns of recognition basic units stored previously in a standard pattern storage 34 according to phoneme-recognition basic unit corresponding relation information stored in a phoneme-recognition basic unit correspondence relation storage means 34 and word information 35 to constitute a word standard pattern. A collation part 31 collates the standard patterns of respective words generated by the standard pattern connecting means 33 with the feature vector time series of the input voice to find the similarity between the input voice and respective words to be recognized. A decision part 32 outputs the recognition result on the basis of the similarity.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音声区間の長さが音韻よ
りも短いような音声区間を標準パタンとして使う音声認
識装置，およびその標準パタンの作成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus that uses a speech section whose speech section has a length shorter than a phoneme as a standard pattern, and a method for creating the standard pattern.

【０００２】[0002]

【従来の技術】音声認識装置における標準パタンの単位
としては，単語単位，音節単位，音韻（母音・子音まで
細分化した単位）などいくつかの単位が考えられる。音
声認識装置においてどの様な単位の標準パタンを用いる
かは，対象とする認識対象語彙数の大小，音声認識装置
に持たせようとする汎用性の大小，標準パタンに持たせ
る音声現象表現能力，標準パタンの学習に使用できる学
習用音声データの量などいくつかの観点から考慮の上決
定される。2. Description of the Related Art As a unit of a standard pattern in a speech recognition apparatus, several units such as a word unit, a syllable unit, a phonological unit (a vowel / consonant subdivided unit) can be considered. What kind of standard pattern is used in the voice recognition device depends on the size of the target vocabulary to be recognized, the versatility of the voice recognition device to have, the speech phenomenon expression ability given to the standard pattern, It is determined in consideration of several points such as the amount of learning voice data that can be used for learning the standard pattern.

【０００３】まず，標準パタンの個数の観点から考え
る。例えば，１０数字のみを認識対象とするような小語
彙の音声認識装置では，単語単位の標準パタンを用いる
ことが多い。一方，日本語の任意の文章を認識対象とす
るような音声認識装置では，単語単位の標準パタンを用
意しようとすると膨大な数（例えば十万語）の標準パタ
ンが必要になり，事実上実現が不可能になる。そこで，
もっと小さい（時間長が短い）単位である音節あるいは
音韻を標準パタンの単位として用い，音節あるいは音韻
単位の認識結果を統合して文章認識結果とするような方
法を用いたり，必要に応じて音節あるいは音韻の標準パ
タンを連結して単語単位の標準パタンを作成し，作成し
た単語単位の標準パタンを用いて認識するような方法を
とる。日本語の場合，音節の数は約１２０程，また，音
韻の数は４０程であるので，この数だけの標準パタンを
用意すれば全ての日本語をカバーできることになる。す
なわち，小さい音声単位を用いれば，少ない数の標準パ
タンによって広い範囲の認識対象をカバーできる。一
方，大きい音声単位を用いると数多くの標準パタンが必
要となる。First, consider from the viewpoint of the number of standard patterns. For example, in a small vocabulary voice recognition device that recognizes only 10 numbers, a standard pattern in word units is often used. On the other hand, in a speech recognition device that recognizes arbitrary sentences in Japanese, an enormous number of standard patterns (for example, 100,000 words) are required to prepare a standard pattern for each word, which is practically realized. Becomes impossible. Therefore,
A syllable or phoneme, which is a smaller unit (shorter in time length), is used as a unit of a standard pattern, and a method of integrating recognition results of syllables or phoneme units into a sentence recognition result is used. Alternatively, the phonological standard patterns are connected to create a standard pattern for each word, and the created standard pattern for each word is used for recognition. In the case of Japanese, since the number of syllables is about 120 and the number of phonemes is about 40, it is possible to cover all Japanese by preparing standard patterns of this number. That is, if a small voice unit is used, a wide range of recognition targets can be covered by a small number of standard patterns. On the other hand, if a large voice unit is used, many standard patterns are required.

【０００４】次に標準パタンの音声現象表現能力につい
て考える。前述のように日本語音声は１２０程の音節に
より原理的には表現可能であるが，同じ音節でも先行す
る音節，後続する音節によって音声パタンは大きく変形
する（これを調音結合という）。また，同じ音節でも音
声パタンは話者によって大きく変形する。音節・音韻単
位に一つの標準パタンを設定する場合には，前後環境に
よって異なる音声パタンや話者によって異なる音声パタ
ンを唯一の標準パタンで代表することになる。様々な変
形を一つの代表パタンで表現することが要求されるが，
その表現能力にはおのずと限界が生ずる。一方単語のよ
うな大きな単位で標準パタンを登録すると，その単語を
構成する音節の前後関係に基づく音声パタンの変形を含
めて標準パタンとするので，結果的に調音結合による影
響は標準パタンの中に取り込まれてしまう。すなわち，
音声現象（特に調音結合）表現能力の観点から考えると
標準パタンには大きい音声単位を用いる方が好ましい。Next, let us consider the speech pattern expression capability of the standard pattern. As described above, Japanese speech can be expressed in principle with about 120 syllables, but even in the same syllable, the preceding syllable and the following syllables greatly change the voice pattern (this is called articulatory combination). Also, even with the same syllable, the voice pattern varies greatly depending on the speaker. When one standard pattern is set for each syllable / phoneme unit, a single standard pattern represents different voice patterns depending on the surrounding environment and different voice patterns depending on the speaker. It is required to represent various transformations with one representative pattern,
There is a natural limit to their expressive ability. On the other hand, when a standard pattern is registered in a large unit such as a word, the standard pattern is created by including the deformation of the phonetic pattern based on the context of the syllables that make up the word. Will be taken into. That is,
From the viewpoint of the ability to express voice phenomena (particularly articulation), it is preferable to use a large voice unit for the standard pattern.

【０００５】次に利用可能な学習用音声データの量の観
点から考える。前述のように小さい音声単位を用いると
標準パタンの総数は少なく押さえられ，大きな単位を用
いると標準パタンの総数は多くなる。従って標準パタン
の学習に利用できる学習用音声データの総量が一定であ
るとすると，小さな単位を用いる方が各標準パタン当り
の学習データ量が多くなり信頼性の高い標準パタンを作
成できる。また，同程度の信頼性の標準パタンの作成を
前提として比較すると，大きい単位を用いる方が，学習
用音声データの必要量が大きくなる。Consider next from the viewpoint of the amount of available learning voice data. As mentioned above, the total number of standard patterns can be kept small by using small voice units, and the total number of standard patterns by large units. Therefore, if the total amount of learning voice data that can be used for learning the standard pattern is constant, the smaller the unit, the larger the learning data amount for each standard pattern, and the more reliable the standard pattern can be created. In addition, when compared on the assumption that standard patterns of similar reliability are created, the larger the unit, the larger the required amount of training voice data.

【０００６】上記の状況より，認識対象語彙数が少ない
ときには単語単位の標準パタン，対象語彙数が多いとき
には音節・音韻あるいはそれに準じる単位の標準パタン
を用いるのが常識的考え方となっている。従来技術の中
にはこの改良として，大きさの異なる標準パタンを混在
させて用いるもの，小さい単位の標準パタンの連結によ
り大きい単位の標準パタンを作成するもの，学習データ
が十分な標準パタンと学習データが不十分な標準パタン
を補間して用いるものなどがある。以下各従来技術を簡
単に説明する。From the above situation, it is common sense to use a standard pattern in word units when the number of vocabularies to be recognized is small, and to use a syllabic / phonological unit or a standard pattern in units equivalent thereto when the number of vocabularies to be recognized is large. Among the conventional techniques, as an improvement, a standard pattern of different sizes is used in a mixed manner, a standard pattern of a larger unit is created by connecting standard patterns of smaller units, a standard pattern with sufficient learning data and learning. There is an example in which a standard pattern with insufficient data is interpolated and used. Each conventional technique will be briefly described below.

【０００７】単語単位の標準パタンを用いる例として
は，ＴｈｅＢｅｌｌＳｙｓｔｅｍＴｅｃｈｎｉｃａ
ｌＪｏｕｒｎａｌＶｏｌ．６２，Ｎｏ．４，Ａｐｒ
ｉｌ１９８３“ＯｎｔｈｅＡｐｐｌｉｃａｔｉｏｎ
ｏｆＶｅｃｔｏｒＱｕａｎｔｉｚａｔｉｏｎａ
ｎｄＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌｓｔｏ
Ｓｐｅａｋｅｒ−Ｉｎｄｅｐｅｎｄｅｎｔ，Ｉｓｏｌ
ａｔｅｄＷｏｒｄＲｅｃｏｇｎｉｔｉｏｎ”に記載
のような数字認識の例があり，この場合各数字毎に単語
単位の標準パタン（本従来例の場合ＨｉｄｄｅｎＭａ
ｒｋｏｖＭｏｄｅｌ：以下ＨＭＭと記す）を登録す
る。このような方式は，前述のように小語彙の場合には
適当であるが，語彙数が大きくなると登録が困難にな
る。An example of using a standard pattern in word units is The Bell System Technica.
l Journal Vol. 62, No. 4, Apr
il1983 “On the Application
of Vector Quantization a
nd Hidden Markov Modelsto
Speaker-Independent, Isol
There is an example of number recognition as described in "Arated Word Recognition". In this case, a standard pattern of each word is used for each number (in the case of the conventional example, Hidden Ma is used).
rkovModel: hereinafter referred to as HMM) is registered. Such a method is suitable for small vocabularies as described above, but registration becomes difficult as the number of vocabularies increases.

【０００８】音韻単位の標準パタンを用いる例として
は，電子情報通信学会技術研究報告，ＳＰ８９−９４，
（１９８９−１２）“ＨＭＭ−ＬＲ音声認識システムの
性能評価”に記載のように，単語単位に発声された学習
用音声データから音韻を切りだし，これを標準パタン
（本従来例の場合ＨＭＭ）の学習に用いるものがある。
本従来例では音韻単位の標準パタンは前後環境とは独立
に用意されており，調音結合や話者による変形への考慮
はなされていない。As an example of using the standard pattern of the phoneme unit, IEICE technical research report, SP89-94,
(1989-12) As described in "Performance evaluation of HMM-LR speech recognition system", a phoneme is cut out from learning speech data uttered word by word, and this is a standard pattern (HMM in the case of the conventional example). There is something used for learning.
In this conventional example, the standard pattern for each phoneme is prepared independently of the surrounding environment, and no consideration is given to articulatory coupling or transformation by the speaker.

【０００９】音節単位の標準パタンを用いる例として
は，電子情報通信学会技術研究報告，ＳＰ８９−９６，
（１９８９−１２）“構文／意味解析駆動型日本語連続
音声理解システム−ＳＰＯＪＵＳ−ＳＹＮＯ／ＳＥＭＯ
−の評価”に記載のように音節単位に発声された学習用
音声データを標準パタン（本従来例の場合ＨＭＭ）の学
習に用いるものがある。本従来例では音節単位の標準パ
タンは前後環境とは独立に用意されており，調音結合へ
の考慮はなされていない。また，話者による変形への考
慮もない。As an example of using a standard pattern in syllable units, IEICE technical report, SP89-96,
(1989-12) "Syntax / Semantic Analysis Driven Japanese Continuous Speech Understanding System-SPOJUS-SYNO / SEMO"
As described in “Evaluation of −”, there is a method of using learning voice data uttered in syllable units for learning of a standard pattern (HMM in the case of the conventional example). It has been prepared independently of, and no consideration is given to articulatory coupling, and no consideration is given to deformation by the speaker.

【００１０】大きさのことなる複数種の音声単位の標準
パタンを用いる例としては，電子情報通信学会技術研究
報告ＳＰ８８−９６（１９８８−１２）“ＣＶ，単語ス
ポッティングをベースとする連続音声認識システム”に
記載のように音節単位の標準パタンと単語単位の標準パ
タンを併用するものがある。本従来例では音節単位の標
準パタンおよび頻出単語については単語単位の標準パタ
ンも用意する。認識時には両方の標準パタンを用いて音
節単位の認識結果と単語単位の認識結果を求め，これを
統合して最終認識結果とする。本従来例では，頻出単語
については調音結合に対する対処がなされていることに
なるがその他の単語については考慮されていない。ま
た，話者による変形への考慮もない。As an example of using a standard pattern of a plurality of types of voice units having different sizes, an IEICE technical report SP88-96 (1988-12) "CV, continuous voice recognition system based on word spotting" As described in “”, there is a combination of standard patterns in syllable units and standard patterns in word units. In this conventional example, standard patterns in syllable units and standard patterns in word units are prepared for frequently-used words. At the time of recognition, both standard patterns are used to obtain the recognition result in syllable units and the recognition result in word units, and these are integrated into the final recognition result. In this conventional example, the articulation combination is dealt with for the frequent words, but other words are not taken into consideration. Also, there is no consideration for deformation by the speaker.

【００１１】小さい単位の標準パタンを連結して大きい
単位の標準パタンを作成する例としては，日本音響学講
演論文集，２−２−１０，（１９８８−３）“音節パタ
ンの連結に基づく大語彙単語音声認識における学習単語
選択法の検討”に記載のように初期標準パタンとして音
節単位の標準パタンを作成し，これの連結により単語単
位の標準パタンを作成する方法がある。本従来例では初
期標準パタンは孤立的に発声された音節から作成され，
さらに単語単位に発声された学習用音声データから切り
出した音節との間での平均化処理を行なう。本処理によ
り音節単位の標準パタンは単語発声向きに修正される
が，調音結合への対策については考慮されていない。ま
た，話者による変形への考慮もない。As an example of connecting standard patterns of small units to create a standard pattern of large units, a collection of papers on acoustics of Japan, 2-2-10, (1988-3) "Large pattern based on connection of syllable patterns" is given. As described in "Study on Learning Word Selection Method in Vocabulary Word Speech Recognition", there is a method of creating a standard pattern in syllable units as an initial standard pattern, and concatenating the standard patterns in word units. In this conventional example, the initial standard pattern is created from syllables that are uttered independently,
Further, averaging processing is performed with respect to the syllables cut out from the learning voice data uttered in word units. By this processing, the standard pattern in syllable units is corrected for word utterances, but no countermeasures against articulatory coupling are taken into consideration. Also, there is no consideration for deformation by the speaker.

【００１２】[0012]

【発明が解決しようとする課題】本発明の目的は上記従
来技術において考慮が不十分であった調音結合や話者に
よる音声パタンの変形への対応が可能なような標準パタ
ンの作成手段および活用手段を提供することにある。SUMMARY OF THE INVENTION The object of the present invention is to create and utilize a standard pattern capable of coping with articulatory coupling and deformation of a voice pattern by a speaker, which were not sufficiently taken into consideration in the above prior art. To provide the means.

【００１３】[0013]

【課題を解決するための手段】上記本発明の目的は，音
韻よりも小さい音声単位の信頼性の高い標準パタンを用
意し，標準パタンの作成過程ではクラスタリングにより
類似の話者をまとめながら標準パタンの作成を行ない，
さらに，音韻とこの標準パタンとの対応関係を利用しな
がら大きな単位の標準パタンを作成することにより達成
される。The object of the present invention is to prepare a highly reliable standard pattern of a voice unit smaller than a phoneme, and in the process of creating a standard pattern, cluster a group of similar speakers by clustering to create a standard pattern. Is created,
Furthermore, this can be achieved by creating a standard pattern of a large unit while utilizing the correspondence between the phoneme and this standard pattern.

【００１４】[0014]

【作用】本発明によれば，音韻よりも小さな（時間長の
短い）標準パタンを設定することにより，個々の標準パ
タンの中ではカバーすべき変形の度合いを小さく抑える
ことができ，個々の標準パタンは精度高く作成される。
また，標準パタンの作成過程において，クラスタリング
により類似の話者をまとめながら標準パタンを作成する
ため，話者のバラエティーを精度よくカバーできる標準
パタンが作成される。さらに，音素環境に依存した音響
的な変形は音韻−認識基本単位標準パタン対応関係情報
の中に記述される。以上の３つの効果が総合され，高精
度な認識ができる。According to the present invention, by setting a standard pattern smaller (shorter in time length) than a phoneme, the degree of deformation to be covered in each standard pattern can be suppressed to a small level, and each standard pattern can be suppressed. The pattern is created with high accuracy.
Also, in the process of creating the standard pattern, the standard pattern is created while clustering similar speakers by clustering, so that the standard pattern that can accurately cover the variety of speakers is created. Further, the acoustic transformation depending on the phoneme environment is described in the phoneme-recognition basic unit standard pattern correspondence information. The above three effects are combined to enable highly accurate recognition.

【００１５】[0015]

【実施例】以下図を用いて本発明の実施例を説明する。
図１は本発明の音声認識装置の一実施例を示すブロック
図である。入力された音声は音声入力手段１において電
気信号に変換される。電気信号に変換された音声はさら
に音声分析手段２において分析され，特徴ベクトルの時
系列が出力される。一方，標準パタン連結手段３３で
は，標準パタン格納手段４に予め格納されている認識基
本単位の標準パタンを音韻−認識基本単位対応関係格納
手段３４に格納された音韻−認識基本単位対応関係情報
と単語辞書３５に従って連結し，単語標準パタンを構成
する。標準パタン連結手段３３で作成された各単語の標
準パタンと前記入力音声の特徴ベクトル時系列とが照合
部３１で照合され，入力音声と認識対象の各単語との間
の類似度が求められる。判定部３２では前記類似度に基
づいて認識結果を出力する。例えば，最も類似度の高い
単語を認識結果とする。Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing an embodiment of a voice recognition device of the present invention. The input voice is converted into an electric signal by the voice input means 1. The voice converted into the electric signal is further analyzed by the voice analysis means 2, and the time series of the feature vector is output. On the other hand, in the standard pattern connection unit 33, the standard pattern of the recognition basic unit stored in advance in the standard pattern storage unit 4 is converted into the phoneme-recognition basic unit correspondence storage information stored in the phoneme-recognition basic unit correspondence storage unit 34. They are connected according to the word dictionary 35 to form a word standard pattern. The collating unit 31 collates the standard pattern of each word created by the standard pattern connecting unit 33 with the feature vector time series of the input voice, and obtains the similarity between the input voice and each word to be recognized. The determination unit 32 outputs the recognition result based on the similarity. For example, the word with the highest similarity is used as the recognition result.

【００１６】次に本発明の中で用いている認識基本単位
の標準パタンについて説明する。図２は本発明の中で用
いている標準パタンに対応する確率モデル（Ｈｉｄｄｅ
ｎＭａｒｋｏｖＭｏｄｅｌ，以下ＨＭＭと略す）を示
したものである。図中各円は状態を表わし，矢印は状態
間の遷移を表わす。矢印に添えた記号ａ_ijは状態ｉから
状態ｊへの遷移が生ずる確率を表わし，記号ｂ_ij（ｋ）
は状態ｉから状態ｊへの遷移が生じたときに第ｋ番目の
分類に属する特徴ベクトルが観測される確率を表す。本
パラメタを用いて，音声データの特徴ベクトル時系列が
この確率モデルから出力された事後確率を計算すること
ができる。学習用音声データを用いてａ_ij，ｂ_ij（ｋ）
の値を推定することが標準パタン作成に相当する。な
お，この事後確率の具体的計算方法およびパラメタ推定
の具体的方法としては例えば，ＫｌｕｗｅｒＡｃａｄ
ｅｍｉｃＰｕｂｌｉｓｈｅｒｓ，Ｎｏｒｗｅｌ，Ｍ
Ａ，１９８９ “ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈ
Ｒｅｃｏｇｎｉｔｉｏｎ”，２０頁−２６頁に記載の
ような公知の手法を用いれば良い。Next, the standard pattern of the recognition basic unit used in the present invention will be described. FIG. 2 shows a probability model (Hidede) corresponding to the standard pattern used in the present invention.
nMarkov Model, hereinafter abbreviated as HMM). Each circle in the figure represents a state, and arrows represent transitions between states. The symbol a _ij attached to the arrow represents the probability that the transition from the state i to the state j occurs, and the symbol b _ij (k)
Represents the probability that a feature vector belonging to the kth classification will be observed when a transition from state i to state j occurs. This parameter can be used to calculate the posterior probability that the time series of feature vectors of voice data is output from this probabilistic model. A _ij , b _ij (k) using the speech data for learning
Estimating the value of is equivalent to standard pattern creation. A specific method of calculating the posterior probability and a specific method of parameter estimation are, for example, Kluwer Acad.
emic Publishers, Norwel, M
A, 1989 "Automatic Speech
Known methods such as those described in "Recognition", pp. 20-26 can be used.

【００１７】次に本発明の中で用いている音韻−認識基
本単位対応関係格納手段３４に格納されている音韻−認
識基本単位対応関係情報ついて説明する。音韻と認識基
本単位との対応関係を表わす方法としては，いくつかの
方法が考えられる。音韻を認識基本単位の系列で表わす
方法，音韻を認識基本単位のネットワークで表わす方
法。音韻を認識基本単位をシンボルとするＨＭＭで表わ
す方法などである。図３に音韻を認識基本単位の系列で
表わす方法を示す。図３の例では，音節と認識基本単位
の系列を対応付けて表現している。例えば，音節／ａ／
は，認識基本単位Ｎ１とＮ２の系列であるか，または，
認識基本単位Ｎ９とＮ１０の系列であると述べている。
同様に，音節／ｉ／は，認識基本単位Ｎ３であるか，ま
たは，認識基本単位Ｎ８とＮ９の系列であるか，また
は，認識基本単位Ｎ３２とＮ９０の系列であるとと述べ
ている。その他の音節についても同様の記述があり，最
後に撥音／Ｎ／は認識基本単位Ｎ１３０からなるとして
いる。Next, the phoneme-recognition basic unit correspondence relationship information stored in the phoneme-recognition basic unit correspondence relationship storage means 34 used in the present invention will be described. There are several possible methods for expressing the correspondence between phonemes and basic recognition units. A method of expressing phonemes by a sequence of recognition basic units, and a method of expressing phonemes by a network of recognition basic units. For example, there is a method of expressing a phoneme by an HMM using a basic unit of recognition as a symbol. FIG. 3 shows a method of expressing phonemes by a series of basic recognition units. In the example of FIG. 3, a series of syllables and recognition basic units is associated and expressed. For example, syllable / a /
Is a sequence of recognition basic units N1 and N2, or
It is said to be a series of recognition basic units N9 and N10.
Similarly, it is stated that the syllable / i / is a recognition basic unit N3, a series of recognition basic units N8 and N9, or a series of recognition basic units N32 and N90. The other syllables have the same description, and finally, the syllable / N / consists of the recognition basic unit N130.

【００１８】次に本発明の中で用いている標準パタンの
連結手段について図４を用いて説明する。図４は，認識
基本単位のＨＭＭを連結して単語のＨＭＭを作る例であ
る。例として単語「日立（/hitachi/）」を取り上げ
る。まず，単語辞書３５により，「日立」が音節列/hi
/，/ta/，/chi/から構成されていることがわかる。ま
た，音韻−認識基本単位対応関係格納手段３４に格納さ
れている音韻−認識基本単位対応関係情報により，各音
節がそれぞれどの様な認識基本単位で構成されているか
がわかる。図４の例では，音節/hi/は認識基本単位Ｎ１
とＮ２の系列であるか，または，認識基本単位Ｎ１とＮ
４の系列である。音節/ta/は認識基本単位Ｎ５とＮ６の
系列である。音節/chi/は認識基本単位Ｎ１２とＮ９の
系列である。ＨＭＭの連結は先行するＨＭＭの最後の状
態遷移を後続するＨＭＭの最初の状態に入れることで実
現できる。そこで単語に対応する音節列/hi/，/ta/，/c
hi/の各音節に対応するＨＭＭを順次連結すれば単語
「日立」のＨＭＭができあがる。図４の例では，音節/h
i/に対する認識基本単位の系列が二通りあるので，結果
として「日立」のＨＭＭも二通りできる。一つ目は認識
基本単位Ｎ１，Ｎ２，Ｎ５，Ｎ６，Ｎ１２，Ｎ９の系
列。二つ目は認識基本単位Ｎ１，Ｎ４，Ｎ５，Ｎ６，Ｎ
１２，Ｎ９の系列である。Next, the standard pattern connecting means used in the present invention will be described with reference to FIG. FIG. 4 is an example in which HMMs of basic recognition units are connected to create an HMM of a word. Take the word "Hitachi (/ hitachi /)" as an example. First, from the word dictionary 35, "Hitachi" is syllable string / hi
It can be seen that it is composed of /, / ta /, / chi /. Further, the phonological-recognition basic unit correspondence relationship information stored in the phonological-recognition basic unit correspondence storage unit 34 makes it possible to know what kind of recognition basic unit each syllable is composed of. In the example of FIG. 4, the syllable / hi / is the recognition basic unit N1.
And N2, or recognition basic units N1 and N
It is a series of 4. The syllable / ta / is a sequence of recognition basic units N5 and N6. The syllable / chi / is a sequence of recognition basic units N12 and N9. The concatenation of HMMs can be realized by putting the last state transition of the preceding HMM into the first state of the succeeding HMM. Then the syllable sequence / hi /, / ta /, / c corresponding to the word
If the HMMs corresponding to each syllable of hi / are sequentially connected, the HMM of the word "Hitachi" is created. In the example of FIG. 4, syllables / h
Since there are two series of recognition basic units for i /, as a result, two "Hitachi" HMMs can be created. The first is a series of recognition basic units N1, N2, N5, N6, N12, N9. The second is the basic recognition unit N1, N4, N5, N6, N
12 and N9 series.

【００１９】次に，本発明の中で用いている認識基本単
位の学習方法について図５を用いて説明する。図５は，
本発明の中で用いている認識基本単位の学習フローを説
明するフローチャートである。まず，学習用の音声を所
定の基準で区分化して，小さな（時間長の短い）音声区
間に分ける。所定の基準での区分化は，例えば一定の時
間間隔で区分化するというような方法でよい。次に，各
区分化された音声区間に対して一つずつＨＭＭを作成
し，これを初期モデルとする。ＨＭＭの作成は具体的に
は，例えば，ＫｌｕｗｅｒＡｃａｄｅｍｉｃＰｕｂ
ｌｉｓｈｅｒｓ，Ｎｏｒｗｅｌ，ＭＡ，１９８９
“ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉ
ｔｉｏｎ”，２０頁−２６頁に記載のような公知の手法
を用いて行なう。次に，初期モデルに対して，再度パラ
メタ推定を施し，各ＨＭＭを更新する。次に，各ＨＭＭ
をクラスタリングする。クラスタリングは類似のＨＭＭ
を集めてクラスタを形成する処理である。クラスタリン
グの手法は様々なものがあるが，ここでは，例えば最も
類似度の高いＨＭＭ二つを一つのクラスタとしてまとめ
るという手法を考える。この手法によれば一回のクラス
タリングによりクラスタの数が一つ減る。なお，ＨＭＭ
の類似度の尺度としては，例えば，電子情報通信学会技
術研究報告，ＳＰ９０−６４，（１９９０−１２）“木
構造音韻モデルによる未知音素文脈中の音響的変動の予
測と評価”に記載のＫｕｌｌｂａｃｋの情報量を用いれ
ばよい。パラメタの再推定とクラスタリングの処理を繰
返し，所定の収束条件，例えば，ＨＭＭの個数が所定の
数より小さくなるといった条件により処理を終了し，結
果として得られたＨＭＭを認識基本単位の標準パタンと
して格納する。Next, the learning method of the recognition basic unit used in the present invention will be described with reference to FIG. Figure 5
It is a flowchart explaining the learning flow of the recognition basic unit used in this invention. First, the learning voice is segmented according to a predetermined criterion and divided into small (short time length) voice intervals. The partitioning based on a predetermined standard may be a method such as partitioning at regular time intervals. Next, one HMM is created for each segmented speech section, and this is used as an initial model. Specifically, the HMM is created by, for example, Kluwer Academic Pub.
lishers, Norwel, MA, 1989.
"Automatic Speech Recogni
section ", pp. 20-26. Then, parameter estimation is performed again on the initial model and each HMM is updated. Next, each HMM is updated.
Cluster. Clustering is similar HMM
Is a process for collecting clusters to form a cluster. There are various clustering methods, but here, for example, consider a method in which two HMMs having the highest degree of similarity are combined into one cluster. According to this method, one cluster reduces the number of clusters by one. HMM
As a measure of the degree of similarity of Kullback, for example, Kullback described in Technical Report of IEICE, SP90-64, (1990-12) "Prediction and evaluation of acoustic variation in unknown phoneme context by tree structure phoneme model". It is sufficient to use the information amount of. The process of parameter re-estimation and clustering is repeated, the process is terminated under a predetermined convergence condition, for example, the condition that the number of HMMs becomes smaller than a predetermined number, and the resulting HMM is used as a standard pattern of the recognition basic unit. Store.

【００２０】次に，本発明の中で用いている認識基本単
位の別の学習方法についてに図６を用いて説明する。図
６は，本発明の中で用いている認識基本単位の別の学習
フローを説明するフローチャートである。図５で説明し
た学習フローが多数の初期モデルから始めて，クラスタ
リングにより順次個数を減らしていく手法であるのにた
いし，図６で説明する手法は初期モデルは一つだけ作成
し，これを順次分割してモデルの数を増やしていく手法
である。まず，学習用の音声を所定の基準で区分化し
て，小さな（時間長の短い）音声区間に分ける。所定の
基準での区分化は，例えば一定の時間間隔で区分化する
というような方法でよい。次に，各区分化された音声区
間に対して共通な一つのＨＭＭを作成し，これを初期モ
デルとする。次に，初期モデルに対して，再度パラメタ
推定を施し，各ＨＭＭを更新する。次に各ＨＭＭを二つ
に分割する。この分割は，各ＨＭＭのパラメタを微修正
することにより行なえばよい。例えば，図２に示したＨ
ＭＭを分割される前のＨＭＭとすると，ｂ_ij（ｋ）の値
を微修正したモデルを二つ作ればよい。出来上がった二
つのモデルは元のモデルとわずかに異なるモデルとな
る。次に，二倍に増えたＨＭＭを使って学習用の音声を
認識する。これにより，各ＨＭＭと各区分化された音声
区間とが対応付けられるので，この新たな対応関係を使
ってパラメタ再推定を行なう。パラメタ再推定，２分
割，学習用音声の認識の処理を繰返し，所定の収束条
件，例えば，ＨＭＭの個数が所定の数（例えば２５６
個）に達したといった条件により処理を終了し，結果と
して得られたＨＭＭを認識基本単位の標準パタンとして
格納する。Next, another learning method of the recognition basic unit used in the present invention will be described with reference to FIG. FIG. 6 is a flowchart explaining another learning flow of the basic recognition unit used in the present invention. While the learning flow described in FIG. 5 is a method of starting from a large number of initial models and sequentially reducing the number by clustering, the method described in FIG. 6 creates only one initial model and sequentially This is a method of dividing and increasing the number of models. First, the learning voice is segmented according to a predetermined criterion and divided into small (short time length) voice intervals. The partitioning based on a predetermined standard may be a method such as partitioning at regular time intervals. Next, one common HMM is created for each segmented voice section, and this is used as an initial model. Next, parameter estimation is performed again on the initial model to update each HMM. Next, each HMM is divided into two. This division may be performed by finely modifying the parameters of each HMM. For example, H shown in FIG.
Assuming that MM is an HMM before division, two models in which the value of b _ij (k) is finely corrected may be created. The two resulting models will be slightly different from the original model. Next, the speech for learning is recognized using the HMM that has been doubled. As a result, each HMM is associated with each segmented voice section, and parameter re-estimation is performed using this new correspondence. The process of parameter re-estimation, division into two, and recognition of learning speech is repeated, and a predetermined convergence condition, for example, the number of HMMs is a predetermined number (for example, 256).
The processing is terminated under the condition that the number of individual objects has been reached, and the resulting HMM is stored as a standard pattern of the recognition basic unit.

【００２１】次に本発明の中で用いている音韻−認識基
本単位対応関係学習手段について図７を用いて説明す
る。図７は，音韻−認識基本単位対応関係学習のフロー
を説明するフローチャートである。ここで説明する音韻
−認識基本単位対応関係学習方法は極めて簡単な手法で
ある。すなわち，発声内容が既知の音声を認識基本単位
の標準パタンを用いて認識し，認識結果を認識基本単位
の系列として得る。次に，発声内容が既知の音声の音韻
系列と認識結果の認識基本単位の系列を対応付ける。こ
の対応付けは，例えば線形な対応付けでよい。音声中の
音韻の数と認識結果の認識基本単位の数の間には必ずし
も，整数倍に割り切れるような関係にはならないので，
音韻対認識基本単位の関係を１対１，１対２，１対３と
いった具合に複数設ける。対応付けの処理の後，線形な
関係に応じて，上記複数の対応関係のそれぞれに重み付
けしながら頻度を累積する。以上により音韻−認識基本
単位対応関係が例えば図３のような形で得られる。Next, the phoneme-recognition basic unit correspondence relationship learning means used in the present invention will be described with reference to FIG. FIG. 7 is a flowchart for explaining the flow of learning the phoneme-recognition basic unit correspondence. The phonological-recognition basic unit correspondence relationship learning method described here is an extremely simple method. That is, a voice whose utterance content is known is recognized using the standard pattern of the recognition basic unit, and the recognition result is obtained as a series of recognition basic units. Next, the phoneme sequence of the speech whose utterance content is known is associated with the sequence of the recognition basic unit of the recognition result. This association may be linear association, for example. Since the number of phonemes in the voice and the number of recognition basic units of the recognition result are not necessarily divisible by an integer,
Plural relations of phonological pair recognition basic units are provided, such as 1: 1, 1: 2, 1: 3. After the association processing, the frequencies are accumulated while weighting each of the plurality of association relationships according to the linear relationship. By the above, the phoneme-recognition basic unit correspondence is obtained, for example, in the form as shown in FIG.

【００２２】[0022]

【発明の効果】以上本発明によれば，音声認識装置用の
標準パタンとして，小さな音声単位のものを利用して
も，認識の実行時に調音結合のような音声現象を十分に
反映しながら標準パタンを利用することができるので，
高精度な認識ができる。As described above, according to the present invention, even if a small voice unit is used as a standard pattern for a voice recognition apparatus, a standard voice pattern such as articulatory coupling is sufficiently reflected when recognition is performed. Because you can use patterns,
Highly accurate recognition is possible.

[Brief description of drawings]

【図１】本発明の音声認識装置の一実施例の構成を示す
ブロック図。FIG. 1 is a block diagram showing the configuration of an embodiment of a voice recognition device of the present invention.

【図２】本発明の音声認識装置で用いる認識基本単位の
隠れマルコフモデルを説明する図。FIG. 2 is a diagram illustrating a hidden Markov model of a basic recognition unit used in the speech recognition apparatus of the present invention.

【図３】本発明の音声認識装置で用いる，音韻と認識基
本単位の標準パタンとの対応関係情報について説明する
図。FIG. 3 is a diagram illustrating correspondence information between phonemes and standard patterns of basic recognition units used in the speech recognition apparatus of the present invention.

【図４】本発明の音声認識装置で用いる認識基本単位の
標準パタンの連結方法を説明する図。FIG. 4 is a diagram illustrating a method of connecting standard patterns of basic recognition units used in the speech recognition apparatus of the present invention.

【図５】本発明の音声認識装置で用いる認識基本単位の
標準パタンの学習フローを説明するフローチャート。FIG. 5 is a flowchart illustrating a learning flow of a standard pattern of recognition basic units used in the voice recognition device of the present invention.

【図６】本発明の音声認識装置で用いる認識基本単位の
標準パタンの別の学習フローを説明するフローチャー
ト。FIG. 6 is a flowchart illustrating another learning flow of the standard pattern of the recognition basic unit used in the voice recognition device of the present invention.

【図７】本発明の音声認識装置で用いる音韻と認識基本
単位の標準パタンとの間の対応関係の学習手段を説明す
るフローチャート。FIG. 7 is a flowchart illustrating a learning means for learning a correspondence relationship between phonemes and standard patterns of recognition basic units used in the speech recognition apparatus of the present invention.

[Explanation of symbols]

１・・・音声入力手段，２・・・音声分析手段，３・・
・音声認識手段３１・・・照合手段，３２・・・判定手段，３３・・・
標準パタン連結手段３４・・・単語辞書，４・・・標準パタン格納手段。1 ... voice input means, 2 ... voice analysis means, 3 ...
-Voice recognition means 31 ... collation means, 32 ... determination means, 33 ...
Standard pattern connecting means 34 ... Word dictionary, 4 ... Standard pattern storing means.

Claims

[Claims]

1. A speech input means for inputting speech, a speech analysis means for analyzing the inputted speech and outputting a time series of feature vectors, and a speech section having a time length shorter than a phoneme as a basic recognition unit, A standard pattern storage means for storing a standard pattern for a recognition basic unit, a phonological-recognition basic unit correspondence relationship storage means for storing a correspondence relationship between a phoneme and a recognition basic unit, and a word dictionary describing recognition target words as a sequence of phonemes. And a standard pattern connecting means for connecting a standard pattern for a recognition basic unit to form a standard pattern for a recognition target word, and a time series of the feature vector of the input speech and the standard pattern formed by the connection. In the voice recognition device, which comprises a collating means, and performs recognition based on the collation result output from the collating means, the standard pattern connecting means includes the sound - speech recognition apparatus being characterized in that so as to constitute a standard pattern for recognition terms using the correspondence between the stored recognition basic unit corresponding relation storage unit phoneme recognition basic unit.

2. The correspondence between the phoneme and the basic recognition unit is
The speech recognition apparatus according to claim 1, wherein the phoneme is given as a series of at least one recognition basic unit.

3. The correspondence relation between the phoneme and the basic recognition unit is:
The speech recognition apparatus according to claim 1, wherein the phoneme is provided as a network of at least one recognition basic unit.

4. The correspondence between the phoneme and the basic recognition unit is:
The speech recognition apparatus according to claim 1, wherein the recognition basic unit is given as a hidden Markov model having a label.

5. A voice for learning is divided according to a predetermined reference, an acoustic model of a recognition basic unit is created for each divided voice section, and parameter estimation and estimation are performed for a group of the acoustic models. A method for learning a standard pattern of a basic recognition unit, which is characterized by repeatedly performing clustering.

6. Each of the divided sections of the learning voice is
The method for learning a standard pattern of a recognition basic unit according to claim 5, wherein a fixed section is used in the re-estimation of the parameters of the repeated acoustic model.

7. The divided sections of the learning voice are:
6. The learning method for a standard pattern of a recognition basic unit according to claim 5, wherein the section obtained as a result of recognition of the learned speech using the acoustic model repeatedly performed is corrected and used as needed.

8. A voice for learning is divided according to a predetermined reference, one acoustic model of a recognition basic unit common to each divided section is set, and thereafter, the acoustic model of the acoustic model using the voice for learning is set. Parameter estimation, a process of dividing each model into two models by modifying the parameters of each acoustic model in which the parameters are estimated, and using the acoustic model of which the total number is doubled for the learning, Voice recognition process,
A method for learning a standard pattern of a recognition basic unit, characterized in that the above three processes are repeatedly performed until a predetermined criterion is satisfied.

9. The divided sections of the learning voice are:
The method for learning a standard pattern of a recognition basic unit according to claim 8, wherein a fixed section is used in the re-estimation of the parameters of the acoustic model that is repeatedly performed.

10. The divided section of the learning voice is used by being corrected as needed to a section obtained as a result of recognition of the learning voice using the acoustic model repeatedly performed. A method for learning the standard pattern of the recognition basic units described.

11. A phonological unit and a recognition basic unit are obtained by recognizing a voice whose utterance content is known by using a standard pattern of a recognition basic unit and associating a sequence of recognition basic units obtained as a recognition result with a phoneme sequence of the utterance content. A learning method characterized by learning the correspondence between a phoneme for obtaining a correspondence with a unit and a recognition basic unit.

12. The learning method according to claim 11, wherein the correspondence relationship between the phoneme and the recognition basic unit is given as a sequence of at least one recognition basic unit.

13. The learning method according to claim 11, wherein the correspondence relationship between the phoneme and the recognition basic unit is given as a sequence of at least one recognition basic unit.

14. The learning method according to claim 11, wherein the correspondence relationship between the phoneme and the recognition basic unit is given as a hidden Markov model having the recognition basic unit as a label.