JPH0786758B2

JPH0786758B2 - Voice recognizer

Info

Publication number: JPH0786758B2
Application number: JP4203669A
Authority: JP
Inventors: 浩一篠田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1992-07-30
Filing date: 1992-07-30
Publication date: 1995-09-20
Anticipated expiration: 2010-09-20
Also published as: JPH06175678A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、使用者の発声を用いて
作成された標準パターンを用いる音声認識装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device using a standard pattern created by a user's utterance.

【０００２】[0002]

【従来の技術】現在、音声認識の分野では、誰の声でも
認識できることを目的とした不特定話者の認識システム
が盛んに研究・開発されている。これらの認識システム
では、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏ
ｖＭｏｄｅｌ、以下ＨＭＭと略記）、ニューラルネッ
トワーク（ＮｅｕｒａｌＮｅｔｗｏｒｋ、以下ＮＮと
略記）などの認識方式が広く使われている。ＨＭＭの詳
細については、例えば、「確率モデルによる音声認識」
中川聖一著、１９８８年、電子情報通信学会（以下文献
１とする）に詳しく解説されている。また、ＮＮによる
音声認識に関しては、例えば、「音声・聴覚と神経回路
網モデル」甘利俊一編、１９９０年、オーム社（以下文
献２とする）に詳しい。これらの手法では、予め多数の
話者により発声された単語、文などの語彙からなる学習
用データを用いて標準パターンを学習し、それを用いて
認識を行っている。2. Description of the Related Art At present, in the field of voice recognition, research and development of an unspecified speaker recognition system aiming at recognizing anyone's voice is being actively conducted. In these recognition systems, the hidden Markov model (Hidden Marko) is used.
v Model, hereinafter abbreviated as HMM), a neural network (Neural Network, hereinafter abbreviated as NN), and other recognition methods are widely used. For details of the HMM, for example, “Speech recognition by probabilistic model”
It is explained in detail in the Institute of Electronics, Information and Communication Engineers (hereinafter referred to as Reference 1) by Seiichi Nakagawa, 1988. Further, regarding the speech recognition by the NN, for example, in "Speech / Hearing and Neural Network Model" edited by Shunichi Amari, 1990, Ohmsha (hereinafter referred to as Reference 2) is detailed. In these methods, a standard pattern is learned using learning data composed of words, sentences, and the like uttered by a large number of speakers in advance, and recognition is performed using the learned standard pattern.

【０００３】不特定話者システムは、使用者を特定した
特定話者システムと違い、事前に使用者が発声を登録す
る必要がないという利点がある。しかしながら、近年、
次のような問題点が指摘された。まず、ほとんどの話者
において認識性能が特定話者システムより劣る。さら
に、認識性能が大幅に悪い話者（特異話者）が存在す
る。このような問題点を解決するために、従来、特定話
者システムにおいて用いられてきた、話者適応化の技術
を不特定話者システムにも適用しようという研究が最近
始まっている。The unspecified speaker system, unlike the specified speaker system in which the user is specified, has an advantage that the user does not need to register the utterance in advance. However, in recent years
The following problems were pointed out. First, most speakers have poorer recognition performance than the specific speaker system. Furthermore, there is a speaker (unique speaker) whose recognition performance is significantly poor. In order to solve such a problem, a study has recently been started to apply a speaker adaptation technique, which has been conventionally used in a specific speaker system, to an unspecified speaker system.

【０００４】話者適応化とは、学習に用いるよりも少量
の学習用データを用いて、認識システムを新しい使用者
（未知話者）に適応化させる方式を指す。話者適応化方
式の詳細については、「音声認識における話者適応化技
術」、古井貞煕著、テレビジョン学会誌、Ｖｏｌ．４
３、Ｎｏ．９、１９８９、ｐｐ．９２９−９３４（以下
文献３とする）に解説されている。話者適応化は大きく
分けて２つの手法に分けられる。１つは教師なし話者適
応化、もう１つは教師あり話者適応化である。前者は、
未知話者の発声する語彙を予め指定する必要がないた
め、使いやすいという利点があるが、適応化後の認識性
能が後者には及ばない。したがって、現在は発声する語
彙を予め指定した教師あり適応化が主流である。Speaker adaptation refers to a method of adapting the recognition system to a new user (unknown speaker) by using a smaller amount of learning data than that used for learning. For details of the speaker adaptation method, “Speaker Adaptation Technology in Speech Recognition”, Sadahiro Furui, Journal of Television Society, Vol. Four
3, No. 9, 1989, pp. 929-934 (hereinafter
Reference 3 )). Speaker adaptation can be roughly divided into two methods. One is unsupervised speaker adaptation, and the other is supervised speaker adaptation. The former is
Since it is not necessary to specify the vocabulary spoken by an unknown speaker in advance, it has the advantage of being easy to use, but the recognition performance after adaptation does not reach the latter. Therefore, currently, supervised adaptation in which the vocabulary to be spoken is designated in advance is the mainstream.

【０００５】[0005]

【発明が解決しようとする課題】音声認識システムで
は、標準パターンを単語単位で用意すると未知単語を認
識できない。特定話者システムの場合は，未知単語が出
現する度に使用者がその単語を発声すればよいが、不特
定話者システムの場合、多くの話者の未知単語の発声を
用意することは事実上不可能である。そのため、多くの
不特定話者システムでは、単語より小さい音素、音節な
どの単位（以後、サブワード）を標準パターンの単位
（以下、認識単位）としている。単語、文の発声を認識
する際には、それを連結して、単語、文の標準パターン
を作成する。未知単語の場合も標準パターンを用意でき
る。In the voice recognition system, if a standard pattern is prepared for each word, an unknown word cannot be recognized. In the case of the specific speaker system, the user may utter the unknown word each time it appears, but in the case of the unspecified speaker system, it is true that many speakers utter the unknown word. It's impossible. Therefore, in many unspecified speaker systems, units such as phonemes and syllables smaller than words (hereinafter, subwords) are standard pattern units (hereinafter, recognition units). When utterances of words and sentences are recognized, they are connected to create a standard pattern of words and sentences. Standard patterns can be prepared for unknown words.

【０００６】このような、サブワードを認識単位とした
不特定話者音声認識システムで教師あり話者適応化を行
う場合、次のような問題点がある。認識単位に対応する
音響的特徴量は、そのコンテキストによって、つまり、
その認識単位の前後にどのような認識単位が続いている
かによって異なる。学習用データに多くの語彙の発声が
含まれている場合には、そのデータは様々なコンテキス
トにおける発声を含んでいるため、それを用いて学習さ
れた標準パターンはコンテキストにほとんど依存しない
ものになっている。In the case of performing supervised speaker adaptation in such an unspecified speaker voice recognition system using subwords as a recognition unit, there are the following problems. The acoustic feature quantity corresponding to the recognition unit depends on the context, that is,
It depends on what recognition unit follows before and after the recognition unit. If the training data contains many vocabulary utterances, the data contains utterances in various contexts, and the standard patterns learned using the data are almost independent of the context. ing.

【０００７】しかしながら、適応化の場合、使用者の発
声の負担を小さくするため，学習用データの量は普通の
学習に用いられるものに比べ極めて少量にする必要があ
る。学習用データが少量の場合、当然、そこに含まれる
コンテキストの種類は限られ、そのデータから学習され
た標準パターンは学習用データの語彙（以下、学習用語
彙）に含まれるコンテキストに依存したものになる。そ
のような標準パターンは、学習用語彙に出現しなかった
コンテキストにおける発声に対し、認識性能の悪いもの
になっている。However, in the case of adaptation, the amount of learning data needs to be extremely small compared to that used for ordinary learning in order to reduce the burden of vocalization on the user. If there is a small amount of training data, naturally, the types of contexts contained in it are limited, and the standard patterns learned from that data depend on the context included in the vocabulary of the training data (hereinafter, learning vocabulary). become. Such a standard pattern has poor recognition performance for utterances in a context that did not appear in the learning vocabulary.

【０００８】本発明は、学習用語彙のコンテキストに依
存したパターンと学習用語彙のコンテキストに依存して
いないパターンとの間の写像を、参照話者の発声を用い
て予め作成しておき、新しい使用者の学習用語彙の発声
から作成された学習用語彙のコンテキストに依存したパ
ターンを、その写像を用いて学習語彙に依存しないもの
に変換し、変換後のパターンを標準パターンとして用い
ることにより、認識性能を高めることを目的とする。According to the present invention, a mapping between a pattern that depends on the context of the learning vocabulary and a pattern that does not depend on the context of the learning vocabulary is created in advance using the utterance of the reference speaker, and a new By converting the pattern that depends on the context of the learning vocabulary created from the utterance of the learning vocabulary of the user into one that does not depend on the learning vocabulary using the mapping, and using the converted pattern as the standard pattern, The purpose is to improve recognition performance.

【０００９】[0009]

【課題を解決するための手段】本発明に係る音声認識装
置は、使用者の発声から作成された標準パターンを用い
る音声認識装置であって、参照話者の多数の語彙の発声
を入力し参照話者学習用語彙独立パターンを出力する参
照話者学習用語彙独立パターン作成部と、参照話者の学
習用語彙の発声を入力し参照話者学習用語彙依存パター
ンを出力する参照話者学習用語彙依存パターン作成部
と、前記参照話者学習用語彙独立パターンと前記参照話
者学習用語彙依存パターンを入力し前記参照話者学習用
語彙依存パターンから前記参照話者学習用語彙独立パタ
ーンへの変換写像を出力する変換写像作成部と、新使用
者の学習用語彙の発声を入力し、新使用者学習用語彙依
存パターンを出力する新使用者学習用語彙依存パターン
作成部と、前記新使用者学習用語彙依存パターンを前記
変換写像により変換し、標準パターンを出力するパター
ン変換部と、新使用者の発声を入力し、前記標準パター
ンを用いて認識を行う認識部と、から構成されることを
特徴とする。A speech recognition apparatus according to the present invention is a speech recognition apparatus that uses a standard pattern created from a user's utterance, and inputs and references the utterances of a large number of vocabularies of a reference speaker. Reference speaker learning vocabulary-independent pattern output part for reference speaker learning vocabulary-independent pattern creation part and reference speaker learning vocabulary-dependent pattern input for reference speaker learning vocabulary-dependent pattern input A vocabulary-dependent pattern creating unit, which inputs the reference speaker learning vocabulary independent pattern and the reference speaker learning vocabulary dependent pattern to convert the reference speaker learning vocabulary dependent pattern to the reference speaker learning vocabulary independent pattern A conversion map creating unit for outputting a conversion map; a new user learning vocabulary dependent pattern creating unit for inputting a utterance of a new user's learning vocabulary and outputting a new user learning vocabulary dependent pattern; The user learning vocabulary-dependent pattern is converted by the conversion map and outputs a standard pattern, and a recognition unit that inputs the utterance of a new user and performs recognition using the standard pattern. It is characterized by

【００１０】[0010]

【実施例】以下、本発明による実施例を図面と共に説明
する。図１は本発明に係る標準パターン作成装置の１実
施例を示すブロック図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of a standard pattern creating apparatus according to the present invention.

【００１１】参照話者学習用語彙独立パターン作成部１
０２は、参照話者の多数の語彙の発声データを入力し、
参照話者学習用語彙独立パターンＲＩを出力する。 Reference speaker learning vocabulary independent pattern creating unit 1
02 inputs the vocalization data of many vocabulary of the reference speaker,
The reference speaker learning vocabulary independent pattern RI is output.

【００１２】参照話者学習用語彙依存パターン作成部１
０１は、参照話者の学習用語彙の発声データを入力し、
参照話者学習用語彙依存パターンＲＤを出力する。 Reference speaker learning vocabulary-dependent pattern generator 1
01 inputs the vocalization data of the learning vocabulary of the reference speaker,
The reference speaker learning vocabulary dependent pattern RD is output.

【００１３】変換写像作成部１０３は、参照話者学習用
語彙独立パターンＲＩと参照話者学習用語彙依存パター
ンＲＤとを入力し，学習用語彙依存パターンから学習用
語彙独立パターンへの変換写像Ｍを出力する。The transformation mapping creating unit 103 inputs the reference speaker learning term vocabulary independent pattern RI and the reference speaker learning term vocabulary dependent pattern RD, and transforms the learning term vocabulary dependent pattern into a learning term vocabulary independent pattern M. Is output.

【００１４】新使用者学習用語彙依存パターン作成部１
０４は、新使用者語彙の発声データを入力し、未知話者
の新使用者学習用語彙依存パターンＰＤを出力する。New User Learning Vocabulary Dependent Pattern Creation Unit 1
04 inputs the vocalization data of the new user vocabulary, and outputs the new user learning vocabulary dependence pattern PD of the unknown speaker.

【００１５】パターン変換部１０５は、新使用者の学習
用語彙依存パターンＰＤを入力し、それを変換写像Ｍに
より変換し、変換後のパターンＰＩを標準パターンとし
て出力する。The pattern conversion unit 105 inputs the learning vocabulary dependent pattern PD of the new user, converts it by the conversion map M, and outputs the converted pattern PI as a standard pattern.

【００１６】認識部１０６は、新使用者の認識時の発声
を入力し、標準パターンＰＩを用いて認識を行い、認識
結果を出力する。The recognition unit 106 inputs the utterance at the time of recognition of the new user, performs recognition using the standard pattern PI, and outputs the recognition result.

【００１７】以下に本発明に係る音声認識装置の動作に
ついて詳細に説明する。The operation of the speech recognition apparatus according to the present invention will be described in detail below.

【００１８】音声認識装置に入力された話者の発声は、
ＡＤ変換、音声分析などの過程を経て、ある時間長をも
つフレームと呼ばれる単位ごとの特徴ベクトルの時系列
に変換される。フレームの長さは通常１０ｍｓから１０
０ｍｓ程度である。特徴ベクトルはその時刻における音
声スペクトルの特徴量を抽出したもので、通常１０次元
から１００次元である。この特徴ベクトルの時系列を、
ここでは発声データと呼ぶ。The utterance of the speaker input to the voice recognition device is
Through a process such as AD conversion and voice analysis, it is converted into a time series of feature vectors for each unit called a frame having a certain time length. Frame length is typically 10ms to 10
It is about 0 ms. The feature vector is a feature amount extracted from the voice spectrum at that time, and is usually 10 to 100 dimensions. The time series of this feature vector is
Here, it is called vocal data.

【００１９】以下、認識方式として、ＨＭＭを例にとっ
て説明する。ＨＭＭは音声の情報源のモデルの１つであ
る。ＨＭＭは各認識単位ごとに用意される。ここでは、
認識単位として音素を例にとる。単語や文を認識する場
合には、各音素のＨＭＭを連結して、その単語や文のＨ
ＭＭを作成する。各音素のＨＭＭは、それぞれ、通常１
から１０個の状態とその間の状態遷移から構成される。
通常は始状態と終状態が定義されており、単位時間ごと
に、各状態からシンボルが出力され、状態遷移が行われ
る。各音素の音声は、始状態から終状態までの状態遷移
の間にＨＭＭから出力されるシンボルの時系列として表
される。各状態にはシンボルの出現率が、状態間の各遷
移には遷移確率が、定義されている。状態毎に出現確率
に応じてシンボルが発生し、遷移確率に応じて状態間を
遷移する。始状態の確率をある値に定め、状態遷移ごと
に出現確率、遷移確率を掛けていくことにより、発声が
そのモデルから発生する確率を求めることができる。逆
に、発声を観測した場合、それが、あるＨＭＭから発生
したと仮定するとその発生確率が計算できることにな
る。ＨＭＭによる音声認識では，各認識候補に対してＨ
ＭＭを用意し、発声が入力されると、各々のＨＭＭにお
いて、発生確率を求め、最大となるＨＭＭを発生源と決
定し、そのＨＭＭに対応する認識候補をもって認識結果
とする。An HMM will be described below as an example of the recognition method. HMM is one of the models of voice information source. The HMM is prepared for each recognition unit. here,
A phoneme is taken as an example of the recognition unit. When recognizing a word or sentence, the HMMs of each phoneme are concatenated and the H
Create an MM. The HMM for each phoneme is usually 1
To 10 states and state transitions between them.
Normally, the start state and the end state are defined, and a symbol is output from each state and a state transition is performed every unit time. The speech of each phoneme is represented as a time series of symbols output from the HMM during the state transition from the start state to the end state. The appearance rate of symbols is defined for each state, and the transition probability is defined for each transition between states. A symbol is generated according to the appearance probability for each state, and transitions between states according to the transition probability. By setting the probability of the initial state to a certain value and multiplying the appearance probability and the transition probability for each state transition, the probability that utterance occurs from the model can be obtained. On the contrary, when a utterance is observed, it is possible to calculate the occurrence probability assuming that it is generated from a certain HMM. In speech recognition by HMM, H is applied to each recognition candidate.
When an MM is prepared and a utterance is input, the occurrence probability is calculated for each HMM, the maximum HMM is determined as the generation source, and the recognition candidate corresponding to that HMM is used as the recognition result.

【００２０】ＨＭＭの著しい特徴は、モデルに対応する
音声を与えることにより、遷移確率、出現確率などのパ
ラメータを学習する、バウムーウェルチアルゴリズムと
呼ばれるアルゴリズムが存在することである。バウムー
ウェルチアルゴリズムについては文献１に詳しい。今、
シンボルは連続分布をなすと仮定すると、出現確率は連
続確率密度分布関数で表される。連続密度分布関数とし
て混合ガウス分布を用いる場合、そのパラメータは各分
布の平均ベクトルと分散ベクトル、および、各分布間の
重みを定める重み係数である。各状態における混合分布
の分布数は通常１から１０程度である。各分布の平均ベ
クトル及び分散ベクトルは発声データと同じ次元をも
つ。各分布の重み係数はスカラーである。学習されうる
パラメータは、これら混合連続分布の平均ベクトル、及
び分散ベクトル、重み係数、そして、遷移確率である。
以下の例では、各分布の平均ベクトルを学習する場合を
例にとる。この場合、標準パターンは、音素ごとの、各
状態の各分布の平均ベクトルである。A significant feature of the HMM is that there is an algorithm called Baum-Welch algorithm that learns parameters such as transition probability and appearance probability by giving a voice corresponding to the model. The Baum-Welch algorithm is detailed in Reference 1. now,
Assuming that the symbols have a continuous distribution, the appearance probability is represented by a continuous probability density distribution function. When a Gaussian mixture distribution is used as the continuous density distribution function, its parameters are the mean vector and variance vector of each distribution, and the weighting coefficient that determines the weight between each distribution. The distribution number of the mixture distribution in each state is usually about 1 to 10. The mean vector and variance vector of each distribution have the same dimensions as the vocal data. The weighting factor of each distribution is a scalar. The parameters that can be learned are the mean and variance vectors of these continuous mixed distributions, the weighting factors, and the transition probabilities.
In the following example, the case of learning the average vector of each distribution is taken as an example. In this case, the standard pattern is an average vector of each distribution of each state for each phoneme.

【００２１】今、音素の１つをとり、その１つの状態の
１つの分布の平均ベクトルμについて考える。Now, take one of the phonemes and consider the mean vector μ of one distribution of that one state.

【００２２】まず、予め用意された参照話者の学習用語
彙の発声データを用いて，学習用の語彙のコンテキスト
に依存した平均ベクトルμ_RDを学習する。参照話者は１
人でも複数でもよい。複数の場合、各話者ごとに平均ベ
クトルを学習してもよいし、参照話者すべてについて１
つの平均ベクトルを学習してもよい。ここでは、後者の
場合について説明する。学習方法は前述のバウムーウェ
ルチアルゴリズムを用いることができる。また、発声を
ビタービアルゴリズムにより、各分布に対応させ、各分
布に対応するすべての発声データを平均したものを平均
ベクトルとすることも可能である。ビタービアルゴリズ
ムについては文献１に詳しい。以上は、参照話者学習用
語彙依存パターン作成部１０１に対応する。First, the average vector μ _RD depending on the context of the learning vocabulary is learned by using the vocalization data of the learning vocabulary of the reference speaker prepared in advance. Reference speaker is 1
There may be one or more people. In the case of multiple speakers, the average vector may be learned for each speaker, or 1 for all reference speakers.
One may learn one mean vector. Here, the latter case will be described. As the learning method, the above-mentioned Baumu-Welch algorithm can be used. It is also possible to make the utterances correspond to each distribution by the Viterbi algorithm and average all utterance data corresponding to each distribution as an average vector. The Viterbi algorithm is detailed in Reference 1. The above corresponds to the reference speaker learning vocabulary dependence pattern creation unit 101.

【００２３】次に、予め用意された多数の参照話者の多
数の発声を用いて、学習用の語彙のコンテキストから独
立した平均ベクトルμ_RIを学習する。学習方法は上の場
合と同じである。この段階は、不特定話者システムにお
ける通常の学習に相当する。以上は、参照話者学習用語
彙独立パターン作成部１０２に対応する。Next, the average vector μ _RI independent of the context of the vocabulary for learning is learned using a large number of utterances of a large number of reference speakers prepared in advance. The learning method is the same as above. This stage corresponds to normal learning in an independent speaker system. The above corresponds to the reference speaker learning vocabulary independent pattern creating unit 102.

【００２４】次に、上で作成した２種類の参照話者の平
均ベクトルを用いて、学習用語彙に依存した平均ベクト
ルから学習用語彙から独立した平均ベクトルへの写像を
作成する。写像は例えば、次のようなものを用いる。Next, using the average vectors of the two types of reference speakers created above, a mapping from the average vector dependent on the learning vocabulary to the average vector independent of the learning vocabulary is created. For example, the following mapping is used.

【００２５】[0025]

【数１】 [Equation 1]

【００２６】ここで、μは入力する学習用語彙に依存し
た平均ベクトル、Where μ is an average vector depending on the input learning vocabulary,

【００２７】[0027]

【数２】 [Equation 2]

【００２８】は出力となる学習用語彙から独立した平均
ベクトルである。写像はその他様々なものが可能であ
る。ここで作成された写像を変換写像と呼ぶ。以上は、
変換写像作成部１０３に対応する。Is a mean vector independent of the output learning vocabulary. Various other maps are possible. The mapping created here is called a conversion mapping. The above is
It corresponds to the conversion mapping creation unit 103.

【００２９】ここまでの処理は新使用者が使用する以前
に行っておくことが可能である。以下は、新使用者が使
用する際の処理である。The processing up to this point can be performed before a new user uses it. The following is the process when a new user uses it.

【００３０】まず、新使用者が発声した学習用語彙の発
声を用いて、新使用者の発声に適応しており、かつ、学
習用の語彙のコンテキストに依存している、平均ベクト
ルμ_PDを学習する。学習方法は参照話者の場合と同様で
ある。以上は、新使用者学習用語彙依存パターン作成部
１０４に対応する。First, by using the vocabulary of the learning vocabulary uttered by the new user, an average vector μ _PD which is adapted to the utterance of the new user and which depends on the context of the vocabulary for learning is calculated. learn. The learning method is similar to that of the reference speaker. The above corresponds to the new user learning vocabulary dependence pattern creating unit 104.

【００３１】次に上の新使用者の学習用語彙に依存した
平均ベクトルを、変換写像を用いて変換し、学習用語彙
のコンテキストに依存しない平均ベクトルを推定する。
以上は、パターン変換部１０５に対応する。Next, the average vector depending on the learning vocabulary of the new user is transformed by using a transformation map, and the average vector independent of the context of the learning vocabulary is estimated.
The above corresponds to the pattern conversion unit 105.

【００３２】以上の手続きを各音素の各状態の各分布に
ついて行う。The above procedure is performed for each distribution of each state of each phoneme.

【００３３】このように、標準パターンを作成した後、
音声認識装置の使用時においては、推定された平均ベク
トルをもつＨＭＭを用いて、入力された新使用者の発声
を認識する。ＨＭＭの認識方式については、文献１に詳
しい。以上は認識部１０６に対応する。After creating the standard pattern in this way,
When the voice recognition device is used, the input new user's utterance is recognized using the HMM having the estimated average vector. The HMM recognition method is described in detail in Reference 1. The above corresponds to the recognition unit 106.

【００３４】ここでは、参照話者が複数で、全話者で１
つの平均ベクトルμ_RD、μ_RIを学習している場合を例に
あげたが、各話者ごとに平均ベクトルを学習してもよ
い。その場合、式（１）の２式第２項はそれら複数の話
者におけるμ_RI、μ_RDの差を適当に重み付けて全参照話
者について和をとったものにすればよい。Here, there are a plurality of reference speakers, and all the speakers are 1
Although an example of learning one average vector μ _RD , μ _RI has been described, the average vector may be learned for each speaker. In that case, the second term of the equation (2) may be obtained by summing all reference speakers by appropriately weighting the differences of μ _RI and μ _RD among the plurality of speakers.

【００３５】また、ここでは、混合分布中の１つの分布
のみを用いて写像を作成し、平均ベクトルを変換してい
るが、他の分布も同時に用いて、写像を作成することも
可能である。他の分布としては、同じ状態の他の分布、
あるいは、他の状態の分布、他の認識単位の分布などが
考えられる。このような場合には、分布に対応する複数
の平均ベクトルが作成された写像を用いて変換される。Further, here, the mapping is created by using only one distribution in the mixture distribution, and the average vector is converted, but it is also possible to create the mapping by simultaneously using other distributions. . As other distributions, other distributions of the same state,
Alternatively, the distribution of other states, the distribution of other recognition units, and the like are possible. In such a case, a plurality of average vectors corresponding to the distribution are converted using the created mapping.

【００３６】今回、写像は非線形なものを例にあげた
が、線形なものでも、非線形なものでも、用いることが
可能である。This time, a non-linear mapping is taken as an example, but a linear one or a non-linear one can be used.

【００３７】参照話者の人数は可変である。１人でも可
能である。The number of reference speakers is variable. It is possible for one person.

【００３８】また、ここでは、簡単のため、参照話者学
習用語彙依存パターン、参照話者学習用語彙独立パター
ン、新使用者学習用語彙依存パターンの学習方式はすべ
て同一としたが、これらがそれぞれ異なっていても、本
発明を適用することが可能である。For simplification, the reference speaker learning vocabulary dependence pattern, the reference speaker learning vocabulary independent pattern, and the new user learning vocabulary dependence pattern are all set to the same learning method. Even if they are different, the present invention can be applied.

【００３９】ここでは、平均ベクトルのみを学習する例
を示したが、その他の分散、重み、遷移確率なども同様
の方式で学習することが容易に可能である。また、それ
らパラメータのうち、同時に複数のものを学習すること
が可能である。ここでは、認識単位として、音素を例に
とりあげたが、音素以外の、音節、半音節など他の認識
単位の場合も、本本式は容易に適用可能である。Although an example of learning only the average vector is shown here, other variances, weights, transition probabilities, etc. can be easily learned by the same method. In addition, it is possible to learn a plurality of parameters at the same time. Here, the phoneme is taken as an example of the recognition unit, but the present formula can be easily applied to other recognition units such as a syllable and a semisyllabic other than the phoneme.

【００４０】ここでは、認識方式としてＨＭＭを例にあ
げて説明したが、他の認識方式、例えば、ＮＮ、ＤＰマ
ッチングなどの認識方式においても、パラメータを学習
する際に、本方式を適用することが容易に可能である。Although the HMM has been described as an example of the recognition method, the present method should be applied to other recognition methods such as NN and DP matching when learning parameters. Is easily possible.

【００４１】[0041]

【発明の効果】限られた数の学習用語彙の発声を用いて
作成された標準パターンは学習用語彙のコンテキストに
依存している。他の話者の多数の語彙の発声から作成さ
れた写像を用いてコンテキストの影響を補正することに
より、コンテキストから独立した標準パターンを推定す
ることが可能になる。従来より少量の学習用発声で認識
性能の高い標準パターンを作成できる。EFFECTS OF THE INVENTION A standard pattern created using a utterance of a limited number of learning terms depends on the context of the learning terms. Correcting the effects of context using maps created from the voicing of multiple vocabularies of other speakers makes it possible to estimate standard patterns independent of context. A standard pattern with high recognition performance can be created with a smaller amount of training utterances than before.

[Brief description of drawings]

【図１】本発明に係る音声認識装置の一実施例を示すブ
ロック図である。FIG. 1 is a block diagram showing an embodiment of a voice recognition device according to the present invention.

[Explanation of symbols]

１０１参照話者学習用語彙依存パターン作成部１０２参照話者学習用語彙独立パターン作成部１０３変換写像作成部１０４新使用者学習用語彙依存パターン作成部１０５パターン変換部１０６認識部 101 Reference Speaker Learning Vocabulary Dependent Pattern Creating Unit 102 Reference Speaker Learning Vocabulary Independent Pattern Creating Unit 103 Transformation Mapping Creating Unit 104 New User Learning Vocabulary Dependent Pattern Creating Unit 105 Pattern Converting Unit 106 Recognition Unit

Claims

[Claims]

1. A speech recognition apparatus using a standard pattern created from a user's utterance, wherein a reference speaker learning vocabulary independent pattern is output by inputting utterances of many vocabularies of a reference speaker. A learning vocabulary independent pattern creating unit, a reference speaker learning vocabulary dependent pattern creating unit that inputs the utterance of the learning vocabulary of the reference speaker and outputs a reference speaker learning vocabulary dependent pattern, and the reference speaker learning pattern A conversion mapping generation unit that inputs a vocabulary independent pattern and the reference speaker learning vocabulary dependent pattern, and outputs a conversion map from the reference speaker learning vocabulary dependent pattern to the reference speaker learning vocabulary independent pattern; A new user learning vocabulary dependence pattern creating unit for inputting the utterance of the learning vocabulary of the user and outputting a new user learning vocabulary dependence pattern; and a conversion copy of the new user learning vocabulary dependence pattern. Was converted by the pattern conversion section for outputting a reference pattern, enter the utterance of the new user, the speech recognition device comprising a recognition unit for recognizing, in that they are composed of using the standard pattern.