JP2879989B2

JP2879989B2 - Voice recognition method

Info

Publication number: JP2879989B2
Application number: JP3058796A
Authority: JP
Inventors: 田麻紀宮; 見昌克星; 勝行二矢田
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1991-03-22
Filing date: 1991-03-22
Publication date: 1999-04-05
Anticipated expiration: 2014-04-05
Also published as: JPH04293095A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、不特定話者が発声した
単語音声を認識するための方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for recognizing a word voice uttered by an unspecified speaker.

【０００２】[0002]

【従来の技術】従来、不特定話者の発声した単語音声を
認識するための方法として、図７に示すように、単語の
入力音声を音響分析部７１で分析して特徴パラメータを
抽出し、これをあらかじめ多数の話者で作成した単語標
準パターン７２とマッチングさせて単語認識部７３で単
語類似度を計算し、認識する方法が知られている。例え
ば、“ワードスポッティング手法を用いた不特定話者・
少数語向け音声認識装置”（電子通信情報学会ＳＰ８
８−１８）に記載されている方法である。2. Description of the Related Art Conventionally, as a method for recognizing a word voice uttered by an unspecified speaker, as shown in FIG. 7, an input voice of a word is analyzed by an acoustic analysis unit 71 to extract feature parameters. A method is known in which this is matched with a word standard pattern 72 created by a number of speakers in advance, and a word similarity is calculated and recognized by a word recognition unit 73. For example, "Unspecified speakers using the word spotting method
Speech Recognition Device for Minority Words "(IEICE SP8
8-18).

【０００３】この方法では、不特定話者用の認識対象用
単語標準パターンを作成するために、実際に３３０名の
話者が発声した音声データを使用している。３３０名が
１０数字を発声した音声データに対して人間がスペクト
ル波形などを参考にして目視で音声区間を切出して、分
析時間毎に得られる特徴パラメータ（ＬＰＣケプストラ
ム係数）の時系列を求め、各単語毎に決められた発声時
間になるように線形に音声データの圧縮を行ない、３３
０名分のデータの絶対値によって単語標準パターンを作
成している。未知入力音声とこのようにして作成した標
準パターンとの照合を統計的距離尺度であるマハラノビ
ス距離を用いて行なうことによって、不特定話者の音声
認識を可能にしている。この方法は、統計的距離尺度を
用いて単語標準パターンとの照合、比較を行なうことに
よって、不特定話者のスペクトル変動を統計的に吸収し
ようという考え方に基づいており、統計的距離尺度のた
めの標準パターン作成には、１つの認識単語に対して数
百名以上の話者が発声したデータを必要とする。In this method, voice data actually produced by 330 speakers is used to create a word standard pattern for recognition of an unspecified speaker. Humans visually cut out voice sections of voice data in which 330 people uttered 10 numbers with reference to spectral waveforms and the like, and obtained a time series of characteristic parameters (LPC cepstrum coefficients) obtained for each analysis time. The voice data is compressed linearly so that the utterance time determined for each word is obtained.
The word standard pattern is created based on the absolute values of the data for 0 persons. By performing matching between the unknown input speech and the standard pattern created in this way using the Mahalanobis distance, which is a statistical distance scale, speech recognition of an unspecified speaker is enabled. This method is based on the idea of statistically absorbing spectral variations of unspecified speakers by comparing and comparing with word standard patterns using a statistical distance scale. The creation of the standard pattern requires data produced by several hundred or more speakers for one recognition word.

【０００４】単語標準パターンを利用した他の音声認識
方法として、マルチ標準パターンを用いる方法がある。
この方法は、多くのデータを分析して、それらの中から
代表的なものを複数個選択し、複数の単語標準パターン
と未知入力の照合を行なうことによって不特定話者のス
ペクトル変動に対処しようとするものである。この方法
においても、複数の単語標準パターンを作成するために
は、数百名のデータの収集と分析を必要とする。As another speech recognition method using a word standard pattern, there is a method using a multi-standard pattern.
This method analyzes a large amount of data, selects multiple representative ones from them, and deals with the spectrum variation of unspecified speakers by comparing multiple word standard patterns with unknown inputs. It is assumed that. Even in this method, it is necessary to collect and analyze data of several hundreds in order to create a plurality of word standard patterns.

【０００５】また、不特定話者の単語音声を認識するた
めの別の方法として、図８に示すような音素認識による
方法がある。単語の入力音声を音響分析部８１で分析し
て分析時間（フレーム）毎に特徴パラメータを抽出し、
セグメンテーション部８２において入力音声を母音区間
と子音区間に切り分ける。音素標準パターン８３はあら
かじめ多数の話者が発声した音声データを音響分析して
音素毎に作成しておく。次に音素認識部８４において、
母音区間は母音の音素標準パターンと、子音区間は子音
の音素標準パターンとそれぞれ照合して音素の認識を行
ない、入力音声に対して音素記号列を求める。求められ
た音素記号列を今度は単語認識部８５で音素表記された
単語辞書８６と照合して単語の類似度を算出し、認識す
る。この方法は、上記した方法に比べ、単語辞書を音素
表記として登録できるため、単語標準パターン作成のた
めに膨大なデータの収集・分析する必要がなく、単語辞
書の変更が容易であるという利点があるが、音素を認識
の基本単位とするため、音素から音素へ時間的に変化す
る部分の情報が利用されず、認識率の点で限界がある。As another method for recognizing the word voice of an unspecified speaker, there is a method based on phoneme recognition as shown in FIG. The acoustic analysis unit 81 analyzes the input voice of the word and extracts feature parameters for each analysis time (frame).
The segmentation unit 82 divides the input voice into a vowel section and a consonant section. The phoneme standard pattern 83 is created in advance for each phoneme by acoustically analyzing voice data uttered by many speakers. Next, in the phoneme recognition unit 84,
The vowel section matches the vowel phoneme standard pattern and the consonant section matches the consonant phoneme standard pattern, and recognizes the phoneme to obtain a phoneme symbol string for the input speech. The obtained phoneme symbol string is collated with a word dictionary 86 in which the phonemes are written by the word recognition unit 85, and the similarity of the words is calculated and recognized. This method has an advantage that the word dictionary can be registered as a phoneme notation as compared with the above-described method, so that it is not necessary to collect and analyze a large amount of data for creating a word standard pattern, and the word dictionary can be easily changed. However, since a phoneme is used as a basic unit of recognition, information of a part that changes with time from phoneme to phoneme is not used, and there is a limit in recognition rate.

【０００６】[0006]

【発明が解決しようとする課題】このように、上記前者
の単語標準パターンを利用する音声認識方法では、認識
対象音声の単語標準パターン作成にデータの収集、音声
区間の切出しなどの膨大な作業量があり、そのために容
易に認識対象音声を変更できないという問題を有してい
た。As described above, in the former speech recognition method using the word standard pattern, an enormous amount of work such as data collection and speech segmentation is required to create the word standard pattern of the speech to be recognized. Therefore, there is a problem that the speech to be recognized cannot be easily changed.

【０００７】また後者の音素を認識の基本単位とする方
法は、音素から音素への時間的変化の要素が失われて認
識率を高める上で限界があるという問題を有していた。[0007] The latter method of using phonemes as a basic unit of recognition has a problem in that the element of temporal change from phonemes to phonemes is lost and there is a limit in increasing the recognition rate.

【０００８】本発明は、このような従来の問題を解決す
るものであり、１名から数名の少数話者が発声した認識
対象音声を用いて不特定話者の音声の認識を可能にし、
容易に認識対象音声が変更でき、しかも高い認識率を得
ることができる音声認識方法を提供することを目的とす
る。The present invention solves such a conventional problem, and makes it possible to recognize an unspecified speaker's voice by using recognition target voices uttered by one to several minority speakers.
It is an object of the present invention to provide a voice recognition method that can easily change a voice to be recognized and can obtain a high recognition rate.

【０００９】[0009]

【課題を解決するための手段】本発明は、上記目的を達
成するために、辞書作成時には、認識対象音声を１名ま
たは数名の少数話者が発声し、これを音響分析して特徴
パラメータを抽出し、今度はこれをあらかじめ多数の話
者で作成した音素標準パターンとマッチングさせてフレ
ーム毎の類似度ベクトルを求め、求められた類似度ベク
トルを類似０が大きいものはより大きくなるように類似
度の強調と正規化を行ない、得られた類似度ベクトルの
時系列を辞書に登録しておき、音声認識時には、不特定
話者の入力音声から辞書登録時と同様にして類似度ベク
トルの時系列を求めて辞書とのマッチングを行ない、も
っとも類似度の大きい認識対象音声を認識結果として出
力するようにしたものである。According to the present invention, in order to attain the above object, at the time of creating a dictionary, one or a few minor speakers utter the recognition target speech, and the resulting speech is subjected to acoustic analysis to obtain characteristic parameters. This time, this is matched with a phoneme standard pattern created by a large number of speakers in advance to obtain a similarity vector for each frame, and the obtained similarity vector is set so that a vector having a large similarity 0 is larger. performs similarity enhancement and normalization, may be registered series in the dictionary when the obtained similarity vector, at the time of speech recognition, the similarity vector in the same manner as in the dictionary registration from an input speech unspecified speaker
In this method, a time series of the tor is obtained and matching with a dictionary is performed, and a recognition target voice having the highest similarity is output as a recognition result.

【００１０】[0010]

【作用】したがって、本発明によれば、入力音声を分析
して特徴パラメータを抽出し、多数の話者の音声データ
で作成した音素標準パターンとの類似度計算からフレー
ム毎に類似度ベクトルを求め、この類似度ベクトルに強
調関数を施してフレーム毎に正規化することにより、１
名または数名の少数話者の発声した音声の類似度ベクト
ル時系列を辞書として登録するだけで、不特定話者の音
声を精度良く認識することができる。Therefore, according to the present invention, an input speech is analyzed to extract feature parameters, and a similarity vector is determined for each frame from a similarity calculation with a phoneme standard pattern created from speech data of many speakers. By applying an enhancement function to this similarity vector and normalizing it for each frame, 1
By simply registering, as a dictionary, the similarity vector time series of the voices uttered by one or several minority speakers, the voice of the unspecified speaker can be accurately recognized.

【００１１】[0011]

【実施例】以下、本発明の実施例について説明するが、
その前に本発明の基本的な考え方について説明する。Hereinafter, embodiments of the present invention will be described.
Before that, the basic concept of the present invention will be described.

【００１２】人の声は有声音では声帯の振動として発せ
られ、その振動音が調音器官と呼ばれる喉頭、咽頭、
舌、あご、唇などで形成される声道を通る間に様々な変
調を受けて口から音声として出力される。したがって、
ア、イ、ウ、…などの音韻性は声道の形状として与えら
れる。無声音は音源が声帯でない場合もあるが、音韻性
はやはり声道の形状で決められる。ところが、声道を形
成する喉、舌、歯、あご、唇などの形状や寸法は、人ご
とに微妙に異なっており、声帯の大きさも性別や年齢で
異なる。このために、人ごとの声の違いが生じることに
なる。すなわち、人の違いによる声の差異は調音器官の
違いによるところが大きい。The voice of a person is emitted as vibration of the vocal cords in a voiced sound, and the vibration sound is generated in the larynx, pharynx, and so-called articulatory organs.
While passing through the vocal tract formed by the tongue, chin, lips, etc., it undergoes various modulations and is output as speech from the mouth. Therefore,
The phonological properties such as a, i, c, ... are given as the shape of the vocal tract. For unvoiced sounds, the sound source may not be the vocal cords, but phonologicality is also determined by the shape of the vocal tract. However, the shapes and dimensions of the throat, tongue, teeth, chin, lips, and the like that form the vocal tract vary slightly from person to person, and the size of the vocal cords also varies with gender and age. For this reason, the voice of each person differs. That is, the difference in voice due to the difference between persons is largely due to the difference in articulatory organs.

【００１３】一方、声がア、イ、ウ、…など音韻として
ではなく、単語や文として発せられるとき、声道の形が
時間的に変化する。すなわち、声道の時間的変化によっ
て言葉が形成される。たとえば、「赤い」（akai）と発
声する場合、声道は、あごが開き、舌の後方に狭めのあ
る/a/の発声から喉頭部の閉鎖と急激な開放を伴う破裂
音/k/に移り、さらに再び/a/の形状に戻ってから徐々に
舌を唇側に移動し、口を閉じた/i/に移る。このような
声道の変化パターンは発声しようとしている言葉によっ
て決まるものであり、人の違いによる差異は少ないと考
えられる。On the other hand, when a voice is uttered as a word or a sentence, not as a phoneme such as a, i, c,..., The shape of the vocal tract changes with time. That is, words are formed by temporal changes in the vocal tract. For example, if you say "red" (akai), the vocal tract changes from a vocal utterance / a / with a narrow back of the tongue to a plosive / k / with a closed larynx and sudden opening. Then, after returning to the / a / shape again, gradually move the tongue to the labial side and move to / i / with the mouth closed. Such a change pattern of the vocal tract is determined by the words to be uttered, and it is considered that there is little difference depending on the person.

【００１４】このように言葉としての音声を静的な声道
の形状とその時間的な変化に分割して考えると、前者の
みが話者によって異なり、後者は話者による差は小さい
と見なすことができる。したがって、静的な声道の形状
の違いに基づく差異を何等かの方法で正規化できれば、
不特定話者の認識が可能となる。As described above, when speech as words is divided into a static vocal tract shape and its temporal change, it is assumed that only the former differs from speaker to speaker, and the latter is considered to have a small difference between speakers. Can be. Therefore, if the difference based on the static vocal tract shape difference can be normalized in some way,
An unspecified speaker can be recognized.

【００１５】ところで、声道の形状の違いは発せられた
音声信号中では周波数スペクトルの違いとして表現され
る。周波数スペクトルを話者間で正規化する最も単純な
方法は、音素や音節などを短時間の音声標準パターンと
のマッチングを行なって、クラス分けをすることであ
る。不特定話者用として作成された汎用的な標準パター
ンを用いれば話者の違いに大きく左右されない類似度情
報を得ることができる。すなわち、スペクトルをパター
ンマッチングによって類似度情報に変換することは、話
者間の差異を軽減することに相当する。Incidentally, the difference in the shape of the vocal tract is expressed as a difference in the frequency spectrum in the emitted voice signal. The simplest method of normalizing the frequency spectrum between speakers is to match phonemes, syllables, and the like with a short-time voice standard pattern and classify them. If a general-purpose standard pattern created for an unspecified speaker is used, it is possible to obtain similarity information that is not largely affected by differences between speakers. That is, converting a spectrum into similarity information by pattern matching is equivalent to reducing differences between speakers.

【００１６】一方、声道の変化パターンは話者による差
異が少ないのであるから、１名から数名の少数話者の情
報を用いれば十分である。したがって、少数話者の単語
や文節などの発声を類似度情報の時間パターンとして辞
書に登録すれば、それは不特定話者用の辞書である。On the other hand, since the change pattern of the vocal tract has little difference among speakers, it is sufficient to use information of one to several minor speakers. Therefore, if utterances of a few speakers such as words and phrases are registered in the dictionary as time patterns of similarity information, it is a dictionary for unspecified speakers.

【００１７】本発明は、このような考え方に基づき、１
名から数名の話者が発声した認識対象音声を分析して得
られる特徴パラメータと、あらかじめ多数の話者で作成
したｎ種類の標準パターンとのマッチングを分析時間で
ある１フレーム毎に行なってｎ次元の類似度ベクトルの
時系列を求め、この類似度ベクトルを、上位の類似度を
強調するような強調関数に通し、フレーム毎に正規化
し、このようにして求まるｎ次元の類似度ベクトルの時
系列を辞書として登録しておくようにしたものである。
また、入力音声を認識させる場合は、入力音声も同様に
ｎ種類の標準パターンとマッチングを行ない、辞書登録
時と同様の強調関数に通してフレーム毎に正規化し、得
られたｎ次元の類似度ベクトルの時系列と前記辞書との
照合を行なうことによって不特定話者の音声認識を行な
うようにしたものである。According to the present invention, based on such a concept, 1
A feature parameter obtained by analyzing recognition target voices uttered by several speakers from a name is matched with n kinds of standard patterns created by a large number of speakers in advance for each frame as an analysis time. A time series of an n-dimensional similarity vector is obtained, and this similarity vector is passed through an enhancement function that emphasizes higher similarities, normalized for each frame, and the n-dimensional similarity vector The time series is registered as a dictionary.
When recognizing the input voice, the input voice is also matched with n kinds of standard patterns, and is normalized for each frame through the same emphasis function as that at the time of dictionary registration. The speech recognition of an unspecified speaker is performed by comparing a time series of vectors with the dictionary.

【００１８】本発明は上記構成により、まず１名から数
名の少数の話者が発声した音声を分析して得られる特徴
パラメータに対して多数の話者で作成したｎ種類の音素
や音節などの標準パターンとの類似度を単位時間毎（フ
レーム毎）に求める。この類似度は多数の話者で作成し
た汎用性のある標準パターンとのマッチング結果なの
で、ｎ種類の類似度値の相対関係は個人性の影響を受け
にくい。したがって、単位時間毎の類似度の相対関係を
パラメータとして使用すれば不特定話者に対して有効で
ある。According to the present invention, n types of phonemes, syllables, etc., created by a large number of speakers, are first obtained by analyzing feature parameters obtained by analyzing voices uttered by a small number of one to several speakers. Is obtained for each unit time (for each frame). Since this similarity is a result of matching with a general-purpose standard pattern created by a large number of speakers, the relative relationship between n kinds of similarity values is hardly affected by personality. Therefore, it is effective for an unspecified speaker if the relative relationship of the similarity for each unit time is used as a parameter.

【００１９】さらに、認識率を向上させるためには認識
に寄与する部分を強調すればよいため、この類似度ベク
トルを、類似度の大きいところはより大きくし類似度の
小さいところは認識に寄与しないような小さい値にする
強調関数に通す。In order to improve the recognition rate, it is only necessary to emphasize a portion contributing to recognition. Therefore, this similarity vector is set such that a portion having a high similarity is larger and a portion having a low similarity does not contribute to the recognition. Through the emphasis function to make such a small value.

【００２０】また音声区間全体に渡って１フレーム内の
類似度の相対関係の特徴を平等にとらえるため、類似度
ベクトルをフレーム毎に正規化する。このようにして求
まるｎ次元の類似度ベクトルの時系列を辞書として登録
しておく。Further, in order to equally capture the feature of the relative relationship of the similarity in one frame over the entire voice section, the similarity vector is normalized for each frame. The time series of the n-dimensional similarity vector obtained in this way is registered as a dictionary.

【００２１】次に、入力音声を認識させる場合は、辞書
として用意したｎ次元の類似度ベクトルの時系列と、入
力音声から辞書作成時と同様の手続きで得られる類似度
ベクトルの時系列とを照合する。これにより、少数の話
者で作成した辞書で不特定話者の音声を認識することが
できる。本発明では単位時間毎の類似度として、最も信
頼できるものを１つだけ用いるのではなく、複数の候補
を用いているのでより高い認識率を得ることができる。Next, when recognizing the input speech, a time series of n-dimensional similarity vectors prepared as a dictionary and a time series of similarity vectors obtained from the input speech by the same procedure as that for creating a dictionary are used. Collate. Thereby, the voice of the unspecified speaker can be recognized by the dictionary created by a small number of speakers. In the present invention, a higher recognition rate can be obtained because a plurality of candidates are used instead of using only the most reliable one as the similarity per unit time.

【００２２】なお、どのような言葉も音素や音節の組合
せで記述できるので、ｎ種類の音素や音節の標準パター
ンは１度作成しておくことにより、認識対象音声を変更
しても常に同じものが使用できる。辞書を変更して他の
音声を認識できるようにするための認識語彙の変更に
は、少数の話者が発声するのみで良い。したがって、簡
単な手続きで不特定話者の音声認識が可能であり、さら
に、語彙の変更などに対して柔軟性のある認識装置の実
現が可能になる。Since any word can be described by a combination of phonemes and syllables, a standard pattern of n kinds of phonemes and syllables is created once, so that the same pattern is always obtained even if the recognition target speech is changed. Can be used. To change the recognition vocabulary to change the dictionary so that other voices can be recognized, only a few speakers need to speak. Therefore, it is possible to perform voice recognition of an unspecified speaker by a simple procedure, and to realize a recognition device that is flexible with respect to vocabulary change and the like.

【００２３】以下、本発明の第１の実施例について図１
を参照しながら説明する。図１において、１は音響分析
部、２は特徴パラメータ抽出部、３は標準パターン格納
部、４は類似度計算部、５は強調による類似度の正規化
部、６はパラメータ時系列作成部、７は辞書格納部、８
はパターンマッチング部である。FIG. 1 shows a first embodiment of the present invention.
This will be described with reference to FIG. In FIG. 1, 1 is an acoustic analysis unit, 2 is a feature parameter extraction unit, 3 is a standard pattern storage unit, 4 is a similarity calculation unit, 5 is a similarity normalization unit by emphasis, 6 is a parameter time series creation unit, 7 is a dictionary storage, 8
Is a pattern matching unit.

【００２４】次に本実施例の動作について、１名の話者
の音声を辞書に登録する場合について説明する。すなわ
ち、本実施例ではまず最初に、１名の発声した認識対象
音声を入力音声として辞書を作成しておき、認識時には
その辞書を用いて不特定話者の入力音声の認識を行な
う。Next, the operation of the present embodiment will be described for the case where the voice of one speaker is registered in a dictionary. That is, in this embodiment, first, a dictionary is created using the recognition target voice uttered by one person as an input voice, and at the time of recognition, the input voice of the unspecified speaker is recognized using the dictionary.

【００２５】図１において、入力音声が入力されると音
響分析部１で分析時間であるフレーム（本実施例では１
フレーム＝１０msec）毎に線形予測係数（ＬＰＣ）を求
める。In FIG. 1, when an input voice is input, a frame (analysis time is 1
A linear prediction coefficient (LPC) is determined every frame (10 msec).

【００２６】次に、特徴パラメータ抽出部２で、ＬＰＣ
ケプストラム係数（Ｃ０〜Ｃ８まで９個）を求める。Next, the feature parameter extraction unit 2 performs LPC
Cepstrum coefficients (9 from C0 to C8) are obtained.

【００２７】標準パターン格納部３には、あらかじめ多
くの話者が発声したデータから作成したｎ種類の音素標
準パターンを格納してある。本実施例ではｎ＝２０と
し、/a/,/o/,/u/,/i/,/e/,/j/,/w/,/m/,/n/,/The standard pattern storage unit 3 stores n types of phoneme standard patterns created in advance from data uttered by many speakers. In this embodiment, n = 20 and / a /, / o /, / u /, / i /, / e /, / j /, / w /, / m /, / n /, /

【００２８】[0028]

【外１】 [Outside 1]

【００２９】/,/b/,/d/,/r/,/z/,/h/,/s/,/c/,/p/,/t/,
/k/ の２０個の音素標準パターンを使用する。音素標準
パターンは各音素の特徴部（その音素の特徴をよく表現
する時間的な位置）を目視によって正確に検出し、この
特徴フレームを中心とした特徴パラメータの時間パター
ンを使用して作成する。本実施例では時間パターンとし
て、特徴フレームの前８フレーム、後３フレーム、計１
２フレーム分のＬＰＣケプストラム係数（Ｃ０〜Ｃ８）
によってパラメータ時系列を構成する。そして多くの人
が発声した多量のデータに対してパラメータ時系列を抽
出し、各要素の平均値ベクトルμｐと要素間の共分散行
列Σを求め標準パターンとする。このように本実施例で
用いている音素標準パターンは複数フレームの特徴パラ
メータを使用しており、パラメータの時間的動きを考慮
して標準パターンを作成しているのが特徴である。/, / B /, / d /, / r /, / z /, / h /, / s /, / c /, / p /, / t /,
Use 20 phoneme standard patterns of / k /. The phoneme standard pattern is created by accurately detecting the characteristic portion of each phoneme (a temporal position at which the characteristics of the phoneme is well expressed) by visual observation, and using a time pattern of feature parameters centered on the feature frame. In this embodiment, the time pattern is eight frames before the feature frame and three frames after the feature frame, for a total of one.
LPC cepstrum coefficient for two frames (C0 to C8)
Form a parameter time series. Then, a parameter time series is extracted from a large amount of data uttered by many people, and an average vector μp of each element and a covariance matrix 間の between the elements are obtained as a standard pattern. As described above, the phoneme standard pattern used in the present embodiment uses feature parameters of a plurality of frames, and is characterized in that the standard pattern is created in consideration of the temporal movement of the parameters.

【００３０】この２０種類の音素標準パターンと特徴パ
ラメータ抽出部２で得られた特徴パラメータ（ＬＰＣケ
プストラム係数）との類似度を類似度計算部４でフレー
ム毎に計算する。すなわち、入力を１フレームずつシフ
トさせながら標準パターンとマッチングを行ない、図２
のような類似度の時系列を求める。本実施例では類似度
計算の距離尺度として共分散行列を共通化したマハラノ
ビス距離を用いる。入力と音素ｐの標準パターンとの類
似度計算のためのマハラノビス距離ｄ_pは、以下の（数
１）で表される。ここで、ｘは入力の時間パターンであ
る１２フレーム分の特徴パラメータによって構成された
ベクトルである。The similarity between the 20 types of phoneme standard patterns and the feature parameters (LPC cepstrum coefficients) obtained by the feature parameter extraction unit 2 is calculated by the similarity calculation unit 4 for each frame. That is, matching is performed with the standard pattern while shifting the input one frame at a time.
A time series of similarity such as In this embodiment, a Mahalanobis distance in which a covariance matrix is shared is used as a distance measure for similarity calculation. The Mahalanobis distance d _p for calculating the similarity between the input and the standard pattern of the phoneme p is represented by the following (Equation 1). Here, x is a vector composed of feature parameters for 12 frames, which is an input time pattern.

【００３１】[0031]

【数１】 (Equation 1)

【００３２】ここで共分散行列Σ_p を各音素共通とする
と、次の（数２）のように簡単な式に展開できる。共通
化された共分散行列をΣとする。Here, assuming that the covariance matrix 共通_p is common to each phoneme, it can be expanded into the following simple equation (Equation 2). Let Σ be the common covariance matrix.

【００３３】[0033]

【数２】 (Equation 2)

【００３４】本実施例では、計算量の少ない上記（数
２）を用いる。ａ_p、ｂ_pが音素ｐに対する標準パターン
であり、標準パターン格納部３にあらかじめ格納されて
いる。このようにして得られた２０種類の音素標準パタ
ーンに対する類似度を要素とするベクトル（図２の斜線
部分）を、類似度ベクトルと呼ぶことにする。In this embodiment, the above (Equation 2), which requires a small amount of calculation, is used. a _p and b _p are standard patterns for the phoneme p, and are stored in the standard pattern storage unit 3 in advance. A vector having similarities to the 20 types of phoneme standard patterns obtained in this way (shaded portions in FIG. 2) is referred to as a similarity vector.

【００３５】類似度の強調・正規化部５では、類似度計
算部４で求められた類似度ベクトルを、上位の類似度を
強調するような強調関数に通し、フレーム毎に最大値が
１、最小値が０となるよう正規化する。これを全フレー
ムに亙って行ない、パラメータ系列作成部６で類似度ベ
クトルの時系列を作成する。The similarity emphasizing / normalizing unit 5 passes the similarity vector obtained by the similarity calculating unit 4 through an emphasizing function for emphasizing higher similarities, and the maximum value is 1 for each frame. Normalize so that the minimum value is 0. This is performed over the entire frame, and the parameter sequence creation unit 6 creates a time series of similarity vectors.

【００３６】類似度の強調・正規化部５では、フレーム
毎に求まった類似度のベクトルを、次のように変換す
る。まず類似度ベクトルの２０個の要素を大きい順に並
べ（値が大きい方がその音素標準パターンに類似してい
るとする。）、第１位の類似度が１、第ｋ位の類似度が
０となるように、第１位から第ｋ位までの類似度の値を
線形に１〜０に変換する。第ｋ＋１位から第２０位まで
はすべて０とし、新しく２０個の要素からなる類似度ベ
クトルを求める。すなわち、類似度ベクトルをａ＝（ａ
₁，ａ₂，…，ａ_i，…，ａ₂₀）とすると、強調関数Ｆは
次の（数３）のように表される。ここでＭは類似度の最
大値、Ｍ_kは第ｋ位の類似度の値である。The similarity emphasis / normalization unit 5 converts the similarity vector obtained for each frame as follows. First, the 20 elements of the similarity vector are arranged in descending order (it is assumed that the larger the value is, the more similar the phoneme standard pattern is), and the first similarity is 1 and the k-th similarity is 0. The values of similarity from the first place to the k-th place are linearly converted to 1 to 0 so that From the (k + 1) th place to the twentieth place, all are set to 0, and a new similarity vector including 20 elements is obtained. That is, the similarity vector is defined as a = (a
_{_{1, a 2, ..., a}} i, ..., When a _20), the enhancement function F is expressed by the following equation (3). Here, M is the maximum value of the similarity, and M _k is the value of the k-th similarity.

【００３７】[0037]

【数３】 (Equation 3)

【００３８】このような関数Ｆを通すことによって、上
位の音素に対する類似度が強調されるようになる。ま
た、ＭおよびＭ_kの値はフレーム毎に異なるため、Ｆ
（ａ_i）もフレーム毎に異なるが、常にフレーム内での
最大値は１、最小値は０となり、フレーム毎に正規化さ
れることになる。このようにフレーム毎に正規化を行な
うのは、音素標準パターンを特徴フレーム周辺の特徴パ
ラメータの時間パターンから作成しているため、音素の
渡り（遷移）の部分ではどの音素標準パターンに対する
類似度も全体的に小さくなり、フレーム毎に正規化を行
なわないと、音素の渡りの部分における類似度の相対関
係の特徴が過小評価されてしまうからである。そこでフ
レーム毎に正規化を行ない、全音声区間に亙って類似度
の相対関係を平等に扱えるようにする。By passing such a function F, the degree of similarity to a higher-order phoneme is emphasized. Also, since the values of M and M _k differ from frame to frame, F
Although (a _i ) also differs for each frame, the maximum value in the frame is always 1 and the minimum value is 0, and is normalized for each frame. Since normalization is performed for each frame in this manner, since the phoneme standard pattern is created from the time patterns of the characteristic parameters around the feature frame, the degree of similarity to any phoneme standard pattern in the transition (transition) of a phoneme is determined. The reason for this is that if the whole becomes small and normalization is not performed for each frame, the feature of the relative relationship of the similarity in the transition portion of the phoneme will be underestimated. Therefore, normalization is performed for each frame so that the relative relationship of the similarity can be equally handled over the entire voice section.

【００３９】このようにして全音声区間に亙って新たな
類似度ベクトルを求め、パラメータ時系列作成部６で類
似度ベクトルの時系列を作成する。図３は図２の類似度
ベクトルの時系列を強調・正規化したあとの類似度ベク
トルの時系列の例である。ここで斜線部の類似度ベクト
ルに注目すると、図２において類似度が最大値となる音
素/ a/の類似度を１、第ｋ位（例えば音素/p/ とする）
の類似度を０となるように第１位から第ｋ位までを線形
に変換し、第ｋ＋１位以下の小さな値はすべて０として
いる。In this way, a new similarity vector is obtained over the entire voice section, and the parameter time series creating section 6 creates a time series of similarity vectors. Figure 3 is an example of a time series of similarity vector after emphasizing-normalized time series of similarity vector of Figure 2. Here, paying attention to the similarity vector in the hatched portion, the similarity of the phoneme / a / having the maximum similarity in FIG. 2 is 1, and the k-th place (for example, phoneme / p /).
Is linearly converted from the first place to the k-th place so that the similarity of the first place becomes zero, and all small values below the (k + 1) -th place are set to zero.

【００４０】ここまでの手続きは辞書作成時および認識
時ともに同じである。The procedure up to this point is the same for both dictionary creation and recognition.

【００４１】辞書作成時には、１名の発声した認識対象
音声を入力音声として入力し、求められた類似度ベクト
ルの時系列を辞書格納部７に登録する。At the time of creating a dictionary, one uttered speech to be recognized is input as input speech, and the obtained time series of similarity vectors is registered in the dictionary storage unit 7.

【００４２】認識時には、辞書作成時と同様に入力音声
から類似度ベクトルの時系列を求め、パターンマッチン
グ部８において、辞書格納部７にある類似度ベクトルの
時系列とをマッチングし、最もスコアの大きい辞書項目
を認識結果とする。本実施例ではマッチング方法として
ＤＰマッチングを行なう。ＤＰマッチングを行なう漸化
式の例を（数４）に示す。ここで、辞書の長さをＪフレ
ーム、入力の長さをＩフレーム、第ｉフレームと第ｊフ
レームの距離関数をｌ（ｉ，ｊ）、累積類似度をｇ
（ｉ，ｊ）とする。At the time of recognition, the time series of the similarity vector is obtained from the input speech in the same manner as when the dictionary is created, and the pattern matching unit 8 matches the time series of the similarity vector in the dictionary storage unit 7 to obtain the highest score. A large dictionary item is used as a recognition result. In this embodiment, DP matching is performed as a matching method. (Equation 4) shows an example of a recurrence formula for performing DP matching. Here, the length of the dictionary is J frame, the input length is I frame, the distance function between the i-th frame and the j-th frame is l (i, j), and the cumulative similarity is g
(I, j).

【００４３】[0043]

【数４】 (Equation 4)

【００４４】距離関数ｌ（ｉ，ｊ）の距離尺度として本
実施例ではユークリッド距離を用いる。入力音声のｉフ
レームにおける類似度ベクトルをａ＝（ａ₁，ａ₂，…，
ａ₂₀）、辞書のｊフレームにおける類似度ベクトルをｂ
＝（ｂ₁，ｂ₂，…，ｂ₂₀）とすると、ユークリッド距離
を用いた場合のｌ（ｉ，ｊ）は、（数５）のようにな
る。In this embodiment, the Euclidean distance is used as a distance measure of the distance function l (i, j). Let a = (a ₁ , a ₂ ,...,
a ₂₀ ), the similarity vector in the j frame of the dictionary is represented by b
= (B ₁ , b ₂ ,..., B ₂₀ ), l (i, j) when the Euclidean distance is used is represented by (Equation 5).

【００４５】[0045]

【数５】 (Equation 5)

【００４６】このように上位の音素に対する類似度を強
調し、下位の音素に対する類似度を一律に０としたベク
トル間のユークリッド距離を求めることにより、上位の
音素に対する類似度の動きを強調してとらえ、下位の音
素に対する類似度の動きは無視することができる。ま
た、フレーム毎に正規化を行なうことによって、音素の
渡りの部分における類似度の相対関係の特徴を、特徴フ
レーム周辺と同等の重みで扱うことができるようにな
る。したがってこのような強調・正規化をすることによ
り、高い認識率を得られる。As described above, the similarity to the upper phoneme is emphasized, and the Euclidean distance between the vectors in which the similarity to the lower phoneme is uniformly set to 0 is obtained, thereby emphasizing the movement of the similarity to the upper phoneme. Therefore, the movement of the similarity to the lower phoneme can be ignored. Further, by performing normalization for each frame, it becomes possible to treat the feature of the relative relationship of the similarity in the transition part of the phoneme with the same weight as that around the feature frame. Therefore, by performing such emphasis and normalization, a high recognition rate can be obtained.

【００４７】次に２名以上の発声話者の音声を辞書に登
録する場合について説明を行なう。認識方法はすでに述
べた１名の発声から辞書を登録した場合と同様である。
まず最初に複数話者の発声した同一音声をＤＰマッチン
グにより時間調整を行なって１つの辞書として登録する
方法について説明し、次に複数話者の発声した同一音声
をマルチ標準パターンとして辞書に登録する方法につい
て説明する。Next, a case where the voices of two or more speakers are registered in the dictionary will be described. The recognition method is the same as in the case where a dictionary is registered from one utterance described above.
First, a method of adjusting the time of the same voice uttered by a plurality of speakers by DP matching and registering the same as a dictionary will be described. Next, the same voice uttered by a plurality of speakers will be registered in the dictionary as a multi-standard pattern. The method will be described.

【００４８】発声話者が２名の場合は、２名の発声した
同一音声を、認識する場合と同様にＤＰマッチングを行
ない時間整合を行なう。時間整合について図４を用いて
説明を行なう。図４は「赤い」（akai）と２名の話者が
発声した例である。話者によって発声の時間長が異なる
ので、２名の話者の同一の認識対象音声間でＤＰマッチ
ングを行ない、その結果からＤＰパスを逆トレースし時
間整合を行なう。時間整合することによって、同じ音素
の区間（/a/,/k/,/a/,/i/）が整合するようになる。そ
してこの時間的に整合したフレーム間で各類似度の平均
値を求め、その時系列を辞書として登録する。When there are two speakers, DP matching is performed and time matching is performed in the same manner as in the case of recognizing the same two voices. The time alignment will be described with reference to FIG. FIG. 4 is an example in which “red” (akai) and two speakers uttered. Since the utterance time length differs depending on the speaker, DP matching is performed between the same recognition target voices of the two speakers, and the DP path is inversely traced from the result to perform time matching. By performing time matching, sections of the same phoneme (/ a /, / k /, / a /, / i /) are matched. Then, an average value of each similarity is obtained between the frames that are temporally matched, and the time series is registered as a dictionary.

【００４９】すなわち、図４の斜線で示した話者１の第
ｉフレームと話者２の第ｊフレームが時間的に整合する
場合は、話者１の第ｉフレームの類似度ベクトルをｃ＝
（ｃ ₁，ｃ₂，…，ｃ₂₀）、話者２の第ｊフレームをｅ＝
（ｅ₁，ｅ₂，…，ｅ₂₀）とすると、新しくｆ＝（（ｃ₁
＋ｅ₁））／２，（ｃ₂＋ｅ₂）／２，…，（ｃ₂₀＋
ｅ₂₀）／２）を求め、この類似度ベクトルｆを辞書のｉ
フレームの類似度ベクトルとして登録する。こうするこ
とによって、辞書の精度を向上させ、より高い認識率を
得ることができる。That is, the first speaker 1 shown by hatching in FIG.
The i-th frame and the j-th frame of speaker 2 are temporally aligned
In this case, the similarity vector of the i-th frame of the speaker 1 is represented by c =
(C ₁, C_Two, ..., c₂₀), And the j-th frame of speaker 2 is e =
(E₁, E_Two, ..., e₂₀), F = ((c₁
+ E₁)) / 2, (c_Two+ E_Two) / 2, ..., (c₂₀+
e₂₀) / 2), and finds the similarity vector f in the dictionary i
Register as a frame similarity vector. Like this
And improve the accuracy of the dictionary and increase the recognition rate
Obtainable.

【００５０】次に、複数話者の発声した音声をマルチ標
準パターンとして辞書に登録するときは、認識対象音声
を複数話者が発声した音声の類似度ベクトル時系列をそ
のまま辞書として複数個登録する。この場合は、辞書項
目毎に複数個登録されている標準パターンの中のどの辞
書で認識されてもその辞書項目を認識したものとする。Next, when voices uttered by a plurality of speakers are registered in the dictionary as a multi-standard pattern, a plurality of similarity vector time series of voices uttered by a plurality of speakers are registered as the dictionary as recognition target voices. . In this case, it is assumed that the dictionary item is recognized even if it is recognized in any dictionary among the standard patterns registered for each dictionary item.

【００５１】ただし、２名以上の話者の発声によって辞
書を作成する場合、辞書パターンの男女差を減らすた
め、男女各１名ずつまたは男女ほぼ同数の発声によって
辞書を作成する。However, when a dictionary is created by the utterances of two or more speakers, in order to reduce the gender difference in the dictionary pattern, the dictionary is created by one male and one female or almost the same number of male and female utterances.

【００５２】次に、本実施例を用いた音声認識実験およ
びその結果について説明する。実験は、２１２単語を発
声した２０名のデータを用い、２０名の中の１名が２１
２単語を発声したデータを辞書として登録し、他の１９
名の発声した２１２単語を認識する方法で行なった。実
験の結果、８８．５％という認識率を得ることができ
た。これに対し、音素標準パターンとのマッチングによ
り得られた類似度の時系列をそのまま使用し、フレーム
毎に正規化を行なわなかった場合の認識率は、８２．１
％であり、類似度の強調効果が認識率に大きく寄与して
いることが明らかになった。Next, a speech recognition experiment using this embodiment and the results thereof will be described. The experiment used data of 20 people who uttered 212 words, and one of the 20
The data which uttered two words is registered as a dictionary, and the other 19 words are registered.
The recognition was performed by a method of recognizing 212 words uttered. As a result of the experiment, a recognition rate of 88.5% was obtained. On the other hand, the recognition rate when the time series of the similarity obtained by matching with the phoneme standard pattern is used as is and normalization is not performed for each frame is 82.1.
%, Indicating that the emphasis effect of the similarity greatly contributes to the recognition rate.

【００５３】１名の話者の発声で辞書を作成した場合、
その話者と異性の話者の認識率は平均８６．０％であ
り、同性の話者の音声の平均認識率９１．４％に比べ５
％程度低い。そこで、男女各１名の計２名が発声した認
識対象音声から得られる類似度ベクトルの時系列を平均
化した時系列パターンを辞書として使用すると、男女差
が解消されるため、９３．４％という高い認識率が得ら
れた。男女各１名の計２名が発声した音声を平均化しな
いで２つとも辞書として登録するマルチ標準パターンを
用いた方法では、９３．２％という認識率が得られた。When a dictionary is created by uttering one speaker,
The recognition rate of that speaker and that of the opposite sex is 86.0% on average, which is 5 times higher than the average recognition rate of 91.4% for the voice of the same sex speaker.
% Lower. Therefore, if a time series pattern obtained by averaging the time series of the similarity vectors obtained from the recognition target voices uttered by two persons, one for each man and woman, is used as a dictionary, the gender difference is eliminated, so that 93.4% High recognition rate was obtained. In a method using a multi-standard pattern in which two voices uttered by one male and one female each were registered as a dictionary without averaging, a recognition rate of 93.2% was obtained.

【００５４】以上のように、入力音声を分析して特徴パ
ラメータを抽出し、多数の話者の音声データで作成した
音素標準パターンとの類似度計算からフレーム毎に類似
度ベクトルを求め、この類似度ベクトルに強調関数を施
してフレーム毎に正規化することにより、１名または数
名の少数話者の発声した音声の類似度ベクトル時系列を
辞書として登録するだけで、入力音声の類似度ベクトル
時系列と辞書とのＤＰマッチングにより不特定話者の音
声を精度良く認識することができる。As described above, the input speech is analyzed to extract the characteristic parameters, and the similarity vector is obtained for each frame from the similarity calculation with the phoneme standard pattern created by the speech data of many speakers. By applying an enhancement function to the degree vector and normalizing for each frame, the similarity vector of the input voice can be registered simply by registering a time series of the similarity vector of the voice uttered by one or several minor speakers as a dictionary. The voice of the unspecified speaker can be accurately recognized by the DP matching between the time series and the dictionary.

【００５５】次に、類似度ベクトルを指数関数などの類
似度の値の大きい部分を強調するような強調関数に通
し、その時間変化量を表す回帰係数を併用して、相関余
弦（correlation cosine）によって認識を行なう本発明
の第２の実施例について、図５を参照して説明する。Next, the similarity vector is passed through an emphasis function such as an exponential function that emphasizes a portion having a large similarity value, and a regression coefficient representing a time change amount thereof is used together to obtain a correlation cosine. A second embodiment of the present invention, in which recognition is performed by the following, will be described with reference to FIG.

【００５６】図５において、１１は音響分析部、１２は
特徴パラメータ抽出部、１３は標準パターン格納部、１
４は類似度計算部、１５は類似度の強調部、１６は類似
度の正規化部、１７は回帰係数計算部、１８は回帰係数
の正規化部、１９はパラメータ時系列作成部、２０は辞
書格納部、２１はパターンマッチング部である。In FIG. 5, reference numeral 11 denotes an acoustic analyzer, 12 denotes a feature parameter extractor, 13 denotes a standard pattern storage, 1
4 is a similarity calculation unit, 15 is a similarity enhancement unit, 16 is a similarity normalization unit, 17 is a regression coefficient calculation unit, 18 is a regression coefficient normalization unit, 19 is a parameter time series creation unit, 20 is The dictionary storage unit 21 is a pattern matching unit.

【００５７】この第２の実施例においても、前記第１の
実施例と同様に入力音声を音響分析部１１で分析して特
徴パラメータ抽出部１２で特徴パラメータを求め、あら
かじめ標準パターン格納部１３に登録してある音素標準
パターンとフレーム毎にマッチングし、類似度ベクトル
の時系列を類似度計算部１４で求める。Also in the second embodiment, similarly to the first embodiment, the input voice is analyzed by the acoustic analysis unit 11 to obtain the characteristic parameters by the characteristic parameter extracting unit 12 and stored in the standard pattern storage unit 13 in advance. The registered phoneme standard pattern is matched for each frame, and a time series of similarity vectors is obtained by the similarity calculation unit 14.

【００５８】次に類似度の強調部１５において、類似度
計算部１４で求められた類似度を指数関数である強調関
数Ｇに通すことによって、値の大きい類似度がより大き
くなるよう変換する。この強調関数Ｇは、入力音声の類
似度ベクトルをａ＝（ａ₁，ａ₂，ａ₃，…，ａ_i，…，ａ
₂₀）とすると（数６）で表される。Next, in the similarity emphasizing unit 15, the similarity calculated by the similarity calculating unit 14 is passed through an emphasizing function G, which is an exponential function, so that the similarity having a larger value is converted to be larger. This emphasis function G calculates the similarity vector of the input speech as a = (a ₁ , a ₂ , a ₃ ,..., A _i ,.
₂₀ ), it is expressed by (Equation 6).

【００５９】[0059]

【数６】 (Equation 6)

【００６０】α、βは全音素、全フレームに対して共通
な定数であり、この式により全フレームに対して新たに
類似度ベクトルを計算する。Α and β are constants common to all phonemes and all frames, and a similarity vector is newly calculated for all frames by using this equation.

【００６１】さらに類似度の正規化部１６において、こ
のｎ次元の類似度ベクトルをフレーム毎に大きさ１に正
規化して新たな類似度ベクトルを作成する。これを式で
表すと（数７）のようになる。Further, the similarity normalizing section 16 normalizes the n-dimensional similarity vector to a size of 1 for each frame to create a new similarity vector. When this is expressed by an equation, it becomes as shown in (Equation 7).

【００６２】[0062]

【数７】 (Equation 7)

【００６３】ここで、強調関数Ｇによって強調された類
似度ベクトルをａ’＝（ａ₁’，ａ₂’，ａ₃’，…，
ａ_i’，…，ａ₂₀’）とし、大きさ１にしたベクトルを
ａ”＝（ａ₁”，ａ₂”，ａ₃”，…，ａ_i”，…，
ａ₂₀”）とする。フレーム毎の類似度ベクトルの大きさ
を１にすることにより、全音声区間に亙って類似度の相
対関数の特徴を平等に扱うことができるようになる。Here, the similarity vector emphasized by the emphasis function G is defined as a ′ = (a ₁ ′, a ₂ ′, a ₃ ′,.
_{_{a i ', ..., a 20}} ' and), was the magnitude of 1 vector _{a "= (a 1",} a 2 ", a 3", ..., a i ", ...,
a ₂₀ ″). By setting the magnitude of the similarity vector for each frame to 1, the feature of the relative function of the similarity over the entire voice section can be treated equally.

【００６４】次に回帰係数計算部１７で、正規化された
各類似度の時系列に対して類似度の時間的変化量である
回帰係数（ｎ個）をフレーム毎に求める。回帰係数は、
各音素に対する類似度のそれぞれの時間方向の傾きであ
る。すなわち、例えばまず音素/a/の標準パターンに対
する類似度の時系列の、あるフレームの前後２フレーム
の類似度値（計５フレームの類似度値）の最小２乗近似
直線の傾き（類似度の時間的変化量）を求める。これを
（数８）に示す。Next, the regression coefficient calculation unit 17 obtains a regression coefficient (n) which is a temporal change amount of the similarity with respect to each normalized time series of the similarity for each frame. The regression coefficient is
It is the inclination of each similarity to each phoneme in the time direction. That is, for example, first, in the time series of the similarity of the phoneme / a / with respect to the standard pattern, the slope (similarity of similarity) of the least square approximation line of the similarity values of two frames before and after a certain frame (similarity values of five frames in total). Temporal change). This is shown in (Equation 8).

【００６５】[0065]

【数８】 (Equation 8)

【００６６】ここで、ｘｔ＝（ｔ＝１，２，３，…）は
音素/a/に対する類似度の時系列を表し、Ｋ（/a/）は時
刻ｔ＋２における音素/a/の回帰係数である。これを各
音素に対する類似度について２０個求め、さらに１フレ
ーム毎に全フレームに対して求め、回帰係数ベクトルの
時系列とする。Here, xt = (t = 1, 2, 3,...) Represents a time series of similarity to phoneme / a /, and K (/ a /) is a regression coefficient of phoneme / a / at time t + 2. It is. This is obtained for 20 similarities for each phoneme, and further for every frame for every frame to obtain a time series of regression coefficient vectors.

【００６７】次に回帰係数の正規化部１８で、類似度と
同様に回帰係数ベクトルをフレーム毎に大きさ１に正規
化する。Next, the regression coefficient normalization unit 18 normalizes the regression coefficient vector to a size of 1 for each frame, similarly to the similarity.

【００６８】そしてパラメータ時系列作成部１９で、指
数関数によって強調を施した大きさ１のｎ次元の類似度
ベクトルおよびそこから求めた大きさ１のｎ次元の回帰
係数ベクトルの時系列の両方をパラメータ時系列とす
る。Then, the parameter time series creating unit 19 converts both the time series of the magnitude 1 n-dimensional similarity vector emphasized by the exponential function and the time series of the magnitude 1 n-dimensional regression coefficient vector. Parameter time series.

【００６９】ここまでの手続きは辞書作成時、認識時と
もに同じであり、辞書作成時には、第１の実施例と同様
にまず最初に１名の発声した認識対象音声を入力音声と
して辞書を作成し、認識時にはその辞書を用いて不特定
話者の入力音声の認識を行なう。The procedure up to this point is the same for both dictionary creation and recognition. At the time of dictionary creation, as in the first embodiment, first, a dictionary is created using one uttered recognition target speech as input speech. During recognition, the input speech of the unspecified speaker is recognized using the dictionary.

【００７０】辞書作成時には、パラメータ時系列作成部
１９で求められたパラメータ時系列を辞書格納部２０に
登録する。なお、第１の実施例で既に述べた方法と同様
にして、２名以上の少数話者の発声した同一音声から辞
書を作成し登録してもよい。At the time of creating a dictionary, the parameter time series obtained by the parameter time series creating unit 19 is registered in the dictionary storage unit 20. A dictionary may be created and registered from the same voice uttered by two or more minority speakers in the same manner as described in the first embodiment.

【００７１】認識時には、パターンマッチング部２１に
おいて、辞書登録時と同様の方法で求めたパラメータ時
系列と辞書格納部２０にあるパラメータの時系列とを相
関余弦を用いてＤＰマッチングし、もっとも類似度の大
きい辞書項目を認識結果とする。ＤＰマッチングを行な
う漸化式は第１の実施例で用いた（数４）と同様である
が、距離関数１（ｉ，ｊ）の距離尺度として本実施例で
は相関余弦を用いる。回帰係数を併用して相関余弦を用
いた距離関数１（ｉ，ｊ）は、（数９）で表される。At the time of recognition, the pattern matching unit 21 performs DP matching between the parameter time series obtained by the same method as that at the time of dictionary registration and the time series of parameters in the dictionary storage unit 20 using the correlation cosine. The dictionary item having the highest similarity is regarded as the recognition result. The recurrence formula for performing DP matching is the same as (Equation 4) used in the first embodiment, but the present embodiment uses the correlation cosine as a distance measure of the distance function 1 (i, j). The distance function 1 (i, j) using the correlation cosine in combination with the regression coefficient is represented by (Equation 9).

【００７２】[0072]

【数９】 (Equation 9)

【００７３】ただし、入力音声のｉフレームにおける類
似度ベクトルをａ＝（ａ₁，ａ₂，…，ａ₂₀）、回帰係数
ベクトルをｃ＝（ｃ₁，ｃ₂，…，ｃ₂₀）、辞書のｊフレ
ームにおける類似度ベクトルをｂ＝（ｂ₁，ｂ₂，…，ｂ
₂₀）、回帰係数ベクトルをｄ＝（ｄ₁，ｄ₂，…，ｄ₂₀）
とする。ｗは類似度と回帰係数の混合比率であり、０．
４から０．６がよい。実際にはすでに類似度ベクトル、
回帰係数ベクトルとも大きさ１に正規化されているた
め、それぞれ内積を求め、ｗ：（１−ｗ）の重みで足し
合わせるだけでよい。すなわち（数１０）のようにな
る。[0073] However, the similarity vector of the i-frames of the input speech _{_{a = (a 1, a 2}} , ..., a 20), the regression coefficient vector _{_{c = (c 1, c 2}} , ..., c 20), Dictionary B = (b ₁ , b ₂ ,..., B)
₂₀ ), and the regression coefficient vector is d = (d ₁ , d ₂ ,..., D ₂₀ )
And w is a mixture ratio of the similarity and the regression coefficient.
4 to 0.6 is good. In fact, already the similarity vector,
Since both the regression coefficient vector and the regression coefficient vector are normalized to the size 1, it is only necessary to calculate the inner products and add up the weights with w: (1-w). That is, (Expression 10) is obtained.

【００７４】[0074]

【数１０】 (Equation 10)

【００７５】次に第２の実施例を用いた音声認識実験お
よびその結果について説明する。実験は、２１２単語を
発声した２０名のデータを用い、２０名の中の１名が２
１２単語を発声したデータを辞書として登録し、他の１
９名の発声した２１２単語を認識する方法で行なった。
図１に示す第１の実施例に対しこの第２の実施例におけ
る回帰係数ベクトルを併用すると、９０．３％の単語認
識率が得られた。これは回帰係数を併用しない８８．５
％に比べ１．８％向上している。また、図５に示す第２
の実施例の方法で回帰係数を併用すると９１．６％とな
り、さらに１．３％の認識率の向上がみられた。Next, a speech recognition experiment using the second embodiment and the results thereof will be described. The experiment used data of 20 people who uttered 212 words, and one of the 20
The data which uttered 12 words is registered as a dictionary, and the other 1
The method was performed by recognizing 212 words uttered by nine people.
When the regression coefficient vector in the second embodiment is used together with the first embodiment shown in FIG. 1, a word recognition rate of 90.3% was obtained. This is 88.5 without regression coefficients.
% Is improved by 1.8%. In addition, the second shown in FIG.
When the regression coefficient was used in combination with the method of the embodiment, the result was 91.6%, and the recognition rate was further improved by 1.3%.

【００７６】また、男女各１名の計２名の話者の発声し
た単語音声を平均化したデータを辞書として登録し、残
り１８名の単語音声を評価すると９５．９％の高い認識
率が得られた。Further, data obtained by averaging the word voices uttered by a total of two speakers, one for each man and woman, is registered as a dictionary, and when the remaining 18 word voices are evaluated, a high recognition rate of 95.9% is obtained. Obtained.

【００７７】以上のように、類似度に強調関数を施し、
その回帰係数を併用して相関余弦を用いたパターンマッ
チングを行なうことにより認識率が向上する。As described above, the emphasis function is applied to the similarity,
By performing pattern matching using the correlation cosine in combination with the regression coefficient, the recognition rate is improved.

【００７８】なお、第２の実施例において回帰係数は、
類似度ベクトルを指数関数で強調して大きさ１に正規化
したものに対して求めたが，図６に示すように、類似度
の正規化部１６で大きさ１にする前に，類似度の強調部
１５における指数関数で強調した類似度系列に対して求
めてもよい。In the second embodiment, the regression coefficient is
The similarity vector was obtained by normalizing the similarity vector to a size of 1 by emphasizing it with an exponential function, but as shown in FIG. May be obtained for the similarity series emphasized by the exponential function in the emphasizing unit 15.

【００７９】[0079]

【発明の効果】以上のように、本発明は、音声を分析し
て得られた特徴パラメータに対してあらかじめ多くの話
者で作成したｎ種類の標準パターンとの類似度計算を行
なって類似度系列を求め，強調関数を通してフレーム毎
に類似度の大きいものがより大きくなるように類似度を
強調し、ｎ次元の類似度ベクトルまたはｎ次元の類似度
ベクトルとそのｎ次元回帰係数ベクトルを音声認識のた
めの特徴パラメータとすることによって、１名から数名
の少数の話者が発声した認識対象音声を辞書として登録
するだけで、精度良く高い認識率で不特定話者の音声認
識を行なうことができる。As described above, according to the present invention, the similarity is calculated by calculating the similarity between the feature parameters obtained by analyzing the voice and n kinds of standard patterns created by many speakers in advance. A sequence is obtained, and the similarity is emphasized so that the larger the similarity becomes larger for each frame through an enhancement function, and the n-dimensional similarity vector or the n-dimensional similarity vector and its n-dimensional regression coefficient vector are subjected to speech recognition. By registering the recognition target speech uttered by one or a few speakers as a dictionary by using the feature parameters for Can be.

【００８０】したがって、辞書の作成が極めて容易であ
り、また認識対象音声を変更したい場合には、１名また
は数名の少数の話者が発声した音声データを辞書として
登録するだけで更新できる。Therefore, it is extremely easy to create a dictionary, and when it is desired to change the recognition target voice, the voice data uttered by one or a small number of speakers can be updated simply by registering it as a dictionary.

【００８１】また強調関数を通してフレーム毎に正規化
することにより、より高い認識率を得ることができる。Further, by normalizing each frame through the enhancement function, a higher recognition rate can be obtained.

【００８２】さらにまた男女同数の少数話者の発声した
音声データから辞書を作成することにより、さらに高い
認識率を得ることができる。Further, by creating a dictionary from voice data uttered by the same number of male and female minority speakers, a higher recognition rate can be obtained.

【００８３】このように、本発明は、不特定話者用音声
認識装置の性能向上および種々の用途へ適用するための
柔軟性の向上に対して極めて大きく貢献するものであ
る。As described above, the present invention greatly contributes to the improvement of the performance of the speech recognition apparatus for unspecified speakers and the improvement of the flexibility for application to various uses.

[Brief description of the drawings]

【図１】本発明の第１の実施例を示す音声認識方法を実
施するための機能ブロック図FIG. 1 is a functional block diagram for implementing a voice recognition method according to a first embodiment of the present invention;

【図２】同実施例における類似度ベクトルの時系列の一
例を示す時系列図FIG. 2 is a time series diagram showing an example of a time series of similarity vectors in the embodiment.

【図３】同実施例における強調・正規化後の類似度ベク
トルの時系列の一例を示す時系列図FIG. 3 is a time series diagram showing an example of a time series of a similarity vector after emphasis and normalization in the embodiment.

【図４】同実施例における２名の話者の登録音声に対す
る時間整合の一例を示す模式図FIG. 4 is a schematic diagram showing an example of time alignment for registered voices of two speakers in the embodiment.

【図５】本発明の第２の実施例における機能ブロック図FIG. 5 is a functional block diagram according to a second embodiment of the present invention.

【図６】本発明の第２の実施例の変形例を示す機能ブロ
ック図FIG. 6 is a functional block diagram showing a modification of the second embodiment of the present invention.

【図７】従来の単語音声認識方法の一例を示す機能ブロ
ック図FIG. 7 is a functional block diagram showing an example of a conventional word speech recognition method.

【図８】従来の単語音声認識方法の他の例を示す機能ブ
ロック図FIG. 8 is a functional block diagram showing another example of the conventional word speech recognition method.

[Explanation of symbols]

１音響分析部２特徴パラメータ抽出部３標準パターン格納部４類似度計算部５類似度の強調・正規化部６パラメータ時系列作成部７辞書格納部８パターンマッチング部１１音響分析部１２特徴パラメータ抽出部１３標準パターン格納部１４類似度計算部１５類似度の強調１６類似度の正規化部１７回帰係数計算部１８回帰係数の正規化部１９パラメータ時系列作成部２０辞書格納部２１パターンマッチング部REFERENCE SIGNS LIST 1 acoustic analysis unit 2 feature parameter extraction unit 3 standard pattern storage unit 4 similarity calculation unit 5 similarity enhancement / normalization unit 6 parameter time series creation unit 7 dictionary storage unit 8 pattern matching unit 11 acoustic analysis unit 12 feature parameter extraction Unit 13 Standard pattern storage unit 14 Similarity calculation unit 15 Similarity enhancement 16 Similarity normalization unit 17 Regression coefficient calculation unit 18 Regression coefficient normalization unit 19 Parameter time series creation unit 20 Dictionary storage unit 21 Pattern matching unit

フロントページの続き (56)参考文献特開平１−216397（ＪＰ，Ａ) 特開昭60−200296（ＪＰ，Ａ) 特開昭２−248999（ＪＰ，Ａ) 特開昭62−66300（ＪＰ，Ａ) 特開昭62−145297（ＪＰ，Ａ) 特開昭60−202488（ＪＰ，Ａ) 特開昭60−209794（ＪＰ，Ａ) 特公昭62−17760（ＪＰ，Ｂ２) 特公昭61−54240（ＪＰ，Ｂ２) ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌｐｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．ＡＳＳＰ−34, Ｎｏ．１，ｐ．52−59（1986) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 531 G10L 3/00 521 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References JP-A-1-216397 (JP, A) JP-A-60-200296 (JP, A) JP-A-2-248999 (JP, A) JP-A-62-66300 (JP, A) JP-A-62-145297 (JP, A) JP-A-60-202488 (JP, A) JP-A-60-209794 (JP, A) JP-B-62-17760 (JP, B2) JP-B-sho 61-54240 (JP, B2) IEEE Transactions on Acoustics, Speech and Signaling, Vol. ASSP-34, no. 1, p. 52-59 (1986) (58) Fields investigated (Int. Cl. ⁶ , DB name) G10L 3/00 531 G10L 3/00 521 JICST file (JOIS)

Claims

(57) [Claims]

[Claim 1] At the time of dictionary creation, the recognition target voice uttered a small number of speakers of several people from one person, asked Me the m feature parameters for each is the analysis time frame, sub et al beforehand a number of story The matching is performed with n kinds of standard patterns created by the user to obtain n similarities for each frame, and the obtained n-dimensional similarity vector is determined such that the larger similarity vector becomes larger. Normalized for each frame through the enhancement function
When registering a time-series pattern of the n-dimensional similarity vector created as a dictionary as a dictionary and recognizing the input voice,
M features obtained for each frame by acoustic analysis of input speech
A parameter is matched with the n kinds of standard patterns to obtain an n-dimensional similarity vector, and a time series of the similarity vector is created through the same emphasis function as that at the time of creating a dictionary. Is compared with the time series pattern of the similarity vector registered in the dictionary.
Speech recognition method and, and recognizes the input speech.

2. An n-dimensional similarity vector with respect to each dimension of a time series of n-dimensional similarity vectors to which an enhancement function has been applied is determined for each frame by n pieces. 2. The speech recognition method according to claim 1, wherein a time-series pattern is created by using both the dimensional vector and the n-dimensional vector of similarity.

3. When comparing an input voice with a dictionary, a similarity vector or a similarity vector of the input voice and a time variation vector thereof normalized for each frame, and a similarity vector or a similarity vector of the dictionary voice 3. The voice recognition method according to claim 1, wherein the matching is performed based on a distance between the time and a time variation vector.

4. As the enhancement function, the first to k-th elements when elements of the similarity vector are arranged in descending order are selected and used, and a function having the lowest value in the (k + 1) -th and lower elements is used to rank the elements. 3. The speech recognition method according to claim 1, wherein weighting is performed, and a Euclidean distance is used as a distance measure for comparing the input speech with the dictionary.

5. The method according to claim 1, wherein an exponential function is used as an emphasis function such that a larger value is larger, and a correlation distance is used as a distance measure for collation between the input speech and the dictionary. Voice recognition method.

6. An n-dimensional similarity vector or a time series of an n-dimensional similarity vector and an n-dimensional time variation vector obtained by two or more speakers uttering the same recognition target voice and analyzing each of them. And performing time matching between speakers by DP matching, calculating an average value of each similarity between frames matched in time, and registering a time-series pattern of the average value in a dictionary. 3. The speech recognition method according to 1 or 2.

7. The same recognition target voice is uttered by two or more speakers and analyzed to obtain a plurality of n-dimensional similarity vectors or a plurality of time series of an n-dimensional similarity vector and an n-dimensional time variation vector. And registering them as a dictionary and using them as a multi-standard pattern.
Or the speech recognition method according to 2.

8. The speech recognition method according to claim 6, wherein, when creating a dictionary based on the utterances of two or more speakers, the dictionary is created based on one male and one female or approximately the same number of male and female utterances.