JPH07113838B2

JPH07113838B2 - Speech recognition method

Info

Publication number: JPH07113838B2
Application number: JP3338102A
Authority: JP
Inventors: 明石田
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1991-12-20
Filing date: 1991-12-20
Publication date: 1995-12-06
Anticipated expiration: 2010-12-06
Also published as: JPH0619497A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、不特定話者の音声認識
を行うための音声認識方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition method for voice recognition of an unspecified speaker.

【０００２】[0002]

【従来の技術】従来から、不特定話者の音声認識を行う
ための標準パターンを作成するには、あらかじめ多くの
話者が認識対象単語を発声したデータを用い、人間が目
視などで音声区間を切り出して、それらを統計的に処理
してきた。この方法では、不特定話者用の認識単語辞書
を作成するために、実際に数百名の話者の発声した音声
データを使用している。最近は、不特定話者の音声認識
を行うための標準パターンの作成に、一人ないし数名の
話者が認識対象単語を発声したデータを用いて作成でき
るようになってきた。たとえば、「少数話者の発声で単
語音声の動的特徴をモデル化した不特定話者音声認識
法」（電子通信情報学会ＳＰ９１−２０）に記載され
た構成が知られている。2. Description of the Related Art Conventionally, in order to create a standard pattern for speech recognition of an unspecified speaker, data in which a large number of speakers have uttered recognition target words is used in advance, and a human being visually recognizes a speech segment. Have been cut out and processed statistically. In this method, speech data actually uttered by hundreds of speakers is used to create a recognition word dictionary for unspecified speakers. Recently, it has become possible to create a standard pattern for recognizing a voice of an unspecified speaker by using data in which a recognition target word is uttered by one or several speakers. For example, a configuration described in "Unspecified speaker voice recognition method in which dynamic characteristics of word voice are modeled by utterance of minority speaker" (Electronic Communication Information Society SP91-20) is known.

【０００３】図７は、この手法の構成図である。図７に
おいて、１は音響分析部、２は特徴パラメータ抽出部、
３は類似度計算部、４は標準パターン格納部、５は回帰
係数計算部、６はパラメータ系列作成部、７は認識部、
８は辞書格納部である。 FIG . 7 is a block diagram of this technique. In FIG. 7, 1 is an acoustic analysis unit, 2 is a characteristic parameter extraction unit,
3 is a similarity calculation unit, 4 is a standard pattern storage unit, 5 is a regression coefficient calculation unit, 6 is a parameter series creation unit, 7 is a recognition unit,
Reference numeral 8 is a dictionary storage unit.

【０００４】標準パターン格納部４は、あらかじめ多く
の話者が発声したデータに対して、ｎ個の各音素毎に、
その音素の特徴を最も良く表現する時間的位置（特徴フ
レーム）を求め、この特徴フレームを中心とした特徴パ
ラメータの時間パターンを使用して作成された音素標準
パターンを格納している。時間パターンとしては、特徴
フレームの前後数フレームに対してＬＰＣケプストラム
係数（Ｃ０〜Ｃ８）を計算し、これらを１次元に並べた
パラメータ系列を求め、このパラメータ系列の各要素の
平均値ベクトルと要素間の共分散行列を求め標準パター
ンとして格納される。 The standard pattern storage unit 4 stores, for each of the n phonemes, the data uttered by many speakers in advance.
The temporal position (feature frame) that best expresses the feature of the phoneme is obtained, and the phoneme standard pattern created by using the time pattern of the feature parameter centered on this feature frame is stored. As the time pattern, the LPC cepstrum coefficients (C0 to C8) are calculated for several frames before and after the characteristic frame, a one-dimensionally arranged parameter series is obtained, and the average value vector and the element of each element of this parameter series are calculated. The covariance matrix between them is obtained and stored as a standard pattern.

【０００５】辞書格納部８には、あらかじめ一人あるい
は二人以上の話者が発声した認識対象音声を分析して上
記のｎ個の標準パターンとフレーム毎に類似度計算を行
なった結果得られる類似度ベクトルの時系列と、フレー
ム毎に類似度ベクトルの傾きとして求まるやはりｎ次元
の回帰係数ベクトルの時系列を全て辞書として登録して
ある。 The dictionary storage unit 8 analyzes the recognition target speech uttered by one or two or more speakers in advance and calculates the similarity for each of the above-mentioned n standard patterns and each frame. The time series of the degree vector and the time series of the n-dimensional regression coefficient vector obtained as the inclination of the similarity vector for each frame are all registered as a dictionary.

【０００６】実際に音声が入力されると音響分析部１で
音響分析を行い、特徴パラメータ抽出部２でＬＰＣケプ
ストラム係数を計算し、類似度計算部３で標準パターン
格納部４の標準パターンとの類似度を計算し、回帰係数
計算部５で各フレーム毎の類似度の変化から回帰係数を
計算し、得られた類似度及び回帰係数をパラメータ系列
作成部６のパラメータ系列作成部で１次元の時系列に
し、認識部７で辞書格納部８の辞書を用いて認識を行
う。 When a voice is actually input, the acoustic analysis unit 1 performs acoustic analysis, the characteristic parameter extraction unit 2 calculates an LPC cepstrum coefficient, and the similarity calculation unit 3 compares the standard pattern in the standard pattern storage unit 4. The similarity is calculated, the regression coefficient calculation unit 5 calculates a regression coefficient from the change in the similarity for each frame, and the obtained similarity and regression coefficient are calculated by the parameter series creation unit of the parameter series creation unit 6 in a one-dimensional manner. The time series is used, and the recognition unit 7 performs recognition using the dictionary of the dictionary storage unit 8.

【０００７】認識においては、ＤＰマッチングと呼ばれ
る照合の方法を用いている。ＤＰマッチングを行なう漸
化式の例を（数１）に示す。ここで、辞書の長さをＪフ
レーム、入力の長さをＩフレーム、第ｉフレームと、第
ｊフレームの距離の関数をl(i,j)、累積類似度をg(i,j)
とする。 In the recognition, a matching method called DP matching is used. An example of a recurrence formula for performing DP matching is shown in (Equation 1). Here, the dictionary length is J frames, the input length is I frames, the function of the distance between the i-th frame and the j-th frame is l (i, j), and the cumulative similarity is g (i, j).
And

【０００８】[0008]

【数１】 [Equation 1]

【０００９】ＤＰマッチングでは、辞書と入力のすべて
のフレームに対して積和計算を行なうので、計算回数が
非常に多くなる。 In the DP matching, the sum of products is calculated for all the frames of the dictionary and the input, so the number of calculations is very large.

【００１０】[0010]

【発明が解決しようとする課題】このように、上述した
従来の方法では照合のための計算回数が多いので時間が
かかり、また、辞書を格納しておく記憶領域が大きいと
いう課題を有していた。As described above, the conventional method described above has a problem that it takes a long time because the number of calculations for collation is large, and that the storage area for storing the dictionary is large. It was

【００１１】本発明は、上記課題を解決するものであ
り、高認識率を保ったまま認識単語辞書に登録してある
パラメータ系列の個数を減少させることで、辞書の記憶
領域と認識における演算回数の両方を共に削減し、高速
な認識処理を可能にし、ハードウェア化を容易にするこ
とを目的とする。 The present invention is to solve the above-mentioned problems, and by reducing the number of parameter sequences registered in the recognition word dictionary while maintaining a high recognition rate, the storage area of the dictionary and the number of operations in recognition are reduced. Both of them are reduced to enable high-speed recognition processing and facilitate hardware implementation.

【００１２】[0012]

【課題を解決するための手段】この目的を達成するため
に、本発明は、認識対象音声を、１名から数名の少数の
話者が発声し、分析時間（フレーム）毎に得られるｍ個
（ｍは整数）の特徴パラメータと、あらかじめ多数の話
者より作成したｎ種類（ｎは整数）の標準パターン各々
が有するｍ個の特徴パラメータとのマッチングを行い、
ｎ個の類似度をフレーム毎に求め、このｎ次元の類似度
ベクトルで作成した時系列パターンの中からｎより小さ
いＮ個（Ｎは整数）を選び単語辞書として登録し、認識
させたい入力音声も同様に分析して得られるｍ個の特徴
パラメータと、ｎ種類の標準パターン各々が有するｍ個
の特徴パラメータとのマッチングを行ない、ｎ次元の類
似度ベクトルの時系列を求めて辞書に登録されているＮ
次元の類似度ベクトルの時系列と照合することによっ
て、認識対象音声を登録した話者およびその他の入力音
声を認識するように構成されている。In order to achieve this object, according to the present invention, the speech to be recognized is uttered by a small number of speakers, one to several, and is obtained every analysis time (frame). and characteristic parameters of the number (m is an integer), the standard pattern each n type created from pre large number of speakers (n is an integer)
Matching with m feature parameters of
Input speech to be recognized by calculating n similarity degrees for each frame, selecting N (N is an integer) smaller than n from the time-series pattern created by this n-dimensional similarity vector, and registering it as a word dictionary. Similarly, m characteristic parameters obtained by the same analysis and m characteristic parameters each of which has n standard patterns
, Which is registered in the dictionary by finding the time series of the n-dimensional similarity vector
It is configured to recognize the speaker who registered the recognition target voice and other input voices by collating with the time series of the dimensional similarity vector.

【００１３】[0013]

【作用】本発明は上記構成により、１名から数名の少数
の話者が発声した音声を分析して得られる特徴パラメー
タに対して多数の話者で作成したｎ種類の音素や音節な
どの標準パターンとの類似度をフレーム毎に求める。こ
のｎ個の類似度からｎより小さい数Ｎに減少させること
で、また、類似度の変化量として求まる回帰係数をｎ個
からｎより小さい数Ｍに減少させることで、認識で最も
時間のかかる照合部分の計算回数がそれだけ減少させる
ことができ、辞書格納部の容量についても類似度がｎ個
からＮ個に、回帰係数も同様にｎ個からＭ個に減少した
分だけ小さくすることができる。According to the present invention, with the above configuration, n types of phonemes and syllables created by a large number of speakers are obtained for feature parameters obtained by analyzing voices uttered by a small number of speakers, one to a few. The similarity with the standard pattern is obtained for each frame. It takes the longest time for recognition by reducing the number of n similarities to a number N smaller than n, and reducing the regression coefficient obtained as a variation amount of the similarity from n to a number M smaller than n. The number of times the matching portion is calculated can be reduced by that much, and the capacity of the dictionary storage unit can be reduced by the amount that the similarity is reduced from n to N and the regression coefficient is similarly reduced from n to M. .

【００１４】[0014]

【実施例】以下、本発明の一実施例について説明する
が、その前に本発明の概略について説明する。EXAMPLE An example of the present invention will be described below, but before that, an outline of the present invention will be described.

【００１５】１名から数名の話者が発声した認識対象音
声を分析して得られる特徴パラメータと、あらかじめ多
数の話者から作成したｎ種類の標準パターンと分析時間
（１フレーム）毎にマッチングを行ない、得られるｎ次
元の類似度ベクトルの時系列から、さらに、有効な特徴
量をＮ個決め、これを辞書として登録しておく。認識さ
せたい入力音声は、やはり、ｎ種類の標準パターンとマ
ッチングを行ない、得られるｎ次元の類似度ベクトルの
時系列を求め、求めた時系列の中から辞書のＮ個の類似
度の特徴量のみと照合を行なう。また、類似度の時間変
化量として求まる回帰係数もｎ個から有効な特徴量Ｍ個
を選んで用いる。このようにして不特定話者の音声認識
を行なうことで、１フレーム１単語あたり２×ｎ回あっ
た照合の計算をＮ＋Ｍ回に減らす事ができる。 [0015] 1 and characteristic parameters Several speakers obtained by analyzing the recognition target voice uttered from people, advance a number of matching for each and analysis time n types of standard patterns created (one frame) from the speaker Then, N effective feature quantities are further determined from the obtained time series of n-dimensional similarity vectors, and these are registered as a dictionary. The input speech to be recognized is also matched with n kinds of standard patterns, the time series of the obtained n-dimensional similarity vector is obtained, and the feature quantity of N similarity degrees of the dictionary is found from the obtained time series. Match only with. Also, as for the regression coefficient obtained as the time variation of the similarity, M effective feature amounts are selected from n and used. By performing the voice recognition of the unspecified speaker in this manner, the number of collation calculations, which was 2 × n times per word per frame, can be reduced to N + M times.

【００１６】例えば、「かさ」（ｋａｓａ）と発声する
ような場合、／ａ／の部分では、／ａ／や／ｏ／などの
母音部分の類似度は大きくなるが、逆に、／ｋ／、／ｔ
／、／ｓ／などの子音部分の類似度は極端に小さくな
り、／ａ／の類似度に比べて、無視できるくらい小さい
値をとる。つまり、入力音声に対して、フレーム毎に多
数の話者で作成したｎ種類の音素や音節などの標準パタ
ーンとの類似度を求めたとき、入力とは異なる音素や音
節との類似度は極端に小さい値をとり、実際の認識には
余り貢献していないといえる。従って、類似度の小さい
要素、即ち、認識に関与しない要素を辞書の構成要素と
して登録しておく必要がない。 [0016] For example, in the case, such as to say "umbrella" (kasa), / in the a / parts, / a / and / o / similarity of vowel part, such as, but larger, on the contrary, / k / , / T
The similarity of the consonant parts such as / and / s / becomes extremely small, and takes a value that is negligibly smaller than the similarity of / a /. That is, when the similarity of the input speech to the standard patterns such as n kinds of phonemes and syllables created by many speakers for each frame is calculated, the similarity to the phonemes and syllables different from the input is extremely high. Is a small value, and it can be said that it does not contribute much to actual recognition. Therefore, it is not necessary to register an element having a low degree of similarity, that is, an element that is not involved in recognition as a constituent element of the dictionary.

【００１７】このようにすることで、認識で一番時間の
かかる照合部分の計算回数が少なくなり、辞書格納部の
容量についても類似度および回帰係数が減少した分だけ
小さくなる。 By doing so, the number of times of calculation of the collation portion, which requires the longest time for recognition, is reduced, and the capacity of the dictionary storage unit is also reduced by the reduction of the similarity and the regression coefficient.

【００１８】以下、本発明の一実施例を図１と共に説明
する。図１において、９は音響分析部、１０は特徴パラ
メータ抽出部、１１は類似度計算部、１２は標準パター
ン格納部、１３は回帰係数計算部、１４はパラメータ系
列作成部、１５はパラメータ選別部、１６は認識部、１
７は辞書格納部である。 [0018] Hereinafter, an embodiment of the present invention in conjunction with FIG. In FIG. 1, 9 is an acoustic analysis unit, 10 is a characteristic parameter extraction unit, 11 is a similarity calculation unit, 12 is a standard pattern storage unit, 13 is a regression coefficient calculation unit, 14 is a parameter series creation unit, and 15 is a parameter selection unit. , 16 are recognition units, 1
Reference numeral 7 is a dictionary storage unit.

【００１９】また、図２はマッチングの方法を説明する
概念図で、１８はインデックス部、１９は類似度格納
部、２０は回帰係数格納部である。パラメータ選別部１
５は、辞書格納部１７に入れる辞書を作成する際に、有
効なものだけを選別する。本実施例では、後述する２３
種類の音素に対応する類似度の中から、その値の大きい
ものからＮ個だけ選別して類似度部１９に格納する。ま
た、回帰係数格納部２０に格納する辞書についても、２
３種類の音素に対応する回帰係数の中から絶対値の大き
いものから順にＭ個だけを選別して作成する。 [0019] Figure 2 is a conceptual diagram illustrating a method of matching, 18 is the index portion, 19 similarity storage unit, 20 is a regression coefficient storage unit. Parameter selection unit 1
When creating a dictionary to be stored in the dictionary storage unit 5, 5 selects only valid ones. In this embodiment, 23 will be described later.
From the similarities corresponding to the types of phonemes, only N pieces having the largest value are selected and stored in the similarity section 19. Also, regarding the dictionary stored in the regression coefficient storage unit 20,
From the regression coefficients corresponding to the three types of phonemes, only M pieces are selected in descending order of absolute value and created.

【００２０】辞書格納部１７において、従来は図５およ
び図６に示すように類似度あるいは類似度と回帰係数の
みを辞書に格納していたが、本発明では図２および図３
に示すようにインデックス部１８を新たに設け、どの音
素に対応する類似度あるいは回帰係数を残し、登録した
かを記憶しておく。そして、図２のように、インデック
ス部１８からどの音素と照合の計算を行うかを決め、イ
ンデックス部１８に登録されている音素に限って照合計
算を行うことにする。そして、類似度格納部１９の類似
度と回帰係数格納部２０の回帰係数とを用いて照合を行
い認識する。回帰係数は類似度の時間変化量であり、図
４に示される直線の傾きで表わされる。 In the dictionary storage unit 17, conventionally, only the similarity or the similarity and the regression coefficient are stored in the dictionary as shown in FIGS. 5 and 6, but in the present invention, FIGS.
As shown in, the index unit 18 is newly provided, and the similarity or regression coefficient corresponding to which phoneme is left and stored. Then, as shown in FIG. 2, the index unit 18 determines which phoneme to perform the matching calculation, and the matching calculation is performed only on the phonemes registered in the index unit 18. Then, the similarity is stored in the similarity storage unit 19 and the regression coefficient in the regression coefficient storage unit 20 is used for matching. The regression coefficient is the amount of change in similarity over time and is represented by the slope of the straight line shown in FIG.

【００２１】標準パターン格納部１２には、あらかじめ
多くの話者が発声したデータから作成した２３種類の音
素標準パターンを格納している。本実施例では、／ａ
／、／ｏ／、／ｕ／、／ｉ／、／ｅ／、／ｊ／、／ｊｖ
／、／ｊｕ／、／ｗ／、／ｍ／、／ｎ／、／ｓ／、／ｈ
ｖ／、／ｈｕ／、／ｐ／、／ｔ／、／ｋ／、／ｃ／、／
ｂ／、／ｄ／、／ｒ／、／ The standard pattern storage unit 12 stores 23 types of phoneme standard patterns created in advance from data uttered by many speakers. In this embodiment, / a
/, / O /, / u /, / i /, / e /, / j /, / jv
/, / Ju /, / w /, / m /, / n /, / s /, / h
v /, / hu /, / p /, / t /, / k /, / c /, /
b /, / d /, / r /, /

【００２２】[0022]

【外１】 [Outer 1]

【００２３】／、／ｚ／の２３種類の音素標準パターン
を使用する。ただし、音素／ｈｖ／は有声の／ｈ／、音
素／ｈｕ／は無声の／ｈ／、音素／ｊｖ／は有声子音後
続の拗音、音素／ｊｕ／は無声子音後続の拗音であると
する。音素標準パターンは、各音素の特徴部（その音素
の特徴を最も良く表現する時間的な位置）を目視によっ
て正確に検出し、この特徴フレームを中心とした特徴パ
ラメータの時間パターンを使用して作成する。 [0023] /, use the / z / of 23 types of phoneme standard pattern. However, it is assumed that the phoneme / hv / is voiced / h /, the phoneme / hu / is unvoiced / h /, the phoneme / jv / is a vocal sound followed by voiced consonants, and the phoneme / ju / is a vocal sound followed by unvoiced consonants. The phoneme standard pattern is created by visually detecting the feature part of each phoneme (the temporal position that best expresses the feature of the phoneme) and using the time pattern of the feature parameter centered on this feature frame. To do.

【００２４】以上、本実施例の構成を用いて２１２単語
を発声した２０名のデータに対して認識実験を行った。
２０名のうちの男女各１名を２１２単語を発声した辞書
として登録し、残りの１８名の発声した単語を認識す
る。２３音素の中から類似度の大きい方から５個だけ残
し、また、回帰係数についても絶対値の大きい方から５
個だけ残して作成した辞書を用いて、認識を行ったとこ
ろ９５．３５％の認識率が得られた。 As described above, the recognition experiment was conducted on the data of 20 persons who uttered 212 words using the configuration of this embodiment.
One of each of the 20 people, male and female, is registered as a dictionary in which 212 words are uttered, and the remaining 18 uttered words are recognized. Of the 23 phonemes, only the five with the highest degree of similarity are retained, and the regression coefficient is also with the highest absolute value.
When recognition was performed using a dictionary created by leaving only the number of words, a recognition rate of 95.35% was obtained.

【００２５】まったく削減を行わない２３音素全てを使
用して作成した辞書を用いた認識の結果が９５．６４％
であるから、わずかに０．３％程度認識率が低下したこ
とになる。これに対して、辞書の記憶容量は、１フレー
ムあたり類似度、回帰係数ともに、２３音素に対応する
２３種類あったのに対し、各々これが５個に減り、イン
デックスが各々５個追加されるので、１０／２３にな
る。そして、認識に要する時間についても、１フレーム
あたりの照合回数が減り半分以下になる。また、類似
度、回帰係数とも５個からさらに４個だけ残して同様の
認識実験を行っても、９５．３０％の高認識率が保持さ
れる。 [0025] As a result of the recognition using the dictionary that was created using the 23 phonemes all that does not perform exactly the reduction is 95.64%
Therefore, the recognition rate is slightly reduced by about 0.3%. On the other hand, the storage capacity of the dictionary was 23 types corresponding to 23 phonemes for both similarity and regression coefficient per frame, but this was reduced to 5 and 5 indexes were added respectively. It will be October 23. Also, regarding the time required for recognition, the number of times of collation per frame decreases and becomes less than half. Further, even if a similar recognition experiment is performed with five to four similarity degrees and regression coefficients remaining, a high recognition rate of 95.30% is maintained.

【００２６】但し、本実施例では、音響的パラメータと
してＬＰＣケプストラムのＣ０〜Ｃ８を、標準パターン
として２３個の音素を用いて認識を行ったときの例につ
いて説明したが、ＬＰＣケプストラムの次数や音素の個
数を変化させても問題はない。さらに、標準パターンに
音素の代わりに音節やＶＣ／ＣＶ等の音声片、ＶＣＶ等
の半音節を用いることも可能である。 [0026] However, in this embodiment, a C0~C8 the LPC cepstrum as the acoustic parameters, an example has been described when performing recognition using 23 phonemes as a standard pattern, the LPC cepstrum orders and phonemes There is no problem even if the number of is changed. Further, it is also possible to use syllables, voice pieces such as VC / CV, and semi-syllables such as VCV instead of phonemes in the standard pattern.

【００２７】[0027]

【発明の効果】以上のように本発明は、１名から数名の
少数の話者が発声した認識単語音声を分析して得られた
特徴パラメータに対して、あらかじめ多くの話者で作成
したｎ種類の標準パターンとの類似度計算を行なって類
似度を求め、ｎ次元の類似度ベクトルのなかのＮ次元と
ｎ次元回帰係数ベクトルのなかのＭ次元を音声認識のた
めの特徴パラメータとして辞書を登録・作成しても、不
特定話者の音声を精度良く認識することが可能となる。As described above, according to the present invention, the feature parameters obtained by analyzing the recognized word voices uttered by a small number of speakers, one to a few, are created in advance by many speakers. The similarity is calculated by calculating the similarity with n types of standard patterns, and the N-dimensional similarity vector of the n-dimensional similarity vector and the M-dimensional regression coefficient vector of the n-dimensional dictionary are used as the feature parameters for speech recognition. Even if is registered and created, the voice of an unspecified speaker can be accurately recognized.

【００２８】また、これによって辞書は今までの半分以
下の記憶容量で済み、かつ、計算量も極めて少なくな
り、そして認識率の低下はほとんどない。このように本
発明は不特定話者用音声認識装置の実用化技術の向上に
対して極めて大きく貢献する。 In addition, this allows the dictionary to have a storage capacity less than half that of the conventional dictionary, the amount of calculation to be extremely small, and the recognition rate to be hardly reduced. As described above, the present invention greatly contributes to the improvement of the practical technology of the voice recognition device for the unspecified speaker.

[Brief description of drawings]

【図１】本発明の一実施例における音声認識方法のブロ
ック構成図。FIG. 1 is a block diagram of a voice recognition method according to an embodiment of the present invention.

【図２】同実施例におけるマッチングの方法を説明する
概念図。FIG. 2 is a conceptual diagram illustrating a matching method in the embodiment.

【図３】同実施例における辞書と入力を説明する概念
図。FIG. 3 is a conceptual diagram illustrating a dictionary and input in the same embodiment.

【図４】同実施例における回帰係数を説明する特性図。FIG. 4 is a characteristic diagram illustrating a regression coefficient in the example.

【図５】従来の音声認識方法における類似度ベクトルの
時系列を説明する概念図。FIG. 5 is a conceptual diagram illustrating a time series of similarity vectors in a conventional voice recognition method.

【図６】従来の音声認識方法における辞書と入力を説明
する概念図。FIG. 6 is a conceptual diagram illustrating a dictionary and input in a conventional voice recognition method.

【図７】従来の音声認識方法を説明するブロック構成
図。FIG. 7 is a block diagram illustrating a conventional voice recognition method.

[Explanation of symbols]

９音響分析部１０特徴パラメータ抽出部１１類似度計算部１２標準パターン格納部１３回帰係数計算部１４パラメータ系列作成部１５パラメータ選別部１６認識部１７辞書格納部１８インデックス部１９類似度格納部２０回帰係数格納部 9 Acoustic Analysis Section 10 Feature Parameter Extraction Section 11 Similarity Calculation Section 12 Standard Pattern Storage Section 13 Regression Coefficient Calculation Section 14 Parameter Series Creation Section 15 Parameter Selection Section 16 Recognition Section 17 Dictionary Storage Section 18 Index Section 19 Similarity Storage Section 20 Regression Coefficient storage

Claims

[Claims]

1. The speech to be recognized is produced by a small number of speakers, one to several, and m (m is an integer) acoustic feature parameter obtained at each analysis time (frame). Matching is performed with m feature parameters of each of the n types (n is an integer) of standard patterns created by the speaker, the n similarity is calculated for each frame, and the n-dimensional similarity vector is created. M obtained by selecting N (N is an integer) smaller than n from the time series pattern, registering it as a word dictionary, and similarly analyzing the input speech to be recognized
Characteristic parameters and each of the n standard patterns
The matching target speech is obtained by matching the m number of feature parameters possessed by, and obtaining the time series of the n-dimensional similarity vector and matching it with the time series of the N-dimensional similarity vector registered in the dictionary. A voice recognition method characterized by recognizing a registered speaker and other input voices.

2. A temporal change amount of similarity for each time series of n types of similarity is calculated for each frame, and M (n) smaller than n are selected from n-dimensional vectors of the temporal change amount of similarity. 2. The speech recognition method according to claim 1, wherein a word dictionary in which M is an integer) is selected and registered as a dictionary is used in combination with an N-dimensional vector of similarity.

3. The speech recognition method according to claim 1, wherein when reducing from N-dimensional similarity vectors to N-dimensional, N pieces are selected from the one having the larger value.

4. The speech according to claim 1, wherein M vectors are selected from those having a larger absolute value when reducing the vector of n-dimensional similarity temporal change amounts to M dimensions. Recognition method.