JPH02298996A

JPH02298996A - Word voice recognition device

Info

Publication number: JPH02298996A
Application number: JP1119505A
Authority: JP
Inventors: Yasuyuki Masai; 康之正井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1989-05-12
Filing date: 1989-05-12
Publication date: 1990-12-11

Abstract

PURPOSE:To improve the recognizing ability by detecting the characteristic pattern of the high-reliability voice section candidate of an input word voice and generating a high-performance standard pattern. CONSTITUTION:A word border temporary generation part 2 sets plural voice section candidates for the feature parameter of the input word voice from an acoustic analysis part 1. Then the voice section candidate found from a combination of head-end and tail-end candidates is detected. Then a similarity arithmetic part 7 extracts and supplies feature patterns having the largest similarity to a standard pattern generation part 8 when their combination is coincident between voice inputs and the generation part 8 generates and registers a standard pattern as to a word to be recognized in a pattern dictionary 5. Then the similarity between the feature pattern of the input word voice and a word to be recognized which is registered in a dictionary 5 is calculated 4. Then a recognition result output part 6 compares the similarity to each other to output the object of recognition having the largest similarity.

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は入力音声単語を効率良く、しかも高精度に認識
することのできる単語音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Object of the Invention] (Industrial Application Field) The present invention relates to a word speech recognition device that can efficiently and highly accurately recognize input speech words.

（従来の技術）音声認識の技術は、優れたマンマシン・インターフェー
スを実現する上での重要な役割を担っている。この音声
認識において、その認識精度を高める上での重要な前処
理として音声区間検出があり、従来より種々研究・開発
されている。(Prior Art) Speech recognition technology plays an important role in realizing an excellent man-machine interface. In this speech recognition, speech section detection is an important preprocessing for improving the recognition accuracy, and various research and developments have been carried out in the past.

この音声区間検出は、従来一般的には入力単語音声のパ
ワ一時系列を求め、その音声パワーＰが所定の閾値Ｔｌ
より大きくなった時点を入力音声単語の始端Ｓとして検
出し、またこの音声始端検出後に上記音声パワーＰが所
定の閾値Ｔ２より小さくなった時点をその入力音声単語
の終端Ｅとして検出することにより行われる。Conventionally, this speech section detection generally involves obtaining a power temporal sequence of input word speech, and setting the speech power P to a predetermined threshold value Tl.
This is carried out by detecting the point in time when the voice power P becomes smaller than a predetermined threshold T2 as the start point S of the input voice word, and detecting the point in time when the voice power P becomes smaller than a predetermined threshold T2 after detecting this voice start point as the end point E of the input voice word. be exposed.

ところがこのような音声区間検出では、その音声区間が
一意に決定されるので、例えば実際の音声区間の前後に
息洩れや舌打ちノイズ等が存在すると、これをも音声区
間の一部として検出してしまうと云う不具合がある。ま
た逆に音節の先頭や最終音節が無声化し易い単語音声の
場合には、その無声化音節部分のパワーＰが極端に小さ
くなるので、この部分が検出音節区間から脱落し易いと
云う欠点がある。However, in this type of speech section detection, the speech section is uniquely determined, so if there is, for example, breath leakage or tongue clicking noise before and after the actual speech section, this will also be detected as part of the speech section. There is a problem with it being stored away. On the other hand, in the case of word sounds where the beginning or final syllable of a syllable is easily devoiced, the power P of the devoiced syllable part becomes extremely small, so there is a drawback that this part is likely to be dropped from the detected syllable section. .

このような音声区間の検出誤りは、その音声認識におい
て致命的な誤認識や認識リジェクトの原因となる。特に
装置利用者が発声した単語音声の特徴パターンを標準パ
ターンとして登録しておき、音声認識時に入力された単
語音声の特徴パターンと先に登録されている標準パター
ンとの類似度を求めてその認識結果を得る特定話者単語
音声認識装置にあっては、標準パターン作成時に音声区
間検出誤りが生じると、登録しようとしている単語の音
声区間とは違う音声区間の特徴パターンを標準パターン
として登録してしまうことになる。このようにして誤っ
た標準パターンを登録してしまうと、仮に音声認識時に
正しく音声区間が検出されたとしても、誤った認識結果
を出力してしまう原因となる。Such a detection error in a speech section causes a fatal misrecognition or recognition rejection in speech recognition. In particular, the characteristic patterns of the word sounds uttered by the device user are registered as standard patterns, and the degree of similarity between the characteristic patterns of the word sounds input during speech recognition and the previously registered standard patterns is determined and recognized. In the specific speaker word speech recognition device that obtains the results, if a speech segment detection error occurs when creating a standard pattern, the feature pattern of a speech segment different from the speech segment of the word to be registered is registered as the standard pattern. It will end up being put away. If an incorrect standard pattern is registered in this manner, even if the speech section is correctly detected during speech recognition, an incorrect recognition result will be output.

このような標準パターン作成時での音声区間検出誤りが
原因となる誤認識を少なくする技術として、標準パター
ン作成時に認識対象単語を複数回発声入力し、それらの
発声時間がほぼ等しいときにのみ標準パターンを作成し
て登録し、発声時間にばらつきがある場合には再発声を
要求する手法がある。しかし複数回の発声の全てに同じ
ようなノイズの付加や音声区間の脱落が生じると、上述
した発声区間の比較だけでは音声区間の検出誤りを防ぐ
ことができないと云う問題がある。As a technology to reduce recognition errors caused by speech interval detection errors when creating a standard pattern, the word to be recognized is uttered multiple times when creating a standard pattern, and only when the utterance times are approximately equal, the standard There is a method of creating and registering a pattern and requesting re-voice if there are variations in the utterance time. However, if the same noise is added or the voice section is dropped in all of the plurality of utterances, there is a problem in that the above-mentioned comparison of the utterance sections alone cannot prevent erroneous detection of the voice section.

またノイズの付加や音声区間の脱落を少なくする手法と
して、入力単語音声に対して複数の始端候補（Ｓ　１．
Ｓ　２．〜ＳＭ）と終端候補（Ｅ　１．Ｅ　２．〜ＥＮ
）とをそれぞれ求め、これらの候補に対して所定の規則
に従って確率（ｆ　ｓｌ、　　ｆ　ｓ２．〜ｆ　ｓＭ）
　。In addition, as a method to reduce the addition of noise and dropout of speech sections, multiple starting point candidates (S1.
S2. ~SM) and termination candidates (E 1.E 2.~EN
) and calculate the probabilities (f sl, f s2.~f sM) for these candidates according to predetermined rules.
.

（ｆ　ｅｌ、　　ｆ　ｅ２．〜ｆ　ｅＮ）をそれぞれ与
える。そして上記始端候補（Ｓｌ、Ｓ２．〜Ｓ４）と終
端候補（Ｅｌ、Ｅ２．〜ＥＮ）の組み合わせとして求め
られる複数の音声区間候補［ＳＩｌ、Ｅｎ］　　（但し
、　ｍ＝１゜２＋〜Ｍ、ｎ−１＋２．〜Ｎ　）について
の尤度ＬｌｎをＬｍｎ　　　＝　　　　ｆｓｍ　　　Ｘ
　　　　ｆｅｎとして計算し、尤度り麿ｎの高い上位の
波数の音声区間候補についての標準パターンとのマツチ
ングをとることで、入力単語音声に対する認識結果を求
める手法が単語境界仮説法として提唱されている。(f el, f e2. to f eN) are given, respectively. Then, a plurality of speech interval candidates [SIl, En] obtained as a combination of the above-mentioned start end candidates (Sl, S2.~S4) and end end candidates (El, E2.~EN) (however, m=1°2+~M, n −1+2.~N) is Lmn = fsm
A method has been proposed as the word boundary hypothesis method that calculates the recognition result for the input word speech by calculating it as fen and matching it with a standard pattern for speech interval candidates with higher wave numbers with higher likelihoods. .

しかしこのような手法を採用しても、標準パターンとし
て求めた単語音声の音声区間自体が誤って検出されてい
ると、結局、入力単語音声を正しく認識することができ
ないと云う問題があった。However, even if such a method is adopted, there is a problem in that if the speech section of the word speech obtained as a standard pattern is incorrectly detected, the input word speech cannot be correctly recognized.

（発明が解決しようとする課題）このように従来にあっては、標準パターンの作成時に、
その音声区間の検出誤りを防いで信頼性が高く、高精度
な標準パターンをどのようにして得るかと云う点で幾つ
かの問題が残されている。(Problem to be solved by the invention) In this way, in the past, when creating a standard pattern,
Several problems remain as to how to obtain a highly reliable and highly accurate standard pattern that prevents detection errors in the voice section.

本発明はこのような事情を考慮してなされたもので、そ
の目的とするところは、標準パターン作成時における音
声区間の検出誤りを効果的に防いでその標準パターンを
求め、音声区間検出誤りが生じ易い入力音声単語につい
ても確実に、信頼性良く認識することのできる単語音声
認識装置を提供することにある。The present invention has been made in consideration of the above circumstances, and its purpose is to obtain a standard pattern by effectively preventing speech interval detection errors when creating a standard pattern, and to prevent speech interval detection errors. It is an object of the present invention to provide a word speech recognition device that can reliably and reliably recognize input speech words that are likely to occur.

［発明の構成］（課題を解決するための手段）本発明は、入力単語音声を音響分析してその特徴パラメ
ータを求め、この音響分析された特徴パラメータから単
語境界仮説生成部にて求められる上記入力単語音声の音
声区間候補毎に前記特徴パラメータを正規化して前記入
力音声の特徴パタ−ンを生成し、この特徴パターンと標
準パターン作成部で作成された認識対象単語音声の標準
パターンと類似度を計算して前記入力単語音声に対する
単語音声認識装置を求める単語音声認識装置に係り、前記標準パターン作成部では、認識対象単語について複
数回に亘って発声入力される単語音声についてそれぞれ
求められる複数の特徴パターンの間で相互にその尤度を
計算し、その尤度が最も高い特徴パターンの組み合わせ
が、前記複数回の発声における特徴パターン間で一致し
たとき、その特徴パターンを当該認識対象単語音声の標
準パターンとして求めるようにしたことを特徴とするも
のである。[Structure of the Invention] (Means for Solving the Problems) The present invention acoustically analyzes input word speech to obtain its feature parameters, and uses the acoustically analyzed feature parameters to obtain the above-mentioned information in a word boundary hypothesis generation unit. The feature parameters are normalized for each speech section candidate of the input word speech to generate a feature pattern of the input speech, and the degree of similarity between this feature pattern and a standard pattern of the recognition target word speech created by the standard pattern creation section is calculated. The word speech recognition device calculates a word speech recognition device for the input word speech, wherein the standard pattern creation unit calculates a plurality of word speech recognition devices for the word speech inputted multiple times regarding the recognition target word. The likelihoods are mutually calculated between the feature patterns, and when the combination of feature patterns with the highest likelihood matches between the feature patterns in the plurality of utterances, that feature pattern is used for the speech of the target word to be recognized. This is characterized in that it is determined as a standard pattern.

（作　用）このような機能を備えた本発明によれば、単語境界仮説
生成部で求められる複数の音声区間候補の中から正しい
音声区□間を決定するに際し、同一単語について複数回
発声された各入力音声についてそれぞれ求められる複数
の特徴パターン間で相互にその尤度を計算し、尤度が最
も高い特徴パターンが上記複数の回の発声の特徴パター
ン間で一致した時にのみ、それらの音声区間候補を正し
い音声区間であると決定するので、その音声区間を高い
精度で信頼性良く決定することができる。(Function) According to the present invention having such a function, when determining the correct speech interval from among the plurality of speech interval candidates obtained by the word boundary hypothesis generation unit, the same word is uttered multiple times. The likelihood is mutually calculated between the plurality of feature patterns obtained for each input speech, and only when the feature pattern with the highest likelihood matches among the feature patterns of the plurality of utterances, those speech Since the segment candidate is determined to be the correct speech segment, the speech segment can be determined with high accuracy and reliability.

この結果、信頼性の高い標準パターンを得、その認識性
能を向上させることが可能となる。As a result, it is possible to obtain a highly reliable standard pattern and improve its recognition performance.

（実施例）以下、図面を参照して本発明に係る単語音声認識装置の
一実施“例について説明する。(Example) Hereinafter, an example of an implementation of the word speech recognition device according to the present invention will be described with reference to the drawings.

第１図は実施例装置の要部概略構成図であり、ｌは入力
音声を音響分析してその特徴パラメータを求める音響分
析部である。この音響分析部１は音声区間検出に用いる
為の特徴量としてその音声パワ一時系列を求めると共に
、認識辞書との照合に用いる為の特徴量として、例えば
周波数分析したバンドパスフィルタ群出力を求める。FIG. 1 is a schematic diagram of the main parts of the apparatus according to the embodiment, and 1 is an acoustic analysis section that acoustically analyzes input speech to obtain its characteristic parameters. The acoustic analysis unit 1 obtains a temporal sequence of speech power as a feature quantity to be used for speech section detection, and also obtains, for example, the output of a group of bandpass filters subjected to frequency analysis as a feature quantity to be used for comparison with a recognition dictionary.

単語境界仮説生成部２は上記音響分析部ｌで求められた
入力音声の特徴パラメータに対して種々の音声区間検出
パラメータを適応的に設定して複数の音声区間候補を設
定する。具体的には、単語境界仮説生成部２は入力単語
音声に対して複数の始端候補（Ｓ　ｌ、Ｓ　２．〜ＳＭ
）と複数の終端候補（Ｅｌ、Ｅ２．〜ＥＮ）とをそれぞ
れ求め、これらの各候補に対して所定の規則に従って確
率（ｆ　ｓｌ。The word boundary hypothesis generation unit 2 adaptively sets various speech interval detection parameters for the characteristic parameters of the input speech determined by the acoustic analysis unit 1, and sets a plurality of speech interval candidates. Specifically, the word boundary hypothesis generation unit 2 generates a plurality of starting point candidates (S l, S 2. to SM
) and a plurality of terminal candidates (El, E2. to EN), respectively, and the probability (f sl.) for each of these candidates is determined according to a predetermined rule.

ｆｓ２．〜ｆｓＭ）　、　　（ｆｅｌ、　　ｆｅ２．〜
ｆｅＮ）をそれぞれ与える。fs2. ~fsM), (fel, fe2.~
feN) respectively.

そして上記始端候補（Ｓ　１．ｓ　２．〜．ＳＭ）と終
端候補（Ｅ　ｌ、Ｅ　２．〜ＥＮ）の組み合わせとして
求められる複数の音声区間候補［Ｓｍ、Ｅｎｌ（但し。Then, a plurality of voice section candidates [Sm, Enl (however,

ｓ−１，２，〜Ｍ、ｎ−１，２，〜Ｎ　）についての尤
度ＬＩＤをＬｍｎ　　−ｆｓｍ　　Ｘ　　ｆｅｎとして計算し、これらの音声区間候補［Ｓｓ、Ｅｎｌに
ついての尤度Ｌｍｎをその値の高いものから順に、上位
の複数の音声区間候補を検出する。s-1, 2, ~M, n-1, 2, ~N) as Lmn - fsm A plurality of high-ranking speech section candidates are detected in descending order of .

しかしてリサンプル部３は上述した如く求められる複数
の音声区間候補［Ｓｍ、’Ｅｎｌについて、前記音響分
析部ｌにて周波数分析して求められたバンドパスフィル
タ群出力からなる特徴パラメータをリサンプルし、各音
声区間候補における入力単語音声の正規化された特徴パ
ターンをそれぞれ求める。尚、尤度の高い音声区間候補
が１つしか求められない場合には、その音声区間候補に
ついてのみリサンプル処理が行われ、１つの特徴パター
ンだけが求められる。Then, the resampling unit 3 resamples the characteristic parameters consisting of the bandpass filter group outputs obtained by frequency analysis in the acoustic analysis unit 1 for the plurality of speech interval candidates [Sm, 'Enl obtained as described above. Then, normalized feature patterns of the input word speech in each speech section candidate are determined. Note that if only one speech segment candidate with a high likelihood is obtained, resampling processing is performed only on that speech segment candidate, and only one feature pattern is obtained.

類似度計算部４はこのようにして求められる入力単語音
声についての特徴パターンと、標準パターン辞書５に登
録されている認識対象単語についての後述する標準パタ
ーンとの間での類似度をそれぞれ計算する。認識結果出
力部６は、類似度計算部４にて求められる入力単語音声
の特徴パターンと認識対象単語についての標準パターン
との類似度を相互に比較し、高い類似度結果を得た所定
数の認識対象単語のカテゴリ名とその類似度値をそれぞ
れ求める。そしてこれらの上位１ｆｉ個のカテゴリ名を
前記入力単語音声に対する認識候補として出力したり、
或いはその類似度値が再上位の認識対象単語のカテゴリ
名を前記入力単語音声に対する認識結果として出力する
。The similarity calculation unit 4 calculates the similarity between the characteristic pattern of the input word sound obtained in this way and the standard pattern described later for the recognition target word registered in the standard pattern dictionary 5. . The recognition result output unit 6 compares the degree of similarity between the characteristic pattern of the input word sound obtained by the similarity calculation unit 4 and the standard pattern of the recognition target word, and selects a predetermined number of results that have obtained a high degree of similarity. The category name of the recognition target word and its similarity value are determined respectively. Then, these top 1fi category names are output as recognition candidates for the input word voice,
Alternatively, the category name of the recognition target word with the highest similarity value is output as the recognition result for the input word speech.

ところで標準パターン辞書５に登録される認謙対象単語
についての標準パターンは、いま１つ設けられた類似度
演算部７と標準パターン作成部８とにより生成される。By the way, the standard pattern for the recognition target word registered in the standard pattern dictionary 5 is generated by the similarity calculation section 7 and the standard pattern creation section 8, which are provided one more.

類似度演算部７は、標準パターン作成時にカテゴリ名の
既知なる認識対象単語音声を複数回に亘って発声入力し
たとき、前述したリサンプル部３にてそれぞれ求められ
る複数の特徴パターンを用いて、複数の発声入力単語音
声間での類似度を計算する。この類似度計算は、従来よ
り種々提唱されている部分空間法やＤＰマツチング法等
の手法を用いて行われる。The similarity calculating section 7 uses a plurality of characteristic patterns respectively obtained by the resampling section 3 described above when the speech of a recognition target word with a known category name is uttered multiple times when creating a standard pattern. Calculate the degree of similarity between multiple spoken input word sounds. This similarity calculation is performed using various techniques such as the subspace method and the DP matching method that have been proposed in the past.

しかして類似度演算部７は、その類似度が最大となる特
徴パターンの組み合わせが上記複数の発声入力単語音声
間で一致したときにその特徴ノくターンを抽出して標準
パターン作成部８に与える。Therefore, when the combination of feature patterns with the maximum similarity matches between the plurality of uttered input word sounds, the similarity calculation unit 7 extracts the feature pattern and provides it to the standard pattern creation unit 8. .

尚、上記類似度が最大となる特徴パターンの組み合わせ
が上記複数の発声入力単語音声間で一致しない場合には
、当該単語音声の再発声入力が促される。Note that if the combination of feature patterns with the maximum similarity does not match among the plurality of input word sounds, re-input of the word sound is prompted.

標準パターン作成部８はこのようにして求められる入力
単語音声の特徴パターンをその認識対象単語についての
標準パターンとし、そのカテゴリ名を付して前記標準パ
ターン辞書５に登録する。The standard pattern creation unit 8 sets the characteristic pattern of the input word sound obtained in this way as a standard pattern for the recognition target word, and registers it in the standard pattern dictionary 5 with its category name attached.

この標準パターンの作成について更に詳しく説明すると
、標準パターン作成時にはカテゴリ名の既知なる単語音
声を複数回に亘って発声入力する。To explain in more detail how to create this standard pattern, when creating the standard pattern, known word sounds of the category name are uttered and input multiple times.

そしてこれらの入力単語音声について前述した音響分析
部ｌにて音響分析してその特徴パラメータを求め、前記
単語境界仮説生成部２にて確からしい音声区間候補をそ
れぞれ求める。そしてこれらの各音声区間候補について
前記リサンプル部３にてその特徴パターンをそれぞれ求
める。These input word sounds are acoustically analyzed by the aforementioned acoustic analysis section 1 to obtain their characteristic parameters, and the word boundary hypothesis generation section 2 obtains probable speech section candidates. Then, the resampling unit 3 obtains characteristic patterns for each of these voice section candidates.

具体的には第２図に示すように、１回目の発声時にその
音声区間候補Ｌｌｌ、Ｌ１２を求め、これらの音声区間
候補Ｌ　１１．　　Ｌ　１２での特徴パターンｐＨ。Specifically, as shown in FIG. 2, the speech section candidates Lll and L12 are obtained at the first utterance, and these speech section candidates L11. Characteristic pattern pH at L 12.

ＰＩ３をそれぞれ求める。同様にして２回目の発声時に
はその音声区間候補Ｌ　２１．　　Ｌ　２２についてそ
の特徴パターンＰ　２１．　　Ｐ　２２をそれぞれ求め
、更に３回目の発声時にもその音声区間候補Ｌ　３１．
　　Ｌ　３２についてその特徴パターンＰ　３１．　　
Ｐ　３２をそれぞれ求める。Find each PI3. Similarly, when speaking for the second time, the voice section candidate L 21. The characteristic pattern P21 for L22. P 22 is calculated respectively, and the voice section candidate L 31. is also calculated for the third utterance.
Regarding L 32, its characteristic pattern P 31.
Find P32 respectively.

尚、ここでは各入力単語音声について２つの音声区間候
補を求め、それらの特徴パターンを求めているが、３個
以上の音声区間候補についての特徴パターンをそれぞれ
求める場合もある。また３回の発声で十分なる結果が得
られない場合には、４回以上の発声が促される場合も勿
論ある。Here, two speech section candidates are obtained for each input word speech and their characteristic patterns are obtained, but characteristic patterns may be obtained for three or more speech section candidates. Furthermore, if a sufficient result cannot be obtained with three utterances, it is of course possible to urge the user to utter four or more times.

類似度演算部７はこのようにして求められる各発声入力
音声の複数の特徴パターン間で、相互にその類似度（尤
度）を計算し、その類似度が最大となる特徴パターンの
組み合わせが上記複数の発声入力単語音声間で一致する
か否かを調べる。例えば上述した１回目と２回目の発声
入力音声間で、それらの特徴パターン間での類似度Ｑを
次のようにそれぞれ計算する。The similarity calculation unit 7 mutually calculates the degree of similarity (likelihood) between the plurality of characteristic patterns of each utterance input voice obtained in this way, and selects the combination of characteristic patterns that has the maximum degree of similarity as described above. Check whether or not there is a match between multiple spoken input word sounds. For example, the degree of similarity Q between the feature patterns of the first and second utterance input voices described above is calculated as follows.

Ｑｌｌ−［ｐＨ−Ｐ２１］　、　Ｑ１２縛［ｐＨΦＰ２
２］Ｑ１３−［ＰＩ３・Ｐ２１］　、　Ｑ１４−　［Ｐ
１２φＰ２２］そしてこれらの類似度の中で最大値をと
る特徴パターンの組み合わせを求める。Qll-[pH-P21], Q12 bound [pHΦP2
2] Q13-[PI3・P21], Q14-[P
12φP22] and a combination of feature patterns that takes the maximum value among these similarities is determined.

しかる後、次に１回目と３回目の発声入力音声間で、そ
れらの特徴パターン間での類似度ＱをＱ２１−　［ｐＨ
・Ｐ３１］　、　Ｑ２２−　［ｐＨ−Ｐ３２］Ｑ２３−
［ＰＩ３・Ｐ３１］　、　Ｑ２４−　［ＰＩ３・Ｐ３２
］としてそれぞれ計算し、これらの類似度の中で最大値
をとる特徴パターンの組み合わせを求める。After that, the similarity Q between the feature patterns between the first and third utterance input voices is calculated as Q21- [pH
・P31], Q22- [pH-P32]Q23-
[PI3・P31], Q24- [PI3・P32
], and find the combination of feature patterns that takes the maximum value among these similarities.

同様にして２回目と３回目の発声入力音声間で、それら
の特徴パターン間での類似度ＱをＱ３１−　［Ｐ２１−
　Ｐ３１］　、　Ｑ３２−　ＩＰ２１−　Ｐ３２］Ｑ３
３−［Ｐ２２・Ｐ３１］　、　Ｑ３４−　［Ｐ２２・Ｐ
３２］としてそれぞれ計算し、これらの類似度の中で最
大値をとる特徴パターンの組み合わせを求める。Similarly, the degree of similarity Q between the feature patterns between the second and third utterance input voices is determined by Q31- [P21-
P31], Q32- IP21- P32] Q3
3-[P22・P31], Q34- [P22・P
32], and find the combination of feature patterns that takes the maximum value among these similarities.

このような処理を経て、各特徴パターン間での類似度が
最大値をとる特徴パターン組み合わせが、複数の発声入
力音声間で一致したとき、これらの各発声入力音声につ
いての該当する特徴パターンをそれぞれ抽出する。Through such processing, when a feature pattern combination with the maximum similarity between each feature pattern matches among multiple vocal input voices, the corresponding feature pattern for each of these vocal input voices is Extract.

例えば１回目と２回目の発声入力音声間での類似度の中
でＱｌｌ−［ｐＨ−Ｐ２１］が最大となり、１回目と３
回目の発声入力音声間での類似度の中でＱ２２−　［Ｐ
ｌｌ−Ｐ３２］最大となったとき、これらの最大類似度
を得た特徴パターンの組み合わせが１回目の発声の特徴
パターンｐＨと一致していることが求められる。更に２
回目と３回目の発声入力音声間での類似度の中でＱ３２
−　［Ｐ２１−　Ｐ３２］が最大となったことが検出さ
れたとき、それらの特徴パターンの組み合わせが前述し
た各発声入力音声の特徴パターンＰ　２１．　　Ｐ　３
２と一致していることが求められる。For example, Qll-[pH-P21] is the highest among the similarities between the first and second utterance input voices, and
Q22- [P
ll-P32] When the maximum similarity is reached, it is required that the combination of feature patterns that have obtained the maximum similarity matches the feature pattern pH of the first utterance. 2 more
Q32 in the similarity between the utterance input voice for the first time and the third time
- When it is detected that [P21-P32] has reached the maximum, the combination of these feature patterns becomes the feature pattern P21 of each voice input voice described above. P 3
2 is required.

これらの最大類似度を特徴パターンの組み合わせが、各
発声での特徴パターンとの間で矛盾がないことが確認さ
れたとき、つまりここではその組み合わせ［ｐＨ−Ｐ２
１］　　［Ｐｌｌ・Ｐ　３２］　　［Ｐ　２１・Ｐ３２
］がそれぞれの発声での特徴パターンＰ　１１゜Ｐ２１
．　　Ｐ３１と一致することから、これらを正しく検出
された音声区間での特徴パターンであるとして判定する
。When it is confirmed that there is no contradiction between the combination of feature patterns with maximum similarity and the feature pattern of each utterance, that is, the combination [pH-P2
1] [Pll・P 32] [P 21・P32
] is the characteristic pattern of each utterance P 11゜P21
．． Since they match P31, these are determined to be characteristic patterns in the correctly detected voice section.

このような判定結果が得られたとき、これらの特徴パタ
ーンｐＨ，Ｐ２１．　　Ｐ３１をそれぞれ抽出して標準
パターン作成部８に与える。標準パターン作成部８では
、これらの特徴パターンＰ　１１．　　Ｐ　２１゜Ｐ３
１を平均化する等して、その認識対象単語についての標
準パターンを作成し、これを前記標準パターン辞書５に
登録する。When such a determination result is obtained, these characteristic patterns pH, P21. P31 are each extracted and provided to the standard pattern creation section 8. The standard pattern creation unit 8 creates these characteristic patterns P11. P 21゜P3
1 is averaged, etc., to create a standard pattern for the word to be recognized, and this is registered in the standard pattern dictionary 5.

尚、１回目の発声と２回目の発声とにおいて［ＰＩ３・
Ｐ　２１］なる特徴パターンの組み合わせの類似度Ｑ１
３が最大値をとり、１回目の発声と３回目の発声とにお
いて［Ｐｌｌ−Ｐ３２］なる特徴パターンの組み合わせ
の類似度Ｑ２２が最大値をとるような場合、これらの最
大値をとる特徴パターンの組み合わせと、１回目の発声
での特徴パターンとが一致しないことから（Ｐ　１１と
ＰＩ３として異なった特徴パターンが求められる）、標
準パターンを作成するための特徴パターンとしては不適
切であると判定され、単語音声の再発声入力が促される
。Note that [PI3・
P 21] Similarity Q1 of the combination of feature patterns
3 takes the maximum value, and the similarity Q22 of the combination of feature patterns [Pll-P32] takes the maximum value between the first utterance and the third utterance, then the feature pattern that takes these maximum values is Since the combination does not match the feature pattern of the first utterance (different feature patterns are required for P11 and PI3), it was determined that it is inappropriate as a feature pattern for creating a standard pattern. , prompts you to input the word audio again.

また１回目と２回目の各発声について、例えばＱｌｌ−
［ｐＨ−Ｐ２１］が最大となり、１回目と３回目の発声
についてＱ２２−［ｐＨ・Ｐ３２］最大となって、これ
らの最大類似度を得た特徴パターンの組み合わせが、１
回目の発声の特徴パターンＰｉｔと一致していることが
求められたとしても、２回目と３回目の発声について、
例えばＱ３２−［Ｐ２２・Ｐ３２］が最大となるような
場合、２回目の発声の特徴パターンＰ２１とＰ２２との
不一致により矛盾が生じることから、この場合にも単語
音声の再発声入力が促される。Also, for each of the first and second utterances, for example, Qll-
[pH-P21] is the maximum, Q22-[pH・P32] is the maximum for the first and third utterances, and the combination of feature patterns that obtains the maximum similarity is 1
Even if it is required to match the characteristic pattern Pit of the first utterance, for the second and third utterances,
For example, when Q32-[P22·P32] is the maximum, a contradiction occurs due to the mismatch between the characteristic patterns P21 and P22 of the second utterance, and therefore, in this case as well, re-input of the word voice is prompted.

このようにして本装置では、複数回（３回以上）の同一
カテゴリの発声入力単語音声についてそれぞれ求められ
る特徴パターン間での類似度（尤度）を求め、最大値を
とる特徴パターンの組み合わせが各発声での特徴パター
ンとの間で矛盾がないとき、これを正しく音声区間検出
された特徴パターンとして抽出している。そしてこれら
の正しく音声区間検出されたときの特徴パターンに基づ
いてその標準パターンを作成し、標準パターン辞書５に
登録して単語音声の認識処理に供するものとなっている
。In this way, this device calculates the degree of similarity (likelihood) between feature patterns obtained for each input word voice of the same category multiple times (three or more times), and determines the combination of feature patterns that takes the maximum value. When there is no inconsistency between the feature patterns of each utterance, this is extracted as a feature pattern that has been correctly detected in a voice section. Then, a standard pattern is created based on the characteristic pattern when the speech section is correctly detected, and is registered in the standard pattern dictionary 5 to be used in word speech recognition processing.

従って本装置によれば、標準パターン自体をその音声区
間が正しく検出されているときの特徴パターンとするこ
とができるので、その認識辞書性能を十分高いものとす
ることができる。この結果、その認識性能を十分に高く
することができる。Therefore, according to the present device, the standard pattern itself can be used as a characteristic pattern when the speech section is correctly detected, so that the recognition dictionary performance can be made sufficiently high. As a result, the recognition performance can be made sufficiently high.

また上述したように入力音声の音声区間を、特徴パター
ン間での類似度から特徴パターンの組み合わせとして評
籠し、正しい音声区間での特徴パターンだけを抽出して
標準パターンを作成するので、その処理手続きが非常に
簡単であり、処理効率が高い。そして少ない発声回数で
効果的にその標準パターンを作成していくことができる
等の効果が奏せられる。In addition, as mentioned above, the speech sections of the input speech are evaluated as combinations of feature patterns based on the similarity between the feature patterns, and only the feature patterns in the correct speech sections are extracted to create a standard pattern, so the process The procedure is very simple and processing efficiency is high. Moreover, effects such as being able to effectively create the standard pattern with a small number of utterances can be achieved.

尚、本発明は上述した実施例に限定されるものではない
。例えば４回以上の発声入力音声についての特徴パター
ン間で類似度（尤度）を計算して音声区間の正しい特徴
パターンを抽出するようにしても良い。また特徴パター
ン間で類似度（尤度）が所定の閾値に達しない場合には
、その組み合わせに矛盾がない場合であっても再発声入
力を促して標準パターンの作成を行うようにしても良い
。Note that the present invention is not limited to the embodiments described above. For example, the similarity (likelihood) may be calculated between the feature patterns of four or more uttered input voices to extract the correct feature pattern of the voice section. Furthermore, if the similarity (likelihood) between feature patterns does not reach a predetermined threshold, a standard pattern may be created by prompting re-voice input even if there is no contradiction in the combination. .

この様にすれば、標準パターン作成の信頼性を高め、よ
り性能の高い標準パターン（認識辞書）を構築していく
ことが可能となる。その他、本発明はその要旨を逸脱し
ない範囲で種々変形して実施することができる。In this way, it becomes possible to increase the reliability of standard pattern creation and to construct standard patterns (recognition dictionaries) with higher performance. In addition, the present invention can be implemented with various modifications without departing from the gist thereof.

［発明の効果］以上説明したように本発明によれば、ノイズの付加や音
声区間の脱落の影響を排除して入力音声の音声区間を正
しく検出し、その特徴パターンを抽出するので、信頼性
の高い音声区間候補の特徴パターンだけを用いて効率的
に高性能な標準パターンを作成することができ、その認
識性能の向上を効果的に図ることができる等の実用上多
大なる効果が奏せられる。[Effects of the Invention] As explained above, according to the present invention, the speech sections of input speech are correctly detected by eliminating the effects of noise addition and omission of speech sections, and the characteristic patterns thereof are extracted. It is possible to efficiently create a high-performance standard pattern using only the feature patterns of voice section candidates with high performance, and it has great practical effects such as being able to effectively improve its recognition performance. It will be done.

[Brief explanation of drawings]

第１図は本発明の一実施例に係る単語音声認識装置の概
略構成図、第２図は実施例装置における標準パターン作
成時での複数回の入力音声に対する音声区間候補とその
特徴パターンについて模式％式％３・・・リサンプル部、４・・・類似度計算部（特徴パ
ターンと標準パターンとの類似度計算）、５・・・標準
パターン辞書、６・・・認識結果出力部、７・・・類似
度計算部（特徴パターン間での尤度計算処理）、８・・
・標準パターン作成部。FIG. 1 is a schematic configuration diagram of a word speech recognition device according to an embodiment of the present invention, and FIG. 2 is a schematic diagram of speech interval candidates and their characteristic patterns for multiple input speeches when creating a standard pattern in the embodiment device. % expression% 3... Resample unit, 4... Similarity calculation unit (similarity calculation between feature pattern and standard pattern), 5... Standard pattern dictionary, 6... Recognition result output unit, 7 ... Similarity calculation unit (likelihood calculation processing between feature patterns), 8...
・Standard pattern creation department.

Claims

[Scope of Claims] An acoustic analysis section that acoustically analyzes input word speech to obtain its feature parameters; a word boundary hypothesis generation section that obtains speech section candidates of the input word speech from the acoustically analyzed feature parameters; a resampling unit that normalizes the feature parameters for each speech section determined by the word boundary hypothesis generation unit to generate a feature pattern of the input speech; A standard pattern creation unit that creates a pattern, and a similarity calculation that calculates the degree of similarity between the standard pattern of the recognition target word audio created by the standard pattern creation unit and the characteristic pattern of the input word audio obtained by the resample unit. Department and
and a recognition result output section that obtains a word speech recognition result for the input word speech according to the similarity calculation result, and the standard pattern creation section is configured to perform a recognition result output section for each word speech that is uttered multiple times for the recognition target word. A means for mutually calculating the likelihood between a plurality of desired feature patterns, and a means for calculating the feature pattern when the combination of the feature pattern with the highest likelihood matches among the feature patterns in the plurality of utterances. A word speech recognition device comprising: means for obtaining a standard pattern of the speech of the word to be recognized.