JP2721341B2

JP2721341B2 - Voice recognition method

Info

Publication number: JP2721341B2
Application number: JP61230001A
Authority: JP
Inventors: 康弘小森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1986-09-30
Filing date: 1986-09-30
Publication date: 1998-03-04
Anticipated expiration: 2013-03-04
Also published as: JPS6385697A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は音声認識方法、特に入力された音声の音響の
特徴により音声を認識する音声認識方法に関するもので
ある。［従来の技術］従来この種の装置は、辞書に登録された項目が全て認
識対象であり、全登録を検索すると入力音声の認識処理
に時間がかかる欠点があつた。このため、大語彙を対象
とした場合には、現在のコンピユータの処理速度では、
音声の重要な要素である実時間性が保証できなくなると
いう問題がある。そこで、入力された音声からまず大まかに特徴抽出を
行い辞書中の単語と大まかマツチング計算を行つて、良
い結果が出たものに関しては、再び詳細な特徴によるマ
ツチングを行う、階層マツチングによる計算量の減少に
よる高速化が行われていたが、この方法でも、すべての
単語に対して必ず一度はマツチングの計算を行う必要が
あり、更に音素認識手段を有する音声認識装置には組込
みにくい欠点があつた。又、すべての単語に対してマッチングを行わなくても
すむように、予めある特徴に従って単語を分類し、該特
徴により認識候補を予備選択する技術が、特開昭61−13
7198号，特開昭61−149997号，特開昭61−67899号，特
開昭60−217399号等に開示されている。［発明が解決しようとする課題］しかしながら、上記従来例の予備選択は、大分類（お
おまかな分類）により音声認識を高速化することのみを
意図しているので、単語の１つ１つが持つ特徴が考慮さ
れずにその共通性のみが優先してしまっている。例え
ば、特開昭61−137198号では、母音・促音列に基づいて
予測される単語のみが予備選択されて、その音節が比較
照合されるが、母音・促音の挿入や置換の可能性に関し
ては、挿入や置換をした母音・促音列に対応する候補を
付加して、予備選択の結果としている。又、特開昭61−
149997号では、母音大分類と子音大分類とから母子音記
号列候補を抽出しているが、抽出結果はそれぞれの母子
音から可能な置換例を網羅したものである。このような選択候補の単純な併合では、１つ１つの単
語の音声パターンがどのように変化し易いか、あるいは
どのように異なって抽出されるか等が考慮されていない
ので、併合の範囲を拡大すると、本来は認識マッチング
の候補にする必要のない単語が増加して、高速化の妨げ
となる一方、併合の範囲を制限すると、予備選択の時点
で正しい単語が排除されてしまい、認識候補から漏れて
しまう危険性を含むことになる。本発明は、上述の従来の欠点を除去し、入力された音
声がもつ音響の特徴により単語の予備選択を行って認識
の高速化を図ると共に、１つ１つの単語の音声パターン
がどのように変化し易いか、あるいはどのように異なっ
て抽出されるかを考慮した予備選択を行うことで、予備
選択に用いる特徴が異なって抽出されても、あるいは入
力される音声の個人的な違いがあっても、高速で認識率
の高い音声認識を可能とする音声認識方法を提供する。［課題を解決するための手段］この課題を解決するために、本発明の音声認識方法
は、マッチング用の単語音声パターンを格納する音声辞
書と、入力された単語音声パターンの音響の特徴を表わ
す単語インデックスと、該単語インデックスと対応付け
られた前記音声辞書の単語音声パターンを示すポインタ
とを格納する索引辞書とを用意し、前記索引辞書では、
１つの単語音声パターンから抽出し得る複数の音響の特
徴をそれぞれ表わす複数の単語インデックスに、前記音
声辞書の当該単語音声パラメータを示すポインタが格納
され、入力された単語音声パターンから音響の特徴を抽
出し、前記索引辞書に前記抽出された音響の特徴を表わ
す単語インデックスに対応して記憶されているポインタ
に従って、該ポインタが示す前記音声辞書の単語音声パ
ターンと前記入力された単語音声パターンとをマッチン
グして、当該入力された単語音声パターンを認識するこ
とを特徴とする。［実施例］第１図に第１の実施例の音声認識装置の構成図を示
す。図中、１は音声入力部でマイク２とA/D変換器３で
構成され、入力音声をデイジタルに変換する。４は音響
処理部で周波数変換や、音響的パラメータを抽出すると
ころである。５は予備選択部で、予備選択用の特徴抽出
部６と予備選択用索引作成部７及び単語検索部８からな
つており、音響処理部４から出力された音響特徴量か
ら、予備選択用の特徴を抽出し、その特徴から索引を作
成する。単語検索部８は索引作成部７より出力される索
引と一致する索引を、索引辞書９より検索し、その索引
が有している辞書10中の単語へのポインタを、索引辞書
９から求め、その単語へのポインタのさす辞書10中の単
語を、候補単語として候補単語用メモリ11に出力する。
上述の方式により予備選択を行う。12は音素認識部で、
音響処理部４の出力を入力として音素の認識を行い、音
素の系列を出力する。13は単語マツチング部で、候補単
語用メモリ11上にある候補単語と音素認識部12の出力で
ある音素系列とのマツチングを行い、単語認識結果を出
力する。第２図に予備選択に用いる音響特徴とその記号化の例
を示す。図中の無声摩擦（音素に直すとだいたい“s",
“c",“h"に当る）、有声摩擦（“z"）、有声音の母
音，半母音，鼻韻（“z",“b",“d",“g",“r"）、無声
音（“s",“c",“h",“p",“t",“k"）、無音・鼻韻
（“m",“n"撥音の“N"…日本語の「ん」にあたる音
素）などは、その音素の前後の音韻によつてもその特徴
の出現のふるまいは異なるものの、比較的抽出しやすい
音響特徴量である。また、これらの音響特徴量に対して
時間的継続長を加えて、それぞれに記号をあたえた。第３図（ａ）に表記に基づいた索引の作成法を示す。
この例では「わたし」“watasi"から“wa"は長い有声音
で［Ｖ］、“wa"と“ta"の間の短い無音に対して
［ｑ］、“ta"に対して短い有声音の［ｖ］、“si"に対
しては長い無声摩擦［Ｓ］と短い有声音の［ｖ］として
表わし、単語“watasi"に対しては、［VqvSv］という予
備選択用の索引（以降INDEX）を作成する。このとき
に、“watasi"と発語されたとき、特徴抽出が異なる可
能性がある。この際のおこりそうな事象例えば“s"の間
が短いとか“si"の“i"の無声化などを考えて、予め［V
qvsv］や［VqvS］などのINDEXを作成する。“baketu"
「バケツ」の例も第３図（ｂ）に示す。第４図に第３図（ｂ）のINDEXを用いた索引辞書９の
構成と辞書10の構成を示す。第５図に第１の実施例の音声認識装置の動作フローチ
ヤートを示す。ステツプS51で音声入力部１よりの、本例では「私」
の音声入力があると、ステツプS52で音響処理部４で音
響処理が行われて。音響処理の結果は、ステツプS53の
特徴抽出部６による特徴抽出と、ステツプS59の音素認
識部12による音素認識に向う。ステップS53において、
入力音声の無声摩擦、有声摩擦、有声音、無声音、無
音、Buzz、鼻音性、および継続時間等の特徴を抽出し、
ステップS54で、第２図のテーブルに従ってINDEXを作成
する。ここでは、「私（watasi）」の入力音声に対し
て、通常は第３図（ａ）に示したように「VqvSv」とい
うINDEXが作成されるところが、“s"の無声摩擦音が短
くて「Vqvsv」というINDEXが作成された例で説明する。
この場合、ステップS56では、INDEX「Vqvsv」を基に索
引辞書９が検索され、ステップS56−S57で、第４図にお
いてINDX「Vqvsv」→ポインタ10→“watasi"、INDEX「V
qvsv」→ポインタ20→“baketu"、…と順に認識候補を
予備選択して、候補単語「私」、「バケツ」、…を候補
単語メモリ11に記憶する。一方、ステツプS59で音素認識された結果と候補単語
用メモリ11に記憶された認識候補とを、ステツプS58で
単語マツチング部13によりマツチングの判定をし、ステ
ツプS60で認識結果、本例では「私」を出力する。尚、第１図のシステムの単語マツチングは音素記号に
基づいているため、音素認識部12を必要とする。しか
し、辞書10内の単語表現を音響レベルのパラメータで表
わし、単語マツチング部13で音響パラメータレベルのマ
ツチングをおこなえば、音素認識部12は不要となる。
又、第２図にある例に限らず、例えば破裂性（有声及び
無声に分けてもよい）や音声のエネルギー（パワー）の
大小などを記号化しても良い。第６図に第２の実施例の音声認識装置の構成図を示
す。図中の61は音響入力部でマイク62とA/D変換器63で
構成され、入力音声は、デイジタルに変換される。64は
音響処理部で周波数変換や音響的パラメータを抽出する
ところである。65は予備選択部で、単語候補として決定
するための特徴ビツト系列を求めるための特徴抽出部66
と、特徴ビツト系列作成部67と、辞書69から引きだした
単語について、候補単語として更に詳細なマツチング計
算を必要とするか否かを決定する特徴ビツト系列比較部
68とからなつており、音響処理部64から出力された音響
特徴量から特徴ビツト系列用の特徴を抽出し、その特徴
から特徴ビツト系列を作成する。特徴ビツト系列比較部
68は、特徴ビツト系列作成部67の出力である特徴ビツト
系列と、辞書69内の単語にもたせてある特徴ビツト系列
とを比較し、入力音声側の特徴ビツト系列で“1"がたつ
ているもの全てについて、辞書69側の特徴ビツト系列で
も“1"が全て立つていれば、その特徴ビツト系列を有し
ている単語を候補単語と決め候補単語用メモリ70に送
る。71は音素認識部で音響処理部64の出力を入力として
音素認識を行い、音素系列を出力する。72は単語マツチ
ング部で候補単語用メモリ70上の単語と音素認識部71の
出力である音素系列とのマツチングを行い、単語認識結
果を出力する。第７図では第６図の辞書69の構成法を例で示す。本例
は「弁当」という単語についての例である。この「弁
当」の音素記号“beNtou"に対する特徴ビツト系列は、H
eader（16ビツトである必要はない）と本例では16ビツ
トで表されたビツト系列（各音素記号に対して16ビツト
とする）とで表されている。ここで16ビツト１つ（音素
に対応する）をSEGとする。各SEGは16ビツトで表され、
その16ビツトは、０ビツト〜15ビツト各々第７図のよう
な意味を持たせる。例えば、０ビツト目は脱落でここに
１が立つものは、そのSEGが脱落する可能性を示してい
る。１ビツト目,2ビツト目は、その特徴の継続時間で長
いか短いかである。SEG1では両方に“1"が立つているの
は、長くなつたり短くなつたりする可能性があるためで
ある。つまり語頭であるために、“b"にBuzzがあつたり
なかつたりする可能性を考えている。同様に前のSEGの
脱落や前後のSEGの影響を考えて、SEG2の破裂性や鼻韻
性などを組込んでおく。各SEGの作成に関しては、各音
素についての音響的特徴及び前後の関係から自動的に音
素系列が決まれば作成できる。Headerの情報としては、
SEGの脱落を考えて全体としておこりうる最大SEG数と最
小SEG数、及び単語へのポインタを入れる。以上第７図
のような特徴ビツト系列をもつ辞書構造を作成する。第８図に、入力音声から作成された特徴ビツト系列の
例を示す。本例は、“beNtou"と発生された時に予想さ
れる特徴ビツト系列の１例である。第７図の辞書の場合と異なり、第８図では挿入の可能
性のビツトを考える他は表現法は同じである。但し、HE
ADER部における単語へのポインタは不要である。各SEG
決定区間において、得られた特徴に“1"ビツトを立て
る。第９図は、入力も辞書側もそれぞれ挿入，脱落が仮定
されるので、表わされたSEG系列で表現可能なSEG系列を
示す。これらのSEG系列を用いて、詳細なマツチングを
するか否かを決定する。第10図は詳細なマツチングを行うか否かを決定する部
分である特徴ビツト系列比較部68の動作説明図である。
入力音声側レジスタ104にまず入力された音声のSEG系列
を入れ、辞書側レジスタ101に入力側と長さの一致するS
EG系列を辞書69からとりだす。メモリ102,103にそれぞ
れのSEG系列を順々に入れていく。メモリ102,103の中の
ビツトを比較回路105で比較し、“1",“0"の出力106が
でる。比較回路105とその動作説明図を第11図（ａ），
（ｂ）に示す。107の判定において“0"であれば、108又
は109で次の単語もしくは次のSEG系列を入力音声側又は
辞書側によびだし、くりかえし処理を行う。107で“1"
の判定であれば110で最終SEGかをCHECKする。最終SEGで
なければ、113で次のSEGに同じ処理を行う。上記のこと
を繰りかえし行い、比較回路105の出力が最終SEGまで全
て“1"であつた単語に関しては、111で候補単語として1
12で単語マツチング部72へ送り詳細なマツチングを行
う。尚、辞書内容に音響的特徴量を用いる方法において
は、第６図における音素認識部71がないシステムについ
ても応用が可能である。以上説明したように、音声認識装置に予備選択部と索
引辞書をもうけることにより、入力音声の音響特徴を有
する可能性のある単語を予備選択し、単語候補とするこ
とにより単語マツチング回数を減少させ、認識処理時間
の高速化を可能にした。又、音声認識装置に詳細マツチング判定部を用いるこ
とにより、多くの無駄の計量を大はばに減少することが
でき、認識処理時間の短縮が可能である。［発明の効果］本発明により、入力された音声がもつ音響の特徴によ
り索引辞書で単語の予備選択を行って認識の高速化を図
ると共に、１つ１つの単語の音声パターンがどのように
変化し易いか、あるいはどのように異なって抽出される
かを考慮した索引辞書を使用して予備選択を行うこと
で、予備選択に用いる特徴が異なって抽出されても、あ
るいは入力される音声の個人的な違いがあっても、高速
で認識率の高い音声認識を可能とした音声認識方法を提
供できる。すなわち、予備選択を行なう為の索引辞書において、
音響の特徴抽出における誤抽出等を考慮してポインタを
格納するので、無駄なマッチングを極力無くし、予備選
択の時点で正しい単語が漏れる可能性が激減し、高速で
ありながら高認識率を得られる音声認識が可能となる。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method, and more particularly to a speech recognition method for recognizing speech based on the acoustic characteristics of input speech. 2. Description of the Related Art Conventionally, this type of apparatus has a disadvantage that all items registered in a dictionary are to be recognized, and if all entries are searched, it takes a long time to perform an input voice recognition process. Therefore, when targeting a large vocabulary, the processing speed of the current computer is:
There is a problem that real-time property, which is an important element of voice, cannot be guaranteed. Therefore, we first roughly extract features from the input speech, perform rough matching calculations with words in the dictionary, and for those that give good results, perform matching with detailed features again. Although the speed was increased by the reduction, even with this method, it was necessary to always perform the matching calculation once for every word, and furthermore, there was a disadvantage that it was difficult to incorporate into a speech recognition device having phoneme recognition means. . Japanese Patent Application Laid-Open No. 61-13 / 1986 discloses a technique in which words are classified according to a certain feature in advance so that matching is not performed for all words, and recognition candidates are preliminarily selected based on the feature.
No. 7198, JP-A-61-149997, JP-A-61-67899 and JP-A-60-217399. [Problems to be Solved by the Invention] However, since the preliminary selection in the above-mentioned conventional example is intended only to speed up speech recognition by a large classification (rough classification), each word has a feature. , But only its commonality has priority. For example, in Japanese Patent Application Laid-Open No. 61-137198, only words predicted based on a vowel / pronunciation sequence are preliminarily selected and their syllables are compared and compared. , A candidate corresponding to the inserted or replaced vowel / prompt string is added to the result of the preliminary selection. Also, JP-A-61-
In 149997, a vowel symbol string candidate is extracted from the vowel large classification and the consonant large classification, but the extraction result covers possible replacement examples from each vowel and consonant. In such a simple merging of the selection candidates, since it is not considered how the voice pattern of each word easily changes or how it is extracted differently, the range of the merging is limited. If it is enlarged, words that do not originally need to be candidates for recognition matching increase, which hinders speeding up. On the other hand, if the range of merging is limited, correct words are excluded at the time of preliminary selection, and The danger of leaking. The present invention eliminates the above-mentioned drawbacks of the prior art, speeds up recognition by performing preliminary selection of words based on the acoustic characteristics of the input voice, and how the voice pattern of each word is changed. By performing a preliminary selection in consideration of the tendency to change or how it is extracted differently, even if the features used for the preliminary selection are extracted differently or there is a personal difference in the input speech. However, the present invention provides a speech recognition method that enables high-speed and high-recognition speech recognition. [Means for Solving the Problems] In order to solve the problems, a voice recognition method according to the present invention represents a voice dictionary storing word voice patterns for matching and an acoustic feature of the input word voice patterns. A word index and an index dictionary that stores a pointer indicating a word voice pattern of the voice dictionary associated with the word index are prepared, and in the index dictionary,
Pointers indicating the relevant word voice parameters of the voice dictionary are stored in a plurality of word indexes respectively representing a plurality of voice features that can be extracted from one word voice pattern, and a voice feature is extracted from the input word voice pattern. Matching a word voice pattern of the voice dictionary indicated by the pointer with the input word voice pattern according to a pointer stored in the index dictionary corresponding to the word index representing the extracted acoustic feature. Then, the input word voice pattern is recognized. Embodiment FIG. 1 shows a configuration diagram of a speech recognition apparatus according to a first embodiment. In FIG. 1, reference numeral 1 denotes an audio input unit, which comprises a microphone 2 and an A / D converter 3, and converts input audio into digital. Reference numeral 4 denotes a sound processing unit for extracting frequency conversion and acoustic parameters. Reference numeral 5 denotes a preliminary selection unit, which comprises a preliminary selection feature extraction unit 6, a preliminary selection index creation unit 7, and a word search unit 8. The preliminary selection unit 5 uses the audio feature amount output from the audio processing unit 4 for preliminary selection. Extract features and create an index from those features. The word search unit 8 searches the index dictionary 9 for an index that matches the index output from the index creation unit 7, and obtains a pointer to a word in the dictionary 10 included in the index from the index dictionary 9, The word in the dictionary 10 pointed to by the pointer to the word is output to the candidate word memory 11 as a candidate word.
Preliminary selection is performed by the method described above. 12 is a phoneme recognition unit,
The output of the acoustic processing unit 4 is used as input to recognize phonemes and output a sequence of phonemes. A word matching unit 13 matches a candidate word in the candidate word memory 11 with a phoneme sequence output from the phoneme recognition unit 12, and outputs a word recognition result. FIG. 2 shows an example of acoustic features used for preliminary selection and their encoding. The unvoiced friction in the figure (about "s"
“C”, “h”), voiced friction (“z”), voiced vowels, semi-vowels, rhymes (“z”, “b”, “d”, “g”, “r”), Unvoiced sounds (“s”, “c”, “h”, “p”, “t”, “k”), silence and rhyme (“m”, “n”, “N” of the repellent sound) ") Is an acoustic feature that is relatively easy to extract, although the appearance of the feature differs depending on the phonemes before and after the phoneme. In addition, a symbol is given to each of these acoustic features by adding a time duration. FIG. 3A shows a method of creating an index based on the notation.
In this example, "I""watasi" to "wa" are long voiced sounds [V], [q] for short silence between "wa" and "ta", and short voiced sounds for "ta". [V] and “si” are represented as a long unvoiced friction [S] and a short voiced sound [v], and the word “watasi” is referred to as a preliminary selection index [VqvSv] (hereinafter INDEX). ) To create. At this time, when "watasi" is uttered, the feature extraction may be different. Considering a likely event at this time, for example, a short interval between “s” and de-voicing of “i” of “si”, [V
Create INDEX such as [qvsv] or [VqvS]. “Baketu”
An example of a “bucket” is also shown in FIG. FIG. 4 shows the configuration of the index dictionary 9 and the configuration of the dictionary 10 using INDEX shown in FIG. 3 (b). FIG. 5 shows an operation flowchart of the voice recognition apparatus of the first embodiment. In step S51, "I" from the voice input unit 1 in this example
When the voice input is performed, the audio processing is performed by the audio processing unit 4 in step S52. The result of the acoustic processing is directed to feature extraction by the feature extraction unit 6 in step S53 and phoneme recognition by the phoneme recognition unit 12 in step S59. In step S53,
Extract features such as unvoiced friction, voiced friction, voiced, unvoiced, silence, Buzz, nasal, and duration of the input voice,
In step S54, an INDEX is created according to the table shown in FIG. Here, an INDEX called “VqvSv” is usually created for the input voice of “I (watasi)” as shown in FIG. 3A, but the voiceless fricative of “s” is short and “ Vqvsv ”will be described as an example.
In this case, in step S56, the index dictionary 9 is searched based on INDEX "Vqvsv". In steps S56-S57, in FIG. 4, INDX "Vqvsv" → pointer 10 → "watasi", INDEX "V
Preliminarily select recognition candidates in the order of “qvsv” → pointer 20 → “baketu”,..., and store candidate words “I”, “bucket”,. On the other hand, the result of the phoneme recognition in step S59 and the recognition candidates stored in the candidate word memory 11 are subjected to matching judgment by the word matching unit 13 in step S58, and the recognition result is obtained in step S60. Is output. Since the word matching in the system shown in FIG. 1 is based on phoneme symbols, a phoneme recognition unit 12 is required. However, if the word expression in the dictionary 10 is represented by acoustic level parameters and the word matching unit 13 performs acoustic parameter level matching, the phoneme recognition unit 12 becomes unnecessary.
In addition, the present invention is not limited to the example shown in FIG. 2 and may be used to symbolize, for example, burstiness (may be divided into voiced and unvoiced) and the magnitude of energy (power) of voice. FIG. 6 shows a configuration diagram of the speech recognition apparatus of the second embodiment. In the figure, reference numeral 61 denotes a sound input unit which is constituted by a microphone 62 and an A / D converter 63, and the input voice is converted to digital. Reference numeral 64 denotes a sound processing unit for extracting frequency conversion and acoustic parameters. 65 is a preliminary selection unit, which is a feature extraction unit 66 for obtaining a feature bit sequence to be determined as a word candidate.
A feature bit sequence creation unit 67, and a feature bit sequence comparison unit that determines whether or not a word extracted from the dictionary 69 requires more detailed matching calculation as a candidate word.
A feature bit sequence is extracted from the acoustic feature amount output from the audio processing unit 64, and a feature bit sequence is created from the extracted feature. Feature bit sequence comparison unit
Reference numeral 68 compares a feature bit sequence output from the feature bit sequence creation unit 67 with a feature bit sequence assigned to a word in the dictionary 69, and a "1" is added to the feature bit sequence on the input voice side. If all "1" s are set in the feature bit sequence on the dictionary 69 side for all the words, the word having the feature bit sequence is determined as a candidate word and sent to the candidate word memory 70. Reference numeral 71 denotes a phoneme recognition unit that performs phoneme recognition using the output of the acoustic processing unit 64 as an input, and outputs a phoneme sequence. Reference numeral 72 denotes a word matching unit that matches a word on the candidate word memory 70 with a phoneme sequence output from the phoneme recognition unit 71 and outputs a word recognition result. FIG. 7 shows an example of the configuration method of the dictionary 69 in FIG. This example is an example of the word “bento”. The characteristic bit sequence for the phonetic symbol "beNtou" of this "lunch box" is H
eader (it does not need to be 16 bits) and in this example, a bit sequence represented by 16 bits (16 bits for each phoneme symbol). Here, one 16-bit (corresponding to a phoneme) is defined as SEG. Each SEG is represented by 16 bits,
The 16 bits have a meaning as shown in FIG. 7 for each of 0 to 15 bits. For example, the 0th bit drops off and a 1 here indicates that the SEG may drop off. The first and second bits are longer or shorter in the duration of the feature. The reason why “1” is set in both SEG1s is that they may become longer or shorter. In other words, we are thinking about the possibility that Buzz may or may not occur for "b" because it is the beginning of a word. Similarly, considering the dropout of the previous SEG and the influence of the SEG before and after, the rupturability and rhinosylation of SEG2 are incorporated. Each SEG can be created if the phoneme sequence is automatically determined based on the acoustic characteristics of each phoneme and the relationship between before and after. As the information of Header,
In consideration of the dropout of SEG, the maximum and minimum SEG numbers that can occur as a whole and the pointer to the word are entered. A dictionary structure having a characteristic bit sequence as shown in FIG. 7 is created. FIG. 8 shows an example of a feature bit sequence created from input speech. This example is an example of a feature bit sequence expected when “beNtou” is generated. Unlike the dictionary of FIG. 7, the representation method in FIG. 8 is the same except for considering the possibility of insertion. However, HE
No pointer to the word in the ADER part is required. Each SEG
In the decision section, "1" bit is set for the obtained feature. FIG. 9 shows a SEG sequence that can be represented by the represented SEG sequence since insertion and omission are assumed on the input side and the dictionary side, respectively. Using these SEG sequences, it is determined whether or not to perform detailed matching. FIG. 10 is a diagram for explaining the operation of the characteristic bit sequence comparing section 68 which is a portion for determining whether or not to perform detailed matching.
First, the SEG sequence of the input voice is put into the input voice side register 104, and the S
The EG sequence is retrieved from the dictionary 69. The SEG sequences are sequentially stored in the memories 102 and 103. The bits in the memories 102 and 103 are compared by a comparison circuit 105, and an output 106 of "1" and "0" is obtained. FIG. 11 (a) shows the comparison circuit 105 and its operation explanatory diagram.
(B). If the determination at 107 is “0”, the next word or the next SEG sequence is extracted from the input voice side or the dictionary side at 108 or 109, and a repeating process is performed. “1” at 107
If it is the judgment of 110, CHECK whether it is the last SEG at 110. If it is not the last SEG, the same processing is performed on the next SEG at 113. By repeating the above, for words for which the output of the comparison circuit 105 is all “1” until the final SEG, 111 is used as a candidate word.
At 12 the word is sent to the word matching section 72 for detailed matching. It should be noted that the method of using acoustic features for dictionary contents can be applied to a system without the phoneme recognition unit 71 in FIG. As described above, by providing a preliminary selection unit and an index dictionary in the speech recognition apparatus, a word having a possibility of having the acoustic feature of the input speech is preselected, and the number of word matching is reduced by setting the word candidate as a word candidate. , Enabling faster recognition processing time. In addition, by using the detailed matching determination unit in the speech recognition device, the amount of waste can be greatly reduced, and the recognition processing time can be reduced. [Effects of the Invention] According to the present invention, a word is preliminarily selected in an index dictionary based on the acoustic characteristics of an input voice to increase the speed of recognition, and how the voice pattern of each word changes. Preliminary selection using an index dictionary that considers how easy it is to extract or how it is extracted differently, even if the features used for preselection are extracted differently, It is possible to provide a speech recognition method that enables high-speed and high-recognition speech recognition even if there is a significant difference. That is, in the index dictionary for performing the preliminary selection,
Since pointers are stored in consideration of erroneous extraction in acoustic feature extraction, useless matching is minimized, the possibility of correct words leaking at the time of preliminary selection is drastically reduced, and a high recognition rate can be obtained at high speed Voice recognition becomes possible.

【図面の簡単な説明】第１図は第１の実施例の音声認識装置の構成図、第２図は第１の実施例の音響特徴の記号化例を示す図、第３図（ａ），（ｂ）は第１の実施例の索引の作成法を
示す図、第４図は第１の実施例の索引辞書と辞書の構成図、第５図は第１の実施例の音声認識装置の動作フローチヤ
ート、第６図は第２の実施例の音声認識装置の構成図、第７図は第２の実施例の単語辞書の表記法を示す図、第８図は第２の実施例の入力音声の特徴表記例を示す
図、第９図は第２の実施例におけるSEG系列を示す図、第10図は第２の実施例の特徴ビツト系列比較部の処理説
明図、第11図（ａ），（ｂ）は第10図の回路Ａの説明図であ
る。図中、１…音声入力部、２…マイク、３…A/D変換部、
４…音響処理部、５…予備選択部、６…特徴抽出部、７
…索引作成部、８…単語検索部、９…索引辞書、10…辞
書、11…候補単語用メモリ、12…音素認識部、13…単語
マツチング部、14…認識結果出力部、61…音声入力部、
62…マイク、63…A/D変換部、64…音響処理部、65…予
備選択部、66…特徴抽出部、67…特徴ビツト系列作成
部、68…特徴ビツト系列比較部、69…辞書、70…候補単
語用メモリ、71…音素認識部、72…単語マツチング部、
73…認識結果出力部である。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a speech recognition device according to a first embodiment, FIG. 2 is a diagram showing an example of symbolizing acoustic features according to the first embodiment, FIG. 3 (a) , (B) is a diagram showing a method of creating an index of the first embodiment, FIG. 4 is a configuration diagram of an index dictionary and a dictionary of the first embodiment, and FIG. 5 is a speech recognition device of the first embodiment. FIG. 6 is a block diagram of the speech recognition apparatus of the second embodiment, FIG. 7 is a diagram showing the notation of a word dictionary of the second embodiment, and FIG. 8 is a second embodiment. FIG. 9 is a diagram showing an example of the feature notation of the input speech of FIG. 9, FIG. 9 is a diagram showing the SEG sequence in the second embodiment, FIG. (A), (b) is an explanatory view of the circuit A of FIG. In the figure, 1 ... voice input unit, 2 ... microphone, 3 ... A / D conversion unit,
4 ... Sound processing unit, 5 ... Preliminary selection unit, 6 ... Feature extraction unit, 7
... index creation unit, 8 ... word search unit, 9 ... index dictionary, 10 ... dictionary, 11 ... memory for candidate words, 12 ... phoneme recognition unit, 13 ... word matching unit, 14 ... recognition result output unit, 61 ... voice input Department,
62: microphone, 63: A / D converter, 64: sound processor, 65: preliminary selector, 66: feature extractor, 67: feature bit sequence generator, 68: feature bit sequence comparator, 69: dictionary, 70: candidate word memory, 71: phoneme recognition unit, 72: word matching unit,
73 ... Recognition result output unit.

Claims

(57) [Claims] A speech dictionary storing a word speech pattern for matching, a word index representing the acoustic feature of the input word speech pattern, and a pointer indicating the word speech pattern of the speech dictionary associated with the word index are stored. In the index dictionary, pointers indicating the word voice parameters of the voice dictionary are stored in a plurality of word indexes respectively representing a plurality of acoustic features that can be extracted from one word voice pattern. Extracting an acoustic feature from the input word speech pattern, and according to a pointer stored in the index dictionary corresponding to a word index representing the extracted acoustic feature, a word in the speech dictionary indicated by the pointer The voice pattern is matched with the input word voice pattern, and Speech recognition method and recognizes the spoken word pattern.