JPS6385697A

JPS6385697A - Voice recognition equipment

Info

Publication number: JPS6385697A
Application number: JP61230001A
Authority: JP
Inventors: 康弘小森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1986-09-30
Filing date: 1986-09-30
Publication date: 1988-04-16
Anticipated expiration: 2013-03-04
Also published as: JP2721341B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Abstract] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、音声認識装置、特に認識時間を短縮した音声
認識装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition device, and particularly to a speech recognition device that reduces recognition time.

［従来の技術］従来この種の装置は、辞書に登録された項目が全て認識
対象であり、全登録を検索すると入力音声の認識処理に
時間がかる欠点があった。このため、犬語食を対象とし
た場合には、現在のコンピュータの処理速度では、音声
の重要な要素である実時間性が保証できなくなるという
問題がある。[Prior Art] Conventionally, devices of this type have the disadvantage that all items registered in a dictionary are subject to recognition, and that it takes time to recognize input speech when searching all the entries. For this reason, when targeting dog language food, there is a problem that current computer processing speeds cannot guarantee real-time performance, which is an important element of speech.

そこで、入力された音声からまず大まかに特徴抽出を行
い辞書中の単語と大まかマツチング計算を行って、良い
結果が出たものに関しては、再び詳細な特徴によるマツ
チングを行う、階層マツチングによる計算量の減少によ
る高速化が行われていたが、この方法でも、すべての単
語に対して必ず一度はマツチングの計算を行う必要があ
り、更に音素認識手段を有する音声認識装置には組込み
にくい欠点があった。Therefore, we first roughly extract features from the input speech, perform rough matching calculations with words in the dictionary, and if good results are obtained, we perform matching again using detailed features. Speed reduction was achieved by reducing the number of words, but even with this method, matching calculations had to be performed once for every word, and there was also the drawback that it was difficult to incorporate into speech recognition devices that had phoneme recognition means. .

［発明が解決しようとする問題点］本発明は、上述の従来の欠点を除去し、辞書から入力さ
れた音声がもつ音響の特徴により単語の予備選択を行い
、単語の候補を減らすこと単より、単語マツチング処理
の回数を減らし、音声認識時間を短縮した音声認識装置
を提供する。[Problems to be Solved by the Invention] The present invention eliminates the above-mentioned conventional drawbacks, performs preliminary word selection based on the acoustic characteristics of speech input from a dictionary, and simply reduces the number of word candidates. To provide a speech recognition device that reduces the number of word matching processes and shortens speech recognition time.

［問題点を解決するための手段］この問題点を解決するための一手段として、本発明の音
声認識装置は、入力された音声から音響の特徴を抽出す
る特徴抽出手段と、該特徴抽出手段により抽出された音
響の特徴に基づいて予め認識候補を予備選択をする予備
選択手段と、該予備選択手段により予備選択した認識候
補から認識結果を認識する認識手段とを備える。[Means for Solving the Problem] As a means for solving this problem, the speech recognition device of the present invention includes a feature extracting means for extracting acoustic features from input speech, and a feature extracting means for extracting acoustic features from input speech. and a recognition means that recognizes a recognition result from the recognition candidates preselected by the preliminary selection means.

［作用］かかる構成において、本発明の音声認識装置は、特徴抽
出手段により抽出された音響の特徴に基づいて予め予備
選択手段により予備選択された認識候補から、認識手段
によって認識結果を認識する。[Operation] In the speech recognition apparatus of the present invention, the recognition means recognizes a recognition result from recognition candidates preselected in advance by the preliminary selection means based on the acoustic features extracted by the feature extraction means.

［実施例］第１図に第１の実施例の音声認識装置の構成図を示す。[Example] FIG. 1 shows a configuration diagram of a speech recognition device according to a first embodiment.

図中、１は音声入力部でマイク２とＡ／Ｄ変換器３で構
成され、入力音声をディジタルに変換する。４は音響処
理部で周波数変換や、音ワ的パラメータを抽出するとこ
ろである。５は千ｍ還択部で、予備選択用の特徴抽出部
６と予備１択用索引作成部７及び単語検索部８からなっ
ており、音響処理部４から出力された音響特徴量から、
予備選択用の特徴を抽出し、その特徴から索引を作成す
る。単語検索部８は索引作成部７より出力される索引と
一致する索引を、索引辞書９より検索し、その索引が有
している辞書１０中の単語へのポインタを、索引辞書９
から求め、その単語へのポインタのさす辞書１０中の単
語を、候補単語として候補単語用メモリ１１に出力する
。上述の方式により予備選択を行う。１２は音素認識部
で、音響処理部４の出力を入力として音素の認識を行い
、音素の系列を出力する。１３は単語マツチング部で、
候補単語用メモリ１１上にある候補単語と音素認識部１
２の出力である音素系列とのマツチングを行い、単語認
識結果を出力する。In the figure, reference numeral 1 denotes an audio input section, which is composed of a microphone 2 and an A/D converter 3, and converts input audio into digital data. 4 is a sound processing section which performs frequency conversion and extracts sound parameters. Reference numeral 5 denotes a 1,000m selection unit, which consists of a feature extraction unit 6 for preliminary selection, an index creation unit 7 for preliminary selection, and a word search unit 8.
Extract features for preliminary selection and create an index from the features. The word search unit 8 searches the index dictionary 9 for an index that matches the index output from the index creation unit 7, and sends a pointer to the word in the dictionary 10 that the index has to the index dictionary 9.
The word in the dictionary 10 pointed to by the pointer to that word is output to the candidate word memory 11 as a candidate word. A preliminary selection is made in the manner described above. 12 is a phoneme recognition unit which receives the output of the acoustic processing unit 4, performs phoneme recognition, and outputs a series of phonemes. 13 is the word matching section,
Candidate words on candidate word memory 11 and phoneme recognition unit 1
Matching is performed with the phoneme sequence output from step 2, and the word recognition result is output.

第２図に予備選択に用いる奇習特徴とその記号化の例を
示す。図中の無声摩擦（音素に直すとだいたい“Ｓ″、
“、“ｈ”に当る）、有声摩Ｃ擦（“Ｚ”）、有声音の母音、半母音、鼻韻（”ｚ”、
“ｂ″、ｄ″、″　”、′ｒ″）、無声音（“Ｓ”、“
、“ｈ”、“ｐ”。FIG. 2 shows an example of odd habits used for preliminary selection and their symbolization. The unvoiced friction in the diagram (roughly expressed as a phoneme is “S”,
“, corresponds to “h”), voiced friction C (“Z”), voiced vowel, semi-vowel, nasal rhyme (“z”,
"b", d", "", 'r"), unvoiced sounds ("S", "
, “h”, “p”.

“ｔ”、“ｋ”）、無音・鼻＠（“ｍ”、“ｎ”撥音の
”Ｎ”・・・日本後の「人」にあたる音素）などは、そ
の音素の前後の音韻によってもその特徴の出現のふるま
いは異なるものの、比較的抽出しやすい音響特徴量であ
る。また、これらの音響特徴量に対して時間的継続長を
加えて、それぞれに記号をあたえた。“t”, “k”), silent/nasal @ (“N” in “m”, “n” sound… phoneme equivalent to “person” after Japan), etc., are pronounced depending on the phoneme before and after the phoneme. Although the appearance behavior of features is different, it is an acoustic feature that is relatively easy to extract. Furthermore, we added the temporal duration to these acoustic features and assigned symbols to each of them.

第３図（ａ）に表記に基づいた索引の作成法を示す。こ
の例では「わたし」　“ｗａｔａｓｉ”から＃ｗａ＃は
長い有声音で［Ｖ］、”ｗａ”と“ｔａ”の間の短い無
音に対して［ｑ］、ｔａ″に対して短い有声音の［Ｖ］
、′ｓｉ″に対しては長い無声摩擦［Ｓ］と短い有声音
の［Ｖ］として表わし、単語“ｗａｔａｓｉ”に対して
は、［ＶｇｖＳｖ］という予備選択用の索引（以降ＩＮ
ＤＥＸ）を作成する。このとぎに、“ｗａｔａｓｉ”と
発語されたとき、特徴抽出が異なる可能性がある。この
際のおこりそうな事象例えば“Ｓ”の間が短いとか”ｓ
ｉ”の“ｉ”の無声化などを考えて、予め［Ｖｑｖｓｖ
］や［ＶｑｖＳコなとのＩＮＤＥＸを作成する。FIG. 3(a) shows a method of creating an index based on notation. In this example, #wa# from "watashi" is a long voiced sound [V], [q] is a short silence between "wa" and "ta", and a short voiced sound is "ta". [V]
, 'si'' is expressed as a long voiceless fricative [S] and a short voiced sound [V], and for the word "watasi", a preliminary selection index of [VgvSv] (hereinafter IN
DEX). Next, when "watasi" is uttered, the feature extraction may be different. Events that are likely to occur at this time, such as short intervals between “S”
Considering things like devoicing the “i” in “i”, we set [Vqvsv
] or [Create an INDEX with VqvS.

“ｂａｋｅｔｕ”　「パケツｊの例も第３図（ｂ）に示
す。"baketu""An example of packet j is also shown in FIG. 3(b).

第４図に第３図（ｂ）のＩ　ＮＤＥＸを用いた索引辞書
９の構成と辞書１０の構成を示す。FIG. 4 shows the structure of the index dictionary 9 and the structure of the dictionary 10 using the INDEX shown in FIG. 3(b).

第５図に第１の実施例の音響認識装置の動作フローチャ
ートを示す。FIG. 5 shows an operation flowchart of the acoustic recognition apparatus of the first embodiment.

ステップＳ５１で音声人力部１よりの、本例では「私」
の音声人力があると、ステップＳ５２で音ワ処理部４で
音響処理が行われて、音響処理の結果は、ステップＳ５
３の特徴抽出部６による特徴抽出と、ステップＳ５９の
音素認識部１２による音素認識に向う。ステップＳ５３
の特徴抽出の結果からステップＳ５５でＩＮＤＥＸが作
成される。本例では、ｒＶｇｖＳｖ］が作成される。ス
テップＳ５６ではＩ　ＮＤＥＸ　ｒＶｇｖＳｖ］を基に
索引辞書９が検索され、ステップ５５６−５７で、第４
図においてＩＮＤＥＸ　ｒＶｇｖＳｖコ→ポインタ１０
→”ｗａｔａｓｉ”−Ｉ　ＮＤＥＸｒＶｇｖｓｖ］→ポ
インタ２０→″ｂａｋｅｔｕ″→Ｉ　Ｎ　Ｄ　Ｅ　Ｘ　
ｒ　Ｂ　ｖ　ｇ　ｖ　ｓ　ｖ　］　−＊−・−と順に認
識候補を予備選択して、候補単語「私」、［バケツ」・
・・を候補車語用メモリ１１に記憶する。In step S51, from the voice human power section 1, in this example, "I"
If there is voice power, the sound processing unit 4 performs sound processing in step S52, and the result of the sound processing is processed in step S5
The process proceeds to feature extraction by the feature extraction unit 6 in step S59 and phoneme recognition by the phoneme recognition unit 12 in step S59. Step S53
INDEX is created in step S55 from the result of feature extraction. In this example, rVgvSv] is created. In step S56, the index dictionary 9 is searched based on INDEX rVgvSv], and in steps 556-57, the fourth
In the figure, INDEX rVgvSv → pointer 10
→"watasi"-I NDEXrVgvsv] → Pointer 20 → "baketu" → I NDEX
r B v g v s v ] −*−・−, the recognition candidates are preliminarily selected in order, and the candidate words “me”, “bucket”,
... is stored in the candidate car word memory 11.

一方、ステップＳ５９で音素認識された結果と候補単語
用メモリ１１に記憶された認識候補とを、ステップ３５
８で単語マツチング部１３によりマツチングの判定をし
、ステップＳ６０で認識結果、本例では「私」を出力す
る。On the other hand, the results of phoneme recognition in step S59 and the recognition candidates stored in the candidate word memory 11 are used in step S55.
In step S60, the word matching section 13 determines whether or not the word is matched, and in step S60, the recognition result, in this example, "Washi" is output.

尚、第１図のシステムの単語マツチングは音素記号に基
づいているため、音素認識部１２を必要とする。しかし
、辞書１０内の単語表現を音響レベルのパラメータで表
わし、単語マツチング部１３で音響パラメータレベルの
マツチングをおこなえば、音素認識部１２は不要となる
。又、第２図にある例に限らず、例えば破裂性（有声及
び無声に分けてもよい）や音声のエネルギー（パワー）
の大小などを記号化しても良い。Note that since the word matching in the system of FIG. 1 is based on phoneme symbols, the phoneme recognition unit 12 is required. However, if the word expressions in the dictionary 10 are represented by acoustic level parameters and the word matching unit 13 performs matching at the acoustic parameter level, the phoneme recognition unit 12 becomes unnecessary. In addition to the examples shown in Figure 2, for example, plosiveness (which can be divided into voiced and unvoiced) and vocal energy (power)
The size of the image may be symbolized.

第６図に第２の実施例の音声認識袋室の構成図を示す。FIG. 6 shows a configuration diagram of the voice recognition bag chamber of the second embodiment.

図中の６１は音声入力部でマイクロ２とＡ／Ｄ変換器６
３で構成され、入力音声は、ディジタルに変換される。61 in the figure is the audio input section, which includes micro 2 and A/D converter 6.
3, the input audio is converted into digital data.

６４は音響処理部で周波数変換や音６的パラメータを抽
出するところである。６５は予備選択部で、単語候補と
して決定するための特徴ビット系列を求めるための特徴
抽出部６６と、特徴ビット系列作成部６７と、辞書６９
から引きだした単語について、候補単語として更に詳細
なマツチング計算を必要とするか否かを決定する特徴ビ
ット系列比較部６８とからなっており、音響処理部６４
から出力された音響特徴量から特徴ビット系列用の特徴
を抽出し、その特徴から特徴ビット系列を作成する。特
徴ビット系列比較部６８は、特徴ビット系列作成部６７
の出力である特徴ビット系列と、辞書６９内の単語にも
たせである特徴ビット系列とを比較し、入力音声側の特
徴ビット系列で“１”がたっているもの全てについて、
辞書６９側の特徴ビット系列でも１”が全て立っていれ
ば、その特徴ビット系列を有している単語を候補単語と
決め候補単語用メそす７０に送る。７１は音素認識部で
音響処理部６４の出力を入力として音素認識を行い、音
素系列を出力する。７２は単語マツチング部で候補単語
用メモリ７０上の単語と音素認識部７１の出力である音
素系列とのマツチングを行い、単語認識結果を出力する
。Reference numeral 64 is a sound processing unit that performs frequency conversion and extracts sound parameters. Reference numeral 65 denotes a preliminary selection unit, which includes a feature extraction unit 66 for obtaining a feature bit sequence to be determined as a word candidate, a feature bit sequence creation unit 67, and a dictionary 69.
The feature bit sequence comparison unit 68 determines whether a word extracted from the word requires more detailed matching calculation as a candidate word, and the acoustic processing unit 64
A feature for a feature bit sequence is extracted from the acoustic features output from the system, and a feature bit sequence is created from the extracted feature. The feature bit sequence comparison unit 68 is the feature bit sequence creation unit 67
The feature bit series that is the output of is compared with the feature bit series that is attached to the words in the dictionary 69, and for all the feature bit series that are "1" on the input voice side,
If all 1'' are set in the feature bit series on the dictionary 69 side, the word having that feature bit series is determined as a candidate word and sent to the candidate word message processor 70. 71 is a phoneme recognition unit that performs acoustic processing. Phoneme recognition is performed using the output of the unit 64 as input, and a phoneme sequence is output.A word matching unit 72 performs matching between the word in the candidate word memory 70 and the phoneme sequence output from the phoneme recognition unit 71, and then outputs a phoneme sequence. Output recognition results.

第７図では第６図の辞書６９の構成法を例で示す。本例
は「弁当」という単語についての例である。この「弁当
」の音素記号“ｂｅＮｔｏｕ”に対する特徴ビット系列
は、Ｈｅａｄｅｒ　（１６ビツトである必要はない）と
本例では１６ビツトで表されたビット系列（各音素記号
に対して１６ビツトとする）とで表されている。ここで
１６ビツト１つ（音素に対応する）をＳＥＧとする。各
ＳＥＧは１６ビツトで表され、その１６ビツトは、０ピ
ットル１５ビツト各々第７図のような意味を持たせる。FIG. 7 shows an example of how the dictionary 69 of FIG. 6 is constructed. This example is about the word "lunch box." The feature bit series for this phoneme symbol "beNtou" for "lunch box" is a header (not necessarily 16 bits) and a bit series represented by 16 bits in this example (16 bits for each phoneme symbol). It is expressed as. Here, one 16 bit (corresponding to a phoneme) is assumed to be an SEG. Each SEG is represented by 16 bits, and each of the 16 bits has a meaning as shown in FIG. 7, including 0 pits and 15 bits.

例えば、０ビツト目は脱落でここに１が立つものは、そ
のＳＥＧが脱落する可能性を示している。１ビツト目１
２ビツト目は、その特徴の維続時間で長いか短いかであ
る。５ＥＧＩでは両方に“１　”が立っているのは、長
くなったり短くなったりする可能性があるためである。For example, the 0th bit is dropped, and a value of 1 here indicates the possibility that the SEG will be dropped. 1st bit 1
The second bit indicates whether the duration of the feature is long or short. In 5EGI, "1" is set in both fields because there is a possibility that the length may become longer or shorter.

つまり語頭であるために、′ｂＨにＢｕｚｚがあったり
なかったりする可能性を考えている。同様に前のＳＥＧ
の脱落や前後のＳＥＧの影１を考えて、５ＥＧ２の破裂
性や鼻韻性などを組込んでおく。各ＳＥＧの作成に関し
ては、各音素についての音響的特徴及び前後の関係から
口切的に音素系列が決まれば作成できる。Ｈｅａｄｅｒ
の情報としては、ＳＥＧの脱藩を考えて全体としておこ
りうる最大ＳＥＧ数と最小ＳＥＧ数、及び単語へのポイ
ンタを入れる。以上第７図のような特徴ビット系列をも
つ辞書構造を作成する。In other words, since it is the beginning of a word, we are considering the possibility that 'bH may or may not have Buzz. Similarly the previous SEG
Considering the dropout of 5EG2 and the shadow 1 of SEG before and after, we incorporate the rupture and rhinism of 5EG2. Each SEG can be created if the phoneme sequence is clearly determined from the acoustic characteristics and context of each phoneme. Header
The information includes the maximum number of SEGs, the minimum number of SEGs that can occur as a whole, and a pointer to a word, considering the separation of SEGs from feudal domains. As described above, a dictionary structure having a feature bit sequence as shown in FIG. 7 is created.

第８図に、入力音声から作成された特徴ビット系列の例
を示す。本例は、”　ｂ　ｅ　Ｎ　ｔ　ｏ　ｕ”と発生
された時に予想される特徴ビット系列の１例である。FIG. 8 shows an example of a feature bit sequence created from input speech. This example is an example of a feature bit sequence expected when "b e N t u" is generated.

第７図の辞書の場合と異なり、第８図では挿入の可能性
のビットを考える他は表現法は同じである。但し、ＨＥ
ＡＤＥＲ部における単語へのポインタは不要である。各
ＳＥＧ決定区間において、得られた特徴に“１′°ビツ
トを立てる。Unlike the dictionary shown in FIG. 7, the method of representation is the same in FIG. 8 except for considering the insertion possibility bit. However, HE
Pointers to words in the ADER section are not required. In each SEG determination interval, a "1'° bit is set for the obtained feature.

第９図は、入力も辞書側もそれぞれ挿入、脱落が仮定さ
れるので、表わされたＳＥＧ系列で表現可能なＳＥＧ系
列を示す。こわらのＳＥＧ系列を用いて、詳細なマツチ
ングをするか否かを決定する。FIG. 9 shows an SEG sequence that can be expressed by the expressed SEG sequence since insertions and omissions are assumed on both the input and dictionary sides. It is determined whether detailed matching is to be performed using the stiff SEG sequence.

第１０図は詳細なマツチングを行うか否かを決定する部
分である特徴ビット系列比較部６８の動作説明図である
。入力音声側レジスタ１０４にまず人力された音声のＳ
ＥＧ系列を入れ、辞書側レジスタ１０１に入力側と長さ
の一致するＳＥＧ系列を辞書６９からとりだす。メモリ
１０２゜１０３にそれぞれのＳＥＧ系列を順々に入れて
いく。メモリ１０２，１０３の中のビットを比較回路１
０５で比較し、“１”、“０”の出力１０６がでる。比
較回路１０５とその動作説明図を第１１図（ａ）、（ｂ
）に示す。１０７の判定においてＯ″であれば、１０８
又は１０９で次の単語もしくは次のＳＥＧ系列を入力音
声側又は辞書側によびだし、くりかえし処理を行う。１
０７で“１”の判定であれば１１０で最終ＳＥＧかをＣ
ＨＥＣＫする。最終ＳＥＧでなければ、１１３で次のＳ
ＥＧに同じ処理を行う。上記のことを繰りかえし行い、
比較回路１０５の出力が最終ＳＥＧまで全て１”であっ
た単語に関しては、１１１で候補単語として１１２で単
語マツチング部７２へ送り詳細なマツチングを行う。FIG. 10 is an explanatory diagram of the operation of the feature bit sequence comparison unit 68, which is the part that determines whether or not to perform detailed matching. First, the input audio register 104 contains the input audio S.
The EG sequence is input into the dictionary side register 101, and the SEG sequence whose length matches that of the input side is extracted from the dictionary 69. Each SEG series is sequentially stored in memories 102 and 103. Comparison circuit 1 compares bits in memories 102 and 103
05 and an output 106 of "1" and "0" is output. FIGS. 11(a) and 11(b) are diagrams illustrating the comparison circuit 105 and its operation.
). If it is O'' in the judgment of 107, then 108
Alternatively, in step 109, the next word or the next SEG sequence is read out to the input voice side or dictionary side, and repeated processing is performed. 1
If the judgment is “1” at 07, check whether it is the final SEG at 110.
HECK. If it is not the last SEG, proceed to the next SEG with 113.
Perform the same process on EG. Repeat the above and
Words for which the output of the comparison circuit 105 is all 1'' up to the final SEG are sent as candidate words in 111 to the word matching unit 72 in 112 for detailed matching.

尚、辞書内容に音響的特徴量を用いる方法においては、
第６図における音素認識部７１がないシステムについて
も応用が可能である。In addition, in the method of using acoustic features in the dictionary contents,
It is also possible to apply the present invention to a system without the phoneme recognition section 71 in FIG.

以上説明したように、音声認識装置に予備選択部と索引
辞書をもうけることにより、入力音声の音響特徴を有す
る可能性のある単語を予備選択し、単語候補とすること
により単語マツチング回数を減少させ、認識処理時間の
高速化を可能にした。As explained above, by providing a preliminary selection unit and an index dictionary in the speech recognition device, words that may have the acoustic characteristics of the input speech are preselected and used as word candidates, thereby reducing the number of word matching operations. , which made it possible to speed up the recognition processing time.

又、音声認識装置に詳細マツチング判定部を用いること
により、多くの無駄の計量を大ばばに減少することがで
き、認識処理時間の短縮が可能である。Furthermore, by using a detailed matching determination section in the speech recognition device, many unnecessary measurements can be greatly reduced, and the recognition processing time can be shortened.

［発明の効果］本発明により、辞書から入力された音声がもつ音響の特
徴により単語の予備選択を行い、単語の候補を減らすこ
とにより、単語マツチング処理の回数を減らし、音声認
識時間を短縮した音声認識装置を提供できる。[Effects of the Invention] According to the present invention, words are preliminary selected based on the acoustic characteristics of the speech input from a dictionary, and the number of word candidates is reduced, thereby reducing the number of word matching processes and shortening the speech recognition time. A voice recognition device can be provided.

[Brief explanation of the drawing]

第１図は第１の実施例の音声認識装置の構成図、第２図は第１の実施例の音響特徴の記号化例を示す図、第３図（ａ）、（ｂ）は第１の実施例の索引の作成法を
示す図、第４図は第１の実施例の索引辞書と辞書の構成図、第５図は第１の実施例の音声認識装置の動作フローチャ
ート、第６図は第２の実施例の音声認識装置の構成図、第７図は第２の実施例の卑語辞書の表記法を示す図、第８図は第２の実施例の入力音声の特徴表記例を示す図
、第９図はＳ２の実施例におけるＳＥＧ系列を示す図、第１０図は第２の実施例の特徴ビット系列比較部の処理
説明図、第１１図（ａ）、（ｂ）は第１０図の回路Ａの説明図で
ある。図中、１・・・音声入力部、２・・・マイク、３・・・
Ａ／Ｄ変換部、４・・・音響処理部、５・・・予備選択
部、６・・・特徴抽出部、７・・・索引作成部、８・・
・単語検索部、９・・・索引辞書、１０・・・辞書、１
１・・・候補単語用メモリ、１２・・・音素認識部、１
３・・・単語マツチング部、１４・・・認識結果出力部
、６１・・・音声入力部、６２・・・マイク、６３・・
・Ａ／Ｄ変換部、６４・・・音９処理部、６５・・・予
備選択部、６６・・・特徴抽出部、６７・・・特徴ビッ
ト系列作成部、６８特徴ビット系列比較部、６９・・・
辞書、７０・・・候補単語用メモリ、７１・・・音素認
識部、７２・・・単語マツチング部、７３・・・認識結
果出力部である。特許出願人　　キャノン株式会社第２図第３図　（０）第３図　（ｂ）第４図１６どブト　０１　２３４５６７８　９Ａ　　・・　　
ＥＦ第８図第９図第１１因（ａ）第１１図（ｂ）FIG. 1 is a block diagram of the speech recognition device of the first embodiment, FIG. 2 is a diagram showing an example of symbolizing acoustic features of the first embodiment, and FIGS. 3(a) and (b) are the first embodiment. FIG. 4 is a configuration diagram of the index dictionary and the dictionary in the first embodiment. FIG. 5 is an operation flowchart of the speech recognition device in the first embodiment. FIG. 6 is a block diagram of the speech recognition device of the second embodiment, FIG. 7 is a diagram showing the notation of the vulgar dictionary of the second embodiment, and FIG. 8 is an example of notation of characteristics of input speech in the second embodiment. FIG. 9 is a diagram showing the SEG sequence in the embodiment of S2, FIG. 10 is an explanatory diagram of circuit A in FIG. 10. FIG. In the figure, 1... audio input section, 2... microphone, 3...
A/D conversion unit, 4... Sound processing unit, 5... Preliminary selection unit, 6... Feature extraction unit, 7... Index creation unit, 8...
・Word search section, 9... Index dictionary, 10... Dictionary, 1
1... Candidate word memory, 12... Phoneme recognition unit, 1
3... Word matching unit, 14... Recognition result output unit, 61... Audio input unit, 62... Microphone, 63...
- A/D conversion unit, 64... Sound 9 processing unit, 65... Preliminary selection unit, 66... Feature extraction unit, 67... Feature bit sequence creation unit, 68 Feature bit sequence comparison unit, 69 ...
Dictionary, 70... Candidate word memory, 71... Phoneme recognition section, 72... Word matching section, 73... Recognition result output section. Patent applicant Canon Co., Ltd. Figure 2 Figure 3 (0) Figure 3 (b) Figure 4 16 Dobuto 01 2345678 9A...
EF Figure 8 Figure 9 Factor 11 (a) Figure 11 (b)

Claims

[Claims]

(1) In a speech recognition device that performs speech recognition by comparing input speech with a pre-registered dictionary, there is a feature extraction means for extracting acoustic features from the input speech, and features extracted by the feature extraction means. 1. A speech recognition device comprising: a preliminary selection means for preselecting recognition candidates based on characteristics of the selected sound; and a recognition means for recognizing a recognition result from the recognition candidates preselected by the preliminary selection means.

(2) The speech recognition device according to claim 1, wherein the feature extraction means extracts features from the input speech as a string of predetermined standard sounds in units of words.

(3) The speech recognition device according to claim 1, wherein the feature extraction means extracts the features by decomposing the input speech into predetermined standard sounds.

(4) The recognition means recognizes the recognition result based on a comparison between the recognition candidates preselected by the preliminary selection means and the phoneme recognition of the input speech. recognition device.