JPS6033599A

JPS6033599A - Voice recognition equipment

Info

Publication number: JPS6033599A
Application number: JP58143181A
Authority: JP
Inventors: 英一坪香
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1983-08-04
Filing date: 1983-08-04
Publication date: 1985-02-20
Also published as: JPH0585918B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は音声認識装置、特に単音節音声を認識すること
により、任意の文章の入力が可能な音声認識装置に関す
る。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a speech recognition device, and more particularly to a speech recognition device capable of inputting arbitrary sentences by recognizing monosyllabic speech.

従来例の構成とその問題点人間にとって最も自然な情報発生手段である音声が、人
間−機械系の入力手段として使用できれば、その効果は
非常に大きい。Conventional configuration and its problems If voice, which is the most natural means of generating information for humans, could be used as an input means for a human-machine system, the effect would be very large.

従来、音声認識装置としては特定話者登録方式によるも
のが実用化されている。即ち、認識装置を使用しようと
する話者が、予め、認識すべきすべての単語を自分の声
で特徴ベクトルの系列に変換し単語辞書に標準パターン
として登録しておき、認識時に発声された音声を、同様
に特徴ベクトルの系列に変換し、前記単語辞書中のどの
単語に最も近いかを予め定められた規則によって計算し
、最も類似している単語を認識結果とするものである。Conventionally, speech recognition devices based on a specific speaker registration method have been put into practical use. That is, a speaker who intends to use a recognition device converts all the words to be recognized into a series of feature vectors using his/her own voice and registers them as standard patterns in a word dictionary, and then uses the voice uttered during recognition. is similarly converted into a series of feature vectors, which word in the word dictionary is closest is calculated according to a predetermined rule, and the most similar word is taken as the recognition result.

ところが、この方法によると、認識単語数が少いときは
良いが、数百、数千単語といったように増加してくると
、主として次の三つの問題が無視し得なくなる。However, this method is good when the number of recognized words is small, but as the number of words increases to hundreds or thousands of words, the following three problems become impossible to ignore.

（１）登録時における話者の負担が著しく増大する。(1) The burden on the speaker during registration increases significantly.

（２）認識時に発声された音声と標準パターンとの類似
度あるいは距離を計算するのに要する時間が著しく増大
し、認識装置の応答速度が遅くなる。(2) The time required to calculate the similarity or distance between the voice uttered during recognition and the standard pattern increases significantly, and the response speed of the recognition device becomes slow.

（３）前記単語辞書のために要するメモリが非常に大き
くなる。(3) The memory required for the word dictionary becomes very large.

以上の欠点を回避するための方法として認識の単位を子
音＋母音および母音の単音節（以後それぞれＣＶ、Ｖで
表す。Ｃは子音、■は母音を意味する。）とする方法が
ある。即ち、標準パターンとして単音節を特徴ベクトル
の系列として登録しておき、認識時に特徴ベクトルの系
列に変換された入力音声を、前記単音節の標準パターン
とマッチングすることにより、単音節の系列に変換する
ものである。日本語の場合、単音節はたかだか１０１種
類であり、単音節は仮名文字に対応しているから、この
方法によれば、日本語の任意の単語あるいは交信を単音
節列に変換する（認識する）ことができ、前記（１）〜
（３）の問題はすべて解決されることになる。しかし、
この場合の問題として調音結合とセグメンテーションが
ある。調音結合は、音節を連続して発声すると各音節は
前後の音節の影響を受け、スペクトル構造が前後に接続
される音節によって変化する現象である。セグメンテー
ションは、連続して発声された音声を単音節単位に区切
ることであるが、これを確実に行うのは現在の技術では
困難である。この２つの問題を解決するために、現在の
ところ各単音節を区切って、発声することが行われてお
り、実用化されている装置もある。As a method to avoid the above-mentioned drawbacks, there is a method in which the unit of recognition is a consonant+vowel or a monosyllable of a vowel (hereinafter referred to as CV and V, respectively; C means a consonant and ■ means a vowel). That is, monosyllables are registered as a series of feature vectors as a standard pattern, and input speech converted into a series of feature vectors during recognition is converted into a series of monosyllables by matching with the monosyllable standard pattern. It is something to do. In the case of Japanese, there are at most 101 types of monosyllables, and monosyllables correspond to kana characters, so this method converts any Japanese word or communication into a monosyllable string (recognizes ), and the above (1) ~
All problems in (3) will be resolved. but,
Problems in this case include articulatory combination and segmentation. Articulatory coupling is a phenomenon in which when syllables are uttered in succession, each syllable is influenced by the syllables before and after it, and the spectral structure changes depending on the syllables connected before and after it. Segmentation is the process of dividing continuously uttered speech into single syllables, but it is difficult to do this reliably with current technology. In order to solve these two problems, the current practice is to separate each monosyllable into utterances, and some devices are in practical use.

第１図は単音節音声認識をパターンマッチングで行う装
置の一般的な構成である。１は音声信号の入力端子であ
る。２は特徴抽出部であって、入力音声信号を、フィル
タバンクやＦＥＴ、ＬＰＣなどにより分析し、数ミリ秒
毎に特徴ベクトルの系列Ａ＝ａ１．ａ２・・・・・・ａ
ｉ・・・・・・ａＩに変換する。３は標準パターン記憶
部であって予め認識すべき単音節音声を同様な手段によ
って特徴ベクトルの系列に変換したものを各音節に対す
る標準パターンＲｎ＝ｂｎ１ｂｎ２・・・・・・ｂｎｊ
・・・・・・ｂｎＪｎ（ただし、ｎ＝１、２、・・・・
・・、Ｎ；Ｎは標準パターンの数として記憶する部分で
ある。４はパターン比較部であって、特徴抽出部２の出
力である入力パターンＡと、標準パターン記憶部３に記
憶されている夫々の標準パターンＲｎを比較し．両者の
距離Ｄ（Ａ、Ｒｎ）を算出する。６は判定部であって、
ｎ＝ｍｉｎｎ［Ｄ（Ａ、Ｒｎ）］により、入力パターンに最も近い標
準パターンＲｎを判定する。６は判定結果を単音節認識
結果として出力する出力端子である。パターン比較部４
におけるパターン比較は、動的計画法を用いた所謂ＤＰ
マッチングや線形シフトマッチング等がよく用いられる
。また、先ず母音を認識して候補刊音段を決定してから
、その母音段に属する標準パターンを用いて子音部を認
識することにより、認識率とマッチングの速度を向上さ
せているので一般的である。FIG. 1 shows the general configuration of a device that performs monosyllabic speech recognition using pattern matching. 1 is an input terminal for audio signals. 2 is a feature extraction unit that analyzes the input audio signal using a filter bank, FET, LPC, etc., and extracts a series of feature vectors A=a1.2 every few milliseconds. a2...a
i...Convert to aI. 3 is a standard pattern storage unit which converts monosyllabic speech to be recognized in advance into a series of feature vectors by a similar means, and stores standard patterns Rn=bn1bn2...bnj for each syllable.
...bnJn (however, n=1, 2, ...
..., N; N is the part to be stored as the number of standard patterns. 4 is a pattern comparing section, which compares the input pattern A, which is the output of the feature extracting section 2, with each standard pattern Rn stored in the standard pattern storage section 3. The distance D (A, Rn) between the two is calculated. 6 is a determination section,
Based on n=minn [D(A, Rn)], the standard pattern Rn closest to the input pattern is determined. 6 is an output terminal that outputs the determination result as a monosyllable recognition result. Pattern comparison section 4
The pattern comparison in is the so-called DP using dynamic programming.
Matching, linear shift matching, etc. are often used. In addition, by first recognizing the vowel and determining a candidate stage, and then recognizing the consonant part using the standard pattern belonging to that vowel stage, the recognition rate and matching speed are improved. It is.

しかし、単音節音声は、持続時間が短かく、「シ」、「
チ」等子音部の微妙な差によって区別しなけれけならな
いものが多く、単語音声のように高い認識率を得るのが
困難である。However, monosyllabic sounds have short durations, such as "shi" and "shi".
Many of them must be distinguished based on subtle differences in consonant parts, such as "ch", making it difficult to achieve a high recognition rate like word sounds.

この問題を解決するために、単語辞書を用いる方法が考
えられている。第２図はその例である。In order to solve this problem, a method using a word dictionary has been considered. Figure 2 is an example.

同図において、第１図と同一の番号を付したブロックは
、第１図と同一の動作を行う。７は単語辞書で、認識す
べき単語Ｗｌ（ｌ＝１．２、・・・・・・、Ｌ；Ｌは登
録単語数）が単音節に対応する記号列Ｗｌ＝Ｃｌ１Ｃｌ
２・・・・・・ＣｌＫ・・・・・・ＣｌＫｌ（ＣｌＫは
単語Ｗｌのｋ番目の音節）として記憶されている。８は
単語比較部であって、入力単音節列Ｔ＝Ａ１Ａ２・・・
・・・Ａｍ・・・・・・ＡＭ（Ｍは入力単語の音節数）
であるとき、入力単語の音節数に等しい音節数の単語辞
書７に記憶されている単語Ｗｌ′＝Ｃｌ′＝Ｃｌ′１Ｃ
ｌ′２・・・・・・Ｃｌ′Ｍ（Ｗｌ′は音節数Ｍの単語
）に対し、パターン比較部４で算出された距離Ｄ（Ａｍ
、Ｃｌ′ｍ）から各ｌ′についてＤＷ（Ｔ、Ｗｌ′）＝
ΣＭｍ＝１Ｄ（Ａｍ、Ｃｌ′ｍ）を算出する。９は判定
部であって、ｌ′＝ｍｉｎｌ′（Ｔ（Ｓ、Ｗｌ′））なるｌ′をめ、
Ｗｌ′を認識単語と判定する。１０は認識された単語を
出力する出力端子である。In this figure, blocks given the same numbers as in FIG. 1 perform the same operations as in FIG. 1. 7 is a word dictionary, in which the word Wl to be recognized (l=1.2,..., L; L is the number of registered words) is a symbol string Wl=Cl1Cl corresponding to a monosyllable.
2...ClK...ClKl (ClK is the k-th syllable of the word Wl). 8 is a word comparison section, which inputs a monosyllable string T=A1A2...
...Am...AM (M is the number of syllables of the input word)
When , the word Wl'=Cl'=Cl'1C stored in the word dictionary 7 with the number of syllables equal to the number of syllables of the input word
l'2... Distance D (Am
, Cl′m) for each l′ DW(T, Wl′)=
Calculate ΣMm=1D(Am, Cl'm). 9 is a determination unit, which determines l' such that l'=minl'(T(S, Wl')),
Wl' is determined to be a recognized word. 10 is an output terminal that outputs the recognized word.

以上のように、単語辞書の知識を用いれば認識率は向上
する。またワードプロセサへの入力を考えるとき、前記
単語辞書は仮名漢字変換を行うための辞書を共用するこ
とができ、単語辞書は音声認識用として特別に準備する
必要はない。As described above, the recognition rate can be improved by using knowledge of the word dictionary. Furthermore, when considering input to a word processor, the word dictionary can be used in common as a dictionary for performing kana-kanji conversion, and there is no need to prepare a word dictionary specifically for speech recognition.

しかし、即語辞書の単語数は通常３万以上にもおよび、
単語比較部８における計算量が無視できなくなる。However, the number of words in an instant dictionary is usually over 30,000.
The amount of calculation in the word comparison section 8 can no longer be ignored.

発明の目的本発明は、単語辞書を用いて、単音節の認識率の向上を
図った単音節音声認識装置に関し、より詳細には、単語
辞書とのマッチングの速度を向上せしめたことを特徴と
する音声認識装置に関する。OBJECTS OF THE INVENTION The present invention relates to a monosyllabic speech recognition device that uses a word dictionary to improve the recognition rate of monosyllables. More specifically, the present invention is characterized by improving the speed of matching with the word dictionary. The present invention relates to a speech recognition device.

発明の構成本発明は、入力音声信号を特徴ベクトルの系列に変換す
る手段と、入力音声信号を音節毎に区切る手段と、前記
特徴ベクトルの系列から前記各音節の後続母音を認識す
る手段と、前記後続母音列と同じ後続母音列を有する単
語あるいは文節の音節列を記号列として得る手段と、前
記記号列と前記入力音声信号から得られた音節列とをマ
ッチングする手段と、このマッチングの結果、前記入力
音声信号に最も近い前記単語あるいは文節を前記入力音
声に対応する認識結果と判定する判定手段とを備えた音
声認識装置である。Structure of the Invention The present invention comprises means for converting an input speech signal into a series of feature vectors, means for dividing the input speech signal into syllables, and means for recognizing a subsequent vowel of each syllable from the series of feature vectors. means for obtaining, as a symbol string, a syllable string of a word or phrase having the same subsequent vowel string as said subsequent vowel string; means for matching said symbol string with a syllable string obtained from said input speech signal; and a result of said matching. and determining means for determining the word or phrase closest to the input voice signal as a recognition result corresponding to the input voice.

本発明の基本的な考え方について、以下、説明する。The basic idea of the present invention will be explained below.

単音節音声の認識において、母音の認識はほぼ確実に行
われる。従って、入力単音節ＣＶまたは■１（Ｃは子音
、■は母音）の後続母音の系列が■１Ｖ２・・・・・・
ＶＭであったとき、照合すべき単語辞書の単語として、
その単語を構成する単音節の後続母音の系列が■１Ｖ２
・・・・・・ＶＭとなる単語のみを選べばよいことにな
る。例えば、入方単音節列の後続母音が｜ｏ｜｜ｏ｜｜
ａ｜｜ａ｜であったとすれば、照合すべき単語としては
「ｏｏｓａｋａ」「ｔｏｙｏｎａｋａ」・・・・・・等
が選ばれることになる。In monosyllabic speech recognition, vowel recognition is almost certain. Therefore, the input monosyllable CV or ■1 (C is a consonant, ■ is a vowel) subsequent vowel series is ■1V2...
When it is a VM, as a word in the word dictionary to be checked,
The sequence of vowels following the monosyllable that makes up the word is ■1V2
...It is only necessary to select words that are VM. For example, the following vowel in a monosyllable string is |o||o||
If a||a|, "oosaka", "toyonaka", etc. would be selected as the words to be matched.

このようにすると、例えば、４音節の単語の場合、母音
の出現確率が等しいとすれば、ある特定の母音列の生す
る確率は（１／５）４＝１／６２５となり４音節の単語
が１万語あるとすれば、ある特定の母音列に対応する４
音節語は１６語となり、実際に比較計算をしなければな
らない単語は激減する。In this way, for example, in the case of a four-syllable word, if the probabilities of vowel occurrence are equal, the probability of a particular vowel string occurring is (1/5)4=1/625, which means that a four-syllable word is If there are 10,000 words, 4 that corresponds to a certain vowel sequence
There are 16 syllabic words, and the number of words that actually need to be compared and calculated is drastically reduced.

余裕をみて、第２候補の母音も勘定に入れるとしても（
２／５）４≒１／３９となり、同様に４音節の単語が１
万語あるとすれば、比較計算をすべき４音節語は２６６
語となり、大幅に減少する。さらに促音や撥音も上記母
音同様に処理することにすれば、さらに比較計算を減少
させることができる。Even if we consider the vowel of the second candidate for some margin (
2/5) 4≒1/39, and similarly a word with 4 syllables becomes 1
If there are 10,000 words, there are 266 four-syllable words that need to be compared.
This will result in a significant decrease. Furthermore, if consonants and consonants are processed in the same way as the vowels described above, the number of comparison calculations can be further reduced.

これら母音や促音、撥音等の認識はほぼ完全に行われる
ので、計算量の減少のみでなく認識率自体も向上する。Since the recognition of these vowels, consonants, consonants, etc. is performed almost completely, not only the amount of calculation is reduced, but also the recognition rate itself is improved.

実施例の説明第３図は本発明の一実施例の音声認識装置の構成を示す
ブロック図である。１１は音声信号の入力端子で、単音
節の連鎖として単語が入力される。DESCRIPTION OF THE EMBODIMENT FIG. 3 is a block diagram showing the configuration of a speech recognition device according to an embodiment of the present invention. Reference numeral 11 denotes an input terminal for audio signals, into which words are input as a chain of monosyllables.

１２は従来例において説明したと同様の特徴抽出部であ
って、前記の如く入力音声を特徴ベクトルの系列に変換
する。１３はパワー計算部であって、特徴抽出部１２の
出力ベクトル系列をａ１ａ２・・・・・ａｉ・・・・・
・ａＩとするとき、第ｉフレームのパワーＰｉは、例え
ばａｉ＝（ａｉ１、ａｉ２・・・・・・、ａｉμ）とす
れば、Ｐｉ＝■ａｉ１２＋ａｉ２２＋・・・・・・＋ａｉ
μ２としてめられ得る。１４は音節区間検出部であって
、パワー計算部１３の出力から、入力音声を音節毎に区
切り各音節の開始フレームと終了フレームとを検出する
。第４図はその例であって、パワーが閾値２９を越える
時点を音節の開始フレーム、閾値２９以下になる時点を
音節の終了フレームとし、閾値２９以上の区間を音節の
存在区間とする。また閾値２９以下の区間が一定値ｔｃ
以上あるときは、その区間を促音とみなす。同図は「ｓ
ａｐｐｏｒｏ」と発声したときの様子を示すもので、Ｑ
は促音を意味する。１５は音節数計数部であって、促音
も一音節とみなして音節数（従ってモーラ数）を計数す
る。１６は母音標準パターン記憶部であって、母音｜ａ
｜、｜ｉ｜、｜ｖ｜、｜ｅ｜、｜ｏ｜および撥音｜Ｎ｜
の標準パターンが予め登録されている。Reference numeral 12 denotes a feature extraction unit similar to that described in the conventional example, which converts input speech into a series of feature vectors as described above. 13 is a power calculation unit, which calculates the output vector sequence of the feature extraction unit 12 by a1a2...ai...
・When aI, the power Pi of the i-th frame is, for example, if ai=(ai1, ai2..., aiμ), then Pi=■ai12+ai22+...+ai
It can be regarded as μ2. Reference numeral 14 denotes a syllable section detection section, which divides the input speech into syllables and detects the start frame and end frame of each syllable from the output of the power calculation section 13. FIG. 4 is an example of this, where the time point when the power exceeds the threshold value 29 is taken as the start frame of a syllable, the time point when the power becomes less than the threshold value 29 is taken as the end frame of the syllable, and the section where the power exceeds the threshold value 29 is taken as the syllable existence section. Also, the section below the threshold value 29 is a constant value tc
If there are more than 1, the interval is considered to be a consonant. The figure is “s
This shows what happens when you say "apporo", and Q
means a consonant. Reference numeral 15 denotes a syllable number counting section, which counts the number of syllables (therefore, the number of moras) by considering a consonant as one syllable. 16 is a vowel standard pattern storage unit, in which the vowel |a
|, |i|, |v|, |e|, |o|, and pixel |N|
Standard patterns are registered in advance.

１７は母音フレーム検出部であって、音節区間検出部１
４で検出された各音節の開始、終了フレームと特徴抽出
部１２で抽出された特徴ベクトルの系列から母音に相当
するフレーム位置を検出する。17 is a vowel frame detection unit, which includes a syllable interval detection unit 1
The frame position corresponding to the vowel is detected from the start and end frames of each syllable detected in step 4 and the series of feature vectors extracted by the feature extraction unit 12.

母音部は定常であるから請求めるべきフレームは、第ｉ
−ｒフレームから第ｉ＋ｒフレーム（ｒは定数）までの
特徴ベクトルの各成分の分散の総和が極小となるフレー
ムｉとして検出することができる。即ち、第ｉフレーム
の入力の特徴ベクトルをａｉ＝（ａｉ１、ａｉ２、・・
・・・・、ａｉｊ、・・・・・・、ａｉμ）とするときｍｉｊ＝１／２ｒ＋１Σｉ＋ｒｋ＝ｉ−ｒ（ａｎｊ−ｍ
ｉｊ）２において、各単音節の最終フレームから逆にｖ
ｉをめてゆき、ｖｉが極小になったフレームを母音定常
部中心フレームとすることができる。１８はバッファメ
モリであって、単音節毎に特徴抽出部１２で抽出された
特徴ベクトルの系列を音声区間検出部１４で検出された
単音節開始フレームから終了フレームまでにわたって記
憶する。１９は母音パターン比較部であって、母音フレ
ーム検出部１７で検出されたフレームに対応する特徴ベ
クトルをバッファメモリ１８から読み出し、母音標準パ
ターン記憶部１６の各母音標準パターンと比較を行いそ
れぞれに対する距離を算出する。例えば、ａｉ＝（ａｉ
１、ａｉ２、・・・・・・ａｉμ）が入力単音節の母音
フレームに対応する特徴ベクトルであるとき、ν番目の
母音標準パターン（撥音も含む）ｖν＝（ｖν１、ｖν
２、・・・・・・、ｖνμ）（ただし、ν＝１，２、・
・・・・・、）との距離はｄｉν＝■ΣμＫ＝１（ａｉｋ−ｖνｋ）２とすること
ができる。２０は母音判定部であって、ν＝ｍｉｎ［ｄ
ｉν〕 ν をめｖνに対応する母音を母音認識結果とする。Since the vowel part is stationary, the frame that should be claimed is the i-th frame.
Frame i can be detected as the frame i in which the sum of the variances of each component of the feature vector from the -r frame to the i+r frame (r is a constant) is minimal. That is, the input feature vector of the i-th frame is ai=(ai1, ai2,...
..., aij, ......, aiμ), then mij=1/2r+1Σi+rk=i-r(anj-m
ij) In 2, from the last frame of each monosyllable, v
As i is increased, the frame where vi becomes minimum can be taken as the frame at the center of the vowel stationary part. A buffer memory 18 stores a series of feature vectors extracted by the feature extraction section 12 for each monosyllable from the monosyllable start frame detected by the speech section detection section 14 to the end frame. Reference numeral 19 denotes a vowel pattern comparing section which reads the feature vector corresponding to the frame detected by the vowel frame detecting section 17 from the buffer memory 18, compares it with each vowel standard pattern in the vowel standard pattern storage section 16, and calculates the distance to each one. Calculate. For example, ai=(ai
1, ai2, .
2,...,vνμ) (where ν=1,2,・
..., ) can be set as diν=■ΣμK=1(aik−vνk)2. 20 is a vowel determination unit, ν=min[d
iν] ν is the vowel corresponding to vν as the vowel recognition result.

２１は母音・促音判定結果記憶部であって、母音判定部
２０で判定された母音と音節区間検出部１４で検出され
た促音とを発生順序に従って記憶する。Reference numeral 21 denotes a vowel/consonant determination result storage section, which stores the vowels determined by the vowel determining section 20 and the consonants detected by the syllable interval detecting section 14 in the order of their occurrence.

２２は単音節標準パターン記憶部であって、特徴ベクト
ルの系列に変換された、それぞれの単音節に対応する標
準パターンが記憶されている。２３は単音節パターン比
較部であって、バッファメモリ１８に蓄えられている入
カパターンと単音節標準パターン記憶部２２に蓄えられ
ている単音節標準パターンとを比較し、前記入カバターンのそれぞれの単音節標準パターンに対
する距離を計算するものである。このとき、照合すべき
単音節標準パターンは、母音判定部２０で判定された母
音を後続母音としてもつ単音節に限られる。また、各単
音節に対し比較する範囲はその単音節の開始フレームか
ら母音の定常部までとする。これは、丁度、子音の情報
が含まれている部分である。比較照合の方法は線形シフ
トマッチングやＤＰマッチング等周知の方法が用いられ
得る。ＤＰマッチングを用いることにすれば次のように
なる。ｎ番目の単音節標準パターンをＲｎ＝ｂｎ１ｂｎ
２・・・・・・ｂｎｉ・・・・・ｂｎＪｎ、単音節入カ
パターンをＡ＝ａ１ａ２・・・・・・ａｉ・・・・・・
ａＩ（ただしＩ、Ｊｎはそれぞれ入カパターン、標準パ
ターンの母音定常部中心フレーム）、ｄｎ（ｉ、ｊ）を
ａｉとｂｎｊのベクトル間距離とするときなる漸化式をｇ（１，１）＝２ｄｎ（１，１）として解
けば、ＡとＲｎの距離Ｄ（Ａ、Ｒｎ）はＤ（Ａ、Ｒｎ）＝ｇ（Ｉ、Ｊ）となる。ここでｄｎ（ｉ、ｊ）はａｉ＝（ａｉ１、ａｊ
２、・・・・・・ａｉμ）ｂｎｊ＝（ｂｎｊ１、ｂｎｉ
２．・・・・・・ｂｎｊμ）とするときｄｎ（ｉ，ｊ）
＝Σμｋ＝１｜ａｉｋ−ｂｎｊｋ｜とするのが、一般的
である。また上記漸化式も種々の形が提案されておりこ
こではその一例を示したにすぎない。２４は距離記憶部
であって、単音節パターン比較部２３で計算された距離
を記憶するものである。単音節列Ａ１Ａ２・・・・・・
Ａｍ・・・・・・ＡＭからなる単語が入力されたときは
、距離記憶部２４はＤ（Ａｍ、Ｒｎ）を１≦ｍ≦Ｍ、Ｒ
ｎ■ＳＡｍのすべてについて記憶する。ただしＡｍと同
じ後続母音をもつ単音節標準パターンの集合をＳＡｍと
する。Reference numeral 22 denotes a monosyllabic standard pattern storage unit, which stores standard patterns corresponding to each monosyllable that have been converted into a series of feature vectors. Reference numeral 23 denotes a monosyllabic pattern comparing section, which compares the input pattern stored in the buffer memory 18 with the monosyllabic standard pattern stored in the monosyllabic standard pattern storage section 22, and compares each input pattern with the monosyllabic standard pattern stored in the monosyllabic standard pattern storage section 22. It calculates the distance to a monosyllabic standard pattern. At this time, the monosyllable standard pattern to be matched is limited to monosyllables having the vowel determined by the vowel determining section 20 as a subsequent vowel. Furthermore, the range to be compared for each single syllable is from the start frame of that single syllable to the constant part of the vowel. This is exactly the part that contains consonant information. As a method of comparison and matching, well-known methods such as linear shift matching and DP matching can be used. If DP matching is used, the result will be as follows. The nth monosyllabic standard pattern is Rn=bn1bn
2...bni...bnJn, monosyllabic input pattern A=a1a2...ai...
When aI (where I and Jn are the input pattern and the center frame of the vowel stationary part of the standard pattern, respectively) and dn (i, j) are the distances between the vectors of ai and bnj, the recurrence formula becomes g (1, 1) =2dn(1,1), the distance D(A,Rn) between A and Rn becomes D(A,Rn)=g(I,J). Here, dn(i,j) is ai=(ai1,aj
2,...aiμ)bnj=(bnj1, bni
2. ...bnjμ), then dn(i,j)
=Σμk=1|aik−bnjk| is generally set. Furthermore, various forms of the above recurrence formula have been proposed, and only one example thereof is shown here. A distance storage section 24 stores the distance calculated by the monosyllabic pattern comparison section 23. Monosyllable string A1A2...
Am... When a word consisting of AM is input, the distance storage unit 24 sets D(Am, Rn) to 1≦m≦M, R
n ■ Memorize everything about SAm. However, let SAm be a set of monosyllabic standard patterns having the same following vowel as Am.

２５は単語辞書であって、認識すべき単語が音節記号列
で表現された形で記憶されている。２６は単語間距離計
算部であって、単音節列として入力された単語と単語辞
書２５の単語との距離を距離記憶部２４に記憶されてい
る距離から計算する。Reference numeral 25 is a word dictionary in which words to be recognized are stored in a form expressed as a string of syllable symbols. 26 is an inter-word distance calculation unit which calculates the distance between a word input as a monosyllable string and a word in the word dictionary 25 from the distance stored in the distance storage unit 24.

単語辞書２５に対し、比較照合されるべき単語は音節数
計数部１５における値、即ち、入力単語の音節数と、母
音・促音判定結果記憶部２１で示される後続母音（撥音
・促音を含む）列と同じ後続母音列をもつ単語に限定さ
れる。いま、この限定された単語の集合をＳｗとし、Ｗ
ｌ■ＳＷなる単語ＷｌがＣｌ１Ｃｌ２・・・・・・Ｃｌ
ｍ・・・・・・ＣｌＭなる音節列からなっているとすれ
ば、前記説明によって単音節ＡｍとＣｌｍとの単音節間
距離Ｄ（Ａｍ、Ｃｌｍ）は距離記憶部２４に記憶されて
いるので、入力単語Ｔ＝Ａ１Ａ２・・・・・・Ａｍ・・
・・・・ＡＭと単語辞書の単語ＷｌＣｌ１Ｃｌ２・・・
・・・Ｃｌｍ・・・・・・ＣｌＭとの距離ＤＷ（Ｔ，Ｗ
ｌ）はＤＷ（Ｔ、Ｗｌ）＝ΣＭｍ＝１Ｄ（Ａｍ，Ｃｌｍ
）としてめることができる。２７は単語判定部であってｌ＝ｍｉｎ［ＤＷ（Ｔ，Ｗｌ）］Ｗｌ■ＳＷなるｌをめ、Ｗｌを認識単語と判定する。２８は認識結
果の出力端子である。The words to be compared and verified against the word dictionary 25 are the values in the syllable number counting section 15, that is, the number of syllables of the input word and the subsequent vowels (including pellicles and consonants) indicated in the vowel/consonant determination result storage section 21. Restricted to words with the same subsequent vowel sequence as the sequence. Now, let this limited set of words be Sw, and W
l■SW word Wl is Cl1Cl2...Cl
If it is made up of a syllable string m...ClM, the inter-monosyllabic distance D (Am, Clm) between the monosyllables Am and Clm is stored in the distance storage unit 24 according to the above explanation. , input word T=A1A2...Am...
...AM and the word WlCl1Cl2 in the word dictionary...
...Clm...Distance to ClM DW (T, W
l) is DW(T, Wl)=ΣMm=1D(Am, Clm
). Reference numeral 27 denotes a word determination unit, which determines Wl to be a recognized word based on l=min[DW(T,Wl)]Wl■SW. 28 is an output terminal for the recognition result.

なお、本実施例においては、単語単位で認識するとして
説明したが、これは勿論、文節単位で行うこともできる
。その場合は、名詞に付属語を付加したものや動詞、形
容詞、形容動詞等の活用形までも含めて前記単語とみな
して単語辞書に登録しておく方法も考えられるが、この
方法では、単語辞書のメモリ量が大幅に増えるので、単
語辞書には語幹や、付属語のつかない形で名詞を登録し
ておき、単語間距離計算部２６で比較照合を行うとき、
種々の文節を規則で作り出すようにすることもできる。Although the present embodiment has been described as recognizing on a word-by-word basis, it is of course possible to recognize on a phrase-by-phrase basis. In that case, it is possible to consider a noun with attached words, verbs, adjectives, conjugated forms of adjectives, etc. as the above-mentioned words and register them in the word dictionary. Since the memory capacity of the dictionary increases significantly, nouns are registered in the word dictionary without stems or adjuncts, and when the word distance calculation unit 26 performs comparison and matching,
It is also possible to create various clauses using rules.

特に、仮名漢字変換機能付のワードプロセッサの入力と
して本発明装置を用いるときは、単語辞書は仮名漢字変
換用のものが共用でき、前記付属語を作る機能ももとも
と備わっているのですこぶる好都合である。In particular, when the device of the present invention is used as an input to a word processor with a kana-kanji conversion function, the word dictionary for kana-kanji conversion can be shared, and the function for creating the adjunct words is already provided, which is very convenient.

また、本実施例では促音を無音区間長から検出するとし
たが、「つ」と発声することにより、促音を入力するよ
うにもできる。このときは、単語辞書において促音を「
つ」に置き換えておけばよく、実際は「つ」であるのか
促音であるのかの区別は言語処理の問題として簡単に行
い得る。Further, in this embodiment, a consonant is detected from the silent interval length, but a consonant can also be input by saying "tsu". In this case, in the word dictionary, the consonant should be
All you have to do is replace it with ``tsu'', and it is easy to distinguish whether it is ``tsu'' or a consonant as a matter of language processing.

さらに、本発明は発声を単音節毎に区切って発声する場
合について述べたが、単音節の区切りが行えれば良いの
であって、連続的に発声してもこの区切りが行える場合
は、本発明の原理はそのまま適用可能である。Further, although the present invention has been described with respect to the case where the utterance is divided into monosyllables, it is sufficient if the utterance can be divided into single syllables, and if this division can be performed even when uttered continuously, the present invention The principle can be applied as is.

発明の効果本発明によれば、単音節のみの認識でなく、単語全体と
しての認識を行っており、また、比較照合すべき単語を
母音列で限定することにより、認識率、照合速度におい
て大幅な改善が得られたものである。Effects of the Invention According to the present invention, not only single syllables are recognized, but whole words are recognized, and by limiting the words to be compared and matched by vowel strings, the recognition rate and matching speed are significantly improved. This is a significant improvement.

[Brief explanation of drawings]

第１図は従来の単音節音声認識装置を示すブロック図、
第２図は前記従来例を改良した例を示すブロック図、第
３図は本発明の一実施例における音声認識装置を示すブ
ロック図、第４図は本発明装置の一部の動体を説明する
波形図である。１１・・・・・・音声信号入力端子、１２・・・・・・
特徴抽出部、１３・・・・・・パワー計算部、１４・・
・・・・音声区間検出部、１５・・・・・・音節数計数
部、１６・・・・・・母音標準パターン記憶部、１７・
・・・・・母音フレーム検出部、１８・・・・・・バッ
ファメモリ、１９・・・・・・母音パターン比較部、２
０・・・・・・母・音判定部、２１・・・・・・母音・
促音判定結果記憶部、２２・・・・・・単音節標準パタ
ーン記憶部、２３・・・・・・単音節パターン比較部、
２４・・・・・・距離記憶部、２５・・・・・・単語辞
書、２６・・・・・・単語間距離計算部、２７・・・・
・・単語判定部、２８・・・・・・認識結果出力端子。FIG. 1 is a block diagram showing a conventional monosyllabic speech recognition device.
FIG. 2 is a block diagram showing an improved example of the conventional example, FIG. 3 is a block diagram showing a speech recognition device in an embodiment of the present invention, and FIG. 4 explains a part of the moving object of the device of the present invention. FIG. 11...Audio signal input terminal, 12...
Feature extraction unit, 13...Power calculation unit, 14...
. . . Vocal section detection unit, 15 . . . Syllable number counting unit, 16 . . . Vowel standard pattern storage unit, 17.
. . . Vowel frame detection section, 18 . . . Buffer memory, 19 . . . Vowel pattern comparison section, 2
0...Vowel/sound determination section, 21...Vowel/sound judgment section
Consonant determination result storage unit, 22... Monosyllabic standard pattern storage unit, 23... Monosyllabic pattern comparison unit,
24... Distance storage unit, 25... Word dictionary, 26... Word distance calculation unit, 27...
. . . Word determination unit, 28 . . . Recognition result output terminal.

Claims

[Claims]

means for converting an input audio signal into a sequence of feature vectors;
means for dividing an input speech signal into syllables; and means for recognizing a subsequent vowel of each syllable from the series of feature vectors;
means for obtaining a syllable string of a word or phrase having the same subsequent vowel string as the subsequent vowel string as a symbol string; and means for matching the symbol string with a syllable string obtained from the input speech signal. . A speech recognition device characterized by comprising: determining means for determining, as a result of this matching, the word or phrase closest to the input speech signal as the recognition result corresponding to the input speech.