JP3004749B2

JP3004749B2 - Standard pattern registration method

Info

Publication number: JP3004749B2
Application number: JP3047624A
Authority: JP
Inventors: 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1990-05-14
Filing date: 1991-02-20
Publication date: 2000-01-31
Anticipated expiration: 2015-01-31
Also published as: JPH04212199A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【技術分野】本発明は、標準パターン登録方法、より詳
細には音声認識のパターン照合に関するものである。The present invention relates to a reference pattern registered how, and more particularly it relates to a pattern matching speech recognition.

【０００２】[0002]

【従来技術】現在の音声認識装置は、パターンマッチン
グ方法を利用するものが主流であり、あらかじめ登録さ
れた標準パターンと、入力された未知の音声パターンを
比較して、最も類似した標準パターンのカテゴリーを認
識結果として出力するものである。BACKGROUND ART Current speech recognition apparatus includes a mainstream utilizes a pattern matching how, and the standard pattern registered in advance, by comparing the unknown speech pattern input, the most similar reference pattern The category is output as a recognition result.

【０００３】図７は、従来の音声パターン照合方法の一
例を説明するための図で、図中、１はマイクロフォン、
２はマイクアンプ、３は特徴変換部、４はＡ／Ｄ変換
部、５は切換えスイッチ、６は標準パターン格納部、７
は照合部、８は最大類似度検出部、９は認識結果出力部
で、まず、切換えスイッチ５を標準パターン登録側（ａ
側）にしておき、マイク１から音声を入力する。マイク
１で電気信号に変換された音声は、マイクアンプ２で増
幅され、特徴変換部３により特徴変換されるが、利用さ
れる特徴量としてはスペクトル他いくつか知られてい
る。それを離散量に直し標準パターンとして標準パター
ン格納部６に格納する。認識時は、スイッチを照合側
（ｂ側）へ倒して行なう。登録時と同様に音声のパター
ンを作り、あらかじめ登録しておいたすべての標準パタ
ーンと照合し、類似性の一番高いパターンを見つけ、そ
れを認識結果とするものである。FIG. 7 is a diagram for explaining an example of a conventional voice pattern matching method, in which 1 is a microphone,
2 is a microphone amplifier, 3 is a feature converter, 4 is an A / D converter, 5 is a changeover switch, 6 is a standard pattern storage, 7
Is a matching unit, 8 is a maximum similarity detection unit, 9 is a recognition result output unit, and first, the changeover switch 5 is set to the standard pattern registration side (a
Side), and input sound from the microphone 1. The sound converted into an electric signal by the microphone 1 is amplified by the microphone amplifier 2 and is subjected to characteristic conversion by the characteristic conversion unit 3, and some of the characteristic amounts to be used are known, such as a spectrum. It is converted into a discrete amount and stored in the standard pattern storage unit 6 as a standard pattern. At the time of recognition, the switch is moved to the collation side (b side). As in the case of registration, a voice pattern is created, collated with all standard patterns registered in advance, a pattern with the highest similarity is found, and that is used as a recognition result.

【０００４】このような認識方式の詳細や、特徴量につ
いては、例えば新美著「音声認識」等に書かれており、
周知であるので、ここでのこれ以上の説明は省略する。
このなかで、パターンの照合に際して、パターンの変動
をどの様に対策するかと言う問題がある。特に、この変
動は時間的なものが大きく、発声の速度等の影響がで
る。この対策は２つあり、１つはＤＰマッチングに代表
される非線形照合で、照合する２つのパターンの類似性
を見ながら、その類似性が最大になるようにダイナミッ
クに２つのパターンを対応づけるもの、もう１つは、類
似性のチェックなどせずに時間長を均等にデータ挿入、
間引きによって一致させてから両者を比較する線形照合
するものである。前者が計算量が多い代りに、精度が良
く、後者は計算量が非常に少ないというメリットがあ
る。特に、後者の場合、全てのパターンを一定長にして
置く事で、入力された音声のパターンを一度長さ合せし
てしまうと、照合に際して、パターン伸縮をする必要が
ないと言う特徴がある。この方法では、音声パターンが
完全で、欠落や付加が無い時にはかなり有効であるが、
しかし、音声は非線形な伸縮をしているものであり、そ
れを線形伸縮で間に合わせている為、音声パターンに欠
落や付加があると、照合精度は非常に悪いものになって
しまう。[0004] Details of such a recognition method and feature amounts are described in, for example, Niimi's "Speech Recognition".
Since it is well known, further description is omitted here.
Among them, there is a problem of how to cope with the fluctuation of the pattern at the time of pattern matching. In particular, this fluctuation is largely temporal, and is influenced by the speed of utterance and the like. There are two countermeasures, one is non-linear matching represented by DP matching, which looks at the similarity of the two patterns to be matched and dynamically associates the two patterns to maximize the similarity. And the other is to insert data evenly over time without checking for similarities,
Linear matching is performed in which the two are compared after being matched by thinning. The former has a merit that the calculation amount is large, but the accuracy is good, and the latter has a merit that the calculation amount is very small. In particular, in the latter case, by setting all patterns to have a fixed length, if the length of the input voice pattern is adjusted once, there is a characteristic that it is not necessary to expand or contract the pattern at the time of comparison. This method is very effective when the voice pattern is complete and there are no missing or added,
However, the voice is non-linearly expanded and contracted, and it is made up by linear expansion and contraction. Therefore, if the voice pattern is missing or added, the matching accuracy becomes extremely poor.

【０００５】図８は、音声のエネルギーの時間変化を示
す図で、この図に従って説明すると、図に示すごとく、
同じ「staff」という音声パターンがあるとき、正常な
もの同士を線形に伸縮して比較する場合には、（ａ）に
示すように、両者の誤差を小さくすることができるが、
（ｂ）に示すように、音声区間検出に失敗して、一方の
パターンの／ｆ／が欠落した「sta」だったりすると、
同じパターンでありながら、音声の末尾付近で違う音同
士が対応づいてしまい、両パターンの差は著しく大きく
なる。ここに例として挙げた「staff」の／ｆ／のよう
に、発声されるエネルギーの小さな子音は、音声区間の
検出がうまく行かないことが多く、上記の問題は非常に
よく起こることである。非線形伸縮を用いたパターン照
合法では端点フリーにするものがあり、／ｆ／が欠けて
いながら、精度の良いマッチングができる。ただし、こ
の非線形伸縮を用いた方法では、先に述べたように計算
量が多い事に変りはない。[0005] FIG. 8 is a diagram showing the time change of the energy of the voice. Referring to FIG. 8, as shown in FIG.
When there is the same voice pattern "staff", when normal ones are linearly expanded and contracted and compared, the error between both can be reduced as shown in (a).
As shown in (b), if voice section detection fails and “/ sta” of one pattern is missing,
Even though the patterns are the same, different sounds correspond to each other near the end of the voice, and the difference between the two patterns becomes significantly large. Consonants with low energy uttered like / f / of "staff" cited here as an example often fail to detect voice segments well, and the above problem occurs very often. In some pattern matching methods using non-linear expansion / contraction, end points are made free, and high-precision matching can be performed while / f / is missing. However, the method using the non-linear expansion and contraction still has a large amount of calculation as described above.

【０００６】また、この対策のひとつとして、欠落等が
生じる等、不安定な音声の標準パターンにマークをつけ
ておいて、入力された音声に不安定な部分がある場合に
は、標準パターンの不安定な部分をつけたままで、入力
された音声に不安定な部分が無い時には、全ての標準パ
ターンから不安定部を取除いて照合するものがある。し
かしながら、この方法では、入力のパターンによって標
準パターンを変化させるものであるから、照合時に毎回
標準パターンを修正しなければならないという欠点があ
る。[0006] As one of the countermeasures, a mark is put on a standard pattern of an unstable voice such as a dropout or the like. When there is no unstable part in the input voice while the unstable part is added, there is a method in which the unstable part is removed from all the standard patterns for matching. However, in this method, since the standard pattern is changed depending on the input pattern, there is a drawback that the standard pattern must be corrected each time of collation.

【０００７】[0007]

【目的】本発明は、上述のごとき従来技術の欠点に鑑み
てなされたものであり、特に、音声区間の検出がうまく
行かなかった場合にも、計算量の少ない線形伸縮法によ
って、正しい照合ができることを目的としてなされたも
のである。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned drawbacks of the prior art, and in particular, even when voice section detection is not successful, correct collation can be performed by a linear expansion / contraction method with a small amount of calculation. It is made for the purpose of being able to do it.

【０００８】[0008]

【構成】本発明は、上記目的を達成するために、音声信
号から特徴量を取り出し特徴パターンとなして照合する
音声パターンマッチング方法において、音声の冒頭また
は末尾に音声のエネルギーが母音のそれよりも低く、か
つスペクトル成分が高域に集中している部分を見出し、
全体のパターンを定められた長さに変換すると共に、該
エネルギーが母音のそれよりも低く、かつスペクトル成
分が高域に集中している部分を取除いた残りのパターン
を定められた長さに変換して標準パターンとしたことを
特徴としたものである。以下、本発明の実施例に基いて
説明する。[Configuration] The present invention, in order to achieve the above object, in the speech pattern matching how to match forms with feature pattern extraction a feature from the audio signal, than the voice energy of the vowel at the beginning or end of the speech Is low, and the part where the spectral components are concentrated in the high range is found,
The entire pattern is converted to a predetermined length, and the remaining pattern is removed to a predetermined length except for a portion where the energy is lower than that of the vowel and the spectral components are concentrated in the high frequency range. It is characterized by being converted into a standard pattern. Hereinafter, a description will be given based on an example of the present invention.

【０００９】図１は、本発明の一実施例を説明するため
のフローチャート、図２は、図１に示したフローチャー
トを実現するためのブロック図で、図中、１０は伸縮
部、１１はパワー計算部、１２は比較部、１３は高域ス
ペクトル計算部、１４は比較部、１５は伸縮部、１６は
メモリー、１７，１８は閾値、１９はパターン整形部、
２０は伸縮部で、その他、図６に示した従来技術と同様
の作用をする部分には、図６の場合と同一の参照番号が
付してある。而して、本発明では音声区間検出がしにく
い子音エネルギーが比較的小さく、周波数成分が高域に
集中しやすいことに注目してなされたものであり、音声
信号から特徴量を取り出し特徴パターンとなして照合す
る音声パターンマッチング方法において、音声の冒頭、
または末尾に音声のエネルギーが母音のそれよりも低
く、かつ、スペクトル成分が高域に集中している部分を
見出し、全体のパターンを定められた長さに変換すると
共に、該エネルギーが母音のそれよりも低く、かつ、ス
ペクトル成分が高域に集中している部分を取除いた残り
のパターンを定められた長さに変換して標準パターンと
するようにしたものである。FIG. 1 is a flow chart for explaining an embodiment of the present invention, and FIG. 2 is a block diagram for realizing the flow chart shown in FIG. 1. In FIG. Calculation unit, 12 is a comparison unit, 13 is a high-band spectrum calculation unit, 14 is a comparison unit, 15 is a stretching unit, 16 is a memory, 17, 18 are thresholds, 19 is a pattern shaping unit,
Reference numeral 20 denotes an expansion / contraction portion, and other portions that operate in the same manner as in the prior art shown in FIG. 6 are denoted by the same reference numerals as in FIG. In the present invention, consonant energies that are difficult to detect a voice section are relatively small, and attention is paid to the fact that frequency components are easily concentrated in a high frequency range. in the speech pattern matching how to match No, the beginning of the voice,
Or, at the end, find a part where the energy of the voice is lower than that of the vowel and the spectral components are concentrated in the high frequency range, convert the entire pattern to a predetermined length, and convert the energy to that of the vowel. The remaining pattern, which is lower than that and where the spectral components are concentrated in the high frequency band, is removed and converted into a predetermined length to become a standard pattern.

【００１０】図１は、上記本発明の方法をフローチャー
トにて示したもので、以下、このフローチャートに基い
て説明する。まず、音声の登録に関して説明すると、通
常通り、音声全体を一定の長さにしておいて、標準パタ
ーンとして登録したあと、その音声の冒頭または末尾に
特定部（つまり、エネルギーが低く、かつスペクトル成
分が高域に集中している部分）があるかどうかをみる。
音声のエネルギーは、一定値より下がるかどうかで決め
られ、この一定値は、母音が入力された時のエネルギー
値から１／５程度の値に決める。また、周波数の集中は
色々な調べかたが考えられるが、例えば、分析周波数帯
域を２つに分け、高域に低域の何倍かの成分が存在して
いる時とか、スペクトル分布の周波数軸方向へのフィッ
ト直線を引いて、この傾きが負の場合とかで判断する事
ができる。このような音声冒頭や末尾にエネルギーが小
さく、周波数成分が高域に集中している部分がなけれ
ば、この音声の登録が終り、ある場合は、それが冒頭
か、末尾かによって、つまり、前記の／ｆ／のような欠
落しやすい音が、音声のどこに付いているかを調べてお
く。次に、あらかじめ、これを欠落させたパターンを併
せて作る。つまり、音声冒頭に欠落しやすい音が付いて
いる時は、エネルギーが小さく、周波数成分が高域に集
中している部分から末尾までを取除いた残りを一定長に
しておいて登録する。[0010] Figure 1 is an illustration in flow chart how the present invention will now be described with reference to this flowchart. First, the registration of voice will be described. As usual, the entire voice is set to a fixed length, registered as a standard pattern, and then a specific part (that is, low energy and low spectral component) is added at the beginning or end of the voice. Is concentrated in the high frequency range).
The energy of the voice is determined depending on whether it falls below a certain value, and this constant value is determined to be about one fifth of the energy value when the vowel is input. In addition, various methods of examining the concentration of the frequency can be considered. For example, the analysis frequency band is divided into two, and when there are components several times higher in the high frequency than in the low frequency, the frequency distribution of the spectrum distribution is considered. It is possible to determine whether the inclination is negative by drawing a fit straight line in the axial direction. If there is no part where the energy is small and the frequency components are concentrated in the high frequency band at the beginning or end of such a sound, the registration of this sound is finished, and in some cases, depending on whether it is the beginning or end, It is checked where a sound that is easily lost such as / f / is attached to the voice. Next, a pattern in which this is missing is also created in advance. In other words, when a sound that is likely to be missing is added to the beginning of the voice, the remaining energy is removed from the portion where the energy is small and the frequency components are concentrated in the high frequency band to the end, and registered with a fixed length.

【００１１】一方、認識の時には、入力音声を定められ
た一定の長さにして、登録されたすべての標準パターン
と照合する。もし、入力の音声の冒頭、末尾の子音等が
落ちている時には、あらかじめ登録されている欠落音声
パターンと照合できるから、認識の精度を向上させる事
ができる。On the other hand, at the time of recognition, the input voice is made to have a predetermined length and is collated with all the registered standard patterns. If the beginning and end consonants of the input speech are dropped, it can be compared with the previously registered missing speech pattern, so that the recognition accuracy can be improved.

【００１２】図２は、上記本発明を実現する為のブロッ
ク図で、マイク１からの音声を、特徴変換して離散量に
なおすところまでは、図６に示した従来技術と同じであ
る。はじめに登録について説明する。スイッチ５を登録
側（ａ側）に倒しておき、音声信号をパワー計算するた
めのパワー計算部１１へ入れる。ここでパワーが一定値
より低い部分が有るか、有るならそれは周波数成分が高
域に集中しているかどうかを、さらにその位置は冒頭
か、末尾かを調べておく。次の伸縮部で、パターン全体
の長さを一定の長さに伸縮して、標準パターン格納部
（メモリー）１６へ登録しておく。もし、パワーが一定
値より低く、周波数成分が高域に集中している部分が存
在したなら、パターン整形部にて、図１のフローチャー
トで示したように、先端、あるいは末尾までを除去し、
再度伸縮部で整形されたパターンを一定長にした後に、
標準パターン格納部へ登録しておく。こうして登録すべ
き音声を標準パターン格納部に登録し終わると、スイッ
チ５を認識側（ｂ側）に倒し、認識する。認識は、登録
と同様に特徴パターンになおした後、伸縮部１０にて一
定長にして照合する。この伸縮部も登録時と同じ機能を
もてば良くて、図では別に記載されているが、同じもの
でよい。照合部７では特に照合方法を限定するものでは
なく、市街地距離によってパターン相互の差を求める方
法でも良いし、ベクトル間の内積による類似性を計算す
るのも良い。未知入力のパターンと各標準パターンとの
類似性、または誤差をそれぞれ求めておく。最大類似度
検出部８では、最も大きな類似性を示した標準パターン
を見つけだし、その名前または、それを表わす記号等を
認識結果として出力する。この方法によると、あらかじ
め音声の一部が欠落した音声パターンも一定長にて登録
してある為、入力の音声の冒頭、末尾の子音等が落ちて
いる時にはこのパターンと照合できるから伸縮するもの
に比べて演算量は少なく、認識の精度を向上させる事が
できる。FIG. 2 is a block diagram for realizing the present invention, and is the same as the prior art shown in FIG. 6 up to the point where the voice from the microphone 1 is converted into a discrete amount by feature conversion. First, registration will be described. The switch 5 is set to the registration side (a side), and is supplied to the power calculator 11 for calculating the power of the audio signal. Here, it is checked whether there is a portion where the power is lower than a certain value, and if there is, whether or not the frequency component is concentrated in a high frequency range, and whether the position is the beginning or end. In the next expansion / contraction section, the entire length of the pattern is expanded / contracted to a certain length and registered in the standard pattern storage section (memory) 16. If there is a portion where the power is lower than a certain value and the frequency components are concentrated in a high frequency range, the pattern shaping unit removes the leading end or the trailing end as shown in the flowchart of FIG.
After making the pattern shaped by the elastic part again a certain length,
It is registered in the standard pattern storage. When the voice to be registered has been registered in the standard pattern storage unit, the switch 5 is moved to the recognition side (b side) to perform recognition. In the recognition, after the feature pattern is converted into a characteristic pattern in the same manner as the registration, the data is collated with a fixed length in the expansion and contraction unit 10. The expansion / contraction section may have the same function as that at the time of registration, and is separately shown in the figure, but may be the same. The collation unit 7 does not particularly limit the collation method, and may be a method of calculating a difference between patterns based on an urban area distance, or a method of calculating similarity by an inner product between vectors. The similarity or error between the unknown input pattern and each of the standard patterns is determined in advance. The maximum similarity detection unit 8 finds a standard pattern showing the greatest similarity, and outputs its name or a symbol representing the same as a recognition result. According to this method, a voice pattern in which a part of the voice is missing is also registered in advance with a fixed length, so that when the beginning and end consonants of the input voice are dropped, it can be compared with this pattern so that it expands and contracts. The amount of calculation is smaller than that of, and the accuracy of recognition can be improved.

【００１３】図３は、図１の動作をマイクロコンピュー
タでハード的に行うための図で、あらかじめ、レジスタ
２８に何種類かの長さに正規化された音声の標準パター
ンがロードされているものとして説明をする。認識させ
るべき未知の音声がマイク１から入力され、マイクアン
プ２で増幅された後、バンドパスフィルタバンク３でい
くつかの周波数（例えば１５コ）に分析される。その結
果をＡ／Ｄ変換器４で１２ビット程度に量子化し、その
データを用いて音声区間の検出２１を行なう。検出され
た音声に係る部分をレジスタ２２に格納する。音声区間
検出のしかたは新美著「音声認識」（共立出版）Ｐ６８
に示されている。この音声区間の検出を含め、これ以後
の動作はマイクロコンピュータのソフトウェアで行なう
ため、レジスタ以外にハードウェアを持たないのが多い
が、図ではその動作をハード的に示す。レジスタ２３に
は正規化すべき何種類かのフレーム長が登録されてい
る。音声区間を検出した際に発生する入力音声のフレー
ム長を比較器２４に送り、レジスタ２３の内容と比較す
る。レジスタ２３の中から一番近いフレーム長をひと
つ、またはふたつ選び出し、レジスタ２２と比較器２７
へフレーム長の信号として送る。レジスタ２２では送ら
れて来た信号をもとに、レジスタ内のコピーで決められ
たフレーム長にする。図４にコピーの際のレジスタの動
作を示す。FIG. 3 is a diagram for performing the operation of FIG. 1 by hardware using a microcomputer, in which a standard pattern of voice normalized to several lengths is previously loaded in a register 28. It will be described as. An unknown voice to be recognized is input from the microphone 1, amplified by the microphone amplifier 2, and then analyzed by the bandpass filter bank 3 into several frequencies (for example, 15 frequencies). The result is quantized by the A / D converter 4 to about 12 bits, and voice data is detected 21 using the data. A part related to the detected voice is stored in the register 22. How to detect voice sections is by Niimi, "Speech Recognition" (Kyoritsu Shuppan) P68
Is shown in Subsequent operations including the detection of the voice section are performed by software of the microcomputer, and therefore, there are many cases where no hardware other than the register is provided. Several kinds of frame lengths to be normalized are registered in the register 23. The frame length of the input voice generated when the voice section is detected is sent to the comparator 24 and compared with the contents of the register 23. One or two closest frame lengths are selected from the register 23, and the register 22 and the comparator 27 are selected.
As a frame length signal. The register 22 sets the frame length determined by the copy in the register based on the transmitted signal. FIG. 4 shows the operation of the register at the time of copying.

【００１４】図４において、仮に未知の音声長が１
₁で、これがレジスタ２２に格納されているとする。こ
れを１₁＋２フレームに伸張しなければならない場合
（ａ）、入力のフレーム長を、挿入するフレーム数＋１
で割って挿入部分を決めるのが簡単である。この場合、
１₁／３でその時の整数が１₁′であったとする。まず、
１₁番目のデータを１₁＋２番目へ、１₁−１番目のデー
タを１₁＋１番目へとコピーを繰返してゆく。ただし、
２１₁′のデータは２１₁′＋１と２１₁′＋２フレーム
目の両方にコピーする。その後、１₁′−１番目を１₁′
番目へとコピーを繰返し、１₁′を１₁′＋１番目へコピ
ーしたところで動作は終了する。次に、１₁を１₁−２フ
レームにする場合を示す（ｂ）。先程とは逆に番号の若
い方から始め、まず、１₁′＋１番目のデータを１₁′番
目へコピーする。１₁′＋ｎを１₁′＋ｎ−１へとコピー
を繰返し、２１₁′に達した時、２１₁′＋２をコピーす
る。それ以降は２１₁′＋ｎを２１₁′＋ｎ−２へコピー
をくりかえして、１₁分が終われば完了である。以上の
動作は２フレームの加減で説明したが、これ以外のフレ
ーム長でも同様である。In FIG. 4, if the unknown voice length is 1
_{At 1} , it is assumed that this is stored in the register 22. If this must be expanded to 1 ₁ +2 frames (a), the input frame length is increased by the number of inserted frames + 1.
It is easy to determine the insertion part by dividing by. in this case,
Then the integer 1 _1/3 is assumed to be 1 ₁ '. First,
The 1 ₁ th data to the ₁ 1 +2 th, Yuku repeat the copy to the ₁ 1 +1 th ₁ 1 -1-th data. However,
21 ₁ 'data of 21 _1' copy to both +1 and 21 ₁ '+ 2-th frame. Then, ₁ 1 '-1st ₁ 1'
To the repeated copy th, it operates in was copied 1 ₁ '1 _1' to +1 th ends. Next, a case of 1 ₁ to ₁ 1 -2 frame (b). Just reverse to start from the younger of the number and, first of all, to copy 1 ₁ '+1 th data 1 _1' to the second. 'The + n ₁ 1' 1 ₁ to + n-1 and repeat copy, 'when it reaches, 21 _1' 21 ₁ Copying +2. Thereafter, copying is repeated from 21 ₁ ′ + n to 21 ₁ ′ + n−2, and the process is completed when ₁₁ minutes have passed. The above operation has been described by adjusting two frames, but the same applies to other frame lengths.

【００１５】パターン長が決められた長さになれば、次
に２値化する。しかし、一般の方法では２値化の必要の
ないものが多い。これは文献（オーム社応用ファジイシ
ステム入門）に述べた方法で認識するためである。２値
化は１フレームごとに比較器２５によっておこなう。レ
ジスタ２２から１フレームの全データの合計を３ビット
シフトして、つまり１／８にして送られ、閾値に格納さ
れる。その後、この閾値とそのフレームの各値を比較
し、閾値よりも大なる時１、その他を０として２値化
し、再びレジスタ２２へ保存しておく。レジスタ２８の
内容はあらかじめ０クリアしておきレジスタ２２内のパ
ターンとレジスタ２８の内容を加算器２７で加算してそ
の結果をレジスタ２８に戻す。これは１つの単語につい
て何回か発声してそれらの平均したパターンを標準パタ
ーンとして登録するためのもので、もし、平均する必要
がない時はレジスタ２２の内容をそのまま標準パターン
としてレジスタ２８に登録すれば良い。ここでは３回発
声したものを登録するものとして説明する。まず、すべ
て０のパターンと第１回の発声で作ったパターンを加算
して、レジスタ２８に格納しておき、第２回目の発声で
第１回目と同様のパターンを作り、再度レジスタ２８の
内容（第１回目の発声パターン）と加算してレジスタ２
８へもどす。第３回目の発声も同様に加算してレジスタ
２８へもどし、その結果を辞書部であるレジスタ２８へ
書込む。これと同時に本発明を適用し、図３に述べたよ
うなやりかたで、音声の先端または終端近くにエネルギ
ーが小さく、高い周波数成分のみをもつ部分があるかど
うかを調べる。あった場合、その部分を取除いて残りの
部分を決められた長さに変換する。加算の必要が有る時
は、取除いた部分同士、取除かない部分同士を加算す
る。パターンが不足する場合はさらに発声を促すか、特
定のパターンに重みを付けて平均しても良い。このよう
にして必要な単語の全てを登録し終わったあと、レジス
タ２８の標準パターンはフロッピ・ディスク４２等に書
込んで電源が切れても内容が保存できるようにしてお
く。When the pattern length reaches a predetermined length, binarization is performed next. However, there are many general methods that do not require binarization. This is for recognition by the method described in the literature (Introduction to Ohm's Applied Fuzzy System). The binarization is performed by the comparator 25 for each frame. The sum of all data of one frame is shifted by 3 bits, that is, reduced to 1/8 from the register 22 and sent to the threshold. After that, the threshold value is compared with each value of the frame. When the value is larger than the threshold value, 1 is set, and the other values are set to 0, binarized and stored in the register 22 again. The contents of the register 28 are cleared to 0 in advance, and the pattern in the register 22 and the contents of the register 28 are added by the adder 27, and the result is returned to the register 28. This is for registering an averaged pattern as a standard pattern by uttering several times for one word. If it is not necessary to average, register the contents of the register 22 as a standard pattern in the register 28 as it is. Just do it. Here, description will be made assuming that the utterance made three times is registered. First, the pattern of all zeros and the pattern created by the first utterance are added and stored in the register 28, and the second utterance forms the same pattern as the first utterance. (1st utterance pattern) and register 2
Return to 8. The third utterance is similarly added and returned to the register 28, and the result is written to the register 28 as a dictionary unit. At the same time, the present invention is applied, and in the manner described with reference to FIG. 3, it is checked whether or not there is a portion having a small energy and having only a high frequency component near the leading or trailing end of the voice. If there is, remove that part and convert the remaining part to the determined length. When addition is necessary, the removed parts are added together, and the parts not removed are added. If the number of patterns is insufficient, further vocalization may be prompted, or a specific pattern may be weighted and averaged. After all necessary words have been registered in this way, the standard pattern of the register 28 is written to the floppy disk 42 or the like so that the contents can be preserved even when the power is turned off.

【００１６】ここで、本発明に関する処理をする。図３
では、音声中にエネルギーが小さくかつスペクトルが高
域に集中している部分を探している。バンドパス・フィ
ルタ３は１５個で成立っており、最低周波数が２５０Ｈ
ｚ、最高が６５００Ｈｚで１／３ｏｃｔ．で並んでいる
ものとする。Ａ／Ｄ変換した後フィルタの低域から１１
番目まではΣ３１で、１２番目から１５番目はΣ３２で
合計して両合計を比較する。Σ３２の出力が大きければ
比較器３３は１を出力する。一方、Σ３１とΣ３２の
和、つまり加算器３４の出力である音声のエネルギーが
閾値３５に決められた値より小さければ比較器３６も１
を出力し、そうでない時は０を出力する。比較器３３，
３６の信号の積が１のときは、レジスタ３７に一時的に
格納されている未知音声のパターンはレジスタ２２へ転
送され、以降は、図７と同じ動作をして認識結果を得
る。Here, processing relating to the present invention is performed. FIG.
Is looking for a part of the voice where the energy is small and the spectrum is concentrated in the high band. The band-pass filter 3 has 15 filters, and the lowest frequency is 250H.
z, the maximum is 1/3 oct. It is assumed that they are lined up. After A / D conversion, 11
The sum is $ 31 and the twelfth to fifteenth are $ 32, and the two sums are compared. If the output of # 32 is large, the comparator 33 outputs 1. On the other hand, if the sum of # 31 and # 32, that is, the energy of the sound output from the adder 34 is smaller than the value determined as the threshold 35, the comparator 36 is also 1
Is output, otherwise 0 is output. Comparator 33,
When the product of the signal of 36 is 1, the unknown voice pattern temporarily stored in the register 37 is transferred to the register 22, and thereafter, the same operation as in FIG. 7 is performed to obtain a recognition result.

【００１７】比較器２４、レジスタ２２では先に照合し
たパターンと同じ処理をし、照合した後、先にレジスタ
４３に格納されている類似度の後へ今回計算した類似度
を続けて書込む。最後は先に述べた例と同じ様に最大の
類似度を得た単語を調べてその名前を出力する。但し、
この時レジスタ４３に書込まれているすべての類似度値
に対してその最大値を求める。The comparator 24 and the register 22 perform the same processing as the previously collated pattern. After collation, the similarity calculated this time is successively written after the similarity previously stored in the register 43. Finally, as in the previous example, the word with the highest similarity is checked and its name is output. However,
At this time, the maximum value is obtained for all the similarity values written in the register 43.

【００１８】図５は、音声認識のフローを示す図、図６
は、図５に示したフローをコンピュータで、ハード的に
行うための図で、以下、図３と異なる部分のみを説明す
る。レジスタ２２で２値化パターンを作るまでは同じで
ある。通常照合の仕方は次の様なものである。比較器４
０では、レジスタ４１の辞書部から送られる１単語づつ
のフレーム長を入力音声のフレーム長と比較し、同じ値
であった時だけ、辞書のパターンを照合部へロードしな
おし、レジスタ２２のパターンと照合、類似度を計算し
てレジスタ４３に書込む。ただし、レジスタ４３はあら
かじめ０クリアされているものとし、フレーム長が違っ
て照合しなかったものは類似度０となるように配慮され
ている。こうして辞書部に登録したパターンの終わりを
示すエンド信号がでるまでこれを繰返す。それが終了す
ると、レジスタ４３の先頭の類似度値をレジスタ４５に
移し、レジスタ４３の２番目以降の類似度値とレジスタ
４５の値を比較器４４で比較してレジスタ４５よりも大
きな値があった時この値をレジスタ４５へ書込む。以
後、新しく書込まれた値とレジスタ４３の類似度値を順
に比べ、これをくりかえす。ただし、これも図３と同
様、すべてこの動作はプログラムでコントロールされる
もので、マイコンで動かすものである。FIG. 5 is a diagram showing a flow of speech recognition, and FIG.
Is a diagram for performing the flow shown in FIG. 5 by hardware using a computer, and only the portions different from FIG. 3 will be described below. This is the same until the binarization pattern is created by the register 22. The method of normal collation is as follows. Comparator 4
In the case of 0, the frame length of each word sent from the dictionary unit of the register 41 is compared with the frame length of the input voice, and only when the value is the same, the dictionary pattern is reloaded into the matching unit. Then, the similarity is calculated and written into the register 43. However, it is assumed that the register 43 has been cleared to 0 in advance, and that those which have not been collated due to a difference in frame length have a similarity of 0. This is repeated until an end signal indicating the end of the pattern registered in the dictionary section is output. When this is completed, the first similarity value of the register 43 is transferred to the register 45, and the second and subsequent similarity values of the register 43 are compared with the value of the register 45 by the comparator 44. Then, this value is written to the register 45. Thereafter, the newly written value is compared with the similarity value of the register 43 in order, and this is repeated. However, as in FIG. 3, all of these operations are controlled by a program and are operated by a microcomputer.

【００１９】[0019]

【効果】以上の説明から明らかなように、本発明による
と音声区間の検出がうまく行かなかった場合にも、照合
的に伸縮することなく、正しい照合ができる。As is clear from the above description, according to the present invention, even when the detection of a voice section is not successful, correct collation can be performed without expansion / contraction in collation.

[Brief description of the drawings]

【図１】本発明による音声登録の一実施例を説明する
ためのフローチャートである。FIG. 1 is a flowchart illustrating an embodiment of voice registration according to the present invention.

【図２】図１に示した音声登録を実現するためのブロ
ック図である。FIG. 2 is a block diagram for realizing the voice registration shown in FIG.

【図３】図１をマイクロコンピュータでハード的に実
現するための図である。FIG. 3 is a diagram for realizing FIG. 1 by hardware using a microcomputer;

【図４】フレームの伸縮を説明するための図である。FIG. 4 is a diagram for explaining expansion and contraction of a frame.

【図５】音声認識のフローチャートを示す図である。FIG. 5 is a diagram showing a flowchart of voice recognition.

【図６】図５をマイクロコンピュータでハード的に実
現するための図である。FIG. 6 is a diagram for realizing FIG. 5 by hardware using a microcomputer;

【図７】一般のパターンマッチングの説明図である。FIG. 7 is an explanatory diagram of general pattern matching.

【図８】弱い子音が検出された場合の対応づけと検出
できなかった場合の対応づけを説明するための図であ
る。FIG. 8 is a diagram for explaining the association when a weak consonant is detected and the association when a weak consonant is not detected;

[Explanation of symbols]

１…マイクロフォン、２…マイクアンプ、３…特徴変換
部、４…Ａ／Ｄ変換部、５…スイッチ、６…標準パター
ン格納部、７…照合部、８…最大類似度検出部、９…認
識結果出力部、１０…伸縮部、１１…パワー計算部、１
２…比較部、１３…高域スペクトル計算部、１４…比較
部、１５…伸縮部、１６…メモリ、１７，１８…閾値、
１９…パターン整形部、２０…伸縮部、２１…音声区間
検出器、２２，２３，２８，３０，３２，３９，４１…
レジスタ、２４，２５，２７，３１，３５，３８…比較
器、２６，３７…閾値、２９…照合部、３６，４０…加
算部、４２…フロッピ・ディスク。DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Microphone amplifier, 3 ... Feature conversion part, 4 ... A / D conversion part, 5 ... Switch, 6 ... Standard pattern storage part, 7 ... Collation part, 8 ... Maximum similarity detection part, 9 ... Recognition Result output unit, 10 ... Expansion unit, 11 ... Power calculation unit, 1
2 ... Comparing unit, 13 ... High-frequency spectrum calculating unit, 14 ... Comparing unit, 15 ... Expandable unit, 16 ... Memory, 17, 18 ... Threshold,
19: pattern shaping section, 20: expandable section, 21: voice section detector, 22, 23, 28, 30, 32, 39, 41 ...
Registers, 24, 25, 27, 31, 35, 38 ... comparators, 26, 37 ... thresholds, 29 ... collating units, 36, 40 ... adding units, 42 ... floppy disks.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 3/00 521 G10L 3/00 531 G10L 7/08 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 3/00 521 G10L 3/00 531 G10L 7/08

Claims

(57) [Claims]

1. A generates a characteristic pattern is taken out a feature from the audio signal, the audio pattern matching how to match the standard pattern of the feature pattern, than that of speech energy vowel at the beginning or end of the speech Find a part where the spectral components are low and concentrated in the high frequency range, convert the entire pattern to a predetermined length, and the energy is lower than that of the vowel and the spectral components are concentrated in the high frequency range standard pattern registration how to standard pattern is converted to a length defined the remaining pattern Remove the part.