JPS6328320B2

JPS6328320B2 -

Info

Publication number: JPS6328320B2
Application number: JP55023795A
Authority: JP
Inventors: Masaru Nishimura
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1980-02-26
Filing date: 1980-02-26
Publication date: 1988-06-08
Also published as: JPS56119198A

Description

[Detailed description of the invention]

本発明はパターンマツチング法に基づく単語音
声認識装置に関し、音声信号の新規な整合方式を
提供するものである。パターンマツチングの原理
による単語音声認識システムは、通常第１図に示
すが如く、音声入力部１、特徴抽出部２、認識処
理部３、登録パターンメモリ４、並びに入力パタ
ーンメモリ５を主要な構成要素とし、登録モード
と認識モードの２つの動作モードを有する。登録
モードは、認識すべき単語音声をあらかじめ登録
するものであつて、マイクロフオンを含む音声入
力部１の出力である登録音声信号から特徴抽出部
２により抽出された音声の特徴が時系列パターン
として登録パターンメモリ（又は標準パターンメ
モリとも言う）４にフアイルされるものである。
また認識モードでは、入力音声信号から同様に抽
出された音声の特徴パターンが入力パターンメモ
リ５に記憶された後、この入力パターンと登録パ
ターンメモリ４に記憶されている登録パターンと
の類似度が認識処理部３で計算され、その結果と
して得られる類似度の最大の登録パターンが入力
音声と一致するものとして固定されこれに相応し
て適当に出力がなされる。このような機能を有す
る第１図の認識処理部３と登録パターンメモリ
４、入力パターンメモリ５の構成は、中央演算処
理装置（CPU）を中心とするコンピユータシス
テムにより具体化される。音声波形の中から、音
声の音韻的特徴を抽出する方法としては、周波数
スペクトル、相関関数、ゼロ交差数、αパラメー
タなどの物理量が用いられている事は周知の通り
である。このうち、音声の周波数スペクトルを多
数のバンドパスフイルタを用いて抽出する方式
は、比較的簡単な構成で高い認識率が得られるの
で多用されつつある。第２図は周波数スペクトル
をフイルタにより分析する方式の音声認識装置の
具体例である。音声入力部１は、マイクロフオン
１１マイクアンプ１２及び入力音声信号のレベル
を入力音声の強弱にかかわらず略一定に保つ
AGC回路１３から成る。この入力部１の出力に
接続する。Ｍ個の帯域通過フイルタ（以下BPF
と略記）２１−１，２１−２，…２１−Ｍ及び該
各BPFに縦続し各出力エンベロープを検出する
低域通過フイルタ（以下LPFと略記）２２−１，
２２−２，…２２−Ｍは特徴抽出部２を構成して
おり、音声帯域信号を周波数分析するものであ
る。音声入力部１を経た音声信号の各フイルタ成
分は適当な時間周期（多くの場合10〜20ｍsec）
で順次マルチプレクサ２３によりサンプリングさ
れる。即ちLPF２２−１，２２−２，…２２−
Ｍの出力端い並列的に得られる音声のスペクトル
信号は直列信号列となり、引き続いて順次アナロ
グデジタル変換器２４（以下Ａ−Ｄ変換器と略
記）によつてデジタルコードに変換され、CPU
３１に制御されるＩ／Ｏポート３２を経てバツフ
アメモリ３３に一旦取り込まれる。このデータ量
は、例えばフイルタの個数Ｍを８、音声の最大入
力時間を1.6秒、サンプリング周期を10ｍsec、Ａ
−Ｄ変換器２４のビツト数を８としたとき最大取
り込みデータ量は 1.6／0.01×８×８＝10240bits ＝1.28KB（Ｂ：byte）である。さて、音声信号は同一話者の同一言語音声であ
つても発声の都度その時間軸、信号振巾とも変動
するのが普通であり、それぞれについてなんらか
の正規化が必要である。振巾の正規化の為に
AGC回路１３がしばしば用いられる事はさきに
述べた通りであるが時間軸については単語音声の
始端から終端までの時間を等分割する第３図の如
き方法が一般的である。音声信号の始終端につい
ては音声検出回路２５が入力信号のレベル、周波
数分布、零交差数などのデータをもとにこれを検
出する。第３図に於て入力音声信号の始端のサン
プリングポイント番号を１、終端をｌとしたと
き、ｌ／Ｎ（Ｎは整数）に最も近い整数を求め
（これをｎとする）、入力サンプリングデータのう
ち始端を含めｎ個おきにＮ個のデータを取り出し
て並べなおす（第３図ｂ）ことにより時間軸の正
規化が可能となる。ここで例えばＮ＝32とする
と、この場合Ｎ×８×８＝2048bits＝256B のデータが登録モードに於ては登録パターンメモ
リ４０に、認識モードに於ては入力パターンメモ
リ５０にそれぞれ記憶される。これらメモリは通
常RAMであり、登録パターンメモリ４０の番地
（アドレス）はCPUのプログラムを記憶する
ROM３４及び入力制御部３５によつて指定され
る。登録パターンの数は、音声認識システムの仕
様即ち登録話者の数と各登録可能な語数により決
められる。認識モードに於ける認識処理は、同様
にバツフアメモリ３３に入力したデータから得ら
れたＮサンプル点のデータを記憶する入力パター
ンメモリ５０の内容と、登録パターンの内容とを
パターンマツチングすることにより行なわれる。
入力パターンと登録パターンの距離計算方式には
各種の方式が提案されているが、ここでは説明の
便宜上最も単純な方式であるチエビシエフ距離に
より説明する。ある単語音声の登録パターンの８
個のフイルタの時系列［fij^(R)］（ｉ：フイルタ番
号１〜８、ｊ：サンプルポイント１〜Ｎ）と入力
音声パターンの同じくフイルタ時条列〔fij〕の
チエビシエフ距離Ｄは次式で定義される。Ｄ＝_N 〓^j=1 ₈ 〓ⁱ⁼¹ ｜fij−fij^(R)｜ ……(1) 即ちこれは入力パターンfijと登録パターンfij^(R)
の各対応するデータの差の絶対値の総和であり、
各登録パターンについて得られたチエビシエフ距
離の中で最小値が得られる登録パターンと、入力
パターンは一致するものと見なされる。これらの
計算結果の一時記憶の為のメモリ領域を説明の便
宜上特に認識処理用メモリ３６として図示した。以上説明したパターンマツチングの原理にもと
づく音声認識システムの従来例では入力パターン
と登録パターンの各対応する時間点での距離の差
の総和により類似度を計算するものであり、回路
構成が簡単であるという特徴を有するものの、計
算上の誤差は多く、必ずしも十分な認識性能を得
ることが出来るとは言い難い面があつた。本発明はかかる認識処理に加え、波形の形状を
ピーク位置及びピーク数として把握し、これを類
似度計算時補助的なデータとして参考することに
より更に精度の高い認識処理を行なうものであ
る。第４図は本発明装置の構成を示すブロツク図で
あり、第２図に示した従来装置と相違するところ
は、入力部１とマルチプレクサ２３との間に適当
な遮断周波数を持ち、信号の包絡線（エンベロー
プ）を検出するLPF２６に依るバイパス路を設
けた点、並びにＡ−Ｄ変換器２４とＩ／Ｏポート
３２との間に音声の極大値を検出するピーク検出
回路２７を配挿した点にある。尚、この第４図の
構成物は第２図のものと殆どが同一であるので、
これ等の点に就いての詳細な説明は省略する。こ
のピーク検出回路２７は入力音声信号波形のピー
クを検出し、その検出信号をＩ／Ｏポート３２を
介してCPU３１に伝えるものであり、CPU３１
はこれより各ピーク位置のサンプリングポイント
番号を、各フイルタ出力列と共にバツフアレジス
タ３３に格納する。従つて本発明実施例の場合該
バツフアレジスタの記憶容量は、さきに計算され
た第２図の従来装置の場合（1.28KB）に比較し
適当量増やされる。バツフアレジスタ３３に対す
る全サンプリングデータの記憶と、ピーク位置
（サンプリングポイント番号）の記憶が完了する
と、CPU３１は時間軸を正規化する為、全サン
プリングデータの中から、音声信号の終始端をＮ
等分するＮ個のデータを抽出すると同時に、同様
に各ピーク位置のサンプリングポイント番号を終
端のサンプリングポイント番号により除して得ら
れる正規化されたピーク位置、及びその個数を前
記Ｎ個のデータと共に前記入力パターンメモリ５
０或いは登録パターンメモリ４０の当該部位に各
収納される。入力音声信号のピーク位置を検出する回路２７
の具体例を第５図に示した。LPF２６により検
出された信号包絡線データは、マルチプレクサ２
３、Ａ−Ｄ変換器２４を経てデジタルコードとし
てラツチ回路６１に入力し保持される。図の場合
Ａ−Ｄ変換器２４の出力は８ビツトパラレルであ
り、ラツチ回路６１は前記マルチプレクサ２３が
LPF２６の出力をサンプリングするタイミング
パルスの適当分周と同期してＡ−Ｄ変換器２４の
出力をラツチし、続いて適当な時間差をもつてそ
の保持内容を縦続する同一記憶容量のラツチ６２
に転送する。通常アナログマルチプレクサは、ク
ロツクパルスに応動し、複数の入力端子のひとつ
を選定指定するために該クロツクパルスと同時に
与えられる２進コードに従つて順次入力をその出
力端子にスイツチする形式をとるものが多い。本
発明に於てもこの形式のものを採用し、CPU３
１からＩ／Ｏポート３２を介して与えられる。ア
ナログマルチプレクサ２３のサンプリングクロツ
クパルス６３（これはＡ−Ｄ変換器２４のコンバ
ートコマンドパルスと同じ）と、同じくCPU３
１からＩ／Ｏポート３２を介して与えられるアナ
ログマルチプレクサ２３の入力指定コード６４の
うちLPF２６の指定コードを検出する一致回路
６５との論理和ゲート６６出力をＫ分周（Ｋは１
以上の適当な整数で一定）する分周回路６７の出
力に応じ、前記第一のラツチ６１はその時Ａ−Ｄ
変換器２４の出力に与えられる、LPF２６の出
力のデジタルコード変換を記憶保持する。更に該
Ｋ分周回路の出力を適当時間（T_D）遅延する回
路６８の後述する論理和（AND）ゲート６９出
力に応じて第二のラツチ６２は、第一のラツチ６
１の保持内容を同様に記憶保持する。ここで、前
記クロツクパルスの周期を（T_C）、サンプリング
が等時間間隔で行なわれる場合、帯域分割フイル
タの個数を（Ｍ＋１個）とするとサンプリング周
期（T_S）は、（Ｍ＋１）T_C従つて前記Ｋ分周回路
６７出力周期は、KT_S＝Ｋ（Ｍ＋１）T_C、である
ので当然遅延回路６８の遅延時間（T_D）は、Ｏ
＜T_D＜Ｋ（Ｍ＋１）T_C、である。サンプリング周
期（T_S）は前述の如く具体的には10〜20ｍsecの
時間が選ばれる。尚、波形の振巾エンベロープを
検出する検出回路２６は帯域分割フイルタ２１−
１，２１−２，…，２１−Ｍ及びそれぞれに縦続
するLPF２２−１，２２−２，…，２２−Ｍの
比較的低周波域のものでこれを代用する事が出
来、この場合省略されて前記説明での（Ｍ＋１）
はＭとなる。さてこの様な構成によれば、第一のラツチ６１
がＪ番目（ＪはＫの倍数）のサンプリングポイン
トのデータをラツチした時、第二のラツチ６２は
（Ｊ−Ｋ）番目のサンプリングデータを保持して
いることになる。該ラツチ６２の８ビツトデータ
は補数回路７０を経て２の補数表現に変換された
後、その上位Ｌビツト（Ｌは整数で１≦Ｌ≦８）
と第一のラツチ６１の同じく上位Ｌビツトとの加
算が加算回路７１により計算される。補数回路７
０及び加算回路７１は、即ち第一のラツチ６１と
第二のラツチ６２の記憶内容の上位Ｌビツトにつ
いての差をとるものであり、その結果の正負が加
算回路７１の最上位桁（MSB）７２に示される。
このMSB７２が０の時、減算の結果は正又は０
で、サンプル値列は増加しつつあるか又は変化が
無い事を示し、MSB７２が１の時、減算の結果
は負でサンプル値列は減少していることがわか
る。MSB７２の内容は前記第二のラツチ６２の
ラツチ信号７３と同期して１ビツトメモリ７４に
転送記憶され、これとMSB７２との排他的論理
和（EXCLUSIVE OR）がNORゲート７５によ
り演算される。この構成により、第一、第二のラ
ツチ回路６１，６２に順次入力するサンプリング
データの差分に変化が生じた時、前記ゲート７５
は論理「１」を出力し、この時前記加算回路７１
のMSB７２が論理「１」であればサンプリング
データ列の差の変化は正から負、即ち極大点があ
つた事になり、これらの論理和をとる出力AND
ゲート７６の出力によりこれを知ることが出来
る。また加算回路７１の出力が０（ゼロ）であれ
ば、一致回路である論理和ゲート７７がこれを検
知して、インバータ７８、ANDゲート６９を介
して、ラツチ６２及び７４に対するラツチパルス
回路７３の出力を遮断し、それぞれに対するデー
タの転送を停止する。これにより波形の一時的平
担部を極値と誤判断する事をさけ得る。尚、この第５図に於ける各箇所の信号波形図を
第６図に示す。この第６図に於て、Ａはサンプリ
ングクロツクパルス６３、Ｂは論理和ゲート６６
出力、Ｃは分周回路６７出力、Ｄは補数回路７０
並びに加算回路７１に依る減算タイミング、Ｅは
遅延回路６８の遅延出力、Ｆは出力ANDゲート
７６からの出力、を夫々示している。上記の構成に於て、Ｋ分周回路６７によりサン
プリングをＫ個おきに行なう事及びサンプリング
データの差分計算に於て下位（８−Ｌ）ビツトを
省略することは、いずれも波形の微小なピークの
検出をさけ、これを無視する為であり、遮断周波
数を50〜100Hzに選ぶことによつて得られるLPF
２６の効果とあわせて波形の概略形状を把握する
のに効果的である。又、上記の波形のピーク位置
検出回路は特にこのような構成のみならず、例え
ば適当にプログラムされたCPUシステムによつ
ても実現できる事は論を待たない。さてこのように検出された波形のピーク位置及
びその個数は、音声のデータとして入力パターン
メモリ５０或いは登録パターンメモリ４０に記憶
されるが、認識処理計算である類似度判定に於て
これらデータを使用する方法を次に述べる。そのひとつは、まず従来同様サンプリングデー
タによりまず距離計算を行ない、その結果として
得られる類似度の高い登録パターンの中からその
順にいくつかを選びその中でピーク数の同数であ
るパターンを選び、これで特定できない時、各対
応するピーク間隔の差の絶対値の和により判定す
る方法である。又、逆にピーク数及びピーク間隔の比較により
あらかじめ登録パターンをある程度限定し、これ
らについて従来同様距離計算による類似度判定を
行なう方法もある。これらの方法の得失は一概に
決められないが実験結果では前者の方式Ａが後者
（方式Ｂ）にくらべ下表のように高い認識率の向
上結果が得られている。しかしながら全計算時間
は後者が短かく従つてこれら方式の選択はシステ
ム設計上の総合的な判断にゆだねられる。尚、こ
の表に於ける実験方法は、 (1) 成人男子５名、試行回数各単語音声につき各
４回、 (2) 登録語数32語 (3) Ａ、Ｂ、両方式につき同一音声をテープレコ
ーダに依り入力。 The present invention relates to a word speech recognition device based on a pattern matching method, and provides a new matching method for speech signals. A word speech recognition system based on the principle of pattern matching usually includes a speech input section 1, a feature extraction section 2, a recognition processing section 3, a registered pattern memory 4, and an input pattern memory 5 as shown in FIG. It has two operating modes: registration mode and recognition mode. In the registration mode, the word sounds to be recognized are registered in advance, and the features of the sound extracted by the feature extraction unit 2 from the registered sound signal which is the output of the sound input unit 1 including the microphone are extracted as a time series pattern. This pattern is stored in a registered pattern memory (or also referred to as a standard pattern memory) 4.
In the recognition mode, after a voice feature pattern similarly extracted from the input voice signal is stored in the input pattern memory 5, the degree of similarity between this input pattern and the registered pattern stored in the registered pattern memory 4 is recognized. The registered pattern with the maximum degree of similarity calculated by the processing unit 3 is fixed as the one that matches the input voice, and is outputted accordingly. The configuration of the recognition processing section 3, registered pattern memory 4, and input pattern memory 5 shown in FIG. 1 having such functions is realized by a computer system centered on a central processing unit (CPU). It is well known that physical quantities such as a frequency spectrum, a correlation function, the number of zero crossings, and an α parameter are used as a method for extracting the phonological features of speech from a speech waveform. Among these, the method of extracting the frequency spectrum of the voice using a large number of bandpass filters is becoming more and more widely used because it has a relatively simple configuration and can obtain a high recognition rate. FIG. 2 shows a specific example of a speech recognition device that analyzes the frequency spectrum using a filter. The audio input section 1 maintains the microphone 11, microphone amplifier 12, and the level of the input audio signal substantially constant regardless of the strength of the input audio.
It consists of AGC circuit 13. Connect to the output of this input section 1. M bandpass filters (hereinafter referred to as BPF)
) 21-1, 21-2, ... 21-M, and a low-pass filter (hereinafter abbreviated as LPF) 22-1, which is connected in series with each BPF and detects each output envelope.
22-2, . . . 22-M constitute a feature extraction unit 2, which performs frequency analysis on the voice band signal. Each filter component of the audio signal that has passed through the audio input section 1 has an appropriate time period (10 to 20 msec in most cases)
are sequentially sampled by the multiplexer 23. That is, LPF22-1, 22-2,...22-
The audio spectrum signal obtained in parallel at the output end of M becomes a serial signal train, which is sequentially converted into a digital code by an analog-to-digital converter 24 (hereinafter abbreviated as A-D converter), and then sent to the CPU.
The data is once taken into the buffer memory 33 via the I/O port 32 controlled by the I/O port 31 . This amount of data is, for example, if the number of filters M is 8, the maximum audio input time is 1.6 seconds, the sampling period is 10 msec, and A
When the number of bits of the -D converter 24 is 8, the maximum amount of data to be taken in is 1.6/0.01×8×8=10240 bits=1.28 KB (B: byte). Now, even if the audio signal is from the same speaker in the same language, it is normal for the time axis and signal amplitude to fluctuate each time it is uttered, and some kind of normalization is required for each. For normalization of amplitude
As mentioned above, the AGC circuit 13 is often used, but regarding the time axis, a method as shown in FIG. 3 is generally used in which the time from the start to the end of a word is divided equally. The audio detection circuit 25 detects the beginning and end of the audio signal based on data such as the level, frequency distribution, and number of zero crossings of the input signal. In Figure 3, when the sampling point number at the beginning of the input audio signal is 1 and the sampling point number at the end is l, find the integer closest to l/N (N is an integer) (this is set as n), and then calculate the input sampling point number. The time axis can be normalized by extracting N pieces of data every n pieces including the start end and rearranging them (FIG. 3b). For example, if N=32, in this case N×8×8=2048bits=256B data is stored in the registered pattern memory 40 in the registration mode and in the input pattern memory 50 in the recognition mode. . These memories are usually RAM, and the address of the registered pattern memory 40 stores the CPU program.
It is specified by the ROM 34 and the input control section 35. The number of registered patterns is determined by the specifications of the speech recognition system, that is, the number of registered speakers and the number of words that can be registered. Recognition processing in the recognition mode is similarly performed by pattern matching the contents of the input pattern memory 50, which stores the data of N sample points obtained from the data input to the buffer memory 33, and the contents of the registered pattern. It will be done.
Although various methods have been proposed for calculating the distance between the input pattern and the registered pattern, here, for convenience of explanation, the simplest method, the Thiebishev distance, will be explained. 8 registration patterns for certain word sounds
The Thievishev distance D between the time series [fij ^(R) ] of the filters [fij (R)] (i: filter numbers 1 to 8, j: sample points 1 to N) and the same filter time series [fij] of the input audio pattern is given by the following formula. defined. D= _N 〓 ^j=1 ₈ 〓 ⁱ⁼¹ ｜fij−fij ^(R) ｜ ...(1) That is, this is the input pattern fij and the registered pattern fij ^(R)
is the sum of the absolute values of the differences between each corresponding data,
The input pattern is considered to match the registered pattern for which the minimum value is obtained among the Tievisiev distances obtained for each registered pattern. For convenience of explanation, a memory area for temporarily storing these calculation results is shown as a recognition processing memory 36. In the conventional speech recognition system based on the principle of pattern matching explained above, the degree of similarity is calculated by the sum of the distance differences between the input pattern and the registered pattern at each corresponding time point, and the circuit configuration is simple. However, there are many calculation errors, and it is difficult to say that sufficient recognition performance can be obtained. In addition to such recognition processing, the present invention performs recognition processing with even higher accuracy by grasping the shape of the waveform as the peak position and number of peaks, and referring to this as auxiliary data when calculating the degree of similarity. FIG. 4 is a block diagram showing the configuration of the device of the present invention, which differs from the conventional device shown in FIG. 2 in that it has an appropriate cutoff frequency between the input section 1 and the multiplexer 23, A bypass path based on the LPF 26 that detects the line (envelope) is provided, and a peak detection circuit 27 that detects the maximum value of the audio is inserted between the A-D converter 24 and the I/O port 32. It is in. Note that most of the components in Figure 4 are the same as those in Figure 2, so
A detailed explanation of these points will be omitted. This peak detection circuit 27 detects the peak of the input audio signal waveform and transmits the detection signal to the CPU 31 via the I/O port 32.
From now on, the sampling point number of each peak position is stored in the buffer register 33 along with each filter output string. Therefore, in the embodiment of the present invention, the storage capacity of the buffer register is increased by an appropriate amount compared to the previously calculated case of the conventional device shown in FIG. 2 (1.28 KB). When all sampling data and peak positions (sampling point numbers) have been stored in the buffer register 33, the CPU 31 selects the starting and ending points of the audio signal from among all the sampling data to normalize the time axis.
At the same time as extracting N pieces of data to be divided equally, the normalized peak position obtained by dividing the sampling point number of each peak position by the terminal sampling point number and its number together with the N pieces of data. The input pattern memory 5
0 or each stored in the corresponding part of the registered pattern memory 40. Circuit 27 for detecting the peak position of the input audio signal
A specific example is shown in FIG. The signal envelope data detected by LPF26 is sent to multiplexer 2.
3. It is input as a digital code to the latch circuit 61 via the A-D converter 24 and is held there. In the case shown in the figure, the output of the A-D converter 24 is 8-bit parallel, and the latch circuit 61 is connected to the multiplexer 23.
The output of the A-D converter 24 is latched in synchronization with an appropriate frequency division of the timing pulse that samples the output of the LPF 26, and then the latch 62 with the same storage capacity cascades the held contents with an appropriate time difference.
Transfer to. Typically, analog multiplexers respond to clock pulses and sequentially switch inputs to their output terminals in accordance with a binary code applied at the same time as the clock pulse to select and designate one of a plurality of input terminals. The present invention also adopts this format, and the CPU 3
1 through I/O port 32. The sampling clock pulse 63 of the analog multiplexer 23 (which is the same as the convert command pulse of the A-D converter 24) and the CPU 3
1 to the input designation code 64 of the analog multiplexer 23 given through the I/O port 32, the output of the OR gate 66 is divided by K (K is 1
According to the output of the frequency divider circuit 67, which is constant at an appropriate integer above, the first latch 61 is then set to A-D.
The digital code conversion of the output of the LPF 26 applied to the output of the converter 24 is stored and held. Furthermore, in response to the output of a logical sum (AND) gate 69, which will be described later, of a circuit 68 that delays the output of the K frequency divider circuit by an appropriate time (T _D ), the second latch 62 delays the output of the first latch 6.
The contents held in 1 are stored and held in the same manner. Here, if the period of the clock pulse is (T _C ) and sampling is performed at equal time intervals, and the number of band division filters is (M+1), the sampling period (T _S ) is (M+1) T _C . Since the output cycle of the K frequency divider circuit 67 is KT _S =K(M+1)T _C , the delay time (T _D ) of the delay circuit 68 is naturally O
<T _D <K(M+1)T _C . As mentioned above, the sampling period ( _TS ) is specifically selected to be 10 to 20 msec. Note that the detection circuit 26 for detecting the amplitude envelope of the waveform is a band division filter 21-
1, 21-2,..., 21-M and the LPFs 22-1, 22-2,..., 22-M cascaded to each other in a relatively low frequency range can be substituted, and in this case, they are omitted. (M+1) in the above explanation
becomes M. Now, according to this configuration, the first latch 61
latches the data at the Jth sampling point (J is a multiple of K), the second latch 62 holds the (J-K)th sampling data. The 8-bit data of the latch 62 is converted into two's complement representation through the complement circuit 70, and then its upper L bits (L is an integer, 1≦L≦8)
An addition circuit 71 calculates the addition of the L bits and the upper L bits of the first latch 61. Complement circuit 7
In other words, the adder circuit 71 calculates the difference between the upper L bits of the stored contents of the first latch 61 and the second latch 62, and the positive or negative of the result is the most significant digit (MSB) of the adder circuit 71. 72.
When this MSB72 is 0, the result of subtraction is positive or 0.
It can be seen that the sample value sequence is increasing or there is no change, and when the MSB 72 is 1, the result of subtraction is negative, indicating that the sample value sequence is decreasing. The contents of MSB 72 are transferred and stored in 1-bit memory 74 in synchronization with latch signal 73 of second latch 62, and an exclusive OR between this and MSB 72 is calculated by NOR gate 75. With this configuration, when a change occurs in the difference between the sampling data sequentially input to the first and second latch circuits 61 and 62, the gate 75
outputs logic "1", and at this time the adder circuit 71
If the MSB72 of is logic "1", the change in the difference in the sampling data string is from positive to negative, that is, there is a maximum point, and the output AND which is the logical sum of these
This can be known from the output of gate 76. Further, if the output of the adder circuit 71 is 0 (zero), the OR gate 77 which is a coincidence circuit detects this and outputs the output of the latch pulse circuit 73 to the latches 62 and 74 via the inverter 78 and the AND gate 69. and stop data transfer to each. This makes it possible to avoid misjudging a temporary flat part of the waveform as an extreme value. Incidentally, a signal waveform diagram at each location in FIG. 5 is shown in FIG. 6. In this FIG. 6, A is the sampling clock pulse 63, and B is the OR gate 66.
Output, C is the frequency divider circuit 67 output, D is the complement circuit 70
In addition, E indicates the subtraction timing by the adder circuit 71, E indicates the delayed output of the delay circuit 68, and F indicates the output from the output AND gate 76, respectively. In the above configuration, performing sampling every K by the K frequency divider circuit 67 and omitting the lower (8-L) bits in the difference calculation of the sampling data both reduce the small peaks of the waveform. This is to avoid detection of this and ignore it, and the LPF obtained by selecting the cutoff frequency between 50 and 100Hz.
In combination with the effect of No. 26, this is effective for grasping the approximate shape of the waveform. It goes without saying that the waveform peak position detection circuit described above can be realized not only by such a configuration, but also by, for example, an appropriately programmed CPU system. Now, the peak position of the waveform detected in this way and its number are stored as audio data in the input pattern memory 50 or the registered pattern memory 40, but these data are used in the similarity judgment which is the recognition process calculation. The following describes how to do this. One method is to first perform distance calculations using sampling data as in the past, select some of the registered patterns with high similarity in order, and select patterns with the same number of peaks among them. When it is not possible to specify the peak interval, this method uses the sum of the absolute values of the differences between the corresponding peak intervals. Alternatively, there is a method in which registered patterns are limited to a certain extent in advance by comparing the number of peaks and peak intervals, and similarity is determined by distance calculation as in the conventional method. Although the merits and demerits of these methods cannot be definitively determined, experimental results show that the former method A has a higher recognition rate than the latter method B, as shown in the table below. However, the total calculation time is shorter in the latter case, so the selection of these methods is left to comprehensive judgment in system design. In addition, the experimental method in this table is as follows: (1) 5 adult males, number of trials 4 times for each word sound, (2) Number of registered words: 32 words, (3) Same sound tape for both methods A and B. Input depending on the recorder.

【表】以上に説明したように本発明は音声波形のピー
ク位置及びその個数を検知し、サンプリングデー
タによる類似度計算と共にこれら波形のピークに
関する情報をパターン認識上の判定データとして
用いているので、システム全体の認識性能の向上
を可能ならしめる、きわめて実用性の高い方式を
提供する事が出来る。[Table] As explained above, the present invention detects the peak position and number of audio waveforms, calculates the similarity using sampling data, and uses information regarding these waveform peaks as judgment data for pattern recognition. It is possible to provide an extremely practical method that makes it possible to improve the recognition performance of the entire system.

[Brief explanation of the drawing]

第１図はパターンマツチング原理に依る単語音
声認識装置の概略を示すブロツク図、第２図はそ
の内部構成を示すブロツク図、第３図ａ，ｂは音
声信号の波形図、第４図は本発明装置の構成を示
すブロツク図、第５図はその要部の構成を示すブ
ロツク図、第６図は動作説明の為のタイミングチ
ヤート、であつて、１は入力部、２は特徴抽出
部、３は認識処理部、４は登録パターンメモリ、
５は入力パターンメモリ、２７はピーク検出回
路、を夫々示している。 Fig. 1 is a block diagram showing the outline of a word speech recognition device based on the pattern matching principle, Fig. 2 is a block diagram showing its internal configuration, Fig. 3 a and b are waveform diagrams of speech signals, and Fig. 4 is FIG. 5 is a block diagram showing the configuration of the apparatus of the present invention, FIG. 5 is a block diagram showing the configuration of its main parts, and FIG. 6 is a timing chart for explaining the operation, where 1 is an input section and 2 is a feature extraction section. , 3 is a recognition processing unit, 4 is a registered pattern memory,
Reference numeral 5 indicates an input pattern memory, and reference numeral 27 indicates a peak detection circuit.

Claims

[Claims]

1. Audio input means for converting audio into electrical signals;
a feature extraction means for extracting features of an input audio waveform;
a sampling means; a conversion means for converting the sampled audio characteristics into a digital code;
Start/end detection means for detecting the start/end of the audio signal;
An amplitude detection means for detecting the amplitude of the audio signal, a difference detection means for detecting a difference between sampling values by the sampling means of the amplitude detection means, and a change detection means for detecting a change in sign of the difference; A peak detecting means that responds to the differential code detecting means, a means for calculating and counting the number of peaks and peak intervals from the output of the peak detecting means, and a register that stores voice characteristics and peak information input in advance for registration. a pattern storage means, an input pattern storage means that stores the characteristics and peak information of the input voice each time a voice is input, and calculates the degree of similarity between the contents of these registered pattern storage means and the contents of the input pattern storage means. A word speech recognition device using a pattern matching method, comprising: recognition processing means that performs pattern recognition by comparing both peak information.