JPS6228480B2

JPS6228480B2 -

Info

Publication number: JPS6228480B2
Application number: JP54154289A
Authority: JP
Inventors: Akinobu Masuko; Akito Tanabe
Original assignee: Tokyo Shibaura Electric Co Ltd
Current assignee: Toshiba Corp
Priority date: 1979-11-30
Filing date: 1979-11-30
Publication date: 1987-06-20
Also published as: JPS5677898A

Description

[Detailed description of the invention]

本発明は、入力音声信号による命令、即ち話者
の音声波から抽出された物理量の時系列を特徴パ
ターンとしてとらえ、これをあらかじめ登録され
たパターンと比較して音声信号による命令を認知
する所謂、パターンマツチング法による音声認識
装置に係り、特に、音声信号のサンプリング手
段、及びパターン間の類似度の判定手段に音声の
誤認識を防止する手段を設けた音声認識装置に関
する。一般に音声認識の方式は、音声信号から何らか
の特徴を抽出した後得られる特徴（入力）パター
ンとあらかじめ登録されている登録パターンとの
類似度を直接計算する方式と、前記音声信号から
特徴を抽出した後にこれを音韻系列に置きかえこ
れをあらかじめ登録されている単語辞書（パター
ン）とを比較して類以度を計算する方式の２つの
方式に大別される。これら２つの方式のうち、後
者は音韻単位の識別を行うために、単語数が多い
場合の音声認識に優位である。しかし、単語数が
さほど多くない場合には、前者によるパターンマ
ツチング認識の方が一般に高い認識率が得られ
る。認識される単語数が数10程度の規模の前記パタ
ーンマツチングによる音声認識システムとして
は、民生機器においては例えば、テレビジヨン受
像機を音声によつて制御する場合が挙げられる。
つまり、テレビジヨン受像機の電源制御、音量制
御チヤンネル切替等の制御を、あらかじめ音声認
識装置に制御内容を表わす言葉の音声を登録し
て、応答装置には認識応答として音声を記憶させ
ておき、音声命令が登録された制御内容と照合し
て一致すると制御内容を認識したことを音声によ
つて返答するとともに所定の制御をするような場
合である。例えば、チヤンネル切替制御におい
て、１チヤンネルを選ぶ場合、あらかじめ「１チ
ヤンネル」という音声を登録パターンとして記憶
しておいたときに、音声命令を受信するマイクに
向い「１チヤンネル」という音声命令を下すと音
声応答で「オーケー（OK）」と返答し、１チヤ
ンネルが選局される。しかし、ここで問題となるのは、「１チヤンネ
ル」と音声命令を下した時に、これと音声が類似
する「８チヤンネル」という音声命令が制御パタ
ーン（登録パターンとして登録されている点であ
る。即ち、「イチ」と「ハチ」の両者の音声は類
似しており、「イチ」と「ハチ」とを誤まつて音
声認識するのをいかに防止するかが問題となる。
これは、「イチ」という語と「ハチ」という語に
おいて、「チ」の発音部分の音声エネルギーが大
きい為に、「イ」と「ハ」を区別するのが困難に
なることに起因する。一般に、一つの単語の中の
アクセントをもつ音声があると、その部分に音声
エネルギーが集中し、他の部分の音声認識が困難
となる。従つて、音声認識に際しては、音声命令
の強音以外の部分の情報を失うことなく特徴（入
力）パターンと登録パターンとの比較をしなけれ
ばならない。また、制御内容を音声によつて登録パターンと
して登録する際の音声と、音声命令として発する
音声の発声速度は必ずしも一致しない。このこと
は、ある単語を登録しした後、その単語を再度同
じように発声しても単語長は異なることを意す
る。この為、入力パターンと登録パターン間の類
似度を評価するに際しては、時間軸についても考
慮しなければ誤認識がなされる。本発明は、上記の問題を解消し、音声のサンプ
ル周期を通常音声のピツチ間隔よりも長い期間と
する制御手段を設けて単語中のアクセント音、発
声時間（単語長）の変動に起因する音声の誤認識
を防止した音声認識装置を提供することを目的と
する。以下、図面を参照して本発明の代表的実施例を
説明する。入力音声の特徴を示す入力パターンとあらかじ
め所定の単語が登録されている登録パターンとの
類似度を判別して音声認識を行う、所謂、パター
ンマツチング法による音声認識においては、入力
パターンの特徴の抽出の仕方によつて誤認識率が
左右される。そこで、本発明においては、入力パ
ターンの特徴の抽出にあたり、入力音声に対する
振幅の正規化、時間軸の正規化の誤認識を防ぐべ
く行うとともに、更には、両パターン間の類似度
の計算を簡素化する。第１図は、本発明に係る音声認識装置を示すブ
ロツク図であり、発声による音圧振動をマイクロ
フオンで電気信号に変換し、更に前記音声の周波
数分布を平担化する機能を有する音声入力部１、
この音声入力部１により得られる電気信号に変換
された音声信号からその特徴を抽出する特徴抽出
部２、この特徴抽出部２により抽出された特徴を
記憶するとともにこれと入力パターンとの比較の
演算処理を行ない音声による制御命令を判別する
認識処理部３を有し、制御命令が認識されたこと
を音声に応答する音声応答部４が必要によつては
付加される。この音声応答部は、応答すべき言葉
をパターンとして記憶してあるメモリ部４０１、
第２のＩ／Ｏ（入出力）ポート４０２、制御部４
０３、Ｄ／Ａ変換器４０４、ローパスフイルタ４
０５を有しており、話者の音声指令が認知された
ことをテレビジヨン受像機４０６等の被制御機器
の音声回路から音声により応答する。音声入力部１において、入力音声は、ワイヤレ
スマイク１１によりFM波に変換した後FM受信
機１２で受信してプリアンプ１３に入力する形態
と、前記プリアンプ前段に設けたマイクロフオン
１４によつて入力する形態のいずれかによりシス
テムにとり入れられる。これらいずれの形態の場
合においても、認識に必要な音声信号電力とそれ
以外の音響信号との比であるSN比は、主として
マイクロフオンの指向性に左右されるのでマイク
ロフオン１１，１４は単一指向性のものを用い
る。プリアンプ１３に得られる電気信号に変換さ
れた音声信号は、単音節明瞭度を向上するため高
音域をプリエンフアシス回路１５により強調す
る。このようにして、得られる音声入力部１の出力
は、特徴抽出部２に供給され、ここで入力パター
ンの形成に必要な特徴データの抽出処理がなされ
る。即ち、話者の音声波から時系列的に周波数を
とらえ、音声を周波数分析しこれらのデータを一
定時間間隔でサンプリングするとともに、サンプ
リングされたアナログデータをＡ／Ｄ変換器によ
りデジタル量に変換する。つまり、特徴抽出部２
の入力端には１６(1)〜１６₍₁₅₎で示されるスイツ
チド・キヤパシタ・バンドパスフイルタ（以下
BPFと称する。）が接続されている。この１６(1)
〜１６₍₁₅₎のBPFの中心周波数は印加されるクロ
ツクで決まり、その各々のフイルタ特性は６次の
チエビシエフ特性で略−36dB／OCTの減衰特性
を持つ。そして、前記BPF１６_(1)〜(15)によ
り、略200Hz〜6.4KHzの帯域を1/3オクターブ間
隔で15バンドに分離している。この15に分離され
たバンドの帯域成分の音声信号を通過させる１６
_(1)〜(15)のBPFの夫々には、略20mSec間隔で信
号をサンプル・ホールドするサンプル・ホールド
回路１７_(1)〜(15)が接続されており、このサン
プル・ホールド作用により到来する音声の特徴が
抽出される。このようにしてサンプル・ホールド回路１７₍₁
_{）〜（１５）}に抽出された特徴はアナログ量である
が、例えば８ビツトのＡ／Ｄ変換器（アナログ―
デジタル変換器）１８によつてデジタル量に変換
される。このとき、前記サンプル・ホールド回路
１７_(1)〜(15)間と前記Ａ／Ｄ変換器１８間の切
換制御は、マルチプレクサ１９によつて行なわれ
る。従つて、音声信号から抽出した、第２図に示
す時間―周波数―レベルの特性をデジタル化した
量が前記Ａ／Ｄ変換器１８に得られる。そして、
このＡ／Ｄ変換器１８で抽出された音声の特徴デ
ータは、第１のＩ／Ｏ（入出力）ポート２０を介
して認識処理部３に供給される。認識処理部３は、制御内容、例えば受信するチ
ヤンネルの指定、電源のオンオフの制御を音声に
よつて指示する場合にその指令音声から抽出され
た音声の特徴を記憶させ登録するための登録パタ
ーンメモリ２１、話者が希望する制御内容を発声
した際にその指示音声の特徴を入力パターンとし
て一担記憶するための入力パターンメモリ２２、
この入力パターンメモリの内容が前記登録パター
ンメモリ２１に記憶された、いずれの登録パター
ンと類似するかの判定を行うためのプログラムを
記憶するシステムプログラムメモリ２３、このシ
ステムプログラムの内容を実行するCPU（中央
処理装置）２４からなる。そして、このCPU２
４は例えば、８ビツトのマイクロプロセツサが用
いられ、前記システムプログラムメモリ２３は、
2Kバイトの容量をもつROMで構成され、前記入
力パターンメモリ２２、登録パターンメモリ２１
は10Kバイトの容量をもつRAMによつて構成さ
れる。この10KバイトのRAMのうち1.75Kバイト
は入力パターンメモリ２２として、略7.5Kバイ
トは登録パターンメモリ２１として用いられる。このような構成の認識処理部３に、前記特徴抽
出部２で抽出されたデータが、入力パターンデー
タ、登録パターンデータとして送られる訳である
が、先ず登録パターンデータが送られる場合につ
いて述べる。登録パターンデータが認識処理部３の登録パタ
ーンメモリ２１に送られる場合は、前述の様に話
者が希望する制御内容を何通りか発声により音声
認識装置に登録する場合である。ここで、いま１
チヤンネルの選局を登録パターンメモリ２１に制
御内容として記憶させる場合についてみると、
「１チヤンネル」という音声の特徴は前記Ａ／Ｄ
変換器１８にデイジタルデータとして抽出され
る。そして、このデータは第１のＩ／Ｏポート２
０を介して登録パターンメモリ２１に送られる
が、このとき前記入力パターンメモリ２２には、
次に示されるマトリツクス〓の形で一旦収納され
る。ｍ；サンプル回数。ｎ；フイルタの個数。ａ_ij；デジタル化されたサンプル値。ここで、行列式の行数はサンプル回数、即ち、
前記スイツチド・キヤパシタ・バンドパスフイル
タ１６の出力が略20mSecの間隔のサンプルパル
スに呼応してサンプルされる回数を示し、列数は
BPF１６の個数を示し、各成分はデジタル化され
た前記各BPFのサンプル値である。このようにし
て、抽出された話者の音声の特徴は、未だ音声の
振幅情報に対する正規化がなされておらず、話者
のアクセントの位置或は強音によつて弱音の情報
が後退することに対する処理が行なわれていない
ので話者の音声の特徴を十分に表わしているとは
いえない。そこで、前記行列式の各行の成分に対
する加重を行う。即ち、前記〓で表わされる一
旦、入力パターンメモリ２２に収納されたデータ
に対してシステムプログラム２３に記憶された次
に示す演算をCPU２４によつて行ない演算結果
の行列式〓を前記登録パターンメモリ２１に登録
パターンとして格納する。このようにして、音声情報のうちの振幅情報は
正規化される。この振幅の正規化は、話者が制御
内容として発声する音声に対してすべてなされた
うえで、前記登録パターンメモリ２１にその内容
（行列式）が記憶される。こうして、話者が発声
により、前記登録パターンメモリ２１に希望する
制御内容を登録することで、音声認識装置に対す
る制御内容のセツテングは終了し、制御内容の数
に等しい種類の登録パターン（〓_１，〓_２……〓
_o）が前記登録パターンメモリ２１に記憶され
る。上述のように、音声の特徴を示すパターン〓に
対する振幅の正規化を行う演算は、前記システム
プログラム２３に記憶されたプログラム内容に応
じてCPU２４によつて実行されるが、その実行
内容を次に模式的に説明する。即ち、前述の第１図中の、Ａ／Ｄ変換器１８、
第１のＩ／Ｏポート２０、システムプログラム２
３、CPU２４の動作は、次に示す第３図の機能
動作に対応できる。つまり、第３図中のラツチ回路３０(1)_〜(15)
（実際には入力パターンメモリ２２に相当する。）
には、前記行列式〓に相当するデータがラツチさ
れ、ラツチされた内容は加算器３１、及び乗算器
３２に夫々供給される。そして、この加算器３１
の出力は、レベル判定回路３３と除算器３４(1)_〜
₍₁₅₎に供給される。前記加算器３１は、前記行列
式の各行成分の要素を加算し、 The present invention captures a command based on an input voice signal, that is, a time series of physical quantities extracted from a speaker's voice wave as a characteristic pattern, and compares this with a pre-registered pattern to recognize a command based on a voice signal. The present invention relates to a speech recognition device using a pattern matching method, and more particularly to a speech recognition device in which a means for sampling a speech signal and a means for determining similarity between patterns are provided with means for preventing erroneous recognition of speech. In general, there are two methods for speech recognition: one is to directly calculate the similarity between a feature (input) pattern obtained after extracting some feature from the speech signal and a registered pattern that has been registered in advance, and the other is to directly calculate the similarity between the feature (input) pattern obtained after extracting some feature from the speech signal and a registered pattern that has been registered in advance. There are two main types of methods: a method in which this is later replaced with a phoneme sequence, and this is compared with a pre-registered word dictionary (pattern) to calculate degree of similarity. Of these two methods, the latter is superior in speech recognition when there are many words because it identifies phoneme units. However, when the number of words is not so large, the former pattern matching recognition generally provides a higher recognition rate. An example of a speech recognition system using pattern matching, in which the number of words to be recognized is about several dozen, is used in consumer equipment, for example, when a television receiver is controlled by voice.
In other words, for controlling the television receiver's power supply, volume control, channel switching, etc., the voice of the words representing the control content is registered in the voice recognition device in advance, and the voice is stored in the response device as a recognition response. This is a case in which the voice command is compared with the registered control contents, and if they match, a voice response is sent to indicate that the control contents have been recognized, and a predetermined control is performed. For example, when selecting channel 1 in channel switching control, if you have previously memorized the voice ``1 channel'' as a registered pattern, then turn to the microphone that receives voice commands and issue the voice command ``1 channel.'' A voice response says "OK" and channel 1 is selected. However, the problem here is that when the voice command ``1 channel'' is given, the voice command ``8 channel'', which has a similar sound, is registered as a control pattern (registered pattern). That is, the sounds of both "ichi" and "hachi" are similar, and the problem is how to prevent the mistaken speech recognition of "ichi" and "hachi".
This is because in the words "ichi" and "hachi", the vocal energy of the pronunciation part of "chi" is large, making it difficult to distinguish between "i" and "ha". Generally, when there is accented speech in one word, the speech energy is concentrated in that part, making it difficult to recognize other parts of the speech. Therefore, when performing voice recognition, it is necessary to compare the characteristic (input) pattern and the registered pattern without losing information on parts of the voice command other than the strong sounds. Further, the voice used when registering the control content as a registered pattern by voice does not necessarily match the rate of voice emitted as a voice command. This means that after a certain word is registered, even if the word is uttered in the same way again, the word length will be different. Therefore, when evaluating the similarity between the input pattern and the registered pattern, erroneous recognition will occur unless the time axis is also taken into account. The present invention solves the above-mentioned problems and provides a control means for setting the speech sampling period to a period longer than the pitch interval of normal speech. The purpose of the present invention is to provide a speech recognition device that prevents erroneous recognition. Hereinafter, typical embodiments of the present invention will be described with reference to the drawings. In speech recognition using the so-called pattern matching method, speech recognition is performed by determining the degree of similarity between an input pattern indicating the characteristics of the input speech and a registered pattern in which predetermined words are registered. The false recognition rate depends on the method of extraction. Therefore, in the present invention, when extracting features of input patterns, amplitude normalization and time axis normalization for input audio are performed to prevent misrecognition, and furthermore, similarity calculation between both patterns is simplified. become FIG. 1 is a block diagram showing a speech recognition device according to the present invention, in which a speech input device has a function of converting sound pressure vibrations caused by vocalization into electrical signals using a microphone and flattening the frequency distribution of the speech. Part 1,
A feature extraction unit 2 extracts the features from the audio signal converted into an electrical signal obtained by the audio input unit 1, and a feature extraction unit 2 stores the features extracted by the feature extraction unit 2 and calculates a comparison between them and the input pattern. It has a recognition processing section 3 that performs processing and discriminates voice control commands, and a voice response section 4 that responds to the voice to indicate that the control command has been recognized is added as necessary. This voice response section includes a memory section 401 that stores words to be responded to as patterns;
Second I/O (input/output) port 402, control unit 4
03, D/A converter 404, low pass filter 4
05, and the voice circuit of the controlled device, such as the television receiver 406, responds with voice to indicate that the speaker's voice command has been recognized. In the audio input section 1, input audio is converted into FM waves by a wireless microphone 11, received by an FM receiver 12, and inputted to a preamplifier 13, and inputted by a microphone 14 provided before the preamplifier. It can be incorporated into the system in either form. In either of these forms, the SN ratio, which is the ratio between the audio signal power necessary for recognition and other acoustic signals, is mainly influenced by the directivity of the microphones, so the microphones 11 and 14 are Use a directional one. The audio signal obtained by the preamplifier 13 and converted into an electric signal has its high frequency range emphasized by the pre-emphasis circuit 15 in order to improve the intelligibility of monosyllables. The output of the voice input section 1 obtained in this manner is supplied to the feature extraction section 2, where the feature data necessary for forming the input pattern is extracted. That is, it captures the frequency of the speaker's voice waves in time series, analyzes the frequency of the voice, samples this data at regular time intervals, and converts the sampled analog data into digital quantities using an A/D converter. . In other words, the feature extraction unit 2
Switched capacitor bandpass filters ( _hereinafter referred to as
It is called BPF. ) are connected. This 16(1)
The center frequency of the BPF of ~16 ₍₁₅₎ is determined by the applied clock, and each filter characteristic is a 6th-order Tievisiev characteristic and has an attenuation characteristic of approximately -36 dB/OCT. The band of approximately 200Hz to 6.4KHz is divided into 15 bands at 1/3 octave intervals by the BPFs 16 _{(1) to (15)} . 16 that passes the audio signal of the band components of these 15 bands.
Sample and hold circuits 17 (1) _{to (15} _{) that} sample and hold signals at approximately 20mSec intervals are connected to each of the BPFs (1) to (15). Voice features are extracted. In this way, sample and hold circuit 17 ₍₁
_{) to (15)} are analog quantities, for example, an 8-bit A/D converter (analog -
It is converted into a digital quantity by a digital converter) 18. At this time, switching control between the sample and hold circuits 17 _{(1) to (15)} and the A/D converter 18 is performed by a multiplexer 19. Therefore, the A/D converter 18 obtains a digitized amount of the time-frequency-level characteristics shown in FIG. 2 extracted from the audio signal. and,
The voice feature data extracted by the A/D converter 18 is supplied to the recognition processing section 3 via a first I/O (input/output) port 20 . The recognition processing unit 3 includes a registration pattern memory for storing and registering the voice characteristics extracted from the command voice when instructing control contents, such as specifying a channel to receive or controlling power on/off, by voice. 21. Input pattern memory 22 for storing the characteristics of the instruction voice as an input pattern when the speaker utters the desired control content;
A system program memory 23 that stores a program for determining whether the contents of this input pattern memory are similar to any registered pattern stored in the registered pattern memory 21, a CPU that executes the contents of this system program ( (Central processing unit) 24. And this CPU2
4 is an 8-bit microprocessor, for example, and the system program memory 23 is
It consists of a ROM with a capacity of 2K bytes, and includes the input pattern memory 22 and the registered pattern memory 21.
is composed of RAM with a capacity of 10K bytes. Of this 10 Kbyte RAM, 1.75 Kbyte is used as the input pattern memory 22 and approximately 7.5 Kbyte is used as the registered pattern memory 21. The data extracted by the feature extraction section 2 is sent to the recognition processing section 3 having such a configuration as input pattern data and registered pattern data. First, a case where registered pattern data is sent will be described. When the registered pattern data is sent to the registered pattern memory 21 of the recognition processing section 3, as described above, the speaker registers several desired control contents in the speech recognition device by uttering the desired control contents. Here, now 1
Regarding the case where channel selection is stored as control content in the registered pattern memory 21,
The audio characteristic of “1 channel” is the A/D described above.
The data is extracted as digital data by the converter 18. This data is then transferred to the first I/O port 2.
0 to the registered pattern memory 21, but at this time, the input pattern memory 22 has the following information:
It is temporarily stored in the form of the matrix shown below. m: Number of samples. n: Number of filters. a _ij ; digitized sample value; Here, the number of rows of the determinant is the number of samples, that is,
Indicates the number of times the output of the switched capacitor bandpass filter 16 is sampled in response to sample pulses at intervals of approximately 20 mSec, and the number of columns is
The number of BPFs 16 is shown, and each component is a digitized sample value of each BPF. In this way, the characteristics of the speaker's voice extracted have not yet been normalized to the amplitude information of the voice, and the information on soft sounds may recede depending on the position of the speaker's accent or strong sounds. Since no processing is performed on the voice, it cannot be said that the characteristics of the speaker's voice are sufficiently expressed. Therefore, the components of each row of the determinant are weighted. That is, the CPU 24 performs the following calculation stored in the system program 23 on the data once stored in the input pattern memory 22, represented by 〓 above, and the determinant 〓 of the calculation result is stored in the registered pattern memory 21. is stored as a registered pattern. In this way, the amplitude information of the audio information is normalized. This amplitude normalization is performed on all sounds uttered by the speaker as control content, and then the content (determinant) is stored in the registered pattern memory 21. In this way, when the speaker registers the desired control contents in the registered pattern memory 21 by utterance, the setting of the control contents for the speech recognition device is completed, and registration patterns of types equal to the number of control contents (〓 ₁ , 〓 ₂ ...〓
_o ) is stored in the registered pattern memory 21. As mentioned above, the calculation for normalizing the amplitude with respect to the pattern 〓 representing the characteristics of the voice is executed by the CPU 24 according to the program contents stored in the system program 23. This will be explained schematically. That is, the A/D converter 18 in FIG.
1st I/O port 20, system program 2
3. The operation of the CPU 24 can correspond to the functional operation shown in FIG. 3 below. In other words, latch circuits 30(1) _{to (15)} in FIG.
(It actually corresponds to the input pattern memory 22.)
, data corresponding to the determinant 〓 is latched, and the latched contents are supplied to an adder 31 and a multiplier 32, respectively. And this adder 31
The output of the level judgment circuit 33 and the divider 34(1) _~
₍₁₅₎ is supplied. The adder 31 adds the elements of each row component of the determinant,

【式】【formula】

【formula】

【式】を算出するが、この夫々の総和値で前記ラツチ回路３０(1)_〜(15)にラツチさ
れた行成分要素の各々が除算器３４(1)_〜(15)で除
算される。ここで、除算器３４(1)_〜(15)の前段に
乗算器３２(1)_〜(15)が設けられておりＮなる乗算
を行うが、これは前記除算結果を整数部で評価す
るためのもので場合によつては省略し得る。ま
た、前記の除算器３４(1)_〜(15)で除算され振幅が
正規化されたデータは、バスラインを通して登録
パターンとして、登録パターンメモリ２１に収納
される。また、前記レベル判定器３３には所定レベルの
閾値が設定されており、前記加算器３１の出力の
レベルが設定された閾値以下の時は、前記ラツチ
回路３０(1)_〜(15)、及び３５(1)_〜(15)のラツチさ
れた内容をクリアし、それ以外の時は前記両ラツ
チ回路を制御しない。このように、ラツチ回路３
０(1)_〜(15)及び３５(1)_〜(15)が、前記加算器３１
の出力が一定値以上の時のみラツチ動作をさせる
ことにより、検出する音声が小さい状態での雑音
による誤動作が防止される。上述の第３図の説明から判る様に、話者が希望
する制御内容を登録パターンメモリ２１に登録す
る過程において、振幅が正規化される前の特徴パ
ターンは、一担、RAMで構成される入力パター
ンメモリに記憶されこの後に振幅が正規化された
特徴パターンが登録パターンメモリに記憶され
る。次に、話者が登録した制御内容に対して、希望
する制御内容を音声により指示した場合について
述べる。話者が、登録した制御内容のうち、希望する制
御内容を発声し音声により指令をすると、音声の
特徴パターンは登録パターンの時と同様に振幅が
正規化され入力パターンメモリ２２に記録され
る。ここで、話者が音声指令した内容に対し、そ
の振幅に対する正規化を行なつた入力パターンは
次に示す行列式で示されるものとする。この振幅が正規化され入力パターンメモリ２２
に記憶される入力パターン〓は、既に制御内容と
して登録パターンメモリ２１に登録されている登
録パターンとの参照が行われる。この参照動作に
よる両パターン間の類似度の演算処理により、類
似度が一番近いパターンに対応する制御内容を話
者が指令した制御内容であると判定する。このような入力パターンと登録パターンの両パ
ターン間の類似度は、次に示されるパターン間の
距離〓を計算することにより判別される。即ち、
前記振幅が正規化された登録パターン〓と入力パ
ターン〓と各要素ｋ_ij，_ijの差の絶対値をとる
ことにより得られる行列式を両パターン間の距離
を表わす行列式距離パターン〓と定義し、この行
列式〓の各要素の総和値によつて類似度を算出す
る。このことを更に述べると、前記距離パターン
〓は次式で示され、かつ類似度ｄは次のように示
される。上記、類似度ｄの計算は全登録パターン、いい
かえると全制御内容を表わすパターンに対して行
われ、類似度ｄの値が最つとも小さいパターンを
話者が音声によつて指令したパターンであると判
定する。このようにして音声認識が行われるが、
上述のように音声の振幅に対する正規化を行うこ
とで誤認識率は著しく低減される。話者の発声に
対する音声認識はこうして、登録パターンと入力
パターンの類似度が、前記システムプログラムメ
モリ２３に設定された類似度算出プログラムによ
つて指示される演算が前記CPU２４で実行され
ることにより算出され音声認識による機器の制御
が可能となる。しかし、上述した音声のパターン・マツチング
法による音声認識では、振幅が正規化されること
で音声の誤認識率は低減されるが、話者が同一の
単語を発声しても発声する時間は必ずしも一致し
ない。この問題を解消するには時間軸についても
正規化を行うことが必要とされる。時間軸の正規
化は、話者の発音単語の発音開始時刻と発音終了
時刻との間にかかる時間を、常に一定の定数ｎで
分割することによりなされる。この一定の定数ｎ
は、この発明では話者の発生音のピツチ間隔時間
よりも大きな定数とする。このことは入力音声を
ピツチ間隔よりも長い周期でサンプルして、サン
プル期間中にピーク値の検出を逸しないようにす
ることで、時間軸の正規化のみならず、振幅値に
対する正規化の誤差が抑圧できることを意味す
る。前述のように、入力パターン、登録パターン
のいずれの場合においても話者の音声の特徴の抽
出は、BPF１６(1)_〜(15)、サンプル・ホールド回
路１７(1)_〜(15)の両者に依存するが、両回路はい
ずれもその動作に時定数的な要素をもつ。とりわ
け、サンプル・ホールド回路のピーク検波方式は
話者の発声の終了時刻の検出を正しく行うのに大
きく左右する。従つて、特徴抽出部２を構成する
サンプル・ホールド回路におけるピーク検波方
式、及びサンプリングのタイミングは話者の発声
長を正確にとらえた上で時間軸の正規化を行うの
に重要な点となる。次に、時間軸の補正を適格にするに適した特徴
抽出部２の他の実施例について説明する。一般に話者がある単音を（第４図ａに示す音声
波形）発声すると、前記BPF１６(1)_〜(15)の出力
には第４図ｂに示すように、ピーク値間のピツチ
がＰの波数が得られる。このピツチＰは、例えば
「ア」という単音を発声した場合には約8mSecで
あるが、普通の音声ではこのピツチは５〜
15mSec以内に入いる。このようなピツチＰを有
する第４図ｂに示されるBPF１６(1)_〜(15)の出力
は、夫々第４図Ｃに示される様にピーク検波され
るわけであるが、検波するときの時定数によつて
は第４図ｄ，ｅに示されるように発声の終了時刻
を誤まつて検出する。即ち、ピーク検波によるリ
ツプルを少なくするために時定数を大きくする
と、検波出力は第４図ｄで判るように、時刻t₁で
実際には発声が終了しているにも拘らず、時刻t₂
まで音声が継続していると認識する。また、これ
に対して時定数を小さくした場合には、検波波形
にリツプルが生じて正確な特徴パターン抽出が望
めない。このことは、時間軸の正規化と特徴パタ
ーンの抽出に影響を与え誤つた音声認識を行う原
因ともなる。そこで、本実施例においては、ピツチ周期より
長い周期でピーク値検出を行うことでこの問題を
解消する。以下に、図面を参照して、本実施例を
説明する。第５図は、第１図に示した特徴抽出部３の他の
実施例を示す回路ブツク線図であり、入力端子P₁
に音声入力部１（図示せず。）からの音声信号が
BPF４１(1)_〜(o)に供給される。そして、この
BPF４１(1)_〜(o)の各々の出力はダイオードＤ(1)
_〜（ｏ）と、ピーク検出機能を有するサンプル・ホ
ールド回路４２(1)_〜(o)を構成するMOSトランジ
スタＱ(1)_〜(o)及びピーク値をホールドするコン
デンサＣ(1)_〜(o)によつてピーク検波される。ピ
ーク検波によつて検出されたピーク値、即ち、音
声の振幅データは前記コンデンサＣ(1)_〜(o)に保
持され、これらの振幅データは２進―10進デコー
ダ４３とMOSトランジスタQ′(1)_〜(o)よりなるマ
ルチプレクサ４４を介してＡ／Ｄ変換器４５に供
給される。ここで前記MOSトランジスタＱ(1)_〜(
_ｏ）がオンのときは前記マルチプレクサを構成する
MOSトランジスタQ′(1)_〜(o)はオフの状態であ
り、一方のトランジスタ群がオンのときは他方の
トランジスタ群がオフとなる様に制御されてい
る。このため、前記MOSトランジスタＱ(1)_〜(o)
がオンのときコンデンサＣ(1)_〜(o)に保持された
音声の振幅データは、前記MOSトランジスタＱ
(1)_〜(o)がオフのときにMOSトランジスタ
Q′(1)_〜(o)を介してＡ／Ｄ変換器４５に供給され
デジタル量に変換される。前記ピーク値のサンプ
リングは、前述したピツチＰの時間より長い時間
Ｔで行なわれ、時間Ｔだけピーク値が保持される
とその後、トランジスタＴ(1)_〜(o)，T′(1)_〜(o)、
抵抗Ｒ(1)_〜(o)，R′(1)_〜(o)によつて構成されるリ
セツト回路４６によつて前記コンデンサＣ(1)_〜(o
_）の充電電荷は放電される。この放電時間後、再
びピーク検出が開始されこれを話者の発声の終了
までくり返す。第６図を用いてこのことを説明す
ると、第６図ａはBPF４１(1)_〜(o)のうちの１つ
の出力を示し、同図ｂに示す時間Ｔのサンプリン
グパルスで音声のピーク値が検出されるとともに
ピーク値が保持され、同図ｃに示すリセツトパル
スでコンデンサＣ(1)_〜(o)の充電電荷は放電され
るので、Ａ／Ｄ変換器４５の入力には同図ｄに示
す波形が入力される。第６図で判るように音声の
ピーク値は、前述のピツチＰよりも長い時間Ｔだ
け保持され、しかも放電時はリセツトパルス期間
なので、放電による誤まつた検波出力の振幅デー
タをＡ／Ｄ変換器４５に送ることもない。次に前記のＴなる時間、ピーク値を保持するた
めのサンプリングパルスを発生させる手段及びリ
セツトパルスを発生させる手段について第５，
７，８図を用いて説明する。前記コンデンサＣ(1)
_〜（ｏ）に音声のピーク値をサンプル保持するため
サンプリングパルスは、分周器４７とナンドゲー
ト４８によつて得られる。即ち、分周器４８のクロツク端子CKには、第
７図のCKで示されるクロツクパルスが印加さ
れ、これを分周してQ₀，Q₁に示される出力をナ
ンドゲート４８に印加することにより第７図中ａ
で示すサンプリングパルスが得られる。このサン
プリングパルスが前記MOSトランジスタＱ(1)_〜(
_ｏ）の導通を制御することは前述の通りである。ま
た、第１のモノマルチ４９は前記サンプリングパ
ルスａの立ちさがりを検出してパルス第７図ｂを
発生しフリツプフロツプ５０の出力を反転する
（第７図ｃ）。すると、ナンドゲート５１、インバ
ータ５２を介してクロツクパルスCK′がｍビツト
カウンタ５３に印加されこのクロツクパルス
CK′をカウントし始め前記マルチプレクサ４４を
構成する２進―10進デコーダを順次切替え、全て
のスキヤンが終わると前記ｍビツトカウンタ５３
の出力Ｑがインバータ５４を介して前記フリツプ
フロツプ５０にリセツトパルスが供給されフリツ
プフロツプ５０の状態が再び反転する。そして、
これと同時に第２のモノマルチ５５が前記トラン
ジスタＴ(1)_〜(o)を導通させコンデンサＣ(1)_〜(o)
の充電電荷を放電させるリセツトパルス（第７図
ｄ、第６図ではｃに相当する。）を発生する。尚、分周器４７に接続された、イニシヤライズ
回路５７は、電源投入時に前記分周器４７をリセ
ツトするためのものである。また、前記Ａ／Ｄ変換器４５へのデータの読み
込みのタイミングは次のようにして第８図ｆに示
すパルスを発生することにより行なわれる。前述
のように、サンプリングパルス（第７図ａ）の立
ち下がりで、第１のモノマルチ４９はパルス（第
７，８図ｂ）を発生する。このパルスによりフリ
ツプフロツプ５０の状態は反転し（第７，８図
ｃ）、ｍビツトカウンタ５３にはクロツクパルス
CK′（第８図ｅが印加される。このクロツクパル
ス（第８図ｅ）の立ち下がりは第３のモノマルチ
５３で検出され、この第３のモルマルチ５３の出
力には第８図ｅで示されるパルスが発生される。
そして、このパルスが前記Ａ／Ｄ変換器４５のデ
ータ読み込みタイミングパルスとして用いられ
る。このようにして、この実施例では、単音発声時
にみられる前述のピツチＰより大きい時間Ｔを音
声の特徴抽出のためのサンプル時間とし、ピーク
検波時にリツプルによる音声認識時における誤つ
た特徴抽出を防止する。また、話者の発声終了時
刻の判定に際しても、その誤差範囲を略前記ピツ
チ長Ｐよりも少ない範囲とすることができるの
で、時間軸に対する正規化を行うにあたり誤認識
は低減できる。いいかえると、話者が同一の単語
を発声するに要する時間を発声のたびに異ならせ
たとしても、このことによる音声の誤認率を低減
することができる。以上の発明から明らかなように、本発明によれ
ば、パターンマツチング法による音声認識の誤動
作を著しく低減した音声認識装置を提供し得るも
のである。尚、本発明による音声認識装置による被制御機
器は、テレビジヨン受像機に限定されるものでは
なく、遠隔操作を要するシステム一般に適応し得
る。The row component elements latched in the latch circuits 30(1) _{to (15)} are divided by the respective sum values in the dividers 34(1) _{to (15)} . Here, multipliers 32(1) _{to (15)} are provided before the dividers 34(1) _{to (15)} to perform N multiplications, but this is because the result of the division is evaluated in the integer part. This may be omitted in some cases. Further, the data whose amplitude has been normalized by being divided by the dividers 34(1) _{to (15)} is stored in the registered pattern memory 21 as a registered pattern through the bus line. Further, a predetermined level threshold is set in the level determiner 33, and when the level of the output of the adder 31 is below the set threshold, the latch circuits 30(1) _{to (15)} and The latched contents of 35(1) _{to 35(15)} are cleared, and the two latch circuits are not controlled at other times. In this way, the latch circuit 3
0(1) _{to (15)} and 35(1) _{to (15)} are the adder 31
By latching only when the output is above a certain value, malfunctions due to noise can be prevented when the detected voice is small. As can be seen from the explanation of FIG. 3 above, in the process of registering the control content desired by the speaker in the registered pattern memory 21, the characteristic pattern before the amplitude is normalized is partially composed of RAM. The characteristic pattern stored in the input pattern memory and then having its amplitude normalized is stored in the registered pattern memory. Next, a case will be described in which the speaker instructs the desired control content by voice with respect to the registered control content. When the speaker utters a desired control content among the registered control content and gives a voice command, the characteristic pattern of the voice is normalized in amplitude and recorded in the input pattern memory 22 in the same way as the registered pattern. Here, it is assumed that an input pattern obtained by normalizing the amplitude of the content of the voice command given by the speaker is expressed by the following determinant. This amplitude is normalized and input pattern memory 22
The input pattern 〓 stored in is referenced with the registered pattern already registered in the registered pattern memory 21 as the control content. By calculating the similarity between both patterns using this reference operation, it is determined that the control content corresponding to the pattern with the closest similarity is the control content commanded by the speaker. The degree of similarity between the input pattern and the registered pattern is determined by calculating the distance between the patterns shown below. That is,
The determinant obtained by taking the absolute value of the difference between the registration pattern whose amplitude has been normalized, the input pattern, and each element _kij , _ij is defined as the determinant distance pattern, which represents the distance between both patterns. , the similarity is calculated based on the total value of each element of this determinant 〓. To further explain this, the distance pattern 〓 is expressed by the following equation, and the similarity d is expressed as follows. The above calculation of the degree of similarity d is performed for all registered patterns, or in other words, patterns representing all control contents, and the pattern with the smallest value of degree of similarity d is the pattern commanded by the speaker by voice. It is determined that Speech recognition is performed in this way,
By normalizing the audio amplitude as described above, the misrecognition rate is significantly reduced. In this way, the speech recognition of the speaker's utterance is performed by calculating the degree of similarity between the registered pattern and the input pattern by the CPU 24 executing calculations instructed by the similarity calculation program set in the system program memory 23. It becomes possible to control devices using voice recognition. However, in speech recognition using the speech pattern matching method described above, the speech misrecognition rate is reduced by normalizing the amplitude, but even if a speaker utters the same word, the utterance time is not always the same. It does not match. To solve this problem, it is necessary to normalize the time axis as well. The time axis is normalized by always dividing the time taken between the pronunciation start time and the pronunciation end time of a word pronounced by the speaker by a constant constant n. This constant n
In this invention, is a constant larger than the pitch interval time of the sounds produced by the speaker. This can be done by sampling the input audio at a period longer than the pitch interval to avoid missing the peak value detection during the sampling period, which not only normalizes the time axis but also reduces the normalization error for the amplitude value. This means that it can be suppressed. As mentioned above, extraction of the characteristics of the speaker's voice in both input patterns and registered patterns is performed by both the BPF 16(1) _{to (15)} and the sample/hold circuits 17(1) _{to (15).} Although dependent, both circuits have a time constant element in their operation. In particular, the peak detection method of the sample-and-hold circuit greatly influences the correct detection of the end time of the speaker's utterance. Therefore, the peak detection method and sampling timing in the sample-and-hold circuit that constitutes the feature extraction unit 2 are important points in normalizing the time axis after accurately capturing the speaker's utterance length. . Next, another embodiment of the feature extracting section 2 suitable for making correction of the time axis will be described. Generally, when a speaker utters a certain single sound (speech waveform shown in Figure 4a), the output of the BPF 16(1) _{to (15)} has a pitch between peak values of P as shown in Figure 4b. The wave number can be obtained. For example, this pitch P is approximately 8 mSec when a single sound such as "a" is uttered, but in normal speech, this pitch is 5 to 5 mSec.
Enter within 15mSec. The outputs of the BPFs 16(1) _{to (15)} shown in FIG. 4B having such a pitch P are each subjected to peak detection as shown in FIG. Depending on the constant, the end time of the utterance may be detected incorrectly as shown in FIGS. 4d and 4e. In other words, when the time constant is increased to reduce ripples caused by peak detection, the detection output changes at time t ₂ even though vocalization actually ends at time t ₁ , as shown in Figure 4 (d).
It is recognized that the audio continues until the end. On the other hand, if the time constant is made small, ripples will occur in the detected waveform, making it impossible to expect accurate feature pattern extraction. This may affect the normalization of the time axis and the extraction of feature patterns, leading to incorrect speech recognition. Therefore, in this embodiment, this problem is solved by performing peak value detection at a cycle longer than the pitch cycle. The present embodiment will be described below with reference to the drawings. FIG. 5 is a circuit diagram showing another embodiment of the feature extraction section 3 shown in _FIG .
The audio signal from the audio input section 1 (not shown) is input to
It is supplied to BPF41(1) _{to (o)} . And this
Each output of BPF41(1) _{to (o)} is a diode D(1)
_~(o) , a sample/hold circuit 42(1) having a peak detection function, a MOS transistor Q(1) ~ _(o) constituting _~(o) , and a capacitor C(1) _~(o) that holds the peak value. ₎ is used to detect the peak. The peak value detected by the peak detection, that is, the audio amplitude data is held in the capacitors C(1) _{to (o)} , and these amplitude data are sent to the binary-decimal decoder 43 and the MOS transistor Q'( 1) is supplied to the A/D converter 45 via a multiplexer 44 consisting of _(o) . Here, the MOS transistor Q(1) _~(
configure said multiplexer when _o) is on;
The MOS transistors Q'(1) _{to (o)} are in an off state, and are controlled so that when one transistor group is on, the other transistor group is off. Therefore, the MOS transistors Q(1) _{to (o)}
When the MOS transistor Q is on, the audio amplitude data held in the capacitors C(1) _{to (o)} is transferred to the MOS transistor Q.
(1) When _~(o) is off, the MOS transistor
The signals are supplied to the A/D converter 45 via Q'(1) _{to (o)} and converted into digital quantities. The sampling of the peak value is performed for a time T longer than the time of the pitch P mentioned above, and after the peak value is held for the time T, the transistors T(1) _~(o) , T'(1) _{~( o)} ,
The capacitors C(1) _{to (o)} are reset by a reset circuit 46 composed of resistors R(1) to _(o) and R'(1) to _(o).
₎ is discharged. After this discharge time, peak detection starts again and is repeated until the end of the speaker's utterance. To explain this using FIG. 6 _{, FIG} . The peak value is held as it is detected, and the charges in the capacitors C(1) _{to (o)} are discharged by the reset pulse shown in c of the same figure, so the input of the A/D converter 45 is The waveform shown is input. As can be seen in Figure 6, the peak value of the voice is held for a time T longer than the pitch P mentioned above, and since the discharge is during the reset pulse period, the amplitude data of the detection output that is erroneous due to the discharge is A/D converted. There is no need to send it to vessel 45. Next, the fifth section regarding the means for generating a sampling pulse for holding the peak value and the means for generating a reset pulse for the time T mentioned above.
This will be explained using Figures 7 and 8. Said capacitor C(1)
A sampling pulse is obtained by a frequency divider 47 and a NAND gate 48 in order to sample and hold the peak value of the voice at _~(o) . That is, the clock pulse indicated by CK in FIG. 7 is applied to the clock terminal _CK of the frequency divider 48, and the clock pulse indicated by CK in _FIG . Figure 7 a
The sampling pulse shown is obtained. This sampling pulse is applied to the MOS transistor Q(1) _~(
The conduction of _o) is controlled as described above. Further, the first monomulti 49 detects the falling edge of the sampling pulse a, generates a pulse shown in FIG. 7b, and inverts the output of the flip-flop 50 (see FIG. 7c). Then, clock pulse CK' is applied to m-bit counter 53 via NAND gate 51 and inverter 52, and this clock pulse
The m-bit counter 53 starts counting CK' and sequentially switches the binary-decimal decoder constituting the multiplexer 44, and when all scanning is completed, the m-bit counter 53
A reset pulse is supplied from the output Q of the flip-flop 50 to the flip-flop 50 through the inverter 54, and the state of the flip-flop 50 is inverted again. and,
At the same time, the second monomulti 55 makes the transistors T(1) _{to (o)} conductive and the capacitors C(1) _{to (o)}
A reset pulse (corresponding to d in FIG. 7 and c in FIG. 6) is generated to discharge the charged charge. An initialization circuit 57 connected to the frequency divider 47 is used to reset the frequency divider 47 when the power is turned on. Further, the timing of reading data into the A/D converter 45 is determined by generating the pulse shown in FIG. 8f as follows. As mentioned above, at the falling edge of the sampling pulse (Figure 7a), the first monomulti 49 generates a pulse (Figures 7 and 8b). This pulse inverts the state of the flip-flop 50 (FIGS. 7 and 8c), and the m-bit counter 53 receives the clock pulse.
CK' (Fig. 8e) is applied. The falling edge of this clock pulse (Fig. 8e) is detected by the third monomulti 53, and the output of this third monomulti 53 has the signal shown in Fig. 8e. A pulse is generated.
This pulse is then used as a data read timing pulse for the A/D converter 45. In this way, in this embodiment, the time T, which is larger than the above-mentioned pitch P observed when uttering a single sound, is used as the sample time for extracting speech features, and erroneous feature extraction during speech recognition due to ripples during peak detection is prevented. do. Furthermore, when determining the end time of a speaker's utterance, the error range can be set to a range that is approximately smaller than the pitch length P, so that erroneous recognition can be reduced when normalizing with respect to the time axis. In other words, even if the time required for a speaker to utter the same word differs each time the speaker utters the same word, the rate of speech misidentification caused by this can be reduced. As is clear from the above invention, according to the present invention, it is possible to provide a speech recognition device in which malfunctions in speech recognition using the pattern matching method are significantly reduced. Note that the device to be controlled by the voice recognition device according to the present invention is not limited to television receivers, but can be applied to general systems that require remote control.

[Brief explanation of the drawing]

第１図は本発明に係る音声認識装置の一実施例
を示す回路ブロツク線図、第２図及び第３図は本
発明を説明するに供する特性図、及び回路ブロツ
ク線図、第４図は音声波の検波特性を説明するに
供する特性図、第５図は本発明の他の実施例を示
す回路ブロツク線図、第６図は本発明の他の実施
例の動作を説明するに供する波形図、第７図、及
び第８図は本発明の他の実施例の動作を説明する
に供するタイミングチヤートである。１６(1)_〜(o)，４１(1)_〜(o)……フイルタ、１７
……ピーク値検出手段、Ｄ(1)_〜(o)，４２(1)_〜(o
_），４４，４６……ピーク値検出手段、３……認
識処理部。 FIG. 1 is a circuit block diagram showing an embodiment of the speech recognition device according to the present invention, FIGS. 2 and 3 are characteristic diagrams and circuit block diagrams for explaining the present invention, and FIG. FIG. 5 is a circuit block diagram showing another embodiment of the present invention, and FIG. 6 is a waveform diagram explaining the operation of another embodiment of the present invention. 7, and 8 are timing charts for explaining the operation of other embodiments of the present invention. 16(1) _~(o) , 41(1) _~(o) ...Filter, 17
...Peak value detection means, D(1) _~(o) , 42(1) _~(o
₎ , 44, 46...Peak value detection means, 3...Recognition processing unit.

Claims

[Claims] 1. A plurality of filters for extracting predetermined frequency band components from the speech generated by a speaker, and a sample pulse having a predetermined period longer than a normal speech pitch interval for each output of the filters. a plurality of peak value detecting means for detecting peak values at each peak value; an A/D converter for converting the peak values detected by the peak value detecting means into digital values; and the A/D converter and the peak value detecting means. and a reset means for controlling the peak value detecting means so that the output of the peak value detecting means is once reset during the non-sampling period and then sequentially detecting the peak value during each of the sample periods. and a timing control circuit that drives the peak value detection means every sample pulse of a predetermined cycle and drives the reset circuit and the A/D converter during a non-sampling period; 1. A speech recognition device comprising at least a recognition processing unit that recognizes features of speech in response.