JPS6295598A

JPS6295598A - Voice recognition apparatus

Info

Publication number: JPS6295598A
Application number: JP23677085A
Authority: JP
Inventors: 納田　重利
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1985-10-23
Filing date: 1985-10-23
Publication date: 1987-05-02

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、例えば話者の音声を単語単位で認識するの
に適用される音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition device that is applied to, for example, recognizing a speaker's speech word by word.

[Summary of the invention]

この発明は、音声認識装置において、音声信号を周波数
スペクトルに変換して時系列フレームデータとし、各フ
レーム毎に二値化処理を行いスペクトルデータに関する
二（直データを得ると共に、有声／無音性、隣り合うフ
レーム間距離及び音素性（母音、子音性等）等の１個の
フレームデータ全体としての特徴を示す二値データをス
ペクトルデータの二値データに付加して二（直特徴デー
タを得、この二値特徴データに基づいてパターンマツチ
ングを行うことにより、認識率の向上を図ると共に、メ
モリ容惜を低減させ、計算処理時間を短縮するようにし
たものである。This invention provides a speech recognition device that converts a speech signal into a frequency spectrum to produce time-series frame data, performs binarization processing for each frame, obtains direct data regarding the spectrum data, and also Binary data indicating the characteristics of one frame data as a whole, such as the distance between adjacent frames and phoneme properties (vowel, consonantity, etc.), is added to the binary data of the spectral data to obtain direct feature data. By performing pattern matching based on this binary feature data, the recognition rate is improved, memory capacity is reduced, and calculation processing time is shortened.

[Conventional technology]

木＋ｔ、Ｉ！出願人により、先に堤案されている音声認
識装置（特１９目昭５９−１０６１７７号）は、音声入
力部としてのマイクロホン、前処理回路、音響分析器、
特徴データ抽出器、登録パターンメモリ及びパターンマ
ツチング判定器等により構成されている。Tree + t, I! The voice recognition device previously proposed by the applicant (Special Patent No. 1988-106177) includes a microphone as a voice input section, a preprocessing circuit, an acoustic analyzer,
It is composed of a feature data extractor, a registered pattern memory, a pattern matching determiner, etc.

この音声部、議装置は、マイクロホンから人力される音
声信号を前処理回路において、音声認識に必要とされる
帯域に制限し、ノ＼／Ｄ変換器によりディジタル音声信
号とし、このディジクル音声信号を音響分析器に供給す
る。This audio section and conference device limits the audio signal manually input from the microphone to the band required for speech recognition in a preprocessing circuit, converts it into a digital audio signal using a /D converter, and converts this digital audio signal into a digital audio signal. Supplies the acoustic analyzer.

そして、音也Σ分ＪＪｒ　ｈにおいて、音声信号を周波
数スペクトルに変換し、例えば対数軸上で一定間隔とな
るようにＮ個の周波数を代表値として周波数スペクトル
を正規化して、フレーム周間毎にＮチャンネルのスペク
トルデータにより構成されるフレームデータを特徴デー
タ抽出器にイ４（給する。Then, in Otoya Σ JJr h, the audio signal is converted into a frequency spectrum, and the frequency spectrum is normalized using N frequencies as representative values at regular intervals on the logarithmic axis. Frame data consisting of N channels of spectral data is fed to the feature data extractor.

特徴データ抽出器は、隣り合うフレームデータの距離を
計算し、夫々のフレーム間距離の総和により、音声信号
の始端フレームから終端フレームまでのＮ次元ベクトル
の軌跡長を求め、最も語数が多く長い音声の場合に特徴
を抽出するのに必要な所定の分割数でもって軌跡長を等
分割し、その分割点に対応したフレームデータのみを特
徴データとして抽出して、話者の音声の発生速度変動に
影響されることがないように時間軸を正規化し出力する
。The feature data extractor calculates the distance between adjacent frame data, calculates the trajectory length of the N-dimensional vector from the start frame to the end frame of the audio signal by summing the distances between each frame, and extracts the longest audio with the largest number of words. In this case, the trajectory length is equally divided by the predetermined number of divisions necessary to extract the features, and only the frame data corresponding to the division points are extracted as feature data, thereby adjusting for fluctuations in the rate of speech generation of the speaker. Normalize and output the time axis so that it is not affected.

この特徴データを登録時においては、登録パターンメモ
リに供給して登録特徴データブロック（標準パターン）
として記・ｉａシ、認識時においては、入力音声信号を
前述した処理により、入力待ｉ攻データブロック　（人
カバターン）とし、パターンマツチング判定器に供給す
る。そしてパターンマツチング判定器において、入力特
徴データブロックと？　２１特徴データプロ、りとの間
でパターンマツチングを行う。When registering this feature data, it is supplied to the registered pattern memory and used as a registered feature data block (standard pattern).
At the time of recognition, the input audio signal is processed as described above to become an input waiting data block (human cover turn), and is supplied to a pattern matching judger. And in the pattern matching judger, the input feature data block and ? Perform pattern matching between 21 Feature Data Pro and Ri.

パターンマツチング判定器は、登録特徴データブロック
を構成するフレームデータと入力特徴データプロ・７り
を構成するフレームデータとの間でフレーム間距離を計
算し、フレーム間距離の総和をマツチング距離とし、池
の登録特徴データプロ・ツクに関しても同様にマツチン
グ距離を算出して、マツチング距離が最小で十分に距離
が近いものと判断される忰録特１”ｉｌデータブロック
に対応する単語を認識結果として出力する。The pattern matching determiner calculates the inter-frame distance between the frame data constituting the registered feature data block and the frame data constituting the input feature data block, and sets the sum of the inter-frame distances as the matching distance; The matching distance is calculated in the same way for Ike's registered feature data pro-tsuku, and the word corresponding to the data block whose matching distance is the minimum and is determined to be sufficiently close is selected as the recognition result. Output.

[Problem that the invention seeks to solve]

しかし、従来の音声部ｎ（装置においては、音響分析器
から出力されるフレームデータが特徴データ抽出器を介
してそのまま今録待ｉ攻データブロックとして登録パタ
ーンメモリに８己・１ａさ、！するため、登録パターン
メモリのメモリ量が膨大なものとなる問題点があった。However, in the conventional audio part n (device), the frame data output from the acoustic analyzer is directly stored in the registered pattern memory as a recording data block via the feature data extractor. Therefore, there was a problem that the amount of memory of the registered pattern memory became enormous.

これと共に、パターンマツチング時るこおいても、デー
タへ１に応じてその計算処理時間が長くなる問題点かあ
った。Along with this, there is also a problem that the calculation processing time becomes longer depending on the number of data in pattern matching.

従って、この発明の目的は、フレームデータを二値化す
ることにより、登録パターンメモリの容量を低減でき、
また、マツチング処理時間の短縮を図ることができる音
声認識装置を提供ることにある。Therefore, an object of the present invention is to reduce the capacity of registered pattern memory by binarizing frame data.
Another object of the present invention is to provide a speech recognition device that can reduce matching processing time.

また、本願出願人により種々の原因により変動するスペ
クトルの傾向を正確に正規化してフレームデータを構成
するスペクトルデータの夫々を二４Ｗ化し、この二値デ
ータに基づいてパターンマツチングを行う音声認識装置
として特願昭６０−１６６１９１号が提案されている。Additionally, the applicant has developed a speech recognition device that accurately normalizes the tendency of the spectrum that varies due to various causes, converts each piece of spectral data constituting frame data into 24W, and performs pattern matching based on this binary data. Japanese Patent Application No. 166191/1983 has been proposed as such.

しかし、この音声認識装置においては、二値化処理のた
め、１個のフレームデータ全体としての特徴が薄らぎ、
池のフレームデータとの差が少なくなり、類似性が強く
なる欠点があった。例えば第３図Ａに示すフレームデー
タが図中の基準レベルにより二値化された場合にはｒｌ
、１，０．０，１，０，１゜１、■」の二１直データと
され、第３図Ｂに示すフレームデータが図中のル準レベ
ルにより二値化された場合ニハ「１．１，０，０，０．
０．１．’１．１　ｊ　〕二値データとされ、明らかに
違うフレームデータであるにもかかわらず殆ど差がない
。パターンマツチング判定器において、これらの二値フ
レームデータ間の距離が求められると、フレーム間距離
が小さな値として計算され、マツチング距離に大きな差
が生じず認識率が低下してしまう可能性がある。However, in this speech recognition device, due to the binarization process, the characteristics of one frame data as a whole are weakened.
There was a drawback that the difference with the pond frame data became smaller and the similarity became stronger. For example, if the frame data shown in Figure 3A is binarized using the reference level in the figure, rl
, 1,0.0,1,0,1゜1, .1,0,0,0.
0.1. '1.1 j] Although they are binary data and are clearly different frame data, there is almost no difference. When the pattern matching determiner calculates the distance between these binary frame data, the interframe distance is calculated as a small value, and there is a possibility that there will be no large difference in the matching distance and the recognition rate will decrease. .

従って、特にこの発明は、二値化の利点であるメモリの
容量の低域化、マツチング処理の高速性をｔ捗なうこと
なく、認識率の向上を図るものである。Therefore, in particular, the present invention aims to improve the recognition rate without sacrificing the advantage of binarization, which is the reduction in memory capacity and the high speed of matching processing.

[Means for solving problems]

この発明は、入力音声信号がＮチャンネルの周波数スペ
クトルに変換され、Ｎチャンネルの周波数スペクトルの
時系列データが入力される音声認識装置において、時系列データのフレーム毎にスペクトルチー　タの二値
化処理を行いスペクトルデータの二値データを得ると共
に、有声／無音性、隣り合うフレーム間距離及び音素性
等の１個のフレームに関する特徴を抽出して二値データ
を得、スペクトルデータの二（直データに対してフレームに関
する二値データを付加して二値特徴データとし、二値特
徴データを用いて入力音声信号を認識するようにしたこ
とを特徴とする音声認識装置である。The present invention provides a speech recognition device in which an input audio signal is converted into N-channel frequency spectra, and time-series data of the N-channel frequency spectra is input, and a spectral cheater binarization process is performed for each frame of the time-series data. In addition to obtaining binary data of the spectral data, features related to one frame such as voicing/silence, distance between adjacent frames, and phoneme characteristics are extracted to obtain binary data. This is a speech recognition device characterized in that binary data regarding a frame is added to the input speech signal to generate binary feature data, and the binary feature data is used to recognize an input speech signal.

（作用口・混合二値化パターン抽出器１０において、時間軸の正規
化処理により圧縮された時系列フレームデータのスペク
トルデータが二値化されると共に、を声・無声抽出器７
．隣接距離計算器８及び音素性パターン抽出器９から供
給される各フレームの特徴を示す二値データが時間的に
対応するフレームのスペクトルデータの二値データに付
加され、二値特徴データとされ、この二値特徴データを
用いてパターンマツチングが行われる。(Operation port: In the mixed binarization pattern extractor 10, the spectral data of the time series frame data compressed by the time axis normalization process is binarized, and the voice/silence extractor 7
．． Binary data indicating the characteristics of each frame supplied from the adjacent distance calculator 8 and the phonetic pattern extractor 9 is added to the binary data of the spectral data of the temporally corresponding frame to obtain binary feature data, Pattern matching is performed using this binary feature data.

［実施例］以下、この発明の一実施例を図面を参照して説明する。[Example] An embodiment of the present invention will be described below with reference to the drawings.

第１図は、この発明の一実施例を示すもので、第１図に
おいて１で示されるのが音声人力部としてのマイクロホ
ンを示している。FIG. 1 shows an embodiment of the present invention, and in FIG. 1, reference numeral 1 indicates a microphone as a voice input section.

マイクロホン１からのアナログ音声信号がフィルタ２に
供給される。フィルタ２は、例えばカットオフ周波数７
．５ｋＨｚのローパスフィルタであり、音声信号がフィ
ルタ２において、音声認識に必要とされる７、５ｋｌｌ
ｚ以下の帯域に制限され、この音声信号がアンプ３を介
してＡ／Ｄ変換器４シこ供給される。An analog audio signal from microphone 1 is supplied to filter 2 . Filter 2 has a cutoff frequency of 7, for example.
．． It is a 5kHz low-pass filter, and the audio signal is passed through the filter 2 at a frequency of 7.5kll, which is required for speech recognition.
This audio signal is limited to a band below z and is supplied to an A/D converter 4 via an amplifier 3.

Ａ／Ｄ変換器４は、例えばサンプリング周波数１２．５
ｋＨｚの８ビツトＡ　／　Ｄ変換器であり、音声信号が
Ａ／Ｄ変換器４において、アナログ−ディジクル変換さ
れて８ビツトのディジクル信号とされ、スペクトル変換
器５に供給される。The A/D converter 4 has a sampling frequency of 12.5, for example.
The audio signal is analog-to-digital converted in the A/D converter 4 to an 8-bit digital signal, which is then supplied to the spectrum converter 5.

スペクトル変換器５は、音声信号を周波数スペクトルに
変換して、例えばＮチャンネルのスペクトルデータ列を
発生するものである。スペクトル変換器５において、音
声信号が演算処理により周波数スペクトルに変換され、
例えば対数軸上で一定間隔となるＮ個の周波数を代表値
とするスペクトルデータ列が得られる。従って、音声信
号がＮチャンネルの離散的な周波数スペクトルの大きさ
によって表現される。そして、単位時間（フレーム周８
Ｊ］）毎にＮチャンネルのスペクトルデータ列が１つの
フレームデータとして出力される。即ち、フレーム周期
毎に音声信号がＮ次元ベクトルにより表現されるパラメ
ータとして切り出され、スペクトルパターン抽出器６．
有声・無声抽出器７゜隣接距離計算器８及び音素性パタ
ーン抽出器９の夫々に供給される。The spectrum converter 5 converts the audio signal into a frequency spectrum and generates, for example, an N-channel spectrum data string. In the spectrum converter 5, the audio signal is converted into a frequency spectrum by arithmetic processing,
For example, a spectral data string having representative values of N frequencies at constant intervals on the logarithmic axis is obtained. Therefore, the audio signal is expressed by the magnitude of the discrete frequency spectrum of N channels. Then, unit time (frame period 8
J]), a spectral data string of N channels is output as one frame data. That is, the audio signal is extracted every frame period as a parameter expressed by an N-dimensional vector, and the spectral pattern extractor 6.
The voiced/unvoiced extractor 7° is supplied to the adjacent distance calculator 8 and the phonemic pattern extractor 9, respectively.

スペクトルパターン抽出器６は、時間軸を正規化するこ
とにより時系列フレームデータを圧縮するものである。The spectral pattern extractor 6 compresses time-series frame data by normalizing the time axis.

例えば、スペクトルパターン抽出器６において、隣り合
うフレームデータの各チャンネルに関してスペクトルデ
ータの差の絶対値が夫々求められ、その総和が隣り合う
フレームデータのフレーム間距離とされる。更に、フレ
ーム間距離の総和が求められ、音声信号の始端フレーム
から終端フレームまでのＮ次元ベクトルの軌跡長が求め
られる。そして最も語数が多く長い音声の場合に特徴を
抽出するのに必要な所定の分割数でもって軌跡長が等分
割される。分割点の夫々に対応したフレームデータのみ
が抽出され、話者の音声の発生速度変動に影響されるこ
とがないように時間軸が正規化され、この抽出されたフ
レームデータが混合二イ１σ化パターン抽出器１０に供
給される。For example, in the spectral pattern extractor 6, the absolute value of the difference between the spectral data for each channel of adjacent frame data is determined, and the sum thereof is taken as the inter-frame distance between the adjacent frame data. Furthermore, the sum of the interframe distances is determined, and the trajectory length of the N-dimensional vector from the start frame to the end frame of the audio signal is determined. Then, in the case of the longest voice with the largest number of words, the trajectory length is equally divided by a predetermined number of divisions necessary to extract the features. Only frame data corresponding to each division point is extracted, the time axis is normalized so as not to be affected by variations in the speech production rate of the speaker, and this extracted frame data is mixed and converted into 1σ. A pattern extractor 10 is provided.

有声・無声抽出器７において、音声信号中の有声区間及
び無声区間の検出がピッチ波の有無によってなされる。In the voiced/unvoiced extractor 7, voiced sections and unvoiced sections in the audio signal are detected based on the presence or absence of pitch waves.

つまり、音声中の有声音は、肺から送り出される呼気が
声帯の振動によって脈流的に断続されたものであるため
ピンチ波が含まれ、無声音は、調音器官、とくに舌先、
歯、唇などによって形づくられる狭い空間を肺から送り
出される呼気が通り抜けるときに生ずる空気の乱流によ
って発生するこめ、。ピッチ波が含まれない。このため
、例えば、スペクトル変換器５から順次供給される時系
列フレームデータの周波数の低いチャンネルに関する相
関が求められることにより、ピッチ波の有無が検出され
、ピンチ波を含む区間が宵宮区間とされ、例えば「１」
で表現され、ピッチ波を含まない区間が無声区間とされ
例えは「０」で表現され、有声／無声性データが発生さ
れる。この有声／無声性データが混合二値化パターン抽
出器１０に供給される。In other words, voiced sounds include pinch waves because the exhaled air sent out from the lungs is interrupted by the vibration of the vocal cords, while unvoiced sounds are created by the articulatory organs, especially the tip of the tongue.
This is caused by the turbulent flow of air that occurs when exhaled air from the lungs passes through the narrow space formed by teeth, lips, etc. Does not include pitch waves. Therefore, for example, the presence or absence of a pitch wave is detected by determining the correlation regarding the low frequency channels of the time-series frame data sequentially supplied from the spectrum converter 5, and the section including the pinch wave is determined as the Yoimiya section. For example "1"
The interval that does not include a pitch wave is defined as a silent interval, and is expressed, for example, as "0", and voiced/unvoiced data is generated. This voiced/unvoiced data is supplied to a mixed binary pattern extractor 10.

隣接距離計算器８において、スペクトル変換器５から供
給される時系列フレームデータの隣り合うフレーム間の
距離が、例えば各チャンネルに関してのスペクトルデー
タの差の絶対値の総和により算出される。この隣接フレ
ーム間距離が２ビツトで量子化され、隣接距離データに
変換される。In the adjacent distance calculator 8, the distance between adjacent frames of the time-series frame data supplied from the spectrum converter 5 is calculated by, for example, the sum of the absolute values of differences in spectrum data for each channel. This distance between adjacent frames is quantized with 2 bits and converted into adjacent distance data.

この隣接距離データが混合二値化パターン抽出器１０に
供給される。This adjacent distance data is supplied to the mixed binary pattern extractor 10.

音素性パターン抽出器９において、各フレームデータの
音素１生、即ちスペクトル形状の特徴が検出され、例え
ば２ビツトのデータにより表現される。例えば、有声母
音ｒａ、、ｒｕｊ、ｒｏｊのスペクトル形状は、低域側
チャンネルに大きなレベルが発生する特徴を有するもの
で、この場合ｒｌ、Ｏｊのデータが発生される。また、
有声母音ｒｉＪ、ｒｅＪのスペクトルの形状は、低域側
及び高域側のチャンネルに大きなレベルが発生する特徴
を有するもので、この場合ｒｏ、Ｏｊのデータが発生さ
れる。また、無声子音ｒｓ　　ｊ。In the phoneme pattern extractor 9, the phoneme 1 raw of each frame data, that is, the characteristic of the spectral shape is detected and expressed by, for example, 2-bit data. For example, the spectral shape of voiced vowels ra, ruj, and roj is characterized by a large level occurring in the lower channel, and in this case, data rl and Oj are generated. Also,
The shape of the spectrum of the voiced vowels riJ and reJ is characterized by large levels occurring in the low-frequency and high-frequency channels, and in this case, data of ro and Oj are generated. Also, the voiceless consonant rs j.

ｉｔｊ等のスペクトルの形状は、高域側のチャンネルに
大きなレベルが発生する特徴を有するもので、この場合
ｒＯ，ＩＪのデータが発生される。The shape of the spectrum such as itj has the characteristic that a large level is generated in the channel on the high frequency side, and in this case, data of rO and IJ are generated.

また、その他の上記３つの特徴的なスペクトルの形状に
合致しないフレームデータの場合には「１゜１」のデー
タが発生される。これらの音素性データが混合二値化パ
ターン抽出器１０に供給される。Furthermore, in the case of frame data that does not match the other three characteristic spectral shapes described above, data of "1°1" is generated. These phonemic data are supplied to the mixed binary pattern extractor 10.

混合二値化パターン抽出器１０において、スペクトルパ
ターン抽出器６により抽出されたフレームデータが二値
化される。例えば、各フレームデータを構成するスペク
トルデータに関して傾向変動を補正する傾向値がチャン
ネル０から所定のチャンネルｎ　（０≦ｎ≦Ｎ−１）ま
でのスペクトルデータの平均値と、所定のチャンネルｎ
から最大チャンネルＮ−１までのスペクトルデータの平
均値との平均値に適当な係数が乗ぜられることにより求
められる。この各チャンネルのスペクトルデータに関し
て求められた傾向値と対応するスペクトルデータとの間
において減算がなされ、スペクトル傾向が平坦化され、
話者の個人差及び周囲ノイズ等に影響されることがない
ようにスペクトル傾向が正規化される。スペクトル傾向
が正規化されたスペクトルデータと適当な値に設定され
た基準１直との比較がなされ、基準値より大きなイ直の
スペクトルデータが「１」とされ、基４λ値より小さな
値のスペクトルデータが「０」とされて二値化される。In the mixed binarization pattern extractor 10, the frame data extracted by the spectral pattern extractor 6 is binarized. For example, the trend value for correcting trend fluctuations regarding the spectral data constituting each frame data is the average value of the spectral data from channel 0 to a predetermined channel n (0≦n≦N-1), and the predetermined channel n
It is obtained by multiplying the average value of the spectral data from 1 to the maximum channel N-1 by an appropriate coefficient. Subtraction is performed between the trend value determined for the spectral data of each channel and the corresponding spectral data to flatten the spectral trend,
The spectral tendency is normalized so as not to be influenced by individual differences among speakers, ambient noise, and the like. The spectral data whose spectral tendency has been normalized is compared with the standard 1 value set to an appropriate value, and the spectral data with the value larger than the reference value is set as "1", and the spectrum data with the value smaller than the base 4λ value is The data is set to "0" and binarized.

ま１こ、それと共に、有声・無声抽出器７かろ供給され
る有声無声性データ（例えば１ビツト）。Also, voiced/unvoiced data (for example, 1 bit) is supplied from the voiced/unvoiced extractor 7.

隣接距離計算器８から供給される隣接距離データ（例え
ば２ビツト）及び音素性パターン抽出器９から供給され
る音素性データ（例えば２ビツト）が時間的に対応する
二値化されたフレームデータ（例えば７ビツト）に付加
され、二値特徴データが形成され、この二値特徴データ
がモード切替回路１１に供給される。The adjacent distance data (for example, 2 bits) supplied from the adjacent distance calculator 8 and the phonemic data (for example, 2 bits) supplied from the phonetic pattern extractor 9 are converted into binarized frame data (for example, 2 bits) that temporally correspond to each other. (for example, 7 bits) to form binary feature data, and this binary feature data is supplied to the mode switching circuit 11.

この二値特徴データが登録時においては、モード切替回
路１１を介して登録パターンメモリ１２に供給され、例
えばスペクトルパターン抽出器６においてＭ個のフレー
ムが抽出された場合には第２図に示すようなデークブロ
・ツクが登録特徴データブロックとして記憶される。認
識時においては、入力音声信号が前述した処理を経て二
値特徴データとされ、この二値特徴データがパターンマ
ツチング判定器１３に供給され、入力特徴データブロッ
クとされる。入力特徴データブロックと全ての登２．★
特徴データブロックとの間において、パターンマツチン
グが行われる。At the time of registration, this binary feature data is supplied to the registered pattern memory 12 via the mode switching circuit 11. For example, when M frames are extracted by the spectral pattern extractor 6, as shown in FIG. A data block is stored as a registered feature data block. At the time of recognition, the input audio signal is converted into binary feature data through the above-described processing, and this binary feature data is supplied to the pattern matching determiner 13 to be used as an input feature data block. Input feature data block and all entries 2. ★
Pattern matching is performed between the feature data blocks.

即ち、パターンマツチング判定器１３において、登録パ
ターンメモリ１２から供給される登録特徴データブロッ
クと入力特徴データブロックとの間の対応するフレーム
においてフレーム間距離が求められる。例えば、同一ヒ
７　ｈのデータの差の絶対値の総和によりフレーム間距
離が求められ、その総和がマツチング距離とされる。そ
して全ての登録特徴データブロックに関して求められた
マツチング距離のうちで最小でかつ十分に距離が近いも
のと判断される登録特徴データブロックに対応する単語
が認識結果とされる。That is, the pattern matching determiner 13 calculates the interframe distance between the corresponding frames between the registered feature data block supplied from the registered pattern memory 12 and the input feature data block. For example, the inter-frame distance is determined by the sum of the absolute values of the data differences of the same frame, and the sum is taken as the matching distance. Then, the word corresponding to the registered feature data block that is determined to be the smallest and sufficiently close among the matching distances determined for all the registered feature data blocks is taken as the recognition result.

尚、この発明の一実施例においては、二値化されたフレ
ームデータに有声・無声性データ、隣接距離データ及び
音素性データが付加される構成について説明したが、少
居（とも一つのデータが付加される構成でも良く、また
、音声強度（パワー）等のデータを更に付加する構成と
しても良い。In one embodiment of the present invention, a configuration has been described in which voiced/unvoiced data, adjacent distance data, and phoneme data are added to binarized frame data. The configuration may be such that the information is added, or the configuration may be such that data such as audio intensity (power) is further added.

また、この発明は、ハードワイヤードの構成に限らず、
マ・イクロコンピュータ又は７１°クロプログラム方式
を用いてソフトウェアにより処理を行うようにしても良
い。Moreover, this invention is not limited to a hard-wired configuration.
The processing may be performed by software using a microcomputer or a 71° microprogram method.

〔Effect of the invention〕

この発明では、混合二値化パターン抽出器において、時
間軸の正規処理により圧縮された時系列フレームデータ
のスペクトルデータが二値化されると共に、有声・無声
抽出器、隣接距離計算器及び音素性パターン抽出器から
供給される各フレームの特徴を示す二値データが時間的
に対応するフレームのスペクトルデータの二値データに
付加され、二値特徴データとされ、この二値特徴データ
を用いてパターンマツチングが行われる。In this invention, in the mixed binarization pattern extractor, spectrum data of compressed time series frame data is binarized by regular processing on the time axis, and at the same time, the voiced/unvoiced extractor, the adjacent distance calculator, and the phoneme Binary data indicating the characteristics of each frame supplied from the pattern extractor is added to the binary data of the spectral data of the temporally corresponding frame to create binary feature data, and this binary feature data is used to create a pattern. Matching is performed.

従って、この発明に１衣れば、１１［１ｉ１のフレーム
データ全体としての特徴を示す二値データがスペクトル
データの二値データに付加されているため、音声特徴が
相乗的に強化され、認識率が向上されると共に、二値特
徴データが用いられるため、登録パターンメモリの容量
を低減でき、マツチング処理時間の短縮を図ることがで
きる。Therefore, one advantage of this invention is that the binary data representing the characteristics of the entire frame data of 11 In addition, since binary feature data is used, the capacity of the registered pattern memory can be reduced, and the matching processing time can be shortened.

[Brief explanation of drawings]

第１図はこの発明の一実施例の構成のブロック図、第２
図は一実施例における二値特徴データブロックのデータ
構成を示す路線図、第３図；よ従来の音声認識装置の説
明に用いる路線図である。図面における主要な符号の説明にマイクロホン、　　５ニスベクトル変換器。６：スペクトルパターン抽出器、　　７：有声・無声抽
出器、　　８；隣接距離計算器、　　９：音素性パター
ン抽出器、　　　１０：？Ｍ合二値化パターン抽出器、
　　１１：モード切替回路、　　１２：登録パターンメ
モリ、　　１３：パターンマツチング判定器。第１図第３図Ａ　　　第３図Ｂ２イ直特話欠テ゛リフ゛ロー／７第２図FIG. 1 is a block diagram of the configuration of one embodiment of the present invention, and FIG.
FIG. 3 is a route map showing the data structure of a binary feature data block in one embodiment, and FIG. 3 is a route map used to explain a conventional speech recognition device. Microphone, 5-varnish vector converter in the explanation of the main symbols in the drawing. 6: Spectral pattern extractor, 7: Voiced/unvoiced extractor, 8: Adjacent distance calculator, 9: Phonemic pattern extractor, 10: ? M-combined binary pattern extractor,
11: Mode switching circuit, 12: Registered pattern memory, 13: Pattern matching determiner. Fig. 1 Fig. 3 A Fig. 3 B 2. Direct special episode reflow/7 Fig. 2

Claims

[Scope of Claims] A speech recognition device in which an input audio signal is converted into N-channel frequency spectra, and time-series data of the N-channel frequency spectra is input, wherein two of the spectrum data are input for each frame of the time-series data. Value processing is performed to obtain binary data of the spectral data, and features related to one frame such as voicedness/silence, distance between adjacent frames, and phoneme characteristics are extracted to obtain binary data, and the above spectral data is Adding binary data regarding the frame to the binary data to obtain binary feature data,
A speech recognition device characterized in that the input speech signal is recognized using the binary feature data.