JPH0344317B2

JPH0344317B2 -

Info

Publication number: JPH0344317B2
Application number: JP59058709A
Authority: JP
Inventors: Satoshi Fujii; Katsuyuki Futayada
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1984-03-27
Filing date: 1984-03-27
Publication date: 1991-07-05
Also published as: JPS60202489A

Description

[Detailed description of the invention]

産業上の利用分野本発明は音声の内容を自動的に認識するための
音声認識方法に関するものである。従来例の構成とその問題点近年、不特定話者、多数語を対象とする音声認
識に対する研究開発が盛んになつてきた。音素認識を行うことを特徴とする音声認識にお
ける従来の音素判別は、井出他：「スペクトルの
動的特徴による子音認識の検討」、日本音響学会
講論集、1981、10、２−１−２に述べられてい
る。そのフローチヤートを第１図に示す。まず標準パターンの作成手順について述べる。
音声中の子音及び半母音を対象として10mS毎に
29チヤンネル帯域フイルタ群（Ｑ＝６、中心周波
数250〜6300Hzで1/6オクターブ間隔）の出力を得
る。さらに周波数軸に沿つて連続する数帯域をま
とめて６チヤネルとする。この６チヤネルを１フ
レームとし、このフレームを連続して５フレーム
取り出し、６×５＝30次元ベクトルとする。これ
を音素毎に集計し、音素ｉに対する平均値m_i、
共分散行列をW_iとし、逆行列をW^-1 _iとして、m_i
とW^-1 _iを標準パターンとして標準パターン格納部
に予め入れておく。次に入力された未知音声を音響分析して第１図
の処理イに示すように29チヤネル帯域フイルタの
出力X″₁（X″₁，１，X″₁，２，……，X″₁，29）を
得、次に処理ロに示すように前記出力を数帯域毎
にまとめて６チヤネルX′₁（X′₁，１，X′₁，２，…
…X′₁，６）にし、処理ハに示すように得られた
スペクトルの連続する５フレーム分X′₁，X′₂，
X′₃，X′₄，X′₅を計算し、処理ニのように６×５
＝30次元ベクトルＸ（X₁，X₂，……，X₃₀）に変
換する。さらにこのベクトルについて、前述した
標準パターンを用いて処理ホに示すようにベイズ
判定による類似度計算を行う。音素ｉに対する類
似度P_iは次式で求めることができる。 P_i＝（2π）^-15/2｜W_i｜^-1/2exp ｛ −１／２（ｘ−m_i）^tW_i ^-1（ｘ−m_i）｝……(1) このP_iを音素毎に求め、最も類似度の高い音素
を求めることにより音素判別を行い（処理へ）そ
の結果を音素認識部に転送する。この方法は、子音や半母音のようにスペクトル
の時間変化に特徴のある音素に対して、変化の動
きを積極的にとらえる考え方である。第２図は半母音、拗音のスペクトル変化の例を
示したものである。ａは前記従来例と類似の場合
を表わし、横軸に時間をフレーム単位で示す。ま
た縦軸に隣接するスペクトルの距離をLPCケプ
ストラム係数のユークリツド距離で示す。この距
離曲線はフレーム数13の間、あるいき値（TH）
以下の小さい値で持続し、半母音、拗音が130ms
の長い時間持続していることを示す。しかし前記
従来例では類似度計算に必要な演算量が膨大にな
る理由から、○印で示す５フレームのみ使用して
いる。このため、半母音、拗音の特徴を十分にと
らえ切れず、判別の精度が悪い欠点がある。この
欠点を解消するために第２図ｂに示すように13個
のフレームを用いることが考えられる。第２図ｂ
は第２図ａと同じ半母音、拗音についてフレーム
数13を用いる場合を○印で示したものである。こ
の場合半母音、拗音の特徴を十分にとらえること
が可能であるが、類似度計算のために膨大な演算
量を必要とし、装置にした場合に高価となる欠点
があつた。発明の目的本発明は前記欠点を解消し、音声の自動認識に
おいて、音素又は音節の判別を高い判別精度で、
かつ少ない演算量で実現するための音声認識方法
を提供することを目的とする。発明の構成本発明は前記目的を達成するためのもので、多
数話者の音声から作成された標準パターンを予め
用意しておき、入力未知音声を連続するｎ個の一
定時間区間（フレーム）に分割し、前記各フレー
ム毎に音声を分析してスペクトル情報を求め、前
記ｎ個のフレームより、フレーム間のスペクトル
変化がしきい値を越えない部分に対しては等間隔
に間引くように抽出してＸ個のフレーム（ｎ＞
Ｘ）を求め、前記Ｘ個のフレームのスペクトル情
報と前記標準パターンとの類似度を統計的距離尺
度で計算することにより音素又は音節の判別を行
うことを特徴とする音声認識方法を提供するもの
である。実施例の説明以下に本発明の実施例を図面を用いて説明す
る。第３図は本発明の音声認識方法を具現化する
装置の一実施例を示すブロツク図である。図において１は音響分析部で、マイク等で入力
された音声の分析を行う。分析方法としては線形
予測分析を行い、フレーム周期（10ｍＳ程度）毎
にLPCケプストラ係数を得る。２は音素判別部で、音響分析部１で得たLPC
ケプストラム係数によつてフレーム毎の音素判別
を行う。３は標準パターン格納部で、予め多数話者の音
声により音素毎に求めた標準パターンを格納して
おく。４はセグメンテーシヨン部で、音響分析部１の
分析出力をもとに音声区間の検出と音素毎の境界
決定（以下セグメンテーシヨンと呼ぶ）を行う。５は音素認識部で、セグメンテーシヨン部４と
音素判別部２の結果をもとに１つの音素区間毎に
何という音素であるかを決定する作業を行う。こ
の結果として音素の系列が完成する。６は単語認識部で、前記音素系列を、同様に音
素系列で表記された単語辞書７と照合し、最も類
似度の高い単語を認識結果として出力する。７は前述した単語辞書である。次に半母音、拗音の認識方法を例に第４図のフ
ローチヤートを用いてさらに詳細に説明する。本
方法は半母音、拗音に限らず、母音、鼻音、摩擦
子音等のスペクトルの時間変化のゆつくりした音
素に対して効果のある方法である。統計的距離尺度としてベイズ判定、マハラノビ
ス距離等があるが、本実施例ではマハラノビス距
離で説明する。又、スペクトル情報にはLPCケ
プストラム係数を用いた場合について述べる。あらかじめ多数話者の単語音声の５母音と半母
音、拗音の区別を決定しておく。この音声を用い
て標準パターンを作成する手順を説明する。各音
素の始端から連続するｎフレームそれぞれについ
てＮ次までのLPCケプストラム係数を求め、そ
の中のｍ次（Ｎ≧ｍ）まで、すなわちC′（C′₁，
C′₂，……，C′m）を抽出する。次に連続するｎ
フレームより、連続しないフレームを少なくとも
含むように（本実施例では１フレーム間隔で）Ｘ
個のフレーム（Ｘ＜ｎ）を抽出し、前記C′をＸ個
並べて、Ｃ（C′₁，C′₂，……，C′_X）を作成する。
ＣはＭ個（Ｍ＝Ｘ×ｍ）のケプストラム係数で構
成される。すなわち、Ｃ（C₁，C₂，……C_M）であ
る。このＣによつて各音素毎の平均値m_i（ｉは音
素名）と対象とする全音素に共通の共分散行列Ｗ
を求める。その逆行列をW^-1とし、その（ｊ，j′）
要素を〓jj′とすると、C_jに対する音素ｉの重み係
数a_ijは a_ij＝２_M 〓^j ′⁼¹〓jj′m_ij……(2) で求める。又、音素ｉに対する平均距離d_iを d_i＝m_i ^tW^-1m_i……(3) で求める。このa_ijおよびd_iを標準パターンとして第３図の
標準パターン格納部３に入れておく。次に、第３図の音響分析部１に入力された未知
音声を、フレーム毎に線形予測分析し連続するｎ
フレームそれぞれについてＮ次のLPCケプスト
ラム係数を求め、そのなかから第４図処理トに示
すｍ次（Ｎ≧ｍ）までのLPCケプストラム係数
X′₁（X′₁，X′₂，……，X′ｍ）を抽出する。次に連続するｎフレームより、連続しないフレ
ームを少なくとも含むように（例えば本実施例で
は１フレーム間隔で）Ｘ個のフレームを抽出し、
処理チに示すようにＸフレーム分X′₁，X′₃，…
…，X′_X計算する。さらに処理トと処理チの結果を用いて、処理リ
に示すようにＸ×ｍ＝Ｍ次元ベクトルｘ（x₁，x₂，
……，x_M）に変換する。このｘを用いて、標準パターン格納部３の標準
パターンによつて次式で類似l_iを求める。 l_i＝_M 〓^j=1 a_ijJ^-dｉ……(4) このl_iを入力音声の各フレーム毎に求め（処理
ヌ）、類似度最大となる音素をけぷ結果として
（処理ル）音素認識部５に転送する。音素認識部
５はこの結果とセグメンテーシヨン部４の結果を
組合せて音素の時系列を作成し、単語認識部６に
送る。単語認識部６はあらかじめ音素の時系列で
表記されている単語辞書７を照合し、最も類似度
の高い単語名を認識結果として出力する。第５図に具体例を示す。第５図ａに半母音、拗音のスペクトル変化の例
を示す。横軸に時間をフレーム単位で示す。縦軸
に隣接するフレーム間のスペクトルの距離を
LPCケプトラム係数のユークリツド距離で示す。
この距離曲線は13フレーム中でいき値（THで示
す）を越えず、スペクトルの時間変化がゆつくり
していることを示す。ここで、いき値は目的とす
る音素（ここでは半母音、拗音）の認識率の最大
値が得られる値に設定する。このため、スペクト
ルの時間変化をとらえるのに必ずしも全てのフレ
ームを用いる必要はない。本実施例では一つおき
に間引いて○印で示すフレームの番号の計７個を
使用する。この場合、音素の判別に必要なLPC
ケプストラム係数のベクトルは６×７＝42とな
る。標準パターン１個あたりの積と和の演算量の
比較を第１表に示す。第２図ｂに示した従来法と本実施例とについ
て、(1)式によるベイズ判定と(4)式によるマハラノ
ビス距離とに分けて示す。 INDUSTRIAL APPLICATION FIELD The present invention relates to a speech recognition method for automatically recognizing the content of speech. Configuration of conventional examples and their problems In recent years, research and development on speech recognition for unspecified speakers and multiple words has become active. Conventional phoneme discrimination in speech recognition, which is characterized by phoneme recognition, is described in Ide et al., "Study of consonant recognition using dynamic features of spectrum," Proceedings of the Acoustical Society of Japan, 1981, 10, 2-1-2. It has been stated. The flowchart is shown in Figure 1. First, the procedure for creating a standard pattern will be described.
Every 10mS for consonants and semi-vowels in speech
Obtain the output of a group of 29 channel band filters (Q = 6, center frequency 250-6300 Hz, 1/6 octave interval). Furthermore, several consecutive bands along the frequency axis are collectively defined as 6 channels. These six channels are taken as one frame, and five consecutive frames are extracted to form a 6×5=30-dimensional vector. This is totaled for each phoneme, and the average value m _i for phoneme i,
Let the covariance matrix be W _i and the inverse matrix W ^-1 _i , then m _i
and W ^-1 _i are stored in the standard pattern storage section in advance as standard patterns. Next, the input unknown voice is acoustically analyzed, and the output of the 29-channel band filter X″ ₁ (X″ _1,1 ,X″ _1,2 ,...,X″ ₁ _. _{_} _{_}
_. _{_} _{_}
Calculate X′ ₃ , _X ′ ₄ ,
= 30-dimensional vector X (X ₁ , X ₂ , ..., X ₃₀ ). Further, for this vector, similarity calculation is performed by Bayesian judgment using the aforementioned standard pattern as shown in Process E. The degree of similarity P _i for phoneme i can be calculated using the following equation. P _i = (2π) ^-15/2 |W _i | ^-1/2 exp { -1/2(x-m _i ) ^t W _i ^-1 (x-m _i )}...(1) This P _i is determined for each phoneme, and the phoneme with the highest degree of similarity is determined (to processing), and the result is transferred to the phoneme recognition unit. This method is based on the idea of actively capturing changes in phonemes that are characterized by temporal changes in their spectra, such as consonants and semivowels. Figure 2 shows an example of the spectral changes of semivowels and persistent consonants. a represents a case similar to the conventional example, and the horizontal axis represents time in frames. Also, the distance between adjacent spectra on the vertical axis is shown as the Euclidean distance of the LPC cepstral coefficients. This distance curve shows the threshold value (TH) during frame number 13.
Lasts at a small value of 130ms for semi-vowels and persistent sounds.
It shows that the period of time lasts for a long time. However, in the conventional example, only the five frames indicated by circles are used because the amount of calculation required for similarity calculation is enormous. For this reason, the characteristics of semivowels and persistent sounds cannot be fully captured, and the accuracy of discrimination is low. In order to overcome this drawback, it is conceivable to use 13 frames as shown in FIG. 2b. Figure 2b
The case where the number of frames 13 is used for semivowels and persistent consonants, which is the same as in Figure 2a, is indicated by a circle. In this case, it is possible to sufficiently capture the characteristics of semi-vowels and persistent consonants, but the disadvantage is that it requires a huge amount of calculation to calculate the similarity, making the device expensive. Purpose of the Invention The present invention solves the above-mentioned drawbacks, and in automatic speech recognition, discriminates phonemes or syllables with high discrimination accuracy.
It is an object of the present invention to provide a speech recognition method that can be realized with a small amount of calculation. Structure of the Invention The present invention is intended to achieve the above-mentioned object. A standard pattern created from the voices of multiple speakers is prepared in advance, and the input unknown voice is divided into n consecutive fixed time intervals (frames). The audio is analyzed for each frame to obtain spectral information, and from the n frames, parts where the spectral change between frames does not exceed a threshold are thinned out at equal intervals. X frames (n>
Provided is a speech recognition method characterized in that phonemes or syllables are determined by calculating the similarity between the spectral information of the X frames and the standard pattern using a statistical distance scale. It is. DESCRIPTION OF EMBODIMENTS Examples of the present invention will be described below with reference to the drawings. FIG. 3 is a block diagram showing an embodiment of a device embodying the speech recognition method of the present invention. In the figure, reference numeral 1 denotes an acoustic analysis section, which analyzes audio input through a microphone or the like. As an analysis method, linear predictive analysis is performed to obtain LPC cepstra coefficients every frame period (about 10 mS). 2 is the phoneme discriminator, and the LPC obtained by the acoustic analysis unit 1
Phoneme discrimination is performed for each frame using cepstral coefficients. Reference numeral 3 denotes a standard pattern storage unit, which stores standard patterns obtained for each phoneme from the voices of multiple speakers in advance. Reference numeral 4 denotes a segmentation unit, which detects speech intervals and determines boundaries for each phoneme (hereinafter referred to as segmentation) based on the analysis output of the acoustic analysis unit 1. Reference numeral 5 denotes a phoneme recognition unit, which determines what phoneme each phoneme section is based on the results of the segmentation unit 4 and the phoneme discrimination unit 2. As a result, a series of phonemes is completed. 6 is a word recognition unit that compares the phoneme sequence with a word dictionary 7 similarly written in phoneme sequences, and outputs the word with the highest degree of similarity as a recognition result. 7 is the word dictionary mentioned above. Next, a method for recognizing semi-vowels and persistent sounds will be explained in more detail using the flowchart of FIG. 4 as an example. This method is effective not only for semivowels and persistent consonants, but also for phonemes whose spectrum changes slowly over time, such as vowels, nasals, and fricative consonants. Statistical distance measures include Bayesian judgment, Mahalanobis distance, etc., and in this embodiment, the Mahalanobis distance will be used. Also, a case will be described in which LPC cepstral coefficients are used for spectrum information. The distinction between five vowels, semi-vowels, and persistent consonants in the word sounds of many speakers is determined in advance. The procedure for creating a standard pattern using this voice will be explained. The LPC cepstral coefficients up to the Nth order are determined for each of the n consecutive frames from the beginning of each phoneme, and up to the mth order (N≧m), that is, C'(C' ₁ ,
C′ ₂ , ..., C′m). next consecutive n
frame, so as to include at least non-consecutive frames (in this embodiment, at one frame interval)
X frames (X<n) are extracted and X C's are arranged to create C (C' ₁ , C' ₂ , . . . , C' _X ).
C is composed of M (M=X×m) cepstral coefficients. That is, C (C ₁ , C ₂ , . . . _CM ). By this C, the average value m _i (i is the phoneme name) for each phoneme and the covariance matrix W common to all target phonemes
seek. Let the inverse matrix be W ^-1 , and its (j, j′)
When the element is 〓jj′, the weighting coefficient a _ij of phoneme i with respect to C _j is obtained as follows: a _ij =2 _M 〓 ^j ′ ⁼¹ 〓jj′m _ij ……(2). Also, the average distance d _i for phoneme i is calculated as d _i =m _i ^t W ^-1 m _i (3). These a _ij and d _i are stored as standard patterns in the standard pattern storage section 3 shown in FIG. Next, the unknown audio input to the acoustic analysis unit 1 shown in FIG.
The N-th LPC cepstrum coefficients are determined for each frame, and among them, the LPC cepstrum coefficients up to the m-th order (N≧m) shown in Figure 4 Processing
Extract X′ ₁ (X′ ₁ , X′ ₂ , ..., X′m). Next, from the n consecutive frames, extract X frames so as to include at least non-consecutive frames (for example, at one frame interval in this embodiment),
As shown in the processing diagram, X frames X' ₁ , X' ₃ ,...
…, X′ _X is calculated. Furthermore, using the results of processing 1 and processing 2, as shown in processing ₂ , X× _m =M-dimensional vector
..., x _M ). Using this x, the similarity l _i is determined using the standard pattern stored in the standard pattern storage section 3 using the following equation. l _i = _M 〓 ^j=1 a _ij J ^-d i...(4) Obtain this l _i for each frame of the input audio (processing module), and remove the phoneme with the maximum similarity (processing module). ) Transfer to the phoneme recognition unit 5. The phoneme recognition unit 5 combines this result with the result of the segmentation unit 4 to create a time series of phonemes, and sends it to the word recognition unit 6. The word recognition unit 6 collates a word dictionary 7 in which phonemes are written in chronological order in advance, and outputs the word name with the highest degree of similarity as a recognition result. A specific example is shown in FIG. Figure 5a shows examples of spectral changes in semivowels and obdurate sounds. The horizontal axis shows time in frames. The spectral distance between adjacent frames on the vertical axis
It is expressed as the Euclidean distance of the LPC ceptrum coefficients.
This distance curve does not exceed the threshold value (indicated by TH) in 13 frames, indicating that the time change of the spectrum is slow. Here, the threshold value is set to a value that yields the maximum recognition rate of the target phoneme (here, a semivowel, a persistent consonant). Therefore, it is not necessarily necessary to use all frames to capture temporal changes in the spectrum. In this embodiment, a total of seven frames are used, which are thinned out every other frame and indicated by a circle. In this case, the LPC required for phoneme discrimination
The vector of cepstral coefficients is 6×7=42. Table 1 shows a comparison of the amount of calculation for product and sum per standard pattern. The conventional method shown in FIG. 2b and the present embodiment are shown separately for Bayesian determination using equation (1) and Mahalanobis distance using equation (4).

【表】第１表からわかるように本実施例は、ベイズ判
定では従来法の約30％と大幅に減らすことができ
る。またマハラノビス距離では従来法の半分に減
らすことができる。また第２図ｂに示した従来法と第５図ａに示し
た本実施例について、半母音、拗音の判別精度を
比較した結果を第２表に示す。[Table] As can be seen from Table 1, this embodiment can significantly reduce the Bayesian judgment by about 30% compared to the conventional method. Furthermore, the Mahalanobis distance can be reduced to half of the conventional method. Furthermore, Table 2 shows the results of a comparison of the discrimination accuracy of semivowels and persistent consonants between the conventional method shown in FIG. 2b and the present embodiment shown in FIG. 5a.

【表】すなわち、従来法に比較して本実施例は認識
率、バラツキを表わす標準偏差ともに向上する。
その理由として、ゆつくりしたスペクトルの時間
変化を、使用フレームを間引くことによつて大局
的にとらえることにより、効率良く特徴をとらえ
ることができるためと考えられる。又、余分なス
ペクトル情報を除くことにより、話者やコンテキ
スト等の変動要因によるバラツキを減らすことが
できるためと考えられる。なお、本発明は連続しない複数のフレームのス
ペクトル情報を音素又は音節の判別に使用するこ
とを特徴とし、フレーム数13の場合を例にとる
と、音素又は音節の種類によつては第５図ｂ，ｃ
の方法も適用される。すなわち第５図ｂに示す曲
線は子音／ｓ／の隣接するスペクトル距離を
LPCケプストラム係数を用いて表わしたもので、
いき値（TH）を越えるフレーム１，２，３とフ
レーム12，13は連続して使用し、いき値を越えな
い区間は等間隔に間引いて使用する。この方法は
子音／ｓ／と／ｈ／のような、境界の動きと摩擦
部のスペクトルに差のある音素や音節の判別に対
して有効である。また第５図ｃに示す曲線は子音／ｚ／の隣接す
るスペクトル距離をLPCケプストラム係数を用
いて表わしたもので、いき値（TH）を越えるフ
レーム１，２，３，４は連続して使用し、いき値
を越えない区間は等間隔に間引いて使用する。子
音／ｚ／や／ｃ／，／ｋ／のような、破裂部の動
きと摩擦部のスペクトルに差のある音素や音節の
判別に対して有効である。なお上記実施例ではい
き値を越えた領域には連続フレームを使用する場
合について述べたが必ずしも連続である必要はな
い。また本発明スペクトル情報としては、線形予
測分析、帯域フイルタ群による分析、高速フーリ
エ変換（FFT）分析のいずれによつても得るこ
とができる。さらに本発明の類似度計算は統計的距離尺度を
用いて計算するのが良く、統計的距離尺度として
は、ベイズ判定に基づく距離、マハラノビス距
離、線形判別関数等がより好適である。発明の効果以上要するに本発明は、音声を連続するｎ個の
フレームに分割し、この各フレーム毎に音声を分
析してスペクトル情報を求め、前記ｎ個のフレー
ムの中から、フレーム間のスペクトル変位がしき
い値を越えない部分に対しては等間隔に間引くよ
うに抽出してＸ個のフレーム（ｎ＞Ｘ）を求め、
このＸ個のフレームのスペクトル情報と標準パタ
ーンとの類似度を統計的距離尺度を用いて計算す
ることにより音素又は音節の判別を行うことを特
徴として含む音声認識方法を提供するものであ
り、対象とする音素又は音節のスペクトルの時間
変化の特徴を効率良くとらえることにより、バ
ラツキの少ない、かつ高い認識性能を得ること
ができる。音素又は音節の判別に必要な演算量を従来の
１／２〜1/3に減らすことができ、装置としての低
価格化をはかることができる。等の利点を有する。[Table] That is, compared to the conventional method, this embodiment improves both the recognition rate and the standard deviation representing variation.
The reason for this is thought to be that by thinning out the used frames to get a broader view of the gradual temporal changes in the spectrum, it is possible to efficiently capture the characteristics. It is also believed that by removing redundant spectrum information, it is possible to reduce variations due to variable factors such as speakers and contexts. The present invention is characterized in that spectral information of a plurality of discontinuous frames is used to discriminate phonemes or syllables. Taking the case of 13 frames as an example, depending on the type of phoneme or syllable, the spectral information shown in FIG. b, c
The method also applies. In other words, the curve shown in Figure 5b represents the adjacent spectral distance of the consonant /s/.
Expressed using LPC cepstral coefficients,
Frames 1, 2, and 3 and frames 12 and 13 that exceed the threshold (TH) are used continuously, and sections that do not exceed the threshold are thinned out at equal intervals and used. This method is effective for identifying phonemes and syllables, such as consonants /s/ and /h/, in which there is a difference in the spectrum of the movement of the boundary and the frictional part. In addition, the curve shown in Figure 5c represents the adjacent spectral distance of the consonant /z/ using LPC cepstral coefficients, and frames 1, 2, 3, and 4 that exceed the threshold (TH) are used consecutively. However, the intervals that do not exceed the threshold are thinned out at equal intervals. This method is effective for identifying phonemes and syllables such as consonants /z/, /c/, and /k/, in which there is a difference in the spectrum of the movement of the plosive part and the spectrum of the frictional part. In the above embodiment, a case has been described in which continuous frames are used for the region exceeding the threshold value, but the frames do not necessarily have to be continuous. In addition, the spectral information of the present invention can be obtained by any of linear predictive analysis, band filter group analysis, and fast Fourier transform (FFT) analysis. Furthermore, the similarity calculation of the present invention is preferably performed using a statistical distance measure, and as the statistical distance measure, distance based on Bayesian judgment, Mahalanobis distance, linear discriminant function, etc. are more suitable. Effects of the Invention In summary, the present invention divides audio into n consecutive frames, analyzes the audio for each frame to obtain spectral information, and calculates the spectral displacement between frames from among the n frames. For the part where does not exceed the threshold, extract it at equal intervals to obtain X frames (n>X),
The present invention provides a speech recognition method characterized by discriminating phonemes or syllables by calculating the degree of similarity between the spectral information of these X frames and a standard pattern using a statistical distance measure. By efficiently capturing the characteristics of temporal changes in the spectra of phonemes or syllables, it is possible to obtain high recognition performance with little variation. The amount of calculation required to discriminate phonemes or syllables can be reduced to 1/2 to 1/3 of that of the conventional method, and the cost of the device can be reduced. It has the following advantages.

[Brief explanation of drawings]

第１図は従来の音声認識方法における音素判別
を説明するフローチヤート、第２図は従来におけ
るフレーム抽出法を説明する図、第３図は本発明
の音声認識方法を具現化する音声認識装置の一実
施例を示すブロツク図、第４図は本発明の一実施
例における音声認識方法の音素判別を説明するフ
ローチヤート、第５図は本発明におけるフレーム
抽出法を説明する図である。１……音響分析部、２……音素判別部、３……
標準パターン格納部、４……セグメンテーシヨン
部、５……音素認識部、６……単語認識部、７…
…単語辞書。 FIG. 1 is a flowchart explaining phoneme discrimination in a conventional speech recognition method, FIG. 2 is a diagram explaining a conventional frame extraction method, and FIG. 3 is a diagram of a speech recognition device embodying the speech recognition method of the present invention. FIG. 4 is a block diagram showing one embodiment of the present invention, FIG. 4 is a flowchart illustrating phoneme discrimination in a speech recognition method according to an embodiment of the present invention, and FIG. 5 is a diagram illustrating a frame extraction method according to the present invention. 1... Acoustic analysis section, 2... Phoneme discrimination section, 3...
Standard pattern storage unit, 4... Segmentation unit, 5... Phoneme recognition unit, 6... Word recognition unit, 7...
...word dictionary.

Claims

[Claims] 1. Divide audio into n consecutive fixed time intervals (frames), analyze the audio for each frame to obtain spectral information, and divide x frames from the n frames. (X<n) is extracted by thinning out at equal intervals for portions where the temporal change rate of the spectrum does not exceed the threshold, and the spectral information of the X frames thus extracted is extracted. A speech recognition method characterized in that phonemes or syllables are discriminated by calculating the degree of similarity between a standard pattern created in advance from the voices of multiple speakers using a statistical distance measure. 2. The speech recognition method according to claim 1, wherein the spectral information is obtained by any one of linear predictive analysis, band filter group, and fast Fourier transform analysis. 3. The speech recognition method according to claim 1, wherein the statistical distance measure is any one of a Bayesian distance, a Mahalanobis distance, and a linear discriminant function.