JPS61188599A

JPS61188599A - Voice recognition

Info

Publication number: JPS61188599A
Application number: JP60029547A
Authority: JP
Inventors: 二矢田　勝行
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1985-02-18
Filing date: 1985-02-18
Publication date: 1986-08-22

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は人間の声を機械に認識させる音声認識方法に関
するものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a voice recognition method that allows a machine to recognize a human voice.

従来の技術近年音声認識技術の開発が活発に行なわれ、商品化され
ているが、これらのほとんどは声を登録した人のみを認
識対象とする特定話者用である。2. Description of the Related Art Speech recognition technologies have been actively developed and commercialized in recent years, but most of these are for specific speakers whose voices are recognized only by those who have registered their voices.

特定話者用の装置は認識すべき言葉をあらかじめ装置に
登録する手間がかかるため、連続的に長時間使用する場
合を除けば、使用者にとって大きな負担となる。これに
対し、声の登録を必要とせず、使い勝手のよい不特定話
者用の認識技術の研究が最近では精力的に行なわれるよ
うになった。Devices for specific speakers require time and effort to register the words to be recognized in the device in advance, which puts a heavy burden on the user unless the device is used continuously for a long time. In response to this, research has recently been actively conducted on recognition technology for non-specific speakers that is easy to use and does not require voice registration.

音声認識方法を一般的に言うと、入力音声と辞書中に格
納しである標準的な音声（これらはパラメータ化しであ
る）のパターンマツチングを行なって、類似度が最も高
い辞書中の音声を認識結果として元方するということで
ある。この場合、入力音声と辞書中の音声が物理的に全
く同じものならば問題はないわけであるが、一般には同
一音声であっても、人が違ったり、言い方が違っている
ため、全く同じにはならない。Generally speaking, the speech recognition method performs pattern matching between the input speech and standard speech stored in a dictionary (these are parameterized), and selects the speech in the dictionary with the highest degree of similarity. The result of recognition is that the original direction is reversed. In this case, there is no problem if the input voice and the voice in the dictionary are physically exactly the same, but in general, even if the input voice is the same, different people say it or say it in different ways, so they may not be exactly the same. It won't be.

人の違い、言い方の違いなどは、物理的にはスペクトル
の特徴の違いと時間的な特徴の違いとして表現される。Physically, differences between people and differences in the way they speak are expressed as differences in spectral features and differences in temporal features.

、すなわち、調音器管（ロ、舌、のどな、ど）の形状は
人ごとに異なっているので、人゛が違えば同じ言葉でも
スペクトル形状は異なる。In other words, the shape of the articulators (b, tongue, throat, etc.) differs from person to person, so the spectral shape of the same word will differ for different people.

また連日で発声するか、ゆっくり発声するかによって時
間的な特徴は異なる。Furthermore, the temporal characteristics differ depending on whether the vocalizations are uttered every day or slowly.

不特定話者用の認識技術では、このようなスペクトルお
よびその時間的変動を正規化して、標準パターンと比較
する必要がある。従来例では非線形マツチング法（ＤＰ
法）を用いてこれを行なっていた。このような例として
、スペクトルと時間の両方にＤＰ法を適用したマツチン
グ法が知られている（三輪他：「予備選択とスペクトル
マツチングを用いた不特定話者多数単語音声認識」、日
本音響学会音声研究会資料、８８３−２０　。Speaker-independent recognition techniques require such spectra and their temporal variations to be normalized and compared to standard patterns. In the conventional example, nonlinear matching method (DP
This was done using the law). As an example of this, a matching method that applies the DP method to both spectrum and time is known (Miwa et al.: "Speaker-independent multi-word speech recognition using preliminary selection and spectral matching", Nippon Onkyo). Materials of the Society's Speech Study Group, 883-20.

１９Ｊ３３−６）ので、従来例の１つとして説明する。19J33-6), so it will be explained as one of the conventional examples.

入力音声を１フレーム（１０ｍ５ｅｃ　）ごとに２９チ
ヤンネルのフィルタで分析し、これをｚｋ（ｋ：１，２
．・・・・・・ＮＵＮ：２９）とする。また基準とすべ
き標準スペクトルｔｒｉ（ｔ＝１１２１２９）とする。The input audio is analyzed every frame (10m5ec) using a filter of 29 channels, and this is zk (k: 1, 2
．． ...NUN:29). Further, it is assumed that the standard spectrum tri (t=112129) is used as a reference.

いま周波数の変動範囲ｔｋ＋からに、２　チャネル以内
であるとする。It is now assumed that the frequency fluctuation range is within 2 channels from tk+.

非線形スペクトルマツ゛チイグにおける入カスベクトル
と標準スペクトルとの間の距離（１−Ｃ式１）ｋ、　、
　ｋ、はそれぞれ音声の始端と終端である。The distance between the input vector and the standard spectrum in nonlinear spectrum matching (1-C equation 1) k, ,
k are the beginning and end of the audio, respectively.

ここで、ｃ　＝（ｋ＋　ｌ　ｋ、＋　１＋・・・・・・
ｋ２−１．に２）ｄｋｓ＋ｋｅ＝ｐ（Ｎ　　（ｋｅＪ、
’　　ＣｋｇＪ）／（Ｎ−１ｋｓｌ＋Ｎ−１に、ｌ　）
　　　　（式２）ランプ関数〔〕は次のように定義する
。Here, c = (k+ l k, + 1+...
k2-1. 2) dks+ke=p(N (keJ,
'CkgJ)/(N-1ksl+N-1, l)
(Formula 2) The ramp function [ ] is defined as follows.

（式２）の分母は、非線形スペクトルマツチングにおけ
る径路の正規化項であり、分子は次の漸化式で計算され
る目的関数の終端における値である。The denominator of (Equation 2) is the path normalization term in nonlinear spectral matching, and the numerator is the value at the end of the objective function calculated by the following recurrence formula.

（式３）ｑ（ｉ、１）＝ＩＺｊ−ｒｔｌ　　　（式４）単語マツ
チングは（式１）で得られるスペクトル間距離ｄを用い
て、単語間距離ｗｉ計算する。(Formula 3) q(i, 1)=IZj-rtl (Formula 4) In word matching, the inter-word distance wi is calculated using the inter-spectral distance d obtained by (Formula 1).

単語間距離も、発声速度の変動に対処するためＤ’ＰＦ
ｊ：を用いて計算する。The distance between words is also calculated using D'PF to deal with variations in speaking rate.
Calculate using j:.

ｗ＝ｈ（Ｉ、Ｊ）／ＣＩ＋Ｊ）　　、、　　　（式５）
ここでＩ、Ｊはフレーム数で表した標準パターンと入カ
バターンの持続時間である１、（式５）のｈ（ｉ、ｉ）
は次の漸化式によって計算する。w=h(I,J)/CI+J) ,, (Formula 5)
Here, I and J are the durations of the standard pattern and input pattern expressed in the number of frames, 1, and h(i, i) in (Equation 5)
is calculated using the following recurrence formula.

（弐〇）ここでは標準パターンのフレーム番号、ｊは入カバター
ンのフレーム番号である。(2〇) Here, the frame number of the standard pattern and j is the frame number of the input pattern.

（式５）のＷが最小となる辞書中の単語を結果として出
力する。　　　”・このように従来例においては、スペクトルの変動に対し
ては（式３）を、時間変動に対しては（弐〇）を用いて
正規化・している。The word in the dictionary for which W in (Equation 5) is the minimum is output as a result. ”- In this way, in the conventional example, spectral fluctuations are normalized using (Equation 3), and time fluctuations are normalized using (20).

発明が解決しようとする問題点以上のように従来例においては２重にＤＰ法を用いてい
る。ＤＰ法は時間軸方向のみ（Ｅ６））でもかなりの計
算量になるが、これを周波数方向（（式３］）にまで用
いると、計算量が膨大になることは明らかであろう。Problems to be Solved by the Invention As described above, in the conventional example, the DP method is used twice. Although the DP method requires a considerable amount of calculation even in the time axis direction (E6)), it is clear that if this method is used in the frequency direction ((Equation 3)), the amount of calculation becomes enormous.

本発明は上記問題を解決するもので、少ない計算量でス
ペクトル、時間の両方の正規化を行ない、不特定話者の
音声に対して高い認識率を得ることを目的とする。The present invention solves the above problem, and aims to normalize both spectrum and time with a small amount of calculation, and to obtain a high recognition rate for speech of unspecified speakers.

問題点を解決するための手段本発明は入力音声から検出された始端と終端間全一定フ
レーム数（Ｊとする）に分割することにより単語長の伸
縮を行い、これらの各フレームについて特徴パラメータ
全求め（１フレームあタリのパラメータ数１ｄとする）
、ｎ＝ｃｔｘ、を次元の入カバターンを構成し、これと
音声標準パターンとの距離Ｗを統計的距離尺度を用いて
計算することによって、少ない計算量でスペクトル変動
と時間変動を吸収することを可能とし、効率よく不特定
話者の音声を認識するようにしたものである。Means for Solving the Problems The present invention expands and contracts the word length by dividing the input speech into a constant number of frames (J) between the start and end detected, and calculates all feature parameters for each of these frames. Calculation (number of parameters for one frame is 1d)
, n=ctx, constitutes a dimensional input pattern, and calculates the distance W between this and the speech standard pattern using a statistical distance measure, thereby absorbing spectral fluctuations and temporal fluctuations with a small amount of calculation. This enables the efficient recognition of voices from unspecified speakers.

作用前述のごとく本発明は、不特定話者の音声認識で問題と
なるスペクトルの変動と時間的な変動を、特徴パラメー
タの変動とその、時間的変動に置き換え、その変動形態
を多次元正規分布に従うものと仮定して、統計量として
標準パターンの中に取り込むことにより、未知入力と標
準パターン間のスペクトルのずれや、時間的に正規化し
たときのフレームのずれの吸収を可能としており、計算
量の低Ｍを図るとともに高い認識率全得、小型、低価格
の不特定音声認識装置の具現化を可能とするようにした
ものである。Effects As mentioned above, the present invention replaces the spectral fluctuations and temporal fluctuations that are problematic in speech recognition of unspecified speakers with the characteristic parameter fluctuations and their temporal fluctuations, and transforms the form of the fluctuation into a multidimensional normal distribution. By assuming that the following is true, and incorporating it into the standard pattern as a statistic, it is possible to absorb the spectral shift between the unknown input and the standard pattern, and the frame shift when normalized over time. It is possible to realize a small, low-cost, non-specific speech recognition device that achieves a high overall recognition rate while reducing the amount of M.

実施例図は本発明の一実施例における音声認識方法の具現化を
示す機能ブロック図である。The embodiment diagram is a functional block diagram showing an implementation of a speech recognition method in an embodiment of the present invention.

図において、１は入力音声をディジタル信号に変換する
ムＤ変倹部、２は音声を分析区間（フレーム）毎に分析
しスペクトル情報を求める音響分析部、３は特徴パラメ
ータを求める特徴パラメータ抽出部、４は始端フレーム
と終端フレームを検出する音声区間検出部、５は単語長
の伸縮を行う時間軸正規化部、６は入カバターンと標準
パターンとの類似度を計算する距離計算部、７は予め作
成された標準パターンを格納する標準パターン格納部で
ある。上記構成において、以下その動作ｖｉ″説明する
。In the figure, 1 is a music converter that converts input audio into a digital signal, 2 is an acoustic analyzer that analyzes audio for each analysis section (frame) and obtains spectrum information, and 3 is a feature parameter extractor that obtains feature parameters. , 4 is a speech interval detection unit that detects the start frame and the end frame, 5 is a time axis normalization unit that expands and contracts the word length, 6 is a distance calculation unit that calculates the similarity between the input cover pattern and the standard pattern, and 7 is a distance calculation unit that calculates the similarity between the input cover pattern and the standard pattern. This is a standard pattern storage unit that stores standard patterns created in advance. In the above configuration, the operation vi'' will be explained below.

入力音声ｉＡＤ変換部１腎よって１２ビツトのディジタ
ル信号に変換する。標本化周波数はａｘｕｚである。音
響分析部２では、１フレーム（１０ｍｓｅｃ　）ごとに
自己相関法によるＬＰＣ分析を行なう。分析の次数は１
＋次とし、線形予測係数α。、α１．α２・・・・・・
α、。を求める。またここではフレームごとの音声パワ
ーＷ０も求めておく。特徴パラメータ抽出部３では線形
予測係数を用いて、ＬＰＣケプストラム係数Ｃ１〜ｃｄ
（ｄは打切り次ｅ、）および正規化対数残差パワーＣｏ
　ｋ求める。The input audio iAD conversion unit converts the input audio into a 12-bit digital signal. The sampling frequency is axuz. The acoustic analysis section 2 performs LPC analysis using the autocorrelation method for each frame (10 msec). The order of analysis is 1
+ order, and the linear prediction coefficient α. , α1. α2・・・・・・
α,. seek. In addition, the audio power W0 for each frame is also determined here. The feature parameter extraction unit 3 uses the linear prediction coefficients to extract LPC cepstral coefficients C1 to cd.
(d is censored next e,) and normalized logarithmic residual power Co
Find k.

なお、ＬＰＣ分析とＬＰＣケプストラム係数の抽出法に
関しては、例えばＪ、Ｄ、マーケル、Ａ、Ｈ。Regarding the LPC analysis and the extraction method of LPC cepstral coefficients, see, for example, J.D., Markel, A.H.

グレイ著、鈴木久喜訳「音声の線形予測」に詳し−く記
述しであるので、ここでは説明を省略する。This is described in detail in ``Linear Prediction of Speech'' by John Gray, translated by Hisaki Suzuki, so the explanation will be omitted here.

また特徴パラメータ抽出部３では対数パワーＬＷ０を次
式で求める。Further, the feature parameter extracting unit 3 obtains the logarithmic power LW0 using the following equation.

ＬＷ０＝　１０　ｌｏｑ　、。ｗｏ（式７）音声区間検
出部４は（式７）で求めたＬＷＯ’ｅ閾値θ６閾値化６
し、ＬＷｏ＞θ６のフレームがらフレーム以上持続する
場合、その最初のフレームを音声区間の始端フレームＦ
８とする。またＦｓの後において、ＬＷｏと閾値θ。″
ｆ、比較し、Ｌｙｏ＜Ｏｏ　　となるフレームが２゜フ
レーム以上連続するとき、その最初のフレームを音声区
間の終端フレームＦ。とする。このようにしてＦｌｌか
らＦｅまでを音声区間とする。い、ま説明を簡単にする
ために改めてｙｓ１第１フレームと考え、フレームナン
バーｔ”　（’　＊　２＋・・・・・・ｉ、・・・・・
・工）とする。ただし、Ｉ　：＝　Ｆ、　−Ｆ、　＋　
１である。LW0=10 loq,. wo (Formula 7) The voice section detection unit 4 uses the LWO'e threshold θ6 determined by (Formula 7)
However, if the frames with LWo>θ6 continue for more than one frame, the first frame is set as the starting frame F of the voice section.
8. Also, after Fs, LWo and threshold θ. ″
f. When the frames in which Lyo<Oo are consecutive for 2 degrees or more, the first frame is the final frame F of the voice section. shall be. In this way, the period from Fll to Fe is defined as a voice section. Well, to simplify the explanation, let's consider it as the first frame of ys1 and use the frame number t"(' * 2+...i,...
・Engineering). However, I:=F, −F, +
It is 1.

時間軸正規化部５では、単語長をＪフレームの長さに分
割することにより線形に伸縮をする。伸縮後の第ｉフレ
ームと入力音声の第ｉフレームは（弐８）の関係を持つ
。The time axis normalization unit 5 linearly expands and contracts the word length by dividing it into J frame lengths. The i-th frame after expansion and contraction and the i-th frame of the input audio have the relationship (28).

、Ｉ−ＩＪ−１１＝〔了：菖−ｊ＋　Ｔ耳＋ｏ、ｓＪ（式８）ただし〔
〕は、その数を超えない最大の整数を表す。実施例では
Ｊ＝１６としている。, I-IJ-1 1=[end: irises-j+T ear+o, sJ (formula 8) but [
] represents the largest integer not exceeding that number. In the embodiment, J=16.

次に伸縮後の特徴パラメータを時系列に並べ、時系列パ
ターンｃ８ヲ作成する。いま第ｊフレームの％徴パラメ
ータ（Ｌｐｃケプストラム係数）をＣ（，（ｋ：０，１
，２．・・・・・・ｅ　ｐ：　４個）とす１゜るとＣは次式となる。Next, the feature parameters after expansion and contraction are arranged in time series to create a time series pattern c8. Now, let the % signature parameter (Lpc cepstral coefficient) of the j-th frame be C(, (k: 0, 1
,2.・・・・・・e p: 4 pieces) and 1°, C becomes the following formula.

ＧＸ　”　（Ｃｊ、ＯＩ　Ｃ１，１１Ｃｊ、２　”””
　Ｃｊ、ｐ　Ｉ　”””（式９）すなわちＣはＪ・（ｐ＋＋）すなわちＪ−ｄ次元のベク
トルとなる（ｄは１フレームあたりのパラメータ数）。GX ” (Cj, OI C1, 11Cj, 2 ”””
Cj, p I ``'''' (Formula 9) That is, C becomes a J·(p++), that is, a J-d dimensional vector (d is the number of parameters per frame).

距離計算部らは入カバターンの　と標準パターン格納部
７に格納されている各音声の標準パターンとの類似度を
統計的な距離尺度を用いて計算し、最も距離が小さくな
る音声を認識結果として出力する。標準パターン格納部
に格納されている第ｎ番目の音声に対応する標準パター
ン６ｃｎ　（平均値）、対象とする全音声に共通な共分
散行列１ｔｙとすると、入カバターンへと第ｎ番目の標
準ノぐターンとのマハラノピス距離へは次式で計算され
る。The distance calculation units calculate the degree of similarity between the input pattern and the standard pattern of each voice stored in the standard pattern storage unit 7 using a statistical distance measure, and select the voice with the smallest distance as the recognition result. Output. Assuming that the standard pattern 6cn (average value) corresponding to the n-th voice stored in the standard pattern storage unit is a covariance matrix 1ty common to all target voices, the n-th standard pattern is The Mahalanopis distance from the turn is calculated using the following formula.

Ｓｎ：　（Ｇｘ−Ｏｎ）・Ｗ　・（Ｃｘｉ、）（式１ｏ
）添字ｔは転置を、また−１は逆行列であることを表す。Sn: (Gx-On)・W・(Cxi,) (Formula 1o
) The subscript t represents transpose, and -1 represents the inverse matrix.

（式１０）を展開してｎに無関係な項を取り除き、これ
ヲＤｎとすると、Ｄｎは次式で計算できる。Expanding (Formula 10) and removing terms unrelated to n, and assuming this as Dn, Dn can be calculated using the following equation.

Ｄ：ｂ−ａｌｌｌｃ　　　　　　（式１１）％式％ただし、＆ユニ２・Ｗ−Ｃｎ　　　　Ｃ式１２）Ｄ４　
を全てのｎ　（ｎ＝１　、２・・・・・・Ｎ）について
計算し、Ｄｎを最小とする音声を認識結果とする。D:b-alllc (Formula 11)%Formula% However, &Uni2・W-Cn CFormula 12)D4
is calculated for all n (n=1, 2...N), and the speech that minimizes Dn is taken as the recognition result.

ここでＮは標準パターン格納部７に格納されている音声
標準パターンの数である。実際には標準ノ（ターンはａ
ｎと礼が１対として、音声の数（Ｎ種類）だけ格納され
ている。Here, N is the number of voice standard patterns stored in the standard pattern storage section 7. Actually the standard number (turn is a
As many as the number of voices (N types) are stored, with n and bow as a pair.

（式１１）の計算に要する計算量は乗算がＪ・（ｐ＋１
）回、減算が１回である。実施例ではＪ　＝１６　、ｐ
＝４としているのでこれを代入すると、乗算回数は８０
回となり、従来例（（式３）。The amount of calculation required to calculate (Equation 11) is that the multiplication is J・(p+1
) times, and the subtraction is once. In the example, J = 16, p
= 4, so by substituting this, the number of multiplications is 80
The conventional example ((Formula 3)).

（弐６））とは比べものにならないぐらい少ないことが
わかる。(26)) It can be seen that the number is incomparably small.

次に標準パターンＣ，Ｗ（実際には＆ユ、ｂｎ、に変換
される）の作成方法について説明する。Next, a method for creating standard patterns C and W (actually converted to &U, bn) will be explained.

標準パターンは、各音声ごとに多くのデータサンプルを
用いて作成する。各音声に対して、用いるサンプルの数
をＭとする。各サンプルに対して（弐８）を適用して、
フレーム数をＪに揃える１、音声ｎに対して平均値ベク
トル（式１４）を求める。ただし、ここでＣｊ１に＋ｒｎは音声ｎの第ｍ番目のサンプルで
、第ｊフレームの第に次のケプヌトラム係数を示す。平
均値ベクトルと同様な手順で音声ｎの共　　　　゛分散
行列−を求める。全音声に共通な共分散行列■は次式で
求める。A standard pattern is created using many data samples for each voice. For each voice, let M be the number of samples used. Apply (28) to each sample,
1, the number of frames is set to J, and the average value vector (Equation 14) is calculated for audio n. However, here, +rn in Cj1 is the m-th sample of audio n, and indicates the next Cepnutrum coefficient of the j-th frame. Find the covariance matrix of voice n using the same procedure as for the mean value vector. The covariance matrix ■ common to all voices is calculated using the following formula.

置＝１（ｔｗ”＋　ｗ”＋−＋００１．＋　ｗ”＋　０
００．−１゛十ｗ　）　　　　（式１６）Ｃｎｌｉ（式１２）、（式１３）によってａ　ｎ　＊　
ｂｎに変換し、標準パターン格納部７にあらかしめ格納
しておく。Placement=1(tw”+ w”+-+001.+ w”+ 0
00. −1゛1w ) (Formula 16) Cnli (Formula 12) and (Formula 13) a n *
bn and stored in the standard pattern storage section 7 in advance.

本実施例の効果を確認するために、１ｏ数字（「イチ」
、「二」、「サン」、「ヨン」、「ゴ」。In order to confirm the effect of this example, 1o number (“ichi”)
, "two", "san", "yon", "go".

「ロク」、「ナナ」、「ハチ」、「キュウ」。"Roku", "Nana", "Hachi", "Kyuu".

「ゼロ」）に対して、成人男女を対象とした認識実験を
行なった。We conducted a recognition experiment on adult men and women regarding "Zero".

表に、ＬＰＣケプヌトラム係数の打切り次数ｐ（パラメ
ータ数ｄ＝ｐ＋１）に対する全話者・全単語の平均認識
率の関係を示しておく。The table shows the relationship between the average recognition rate for all speakers and all words with respect to the truncation order p (number of parameters d=p+1) of the LPC cepnutrum coefficient.

表表によるとｐ　＝　４で最高値となり、それ以上では認
識率はあまり違わないことがわかる。従って統計的距離
尺度で用いる次元数を小さくし、計算量を少なくすると
いう観点からも、ｐ＝４が最適である。このときの平均
認識率は９８．６％という高率であること、および計算
量は１語の認識に対して乗算８００回１．減算１０回と
非常に少ないことから考えると、本実施例の効果は明ら
かである。According to the table, the highest value is reached when p = 4, and it can be seen that the recognition rate does not differ much beyond that. Therefore, p=4 is optimal from the viewpoint of reducing the number of dimensions used in the statistical distance measure and reducing the amount of calculation. The average recognition rate at this time is as high as 98.6%, and the amount of calculation is 800 times 1. Considering that the number of subtractions is very small, 10 times, the effect of this embodiment is obvious.

なお統計的距離尺度として（式１１）を用いて説明した
が、これをぺ＼イズ判定、ａ形判別関数など他の統計的
距離尺度に置きかえることも可能である。例えばベイズ
判定全周いると、対数尤度Ｌｎは次式で求められる。Although the explanation has been given using (Equation 11) as the statistical distance measure, it is also possible to replace this with other statistical distance measures such as the p\ize judgment and the a-type discriminant function. For example, if Bayesian judgment is used all around, the log likelihood Ln is obtained by the following equation.

Ｌｎ＝　−（Ｃｘ−Ｃｎ）＠、−（Ｃｘ−Ｃｎ）−人。Ln=-(Cx-Cn)@,-(Cx-Cn)-person.

（式１７）（式１７）を最大とする、標準パターン格納部７の中の
音声を出力すればよい。(Formula 17) It is sufficient to output the audio in the standard pattern storage unit 7 that maximizes (Formula 17).

また、特徴パラメータはＬＰＣケプストラム以外のもの
でもよく、例えば帯域通過フィルタの出力、ＰＡＲＣＯ
Ｒ係数、自己相関係数やこれらを変形したものを使うこ
とができる。Furthermore, the feature parameters may be other than the LPC cepstrum, for example, the output of a bandpass filter, the PARCO cepstrum, etc.
R coefficients, autocorrelation coefficients, and modified versions thereof can be used.

発明の効果以上要するに本発明は、入力音声から検出された始端と
終端間を一定長Ｊフレームに分割することにより単語長
の伸縮を行い、これら各フレームにづいて特徴パラメー
タを求め（１フレームあたりのパラメータ数をｄとする
）、ｄ×Ｊ次元の入カバターンを構成し、これと音声標
準パターンとの距離を統計的距離尺度を用いて計算し、
類似度が最も高い音声標準パターンに対応する音声を認
識結果とするもので、不特定話者の音声認識で問題であ
ったスペクトルの変動や時間的な変動を、特徴パラメー
タの変動とその時間的な変動に置きかえ、変動形態を多
次元正規分布に従うものと仮定し、統計量として標準パ
ターンの中に取り込んでいる。したがって、未知人力と
標準パターン間のスペクトルのずれや、時間的に正規化
したときのフレームのずれは、統計的に許容できる節回
ならば、吸収してしまうことができる。Effects of the Invention In short, the present invention expands and contracts the word length by dividing the period between the start and end detected from the input audio into J frames of a constant length, and calculates feature parameters for each of these frames (per frame. (the number of parameters is d), construct a d×J-dimensional input pattern, calculate the distance between this and the speech standard pattern using a statistical distance measure,
The recognition result is the speech that corresponds to the standard speech pattern with the highest degree of similarity, and the spectral fluctuations and temporal fluctuations that have been problems in speech recognition for unspecified speakers are replaced by the fluctuations in feature parameters and their temporal fluctuations. It is assumed that the form of variation follows a multidimensional normal distribution, and it is incorporated into the standard pattern as a statistic. Therefore, the spectral deviation between the unknown human power and the standard pattern, and the frame deviation when normalized in time can be absorbed as long as the timing is statistically acceptable.

また本発明ではスペクトル変動とその時間変動を分離せ
ず、同一レベルで取扱うこと（すなわち特徴パラメータ
の時間パターンとして取扱うこと）によって、計算量の
削減を可能としており、不特定話者の音声を少ない計算
量で、しかも高い精度で認識することができ、その効果
は大きい。Furthermore, in the present invention, by not separating spectral fluctuations and their temporal fluctuations, but treating them at the same level (that is, treating them as temporal patterns of characteristic parameters), it is possible to reduce the amount of calculation, and reduce the amount of speech of unspecified speakers. It can be recognized with high accuracy even with a small amount of calculation, and its effects are large.

[Brief explanation of the drawing]

図は本発明の一実施例における音声認識方法を具現化す
る機能ブロック図である。１・・・・・・ムＤ変換部、２・・・・・・音響分析部
、３・・・・・・特徴パラメータ抽出部、４・・・・・
・音声区間検出部、６・・・・・・時間軸正規化部、６
・・・・・・距離計算部、７・・・・・・標準パターン
格納部。The figure is a functional block diagram embodying a speech recognition method in an embodiment of the present invention. 1... MuD conversion unit, 2... Acoustic analysis unit, 3... Feature parameter extraction unit, 4...
・Voice section detection unit, 6... Time axis normalization unit, 6
. . . Distance calculation section, 7. . . Standard pattern storage section.

Claims

[Claims]

(1) In advance, prepare as many speech standard patterns as the number of speeches to be recognized, which are composed of d×J orders of constant length J frames with d parameters per frame in the analysis section, and input them on the one hand. Detects the start and end of audio,
Divide the area between the start and end into J frames, extract d feature parameters for each frame, arrange them in temporal order to create a d x J-dimensional input time series vector, and combine this with each of the above-mentioned voices. A speech recognition method characterized in that the degree of similarity with a standard pattern is calculated using a statistical distance measure, and the result is that the speech corresponding to the speech standard pattern with the highest degree of similarity is determined as a recognition result.

(2) The speech recognition method according to claim 1, wherein the statistical distance measure is Mahalanobis distance, Bayesian judgment, or a linear discriminant function.

(3) The speech recognition method according to claim 1, wherein the feature parameter is any one of an LPC cepstral coefficient, an autocorrelation coefficient, and an output of a bandpass filter.