JP3445117B2

JP3445117B2 - Acoustic analysis method for speech recognition

Info

Publication number: JP3445117B2
Application number: JP26782697A
Authority: JP
Inventors: 眞吾黒岩; 恒夫加藤; 文廣谷戸
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 1997-09-12
Filing date: 1997-09-12
Publication date: 2003-09-08
Anticipated expiration: 2017-09-12
Also published as: JPH1185200A

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は音声認識のための
音響分析方法に関し、特に音声対話システムに用いて好
適な音声認識のための音響分析方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an acoustic analysis method for speech recognition, and more particularly to an acoustic analysis method for speech recognition suitable for use in a speech dialogue system.

【０００２】[0002]

【従来の技術】従来の音声認識のための音響分析方法お
よび装置の一例として、本発明者等によって発明され、
特許出願されたもの（特開平９−９０９９０号公報）が
ある。この従来技術の概要を、図４のブロック図を参照
して簡単に説明する。フレームｘ1 〜ｘT からなる第Ｍ
番目の音声入力Ｍが入力してくると、特徴ベクトル計算
部３１は各フレームに対する特徴ベクトルＣM(t)（ここ
に、ｔ＝１、２、…、Ｔ）を求める。この特徴ベクトル
ＣM(t)は記憶部３２に送られると共に、減算部３４に送
られる。記憶部３２は、第１記憶部３２ａと第２記憶部
３２ｂとから構成されており、該第１記憶部３２ａには
第Ｍ発声の音声入力Ｍの特徴ベクトルＣM(t)が記憶さ
れ、第２記憶部３２ｂには第（Ｍ−１）発声の音声入力
（Ｍ−１）の特徴ベクトルＣM-1(t)が記憶される。As an example of a conventional acoustic analysis method and apparatus for speech recognition, the present invention has been invented by the present inventors.
There is a patent application (Japanese Patent Laid-Open No. 9-90990). The outline of this conventional technique will be briefly described with reference to the block diagram of FIG. M-th frame consisting of frames x1 to xT
When the th voice input M is input, the feature vector calculation unit 31 obtains a feature vector CM (t) (here, t = 1, 2, ..., T) for each frame. The feature vector CM (t) is sent to the storage unit 32 and the subtraction unit 34. The storage unit 32 includes a first storage unit 32a and a second storage unit 32b. The first storage unit 32a stores the feature vector CM (t) of the speech input M of the Mth utterance, The second storage unit 32b stores the feature vector CM-1 (t) of the voice input (M-1) of the (M-1) th utterance.

【０００３】平均計算部３３は入力してきた音声入力
（Ｍ−１）の特徴ベクトルＣM-1(1)、ＣM-1(2)、…、Ｃ
M-1(T)から、下記の(1) 式で表される音声入力（Ｍ−
１）のケプストラム平均値を求める。The average calculator 33 receives feature vectors CM-1 (1), CM-1 (2), ..., C of the input voice input (M-1).
From M-1 (T), the voice input (M-
Obtain the cepstrum average of 1).

【０００４】[0004]

【数１】減算部３４は特徴ベクトル計算部３１からの音声入力Ｍ
の特徴ベクトルＣM(t)から前記音声入力（Ｍ−１）のケ
プストラム平均値を減算し、下記の(2) 式で表される演
算をして、ケプストラム平均値正規化（以下、ＣＭＮと
略す）されたケプストラム＜ＣM(t)＞を求める。[Equation 1] The subtraction unit 34 receives the voice input M from the feature vector calculation unit 31.
Average value of the voice input (M-1) is subtracted from the feature vector CM (t), and the operation represented by the following equation (2) is performed to normalize the cepstrum average value (hereinafter abbreviated as CMN). ) Calculated cepstrum <CM (t)>.

【０００５】[0005]

【数２】パターン照合部３５は、該ＣＭＮされたケプストラムと
標準パターン蓄積部３６に蓄積された標準パターンとを
照合し、認識結果を出力する。[Equation 2] The pattern matching unit 35 matches the CMN-processed cepstrum with the standard pattern stored in the standard pattern storage unit 36, and outputs a recognition result.

【０００６】[0006]

【発明が解決しようとする課題】前記の(2) 式から明ら
かなように、ＣＭＮされたケプストラム＜ＣM(t)＞は、
１発声前の発声に大きく影響される。このため、この１
発声前の発声が短い場合、例えば該１発声前の発声が単
に「あ」、あるいは「はい」だけのように短かった場合
には、該発声のケプストラム平均値が十分に音声の特徴
を反映することができず、このケプストラム平均値を用
いてＣＭＮを行うと、音声の認識率が低くなるという問
題があった。As is apparent from the above equation (2), the CMN-processed cepstrum <CM (t)> is
It is greatly affected by the utterance before the first utterance. Therefore, this 1
When the utterance before utterance is short, for example, when the utterance before the first utterance is short such as "A" or "Yes", the cepstrum average value of the utterance sufficiently reflects the characteristics of the voice. However, if CMN is performed using this cepstrum average value, there is a problem in that the voice recognition rate becomes low.

【０００７】また、前記の従来技術では、１発声前の発
声を必要とするため、１発声のみの音声入力をリアルタ
イムで音声認識できないという問題があった。Further, in the above-mentioned prior art, there is a problem that the voice input of only one voice cannot be recognized in real time because the voice of one voice before is required.

【０００８】この発明の目的は、前記した従来技術の問
題点を除去し、一発声前の発声が短くても音声の認識率
を低下させない、あるいは該認識率の低下を低減する音
声認識のための音響分析方法を提供することにある。ま
た、本発明の他の目的は、１発声以前の音声を必要とせ
ずにリアルタイムに音声認識できる音声認識のための音
響分析方法を提供することにある。An object of the present invention is to eliminate the above-mentioned problems of the prior art and not to reduce the recognition rate of speech even if the utterance before one utterance is short, or for speech recognition to reduce the reduction of the recognition rate. To provide a method of acoustic analysis of. Another object of the present invention is to provide an acoustic analysis method for voice recognition, which enables real-time voice recognition without the need for voices before one utterance.

【０００９】[0009]

【課題を解決するための手段】前記目的を達成するため
に、この発明は、入力音声から特徴ベクトルを求め、該
特徴ベクトルを正規化し、予め蓄積されている音声の標
準パターンとパターン照合して音声認識を行う音響分析
方法において、前記特徴ベクトルの平均値と、前記パタ
ーン照合の認識結果を利用して推定された発声内容に依
存する部分とから、該発声内容に依存する部分を相殺し
て発声内容に依存しないケプストラム平均値を推定し、
該ケプストラム平均値を前記特徴ベクトルの正規化に用
いて音声認識を行うようにした点に第１の特徴がある。
また、本発明は、前記発声内容に依存する部分の推定
を、一文前の発声のパターン照合の認識結果を利用して
行うようにした点に第２の特徴があり、また、前記発声
内容に依存する部分の推定を、現在途中まで発声されて
いる音声の部分的な認識結果を用いて行うようにした点
に第３の特徴がある。In order to achieve the above object, the present invention obtains a feature vector from an input voice, normalizes the feature vector, and performs pattern matching with a standard pattern of voice stored in advance. In an acoustic analysis method for performing voice recognition, the average value of the feature vector and a part dependent on the utterance content estimated by using the recognition result of the pattern matching are used to cancel the part dependent on the utterance content. Estimate the cepstral mean value that does not depend on the utterance content,
The first feature is that speech recognition is performed by using the average value of the cepstrum for the normalization of the feature vector.
Further, the present invention has a second feature in that the estimation of the portion depending on the utterance content is performed by using the recognition result of the pattern matching of the utterance of one sentence before. A third feature is that the dependent portion is estimated by using the partial recognition result of the voice that is currently uttered halfway.

【００１０】前記第１および第２の特徴によれば、例え
ば「あ」という短い発声に対して、ケプストラム平均値
を精度高く推定することができるようになり、一発声前
の発声が短くても音声の認識率を低下させない、あるい
は認識率の低下を低減することを防止できるようにな
る。また、前記の第３の特徴によれば、１発声以前の音
声を用いず当該音声のみでリアルタイムで音声認識がで
きるようになる。According to the first and second characteristics, it is possible to accurately estimate the average value of the cepstrum for a short utterance "a", for example, and even if the utterance before one utterance is short. It is possible to prevent the voice recognition rate from being reduced or prevent the reduction in the recognition rate from being reduced. Further, according to the third feature described above, it becomes possible to perform voice recognition in real time using only the voice without using the voice before one utterance.

【００１１】[0011]

【発明の実施の形態】以下に、図面を参照して、本発明
を詳細に説明する。図１は、本発明の一実施形態のブロ
ック図を示し、特徴ベクトル計算部１、記憶部２、正規
化パラメータ演算部３、減算部４、パターン照合部５、
標準パターン蓄積部６および認識結果の状態系列記憶部
７とから、主要部分が構成されている。なお、遅延部１
１、減算部１２、平均計算部１３およびパターン照合部
１４からなる回路は、後述の説明から明らかになるよう
に、最初の音声入力Ｍ＝０のみを音声認識するために設
けられたものであり、本発明の付属部分である。この付
属部分は、機能的に表されたスイッチＳＷ１とＳＷ２に
より、本実施形態の前記主要部分と接続されている。ス
イッチＳＷ１とＳＷ２は、最初の音声（Ｍ＝０）が入力
している間だけ図示の実線の接続状態にあり、第２の音
声（Ｍ＝１）が入力するようになると、点線で表された
接続状態になる。BEST MODE FOR CARRYING OUT THE INVENTION The present invention will be described in detail below with reference to the drawings. FIG. 1 shows a block diagram of an embodiment of the present invention, in which a feature vector calculation unit 1, a storage unit 2, a normalization parameter calculation unit 3, a subtraction unit 4, a pattern matching unit 5,
The standard pattern storage unit 6 and the recognition result state series storage unit 7 constitute a main part. The delay unit 1
The circuit composed of 1, the subtraction unit 12, the average calculation unit 13, and the pattern matching unit 14 is provided for recognizing only the first voice input M = 0, as will be apparent from the description below. , Which is an adjunct to the present invention. This accessory part is connected to the main part of the present embodiment by switches SW1 and SW2 which are functionally represented. The switches SW1 and SW2 are in the connection state of the solid line shown only while the first voice (M = 0) is input, and when the second voice (M = 1) is input, the switches SW1 and SW2 are represented by the dotted lines. It becomes a connected state.

【００１２】次に、本実施形態の動作を説明する。最初
の音声（Ｍ＝０）が入力してくると、特徴ベクトル計算
部１は音声の各フレームに対する特徴ベクトルＣ0(t)
（ここに、ｔ＝１〜Ｔ）を求めて出力する。この特徴ベ
クトルＣ0(t)は、記憶部２の第１の記憶部２ａに記憶さ
れると共に、前記付属部分の遅延部１１および平均計算
部１３に入力する。Next, the operation of this embodiment will be described. When the first voice (M = 0) is input, the feature vector calculation unit 1 outputs the feature vector C0 (t) for each frame of the voice.
(Here, t = 1 to T) is obtained and output. The feature vector C0 (t) is stored in the first storage unit 2a of the storage unit 2 and is also input to the delay unit 11 and the average calculation unit 13 of the attached part.

【００１３】第１の記憶部２ａは前記特徴ベクトルＣ0
(t)を記憶する。一方、遅延部１１は平均計算部１３が
特徴ベクトルＣ0(t)の平均値を計算するに要する時間の
間特徴ベクトルＣ0(t)を遅延する。減算部１２は、遅延
部１１から入力してきた特徴ベクトルＣ0(t)から特徴ベ
クトルＣ0(t)の前記平均値を減算することにより求めら
れた、ＣＭＮされたケプストラム＜Ｃ0(t)＞をパターン
照合部１４に出力する。パターン照合部１４は、周知の
ように、該ＣＭＮされたケプストラム＜Ｃ0(t)＞と標準
パターン蓄積部６に蓄積された標準パターンとを照合
し、認識結果を出力する。The first storage unit 2a stores the feature vector C0.
Remember (t). On the other hand, the delay unit 11 delays the feature vector C0 (t) for the time required for the average calculation unit 13 to calculate the average value of the feature vector C0 (t). The subtraction unit 12 patterns the CMN-processed cepstrum <C0 (t)> obtained by subtracting the average value of the feature vector C0 (t) from the feature vector C0 (t) input from the delay unit 11. It is output to the matching unit 14. As is well known, the pattern matching unit 14 matches the CMN cepstrum <C0 (t)> with the standard pattern stored in the standard pattern storage unit 6 and outputs the recognition result.

【００１４】ここに、パターン照合部５、１４は、例え
ば、周知のビタービアルゴリズム（Ｖiterbi探索）を用
いて、減算部４、１２からの出力と音声パターン蓄積部
６に蓄積されている標準パターンとを照合する動作を行
う。ビタービアルゴリズムを実行すると、与えられた観
測系列、すなわち音声入力Ｍに対する、最適状態系列Ｑ
M を求めることができる。Here, the pattern matching units 5 and 14 use the well-known Viterbi algorithm (Viterbi search), for example, and output from the subtraction units 4 and 12 and the standard pattern stored in the voice pattern storage unit 6. Performs an operation to match with. When the Viterbi algorithm is executed, the optimum state sequence Q for a given observation sequence, that is, the speech input M
You can ask for M.

【００１５】認識結果の状態系列記憶部７は、パターン
照合部１４で求められた音声入力Ｍ（＝０）に対する、
最適状態系列Ｑ0(t)（＝{q0(1),q0(2),...,q0(T)} ）を
記憶する。The recognition result state series storage unit 7 stores, for the voice input M (= 0) obtained by the pattern matching unit 14,
The optimum state sequence Q0 (t) (= {q0 (1), q0 (2), ..., q0 (T)}) is stored.

【００１６】次いで、次の音声入力（Ｍ＝１）がある
と、前記ＳＷ１およびＳＷ２は、図示の点線方向に切替
わる。特徴ベクトル計算部１はこの音声の特徴ベクトル
Ｃ1(t)（ここに、ｔ＝１〜Ｔ）を求めて入力に同期して
出力する。この特徴ベクトルＣ1(t)は第１の記憶部２ａ
に記憶されると共に減算部４に入力する。この時、前記
第１の記憶部２ａに記憶されていた特徴ベクトルＣ0(t)
は第２の記憶部２ｂに転送されており、正規化パラメー
タ演算部３は、該第２の記憶部２ｂに記憶された特徴ベ
クトルＣ0(t)と、前記認識結果の状態系列記憶部７に記
憶された最適状態系列、換言すれば最尤状態系列Ｑ0(t)
（＝{q0(1),q0(2),...,q0(T)} ）とを用いて、下記の
(3) 式の演算をして、正規化パラメータ［ＣM-1 ］、す
なわちケプストラム平均値を求める。Next, when there is the next voice input (M = 1), SW1 and SW2 are switched in the direction of the dotted line shown in the figure. The feature vector calculation unit 1 obtains this voice feature vector C1 (t) (here, t = 1 to T) and outputs it in synchronization with the input. This feature vector C1 (t) is stored in the first storage unit 2a.
And is input to the subtraction unit 4. At this time, the feature vector C0 (t) stored in the first storage unit 2a
Has been transferred to the second storage unit 2b, and the normalization parameter calculation unit 3 stores the feature vector C0 (t) stored in the second storage unit 2b and the recognition result state series storage unit 7 in the state vector storage unit 7. The stored optimum state sequence, in other words, the maximum likelihood state sequence Q0 (t)
(= {Q0 (1), q0 (2), ..., q0 (T)}) and using
The equation (3) is calculated to obtain the normalized parameter [CM-1], that is, the cepstrum average value.

【００１７】[0017]

【数３】ここに、ω（ｑM-1(t)) は、状態ｑM-1(t)の平均ケプス
トラムから長時間音声の平均ケプストラムを減じたもの
の平均値と定義され、標準パターンの１パラメータとし
て学習記憶されているものである。前記の式の右辺の第
１項は、音声入力（Ｍ＝０）の発声＋回線特性を内容と
するものであり、同第２項は、発声−長時間音声の推定
値を内容とするものである。例えば、音声入力（Ｍ＝
０）が「あ」だけの短い音声であった場合には、前記右
辺の第１項は、「あ」＋回線特性となり、同第２項は、
「あ」−長時間音声となる。したがって、正規化パラメ
ータ演算部３の出力であるケプストラム平均値［Ｃ0 ］
は、下式のようになる。[Equation 3] Here, ω (qM-1 (t)) is defined as the average value of the average cepstrum of state qM-1 (t) minus the average cepstrum of long-term speech, and is learned and stored as one parameter of the standard pattern. It is what The first term on the right-hand side of the above equation has the content of the utterance + line characteristic of the voice input (M = 0), and the second term has the content of the estimated value of utterance-long speech. Is. For example, voice input (M =
When 0) is a short voice with only "A", the first term on the right side is "A" + line characteristic, and the second term is
"Ah" -Sounds for a long time. Therefore, the cepstrum average value [C0] output from the normalization parameter calculator 3
Becomes like the following formula.

【００１８】［Ｃ0 ］＝｛「あ」＋回線特性｝−｛「あ」−長時間音声｝＝｛回線特性｝＋｛長時間音声｝次に、減算部４は、下式により、ＣＭＮされたケプスト
ラム＜Ｃ1(t)＞を求める。今、前記音声入力（Ｍ＝１）
が一例として「はい」であったとすると、前記特徴ベク
トル計算部１からの出力は、「はい」＋回線特性である
ので、＜Ｃ1(t)＞は次のようになる。[C0] = {"a" + line characteristic}-{"a" -long time voice} = {line characteristic} + {long time voice} Next, the subtracting section 4 performs CMN according to the following equation. Cepstrum <C1 (t)>. Now, the voice input (M = 1)
As an example, if "Yes", the output from the feature vector calculation unit 1 is "Yes" + line characteristic, so <C1 (t)> is as follows.

【００１９】＜Ｃ1(t)＞＝Ｃ1(t)−［Ｃ0 ］＝｛回線特
性を含まないＣ1(t)＋回線特性｝−｛回線特性＋長時間
音声｝＝回線特性を含まないＣ1(t)−長時間音声したがって、本実施形態によれば、発声内容に依存しな
いケプストラム平均値を推定し、該ケプストラム平均値
をケプストラム平均値正規化に使用することができるよ
うになる。<C1 (t)> = C1 (t)-[C0] = {C1 (t) + line characteristic not including line characteristic}-{line characteristic + long voice} = C1 (not including line characteristic) t) -Long duration voice Therefore, according to the present embodiment, it is possible to estimate the cepstrum average value that does not depend on the utterance content and use the cepstrum average value for normalization of the cepstrum average value.

【００２０】続いて、パターン照合部５は、前記の＜Ｃ
1(t)＞と標準パターン蓄積部６に蓄積された標準パター
ンとを照合し、認識結果を出力する。また、同時に、認
識結果の最適状態系列Ｑ1(t)（＝{q1(1),q1(2),...,q1
(T)} ）を出力し、これは認識結果の状態系列記憶部７
に蓄積される。以降は、前記した動作が繰り返され、音
声認識が続行される。Subsequently, the pattern matching unit 5 causes the above-mentioned <C
1 (t)> is collated with the standard pattern stored in the standard pattern storage unit 6, and the recognition result is output. At the same time, the optimal state sequence Q1 (t) (= {q1 (1), q1 (2), ..., q1 of the recognition result is obtained.
(T)}), which is the recognition result state sequence storage unit 7
Accumulated in. After that, the above-described operation is repeated and the voice recognition is continued.

【００２１】以上のように、本実施形態によれば、一文
前の発声が、例えば「あ」だけの短い音声であっても、
ＣＭＮされたケプストラム＜ＣM(t)＞は、その時の音声
−一文前の発声から推定した長時間音声となるから、一
文前の発声の長短に大して影響されなくなり、音声の認
識率の低下を防止することができるようになる。As described above, according to this embodiment, even if the utterance one sentence before is a short voice such as "A",
The CMN-processed cepstrum <CM (t)> is a long-term voice estimated from the voice at that time-the utterance one sentence before, so that it is not greatly affected by the length of the utterance one sentence before, and the voice recognition rate is prevented from lowering. You will be able to.

【００２２】次に、本発明の第２実施形態について、図
２のブロック図を参照して説明する。この実施形態は、
第１実施形態のように、一文前の発声を用いずに、現在
途中まで発声されている音声の部分的な認識結果を用い
ることにより、リアルタイムに音声認識をするようにし
たものである。Next, a second embodiment of the present invention will be described with reference to the block diagram of FIG. This embodiment is
As in the first embodiment, the voice recognition is performed in real time by using the partial recognition result of the voice that is being uttered halfway, instead of using the utterance of a sentence before.

【００２３】図２において、８は学習データで予め求め
ておいた長時間音声ＣＭ0 であり、他の符号は、図１の
同符号と同一または同等物を示す。この長時間音声ＣＭ
0 は、後述の説明から明らかなように、音声認識のスタ
ート時に、短時間、例えば音声入力Ｍの１あるいは数フ
レーム期間（例えば、Ｘ1 あるいはＸ1 〜Ｘt （ｔ＜Ｔ
s ）の入力期間）だけ、正規化パラメータの演算に使用
され、この期間が過ぎると、不使用にされる。In FIG. 2, reference numeral 8 is a long-term voice CM0 previously obtained from the learning data, and other symbols are the same as or equivalent to the same symbols in FIG. This long voice CM
As will be apparent from the description below, 0 indicates a short time, for example, one or several frame periods of the voice input M (for example, X1 or X1 to Xt (t <T
s) input period)) is used for the calculation of the normalization parameter, and after this period, it is disabled.

【００２４】次に、本実施形態の動作を、図３のフロー
チャートを参照して説明する。ステップＳ１では、音声
入力の受付け開始と同時に、特徴ベクトル計算部１は音
声の各フレームに対する特徴ベクトルＣ0(t)（ここに、
ｔ＝１〜Ｔ）を求めて順次出力する。これらの特徴ベク
トルＣ0(t)は記憶部２ａに順次記憶されると共に、減算
部４に入力される。ステップＳ２では、最尤状態系列Ｑ
0(t)に基づき音声が数フレーム入力したか否かが判断さ
れる。換言すれば、該数フレームに相当する時間Ｔs が
経過したか否かが判断される。この判断が否定の時には
ステップＳ９に進んで、前記正規化パラメータの演算に
ＣＭ0 が使用される。すなわち、［Ｃ0(t-1)］＝ＣＭ0
となる。Next, the operation of this embodiment will be described with reference to the flowchart of FIG. In step S1, at the same time when the voice input is started to be received, the feature vector calculation unit 1 causes the feature vector C0 (t) (here,
t = 1 to T) are calculated and sequentially output. These feature vectors C0 (t) are sequentially stored in the storage unit 2a and input to the subtraction unit 4. In step S2, the maximum likelihood state sequence Q
Based on 0 (t), it is determined whether or not several frames of voice have been input. In other words, it is determined whether or not the time Ts corresponding to the several frames has elapsed. When this determination is negative, the process proceeds to step S9, and CM0 is used for the calculation of the normalization parameter. That is, [C0 (t-1)] = CM0
Becomes

【００２５】ステップＳ４では、正規化パラメータ演算
部３は、下記の(4) 式を用いて、前記記憶部２ａに順次
記憶された各フレーム毎の特徴ベクトルと、後述するパ
ターン照合部５から順次入力される状態系列とから、正
規化パラメータ［ＣM(t-1)］すなわち発声内容に依存し
ないケプストラム平均値を推定する。In step S4, the normalization parameter calculation unit 3 uses the following equation (4) to sequentially store the feature vector for each frame stored in the storage unit 2a and the pattern matching unit 5 described later. The normalization parameter [CM (t-1)], that is, the cepstrum average value that does not depend on the utterance content is estimated from the input state sequence.

【００２６】[0026]

【数４】次に、ステップＳ５に進むと、減算部４は特徴ベクトル
Ｃ0(t)から正規化パラメータ［Ｃ0(t-1)］を減算して、
ＣＭＮされたケプストラム＜Ｃ0(t)＞を求める。ステッ
プＳ６では、パターン照合部５が該ＣＭＮされたケプス
トラム＜Ｃ0(t)＞と標準パターン蓄積部６に蓄積された
標準パターンとを照合し、認識結果を出力する。また、
同時に、ステップＳ７にて、求められた状態系列Ｑ0(t)
を、順次、正規化パラメータ演算部３に送出する。ステ
ップＳ８では、音声認識が終了したか否かが判断され、
この判断が否定になるとステップＳ１に戻って、特徴ベ
クトルの計算が続けられる。ステップＳ２では、音声入
力が数フレームになったか否かが判断される。[Equation 4] Next, in step S5, the subtraction unit 4 subtracts the normalization parameter [C0 (t-1)] from the feature vector C0 (t),
The CMN-processed cepstrum <C0 (t)> is obtained. In step S6, the pattern matching unit 5 matches the CMN-processed cepstrum <C0 (t)> with the standard pattern stored in the standard pattern storage unit 6, and outputs the recognition result. Also,
At the same time, in step S7, the obtained state series Q0 (t)
Are sequentially sent to the normalization parameter calculator 3. In step S8, it is determined whether or not the voice recognition is completed,
If this judgment is negative, the process returns to step S1 and the calculation of the feature vector is continued. In step S2, it is determined whether the voice input has reached several frames.

【００２７】音声入力が数フレームとなって、ステップ
Ｓ２の判断が肯定になると、ステップＳ３に進み、正規
化パラメータ演算部３は前記ＣＭ0 ８を切り離して、正
規化パラメータの演算に、前記状態系列Ｑ0(t-1)を使用
する。そして、ステップＳ５では、ステップＳ４で求め
られた正規化パラメータを用いて、減算部４はＣＭＮさ
れたケプストラム＜ＣM(t)＞を求める。次いで、ステッ
プＳ６では、該ＣＭＮされたケプストラム＜ＣM(t)＞と
標準パターン蓄積部６に蓄積された標準パターンとを照
合し、認識結果を出力する。また、ステップＳ７では、
ステップＳ６にて同時に求められた認識結果の状態系列
を、前記正規化パラメータ演算部３に出力する。When the voice input becomes several frames and the determination in step S2 becomes affirmative, the process proceeds to step S3, in which the normalization parameter calculation unit 3 disconnects the CM08 and calculates the normalization parameter by the state sequence. Use Q0 (t-1). Then, in step S5, the subtraction unit 4 obtains the CMN-processed cepstrum <CM (t)> using the normalization parameter obtained in step S4. Next, in step S6, the CMN-processed cepstrum <CM (t)> is collated with the standard pattern stored in the standard pattern storage unit 6, and the recognition result is output. Also, in step S7,
The state series of the recognition result obtained at the same time in step S6 is output to the normalization parameter calculation unit 3.

【００２８】以降は、前記した動作が繰り返し行われ
て、音声入力Ｍに対する音声認識が続行される。After that, the above-mentioned operation is repeated, and the voice recognition for the voice input M is continued.

【００２９】以上のように、本実施形態によれば、一文
前の発声を用いずに、現在途中まで発声されている音声
の部分的な認識結果を用いて、ケプストラム平均値を推
定することが可能である。すなわち、時刻（ｔ−１）ま
での入力を用いて、時刻ｔに正規化に用いるケプストラ
ム平均値を推定することが可能となり、リアルタイムの
音声認識が可能になる。As described above, according to the present embodiment, the cepstrum average value can be estimated by using the partial recognition result of the voice that is being uttered halfway up to now, without using the utterance of a sentence before. It is possible. That is, it is possible to estimate the cepstrum average value used for normalization at time t using the input up to time (t-1), and real-time speech recognition is possible.

【００３０】次に、本発明の第３実施形態を説明する。
この実施形態は、前記パターン照合部５の認識結果中
に、音声が存在すると判断された部分のみを推定に用
い、逆に音声が存在しないと判断された無音部分｜Ｑ｜
を推定に用いないようにしたものである。この実施形態
によれば、推定誤りによる誤認識の恐れを大きく低減す
ることができる。Next, a third embodiment of the present invention will be described.
In this embodiment, in the recognition result of the pattern matching unit 5, only the part determined to have voice is used for estimation, and conversely, the silent part | Q |
Is not used for estimation. According to this embodiment, the risk of misrecognition due to an estimation error can be greatly reduced.

【００３１】この実施形態における正規化パラメータ演
算部の演算式は下記のようになる。The arithmetic expression of the normalization parameter arithmetic unit in this embodiment is as follows.

【００３２】[0032]

【数５】 [Equation 5]

【００３３】[0033]

【発明の効果】以上の説明から明らかなように、本発明
によれば、「短い発声」であっても、ケプストラム平均
値の推定を高い精度ですることができるようになるか
ら、一発声前の発声が短くても音声の認識率を低下させ
ない、あるいは認識率の低下を低減することができるよ
うになる。As is apparent from the above description, according to the present invention, it is possible to estimate the cepstrum average value with high accuracy even with "short utterance". Even if the utterance is short, the voice recognition rate is not reduced, or the reduction of the recognition rate can be reduced.

【００３４】また、現在途中までの発声を用いてケプス
トラム平均値の推定をすることができるので、リアルタ
イムに、認識率の高い音声認識をできるようになる。Further, since the average value of the cepstrum can be estimated using the utterance up to the middle of the present, it is possible to recognize the voice with a high recognition rate in real time.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施形態の構成を示すブロック図
である。FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention.

【図２】本発明の第２の実施形態の構成を示すブロッ
ク図である。FIG. 2 is a block diagram showing a configuration of a second exemplary embodiment of the present invention.

【図３】第２の実施形態の動作を説明するフローチャ
ートである。FIG. 3 is a flowchart illustrating the operation of the second embodiment.

【図４】従来例の構成を示すブロック図である。FIG. 4 is a block diagram showing a configuration of a conventional example.

[Explanation of symbols]

１…特徴ベクトル計算部、２…記憶部、３…正規化パラ
メータ演算部、４…減算部、５…パターン照合部、６…
標準パターン照合部、７…認識結果の状態系列記憶部。1 ... Feature vector calculation unit, 2 ... Storage unit, 3 ... Normalization parameter calculation unit, 4 ... Subtraction unit, 5 ... Pattern matching unit, 6 ...
Standard pattern matching unit, 7 ... State series storage unit of recognition result.

フロントページの続き (56)参考文献特開平７−191689（ＪＰ，Ａ) 特開平８−202385（ＪＰ，Ａ) 特開平９−90990（ＪＰ，Ａ) 特開平７−334184（ＪＰ，Ａ) 黒岩眞吾、ＤｉｅｕＴｒａｎ、加藤恒夫、谷戸文廣，発声内容を考慮した実時間ケプストラム平均値正規化の検討, 日本音響学会講演論文集，日本音響学会，1997年９月，1997年秋季Ｉ，159 −160 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/02 G10L 15/20 G10L 21/02 Continuation of the front page (56) Reference JP-A-7-191689 (JP, A) JP-A-8-202385 (JP, A) JP-A-9-90990 (JP, A) JP-A-7-334184 (JP , A) Shingo Kuroiwa, Dieu Tran, Tsuneo Kato, Fumihiro Yato, Study on Real-time Cepstral Mean Normalization Considering Speech Content, Proceedings of the Acoustical Society of Japan, The Acoustical Society of Japan, September 1997, 1997. Autumn I, 159-160 (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 15/02 G10L 15/20 G10L 21/02

Claims

(57) [Claims]

1. An acoustic analysis method for obtaining a feature vector from an input voice, normalizing the feature vector, and performing voice recognition by pattern matching with a standard pattern of voice stored in advance, comprising: an average value of the feature vector; , A part dependent on the utterance content estimated by using the recognition result of the pattern matching is canceled out to estimate a cepstrum average value independent of the utterance content by canceling the part dependent on the utterance content, and the cepstrum average value. Is used for normalizing the feature vector to perform speech recognition.

2. The acoustic analysis method according to claim 1, wherein the estimation of the portion depending on the utterance content is performed by using a recognition result of the pattern matching of the utterance of one sentence before. Analysis method.

3. The acoustic analysis method according to claim 1, wherein the estimation of a portion depending on the utterance content is performed by using a partial recognition result of a voice which is currently uttered halfway. Acoustic analysis method.

4. The acoustic analysis method according to claim 3, wherein the estimation of the cepstrum average value is started after a lapse of a predetermined time from the start of voice input. .

5. The acoustic analysis method according to claim 2, wherein the estimation of a portion depending on the utterance content is performed using only the portion determined to have a voice in the recognition result of the pattern matching. An acoustic analysis method characterized in that