JPH0830294A

JPH0830294A - Device and method for voice recognition

Info

Publication number: JPH0830294A
Application number: JP6186392A
Authority: JP
Inventors: Toshihiro Kasuya; 敏宏糟谷; Noriya Murakami; 憲也村上
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1994-07-15
Filing date: 1994-07-15
Publication date: 1996-02-02

Abstract

PURPOSE:To provide a voice recognition device which conducts high precision voice recognition of noisy signals by predicting the dynamic featured value of an inputted voice. CONSTITUTION:Only noise is beforehand inputted in order to extract the static featured value of the noise (301). The static featured value of the noise and the static featured value of voice only which is beforehand prepared are added (302) to generate a static featured value code book 304 of a noise mixed voice. Then, differential computations are conducted employing the book 304 and the static featured value of voice only which is beforehand prepared (403) and a conversion table 405 is generated to convert the dynamic featured value of the noise mixed voice into a dynamic featured value of voice only for all featured values of the noise mixed voice. During a voice recognition, an object noise mixed voice is inputted and a voice only static featured value of the input voice is computed (303). On the other hand, the inputted voice is converted into a voice only dynamic featured value by applying the table 405 (404 and 306). Based on the voice only static and dynamic featured values, the recognition processes are conducted.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、雑音特性が頻繁に変動
する環境下において、雑音に適応したモデルを随時作成
して音声認識を行う音声認識装置の改良に係るものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an improvement of a voice recognition device for performing voice recognition by creating a model adapted to noise at any time in an environment in which noise characteristics fluctuate frequently.

【０００２】[0002]

【従来の技術】背景雑音の混入した音声は、雑音の無い
環境で発声された音声とはそこから抽出されるスペクト
ル等の音声認識に使用する特徴パラメータが異なる。従
って、雑音下において音声認識を行う際に高い識別率を
維持するためには、何らかの雑音除去処理を行うか、或
いはモデルの再作成等の適応化処理を行う必要がある。2. Description of the Related Art A voice mixed with background noise differs from a voice uttered in a noise-free environment in a characteristic parameter used for voice recognition such as a spectrum extracted from the voice. Therefore, in order to maintain a high identification rate when performing speech recognition under noise, it is necessary to perform some noise removal processing or adaptation processing such as model re-creation.

【０００３】標準的な音声認識方式である離散分布型も
しくはSemi-Continuous型の隠れマルコフモデル（HMM）
等のコードブックを使用する手法では、特徴量をクラス
タリングして生成されるコードブックに対して、雑音を
含んだ音声でモデルを再生成してその写像を利用するこ
とにより雑音下において発声された音声を認識する装置
が知られている。Hidden Markov Model (HMM) of discrete distribution type or Semi-Continuous type which is a standard speech recognition method
In the method of using a codebook such as, for the codebook generated by clustering the features, the model is regenerated with noisy speech and the mapping is used to utter in the noise. Devices that recognize voice are known.

【０００４】また近年では、音声と雑音の夫々に対して
別個に隠れマルコフモデルを用意し、ＭＡＰ（maximum
a posteriori probability）推定もしくは音声特徴量で
あるケプストラムを線形スペクトル上で畳み込んで雑音
下ケプストラムを生成して認識する方法を用いた装置な
どがある。In recent years, a hidden Markov model is separately prepared for each of speech and noise, and MAP (maximum
a posteriori probability) There is an apparatus using a method of recognizing by generating a noisy cepstrum by convolving the cepstrum which is an estimation or speech feature on a linear spectrum.

【０００５】なお、従来の音声認識方法については、例
としてB.-H.Juang and L.R,Rabiner.“Signal restorat
ion by spectral mapping,”inProc.IEEE Int.Conf.Aco
ust.,Speech,Signal Process.,ICASSP87,pp.2368-2371,
Apr.1987、H.Gish,Y.-L.Chow and J.R.Rohlicek,“Prob
abilistic vector mapping of noisy speech parameter
s for HMM word spotting,”in Proc.IEEE Int.Cnf.Aco
ust.,Speech Signal Process.,ICASSP90,pp.117-120,Ap
r.1990、M.J.Gales,S.Young,"An Improved approach to
the Hidden Markov Model Decomposition of speech a
nd noise”,ICASSP92、S.F.Boll:IEEE Trans.ASSP-27,N
o2（1978）等に記載されている。As for the conventional speech recognition method, for example, B.-H. Juang and LR, Rabiner. “Signal restorat
ion by spectral mapping, ”inProc.IEEE Int.Conf.Aco
ust., Speech, Signal Process., ICASSP87, pp.2368-2371,
Apr.1987, H.Gish, Y.-L.Chow and JRRohlicek, “Prob
abilistic vector mapping of noisy speech parameter
s for HMM word spotting, ”in Proc.IEEE Int.Cnf.Aco
ust., Speech Signal Process., ICASSP90, pp.117-120, Ap
r. 1990, MJ Gales, S. Young, "An Improved approach to
the Hidden Markov Model Decomposition of speech a
nd noise ”, ICASSP92, SFBoll: IEEE Trans.ASSP-27, N
o2 (1978) etc.

【０００６】[0006]

【発明が解決しようとする課題】ところで、音声認識に
使用される音声の特徴量には、入力された音声を複数個
のフレーム（連続的に繰り返される一定時間長の区間）
に分割し、これら各フレームを分析処理することによっ
て得られる静的特徴量と、この静的特徴量を時間微分す
ることによって得られる動的特徴量とがある。By the way, as the feature quantity of the voice used for the voice recognition, the input voice is composed of a plurality of frames (continuously repeated sections of a certain time length).
There is a static feature amount obtained by analyzing each of these frames by dividing the static feature amount and a dynamic feature amount obtained by differentiating the static feature amount with respect to time.

【０００７】しかしながら、雑音下の音声認識に際して
は、上述した音声特徴量ケプストラムを線形スペクトル
上で重畳する方法を用いることによって静的特徴量の推
定を行うことができるが、静的特徴量を示すケプストラ
ムを時間微分することにより動的特徴量を示すデルタケ
プストラムを求める等の動的特徴量の推定を行うには、
上記静的特徴量の推定手法とは異なったアプローチが必
要である。However, at the time of speech recognition under noise, the static feature quantity can be estimated by using the above-mentioned method of superimposing the voice feature quantity cepstrum on the linear spectrum. To estimate a dynamic feature such as obtaining a delta cepstrum indicating a dynamic feature by differentiating the cepstrum with time,
An approach different from the above static feature estimation method is required.

【０００８】このような事情により、上記従来の音声認
識方法では、静的特徴量の推定は行っていたが動的特徴
量の推定は行ってはいなかった。Due to such circumstances, the above-described conventional speech recognition method estimates the static feature amount, but does not estimate the dynamic feature amount.

【０００９】そのため、静的特徴量の推定では背景雑音
に起因する静的特徴量のズレには対応し得たとしても、
背景雑音に起因する動的特徴量のズレには対応できず、
雑音下における音声認識の精度のより一層の向上を図る
ことが困難であった。Therefore, even if the static feature amount estimation can deal with the shift of the static feature amount due to the background noise,
It is not possible to deal with the deviation of the dynamic feature amount due to background noise,
It was difficult to further improve the accuracy of voice recognition under noise.

【００１０】従って本発明の目的は、入力音声の動的特
徴量の推定を行うことにより、雑音下において、より精
度の高い音声認識を行うことが可能な音声認識装置を提
供することにある。Therefore, an object of the present invention is to provide a voice recognition device capable of performing more accurate voice recognition under noise by estimating a dynamic feature amount of an input voice.

【００１１】[0011]

【課題を解決するための手段】本発明によれば、認識対
象となる雑音混入音声から静的特徴量と動的特徴量とを
演算して音声認識処理に渡すための装置であって、事前
に無音区間を入力して、この無音区間に含まれる雑音の
特徴量と、予め用意した音声のみの特徴量とに基づい
て、雑音混入音声の動的特徴量を音声のみの動的特徴量
に変換するための変換テーブルを作成する変換テーブル
作成手段と、音声認識時に認識対象となる雑音混入音声
を入力してその動的特徴量を求める手段と、求められた
雑音混入音声の動的特徴量に、事前に作成された変換テ
ーブルを適用することにより、入力された雑音混入音声
内の音声のみの動的特徴量を推定する手段とを備え、推
定された音声のみの動的特徴量を音声認識処理に渡すこ
とを特徴とする音声認識装置が提供される。According to the present invention, there is provided a device for calculating a static feature amount and a dynamic feature amount from a noise-containing speech to be recognized and passing the result to a speech recognition process. A silent section is input to, and based on the feature quantity of noise included in this silent section and the feature quantity of only voice prepared in advance, the dynamic feature quantity of noise-containing speech is changed to the dynamic feature quantity of only voice. A conversion table creating means for creating a conversion table for conversion, a means for inputting a noise-containing speech to be recognized at the time of speech recognition to obtain a dynamic feature amount thereof, and a dynamic feature amount of the obtained noise-containing speech And a means for estimating the dynamic feature amount of only the voice in the input noise-containing voice by applying the conversion table created in advance, and the estimated dynamic feature amount of the voice only Voices characterized by being passed to recognition processing Identification device is provided.

【００１２】また、本発明によれば、上記の音声認識装
置によって実行される動的特徴量の推定方法が提供され
る。Further, according to the present invention, there is provided a method for estimating a dynamic feature quantity which is executed by the above speech recognition apparatus.

【００１３】[0013]

【作用】本発明では、無音区間に含まれる雑音の特徴量
を、入力音声に混入する雑音の推定特徴量と看做し、こ
の雑音の特徴量と予め用意してある音声のみの特徴量と
に基づいて、実際に入力される雑音混入音声の動的特徴
量を雑音を除いた音声のみの動的特徴量に変換するため
の変換テーブルを、事前に作成しておく。そして、音声
認識時には、入力された雑音混入音声から、その動的特
徴量をまず求め、ついで、この動的特徴量に上記変換テ
ーブルを施すことにより、音声のみの動的特徴量を推定
する。尚、静的特徴量に関しては、従来技術に基づい
て、音声のみの静的特徴量を求めることができる。こう
して、静的特徴量だけでなく動的特徴量についても、雑
音の影響を除いた音声のみの特徴量が得られ、結果とし
て、音声認識の精度が向上する。In the present invention, the feature amount of noise included in the silent section is regarded as the estimated feature amount of noise mixed in the input voice, and the feature amount of noise and the feature amount of only voice prepared in advance are considered. Based on the above, a conversion table for converting the dynamic feature amount of the actually input noise-containing voice into the dynamic feature amount of only the voice without noise is created in advance. Then, at the time of voice recognition, the dynamic feature amount is first obtained from the input noise-containing voice, and then the conversion table is applied to this dynamic feature amount to estimate the dynamic feature amount of only the voice. Regarding the static feature amount, the static feature amount of only the voice can be obtained based on the conventional technique. In this way, not only the static feature amount but also the dynamic feature amount, the feature amount of only the voice without the influence of noise is obtained, and as a result, the accuracy of voice recognition is improved.

【００１４】[0014]

【実施例】以下、本発明の実施例を、図面に基づき詳細
に説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００１５】始めに、本発明に関連する事項として、音
声の静的特徴量ケプストラムを推定する一般的な方法に
ついて説明する。First, as a matter related to the present invention, a general method for estimating a static feature amount cepstrum of a voice will be described.

【００１６】認識対象となる音声に背景雑音等が混入し
た実際の雑音混入音声を音声認識のための入力とし、音
声認識のための処理過程において、上記雑音混入音声を
分析して得られた特徴量のケプストラムをCs+nとする。
ここで、雑音を含まない同一音声から得られるべき特徴
量のケプストラムをCsとする。Characteristic obtained by analyzing the noise-containing voice in the process of the voice recognition, using an actual noise-containing voice in which background noise is mixed in the voice to be recognized as an input for the voice recognition. Let the amount of cepstrum be Cs + n.
Here, Cs is the cepstrum of the feature quantity that should be obtained from the same voice that does not contain noise.

【００１７】このとき事後確率密度p（Cs｜Cs+n）、即
ち、条件Cs+nにおける事象Csの条件付き確率、を最大化
することを考える。この最大化の過程において、雑音混
入音声のケプストラムCs+nから雑音成分のケプストラム
を線形スペクトル上で減算し、この減算によって得られ
た差分を再度ケプストラムに変換することにより雑音を
含まない音声のケプストラムCsを推定できる。At this time, it is considered to maximize the posterior probability density p (Cs | Cs + n), that is, the conditional probability of the event Cs under the condition Cs + n. In the process of this maximization, the cepstrum of the noise component is subtracted from the cepstrum Cs + n of the noisy speech on the linear spectrum, and the difference obtained by this subtraction is converted into the cepstrum again, whereby the cepstrum of the noise-free speech is converted. Cs can be estimated.

【００１８】しかしながら、上記減算過程において、ス
ペクトル成分が負になる等の実際にはあり得ない現象が
発生し、そのため雑音なし音声ケプストラムCsの推定に
おいて大きな推定誤差を生じることがある。However, in the subtraction process, a phenomenon such as a spectral component becoming negative may occur, which is not possible in reality, and thus a large estimation error may occur in the estimation of the noise-free speech cepstrum Cs.

【００１９】そこで、音声認識のための学習に使用する
音声データに対象となる雑音を付加することにより、雑
音を含まない音声で作成したコードブックとの対応をつ
けた雑音コードブックを予め作成し、この雑音コードブ
ックから上記雑音を含まない音声で作成したコードブッ
クへのマッピング（写像）を行って音声認識を行う方法
がある。他には、雑音を含まない音声のケプストラムCs
と雑音成分のケプストラムとを線形スペクトル上で加算
し、再度ケプストラムに変換する方法がある。Therefore, by adding target noise to the voice data used for learning for voice recognition, a noise codebook associated with a codebook created with voice containing no noise is created in advance. There is a method of performing speech recognition by mapping (mapping) from this noise codebook to a codebook created with speech that does not contain the noise. Others include noise-free speech cepstrum Cs
There is a method of adding and the cepstrum of the noise component on the linear spectrum and converting again to the cepstrum.

【００２０】図１は、音声（雑音を含まない音声及び雑
音混入音声の両方を指す。以下同じ）の静的特徴量と音
声の動的特徴量との関係を概念的に示したものである。FIG. 1 conceptually shows the relationship between the static feature amount of voice (refers to both voice that does not include noise and voice that includes noise. The same applies hereinafter) and the dynamic feature amount of voice. .

【００２１】音声の動的特徴量とは、一般に、音声の静
的特徴量を時間微分したものである。音声認識の場合、
入力音声の分析の単位は、フレームと称される一定時間
幅をもって連続する区間が基本となる。入力音声は、音
声認識のための処理過程において、複数個のフレーム
（フレームは、図１の横軸で示される）に分割され、夫
々のフレームを単位として分析処理が行われ、図１の符
号１０１で示す雑音無し音声（雑音を含まない音声）の
静的特徴量が算出される。音声認識のための処理過程に
おいて、音声の動的特徴量とは、上述したように音声の
静的特徴量を時間微分することによって得られるから、
上記動的特徴量は、過去と未来のフレームにおける音声
の静的特徴量が作る離散時系列（図１の線分ＡＢで示
す）の、横軸に対する傾きの値（図１の符号１０２で示
す）で示されることとなる。The dynamic feature amount of voice is generally a time-differentiated static feature amount of voice. For voice recognition,
The unit of analysis of the input voice is basically a continuous section having a certain time width called a frame. The input voice is divided into a plurality of frames (frames are shown by the horizontal axis in FIG. 1) in the process of voice recognition, and an analysis process is performed in units of each frame. The static feature amount of the noise-free voice (voice without noise) indicated by 101 is calculated. In the processing process for speech recognition, the dynamic feature amount of voice is obtained by differentiating the static feature amount of voice as described above,
The dynamic feature amount is a value (indicated by reference numeral 102 in FIG. 1) of the inclination of the discrete time series (indicated by the line segment AB in FIG. 1) of the static feature amount of the voice in the past and future frames with respect to the horizontal axis. ) Will be indicated.

【００２２】入力音声に雑音が付加されると、図１の符
号１０３で示すように雑音による影響が生じて、雑音無
し音声の静的特徴量１０１とは異なった値である雑音付
加音声の動的特徴量１０４が得られることになる。その
ため、雑音付加音声の静的特徴量が作る離散時系列（図
１の線分ＣＤで示す）の、横軸に対する傾きの値で示さ
れる雑音付加音声の動的特徴量も、符号１０５で示すよ
うに雑音無し音声のそれ１０２とは異なったものとなっ
てします。When noise is added to the input voice, the influence of noise occurs as indicated by reference numeral 103 in FIG. 1, and the motion of the noise-added voice having a value different from the static feature amount 101 of the noiseless voice is generated. The characteristic amount 104 is obtained. Therefore, the dynamic feature amount of the noise-added voice, which is indicated by the value of the slope of the discrete time series (indicated by the line segment CD in FIG. 1) with respect to the horizontal axis, which is created by the static feature amount of the noise-added voice, is also indicated by reference numeral 105. So it is different from that of noiseless voice 102.

【００２３】音声の静的特徴量に関しては、上記の様々
な手法を用いて雑音に適応するか、もしくは雑音の影響
を除去する試みがある。それらの音声の静的特徴量にお
ける試みに対して、本発明は音声の動的特徴量を適用す
る手法に関するものである。Regarding the static feature amount of speech, there have been attempts to adapt to noise or remove the influence of noise using the above-mentioned various methods. The present invention relates to a method of applying a dynamic feature amount of a voice to an attempt in those static feature amounts of a voice.

【００２４】図２は、本発明を適用することのできる音
声認識装置の一実施例の全体構成を示す。FIG. 2 shows the overall construction of an embodiment of a voice recognition device to which the present invention can be applied.

【００２５】この一般的な装置は、図示のように、入力
端子２０１、Ａ／Ｄ変換部２０２、音声区間検出部２０
３、フレーム分割部２０４、音響分析部２０５、音声認
識処理部２０６、出力端子２０７、特徴量コードブック
２０８及び認識辞書２０９を備える。As shown in the figure, this general apparatus has an input terminal 201, an A / D conversion section 202, and a voice section detection section 20.
3, a frame division unit 204, an acoustic analysis unit 205, a voice recognition processing unit 206, an output terminal 207, a feature quantity codebook 208, and a recognition dictionary 209.

【００２６】マイクロホンや電話回線等から入力端子２
０１を介して入力された音声信号（アナログ信号）は、
Ａ／Ｄ変換部２０２で量子化される。この量子化サンプ
ルは、音声区間検出部２０３において音声区間と無音区
間とに分割される。更に、上記量子化サンプルは、フレ
ーム分割部２０４において複数個のフレームに分割さ
れ、音響分析部２０５に出力される。Input terminal 2 from a microphone or telephone line
The audio signal (analog signal) input via 01 is
It is quantized by the A / D conversion unit 202. The quantized sample is divided into a voice section and a silent section by the voice section detection unit 203. Further, the quantized sample is divided into a plurality of frames by the frame division unit 204 and output to the acoustic analysis unit 205.

【００２７】音響分析部２０５は、雑音無し静的特徴量
コードブック（図５において符号５０１で示す）と雑音
無し動的特徴量コードブック（図３において符号３０７
で示す）とからなる特徴量コードブック２０８を使用し
て、音声区間と無音区間の両区間に亘りフレーム分割部
２０４から出力される各フレーム毎に音声の静的特徴量
及び音声の動的特徴量を抽出する。The acoustic analysis unit 205 includes a noiseless static feature quantity codebook (denoted by reference numeral 501 in FIG. 5) and a noiseless dynamic feature quantity codebook (reference numeral 307 in FIG. 3).
, The static feature amount of the voice and the dynamic feature of the voice for each frame output from the frame division unit 204 over both the voice period and the silent period. Extract the amount.

【００２８】音響分析部２０５によって抽出された音声
の静的特徴量及び音声の動的特徴量は、音声認識処理部
２０６に出力される。これら両特徴量は、音声認識処理
部２０６において認識辞書２０９と対照されて音声認識
処理が行われ、この処理によって得られた認識結果が、
出力端子２０７から出力されることとなる。The static feature amount of voice and the dynamic feature amount of voice extracted by the acoustic analysis unit 205 are output to the voice recognition processing unit 206. Both of these feature amounts are subjected to voice recognition processing in the voice recognition processing unit 206 in comparison with the recognition dictionary 209, and the recognition result obtained by this processing is
It will be output from the output terminal 207.

【００２９】図３は、この音声認識装置の音響分析部２
０５として一般的に採用されている構成を示す。FIG. 3 shows the acoustic analysis unit 2 of this voice recognition device.
The configuration generally adopted as 05 is shown.

【００３０】音響分析部２０５の一般的構成は、図示の
ように、特徴量抽出部３０１、雑音混入静的特徴量生成
部３０２、静的特徴量量子化処理部３０３、雑音混入静
的特徴量コードブック３０４、デルタ特徴量抽出部３０
５、動的特徴量量子化処理部３０６、雑音無し動的特徴
量コードブック３０７及び複数個の遅延処理部３０８を
備える。As shown in the figure, the general structure of the acoustic analysis unit 205 is a feature quantity extraction section 301, a noise-containing static feature quantity generation section 302, a static feature quantity quantization processing section 303, and a noise-containing static feature quantity. Codebook 304, delta feature quantity extraction unit 30
5, a dynamic feature amount quantization processing unit 306, a noiseless dynamic feature amount codebook 307, and a plurality of delay processing units 308.

【００３１】この構成において、音声区間の特徴量を抽
出するに先立ち、まず、図１に示したフレーム分割部２
０４から雑音区間を特徴量抽出部３０１を通して入力
し、雑音混入静的特徴量生成部３０２において、上記雑
音区間に基づき、雑音混入静的特徴量コードブック３０
４を予め生成し保存しておく。In this structure, before extracting the feature quantity of the voice section, first, the frame division unit 2 shown in FIG.
04, the noise section is input through the feature quantity extraction unit 301, and the noise-included static feature quantity generation unit 302 uses the noise-included static feature quantity codebook 30 based on the noise section.
4 is generated in advance and saved.

【００３２】次に、フレーム分割部２０４から認識対象
の音声（雑音混入音声）を１フレーム単位で入力し、特
徴量抽出部３０１においてLPC（LPC:Linear Predictive
Coding）分析等の処理を施すことによって上記認識対
象音声の静的特徴量を得る。この静的特徴量は連続量で
あるので、音声認識処理部２０６における計算量を削減
するために、静的特徴量量子化処理部３０３において雑
音混入静的特徴量コードブック３０４を用いて量子化さ
れた静的特徴量に変換する。この量子化された静的特徴
量は、静的特徴量量子化処理部３０３から図１に示した
音声認識処理部２０６に出力される。Next, the speech to be recognized (noise-incorporated speech) is input from the frame division unit 204 in units of one frame, and the feature amount extraction unit 301 performs LPC (LPC: Linear Predictive).
The static feature amount of the recognition target speech is obtained by performing processing such as Coding) analysis. Since this static feature amount is a continuous amount, in order to reduce the amount of calculation in the speech recognition processing unit 206, the static feature amount quantization processing unit 303 performs quantization using the noise-containing static feature amount codebook 304. Converted into static feature amount. The quantized static feature amount is output from the static feature amount quantization processing unit 303 to the voice recognition processing unit 206 illustrated in FIG.

【００３３】一方、上記連続量としての特徴量は、特徴
量抽出部３０１から各遅延処理部３０８に出力され、各
々の遅延処理部３０８において数フレーム分遅延処理を
施されて保存される（図中左端の遅延処理部３０８には
現在のフレームが、右の遅延処理部３０８に向かうに従
ってより以前のフレームが保存される）。デルタ特徴量
抽出部３０５では、各々の遅延処理部３０８から出力さ
れるフレームを用いて上記認識対象音声（雑音混入音
声）の動的特徴量を算出する。この動的特徴量は、デル
タ特徴量抽出部３０５から動的特徴量量子化処理部３０
６に出力される。この動的特徴量もまた連続量であるた
め、動的特徴量量子化処理部３０６において、予め用意
された雑音無し動的特徴量コードブック３０７（即ち、
動的特徴量用の辞書）を用いて量子化された動的特徴量
に変換する。この動的特徴量は、動的特徴量量子化処理
部３０３から図１の音声認識処理部２０６に出力され、
音声認識処理部２０６において上記静的特徴量量子化処
理部３０３から出力された静的特徴量とともに音声認識
処理が施されることとなる。On the other hand, the feature quantity as the continuous quantity is output from the feature quantity extracting section 301 to each delay processing section 308, and subjected to delay processing for several frames in each delay processing section 308 and stored (see FIG. The current frame is stored in the delay processing unit 308 at the middle left end, and the previous frame is stored toward the right delay processing unit 308). The delta feature amount extraction unit 305 calculates the dynamic feature amount of the recognition target voice (noise mixed voice) using the frames output from each delay processing unit 308. This dynamic feature amount is transferred from the delta feature amount extraction unit 305 to the dynamic feature amount quantization processing unit 30.
6 is output. Since this dynamic feature quantity is also a continuous quantity, the dynamic feature quantity quantization processing unit 306 prepares a noiseless dynamic feature quantity codebook 307 (that is,
It is converted into a quantized dynamic feature amount using a dictionary for dynamic feature amounts). This dynamic feature amount is output from the dynamic feature amount quantization processing unit 303 to the speech recognition processing unit 206 in FIG.
In the voice recognition processing unit 206, voice recognition processing is performed together with the static feature amount output from the static feature amount quantization processing unit 303.

【００３４】図４は、図１の音声認識装置の音響分析部
２０５に本発明を適用した一実施例の構成を示す。FIG. 4 shows the configuration of an embodiment in which the present invention is applied to the acoustic analysis unit 205 of the voice recognition device shown in FIG.

【００３５】本実施例に係る音響分析部２０５は、図３
に示した一般的構成に加えて、更に、動的特徴量変換テ
ーブル生成部４０３、ベクトル変換部４０４及び動的特
徴量変換テーブル４０５を備える。これらの要素を追加
することにより、背景雑音に起因して入力音声の動的特
徴量にズレが生じた場合に、そのズレをベクトル変換部
４０４において補正することによって雑音を含まない音
声の動的特徴量を推定することができる。その結果、音
声認識の精度のより一層の向上を図ることが可能とな
る。なお、図４において、図３に示した物と同一物には
同一符号を付した。The acoustic analysis unit 205 according to this embodiment is shown in FIG.
In addition to the general configuration shown in FIG. 3, a dynamic feature amount conversion table generation unit 403, a vector conversion unit 404, and a dynamic feature amount conversion table 405 are further provided. By adding these elements, when the dynamic feature amount of the input voice is deviated due to the background noise, the vector conversion unit 404 corrects the shift, thereby dynamically changing the noise-free voice. The feature quantity can be estimated. As a result, it is possible to further improve the accuracy of voice recognition. In FIG. 4, the same components as those shown in FIG. 3 are designated by the same reference numerals.

【００３６】ベクトル変換部４０４は、入力音声を分析
して得られる音声の動的特徴量をベクトル変換するため
に、動的特徴量変換テーブル４０５を必要とする。この
動的特徴量変換テーブル４０５は、音声入力が開始され
る以前に、雑音混入静的特徴量生成部３０２による雑音
混入静的特徴量コードブック３０４の生成に次いで、動
的特徴量変換テーブル生成部４０３により生成される。The vector conversion unit 404 needs the dynamic feature amount conversion table 405 in order to perform vector conversion of the dynamic feature amount of the voice obtained by analyzing the input voice. This dynamic feature amount conversion table 405 is generated by the noise-incorporated static feature amount generation unit 302 subsequent to the generation of the noise-involved static feature amount codebook 304 before the voice input is started. It is generated by the unit 403.

【００３７】次に、本実施例をより詳細に説明する。ま
ず、雑音混入静的特徴量生成部３０２について説明す
る。Next, this embodiment will be described in more detail. First, the noise-containing static feature amount generation unit 302 will be described.

【００３８】図５は、雑音混入静的特徴量生成部３０２
の詳細を示す。FIG. 5 shows the static feature quantity generation unit 302 with noise.
Shows the details of.

【００３９】雑音混入静的特徴量生成部３０２は、図示
のように、雑音なし静的コードブック５０１特徴量取出
部５０２、cosine変換部５０３、線形変換部５０４、加
算部５０５、対数変換部５０６、逆cosine変換部５０
７、特徴量格納部５０８及び雑音スペクトル生成部５０
９を備える。As shown in the figure, the noise-containing static feature amount generation unit 302 includes a noiseless static codebook 501 feature amount extraction unit 502, a cosine conversion unit 503, a linear conversion unit 504, an addition unit 505, and a logarithmic conversion unit 506. , Inverse cosine conversion unit 50
7. Feature quantity storage unit 508 and noise spectrum generation unit 50
9 is provided.

【００４０】始めに、特徴量抽出部３０１から出力され
た、音声の発声前、又は発声後等の無音区間（雑音のみ
の区間）が、雑音スペクトル生成部５０９により分析さ
れて雑音パワースペクトルΦnが算出され、これが雑音
パワースペクトルの推定値とされる。First, the noise spectrum generation section 509 analyzes a silent section (a section including only noise) output from the feature amount extraction section 301, such as before or after vocalization of a voice, to obtain a noise power spectrum Φn. It is calculated and used as an estimated value of the noise power spectrum.

【００４１】次に、予め用意された雑音無し静的特徴量
コードブック５０１中から、ケプストラムの状態で格納
されている雑音無し音声の静的特徴量が、１個だけ取り
出される。この静的特徴量は、ケプストラムの状態から
線形パワースペクトルの状態に変換するために、cosine
変換部５０３においてcosine変換された後、線形変換部
５０４において線形変換が施される。Next, only one static feature amount of the noiseless voice stored in the cepstrum state is extracted from the noiseless static feature amount codebook 501 prepared in advance. This static feature is cosine in order to convert from the cepstrum state to the linear power spectrum state.
After the cosine conversion in the conversion unit 503, the linear conversion is performed in the linear conversion unit 504.

【００４２】ここで、雑音無し音声のケプストラムCsと
雑音無し音声の線形パワースペクトルΦsとの関係は、
下記の（２）式で示される。Here, the relationship between the cepstrum Cs of noiseless speech and the linear power spectrum Φs of noiseless speech is
It is shown by the following equation (2).

【００４３】Φs＝exp（Cos（Cs））…………（1）ここに、Cos()はcosine変換を表す。Φs = exp (Cos (Cs)) (1) where Cos () represents cosine transformation.

【００４４】上記（１）式によって算出された線形パワ
ースペクトルΦsは、加算部５０５において、雑音スペ
クトル生成部５０９から出力された雑音パワースペクト
ルΦnと加算されることによって雑音混入音声のスペク
トルが合成される。このスペクトルは、対数変換部５０
６及び逆cosine変換部５０７において、下記の（２）式
にて示すように逆変換が施されてケプストラムに戻さ
れ、推定ケプストラムCs+nを得る。The linear power spectrum Φs calculated by the above equation (1) is added to the noise power spectrum Φn output from the noise spectrum generation unit 509 in the addition unit 505 to synthesize the spectrum of noise-containing speech. It This spectrum is obtained by the logarithmic conversion unit 50.
In 6 and the inverse cosine transform unit 507, the inverse transform is performed as shown in the following equation (2) and the result is returned to the cepstrum to obtain the estimated cepstrum Cs + n.

【００４５】[0045]

【数１】対数変換部５０６及び逆cosine変換部５０７において
（２）式により演算された雑音混入ケプストラムCs+n
は、特徴量格納部５０８において雑音混入静的特徴量コ
ードブック３０４に登録される。上述した処理が雑音無
し静的特徴量コードブック５０１内の全ての特徴量につ
いて繰り返し行われることによって雑音混入静的特徴量
コードブック３０４の設定が完了する。[Equation 1] The noise-containing cepstrum Cs + n calculated by the equation (2) in the logarithmic transformation unit 506 and the inverse cosine transformation unit 507.
Is registered in the noise-containing static feature quantity codebook 304 in the feature quantity storage unit 508. The above-described processing is repeatedly performed for all the feature amounts in the noise-free static feature amount codebook 501, thereby completing the setting of the noise-containing static feature amount codebook 304.

【００４６】続いて、動的特徴量変換テーブル４０５の
設定を次のように行なう。Subsequently, the dynamic feature quantity conversion table 405 is set as follows.

【００４７】図６は、動的特徴量変換テーブル生成部を
４０３の詳細を示す。FIG. 6 shows the details of the dynamic feature amount conversion table generation unit 403.

【００４８】動的特徴量変換テーブル生成部４０３は、
図示のように、特徴量取出部６０３、６０４、変換ベク
トル算出部６０５及び変換ベクトル格納部６０７を備え
る。The dynamic feature quantity conversion table generation unit 403
As shown in the figure, the feature amount extraction units 603 and 604, the conversion vector calculation unit 605, and the conversion vector storage unit 607 are provided.

【００４９】動的特徴量変換テーブル生成部４０３によ
る動的特徴量変換テーブル４０５の生成は、雑音混入静
的特徴量コードブック３０４の生成が完了した後に、雑
音無し静的特徴量コードブック５０１と雑音混入静的特
徴量コードブック３０４を使用することによって行われ
る。The generation of the dynamic feature amount conversion table 405 by the dynamic feature amount conversion table generation unit 403 is performed by generating the noise-free static feature amount codebook 501 after the generation of the noise-containing static feature amount codebook 304. This is done by using the noisy static feature codebook 304.

【００５０】まず、予め用意された雑音無し静的特徴量
コードブック５０１から特徴量（雑音無し音声のケプス
トラム）Csが、特徴量取出部６０３によって取り出さ
れ、この特徴量Csと対応する特徴量（雑音混入音声のケ
プストラム）Cs+nが、雑音混入静的特徴量コードブック
３０４から特徴量取出部６０４によって取り出される。
このとき、特徴量CsのパワースペクトルΦs（雑音無し
音声のパワースペクトル）は、下記の（３）式で与えら
れ、また、特徴量Cs+nのパワースペクトルΦs+n（雑音
混入音声のパワースペクトル）は、下記の（４）式で与
えられる。First, the feature quantity (cepstrum of noiseless speech) Cs is extracted from the noiseless static feature quantity codebook 501 prepared in advance by the feature quantity extracting section 603, and the feature quantity corresponding to this feature quantity Cs ( The cepstrum Cs + n of the noise-containing speech is extracted from the noise-containing static feature quantity codebook 304 by the feature quantity extraction unit 604.
At this time, the power spectrum Φs of the feature quantity Cs (power spectrum of noise-free speech) is given by the following equation (3), and the power spectrum Φs + n of the feature quantity Cs + n (power spectrum of noise-containing speech). ) Is given by the following equation (4).

【００５１】 Φs＝exp（Cos（Cs））……………………………（３） Φs+n＝exp（Cos（Cs+n））………………………（４）更に、Φs+n＝Φs+Φnの関係から、下記の（５）式が導
かれる。Φs = exp (Cos (Cs)) ……………………………… (3) Φs + n = exp (Cos (Cs + n)) …………………… (4 ) Furthermore, the following equation (5) is derived from the relationship of Φs + n = Φs + Φn.

【００５２】 exp（Cos（Cs+n））＝exp（Cos（Cs））+Φn……（５）この（５）式は、雑音混入音声のケプストラムCs+nと雑
音無し音声のケプストラムCsとの関係式と見做すことが
できる。ここで、表記の簡単化のために（５）式におい
てx＝Cs+n、f（x）＝Csとおき、両辺をXで偏微分する
と、下記の（６）式が導かれる。Exp (Cos (Cs + n)) = exp (Cos (Cs)) + Φn (5) This equation (5) is used for the cepstrum Cs + n of noise-containing speech and the cepstrum Cs of noiseless speech. It can be regarded as the relational expression of. Here, for simplification of notation, x = Cs + n and f (x) = Cs are set in the equation (5), and both sides are partially differentiated by X, and the following equation (6) is derived.

【００５３】[0053]

【数２】この（６）式を整理すると、下記の（７）式が導かれ
る。[Equation 2] By rearranging this equation (6), the following equation (7) is derived.

【００５４】[0054]

【数３】但し、Cos'（x）は、(Equation 3) However, Cos' (x) is

【数４】である。[Equation 4] Is.

【００５５】ここで、x＝Cs+n、f（x）＝Csを代入して
整理すると、下記の（９）式が得られる。Here, by substituting x = Cs + n and f (x) = Cs and rearranging, the following equation (9) is obtained.

【００５６】[0056]

【数５】尚、cosine変換は線形変数なのでCos'(x)はxに係らず一
定である。逆cosine変換についても同様のことが言え
る。そのため、（９）式中のCos-1'(Cs)及びCos'(Cs+n)
は、一定値（行列）である。(Equation 5) Since cosine transformation is a linear variable, Cos' (x) is constant regardless of x. The same applies to the inverse cosine transform. Therefore, Cos-1 '(Cs) and Cos' (Cs + n) in the equation (9)
Is a constant value (matrix).

【００５７】雑音混入静的特徴量コードブック３０１内
の全ての特徴量（即ち、雑音混入音声のケプストラム）
Cs+nに対して、f'（Cs+n）が、変換ベクトル算出部６０
５において（９）に式より算出される。そして、このf'
（Cs+n）は、特徴量Cn+sと対応づけられて、変換べクト
ル格納部６０６によって動的特徴量変換テーブル４０５
に格納されることとなる。Noise-included static feature quantity All feature quantities in the codebook 301 (that is, cepstrum of noise-included speech)
For Cs + n, f ′ (Cs + n) is the conversion vector calculation unit 60.
5 is calculated from the equation (9). And this f '
(Cs + n) is associated with the feature amount Cn + s, and the conversion vector storage unit 606 stores the dynamic feature amount conversion table 405.
Will be stored in.

【００５８】次に、ベクトル変換部４０４を説明する。Next, the vector conversion section 404 will be described.

【００５９】図７は、図４のベクトル変換部４０４の詳
細を示す。FIG. 7 shows details of the vector conversion section 404 of FIG.

【００６０】本実施例に係るベクトル変換部４０４は、
図示のように、最尤静的特徴量選択部７０１及びベクト
ル生成部７０２を備える。The vector conversion unit 404 according to this embodiment is
As illustrated, the maximum likelihood static feature amount selection unit 701 and the vector generation unit 702 are provided.

【００６１】ここで入力音声を分析して得られる静的特
徴量（ケプストラム）をCinとする。始めに、最尤静的
特徴量選択部７０１において、静的特徴量量子化処理部
３０３から出力される量子化された音声の静的特徴量
中、事後確率密度p（Cs+n｜Cin）が最大となる量子化後
の特徴量（ケプストラム）Cs+nが選択される。この選択
された特徴量Cs+nを雑音混入音声の静的特徴量と見做
す。Here, the static feature amount (cepstrum) obtained by analyzing the input voice is Cin. First, in the maximum likelihood static feature amount selection unit 701, the posterior probability density p (Cs + n | Cin) is included in the static feature amount of the quantized speech output from the static feature amount quantization processing unit 303. The quantized feature amount (cepstral) Cs + n that maximizes is selected. The selected feature amount Cs + n is regarded as a static feature amount of noise-containing speech.

【００６２】ベクトル生成部７０２は、この選択された
特徴量Cs+nと動的特徴量変換テーブル４０５とを使用し
て、入力音声を分析して得られる動的特徴量に基づきベ
クトル生成を行う。ベクトル生成部７０２におけるベク
トル生成の手順は、以下のようである。The vector generation unit 702 uses the selected feature amount Cs + n and the dynamic feature amount conversion table 405 to generate a vector based on the dynamic feature amount obtained by analyzing the input voice. . The vector generation procedure in the vector generation unit 702 is as follows.

【００６３】即ち、f（x）の定義式からf'（Cs+n）＝dC
s／dCs+nであり、dCs／dtは、動的特徴量変換テーブル
４０５に格納されている動的特徴量であるから、上記
（９）式は、下記の（１０）式及び（１１）式に変形さ
れる。That is, from the definitional expression of f (x), f '(Cs + n) = dC
Since s / dCs + n, and dCs / dt is the dynamic feature quantity stored in the dynamic feature quantity conversion table 405, the above equation (9) is given by the following equations (10) and (11). It is transformed into an expression.

【００６４】[0064]

【数６】（１１）式中の、ΔCs及びΔCs+nは、いずれも動的特徴
量である。（１１）式から、動的特徴量変換テーブル４
０５は、動的特徴量（ΔCs、ΔCs+n）の変換行列となっ
ていることがわかる。従って、雑音混入音声から抽出さ
れた動的特徴量△Cs+nに、（１１）式の右辺に記述した
変換行列（動的特徴量変換テーブル内の変換行列）を乗
じてベクトル変換することにより、音声のみ（即ち、雑
音なし音声）の動的特徴量△Csの推定が可能である。(Equation 6) Both ΔCs and ΔCs + n in the equation (11) are dynamic feature quantities. From the equation (11), the dynamic feature amount conversion table 4
It can be seen that 05 is a conversion matrix of the dynamic feature amount (ΔCs, ΔCs + n). Therefore, the dynamic feature amount ΔCs + n extracted from the noisy speech is multiplied by the conversion matrix (conversion matrix in the dynamic feature amount conversion table) described on the right side of Expression (11) to perform vector conversion. , It is possible to estimate the dynamic feature amount ΔCs of only voice (that is, voice without noise).

【００６５】以上説明したように、本発明の一実施例に
よれば、ベクトル変換部４０４が、静的特徴量量子化処
理部３０３の出力中から最尤静的特徴量Cs+nを選択し、
この最尤静的特徴量Cs+nに対応する動的特徴量の変換行
列を、動的特徴量変換テーブル４０５中から読出して、
これをデルタ特徴量抽出部３０５から出力された雑音混
入部の動的特徴量に乗ずることによって、この雑音混入
音声の動的特徴量を雑音を含まない音声のみの動的特徴
量にベクトル変換することとしたので、背景雑音に起因
してデルタ特徴量抽出部３０５から抽出された動的特徴
量にズレが生じたとしても、このズレによって音声認識
の精度が低下するような不具合は生じない。よって、よ
り精度の高い音声の特徴量の抽出が可能となり、これに
よって音声認識性能の向上を図ることが可能である。As described above, according to the embodiment of the present invention, the vector conversion unit 404 selects the maximum likelihood static feature amount Cs + n from the output of the static feature amount quantization processing unit 303. ,
The conversion matrix of the dynamic feature quantity corresponding to the maximum likelihood static feature quantity Cs + n is read from the dynamic feature quantity conversion table 405,
By multiplying this by the dynamic feature amount of the noise mixing unit output from the delta feature amount extracting unit 305, the dynamic feature amount of this noise-containing speech is vector-converted into the dynamic feature amount of only speech that does not contain noise. Therefore, even if the dynamic feature quantity extracted from the delta feature quantity extraction unit 305 is deviated due to the background noise, the deviation does not cause a problem that the accuracy of voice recognition is lowered. Therefore, it is possible to extract the feature amount of the voice with higher accuracy, and it is possible to improve the voice recognition performance.

【００６６】なお、上記内容はあくまで本発明の一実施
例に係るものであり、本発明は、上記内容にのみ限定さ
れるものでないのは勿論である。The above contents are only related to one embodiment of the present invention, and it goes without saying that the present invention is not limited to the above contents.

【００６７】[0067]

【発明の効果】以上説明したように、本発明によれば、
入力音声の動的特徴量の推定を行うことにより、雑音下
において、より精度の高い音声認識を行うことが可能な
音声認識装置を提供することができる。As described above, according to the present invention,
By estimating the dynamic feature amount of the input voice, it is possible to provide a voice recognition device capable of performing more accurate voice recognition in noise.

[Brief description of drawings]

【図１】背景雑音による、音声の静的特徴量と音声の動
的特徴量の変動を概念的に示した説明図。FIG. 1 is an explanatory diagram conceptually showing changes in static voice feature and dynamic voice feature due to background noise.

【図２】本発明が適用できる音声認識装置の一例の全体
を示す機能ブロック図。FIG. 2 is a functional block diagram showing an entire example of a voice recognition device to which the present invention can be applied.

【図３】図２に示した音響分析部に一般的に採用される
構成を示すの機能ブロック図。FIG. 3 is a functional block diagram showing a configuration generally adopted in the acoustic analysis unit shown in FIG.

【図４】図２に示した音響分析部に本発明を適用した一
実施例の構成を示す機能ブロック図。4 is a functional block diagram showing the configuration of an embodiment in which the present invention is applied to the acoustic analysis unit shown in FIG.

【図５】同実施例における雑音混入静的特徴量生成部の
詳細を示す機能ブロック図。FIG. 5 is a functional block diagram showing details of a noise-containing static feature amount generation unit in the embodiment.

【図６】同実施例における動的特徴量変換テーブル生成
部の詳細を示す機能ブロック図。FIG. 6 is a functional block diagram showing details of a dynamic feature quantity conversion table generation unit in the embodiment.

【図７】同実施例のベクトル変換部の詳細を示す機能ブ
ロック図。FIG. 7 is a functional block diagram showing details of a vector conversion unit of the embodiment.

[Explanation of symbols]

２０５音響分析部３０１特徴量抽出部３０５デルタ特徴量抽出部３０８遅延処理部４０３動的特徴量変換テーブル生成部４０４ベクトル変換部４０５動的特徴量変換テーブル 205 acoustic analysis unit 301 feature amount extraction unit 305 delta feature amount extraction unit 308 delay processing unit 403 dynamic feature amount conversion table generation unit 404 vector conversion unit 405 dynamic feature amount conversion table

Claims

[Claims]

1. An apparatus for obtaining a static feature amount and a dynamic feature amount from a noise-containing speech to be recognized and passing the result to a voice recognition process, in which a silent section is input in advance and included in this silent section. Conversion table creating means for creating a conversion table for converting the dynamic feature quantity of noise-containing speech into the dynamic feature quantity of only voice based on the feature quantity of noise to be generated and the feature quantity of only voice prepared in advance A means for inputting the noise-containing speech to be recognized at the time of voice recognition to obtain a dynamic characteristic amount of the noise-containing speech; Means for estimating a dynamic feature amount of only the voice in the input noise-containing voice by applying the created conversion table, and the estimated dynamic feature amount of the voice only Characterized by passing to recognition processing And a voice recognition device.

2. A method for obtaining a static feature amount and a dynamic feature amount from a noise-mixed speech to be recognized and passing them to a voice recognition process, in which a silent interval is input in advance and included in this silent interval. A process of creating a conversion table for converting a dynamic feature amount of a noise-containing voice into a dynamic feature amount of only a voice based on a feature amount of noise to be generated and a feature amount of a voice prepared in advance; The process of inputting the noise-containing speech to be recognized at the time of recognition and obtaining the dynamic feature amount of the noise-containing voice, and the dynamic feature amount of the obtained noise-containing voice, which has been created in advance A step of estimating a dynamic feature amount of only the voice in the input noise-containing voice by applying a conversion table, and applying the estimated dynamic feature amount of the voice only to the voice recognition process. Speech recognition method characterized by passing Law.