JP2006084659A

JP2006084659A - Audio signal analysis method, voice recognition methods using same, their devices, program, and recording medium thereof

Info

Publication number: JP2006084659A
Application number: JP2004268120A
Authority: JP
Inventors: Kentaro Ishizuka; 健太郎石塚; Noboru Miyazaki; 昇宮崎; Tomohiro Nakatani; 智広中谷; Yasuhiro Minami; 泰浩南
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-09-15
Filing date: 2004-09-15
Publication date: 2006-03-30

Abstract

<P>PROBLEM TO BE SOLVED: To correct influences of multiplication distortion and addition distortion, on the basis of noise or the like. <P>SOLUTION: An audio signal analysis method comprises dividing a voice signal into a plurality of band signals by a filtering bank 11; finding non-periodic component powers (13, 14 and 15A) and periodic component powers (15F and 16) of each band signal; converting the periodic component and non-periodic component powers into discrete cosines (17P and 17A); and vector-coupling these discrete cosine coefficients (18). The audio signal analysis method comprises finding at least a partial dispersion value (or standard deviation) in an hour and/or a vector element of a characteristic parameter obtained above, dividing (normalizing) the corresponding element of a coupling vector by the dispersion value (22b) and correcting the influence of the multiplication distortions and the addition distortions. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、音声信号や音楽信号などのオーディオ信号の特徴表現を抽出するオーディオ信号分析方法、その方法を用いた音声認識方法、それらの装置、プログラムおよびその記録媒体に関する。 The present invention relates to an audio signal analysis method for extracting a feature expression of an audio signal such as a voice signal or a music signal, a voice recognition method using the method, an apparatus thereof, a program, and a recording medium thereof.

自動音声認識装置においては、音声特徴表現抽出のための音声信号分析が行われる。頑健性の高い、つまり雑音に影響され難い音声特徴抽出法として、音声信号の周期的な成分と非周期的な成分を分離してそれらを連結して出力する音声信号分析方法がある（非特許文献１参照）。この音声信号分析方法のこの従来の音声信号分析方法を実行する装置の機能構成例を図１に、その処理手順を図２にそれぞれ示す。この音声信号分析装置１０は帯域通過フィルタバンク１１と、音声波形切出手段１２と、周期推定手段１３と、櫛型フィルタ手段１４と、パワー算出手段１５Ｆ及び１５Ａと、減算手段１６と、離散コサイン変換手段１８Ｐ及び１８Ａと、ベクトル連結手段１８とを備える。 In the automatic speech recognition apparatus, a speech signal analysis for speech feature expression extraction is performed. As a speech feature extraction method that is highly robust, that is, hardly affected by noise, there is a speech signal analysis method that separates periodic components and aperiodic components of a speech signal and outputs them by connecting them (non-patented). Reference 1). FIG. 1 shows an example of the functional configuration of an apparatus for executing this conventional audio signal analysis method of this audio signal analysis method, and FIG. 2 shows the processing procedure thereof. This speech signal analyzing apparatus 10 includes a band-pass filter bank 11, speech waveform cutting means 12, period estimating means 13, comb filter means 14, power calculating means 15F and 15A, subtracting means 16, discrete cosine, Conversion means 18P and 18A and vector connection means 18 are provided.

入力端子１００より音声信号分析装置１０へ入力される音声信号は、例えば１６，０００Ｈｚのサンプリングレートでサンプリングされ、その各サンプルがデジタル値に変換された離散音声信号である。
帯域通過フィルタバンク１１では、複数の帯域通過デジタルフィルタ１１_１，…，１１_Ｂを用いて、入力された離散音声信号を帯域分割して出力する（ステップＳ１）。ここで用いられる帯域通過フィルタバンク１１は、例えば聴知覚の特性に基づく、等価矩形帯域幅の大きさに対応した中心周波数を持つガンマトーンフィルタバンクを用いるとよい（M.Slaney,“An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank,”Apple Computer Technical Report #35,1993）。このガンマトーンフィルタバンクでは、帯域通過フィルタ１１_ｂ（ｂ＝１，…，Ｂ）であるガンマトーンフィルタを、通過帯域が重なり合うように、かつそれぞれのフィルタの中心周波数が等価矩形帯域幅の大きさ（おおよそ対数スケール）に従うように、例えば２４帯域分用意する。このフィルタバンク１１の各フィルタの周波数特性の例を図３に示す。図３には複数の帯域通過フィルタ１１_１，…，１１₂₄（ガンマトーンフィルタ）の周波数特性が同時に示されている。入力である離散音声信号をフィルタバンク１１中のそれぞれの帯域通過フィルタ１１_１，…，１１_Ｂでフィルタ処理した結果としてフィルタ数Ｂだけの離散信号が帯域通過フィルタバンク１１から出力される。帯域通過フィルタバンク１１の入力信号と出力信号の例として、帯域通過フィルタ１１_ｂとして図３に示した２４個のフィルタ特性のうちの３つの特性をそれぞれもつガンマトーンフィルタを用いた場合を図４に示す。図４Ａは入力離散音声信号の時間的変化を示す波形を示し、図４Ｂは中心周波数がｆ_c1，ｆ_c2及びｆ_c3の帯域通過フィルタ１１_ｂの各周波数特性をそれぞれ示し、図４Ｃはこれら３つの帯域通過フィルタの各出力信号波形をそれぞれ示す。 The audio signal input to the audio signal analyzer 10 from the input terminal 100 is a discrete audio signal that is sampled at a sampling rate of 16,000 Hz, for example, and each sample is converted to a digital value.
The band-pass filter bank 11 divides the input discrete sound signal into a band using a plurality of band-pass digital filters 11 ₁ ,..., 11 _B and outputs the result (step S1). As the band-pass filter bank 11 used here, for example, a gamma tone filter bank having a center frequency corresponding to the size of the equivalent rectangular bandwidth based on the characteristics of auditory perception may be used (M.Slaney, “An Efficient Implementation”). of the Patterson-Holdsworth Auditory Filter Bank, “Apple Computer Technical Report # 35, 1993). In this gamma tone filter bank, a gamma tone filter which is a band pass filter 11 _b (b = 1,..., B) is arranged such that the pass bands overlap and the center frequency of each filter is the size of the equivalent rectangular bandwidth. For example, 24 bands are prepared so as to follow (approximately logarithmic scale). An example of the frequency characteristics of each filter of the filter bank 11 is shown in FIG. FIG. 3 shows the frequency characteristics of a plurality of bandpass filters 11 ₁ ,..., 11 ₂₄ (gamma tone filters) at the same time. .., 11 _B as a result of filtering the discrete speech signal as an input with the respective band-pass filters 11 ₁ ,... Examples of input and output signals of the band pass filter bank 11, the case of using the gamma tone filter with each of the three characteristics of the 24 filter characteristic shown in FIG. 3 as a bandpass filter 11 _b 4 Shown in FIG. 4A shows a waveform showing the temporal change of the input discrete speech signal, FIG. 4B shows the frequency characteristics of the bandpass filter 11 _b whose center frequencies are f _c1 , f _c2 and f _c3 , respectively. FIG. Each output signal waveform of one band pass filter is shown.

音声波形切出手段１２は、帯域通過フィルタバンク１１の各帯域通過フィルタ１１_１，…，１１_Ｂの出力信号から例えば時間軸方向に１０ｍｓづつ移動しながら、３０ｍｓの時間長の信号を各切出部１２_１，…，１２_Ｂでそれぞれ切り出す（ステップＳ２）の結果、例えば４８０サンプル点（１６，０００Ｈｚ×３０ｍｓ）の離散信号を１６０サンプル点（１６，０００Ｈｚ×１０ｍｓ）づつ移動しながら切り出した信号が音声波形切出手段１２の切出部１２_１，…，１２_Ｂから出力される。つまり各帯域通過フィルタ１１_１，…，１１_Ｂよりの各帯域信号が分析区間（フレーム）ごとに分割される。 The voice waveform cutting means 12 cuts out each signal having a time length of 30 ms while moving from the output signals of the bandpass filters 11 ₁ ,..., 11 _B of the bandpass filter bank 11 by 10 ms, for example, in the time axis direction. As a result of cutting out by the sections 12 ₁ ,..., 12 _B (step S2), for example, a signal cut out while moving a discrete signal of 480 sample points (16,000 Hz × 30 ms) by 160 sample points (16,000 Hz × 10 ms) Are output from the cutting sections 12 ₁ ,..., 12 _B of the voice waveform cutting means 12. That is, each band signal from each band pass filter 11 ₁ ,..., 11 _B is divided for each analysis section (frame).

周期推定手段１３は音声波形切出手段１２の各切出部１２_１，…，１２_Ｂよりの出力信号を入力とし、その各分析区間ごとの各出力信号の周期性の周期を周期推定部１３_１，…，１３_Ｂでそれぞれ推定する（ステップＳ３）。この周期性の推定には例えば基本周波数抽出法の一つである自己相関法（W.Hess,“Pitch determination of speech signals,”Springer-Verlag,New York,1983）を用いる。自己相関法では、まず入力信号の自己相関関数係数を求める。入力信号の全サンプル点数（１分析区間のサンプル点数）をＮ、ｊ番目のサンプル点の信号の振幅をｓ_ｊとすると、入力信号の自己相関関数係数ａｃ_ｉは以下の式に従って求まる。 Each cutting unit of period estimation means 13 the speech waveform clipping means 12 12 _1, ..., 12 an output signal as input from _B, the period estimator 13 the period of the periodicity of the output signal for each of the respective analytical block ₁ ,..., 13 _B are estimated (step S3). For example, the autocorrelation method (W. Hess, “Pitch determination of speech signals,” Springer-Verlag, New York, 1983), which is one of the fundamental frequency extraction methods, is used to estimate the periodicity. In the autocorrelation method, first, an autocorrelation function coefficient of an input signal is obtained. If the total number of sample points of the input signal (the number of sample points in one analysis section) is N and the amplitude of the signal at the j-th sample point is s _j , the autocorrelation function coefficient ac _i of the input signal is obtained according to the following equation.

ａｃ_ｉ＝（１／Ｎ）Σ_j=1 ^N-1-iｓ_ｊｓ_i+j ，ｉ＝１，…，Ｎ
図５Ａに入力信号波形の例を、図５Ｂにこの自己相関関数係数をそれぞれ示す。次に、この自己相関関数係数におけるｉの一定の探索範囲内、例えば８０≦ｉ≦２００（サンプリング周波数１６，０００Ｈｚの場合の８０Ｈｚから２００Ｈｚの周期に該当）の範囲内においてａｃ_ｉが最大となるｉを検出する。その結果得られたｉをｎとする。このｎは入力信号の探索範囲において最も支配的な周期性成分の周期長を表し、入力信号が単一の完全な周期信号（例えば正弦波）の場合にはその周期長に相当する値になる。周期推定手段１３の各周期推定部１３_１，…，１３_Ｂから各推定周期ｎが出力される。 ac _i = (1 / N) Σj _{= 1} ^N-1-i s _j s _{i + j} , i = 1,..., N
FIG. 5A shows an example of the input signal waveform, and FIG. 5B shows the autocorrelation function coefficient. Next, ac _i becomes maximum within a certain search range of i in the autocorrelation function coefficient, for example, within a range of 80 ≦ i ≦ 200 (corresponding to a period of 80 Hz to 200 Hz when the sampling frequency is 16,000 Hz). i is detected. The resulting i is n. This n represents the period length of the most dominant periodic component in the search range of the input signal. When the input signal is a single complete period signal (for example, a sine wave), the value corresponds to the period length. . Each estimation period n is output from each period estimation unit 13 ₁ ,..., 13 _{B of the} period estimation means 13.

櫛型フィルタ手段１４は周期推定手段１３で得られた周期に基づいた離散櫛型フィルタを設定し、音声波形切出手段１２の出力信号をフィルタ処理する（ステップＳ４）。ここで用いる離散櫛型フィルタの周波数特性は、例えば周期推定手段１３の出力周期ｎに対し、ｚ領域表現で次式とされる。
Ｈ（ｚ）＝１−ｚ^-n
音声波形切出手段１２の出力信号をこの離散櫛型フィルタによってフィルタ処理することで得られる出力信号は、その櫛型フィルタの周波数特性における零点に相当する部分（基本周波数成分とその整数倍の周波数成分）のパワーが抑圧された離散信号となる。各周期推定部１３_ｂ（ｂ＝１，…，Ｂ）で推定された周期ｎは櫛型フィルタ手段１４の対応するフィルタ部１４ｂに設定され、各切出部１２_ｂよりの分析区間ごとの帯域離散音声信号が対応するフィルタ部１４_ｂに入力される。図４に示した帯域通過周波数特性中から選んだ３個と帯域通過フィルタの出力信号を音声波形切出手段１２によりそれぞれ切り出した信号波形例を図６Ａに、これら信号からそれぞれ推定された周期に設定された離散櫛型フィルタのそれぞれの周波数特性を図６Ｂに、その各フィルタ処理された各出力信号を図６Ｃにそれぞれ示す。 The comb filter means 14 sets a discrete comb filter based on the period obtained by the period estimation means 13, and filters the output signal of the speech waveform cutout means 12 (step S4). The frequency characteristic of the discrete comb filter used here is, for example, the following expression in terms of z region with respect to the output period n of the period estimating means 13.
H (z) = 1−z ⁻ⁿ
The output signal obtained by filtering the output signal of the speech waveform cutting means 12 with this discrete comb filter is a portion corresponding to the zero point in the frequency characteristics of the comb filter (the fundamental frequency component and its integral multiple frequency). It becomes a discrete signal in which the power of the component) is suppressed. The period n estimated by each period estimation unit 13 _b (b = 1,..., B) is set in the corresponding filter unit 14 b of the comb filter means 14, and the band for each analysis section from each cutout unit 12 _b. The discrete audio signal is input to the corresponding filter unit _14b . FIG. 6A shows an example of a signal waveform obtained by extracting the three output signals from the band-pass frequency characteristics shown in FIG. 4 and the output signal of the band-pass filter by the speech waveform cutting means 12, and the periods estimated from these signals. Each frequency characteristic of the set discrete comb filter is shown in FIG. 6B, and each output signal subjected to each filter processing is shown in FIG. 6C.

パワー算出手段１５Ｆの各計算部１５Ｆ_ｂは音声波形切出手段１２の各切出部１２_ｂにより出力信号のパワーを計算し、パワー算出手段１５Ａの各計算部１５Ａ_ｂは櫛型フィルタ手段１４の各フィルタ部１４_ｂの出力信号のパワーを計算する（ステップＳ５）。各計算部１５Ｆ_ｂ及び１５Ａ_ｂでのパワーの計算Ｗは例えば次式に示す二乗和を行う。ここで、ｓ_ｊは入力離散信号のサンプル点ｊにおける振幅を、Ｎは入力信号の全サンプル点数をそれぞれ表す。
Ｗ＝Σ_j=1 ^Nｓ_j ²
減算手段１６の各減算部１６_ｂでは、パワー算出手段１５Ｆの各計算部１５Ｆ_ｂの出力パワー値、つまり音声波形切出手段１２の切出部１２_ｂの出力信号のパワー値ＷＰ_ｂから、対応するパワー算出手段１５Ａの計算部１５Ａ_ｂの出力パワー値、つまり切出部１２_ｂの出力に対応する櫛型フィルタ手段１４のフィルタ部１４_ｂの出力信号から算出されたパワー値ＷＡ_ｂを減算する（ステップＳ６）。この結果、各減算部１６_ｂから櫛型フィルタ手段１４の各フィルタ部１４_ｂによって抑圧された周波数成分のパワー値（power_sp）、すなわち各帯域離散音声信号の周期成分パワー値ＷＰ_ｂを求めることができる。この減算操作を次式に示す。 Each calculator 15F _b of the power calculation unit 15F calculates the power of the output signal by the cutting unit 12 _b of the speech waveform clipping means 12, each of the computing units 15A _b the power calculation unit 15A is the comb filter means 14 The power of the output signal of each filter unit _14b is calculated (step S5). Calculating W of power at each of the computing units 15F _b and 15A _b performs square sum shown for example in the following equation. Here, s _j represents the amplitude at the sampling point j of the input discrete signal, and N represents the total number of sampling points of the input signal.
W = Σ _{j = 1} ^N s _j ²
Each subtraction section 16 _b of the subtracting means 16, the output power value of each of the computing units 15F _b of the power calculation unit 15F, that is from the power value WP _b of the output signal of the cutting unit 12 _b of the speech waveform clipping means 12, corresponding power output power value calculator 15A _b calculation means 15A, that is, subtracts the cutting unit 12 _b power value WA _b calculated from the output signal of the filter portion 14 _b of the comb filter means 14 corresponding to the output of the (Step S6). As a result, the power value (power _sp ) of the frequency component suppressed by each filter unit 14 _b of the comb filter means 14 from each subtraction unit 16 _b , that is, the periodic component power value WP _b of each band discrete speech signal is obtained. Can do. This subtraction operation is shown in the following equation.

ＷＰ_ｂ＝ＷＦ_ｂ−ＷＡ_ｂ
周期成分パワーベクトル化手段２０Ｐは各周期成分パワーＷＰ_ｂが入力され、これらをその対応帯域通過フィルタ１１_ｂ（ｂ＝１，…，Ｂ）の中心周波数順に整列したベクトルとし、非周期成分パワーベクトル化手段２０Ａは同様に各非周期成分パワーＷＡ_ｂをベクトルとする（ステップＳ７）。離散コサイン変換手段１７Ｐでは、周期成分パワーベクトルに対しその対数値を取って離散コサイン変換を行う（この離散コサイン変換については例えば非特許文献３、1４頁参照）。同様に離散コサイン変換手段１７Ａは非周期成パワーベクトルを離散コサイン変換する（ステップＳ８）。例えば２４帯域分の帯域通過フィルタ１１_ｂを用いた場合、ＷＰ_ｂおよびＷＡ_ｂはそれぞれ２４通り算出される。これらをそれぞれ対応する帯域通過フィルタの中心周波数順に整列し、それぞれ２４次元のベクトルとして扱う。その各ベクトルに対し、離散コサイン変換を例えば下記の式に従って行う。 WP _b = WF _b −WA _b
The periodic component power vectorization means 20P receives the respective periodic component powers WP _{b and} sets them as vectors arranged in the order of the center frequencies of the corresponding bandpass filters 11 _b (b = 1,..., B), and the aperiodic component power vector means 20A similarly to the respective aperiodic component power WA _b vector (step S7). The discrete cosine transform unit 17P takes the logarithmic value of the periodic component power vector and performs the discrete cosine transform (refer to Non-Patent Document 3, pages 14 for this discrete cosine transform). Similarly, the discrete cosine transform unit 17A performs a discrete cosine transform on the aperiodic power vector (step S8). For example, when the band-pass filter 11 _b for 24 bands is used, WP _b and WA _b are each calculated in 24 ways. These are arranged in the order of the center frequencies of the corresponding bandpass filters, and each is handled as a 24-dimensional vector. For each vector, a discrete cosine transform is performed according to the following equation, for example.

ここで、ｐ_ｊは対応する帯域通過フィルタの中心周波数順に整列されたＷＰ_ｂまたはＷＡ_ｂによって構成されるＢ次元ベクトルのｊ番目の要素（パワー値）を表し、ｃ_ｉは離散コサイン変換後に得られるＢ次元ベクトルＣのｉ番目の離散コサイン係数を表す。ｃ_ｉはｉ＝１，…，Ｂのすべてについて求める。離散コサイン変換手段１７Ｐ及び１７Ａでは、ＷＰ_ｂおよびＷＡ_ｂそれぞれから得られる離散コサイン係数ｃ_iPおよびｃ_iAを出力とする。

Here, p _j represents the j-th element (power value) of a B-dimensional vector constituted by WP _b or WA _b arranged in the order of the center frequency of the corresponding bandpass filter, and c _i is obtained after the discrete cosine transform. Represents the i-th discrete cosine coefficient of the obtained B-dimensional vector C. c _i is obtained for all of i = 1,. Discrete cosine transform means 17P and 17A output discrete cosine coefficients c _iP and c _iA obtained from WP _b and WA _b, respectively.

ベクトル連結手段１８は、離散コサイン変換手段１７Ｐおよび１７Ａの出力であるＷＰ_ｂおよびＷＡ_ｂに対応するそれぞれＮ次元の離散コサイン係数ｃ_iPおよびｃ_iAを入力とし、それぞれの一部または全体を連結して一連のベクトルＣ＝（ｃ_１，ｃ_２，…，ｃ_ｋ）として出力する（ステップＳ９）。例えばＷＰ_ｂおよびＷＡ_ｂそれぞれの２４次元の離散コサイン係数ｃ_iPおよびｃ_iAが入力とされた場合、それぞれ次数の低い方から１２次元の係数を連結して一連の２４次元ベクトルとして出力する。この分析方法を用いた場合、雑音下での自動音声認識において所定の頑健性が得られる。 The vector connecting means 18 inputs N-dimensional discrete cosine coefficients c _iP and c _iA corresponding to WP _b and WA _b which are the outputs of the discrete cosine transform means 17P and 17A, respectively, and connects some or all of them. Are output as a series of vectors C = (c ₁ , c ₂ ,..., C _k ) (step S9). For example, when the 24-dimensional discrete cosine coefficients c _iP and c _{iA of} WP _b and WA _b are input, the 12-dimensional coefficients from the lower order are concatenated and output as a series of 24-dimensional vectors. When this analysis method is used, predetermined robustness can be obtained in automatic speech recognition under noise.

音声信号の周期的な成分と非周期的な成分を分離することなく、音声特徴表現を抽出する音声分析方法において次のようなことが知られている。
（１）音声特徴パラメータであるＭＦＣＣ（メル周波数ケプストラム係数）などの特徴パラメータの分散値によりパラメータを正規化して加法性歪の影響を補正するケプストラム分散正規化法（非特許文献２参照）。
（２）音声認識用モデル作成に用いた音声信号と認識対象音声信号とでマイクロホンや伝送路の違いなどに起因する乗法性歪に対処するため、ＭＦＣＣなどの特徴パラメータを時間平均し、元のパラメータより減算して乗法性歪の影響を補正するケプストラム平均除去法（非特許文献３、１４〜１５頁参照）。 The following is known in a speech analysis method for extracting speech feature expression without separating a periodic component and an aperiodic component of a speech signal.
(1) A cepstrum dispersion normalization method (see Non-Patent Document 2) that normalizes a parameter by a dispersion value of a feature parameter such as an MFCC (Mel Frequency Cepstrum Coefficient) that is a voice feature parameter to correct the influence of additive distortion.
(2) In order to cope with multiplicative distortion caused by differences in microphones and transmission paths between the speech signal used for creating the speech recognition model and the speech signal to be recognized, characteristic parameters such as MFCC are averaged over time, A cepstrum average elimination method that subtracts from the parameters to correct the effect of multiplicative distortion (see Non-Patent Document 3, pages 14 to 15).

（３）加法性歪に対処するため、ＭＦＣＣなどの特徴パラメータのゲイン（大きさ）を正規化して加法性歪の影響を補正するケプストラムゲイン正規化法（非特許文献４参照）。
前記（２）の平均除去法を図７を参照して簡単に説明する。音声信号は音声波形切出部１で分析フレームごとに切出され、各分析フレームごとに離散フーリエ変換部２で離散的フーリエ変換される。そのフーリエ変換結果のスペクトルは、三角窓フィルタ３_１，…，３_Ｌによりメル周波数軸上で、等間隔かつ両隣接帯域の中心に達する三角窓が掛けられたＬ個の帯域に分割される。これらＬ個の帯域スペクトルはパワー算出部４_１，…，４_Ｌでそれぞれパワーが計算され、これらＬ個のパワーが対応フィルタの中心周波数の低い順に並べられたパワーベクトルとして離散コサイン変換部５で離散コサイン変換され、ＭＦＣＣが得られる。 (3) A cepstrum gain normalization method for correcting the influence of additive distortion by normalizing the gain (magnitude) of a characteristic parameter such as MFCC in order to deal with additive distortion (see Non-Patent Document 4).
The average removal method (2) will be briefly described with reference to FIG. The speech signal is cut out for each analysis frame by the speech waveform cutout unit 1 and discrete Fourier transformed by the discrete Fourier transform unit 2 for each analysis frame. The spectrum of the Fourier transform result is divided by the triangular window filters 3 ₁ ,..., 3 _L into _L bands multiplied by triangular windows that reach the center of both adjacent bands on the mel frequency axis at equal intervals. The powers of these L band spectrums are calculated by the power calculation units 4 ₁ ,..., 4 _L , respectively, and the L c powers are arranged in the descending order of the center frequency of the corresponding filter by the discrete cosine transform unit 5. Discrete cosine transform is performed to obtain MFCC.

このＭＦＣＣを時間平均部６で十分な分析フレーム数分の平均をとり、ほぼ一定値となる平均ベクトルが求められ、これがＭＦＣＣから減算部７で減算され、音声特徴パラメータとして出力される。なお前記時間平均、減算はそれぞれ対数計算により行われる。
Kentaro Ishizuka,Noboru Miyazaki,“Speech feature extraction method representing periodicity and aperiodicity in sub bands for robust speech recognition ,”Proceedings of the 29th International Conference on Acoustics, Speech,and Signal Processing,Vol.1,pp.141-144,2004. Chia-Ping Chen,Karim Filali,Jeff A.Bilmes,“Frontend post-processing and backend model enhancement on the Aurora 2.0/3.0 databases,”Proceedings of the 7th International Conference on Spoken Language Processing, pp.241-244,2002. 鹿野清宏，伊藤克亘，河原達也，武田一哉，山本幹雄編著，“音声認識システム”，オーム社，２００１，１４〜１５頁 Shingo Yoshizawa,Noboru Hayasaka,Naoya Wada,Yoshikazu Miyanaga,“Cepstral gain normalization for noise robust speech recognition,”Proceedings of the 29th International Conference on Acoustics,Speech, and Signal Processing,Vol.1, pp.209-212,2004. The MFCC is averaged for a sufficient number of analysis frames by the time averaging unit 6 to obtain an average vector having a substantially constant value, which is subtracted from the MFCC by the subtracting unit 7 and output as a speech feature parameter. The time average and the subtraction are each performed by logarithmic calculation.
Kentaro Ishizuka, Noboru Miyazaki, “Speech feature extraction method representing periodicity and aperiodicity in sub bands for robust speech recognition,” Proceedings of the 29th International Conference on Acoustics, Speech, and Signal Processing, Vol.1, pp.141-144,2004 . Chia-Ping Chen, Karim Filali, Jeff A. Bilmes, “Frontend post-processing and backend model enhancement on the Aurora 2.0 / 3.0 databases,” Proceedings of the 7th International Conference on Spoken Language Processing, pp.241-244, 2002. Shinohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, “Speech Recognition System”, Ohmsha, 2001, 14-15 Shingo Yoshizawa, Noboru Hayasaka, Naoya Wada, Yoshikazu Miyanaga, “Cepstral gain normalization for noise robust speech recognition,” Proceedings of the 29th International Conference on Acoustics, Speech, and Signal Processing, Vol.1, pp.209-212, 2004.

非特許文献１に示す、音声信号を周期性と非周期性との２つの成分に分離して特徴パラメータを抽出する方法は、雑音などの加法性歪や乗法性歪などの外部変動要因および音声に内在する変動要因に対し、十分な頑健性が得られない問題がある。非特許文献２〜４に示す技術は、いずれもパワースペクトルの長時間平均がほぼ一定形状になることを前提としているため、その前提に無理があり、同様に前記変動要因に対する十分な頑健性が得られない。
音声信号のみならず、音楽信号などの周期性成分と非周期性成分とが混在する音響信号の特徴を表現するパラメータの分析においても、非特許文献１〜４の各方法を個別に適用しても、同様の問題が生じる。音声信号および音楽信号などの周期性成分と非周期性成分とが混在する信号をオーディオ信号と総称する。 The method of extracting a feature parameter by separating a speech signal into two components of periodicity and aperiodicity shown in Non-Patent Document 1 is based on external fluctuation factors such as additive distortion such as noise and multiplicative distortion, and speech. There is a problem that sufficient robustness cannot be obtained with respect to the fluctuating factors inherent in. The techniques shown in Non-Patent Documents 2 to 4 are all premised on the assumption that the long-time average of the power spectrum has a substantially constant shape, so that assumption is impossible, and there is sufficient robustness against the fluctuation factors as well. I can't get it.
In the analysis of parameters expressing the characteristics of not only audio signals but also acoustic signals in which periodic components and non-periodic components such as music signals are mixed, each method of Non-Patent Documents 1 to 4 is applied individually. However, the same problem occurs. Signals in which periodic components and non-periodic components such as audio signals and music signals are mixed are collectively referred to as audio signals.

この発明の目的は少くともいずれかの歪に基づく変動要因の影響が補正されたオーディオ特徴パラメータを得ることができるオーディオ信号分析方法、その方法を用いた音声認識方法、その装置、プログラムおよびその記録媒体を提供することにある。 An object of the present invention is an audio signal analysis method capable of obtaining an audio feature parameter in which the influence of a variation factor based on at least one distortion is corrected, a speech recognition method using the method, an apparatus, a program, and a recording thereof To provide a medium.

この発明によればオーディオ信号を周期性成分と非周期性成分とに分離してその特徴パラメータを抽出し、その抽出した特徴パラメータの少なくとも一部について統計パラメータを計算し、その統計パラメータにより上記特徴パラメータの対応するものを正規化して分析結果の特徴パラメータとする。 According to the present invention, the audio signal is separated into the periodic component and the non-periodic component, the feature parameter is extracted, the statistical parameter is calculated for at least a part of the extracted feature parameter, and the feature is calculated based on the statistical parameter. The corresponding parameter is normalized and used as a characteristic parameter of the analysis result.

この構成によれば周期性成分と非周期性成分とを分離して特徴パラメータを抽出し、その特徴パラメータを、その統計パラメータにより正規化して歪補正をしているため、外部変動要因および内部変動要因の少くとも一方に影響され難い特徴パラメータを得ることができる。 According to this configuration, the periodic component and the non-periodic component are separated and feature parameters are extracted, and the feature parameters are normalized by the statistical parameters to correct distortion. It is possible to obtain a feature parameter that is hardly affected by at least one of the factors.

以下この発明の実施形態を図面を参照して説明するが、図１およびこれから説明する各図中の対応する部分は同一参照番号を付けて重複説明を省略する。また以下の説明ではオーディオ信号として音声信号にこの発明を適用した場合である。
［第１実施形態］
この発明ではオーディオ信号を周期性成分と非周期性成分とに分離して、特徴パラメータを抽出し、その特徴パラメータの少なくとも一部について統計パラメータを求め、その統計パラメータにより特徴パラメータ中の対応するものを正規化して歪補正を行うが、第１実施形態では統計パラメータとして分散あるいは標準偏差を求めて、外部変動要因ならびに音声に内在する変動要因（内部変動要因）の影響を減ずる歪補正にこの発明を適用した形態である。図８にその機能構成例を図９に処理手順の例をそれぞれ示す。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention will be described below with reference to the accompanying drawings. Corresponding portions in FIG. 1 and the drawings to be described below are denoted by the same reference numerals, and redundant description will be omitted. In the following description, the present invention is applied to an audio signal as an audio signal.
[First Embodiment]
In this invention, an audio signal is separated into a periodic component and an aperiodic component, a feature parameter is extracted, a statistical parameter is obtained for at least a part of the feature parameter, and a corresponding parameter in the feature parameter is determined by the statistical parameter. In the first embodiment, the present invention is used for distortion correction in which the variance or standard deviation is obtained as a statistical parameter to reduce the influence of external fluctuation factors and fluctuation factors (internal fluctuation factors) inherent in speech. It is the form which applied. FIG. 8 shows an example of the functional configuration and FIG. 9 shows an example of the processing procedure.

入力端子１００よりの入力音声信号は音声区間検出部２１で音声区間と検出された部分が信号分析手段１０内の帯域通過フィルタバンク１１に入力される（ステップＳ１１）。音声区間検出部２１は入力信号中の認識すべき音声信号の始めから終わりまでの全区間を音声区間信号として検出する。
この検出された音声信号は信号分析手段１０で音声信号が複数の帯域信号に分割され、各帯域信号ごとに周期性成分と非周期性成分とに分離され、音声特徴を表現する特徴パラメータが抽出されて信号分析される（ステップＳ１２）。この信号分析手段１０はこの例では図１に示した音声分析装置１０と同一構成であり、信号分析処理（ステップＳ１２）は図２に示した処理手順と同一である。なお図８中の離散コサイン変換手段１７Ｐ及び１７Ａとベクトル連結手段１８は特徴ベクトル生成手段を構成している。 The input speech signal from the input terminal 100 is input to the band-pass filter bank 11 in the signal analysis means 10 in the portion detected as the speech segment by the speech segment detection unit 21 (step S11). The voice section detector 21 detects all sections from the beginning to the end of the voice signal to be recognized in the input signal as voice section signals.
The detected voice signal is divided into a plurality of band signals by the signal analysis means 10 and separated into a periodic component and a non-periodic component for each band signal, and feature parameters expressing the voice features are extracted. Then, signal analysis is performed (step S12). In this example, the signal analysis means 10 has the same configuration as the speech analysis apparatus 10 shown in FIG. 1, and the signal analysis processing (step S12) is the same as the processing procedure shown in FIG. Note that the discrete cosine transform means 17P and 17A and the vector connection means 18 in FIG. 8 constitute a feature vector generation means.

この実施形態においては信号分析手段１０で分析抽出された特徴パラメータに対し、歪補正手段２２で歪補正がなされる（ステップＳ１３）。歪補正手段２２においては入力された特徴パラメータ、つまり信号分析手段１０内のベクトル連結手段１８よりの連結ベクトルの分散値が分散値演算手段２２ａにより演算される（ステップＳ１３ａ）。その分散値により信号分析手段１０よりの特徴パラメータが除算手段２２ｂで除算されて、歪補正される（ステップＳ１３ｂ）。
これらの処理を更に具体的に説明する。ベクトル連結手段１８の出力する離散コサイン係数ベクトルＣは、時間方向について音声波形切出手段１２を実施する回数、つまり音声区間検出部２１で検出された１つの音声区間におけるフレーム（分析区間）の数だけ出力される。音声波形切出手段１２における、ある時点（フレーム）τのベクトル連結手段１８の出力する離散コサイン係数ベクトルＣのｋ番目の係数をｃ_ｋ（τ）と表す。τは、波形切出手段１２によって離散化された時間を表す。例えば、音声波形切出手段１２が１秒間の音声区間に対し１０ｍｓづつ移動しながら３０ｍｓの長さで音声波形を切出す場合、τは１から９７（＝（１，０００（ｍｓ）−３０（ｍｓ））／１０（ｍｓ））の値をとる。このとき、分散値演算手段２２ａにおいて、ｋ番目の離散コサイン係数の分散値σ_ｋ ^２を、次式のようにｃ_ｋ（τ）のτについての分散値σ_ｋ ^２として求める。 In this embodiment, the distortion correction unit 22 performs distortion correction on the feature parameter analyzed and extracted by the signal analysis unit 10 (step S13). In the distortion correction means 22, the input characteristic parameter, that is, the variance value of the connected vector from the vector connecting means 18 in the signal analyzing means 10 is calculated by the variance value calculating means 22a (step S13a). The characteristic parameter from the signal analyzing means 10 is divided by the dividing means 22b by the variance value, and the distortion is corrected (step S13b).
These processes will be described more specifically. The discrete cosine coefficient vector C output from the vector connecting means 18 is the number of times that the speech waveform cutting means 12 is implemented in the time direction, that is, the number of frames (analysis sections) in one speech section detected by the speech section detection unit 21. Is output only. The k-th coefficient of the discrete cosine coefficient vector C output from the vector connecting means 18 at a certain time point (frame) τ in the speech waveform cutting means 12 is represented as c _k (τ). τ represents the time discretized by the waveform cutting means 12. For example, when the voice waveform cutting means 12 cuts a voice waveform with a length of 30 ms while moving by 10 ms with respect to a voice section of 1 second, τ is 1 to 97 (= (1,000 (ms) -30 ( ms)) / 10 (ms)). At this time, we obtain the dispersion value calculating unit 22a, the variance sigma _k ² of the k-th discrete cosine coefficients, as the dispersion value sigma _k ² for tau of c _{k (tau)} as follows.

βとαは、分散値を計算する範囲を表し、β≧αを満たす。α＝１で、βがτの最大値なら、全ての音声区間のパラメータを利用することになり、それ以外の場合は一部の音声区間を利用することに相当する。離散コサイン係数の分散値σ_ｋ ^２は、全てまたは一部のｋについて求める。
除算手段２２ｂでは、ベクトル連結手段１８で得られたｋ番目の離散コサイン係数ｃ_ｋ（τ）を、その分散値σ_ｋ ^２により除算して特徴パラメータを正規化補正する。補正済み離散コサイン係数Ｎｃ_ｋ（τ）を次式により求める。

β and α represent the range in which the dispersion value is calculated and satisfy β ≧ α. If α = 1 and β is the maximum value of τ, the parameters of all speech segments are used, and in other cases, it corresponds to using some speech segments. The discrete cosine coefficient variance σ _k ² is obtained for all or part of k.
The dividing unit 22b normalizes and corrects the feature parameter by dividing the _kth discrete cosine coefficient c _k (τ) obtained by the vector connecting unit 18 by the variance value σ _k ² . A corrected discrete cosine coefficient Nc _k (τ) is obtained by the following equation.

Ｎｃ_ｋ（τ）＝φ_ｋ（τ）・ｃ_ｋ（τ）／σ_ｋ ^２
ここでφ_ｋ（τ）は除算した結果得られる、正規化された特徴パラメータのスケールを調整する実数パラメータで、例えば１を用いる。
この正規化補正を全てまたは一部のτ、および全てまたは一部のｋについて求める。このようにして外部変動要因および音声に内在される変動要因の影響が補正された離散コサイン係数ベクトルＣを得ることができる。
図８および図９中に破線で示すように、分散値演算手段２２ａよりの出力分散値σ_ｋ ^２の平方根を開平演算部２２ｃで取り、標準偏差値σ_ｋ＝√（σ_ｋ ^２）を求め（ステップＳ１３ｃ）、これを除算手段２２ｂに入力してもよい。 Nc _k (τ) = φ _k (τ) · c _k (τ) / σ _k ²
Here, φ _k (τ) is a real number parameter that is obtained as a result of the division and adjusts the scale of the normalized feature parameter. For example, 1 is used.
This normalization correction is obtained for all or part of τ and all or part of k. In this way, it is possible to obtain a discrete cosine coefficient vector C in which the influences of external fluctuation factors and fluctuation factors inherent in speech are corrected.
As shown by broken lines in FIGS. 8 and 9, the square root of the output variance value σ _k ² from the variance value computing means 22a is taken by the square root computing unit 22c to obtain the standard deviation value σ _k = √ (σ _k ² ). (Step S13c), this may be input to the dividing means 22b.

この場合、出力される正規化補正特徴パラメータＮｃ_ｋ（τ）は以下のようになる。
Ｎｃ_ｋ（τ）＝φ_ｋ（τ）・ｃ_ｋ（τ）／σ_ｋ
［第２実施形態］
第２実施形態は統計パラメータとして信号分析により得られた特徴パラメータの時間平均を用いて歪補正を行う。図１０にその機能構成例を、図１１にその処理手順の例を示す。
入力端子１００よりの音声信号は音声区間検出部２１を通じて信号分析手段１０に入力される。この例では信号分析手段１０は図１に示した音声分析装置１０と同一構成とした場合である。この信号分析手段１０より出力される特徴パラメータに対し、歪補正手段３１により乗法性歪の影響を減ずるための処理が行われる（ステップＳ２１）。このため歪補正手段３１に入力された特徴パラメータはまず時間平均手段３１ａにより時間平均される（ステップＳ２１ａ）。 In this case, the output normalized correction feature parameter Nc _k (τ) is as follows.
Nc _k (τ) = φ _k (τ) · c _k (τ) / σ _k
[Second Embodiment]
In the second embodiment, distortion correction is performed using a time average of feature parameters obtained by signal analysis as a statistical parameter. FIG. 10 shows an example of the functional configuration, and FIG. 11 shows an example of the processing procedure.
The audio signal from the input terminal 100 is input to the signal analysis means 10 through the audio section detection unit 21. In this example, the signal analysis means 10 has the same configuration as that of the speech analysis apparatus 10 shown in FIG. A process for reducing the influence of the multiplicative distortion is performed by the distortion correcting unit 31 on the characteristic parameter output from the signal analyzing unit 10 (step S21). For this reason, the characteristic parameters input to the distortion correction means 31 are first time averaged by the time averaging means 31a (step S21a).

具体的には例えばベクトル連結手段１８の出力する離散コサイン係数ベクトルＣは、時間方向について音声波形切出手段１２による切出し回数、つまり１音声区間における分析区間数だけ出力される。第１実施形態の場合と同様に、ある時点τのベクトル連結手段１８の出力する離散コサイン係数ベクトルのｋ番目の係数をｃ_ｋ（τ）のτを波形切出手段１２によって離散化された時間を表し、例えば、音声波形切出手段１２が１秒間の音声区間に対し１０ｍｓづつ移動しながら３０ｍｓの長さで音声波形を切出す場合、τは１から９７（＝（１，０００（ｍｓ）−３０（ｍｓ））／１０（ｍｓ））の値をとる。このとき、時間平均手段３１ａにおいて、時間平均離散コサイン係数ｍ_ｋを次の式（１）の計算により求める。 Specifically, for example, the discrete cosine coefficient vector C output from the vector connecting unit 18 is output in the time direction by the number of extractions by the speech waveform extracting unit 12, that is, the number of analysis intervals in one speech segment. Similarly to the case of the first embodiment, the k-th coefficient of the discrete cosine coefficient vector output from the vector connecting unit 18 at a certain time τ is the time obtained by discretizing the τ of c _k (τ) by the waveform cutting unit 12. For example, when the speech waveform cutting means 12 cuts out a speech waveform with a length of 30 ms while moving by 10 ms with respect to a speech section of 1 second, τ is 1 to 97 (= (1,000 (ms)). −30 (ms)) / 10 (ms)). At this time, the time average means 31a obtains the time average discrete cosine coefficient m _k by the calculation of the following equation (1).

βとαは、時間平均を取る範囲を表し、β＞αを満たす。α＝１で、βがτの最大値なら、全ての音声区間を利用することになり、それ以外の場合は一部の音声区間を利用することに相当する。γ_ｋ（τ）は係数を加算する際の重みで、例えば１を用いる。時間平均離散コサイン係数ｍ_ｋは、全てまたは一部のｋについて求める。
次に、減算手段３１ｂにおいて、時間平均手段３１ａで得られた時間平均離散コサイン係数ｍ_ｋを、ベクトル連結手段１８で得られた離散コサイン係数ｃ_ｋ（τ）から減算して補正済み離散コサイン係数Ｎｃ_ｋ（τ）を求める（ステップＳ３１ｂ）。この減算式（２）により行う。

β and α represent a time-averaged range and satisfy β> α. If α = 1 and β is the maximum value of τ, all speech segments are used, and otherwise, it corresponds to using some speech segments. γ _k (τ) is a weight at the time of adding a coefficient, and for example, 1 is used. The time average discrete cosine coefficient m _k is obtained for all or part of k.
Next, the subtracting means 31b subtracts the time average discrete cosine coefficient m _k obtained by the time averaging means 31a from the discrete cosine coefficient c _k (τ) obtained by the vector concatenation means 18, and corrected discrete cosine coefficient. Nc _k (τ) is obtained (step S31b). This subtraction formula (2) is used.

Ｎｃ_ｋ（τ）＝ｃ_ｋ（τ）−φ_ｋ（τ）・ｍ_ｋ …（２）
ここでφ_ｋ（τ）は減算する際に時間平均離散コサイン係数に乗ずる重みで、例えば１を用いる。
これを全てまたは一部のτ、および全てまたは一部のｋについて求めることで、乗法性歪が補正された離散コサイン係数ベクトルを得る。
［第３実施形態］
第３実施形態は信号分析により得られた特徴パラメータの変動範囲を統計パラメータとして歪補正を行う。図１２にその例の機能構成例を、図１３に処理手順の例をそれぞれ示す。 Nc _k (τ) = c _k (τ) −φ _k (τ) · m _k (2)
Here, φ _k (τ) is a weight to be multiplied by the time-average discrete cosine coefficient when subtracting, and for example, 1 is used.
By obtaining this for all or part of τ and all or part of k, a discrete cosine coefficient vector in which multiplicative distortion is corrected is obtained.
[Third Embodiment]
In the third embodiment, distortion correction is performed using a variation range of a characteristic parameter obtained by signal analysis as a statistical parameter. FIG. 12 shows an example of the functional configuration of the example, and FIG. 13 shows an example of the processing procedure.

入力端子１００よりの音声信号は音声区間検出部２１を通じて信号分析手段１０に入力される。この例では信号分析手段１０は図１に示した音声分析装置１０と同一構成とした場合である。この信号分析手段１０より出力される特徴パラメータに対し、歪補正手段３３により加法性歪の影響を減ずるための補正を行う（ステップＳ２３）。歪補正手段３３は特徴パラメータの変動範囲を変動範囲検出手段３３ａにより検出し（ステップＳ２３ａ）、その検出した変動範囲で特徴パラメータを除算手段３３ｂにおいて割算する（ステップＳ２３ｂ）。 The audio signal from the input terminal 100 is input to the signal analysis means 10 through the audio section detection unit 21. In this example, the signal analysis means 10 has the same configuration as that of the speech analysis apparatus 10 shown in FIG. The characteristic parameters output from the signal analyzing means 10 are corrected by the distortion correcting means 33 to reduce the influence of additive distortion (step S23). The distortion correction unit 33 detects the variation range of the feature parameter by the variation range detection unit 33a (step S23a), and divides the feature parameter by the division unit 33b by the detected variation range (step S23b).

具体的には例えばベクトル連結手段１８の出力する離散コサイン係数ベクトルＣは、時間方向について音声波形切出手段１２による切出し回数だけ出力される。これは第１ベクトルのｉ番目の係数をｃ_ｋ（τ）と表す。τは、波形切出手段１２によって離散化された時間実施形態及び第２実施形態と同様であって、例えば、音声波形切出手段１２が１秒間の音声区間に対し１０ｍｓづつ移動しながら３０ｍｓの長さで音声波形を切出す場合、τは１から９７の値をとる。変動範囲検出手段３３ａにおいて最大値選出手段３３ａ１により、次式で与えられる最大離散コサイン係数Ｍａｘ_ｋを、ｃ_ｋ（τ）のτについての最大値として選出する。 Specifically, for example, the discrete cosine coefficient vector C output from the vector connecting unit 18 is output by the number of times of extraction by the speech waveform extracting unit 12 in the time direction. This represents the i-th coefficient of the first vector as c _k (τ). τ is the same as in the time embodiment and the second embodiment discretized by the waveform cutting means 12, and for example, the voice waveform cutting means 12 is 30 ms while moving by 10 ms with respect to the voice section of 1 second. When a speech waveform is cut out by length, τ takes a value from 1 to 97. In the fluctuation range detection means 33a, the maximum value selection means 33a1 selects the maximum discrete cosine coefficient Max _k given by the following equation as the maximum value for τ of c _k (τ).

βとαは、最大値を探索する範囲を表し、β≧αを満たす。α＝１で、βがτの最大値なら、全ての音声区間を探索することになり、それ以外の場合は一部の音声区間を探索することに相当する。最大離散コサイン係数Ｍａｘ_ｋは、全てまたは一部のｋについて求める。同様に、最小値選出手段３３ａ２によって、次式で与えられる最小離散コサイン係数Ｍｉｎ_ｋを、ｃ_ｋ（τ）のτについての最小値として求める（ステップＳ２３ａ１）。

β and α represent a range for searching for the maximum value, and β ≧ α is satisfied. If α = 1 and β is the maximum value of τ, all speech segments are searched, and in other cases, it corresponds to searching a part of speech segments. The maximum discrete cosine coefficient Max _k is obtained for all or part of k. Similarly, the minimum discrete cosine coefficient Min _k given by the following equation is _obtained as the minimum value for τ of c _k (τ) by the minimum value selection means 33a2 (step S23a1).

次に、減算手段３３ａ３において、最大値選出手段３３ａ１および最小値選出手段３３ａ２で得られた最大離散コサイン係数Ｍａｘ_ｋから最小離散コサイン係数Ｍｉｎ_ｋを減算して、離散コサイン係数変化範囲Ｇａｉｎ_ｋ＝Ｍａｘ_ｋ−Ｍｉｎ_ｋを全てまたは一部のｋについて求める（ステップＳ２３ａ２）。
除算手段３３ｂでは、ベクトル連結手段１８で得られた離散コサイン係数ｃ_ｋ（τ）を、検出した範囲Ｇａｉｎ_ｋにより除算してパラメータを正規化し、補正済み離散コサイン係数Ｎｃ_ｋ（τ）＝φ_ｋ（τ）・ｃ_ｋ（τ）／Ｇａｉｎ_ｋを求める。ここでφ_ｋ（τ）は除算した結果得られる、正規化されたパラメータのスケールを調整する実数パラメータで、例えば１を用いる。このようなＮｃ_ｋ（τ）を全てまたは一部のτ、および全てまたは一部のｋについて求めることで、加法性歪が補正された離散コサイン係数ベクトルを得る。
［変形実施形態］
この発明のオーディオ信号分析における変形実施形態を説明する。その１つとして第１〜第３実施形態において、離散コサイン変換手段１７Ｐおよび１７Ａを省略する。その場合の機能構成例を図１４に、処理手順の例を図１５にそれぞれ示す。

Next, the subtracting means 33a3 subtracts the minimum discrete cosine coefficient Min _k from the maximum discrete cosine coefficient Max _k obtained by the maximum value selecting means 33a1 and the minimum value selecting means 33a2, and the discrete cosine coefficient change range Gain _k = Max. _k- Mink is _obtained for all or part of k (step S23a2).
In the dividing means 33b, the discrete cosine coefficient c _k (τ) obtained by the vector connecting means 18 is divided by the detected range Gain _k to normalize the parameters, and the corrected discrete cosine coefficient Nc _k (τ) = φ _k (Τ) · c _k (τ) / Gain _k is obtained. Here, φ _k (τ) is a real parameter obtained by dividing and adjusting the scale of the normalized parameter, and for example, 1 is used. By obtaining such Nc _k (τ) for all or part of τ and all or part of k, a discrete cosine coefficient vector in which additive distortion is corrected is obtained.
[Modified Embodiment]
A modified embodiment of the audio signal analysis of the present invention will be described. As one of them, the discrete cosine transform means 17P and 17A are omitted in the first to third embodiments. FIG. 14 shows an example of the functional configuration in this case, and FIG. 15 shows an example of the processing procedure.

入力端子１００よりの音声信号は必要に応じて音声区間検出部２１を通じて信号分析手段３５に入力され、信号分析が行われる（ステップＳ２５）。この信号分析手段３５は図１中の音声分析装置１０中から離散コサイン変換手段１７Ｐおよび１７Ａが省略され、減算手段１６よりの各周期成分と、パワー算出手段１５Ａよりの各非周期成分との各パワー値がベクトル連結手段１８により連結され、この連結されたベクトルの対数値が対数計算手段３７で計算される。従って処理手順においては図１５中のステップＳ２５に示すように、図２中のステップＳ１〜Ｓ６を実行し、その後、ステップＳ７の離散コサイン変換を行うことなく、前記パワー値のベクトル連結を行い（ステップＳ２７）、この連結ベクトルの各パワー値の対数値を計算する（ステップＳ２９）。図１４中のベクトル連結手段１８と対数計算手段３７は特徴ベクトル生成手段を構成している。 The audio signal from the input terminal 100 is input to the signal analysis means 35 through the audio section detection unit 21 as necessary, and signal analysis is performed (step S25). In this signal analysis unit 35, the discrete cosine transform units 17P and 17A are omitted from the speech analysis apparatus 10 in FIG. 1, and each of the periodic components from the subtraction unit 16 and the non-periodic components from the power calculation unit 15A. The power values are connected by the vector connecting means 18, and logarithmic values of the connected vectors are calculated by the logarithmic calculating means 37. Accordingly, in the processing procedure, as shown in step S25 in FIG. 15, steps S1 to S6 in FIG. 2 are executed, and thereafter, the power values are vector-connected without performing the discrete cosine transform in step S7 ( In step S27, the logarithmic value of each power value of the concatenated vector is calculated (step S29). The vector connecting means 18 and the logarithmic calculating means 37 in FIG. 14 constitute a feature vector generating means.

この信号分析手段３５よりの特徴パラメータ、この例では対数パワー値ベクトルが歪補正手段３９に入力され、歪補正手段３９は対数パワー値ベクトルに対し、歪補正を行う（ステップＳ３１）。歪補正手段３７は図８中の歪補正手段２２、図１０中の歪補正手段３１、図１２中の歪補正手段３３などである。ステップＳ１の歪補正処理は、図９中のステップＳ１３、図１１中のステップＳ２１、図１３中のステップＳ２３などである。
第１実施形態及び第３実施形態においては、信号分析結果の特徴パラメータに対し、乗法性歪の影響を補正した後に歪補正を行ってもよい。例えば図１６に示すように、信号分析手段１０からの離散コサイン係数ベクトルを歪補正手段３１に入力して、乗法性歪の影響を補正する。この歪補正は例えば、第２実施形態において図１０中に示した歪補正手段３１と同様の構成により行う。この歪補正された特徴パラメータを、歪補正手段４１により更に歪補正を行う。この歪補正手段４１は第１実施形態における図８中の歪補正手段２２又は第３実施形態における図１２中の歪補正手段３３である。 The characteristic parameter from the signal analyzing unit 35, in this example, the logarithmic power value vector is input to the distortion correcting unit 39, and the distortion correcting unit 39 performs distortion correction on the logarithmic power value vector (step S31). The distortion correction means 37 is the distortion correction means 22 in FIG. 8, the distortion correction means 31 in FIG. 10, the distortion correction means 33 in FIG. The distortion correction processing in step S1 includes step S13 in FIG. 9, step S21 in FIG. 11, step S23 in FIG.
In the first embodiment and the third embodiment, distortion correction may be performed after correcting the influence of multiplicative distortion on the characteristic parameter of the signal analysis result. For example, as shown in FIG. 16, the discrete cosine coefficient vector from the signal analysis unit 10 is input to the distortion correction unit 31 to correct the influence of multiplicative distortion. This distortion correction is performed, for example, with the same configuration as the distortion correction means 31 shown in FIG. 10 in the second embodiment. The distortion correction unit 41 further performs distortion correction on the distortion-corrected feature parameter. This distortion correction means 41 is the distortion correction means 22 in FIG. 8 in the first embodiment or the distortion correction means 33 in FIG. 12 in the third embodiment.

この処理手順は例えば図１７に示すように、図９中のステップ１２の信号分析処理の後、その離散コサインベクトルに対し、乗法性歪の影響を補正するための歪補正を行い（ステップＳ２１）、その歪補正された離散コサインベクトルに対し更に、歪補正を行う（ステップＳ３３）。このステップＳ３３における歪補正は図９中のステップＳ１３の歪補正又は図１３中のステップＳ２３の歪補正である。
このように特徴パラメータを乗法性歪の影響を補正した後に、更に歪補正することは信号分析手段３５から得られる対数パワー値ベクトルに対しても適用することができる。このことを明らかにするために図１０中に括弧書きで信号分析手段３５及び対数値計算手段３７を示し、また図１７中に括弧書きで信号分析ステップＳ２５及び対数計算ステップＳ２９をそれぞれ示した。 For example, as shown in FIG. 17, this processing procedure performs distortion correction for correcting the influence of multiplicative distortion on the discrete cosine vector after the signal analysis processing in step 12 in FIG. 9 (step S21). Further, distortion correction is further performed on the distortion-corrected discrete cosine vector (step S33). The distortion correction in step S33 is the distortion correction in step S13 in FIG. 9 or the distortion correction in step S23 in FIG.
In this way, further correcting the distortion after correcting the influence of the multiplicative distortion on the characteristic parameter can also be applied to the logarithmic power value vector obtained from the signal analyzing means 35. In order to clarify this, the signal analysis means 35 and the logarithmic value calculation means 37 are shown in parentheses in FIG. 10, and the signal analysis step S25 and the logarithmic calculation step S29 are shown in parentheses in FIG.

なお帯域通過フィルタバンク１１中の帯域通過フィルタの数Ｂは例えば、入力音声信号のサンプリング周波数が８kHzの場合、２４個とされ、サンプリング周波数が高くなるに従って帯域通過フィルタの数Ｂを大きくするのが好ましい。このようにして乗法性歪の影響を補正した後、加法性歪の影響を補正することができる。
上述において、櫛型フィルタ手段１４として、周期推定手段１３で推定した周期成分と、その整数倍を阻止する阻止型櫛型フィルタを用いたが、推定した周期成分と、その整数倍を通過させる通過型櫛型フィルタを用いてもよい。その場合の処理手順を図１８に示す。いままでの説明と同様に入力音声信号は音声区間検出（ステップＳ１１）、帯域分割（ステップＳ１）、各帯域ごとの基本周期推定（ステップＳ２）の各処理が行われる。その後、各帯域信号ごとに、櫛型フィルタ手段１４（図８中の括弧書、以下同様）の各フィルタ部１４_１′，…，１４_Ｂ′で対応帯域の推定周期成分とその整数倍成分とのみが通過選出される（ステップＳ４１）。これら選出された各帯域ごとの基本周期成分と、その整数倍成分とのパワー、つまり周期成分パワーＷＰ_ｂ（ｂ＝１，…，Ｂ）が、パワー計算手段１５Ｐの計算部１５Ｐ_ｂでそれぞれ計算され、また各帯域信号のパワーＷＦ_ｂがパワー計算手段１５Ｆの各計算部１５Ｆ_ｂで計算される（ステップＳ４３）。減算手段１６の各減算部１６_ｂにおいて計算部１５Ｆ_ｂの出力パワーＷＦ_ｂから、計算部１５Ｐ_ｂ′からの出力パワーＷＰ_ｂが減算されて帯域ごとの非周期成分パワーＷＡ_ｂが求められる（ステップＳ４３）。減算手段１６よりの非周期成分パワー値が非周期成分パワーベクトル化手段２０Ａでベクトル化された後（ステップＳ７）、離散コサイン変換手段１７Ａで離散コサイン変換され（ステップＳ８）、またパワー計算手段１５Ｐよりの周期成分パワー値が周期成分パワーベクトル化手段２０Ｐでベクトル化された後（ステップＳ７）離散コサイン変換手段１７Ｐで離散コサイン変換される（ステップＳ８）、その他の処理は先に述べた各実施形態と同様である。この場合も、図８及び図１８中に一点鎖線で示すように、離散コサイン変換をすることなく得られた周期成分パワーベクトルと非周期成分パワーベクトルとをベクトル連結手段１８でベクトル連結し、その連結されたベクトルの対数値を対数計算部３７で求めてもよい。また図１８中に括弧書で示すように、図１７中に示したように信号分析により得られた特徴パラメータに対し、ステップＳ２１により乗法性歪による影響を除去した後に歪補正処理（ステップＳ３３）を行ってもよい。 The number B of band-pass filters in the band-pass filter bank 11 is, for example, 24 when the sampling frequency of the input audio signal is 8 kHz. The number B of band-pass filters increases as the sampling frequency increases. preferable. After correcting the influence of the multiplicative distortion in this way, the influence of the additive distortion can be corrected.
In the above description, the comb-type filter unit 14 uses the periodic component estimated by the cycle estimation unit 13 and the blocking-type comb filter that blocks integer multiples thereof. A comb-shaped filter may be used. The processing procedure in that case is shown in FIG. As in the description so far, the input speech signal is subjected to speech interval detection (step S11), band division (step S1), and basic period estimation for each band (step S2). After that, for each band signal, each filter unit 14 ₁ ′,..., 14 _B ′ of the comb filter means 14 (in parentheses in FIG. Only pass is elected (step S41). The power of the selected basic periodic component for each band and its integral multiple component, that is, the periodic component power WP _b (b = 1,..., B) is calculated by the calculating unit 15P _b of the power calculating means 15P. Further, the power WF _b of each band signal is calculated by each calculation unit 15F _b of the power calculation means 15F (step S43). The output power WF _b calculator 15F _b at each subtraction unit 16 _b of the subtracting means 16, the aperiodic component power WA _b of each band is calculated by subtracting the output power WP _b from the computing unit 15P _b '(step S43). After the non-periodic component power value from the subtracting means 16 is vectorized by the aperiodic component power vectorizing means 20A (step S7), it is subjected to discrete cosine transformation by the discrete cosine transforming means 17A (step S8), and the power calculating means 15P. After the periodic component power value is vectorized by the periodic component power vectorization means 20P (step S7), the discrete cosine transformation is performed by the discrete cosine transformation means 17P (step S8). It is the same as the form. Also in this case, as indicated by a one-dot chain line in FIGS. 8 and 18, the periodic component power vector and the aperiodic component power vector obtained without performing the discrete cosine transform are vector-connected by the vector connecting means 18, The logarithmic value of the connected vector may be obtained by the logarithm calculation unit 37. Also, as shown in parentheses in FIG. 18, the distortion correction processing (step S33) is performed after the influence of the multiplicative distortion is removed in step S21 on the characteristic parameters obtained by signal analysis as shown in FIG. May be performed.

上述において、分散値σ_ｋ ^２、標準偏差σ_ｋ、時間平均ｍ_ｋ、変動範囲Ｇａｉｎ_ｋに基づく歪補正は、一部のｋ、一部のτについて求めればよいと述べたが、この一部とは任意の組み合わせについて求めればよい、例えばｋについては低次のものあるいは高次のもの、あるいは適当に選んだ複数でもよい。τについても同様である。つまり特徴パラメータを抽出したい信号に対し、その抽出に影響を与える要因、例えば混入される雑音も比較的定常的なもの突発的なものなど時間的あるいは周波数的に異なる態様に応じ、同様に乗法性歪についてもどのようなものに基づくものかにより、それぞれ適切なｋやτが選定される。これは例えば各種要因についてあらかじめ実験により求めておけばよい。このようにして、途中で得られる一部の離散コサイン係数の長時間平均が一定値に近づくｋとτとが用いられることになる。 In the above description, it has been described that the distortion correction based on the variance value σ _k ² , the standard deviation σ _k , the time average m _k , and the fluctuation range Gain _k may be obtained for a part of k and a part of τ. May be obtained for an arbitrary combination. For example, k may be a low-order or high-order one, or a plurality selected appropriately. The same applies to τ. In other words, multiplicativeness is similarly applied to the signal whose feature parameters are to be extracted, depending on the factors that affect the extraction, for example, the noise that is mixed is relatively steady or sudden, and the time or frequency is different. Appropriate k and τ are selected depending on what the distortion is based on. For example, various factors may be obtained by experiments in advance. In this way, k and τ are used in which the long-term average of some of the discrete cosine coefficients obtained along the way approaches a constant value.

第２実施形態で求めた時間平均ｍ_ｋは、統計分布曲線における平均と対応し、またこの時間平均ｍ_ｋを連結ベクトルの対応する要素（係数又はパワー値）から減算することはその要素を正規化することと対応している。従って、第１〜第３実施形態において求める分散、標準偏差、平均、変動範囲を統計パラメータと総称し、かつ係数又はパワー値に対する分散、標準偏差、変動範囲のそれぞれによる除算および時間平均の減算を正規化と総称する。
上述した実施形態では音声信号を分析したが、音楽信号などの周期性成分と非周期性成分とが混在している信号にこの発明の信号分析は適用できる。
［第４実施形態］
第４実施形態は第１〜第３実施形態、変形実施形態のいずれかにより音声信号を信号分析して音声認識をする装置および方法の実施形態である。第４実施形態の機能構成例を図１９に、処理手順を図２０にそれぞれ示す。この例ではこの音声認識装置６０の入力端子２００に学習音声データが入力され（ステップＳ５１）、学習処理がされる。つまりこの学習音声データは信号分析部６２で分析され、特徴パラメータが抽出される（ステップＳ５２）。入力端子２００に入力される学習音声データや認識されるべき音声信号は所定のサンプリング周波数でサンプリングされ、ディジタル値とされた信号系列である。信号分析部６２は第１〜第３実施形態、変形実施形態のいずれかと同様な信号分析手段及び歪補正手段を備え、信号分析手段で抽出された特徴パラメータに対し、歪補正された特徴パラメータが信号分析部６２から出力される。 The time average m _k obtained in the second embodiment corresponds to the average in the statistical distribution curve, and subtracting this time average m _k from the corresponding element (coefficient or power value) of the connected vector normalizes the element. It corresponds to becoming. Accordingly, the variance, standard deviation, average, and variation range obtained in the first to third embodiments are collectively referred to as statistical parameters, and division and time average subtraction for each of the variance, standard deviation, and variation range for the coefficient or power value are performed. This is collectively called normalization.
In the embodiment described above, the audio signal is analyzed. However, the signal analysis of the present invention can be applied to a signal in which a periodic component and a non-periodic component such as a music signal are mixed.
[Fourth Embodiment]
The fourth embodiment is an embodiment of an apparatus and method for performing speech recognition by analyzing a speech signal according to any one of the first to third embodiments and modified embodiments. FIG. 19 shows a functional configuration example of the fourth embodiment, and FIG. 20 shows a processing procedure. In this example, learning voice data is input to the input terminal 200 of the voice recognition device 60 (step S51), and learning processing is performed. That is, the learning speech data is analyzed by the signal analysis unit 62, and feature parameters are extracted (step S52). The learning speech data input to the input terminal 200 and the speech signal to be recognized are a signal sequence sampled at a predetermined sampling frequency and converted into a digital value. The signal analysis unit 62 includes a signal analysis unit and a distortion correction unit similar to those in any of the first to third embodiments and the modified embodiment, and a distortion-corrected feature parameter is included in the feature parameter extracted by the signal analysis unit. The signal is output from the signal analysis unit 62.

この学習音声特徴パラメータはパターン（学習）識別部６４に入力され、パターン（学習）識別部６４は学習音声特徴パラメータから標準パターンを生成して標準パターン記憶部６６に格納する（ステップＳ５３）。標準パターンは例えばＨＭＭ（隠れマルコフモデル）であり、状態数及び分布数とその各音素ごとの遷移確率、出現確率などのパラメータである。
次に入力端子２００に認識されるべき音声信号が入力され（ステップＳ５４）、その入力音声信号は信号分析部６２で特徴パラメータが抽出される（ステップＳ５５）。 The learned speech feature parameter is input to the pattern (learning) identifying unit 64, and the pattern (learning) identifying unit 64 generates a standard pattern from the learned speech feature parameter and stores it in the standard pattern storage unit 66 (step S53). The standard pattern is, for example, an HMM (Hidden Markov Model), and includes parameters such as the number of states and the number of distributions, transition probability, and appearance probability for each phoneme.
Next, an audio signal to be recognized is input to the input terminal 200 (step S54), and a feature parameter is extracted from the input audio signal by the signal analysis unit 62 (step S55).

この特徴パラメータはパターン（学習）識別部６４で、標準パターン記憶部６６に予め格納されている標準パターンと比較され、最も類似度が高い標準パターンと対応する、音素、単語などを表わすデータが出力される（ステップＳ５６）。なおこの学習及び認識の具体的処理は例えば北研二他２名著「音声言語処理」森北出版株式会社、１９９６年発行３７〜４３頁を参照されたい。
この例ではまず学習音声データによる標準パターンの学習を行ったが、信号分析部６２で抽出される特徴パラメータと同一種類の特徴パラメータによりあらかじめ生成された標準パターンが格納された標準パターン記憶部６６を用い、つまり図２０において、ステップＳ５１〜ステップＳ５３を省略し、入力された音声信号の認識のみを行うものでもよい。その場合はパターン識別部６４は認識処理のみを行う。 The feature parameter is compared with a standard pattern stored in advance in the standard pattern storage unit 66 by the pattern (learning) identification unit 64, and data representing phonemes, words, etc. corresponding to the standard pattern having the highest similarity is output. (Step S56). For specific processing of learning and recognition, see, for example, Kitakenji et al. And two authors “spoken language processing” published by Morikita Publishing Co., Ltd., 1996, pages 37-43.
In this example, the standard pattern is first learned using the learned speech data. However, the standard pattern storage unit 66 in which standard patterns generated in advance using the same type of feature parameter as the feature parameter extracted by the signal analysis unit 62 is stored. In other words, in FIG. 20, steps S51 to S53 may be omitted, and only the input audio signal may be recognized. In this case, the pattern identification unit 64 performs only recognition processing.

また標準パターンを生成する学習音声データは、被認識音声が収音される環境雑音と同様な環境雑音が重畳されたものが好ましく、学習音声データから特徴パラメータを抽出する信号分析部としては、被認識入力音声信号より特徴パラメータを抽出する信号分析部と同一または同様のものがよい。
第１〜第２実施形態および変形実施形態の各オーディオ信号分析装置、第４実施形態の音声認識装置はいずれも、コンピュータにより機能させることができる。コンピュータに、例えば図８に示したオーディオ信号分析装置としてコンピュータを機能させるためのプログラムを磁気ディスク、ＣＤ−ＲＯＭ、半導体記憶装置などの記録媒体からインストールし、または通信回線を介してダウンロードし、そのプログラムをそのコンピュータに実行させればよい。なおコンピュータを分析装置あるいは認識装置として機能させる場合はその対象信号を一旦コンピュータ内の記憶装置に取り込んだ後、処理することになる。
［実験例］
以下にこの発明の効果を示すために、この発明による音声信号分析方法によって得られた音声特徴パラメータを用いた音声認識装置と、［従来の技術］項に記載の非特許文献１に示す音声認識装置（単に従来装置という）の、雑音下での数字認識における音声認識精度を比較のために行った実験を説明する。
実験１
この実験１は第１実施形態の効果を明らかにするためであり、この実験には、（社）情報処理学会音声言語情報処理研究会雑音下音声認識評価ワーキンググループ雑音下音声認識評価環境（ＡＵＲＯＲＡ−２Ｊ）を利用した。この第１実施形態の装置および従来装置とも２４チャネルのガンマトーンフィルタバンクをフィルタバンク１１として用い音声波形切出手段１２での音声波形の切出しは２５ｍｓ長で１０ｍｓごとに行い、周期成分パワーＷＰ_ｂおよび非周期成分パワーＷＡ_ｂに対応する離散コサイン変換後の係数ベクトルはそれぞれ１２次元、他に入力信号全体のパワーを表すパワー値、あわせて２５次元のベクトルを特徴ベクトルとして用い、その動的特徴であるΔパラメータとΔΔパラメータを、ΔＭＦＣＣ，ΔΔＭＦＣＣ，Δパワー，ΔΔパワーを求める方法（非特許文献３、１３頁参照）と同様にして求め、その結果７５次元のベクトルを特徴パラメータとして用いた。 The learning speech data for generating the standard pattern is preferably superimposed with the environmental noise similar to the environmental noise from which the recognized speech is picked up. As a signal analysis unit for extracting feature parameters from the learning speech data, The same or similar signal analysis unit that extracts feature parameters from the recognized input speech signal is preferable.
Each of the audio signal analysis apparatuses of the first to second embodiments and the modified embodiment and the speech recognition apparatus of the fourth embodiment can be functioned by a computer. For example, a program for causing the computer to function as the audio signal analysis apparatus shown in FIG. 8 is installed from a recording medium such as a magnetic disk, a CD-ROM, or a semiconductor storage device or downloaded via a communication line. The program can be executed on the computer. When the computer functions as an analysis device or a recognition device, the target signal is once taken into a storage device in the computer and then processed.
[Experimental example]
In order to show the effects of the present invention below, a speech recognition apparatus using speech feature parameters obtained by the speech signal analysis method according to the present invention, and speech recognition shown in Non-Patent Document 1 described in [Prior Art] A description will be given of an experiment conducted for comparison of speech recognition accuracy in numerical recognition under noise of a device (simply referred to as a conventional device).
Experiment 1
This experiment 1 is for clarifying the effect of the first embodiment. This experiment includes the Information Processing Society of Japan Spoken Language Information Processing Research Group Noisy Speech Recognition Evaluation Working Group Noisy Speech Recognition Evaluation Environment (AURORA) -2J). In both the apparatus of the first embodiment and the conventional apparatus, a 24-channel gamma tone filter bank is used as the filter bank 11, and the voice waveform cutting means 12 cuts out a voice waveform every 10 ms with a length of 25 ms, and a periodic component power WP _b The coefficient vector after the discrete cosine transform corresponding to the non-periodic component power WA _b is 12 dimensions, and the power value representing the power of the entire input signal is used as a feature vector. The Δ parameter and the ΔΔ parameter are obtained in the same manner as the method for obtaining ΔMFCC, ΔΔMFCC, Δ power, and ΔΔ power (see Non-Patent Document 3, page 13), and as a result, a 75-dimensional vector is used as a feature parameter.

この第１実施形態の装置では、前記離散コサイン係数ベクトルに対し図１６に示したように歪補正手段３１により乗法性歪の補正を行った後、図８中の歪補正手段２２において、発話データごとの音声全区間に渡って全ての離散コサイン係数の外部変動要因ならびに音声に内在する変動要因に基づく影響を抑圧するように前記パラメータを、その標準偏差値を用いて補正した。
パターン（学習）識別部６４での学習処理には１６状態２４ガウス分布混合の数字ＨＭＭを用い、前記ＡＵＲＯＲＡ−２Ｊに付属する学習音声データ中の８，４４０発話の雑音が混入した数字読み上げ学習音声データとＨＭＭ学習用スクリプト（学習プログラム）とを用いＨＭＭ学習を行った。また、同様にＡＵＲＯＲＡ−２Ｊに付属する評価データにおいて雑音下での数字読み上げ音声のうち強い加法性歪を伴う自動車中雑音が音声と同じパワーで重畳されている評価データ（信号対雑音比０ｄＢ、１，００１発話）を用い雑音下での数字認識精度の評価を行った。 In the apparatus of the first embodiment, after correcting the multiplicative distortion by the distortion correcting means 31 as shown in FIG. 16 for the discrete cosine coefficient vector, the distortion correcting means 22 in FIG. The parameters were corrected by using the standard deviation value so as to suppress the influence based on the external fluctuation factors of all the discrete cosine coefficients and the fluctuation factors inherent in the voice over the whole voice interval.
The learning process in the pattern (learning) discriminating unit 64 uses a 16-state 24 Gaussian distribution number HMM, and the number reading learning voice mixed with the noise of 8,440 utterances in the learning voice data attached to the AURORA-2J. HMM learning was performed using the data and an HMM learning script (learning program). Similarly, in the evaluation data attached to AURORA-2J, evaluation data (signal-to-noise ratio 0 dB, signal-to-noise ratio 0 dB, in which noise in the car with strong additive distortion is superimposed with the same power as the voice of the number reading speech under noise. (001 utterances) was used to evaluate the accuracy of digit recognition under noise.

各認識精度の結果を図２１に示す。図２１に示されたとおり、第１実施形態による加法性歪を補正する音声信号分析方法を用いた場合の音声認識装置の認識精度が従来装置の認識精度よりも１０％程度以上高く、第１実施形態の手法が効果的に頑健性を向上することが明らかにされた。
実験２
この実験２は第２実施形態の効果を明らかにするためであり、実験１と異なる点のみを記載する。前記７５次元のベクトルを特徴パラメータに対し、図１０中の歪補正手段３１により乗法性歪の補正を行い、識別部６４で処理する数字ＨＭＭのガウス分布の数を２０とし、評価データとしてＡＵＲＯＲＡ−２Ｊに付属する評価データ中の、乗法性歪を伴う雑音下での数字読み上げ音声の１４，０１４発話データを用いた。 The result of each recognition accuracy is shown in FIG. As shown in FIG. 21, the recognition accuracy of the speech recognition apparatus when using the speech signal analysis method for correcting additive distortion according to the first embodiment is about 10% or more higher than the recognition accuracy of the conventional apparatus. It has been clarified that the method of the embodiment effectively improves the robustness.
Experiment 2
This experiment 2 is for clarifying the effect of the second embodiment, and only the points different from the experiment 1 are described. The multidimensional distortion is corrected by the distortion correction means 31 in FIG. 10 with respect to the feature parameter of the 75-dimensional vector, the number of Gaussian distributions of the numeral HMM processed by the identification unit 64 is set to 20, and AURORA- The 14,014 utterance data of the number reading speech under noise with multiplicative distortion in the evaluation data attached to 2J was used.

平均での認識精度結果を図２２に示す。図２２に示されたとおり、第２実施形態装置による乗法性歪を補正する音声信号分析方法を用いた場合の音声認識装置の認識精度が従来装置の認識精度法よりも１０％程度高く、第２実施形態の手法が乗法性歪に効果的であることが明らかにされた。
実験３
この実験３は第３実施形態の効果を明らかにするためであり、実験１と異なる点のみを記載する。前記乗法性歪の補正を行った７５次元ベクトルの特徴パラメータを、図１２中の歪補正手段３３において変動範囲ｇａｉｎ_ｋにより正規化して、加法性歪の補正を行った。 The average recognition accuracy result is shown in FIG. As shown in FIG. 22, the recognition accuracy of the speech recognition apparatus when using the speech signal analysis method for correcting multiplicative distortion by the second embodiment apparatus is about 10% higher than the recognition accuracy method of the conventional apparatus. It has been clarified that the method of the second embodiment is effective for multiplicative distortion.
Experiment 3
This experiment 3 is for clarifying the effect of the third embodiment, and only points different from the experiment 1 are described. The characteristic parameters of the 75-dimensional vector for which the multiplicative distortion was corrected were normalized by the variation range gain _k in the distortion correcting means 33 in FIG. 12 to correct the additive distortion.

認識精度の結果を図２３に示す。図２３に示されたとおり、第３実施形態による加法性歪を補正する音声信号分析方法を用いた場合の装置の認識精度が従来装置の認識精度よりも１０％程度以上高く、第３実施形態の手法が加法性歪に効果的であることが明らかにされた。
従来との差の理由
図７に示した従来方法および非特許文献２及び４にそれぞれ示す従来方法のいずれにおいても、離散フーリエ変換の結果得られるパワースペクトルに基づいた離散コサイン係数であることが前提となる。具体的には、雑音や乗法性歪が時間方向に急激な変化なくパワースペクトルに一定の変動を与えており、かつ音声のパワースペクトルの長時間平均が一定の形状に近づく性質を前提とする。つまり音声信号のパワースペクトルの形状を利用するものである。 The result of recognition accuracy is shown in FIG. As shown in FIG. 23, the recognition accuracy of the apparatus when the speech signal analysis method for correcting additive distortion according to the third embodiment is used is about 10% or more higher than the recognition accuracy of the conventional apparatus. This method is effective for additive distortion.
The reason for the difference from the conventional method In both the conventional method shown in FIG. 7 and the conventional methods shown in Non-Patent Documents 2 and 4, it is assumed that the discrete cosine coefficient is based on the power spectrum obtained as a result of the discrete Fourier transform. It becomes. Specifically, it is assumed that noise and multiplicative distortion give a constant fluctuation to the power spectrum without abrupt change in the time direction, and that the long-time average of the power spectrum of speech approaches a constant shape. That is, the shape of the power spectrum of the audio signal is used.

それに対し、図１に示した従来方法で抽出される特徴パラメータとしての離散コサイン係数は、パワースペクトル形状に基づくものではなく、しかも、音声信号を周期成分と非周期成分に分割しているため、それぞれの成分の長時間平均が一定に近づく保証はない。従って図１に示した従来方法で得られた離散コサイン係数（連結ベクトル）をその平均・変動範囲・分散・標準偏差で正規化することは通常は考えられない。
しかしこの発明においては正規化を効果的に適用する特徴パラメータとして、分析中途で得られる離散コサイン係数あるいはパワー値が長時間平均で一定の値に近づく性質のものとしている。つまり特徴パラメータの一部についてその統計パラメータを求め、その統計パラメータで前記一部の特徴パラメータを正規化しているため前記のような優れた効果が得られる。 On the other hand, the discrete cosine coefficient as the characteristic parameter extracted by the conventional method shown in FIG. 1 is not based on the shape of the power spectrum, and further, since the audio signal is divided into a periodic component and an aperiodic component, There is no guarantee that the long-term average of each component will approach constant. Therefore, it is not usually considered to normalize the discrete cosine coefficient (concatenated vector) obtained by the conventional method shown in FIG. 1 by its average, fluctuation range, variance, and standard deviation.
However, in the present invention, as a characteristic parameter to which normalization is effectively applied, the discrete cosine coefficient or power value obtained in the middle of analysis has a property of approaching a constant value on an average over a long period of time. That is, since the statistical parameter is obtained for a part of the characteristic parameter and the partial characteristic parameter is normalized by the statistical parameter, the excellent effect as described above can be obtained.

しかし、図１に示した従来方法より得られた連結ベクトル（特徴パラメータ、離散コサイン係数ベクトル）の一部、つまりそのベクトルの要素中のいずれかの複数個及び／又は音声区間におけるいずれかの複数のフレーム（分析区間）については、時間平均がほぼ一定値になることに着目し、その平均値で特徴パラメータ中の対応するものを正規化することを考えた。
先に示した各実験では、特徴パラメータの一部ではなく、全てに対して歪補正をしているが、この発明方法が優れている結果となっている。これは前記特徴パラメータの一部に対する歪補正が大きく影響しているためと思われ、時間平均がほぼ一定値になる部分のみに対して歪補正をすればより大きな効果が得られると思われる。 However, a part of the concatenated vector (feature parameter, discrete cosine coefficient vector) obtained by the conventional method shown in FIG. 1, that is, any plural of elements of the vector and / or any plural in the speech section. With regard to the frame (analysis section), attention was paid to the fact that the time average becomes a substantially constant value, and it was considered to normalize the corresponding one of the feature parameters with the average value.
In each of the above-described experiments, distortion correction is performed for all but not a part of the characteristic parameter, and the method of the present invention is excellent. This seems to be because distortion correction for a part of the characteristic parameter has a great influence, and it is considered that a greater effect can be obtained if distortion correction is performed only on a portion where the time average becomes a substantially constant value.

非特許文献１の技術を説明するための音声信号分析装置の機能構成を示すブロック図。The block diagram which shows the function structure of the audio | voice signal analyzer for demonstrating the technique of a nonpatent literature 1. FIG. 図１に示した装置の処理手順を示す流れ図。The flowchart which shows the process sequence of the apparatus shown in FIG. ガンマトーンフィルタバンクの周波数特性の例を示す図。The figure which shows the example of the frequency characteristic of a gamma tone filter bank. Ａは入力音声波形例を、Ｂは３つのガンマトーンフィルタの特性例を、Ｃはその各フィルタの出力信号をそれぞれ示す図である。A is an input speech waveform example, B is a characteristic example of three gamma tone filters, and C is a diagram showing output signals of the respective filters. Ａは切り出された音声波形例を、Ｂはその自己相関関数をそれぞれ示す図である。A is an example of a clipped speech waveform, and B is a diagram showing its autocorrelation function. Ａは３つの櫛型フィルタの入力信号例を、Ｂは上記櫛型フィルタの周波数特性例を、Ｃはその各出力信号の例をそれぞれ示す図である。A is an example of input signals of three comb filters, B is an example of frequency characteristics of the comb filter, and C is an example of output signals thereof. 非特許文献３の技術を説明するための音声信号分析装置の機能構成を示すブロック図。The block diagram which shows the function structure of the audio | voice signal analyzer for demonstrating the technique of a nonpatent literature 3. FIG. 第１実施形態の装置の機能構成例を示すブロック図。The block diagram which shows the function structural example of the apparatus of 1st Embodiment. 第１実施形態の分析方法の手順例を示す流れ図。The flowchart which shows the example of a procedure of the analysis method of 1st Embodiment. 第２実施形態の装置の機能構成例を示すブロック図。The block diagram which shows the function structural example of the apparatus of 2nd Embodiment. 第２実施形態の方法の手順例を示す流れ図。The flowchart which shows the example of a procedure of the method of 2nd Embodiment. 第３実施形態の装置の機能構成例を示すブロック図。The block diagram which shows the function structural example of the apparatus of 3rd Embodiment. 第３実施形態の方法の手順例を示す流れ図。The flowchart which shows the example of a procedure of the method of 3rd Embodiment. パワー値を特徴パラメータとする実施形態の装置の機能構成例を示すブロック図。The block diagram which shows the function structural example of the apparatus of embodiment which uses a power value as a characteristic parameter. 図１４に示した装置の処理手順例を示す流れ図。15 is a flowchart showing an example of a processing procedure of the apparatus shown in FIG. 特徴パラメータに乗法性歪補正を行った後、統計パラメータを求める実施形態の機能構成例を示すブロック図。The block diagram which shows the function structural example of embodiment which calculates | requires a statistical parameter, after performing multiplicative distortion correction to a characteristic parameter. 図１６に示した装置の処理手順例を示す流れ図。FIG. 17 is a flowchart showing a processing procedure example of the apparatus shown in FIG. 16. 特徴パラメータの生成の他の実施形態の処理手順例を示す流れ図。The flowchart which shows the process sequence example of other embodiment of the production | generation of a characteristic parameter. この発明による音声認識装置の実施形態の機能構成例を示すブロック図。The block diagram which shows the function structural example of embodiment of the speech recognition apparatus by this invention. この発明による音声認識方法の実施形態の処理手順例を示す流れ図。The flowchart which shows the process sequence example of embodiment of the speech recognition method by this invention. AURORA-2Jを用いて評価した第１実施形態の効果を明らかにするための認識結果を示す図。The figure which shows the recognition result for clarifying the effect of 1st Embodiment evaluated using AURORA-2J. AURORA-2Jを用いて評価した第２実施形態の効果を明らかにするための認識結果を示す図。The figure which shows the recognition result for clarifying the effect of 2nd Embodiment evaluated using AURORA-2J. AURORA-2Jを用いて評価した第３実施形態の効果を明らかにするための認識結果を示す図。The figure which shows the recognition result for clarifying the effect of 3rd Embodiment evaluated using AURORA-2J.

Claims

A bandpass filter bank that filters the input audio signal into multiple band signals; and
A fundamental period estimator for estimating a fundamental period included in each band signal;
Each of the basic periods is set, and a comb filter that outputs one of a periodic component and a non-periodic component included in the band signal by filtering one of the band signals and blocking and passing each band signal;
First power calculating means for calculating the one power of the periodic component and the non-periodic component of each band;
Second power calculating means for calculating the power of each band signal;
Subtracting means for subtracting the output power value of the first power calculation unit from the output power value of the second power calculation unit and outputting the other of the periodic component and the non-periodic component of each band;
First and second vectorizing means for vectorizing the periodic component power value of each band and the non-periodic component power value of each band;
Feature parameter generating means for generating a feature parameter from the periodic component power value vector and the non-periodic component power value;
Statistical parameter generating means for calculating statistical parameters for at least some of the characteristic parameters in the signal section of the audio signal;
An audio signal analyzing apparatus comprising: normalizing means for normalizing a corresponding one of the characteristic parameters with the statistical parameter and outputting the result as an analysis result characteristic parameter.

The apparatus of claim 1.
The feature parameter generation means performs first and second discrete cosines for obtaining a periodic component discrete cosine coefficient vector and a non-periodic component discrete cosine coefficient vector by performing discrete cosine transform on the periodic component power value vector and the non-periodic component power value vector, respectively. Conversion means;
An audio signal analyzing apparatus comprising: a vector concatenation unit that concatenates the periodic component discrete cosine coefficient vector and the non-periodic component discrete cosine coefficient vector into the characteristic parameter.

The apparatus of claim 1.
The feature parameter generating means is a vector connecting means for connecting the periodic component power value vector and the non-periodic component power value vector;
An audio signal analyzing apparatus comprising: logarithm calculating means for calculating a logarithmic value of the concatenated vector and using the logarithmic value as the characteristic parameter.

The apparatus of claim 1.
The statistical parameter is a variance value of at least some of the characteristic parameters,
The audio signal analyzing apparatus according to claim 1, wherein the normalizing means is a dividing means.

The apparatus of claim 1.
The statistical parameter is the standard deviation of at least some of the characteristic parameters,
The audio signal analyzing apparatus according to claim 1, wherein the normalizing means is a dividing means.

The apparatus of claim 1.
The statistical parameter is a variation range, the statistical parameter calculation means selects means for selecting the maximum value of at least some feature parameters, means for selecting the minimum value of at least some feature parameters, and maximum A fluctuation range detecting means comprising subtracting means for subtracting the minimum value from a value to obtain the fluctuation range;
The audio signal analyzing apparatus according to claim 1, wherein the normalizing means is a dividing means.

The device according to claim 5 or 6,
Time averaging means for obtaining a time average of at least a part of the characteristic parameters from the characteristic parameter generating means;
Subtracting means for subtracting the time average from at least some corresponding ones of the characteristic parameters to obtain the at least some characteristic parameters to be supplied to the statistical parameter calculating means and the normalizing means. An audio signal analyzer characterized by the above.

The apparatus of claim 1.
The statistical parameter is an average, and the statistical parameter calculation means is a time average means for obtaining a time average of the at least some characteristic parameters,
The audio signal analyzing apparatus according to claim 1, wherein the normalizing means is a subtracting means.

Filter the input audio signal into multiple band signals,
Estimating the fundamental period included in each band signal above,
Comb filter processing each band signal based on the estimated basic period to obtain one of a periodic component and an aperiodic component included in the band signal,
Calculate the above one power value of the periodic component and non-periodic component of each band,
Calculate the power value of each band signal above,
Subtracting the one power value from the power value of each band signal to obtain the other power value of the periodic component and non-periodic component of each band,
Vectorize the periodic component power value of each band and the aperiodic component power value of each band,
A feature parameter is generated from the periodic component power value vector and the non-periodic component power value,
Calculating statistical parameters for at least some of the characteristic parameters in the signal section of the audio signal;
An audio signal analysis method characterized by normalizing a corresponding one of the feature parameters with the statistical parameter to obtain an analysis result feature parameter.

The method of claim 9, wherein
First and second discrete cosine transforming means for obtaining a periodic component discrete cosine coefficient vector and an aperiodic component discrete cosine coefficient vector by discrete cosine transforming the periodic component power value vector and the non-periodic component power value vector, respectively;
A method for analyzing an audio signal, wherein the periodic component discrete cosine coefficient vector and the non-periodic component discrete cosine coefficient vector are connected to form the characteristic parameter.

The method of claim 9, wherein
Connecting the periodic component power value vector and the non-periodic component power value vector;
An audio signal analysis method characterized in that a logarithmic value of the concatenated vector is calculated and used as the feature parameter.

The method of claim 9, wherein
The statistical parameter is a variance value of the characteristic parameter,
An audio signal analysis method, wherein the normalization is performed by dividing a corresponding one of the at least some feature parameters by the variance value.

The method of claim 9, wherein
The statistical parameter is the standard deviation of at least some of the characteristic parameters,
An audio signal analysis method characterized in that the normalization is obtained by removing the corresponding one of the at least some feature parameters by the standard deviation.

The method of claim 9, wherein
The statistical parameter is a variation range, and the maximum value and the minimum value of at least some of the characteristic parameters are selected,
Subtract the minimum value from the maximum value to obtain the fluctuation range,
An audio signal analysis method, wherein the normalization is performed by dividing the corresponding one of the at least some characteristic parameters by the fluctuation range.

15. A method according to claim 13 or 14,
Find the time average of at least some of the above characteristic parameters,
An audio signal analysis method characterized in that the time average is subtracted from at least a part of the corresponding characteristic parameter and used for the calculation of the statistical parameter and the normalization.

The method of claim 9, wherein
The statistical parameter is an average, and a time average of at least some of the characteristic parameters is obtained,
An audio signal analysis method, wherein the normalization is performed by subtracting the time average from at least some of the feature parameters.

A standard pattern storage unit in which standard feature parameters are stored;
A signal analysis unit that extracts a voice feature parameter from the input voice signal by the audio signal analysis device according to any one of claims 7 to 8,
A pattern identifying unit that receives the speech feature parameter and performs speech recognition on the speech signal using the standard pattern;
A speech recognition apparatus comprising:

The input audio signal is analyzed by the audio signal analysis method according to any one of claims 9 to 16 to obtain a characteristic parameter,
A speech recognition method characterized by performing speech recognition using the feature parameters for learning and recognition.

The program for functioning a computer as an apparatus as described in any one of Claims 1-8.

A computer-readable recording medium on which the program according to claim 19 is recorded.