JP2016033677A

JP2016033677A - Voice feature quantity extraction device, voice feature quantity extraction method, and voice feature quantity extraction program

Info

Publication number: JP2016033677A
Application number: JP2015216661A
Authority: JP
Inventors: 匡伸中村; Masanobu Nakamura; 貴史益子; Takashi Masuko
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2012-01-10
Filing date: 2015-11-04
Publication date: 2016-03-10
Anticipated expiration: 2032-03-09
Also published as: JP6092345B2

Abstract

PROBLEM TO BE SOLVED: To extract a voice feature quantity capable of improving noise resistance performance of voice recognition.SOLUTION: A voice feature quantity extraction device includes a segmentation section 101 and a calculation section 106. The segmentation section 101 generates one of a unit voice signal 11 and a plurality of sub-band unit voice signals by segmenting a voice waveform over predetermined time length for each unit time from one of an input voice signal 10 and a plurality of sub-band input voice signals obtained by extracting a signal component of a plurality of frequency bands from the input voice signal 10. The calculation section 106 obtains a voice feature quantity 16 by calculating one of average time of the unit voice signal 11 and each of average times of the plurality of sub-band unit voice signals in each of the plurality of frequency bands.SELECTED DRAWING: Figure 1

Description

実施形態は、音声特徴量の抽出技術に関する。 Embodiments described herein relate to a voice feature extraction technique.

雑音環境下で実用可能な音声認識技術の重要性が高まっている。雑音環境下では、雑音による音声認識精度の劣化が問題となる。音声認識は、入力音声信号から抽出された音声特徴量を使用して行われる。音声特徴量の一種としてメル周波数ケプストラム係数（ＭＦＣＣ；Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）が知られている。しかしながら、ＭＦＣＣのみを使用する音声認識はその耐雑音性能が十分に高いとは言い難い。故に、音声認識の耐雑音性能を向上させることのできる音声特徴量が望まれる。 The importance of speech recognition technology that can be used in noisy environments is increasing. In a noisy environment, degradation of speech recognition accuracy due to noise becomes a problem. Speech recognition is performed using speech feature values extracted from the input speech signal. A mel frequency cepstrum coefficient (MFCC) is known as a kind of audio feature quantity. However, it is difficult to say that speech recognition using only MFCC has sufficiently high noise resistance. Therefore, a voice feature quantity that can improve the noise resistance performance of voice recognition is desired.

山本ら，「長時間位相特徴と振幅スペクトル特徴の併用による音声認識の検討」（２０１１年秋季日本音響学会論文集２−Ｑ−１３）Yamamoto et al., "Study on speech recognition by using long-time phase feature and amplitude spectrum feature together" (2011 Autumn Acoustics Society of Japan 2-Q-13) Ｌ．コーエン，「時間―周波数解析」（朝倉書店），１９９８年１０月１日，第４−５頁L. Cohen, “Time-Frequency Analysis” (Asakura Shoten), October 1, 1998, pp. 4-5 山本ら，「長時間分析に基づく位相情報を用いた音声認識の検討」（音声信号処理技術報告ＳＰ２０１０−４０）Yamamoto et al., "Study on speech recognition using phase information based on long-term analysis" (Speech Signal Processing Technical Report SP2010-40)

実施形態は、音声認識の耐雑音性能を向上させることのできる音声特徴量を抽出することを目的とする。 An object of the embodiment is to extract a speech feature amount that can improve noise resistance performance of speech recognition.

実施形態によれば、音声特徴量抽出装置は、切り出し部と、第１の算出部とを含む。切り出し部は、入力音声信号及び前記入力音声信号から複数の周波数帯域の信号成分を抽出することによって得られる複数のサブバンド入力音声信号のうちいずれか一方から単位時間毎に所定の時間長に亘る音声波形を切り出すことによって、単位音声信号及び複数のサブバンド単位音声信号のうちいずれか一方を生成する。第１の算出部は、複数の周波数帯域の各々における前記単位音声信号の平均時間及び前記複数のサブバンド単位音声信号の各々の平均時間のうちいずれか一方を算出することによって、音声特徴量を得る。 According to the embodiment, the speech feature quantity extraction device includes a cutout unit and a first calculation unit. The cut-out unit covers a predetermined time length per unit time from any one of the input audio signal and a plurality of subband input audio signals obtained by extracting signal components of a plurality of frequency bands from the input audio signal. By cutting out the voice waveform, one of the unit voice signal and the plurality of subband unit voice signals is generated. The first calculation unit calculates an audio feature amount by calculating one of an average time of the unit audio signal and an average time of each of the plurality of subband unit audio signals in each of a plurality of frequency bands. obtain.

第１の実施形態に係る音声特徴量抽出装置を例示するブロック図。1 is a block diagram illustrating a speech feature quantity extraction device according to a first embodiment. 図１の音声特徴量抽出装置の動作を例示するフローチャート。3 is a flowchart illustrating an operation of the audio feature quantity extraction device in FIG. 1. 第２の実施形態に係る音声特徴量抽出装置を例示するブロック図。The block diagram which illustrates the voice feature-value extraction device concerning a 2nd embodiment. 図３の音声特徴量抽出装置の動作を例示するフローチャート。FIG. 4 is a flowchart illustrating an operation of the speech feature quantity extraction device of FIG. 第２の実施形態の比較例に係る音声特徴量抽出装置の動作を例示するフローチャート。The flowchart which illustrates operation | movement of the audio | voice feature-value extraction apparatus which concerns on the comparative example of 2nd Embodiment. 第２の実施形態の効果の説明図。Explanatory drawing of the effect of 2nd Embodiment. 第３の実施形態に係る音声認識装置を例示するブロック図。The block diagram which illustrates the voice recognition device concerning a 3rd embodiment. 第４の実施形態に係る音声特徴量抽出装置を例示するブロック図。The block diagram which illustrates the voice feature-value extraction device concerning a 4th embodiment. 図８の音声特徴量抽出装置の動作を例示するフローチャート。The flowchart which illustrates operation | movement of the audio | voice feature-value extraction apparatus of FIG. 第４の実施形態において算出される帯域別平均時間の説明図。Explanatory drawing of the average time according to zone | band calculated in 4th Embodiment. 第１の実施形態及び第４の実施形態において算出される帯域別平均時間を夫々示すグラフ。The graph which shows the average time according to zone | band calculated in 1st Embodiment and 4th Embodiment, respectively. 第１の実施形態及び第４の実施形態において算出される帯域別平均時間を夫々示すグラフ。The graph which shows the average time according to zone | band calculated in 1st Embodiment and 4th Embodiment, respectively. 第５の実施形態に係る音声特徴量抽出装置を例示するブロック図。The block diagram which illustrates the voice feature-value extraction device concerning a 5th embodiment. 図１３の音声特徴量抽出装置の動作を例示するフローチャート。14 is a flowchart illustrating an operation of the audio feature quantity extraction device in FIG. 13. 第１の実施形態及び第４の実施形態において算出される帯域別平均時間を夫々示すグラフ。The graph which shows the average time according to zone | band calculated in 1st Embodiment and 4th Embodiment, respectively.

以下、図面を参照しながら実施形態の説明が述べられる。尚、以降、説明済みの要素と同一または類似の要素には同一または類似の符号が付され、重複する説明は基本的に省略される。 Hereinafter, embodiments will be described with reference to the drawings. Hereinafter, the same or similar elements as those already described are denoted by the same or similar reference numerals, and redundant description is basically omitted.

（第１の実施形態）
図１に例示されるように、第１の実施形態に係る音声特徴量抽出装置は、波形切り出し部１０１と、パワースペクトル算出部１０２と、第３のスペクトル算出部１０３と、フィルタバンク適用部１０４，１０５と、帯域別平均時間算出部１０６と、軸変換部１０７とを備える。図１の音声特徴量抽出装置は、入力音声信号１０から音声特徴量１７を抽出する。 (First embodiment)
As illustrated in FIG. 1, the speech feature extraction device according to the first embodiment includes a waveform cutout unit 101, a power spectrum calculation unit 102, a third spectrum calculation unit 103, and a filter bank application unit 104. , 105, an average time calculating unit 106 for each band, and an axis converting unit 107. The voice feature quantity extraction device in FIG. 1 extracts a voice feature quantity 17 from the input voice signal 10.

波形切り出し部１０１は、外部から入力音声信号１０を取得する。波形切り出し部１０１は、入力音声信号１０から単位時間毎に時間長Ｔ（例えば、Ｔ＝５６ミリ秒）の音声波形を切り出すことによって時刻（ｎ）での単位音声信号１１（ｘ_ｎ（ｔ））を生成する。尚、以降の説明において、時間長Ｔは分析窓幅とも呼ばれる。波形切り出し部１０１は、時間長Ｔの音声波形を切り出す処理に加えて、切り出した音声波形の直流成分を除去する処理、切り出した音声波形の高周波成分を強調する処理、切り出した音声波形に窓関数（例えば、ハミング窓）を乗算する処理などを行うことによって、単位音声信号１１を生成してもよい。波形切り出し部１０１は、単位音声信号１１をパワースペクトル算出部１０２及び第３のスペクトル算出部１０３へと出力する。 The waveform cutout unit 101 acquires the input audio signal 10 from the outside. The waveform cutout unit 101 cuts out a sound waveform having a time length T (for example, T = 56 milliseconds) from the input sound signal 10 for each unit time, thereby unit sound signal 11 (x _n (t)) at time (n). ) Is generated. In the following description, the time length T is also called an analysis window width. In addition to the process of cutting out a speech waveform of time length T, the waveform cutout unit 101 performs a process of removing a DC component of the cut out voice waveform, a process of enhancing high frequency components of the cut out voice waveform, and a window function on the cut out voice waveform The unit audio signal 11 may be generated by performing a process of multiplying (for example, a Hamming window). The waveform cutout unit 101 outputs the unit audio signal 11 to the power spectrum calculation unit 102 and the third spectrum calculation unit 103.

パワースペクトル算出部１０２は、波形切り出し部１０１から単位音声信号１１を入力する。パワースペクトル算出部１０２は、単位音声信号１１のパワースペクトル１２を算出する。具体的には、単位音声信号１１に複素フーリエ変換を施すことによって下記数式（１）に示されるように、周波数（ω）毎の第１のスペクトル（Ｘ（ω））が導出できる。 The power spectrum calculation unit 102 receives the unit audio signal 11 from the waveform cutout unit 101. The power spectrum calculation unit 102 calculates the power spectrum 12 of the unit audio signal 11. Specifically, the first spectrum (X (ω)) for each frequency (ω) can be derived by performing complex Fourier transform on the unit audio signal 11 as shown in the following formula (1).

ここで、Ｘ_Ｒ（ω）は第１のスペクトル（Ｘ（ω））の実部を表し、Ｘ_Ｉ（ω）は第１のスペクトル（Ｘ（ω））の虚部を表し、ｊは虚数単位を表す。更に、パワースペクトル算出部１０２は、下記数式（２）に示されるように、第１のスペクトルのパワーを算出することによってパワースペクトル１２を得る。 Here, X _R (ω) represents the real part of the first spectrum (X (ω)), X _I (ω) represents the imaginary part of the first spectrum (X (ω)), and j is an imaginary number. Represents a unit. Furthermore, the power spectrum calculation unit 102 obtains the power spectrum 12 by calculating the power of the first spectrum as shown in the following mathematical formula (2).

パワースペクトル算出部１０２は、パワースペクトル１２をフィルタバンク適用部１０４へと出力する。 The power spectrum calculation unit 102 outputs the power spectrum 12 to the filter bank application unit 104.

第３のスペクトル算出部１０３は、波形切り出し部１０１から単位音声信号１１を入力する。第３のスペクトル算出部１０３は、前述の第１のスペクトル（Ｘ（ω））と、単位音声信号１１（ｘ_ｎ（ｔ））及び時刻（ｔ）の積の第２のスペクトルとを利用して第３のスペクトル１３を算出する。例えば、下記数式（３）に示されるように、単位音声信号１１（ｘ_ｎ（ｔ））及び時刻（ｔ）の積に複素フーリエ変換を施すことによって周波数（ω）毎の第２のスペクトルが導出できる。 The third spectrum calculation unit 103 receives the unit audio signal 11 from the waveform cutout unit 101. The third spectrum calculation unit 103 uses the first spectrum (X (ω)) described above and the second spectrum of the product of the unit audio signal 11 (x _n (t)) and time (t). To calculate the third spectrum 13. For example, as shown in the following equation (3), the second spectrum for each frequency (ω) is obtained by performing a complex Fourier transform on the product of the unit audio signal 11 (x _n (t)) and the time (t). Can be derived.

ここで、Ｙ_Ｒ（ω）は第２のスペクトル（Ｙ（ω））の実部を表し、Ｙ_Ｉ（ω）は第２のスペクトル（Ｙ（ω））の虚部を表す。そして、第３のスペクトル算出部１０３は、第１のスペクトルの実部（Ｘ_Ｒ（ω））と第２のスペクトルの実部（Ｙ_Ｒ（ω））との第１の積を算出し、第１のスペクトルの虚部（Ｘ_Ｉ（ω））と第２のスペクトルの虚部（Ｙ_Ｉ（ω））との第２の積を算出し、第１の積及び第２の積を加算することによって、第３のスペクトル１３を得る。即ち、第３のスペクトル算出部１０３は、下記数式（４）に示されるように、周波数（ω）毎の第３のスペクトル１３（ＸＹ（ω））を算出できる。 Here, Y _R (ω) represents the real part of the second spectrum (Y (ω)), and Y _I (ω) represents the imaginary part of the second spectrum (Y (ω)). The third spectrum calculation unit 103 calculates a first product of the real part (X _R (ω)) of the first spectrum and the real part (Y _R (ω)) of the second spectrum, Calculate the second product of the imaginary part (X _I (ω)) of the first spectrum and the imaginary part (Y _I (ω)) of the second spectrum, and add the first product and the second product By doing so, the third spectrum 13 is obtained. That is, the third spectrum calculation unit 103 can calculate the third spectrum 13 (XY (ω)) for each frequency (ω) as shown in the following mathematical formula (4).

第３のスペクトル算出部１０３は、第３のスペクトル１３をフィルタバンク適用部１０５へと出力する。 The third spectrum calculation unit 103 outputs the third spectrum 13 to the filter bank application unit 105.

フィルタバンク適用部１０４は、パワースペクトル算出部１０２からパワースペクトル１２を入力する。フィルタバンク適用部１０４は、パワースペクトル１２にフィルタバンクを適用し、フィルタ処理されたパワースペクトル１４を得る。フィルタバンク適用部１０４は、フィルタ処理されたパワースペクトル１４を帯域別平均時間算出部１０６へと出力する。フィルタバンク適用部１０４によって適用されるフィルタバンクは、１または複数（例えば、１６個）の周波数フィルタを備える。各周波数フィルタは、三角フィルタ、矩形フィルタなどであってよい。また、このフィルタバンクは、メルフィルタバンク、線形フィルタバンクなどであってよい。 The filter bank application unit 104 receives the power spectrum 12 from the power spectrum calculation unit 102. The filter bank application unit 104 applies a filter bank to the power spectrum 12 to obtain a filtered power spectrum 14. The filter bank application unit 104 outputs the filtered power spectrum 14 to the band-based average time calculation unit 106. The filter bank applied by the filter bank application unit 104 includes one or a plurality of (for example, 16) frequency filters. Each frequency filter may be a triangular filter, a rectangular filter, or the like. The filter bank may be a mel filter bank, a linear filter bank, or the like.

フィルタバンク適用部１０５は、第３のスペクトル算出部１０３から第３のスペクトル１３を入力する。フィルタバンク適用部１０５は、第３のスペクトル１３にフィルタバンクを適用し、フィルタ処理された第３のスペクトル１５を得る。フィルタバンク適用部１０５は、フィルタ処理された第３のスペクトル１５を帯域別平均時間算出部１０６へと出力する。フィルタバンク適用部１０５によって適用されるフィルタバンクは、フィルタバンク適用部１０４によって適用されるフィルタバンクと同数の周波数フィルタを備える必要がある。好ましくは、フィルタバンク適用部１０５は、フィルタバンク適用部１０４と同一のフィルタバンクを適用する。以降の説明において、フィルタバンク適用部１０５は、フィルタバンク適用部１０４と同一のフィルタバンクを適用すると仮定される。 The filter bank application unit 105 inputs the third spectrum 13 from the third spectrum calculation unit 103. The filter bank application unit 105 applies the filter bank to the third spectrum 13 and obtains a filtered third spectrum 15. The filter bank application unit 105 outputs the filtered third spectrum 15 to the band-based average time calculation unit 106. The filter bank applied by the filter bank application unit 105 needs to include the same number of frequency filters as the filter bank applied by the filter bank application unit 104. Preferably, the filter bank application unit 105 applies the same filter bank as the filter bank application unit 104. In the following description, it is assumed that the filter bank application unit 105 applies the same filter bank as the filter bank application unit 104.

帯域別平均時間算出部１０６は、フィルタバンク適用部１０４からフィルタ処理されたパワースペクトル１４を入力し、フィルタバンク適用部１０５からフィルタ処理された第３のスペクトル１５を入力する。帯域別平均時間算出部１０６は、フィルタ処理されたパワースペクトル１４及びフィルタ処理された第３のスペクトル１５に基づいて、１以上の周波数帯域（サブバンドと呼ばれてもよい）の各々における単位音声信号１１の平均時間（以降の説明において、帯域別平均時間１６とも称される）を算出する。帯域別平均時間算出部１０６は、帯域別平均時間１６を軸変換部１０７へと出力する。尚、帯域別平均時間算出部１０６の処理の詳細は後述される。 The band-specific average time calculation unit 106 receives the filtered power spectrum 14 from the filter bank application unit 104, and receives the filtered third spectrum 15 from the filter bank application unit 105. Based on the filtered power spectrum 14 and the filtered third spectrum 15, the band-specific average time calculation unit 106 unit sounds in each of one or more frequency bands (may be referred to as subbands). The average time of the signal 11 (also referred to as band-specific average time 16 in the following description) is calculated. The band-specific average time calculation unit 106 outputs the band-specific average time 16 to the axis conversion unit 107. The details of the processing of the band-specific average time calculation unit 106 will be described later.

軸変換部１０７は、帯域別平均時間算出部１０６から帯域別平均時間１６を入力する。軸変換部１０７は、帯域別平均時間１６に軸変換処理を施し、音声特徴量１７を生成する。以降の説明において、音声特徴量１７は、帯域別平均時間ケプストラム（Ｓｕｂ−ｂａｎｄＡｖｅｒａｇｅＴｉｍｅＣｅｐｓｔｒｕｍ：ＳＡＴＣ）とも呼ばれる。軸変換部１０７は、例えば、離散コサイン変換（ＤｉｓｃｒｅｔｅＣｏｓｉｎｅＴｒａｎｓｆｏｒｍ：ＤＣＴ）を用いることができる。軸変換部１０７は、音声特徴量１７を外部へと出力する。尚、軸変換部１０７は省略されてもよい。係る場合には、帯域別平均時間１６が、音声特徴量１７として外部へと出力される。例えば、フィルタバンク適用部１０４，１０５によって適用されるフィルタバンクが備える周波数フィルタの総数が１である場合には、軸変換部１０７は不要である。 The axis conversion unit 107 receives the average time 16 for each band from the average time calculation unit 106 for each band. The axis conversion unit 107 performs an axis conversion process on the band-based average time 16 to generate the audio feature amount 17. In the following description, the audio feature 17 is also referred to as a band-specific average time cepstrum (SATC). The axis conversion unit 107 can use, for example, Discrete Cosine Transform (DCT). The axis conversion unit 107 outputs the audio feature quantity 17 to the outside. The axis conversion unit 107 may be omitted. In such a case, the average time 16 for each band is output to the outside as the audio feature amount 17. For example, when the total number of frequency filters included in the filter bank applied by the filter bank application units 104 and 105 is 1, the axis conversion unit 107 is unnecessary.

ここで、帯域別平均時間１６は、１以上の周波数帯域の各々における単位音声信号１１のエネルギー重心までの時間を意味する。尚、一般的な信号の平均時間について、非特許文献２は下記数式（５）に示す定義を開示する。 Here, the average time 16 by band means the time to the energy center of gravity of the unit audio signal 11 in each of one or more frequency bands. In addition, about the average time of a general signal, the nonpatent literature 2 discloses the definition shown to following Numerical formula (5).

ここで、ｓ（ｔ）は分析窓中で信号のパワーを正規化することによって得られるパワー正規化信号を表し、Ｓ（ω）はパワー正規化信号（ｓ（ｔ））を複素フーリエ変換することによって得られる周波数（ω）毎のスペクトルを表し、τ_ｇ（ω）は周波数（ω）毎の群遅延スペクトルを表す。数式（５）は、全周波数帯域に亘る信号の平均時間を定義している。具体的には、数式（５）において、右辺の分子は群遅延スペクトル及びパワースペクトルの積の全周波数帯域に亘る総和を表し、右辺の分母はパワースペクトルの全周波数帯域に亘る総和を表す。他方、帯域別平均時間１６は、前述の通り、１以上の周波数帯域の各々における単位音声信号１１の平均時間を意味する。そして、第ｍ番目の周波数帯域（Ω_ｍ）における単位音声信号１１の平均時間（＜ｔ＞_（ｍ））は、例えば下記数式（６）に従って算出できる。ここで、ｍは１以上の周波数帯域の各々を識別するためのインデックスであり、１以上Ｍ以下の整数となる。Ｍは、周波数帯域の総数を表しており、周波数（ω）のｂｉｎ数よりも小さいとする。 Here, s (t) represents a power normalized signal obtained by normalizing the power of the signal in the analysis window, and S (ω) performs a complex Fourier transform on the power normalized signal (s (t)). The spectrum for each frequency (ω) obtained by the above is expressed, and τ _g (ω) represents the group delay spectrum for each frequency (ω). Equation (5) defines the average time of the signal over the entire frequency band. Specifically, in Equation (5), the numerator on the right side represents the sum over the entire frequency band of the product of the group delay spectrum and the power spectrum, and the denominator on the right side represents the sum over the entire frequency band of the power spectrum. On the other hand, the average time 16 by band means the average time of the unit audio signal 11 in each of one or more frequency bands as described above. Then, the average time (<t> _(m) ) of the unit audio signal 11 in the _mth frequency band (Ω _m ) can be calculated according to the following formula (6), for example. Here, m is an index for identifying each of one or more frequency bands, and is an integer of 1 to M. M represents the total number of frequency bands, and is assumed to be smaller than the number of bins of the frequency (ω).

ここで、ｈ_ｍ（ω）は、フィルタバンク適用部１０４，１０５によって適用されるフィルタバンクのうち第ｍ番目の周波数帯域（Ω_ｍ）に対応する周波数フィルタを表す。数式（６）のうち群遅延スペクトル（τ_ｇ（ω））は、下記数式（７）に示されるように、表すこともできる。 Here, h _m (ω) represents a frequency filter corresponding to the m-th frequency band (Ω _m ) in the filter bank applied by the filter bank application units 104 and 105. The group delay spectrum (τ _g (ω)) in the formula (6) can also be expressed as shown in the following formula (7).

上記数式（２），（４），（７）によれば、上記数式（６）における群遅延スペクトル及びパワースペクトルの積（τ_ｇ（ω）｜Ｘ（ω）｜^２）は、第３のスペクトル（ＸＹ（ω））に等しい。故に、数式（７）に基づいて、上記数式（６）は下記数式（８）のように書き換えることができる。 According to the above equations (2), (4), and (7), the product (τ _g (ω) | X (ω) | ² ) of the group delay spectrum and the power spectrum in the above equation (6) It is equal to the spectrum (XY (ω)). Therefore, based on Equation (7), Equation (6) can be rewritten as Equation (8) below.

数式（８）において、ｈ_ｍ（ω）｜Ｘ（ω）｜^２は、フィルタ処理されたパワースペクトル１４に相当し、ｈ_ｍ（ω）ＸＹ（ω）はフィルタ処理された第３のスペクトル１５に相当する。即ち、帯域別平均時間算出部１０６は、フィルタ処理された第３のスペクトル１５の第ｍ番目の周波数帯域（Ω_ｍ）における総和をフィルタ処理されたパワースペクトル１４の第ｍ番目の周波数帯域（Ω_ｍ）における総和によって除算することによって、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間１６を得る。 In Equation (8), h _m (ω) | X (ω) | ² corresponds to the filtered power spectrum 14, and h _m (ω) XY (ω) is the filtered third spectrum 15. It corresponds to. In other words, the band-based average time calculation unit 106 calculates the sum in the m-th frequency band (Ω _m ) of the filtered third spectrum 15 and filters the m-th frequency band (Ω _By dividing by the sum in _{m 2} ), the band average time 16 of the m th frequency band (Ω _m ) is obtained.

図１の音声特徴量抽出装置は、図２に例示されるように動作できる。波形切り出し部１０１は、外部から取得した入力音声信号１０から単位時間毎に時間長Ｔの音声波形を切り出すことによって単位音声信号１１を生成する（ステップＳ１０１）。 The voice feature extraction device of FIG. 1 can operate as illustrated in FIG. The waveform cutout unit 101 generates a unit voice signal 11 by cutting out a voice waveform having a time length T for each unit time from the input voice signal 10 acquired from the outside (step S101).

パワースペクトル算出部１０２は、ステップＳ１０１において生成された単位音声信号１１のパワースペクトル１２を算出する（ステップＳ１０２）。具体的には、パワースペクトル算出部１０２は、前述の第１のスペクトル（Ｘ（ω））のパワーを算出することによって、パワースペクトル１２を得る。フィルタ適用部１０４は、ステップＳ１０２において算出されたパワースペクトル１２にフィルタバンクを適用し、フィルタ処理されたパワースペクトル１４を得る（ステップＳ１０４）。 The power spectrum calculation unit 102 calculates the power spectrum 12 of the unit audio signal 11 generated in step S101 (step S102). Specifically, the power spectrum calculation unit 102 obtains the power spectrum 12 by calculating the power of the first spectrum (X (ω)). The filter application unit 104 applies a filter bank to the power spectrum 12 calculated in step S102 to obtain a filtered power spectrum 14 (step S104).

第３のスペクトル算出部１０３は、ステップＳ１０１において生成された単位音声信号１１のパワースペクトル１２を算出する（ステップＳ１０３）。具体的には、第３のスペクトル算出部１０３は、第１のスペクトルの実部（Ｘ_Ｒ（ω））と第２のスペクトルの実部（Ｙ_Ｒ（ω））との第１の積を算出し、第１のスペクトルの虚部（Ｘ_Ｉ（ω））と第２のスペクトルの虚部（Ｙ_Ｉ（ω））との第２の積を算出し、第１の積及び第２の積を加算することによって、第３のスペクトル１３を得る。フィルタ適用部１０５は、ステップＳ１０３において算出された第３のスペクトル１３にフィルタバンクを適用し、フィルタ処理された第３のスペクトル１５を得る（ステップＳ１０５）。 The third spectrum calculation unit 103 calculates the power spectrum 12 of the unit audio signal 11 generated in step S101 (step S103). Specifically, the third spectrum calculation unit 103 calculates the first product of the real part (X _R (ω)) of the first spectrum and the real part (Y _R (ω)) of the second spectrum. And calculating a second product of an imaginary part (X _I (ω)) of the first spectrum and an imaginary part (Y _I (ω)) of the second spectrum, and calculating the first product and the second A third spectrum 13 is obtained by adding the products. The filter application unit 105 applies a filter bank to the third spectrum 13 calculated in step S103, and obtains a filtered third spectrum 15 (step S105).

ここで、ステップＳ１０２，Ｓ１０４の一連の処理と、ステップＳ１０３，Ｓ１０５の一連の処理との間には依存関係が存在しないので、ステップＳ１０１の完了後に、両者が並列的に実行されてもよいし、直列的に実行されてもよい。 Here, since there is no dependency between the series of processes in steps S102 and S104 and the series of processes in steps S103 and S105, both may be executed in parallel after step S101 is completed. , May be executed serially.

帯域別平均時間算出部１０６は、ステップＳ１０４において得られたフィルタ処理されたパワースペクトル１４及びステップＳ１０５において得られたフィルタ処理された第３のスペクトル１５に基づいて帯域別平均時間１６を算出する（ステップＳ１０６）。具体的には、帯域別平均時間算出部１０６は、フィルタ処理された第３のスペクトル１５の第ｍ番目の周波数帯域（Ω_ｍ）における総和をフィルタ処理されたパワースペクトル１４の第ｍ番目の周波数帯域（Ω_ｍ）における総和によって除算することによって、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間１６を得る。軸変換部１０７は、ステップＳ１０６において算出された帯域別平均時間１６に対して軸変換処理を施し、音声特徴量１７を生成する。 The band-specific average time calculation unit 106 calculates the band-specific average time 16 based on the filtered power spectrum 14 obtained in step S104 and the filtered third spectrum 15 obtained in step S105 ( Step S106). Specifically, the band-based average time calculation unit 106 filters the sum of the filtered third spectrum 15 in the m-th frequency band (Ω _m ), and the m-th frequency of the power spectrum 14 that has been filtered. by dividing by the sum of the band (Omega _m), obtaining a per-band average time 16 of the m-th frequency band (Omega _m). The axis conversion unit 107 performs an axis conversion process on the band-specific average time 16 calculated in step S <b> 106 to generate a voice feature 17.

以上説明したように、第１の実施形態に係る音声特徴量抽出装置は、ＳＡＴＣを音声特徴量として抽出する。この音声特徴量抽出装置によれば、例えば、ＳＡＴＣをＭＦＣＣなどの従来の音声特徴量に結合（追加）して使用することによって、音声認識の耐雑音性能を向上させることができる。 As described above, the speech feature amount extraction apparatus according to the first embodiment extracts SATC as a speech feature amount. According to this speech feature amount extraction apparatus, for example, by using (adding) SATC to a conventional speech feature amount such as MFCC, the noise resistance performance of speech recognition can be improved.

尚、本実施形態において、フィルタバンク適用部１０４，１０５は、省略されてもよい。係る場合には、帯域別平均時間算出部１０６は、パワースペクトル１２及び第３のスペクトル１３に基づいて、帯域別平均時間１６を算出する。具体的には、帯域別平均時間算出部１０６は、下記数式（９）を利用できる。 In the present embodiment, the filter bank application units 104 and 105 may be omitted. In such a case, the band-specific average time calculation unit 106 calculates the band-specific average time 16 based on the power spectrum 12 and the third spectrum 13. Specifically, the band-specific average time calculation unit 106 can use the following mathematical formula (9).

数式（９）において、｜Ｘ（ω）｜^２は、パワースペクトル１２に相当し、ＸＹ（ω）は第３のスペクトル１３に相当する。即ち、帯域別平均時間算出部１０６は、第３のスペクトル１３の第ｍ番目の周波数帯域（Ω_ｍ）における総和をパワースペクトル１２の第ｍ番目の周波数帯域（Ω_ｍ）における総和によって除算し、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間１６を得る。 In Expression (9), | X (ω) | ² corresponds to the power spectrum 12, and XY (ω) corresponds to the third spectrum 13. That is, the band-based average time calculation unit 106 divides the sum in the m-th frequency band (Ω _m ) of the third spectrum 13 by the sum in the m-th frequency band (Ω _m ) of the power spectrum 12, An average time 16 for each band of the _mth frequency band (Ω _m ) is obtained.

（第２の実施形態）
前述の第１の実施形態において、例えば上記数式（８）に従って、パワースペクトル及び第３のスペクトルに基づいて帯域別平均時間が算出される。他方、上記数式（６）によれば、群遅延スペクトル及びパワースペクトルに基づいて帯域別平均時間を算出することもできる。 (Second Embodiment)
In the first embodiment described above, the average time for each band is calculated based on the power spectrum and the third spectrum, for example, according to the equation (8). On the other hand, according to the above formula (6), the average time for each band can be calculated based on the group delay spectrum and the power spectrum.

図３に例示されるように、第２の実施形態に係る音声特徴量抽出装置は、波形切り出し部１０１と、パワースペクトル算出部１０２と、フィルタバンク適用部１０４と、軸変換部１０７と、群遅延スペクトル算出部２０８と、スペクトル乗算部２０９と、フィルタバンク適用部２１０と、帯域別平均時間算出部２１１とを備える。図３の音声特徴量抽出装置は、入力音声信号１０から音声特徴量２２を抽出する。 As illustrated in FIG. 3, the speech feature extraction device according to the second embodiment includes a waveform cutout unit 101, a power spectrum calculation unit 102, a filter bank application unit 104, an axis conversion unit 107, a group A delay spectrum calculation unit 208, a spectrum multiplication unit 209, a filter bank application unit 210, and an average time calculation unit 211 for each band are provided. The voice feature quantity extraction device in FIG. 3 extracts the voice feature quantity 22 from the input voice signal 10.

群遅延スペクトル算出部２０８は、波形切り出し部１０１から単位音声信号１１を入力する。群遅延スペクトル算出部２０８は、単位音声信号１１の群遅延スペクトル１８を算出する。群遅延スペクトル算出部２０８は、群遅延スペクトル１８をスペクトル乗算部２０９へと出力する。 The group delay spectrum calculation unit 208 receives the unit audio signal 11 from the waveform cutout unit 101. The group delay spectrum calculation unit 208 calculates the group delay spectrum 18 of the unit audio signal 11. The group delay spectrum calculation unit 208 outputs the group delay spectrum 18 to the spectrum multiplication unit 209.

例えば、群遅延スペクトル算出部２０８は、上記数式（７）に第１のスペクトルの実部（Ｘ_Ｒ（ω））及び虚部（Ｘ_Ｉ（ω））と、第２のスペクトルの実部（Ｙ_Ｒ（ω））及び虚部（Ｙ_Ｉ（ω））とを代入することによって、群遅延スペクトル１８を算出してもよい。 For example, the group delay spectrum calculation unit 208 adds the real part (X _R (ω)) and imaginary part (X _I (ω)) of the first spectrum and the real part (X _I (ω)) of the second spectrum to the above equation (7). The group delay spectrum 18 may be calculated by substituting Y _R (ω)) and the imaginary part (Y _I (ω)).

或いは、群遅延スペクトル算出部２０８は、上記数式（７）とは異なる技法で群遅延スペクトル１８を算出してもよい。具体的には、群遅延スペクトル１８（τ_ｇ（ω））は、下記数式（１０）に示されるように、第１のスペクトル（Ｘ（ω））の位相項（θ（ω））を周波数（ω）について微分し、その符号を反転することによって得られる値として定義される。 Alternatively, the group delay spectrum calculation unit 208 may calculate the group delay spectrum 18 by a technique different from the equation (7). Specifically, the group delay spectrum 18 (τ _g (ω)) is obtained by using the phase term (θ (ω)) of the first spectrum (X (ω)) as the frequency as shown in the following formula (10). It is defined as a value obtained by differentiating (ω) and inverting its sign.

ここで、位相項（θ（ω））は下記数式（１１）によって定義される。 Here, the phase term (θ (ω)) is defined by the following mathematical formula (11).

従って、群遅延スペクトル算出部２０８は、非特許文献３に記載されているように、数式（１１）に示される位相項（θ（ω））の周波数（ω）軸方向の差分値を用いて群遅延スペクトル１８を算出してもよい。尚、本技法によって群遅延スペクトル１８を算出する場合には、位相項（θ（ω））の値域を−πからπまでの範囲に収めるために位相アンラッピング処理を行う必要がある。 Therefore, as described in Non-Patent Document 3, the group delay spectrum calculation unit 208 uses the difference value in the frequency (ω) axis direction of the phase term (θ (ω)) shown in Equation (11). The group delay spectrum 18 may be calculated. When the group delay spectrum 18 is calculated by this technique, it is necessary to perform a phase unwrapping process in order to keep the range of the phase term (θ (ω)) within the range from −π to π.

スペクトル乗算部２０９は、パワースペクトル算出部１０２からパワースペクトル１２を入力し、群遅延スペクトル算出部２０８から群遅延スペクトル１８を入力する。スペクトル乗算部２０９は、群遅延スペクトル１８をパワースペクトル１２に乗算し、乗算スペクトル１９を得る。スペクトル乗算部２０９は、乗算スペクトル１９をフィルタバンク適用部２１０へと出力する。尚、乗算スペクトル１９は、前述の第３のスペクトル１３に相当する。 The spectrum multiplier 209 receives the power spectrum 12 from the power spectrum calculator 102 and receives the group delay spectrum 18 from the group delay spectrum calculator 208. The spectrum multiplication unit 209 multiplies the group delay spectrum 18 by the power spectrum 12 to obtain a multiplication spectrum 19. The spectrum multiplication unit 209 outputs the multiplication spectrum 19 to the filter bank application unit 210. The multiplication spectrum 19 corresponds to the third spectrum 13 described above.

フィルタバンク適用部２１０は、乗算スペクトル算出部２０９から乗算スペクトル１９を入力する。フィルタバンク適用部２１０は、乗算スペクトル１９にフィルタバンクを適用し、フィルタ処理された乗算スペクトル２０を得る。フィルタバンク適用部２１０は、フィルタ処理された乗算スペクトル２０を帯域別平均時間算出部２１１へと出力する。フィルタバンク適用部２１０によって適用されるフィルタバンクは、フィルタバンク適用部１０４によって適用されるフィルタバンクと同数の周波数フィルタを備える必要がある。好ましくは、フィルタバンク適用部２１０は、フィルタバンク適用部１０４と同一のフィルタバンクを適用する。以降の説明において、フィルタバンク適用部２１０は、フィルタバンク適用部１０４と同一のフィルタバンクを適用すると仮定される。 The filter bank application unit 210 receives the multiplication spectrum 19 from the multiplication spectrum calculation unit 209. The filter bank application unit 210 applies the filter bank to the multiplication spectrum 19 to obtain the filtered multiplication spectrum 20. The filter bank application unit 210 outputs the filtered multiplication spectrum 20 to the band-based average time calculation unit 211. The filter bank applied by the filter bank application unit 210 needs to include the same number of frequency filters as the filter bank applied by the filter bank application unit 104. Preferably, the filter bank application unit 210 applies the same filter bank as the filter bank application unit 104. In the following description, it is assumed that the filter bank application unit 210 applies the same filter bank as the filter bank application unit 104.

帯域別平均時間算出部２１１は、フィルタバンク適用部１０４からフィルタ処理されたパワースペクトル１４を入力し、フィルタバンク適用部２１０からフィルタ処理された乗算スペクトル２０を入力する。帯域別平均時間算出部２１１は、フィルタ処理されたパワースペクトル１４及びフィルタ処理された乗算スペクトル２０に基づいて、１以上の周波数帯域の各々における単位音声信号１１の平均時間（以降の説明において、帯域別平均時間２１とも称される）を算出する。 The band-specific average time calculation unit 211 receives the filtered power spectrum 14 from the filter bank application unit 104 and receives the filtered spectrum 20 from the filter bank application unit 210. Based on the filtered power spectrum 14 and the filtered multiplication spectrum 20, the average time by band calculation unit 211 calculates the average time of the unit audio signal 11 in each of one or more frequency bands (in the following description, the band (Also referred to as another average time 21).

具体的には、帯域別平均時間算出部２１１は、上記数式（６）を利用できる。尚、数式（６）において、ｈ_ｍ（ω）τ_ｇ（ω）｜Ｘ（ω）｜^２はフィルタ処理された乗算スペクトル２０に相当し、ｈ_ｍ（ω）｜Ｘ（ω）｜^２はフィルタ処理されたパワースペクトル１４に相当する。即ち、帯域別平均時間算出部２１１は、フィルタ処理された乗算スペクトル２０の第ｍ番目の周波数帯域（Ω_ｍ）における総和をフィルタ処理されたパワースペクトル１４の第ｍ番目の周波数帯域（Ω_ｍ）における総和によって除算し、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間２１を得る。帯域別平均時間算出部２１１は、帯域別平均時間２１を軸変換部１０７へと出力する。 Specifically, the band-specific average time calculation unit 211 can use the above formula (6). Note that in equation _{(6), h m (ω} ) τ g (ω) | X (ω) | 2 is equivalent to multiplying the spectrum 20, which is _{filtered, h m (ω) | X} (ω) | 2 is Corresponds to the filtered power spectrum 14. That is, the band-by-band average time calculation unit 211, the m-th frequency band of the power spectrum 14 which has been filtered summation for the m-th frequency band of the filtered multiplied spectrum _{_{20 (Ω m) (Ω m}} ) Is divided by the sum total at, and an average time 21 for each band of the _mth frequency band (Ω _m ) is obtained. The band-specific average time calculation unit 211 outputs the band-specific average time 21 to the axis conversion unit 107.

軸変換部１０７は、帯域別平均時間算出部２１１から帯域別平均時間２１を入力する。軸変換部１０７は、帯域別平均時間２１に第１の実施形態と同一または類似の軸変換処理を施し、音声特徴量２２を生成する。音声特徴量２２は、前述の音声特徴量１７に相当し、ＳＡＴＣとも呼ばれる。軸変換部１０７は、音声特徴量２２を外部へと出力する。尚、軸変換部１０７は省略されてもよい。係る場合には、帯域別平均時間２１が、音声特徴量２２として外部へと出力される。例えば、フィルタバンク適用部１０４，２１０によって適用されるフィルタバンクが備える周波数フィルタの総数が１である場合には、軸変換部１０７は不要である。 The axis conversion unit 107 receives the average time by band 21 from the average time by band calculation unit 211. The axis conversion unit 107 performs an axis conversion process that is the same as or similar to that of the first embodiment on the average time 21 for each band, and generates a voice feature 22. The audio feature 22 corresponds to the audio feature 17 described above and is also called SATC. The axis conversion unit 107 outputs the audio feature quantity 22 to the outside. The axis conversion unit 107 may be omitted. In such a case, the band-specific average time 21 is output to the outside as the audio feature amount 22. For example, when the total number of frequency filters included in the filter bank applied by the filter bank application units 104 and 210 is 1, the axis conversion unit 107 is unnecessary.

図３の音声特徴量抽出装置は、図４に例示されるように動作できる。群遅延スペクトル算出部２０８は、ステップＳ１０１において生成された単位音声信号１１の群遅延スペクトル１８を算出する（ステップＳ２０８）。具体的には、群遅延スペクトル算出部２０８は、上記数式（７）を利用して群遅延スペクトル１８を算出してもよいし、上記数式（１１）に示される位相項（θ（ω））の周波数（ω）軸方向の差分値を用いて群遅延スペクトル１８を算出してもよい。 The voice feature extraction device of FIG. 3 can operate as illustrated in FIG. The group delay spectrum calculation unit 208 calculates the group delay spectrum 18 of the unit audio signal 11 generated in step S101 (step S208). Specifically, the group delay spectrum calculation unit 208 may calculate the group delay spectrum 18 using the above formula (7), or the phase term (θ (ω)) shown in the above formula (11). The group delay spectrum 18 may be calculated using the difference value in the frequency (ω) axis direction.

ここで、ステップＳ１０２の処理と、ステップＳ２０８の処理との間には依存関係が存在しないので、ステップＳ１０２の完了後に両者が並列的に実行されてもよいし、直列的に実行されてもよい。 Here, since there is no dependency between the process of step S102 and the process of step S208, both may be executed in parallel after the completion of step S102, or may be executed in series. .

スペクトル乗算部２０９は、ステップＳ２０８において算出された群遅延スペクトル１８をステップＳ１０２において算出されたパワースペクトル１２に乗算し、乗算スペクトル１９を得る（ステップＳ２０９）。フィルタ適用部２１０は、ステップＳ２０９において算出された乗算スペクトル１９にフィルタバンクを適用し、フィルタ処理された乗算スペクトル２０を得る（ステップＳ２１０）。 The spectrum multiplication unit 209 multiplies the power spectrum 12 calculated in step S102 by the group delay spectrum 18 calculated in step S208 to obtain a multiplication spectrum 19 (step S209). The filter application unit 210 applies the filter bank to the multiplication spectrum 19 calculated in step S209 to obtain the filtered multiplication spectrum 20 (step S210).

ここで、ステップＳ２０９，Ｓ２１０の一連の処理と、ステップＳ１０４の処理との間には依存関係が存在しないので、ステップＳ１０２の完了後に、両者が並列的に実行されてもよいし、直列的に実行されてもよい。但し、ステップＳ２０９の処理は、ステップＳ１０２だけでなくステップＳ２０８の完了後に実行される必要がある。 Here, since there is no dependency between the series of processes in steps S209 and S210 and the process in step S104, both may be executed in parallel after the completion of step S102, or in series. May be executed. However, the process of step S209 needs to be executed not only in step S102 but also after completion of step S208.

帯域別平均時間算出部２１１は、ステップＳ１０４において得られたフィルタ処理されたパワースペクトル１４及びステップＳ２１０において得られたフィルタ処理された乗算スペクトル２０に基づいて帯域別平均時間２１を算出する（ステップＳ２１１）。具体的には、帯域別平均時間算出部２１１は、フィルタ処理された第３のスペクトル２０の第ｍ番目の周波数帯域（Ω_ｍ）における総和をフィルタ処理されたパワースペクトル１４の第ｍ番目の周波数帯域（Ω_ｍ）における総和によって除算することによって、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間２１を得る。軸変換部１０７は、ステップＳ２１１において算出された帯域別平均時間２１に対して軸変換処理を施し、音声特徴量２２を生成する。 The band-specific average time calculation unit 211 calculates the band-specific average time 21 based on the filtered power spectrum 14 obtained in step S104 and the filtered multiplication spectrum 20 obtained in step S210 (step S211). ). Specifically, the band-specific average time calculation unit 211 performs filtering on the sum of the filtered third spectrum 20 in the m-th frequency band (Ω _m ) of the power spectrum 14 in the m-th frequency. by dividing by the sum of the band (Omega _m), obtaining a per-band average time 21 of the m-th frequency band (Omega _m). The axis conversion unit 107 performs an axis conversion process on the band-based average time 21 calculated in step S <b> 211, and generates a voice feature amount 22.

以上説明したように、第２の実施形態に係る音声特徴量抽出装置は、前述のＳＡＴＣを音声特徴量として抽出する。従って、この音声特徴量抽出装置によれば、第１の実施形態と同一または類似の効果を得ることができる。 As described above, the speech feature amount extraction apparatus according to the second embodiment extracts the above-described SATC as a speech feature amount. Therefore, according to the speech feature quantity extraction device, the same or similar effect as that of the first embodiment can be obtained.

以下、２つの比較例と本実施形態との対比を通じて本実施形態の効果が説明される。以降の説明において、比較例１は、ＭＦＣＣのみを使用する従来の音声認識に対応する。比較例２は、非特許文献１に開示される長時間群遅延ケプストラムをＭＦＣＣに結合して得られる音声特徴量を使用する音声認識に対応する。具体的には、比較例２における長時間群遅延ケプストラムは、図５に例示されるように動作する音声特徴量抽出装置によって抽出される。 Hereinafter, the effects of the present embodiment will be described through comparison between two comparative examples and the present embodiment. In the following description, Comparative Example 1 corresponds to conventional speech recognition that uses only MFCC. Comparative Example 2 corresponds to speech recognition using speech feature values obtained by combining the long-time group delay cepstrum disclosed in Non-Patent Document 1 with MFCC. Specifically, the long-time group delay cepstrum in the comparative example 2 is extracted by an audio feature quantity extraction device that operates as illustrated in FIG.

比較例２に係る音声特徴量抽出装置は、入力音声信号から単位時間毎に音声波形を切り出すことによって単位音声信号を生成する（ステップＳ１０１）。この音声特徴量抽出装置は、ステップＳ１０１において生成された単位音声信号の群遅延スペクトルを算出する（ステップＳ２０８）。この音声特徴量抽出装置は、ステップＳ２０８において算出された群遅延スペクトルに基づいて帯域別群遅延スペクトルを算出する（ステップＳ３１２）。この音声特徴量抽出装置は、ステップＳ３１２において算出された帯域別群遅延スペクトルに対して軸変換処理を施し、長時間群遅延ケプストラムを生成する（ステップＳ１０７）。 The speech feature quantity extraction device according to Comparative Example 2 generates a unit speech signal by cutting out speech waveforms from the input speech signal every unit time (step S101). The speech feature quantity extraction device calculates a group delay spectrum of the unit speech signal generated in step S101 (step S208). The speech feature quantity extraction device calculates a band-specific group delay spectrum based on the group delay spectrum calculated in step S208 (step S312). The speech feature extraction device performs axis conversion processing on the band-specific group delay spectrum calculated in step S312 to generate a long-time group delay cepstrum (step S107).

図６は、本実施形態に係る音声特徴量抽出装置によって抽出されたＳＡＴＣをＭＦＣＣに結合して得られる音声特徴量を使用する音声認識の結果と、比較例１に係る音声認識の結果と、比較例２に係る音声認識の結果とを示す。具体的には、図６は、駅構内などの雑音環境下において、上記３種類の特徴量を用いて約１０万語彙の孤立単語認識を行った場合の単語認識性能（％）を示す。本評価実験は雑音環境における単語認識性能を確認するために、２０，１５，１０，５，０（ｄＢ）の５段階の信号耐雑音比（ＳＮＲ）の下で単語認識性能を夫々評価した。図６には、５段階のＳＮＲの下で夫々評価された単語認識性能の平均値が示されている。また、本評価実験は、長時間群遅延ケプストラム及びＳＡＴＣについて、複数段階の分析窓幅（ミリ秒）の下で単語認識性能を夫々評価した。 FIG. 6 shows the results of speech recognition using speech features obtained by combining the SATC extracted by the speech feature extraction device according to the present embodiment with MFCC, and the results of speech recognition according to Comparative Example 1. The result of the speech recognition which concerns on the comparative example 2 is shown. Specifically, FIG. 6 shows the word recognition performance (%) when an isolated word recognition of about 100,000 vocabulary words is performed using the above three types of feature amounts in a noise environment such as a station premises. In this evaluation experiment, in order to confirm the word recognition performance in a noisy environment, the word recognition performance was evaluated under a signal-to-noise ratio (SNR) of 5, 15, 10, 5, 5, 0 (dB). FIG. 6 shows an average value of the word recognition performance evaluated under five levels of SNR. In this evaluation experiment, word recognition performance was evaluated for each of the long-time group delay cepstrum and SATC under a plurality of analysis window widths (milliseconds).

比較例１は、分析窓幅を２５ミリ秒に固定して抽出したＭＦＣＣのみを用いているため、分析窓幅に依存せず一定の単語認識性能を達成する。また、比較例２は、分析窓幅に依存してその単語認識性能が変動するものの大部分の分析窓幅（＝５６〜１５２ミリ秒）の下で比較例１よりも高い単語認識性能を達成する。但し、その性能改善率は、例えば分析窓幅＝１５２ミリ秒の場合に最大で約３．６％に留まる。他方、本実施形態は、全ての分析窓幅（＝２５〜２１６ミリ秒）の下で比較例１，２よりも高い単語認識性能を達成する。具体的には、分析窓幅＝５６ミリ秒の場合の性能改善率が最大で約９．５％となる。以上の通り、本評価実験によれば、例えばＭＦＣＣなどの従来の音声特徴量にＳＡＴＣを結合して得られる音声特徴量を使用することによって音声認識の耐雑音性能が向上することが定量的に理解できる。 Since Comparative Example 1 uses only the MFCC extracted with the analysis window width fixed at 25 milliseconds, a certain word recognition performance is achieved without depending on the analysis window width. Comparative Example 2 achieves higher word recognition performance than Comparative Example 1 under most of the analysis window width (= 56 to 152 milliseconds), although the word recognition performance varies depending on the analysis window width. To do. However, the performance improvement rate remains at a maximum of about 3.6% when the analysis window width is 152 milliseconds, for example. On the other hand, this embodiment achieves higher word recognition performance than Comparative Examples 1 and 2 under all analysis window widths (= 25 to 216 milliseconds). Specifically, the performance improvement rate when the analysis window width is 56 milliseconds is about 9.5% at the maximum. As described above, according to this evaluation experiment, it is quantitatively understood that the noise resistance performance of speech recognition is improved by using a speech feature obtained by combining SATC with a conventional speech feature such as MFCC. Understandable.

尚、本実施形態において、フィルタバンク適用部１０４，２１０は、省略されてもよい。係る場合には、帯域別平均時間算出部２１１は、パワースペクトル１２及び乗算スペクトル１９に基づいて、帯域別平均時間２１を算出する。具体的には、帯域別平均時間算出部２１１は、下記数式（１２）を利用できる。 In the present embodiment, the filter bank application units 104 and 210 may be omitted. In such a case, the band-specific average time calculation unit 211 calculates the band-specific average time 21 based on the power spectrum 12 and the multiplication spectrum 19. Specifically, the band-specific average time calculation unit 211 can use the following formula (12).

数式（１２）において、｜Ｘ（ω）｜^２は、パワースペクトル１２に相当し、τ_ｇ（ω）｜Ｘ（ω）｜^２は乗算スペクトル１９に相当する。即ち、帯域別平均時間算出部２１１は、乗算スペクトル１９の第ｍ番目の周波数帯域（Ω_ｍ）における総和をパワースペクトル１２の第ｍ番目の周波数帯域（Ω_ｍ）における総和によって除算し、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間２１を得る。 In Expression (12), | X (ω) | ² corresponds to the power spectrum 12, and τ _g (ω) | X (ω) | ² corresponds to the multiplication spectrum 19. That is, the band-specific average time calculation unit 211 divides the sum in the m-th frequency band (Ω _m ) of the multiplication spectrum 19 by the sum in the m-th frequency band (Ω _m ) of the power spectrum 12, and An average time 21 for each band of the first frequency band (Ω _m ) is obtained.

（第３の実施形態）
図７に例示されるように、第３の実施形態に係る音声認識装置は、特徴量抽出部４００と、デコーダ４０１と、音響モデル記憶部４０２と、言語モデル記憶部４０３とを備える。図７の音声認識装置は、入力音声信号１０に対して音声認識処理を行って、当該入力音声信号１０の内容を示す言語テキストを音声認識結果として出力する。 (Third embodiment)
As illustrated in FIG. 7, the speech recognition apparatus according to the third embodiment includes a feature amount extraction unit 400, a decoder 401, an acoustic model storage unit 402, and a language model storage unit 403. The speech recognition apparatus in FIG. 7 performs speech recognition processing on the input speech signal 10 and outputs language text indicating the content of the input speech signal 10 as a speech recognition result.

特徴量抽出部４００は、前述の第１乃至第２の実施形態または後述される第４乃至第５の実施形態に係る音声特徴量抽出装置が組み込まれてもよい。特徴量抽出部４００は、外部から入力音声信号１０を取得する。特徴量抽出部４００は、入力音声信号１０から少なくともＳＡＴＣを含む音声特徴量１７を抽出する。特徴量抽出部４００は、デコーダ４０１へと出力する。 The feature quantity extraction unit 400 may incorporate the speech feature quantity extraction apparatus according to the first or second embodiment described above or the fourth to fifth embodiments described later. The feature amount extraction unit 400 acquires the input audio signal 10 from the outside. The feature quantity extraction unit 400 extracts the voice feature quantity 17 including at least SATC from the input voice signal 10. The feature quantity extraction unit 400 outputs to the decoder 401.

デコーダ４０１は、特徴量抽出部４００から音声特徴量１７を入力する。デコーダ４０１は、音響モデル記憶部４０２に記憶された音響モデルと、言語モデル記憶部４０３に記憶された言語モデルとを参照し、音声特徴量１７を用いて音声認識処理を行う。デコーダ４０１は、音響的類似度及び言語的信頼度に基づき、入力音声信号１０を図示されない認識辞書記憶部に記憶されている認識辞書の登録単語に順次置き換えることによって音声認識結果を生成する。ここで、音響的類似度とは、認識対象となる音声（即ち、音声特徴量１７）と、認識候補となる単語の音響モデルとの間の音響的な類似度を意味する。また、言語的信頼度は、認識候補となる単語を含む系列の言語的（文法的、構文的）な信頼度を意味し、例えば、ｎ−ｇｒａｍモデルなどの言語モデルに基づいて評価される。デコーダ４０１は、音声認識結果を外部へと出力する。ここで、外部とは、テキストを表示するための表示装置であってもよいし、テキストを印刷するための印刷装置であってもよいし、テキストを別の言語に翻訳するなどの任意の言語処理を行うための言語処理装置であってもよい。 The decoder 401 receives the audio feature value 17 from the feature value extraction unit 400. The decoder 401 refers to the acoustic model stored in the acoustic model storage unit 402 and the language model stored in the language model storage unit 403 and performs speech recognition processing using the speech feature amount 17. The decoder 401 generates a speech recognition result by sequentially replacing the input speech signal 10 with registered words in a recognition dictionary stored in a recognition dictionary storage unit (not shown) based on the acoustic similarity and linguistic reliability. Here, the acoustic similarity means the acoustic similarity between the speech to be recognized (that is, the speech feature 17) and the acoustic model of the word to be recognized. The linguistic reliability means a linguistic (grammatical or syntactic) reliability of a series including words that are recognition candidates, and is evaluated based on a language model such as an n-gram model, for example. The decoder 401 outputs the speech recognition result to the outside. Here, the outside may be a display device for displaying the text, a printing device for printing the text, or any language such as translating the text into another language. It may be a language processing device for performing processing.

音響モデル記憶部４０２には、音響モデルが記憶されている。音響モデルは、デコーダ４０１によって必要に応じて参照される。言語モデル記憶部４０３には、言語モデルが記憶されている。言語モデルは、デコーダ４０１によって必要に応じて参照される。 The acoustic model storage unit 402 stores an acoustic model. The acoustic model is referred to by the decoder 401 as necessary. The language model storage unit 403 stores language models. The language model is referred to by the decoder 401 as necessary.

以上説明したように、第３の実施形態に係る音声認識装置は、少なくともＳＡＴＣを含む音声特徴量に基づいて音声認識処理を行う。従って、この音声認識装置によれば、雑音環境下でも高い認識精度を達成できる。 As described above, the speech recognition apparatus according to the third embodiment performs speech recognition processing based on speech feature amounts including at least SATC. Therefore, according to this speech recognition apparatus, high recognition accuracy can be achieved even in a noisy environment.

（第４の実施形態）
図８に例示されるように、第４の実施形態に係る音声特徴量抽出装置は、波形切り出し部１０１と、パワースペクトル算出部１０２と、フィルタバンク適用部１０４と、帯域別平均時間算出部５１３と、軸変換部１０７とを備える。図８の音声特徴量抽出装置は、入力音声信号１０から音声特徴量３２を抽出する。 (Fourth embodiment)
As illustrated in FIG. 8, the speech feature amount extraction apparatus according to the fourth embodiment includes a waveform cutout unit 101, a power spectrum calculation unit 102, a filter bank application unit 104, and a band-based average time calculation unit 513. And an axis conversion unit 107. The voice feature quantity extraction device in FIG. 8 extracts the voice feature quantity 32 from the input voice signal 10.

波形切り出し部１０１は、外部から入力音声信号１０を取得する。波形切り出し部１０１は、入力音声信号１０から単位時間毎に時間長Ｔ_０（例えば、Ｔ_０＝２５ミリ秒）の音声波形を切り出すことによって時刻（ｎ）での単位音声信号１１（ｘ_ｎ（ｔ））を生成する。即ち、本実施形態において波形切り出し部１０１は、第１の実施形態または第２の実施形態と同一または類似の波形切り出し処理を行う。波形切り出し部１０１は、単位音声信号１１をパワースペクトル算出部１０２へと出力する。 The waveform cutout unit 101 acquires the input audio signal 10 from the outside. The waveform cutout unit 101 cuts out a voice waveform having a time length T ₀ (for example, T ₀ = 25 milliseconds) from the input voice signal 10 for each unit time, so that the unit voice signal 11 (x _n ( t)) is generated. That is, in this embodiment, the waveform cutout unit 101 performs the same or similar waveform cutout processing as in the first embodiment or the second embodiment. The waveform cutout unit 101 outputs the unit audio signal 11 to the power spectrum calculation unit 102.

尚、本実施形態において波形切り出し部１０１が使用する時間長Ｔ_０は、第１の実施形態または第２の実施形態において波形切り出し部１０１が使用する時間長Ｔ（即ち、分析窓幅）に比べて短くなるように設定されてよい。例えば、Ｔ＝５６ミリ秒と設定され、Ｔ_０＝２５ミリ秒と設定されてよい。 In this embodiment, the time length T ₀ used by the waveform cutout unit 101 is compared with the time length T used by the waveform cutout unit 101 in the first embodiment or the second embodiment (that is, the analysis window width). May be set to be shorter. For example, T = 56 milliseconds may be set and T ₀ = 25 milliseconds may be set.

帯域別平均時間算出部５１３は、フィルタバンク適用部１０４からフィルタ処理されたパワースペクトル１４を入力する。帯域別平均時間算出部５１３は、フィルタ処理されたパワースペクトル１４に基づいて、１以上の周波数帯域の各々における単位音声信号１１の平均時間（以降の説明において、帯域別平均時間３１とも称される）を算出する。帯域別平均時間算出部５１３は、帯域別平均時間３１を軸変換部１０７へと出力する。尚、帯域別平均時間算出部５１３の処理の詳細は後述される。 The band-specific average time calculation unit 513 receives the filtered power spectrum 14 from the filter bank application unit 104. The band-specific average time calculation unit 513 is also referred to as the average time of the unit audio signal 11 in each of one or more frequency bands based on the filtered power spectrum 14 (in the following description, also referred to as the band-specific average time 31). ) Is calculated. The band-specific average time calculation unit 513 outputs the band-specific average time 31 to the axis conversion unit 107. The details of the processing of the band-specific average time calculation unit 513 will be described later.

軸変換部１０７は、帯域別平均時間算出部５１３から帯域別平均時間３１を入力する。軸変換部１０７は、帯域別平均時間３１に第１の実施形態または第２の実施形態と同一または類似の軸変換処理を施し、音声特徴量３２を生成する。音声特徴量３２は、前述の音声特徴量１７または音声特徴量２２に相当し、ＳＡＴＣとも呼ばれる。軸変換部１０７は、音声特徴量３２を外部へと出力する。尚、軸変換部１０７は省略されてもよい。係る場合には、帯域別平均時間３１が、音声特徴量３２として外部へと出力される。例えば、フィルタバンク適用部１０４によって適用されるフィルタバンクが備える周波数フィルタの総数が１である場合には、軸変換部１０７は不要である。 The axis conversion unit 107 receives the average time by band 31 from the average time by band calculation unit 513. The axis conversion unit 107 performs the same or similar axis conversion processing as that of the first embodiment or the second embodiment on the band-based average time 31 to generate the audio feature amount 32. The audio feature amount 32 corresponds to the above-described audio feature amount 17 or the audio feature amount 22 and is also called SATC. The axis conversion unit 107 outputs the audio feature quantity 32 to the outside. The axis conversion unit 107 may be omitted. In such a case, the average time 31 for each band is output to the outside as the audio feature amount 32. For example, when the total number of frequency filters included in the filter bank applied by the filter bank application unit 104 is 1, the axis conversion unit 107 is unnecessary.

ここで、帯域別平均時間３１は、１以上の周波数帯域の各々における単位音声信号１１のエネルギー重心までの時間を意味する。故に、帯域別平均時間算出部５１３は、例えば下記数式（１３）に従って、帯域別平均時間３１を算出できる。 Here, the average time 31 by band means the time to the energy center of gravity of the unit audio signal 11 in each of one or more frequency bands. Therefore, the average time by band calculation unit 513 can calculate the average time by band 31 according to, for example, the following formula (13).

数式（１３）において、τは時刻ｎからのずれを表し、ｗ（τ）はτに対応する重みを表す。｜Ｘ（ｎ＋τ，ω）｜^２は、時刻ｎ＋τにおける周波数ωでのパワースペクトル１２を表し、ｈ_ｍ（ω）｜Ｘ（ｎ＋τ，ω）｜^２は、時刻ｎ＋τにおける周波数ωでのフィルタ処理されたパワースペクトル１４を表す。 In Equation (13), τ represents a deviation from time n, and w (τ) represents a weight corresponding to τ. | X (n + τ, ω) | ² represents the power spectrum 12 at frequency ω at time n + τ, and h _m (ω) | X (n + τ, ω) | ² is filtered at frequency ω at time n + τ. Represents the power spectrum 14.

尚、重みｗ（τ）は、τ＝０において最大となり、τの絶対値が大きくなるにつれて線形または非線形に小さくなるように決定されてもよい。或いは、重みｗ（τ）は、τの値に関わらず一定値（例えば、１）となるように決定されてもよい。或いは、重みｗ（τ）は、いくつかのτについて０となるように決定されてもよい。 The weight w (τ) may be determined so as to be the maximum at τ = 0 and to decrease linearly or nonlinearly as the absolute value of τ increases. Alternatively, the weight w (τ) may be determined to be a constant value (for example, 1) regardless of the value of τ. Alternatively, the weight w (τ) may be determined to be 0 for some τ.

数式（１３）におけるＴは、分析窓幅とも呼ばれる。Ｔは、前述の単位時間以上の値（例えば５６ミリ秒）に設定される。数式（１３）によれば、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間３１が得られる。 T in Equation (13) is also called an analysis window width. T is set to a value (for example, 56 milliseconds) that is equal to or more than the unit time described above. According to Expression (13), the average time 31 for each band of the _mth frequency band (Ω _m ) is obtained.

即ち、帯域別平均時間算出部５１３は、図１０に例示されるように、所与の時刻のフィルタ処理されたパワースペクトル１４の第ｍ番目の周波数帯域（Ω_ｍ）における総和を算出する。そして、帯域別平均時間算出部５１３は、この総和について時刻ｎ−Ｔ／２から時刻ｎ＋Ｔ／２までの区間内のエネルギー重心位置を算出することにより、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間３１を得る。 That is, the average time calculation unit 513 for each band calculates the sum in the m-th frequency band (Ω _m ) of the filtered power spectrum 14 at a given time, as illustrated in FIG. And the average time calculation part 513 according to zone | band calculates the energy gravity center position in the area from the time n-T / 2 to the time n + T / 2 about this sum total, The m-th frequency band ((omega | ohm) _m ) is calculated. An average time 31 for each band is obtained.

図８の音声特徴量抽出装置は、図９に例示されるように動作できる。波形切り出し部１０１は、外部から取得した入力音声信号１０から単位時間毎に時間長Ｔ_０の音声波形を切り出すことによって単位音声信号１１を生成する（ステップＳ１０１）。 The voice feature quantity extraction apparatus of FIG. 8 can operate as illustrated in FIG. The waveform cutout unit 101 generates a unit voice signal 11 by cutting out a voice waveform having a time length T ₀ per unit time from the input voice signal 10 acquired from the outside (step S101).

帯域別平均時間算出部５１３は、ステップＳ１０４において得られたフィルタ処理されたパワースペクトル１４に基づいて帯域別平均時間３１を算出する（ステップＳ５１３）。軸変換部１０７は、ステップＳ５１３において算出された帯域別平均時間３１に対して軸変換処理を施し、音声特徴量３２を生成する（ステップＳ１０７）。 The band-specific average time calculation unit 513 calculates the band-specific average time 31 based on the filtered power spectrum 14 obtained in step S104 (step S513). The axis conversion unit 107 performs an axis conversion process on the band-based average time 31 calculated in step S513, and generates a speech feature 32 (step S107).

前述の通り、本実施形態における帯域別平均時間３１は、第１の実施形態における算出される帯域別平均時間１６とも第２の実施形態における帯域別平均時間２１とも算出手法において異なる。しかしながら、図１１、図１２及び図１５を用いて説明されるように、帯域別平均時間３１は、第１の実施形態において算出される帯域別平均時間１６と同一または類似の音声特徴を表現する。 As described above, the average time by band 31 in the present embodiment is different in the calculation method from the average time by band 16 calculated in the first embodiment and the average time by band 21 in the second embodiment. However, as will be described with reference to FIGS. 11, 12, and 15, the average time by band 31 expresses the same or similar voice feature as the average time by band 16 calculated in the first embodiment. .

図１５（ａ）のグラフは帯域別平均時間１６を例示し、図１５（ｂ）のグラフは帯域別平均時間３１を例示している。図１５の３次元グラフから切り出された２次元グラフが図１１及び図１２に示されている。 The graph of FIG. 15A illustrates the average time 16 for each band, and the graph of FIG. 15B illustrates the average time 31 for each band. A two-dimensional graph cut out from the three-dimensional graph of FIG. 15 is shown in FIGS.

図１１（ａ）のグラフは、図１５（ａ）のグラフのうち第１の注目周波数における時刻と帯域別平均時間１６との関係を示している。第１の注目周波数は、図１５における低周波数帯域側から選択された。図１１（ｂ）のグラフは、図１５（ｂ）のグラフのうち上記第１の注目周波数における時刻と帯域別平均時間３１との関係を示している。図１１によれば、低周波数帯域側において帯域別平均時間１６及び帯域別平均時間３１は概ね同じ特性を持つことが確認できる。 The graph of FIG. 11A shows the relationship between the time at the first frequency of interest and the average time 16 by band in the graph of FIG. The first frequency of interest was selected from the low frequency band side in FIG. The graph of FIG. 11B shows the relationship between the time at the first frequency of interest in the graph of FIG. According to FIG. 11, it can be confirmed that the average time 16 by band and the average time 31 by band have substantially the same characteristics on the low frequency band side.

図１２（ａ）のグラフは、図１５（ａ）のグラフのうち第２の注目周波数における時刻と帯域別平均時間１６との関係を示している。第２の注目周波数は、図１５における高周波数帯域側から選択された。図１２（ｂ）のグラフは、図１５（ｂ）のグラフのうち上記第２の注目周波数における時刻と帯域別平均時間３１との関係を示している。図１２によれば、高周波数帯域側においても帯域別平均時間１６及び帯域別平均時間３１が概ね同じ特性を持つことが確認できる。 The graph of FIG. 12A shows the relationship between the time at the second frequency of interest in the graph of FIG. The second frequency of interest was selected from the high frequency band side in FIG. The graph of FIG. 12B shows the relationship between the time at the second frequency of interest in the graph of FIG. According to FIG. 12, it can be confirmed that the average time 16 by band and the average time 31 by band have substantially the same characteristics on the high frequency band side.

以上説明したように、第４の実施形態に係る音声特徴量抽出装置は、前述のＳＡＴＣを音声特徴量として抽出する。従って、この音声特徴量抽出装置によれば、第１の実施形態または第２の実施形態と同一または類似の効果を得ることができる。 As described above, the speech feature amount extraction apparatus according to the fourth embodiment extracts the above-described SATC as a speech feature amount. Therefore, according to the speech feature quantity extraction device, the same or similar effect as in the first embodiment or the second embodiment can be obtained.

尚、本実施形態において、フィルタバンク適用部１０４は、省略されてもよい。係る場合には、帯域別平均時間算出部５１３は、パワースペクトル１２に基づいて、帯域別平均時間３１を算出する。具体的には、帯域別平均時間算出部５１３は、下記数式（１４）を利用できる。 In the present embodiment, the filter bank application unit 104 may be omitted. In such a case, the band-specific average time calculation unit 513 calculates the band-specific average time 31 based on the power spectrum 12. Specifically, the band-specific average time calculation unit 513 can use the following formula (14).

即ち、帯域別平均時間算出部５１３は、所与の時刻のパワースペクトル１２の第ｍ番目の周波数帯域（Ω_ｍ）における総和を算出する。そして、帯域別平均時間算出部５１３は、この総和について時刻ｎ−Ｔ／２から時刻ｎ＋Ｔ／２までの区間内のエネルギー重心位置を算出することにより、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間３１を得る。 That is, the band-specific average time calculation unit 513 calculates the sum in the m-th frequency band (Ω _m ) of the power spectrum 12 at a given time. And the average time calculation part 513 according to zone | band calculates the energy gravity center position in the area from the time n-T / 2 to the time n + T / 2 about this sum total, The m-th frequency band ((omega | ohm) _m ) is calculated. An average time 31 for each band is obtained.

（第５の実施形態）
図１３に例示されるように、第５の実施形態に係る音声特徴量抽出装置は、バンドパスフィルタ適用部６１４と、波形切り出し部６１５と、帯域別平均時間算出部６１６と、軸変換部１０７とを備える。図１３の音声特徴量抽出装置は、入力音声信号１０から音声特徴量４４を抽出する。 (Fifth embodiment)
As illustrated in FIG. 13, the speech feature amount extraction apparatus according to the fifth embodiment includes a bandpass filter application unit 614, a waveform cutout unit 615, a band-based average time calculation unit 616, and an axis conversion unit 107. With. The voice feature quantity extraction device in FIG. 13 extracts the voice feature quantity 44 from the input voice signal 10.

バンドパスフィルタ適用部６１４は、外部から入力音声信号１０を取得する。バンドパスフィルタ適用部６１４は、入力音声信号１０に対して１以上のバンドパスフィルタを適用する。即ち、バンドパスフィルタ適用部６１４は、入力音声信号１０から１以上（例えば、１６個）の周波数帯域の信号成分を抽出することにより、１以上のサブバンド入力音声信号４１を得る。バンドパスフィルタ適用部６１４は、１以上のサブバンド入力音声信号４１を波形切り出し部６１５へと出力する。バンドパスフィルタの数が１である場合にはバンドパスフィルタ適用部６１４が省略されてもよい。係る場合には、第４の実施形態のフィルタバンク適応部１０４によって適用されるフィルタバンクが備える周波数フィルタの総数が１である場合と同一もしくは類似の値が得られる。 The band pass filter application unit 614 acquires the input audio signal 10 from the outside. The band pass filter application unit 614 applies one or more band pass filters to the input audio signal 10. That is, the bandpass filter application unit 614 obtains one or more subband input audio signals 41 by extracting signal components of one or more (for example, 16) frequency bands from the input audio signal 10. The bandpass filter application unit 614 outputs one or more subband input audio signals 41 to the waveform cutout unit 615. When the number of bandpass filters is 1, the bandpass filter application unit 614 may be omitted. In such a case, the same or similar value as that obtained when the total number of frequency filters included in the filter bank applied by the filter bank adaptation unit 104 of the fourth embodiment is 1 is obtained.

波形切り出し部６１５は、バンドパスフィルタ適用部６１４から１以上のサブバンド入力音声信号４１を入力する。波形切り出し部６１５は、１以上のサブバンド入力音声信号４１から単位時間毎に時間長Ｔ（例えば、Ｔ＝５６ミリ秒）の音声波形を切り出すことによって、１以上のサブバンド単位音声信号４２を生成する。より具体的には、波形切り出し部６１５は、第ｍ番目のサブバンド入力音声信号４１から単位時間毎に時間長Ｔの音声波形を切り出すことによって時刻（ｎ）での第ｍ番目のサブバンド単位音声信号４２（ｘ_ｎｍ（ｔ））を生成する。波形切り出し部６１５は、１以上のサブバンド単位音声信号４２を帯域別平均時間算出部６１６へと出力する。 The waveform cutout unit 615 receives one or more subband input audio signals 41 from the bandpass filter application unit 614. The waveform cutout unit 615 cuts out one or more subband unit sound signals 42 from one or more subband input sound signals 41 by cutting out a sound waveform having a time length T (for example, T = 56 milliseconds) per unit time. Generate. More specifically, the waveform cut-out unit 615 cuts out a sound waveform having a time length T for each unit time from the m-th subband input sound signal 41 to thereby unit the m-th subband at time (n). An audio signal 42 (x _nm (t)) is generated. The waveform cutout unit 615 outputs one or more subband unit audio signals 42 to the band-based average time calculation unit 616.

波形切り出し部６１５は、単位時間毎に時間長Ｔの音声波形を切り出す処理に加えて、切り出した音声波形の直流成分を除去する処理、切り出した音声波形の高周波成分を強調する処理、切り出した音声波形に窓関数（例えば、ハミング窓）を乗算する処理などを行うことによって、１以上のサブバンド単位音声信号４２を生成してもよい。 In addition to the process of cutting out a speech waveform having a time length T for each unit time, the waveform cutout unit 615 performs a process of removing a DC component of the cut out voice waveform, a process of enhancing a high frequency component of the cut out voice waveform, and a cut out voice One or more subband unit audio signals 42 may be generated by performing a process of multiplying the waveform by a window function (for example, a Hamming window).

帯域別平均時間算出部６１６は、波形切り出し部６１５から１以上のサブバンド単位音声信号４２を入力する。帯域別平均時間算出部６１６は、１以上のサブバンド単位音声信号４２の各々の平均時間（以降の説明において、帯域別平均時間４３とも称される）を算出する。帯域別平均時間算出部６１６は、帯域別平均時間４３を軸変換部１０７へと出力する。尚、帯域別平均時間算出部６１６の処理の詳細は後述される。 The band-specific average time calculation unit 616 receives one or more subband unit audio signals 42 from the waveform cutout unit 615. The band-specific average time calculation unit 616 calculates the average time of each of the one or more subband unit audio signals 42 (also referred to as the band-specific average time 43 in the following description). The band-specific average time calculation unit 616 outputs the band-specific average time 43 to the axis conversion unit 107. Details of the processing of the band-specific average time calculation unit 616 will be described later.

軸変換部１０７は、帯域別平均時間算出部６１６から帯域別平均時間４３を入力する。軸変換部１０７は、帯域別平均時間４３に第１の実施形態、第２の実施形態または第４の実施形態と同一または類似の軸変換処理を施し、音声特徴量４４を生成する。音声特徴量４４は、前述の音声特徴量１７、音声特徴量２２または音声特徴量３２に相当し、ＳＡＴＣとも呼ばれる。軸変換部１０７は、音声特徴量４４を外部へと出力する。尚、軸変換部１０７は省略されてもよい。係る場合には、帯域別平均時間４３が、音声特徴量４４として外部へと出力される。例えば、バンドパスフィルタ適用部６１４によって適用されるバンドパスフィルタの総数が１である場合、バンドパスフィルタ適用部６１４が省略される場合などには軸変換部１０７は不要である。 The axis conversion unit 107 inputs the average time 43 for each band from the average time calculation unit 616 for each band. The axis conversion unit 107 performs an axis conversion process that is the same as or similar to that of the first embodiment, the second embodiment, or the fourth embodiment on the band-based average time 43 to generate the audio feature amount 44. The audio feature quantity 44 corresponds to the above-described audio feature quantity 17, audio feature quantity 22 or audio feature quantity 32, and is also called SATC. The axis conversion unit 107 outputs the audio feature quantity 44 to the outside. The axis conversion unit 107 may be omitted. In such a case, the average time 43 by band is output to the outside as the audio feature amount 44. For example, when the total number of bandpass filters applied by the bandpass filter application unit 614 is 1, or when the bandpass filter application unit 614 is omitted, the axis conversion unit 107 is unnecessary.

ここで、帯域別平均時間４３は、１以上のサブバンド単位音声信号４２の各々の平均時間である。故に、帯域別平均時間算出部６１６は、例えば下記数式（１５）に従って、帯域別平均時間４３を算出できる。 Here, the average time 43 by band is the average time of each of the one or more subband unit audio signals 42. Therefore, the band-specific average time calculation unit 616 can calculate the band-specific average time 43 according to the following formula (15), for example.

数式（１５）において、ｘ_ｎｍ（ｔ）は時刻ｎにおける第ｍ番目のサブバンド単位音声信号４２を表す。数式（１５）におけるＴは、分析窓幅とも呼ばれる。数式（１５）によれば、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間４３が得られる。 In Equation (15), x _nm (t) represents the m-th subband unit audio signal 42 at time n. T in Expression (15) is also called an analysis window width. According to Equation (15), the average time 43 for each band of the _mth frequency band (Ω _m ) is obtained.

即ち、帯域別平均時間算出部６１６は、時刻ｎ−Ｔ／２から時刻ｎ＋Ｔ／２までの区間内の第ｍ番目のサブバンド単位音声信号４２のパワー（｜ｘ_ｍ（ｎ＋τ）｜^２）のエネルギー重心位置を算出することにより、第ｍ番目の周波数帯域（Ω_ｍ）の帯域別平均時間４３を得る。 That is, the band-based average time calculation unit 616 calculates the power (| x _m (n + τ) | ² ) of the m-th subband unit audio signal 42 in the section from the time n−T / 2 to the time n + T / 2. By calculating the energy barycentric position, an average time 43 by band of the m-th frequency band (Ω _m ) is obtained.

尚、数式（１５）において、時刻τ＝０は、サブバンド単位音声信号４２の中心に設定されているものとしているが、必ずしも単位音声信号４２の中心に設定する必要はない。τ＝０の位置に応じて、数式（１５）右辺の分母および分子の総和を求める範囲も適宜変更されてよい。 In Equation (15), the time τ = 0 is set at the center of the sub-band unit audio signal 42, but it is not always necessary to set it at the center of the unit audio signal 42. Depending on the position of τ = 0, the range for calculating the denominator on the right side of Equation (15) and the sum of the numerators may be changed as appropriate.

図１３の音声特徴量抽出装置は、図１４に例示されるように動作できる。バンドパスフィルタ適用部６１４は、外部から取得した入力音声信号１０に１以上のバンドパスフィルタを適用することによって１以上のサブバンド入力音声信号４１を得る（ステップＳ６１４）。 The voice feature extraction device of FIG. 13 can operate as illustrated in FIG. The bandpass filter application unit 614 obtains one or more subband input audio signals 41 by applying one or more bandpass filters to the input audio signal 10 acquired from the outside (step S614).

波形切り出し部６１５は、ステップＳ６１４において得られた１以上のサブバンド入力音声信号４１から単位時間毎に時間長Ｔの音声波形を切り出すことによって１以上のサブバンド単位音声信号４２を生成する（ステップＳ６１５）。 The waveform cutout unit 615 generates one or more subband unit sound signals 42 by cutting out a sound waveform having a time length T per unit time from the one or more subband input sound signals 41 obtained in Step S614 (Step S614). S615).

帯域別平均時間算出部６１６は、ステップＳ６１５において生成された１以上のサブバンド単位音声信号４２の各々の平均時間を算出することによって帯域別平均時間４３を得る（ステップＳ６１６）。軸変換部１０７は、ステップＳ６１６において算出された帯域別平均時間４３に対して軸変換処理を施し、音声特徴量４４を生成する（ステップＳ１０７）。 The band-specific average time calculation unit 616 obtains the band-specific average time 43 by calculating the average time of each of the one or more subband unit audio signals 42 generated in step S615 (step S616). The axis conversion unit 107 performs an axis conversion process on the band-specific average time 43 calculated in step S616 to generate a voice feature amount 44 (step S107).

以上説明したように、第５の実施形態に係る音声特徴量抽出装置は、前述のＳＡＴＣを音声特徴量として抽出する。従って、この音声特徴量抽出装置によれば、第１の実施形態、第２の実施形態または第４の実施形態と同一または類似の効果を得ることができる。 As described above, the speech feature amount extraction apparatus according to the fifth embodiment extracts the above-described SATC as a speech feature amount. Therefore, according to the speech feature quantity extraction device, the same or similar effect as that of the first embodiment, the second embodiment, or the fourth embodiment can be obtained.

上記各実施形態の処理は、汎用のコンピュータを基本ハードウェアとして用いることで実現可能である。上記各実施形態の処理を実現するプログラムは、コンピュータで読み取り可能な記憶媒体に格納して提供されてもよい。プログラムは、インストール可能な形式のファイルまたは実行可能な形式のファイルとして記憶媒体に記憶される。記憶媒体としては、磁気ディスク、光ディスク（ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ等）、光磁気ディスク（ＭＯ等）、半導体メモリなどである。記憶媒体は、プログラムを記憶でき、かつ、コンピュータが読み取り可能であれば、何れであってもよい。また、上記各実施形態の処理を実現するプログラムを、インターネットなどのネットワークに接続されたコンピュータ（サーバ）上に格納し、ネットワーク経由でコンピュータ（クライアント）にダウンロードさせてもよい。 The processing of each of the above embodiments can be realized by using a general-purpose computer as basic hardware. The program for realizing the processing of each of the above embodiments may be provided by being stored in a computer-readable storage medium. The program is stored in the storage medium as an installable file or an executable file. Examples of the storage medium include a magnetic disk, an optical disk (CD-ROM, CD-R, DVD, etc.), a magneto-optical disk (MO, etc.), and a semiconductor memory. The storage medium may be any as long as it can store the program and can be read by the computer. Further, the program for realizing the processing of each of the above embodiments may be stored on a computer (server) connected to a network such as the Internet and downloaded to the computer (client) via the network.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１０・・・入力音声信号
１１・・・単位音声信号
１２・・・パワースペクトル
１３・・・第３のスペクトル
１４・・・フィルタ処理されたパワースペクトル
１５・・・フィルタ処理された第３のスペクトル
１６，２１，３１，４３・・・帯域別平均時間
１７，２２，３２，４４・・・音声特徴量
１８・・・群遅延スペクトル
１９・・・乗算スペクトル
２０・・・フィルタ処理された乗算スペクトル
４１・・・サブバンド入力音声信号
４２・・・サブバンド単位音声信号
１０１，６１５・・・波形切り出し部
１０２・・・パワースペクトル算出部
１０３・・・第３のスペクトル算出部
１０４，１０５，２１０・・・フィルタバンク適用部
１０６，２１１，５１３，６１６・・・帯域別平均時間算出部
１０７・・・軸変換部
２０８・・・群遅延スペクトル
２０９・・・スペクトル乗算部
４００・・・特徴量抽出部
４０１・・・デコーダ
４０２・・・音響モデル記憶部
４０３・・・言語モデル記憶部
６１４・・・バンドパスフィルタ適用部 DESCRIPTION OF SYMBOLS 10 ... Input audio signal 11 ... Unit audio signal 12 ... Power spectrum 13 ... Third spectrum 14 ... Filtered power spectrum 15 ... Filtered third spectrum 16, 21, 31, 43 ... Average time by band 17, 22, 32, 44 ... Voice feature amount 18 ... Group delay spectrum 19 ... Multiplication spectrum 20 ... Filtered multiplication spectrum 41 ... Subband input audio signal 42 ... Subband unit audio signal 101,615 ... Waveform cutout unit 102 ... Power spectrum calculation unit 103 ... Third spectrum calculation unit 104,105,210 ... Filter bank application unit 106, 211, 513, 616 ... Average time calculation unit for each band 107 ... Axis conversion unit 208 ..Group delay spectrum 209 ... Spectrum multiplication unit 400 ... Feature amount extraction unit 401 ... Decoder 402 ... Acoustic model storage unit 403 ... Language model storage unit 614 ... Band pass filter application unit

Claims

A cutout unit that generates a unit voice signal by cutting out a voice waveform over a predetermined time length for each unit time from the input voice signal;
An average time calculation unit that calculates an average time corresponding to the time to the energy centroid in each of a plurality of frequency bands obtained by dividing the entire frequency band of the unit audio signal by a number smaller than the number of bins of the frequency;
An audio feature quantity extraction device comprising: a generation unit that generates an audio feature quantity based on the average time.

The voice feature quantity extraction device according to claim 1, wherein the generation unit generates the voice feature quantity by performing axis conversion on the average time.

A power spectrum calculation unit for calculating a power spectrum of the unit audio signal;
The cutout unit generates the unit audio signal by cutting out the audio waveform over the predetermined time length for each unit time from the input audio signal,
The average time calculation unit calculates the average time based on the power spectrum,
The speech feature amount extraction apparatus according to claim 1 or 2.

Calculating a first product of a real part of the first spectrum of the unit audio signal and a real part of a second spectrum of the product of the unit audio signal and time, and calculating an imaginary part of the first spectrum and the first A second spectrum calculation unit for obtaining a third spectrum by calculating a second product with an imaginary part of the spectrum of 2 and adding the first product and the second product;
The average time calculation unit calculates the average time based on the power spectrum and the third spectrum,
The speech feature amount extraction apparatus according to claim 3.

The average time calculation unit divides the sum of the third spectra in a given frequency band by the sum of the power spectra in the given frequency band, thereby calculating the average time in the given frequency band. The speech feature amount extraction apparatus according to claim 4, wherein the speech feature amount extraction device calculates.

A first application unit for obtaining a filtered power spectrum by applying a first filter bank to the power spectrum;
A second applying unit that obtains a filtered third spectrum by applying a second filter bank to the third spectrum; and
The average time calculating unit calculates the average time based on the filtered power spectrum and the filtered third spectrum;
The speech feature amount extraction apparatus according to claim 4.

A group delay spectrum calculator for calculating a group delay spectrum of the unit audio signal;
A multiplier for multiplying the power spectrum by the group delay spectrum to obtain a multiplication spectrum;
The average time calculation unit calculates the average time based on the power spectrum and the multiplication spectrum.
The speech feature amount extraction apparatus according to claim 3.

A first application unit for obtaining a filtered power spectrum by applying a first filter bank to the power spectrum;
A second applying unit that obtains a filtered multiplication spectrum by applying a second filter bank to the multiplication spectrum; and
The average time calculation unit calculates the average time based on the filtered power spectrum and the filtered multiplication spectrum.
The speech feature amount extraction apparatus according to claim 7.

An application unit for obtaining a filtered power spectrum by applying a filter bank to the power spectrum;
The average time calculation unit calculates the average time based on the filtered power spectrum,
The speech feature amount extraction apparatus according to claim 3.

Generating a unit voice signal by cutting out a voice waveform over a predetermined time length for each unit time from the input voice signal;
Calculating an average time corresponding to the time to the energy center of gravity in each of a plurality of frequency bands obtained by dividing the entire frequency band of the unit audio signal by a number smaller than the number of bins of the frequency;
Generating a voice feature amount based on the average time.

Computer
Clipping means for generating a unit voice signal by cutting out a voice waveform over a predetermined time length for each unit time from the input voice signal;
Average time calculating means for calculating an average time corresponding to the time to the energy centroid in each of a plurality of frequency bands obtained by dividing the entire frequency band of the unit audio signal by a number smaller than the number of bins of the frequency;
Generating means for generating an audio feature based on the average time;
Voice feature extraction program to function as

A plurality of subband unit sound signals are obtained by cutting out a sound waveform over a predetermined time length per unit time from a plurality of subband input sound signals obtained by extracting signal components of a plurality of frequency bands from the input sound signal. A cutout unit for generating
An average time calculation unit that calculates an average time corresponding to the energy barycentric position of each of the plurality of subband unit audio signals within a predetermined time interval;
An audio feature quantity extraction device comprising: a generation unit that generates an audio feature quantity based on the average time.

The voice feature quantity extraction device according to claim 12, wherein the generation unit generates the voice feature quantity by converting the average time into an axis.

An application unit that obtains the plurality of subband input audio signals by applying a plurality of bandpass filters to the input audio signals;
The cutout unit generates the plurality of subband unit sound signals by cutting out the sound waveform over the predetermined time length for each unit time from the plurality of subband input sound signals.
The speech feature amount extraction apparatus according to claim 12 or 13.

A plurality of subband unit sound signals are obtained by cutting out a sound waveform over a predetermined time length per unit time from a plurality of subband input sound signals obtained by extracting signal components of a plurality of frequency bands from the input sound signal. Generating
Calculating an average time corresponding to the energy barycentric position of each of the plurality of subband unit audio signals within a predetermined time interval;
Generating a voice feature amount based on the average time.

Computer
A plurality of subband unit sound signals are obtained by cutting out a sound waveform over a predetermined time length per unit time from a plurality of subband input sound signals obtained by extracting signal components of a plurality of frequency bands from the input sound signal. A cutout means for generating
Average time calculation means for calculating an average time corresponding to the energy barycentric position of each of the plurality of subband unit audio signals within a predetermined time interval;
Generating means for generating an audio feature based on the average time;
Voice feature extraction program to function as