JP6063843B2

JP6063843B2 - Signal section classification device, signal section classification method, and program

Info

Publication number: JP6063843B2
Application number: JP2013176821A
Authority: JP
Inventors: 達也加古; 小林　和則; 和則小林; 仲大室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-08-28
Filing date: 2013-08-28
Publication date: 2017-01-18
Anticipated expiration: 2033-08-28
Also published as: JP2015045737A

Description

本発明は、複数個のチャネルのデジタル音響信号から、音源位置に基づいた信号区間分類を行う技術に関する。 The present invention relates to a technique for performing signal section classification based on a sound source position from digital audio signals of a plurality of channels.

複数のマイクロホンからなるマイクロホンアレーを利用し、複数の音源（例えば、話者）から発せられた音響信号の信号区間分類を行う手法の一つに、ＭＵＳＩＣ法を用いた手法がある（例えば、非特許文献１参照）。この手法では、同期した音響信号を収録可能なＫ個（Ｋ＞１）のマイクロホンを利用して音響信号を取得する。取得された音響信号はフレーム（所定の時間区間）ごとの周波数領域信号に変換され、それらを要素とするＫ次元の時間周波数領域信号ベクトルＸ_τ，ωが得られる。ただし、τはフレーム番号を表し、ωは周波数ビンを示す。フレーム番号τのフレームを「フレームτ」と表記し、周波数ビンωの周波数を「周波数ω」と表記する。次に、時間周波数領域信号ベクトルＸ_τ，ωから、入力信号の自己相関行列Ｒ_τ，ωが以下のように計算される。
Ｒ_τ，ω＝Ｅ｛Ｘ_τ，ωＸ_τ，ω ^Ｈ｝ (1)
ただし、Ｅ｛・｝は｛・｝の期待値を表し、・^Ｈは・の共役転置を表す。自己相関行列Ｒ_τ，ωは固有値分解される。ここで、音響信号中にＮ個（Ｎ＞１）の音源から発せられた音響信号が含まれる場合、固有値が降順に並んでいるとすると、Ｎ番目までの固有値が各音源から発せられた音響信号のエネルギーに対応する大きな値を持つ。それに対して、残りのＮ＋１番目からＫ番目までの固有値はノイズに対応している。この残りのＮ＋１番目からＫ番目までの固有値に対応する固有ベクトルｅ_τ，ω ^Ｎ＋１からｅ_τ，ω ^Ｋは音源到来方向に対応する伝達関数ベクトルと直交するという性質がある。次に、ＭＵＳＩＣスペクトルが計算される。ＭＵＳＩＣスペクトルは以下のように計算される。

ただし、ａ_ｄ，ωは方向ｄおよび周波数ビンωに対応するＫ次元の伝達関数ベクトルである。これらの伝達関数ベクトルは、マイクロホンアレーを用いて事前に音響信号を測定して得られたものである。固有ベクトルｅ_τ，ω ^Ｎ＋１からｅ_τ，ω ^Ｋは、常に音源到来方向ｄ＝ｄ’に対応する伝達関数ベクトルａ_ｄ，ωと直行する。従って、ＭＵＳＩＣスペクトルＰ_{τ，ｄ，ω}の分母は、音源到来方向ｄ＝ｄ’に対して０となる。つまり、ＭＵＳＩＣスペクトルＰ_{τ，ｄ，ω}は、音源到来方向ｄ＝ｄ’において発散する。ここで、周波数ビンωについてＭＵＳＩＣスペクトルを以下のように合算する。

ただし、ω_ｍｉｎは周波数ビンの下限値であり、ω_ｍａｘは周波数ビンの上限値であり、ｑ_τ，ω ^１は周波数ビンωにおける最大固有値である。このＭＵＳＩＣスペクトルＰ_τ，ｄ’により、周波数ビンωごとに存在する各音源から発せられた音響信号を観測することができる。つまり、あるフレームτあたりに含まれる音源の方向がわかり、音源の方向からフレーム内に含まれる当該音源から発せられた音響信号を分類することができる。 One of the methods for classifying signal sections of acoustic signals emitted from a plurality of sound sources (for example, speakers) using a microphone array composed of a plurality of microphones is a method using the MUSIC method (for example, a non-method). Patent Document 1). In this method, acoustic signals are acquired using K (K> 1) microphones that can record synchronized acoustic signals. The acquired acoustic signal is converted into a frequency domain signal for each frame (predetermined time interval), and a K-dimensional time frequency domain signal vector _{Xτ, ω} having them as elements is obtained. However, (tau) represents a frame number and (omega) shows a frequency bin. The frame with the frame number τ is denoted as “frame τ”, and the frequency of the frequency bin ω is denoted as “frequency ω”. Next, the autocorrelation matrix R _{τ, ω of} the input signal is calculated from the time frequency domain signal vector X _{τ, ω as} follows.
_{R τ, ω = E {X} τ, ω X τ, ω H} (1)
However, E {•} represents an expected value of {•}, and • ^H represents a conjugate transpose of •. The autocorrelation matrix R _{τ, ω} is subjected to eigenvalue decomposition. Here, if the acoustic signal includes acoustic signals emitted from N (N> 1) sound sources, assuming that the eigenvalues are arranged in descending order, the sound up to the Nth eigenvalue is emitted from each sound source. It has a large value corresponding to the energy of the signal. On the other hand, the remaining eigenvalues from the (N + 1) th to the Kth correspond to noise. The remaining eigenvectors e _{τ and ω} ^{N + 1} to e _{τ and ω} ^K corresponding to the remaining (N + 1) th to Kth eigenvalues are orthogonal to the transfer function vector corresponding to the sound source arrival direction. Next, the MUSIC spectrum is calculated. The MUSIC spectrum is calculated as follows.

Here, a _{d, ω} is a K-dimensional transfer function vector corresponding to the direction d and the frequency bin ω. These transfer function vectors are obtained by measuring acoustic signals in advance using a microphone array. The eigenvectors e _{τ, ω} ^{N + 1} to e _{τ, ω} ^K are always orthogonal to the transfer function vectors a _{d, ω} corresponding to the sound source arrival direction d = d ′. Therefore, the denominator of the MUSIC spectrum P _{τ, d, ω} is 0 with respect to the sound source arrival direction d = d ′. That is, the MUSIC spectrum _{Pτ, d, ω} diverges in the sound source arrival direction d = d ′. Here, the MUSIC spectrum for the frequency bin ω is added as follows.

However, ω _min is the lower limit value of the frequency bin, ω _max is the upper limit value of the frequency bin, and q _{τ, ω} ¹ is the maximum eigenvalue in the frequency bin ω. With this MUSIC spectrum P _{τ, d ′} , it is possible to observe the acoustic signal emitted from each sound source that exists for each frequency bin ω. That is, the direction of the sound source included around a certain frame τ can be known, and the sound signal emitted from the sound source included in the frame can be classified from the direction of the sound source.

大塚琢馬，中臺一博，尾形哲也，奥乃博，“音源定位手法ＭＵＳＩＣのベイズ拡張，”人工知能学会研究会資料，ＳＩＧ−Ｃｈａｌｌｅｎｇｅ−Ｂ１０２−６.Otsuka Tatsuma, Nakajo Kazuhiro, Ogata Tetsuya, Okuno Hiroshi, “Bayesian extension of sound source localization method MUSIC,” Artificial Intelligence Society of Japan, SIG-Challenge-B102-6.

しかし、従来のマイクロホンアレーを利用した音源方向に基づいて信号区間を分類する手法では、マイクロホンの相対位置関係が既知である必要がある。そのため自由に配置されたマイクロホンで観測して得られたデジタル音響信号に対しては、従来のＭＵＳＩＣ法を利用した信号区間分類を行うことができない。 However, in the conventional method of classifying signal sections based on the sound source direction using the microphone array, the relative positional relationship of the microphones needs to be known. Therefore, it is not possible to perform signal section classification using the conventional MUSIC method for digital sound signals obtained by observation with freely arranged microphones.

本発明はこのような点に鑑みてなされたものであり、たとえ観測位置が未知であったとしても、信号区間分類を精度よく行うことができる技術を提供する。 The present invention has been made in view of the above points, and provides a technique capable of accurately classifying signal sections even if the observation position is unknown.

複数個のチャネルで得られた所定の時間区間ごとの周波数領域のデジタル音響信号を入力とし、チャネルごとに音声区間のデジタル音響信号の大きさを非音声区間のデジタル音響信号の大きさで正規化した正規化信号を得、１個または複数個の基本周波数について、基本周波数の整数倍および基本周波数の整数倍の近傍の周波数の正規化信号を、基本周波数の整数倍以外および基本周波数の整数倍の近傍以外の周波数の正規化信号よりも優先した調波構造信号を得、複数個のチャネルに対して得られた調波構造信号から特徴量列を得、特徴量列をクラスタリングして特徴量列が属する信号区間分類を決定する。 Input digital audio signals in frequency domain for each predetermined time interval obtained from multiple channels as input, and normalize the size of the digital audio signal in the audio interval for each channel by the size of the digital audio signal in the non-audio interval Normalized signal of one or more fundamental frequencies with an integer multiple of the fundamental frequency and a frequency in the vicinity of the integral multiple of the fundamental frequency is obtained by subtracting an integer multiple of the fundamental frequency and an integral multiple of the fundamental frequency. A harmonic structure signal prioritizing a normalized signal with a frequency other than the vicinity of is obtained, a feature quantity sequence is obtained from the harmonic structure signals obtained for a plurality of channels, and the feature quantity sequence is clustered. Determine the signal interval classification to which the column belongs.

本発明では、観測位置の情報を用いることなく、調波構造を考慮して得られた特徴量列を用いて信号区間分類を行う。そのため、たとえ観測位置が未知であったとしても、信号区間分類を精度よく行うことができる。 In the present invention, signal section classification is performed using a feature string obtained in consideration of the harmonic structure without using information on the observation position. Therefore, even if the observation position is unknown, signal section classification can be performed with high accuracy.

図１は、実施形態の信号区間分類装置の機能構成を例示するための図である。FIG. 1 is a diagram for illustrating a functional configuration of the signal section classification device according to the embodiment. 図２は、実施形態の信号区間分類方法を例示するための図である。FIG. 2 is a diagram for illustrating the signal section classification method according to the embodiment. 図３は、調波構造フィルタを説明するための図である。FIG. 3 is a diagram for explaining the harmonic structure filter. 図４は、調波構造フィルタを例示するための図である。FIG. 4 is a diagram for illustrating a harmonic structure filter. 図５Ａ〜図５Ｃは、信号区間分類結果を例示するための図である。5A to 5C are diagrams for illustrating signal section classification results. 図６Ａ〜図６Ｃは、信号区間分類結果を例示するための図である。6A to 6C are diagrams for illustrating signal section classification results.

以下、図面を参照して本発明の実施形態を説明する。
［第１実施形態］
図１に例示するように、本形態の信号区間分類装置１００は、サンプリング周波数変換部１１１、信号同期部１１２、フレーム分割部１１３、周波数領域変換部１１４、ＶＡＤ判定部１１５、非音声パワー記憶部１１６、ゲイン正規化部１１７（正規化部）、調波構造化部１１８、特徴量列算出部１１９、およびベクトル分類部１２０（分類部）を有する。本形態の信号区間分類装置１００は、例えばＣＰＵ（central processing unit）やＲＡＭ（random-access memory）等を備える汎用または専用のコンピュータに所定のプログラムが読み込まれて構成される装置である。信号区間分類装置１００に入力されたデータおよび処理されたデータは、図示していないメモリに格納され、必要に応じて各部から読み出される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
As illustrated in FIG. 1, the signal section classification apparatus 100 of this embodiment includes a sampling frequency conversion unit 111, a signal synchronization unit 112, a frame division unit 113, a frequency domain conversion unit 114, a VAD determination unit 115, and a non-speech power storage unit. 116, a gain normalization unit 117 (normalization unit), a harmonic structuring unit 118, a feature amount sequence calculation unit 119, and a vector classification unit 120 (classification unit). The signal section classification apparatus 100 according to this embodiment is an apparatus configured by reading a predetermined program into a general-purpose or dedicated computer including a CPU (central processing unit), a RAM (random-access memory), and the like. Data input to the signal section classification device 100 and processed data are stored in a memory (not shown), and are read from each unit as necessary.

信号区間分類装置１００は、自由に配置されたＫ個（Ｋは２以上の整数）の観測装置２０−１，・・・，２０−Ｋに接続されている。観測装置２０−１，・・・，２０−Ｋの位置や互いの相対位置は、未知であってもよいし、既知であってもよい。ただし、観測装置２０−１，・・・，２０−Ｋの位置がすべて同一でないことが好ましく、より好ましくは観測装置２０−１，・・・，２０−Ｋの位置が互いに相違することが望ましい。また各観測装置２０−ｋ（ｋ＝１，２，・・・，Ｋ）は、マイクロホン２１−ｋおよびＡ／Ｄ変換器２２−ｋを有する。観測装置２０−１，・・・，２０−Ｋは、互いに独立に動作するものであってもよいし、互いに何らかの連携を行うものであってもよい。マイクロホン２１−１，・・・，２１−Ｋの感度は、互いに異なっていてもよいし、同一であってもよく、Ａ／Ｄ変換器２２−１，・・・，２２−Ｋのサンプリング周波数は、互いに異なっていてもよいし、同一であってもよい。観測装置２０−１，・・・，２０−Ｋの具体例は、互いにサンプリング周波数およびマイクロホン感度が異なる、スマートフォン、固定電話、ボイスレコーダなどの録音機能をもつ端末装置である。 The signal section classification device 100 is connected to K observation devices 20-1,..., 20-K arranged freely (K is an integer of 2 or more). The positions of the observation devices 20-1,..., 20-K and their relative positions may be unknown or known. However, it is preferable that the positions of the observation devices 20-1,..., 20-K are not all the same, and more preferably, the positions of the observation devices 20-1,. . Each observation device 20-k (k = 1, 2,..., K) includes a microphone 21-k and an A / D converter 22-k. The observation devices 20-1,..., 20-K may operate independently from each other, or may perform some kind of cooperation with each other. The sensitivities of the microphones 21-1,..., 21-K may be different or the same, and the sampling frequencies of the A / D converters 22-1,. May be different from each other or the same. Specific examples of the observation devices 20-1,..., 20-K are terminal devices having recording functions such as smartphones, fixed telephones, and voice recorders, which have different sampling frequencies and microphone sensitivities.

各観測装置２０−ｋのマイクロホン２１−ｋは音響信号を観測する。各マイクロホン２１−ｋで観測された音響信号はＡ／Ｄ変換器２２−ｋに入力される。各Ａ／Ｄ変換器２２−ｋはそれぞれのサンプリング周波数で当該音響信号をＡ／Ｄ変換し、複数個のサンプル点での入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）を得て出力する。ただし、ｉ_ｋは時間領域のサンプル点を表す整数のインデックスである。すなわち、ｘ_ｋ（ｉ_ｋ）は、インデックスｉ_ｋで表されるサンプル点の入力デジタル音響信号を表す。 The microphone 21-k of each observation device 20-k observes an acoustic signal. The acoustic signal observed by each microphone 21-k is input to the A / D converter 22-k. Each A / D converter 22-k performs A / D conversion on the acoustic signal at each sampling frequency, and obtains and outputs an input digital acoustic signal x _k (i _k ) at a plurality of sample points. Here, i _k is an integer index representing a sample point in the time domain. That is, x _k (i _k ) represents the input digital acoustic signal at the sample point represented by the index i _k .

観測装置２０−ｋで得られた入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）に対応する処理を行う処理系列をチャネルｋと呼ぶ。言い換えると、Ａ／Ｄ変換器２２−ｋで音響信号を変換して得られた入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）に対応する処理を行う処理系列をチャネルｋと呼ぶ。すなわち、チャネルｋは入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）および入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）から得られる値を取り扱う。本形態ではＫ個のチャネルｋ＝１，・・・，Ｋが存在する。 A processing sequence for performing processing corresponding to the input digital acoustic signal x _k (i _k ) obtained by the observation device 20-k is referred to as a channel k. In other words, a processing sequence for performing processing corresponding to the input digital acoustic signal x _k (i _k ) obtained by converting the acoustic signal by the A / D converter 22-k is referred to as a channel k. That is, the channel k handles values obtained from the input digital acoustic signal x _k (i _k ) and the input digital acoustic signal x _k (i _k ). In this embodiment, there are K channels k = 1,.

＜サンプリング周波数変換部１１１＞
複数個の観測装置２０−１，・・・，２０−Ｋで得られた複数個のチャネルｋ＝１，・・・，Ｋの入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）は、サンプリング周波数変換部１１１に入力される。異なるチャネルｋの入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）は、異なるＡ／Ｄ変換器２２−ｋで得られたものであるため、サンプリング周波数が異なる場合がある。サンプリング周波数変換部１１１は、すべてのチャネルｋ＝１，・・・，Ｋの入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）のサンプリング周波数を任意の同一のサンプリング周波数に揃える。言い換えると、サンプリング周波数変換部１１１は、複数個のチャネルｋ＝１，・・・，Ｋの入力デジタル音響信号ｘ_ｋ（ｉ_ｋ）をサンプリング周波数変換し、特定のサンプリング周波数の変換デジタル音響信号ｃｘ_ｋ（ｉ_ｋ）を複数個のチャネルｋ＝１，・・・，Ｋについて得る。「特定のサンプリング周波数」は、Ａ／Ｄ変換器２２−１，・・・，２２−Ｋの何れか一つのサンプリング周波数であってもよいし、その他のサンプリング周波数であってもよい。「特定のサンプリング周波数」の一例は１６ｋＨｚである。サンプリング周波数変換部１１１は、各Ａ／Ｄ変換器２２−ｋのサンプリング周波数の公称値に基づいてサンプリング周波数変換を行う。すなわち、サンプリング周波数変換部１１１は、各Ａ／Ｄ変換器２２−ｋのサンプリング周波数の公称値でサンプリングされた信号を、特定のサンプリング周波数でサンプリングされた信号に変換する。このようなサンプリング周波数変換は周知である。サンプリング周波数変換部１１１は、以上のように得た各チャネルｋの変換デジタル音響信号ｃｘ_ｋ（ｉ_ｋ）を出力する（ステップＳ１１１）。 <Sampling frequency converter 111>
The input digital acoustic signals x _k (i _k ) of the plurality of channels k = 1,..., K obtained by the plurality of observation devices 20-1,. 111 is input. Since the input digital acoustic signals x _k (i _k ) of different channels k are obtained by different A / D converters 22-k, the sampling frequencies may be different. The sampling frequency converter 111 aligns the sampling frequencies of the input digital acoustic signals x _k (i _k ) of all channels k = 1,. In other words, the sampling frequency converter 111 converts the input digital acoustic signal x _k (i _k ) of the plurality of channels k = 1,..., K to the sampling frequency, and converts the converted digital acoustic signal cx having a specific sampling frequency. _k (i _k ) is _obtained for a plurality of channels k = 1,. The “specific sampling frequency” may be one of the sampling frequencies of the A / D converters 22-1,..., 22-K, or may be another sampling frequency. An example of the “specific sampling frequency” is 16 kHz. The sampling frequency conversion unit 111 performs sampling frequency conversion based on the nominal value of the sampling frequency of each A / D converter 22-k. That is, the sampling frequency conversion unit 111 converts a signal sampled at the nominal value of the sampling frequency of each A / D converter 22-k into a signal sampled at a specific sampling frequency. Such sampling frequency conversion is well known. The sampling frequency converter 111 outputs the converted digital acoustic signal cx _k (i _k ) of each channel k obtained as described above (step S111).

＜信号同期部１１２＞
信号同期部１１２は、チャネルｋ＝１，・・・，Ｋの変換デジタル音響信号ｃｘ_１（ｉ_１），・・・，ｃｘ_Ｋ（ｉ_Ｋ）を入力として受け取る。信号同期部１１２は、変換デジタル音響信号ｃｘ_１（ｉ_１），・・・，ｃｘ_Ｋ（ｉ_Ｋ）をチャネルｋ＝１，・・・，Ｋ間で同期させ、チャネルｋ＝１，・・・，Ｋのデジタル音響信号ｓｘ_１（ｉ_１），・・・，ｓｘ_Ｋ（ｉ_Ｋ）を得て出力する（ステップＳ１１２）。以下にこの詳細を説明する。 <Signal synchronization unit 112>
The signal synchronizer 112 receives the converted digital acoustic signals cx ₁ (i ₁ ),..., Cx _K (i _K ) of the channels k = 1,. The signal synchronizer 112 synchronizes the converted digital acoustic signals cx ₁ (i ₁ ),..., Cx _K (i _K ) between the channels k = 1,. .., K digital audio signals sx ₁ (i ₁ ),..., Sx _K (i _K ) are obtained and output (step S112). The details will be described below.

Ａ／Ｄ変換器２２−ｋには個体差がある。そのためＡ／Ｄ変換器２２−ｋのサンプリング周波数の公称値がｆ_ｋであったとしても、Ａ／Ｄ変換器２２−ｋがサンプリング周波数ｆ_ｋ／α_ｋでＡ／Ｄ変換を行う場合もある。ただし、α_ｋはＡ／Ｄ変換器２２−ｋの実際のサンプリング周波数とサンプリング周波数の公称値との間の周波数ずれを表す正のパラメータである。音響信号をサンプリング周波数ｆ_ｋでＡ／Ｄ変換して得られる入力デジタル音響信号をｘ_ｋ’（ｉ_ｋ）とおくと、同じ音響信号をサンプリング周波数ｆ_ｋ／α_ｋでＡ／Ｄ変換して得られる入力デジタル音響信号はｘ_ｋ’（ｉ_ｋ×α_ｋ）となる。ただし「×」は乗算演算子を表す。すなわち、サンプリング周波数の周波数ずれは、入力デジタル音響信号の時間領域でのタイミングずれとなって現れる。 There are individual differences in the A / D converter 22-k. Even nominal sampling frequency of the order A / D converter 22-k was _{f k,} sometimes A / D converter 22-k performs A / D conversion at a sampling frequency _{f k} / alpha _k . Here, α _k is a positive parameter representing a frequency shift between the actual sampling frequency of the A / D converter 22-k and the nominal value of the sampling frequency. If an input digital acoustic signal obtained by A / D converting the acoustic signal at the sampling frequency f _k is x _k ′ (i _k ), the same acoustic signal is A / D converted at the sampling frequency f _k / α _k. The resulting input digital acoustic signal is x _k ′ (i _k × α _k ). However, “×” represents a multiplication operator. That is, the frequency deviation of the sampling frequency appears as a timing deviation in the time domain of the input digital acoustic signal.

サンプリング周波数変換部１１１は、各Ａ／Ｄ変換器２２−ｋのサンプリング周波数の公称値ｆ_ｋに基づいてサンプリング周波数変換を行っている。すなわち、すべてのチャネルｋ＝１，・・・，Ｋに共通の「特定のサンプリング周波数」をｆとすると、サンプリング周波数変換部１１１は、各チャネルｋのサンプリング周波数をｆ／ｆ_ｋ倍にするサンプリング周波数変換を行っている。そのため、各Ａ／Ｄ変換器２２−ｋの実際のサンプリング周波数がｆ_ｋ／α_ｋであるとすると、各チャネルｋの変換デジタル音響信号ｃｘ_ｋ（ｉ_ｋ）のサンプリング周波数はｆ×α_ｋとなる。この個体差に基づく周波数ずれは、チャネルｋ＝１，・・・，Ｋ間における、変換デジタル音響信号ｃｘ_ｋ（ｉ_ｋ）の時間領域でのタイミングずれとなって現れる。 The sampling frequency converter 111 performs sampling frequency conversion based on the nominal value f _k of the sampling frequency of each A / D converter 22-k. That is, if the “specific sampling frequency” common to all channels k = 1,..., K is f, the sampling frequency conversion unit 111 performs sampling to increase the sampling frequency of each channel k by f / f _k times. Frequency conversion is performed. Therefore, assuming that the actual sampling frequency of each A / D converter 22-k is f _k / α _k , the sampling frequency of the converted digital acoustic signal cx _k (i _k ) of each channel k is f × α _k . Become. The frequency shift based on the individual difference appears as a timing shift in the time domain of the converted digital acoustic signal cx _k (i _k ) between the channels k = 1,.

信号同期部１１２は、個体差に基づく変換デジタル音響信号ｃｘ_ｋ（ｉ_ｋ）の時間領域でのタイミングずれを減らすために、時間領域の変換デジタル音響信号ｃｘ_１（ｉ_１），・・・，ｃｘ_Ｋ（ｉ_Ｋ）をチャネルｋ＝１，・・・，Ｋ間で同期させる。例えば信号同期部１１２は、チャネル間の相互相関が最大になるように、変換デジタル音響信号ｃｘ_１（ｉ_１），・・・，ｃｘ_Ｋ（ｉ_Ｋ）を時間軸方向（サンプル点方向）に互いにずらし、同期後のデジタル音響信号ｓｘ_１（ｉ_１），・・・，ｓｘ_Ｋ（ｉ_Ｋ）を得る。 The signal synchronization unit 112 converts the time-domain converted digital acoustic signal cx ₁ (i ₁ ),..., In order to reduce the timing shift in the time domain of the converted digital acoustic signal cx _k (i _k ) based on individual differences. cx _K (i _K ) is synchronized between channels k = 1,. For example, the signal synchronization unit 112 converts the converted digital acoustic signals cx ₁ (i ₁ ),..., Cx _K (i _K ) in the time axis direction (sample point direction) so that the cross-correlation between channels is maximized. The digital audio signals sx ₁ (i ₁ ),..., Sx _K (i _K ) are obtained after being shifted from each other.

例えば信号同期部１１２は、各チャネルｋの変換デジタル音響信号ｃｘ_ｋ（ｉ_ｋ）から、単語の発話など十分特徴的な波形の変化を観測できる長さ（例えば３秒）のサンプル列ｃｘ_ｋ（１），・・・，ｃｘ_ｋ（Ｉ）をとりだす（ステップＳ１１２ａ）。ただし、Ｉは正整数を表す。次に信号同期部１１２は、取り出したサンプル列のうち、１つのチャネルｋ’∈｛１，・・・，Ｋ｝のサンプル列ｃｘ_ｋ’（１），・・・，ｃｘ_ｋ’（Ｉ）を基準サンプル列とする（ステップＳ１１２ｂ）。次に信号同期部１１２は、チャネルｋ’以外のチャネルｋ”∈｛１，・・・，Ｋ｝（ｋ”≠ｋ’）のサンプル列ｃｘ_ｋ”（１），・・・，ｃｘ_ｋ”（Ｉ）を時間軸に沿ってずらしたサンプル列ｃｘ_ｋ”（１＋ｒ_ｋ”），・・・，ｃｘ_ｋ”（Ｉ＋ｒ_ｋ”）と基準サンプル列ｃｘ_ｋ’（１），・・・，ｃｘ_ｋ’（Ｉ）との相互相関Σ_ｎ｛ｃｘ_ｋ”（ｎ）×ｃｘ_ｋ’（ｎ）｝を最大にする遅延ｒ_ｋ”を所定の探索範囲から探索し、ｓｘ_ｋ”（ｉ_ｋ”）＝ｃｘ_ｋ”（ｉ_ｋ”＋ｒ_ｋ”）およびｓｘ_ｋ’（ｉ_ｋ’）＝ｃｘ_ｋ’（ｉ_ｋ’）とする（ステップＳ１１２ｃ）。さらに信号同期部１１２は、サンプル列ｃｘ_ｋ（１），・・・，ｃｘ_ｋ（Ｉ）を切り出す範囲をシフトさせ（例えば１秒の時間に対応するサンプル点だけシフトさせ）、ステップＳ１１２ａ〜Ｓ１１２ｃの処理を実行する処理を繰り返し、同期後のデジタル音響信号ｓｘ_１（ｉ_１），・・・，ｓｘ_Ｋ（ｉ_Ｋ）をすべてのサンプル点について得て出力する。なお、サンプル列ｃｘ_ｋ（１），・・・，ｃｘ_ｋ（Ｉ）を切り出す範囲のシフト量（上記の例では１秒の時間に対応するサンプル点の個数）は、十分特徴的な波形の変化を観測できる長さ（上記の例では３秒）よりも短いものとする。 For example, the signal synchronization unit 112 has a sample string cx _k (for example, 3 seconds) long enough to observe a sufficiently characteristic waveform change such as a word utterance from the converted digital acoustic signal cx _k (i _k ) of each channel k. 1),..., Cx _k (I) are extracted (step S112a). However, I represents a positive integer. Next, the signal synchronizer 112 sets the sample sequence cx _{k ′} (1),..., Cx _{k ′} (I) of one channel k′∈ {1,. Is a reference sample string (step S112b). Next, the signal synchronizer 112 sets the sample sequence cx _{k ″} (1),..., Cx _{k ″} of channels k ″ ε {1,..., K} (k ″ ≠ k ′) other than the channel k ′. Sample sequence cx _{k ″} (1 + r _{k ″} ),..., Cx _{k ″} (I + r _{k ″} ) and reference sample sequence cx _{k ′} (1),. A delay r _{k ″} that maximizes the cross-correlation Σ _n {cx _{k ″} (n) × cx _{k ′} (n)} with _{k ′} (I) is searched from a predetermined search range, and sx _{k ″} (i _{k ″} ) = Cx _{k ″} (i _{k ″} + r _{k ″} ) and sx _{k ′} (i _{k ′} ) = cx _{k ′} (i _{k ′} ) (Step S <b> 112 c) Further, the signal synchronizer 112 performs the sample sequence cx _k ( 1),..., Cx _k (I) is shifted in the range to be cut out (for example, only the sample point corresponding to the time of 1 second is shifted) , Steps S112a to S112c are repeated, and the synchronized digital acoustic signals sx ₁ (i ₁ ),..., Sx _K (i _K ) are obtained and output for all sample points. The shift amount (the number of sample points corresponding to the time of 1 second in the above example) in the range in which the sample sequence cx _k (1),..., Cx _k (I) is cut out is a sufficiently characteristic waveform change. It is assumed that it is shorter than the observable length (3 seconds in the above example).

＜フレーム分割部１１３＞
フレーム分割部１１３は、同期後のデジタル音響信号ｓｘ_１（ｉ_１），・・・，ｓｘ_Ｋ（ｉ_Ｋ）を入力として受け取る。フレーム分割部１１３は、チャネルｋごとにデジタル音響信号ｓｘ_ｋ（ｉ_ｋ）を所定の時間区間であるフレームに分割する（ステップＳ１１３）。このフレーム分割処理では、フレーム切り出し区間長（フレーム長）Ｌ点と切り出し区間のずらし幅ｍ点を任意に決めることができる。ただし、Ｌおよびｍは正整数である。例えば、切り出し区間長を２０４８点、切り出し区間のずらし幅を２５６点とする。フレーム分割部１１３は、チャネルｋごとに切り出し区間長のデジタル音響信号ｓｘ_ｋ（ｉ_ｋ）を切り出して出力する。さらにフレーム分割部１１３は、決められた切り出し区間のずらし幅に従い切り出し区間をずらし、チャネルｋごとに上記切り出し区間長のデジタル音響信号ｓｘ_ｋ（ｉ_ｋ）を切り出して出力する処理を繰り返す。以上の処理により、各チャネルｋについて各フレームのデジタル音響信号が出力される。以下では、チャネルｋのフレームτに属するデジタル音響信号をｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）と表現する。なお、τはフレーム番号を表し、フレーム番号τのフレームを「フレームτ」と表記する。フレーム番号が大きいほど、後の時刻に対応する。 <Frame division unit 113>
The frame dividing unit 113 receives the digital audio signals sx ₁ (i ₁ ),..., Sx _K (i _K ) after synchronization as inputs. The frame dividing unit 113 divides the digital acoustic signal sx _k (i _k ) for each channel k into frames that are predetermined time intervals (step S113). In this frame division processing, the frame cutout section length (frame length) L point and the shift width m point of the cutout section can be arbitrarily determined. However, L and m are positive integers. For example, the cut section length is 2048 points, and the shift width of the cut section is 256 points. The frame dividing unit 113 cuts out and outputs a digital acoustic signal sx _k (i _k ) having a cut-out section length for each channel k. Further, the frame dividing unit 113 repeats the process of shifting the cutout section according to the determined shift width of the cutout section and cutting out and outputting the digital audio signal sx _k (i _k ) having the cutout section length for each channel k. Through the above processing, a digital audio signal of each frame is output for each channel k. Hereinafter, the digital acoustic signal belonging to the frame τ of the channel k is expressed as sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L−1} ). Note that τ represents a frame number, and a frame having the frame number τ is represented as “frame τ”. A larger frame number corresponds to a later time.

周波数領域変換部１１４は、各チャネルｋの各フレームτに属するｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）を入力とし、チャネルｋおよびフレームτごとに、これらに対する周波数領域変換を行い、チャネルｋおよびフレームτごとの周波数領域のデジタル音響信号Ｘ_τ ^（ｋ）（ω）（複数個のチャネルで得られた所定の時間区間ごとの周波数領域のデジタル音響信号）に変換する。例えば、周波数領域変換部１１４は、ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）に対する離散フーリエ変換を行い、チャネルｋおよびフレームτごとに周波数領域のデジタル音響信号Ｘ_τ ^（ｋ）（ω）を得る。ただし、ωは周波数ビンを表し、周波数ビンωの周波数を「周波数ω」と表記する。周波数領域変換部１１４は、周波数領域のデジタル音響信号Ｘ_τ ^（ｋ）（ω）を出力する（ステップＳ１１４）。 The frequency domain transform unit 114 receives sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L−1} ) belonging to each frame τ of each channel k, and inputs the channel k and For each frame τ, frequency domain transformation is performed on these, and the digital acoustic signal X _τ ^(k) (ω) in the frequency domain for each channel k and frame τ (frequency for each predetermined time interval obtained by a plurality of channels) Domain digital audio signal). For example, the frequency domain transform unit 114 performs a discrete Fourier transform on sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L−1} ), and for each channel k and frame τ. A digital acoustic signal X _τ ^(k) (ω) in the frequency domain is obtained. However, ω represents a frequency bin, and the frequency of the frequency bin ω is expressed as “frequency ω”. The frequency domain converter 114 outputs the frequency domain digital acoustic signal X _τ ^(k) (ω) (step S114).

＜ＶＡＤ判定部１１５＞
ＶＡＤ判定部１１５は、各チャネルｋおよび各フレームτの周波数領域のデジタル音響信号Ｘ_τ ^（ｋ）（ω）を入力として受け取る。ＶＡＤ判定部１１５は、入力されたデジタル音響信号Ｘ_τ ^（ｋ）（ω）を用い、各チャネルｋの各フレームτが音声区間であるか非音声区間であるかを判定する（ステップＳ１１５）。音声区間であるか非音声区間であるかの判定には、例えば、音声の確率モデルと非音声の確率モデルを作成し、非音声の確率モデルに対する音声の確率モデルの相対値（比率等）が閾値を超えれば音声区間、超えなければ非音声区間と判断する方法（例えば、参考文献１参照）等の周知技術を利用することができる。
［参考文献１］Jongseo Sohn, Nam Soo Kim, Wonyong Sung, “A Statistic Model-Based Voice Activity Detection,” IEEE SIGNAL PROCESSING LETTERS, VOL.6, NO.1, 1999． <VAD determination unit 115>
The VAD determination unit 115 receives the digital acoustic signal X _τ ^(k) (ω) in the frequency domain of each channel k and each frame τ as an input. The VAD determination unit 115 determines whether each frame τ of each channel k is a speech section or a non-speech section using the input digital acoustic signal X _τ ^(k) (ω) (step S115). To determine whether a speech segment or non-speech segment, for example, a speech probability model and a non-speech probability model are created, and the relative value (ratio, etc.) of the speech probability model to the non-speech probability model is determined. A well-known technique such as a method of determining a voice section when the threshold is exceeded and a non-voice section when the threshold is not exceeded (for example, see Reference 1) can be used.
[Reference 1] Jongseo Sohn, Nam Soo Kim, Wonyong Sung, “A Statistic Model-Based Voice Activity Detection,” IEEE SIGNAL PROCESSING LETTERS, VOL.6, NO.1, 1999.

ＶＡＤ判定部１１５は、各チャネルｋの各フレームτが音声区間であるか非音声区間であるかの判定結果を用い、各フレームτが音声区間であるか非音声区間であるかを判定する。例えば、ＶＡＤ判定部１１５は、過半数以上のチャネルで音声区間と判別したフレームτを音声区間であると判定し、そうでないフレームτを音声区間であると判定する。その他、チャネルｋ＝１，・・・，Ｋのうち、Ｘ_τ ^（ｋ）（ω）のパワーもしくはＳ／Ｎ比、またはｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）の平均パワーや平均Ｓ／Ｎ比が最も大きなチャネルに対する判定結果をフレームτでの判定結果としてもよい。 The VAD determination unit 115 determines whether each frame τ is a voice interval or a non-voice interval by using a determination result of whether each frame τ of each channel k is a voice interval or a non-voice interval. For example, the VAD determination unit 115 determines that a frame τ determined to be a voice section in a majority of channels or more is a voice section, and determines that a frame τ that is not a voice section is a voice section. Other channel k = 1, · · ·, of K, X τ _^(k) power or S / N ratio of (omega) or _{_{sx k, (i k, τ}} , 0), ···, sx k ( The determination result for the channel having the largest average power or average S / N ratio of i _{k, τ, L−1} ) may be used as the determination result in the frame τ.

ＶＡＤ判定部１１５は、音声区間と判定したフレームτの全チャネルｋ＝１，…，ＫのＸ_τ ^（ｋ）（ω）をゲイン正規化部１１７に送る。また、ＶＡＤ判定部１１５は、非音声区間と判定したフレームτの全チャネルｋ＝１，…，ＫのＸ_τ ^（ｋ）（ω）を非音声パワー記憶部１１６に送る。さらに、ＶＡＤ判定部１１５は、各フレームτに対し、音声区間であるか非音声区間であるかの判定結果を表すラベルθ_τを付与する。フレームτが音声区間であることを表すラベルの例はθ_τ＝１であり、フレームτが非音声区間であることを表すラベルの例はθ_τ＝０である。ＶＡＤ判定部１１５は、各フレームτのラベルθ_τをベクトル分類部１２０に送る。 The VAD determination unit 115 sends X _τ ^(k) (ω) of all channels k = 1,..., K of the frame τ determined to be a speech section to the gain normalization unit 117. Further, the VAD determination unit 115 sends X _τ ^(k) (ω) of all channels k = 1,..., K of the frame τ determined to be a non-speech section to the non-speech power storage unit 116. Further, the VAD determination unit 115 assigns a label θ _τ indicating a determination result as to whether each frame is a speech segment or a non-speech segment. An example of a label indicating that the frame τ is a speech segment is θ _τ = 1, and an example of a label indicating that the frame τ is a non-speech segment is θ _τ = 0. The VAD determination unit 115 sends the label θ _τ of each frame _τ to the vector classification unit 120.

＜非音声パワー記憶部１１６＞
非音声パワー記憶部１１６は、非音声区間と判定されたフレームτのデジタル音響信号Ｘ_τ ^（ｋ）（ω）を用い、非音声区間のデジタル音響信号の大きさＰ_Ｎ ^（ｋ）（ω）を算出して格納する（ステップＳ１１６）。「非音声区間のデジタル音響信号の大きさＰ_Ｎ ^（ｋ）（ω）」の例は、非音声区間のデジタル音響信号のパワーや絶対値、非音声区間のデジタル音響信号のパワーの平均値、非音声区間のデジタル音響信号のパワーの重み付け平均値、非音声区間のデジタル音響信号のパワーの合計値、それらの正負反転値や関数値などである。「平均値」の例は、時間平均値である。Ｐ_Ｎ ^（ｋ）（ω）は、各チャネルｋおよび周波数ビンωに対応する。Ｐ_Ｎ ^（ｋ）（ω）は、チャネルｋおよび周波数ビンωごとに算出されてもよいし、何れかのチャネルまたは周波数ビンで算出されたものを複数のチャネルｋや周波数ビンωでのＰ_Ｎ ^（ｋ）（ω）としてもよい。例えば、非音声パワー記憶部１１６は、チャネルｋおよび周波数ビンωごとに、非音声区間と判定されたフレームτのデジタル音響信号Ｘ_τ ^（ｋ）（ω）の大きさの平均値や重み付け平均値を算出し、それらをＰ_Ｎ ^（ｋ）（ω）とする。例えば、非音声パワー記憶部１１６は、以下の式（４）のようにＰ_Ｎ ^（ｋ）（ω）を得る。

ただし、ηは予め定められた忘却係数である。忘却係数ηの例は０＜η≦１であり、例えばη＝０．１である。また、Ｐ’_Ｎ ^（ｋ）（ω）は、非音声区間と判定された過去のフレームで算出されたＰ_Ｎ ^（ｋ）（ω）である。例えば、Ｐ’_Ｎ ^（ｋ）（ω）は、前回行われた式（４）の計算によって得られた最新のＰ_Ｎ ^（ｋ）（ω）である。なお、非音声区間と判定された過去のフレームが存在しない場合には、Ｐ’_Ｎ ^（ｋ）（ω）を定数（例えば、０）とする。なお、式（４）のように、「Ｐ_Ｎ ^（ｋ）（ω）」「Ｐ’_Ｎ ^（ｋ）（ω）」の上付き添え字「（ｋ）」は、下付き添え字「Ｎ」の真上に表記されるべきであるが、表記上の制約から、上付き添え字「（ｋ）」を下付き添え字「Ｎ」の右上に表記する場合がある。すなわち、上付き添え字「（ｋ）」が下付き添え字「Ｎ」の真上に表記された記号と、上付き添え字「（ｋ）」が下付き添え字「Ｎ」の右上に表記された記号とは同じものを指す。同様に、上付き添え字「Ｃ」が下付き添え字Ｂの真上に表記されたその他の記号「Ａ_Ｂ ^Ｃ」は、その上付き添え字「Ｃ」が下付き添え字Ｂの右上に表記された記号と同じものを指す。 <Non-voice power storage unit 116>
The non-speech power storage unit 116 uses the digital acoustic signal X _τ ^(k) (ω) of the frame τ determined to be a non-speech section, and uses the digital sound signal magnitude P _N ^(k) (ω) of the non-speech section. Is calculated and stored (step S116). Examples of “the magnitude P _N ^(k) (ω) of the digital sound signal in the non-speech section” include the power and absolute value of the digital sound signal in the non-speech section, the average value of the power of the digital sound signal in the non-speech section, These are the weighted average value of the power of the digital sound signal in the non-speech section, the total value of the power of the digital sound signal in the non-speech section, their positive / negative inversion values and function values. An example of “average value” is a time average value. P _N ^(k) (ω) corresponds to each channel k and frequency bin ω. P _N ^(k) (ω) may be calculated for each channel k and frequency bin ω, or P _N in a plurality of channels k and frequency bins ω may be calculated for any channel or frequency bin. ^{(K) It} may be (ω). For example, the non-speech power storage unit 116 determines, for each channel k and frequency bin ω, an average value or a weighted average value of the digital acoustic signal X _τ ^(k) (ω) of the frame τ determined as a non-speech section. And calculate them as P _N ^(k) (ω). For example, the non-speech power storage unit 116 obtains P _N ^(k) (ω) as in the following formula (4).

Here, η is a predetermined forgetting factor. An example of the forgetting factor η is 0 <η ≦ 1, for example, η = 0.1. P ′ _N ^(k) (ω) is P _N ^(k) (ω) calculated in the past frame determined to be a non-speech segment. For example, P ′ _N ^(k) (ω) is the latest P _N ^(k) (ω) obtained by the previous calculation of Equation (4). If there is no past frame determined to be a non-speech segment, P ′ _N ^(k) (ω) is set to a constant (for example, 0). Note that the superscript “(k)” of “P _N ^(k) (ω)” “P ′ _N ^(k) (ω)” is the subscript “N”, as in equation (4). The superscript “(k)” may be written in the upper right of the subscript “N” due to notation restrictions. That is, the superscript “(k)” is written just above the subscript “N”, and the superscript “(k)” is written at the upper right of the subscript “N”. The same symbol is used for the same thing. Similarly, other symbols “A _B ^C ” in which the superscript “C” is written directly above the subscript B are superscript “C” in the upper right of the subscript B. Refers to the same symbol as shown.

＜ゲイン正規化部１１７＞
ゲイン正規化部１１７は、音声区間と判定されたフレームτの全チャネルｋ＝１，…，ＫのＸ_τ ^（ｋ）（ω）を受け取る。さらにゲイン正規化部１１７は、非音声パワー記憶部１１６からＰ_Ｎ ^（ｋ）（ω）を抽出する。ゲイン正規化部１１７は、これらを用い、チャネルｋごとに音声区間のデジタル音響信号Ｘ_τ ^（ｋ）（ω）の大きさを非音声区間のデジタル音響信号の大きさＰ_Ｎ ^（ｋ）（ω）で正規化した正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）を得る（ステップＳ１１７）。「音声区間のデジタル音響信号Ｘ_τ ^（ｋ）（ω）の大きさ」の例は、音声区間のデジタル音響信号Ｘ_τ ^（ｋ）（ω）のパワーや絶対値、音声区間のデジタル音響信号Ｘ_τ ^（ｋ）（ω）のパワーの平均値、音声区間のデジタル音響信号Ｘ_τ ^（ｋ）（ω）のパワーの重み付け平均値、音声区間のデジタル音響信号Ｘ_τ ^（ｋ）（ω）のパワーの合計値、それらの正負反転値や関数値などである。正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）は、例えば、非音声区間のデジタル音響信号の大きさＰ_Ｎ ^（ｋ）（ω）に対する音声区間のデジタル音響信号Ｘ_τ ^（ｋ）（ω）の大きさの比を表す値である。「比を表す特徴量」の例は、「非音声区間のデジタル音響信号の大きさＰ_Ｎ ^（ｋ）（ω）に対する音声区間のデジタル音響信号Ｘ_τ ^（ｋ）（ω）の大きさの比」そのもの、その逆数その他の関数値である。例えば、ゲイン正規化部１１７は、以下の式（５）（６）のようにＰ^〜 _Ｘτ ^（ｋ）（ω）を得る。

なお、「Ｐ^〜 _Ｘτ ^（ｋ）（ω）」の「〜」は「Ｐ」の真上に表記されるべきであるが、表記上の制約から、「〜」を下付き添え字「Ｐ」の右上に表記する場合がある。すなわち、「〜」が「Ｐ」の真上に表記された記号と、「〜」が「Ｐ」の右上に表記された記号とは同じものを指す。なお、非音声パワー記憶部１１６にＰ_Ｎ ^（ｋ）（ω）がまだ格納されていない場合には、ゲイン正規化部１１７は、Ｐ_Ｎ ^（ｋ）（ω）を定数として正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）を得る。例えば、ゲイン正規化部１１７は、過去の計算されたＰ_Ｎ ^（ｋ）（ω）の平均値等を用いて正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）を得る。ゲイン正規化部１１７は、各チャネルｋおよびフレームτの正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）を調波構造化部１１８に送る。 <Gain normalization unit 117>
The gain normalization unit 117 receives X _τ ^(k) (ω) of all channels k = 1 _,. Further, the gain normalization unit 117 extracts P _N ^(k) (ω) from the non-speech power storage unit 116. Using these, the gain normalization unit 117 converts the magnitude of the digital acoustic signal X _τ ^(k) (ω) in the speech section for each channel k to the magnitude P _N ^(k) (ω of the digital acoustic signal in the non-speech section. ) To obtain the normalized signals P ^to _Xτ ^(k) (ω) normalized (step S117). Examples of “the magnitude of the digital acoustic signal X _τ ^(k) (ω) in the speech section” include the power and absolute value of the digital acoustic signal X _τ ^(k) (ω) in the speech section, and the digital acoustic signal X in the speech section. power of _tau ^{(k) (omega)} the average value of the power of the digital audio signal X _tau in a speech period ^(k) weighted average value of the power of the ^(omega), the digital audio signal X _tau in a speech period ^{(k) (omega)} The total value of these, their positive and negative inversion values, function values, etc. The normalized signals P ^to _Xτ ^(k) (ω) are, for example, those of the digital acoustic signal X _τ ^(k) (ω) in the speech section with respect to the magnitude P _N ^(k) (ω) of the digital acoustic signal in the non-speech section. It is a value that represents the ratio of sizes. An example of the “feature representing the ratio” is “a ratio of the magnitude of the digital acoustic signal X _τ ^(k) (ω) in the speech section to the magnitude P _N ^(k) (ω) of the digital acoustic signal in the non-speech section. ”Itself, its reciprocal and other function values. For example, the gain normalization unit 117 _obtains P ^to _Xτ ^(k) (ω) as in the following formulas (5) and (6).

In addition, “ ^˜ ” in “ _P˜Xτ ^(k) (ω)” should be written immediately above “P”, but “˜” is _added to the subscript “P” due to notation restrictions. It may be written in the upper right corner of. That is, the symbol “˜” written immediately above “P” and the symbol “˜” written right above “P” are the same. If P _N ^(k) (ω) is not yet stored in the non-speech power storage unit 116, the gain normalization unit 117 sets the normalized signal P˜ ^to P _N ^(k) (ω) as a constant. _Xτ ^(k) (ω) is obtained. For example, the gain normalization unit 117 obtains the normalization signals P ^to _Xτ ^(k) (ω) using the average value of P _N ^(k) (ω) calculated in the past. The gain normalization unit 117 sends the normalization signals P ^to _Xτ ^(k) (ω) of each channel k and frame τ to the harmonic structuring unit 118.

＜調波構造化部１１８＞
調波構造化部１１８は、各チャネルｋおよび音声区間と判定された各フレームτの正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）を受け取る。調波構造化部１１８は、１個または複数個の基本周波数ｆ_０（ｋ_ｆ０）について、基本周波数ｆ_０（ｋ_ｆ０）の整数倍および基本周波数ｆ_０（ｋ_ｆ０）の整数倍の近傍の周波数ωの正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）を、それら以外の周波数ω（基本周波数ｆ_０（ｋ_ｆ０）の整数倍以外および基本周波数ｆ_０（ｋ_ｆ０）の整数倍の近傍以外の周波数ω）の正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）よりも優先した調波構造信号Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）を得る（ステップＳ１１８）。ここで、観測された音響信号の基本周波数は未知であるため、調波構造化部１１８は、所定の範囲内の１個または複数個の基本周波数ｆ_０（ｋ_ｆ０）について調波構造信号Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）を得る。ｋ_ｆ０は基本周波数ｆ_０（ｋ_ｆ０）を表すインデックスである。例えば、基本周波数ｆ_０（ｋ_ｆ０）は離散値であり、ｋ_ｆ０が整数インデックスであり、ｆ_０（ｋ_ｆ０−１）＜ｆ_０（ｋ_ｆ０）である。「基本周波数ｆ_０（ｋ_ｆ０）の整数倍」とは、ｈ×ｆ_０（ｋ_ｆ０）を意味する。ｈは調波構造を考慮する倍音の次数であり、１以上の整数である。「基本周波数ｆ_０（ｋ_ｆ０）の整数倍の近傍の周波数」とは、例えば、ｈ×ｆ_０（ｋ_ｆ０）−δ_１以上、ｈ×ｆ_０（ｋ_ｆ０）＋δ_２以下の周波数を意味する。ただし、δ_１およびδ_２は、正の定数である。例えば、ｆ_０（ｋ_ｆ０−１）＜ｆ_０（ｋ_ｆ０）である場合、ｈ×ｆ_０（ｋ_ｆ０−１）＋δ_２＜ｈ×ｆ_０（ｋ_ｆ０）−δ_１であり、かつ、ｈ×ｆ_０（ｋ_ｆ０）＋δ_２＜ｈ×ｆ_０（ｋ_ｆ０＋１）−δ_１である。「基本周波数ｆ_０（ｋ_ｆ０）の整数倍および基本周波数ｆ_０（ｋ_ｆ０）の整数倍の近傍の周波数ωの正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）（以下「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」）を、それら以外の周波数ωの正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）（以下「倍音またはその近傍以外のＰ^〜 _Ｘτ ^（ｋ）（ω）」）よりも優先する」とは、「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」を「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」に対して差別化し、優先することを意味する。以下にこの例を列挙する。
（例１）「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」に対し、「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」よりも平均して大きな重みを与える。例えば、「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」に対し、「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」よりも大きな重みを与える。
（例２）「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」を強調する。
（例３）「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」を抽出する。
（例４）大きさを変更しつつ、「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」を抽出する。
（例５）「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」よりも、「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」に対して平均して小さな重みを与える。例えば、「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」よりも、「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」に対して小さな重みを与える。
（例６）「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」を抑圧する（減衰させる）。
（例７）「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」を除去する。
（例８）（例２）または（例３）と（例６）または（例７）との組み合わせ。 <Harmonic structuring unit 118>
The harmonic structuring unit 118 receives the normalized signals P ^to _Xτ ^(k) (ω) of each frame τ determined to be each channel k and the speech section. Harmonic structuring unit 118, for one or more of the fundamental frequency _f 0 _{(k f0),} the vicinity of integral multiples of the integer multiples and the fundamental frequency _f 0 of the fundamental frequency _{_{_{f 0 (k f0) (k}}} f0) the normalized signal _{^P ~} ^Xτ frequency ω ^{(k) (ω),} other than the vicinity of integral multiples of the frequency other than those omega (fundamental frequency _f 0 _{(k f0} integer times other than and the fundamental frequency _f 0 of the) _{(k f0)} The harmonic structure signal P _outτ ^(k) (ω, k _f0 ) is given priority over the normalized signal P ^to _Xτ ^(k) (ω) of the frequency ω) (step S118). Here, since the fundamental frequency of the observed acoustic signal is unknown, the harmonic structuring unit 118 uses the harmonic structure signal P for one or more fundamental frequencies f ₀ (k _f0 ) within a predetermined range. _outτ ^(k) (ω, k _f0 ) is obtained. k _f0 is an index representing the fundamental frequency f ₀ (k _f0 ). For example, the fundamental frequency f ₀ (k _f0 ) is a discrete value, k _f0 is an integer index, and f ₀ (k _f0 −1) <f ₀ (k _f0 ). “An integral multiple of the fundamental frequency f ₀ (k _f0 )” means h × f ₀ (k _f0 ). h is the order of the harmonic overtone considering the harmonic structure, and is an integer of 1 or more. “A frequency in the vicinity of an integral multiple of the fundamental frequency f ₀ (k _f0 )” means, for example, a frequency of h × f ₀ (k _f0 ) −δ ₁ or more and h × f ₀ (k _f0 ) + δ ₂ or less. To do. However, δ ₁ and δ ₂ are positive constants. For example, if f ₀ (k _f0 −1) <f ₀ (k _f0 ), h × f ₀ (k _f0 −1) + δ ₂ <h × f ₀ (k _f0 ) −δ ₁ and h × f ₀ (k _f0 ) + δ ₂ <h × f ₀ (k _f0 +1) −δ ₁ . "Fundamental frequency _f 0 _{(k f0)} integral multiples and the fundamental frequency _f 0 _{(k f0)} of integral multiples of the vicinity of the frequency omega of the normalized signal ^P _{~ Xtau} of ^{(k) (ω) (hereinafter"} ^P harmonic range ^- _Xτ ^(k) (ω) ”) from the normalized signals P ^to _Xτ ^(k) (ω) of other frequencies ω (hereinafter referred to as“ P ^to _Xτ ^(k) (ω) ”other than harmonics or their vicinity) also preferred "is, to differentiate the" overtone range of ^{_{^{P ~ Xτ (k) (ω}}} ) "to the" harmonic outside the scope of ^{_{^{P ~ Xτ (k) (ω}}} ) ", means that priority. Examples of this are listed below.
(Example 1) “P ^to _Xτ ^(k) (ω) in the overtone range” is given a greater weight on average than “P ^to _Xτ ^(k) (ω) outside the harmonic range”. For example, for "overtone range of ^{_{^{P ~ Xτ (k) (ω}}} ) ", it gives a greater weight than the "harmonic outside the scope of the ^{_{^{P ~ Xτ (k) (ω}}} ) ".
(Example 2) Emphasize "P ^~ _Xτ ^(k) (ω) of overtone range".
(Example 3) "P ^~ _Xτ ^(k) (ω) of overtone range" is extracted.
(Example 4) “ _H overtone range P ^to _Xτ ^(k) (ω)” is extracted while changing the size.
(Example 5) On the average, a smaller weight is given to “P ^to _Xτ ^(k) (ω) outside the harmonic range” than “P ^to _Xτ ^(k) (ω) in the harmonic range”. For example, a smaller weight is given to “P ^to _Xτ ^(k) (ω) outside the harmonic range” than “P ^to _Xτ ^(k) (ω) in the harmonic range”.
(Example 6) “P ^~ _Xτ ^(k) (ω) outside the harmonic range” is suppressed (attenuated).
(Example 7) to remove the "harmonic range of ^{_{^{P ~ Xτ (k) (ω}}} ) ".
(Example 8) A combination of (Example 2) or (Example 3) and (Example 6) or (Example 7).

図３を用い、これらを例示する。図３の横軸は周波数を表し、縦軸は正規化信号の大きさ（パワーや絶対値等）を表す。β_１〜β_５の範囲は、基本周波数ｆ_０（ｋ_ｆ０）の整数倍および基本周波数ｆ_０（ｋ_ｆ０）の整数倍の近傍の周波数を例示し、α_１〜α_６の範囲は、基本周波数ｆ_０（ｋ_ｆ０）の整数倍以外および基本周波数ｆ_０（ｋ_ｆ０）の整数倍の近傍以外の周波数を例示する。調波構造化部１１８は、例えば、β_１〜β_５の範囲の周波数ωのＰ^〜 _Ｘτ ^（ｋ）（ω）に重みｗ_βを乗じて得られる値を、β_１〜β_５の範囲の周波数ωに対する調波構造信号Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）とし、α_１〜α_６の範囲の周波数ωのＰ^〜 _Ｘτ ^（ｋ）（ω）に重みｗ_αを乗じて得られる値を、α_１〜α_６の範囲の周波数ωに対する調波構造信号Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）とする。ただし、ｗ_β＞ｗ_αである（例１，２，５，６）。例えば、ｗ_β＝１かつｗ_α＝０であってもよいし（例３，６，７）、ｗ_β＝０．５かつｗ_α＝０であってもよいし（例４）、ｗ_β＝２かつｗ_α＝０．１であってもよい（例２と例６との組み合わせ）。なお、α_１〜α_６の範囲に対する重みは同一値である必要はなく、β_１〜β_５の範囲に対する重みも同一値である必要はない。例えば、これらの重みが周波数に依存したものであってもよい。すなわち、
β_１〜β_５の範囲に対する重みの平均がα_１〜α_６の範囲に対する重みの平均よりも大きいのであれば、どのような重みであってもよい。 These are illustrated using FIG. The horizontal axis in FIG. 3 represents frequency, and the vertical axis represents the magnitude (power, absolute value, etc.) of the normalized signal. The range of β _{1 to} β ₅ exemplifies frequencies in the vicinity of an integer multiple of the fundamental frequency f ₀ (k _f0 ) and an integer multiple of the fundamental frequency f ₀ (k _f0 ), and the range of α _{1 to} α ₆ It illustrates the integral multiple frequency other than the vicinity of the integer times than and the fundamental frequency _f 0 _{(k f0)} of frequency _f 0 _{(k f0).} Harmonic structuring unit 118, for example, beta ₁ to _{^P ~} ^Xτ the frequency range of _{^{~β 5 ω (k) (ω}} ) values obtained by multiplying the weight w _beta, ranging from β ₁ ~β ₅ A value obtained by multiplying a weight w _α by P ⁻ _Xτ ^(k) (ω) of a frequency ω in the range of α _{1 to} α ₆ as a harmonic structure signal P _outτ ^(k) (ω, k _f0 ) with _respect ^to the frequency ω. _Is a harmonic structure signal P _outτ ^(k) (ω, k _f0 ) for a frequency ω in the range of α _{1 to} α ₆ . However, w _β > w _α (Examples 1, 2, 5, 6). For example, w _β = 1 and w _α = 0 may be used (Examples 3, 6 and 7), w _β = 0.5 and w _α = 0 may be used (Example 4), and w _β = 2 and w _α = 0.1 (combination of Example 2 and Example 6). Note that the weights for the range of α _{1 to} α ₆ need not have the same value, and the weights for the range of β _{1 to} β ₅ need not have the same value. For example, these weights may be frequency dependent. That is,
Any weight may be used as long as the average weight for the range of β _{1 to} β ₅ is larger than the average weight for the range of α _{1 to} α ₆ .

調波構造化部１１８は、例えば、「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」を「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」よりも優先するための櫛型の調波構造フィルタｆ（ω，ｆ_０（ｋ_ｆ０））を正規化信号Ｐ^〜 _Ｘτ ^（ｋ）（ω）に乗じ、Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）を得る。ｆ（ω，ｆ_０（ｋ_ｆ０））は、Ｐ^〜 _Ｘτ ^（ｋ）（ω）に乗じられる重みである。例えばｆ（ω，ｆ_０（ｋ_ｆ０））≧０を満たす。「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））の大きさの平均値は「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））の大きさの平均値よりも大きい。例えば、「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））の大きさは「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））の大きさよりも大きい。「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））は０であってもよいし、０以外であってもよい。「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））は１であってもよいし、１よりも大きくても小さくてもよい。「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））は、全てのωおよびｋ_ｆ０に対して同一であってもよいし、ωおよびｋ_ｆ０の少なくとも一方に応じて異なってもよい。同様に、「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））は、全てのωおよびｋ_ｆ０に対して同一であってもよいし、ωおよびｋ_ｆ０の少なくとも一方に応じて異なってもよい。例えば、「倍音範囲のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））が、ωが大きい（周波数が高い）ほど小さくてもよいし、「倍音範囲外のＰ^〜 _Ｘτ ^（ｋ）（ω）」に乗じられるｆ（ω，ｆ_０（ｋ_ｆ０））が、ωが大きいほど小さくてもよい。調波構造フィルタは、所定範囲内のωおよびｆ_０（ｋ_ｆ０）に対して事前に計算されたｆ（ω，ｆ_０（ｋ_ｆ０））の集合であってもよいし、ωおよびｆ_０（ｋ_ｆ０）を定義域としｆ（ω，ｆ_０（ｋ_ｆ０））を値域とする関数であってもよい。以下に調波構造フィルタｆ（ω，ｆ_０（ｋ_ｆ０））の一例を示す。

ただし、Ａは調波構造を模擬した倍音成分の形状を設定する定数（Ａ＞０）を表し、σはガウス分布の分散（σ＞０）を表す。Ｈはｈの上限値である１以上の整数を表す。一例として、Ｈ＝８，Ａ＝０．９８４９，σ＝７であり、基本周波数ｆ_０（ｋ_ｆ０）が分析範囲１４８．４≦ｆ_０≦５００［Ｈｚ］に属するΔｆ_０＝７．８１［Ｈｚ］刻みの離散値であり、ｋ_ｆ０の個数が４６個であり、１≦ｋ_ｆ０≦４６である。図４は、式（７）の調波構造フィルタｆ（ω，ｆ_０（ｋ_ｆ０））を例示するためのグラフである。図４の横軸は基本周波数ｆ_０（ｋ_ｆ０）のインデックスｋ_ｆ０を表し、縦軸は周波数ωを表す。図４では、Ｈ＝１〜８についての調波構造フィルタｆ（ω，ｆ_０（ｋ_ｆ０））を例示している。このグラフでは、白に近いほどｆ（ω，ｆ_０（ｋ_ｆ０））の大きさが大きいことを表し、黒はｆ（ω，ｆ_０（ｋ_ｆ０））の大きさがゼロであることを表す。調波構造化部１１８は、例えば、以下のように調波構造フィルタｆ（ω，ｆ_０（ｋ_ｆ０）にＰ^〜 _Ｘτ ^（ｋ）（ω）を乗じ、Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）を得る。

調波構造化部１１８は、音声区間と判定された各フレームτ，各チャネルｋ，各周波数ビンω，各インデックスｋ_ｆ０に対してそれぞれ得られた調波構造信号Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）を特徴量列算出部１１９に送る。 Harmonic structuring unit 118, for example, a comb for priority over "harmonic range of ^{_{^{P ~ Xτ (k) (ω}}} ) " and "harmonic range of ^{_{^{P ~ Xτ (k) (ω}}} ) " tone The wave structure filter f (ω, f ₀ (k _f0 )) is multiplied by the normalized signals P ^to _Xτ ^(k) (ω) to obtain P _outτ ^(k) (ω, k _f0 ). f (ω, f ₀ (k _f0 )) is a weight multiplied by P ^to _Xτ ^(k) (ω). For example, f (ω, f ₀ (k _f0 )) ≧ 0 is satisfied. "Overtone range of ^{_{^{P ~ Xτ (k) (ω}}} ) " to the multiplying is _{_{f (ω, f 0 (k}} f0)) of the average value of the magnitude of the "harmonic range of ^{_{^{P ~ Xτ (k) (ω}}} ) " _Is larger than the average value of the magnitudes of f (ω, f ₀ (k _f0 )) multiplied by. For example, the "harmonic range of ^{_{^{P ~ Xτ (k) (ω}}} ) " to the multiplying is _{_{f (ω, f 0 (k}} f0)) the size of the "harmonic range of ^{_{^{P ~ Xτ (k) (ω}}} ) " It is larger than the magnitude of f (ω, f ₀ (k _f0 )) to be multiplied. “F (ω, f ₀ (k _f0 ))” multiplied by “P ^to _Xτ ^(k) (ω) outside the harmonic range” may be 0 or may be other than 0. F (ω, f ₀ (k _f0 )) multiplied by “P ^- _Xτ ^(k) (ω) of overtone range” may be 1, or may be larger or smaller than 1. "Overtone range of ^{_{^{P ~ Xτ (k) (ω}}} ) " to the multiplying is _{_{f (ω, f 0 (k}} f0)) can be the same for all omega and _{k f0,} omega and k _It may be different according to at least one of _f0 . Similarly, "harmonic range of ^{_{^{P ~ Xτ (k) (ω}}} ) " to the multiplying is _{_{f (ω, f 0 (k}} f0)) can be the same for all omega and _{k f0} , Ω and k _f0 may be different. For example, "overtone range of ^{_{^{P ~ Xτ (k) (ω}}} ) " is multiplied by _{_{f (ω, f 0 (k}} f0)) is, may be small as omega is large (high frequency), "harmonic range outside of the ^{_{^{P ~ Xτ (k) (ω}}} ) is multiplied by the _{_{"f (ω, f 0 (k}} f0)) , it may be smaller as ω is large. The harmonic structure filter may be a set of f (ω, f ₀ (k _f0 )) calculated in advance for ω and f ₀ (k _f0 ) within a predetermined range, or ω and f _0. It may be a function having (k _f0 ) as a domain and f (ω, f ₀ (k _f0 )) as a range. An example of the harmonic structure filter f (ω, f ₀ (k _f0 )) is shown below.

However, A represents a constant (A> 0) for setting the shape of a harmonic component simulating a harmonic structure, and σ represents a Gaussian distribution (σ> 0). H represents an integer of 1 or more which is the upper limit value of h. As an example, H = 8, A = 0.9849, and σ = 7, and the fundamental frequency f ₀ (k _f0 ) belongs to the analysis range 148.4 ≦ f ₀ ≦ 500 [Hz] Δf ₀ = 7.81 [ Hz] discrete values, k _f0 is 46, and 1 ≦ k _f0 ≦ 46. FIG. 4 is a graph for illustrating the harmonic structure filter f (ω, f ₀ (k _f0 )) of Expression (7). The horizontal axis in FIG. 4 represents the index k _f0 of the fundamental frequency f ₀ (k _f0 ), and the vertical axis represents the frequency ω. FIG. 4 illustrates the harmonic structure filter f (ω, f ₀ (k _f0 )) for H = 1 to 8. That in this graph, the closer to white _{_{f (ω, f 0 (k}} f0)) indicates that the size of the large, black magnitude of _{_{f (ω, f 0 (k}} f0)) is zero Represent. For example, the harmonic structuring unit 118 multiplies the harmonic structure filter f (ω, f ₀ (k _f0 ) by P ^to _Xτ ^(k) (ω) to obtain P _outτ ^(k) (ω, k _f0 ) is obtained.

The harmonic structuring unit 118 generates the harmonic structure signal P _outτ ^(k) (ω, obtained for each frame τ, each channel k, each frequency bin ω, and each index k _f0 determined as a speech section. k _f0 ) is sent to the feature quantity sequence calculation unit 119.

＜特徴量列算出部１１９＞
特徴量列算出部１１９は、調波構造信号Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）を入力とし、複数個のチャネルｋ＝１，…，Ｋの調波構造信号Ｐ_ｏｕｔτ ^（１）（ω，ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ω，ｋ_ｆ０）から特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）を得る（ステップＳ１１９）。ここで、全てのインデックスｋ_ｆ０について特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）が生成されてもよいが、観測された音響信号に対応しない基本周波数ｆ_０（ｋ_ｆ０）が存在する可能性がある。そのため、観測された音響信号に対応する基本周波数ｆ_０（ｋ_ｆ０）のインデックスｋ_ｆ０のみについて特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）を生成することが望ましい。観測された音響信号に対応する基本周波数ｆ_０（ｋ_ｆ０）のインデックスｋ_ｆ０の選択方法としては、例えば、正の閾値γを決め、全てのチャネルｋ＝１，…，ＫについてＰ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）＞γを満たすインデックスｋ_ｆ０を選択する方法がある。或いは、チャネルｋについてのＰ_ｏｕｔτ ^（１）（ω，ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ω，ｋ_ｆ０）の平均値がγを超えるインデックスｋ_ｆ０を選択してもよい。或いは、特定のチャネルｋのＰ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）がＰ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）＞γを満たすインデックスｋ_ｆ０を選択してもよい。その他、例えば、正の閾値γおよびρを決め、チャネルｋ＝１，…，Ｋのうちρ個以上のチャネルｋについてＰ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）＞γを満たすインデックスｋ_ｆ０を選択してもよい。或いは、閾値γが固定値でなくてもよい。例えば、各フレームτでのＰ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０）の最小値ｍｉｎ（Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０））の定数倍ａ×ｍｉｎ（Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０））を閾値γとしてもよい。ただし、ａは正の定数であり、例えばａ＝１００である。各フレームτでの最小値ｍｉｎ（Ｐ_ｏｕｔτ ^（ｋ）（ω，ｋ_ｆ０））を基準とすることで、観測された音響信号に重畳した雑音成分の大きさに応じて閾値を変動させることができる。 <Feature Quantity Sequence Calculation Unit 119>
The feature quantity sequence calculation unit 119 receives the harmonic structure signal P _outτ ^(k) (ω, k _f0 ) and inputs the harmonic structure signal P _outτ ⁽¹⁾ (ω ^{) of} a plurality of channels k = 1 _,. , K _f0 ),..., P _outτ ^(K) (ω, k _f0 ), a feature quantity sequence P _H ^(τ) (ω, k _f0 ) is obtained (step S119). Here, the feature amount sequence P _H ^(τ) (ω, k _f0 ) may be generated for all indexes k _f0 , but there is a fundamental frequency f ₀ (k _f0 ) that does not correspond to the observed acoustic signal. there is a possibility. Therefore, it is desirable to generate the feature quantity sequence P _H ^(τ) (ω, k _f0 ) only for the index k _{f0 of} the fundamental frequency f ₀ (k _f0 ) corresponding to the observed acoustic signal. As a selection method of the index k _f0 of the fundamental frequency f ₀ (k _f0 ) corresponding to the observed acoustic signal, for example, a positive threshold γ is determined, and P _outτ ^(k for all channels k = 1 _,. ^{) _(ω,} there is a method to select the index _{k f0} meet the _{k f0)> γ.} Alternatively, an index k _f0 whose average value of P _outτ ⁽¹⁾ (ω, k _f0 ),..., P _outτ ^(K) (ω, k _f0 ) for channel k exceeds γ may be selected. Alternatively, an index k _f0 may be selected in which P _outτ ^(k) (ω, k _f0 ) of a specific channel k _satisfies P _outτ ^(k) (ω, k _f0 )> γ. In addition, for example, positive thresholds γ and ρ are determined, and an index k _f0 that satisfies P _outτ ^(k) (ω, k _f0 )> γ is selected for ρ or more channels k among channels k = 1 _,. May be. Alternatively, the threshold γ may not be a fixed value. For example, _{P Outtau} in each frame ^{_{τ (k) (ω, k}} f0) a minimum value min of the _{^{(P outτ (k) (ω}} , k f0)) constant times _{^{a × min (P outτ (k}} ) (ω a, k _f0 )) may be the threshold value γ. However, a is a positive constant, for example, a = 100. By using the minimum value min (P _outτ ^(k) (ω, k _f0 )) in each frame τ as a reference, the threshold value can be varied according to the magnitude of the noise component superimposed on the observed acoustic signal. it can.

特徴量列算出部１１９は、例えば、複数個のチャネルｋ＝１，…，Ｋの調波構造信号Ｐ_ｏｕｔτ ^（１）（ω，ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ω，ｋ_ｆ０）の大きさを正規化して得られる列を、特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）とする。例えば、特徴量列算出部１１９は、Ｐ_ｏｕｔτ ^（１）（ω，ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ω，ｋ_ｆ０）を要素とするＫ次元ベクトル［Ｐ_ｏｕｔτ ^（１）（ω，ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ω，ｋ_ｆ０）］^Ｔの大きさを所定値に正規化して得られるベクトルを、特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）とする。ただし、・^Ｔは・の転置を表す。例えば、特徴量列算出部１１９は、以下のようにＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）を得る。以下の例のＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）は単位ベクトルである。

特徴量列算出部１１９は、得られた特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）をベクトル分類部１２０に送る。 The feature quantity sequence calculation unit 119, for example, a harmonic structure signal P _outτ ⁽¹⁾ (ω, k _f0 ),..., P _outτ ^(K) (ω, k _f0 ^{) of} a plurality of channels k = 1 _,. A column obtained by normalizing the size of) is a feature amount column P _H ^(τ) (ω, k _f0 ). For example, the feature quantity sequence calculation unit 119 uses a P-dimensional vector [P _outτ ⁽¹⁾ (ω ^{) with} P _outτ ⁽¹⁾ (ω, k _f0 ),..., P _outτ ^(K) (ω, k _f0 ) as elements. , K _f0 ),..., P _outτ ^(K) (ω, k _f0 )] A vector obtained by normalizing the magnitude of ^T to a predetermined value is represented as a feature quantity sequence P _H ^(τ) (ω, k _f0 ). To do. However, * ^T represents transposition of *. For example, the feature quantity sequence calculation unit 119 obtains P _H ^(τ) (ω, k _f0 ) as follows. In the following example, P _H ^(τ) (ω, k _f0 ) is a unit vector.

The feature quantity sequence calculation unit 119 sends the obtained feature quantity sequence P _H ^(τ) (ω, k _f0 ) to the vector classification unit 120.

＜ベクトル分類部１２０＞
ベクトル分類部１２０は、各フレームτのラベルθ_τおよび音声区間と判定された各フレームτの特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）を受け取る。ベクトル分類部１２０は、特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）をクラスタリングし、音声区間と判定された各フレームτの特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）が属する信号区間分類（クラスタ）を決定する（ステップＳ１２０）。ベクトル分類部１２０は、クラスタによって各フレームτのラベルθ_τを更新し、各フレームτのラベルθ_τを出力する。 <Vector classification unit 120>
The vector classifying unit 120 receives the label θ _{τ of} each frame τ and the feature amount sequence P _H ^(τ) (ω, k _f0 ) of each frame τ determined as the speech section. Vector classifying portion 120, the feature column _{P H} ^{_{(τ) (ω, k f0}} ) clustering, the feature column _{P H} of each frame tau it is determined that the voice section ^{_{(τ) (ω, k f0}} ) belongs A signal section classification (cluster) is determined (step S120). The vector classifying unit 120 updates the label θ _τ of each frame τ with the cluster and outputs the label θ _τ of each frame τ.

音源から観測装置２０−１〜Ｋのマイクロホン２２−１〜Ｋまでの距離に従って生じる減衰に基づいて、特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）を音源ごとの信号区間分類にクラスタリングする。クラスタリング方法には、例えば、学習情報なしにフレームτごとに特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）のクラスタリングを行うオンラインクラスタリングや、観測装置２０−１〜Ｋやユーザから与えられた正解情報を元に教師ありで学習するオフラインクラスタリングを利用できる。オンラインクラスタリングとしては、例えば、leader-followerクラスタリングや予め分類するクラスタ数を決めておくｋ近傍法を利用することができる。その他のオフラインクラスタリングとしては、ＧＭＭ（Gaussian mixture model）学習を行ってＭＡＰ（Maximum a posteriori）推定を行う手法や、ＳＶＭ（Support vector machine）を利用する手法などがある（例えば、参考文献２参照）。
［参考文献２］Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern
Classification，” Wiley-Interscience, 2000. Based on the attenuation generated according to the distance from the sound source to the microphones 22-1 to K of the observation devices 20-1 to 20-K, the feature amount sequence P _H ^(τ) (ω, k _f0 ) is clustered into signal interval classifications for each sound source. . As the clustering method, for example, online clustering that performs clustering of the feature quantity sequence P _H ^(τ) (ω, k _f0 ) for each frame τ without learning information, or given from the observation devices 20-1 to K and the user. You can use offline clustering with supervised learning based on correct answer information. As online clustering, for example, leader-follower clustering or a k-nearest neighbor method for determining the number of clusters to be classified in advance can be used. Other offline clustering methods include GMM (Gaussian mixture model) learning and MAP (Maximum a posteriori) estimation, and SVM (Support vector machine) method (see Reference 2). .
[Reference 2] Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern
Classification, “Wiley-Interscience, 2000.

以下では、leader-followerクラスタリングを用いる例とクラスタ数を固定したｋ近傍法を用いる例とを説明する。
［leader-followerクラスタリングを用いる例］
ベクトル分類部１２０は、特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）を入力とし、教師なし学習のleader-followerクラスタリングを用いて各フレームτの特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）が属するクラスタを決定することができる。クラスタリングの指標となる距離にはコサイン類似度を用いることができる。コサイン類似度の距離関数は以下のように定義できる。

ただし、ＣＬは各クラスタのラベルであり、ラベルＣＬは非音声区間を表すラベルθ_ｒ（例えば０）以外の値（例えば、１以上の整数）をとる。ラベルＣＬのクラスタを「クラスタＣＬ」と表記する。Ｐ_ＣＬはクラスタＣＬの重心ベクトルである。ｄ（ＣＬ）はクラスタＣＬの重心ベクトルＰ_ＣＬと入力された特徴量列（ベクトル）Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）との距離を表す。Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）・Ｐ_ＣＬはＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）とＰ_ＣＬの内積を表し、｜Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）｜｜Ｐ_ＣＬ｜は、Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）の大きさ｜Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）｜とＰ_ＣＬの大きさ｜Ｐ_ＣＬ｜との積を表す。ベクトルの大きさの例は、ユークリッドノルム等のノルムである。ここで、分類を行おうとするＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）のｄ（ＣＬ）がすべてのクラスタＣＬの候補について閾値η（ηは正の定数）を超えれば、ベクトル分類部１２０は、このＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）を重心ベクトルとする新たなクラスタをクラスタ候補に加え、このＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）をこの新たなクラスタに分類する。一方、分類を行おうとするＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）のｄ（ＣＬ）が何れかのクラスタＣＬの候補について閾値ηを下回れば、このＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）を最小のｄ（ＣＬ）に対応するクラスタＣＬに分類する。ベクトル分類部１２０は、音声区間と判定されたフレームτの特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）が分類されたクラスタＣＬを表すラベルＣＬの集合を、このフレームτのラベルθ_τとする。すなわち、音声区間と判定されたフレームτの特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）が属するクラスタＣＬを表すラベルＣＬの集合を、このフレームτのラベルθ_τとする。 In the following, an example using leader-follower clustering and an example using the k-nearest neighbor method with a fixed number of clusters will be described.
[Example using leader-follower clustering]
The vector classifying unit 120 receives the feature value sequence P _H ^(τ) (ω, k _f0 ) as input, and uses the leader-follower clustering of unsupervised learning for the feature value sequence P _H ^(τ) (ω, The cluster to which k _f0 ) belongs can be determined. The cosine similarity can be used for the distance that is an index for clustering. The distance function of cosine similarity can be defined as follows.

However, CL is a label of each cluster, and the label CL takes a value (for example, an integer of 1 or more) other than a label θ _r (for example, 0) representing a non-voice segment. The cluster with the label CL is denoted as “cluster CL”. _PCL is the centroid vector of the cluster CL. d (CL) represents the distance between the center-of-gravity vector P _CL of the cluster CL and the input feature quantity sequence (vector) P _H ^(τ) (ω, k _f0 ). P _H ^(τ) (ω, k _f0 ) · P _CL represents the inner product of P _H ^(τ) (ω, k _f0 ) and P _CL , and | P _H ^(τ) (ω, k _f0 ) || P _CL | _{^{is, P H (τ) (ω}} , k f0) of size | represents the product of _{^{| P H (τ) (ω}} , k f0) | magnitude of the _{P CL} _{| P CL.} An example of the magnitude of the vector is a norm such as Euclidean norm. Here, if d (CL) of P _H ^(τ) (ω, k _f0 ) to be classified exceeds the threshold η (η is a positive constant) for all candidates of the cluster CL, the vector classification unit 120 Then, a new cluster whose center vector is P _H ^(τ) (ω, k _f0 ) is added to the cluster candidate, and this P _H ^(τ) (ω, k _f0 ) is classified into this new cluster. On the other hand, if d (CL) of P _H ^(τ) (ω, k _f0 ) to be classified is lower than the threshold η for any candidate cluster CL, this P _H ^(τ) (ω, k _f0 ) Are classified into clusters CL corresponding to the smallest d (CL). The vector classifying unit 120 sets a set of labels CL representing the cluster CL into which the feature amount sequence P _H ^(τ) (ω, k _f0 ) of the frame τ determined to be a speech section as a label θ _{τ of the} frame _τ. And That is, a set of labels CL representing the cluster CL to which the feature amount sequence P _H ^(τ) (ω, k _f0 ) of the frame τ determined as the speech section belongs is set as a label θ _{τ of the} frame τ.

［クラスタ数を固定したｋ近傍法を用いる例］
ベクトル分類部１２０は、特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）の分類の基準となるＰ_ＣＬベクトルを定める。Ｐ_ＣＬベクトルはクラスタＣＬにそれぞれ対応し、各クラスタＣＬは互いに異なる観測装置２０−ｋに対応する。本形態では、観測装置２０−１〜Ｋの総数はＫであるため、互いに直交するＫ個のＰ_ＣＬベクトルを定める。Ｐ_ＣＬベクトルはＫ次元ベクトルである。Ｐ_ＣＬベクトルの例は、Ｋ個の要素のうち１個の要素だけが１となり、他の要素がすべて０となるベクトルである。Ｐ_ＣＬベクトルの一例は［１，０，...，０］^Ｔである。以上のように、ベクトル分類部１２０は、観測装置２０−１〜Ｋの総数ＫからＰ_ＣＬベクトルを定める。ベクトル分類部１２０は、分類を行おうとする特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）と各Ｐ_ＣＬベクトルとの距離を計算する。Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）は音源から発せられた音響信号を各観測装置２０−ｋで観測して得られたデジタル音響信号の音声区間と雑音区間とのパワー比に対応する。音源から発せられた音響信号のパワーは、空気中を伝搬する間に距離に反比例して減衰する。そのため、異なる位置に配置された各観測装置２０−ｋで或る音源から発せられた音響信号を観測すると、当該音源に最も近い観測装置２０−ｋが最もパワーの大きな音響信号を観測する。例えば、ｋ番目の観測装置２０−ｋのみで音源から発せられた音響信号を観測すると、式（９）で算出される特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）はｋ番目の要素のみが１で、その他の要素が０である単位ベクトルとなる。この性質から、或るフレームτで音響信号を発した音源がどの観測装置２０−ｋに最も近いかを分類することができる。特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）を、それに対応する音響信号を発した音源に最も近い観測装置２０−ｋに対応するクラスタＣＬに分類することで、各音源から発せられた音響信号の分類を行うことができる。ベクトル分類部１２０は、音声区間と判定されたフレームτの特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）と各Ｐ_ＣＬベクトルとの距離を求め、最も距離が小さいＰ_ＣＬベクトルに対応するクラスタＣＬに当該特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）を分類する。距離尺度にはコサイン類似度を用いることができる（式（１０））。ベクトル分類部１２０は、音声区間と判定されたフレームτの特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）が分類されたクラスタＣＬを表すラベルＣＬの集合を、このフレームτのラベルθ_τとする。すなわち、音声区間と判定されたフレームτの特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）が属するクラスタＣＬを表すラベルＣＬの集合を、このフレームτのラベルθ_τとする。 [Example using the k-nearest neighbor method with a fixed number of clusters]
The vector classification unit 120 determines a _PCL vector that is a reference for classification of the feature amount sequence P _H ^(τ) (ω, k _f0 ). The _PCL vector corresponds to each cluster CL, and each cluster CL corresponds to a different observation device 20-k. In this embodiment, since the total number of observation devices 20-1~K is K, defines the K-number of _{P CL} vector orthogonal to each other. The _PCL vector is a K-dimensional vector. An example of the _PCL vector is a vector in which only one of the K elements is 1 and all other elements are 0. An example of a _PCL vector is [1, 0,..., 0] ^T. As described above, the vector classifying portion 120 determines the _{P CL} vector from the total number K of the observation device 20-1～K. The vector classification unit 120 calculates the distance between the feature quantity sequence P _H ^(τ) (ω, k _f0 ) to be classified and each _PCL vector. P _H ^(τ) (ω, k _f0 ) corresponds to the power ratio between the speech section and the noise section of the digital acoustic signal obtained by observing the acoustic signal emitted from the sound source with each observation device 20-k. The power of the acoustic signal emitted from the sound source attenuates in inverse proportion to the distance while propagating in the air. Therefore, when an acoustic signal emitted from a certain sound source is observed by each observation device 20-k arranged at a different position, the observation device 20-k closest to the sound source observes the acoustic signal having the highest power. For example, when an acoustic signal emitted from a sound source is observed only by the k-th observation device 20-k, the feature quantity sequence P _H ^(τ) (ω, k _f0 ) calculated by Expression (9) is the k-th element. Is a unit vector with only 1 and other elements 0. From this property, it is possible to classify which observation device 20-k is closest to a sound source that has emitted an acoustic signal in a certain frame τ. By classifying the feature string P _H ^(τ) (ω, k _f0 ) into the cluster CL corresponding to the observation device 20-k closest to the sound source that emitted the corresponding acoustic signal, Classification of acoustic signals can be performed. The vector classifying unit 120 obtains the distance between the feature amount sequence P _H ^(τ) (ω, k _f0 ) of the frame τ determined to be a speech section and each P _CL vector, and corresponds to the _PCL vector having the smallest distance. The feature string P _H ^(τ) (ω, k _f0 ) is classified into the cluster CL. The cosine similarity can be used for the distance scale (Formula (10)). The vector classifying unit 120 sets a set of labels CL representing the cluster CL into which the feature amount sequence P _H ^(τ) (ω, k _f0 ) of the frame τ determined to be a speech section as a label θ _{τ of the} frame _τ. And That is, a set of labels CL representing the cluster CL to which the feature amount sequence P _H ^(τ) (ω, k _f0 ) of the frame τ determined as the speech section belongs is set as a label θ _{τ of the} frame τ.

［第２実施形態］
図１に例示するように、本形態の信号区間分類装置２００は、第１実施形態の信号区間分類装置１００に確率計算部２２１を追加したものである。すなわち、信号区間分類装置２００は、サンプリング周波数変換部１１１、信号同期部１１２、フレーム分割部１１３、周波数領域変換部１１４、ＶＡＤ判定部１１５、非音声パワー記憶部１１６、ゲイン正規化部１１７（正規化部）、調波構造化部１１８、特徴量列算出部１１９、ベクトル分類部１２０（分類部）、および確率計算部２２１を有する。本形態の信号区間分類装置２００は、例えばＣＰＵやＲＡＭ等を備える汎用または専用のコンピュータに所定のプログラムが読み込まれて構成される装置である。信号区間分類装置２００に入力されたデータおよび処理されたデータは、図示していないメモリに格納され、必要に応じて各部から読み出される。その他は、第１実施形態と同じである。以下では、第１実施形態との相違点のみを説明し、第１実施形態と共通する事項には前述と同じ参照番号を用い、説明を省略する。 [Second Embodiment]
As illustrated in FIG. 1, the signal section classification device 200 of this embodiment is obtained by adding a probability calculation unit 221 to the signal section classification device 100 of the first embodiment. That is, the signal section classification apparatus 200 includes a sampling frequency conversion unit 111, a signal synchronization unit 112, a frame division unit 113, a frequency domain conversion unit 114, a VAD determination unit 115, a non-speech power storage unit 116, and a gain normalization unit 117 (normal Conversion unit), harmonic structuring unit 118, feature quantity sequence calculation unit 119, vector classification unit 120 (classification unit), and probability calculation unit 221. The signal section classification apparatus 200 of this embodiment is an apparatus configured by reading a predetermined program into a general-purpose or dedicated computer including, for example, a CPU and a RAM. Data input to the signal section classification device 200 and processed data are stored in a memory (not shown), and are read from each unit as necessary. Others are the same as the first embodiment. Hereinafter, only differences from the first embodiment will be described, and the same reference numerals as those described above will be used for items common to the first embodiment, and description thereof will be omitted.

確率計算部２２１には、ベクトル分類部１２０から出力された各フレームτのラベルθ_τが入力される。確率計算部２２１は、ラベルθ_τによって表される各特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）が属する信号区間分類（クラスタ）から、特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）がいずれの信号区間分類に属するかを表す確率に対応する値を得て、それらの値を出力する。例えば、確率計算部２２１は、フレームτごとに「確率に対応する値」を出力する。例えば、確率計算部２２１は、フレームτごとに、ラベルθ_τが表すクラスタＣＬの度数分布を度数の総和が所定値（例えば１）になるように正規化し、このように正規化された各クラスタＣＬの度数を「確率に対応する値」とする。ここで度数の総和が１になるように正規化すると、確率計算部２２１が出力する値は、特徴量列が各クラスタに属する確率となる。例えば、クラスタの総数が５種類であり、１個のフレームτのラベルθ_τが分類結果である４７個のクラスタＣＬを表しており、度数の総和が１に正規化されるとする。このうち、４５個がクラスタ１を表し、２個がクラスタ２を表しているとすると、クラスタ１に特徴量列が属する確率（確率に対応する値）は約９５％となり、クラスタ１に特徴量列が属する確率（確率に対応する値）は約５％となる。或いは、確率計算部２２１が、フレームτごとに、特徴量列がクラスタに属する確率が閾値以上となるクラスタの個数を「確率に対応する値」として出力してもよい。或いは、確率計算部２２１が、フレームτごとに、特徴量列がクラスタに属する確率が最大となるクラスタを表す識別番号を「確率に対応する値」として出力してもよい。 The probability calculation unit 221 receives the label θ _τ of each frame τ output from the vector classification unit 120. Probability calculation unit 221, the label θ each feature quantity column represented by _{_{^{τ P H (τ) (ω}}} , k f0) from the signal segment classification belonging (clusters), the feature column _{^{P H (τ) (ω,}} k _The value corresponding to the probability indicating to which signal section classification _f0 ) belongs is obtained, and these values are output. For example, the probability calculation unit 221 outputs “a value corresponding to the probability” for each frame τ. For example, the probability calculation section 221, for each frame tau, normalized so the frequency of the sum of the frequency distribution of the cluster CL representing the label theta _tau is a predetermined value (e.g. 1), such that each cluster has been normalized Let the frequency of CL be a “value corresponding to the probability”. When normalization is performed so that the sum of the frequencies is 1, the value output by the probability calculation unit 221 is a probability that the feature amount sequence belongs to each cluster. For example, the total number of clusters is five, represents one frame tau label theta _tau is 47 clusters CL is classification result, the total sum of is normalized to 1. Of these, if 45 represent cluster 1 and 2 represent cluster 2, the probability that the feature quantity sequence belongs to cluster 1 (value corresponding to the probability) is approximately 95%, and cluster 1 has the feature quantity. The probability that the column belongs (the value corresponding to the probability) is about 5%. Alternatively, the probability calculation unit 221 may output, as “value corresponding to the probability”, the number of clusters in which the probability that the feature amount sequence belongs to the cluster is equal to or greater than a threshold value for each frame τ. Alternatively, the probability calculation unit 221 may output, for each frame τ, an identification number representing a cluster having the maximum probability that the feature amount sequence belongs to the cluster as a “value corresponding to the probability”.

図４Ａ〜４Ｃ，５Ａ〜５Ｃに実験結果を示す。これらのグラフでは、横軸がフレーム番号τを表し、縦軸がクラスタＣＬを表す。また、これらのグラフでは、白に近いほど確率が大きいことを表し、黒は確率がゼロであることを表す。また、クラスタの総数を５とし、観測装置２０−１，…，５として５台のスマートフォンを用い（Ｋ＝５）、５台のスマートフォンからの距離がそれぞれ異なる配置の２個の音源から発せられた音響信号を録音した。このように録音された音響信号を各チャネルｋ＝１，…，５の入力デジタル音響信号とし、信号区間分類装置２００に入力し、インデックスｋ_ｆ０の総数を４６個として本装置の処理を実施した。図５Ａおよび図６Ａは音源１のみから音響信号が発せられた場合の実験結果を表し、図５Ｂおよび図６Ｂは音源２のみから音響信号が発せられた場合の実験結果を表し、図５Ｃおよび図６Ｃは音源１および２から音響信号が発せられた場合の実験結果を表す。 4A to 4C and 5A to 5C show the experimental results. In these graphs, the horizontal axis represents the frame number τ, and the vertical axis represents the cluster CL. In these graphs, the closer to white, the greater the probability, and black, the probability is zero. In addition, the total number of clusters is 5, and five smartphones are used as the observation devices 20-1,..., 5 (K = 5), and emitted from two sound sources arranged at different distances from the five smartphones. A sound signal was recorded. The sound signal recorded in this way is used as an input digital sound signal of each channel k = 1,..., 5 and is input to the signal section classification device 200, and the processing of this device is performed with the total number of indexes _kf0 being 46. . 5A and 6A show the experimental results when the acoustic signal is emitted only from the sound source 1, and FIGS. 5B and 6B show the experimental results when the acoustic signal is emitted only from the sound source 2, and FIGS. 6C represents an experimental result when sound signals are emitted from the sound sources 1 and 2.

≪実験１≫
図５Ａ〜図５Ｃは、上記の条件のもと、クラスタリンング手法としてｋ近傍法を用いた実験１の結果を表す。図５Ａおよび図５Ｂから、音源１から音響信号が発せられているフレームの特徴量列がクラスタ５に分類され、音源２から音響信号が発せられているフレームの特徴量列がクラスタ４に分類される確率が高いことが分かる。図５Ｃから、音源１および２から音響信号が発せられているフレームの特徴量列は、クラスタ４およびクラスタ５の２個のクラスタに分かれて分類される可能性が高いことが分かる。 ≪Experiment 1≫
5A to 5C show the results of Experiment 1 using the k-nearest neighbor method as the clustering method under the above conditions. From FIG. 5A and FIG. 5B, the feature amount sequence of the frame from which the sound signal is emitted from the sound source 1 is classified into cluster 5, and the feature amount sequence of the frame from which the sound signal is emitted from sound source 2 is classified as cluster 4. It is understood that there is a high probability. From FIG. 5C, it can be seen that the feature amount sequence of the frames in which the sound signals are emitted from the sound sources 1 and 2 is highly likely to be classified into two clusters of the cluster 4 and the cluster 5.

≪実験２≫
図６Ａ〜図６Ｃは、上記の条件のもと、クラスタリンング手法としてleader-followerクラスタリングを用いた実験２の結果を表す。leader-followerクラスタリングで新たなクラスタリングを生成するための閾値を０．３５とした。図６Ａおよび図６Ｂから、音源１から音響信号が発せられているフレームの特徴量列がクラスタ１に分類され、音源２から音響信号が発せられているフレームの特徴量列がクラスタ２に分類される確率が高いことが分かる。図６Ｃから、音源１および２から音響信号が発せられているフレームの特徴量列は、クラスタ１およびクラスタ２の２個のクラスタに分かれて分類される可能性が高いことが分かる。 ≪Experiment 2≫
6A to 6C show the results of Experiment 2 using leader-follower clustering as a clustering method under the above conditions. The threshold for generating new clustering by leader-follower clustering was set to 0.35. From FIG. 6A and FIG. 6B, the feature amount sequence of the frame from which the sound signal is emitted from the sound source 1 is classified as cluster 1, and the feature amount sequence of the frame from which the sound signal is emitted from sound source 2 is classified as cluster 2. It is understood that there is a high probability. From FIG. 6C, it can be seen that the feature amount sequence of the frames in which the sound signals are emitted from the sound sources 1 and 2 is highly likely to be divided into two clusters of cluster 1 and cluster 2.

［第３実施形態］
図１に例示するように、本形態の信号区間分類装置３００は、第２実施形態の信号区間分類装置２００にさらに信号強調部３２２を追加したものである。すなわち、信号区間分類装置３００は、サンプリング周波数変換部１１１、信号同期部１１２、フレーム分割部１１３、周波数領域変換部１１４、ＶＡＤ判定部１１５、非音声パワー記憶部１１６、ゲイン正規化部１１７（正規化部）、調波構造化部１１８、特徴量列算出部１１９、ベクトル分類部１２０（分類部）、確率計算部２２１、および信号強調部３２２を有する。本形態の信号区間分類装置３００は、例えばＣＰＵやＲＡＭ等を備える汎用または専用のコンピュータに所定のプログラムが読み込まれて構成される装置である。信号区間分類装置３００に入力されたデータおよび処理されたデータは、図示していないメモリに格納され、必要に応じて各部から読み出される。その他は、第２実施形態と同じである。以下では、第１，２実施形態との相違点のみを説明し、第１，２実施形態と共通する事項には前述と同じ参照番号を用い、説明を省略する。 [Third Embodiment]
As illustrated in FIG. 1, the signal section classification device 300 of this embodiment is obtained by further adding a signal enhancement unit 322 to the signal section classification device 200 of the second embodiment. That is, the signal section classification device 300 includes a sampling frequency conversion unit 111, a signal synchronization unit 112, a frame division unit 113, a frequency domain conversion unit 114, a VAD determination unit 115, a non-speech power storage unit 116, and a gain normalization unit 117 (normal Conversion unit), harmonic structuring unit 118, feature quantity sequence calculation unit 119, vector classification unit 120 (classification unit), probability calculation unit 221, and signal enhancement unit 322. The signal section classification apparatus 300 according to this embodiment is an apparatus configured by reading a predetermined program into a general purpose or dedicated computer including a CPU, a RAM, and the like. Data input to the signal section classification device 300 and processed data are stored in a memory (not shown), and are read from each unit as necessary. Others are the same as in the second embodiment. Only the differences from the first and second embodiments will be described below, and the same reference numerals as those described above are used for items that are common to the first and second embodiments, and the description thereof is omitted.

信号強調部３２２は、確率計算部２２１から出力された確率に対応する値と、何れかのチャネルｋの各フレームτに属するデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）とを受け取る。信号強調部３２２は、「確率に対応する値」を用い、音源から発せられたデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）を強調し、雑音成分を抑圧する。すなわち、信号強調部３２２は、特徴量列が属する確率が所定の値ＴＨ_ｐ以上（例えば７０％以上）であるクラスタに対応するフレームτのデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）を強調し、強調されたデジタル音響信号を出力する。例えば、信号強調部３２２は、上記の確率が所定の値以上となるフレームτのデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）を強調する。この場合、特定のクラスタに特徴量列が属する確率が所定の値以上となるフレームτのデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）を強調してもよい。この場合には、特定の音源から発せられた音響信号を強調することができる。或いは、何れかのクラスタに特徴量列が属する確率が所定の値以上となるフレームτのデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）を強調してもよい。この場合には、何れの音源から発せられた音響信号も強調し、雑音信号成分を抑制することができる。なお、所定の値ＴＨ_ｐは、例えば、１個のクラスタのみが特定されるための値である。例えば、ＴＨ_ｐは５０％よりも大きな値である。具体的には、例えばＴＨ_ｐ＝７０％である。また、或るフレームτのデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）を強調する方法としては、例えば、これらのデジタル音響信号に１を超える重みを乗じた値の列を強調されたデジタル音響信号とする方法がある。或いは、上記確率が所定の値未満のフレームのデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）に０以上１未満の重みを乗じることによって、相対的に上記確率が所定の値以上のフレームのデジタル音響信号ｓｘ_ｋ（ｉ_{ｋ，τ，０}），・・・，ｓｘ_ｋ（ｉ_{ｋ，τ，Ｌ−１}）が強調されてもよい。 The signal enhancement unit 322 includes a value corresponding to the probability output from the probability calculation unit 221 and a digital acoustic signal sx _k (i _{k, τ, 0} ),..., Belonging to each frame τ of any channel k. sx _k (i _{k, τ, L−1} ) is received. The signal enhancement unit 322 uses the “value corresponding to the probability”, and uses the digital acoustic signal sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L−1} ) emitted from the sound source. ) To suppress noise components. That is, the signal enhancement unit 322 includes the digital acoustic signal sx _k (i _{k, τ, 0} ) of the frame τ corresponding to a cluster having a probability that the feature amount sequence belongs to a predetermined value TH _p or more (for example, 70% or more), .., Sx _k (i _{k, τ, L−1} ) are emphasized, and the enhanced digital acoustic signal is output. For example, the signal enhancement unit 322 includes the digital acoustic signals sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L−1} ) of the frame τ in which the above probability is equal to or greater than a predetermined value. ). In this case, the digital acoustic signals sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L} ) of the frame τ in which the probability that the feature amount sequence belongs to a specific cluster is equal to or greater than a predetermined value. _-1 ) may be emphasized. In this case, an acoustic signal emitted from a specific sound source can be emphasized. Alternatively, the digital acoustic signal sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L} ) of the frame τ in which the probability that the feature amount sequence belongs to any one of the clusters is a predetermined value or more. _-1 ) may be emphasized. In this case, an acoustic signal emitted from any sound source can be emphasized and the noise signal component can be suppressed. The predetermined value TH _p is a value for specifying only one cluster, for example. For example, TH _p is a value greater than 50%. Specifically, for example, TH _p = 70%. In addition, as a method of enhancing the digital acoustic signal sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L-1} ) of a certain frame τ, for example, these digital sounds There is a method in which a sequence of values obtained by multiplying a signal by a weight exceeding 1 is used as an enhanced digital acoustic signal. Alternatively, the digital acoustic signal sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L−1} ) of a frame having the probability less than a predetermined value is given a weight of 0 or more and less than 1. By multiplying, the digital acoustic signals sx _k (i _{k, τ, 0} ),..., Sx _k (i _{k, τ, L-1} ) of the frames having the above-mentioned probability of the predetermined value or more are relatively emphasized. May be.

［特徴］
以上のように、各実施形態では、複数個のチャネルで得られた所定の時間区間ごとの周波数領域のデジタル音響信号を入力とし、チャネルごとに音声区間のデジタル音響信号の大きさを非音声区間のデジタル音響信号の大きさで正規化した正規化信号を得、１個または複数個の基本周波数について、基本周波数の整数倍および基本周波数の整数倍の近傍の周波数の正規化信号を、基本周波数の整数倍および基本周波数の整数倍の近傍以外の周波数の正規化信号よりも優先した調波構造信号を得、複数個のチャネルに対して得られた調波構造信号から特徴量列を得、特徴量列をクラスタリングし、特徴量列が属する信号区間分類を得る。これらの処理では音響信号を観測するマイクロホンの位置に関する情報を用いない。そのため、自由に配置されたマイクロホン感度が異なる複数個のスマートフォン端末装置、固定電話、ボイスレコーダなどの録音機能をもつ端末装置で録音されたデジタル音響信号から、音源位置に基づいた信号区間分類を行うことができる。また、区間分類結果を用いて目的音区間とその他の音源区間に分類ができるため、雑音を抑圧し目的音を強調するフィルタの設計のための情報として利用できる。さらに、各々のフレームに含まれる音源（話者等）を判別できるため、目的音を強調すべきフィルタを選択することに利用することもできる。 [Feature]
As described above, in each embodiment, the frequency domain digital acoustic signal for each predetermined time interval obtained by a plurality of channels is input, and the size of the audio segment digital acoustic signal for each channel is set as the non-speech interval. To obtain a normalized signal normalized by the size of the digital acoustic signal of, for one or a plurality of fundamental frequencies, an integer multiple of the fundamental frequency and a normalized signal having a frequency in the vicinity of the integral multiple of the fundamental frequency. Obtaining a harmonic structure signal prioritizing a normalized signal of a frequency other than the vicinity of an integer multiple of the fundamental frequency and an integer multiple of the fundamental frequency, obtaining a feature string from the harmonic structure signals obtained for a plurality of channels, The feature quantity sequence is clustered to obtain a signal section classification to which the feature quantity sequence belongs. These processes do not use information about the position of the microphone that observes the acoustic signal. Therefore, signal section classification based on sound source position is performed from digital audio signals recorded by a plurality of freely arranged microphone terminal devices with different microphone sensitivities, terminal devices having a recording function such as fixed telephones and voice recorders. be able to. Moreover, since the target sound section and other sound source sections can be classified using the section classification result, it can be used as information for designing a filter that suppresses noise and emphasizes the target sound. Furthermore, since a sound source (speaker or the like) included in each frame can be discriminated, it can be used to select a filter for emphasizing the target sound.

さらに各実施形態では、サンプリング周波数変換部１１１でサンプリング周波数変換を行ってチャネル間のサンプリング周波数のずれを補正し、信号同期部１１２でチャネル間での同期を行って観測装置２０−ｋの個体差による影響を抑制した。そのため、各チャネルのＡ／Ｄ変換器２２−ｋのサンプリング周波数の公称値が互いに異なっていたり、サンプリング周波数の個体差があったりしても、信号区間分類を精度よく決定することができる。 Furthermore, in each embodiment, sampling frequency conversion is performed by the sampling frequency conversion unit 111 to correct a sampling frequency shift between channels, and synchronization between channels is performed by the signal synchronization unit 112 so that individual differences of the observation devices 20-k occur. Suppressed the effect of. Therefore, even if the nominal values of the sampling frequencies of the A / D converters 22-k of the respective channels are different from each other or there are individual differences in the sampling frequencies, the signal section classification can be determined with high accuracy.

また、音声信号の周波数領域信号はスパース性（信号が疎らにしか存在しない性質）を有する。上述の各実施形態では、周波数領域の信号をクラスタリングするため、同時発話が行われる環境であっても信号区間分類を行うことができる。 Further, the frequency domain signal of the audio signal has sparsity (property that the signal exists only sparsely). In each of the above-described embodiments, since the signals in the frequency domain are clustered, signal section classification can be performed even in an environment where simultaneous speech is performed.

基本周波数の整数倍および基本周波数の整数倍の近傍の周波数の正規化信号を、基本周波数の整数倍および基本周波数の整数倍の近傍以外の周波数の正規化信号よりも優先した調波構造信号から特徴量列を得るため、音声や弦楽器等の調波構造を有する音響信号の信号区間分類を精度よく行うことができる。 Harmonic structure signals that prioritize normalized signals at frequencies other than integer multiples of the fundamental frequency and integer multiples of the fundamental frequency. In order to obtain a feature amount sequence, it is possible to accurately classify signal sections of an acoustic signal having a harmonic structure such as a voice or a stringed instrument.

［変形例等］
なお、本発明は上述の実施の形態に限定されるものではない。例えば、上記の各実施形態では、特徴量列算出部１１９が、チャネルｋ＝１，…，Ｋの調波構造信号Ｐ_ｏｕｔτ ^（１）（ω，ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ω，ｋ_ｆ０）の大きさを正規化して得られる列を特徴量列Ｐ_Ｈ ^（τ）（ω，ｋ_ｆ０）とした（例えば、式（９））。しかしながら、チャネルｋ＝１，…，Ｋの調波構造信号Ｐ_ｏｕｔτ ^（１）（ω，ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ω，ｋ_ｆ０）それぞれの周波数成分の総和Σ_ω Ｐ_ｏｕｔτ ^（k）（ω，ｋ_ｆ０）を計算することによって、調波構造信号のパワーＰ_ｏｕｔτ ^（１）（ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ｋ_ｆ０）に変換し、調波構造信号Ｐ_ｏｕｔτ ^（１）（ｋ_ｆ０），…，Ｐ_ｏｕｔτ ^（Ｋ）（ｋ_ｆ０）の大きさを正規化して得られる列を特徴量列Ｐ_Ｈ ^（τ）（ｋ_ｆ０）としてもよい。例えば、以下のように特徴量列Ｐ_Ｈ ^（τ）（ｋ_ｆ０）が生成されてもよい。

この場合、特徴量列としてＰ_Ｈ ^（τ）（ω，ｋ_ｆ０）に代えてＰ_Ｈ ^（τ）（ｋ_ｆ０）が用いられ、ベクトル分類部１２０の処理が行われる。その他は、上述した各実施形態と同じである。 [Modifications, etc.]
The present invention is not limited to the embodiment described above. For example, in each of the above-described embodiments, the feature amount calculation unit 119 performs the harmonic structure signals P _outτ ⁽¹⁾ (ω, k _f0 ),..., P _outτ ^(K) (for the channels k = 1 _,. A sequence obtained by normalizing the magnitude of ω, k _f0 ) was defined as a feature sequence P _H ^(τ) (ω, k _f0 ) (for example, Equation (9)). However, the channel k = 1, ..., harmonic structure signal _{^{K P outτ (1) (ω}} , k f0), ..., P outτ (K) (ω, k f0) sum _{_Σ} ω P outτ of each frequency component ^(K) By calculating (ω, k _f0 ), the harmonic structure signal power P _outτ ⁽¹⁾ (k _f0 ),..., P _outτ ^(K) (k _f0 ) is converted into the harmonic structure signal. A sequence obtained by normalizing the size of P _outτ ⁽¹⁾ (k _f0 ),..., P _outτ ^(K) (k _f0 ) may be used as the feature amount sequence P _H ^(τ) (k _f0 ). For example, the feature amount sequence P _H ^(τ) (k _f0 ) may be generated as follows.

In this case, P _H ^(τ) (k _f0 ) is used instead of P _H ^(τ) (ω, k _f0 ) as the feature quantity sequence, and the processing of the vector classification unit 120 is performed. Others are the same as each embodiment mentioned above.

また、ベクトル分類部１２０が、フレームごとに独立に特徴量列のクラスタリングを行うことに代え、複数個のフレームごとに特徴量列のクラスタリングを行ってもよい。すなわち、複数個のフレームに対応するすべての特徴量列が同じクラスタリングの対象とされてもよい。 In addition, the vector classification unit 120 may perform clustering of feature amount sequences for a plurality of frames instead of clustering feature amount sequences independently for each frame. That is, all feature quantity sequences corresponding to a plurality of frames may be the same clustering target.

或いは、第１実施形態においてベクトル分類部１２０が、各フレームτの特徴量列が最も多く分類されたクラスタＣＬをこのフレームτが属するクラスタとし、このフレームτのラベルθ_τをこのクラスタのラベルＣＬに更新してもよい。或いは、このフレームτの一部の特徴量列が最も多く分類されたクラスタＣＬをこのフレームτが属するクラスタとし、ラベルθ_τをこのクラスタのラベルＣＬに更新しもよいし、このフレームτの何れかの特徴量列が分類されたクラスタＣＬをこのフレームτが属するクラスタとし、ラベルθ_τをこのクラスタのラベルＣＬに更新しもよい。 Alternatively, in the first embodiment, the vector classification unit 120 sets the cluster CL in which the feature quantity sequence of each frame τ is most classified as a cluster to which the frame τ belongs, and uses the label θ _{τ of the} frame τ as the label CL of the cluster. May be updated. Alternatively, the cluster CL in which a part of the feature amount sequence of the frame τ is most classified may be a cluster to which the frame τ belongs, and the label θ _τ may be updated to the label CL of the cluster. The cluster CL into which the feature amount sequence is classified may be a cluster to which the frame τ belongs, and the label θ _τ may be updated to the label CL of this cluster.

例えば、すべてのチャネルｋ＝１，・・・，ＫのＡ／Ｄ変換器２２−ｋのサンプリング周波数の公称値が互いに同一であるならば、サンプリング周波数変換部１１１の処理を行わなくてもよい。この場合には「入力デジタル音響信号」がそのまま「変換デジタル音響信号」として信号同期部１１２に入力されてもよい。このような場合にはサンプリング周波数変換部１１１を設けなくてもよい。 For example, if the nominal values of the sampling frequencies of the A / D converters 22-k of all the channels k = 1,..., K are the same, the processing of the sampling frequency conversion unit 111 may not be performed. . In this case, the “input digital acoustic signal” may be directly input to the signal synchronization unit 112 as a “converted digital acoustic signal”. In such a case, the sampling frequency conversion unit 111 may not be provided.

さらにすべてのチャネルｋ＝１，・・・，ＫのＡ／Ｄ変換器２２−ｋのサンプリング周波数の公称値が互いに同一であり、それらの個体差の影響も小さいのであれば、サンプリング周波数変換部１１１および信号同期部１１２の処理を行わなくてもよい。この場合には「入力デジタル音響信号」がそのまま「デジタル音響信号」としてフレーム分割部１１３に入力されてもよい。このような場合にはサンプリング周波数変換部１１１および信号同期部１１２を設けなくてもよい。 Further, if the nominal values of the sampling frequencies of the A / D converters 22-k of all the channels k = 1,..., K are the same and the influence of their individual differences is small, the sampling frequency conversion unit 111 and the signal synchronizer 112 need not be processed. In this case, the “input digital acoustic signal” may be directly input to the frame dividing unit 113 as a “digital acoustic signal”. In such a case, the sampling frequency converter 111 and the signal synchronizer 112 need not be provided.

上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads a program stored in its own recording device and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and each time the program is transferred from the server computer to the computer. The processing according to the received program may be executed sequentially. The above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good.

上記実施形態では、コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されたが、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 In the above embodiment, the processing functions of the apparatus are realized by executing a predetermined program on a computer. However, at least a part of these processing functions may be realized by hardware.

観測装置２０−１〜Ｋ
信号区間分類装置１００，２００，３００ Observation device 20-1 to K
Signal section classification device 100, 200, 300

Claims

Input digital acoustic signals in a frequency domain for each predetermined time interval obtained by a plurality of channels, and for each channel, the magnitude of the digital acoustic signal in the speech segment is the magnitude of the digital acoustic signal in the non-speech segment. A normalization unit for obtaining a normalized signal normalized by
For one or more arbitrary fundamental frequencies within a predetermined range, the normalized signal having an integer multiple of the fundamental frequency and a frequency in the vicinity of the integer multiple of the fundamental frequency is set to a value other than an integral multiple of the fundamental frequency and the fundamental frequency. A harmonic structuring unit for obtaining a harmonic structure signal having priority over the normalized signal of a frequency other than the vicinity of an integer multiple of the frequency;
Using the harmonic structure signals obtained for the plurality of channels, estimating the fundamental frequency of the observed digital acoustic signal, and using the harmonic structure signal corresponding to the estimated fundamental frequency, A feature amount sequence calculation unit for obtaining
A signal section classification device comprising: a classifying unit that clusters the feature quantity strings and determines a signal section classification to which the feature quantity strings belong.

  The signal section classification apparatus according to claim 1, wherein
  The feature amount sequence calculating unit includes:
  (1) a fundamental frequency corresponding to the harmonic structure signal in which the harmonic structure signal exceeds a positive threshold γ in the plurality of channels, or
  (2) a fundamental frequency corresponding to the harmonic structure signal in which an average value of the harmonic structure signals obtained for the plurality of channels exceeds the threshold γ, or
  (3) selecting a fundamental frequency corresponding to the harmonic structure signal exceeding the threshold γ in a predetermined number or more channels among the harmonic structure signals obtained for the plurality of channels;
  Obtaining the feature string from the harmonic structure signal corresponding to the selected fundamental frequency;
  The signal section classification device, wherein the threshold γ is a constant multiple of the minimum value of the harmonic structure signal.

Input digital acoustic signals in a frequency domain for each predetermined time interval obtained by a plurality of channels, and for each channel, the magnitude of the digital acoustic signal in the speech segment is the magnitude of the digital acoustic signal in the non-speech segment. A normalization step for obtaining a normalized signal normalized by
For one or more arbitrary fundamental frequencies within a predetermined range, the normalized signal having an integer multiple of the fundamental frequency and a frequency in the vicinity of the integer multiple of the fundamental frequency is set to a value other than an integral multiple of the fundamental frequency and the fundamental frequency. A harmonic structuring step for obtaining a harmonic structured signal having priority over the normalized signal at a frequency other than the vicinity of an integer multiple of the frequency;
Using the harmonic structure signals obtained for the plurality of channels, the fundamental frequency of the observed digital acoustic signal is estimated, and a feature quantity sequence is obtained from the harmonic structure signal corresponding to the estimated fundamental frequency. Obtaining a feature amount sequence calculating step;
A signal section classification method comprising: clustering the feature quantity sequences and determining a signal section classification to which the feature quantity sequences belong.

  The signal section classification method according to claim 3,
  The feature amount sequence calculating step includes:
  (1) a fundamental frequency corresponding to the harmonic structure signal in which the harmonic structure signal exceeds a positive threshold γ in the plurality of channels, or
  (2) a fundamental frequency corresponding to the harmonic structure signal in which an average value of the harmonic structure signals obtained for the plurality of channels exceeds the threshold γ, or
  (3) selecting a fundamental frequency corresponding to the harmonic structure signal exceeding the threshold γ in a predetermined number or more channels among the harmonic structure signals obtained for the plurality of channels;
  Obtaining the feature string from the harmonic structure signal corresponding to the selected fundamental frequency;
  The signal interval classification method, wherein the threshold γ is a constant multiple of the minimum value of the harmonic structure signal.

Program for causing a computer to function as a signal segment classification apparatus according to claim 1 or 2.