JP2008233866A

JP2008233866A - Signal separating device, signal separating method, and computer program

Info

Publication number: JP2008233866A
Application number: JP2007328516A
Authority: JP
Inventors: Atsuo Hiroe; 厚夫廣江
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2007-02-21
Filing date: 2007-12-20
Publication date: 2008-10-02
Anticipated expiration: 2027-12-20
Also published as: JP5233827B2; JP5195979B2; JP2009169439A; JP2011215649A; JP4403436B2

Abstract

<P>PROBLEM TO BE SOLVED: To perform highly accurate separation processing in consideration of a delay amount for mixed sound signals having various delay amounts. <P>SOLUTION: Observation spectrograms generated by converting an input sound signal into a time-frequency domain is interpreted as observation signals subjected to convolution mixtures in the time-frequency domain and an independent component analysis solving the convolutive mixtures is performed to generate signal separated results. Alternatively, modulation spectrograms generated by short-time Fourier transform (STFT) is interpreted as instantaneous mixting and an independent component analysis solving the instantaneous mixting is performed to generate signal separated results. Therefore, highly accurate separation processing is performed by taking into account a delay amount for mixed sound signals having various delay amounts such as direct waves and reflected waves. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、信号分離装置、および信号分離方法、並びにコンピュータ・プログラムに関する。さらに、詳細には、本発明は、複数の信号が混合された信号を独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ；ＩＣＡ）を用いて信号毎に分離する信号分離装置、および信号分離方法、並びにコンピュータ・プログラムに関する。 The present invention relates to a signal separation device, a signal separation method, and a computer program. More specifically, the present invention relates to a signal separation device, a signal separation method, and a computer program for separating a signal in which a plurality of signals are mixed into each signal using independent component analysis (ICA). About.

複数の原信号が未知の係数によって線形に混合されているときに、統計的独立性のみを用いて原信号を分離・復元するという独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ；ＩＣＡ）の手法が信号処理の分野で注目されている。この独立成分分析を応用することで、例えば話者とマイクロホンとが離れた場所にあり、マイクロホンで話者の音声以外の音を拾ってしまうような状況でも、音声信号を分離・復元することが可能となる。 Independent component analysis (ICA), which uses only statistical independence to separate and reconstruct the original signal when multiple original signals are linearly mixed by unknown coefficients, is the signal processing method. It is attracting attention in the field. By applying this independent component analysis, voice signals can be separated and restored even in situations where the speaker and the microphone are separated and the microphone picks up sounds other than the speaker's voice. It becomes possible.

ＩＣＡとは、多変量分析の一種であり、信号の統計的な性質を利用して多次元信号を分離する手法のことである。ＩＣＡ自体の詳細については、例えば非特許文献１（「入門・独立成分分析」（村田昇著、東京電機大学出版局））などを参照されたい。 ICA is a type of multivariate analysis, and is a technique for separating multidimensional signals using the statistical properties of signals. For details of ICA itself, refer to Non-Patent Document 1 ("Introduction / Independent Component Analysis" (Noboru Murata, Tokyo Denki University Press)).

まず、時間周波数領域の独立成分分析を用いて、複数の信号（特に音信号）が混合された信号を時間周波数領域で分離する方法について説明し、次にその方法が持つ問題点について述べる。図１に示すように、Ｎ個の音源（信号源）から異なる音が鳴っていて、それらをｎ個のマイク（センサー）で観測するという状況を考える。複数の音源が発した音（原信号）がマイクに届く場合、マイクの取得する音は直接波、反射波が含まれ、各音源との距離に基づく時間遅れなどがあるため、ある１つのマイクｊ（ただし１≦ｊ≦ｎ）で観測される信号（観測信号）は以下に示す式［１．１］のように、原信号と伝達関数との畳み込み演算を全音源について総和した式として表わせる（以下では「畳み込み混合」と呼ぶ）。さらに、全てのマイク１〜ｎについての観測信号を一つの式で表わすと、式［１．２］のように表わせる。ただし、ｘ（ｔ），ｓ（ｔ）はそれぞれｘ_ｋ（ｔ），ｓ_ｋ（ｔ）を要素とする列ベクトルであり、Ａ^［ｌ］はａ_ｋｊを要素とするｎ×Ｎの行列である。（以降では、ｎ＝Ｎとする。）
First, a method for separating a signal in which a plurality of signals (especially sound signals) are mixed in the time-frequency domain by using independent component analysis in the time-frequency domain will be described, and then problems of the method will be described. As shown in FIG. 1, a situation is considered in which different sounds are produced from N sound sources (signal sources) and these are observed by n microphones (sensors). When sounds (original signals) from multiple sound sources reach the microphone, the sound acquired by the microphone includes direct waves and reflected waves, and there is a time delay based on the distance to each sound source. The signal (observation signal) observed at j (where 1 ≦ j ≦ n) is expressed as a summation of the convolution operation of the original signal and the transfer function for all sound sources, as shown in Equation [1.1] below. (Hereinafter referred to as “convolution mixing”). Furthermore, if the observation signals for all the microphones 1 to n are expressed by one equation, it can be expressed as equation [1.2]. However, x (t), s (t) respectively _x k _(t), a column vector with _s k (t) of the ^{elements, A [l]} is a matrix of n × N whose elements _{a kj} is there. (Hereafter, n = N is assumed.)

このような畳み混合を解く手法として、以下の２つの方法が知られている。
（１）時間領域で畳み込み混合を直接解く。（時間領域逆畳み込み）
（２）観測信号を時間周波数領域に変換し、瞬時混合問題として解く。
以下では、それぞれの方法について説明する。 The following two methods are known as a method for solving such folding mixing.
(1) Solve convolutional mixing directly in the time domain. (Time domain deconvolution)
(2) Convert the observed signal to the time-frequency domain and solve it as an instantaneous mixing problem.
Below, each method is demonstrated.

（１）時間領域で畳み込み混合を直接解く（時間領域逆畳み込み）手法について
上記した式［１．２］の畳み込みを解くために、以下に示す式［２．１］のような、観測信号の畳み込み混合の式を用意する。
(1) Method for directly solving the convolution mixture in the time domain (time domain deconvolution) In order to solve the convolution of the above equation [1.2], the observation signal of the following equation [2.1] Prepare a convolution mixing formula.

上記の式［２．１］のような、観測信号の畳み込み混合の式を用意し、そして、分離結果ｙ（ｔ）の成分であるｙ_１（ｔ）〜ｙ_ｎ（ｔ）が全てのｔについて最も独立となるように、分離行列Ｗ^［０］〜Ｗ^［Ｌ'］を決める（以降、Ｗ^［０］〜Ｗ^［Ｌ'］をまとめて分離フィルターと呼ぶ）。そのためには、式［２．１］〜式［２．４］を、分離行列および分離結果が収束するまで繰り返す（以降、このような繰り返しを「学習」と呼ぶ。また、分離行列を更新する式やΔＷを算出する式などを「学習規則」と呼ぶ）。なお、式［２．３］のＥ_ｔ［］はｔについての平均を表わす。同式のφはスコア関数または活性化関数と呼ばれる関数である。なお、時間領域での畳み込み混合を解く式の詳細については、例えば、非特許文献２（「詳解独立成分分析」（ＡａｐｏＨｙｖａｒｉｎｅｎｎほか著。東京電機大学出版局）１９．２畳み込み混合の暗中分離，１９．２．３自然勾配法）を参照されたい。 An expression for convolutional mixing of observation signals such as the above expression [2.1] is prepared, and components y ₁ (t) to y _n (t) of the separation result y (t) are all t The separation matrices W ^{[0] to} W ^{[L ′]} are determined so as to be most independent of each other (hereinafter, W ^{[0] to} W ^{[L ′]} are collectively referred to as a separation filter). For this purpose, Expressions [2.1] to [2.4] are repeated until the separation matrix and the separation result converge (hereinafter, such repetition is referred to as “learning.” In addition, the separation matrix is updated. Formulas and formulas for calculating ΔW are called “learning rules”). In the equation [2.3], E _t [] represents an average with respect to t. Φ in the equation is a function called a score function or an activation function. For details of the equation for solving convolutional mixing in the time domain, see, for example, Non-Patent Document 2 (“Detailed Independent Component Analysis” (Aapo Hyvarenen et al., Tokyo Denki University Press) 19.2 Separation in the dark of convolutional mixing, Refer to 19.2.3 Natural gradient method.

（２）観測信号を時間周波数領域に変換し、瞬時混合問題として解く手法について
時間領域の畳み込み混合は、時間周波数領域では瞬時混合で表わされることが知られており、その特徴を利用したのが時間周波数領域のＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）である。時間周波数領域ＩＣＡ自体については、上述の非特許文献２（「詳解独立成分分析」（ＡａｐｏＨｙｖａｒｉｎｅｎｎほか著。東京電機大学出版局「１９．２．４フーリエ変換法」）や、特許文献１（特開２００６−２３８４０９「音声信号分離装置・雑音除去装置および方法」）などを参照されたい。 (2) Method for transforming observed signals into the time-frequency domain and solving them as an instantaneous mixing problem Time-domain convolutional mixing is known to be represented by instantaneous mixing in the time-frequency domain. It is ICA (Independent Component Analysis) in the time-frequency domain. Regarding the time-frequency domain ICA itself, Non-Patent Document 2 (“Detailed Independent Component Analysis” (Aapo Hyvarinen et al., Tokyo Denki University Press “19.2.4 Fourier Transform Method”)) and Patent Document 1 (Special See, for example, 2006-238409 “Audio Signal Separation Device / Noise Removal Device and Method”).

時間周波数領域の独立成分分析では、上記式［１．２］のｘ（ｔ）からＡおよびｓ（ｔ）を直接推定するのではなく、ｘ（ｔ）を時間周波数領域の信号に変換し、Ａおよびｓ（ｔ）に対応する信号を時間周波数領域で推定する。以下では、主に本発明と関係ある点を説明する。上記式［１．２］の両辺を短時間フーリエ変換すると、以下に示す式［３．１］が近似的に得られる。すなわち、信号ベクトルｘ（ｔ）、ｓ（ｔ）を長さＬの窓で短時間フーリエ変換したものをそれぞれＸ（ω，ｔ），Ｓ（ω，ｔ）とし、行列Ａ（ｔ）を同様に短時間フーリエ変換したものをＡ（ω）とすると、時間領域の上記式［１．２］は時間周波数領域の下記式［３．１］で表すことができる。但し、ωは周波数ビンの番号を示し（１≦ω≦Ｍ）、ｔはフレーム番号を示す（１≦ｔ≦Ｔ）。時間周波数領域の独立成分分析では、式［３．１］のＳ（ω，ｔ）、Ａ（ω）を時間周波数領域で推定することになる。
In the independent component analysis in the time frequency domain, instead of directly estimating A and s (t) from x (t) in the above equation [1.2], x (t) is converted into a signal in the time frequency domain, The signals corresponding to A and s (t) are estimated in the time frequency domain. Below, the point which is mainly related to this invention is demonstrated. When both sides of the above formula [1.2] are subjected to short-time Fourier transform, the following formula [3.1] is approximately obtained. That is, X (ω, t) and S (ω, t) are obtained by short-time Fourier transforming the signal vectors x (t) and s (t) through a window of length L, respectively, and the matrix A (t) is the same. If A (ω) is the result of short-time Fourier transform, the above formula [1.2] in the time domain can be expressed by the following formula [3.1] in the time frequency domain. However, ω indicates a frequency bin number (1 ≦ ω ≦ M), and t indicates a frame number (1 ≦ t ≦ T). In the independent component analysis in the time-frequency domain, S (ω, t) and A (ω) in Equation [3.1] are estimated in the time-frequency domain.

上記式［３．１］において、ωは周波数ビンの番号、ｔはフレームの番号である。ωを固定すると、この式は瞬時混合と見なせる。そこで、観測信号を分離するには、式［３．５］のような式を用意し、Ｙ（ω，ｔ）の各成分が最も独立になるように分離行列Ｗ（ω）を決める。 In the above equation [3.1], ω is a frequency bin number, and t is a frame number. If ω is fixed, this equation can be regarded as instantaneous mixing. Therefore, in order to separate the observation signals, an equation such as Equation [3.5] is prepared, and the separation matrix W (ω) is determined so that each component of Y (ω, t) is most independent.

なお、周波数ビンの個数は、本来は窓の長さＬと同一であり、各周波数ビンは、−Ｒ／２からＲ／２まで（Ｒはサンプリング周波数）をＬ等分したそれぞれの周波数成分を表す。但し、負の周波数成分は正の周波数成分の共役複素数であり、Ｘ（−ω）＝ｃｏｎｊ（Ｘ（ω））（ｃｏｎｊ（・）は共役複素数）として求めることができる。
時間周波数領域でＳ（ω，ｔ）、Ａ（ω）を推定するには、先ず、下記式（４）のような式を考える。式［３．５］において、Ｙ（ω，ｔ）はｙｋ（ｔ）を長さＬの窓で短時間フーリエ変換したＹｋ（ω，ｔ）を要素とする列ベクトルを表し、Ｗ（ω）はｗｉｊ（ω）を要素とするｎ行ｎ列の行列（分離行列）を表す。 The number of frequency bins is essentially the same as the window length L, and each frequency bin has frequency components obtained by equally dividing -R / 2 to R / 2 (R is a sampling frequency) into L equal parts. To express. However, the negative frequency component is a conjugate complex number of the positive frequency component, and can be obtained as X (−ω) = conj (X (ω)) (conj (·) is a conjugate complex number).
In order to estimate S (ω, t) and A (ω) in the time-frequency domain, first consider the following equation (4). In equation [3.5], Y (ω, t) represents a column vector whose element is Yk (ω, t) obtained by short-time Fourier transform of yk (t) through a window of length L, and W (ω) Represents an n-by-n matrix (separation matrix) having wij (ω) as an element.

従来の時間周波数領域ＩＣＡでは、パーミュテーション問題と呼ばれる、「どの成分がどのチャンネルに分離されるか」が周波数ビンごとに異なるという問題が発生していたが、この問題については、発明者自身の特許出願である上述の特許文献１（特開２００６−２３８４０９「音声信号分離装置・雑音除去装置および方法」）によって、ほぼ解決した。 In the conventional time-frequency domain ICA, there is a problem called “permutation problem” in which “which component is separated into which channel” is different for each frequency bin. The above-mentioned patent document 1 (Japanese Patent Application Laid-Open No. 2006-238409 “Audio Signal Separation Device / Noise Removal Device and Method”) is substantially solved.

本発明は、この公開されている特許出願である特開２００６−２３８４０９の理論的発展版であるため、以下では特開２００６−２３８４０９の特徴についても説明する。 Since the present invention is a theoretical development version of this published patent application, JP-A-2006-238409, the features of JP-A-2006-238409 will also be described below.

従来、すなわち、上述の特許文献１（特開２００６−２３８４０９）で示される手法が開示される以前は、時間周波数領域の分離の式として、周波数ビンごとの式である［３．５］を用い、さらに周波数ビンごとに独立性を最大にするような分離行列Ｗ（ω）求めていた。 Conventionally, that is, before the technique disclosed in Patent Document 1 (Japanese Patent Application Laid-Open No. 2006-238409) is disclosed, [3.5], which is an expression for each frequency bin, is used as an expression for separating the time-frequency domain. Further, a separation matrix W (ω) that maximizes independence for each frequency bin is obtained.

すなわち、ωを固定してｔを変化させたときにＹ１（ω，ｔ）〜Ｙｎ（ω，ｔ）が統計的に独立となる（実際には、独立性が最大となる）ようなＷ（ω）を求める。なお、後述するが時間周波数領域の独立成分分析ではパーミュテーション（ｐｅｒｍｕｔａｔｉｏｎ）及びスケーリングの不定性があるため、Ｗ（ω）＝Ａ（ω）−１以外にも解が存在する。統計的に独立となるＹ１（ω，ｔ）〜Ｙｎ（ω，ｔ）が全てのωについて得られたら、それらを逆フーリエ変換することで、時間領域の分離信号ｙ（ｔ）を得ることができる。 In other words, when ω is fixed and t is changed, W1 (Y (ω, t) to Yn (ω, t) is statistically independent (in practice, the independence is maximized). ω). As will be described later, there is a solution other than W (ω) = A (ω) −1 because of the indefiniteness of permutation and scaling in independent component analysis in the time-frequency domain. If Y1 (ω, t) to Yn (ω, t) that are statistically independent are obtained for all ω, the time domain separation signal y (t) can be obtained by performing inverse Fourier transform on them. it can.

時間周波数領域における従来の独立成分分析の概略について説明する。ｎ個の音源が発するお互いに独立な原信号をｓ１〜ｓｎとし、それらを要素とするベクトルをｓとする。マイクロホンで観測される観測信号ｘは、原信号ｓに上記式［１．２］の畳み込み・混合演算を施したものである。次に、観測信号ｘに対して短時間フーリエ変換を施し、時間周波数領域の信号Ｘを得る。Ｘの要素をＸｋ（ω，ｔ）とすると、Ｘｋ（ω，ｔ）は複素数値をとる。Ｘｋ（ω，ｔ）の絶対値である｜Ｘｋ（ω，ｔ）｜を色の濃淡で表現した図をスペクトログラムという。スペクトログラムは、例えば、横軸をｔ（フレーム番号）、縦軸をω（周波数ビン番号）としてＸｋ（ω，ｔ）の絶対値である｜Ｘｋ（ω，ｔ）｜を色の濃淡で表現した図である。続いて、信号Ｘの各周波数ビンにＷ（ω）を乗算することで分離信号Ｙを得る。そして、分離信号Ｙを逆フーリエ変換することで時間領域の分離信号ｙを得る。 An outline of conventional independent component analysis in the time-frequency domain will be described. s1 to sn are independent original signals emitted from n sound sources, and s is a vector having them as elements. The observation signal x observed by the microphone is obtained by performing the convolution / mixing operation of the above equation [1.2] on the original signal s. Next, short-time Fourier transform is performed on the observation signal x to obtain a signal X in the time-frequency domain. If the element of X is Xk (ω, t), Xk (ω, t) takes a complex value. A diagram in which | Xk (ω, t) |, which is the absolute value of Xk (ω, t), is expressed by color shading is called a spectrogram. In the spectrogram, for example, | Xk (ω, t) |, which is the absolute value of Xk (ω, t), is expressed by color shading, where the horizontal axis is t (frame number) and the vertical axis is ω (frequency bin number). FIG. Subsequently, the separated signal Y is obtained by multiplying each frequency bin of the signal X by W (ω). Then, the separation signal Y is inverse Fourier transformed to obtain a separation signal y in the time domain.

しかし、上述した時間周波数領域の独立成分分析では、信号の分離処理を周波数ビン毎に行っており、周波数ビンの間の関係は考慮していない。そのため、分離自体は成功しても、周波数ビンの間でスケーリング及び分離先の不統一が発生する可能性がある。このうち、スケーリングの不統一については、音源毎に観測信号を推定する方法により解決できる。一方、分離先の不統一とは、例えばω＝１ではＹ１にＳ１由来の信号が現れるのに対してω＝２ではＹ１にＳ２由来の信号が現れる、というような現象のことであり、パーミュテーション（置換）の問題と呼ばれている。 However, in the time-domain independent component analysis described above, signal separation processing is performed for each frequency bin, and the relationship between the frequency bins is not considered. Therefore, even if the separation itself is successful, there is a possibility that scaling and separation destination inconsistency may occur among the frequency bins. Among these, inconsistency of scaling can be solved by a method of estimating an observation signal for each sound source. On the other hand, the separation destination is inconsistent, for example, a phenomenon in which a signal derived from S1 appears in Y1 at ω = 1, whereas a signal derived from S2 appears in Y1 at ω = 2. It is called the problem of mutation.

これに対し、特許文献１（特開２００６−２３８４０９）では、スペクトログラム全体での分離を表わす式である以下に示す数式［４．４］を用い、スペクトログラム全体での独立性を最大にする分離行列Ｗを求めるという手法を採用している。
On the other hand, in Patent Document 1 (Japanese Patent Laid-Open No. 2006-238409), a separation matrix that maximizes the independence of the entire spectrogram using the following equation [4.4] that is an expression representing the separation of the entire spectrogram. A method of obtaining W is employed.

具体的には、スペクトログラム全体での独立性として、式［４．５］で表わされるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を導入し、Ｉ（Ｙ）を最小にする分離行列Ｗを求めている。独立成分分析において、独立性をどのような尺度で表現するか、また、どのようなアルゴリズムで独立性を最大化するかについては、種々のバリエーションが存在する。その１つの手法として、独立性をＫｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ情報量（ＫＬ情報量）がある。Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）は、スペクトログラムごとのエントロピーの総和から、スペクトログラム全体の同時エントロピーを引いたものであり、全てのスペクトログラムがお互いに独立となった場合に最小（理想的には０）となる。 Specifically, as the independence of the entire spectrogram, the Kullback-Leiblar information amount I (Y) represented by the formula [4.5] is introduced, and the separation matrix W that minimizes I (Y) is obtained. . In independent component analysis, there are various variations on what scale to express independence and what algorithm maximizes independence. As one of the methods, there is Kullback-Leibler information amount (KL information amount) for independence. The Kullback-Leiblar information amount I (Y) is obtained by subtracting the simultaneous entropy of the entire spectrogram from the total entropy of each spectrogram, and is minimum (ideally 0 when all spectrograms are independent of each other). )

ＫＬ情報量Ｉ（Ｙ）は前述したように上記式［４．５］のように定義される。この式［４．５］において、Ｈ（Ｙ_ｋ）は各チャンネルについてのスペクトログラム１枚分のエントロピーを表し、Ｈ（Ｙ）は全チャンネルについてのスペクトログラム１枚分の同時エントロピーを表す。ｎ＝２のときのＨ（Ｙ_ｋ）とＨ（Ｙ）との関係を図２に示す。図２において、Ｐ（Ｙ_ｋ（ｔ））は、Ｙ_ｋ（ｔ）の確率密度関数であり、Ｈ（Ｙ_ｋ）は各チャンネルについてのスペクトログラム１枚分のエントロピーである。Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）は、スペクトログラムごとのエントロピー１１，１２の総和から、スペクトログラム全体の同時エントロピー１３を引いたものであり、全てのスペクトログラムがお互いに独立となった場合に最小（理想的には０）となる。 The KL information amount I (Y) is defined as in the above formula [4.5] as described above. In this equation [4.5], H (Y _k ) represents the entropy for one spectrogram for each channel, and H (Y) represents the simultaneous entropy for one spectrogram for all channels. FIG. 2 shows the relationship between H (Y _k ) and H (Y) when n = 2. In FIG. 2, P (Y _k (t)) is a probability density function of Y _k (t), and H (Y _k ) is an entropy for one spectrogram for each channel. The Kullback-Leiblar information amount I (Y) is obtained by subtracting the simultaneous entropy 13 of the entire spectrogram from the sum of entropies 11 and 12 for each spectrogram, and is minimum (ideal) when all spectrograms are independent of each other. 0).

スペクトログラム全体でのＫＬ情報量Ｉ（Ｙ）を最小にするためには、式［５．１］〜［５．３］をＷおよびＹが収束するまで繰り返す。
In order to minimize the KL information amount I (Y) in the entire spectrogram, the equations [5.1] to [5.3] are repeated until W and Y converge.

なお、式［５．３］に出てくるΔＷ（ω），Ｗ（ω），Ｙ（ω，ｔ）は、それぞれΔＷ，Ｗ，Ｙ（ｔ）からω番目の周波数ビンに対応する要素を抽出した部分行列である。こうすることで、パーミュテーション問題のない分離結果を得ることが可能となった。 Note that ΔW (ω), W (ω), and Y (ω, t) appearing in Equation [5.3] are elements corresponding to the ω-th frequency bin from ΔW, W, and Y (t), respectively. This is the extracted submatrix. By doing so, it became possible to obtain separation results without permutation problems.

しかし、上記の２つの畳み混合を解く手法、すなわち、
（１）時間領域で畳み込み混合を直接解く。（時間領域逆畳み込み）
（２）観測信号を時間周波数領域に変換し、瞬時混合問題として解く。
この２つの手法には、課題がある。すなわち、
（１）時間領域で畳み込み混合を直接解く。（時間領域逆畳み込み）
この方法については、収束が遅いという問題がある。遅収束の理由は、分離フィルターの係数が変化すると波形全体が変化することや、分離フィルターの更新式の計算量がタップ数Ｌ'の２乗に比例することなどが挙げられる。そのため、フィルターのタップ数Ｌ'が大きい場合は、分離フィルターの初期値として、収束値にできる限り近い値を事前に求めておかないと、実用的な時間で分離することは難しい。実環境の残響に対応させるためには、少なくとも数千のオーダーのタップ数が必要であるため、（１）の方法では数千の２乗の計算量が必要である。 However, the method of solving the above two convolution mixture, ie,
(1) Solve convolutional mixing directly in the time domain. (Time domain deconvolution)
(2) Convert the observed signal to the time-frequency domain and solve it as an instantaneous mixing problem.
There are problems with these two approaches. That is,
(1) Solve convolutional mixing directly in the time domain. (Time domain deconvolution)
This method has a problem of slow convergence. The reason for the slow convergence is that the entire waveform changes when the coefficient of the separation filter changes, and that the calculation amount of the update formula of the separation filter is proportional to the square of the tap number L ′. Therefore, when the number of taps L ′ of the filter is large, it is difficult to separate in a practical time unless a value as close as possible to the convergence value is obtained in advance as the initial value of the separation filter. In order to cope with the reverberation of the real environment, the number of taps in the order of several thousand is required. Therefore, the method (1) requires a calculation amount of several thousand squares.

一方、（２）観測信号を時間周波数領域に変換し、瞬時混合問題として解く。この方法は、短時間フーリエ変換（ＳＴＦＴ）の窓長と分離精度との間にトレードオフが存在することが課題である。観測信号が長い残響、すなわち大きなタップ数の畳み込み混同である場合、それを時間周波数領域の瞬時混合で表わすためには、ＳＴＦＴの窓の長さ（＝タップ数）も大きくする必要がある。（窓長＜残響長の場合は、残響が複数のフレームにまたがるため、瞬時混合では表現できない。）しかし、窓長を長くしすぎると、かえって分離精度が落ちることが知られている。なお、このトレードオフについては、例えば、以下の文献を参照されたい。
特許文献２（特開２００３−２７１１６８「信号抽出方法および信号抽出装置、信号抽出プログラムとそのプログラムを記録した記録媒体」）
非特許文献３（「サブバンド処理によるブラインド音源分離に関する検討」荒木章子・ＲｏｂｅｒｔＡｉｃｈｎｅｒ・牧野昭二・西川剛樹・猿渡洋日本音響学会講演論文集２００２年３月ｐｐ．６１９−−６２０）
非特許文献４（「帯域分割型ＩＣＡを用いたＢｌｉｎｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎにおける帯域分割数の最適化」西川剛樹・荒木章子・牧野昭二・猿渡洋日本音響学会講演論文集２００１年３月ｐｐ．５６９−−５７０） On the other hand, (2) transform the observed signal into the time-frequency domain and solve it as an instantaneous mixing problem. This method has a problem that a trade-off exists between the window length of short-time Fourier transform (STFT) and the separation accuracy. When the observed signal is long reverberation, that is, convolutional confusion with a large number of taps, in order to express it by instantaneous mixing in the time-frequency domain, it is necessary to increase the STFT window length (= tap number). (If the window length is less than the reverberation length, the reverberation spans a plurality of frames and cannot be expressed by instantaneous mixing.) However, it is known that if the window length is too long, the separation accuracy decreases. For this trade-off, refer to the following documents, for example.
Patent Document 2 (Japanese Patent Laid-Open No. 2003-271168 “Signal Extraction Method and Signal Extraction Apparatus, Signal Extraction Program and Recording Medium Recording the Program”)
Non-Patent Document 3 ("Study on blind sound source separation by subband processing" Akiko Araki, Robert Aichner, Shoji Makino, Takeki Nishikawa, Hiroshi Saruwatari Proceedings of the Acoustical Society of Japan, March 2002, pp. 619--620)
Non-Patent Document 4 ("Optimization of the number of band divisions in the Blind Source Separation using band division type ICA" Takeki Nishikawa, Akiko Araki, Shoji Makino, Hiroshi Saruwatari Proceedings of the Acoustical Society of Japan, March 2001 pp.569- -570)

窓長を長くすると分離精度が落ちる原因は、窓長を長くする（＝タップ数を大きくする）ほど、生成されたスペクトログラムの時間方向の変化、すなわち時間方向エンベロープの変化がなだらかになることである。時間周波数領域ＩＣＡは、エンベロープ同士の独立性に着目して観測信号を分離するが、なだらかなエンベロープ同士の独立性は急激に変化するエンベロープ同士の独立性に比べて低めに算出される傾向がある。つまり、異なる音源に由来するエンベロープ同士であっても「相関がある」と判定される可能性があるため、結果として分離精度が悪くなる。 The reason why the separation accuracy decreases when the window length is increased is that the longer the window length (= the larger the number of taps), the more gradually the change in the time direction of the generated spectrogram, that is, the change in the time direction envelope. . The time frequency domain ICA separates the observation signals by focusing on the independence between the envelopes, but the independence between the gentle envelopes tends to be calculated lower than the independence between the rapidly changing envelopes. . That is, even envelopes derived from different sound sources may be determined to have “correlation”, resulting in poor separation accuracy.

上述したように、（２）観測信号を時間周波数領域に変換し、瞬時混合問題として解く手法における問題点は、短時間フーリエ変換（ＳＴＦＴ）の窓長と分離精度との間にトレードオフが存在することである。以下、窓長と分離精度とのトレードオフに関して発明者本人が行なった実験結果を示す。図３は、ＳＴＦＴの窓長と時間周波数領域ＩＣＡの分離精度の関係をプロットしたグラフである。 As described above, (2) the problem in the method of converting the observation signal into the time-frequency domain and solving it as an instantaneous mixing problem is that there is a trade-off between the window length of short-time Fourier transform (STFT) and the separation accuracy. It is to be. Hereinafter, the results of experiments conducted by the inventor regarding the trade-off between window length and separation accuracy will be shown. FIG. 3 is a graph plotting the relationship between the window length of the STFT and the separation accuracy of the time frequency domain ICA.

図３において、横軸はＳＴＦＴの窓長（６４，１２８，２５６，５１２，１０２４，２０４８，４０９６）、縦軸は分離精度の尺度であるｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅ−ｒａｔｉｏ（ＳＩＲ）であり、実線は、（２）観測信号を時間周波数領域に変換し、瞬時混合問題として解く手法として、特開２００６−２３８４０９の方法で分離した結果のＳＩＲである（実験の詳細は後述する）。また、図３上段のグラフは波形ベースのＳＩＲであり、下段は周波数ビンベースのＳＩＲである。また、図４は、横軸の窓長を実際の秒数で表わしたグラフである。どちらのグラフも、中間に分離精度のピークがあることが分かる。（波形ベースでは窓長＝１０２４がピーク。周波数ビンベースでは窓長＝５１２がピーク。） In FIG. 3, the horizontal axis is the STFT window length (64, 128, 256, 512, 1024, 2048, 4096), the vertical axis is the signal-interference-ratio (SIR), which is a measure of separation accuracy, and the solid line is (2) SIR obtained as a method of converting the observation signal into the time-frequency domain and solving as an instantaneous mixing problem is the result of separation by the method disclosed in Japanese Patent Laid-Open No. 2006-238409 (details of the experiment will be described later). Also, the upper graph in FIG. 3 is a waveform-based SIR, and the lower graph is a frequency bin-based SIR. FIG. 4 is a graph showing the window length on the horizontal axis in actual seconds. Both graphs show that there is a peak of separation accuracy in the middle. (Window length = 1024 is peak for waveform base. Window length = 512 is peak for frequency bin base.)

すなわち、時間周波数領域のＩＣＡでは、長い残響に対応させようとしてＳＴＦＴの窓を長くしても、ある程度を超えるとかえって分離性能が落ちるという問題がある。 That is, in the ICA in the time-frequency domain, even if the STFT window is lengthened so as to cope with long reverberation, there is a problem that the separation performance is deteriorated if it exceeds a certain level.

以上をまとめると、独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ；ＩＣＡ）の手法の方式である以下の２つの方式、すなわち、
（１）時間領域で畳み込み混合を直接解く。（時間領域逆畳み込み）
（２）観測信号を時間周波数領域に変換し、瞬時混合問題として解く。
この２つの方式のいずれも、タップ数の大きい畳み込み混合に対しては分離精度が不十分であるという問題が存在する。 In summary, the following two methods, which are methods of independent component analysis (ICA), are as follows:
(1) Solve convolutional mixing directly in the time domain. (Time domain deconvolution)
(2) Convert the observed signal to the time-frequency domain and solve it as an instantaneous mixing problem.
Both of these two methods have a problem that the separation accuracy is insufficient for convolutional mixing with a large number of taps.

なお、「残響長よりも短い窓を用いてＳＴＦＴした場合は、スペクトログラム上でも依然として畳み込みが残る」との想定に対応した処理を開示した技術して非特許文献５（Ｓｅｒｖｉｅｒｅ，Ｃ．Ｓｅｐａｒａｔｉｏｎｏｆｓｐｅｅｃｈｓｉｇｎａｌｓｕｎｄｅｒｒｅｖｅｒｂｅｒａｎｔｃｏｎｄｉｔｉｏｎｓ．ＩｎＰｒｏｃ．ＥＵＳＩＰＣＯ０４，ｐｐ．１６９３-１６９６（２００４））がある。 Note that Non-Patent Document 5 (Serviere, C. Separation of speech) discloses a process corresponding to the assumption that “if STFT is performed using a window shorter than the reverberation length, convolution still remains on the spectrogram”. signals under reverse conditions. In Proc. EUSIPCO04, pp. 1693-1696 (2004)).

上記非特許文献５では、観測信号を時間周波数領域上の畳み込み混合であると考え、それを解く方法として、時間周波数領域での逆畳み込みのアルゴリズムを提案している。すなわち、「時間周波数領域において畳み込み混合を直接解く」方式に近い処理である。しかし、この非特許文献５において開示しているアルゴリズムは、２入力・２出力、すなわち、音声信号の出力音源が２つ、入力部としてのマイクを２つとした場合に限られたものである。また、本文献では、分離および逆畳み込みを周波数ビンごとに個別に行なう構成であり、パーミュテーション（ｐｅｒｍｕｔａｔｉｏｎ）問題と呼ばれる、「どの成分がどのチャンネルに分離されるか」が周波数ビンごとに異なるという問題が発生する。 Non-Patent Document 5 considers the observation signal to be convolutional mixture in the time-frequency domain, and proposes a deconvolution algorithm in the time-frequency domain as a method for solving it. In other words, it is a process close to the method of “solving convolutional mixture directly in the time frequency domain”. However, the algorithm disclosed in Non-Patent Document 5 is limited to the case of two inputs and two outputs, that is, two output sound sources for audio signals and two microphones as input units. In addition, in this document, separation and deconvolution are individually performed for each frequency bin, and a “permutation problem”, which component is separated into which channel, is different for each frequency bin. The problem occurs.

上述したように、複数の信号が混合された音信号の分離処理について開示した従来技術はいくつか存在するが、独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ；ＩＣＡ）を用いて信号毎の高精度な分離処理を実現する信号分離処理においては、
（１）窓長（＝分析フレームの長さ）を超える残響に対する対処、
（２）パーミュテーション（ｐｅｒｍｕｔａｔｉｏｎ）問題に対する対処、
（３）２つの入出力を超える入出力構成に対する対処、
これらの様々な問題に対する十分な解決策が提示されていないのが現状である。
特開２００６−２３８４０９特開２００３−２７１１６８「入門・独立成分分析」（村田昇著、東京電機大学出版局）特開２００２−３４２１９８「詳解独立成分分析」（ＡａｐｏＨｙｖａｒｉｎｅｎｎほか著。東京電機大学出版局）「サブバンド処理によるブラインド音源分離に関する検討」荒木章子・ＲｏｂｅｒｔＡｉｃｈｎｅｒ・牧野昭二・西川剛樹・猿渡洋日本音響学会講演論文集２００２年３月ｐｐ．６１９−−６２０「帯域分割型ＩＣＡを用いたＢｌｉｎｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎにおける帯域分割数の最適化」西川剛樹・荒木章子・牧野昭二・猿渡洋日本音響学会講演論文集２００１年３月ｐｐ．５６９−−５７０Ｓｅｒｖｉｅｒｅ，Ｃ．Ｓｅｐａｒａｔｉｏｎｏｆｓｐｅｅｃｈｓｉｇｎａｌｓｕｎｄｅｒｒｅｖｅｒｂｅｒａｎｔｃｏｎｄｉｔｉｏｎｓ．ＩｎＰｒｏｃ．ＥＵＳＩＰＣＯ０４，ｐｐ．１６９３-１６９６（２００４） As described above, there are some prior arts disclosed regarding the separation processing of a sound signal in which a plurality of signals are mixed, but high-precision separation processing for each signal using independent component analysis (ICA). In the signal separation process to realize
(1) Coping with reverberation exceeding the window length (= analysis frame length),
(2) coping with permutation problems,
(3) Dealing with input / output configurations that exceed two inputs / outputs,
At present, sufficient solutions to these various problems have not been presented.
JP 2006-238409 A JP2003-271168A "Introduction and Independent Component Analysis" (Noboru Murata, Tokyo Denki University Press) JP 2002-342198 “Detailed analysis of independent components” (Aapo Hyvarinen et al., Tokyo Denki University Press) “Study on blind sound source separation by subband processing” Akiko Araki, Robert Aichner, Shoji Makino, Takeki Nishikawa, Hiroshi Saruwatari Proceedings of the Acoustical Society of Japan March 2002 pp. 619--620 “Optimization of the number of band division in Blind Source Separation using band division type ICA” Takeki Nishikawa, Akiko Araki, Shoji Makino, Hiroshi Saruwatari Proceedings of the Acoustical Society of Japan, March 2001 pp. 569--570 Servier, C.I. Separation of speech signals under reverberant conditions. In Proc. EUSIPCO04, pp. 1693-1696 (2004)

本発明は、このような状況に鑑みてなされたものであり、複数の信号が混合された音信号を独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ；ＩＣＡ）を用いて信号毎の高精度な分離処理を実現する信号分離装置、および信号分離方法、並びにコンピュータ・プログラムを提供することを目的とするものであり、特に、タップ数の大きい畳み込み混合に対する分離精度を向上させた信号分離装置、および信号分離方法、並びにコンピュータ・プログラムを提供することを目的とする。 The present invention has been made in view of such a situation, and realizes high-accuracy separation processing for each signal by using independent component analysis (ICA) for a sound signal in which a plurality of signals are mixed. Signal separation device, signal separation method, and computer program, and in particular, a signal separation device and a signal separation method with improved separation accuracy for convolutional mixing with a large number of taps, An object is to provide a computer program.

本発明の第１の側面は、
複数の信号が混合した信号を入力して個別の信号に分離する信号分離装置であり、
入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する信号変換手段と、
前記信号変換手段の生成した観測信号スペクトログラムから信号分離結果を生成する信号分離手段を有し、
前記信号分離手段は、
前記観測信号スペクトログラムを時間周波数領域において畳み込み混合された観測信号として解釈し、時間周波数領域の畳み込み混合を解く処理の実行により信号分離結果を生成する構成であることを特徴とする信号分離装置にある。 The first aspect of the present invention is:
A signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals,
A signal conversion means for converting the input signal into the time-frequency domain and generating an observed signal spectrogram;
Signal separation means for generating a signal separation result from the observed signal spectrogram generated by the signal conversion means,
The signal separating means includes
The signal separation device is configured to interpret the observation signal spectrogram as an observation signal mixed in the time-frequency domain and generate a signal separation result by executing a process for solving the time-frequency domain convolutional mixing. .

さらに、本発明の信号分離装置の一実施態様において、前記信号変換手段は、前記入力信号に対して短時間フーリエ変換（ＳＴＦＴ）を実行して時間周波数領域に変換し観測信号スペクトログラムを生成する処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal conversion means performs a short-time Fourier transform (STFT) on the input signal to convert it into the time-frequency domain and generate an observation signal spectrogram. It is the structure which performs.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、フレーム番号（ｔ）の分離信号Ｙ（ｔ）を、観測信号Ｘ（ｔ−Ｌ'）〜Ｘ（ｔ）の畳み込み混合として設定し、分離信号Ｙ（ｔ）に含まれる個別の信号成分であるＹ１（ｔ）〜Ｙｎ（ｔ）各々の独立性を高める処理により信号分離結果を生成する構成であることを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means convolves the separation signal Y (t) of the frame number (t) with the observation signals X (t−L ′) to X (t). The configuration is characterized in that the signal separation result is generated by a process which is set as mixing and increases the independence of each of the individual signal components Y1 (t) to Yn (t) included in the separation signal Y (t). To do.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、前記分離信号Ｙ（ｔ）に含まれる個別の信号成分であるＹ１（ｔ）〜Ｙｎ（ｔ）各々の独立性を高める処理として、独立性算出尺度であるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を適用し、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を最小にする分離行列の更新処理により信号分離結果を生成する構成であることを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means increases the independence of each of the individual signal components Y1 (t) to Yn (t) included in the separated signal Y (t). As a process to increase, a configuration in which a Kullback-Leiblar information amount I (Y) that is an independence calculation scale is applied, and a signal separation result is generated by updating a separation matrix that minimizes the Kullback-Leiblar information amount I (Y). It is characterized by being.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、前記観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により第１の信号分離結果を生成し、該第１の信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行し、該除去処理後に残存する観測信号スペクトログラムに対して時間周波数領域の畳み込み混合を解く処理を実行して信号分離結果を生成することを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means generates a first signal separation result by processing applying instantaneous mixed ICA (Independent Component Analysis) to the observed signal spectrogram. , Unnecessary channel removal processing determined not to correspond to any sound source from the first signal separation result, and processing for solving the convolutional mixture in the time-frequency domain for the observed signal spectrogram remaining after the removal processing And generating a signal separation result.

さらに、本発明の信号分離装置の一実施態様において、前記瞬時混合ＩＣＡを適用した処理は、時間周波数領域の観測信号と分離行列から時間周波数領域の分離信号を生成し、生成した時間周波数領域の分離信号と、多次元確率密度関数から導出される多次元スコア関数によって計算される分離行列とがほぼ収束するまで分離行列を修正し、修正した分離行列を適用して時間周波数領域の分離信号を生成する処理であることを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the processing using the instantaneous mixing ICA generates a separation signal in the time frequency domain from the observation signal in the time frequency domain and the separation matrix, Modify the separation matrix until the separation signal and the separation matrix calculated by the multidimensional score function derived from the multidimensional probability density function converge, and apply the modified separation matrix to the time-frequency domain separation signal. It is a process to generate.

さらに、本発明の第２の側面は、
複数の信号が混合した信号を入力して個別の信号に分離する信号分離装置であり、
入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する第１信号変換手段と、
前記第１信号変換手段の生成した観測信号スペクトログラムに対するデータ変換を実行しモジュレーション・スペクトログラムを生成する第２信号変換手段と、
第２信号変換手段の生成した前記モジュレーション・スペクトログラムから信号分離結果を生成する信号分離手段を有し、
前記信号分離手段は、
前記モジュレーション・スペクトログラムを瞬時混合として解釈し信号分離結果を生成する構成であることを特徴とする信号分離装置にある。 Furthermore, the second aspect of the present invention provides
A signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals,
First signal conversion means for converting an input signal into a time-frequency domain and generating an observation signal spectrogram;
Second signal conversion means for performing data conversion on the observed signal spectrogram generated by the first signal conversion means to generate a modulation spectrogram;
Signal separation means for generating a signal separation result from the modulation spectrogram generated by the second signal conversion means;
The signal separating means includes
In the signal separation device, the modulation spectrogram is interpreted as instantaneous mixing and a signal separation result is generated.

さらに、本発明の信号分離装置の一実施態様において、前記第１信号変換手段は、前記入力信号に対して短時間フーリエ変換（ＳＴＦＴ）を実行して時間周波数領域に変換し観測信号スペクトログラムを生成する処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the first signal conversion means performs a short-time Fourier transform (STFT) on the input signal to convert it into a time-frequency domain and generate an observation signal spectrogram. It is the structure which performs the process to perform.

さらに、本発明の信号分離装置の一実施態様において、前記第２信号変換手段は、前記観測信号スペクトログラムに対して時間方向の短時間フーリエ変換（ＳＴＦＴ）を実行した結果としてのモジュレーション・スペクトログラムを生成する構成であり、前記信号分離手段は、前記モジュレーション・スペクトログラムに含まれる分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々の独立性を高める処理により信号分離結果を生成する構成であることを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the second signal conversion means generates a modulation spectrogram as a result of performing a short-time Fourier transform (STFT) in the time direction on the observed signal spectrogram. The signal separation means is configured to generate a signal separation result by a process for increasing the independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separation signal included in the modulation spectrogram. To do.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、前記分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々の独立性を高める処理として、独立性算出尺度であるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量を適用し、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量を最小にする分離行列の更新処理により信号分離結果を生成する構成であることを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means is a process for increasing the independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separated signal, and is a Kullback-Leiblar which is an independence calculation measure. A signal separation result is generated by updating a separation matrix that applies the information amount and minimizes the Kullback-Leiblar information amount.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離装置は、さらに、前記信号分離手段において得られた分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々に対して逆フーリエ変換を実行して分離信号対応のスペクトログラムＹ１〜Ｙｎを生成する逆フーリエ変換手段を有することを特徴とする。 Furthermore, in an embodiment of the signal separation device of the present invention, the signal separation device further performs inverse Fourier transform on each of the signal components Y1 ′ to Yn ′ corresponding to the separation signal obtained by the signal separation means. And an inverse Fourier transform means for generating spectrograms Y1 to Yn corresponding to the separated signals.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離装置は、さらに、前記第１信号変換手段の生成した前記観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により第１の信号分離結果を生成し、該第１の信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行する不要チャンネル除去手段を有し、前記第２信号変換手段および前記信号分離手段は、不要チャンネル除去後の信号に対する処理のみを実行して信号分離結果を生成することを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation device further applies instantaneous mixed ICA (Independent Component Analysis) to the observation signal spectrogram generated by the first signal conversion means. The first signal separation result is generated by the processing, and unnecessary channel removal means for executing an unnecessary channel removal process determined from the first signal separation result as not corresponding to any sound source is provided, and the second signal The conversion means and the signal separation means generate a signal separation result by executing only the processing on the signal after unnecessary channel removal.

さらに、本発明の第３の側面は、
複数の信号が混合した信号を入力して個別の信号に分離する信号分離装置であり、
入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する信号変換手段と、
前記信号変換手段の生成した観測信号スペクトログラムから信号分離結果を生成する信号分離手段を有し、
前記信号分離手段は、
前記観測信号スペクトログラムをフレーム方向へシフトさせて、各々が異なるシフト量を持つデータを積み重ねた観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により、信号分離結果を生成する構成であることを特徴とする信号分離装置にある。 Furthermore, the third aspect of the present invention provides
A signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals,
A signal conversion means for converting the input signal into the time-frequency domain and generating an observed signal spectrogram;
Signal separation means for generating a signal separation result from the observed signal spectrogram generated by the signal conversion means,
The signal separating means includes
The observed signal spectrogram is shifted in the frame direction to generate an observed signal spectrogram shift set in which data having different shift amounts are stacked, and an instantaneous mixed ICA (Independent Component Analysis) is generated for the generated observed signal spectrogram shift set. ) To generate a signal separation result by the processing applied.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、複数の信号入力源の観測信号各々に対応して生成される複数の観測信号スペクトログラムシフトセットを積み重ねた複数チャンネル対応の観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡを適用して信号分離結果を生成することを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means is adapted for a plurality of channels in which a plurality of observation signal spectrogram shift sets generated corresponding to the observation signals of a plurality of signal input sources are stacked. A signal separation result is generated by applying instantaneous mixing ICA to the observed signal spectrogram shift set.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、前記シフトの際に生じた隙間をゼロまたはゼロに近い値、または前記観測信号スペクトログラムの両端の値をコピーして設定して、前記観測信号スペクトログラムシフトセットを生成することを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means sets the gap generated during the shift by copying zero or a value close to zero or values at both ends of the observed signal spectrogram. Then, the observation signal spectrogram shift set is generated.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、前記シフトをシフトではみ出した一端のデータを他端にコピーする巡回シフト処理を実行することを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means executes a cyclic shift process of copying the data at one end that protrudes from the shift to the other end.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、最小シフト量を０、最大シフト量を観測信号から分離結果を生成する際のフレームタップ数［Ｌ'］として設定した複数のシフトデータを生成し、生成した異なるシフト量を持つデータを積み重ねた観測信号スペクトログラムシフトセットを生成することを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means sets the minimum shift amount as 0 and the maximum shift amount as the number of frame taps [L ′] when generating the separation result from the observation signal. A plurality of shift data is generated, and an observation signal spectrogram shift set in which the generated data having different shift amounts are stacked is generated.

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、周波数に応じて前記フレームタップ数［Ｌ'］を変更して前記観測信号スペクトログラムシフトセットを生成することを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means generates the observation signal spectrogram shift set by changing the number of frame taps [L ′] according to the frequency. .

さらに、本発明の信号分離装置の一実施態様において、前記信号分離手段は、前記観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により第１の信号分離結果を生成し、該第１の信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行し、該除去処理後に残存する観測信号スペクトログラムをフレーム方向へシフトさせて観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡを適用して信号分離結果を生成する構成であることを特徴とする。 Furthermore, in one embodiment of the signal separation device of the present invention, the signal separation means generates a first signal separation result by processing applying instantaneous mixed ICA (Independent Component Analysis) to the observed signal spectrogram. Then, an unnecessary channel removal process determined not to correspond to any sound source is executed from the first signal separation result, and an observation signal spectrogram shift set remaining after the removal process is shifted in the frame direction to obtain an observation signal spectrogram shift set. It is the structure which produces | generates a signal separation result by applying instantaneous mixing ICA with respect to the produced | generated observation signal spectrogram shift set.

さらに、本発明の第４の側面は、
複数の信号が混合した信号を入力して個別の信号に分離する信号分離装置であり、
入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する信号変換手段と、
前記信号変換手段の生成した観測信号スペクトログラムから信号分離結果を生成する信号分離手段を有し、
前記信号分離手段は、
前記観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により信号分離結果Ｙ１〜Ｙｎを生成し、
信号分離結果Ｙ１〜Ｙｎの各々に対応する信号スペクトログラムをフレーム方向へシフトさせて、各々が異なるシフト量を持つデータを積み重ねた観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により残響除去処理を実行し、残響除去済みスペクトログラムの統合処理によって、残響を除去した信号分離結果を生成する構成であることを特徴とする信号分離装置にある。 Furthermore, the fourth aspect of the present invention provides
A signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals,
A signal conversion means for converting the input signal into the time-frequency domain and generating an observed signal spectrogram;
Signal separation means for generating a signal separation result from the observed signal spectrogram generated by the signal conversion means,
The signal separating means includes
For the observed signal spectrogram, signal separation results Y1 to Yn are generated by processing using instantaneous mixed ICA (Independent Component Analysis),
A signal spectrogram corresponding to each of the signal separation results Y1 to Yn is shifted in the frame direction to generate an observation signal spectrogram shift set in which data having different shift amounts are stacked, and for the generated observation signal spectrogram shift set The signal is characterized in that it performs a dereverberation process by a process to which instantaneous mixing ICA (Independent Component Analysis) is applied, and generates a signal separation result from which the dereverberation is removed by an integration process of the spectrogram after dereverberation. In the separation device.

さらに、本発明の第５の側面は、
信号分離装置において、複数の信号が混合した信号を入力して個別の信号に分離する処理を実行する信号分離方法であり、
信号変換手段が、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する信号変換ステップと、
信号分離手段が、前記信号変換ステップにおいて生成した観測信号スペクトログラムから信号分離結果を生成する信号分離ステップを有し、
前記信号分離ステップは、
前記観測信号スペクトログラムを時間周波数領域において畳み込み混合された観測信号として解釈し、時間周波数領域の畳み込み混合を解く処理の実行により信号分離結果を生成するステップであることを特徴とする信号分離方法にある。 Furthermore, the fifth aspect of the present invention provides
In the signal separation device, a signal separation method for executing a process of inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A signal conversion step in which the signal conversion means converts the input signal into the time-frequency domain and generates an observation signal spectrogram;
The signal separation means includes a signal separation step of generating a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
The signal separation method is a step of interpreting the observed signal spectrogram as an observation signal mixed in the time-frequency domain and generating a signal separation result by executing a process for solving the time-frequency domain convolutional mixture. .

さらに、本発明の信号分離方法の一実施態様において、前記信号変換ステップは、前記入力信号に対して短時間フーリエ変換（ＳＴＦＴ）を実行して時間周波数領域に変換し観測信号スペクトログラムを生成する処理を実行するステップであることを特徴とする。 Furthermore, in an embodiment of the signal separation method of the present invention, the signal converting step performs a short-time Fourier transform (STFT) on the input signal to convert it to a time frequency domain to generate an observation signal spectrogram. It is the step which performs.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、フレーム番号（ｔ）の分離信号Ｙ（ｔ）を、観測信号Ｘ（ｔ−Ｌ'）〜Ｘ（ｔ）の畳み込み混合として設定し、分離信号Ｙ（ｔ）に含まれる個別の信号成分であるＹ１（ｔ）〜Ｙｎ（ｔ）各々の独立性を高める処理により信号分離結果を生成するステップであることを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation step convolves the separation signal Y (t) of the frame number (t) with the observation signals X (t−L ′) to X (t). It is a step of generating a signal separation result by a process of setting as mixing and increasing the independence of each of the individual signal components Y1 (t) to Yn (t) included in the separation signal Y (t). To do.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、前記分離信号Ｙ（ｔ）に含まれる個別の信号成分であるＹ１（ｔ）〜Ｙｎ（ｔ）各々の独立性を高める処理として、独立性算出尺度であるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を適用し、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を最小にする分離行列の更新処理により信号分離結果を生成することを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation step is performed to determine the independence of each of the individual signal components Y1 (t) to Yn (t) included in the separated signal Y (t). As a process of increasing, applying the Kullback-Leiblar information amount I (Y), which is an independence calculation scale, and generating a signal separation result by updating a separation matrix that minimizes the Kullback-Leiblar information amount I (Y). Features.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、前記観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により第１の信号分離結果を生成し、該第１の信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行し、該除去処理後に残存する観測信号スペクトログラムに対して時間周波数領域の畳み込み混合を解く処理を実行して信号分離結果を生成するステップであることを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation step generates a first signal separation result by processing applying instantaneous mixed ICA (Independent Component Analysis) to the observed signal spectrogram. , Unnecessary channel removal processing determined not to correspond to any sound source from the first signal separation result, and processing for solving the convolutional mixture in the time-frequency domain for the observed signal spectrogram remaining after the removal processing It is a step of executing and generating a signal separation result.

さらに、本発明の信号分離方法の一実施態様において、前記瞬時混合ＩＣＡを適用した処理は、時間周波数領域の観測信号と分離行列から時間周波数領域の分離信号を生成し、生成した時間周波数領域の分離信号と、多次元確率密度関数から導出される多次元スコア関数によって計算される分離行列とがほぼ収束するまで分離行列を修正し、修正した分離行列を適用して時間周波数領域の分離信号を生成する処理であることを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the process using the instantaneous mixing ICA generates a time frequency domain separation signal from the time frequency domain observation signal and the separation matrix, and generates the generated time frequency domain separation signal. Modify the separation matrix until the separation signal and the separation matrix calculated by the multidimensional score function derived from the multidimensional probability density function converge, and apply the modified separation matrix to the time-frequency domain separation signal. It is a process to generate.

さらに、本発明の第６の側面は、
信号分離装置において、複数の信号が混合した信号を入力して個別の信号に分離する処理を実行する信号分離方法であり、
第１信号変換手段が、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する第１信号変換ステップと、
第２信号変換手段が、前記第１信号変換ステップにおいて生成した観測信号スペクトログラムに対するデータ変換を実行しモジュレーション・スペクトログラムを生成する第２信号変換ステップと、
信号分離手段が、前記第２信号変換ステップにおいて生成した前記モジュレーション・スペクトログラムから信号分離結果を生成する信号分離ステップを有し、
前記信号分離ステップは、
前記モジュレーション・スペクトログラムを瞬時混合として解釈し信号分離結果を生成するステップであることを特徴とする信号分離方法にある。 Furthermore, the sixth aspect of the present invention provides
In the signal separation device, a signal separation method for executing a process of inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A first signal converting step in which a first signal converting means converts an input signal into a time-frequency domain and generates an observed signal spectrogram;
A second signal converting step in which a second signal converting means performs data conversion on the observed signal spectrogram generated in the first signal converting step to generate a modulation spectrogram;
A signal separation step of generating a signal separation result from the modulation spectrogram generated in the second signal conversion step;
The signal separation step includes
In the signal separation method, the modulation spectrogram is interpreted as instantaneous mixing to generate a signal separation result.

さらに、本発明の信号分離方法の一実施態様において、前記第１信号変換ステップは、前記入力信号に対して短時間フーリエ変換（ＳＴＦＴ）を実行して時間周波数領域に変換し観測信号スペクトログラムを生成する処理を実行するステップであることを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the first signal conversion step performs short-time Fourier transform (STFT) on the input signal to convert it to the time-frequency domain to generate an observation signal spectrogram. It is the step which performs the process to perform.

さらに、本発明の信号分離方法の一実施態様において、前記第２信号変換ステップは、前記観測信号スペクトログラムに対して時間方向の短時間フーリエ変換（ＳＴＦＴ）を実行した結果としてのモジュレーション・スペクトログラムを生成するステップであり、前記信号分離ステップは、前記モジュレーション・スペクトログラムに含まれる分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々の独立性を高める処理により信号分離結果を生成することを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the second signal conversion step generates a modulation spectrogram as a result of performing a short-time Fourier transform (STFT) in the time direction on the observed signal spectrogram. The signal separation step is characterized in that a signal separation result is generated by a process of increasing the independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separation signal included in the modulation spectrogram.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、前記分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々の独立性を高める処理として、独立性算出尺度であるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量を適用し、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量を最小にする分離行列の更新処理により信号分離結果を生成することを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation step is a process of increasing the independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separated signal, and is a Kullback-Leiblar which is an independence calculation measure. The information amount is applied, and a signal separation result is generated by updating a separation matrix that minimizes the Kullback-Leiblar information amount.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離方法は、さらに、逆フーリエ変換手段が、前記信号分離ステップにおいて得られた分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々に対して逆フーリエ変換を実行して分離信号対応のスペクトログラムＹ１〜Ｙｎを生成する逆フーリエ変換ステップを有することを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation method further includes: an inverse Fourier transform unit for each of the signal components Y1 ′ to Yn ′ corresponding to the separation signal obtained in the signal separation step. And performing an inverse Fourier transform to generate spectrograms Y1 to Yn corresponding to the separated signals.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離方法は、さらに、不要チャンネル除去手段が、前記第１信号変換手段の生成した前記観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により第１の信号分離結果を生成し、該第１の信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行する不要チャンネル除去ステップを有し、前記第２信号変換手段および前記信号分離手段は、不要チャンネル除去後の信号に対する処理のみを実行して信号分離結果を生成することを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation method is further characterized in that the unnecessary channel removing means performs instantaneous mixed ICA (Independent) on the observed signal spectrogram generated by the first signal converting means. A first signal separation result is generated by a process applying Component Analysis), and an unnecessary channel removal step is executed to execute an unnecessary channel removal process that is determined not to correspond to any sound source from the first signal separation result. The second signal conversion unit and the signal separation unit generate a signal separation result by executing only the processing on the signal after unnecessary channel removal.

さらに、本発明の第７の側面は、
信号分離装置において、複数の信号が混合した信号を入力して個別の信号に分離する信号分離方法であり、
信号変換手段が、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する信号変換ステップと、
信号分離手段が、前記信号変換ステップにおいて生成した観測信号スペクトログラムから信号分離結果を生成する信号分離ステップを有し、
前記信号分離ステップは、
前記観測信号スペクトログラムをフレーム方向へシフトさせて、各々が異なるシフト量を持つデータを積み重ねた観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により、信号分離結果を生成するステップであることを特徴とする信号分離方法にある。 Furthermore, the seventh aspect of the present invention provides
In the signal separation device, a signal separation method for inputting a signal in which a plurality of signals are mixed and separating them into individual signals,
A signal conversion step in which the signal conversion means converts the input signal into the time-frequency domain and generates an observation signal spectrogram;
The signal separation means includes a signal separation step of generating a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
The observed signal spectrogram is shifted in the frame direction to generate an observed signal spectrogram shift set in which data having different shift amounts are stacked, and an instantaneous mixed ICA (Independent Component Analysis) is generated with respect to the generated observed signal spectrogram shift set. The signal separation method is a step of generating a signal separation result by the processing to which (1) is applied.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、複数の信号入力源の観測信号各々に対応して生成される複数の観測信号スペクトログラムシフトセットを積み重ねた複数チャンネル対応の観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡを適用して信号分離結果を生成することを特徴とする。 Furthermore, in one embodiment of the signal separation method according to the present invention, the signal separation step corresponds to a plurality of channels in which a plurality of observation signal spectrogram shift sets generated corresponding to the observation signals of the plurality of signal input sources are stacked. A signal separation result is generated by applying instantaneous mixing ICA to the observed signal spectrogram shift set.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、前記シフトの際に生じた隙間をゼロまたはゼロに近い値、または前記観測信号スペクトログラムの両端の値をコピーして設定して、前記観測信号スペクトログラムシフトセットを生成することを特徴とする。 Further, in one embodiment of the signal separation method of the present invention, in the signal separation step, the gap generated during the shift is set by copying zero or a value close to zero or values at both ends of the observed signal spectrogram. Then, the observation signal spectrogram shift set is generated.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、前記シフトをシフトではみ出した一端のデータを他端にコピーする巡回シフト処理を実行することを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation step is characterized in that a cyclic shift process is performed in which the data at one end protruding from the shift is copied to the other end.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、最小シフト量を０、最大シフト量を観測信号から分離結果を生成する際のフレームタップ数［Ｌ'］として設定した複数のシフトデータを生成し、生成した異なるシフト量を持つデータを積み重ねた観測信号スペクトログラムシフトセットを生成することを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, in the signal separation step, the minimum shift amount is set to 0, and the maximum shift amount is set as the number of frame taps [L ′] when the separation result is generated from the observation signal. A plurality of shift data is generated, and an observation signal spectrogram shift set in which the generated data having different shift amounts are stacked is generated.

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、周波数に応じて前記フレームタップ数［Ｌ'］を変更して前記観測信号スペクトログラムシフトセットを生成することを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation step generates the observed signal spectrogram shift set by changing the number of frame taps [L ′] according to the frequency. .

さらに、本発明の信号分離方法の一実施態様において、前記信号分離ステップは、前記観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により第１の信号分離結果を生成し、該第１の信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行し、該除去処理後に残存する観測信号スペクトログラムをフレーム方向へシフトさせて観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡを適用して信号分離結果を生成するステップであることを特徴とする。 Furthermore, in one embodiment of the signal separation method of the present invention, the signal separation step generates a first signal separation result by processing applying instantaneous mixed ICA (Independent Component Analysis) to the observed signal spectrogram. Then, an unnecessary channel removal process determined not to correspond to any sound source is executed from the first signal separation result, and an observation signal spectrogram shift set remaining after the removal process is shifted in the frame direction to obtain an observation signal spectrogram shift set. The step of generating a signal separation result by applying the instantaneous mixing ICA to the generated observed signal spectrogram shift set.

さらに、本発明の第８の側面は、
信号分離装置において、複数の信号が混合した信号を入力して個別の信号に分離する信号分離方法であり、
信号変換手段が、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する信号変換ステップと、
信号分離手段が、前記信号変換ステップにおいて生成した観測信号スペクトログラムから信号分離結果を生成する信号分離ステップを有し、
前記信号分離ステップは、
前記観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により信号分離結果Ｙ１〜Ｙｎを生成し、
信号分離結果Ｙ１〜Ｙｎの各々に対応する信号スペクトログラムをフレーム方向へシフトさせて、各々が異なるシフト量を持つデータを積み重ねた観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により残響除去処理を実行し、残響除去済みスペクトログラムの統合処理によって、残響を除去した信号分離結果を生成するステップであることを特徴とする信号分離方法にある。 Furthermore, the eighth aspect of the present invention provides
In the signal separation device, a signal separation method for inputting a signal in which a plurality of signals are mixed and separating them into individual signals,
A signal conversion step in which the signal conversion means converts the input signal into the time-frequency domain and generates an observation signal spectrogram;
The signal separation means includes a signal separation step of generating a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
For the observed signal spectrogram, signal separation results Y1 to Yn are generated by processing using instantaneous mixed ICA (Independent Component Analysis),
A signal spectrogram corresponding to each of the signal separation results Y1 to Yn is shifted in the frame direction to generate an observation signal spectrogram shift set in which data having different shift amounts are stacked, and for the generated observation signal spectrogram shift set The signal is characterized in that it is a step of performing a dereverberation process by a process to which instantaneous mixing ICA (Independent Component Analysis) is applied, and generating a signal separation result from which the dereverberation is removed by an integration process of the dereverberation spectrogram. In the separation method.

さらに、本発明の第９の側面は、
信号分離装置において、複数の信号が混合した信号を入力して個別の信号に分離する信号分離処理を実行させるコンピュータ・プログラムであり、
信号変換手段に、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成させる信号変換ステップと、
信号分離手段に、前記信号変換ステップにおいて生成した観測信号スペクトログラムから信号分離結果を生成させる信号分離ステップを有し、
前記信号分離ステップは、
前記観測信号スペクトログラムを時間周波数領域において畳み込み混合された観測信号として解釈し、時間周波数領域の畳み込み混合を解く処理の実行により信号分離結果を生成させるステップであることを特徴とするコンピュータ・プログラムにある。 Furthermore, the ninth aspect of the present invention provides
In the signal separation device, a computer program for executing a signal separation process for inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A signal conversion step for causing the signal conversion means to convert the input signal into the time-frequency domain and generate an observed signal spectrogram;
A signal separation step for causing the signal separation means to generate a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
The computer program is a step of interpreting the observed signal spectrogram as an observation signal mixed in the time-frequency domain and generating a signal separation result by executing a process for solving the convolutional mixing in the time-frequency domain. .

さらに、本発明の第１０の側面は、
信号分離装置において、複数の信号が混合した信号を入力して個別の信号に分離する信号分離処理を実行させるコンピュータ・プログラムであり、
第１信号変換手段に、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成させる信号変換ステップと、
第２信号変換手段に、前記第１信号変換ステップにおいて生成した観測信号スペクトログラムに対するデータ変換を実行させモジュレーション・スペクトログラムを生成させる第２信号変換ステップと、
信号分離手段に、前記第２信号変換ステップにおいて生成した前記モジュレーション・スペクトログラムから信号分離結果を生成させる信号分離ステップを有し、
前記信号分離ステップは、
前記モジュレーション・スペクトログラムを瞬時混合として解釈し信号分離結果を生成させるステップであることを特徴とするコンピュータ・プログラムにある。 Furthermore, the tenth aspect of the present invention provides
In the signal separation device, a computer program for executing a signal separation process for inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A signal conversion step of causing the first signal conversion means to convert the input signal into the time-frequency domain and generate an observation signal spectrogram;
A second signal converting step for causing the second signal converting means to perform data conversion on the observed signal spectrogram generated in the first signal converting step to generate a modulation spectrogram;
A signal separation step of causing a signal separation means to generate a signal separation result from the modulation spectrogram generated in the second signal conversion step;
The signal separation step includes
The computer program is a step of interpreting the modulation spectrogram as an instantaneous mixture and generating a signal separation result.

さらに、本発明の第１１の側面は、
信号分離装置において、複数の信号が混合した信号を入力して個別の信号に分離する信号分離処理を実行させるコンピュータ・プログラムであり、
信号変換手段に、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成させる信号変換ステップと、
信号分離手段に、前記信号変換ステップにおいて生成した観測信号スペクトログラムから信号分離結果を生成させる信号分離ステップを有し、
前記信号分離ステップは、
前記観測信号スペクトログラムをフレーム方向へシフトさせて、各々が異なるシフト量を持つデータを積み重ねた観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により、信号分離結果を生成させるステップであることを特徴とするコンピュータ・プログラムにある。 Furthermore, an eleventh aspect of the present invention is
In the signal separation device, a computer program for executing a signal separation process for inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A signal conversion step for causing the signal conversion means to convert the input signal into the time-frequency domain and generate an observed signal spectrogram;
A signal separation step for causing the signal separation means to generate a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
The observed signal spectrogram is shifted in the frame direction to generate an observed signal spectrogram shift set in which data having different shift amounts are stacked, and an instantaneous mixed ICA (Independent Component Analysis) is generated for the generated observed signal spectrogram shift set. ) Is a step of generating a signal separation result by the processing applied.

なお、本発明のコンピュータ・プログラムは、例えば、様々なプログラム・コードを実行可能なコンピュータ・システムに対して、コンピュータ可読な形式で提供する記憶媒体、通信媒体、例えば、ＣＤやＦＤ、ＭＯなどの記録媒体、あるいは、ネットワークなどの通信媒体によって提供可能なコンピュータ・プログラムである。このようなプログラムをコンピュータ可読な形式で提供することにより、コンピュータ・システム上でプログラムに応じた処理が実現される。 The computer program of the present invention is, for example, a storage medium or communication medium provided in a computer-readable format to a computer system capable of executing various program codes, such as a CD, FD, or MO. It is a computer program that can be provided by a recording medium or a communication medium such as a network. By providing such a program in a computer-readable format, processing corresponding to the program is realized on the computer system.

本発明のさらに他の目的、特徴や利点は、後述する本発明の実施例や添付する図面に基づくより詳細な説明によって明らかになるであろう。なお、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Other objects, features, and advantages of the present invention will become apparent from a more detailed description based on embodiments of the present invention described later and the accompanying drawings. In this specification, the system is a logical set configuration of a plurality of devices, and is not limited to one in which the devices of each configuration are in the same casing.

本発明の一実施例の構成によれば、複数の信号が混合した入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成し、観測信号スペクトログラムから信号分離結果を生成する信号分離処理において、観測信号スペクトログラムを時間周波数領域において畳み込み混合された観測信号として解釈し、時間周波数領域の畳み込み混合を解く処理により信号分離結果を生成する、あるいは、観測信号スペクトログラムに対する時間方向の短時間フーリエ変換（ＳＴＦＴ）によりモジュレーション・スペクトログラムを生成してモジュレーション・スペクトログラムを瞬時混合として解釈して信号分離結果を生成する構成としたので、直接波、反射波など様々な遅延量を持つ混合された音信号について、遅延量を考慮した高精度な分離処理が実現される。 According to the configuration of an embodiment of the present invention, in a signal separation process in which an input signal mixed with a plurality of signals is converted into a time-frequency domain to generate an observation signal spectrogram, and a signal separation result is generated from the observation signal spectrogram. Interpret the signal spectrogram as an observation signal convoluted and mixed in the time-frequency domain, and generate a signal separation result by solving the convolution mixture in the time-frequency domain, or a short-time Fourier transform (STFT) in the time direction for the observed signal spectrogram Since the modulation spectrogram is generated by the above, and the modulation spectrogram is interpreted as instantaneous mixing and the signal separation result is generated, the delay amount for mixed sound signals with various delay amounts such as direct wave and reflected wave Highly accurate separation process considering There is realized.

以下、図面を参照しながら本発明の信号分離装置、および信号分離方法、並びにコンピュータ・プログラムの詳細について説明する。本発明は、前述したように複数の原信号が混合されて取得された混合信号の信号解析によって原信号を分離・復元する処理を実行する信号分離処理を行なうものであり、独立成分分析（ＩＣＡ：ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）による信号分離処理を行なう構成である。 Hereinafter, a signal separation device, a signal separation method, and a computer program according to the present invention will be described in detail with reference to the drawings. The present invention performs signal separation processing for performing processing for separating and restoring the original signal by signal analysis of the mixed signal obtained by mixing a plurality of original signals as described above, and includes independent component analysis (ICA). : Independent Component Analysis).

具体的には、図５に示すように、Ｎ個の音源１１１−１〜１１１−Ｎから異なる音が鳴っていて、それらをｎ個のマイク１２１−１〜１２１−ｎで観測するような状況において、マイク１２１−１〜１２１−ｎによって取得された混合信号に基づいて独立成分分析（ＩＣＡ）による信号分離処理を行なう。 Specifically, as shown in FIG. 5, different sound is produced from N sound sources 111-1 to 111 -N, and these are observed by n microphones 121-1 to 121-n. , Signal separation processing by independent component analysis (ICA) is performed based on the mixed signal acquired by the microphones 121-1 to 121-n.

先に説明したように、ある１つのマイクｊ（ただし１≦ｊ≦ｎ）で観測される信号（観測信号）は前述した式［１．１］のように、原信号と伝達関数との畳み込み演算を全音源について総和した式として表わせる（「畳み込み混合」）。さらに、全てのマイク１〜ｎについての観測信号を一つの式で表わすと、前述した式［１．２］のように表わせ、この畳み混合を解く手法として２つの方法、すなわち、
（１）時間領域で畳み込み混合を直接解く。（時間領域逆畳み込み）
（２）観測信号を時間周波数領域に変換し、瞬時混合問題として解く。
これらの手法があったが、
（２）観測信号を時間周波数領域に変換し瞬時混合問題として解くという処理を行なう前提として、従来の時間周波数領域ＩＣＡの枠組みでは、時間領域の畳み混合が時間周波数領域では瞬時混合で表わされると考えていた。それに対し本発明では、時間周波数領域でも依然として畳み込み混合であると考える。この概念について図６を参照して説明する。 As described above, a signal (observation signal) observed by a certain microphone j (where 1 ≦ j ≦ n) is a convolution of the original signal and the transfer function as shown in the above-mentioned equation [1.1]. The calculation can be expressed as a summation expression for all sound sources ("convolution mixing"). Furthermore, when the observation signals for all the microphones 1 to n are expressed by one equation, it can be expressed as the above-described equation [1.2], and there are two methods for solving this convolution mixture, that is,
(1) Solve convolutional mixing directly in the time domain. (Time domain deconvolution)
(2) Convert the observed signal to the time-frequency domain and solve it as an instantaneous mixing problem.
There were these methods,
(2) Assuming that the process of converting the observed signal into the time-frequency domain and solving it as an instantaneous mixing problem is performed, in the conventional time-frequency domain ICA framework, the time-domain convolutional mixture is represented by instantaneous mixing in the time-frequency domain. I was thinking. In contrast, in the present invention, convolutional mixing is still considered in the time-frequency domain. This concept will be described with reference to FIG.

図６に示す図６（ａ）は、原信号、すなわち、図５に示す各音源１１１−１〜１１１−Ｎの出力する原信号のスペクトログラムを縦に積み重ねたものである。それぞれの音源のスペクトログラムをＳ_１，Ｓ_２、両者を縦に積み上げたものをＳとする。なお、スペクトログラムは、前述したように、横軸をｔ（フレーム番号）、縦軸をω（周波数ビン番号）としてＸｋ（ω，ｔ）の絶対値である｜Ｘｋ（ω，ｔ）｜を色の濃淡で表現した図である。 FIG. 6A shown in FIG. 6 is a vertically stacked spectrogram of the original signal, that is, the original signal output from each of the sound sources 111-1 to 111-N shown in FIG. The spectrogram of each sound source is S ₁ , S ₂ , and S is the result of vertically stacking both. In addition, as described above, the spectrogram is represented by | Xk (ω, t) |, which is the absolute value of Xk (ω, t), where t (frame number) is the horizontal axis and ω (frequency bin number) is the vertical axis. FIG.

図６（ａ）に示す原信号のスペクトログラムにおいて、ｔ番目のフレームの信号をベクトルで表現したものをＳ（ｔ）とおく。なお、スペクトログラムの１フレーム分をスペクトルと呼ぶ。 In the spectrogram of the original signal shown in FIG. 6A, a signal representing the signal of the t-th frame as a vector is S (t). Note that one frame of the spectrogram is called a spectrum.

従来は、Ｓ（ｔ）がフレーム遅延なしでマイクに届くと考えていたが、本発明ではフレーム遅延があると考える。すなわち、図５を参照して説明すると、それぞれの音源１１１−１〜１１１−Ｎでスペクトルという名のベクトルが独立に発生し、それらが０以上の遅延を伴ってセンサーとしてのマイク１２１−１〜１２１−ｎに届く。これらには直接波と反射波が含まれる。 Conventionally, S (t) was considered to reach the microphone without frame delay, but in the present invention, it is considered that there is a frame delay. That is, referring to FIG. 5, vectors named spectrum are independently generated in each of the sound sources 111-1 to 111 -N, and they are microphones 121-1 to 121-1 as sensors with a delay of 0 or more. 121-n. These include direct waves and reflected waves.

異なる音源からの直接波、また直接波と反射波、さらには単純な反射と複雑な反射など、様々な信号がマイクによって取得されることになり、その信号には様々な遅延量が存在すると推定される。ここで遅延の最大値をＬ＋１とすると、図６（ａ）に示す原信号のスペクトログラムにおけるｔ番目のフレーム信号のベクトル表現であるスペクトルＳ（ｔ）の影響は観測信号のｔ番目からｔ＋Ｌ番目のフレームに及ぶことになる。 Various signals such as direct waves from different sound sources, direct waves and reflected waves, simple reflections and complex reflections will be acquired by the microphone, and it is estimated that there are various amounts of delay in the signals Is done. Here, assuming that the maximum value of the delay is L + 1, the influence of the spectrum S (t) which is the vector representation of the t-th frame signal in the spectrogram of the original signal shown in FIG. It will span the frame.

図６（ｂ）は、観測信号のスペクトログラムであり、各マイク１２１−１〜１２１−ｎによって取得された観測信号について、短時間フーリエ変換（ＳＴＦＴ）を実行して生成した観測信号のスペクトログラムＸである。 FIG. 6B is a spectrogram of the observation signal. The observation signal spectrogram X generated by executing short-time Fourier transform (STFT) on the observation signals acquired by the microphones 121-1 to 121-n. is there.

短時間フーリエ変換（ＳＴＦＴ）について、図７を用いて説明する。例えば図５に示すような環境においてｋ番目のマイクによって収録された観測信号ｘ_ｋを図７（ａ）に示す。この観測信号ｘ_ｋから一定長を切り出した切り出しデータであるフレーム１７１〜１７３にハニング窓やサイン窓等の窓関数を作用させる。なお、切り出した単位をフレームと呼ぶ。切り出す長さ（サンプルポイント数）は、従来法の時間周波数領域ICAにおいて最も高精度の分離結果が得られる長さ（図３によれば、512ポイントまたは1024ポイント付近）と同じ値でよい。
１フレーム分のデータに対して、離散フーリエ変換（有限区間のフーリエ変換のこと。略称DFT）または高速フーリエ変換（FFT）を施すことにより、周波数領域のデータであるスペクトルＸｋ（ｔ）を得る（ｔはフレーム番号）。 The short-time Fourier transform (STFT) will be described with reference to FIG. For example, FIG. 7A shows an observation signal x _k recorded by the k-th microphone in the environment shown in FIG. The observed signal frame 171-173 is cut data cut out a predetermined length from x _k exerts a window function such as a Hanning window or sine window. The cut unit is called a frame. The length to be cut out (number of sample points) may be the same value as the length (in the vicinity of 512 points or 1024 points according to FIG. 3) at which the most accurate separation result is obtained in the time frequency domain ICA of the conventional method.
A spectrum Xk (t), which is data in the frequency domain, is obtained by performing discrete Fourier transform (Fourier transform in a finite interval; abbreviated DFT) or fast Fourier transform (FFT) on one frame of data ( t is a frame number).

切り出すフレームの間には、図に示すフレーム１７１〜１７３のように重複があってもよく、そうすることで連続するフレームのスペクトルＸｋ（ｔ−１）〜Ｘｋ（ｔ＋１）を滑らかに変化させることができる。また、スペクトルをフレーム番号に従って並べたものをスペクトログラムと呼ぶ。図７（ｂ）がスペクトログラムの例である。 There may be overlap between frames to be cut out as in the frames 171 to 173 shown in the figure, so that the spectra Xk (t−1) to Xk (t + 1) of successive frames can be changed smoothly. Can do. A spectrum arranged in accordance with the frame number is called a spectrogram. FIG. 7B is an example of a spectrogram.

なお、短時間フーリエ変換（ＳＴＦＴ）において切り出すフレーム間に重複がある場合は、逆フーリエ変換（ＦＴ）においてもフレームごとの逆変換結果（波形）波形を重複つきで重ね合わせる。これをオーバラップ加算（ｏｖｅｒｌａｐａｄｄ）という。逆変換結果は、オーバラップ加算（ｏｖｅｒｌａｐａｄｄ）の前にサイン窓等の窓関数を再び作用させても良く、これを、ｗｅｉｇｈｔｅｄｏｖｅｒｌａｐａｄｄ（ＷＯＬＡ）という。ＷＯＬＡにより、フレーム間の不連続性に由来するノイズを低減することができる。 If there is an overlap between frames to be cut out in the short-time Fourier transform (STFT), the inverse transform result (waveform) waveforms for each frame are also overlapped with each other in the inverse Fourier transform (FT). This is called overlap addition. The inverse transformation result may be obtained by applying a window function such as a sine window again before overlap addition, which is called weighted overlap add (WOLA). With WOLA, noise derived from discontinuity between frames can be reduced.

図６（ｂ）は、図７を参照して処理によって得られる観測信号のスペクトログラムであり、スペクトログラムを縦に積み重ねたものである。それぞれのセンサー（マイク）のスペクトログラムをＸ_１，Ｘ_２、両者を縦に積み上げたものをＸとする。観測信号のスペクトログラムをＸとすると、Ｘ（ｔ）からＸ（ｔ＋Ｌ）のＬ＋１個のフレームは原信号スペクトルＳ（ｔ）の影響を受けている。逆にいうと、図６（ｂ）の観測信号中のｔ番目のフレームの観測信号Ｘ（ｔ）は、それより前のＬ＋１フレーム分の原信号の影響を受けている。 FIG. 6B is a spectrogram of observation signals obtained by processing with reference to FIG. 7, and the spectrograms are vertically stacked. The spectrogram of each sensor (microphone) is X ₁ , X ₂ , and X is the result of vertically stacking both. If the spectrogram of the observed signal is X, L + 1 frames from X (t) to X (t + L) are affected by the original signal spectrum S (t). Conversely, the observation signal X (t) of the t-th frame in the observation signal of FIG. 6B is affected by the original signal for the L + 1 frame before that.

このように、観測信号中のｔ番目のフレームの観測信号Ｘ（ｔ）が、それより前のＬ＋１フレーム分の原信号の影響を受けていることを考慮すると、観測信号Ｘ（ｔ）は、以下に示す式［６．１］のような畳み込み混合で表わすことができる。
Thus, considering that the observation signal X (t) of the t-th frame in the observation signal is influenced by the original signal for the L + 1 frame before that, the observation signal X (t) is It can be expressed by convolutional mixing as shown in the following formula [6.1].

上記式［６．１］は、先に説明した式［１．２］と似ているが、式［６．１］は時間周波数領域の式であることに注意されたい。Ｌ＝０の場合は、従来の瞬時混合と等価となる。すなわち、観測信号Ｘ（ｔ）が原信号スペクトルＳ（ｔ）のみによって影響されている場合はＬ＝０となり、従来の瞬時混合と等価となる。
両者の畳み込みを区別するため、
式［１．２］のＬを「時間タップ数」、
式［６．１］のＬを「フレームタップ数」、
と定義する。 Note that equation [6.1] is similar to equation [1.2] described above, but equation [6.1] is a time-frequency domain equation. When L = 0, this is equivalent to conventional instantaneous mixing. That is, when the observation signal X (t) is affected only by the original signal spectrum S (t), L = 0, which is equivalent to the conventional instantaneous mixing.
To distinguish between the convolutions of both,
L in Equation [1.2] is “the number of time taps”,
L in Equation [6.1] is “the number of frame taps”,
It is defined as

なお、上記式［６．１］は、ＳＴＦＴにおいてフレームのシフト幅を１とした場合には厳密に成立する。また、フレームのシフト幅を２以上に設定した場合でも、近似的に成立する。この点についての詳細は、発明者自身の論文である以下の文献を参照されたい。
［Ｈｉｒｏｅ，Ａ．ＢｌｉｎｄＶｅｃｔｏｒＤｅｃｏｎｖｏｌｕｔｉｏｎ：ＣｏｎｖｏｌｕｔｉｖｅＭｉｘｔｕｒｅＭｏｄｅｌｓｉｎＳｈｏｒｔ−ＴｉｍｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍＤｏｍａｉｎ．ＩｎＭ．Ｅ．Ｄａｖｉｅｓｅｔａｌ．（Ｅｄｓ．）：ＩＣＡ２００７，ＬＮＣＳ４６６６，ｐｐ．４７１-４７９，２００７．］ The above equation [6.1] is strictly established when the frame shift width is 1 in the STFT. Even when the frame shift width is set to 2 or more, it is approximately established. For details on this point, refer to the following literature which is the inventor's own paper.
[Hiroe, A. et al. Blind Vector Devolution: Convolutive Mixture Models in Short-Time Fourier Transform Domain. In M.M. E. Davis et al. (Eds.): ICA 2007, LNCS 4666, pp. 471-479, 2007. ]

短時間フーリエ変換（ＳＴＦＴ）の窓長よりも残響の時間の方が長い場合、残響の影響は１フレームでは完結せず、複数のフレームに及ぶ。複数のフレームに跨がる残響は、時間周波数領域での畳み込みとして表現できるため、本発明で導入する「時間周波数領域での畳み込み混合」という考え方によって、ＳＴＦＴの窓長を越える残響も除去することが可能である。 When the reverberation time is longer than the window length of the short-time Fourier transform (STFT), the reverberation effect is not completed in one frame but extends to a plurality of frames. Since reverberation across multiple frames can be expressed as convolution in the time-frequency domain, the reverberation exceeding the STFT window length should be eliminated by the concept of “convolution mixing in the time-frequency domain” introduced in the present invention. Is possible.

すなわち、先に説明した図３のＳＴＦＴの窓長と時間周波数領域ＩＣＡの分離精度の関係をプロットしたグラフを例にして説明すると、長い窓（２０４８や４０９６など）の代わりに、短めの窓（５１２や１０２４）と複数のフレームタップ（１６や３２）との組み合わせが可能となり、長い窓のトレードオフを回避しつつ長い窓と同等のタイムスパン（時間タップ数とフレーム方向へのシフト幅とフレームタップ数とから算出される時間）を確保することが可能となる。 That is, a graph plotting the relationship between the window length of the STFT and the separation accuracy of the time-frequency domain ICA described above in FIG. 3 will be described as an example. Instead of a long window (2048, 4096, etc.), a short window ( 512 and 1024) and a plurality of frame taps (16 and 32) can be combined, and the time span equivalent to a long window (the number of time taps, the shift width in the frame direction, and the frame is avoided while avoiding the trade-off of long windows) Time calculated from the number of taps) can be secured.

また、時間領域逆畳み込みと比べ、ずっと少ないタップ数の畳み込みで済む（数十のオーダー）ため、時間領域逆畳み込みの課題も回避できる。なお、以降では、
原信号から観測信号が生成される際のフレームタップ数をＬという文字で表わし、
一方、
観測信号から分離結果を生成する際のフレームタップ数をＬ'と表わす。
Ｌはその環境の残響時間およびＳＴＦＴの窓長とシフト幅から決まる値である。一方、Ｌ'はＬとは異なる値に設定することができる。（Ｌ'＝０とすると、従来法と等価になる。） Further, since much fewer taps are required for convolution than the time domain deconvolution (in the order of several tens), the problem of time domain deconvolution can be avoided. In the following,
The number of frame taps when the observation signal is generated from the original signal is represented by the letter L,
on the other hand,
The number of frame taps when generating the separation result from the observation signal is represented as L ′.
L is a value determined from the reverberation time of the environment and the window length and shift width of the STFT. On the other hand, L ′ can be set to a value different from L. (If L ′ = 0, it is equivalent to the conventional method.)

観測信号のフレームタップ数Ｌは、以下の式で計算することができる。
Ｌ＝Ｔｒ×Ｆｓ／Ｓ
ただし、
Ｔｒ：環境の残響時間
Ｆｓ：サンプリング周波数
Ｓ：ＳＴＦＴのシフト幅
である。 The number L of frame taps of the observation signal can be calculated by the following formula.
L = Tr × Fs / S
However,
Tr: reverberation time of environment Fs: sampling frequency S: STFT shift width

例えば、
残響時間Ｔｒ＝０．３秒、
サンプリング周波数Ｆｓ＝１６０００Ｈｚ、
シフト幅Ｓ＝２５６、
とすると、
原信号から観測信号が生成される際のフレームタップ数Ｌは、
Ｌ＝１８．７５である。
すなわち、残響の影響は１９フレームに及ぶ（端数切り上げ）ことが分かる。 For example,
Reverberation time Tr = 0.3 seconds,
Sampling frequency Fs = 16000 Hz,
Shift width S = 256,
Then,
The number L of frame taps when the observation signal is generated from the original signal is
L = 18.75.
That is, it can be seen that the effect of reverberation reaches 19 frames (rounded up).

観測信号Ｘから分離結果Ｙ、すなわち、図６（ｂ）の観測信号Ｘから図６（ｃ）の分離結果Ｙを生成するためのフレームタップ数Ｌ'については、Ｌが既知であれば（すなわち、残響時間が既知であれば）、
Ｌ'＝αＬ
とすればよい（αは適切な正の実数）。
Ｌが未知である場合、Ｌ'は例えば以下のいずれかの方法で決定することができる。 If the number L ′ of frame taps for generating the separation result Y from the observation signal X, that is, the separation result Y in FIG. 6C from the observation signal X in FIG. , If the reverberation time is known)
L ′ = αL
(Α is an appropriate positive real number).
When L is unknown, L ′ can be determined by any of the following methods, for example.

第一の方法は、Ｌ'＝６４やＬ'＝１００といった一定の値に決め打つことである。基本的に、Ｌ'が大きくなるほど計算量も増えるため、計算量と分離性能との兼ね合いからＬ'を決定しても良い。
第二の方法は、何らかの方法で残響時間を測定し、その残響時間から上記の式で求めたＬの値の定数倍をＬ'、すなわちＬ'＝αＬとする方法である。残響時間の測定方法としては、例えば装置自体に装備されたスピーカーからインパルス性の音を発し、その音が十分減衰するまでの時間を計測する。
第三の方法は、既知の原信号から生成された観測信号に対してさまざまなＬ'の下で分離を行ない、最も良い分離結果をもたらすＬ'の値を採用することである。そのためには、例えば装置の周辺にスピーカーを複数設置し、それぞれから既知の音を鳴らし、それらの音を複数のマイクで観測する。その観測結果に対して、異なるＬ'（例えばＬ'＝０〜１００のそれぞれの値）で分離結果を生成する。分離結果と原信号とから、後述のＳＩＲ（ｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅｒａｔｉｏ）という分離性能尺度を計算し、最高のＳＩＲをもたらすＬ'を採用する。環境が同じであれば、原信号が未知の場合でも、そのＬ'が最高の分離信号をもたらす可能性が高い。 The first method is to determine a fixed value such as L ′ = 64 or L ′ = 100. Basically, the amount of calculation increases as L ′ increases, so L ′ may be determined based on the balance between the amount of calculation and the separation performance.
The second method is a method in which the reverberation time is measured by some method, and a constant multiple of the value of L obtained from the reverberation time by the above formula is set to L ′, that is, L ′ = αL. As a method for measuring the reverberation time, for example, an impulsive sound is emitted from a speaker installed in the apparatus itself, and the time until the sound is sufficiently attenuated is measured.
The third method is to perform separation under various L's on the observed signal generated from the known original signal and adopt the value of L 'that gives the best separation result. For that purpose, for example, a plurality of speakers are installed around the device, known sounds are produced from each of them, and these sounds are observed by a plurality of microphones. For the observation result, a separation result is generated with different L ′ (for example, each value of L ′ = 0 to 100). A separation performance measure called SIR (signal-interference ratio), which will be described later, is calculated from the separation result and the original signal, and L ′ that gives the highest SIR is adopted. If the environment is the same, even if the original signal is unknown, its L ′ is likely to yield the best separated signal.

例えば上記いずれかの方法によって、Ｌ'、すなわち、観測信号Ｘから分離結果Ｙを生成するためのフレームタップ数Ｌ'、具体的には、例えば、図６（ｂ）の観測信号Ｘから図６（ｃ）の分離結果Ｙを生成するためのフレームタップ数Ｌ'を決定し、このフレームタップ数Ｌ'を用いて、観測信号の複数の連続フレームから分離結果を生成する。 For example, L ′, that is, the number of frame taps L ′ for generating the separation result Y from the observation signal X by any of the above methods, specifically, for example, from the observation signal X in FIG. The number L ′ of frame taps for generating the separation result Y in (c) is determined, and the separation result is generated from a plurality of consecutive frames of the observation signal using this number of frame taps L ′.

時間周波数領域において畳み込み混合された観測信号を分離する処理方式としては、たとえば以下の手法のいずれかを適用することが可能である。
（１）時間周波数領域において、畳み込み混合を直接解く。
（２）スペクトログラムを時間方向へもう一度短時間フーリエ変換（ＳＴＦＴ）し、瞬時混合として解く。
（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって解く。
以下では、それぞれの方法について説明する。 As a processing method for separating the observation signal mixed in the time-frequency domain, for example, any of the following methods can be applied.
(1) Solve convolutional mixing directly in the time-frequency domain.
(2) The spectrogram is again subjected to short-time Fourier transform (STFT) in the time direction and solved as instantaneous mixing.
(3) Solve by a process combining shift stacking and instantaneous mixing ICA.
Below, each method is demonstrated.

なお、上記の「（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理」は、「（１）時間周波数領域において、畳み込み混合を直接解く」と同等の分離処理を実現する方式であり、観測信号スペクトログラムをシフトしながら積み重ねた後、その結果に対して従来の時間周波数領域の瞬時混合ＩＣＡを適用する手法である。詳細については後段で説明する。 The above-mentioned “(3) processing combining shift stacking and instantaneous mixing ICA” is a method for realizing separation processing equivalent to “(1) solving convolutional mixing directly in the time-frequency domain”. This is a technique of applying the conventional instantaneous frequency-domain mixed ICA to the result after stacking while shifting the spectrogram. Details will be described later.

（１）時間周波数領域において、畳み込み混合を直接解く
まず、時間周波数領域において、畳み込み混合を直接解くことで、時間周波数領域において畳み込み混合された観測信号を分離する処理について説明する。 (1) Directly solving convolutional mixing in the time-frequency domain First, processing for separating observation signals mixed in the time-frequency domain by directly solving convolutional mixing in the time-frequency domain will be described.

再び図６を参照して説明する。前述の通り、原信号スペクトログラムのｔ番目のフレームＳ（ｔ）は、観測信号のｔ番目からｔ＋Ｌ番目のフレームに影響を与える。従って、原信号の１フレーム分を推定するためには、観測信号がＬフレーム分かそれ以上必要である。その値をＬ'とする。 A description will be given with reference to FIG. 6 again. As described above, the t-th frame S (t) of the original signal spectrogram affects the t-th to t + L-th frames of the observation signal. Therefore, in order to estimate one frame of the original signal, the observation signal is required for L frames or more. Let that value be L ′.

分離信号中のｔ番目のフレームを基準にした場合、例えば、図６（ｃ）の分離信号中のＹ（ｔ）を基準として考えると、Ｓ（ｔ）を推定するためには少なくとも以降のＬ＋１フレーム分のデータが必要である。そこで、原信号の推定結果（＝分離結果）であるＹ（ｔ）を、先に示した式［６．３］のように、観測信号Ｘ（ｔ）からＸ（ｔ＋Ｌ'）までの畳み込み混合として表わす。 When the t-th frame in the separated signal is used as a reference, for example, considering Y (t) in the separated signal in FIG. 6C as a reference, at least the following L + 1 is required to estimate S (t). Data for the frame is required. Therefore, convolutional mixing from the observed signal X (t) to X (t + L ′) is performed using Y (t), which is the estimation result (= separation result) of the original signal, as shown in Equation [6.3] above. Represent as

一方、分離信号中のｔ＋Ｌ'番目のフレームを基準にした場合、例えば、図６（ｃ）の分離信号中のＹ（ｔ＋Ｌ'）を基準として考えると、Ｓ（ｔ）を推定するためには直前のＬ＋１フレーム分のデータが必要である。そこで、分離信号Ｙ（ｔ）は式［６．２］のようにＸ（ｔ−Ｌ'）からＸ（ｔ）までの畳み込み混合として表わす。
両者の式は、Ｓ（ｔ）とのフレームのずれが異なるが、本質的には等価であるため、以下では式［６．２］からＹ（ｔ）を推定する方法について説明する。 On the other hand, when the t + L′-th frame in the separated signal is used as a reference, for example, when Y (t + L ′) in the separated signal in FIG. Data for the immediately preceding L + 1 frame is required. Therefore, the separation signal Y (t) is expressed as a convolution mixture from X (t−L ′) to X (t) as shown in Equation [6.2].
Although both equations differ in frame deviation from S (t) but are essentially equivalent, a method for estimating Y (t) from equation [6.2] will be described below.

混合は同じ周波数ビンでのみ起こると仮定する（すなわち、伝播の途中で周波数の変調が起こることはないと仮定する）と、全周波数ビンの混合の式である式［６．１］は、周波数ビンごとの混合の式である式［６．４］のように書き表すことができる。その仮定の下では、式［６．２］の分離行列Ｗ^［ｌ］は、式［６．５］のように、対角行列から構成される行列として表わせるため、Ｗ^［ｌ］を推定するためには式［６．５］の非零の成分のみを推定すればよい。 Assuming that mixing occurs only in the same frequency bin (ie, no frequency modulation occurs in the middle of propagation), Equation [6.1], an equation for mixing all frequency bins, is It can be expressed as equation [6.4], which is a mixing equation for each bin. Under the assumption, since the separation matrix W ^[l] of the equation [6.2] can be expressed as a matrix composed of diagonal matrices as in the equation [6.5], W ^[l] is estimated. In order to do this, it is only necessary to estimate the non-zero component of equation [6.5].

式［６．２］から学習規則（ΔＷの式）を求める処理は以下のように行なう。スペクトログラム全体での独立性を表わす尺度として、式［４．５］で計算されるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を考える。なお、この手法は、特開２００６−２３８４０９に記載したと同様の処理である。 The process for obtaining the learning rule (formula of ΔW) from the formula [6.2] is performed as follows. As a measure representing independence in the entire spectrogram, the Kullback-Leiblar information amount I (Y) calculated by the formula [4.5] is considered. This method is the same processing as described in JP-A-2006-238409.

Ｙ（ｔ）の成分であるＹ１（ｔ）〜Ｙｎ（ｔ）を互いに独立にするためには、式［４．５］のＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を最小にする分離行列Ｗ^［０］〜Ｗ^［Ｌ'］を求めれば良いわけである。なお、特開２００６−２３８４０９において示した方法は瞬時混合だったため、分離行列は１つだけ推定すればよかったが、本発明では、Ｌ'＋１個のフレームの畳み込み混合であるため、分離行列もＬ'＋１個推定する必要がある。 In order to make Y1 (t) components Y1 (t) to Yn (t) independent from each other, a separation matrix W ^[1 ] that minimizes the Kullback-Leiblar information amount I (Y) in equation [4.5] ^{. 0] to} W ^{[L ′]} may be obtained. Since the method shown in Japanese Patent Laid-Open No. 2006-238409 is instantaneous mixing, it is sufficient to estimate only one separation matrix. However, in the present invention, since the convolutional mixing of L ′ + 1 frames is performed, the separation matrix is also L '+1 need to be estimated.

ここで「Ｙ１（ｔ）〜Ｙｎ（ｔ）は互いに独立」という仮定（チャンネル間の独立性）の他に、「Ｙｋ（ｔ−Ｌ'）〜Ｙｋ（ｔ）も互いに独立」という仮定（フレーム間の独立性）を設けると、最終的に、以下に示す式［７．１］の学習規則が導出される。
Here, in addition to the assumption that “Y1 (t) to Yn (t) are independent from each other” (independence between channels), it is assumed that “Yk (t−L ′) to Yk (t) are also independent from each other” (frame (Independence between), the learning rule of formula [7.1] shown below is finally derived.

すなわち、分離行列Ｗ^［０］〜Ｗ^［Ｌ'］を求めるためには、式［６．２］，［７，１］，［７．８］をＷ^［０］〜Ｗ^［Ｌ'］が収束するまで（または一定回数）繰り返す。ただし、式［７，１］のΔＷ^［ｌ］（ω），Ｗ^［ｌ］（ω）は、それぞれΔＷ^［ｌ］とＷ^［ｌ］から周波数ビンωに対応する要素を抽出した部分行列（式［６．６］）であり、Ｒω^［ｌ］は式［７．２］で計算されるクロス項である。式［７．２］のφω（Ｙ（ｔ））は、スコア関数からなるベクトルであり（式［７．４］）、これは本出願人の先の出願（特開２００６−２３８４０９）において示したスコア関数からなるベクトルと同一である。スコア関数は確率密度関数の対数微分として定義され（式［７．５］）、特開２００６−２３８４０９において開示したと同様に多変量のスコア関数を使用することで、パーミュテーションの発生を阻止できる。 That is, in order to obtain the separation matrix W ^{[0] to} W ^{[L ′]} , the equations [6.2], [7, 1], and [7.8] are changed from W ^{[0] to} W ^{[L ′].} Repeat until convergence (or a fixed number of times). However, ΔW ^[l] (ω) and W ^[l] (ω) in Expression [7, 1] are submatrices obtained by extracting elements corresponding to the frequency bin ω from ΔW ^[l] and W ^[l] , respectively. Equation [6.6]) and Rω ^[l] is a cross term calculated by Equation [7.2]. Φω (Y (t)) in equation [7.2] is a vector composed of a score function (equation [7.4]), which is shown in the earlier application (Japanese Patent Laid-Open No. 2006-238409) of the present applicant. It is the same as the vector consisting of The score function is defined as a logarithmic derivative of the probability density function (formula [7.5]), and prevents the occurrence of permutation by using a multivariate score function as disclosed in JP-A-2006-238409. it can.

スコア関数の具体例は特開２００６−２３８４０９において説明したと同一でよく、例えば式［７．６］を用いる。この式において、αｋ（ω），ｍ，γｋ（ω）は正の実数、βｋ（ω）は非負の実数である。簡単な例として、式［７．７］を適用してもよい。 A specific example of the score function may be the same as that described in JP-A-2006-238409, and for example, Equation [7.6] is used. In this equation, αk (ω), m, γk (ω) are positive real numbers, and βk (ω) is a non-negative real number. As a simple example, equation [7.7] may be applied.

式［７．８］において、ηは学習率と呼ばれる正の実数である。ηは例えば０．１といった定数でも良いが、式［７．９］のように適用的に算出しても良い。ただし、この式において‖Ｗ（ω）‖はＷ^［０］（ω）〜Ｗ^［Ｌ］（ω）の全要素の２乗和（式［７．１０］）、‖ΔＷ（ω）‖も同様にΔＷ^［０］（ω）〜ΔＷ^［Ｌ'］（ω）の全要素の２乗和、η_０はηの上限値を表わす正の実数である。式［７．８］を用いると、学習の始めはηが比較的小さな値となり（‖ΔＷ（ω）‖が大きいため）、Ｗ（ω）がオーバーフローするのを回避できる。一方で学習の終わりではηが比較的大きな値となり（‖ΔＷ（ω）‖がゼロ行列に近いため）、Ｗ（ω）が目的の値に早く収束する。 In equation [7.8], η is a positive real number called a learning rate. For example, η may be a constant such as 0.1, but may be calculated as shown in Equation [7.9]. In this equation, ‖W (ω) ‖ is also the sum of squares of all elements of W ^[0] (ω) to W ^[L] (ω) (equation [7.10]), and ‖ΔW (ω) ‖. Similarly, the sum of squares of all elements of ΔW ^[0] (ω) to ΔW ^[ ^{L ′]} (ω), η ₀ is a positive real number representing the upper limit value of η. By using Equation [7.8], η becomes a relatively small value at the beginning of learning (because ‖ΔW (ω) 大きい is large), and overflow of W (ω) can be avoided. On the other hand, at the end of learning, η becomes a relatively large value (because ‖ΔW (ω) ‖ is close to the zero matrix), and W (ω) converges quickly to the target value.

なお、式［６．２］の代わりに式［６．３］を用いる場合は、学習において式［６．３］，［７，１］，［７．８］を繰り返す。ただし、式［７．１］のＲω^［ｌ］として、式［７．２］の代わりに式［７．３］を用いる。 In addition, when the formula [6.3] is used instead of the formula [6.2], the formulas [6.3], [7, 1], and [7.8] are repeated in learning. However, equation [7.3] is used instead of equation [7.2] as Rω ^[l] in equation [7.1].

上記で式［７．１］を導出する際に、「Ｙｋ（ｔ−Ｌ'）〜Ｙｋ（ｔ）も互いに独立」という仮定を置いていたが、代わりに「Ｙｋ（ｔ−Ｌ'）〜Ｙｋ（ｔ）は互いに依存」という仮定を置くと、別の学習規則である以下に示す式［８．１］が得られる（式［７．１］は共通）。
In deriving equation [7.1] above, the assumption was made that “Yk (t−L ′) to Yk (t) are also independent of each other”, but instead “Yk (t−L ′) to When the assumption that “Yk (t) depends on each other” is set, another learning rule [8.1] shown below is obtained (equation [7.1] is common).

式［７．２］と式［８．１］との違いはスコア関数の引数にあり、式［７．２］はＹ（ｔ）のみを引数としているのに対して式［８．１］はＹ（ｔ−Ｌ'）〜Ｙ（ｔ）を全て引数としている。このスコア関数は式［８．４］で定義され、この式に現われるＰ（Ｙｋ（ｔ），...，Ｙｋ（ｔ−Ｌ'））は、隣接するＬ'＋１個のフレームのデータが同時に発生する確率を表わしている。そのため、式［８．１］を用いると、隣り合ったフレーム間での依存関係が分離行列に一層反映され得る。スコア関数の例としては、式［８．５］（式［８．６］はさらにその具体例）が挙げられる。 The difference between the formula [7.2] and the formula [8.1] is in the argument of the score function. The formula [7.2] uses only Y (t) as an argument whereas the formula [8.1] Uses Y (t−L ′) to Y (t) as arguments. This score function is defined by Equation [8.4], and P (Yk (t),..., Yk (t−L ′)) appearing in this equation is the data of adjacent L ′ + 1 frames. It represents the probability of simultaneous occurrence. Therefore, when Expression [8.1] is used, the dependency relationship between adjacent frames can be further reflected in the separation matrix. As an example of the score function, Expression [8.5] (Expression [8.6] is a specific example thereof) can be given.

なお、式［８．１］は式［６．２］に対応した式である。式［６．２］の代わりに式［６．３］を用いる場合は、式［８．２］が対応する。 In addition, Formula [8.1] is a formula corresponding to Formula [6.2]. When equation [6.3] is used instead of equation [6.2], equation [8.2] corresponds.

なお、上記の説明では、独立性の尺度としてKullback-Leiblar情報量を採用していたが、他の尺度を用いても良い。Kullback-Leiblar情報量以外で独立性を表す尺度としては、非正規性や尖度（kurtosis）などがあり、それらの量を最大または最小とするように分離行列を更新してもよい。 In the above description, the Kullback-Leiblar information amount is adopted as the measure of independence, but other measures may be used. The scales representing independence other than the Kullback-Leiblar information amount include non-normality and kurtosis, and the separation matrix may be updated so that these amounts are maximized or minimized.

（２）スペクトログラムを時間方向へもう一度ＳＴＦＴし、瞬時混合として解く
次に、スペクトログラムを時間方向へもう一度短時間フーリエ変換（ＳＴＦＴ）し、瞬時混合として解くことで、時間周波数領域において畳み込み混合された観測信号を分離する処理について説明する。 (2) Analyzing the spectrogram once again in the time direction and solving it as instantaneous mixing Next, short-time Fourier transform (STFT) of the spectrogram in the time direction once again and solving it as instantaneous mixing, convolution mixed observation in the time frequency domain Processing for separating signals will be described.

畳み込みをタップ数よりも長い窓長で短時間フーリエ変換（ＳＴＦＴ）すると、畳み込みはただの積に変換される。これは、時間周波数領域の畳み込み混合についても同様である。すなわち、時間周波数領域の畳み込み混合である上述した式［６．４］を時間方向に再び短時間フーリエ変換（ＳＴＦＴ）すると、以下に示す式［９．１］が得られる。ただし、Ｘ'，Ａ'，Ｓ'は式［６．４］のＸ，Ａ，Ｓの各要素を短時間フーリエ変換（ＳＴＦＴ）した結果である。
When convolution is short-time Fourier transformed (STFT) with a window length longer than the number of taps, the convolution is converted into a simple product. The same applies to convolutional mixing in the time frequency domain. That is, when the above formula [6.4], which is convolutional mixing in the time-frequency domain, is again subjected to short-time Fourier transform (STFT) in the time direction, the following formula [9.1] is obtained. However, X ′, A ′, and S ′ are the results of short-time Fourier transform (STFT) of each element of X, A, and S in Equation [6.4].

上記式［９．１］は瞬時混合の式であり、観測信号を独立な成分へ分離するためには式［９．２］を考えればよい。 The above equation [9.1] is an instantaneous mixing equation, and in order to separate the observation signal into independent components, the equation [9.2] may be considered.

ここで、図８、図９を用いて、スペクトログラムＸから、時間方向に再び短時間フーリエ変換（ＳＴＦＴ）したＸ'（モジュレーション・スペクトログラム）への変換について説明する。比較のため、波形ｘからスペクトログラムＸへの変換についても説明する。 Here, the conversion from the spectrogram X to X ′ (modulation spectrogram) which has been subjected to short-time Fourier transform (STFT) in the time direction again will be described with reference to FIGS. 8 and 9. For comparison, the conversion from the waveform x to the spectrogram X will also be described.

図８（ａ）は、観測信号の波形である（この図ではチャンネル数＝２としてあるが、
チャンネル数は任意である）。
図８（ｂ）は、観測信号の波形（図８（ａ））を短時間フーリエ変換（ＳＴＦＴ）することで生成されたスペクトログラムである（チャンネルごとにＳＴＦＴを行ない、それぞれの結果を縦に並べて表示してある）。窓長＝Ｎでフーリエ変換するとＮ個の周波数成分が得られるが、実数データの変換においては負の周波数成分は正の周波数成分の共役複素数の関係にある（共役対称と呼ぶ）ため、直流成分と正の周波数成分とのＮ／２＋１＝Ｍ本の周波数ビンだけ考慮すればよい。図８（ｂ）に示す周波数ビン２０１は、周波数ビンの１本を示している。なお、通常、スペクトログラムはＸの絶対値をプロットしたものを指すが、ここではＸ自体もスペクトログラムと呼ぶ。原信号Ｓや分離結果Ｙについても同様である。 FIG. 8A shows the waveform of the observation signal (in this figure, the number of channels = 2,
The number of channels is arbitrary).
FIG. 8B is a spectrogram generated by short-time Fourier transform (STFT) of the waveform of the observation signal (FIG. 8A) (STFT is performed for each channel, and the results are arranged vertically. Displayed). N frequency components are obtained when Fourier transform is performed with window length = N. However, in the conversion of real number data, since the negative frequency component is related to the conjugate complex number of the positive frequency component (referred to as conjugate symmetry), the direct current component And only N / 2 + 1 = M frequency bins of the positive frequency component. A frequency bin 201 shown in FIG. 8B represents one of the frequency bins. Normally, the spectrogram indicates a plot of the absolute value of X, but here X is also called a spectrogram. The same applies to the original signal S and the separation result Y.

ここでさらに、図８（ｂ）に示すスペクトログラムＸについて、周波数ビンごとに短時間フーリエ変換（ＳＴＦＴ）を行なう。スペクトログラムをもう一度ＳＴＦＴして生成されたデータをモジュレーション・スペクトログラムと呼ぶ。２度目の短時間フーリエ変換（ＳＴＦＴ）の窓長をＬ'とすると、１本の周波数ビン、例えば図８（ｂ）のビン２０１からＬ'本のビンが生成されるため、それを奥行き方向で表現する。これが、図９（ｃ）に示すビン２０２であり、これらを集積した結果が、図９（ｃ）に示すデータとなる。 Here, for the spectrogram X shown in FIG. 8B, a short-time Fourier transform (STFT) is performed for each frequency bin. Data generated by performing STFT on the spectrogram again is called a modulation spectrogram. When the window length of the second short-time Fourier transform (STFT) is L ′, one frequency bin, for example, L ′ bins is generated from the bin 201 in FIG. It expresses with. This is the bin 202 shown in FIG. 9C, and the result of accumulating these is the data shown in FIG. 9C.

すなわち、図９（ｃ）は、図８（ｂ）に示すスペクトログラムＸについて、周波数ビンごとに短時間フーリエ変換（ＳＴＦＴ）を行なって生成したモジュレーション・スペクトログラムであり、図９（ｃ）に示すような直方体の構造のモジュレーション・スペクトログラムＸ'で表わされる。奥行き方向も周波数成分であるが、波形の周波数成分ではなくてエンベロープの周波数成分である。２度目の短時間フーリエ変換（ＳＴＦＴ）では変換前データも複素数であるため、変換結果は共役対称にはならない。従って、Ｌ'本のビンは全て考慮しなければならない。 That is, FIG. 9C is a modulation spectrogram generated by performing short-time Fourier transform (STFT) for each frequency bin for the spectrogram X shown in FIG. 8B, as shown in FIG. 9C. It is represented by a modulation spectrogram X ′ of a rectangular parallelepiped structure. Although the depth direction is also a frequency component, it is not a waveform frequency component but an envelope frequency component. In the second short-time Fourier transform (STFT), the data before conversion is also a complex number, so the conversion result is not conjugate symmetric. Therefore, all L ′ bins must be considered.

新たに生成されたビンを、奥行き方向の代わりに縦方向に配置する。すなわち、図９（ｃ）に示すビン２０２を図９（ｄ）に示すビン２０３のように配置すると、モジュレーション・スペクトログラムは図９（ｄ）に示すように平面でも表現可能である。図９（ｄ）に示すモジュレーション・スペクトログラムＸ'は、図８（ｂ）に示すスペクトログラムＸと一見似ているが、周波数ビンの意味合いが異なることに注意されたい。（１チャンネル辺りのビンの本数は、図８（ｂ）に示すスペクトログラムＸがＬ'本、図９（ｄ）に示すモジュレーション・スペクトログラムＸ'がＭ×Ｌ'本である。 The newly generated bin is arranged in the vertical direction instead of the depth direction. That is, when the bin 202 shown in FIG. 9C is arranged like the bin 203 shown in FIG. 9D, the modulation spectrogram can also be expressed in a plane as shown in FIG. 9D. It should be noted that the modulation spectrogram X ′ shown in FIG. 9 (d) is similar to the spectrogram X shown in FIG. 8 (b), but the meaning of the frequency bin is different. (The number of bins per channel is L ′ for the spectrogram X shown in FIG. 8B and M × L ′ for the spectrogram X ′ shown in FIG. 9D.

ここで先に示した数式［９．ｎ］に戻ると、式［９．１］，［９．２］のＸ'に相当するのが、図９（ｃ）に示す立体版モジュレーション・スペクトログラムＸ'であり、ωが縦方向の周波数ビンを、ω_２が奥行き方向のビンを表わしている。式［９．２］において、（ω，ω_２）というペアをまとめてω'というインデックスで表わすと、式［９．３］が得られる。式［９．３］は、図９（ｄ）に示す平面版モジュレーション・スペクトログラムＸ'に対応している。 Here, the mathematical formula [9. Returning to [n], the three-dimensional modulation spectrogram X ′ shown in FIG. 9C corresponds to X ′ in the equations [9.1] and [9.2], where ω is the vertical frequency. The bin is represented by ω ₂ in the depth direction. In the equation [9.2], when the pair (ω, ω ₂ ) is collectively represented by the index ω ′, the equation [9.3] is obtained. Equation [9.3] corresponds to the planar modulation spectrogram X ′ shown in FIG.

式［９．２］または式［９．３］から学習規則（ΔＷの式）を求めるためには、以下のように考える。モジュレーション・スペクトログラム全体での独立性を表わす尺度として、式［９．５］で計算されるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量を考える。この式は式［４．５］とほぼ同一だが、Ｈ（Ｙｋ'）は１チャンネル分のモジュレーション・スペクトログラムから算出されるエントロピー、Ｈ（Ｙ'）はモジュレーション・スペクトログラム全体から算出される同時エントロピーである。Ｈ（Ｙ'ｋ）の計算方法について、図１０を参照して説明する。 In order to obtain the learning rule (ΔW equation) from Equation [9.2] or Equation [9.3], the following is considered. As a measure representing independence in the entire modulation spectrogram, the Kullback-Leiblar information amount calculated by Equation [9.5] is considered. This equation is almost the same as equation [4.5], but H (Yk ′) is the entropy calculated from the modulation spectrogram for one channel, and H (Y ′) is the simultaneous entropy calculated from the entire modulation spectrogram. is there. A method of calculating H (Y′k) will be described with reference to FIG.

図１０は、図９（ｃ）に示す立体版モジュレーション・スペクトログラムＸ'に相当する。すなわち、例えば、観測信号の波形（図８（ａ））を短時間フーリエ変換（ＳＴＦＴ）することで生成された図８（ｂ）に示すスペクトログラムＸに対して、さらに周波数ビンごとに短時間フーリエ変換（ＳＴＦＴ）を行なって生成したモジュレーション・スペクトログラムに相当する。この図１０に示す立体版モジュレーション・スペクトログラムＸ'において、例えば、１番目のチャンネルのエントロピー計算においては、図１０における１番目のフレームのモジュレーション・スペクトログラムＹ１'（ｔ）２２１は、平面を表わし、それを引数とする多変量確率密度関数Ｐ（Ｙ１'（ｔ））２２２にＹ１'（ｔ）を代入することで、エントロピーＨ（Ｙ１'）２２３を求める。 FIG. 10 corresponds to the three-dimensional modulation spectrogram X ′ shown in FIG. That is, for example, for the spectrogram X shown in FIG. 8B generated by performing the short-time Fourier transform (STFT) on the waveform of the observation signal (FIG. 8A), the short-time Fourier is further provided for each frequency bin. This corresponds to a modulation spectrogram generated by performing transformation (STFT). In the three-dimensional modulation spectrogram X ′ shown in FIG. 10, for example, in the entropy calculation of the first channel, the modulation spectrogram Y1 ′ (t) 221 of the first frame in FIG. 10 represents a plane, The entropy H (Y1 ′) 223 is obtained by substituting Y1 ′ (t) into the multivariate probability density function P (Y1 ′ (t)) 222 having the argument as a parameter.

式［９．３］は、変数名の違いを除くと式［３．５］と同一である。従って、学習規則を導出するためには、式［５．２］の変数名を付け替えればよく、結果として式［９．５］を得る。すなわち、式［９．３］，［９．５］，［９．６］をＷ'が収束するまで繰り返せば、Ｙ１'（ｔ）〜Ｙｎ'（ｔ）が互いに独立になる。 Formula [9.3] is the same as Formula [3.5] except for the difference in variable names. Therefore, in order to derive the learning rule, the variable name of the equation [5.2] may be changed, and the equation [9.5] is obtained as a result. That is, if the expressions [9.3], [9.5], and [9.6] are repeated until W ′ converges, Y1 ′ (t) to Yn ′ (t) become independent from each other.

互いに独立になったモジュレーション・スペクトログラムＹ１'〜Ｙｎ'それぞれに対して逆フーリエ変換とｏｖｅｒｌａｐａｄｄを作用させると、互いに独立なスペクトログラムＹ１〜Ｙｎが得られる。 When inverse Fourier transform and overlap add are applied to the modulation spectrograms Y1 ′ to Yn ′ that are independent of each other, spectrograms Y1 to Yn that are independent of each other are obtained.

上記の説明では、独立性の尺度としてKullback-Leiblar情報量を採用していたが、（１）と同様に他の尺度を用いても良い。
また、分離行列更新の式として自然勾配法に基づく式を導出したが、代わりに他のアルゴリズムを用いてもよい。他のアルゴリズムとしては、正規直交制約つき勾配法や不動点法やニュートン法などがあるが、この点は従来の瞬時混合ＩＣＡと同様である。 In the above description, the Kullback-Leiblar information amount is adopted as a measure of independence, but other measures may be used as in (1).
In addition, although an equation based on the natural gradient method is derived as an equation for updating the separation matrix, another algorithm may be used instead. Other algorithms include a gradient method with an orthonormal constraint, a fixed point method, and a Newton method, which are the same as those of the conventional instantaneous mixing ICA.

（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって解く。
次に、シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって、時間周波数領域において畳み込み混合された観測信号を分離する処理について説明する。 (3) Solve by a process combining shift stacking and instantaneous mixing ICA.
Next, a process for separating the observation signal mixed by convolution in the time-frequency domain by a process combining shift stacking and instantaneous mixing ICA will be described.

この３番目の処理方式は、先に説明した［（１）時間周波数領域において、畳み込み混合を直接解く方式］とほぼ同等の分離処理を実現する方式であり、本出願人の先の特許出願である特開２００６−２３８４０９において開示した瞬時混合ＩＣＡ処理を利用して実現される。 This third processing method is a method that realizes a separation process substantially equivalent to the previously described [(1) method for directly solving convolutional mixing in the time-frequency domain]. This is realized by utilizing the instantaneous mixing ICA process disclosed in JP-A-2006-238409.

本方式は、例えば観測信号スペクトログラムをシフトしながら積み重ねた後、その結果に対して時間周波数領域の瞬時混合ＩＣＡ、すなわち、本出願人の先の特許出願である特開２００６−２３８４０９において開示した瞬時混合ＩＣＡを適用することで実現される。この３番目の手法の適用により、パーミュテーション（置換）問題が解決され、かつ直接波、反射波など様々な遅延量を持つ混合された音信号について、遅延量を考慮した高精度な分離処理が実現される。 In this method, for example, the observation signal spectrograms are stacked while shifting, and then the instantaneous frequency ICA is mixed with the result, that is, the instantaneous time disclosed in Japanese Patent Application Laid-Open No. 2006-238409, the applicant's earlier patent application. Realized by applying mixed ICA. By applying this third method, the permutation (replacement) problem is solved, and the mixed sound signal with various delay amounts such as direct waves and reflected waves is processed with high accuracy in consideration of the delay amount. Is realized.

まず、この３番目の手法について説明する前に、観測信号の分離処理において発生するパーミュテーション（置換）問題と、この問題を解決した本出願人の先の特許出願である特開２００６−２３８４０９の瞬時混合ＩＣＡの概要について、再度、簡潔に説明する。 First, before describing the third method, the permutation (replacement) problem that occurs in the separation process of the observation signal, and Japanese Patent Application Laid-Open No. 2006-238409, which is an earlier patent application of the present applicant that solved this problem. The outline of the instantaneous mixing ICA will be briefly described again.

ｎ個の音源が発するお互いに独立な原信号をｓ１〜ｓｎとし、それらを要素とするベクトルをｓとしたとき、マイクロホンで観測される観測信号ｘは、原信号ｓに前述した式［１．２］の畳み込み・混合演算を施したものとなる。次に、観測信号ｘに対して短時間フーリエ変換を施し、時間周波数領域の信号Ｘを得る。Ｘの要素をＸｋ（ω，ｔ）とすると、Ｘｋ（ω，ｔ）は複素数値をとる。Ｘｋ（ω，ｔ）の絶対値である｜Ｘｋ（ω，ｔ）｜を色の濃淡で表現した図が例えば、図６（ｂ）に示す観測信号のスペクトログラムである。このスペクトログラムは、例えば図５に示すマイク１２１−１〜１２１−ｎによって取得された観測信号について、短時間フーリエ変換（ＳＴＦＴ）を実行して生成した観測信号のスペクトログラムＸである。 When s1 to sn are independent original signals emitted from n sound sources and s is a vector having these elements as s, the observed signal x observed by the microphone is expressed by the equation [1. 2] is subjected to the convolution / mixing operation. Next, short-time Fourier transform is performed on the observation signal x to obtain a signal X in the time-frequency domain. If the element of X is Xk (ω, t), Xk (ω, t) takes a complex value. A diagram expressing | Xk (ω, t) |, which is the absolute value of Xk (ω, t), with color shading is, for example, a spectrogram of the observation signal shown in FIG. 6B. This spectrogram is a spectrogram X of an observation signal generated by performing short-time Fourier transform (STFT) on the observation signals acquired by the microphones 121-1 to 121-n shown in FIG.

スペクトログラムは、例えば、横軸をｔ（フレーム番号）、縦軸をω（周波数ビン番号）としてＸｋ（ω，ｔ）の絶対値である｜Ｘｋ（ω，ｔ）｜を色の濃淡で表現した図である。続いて、信号Ｘの各周波数ビンに、分離行列Ｗ（ω）を乗算することで分離信号Ｙを得る。そして、分離信号Ｙを逆フーリエ変換することで時間領域の分離信号ｙを得ることができる。 In the spectrogram, for example, | Xk (ω, t) |, which is the absolute value of Xk (ω, t), is expressed by color shading, where the horizontal axis is t (frame number) and the vertical axis is ω (frequency bin number). FIG. Subsequently, the separation signal Y is obtained by multiplying each frequency bin of the signal X by the separation matrix W (ω). A separation signal y in the time domain can be obtained by inverse Fourier transforming the separation signal Y.

しかし、前述したように、従来の時間周波数領域の独立成分分析では、信号の分離処理を周波数ビン毎に行っており、周波数ビンの間の関係は考慮していない。そのため、分離自体は成功しても、周波数ビンの間でスケーリング及び分離先の不統一が発生する可能性がある。スケーリングの不統一については、音源毎に観測信号を推定する方法により解決できるが、分離先の不統一、例えばω＝１ではＹ１にＳ１由来の信号が現れるのに対してω＝２ではＹ１にＳ２由来の信号が現れるといったパーミュテーション（置換）問題は解決できない。 However, as described above, in the conventional independent component analysis in the time-frequency domain, signal separation processing is performed for each frequency bin, and the relationship between the frequency bins is not considered. Therefore, even if the separation itself is successful, there is a possibility that scaling and separation destination inconsistency may occur among the frequency bins. The inconsistency of scaling can be solved by a method of estimating the observation signal for each sound source. However, the inconsistency of the separation destination, for example, the signal derived from S1 appears in Y1 when ω = 1, whereas it becomes Y1 when ω = 2. The permutation (replacement) problem that the signal derived from S2 appears cannot be solved.

本出願人の先の特許出願である特開２００６−２３８４０９は、このパーミュテーション（置換）問題を解決する手法を開示した。すなわち、スペクトログラム全体での分離を表わす式については、先に説明した以下に示す数式［４．４］を用い、スペクトログラム全体での独立性を最大にする分離行列Ｗを求めるという手法を採用した。
Japanese Patent Application Laid-Open No. 2006-238409, the applicant's earlier patent application, disclosed a technique for solving this permutation (substitution) problem. That is, for the expression representing the separation in the entire spectrogram, the method of obtaining the separation matrix W that maximizes the independence in the entire spectrogram using the previously described equation [4.4] shown below was adopted.

具体的には、スペクトログラム全体での独立性として、式［４．５］で表わされるＫＬ（Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ）情報量Ｉ（Ｙ）を導入し、Ｉ（Ｙ）を最小にする分離行列Ｗを求める。ＫＬ情報量Ｉ（Ｙ）は、スペクトログラムごとのエントロピーの総和から、スペクトログラム全体の同時エントロピーを引いたものであり、全てのスペクトログラムがお互いに独立となった場合に最小（理想的には０）となる。 Specifically, as the independence of the entire spectrogram, a KL (Kullback-Leiblar) information amount I (Y) represented by the formula [4.5] is introduced, and a separation matrix W that minimizes I (Y) is obtained. Ask. The KL information amount I (Y) is obtained by subtracting the simultaneous entropy of the entire spectrogram from the total entropy of each spectrogram, and is minimum (ideally 0) when all spectrograms are independent of each other. Become.

ＫＬ情報量Ｉ（Ｙ）の定義式［４．５］において、Ｈ（Ｙ_ｋ）は各チャンネルについてのスペクトログラム１枚分のエントロピーを表し、Ｈ（Ｙ）は全チャンネルについてのスペクトログラム１枚分の同時エントロピーを表す。 In the definition formula [4.5] of the KL information amount I (Y), H (Y _k ) represents the entropy of one spectrogram for each channel, and H (Y) represents one spectrogram for all channels. Represents simultaneous entropy.

例えば、ｎ＝２のときのＨ（Ｙ_ｋ）とＨ（Ｙ）との関係は、先に図２を参照して説明したとおりである。図２において、Ｐ（Ｙ_ｋ（ｔ））は、Ｙ_ｋ（ｔ）の確率密度関数であり、Ｈ（Ｙ_ｋ）は各チャンネルについてのスペクトログラム１枚分のエントロピーである。Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）は、スペクトログラムごとのエントロピー１１，１２の総和から、スペクトログラム全体の同時エントロピー１３を引いたものであり、全てのスペクトログラムがお互いに独立となった場合に最小（理想的には０）となる。 For example, the relationship between H (Y _k ) and H (Y) when n = 2 is as described above with reference to FIG. In FIG. 2, P (Y _k (t)) is a probability density function of Y _k (t), and H (Y _k ) is an entropy for one spectrogram for each channel. The Kullback-Leiblar information amount I (Y) is obtained by subtracting the simultaneous entropy 13 of the entire spectrogram from the sum of entropies 11 and 12 for each spectrogram, and is minimum (ideal) when all spectrograms are independent of each other. 0).

スペクトログラム全体でのＫＬ情報量Ｉ（Ｙ）を最小にするためには、先に説明したように以下に示す式［５．１］〜［５．３］をＷおよびＹが収束するまで繰り返す。
In order to minimize the KL information amount I (Y) in the entire spectrogram, as described above, the following equations [5.1] to [5.3] are repeated until W and Y converge.

３番目の処理方式は、この特開２００６−２３８４０９において開示した時間周波数領域の瞬時混合ＩＣＡを適用する手法である。特開２００６−２３８４０９において示されている時間周波数領域の瞬時混合ＩＣＡを適用した処理は、具体的には、信号分離処理として、時間周波数領域の観測信号と初期値が代入された分離行列から時間周波数領域の分離信号を生成し、生成した時間周波数領域の分離信号と、多次元確率密度関数から導出される多次元スコア関数によって計算される分離行列とがほぼ収束するまで分離行列の修正を行い、この修正された分離行列を適用して時間周波数領域の分離信号を生成する処理として実行される。詳細については、特開２００６−２３８４０９に開示された通りである。 The third processing method is a method of applying the time-frequency domain instantaneous mixing ICA disclosed in JP-A-2006-238409. Specifically, the process using the time-frequency domain instantaneous mixing ICA disclosed in Japanese Patent Laid-Open No. 2006-238409 is performed as a signal separation process from a separation matrix into which an observation signal in the time-frequency domain and an initial value are substituted. Generate a separation signal in the frequency domain, and correct the separation matrix until the generated separation signal in the time-frequency domain and the separation matrix calculated by the multidimensional score function derived from the multidimensional probability density function almost converge. Then, this modified separation matrix is applied to generate a time-frequency domain separation signal. Details are as disclosed in JP-A-2006-238409.

以下説明する３番目の処理方式、すなわち、［（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって時間周波数領域において畳み込み混合された観測信号を分離する処理では、この特開２００６−２３８４０９において開示した時間周波数領域の瞬時混合ＩＣＡを適用する。具体的には、例えば観測信号スペクトログラムをシフトしながら積み重ねた後、その結果に対して時間周波数領域の瞬時混合ＩＣＡを適用する。以下、この３番目の手法について説明する。 In the third processing method described below, that is, [(3) separation of observation signals mixed in the time-frequency domain by processing combining shift accumulation and instantaneous mixing ICA, this is disclosed in JP-A-2006-238409. Apply the instantaneous frequency-domain mixed ICA. Specifically, for example, the observation signal spectrograms are stacked while being shifted, and then the instantaneous mixing ICA in the time-frequency domain is applied to the result. Hereinafter, the third method will be described.

本手法では、音声入力部である複数のマイクの各々の観測信号スペクトログラムに対して、フレーム番号をずらしながら縦に積み重ねたベクトルを生成する。例えばｋ番目のマイクに相当するｋチャンネル目の観測信号スペクトログラム、すなわち、上記の式［４．１］の観測信号スペクトログラムＸ_ｋ（ｔ）に対して、フレーム番号をずらしながら縦に積み重ねたベクトルを考える。さらに、それを全チャンネル分積み重ねたベクトルを考える。以下に示す式［１１．１］のベクトルＸ''（ｔ）である。式［１１．１］のベクトルＸ''（ｔ）は、ｎチャンネル分のベクトルを含む。なお、チャンネルごとのベクトルをＸ_ｋ''（ｔ）と示している。 In this technique, vectors stacked vertically while shifting the frame number are generated for the observed signal spectrograms of the plurality of microphones that are the voice input units. For example, with respect to the observation signal spectrogram of the k-th channel corresponding to the k-th microphone, that is, the observation signal spectrogram X _k (t) of the above equation [4.1], a vector vertically stacked while shifting the frame number is obtained. Think. Furthermore, consider a vector in which it is stacked for all channels. This is a vector X ″ (t) of the equation [11.1] shown below. The vector X ″ (t) in Expression [11.1] includes vectors for n channels. The vector for each channel is indicated as X _k ″ (t).

上記数式［１１．１］に示すベクトルＸ''（ｔ）の作成手順について、図１１以下を参照して説明する。図１１は、各マイクの入力信号に基づいて生成されるチャンネルごとの観測信号スペクトログラムＸ_ｋ（ｔ）から、チャンネルごとのベクトルＸ''（ｔ）を生成する処理について説明する図である。図１１（ａ）に示すデータ３０１、すなわちＸ_ｋは観測信号１チャンネル分のスペクトログラム、すなわちｋ番目のマイクに相当するｋチャンネル目の観測信号スペクトログラムＸ_ｋであり、先に説明した図６（ｂ）のＸ１やＸ２に相当する。 A procedure for generating the vector X ″ (t) represented by the above mathematical expression [11.1] will be described with reference to FIG. FIG. 11 is a diagram illustrating a process of generating a vector X ″ (t) for each channel from the observed signal spectrogram X _k (t) for each channel generated based on the input signal of each microphone. Figure 11 (a) to indicate the data 301, i.e. X _k is the observed signal one channel spectrogram, i.e. the observed signal spectrograms X _k of the k-th channel corresponding to the k-th microphone, FIG. 6 described above (b ) X1 and X2.

このＸ_ｋを、左にｌ（小文字のエル）フレームずつシフトした結果をＸ_ｋ ^［ｌ］とする。図１１（ｂ）には、観測信号スペクトログラムＸ_ｋを、シフト量ｌをｌ＝０〜Ｌ'まで、順次、変化させながら縦に複数積み重ねた構造を示している。
データ３１１−０が、シフト量ｌ＝０、
データ３１１−１が、シフト量ｌ＝ｌ（エル）フレーム
：
データ３１１−Ｌ'が、シフト量ｌ＝Ｌ'フレーム
である。
なお、Ｌ'は前述したように、観測信号から分離結果を生成する際のフレームタップ数である。 The result of shifting _Xk to the left by l (lower-case el) frames is _{defined as Xk} ^[l] . FIG. 11B shows a structure in which a plurality of observation signal spectrograms X _k are stacked vertically while the shift amount l is sequentially changed from 1 = 0 to L ′.
Data 311-0 is the shift amount l = 0,
Data 311-1 is a shift amount l = 1 l frame:
Data 311 -L ′ is the shift amount l = L ′ frame.
Note that L ′ is the number of frame taps when generating the separation result from the observation signal, as described above.

１つの観測信号スペクトログラムから、これらの複数の異なるフレーム方向へのシフト量を持つ観測信号スペクトログラムシフトセットを生成し、これを観測信号スペクトログラムシフトセット［Ｘ''］とする。観測信号スペクトログラムシフトセット［Ｘ''］から１フレーム分を切り出すと、図１１（ｂ）に示す数式３１２となる。この数式は、上述した式［１１．１］に含まれる１つのチャンネル対応のベクトル［Ｘ_ｋ（ｔ）］に対応する。 From one observation signal spectrogram, an observation signal spectrogram shift set having a shift amount in a plurality of different frame directions is generated, and this is set as an observation signal spectrogram shift set [X ″]. When one frame is cut out from the observed signal spectrogram shift set [X ″], a mathematical expression 312 shown in FIG. 11B is obtained. This mathematical expression corresponds to a vector [X _k (t)] corresponding to one channel included in the above-described expression [11.1].

式［１１．１］は、先に説明したように、１つのチャンネル対応の観測信号スペクトログラムＸ_ｋ（ｔ）を、フレーム番号をずらしながら縦に積み重ねて生成される観測信号スペクトログラムシフトセットからなるベクトルを、さらに全チャンネル分積み重ねて生成される複数チャンネル対応観測信号スペクトログラムシフトセットからなるベクトルである。 As described above, Expression [11.1] is a vector composed of observation signal spectrogram shift sets generated by vertically stacking observation signal spectrograms X _k (t) corresponding to one channel while shifting frame numbers. Is a vector composed of observation signal spectrogram shift sets corresponding to a plurality of channels generated by stacking all channels.

なお、図１１（ｂ）に示すように、シフトの際に生じた隙間、すなわち、図１１（ｂ）に示すデータ３１１−０〜３１１−Ｌ'の斜線部分には、ゼロに近い値を代入しておくか、両端の値（Ｘ（１）やＸ（Ｔ）など）をコピーして設定する。または、後述のゼロ除算対策が施されている場合には、ゼロを代入しても構わない。または、両端の隙間を取り除き、中間のＴ−Ｌ'フレーム分のデータを用いるようにしても構わない。さらには、通常のシフト処理ではなく、長さＴの巡回シフト（シフトではみ出した左端のデータを右端にコピーする）を適用する設定としてもよい。以下において説明する処理例は、巡回シフトによって生成した観測信号スペクトログラムシフトセット［Ｘ''］を適用した処理例について説明する。 As shown in FIG. 11B, a value close to zero is assigned to the gap generated during the shift, that is, the hatched portion of the data 311-0 to 311-L ′ shown in FIG. Or the values at both ends (X (1), X (T), etc.) are copied and set. Alternatively, zero may be substituted if a zero division countermeasure described later is taken. Alternatively, the gap at both ends may be removed and data for an intermediate TL ′ frame may be used. Furthermore, instead of a normal shift process, a cyclic shift of length T (the left end data protruding at the shift is copied to the right end) may be applied. In the processing example described below, a processing example to which the observation signal spectrogram shift set [X ″] generated by the cyclic shift is applied will be described.

図１１（ｂ）に示すような、シフト処理と積み重ね処理で生成された観測信号スペクトログラムシフトセット［Ｘ''］を元の観測信号スペクトログラム［Ｘ］と比較すると、
観測信号スペクトログラム［Ｘ］はｎチャンネル分のスペクトログラム、
であるのに対して、
観測信号スペクトログラムシフトセット［Ｘ''］は見かけ上、ｎ×（Ｌ'＋１）チャンネル分のスペクトログラムを含む。ｎがマイク数に相当するチャンネル数、（Ｌ'＋１）が１つのチャンネルに対応して設定されるシフトデータ数である。 When the observed signal spectrogram shift set [X ″] generated by the shift process and the stack process as shown in FIG. 11B is compared with the original observed signal spectrogram [X],
The observed signal spectrogram [X] is the spectrogram for n channels,
Whereas
The observed signal spectrogram shift set [X ″] apparently includes spectrograms for n × (L ′ + 1) channels. n is the number of channels corresponding to the number of microphones, and (L ′ + 1) is the number of shift data set corresponding to one channel.

この観測信号スペクトログラムシフトセット［Ｘ''］を、ｎ×（Ｌ'＋１）チャンネルの観測信号スペクトログラムとして、本出願人の先の特許出願である特開２００６−２３８４０９において開示した瞬時混合ＩＣＡを適用した方法で分離処理を行なう。この処理によって、先に説明した「（１）時間周波数領域において、畳み込み混合を直接解く」方式と同等な分離を行なうことができる。以降では、その原理について説明する。 Using this observed signal spectrogram shift set [X ″] as an observed signal spectrogram of n × (L ′ + 1) channels, the instantaneous mixing ICA disclosed in Japanese Patent Application Laid-Open No. 2006-238409, which is the applicant's earlier patent application, is applied. The separation process is performed by the method described above. By this processing, it is possible to perform separation equivalent to the above-described method of “(1) directly solving convolutional mixing in the time frequency domain”. Hereinafter, the principle will be described.

観測信号スペクトログラムＸについて、ｔ−ｌ（エル）番目からｔ−ｌ（エル）＋Ｌ'番目のフレームを畳み込むことで分離結果を生成する操作について考察する。すなわち、図１２に示すように、Ｘ（ｔ−ｌ（エル））からＸ（ｔ−ｌ（エル）＋Ｌ'）までのＬ'＋１フレームから１フレーム分の分離結果を生成する操作であり、この処理によって分離信号Ｙ^［ｌ］（ｔ）を得るための算出式は式［１１．２］で表わされる。 Regarding the observed signal spectrogram X, an operation of generating a separation result by convolving the tl (ell) th to tl (ell) + L'th frames will be considered. That is, as shown in FIG. 12, this is an operation for generating a separation result for one frame from the L ′ + 1 frame from X (t−l (el)) to X (t−l (el) + L ′). A calculation formula for obtaining the separation signal Y ^[l] (t) by this processing is expressed by Formula [11.2].

分離結果をＹ^［ｌ］（ｔ）とおく。分離信号Ｙ^［ｌ］（ｔ）はＬ'＋１フレームの間の畳み込みであるため、係数の行列はＬ'＋１個必要となるが、さらに、シフトフレーム数［ｌ（エル）］によっても異なる値をとるため、分離行列［Ｗ］については、２種類の添字をつけて、
Ｗ^{［ｌ，０］}〜Ｗ^{［ｌ，Ｌ'］}
と表わす。すなわち、分離行列［Ｗ］は、シフトフレーム数［ｌ（エル）］と、各々のシフトスペクトログラムに応じて設定する。 Let the separation result be Y ^[l] (t). Since the separation signal Y ^[l] (t) is a convolution between L ′ + 1 frames, L ′ + 1 coefficient matrices are required, but the value varies depending on the number of shift frames [l (L)]. Therefore, for the separation matrix [W], two subscripts are attached,
W ^{[l, 0] to} W ^{[l, L ']}
It expresses. That is, the separation matrix [W] is set according to the number of shift frames [l (el)] and each shift spectrogram.

式［１１．３］および式［１１．４］は、式［１１．２］に現れる部分行列の詳細であり、式［１１．５］は、式［１１．４］に現れる部分行列の詳細を示している。
分離信号［Ｙ^［ｌ］（ｔ）］と分離行列［Ｗ^{［ｌ，τ］}］は、それぞれ、各チャンネルの成分に対応したベクトルや行列からなる。なお、Ｗに対する添字τはτ＝０〜Ｌ'である、 Equation [11.3] and Equation [11.4] are details of the submatrix appearing in Equation [11.2], and Equation [11.5] is the detail of the submatrix appearing in Equation [11.4]. Is shown.
The separation signal [Y ^[l] (t)] and the separation matrix [W ^{[l, τ]} ] are each composed of a vector or matrix corresponding to the component of each channel. Note that the subscript τ for W is τ = 0 to L ′.

ここで、式［１１．６］で示される、
分離結果：Ｙ^［０］（ｔ）〜Ｙ^［Ｌ'］（ｔ）
をすべて包含したベクトル
分離結果ベクトルＹ''（ｔ）と、
式［１１．７］で示される複数の分離行列、
Ｗ^{［０，０］}〜Ｗ^{［Ｌ'＋１，Ｌ'］}
をすべて包含した行列
Ｗ''
これらのベクトル［Ｙ''（ｔ）］と、行列［Ｗ''］とを用いると、分離処理を示す式は、単純に式［１１．８］、すなわち、
Ｙ''（ｔ）＝Ｗ''Ｘ''（ｔ）・・・［１１．８］
このように示すことができる。 Here, represented by the formula [11.6],
Separation result: Y ^[0] (t) to Y ^[ ^{L ′]} (t)
A vector including all of the separation result vector Y ″ (t),
A plurality of separation matrices represented by Equation [11.7],
W ^{[0, 0] to} W ^{[L ′ + 1, L ′]}
A matrix that contains all W ''
Using these vectors [Y ″ (t)] and matrix [W ″], the equation indicating the separation process is simply the equation [11.8], that is,
Y ″ (t) = W ″ X ″ (t) ... [11.8]
It can be shown in this way.

従来法として先に説明した、特許文献１（特開２００６−２３８４０９）では、スペクトログラム全体での分離を表わす式として、先に説明した数式［４．４］、すなわち、
Ｙ（ｔ）＝ＷＸ（ｔ）・・・［４．４］
上記式を用いた処理としているが、式［１１．８］と式［４．４］と比較すると、式［１１．８］は単にチャンネル数がｎからｎ×（Ｌ'＋１）に増えたものとして、式［４．４］を適用したとみなすことができる。 In Patent Document 1 (Japanese Patent Laid-Open No. 2006-238409) described above as a conventional method, the expression [4.4] described above is used as an expression representing separation in the entire spectrogram, that is,
Y (t) = WX (t) ... [4.4]
Although the processing using the above formula is used, when the formula [11.8] and the formula [4.4] are compared, the formula [11.8] simply increases the number of channels from n to n × (L ′ + 1). As an example, it can be considered that the formula [4.4] is applied.

すなわち、図１３に示すように、複数チャンネル分の観測信号スペクトログラムシフトセット［Ｘ''］は、Ｘ_１''〜Ｘ_ｎ''によって構成され、これらをｎ×（Ｌ'＋１）の個別のチャンネルに対応する観測信号スペクトログラムとして考えれば、式［１１．８］は単にチャンネル数がｎからｎ×（Ｌ'＋１）に増えたものとして、式［４．４］を適用したとみなすことができる。 That is, as shown in FIG. 13, the observation signal spectrogram shift set [X ″] for a plurality of channels is composed of X ₁ ″ to X _n ″, and these are divided into n × (L ′ + 1) individual. Considering the observed signal spectrogram corresponding to the channel, it can be considered that the formula [11.8] is simply an increase of the number of channels from n to n × (L ′ + 1) and that the formula [4.4] is applied. it can.

従って、ｎチャンネル分の観測信号スペクトログラムＸを、図１１を参照して説明した方法によってｎ×（Ｌ'＋１）チャンネルに拡張し、その結果である観測信号スペクトログラムシフトセット［Ｘ''］に対して、特開２００６−２３８４０９の学習式である式［５．１］〜［５．３］を繰り返し適用すると、分離結果であるＹ''と分離行列Ｗ''とが得られるのである。 Therefore, the observation signal spectrogram X for n channels is expanded to n × (L ′ + 1) channels by the method described with reference to FIG. 11, and the observation signal spectrogram shift set [X ″] as a result is expanded. Thus, when the equations [5.1] to [5.3], which are learning equations of JP-A-2006-238409, are repeatedly applied, a separation result Y ″ and a separation matrix W ″ are obtained.

ただし、式［５．１］〜［５．３］の変数の詳細である式［５．４］〜［５．７］において、ｎはｎ×（Ｌ'＋１）に読み替えること。また、ｋは１≦ｋ≦ｎではなく、１≦ｋ≦ｎ×（Ｌ'＋１）を表わすインデックスとなる。 However, in the equations [5.4] to [5.7], which are details of the variables of the equations [5.1] to [5.3], n should be read as n × (L ′ + 1). Further, k is not 1 ≦ k ≦ n but an index representing 1 ≦ k ≦ n × (L ′ + 1).

分離結果であるＹ''はｎ×（Ｌ'＋１）チャンネル分のスペクトログラムを含んでいるが、所望のものはｎチャンネル分（またはｎ未満）であるため、必要に応じてスペクトログラムの選択を行なう。選択の方法としては、分離結果Ｙ''の中からＹ_１ ^［０］，Ｙ_２ ^［０］，…，Ｙ_ｎ ^［０］のように特定のシフト量［ｌ（エル）］に該当する成分のみを残すといった方法が適用可能である。 The separation result Y ″ includes spectrograms for n × (L ′ + 1) channels, but since the desired one is for n channels (or less than n), the spectrogram is selected as necessary. . As a selection method, components corresponding to a specific shift amount [l (L)] such as Y ₁ ^[0] , Y ₂ ^[0] ,..., Y _n ^[0] from the separation result Y ″. It is possible to apply a method such as leaving only.

または、観測信号から分離結果を生成する際のフレームタップ数をＬ'の値の決定方法と同様に、既知の信号を用いて最適なフレーム方向へのシフト量［ｌ（エル）］を求めても良い。すなわち、既知の信号をスピーカー等から鳴らして本発明の手法で集音および分離を行なった後、分離結果のＹ_ｋ ^［０］〜Ｙ_ｋ ^［Ｌ'］それぞれについて分離精度の尺度であるＳＩＲ（ｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅ−ｒａｔｉｏ）を計算する。そして最も高い分離精度（ＳＩＲ）をもたらすシフト数：ｌ（エル）に対応した分離結果［Ｙ_ｋ ^［ｌ］］を選択するといった処理が可能である。 Alternatively, as in the method for determining the value of L ′ for the number of frame taps when generating the separation result from the observation signal, an optimum shift amount [l (el)] in the frame direction is obtained using a known signal. Also good. That is, after a known signal is emitted from a speaker or the like and collected and separated by the method of the present invention, each of the separation results Y _k ^{[0] to} Y _k ^{[L ′]} is a measure of separation accuracy SIR ( signal-interference-ratio). Then, it is possible to perform processing such as selecting a separation result [Y _k ^[l] ] corresponding to the number of shifts: l (el) that provides the highest separation accuracy (SIR).

この（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって、時間周波数領域において畳み込み混合された観測信号を分離する処理のシーケンスを説明するフローチャートを図１４に示す。図１４のフローにおける各ステップの処理について説明する。 FIG. 14 is a flowchart for explaining a sequence of processing for separating the observation signal mixed by convolution in the time-frequency domain by the processing combining (3) shift stacking and instantaneous mixing ICA. Processing of each step in the flow of FIG. 14 will be described.

まず、ステップＳ１１において、観測信号スペクトログラムをシフトしながら積み重ねる。この処理は、図１１を参照して説明した処理であり、各マイクによって取得される観測信号から生成される観測信号スペクトログラムに対して、シフトフレーム（ｌ（エル））単位で順次シフトとして、フレームタップ数をＬ'に想到するシフト量になるまでシフトデータを生成して積み重ねて、観測信号スペクトログラムシフトセット［Ｘ''］を生成する。 First, in step S11, observation signal spectrograms are stacked while being shifted. This process is the process described with reference to FIG. 11, and the observation signal spectrogram generated from the observation signal acquired by each microphone is sequentially shifted in units of shift frames (l (el)) as frames. Shift data is generated and accumulated until the number of taps reaches a shift amount that reaches L ′, and an observed signal spectrogram shift set [X ″] is generated.

次にステップＳ１２において、瞬時混合ＩＣＡ（または、変更したスコア関数）を用いて分離結果Ｙ''を求める。すなわち、観測信号スペクトログラムシフトセット［Ｘ''］に対して、特開２００６−２３８４０９の学習式である式［５．１］〜［５．３］を繰り返し適用して、分離結果であるＹ''と分離行列Ｗ''とを算出する。ただし、式［５．１］〜［５．３］の変数の詳細である式［５．４］〜［５．７］において、ｎはｎ×（Ｌ'＋１）に読み替えること。また、ｋは１≦ｋ≦ｎではなく、１≦ｋ≦ｎ×（Ｌ'＋１）を表わすインデックスとなる。 Next, in step S12, the separation result Y ″ is obtained using the instantaneous mixing ICA (or the changed score function). That is, the learning results of Japanese Patent Application Laid-Open No. 2006-238409, [5.1] to [5.3], are repeatedly applied to the observation signal spectrogram shift set [X ″] to obtain the separation result Y ′. 'And the separation matrix W' 'are calculated. However, in the equations [5.4] to [5.7], which are details of the variables of the equations [5.1] to [5.3], n should be read as n × (L ′ + 1). Further, k is not 1 ≦ k ≦ n but an index representing 1 ≦ k ≦ n × (L ′ + 1).

なお、スコア関数は確率密度関数の対数微分として定義され、式［５．７］によって定義される。先の［（１）時間周波数領域において、畳み込み混合を直接解く］方式に式［７．５］について説明したように、特開２００６−２３８４０９において開示したと同様、多変量のスコア関数を使用することで、パーミュテーションの発生を阻止できる。このスコア関数を用いた処理については後述する。 The score function is defined as a logarithmic derivative of the probability density function, and is defined by the equation [5.7]. The multivariate score function is used as described in Japanese Patent Application Laid-Open No. 2006-238409, as described in Equation [7.5] in the previous [(1) Solve convolution mixture directly in time frequency domain] method. This prevents permutation from occurring. Processing using this score function will be described later.

次に、ステップＳ１３において、必要に応じて分離結果Ｙ''の中から所望のスペクトログラムを選択する。すなわち、上述したように、分離結果であるＹ''はｎ×（Ｌ'＋１）チャンネル分のスペクトログラムを含んでいるが、所望のものはｎチャンネル分（またはｎ未満）であるため、必要に応じてスペクトログラムの選択を行なう。 Next, in step S13, a desired spectrogram is selected from the separation result Y ″ as necessary. That is, as described above, Y ″ as the separation result includes spectrograms for n × (L ′ + 1) channels, but the desired one is for n channels (or less than n), so that it is necessary. The spectrogram is selected accordingly.

選択方法としては、分離結果Ｙ''の中からＹ_１ ^［０］，Ｙ_２ ^［０］，…，Ｙ_ｎ ^［０］のように特定のシフト量［ｌ（エル）］に該当する成分のみを残すといった方法が適用可能である。この際、最も高い分離精度（ＳＩＲ）をもたらすシフト数：ｌ（エル）に対応した分離結果［Ｙ_ｋ ^［ｌ］］を選択するといった処理が可能である。 As a selection method, only components corresponding to a specific shift amount [l (L)] such as Y ₁ ^[0] , Y ₂ ^[0] ,..., Y _n ^[0] are selected from the separation result Y ″. It is possible to apply a method such as leaving At this time, it is possible to perform processing such as selecting a separation result [Y _k ^[l] ] corresponding to the number of shifts: l (el) that provides the highest separation accuracy (SIR).

上記で説明した方法は、先に、［（１）時間周波数領域において、畳み込み混合を直接解く］方式において説明した式［７．２］〜式［７．５］を用いた方法とほぼ同等の処理を実行していることに相当する。すなわち、本処理例では、ｎ×（Ｌ'＋１）チャンネルの信号をお互いに独立となるように分離する処理であり、例えば、図１３を参照して説明すると、複数チャンネル分の観測信号スペクトログラムシフトセット［Ｘ''］を適用した結果として得られる分離結果である信号Ｙ''３３１の中のスペクトログラム１枚分である信号Ｙ_１ ^［０］３４１は、他音源に由来する信号Ｙ_ｎ ^［０］３４３やＹ_ｎ ^［Ｌ'］３４４と独立となるだけでなく、同一音源に由来するはずのＹ_１ ^［Ｌ'］３４２とも独立になる。 The method described above is almost equivalent to the method using the equations [7.2] to [7.5] described in the [(1) Solve convolutional mixture directly in the time frequency domain] method. This is equivalent to executing the process. In other words, in the present processing example, the signals of n × (L ′ + 1) channels are separated so as to be independent from each other. For example, with reference to FIG. 13, observation signal spectrogram shifts for a plurality of channels are performed. A signal Y ₁ ^[0] 341 corresponding to _one spectrogram in the signal Y ^″ 331 which is a separation result obtained as a result of applying the set [X ″] is a signal Y _n ^[0 derived from another sound source. ^] 343 and Y _n ^{[L ′]} 344 as well as Y ₁ ^{[L ′]} 342 that should be derived from the same sound source.

一方、上記で使用するスコア関数（式［５．７］ほか）を変更することで、先に、［（１）時間周波数領域において、畳み込み混合を直接解く］方式において説明した式［８．１］〜式［８．４］を用いた方式と同等の処理も行なうことが可能である。 On the other hand, by changing the score function used in the above (formula [5.7] etc.), the formula [8.1 described above in the [(1) Directly solve convolution mixture in the time frequency domain] method is used. ] To processing equivalent to the method using the formula [8.4] can be performed.

［（１）時間周波数領域において、畳み込み混合を直接解く］方式において説明した式［８．１］〜式［８．４］を用いた方式は、「Ｙｋ（ｔ−Ｌ'）〜Ｙｋ（ｔ）は互いに依存」という仮定に基づく処理である。本処理例［（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理］においても、分離結果の依存を考慮した処理が可能である。すなわち、図１３を参照して説明すると、Ｙ_１ ^［０］３４１は、Ｙ_ｎ ^［０］３４３やＹ_ｎ ^［Ｌ'］３４４とは独立で、Ｙ_１ ^［Ｌ'］３４２とは依存関係があるという分離を行なうことができる。以下では、その方法について説明する。 [(1) In the time-frequency domain, the method using the equations [8.1] to [8.4] described in the method for directly solving convolutional mixing is “Yk (t−L ′) to Yk (t ) Is a process based on the assumption of “depending on each other”. Even in this processing example [(3) processing combining shift stacking and instantaneous mixing ICA], processing in consideration of the dependence of the separation result is possible. That is, with reference to FIG. 13, Y ₁ ^[0] 341 is independent from Y _n ^[0] 343 and Y _n ^{[L ′]} 344 and has a dependency relationship with Y ₁ ^{[L ′]} 342. There can be a separation. Below, the method is demonstrated.

同一の音源に由来する分離結果Ｙ_ｋ ^［０］，…，Ｙ_ｋ ^［Ｌ'］の間で依存関係を持たせるには、先に説明したΔＷ（ω）の算出式［５．２］の代わりに、以下に示す式［１２．１］を用いる。 In order to give a dependency relationship between the separation results Y _k ^[0] ,..., Y _k ^{[L ′]} derived from the same sound source, the ΔW (ω) calculation formula [5.2] described above is used. Instead, the following formula [12.1] is used.

ただし、式［１２．１］中のＹ''（ω，ｔ）およびＷ''（ω）は、それぞれＹ''およびＷ''からω番目の周波数ビンの成分を抽出したベクトルおよび行列であり、式［１２．３］および式［１２．４］のように表わされる。φ_ω（Ｙ''（ｔ））は、式［１２．５］で表わされる通り、ｎ×（Ｌ'＋１）個のスコア関数を要素に持つベクトルである。（スコア関数の具体例は後述する。） However, Y ″ (ω, t) and W ″ (ω) in Equation [12.1] are vectors and matrices obtained by extracting the components of the ωth frequency bin from Y ″ and W ″, respectively. Yes, and expressed as Equation [12.3] and Equation [12.4]. φ _ω (Y ″ (t)) is a vector having n × (L ′ + 1) score functions as elements, as represented by Expression [12.5]. (Specific examples of the score function will be described later.)

式［１２．５］と式［５．６］との違いは、スコア関数の引数にある。すなわち、式［５．６］をｎ×（Ｌ'＋１）チャンネルに拡張した場合は、ｎ×（Ｌ'＋１）個のスコア関数が全て異なる引数をとるのに対し、式［１２．５］ではφ_ｋω ^［０］（Ｙ_ｋ''（ｔ））〜φ_ｋω ^［Ｌ'］（Ｙ_ｋ''（ｔ））は同一の引数Ｙ_ｋ''（ｔ）をとるため、引数はｎ種類である。 The difference between Expression [12.5] and Expression [5.6] is in the argument of the score function. That is, when equation [5.6] is expanded to n × (L ′ + 1) channels, n × (L ′ + 1) score functions all take different arguments, whereas equation [12.5] In this case, φ _kω ^[0] (Y _k ″ (t)) to φ _kω ^{[L ′]} (Y _k ″ (t)) takes the same argument Y _k ″ (t), and therefore there are n types of arguments. It is.

スコア関数φ_ｋω ^［ｌ］（Ｙ_ｋ''（ｔ））は、Ｙ_ｋ''（ｔ）（すなわちＹ_ｋ ^［０］，…，Ｙ_ｋ ^［Ｌ'］）を引数とする多次元（多変量）確率密度関数の対数微分として定義される（式［１２．５］）。このように、一つの確率密度関数に複数の引数を含め、そこから導出されるスコア関数を用いてＩＣＡの学習を行なうと、引数となっている要素同士は依存関係を持つ（独立にはならない）ことが理論的に示されている。すなわち、再び図１３を参照して説明すると、Ｙ_１ ^［０］，…，Ｙ_１ ^［Ｌ'］の組である信号Ｙ_１''３５１は、組の内部の各要素は依存関係を持つが、別の組の信号、例えばＹ_ｎ''３５２とは独立になる。 The score function φ _kω ^[l] (Y _k ″ (t)) is multi-dimensional (multi- _valued ) with Y _k ″ (t) (that is, Y _k ^[0] ,..., Y _k ^{[L ′]} ) as an argument. (Variable) defined as the logarithmic derivative of the probability density function (Equation [12.5]). In this way, when a plurality of arguments are included in one probability density function and ICA learning is performed using a score function derived therefrom, the elements that are arguments have a dependency relationship (not independent). ) Is theoretically shown. That is, referring again to FIG. 13, the signal Y ₁ ″ 351, which is a set of Y ₁ ^[0] ,..., Y ₁ ^{[L ′]} , is dependent on each element in the set. , Independent of another set of signals, eg, Y _n ″ 352.

ここで、多次元確率密度関数とスコア関数の具体例について説明する。多次元確率密度関数の一種に、球状分布と呼ばれるものがある。これは、以下に示す式［１３．１］の通り、スカラーを引数にとる関数にベクトルのＬ２ノルムを代入することで生成される（「∝」は比例を表わす）。 Here, specific examples of the multidimensional probability density function and the score function will be described. One type of multidimensional probability density function is called a spherical distribution. This is generated by substituting the L2 norm of a vector into a function that takes a scalar as an argument, as shown in Equation [13.1] below (“∝” represents proportionality).

Ｌ２ノルムは、各要素の（絶対値の）２乗和の平方根であり、式［１３．２］のｍに２を代入することで得られる。球状分布の例として、式［１３．３］のような指数分布に基づくもの（γは正の実数）を用いると、対応するスコア関数として式［１３．４］が導出される。この式を［１２．５］に代入すればよい。 The L2 norm is the square root of the sum of squares (absolute value) of each element, and can be obtained by substituting 2 for m in Equation [13.2]. As an example of the spherical distribution, when an index distribution (γ is a positive real number) such as Equation [13.3] is used, Equation [13.4] is derived as a corresponding score function. This equation may be substituted into [12.5].

なお、先に、［（１）時間周波数領域において、畳み込み混合を直接解く］方式において説明した式［７．６］と同様に、式［１３．４］に対しても変更を及ぼして構わない。その例を式［１３．５］に示す。変更の例としては、
１）ゼロ除算を防ぐために分母に正の値β_ｋ ^［ｌ］（ω）を加える。さらに、その値として、ｋやｌ（エル）やωごとに異なるものを用いる。
２）Ｌ２ノルムの代わりにＬ−ｍノルム（式［１３．２］）を用いる。
３）スコア関数の係数Ｋの代わりに、ｋやｌ（エル）やωごとに異なる正値γ_ｋ ^［ｌ］（ω）を用いる
このようなことを行なう。 It should be noted that the expression [13.4] may be changed similarly to the expression [7.6] described above in the [(1) Solve convolutional mixture directly in the time frequency domain] method. . The example is shown in Formula [13.5]. Examples of changes include
1) Add a positive value β _k ^[l] (ω) to the denominator to prevent division by zero. Further, different values are used for k, l (el), and ω.
2) The Lm norm (formula [13.2]) is used instead of the L2 norm.
3) Instead of the coefficient K of the score function, a different positive value γ _k ^[l] (ω) is used for each of k, l (el) and ω.

式［１２．１］は自然勾配法に基づく更新式であるが、それ以外のアルゴリズムも使用可能である。例えば、「独立性による等分散適用的分離」（ＥｑｕｉｖａｒｉａｎｔＡｄａｐｔｉｖｅＳｅｐａｒａｔｉｏｎｖｉａＩｎｄｅｐｅｎｄｅｎｃｅ：ＥＡＳＩ）と呼ばれる、信号の無相関化と分離とを同時に行なうアルゴリズムに基づく更新式は、式［１２．２］の通りである。このアルゴリズムを用いると、自然勾配法に比べて少ないループ回数で学習を収束させることができる。 Equation [12.1] is an update equation based on the natural gradient method, but other algorithms can be used. For example, an update equation based on an algorithm that performs simultaneous decorrelation and separation called “Equivalent Adaptive Separation Via Independence” (EASI) is expressed as Equation [12.2]. It is. When this algorithm is used, learning can be converged with a smaller number of loops compared to the natural gradient method.

なお、式［１２．１］および式［１２．２］において行列の要素の対称性に注目すると、計算量を削減することが可能である。その点について説明する。 Note that the amount of calculation can be reduced by paying attention to the symmetry of the elements of the matrix in the equations [12.1] and [12.2]. This will be described.

式［１２．１］のＥｔ［・］の内部は、式［１２．７］のような（Ｌ'＋１）ｎ×（Ｌ'＋１）ｎの行列に展開される（上線は共役複素数を表わす）。この式の各要素について平均を取る際に、各要素の１番目の項であるφ_ｋω ^［α］（Ｙ_ｋ''（ｔ））と２番目の要素であるＹ_ｉ ^［β］（ω，ｔ）との間（α，βは０≦α，β≦Ｌ'を満たす整数）で相対的なシフト量が同じであれば、平均後の値はほぼ同じ値となる。すなわち、式［１２．８］の関係が成り立つ。特に、シフトとして前述の巡回シフトを用いた場合は、全く同一の値となる。 The inside of Et [•] in Expression [12.1] is expanded into a matrix of (L ′ + 1) n × (L ′ + 1) n as in Expression [12.7] (the upper line represents a conjugate complex number). ). When taking an average for each element of this expression, φ _kω ^[α] (Y _k ″ (t)) as the first term of each element and Y _i ^[β] (ω, as the second element) t), α and β are integers satisfying 0 ≦ α and β ≦ L ′, and the relative shift amounts are the same. That is, the relationship of Formula [12.8] is established. In particular, when the above-described cyclic shift is used as the shift, the values are exactly the same.

この性質を用いると、式［１２．７］の｛（Ｌ'＋１）ｎ｝^２個の要素のうち、実際に値を計算する必要のあるのは２（Ｌ'＋１）ｎ^２個だけでよく、残りの要素については、式［１２．８］に従って値を再利用すればよい。 Using this property, the formula [12.7] {(L '+ 1) n} of the ^two elements, the need to calculate the actual value 2 (L' + 1) ⁿ only ^two For the remaining elements, the values may be reused according to equation [12.8].

同様に、式［１２．２］についても、計算量の削減が可能である。Ｅｔ［・］の内部の３項のうち、１番目は式［１２．１］と同様の計算が行なえる。また、２番目については１番目の項を求めた後、単にエルミート転置を計算すればよい（式［１２．９］）。３番目の項については、式［１２．１０］の変形を行なうことで、計算量削減が可能となる。ただし、式［１２．１０］のＸ''（ω，ｔ）は式［１１．１］からω番目の周波数ビンに対応した要素を抽出したベクトルであり、式［１２．１１］のように表わせる。 Similarly, the amount of calculation can be reduced for the formula [12.2]. Of the three terms inside Et [•], the first can perform the same calculation as in equation [12.1]. For the second, after obtaining the first term, the Hermitian transposition is simply calculated (formula [12.9]). For the third term, the amount of calculation can be reduced by modifying Expression [12.10]. However, X ″ (ω, t) in the equation [12.10] is a vector obtained by extracting the element corresponding to the ω-th frequency bin from the equation [11.1], as in the equation [12.11]. I can express.

１７あるＥｔ［Ｘ''（ω，ｔ）Ｘ''（ω，ｔ）^Ｈ］は学習のあいだ常に一定である。従って、Ｅｔ［Ｘ''（ω，ｔ）Ｘ''（ω，ｔ）^Ｈ］については、学習前に一度だけ計算しておけばよく、学習中に毎回平均操作を行なう必要はない。すなわち、式［１２．１０］の左辺よりも右辺の方が計算量を削減できるのである。 17 Et [X ″ (ω, t) X ″ (ω, t) ^H ] is always constant during learning. Therefore, it is only necessary to calculate Et [X ″ (ω, t) X ″ (ω, t) ^H ] once before learning, and it is not necessary to perform an averaging operation every time during learning. That is, the amount of calculation can be reduced on the right side than on the left side of Equation [12.10].

さらに、Ｅｔ［Ｘ''（ω，ｔ）Ｘ''（ω，ｔ）^Ｈ］の計算においては、式［１２．８］と同様の対称性である式［１２．１２］および、対角線に対する対称性である式［１２．１３］が成り立つ。そのため、｛（Ｌ'＋１）ｎ｝^２個の要素のうち、実際に値を計算する必要があるのは（Ｌ'＋１）ｎ^２個だけでよい。 Further, in the calculation of Et [X ″ (ω, t) X ″ (ω, t) ^H ], the equation [12.12] having the same symmetry as the equation [12.8] and the diagonal line The symmetry [12.13] holds. Therefore, it is only necessary to calculate (L ′ + 1) n ² values among the {(L ′ + 1) n} ² elements.

［具体的構成例および処理例］
本発明の信号分離装置の構成例を図１５および図１６に示す。図１５は時間周波数領域で畳み込み混合を解く方式を実行する信号分離装置、図１６はモジュレーション・スペクトログラムに変換してから瞬時混合を解く方式を実行する信号分離装置に対応する構成例である。 [Specific configuration example and processing example]
A configuration example of the signal separation device of the present invention is shown in FIGS. 15 and 16. FIG. 15 shows a configuration example corresponding to a signal separation device that executes a method for solving convolutional mixing in the time-frequency domain, and FIG. 16 shows a configuration example corresponding to a signal separation device that executes a method for solving instantaneous mixing after conversion to a modulation spectrogram.

（１）時間周波数領域で畳み込み混合を解く方式を実行する構成
先に、図１５に示す時間周波数領域で畳み込み混合を解く方式を実行する信号分離装置の構成および処理について説明する。なお、以下に説明する処理の統括的な制御は制御部４０９において実行される。制御部４０９は、例えば、装置の記憶部（図示せず）に予め記憶された以下に説明する処理を実行するプログラムに従って処理を制御する。以下、各構成部の処理について説明する。複数の音源が発する独立な音を複数のマイク４０１で観測し、ＡＤ変換部４０２において入力アナログ信号をデジタル信号に変換してデジタル観測信号を得る。 (1) Configuration for Executing Method for Solving Convolution Mixing in Time Frequency Domain First, the configuration and processing of a signal separation device for executing the method for solving convolution mixing in the time frequency domain shown in FIG. 15 will be described. Note that overall control of processing described below is executed by the control unit 409. For example, the control unit 409 controls processing according to a program that executes processing described below that is stored in advance in a storage unit (not shown) of the apparatus. Hereinafter, processing of each component will be described. Independent sounds emitted by a plurality of sound sources are observed by a plurality of microphones 401, and an AD conversion unit 402 converts an input analog signal into a digital signal to obtain a digital observation signal.

デジタル観測信号は短時間フーリエ変換（ＳＴＦＴ）部４０３に入力され短時間フーリエ変換処理が行なわれ、観測信号のスペクトログラムを得る。ここまでの処理は、例えば、図６（ｂ）に示す観測信号のスペクトログラムＸを得る処理に相当する。 The digital observation signal is input to a short-time Fourier transform (STFT) unit 403, and short-time Fourier transform processing is performed to obtain a spectrogram of the observation signal. The process so far corresponds to, for example, the process of obtaining the spectrogram X of the observation signal shown in FIG.

信号分離部４０４は、短時間フーリエ変換（ＳＴＦＴ）部４０３で生成した観測信号のスペクトログラムＸを独立な成分に分離する。この図１５に示す信号分離装置は、時間周波数領域において畳み込み混合された観測信号の分離処理として時間周波数領域において畳み込み混合を直接解く方式を適用しており、式［６．２］および［７．２］・［７．１］・［７．８］の演算を分離行列および分離結果が十分収束するまで（または一定回数）繰り返し行なうことで観測信号の分離処理を実行する。この分離処理によって、図６（ｃ）に示す分離結果Ｙを得る。 The signal separation unit 404 separates the spectrogram X of the observation signal generated by the short-time Fourier transform (STFT) unit 403 into independent components. The signal separation device shown in FIG. 15 applies a method of directly solving convolutional mixing in the time-frequency domain as separation processing of the observation signal mixed in the time-frequency domain, and the equations [6.2] and [7. 2), [7.1], and [7.8] are repeatedly performed until the separation matrix and the separation result converge sufficiently (or a fixed number of times), thereby executing the observation signal separation processing. By this separation process, a separation result Y shown in FIG. 6C is obtained.

なお、畳み込み演算部４０８で行なう処理は、先に、図６を参照して説明した処理に従った処理である。すなわち、観測信号中のｔ番目のフレームの観測信号Ｘ（ｔ）が、遅延の最大値をＬ＋１として、前のＬ＋１フレーム分の原信号の影響を受けていることを考慮した処理である。すなわち、観測信号Ｘ（ｔ）を前述したフレームタップ数Ｌを適用した式［６．１］の畳み込み混合で表し、さらに、図６（ｃ）の分離信号中のＹ（ｔ＋Ｌ'）を基準とし、Ｓ（ｔ）を推定するため、直前のＬ＋１フレーム分のデータを考慮して、分離信号Ｙ（ｔ）を式［６．２］のようにＸ（ｔ−Ｌ'）からＸ（ｔ）までの畳み込み混合として表わして、式［６．２］と式［７．２］を適用して行なう畳み込み演算である。 The process performed by the convolution operation unit 408 is a process according to the process described above with reference to FIG. That is, this is a process considering that the observation signal X (t) of the t-th frame in the observation signal is affected by the original signal of the previous L + 1 frame with the maximum delay value being L + 1. That is, the observed signal X (t) is represented by the convolutional mixture of the equation [6.1] to which the number of frame taps L described above is applied, and Y (t + L ′) in the separated signal in FIG. , S (t) is estimated, and the separated signal Y (t) is converted from X (t−L ′) to X (t) as shown in Equation [6.2] in consideration of the data for the immediately preceding L + 1 frame. This is a convolution operation that is expressed as a convolution mixture up to and performed by applying Equations [6.2] and [7.2].

観測信号Ｘから分離結果Ｙ、すなわち、図６（ｂ）の観測信号Ｘから図６（ｃ）の分離結果Ｙを生成するためのフレームタップ数Ｌ'は、前述したように、Ｌが既知であれば（すなわち、残響時間が既知であれば）、L'＝αLとすればよく(αは適切な正の実数)、Ｌが未知である場合、Ｌ'は、例えば、以下のいずれかの方法で決定する。
（ａ）Ｌ'＝６４やＬ'＝１００といった一定の値に決め打つ。
（ｂ）残響時間を測定し、その残響時間から求めたＬの値をＬ'とする。
（ｃ）さまざまなＬ'の下で分離を行ない、最も良い分離結果をもたらすＬ'の値を採用する。例えばＳＩＲ（ｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅｒａｔｉｏ）という分離性能尺度を計算し、最高のＳＩＲをもたらすＬ'を採用する。
上記いずれかの方法によって、Ｌ'、すなわち、観測信号Ｘから分離結果Ｙを生成するためのフレームタップ数Ｌ'、具体的には、例えば、図６（ｂ）の観測信号Ｘから図６（ｃ）の分離結果Ｙを生成するためのフレームタップ数Ｌ'を決定し、このフレームタップ数Ｌ'を用いて、観測信号の複数の連続フレームから分離結果を生成する。 As described above, the number L ′ of frame taps for generating the separation result Y from the observation signal X, that is, the separation result Y in FIG. 6C from the observation signal X in FIG. 6B is known. If there is (that is, if the reverberation time is known), L ′ = αL may be set (α is an appropriate positive real number), and if L is unknown, L ′ is, for example, one of the following Decide by method.
(A) A fixed value such as L ′ = 64 or L ′ = 100 is determined.
(B) The reverberation time is measured, and the value of L obtained from the reverberation time is defined as L ′.
(C) Perform the separation under various L ′ and adopt the value of L ′ that gives the best separation result. For example, a separation performance measure called SIR (signal-interference ratio) is calculated, and L ′ that gives the highest SIR is adopted.
By any one of the above methods, L ′, that is, the number L ′ of frame taps for generating the separation result Y from the observation signal X, specifically, for example, from the observation signal X in FIG. The frame tap number L ′ for generating the separation result Y of c) is determined, and the separation result is generated from a plurality of continuous frames of the observation signal using this frame tap number L ′.

リスケーリング部４０５では、分離信号の各周波数ビンに対してスケールを揃えるリスケーリング処理を行なう。リスケーリングとは、周波数ビンごとのスケールを調整する処理である。また、分離処理前に観測信号に正規化（平均や分散の調整）を行なっていた場合は、ここで元に戻す。 The rescaling unit 405 performs rescaling processing for aligning the scale for each frequency bin of the separated signal. Rescaling is a process of adjusting the scale for each frequency bin. If normalization (adjustment of average or variance) is performed on the observation signal before separation processing, it is restored here.

逆フーリエ変換部４０６は、逆フーリエ変換によって分離信号のスペクトログラムを時間領域の信号へと変換する。変換された信号は、必要に応じて後段処理実行部４０７へ送られる。後段の処理とは、スピーカーからの再生や音声認識などである。なお、後段の処理によっては、逆フーリエ変換部を省略することも可能である。 The inverse Fourier transform unit 406 converts the spectrogram of the separated signal into a time domain signal by inverse Fourier transform. The converted signal is sent to the post-stage processing execution unit 407 as necessary. Subsequent processing includes reproduction from a speaker and voice recognition. Note that the inverse Fourier transform unit may be omitted depending on the subsequent processing.

このように、図１５に示す信号分離装置は、複数の音信号が混合した信号を入力して個別の音信号に分離する信号分離装置であり、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する信号変換手段（ＳＴＦＴ部４０３）と、信号変換手段（ＳＴＦＴ部４０３）の生成した観測信号スペクトログラムから信号分離結果を生成する信号分離手段（信号分離部４０４）を有し、信号分離手段（信号分離部４０４）は、観測信号スペクトログラムを時間周波数領域において畳み込み混合された観測信号として解釈して、畳み込み演算部４０８における畳み込み演算の実行により信号分離結果を生成する。 As described above, the signal separation device shown in FIG. 15 is a signal separation device that inputs a signal in which a plurality of sound signals are mixed and separates them into individual sound signals, converts the input signal into the time-frequency domain, and observe signal spectrogram A signal conversion unit (STFT unit 403) for generating a signal, and a signal separation unit (signal separation unit 404) for generating a signal separation result from the observed signal spectrogram generated by the signal conversion unit (STFT unit 403). The (signal separation unit 404) interprets the observation signal spectrogram as an observation signal mixed by convolution in the time-frequency domain, and generates a signal separation result by executing a convolution operation in the convolution operation unit 408.

なお、信号変換手段（ＳＴＦＴ部４０３）は、入力信号に対して短時間フーリエ変換（ＳＴＦＴ）を実行して時間周波数領域に変換し観測信号スペクトログラムを生成する処理を実行する。 The signal conversion means (STFT unit 403) executes processing for generating an observation signal spectrogram by executing short-time Fourier transform (STFT) on the input signal to convert it to the time-frequency domain.

また、信号分離手段（信号分離部４０４）は、フレーム番号（ｔ）の分離信号Ｙ（ｔ）を、観測信号Ｘ（ｔ−Ｌ'）〜Ｘ（ｔ）の畳み込み混合として設定し、分離信号Ｙ（ｔ）に含まれる個別の音声信号成分であるＹ１（ｔ）〜Ｙｎ（ｔ）各々の独立性を高める処理により信号分離結果を生成する。具体的には、分離信号Ｙ（ｔ）に含まれる個別の音信号成分であるＹ１（ｔ）〜Ｙｎ（ｔ）各々の独立性を高める処理として、独立性算出尺度であるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を適用し、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量Ｉ（Ｙ）を最小にする分離行列の更新処理により信号分離結果を生成する。 The signal separation means (signal separation unit 404) sets the separation signal Y (t) of the frame number (t) as the convolution mixture of the observation signals X (t−L ′) to X (t), and the separation signal A signal separation result is generated by a process for increasing the independence of each of Y1 (t) to Yn (t), which are individual audio signal components included in Y (t). Specifically, as a process for enhancing the independence of each of the individual sound signal components Y1 (t) to Yn (t) included in the separated signal Y (t), the amount of Kullback-Leiblar information that is an independence calculation measure Applying I (Y), a signal separation result is generated by updating the separation matrix that minimizes the Kullback-Leiblar information amount I (Y).

なお、（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって時間周波数領域において畳み込み混合された観測信号を分離する処理を実行する装置構成としては、例えば、図１５に示す構成から畳み込み演算部４０８を省いた構成が適用できる。なお、信号分離部において実行する処理は異なる。 Note that (3) as a device configuration for performing processing for separating the observation signal mixed in the time-frequency domain by processing combining shift accumulation and instantaneous mixing ICA, for example, a convolution operation unit 408 from the configuration shown in FIG. A configuration without the above can be applied. The processing executed in the signal separation unit is different.

（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理を行なう装置では、ＳＴＦＴ部４０３が、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する信号変換手段として機能し、信号分離部４０４は、信号変換手段の生成した観測信号スペクトログラムから信号分離結果を生成する処理を行なう構成であり、信号分離部４０４では、先に図１１〜図１４を参照して説明したように、観測信号スペクトログラムをフレーム方向へシフトさせて、各々が異なるシフト量を持つデータを積み重ねた観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により、信号分離結果を生成する。なお、瞬時混合ＩＣＡを適用した処理は、特開２００６−２３８４０９に開示された手法、すなわち、時間周波数領域の観測信号と分離行列から時間周波数領域の分離信号を生成し、生成した時間周波数領域の分離信号と、多次元確率密度関数から導出される多次元スコア関数によって計算される分離行列とがほぼ収束するまで分離行列を修正し、修正した分離行列を適用して時間周波数領域の分離信号を生成する処理として実行される。 (3) In an apparatus that performs processing that combines shift stacking and instantaneous mixing ICA, the STFT unit 403 functions as a signal conversion unit that converts an input signal into a time-frequency domain and generates an observation signal spectrogram. The signal separation unit 404 performs processing for generating a signal separation result from the observation signal spectrogram generated by the signal conversion means, and the signal separation unit 404 converts the observation signal spectrogram as described above with reference to FIGS. Processing that shifts in the frame direction and generates observation signal spectrogram shift sets in which data having different shift amounts are stacked, and applies instantaneous mixed ICA (Independent Component Analysis) to the generated observation signal spectrogram shift set Signal Generate separation results. Note that the processing using the instantaneous mixing ICA is the method disclosed in Japanese Patent Laid-Open No. 2006-238409, that is, the time frequency domain separation signal is generated from the time frequency domain observation signal and the separation matrix, and the generated time frequency domain Modify the separation matrix until the separation signal and the separation matrix calculated by the multidimensional score function derived from the multidimensional probability density function converge, and apply the modified separation matrix to the time-frequency domain separation signal. It is executed as a process to generate.

（２）モジュレーション・スペクトログラムに変換してから瞬時混合を解く方式を実行する構成
次に、図１６に示すモジュレーション・スペクトログラムに変換してから瞬時混合を解く方式を実行する信号分離装置の構成、および処理について説明する。なお、以下に説明する処理の統括的な制御は制御部４６１において実行される。制御部４６１は、例えば、装置の記憶部（図示せず）に予め記憶された以下に説明する処理を実行するプログラムに従って処理を制御する。以下、各構成部の処理について説明する。複数の音源が発する独立な音を複数のマイク４５１で観測し、ＡＤ変換部４５２において入力アナログ信号をデジタル信号に変換してデジタル観測信号を得る。 (2) Configuration for executing method for solving instantaneous mixing after conversion to modulation spectrogram Next, a configuration of a signal separation device for executing the method for solving instantaneous mixing after conversion to a modulation spectrogram shown in FIG. 16, and Processing will be described. Note that overall control of processing described below is executed by the control unit 461. For example, the control unit 461 controls processing according to a program for executing processing described below that is stored in advance in a storage unit (not shown) of the apparatus. Hereinafter, processing of each component will be described. Independent sounds emitted by a plurality of sound sources are observed by a plurality of microphones 451, and an analog signal is converted into a digital signal by an AD converter 452 to obtain a digital observation signal.

デジタル観測信号は、第１短時間フーリエ変換（ＳＴＦＴ）部４５３に入力され、短時間フーリエ変換（ＳＴＦＴ）処理が行なわれ、観測信号のスペクトログラムを得る。この段階で得られる信号は、例えば、図８（ｂ）に示すスペクトログラムＸである。さらに、第１段階の短時間フーリエ変換（ＳＴＦＴ）処理によって得られた観測信号のスペクトログラムを第２短時間フーリエ変換（ＳＴＦＴ）部４５４に入力して、周波数ビンごとに再び短時間フーリエ変換（ＳＴＦＴ）を実行し、モジュレーション・スペクトログラムを得る。 The digital observation signal is input to a first short-time Fourier transform (STFT) unit 453, and short-time Fourier transform (STFT) processing is performed to obtain a spectrogram of the observation signal. The signal obtained at this stage is, for example, a spectrogram X shown in FIG. Further, the spectrogram of the observation signal obtained by the first-stage short-time Fourier transform (STFT) process is input to the second short-time Fourier transform (STFT) unit 454, and the short-time Fourier transform (STFT) is again performed for each frequency bin. ) To obtain a modulation spectrogram.

この第２短時間フーリエ変換（ＳＴＦＴ）部４５４における短時間フーリエ変換（ＳＴＦＴ）によって得られるモジュレーション・スペクトログラムが例えば図９（ｃ），（ｄ）に示すモジュレーション・スペクトログラムＸ'である。 The modulation spectrogram obtained by the short time Fourier transform (STFT) in the second short time Fourier transform (STFT) unit 454 is, for example, the modulation spectrogram X ′ shown in FIGS.

信号分離部４５５は、モジュレーション・スペクトログラムＸ'を入力して、このモジュレーション・スペクトログラムＸ'を独立な成分に分離する。この分離処理は、先に図１０を参照して説明した処理である。すなわち、図１０は、図９（ｃ）に示す立体版モジュレーション・スペクトログラムＸ'に相当し、この図１０に示す立体版モジュレーション・スペクトログラムＸ'において、例えば、１番目のチャンネルのエントロピー計算においては、図１０における１番目のフレームのモジュレーション・スペクトログラムＹ１'（ｔ）２２１は、平面を表わし、それを引数とする多変量確率密度関数Ｐ（Ｙ１'（ｔ））２２２にＹ１'（ｔ）を代入することで、エントロピーＨ（Ｙ１'）２２３を求める。式［９．３］は、変数名の違いを除くと式［３．５］と同一である。従って、学習規則を導出するためには、式［５．２］の変数名を付け替えればよく、結果として式［９．５］を得る。すなわち、式［９．３］，［９．５］，［９．６］をＷ'が収束するまで繰り返せば、Ｙ１'（ｔ）〜Ｙｎ'（ｔ）が互いに独立になる。 The signal separation unit 455 receives the modulation spectrogram X ′ and separates the modulation spectrogram X ′ into independent components. This separation process is the process described above with reference to FIG. That is, FIG. 10 corresponds to the three-dimensional modulation spectrogram X ′ shown in FIG. 9C. In the three-dimensional modulation spectrogram X ′ shown in FIG. 10, for example, in entropy calculation of the first channel, The modulation spectrogram Y1 ′ (t) 221 of the first frame in FIG. 10 represents a plane, and Y1 ′ (t) is substituted into the multivariate probability density function P (Y1 ′ (t)) 222 with the argument as an argument. Thus, entropy H (Y1 ′) 223 is obtained. Formula [9.3] is the same as Formula [3.5] except for the difference in variable names. Therefore, in order to derive the learning rule, the variable name of the equation [5.2] may be changed, and the equation [9.5] is obtained as a result. That is, if the expressions [9.3], [9.5], and [9.6] are repeated until W ′ converges, Y1 ′ (t) to Yn ′ (t) become independent from each other.

次に、第１リスケーリング部４５６でモジュレーション・スペクトログラムに対してリスケーリングを行なう。リスケーリングとは、周波数ビンごとのスケールを調整する処理である。さらに、第１逆フーリエ変換（ＦＴ）部４５７で、リスケーリングされたモジュレーション・スペクトログラムに対して逆フーリエ変換（ＦＴ）処理を実行して、モジュレーション・スペクトログラムをスペクトログラムへ変換する。その後、再び第２リスケーリング部４５８でリスケーリングを行なった後、第２逆フーリエ変換（ＦＴ）部４５９で、リスケーリングされたスペクトログラムに対して逆フーリエ変換（ＦＴ）処理を実行して、スペクトログラムを波形に変換する。波形に変換された信号は、必要に応じて後段処理実行部４６１へ送られ、必要に応じた後段処理を実行する。後段の処理とは、スピーカーからの再生や音声認識などである。 Next, the first rescaling unit 456 performs rescaling on the modulation spectrogram. Rescaling is a process of adjusting the scale for each frequency bin. Further, the first inverse Fourier transform (FT) unit 457 performs an inverse Fourier transform (FT) process on the rescaled modulation spectrogram to convert the modulation spectrogram into a spectrogram. Then, after rescaling is performed again by the second rescaling unit 458, the second inverse Fourier transform (FT) unit 459 performs inverse Fourier transform (FT) processing on the rescaled spectrogram to obtain a spectrogram. To waveform. The signal converted into the waveform is sent to the post-stage processing execution unit 461 as necessary, and the post-stage process is executed as necessary. Subsequent processing includes reproduction from a speaker and voice recognition.

このように、図１６に示す信号分離装置は、複数の信号が混合した信号を入力して個別の信号に分離する信号分離装置であり、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成する第１信号変換手段（第１ＳＴＦＴ部４５３）と、第１信号変換手段（第１ＳＴＦＴ部４５３）の生成した観測信号スペクトログラムに対するデータ変換を実行しモジュレーション・スペクトログラムを生成する第２信号変換手段（第２ＳＴＦＴ部４５４）と、第２信号変換手段（第２ＳＴＦＴ部４５４）の生成した前記モジュレーション・スペクトログラムから信号分離結果を生成する信号分離手段（信号分離部４５５）を有し、信号分離手段（信号分離部４５５）は、モジュレーション・スペクトログラムを瞬時混合として解釈し信号分離結果を生成する。 As described above, the signal separation device shown in FIG. 16 is a signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals, and converts the input signal into the time-frequency domain to generate an observation signal spectrogram. First signal converting means (first STFT unit 453), and second signal converting means (first STFT unit 453) that performs data conversion on the observed signal spectrogram generated by the first signal converting means (first STFT unit 453) and generates a modulation spectrogram. 2STFT section 454) and signal separation means (signal separation section 455) for generating a signal separation result from the modulation spectrogram generated by the second signal conversion means (second STFT section 454). Part 455) interprets the modulation spectrogram as instantaneous mixing and signal separation To generate the results.

第１信号変換手段（第１ＳＴＦＴ部４５３）は、入力信号に対して短時間フーリエ変換（ＳＴＦＴ）を実行して時間周波数領域に変換し観測信号スペクトログラムを生成する。さらに、第２信号変換手段（第２ＳＴＦＴ部４５４）は、観測信号スペクトログラムに対して時間方向の短時間フーリエ変換（ＳＴＦＴ）を実行しモジュレーション・スペクトログラムを生成する。 The first signal conversion means (first STFT unit 453) performs short-time Fourier transform (STFT) on the input signal to convert it into the time-frequency domain, and generates an observation signal spectrogram. Further, the second signal conversion means (second STFT unit 454) performs a short-time Fourier transform (STFT) in the time direction on the observed signal spectrogram to generate a modulation spectrogram.

信号分離手段（信号分離部４５５）は、モジュレーション・スペクトログラムに含まれる分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々の独立性を高める処理により信号分離結果を生成する。具体的には、分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々の独立性を高める処理として、独立性算出尺度であるＫｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量を適用し、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌａｒ情報量を最小にする分離行列の更新処理により信号分離結果を生成する。 The signal separation means (signal separation unit 455) generates a signal separation result by a process for increasing the independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separation signal included in the modulation spectrogram. Specifically, as a process for increasing the independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separated signal, the Kullback-Leiblar information amount that is an independence calculation measure is applied, and the Kullback-Leiblar information amount is minimized. A signal separation result is generated by matrix update processing.

なお、逆フーリエ変換手段（第１逆ＦＴ部４５７）は、信号分離手段（信号分離部４５５）において得られた分離信号対応の信号成分Ｙ１'〜Ｙｎ'各々に対して逆フーリエ変換を実行して分離信号対応のスペクトログラムＹ１〜Ｙｎを生成する。 The inverse Fourier transform unit (first inverse FT unit 457) performs inverse Fourier transform on each of the signal components Y1 ′ to Yn ′ corresponding to the separated signal obtained by the signal separation unit (signal separation unit 455). Thus, spectrograms Y1 to Yn corresponding to the separated signals are generated.

本発明の信号分離装置の実行する処理のシーケンスの一例について図１７に示すフローチャートを参照して説明する。ステップＳ１０１において、マイクで音を観測する。例えば先に図５を参照して説明したように、複数の音源から出力される音の混合信号をマイクで取得する。次に、ステップＳ１０２において、観測信号に対する短時間フーリエ変換（ＳＴＦＴ）処理を実行しスペクトログラムを得る。短時間フーリエ変換は、先に図７を参照して説明した処理であり、この処理によって、スペクトログラムを得る。このスペクトログラムは、例えば図６（ｂ）に示すスペクトログラムである。 An example of the sequence of processing executed by the signal separation device of the present invention will be described with reference to the flowchart shown in FIG. In step S101, sound is observed with a microphone. For example, as described above with reference to FIG. 5, a mixed signal of sounds output from a plurality of sound sources is acquired by a microphone. Next, in step S102, a short-time Fourier transform (STFT) process is performed on the observation signal to obtain a spectrogram. The short-time Fourier transform is the process described above with reference to FIG. 7, and a spectrogram is obtained by this process. This spectrogram is, for example, the spectrogram shown in FIG.

次に、ステップＳ１０３において、観測信号のスペクトログラムに対し、ＩＣＡによる分離処理を行なう。分離処理の処理シーケンスの詳細については後述する。分離結果に対し、ステップＳ１０４において、必要に応じて逆フーリエ変換（ＩＦＴ）を実行し、その語必要に応じてステップＳ１０５において後段処理を実行する。 Next, in step S103, ICA separation processing is performed on the spectrogram of the observation signal. Details of the processing sequence of the separation processing will be described later. In step S104, an inverse Fourier transform (IFT) is performed on the separation result as necessary, and subsequent processing is performed in step S105 as necessary.

ステップＳ１０３において実行する分離処理の詳細シーケンスについて、図１８と図１９に示すフローチャートを参照して説明する。
図１８、図１９に示す分離処理シーケンスは、それぞれ、先に、図１５、図１６を参照して説明した信号分離装置において実行する分離処理の具体的シーケンであり、
図１８は、図１５の信号分離装置の実行する時間周波数領域で畳み込み混合を解く方式における分離処理、
図１９は、図１６の信号分離装置の実行するモジュレーション・スペクトログラムに変換してから瞬時混合を解く方式における分離処理、
これらの詳細シーケンスである。 A detailed sequence of the separation process executed in step S103 will be described with reference to the flowcharts shown in FIGS.
The separation processing sequences shown in FIGS. 18 and 19 are specific sequences of the separation processing executed in the signal separation device described above with reference to FIGS. 15 and 16, respectively.
FIG. 18 illustrates a separation process in a method for solving convolutional mixing in the time-frequency domain performed by the signal separation device of FIG.
FIG. 19 shows a separation process in a system for solving instantaneous mixing after converting into a modulation spectrogram executed by the signal separation device of FIG.
These are detailed sequences.

まず、図１８を参照して、図１５の信号分離装置の実行する時間周波数領域で畳み込み混合を解く方式における分離処理、すなわち時間周波数領域での逆畳み込みを行なう分離処理シーケンスについて説明する。 First, with reference to FIG. 18, a description will be given of a separation processing sequence in which a convolution mixture is solved in the time-frequency domain performed by the signal separation device in FIG.

最初に、ステップＳ２０１において、観測信号スペクトログラムに対して正規化を行なう。本処理における正規化処理は、スペクトログラムの各周波数ビンに対して、平均を０に設定し分散を１とする処理、または以降の処理に都合のよい値に調整する処理である。次に、ステップＳ２０２において、分離行列の初期化処理、すなわち、分離行列Ｗ^［τ］に初期値を代入する。初期値は、Ｗ^［０］に対しては単位行列を、τ＞０の分離行列Ｗ^［τ］に対してはゼロ行列を代入すればよい。または、前回の学習で求まった分離行列が存在している場合は、それを初期値として用いても良い。 First, in step S201, normalization is performed on the observed signal spectrogram. The normalization process in this process is a process of setting the average to 0 and setting the variance to 1 for each frequency bin of the spectrogram, or adjusting the value to a value convenient for the subsequent processes. Next, in step S202, an initial value is substituted into the separation matrix initialization process, that is, the separation matrix W ^[τ] . The initial value is the identity matrix with respect to W ^[0], may be substituted for zero matrix for tau> 0 of the separation matrix W ^[tau]. Alternatively, if there is a separation matrix obtained by the previous learning, it may be used as an initial value.

ステップＳ２０３〜Ｓ２１０は学習のループであり、分離行列および分離結果が収束するまでこのループを繰り返す。すなわち、
ステップＳ２０３：分離行列が収束したか否かの判定、
ステップＳ２０４：分離信号Ｙの計算、
ステップＳ２０５：周波数ビンループの開始（ω＝１，...，Ｍ）、
ステップＳ２０６：フレームタップループの開始（τ＝０，...，Ｌ）、
ステップＳ２０７：τ番目のフレームタップに対応する増分ΔＷ^［τ］の計算、
ステップＳ２０８：フレームタップループの終了、
ステップＳ２０９：ΔＷ^［０］（ω）〜Ｗ^［Ｌ'］（ω）の更新、
ステップＳ２１０：周波数ビンループの終了、
これらのステップからなるループを繰り返し実行する。 Steps S203 to S210 are a learning loop, and this loop is repeated until the separation matrix and the separation result converge. That is,
Step S203: Determination of whether or not the separation matrix has converged,
Step S204: Calculation of the separation signal Y,
Step S205: Start of frequency bin loop (ω = 1,..., M),
Step S206: Start of frame tap loop (τ = 0,..., L),
Step S207: Calculation of the increment ΔW ^[τ] corresponding to the τ-th frame tap,
Step S208: End of the frame tap loop,
Step S209: Update ΔW ^[0] (ω) to W ^{[L ′]} (ω),
Step S210: End of the frequency bin loop,
A loop consisting of these steps is repeatedly executed.

ステップＳ２０４の分離結果Ｙの計算には、先に説明した式［６．２］、または式［６．３］を用いる。（Ｙ＝［Ｙ（１），...，Ｙ（Ｔ）］とする。）ステップＳ２０５〜Ｓ２１０は周波数ビンについてのループであり、Ｍを周波数ビンの本数として、１≦ω≦Ｍを満たす各周波数（ω）について、ステップＳ２０６〜Ｓ２０９を繰り返す。なお、ループの代わりに、周波数ビンごとの並列処理を行なう構成としてもよい。なお、本出願と同一出願人の先の特許出願であり公開された特開２００６−２３８４０９において示した手法法では、推定する分離行列は１つ（または周波数ビンごとに１つ）だけであったが、本発明ではフレームタップの個数だけ分離行列を推定する必要がある。そこで、フレームタップの個数分だけループを回す（ステップＳ２０６〜Ｓ２０８）。 For the calculation of the separation result Y in step S204, the previously described equation [6.2] or equation [6.3] is used. (Y = [Y (1),..., Y (T)]) Steps S205 to S210 are loops for frequency bins, where M is the number of frequency bins and 1 ≦ ω ≦ M is satisfied. Steps S206 to S209 are repeated for each frequency (ω). In addition, it is good also as a structure which performs the parallel processing for every frequency bin instead of a loop. In the method disclosed in Japanese Patent Application Laid-Open No. 2006-238409, which is an earlier patent application of the same applicant as the present application and published, only one separation matrix (or one for each frequency bin) is estimated. However, in the present invention, it is necessary to estimate the separation matrix by the number of frame taps. Therefore, the loop is rotated by the number of frame taps (steps S206 to S208).

ステップＳ２０７では、τ番目のフレームタップに対応する増分ΔＷ^［τ］（ω）を求める。ΔＷ^［τ］（ω）の計算には、式［７．１］を用いる。前述の通り、この式［７．１］のＲω^［ｌ］は、分離結果Ｙの計算に式［６，２］、または式［６．３］のいずれの式を用いたかによって異なる。 In step S207, an increment ΔW ^[τ] (ω) corresponding to the τ-th frame tap is obtained. Formula [7.1] is used for calculation of ΔW ^[τ] (ω). As described above, Rω ^[l] in the equation [7.1] differs depending on whether the equation [6, 2] or the equation [6.3] is used to calculate the separation result Y.

分離結果Ｙの計算に式［６．２］を用いた場合はＲω^［ｌ］の計算に式［７．２］または式［８．１］を用い、分離結果Ｙの計算に式［６．３］を用いた場合はＲω^［ｌ］の計算に式［７．３］または式［８．２］を用いる。 When Expression [6.2] is used for calculation of separation result Y, Expression [7.2] or [8.1] is used for calculation of Rω ^[l] , and Expression [6. 3] is used, the equation [7.3] or [8.2] is used to calculate Rω ^[l] .

ステップＳ２０６〜Ｓ２０８のフレームタップのループを抜けた後、ステップＳ２０９において、式［７．８］を用いて分離行列ΔＷ^［０］（ω）〜Ｗ^［Ｌ'］（ω）を更新する。なお、この処理は、ステップＳ２１０の後で全周波数ビンの分をまとめて行なっても構わない。（一方、フレームタップの内部には入れられないことに注意。） After exiting the frame tap loop of steps S206 to S208, the separation matrix ΔW ^[0] (ω) to W ^{[L ′]} (ω) is updated using equation [7.8] in step S209. This process may be performed for all frequency bins after step S210. (On the other hand, note that it cannot be put inside the frame tap.)

ステップＳ２０５〜Ｓ２１０の周波数ビンのループを抜けた後、再び、ステップＳ２０３の収束チェックに戻る。ステップＳ２０４において、で分離行列が収束した（または、所定の回数だけループした）と判定された場合は、分岐を右に進みステップＳ２１１に移行する。 After exiting the frequency bin loop of steps S205 to S210, the process returns to the convergence check of step S203 again. If it is determined in step S204 that the separation matrix has converged (or looped a predetermined number of times), the process proceeds to the right and proceeds to step S211.

なお、ステップＳ２０３における分離行列が収束したか否かの判定は、例えばΔＷのノルム‖ΔＷ‖（行列のノルムは、例えば前記した式［７．１０］で計算する）がある値を下回ったかどうか（または‖ΔＷ‖／‖Ｗ‖がある値を下回ったかどうか）で判断してもよいし、または、単純に一定回数のループ数を予め設定し、そのループ数を実行してもよい。 Note that whether or not the separation matrix has converged in step S203 is, for example, whether or not the norm of ΔW ‖ΔW‖ (the norm of the matrix is calculated by, for example, the above equation [7.10]) is below a certain value. (Or whether or not ‖ΔW‖ / ‖W‖ falls below a certain value), or a predetermined number of loops may be simply set in advance and the number of loops may be executed.

ステップＳ２０３において、分離行列がまだ収束していないと判定された場合は、ステップＳ２０４〜Ｓ２１０の処理を繰り返し実行する。ステップＳ２０４において、で分離行列が収束した（または、所定の回数だけループした）と判定された場合は、分岐を右に進みステップＳ２１１に移行し、ステップＳ２１１において、リスケーリングを行なう。リスケーリングとは、周波数ビンごとのスケールを調整する処理である。また、正規化処理ステップ（Ｓ２０１）で周波数ビンの平均や分散を変更した場合は、ここで必要に応じて元に戻す。 If it is determined in step S203 that the separation matrix has not yet converged, the processes in steps S204 to S210 are repeated. If it is determined in step S204 that the separation matrix has converged (or looped a predetermined number of times), the branch is moved to the right and the process proceeds to step S211. In step S211, rescaling is performed. Rescaling is a process of adjusting the scale for each frequency bin. If the average or variance of the frequency bin is changed in the normalization processing step (S201), it is restored here as necessary.

なお、ステップＳ２１１において実行するリスケーリングの係数は、以下のようにして求める。先に示した式［７．１１］を用いて、ある周波数ビンにおいて観測信号と分離結果との２乗誤差が最小となるようなスケールを求める（具体的には、最小二乗法などを用いる）。そして分離結果を、そのスケールを乗じた値に更新する（式［７．１２］）。また、必要に応じて、分離行列自体も同様に更新する（式［７．１３］）。 The rescaling coefficient to be executed in step S211 is obtained as follows. Using the equation [7.11] shown above, a scale is obtained such that the square error between the observed signal and the separation result is minimized in a certain frequency bin (specifically, the least square method or the like is used). . Then, the separation result is updated to a value multiplied by the scale (formula [7.12]). Also, if necessary, the separation matrix itself is updated in the same manner (Formula [7.13]).

または、以下のように行なっても良い。式［７．１４］を用いて、観測信号を分離結果と定数との線形和で表現する。スケールα_ｋ１（ω）〜α_ｋｎ（ω）および定数項βｋ（ω）は、式［７．１５］で求める（具体的には、最小二乗法などを用いる）。スケールが求まったら、式［７．１６］を用いて分離結果を更新する。（必要に応じて、分離行列も更新する。） Or you may carry out as follows. The observation signal is expressed as a linear sum of the separation result and a constant using Equation [7.14]. The scales α _k1 (ω) to α _kn (ω) and the constant term βk (ω) are obtained by the equation [7.15] (specifically, the least square method or the like is used). When the scale is obtained, the separation result is updated using Expression [7.16]. (If necessary, update the separation matrix.)

なお、式［７．１４］に出てくるα_ｋｊ（ω）Ｙ_ｊ（ω，ｔ）を全て出力すると、ｓｉｎｇｌｅｉｎｐｕｔｍｕｌｔｉｐｌｅｏｕｔｐｕｔ（ＳＩＭＯ）形式の出力が得られる。ＩＣＡのＳＩＭＯ出力とは、「観測信号を、それぞれの音源に由来する成分に分解する」ことであり、例えばα_ｋｊ（ω）Ｙ_ｊ（ω，ｔ）は、Ｙ_ｊをｉ番目の音源の推定結果とすると、「ｋ番目のマイクで観測される信号の内、ｉ番目の音源に由来する成分」を表わしている。以上で、時間周波数領域で畳み込み混合を解く場合についてのフローチャートの解説を終わる。 If all α _kj (ω) Y _j (ω, t) appearing in Equation [7.14] is output, an output in a single input multiple output (SIMO) format is obtained. The SIMA output of ICA is “decomposing the observation signal into components derived from the respective sound sources”. For example, α _kj (ω) Y _j (ω, t) is used to convert Y _j into the i-th sound source. As an estimation result, “a component derived from the i-th sound source among signals observed by the k-th microphone” is represented. This completes the description of the flowchart for the case of solving convolutional mixing in the time-frequency domain.

次に、モジュレーション・スペクトログラム領域で瞬時混合を解く場合の処理について、図１９に示すフローチャートを参照して説明する。図１９は、図１６の信号分離装置の実行するモジュレーション・スペクトログラムに変換してから瞬時混合を解く方式における分離処理の詳細シーケンスである。 Next, processing for solving instantaneous mixing in the modulation spectrogram region will be described with reference to a flowchart shown in FIG. FIG. 19 is a detailed sequence of the separation process in the method of solving the instantaneous mixing after converting into the modulation spectrogram executed by the signal separation device of FIG.

ステップＳ３０１は、観測信号スペクトログラムに対して正規化を行なう。この処理は、図１８のフローにおけるステップＳ２０１の正規化処理と同様の処理であり、スペクトログラムの各周波数ビンに対して、平均を０に設定し分散を１とする処理、または以降の処理に都合のよい値に調整する処理である。ステップＳ３０２では、周波数ビンごとに短時間フーリエ変換（ＳＴＦＴ）を行ない、モジュレーション・スペクトログラム、すなわち、図９（ｃ），（ｄ）に示すモジュレーション・スペクトログラムＸ'を生成する。 Step S301 normalizes the observed signal spectrogram. This process is the same as the normalization process in step S201 in the flow of FIG. 18, and is convenient for the process in which the average is set to 0 and the variance is 1 for each frequency bin of the spectrogram, or the subsequent process. This is a process of adjusting to a good value. In step S302, a short-time Fourier transform (STFT) is performed for each frequency bin to generate a modulation spectrogram, that is, a modulation spectrogram X ′ shown in FIGS.

なお、このモジュレーション・スペクトログラムの生成には、先に図１６を参照して説明したように、デジタル観測信号に対する第１短時間フーリエ変換（ＳＴＦＴ）部４５３における短時間フーリエ変換（ＳＴＦＴ）処理を行なって観測信号のスペクトログラム（例えば、図８（ｂ）に示すスペクトログラムＸ）を得て、さらに、第１段階の短時間フーリエ変換（ＳＴＦＴ）処理によって得られた観測信号のスペクトログラムを第２短時間フーリエ変換（ＳＴＦＴ）部４５４に入力して、周波数ビンごとに再び短時間フーリエ変換（ＳＴＦＴ）を実行することが必要となる。この第２短時間フーリエ変換（ＳＴＦＴ）部４５４における短時間フーリエ変換（ＳＴＦＴ）によって得られるモジュレーション・スペクトログラムが例えば図９（ｃ），（ｄ）に示すモジュレーション・スペクトログラムＸ'である。 Note that the modulation spectrogram is generated by performing a short-time Fourier transform (STFT) process in the first short-time Fourier transform (STFT) unit 453 on the digital observation signal as described above with reference to FIG. Thus, the spectrogram of the observation signal (for example, the spectrogram X shown in FIG. 8B) is obtained, and the spectrogram of the observation signal obtained by the first-stage short-time Fourier transform (STFT) processing is further obtained. It is necessary to input to the transform (STFT) unit 454 and execute short-time Fourier transform (STFT) again for each frequency bin. The modulation spectrogram obtained by the short time Fourier transform (STFT) in the second short time Fourier transform (STFT) unit 454 is, for example, the modulation spectrogram X ′ shown in FIGS.

モジュレーション・スペクトログラムは、図９（ｃ），（ｄ）に示すように、立方体形式（式［９．２］に相当）と、平面形式（式［９．３］に相当）とがあるが、以降の説明では平面形式を用いる。すなわち、図９（ｃ）に示す縦方向と奥行き方向の両方のビンをまとめてω'というインデックスで表現する。 As shown in FIGS. 9C and 9D, the modulation spectrogram has a cubic form (corresponding to Expression [9.2]) and a planar form (corresponding to Expression [9.3]). In the following description, a planar format is used. That is, the bins in both the vertical direction and the depth direction shown in FIG. 9C are collectively expressed by an index ω ′.

ステップＳ３０３では、モジュレーション・スペクトログラムの各ビンω'に対して、再び正規化を行なう。学習のループの前に、ステップＳ３０４では、分離行列Ｗ'に初期値を代入しておく。初期値は、単位行列でよいが、前回の学習で求まった分離行列でも良い。 In step S303, normalization is performed again for each bin ω ′ of the modulation spectrogram. Prior to the learning loop, in step S304, initial values are substituted into the separation matrix W ′. The initial value may be a unit matrix, but may also be a separation matrix obtained in the previous learning.

ステップＳ３０５〜ステップＳ３１０は学習のループであり、分離行列Ｗ'が収束するまで（または一定回数）繰り返す。ステップＳ３０５における収束性判定は、図１８を参照して説明したステップＳ２０３における処理と同様の判定であり、分離行列が収束したか否かの判定は、例えばΔＷ'のノルム‖ΔＷ'‖（行列のノルムは、例えば前記した式［７．１０］で計算する）がある値を下回ったかどうか（または‖ΔＷ'‖／‖Ｗ'‖がある値を下回ったかどうか）で判断してもよいし、または、単純に一定回数のループ数を予め設定し、そのループ数を実行してもよい。 Steps S305 to S310 are a learning loop and are repeated until the separation matrix W ′ converges (or a fixed number of times). The convergence determination in step S305 is the same determination as the process in step S203 described with reference to FIG. 18, and the determination of whether or not the separation matrix has converged is, for example, the norm ‖ΔW′‖ (matrix of ΔW ′ For example, it may be determined by whether or not the value of the norm is less than a certain value (or whether or not ‖ΔW′‖ / ‖W′‖ is less than a certain value). Alternatively, a predetermined number of loops may be simply set in advance and the number of loops may be executed.

ステップＳ３０６で、分離結果モジュレーション・スペクトログラムであるＹ'を計算する。この計算は、式［９．３］を全てのω'とｔに対して行なえばよい。 In step S306, Y ′ which is a separation result modulation spectrogram is calculated. This calculation may be performed by performing the equation [9.3] for all ω ′ and t.

ステップＳ３０７〜Ｓ３１０は、図９（ｃ）に示すモジュレーション・スペクトログラムの各ビンω'、すなわち縦方向と奥行き方向の両方のビンω'についてのループである。なお、各ビンについての繰り返し処理とするループの代わりに、各ビンについての処理を並列処理として実行してもよい。ステップＳ３０８では、分離行列の増分を計算し（式［９．５］）、ステップＳ３０９においてで分離行列を更新する（式［９．６］）。 Steps S307 to S310 are loops for each bin ω ′ of the modulation spectrogram shown in FIG. 9C, that is, both the vertical and depth bins ω ′. Note that the processing for each bin may be executed as parallel processing instead of the loop for repeated processing for each bin. In step S308, the increment of the separation matrix is calculated (formula [9.5]), and in step S309, the separation matrix is updated (formula [9.6]).

ステップＳ３１０において、ループを抜けた後、再びステップＳ３０５の収束性判定に戻る。ステップＳ３０５において、分離行列が収束した（または所定の回数ループした）と判定された場合は、条件分岐を右に進む。ステップＳ３１１において、リスケーリングを行なう。リスケーリングは、各ビンのスケールを調整する処理である。分離結果のモジュレーション・スペクトログラムに対してリスケーリングを行なう。リスケーリングの方法は、先に図１８を参照して説明したステップＳ２１１の処理とほぼ同様であり、式［７．１１］〜［７．１６］の式のＹ，Ｘ，Ｗを適宜Ｙ'，Ｘ'，Ｗ'に置き換えた式に基づいて行なう。また、必要に応じて、ステップＳ３０１の正規化を元に戻す処理も行なう。 In step S310, after exiting the loop, the process returns to the convergence determination in step S305 again. If it is determined in step S305 that the separation matrix has converged (or looped a predetermined number of times), the conditional branch is advanced to the right. In step S311, rescaling is performed. Rescaling is a process of adjusting the scale of each bin. Rescaling the resulting modulation spectrogram. The rescaling method is almost the same as the processing in step S211 described above with reference to FIG. 18, and Y, X, and W in the equations [7.11] to [7.16] are appropriately changed to Y ′. , X ′, W ′ is performed based on the replaced expression. Moreover, the process which returns the normalization of step S301 is also performed as needed.

次に、ステップＳ３１２において、モジュレーション・スペクトログラムをスペクトログラムへ変換する逆フーリエ変換（ＦＴ）を実行する。その際、必要に応じてＷＯＬＡなどを行なう。すなわち、逆フーリエ変換（ＦＴ）においてもフレームごとの逆変換結果（波形）波形を重複つきで重ね合わせる。これをオーバラップ加算（ｏｖｅｒｌａｐａｄｄ）という。逆変換結果は、オーバラップ加算（ｏｖｅｒｌａｐａｄｄ）の前にサイン窓等の窓関数を再び作用させても良く、これを、ｗｅｉｇｈｔｅｄｏｖｅｒｌａｐａｄｄ（ＷＯＬＡ）という。ＷＯＬＡにより、フレーム間の不連続性に由来するノイズを低減することができる。 Next, in step S312, an inverse Fourier transform (FT) for converting the modulation spectrogram to the spectrogram is executed. At that time, WOLA or the like is performed as necessary. That is, in the inverse Fourier transform (FT), the inverse transform result (waveform) waveforms for each frame are overlapped. This is called overlap addition. The inverse transformation result may be obtained by applying a window function such as a sine window again before overlap addition, which is called weighted overlap add (WOLA). With WOLA, noise derived from discontinuity between frames can be reduced.

さらに、ステップＳ３１３において、スペクトログラムに対してリスケーリングを行なう。これは、ステップＳ３１１のリスケーリングと同様の処理である。 Furthermore, rescaling is performed on the spectrogram in step S313. This is the same processing as the rescaling in step S311.

なお、図１７のフローにおけるステップＳ１０４と、図１９に示すフローにおけるステップＳ３１２において実行する逆フーリエ変換（ＦＴ）においては、分離結果のスペクトログラムやモジュレーション・スペクトログラムの他に、分離行列自体も、必要に応じて逆フーリエ変換（ＦＴ）を施す。 In the inverse Fourier transform (FT) executed in step S104 in the flow of FIG. 17 and step S312 in the flow shown in FIG. 19, in addition to the spectrogram and the modulation spectrogram of the separation result, the separation matrix itself is necessary. In response, inverse Fourier transform (FT) is performed.

［変形例］
次に、上述した実施例を変形した実施例について説明する。上記の実施例では、分離結果を生成する際に適用するフレームタップＬ'は、すなわち観測信号から分離結果を生成する際のフレームタップＬ'は、全ての周波数で一定値を用いていた。これを、周波数毎で一律とするのではなく周波数ごとにフレームタップＬ'の値を変えても良い。 [Modification]
Next, a modified example of the above-described embodiment will be described. In the above-described embodiment, the frame tap L ′ applied when generating the separation result, that is, the frame tap L ′ when generating the separation result from the observation signal, uses a constant value at all frequencies. Instead of making this uniform for each frequency, the value of the frame tap L ′ may be changed for each frequency.

例えば、高い周波数の成分は、低い周波数の成分と比べて急激に減衰するため、残響時間は短い。そこで、高い周波数に対応する周波数ビンでは、フレームタップＬ'の値を低い周波数ビンよりも少なくしても良い。こうすることで、分離性能を保ったまま計算量を削減することができる。 For example, the reverberation time is short because the high frequency component attenuates more rapidly than the low frequency component. Therefore, in the frequency bin corresponding to the high frequency, the value of the frame tap L ′ may be smaller than that in the low frequency bin. By doing so, the amount of calculation can be reduced while maintaining the separation performance.

また、図１６の信号分離装置、および図１９に示すフローチャートを参照して説明した方法、すなわち、モジュレーション・スペクトログラムに変換してから瞬時混合を解く方式における分離処理においては、２度目の短時間フーリエ変換（ＳＴＦＴ）において、周波数ビンごとにフレームタップ数Ｌ'を異ならせる他に、シフト幅を異ならせることも可能である。ただし、周波数ビンごとにフレームタップ数やシフト幅を異ならせると、モジュレーション・スペクトログラムにおいて１フレームあたりの時間長が異なる可能性が出てくる。 Further, in the method described with reference to the signal separation device in FIG. 16 and the flowchart shown in FIG. 19, that is, in the separation process in which the instantaneous mixing is solved after being converted into a modulation spectrogram, the second short-time Fourier transform is performed. In the conversion (STFT), in addition to changing the number of frame taps L ′ for each frequency bin, it is also possible to change the shift width. However, if the number of frame taps and the shift width are varied for each frequency bin, the time length per frame may be different in the modulation spectrogram.

例えば、２度目の短時間フーリエ変換（ＳＴＦＴ）において、
低い周波数はタップ数＝３２・シフト幅＝１６を用い、
高い周波数はタップ数＝１６・シフト幅＝８を用いると、
変換後のモジュレーション・スペクトログラムにおける１フレームあたりの時間長は、低い周波数が高い周波数の倍となる。すなわち、低い周波数の方が単位時間当たりのフレーム数が少ない（半分である）。 For example, in the second short time Fourier transform (STFT):
Low frequency uses tap number = 32 and shift width = 16,
For high frequency, using tap number = 16 and shift width = 8,
The time length per frame in the converted modulation spectrogram is such that the low frequency is double the high frequency. That is, the number of frames per unit time is smaller (half) at lower frequencies.

１フレーム当たりの時間長が一定である場合は、図１０に示すように、モジュレーション・スペクトログラムからＹｋ'（ｔ）２２１を切り出して独立性を計算することができるが、一定でない場合はそれが難しい。そういう場合は、以下に説明する方法（方法１〜３）のいずれかを用いることで、フレームの不一致に対処する。 When the time length per frame is constant, Yk ′ (t) 221 can be extracted from the modulation spectrogram and the independence can be calculated as shown in FIG. 10, but it is difficult if it is not constant. . In such a case, any one of the methods described below (methods 1 to 3) is used to deal with a frame mismatch.

（方法１）フレームデータの間引き
生成されたモジュレーション・スペクトログラムにおいて、単位時間当たりのフレーム数が多い方のビンからデータを間引くことで、フレーム数が少ない方のビンとデータの個数を合わせる。上記の３２タップ・１６シフトと１６タップ・８シフトの例では、１６タップ・８シフトの短時間フーリエ変換（ＳＴＦＴ）を行なったビンに対して、一つ置きにデータを間引くと、両者で単位時間当たりのフレーム数が一致する（＝１フレーム当たりの時間が同じ）ようになる。 (Method 1) Thinning out frame data In the generated modulation spectrogram, the number of data is matched with the bin having the smaller number of frames by thinning out the data from the bin having the larger number of frames per unit time. In the example of 32 taps, 16 shifts, and 16 taps, 8 shifts described above, when data is thinned out every other bin for 16 taps / 8 shifts short-time Fourier transform (STFT), both units are used. The number of frames per time is the same (= time per frame is the same).

（方法２）フレームデータの補間
上述の（方法１）とは逆に、個数が少ない方を多い方に合わせる方法。上記の３２タップ・１６シフトと１６タップ・８シフトの例では、３２タップ・１６シフトを行なったビンに対して、データの補間を行なう。例えば、前後のフレームデータの平均を取ることで、中間に新しいデータを挿入する。 (Method 2) Interpolation of frame data Contrary to the above (Method 1), a method in which the smaller number is matched with the larger number. In the above example of 32 taps · 16 shift and 16 taps · 8 shifts, data interpolation is performed on bins that have been subjected to 32 taps · 16 shifts. For example, new data is inserted in the middle by taking the average of previous and next frame data.

（方法３）フレームデータの重複
上述の（方法２）と同様に、個数が少ない方を多い方に合わせる方法。上記の３２タップ・１６シフトと１６タップ・８シフトの例では、３２タップ・１６シフトを行なったビンに対して、データを２回ずつ重複させることで、１６タップ・８シフトのビンとデータ数を合わせる。 (Method 3) Overlapping of frame data A method of matching the smaller number with the larger one as in the above (Method 2). In the above example of 32 taps, 16 shifts and 16 taps, 8 shifts, bins with 16 taps, 8 shifts and the number of data are duplicated twice for bins that have undergone 32 taps, 16 shifts. Adjust.

次に、図１１〜図１４を参照して説明した（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理による方法、すなわち、観測信号スペクトログラムをシフトしながら積み重ね、それを瞬時混合ＩＣＡ（例えば、特開２００６−２３８４０９に記載の手法）によって、時間周波数領域において畳み込み混合された観測信号を分離する処理において、
「周波数ごとに［Ｌ'］の値、すなわち、観測信号から分離結果を生成する際のフレームタップ数［Ｌ'］の値を異ならせる」という変形例について説明する。 Next, referring to FIGS. 11 to 14, (3) a method based on a combination of shift stacking and instantaneous mixing ICA, that is, stacking while shifting the observed signal spectrogram, and combining the signals with instantaneous mixing ICA (for example, special mixing ICA) In the process of separating the observation signal mixed by convolution in the time-frequency domain by the method described in Japanese Patent Application Laid-Open No. 2006-238409,
A modification example in which “the value of [L ′] for each frequency, that is, the value of the number of frame taps [L ′] when generating the separation result from the observation signal is varied” will be described.

この変形例、すなわち、「周波数ごとにフレームタップ数［Ｌ'］の値を異ならせる」という変形例を実現するためには、以下のようにすればよい。周波数ビンごとに異なるＬ'をＬ'（ω）と表記する。図１１を参照して説明したシフト処理において、シフト量が［Ｌ'（ω）］を超えたら、その周波数ビンのデータを［０］に置き換える。その方法について、図２０を参照して説明する。 In order to realize this modified example, that is, a modified example of “variing the value of the number of frame taps [L ′] for each frequency”, the following may be performed. L ′ that differs for each frequency bin is represented as L ′ (ω). In the shift process described with reference to FIG. 11, when the shift amount exceeds [L ′ (ω)], the data of the frequency bin is replaced with [0]. The method will be described with reference to FIG.

周波数ビンごとに異なるフレームタップ数［Ｌ'（ω）］の値を、周波数ビン番号ωに応じて以下のように変更したいとする。（Ｍはスペクトログラム１枚あたりの周波数ビン数）
１≦ω＜Ｍ／４では、Ｌ'（ω）＝２
Ｍ／４≦ω＜Ｍ／２では、Ｌ'（ω）＝１
Ｍ／２≦ω＜Ｍでは、Ｌ'（ω）＝０
これを実現するためには、先に、図１１を参照して説明したシフト処理によって生成されるデータＸ_ｋ ^［０］，Ｘ_ｋ ^［１］，Ｘ_ｋ ^［２］に対して以下の操作を行なう。
Ｘ_ｋ ^［０］は、Ｘ_ｋそのまま。（全ての周波数ビンで、シフト＝０は必要）、
Ｘ_ｋ ^［１］は、Ｍ／２≦ωの周波数ビンを０でマスク。（Ｍ／２≦ωでは、１以上のシフトは不要）
Ｘ_ｋ ^［２］は、Ｍ／４≦ωの周波数ビンを０でマスク。（Ｍ／４≦ωでは、２以上のシフトは不要） Assume that the value of the number of frame taps [L ′ (ω)] that differs for each frequency bin is changed as follows according to the frequency bin number ω. (M is the number of frequency bins per spectrogram)
For 1 ≦ ω <M / 4, L ′ (ω) = 2
When M / 4 ≦ ω <M / 2, L ′ (ω) = 1
For M / 2 ≦ ω <M, L ′ (ω) = 0
In order to realize this, the following operation is performed on the data X _k ^[0] , X _k ^[1] , and X _k ^[2] generated by the shift processing described with reference to FIG. Do.
X _k ^[0] remains as X _k . (For all frequency bins, shift = 0 is required),
X _k ^[1] masks the frequency bin of M / 2 ≦ ω with 0. (If M / 2 ≦ ω, one or more shifts are unnecessary)
X _k ^[2] masks the frequency bin of M / 4 ≦ ω with 0. (If M / 4 ≦ ω, two or more shifts are not required)

具体的には、図２０（ｂ）に示すように、黒く塗りつぶされているデータ部分５１１２が、０でマスクされる箇所である。なお、実際の処理においては、マスクされた部分のメモリを確保する必要はなく、スペクトログラムへのアクセスの際にマスク該当箇所をスキップすれば、処理時間やメモリ量の増加を防ぐことができる。 Specifically, as shown in FIG. 20B, a black-filled data portion 5112 is a portion masked with 0. In actual processing, it is not necessary to secure the masked portion of the memory, and if the corresponding portion of the mask is skipped when accessing the spectrogram, an increase in processing time and memory amount can be prevented.

なお、本発明の前処理として、従来の時間周波数領域の瞬時混合ＩＣＡ（例えば特開２００６−２３８４０９）を組み合わせると、処理時間の増加をある程度抑えることができる。以降では、両者の組み合わせについて説明する。以下のあ角処理例について、順次説明する。
（１）基本的な２段階分離
（２）チャンネル数の削減
（３）残響除去として利用 In addition, as a pre-process of the present invention, when a conventional instantaneous frequency-domain mixed ICA (for example, JP-A-2006-238409) is combined, an increase in processing time can be suppressed to some extent. Hereinafter, the combination of both will be described. The following corner processing examples will be described sequentially.
(1) Basic two-stage separation (2) Reduction of the number of channels (3) Use as dereverberation

（１）基本的な２段階分離
従来の時間周波数領域の瞬時混合ＩＣＡにおいて、残響よりも短い分析フレーム（または分析窓）を用いた場合、複数のフレームにまたがる妨害音は除去しきれない。その反面、本発明よりも（１回目のＳＴＦＴにおける分析フレーム長が同じであれば）計算量は少ない。そこで、最初に従来の時間周波数領域ＩＣＡで分離を行い、その結果のスペクトログラムを本発明の方法でさらに分離すれば、最初から本発明だけを用いる場合と比べて少ない時間で同等の分離精度を達成することができる。 (1) Basic two-stage separation In the conventional instantaneous mixing ICA in the time-frequency domain, when an analysis frame (or analysis window) shorter than the reverberation is used, the interference sound over a plurality of frames cannot be removed. On the other hand, the amount of calculation is less than that of the present invention (if the analysis frame length in the first STFT is the same). Therefore, if separation is first performed in the conventional time-frequency domain ICA, and the resulting spectrogram is further separated by the method of the present invention, the same separation accuracy can be achieved in less time than when only the present invention is used from the beginning. can do.

特に、本発明における「（１）時間周波数領域において、畳み込み混合を直接解く」方式を用いる場合は、従来法と本発明とをシームレスに動作させることが可能である。すなわち、式［７．２］および式［８．１］（または、式［７．３］および式［８．２］）においてＬ'＝０とすると従来法と等価になるという特徴を利用することができ、図１８に示したフローにおけるステップＳ２０３〜Ｓ２１０の学習ループにおいて、ループ回数が少ないうちはＬ'＝０、ループ回数がある値を超えたらＬ'を本来の値とすればよい。または、ループ回数の増加にともなって、Ｌ'を少しずつ増加させても構わない。 In particular, when using the “(1) direct solution of convolutional mixing in the time-frequency domain” according to the present invention, the conventional method and the present invention can be operated seamlessly. That is, the feature that the equation [7.2] and the equation [8.1] (or the equation [7.3] and the equation [8.2]) are equivalent to the conventional method when L ′ = 0 is used. In the learning loop of steps S203 to S210 in the flow shown in FIG. 18, L ′ = 0 when the number of loops is small, and L ′ may be set to an original value when the number of loops exceeds a certain value. Alternatively, L ′ may be increased little by little as the number of loops increases.

（２）チャンネル数の削減
一般に、ＩＣＡの計算量は、チャンネル数の２乗に比例する。そのため、チャンネル数を削減することができれば、計算量を大幅に削減することができる。２段階分離を用いると、本発明のステップのチャンネル数を削減することもできる。その方法について説明する。 (2) Reduction of the number of channels Generally, the amount of calculation of ICA is proportional to the square of the number of channels. Therefore, if the number of channels can be reduced, the amount of calculation can be greatly reduced. The use of two-stage separation can also reduce the number of channels in the steps of the present invention. The method will be described.

時間周波数領域のＩＣＡにおいて、音源数よりもマイク数の方が多い場合、出力チャンネルのうちのいくつかは、どの音源にも対応しないと判定される信号が出力される。例えば、マイク数＝４・音源数＝３の場合、出力チャンネルのうち３つは音源に対応しているが、残りの１つはどの音源にも対応しない、背景雑音と残響音とが混ざったような信号が出力される。このような出力は、他のチャンネルと比べてパワーが極端に小さかったり、他のどのチャンネルとも相関があったりするため、容易に検出できる。 In the ICA in the time frequency domain, when the number of microphones is larger than the number of sound sources, a signal determined that some of the output channels do not correspond to any sound source is output. For example, when the number of microphones = 4 and the number of sound sources = 3, three of the output channels correspond to sound sources, but the remaining one does not correspond to any sound source, and background noise and reverberation sound are mixed. Such a signal is output. Such an output can be easily detected because the power is extremely small compared to other channels or there is a correlation with any other channel.

そこで、２段階分離においては、図２１に示すフローチャートに従って、まず、ステップＳ５０１で時間周波数領域の瞬時混合ＩＣＡによる分離処理を行なう。この処理は、特開２００６−２３８４０９において開示した処理として実行可能である。その後、ステップＳ５０２において、例えば「どの音源にも対応しないと判定される出力」（不要チャンネル）を除去した後、上述した本発明に従った処理、すなわち、
（１）時間周波数領域において、畳み込み混合を直接解く。
（２）スペクトログラムを時間方向へもう一度短時間フーリエ変換（ＳＴＦＴ）し、瞬時混合として解く。
（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって解く。
これらのいずれかの処理によって、時間周波数領域において畳み込み混合された観測信号を分離する処理を実行すれば、分離処理における計算量を削減することができる。なお、入力チャンネル数＝音源数であれば分離は可能であるため、ステップＳ５０２においてチャンネル数を削減しても分離精度には影響しない。 Therefore, in the two-stage separation, according to the flowchart shown in FIG. 21, first, separation processing by the time-frequency domain instantaneous mixing ICA is performed in step S501. This process can be executed as the process disclosed in JP-A-2006-238409. Thereafter, in step S502, for example, after removing “output determined not to correspond to any sound source” (unnecessary channel), the processing according to the present invention described above, that is,
(1) Solve convolutional mixing directly in the time-frequency domain.
(2) The spectrogram is again subjected to short-time Fourier transform (STFT) in the time direction and solved as instantaneous mixing.
(3) Solve by a process combining shift stacking and instantaneous mixing ICA.
If any of these processes is performed to separate the convolution mixed observation signals in the time-frequency domain, the amount of calculation in the separation process can be reduced. Since the separation is possible if the number of input channels = the number of sound sources, even if the number of channels is reduced in step S502, the separation accuracy is not affected.

例えば、上記の（１）時間周波数領域において、畳み込み混合を直接解く方式に、この２段階処理を適用した場合は、信号分離手段が、観測信号スペクトログラムに対して、瞬時混合ＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を適用した処理により第１の信号分離結果を生成し、該第１の信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行して、除去処理後に残存する観測信号スペクトログラムに対して時間周波数領域の畳み込み混合を解く処理を実行して信号分離結果を生成する。 For example, when the two-stage processing is applied to the method of directly solving convolutional mixing in the above (1) time-frequency domain, the signal separation means performs instantaneous mixing ICA (Independent Component Analysis) on the observed signal spectrogram. The first signal separation result is generated by the process of applying, the unnecessary channel removal process determined to not correspond to any sound source is executed from the first signal separation result, and the observation signal remaining after the removal process A signal separation result is generated by executing a process of solving the convolutional mixture in the time-frequency domain for the spectrogram.

また、（２）スペクトログラムを時間方向へもう一度短時間フーリエ変換（ＳＴＦＴ）し、瞬時混合として解く方式にこの２段階処理を適用した場合は、
第１信号変換手段が、入力信号を時間周波数領域に変換し観測信号スペクトログラムを生成し、不要チャンネル除去手段が、第１信号変換手段の生成した観測信号スペクトログラムに対して、瞬時混合ＩＣＡを適用した処理により第１の信号分離結果を生成し、この信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行し、さらに、第２信号変換手段が、不要チャンネルが除去された観測信号スペクトログラムに対してデータ変換を実行してモジュレーション・スペクトログラムを生成し、信号分離手段が、モジュレーション・スペクトログラムから信号分離結果を生成するといった処理となる。 In addition, when (2) this two-step process is applied to a method of performing a short-time Fourier transform (STFT) once again in the time direction and solving as an instantaneous mixture,
The first signal converting means converts the input signal into the time frequency domain to generate an observed signal spectrogram, and the unnecessary channel removing means applies the instantaneous mixing ICA to the observed signal spectrogram generated by the first signal converting means. A first signal separation result is generated by the processing, and an unnecessary channel removal process that is determined not to correspond to any sound source is executed from the signal separation result. Further, the second signal conversion unit removes unnecessary channels. Data conversion is performed on the observed signal spectrogram to generate a modulation spectrogram, and the signal separation means generates a signal separation result from the modulation spectrogram.

また、（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理にこの２段階処理を適用した場合は、
観測信号スペクトログラムに対して、瞬時混合ＩＣＡを適用した処理により第１の信号分離結果を生成し、生成した第１の信号分離結果から、どの音源にも対応しないと判定される不要チャンネル除去処理を実行し、除去処理後に残存する観測信号スペクトログラムをフレーム方向へシフトさせて観測信号スペクトログラムシフトセットを生成し、生成した観測信号スペクトログラムシフトセットに対して、再度、瞬時混合ＩＣＡを適用して信号分離結果を生成する構成となる。 When (3) this two-stage process is applied to a process that combines shift stacking and instantaneous mixing ICA,
A first signal separation result is generated for the observed signal spectrogram by a process using the instantaneous mixing ICA, and an unnecessary channel removal process that is determined not to correspond to any sound source from the generated first signal separation result is performed. The observed signal spectrogram remaining after the removal processing is shifted in the frame direction to generate an observed signal spectrogram shift set, and the instantaneous mixed ICA is again applied to the generated observed signal spectrogram shift set to obtain a signal separation result Is generated.

（３）残響除去として利用
本発明の分離処理、すなわち、
（１）時間周波数領域において、畳み込み混合を直接解く。
（２）スペクトログラムを時間方向へもう一度短時間フーリエ変換（ＳＴＦＴ）し、瞬時混合として解く。
（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって解く。
これらの処理のうち、３番目の「シフト積み重ね＋従来法」を用いた場合、分離自体は前処理の従来法で行ない、本発明は残響除去のみを行なうという役割分担も可能である。こうすることで、計算量はＯ（｛ｎ×（Ｌ'＋１）｝^２）からＯ（ｎ×ｎ×（Ｌ'＋１））に削減される。以下はその方法について説明する。 (3) Use as dereverberation separation process of the present invention, that is,
(1) Solve convolutional mixing directly in the time-frequency domain.
(2) The spectrogram is again subjected to short-time Fourier transform (STFT) in the time direction and solved as instantaneous mixing.
(3) Solve by a process combining shift stacking and instantaneous mixing ICA.
Among these processes, when the third “shift stack + conventional method” is used, the separation itself is performed by the conventional method of preprocessing, and the present invention can also share the role of performing only dereverberation. By doing so, the calculation amount is reduced from O ({n × (L ′ + 1)} ² ) to O (n × n × (L ′ + 1)). The method will be described below.

先に図１１を参照して説明した「スペクトログラムのシフト＋積み重ね」を行なうと、１チャンネル分のスペクトログラムが見かけ上はＬ'＋１チャンネルに拡張される。その結果に対して、従来の時間周波数領域の瞬時混合ＩＣＡを用いてＬ'＋１チャンネルの入力として処理すると、結果としてＬ'＋１チャンネル分のスペクトログラムが生成される。このような処理を行なっても、音源ごとの成分に分離されるわけではないが、複数のフレームにまたがった成分を取り除く効果、すなわち残響除去の効果はある。そこで、従来の時間周波数領域の瞬時混合ＩＣＡで分離を行い、ｎチャンネル分の分離結果スペクトログラムを生成した後、各チャンネルに対して前述の「残響除去」を行なうという組み合わせが考えられる。そのような処理について、図２２に示すフローチャートを参照して説明する。 When the “shift of the spectrogram + stacking” described above with reference to FIG. 11 is performed, the spectrogram for one channel is apparently expanded to L ′ + 1 channel. When the result is processed as an input of the L ′ + 1 channel using the conventional instantaneous frequency domain ICA, the spectrogram for the L ′ + 1 channel is generated as a result. Even if such processing is performed, it is not separated into components for each sound source, but there is an effect of removing components across a plurality of frames, that is, an effect of dereverberation removal. Therefore, a combination of performing separation by a conventional instantaneous frequency mixing ICA in the time-frequency domain, generating a separation result spectrogram for n channels, and performing the above-mentioned “reverberation removal” on each channel is conceivable. Such processing will be described with reference to the flowchart shown in FIG.

まず、ステップＳ６０１において、時間周波数領域の瞬時混合ＩＣＡによる分離処理を行なう。この処理は、特開２００６−２３８４０９において開示した処理として実行可能である。分離結果としてｎチャンネル分のスペクトログラムＹ_１〜Ｙ_ｎが生成される。以降の処理は、ｎチャンネル分のスペクトログラムＹ_１〜Ｙ_ｎに対して別個に行なう。第１チャンネル対応のスペクトログラムＹ_１に対する処理がステップＳ６１１〜Ｓ６１３、第ｎチャンネル対応のスペクトログラムＹ_ｎに対する処理がステップＳ６２１〜Ｓ６２３である。なお、ステップＳ６０１の瞬時混合ＩＣＡによる分離処理が終了した時点で、先に図２０を参照して説明したと同様、不要チャンネル（どの音源にも対応しないと判定される出力）を除去する処理を行なう構成としてもよい。 First, in step S601, separation processing is performed by instantaneous mixing ICA in the time-frequency domain. This process can be executed as the process disclosed in JP-A-2006-238409. Spectrogram _Y 1 to Y _n of the n channels are generated as a separation result. The subsequent processing is performed separately for the n-channel spectrograms Y _{1 to} Y _n . Processing for spectrogram _{Y 1} of the first channel corresponding steps S611～S613, processing for the n-channel corresponding spectrogram _{Y n} is step S621～S623. When the separation process by the instantaneous mixing ICA in step S601 is completed, a process of removing unnecessary channels (output determined not to correspond to any sound source) is performed as described above with reference to FIG. It is good also as a structure to perform.

ステップＳ６１１〜Ｓ６１３の処理は、先に説明した［（３）シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理によって解く］方式の処理シーケンスである図１４に示すフローのステップＳ１１〜Ｓ１３の処理に対応する処理である。ただし、図１４に示すフローのステップＳ１１の処理はｎチャンネル分のスペクトログラムをｎ×（Ｌ'＋１）に拡張する処理だったのに対し、図２１に示すフローのステップＳ６１１の処理は、１チャンネル分をＬ'＋１チャンネル分に拡張する処理である。また、ステップＳ６１２の残響除去処理は、処理自体は、図１４に示すフローにおけるステップＳ１２の処理と同一の処理となるが、前述の理由により、このステップＳ６１２の処理の効果は音源の分離ではなくて残響除去として実行される。ステップＳ６１３の処理は、Ｌ'＋１チャンネル分の残響除去済みスペクトログラムから所望の一つを選択する処理であり、図１４のステップＳ１３の処理と同様の処理である。
ステップＳ６２１〜Ｓ６２３の処理は、処理対象が異なるチャンネル対応の信号Ｙ_ｎである点を除いてはステップＳ６１１〜Ｓ６１３の処理と同様である。 The processing of steps S611 to S613 corresponds to the processing of steps S11 to S13 of the flow shown in FIG. 14 which is the processing sequence of the method described above [(3) Solve by processing combining shift stacking and instantaneous mixing ICA]. It is processing. However, the process of step S11 of the flow shown in FIG. 14 is a process of expanding the spectrogram for n channels to n × (L ′ + 1), whereas the process of step S611 of the flow shown in FIG. This is a process of extending the minutes to L ′ + 1 channel. Further, the dereverberation process of step S612 is the same as the process of step S12 in the flow shown in FIG. 14, but for the reason described above, the effect of the process of step S612 is not the separation of the sound source. And executed as dereverberation. The process of step S613 is a process of selecting a desired one from the dereverberation-removed spectrogram for L ′ + 1 channel, and is the same process as the process of step S13 of FIG.
Processing in step S621~S623 are, except processed is different channels corresponding signal _{Y n} are the same as steps S611～S613.

全ての出力チャンネル（ただし不要チャンネルは削除して構わない）に対して残響除去と選択とが完了したら、ステップＳ６３１において、残ったスペクトログラムを統合する。例えば縦に積み重ねる処理を実行する。この処理によって、複数のフレームにまたがった成分を取り除く処理、すなわち残響除去処理が実現される。 When dereverberation and selection are completed for all output channels (however, unnecessary channels may be deleted), the remaining spectrograms are integrated in step S631. For example, a process of stacking vertically is executed. By this process, a process for removing a component across a plurality of frames, that is, a dereverberation process is realized.

［本発明に従った信号分離処理における効果の検証］
上述した本発明の方法により、従来の時間周波数領域ＩＣＡを超える分離性能が出ることを実験で確かめた。以下、この実験結果に基づいて本発明に従った信号分離処理による効果について説明する。 [Verification of effects in signal separation processing according to the present invention]
Experiments have confirmed that the above-described method of the present invention provides separation performance exceeding the conventional time frequency domain ICA. Hereinafter, the effects of the signal separation processing according to the present invention will be described based on the experimental results.

最初に、実験の条件について説明する。
音データの収録を、図２３に示す環境（オフィスの部屋）で行なった。
マイク数＝４（間隔＝７．５ｃｍ）、音源数＝３であり、音源として以下のＷｅｂページで公開されているものを用いた。
原信号：
ＩＣＡ'９９ＳＹＮＴＨＥＴＩＣＢＥＮＣＨＭＡＲＫＳ
ｈｔｔｐ：／／ｓｏｕｎｄ．ｍｅｄｉａ．ｍｉｔ．ｅｄｕ／ｉｃａ−ｂｅｎｃｈ／ｓｏｕｒｃｅｓ／
ｓｒｃ１：ｂｅｅｔ．ｗａｖ
ｓｒｃ２：ｂｅｅｔ９．ｗａｖ
ｓｒｃ３：ｍｉｋｅ．ｗａｖ
なお、収録はそれぞれの音源を単独に鳴らした状態で行ない、後で計算機上で混合している。 First, experimental conditions will be described.
Recording of sound data was performed in the environment (office room) shown in FIG.
The number of microphones = 4 (interval = 7.5 cm), the number of sound sources = 3, and the sound source disclosed on the following Web page was used.
Original signal:
ICA'99 SYNTHETIC BENCHMARKS
http: // sound. media. mit. edu / ica-bench / sources /
src1: beet. wav
src2: beet9. wav
src3: Mike. wav
In addition, recording is performed in a state where each sound source is sounded independently, and later mixed on a computer.

実験は以下の条件で行なった。
サンプリング周波数：１６ｋＨｚ
ＳＴＦＴの窓長：６４，１２８，２５６，５１２，１０２４，（２０４８，４０９６）
ＳＴＦＴのシフト幅：窓長の１／２
窓：短時間フーリエ変換（ＳＴＦＴ）時にサイン窓、逆フーリエ変換（ＦＴ）時に再びサイン窓
η０＝０．５（式）
ループ回数＝２００ｏｒ４００
方式：
（方式１）式［５．２］（従来法に相当）
（方式２）式［７．１］＆式［７．２］（以降「逆方向畳み込み」）
（方式３）式［９．５］（以降「再ＳＴＦＴ」）
スコア関数：式［７．７］を使用
スコア関数のγの値：
（方式１＆２）γ＝ｓｑｒｔ（Ｍ）Ｍ：周波数ビンの本数
（方式３） γ＝ｓｑｒｔ（Ｌ'Ｍ）
フレームタップ：
（方式２）Ｌ'＝４，５，８，１０，１５，１６，２０，２５，３０，３２
（方式３）Ｌ'＝４，８，１６，３２ The experiment was performed under the following conditions.
Sampling frequency: 16 kHz
STFT window length: 64, 128, 256, 512, 1024 (2048, 4096)
STFT shift width: 1/2 of window length
Window: Sine window during short-time Fourier transform (STFT), sine window again during inverse Fourier transform (FT) η0 = 0.5 (formula)
Number of loops = 200 or 400
method:
(Method 1) Equation [5.2] (equivalent to the conventional method)
(Method 2) Equation [7.1] & Equation [7.2] (hereinafter “reverse convolution”)
(Method 3) Equation [9.5] (hereinafter “Re-STFT”)
Score function: Use equation [7.7] Value of score function γ:
(Method 1 & 2) γ = sqrt (M) M: Number of frequency bins (Method 3) γ = sqrt (L′ M)
Frame tap:
(Method 2) L ′ = 4, 5, 8, 10, 15, 16, 20, 25, 30, 32
(Method 3) L ′ = 4, 8, 16, 32

評価尺度として、波形ベースのｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅ−ｒａｔｉｏ（ＳＩＲ）と周波数ビンベースのＳＩＲとを用いている。以下で、ＳＩＲの計算方法について説明する。 As an evaluation measure, waveform-based signal-interference-ratio (SIR) and frequency bin-based SIR are used. Below, the calculation method of SIR is demonstrated.

ｋ番目のチャンネルに対応した分離結果（波形）をｙｋ（ｔ）とし、原信号ｓ_１（ｔ）〜ｓ_Ｎ（ｔ）の線形結合でｙｋ（ｔ）を近似することを考える（以下に示す式［１０．１］）。
Consider the separation result (waveform) corresponding to the k-th channel as yk (t), and approximate yk (t) by linear combination of the original signals s ₁ (t) to s _N (t) (shown below) Formula [10.1]).

ｓ_１（ｔ）〜ｓ_Ｎ（ｔ）の係数λ_１〜λ_Ｎは、式［１０．２］の二乗誤差を最小にすることで求まる。
ｙｋ（ｔ）をｉ番目の音源ｓ_ｉ（ｔ）の推定結果と見なした場合、ＳＩＲはｓ_ｉ（ｔ）とそれ以外の音源とのパワー比として定義される（式［１０．３］）。
出力チャンネル数（＝マイク数）をｎとすると、１つの音源に対してＳＩＲはｎ通り計算されるが、その内の最大値を音源ｉのＳＩＲと定義する（式［１０．４］）。以降の実験結果では、３つの音源のからそれぞれ求めたＳＩＲを、さらに平均している。
周波数ビンベースのＳＩＲは、周波数ビンごとにＳＩＲを計算した後、全ての周波数ビンについて平均を取ることで計算する（式［１０．６］）。 coefficients lambda ₁ to [lambda] _N of _{_{s 1 (t) ~s N (}} t) is determined by minimizing the square error of Formula [10.2].
When yk (t) is regarded as an estimation result of the i-th sound source s _i (t), SIR is defined as a power ratio between s _i (t) and the other sound sources (formula [10.3]). ).
When the number of output channels (= number of microphones) is n, n SIRs are calculated for one sound source, and the maximum value among them is defined as the SIR of the sound source i (formula [10.4]). In the subsequent experimental results, the SIR obtained from each of the three sound sources is further averaged.
The frequency bin-based SIR is calculated by calculating the SIR for each frequency bin and then taking an average for all frequency bins (formula [10.6]).

以下では、実験結果について説明する。以下、実験結果を表として示す。
各表において、
窓長：ＳＴＦＴの窓長、
ｆｒｍ−ｔａｐはフレームタップ数、
ＳＩＲ（ｗａｖｅ）は波形ベースのＳＩＲ、
ＳＩＲ（ｂｉｎ）は周波数ビンベースのＳＩＲ、
を表わす。 Below, an experimental result is demonstrated. The experimental results are shown as a table below.
In each table,
Window length: STFT window length,
frm-tap is the number of frame taps,
SIR (wave) is waveform-based SIR,
SIR (bin) is the frequency bin based SIR,
Represents.

以下、各表において、
（１）方式１（従来法）、２００回ループ
（２）方式２（式［６．１］，［７．１］，［７．２］）、２００回ループ
（３）方式３（式［９．２］，［９．５］）、２００回ループ
（４）方式１（従来法）、４００回ループ
（５）方式２（式［６．１］，［７．１］，［７．２］）、４００回ループ
（６）方式３（式［９．２］，［９．５］）、４００回ループ
これらの実験結果を示す。 Hereinafter, in each table,
(1) Method 1 (conventional method), 200 loops (2) Method 2 (Equations [6.1], [7.1], [7.2]), 200 times loop (3) Method 3 (Equation [ 9.2], [9.5]), 200 loops (4) Method 1 (conventional method), 400 loops (5) Method 2 (Equations [6.1], [7.1], [7. 2]), 400 loops (6) Method 3 (formulas [9.2], [9.5]), 400 loops These experimental results are shown.

図２４は、以下の３方式による分離結果についての評価データである。
（１）方式１（従来法）、２００回ループ
（２）方式２（式［６．１］，［７．１］，［７．２］）、２００回ループ
（３）方式３（式［９．２］，［９．５］）、２００回ループ
これらの３方式を実行した場合の結果データに基づくＳＩＲデータをプロットしたものであり、
（ａ）波形ベースのＳＩＲ（ｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅ−ｒａｔｉｏ）
（ｂ）周波数ビンベースのＳＩＲ
これらのＳＩＲデータをプロットしたものである。横軸がＳＴＦＴの窓長、縦軸がＳＩＲである。各グラフにおいて、
＊（実線）：方式１、
◆：が方式２、
＋：が方式３、
である。
いくつかの設定において、方式２と方式３は、従来法を上回っているのが確認できる。 FIG. 24 shows evaluation data on the separation results by the following three methods.
(1) Method 1 (conventional method), 200 loops (2) Method 2 (Equations [6.1], [7.1], [7.2]), 200 times loop (3) Method 3 (Equation [ 9.2], [9.5]), 200 loops SIR data based on the result data when these three methods are executed is plotted.
(A) Waveform-based SIR (signal-interference-ratio)
(B) Frequency bin based SIR
These SIR data are plotted. The horizontal axis is the window length of the STFT, and the vertical axis is the SIR. In each graph,
* (Solid line): Method 1,
◆: method 2,
+: Is method 3,
It is.
In some settings, it can be seen that method 2 and method 3 outperform the conventional method.

次に、横軸として以下の式で計算されるタイムスパンを用いてプロットした評価データを図２５に示す。
ｔｉｍｅ＿ｓｐａｎ＝｛（ｆｒａｍｅ＿ｔａｐ−１）×ｆｒａｍｅ＿ｓｈｉｆｔ＋ｗｉｎｄｏｗ＿ｌｅｎ｝／ｓｒａｔｅ
なお、
ｆｒａｍｅ＿ｔａｐ：フレームタップ数（＝Ｌ'）
ｗｉｎｄｏｗ＿ｌｅｎ：窓長（一度目のSTFTの切り出し区間長）
ｆｒａｍｅ＿ｓｈｉｆｔ：窓シフト幅（今回は窓長の１／２）
ｓｒａｔｅ：サンプリング周波数（１６ｋＨｚ） Next, FIG. 25 shows evaluation data plotted using the time span calculated by the following equation as the horizontal axis.
time_span = {(frame_tap-1) × frame_shift + window_len} / rate
In addition,
frame_tap: number of frame taps (= L ′)
window_len: Window length (first STFT cut-out section length)
frame_shift: Window shift width (this time 1/2 of the window length)
rate: Sampling frequency (16 kHz)

図２５も、以下の３方式による分離結果についての評価データである。
（１）方式１（従来法）、２００回ループ
（２）方式２（式［６．１］，［７．１］，［７．２］）、２００回ループ
（３）方式３（式［９．２］，［９．５］）、２００回ループ
これらの３方式を実行した場合の結果データに基づくＳＩＲデータをプロットしたものであり、
（ａ）波形ベースのＳＩＲ（ｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅ−ｒａｔｉｏ）
（ｂ）周波数ビンベースのＳＩＲ
これらのＳＩＲデータをプロットしたものである。横軸が上述したタイムスパン（Ｔｉｍｅ＿ｓｐａｎ）の窓長、縦軸がＳＩＲである。各グラフにおいて、
＊（実線）：方式１、
◆：が方式２、
＋：が方式３、
である。 FIG. 25 is also evaluation data about the separation results by the following three methods.
(1) Method 1 (conventional method), 200 loops (2) Method 2 (Equations [6.1], [7.1], [7.2]), 200 times loop (3) Method 3 (Equation [ 9.2], [9.5]), 200 loops SIR data based on the result data when these three methods are executed is plotted.
(A) Waveform-based SIR (signal-interference-ratio)
(B) Frequency bin based SIR
These SIR data are plotted. The horizontal axis is the window length of the time span (Time_span) described above, and the vertical axis is the SIR. In each graph,
* (Solid line): Method 1,
◆: method 2,
+: Is method 3,
It is.

従来は、長いタイムスパンをカバーするためには短時間フーリエ変換（ＳＴＦＴ）の窓長を長くするしかなく、それがＳＩＲの低下を招いていた。それに対し本発明では、短めの窓と複数のフレームタップという組み合わせを用いることで、ＳＩＲを低下させずに同等のタイムスパンをカバーすることができる。 Conventionally, in order to cover a long time span, the window length of the short-time Fourier transform (STFT) must be increased, which has caused a decrease in SIR. On the other hand, in the present invention, by using a combination of a short window and a plurality of frame taps, an equivalent time span can be covered without reducing the SIR.

図２６は、以下の３方式による分離結果についての評価データである。
（４）方式１（従来法）、４００回ループ
（５）方式２（式［６．１］，［７．１］，［７．２］）、４００回ループ
（６）方式３（式［９．２］，［９．５］）、４００回ループ
これらの３方式を実行した場合の結果データに基づくＳＩＲデータをプロットしたものであり、
（ａ）波形ベースのＳＩＲ（ｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅ−ｒａｔｉｏ）
（ｂ）周波数ビンベースのＳＩＲ
これらのＳＩＲデータをプロットしたものである。横軸がＳＴＦＴの窓長、縦軸がＳＩＲである。各グラフにおいて、
＊（実線）：方式１、
◆：が方式２、
＋：が方式３、
である。 FIG. 26 shows evaluation data on the separation results by the following three methods.
(4) Method 1 (conventional method), 400 loops (5) Method 2 (Equations [6.1], [7.1], [7.2]), 400 loops (6) Method 3 (Equation [ 9.2], [9.5]), 400 loops SIR data based on the result data when these three methods are executed is plotted.
(A) Waveform-based SIR (signal-interference-ratio)
(B) Frequency bin based SIR
These SIR data are plotted. The horizontal axis is the window length of the STFT, and the vertical axis is the SIR. In each graph,
* (Solid line): Method 1,
◆: method 2,
+: Is method 3,
It is.

次に、分離処理のループ回数を400回に増やして同様の評価実験を行なった。
図２６に示すデータに対応するデータとして、横軸としてタイムスパンを用いてプロットした評価データを図２７に示す。以下の３方式による分離結果についての評価データである。
（４）方式１（従来法）、４００回ループ
（５）方式２（式［６．１］，［７．１］，［７．２］）、４００回ループ
（６）方式３（式［９．２］，［９．５］）、４００回ループ
これらの３方式を実行した場合の結果データに基づくＳＩＲデータをプロットしたものであり、
（ａ）波形ベースのＳＩＲ（ｓｉｇｎａｌ−ｉｎｔｅｒｆｅｒｅｎｃｅ−ｒａｔｉｏ）
（ｂ）周波数ビンベースのＳＩＲ
これらのＳＩＲデータをプロットしたものである。横軸が上述したタイムスパン（Ｔｉｍｅ＿ｓｐａｎ）の窓長、縦軸がＳＩＲである。各グラフにおいて、
＊（実線）：方式１、
◆：が方式２、
＋：が方式３、
である。 Next, the same evaluation experiment was performed by increasing the number of separation processing loops to 400.
As data corresponding to the data shown in FIG. 26, evaluation data plotted using the time span as the horizontal axis is shown in FIG. It is evaluation data about the separation result by the following three methods.
(4) Method 1 (conventional method), loop 400 times (5) Method 2 (equations [6.1], [7.1], [7.2]), loop 400 times (6) Method 3 (equation [ 9.2], [9.5]), 400 loops SIR data based on the result data when these three methods are executed is plotted.
(A) Waveform-based SIR (signal-interference-ratio)
(B) Frequency bin based SIR
These SIR data are plotted. The horizontal axis is the window length of the time span (Time_span) described above, and the vertical axis is the SIR. In each graph,
* (Solid line): Method 1,
◆: method 2,
+: Is method 3,
It is.

図２６、図２７においても、方式２，方式３は、従来法を上回る設定が存在する。このように、本発明によって、従来の時間周波数領域ＩＣＡが持っていた「窓長と分離性能とのトレードオフ」という課題を回避することが可能である。 Also in FIGS. 26 and 27, method 2 and method 3 have settings that exceed the conventional method. As described above, according to the present invention, it is possible to avoid the problem of “tradeoff between window length and separation performance” that the conventional time-frequency domain ICA has.

次に、別種のデータについての評価実験について説明する。図２８は、収録環境であるオフィス環境の見取り図である。図に示すようにほぼ７５０ｃｍ×３７５ｃｍの長方形の室内で実験を行なった。なお、図に示すように完全な長方形ではなく、一方は高さ１５３ｃｍのパーティションによって区切られた空間である。部屋の残響時間は０．３秒よりやや短い値である。（以降では、０．２７５秒としてプロットしてある。） Next, an evaluation experiment for different types of data will be described. FIG. 28 is a sketch of an office environment that is a recording environment. As shown in the figure, the experiment was conducted in a rectangular room of approximately 750 cm × 375 cm. As shown in the figure, it is not a complete rectangle, but one is a space delimited by partitions having a height of 153 cm. The reverberation time of the room is a little shorter than 0.3 seconds. (Hereafter, it is plotted as 0.275 seconds.)

音源として、以下の３種類の音を用意した。（各信号のスペクトログラムを図２９に示す。）
音源１（ｓｒｃ１）：女性１名の発話（以降、女声またはＦ）
音源２（ｓｒｃ２）：男性１名の発話（以降、男声またはＭ）
音源３（ｓｒｃ３）：以下のＵＲＬで公開されているストリートノイズ（以降、雑踏またはＳ）ｈｔｔｐ：／／ｓｏｕｎｄ．ｍｅｄｉａ．ｍｉｔ．ｅｄｕ／ｉｃａ−ｂｅｎｃｈ／ｓｏｕｒｃｅｓ／ｓｔｒｅｅｔ．ｗａｖ The following three types of sounds were prepared as sound sources. (The spectrogram of each signal is shown in FIG. 29.)
Sound source 1 (src1): Utterance of one woman (hereinafter referred to as female voice or F)
Sound source 2 (src2): Utterance of one male (hereinafter male voice or M)
Sound source 3 (src3): Street noise (hereinafter referred to as “busy or S”) published at the following URL: http: // sound. media. mit. edu / ica-bench / sources / street. wav

図中のｓｐ１〜ｓｐ４の各スピーカーから上記の音をそれぞれ再生し、５ｃｍ間隔で並べた４本のマイク（ｍｉｃ１〜ｍｉｃ４）で収録した。次に、図３０に示す８通りの組み合わせで、スビーカｓｐ１〜ｓｐ４からの音声出力を行なって４本のマイク（ｍｉｃ１〜ｍｉｃ４）によって入力するデータの解析を行なった。女性１名の発話を［Ｆ］、男性１名の発話を［Ｍ］、ストリートノイズを［Ｓ］、音声出力なしを［０］として、
（１）ｓｐ１＝Ｓ，ｓｐ２＝０、ｓｐ３＝Ｆ、ｓｐ４＝Ｍ
（２）ｓｐ１＝Ｓ，ｓｐ２＝０、ｓｐ３＝Ｍ、ｓｐ４＝Ｆ
（３）ｓｐ１＝Ｆ，ｓｐ２＝Ｓ、ｓｐ３＝０、ｓｐ４＝Ｍ
（４）ｓｐ１＝Ｍ，ｓｐ２＝Ｓ、ｓｐ３＝０、ｓｐ４＝Ｍ
（５）ｓｐ１＝０，ｓｐ２＝０、ｓｐ３＝Ｆ、ｓｐ４＝Ｍ
（６）ｓｐ１＝０，ｓｐ２＝０、ｓｐ３＝Ｍ、ｓｐ４＝Ｆ
（７）ｓｐ１＝Ｆ，ｓｐ２＝０、ｓｐ３＝０、ｓｐ４＝Ｍ
（８）ｓｐ１＝Ｍ，ｓｐ２＝０、ｓｐ３＝０、ｓｐ４＝Ｍ
これらの８つのパターンである。
なお、実験では、（１）〜（８）の書くパターンについて、観測信号が４秒の場合と８秒の場合とについて実験しているため、観測信号のバリエーションは合計で８×２＝１６通り存在する。 The above sounds were reproduced from each of the speakers sp1 to sp4 in the figure and recorded with four microphones (mic1 to mic4) arranged at intervals of 5 cm. Next, with the eight combinations shown in FIG. 30, sound output from the sub-speakers sp1 to sp4 was performed, and data input by four microphones (mic1 to mic4) was analyzed. Assume that one female utterance is [F], one male utterance is [M], street noise is [S], and no voice output is [0].
(1) sp1 = S, sp2 = 0, sp3 = F, sp4 = M
(2) sp1 = S, sp2 = 0, sp3 = M, sp4 = F
(3) sp1 = F, sp2 = S, sp3 = 0, sp4 = M
(4) sp1 = M, sp2 = S, sp3 = 0, sp4 = M
(5) sp1 = 0, sp2 = 0, sp3 = F, sp4 = M
(6) sp1 = 0, sp2 = 0, sp3 = M, sp4 = F
(7) sp1 = F, sp2 = 0, sp3 = 0, sp4 = M
(8) sp1 = M, sp2 = 0, sp3 = 0, sp4 = M
These eight patterns.
In the experiment, for the patterns written in (1) to (8), the experiment is performed for the case where the observation signal is 4 seconds and the case where the observation signal is 8 seconds, so the variation of the observation signal is 8 × 2 = 16 in total. Exists.

観測信号の例を図３１に示す。これは、図３０に示すパターン中の［ＴａｋｅＮｏ．＝３］に該当する。すなわち、以下の出力パターンである。
（３）ｓｐ１＝Ｆ，ｓｐ２＝Ｓ、ｓｐ３＝０、ｓｐ４＝Ｍ
図３１（ａ）に示すＸ_１〜Ｘ_４の４枚のスペクトログラムは、図２８に示す４本のマイク（ｍｉｃ１〜ｍｉｃ４）で観測された観測信号である。図３１（ｂ）は周波数ビンごとのＳＩＲである。４枚のスペクトログラムの間で、４つの音源の混ざり具合はほぼ同じであることが分かる。 An example of the observation signal is shown in FIG. This is because [Take No. in the pattern shown in FIG. = 3]. That is, the following output pattern.
(3) sp1 = F, sp2 = S, sp3 = 0, sp4 = M
The four spectrograms X _{1 to} X ₄ shown in FIG. 31A are observation signals observed by the four microphones (mic 1 to mic 4) shown in FIG. FIG. 31B shows the SIR for each frequency bin. It can be seen that the mix of the four sound sources is almost the same among the four spectrograms.

音源分離実験は以下の３つの方式について行なった。すなわち、前述の実験から（方式２）を省き、代わりに前述の「（３）シフト積み重ね＋瞬時混合ＩＣＡ」（の中の１番目の方法）を方式４として行なった。
（方式１）式［５．２］（従来法に相当）
（方式３）式［９．５］（以降「再ＳＴＦＴ」）
（方式４）式［１１．１］＆式［５．２］（以降「シフト積み重ね」） The sound source separation experiment was conducted for the following three methods. That is, (Method 2) was omitted from the above-described experiment, and “(3) Shift stack + instantaneous mixing ICA” (the first method) was performed as Method 4 instead.
(Method 1) Equation [5.2] (equivalent to the conventional method)
(Method 3) Equation [9.5] (hereinafter “Re-STFT”)
(Method 4) Formula [11.1] & Formula [5.2] (hereinafter “shift stack”)

実験の条件は以下とおりである。
共通：
サンプリング周波数＝１６ｋＨｚ
サンプルビット数＝１６
観測信号の長さ：４秒および８秒 The experimental conditions are as follows.
Common:
Sampling frequency = 16 kHz
Number of sample bits = 16
Observation signal length: 4 and 8 seconds

（方式１）
ＳＴＦＴの窓長：２５６，５１２，１０２４，２０４８，４０９６，８１９２
ＳＴＦＴのシフト幅：窓長の１／４
窓：短時間フーリエ変換（ＳＴＦＴ）時にハニング窓、逆フーリエ変換（ＦＴ）時には窓なし。
η０＝０．３
ループ回数＝４００
スコア関数のγの値：γ＝ｓｑｒｔ（Ｍ）Ｍ：周波数ビンの本数 (Method 1)
STFT window length: 256, 512, 1024, 2048, 4096, 8192
STFT shift width: 1/4 of window length
Window: Hanning window during short-time Fourier transform (STFT), no window during inverse Fourier transform (FT).
η0 = 0.3
Loop count = 400
Γ value of the score function: γ = sqrt (M) M: number of frequency bins

ただし、観測信号＝４秒、ＳＴＦＴの窓長＝８１９２の場合のみ、シフト幅は１／８の１０２４を使用した。（１／４シフトではフレーム数が少なくなりすぎるため。） However, only when the observation signal = 4 seconds and the STFT window length = 8192, the shift width of 1/8 was used. (Because 1/4 shift makes the number of frames too small.)

（方式３）
ＳＴＦＴ（１回目）の窓長：５１２
ＳＴＦＴ（１回目）のシフト幅：窓長の１／４
窓（１回目）：短時間フーリエ変換（ＳＴＦＴ）時にハニング窓、逆フーリエ変換（ＦＴ）時には窓なし。
η０＝０．３
ループ回数＝４００
スコア関数のγの値：γ＝ｓｑｒｔ（Ｍ（Ｌ'＋１））Ｍ：周波数ビンの本数
ＳＴＦＴ（２回目）の窓長：Ｌ'＋１＝４，８，１６，３２
ＳＴＦＴ（２回目）のシフト幅：窓長の１／８（端数切り上げ）
窓（２回目）：短時間フーリエ変換（ＳＴＦＴ）時にハミング窓、逆フーリエ変換（ＦＴ）時には窓なし。 (Method 3)
STFT (first time) window length: 512
STFT (first time) shift width: 1/4 of the window length
Window (first time): Hanning window during short-time Fourier transform (STFT), no window during inverse Fourier transform (FT).
η0 = 0.3
Loop count = 400
Γ value of score function: γ = sqrt (M (L ′ + 1)) M: number of frequency bins STFT (second time) window length: L ′ + 1 = 4, 8, 16, 32
STFT (second time) shift width: 1/8 of the window length (rounded up)
Window (second time): Hamming window during short-time Fourier transform (STFT), no window during inverse Fourier transform (FT).

２回目のＳＴＦＴでハニング窓ではなくてハミング窓を使った理由は、タップ数が小さい場合でも両端のサンプルを有効に使用するためである。（ハニング窓は両端が０なので、有効なサンプル数が２個少なくなってしまう。） The reason why the Hamming window is used instead of the Hanning window in the second STFT is that the samples at both ends are used effectively even when the number of taps is small. (Since the Hanning window is zero at both ends, the number of effective samples is reduced by two.)

（方式４）
ＳＴＦＴ（１回目）の窓長：５１２
ＳＴＦＴ（１回目）のシフト幅：窓長の１／４
窓（１回目）：短時間フーリエ変換（ＳＴＦＴ）時にハニング窓、逆フーリエ変換（ＦＴ）時には窓なし。
η０＝０．３
ループ回数＝４００
スコア関数のγの値：γ＝ｓｑｒｔ（Ｍ（Ｌ'＋１））Ｍ：周波数ビンの本数
フレームタップ：Ｌ'＋１＝２，４，８，１２ (Method 4)
STFT (first time) window length: 512
STFT (first time) shift width: 1/4 of the window length
Window (first time): Hanning window during short-time Fourier transform (STFT), no window during inverse Fourier transform (FT).
η0 = 0.3
Loop count = 400
Γ value of score function: γ = sqrt (M (L ′ + 1)) M: number of frequency bins frame taps: L ′ + 1 = 2, 4, 8, 12

図３２と図３３は、図３１の観測信号に対して方式４で処理をした結果である。Ｌ'＝１（すなわち２タップ分）でシフト＆積み重ね（図１１参照）を行なった結果が図３２であり、それを８チャンネルの観測信号として分離をした結果が図３３である。図３３（ａ）が分離結果のスペクトログラム、図３３（ｂ）が周波数ビンごとのＳＩＲである。図３３（ａ）の分離結果のスペクトログラムを見ると、原信号と以下のように対応している。
Ｙ_２ ^［１］，Ｙ_４ ^［０］：音源１
Ｙ_３ ^［０］，Ｙ_３ ^［１］：音源２
Ｙ_１ ^［０］，Ｙ_１ ^［１］：音源３
Ｙ_２ ^［０］，Ｙ_４ ^［１］：対応なし 32 and 33 show the results of processing the observation signal in FIG. 31 by the method 4. FIG. 32 shows the result of shifting and stacking (see FIG. 11) with L ′ = 1 (that is, 2 taps), and FIG. 33 shows the result of separation using 8-channel observation signals. FIG. 33A shows the spectrogram of the separation result, and FIG. 33B shows the SIR for each frequency bin. The spectrogram of the separation result in FIG. 33A corresponds to the original signal as follows.
Y ₂ ^[1] , Y ₄ ^[0] : Sound source 1
Y ₃ ^[0] , Y ₃ ^[1] : Sound source 2
Y ₁ ^[0] , Y ₁ ^[1] : Sound source 3
Y ₂ ^[0] , Y ₄ ^[1] : No correspondence

分離度合いを表わす尺度として、周波数ビンごとの改善ＳＩＲの平均を計算した。図３３を例にとると、図３３（ａ）に示す分離結果スペクトログラムの各チャンネル（Ｙ_１ ^［０］，〜Ｙ_４ ^［１］）について、最も強く現れている音源に対してＳＩＲを計算し、さらに全周波数ビンで平均をとった。例えばＹ_２ ^［１］では音源１が最も強く現れているので、音源１へのＳＩＲを計算する。Ｙ_４ ^［０］に対しても同様に計算し、両者で値が大きい方を音源１の分離度合いとする。音源２・３についても同様に計算してから３つの間で平均を取ることで、全体の分離度合いとする。この値から観測信号の平均ＳＩＲ、すなわち図３３（ｂ）に示す周波数ビンごとのＳＩＲのプロットを全周波数で平均したものを引くと、改善ＳＩＲが計算される。 The average of the improved SIR for each frequency bin was calculated as a measure representing the degree of separation. Taking FIG. 33 as an example, SIR is calculated for the sound source that appears most strongly for each channel (Y ₁ ^{[0] to} Y ₄ ^[1] ) of the separation result spectrogram shown in FIG. In addition, averaged over all frequency bins. For example, since the sound source 1 appears most strongly in Y ₂ ^[1] , the SIR to the sound source 1 is calculated. The same calculation is performed for Y ₄ ^[0] , and the larger value of both values is set as the degree of separation of the sound source 1. The sound sources 2 and 3 are calculated in the same manner and then averaged between the three to obtain the degree of separation as a whole. By subtracting the average SIR of the observed signal from this value, that is, the average of the SIR plots for each frequency bin shown in FIG. 33B at all frequencies, the improved SIR is calculated.

最後に、８回のテイクの間で平均を取ることで、１つの実験パラメーターについての分離度合いを計算する。なお、観測信号が４秒の場合と８秒の場合とは、別々に集計した。集計結果は、図３４と図３５に示す通りである。図３４が観測信号＝４秒の場合、図３５が８秒の場合であり、いずれも縦軸が改善ＳＩＲ、横軸がタイムスパン（対数表示）である。両図の縦の破線は部屋の残響時間であり、０．２７５秒としている。３本の折れ線は、それぞれ従来法（方式１）・再ＳＴＦＴ（方式３）・シフト積み重ね（方式４）に対応する。 Finally, the degree of separation for one experimental parameter is calculated by taking the average between 8 takes. In addition, the case where the observation signal was 4 seconds and the case of 8 seconds were counted separately. The tabulation results are as shown in FIG. 34 and FIG. FIG. 34 shows the case where the observation signal = 4 seconds and FIG. 35 shows the case of 8 seconds. In each case, the vertical axis represents the improved SIR and the horizontal axis represents the time span (logarithmic display). The vertical broken lines in both figures represent the room reverberation time, which is 0.275 seconds. The three broken lines correspond to the conventional method (method 1), the re-STFT (method 3), and the shift stack (method 4), respectively.

図３４、図３５に示すように、従来法では、ＳＴＦＴの窓長（分析フレーム長）を長くしていっても、ある値（観測信号＝４秒で１０２４、８秒で２０４８）で分離精度がピークに達し、それより長くするとかえって分離精度が悪化する。これは、窓を長くし過ぎると、ＳＴＦＴ結果の時間分解能が落ちるためである。時間分解能の低下は、観測信号が短い場合ほど強く影響するため、観測信号＝４秒の場合の方が８秒の場合よりも短い窓長でピークに達する。一方で、窓が短い場合は、時間分解能は高いものの、フレームをまたがった成分が多くなる（残響が１フレームに収まらなくなる）ため、十分な分離精度が出ない。 As shown in FIGS. 34 and 35, in the conventional method, even if the window length (analysis frame length) of the STFT is increased, the separation accuracy is at a certain value (observation signal = 1024 in 4 seconds, 2048 in 8 seconds). However, if it reaches a peak and is longer than that, the separation accuracy deteriorates. This is because the time resolution of the STFT result decreases if the window is too long. The decrease in time resolution is more strongly affected when the observation signal is shorter, so that the peak is reached when the observation signal is 4 seconds and the window length is shorter than when the observation signal is 8 seconds. On the other hand, when the window is short, although the temporal resolution is high, there are many components across frames (the reverberation does not fit in one frame), so that sufficient separation accuracy cannot be obtained.

それに対し、本発明である方法３・方法４では、短い窓（この実験では５１２）でＳＴＦＴした結果に対してさらに複数フレーム用いて分離を行なうため、時間分解能の低下を抑える一方で複数フレームにまたがった成分に対しても対応できる。そのため、従来法と同一のタイムスパンで比較した場合はいっそう高い分離精度を達成でき、また、ピークの分離精度で比較した場合は、より長いタイムスパンでより高い分離精度を達成できる。 On the other hand, in the method 3 and method 4 of the present invention, the result of STFT with a short window (512 in this experiment) is further separated using a plurality of frames, so that a reduction in time resolution is suppressed while a plurality of frames are used. It can also handle components that straddle. Therefore, even higher separation accuracy can be achieved when compared with the same time span as in the conventional method, and higher separation accuracy can be achieved with a longer time span when compared with peak separation accuracy.

以上、特定の実施例を参照しながら、本発明について詳解してきた。しかしながら、本発明の要旨を逸脱しない範囲で当業者が該実施例の修正や代用を成し得ることは自明である。すなわち、例示という形態で本発明を開示してきたのであり、限定的に解釈されるべきではない。本発明の要旨を判断するためには、特許請求の範囲の欄を参酌すべきである。 The present invention has been described in detail above with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiments without departing from the gist of the present invention. In other words, the present invention has been disclosed in the form of exemplification, and should not be interpreted in a limited manner. In order to determine the gist of the present invention, the claims should be taken into consideration.

なお、明細書中において説明した一連の処理はハードウェア、またはソフトウェア、あるいは両者の複合構成によって実行することが可能である。ソフトウェアによる処理を実行する場合は、処理シーケンスを記録したプログラムを、専用のハードウェアに組み込まれたコンピュータ内のメモリにインストールして実行させるか、あるいは、各種処理が実行可能な汎用コンピュータにプログラムをインストールして実行させることが可能である。 The series of processes described in the specification can be executed by hardware, software, or a combined configuration of both. When executing processing by software, the program recording the processing sequence is installed in a memory in a computer incorporated in dedicated hardware and executed, or the program is executed on a general-purpose computer capable of executing various processing. It can be installed and run.

例えば、プログラムは記録媒体としてのハードディスクやＲＯＭ（Read Only Memory)に予め記録しておくことができる。あるいは、プログラムはフレキシブルディスク、ＣＤ−ＲＯＭ(Compact Disc Read Only Memory)，ＭＯ(Magneto optical)ディスク，ＤＶＤ(Digital Versatile Disc)、磁気ディスク、半導体メモリなどのリムーバブル記録媒体に、一時的あるいは永続的に格納（記録）しておくことができる。このようなリムーバブル記録媒体は、いわゆるパッケージソフトウエアとして提供することができる。 For example, the program can be recorded in advance on a hard disk or ROM (Read Only Memory) as a recording medium. Alternatively, the program is temporarily or permanently stored on a removable recording medium such as a flexible disk, a CD-ROM (Compact Disc Read Only Memory), an MO (Magneto optical) disk, a DVD (Digital Versatile Disc), a magnetic disk, or a semiconductor memory. It can be stored (recorded). Such a removable recording medium can be provided as so-called package software.

なお、プログラムは、上述したようなリムーバブル記録媒体からコンピュータにインストールする他、ダウンロードサイトから、コンピュータに無線転送したり、ＬＡＮ(Local Area Network)、インターネットといったネットワークを介して、コンピュータに有線で転送し、コンピュータでは、そのようにして転送されてくるプログラムを受信し、内蔵するハードディスク等の記録媒体にインストールすることができる。 The program is installed on the computer from the removable recording medium as described above, or is wirelessly transferred from the download site to the computer, or is wired to the computer via a network such as a LAN (Local Area Network) or the Internet. The computer can receive the program transferred in this manner and install it on a recording medium such as a built-in hard disk.

なお、明細書に記載された各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。また、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Note that the various processes described in the specification are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Further, in this specification, the system is a logical set configuration of a plurality of devices, and the devices of each configuration are not limited to being in the same casing.

以上、説明したように、本発明の一実施例の構成によれば、複数の音信号が混合した入力信号を時間周波数領域に変換して観測信号スペクトログラムを生成し、観測信号スペクトログラムから信号分離結果を生成する信号分離処理において、観測信号スペクトログラムを時間周波数領域において畳み込み混合された観測信号として解釈し、畳み込み混合を解く独立性分析の実行により信号分離結果を生成する、あるいは、観測信号スペクトログラムに対する時間方向の短時間フーリエ変換（ＳＴＦＴ）によりモジュレーション・スペクトログラムを生成してモジュレーション・スペクトログラムを瞬時混合として解釈し、瞬時混合を解く独立性分析の実行により信号分離結果を生成する構成としている。そのため、直接波、反射波など様々な遅延量を持つ混合された音信号について、遅延量を考慮した高精度な分離処理が実現される。 As described above, according to the configuration of one embodiment of the present invention, an observation signal spectrogram is generated by converting an input signal in which a plurality of sound signals are mixed into the time-frequency domain, and a signal separation result from the observation signal spectrogram. In the signal separation process that generates the A modulation spectrogram is generated by a short-time Fourier transform (STFT) of the direction, the modulation spectrogram is interpreted as instantaneous mixing, and a signal separation result is generated by executing an independence analysis to solve the instantaneous mixing. Therefore, high-accuracy separation processing considering the delay amount is realized for the mixed sound signal having various delay amounts such as a direct wave and a reflected wave.

複数の信号が混合された音信号の分離処理に適用する音情報の取得構成例について説明する図である。It is a figure explaining the acquisition structural example of the sound information applied to the isolation | separation process of the sound signal with which the several signal was mixed. 各チャンネルについてのスペクトログラム１枚分のエントロピーＨ（Ｙ_ｋ）と、全チャンネルについてのスペクトログラム１枚分の同時エントロピーＨ（Ｙ）の関係を示す図である。It is a figure which shows the relationship of the entropy H ( _Yk ) for one spectrogram about each channel, and the simultaneous entropy H (Y) for one spectrogram about all the channels. ＳＴＦＴの窓長と時間周波数領域ＩＣＡの分離精度の関係をプロットしたグラフを示す図である。It is a figure which shows the graph which plotted the relationship between the window length of STFT, and the separation accuracy of the time frequency domain ICA. 横軸の窓長を実際の秒数として、時間周波数領域ＩＣＡの分離精度を表わしたグラフを示す図である。It is a figure which shows the graph showing the isolation | separation precision of the time frequency area | region ICA by making window length of a horizontal axis into an actual number of seconds. 複数の信号が混合された音信号の分離処理に適用する音情報の取得構成例について説明する図である。It is a figure explaining the acquisition structural example of the sound information applied to the isolation | separation process of the sound signal with which the several signal was mixed. 時間領域の畳み混合が時間周波数領域において瞬時混合でなく畳み込み混合であると考える概念について説明する図である。It is a figure explaining the concept considered that the time domain convolution mixing is convolution mixing instead of instantaneous mixing in a time frequency domain. 短時間フーリエ変換（ＳＴＦＴ）について説明する図である。It is a figure explaining short-time Fourier transform (STFT). スペクトログラムＸから、時間方向に再び短時間フーリエ変換（ＳＴＦＴ）したＸ'（モジュレーション・スペクトログラム）への変換について説明する図である。It is a figure explaining conversion from spectrogram X to X '(modulation spectrogram) which carried out short-time Fourier transform (STFT) again in the time direction. スペクトログラムＸから、時間方向に再び短時間フーリエ変換（ＳＴＦＴ）したＸ'（モジュレーション・スペクトログラム）への変換について説明する図である。It is a figure explaining conversion from spectrogram X to X '(modulation spectrogram) which carried out short-time Fourier transform (STFT) again in the time direction. エントロピーＨ（Ｙ'ｋ）の計算方法について説明する図である。It is a figure explaining the calculation method of entropy H (Y'k). 観測信号スペクトログラムに対して、フレーム番号をずらしながら縦に積み重ねたベクトルを生成する処理について説明する図である。It is a figure explaining the process which produces | generates the vector piled up vertically, shifting a frame number with respect to an observation signal spectrogram. 観測信号スペクトログラムＸについて、ｔ−ｌ（エル）番目からｔ−ｌ（エル）＋Ｌ'番目のフレームを畳み込むことで分離結果を生成する操作について説明する図である。It is a figure explaining operation which produces | generates a separation result by convolving the tl (ell) + tl (ell) + L'th frame about the observation signal spectrogram X. FIG. シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理について説明する図である。It is a figure explaining the process which combined shift accumulation and instantaneous mixing ICA. シフト積み重ねと瞬時混合ＩＣＡを組み合わせた処理のシーケンスについて説明刷るフローチャートを示す図である。It is a figure which shows the flowchart which prints explanatoryally about the sequence of the process which combined shift stacking and instantaneous mixing ICA. 本発明の信号分離装置の構成例をについて説明する図である。It is a figure explaining the structural example of the signal separation apparatus of this invention. 本発明の信号分離装置の構成例をについて説明する図である。It is a figure explaining the structural example of the signal separation apparatus of this invention. 本発明の信号分離装置の処理シーケンスを説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the processing sequence of the signal separation apparatus of this invention. 本発明の信号分離装置の実行する分離処理の詳細シーケンスについて説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the detailed sequence of the separation process which the signal separation apparatus of this invention performs. 本発明の信号分離装置の実行する分離処理の詳細シーケンスについて説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the detailed sequence of the separation process which the signal separation apparatus of this invention performs. 周波数ごとにフレームタップ数［Ｌ'］の値を異ならせる処理について説明する図である。It is a figure explaining the process which changes the value of the number of frame taps [L '] for every frequency. ２段階分離によるチャンネル数削減を行なって信号分離を行なう処理についてのフローチャートを示す図である。It is a figure which shows the flowchart about the process which performs signal reduction by reducing the number of channels by two-stage separation. する処理について説明する図である。It is a figure explaining the process to perform. 本発明の信号分離装置の効果の確認のための実験装置の構成について説明する図である。It is a figure explaining the structure of the experimental apparatus for confirmation of the effect of the signal separation apparatus of this invention. 本発明の信号分離装置の効果の確認のための実験結果の評価データを示す図である。It is a figure which shows the evaluation data of the experimental result for confirmation of the effect of the signal separation apparatus of this invention. 本発明の信号分離装置の効果の確認のための実験結果の評価データを示す図である。It is a figure which shows the evaluation data of the experimental result for confirmation of the effect of the signal separation apparatus of this invention. 本発明の信号分離装置の効果の確認のための実験結果の評価データを示す図である。It is a figure which shows the evaluation data of the experimental result for confirmation of the effect of the signal separation apparatus of this invention. 本発明の信号分離装置の効果の確認のための実験結果の評価データを示す図である。It is a figure which shows the evaluation data of the experimental result for confirmation of the effect of the signal separation apparatus of this invention. 信号分離処理の評価実験を行なった環境について説明する図である。It is a figure explaining the environment where the evaluation experiment of the signal separation process was performed. 信号分離処理の評価実験に適用した音源について説明する図である。It is a figure explaining the sound source applied to the evaluation experiment of signal separation processing. 信号分離処理の評価実験に適用した音源の入出力パターンについて説明する図である。It is a figure explaining the input-output pattern of the sound source applied to the evaluation experiment of signal separation processing. 信号分離処理の評価実験における観測信号の例について説明する図である。It is a figure explaining the example of the observation signal in the evaluation experiment of signal separation processing. 信号分離処理の評価実験において得られたシフト＆積み重ね（図１１参照）を行なった結果について説明する図である。It is a figure explaining the result of having performed the shift & accumulation (refer FIG. 11) obtained in the evaluation experiment of signal separation processing. 信号分離処理の評価実験において得られた分離結果およびＳＩＲについて説明する図である。It is a figure explaining the separation result and SIR which were obtained in the evaluation experiment of signal separation processing. 信号分離処理の評価実験において得られた評価結果について説明する図である。It is a figure explaining the evaluation result obtained in the evaluation experiment of signal separation processing. 信号分離処理の評価実験において得られた評価結果について説明する図である。It is a figure explaining the evaluation result obtained in the evaluation experiment of signal separation processing.

Explanation of symbols

１１，１２エントロピー
１３同時エントロピー
１１１音源
１２１マイク
２０１周波数ビン
２０２ビン
２０３ビン
２２１モジュレーション・スペクトログラム
２２２多変量確率密度関数
２２３，２２４エントロピー
４０１マイク
４０２ＡＤ変換部
４０３ＳＴＦＴ部
４０４信号分離部
４０５リスケーリング部
４０６逆ＦＴ部
４０７後段処理実行部
４０８畳み込み演算部
４０９制御部
４５１マイク
４５２ＡＤ変換部
４５３第１ＳＴＦＴ部
４５４第２ＳＴＦＴ部
４５５信号分離部
４５６第１リスケーリング部
４５７第１逆ＦＴ部
４５８第２リスケーリング部
４５９第２逆ＦＴ部
４６０後段処理実行部
４６１制御部 11,12 Entropy 13 Simultaneous entropy 111 Sound source 121 Microphone 201 Frequency bin 202 Bin 203 Bin 221 Modulation spectrogram 222 Multivariate probability density function 223,224 Entropy 401 Microphone 402 AD conversion unit 403 STFT unit 404 Signal separation unit 405 Rescaling unit 406 Inverse FT unit 407 Post-stage processing execution unit 408 Convolution operation unit 409 Control unit 451 Microphone 452 AD conversion unit 453 First STFT unit 454 Second STFT unit 455 Signal separation unit 456 First rescaling unit 457 First inverse FT unit 458 Second rescaling Unit 459 second inverse FT unit 460 post-processing execution unit 461 control unit

Claims

A signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals,
A signal conversion means for converting the input signal into the time-frequency domain and generating an observed signal spectrogram;
Signal separation means for generating a signal separation result from the observed signal spectrogram generated by the signal conversion means,
The signal separating means includes
A signal separation device characterized in that the observation signal spectrogram is interpreted as an observation signal that is convolutionally mixed in the time-frequency domain, and a signal separation result is generated by executing processing for solving the convolutional mixture in the time-frequency domain.

2. The signal conversion means is configured to execute processing for generating an observation signal spectrogram by performing short-time Fourier transform (STFT) on the input signal to convert to the time-frequency domain. A signal separation device according to claim 1.

The signal separating means includes
The separation signal Y (t) of the frame number (t) is set as a convolution mixture of the observation signals X (t−L ′) to X (t), and is an individual signal component included in the separation signal Y (t). 2. The signal separation device according to claim 1, wherein the signal separation device is configured to generate a signal separation result by a process for increasing independence of each of Y1 (t) to Yn (t).

The signal separating means includes
As a process for increasing the independence of each of the individual signal components Y1 (t) to Yn (t) included in the separated signal Y (t), an independence calculation measure Kullback-Leiblar information amount I (Y) is used. 4. The signal separation device according to claim 3, wherein the signal separation device is configured to generate a signal separation result by applying a separation matrix updating process that minimizes the Kullback-Leiblar information amount I (Y).

The signal separating means includes
A first signal separation result is generated by applying an instantaneous component ICA (Independent Component Analysis) to the observed signal spectrogram, and it is determined from the first signal separation result that it does not correspond to any sound source. 2. The signal separation result is generated by performing unnecessary channel removal processing, performing processing for solving convolutional mixing in a time-frequency domain on an observed signal spectrogram remaining after the removal processing, and generating a signal separation result. Signal separation device.

The processing using the instantaneous mixing ICA generates a time-frequency domain separation signal from the time-frequency domain observation signal and the separation matrix, and generates the time-frequency domain separation signal and a multi-dimensional probability density function. 6. The process according to claim 5, wherein the separation matrix is corrected until the separation matrix calculated by the dimension score function substantially converges, and the modified separation matrix is applied to generate a separation signal in the time-frequency domain. The signal separation device as described.

A signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals,
First signal conversion means for converting an input signal into a time-frequency domain and generating an observation signal spectrogram;
Second signal conversion means for performing data conversion on the observed signal spectrogram generated by the first signal conversion means to generate a modulation spectrogram;
Signal separation means for generating a signal separation result from the modulation spectrogram generated by the second signal conversion means;
The signal separating means includes
A signal separation device characterized in that the modulation spectrogram is interpreted as instantaneous mixing and a signal separation result is generated.

The first signal conversion unit is configured to execute a process of performing a short-time Fourier transform (STFT) on the input signal to convert the input signal into a time-frequency domain to generate an observation signal spectrogram. Item 8. The signal separation device according to Item 7.

The second signal converting means includes
A configuration for generating a modulation spectrogram as a result of performing a short-time Fourier transform (STFT) in the time direction on the observed signal spectrogram;
The signal separating means includes
8. The signal separation device according to claim 7, wherein the signal separation device is configured to generate a signal separation result by a process for increasing independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separation signal included in the modulation spectrogram.

The signal separating means includes
As a process for increasing the independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separation signal, the Kullback-Leiblar information amount, which is an independence calculation measure, is applied, and the separation matrix updating process for minimizing the Kullback-Leiblar information amount The signal separation device according to claim 9, wherein the signal separation result is generated by the method.

The signal separation device further includes:
Inverse Fourier transform means for generating spectrograms Y1 to Yn corresponding to separated signals by performing inverse Fourier transform on the signal components Y1 'to Yn' corresponding to the separated signals obtained by the signal separating means. The signal separation device according to claim 7.

The signal separation device further includes:
A first signal separation result is generated by a process of applying instantaneous mixed ICA (Independent Component Analysis) to the observed signal spectrogram generated by the first signal converting means, and from the first signal separation result, which It has unnecessary channel removal means for executing unnecessary channel removal processing that is determined not to correspond to a sound source,
The second signal converting means and the signal separating means are:
8. The signal separation device according to claim 7, wherein the signal separation result is generated by executing only the processing on the signal after the unnecessary channel is removed.

The processing using the instantaneous mixing ICA generates a time-frequency domain separation signal from the time-frequency domain observation signal and the separation matrix, and generates the time-frequency domain separation signal and a multi-dimensional probability density function. The processing according to claim 12, wherein the separation matrix is corrected until the separation matrix calculated by the dimension score function substantially converges, and the modified separation matrix is applied to generate a separation signal in the time-frequency domain. The signal separation device as described.

A signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals,
A signal conversion means for converting the input signal into the time-frequency domain and generating an observed signal spectrogram;
Signal separation means for generating a signal separation result from the observed signal spectrogram generated by the signal conversion means,
The signal separating means includes
The observed signal spectrogram is shifted in the frame direction to generate an observed signal spectrogram shift set in which data having different shift amounts are stacked, and an instantaneous mixed ICA (Independent Component Analysis) is generated with respect to the generated observed signal spectrogram shift set. ) Is applied to generate a signal separation result.

The processing using the instantaneous mixing ICA generates a time-frequency domain separation signal from the time-frequency domain observation signal and the separation matrix, and generates the time-frequency domain separation signal and a multi-dimensional probability density function. 15. The process according to claim 14, wherein the separation matrix is corrected until the separation matrix calculated by the dimension score function substantially converges, and the modified separation matrix is applied to generate a separation signal in the time-frequency domain. The signal separation device as described.

The signal separating means includes
Apply the instantaneous mixing ICA to the multi-channel observation signal spectrogram shift set obtained by stacking a plurality of observation signal spectrogram shift sets generated corresponding to the observation signals of a plurality of signal input sources. The signal separation device according to claim 14, wherein the signal separation device is generated.

The signal separating means includes
15. The observation signal spectrogram shift set is generated by copying and setting a gap generated during the shift by copying zero or a value close to zero or values at both ends of the observation signal spectrogram. A signal separation device according to claim 1.

The signal separating means includes
The signal separation device according to claim 14, wherein a cyclic shift process is performed in which data at one end protruding from the shift is copied to the other end.

The signal separating means includes
Observation with multiple shift data set as the number of frame taps [L '] when the minimum shift amount is 0 and the maximum shift amount is generated from the observation signal, and the data with different shift amounts generated is stacked 15. The signal separation device according to claim 14, wherein a signal spectrogram shift set is generated.

The signal separating means includes
The signal separation device according to claim 14, wherein the observation signal spectrogram shift set is generated by changing the number of frame taps [L ′] according to a frequency.

The signal separating means includes
A first signal separation result is generated by applying an instantaneous component ICA (Independent Component Analysis) to the observed signal spectrogram, and it is determined from the first signal separation result that it does not correspond to any sound source. Execute unnecessary channel removal processing, shift the observed signal spectrogram remaining after the removal processing in the frame direction to generate an observed signal spectrogram shift set, and apply instantaneous mixing ICA to the generated observed signal spectrogram shift set The signal separation device according to claim 14, wherein the signal separation result is generated.

A signal separation device that inputs a signal in which a plurality of signals are mixed and separates them into individual signals,
A signal conversion means for converting the input signal into the time-frequency domain and generating an observed signal spectrogram;
Signal separation means for generating a signal separation result from the observed signal spectrogram generated by the signal conversion means,
The signal separating means includes
For the observed signal spectrogram, signal separation results Y1 to Yn are generated by processing using instantaneous mixed ICA (Independent Component Analysis),
A signal spectrogram corresponding to each of the signal separation results Y1 to Yn is shifted in the frame direction to generate an observation signal spectrogram shift set in which data having different shift amounts are stacked, and for the generated observation signal spectrogram shift set The signal is characterized in that it performs a dereverberation process by a process to which instantaneous mixing ICA (Independent Component Analysis) is applied, and generates a signal separation result from which dereverberation is removed by an integration process of the spectrogram after dereverberation. Separation device.

The processing using the instantaneous mixing ICA generates a time-frequency domain separation signal from the time-frequency domain observation signal and the separation matrix, and generates the time-frequency domain separation signal and a multi-dimensional probability density function. 23. The processing according to claim 22, wherein the separation matrix is corrected until the separation matrix calculated by the dimension score function substantially converges, and the modified separation matrix is applied to generate a separation signal in the time-frequency domain. The signal separation device as described.

In the signal separation device, a signal separation method for executing a process of inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A signal conversion step in which the signal conversion means converts the input signal into the time-frequency domain and generates an observation signal spectrogram;
The signal separation means includes a signal separation step of generating a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
A signal separation method comprising: interpreting the observed signal spectrogram as an observation signal mixed in a convolution in the time-frequency domain, and generating a signal separation result by executing a process for solving the convolutional mixture in the time-frequency domain.

25. The signal converting step is a step of executing a process of generating an observed signal spectrogram by performing a short-time Fourier transform (STFT) on the input signal to convert it to a time-frequency domain. The signal separation method according to 1.

The signal separation step includes
The separation signal Y (t) of the frame number (t) is set as a convolution mixture of the observation signals X (t−L ′) to X (t), and is an individual signal component included in the separation signal Y (t). 25. The signal separation method according to claim 24, wherein the signal separation method is a step of generating a signal separation result by processing for increasing independence of each of Y1 (t) to Yn (t).

The signal separation step includes
As a process for increasing the independence of each of the individual signal components Y1 (t) to Yn (t) included in the separated signal Y (t), an independence calculation measure Kullback-Leiblar information amount I (Y) is used. 27. The signal separation method according to claim 26, wherein the signal separation result is generated by applying a separation matrix updating process that minimizes the Kullback-Leiblar information amount I (Y).

The signal separation step includes
A first signal separation result is generated by applying an instantaneous component ICA (Independent Component Analysis) to the observed signal spectrogram, and it is determined from the first signal separation result that it does not correspond to any sound source. 25. A step of generating a signal separation result by executing unnecessary channel removal processing, executing processing for solving convolutional mixing in a time-frequency domain with respect to an observation signal spectrogram remaining after the removal processing. The signal separation method according to 1.

The processing using the instantaneous mixing ICA generates a time-frequency domain separation signal from the time-frequency domain observation signal and the separation matrix, and generates the time-frequency domain separation signal and a multi-dimensional probability density function. 29. The process according to claim 28, wherein the separation matrix is corrected until the separation matrix calculated by the dimension score function substantially converges, and the modified separation matrix is applied to generate a separation signal in the time-frequency domain. The signal separation method described.

In the signal separation device, a signal separation method for executing a process of inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A first signal converting step in which a first signal converting means converts an input signal into a time-frequency domain and generates an observed signal spectrogram;
A second signal converting step in which a second signal converting means performs data conversion on the observed signal spectrogram generated in the first signal converting step to generate a modulation spectrogram;
A signal separation step of generating a signal separation result from the modulation spectrogram generated in the second signal conversion step;
The signal separation step includes
A signal separation method comprising the step of interpreting the modulation spectrogram as instantaneous mixing and generating a signal separation result.

The first signal converting step includes:
31. The signal separation method according to claim 30, wherein the signal separation method is a step of executing short-time Fourier transform (STFT) on the input signal to convert the input signal into a time-frequency domain to generate an observation signal spectrogram. .

The second signal converting step includes
Generating a modulation spectrogram as a result of performing a short time Fourier transform (STFT) in the time direction on the observed signal spectrogram;
The signal separation step includes
31. The signal separation method according to claim 30, wherein a signal separation result is generated by a process of increasing the independence of each of the signal components Y1 'to Yn' corresponding to the separation signal included in the modulation spectrogram.

The signal separation step includes
As a process for increasing the independence of each of the signal components Y1 ′ to Yn ′ corresponding to the separation signal, the Kullback-Leiblar information amount, which is an independence calculation measure, is applied, and the separation matrix updating process for minimizing the Kullback-Leiblar information amount 33. The signal separation method according to claim 32, wherein a signal separation result is generated by the method.

The signal separation method further includes:
The inverse Fourier transform means performs inverse Fourier transform on each of the signal components Y1 ′ to Yn ′ corresponding to the separated signals obtained in the signal separating step to generate spectrograms Y1 to Yn corresponding to the separated signals. The signal separation method according to claim 30, further comprising steps.

The signal separation method further includes:
An unnecessary channel removing unit generates a first signal separation result by a process in which instantaneous mixed ICA (Independent Component Analysis) is applied to the observed signal spectrogram generated by the first signal converting unit, and the first signal separation result is generated. From the signal separation result, there is an unnecessary channel removal step for performing an unnecessary channel removal process that is determined not to correspond to any sound source,
The second signal converting means and the signal separating means are:
31. The signal separation method according to claim 30, wherein a signal separation result is generated by performing only processing on the signal after unnecessary channel removal.

The processing using the instantaneous mixing ICA generates a time-frequency domain separation signal from the time-frequency domain observation signal and the separation matrix, and generates the time-frequency domain separation signal and a multi-dimensional probability density function. 36. The process according to claim 35, wherein the separation matrix is corrected until the separation matrix calculated by the dimension score function substantially converges, and the modified separation matrix is applied to generate a separation signal in the time-frequency domain. The signal separation method described.

In the signal separation device, a signal separation method for inputting a signal in which a plurality of signals are mixed and separating them into individual signals,
A signal conversion step in which the signal conversion means converts the input signal into the time-frequency domain and generates an observation signal spectrogram;
The signal separation means includes a signal separation step of generating a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
The observed signal spectrogram is shifted in the frame direction to generate an observed signal spectrogram shift set in which data having different shift amounts are stacked, and an instantaneous mixed ICA (Independent Component Analysis) is generated with respect to the generated observed signal spectrogram shift set. The signal separation method is a step of generating a signal separation result by the processing to which (1) is applied.

The processing using the instantaneous mixing ICA generates a time-frequency domain separation signal from the time-frequency domain observation signal and the separation matrix, and generates the time-frequency domain separation signal and a multi-dimensional probability density function. The process according to claim 37, wherein the separation matrix is corrected until the separation matrix calculated by the dimension score function substantially converges, and the modified separation matrix is applied to generate a separation signal in the time-frequency domain. The signal separation method described.

The signal separation step includes
Apply the instantaneous mixing ICA to the multi-channel observation signal spectrogram shift set obtained by stacking a plurality of observation signal spectrogram shift sets generated corresponding to the observation signals of a plurality of signal input sources. The signal separation method according to claim 37, wherein the signal separation method is generated.

The signal separation step includes
38. The observed signal spectrogram shift set is generated by copying and setting a gap generated during the shift by copying a value of zero or a value close to zero or values at both ends of the observed signal spectrogram. The signal separation method according to 1.

The signal separation step includes
38. The signal separation method according to claim 37, wherein a cyclic shift process is performed in which data at one end protruding from the shift is copied to the other end.

The signal separation step includes
Observation with multiple shift data set as the number of frame taps [L '] when the minimum shift amount is 0 and the maximum shift amount is generated from the observation signal, and the data with different shift amounts generated is stacked 38. The signal separation method according to claim 37, wherein a signal spectrogram shift set is generated.

The signal separation step includes
The signal separation method according to claim 37, wherein the observation signal spectrogram shift set is generated by changing the number of frame taps [L '] according to a frequency.

The signal separation step includes
A first signal separation result is generated by applying an instantaneous component ICA (Independent Component Analysis) to the observed signal spectrogram, and it is determined from the first signal separation result that it does not correspond to any sound source. Execute unnecessary channel removal processing, shift the observed signal spectrogram remaining after the removal processing in the frame direction to generate an observed signal spectrogram shift set, and apply instantaneous mixing ICA to the generated observed signal spectrogram shift set 38. The signal separation method according to claim 37, further comprising: generating a signal separation result.

In the signal separation device, a signal separation method for inputting a signal in which a plurality of signals are mixed and separating them into individual signals,
A signal conversion step in which the signal conversion means converts the input signal into the time-frequency domain and generates an observation signal spectrogram;
The signal separation means includes a signal separation step of generating a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
For the observed signal spectrogram, signal separation results Y1 to Yn are generated by processing using instantaneous mixed ICA (Independent Component Analysis),
A signal spectrogram corresponding to each of the signal separation results Y1 to Yn is shifted in the frame direction to generate an observation signal spectrogram shift set in which data having different shift amounts are stacked, and for the generated observation signal spectrogram shift set The signal is characterized in that it is a step of performing a dereverberation process by a process to which instantaneous mixing ICA (Independent Component Analysis) is applied, and generating a signal separation result from which the dereverberation is removed by an integration process of the dereverberation spectrogram. Separation method.

The processing using the instantaneous mixing ICA generates a time-frequency domain separation signal from the time-frequency domain observation signal and the separation matrix, and generates the time-frequency domain separation signal and a multi-dimensional probability density function. 46. The processing according to claim 45, wherein the separation matrix is corrected until the separation matrix calculated by the dimension score function substantially converges, and the modified separation matrix is applied to generate a separation signal in a time-frequency domain. The signal separation method described.

In the signal separation device, a computer program for executing a signal separation process for inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A signal conversion step for causing the signal conversion means to convert the input signal into the time-frequency domain and generate an observed signal spectrogram;
A signal separation step for causing the signal separation means to generate a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
A computer program comprising the steps of interpreting the observed signal spectrogram as an observation signal mixed in the time-frequency domain and generating a signal separation result by executing a process for solving the convolutional mixing in the time-frequency domain.

In the signal separation device, a computer program for executing a signal separation process for inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A signal conversion step of causing the first signal conversion means to convert the input signal into the time-frequency domain and generate an observation signal spectrogram;
A second signal converting step for causing the second signal converting means to perform data conversion on the observed signal spectrogram generated in the first signal converting step to generate a modulation spectrogram;
A signal separation step of causing a signal separation means to generate a signal separation result from the modulation spectrogram generated in the second signal conversion step;
The signal separation step includes
A computer program comprising the step of interpreting the modulation spectrogram as instantaneous mixing and generating a signal separation result.

In the signal separation device, a computer program for executing a signal separation process for inputting a signal obtained by mixing a plurality of signals and separating the signal into individual signals,
A signal conversion step for causing the signal conversion means to convert the input signal into the time-frequency domain and generate an observed signal spectrogram;
A signal separation step for causing the signal separation means to generate a signal separation result from the observed signal spectrogram generated in the signal conversion step;
The signal separation step includes
The observed signal spectrogram is shifted in the frame direction to generate an observed signal spectrogram shift set in which data having different shift amounts are stacked, and an instantaneous mixed ICA (Independent Component Analysis) is generated for the generated observed signal spectrogram shift set. The computer program is a step of generating a signal separation result by the processing to which (1) is applied.