JP2006084898A

JP2006084898A - Sound input device

Info

Publication number: JP2006084898A
Application number: JP2004270772A
Authority: JP
Inventors: Mitsunobu Kaminuma; 充伸神沼
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2004-09-17
Filing date: 2004-09-17
Publication date: 2006-03-30

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound input device capable of removing diffusive noise with a slight increase in calculation quantity as compared with SBE applied to a general frequency domain ICA. <P>SOLUTION: The sound input device detects a sound including a target sound and a non-target sound by a plurality of microphones 10-1 to 10-n to obtain a plurality of sound signals including target sound signals and non-target sound signals and implements an independent component analyzing method of acquiring, by repetitive learning, a filter separating one target sound signal from the sound signals. A sound frequency band is divided into a plurality of frequency division bands at a band division stage 30, discrimination levels for evaluating target sound separation performance of the filter obtained by the learning are calculated by the frequency division bands at an evaluation stage 50, and a frequency division band for learning calculation is determined at a determination stage 60 among the frequency division bands based upon the discrimination levels. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は音声入力装置に関する。 The present invention relates to a voice input device.

近年、車室内における音声入力系は、音声認識による車載機器操作およびハンドフリー電話などに広く用いられている。これらの技術の実現を阻害する要因として、車室内における、音声入力使用者以外の音源からの音の存在があげられる。音声入力使用者からの音声を他の音源からの音から分離する方法として、複数の音響センサからそれぞれの音信号を取得し、取得した複数の音信号のみを用いて、その音信号から目的とする音声信号を分離するフィルタを学習によって得る方法として、独立成分解析法（Independent Component Analysis、以下ＩＣＡと記す）が開発されている。 2. Description of the Related Art In recent years, a voice input system in a passenger compartment has been widely used for in-vehicle device operation by voice recognition, hands-free telephone, and the like. A factor that hinders the realization of these technologies is the presence of sound from a sound source other than the voice input user in the passenger compartment. As a method of separating the sound from the sound input user from the sound from other sound sources, each sound signal is acquired from a plurality of acoustic sensors, and only the plurality of acquired sound signals are used, Independent component analysis (hereinafter referred to as ICA) has been developed as a method for obtaining a filter for separating a speech signal by learning.

特開２００３−２７１１６６号公報JP 2003-271166 A 「アレー信号処理を用いたブラインド音源分離の基礎」Technical report of IEICE，EA2001-7。"Basics of blind source separation using array signal processing" Technical report of IEICE, EA2001-7. 「独立成分解析とは」Computer Today，pp.38-43，1998.9，No.87、「fMRI画像解析への応用」Computer Today，pp.60-67，2001.1 No.95。“What is independent component analysis?” Computer Today, pp. 38-43, 1998.9, No. 87, “Application to fMRI image analysis” Computer Today, pp. 60-67, 2001.1 No. 95.

しかしながら、上記ＩＣＡに基づく目的信号分離の処理における問題点としては、以下が挙げられる。 However, problems in the target signal separation processing based on the ICA include the following.

まず、信号源から送出される信号同士の統計的な独立性を利用するが、実環境では信号の伝達特性・背景ノイズ等によりその統計量を精度よく推定することが困難であり、それによって、分離精度が劣化する。 First, the statistical independence between signals sent from the signal source is used, but in the actual environment, it is difficult to accurately estimate the statistics due to the signal transfer characteristics, background noise, etc. Separation accuracy deteriorates.

また、拡散性の信号源は、それを一信号源と見なすことが困難であることより、分離が非常に困難となる。 Also, a diffusive signal source is very difficult to separate because it is difficult to consider it as one signal source.

上記の問題に対し、上記特許文献１においては、ＩＣＡの計算過程で拡散性の信号源の影響を除去する手法が提案されている。この手法においては、ＩＣＡの計算過程において周波数毎に計算されるコスト関数の大きさによって音源分離処理の精度を予測し、音源分離処理の精度が低い周波数ではフィルタの応答を小さくする処理を行う（以下、ＳＢＥ(Sub-Band Eliminate)と記す）。ＳＢＥでは周波数毎に音源分離処理の精度が閾値を超えているか否かの判定処理を行うため、一般的な周波数領域ＩＣＡと比較して、計算量が大きくなる。 With respect to the above problem, Patent Document 1 proposes a method for removing the influence of a diffusive signal source in the ICA calculation process. In this method, the accuracy of the sound source separation process is predicted based on the size of the cost function calculated for each frequency in the ICA calculation process, and the filter response is reduced at frequencies where the accuracy of the sound source separation process is low ( Hereinafter, this is referred to as SBE (Sub-Band Eliminate). In SBE, since the process of determining whether or not the accuracy of the sound source separation process exceeds the threshold value for each frequency, the amount of calculation is larger than that of a general frequency domain ICA.

本発明の目的は、この点を改良し、一般的な周波数領域ＩＣＡに適用するＳＢＥと比較してわずかな計算量増加で済み、拡散性の雑音を除去できる音声入力装置を提供することである。 An object of the present invention is to provide a voice input device that improves this point, and requires only a slight increase in the amount of calculation compared to SBE applied to a general frequency domain ICA, and can remove diffuse noise. .

上記の、複数の音響センサからそれぞれの音信号を取得し、取得した複数の音信号のみを用いて、その音信号から目的とする音声信号を分離するフィルタを学習によって得る独立成分解析法を実行する音声入力装置において、音響周波数帯域を複数の周波数分割帯域に分割し、該学習の繰り返し過程において、該学習によって得られる該フィルタの目的音声分離性能を評価する識別レベルを該周波数分割帯域ごとに算出し、該識別レベルに基づいて、該周波数分割帯域の中から該学習を実行する周波数分割帯域を決定することを特徴とする音声入力装置を構成する。 Execute the independent component analysis method that acquires each sound signal from multiple acoustic sensors and learns the filter that separates the target audio signal from the sound signal by using only the acquired multiple sound signals. In the voice input device, the acoustic frequency band is divided into a plurality of frequency division bands, and an identification level for evaluating the target voice separation performance of the filter obtained by the learning is determined for each frequency division band in the repetition process of the learning. A speech input device is configured to calculate and determine a frequency division band for performing the learning from the frequency division band based on the identification level.

本発明の実施により、学習成果の上がらない周波数分割帯域における学習の実行回数を減らし、一般的な周波数領域ＩＣＡに適用するＳＢＥと比較してわずかな計算量増加で済み、拡散性の雑音を除去できる音声入力装置を提供することが可能となる。 By implementing the present invention, the number of times of learning is reduced in the frequency division band where the learning result does not improve, and a slight increase in the amount of calculation is required compared to SBE applied to a general frequency domain ICA, thereby eliminating diffusive noise. It is possible to provide a voice input device that can be used.

以下に、本発明に係る音声入力装置が特徴とする、フィルタを得るための学習方法を、ＩＣＡの一例に適用した場合を説明する。 The case where the learning method for obtaining the filter, which is characteristic of the voice input device according to the present invention, is applied to an example of ICA will be described below.

例えば、信号源として、音信号をK個のマイク（センサ）で音を受信することに加え、各音源から到来する、音信号同士が統計的に独立であることを利用することでマイクと同じK個もしくはK個以下の音源を分離することができる。当初、ＩＣＡを用いた音源分離法は、各音源からの到来音の時間差が考慮されていなかったため、マイクアレーに適用することは困難であった。しかし近年では、時間差を考慮し、マイクアレーを用いて複数の音信号を観測し、周波数領域にて混合過程の逆変換を求める手法が多数提案されている。 For example, as a signal source, in addition to receiving sound signals with K microphones (sensors), it is the same as a microphone by utilizing that the sound signals coming from each sound source are statistically independent. K or less than K sound sources can be separated. Initially, the sound source separation method using ICA was difficult to apply to a microphone array because the time difference between incoming sounds from each sound source was not considered. However, in recent years, many methods have been proposed in which a time difference is taken into account and a plurality of sound signals are observed using a microphone array to obtain an inverse transformation of the mixing process in the frequency domain.

一般に、L個の複数音源から到来する音信号が線形に混合されてK個のマイクにて観測されている場合、観測された音信号は、ある周波数fにおいて以下のように書くことができる。 In general, when sound signals coming from a plurality of L sound sources are linearly mixed and observed by K microphones, the observed sound signal can be written as follows at a certain frequency f.

X(f) ＝ A(f)S(f) (1)
ここで、S(f)は各音源から送出される音信号ベクトル、X(f)は受音点であるマイクアレーで観測された観測信号ベクトル、A(f)は各音源と受音点との空間的な音響系に関する混合行列であり、それぞれ以下のように書くことができる。 X (f) = A (f) S (f) (1)
Where S (f) is the sound signal vector transmitted from each sound source, X (f) is the observed signal vector observed at the microphone array that is the sound receiving point, and A (f) is the sound source and sound receiving point. Is a mixing matrix for the spatial acoustic system of, which can be written as follows:

S(f) ＝ [S_１(f),...,S_Ｌ(f)]^Ｔ (2)
X(f) ＝ [X_１(f),...,X_Ｌ(f)]^Ｔ (3) S (f) = [S ₁ (f), ..., S _L (f)] ^T (2)
X (f) = [X ₁ (f), ..., X _L (f)] ^T (3)

ここで上添字^Ｔはベクトルの転置を表す。このとき、混合行列A(f)が既知であれば、受音点での観測信号ベクトルX(f)を用いて、
S(f) ＝ A(f)⁻X(f) (5)
（ただし、A(f)⁻は行列A(f)の一般逆行列を表す）のようにA(f)の一般逆行列A(f)⁻を計算することで音源から送出される音信号S(f)を計算することができる。しかし一般にA(f)は未知であり、X(f)だけを利用することで音信号S(f)を求めなければならない。

Here, the superscript ^T represents transposition of the vector. At this time, if the mixing matrix A (f) is known, using the observation signal vector X (f) at the sound receiving point,
S (f) = A (f) ⁻ X (f) (5)
(However, A (f) ^- the matrix A (f) represents a general inverse matrix) of the generalized inverse matrix A A (f) (f) as ^- the sound signal is sent from the sound source by calculating S (f) can be calculated. However, in general, A (f) is unknown, and the sound signal S (f) must be obtained by using only X (f).

この問題を解くためには、音信号S(f)が確率的に発生し、更に、S(f)の各成分が全て互いに独立であると仮定する。このとき観測信号X(f)は混合された信号であるためX(f)の各成分の分布は独立ではない。そこで、観測信号に含まれる独立な成分をＩＣＡによって探索することを考える。すなわち、観測信号X(f)を独立な成分に変換する行列W(f)(以下、逆混合行列)を計算し、観測信号X(f)に逆混合行列W(f)を適用(行列乗算)することで、音源から送出される音信号S(f)に対して近似的な信号を求める。 In order to solve this problem, it is assumed that the sound signal S (f) is generated stochastically and that all components of S (f) are all independent of each other. At this time, since the observation signal X (f) is a mixed signal, the distribution of each component of X (f) is not independent. Therefore, consider searching for an independent component included in the observation signal by ICA. That is, the matrix W (f) (hereinafter referred to as the inverse mixing matrix) that converts the observed signal X (f) into independent components is calculated, and the inverse mixing matrix W (f) is applied to the observed signal X (f) (matrix multiplication). ) To obtain an approximate signal to the sound signal S (f) transmitted from the sound source.

ＩＣＡによる混合過程の逆変換を求める処理には時間領域で分析する手法と、周波数領域で分析する手法が提案されている。ここでは周波数領域で計算する手法を例にして説明する。 As processing for obtaining the inverse transformation of the mixing process by ICA, a method of analyzing in the time domain and a method of analyzing in the frequency domain have been proposed. Here, a method for calculating in the frequency domain will be described as an example.

図１を用いて、本発明に係る音声入力装置が特徴とする、フィルタを得るための学習方法を説明する。図に示すように、目的音声と非目的音とが混在する音響を複数の音響センサであるマイクロフォン10-1〜10-nを検知する（検知過程20)。検知過程20によって得られた目的音声信号と非目的音信号とが混在する複数の音響信号は、帯域分割過程30において複数の周波数分割帯域に分割され、フィルタを学習の繰り返しによって取得するフィルタ学習過程40に投入される。本発明に係る音声入力装置が特徴とする、フィルタを得るための学習方法においては、該学習の繰り返し過程において、繰り返しの途中におけるフィルタの目的音声分離性能を評価する識別レベルを評価過程50において周波数分割帯域ごとに算出し、算出された識別レベルに基づいて、周波数分割帯域の中から学習を実行する周波数分割帯域を決定過程60において決定する。その決定結果はフィルタ学習過程40にフィードバックされ、その決定結果に従った学習が実行される。識別レベルとその算出法の実例については後述する。 A learning method for obtaining a filter, which is a feature of the voice input device according to the present invention, will be described with reference to FIG. As shown in the figure, the microphones 10-1 to 10-n, which are a plurality of acoustic sensors, detect the sound in which the target sound and the non-target sound are mixed (detection process 20). A plurality of acoustic signals in which the target audio signal and the non-target sound signal obtained in the detection process 20 are mixed are divided into a plurality of frequency division bands in the band division process 30, and a filter learning process in which a filter is acquired by repetition of learning It is thrown into 40. In the learning method for obtaining a filter, which is characterized by the speech input device according to the present invention, the identification level for evaluating the target speech separation performance of the filter in the middle of the iteration is set to the frequency in the evaluation step in the learning iteration process. The frequency division band is calculated for each division band, and the frequency division band for performing learning is determined from the frequency division bands in the determination step 60 based on the calculated identification level. The determination result is fed back to the filter learning process 40, and learning according to the determination result is executed. An example of the identification level and its calculation method will be described later.

最初に、各マイクロフォンにて観測された信号を適切な直交変換を用いて短時間フレーム分析を行う。このとき、１つのマイクロフォン入力における、特定の周波数ビンでの複素スペクトル値をプロットすることにより、それを時系列として考える。ここで、周波数ビンとは、たとえば、短時間離散フーリエ変換によって周波数変換された信号ベクトルにおける個別の複素成分を示す。同様に、他のマイクロフォン入力に対しても同じ操作を行う。ここで得られた、時間‐周波数信号系列は、
X(f,t) ＝ [X_１(f,t),...,X_Ｋ(f,t)]^Ｔ (6)
と記述できる。次に、逆混合行列W(f)を用いて信号分離を行う。この処理は以下のように示される。 First, a short-time frame analysis is performed on the signal observed by each microphone using an appropriate orthogonal transform. At this time, it is considered as a time series by plotting the complex spectrum value at a specific frequency bin at one microphone input. Here, the frequency bin indicates, for example, an individual complex component in a signal vector frequency-converted by short-time discrete Fourier transform. Similarly, the same operation is performed for other microphone inputs. The time-frequency signal sequence obtained here is
X (f, t) = [X ₁ (f, t), ..., X _K (f, t)] ^T (6)
Can be described. Next, signal separation is performed using the inverse mixing matrix W (f). This process is shown as follows.

Y(f,t) ＝ [Y_１(f,t),...,Y_Ｌ(f,t)]^Ｔ＝ W(f)X(f,t) (7)
ここで、逆混合行列W(f)は、L個の時系列の出力Y(f,t)が互いに独立になるように最適化される。これらの処理を全ての周波数ビンについて行う。最後に、分離した時系列Y(f,t)に逆直交変換を適用して、音源信号時間波形の再構成を行う。 Y (f, t) = [ Y 1 (f, t), ..., Y L (f, t)] T = W (f) X (f, t) (7)
Here, the demixing matrix W (f) is optimized so that the L time series outputs Y (f, t) are independent of each other. These processes are performed for all frequency bins. Finally, inverse orthogonal transformation is applied to the separated time series Y (f, t) to reconstruct the sound source signal time waveform.

独立性の評価および逆混合行列の最適化方法としては、Kullback-Leibler divergenceの最小化に基づく教師無し学習アルゴリズムや、２次または高次の相関を無相関化するアルゴリズムが提案されている（上記非特許文献１参照）。 As an independence evaluation and demixing matrix optimization method, an unsupervised learning algorithm based on the minimization of Kullback-Leibler divergence and an algorithm for decorrelating a second-order or higher-order correlation have been proposed (see above). Non-patent document 1).

なお、ＩＣＡは音信号処理だけではなく、例えば、移動体通信などで話が混線して到達した信号を、其々に分離する、或いは脳の内部の各所で生ずる信号を脳電計や脳磁計、fMRI（Functional Magnetic Resonance Imaging；磁気共鳴機能画像）などを用いて外部から測定した場合に、測定信号の中から目的の信号を分離抽出すること等に用いられている（上記非特許文献２参照）。 Note that ICA is not limited to sound signal processing, for example, separates signals that have arrived due to crosstalk in mobile communication, etc., or separates signals generated at various locations inside the brain into electroencephalographs or magnetoencephalographs. This is used to separate and extract a target signal from measurement signals when measured from the outside using fMRI (Functional Magnetic Resonance Imaging) or the like (see Non-Patent Document 2 above). ).

以下では、本発明に係る音声入力装置が特徴とする、フィルタを得るための学習方法における識別レベルとその算出法の実例について説明する。 In the following, an example of an identification level and a calculation method in a learning method for obtaining a filter, which is a feature of the voice input device according to the present invention, will be described.

ＩＣＡを用いても信号の分離が困難である周波数帯域においては、数十回の学習を経ても識別レベルすなわち分離精度（例えばコサイン距離）の値が改善しない場合が多い。このような帯域における学習のための演算は冗長であるため、識別レベルすなわち分離精度の変化が少ない帯域を検出して、周波数領域ＩＣＡの計算を終了すればよい。 In frequency bands where it is difficult to separate signals using ICA, the identification level, that is, the value of separation accuracy (for example, cosine distance) is often not improved even after several tens of learning. Since the calculation for learning in such a band is redundant, it is only necessary to detect a band with a small change in the identification level, that is, the separation accuracy, and end the calculation of the frequency domain ICA.

以下では、周波数領域ＩＣＡの計算終了の判断を行う判定関数を導入する。 In the following, a determination function for determining the end of calculation of the frequency domain ICA is introduced.

はじめに、各マイクロフォンにて集音され短時間フレーム分析された時間‐周波数信号系列をX(f,t) ＝ [X_１(f,t),...,X_Ｋ(f,t)]^Ｔと記述する。次に、ＩＣＡによって最適化された逆混合行列W(f)を用いて信号分離を行う。この処理は上記式(7)のように示される。ここで、Y(f,t)は音源分離が為された分離信号である。このとき、W(f)はL行K列のマトリクスであり、複数の音響信号から目的音声信号を分離するフィルタの減衰特性を表している。 _First , a time-frequency signal sequence collected by each microphone and subjected to short-time frame analysis is represented by X (f, t) = [X ₁ (f, t), ..., X _K (f, t)] ^T Is described. Next, signal separation is performed using an inverse mixing matrix W (f) optimized by ICA. This process is shown as the above equation (7). Here, Y (f, t) is a separated signal subjected to sound source separation. At this time, W (f) is a matrix of L rows and K columns, and represents the attenuation characteristics of a filter that separates the target audio signal from a plurality of acoustic signals.

複数の音響信号から目的音声信号を分離するフィルタの分離精度の変化が少ない帯域の検出方法としては、ＩＣＡによる学習終了後に、フィルタの目的音声分離性能を評価する識別レベルとして、分離信号間の独立性を評価するコスト関数を定義し、このコスト関数の変化率に基づいて分離精度の変化が少ない帯域の決定を行う。本コスト関数については、例えば、分離信号間の高次相関値やコサイン距離などを使用すればよい。特にコサイン距離は演算量も少なく効率的である。以下では、２音源のコサイン距離に基づくコスト関数を示す。 As a method for detecting a band with a small change in separation accuracy of a filter that separates target speech signals from a plurality of acoustic signals, as a discrimination level for evaluating the target speech separation performance of the filter after completion of learning by ICA, the separation signals are independent. A cost function for evaluating the performance is defined, and a band with a small change in separation accuracy is determined based on the rate of change of the cost function. For this cost function, for example, a higher-order correlation value between separated signals, a cosine distance, or the like may be used. In particular, the cosine distance is efficient with a small amount of calculation. Below, the cost function based on the cosine distance of two sound sources is shown.

ここで、記号〈
〉_ｔは局所時間区間、たとえば、時刻t-200msから時刻tまでの時間区間において時間に関する平均をとることを表し、記号＊は複素共役を表す。式(8)の右辺は、二つの分離信号Y_１(f,t)、Y_２(f,t)のその局所時間区間における相関係数を表し、これがこの場合の識別レベルとなっている。式(8)の左辺のtは、上記のt、すなわち、局所時間区間の上端(時間が左から右に流れるとしたときの右端)を表していて、右辺におけるtとは意味が異なる。

Where the symbol <
> _T represents taking an average with respect to time in a local time interval, for example, a time interval from time t-200 ms to time t, and symbol * represents a complex conjugate. The right side of Equation (8) represents the correlation coefficient of the two separated signals Y ₁ (f, t) and Y ₂ (f, t) in the local time interval, which is the discrimination level in this case. T on the left side of Equation (8) represents the above t, that is, the upper end of the local time interval (the right end when time flows from left to right), and has a different meaning from t on the right side.

実際の応用に際しては、短時間フレーム分析における時間切り出し位置などに上記の識別レベルの値は左右されるため、周波数間において著しい不連続を生じることがある。そのようなコスト関数の周波数間不連続現象例を図９中の変化の激しい線で示す。これを回避するため、一例として、式(8)のコスト関数を、ある周波数帯域幅で移動平均をとることによって得られる平滑化されたコスト関数すなわち平滑化識別レベルを使用することが考えられる。これは以下で書くことができる（図９変化が少ない実線参照）。 In actual application, since the value of the identification level depends on the time cut-out position in the short-time frame analysis, a significant discontinuity may occur between frequencies. An example of such a frequency discontinuity phenomenon of the cost function is shown by a line that changes drastically in FIG. In order to avoid this, as an example, it is conceivable to use a smoothed cost function obtained by taking a moving average of the cost function of Equation (8) with a certain frequency bandwidth, that is, a smoothed discrimination level. This can be written as follows (see solid line with little change in FIG. 9):

ここでＢは平滑化幅を与えるパラメタである。すなわち、この場合の平滑化は、局所周波数区間において平均しておこなわれる。この平滑化されたコスト関数J_Ｓ(f,t)は、分離された信号が独立なものであれば値は小さくなり、非独立なものであれば値は大きくなる。また、その最大値は１である。

Here, B is a parameter that gives a smoothing width. That is, the smoothing in this case is performed by averaging in the local frequency section. The smoothed cost function J _S (f, t) has a small value if the separated signal is independent, and a large value if the separated signal is non-independent. The maximum value is 1.

更に、該コスト関数の変化率ΔJを利用することにより、分離精度が向上しない帯域を検出することができる。このΔJは、請求項１に記載の識別レベル間の変化率に該当する。例えば、時刻t、一つの周波数分割帯域における変化率ΔJ_Ｓ(f,t)は、時刻tにおける該周波数分割帯域のコスト関数J_Ｓ(f,t)および、m>0として、時刻t-mにおける該コスト関数J_Ｓ(f,t-m)を用いて、
ΔJ_Ｓ(f,t) ＝ ‖J_Ｓ(f,t)−J_Ｓ(f,t-m)‖ (10)
で表現することができる。式(10)の右辺は、二つの平滑化識別レベル間の差ベクトルJ_Ｓ(f,t)−J_Ｓ(f,t-m)のノルム、たとえば、J_Ｓ(f,t)−J_Ｓ(f,t-m)の２乗をこの周波数分割帯域において周波数に関して平均したものである。式(10)の左辺におけるfはこの周波数分割帯域を表示するための周波数、たとえば、この周波数分割帯域の中心周波数であり、右辺におけるfとは意味が異なる。 Furthermore, by using the change rate ΔJ of the cost function, it is possible to detect a band where the separation accuracy is not improved. This ΔJ corresponds to the change rate between the identification levels described in claim 1. For example, at time t, the rate of change ΔJ _S (f, t) in one frequency division band is the cost function J _S (f, t) of the frequency division band at time t and m> 0, Using the cost function J _S (f, tm)
ΔJ _S (f, t) = ‖J _S (f, t) −J _S (f, tm) ‖ (10)
Can be expressed as The right side of equation (10) represents the norm of the difference vector J _S (f, t) −J _S (f, tm) between the two smoothed discrimination levels, for example, J _S (f, t) −J _S (f , tm) squared with respect to frequency in this frequency division band. F on the left side of Equation (10) is a frequency for displaying this frequency division band, for example, the center frequency of this frequency division band, and has a different meaning from f on the right side.

このとき、判定関数B(f)は、 At this time, the decision function B (f) is

と与えることができる。ただし、J_Ｔは判定のために予め設定された閾値である。B(f)=1ならば、fで表示される周波数分割帯域における学習計算を行い、B(f)=0ならば、その学習計算を行わないと決定する。

And can be given. However, _JT is a threshold set in advance for determination. If B (f) = 1, learning calculation is performed in the frequency division band indicated by f. If B (f) = 0, it is determined that the learning calculation is not performed.

式(11)により、事前に音源に関する情報を用いることなく、分離精度の向上の可能性がある帯域の自動検出が可能となる。なお、ΔJ_Ｓ(f,t)はJ_Ｓ(f,t)のようにスムージングをすることで複数の帯域の影響を考慮しても、また、J_Ｓ(f,t)に代えて、J(f,t)をそのまま用いても良い。 Expression (11) enables automatic detection of a band that may improve separation accuracy without using information related to a sound source in advance. Note that ΔJ _S (f, t) is smoothed like J _S (f, t), considering the influence of multiple bands, and instead of J _S (f, t), (f, t) may be used as it is.

以上の説明は、２音源の場合についての説明であったが、音源の個数が３以上になった場合にも、各音源間について、上記の手法を適用することによって、本発明の効果が得られる。 The above description is for the case of two sound sources. However, even when the number of sound sources is three or more, the effect of the present invention can be obtained by applying the above method between the sound sources. It is done.

以下に、本発明の構成を、実施の形態例によって説明する。 Hereinafter, the configuration of the present invention will be described with reference to embodiments.

（実施の形態１）
図１は第１の実施の形態におけるフィルタ更新処理過程のブロック図である。図中、10-1〜10-nは、目的音声と非目的音とが混在する音響を検知し、目的音声信号と非目的音信号とが混在する複数の音響信号として出力する複数の音響センサであるマイクロフォンであり、20は、マイクロフォン10-1〜10-nの出力である音響信号を検知して離散信号に変換する検知過程であり、30は、その離散信号を周波数に分解し、かつ、周波数分割帯域に分割する帯域分割過程である。信号を周波数に分解する変換は、ＦＦＴが一般的であるが、ウェーブレット、Ｚ変換など、直交変換系であればいずれでもよい。 (Embodiment 1)
FIG. 1 is a block diagram of a filter update process in the first embodiment. In the figure, reference numerals 10-1 to 10-n denote a plurality of acoustic sensors that detect the sound in which the target sound and the non-target sound are mixed and output the sound as a plurality of sound signals in which the target sound signal and the non-target sound signal are mixed. 20 is a detection process in which an acoustic signal that is an output of the microphones 10-1 to 10-n is detected and converted into a discrete signal, 30 is a decomposition process of the discrete signal into a frequency, and This is a band division process for dividing the frequency division band. The transform for decomposing the signal into frequencies is generally FFT, but any transform may be used as long as it is an orthogonal transform system such as a wavelet or Z transform.

フィルタ学習過程40は、周波数分割帯域に分割された複数の音響信号から少なくとも一つの該目的音声信号を分離するフィルタを学習の繰り返しによって取得する過程である。この過程としては、周波数ごとに学習する手法であればいずれでも良いが、本実施形態では周波数領域ＩＣＡを用いる。 The filter learning process 40 is a process of acquiring a filter that separates at least one target speech signal from a plurality of acoustic signals divided into frequency division bands by repeating learning. Any method may be used for this process as long as it is a learning method for each frequency, but in this embodiment, a frequency domain ICA is used.

評価過程50は、フィルタ学習過程40で取得されたフィルタの、目的音声分離性能を評価する識別レベルを周波数分割帯域ごとに算出する過程である。決定過程60は、評価過程50において算出された識別レベルに基づいて、学習計算を行う周波数分割帯域を決定する過程であり、その決定結果はフィルタ学習過程40にフィードバックされる。なお、評価過程50と決定過程60とは、フィルタ学習過程40における学習の繰り返しの度に実行される必要は無い。例えば、学習１０回に１回動作しても良いし、学習開始時には毎回動作し、大半の周波数分割帯域の学習が終了した場合は、１０回に１回の動作でも良い。 The evaluation process 50 is a process of calculating the identification level for evaluating the target speech separation performance of the filter acquired in the filter learning process 40 for each frequency division band. The determination process 60 is a process of determining a frequency division band in which learning calculation is performed based on the identification level calculated in the evaluation process 50, and the determination result is fed back to the filter learning process 40. Note that the evaluation process 50 and the determination process 60 do not have to be executed each time learning is repeated in the filter learning process 40. For example, the operation may be performed once in 10 learning operations, or may be performed every time learning is started, and may be performed once in 10 when learning of most frequency division bands is completed.

学習計算を行う周波数分割帯域を決定する具体的な方法としては、たとえば、時刻tに計算された識別レベルと、m>0として、時刻t-mに計算された識別レベルとの間の変化率が予め設定された閾値を超えない周波数分割帯域においては、時刻t以降学習の繰り返しにおける学習計算を行わないと決定する方法がある。このとき、時刻tに計算された識別レベルと時刻t-mに計算された識別レベルとの間の変化率が予め設定された閾値よりも大きい周波数分割帯域においては、時刻t以降学習計算を行う周波数分割帯域の決定が改めて行われる場合には時刻tから該決定が行われるまでの間、学習の繰り返しにおける学習計算を行うと決定され、時刻t以降学習計算を行う周波数分割帯域の決定が行われずに学習の繰り返し過程が終了する場合には時刻tから該終了時まで学習の繰り返しにおける学習計算を行うと決定される。 As a specific method for determining the frequency division band for performing the learning calculation, for example, the rate of change between the identification level calculated at time t and the identification level calculated at time tm when m> 0 is set in advance. In a frequency division band that does not exceed a set threshold, there is a method of determining not to perform learning calculation in repetition of learning after time t. At this time, in the frequency division band in which the change rate between the identification level calculated at time t and the identification level calculated at time tm is greater than a preset threshold, frequency division for performing learning calculation after time t When the determination of the band is performed anew, it is determined that the learning calculation in the repetition of learning is performed from time t until the determination is performed, and the frequency division band for performing the learning calculation after time t is not determined. When the learning repetition process ends, it is determined to perform learning calculation in the repetition of learning from time t to the end.

学習終了後、フィルタ学習過程40において取得されたフィルタは、図２における減衰過程45の内容として用いられる。 After completion of learning, the filter acquired in the filter learning process 40 is used as the content of the attenuation process 45 in FIG.

上記のようにして、周波数分割帯域ごとの適応学習において、無駄な計算を省くことにより、フィルタ学習時における計算量を削減することができる。 As described above, the amount of calculation at the time of filter learning can be reduced by omitting useless calculation in adaptive learning for each frequency division band.

上記の識別レベルとして、たとえば、式(8)で示されるJ(f,t)を用いることができ、識別レベル間の変化率としては、たとえば、式(10)で示されるΔJ_Ｓ(f,t)を用いることができる。 As the discrimination level, for example, J (f, t) represented by equation (8) can be used. As the rate of change between the identification levels, for example, ΔJ _S (f, t represented by equation (10) can be used. t) can be used.

図２はフィルタ処理過程のブロック図である。マイクロフォン10-1〜10-nの出力である、目的音声信号と非目的音信号とが混在する音響信号は、検知過程20において離散信号に変換され、フィルタ学習過程40で取得されたフィルタを内容とする減衰過程45を経て、目的音声信号(信号Ｒ１００）として出力される。減衰過程45は、入力された音響信号から目的音声信号を抽出するか、または、非目的音信号を抑圧する。 FIG. 2 is a block diagram of the filtering process. The acoustic signals, which are the outputs of the microphones 10-1 to 10-n, in which the target sound signal and the non-target sound signal are mixed, are converted into discrete signals in the detection process 20, and the filter acquired in the filter learning process 40 The signal is output as a target audio signal (signal R100) through an attenuation process 45. The attenuation process 45 extracts the target sound signal from the input acoustic signal or suppresses the non-target sound signal.

図３はフィルタ更新システムのブロック図である。マイクロフォン110-1〜110-nとしては、一般的なマイクロフォンが使用できる。検知手段120は、図５におけるフィルタ（アンチエリアシングフィルタ）220、ＡＤ変換器230、演算装置240に対応し、ＣＰＵ、ＭＰＵ、ＤＳＰ、ＦＰＧＡなど、一般的な動作回路を組合わせて構成される。帯域分割手段130は図５における演算装置240および記憶装置250に対応する。フィルタ学習手段140は図５における演算装置240および記憶装置250に対応する。評価手段150は図５における演算装置240および記憶装置250に対応する。決定手段160は図５における演算装置240および記憶装置250に対応する。 FIG. 3 is a block diagram of the filter update system. As the microphones 110-1 to 110-n, general microphones can be used. The detection unit 120 corresponds to the filter (anti-aliasing filter) 220, the AD converter 230, and the arithmetic device 240 in FIG. 5, and is configured by combining general operation circuits such as a CPU, MPU, DSP, and FPGA. . The band dividing means 130 corresponds to the arithmetic device 240 and the storage device 250 in FIG. The filter learning unit 140 corresponds to the arithmetic device 240 and the storage device 250 in FIG. Evaluation means 150 corresponds to arithmetic device 240 and storage device 250 in FIG. The determining means 160 corresponds to the arithmetic device 240 and the storage device 250 in FIG.

図４はフィルタ処理システムのブロック図である。マイクロフォン110-1〜110-nおよび検知手段120は、図３に示したものと同じである。減衰手段145は図５における演算装置240および記憶装置250に対応する。記憶手段170は、図５における記憶装置250に対応し、キャッシュメモリ、メインメモリ、ＨＤＤ、ＣＤ、ＭＤ、ＤＶＤ、光ディスク、ＦＤＤなど、一般的な記憶媒体などによって構成されている。 FIG. 4 is a block diagram of the filter processing system. The microphones 110-1 to 110-n and the detection means 120 are the same as those shown in FIG. The attenuation means 145 corresponds to the arithmetic device 240 and the storage device 250 in FIG. The storage unit 170 corresponds to the storage device 250 in FIG. 5 and is configured by a general storage medium such as a cache memory, a main memory, an HDD, a CD, an MD, a DVD, an optical disk, and an FDD.

図５はシステム構成の一例を示すブロック図である。マイクロフォン210-1〜210-nの出力である音響信号はフィルタ220を経てＡＤ変換器230に入力され、ＡＤ変換された後、演算装置240に入力され、演算処理される。フィルタ220は、上記音響信号に含まれるノイズを除去することに用いられる。 FIG. 5 is a block diagram illustrating an example of a system configuration. The acoustic signals that are the outputs of the microphones 210-1 to 210-n are input to the AD converter 230 through the filter 220, and after AD conversion, are input to the arithmetic device 240 and are subjected to arithmetic processing. The filter 220 is used to remove noise included in the acoustic signal.

図６はフィルタ学習手順のフロー図である。Ｓ１００〜Ｓ１７０は個々のステップを表す。 FIG. 6 is a flowchart of the filter learning procedure. S100 to S170 represent individual steps.

Ｓ１００でシステムの初期化、メモリへの読込作業を行う。 In S100, the system is initialized and the memory is read.

Ｓ１１０で音入力を検知する。検知したらＳ１２０へ進む。 Sound input is detected in S110. If detected, the process proceeds to S120.

Ｓ１２０で、入力信号の帯域分割処理を行う。周波数ビンごとの帯域幅は固定でも可変でも良い。 In S120, input signal band division processing is performed. The bandwidth for each frequency bin may be fixed or variable.

Ｓ１３０で、フィルタの学習、更新作業を行う。例えば、周波数領域ＩＣＡを用いる。 In S130, learning and updating of the filter are performed. For example, a frequency domain ICA is used.

Ｓ１５０で、時刻ｔの音源分離精度すなわち識別レベルJtを計算し、更に、Jtと時刻t-m(m>0)の識別レベルJt-mとの間の変化率ΔJを計算する。時刻tの識別レベルJtは保存しておく。 In S150, the sound source separation accuracy at time t, that is, the discrimination level Jt is calculated, and further, the rate of change ΔJ between Jt and the discrimination level Jt-m at time t−m (m> 0) is calculated. The identification level Jt at time t is stored.

Ｓ１６０で、更新フラグの判定を行いフラグを設定する。すなわち、ΔJ≦閾値となったら、その分割帯域のフラグをオフにする。 In S160, the update flag is determined and the flag is set. That is, when ΔJ ≦ threshold, the flag of the divided band is turned off.

Ｓ１７０で、更新フラグをチェックする。すべての更新フラグがＯＦＦであれば学習終了、ＯＮのフラグがあればＳ１４０へ戻る。 In S170, the update flag is checked. If all the update flags are OFF, the learning ends. If there is an ON flag, the process returns to S140.

学習が終了したら、完成したフィルタを図２の減衰過程45のフィルタにする。 When the learning is completed, the completed filter is used as the filter of the attenuation process 45 in FIG.

図７はフィルタ学習手順のフロー図（Ｓ１５０、Ｓ１６０の内容）である。 FIG. 7 is a flowchart of the filter learning procedure (contents of S150 and S160).

Ｓ１５１で、時刻tにおける識別レベルJt（例えば２音源以上の信号のコサイン距離）を計算する。 In S151, the discrimination level Jt (for example, the cosine distance of signals of two or more sound sources) at time t is calculated.

Ｓ１５２で、分割帯域ωにおける時刻t-m(m>0)と時刻tの識別レベルJt-mから識別レベルの変化率ΔJを計算する。 In S152, the change rate ΔJ of the discrimination level is calculated from the time t-m (m> 0) and the discrimination level Jt-m at time t in the divided band ω.

Ｓ１６１で、分割帯域ωにおいてΔJ≦閾値となったらＳ１６３へ進み、ならない場合はＳ１６２へ進む。 In S161, if ΔJ ≦ threshold in the divided band ω, the process proceeds to S163, and if not, the process proceeds to S162.

Ｓ１６２で、分割帯域ωを更新し、Ｓ１６４へ進む。 In S162, the divided band ω is updated, and the process proceeds to S164.

Ｓ１６３で、更新フラグをＯＦＦにする。 In S163, the update flag is turned OFF.

Ｓ１６４で、分割帯域ωが存在しない場合はこのフローを終了し、存在する場合はＳ１５２へ戻る。 In S164, when the divided band ω does not exist, this flow is ended, and when it exists, the process returns to S152.

上記のフローにおいて、時刻tに計算された識別レベルJtと時刻t-mに計算された識別レベルJt-mとの間の変化率ΔJが予め設定された閾値を超えない周波数分割帯域においては、時刻t以降学習の繰り返しにおける学習計算を行わないと決定される。 In the above flow, in the frequency division band where the rate of change ΔJ between the identification level Jt calculated at time t and the identification level Jt-m calculated at time tm does not exceed a preset threshold, the time t Thereafter, it is determined not to perform learning calculation in repeated learning.

さらに、上記のフローにおいて、時刻tに計算された識別レベルJtと、時刻t-mに計算された識別レベルJt-mとの間の変化率ΔJが予め設定された閾値よりも大きい周波数分割帯域においては、時刻t以降学習計算を行う周波数分割帯域の決定が改めて行われる場合には時刻tから該決定が行われるまでの間学習の繰り返しにおける学習計算を行うと決定され、時刻t以降学習計算を行う周波数分割帯域の決定が行われずに学習の繰り返し過程が終了する場合には時刻tから該終了時まで学習の繰り返しにおける学習計算を行うと決定される。 Furthermore, in the above flow, in the frequency division band in which the change rate ΔJ between the identification level Jt calculated at time t and the identification level Jt-m calculated at time tm is larger than a preset threshold value. When the frequency division band for which learning calculation is performed after time t is performed again, it is determined to perform learning calculation in the repetition of learning from time t until the determination is performed, and learning calculation is performed after time t When the repetition process of learning ends without determining the frequency division band, it is determined to perform learning calculation in the repetition of learning from time t to the end.

図８は入力信号に対するフィルタ処理手順のフロー図である。 FIG. 8 is a flowchart of a filtering process procedure for an input signal.

Ｓ１００で、システムの初期化、メモリへフィルタの読み込み作業を行う。 In S100, the system is initialized and the filter is read into the memory.

Ｓ１１０で、音入力を検知する。検知したらＳ１８０へ進む。 In S110, sound input is detected. If detected, the process proceeds to S180.

Ｓ１８０で、入力信号に対しフィルタ処理を行い、結果である目的音声信号を送出する。 In S180, the input signal is filtered, and the resulting target audio signal is sent out.

図９は識別レベルの一例であるコスト関数の計算結果例（コサイン距離）を示す図である。図の横軸は周波数、縦軸はコスト関数（例えばコサイン距離）である。変化の激しい線（点線）がコスト関数の計算値、他の線（実線）がコスト関数の計算値をスムージング（平滑化）した後の値である。数10回〜数100回の学習を経て得られる。コスト関数の値が１に近いほど分離がうまくいかない可能性が高い。例えば１００Ｈｚ未満または３５００Ｈｚ以上の周波数分割帯域ではコスト関数の値が高いため分離精度の低下が予測される。初期値はすべての周波数分割帯域において１に近い値を示し、学習を経るに従い、コスト関数の値が低下していく。 FIG. 9 is a diagram illustrating a calculation result example (cosine distance) of a cost function which is an example of an identification level. In the figure, the horizontal axis represents frequency, and the vertical axis represents a cost function (for example, cosine distance). A rapidly changing line (dotted line) is a calculated value of the cost function, and another line (solid line) is a value after smoothing (smoothing) the calculated value of the cost function. Obtained through several tens to several hundreds of learning sessions. The closer the value of the cost function is to 1, the higher the possibility that separation will not be successful. For example, in the frequency division band of less than 100 Hz or 3500 Hz or more, the cost function value is high, so that a reduction in separation accuracy is predicted. The initial value shows a value close to 1 in all frequency division bands, and the value of the cost function decreases with learning.

図１０はコスト関数と学習回数の関係の概念図である。横軸は周波数、縦軸はコスト関数（例えばコサイン距離）である。L110は周波数領域ＩＣＡを１０回学習した場合のコスト関数の値、L120は周波数領域ＩＣＡを２０回学習した場合のコスト関数の値を概念的に示したものである。１ｋＨｚ付近はL110からL120への変化率が大きく、１００Ｈｚzおよび３ｋＨｚ付近はL110からL120への変化率が小さい。 FIG. 10 is a conceptual diagram of the relationship between the cost function and the number of learning times. The horizontal axis represents frequency, and the vertical axis represents a cost function (for example, cosine distance). L110 conceptually shows the value of the cost function when the frequency domain ICA is learned 10 times, and L120 conceptually shows the value of the cost function when the frequency domain ICA is learned 20 times. The rate of change from L110 to L120 is large near 1 kHz, and the rate of change from L110 to L120 is small near 100 Hzz and 3 kHz.

図１１は、ΔJが閾値J_Ｔを超えない場合に学習を終了したときのコスト関数Jの変移に関する概念図である。L210(破線)は１０回学習したときの周波数毎のJの値であり、L220(点線)は２０回学習したときの周波数毎のJの値であり、L230(実線)は３０回学習したときの周波数毎のJの値である。L210からL220への変移の際の変化率ΔJ(２０回）が閾値J_Ｔを超えない周波数分割帯域すなわちΔJ(２０回)≦J_Ｔである周波数分割帯域（１００Ｈｚ未満と３ｋＨｚ以上）は２０回で学習を終了し、以降は学習計算を行わない。L220からL230への変移の際の変化率ΔJ(３０回）が閾値J_Ｔを超えない周波数分割帯域すなわちΔJ(３０回)≦J_Ｔである周波数分割帯域（５００Ｈｚ未満と２ｋＨｚ以上）は学習を終了し、以降は学習計算を行わない。 Figure 11 is a conceptual diagram relating to displacement of the cost function J when finished learning when ΔJ does not exceed the threshold value J _T. L210 (dashed line) is the J value for each frequency when learning 10 times, L220 (dotted line) is the J value for each frequency when learning 20 times, and L230 (solid line) is when learning 30 times The value of J for each frequency. Change rate .DELTA.J upon transition from L210 to L220 (20 times) does not exceed the threshold value J _T frequency division band i.e. .DELTA.J (20 times) ≦ J _T a is frequency division band (above 100Hz and less than 3 kHz) is 20 times Then, the learning is terminated, and the learning calculation is not performed thereafter. Rate of change during the transition from L220 to L230 .DELTA.J (30 times) the threshold J does not exceed _T frequency division bands That .DELTA.J (30 times) ≦ J _T a is frequency division band (above 500Hz and less than 2 kHz) is a Learning The learning calculation is not performed thereafter.

このように、本発明においては、学習効果の上がらない周波数分割帯域については、適宜、学習計算を行わないことを決定し、フィルタ学習時における計算量を削減することができる。 Thus, in the present invention, it is possible to appropriately decide not to perform learning calculation for the frequency division band that does not improve the learning effect, and to reduce the amount of calculation at the time of filter learning.

（実施の形態２）
図１の評価過程50を実行する具体例としては、以下のような場合がある。 (Embodiment 2)
Specific examples of executing the evaluation process 50 of FIG. 1 include the following cases.

（場合１）
学習の繰り返しの度に評価過程50を実行する。すなわち、学習の繰り返しの度に識別レベルを算出し、今回と前回の識別レベルの間の変化率を算出して評価を実行する。最も精密に学習帯域を決定できるが、計算量が多くなる。ＣＰＵ速度が十分高くなれば実用的に実行可能である。 (Case 1)
The evaluation process 50 is executed every time the learning is repeated. That is, the identification level is calculated every time learning is repeated, and the evaluation is performed by calculating the rate of change between the current and previous identification levels. Although the learning band can be determined most precisely, the calculation amount increases. If the CPU speed is sufficiently high, it can be practically executed.

（場合２）
評価過程50を、学習の繰り返しの１０回に１回実行する。すなわち、学習の繰り返しの１回分の時間の長さをuとし、i=10uとしたときに、時刻tに実行された評価過程50は、時刻t+iに再び実行されるようにする。場合１に比べて、大幅に計算量を削減することができる。評価過程50を、学習の繰り返しの何回に１回実行するか、すなわち、iをuの何倍にするかは、環境と許容される計算量とを勘案して決定してよい。 (Case 2)
The evaluation process 50 is performed once every 10 learning iterations. That is, when the length of time for one repetition of learning is u and i = 10u, the evaluation process 50 executed at time t is executed again at time t + i. Compared to Case 1, the amount of calculation can be greatly reduced. The number of times the evaluation process 50 is performed once, that is, how many times i is greater than u, may be determined in consideration of the environment and the amount of calculation allowed.

（場合３）
上記時間mを学習の繰り返しの５回分の長さ5uとする。m=uとすると、誤って学習効果がある周波数分割帯域の学習動作をとめてしまう場合もありうるが、時間mをこのように長くするとそのような誤りが少なくなる。すなわち、微小変化に対してロバストな（影響を受けにくい）計算が可能となる。 (Case 3)
The above time m is set to 5u in length for 5 repetitions of learning. If m = u, there is a possibility that the learning operation of the frequency division band that has a learning effect may be erroneously stopped, but such an error is reduced by increasing the time m in this way. That is, a calculation that is robust (not easily affected) with respect to a minute change is possible.

（場合４）
学習停止時刻を数回後の学習タイミングに設定する。閾値J_Ｔの設定が不適切であると、学習の繰り返しが多数回行われてしまうことがある。これを防ぐために、学習の繰り返しのフロー中に学習停止時刻を設定する手順を挿入する。これにより、評価／決定に関する計算量が削減できる。最も単純には、学習の繰り返し回数が規定の値になったときに学習動作を終了することにすればよい。この場合には、図１２に示したように、ステップＳ１７０に最大学習回数に到達したときに学習を終了することを含め、図１３に示したように、学習を終了するための停止過程80を設ける。 (Case 4)
Set the learning stop time to the learning timing several times later. When the setting of the threshold value J _T is inappropriate may be repeated learning will take place many times. In order to prevent this, a procedure for setting the learning stop time is inserted in the learning repetition flow. Thereby, the calculation amount regarding evaluation / determination can be reduced. Most simply, the learning operation may be terminated when the number of learning repetitions reaches a specified value. In this case, as shown in FIG. 12, a stop process 80 for ending the learning is performed as shown in FIG. 13, including the completion of learning when the maximum number of learning times is reached in step S170. Provide.

（場合５）
Jt、Jt-mの値として、学習の繰り返しの今回、１回目前、２回目前の値の合計または平均を採用する。このようにすることによって、微小変化に対してロバストな（影響を受けにくい）計算が可能となる。 (Case 5)
As the values of Jt and Jt-m, the total or average of the values of the first, second, and previous times of the repetition of learning is adopted. By doing so, it is possible to perform a calculation that is robust (not easily influenced) with respect to a minute change.

（場合６）
場合１〜４において、Jt、Jt-mとして式(9)で表されるJ_Ｓ(j,t)、J_Ｓ(j,t-m)を用い、ΔJとして式(10)で表されるΔJ_Ｓ(j,t)を用いるか、または、Jt、Jt-mとして式(8)で表されるJ(j,t)、J(j,t-m)を用い、ΔJとして下式で表されるΔJ(j,t)を用いる。 (Case 6)
In cases 1 to 4, J _S (j, t) and J _S (j, tm) represented by Equation (9) are used as Jt and Jt-m, and ΔJ _S represented by Equation (10) is represented as ΔJ. (j, t) is used, or J (j, t) and J (j, tm) represented by the formula (8) are used as Jt and Jt-m, and ΔJ is represented by the following formula as ΔJ. Use (j, t).

ΔJ(f,t) ＝ ‖J(f,t)−J(f,t-m)‖ (12)
ここに、右辺の表式の意味は式(10)におけるものと同じである。 ΔJ (f, t) = ‖J (f, t) −J (f, tm) ‖ (12)
Here, the meaning of the expression on the right side is the same as that in Expression (10).

（実施の形態３）
周波数分割帯域を複数のブロックに組分けし、学習の繰り返しにおける学習計算を行わないと決定された周波数分割帯域を含むブロックに属するすべての周波数分割帯域において、学習の繰り返しにおける学習計算を行わないと決定する。その一例を図１４に示す。図において、周波数分割帯域（図示せず）はＢ１からＢ５までの５ブロックに組分けされ、周波数分割帯域数≧ブロック数である。Ｂ１およびＢ５は１０回の学習で終了し、Ｂ２およびＢ４は２０回の学習で終了し、Ｂ３は３０回まで学習が進んでいる。このようなブロック分けによって、評価／決定に関する計算量を削減できる。 (Embodiment 3)
The frequency division band is divided into a plurality of blocks, and learning calculation in repetition of learning is not performed in all frequency division bands belonging to the block including the frequency division band determined not to perform learning calculation in repetition of learning. decide. An example is shown in FIG. In the figure, the frequency division band (not shown) is divided into five blocks from B1 to B5, where the number of frequency division bands ≧ the number of blocks. B1 and B5 end with 10 learnings, B2 and B4 end with 20 learnings, and B3 has progressed to 30 times. By such block division, the amount of calculation related to evaluation / determination can be reduced.

本発明に係る音声入力装置におけるフィルタを得るための学習方法を説明する図である。It is a figure explaining the learning method for obtaining the filter in the voice input device concerning the present invention. 本発明に係る音声入力装置におけるフィルタ処理過程のブロック図である。It is a block diagram of a filtering process in the voice input device according to the present invention. 本発明に係る音声入力装置におけるフィルタ更新システムのブロック図である。It is a block diagram of the filter update system in the voice input device concerning the present invention. 本発明に係る音声入力装置におけるフィルタ処理システムのブロック図である。It is a block diagram of the filter processing system in the voice input device concerning the present invention. 本発明に係る音声入力装置の一構成例を示す図である。It is a figure which shows one structural example of the audio | voice input apparatus which concerns on this invention. 本発明に係る音声入力装置におけるフィルタ更新のフロー図である。It is a flowchart of the filter update in the audio | voice input apparatus which concerns on this invention. 本発明に係る音声入力装置におけるフィルタ更新の詳細フロー図である。It is a detailed flowchart of filter update in the voice input device according to the present invention. 本発明に係る音声入力装置におけるフィルタ処理のフロー図である。It is a flowchart of the filter process in the audio | voice input apparatus which concerns on this invention. 音声入力装置におけるコスト関数の計算結果例を示す図である。It is a figure which shows the example of a calculation result of the cost function in a speech input device. 音声入力装置におけるコスト関数と学習回数の関係の概念図である。It is a conceptual diagram of the relationship between the cost function and the learning frequency in a voice input device. 本発明に係る音声入力装置におけるコスト関数と学習回数の関係の概念図である。It is a conceptual diagram of the relationship between the cost function and the learning frequency in the voice input device according to the present invention. 本発明に係る音声入力装置における、学習回数に制限を設けた場合のフィルタ更新のフロー図である。It is a flowchart of the filter update when the restriction | limiting is provided in the frequency | count of learning in the audio | voice input apparatus which concerns on this invention. 本発明に係る音声入力装置における、学習回数に制限を設けた場合のフィルタを得るための学習方法を説明する図である。It is a figure explaining the learning method for obtaining the filter in the case where the restriction | limiting is provided in the frequency | count of learning in the audio | voice input apparatus which concerns on this invention. 本発明に係る音声入力装置における、ブロック分けを行った場合のコスト関数と学習回数の関係の概念図である。It is a conceptual diagram of the relationship between the cost function and the number of learnings when performing block division in the speech input device according to the present invention.

Explanation of symbols

10-1〜10-n:マイクロフォン、20:検知過程、30:帯域分割過程、40:フィルタ学習過程、45:減衰過程、50:評価過程、60:決定過程、80:停止過程、110-1〜110-n:マイクロフォン、120:検知手段、130:帯域分割手段、140:フィルタ学習手段、145:減衰手段、150:評価手段、160:決定手段、170:記憶手段、210-1〜210-n:マイクロフォン、220:フィルタ、230:ＡＤ変換器、240:演算装置、250:記憶装置。 10-1 to 10-n: microphone, 20: detection process, 30: band division process, 40: filter learning process, 45: attenuation process, 50: evaluation process, 60: decision process, 80: stop process, 110-1 110-n: microphone, 120: detection means, 130: band division means, 140: filter learning means, 145: attenuation means, 150: evaluation means, 160: determination means, 170: storage means, 210-1 to 210- n: microphone, 220: filter, 230: AD converter, 240: arithmetic device, 250: storage device.

Claims

A plurality of acoustic signals in which the target voice signal and the non-target sound signal are mixed are obtained by detecting the sound in which the target voice and the non-target sound are mixed with a plurality of acoustic sensors, and at least one of the plurality of acoustic signals is acquired from the plurality of acoustic signals. In a voice input device that executes an independent component analysis method for acquiring a filter that separates the target voice signal by repetition of learning,
An acoustic frequency band is divided into a plurality of frequency division bands, and an identification level for evaluating the target speech separation performance of the filter obtained by the learning is calculated for each frequency division band in the learning process, A voice input device that determines a frequency division band for performing learning calculation from the frequency division band based on the frequency division band.

In the process of determining the frequency division band for performing the learning calculation, a rate of change between the identification level calculated at time t and the identification level calculated at time tm is set in advance as m> 0. The speech input device according to claim 1, wherein in the frequency division band that does not exceed the threshold, it is determined not to perform learning calculation in repetition of the learning after time t.

The voice input device according to claim 1 or 2, wherein the determination of the frequency division band for performing the learning calculation performed at time t is performed again at time t + i with i> 0.

In the process of determining the frequency division band for performing the learning calculation, a threshold value at which a rate of change between the identification level calculated at time t and the identification level calculated at time tm is set as m> 0 is set in advance. In a frequency division band larger than that, when the determination of the frequency division band for performing the learning calculation after time t is performed again, the learning calculation in the repetition of learning is performed from time t until the determination is performed. When the learning repetition process ends without determining the frequency division band for performing the learning calculation after time t, it is determined that the learning calculation in the learning repetition is performed from time t to the end. The voice input device according to claim 1, 2, or 3.

5. The voice input device according to claim 1, wherein the learning operation is terminated when the number of repetitions of the learning reaches a specified value.

The identification level is a correlation coefficient in a local time interval between two separated signals obtained by filtering the plurality of acoustic signals with the filter, and a change rate between the two identification levels is 6. The voice input device according to claim 1, wherein the voice input device is a norm of a difference vector between two identification levels.

The identification level is a correlation coefficient in a local time interval between two separated signals obtained by filtering the plurality of acoustic signals with the filter, and a change rate between the two identification levels is 6. The discrimination level is a norm of a difference vector between the two smoothed discriminating levels by averaging two discriminating levels in the local frequency section. Voice input device.

The frequency division band is divided into a plurality of blocks, and the learning calculation in the repetition of learning is performed in all frequency division bands belonging to the block including the frequency division band determined not to perform the learning calculation in the learning repetition. The voice input device according to claim 1, wherein the voice input device is determined not to be used.