JP2004191968A

JP2004191968A - Method and device for separating signal source

Info

Publication number: JP2004191968A
Application number: JP2003400576A
Authority: JP
Inventors: Sabine V Deligne; サビネ・ブイ・デライン; Satyanarayana Dharanipragada; サトヤナラヤナ・ダラニプラガダ
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2002-12-10
Filing date: 2003-11-28
Publication date: 2004-07-08
Anticipated expiration: 2023-11-28
Also published as: JP3999731B2; US7225124B2; US20040111260A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technique for separating a signal from a mixture of a 1st source signal related to a 1st source and a 2nd source signal related to a 2nd source. <P>SOLUTION: Two signals are obtained first which represent two mixtures of the 1st source signal and 2nd source signal. Those two signals and at least one known statistical characteristic related to the 1st and 2nd sources are used and the 1st source signal is separated from the mixtures in a nonlinear signal domain without the need to use a reference signal. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

本発明は、概して云えば、信号分離技術に関し、詳しく言えば、各ソースに関する何らかの統計的特性がわかっている場合、例えば、各ソースの確率密度関数（probability density function）が既知のガウス混合（mixture of Gaussians）によってモデル化される場合、ソースの非線形混合を分離するための技術に関するものである。 The present invention relates generally to signal separation techniques, and more particularly, if some statistical property is known for each source, e.g., the probability density function of each source is a known Gaussian mixture. of the Gaussians, which relates to techniques for separating non-linear mixtures of sources.

ソース分離は、ソース信号に関する相異なる混合体を観察することによってこれらのソース信号を回復させるという問題を扱う。ソース分離に対する通常の取り組み方法は、一般に、ソース信号が線形に混合されるものと仮定する。また、ソース分離に対する通常の方法は、ソースの統計的特性に関する詳細情報が全く知られてなく（又は、セミブラインド（semi-blind）方法ではほとんど詳細情報がなく）、しかもその分離プロセスにおいて明示的に利用され得ることが仮定されていると云う意味で一般に盲目的（blind）ある。Proceedingsof the IEEE 誌の vol. 9, October 1998, pp. 2009-2025 における「Blind SignalSeparation: Statistical Principles」と題した J.F. Cardoso 氏による論文において開示された方法は線形混合体を仮定していてしかも盲目的であるソース分離方法の１つの例である。 Source separation addresses the problem of recovering these source signals by observing different mixtures of the source signals. The usual approach to source separation generally assumes that the source signals are mixed linearly. Also, the usual method for source separation is that no detailed information about the statistical properties of the source is known (or little detail in the semi-blind method) and that explicit Is generally blind in the sense that it is assumed that it can be used for The method disclosed in JF Cardoso's paper entitled "Blind SignalSeparation: Statistical Principles" in Proceedings of the IEEE, vol. 9, October 1998, pp. 2009-2025, assumes a linear mixture and is blind. Is an example of the source separation method.

Proceedings of ICSLP 2000 誌の「Speech/Noise Separation Using TwoMicrophones and a VQ Model of Speech Signals」と題した A. Acero 氏他による論文において開示された方法は、ソースの確率密度関数（ｐｄｆ）に関する先験的な情報を使用するソース分離技術を提案している。しかし、その技術は、波形ドメインの線形変換に起因する線形予測係数（LinearPredictive Coefficient - LPC）ドメインにおいて動作するので、その技術は、被観察混合が線形であることを仮定している。従って、その技術は、非線形混合の場合には使用され得ない。 The method disclosed in the paper by A. Acero et al., Entitled "Speech / Noise Separation Using Two Microphones and a VQ Model of Speech Signals," Proceedings of ICSLP 2000, uses an a priori approach to the source probability density function (pdf). We propose a source separation technique that uses sensitive information. However, since the technique operates in the Linear Predictive Coefficient (LPC) domain due to the linear transformation of the waveform domain, the technique assumes that the observed mixture is linear. Therefore, the technique cannot be used in the case of non-linear mixing.

しかし、被観察混合が線形でない場合、及びソースの統計的特性に関する先見的情報が高い信頼性で得られる場合がある。これは、例えば、混合したオーディオ・ソースの分離を必要とする音声アプリケーションにおける場合である。そのような音声アプリケーションの例は、競合する音声、干渉する音楽、又は特殊なノイズ・ソース、例えば、自動車又は街頭のノイズが存在する場合の音声認識である。 However, in some cases, the observed mixture is not linear, and in some cases, a priori information regarding the statistical properties of the source can be obtained with high reliability. This is the case, for example, in audio applications that require the separation of mixed audio sources. Examples of such speech applications are speech recognition in the presence of competing speech, interfering music, or special noise sources, such as car or street noise.

たとえオーディオ・ソースが波形ドメインにおいて線形に混合されるものと仮定され得ても、波形の線形混合は、音声アプリケーションが通常動作するドメインであるケプストラル・ドメイン（cepstral domain）では非線形混合を生じる。既知のように、セプストラ（cepstra）は、音声波形のセグメントのログ・スペクトルから、音声認識システムのフロント・エンドによって計算されるベクトルである。それに関しては、例えば、PrenticeHall Signal Processing Series, 1993 誌の「Fundamentals of Speech Recognition」chapter3 と題した L. Rabiner 氏他による論文を参照してほしい。 Even though the audio source may be assumed to be linearly mixed in the waveform domain, linear mixing of the waveforms results in non-linear mixing in the cepstral domain, the domain where speech applications typically operate. As is known, cepstra is a vector calculated by the front end of a speech recognition system from the log spectrum of a segment of a speech waveform. See, for example, an article by L. Rabiner et al. In the PrenticeHall Signal Processing Series, 1993, entitled "Fundamentals of Speech Recognition" chapter 3.

このログ変換のために、波形信号の線形混合の結果、ケプストラル信号の非線形混合が生じる。しかし、それは、波形ドメインにおいてよりもケプストラル・ドメインにおいてソース分離を行うことが音声アプリケーションでは計算上有利である。実際に、発生音に対応するセプストラのストリームが音声波形の連続的に重畳したセグメントから計算される。セグメントは、通常、約１００ミリ秒（ms）の長さであり、２つの隣接するセグメントの間のシフトは約１０ms の長さである。従って、ケプストラル・ドメインにおいて１１キロヘルツ（kHz）の音声データに関して動作する分離プロセスは、その分離プロセスが各サンプルに適用されなければならないという波形ドメインに比べて、１１０サンプル毎に適用される必要があるだけである。 Because of this log transformation, the linear mixing of the waveform signals results in the non-linear mixing of the cepstral signals. However, it is computationally advantageous in speech applications to perform source separation in the cepstral domain than in the waveform domain. In fact, a Sepstra stream corresponding to the generated sound is calculated from continuously superimposed segments of the audio waveform. A segment is typically about 100 milliseconds (ms) long, and the shift between two adjacent segments is about 10 ms long. Thus, a separation process that operates on 11 kilohertz (kHz) audio data in the cepstral domain needs to be applied every 110 samples, as compared to the waveform domain where the separation process must be applied to each sample. Only.

更に、音声のｐｄｆ及び多くの可能な干渉オーディオ信号（例えば、競合する音声、音楽、特定のノイズ・ソース等）のｐｄｆはケプストラル・ドメインにおいて高い信頼性でモデル化され、分離プロセスにおいて統合され得る。ケプストラル・ドメインにおける音声のｐｄｆは認識目的で算定され、干渉ソースのｐｄｆは、同様のソースから収集されたデータの代表的なセットに関してオフラインで算定され得る。 Further, the pdf of speech and the pdf of many possible interfering audio signals (eg, competing speech, music, particular noise sources, etc.) can be reliably modeled in the cepstral domain and integrated in the separation process. . The pdf of the speech in the cepstral domain is calculated for recognition purposes, and the pdf of the interference source may be calculated off-line for a representative set of data collected from similar sources.

Proceedings of ASRU2001,2002 誌の「RobustSpeech Recognition with Multi-channel Codebook Dependent Cepstral Normalization(MCDCN)」と題した S. Deligne 及び R. Gopinath 氏による論文に開示された方法は、少なくとも１つのソースのｐｄｆに関する先験的情報を統合し、線形混合を仮定しないソース分離技術を提案している。この方法では、不要なソース信号が所望のソース信号と干渉する。所望の信号及び干渉信号の混合が１つのチャネルに記録され、一方、干渉信号だけ（即ち、所望の信号を含まない）が、いわゆる、参照信号を形成して第２のチャネルに記録される。しかし、多くの場合、参照信号は使用可能ではない。例えば、自動車の音声認識アプリケーションと自動車の乗客の競合音声との関連において、音声認識システムのユーザ（例えば、運転手）の音声及び自動車における他の乗客の競合音声を分離して捕捉することは不可能である。 The method disclosed by S. Deligne and R. Gopinath, entitled "RobustSpeech Recognition with Multi-channel Codebook Dependent Cepstral Normalization (MCDCN)" in Proceedings of ASRU 2001 and 2002, is a method for pdfs of at least one source. We propose a source separation technique that integrates experimental information and does not assume linear mixing. In this method, unwanted source signals interfere with desired source signals. A mixture of the desired signal and the interfering signal is recorded on one channel, while only the interfering signal (ie, not containing the desired signal) is recorded on a second channel forming a so-called reference signal. However, in many cases, the reference signal is not available. For example, in the context of a car speech recognition application and the competing speech of a car passenger, it is not possible to separately capture the speech of the user (eg, driver) of the speech recognition system and the competing speech of other passengers in the car. It is possible.

従って、通常のソース分離技術と関連した欠点及び不利な点を克服するソース分離技術に対する要求がある。
Proceedings of the IEEE 誌のvol. 9, October 1998, pp. 2009-2025 における「Blind Signal Separation: StatisticalPrinciples」と題した J.F. Cardoso 氏による論文。 Proceedings of ICSLP2000 誌の「Speech/Noise Separation Using Two Microphones and a VQ Model of SpeechSignals」と題した A. Acero 氏他による論文。 Prentice Hall SignalProcessing Series, 1993 誌の「Fundamentals of Speech Recognition」chapter 3 と題した L.Rabiner 氏他による論文。 Proceedings ofASRU2001,2002 誌の「Robust Speech Recognition with Multi-channel CodebookDependent Cepstral Normalization (MCDCN)」と題した S. Deligne 及び R. Gopinath 氏による論文。 Accordingly, there is a need for a source separation technique that overcomes the disadvantages and disadvantages associated with conventional source separation techniques.
A paper by JF Cardoso entitled "Blind Signal Separation: StatisticalPrinciples" in Proceedings of the IEEE, vol. 9, October 1998, pp. 2009-2025. Proceedings of ICSLP2000, a paper by A. Acero and others entitled "Speech / Noise Separation Using Two Microphones and a VQ Model of SpeechSignals." A paper by L. Rabiner et al. Entitled Chapter 3, "Fundamentals of Speech Recognition", Prentice Hall SignalProcessing Series, 1993. A paper by S. Deligne and R. Gopinath entitled "Robust Speech Recognition with Multi-channel Codebook Dependent Cepstral Normalization (MCDCN)" in Proceedings of ASRU 2001 and 2002.

本発明の目的は、改良された音声分離技術を提供することにある。 It is an object of the present invention to provide an improved speech separation technique.

本発明の１つの局面では、第１ソースに関連した第１ソース信号と第２ソースに関連した第２ソース信号との混合体から信号を分離するための技術が次のようなステップ／操作を含む。先ず、第１ソース信号と第２ソース信号との２つの混合体をそれぞれ表す２つの混合信号が得られる。そこで、それら２つの混合信号と第１ソース及び第２ソースに関連した少なくとも１つの既知の統計的特性とを使用して、しかも参照信号の使用を必要とすることなく、非線型信号ドメインにおいて、第１ソース信号がその混合体から分離される。 In one aspect of the invention, a technique for separating a signal from a mixture of a first source signal associated with a first source and a second source signal associated with a second source comprises the following steps / operations. Including. First, two mixed signals respectively representing two mixtures of the first source signal and the second source signal are obtained. Thus, using the two mixed signals and at least one known statistical property associated with the first and second sources, and without requiring the use of a reference signal, in the nonlinear signal domain, A first source signal is separated from the mixture.

それらの得られた２つの混合信号は、それぞれ、第１ソース信号及び第２ソース信号の非加重混合信号と、第１ソースの信号及び第２ソースの信号の加重混合信号とを表す。分離ステップ／操作は、非加重混合信号を第１ケプストラル混合信号に変換すること及び加重混合信号を第２ケプストラル混合信号に変換することにより非線形ドメインにおいて遂行され得る。 The two resulting mixed signals represent an unweighted mixed signal of the first and second source signals and a weighted mixed signal of the first and second source signals, respectively. The separation step / operation may be performed in the non-linear domain by converting the unweighted mixed signal to a first cepstral mixed signal and converting the weighted mixed signal to a second cepstral mixed signal.

従って、分離ステップ／操作は、更に、第２ケプストラル混合信号及び分離ステップ／操作における前の反復からの第１ソース信号に関する算定値に基づいた第２ソース信号に関する算定値を反復的に生成することを含み得る。望ましくは、第２ソース信号に関する算定値を生成するステップ／操作は、第２ソース信号がガウス混合によってモデル化されることを仮定する。 Accordingly, the separating step / operation further comprises iteratively generating an estimate for the second source signal based on the estimate for the second cepstral mixed signal and the first source signal from a previous iteration in the separating step / operation. May be included. Preferably, the step / operation of generating an estimate for the second source signal assumes that the second source signal is modeled by Gaussian mixing.

更に、分離ステップ／操作は、第１ケプストラル混合信号及び第２ソース信号に関する算定値に基づいて第１ソース信号に関する算定値を反復的に生成することを含み得る。望ましくは、第１ソース信号に関する算定値を生成するステップ／操作は、第１ソース信号がガウス混合によってモデル化されることを仮定する。 Further, the separating step / operation may include iteratively generating an estimate for the first source signal based on the estimate for the first cepstral mixed signal and the second source signal. Preferably, the step / operation of generating an estimate for the first source signal assumes that the first source signal is modeled by Gaussian mixing.

分離プロセスの後、その分離された第１ソース信号は、その後に信号処理アプリケーション、例えば、音声認識アプリケーションによって使用され得る。更に、或る音声処理アプリケーションでは、第１ソース信号が音声信号であってもよく、第２ソース信号が、競合する音声、干渉する音楽、及び特定のノイズ・ソースを表す信号であってもよい。 After the separation process, the separated first source signal may be subsequently used by a signal processing application, for example, a speech recognition application. Further, in some audio processing applications, the first source signal may be an audio signal and the second source signal may be a signal representing competing audio, interfering music, and a particular noise source. .

本発明のこれらの及び他の目的、特徴、及び利点が、添付図面と関連して読まれるべき本発明の説明上の実施例に関する以下の詳細な説明から明らかになるであろう。 These and other objects, features, and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments of the invention, which should be read in conjunction with the accompanying drawings.

本発明は、例示的な音声認識アプリケーションと関連して後述される。更に、その例示的な音声認識アプリケーションは、「コードブック従属的である（codebook dependent）」と考えられる。「コードブック従属的である」というフレーズが、各ソース信号の確率密度関数をモデル化するためにガウス混合を使用することを指すことは理解されるべきである。ソース信号に関連したコードブックは、このソース信号を特徴付けるコードワードの集合を含む。各コードワードは、それの前の確率によって及びガウス分布のパラメータ、即ち、平均マトリクス及び共分散マトリクスによって指定される。換言すれば、ガウス混合はコードブックと同じである。 The present invention is described below in connection with an exemplary speech recognition application. Further, the exemplary speech recognition application is considered "codebook dependent." It should be understood that the phrase "codebook-dependent" refers to using Gaussian mixture to model the probability density function of each source signal. The codebook associated with a source signal includes a set of codewords that characterize the source signal. Each codeword is specified by its prior probabilities and by the parameters of the Gaussian distribution, namely the mean matrix and the covariance matrix. In other words, Gaussian mixture is the same as codebook.

しかし、本発明がこのアプリケーション及び任意の特定のアプリケーションに限定されないことは更に理解されるべきである。むしろ、本発明は、ソースの線形混合を仮定せず、ソースの少なくとも１つの統計的特性がわかっているものと仮定し、且つ参照信号を必要としないソース分離プロセスを遂行することが望ましい任意のアプリケーションに対してより一般的に適用可能である。 However, it should be further understood that the invention is not limited to this application and any particular application. Rather, the present invention does not assume linear mixing of the sources, assumes that at least one statistical characteristic of the sources is known, and is desirable to perform a source separation process that does not require a reference signal. More generally applicable to applications.

従って、音声認識に関連して、本発明のソース分離プロセスを説明する前に、先ず、本発明のソース分離の原理を一般的に説明することにする。 Therefore, before describing the source separation process of the present invention in the context of speech recognition, first the principles of the source separation of the present invention will be generally described.

ypcm1 及び ypcm2 は線形に混合された２つの波形信号であり、その結果、２つの混合xpcm1 及び xpcm2 が xpcm1 = ypcm1 + ypcm2 及び xpcm2 = a ypcm1 + ypcm2 (但し、a<1) に従って、生じるものと仮定する。更に、yf1及び yf2 が、それぞれ、信号 ypcm1 及び ypcm2 のスペクトルであり、xf1 及び xf2 が、それぞれ、信号 xpcm1 及び xpcm2 のスペクトルであると仮定する。 ypcm1 and ypcm2 are two waveform signals that are linearly mixed, so that the two mixed xpcm1 and xpcm2 result according to xpcm1 = ypcm1 + ypcm2 and xpcm2 = a ypcm1 + ypcm2, where a <1. Assume. Further assume that yf1 and yf2 are the spectra of signals ypcm1 and ypcm2, respectively, and xf1 and xf2 are the spectra of signals xpcm1 and xpcm2, respectively.

更に、y1、y2、x1 及び x2 は、それぞれ、y1 = Clog(yf1)、y2 = C log(yf2)、x1 = C log(xf1)、x2 = C log(xf2) に従って yf1、yf2、xf1、xf2 に対応するケプストラル信号である。なお、Cは、離散コサイン変換（Discrete Cosine Transform）を指す。従って、次式が示される：
y1 = x1-g(y1,y2,1) （１）
y2 =x2-g(y2,y1,a) （２）
なお、g(u,v,w) = C log(1+wexp(invC(v-u))) であり、invC は逆離散コサイン変換を指す。 Further, y1, y2, x1 and x2 are respectively y1 = Clog (yf1), y2 = C log (yf2), x1 = C log (xf1), and x2 = C log (xf2) according to yf1, yf2, xf1, This is a cepstral signal corresponding to xf2. Note that C indicates a discrete cosine transform. Therefore, the following equation is shown:
y1 = x1-g (y1, y2,1) (1)
y2 = x2-g (y2, y1, a) (2)
Note that g (u, v, w) = C log (1 + wexp (invC (vu))), and invC indicates an inverse discrete cosine transform.

等式（１）における y1 は未知であるので、その関数の値が、y1 を越えるそれの予測値、即ち、Ey1[g(y1,y2,1)|y2]によって概算される。但し、その予測値は、y1 のｐｄｆをモデル化するガウス混合に関して計算される。また、等式（２）における y2 も未知であるので、関数 g の値が、y2を越えるそれの予測値、即ち、Ey2[g(y2,y1,a)|y1] によって概算される。但し、その予測値は、y2 のｐｄｆをモデル化するガウス混合に関して計算される。等式（１）及び（２）における関数g の値を g の対応する予測値によって置換すると、y2 及び y1 のそれぞれの算定値 y2(k) 及び y1(k) が次のような反復手順の各反復(k)において交互に計算される：
Initialization :
y1(0)=x1
Iteration n(n≥1):
y2(n)=x2-Ey2[g(y2,y1,a)|y1=y1(n-1)]
y1(n)=x1-Ey1[g(y1,y2,1)|y2=y2(n)]
n=n+1 Since y1 in equation (1) is unknown, the value of the function is approximated by its predicted value beyond y1, ie, Ey1 [g (y1, y2,1) | y2]. Where the predicted value is calculated for a Gaussian mixture that models the pdf of y1. Also, since y2 in equation (2) is also unknown, the value of function g is approximated by its predicted value exceeding y2, ie, Ey2 [g (y2, y1, a) | y1]. Where the predicted value is calculated for a Gaussian mixture that models the pdf of y2. Replacing the value of the function g in equations (1) and (2) with the corresponding predicted value of g, the calculated values y2 (k) and y1 (k) of y2 and y1, respectively, become Calculated alternately at each iteration (k):
Initialization:
y1 (0) = x1
Iteration n (n ≥ 1):
y2 (n) = x2-Ey2 [g (y2, y1, a) | y1 = y1 (n-1)]
y1 (n) = x1-Ey1 [g (y1, y2,1) | y2 = y2 (n)]
n = n + 1

一般的に上記した本発明のソース分離の原理を念頭において、音声認識の関連における本発明のソース分離プロセスを説明することにする。 With the principles of the source separation of the present invention generally described above, the source separation process of the present invention in the context of speech recognition will be described.

先ず、図１を参照すると、本発明の実施例に従って音声認識システムにおけるソース分離プロセスの統合をブロック図で示す。図示のように、音声認識システム１００は、アライメント及びスケーリング・モジュール１０２、第１及び第２フィーチャ抽出装置１０４及び１０６、ソース分離モジュール１０８、事後分離処理（post separation processing）モジュール１１０、及び音声認識エンジン１１２を含む。 Referring first to FIG. 1, a block diagram illustrates the integration of a source separation process in a speech recognition system according to an embodiment of the present invention. As shown, the speech recognition system 100 includes an alignment and scaling module 102, first and second feature extractors 104 and 106, a source separation module 108, a post separation processing module 110, and a speech recognition engine. 112.

先ず、信号を捕捉するセンサ、たとえば、音声認識システムに関連したマイクロフォン（図示されてない）への信号の伝播中に導入された遅延及び減衰を補償するために、被観察波形混合 xpcm1 及び xpcm2 がアライメント及びスケーリング・モジュール１０２において揃えられ且つスケーリングされる。そのようなアライメント及びスケーリング操作は、音声信号処理の分野ではよく知られている。任意の適当なアライメント及びスケーリング技術が使用可能である。 First, in order to compensate for the delay and attenuation introduced during propagation of the signal to a sensor that captures the signal, for example, a microphone (not shown) associated with a speech recognition system, the observed waveform mixtures xpcm1 and xpcm2 are The alignment and scaling module 102 aligns and scales. Such alignment and scaling operations are well known in the art of audio signal processing. Any suitable alignment and scaling techniques can be used.

次に、第１及び第２フィーチャ抽出装置１０４及び１０６において、それぞれ、整列した及びスケーリングされた波形混合 xpcm1 及び xpcm2 から、ケプストラル・フィーチャが抽出される。ケプストラル・フィーチャ抽出のための技術は、音声信号処理の分野では周知である。任意の適当な抽出技術が使用可能である。 Next, cepstral features are extracted from the aligned and scaled waveform mixes xpcm1 and xpcm2 in first and second feature extraction devices 104 and 106, respectively. Techniques for cepstral feature extraction are well known in the field of audio signal processing. Any suitable extraction technique can be used.

次に、フィーチャ抽出装置１０４及び１０６によってそれぞれ出力されたセプトラル混合 x1 及び x2 が、本発明に従ってソース分離モジュール１０８によって分離される。ソース分離モジュール１０８の出力が、音声認識を適用すべき所望のソース、例えば、この場合には、算定ソース信号y1 の算定値であることが望ましいことは明らかである。ソース分離モジュール１０８がインプリメントし得る例示的なソース分離プロセスが図２及び図３に関連して詳細に後述される。 Next, the septal mixtures x1 and x2 output by the feature extractors 104 and 106, respectively, are separated by a source separation module 108 according to the present invention. Obviously, it is desirable that the output of the source separation module 108 is the desired source to which speech recognition is to be applied, for example, in this case the calculated value of the calculated source signal y1. An exemplary source separation process that the source separation module 108 may implement is described in detail below with respect to FIGS.

そこで、ソース分離モジュール１０８によって出力された、例えば、算定ソース信号 y1 に関連する機能強化されたケプストラル・フィーチャが正規化され、更に、事後分離処理モジュール１１０において処理される。モジュール１１０において遂行され得る処理技術の例は、ダイナミック・フィーチャ又はデルタ及びデルタ・デルタ・ケプストラル・フィーチャとも呼ばれ、これらのダイナミック・フィーチャが音声の一時的構造に関する情報（例えば、前記chapter 3 における Rabiner 氏他による文献参照）を保持するとき、それの第１及び第２オ−ダの一時的デリバティブ（first andsecond order temporal derivatives）を計算してそれをケプストラル・フィーチャのベクトルに付加することを含むが、それに限定されない。 The enhanced cepstral features associated with, for example, the calculated source signal y1 output by the source separation module 108 are then normalized and further processed in the post separation processing module 110. Examples of processing techniques that may be performed in the module 110 are also referred to as dynamic features or delta and delta delta cepstral features, where these dynamic features include information about the temporal structure of speech (eg, Rabiner in chapter 3 above). (See references by Mr. et al.), Including calculating its first and second order temporal derivatives and appending it to a vector of cepstral features. , But is not limited thereto.

最後に、算定ソース信号 y1 が、デコーディングのために音声認識エンジン１１２に送られる。音声認識を遂行するための技術は、音声信号処理の分野では周知である。任意の適当な認識技術が使用可能である。 Finally, the calculated source signal y1 is sent to the speech recognition engine 112 for decoding. Techniques for performing speech recognition are well known in the field of speech signal processing. Any suitable recognition technique can be used.

次に、図２及び図３を参照すると、それぞれ、本発明の実施例によるソース分離プロセスの第１部分及び第２部分の流れ図が示される。更に詳しく言えば、図２及び図３は、それぞれ、本発明の実施例に従ってソース分離プロセスの各反復を形成する２つのステップを示す。 Referring now to FIGS. 2 and 3, there are shown flowcharts of a first portion and a second portion, respectively, of a source isolation process according to an embodiment of the present invention. More specifically, FIGS. 2 and 3, respectively, show two steps forming each iteration of the source separation process according to an embodiment of the present invention.

先ず、プロセスは、時間 t において、y1(0,t)を、被観察混合x1(t) に等しくセットすることによって、即ち、各タイム・インデックス t に対して y1(0,t) = x1(t) をセットすることによって初期設定される。 First, at time t, the process sets y1 (0, t) equal to the observed mixture x1 (t), i.e., for each time index t, y1 (0, t) = x1 ( Initialized by setting t).

図２に示されるように、反復ｎ（ｎ≥１）の第１ステップ２００Ａは、ランダム変数 y2 のｐｄｆが k=1 乃至 K を有する K 個のガウス混合 N(μ2k,Σ2k) でもってモデル化されること（但し、N は平均的μ2k 及び差異Σ2k のガウスｐｄｆを指す）を仮定することによって、被観察混合 x2 から及び算定された値 y1(n-1,t) から（但し、y1(0,t)はx1(t) でもって初期設定される）時間(t)におけるソース y2 の算定 y2(n,t) を計算することを含む。そのステップは、次のように表される：
y2(n,t) = x2(t)-Σ_kp(k|x2(t))g(μ2k,y1(n-1,t),a) （３）
なお、p(k|x2(t)) は、ランダム変数 x2 がガウス分布N(μ2k+g(μ2k,y(n-1,t),a),Ξ2k(n,t)) に後続するものと仮定することによって、サブステップ２０２（ガウスｋに対する事後計算）において計算される（なお、Ξ2k(n,t)は、ランダム変数x2 の差異を概算するために計算される。なお、g(u,v,w)=C log(1+w exp(invC(v-u))) である）。サブステップ２０４がp(k|x2(t)) と g(μ2k,y1(n-1,t),a) との乗算を行い、一方、サブステップ２０６が x2(t) と Σ_ｋp(k|x2(t))g(μ2k,y1(n-1,t),a)との減算を行う。その結果は、算定ソース y2(n,t) である。 As shown in FIG. 2, the first step 200A of the iteration n (n ≥ 1) is performed with K Gaussian mixtures N (μ2k, Σ2k) in which the pdf of the random variable y2 has k = 1 to K. By assuming that it is modeled (where N refers to a Gaussian pdf of mean μ2k and difference Σ2k), from the observed mixture x2 and from the calculated value y1 (n-1, t), y1 (0, t) is initialized with x1 (t)) and involves calculating the estimate y2 (n, t) of source y2 at time (t). The steps are represented as follows:
y2 (n, t) = x2 (t) -Σ _k p (k | x2 (t)) g (μ2k, y1 (n-1, t), a) (3)
Note that p (k | x2 (t)) is the random variable x2 following the Gaussian distribution N (μ2k + g (μ2k, y (n-1, t), a), Ξ2k (n, t)) Is calculated in sub-step 202 (post-calculation for Gauss k), where Ξ2k (n, t) is calculated to approximate the difference in random variable x2, where g (u , v, w) = C log (1 + w exp (invC (vu)))). Substep 204 performs a multiplication of p (k | x2 (t)) with g (μ2k, y1 (n-1, t), a), while substep 206 performs x2 (t) and Σ _k p ( k | x2 (t)) g (μ2k, y1 (n-1, t), a) is subtracted. The result is the calculation source y2 (n, t).

図３に示されるように、反復ｎ（ｎ≥１）の第２ステップ２００Ｂは、ランダム変数 y1 のｐｄｆが k=1 乃至 K を有する K 個のガウス混合 N(μ1k,Σ1k) でもってモデル化されること（但し、N は平均的μ1k 及び差異Σ1k のガウスｐｄｆを指す）を仮定することによって、被観察混合 x1 から及び算定された値 y2(n,t) から時間(t)におけるソースy1 の算定 y1(n,t) を計算することを含む。そのステップは、次のように表される：
y1(n,t) = x1(t)-Σ_kp(k|x1(t))g(μ1k,y2(n,t),1) （４）
なお、p(k|x１(t)) は、ランダム変数 x１がガウス分布N(μ1k+g(μ1k,y2(n,t),1),Ξ1k(n,t)) に後続するものと仮定することによって、サブステップ２０８（ガウスｋに対する事後計算）において計算される（なお、Ξ1k(n,t)は、ランダム変数x1 の差異を概算するために計算される。なお、g(u,v,w)=C log(1+w exp(invC(v-u))) である）。サブステップ２１０がp(k|x1(t)) と g(μ1k,y2(n,t),1) との乗算を行い、一方、サブステップ２１２が x1(t) と Σ_ｋp(k|x1(t))g(μ1k,y2(n,t),1)との減算を行う。その結果は、算定ソース y1(n,t) である。 As shown in FIG. 3, the second step 200B of the iteration n (n ≥ 1) is performed with K Gaussian mixtures N (μ1k, Σ1k) in which the pdf of the random variable y1 has k = 1 to K. By assuming to be modeled (where N refers to the Gaussian pdf of the mean μ1k and the difference Σ1k), from the observed mixture x1 and from the calculated value y2 (n, t) at time (t) Includes calculating the calculation y1 (n, t) of source y1. The steps are represented as follows:
y1 (n, t) = x1 (t) -Σ _k p (k | x1 (t)) g (μ1k, y2 (n, t), 1) (4)
Note that p (k | x1 (t)) assumes that the random variable x1 follows the Gaussian distribution N (μ1k + g (μ1k, y2 (n, t), 1), Ξ1k (n, t)) By doing so, it is calculated in sub-step 208 (post-calculation for Gaussian k) (note that Ξ1k (n, t) is calculated to approximate the difference in random variable x1; g (u, v , w) = C log (1 + w exp (invC (vu)))). Substep 210 performs a multiplication of p (k | x1 (t)) and g (μ1k, y2 (n, t), 1), while substep 212 performs x1 (t) and Σ _k p (k | x1 (t)) g (μ1k, y2 (n, t), 1) is subtracted. The result is the calculation source y1 (n, t).

M 個の反復が行われた後（M1）、t=1 乃至 T の場合の T 個のケプストラル・フィーチャ・ベクトルy1(M,t)の算定ストリームがデコーディングのために音声認識エンジンに送られる。t=1 乃至 T の場合の T 個のケプストラル・フィーチャ・ベクトルy2(M,t)の算定ストリームが、それがデコードされないとき、廃棄される。データ y1 のストリームが、ストリーム x1 及び x2 を捕捉するマイクロフォンの相対的位置に基づいてデコードされるべきソースであると決定される。デコードされるべき音声ソースに近接して置かれているマイクロフォンが信号x1 を捕捉する。デコードされるべき音声ソースから遠く離れて置かれているマイクロフォンが信号 x2 を捕捉する。 After M iterations have been performed (M1), a computational stream of T cepstral feature vectors y1 (M, t) for t = 1 to T is sent to the speech recognition engine for decoding. . The computed stream of T cepstral feature vectors y2 (M, t) for t = 1 to T is discarded when it is not decoded. The stream of data y1 is determined to be the source to be decoded based on the relative positions of the microphones capturing streams x1 and x2. A microphone located in close proximity to the audio source to be decoded captures signal x1. A microphone located far from the audio source to be decoded captures the signal x2.

本発明の前述した例示的ソース捕捉プロセスを更に詳しく説明すると、前に指摘したように、ソース捕捉プロセスは、各反復ｎのステップ２００Ａ及び２００Ｂにおいて、それぞれ、使用される被観察混合 x1 及び x2 の共分散マトリクス Ξ1k(n,t) 又は Ξ2k(n,t) を算定する。共分散マトリクス Ξ1k(n,t) 又は Ξ2k(n,t)は、被観察混合からオン・ザ・フライで計算されるか、又は２つの「log-正規分布したランダム変数」の和の指数に起因するランダム変数の共分散マトリクスを定義する並列モデル結合（ParallelModel Combination - PMC）方程式に従って計算され得る。これに関しては、例えば、IEEE Transactions on Speechand Audio Processing 誌の vol.4, 1996 における「Robust Continuous Speech RecognitionUsing Parallel Model Combination」と題した M.J.F. Gales 氏他による論文を参照してほしい。 To elaborate on the above-described exemplary source capture process of the present invention, as pointed out above, the source capture process, at each iteration n, steps 200A and 200B, respectively, of the observed mixtures x1 and x2 used, Calculate the covariance matrix Ξ1k (n, t) or Ξ2k (n, t). The covariance matrix Ξ1k (n, t) or Ξ2k (n, t) is calculated on-the-fly from the observed mixture or is the exponent of the sum of two “log-normally distributed random variables”. It can be calculated according to a Parallel Model Combination (PMC) equation that defines the covariance matrix of the resulting random variables. See, for example, a paper by M.J.F. Gales et al. Entitled "Robust Continuous Speech Recognition Using Parallel Model Combination" in IEEE Transactions on Speechand Audio Processing, vol. 4, 1996.

ＰＭＣ方程式は、次のように使用され得る。μ1 及び Ξ1 は、それぞれ、ケプストラル・ドメインにおけるガウス・ランダム変数z1 の平均的マトリクス及び共分散マトリクスであると仮定する。μ2 及びΞ2 は、それぞれ、ケプストラル・ドメインにおけるガウスのランダム変数 z2 の平均的マトリクス及び共分散マトリクスであると仮定する。z1f=invClog(z1) 及び z2f=invC log(z2) は、ランダム変数 z1 及び z2 をスペクトル・ドメインに変換することによって得られるランダム変数であると仮定する。zf= z1f+z2f がランダム変数 z1f 及び z2f の和であると仮定する。そこで、ＰＣＭ方程式は、ランダム変数 zf をケプストラル・ドメインに変換することによって得られるランダム変数z = C log(zf) の共分散マトリクスΞを次のように計算することを可能にする。
Ξ_ij = log[((Ξ1f_ij+Ξ2f_ij)/((μ1f_i+μ2f_i)(μ1f_j+μ2f_j)))+1]
なお、Ξ1f_ij(resp., Ξ2f_ij) は、Ξ1f_ij =μ1f_i*μ1f_j(exp(Ξ1f_ij)-1)(resp.,Ξ2f_ij=μ2f_i*μ2f_j(exp(Ξ2f_ij-1))として定義された共分散マトリクスΞ1f (resp., Ξ2f) における (i,j)^th素子を示し、μ1f_i(resp.,μ2f_i) は、ベクトルμ1f(resp., μ2f) の i^th 次元を指し、μ1f_i=exp(μ1_i+Ξ1_ij/2))(resp., μ2f_i=exp(μ2_i+(Ξ2_ij/2))) である。 The PMC equation can be used as follows. Let μ1 and Ξ1 be the average and covariance matrices of the Gaussian random variable z1 in the cepstral domain, respectively. Suppose that μ2 and Ξ2 are the average and covariance matrices of the Gaussian random variable z2 in the cepstral domain, respectively. Assume that z1f = invClog (z1) and z2f = invClog (z2) are random variables obtained by transforming random variables z1 and z2 into the spectral domain. Assume that zf = z1f + z2f is the sum of random variables z1f and z2f. Thus, the PCM equation makes it possible to calculate the covariance matrix の of the random variable z = C log (zf) obtained by transforming the random variable zf into the cepstral domain as follows:
Ξ _ij = log [((Ξ1f _ij + Ξ2f _ij ) / ((μ1f _i + μ2f _i ) (μ1f _j + μ2f _j ))) + 1]
Note that Ξ1f _ij (resp., Ξ2f _ij ) is Ξ1f _ij = μ1f _i * μ1f _j (exp (Ξ1f _ij ) -1) (resp., Ξ2f _ij = μ2f _i * μ2f _j (exp (Ξ2f _ij -1) ) Indicates the (i, j) ^th element in the covariance matrix Ξ1f (resp., Ξ2f), and μ1f _i (resp., Μ2f _i ) indicates the i ^th dimension of the vector μ1f (resp., Μ2f). Μ1f _i = exp (μ1 _i + Ξ1 _ij / 2)) (resp., Μ2f _i = exp (μ2 _i + (Ξ2 _ij / 2))).

以下で明らかであるように、種々の話しての音声が自動車のノイズと混合される場合の実験では、音声ソースのｐｄｆは、３２個のガウス混合でもってモデル化され、ノイズ・ソースのｐｄｆは、２個のガウス混合でもってモデル化される。テスト・データに関する限り、音声に対する３２個のガウス混合及びノイズに対する２個のガウス混合は、認識精度及び複雑性の間の良好なトレードオフに相当するように見える。更に複雑なｐｄｆを有するソースは更に多くのガウス混合を伴なうことがある。 As will be apparent below, in experiments where various spoken voices are mixed with car noise, the pdf of the voice source is modeled with 32 Gaussian mixtures and the pdf of the noise source is , Modeled with two Gaussian mixtures. As far as test data is concerned, 32 Gaussian mixtures for speech and 2 Gaussian mixtures for noise seem to represent a good trade-off between recognition accuracy and complexity. Sources with more complex pdfs may involve more Gaussian mixing.

最後に、図４を参照すると、本発明の実施例によるソース分離プロセス（例えば、図１、図２及び図３に示されるような）を組み込んだ音声認識システムの例示的インプリメンテーションのブロック図が示される。この特定のインプリメンテーション３００では、本明細書において開示された操作（例えば、アライメント、スケーリング、フィーチャ抽出、ソース分離、事後分離処理、及び音声認識）を制御及び実行するためのプロセッサ３０２がコンピュータ・バス３０８を介してメモリ３０４及びユーザ・インターフェース３０６に結合される。 Finally, referring to FIG. 4, a block diagram of an exemplary implementation of a speech recognition system incorporating a source separation process (eg, as shown in FIGS. 1, 2 and 3) according to an embodiment of the present invention. Is shown. In this particular implementation 300, a processor 302 for controlling and performing the operations disclosed herein (eg, alignment, scaling, feature extraction, source separation, post-separation processing, and speech recognition) is implemented on a computer. Coupled to memory 304 and user interface 306 via bus 308.

本明細書において使用される用語「プロセッサ」は、たとえば、ＣＰＵ（中央処理装置）及び（又は）他の適当な処理回路を含む装置のような任意の処理装置を含むように意図される。例えば、プロセッサは、従来技術において知られているようなディジタル信号プロセッサであってもよい。また、用語「プロセッサ」は、複数の個々のプロセッサを指してもよい。本明細書において使用される用語「メモリ」は、例えば、ＲＡＭ、ＲＯＭ、固定メモリ・デバイス（例えば、ハード・ドライブ）、取り外し可能メモリ・デバイス（例えば、フロッピ・ディスク）等のようなプロセッサ又はＣＰＵに関連したメモリを含むように意図される。更に、本明細書において使用される用語「ユーザ・インターフェース」は、例えば、音声データを処理ユニットに入力するためのマイクロフォン及び、望ましくは、音声認識プロセスと関連した結果を表示するための可視表示装置を含むように意図される。 The term "processor" as used herein is intended to include any processing device, such as, for example, a device including a CPU (central processing unit) and / or other suitable processing circuitry. For example, the processor may be a digital signal processor as known in the art. Also, the term “processor” may refer to a plurality of individual processors. As used herein, the term “memory” refers to a processor or CPU, such as, for example, RAM, ROM, fixed memory devices (eg, hard drives), removable memory devices (eg, floppy disks), etc. It is intended to include memory associated with Further, as used herein, the term "user interface" refers to, for example, a microphone for inputting audio data to a processing unit and, preferably, a visual display device for displaying results associated with a voice recognition process. It is intended to include

従って、本明細書に開示されたような本発明の方法を遂行するための命令又はコードを含むコンピュータ・ソフトウェアが１つ又はそれ以上の関連のメモリ・デバイス（例えば、ＲＯＭ、固定メモリ又は取り外し可能メモリ）に記憶され得るし、利用の準備ができているときには、部分的に又は全体的に（例えば、ＲＡＭに）ロードされ、そしてＣＰＵによって実行され得る。 Accordingly, computer software containing instructions or code for performing the methods of the present invention as disclosed herein may comprise one or more associated memory devices (eg, ROM, fixed memory, or removable memory). Memory) and when ready for use, may be partially or fully loaded (eg, into RAM) and executed by the CPU.

いずれにしても、図１、図２及び図３に示された素子は、ハードウェア、ソフトウェア、或いはそれらの結合という種々の形式で、例えば、関連のメモリを有する１つ又はそれ以上のディジタル信号プロセッサ、アプリケーション独特の集積回路、機能的回路、関連のメモリを有する１つ又はそれ以上の適切にプログラムされた汎用ディジタル・コンピュータの形式でインプリメントされ得る。更に、本発明の方法は、実行時に本発明の方法のステップをインプリメントする１つ又はそれ以上のプログラムを含むマシン可読媒体においても具体化され得る。本願において提供された本発明に関する教示があれば、当業者は、本発明の構成要素における別のインプリメンテーションを予想することができるであろう。 In any event, the elements shown in FIGS. 1, 2 and 3 may be implemented in various forms of hardware, software, or a combination thereof, for example, one or more digital signals having an associated memory. It may be implemented in the form of one or more suitably programmed general purpose digital computers having a processor, application specific integrated circuits, functional circuits, and associated memory. Further, the method of the present invention may be embodied in a machine-readable medium that includes one or more programs that, when executed, implement the steps of the method of the present invention. Given the teachings provided herein regarding the present invention, those skilled in the art will be able to contemplate alternative implementations of the components of the present invention.

次に、音声と混合された信号が自動車の騒音である場合、音声認識と関連して使用される本発明の実施例に関する例示的評価を行うことにする。先ず、評価プロトコルが説明され、しかる後、本発明のソース分離プロセス（以下では、「コードブック従属ソース分離プロセス（codebook dependent source separation process）」又は「ＣＤＳＳ」と呼ばれる）に従って得られた認識スコアが、如何なる分離プロセスも無くて得られたスコアと比較され、更に、上記のＭＣＤＣＮプロセスによって得られたスコアと比較される。 Next, an exemplary evaluation of an embodiment of the present invention used in connection with speech recognition, where the signal mixed with the speech is vehicle noise, will be performed. First, the evaluation protocol is described, after which the recognition scores obtained according to the source separation process of the present invention (hereinafter referred to as “codebook dependent source separation process” or “CDSS”) are obtained. , Is compared to the score obtained without any separation process, and further compared to the score obtained by the MDCCN process described above.

実験は、非走行車において、連結したディジット・シ−ケンスを発する１２人の男性及び女性被験者のコーパス（corpus）に関して行われる。６０mph（約９６.５km/時間）の速度の自動車における事前記録されたノイズ信号が、１又は「a」の係数によって加重音声信号に人為的に加えられ、従って、音声波形及びノイズ波形の２つの異なる線形混合（前述のように「ypcm1+ypcm2」及び「aypcm1+ypcm2」が生じる。なお、ypcm1 は音声波形を指し、ypcm2 はノイズ波形を指す）。係数「a」を０.３、０.４、及び０.５にセットした場合の実験が行われた。音声及びノイズのすべてのレコーディングがＡＫＧＯ４００マイクロフォンによって２２kHzで行われ、１１kHz にダウンサンプルされた。 The experiment is performed on a corpus of twelve male and female subjects emitting a connected digit sequence in a non-moving vehicle. A pre-recorded noise signal in a car at a speed of 60 mph (approximately 96.5 km / hr) is artificially added to the weighted audio signal by a factor of 1 or "a", and therefore has two waveforms: an audio waveform and a noise waveform Different linear mixtures ("ypcm1 + ypcm2" and "aypcm1 + ypcm2" occur as described above, where ypcm1 refers to the speech waveform and ypcm2 refers to the noise waveform). Experiments were performed with coefficient "a" set to 0.3, 0.4, and 0.5. All recordings of speech and noise were made at 22 kHz with an AKG O400 microphone and downsampled to 11 kHz.

音声ソースのｐｄｆをモデル化するためには、男性及び女性の両方によって発せられ、非走行の自動車及びノイズの無い環境においてＡＫＧＱ４００マイクロフォンでもって記録された数千のセンテンスの集合体に関して３２個のガウス混合が算定された。自動車ノイズのｐｄｆをモデル化するために、テスト・データに対する設定と同じ設定を使用して、６０mph（約９６.５km/時間）の速度の自動車においてＡＫＧＱ４００でもって記録された約４分のノイズに関し（実験に先立って）２個のガウス混合が算定された。 To model the pdf of an audio source, 32 of a collection of thousands of sentences emitted by both men and women and recorded with an AKG Q400 microphone in a non-moving car and noise-free environment were used. Gaussian mixture was calculated. Approximately 4 minutes of noise recorded with an AKG Q400 in a 60 mph (approximately 96.5 km / hr) vehicle using the same settings as for the test data to model the pdf of the vehicle noise For each (prior to the experiment) two Gaussian mixtures were calculated.

音声認識エンジンによってデコードされる音声及びノイズの混合は、
（Ａ）分離されない、又は
（Ｂ）ＭＣＤＣＮプロセスによって分離される、又は
（Ｃ）ＣＤＳＳプロセスによって分離される。
上記（Ａ）、（Ｂ）及び（Ｃ）によって得られた音声認識エンジンのパフォーマンスがワード・エラー率（Word Error Rates - WER）によって比較される。 The mixture of speech and noise decoded by the speech recognition engine is
(A) not separated, or (B) separated by MDCCN process, or (C) separated by CDSS process.
The performance of the speech recognition engine obtained by the above (A), (B) and (C) is compared by Word Error Rates (WER).

その実験において使用された音声認識エンジンは、特に、携帯可能な装置において又は自動車のアプリケーションにおいて使用される。そのエンジンは、約１０,０００個のコンテキスト従属のガウス、即ち、一般的な英語の音声を数百時間も訓練された（これらの訓練データの約半分が自動車ノイズをディジタル的に付加したか、又は３０mph及び６０mph（約４８km/時間及び約９６.５km/時間）の速度で走行する自動車において記録された）決定木（decision tree）を使用することにより結束されたトライフォン・コンテキスト（triphonecontext）を有するスピーカ独立型の音響モデル（英語の音声をカバーする１５６個のサブフォン（subphone））のセットを含む。これに関しては、（Proceedingsof ICASSP 1995 誌の vol. 1, pp. 41-44 における「Performance of the IBM LargeVocabulary Continuous Speech Recognition System on the ARPA Wall Street JournalTask」と題した L.R. Bahl 氏他による論文を参照してほしい）。システムのフロント・エンドは、２４個のメルフィルタ・バンクを使用して１５ms フレームから１２個のセプストラ＋エネルギ＋デルタ及びデルタ−デルタ係数を計算する（例えば、前記のRabiner 氏他による chapter 3 の文献を参照してほしい）。 The speech recognition engine used in that experiment is used in particular in portable devices or in automotive applications. The engine was trained with about 10,000 context-dependent Gaussian, or general English, voices for hundreds of hours (approximately half of these training data digitally added car noise, Or a triphone context united by using a decision tree (recorded in a car traveling at speeds of 30 mph and 60 mph (about 48 km / hour and about 96.5 km / hour)). And a set of speaker independent acoustic models (156 subphones covering English voice). (See the paper by LR Bahl et al., Entitled "Performance of the IBM Large Vocabulary Continuous Speech Recognition System on the ARPA Wall Street JournalTask," in Proceedings of ICASSP 1995, vol. 1, pp. 41-44. want). The front end of the system computes twelve Sepstra + energy + delta and delta-delta coefficients from a 15 ms frame using a 24 mel filter bank (see, for example, Rabiner et al., Chapter 3, supra). Please see).

ＣＤＳＳプロセスは、一般的に上記したように適用され、図１、図２、及び図３に関連して例示的に上記したように適用されることが望ましい。 The CDSS process is generally applied as described above, and is preferably applied as exemplified above in connection with FIGS. 1, 2, and 3.

下記の表１は、テスト・データをデコードした後に得られたワード・エラー率（ＷＥＲ）を示す。ノイズの付加前のきれいな音声において得られたＷＥＲは１.５３％である。ノイズの付加後の且つ如何なる分離プロセスも使用せずにノイズのある音声において得られたＷＥＲは１２.３１％である。参照信号として第２混合（「ayf1+yf2」）を使用してＭＣＤＣＮプロセス使用した後に得られたＷＥＲが、混合係数「a」の種々な値に対して与えられる。ＭＣＤＣＮは、参照信号における音声の漏洩が小さい（a= ０.３）ときにＷＥＲの減少を与えるが、漏洩がもっと重要になるに従ってそれのパフォーマンスは低下し、０.５に等しい係数「a」に対しては、ＭＣＤＣＮプロセスは、１２.３１％のベースラインＷＥＲよりも悪くなる。一方、ＣＤＳＳプロセスは、係数「a」のすべての実験値に対してベースラインＷＥＲを大いに改善する。 Table 1 below shows the word error rate (WER) obtained after decoding the test data. The WER obtained for a clean speech before adding noise is 1.53%. The WER obtained in noisy speech after the addition of noise and without using any separation process is 12.31%. The WER obtained after using the MCDCN process using the second mixture ("ayf1 + yf2") as the reference signal is given for various values of the mixing coefficient "a". The MCDCN provides a reduction in WER when the speech leakage in the reference signal is small (a = 0.3), but its performance decreases as the leakage becomes more important, with a coefficient "a" equal to 0.5. , The MDCCN process is worse than the 12.31% baseline WER. On the other hand, the CDSS process greatly improves the baseline WER for all experimental values of the coefficient "a".

（表１）
オリジナル音声１.５３
ノイズのある音声、分離無し１２.３１
a = 0.3 a = 0.4 a = 0.5
ノイズのある音声、ＭＣＤＣＮ７.８６１０.００１５.５１
ノイズのある音声、ＣＤＳＳ６.３５６.８７７.５９ (Table 1)
Original sound 1.53
Noisy voice, no separation 12.31
a = 0.3 a = 0.4 a = 0.5
Noisy voice, MCDCN 7.86 10.00 15.51
Noisy voice, CDSS 6.35 6.87 7.59

添付図面を参照して本発明の実施例を説明したけれども、本発明がそれらの実施例そのものに限定されないこと、及び、本発明の範囲又は精神から逸脱することなく、他の種々な変更及び修正が当業者によって行われ得ることは当然である。 Although the embodiments of the present invention have been described with reference to the accompanying drawings, the present invention is not limited to the embodiments themselves, and various other changes and modifications can be made without departing from the scope or spirit of the present invention. Can be performed by those skilled in the art.

まとめとして、本発明の構成に関して以下の事項を開示する。 In summary, the following matters are disclosed regarding the configuration of the present invention.

（１）第１ソースに関連した信号（第１ソース信号）と第２ソースに関連した信号（第２ソース信号）との混合体から信号を分離する方法であって、
前記第１ソース信号及び前記第２ソース信号の２つの混合体をそれぞれ表す２つの信号を得るステップと、
前記２つの信号と前記第１ソース及び前記第２ソースに関連した少なくとも１つの既知の統計的特性とを使用して且つ参照信号の使用を必要とすることなく、非線形信号ドメインにおいて前記混合体から前記第１ソース信号を分離するステップと、
を含む方法。
（２）前記２つの信号が、それぞれ、前記第１ソース信号及び前記第２ソース信号の非加重混合信号と前記第１ソース信号及び前記第２ソース信号の加重混合信号とを表す、上記（１）に記載の方法。
（３）前記分離するステップが、前記非加重混合信号を第１ケプストラル混合信号に変換すること及び前記加重混合信号を第２ケプストラル混合信号に変換することにより前記非線型ドメインにおいて遂行される、上記（２）に記載の方法。
（４）前記分離するステップが、前記第２ケプストラル混合信号と前記分離するステップにおける前の反復からの前記第１ソース信号に関する算定値とに基づいて前記第２ソース信号に関する算定値を反復的に生成するステップを含む、上記（３）に記載の方法。
（５）前記第２ソース信号に関する算定値を生成するステップは、前記第２ソース信号がガウス混合によってモデル化されることを仮定する、上記（４）に記載の方法。
（６）前記分離するステップが、更に、前記第１ケプストラル混合信号と前記第２ソース信号に関する算定値とに基づいて前記第１ソース信号に関する算定値を反復的に生成するステップを含む、上記（４）に記載の方法。
（７）前記第１ソース信号に関する算定値を生成するステップは、前記第１ソース信号がガウス混合によってモデル化されることを仮定する、上記（６）に記載の方法。
（８）前記分離された第１ソース信号が、その後、信号処理アプリケーションによって使用される、上記（１）に記載の方法。
（９）前記アプリケーションが音声認識である、上記（８）に記載の方法。
（１０）前記第１ソース信号が音声信号であり、前記第２ソース信号が、競合する音声、干渉する音楽及び特定のノイズ・ソースの少なくとも１つを表す信号である、上記（１）に記載の方法。
（１１）第１ソースに関連した信号（第１ソース信号）と第２ソースに関連した信号（第２ソース信号）との混合体から信号を分離するための装置であって、
メモリと、
前記メモリに結合され、（ｉ）前記第１ソース信号及び前記第２ソース信号の２つの体をそれぞれ表す２つの混合信号を得るように動作し、（ii）前記２つの信号と前記第１ソース及び前記第２ソースに関連した少なくとも１つの既知の統計的特性とを使用して且つ参照信号の使用を必要とすることなく、非線形信号ドメインにおいて前記混合体から前記第１ソース信号を分離するように動作する少なくとも１つのプロセッサと、
を含む装置。
（１２）前記２つの信号が、それぞれ、前記第１ソース信号及び前記第２ソース信号の非加重混合信号と前記第１ソース信号及び前記第２ソース信号の加重混合信号とを表す、上記（１１）に記載の装置。
（１３）前記分離する操作が、前記非加重混合信号を第１ケプストラル混合信号に変換すること及び前記加重混合信号を第２ケプストラル混合信号に変換することにより、前記非線型ドメインにおいて遂行される、上記（１２）に記載の装置。
（１４）前記分離する操作が、前記第２ケプストラル混合信号及び前記分離する操作における前の反復からの前記第１ソース信号に関する算定値に基づいて前記第２ソース信号に関する算定値を反復的に生成する操作を含む、上記（１３）に記載の装置。
（１５）前記第２ソース信号に関する算定値を生成する操作は、前記第２ソース信号がガウス混合によってモデル化されることを仮定する、上記（１４）に記載の装置。
（１６）前記分離する操作が、更に、前記第１ケプストラル混合信号及び前記第２ソース信号に関する算定値に基づいて前記第１ソース信号に関する算定値を反復的に生成する操作を含む、上記（１４）に記載の装置。
（１７）前記第１ソース信号に関する算定値を生成する操作は、前記第１ソース信号がガウス混合によってモデル化されることを仮定する、上記（１６）に記載の装置。
（１８）前記分離された第１ソース信号が、その後、信号処理アプリケーションによって使用される、上記（１１）に記載の装置。
（１９）前記アプリケーションが音声認識である、上記（１８）に記載の装置。
（２０）前記第１ソース信号が音声信号であり、前記第２ソース信号が、競合する音声、干渉する音楽及び特定のノイズ・ソースの少なくとも１つを表す信号である、上記（１１）に記載の装置。
（２１）第１ソースに関連した信号（第１ソース信号）と第２ソースに関連した信号（第２ソース信号）との混合体から信号を分離するためのコンピュータ・プログラムであって、
前記第１ソース信号及び前記第２ソース信号の２つの混合体をそれぞれ表す２つの信号を得るステップと、
前記２つの信号と前記第１ソース及び前記第２ソースに関連した少なくとも１つの既知の統計的特性とを使用して且つ参照信号の使用を必要とすることなく、非線形信号ドメインにおいて前記混合体から前記第１ソース信号を分離するステップと、
を、実行時にインプリメントする１つ又はそれ以上のプログラムを含むマシン可読媒体を構成するコンピュータ・プログラム。
（２２）前記２つの信号が、それぞれ、前記第１ソース信号及び前記第２ソース信号の非加重混合信号と前記第１ソース信号及び前記第２ソース信号の加重混合信号とを表す、上記（２１）に記載のコンピュータ・プログラム。
（２３）前記分離するステップが、前記非加重混合信号を第１ケプストラル混合信号に変換すること及び前記加重混合信号を第２ケプストラル混合信号に変換することにより、前記非線型ドメインにおいて遂行される、上記（２２）に記載のコンピュータ・プログラム。
（２４）前記分離するステップが、前記第２ケプストラル混合信号及び前記分離するステップにおける前の反復からの前記第１ソース信号に関する算定値に基づいて前記第２ソース信号に関する算定値を反復的に生成するステップを含む、上記（２３）に記載のコンピュータ・プログラム。
（２５）前記第２ソース信号に関する算定値を生成するステップは、前記第２ソース信号がガウス混合によってモデル化されることを仮定する、上記（２４）に記載のコンピュータ・プログラム。
（２６）前記分離するステップが、更に、前記第１ケプストラル混合信号及び前記第２ソース信号に関する算定値に基づいて前記第１ソース信号に関する算定値を反復的に生成するステップを含む、上記（２４）に記載のコンピュータ・プログラム。
（２７）前記第１ソース信号に関する算定値を生成するステップは、前記第１ソース信号がガウス混合によってモデル化されることを仮定する、上記（２６）に記載のコンピュータ・プログラム。
（２８）前記分離された第１ソース信号が、その後、信号処理アプリケーションによって使用される、上記（２１）に記載のコンピュータ・プログラム。
（２９）前記アプリケーションがは音声認識である、上記（２８）に記載のコンピュータ・プログラム。
（３０）前記第１ソース信号が音声信号であり、前記第２ソース信号が、競合する音声、干渉する音楽及び特定のノイズ・ソースの少なくとも１つを表す信号である、上記（２１）に記載のコンピュータ・プログラム。
（３１）第１ソースに関連した信号（第１ソース信号）と第２ソースに関連した信号（第２ソース信号）との混合体から信号を分離するための装置であって、
前記第１ソース信号及び前記第２ソース信号の２つの混合体をそれぞれ表す２つの信号を得るための手段と、
前記２つの信号を得るための手段に結合され、前記２つの信号と前記第１ソース及び前記第２ソースに関連した少なくとも１つの既知の統計的特性とを使用して且つ参照信号の使用を必要とすることなく、非線形信号ドメインにおいて前記混合体から前記第１ソース信号を分離するための手段と、
を含む装置。 (1) A method for separating a signal from a mixture of a signal related to a first source (a first source signal) and a signal related to a second source (a second source signal),
Obtaining two signals, each representing two mixtures of the first source signal and the second source signal;
Using the two signals and at least one known statistical property associated with the first source and the second source and without using a reference signal from the mixture in the non-linear signal domain Separating the first source signal;
A method that includes
(2) The above (1), wherein the two signals respectively represent an unweighted mixed signal of the first source signal and the second source signal and a weighted mixed signal of the first source signal and the second source signal. ).
(3) the step of separating is performed in the non-linear domain by converting the unweighted mixed signal into a first cepstral mixed signal and converting the weighted mixed signal into a second cepstral mixed signal. The method according to (2).
(4) the separating step iteratively calculates a calculated value for the second source signal based on the second cepstral mixed signal and a calculated value for the first source signal from a previous iteration in the separating step. The method according to (3), further comprising the step of generating.
(5) The method according to (4), wherein the step of generating an estimate for the second source signal assumes that the second source signal is modeled by Gaussian mixing.
(6) The step of separating further includes a step of repeatedly generating a calculated value for the first source signal based on the calculated value for the first cepstral mixed signal and the calculated value for the second source signal. The method according to 4).
(7) The method according to (6), wherein the step of generating an estimate for the first source signal assumes that the first source signal is modeled by Gaussian mixing.
(8) The method according to (1), wherein the separated first source signal is subsequently used by a signal processing application.
(9) The method according to (8), wherein the application is speech recognition.
(10) The above (1), wherein the first source signal is an audio signal, and the second source signal is a signal representing at least one of competing audio, interfering music, and a specific noise source. the method of.
(11) An apparatus for separating a signal from a mixture of a signal related to a first source (first source signal) and a signal related to a second source (second source signal),
Memory and
Coupled to the memory, operative to obtain two mixed signals respectively representing two bodies of the first source signal and the second source signal; and (ii) the two signals and the first source signal. And separating the first source signal from the mixture in the non-linear signal domain using at least one known statistical property associated with the second source and without requiring the use of a reference signal. At least one processor operating on
Equipment including.
(12) The above (11), wherein the two signals respectively represent an unweighted mixed signal of the first source signal and the second source signal and a weighted mixed signal of the first source signal and the second source signal. The device according to (1).
(13) the separating operation is performed in the non-linear domain by converting the non-weighted mixed signal into a first cepstral mixed signal and converting the weighted mixed signal into a second cepstral mixed signal; The device according to the above (12).
(14) the separating operation iteratively generating a calculated value for the second source signal based on the second cepstral mixed signal and a calculated value for the first source signal from a previous iteration of the separating operation. The apparatus according to the above (13), including an operation of performing the following.
(15) The apparatus according to (14), wherein the operation of generating an estimate for the second source signal assumes that the second source signal is modeled by Gaussian mixing.
(16) The method according to (14), wherein the separating operation further includes an operation of repeatedly generating a calculated value for the first source signal based on the calculated value for the first cepstral mixed signal and the second source signal. The device according to (1).
(17) The apparatus according to (16), wherein the operation of generating an estimate for the first source signal assumes that the first source signal is modeled by Gaussian mixing.
(18) The apparatus according to (11), wherein the separated first source signal is subsequently used by a signal processing application.
(19) The device according to (18), wherein the application is voice recognition.
(20) The above (11), wherein the first source signal is an audio signal, and the second source signal is a signal representing at least one of competing audio, interfering music, and a specific noise source. Equipment.
(21) A computer program for separating a signal from a mixture of a signal related to a first source (a first source signal) and a signal related to a second source (a second source signal),
Obtaining two signals, each representing two mixtures of the first source signal and the second source signal;
Using the two signals and at least one known statistical property associated with the first source and the second source and without using a reference signal from the mixture in the non-linear signal domain Separating the first source signal;
A computer program comprising a machine-readable medium including one or more programs that implement at runtime.
(22) The above (21), wherein the two signals respectively represent an unweighted mixed signal of the first source signal and the second source signal and a weighted mixed signal of the first source signal and the second source signal. The computer program according to (1).
(23) the separating is performed in the non-linear domain by converting the unweighted mixed signal into a first cepstral mixed signal and converting the weighted mixed signal into a second cepstral mixed signal; The computer program according to the above (22).
(24) the separating step iteratively generating a calculated value for the second source signal based on the second cepstral mixed signal and a calculated value for the first source signal from a previous iteration of the separating step. The computer program according to the above (23), comprising the step of:
(25) The computer program according to (24), wherein the step of generating an estimate for the second source signal assumes that the second source signal is modeled by Gaussian mixture.
(26) The above (24), wherein the step of separating further includes the step of repeatedly generating a calculated value for the first source signal based on the calculated value for the first cepstral mixed signal and the second source signal. The computer program according to (1).
(27) The computer program according to (26), wherein the step of generating an estimate for the first source signal assumes that the first source signal is modeled by Gaussian mixture.
(28) The computer program according to (21), wherein the separated first source signal is subsequently used by a signal processing application.
(29) The computer program according to (28), wherein the application is voice recognition.
(30) The above (21), wherein the first source signal is an audio signal, and the second source signal is a signal representing at least one of competing audio, interfering music, and a specific noise source. Computer programs.
(31) An apparatus for separating a signal from a mixture of a signal related to a first source (first source signal) and a signal related to a second source (second source signal),
Means for obtaining two signals respectively representing two mixtures of said first source signal and said second source signal;
Coupled to the means for obtaining the two signals, using the two signals and at least one known statistical property associated with the first source and the second source and requiring the use of a reference signal Means for separating the first source signal from the mixture in the non-linear signal domain, and
Equipment including.

本発明の実施例に従って音声認識システムにおけるソース分離プロセスの統合を示すブロック図である。FIG. 4 is a block diagram illustrating integration of a source separation process in a speech recognition system according to an embodiment of the present invention. 本発明の実施例に従ってソース分離プロセスの第１部分を示す流れ図である。5 is a flowchart illustrating a first part of a source separation process according to an embodiment of the present invention. 本発明の実施例に従ってソース分離プロセスの第２部分を示す流れ図である。5 is a flowchart illustrating a second part of the source separation process according to an embodiment of the present invention. 本発明の実施例に従ってソース分離プロセスを組み込んだ音声認識システムの例示的インプリメンテーションを示すブロック図である。FIG. 2 is a block diagram illustrating an exemplary implementation of a speech recognition system incorporating a source separation process according to an embodiment of the present invention.

Claims

A method for separating a signal from a mixture of a signal associated with a first source (a first source signal) and a signal associated with a second source (a second source signal),
Obtaining two signals, each representing two mixtures of the first source signal and the second source signal;
Using the two signals and at least one known statistical property associated with the first source and the second source and without using a reference signal from the mixture in the non-linear signal domain Separating the first source signal;
A method that includes

2. The signal of claim 1, wherein the two signals represent an unweighted mixed signal of the first source signal and the second source signal and a weighted mixed signal of the first source signal and the second source signal, respectively. 3. Method.

3. The method of claim 2, wherein the separating is performed in the non-linear domain by converting the non-weighted mixed signal to a first cepstral mixed signal and converting the weighted mixed signal to a second cepstral mixed signal. The described method.

The separating step iteratively generating a calculated value for the second source signal based on the second cepstral mixed signal and a calculated value for the first source signal from a previous iteration of the separating step. 4. The method of claim 3, comprising:

The method of claim 4, wherein generating an estimate for the second source signal assumes that the second source signal is modeled by Gaussian mixing.

5. The method of claim 4, wherein the separating step further comprises: iteratively generating a calculated value for the first source signal based on the first cepstral mixed signal and a calculated value for the second source signal. the method of.

7. The method of claim 6, wherein generating an estimate for the first source signal assumes that the first source signal is modeled by Gaussian mixing.

The method of claim 1, wherein the separated first source signal is subsequently used by a signal processing application.

The method of claim 8, wherein the application is speech recognition.

The method of claim 1, wherein the first source signal is an audio signal and the second source signal is a signal representing at least one of competing audio, interfering music, and a particular noise source.

An apparatus for separating a signal from a mixture of a signal associated with a first source (a first source signal) and a signal associated with a second source (a second source signal),
Memory and
Coupled to the memory, operable to obtain two signals, each representing two mixtures of the first source signal and the second source signal, and (ii) the two signals and the first source signal. And separating the first source signal from the mixture in the non-linear signal domain using at least one known statistical property associated with the second source and without requiring the use of a reference signal. At least one processor operating on
Equipment including.

The method of claim 11, wherein the two signals represent a non-weighted mixed signal of the first and second source signals and a weighted mixed signal of the first and second source signals, respectively. apparatus.

13. The method of claim 12, wherein the separating is performed in the non-linear domain by converting the non-weighted mixed signal to a first cepstral mixed signal and converting the weighted mixed signal to a second cepstral mixed signal. An apparatus according to claim 1.

The separating operation includes an iteratively generating a calculated value for the second source signal based on the second cepstral mixed signal and a calculated value for the first source signal from a previous iteration of the separating operation. 14. The device of claim 13, comprising.

15. The apparatus of claim 14, wherein generating an estimate for the second source signal assumes that the second source signal is modeled by Gaussian mixing.

The method of claim 14, wherein the separating operation further comprises: iteratively generating an estimate for the first source signal based on the estimate for the first cepstral mixed signal and the second source signal. apparatus.

17. The apparatus of claim 16, wherein generating an estimate for the first source signal assumes that the first source signal is modeled by Gaussian mixing.

The apparatus of claim 11, wherein the separated first source signal is subsequently used by a signal processing application.

19. The device of claim 18, wherein the application is speech recognition.

The apparatus of claim 11, wherein the first source signal is an audio signal and the second source signal is a signal representing at least one of competing audio, interfering music, and a particular noise source.

A computer program for separating a signal from a mixture of a signal associated with a first source (a first source signal) and a signal associated with a second source (a second source signal),
Obtaining two signals, each representing two mixtures of the first source signal and the second source signal;
Using the two signals and at least one known statistical property associated with the first source and the second source and without using a reference signal from the mixture in the non-linear signal domain Separating the first source signal;
A computer program comprising a machine-readable medium including one or more programs that implement at runtime.

22. The signal of claim 21, wherein the two signals respectively represent an unweighted mixed signal of the first source signal and the second source signal and a weighted mixed signal of the first source signal and the second source signal. Computer program.

23. The separating step is performed in the non-linear domain by converting the unweighted mixed signal to a first cepstral mixed signal and converting the weighted mixed signal to a second cepstral mixed signal. A computer program according to claim 1.

The separating step comprises the step of iteratively generating a calculated value for the second source signal based on the second cepstral mixed signal and a calculated value for the first source signal from a previous iteration of the separating step. 24. The computer program according to claim 23, comprising:

26. The computer program of claim 24, wherein generating an estimate for the second source signal assumes that the second source signal is modeled by Gaussian mixing.

25. The method of claim 24, wherein the separating further comprises: iteratively generating a calculated value for the first source signal based on the calculated value for the first cepstral mixed signal and the second source signal. Computer program.

27. The computer program of claim 26, wherein generating an estimate for the first source signal assumes that the first source signal is modeled by Gaussian mixing.

22. The computer program of claim 21, wherein the separated first source signal is subsequently used by a signal processing application.

29. The computer program according to claim 28, wherein the application is speech recognition.

22. The computer program of claim 21, wherein the first source signal is a speech signal and the second source signal is a signal representing at least one of competing speech, interfering music, and a particular noise source. .

An apparatus for separating a signal from a mixture of a signal associated with a first source (a first source signal) and a signal associated with a second source (a second source signal),
Means for obtaining two signals respectively representing two mixtures of said first source signal and said second source signal;
Coupled to the means for obtaining the two signals, using the two signals and at least one known statistical property associated with the first source and the second source and requiring the use of a reference signal Means for separating the first source signal from the mixture in the non-linear signal domain, and
Equipment including.