JP2013037174A

JP2013037174A - Noise/reverberation removal device, method thereof, and program

Info

Publication number: JP2013037174A
Application number: JP2011172919A
Authority: JP
Inventors: Keisuke Kinoshita; 慶介木下; Tomohiro Nakatani; 智広中谷; Soden Meretsu; ソウデンメレツ; Marc Delcroix; マークデルクロア
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-08-08
Filing date: 2011-08-08
Publication date: 2013-02-21
Anticipated expiration: 2031-08-08
Also published as: JP5634959B2

Abstract

PROBLEM TO BE SOLVED: To provide a noise/reverberation removal device which uses only exemplar models of clean voice to perform voice emphasis.SOLUTION: An emphasizing processing result reliability calculation unit outputs a value indicative of uncertainty of a primary voice emphasis signal in accordance with a feature quantity of an input signal and the primacy voice emphasis signal. A matching unit receives the primary voice emphasis signal, the value indicative of uncertainty of the primary voice emphasis signal, and exemplar models of learning data and outputs a learning data segment which gives a clean voice sequence closest to clean voice included in the input signal with respect to each time frame. A voice emphasis filtering unit receives the input signal and the learning data segment, reads out amplitude spectrum data pairing the learning data segment from an exemplar model storage unit to generate a Wiener filter, multiplies a power spectrum of the input signal by the Wiener filter to perform filtering, and outputs a voice emphasis signal.

Description

この発明は、雑音や残響を伴った音響信号から、雑音や残響を取り除いた音響信号を抽出する雑音/残響除去装置と、その方法とプログラムに関する。 The present invention relates to a noise / dereverberation apparatus that extracts an acoustic signal from which noise and reverberation have been removed from an acoustic signal accompanied by noise and reverberation, and a method and program thereof.

雑音や残響のある環境で音響信号を収音すると、本来の信号に音響歪み（雑音や残響）が重畳された信号として観測される。その音響信号が音声の場合、重畳した音響歪みの影響により音声の明瞭度は大きく低下してしまう。その結果、本来の音声信号の性質を抽出することが困難となり、例えば、音声認識システムの認識率が低下する。この認識率の低下を防ぐためには、重畳した音響歪みを取り除く工夫（方法）が必要である。 When an acoustic signal is collected in an environment with noise or reverberation, it is observed as a signal in which acoustic distortion (noise or reverberation) is superimposed on the original signal. When the acoustic signal is speech, the clarity of speech is greatly reduced due to the effect of superimposed acoustic distortion. As a result, it becomes difficult to extract the nature of the original speech signal, and for example, the recognition rate of the speech recognition system decreases. In order to prevent this reduction in the recognition rate, it is necessary to devise a method (method) for removing the superimposed acoustic distortion.

この雑音/残響除去方法は、音声認識の他にも、例えば、補聴器、ＴＶ会議システム、機械制御インターフェース、楽曲を検索したり採譜したりする音楽情報処理システムなどに利用することが出来る。 In addition to voice recognition, this noise / reverberation removal method can be used for, for example, a hearing aid, a TV conference system, a machine control interface, a music information processing system for searching for music, and recording music.

図７に、従来の雑音/残響除去装置７００の機能構成例を示してその動作を簡単に説明する。雑音/残響除去装置７００は、マッチング部７０３と、音声強調フィルタリング部７０４と、事例モデル７０５、を具備する。マッチング部７０３は、入力信号特徴量と事例モデル７０５内に含まれる特徴量の事例とのマッチングを行い、入力信号に一番近い事例を探索する。 FIG. 7 shows an example of the functional configuration of a conventional noise / dereverberation apparatus 700, and its operation will be briefly described. The noise / dereverberation apparatus 700 includes a matching unit 703, a speech enhancement filtering unit 704, and a case model 705. The matching unit 703 performs matching between the input signal feature quantity and the feature quantity cases included in the case model 705, and searches for a case closest to the input signal.

事例モデル７０５は、事例に対応したクリーン音声データと、それと対を成す雑音/残響音声特徴量とから成るモデルである。この事例モデル７０５は、音声コーパスなどから得られる大量のクリーン音声と、あらゆる環境で得られる雑音/残響データ（雑音信号の波形や、室内インパルス応答）を用い、さまざまな環境での観測信号を模擬生成し、その模擬観測信号を特徴量領域へ変換したものを用いて、事前に事例モデル学習装置によって生成される。 The case model 705 is a model composed of clean speech data corresponding to a case and noise / reverberation speech feature quantities paired therewith. This example model 705 simulates observation signals in various environments using a large amount of clean speech obtained from a speech corpus and noise / reverberation data (noise signal waveform and room impulse response) obtained in any environment. It is generated in advance by a case model learning device using a signal obtained by converting the simulated observation signal into a feature amount region.

音声強調フィルタリング部７０４は入力信号に一番近い事例を探索する際に用いたクリーン音声の振幅スペクトル事例データを用いて音声強調のためのフィルタを作成し、入力信号をフィルタリングする。この方法によれば、従来は困難であった、非常に時間変化の多い雑音の除去が可能となることが報告されている。非常に時間変化の多い雑音とは、背景雑音に対して、例えば目覚まし時計のアラーム音などの雑音のことである。 The speech enhancement filtering unit 704 creates a filter for speech enhancement using the clean spectrum amplitude spectrum case data used when searching for the case closest to the input signal, and filters the input signal. According to this method, it has been reported that it is possible to remove noise that has been difficult in the past and has a very large time variation. The noise having a very large time change is a noise such as an alarm sound of an alarm clock with respect to the background noise.

J. Ming and R. Srinivasan, and D. Crooke, “A C0rpus-Based Approach to Speech Enhancement From Nonstationary Noise,” IEEE Trans. On Acoustics, Speech and Signal Processing, 19(4),pp. 822-836, 2011.J. Ming and R. Srinivasan, and D. Crooke, “A C0rpus-Based Approach to Speech Enhancement From Nonstationary Noise,” IEEE Trans. On Acoustics, Speech and Signal Processing, 19 (4), pp. 822-836, 2011 .

しかし、従来の方法では、あらゆる環境の雑音/残響環境を模擬するための雑音/残響データが学習時に必要となり、そのデータ量が十分でなく、音声強調時の雑音/残響データに十分に近い条件が事例として用意されていない場合は、精度の良い音声強調を行うことが困難であった。また、仮にあらゆる環境の雑音/残響環境を模擬することが可能で、音声強調時に、十分に近い事例が事例モデルに含まれている場合でも、事例数の数は膨大となり、入力信号に一番近い事例を探索するための計算量が非常に大きくなってしまう課題があった。 However, in the conventional method, noise / reverberation data for simulating noise / reverberation environment of any environment is necessary at the time of learning, and the amount of data is not sufficient, and the condition is sufficiently close to the noise / reverberation data at the time of speech enhancement Is not prepared as an example, it was difficult to perform accurate speech enhancement. In addition, it is possible to simulate the noise / reverberation environment of any environment, and even when sufficiently close examples are included in the case model at the time of speech enhancement, the number of cases becomes enormous, and the number of cases is the largest for the input signal. There is a problem that the amount of calculation for searching for a nearby case becomes very large.

この発明は、このような課題に鑑みてなされたものであり、あらゆる雑音/残響データを学習時に用意しなくても、入力信号に含まれるクリーン音声に一番近いと思われるクリーン音声を、事例モデルを用いて発見し、精度の良い音声強調を行うことの出来る雑音/残響除去装置と、その方法とプログラムを提供することを目的とする。 The present invention has been made in view of such a problem, and even if all noise / reverberation data is not prepared at the time of learning, a clean voice that seems to be closest to the clean voice included in the input signal is used as an example. It is an object to provide a noise / dereverberation apparatus that can be found using a model and perform accurate speech enhancement, and a method and program thereof.

この発明の雑音/残響除去装置は、音声強調処理部と、強調処理結果信頼性計算部と、事例モデル記憶部と、マッチング部と、音声強調フィルタリング部と、を具備する。音声強調処理部は、雑音・残響の重畳した音声ディジタル信号を入力信号として、その入力信号に１次的な音声強調処理を施した特徴量領域の１次音声強調信号を出力する。強調処理結果信頼性計算部は、入力信号の特徴量と、１次音声強調信号とから、その１次音声強調信号の不確かさを示す値を出力する。事例モデル記憶部は、学習データの事例モデルと、その振幅スペクトルデータを記憶する。マッチング部は、１次音声強調信号とこの１次音声強調信号の不確かさを示す値と学習データの事例モデルとを入力として、各時間フレームに対して入力信号に含まれるクリーン音声に一番近いクリーン音声系列を与える学習データセグメントを出力する。音声強調フィルタリング部は、入力信号のパワースペクトルと学習データセグメントを入力として、該学習データセグメントと対を成す振幅スペクトルデータを事例モデル記憶部から読み出してウィナーフィルタを生成し、入力信号のパワースペクトルにそのウィナーフィルタを乗じてフィルタリングして音声強調信号を出力する。 The noise / dereverberation apparatus of the present invention includes a speech enhancement processing unit, an enhancement processing result reliability calculation unit, a case model storage unit, a matching unit, and a speech enhancement filtering unit. The speech enhancement processing unit receives a speech digital signal on which noise and reverberation are superimposed as an input signal, and outputs a primary speech enhancement signal in a feature amount region obtained by performing primary speech enhancement processing on the input signal. The enhancement processing result reliability calculation unit outputs a value indicating the uncertainty of the primary speech enhancement signal from the feature amount of the input signal and the primary speech enhancement signal. The case model storage unit stores a case model of learning data and amplitude spectrum data thereof. The matching unit receives a primary speech enhancement signal, a value indicating the uncertainty of the primary speech enhancement signal, and a case model of learning data as input, and is closest to the clean speech included in the input signal for each time frame. A learning data segment giving a clean speech sequence is output. The speech enhancement filtering unit receives the power spectrum of the input signal and the learning data segment, reads out the amplitude spectrum data paired with the learning data segment from the case model storage unit, generates a Wiener filter, and generates the power spectrum of the input signal. A voice emphasis signal is output after filtering by the winner filter.

この発明の雑音/残響除去装置によれば、クリーン音声のみから生成された事例モデルを用いるので、事例探索のための計算量を少なくすることが出来る。と共に、入力信号に１次的な音声強調処理を施し、その音声強調処理の不確かさ（信頼度）を加味してマッチングを行うことで適切なクリーン音声の事例の探索を可能にする。具体的な効果については後述するが、この発明によれば、計算量を削減した上で、雑音/残響除去のＳＮ比を従来技術よりも改善することが出来る。 According to the noise / dereverberation apparatus of the present invention, since a case model generated only from clean speech is used, the amount of calculation for case search can be reduced. At the same time, primary speech enhancement processing is performed on the input signal, and matching is performed in consideration of the uncertainty (reliability) of the speech enhancement processing, thereby making it possible to search for an example of an appropriate clean speech. Although specific effects will be described later, according to the present invention, the SN ratio for noise / reverberation removal can be improved as compared with the prior art while reducing the amount of calculation.

この発明の雑音/残響除去装置１００の機能構成例を示す図。The figure which shows the function structural example of the noise / dereverberation apparatus 100 of this invention. 雑音/残響除去装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the noise / dereverberation apparatus 100. 事例モデル生成装置２００の機能構成例を示す図。The figure which shows the function structural example of the example model production | generation apparatus 200. FIG. 事例モデル生成装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the example model production | generation apparatus 200. FIG. 評価実験結果のスペクトログラムを示す図であり、（ａ）はクリーン音声、（ｂ）は残響音声、（ｃ）は従来法、（ｄ）は不確かさを考慮しないでマッチング処理を行った出力信号、（ｅ）はこの発明の雑音/残響除去装置１００の出力信号である。It is a figure which shows the spectrogram of an evaluation experiment result, (a) is a clean voice, (b) is a reverberation voice, (c) is a conventional method, (d) is an output signal that has undergone matching processing without considering uncertainty, (E) is an output signal of the noise / dereverberation apparatus 100 of the present invention. 評価実験結果をセグメンタルＳＮＲと対数スペクトル距離で示す図であり、（ａ）はセグメンタルＳＮＲ、（ｂ）は対数スペクトル距離である。It is a figure which shows an evaluation-experiment result by segmental SNR and logarithmic spectral distance, (a) is segmental SNR, (b) is logarithmic spectral distance. 従来の雑音/残響除去装置７００の機能構成例を示す図。The figure which shows the function structural example of the conventional noise / dereverberation apparatus 700.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の雑音/残響除去装置１００の機能構成例を示す。その動作フローを図２に示す。雑音/残響除去装置１００は、音声強調処理部１０２と、強調処理結果信頼性計算部１０３と、事例モデル記憶部１０４と、マッチング部１０５と、音声強調フィルタリング部１０６と、制御部１０７と、を具備する。雑音/残響除去装置１００の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of a noise / dereverberation apparatus 100 of the present invention. The operation flow is shown in FIG. The noise / dereverberation removing apparatus 100 includes a speech enhancement processing unit 102, an enhancement processing result reliability calculation unit 103, a case model storage unit 104, a matching unit 105, a speech enhancement filtering unit 106, and a control unit 107. It has. The function of each part of the noise / dereverberation apparatus 100 is realized by a predetermined program being read into a computer constituted by, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

雑音/残響除去装置１００の出力信号の領域は、時間領域、パワースペクトル領域、振幅スペクトル領域、特徴量領域などの、各種信号領域での出力が可能であり、出力信号の用途によって選択される。この実施例の説明に当たっては、入力信号をパワースペクトル領域とし、出力信号を時間領域信号として説明する。 The output signal region of the noise / dereverberation apparatus 100 can be output in various signal regions such as a time region, a power spectrum region, an amplitude spectrum region, and a feature amount region, and is selected according to the use of the output signal. In the description of this embodiment, the input signal will be described as the power spectrum region and the output signal will be described as the time domain signal.

入力信号は、パワースペクトル領域で与えられるので、この実施例では特徴量生成部１０１を備える。特徴量生成部１０１は、入力されるパワースペクトルからフレーム毎の特徴量（例えば、メル周波数ケプストラム係数）を生成する（ステップＳ１０１）。入力信号が特徴量領域で与えられれば、特徴量生成部１０１は不要である。よって、特徴量生成部１０１を破線で示している。 Since the input signal is given in the power spectrum region, the feature amount generation unit 101 is provided in this embodiment. The feature value generation unit 101 generates a feature value (for example, mel frequency cepstrum coefficient) for each frame from the input power spectrum (step S101). If the input signal is given in the feature quantity region, the feature quantity generation unit 101 is not necessary. Therefore, the feature quantity generation unit 101 is indicated by a broken line.

特徴量領域の入力信号ｙ_ｔを式（１）に示すようにモデル化する。 The input signal y _t of the feature region is modeled as shown in equation (1).

ｙ_ｔは時間フレームｔの入力信号、ｓ_ｔはクリーン音声、ｂ_ｔは音響歪み成分（雑音や、後部残響成分）である。雑音をこのように加法性の項としてモデル化することは広く行われており、後部残響を加法性の項としてモデル化することもしばしば行われている（参考文献１：K. Kinoshita, M. Delcroix, T. Nakatani, and M. Miyoshi, “Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction,” IEEE TASLP, 17(4), pp. 534-545, 2009.）。以降の説明において、パワースペクトル領域の信号は、それぞれ、Ｙ_ｔ ^２，Ｓ_ｔ ^２，Ｂ_ｔ ^２と表記する。 y _t is an input signal of time frame t, _st is clean speech, and b _t is an acoustic distortion component (noise or rear reverberation component). Modeling noise as an additive term in this manner is widely performed, and posterior reverberation is often modeled as an additive term (Reference 1: K. Kinoshita, M.). Delcroix, T. Nakatani, and M. Miyoshi, “Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction,” IEEE TASLP, 17 (4), pp. 534-545, 2009.). In the following description, the signals in the power spectrum region are denoted as Y _t ² , S _t ² , and B _t ² , respectively.

音声強調処理部１０２は、雑音・残響の重畳した音声ディジタル信号を入力信号として、その入力信号に１次的な音声強調処理を施した特徴量領域の１次音声強調信号^〜ｓ_ｔを出力する（ステップＳ１０２）。^〜の位置は式中（式（２））の表記のように変数の真上に位置するのが正しい表記である。強調処理結果信頼性計算部１０３は、入力信号ｙ_ｔと、音声強調処理部１０２が出力する１次音声強調信号^〜ｓ_ｔとから、１次音声強調信号^〜ｓ_ｔの不確かさを示す値Σ_ｂｔを出力する（ステップＳ１０３）。 Speech enhancement processing unit 102, the superimposed audio digital signal of the noise-reverberation as an input signal, and outputs the primary audio enhancement signals ^~ s _t of the feature region subjected to first-order speech enhancement process on the input signal (Step S102). ^The correct notation is that the position of ^~ is located immediately above the variable as in the notation of (Expression (2)). Enhancement processing result reliability calculation unit 103, an input signal y _t, and a primary audio enhancement signals ^~ s _t outputted by the sound enhancement processor 102, a value indicating the uncertainty of the primary audio enhancement signals ^~ s _t sigma _bt is output (step S103).

事例モデル記憶部１０４は、学習データの事例モデルと、その振幅スペクトルデータを記憶する。マッチング部１０５は、音声強調処理部１０２が出力する１次音声強調信号^〜ｓ_ｔと、強調処理結果信頼性計算部１０３が出力する１次音声強調信号^〜ｓ_ｔの不確かさを示す値Σ_ｂｔと、事例モデル記憶部１０４に記憶されている学習データの事例モデルＭと、を入力として入力信号ｙ_ｔに含まれるクリーン音声に一番近いクリーン音声系列を与える学習データセグメントを出力する（ステップＳ１０５）。 The case model storage unit 104 stores a case model of learning data and amplitude spectrum data thereof. Matching unit 105 includes a primary audio enhancement signals ^~ s _t outputted by the sound enhancement processing unit 102, enhancement processing result value indicating the uncertainty of the primary audio reliability calculation unit 103 outputs enhancement signal ^~ s _t ^Σ _bt When, and outputs the learning data segments to provide a closest clean speech sequence to clean speech included in the input signal y _t and case model M of the learning data stored in the case model storage unit 104, as inputs (step S105 ).

音声強調フィルタリング部１０６は、入力信号のパワースペクトルＹ_ｔ ^２と、マッチング部１０５が出力する学習データセグメントを入力として、その学習データセグメントと対を成す振幅スペクトルデータを事例モデル記憶部１０４から読み出してウィナーフィルタを生成し、入力信号のパワースペクトルＹ_ｔ ^２に、そのウィナーフィルタを乗じてフィルタリングして音声強調信号を出力する（ステップＳ１０６）。制御部１０７は、上記した各部間の時系列的な動作等を制御するものである。 The speech enhancement filtering unit 106 receives the power spectrum Y _t ² of the input signal and the learning data segment output from the matching unit 105 as input, and reads out the amplitude spectrum data paired with the learning data segment from the case model storage unit 104. A Wiener filter is generated and filtered by multiplying the power spectrum Y _t ² of the input signal by the Wiener filter to output a speech enhancement signal (step S106). The control unit 107 controls time-series operations between the above-described units.

以上述べたように動作することで、雑音/残響除去装置１００は、クリーン音声のみから生成された事例モデルを用い、事例探索のための計算量が少なく、且つ、ＳＮ比の良好な雑音/残響除去を可能にする。 By operating as described above, the noise / dereverberation apparatus 100 uses a case model generated only from clean speech, uses a small amount of calculation for case search, and has a good SN ratio. Allows removal.

以降において、雑音/残響除去装置１００の各部の機能を更に詳しく説明する。 In the following, the function of each part of the noise / dereverberation apparatus 100 will be described in more detail.

〔音声強調処理部〕
この実施例の音声強調処理部１０２は、入力信号が特徴量領域であるので、入力信号ｙ_ｔに直接、１次的な音声強調処理を施す。１次音声強調信号^〜ｓ_ｔを得るための処理としては、あらゆる従来の音声強調方法が適用可能であり、適用する方法は入力信号に含まれる音響歪みの種類により適切に選ばれるべきものである。例えば、残響成分を過去の信号から線形予測してパワースペクトル領域で除去する方法（参考文献２：再表２００７/１００１３７）などを用いることが出来る。 [Speech enhancement processor]
Speech enhancement processing unit 102 of this embodiment, the input signal because the feature quantity region, directly to the input signal y _t, subjected to first-order speech enhancement. The process for obtaining the primary speech enhancement signals ^~ s _t, applicable any conventional speech enhancement method, a method of applying the like should be suitably chosen depending on the type of acoustic distortion contained in the input signal . For example, a method of linearly predicting a reverberation component from a past signal and removing it in the power spectrum region (Reference Document 2: Table 2007/100137) can be used.

〔強調処理結果信頼性計算部〕
強調処理結果信頼性計算部１０３は、１次音声強調信号^〜ｓ_ｔと、入力信号の特徴量ｙ_ｔを用いて、強調音声（１次音声強調信号^〜ｓ_ｔ）の不確かさを示す値Σ_ｂｔを計算して出力する。不確かさを示す値Σ_ｂｔは、全共分散行列を用いることも可能であるが、この実施例ではΣ_ｂｔを、対角成分をゼロとする共分散行列である対角共分散行列とし、そのｋ番目の対角要素σ_ｋは式（２）に示すように計算する。 [Enhancement processing result reliability calculation section]
Enhancement processing result reliability calculation unit 103, a primary audio enhancement signals ^~ s _t, using the feature quantity y _t of the input signal, a value indicating the uncertainty of the enhanced speech (primary speech enhancement signals ^{~ s} _t) ^Σ _bt is calculated and output. Although the total covariance matrix may be used as the value Σ _bt indicating the uncertainty, in this embodiment, Σ _bt is a diagonal covariance matrix that is a covariance matrix having a diagonal component of zero, The k-th diagonal element σ _k is calculated as shown in Equation (2).

ｋは、特徴量ベクトルの次数を表すインデックスである。 k is an index representing the order of the feature vector.

つまり、強調処理結果信頼性計算部１０３は、１次音声強調信号^〜ｓ_ｔの不確かさを示す値Σ_ｂｔを、入力信号の特徴量ｙ_ｔと１次音声強調信号^〜ｓ_ｔとの差を成分とする共分散行列とする。 That is, enhancement processing result reliability calculation unit 103, the value sigma _bt indicating the uncertainty of the primary audio enhancement signals ^~ s _t, a difference between the feature quantity y _t and the primary audio enhancement signals ^~ s _t of the input signal A covariance matrix is used as a component.

〔事例モデル生成装置〕
ここで、事例モデル記憶部１０４に記憶される事例モデルを生成する事例モデル生成装置２００について説明する。図３に、事例モデル生成装置２００の機能構成例を示す。その動作フローを図４に示す。事例モデル生成装置２００は、フーリエ変換部２０１と、特徴量生成部２０２と、ガウス混合モデル学習部２０３と、最尤ガウス分布計算部２０４と、制御部２０５と、を具備する。事例モデル生成装置２００の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 [Case model generator]
Here, a case model generation apparatus 200 that generates a case model stored in the case model storage unit 104 will be described. FIG. 3 shows a functional configuration example of the case model generation apparatus 200. The operation flow is shown in FIG. The case model generation apparatus 200 includes a Fourier transform unit 201, a feature value generation unit 202, a Gaussian mixture model learning unit 203, a maximum likelihood Gaussian distribution calculation unit 204, and a control unit 205. The function of each part of the case model generation apparatus 200 is realized by, for example, reading a predetermined program into a computer constituted by a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

フーリエ変換部２０１は、音声ディジタル信号のクリーン音声を入力信号として、入力信号には例えば３０ｍｓ程度の短時間ハミング窓で窓かけされ、それぞれ窓かけされた入力信号は離散フーリエ変換を経て振幅スペクトルに変換される（ステップＳ２０１）。振幅スペクトルとは、周波数スペクトルの振幅データのことである。 The Fourier transform unit 201 receives clean speech of a speech digital signal as an input signal, and the input signal is windowed with a short Hamming window of about 30 ms, for example, and each windowed input signal is subjected to discrete Fourier transform to an amplitude spectrum. Conversion is performed (step S201). An amplitude spectrum is amplitude data of a frequency spectrum.

特徴量生成部２０２は、フーリエ変換部２０１が出力する振幅スペクトルの全てを、メルケプストラム特徴量ｓ_ｉに変換する。一般的に広く使われているメルケプストラムは高々１０〜２０次程度であるが、事例データを正確に表すために、高い次数（例えば、３０〜１００次程度）のメルケプストラムを用いる。なお、メルケプストラム以外の特徴量を用いても良い。 The feature value generation unit 202 converts all of the amplitude spectrum output from the Fourier transform unit 201 into a mel cepstrum feature value s _i . In general, the mel cepstrum widely used is about 10 to 20th order, but in order to accurately represent the case data, a mel cepstrum having a high order (for example, about 30 to 100th order) is used. Note that feature quantities other than the mel cepstrum may be used.

ガウス混合モデル学習部２０３は、特徴量生成部２０２で得られた各短時間フレームｉでの特徴量ｓ_ｉを学習データとして、通常の最尤推定法によりガウス混合モデルｇ（式（３））を得る。 The Gaussian mixture model learning unit 203 uses the feature quantity s _i in each short time frame i obtained by the feature quantity generation unit 202 as learning data, and uses a Gaussian mixture model g (formula (3)) by a normal maximum likelihood estimation method. Get.

ｇ（ｓ|ｑ）は、平均μ_ｑ、分散Σ_ｑを持つｑ番目のガウス分布を表し、ｗ（ｑ）はそれに対する混合重みを表す。Ｑは混合数を表す。 g (s | q) represents the q-th Gaussian distribution with mean mu _q, the dispersion sigma _q, w (q) represents a mixture weight for it. Q represents the number of mixtures.

最尤ガウス分布計算部２０４は、各時間フレームｉに対して最大の尤度を与えるガウス混合分布ｇの中のガウス分布のインデックスｑ_ｉを求め、そのインデックスｑ_ｉの時間系列を事例モデルＭとして求める（ステップＳ２０４）。事例モデルＭは、ガウス分布のインデックスｑ_ｉの集合とガウス混合モデルｇを用いて式（４）に示すように表される。 The maximum likelihood Gaussian distribution calculation unit 204 obtains an index q _i of the Gaussian distribution in the Gaussian mixture distribution g giving the maximum likelihood for each time frame i, and uses the time series of the index q _i as a case model M. Obtained (step S204). The case model M is expressed as shown in Expression (4) using a set of Gaussian distribution indices q _i and a Gaussian mixture model g.

ここで、ｑ_ｉは、ｉ番目のフレームの特徴量ｓ_ｉに対して最大の尤度を与えるガウス分布のインデックスであり、ガウス混合分布ｑの中の分布ｇ（ｓ|ｑ_ｉ）を表している。モデルＭを、学習データｓの詳細な時間周波数特徴を捉えた事例モデルＭと称する。この事例モデルＭは、学習データｓと対と成る学習用クリーン音声の振幅スペクトルデータＡと共に、例えば事例モデル記憶部２０４（図１）に記憶される。 Here, q _i is the index of the Gaussian distribution that gives the maximum likelihood for the i-th frame of the feature s _i, the distribution g in the Gaussian mixture distribution q | represents (s q _i) Yes. The model M is referred to as a case model M that captures detailed time-frequency characteristics of the learning data s. The case model M is stored in the case model storage unit 204 (FIG. 1), for example, together with the amplitude spectrum data A of clean speech for learning that is paired with the learning data s.

〔マッチング部〕
マッチング部１０５は、入力信号の特徴量ｙ_ｔと、その入力信号の特徴量ｙ_ｔに最も近い学習データのセグメントを、事例モデルＭを用いて探索し、入力信号ｙ_ｔに含まれるクリーン音声ｓ_ｔに一番近いクリーン音声系列を与えると思われる学習データセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}を出力する。マッチング部１０５は、１次音声強調信号^〜ｓ_ｔの不確かさを示す値Σ_ｂｔを加味して、クリーン音声に一番近いクリーン音声系列を探索するものであるが、Σ_ｂｔを加味しない従来法との違いを明確にする目的で、先に、不確からしさを示す値Σ_ｂｔを加味しないマッチング方法について説明する。 [Matching part]
Matching unit 105 includes a feature amount y _t of the input signal, the segment closest training data to the feature quantity y _t of the input signal, and probed with case model M, the clean speech s included in the input signal y _t learning seems to give the closest clean speech series in _t data segment ^M _{t u:} to output the _{u + τmax.} Matching unit 105, in consideration of the value sigma _bt indicating the uncertainty of the primary audio enhancement signals ^~ s _t, but is intended to explore the nearest clean speech sequence to clean speech, the conventional method without consideration of sigma _bt For the purpose of clarifying the difference from the above, a matching method that does not take into account the value Σ _bt indicating uncertainty is described first.

入力信号は、Ｔ個の時間フレームから成るとし、その入力信号をｙ＝{ｙ_ｔ:ｔ=1，２，…，Ｔ}とする。また、ｙ_{ｔ：ｔ＋τ}を入力信号の時間フレームｔからｔ＋τまでの系列とする。そして、Ｍ_{ｕ：ｕ＋τ}＝{ｇ，ｑ_ｉ：ｉ＝ｕ，ｕ＋１，…，ｕ＋τ}を、学習データｓの中のｕ番目からｕ＋τ番目までの連続する時間フレームに対応するガウス分布系列とする。 It is assumed that the input signal is composed of T time frames, and the input signal is y = {y _t : t = 1, 2,..., T}. Also, let _{yt: t + τ be} a sequence from the time frame t to t + τ of the input signal. Then, M _{u: u + τ} = {g, q _i : i = u, u + 1,..., U + τ} is a Gaussian distribution sequence corresponding to continuous time frames from u-th to u + τ-th in the learning data s. .

入力信号ｙ_ｔと学習データｓの中のあるセグメントとの距離の定義や、入力信号ｙ_ｔと一番近い学習データの探索方法としては、ユークリッド距離など、他のいくつかの方法を考えることが出来る。ここでは、入力信号ｙの時間フレームｔに対する一番近い学習データセグメントは、入力信号に良く一致する学習データセグメントの中でも長さの最も長いものとする。つまり、入力信号に最も近い学習データセグメントＭ^ｔ _{ｕ：ｕ＋τ}は、次式に示す事後確率を最大化することで求めることが出来る。 Definition and of the distance between a segment in the input signal y _t and the training data s, as a search method of the input signal y _t and the closest training data, be considered a Euclidean distance, etc., several other methods I can do it. Here, it is assumed that the learning data segment closest to the time frame t of the input signal y has the longest length among learning data segments that closely match the input signal. In other words, the closest training data segments M ^{t u} to the input _{signal: u + tau} can be determined by maximizing a posterior probability shown in the following equation.

ここで、ｐ（Ｍ_ｕ:ｕ+τ|ｙ_ｔ:ｔ+τ）は事後確率を表し、ｙ_ｔ:ｔ+τとＭ_ｕ:ｕ+τが比較的よく一致している場合、τが長ければ長いほど高い事後確率を与えるという特徴を持っている。より長いセグメントを探索するという方策を取ることで、ある時間に局所的に存在する雑音などの影響を受け難くなり、雑音などに対して比較的ロバストなマッチングが行われることが期待できる。式（６）では、簡単のため、ｐ（Ｍ_ｕ:ｕ+τ）は全ての学習データセグメントに対して等確率を仮定することが出来る。これは、学習データ中で観測された系列パターンは、雑音/残響除去時に全て同じ確率で起こりえるということを仮定することに対応する。 Here, p (M _{u: u + τ} | y _{t: t + τ} ) represents the posterior probability, and when y _{t: t + τ} and M _{u: u + τ} are relatively well matched, τ is The longer it is, the higher the posterior probability is. By taking a measure of searching for a longer segment, it becomes difficult to be affected by noise that exists locally at a certain time, and it can be expected that relatively robust matching is performed with respect to noise. In equation (6), for simplicity, p (M _{u: u + τ} ) can assume equal probabilities for all learning data segments. This corresponds to the assumption that the sequence patterns observed in the training data can all occur with the same probability when noise / dereverberation is removed.

式（６）の分子の項ｐ（ｙ_ｔ:ｔ+τ|Ｍ_ｕ:ｕ+τ）は、Ｍ_ｕ:ｕ+τに対応する学習データセグメントに対する音声強調信号ｙ_ｔ:ｔ+τの尤度である。その尤度は次式で計算される。 The numerator term p (y _{t: t + τ} | M _{u: u + τ} ) in equation (6) is the likelihood of the speech enhancement signal y _{t: t + τ} for the learning data segment corresponding to M _{u: u + τ.} Degree. The likelihood is calculated by the following equation.

簡単のため、隣り合うフレームは独立であることを仮定している。式（６）の分母は、事例モデルＭに含まれる全てのパターンについてｐ（ｙ_ｔ:ｔ+τ|Ｍ_ｕ:ｕ+τ）の和を取った値となる。 For simplicity, it is assumed that adjacent frames are independent. The denominator of Expression (6) is a value obtained by taking the sum of p (y _{t: t + τ} | M _{u: u + τ} ) for all patterns included in the case model M.

ここで、入力信号ｙ_ｔが十分にクリーン音声に近ければ、つまり音響歪み成分ｂ_ｔが十分にゼロに近ければ、学習時に用いたクリーン音声データとのミスマッチは小さくなり、クリーン音声ｓ_ｔに近いパターンを学習データから探索することが出来る。しかし、一般的に入力信号ｙ_ｔとクリーン音声ｓ_ｔには雑音/残響に起因する差があり、その差がマッチング処理に直接影響する。したがって、そのままではクリーン音声ｓ_ｔに近いパターンを学習パターンから発見することは容易ではない。この雑音/残響に起因する差による影響を低減させる工夫が必要である。 Here, if the input signal y _t is close enough to clean speech, i.e. the closer to zero sufficiently audio distortion component b _t, mismatch between the clean speech data is reduced using at the time of learning, close to the clean speech s _t Patterns can be searched from learning data. In general, however, the input signal y _t and the clean speech s _t there is a difference due to noise / reverberation, the difference directly affects the matching process. Therefore, it is not easy to discover from the learning patterns a pattern close to the clean speech s _t as it is. It is necessary to devise a technique for reducing the influence of the difference due to the noise / reverberation.

そこで、この発明の雑音/残響除去装置１００は、雑音/残響に起因する差による影響を低減させる目的で、不確からしさ（信頼度）を加味するようにした。つまり、この発明の雑音/残響除去装置１００は、信頼度を加味しながら入力信号ｙ_ｔと学習データをマッチングさせ、最も入力信号に近い学習データのセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}を探索する。 Therefore, the noise / dereverberation apparatus 100 of the present invention takes into account uncertainty (reliability) for the purpose of reducing the influence of the difference due to noise / reverberation. That is, the noise / dereverberation apparatus 100 of the present invention, while considering the reliability by matching the input signal y _t and learning data, most close to the input signal training data segments M ^{t _u:} searching the _{u + .tau.max.}

そこで、１次音声強調信号^〜ｓ_ｔとクリーン音声ｓ_ｔとの間に差があることを陽に考慮するために、１次音声強調信号^〜ｓ_ｔの信頼性/不確かさを考慮する。具体的には、入力信号ｙ_ｔを確率的に定式化する。 Therefore, in order to take into account that there is a difference between the primary audio enhancement signals ^~ s _t a clean speech s _t explicitly considers the reliability / uncertainty of the primary audio enhancement signals ^~ s _t. Specifically, probabilistically formulate the input signal y _t.

まず、雑音/残響成分ｂ_ｔは、以下のガウス過程に従うものとする。 First, it is assumed that the noise / reverberation component b _t follows the following Gaussian process.

ここで、＾ｂ_ｔは、１次音声強調信号^〜ｓ_ｔと入力信号ｙ_ｔの差の推定値であり、＾ｂ_ｔ＝ｙ_ｔ−^〜ｓ_ｔのように計算され、１次音声強調信号^〜ｓ_ｔの不確からしさを示す値Σ_ｂｔは、ｂ_ｔの時変の共分散行列である。この定式化を用いることで、入力信号ｙ_ｔの尤度は、結合確率をクリーン音声信号について周辺化することで、以下のように求めることが出来る。 Here, ^ _{b t} is the estimated value of the difference of the primary audio enhancement signals ^~ s _t and the input signal _{_{_{y t, ^ b t = y}}} t - calculated as ~ _{s t,} the primary speech enhancement signal value sigma _bt indicating the uncertainty likeness of ^~ s _t is a variable of the covariance matrix when _{b t.} By using this formulation, the likelihood of the input signal y _t is by marginalizing the clean speech signal joint probability can be determined as follows.

導出の中では、確率の乗法定理を用いた。式（９）より、時変の共分散行列Σ_ｂｔは、１次音声強調信号^〜ｓ_ｔの不確からしさの尺度と考えることが出来る。例えば、信頼度の低い不確からしい特徴量については、それに対応する共分散行列Σ_ｂｔが大きくなり、結果それらの特徴量が結果に与える影響が低くなる。 In the derivation, the probabilistic multiplication theorem was used. From the equation (9), when the change of the covariance matrix Σ _bt can be thought of as a measure of the uncertainty likeness of the primary speech enhancement signal ^~ s _t. For example, for an uncertain feature amount with low reliability, the covariance matrix _Σbt corresponding to the feature amount increases, and as a result, the influence of the feature amount on the result is reduced.

このようにガウス分布の分散の項を時変で補正する作業を、式（６）に挿入することで、１次的な音声強調処理の結果である１次音声強調信号^〜ｓ_ｔの信頼度/不確からしさを考慮しながら、クリーン音声信号ｓ_ｔに近い学習データセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}を探索することが可能となる。 Thus the task of varying correction when the dispersion section of the Gaussian distribution, by inserting the equation (6), primarily first-order speech enhancement signals ^~ s _t reliability is the result of the speech enhancement / taking into account the uncertainty likeness, learning closer to the clean speech signal s _t data segment M ^{t _u:} it is possible to search the _{u + .tau.max.}

〔音声強調フィルタリング部〕
音声強調フィルタリング部１０６は、マッチング部１０５が出力する学習データセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}と、それに対応するクリーン音声の振幅スペクトルの事例を用いてフィルタリングを行う。 [Speech enhancement filtering part]
Speech enhancement filtering unit 106, the learning data segment M ^{t u} matching unit 105 _outputs: the _{u + .tau.max,} performs filtering using the case of the amplitude spectrum of the clean speech corresponding thereto.

はじめに、マッチング結果Ｍ^ｔ _{ｕ：ｕ＋τｍａｘ}に対応するクリーン音声の振幅スペクトルを、事例モデル記憶部１０４から読み出し、入力信号ｙｔに含まれるクリーン音声成分ｓの振幅スペクトルの復元を試みる。ε（ε＝１，２，…，Ｔ）を、クリーン音声の振幅スペクトルを復元したい対象の時間フレームインデックスとすると、クリーン音声の振幅スペクトル＾Ｓ_εは以下のように推定・復元される。 First, the matching result M ^{t _u:} the amplitude spectrum of the clean speech corresponding to _{u + .tau.max,} read from case model storage unit 104 attempts to restore the amplitude spectrum of the clean speech component s contained in the input signal yt. Assuming that ε (ε = 1, 2,..., T) is a time frame index of a target for which the amplitude spectrum of the clean speech is to be restored, the clean speech amplitude spectrum ^ S _ε is estimated and restored as follows.

ここでＡ（ｕ^ｔ _ε）は、学習データセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}と対となるクリーン音声の振幅スペクトルの事例であり、ｕ^ｔ _εは、各フレームｔで得られた尤もらしい学習データセグメントｕ＝｛ｕ，ｕ＋１，…，ｕ＋τｍａｘ｝のεに対応するインデックスである。また、クリーン音声の振幅スペクトルデータの集合[Ａ]は{Ａ（ｉ）：ｉ＝１，２，…，Ｉ_ｓ}である。 Here, A (u ^t _ε ) is an example of the amplitude spectrum of the clean speech paired with the learning data segment M ^t _{u: u + τmax,} and u ^t _ε is the likely learning data segment u obtained at each frame t. = Index corresponding to ε of {u, u + 1,..., U + τmax}. A set [A] of clean speech amplitude spectrum data is {A (i): i = 1, 2,..., I _s }.

次に、この推定した振幅スペクトル＾Ｓ_εを用いてウィナーフィルタＨ_εを構築する（式（１１））。 Next, to construct a Wiener filter H _epsilon using amplitude spectrum ^ S _epsilon that the estimated (equation (11)).

雑音/残響成分の推定値＾Ｂ^２ _εは、式（１２）に示すように求める。 The estimated value of noise / reverberation component B ² _ε is obtained as shown in equation (12).

ここでαは平滑化係数であり、ｍａｘ[ｋ，ｋ′]はｋとｋ′の大きい方を選択して出力する関数である。ウィナーフィルタＨ_εをＨ_ｔとして、そのＨ_ｔを入力信号のパワースペクトルＹ_ｔ ^２に乗算すれば、最終的な出力信号を得ることが出来る。 Here, α is a smoothing coefficient, and max [k, k ′] is a function that selects and outputs the larger of k and k ′. The Wiener filter H _epsilon as H _t, is multiplied to the H _t to the power spectrum Y _t ² of the input signal, it is possible to obtain a final output signal.

入力信号のパワースペクトルＹ_ｔ ^２にウィナーフィルタＨ_ｔを乗じた出力信号は、逆フーリエ変換され時間領域の信号に変換されて出力される。 An output signal obtained by multiplying the power spectrum Y _t ² of the input signal by the Wiener filter H _t is subjected to inverse Fourier transform, converted into a time domain signal, and output.

〔評価実験〕
この発明の雑音/残響除去装置１００の性能を評価する目的で評価実験を行った。実験条件は次の通りとした。 [Evaluation experiment]
An evaluation experiment was conducted for the purpose of evaluating the performance of the noise / dereverberation apparatus 100 of the present invention. The experimental conditions were as follows.

ガウス混合モデルｇの学習には、1088文、136話者からなるＴＩＭＩＴ core training-setを用いた。標本化周波数は８kHz、ガウス混合モデルの学習に用いる特徴量ベクトルとしては、40次のメルケプストラム係数と対数エネルギー項をつなげたベクトルを用いた。ガウス混合モデルの混合数Ｑは、学習データに含まれるさまざまな時間周波数パターンを精度よくモデル化するために、十分大きい値である4096を用いた。フーリエ変換に用いたフレーム長は20msであり、短時間窓のシフト幅は10msとした。 For training of the Gaussian mixture model g, TIMIT core training-set consisting of 1088 sentences and 136 speakers was used. A sampling frequency is 8 kHz, and a vector obtained by connecting a 40th-order mel cepstrum coefficient and a logarithmic energy term is used as a feature vector used for learning a Gaussian mixture model. As the mixture number Q of the Gaussian mixture model, 4096, which is a sufficiently large value, is used in order to accurately model various temporal frequency patterns included in the learning data. The frame length used for Fourier transform was 20 ms, and the shift width of the short time window was 10 ms.

実験では、大きさ５m×5m×5m、残響時間0.5秒の部屋を想定して、この部屋の中で、話者がマイクから2.5m離れた状況で測定されるであろう室内インパルス応答をコンピュータ上でシミュレートした。雑音/残響除去装置１００への入力信号ｙ_ｔは、上記室内インパルス応答とＴＩＭＩＴ core training-setに含まれる64文の音声と、を畳み込んで生成した。１次的な音声強調信号である１次音声強調信号^〜ｓ_ｔを得るための音声強調処理には、上記した参考文献２の方法を用いた。 In the experiment, assuming a room with a size of 5m x 5m x 5m and a reverberation time of 0.5 seconds, the room impulse response that would be measured in a situation where the speaker was 2.5m away from the microphone in the room. Simulated above. Input signals y _t to noise / dereverberation apparatus 100, produced by convoluting a and sound 64 sentences included in the room impulse response and TIMIT core training-set. The speech enhancement processing to obtain a 1 a-order speech enhancement signal primary audio enhancement signals ^~ s _t, using the method of Reference 2 described above.

図５に、実験結果をスペクトログラムで示す。横軸は時間、縦軸は周波数であり、白黒の濃淡で周波数の強さを表す。（ａ）は入力信号、（ｂ）は残響音声、（ｃ）は従来法による出力信号、（ｄ）は不確かさを考慮しないでマッチング処理を行った出力信号、（ｅ）はこの発明の雑音/残響除去装置１００の出力信号である。 FIG. 5 shows the experimental results in a spectrogram. The horizontal axis represents time, and the vertical axis represents frequency. The intensity of the frequency is represented by black and white shading. (A) is an input signal, (b) is a reverberant voice, (c) is an output signal according to a conventional method, (d) is an output signal that has been subjected to matching processing without taking uncertainty into consideration, and (e) is the noise of the present invention. This is an output signal of the dereverberation apparatus 100.

従来法による出力信号（ｃ）を見ると、ある程度の残響除去効果は確認できるものの、本来の音声エネルギーの存在する部分のエネルギーを過剰に抑圧してしまっており、処理の不正確さを確認することが出来る。それに対し、不確かさを考慮せずにマッチングを行った処理の出力信号（ｄ）は、事例に基づく処理をつなげたことで、従来法（ｃ）よりはやや歪みの少ない音声を出力している。 Looking at the output signal (c) according to the conventional method, although a certain degree of dereverberation effect can be confirmed, the energy of the portion where the original speech energy exists is excessively suppressed, and the inaccuracy of the processing is confirmed. I can do it. On the other hand, the output signal (d) of the processing that has been matched without considering the uncertainty outputs the sound with slightly less distortion than the conventional method (c) by connecting the processing based on the case. .

この発明の雑音/残響除去装置１００の出力信号（ｅ）は、上記した２つの処理音よりも更に効果的な残響除去が行われていることが、矢印↓で示す約0.54秒、0.81秒、0.96秒付近の調波構造の回復具合から見て取ることが出来る。 The output signal (e) of the noise / dereverberation apparatus 100 of the present invention is about 0.54 seconds, 0.81 seconds, indicated by an arrow ↓, indicating that dereverberation is more effective than the above two processed sounds. It can be seen from the recovery of the harmonic structure around 0.96 seconds.

次に、より客観的に本願発明の雑音/残響除去方法の効果を評価するため、セクメンタルＳＮＲと、対数スペクトル距離を算出した。セグメンタルＳＮＲは、高ければ高いほど正確に音響歪みが除去されていることを意味する。逆に、対数スペクトル距離は、小さい値であればあるほど、クリーン音声に近い音声であることを意味する。評価音声全てから得られた結果の平均値を図６に示す。図６の横方向は処理方法であり、左から入力信号（□）、従来法、不確かさを考慮しないでマッチング処理、この発明（■）である。縦軸方向は（ａ）がセグメンタルＳＮＲ（ｄＢ）、（ｂ）が対数スペクトル距離（ｄＢ）である。 Next, in order to more objectively evaluate the effect of the noise / reverberation removal method of the present invention, a sectional SNR and a logarithmic spectral distance were calculated. The higher the segmental SNR, the more accurately the acoustic distortion is removed. Conversely, the smaller the logarithmic spectral distance, the closer the sound is to clean sound. The average value of the results obtained from all the evaluation voices is shown in FIG. The horizontal direction in FIG. 6 is a processing method, which is the input signal (□) from the left, the conventional method, matching processing without considering uncertainty, and the present invention (■). In the vertical axis direction, (a) is the segmental SNR (dB), and (b) is the logarithmic spectral distance (dB).

このように、この発明の雑音/残響除去方法によれば、クリーン音声のみから生成された事例モデルのみの利用で、セクメンタルＳＮＲと対数スペクトル距離の両方で最も良い数値を得ることが出来る。つまり、本願発明の雑音/残響除去方法によれば、学習時の雑音/残響データが不要となるので計算量を削減した上で、雑音/残響除去のＳＮ比を従来技術よりも改善することが可能になる。 As described above, according to the noise / reverberation removal method of the present invention, the best numerical value can be obtained for both the sectional SNR and the logarithmic spectral distance by using only the case model generated only from clean speech. That is, according to the noise / reverberation removal method of the present invention, noise / reverberation data at the time of learning is no longer required, so that the amount of calculation can be reduced and the SN ratio of noise / dereverberation removal can be improved over the prior art. It becomes possible.

上記した雑音/残響除去装置１００及び事例モデル生成装置２００における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the noise / dereverberation apparatus 100 and the example model generation apparatus 200 described above is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A speech enhancement processing unit which outputs a primary speech enhancement signal in a feature amount region obtained by performing primary speech enhancement processing on the input signal, using a speech digital signal on which noise and reverberation are superimposed;
An enhancement processing result reliability calculation unit that outputs a value indicating the uncertainty of the primary speech enhancement signal from the feature amount of the input signal and the primary speech enhancement signal;
A case model of learning data, a case model storage unit for storing the amplitude spectrum data,
Using the primary speech enhancement signal, a value indicating the uncertainty of the primary speech enhancement signal, and the example model of the learning data as inputs, the cleanest closest to the clean speech included in the input signal for each time frame A matching unit that outputs a training data segment that gives a speech sequence;
Using the power spectrum of the input signal and the learning data segment as input, amplitude spectrum data paired with the learning data segment is read from the case model storage unit to generate a Wiener filter, and the winner spectrum is added to the power spectrum of the input signal. A voice enhancement filtering unit that filters and filters to output a voice enhancement signal;
A noise / dereverberation apparatus comprising:

The noise / dereverberation apparatus according to claim 1,
The emphasis processing result reliability calculation unit is
A noise / dereverberation apparatus characterized in that a value indicating the uncertainty of the primary speech enhancement signal is a covariance matrix whose component is a difference between the feature amount of the input signal and the primary speech enhancement signal.

The noise / dereverberation apparatus according to claim 1 or 2,
The learning data segment that gives the clean speech sequence closest to the clean speech included in the input signal for each time frame output by the matching unit is the learning data segment that closely matches the feature quantity of the input signal. Noise / dereverberation device characterized by being the longest.

A speech enhancement process for outputting a primary speech enhancement signal in a feature amount region obtained by performing a primary speech enhancement process on the input signal using a speech digital signal on which noise and reverberation are superimposed;
An enhancement processing result reliability calculation process for outputting a value indicating the uncertainty of the primary speech enhancement signal from the feature amount of the input signal and the primary speech enhancement signal;
A case model of learning data, a case model storage unit for storing the amplitude spectrum data,
The primary speech enhancement signal, a value indicating the uncertainty of the primary speech enhancement signal, and the case model of the learning data stored in the case model storage unit are input and included in the input signal for each time frame. A matching process that outputs a training data segment giving the clean speech sequence closest to the clean speech;
Using the power spectrum of the input signal and the learning data segment as inputs, the amplitude spectrum data stored in pairs with the learning data segment is read from the case model storage unit to generate a Wiener filter, and the input signal A voice enhancement filtering process of outputting a voice enhancement signal by multiplying the power spectrum by the Wiener filter and filtering;
A noise / dereverberation method comprising:

The noise / dereverberation method according to claim 4,
The emphasis processing result reliability calculation process is as follows:
A noise / reverberation removal method characterized in that a value indicating the uncertainty of the primary speech enhancement signal is a covariance matrix whose component is a difference between the feature amount of the input signal and the primary speech enhancement signal.

The noise / dereverberation method according to claim 4 or 5,
The learning data segment that gives the clean speech sequence closest to the clean speech included in the input signal for each time frame output by the matching process is the learning data segment that closely matches the feature quantity of the input signal. Noise / dereverberation method characterized by being the longest.

A program for causing a computer to function as the noise / dereverberation apparatus according to any one of claims 1 to 3.