JP2007093630A

JP2007093630A - Speech emphasizing device

Info

Publication number: JP2007093630A
Application number: JP2005268174A
Authority: JP
Inventors: Herbordt Wolfgang; ヘルボートウォルフガング; Masakiyo Fujimoto; 雅清藤本; Toshiharu Horiuchi; 俊治堀内; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-09-05
Filing date: 2005-09-15
Publication date: 2007-04-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech emphasizing device capable of extracting only a target speech with high precision even in an environment wherein sound conditions change. <P>SOLUTION: The speech emphasizing device is equipped with a microphone array equipped with a plurality of microphones, an adaptive beam former which generates a signal by emphasizing a target speech signal, and a noise reducing device which suppresses noise of the output signal of the adaptive beam former. The speech emphasizing device uses as the adaptive beam former a robust generalized side-lobe canceller which is equipped with a fixed beam former, an adaptive blocking matrix, and an adaptive disturbance canceller, and has the fixed beam former and adaptive disturbance canceller adaptively controlled according to the SNR of an input signal and also uses as the noise reducing device a single-channel noise reducing device which suppresses noise by using a Wiener filter based upon a GMM. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、音声強調装置に関する。 The present invention relates to a speech enhancement device.

近年、ラップトップ・パソコン、ゲーム機、携帯情報端末（ＰＤＡ）、携帯電話等のモバイル・プラットホームへのマイクロホンアレーの一体化に対し、関心が高まっている。数十年の研究の後でも、特に、急速に増大するインターネット電話や音声制御の用途への需要があるため、このようなプラットホーム上のマイクロホンアレー技術は、未開拓の能力を備えている（文献〔１〕参照) 。 In recent years, there has been increasing interest in integrating microphone arrays into mobile platforms such as laptops, gaming machines, personal digital assistants (PDAs), and mobile phones. Even after decades of research, in particular, there is a demand for rapidly increasing Internet telephony and voice control applications, so such on-platform microphone array technology has unexplored capabilities (references). [Refer to [1].]

文献〔１〕：M.S.Brandstein and D.B.Ward, Eds.,Microphone Arrays: Signal Processing Techniques and Applications, Springer,Berlin, 2001. Reference [1]: M.S.Brandstein and D.B.Ward, Eds., Microphone Arrays: Signal Processing Techniques and Applications, Springer, Berlin, 2001.

このようなプラットホーム上のマイクロホンアレーは、ハンズフリー通信システムにとっては随意と思われるとしても、雑音にロバストな自動音声認識にとっては大変効果的である。しかしながら、このような用途に対する主要な諸問題の１つは、雑音の種類、雑音レベル、残響時間、話者の頭を基準としたモバイル・プラットフォームの動きなどの音響条件の高変動性である。
特開２００４−１１２７０１号公報特開平１１−５２９８８号公報 Such a microphone array on the platform is very effective for noise-robust automatic speech recognition, even though it seems optional for a hands-free communication system. However, one of the major problems for such applications is the high variability in acoustic conditions such as noise type, noise level, reverberation time, mobile platform movement relative to the speaker's head.
JP 2004-112701 A Japanese Patent Laid-Open No. 11-52988

この発明は、音響条件が変動する環境下においても、目的音声のみを高精度に抽出できる音声強調装置を提供することを目的とする。 An object of the present invention is to provide a speech enhancement device that can extract only a target speech with high accuracy even in an environment where acoustic conditions vary.

請求項１に記載の発明は、複数のマイクロホンを備えたマイクロホンアレーと、マイクロホンアレーによって得られる複数のマイクロホン信号から、目的の音声信号が強調された信号を生成する適応ビームフォーマと、適応ビームフォーマの出力信号上の雑音を抑圧する雑音低減装置とを備えており、適応ビームフォーマとして、固定ビームフォーマ、適応ブロッキング行列および適応外乱キャンセラを備え、固定ビームフォーマおよび適応外乱キャンセラが入力信号のＳＮＲに応じて適応制御されるロバスト一般化サイドローブ・キャンセラが用いられており、雑音低減装置として、ＧＭＭに基づくウイナーフィルタを用いて、雑音を抑圧する単一チャンネル雑音低減装置が用いられていることを特徴とする。 According to the first aspect of the present invention, there is provided a microphone array having a plurality of microphones, an adaptive beamformer for generating a signal in which a target audio signal is emphasized from a plurality of microphone signals obtained by the microphone array, and an adaptive beamformer. And a noise reduction device that suppresses noise on the output signal of the input signal. The adaptive beamformer includes a fixed beamformer, an adaptive blocking matrix, and an adaptive disturbance canceller. The fixed beamformer and the adaptive disturbance canceller are used for the SNR of the input signal. A robust generalized sidelobe canceller that is adaptively controlled in response to this is used, and a single-channel noise reduction device that suppresses noise using a winner filter based on GMM is used as the noise reduction device. Features.

請求項２に記載の発明は、請求項１に記載の発明において、雑音低減装置は、適応ビームフォーマから送られてくる入力音声信号に対してフレーム毎にメルフィルタバンク分析を行なうことにより、入力音声信号に対応する対数メルスペクトルを求める第１手段、第１手段によって得られた、入力音声信号に対応する対数メルスペクトルのフレーム番号が所定値以上であるか否かを判別する第２手段、第１手段によって得られた、入力音声信号に対応する対数メルスペクトルのフレーム番号が所定値未満である場合には、第１手段によって得られた、入力音声信号に対応する対数メルスペクトルに基づいて、雑音に対応する対数メルスペクトルを推定するための処理を行なった後、第１手段による次のフレーム処理に移行させる第３手段、第１手段によって得られた、入力音声信号に対応する対数メルスペクトルのフレーム番号が所定値以上である場合には、第３手段によって得られた雑音に対応する対数メルスペクトルとＧＭＭとを用いることにより、ＧＭＭの要素分布毎にウイナーフィルタを設計した後、得られた複数のウイナーフィルタを加重平均する第４手段、ならびに第４手段によって得られた、加重平均されたウイナーフィルタをインパルス応答に変換し、得られたインパルス応答を入力音声信号に畳み込むことによって推定クリーン音声信号を得た後、第１手段による次のフレーム処理に移行させる第５手段を備えていることを特徴とする。 According to a second aspect of the present invention, in the first aspect of the present invention, the noise reduction apparatus performs an input by performing a mel filter bank analysis for each frame on the input speech signal transmitted from the adaptive beamformer. First means for obtaining a logarithmic mel spectrum corresponding to the audio signal; second means for determining whether or not the frame number of the logarithmic mel spectrum corresponding to the input audio signal obtained by the first means is greater than or equal to a predetermined value; When the frame number of the log mel spectrum corresponding to the input voice signal obtained by the first means is less than the predetermined value, the log mel spectrum obtained by the first means is based on the log mel spectrum corresponding to the input voice signal. A third means for performing a process for estimating a logarithmic mel spectrum corresponding to noise and then shifting to the next frame processing by the first means; When the frame number of the log mel spectrum corresponding to the input speech signal obtained by the means is greater than or equal to a predetermined value, the log mel spectrum corresponding to the noise obtained by the third means and the GMM are used. After designing a Wiener filter for each element distribution of the GMM, a fourth means for weighted averaging the obtained plurality of winner filters, and a weighted averaged Wiener filter obtained by the fourth means are converted into impulse responses, After obtaining the estimated clean voice signal by convolving the obtained impulse response with the input voice signal, the fifth means for shifting to the next frame processing by the first means is provided.

この発明によれば、音響条件が変動する環境下においても、目的音声のみを高精度に抽出できるようになる。 According to the present invention, only the target voice can be extracted with high accuracy even in an environment where the acoustic conditions fluctuate.

以下、図面を参照して、この発明の実施例について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

〔１〕マイクロホンアレー [1] Microphone array

図１は、携帯情報端末（ＰＤＡ）にマイクロホンアレーユニットが取り付けられた状態を示している。 FIG. 1 shows a state in which a microphone array unit is attached to a personal digital assistant (PDA).

図１において、１はマイクロホンアレーユニットであり、２は携帯情報端末（小型ＰＣ）である。なお、２１は、携帯情報端末２の前面に設けられた表示部である。この明細書においては、携帯情報端末２の背面側から前面側に向かう方向を前方ということにする。 In FIG. 1, 1 is a microphone array unit, and 2 is a portable information terminal (small PC). Reference numeral 21 denotes a display unit provided on the front surface of the portable information terminal 2. In this specification, the direction from the back side to the front side of the portable information terminal 2 is referred to as the front.

マイクロホンアレーユニット１は、矩形板上の基部１００と、基部１００の上端部に前方突出状に設けられた横長の第１マイクロホン保持部１０１と、基部１００の右端部に前方突出状に設けられた縦長の第２マイクロホン保持部１０２とを備えている。 The microphone array unit 1 includes a base portion 100 on a rectangular plate, a horizontally long first microphone holding portion 101 provided in a forward projecting shape at the upper end portion of the base portion 100, and a forward projecting shape in a right end portion of the base portion 100. A vertically long second microphone holding unit 102 is provided.

マイクロホンアレーユニット１の基部１００上に携帯情報端末２が載置された状態で、携帯端末２にマイクロホンアレーユニット１が取り付けられている。携帯情報端末２にマイクロホンアレーユニット１が取り付けられた状態では、携帯情報端末２の上端面に沿って第１マイクロホン保持部１０１が配置され、携帯情報端末２の右側面に沿って第２マイクロホン保持部１０２が配置されている。マイクロホンアレーユニット１と携帯情報端末２とは、ＵＳＢ接続されている。 The microphone array unit 1 is attached to the portable terminal 2 in a state where the portable information terminal 2 is placed on the base 100 of the microphone array unit 1. In a state where the microphone array unit 1 is attached to the portable information terminal 2, the first microphone holding unit 101 is disposed along the upper end surface of the portable information terminal 2, and the second microphone is held along the right side surface of the portable information terminal 2. Part 102 is arranged. The microphone array unit 1 and the portable information terminal 2 are connected by USB.

図２は、マイクロホンアレーユニット１におけるマイクロホンの配置形態を示している。 FIG. 2 shows the arrangement of microphones in the microphone array unit 1.

この例では、マイクロホンアレーユニット１には８個のマイクロホンＭ１〜Ｍ８が設けられている。マイクロホンＭ５は、第１マイクロホン保持部１０１と第２マイクロホン保持部１０２の接続部に設けられている。マイクロホンＭ１〜Ｍ４は第１マイクロホン保持部１０１に設けられ、マイクロホンＭ６〜Ｍ８は第２マイクロホン保持部１０２に設けられている。 In this example, the microphone array unit 1 is provided with eight microphones M1 to M8. The microphone M5 is provided at a connection portion between the first microphone holding unit 101 and the second microphone holding unit 102. The microphones M1 to M4 are provided in the first microphone holding unit 101, and the microphones M6 to M8 are provided in the second microphone holding unit 102.

マイクロホンＭ１〜Ｍ５は、等間隔Ｄ１をおいて横方向に並んで前向きに配置されている。マイクロホンＭ５〜Ｍ８は、等間隔Ｄ２をおいて縦方向に並んで前向きに配置されている。間隔Ｄ１はこの例では、２ｃｍに設定されており、間隔Ｄ２はこの例では、４ｃｍに設定されている。各マイクロホンＭ１〜Ｍ８としては、無指向性コンデンサマイクロホンが用いられている。 The microphones M1 to M5 are arranged side by side in the horizontal direction with an equal interval D1. The microphones M5 to M8 are arranged in the vertical direction with an equal interval D2 and facing forward. The distance D1 is set to 2 cm in this example, and the distance D2 is set to 4 cm in this example. As each of the microphones M1 to M8, an omnidirectional condenser microphone is used.

〔２〕音声認識システム [2] Voice recognition system

図３は、音声認識システムの構成を示している。 FIG. 3 shows the configuration of the voice recognition system.

音声認識システムは、マイクロホンアレーユニット１が装着された携帯情報端末（ＰＤＡ）２と、携帯情報端末２と無線ＬＡＮによって接続される音声認識装置３とからなる。 The voice recognition system includes a personal digital assistant (PDA) 2 to which a microphone array unit 1 is attached, and a voice recognition device 3 connected to the personal digital assistant 2 through a wireless LAN.

マイクロホンアレーユニット１内には、各マイクロホンＭ１〜Ｍ８によって受音された音声信号をデジタル信号に変換するためのマルチチャンネルＡ／Ｄ変換器１１を備えている。マイクロホンアレーユニット１内のマルチチャンネルＡ／Ｄ変換器１１によって得られたマルチチャンネル音声信号ｘ１（ｔ）〜ｘ８（ｔ）は、携帯情報端末２を介して、無線ＬＡＮにより、音声認識装置３に送信される。 The microphone array unit 1 includes a multi-channel A / D converter 11 for converting audio signals received by the microphones M1 to M8 into digital signals. Multichannel audio signals x1 (t) to x8 (t) obtained by the multichannel A / D converter 11 in the microphone array unit 1 are transmitted to the voice recognition device 3 via the portable information terminal 2 by wireless LAN. Sent.

音声認識装置３は、適応ビームフォーマ３１、単一チャンネル雑音低減装置３２および音声認識部３３を備えている。 The speech recognition device 3 includes an adaptive beamformer 31, a single channel noise reduction device 32, and a speech recognition unit 33.

適応ビームフォーマ３１としては、適応ブロッキング行列（文献〔４〕、特許文献２参照）を用いたロバスト一般化サイドローブ・キャンセラ（ＲＧＳＣ：robust generalized sidelobe canceller)（文献〔２〕，〔３〕参照））が用いられている。このＧＳＣは、雑音を高度に抑制しながら話者の動きや残響による信号の削除に対して高いロバスト性が得られるよう設計されている。 As the adaptive beamformer 31, a robust generalized sidelobe canceller (RGSC) using an adaptive blocking matrix (see literature [4], patent document 2) (see literatures [2] and [3]). ) Is used. This GSC is designed to obtain high robustness against signal deletion due to speaker movement and reverberation while highly suppressing noise.

文献〔２〕：W.Herbordt,H.Buchner,S.Nakamura, and W.Kellermann,"Application of a double-talk resilient DFT-domain adaptive filter for bin-wise stepsize controls to adaptive beamforming," Proc.IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing, May 2005.
文献〔３〕：W.Herbordt, H.Buchner, S.Nakamura, and W.Kellermann,"Outlier-robust DFT-domain adaptive filtering for bin-wise stepsize controls, and its application to a generalized sidelobe canceller," Proc.Int. Workshop on Acoustic, Echo, and Noise Control, Septemnber 2005．
文献〔４〕：O.Hoshuyama, A. Sugiyama, and A. Hirano, "A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters," IEEE Trans. on Signal Processing , vol. 47, no. 10, pp. 2677-2684, October 1999. Reference [2]: W. Herbordt, H. Buchner, S. Nakamura, and W. Kellermann, "Application of a double-talk resilient DFT-domain adaptive filter for bin-wise stepsize controls to adaptive beamforming," Proc. IEEE- EURASIP Workshop on Nonlinear Signal and Image Processing, May 2005.
Reference [3]: W. Herbordt, H. Buchner, S. Nakamura, and W. Kellermann, "Outlier-robust DFT-domain adaptive filtering for bin-wise stepsize controls, and its application to a generalized sidelobe canceller," Proc. Int. Workshop on Acoustic, Echo, and Noise Control, Septemnber 2005.
Reference [4]: O. Hoshuyama, A. Sugiyama, and A. Hirano, "A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters," IEEE Trans. On Signal Processing, vol. 47, no. 10 , pp. 2677-2684, October 1999.

単一チャンネル雑音低減装置３２は、適応ビームフォーマ３１と従続接続され、適応ビームフォーマ３１の出力上の残留ノイズを除去する。この雑音低減装置３２は、クリーンな音声の混合正規分布モデル（ＧＭＭ：Gaussian mixture model) （文献〔５〕，〔６〕参照）を用いたメル対数スペクトル・エネルギー係数の最小平均２乗誤差（ＭＭＳＥ: minimum mean-squared error) ）推定器を基盤にしている。 The single channel noise reduction device 32 is connected to the adaptive beamformer 31 and removes residual noise on the output of the adaptive beamformer 31. This noise reduction device 32 is a minimum mean square error (MMSE) of the mel logarithmic spectrum energy coefficient using a Gaussian mixture model (GMM) (see [5] and [6]). : minimum mean-squared error)) Based on an estimator.

文献〔５〕：J.C.Segura, A. de la Torre, M.C.Benitez, and A.M. Peinado,"Model-based compensation of the additive noise for continuous speech recognition. experiments using AURORA II database and tasks,"Proc. Eurospeech, vol. l,pp.221-224, September 2001. Reference [5]: JCSegura, A. de la Torre, MC Benitez, and AM Peinado, "Model-based compensation of the additive noise for continuous speech recognition. Experiments using AURORA II database and tasks," Proc. Eurospeech, vol. l, pp.221-224, September 2001.

文献〔６〕：M.Fujimoto and Y. Ariki,"Combination of temporal domain svd based speech enhancement and gmm based speech estimation for ASR in noise-evaluation on the AURORA2 tasks,"Proc. Eurospeech, pp.1781-1784, September 2003. Reference [6]: M. Fujimoto and Y. Ariki, "Combination of temporal domain svd based speech enhancement and gmm based speech estimation for ASR in noise-evaluation on the AURORA2 tasks," Proc. Eurospeech, pp.1781-1784, September 2003.

単一チャンネル雑音低減装置３２によって雑音が低減された音声信号は、音声認識部３３に送られ、音声認識が行なわれる。音声認識部３３としては、例えば、本出願人が開発したバージョン３．３のＡＴＲＡＳＲ大語彙音声認識システムのような公知の音声認識システムが用いられる。 The voice signal whose noise has been reduced by the single channel noise reduction device 32 is sent to the voice recognition unit 33 for voice recognition. As the speech recognition unit 33, for example, a known speech recognition system such as the version 3.3 ATRASR large vocabulary speech recognition system developed by the present applicant is used.

本願発明の特徴は、ロバスト一般化サイドローブ・キャンセラからなる適応ビームフォーマ３１と、ＧＭＭに基づくウイナーフィルタを用いて雑音を抑圧する単一チャンネル雑音低減装置３２とを組み合わせた点にある。上記適応ビームフォーマ３１では比較的周波数の高い雑音を効果的に除去できるが周波数の低い雑音が残る。上記単一チャンネル雑音低減装置３２は、周波数の低い雑音を効果的に除去できる。したがって、これらを組み合わせることにより、広範囲の周波数帯の雑音を除去できる。 The feature of the present invention resides in that an adaptive beamformer 31 composed of a robust generalized sidelobe canceller and a single channel noise reduction device 32 that suppresses noise using a Wiener filter based on GMM are combined. Although the adaptive beamformer 31 can effectively remove noise having a relatively high frequency, noise having a low frequency remains. The single channel noise reduction device 32 can effectively remove low frequency noise. Therefore, by combining these, noise in a wide frequency band can be removed.

〔３〕適応ビームフォーマについての説明 [3] Explanation of adaptive beamformer

図４は、適応ビームフォーマ３１の構成を示している。 FIG. 4 shows the configuration of the adaptive beamformer 31.

適応ビームフォーマ３１は、固定ビームフォーマ４１、適応ブロッキング行列４２、適応外乱キャンセラ４３および適応制御部４４を備えている。 The adaptive beamformer 31 includes a fixed beamformer 41, an adaptive blocking matrix 42, an adaptive disturbance canceller 43, and an adaptive control unit 44.

〔３−１〕固定ビームフォーマ
固定ビームフォーマ４１には、携帯情報端末２から送られてくるマルチチャンネル音声信号（各マイクロホンＭ１〜Ｍ８の信号ｘ１（ｔ）〜ｘ８（ｔ））が入力する。固定ビームフォーマ４１は、マルチチャンネル音声信号ｘ１（ｔ）〜ｘ８（ｔ）から、目的の音声信号が強調され、目的音声信号以外の信号が減衰された信号を生成する。 [3-1] Fixed Beamformer Multichannel audio signals (signals x1 (t) to x8 (t) of the microphones M1 to M8) sent from the portable information terminal 2 are input to the fixed beamformer 41. The fixed beamformer 41 generates a signal in which the target audio signal is emphasized and signals other than the target audio signal are attenuated from the multi-channel audio signals x1 (t) to x8 (t).

固定ビームフォーマ４１は、携帯情報端末２から送られてくるマルチチャンネル音声信号ｘ１（ｔ）〜ｘ８（ｔ）に基づいて目的方向を推定し、この推定結果を用いて目的の音声信号を得る。この方向推定処理としては、例えば、特許文献１に開示された方法を使用することができる。特許文献１では、正三角形の頂点に配置された３つのマイクロホンの信号に基づいて方向推定を行なっている。この３つのマイクロホンの信号として、例えば、Ｍ３、Ｍ５、Ｍ６の信号ｘ３（ｔ），ｘ５（ｔ），ｘ６（ｔ）を用いることが可能である。 The fixed beamformer 41 estimates the target direction based on the multi-channel audio signals x1 (t) to x8 (t) sent from the portable information terminal 2, and obtains the target audio signal using the estimation result. As this direction estimation processing, for example, the method disclosed in Patent Document 1 can be used. In Patent Document 1, direction estimation is performed based on signals from three microphones arranged at the vertices of an equilateral triangle. As signals of these three microphones, for example, signals x3 (t), x5 (t), and x6 (t) of M3, M5, and M6 can be used.

推定された目的方向を用いて目的の音声信号を得る方法としては、例えば、遅延和ビームフォーマ方式が用いられる。つまり、各マイクロホンによって受音された信号それぞれに遅延を付加して目的信号を同相化した後、これらを加算する。各マイクロホンに付加する遅延量は、推定された目的方向に基づいてそれぞれ求められる。 As a method of obtaining a target audio signal using the estimated target direction, for example, a delay sum beamformer method is used. That is, after adding a delay to each signal received by each microphone to make the target signal in-phase, these signals are added. The amount of delay added to each microphone is obtained based on the estimated target direction.

〔３−２〕適応ブロッキング行列
適応ブロッキング行列４２は、適応信号処理によって目的の音声信号が減衰され、目的音声信号以外の信号が強調された信号を生成する。 [3-2] Adaptive Blocking Matrix The adaptive blocking matrix 42 generates a signal in which a target speech signal is attenuated by adaptive signal processing and a signal other than the target speech signal is emphasized.

携帯情報端末２から送られてくるマルチチャンネル音声信号ｘ１（ｔ）〜ｘ８（ｔ）は、それぞれ減算器５１〜５８に送られる。一方、固定ビームフォーマ４１の出力信号は、適応フィルタ６１〜６８に送られる。各適応フィルタ６１〜６８の出力信号は、対応する減算器５１〜５８に送られる。各減算器５１〜５８は、対応するマイクロホンの信号ｘ１（ｔ）〜ｘ８（ｔ）から対応する適応フィルタ６１〜６８の出力信号を減算する。各減算器５１〜５８の出力信号は、適応外乱キャンセラ４３に送られるとともに、各適応フィルタ６１〜６８のフィルタ係数更新のために対応する適応フィルタ６１〜６８に送られる。 Multi-channel audio signals x1 (t) to x8 (t) sent from the portable information terminal 2 are sent to subtracters 51 to 58, respectively. On the other hand, the output signal of the fixed beam former 41 is sent to the adaptive filters 61-68. The output signal of each adaptive filter 61-68 is sent to the corresponding subtracter 51-58. Each subtracter 51-58 subtracts the output signal of the corresponding adaptive filter 61-68 from the corresponding microphone signal x1 (t) -x8 (t). The output signals of the subtracters 51 to 58 are sent to the adaptive disturbance canceller 43 and to the corresponding adaptive filters 61 to 68 for updating the filter coefficients of the adaptive filters 61 to 68.

各適応フィルタ６１〜６８では、対応する減算器５１〜５８の出力信号電力が最小化されるように、フィルタ係数の更新を行なう。この結果、減算器５１〜５８の出力信号は、目的信号が減衰された信号となる。なお、後述するように、各適応フィルタ６１〜６８は適応制御部４４によって制御される。 In each of the adaptive filters 61 to 68, the filter coefficients are updated so that the output signal power of the corresponding subtracters 51 to 58 is minimized. As a result, the output signals of the subtracters 51 to 58 are signals in which the target signal is attenuated. As will be described later, each of the adaptive filters 61 to 68 is controlled by the adaptive control unit 44.

〔３−３〕適応外乱キャンセラ
適応外乱キャンセラ４３は、固定ビームフォーマ４１の出力信号から、適応ブロッキング行列４２の出力信号群に相関がある成分を除去する。適応ブロッキング行列４２の出力信号は、多チャンネル適応フィルタ７１に送られる。多チャンネル適応フィルタ７１の出力（多チャンネル適応フィルタを構成する各適応フィルタの出力の総和）は減算器７２に送られる。この減算器７２には、固定ビームフォーマ４１の出力信号が送られている。減算器７２は、固定ビームフォーマ４１の出力信号から多チャンネル適応フィルタ７１の出力信号を減算する。減算器７２の出力信号ｚ（ｔ）は、適応ビームフォーマ３１の出力信号として単一チャンネル雑音低減装置３２に送られるとともに、多チャンネル適応フィルタ７１のフィルタ係数更新のために多チャンネル適応フィルタ７１に送られる。 [3-3] Adaptive Disturbance Canceller The adaptive disturbance canceller 43 removes a component correlated with the output signal group of the adaptive blocking matrix 42 from the output signal of the fixed beamformer 41. The output signal of the adaptive blocking matrix 42 is sent to the multichannel adaptive filter 71. The output of the multichannel adaptive filter 71 (the sum of the outputs of the adaptive filters constituting the multichannel adaptive filter) is sent to the subtracter 72. An output signal of the fixed beam former 41 is sent to the subtracter 72. The subtracter 72 subtracts the output signal of the multi-channel adaptive filter 71 from the output signal of the fixed beam former 41. The output signal z (t) of the subtracter 72 is sent to the single channel noise reduction device 32 as the output signal of the adaptive beamformer 31 and is sent to the multichannel adaptive filter 71 for updating the filter coefficient of the multichannel adaptive filter 71. Sent.

多チャンネル適応フィルタ７１では、減算器７２の出力信号電力が最小化されるように、フィルタ係数の更新を行なう。この結果、減算器７２の出力信号は、目的信号以外の信号（干渉信号および雑音信号）が大きく減衰された信号となる。なお、後述するように、多チャンネル適応フィルタ７１は適応制御部４４によって制御される。 In the multi-channel adaptive filter 71, the filter coefficient is updated so that the output signal power of the subtracter 72 is minimized. As a result, the output signal of the subtracter 72 is a signal in which signals other than the target signal (interference signal and noise signal) are greatly attenuated. As will be described later, the multi-channel adaptive filter 71 is controlled by the adaptive control unit 44.

〔３−４〕適応制御部
上記文献〔４〕で指摘されているように、ブロッキング行列４２は、ＳＮＲ（信号対雑音比）が高いときに適応させるべきである。つまり、ＳＮＲが高いときにブロッキング行列４２内の各適応フィルタ６１〜６８の係数の更新を行なうべきである。一方、適応外乱キャンセラ４３は、ＳＮＲが低いときに適応させるべきである。つまり、ＳＮＲが低いときに適応外乱キャンセラ４３内の多チャンネル適応フィルタ７１の係数の更新を行なうべきである。 [3-4] Adaptive Control Unit As pointed out in the above document [4], the blocking matrix 42 should be adapted when the SNR (signal to noise ratio) is high. That is, when the SNR is high, the coefficients of the adaptive filters 61 to 68 in the blocking matrix 42 should be updated. On the other hand, the adaptive disturbance canceller 43 should be adapted when the SNR is low. That is, when the SNR is low, the coefficient of the multichannel adaptive filter 71 in the adaptive disturbance canceller 43 should be updated.

そこで、適応制御部４４は、音声のみが検出された場合にブロッキング行列４２を適応させる。また、雑音のみが検出された場合に適応外乱キャンセラ４３を適応させる。いわゆるダブルトーク中は、全ての適応を停止させる。適応制御部４４は、それぞれの周波数帯でのＤＦＴ領域において、これら３つの状態、”目的音声のみ”、”雑音のみ”および”ダブルトーク”を常時検出する。 Therefore, the adaptive control unit 44 adapts the blocking matrix 42 when only voice is detected. Further, the adaptive disturbance canceller 43 is adapted when only noise is detected. During so-called double talk, all adaptations are stopped. The adaptive control unit 44 constantly detects these three states, “target speech only”, “noise only”, and “double talk” in the DFT region in each frequency band.

図５は、適応制御部４４の構成を示している。 FIG. 5 shows the configuration of the adaptive control unit 44.

固定ビームフォーマ２０１は、上記固定ビームフォーマ４１と同様にマルチチャンネル音声信号ｘ１（ｔ）〜ｘ８（ｔ）から、目的の音声信号が強調され、目的の音声信号以外の信号が減衰された信号を生成する。 Similarly to the fixed beamformer 41, the fixed beamformer 201 outputs a signal in which a target audio signal is emphasized and signals other than the target audio signal are attenuated from the multichannel audio signals x1 (t) to x8 (t). Generate.

固定ビームフォーマ２０１の出力信号ｙ（ｔ）が目的信号の推定値とみなされるように、固定ビームフォーマ２０１は雑音と比較して目的信号を高める。マルチチャンネル音声信号ｘ１（ｔ）〜ｘ８（ｔ）は、それぞれ減算器２１１〜２１８に送られる。これらの減算器２１１〜２１８には、固定ビームフォーマ２０１の出力信号ｙ（ｔ）が送られる。 The fixed beamformer 201 enhances the target signal compared to noise so that the output signal y (t) of the fixed beamformer 201 is regarded as an estimated value of the target signal. Multi-channel audio signals x1 (t) to x8 (t) are sent to subtracters 211 to 218, respectively. The output signals y (t) of the fixed beam former 201 are sent to these subtracters 211 to 218.

各減算器２１１〜２１８は、対応する音声信号ｘ１（ｔ）〜ｘ８（ｔ）から目的信号の推定値ｙ（ｔ）を減算することにより、雑音の推定値ｃ１（ｔ）〜ｃ８（ｔ）を算出する。 Each of the subtracters 211 to 218 subtracts the estimated value y (t) of the target signal from the corresponding audio signal x1 (t) to x8 (t), thereby estimating the noise estimated values c1 (t) to c8 (t). Is calculated.

固定ビームフォーマ２０１の出力信号ｙ（ｔ）は、離散フーリエ変換器（ＤＦＴ：Discrete Fourier Transformer) ２２０によってＤＦＴ領域に変換され、ＰＳＤ推定器２３０によってそのパワースペクトラム密度（ＰＳＤ）が推定される。ｙ（ｔ）のＰＳＤはＳＹＹ（ｎ，ｔ）で表される。ｎ＝１…Ｎであり、ＤＦＴの周波数帯のインデックスである。ｔは時間のインデックスである。 The output signal y (t) of the fixed beamformer 201 is converted into a DFT domain by a discrete Fourier transformer (DFT) 220, and its power spectrum density (PSD) is estimated by a PSD estimator 230. The PSD of y (t) is represented by SYY (n, t). n = 1... N, which is an index of the frequency band of DFT. t is a time index.

各雑音の推定値ｃ１（ｔ）〜ｃ８（ｔ）は、それぞれ対応するＤＦＴ２２１〜２２８によってＤＦＴ領域に変換され、ＰＳＤ推定器２３１〜２３８によってそのパワースペクトラム密度（ＰＳＤｓ）が推定される。ｃ１（ｔ）…ｃ８（ｔ）のパワースペクトラム密度（ＰＳＤｓ）は、加算器２３９において平均化される。平均化されたパワースペクトラム密度はＳｃｃ（ｎ，ｔ）で表される。 The estimated values c1 (t) to c8 (t) of each noise are converted into DFT regions by corresponding DFTs 221 to 228, respectively, and their power spectrum density (PSDs) is estimated by PSD estimators 231 to 238. The power spectrum density (PSDs) of c1 (t)... c8 (t) is averaged in the adder 239. The averaged power spectrum density is represented by Scc (n, t).

ＳＹＹ（ｎ，ｔ）とＳｃｃ（ｎ，ｔ）とは、それぞれ、目的音声のパワースペクトラム密度の推定値および雑音のパワースペクトラム密度の推定値として、みなすことができる。除算器２４１では、Ｒ（ｎ，ｔ）＝ＳＹＹ（ｎ，ｔ）／Ｓｃｃ（ｎ，ｔ）の演算を行なう。Ｒ（ｎ，ｔ）は、マイクロホンのＳＮＲの周波数帯による推定値となる。Ｒ（ｎ，ｔ）の最大値は高いＳＮＲに対応し、Ｒ（ｎ，ｔ）の最小値は低いＳＮＲに対応している。 SYY (n, t) and Scc (n, t) can be regarded as an estimated value of the power spectrum density of the target speech and an estimated value of the power spectrum density of noise, respectively. The divider 241 calculates R (n, t) = SYY (n, t) / Scc (n, t). R (n, t) is an estimated value according to the frequency band of the SNR of the microphone. The maximum value of R (n, t) corresponds to a high SNR, and the minimum value of R (n, t) corresponds to a low SNR.

決定ユニット２４２では、Ｒ（ｎ，ｔ）の最大値と最小値とが検出される。決定ユニット２４２は、Ｒ（ｎ，ｔ）の最大値が検出された場合には、ブロッキング行列４２のみを適応させるために、Ｄｂ（ｎ，ｔ）＝１，Ｄｉ（ｎ，ｔ）＝０とする。一方、Ｒ（ｎ，ｔ）の最小値が検出された場合には、適応外乱キャンセラ４３のみを適応させるために、Ｄｂ（ｎ，ｔ）＝０，Ｄｉ（ｎ，ｔ）＝１とする。その他の場合には、ブロッキング行列４２および適応外乱キャンセラ４３の適応を停止させるために、Ｄｂ（ｎ，ｔ）＝０，Ｄｉ（ｎ，ｔ）＝０とする。 In the determination unit 242, the maximum value and the minimum value of R (n, t) are detected. When the maximum value of R (n, t) is detected, the decision unit 242 determines that Db (n, t) = 1 and Di (n, t) = 0 in order to adapt only the blocking matrix 42. To do. On the other hand, when the minimum value of R (n, t) is detected, Db (n, t) = 0 and Di (n, t) = 1 are set in order to adapt only the adaptive disturbance canceller 43. In other cases, in order to stop the adaptation of the blocking matrix 42 and the adaptive disturbance canceller 43, Db (n, t) = 0 and Di (n, t) = 0.

〔４〕単一チャンネル雑音低減装置についての説明 [4] Description of single channel noise reduction device

単一チャンネル雑音低減装置３２は、ＧＭＭに基づくWienner filterを用いて雑音を抑圧するものである。以下、ＧＭＭに基づくWienner filterの設計法と、雑音抑圧法について説明する。 The single channel noise reduction device 32 suppresses noise using a Wienner filter based on GMM. A Wienner filter design method based on GMM and a noise suppression method will be described below.

〔４−１〕ＧＭＭに基づくWienner filterの設計
まず、クリーンな音声信号をｓ（ｔ）、雑音をｎ（ｔ）とすると、雑音が重畳した音声信号ｚ（ｔ）は、次式（１）により表現される。 [4-1] Design of Wienner filter based on GMM First, assuming that a clean speech signal is s (t) and noise is n (t), the speech signal z (t) on which noise is superimposed is expressed by the following equation (1). It is expressed by

次に、上記式（１）に対して離散フーリエ変換（ＤＦＴ）およびメルフィルタバンク分析を適用することにより、メルスペクトルを求める。メルスペクトル上での雑音重畳音声、クリーン音声、雑音の関係は、次式（２）のように表される。 Next, a mel spectrum is obtained by applying discrete Fourier transform (DFT) and mel filter bank analysis to the above equation (1). The relationship between noise superimposed speech, clean speech, and noise on the mel spectrum is expressed by the following equation (2).

上記式（２）において、Ｚ_b ^lin（ｉ），Ｓ_b ^lin（ｉ），Ｎ_b ^lin（ｉ）は、それぞれ雑音重畳音声、クリーン音声、雑音のメルスペクトルを示しており、ｂはメルフィルタの番号を、ｉはフレーム番号を示している。 In the above equation (2), Z _b ^lin (i), S _b ^lin (i), and N _b ^lin (i) indicate noise superimposed speech, clean speech, and noise mel spectrum, respectively, and b is a mel filter. I indicates a frame number.

以上のような定義のもと、Wienner filterにより、Ｚ_b ^lin（ｉ）からＳ_b ^lin（ｉ）を推定する。Wienner filterによる推定は、次式（３）により行なわれる。 Based on the above definition, S _b ^lin (i) is estimated from Z _b ^lin (i) by the Wienner filter. Estimation by the Wienner filter is performed by the following equation (3).

上記式（３）の右辺第２項がWienner filterであり、Wienner filterのパラメータのうち、雑音のメルスペクトルＮ_b ^lin（ｉ）は、定常的な信号であり、入力された雑音重畳音声の開始１０フレームには雑音信号のみが存在すると仮定して次式（４）により推定する。 The second term on the right side of the above equation (3) is the Wienner filter, and among the Wienner filter parameters, the noise mel spectrum N _b ^lin (i) is a stationary signal, and the start of the input noise superimposed speech Assuming that only a noise signal exists in 10 frames, the estimation is performed by the following equation (4).

次に、Wienner filterのパラメータＳ_b ^lin（ｉ）は、クリーン音声のメルスペクトルであり、これはフィルタ設計時には未知のパラメータである。よって、事前に正規分布を用いて、クリーン音声の確率モデルを作成しておき、正規分布の平均値μ_b ^S,linをWienner filterのパラメータＳ_b ^lin（ｉ）として代用する。この結果、上記式（３）は次式（５）で表される。 Next, the parameter S _b ^lin (i) of the Wienner filter is a mel spectrum of clean speech, which is an unknown parameter when designing the filter. Therefore, a probability model of clean speech is created in advance using a normal distribution, and the average value μ _b ^{S, lin} of the normal distribution is substituted as the parameter S _b ^lin (i) of the Wienner filter. As a result, the above equation (3) is expressed by the following equation (5).

以上はメルスペクトル上の定式化であるが、メルスペクトルは、時間変化量（ダイナミックレンジ）が大きいパラメータであるため、クリーン音声の確率モデルを作成する際のモデリングのパラメータとしては相応しくない。よって、対数関数によりダイナミックレンジを平滑化した対数メルスペクトル上でモデリングを行ない、Wienner filterの設計も同様に対数メルスペクトル上で行なう。これにより、次式（６），（７），（８）が得られる。 The above is the formulation on the mel spectrum, but the mel spectrum is a parameter with a large amount of time change (dynamic range), and is not suitable as a modeling parameter when creating a clean speech probability model. Therefore, modeling is performed on the log mel spectrum whose dynamic range is smoothed by the logarithmic function, and the design of the Wienner filter is also performed on the log mel spectrum. Thereby, following Formula (6), (7), (8) is obtained.

また、クリーン音声の確率モデルに正規分布を用いるが、単一の正規分布ではモデリングの性能が低い。そのため、次式（９）のように、複数の正規分布を有するＧＭＭによりモデリングを行なう。 In addition, the normal distribution is used for the probability model of clean speech, but the modeling performance is low with a single normal distribution. Therefore, modeling is performed by a GMM having a plurality of normal distributions as in the following equation (9).

上記式（９）において、Ｐ_kは要素分布ｋの重みであり、Ｋは要素分布の総数である。また、Ｓ（ｉ）はＳ_b（ｉ）を要素に持つベクトル、μ_k ^Sは要素分布ｋにおける平均値μ_b,k ^Sを要素に持つベクトル、Σ_k ^Sは要素分布ｋにおける分散値σ_b,k ^Sを要素に持つ対角行列である。 In the above equation (9), P _k is the weight of the element distribution k, and K is the total number of element distributions. S (i) is a vector having S _b (i) as an element, μ _k ^S is a vector having an average value μ _{b, k} ^S in an element distribution k, and Σ _k ^S is a variance value σ in the element distribution k. It is a diagonal matrix with _{b and k} ^S as elements.

上記式（９）のＧＭＭを用いることにより、次式（１０）で示すように、要素分布ｋごとにWienner filterを設計することができる。 By using the GMM of the above formula (9), it is possible to design a Wienner filter for each element distribution k as shown by the following formula (10).

最終的に、複数設計されたWienner filterを次式（１１）の加重平均により１つのフィルタとしてまとめる。 Finally, a plurality of designed Wienner filters are collected as one filter by the weighted average of the following equation (11).

上記式（１１）のＰ_k,iは加重平均に用いる重みであり、これには次式（１２）で与えられる雑音重畳音声の雑音重畳音声ＧＭＭに対する事後確率を用いる。本手法において事前に用意できるＧＭＭは、クリーン音声のＧＭＭのみであるので、次式（１３），（１４）を用いて、雑音重畳音声ＧＭＭのパラメータを近似的に生成する。 P _{k, i in} the above equation (11) is a weight used for the weighted average, and for this, the posterior probability of the noise superimposed speech given by the following equation (12) with respect to the noise superimposed speech GMM is used. Since the GMM that can be prepared in advance in this method is only a clean speech GMM, the parameters of the noise superimposed speech GMM are approximately generated using the following equations (13) and (14).

上記式（１２）において、Ｐ_kはクリーン音声ＧＭＭのパラメータをそのまま使用する。また、Ｚ（ｉ）はＺ_b（ｉ）を要素に持つベクトル、μ_k ^Zは要素分布ｋにおける平均値μ_b,k ^Zを要素に持つベクトル、Σ_k ^Zは要素分布ｋにおける分散値σ_b,k ^Zを要素に持つ対角行列である。 In the above equation (12), P _k uses the parameters of the clean speech GMM as it is. Z (i) is a vector having Z _b (i) as an element, μ _k ^Z is a vector having an average value μ _{b, k} ^Z in the element distribution k, and Σ _k ^Z is a variance value σ in the element distribution k. It is a diagonal matrix with _{b and k} ^Z as elements.

本手法では、ＧＭＭのパラメータを用いて複数のWienner filterを設計しておき、重み付け平均により最終的なフィルタを得る。重みＰ_k,iは短時間フレーム毎に計算するので、本手法で得られるフィルタは時間方向に対して可変なフィルタとなる。 In this method, a plurality of Wienner filters are designed using GMM parameters, and a final filter is obtained by weighted averaging. Since the weights P _{k, i} are calculated for each short time frame, the filter obtained by this method is a filter that is variable in the time direction.

〔４−２〕Wienner filterの適用
上記式（１１）で設計されたWienner filterを雑音重畳音声に対して適用する。まず、得られるフィルタは対数メルスペクトル上で設計されているので、次式（１５）にように指数関数を用いてメルスペクトル上のフィルタに変換する。 [4-2] Application of Wienner filter The Wienner filter designed by the above equation (11) is applied to noise superimposed speech. First, since the obtained filter is designed on the log mel spectrum, it is converted into a filter on the mel spectrum using an exponential function as shown in the following equation (15).

次に、＾Ｇ_b ^lin（ｉ）に対して逆ＤＣＴを適用して、インパルス応答に変換する。ここで、＾Ｇ_b ^lin（ｉ）はメル周波数上でのWienner filterであるので、従来の逆ＤＣＴでは時間領域のパラメータ（インパルス応答）に変換できない。よって、メル周波数を線形周波数にマッピングして変換を行なうMEL-warped ＩＤＣＴ( 文献〔７〕参照）を用いる。これにより、次式（１６）が得られる。 Next, inverse DCT is applied to ^{ _{circumflex over} (G) ^} _b ^lin (i) to convert it into an impulse response. Here, since {circumflex over (G)} _b ^lin (i) is a Wienner filter on the mel frequency, conventional inverse DCT cannot convert it into a time domain parameter (impulse response). Therefore, MEL-warped IDCT (see reference [7]) that performs conversion by mapping mel frequencies to linear frequencies is used. Thereby, the following equation (16) is obtained.

文献〔７〕：ETSI ES 202 050 V1.1.3," Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-end Feature Extraction Algorithm; Compression Algorithms, " Nov.2003. Reference [7]: ETSI ES 202 050 V1.1.3, "Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-end Feature Extraction Algorithm; Compression Algorithms," Nov.2003.

上記式（１６）において、ξ_b,tはMEL-warped ＩＤＣＴの係数である。 In the above equation (16), ξ _{b, t} is a coefficient of MEL-warped IDCT.

最終的に、得られたインパルス応答を次式（１７）のように雑音重畳音声信号ｚ（ｔ）に畳み込むことにより、推定クリーン音声信号＾ｓ（ｔ）を得る。 Finally, an estimated clean speech signal ｓs (t) is obtained by convolving the obtained impulse response with the noise superimposed speech signal z (t) as shown in the following equation (17).

〔４−３〕単一チャンネル雑音低減装置３２の処理手順
図６は、単一チャンネル雑音低減装置３２の処理手順を示している。
まず、適応ビームフォーマ３１から送られてくる雑音重畳音声信号ｚ（ｔ）に対してメルフィルタバンク分析を行ない、対数メルスペクトルＺ_b（ｉ）を得る（ステップＳ１）。 [4-3] Processing Procedure of Single Channel Noise Reduction Device 32 FIG. 6 shows a processing procedure of the single channel noise reduction device 32.
First, mel filter bank analysis is performed on the noise superimposed speech signal z (t) sent from the adaptive beamformer 31 to obtain a log mel spectrum Z _b (i) (step S1).

そして、ｉが１０以上であるか否かを判別する（ステップＳ２）。ｉが１０未満であれば（ｉ＜１０）、上記式（８）に基づいて雑音＾Ｎ_bの推定を行なう（ステップＳ３）。そして、ｉを１だけインクリメントして（ステップＳ４）、ステップＳ１に戻り、次のフレーム処理に移行する。 And it is discriminate | determined whether i is 10 or more (step S2). i is less than 10 (i <10), to estimate the noise ^ N _b based on the equation (8) (step S3). Then, i is incremented by 1 (step S4), and the process returns to step S1 to shift to the next frame processing.

上記ステップＳ２において、ｉが１０以上であれば（ｉ≧１０）、上記式（１０）に基づいて、ＧＭＭの要素分布毎にWienner filter Ｇ_b,k（μ_b,k ^S，＾Ｎ_b）を設計する（ステップＳ５）。次に、上記式（１１）〜（１４）に基づいて、重みＰ_k,iの計算を行なうとともに、Wienner filter Ｇ_b,k（μ_b,k ^S，＾Ｎ_b）の加重平均を行なう（ステップＳ６）。次に、上記式（１５），（１６）に基づいて、加重平均されたWienner filter ＾Ｇ_b（ｉ）をインパルス応答ｇ（ｔ）に変換する（ステップＳ７）。そして、上記式（１７）に基づいて、インパルス応答ｇ（ｔ）を雑音重畳音声ｚ（ｔ）に畳み込み、推定クリーン音声信号＾ｓ（ｔ）を得る（ステップＳ８）。ｉを１だけインクリメントして（ステップＳ４）、ステップＳ１に戻り、次のフレーム処理に移行する。 In step S2, if i is 10 or more (i ≧ 10), Wienner filter G _{b, k} (μ _{b, k} ^S , ^ N _b ) for each element distribution of GMM based on the above equation (10). Is designed (step S5). Next, the weight P _{k, i} is calculated based on the above equations (11) to (14), and the weighted average of the Wienner filter G _{b, k} (μ _{b, k} ^S , ^ N _b ) is performed ( Step S6). Next, based on the above equations (15) and (16), the weighted average Wienner filter ^ G _b (i) is converted into an impulse response g (t) (step S7). Then, based on the above equation (17), the impulse response g (t) is convoluted with the noise-superimposed speech z (t) to obtain an estimated clean speech signal ^ s (t) (step S8). i is incremented by 1 (step S4), and the process returns to step S1 to shift to the next frame processing.

〔５〕システム評価 [5] System evaluation

音声認識部３３として、本出願人が開発したバージョン３．３のＡＴＲＡＳＲ大語彙音声認識システムを使用した。特徴ベクトルは、１６ＫＨｚのサンプルレートで記録されたデータを１０ｍｓでフレーム・シフトする２０ｍｓのフレームから抽出した１２次元ＭＦＣＣ、１２次元ΔＭＦＣＣおよびΔ対数パワーからなる。さらにケプストラム平均減算（ＣＭＳ）を適用した。クリーンな音声の日本語性別依存音響モデルを、本出願人が作成した旅行手配業務の対話音声コーパスから５時間の音声データと、２５時間の音素出現バランスが考慮された文章の読み上げ音声とを使って学習した。ここで、ＭＤＬ−ＳＳＳアルゴリズム（文献〔８〕参照）より生成される２０８６の状態を持つ音素ＨＭＭを使用した。 As the speech recognition unit 33, a version 3.3 ATRASR large vocabulary speech recognition system developed by the present applicant was used. The feature vector consists of 12-dimensional MFCC, 12-dimensional ΔMFCC and Δlogarithmic power extracted from a 20-ms frame obtained by shifting data recorded at a 16 KHz sample rate by 10 ms. Further cepstrum average subtraction (CMS) was applied. Using a Japanese voice gender-dependent acoustic model of clean speech from speech dialogue corpus of travel arrangement work created by the applicant, speech data of 5 hours and text-to-speech that takes into account 25-hour phoneme balance I learned. Here, a phoneme HMM having 2086 states generated from the MDL-SSS algorithm (see reference [8]) was used.

文献〔８〕：T.Jitsuhiro, T.Matsui, and S.Nakamura," Automatic generation of non-uniform HMM topologies based on the MDL criterion," IEICE Trans. on Information and Systems, vol. E87-D, no.8, pp. 2121-2129, August 2004. Reference [8]: T.Jitsuhiro, T.Matsui, and S.Nakamura, "Automatic generation of non-uniform HMM topologies based on the MDL criterion," IEICE Trans. On Information and Systems, vol. E87-D, no. 8, pp. 2121-2129, August 2004.

この音声認識システムは、多重クラス複合バイグラム言語モデル( 文献〔９〕参照）およびリスコアリングのための単語トライグラム言語モデルを使用している。言語モデルは、総計６３０万語( 文献〔１０〕参照）の、自然発話データベース（ＳＤＢ）、言語データベース（ＬＤＢ）および口語データベース（ＳＬＤＢ）で学習される。辞書サイズは、５万５千語である。 This speech recognition system uses a multi-class composite bigram language model (see reference [9]) and a word trigram language model for rescoring. The language model is learned in a natural utterance database (SDB), a language database (LDB), and a spoken language database (SLDB) of a total of 6,300,000 words (see document [10]). The dictionary size is 55,000 words.

文献〔９〕：H.Yamamoto, S.Isogai, and Y.Sagisaka," Multi-class composite N-gram language model," Speech Communication, vol. 41, no. 2-3, pp. 369-379, October 2003.
文献〔１０〕：T.Takezawa, T.Morimoto, and Y.Sagisaka," Speech and language databases for speech translation research in ATR," Proc. Int. Workshop on East-Asian Language Resources and Evaluation(EALREW), PP. 148-155, May 1998. Reference [9]: H. Yamamoto, S. Isogai, and Y. Sagisaka, "Multi-class composite N-gram language model," Speech Communication, vol. 41, no. 2-3, pp. 369-379, October 2003.
Reference [10]: T. Takezawa, T. Morimoto, and Y. Sagisaka, "Speech and language databases for speech translation research in ATR," Proc. Int. Workshop on East-Asian Language Resources and Evaluation (EALREW), PP. 148-155, May 1998.

雑音低減システムを試験するために、ＰＤＡマイクロホン・アレー（図１のマイクロホンアレーユニット１）と、参考として接話型マイクロホンとを用いて２つの音響環境、すなわち、ＡＴＲのデモンストレーション・ルームとカフェテリアで、小型のデータベースを記録した。２名の男性話者と２名の女性話者が各環境において、基本旅行表現コーパス（ＢＴＥＣ）テストセット−０１（文献〔１１〕参照）から１０２の発話文を読み上げた。 To test the noise reduction system, a PDA microphone array (microphone array unit 1 in FIG. 1) and a close-talking microphone as a reference, in two acoustic environments, an ATR demonstration room and a cafeteria, A small database was recorded. Two male speakers and two female speakers read 102 utterances in each environment from the basic travel expression corpus (BTEC) test set-01 (see reference [11]).

文献〔１１〕：T.Takezawa, E.Sumita, F.Sugaya, and H.Yamamoto," Towards a broad-coverage bilingual corpus for speech translation of travel conversations in the real world," Proc. Int. Conference on Language Resources and Evaluation, vol. 1, pp. 147-152, May 2002. Reference [11]: T. Takezawa, E. Sumita, F. Sugaya, and H. Yamamoto, "Towards a broad-coverage bilingual corpus for speech translation of travel conversations in the real world," Proc. Int. Conference on Language Resources and Evaluation, vol. 1, pp. 147-152, May 2002.

全ての発話者の位置は同じである。ＰＤＡ２は、見台上に置いた。話者は、見台上のＰＤＡの隣に置かれた読み上げ原稿を読むとき、頭を動かすことが許された。デモンストレーション・ルームとカフェテリアの残響時間は、それぞれ約Ｔ₆₀＝２５０ｍｓとＴ₆₀＝１ｓである。話者の頭とＰＤＡとの距離は、ほぼ５０ｃｍであるので、高い直接音対残響音比が得られる。デモンストレーション・ルームにおいては、数台のパソコンのファンやエアコンの騒音があった。カフェテリアでは、台所ノイズ、話し声、エアコンや冷蔵庫の騒音かあった。デモンストレーション・ルームとカフェテリアでの各話者の平均ＳＮＲを、それぞれ表１、表２に示す。 All speaker locations are the same. PDA2 was placed on a lookout. The speaker was allowed to move his head when reading a reading manuscript placed next to the PDA on the platform. The reverberation times of the demonstration room and cafeteria are approximately T ₆₀ = 250 ms and T ₆₀ = 1s, respectively. Since the distance between the speaker's head and the PDA is approximately 50 cm, a high direct sound to reverberation ratio can be obtained. In the demonstration room, there were noises from several PC fans and air conditioners. In the cafeteria, there was kitchen noise, talking voice, air conditioner and refrigerator noise. Tables 1 and 2 show the average SNR of each speaker in the demonstration room and cafeteria, respectively.

表１はデモンストレーション・ルームでの実験結果を示し、表２はカフェテリアでの実験結果を示している。 Table 1 shows the experimental results in the demonstration room, and Table 2 shows the experimental results in the cafeteria.

表１、２において、”Ｆ１”、”Ｆ２”は女性話者を表し、”Ｍ１”、”Ｍ２”は、男性話者を表している。話者全員の総平均は、線形領域で計算した。周波数帯域は５０Ｈｚ〜８ｋＨｚの範囲である。 In Tables 1 and 2, “F1” and “F2” represent female speakers, and “M1” and “M2” represent male speakers. The total average of all speakers was calculated in the linear region. The frequency band is in the range of 50 Hz to 8 kHz.

また、表１、２において、”Close-talk" とは、接話型マイクロホンによって得られた音声信号を直接音声認識部に入力させて、音声認識を行なった場合の平均単語正解精度〔％〕を示している。”Baseline" とは、マイクロホンアレーの各マイクロホン毎に、そのマイクロホンによって得られた音声信号を直接音声認識部に入力させて音声認識を行なった場合のマイクロホン全ての平均単語正解精度〔％〕を示している。 In Tables 1 and 2, “Close-talk” means the average word accuracy [%] when speech recognition is performed by directly inputting the speech signal obtained by the close-talking microphone into the speech recognition unit. Is shown. “Baseline” indicates the average word accuracy [%] of all microphones when speech recognition is performed by directly inputting the speech signal obtained by the microphone to the speech recognition unit for each microphone of the microphone array. ing.

"ＭＭＳＥ" とは、マイクロホンアレーの各マイクロホン毎に、そのマイクロホンによって得られた音声信号を、ＭＭＳＥ( 単一チャンネル雑音低減装置３２) を介して音声認識部に入力させて音声認識を行なった場合のマイクロホン全ての平均単語正解精度〔％〕を示している。 “MMSE” means that for each microphone of the microphone array, the speech signal obtained by the microphone is input to the speech recognition unit via the MMSE (single channel noise reduction device 32) and speech recognition is performed. The average word accuracy [%] of all the microphones is shown.

"ＲＧＳＣ" とは、マイクロホンアレーによって得られたマルチチャンネル音声信号をＲＧＳＣ( 適応ビームフォーマ３１）によってビームフォーミングした後に、音声認識部によって音声認識を行なった場合の単語正解精度〔％〕を示している。 “RGSC” indicates the correct word accuracy [%] when the speech recognition unit performs speech recognition after beamforming the multi-channel speech signal obtained by the microphone array by the RGSC (adaptive beamformer 31). Yes.

"ＲＧＳＣ＋ＭＭＳＥ" とは、マイクロホンアレーによって得られたマルチチャンネル音声信号をＲＧＳＣ( 適応ビームフォーマ３１）によってビームフォーミングするとともにＭＭＳＥ( 単一チャンネル雑音低減装置３２) によって雑音を低減させた後に、音声認識部によって音声認識を行なった場合の単語正解精度〔％〕を示している。ＳＮＲは線形領域で平均化される。 “RGSC + MMSE” means that a multi-channel audio signal obtained by a microphone array is beam-formed by RGSC (adaptive beamformer 31) and noise is reduced by MMSE (single channel noise reduction device 32), and then a voice recognition unit Indicates the correct word accuracy [%] when speech recognition is performed. The SNR is averaged in the linear region.

〔５−２〕音声認識性能 [5-2] Speech recognition performance

〔５−２−１〕デモンストレーション・ルーム
表１は、デモンストレーション・ルームでの実験結果を示している。全話者の平均ＳＮＲは、２４．９ｄＢである。接話型マイクロホンに対する全話者の平均単語正解精度（”Close-talk" の平均単語正解精度）は９６．７４％である。ＳＮＲが高く、残響時間も比較的短いので、”Baseline" の平均単語正解精度は９３．２５％である。 "ＭＭＳＥ" または "ＲＧＳＣ" は、双方とも”Baseline" に比べて１％の改善を示している。 "ＲＧＳＣ＋ＭＭＳＥ" では、単語正解精度が９５．５７％となり、接話型マイクロホンの性能に近くなっている。 [5-2-1] Demonstration Room Table 1 shows the results of experiments in the demonstration room. The average SNR for all speakers is 24.9 dB. The average word accuracy of all speakers for the close-talking microphone (average word accuracy of “Close-talk”) is 96.74%. Since the SNR is high and the reverberation time is relatively short, the average word accuracy of “Baseline” is 93.25%. “MMSE” or “RGSC” both show a 1% improvement over “Baseline”. In “RGSC + MMSE”, the correct word accuracy is 95.57%, which is close to the performance of a close-talking microphone.

〔５−２−２〕カフェテリア
表２は、カフェテリアでの実験結果を示している。”Baseline" の平均単語正解精度は７９．４２％であり、カフェテリアでは雑音レベルが高く残響時間が長いため、デモンストレーション・ルームでの結果に比べて著しく低くなっている。 "ＭＭＳＥ" または "ＲＧＳＣ" では、単語正解精度はデモンストレーション・ルームと同様に改善されている。 "ＲＧＳＣ＋ＭＭＳＥ" では、接話型マイクロホンの性能に近くなっている。 [5-2-2] Cafeteria Table 2 shows experimental results in the cafeteria. The average word accuracy of “Baseline” is 79.42%, and the noise level is high and the reverberation time is long in the cafeteria, so it is significantly lower than the result in the demonstration room. In "MMSE" or "RGSC", word accuracy is improved as in the demonstration room. “RGSC + MMSE” is close to the performance of a close-talking microphone.

〔５−２−３〕可変ＳＮＲ
可変ＳＮＲ用のＡＳＲシステムの性能を研究するため、カフェテリアで記録した音声データを使用し、レベルが可変のカフェテリア雑音を追加した。雑音は、音声録音を行なった所と同じ場所で見台に置いたＰＤＡマイクロホン・アレーで別途記録した。結果を表３に示す。括弧内の値は、マイクロホン全ての単語正解精度の標準偏差である。 [5-2-3] Variable SNR
In order to study the performance of the ASR system for variable SNR, voice data recorded in the cafeteria was used and cafeteria noise with variable levels was added. Noise was recorded separately with a PDA microphone array placed on a lookout at the same location where the voice recording was performed. The results are shown in Table 3. The value in parentheses is the standard deviation of the correct word accuracy for all microphones.

"ＭＭＳＥ" の方が、 "ＲＧＳＣ" より僅かに良好であるのがわかる。しかしながら、 "ＲＧＳＣ＋ＭＭＳＥ" は、ＲＧＳＣおよびＭＭＳＥが相互に補い合って、それぞれの単独の場合（ "ＭＭＳＥ" または "ＲＧＳＣ" ）や、”Baseline" の場合よりも、単語正解精度に対して高い改善が得られている。 It can be seen that “MMSE” is slightly better than “RGSC”. However, "RGSC + MMSE" complements RGSC and MMSE, and the word accuracy is improved much more than the case of each alone ("MMSE" or "RGSC") or "Baseline". It has been.

携帯端末にマイクロホンアレーユニットが取り付けられた状態を示す斜視図である。It is a perspective view which shows the state in which the microphone array unit was attached to the portable terminal. マイクロホンアレーユニット１におけるマイクロホンの配置形態を示す平面図である。FIG. 3 is a plan view showing the arrangement of microphones in the microphone array unit 1. 音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of a speech recognition system. 適応ビームフォーマ３１の構成を示すブロック図である。3 is a block diagram showing a configuration of an adaptive beamformer 31. 適応制御部４４の構成を示すブロック図である。3 is a block diagram showing a configuration of an adaptive control unit 44. FIG. 単一チャンネル雑音低減装置３２の処理手順を示すフローチャートである。4 is a flowchart showing a processing procedure of a single channel noise reduction device 32.

Explanation of symbols

１マイクロホンアレーユニット
２携帯情報端末
３音声認識装置
３１適応ビームフォーマ
３２単一チャンネル雑音低減装置
３３音声認識部
Ｍ１〜Ｍ８マイクロホン DESCRIPTION OF SYMBOLS 1 Microphone array unit 2 Portable information terminal 3 Voice recognition apparatus 31 Adaptive beamformer 32 Single channel noise reduction apparatus 33 Voice recognition part M1-M8 Microphone

Claims

A microphone array having a plurality of microphones, an adaptive beamformer that generates a signal in which the target audio signal is emphasized from a plurality of microphone signals obtained by the microphone array, and noise on the output signal of the adaptive beamformer are suppressed. Noise reduction device,
A robust generalized sidelobe canceller is used as the adaptive beamformer, which includes a fixed beamformer, adaptive blocking matrix, and adaptive disturbance canceller, and the adaptive beamformer and adaptive disturbance canceller are adaptively controlled according to the SNR of the input signal. ,
A speech enhancement apparatus, wherein a single-channel noise reduction apparatus that suppresses noise using a GMM-based Wiener filter is used as the noise reduction apparatus.

Noise reduction device
A first means for obtaining a logarithmic mel spectrum corresponding to an input voice signal by performing a mel filter bank analysis for each frame on the input voice signal sent from the adaptive beamformer;
Second means for determining whether or not the log mel spectrum frame number corresponding to the input speech signal obtained by the first means is equal to or greater than a predetermined value;
When the frame number of the log mel spectrum corresponding to the input voice signal obtained by the first means is less than a predetermined value, the log mel spectrum obtained by the first means is based on the log mel spectrum corresponding to the input voice signal. A third means for performing a process for estimating a logarithmic mel spectrum corresponding to noise and then proceeding to a next frame process by the first means;
When the log mel spectrum frame number corresponding to the input speech signal obtained by the first means is greater than or equal to a predetermined value, the log mel spectrum corresponding to the noise obtained by the third means and the GMM are used. After designing a winner filter for each element distribution of GMM, the fourth means for weighted averaging of the obtained plurality of winner filters, and the weighted averaged winner filter obtained by the fourth means are converted into impulse responses And, after obtaining the estimated clean speech signal by convolving the obtained impulse response with the input speech signal, fifth means for shifting to the next frame processing by the first means,
The speech enhancement apparatus according to claim 1, further comprising: