JP2016143042A

JP2016143042A - Noise removal system and noise removal program

Info

Publication number: JP2016143042A
Application number: JP2015021452A
Authority: JP
Inventors: 章子荒木; Akiko Araki; 智広中谷; Tomohiro Nakatani; マークデルクロア; Marc Delcroix; 雅清藤本; Masakiyo Fujimoto
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-02-05
Filing date: 2015-02-05
Publication date: 2016-08-08
Anticipated expiration: 2035-02-05
Also published as: JP6348427B2

Abstract

PROBLEM TO BE SOLVED: To improve performance of noise removal.SOLUTION: A noise removal system 10 performs frequency analysis on a first audio signal, which is observed by a first sound collection unit, and a second audio signal observed by a second sound collection unit disposed at a position different from the position of the first sound collection unit, and calculates a multi-channel feature quantity concerning the first audio signal and the second audio signal. The noise removal system 10 learns a parameter of a DAE that inputs a log spectrum of the first audio signal for learning, and a multi-channel feature quantity for learning concerning the first audio signal for learning and the second audio signal for learning, and that outputs a log spectrum of original sounds for learning. The noise removal system 10 inputs a log spectrum of the first audio signal that is an object of noise removal, and a multi-channel feature quantity for noise removal concerning the first audio signal that is an object of noise removal and the second audio signal that is an object of noise removal, and uses the learned parameter to remove a noise component from the log spectrum of the first audio signal that is the object of noise removal.SELECTED DRAWING: Figure 1

Description

本発明は、雑音除去装置及び雑音除去プログラムに関する。 The present invention relates to a noise removal apparatus and a noise removal program.

従来から、マイク等の集音装置により集音された音声と対応する音声信号から雑音を除去する技術がある。雑音を除去する技術として、例えば、ＤＡＥ（Denoising AutoEncoder；デノイジング・オートエンコーダ）がある。ＤＡＥは、ＤＮＮ（Deep Neural Network；多層ニューラルネットワークシステム）の一種である。 Conventionally, there is a technique for removing noise from a voice signal corresponding to a voice collected by a sound collecting device such as a microphone. As a technique for removing noise, for example, there is DAE (Denoising AutoEncoder). DAE is a kind of DNN (Deep Neural Network).

図７Ａは、従来技術のＤＡＥ学習処理の概要を説明する図である。図７Ａに示すように、従来技術のＤＡＥは、ＤＡＥ学習部において、１つの集音装置（１チャネル）により雑音下で得られた、学習用の観測信号を周波数分析し、全周波数領域における学習用の観測信号の振幅の対数スペクトルを示すベクトルｙを入力ベクトルｚ_ｉｎとする。従来技術のＤＡＥにおいて、Ｌ個の各隠れ層ｈ^ｌ（ｌ＝１，・・・，Ｌ）における出力ｈ^ｌ（ｚ_ｉｎ）は、重み行列Ｗ及びバイアスベクトルｂを含むパラメータθ＝｛Ｗ，ｂ｝を有する所定の非線形関数σ_θによりｈ^ｌ（ｚ_ｉｎ）＝σ_θ（Ｗ^ｌｈ^{（ｌ−１）}（ｚ_ｉｎ）＋ｂ^ｌ）と表される。なお、隠れ層ｈ^１における出力はｈ^１（ｚ_ｉｎ）＝σ_θ（Ｗ^１（ｚ_ｉｎ）＋ｂ^１）である。 FIG. 7A is a diagram for explaining the outline of the conventional DAE learning process. As shown in FIG. 7A, in the DAE of the prior art, in the DAE learning unit, the observation signal for learning obtained under noise by one sound collector (one channel) is subjected to frequency analysis, and learning in the entire frequency domain is performed. A vector y indicating the logarithmic spectrum of the amplitude of the observation signal for use is defined as an input vector z _in . In the prior art DAE, the output h ^l (z _in ) _in each of the L hidden layers h ^l (l = 1,..., L) is a parameter θ = {W, It is expressed as h ^l (z _in ) = σ _θ (W ^l h ^(l−1) (z _in ) + b ^l ) by a predetermined nonlinear function σ _θ having b}. The output in the hidden layer h ¹ is h ¹ (z _in ) = σ _θ (W ¹ (z _in ) + b ¹ ).

そして、従来技術のＤＡＥは、ｈ^ｌ（ｚ_ｉｎ）により入力ベクトルｚ_ｉｎからｌ＝１，・・・，Ｌと順次雑音除去した出力ベクトルｚ_ｏｕｔが学習用の原音声と一致する、すなわちｚ_ｏｕｔ＝ｆ_θ（ｚ_ｉｎ）＝Ｗ^Ｌｈ^{（Ｌ−１）}（ｚ_ｉｎ）＋ｂ^Ｌとなるようパラメータθを学習する。 In the DAE of the prior art, the output vector z _out obtained by sequentially removing noise from l = 1,..., L from the input vector z _in by h ^l (z _in ) matches the learning original speech, that is, z. _The parameter θ is learned so that _out = f _θ (z _in ) = W ^L h ^(L−1) (z _in ) + b ^L.

図７Ｂは、従来技術の雑音除去処理の概要を説明する図である。従来技術のＤＡＥは、ＤＡＥ復号部において、１つの集音装置（１チャネル）により得られた観測信号を周波数分析し、全周波数領域における観測信号の振幅の対数スペクトルを示すベクトルｙ´を入力ベクトルｚ_ｉｎ´とする。そして、従来技術のＤＡＥは、図７Ａに示したように学習したパラメータθをＤＡＥ学習部からＤＡＥ復号部へコピーし、学習したパラメータθを用いてＬ個の各隠れ層ｈ^ｌ（ｌ＝１，・・・，Ｌ）において入力ベクトルｚ_ｉｎ´から雑音除去した出力ベクトルｚ_ｏｕｔ´を得る。 FIG. 7B is a diagram for explaining the outline of the conventional noise removal processing. In the DAE of the prior art, the DAE decoder performs frequency analysis on the observation signal obtained by one sound collector (one channel), and an input vector y ′ indicating a logarithmic spectrum of the amplitude of the observation signal in the entire frequency domain is input. Let z _in ′. Then, the DAE of the prior art copies the learned parameter θ as shown in FIG. 7A from the DAE learning unit to the DAE decoding unit, and uses the learned parameter θ to each of the L hidden layers h ^l (l = 1). ,..., L), an output vector z _out ′ obtained by removing noise from the input vector z _in ′ is obtained.

P. Vincent, H. Larochelle, Y.Bengio, and P. A. Manzagol,“Extracting and composing robust features with denoising autoencoders.”, in Proc. of ICML2008, 2008, pp.1096-1103.P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, “Extracting and composing robust features with denoising autoencoders.”, In Proc. Of ICML2008, 2008, pp.1096-1103. Y. Liu, P. Zhang, and T. Hain, “Using neural network front-ends on far field multiple microphones based speech recognitions.”, Proc. of ICASSP2014, 2014, pp.5579-5583.Y. Liu, P. Zhang, and T. Hain, “Using neural network front-ends on far field multiple microphones based speech recognitions.”, Proc. Of ICASSP2014, 2014, pp.5579-5583. S. Renals and P. Swietojanski, “Neural networks for distant speech recognition.”, in Proc. of HSCMA2014, 2014.S. Renals and P. Swietojanski, “Neural networks for distant speech recognition.”, In Proc. Of HSCMA2014, 2014. T. Nakatani, S. Araki, T. Yoshioka, M. Delcroix, and M. Fujimoto, “Dominance based integration of spatial and spectral features for speech enhancement.”, IEEE Trans. Audio, Speech and Language Processing, vol. 21, no. 12, 2013, pp.2516-2531.T. Nakatani, S. Araki, T. Yoshioka, M. Delcroix, and M. Fujimoto, “Dominance based integration of spatial and spectral features for speech enhancement.”, IEEE Trans. Audio, Speech and Language Processing, vol. 21, no. 12, 2013, pp.2516-2531. M. Fujimoto, S. Watanabe, and T. Nakatani, “A robust estimation method of noise mixture model for noise suppression.”, Proc. of Interspeech2011, 2011, pp.697-700.M. Fujimoto, S. Watanabe, and T. Nakatani, “A robust estimation method of noise mixture model for noise suppression.”, Proc. Of Interspeech2011, 2011, pp.697-700. M. Delcroix, Y. Kubo, T. Nakatani, and A. Nakamura, “Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling ?”, Proc. of Interspeech2013, 2013, pp.2992-2996.M. Delcroix, Y. Kubo, T. Nakatani, and A. Nakamura, “Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling?”, Proc. Of Interspeech2013, 2013, pp.2992-2996.

しかしながら、上記従来技術は、雑音除去処理において、ＤＡＥの入力として、１つの集音装置（１チャネル）により得られた学習用の観測信号を用いて学習を行い、１つの集音装置により得られた観測信号の雑音除去を行うに過ぎない。そして、複数の集音装置（複数チャネル）による観測信号が利用できる場合であっても、各チャネルの信号を単純につなげたチャネル・コネクションを利用する、あるいは、ビームフォーミング技術による強調音声信号をＤＡＥの入力とするに過ぎず、雑音除去の性能は低い。 However, the above prior art performs learning using the observation signal for learning obtained by one sound collector (one channel) as an input of DAE in the noise removal processing, and is obtained by one sound collector. It only removes noise from the observed signal. Even when observation signals from a plurality of sound collectors (multiple channels) can be used, a channel connection in which the signals of each channel are simply connected is used, or an enhanced audio signal by the beam forming technique is DAE. The noise removal performance is low.

本願が開示する実施形態の一例は、上記に鑑みてなされたものであって、雑音除去の性能を向上させることを目的とする。 An example of an embodiment disclosed in the present application has been made in view of the above, and aims to improve noise removal performance.

本願が開示する実施形態の一例の雑音除去装置は、周波数分析部、対数計算部、マルチチャネル特徴量計算部、ＤＡＥ（Denoising AutoEncoder）学習部、ＤＡＥ復号部を備える。周波数分析部は、第１集音装置により観測された第１音声信号及び第１集音装置とは異なる位置に配置された第２集音装置により観測された第２音声信号を周波数分析する。対数計算部は、周波数分析部により周波数分析された第１音声信号の振幅の対数である対数スペクトルを計算する。マルチチャネル特徴量計算部は、周波数分析部により周波数分析された第１音声信号及び第２音声信号から、該第１音声信号及び該第２音声信号に関するマルチチャネル特徴量を計算する。ＤＡＥ学習部は、対数計算部により計算された学習用の第１音声信号の対数スペクトルと、マルチチャネル特徴量計算部により計算された該学習用の第１音声信号及び学習用の第２音声信号に関する学習用のマルチチャネル特徴量を入力とし、学習用の原音声の対数スペクトルを出力とするＤＡＥのパラメータを学習する。ＤＡＥ復号部は、対数計算部により計算された雑音除去対象の第１音声信号の対数スペクトルと、マルチチャネル特徴量計算部により計算された該雑音除去対象の第１音声信号及び雑音除去対象の第２音声信号に関する雑音除去用のマルチチャネル特徴量を入力とし、ＤＡＥ学習部により学習されたパラメータを用いて該雑音除去対象の第１音声信号の対数スペクトルから雑音成分を除去した雑音除去音声の対数スペクトルを出力する。 An example of a noise removal apparatus disclosed in the present application includes a frequency analysis unit, a logarithmic calculation unit, a multichannel feature amount calculation unit, a DAE (Denoising AutoEncoder) learning unit, and a DAE decoding unit. The frequency analysis unit performs frequency analysis on the first sound signal observed by the first sound collector and the second sound signal observed by the second sound collector disposed at a position different from the first sound collector. The logarithm calculation unit calculates a logarithmic spectrum that is a logarithm of the amplitude of the first audio signal subjected to frequency analysis by the frequency analysis unit. The multi-channel feature value calculation unit calculates a multi-channel feature value related to the first sound signal and the second sound signal from the first sound signal and the second sound signal subjected to frequency analysis by the frequency analysis unit. The DAE learning unit includes a logarithmic spectrum of the first speech signal for learning calculated by the logarithm calculation unit, the first speech signal for learning and the second speech signal for learning calculated by the multichannel feature amount calculation unit. The learning multi-channel feature quantity is used as an input, and a DAE parameter is learned using the logarithmic spectrum of the original voice for learning as an output. The DAE decoding unit calculates the logarithm spectrum of the first speech signal to be denoised calculated by the logarithm calculation unit, the first speech signal to be denoised by the multichannel feature amount calculation unit, and the first speech signal of the noise removal target The logarithm of the noise-removed speech obtained by removing the noise component from the logarithmic spectrum of the first speech signal to be denoised using the parameters learned by the DAE learning unit with the multichannel feature amount for noise removal regarding the two speech signals as input. Output the spectrum.

本願が開示する実施形態の一例によれば、例えば、雑音除去の性能を向上させることができる。 According to an exemplary embodiment disclosed in the present application, for example, it is possible to improve noise removal performance.

図１は、雑音除去システムの構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a configuration of a noise removal system. 図２は、信号再構成装置の構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of the signal reconstruction device. 図３は、ＤＡＥ学習処理の一例を示すフローチャートである。FIG. 3 is a flowchart illustrating an example of the DAE learning process. 図４は、雑音除去処理の一例を示すフローチャートである。FIG. 4 is a flowchart illustrating an example of the noise removal process. 図５Ａは、実施形態の効果の一例を説明する図である。FIG. 5A is a diagram illustrating an example of the effect of the embodiment. 図５Ｂは、実施形態の効果の一例を説明する図である。FIG. 5B is a diagram illustrating an example of the effect of the embodiment. 図５Ｃは、実施形態の効果の一例を説明する図である。FIG. 5C is a diagram illustrating an example of the effect of the embodiment. 図６は、プログラムが実行されることにより、雑音除去装置及び信号再構成装置が実現されるコンピュータの一例を示す図である。FIG. 6 is a diagram illustrating an example of a computer in which a noise removal apparatus and a signal reconstruction apparatus are realized by executing a program. 図７Ａは、従来技術のＤＡＥ学習処理の概要を説明する図である。FIG. 7A is a diagram for explaining the outline of the conventional DAE learning process. 図７Ｂは、従来技術の雑音除去処理の概要を説明する図である。FIG. 7B is a diagram for explaining the outline of the conventional noise removal processing.

［実施形態］
以下、本願が開示する雑音除去装置及び雑音除去プログラムの実施形態を説明する。なお、以下の実施形態は、一例を示すに過ぎず、本願が開示する技術を限定するものではない。また、以下に示す実施形態及びその他の実施形態は、矛盾しない範囲で適宜組合せてもよい。 [Embodiment]
Hereinafter, embodiments of a noise removal device and a noise removal program disclosed in the present application will be described. The following embodiments are merely examples, and do not limit the technology disclosed by the present application. Moreover, you may combine suitably embodiment shown below and other embodiment in the range with no contradiction.

なお、以下の実施形態では、ベクトル又はスカラーであるＡに対し、“＾Ａ”と記載する場合は下記（１）式で定義する表記と同等とし、“￣Ａ”と記載する場合は下記（２）式で定義する表記と同等であるとする。 In the following embodiments, when A is a vector or a scalar, “^ A” is equivalent to the notation defined by the following equation (1), and “￣A” is 2) It is assumed that it is equivalent to the notation defined by the formula.

また、以下の実施形態では、集音装置はマイクロフォン（以下、マイクと表記する）である。それぞれ異なる位置に配置されたＭ個のマイク（Ｍは自然数）のうちのｍ番目（１≦ｍ≦Ｍ）のマイクで観測した観測信号の対数スペクトルをＹ_ｔ，ｆ ^（ｍ）と表記する。ここで、ｔは時刻のインデックスであり、ｆはメルスケール周波数のインデックスである。Ｍ個のマイクで観測した観測信号の対数スペクトルをベクトルＹ_ｔ，ｆ＝［Ｙ_ｔ，ｆ ^（１），・・・，Ｙ_ｔ，ｆ ^（Ｍ）］と表記する。ここで、マイクによる観測信号を示すベクトルＹ_ｔ，ｆは、ｍ番目のマイクで集音される原音声をＸ_ｔ，ｆ ^（ｍ）とし、雑音をＮ_ｔ，ｆ ^（ｍ）とすると、原音声を示すベクトルをベクトルＸ_ｔ，ｆ＝［Ｘ_ｔ，ｆ ^（１），・・・，Ｘ_ｔ，ｆ ^（Ｍ）］、雑音を示すベクトルをベクトルＮ_ｔ，ｆ＝［Ｎ_ｔ，ｆ ^（１），・・・，Ｎ_ｔ，ｆ ^（Ｍ）］を用いて、下記（３）のように与えられる。 In the following embodiments, the sound collection device is a microphone (hereinafter referred to as a microphone). A logarithmic spectrum of an observation signal observed by an m-th (1 ≦ m ≦ M) microphone among M microphones (M is a natural number) arranged at different positions is denoted as Y _{t, f} ^(m) . Here, t is an index of time, and f is an index of mel scale frequency. A logarithmic spectrum of an observation signal observed with M microphones is _expressed as a vector Y _{t, f} = [Y _{t, f} ⁽¹⁾ ,..., Y _{t, f} ^(M) ]. Here, the vector Y _{t, f} indicating the observation signal from the microphone is defined as X _{t, f} ^(m) as the original sound collected by the m-th microphone and N _{t, f} ^(m) as the noise. A vector indicating speech is a vector X _{t, f} = [X _{t, f} ⁽¹⁾ ,..., X _{t, f} ^(M) ], and a vector indicating noise is a vector N _{t, f} = [N _{t, f} ^{( 1)} ,..., N _{t, f} ^(M) ] are given as follows (3).

実施形態は、観測信号を示すベクトルＹ_ｔ，ｆから、雑音を除去した雑音除去音声の対数スペクトルの各推定値＾Ｘ_ｔ，ｆ ^（ｍ）を求め、さらにその音声の波形を再合成することで、雑音除去音声を得る。実施形態は、マイクの数Ｍ＝２の場合について説明するが、Ｍ≧３の場合も同様に議論できる。 The embodiment obtains each estimated value ^ X _{t, f} ^(m) of the logarithmic spectrum of the noise-removed speech from which noise is removed from the vector Y _{t, f} indicating the observed signal, and further re-synthesizes the waveform of the speech. To obtain noise-removed speech. The embodiment describes the case where the number of microphones M = 2, but the case where M ≧ 3 can be similarly discussed.

また、実施形態は、説明を簡単にするために、１番目のマイクの観測信号の、全ての周波数での対数スペクトルをまとめたベクトルをベクトルｙ_ｔ＝［ｙ_ｔ，１，・・・，ｙ_ｔ，ｆ，・・・，ｙ_ｔ，Ｆ］（但し、ｙ_ｔ，ｆ ^（１））と表記する。また、実施形態は、１番目のマイクでの原音声の全周波数での対数スペクトルをｘ_ｔと表記し、雑音除去音声の全周波数での対数スペクトルを＾ｘ_ｔと表記する。 Further, in the embodiment, for the sake of simplicity, a vector y _t = [y _{t, 1} ,..., Y is a vector in which logarithmic spectra of all the frequencies of the observation signal of the first microphone are collected. _{t, f 1} ,..., y _{t, F} ] (where y _{t, f} ⁽¹⁾ ). Further, embodiments, the logarithmic spectrum at the first original audio of all frequencies in the microphone is expressed as x _t, denoted the log spectrum of the entire frequency noise rejection speech ^ x _t and.

（雑音除去システムの構成）
図１は、雑音除去システムの構成の一例を示す図である。雑音除去システム１００は、マイク１及び２、雑音除去装置１０、信号再構成装置２０を有する。マイク１が１番目のマイクであり、マイク２が２番目のマイクであるとする。 (Configuration of noise reduction system)
FIG. 1 is a diagram illustrating an example of a configuration of a noise removal system. The noise removal system 100 includes microphones 1 and 2, a noise removal device 10, and a signal reconstruction device 20. Assume that the microphone 1 is the first microphone and the microphone 2 is the second microphone.

雑音除去装置１０は、周波数分析部１１、対数計算部１２、マルチチャネル特徴計算部１３、ＤＡＥ（Denoising AutoEncoder；デノイジング・オートエンコーダ）学習部１４、ＤＡＥ復号部１５を有する。 The noise removal apparatus 10 includes a frequency analysis unit 11, a logarithmic calculation unit 12, a multichannel feature calculation unit 13, a DAE (Denoising AutoEncoder) learning unit 14, and a DAE decoding unit 15.

周波数分析部１１は、２個のマイク１及び２によりそれぞれ観測された観測信号について、周波数分析、例えばメル周波数分析を行って周波数領域信号へ変換する。観測信号には、後述するＤＡＥ学習部１４の学習用の観測信号と、学習結果を用いて雑音除去する雑音除去対象の観測信号がある。対数計算部１２は、周波数領域信号へ変換された観測信号の振幅の対数を取って、対数スペクトルのベクトルＹ_ｔ，ｆ＝［Ｙ_ｔ，ｆ ^（１），Ｙ_ｔ，ｆ ^（２）］を得る。ここで、１番目のマイクでの観測信号Ｙ_ｔ，ｆ ^（１）について、ｙ_ｔ，ｆ＝Ｙ_ｔ，ｆ ^（１）である。そして、対数計算部１２は、１番目のマイクの観測信号の全ての周波数での対数スペクトルをまとめたベクトルｙ_ｔ＝［ｙ_ｔ，１，・・・，ｙ_ｔ，ｆ，・・・，ｙ_ｔ，Ｆ］を、学習用の観測信号である場合にはＤＡＥ学習部１４へ入力し、雑音除去対象の観測信号である場合にはＤＡＥ復号部１５へ入力する。 The frequency analysis unit 11 performs frequency analysis, for example, Mel frequency analysis, on the observation signals observed by the two microphones 1 and 2, respectively, and converts them into frequency domain signals. The observation signals include an observation signal for learning by the DAE learning unit 14 described later and an observation signal to be removed from noise using the learning result. The logarithm calculation unit 12 takes the logarithm of the amplitude of the observation signal converted into the frequency domain signal, and obtains a logarithmic spectrum vector Y _{t, f} = [Y _{t, f} ⁽¹⁾ , Y _{t, f} ⁽²⁾ ]. obtain. Here, for the observation signal Y _{t, f} ⁽¹⁾ at the first microphone, y _{t, f} = Y _{t, f} ⁽¹⁾ . Then, the logarithm calculation unit 12 is a vector y _t = [y _{t, 1} ,..., Y _{t, f} ,..., Y that summarizes the logarithmic spectra at all frequencies of the observation signal of the first microphone. _{t, F} ] is input to the DAE learning unit 14 when it is an observation signal for learning, and is input to the DAE decoding unit 15 when it is an observation signal to be denoised.

マルチチャネル特徴計算部１３は、マイク１及び２の観測信号を周波数分析し、周波数分析した結果を用いて、マルチチャネル特徴量を示すベクトルｒ_ｔ＝［ｒ_ｔ，１，・・・，ｒ_ｔ，Ｆ］を計算する。マルチチャネル特徴計算部１３は、計算したベクトルｒ_ｔを、ＤＡＥ学習部１４及びＤＡＥ復号部１５へ入力する。なお、マルチチャネル特徴計算部１３の詳細については、後述する。 The multi-channel feature calculation unit 13 performs frequency analysis on the observation signals of the microphones 1 and 2, and uses a result of the frequency analysis to calculate a vector r _t = [r _{t, 1} _{,. , F} ]. Multi-channel feature calculation block 13, the calculated vector _{r t,} and inputs to the DAE learning unit 14 and the DAE decoding unit 15. Details of the multi-channel feature calculation unit 13 will be described later.

ＤＡＥ学習部１４は、ＤＮＮ（Deep Neural Network；多層ニューラルネットワークシステム）であり、雑音下で観測された学習用の観測音声である入力ベクトルｚ_ｉｎに対する出力ベクトルｚ_ｏｕｔとして原音声が得られるようなｌ個の隠れ層（ｌは自然数）を学習する。実施形態では、ｌ＝２の場合を例示する。ＤＡＥ学習部１４は、入力部１４ａ、隠れ層１４ｂ、隠れ層１４ｃ、出力部１４ｄを有する。すなわち、隠れ層１４ｂ及び隠れ層１４ｃの２つの隠れ層が含まれる。ここで、隠れ層の数ｌ＝２はあくまで例示に過ぎない。 The DAE learning unit 14 is a DNN (Deep Neural Network), and can obtain an original voice as an output vector z _{out with} respect to an input vector z _in which is an observation voice for learning observed under noise. Learn l hidden layers (l is a natural number). In the embodiment, the case of l = 2 is illustrated. The DAE learning unit 14 includes an input unit 14a, a hidden layer 14b, a hidden layer 14c, and an output unit 14d. That is, the two hidden layers of the hidden layer 14b and the hidden layer 14c are included. Here, the number of hidden layers l = 2 is merely an example.

入力部１４ａは、対数計算部１２から入力された、マイク１より観測された学習用の観測音声の対数スペクトルのベクトルｙ_ｔ＝［ｙ_ｔ，１，・・・，ｙ_ｔ，ｆ，・・・，ｙ_ｔ，Ｆ］を、時刻ｔを中心とした前後Ｔフレームから成るコンテキスト窓を用いて連結する。 The input unit 14a receives the logarithmic spectrum vector y _t = [y _{t, 1} ,..., Y _{t, f} ,. , Y _{t, F} ] are connected using a context window consisting of T frames before and after the time t.

さらに、入力部１４ａは、マルチチャネル特徴計算部１３から入力されたベクトルｒ_ｔを、時刻ｔを中心とした前後Ｔフレームから成るコンテキスト窓を用いて連結し、下記（４）式のように入力ベクトルｚ_ｉｎを求める。 Further, the input unit 14a connects the vector r _t input from the multi-channel feature calculation unit 13 using a context window composed of previous and subsequent T frames centered on the time t, and inputs as shown in the following equation (4). Find the vector z _in .

出力部１４ｄは、学習用の原音声を、時刻ｔを中心とした前後Ｔフレームから成るコンテキスト窓を用いて連結し、下記（５）式のように出力ベクトルｚ_ｏｕｔを求める。 The output unit 14d concatenates the original speech for learning using a context window composed of previous and subsequent T frames centered at time t, and obtains an output vector z _out as shown in the following equation (5).

そして、隠れ層１４ｂ及び隠れ層１４ｃは、ＤＡＥ学習部１４の入力を入力ベクトルｚ_ｉｎ、出力を出力ベクトルｚ_ｏｕｔとするＤＡＥの学習により、ｌ層目の隠れ層（ｌ＝１，２）として得られる。すなわち、ＤＡＥ学習部１４は、隠れ層１４ｂ及び隠れ層１４ｃの出力として、下記（６）式に示すものが得られるように、パラメータθ＝｛Ｗ，ｂ｝（但し、Ｗは重みベクトル、ｂはバイアスベクトル）を学習する。なお、下記（６）式におけるσ_θ（・）は非線形関数であり、例えばシグモイド関数である。また、下記（６）式において、１層目（すなわち隠れ層１４ｂ）の出力は、ｈ^１（ｚ_ｉｎ）＝σ_θ（Ｗ^１ｚ_ｉｎ＋ｂ^１）で計算し、最終層（すなわち２層目、隠れ層１４ｃ）の出力は、線形関数にてベクトルｚ_ｏｕｔ＝ｆ_θ＝Ｗ^Ｌｈ^{（Ｌ−１）}＋ｂ^Ｌで計算する。パラメータθ＝｛Ｗ，ｂ｝の学習は、既存の手法、例えば確率的勾配効果法（Stochastic Gradient Descent Method）により求められる。ＤＡＥ学習部１４は、学習したパラメータθ＝｛Ｗ，ｂ｝を、ＤＡＥ復号部１５へコピーする。 Then, the hidden layer 14b and the hidden layer 14c are formed as the first hidden layer (l = 1, 2) by learning DAE using the input of the DAE learning unit 14 as the input vector z _in and the output as the output vector z _out. can get. That is, the DAE learning unit 14 sets the parameter θ = {W, b} (W is a weight vector, b so that the output of the hidden layer 14b and the hidden layer 14c can be obtained by the following equation (6). Learns the bias vector. Note that σ _θ (·) in the following equation (6) is a nonlinear function, for example, a sigmoid function. Further, in the following equation (6), the output of the first layer (ie, the hidden layer 14b) is calculated by h ¹ (z _in ) = σ _θ (W ¹ z _in + b ¹ ), and the final layer (ie, the second layer) The output of the hidden layer 14c) is calculated by a vector z _out = f _θ = W ^L h ^(L-1) + b ^L by a linear function. Learning of the parameter θ = {W, b} is obtained by an existing method, for example, Stochastic Gradient Descent Method. The DAE learning unit 14 copies the learned parameter θ = {W, b} to the DAE decoding unit 15.

ＤＡＥ復号部１５は、入力部１５ａ、隠れ層１５ｂ、隠れ層１５ｃ、出力部１５ｄを有する。入力部１５ａはＤＡＥ学習部１４の入力部１４ａに対応し、隠れ層１５ｂはＤＡＥ学習部１４の隠れ層１４ｂに対応し、隠れ層１５ｃはＤＡＥ学習部１４の隠れ層１４ｃに対応し、出力部１５ｄはＤＡＥ学習部１４の出力部１４ｄに対応する。 The DAE decoding unit 15 includes an input unit 15a, a hidden layer 15b, a hidden layer 15c, and an output unit 15d. The input unit 15a corresponds to the input unit 14a of the DAE learning unit 14, the hidden layer 15b corresponds to the hidden layer 14b of the DAE learning unit 14, the hidden layer 15c corresponds to the hidden layer 14c of the DAE learning unit 14, and the output unit 15 d corresponds to the output unit 14 d of the DAE learning unit 14.

入力部１５ａは、対数計算部１２から入力された、マイク１，２より観測された雑音除去対象の観測音声の対数スペクトルのベクトルｙ_ｔ＝［ｙ_ｔ，１，・・・，ｙ_ｔ，ｆ，・・・，ｙ_ｔ，Ｆ］、及び、マルチチャネル特徴計算部１３から入力されたベクトルｒ_ｔ＝［ｒ_ｔ，１，・・・，ｒ_ｔ，Ｆ］を、時刻ｔを中心とした前後Ｔフレームから成るコンテキスト窓を用いて連結し、上記（４）式のようにベクトルｚ_ｉｎを求める。入力部１５ａは、求めたベクトルｚ_ｉｎを、隠れ層１５ｂへ入力する。 The input unit 15a receives the logarithmic spectrum vector y _t = [y _{t, 1} ,..., Y _{t, f} of the observed speech to be removed from the noise observed from the microphones 1 and 2 and input from the logarithmic calculation unit 12. ,..., Y _{t, F} ] and the vector r _t = [r _{t, 1} ,..., R _{t, F} ] input from the multi-channel feature calculator 13 with the time t as the center. A context window made up of preceding and following T frames is used for connection, and a vector z _in is obtained as in the above equation (4). The input unit 15a inputs the obtained vector z _in to the hidden layer 15b.

隠れ層１５ｂ及び隠れ層１５ｃによる、ＤＡＥ学習部１４により学習されたパラメータθ＝｛Ｗ，ｂ｝を用いた、下記（７）式及び（８）式による処理により、出力ベクトルｚ_ｏｕｔ＝［ｘ_ｔ−Ｔ，・・・，ｘ_ｔ，・・・，ｘ_ｔ＋Ｔ］が得られる。なお、下記（７）式におけるパラメータθ＝｛Ｗ，ｂ｝は、下記（６）式により学習されたものである。また、下記（７）式においては、上記（６）式と同様に、１層目（すなわち隠れ層１５ｂ）の出力は、ｈ^１（ｚ_ｉｎ）＝σ_θ（Ｗ^１ｚ_ｉｎ＋ｂ^１）で計算し、最終層（すなわち２層目、隠れ層１５ｃ）の出力は、線形関数にてベクトルｚ_ｏｕｔ＝ｆ_θ＝Ｗ^Ｌｈ^{（Ｌ−１）}＋ｂ^Ｌで計算する。出力部１５ｄは、出力ベクトルｚ_ｏｕｔ＝［ｘ_ｔ−Ｔ，・・・，ｘ_ｔ，・・・，ｘ_ｔ＋Ｔ］を、雑音除去音声の対数スペクトルの推定ベクトル＾ｘとして、例えば信号再構成装置２０へ出力する。 By using the parameter θ = {W, b} learned by the DAE learning unit 14 by the hidden layer 15b and the hidden layer 15c, the output vector z _out = [x _{t−T 1} ,..., x _t ,..., x _{t + T} ]. The parameter θ = {W, b} in the following equation (7) is learned by the following equation (6). Further, in the following expression (7), the output of the first layer (that is, the hidden layer 15b) is h ¹ (z _in ) = σ _θ (W ¹ z _in + b ¹ ) as in the above expression (6). The output of the final layer (that is, the second layer and the hidden layer 15c) is calculated as a vector z _out = f _θ = W ^L h ^(L−1) + b ^L by a linear function. The output unit 15d uses the output vector z _out = [x _t−T ,..., X _t ,..., X _{t + T} ] as an estimated vector ^ x of the logarithmic spectrum of the noise-removed speech, for example, a signal reconstruction device. 20 output.

（マルチチャネル特徴計算部１３の詳細）
マルチチャネル特徴計算部１３は、マルチチャネル特徴量を示すベクトルｒ_ｔ＝［ｒ_ｔ，１，・・・，ｒ_ｔ，Ｆ］を計算する。ベクトルｒ_ｔとしては、次が挙げられる。 (Details of multi-channel feature calculation unit 13)
The multi-channel feature calculation unit 13 calculates a vector r _t = [r _{t, 1} ,..., R _{t, F} ] indicating the multi-channel feature amount. The vector _{r t,} and the like following.

（１）両耳間振幅差（ＩＬＤ：Interaural Level Difference）
マイク１及び２それぞれの観測信号を、人の両耳の各感知音とし、それぞれの観測信号の対数スペクトルの比を取るものであり、ベクトルｒ_ｔとして、下記（９）式により求める。詳細は、文献「Y. Liu, P. Zhang, and T. Hain, “Using neural network front-ends on far field multiple microphones based speech recognitions”, Proc. of ICASSP2014, 2014, pp.5579-5583.」に基づく。 (1) Interaural Level Difference (ILD)
The microphone 1 and 2 each observed signal, and each sensing sound binaural human, which takes the logarithm spectrum ratio of each observed signal, as a vector r _t, determined by the following equation (9). For details, refer to the document “Y. Liu, P. Zhang, and T. Hain,“ Using neural network front-ends on far field multiple microphones based speech recognitions ”, Proc. Of ICASSP2014, 2014, pp.5579-5583. Based.

（２）両耳間位相差（ＩＰＤ：Interaural Phase Difference）
複数マイクにおける観測信号の位相差は、音源と複数マイクの位置関係を反映する特徴量になることがよく知られている。そこで、ベクトルｒ_ｔとして、ｆ番目のメルフィルタバンクの中心周波数ｆ´（ｆ）におけるマイク１及び２での観測信号の位相差を、ベクトルｒ_ｔとして、下記（１０）式により求める。なお、￣Ｙ_{ｔ，ｆ´（ｆ）}はｆ番目のメルフィルタバンクの中心周波数ｆ´（ｆ）における観測信号の対数スペクトルのベクトルＹ_ｔ，ｆの短時間フーリエ変換係数である。そして、∠￣Ｙ_{ｔ，ｆ´（ｆ）} ^（１）はｆ番目のメルフィルタバンクの中心周波数ｆ´（ｆ）におけるマイク１の観測信号の短時間フーリエ変換係数の位相、∠￣Ｙ_{ｔ，ｆ´（ｆ）} ^（２）はｆ番目のメルフィルタバンクの中心周波数ｆ´（ｆ）におけるマイク２の観測信号の短時間フーリエ変換係数の位相である。 (2) Interaural Phase Difference (IPD)
It is well known that the phase difference of observation signals in a plurality of microphones is a feature quantity that reflects the positional relationship between the sound source and the plurality of microphones. Therefore, as a vector r _t, the phase difference of the observed signals at the microphone 1 and 2 at the center frequency f'the f-th mel filter bank (f), as a vector r _t, determined by the following equation (10). ￣Y _{t, f ′ (f)} is a short-time Fourier transform coefficient of the logarithmic spectrum vector Y _{t, f} of the observed signal at the center frequency f _{′ (f)} of the f-th mel filter bank. ∠￣Y _{t, f ′ (f)} ⁽¹⁾ is the phase of the short-time Fourier transform coefficient of the observation signal of the microphone 1 at the center frequency f ′ (f) of the f-th mel filter bank, ∠￣Y _{t, f ′ (f)} ⁽²⁾ is the phase of the short-time Fourier transform coefficient of the observation signal of the microphone 2 at the center frequency f ′ (f) of the f-th mel filter bank.

また、上記（１０）式に代えて、下記（１１）式により求まる、ｆ番目のメルフィルタバンクの中心周波数ｆ´（ｆ）におけるマイク１及び２での観測信号の短時間フーリエ変換係数の位相差の余弦値φ_{ｆ´（ｆ）}を、ベクトルｒ_ｔとしてもよい。 Further, instead of the above equation (10), the level of the short-time Fourier transform coefficient of the observation signal at the microphones 1 and 2 at the center frequency f ′ (f) of the f-th mel filter bank obtained by the following equation (11): cosine values _{phi f'the} retardation of _(f), or as a vector _{r t.}

（３）時間周波数マスク（ＭＡＳＫ）
マルチチャネル情報がある場合、音源と複数マイクの位置関係を反映する特徴量（例えばチャネル間位相差ＩＰＤなど）を計算することができ、その特徴量をクラスタリングすることで、音音声強調を行う時間周波数マスクを計算することができる。例えば、時間周波数マスクは、各時間周波数で得られた、上記（５）式又は（６）式によりえられたＩＰＤのベクトルｒ_ｔをクラスタリングすることで計算できる。詳細は、文献「T. Nakatani, S. Araki, T. Yoshioka, M. Delcroix, and M. Fujimoto, “Dominance based integration of spatial and spectral features for speech enhancement”, IEEE Trans. Audio, Speech and Language Processing, vol. 21, no. 12, pp. 2516-2531, 2013.」に基づく。 (3) Time frequency mask (MASK)
When there is multi-channel information, it is possible to calculate a feature value (for example, inter-channel phase difference IPD) that reflects the positional relationship between the sound source and a plurality of microphones, and cluster the feature values to perform sound speech enhancement time A frequency mask can be calculated. For example, the time frequency mask can be calculated by clustering the IPD vector r _t obtained by the above equation (5) or (6) obtained at each time frequency. For details, see T. Nakatani, S. Araki, T. Yoshioka, M. Delcroix, and M. Fujimoto, “Dominance based integration of spatial and spectral features for speech enhancement”, IEEE Trans. Audio, Speech and Language Processing, vol. 21, no. 12, pp. 2516-2531, 2013 ”.

具体的には、メル周波数領域での時間周波数マスクＭ_ｔ，ｆは、例えば、短時間周波数領域で得られた時間周波数マスクをメル周波数領域に変換することで計算できる。マルチチャネル特徴量としては、下記（１２）式のように、メル周波数領域での時間周波数マスクＭ_ｔ，ｆを用いる。 Specifically, the time frequency mask M _{t, f} in the mel frequency domain can be calculated, for example, by converting the time frequency mask obtained in the short time frequency domain into the mel frequency domain. As the multi-channel feature quantity, a time frequency mask M _{t, f} in the mel frequency domain is used as in the following equation (12).

あるいは、下記（１３）式のように、メル周波数領域での時間周波数マスクの対数を取ったものを用いてもよい。 Or you may use what took the logarithm of the time frequency mask in a mel frequency area like the following (13) Formula.

なお、上記（１３）式は、Ｍが０と１の近くの値を取りやすく、その間の値を取りにくい性質をもつ。よって、上記（１３）式を、下記（１４）式のように、１と０の間でピークを持つ単峰性の性質を持つデータに変換したものを用いてもよい。 The above equation (13) has a property that M is easy to take a value near 0 and 1, and it is difficult to take a value between them. Therefore, the above equation (13) may be converted into data having a unimodal property having a peak between 1 and 0 as in the following equation (14).

（４）時間周波数マスクでの強調音声（ＥＮＨＡＮＣＥ)
上記（１３）式のように、メル周波数領域での時間周波数マスクの対数を取ったものを用いると、強調音声＾ｘ_ｔ，ｆ＝ｌｏｇ（Ｍ_ｔ，ｆ・ｅｘｐ（ｙ_ｔ，ｆ））＝ｌｏｇ（Ｍ_ｔ，ｆ）＋ｙ_ｔ，ｆが得られる。よって、時間周波数マスクでの強調音声（ＥＮＨＡＮＣＥ)におけるマルチチャネル特徴量は、下記（１５）式により得られる。 (4) Emphasized speech with time frequency mask (ENHANCE)
When the logarithm of the time frequency mask in the mel frequency domain is used as in the above equation (13), the emphasized speech ^ x _{t, f} = log (M _{t, f} · exp (y _{t, f} )) = Log (M _{t, f} ) + y _{t, f} is obtained. Therefore, the multi-channel feature quantity in the emphasized speech (ENHANCE) using the time-frequency mask is obtained by the following equation (15).

（５）時間周波数マスクで計算した雑音（ＮＯＩＳＥ）
時間周波数マスクでの強調音声の代わりに、下記（１６）式のように、マスクにより推定された雑音信号を、マルチチャネル特徴として用いてもよい。 (5) Noise calculated with a time-frequency mask (NOISE)
Instead of the enhanced speech using the time-frequency mask, a noise signal estimated by the mask may be used as a multi-channel feature as shown in the following equation (16).

（６）その他のマルチチャネル特徴量
例えば、上記ＩＰＤの代わりに、ｆ番目のメルフィルタバンクの中心周波数ｆ´（ｆ）における複数マイクでの観測信号の到達時間差（ＴＤＯＡ；Time Difference Of Arrival)を用いてもよい。これは、例えば、よく知られているＧＣＣ−ＰＨＡＴ（Generalized Cross-Correlation PHAse Transform）法で計算することができる。これは、下記（１７）式に示すように、周波数によらない特徴量となる。なお、下記（１７）式において、￣Ｙ_{ｔ，ｆ´（ｆ）} ^（１）はｆ番目のメルフィルタバンクの中心周波数ｆ´（ｆ）におけるマイク１による観測信号の対数スペクトルのベクトルＹ_ｔ，ｆの短時間フーリエ変換係数である。また、￣Ｙ_{ｔ，ｆ´（ｆ）＊} ^（２）はｆ番目のメルフィルタバンクの中心周波数ｆ´（ｆ）の複素共役ｆ´（ｆ）^＊におけるマイク２による観測信号の対数スペクトルのベクトルＹ_ｔ，ｆの短時間フーリエ変換係数である。また、｜・｜は、・のノルムを表す。また、ｊは虚数単位である。 (6) Other multi-channel feature quantities For example, instead of the IPD, the time difference of arrival (TDOA) of observation signals at a plurality of microphones at the center frequency f ′ (f) of the f-th mel filter bank It may be used. This can be calculated, for example, by the well-known GCC-PHAT (Generalized Cross-Correlation PHAse Transform) method. This is a feature quantity that does not depend on the frequency, as shown in the following equation (17). In the following equation (17), ￣Y _{t, f ′ (f)} ⁽¹⁾ is a vector Y _t, of the logarithmic spectrum of the signal observed by the microphone 1 at the center frequency f ′ (f) of the f-th mel filter bank _{. It is} a short-time Fourier transform coefficient of _f . Further, ￣Y _{t, f ′ (f) *} ⁽²⁾ is a vector of the logarithmic spectrum of the signal observed by the microphone 2 at the complex conjugate f ′ (f) ^* of the center frequency f ′ (f) of the f-th mel filter bank. Y _{t, f is} a short-time Fourier transform coefficient. Also, | · | represents the norm of •. J is an imaginary unit.

また、一定時間（例えば５秒間や、１発話分など）の間に計算されたＴＤＯＡ情報のヒストグラムを取り、そのヒストグラムを、マルチチャネル特徴量のベクトルｒ_ｔとして用いてもよい。この場合、ｔは時刻ではなく、ヒストグラムのビンのインデックスに置き換わる。この場合、入力ベクトルｚ_ｉｎをコンテキスト窓で連結しなくてもよい。また、上記に列挙した各マルチチャネル特徴量のベクトルを任意に選択して並べたベクトルを、マルチチャネル特徴量のベクトルとしてもよい。 Further, (or for example 5 seconds, 1, etc. utterance) a certain time taking a histogram of the calculated TDOA information between, the histogram may be used as a vector r _t multichannel feature quantity. In this case, t replaces the index of the bin in the histogram, not the time. In this case, the input vector z _in may not be connected by the context window. A vector obtained by arbitrarily selecting and arranging the vectors of the multichannel feature values listed above may be used as the multichannel feature value vector.

なお、マイクの数Ｍが３以上の場合は、上記（９）式、（１０）式、（１１）式、（１７）式において、右肩の添え字（２）を、添え字（３）、（４）・・・についても同様にそれぞれ計算し、得られた特徴量を添え字の順序（すなわちマイクの順序）で並べたものを、マルチチャネル特徴量のベクトルｒ_ｔとしてもよい。 When the number M of microphones is 3 or more, the right shoulder subscript (2) is replaced with the subscript (3) in the above formulas (9), (10), (11), and (17). , (4) respectively calculated similarly for ..., those arranged in the order of subscripts obtained feature amount (i.e. the order of the microphone), or as a vector r _t multichannel feature quantity.

（信号再構成装置の構成）
図２は、信号再構成装置の構成の一例を示す図である。信号再構成装置２０は、雑音除去フィルタ計算部２１、周波数ドメイン変換部２２、雑音除去音声計算部２３、逆フーリエ変換部２４を有する。雑音除去フィルタ計算部２１は、雑音除去装置１０から入力された雑音除去音声の全周波数での対数スペクトル＾ｘ_ｔに対し、下記（１８）式により、メル周波数領域における雑音除去フィルタ（ウィーナフィルタ）Ｗ_ｔ，ｆを計算する。 (Configuration of signal reconstruction device)
FIG. 2 is a diagram illustrating an example of the configuration of the signal reconstruction device. The signal reconstruction device 20 includes a noise removal filter calculation unit 21, a frequency domain conversion unit 22, a noise removal speech calculation unit 23, and an inverse Fourier transform unit 24. The noise removal filter calculation unit 21 applies a noise removal filter (Wiener filter) in the mel frequency region to the logarithmic spectrum ^ x _t at all frequencies of the noise-removed speech input from the noise removal device 10 according to the following equation (18). W _{t, f} is calculated.

次に、周波数ドメイン変換部２２は、雑音除去フィルタ計算部２１により計算された雑音除去フィルタＷ_ｔ，ｆを、線形周波数領域の雑音除去フィルタ￣Ｗ_ｔ，ｆ´へ変換する。ここで、ｆ´は、線形周波数である。詳細は、文献「M. Fujimoto, S. Watanabe, and T. Nakatani, “A robust estimation method of noise mixture model for noise suppression”, Proc. of Interspeech2011, 2011, pp.697-700.」に基づく。 Next, the frequency domain conversion unit 22 converts the noise removal filter W _{t, f} calculated by the noise removal filter calculation unit 21 into a noise removal filter ￣W _{t, f ′ in the} linear frequency domain. Here, f ′ is a linear frequency. Details are based on the literature “M. Fujimoto, S. Watanabe, and T. Nakatani,“ A robust estimation method of noise mixture model for noise suppression ”, Proc. Of Interspeech 2011, 2011, pp. 697-700.

次に、雑音除去音声計算部２３は、線形周波数領域での雑音除去音声を、￣ｘ_ｔ，ｆ＝￣Ｗ_ｔ，ｆ・￣ｙ_ｔ，ｆ´にて計算する。逆フーリエ変換部２４は、雑音除去音計算部２３より計算された￣ｙ_ｔ，ｆ´に対して短時間逆フーリエ変換を行って時間波形に戻すことで、最終的な雑音除去音声の波形ｙ_ｔ，ｆを得る。ここで、￣ｙ_ｔ，ｆ´は、ｙ_ｔ，ｆの線形周波数領域表現であり、観測信号の短時間フーリエ変換により得られる。 Next, the noise-removed speech calculation unit 23 calculates a noise-removed speech in the linear frequency domain using ￣x _{t, f} = ￣W _{t, f} · ￣y _{t, f ′} . The inverse Fourier transform unit 24 performs a short-time inverse Fourier transform on ￣y _{t, f ′} calculated by the noise-removed sound calculation unit 23 to return it to a time waveform, so that the final waveform y of the noise-removed speech is obtained. _{t and f} are obtained. Here, ￣y _{t, f ′} is a linear frequency domain representation of y _{t, f} and is obtained by short-time Fourier transform of the observation signal.

（ＤＡＥ学習処理）
図３は、ＤＡＥ学習処理の一例を示すフローチャートである。ＤＡＥ学習処理は、隠れ層のパラメータθ＝｛Ｗ，ｂ｝を学習する処理であり、雑音除去装置１０により実行される。なお、図３に示すＤＡＥ学習処理は、マイク１の入力信号を雑音除去対象の音声信号とし、マイク１及び２の入力信号をマルチチャネル特徴量ｒ_ｔの算出対象の音声信号とする。 (DAE learning process)
FIG. 3 is a flowchart illustrating an example of the DAE learning process. The DAE learning process is a process of learning the hidden layer parameter θ = {W, b}, and is executed by the noise removal apparatus 10. Incidentally, DAE learning process shown in FIG. 3, the input signal of the microphone 1 and an audio signal in the noise removal target, the calculation target of the audio signal of the multi-channel feature quantity r _t the input signal of the microphone 1 and 2.

先ず、ＤＡＥ学習部１４の周波数分析部１１は、マイク１の入力信号（及び学習用の原音声）を例えば１００ｍｓのフレームごとに周波数分析する（ステップＳ１１）。次に、出力部１１は、ステップＳ１１で周波数解析した学習用の原音声を前後Ｔフレームへコンテキスト窓で連結する（ステップＳ１２）。次に、対数計算部１２は、ステップＳ１１で周波数解析した入力信号の対数スペクトルを算出する（ステップＳ１３）。次に、マルチチャネル特徴計算部１３は、マイク１及び２の入力信号からマルチチャネル特徴量ｒ_ｔを算出する（ステップＳ１４）。 First, the frequency analysis unit 11 of the DAE learning unit 14 analyzes the frequency of the input signal (and the original voice for learning) of the microphone 1 for every 100 ms frame, for example (step S11). Next, the output unit 11 connects the original speech for learning subjected to frequency analysis in step S11 to the previous and next T frames through a context window (step S12). Next, the logarithmic calculator 12 calculates a logarithmic spectrum of the input signal subjected to frequency analysis in step S11 (step S13). Next, the multi-channel feature calculation unit 13 calculates the multi-channel feature quantity r _t from the input signal of the microphone 1 and 2 (step S14).

次に、ＤＡＥ学習部１４の入力部１４ａは、ステップＳ１３で算出した対数スペクトル及びステップＳ１４で算出したマルチチャネル特徴量ｒ_ｔを並べたベクトルを前後Ｔフレームへコンテキスト窓で連結する（ステップＳ１５）。次に、隠れ層１４ｂは、ステップＳ１５でコンテキスト窓で連結したベクトルを入力とする出力を求め、重みベクトルＷ^１とバイアスベクトルｂ^１を学習する（ステップＳ１６）。次に、隠れ層１４ｃは、ステップＳ１６による隠れ層１４ｂの出力を入力とする出力を求め、重みベクトルＷ^２とバイアスベクトルｂ^２を学習する（ステップＳ１７）。なお、ステップＳ１７による隠れ層１４ｃの出力は、学習用の原音声と一致する。そして、出力部１４ｄは、ステップＳ１７による隠れ層１４ｃの出力を最終出力する（ステップＳ１８）。なお、ＤＡＥ学習部１４は、ＤＡＥ学習処理により学習した隠れ層のパラメータθ＝｛Ｗ，ｂ｝を、ＤＡＥ復号部１５へコピーする。 Then, the input section 14a of the DAE learning unit 14, which is linked with the calculated log spectrum and the calculated multichannel feature quantity _{r t} Sorting context window vector back and forth T frames in step S14 in step S13 (step S15) . Next, the hidden layer 14b obtains an output for receiving the vector was ligated in the context window in step S15, the learned weight vectors ^{W 1} and the bias vector ^{b 1} (step S16). Next, the hidden layer 14c obtains an output for receiving the output of the hidden layer 14b in step S16, to learn the weight vector ^{W 2} and the bias vector ^{b 2} (step S17). Note that the output of the hidden layer 14c in step S17 coincides with the original voice for learning. And the output part 14d finally outputs the output of the hidden layer 14c by step S17 (step S18). The DAE learning unit 14 copies the hidden layer parameter θ = {W, b} learned by the DAE learning process to the DAE decoding unit 15.

（雑音除去処理）
図４は、雑音除去処理の一例を示すフローチャートである。雑音除去処理は、図３に示すＤＡＥ学習処理により学習された隠れ層のパラメータθ＝｛Ｗ，ｂ｝を用いて、マイクの入力信号から雑音を除去する処理であり、雑音除去装置１０により実行される。なお、図４に示す雑音除去処理は、図３に示すＤＡＥ学習処理がマイク１の入力信号を雑音除去対象の音声信号とし、マイク１及び２の入力信号をマルチチャネル特徴量ｒ_ｔの算出対象の音声信号とする場合は、同様に、マイク１の入力信号を雑音除去対象の音声信号とし、マイク１及び２の入力信号をマルチチャネル特徴量ｒ_ｔの算出対象の音声信号とする。 (Noise removal processing)
FIG. 4 is a flowchart illustrating an example of the noise removal process. The noise removal process is a process for removing noise from the microphone input signal using the hidden layer parameter θ = {W, b} learned by the DAE learning process shown in FIG. Is done. Incidentally, the noise removal processing shown in FIG. 4, DAE learning process shown in FIG. 3 is a noise removal target speech signal of the input signal of the microphone 1, the calculation target input signal of the microphone 1 and 2 of the multi-channel feature quantity r _t If the audio signal is similarly input signal of the microphone 1 and an audio signal in the noise removal target, the calculation target of the audio signal of the multi-channel feature quantity r _t the input signal of the microphone 1 and 2.

先ず、周波数分析部１１は、マイク１の入力信号を例えば１００ｍｓのフレームごとに周波数分析する（ステップＳ２１）。次に、対数計算部１２は、ステップＳ２１で周波数分析した入力信号の対数スペクトルを算出する（ステップＳ２２）。次に、マルチチャネル特徴計算部１３は、マイク１及び２の入力信号からマルチチャネル特徴量ｒ_ｔを算出する（ステップＳ２３）。 First, the frequency analysis unit 11 analyzes the frequency of the input signal of the microphone 1 for every 100 ms frame, for example (step S21). Next, the logarithmic calculator 12 calculates a logarithmic spectrum of the input signal subjected to frequency analysis in step S21 (step S22). Next, the multi-channel feature calculation unit 13 calculates the multi-channel feature quantity r _t from the input signal of the microphone 1 and 2 (step S23).

次に、ＤＡＥ復号部１５の入力部１５ａは、ステップＳ２２で算出した対数スペクトル及びステップＳ２３で算出したマルチチャネル特徴量ｒ_ｔを並べたベクトルを前後Ｔフレームへコンテキスト窓で連結する（ステップＳ２４）。次に、隠れ層１５ｂは、ステップＳ２４でコンテキスト窓で連結したベクトルを入力とし、ＤＡＥ学習部１４がＤＡＥ学習処理により学習した重みベクトルＷ^１及びバイアスベクトルｂ^１を用いて、隠れ層１５ｂの出力を求める（ステップＳ２５）。次に、隠れ層１５ｃは、ステップＳ２５で求められた隠れ層１５ｂの出力を入力とし、ＤＡＥ学習部１４がＤＡＥ学習処理により学習した重みベクトルＷ^２及びバイアスベクトルｂ^２を用いて、隠れ層１５ｃの出力を求める（ステップＳ２６）。そして、出力部１５ｄは、ステップＳ２６による隠れ層１５ｃの出力を最終出力する（ステップＳ２７）。 Then, the input section 15a of the DAE decoder 15 couples the vector obtained by arranging multi-channel feature value _{r t} calculated in logarithmic spectrum and S23 calculated to the context window before and after T frame in step S22 (step S24) . Next, the hidden layer 15b inputs the vector was ligated in the context window in step S24, using the weight vectors ^{W 1} and a bias vector ^{b 1} of DAE learning section 14 learns the DAE learning process, outputs of the hidden layer 15b Is obtained (step S25). Next, the hidden layer 15c receives the output of the hidden layer 15b obtained in step S25, by using the weight vector ^{W 2} and a bias vector ^{b 2} of DAE learning section 14 learns the DAE learning process, the hidden layer 15c Is obtained (step S26). And the output part 15d finally outputs the output of the hidden layer 15c by step S26 (step S27).

［実施形態による効果］
図５Ａ〜図５Ｃは、実施形態の効果の一例を説明する図である。実験に用いた観測信号は、英国一般家庭に設置したＭ＝２個のマイクアレイの正面からおよそ２ｍ離れた位置で発声された音声と、同じマイクアレイで収録した英国一般家庭での雑音との混合信号である。雑音は、非定常的なもの（子供の声、掃除機の音、テレビ音等）を含む。音声信号は全て、英語６単語から成るコマンド音声である。 [Effects of the embodiment]
5A to 5C are diagrams illustrating an example of the effect of the embodiment. The observation signal used in the experiment consists of the sound uttered at a position approximately 2 m away from the front of the M = 2 microphone array installed in the UK household and the noise in the UK household recorded with the same microphone array. It is a mixed signal. Noise includes non-stationary things (child's voice, vacuum cleaner sound, TV sound, etc.). All voice signals are command voices consisting of six English words.

学習データは、３４話者の音声クリーンデータに雑音を加えた６時間分の観測信号からなる。評価データは、同じ３４話者が発声した音声に、雑音（学習データと同じ部屋で収録した、学習データで用いた雑音とは別の雑音)を付加した観測信号で、ＳＮ比が−６ｄＢ、−３ｄＢ、０ｄＢ、３ｄＢ、６ｄＢ、９ｄＢの各条件での、それぞれ６００発話からなる。スペクトル特徴のベクトルｙ_ｔの次元は、メル周波数領域で４０次元とし、コンテキスト窓の長さはＴ＝５とした。フレーム長とフレームシフトは、それぞれ１００ｍｓ、２５ｍｓである。 The learning data consists of 6 hours of observation signals obtained by adding noise to the voice clean data of 34 speakers. The evaluation data is an observation signal obtained by adding noise (noise recorded in the same room as the learning data and different from the noise used in the learning data) to the voice uttered by the same 34 speakers, and the SN ratio is −6 dB, It consists of 600 utterances under each condition of −3 dB, 0 dB, 3 dB, 6 dB, and 9 dB. The dimension of the spectrum feature vector y _t is 40 dimensions in the mel frequency domain, and the length of the context window is T = 5. The frame length and frame shift are 100 ms and 25 ms, respectively.

実施形態のマルチチャネル特徴のベクトルｒ_ｔとしては、上記（１０）式又は（１１）式に基づくＩＰＤ、上記（１５）式に基づくＥＮＨＡＮＣＥ、上記（１６）式に基づくＮＯＩＳＥの３種類とした。ＤＡＥに用いた隠れ層の数は１、隠れ層で用いたユニット数（パラメタの次元）は、１，０２４である。比較例のマルチチャネル特徴には、雑音除去していない観測信号そのものと、ＥＮＨＡＮＣＥに使った時間周波数マスク（上記（１５）式参照）にて雑音除去した結果と、１マイクのみを用いるＤＡＥの出力と、ＩＬＤの４種類を試した。 The vector _{r t} multichannel features of the embodiments, the IPD based on the above (10) or (11), ENHANCE based on the above (15), and three kinds of NOISE based on the equation (16). The number of hidden layers used for DAE is 1, and the number of units (parameter dimensions) used in the hidden layers is 1,024. The multi-channel feature of the comparative example includes the observation signal itself without noise removal, the result of noise removal by the time frequency mask used for ENHANCE (see the above equation (15)), and the output of the DAE using only one microphone. And I tried four types of ILD.

図５Ａ及び図５Ｂは、それぞれの手法による雑音除去の性能を示している。評価量は、ケプストラム歪みＣＤ（Cepstral Distortion）と、セグメンタルＳＮ比（ＳＳＮＲ）である。ＣＤは、値がより小さいほど、より高い性能を示す。また、ＳＳＮＲは、値が大きいほど、より高い性能を示す。図５Ａ及び図５Ｂは、各マルチチャネル特徴量を用いて得られた雑音除去音声のＣＤ及びＳＳＮＲを、観測信号のＳＮ比ごとに示している。図５Ａ及び図５Ｂに示すように、実施形態のＩＰＤ、ＮＯＩＳＥ、ＥＮＨＡＮＣＥは、比較例よりも高い性能を示すことが分かる。 FIG. 5A and FIG. 5B show the noise removal performance by each method. The evaluation amount is a cepstrum distortion CD (Cepstral Distortion) and a segmental SN ratio (SSNR). The smaller the value of CD, the higher the performance. Moreover, SSNR shows a higher performance, so that a value is large. 5A and 5B show the CD and SSNR of the noise-removed speech obtained using each multi-channel feature amount for each S / N ratio of the observation signal. As shown in FIGS. 5A and 5B, it can be seen that the IPD, NOISE, and ENHANCE of the embodiment show higher performance than the comparative example.

また、図５Ｃは、比較例としての、雑音除去していない観測信号と、ＥＮＨＡＮＣＥに使った時間周波数マスク（上記（１５）式参照）にて雑音除去した結果と、１マイクのみを用いるＤＡＥの出力と、ＩＬＤと、実施形態としてＩＰＤ、ＥＮＨＡＮＣＥ、ＮＯＩＳＥの７種類それぞれの音声について、コマンド音声認識を行なった結果を示す。音声認識機としては、文献「M. Delcroix, Y. Kubo, T. Nakatani, and A. Nakamura, “Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling ?”, Proc. of Interspeech2013, 2013, pp. 2992-2996.」に示されるディープニューラルネットワーク（ＤＮＮ）ベースの技術を用いた。図５Ｃは、実施形態としてＩＰＤ、ＥＮＨＡＮＣＥ、ＮＯＩＳＥが、比較例よりも高い性能であることを示す。 Further, FIG. 5C shows a comparative example of a non-noise-removed observation signal, a result of noise removal by the time-frequency mask used for ENHANCE (see the above equation (15)), and a DAE using only one microphone. The result of command voice recognition is shown for each of the output, ILD, and seven types of voices of IPD, ENHANCE, and NOISE as embodiments. As a speech recognizer, the literature “M. Delcroix, Y. Kubo, T. Nakatani, and A. Nakamura,“ Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? ”, Proc. Of Interspeech2013, 2013, pp. 2992-2996. ", A deep neural network (DNN) based technology was used. FIG. 5C shows that IPD, ENHANCE, and NOISE as an embodiment have higher performance than the comparative example.

すなわち、実施形態は、信号処理において、複数信号が混在している音響データから、それぞれの原信号を推定する、音源分離の際に、原信号及び複数信号がどのように混ざったかの情報を用いずに、複数信号が混在している音響データのみからそれぞれの原信号を推定する、ブラインド音源分離において、より高い雑音除去性能を有する。 That is, the embodiment estimates each original signal from acoustic data in which a plurality of signals are mixed in signal processing, and does not use information on how the original signal and the plurality of signals are mixed in sound source separation. In addition, in the blind sound source separation in which each original signal is estimated only from the acoustic data in which a plurality of signals are mixed, the noise removal performance is higher.

［その他の実施形態］
実施形態は、観測信号ｙ_ｔとして、メル周波数領域での対数スペクトルを用いたが、これに限らず、線形周波数ドメインでの対数スペクトルや、メル周波数ケプストラム係数（ＭＦＣＣ；Mel Frequency Cepstral Coefficient）等を用いてもよい。また、信号再構成装置２０は、ウィーナフィルタを用いたが、これに限らず、例えば下記（１９）式に示すような時間周波数マスクＭ_ｔ，ｆを用いてもよい。また、上記（５）式の出力ベクトルｚ_ｏｕｔでは、コンテキスト窓を用いなくてもよい。 [Other Embodiments]
Embodiment, the observed signal y _t, but using a logarithmic spectrum in the Mel frequency domain, not limited thereto, and a logarithmic spectrum of a linear frequency domain, mel-frequency cepstral coefficients; and the like (MFCC Mel Frequency Cepstral Coefficient) It may be used. The signal reconstruction device 20 uses a Wiener filter. However, the present invention is not limited to this. For example, a time frequency mask M _{t, f} as shown in the following equation (19) may be used. Further, the context window need not be used in the output vector z _out in the above equation (5).

（雑音除去装置及び信号再構成装置の装置構成について）
図１に示す雑音除去装置１０及び図２に示す信号再構成装置２０の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、雑音除去装置１０の機能の分散及び統合の具体的形態は図示のものに限られず、全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散又は統合して構成することができる。例えば、雑音除去装置１０のＤＡＥ学習部１４及びＤＡＥ復号部１５は、一体であってもよく、ＤＡＥ学習時と雑音除去処理時とで処理を切り替えるようにしてもよい。また、周波数分析部１１、対数計算部１２、マルチチャネル特徴計算部１３は、ＤＡＥ学習部１４及びＤＡＥ復号部１５で共有するとしたが、ＤＡＥ学習部１４及びＤＡＥ復号部１５それぞれで個別に周波数分析部、対数計算部、マルチチャネル特徴計算部を有する構成であってもよい。また、実施形態では、雑音除去装置１０及び信号再構成装置２０は、別装置とするが、これに限らず、一体の装置であってもよい。 (About the device configuration of the noise removal device and the signal reconstruction device)
Each component of the noise removal apparatus 10 shown in FIG. 1 and the signal reconstruction apparatus 20 shown in FIG. 2 is functionally conceptual, and does not necessarily need to be physically configured as illustrated. That is, the specific form of distribution and integration of the functions of the noise removal device 10 is not limited to the illustrated one, and all or a part of the functions can be functionally or physically in arbitrary units according to various loads, usage conditions, and the like. Can be distributed or integrated. For example, the DAE learning unit 14 and the DAE decoding unit 15 of the noise removal apparatus 10 may be integrated, or the process may be switched between DAE learning and noise removal processing. The frequency analysis unit 11, the logarithmic calculation unit 12, and the multi-channel feature calculation unit 13 are shared by the DAE learning unit 14 and the DAE decoding unit 15, but the DAE learning unit 14 and the DAE decoding unit 15 individually perform frequency analysis. May be configured to include a unit, a logarithm calculation unit, and a multi-channel feature calculation unit. In the embodiment, the noise removal device 10 and the signal reconstruction device 20 are separate devices, but are not limited thereto, and may be integrated devices.

また、雑音除去装置１０において行われる各処理は、全部又は任意の一部が、ＣＰＵ（Central Processing Unit）及びＣＰＵにより解析実行されるプログラムにて実現されてもよい。また、雑音除去装置１０及び信号再構成装置２０において行われる各処理は、ワイヤードロジックによるハードウェアとして実現されてもよい。 In addition, each or all of the processes performed in the noise removal apparatus 10 may be realized by a CPU (Central Processing Unit) and a program that is analyzed and executed by the CPU. Moreover, each process performed in the noise removal apparatus 10 and the signal reconstruction apparatus 20 may be realized as hardware by wired logic.

また、実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともできる。もしくは、実施形態において説明した各処理のうち、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上述及び図示の処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて適宜変更することができる。 In addition, among the processes described in the embodiment, all or a part of the processes described as being automatically performed can be manually performed. Alternatively, all or some of the processes described as being manually performed among the processes described in the embodiments can be automatically performed by a known method. In addition, the above-described and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be changed as appropriate unless otherwise specified.

（プログラムについて）
図６は、プログラムが実行されることにより、雑音除去装置及び信号再構成装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。コンピュータ１０００において、これらの各部はバス１０８０によって接続される。 (About the program)
FIG. 6 is a diagram illustrating an example of a computer in which a noise removal apparatus and a signal reconstruction apparatus are realized by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. In the computer 1000, these units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１０４１に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１０５１、キーボード１０５２に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１０６１に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. The serial port interface 1050 is connected to a mouse 1051 and a keyboard 1052, for example. The video adapter 1060 is connected to the display 1061, for example.

ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、雑音除去装置１０及び信号再構成装置２０の各処理を規定するプログラムは、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、例えばハードディスクドライブ１０３１に記憶される。例えば、雑音除去装置１０及び信号再構成装置２０における機能構成と同様の情報処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 The hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the noise removal apparatus 10 and the signal reconstruction apparatus 20 is stored in, for example, the hard disk drive 1031 as a program module 1093 in which commands executed by the computer 1000 are described. For example, a program module 1093 for executing information processing similar to the functional configuration in the noise removal device 10 and the signal reconstruction device 20 is stored in the hard disk drive 1031.

また、実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1031, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1041 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 via the network interface 1070.

上記実施形態及びその他の実施形態は、本願が開示する技術に含まれると同様に、特許請求の範囲に記載された発明とその均等の範囲に含まれるものである。 The above-described embodiments and other embodiments are included in the invention disclosed in the claims and equivalents thereof as well as included in the technology disclosed in the present application.

１、２マイク
１０雑音除去装置
１１周波数分析部
１２対数計算部
１３マルチチャネル特徴計算部
１４ＤＡＥ学習部
１４ａ入力部
１４ｂ、１４ｃ隠れ層
１４ｄ出力部
１５ＤＡＥ復号部
１５ａ入力部
１５ｂ、１５ｃ隠れ層
１５ｄ出力部
２０信号再構成装置
２１雑音除去フィルタ計算部
２２周波数ドメイン計算部
２３雑音除去音声計算部
２４逆フーリエ変換部
１００雑音除去システム
１０００コンピュータ
１０１０メモリ
１０２０ＣＰＵ DESCRIPTION OF SYMBOLS 1, 2 Microphone 10 Noise removal apparatus 11 Frequency analysis part 12 Logarithm calculation part 13 Multichannel feature calculation part 14 DAE learning part 14a Input part 14b, 14c Hidden layer 14d Output part 15 DAE decoding part 15a Input part 15b, 15c Hidden layer 15d Output unit 20 Signal reconstruction device 21 Noise removal filter calculation unit 22 Frequency domain calculation unit 23 Noise removal speech calculation unit 24 Inverse Fourier transform unit 100 Noise removal system 1000 Computer 1010 Memory 1020 CPU

Claims

A frequency analysis unit that performs frequency analysis on the first sound signal observed by the first sound collector and the second sound signal observed by the second sound collector disposed at a position different from the first sound collector;
A logarithm calculation unit for calculating a logarithm spectrum which is a logarithm of the amplitude of the first audio signal frequency-analyzed by the frequency analysis unit;
A multi-channel feature quantity calculation unit for calculating a multi-channel feature quantity related to the first voice signal and the second voice signal from the first voice signal and the second voice signal subjected to frequency analysis by the frequency analysis unit;
The logarithm spectrum of the first speech signal for learning calculated by the logarithm calculation unit, and the learning for the first speech signal for learning and the second speech signal for learning calculated by the multichannel feature amount calculation unit. A DAE learning unit that learns parameters of a DAE (Denoising AutoEncoder) that receives the multi-channel feature quantity of the input and outputs the logarithmic spectrum of the original speech for learning;
The logarithmic spectrum of the first speech signal to be denoised calculated by the logarithm calculation unit, the first speech signal to be denoised and the second speech signal to be denoised calculated by the multichannel feature amount calculation unit. The logarithmic spectrum of the noise-removed speech in which the noise component is removed from the logarithmic spectrum of the first speech signal to be denoised using the parameters learned by the DAE learning unit. And a DAE decoding unit that outputs the noise elimination device.

The multi-channel feature quantity calculation unit, as the multi-channel feature quantity related to the first audio signal and the second audio signal subjected to frequency analysis by the frequency analysis unit, the first audio signal at each time frequency in the linear frequency domain and the A time-frequency mask for speech enhancement is calculated by clustering the amount characterizing the second speech signal into a speech cluster and a noise cluster, and converting the time-frequency mask for extracting the time-frequency component of the speech cluster into the mel frequency domain. The noise removal device according to claim 1, wherein:

The multi-channel feature quantity calculation unit, as the multi-channel feature quantity related to the first audio signal and the second audio signal subjected to frequency analysis by the frequency analysis unit, the first audio signal at each time frequency in the linear frequency domain and the The logarithm of the time-frequency mask for performing speech enhancement in which the amount characterizing the second speech signal is clustered into a speech cluster and a noise cluster, and the time-frequency mask for extracting the time-frequency component of the speech cluster is converted into the mel frequency domain. The noise removal device according to claim 1, wherein:

The multi-channel feature quantity calculation unit, as the multi-channel feature quantity related to the first audio signal and the second audio signal subjected to frequency analysis by the frequency analysis unit, the first audio signal at each time frequency in the linear frequency domain and the The logarithm of the time-frequency mask for performing speech enhancement in which the amount characterizing the second speech signal is clustered into a speech cluster and a noise cluster, and the time-frequency mask for extracting the time-frequency component of the speech cluster is converted into the mel frequency domain. The noise removal apparatus according to claim 1, wherein the logarithmic spectrum of the first audio signal calculated by the logarithm calculation unit is added to the logarithm calculation unit.

The multi-channel feature quantity calculation unit, as the multi-channel feature quantity related to the first audio signal and the second audio signal subjected to frequency analysis by the frequency analysis unit, the first audio signal at each time frequency in the linear frequency domain and the A time frequency mask for performing speech enhancement is obtained by clustering the amount characterizing the second speech signal into a speech cluster and a noise cluster, and converting the time frequency mask for extracting the time frequency component of the speech cluster into the mel frequency domain. The noise removal apparatus according to claim 1, wherein the logarithm spectrum of the first speech signal calculated by the logarithm calculation unit is added to the logarithm obtained by subtracting from the logarithm.

The amount of characterizing is the phase difference between the first audio signal and the second audio signal at each time frequency in a linear frequency domain with respect to the first audio signal and the second audio signal frequency-analyzed by the frequency analysis unit. The noise removal device according to any one of claims 2 to 5, wherein:

The multi-channel feature quantity calculation unit, as the multi-channel feature quantity related to the first audio signal and the second audio signal frequency-analyzed by the frequency analysis unit, the first audio signal and the first audio signal at the center frequency for each mel filter bank. The noise removal apparatus according to claim 1, wherein a phase difference between two audio signals or a cosine value of the phase difference is calculated.

The noise removal program which makes a computer function as a noise removal apparatus as described in any one of Claims 1-7.