JP7026357B2

JP7026357B2 - Time frequency mask estimator learning device, time frequency mask estimator learning method, program

Info

Publication number: JP7026357B2
Application number: JP2019015065A
Authority: JP
Inventors: 悠馬小泉; 浩平矢田部
Original assignee: Waseda University; Nippon Telegraph and Telephone Corp; NTT Inc
Current assignee: Waseda University; Nippon Telegraph and Telephone Corp; NTT Inc
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2022-02-28
Anticipated expiration: 2039-01-31
Also published as: JP2020122896A

Description

本発明は、深層学習を利用して時間周波数マスク推定器を学習する時間周波数マスク推定器学習装置、時間周波数マスク推定器学習方法、プログラムに関する。 The present invention relates to a time-frequency mask estimator learning device for learning a time-frequency mask estimator using deep learning, a time-frequency mask estimator learning method, and a program.

本発明は、深層学習を利用した全ての音響信号処理技術に利用できるが、本明細書では、音源強調を例に挙げて説明する。 The present invention can be used for all acoustic signal processing techniques utilizing deep learning, but in the present specification, sound enhancement will be described as an example.

＜STFTを利用した信号解析と音源強調＞
音響信号処理を行うためには、まず、マイクロホンを用いて、音を観測する必要がある。その観測音には、処理を行いたい目的音の他に雑音が含まれている。音源強調とは、雑音が含まれた観測信号から、目的音を抽出する信号処理のことを指す。 <Signal analysis and speech enhancement using STFT>
In order to perform acoustic signal processing, it is first necessary to observe the sound using a microphone. The observed sound contains noise in addition to the target sound to be processed. Speech enhancement refers to signal processing that extracts the target sound from the observation signal containing noise.

音源強調を定義する。マイクロホンの観測信号をx_kと置き、x_kは目的音s_kと雑音n_kの混合信号であるとする。
x_k= s_k + n_k …(1)
ここで、kは時間領域における時間のインデックスである。観測信号から目的音を抽出するために、時間領域の観測信号を Define speech enhancement. Let x _k be the observation signal of the microphone, and let x _k be a mixed signal of the target sound _sk and the noise _nk .
x _k = s _k + n _k … (1)
Where k is an index of time in the time domain. In order to extract the target sound from the observation signal, the observation signal in the time domain is used.

点毎にL点まとめて解析することを考える。以降、観測信号をその様にまとめたt∈{0,…,T}番目の信号 Consider analyzing L points for each point. After that, the t ∈ {0,…, T} th signal that summarizes the observed signals in that way.

をtフレーム目の観測信号と表現する。ただし^Tは転置を表す。すると、tフレーム目の観測信号は、式(1)より、以下の様に記述できる。
x_t= s_t + n_t …(3) Is expressed as the observation signal in the t-frame. Where ^T represents transpose. Then, the observation signal at the t-frame can be described as follows from Eq. (1).
x _t = s _t + n _t … (3)

ここでs_t,n_tはそれぞれ、式(2)の様に定義されたtフレーム目の目的音と雑音である。STFTを用いた信号の時間周波数解析では、各時間フレームの観測信号に対してSTFTをかける。STFT後の信号は以下の性質を満たす。 Here, _{st and n t} _are the target sound and noise of the t-frame defined as in Eq. (2), respectively. In the time-frequency analysis of signals using STFT, STFT is applied to the observed signal in each time frame. The signal after STFT satisfies the following properties.

ここで here

は、tフレーム目の観測信号をSTFTした結果得られる解析結果である。 Is the analysis result obtained as a result of STFT of the observation signal at the t-frame.

時間周波数マスク処理は、音源強調における代表的な手法の一つである。この処理では、STFT後の観測信号に対して、時間周波数マスク Time-frequency mask processing is one of the typical methods in speech enhancement. In this process, the time-frequency mask is applied to the observed signal after STFT.

を乗ずることで、STFT後の目的音の推定値を以下の様に得る。 By multiplying by, the estimated value of the target sound after STFT is obtained as follows.

ここで○はアダマール積である。最後に、 Here, ○ is the Hadamard product. Lastly,

に逆STFT(ISTFT:inverse-STFT)を実行することで、時間領域の目的音の推定値を得る。 By executing reverse STFT (ISTFT: inverse-STFT), the estimated value of the target sound in the time domain is obtained.

今、観測信号からG_tを推定する、パラメータθ_Gを持つ関数を Now, a function with the parameter θ _G that estimates G _t from the observed signal

と置く。そして、G_tを以下の様に定義する。 And put. Then, G _t is defined as follows.

ここでφ_tはX_tから抽出される音響特徴量であり、X_tの振幅スペクトルなどが利用される。なお、近年盛んに研究されている深層学習を用いた音源強調では、 Here, φ _t is an acoustic feature extracted from X _t , and the amplitude spectrum of X _t or the like is used. In addition, in speech enhancement using deep learning, which has been actively studied in recent years,

を深層ニューラルネットワーク(DNN: deep neural network)で設計する手法が主流である。以降では、 The mainstream method is to design a deep neural network (DNN). After that,

はDNNを利用して実装されていると仮定する。 Is implemented using DNN.

C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos,S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal, "Deep complex networks," in Int. Conf.Learn. Representat., 2018.C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, JF Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and CJ Pal, "Deep complex networks," in Int. Conf. Learn. Representat., 2018.

＜STFTを利用した信号解析と音源強調の課題＞
音声/音響信号処理では、波形をそのまま扱うことは稀であり、多くの場合、上述したように、観測信号を短い時間区間毎にフーリエ変換（STFT）し、その信号に対して強調や識別をかける。ところが、STFTは実数から複素数への変換であり、複素数を利用した深層学習はその学習が複雑になることから、STFTスペクトルの振幅情報のみを利用したり、制御したりすることが多い。また、DNNに入力する際も、振幅情報のみを特徴量とすることが多い。例えば、上述のφ_tをX_tの振幅スペクトルとする場合、これは、位相スペクトルの情報を無視していることになるため、観測信号から得られる情報を余すことなく利用しているとは言えない。しかし、多くのDNNのモジュールは、実数に対して処理を行うことを前提に設計されているため、X_tをそのまま入力するには不具合が生じる。非特許文献１では複素数を利用するCNNやBNが提案されているが、これは機械学習一般の理論として提案されており、音響信号処理に特化しているわけではない。 <Problems of signal analysis and speech enhancement using STFT>
In audio / acoustic signal processing, it is rare to handle the waveform as it is, and in many cases, as described above, the observed signal is Fourier transformed (STFT) every short time interval to emphasize or identify the signal. times. However, STFT is a conversion from a real number to a complex number, and deep learning using a complex number complicates the learning, so that only the amplitude information of the STFT spectrum is often used or controlled. Also, when inputting to the DNN, only the amplitude information is often used as the feature quantity. For example, when the above-mentioned φ _t is an amplitude spectrum of X _t , this means that the information of the phase spectrum is ignored, so it can be said that the information obtained from the observation signal is fully utilized. do not have. However, since many DNN modules are designed on the assumption that they process real numbers, there is a problem in inputting X _t as it is. Although CNN and BN using complex numbers are proposed in Non-Patent Document 1, they are proposed as a general theory of machine learning and are not specialized in acoustic signal processing.

そこで本発明では、観測信号のSTFTの実部と虚部を利用する時間周波数マスク推定器を学習する時間周波数マスク推定器学習装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a time-frequency mask estimator learning device for learning a time-frequency mask estimator using the real part and the imaginary part of the STFT of the observed signal.

本発明の時間周波数マスク推定器学習装置は、時間周波数マスク推定器と、学習部を含み、時間周波数マスク推定器は、任意の観測信号から目的音を推定するための時間周波数マスクを推定する時間周波数マスク推定器であって、ＣＮＮ処理部と、ＢＮ処理部と、ＧＬＵ処理部を含む。 The time-frequency mask estimator learning device of the present invention includes a time-frequency mask estimator and a learning unit, and the time-frequency mask estimator is a time for estimating a time-frequency mask for estimating a target sound from an arbitrary observation signal. It is a frequency mask estimator and includes a CNN processing unit, a BN processing unit, and a GLU processing unit.

ＣＮＮ処理部は、観測信号のＳＴＦＴスペクトルの実部と、対応する虚部を実数と見做した値に対して畳み込みニューラルネットワーク処理を実行する。ＢＮ処理部は、観測信号のＳＴＦＴスペクトルの実部と、対応する虚部を実数と見做した値に対するノルム操作に共通のパラメータを利用するバッチ正規化処理を実行する。ＧＬＵ処理部は、観測信号のＳＴＦＴスペクトルの実部と、対応する虚部を実数と見做した値を結合した値に対してゲート線形ユニット処理を実行する。学習部は、既知の目的音と既知の雑音とを重畳してなる既知の観測信号と時間周波数マスクを乗算した値と、既知の目的音との間のコスト関数が最小化するように時間周波数マスク推定器のパラメータを学習する。 The CNN processing unit executes convolutional neural network processing on the real part of the FTFT spectrum of the observation signal and the value in which the corresponding imaginary part is regarded as a real number. The BN processing unit executes batch normalization processing using parameters common to norm operations for values in which the real part of the FTFT spectrum of the observation signal and the corresponding imaginary part are regarded as real numbers. The GLU processing unit executes gate linear unit processing on a value obtained by combining a real part of the FTFT spectrum of the observation signal and a value in which the corresponding imaginary part is regarded as a real number. The learning unit uses the time frequency so that the cost function between the known observation signal obtained by superimposing the known target sound and the known noise multiplied by the time frequency mask and the known target sound is minimized. Learn the parameters of the mask estimator.

本発明の時間周波数マスク推定器学習装置によれば、観測信号のSTFTの実部と虚部を利用する時間周波数マスク推定器を学習することができる。 According to the time-frequency mask estimator learning device of the present invention, it is possible to learn a time-frequency mask estimator using the real part and the imaginary part of the STFT of the observed signal.

実施形態の時間周波数マスク推定器学習装置の構成を示すブロック図。The block diagram which shows the structure of the time frequency mask estimator learning apparatus of embodiment. 実施形態の時間周波数マスク推定器学習装置の動作を示すフローチャート。The flowchart which shows the operation of the time frequency mask estimator learning apparatus of embodiment. 実施形態の時間周波数マスク推定器の詳細な動作を示すフローチャート。The flowchart which shows the detailed operation of the time frequency mask estimator of embodiment. 三つのモジュールと補助演算を組み合わせた時間周波数マスク推定器の構成例を示す図。The figure which shows the configuration example of the time frequency mask estimator which combined three modules and auxiliary operation.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. The components having the same function are given the same number, and duplicate explanations are omitted.

実施例１では、位相スペクトルの特性に基づくDNNモジュールとして、(1)制約付き複素畳み込みニューラルネットワーク（CNN: convolutional neural network）、(2)簡略化複素バッチ正規化（BN: batch normalization）、(3)複素ゲート線形ユニット（GLU: fated linear unit）の３つのモジュールを開示する。以降では、I=√(-1)、また()_Reと()_Imはそれぞれ、複素数の実部と虚部を表す。なお、()_Imは虚部を実数と見做した値（虚部から複素単位を取り除いた値）を指す。 In Example 1, as DNN modules based on the characteristics of the phase spectrum, (1) constrained complex convolutional neural network (CNN), (2) simplified complex batch normalization (BN: batch normalization), (3). ) Disclose three modules of complex gate linear unit (GLU). In the following, I = √ (-1), and () _Re and () _Im represent the real and imaginary parts of complex numbers, respectively. Note that () _Im refers to a value in which the imaginary part is regarded as a real number (value obtained by removing the complex unit from the imaginary part).

＜制約付き複素CNN＞
実施例１では、以下の形式の複素CNNを実行する。
W * z = (W_Re * z_Re - W_Im* z_Im) + I (W_Re * z_Im + W_Im * z_Re)…(8) <Restricted complex CNN>
In the first embodiment, a complex CNN having the following format is executed.
W * z = (W _Re * z _Re _{--W Im} * z _Im ) + I (W _Re * z _Im + W _Im * z _Re )… (8)

ここでWは重みパラメータWの畳み込み層を表す。なおその変形として、畳み込みの重みを実数に制限することも可能である。
W_Re* z = (W_Re * z_Re) + I (W_Re * z_Im)…(9) Where W represents the convolutional layer of the weight parameter W. As a variation, it is also possible to limit the convolution weight to a real number.
W _Re * z = (W _Re * z _Re ) + I (W _Re * z _Im )… (9)

ところで、音声信号の位相スペクトルは時間方向に連続的に変化する性質を持つ。ゆえに、本実施例では、非特許文献１と異なり、式(8)と式(9)の時間方向へのストライドを1に制約する。 By the way, the phase spectrum of an audio signal has a property of continuously changing in the time direction. Therefore, in this embodiment, unlike Non-Patent Document 1, the stride of Eqs. (8) and (9) in the time direction is restricted to 1.

＜簡略化複素BN＞
BNはDNNの学習に用いるミニバッチ内平均と分散をそれぞれ0と1にする処理である。また、学習可能な拡大と平行移動のパラメータとを用いて以下のように定義される。 <Simplified complex BN>
BN is a process that sets the mean and variance in the mini-batch used for learning DNN to 0 and 1, respectively. It is also defined as follows using learnable expansion and translation parameters.

ここで here

はミニバッチ内の平均と分散、ε>0は小さな正の定数である。εは例えば、10^-5程度に設定すればよい。これは、実数に対してはよく動くが、複素数に対してそのまま適応することは妥当ではない。なぜなら、例えば、実部と虚部に別々にBNを施すと、複素変数のノルムの関係性（各周波数の振幅の関係性に相当）や位相スペクトルを崩してしまうためである。非特許文献１で提案されているような、主成分分析に基づく処理も、同様の理由で妥当ではない。一方で、複素数の集合からその平均値を減算する処理はハイパスフィルタと等価であり、複素数の集合にあるスカラーを乗じる処理はフィルタゲインの調整とみなせる。そこで、本実施例では、以下の簡略化複素BNを提案する。 Is the mean and variance in the mini-batch, and ε> 0 is a small positive constant. For example, ε may be set to about 10 ^-5 . This works well for real numbers, but it doesn't make sense to apply it straight to complex numbers. This is because, for example, if BN is applied separately to the real part and the imaginary part, the relation of the norm of the complex variable (corresponding to the relation of the amplitude of each frequency) and the phase spectrum are destroyed. Processing based on principal component analysis as proposed in Non-Patent Document 1 is also not valid for the same reason. On the other hand, the process of subtracting the average value from the set of complex numbers is equivalent to a high-pass filter, and the process of multiplying the scalar in the set of complex numbers can be regarded as the adjustment of the filter gain. Therefore, in this embodiment, the following simplified complex BN is proposed.

式(11)において、平均値の減算は実部と虚部について別々に行うが、ノルム操作は実部と虚部について同じパラメータ（式(11)の分母に該当する部分）を利用している点が従来法と異なる。また、学習可能なパラメータγとβも、同様の理由で利用しない。 In equation (11), the subtraction of the mean value is performed separately for the real part and the imaginary part, but the norm operation uses the same parameters (the part corresponding to the denominator of equation (11)) for the real part and the imaginary part. The point is different from the conventional method. Also, the learnable parameters γ and β are not used for the same reason.

＜複素GLU＞
複素スペクトルに対して妥当な活性化関数を考える。非特許文献１では、以下の3種類の活性化関数が提案されている。 <Complex GLU>
Consider an activation function that is valid for complex spectra. Non-Patent Document 1 proposes the following three types of activation functions.

ここで| |は絶対値、Arg[ ]は複素数の角度を返す関数である。本実施例では、位相スペクトルを保存するために、式(14)の形式の活性化関数を考える。すなわち、振幅スペクトルのみを操作する活性化関数である。 Where | | is an absolute value and Arg [] is a function that returns the argument of a complex number. In this embodiment, in order to preserve the phase spectrum, consider the activation function of the form of Eq. (14). That is, it is an activation function that manipulates only the amplitude spectrum.

振幅スペクトルのみを操作する音響信号処理手法で代表的なものは、式(5)の時間周波数マスクである。そこで、実部と虚部を入力し、時間周波数マスクを出力するCNNを考える。 A typical acoustic signal processing method that manipulates only the amplitude spectrum is the time-frequency mask in Eq. (5). Therefore, consider a CNN that inputs the real part and the imaginary part and outputs the time-frequency mask.

ここでconcatは2変数のチャネル方向への結合である。この形式は、GLUと類似している。GLUは実数値を入力して、マスクを出力する活性化関数だが、本実施例では、複素数値をチャネル方向に結合してマスクを出力するGLU(i.e.複素GLU)を実行する。 Where concat is a combination of two variables in the channel direction. This format is similar to GLU. GLU is an activation function that inputs a real value and outputs a mask, but in this embodiment, GLU (i.e. complex GLU) that combines complex values in the channel direction and outputs a mask is executed.

＜実施形態＞
以下、具体的な実施形態として、時間周波数マスク推定器学習装置１について説明する。図１に示すように、時間周波数マスク推定器学習装置１は、時間周波数マスク推定器１１と、学習部１２を含む構成であり、装置外部、あるいは装置内部に目的音ＤＢ９１と、雑音ＤＢ９２を含む構成である。また、時間周波数マスク推定器１１は、ＣＮＮ処理部１１１と、ＢＮ処理部１１２と、ＧＬＵ処理部１１３と、補助演算部１１４を含む構成である。 <Embodiment>
Hereinafter, as a specific embodiment, the time-frequency mask estimator learning device 1 will be described. As shown in FIG. 1, the time-frequency mask estimator learning device 1 has a configuration including a time-frequency mask estimator 11 and a learning unit 12, and includes a target sound DB 91 and a noise DB 92 outside or inside the device. It is a composition. Further, the time-frequency mask estimator 11 includes a CNN processing unit 111, a BN processing unit 112, a GLU processing unit 113, and an auxiliary calculation unit 114.

図２に示すように、時間周波数マスク推定器１１は、目的音ＤＢ９１から目的音を、雑音ＤＢ９２から雑音をランダムに選択し、それを重畳することで観測信号をシミュレーションし、観測信号のＳＴＦＴに対して、ＣＮＮ処理とＢＮ処理とＧＬＵ処理を実行して、任意の観測信号から目的音を推定するための時間周波数マスクを推定する（Ｓ１１）。ステップＳ１１を最初に実行する場合、時間周波数マスクＧのパラメータは、例えば、何らかの乱数で初期化しておく。なお、ＳＴＦＴ処理については、時間周波数マスク推定器１１が実行してもよいし、図示しないＳＴＦＴ処理部が実行することとしてもよい。 As shown in FIG. 2, the time-frequency mask estimator 11 randomly selects a target sound from the target sound DB 91 and noise from the noise DB 92, superimposes the noise, simulates an observation signal, and uses the observation signal as an STFT. On the other hand, CNN processing, BN processing, and GLU processing are executed to estimate a time-frequency mask for estimating a target sound from an arbitrary observation signal (S11). When the step S11 is executed for the first time, the parameter of the time frequency mask G is initialized with, for example, some random number. The TFT processing may be executed by the time-frequency mask estimator 11 or may be executed by the TFT processing unit (not shown).

学習部１２は、既知の目的音と既知の雑音とを重畳してなる既知の観測信号と時間周波数マスクを乗算した値（式(5)）と、既知の目的音との間の任意のコスト関数が最小化するように時間周波数マスク推定器１１のパラメータを学習する（Ｓ１２）。学習法には、確率的最急降下法などを利用すればよく、その学習率は10^-5程度に設定すればよい。 The learning unit 12 has an arbitrary cost between a known observation signal obtained by superimposing a known target sound and a known noise multiplied by a time frequency mask (Equation (5)) and a known target sound. The parameters of the time-frequency mask estimator 11 are learned so that the function is minimized (S12). As the learning method, a stochastic steepest descent method may be used, and the learning rate may be set to about ^10-5 .

時間周波数マスク推定器学習装置１は、収束判定をし、収束していなければステップＳ１１へ戻る。収束判定ルールは、例えばＳ１２を一定回数（例えば10万回）繰り返したか否か、などとすればよい。 The time-frequency mask estimator learning device 1 makes a convergence test, and if it does not converge, the process returns to step S11. The convergence test rule may be, for example, whether or not S12 is repeated a certain number of times (for example, 100,000 times).

ステップＳ１１の詳細を図３に示す。時間周波数マスク推定器１１は、観測信号のＳＴＦＴに対して、所定の順序で、ＣＮＮ処理（Ｓ１１１）、ＢＮ処理（Ｓ１１２）、ＧＬＵ処理（Ｓ１１３）、その他の処理（Ｓ１１４）を繰り返し実行する。以下、ステップＳ１１１～Ｓ１１４の詳細を説明する。 The details of step S11 are shown in FIG. The time-frequency mask estimator 11 repeatedly executes CNN processing (S111), BN processing (S112), GLU processing (S113), and other processing (S114) with respect to the FTFT of the observation signal in a predetermined order. Hereinafter, the details of steps S111 to S114 will be described.

ＣＮＮ処理部１１１は、観測信号のＳＴＦＴスペクトルの実部と、対応する虚部を実数と見做した値に対して、式(8)または(9)に基づいて、時間方向のストライドを１に制約して、畳み込みニューラルネットワーク処理を実行する（Ｓ１１１）。ＢＮ処理部１１２は、観測信号のＳＴＦＴスペクトルの実部と、対応する虚部を実数と見做した値に対するノルム操作に共通のパラメータ（１／√(σ² _BRe+σ² _BIm+ε)）を利用する式(11)に基づいて、バッチ正規化処理を実行する（Ｓ１１２）。ＧＬＵ処理部１１３は、式(15)に基づいて、観測信号のＳＴＦＴスペクトルの実部z_Reと、対応する虚部を実数と見做した値z_Imを結合した値concat(z_Re,z_Im)に対して、ゲート線形ユニット処理を実行する（Ｓ１１３）。補助演算部１１４は、上記以外のその他の処理（例えばconcatやsigmoid）を実行する（Ｓ１１４）。 The CNN processing unit 111 sets the stride in the time direction to 1 based on the equation (8) or (9) with respect to the value in which the real part of the FTFT spectrum of the observation signal and the corresponding imaginary part are regarded as real numbers. Convolutional neural network processing is executed with constraints (S111). The BN processing unit 112 is a parameter (1 / √ (σ ² _BRe + σ ² _BIm + ε)) common to the norm operation for the real part of the FTFT spectrum of the observation signal and the value in which the corresponding imaginary part is regarded as a real number. The batch normalization process is executed based on the equation (11) using the above (S112). Based on Eq. (15), the GLU processing unit 113 combines the real part z _Re of the SFT spectrum of the observed signal and the value z _Im which regards the corresponding imaginary part as a real number, and concat (z _Re , z _Im ). ), The gate linear unit processing is executed (S113). The auxiliary calculation unit 114 executes other processing (for example, concat or sigmoid) other than the above (S114).

図４は、三つのモジュールを組み合わせた時間周波数マスク推定器（ニューラルネットワーク）の構成例を示す図である。“Conv”，“BN”，“GLU”，“Concat”，“sigmoid”はそれぞれ、制約付き複素CNN、簡略化複素BN、複素GLU、入力のチャネル方向への結合、シグモイド関数の演算を表す。c,k,s,pはそれぞれ、チャネル数、カーネルサイズ、ストライドサイズ、パディングサイズである。 FIG. 4 is a diagram showing a configuration example of a time-frequency mask estimator (neural network) in which three modules are combined. “Conv”, “BN”, “GLU”, “Concat”, and “sigmoid” represent the constrained complex CNN, the simplified complex BN, the complex GLU, the coupling in the channel direction of the input, and the operation of the sigmoid function, respectively. c, k, s, and p are the number of channels, kernel size, stride size, and padding size, respectively.

＜効果＞
実施例１の時間周波数マスク推定器学習装置１によれば、複素スペクトルをそのまま入力するための、DNNのモジュールとして、(1)制約付き複素畳み込みニューラルネットワーク（CNN: convolutionalneural network）、(2)簡略化複素バッチ正規化（BN: batch normalization）、(3)複素ゲート線形ユニット（GLU: fated linear unit）の三つを導入したため、FFTスペクトルを複素数のまま、かつ信号処理理論に則った処理をDNNで実現できる。 <Effect>
According to the time-frequency mask estimator learning device 1 of the first embodiment, as a DNN module for inputting a complex spectrum as it is, (1) a constrained convolutional neural network (CNN), and (2) a simplification. Since we introduced three types of complex normalization (BN: batch normalization) and (3) complex gate linear unit (GLU), the FFT spectrum remains complex and the processing according to the signal processing theory is performed by DNN. Can be realized with.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ－ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Communication unit, CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these. , CPU, RAM, ROM, has a bus connecting so that data can be exchanged between external storage devices. Further, if necessary, a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity. As a physical entity equipped with such hardware resources, there is a general-purpose computer or the like.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program required to realize the above-mentioned functions and data required for processing of this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in the external storage device (or ROM, etc.) and the data required for processing of each program are read into the memory as needed, and are appropriately interpreted and executed and processed by the CPU. .. As a result, the CPU realizes a predetermined function (each configuration requirement represented by the above, ... Department, ... means, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by the computer, the processing content of the function that the hardware entity should have is described by the program. Then, by executing this program on the computer, the processing function in the above hardware entity is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ－ＲＡＭ（Random Access Memory）、ＣＤ－ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ－Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ－ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing content can be recorded on a computer-readable recording medium. The recording medium that can be read by a computer may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape or the like as a magnetic recording device, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, the distribution of this program is performed, for example, by selling, transferring, renting, or the like a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first temporarily stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this form, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.

Claims

A real part of the FTFT spectrum of the observed signal, a CNN processing part that performs convolutional neural network processing on the value that regards the corresponding imaginary part as a real number, and
A BN processing unit that executes batch normalization processing that uses parameters common to norm operations for values that regard the real part of the FTFT spectrum of the observation signal as a real number and the corresponding imaginary part.
Includes a GLU processing unit that performs gate linear unit processing on a value that combines the real part of the FTFT spectrum of the observed signal and the value that considers the corresponding imaginary part as a real number.
A time-frequency mask estimator that estimates the time-frequency mask for estimating the target sound from an arbitrary observation signal,
The time-frequency mask estimation so that the cost function between the value obtained by multiplying the known observation signal obtained by superimposing the known target sound and the known noise and the time-frequency mask and the known target sound is minimized. A time-frequency mask estimator learning device that includes a learning unit that learns instrument parameters.

The time-frequency mask estimator learning device according to claim 1.
The CNN processing unit is a time-frequency mask estimator learning device that executes the convolutional neural network processing by limiting the stride in the time direction to 1.

A time-frequency mask estimator that estimates a time-frequency mask for estimating a target sound from an arbitrary observation signal,
A step of executing a convolutional neural network process on a value in which the real part of the FTFT spectrum of the observed signal and the corresponding imaginary part are regarded as real numbers,
A step to execute a batch normalization process that uses parameters common to norm operations for values that regard the real part of the FTFT spectrum of the observed signal and the corresponding imaginary part as real numbers.
Perform the step of performing gate linear unit processing on the value obtained by combining the real part of the FTFT spectrum of the observation signal and the value in which the corresponding imaginary part is regarded as a real number.
The time-frequency mask estimator learning device
The time-frequency mask estimation so that the cost function between the value obtained by multiplying the known observation signal obtained by superimposing the known target sound and the known noise and the time-frequency mask and the known target sound is minimized. A time-frequency mask estimator learning method that performs the steps of learning instrument parameters.

The time-frequency mask estimator learning method according to claim 3.
A time-frequency mask estimator learning method that executes the convolutional neural network process by limiting the stride in the time direction to 1.

A program that causes a computer to function as the time-frequency mask estimator learning device according to claim 1.