JP6567478B2

JP6567478B2 - Sound source enhancement learning device, sound source enhancement device, sound source enhancement learning method, program, signal processing learning device

Info

Publication number: JP6567478B2
Application number: JP2016164726A
Authority: JP
Inventors: 健太丹羽; 小林　和則; 和則小林; 悠馬小泉; 智子川瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-08-25
Filing date: 2016-08-25
Publication date: 2019-08-28
Anticipated expiration: 2036-08-25
Also published as: JP2018031910A

Description

本発明は、様々な音源の音源信号が混合する音源信号から特定の音源の音源信号を強調する音源強調技術に関するものであり、特に深層学習を用いる音源強調技術に関する。 The present invention relates to a sound source enhancement technique that emphasizes a sound source signal of a specific sound source from a sound source signal in which sound source signals of various sound sources are mixed, and particularly to a sound source enhancement technique using deep learning.

これまで、深層学習を構成するネットワークパラメータ（例えば、重み行列、バイアスベクトル）については、初期値をランダムに設定し、誤差逆伝搬法に則って最適化することで、音源強調のための潜在変数群(例えば、目的音／雑音のPSD(Power Spectral Density)、ウィーナーフィルタ、事前SNR(Signal-Noise Ratio))を出力する音源強調用ニューラルネットワークを構築してきた。 Up to now, the network parameters (for example, weight matrix, bias vector) that make up deep learning are set to random initial values and optimized according to the error back-propagation method. We have built a neural network for sound source enhancement that outputs a group (for example, target sound / noise PSD (Power Spectral Density), Wiener filter, prior SNR (Signal-Noise Ratio)).

従来の音源分離・強調の方式として、非特許文献１がある。強調したい音源信号（以下、目的音という）をs(t)、目的音以外の雑音をn(t)と表す（ここで、tは時間のインデックスである）。s(t)、n(t)は時間領域信号である。 There is Non-Patent Document 1 as a conventional sound source separation / emphasis method. A sound source signal to be emphasized (hereinafter referred to as a target sound) is represented as s (t), and noise other than the target sound is represented as n (t) (where t is a time index). s (t) and n (t) are time domain signals.

s(t)、n(t)を周波数領域に展開した信号である周波数領域目的音、周波数領域雑音をS(τ, ω)、N(τ, ω)と表す（ここで、τ、ωは、それぞれ時間フレームのインデックス、周波数のインデックスである）。このとき、周波数領域混合音X(τ, ω)は、以下のように表現される。また、X(τ, ω)の時間領域表現をx(t)とする。

s (t), n (t) is a frequency domain target sound that is developed in the frequency domain, and frequency domain noise is represented as S (τ, ω), N (τ, ω) (where τ, ω are , Time frame index and frequency index respectively). At this time, the frequency domain mixed sound X (τ, ω) is expressed as follows. Also, let x (t) be the time domain representation of X (τ, ω).

雑音は特性に応じて複数に分けてもよい。ここで、雑音の特性とは、例えば、音楽のように非定常性が強いという性質、空調雑音のように定常性の強いという性質や低域成分が強いという性質、2ch以上のマイクを用いて観測する場合には到来方向が異なるといった性質等のことである。雑音の特性が異なる場合には、以下のように雑音をモデル化することができる。

ここで、Kは特性の異なる雑音の数、N_i(τ, ω)はi番目の雑音特性に属する雑音源の周波数領域雑音である。 Noise may be divided into a plurality according to characteristics. Here, the noise characteristics are, for example, the property of strong non-stationarity like music, the property of strong stationaryity like air-conditioning noise and the property of strong low-frequency components, using a microphone with 2ch or more In the case of observation, it is a property that the direction of arrival is different. If the noise characteristics are different, the noise can be modeled as follows.

Here, K is the number of noises having different characteristics, and N _i (τ, ω) is the frequency domain noise of the noise source belonging to the i-th noise characteristic.

以下では、まず、図１２〜図１３を参照して従来技術の音源強調学習装置８００を説明する。図１２は、音源強調学習装置８００の構成を示すブロック図である。図１３は、音源強調学習装置８００の動作を示すフローチャートである。図１２に示すように音源強調学習装置８００は、特徴量・ラベル生成部８１０と、事前学習部８２０を含む。 In the following, first, a conventional sound source enhancement learning apparatus 800 will be described with reference to FIGS. FIG. 12 is a block diagram illustrating a configuration of the sound source enhancement learning apparatus 800. FIG. 13 is a flowchart showing the operation of the sound source enhancement learning apparatus 800. As illustrated in FIG. 12, the sound source enhancement learning device 800 includes a feature quantity / label generation unit 810 and a pre-learning unit 820.

音源強調学習装置８００は、学習データ記録部８９０に接続している。学習データ記録部８９０には、事前学習に用いるモノラル混合音X(τ, ω)とそれを構成する目的音S(τ, ω)、雑音N(τ, ω)が学習データとして記録されている。以下、X(τ, ω)、S(τ, ω)、N(τ, ω)をそれぞれ周波数領域混合音学習データ、周波数領域目的音学習データ、周波数領域雑音学習データという。 The sound source enhancement learning device 800 is connected to the learning data recording unit 890. In the learning data recording unit 890, the monaural mixed sound X (τ, ω) used for pre-learning, the target sound S (τ, ω), and the noise N (τ, ω) constituting the same are recorded as learning data. . Hereinafter, X (τ, ω), S (τ, ω), and N (τ, ω) are referred to as frequency domain mixed sound learning data, frequency domain target sound learning data, and frequency domain noise learning data, respectively.

特徴量・ラベル生成部８１０は、周波数領域混合音学習データX(τ, ω)、周波数領域目的音学習データS(τ, ω)、周波数領域雑音学習データN(τ, ω)から、特徴量とラベルを生成する（Ｓ８１０）。特徴量の設計方法は様々あるが、最も単純な例として、混合音のスペクトル|X(τ, ω)|²やスペクトル|X(τ, ω)|²を平滑化した値を利用することができる。また、ラベルの設計方法も様々あるが、最も単純な例として、バイナリマスクI(τ, ω)を用いることができる。以下では、特徴量として混合音のスペクトル|X(τ, ω)|²、ラベルとしてバイナリマスクI(τ, ω)を用いて説明する。|X(τ, ω)|²、I(τ, ω)は、フレーム時間ごと、周波数ごとに用意する。 The feature quantity / label generation unit 810 generates a feature quantity from the frequency domain mixed sound learning data X (τ, ω), the frequency domain target sound learning data S (τ, ω), and the frequency domain noise learning data N (τ, ω). And a label are generated (S810). There are various methods for designing feature quantities, but the simplest example is to use a smoothed value of the mixed sound spectrum | X (τ, ω) | ² or spectrum | X (τ, ω) | ² it can. Although there are various label design methods, as the simplest example, a binary mask I (τ, ω) can be used. In the following description, the mixed sound spectrum | X (τ, ω) | ² is used as the feature quantity and the binary mask I (τ, ω) is used as the label. | X (τ, ω) | ² and I (τ, ω) are prepared for each frame time and for each frequency.

バイナリマスクI(τ, ω)は、次式で計算される。

ここで、NL(τ, ω)は観測時点の雑音混在レベルであり、θはバイナリマスクの値（0または1）を決定する閾値である。 The binary mask I (τ, ω) is calculated by the following equation.

Here, NL (τ, ω) is a noise mixing level at the time of observation, and θ is a threshold value for determining a binary mask value (0 or 1).

したがって、ラベル（0または1）は周波数ごと、または周波数帯域ごとに付与されることになる。 Therefore, the label (0 or 1) is given for each frequency or frequency band.

閾値θは、θ=0dB程度とすることが多い。これは、該当する周波数−時間フレームで、目的とする音源が最も主要な音源か否かを判断することに対応する。また、フロア値（つまり、NL(τ, ω)がθより小さいときのI(τ, ω)の値）は、式(3)では0としたが、実際には、0<α<1を満たす値αを用いることが多い。例えば、αを0.05〜0.3程度の値とする。 The threshold value θ is often set to about θ = 0 dB. This corresponds to determining whether or not the target sound source is the most important sound source in the corresponding frequency-time frame. In addition, the floor value (that is, the value of I (τ, ω) when NL (τ, ω) is smaller than θ) is 0 in Equation (3), but in reality, 0 <α <1 In many cases, the value α to be satisfied is used. For example, α is set to a value of about 0.05 to 0.3.

NL(τ, ω)は、次式で計算される。

NL (τ, ω) is calculated by the following equation.

事前学習部８２０は、特徴量とラベルの組からネットワークパラメータpを学習する（Ｓ８２０）。ネットワークパラメータpは、以下で説明する音源強調装置９００を構成するバイナリマスク推定部９２０で使用するパラメータである。学習の枠組みには、例えば、DNN(Deep Neural Network)、CNN(Convolutional Neural Network)、RNN(Recurrent Neural Network)といったいずれの教師あり学習方法を用いてもよい。つまり、ネットワークパラメータpは、事前の教師あり学習により、音源ごとに最適化されたパラメータの集合である。DNNを用いる場合、ネットワークパラメータpは、重み行列とバイアスベクトルとなる。 The prior learning unit 820 learns the network parameter p from the combination of the feature quantity and the label (S820). The network parameter p is a parameter used in the binary mask estimation unit 920 constituting the sound source enhancement apparatus 900 described below. For the learning framework, any supervised learning method such as DNN (Deep Neural Network), CNN (Convolutional Neural Network), or RNN (Recurrent Neural Network) may be used. That is, the network parameter p is a set of parameters optimized for each sound source by prior supervised learning. When DNN is used, the network parameter p is a weight matrix and a bias vector.

次に、図１４〜図１５を参照して従来技術の音源強調装置９００を説明する。図１４は、音源強調装置９００の構成を示すブロック図である。図１５は、音源強調装置９００の動作を示すフローチャートである。図１４に示すように音源強調装置９００は、周波数領域変換部９１０と、バイナリマスク推定部９２０と、信号強調部９３０と、時間領域変換部９４０を含む。 Next, a conventional sound source enhancement apparatus 900 will be described with reference to FIGS. FIG. 14 is a block diagram illustrating a configuration of the sound source enhancement device 900. FIG. 15 is a flowchart showing the operation of the sound source enhancement apparatus 900. As illustrated in FIG. 14, the sound source enhancement device 900 includes a frequency domain conversion unit 910, a binary mask estimation unit 920, a signal enhancement unit 930, and a time domain conversion unit 940.

音源強調装置９００は、学習結果記録部９９０に接続している。学習結果記録部９９０には、音源強調学習装置８００により事前学習したネットワークパラメータpが記録されている。 The sound source emphasizing apparatus 900 is connected to the learning result recording unit 990. In the learning result recording unit 990, the network parameter p learned in advance by the sound source enhancement learning apparatus 800 is recorded.

周波数領域変換部９１０は、時間領域の観測信号x(t)を周波数領域変換し、周波数領域観測信号X(τ, ω)を生成する（Ｓ９１０）。観測信号x(t)は、1chの信号であってもいいし、2ch以上の信号であってもよい。 The frequency domain transform unit 910 performs frequency domain transform on the time domain observation signal x (t) to generate the frequency domain observation signal X (τ, ω) (S910). The observation signal x (t) may be a 1ch signal or a 2ch or more signal.

バイナリマスク推定部９２０は、学習結果記録部９９０から読み出したネットワークパラメータpを用いて、周波数領域観測信号X(τ, ω)からバイナリマスクI(τ, ω)を生成する（Ｓ９２０）。バイナリマスクの推定方法については、音源強調学習装置８００で用いた学習方法に対応するものを利用することを前提としている。 The binary mask estimation unit 920 generates a binary mask I (τ, ω) from the frequency domain observation signal X (τ, ω) using the network parameter p read from the learning result recording unit 990 (S920). The binary mask estimation method is based on the premise that a method corresponding to the learning method used in the sound source enhancement learning apparatus 800 is used.

また、2ch以上の信号を入力とする場合は、ビームフォーミング等で目的音と雑音をそれぞれ強調した信号群(つまり、X(τ, ω)の線形変換信号群)を入力とし、バイナリマスクI(τ, ω)を出力してもよい。この場合、特徴量・ラベル生成部８１０における特徴量の生成や事前学習部８２０におけるネットワークパラメータの事前学習についても同様の処理を実行する必要がある。 In addition, when inputting signals of 2ch or more, a signal group (that is, a linearly converted signal group of X (τ, ω)) each emphasizing the target sound and noise by beam forming or the like is input, and a binary mask I ( τ, ω) may be output. In this case, it is necessary to execute the same processing for the generation of the feature amount in the feature amount / label generation unit 810 and the pre-learning of the network parameter in the pre-learning unit 820.

信号強調部９３０は、周波数領域観測信号X(τ, ω)とバイナリマスクI(τ, ω)から式(5)により信号強調（音源強調）を実行し、周波数領域強調音^S(τ, ω)を生成する（Ｓ９３０）。

The signal enhancement unit 930 performs signal enhancement (sound source enhancement) from the frequency domain observation signal X (τ, ω) and the binary mask I (τ, ω) according to the equation (5), and the frequency domain enhanced sound ^ S (τ, ω) is generated (S930).

時間領域変換部９４０は、周波数領域強調音^S(τ, ω)から時間領域での推定音源信号である強調音^s(t)を生成する（Ｓ９４０）。 The time domain transform unit 940 generates an enhanced sound ^ s (t) that is an estimated sound source signal in the time domain from the frequency domain enhanced sound ^ S (τ, ω) (S940).

Y. Wang, A. Narayanan and D.L. Wang, “On training targets for supervised speech separation”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, No.12, pp.1849-1858, 2014.Y. Wang, A. Narayanan and D.L.Wang, “On training targets for supervised speech separation”, IEEE / ACM Transactions on Audio, Speech, and Language Processing, vol.22, No.12, pp.1849-1858, 2014.

従来技術の方法では、音源強調に用いるバイナリマスクの推定において、DNN、CNN、RNN等を用いて（周波数領域）混合音を入力としてバイナリマスクを出力する関数（以下、バイナリマスク関数という）を設計している。バイナリマスク関数はニューラルネットワークを用いて表現されるが、このニューラルネットワークの構造には物理的な制約を一切加えることなく、ネットワークパラメータを学習していた。例えば、ネットワークの層の数、ノードの数はヒューリスティックに決定したり、(重み行列やバイアスベクトルなどの)ネットワークパラメータの初期値もランダムに決定していた。つまり、混合音からバイナリマスクを推定するバイナリマスク関数をブラックボックス化していることに等しいといえる。そのため、高精度かつ安定的に音源強調するための潜在変数群（例えば、目的音／雑音のPSD）を推定するのに適したニューラルネットワークを構築できているのか否かについて判断することが困難であった。実際、ニューラルネットワークにおける途中の計算段階で目的音／雑音のPSDを高精度に推定することができないこともあった。 In the method of the prior art, in the estimation of the binary mask used for sound source enhancement, a function (hereinafter referred to as a binary mask function) that outputs a binary mask using DNN, CNN, RNN, etc. (frequency domain) as a mixed sound input is designed. doing. The binary mask function is expressed by using a neural network, but the network parameters are learned without any physical restrictions on the structure of the neural network. For example, the number of network layers and the number of nodes are determined heuristically, and initial values of network parameters (such as a weight matrix and a bias vector) are also randomly determined. In other words, it can be said that the binary mask function for estimating the binary mask from the mixed sound is black boxed. Therefore, it is difficult to determine whether or not a neural network suitable for estimating a latent variable group (for example, target sound / noise PSD) for highly accurate and stable sound source enhancement can be constructed. there were. Actually, the PSD of the target sound / noise could not be estimated with high accuracy at the calculation stage in the middle of the neural network.

そこで本発明では、音源強調のためのモデルをニューラルネットワークの構造に取り入れることにより、高精度かつ安定的に音源強調をするためのネットワークパラメータを学習する音源強調学習技術を提供することを目的とする。また、信号処理アルゴリズムの構造をニューラルネットワークの構造に取り入れることにより、高精度かつ安定的に信号処理をするためのネットワークパラメータを学習する信号処理学習技術を提供することを目的とする。 In view of the above, an object of the present invention is to provide a sound source enhancement learning technique for learning network parameters for highly accurate and stable sound source enhancement by incorporating a model for sound source enhancement into the structure of a neural network. . It is another object of the present invention to provide a signal processing learning technique for learning network parameters for signal processing with high accuracy and stability by incorporating the structure of a signal processing algorithm into the structure of a neural network.

本発明の一態様は、周波数領域混合音学習データと周波数領域目的音学習データと周波数領域雑音学習データの組から、周波数領域混合音学習データを特徴量、目的音PSDと雑音PSDの組をラベルとして生成する特徴量・ラベル生成部と、前記特徴量と前記ラベルを用いて、観測信号を入力とし推定目的音PSDと推定雑音PSDを出力とする音源強調用ニューラルネットワークを特徴付けるネットワークパラメータを学習する事前学習部とを有する音源強調学習装置であって、前記事前学習部は、前記周波数領域混合音学習データから生成される目的音を強調したビームフォーミング出力信号のパワーである目的音BF出力パワーから、目的音のスペクトル特性をモデル化した目的音スペクトル特性ベクトルを計算する目的音非負オートエンコーダ計算部と、前記周波数領域混合音学習データから生成される雑音を強調したビームフォーミング出力信号のパワーである雑音BF出力パワーから、雑音のスペクトル特性をモデル化した雑音スペクトル特性ベクトルを計算する雑音非負オートエンコーダ計算部と、前記目的音スペクトル特性ベクトルと前記雑音スペクトル特性ベクトルから、前記推定目的音PSDと前記推定雑音PSDを計算する相補減算ニューラルネットワーク計算部と、前記推定目的音PSDと前記推定雑音PSDと、前記目的音PSDと前記雑音PSDを用いて前記ネットワークパラメータを最適化するネットワークパラメータ最適化部とを含む。 According to one aspect of the present invention, a frequency domain mixed sound learning data is labeled from a set of frequency domain mixed sound learning data, frequency domain target sound learning data, and frequency domain noise learning data, and a set of target sound PSD and noise PSD is labeled. Using the feature quantity and the label, and learning the network parameters that characterize the neural network for sound enhancement using the observation signal as input and the estimated target sound PSD and the estimated noise PSD as outputs A sound source enhancement learning device having a pre-learning unit, wherein the pre-learning unit is a target sound BF output power that is a power of a beamforming output signal that emphasizes a target sound generated from the frequency domain mixed sound learning data Target sound non-negative auto encoder calculation unit that calculates target sound spectrum characteristic vector modeling target sound spectrum characteristics Noise non-negative auto encoder calculation that calculates noise spectrum characteristic vector modeling noise spectrum characteristic from noise BF output power which is power of beamforming output signal emphasizing noise generated from frequency domain mixed sound learning data A complementary subtraction neural network calculation unit for calculating the estimated target sound PSD and the estimated noise PSD from the target sound spectrum characteristic vector and the noise spectrum characteristic vector, the estimated target sound PSD and the estimated noise PSD, A network parameter optimizing unit that optimizes the network parameter using the target sound PSD and the noise PSD;

本発明の一態様は、信号に関する特徴量と前記信号の処理結果であるラベルの組からなる学習データを用いて、信号を入力とし前記信号の推定処理結果を出力する信号処理用ニューラルネットワークを特徴付けるネットワークパラメータを学習する信号処理学習装置であって、前記信号から計算される特徴量から前記推定処理結果が出力されるまでの途中段階で計算される推定処理結果を途中段階推定処理結果i（1≦i≦n、nは１以上の整数）とし、前記特徴量を用いて、前記途中段階推定処理結果1を計算するニューラルネットワーク1計算部と、前記特徴量または前記途中段階推定処理結果1ないし前記途中段階推定処理結果j-1のいずれかを用いて、前記途中段階推定処理結果jを計算するニューラルネットワークj計算部と（2≦j≦n）、前記特徴量または前記途中段階推定処理結果1ないし前記途中段階推定処理結果nのいずれかを用いて、前記推定処理結果を計算するニューラルネットワークn+1計算部と、前記推定処理結果と前記ラベルを用いて、前記ネットワークパラメータを最適化するネットワークパラメータ最適化部とを含む。 One aspect of the present invention is characterized by a signal processing neural network that receives a signal as an input and outputs an estimation processing result of the signal using learning data including a set of a feature amount related to the signal and a label that is a processing result of the signal. A signal processing learning device for learning a network parameter, wherein an estimation processing result calculated in an intermediate stage from a feature amount calculated from the signal to an output of the estimation processing result is an intermediate stage estimation processing result i (1 ≦ i ≦ n, where n is an integer equal to or greater than 1, and using the feature amount, the neural network 1 calculation unit that calculates the intermediate stage estimation process result 1 and the feature quantity or the intermediate stage estimation process result 1 to 1 A neural network j calculation unit for calculating the intermediate stage estimation processing result j using any of the intermediate stage estimation processing result j-1 (2 ≦ j ≦ n), the feature Using the amount or the intermediate stage estimation process result 1 to the intermediate stage estimation process result n, the neural network n + 1 calculation unit for calculating the estimation process result, the estimation process result and the label A network parameter optimization unit for optimizing the network parameter.

本発明によれば、音源強調のためのモデルをニューラルネットワークの構造として取り入れ、目的音／雑音のPSDなどの潜在変数群を推定することにより、高精度かつ安定的に音源強調をするためのネットワークパラメータを学習することが可能となる。また、信号処理アルゴリズムの構造をニューラルネットワークの構造として取り入れ、潜在変数群に相当する途中段階推定処理結果を推定することにより、高精度かつ安定的に信号処理をするためのネットワークパラメータを学習することが可能となる。 According to the present invention, a network for enhancing sound sources with high accuracy and stability by incorporating a model for sound source enhancement as a neural network structure and estimating a latent variable group such as PSD of target sound / noise. It becomes possible to learn parameters. In addition, learning the network parameters for highly accurate and stable signal processing by incorporating the structure of the signal processing algorithm as the structure of the neural network and estimating the intermediate stage estimation processing results corresponding to the latent variables. Is possible.

音源強調用ニューラルネットワーク５００の構造を示す図。The figure which shows the structure of the neural network 500 for sound source emphasis. 音源強調学習装置１００の構成を示すブロック図。FIG. 2 is a block diagram showing a configuration of a sound source enhancement learning device 100. 音源強調学習装置１００の動作を示すフローチャート。5 is a flowchart showing the operation of the sound source enhancement learning apparatus 100. 事前学習部１２０の構成を示すブロック図。The block diagram which shows the structure of the prior learning part 120. FIG. 事前学習部１２０の動作を示すフローチャート。The flowchart which shows operation | movement of the prior learning part 120. FIG. ネットワークパラメータ計算部１２２の構成を示すブロック図。FIG. 3 is a block diagram showing a configuration of a network parameter calculation unit 122. ネットワークパラメータ計算１２２の動作を示すフローチャート。6 is a flowchart showing the operation of network parameter calculation 122; 音源強調装置２００の構成を示すブロック図。FIG. 2 is a block diagram showing a configuration of a sound source enhancement apparatus 200. 音源強調装置２００の動作を示すフローチャート。5 is a flowchart showing the operation of the sound source emphasizing apparatus 200. ウィーナーフィルタ推定部２２０の構成を示すブロック図。The block diagram which shows the structure of the Wiener filter estimation part 220. FIG. ウィーナーフィルタ推定部２２０の動作を示すフローチャート。The flowchart which shows operation | movement of the Wiener filter estimation part 220. FIG. 音源強調学習装置８００の構成を示すブロック図。The block diagram which shows the structure of the sound source emphasis learning apparatus. 音源強調学習装置８００の動作を示すフローチャート。5 is a flowchart showing the operation of the sound source enhancement learning apparatus 800. 音源強調装置９００の構成を示すブロック図。FIG. 3 is a block diagram showing a configuration of a sound source enhancement apparatus 900. 音源強調装置９００の動作を示すフローチャート。5 is a flowchart showing the operation of the sound source enhancement apparatus 900.

以下、本発明の実施の形態について、詳細に説明する。なお、音源強調学習装置８００、音源強調装置９００も含め、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function including the sound source emphasis learning apparatus 800 and the sound source emphasis apparatus 900, and duplication description is abbreviate | omitted.

また、_（アンダースコア）は下付き添字を表す。例えば、x^y_zはy_zがxに対する上付き添字であり、x_{y_z}はy_zがxに対する下付き添字であることを表す。 _ (Underscore) represents a subscript. For example, ^xy_z represents that _yz is a superscript to x, and _{xy_z} represents that _yz is a subscript to x.

まず、発明の原理について説明する。
＜発明の原理＞
従来の方式では、ネットワークの層の数、ノード数、ノード間の連結をヒューリスティックに決定していた。つまり、ニューラルネットワークの物理的な構造について特段の制約を設けることなく、特徴量とラベルの組を入力としてニューラルネットワークを設計（学習）していた。 First, the principle of the invention will be described.
<Principle of the invention>
In the conventional method, the number of network layers, the number of nodes, and the connection between nodes are determined heuristically. In other words, the neural network is designed (learned) with the combination of the feature value and the label as input without providing any particular restrictions on the physical structure of the neural network.

本願では、音源強調のための物理的なモデルを反映したニューラルネットワークの設計を行う。設計するニューラルネットワークの構造を図１に示す。図１に示すように音源強調用ニューラルネットワーク５００は、目的音非負オートエンコーダ５１０と、雑音非負オートエンコーダ５１５と、相補減算ニューラルネットワーク５２０を含む。目的音非負オートエンコーダ５１０、雑音非負オートエンコーダ５１５は、目的音のスペクトル特性、雑音のスペクトル特性をそれぞれ独立にモデル化するニューラルネットワークである。また、相補減算ニューラルネットワーク５２０は、目的音PSDと雑音PSDを推定するニューラルネットワークである。 In this application, a neural network that reflects a physical model for sound source enhancement is designed. The structure of the neural network to be designed is shown in FIG. As shown in FIG. 1, the sound source enhancement neural network 500 includes a target sound non-negative auto encoder 510, a noise non-negative auto encoder 515, and a complementary subtraction neural network 520. The target sound non-negative auto encoder 510 and the noise non-negative auto encoder 515 are neural networks that independently model the target sound spectral characteristics and noise spectral characteristics. The complementary subtraction neural network 520 is a neural network that estimates the target sound PSD and noise PSD.

なお、音源強調用ニューラルネットワーク５００は、特性の異なる目的音や特性の異なる雑音が複数ある場合、それぞれの目的音・雑音に対応して複数の目的音非負オートエンコーダ５１０や雑音非負オートエンコーダ５１５を含むものであってもよい。ただし、例えば、空調雑音のモデル、テレビの音を雑音とするモデルのように複数の雑音モデルを考える代わりに、これらの雑音の加算値を用いて、１つの雑音非負オートエンコーダ５１５を設計するのでもよい。以下では、簡単のため、目的音非負オートエンコーダ５１０と雑音非負オートエンコーダ５１５はそれぞれ１つであるとして説明する。 When there are a plurality of target sounds having different characteristics and noises having different characteristics, the neural network for sound source enhancement 500 includes a plurality of target sound non-negative auto encoders 510 and noise non-negative auto encoders 515 corresponding to the respective target sounds / noises. It may be included. However, for example, instead of considering a plurality of noise models such as a model of air conditioning noise and a model that uses TV sound as noise, one noise non-negative auto encoder 515 is designed using the added value of these noises. But you can. In the following, for the sake of simplicity, it is assumed that there is one target sound non-negative auto encoder 510 and one noise non-negative auto encoder 515.

音源強調学習用ニューラルネットワーク５００は、周波数領域混合音X(τ,ω)にビームフォーミングを適用することにより得られる目的音を強調したビームフォーミング出力信号Y_Sのパワーφ_{Y_S}(τ,ω)（以下、φ_{Y_S}(τ,ω)を目的音BF出力パワーという）と、周波数領域混合音X(τ,ω)にビームフォーミングを適用することにより得られる雑音を強調したビームフォーミング出力信号Y_Nのパワーφ_{Y_N}(τ,ω)（以下、φ_{Y_N}(τ,ω)を雑音BF出力パワーという）を入力とし、推定目的音PSD q_S ⁽⁴⁾、推定雑音PSD q_N ⁽⁴⁾を結合した推定PSD q⁽⁴⁾を出力するように構成する。ただし、Ωを周波数バンド数とし、φ_{Y_S}(τ,ω)、φ_{Y_N}(τ,ω)、q_S ⁽⁴⁾、q_N ⁽⁴⁾はΩ次元実数値ベクトルである。つまり、目的音BF出力パワーφ_{Y_S}(τ,ω)、雑音BF出力パワーφ_{Y_N}(τ,ω)は、フレームごと、周波数バンドごとに計算されるものである。 Sound enhancement training the neural network 500, the power phi _{Y_S} beamformed output signal Y _S emphasizing the target sound obtained by applying beamforming frequency domain mixed sound X (τ, ω) (τ , ω) ( Hereinafter, φ _{Y_S} (τ, ω) is referred to as target sound BF output power) and beam forming output signal Y _N emphasizing noise obtained by applying beam forming to frequency domain mixed sound X (τ, ω). Power φ _{Y_N} (τ, ω) (hereinafter referred to as φ _{Y_N} (τ, ω) is referred to as noise BF output power) and the estimated target sound PSD q _S ⁽⁴⁾ and estimated noise PSD q _N ⁽⁴⁾ are combined. Configure to output estimated PSD q ⁽⁴⁾ . Here, Ω is the number of frequency bands, and φ _{Y_S} (τ, ω), φ _{Y_N} (τ, ω), q _S ⁽⁴⁾ , and q _N ⁽⁴⁾ are Ω-dimensional real value vectors. That is, the target sound BF output power φ _{Y_S} (τ, ω) and the noise BF output power φ _{Y_N} (τ, ω) are calculated for each frame and for each frequency band.

音源強調用ニューラルネットワーク５００の学習に用いられる学習データとして、周波数領域混合音X(τ, ω)とそれを構成する周波数領域目的音S(τ, ω)、周波数領域雑音N(τ, ω)が与えられる。周波数領域混合音学習データX(τ, ω)、周波数領域目的音学習データS(τ, ω)、周波数領域雑音学習データN(τ, ω)から目的音BF出力パワーφ_{Y_S}(τ,ω)、雑音BF出力パワーφ_{Y_N}(τ,ω)を学習前に生成しておく。また、学習を制御する、つまりネットワークパラメータpを最適化するために用いる目的音PSDφ_S、雑音PSDφ_Nもそれぞれ周波数領域目的音学習データS(τ, ω)、周波数領域雑音学習データN(τ, ω)から生成しておく。 As learning data used for learning of the neural network 500 for sound source enhancement, the frequency domain mixed sound X (τ, ω), the frequency domain target sound S (τ, ω) constituting it, and the frequency domain noise N (τ, ω) Is given. _Target sound BF output power φ _{Y_S} (τ, ω) from frequency domain mixed sound learning data X (τ, ω), frequency domain target sound learning data S (τ, ω), frequency domain noise learning data N (τ, ω) The noise BF output power φ _{Y_N} (τ, ω) is generated before learning. Also, the target sound PSDφ _S and noise PSDφ _N used to control learning, that is, to optimize the network parameter p are also frequency domain target sound learning data S (τ, ω) and frequency domain noise learning data N (τ, generated from ω).

以下、目的音非負オートエンコーダ５１０と雑音非負オートエンコーダ５１５、相補減算ニューラルネットワーク５２０について説明する。 Hereinafter, the target sound non-negative auto encoder 510, the noise non-negative auto encoder 515, and the complementary subtraction neural network 520 will be described.

[非負オートエンコーダによる目的音スペクトル特性、雑音スペクトル特性のモデル化]
まず、目的音のスペクトル特性、雑音のスペクトル特性をモデル化するためのニューラルネットワークである目的音非負オートエンコーダ５１０と雑音非負オートエンコーダ５１５について説明する。 [Modeling of target sound spectrum characteristics and noise spectrum characteristics by non-negative auto encoder]
First, the target sound non-negative auto encoder 510 and the noise non-negative auto encoder 515, which are neural networks for modeling the target sound spectral characteristics and noise spectral characteristics, will be described.

音源のスペクトル特性を音源分離に利用する方法に、非負値行列分解(Non-Negative Matrix Factorization)がある。非負値行列分解では、次式のように、非負のスぺクトログラムSを非負のスペクトル基底行列Bと非負のアクティベーション行列Aの積として表現する。

ここで、Ωは周波数バンド数、Υは解析する時間フレーム数、βはスペクトル基底数を表す。なお、Rは実数の集合を表す。 There is a non-negative matrix factorization (Non-Negative Matrix Factorization) as a method of using the spectral characteristics of a sound source for sound source separation. In non-negative matrix decomposition, a non-negative spectrogram S is expressed as a product of a non-negative spectral basis matrix B and a non-negative activation matrix A as in the following equation.

Here, Ω represents the number of frequency bands, Υ represents the number of time frames to be analyzed, and β represents the number of spectrum bases. R represents a set of real numbers.

Sを非負値行列分解する方法として、例えば、最小二乗誤差を用いる方法があり、最小二乗誤差を用いると、アクティベーション行列A、スペクトル基底行列Bを最適化することができる（参考非特許文献１）。
〔参考非特許文献１〕D. D. Lee et al, “Algorithms for non-negative matrix factorization”, Proc. NIPS 2000, pp.556-562, 2000. As a method of non-negative matrix decomposition of S, for example, there is a method using a least square error, and when the least square error is used, the activation matrix A and the spectrum basis matrix B can be optimized (reference non-patent document 1). ).
[Reference Non-Patent Document 1] DD Lee et al, “Algorithms for non-negative matrix factorization”, Proc. NIPS 2000, pp.556-562, 2000.

非負値行列分解と同様の、非負のパワースペクトルをスペクトル基底群とアクティベーション群の積に分解する考え方は、ニューラルネットワークを用いて表現することができる。全結合型DNNを用いて、目的音のスペクトル特性をモデル化する方法、つまり目的音非負オートエンコーダ５１０の構造について説明する。 The idea of decomposing a non-negative power spectrum into a product of a spectrum base group and an activation group, similar to non-negative matrix decomposition, can be expressed using a neural network. A method for modeling the spectral characteristics of the target sound using the fully coupled DNN, that is, the structure of the target sound non-negative auto encoder 510 will be described.

まず、目的音非負オートエンコーダ５１０の入力層（第1層）に目的音BF出力パワーφ_{Y_S}(τ,ω)を入力する。つまり、第1層への入力をq_S ⁽¹⁾とすると、

First, the target sound BF output power φ _{Y_S} (τ, ω) is input to the input layer (first layer) of the target sound non-negative auto encoder 510. In other words, if the input to the first layer is q _S ⁽¹⁾ ,

次に、β×Ωの実数値行列である重み行列W_S ⁽²⁾をβ個のスペクトル基底の集合体とみなし、W_S ⁽²⁾に非負行列の制約を課す。つまり、

ここで、W_S ⁽²⁾ _i,jは、W_S ⁽²⁾の(i、j)番目の要素を表す。 Next, the weight matrix W _S ⁽²⁾ , which is a β × Ω real-valued matrix, is regarded as an aggregate of β spectral bases, and a non-negative matrix constraint is imposed on W _S ⁽²⁾ . That means

Here, W _S ⁽²⁾ _{i, j} represents the (i, j) -th element of W _S ⁽²⁾ .

以下の演算により、第1層の出力、つまり第2層の入力となるβ×1の実数値行列（β次元実数値ベクトル）q_S ⁽²⁾を得る。

ただし、b_S ⁽²⁾はβ次元実数値ベクトルで表されるバイアスベクトル、f⁽²⁾ (・)は次式で表されるランプ関数(ReLU)である。

ここでは、活性化関数（伝達関数）の例としてランプ関数を用いるとしたが、これに限られるものではない。従来から用いられているシグモイド関数など、様々な非線形関数を用いることができる。 A β × 1 real-valued matrix (β-dimensional real-valued vector) q _S ⁽²⁾ that is the output of the first layer, that is, the input of the second layer, is obtained by the following calculation.

However, b _S ⁽²⁾ is a bias vector represented by a β-dimensional real value vector, and f ⁽²⁾ (·) is a ramp function (ReLU) represented by the following equation.

Here, the ramp function is used as an example of the activation function (transfer function), but the present invention is not limited to this. Various nonlinear functions such as a sigmoid function used conventionally can be used.

このように計算すると、q_S ⁽²⁾は、1フレーム分のβ個のスペクトル基底それぞれに対するアクティベーションとみなすことができる。 When calculated in this way, q _S ⁽²⁾ can be regarded as an activation for each of β spectral bases for one frame.

次に、重み行列W_S ⁽³⁾をW_S ⁽²⁾の転置行列（式(12)）とし、式(13)〜(15)によりスペクトログラムを復元する演算を行う。

ここで、Tは転置を表す。

ただし、b_S ⁽³⁾はΩ次元実数値ベクトルで表されるバイアスベクトル、f⁽³⁾ (・)は式(15)で表されるランプ関数(ReLU)である。 Next, the weight matrix W _S ⁽³⁾ is used as a transposed matrix of W _S ⁽²⁾ (formula (12)), and the spectrogram is restored by formulas (13) to (15).

Here, T represents transposition.

However, b _S ⁽³⁾ is a bias vector represented by an Ω-dimensional real value vector, and f ⁽³⁾ (·) is a ramp function (ReLU) represented by Expression (15).

目的音非負オートエンコーダ５１０は、Ω×1の実数値行列（Ω次元実数値ベクトル）q_S ⁽³⁾（以下、q_S ⁽³⁾を目的音スペクトル特性ベクトルという）を出力する。非負の重み行列W_S ⁽²⁾、W_S ⁽³⁾を用いて目的音非負オートエンコーダ５１０を構成することは、目的音のスペクトル特性をモデル化することに対応する。 The target sound non-negative auto encoder 510 outputs an Ω × 1 real value matrix (Ω-dimensional real value vector) q _S ⁽³⁾ (hereinafter, q _S ^{(3) is referred} to as a target sound spectrum characteristic vector). Configuring the target sound non-negative auto encoder 510 using the non-negative weight matrices W _S ⁽²⁾ and W _S ⁽³⁾ corresponds to modeling the spectral characteristics of the target sound.

ちなみに、式(12)の構造を持つニューラルネットワークを一般にオートエンコーダという（参考非特許文献２）。
〔参考非特許文献２〕岡谷貴之, “深層学習(機械学習プロフェッショナルシリーズ)”, 講談社, 2015. Incidentally, a neural network having the structure of Expression (12) is generally called an auto encoder (Reference Non-Patent Document 2).
[Non-Patent Document 2] Takayuki Okaya, “Deep Learning (Machine Learning Professional Series)”, Kodansha, 2015.

同様に、雑音BF出力パワーφ_{Y_N}(τ,ω)についても、同様の演算を行うニューラルネットワークである非負オートエンコーダを構成することにより、雑音のスペクトル特性をモデル化することができる。雑音非負オートエンコーダ５１５の出力をΩ×1の実数値行列（Ω次元ベクトル）q_N ⁽³⁾（以下、q_N ⁽³⁾を雑音スペクトル特性ベクトルという）とする。 Similarly, with _respect to the noise BF output power φ _{Y_N} (τ, ω), it is possible to model the spectral characteristics of noise by configuring a non-negative auto encoder that is a neural network that performs the same calculation. The output of the noise non-negative auto encoder 515 is assumed to be an Ω × 1 real value matrix (Ω-dimensional vector) q _N ⁽³⁾ (hereinafter, q _N ^{(3) is referred} to as a noise spectrum characteristic vector).

[相補減算による目的音/雑音PSDの推定]
次に、目的音のスペクトル特性をモデル化した目的音非負オートエンコーダ５１０の出力q_S ⁽³⁾と雑音のスペクトル特性をモデル化した雑音非負オートエンコーダ５１５の出力q_N ⁽³⁾からウィーナーフィルタの生成に用いる目的音PSD、雑音PSDを推定するためのニューラルネットワークである相補減算ニューラルネットワーク５２０について説明する。 [Estimation of target sound / noise PSD by complementary subtraction]
Next, from the output q _S ^{(3) of} the target sound non-negative auto encoder 510 that models the spectral characteristic of the target sound and the output q _N ^{(3) of} the noise non-negative auto encoder 515 that models the noise spectral characteristic, the Wiener filter The complementary subtraction neural network 520, which is a neural network for estimating the target sound PSD and noise PSD used for generation, will be described.

第2層の出力、つまり第3層の入力である目的音スペクトル特性ベクトルq_S ⁽³⁾及び雑音スペクトル特性ベクトルq_N ⁽³⁾をq⁽³⁾=[q_S ^(3)T, q_N ^(3)T]^Tのように結合し、第3層の出力において、目的音スペクトル特性ベクトルq_S ⁽³⁾に残留する雑音成分を減算することで推定目的音PSD q_S ⁽⁴⁾を生成、雑音スペクトル特性ベクトルq_N ⁽³⁾に残留する目的音成分を減算することで推定雑音PSD q_N ⁽⁴⁾を生成する。つまり、2Ω×2Ωの実数値行列である重み行列W⁽⁴⁾を相補減算するように構成することにより、推定目的音PSD q_S ⁽⁴⁾及び推定雑音PSD q_N ⁽⁴⁾を高精度に推定することを考える。 The output of the second layer, that is, the target sound spectrum characteristic vector q _S ⁽³⁾ and the noise spectrum characteristic vector q _N ⁽³⁾ , which are the inputs of the third layer, are changed to q ⁽³⁾ = [q _S ^{(3) T} , q _N ^{(3) T} ] Combined as ^T , and at the output of the third layer, the estimated target sound PSD q _S ⁽⁴⁾ is generated by subtracting the noise component remaining in the target sound spectrum characteristic vector q _S ⁽³⁾ Then, the estimated noise PSD q _N ⁽⁴⁾ is generated by subtracting the target sound component remaining in the noise spectrum characteristic vector q _N ⁽³⁾ . In other words, by constructing the weight matrix W ^(4), which is a 2Ω × 2Ω real-valued matrix, to perform complementary subtraction, the estimated target sound PSD q _S ⁽⁴⁾ and the estimated noise PSD q _N ⁽⁴⁾ can be accurately obtained. Think about estimating.

対称性マイクロホンアレイや対称性ビームフォーミングを用いて受音すると、ビームフォーミングの平均的な感度D_ωは目的音の到来方向とは独立に決まる。したがって、W⁽⁴⁾の構造も目的音の到来方向に依存せず決まるため、最適化計算が収束しやすいと考えられる。 When receiving sound using a symmetric microphone array or symmetric beam forming, the average sensitivity D _ω of beam forming is determined independently of the direction of arrival of the target sound. Therefore, since the structure of W ⁽⁴⁾ is determined without depending on the direction of arrival of the target sound, it is considered that the optimization calculation is easy to converge.

一例として、W⁽⁴⁾の初期値W_init ⁽⁴⁾を以下の式で設計する。

ただし、λ_{S,ω_i}、λ_{N,ω_i}は正の値、γ_{S-N,ω_i}、γ_{N-S,ω_i}は負の値（1≦i≦Ω）とする。 As an example, to design W ⁽⁴⁾ Initial value W _init ⁽⁴⁾ of the following equation.

However, λ _{S, ω_i} , λ _{N, ω_i} are positive values, and γ _{SN, ω_i} , γ _{NS, ω_i} are negative values (1 ≦ i ≦ Ω).

なお、このとき、W⁽⁴⁾の各要素の値の範囲に関して制約を課さない。 At this time, no restriction is imposed on the range of values of each element of W ⁽⁴⁾ .

次式で第3層の出力q⁽⁴⁾を計算する。

The output q ⁽⁴⁾ of the third layer is calculated by the following equation.

第3層の出力q⁽⁴⁾は、q⁽⁴⁾=[q_S ^(4)T, q_N ^(4)T]^T（ただし、q_S ⁽⁴⁾,q_N ⁽⁴⁾はΩ次元実数値ベクトル）であり、q_S ⁽⁴⁾,q_N ⁽⁴⁾がそれぞれ推定目的音PSD、推定雑音PSDとなる。
[音源強調用ニューラルネットワークのネットワークパラメータの最適化]
最後に、目的音PSDφ_S、雑音PSDφ_Nを用いて誤差逆伝搬法により3層の全体を最適化し、ネットワークパラメータpを最適化する。最適化は学習データの数だけ実行される。 The output of layer 3 q ⁽⁴⁾ is q ⁽⁴⁾ = [q _S ^{(4) T} , q _N ^{(4) T} ] ^T (where q _S ⁽⁴⁾ and q _N ⁽⁴⁾ (Numerical vector), and q _S ⁽⁴⁾ and q _N ⁽⁴⁾ are the estimated target sound PSD and the estimated noise PSD, respectively.
[Optimization of network parameters of neural network for sound source enhancement]
Finally, the entire three layers are optimized by the error back propagation method using the target sound PSDφ _S and the noise PSDφ _N , and the network parameter p is optimized. Optimization is performed by the number of learning data.

ここで、ネットワークパラメータpは、3層分の重み行列W_S ⁽²⁾,W_N ⁽²⁾,W_S ⁽³⁾,W_N ⁽³⁾,W⁽⁴⁾と2層分のバイアスベクトルb_S ⁽²⁾,b_N ⁽²⁾,b_S ⁽³⁾,b_N ⁽³⁾で構成される。W_S ⁽²⁾,W_N ⁽²⁾はβ×Ωの実数値行列、W_S ⁽³⁾,W_N ⁽³⁾ はΩ×βの実数値行列、W⁽⁴⁾ は2Ω×2Ωの実数値行列、b_S ⁽²⁾,b_N ⁽²⁾ はβ次元実数値ベクトル、b_S ⁽³⁾,b_N ⁽³⁾ はΩ次元実数値ベクトルである。また、W_S ⁽²⁾とW_S ⁽³⁾は式(12)を、W_N ⁽²⁾とW_N ⁽³⁾ は以下の関係を満たし、W⁽⁴⁾はその初期値が式(16)で表現される。

Here, the network parameter p is a weight matrix W _S ⁽²⁾ , W _N ⁽²⁾ , W _S ⁽³⁾ , W _N ⁽³⁾ , W ^{(4) for} ^three layers and a bias vector b for two layers. _S ⁽²⁾ , b _N ⁽²⁾ , b _S ⁽³⁾ , b _N ⁽³⁾ . W _S ⁽²⁾ and W _N ⁽²⁾ are β × Ω real-valued matrices, W _S ⁽³⁾ and W _N ⁽³⁾ are Ω × β real-valued matrices, and W ⁽⁴⁾ is 2Ω × 2Ω real Numerical matrices, b _S ⁽²⁾ and b _N ⁽²⁾ are β-dimensional real-valued vectors, and b _S ⁽³⁾ and b _N ⁽³⁾ are Ω-dimensional real-valued vectors. W _S ⁽²⁾ and W _S ⁽³⁾ satisfy Equation (12), W _N ⁽²⁾ and W _N ⁽³⁾ satisfy the following relationship, and W ⁽⁴⁾ has an initial value of Equation (16 ).

なお、あらかじめ、目的音非負オートエンコーダ５１０、雑音非負オートエンコーダ５１５を最適化した後、第3層のパラメータであるW⁽⁴⁾（つまり、相補減算ニューラルネットワーク５２０）を誤差逆伝搬法で最適化しておいてもよい。 In addition, after optimizing the target sound non-negative auto encoder 510 and the noise non-negative auto encoder 515 in advance, the third layer parameter W ⁽⁴⁾ (that is, the complementary subtraction neural network 520) is optimized by the error back propagation method. You may keep it.

＜実施形態１＞
以下、図２〜図３を参照して実施形態１の音源強調学習装置１００を説明する。図２は、音源強調学習装置１００の構成を示すブロック図である。図３は、音源強調学習装置１００の動作を示すフローチャートである。図２に示すように音源強調学習装置１００は、特徴量・ラベル生成部１１０と、事前学習部１２０を含む。 <Embodiment 1>
Hereinafter, the sound source enhancement learning apparatus 100 according to the first embodiment will be described with reference to FIGS. FIG. 2 is a block diagram illustrating a configuration of the sound source enhancement learning apparatus 100. FIG. 3 is a flowchart showing the operation of the sound source enhancement learning apparatus 100. As illustrated in FIG. 2, the sound source enhancement learning device 100 includes a feature quantity / label generation unit 110 and a pre-learning unit 120.

音源強調学習装置１００は、音源強調学習装置８００と同様、学習データ記録部８９０に接続している。学習データ記録部８９０には、周波数領域混合音学習データX(τ, ω)、周波数領域目的音学習データS(τ, ω)、周波数領域雑音学習データN(τ, ω)が記録されている。 The sound source enhancement learning device 100 is connected to the learning data recording unit 890 as with the sound source enhancement learning device 800. The learning data recording unit 890 records frequency domain mixed sound learning data X (τ, ω), frequency domain target sound learning data S (τ, ω), and frequency domain noise learning data N (τ, ω). .

特徴量・ラベル生成部１１０は、周波数領域混合音学習データX(τ, ω)、周波数領域目的音学習データS(τ, ω)、周波数領域雑音学習データN(τ, ω)から、特徴量とラベルを生成する（Ｓ１１０）。ここでは特徴量として周波数領域混合音学習データX(τ, ω)を用いる。また、ラベルには、周波数領域目的音学習データS(τ, ω)、周波数領域雑音学習データN(τ, ω)から計算される目的音PSDφ_S、雑音PSDφ_Nを用いる。 The feature quantity / label generation unit 110 obtains a feature quantity from the frequency domain mixed sound learning data X (τ, ω), the frequency domain target sound learning data S (τ, ω), and the frequency domain noise learning data N (τ, ω). And a label are generated (S110). Here, frequency domain mixed sound learning data X (τ, ω) is used as a feature quantity. Further, the target sound PSDφ _S and noise PSDφ _N calculated from the frequency domain target sound learning data S (τ, ω) and the frequency domain noise learning data N (τ, ω) are used as labels.

事前学習部１２０は、特徴量X(τ, ω)とラベルφ_S、φ_Nの組からネットワークパラメータpを学習する（Ｓ１２０）。以下、図４〜図５を参照して事前学習部１２０について説明する。図４は、事前学習部１２０の構成を示すブロック図である。図５は、事前学習部１２０の動作を示すフローチャートである。図４に示すように事前学習部１２０は、ビームフォーミング出力パワー計算部１２１と、ネットワークパラメータ計算部１２２を含む。 The pre-learning unit 120 learns the network parameter p from the set of the feature amount X (τ, ω) and the labels φ _S and φ _N (S120). Hereinafter, the pre-learning unit 120 will be described with reference to FIGS. FIG. 4 is a block diagram illustrating a configuration of the pre-learning unit 120. FIG. 5 is a flowchart showing the operation of the pre-learning unit 120. As shown in FIG. 4, the pre-learning unit 120 includes a beamforming output power calculation unit 121 and a network parameter calculation unit 122.

まず、ビームフォーミング出力パワー計算部１２１は、特徴量X(τ, ω)から目的音BF出力パワーφ_{Y_S}(τ,ω)、雑音BF出力パワーφ_{Y_N}(τ,ω)を計算する（Ｓ１２１）。先述した通り、目的音BF出力パワーφ_{Y_S}(τ,ω)は、特徴量である周波数領域混合音学習データX (τ,ω)にビームフォーミングを適用し、目的音を強調したビームフォーミング出力信号Y_Sを生成した後、そのパワーφ_{Y_S}(τ,ω)を計算すればよい。雑音BF出力パワーφ_{Y_N}(τ,ω)についても同様にして求めることができる。 First, the beamforming output power calculation unit 121 calculates the target sound BF output power φ _{Y_S} (τ, ω) and the noise BF output power φ _{Y_N} (τ, ω) from the feature amount X (τ, ω) (S121). . As described above, the target sound BF output power φ _{Y_S} (τ, ω) is applied to the frequency domain mixed sound learning data X (τ, ω), which is the feature quantity, and the beam forming output signal that emphasizes the target sound. After generating Y _S , its power φ _{Y_S} (τ, ω) may be calculated. The noise BF output power φ _{Y_N} (τ, ω) can be obtained in the same manner.

次に、ネットワークパラメータ計算部１２２は、Ｓ１２１で計算した目的音BF出力パワーφ_{Y_S}(τ,ω)、雑音BF出力パワーφ_{Y_N}(τ,ω)とラベルである目的音PSDφ_S、雑音PSDφ_Nからネットワークパラメータpを計算する（Ｓ１２２）。以下、図６〜図７を参照してネットワークパラメータ計算部１２２について説明する。図６は、ネットワークパラメータ計算部１２２の構成を示すブロック図である。図７は、ネットワークパラメータ計算部１２２の動作を示すフローチャートである。図６に示すようにネットワークパラメータ計算部１２２は、目的音非負オートエンコーダ計算部１２６と、雑音非負オートエンコーダ計算部１２７と、相補減算ニューラルネットワーク計算部１２８と、ネットワークパラメータ最適化部１２９を含む。 Next, the network parameter calculation unit 122 labels the target sound PSDφ _S and the noise PSDφ _N that are labeled as the target sound BF output power φ _{Y_S} (τ, ω) and the noise BF output power φ _{Y_N} (τ, ω) calculated in S121. The network parameter p is calculated from (S122). Hereinafter, the network parameter calculation unit 122 will be described with reference to FIGS. FIG. 6 is a block diagram illustrating a configuration of the network parameter calculation unit 122. FIG. 7 is a flowchart showing the operation of the network parameter calculation unit 122. As shown in FIG. 6, the network parameter calculation unit 122 includes a target sound non-negative auto encoder calculation unit 126, a noise non-negative auto encoder calculation unit 127, a complementary subtraction neural network calculation unit 128, and a network parameter optimization unit 129.

まず、目的音非負オートエンコーダ計算部１２６は、音源強調用ニューラルネットワークの目的音非負オートエンコーダ５１０に対応するものであり、目的音BF出力パワーφ_{Y_S}(τ,ω)から目的音スペクトル特性ベクトルq_S ⁽³⁾を計算する（Ｓ１２６）。同様に、雑音非負オートエンコーダ計算部１２７は、音源強調用ニューラルネットワークの雑音非負オートエンコーダ５１５に対応するものであり、雑音BF出力パワーφ_{Y_N}(τ,ω)から雑音スペクトル特性ベクトルq_N ⁽³⁾を計算する（Ｓ１２７）。 First, the target sound non-negative auto encoder calculation unit 126 corresponds to the target sound non-negative auto encoder 510 of the neural network for sound source enhancement, and the target sound spectrum characteristic vector q from the target sound BF output power φ _{Y_S} (τ, ω). _S ⁽³⁾ is calculated (S126). Similarly, the noise non-negative auto encoder calculation unit 127 corresponds to the noise non-negative auto encoder 515 of the neural network for sound source enhancement, and the noise spectrum characteristic vector q _N ⁽³ from the noise BF output power φ _{Y_N} (τ, ω). ⁾ Is calculated (S127).

次に、相補減算ニューラルネットワーク計算部１２８は、音源強調用ニューラルネットワークの相補減算ニューラルネットワーク５２０に対応するものであり、Ｓ１２６で計算した目的音スペクトル特性ベクトルq_S ⁽³⁾とＳ１２７で計算した雑音スペクトル特性ベクトルq_N ⁽³⁾から推定目的音PSD q_S ⁽⁴⁾ と推定雑音PSD q_N ⁽⁴⁾を計算する（Ｓ１２８）。 Next, the complementary subtraction neural network calculation unit 128 corresponds to the complementary subtraction neural network 520 of the sound source enhancement neural network, and the target sound spectral characteristic vector q _S ⁽³⁾ calculated in S126 and the noise calculated in S127. The estimated target sound PSD q _S ⁽⁴⁾ and the estimated noise PSD q _N ⁽⁴⁾ are calculated from the spectrum characteristic vector q _N ⁽³⁾ (S128).

最後に、ネットワークパラメータ最適化部１２９は、Ｓ１２８で計算した推定目的音PSD q_S ⁽⁴⁾ と推定雑音PSD q_N ⁽⁴⁾とＳ１１０で計算した目的音PSDφ_S、雑音PSDφ_Nを用いて、ネットワークパラメータpを最適化する（Ｓ１２９）。＜発明の原理＞で説明したように、ネットワークパラメータpは、3層分の重み行列W_S ⁽²⁾,W_N ⁽²⁾,W_S ⁽³⁾,W_N ⁽³⁾,W⁽⁴⁾と2層分のバイアスb_S ⁽²⁾,b_N ⁽²⁾,b_S ⁽³⁾,b_N ⁽³⁾から構成される。 Finally, the network parameter optimization unit 129 uses the estimated target sound PSD q _S ⁽⁴⁾ and estimated noise PSD q _N ⁽⁴⁾ calculated in S128, and the target sound PSDφ _S and noise PSDφ _N calculated in S110, The network parameter p is optimized (S129). As described in <Principle of the Invention>, the network parameter p is a weight matrix W _S ⁽²⁾ , W _N ⁽²⁾ , W _S ⁽³⁾ , W _N ⁽³⁾ , W ^{(4) for} three layers. And two layers of biases b _S ⁽²⁾ , b _N ⁽²⁾ , b _S ⁽³⁾ , b _N ⁽³⁾ .

Ｓ１２１〜Ｓ１２９の処理は学習データの数だけ繰り返される。 The processing of S121 to S129 is repeated for the number of learning data.

なお、目的音非負オートエンコーダ計算部１２６と雑音非負オートエンコーダ計算部１２７は、例えば、特性の異なる目的音の音源数、雑音の音源数に応じてそれぞれ複数あってもよい。 Note that there may be a plurality of target sound non-negative auto encoder calculation units 126 and noise non-negative auto encoder calculation units 127 depending on, for example, the number of target sound sources and the number of noise sound sources having different characteristics.

本実施形態の発明によれば、物理的な制約を導入したニューラルネットワーク、具体的には目的音のスペクトル特性、雑音のスペクトル特性を独立にモデル化する非負オートエンコーダ部と、推定目的音PSDと推定雑音PSDを生成するための相補減算ニューラルネットワークから構成される音源強調用ニューラルネットワークを、（大量の）音声データ、つまり周波数領域混合音学習データ、周波数領域目的音学習データ、周波数領域雑音学習データの組を用いて学習することにより、高精度かつ安定的に目的音PSD、雑音PSDを推定することが可能となる。また、当該推定した目的音PSD、雑音PSDを用いることにより、高精度かつ安定的にウィーナーフィルタを生成することが可能となる。 According to the invention of the present embodiment, a neural network in which physical constraints are introduced, specifically, a target sound spectrum characteristic, a non-negative auto encoder unit that independently models a noise spectrum characteristic, an estimated target sound PSD, and Sound source enhancement neural network composed of complementary subtraction neural network to generate estimated noise PSD, (large amount) speech data, that is, frequency domain mixed sound learning data, frequency domain target sound learning data, frequency domain noise learning data It is possible to estimate the target sound PSD and the noise PSD with high accuracy and stability by learning using the set. In addition, by using the estimated target sound PSD and noise PSD, it is possible to generate a Wiener filter with high accuracy and stability.

＜実施形態２＞
実施形態１では、周波数領域混合音学習データX(τ, ω)、周波数領域目的音学習データS(τ, ω)、周波数領域雑音学習データN(τ, ω)の組から音源強調用ニューラルネットワークのネットワークパラメータpを学習する方法について説明した。ここでは、実施形態１で学習したネットワークパラメータpを用いて、マイクロホンで収音した観測信号から強調音を生成する方法について説明する。これにより、観測信号中の目的音を音源強調した強調音を出力することが可能となる。 <Embodiment 2>
In the first embodiment, a neural network for sound source enhancement is obtained from a set of frequency domain mixed sound learning data X (τ, ω), frequency domain target sound learning data S (τ, ω), and frequency domain noise learning data N (τ, ω). The method of learning the network parameter p of was described. Here, a method for generating an emphasized sound from an observation signal collected by a microphone using the network parameter p learned in the first embodiment will be described. As a result, it is possible to output an enhanced sound in which the target sound in the observation signal is sound source enhanced.

なお、ここで用いるネットワークパラメータpの値は、実施形態１の音源強調学習装置１００による学習終了時のpの値である。 Note that the value of the network parameter p used here is the value of p at the end of learning by the sound source enhancement learning apparatus 100 of the first embodiment.

以下、図８〜図９を参照して音源強調装置２００を説明する。図８は、音源強調装置２００の構成を示すブロック図である。図９は、音源強調装置２００の動作を示すフローチャートである。図８に示すように音源強調装置２００は、周波数領域変換部９１０と、ウィーナーフィルタ推定部２２０と、信号強調部２３０と、時間領域変換部９４０を含む。 Hereinafter, the sound source emphasizing apparatus 200 will be described with reference to FIGS. FIG. 8 is a block diagram illustrating a configuration of the sound source enhancement device 200. FIG. 9 is a flowchart showing the operation of the sound source emphasizing apparatus 200. As illustrated in FIG. 8, the sound source enhancement device 200 includes a frequency domain conversion unit 910, a Wiener filter estimation unit 220, a signal enhancement unit 230, and a time domain conversion unit 940.

音源強調装置２００は、音源強調装置９００と同様、学習結果記録部９９０に接続している。学習結果記録部９９０には上述の学習終了時のネットワークパラメータpの値が記録されている。 The sound source emphasizing apparatus 200 is connected to the learning result recording unit 990 as with the sound source emphasizing apparatus 900. The learning result recording unit 990 records the value of the network parameter p at the end of the above learning.

周波数領域変換部９１０は、時間領域混合音である観測信号x(t)を周波数領域変換し、周波数領域観測信号X(τ, ω)を生成する（Ｓ９１０）。 The frequency domain transform unit 910 performs frequency domain transform on the observation signal x (t), which is a time domain mixed sound, and generates a frequency domain observation signal X (τ, ω) (S910).

ウィーナーフィルタ推定部２２０は、学習結果記録部９９０から読み出したネットワークパラメータpを用いて、周波数領域観測信号X(τ, ω)からウィーナーフィルタG(τ, ω)を生成する（Ｓ２２０）。以下、図１０〜図１１を参照してウィーナーフィルタ推定部２２０について説明する。図１０は、ウィーナーフィルタ推定部２２０の構成を示すブロック図である。図１１は、ウィーナーフィルタ推定部２２０の動作を示すフローチャートである。図１０に示すようにウィーナーフィルタ推定部２２０は、ビームフォーミング出力パワー計算部１２１と、目的音非負オートエンコーダ部２２１と、雑音非負オートエンコーダ部２２２と、相補減算部２２３と、ウィーナーフィルタ計算部２２４を含む。 The Wiener filter estimation unit 220 generates the Wiener filter G (τ, ω) from the frequency domain observation signal X (τ, ω) using the network parameter p read from the learning result recording unit 990 (S220). Hereinafter, the Wiener filter estimation unit 220 will be described with reference to FIGS. FIG. 10 is a block diagram illustrating a configuration of the Wiener filter estimation unit 220. FIG. 11 is a flowchart showing the operation of the Wiener filter estimation unit 220. As shown in FIG. 10, the Wiener filter estimation unit 220 includes a beamforming output power calculation unit 121, a target sound non-negative auto encoder unit 221, a noise non-negative auto encoder unit 222, a complementary subtraction unit 223, and a Wiener filter calculation unit 224. including.

まず、ビームフォーミング出力パワー計算部１２１は、周波数領域観測信号X(τ, ω)から目的音BF出力パワーφ_{Y_S}(τ,ω)、雑音BF出力パワーφ_{Y_N}(τ,ω)を計算する（Ｓ１２１）。 First, the beamforming output power calculation unit 121 calculates the target sound BF output power φ _{Y_S} (τ, ω) and the noise BF output power φ _{Y_N} (τ, ω) from the frequency domain observation signal X (τ, ω) ( S121).

次に、目的音非負オートエンコーダ部２２１は、Ｓ１２１で計算した目的音BF出力パワーφ_{Y_S}(τ,ω)を第1層の入力q_S ⁽¹⁾から目的音スペクトル特性ベクトルq_S ⁽³⁾を計算する（Ｓ２２１）。目的音非負オートエンコーダ部２２１は、学習終了時のネットワークパラメータp（重み行列W_S ⁽²⁾,W_S ⁽³⁾とバイアスベクトルb_S ⁽²⁾,b_S ⁽³⁾⁾）が設定されている点においてのみ目的音非負オートエンコーダ計算部１２６と異なる。 Next, the target sound non-negative auto encoder unit 221 _converts the target sound BF output power φ _{Y_S} (τ, ω) calculated in S121 from the input q _S ⁽¹⁾ of the first layer to the target sound spectrum characteristic vector q _S ^(3). Is calculated (S221). The target sound non-negative auto encoder unit 221 is set with network parameters p (weight matrices W _S ⁽²⁾ , W _S ⁽³⁾ and bias vectors b _S ⁽²⁾ , b _S ⁽³⁾⁾ ) at the end of learning. This is different from the target sound non-negative auto encoder calculation unit 126 only in that point.

同様に、雑音非負オートエンコーダ部２２２は、Ｓ１２１で計算した雑音BF出力パワーφ_{Y_N}(τ,ω)を第1層の入力q_N ⁽¹⁾から雑音スペクトル特性ベクトルq_N ⁽³⁾を計算する（Ｓ２２２）。雑音非負オートエンコーダ部２２２は、学習終了時のネットワークパラメータp（重み行列W_N ⁽²⁾,W_N ⁽³⁾とバイアスベクトルb_N ⁽²⁾,b_N ⁽³⁾）が設定されている点においてのみ雑音非負オートエンコーダ計算部１２７と異なる。 Similarly, the noise non-negative auto encoder unit 222 calculates the noise spectrum characteristic vector q _N ⁽³⁾ from the input q _N ⁽¹⁾ of the first layer with the noise BF output power φ _{Y_N} (τ, ω) calculated in S121. (S222). The noise non-negative auto encoder unit 222 is set with network parameters p (weight matrices W _N ⁽²⁾ , W _N ⁽³⁾ and bias vectors b _N ⁽²⁾ , b _N ⁽³⁾ ) at the end of learning. Is different from the noise non-negative auto encoder calculation unit 127 only in FIG.

次に、相補減算部２２３は、Ｓ２２１、Ｓ２２２で計算した目的音スペクトル特性ベクトルq_S ⁽³⁾、雑音スペクトル特性ベクトルq_N ⁽³⁾から推定目的音PSD q_S ⁽⁴⁾, 推定雑音PSD q_N ⁽⁴⁾を計算する（Ｓ２２３）。相補減算部２２３は、学習終了時のネットワークパラメータp（重み行列W⁽⁴⁾）が設定されている点においてのみ相補減算ニューラルネットワーク計算部１２８と異なる。 Next, the complementary subtraction unit 223 calculates the estimated target sound PSD q _S ⁽⁴⁾ and the estimated noise PSD q from the target sound spectrum characteristic vector q _S ⁽³⁾ and the noise spectrum characteristic vector q _N ⁽³⁾ calculated in S221 and S222. _N ⁽⁴⁾ is calculated (S223). The complementary subtraction unit 223 differs from the complementary subtraction neural network calculation unit 128 only in that the network parameter p (weight matrix W ⁽⁴⁾ ) at the end of learning is set.

最後に、ウィーナーフィルタ計算部２２４は、Ｓ２２３で計算した推定目的音PSD q_S ⁽⁴⁾, 推定雑音PSD q_N ⁽⁴⁾からウィーナーフィルタG(τ, ω)を計算する（Ｓ２２４）。 Finally, the Wiener filter calculation unit 224 calculates the Wiener filter G (τ, ω) from the estimated target sound PSD q _S ⁽⁴⁾ and the estimated noise PSD q _N ⁽⁴⁾ calculated in S223 (S224).

信号強調部２３０は、周波数領域観測信号X(τ, ω)とＳ２２０で推定したウィーナーフィルタG(τ, ω)から次式により信号強調（音源強調）を実行し、周波数領域強調音^S(τ, ω)を生成する（Ｓ２３０）。

The signal enhancement unit 230 executes signal enhancement (sound source enhancement) from the frequency domain observation signal X (τ, ω) and the Wiener filter G (τ, ω) estimated in S220 according to the following equation, and the frequency domain enhancement sound ^ S ( (τ, ω) is generated (S230).

なお、目的音非負オートエンコーダ部２２１と雑音非負オートエンコーダ部２２２は、目的音非負オートエンコーダ計算部１２６と雑音非負オートエンコーダ計算部１２７と同様複数あってもよく、この場合それぞれ同数含むことになる。 Note that there may be a plurality of target sound non-negative auto encoder units 221 and noise non-negative auto encoder units 222 similar to the target sound non-negative auto encoder calculation unit 126 and noise non-negative auto encoder calculation unit 127, and in this case, the same number is included. .

本実施形態の発明によれば、観測信号中の目的音を音源強調した強調音を高精度かつ安定的に出力することが可能となる。 According to the invention of the present embodiment, it is possible to output an enhanced sound obtained by emphasizing a target sound in an observation signal with high accuracy and stably.

＜変形例＞
実施形態１では、混合音を対象に音源強調処理を行うために、音源強調処理のアルゴリズムを反映した物理的構造を有するニューラルネットワークを学習した。ここでは、音源強調に最終的に必要になるウィーナーフィルタの代わりに、潜在変数である目的音スペクトル特性ベクトルを出力するニューラルネットワーク、雑音スペクトル特性ベクトルを出力するニューラルネットワーク、推定目的音PSD・推定雑音PSDを出力するニューラルネットワークをそれぞれ学習することにより、音源強調処理のアルゴリズムの特徴をニューラルネットワークの物理的構造として取り入れ、高精度かつ安定的に音源強調処理することができるニューラルネットワークを学習した。 <Modification>
In the first embodiment, in order to perform sound source enhancement processing on a mixed sound, a neural network having a physical structure that reflects a sound source enhancement processing algorithm is learned. Here, instead of the Wiener filter that is finally required for sound source enhancement, a neural network that outputs the target sound spectrum characteristic vector, which is a latent variable, a neural network that outputs the noise spectrum characteristic vector, the estimated target sound PSD / estimated noise By learning each neural network that outputs PSD, the features of the algorithm for sound source enhancement processing were incorporated as the physical structure of the neural network, and the neural network that can perform sound source enhancement processing with high accuracy and stability was learned.

同様に、音響信号を対象にした音源強調以外の処理、画像信号を対象とした処理に加えて、電波や光波を電気信号に変換した信号全般を対象とした処理についても各処理アルゴリズムが途中段階で生成するベクトル（途中段階推定処理結果）を出力するニューラルネットワークを構成し、これらのニューラルネットワークを部品として最終的な処理結果（推定処理結果）を得るようなニューラルネットワークを構成することができる。このようにすることにより、各処理アルゴリズムに対応するニューラルネットワークを構成することができ、当該ニューラルネットワークは高精度かつ安定的に各処理を実行することができる。 Similarly, in addition to processing other than sound source enhancement for acoustic signals and processing for image signals, each processing algorithm is also in the middle of processing for all signals in which radio waves and light waves are converted into electrical signals It is possible to construct a neural network that outputs the vectors (intermediate stage estimation processing results) generated in step (b) and obtains final processing results (estimation processing results) using these neural networks as parts. In this way, a neural network corresponding to each processing algorithm can be configured, and the neural network can execute each processing with high accuracy and stability.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes predetermined functions (respective constituent requirements expressed as the above-described units,.

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１００音源強調学習装置
１１０特徴量・ラベル生成部
１２０事前学習部
１２１ビームフォーミング出力パワー計算部
１２２ネットワークパラメータ計算部
１２６目的音非負オートエンコーダ計算部
１２７雑音非負オートエンコーダ計算部
１２８相補減算ニューラルネットワーク計算部
１２９ネットワークパラメータ最適化部
２００音源強調装置
２２０ウィーナーフィルタ推定部
２２１目的音非負オートエンコーダ部
２２２雑音非負オートエンコーダ部
２２３相補減算部
２２４ウィーナーフィルタ計算部
２３０信号強調部
８００音源強調学習装置
８１０特徴量・ラベル生成部
８２０事前学習部
８９０学習データ記録部
９００音源強調装置
９１０周波数領域変換部
９２０バイナリマスク推定部
９３０信号強調部
９４０時間領域変換部
９９０学習結果記録部 100 Sound Source Enhancement Learning Device 110 Feature Quantity / Label Generation Unit 120 Pre-Learning Unit 121 Beamforming Output Power Calculation Unit 122 Network Parameter Calculation Unit 126 Target Sound Non-Negative Auto Encoder Calculation Unit 127 Noise Non-Negative Auto Encoder Calculation Unit 128 Complementary Subtraction Neural Network Calculation Unit 129 Network parameter optimization unit 200 Sound source enhancement device 220 Wiener filter estimation unit 221 Target sound non-negative auto encoder unit 222 Noise non-negative auto encoder unit 223 Complementary subtraction unit 224 Wiener filter calculation unit 230 Signal enhancement unit 800 Sound source enhancement learning device 810 Label generation unit 820 Pre-learning unit 890 Learning data recording unit 900 Sound source enhancement device 910 Frequency domain conversion unit 920 Binary mask estimation unit 930 Signal enhancement unit 940 Time domain change Replacement unit 990 Learning result recording unit

Claims

Features / labels that generate frequency domain mixed sound learning data as features and sets target sound PSD and noise PSD as labels from a set of frequency domain mixed sound learning data, frequency domain target sound learning data, and frequency domain noise learning data A generator,
A sound source enhancement learning apparatus comprising: a pre-learning unit that learns a network parameter that characterizes a neural network for sound source enhancement having an observation signal as an input and an estimated target sound PSD and an estimated noise PSD as an output, using the feature amount and the label. There,
The pre-learning unit
The target sound that calculates the target sound spectrum characteristic vector that models the spectral characteristic of the target sound from the target sound BF output power that is the power of the beamforming output signal that emphasizes the target sound generated from the frequency domain mixed sound learning data Non-negative auto encoder calculation unit,
A noise non-negative auto encoder calculation unit that calculates a noise spectrum characteristic vector that models a noise spectrum characteristic from a noise BF output power that is a power of a beamforming output signal that emphasizes noise generated from the frequency domain mixed sound learning data When,
A complementary subtraction neural network calculation unit for calculating the estimated target sound PSD and the estimated noise PSD from the target sound spectrum characteristic vector and the noise spectrum characteristic vector;
A sound source enhancement learning apparatus comprising: the estimated target sound PSD, the estimated noise PSD, and a network parameter optimization unit that optimizes the network parameter using the target sound PSD and the noise PSD.

The sound source enhancement learning device according to claim 1,
Among the network parameters,
Network parameters used in the target sound non-negative auto encoder calculation unit are non-negative weight matrices W _S ⁽²⁾ and W _S ⁽³⁾ and bias vectors b _S ⁽²⁾ and b _S ⁽³⁾ , and W _S ^{(2 )} And W _S ⁽³⁾ satisfy the following relationship:

Network parameters used in the noise non-negative auto encoder calculation unit are non-negative weight matrices W _N ⁽²⁾ , W _N ⁽³⁾ and bias vectors b _N ⁽²⁾ , b _N ⁽³⁾ , and W _N ⁽²⁾ And W _N ⁽³⁾ satisfy the following relationship:

The network parameter used in the complementary subtraction neural network calculation unit is a weight matrix W ⁽⁴⁾ , the weight matrix W ⁽⁴⁾ is an initial value W _init ⁽⁴⁾ ,

( _Wherein λ _{S, ω_i} , λ _{N, ω_i} are positive values, γ _{SN, ω_i} , γ _{NS, ω_i} are negative values, 1 ≦ i ≦ Ω) .

A sound source emphasizing device for generating an emphasis sound obtained by emphasizing the observation signal from the observation signal using a neural network for sound source emphasis in which the network parameter generated using the sound source emphasis learning device according to claim 1 is set. There,
A frequency domain transform unit for generating a frequency domain observation signal from the observation signal;
Using the neural network for sound source enhancement, an estimated target sound PSD and an estimated noise PSD are generated from the frequency domain observation signal, and a Wiener filter estimating unit that estimates a Wiener filter from the estimated target sound PSD and the estimated noise PSD;
Using the Wiener filter, a signal enhancement unit that generates a frequency domain enhanced sound from the frequency domain observation signal;
A sound source emphasizing device comprising: a time domain transforming unit that performs time domain transformation on the frequency domain enhanced sound and generates the enhanced sound.

Features / labels that generate frequency domain mixed sound learning data as features and sets target sound PSD and noise PSD as labels from a set of frequency domain mixed sound learning data, frequency domain target sound learning data, and frequency domain noise learning data Generation step;
A sound source enhancement learning method including a pre-learning step of learning a network parameter that characterizes a neural network for sound source enhancement having an observation signal as an input and an estimated target sound PSD and an estimated noise PSD as an output using the feature amount and the label. There,
The pre-learning step includes
The target sound that calculates the target sound spectrum characteristic vector that models the spectral characteristic of the target sound from the target sound BF output power that is the power of the beamforming output signal that emphasizes the target sound generated from the frequency domain mixed sound learning data Non-negative auto encoder calculation step,
Noise non-negative auto encoder calculation step for calculating a noise spectrum characteristic vector modeling the noise spectrum characteristic from the noise BF output power, which is the power of a beamforming output signal emphasizing noise generated from the frequency domain mixed sound learning data When,
Complementary subtraction neural network calculation step of calculating the estimated target sound PSD and the estimated noise PSD from the target sound spectrum characteristic vector and the noise spectrum characteristic vector;
A sound source enhancement learning method comprising: the estimated target sound PSD, the estimated noise PSD, and a network parameter optimization step that optimizes the network parameter using the target sound PSD and the noise PSD.

A program for causing a computer to function as the sound source enhancement learning device according to claim 1 or 2, or the sound source enhancement device according to claim 3.

Signal processing that learns network parameters that characterize a neural network for signal processing that takes a signal as an input and outputs an estimation processing result of the signal, using learning data consisting of a set of a feature amount related to the signal and a label that is a processing result of the signal A learning device,
The estimation process result calculated in the middle stage from the feature quantity calculated from the signal to the output of the estimation process result is the middle stage estimation process result i (1 ≦ i ≦ n, n is an integer of 1 or more). ,
Using the feature amount, a neural network 1 calculation unit that calculates the intermediate stage estimation processing result 1;
A neural network j calculation unit for calculating the intermediate stage estimation process result j using any one of the feature quantity or the intermediate stage estimation process result 1 to the intermediate stage estimation process result j-1 (2 ≦ j ≦ n ),
Using either the feature amount or the intermediate stage estimation process result 1 to the intermediate stage estimation process result n, a neural network n + 1 calculation unit that calculates the estimation process result;
A signal processing learning device comprising: a network parameter optimization unit that optimizes the network parameter using the estimation processing result and the label.