JP2019078864A

JP2019078864A - Musical sound emphasis device, convolution auto encoder learning device, musical sound emphasis method, and program

Info

Publication number: JP2019078864A
Application number: JP2017205041A
Authority: JP
Inventors: 健太丹羽; Kenta Niwa; 一哉武田; Kazuya Takeda; 隆典西野; Takanori Nishino; 健登大谷; Kento Otani
Original assignee: Nagoya University NUC; Nippon Telegraph and Telephone Corp
Current assignee: Nagoya University NUC; Nippon Telegraph and Telephone Corp
Priority date: 2017-10-24
Filing date: 2017-10-24
Publication date: 2019-05-23

Abstract

To provide a musical sound emphasis technique which enables sound source emphasis with high accuracy, using DNN which reflects characteristics of musical instrument sound.SOLUTION: The musical sound emphasis device includes a frequency domain conversion unit which converts a time domain musical signal into a frequency domain to generate a frequency domain musical signal, a Wiener filter estimation unit which estimates a Wiener filter used to emphasize a predetermined musical instrument sound from the frequency domain musical signal, a signal emphasizing unit for generating a frequency domain emphasized musical sound from the frequency domain musical signal and the Wiener filter, and a time domain converting unit for generating an emphasized musical sound in the time domain from the frequency domain emphasized musical sound. The Wiener filter estimation unit estimates the Wiener filter from the frequency domain musical signal, using a deep neural network including a 2n-layer convolutional denoising auto encoder which takes an amplitude spectrum of the music signal in a logarithmic frequency domain as input and outputs the amplitude spectrum of a part of the music signal in the logarithmic frequency domain.SELECTED DRAWING: Figure 1

Description

本発明は、様々な音源の音源信号が混合する音源信号から特定の音源の音源信号を強調する音源強調技術に関するものであり、特に複数の楽音信号で構成された楽曲信号から目的とする楽音信号を強調する楽音強調技術に関する。 The present invention relates to a sound source emphasizing technique for emphasizing a sound source signal of a specific sound source from sound source signals mixed with sound source signals of various sound sources, and in particular to a target musical tone signal from music signals composed of a plurality of musical tone signals. Relates to a tone emphasizing technique that emphasizes

従来から様々な音源の音源信号が混合する音源信号から特定の音源の音源信号を強調する手法については広く研究されている。最近では、例えば、非特許文献１のように、ディープニューラルネットワーク（DNN: Deep Neural Networks）を用いた教師あり学習により、複数の変数間の射影を獲得するアプローチが多くとられている。 Conventionally, a method of emphasizing a sound source signal of a specific sound source from a sound source signal in which sound source signals of various sound sources are mixed has been widely studied. Recently, for example, as described in Non-Patent Document 1, many approaches have been taken to obtain projections among a plurality of variables by supervised learning using Deep Neural Networks (DNN).

以下では、図１３〜図１４を参照して非特許文献１の音源強調装置９００を説明する（非特許文献１のFig.1参照）。音源強調装置９００は、マイクロホンで収音した観測信号中の目的音を音源強調した強調音を出力するものである。図１３は、音源強調装置９００の構成を示すブロック図である。図１４は、音源強調装置９００の動作を示すフローチャートである。図１３に示すように音源強調装置９００は、周波数領域変換部９１０と、ウィーナーフィルタ推定部９２０と、信号強調部９３０と、時間領域変換部９４０を含む。 Below, the sound source emphasizing device 900 of Non-Patent Document 1 will be described with reference to FIGS. 13 to 14 (see FIG. 1 of Non-Patent Document 1). The sound source emphasizing device 900 outputs an emphasized sound in which the target sound in the observation signal collected by the microphone is sound source-emphasized. FIG. 13 is a block diagram showing the configuration of the sound source emphasizing device 900. As shown in FIG. FIG. 14 is a flowchart showing the operation of the sound source enhancement apparatus 900. As shown in FIG. 13, the sound source emphasizing device 900 includes a frequency domain transforming unit 910, a Wiener filter estimating unit 920, a signal emphasizing unit 930, and a time domain transforming unit 940.

音源強調装置９００は、学習結果記録部９９０に接続している。学習結果記録部９９０には事前に学習したDNNのネットワークパラメータの値が記録されている。DNNのネットワークパラメータは、重み行列W_S ⁽²⁾,W_S ⁽³⁾とバイアスベクトルb_S ⁽²⁾,b_S ⁽³⁾⁾、重み行列W_N ⁽²⁾,W_N ⁽³⁾とバイアスベクトルb_N ⁽²⁾,b_N ⁽³⁾、重み行列W⁽⁴⁾である。 The sound source emphasizing device 900 is connected to the learning result recording unit 990. The learning result recording unit 990 records the values of DNN network parameters learned in advance. The network parameters of DNN are: weight matrix W _S ⁽²⁾ , W _S ⁽³⁾ and bias vector b _S ⁽²⁾ , b _S ⁽³⁾⁾ , weight matrix W _N ⁽²⁾ , W _N ⁽³⁾ and bias The vectors b _N ⁽²⁾ and b _N ^{(3) are} the weighting matrix W ⁽⁴⁾ .

周波数領域変換部９１０は、時間領域混合音である観測信号x(t)を周波数領域変換し、周波数領域観測信号X(ω,τ)を生成する（Ｓ９１０）。ただし、tは時間のインデックス、ωは周波数ビン番号、τはフレーム番号である。 The frequency domain conversion unit 910 performs frequency domain conversion on the observation signal x (t) that is time domain mixed sound to generate a frequency domain observation signal X (ω, τ) (S 910). Here, t is an index of time, ω is a frequency bin number, and τ is a frame number.

ウィーナーフィルタ推定部９２０は、学習結果記録部９９０から読み出したネットワークパラメータを用いて、周波数領域観測信号X(ω,τ)からウィーナーフィルタ^W(ω,τ)を推定する（Ｓ９２０）。以下、図１５〜図１６を参照してウィーナーフィルタ推定部９２０について説明する。図１５は、ウィーナーフィルタ推定部９２０の構成を示すブロック図である。図１６は、ウィーナーフィルタ推定部９２０の動作を示すフローチャートである。図１５に示すようにウィーナーフィルタ推定部９２０は、ビームフォーミング出力パワー計算部９２１と、目的音非負オートエンコーダ部９２２と、雑音非負オートエンコーダ部９２３と、相補減算部９２４と、ウィーナーフィルタ計算部９２５を含む。 The Wiener filter estimation unit 920 estimates the Wiener filter ^ W (ω, τ) from the frequency domain observation signal X (ω, τ) using the network parameter read out from the learning result recording unit 990 (S920). Hereinafter, the Wiener filter estimation unit 920 will be described with reference to FIGS. FIG. 15 is a block diagram showing the configuration of the Wiener filter estimation unit 920. As shown in FIG. FIG. 16 is a flowchart showing the operation of the Wiener filter estimation unit 920. As shown in FIG. 15, the Wiener filter estimation unit 920 includes a beamforming output power calculation unit 921, a target sound nonnegative auto encoder unit 922, a noise nonnegative auto encoder unit 923, a complementary subtraction unit 924, and a Wiener filter calculation unit 925. including.

まず、ビームフォーミング出力パワー計算部９２１は、周波数領域観測信号X(ω,τ)から目的音BF出力パワーφ_{Y_S}(ω,τ)、雑音BF出力パワーφ_{Y_N}(ω,τ)を計算する（Ｓ９２１）。 First, the beamforming output power calculator 921 calculates the target sound BF output power φ _{Y — S} (ω, τ) and noise BF output power φ _{Y — N} (ω, τ) from the frequency domain observation signal X (ω, τ) S921).

次に、目的音非負オートエンコーダ部９２２は、Ｓ９２１で計算した目的音BF出力パワーφ_{Y_S}(ω,τ)を第1層の入力q_S ⁽¹⁾とし、目的音スペクトル特性ベクトルq_S ⁽³⁾を計算する（Ｓ９２２）。目的音非負オートエンコーダ部９２２には、事前に学習したDNNのネットワークパラメータ（具体的には、重み行列W_S ⁽²⁾,W_S ⁽³⁾とバイアスベクトルb_S ⁽²⁾,b_S ⁽³⁾⁾）が設定されている。 Next, the target sound nonnegative auto encoder unit 922 sets the target sound BF output power φ _{Y — S} (ω, τ) calculated in _S _{921 as} the input q _S ⁽¹⁾ of the first layer, and the target sound spectral characteristic vector q _S ^{(3 )} Is calculated (S922). The target sound non-negative auto encoder unit 922 includes the network parameters of DNN learned in advance (specifically, weighting matrices W _S ⁽²⁾ and W _S ⁽³⁾ and bias vectors b _S ⁽²⁾ and b _S ^{(3 ))} Is set.

同様に、雑音非負オートエンコーダ部９２３は、Ｓ９２１で計算した雑音BF出力パワーφ_{Y_N}(ω,τ)を第1層の入力q_N ⁽¹⁾とし、雑音スペクトル特性ベクトルq_N ⁽³⁾を計算する（Ｓ９２３）。雑音非負オートエンコーダ部９２３には、事前に学習したDNNのネットワークパラメータ（具体的には、重み行列W_N ⁽²⁾,W_N ⁽³⁾とバイアスベクトルb_N ⁽²⁾,b_N ⁽³⁾）が設定されている。 Similarly, the noise non-negative auto encoder unit 923 calculates the noise spectrum characteristic vector q _N ⁽³⁾ with the noise BF output power φ _{Y — N} (ω, τ) calculated in S921 as the input q _N ⁽¹⁾ of the first layer. To do (S923). The noise non-negative auto encoder unit 923 includes the network parameters of DNN learned in advance (specifically, the weight matrices W _N ⁽²⁾ and W _N ⁽³⁾ and the bias vectors b _N ⁽²⁾ and b _N ⁽³⁾ ) Is set.

次に、相補減算部９２４は、Ｓ９２２、Ｓ９２３で計算した目的音スペクトル特性ベクトルq_S ⁽³⁾、雑音スペクトル特性ベクトルq_N ⁽³⁾から、推定目的音PSD q_S ⁽⁴⁾, 推定雑音PSD q_N ⁽⁴⁾を計算する（Ｓ９２４）。相補減算部９２４には、事前に学習したDNNのネットワークパラメータ（具体的には、重み行列W⁽⁴⁾）が設定されている。 Next, the complementary subtraction unit 924 estimates the estimated target sound PSD q _S ⁽⁴⁾ , the estimated noise PSD from the target sound spectrum characteristic vector q _S ⁽³⁾ and the noise spectrum characteristic vector q _N ⁽³⁾ calculated in S922 and S923. q _N ⁽⁴⁾ is calculated (S924). In the complementary subtraction unit 924, network parameters (specifically, a weighting matrix W ⁽⁴⁾ ) of DNN learned in advance are set.

最後に、ウィーナーフィルタ計算部９２５は、Ｓ９２４で計算した推定目的音PSD q_S ⁽⁴⁾, 推定雑音PSD q_N ⁽⁴⁾から、ウィーナーフィルタ^W(ω,τ)を計算する（Ｓ９２５）。 Finally, the Wiener filter calculator 925 calculates a Wiener filter ^ W (ω, τ) from the estimated target sound PSD q _S ⁽⁴⁾ and estimated noise PSD q _N ⁽⁴⁾ calculated in S924 (S925).

信号強調部９３０は、周波数領域観測信号X(ω,τ)とＳ９２０で推定したウィーナーフィルタ^W(ω,τ)から次式により信号強調（音源強調）を実行し、周波数領域強調音S’(ω,τ)を生成する（Ｓ９３０）。 The signal emphasizing unit 930 executes signal emphasizing (sound source emphasizing) according to the following equation from the frequency domain observation signal X (ω, τ) and the Wiener filter ^ W (ω, τ) estimated in S920, and the frequency domain emphasizing sound S '. (ω, τ) is generated (S930).

時間領域変換部９４０は、周波数領域強調音S’(ω,τ)から時間領域での推定音源信号である強調音s’(t)を生成する（Ｓ９４０）。 The time domain transform unit 940 generates an enhanced sound s '(t) that is an estimated sound source signal in the time domain from the frequency domain enhanced sound S' (ω, τ) (S940).

なお、音源強調装置９００では、非特許文献１の記載に従い、目的音以外の音を１つの雑音音源としてモデル化したが、目的音以外の音を音源ごとに複数の雑音音源としてモデル化してもよい。この場合、音源強調装置９００は、雑音音源の数と同じ数の雑音非負オートエンコーダ部９２３を含むことになる。例えば、ギター、ベース、ドラムスの３つの音源のうち、ギターの音を目的音として強調したい場合、ベースとドラムスの２つを１つの雑音音源としてモデル化してもよいし、ベースを雑音音源１、ドラムスを雑音音源２のように２つの雑音音源としてモデル化してもよい。 In the sound source emphasizing device 900, according to the description of Non-Patent Document 1, sounds other than the target sound are modeled as one noise sound source, but sounds other than the target sound may be modeled as a plurality of noise sources for each sound source. Good. In this case, the sound source enhancement apparatus 900 includes the same number of noise non-negative auto encoders 923 as the number of noise sources. For example, if it is desired to emphasize the sound of the guitar as the target sound among the three sound sources of guitar, bass and drums, two of the bass and the drums may be modeled as one noise source, or the bass may be a noise source 1, The drums may be modeled as two noise sources as the noise source 2.

K. Niwa, Y. Koizumi, T. Kawase, K. Kobayashi, and Y. Hioka, “Supervised source enhancement composed of nonnegative auto-encoders and complementarity subtraction”, Proc. of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), pp.266-270, 2017.K. Niwa, Y. Koizumi, T. Kawase, K. Kobayashi, and Y. Hioka, “Supervised source enhancement composed of non-autonomous auto-encoders and complementarity subtraction”, Proc. Of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), pp. 266-270, 2017.

非特許文献１の音源強調方法は、ウィーナーフィルタ推定部９２０に特徴がある。つまり、ウィーナーフィルタ推定部９２０は、オートエンコーダを用いて目的音スペクトル特性ベクトルq_S ⁽³⁾と雑音スペクトル特性ベクトルq_N ⁽³⁾を計算することにより、目的音スペクトルと雑音スペクトルをモデル化している。具体的には、DNNのネットワークパラメータである重み行列W_S ⁽²⁾,W_S ⁽³⁾,W_N ⁽²⁾,W_N ⁽³⁾について、(i)各要素を非負に制約すること、(ii)２層目、３層目へ射影する重み行列が転置関係（つまり、W_S ⁽³⁾=W_S ^(2)T、W_N ⁽³⁾=W_N ^(2)T）になるように制約することに特徴がある。これらの特徴は、目的音や雑音を構成する基底群（特徴的なスペクトル）の行列とその基底がアクティベートする成分を表す行列の２つに分解することに相当するものであり、これにより目的音スペクトルや雑音スペクトルの物理的特性を反映したDNNを構成している。 The sound source enhancement method of Non-Patent Document 1 is characterized by the Wiener filter estimation unit 920. That is, the Wiener filter estimation unit 920 models the target sound spectrum and the noise spectrum by calculating the target sound spectrum characteristic vector q _S ⁽³⁾ and the noise spectrum characteristic vector q _N ⁽³⁾ using an auto encoder. There is. Specifically, (i) constraining each element nonnegatively with respect to weighting matrices W _S ⁽²⁾ , W _S ⁽³⁾ , W _N ⁽²⁾ and W _N ⁽³⁾ which are network parameters of DNN, (ii) Weight matrices projected to the second and third layers are transposed (that is, W _S ⁽³⁾ = W _S ^{(2) T} , W _N ⁽³⁾ = W _N ^{(2) T} ) It is characterized by being restricted to These features are equivalent to decomposing into a matrix of a basis group (characteristic spectrum) constituting a target sound or noise and a matrix representing a component to be activated by the base. It comprises DNN reflecting the physical characteristics of spectrum and noise spectrum.

しかし、非特許文献１の方法では、あらゆる種類の音を入力対象としているため、上記DNNの構成はかなり複雑なものとなっている。上述のギター、ベース、ドラムスから構成される音のように、入力を複数の楽器音で構成された楽曲信号に限定し、楽器音の特性を反映したDNNを構成することとすれば、目的音スペクトルや雑音スペクトルをより単純かつ効率的なDNNとして表現できると考えられる。また、そのようなDNNの構成とすることにより、音源強調の精度向上が図れるものと期待される。 However, in the method of Non-Patent Document 1, since all kinds of sounds are to be input, the configuration of the DNN is quite complicated. If the input is limited to a music signal composed of a plurality of musical instrument sounds such as the above-mentioned guitar, bass, and drums, and a DNN reflecting the characteristics of the musical instrument sound is formed, the target sound It is considered that spectrum and noise spectrum can be expressed as simpler and more efficient DNN. Moreover, it is expected that the accuracy of sound source emphasis can be improved by adopting such a DNN configuration.

そこで本発明では、楽器音の特性を反映したDNNを用いて、高精度に音源強調をすることができる楽音強調技術を提供することを目的とする。 Therefore, it is an object of the present invention to provide a musical tone emphasizing technology capable of emphasizing sound source with high accuracy using DNN reflecting the characteristics of musical instrument sound.

本発明の一態様は、モノラル混合楽曲信号である時間領域楽曲信号x(t)を周波数領域変換し、周波数領域楽曲信号X(ω,τ)を生成する周波数領域変換部と、前記周波数領域楽曲信号X(ω,τ)から所定の楽器音を強調するために用いるウィーナーフィルタ^W(ω,τ)を推定するウィーナーフィルタ推定部と、前記周波数領域楽曲信号X(ω,τ)と前記ウィーナーフィルタ^W(ω,τ)から、周波数領域強調楽音S’(ω,τ)を生成する信号強調部と、前記周波数領域強調楽音S’(ω,τ)から時間領域での強調楽音s’(t)を生成する時間領域変換部とを含み、前記ウィーナーフィルタ推定部は、対数周波数領域での楽曲信号の振幅スペクトルを入力とし、対数周波数領域での楽曲信号の一部の信号の振幅スペクトルを出力する、2n層（nは1以上の整数）の畳み込みデノイジングオートエンコーダを含むディープニューラルネットワークを用いて、前記周波数領域楽曲信号X(ω,τ)から前記ウィーナーフィルタ^W(ω,τ)を推定する。 According to one aspect of the present invention, there is provided a frequency domain conversion unit that frequency domain converts a time domain music signal x (t), which is a monaural mixed music signal, to generate a frequency domain music signal X (ω, τ); A Wiener filter estimation unit for estimating a Wiener filter ^ W (ω, τ) used to emphasize a predetermined musical instrument sound from a signal X (ω, τ); the frequency domain music signal X (ω, τ) and the Wiener A signal emphasizing unit for generating a frequency domain emphasized tone S '(ω, τ) from the filter ^ W (ω, τ), and an emphasized tone s' in the time domain from the frequency domain emphasized tone S' (ω, τ) (t), and the Wiener filter estimation unit receives the amplitude spectrum of the music signal in the logarithmic frequency domain as an input, and the amplitude spectrum of a part of the music signal in the logarithmic frequency domain Output of 2n layers (n is an integer of 1 or more) The Wiener filter ^ W (ω, τ) is estimated from the frequency domain music signal X (ω, τ) using a deep neural network including a tone encoder.

本発明によれば、複数の楽器音で構成された楽曲信号の特定の楽音を精度よく強調することが可能となる。 According to the present invention, it is possible to emphasize a specific musical tone of a music signal composed of a plurality of musical instrument sounds with high accuracy.

楽音強調装置１００の構成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a tone emphasis device 100. 楽音強調装置１００の動作を示すフローチャート。6 is a flowchart showing the operation of the tone emphasis device 100. ウィーナーフィルタ推定部１２０の構成を示すブロック図。FIG. 2 is a block diagram showing a configuration of a Wiener filter estimation unit 120. ウィーナーフィルタ推定部１２０の動作を示すフローチャート。6 is a flowchart showing the operation of the Wiener filter estimation unit 120. ウィーナーフィルタ推定部２２０の構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a Wiener filter estimation unit 220. ウィーナーフィルタ推定部２２０の動作を示すフローチャート。6 is a flowchart showing the operation of the Wiener filter estimation unit 220. ウィーナーフィルタ推定部３２０の構成を示すブロック図。FIG. 7 is a block diagram showing the configuration of a Wiener filter estimation unit 320. ウィーナーフィルタ推定部３２０の動作を示すフローチャート。6 is a flowchart showing the operation of the Wiener filter estimation unit 320. ウィーナーフィルタ推定部４２０の構成を示すブロック図。FIG. 7 is a block diagram showing the configuration of a Wiener filter estimation unit 420. ウィーナーフィルタ推定部４２０の動作を示すフローチャート。7 is a flowchart showing the operation of the Wiener filter estimation unit 420. 畳み込みオートエンコーダ学習装置５００の構成を示すブロック図。FIG. 16 is a block diagram showing the configuration of a convolution auto encoder learning device 500. 畳み込みオートエンコーダ学習装置５００の動作を示すフローチャート。8 is a flowchart showing the operation of the convolutional auto encoder learning device 500. 音源強調装置９００の構成を示すブロック図。FIG. 16 is a block diagram showing the configuration of a sound source emphasis device 900. 音源強調装置９００の動作を示すフローチャート。11 is a flowchart showing the operation of the sound source enhancement device 900. ウィーナーフィルタ推定部９２０の構成を示すブロック図。FIG. 16 is a block diagram showing a configuration of a Wiener filter estimation unit 920. ウィーナーフィルタ推定部９２０の動作を示すフローチャート。The flowchart which shows the operation of the Wiener filter estimating part 920.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. Note that components having the same function will be assigned the same reference numerals and redundant description will be omitted.

各実施形態の説明に先立って、この明細書における表記方法について説明する。 Before describing each embodiment, a description method in this specification will be described.

_（アンダースコア）は下付き添字を表す。例えば、x^y_zはy_zがxに対する上付き添字であり、x_{y_z}はy_zがxに対する下付き添字であることを表す。 _ (Underscore) represents a subscript subscript. For example, ^xy_z represents that _yz is a superscript for x, and _{xy_z} represents that _yz is a subscript for x.

＜技術的背景＞
以下、モノラル（１ｃｈ）の楽曲信号を扱うものとし、楽器音の特性を用いた目的音スペクトル・雑音スペクトルの推定フレームワークについて説明する。まず、楽器音のスペクトル特性に関するモデル（楽器音のスペクトルモデル）について説明する。次に、楽器音のスペクトルモデルを反映したDNNの構造について説明する。ここで説明するDNNの構造は、対数周波数スペクトルを畳み込みデノイジングオートエンコーダ(CDAE: Convolutional De-noising Auto-Encoder)に入力する構造に相当する。 <Technical background>
Hereinafter, a framework for estimating a target sound spectrum and a noise spectrum using characteristics of an instrument sound will be described, assuming that a monaural (1 ch) music signal is handled. First, a model (spectral model of musical instrument sound) regarding spectral characteristics of musical instrument sound will be described. Next, the structure of the DNN reflecting the spectral model of the musical instrument sound will be described. The structure of DNN described here corresponds to a structure in which a logarithmic frequency spectrum is input to a convolutional de-noising auto-encoder (CDAE).

周波数領域でのモノラル混合楽曲信号X(ω,τ)がN個（Nは1以上の整数）の音源信号S_n(ω,τ)の単純加算で表現されると仮定する。 It is assumed that the monaural mixed music signal X (ω, τ) in the frequency domain is represented by simple addition of N (N is an integer of 1 or more) sound source signals S _n (ω, τ).

ただし、nは各音源を表す番号、ωは周波数ビン番号（周波数のインデックス）、τはフレーム番号（時間フレームのインデックス）である。 Here, n is a number representing each sound source, ω is a frequency bin number (index of frequency), and τ is a frame number (index of time frame).

なお、各音源はそれぞれ所定の楽器に対応しているものとする。 Each sound source is assumed to correspond to a predetermined musical instrument.

［楽器音のスペクトルモデル］
多くの楽器音のスペクトル特性S_n(ω’,τ)は、基本周波数及び楽器音に依存して変化するスペクトル基底の重みづけ和によって表現される。 [Spectrum model of instrument sound]
The spectral characteristics S _n (ω ′, τ) of many instrumental sounds are represented by a weighted sum of spectral bases that vary depending on the fundamental frequency and the instrumental sound.

ここで、I_nは異なる基本周波数として考慮することのできるスペクトル基底の数、w_i,n(ω’,τ), H_i,n(τ)はそれぞれn番目の楽器音におけるフレームτにおけるi番目の基本周波数に対応するスペクトル基底とそれに対応する重み（ゲイン）を表している。また、ω’は対数周波数ビン番号である。 Here, I _n is the number of spectral bases that can be considered as different fundamental frequencies, w _{i, n} (ω ′, τ) and H _{i, n} (τ) are i in frame τ in the n-th musical instrument sound respectively The spectral base corresponding to the th fundamental frequency and the corresponding weight (gain) are shown. Also, ω 'is a logarithmic frequency bin number.

なお、ここでは、cent領域の対数周波数を仮定する。例えば、周波数f_k[Hz]からcentスケール上の対数周波数f_cent[cents]へ変換する式は、以下のように表される。 Here, a logarithmic frequency in the cent region is assumed. For example, an equation for converting the frequency f _k [Hz] to the logarithmic frequency f _cent [cents] on the cent scale is expressed as follows.

なお、定数cの値はc=1200とすることが多い。 The value of the constant c is often c = 1200.

対数周波数ビン番号ω’は、この式による変換後のスケールでできるだけ等間隔になるように離散化することにより、得られる。 The logarithmic frequency bin numbers ω ′ are obtained by discretizing them as evenly as possible on the scale after conversion according to this equation.

以下、Ω個の線形周波数ビン{ω₁,…,ω_Ω}をD個の対数周波数バンド{ω’₁,…,ω’_D}に変換する方法について説明する。 Hereinafter, a method of converting Ω linear frequency bins {ω ₁ ,..., Ω _Ω } into D logarithmic frequency bands {ω ′ ₁ ,..., Ω ′ _D } will be described.

サンプリング周波数をF_s[Hz]とし、k番目(1≦k≦Ω)の周波数ビンの中心周波数を対数変換すると以下のようになる。 Assuming that the sampling frequency is F _s [Hz], and the central frequency of the k-th (1 ≦ k ≦ Ω) frequency bin is logarithmically converted, it is as follows.

同様に、d番目(1≦d≦D)の対数周波数バンドの中心周波数を対数変換すると以下のようになる。 Similarly, when the central frequency of the d-th (1 ≦ d ≦ D) logarithmic frequency band is logarithmically converted, it is as follows.

次に、次式を用いてこれらが同一となるようなインデックスの対応関係(d,k)を求めると、対数周波数バンドへ変換することになる。なお、dとkは1対1対応ではなく、複数のインデックス群が一つの対数周波数バンドへ対応する。 Next, when the correspondence relationship (d, k) of the indexes is obtained using the following equation so as to make them identical, conversion to a logarithmic frequency band is performed. Note that d and k are not in one-to-one correspondence, and a plurality of index groups correspond to one logarithmic frequency band.

議論を式(2)に戻す。n番目の楽器音のスペクトル基底w_i,n(ω’,τ)は、次式のように基本周波数に依存して変化する調波構造成分C_i(ω’,τ)と楽器ごとに特性が異なる包絡構造成分F_n(ω’,τ)に分解することができる。 The argument is returned to equation (2). The spectral base w _{i, n} (ω ′, τ) of the n-th instrumental sound is characterized by the harmonic structure component C _i (ω ′, τ) changing depending on the fundamental frequency as in the following equation and the characteristic for each instrument Can be decomposed into different envelope structure components F _n (ω ′, τ).

また、調波構造成分C_i(ω’,τ)、包絡構造成分F_n(ω’,τ)は、それぞれ以下のように基本周波数とその倍音に対応する周波数にピークを持つ関数、全極型関数によって表現される。 In addition, harmonic structure component C _i (ω ′, τ) and envelope structure component F _n (ω ′, τ) are functions having all peaks at the frequency corresponding to the fundamental frequency and its overtone as follows, all poles Expressed by type function.

ただし、Rは調波構造成分の倍音成分の数、f_ω’[cents]は対数周波数ビンω'に対応する対数周波数、f_o(i,τ)[cents]はフレームτにおけるi番目の調波構造成分の対数基本周波数、a_{n, p}はn番目の包絡構造成分の全極型関数の自己回帰フィルタ係数、ω_i,r(τ)[rad]はフレームτにおけるi番目の調波構造成分のr次倍音に対応する正規化角周波数である。なお、jは虚数単位を表す。 Where R is the number of harmonic components of the harmonic structure component, f _{ω ′} [cents] is the logarithmic frequency corresponding to the logarithmic frequency bin ω ′, and f _o (i, τ) [cents] is the ith harmonic in frame τ Logarithmic fundamental frequency of wave structure component, a _{n, p} is autoregressive filter coefficient of all pole function of nth envelope structure component, ω _{i, r} (τ) [rad] is i th harmonic structure in frame τ It is a normalized angular frequency corresponding to the r-th harmonic of the component. Here, j represents an imaginary unit.

式(4)より、調波構造成分は、基本周波数f_o(i,τ)の変化が調波構造成分全体を対数周波数軸上でシフトすることによって表現できることがわかる。例えば、基本周波数をf[cents]増加させた場合の調波構造成分は、以下のように計算できる。 From harmonic structure component equation (4) it is seen that the fundamental frequency f _o (i, tau) the whole harmonic structure component change of can be expressed by shifting on the logarithmic frequency axis. For example, the harmonic structure component when the fundamental frequency is increased by f [cents] can be calculated as follows.

ここで、調波構造成分C_{f_o(i,τ)}(f_ω’,τ)について基本周波数を任意の値f_o[cents]に固定し、H_i,n(τ)を基本周波数f_o(i,τ)に依存して変化する関数H_n(f_o(i,τ),τ)とみなすと、式(2)を式(6)のように変形することができる。 Here, for the harmonic structure component C f — _o _{(i, τ)} (f _{ω ′} , τ), the fundamental frequency is fixed to an arbitrary value f _o [cents], and H _{i, n} (τ) is the fundamental frequency f _o ( Assuming that the function H _n (f _o (i, τ), τ) varies depending on i, τ), the equation (2) can be transformed as the equation (6).

ただし、＊は畳み込み演算、δ(・)は単位インパルスを表し、変数同士の積をインパルスとの畳み込みとして表現している。 Here, * represents a convolution operation, δ (·) represents a unit impulse, and the product of variables is expressed as a convolution with an impulse.

この式変形では、i番目の調波構造成分が基本周波数f_o(i,τ)で決まることが明確となるようにiをf_o(i,τ)に置き換え、対数周波数ビン番号ω'を対応する対数周波数f_ω'[cents]に書き直した。 In this equation variant, replace i with f _o (i, τ) so that it becomes clear that the i th harmonic structure component is determined by the fundamental frequency f _o (i, τ), and the logarithmic frequency bin number ω ' Rewritten to the corresponding logarithmic frequency f _{ω '} [cents].

式(6)より、ある楽器音のスペクトル特性は、任意の基本周波数f_oに対応する調波構造成分と各調波構造成分に対応するゲインおよび楽器固有の包絡構造成分の周波数f_ω'でのゲインに対応する大きさを持つインパルスの畳み込みで表現されることがわかる。 From equation (6), the spectral characteristics of a certain instrumental sound are determined by the harmonic structure component corresponding to an arbitrary fundamental frequency f _o , the gain corresponding to each harmonic structure component, and the frequency f _{ω ′} of the instrument specific envelope structure component It can be seen that it is expressed by convolution of impulses having a magnitude corresponding to the gain of.

［楽器音のスペクトルモデルを反映したDNNの構造］
次に、楽器音のスペクトルモデルを反映したDNNの構造について説明する。 [Structure of DNN reflecting spectral model of musical instrument sound]
Next, the structure of the DNN reflecting the spectral model of the musical instrument sound will be described.

式(6)の楽器音スペクトルの畳み込み分解モデルを利用して、DNNを構築する。これにより、目的音（つまり、音源強調の目的とする楽器音）のスペクトルをDNNとして効率的に表現することができ、結果目的音のスペクトルを高精度に推定できるものと期待される。 The DNN is constructed using the convolutional decomposition model of the instrument sound spectrum of equation (6). As a result, it is expected that the spectrum of the target sound (that is, the target instrument sound as the target of sound source enhancement) can be efficiently expressed as DNN, and the spectrum of the target sound can be estimated with high accuracy.

具体的には、以下のような2段階の畳み込み積を有するDNNとして目的とする楽器音スペクトルを表現する。 Specifically, the target instrument sound spectrum is represented as a DNN having a two-stage convolution product as follows.

ただし、g(・)は活性化関数を表す。 Here, g (·) represents an activation function.

ここで、楽器音スペクトル及びその分解成分は非負であるため、活性化関数g(・)として、例えば次式のReLU（ランプ関数）を利用することで、式(7)及び式(8)は式(6)と等価になると考えられる。 Here, since the musical instrument sound spectrum and its decomposition component are non-negative, for example, by using ReLU (ramp function) of the following equation as the activation function g (·), the equations (7) and (8) can be obtained. It is considered to be equivalent to equation (6).

つまり、式(6)の楽器音スペクトルの畳み込み分解モデルは、式(7)を計算するCDAEと式(8)を計算するCDAE、つまり2層のCDAEを含むDNNとして表現される。したがって、このDNNを用いることにより、混合楽曲スペクトルから楽器音スペクトルを効率的に獲得することができると考えられる。 That is, the convolutional decomposition model of the instrument sound spectrum of Equation (6) is expressed as a CDAE for calculating Equation (7) and a CDAE for calculating Equation (8), that is, a DNN including a two-layer CDAE. Therefore, it is considered that by using this DNN, an instrument sound spectrum can be efficiently obtained from the mixed music spectrum.

なお、楽器音のスペクトルモデルを反映したDNNに含まれるCDAEは2層に限定する必要はなく、一般に、2n層（nは1以上の整数）としてもよい。 The CDAE included in DNN reflecting the spectral model of musical instrument sound does not have to be limited to two layers, and in general, it may be 2n layers (n is an integer of 1 or more).

楽器音のスペクトルモデルを反映したDNNを、CDAEを含むDNNとして学習を行う場合には、適切なスペクトルモデルが獲得されたとしても、各層への対応関係は順不同であるため、包絡構造成分と調波構造成分が逆になって出力されることも考えられる。 When learning DNN reflecting a spectral model of musical instrument sound as a DNN including a CDAE, even if an appropriate spectral model is obtained, the correspondence relationship to each layer is in random order, so the envelope structure component and the key It is also conceivable that the wave structure component is reversed and output.

また、楽器音のスペクトルモデルを反映したDNNを、2層のCDAEを含むDNNとして学習を行う場合には、各層における畳み込みフィルタの周波数方向の大きさ（フィルタサイズ）は入力スペクトルの周波数ビン数と同一であるべきと考えられる。実際、フィルタサイズが入力スペクトルよりも小さい場合について実験により検証したが、入力スペクトルと一致するときに最も性能が高いという結果が得られた。 When learning is performed with DNN reflecting the spectral model of musical instrument sound as DNN including two-layer CDAE, the size (filter size) of the convolution filter in each layer in the frequency direction is the number of frequency bins of the input spectrum and It should be considered identical. In fact, although the experimental verification was made for the case where the filter size is smaller than the input spectrum, the result that the performance is the highest when matching with the input spectrum was obtained.

＜第一実施形態＞
以下、図１〜図２を参照して楽音強調装置１００を説明する。図１は、楽音強調装置１００の構成を示すブロック図である。図２は、楽音強調装置１００の動作を示すフローチャートである。図２に示すように楽音強調装置１００は、周波数領域変換部９１０と、ウィーナーフィルタ推定部１２０と、信号強調部９３０と、時間領域変換部９４０を含む。 First Embodiment
The tone enhancing apparatus 100 will be described below with reference to FIGS. FIG. 1 is a block diagram showing the configuration of the tone enhancing apparatus 100. As shown in FIG. FIG. 2 is a flow chart showing the operation of the tone enhancing apparatus 100. As shown in FIG. 2, the musical tone emphasizing device 100 includes a frequency domain transforming unit 910, a Wiener filter estimating unit 120, a signal emphasizing unit 930, and a time domain transforming unit 940.

楽音強調装置１００は、音源強調装置９００と同様、学習結果記録部９９０に接続している。学習結果記録部９９０にはDNNの学習終了時のネットワークパラメータの値が記録されている。このネットワークパラメータは、ウィーナーフィルタ推定部１２０で用いられる。 The tone enhancement device 100 is connected to the learning result storage unit 990 as in the sound source enhancement device 900. The learning result recording unit 990 records the values of network parameters at the end of learning of the DNN. This network parameter is used by the Wiener filter estimation unit 120.

周波数領域変換部９１０は、モノラル混合楽曲信号である時間領域楽曲信号x(t)を周波数領域変換し、周波数領域楽曲信号X(ω,τ)を生成する（Ｓ９１０）。ただし、tは時間のインデックス、ωは周波数ビン番号、τはフレーム番号である。 The frequency domain conversion unit 910 frequency domain converts the time domain music signal x (t), which is a monaural mixed music signal, to generate a frequency domain music signal X (ω, τ) (S 910). Here, t is an index of time, ω is a frequency bin number, and τ is a frame number.

ウィーナーフィルタ推定部１２０は、学習結果記録部９９０から読み出したネットワークパラメータを用いて、周波数領域楽曲信号X(ω,τ)から所定の楽器音（以下、目的音という。）を強調するために用いるウィーナーフィルタ^W(ω,τ)を推定する（Ｓ１２０）。所定の楽器音とは、モノラル混合楽曲信号を作り出すために用いられている楽器の音である。例えば、モノラル混合楽曲信号がギター、ベース、ドラムスから作り出された音である場合、ギターの音を所定の楽器音とすることができる。 The Wiener filter estimation unit 120 uses the network parameters read out from the learning result recording unit 990 to emphasize a predetermined instrumental sound (hereinafter referred to as the target sound) from the frequency domain music signal X (ω, τ). The Wiener filter ^ W (ω, τ) is estimated (S 120). The predetermined musical instrument sound is the sound of the musical instrument used to create the monaural mixed music signal. For example, if the monaural mixed music signal is a sound produced from a guitar, a bass, or a drum, the sound of the guitar can be a predetermined instrumental sound.

以下、図３〜図４を参照してウィーナーフィルタ推定部１２０について説明する。図３は、ウィーナーフィルタ推定部１２０の構成を示すブロック図である。図４は、ウィーナーフィルタ推定部１２０の動作を示すフローチャートである。図３に示すようにウィーナーフィルタ推定部１２０は、対数振幅スペクトル計算部１２１と、目的音畳み込みオートエンコーダ部１２２と、対数−線形周波数変換部１２３と、ウィーナーフィルタ計算部１２４を含む。 Hereinafter, the Wiener filter estimation unit 120 will be described with reference to FIGS. 3 to 4. FIG. 3 is a block diagram showing the configuration of the Wiener filter estimation unit 120. As shown in FIG. FIG. 4 is a flowchart showing the operation of the Wiener filter estimation unit 120. As shown in FIG. 3, the Wiener filter estimation unit 120 includes a logarithmic amplitude spectrum calculation unit 121, a target sound convolution auto encoder unit 122, a logarithm-linear frequency conversion unit 123, and a Wiener filter calculation unit 124.

対数振幅スペクトル計算部１２１は、周波数領域楽曲信号X(ω,τ)を対数周波数領域変換し、楽曲信号対数振幅スペクトル|X(ω’,τ)|を計算する（Ｓ１２１）。楽曲信号対数振幅スペクトル|X(ω’,τ)|は、周波数領域楽曲信号X(ω,τ)を周波数領域から対数周波数領域への変換した信号である対数周波数領域楽曲信号X(ω’,τ)の振幅スペクトルである。周波数領域から対数周波数領域への変換は、＜技術的背景＞で説明したようにすればよい。 The logarithmic amplitude spectrum calculation unit 121 performs logarithmic frequency domain conversion on the frequency domain music signal X (ω, τ) and calculates music signal logarithmic amplitude spectrum | X (ω ′, τ) | (S121). The music signal logarithmic amplitude spectrum | X (ω ′, τ) | is a logarithmic frequency domain music signal X (ω ′, which is a signal obtained by converting the frequency domain music signal X (ω, τ) from the frequency domain to the logarithmic frequency domain. amplitude spectrum of τ). The conversion from the frequency domain to the logarithmic frequency domain may be as described in <Technical background>.

目的音畳み込みオートエンコーダ部１２２は、Ｓ１２１で計算した楽曲信号対数振幅スペクトル|X(ω’,τ)|から、所定の楽器音（目的音）の対数周波数領域における振幅スペクトルである推定目的音対数振幅スペクトル|^S(ω’,τ)|を計算する（Ｓ１２２）。目的音畳み込みオートエンコーダ部１２２は、対数周波数領域での楽曲信号の振幅スペクトルを入力とし、対数周波数領域での（楽曲信号の一部の信号である）所定の楽器音の振幅スペクトルを出力する、2n層のCDAEを含むDNNである。また、目的音畳み込みオートエンコーダ部１２２には、DNN学習終了時のネットワークパラメータ（すなわち、学習結果記録部９９０に記録されているネットワークパラメータ）が設定されている。具体的には、目的音畳み込みオートエンコーダ部１２２は、＜技術的背景＞で説明したような、式(7)及び式(8)を計算する2n層のCDAEを含むDNNとなる。なお、DNNのネットワークパラメータの学習については後述する。 The target sound convolution auto encoder unit 122 estimates an estimated target sound logarithm that is an amplitude spectrum in a logarithmic frequency domain of a predetermined instrument sound (target sound) from the music signal log amplitude spectrum | X (ω ′, τ) | calculated in S121. The amplitude spectrum | ^ S (ω ', τ) | is calculated (S122). The target sound convolution auto-encoder unit 122 receives the amplitude spectrum of the music signal in the logarithmic frequency domain as an input, and outputs the amplitude spectrum of a predetermined musical instrument sound (which is a part of the music signal) in the logarithmic frequency domain. It is DNN containing 2 n layer CDAE. Further, in the target sound convolution auto encoder unit 122, network parameters at the end of DNN learning (that is, network parameters recorded in the learning result recording unit 990) are set. Specifically, the target sound convolution auto encoder unit 122 becomes a DNN including a 2n-layer CDAE for calculating the equations (7) and (8) as described in <Technical background>. The learning of the network parameters of DNN will be described later.

対数−線形周波数変換部１２３は、Ｓ１２２で計算した推定目的音対数振幅スペクトル|^S(ω’,τ)|を周波数領域変換し、推定目的音振幅スペクトル|^S(ω,τ)|を計算する（Ｓ１２３）。推定目的音振幅スペクトル|^S(ω,τ)|は、推定目的音対数振幅スペクトル|^S(ω’,τ)|を対数周波数領域から周波数領域への変換した、推定周波数領域目的音^S(ω,τ)の振幅スペクトルである。対数周波数領域から周波数領域への変換は、＜技術的背景＞で説明したようにすればよい。 The log-linear frequency conversion unit 123 performs frequency domain conversion on the estimated target sound log amplitude spectrum | ^ S (ω ′, τ) | calculated in S122, and estimates the estimated target sound amplitude spectrum | ^ S (ω, τ) | Calculate (S123). The estimated target sound amplitude spectrum | ^ S (ω, τ) | is an estimated frequency domain target sound ^ obtained by converting the estimated target sound log amplitude spectrum | ^ S (ω ′, τ) | from the logarithmic frequency domain to the frequency domain. It is an amplitude spectrum of S (ω, τ). The transformation from the logarithmic frequency domain to the frequency domain may be as described in <Technical background>.

ウィーナーフィルタ計算部１２４は、Ｓ９１０で生成した周波数領域楽曲信号X(ω,τ)の振幅スペクトルである楽曲信号振幅スペクトル|X(ω,τ)|とＳ１２３で計算した推定目的音振幅スペクトル|^S(ω,τ)|からウィーナーフィルタ^W(ω,τ)を計算する（Ｓ１２４）。具体的には、次式によりウィーナーフィルタ^W(ω,τ)を計算する。 The Wiener filter calculation unit 124 calculates a music signal amplitude spectrum | X (ω, τ) | that is an amplitude spectrum of the frequency domain music signal X (ω, τ) generated in S910 and an estimated target sound amplitude spectrum | ^ calculated in S123. A Wiener filter ^ W (ω, τ) is calculated from S (ω, τ) | (S124). Specifically, the Wiener filter ^ W (ω, τ) is calculated by the following equation.

信号強調部９３０は、Ｓ９１０で生成した周波数領域楽曲信号X(ω,τ)とＳ１２０で推定したウィーナーフィルタ^W(ω,τ)から次式により信号強調（音源強調）を実行し、周波数領域強調楽音S’(ω,τ)を生成する（Ｓ９３０）。 The signal emphasizing unit 930 executes signal emphasizing (sound source emphasizing) according to the following equation from the frequency domain music signal X (ω, τ) generated in S 910 and the Wiener filter ^ W (ω, τ) estimated in S 120 An emphasized tone S '(. Omega.,. Tau.) Is generated (S930).

時間領域変換部９４０は、周波数領域強調楽音S’(ω,τ)から時間領域での強調楽音s’(t)を生成する（Ｓ９４０）。 The time domain conversion unit 940 generates an emphasized tone s '(t) in the time domain from the frequency domain emphasized tone S' (ω, τ) (S940).

本実施形態の発明によれば、複数の楽器音で構成された楽曲信号の特定の楽音を精度よく強調することが可能となる。本実施形態の発明では、対数周波数スペクトルを入力とする2n層のCDAEを含むDNNを構成することにより、楽器音のスペクトル特性（つまり、スペクトルが調波−包絡成分に分解されること）を利用した目的音スペクトルの推定を可能としている。また、これにより、DNNの計算規模が限定されていても、精度の高いウィーナーフィルタの推定が可能となり、結果精度の高い楽音強調が可能となる。 According to the invention of the present embodiment, it is possible to emphasize a specific musical tone of a music signal composed of a plurality of musical instrument sounds with high accuracy. The invention of the present embodiment uses the spectral characteristics of the instrument sound (that is, the spectrum is decomposed into harmonic-envelope components) by constructing a DNN including 2 n-layer CDAEs having a logarithmic frequency spectrum as an input. It is possible to estimate the target sound spectrum. Further, this makes it possible to estimate the Wiener filter with high accuracy even if the calculation scale of DNN is limited, and as a result, it is possible to enhance the musical tone with high accuracy.

（変形例）
ウィーナーフィルタ推定部１２０では、対数−線形周波数変換、ウィーナーフィルタ計算の順に計算し、ウィーナーフィルタを推定したが、この計算順序は逆でもよい。ここでは、このようなウィーナーフィルタ推定部２２０について説明する。この場合、楽音強調装置１００は、周波数領域変換部９１０と、ウィーナーフィルタ推定部２２０と、信号強調部９３０と、時間領域変換部９４０を含む（図１、図２参照）。 (Modification)
Although the Wiener filter estimation unit 120 calculates the order of the logarithmic-linear frequency conversion and the Wiener filter calculation to estimate the Wiener filter, the calculation order may be reversed. Here, such a Wiener filter estimation part 220 is demonstrated. In this case, the musical tone emphasizing device 100 includes a frequency domain converting unit 910, a Wiener filter estimating unit 220, a signal emphasizing unit 930, and a time domain converting unit 940 (see FIGS. 1 and 2).

以下、図５〜図６を参照してウィーナーフィルタ推定部２２０について説明する。図５は、ウィーナーフィルタ推定部２２０の構成を示すブロック図である。図６は、ウィーナーフィルタ推定部２２０の動作を示すフローチャートである。図５に示すようにウィーナーフィルタ推定部２２０は、対数振幅スペクトル計算部１２１と、目的音畳み込みオートエンコーダ部１２２と、対数ウィーナーフィルタ計算部１２５、対数−線形周波数変換部１２３を含む。 Hereinafter, the Wiener filter estimation unit 220 will be described with reference to FIGS. 5 to 6. FIG. 5 is a block diagram showing the configuration of the Wiener filter estimation unit 220. As shown in FIG. FIG. 6 is a flowchart showing the operation of the Wiener filter estimation unit 220. As shown in FIG. 5, the Wiener filter estimation unit 220 includes a logarithmic amplitude spectrum calculation unit 121, a target sound convolution auto encoder unit 122, a logarithmic Wiener filter calculation unit 125, and a logarithmic-linear frequency conversion unit 123.

対数振幅スペクトル計算部１２１は、周波数領域楽曲信号X(ω,τ)を対数周波数領域変換し、楽曲信号対数振幅スペクトル|X(ω’,τ)|を計算する（Ｓ１２１）。 The logarithmic amplitude spectrum calculation unit 121 performs logarithmic frequency domain conversion on the frequency domain music signal X (ω, τ) and calculates music signal logarithmic amplitude spectrum | X (ω ′, τ) | (S121).

目的音畳み込みオートエンコーダ部１２２は、Ｓ１２１で計算した楽曲信号対数振幅スペクトル|X(ω’,τ)|から、所定の楽器音（目的音）の対数周波数領域における振幅スペクトルである推定目的音対数振幅スペクトル|^S(ω’,τ)|を計算する（Ｓ１２２）。 The target sound convolution auto encoder unit 122 estimates an estimated target sound logarithm that is an amplitude spectrum in a logarithmic frequency domain of a predetermined instrument sound (target sound) from the music signal log amplitude spectrum | X (ω ′, τ) | calculated in S121. The amplitude spectrum | ^ S (ω ', τ) | is calculated (S122).

対数ウィーナーフィルタ計算部１２５は、Ｓ１２１で計算した楽曲信号対数振幅スペクトル|X(ω’,τ)|とＳ１２２で計算した推定目的音対数振幅スペクトル|^S(ω’,τ)|から対数ウィーナーフィルタ^W(ω’,τ)を計算する（Ｓ１２５）。具体的には、次式により対数ウィーナーフィルタ^W(ω’,τ)を計算する。 The logarithmic Wiener filter calculator 125 calculates a logarithmic Wiener from the music signal logarithmic amplitude spectrum | X (ω ′, τ) | calculated in S121 and the estimated target sound logarithmic amplitude spectrum | ^ S (ω ′, τ) | calculated in S122. The filter ^ W (ω ', τ) is calculated (S125). Specifically, the logarithmic Wiener filter ^ W (ω ', τ) is calculated by the following equation.

対数−線形周波数変換部１２３は、Ｓ１２５で計算した対数ウィーナーフィルタ^W(ω’,τ)を周波数領域変換し、ウィーナーフィルタ^W(ω,τ)を計算する（Ｓ１２３）。 The log-linear frequency conversion unit 123 performs frequency domain conversion on the logarithmic Wiener filter ^ W (ω ', τ) calculated in S125, and calculates the Wiener filter ^ W (ω, τ) (S123).

＜第二実施形態＞
第一実施形態では、音源強調の目的となる楽音に対する対数周波数領域のスペクトルを推定する2n層のCDAEを含むDNNを用いて楽音強調を行ったが、目的となる楽音以外の音、つまり雑音に対する対数周波数領域のスペクトルを推定する2n層のCDAEを含むDNNも用いるようにして楽音強調を行うこともできる。ここでは、このようなウィーナーフィルタ推定部３２０について説明する。この場合、楽音強調装置１００は、周波数領域変換部９１０と、ウィーナーフィルタ推定部３２０と、信号強調部９３０と、時間領域変換部９４０を含む（図１、図２参照）。 Second Embodiment
In the first embodiment, tone enhancement is performed using DNN including a 2n-layer CDAE that estimates a spectrum of a logarithmic frequency domain for a tone to be emphasized in a sound source. It is also possible to perform musical tone enhancement by using a DNN including a 2n-layer CDAE that estimates a spectrum in the logarithmic frequency domain. Here, such a Wiener filter estimation part 320 is demonstrated. In this case, the musical tone emphasizing device 100 includes a frequency domain converting unit 910, a Wiener filter estimating unit 320, a signal emphasizing unit 930, and a time domain converting unit 940 (see FIGS. 1 and 2).

以下、図７〜図８を参照してウィーナーフィルタ推定部３２０について説明する。図７は、ウィーナーフィルタ推定部３２０の構成を示すブロック図である。図８は、ウィーナーフィルタ推定部３２０の動作を示すフローチャートである。図７に示すようにウィーナーフィルタ推定部３２０は、対数振幅スペクトル計算部１２１と、目的音畳み込みオートエンコーダ部１２２と、雑音畳み込みオートエンコーダ部２２２と、対数−線形周波数変換部１２３と、ウィーナーフィルタ計算部２２４を含む。 Hereinafter, the Wiener filter estimation unit 320 will be described with reference to FIGS. 7 to 8. FIG. 7 is a block diagram showing the configuration of the Wiener filter estimation unit 320. As shown in FIG. FIG. 8 is a flowchart showing the operation of the Wiener filter estimation unit 320. As shown in FIG. 7, the Wiener filter estimation unit 320 includes a logarithmic amplitude spectrum calculation unit 121, a target sound convolution auto encoder unit 122, a noise convolution auto encoder unit 222, a log-linear frequency conversion unit 123, and a Wiener filter calculation. Part 224 is included.

同様に、雑音畳み込みオートエンコーダ部２２２は、Ｓ１２１で計算した楽曲信号対数振幅スペクトル|X(ω’,τ)|から、所定の楽器音（目的音）以外のすべての音である雑音の対数周波数領域における振幅スペクトルである推定雑音対数振幅スペクトル|^N(ω’,τ)|を計算する（Ｓ２２２）。雑音畳み込みオートエンコーダ部２２２は、対数周波数領域での楽曲信号の振幅スペクトルを入力とし、対数周波数領域での（楽曲信号の一部の信号である）所定の楽器音以外のすべての音である雑音の振幅スペクトルを出力する、2n層のCDAEを含むDNNである。また、雑音畳み込みオートエンコーダ部２２２には、DNN学習終了時のネットワークパラメータ（すなわち、学習結果記録部９９０に記録されているネットワークパラメータ）が設定されている。具体的には、雑音畳み込みオートエンコーダ部２２２は、＜技術的背景＞で説明したような、式(7)及び式(8)を計算する2n層のCDAEを含むDNNとなる。なお、DNNのネットワークパラメータの学習については後述する。 Similarly, from the music signal log amplitude spectrum | X (ω ′, τ) | calculated in S121, the noise convolution auto encoder unit 222 calculates the logarithmic frequency of noise that is all sounds other than a predetermined instrument sound (target sound). An estimated noise log amplitude spectrum | ^ N (ω ′, τ) |, which is an amplitude spectrum in the region, is calculated (S222). The noise convolution auto encoder unit 222 receives as input the amplitude spectrum of the music signal in the logarithmic frequency domain, and is noise that is all sounds other than a predetermined musical instrument sound (which is a part of the music signal) in the logarithmic frequency domain A DNN including a 2n-layer CDAE that outputs an amplitude spectrum of Further, in the noise convolution auto encoder unit 222, network parameters at the end of DNN learning (that is, network parameters recorded in the learning result recording unit 990) are set. Specifically, the noise convolution auto encoder unit 222 becomes a DNN including 2n-layer CDAEs for calculating the equations (7) and (8) as described in <Technical background>. The learning of the network parameters of DNN will be described later.

対数−線形周波数変換部１２３は、Ｓ１２２で計算した推定目的音対数振幅スペクトル|^S(ω’,τ)|を周波数領域変換し、推定目的音振幅スペクトル|^S(ω,τ)|を計算し、Ｓ２２２で計算した推定雑音対数振幅スペクトル|^N(ω’,τ)|を周波数領域変換し、推定雑音振幅スペクトル|^N(ω,τ)|を計算する（Ｓ１２３）。推定雑音振幅スペクトル|^N(ω,τ)|は、推定雑音対数振幅スペクトル|^N(ω’,τ)|を対数周波数領域から周波数領域への変換した、推定周波数領域雑音^N(ω,τ)の振幅スペクトルである。なお、ウィーナーフィルタ推定部３２０が図７に示すように２つの対数−線形周波数変換部１２３を持つように構成する代わりに、ウィーナーフィルタ推定部３２０を１つの対数−線形周波数変換部１２３を持つように構成し、推定目的音振幅スペクトル|^S(ω,τ)|と推定雑音振幅スペクトル|^N(ω,τ)|を１つの対数−線形周波数変換部１２３で計算するようにしてもよい。 The log-linear frequency conversion unit 123 performs frequency domain conversion on the estimated target sound log amplitude spectrum | ^ S (ω ′, τ) | calculated in S122, and estimates the estimated target sound amplitude spectrum | ^ S (ω, τ) | The estimated noise log amplitude spectrum | ^ N (ω ′, τ) | calculated in S222 is frequency domain transformed, and the estimated noise amplitude spectrum | ^ N (ω, τ) | is calculated (S123). The estimated noise amplitude spectrum | ^ N (ω, τ) | is an estimated frequency domain noise ^ N (ω) obtained by converting the estimated noise log amplitude spectrum | ^ N (ω ′, τ) | from the logarithmic frequency domain to the frequency domain. , τ) is an amplitude spectrum. Note that instead of configuring the Wiener filter estimation unit 320 to have two log-linear frequency conversion units 123 as shown in FIG. 7, the Wiener filter estimation unit 320 may have one log-linear frequency conversion unit 123. And the estimated noise amplitude spectrum | ^ N (ω, τ) | may be calculated by one log-linear frequency conversion unit 123. .

ウィーナーフィルタ計算部２２４は、Ｓ１２３で計算した推定目的音振幅スペクトル|^S(ω,τ)|と推定雑音振幅スペクトル|^N(ω,τ)|からウィーナーフィルタ^W(ω,τ)を計算する（Ｓ１２４）。具体的には、次式によりウィーナーフィルタ^W(ω,τ)を計算する。 The Wiener filter calculation unit 224 calculates the Wiener filter ^ W (ω, τ) from the estimated target sound amplitude spectrum | ^ S (ω, τ) | and the estimated noise amplitude spectrum | ^ N (ω, τ) | calculated in S123. Calculate (S124). Specifically, the Wiener filter ^ W (ω, τ) is calculated by the following equation.

本実施形態の発明によれば、複数の楽器音で構成された楽曲信号の特定の楽音を精度よく強調することが可能となる。本実施形態の発明では、対数周波数スペクトルを入力とする2n層のCDAEを含むDNNを構成することにより、楽器音のスペクトル特性（つまり、スペクトルが調波−包絡成分に分解されること）を利用した目的音スペクトル・雑音スペクトルの推定を可能としている。また、これにより、DNNの計算規模が限定されていても、精度の高いウィーナーフィルタの推定が可能となり、結果精度の高い楽音強調が可能となる。 According to the invention of the present embodiment, it is possible to emphasize a specific musical tone of a music signal composed of a plurality of musical instrument sounds with high accuracy. The invention of the present embodiment uses the spectral characteristics of the instrument sound (that is, the spectrum is decomposed into harmonic-envelope components) by constructing a DNN including 2 n-layer CDAEs having a logarithmic frequency spectrum as an input. It is possible to estimate the target sound spectrum and noise spectrum. Further, this makes it possible to estimate the Wiener filter with high accuracy even if the calculation scale of DNN is limited, and as a result, it is possible to enhance the musical tone with high accuracy.

特に、目的音・雑音・観測音の間で振幅領域における相補性が成り立つため、推定目的音振幅スペクトル|^S(ω,τ)|と推定雑音振幅スペクトル|^N(ω,τ)|のいずれか一方の推定精度が悪い場合においても他方の推定精度がよいと考えられることから、このような場合においても精度の高いウィーナーフィルタ^W(ω,τ)を推定することができる。 In particular, since the complementarity in the amplitude domain holds among the target sound, the noise, and the observation sound, the estimated target sound amplitude spectrum | ^ S (ω, τ) | and the estimated noise amplitude spectrum | ^ N (ω, τ) | Even if one of the estimation accuracy is poor, the other estimation accuracy is considered to be good, and therefore, even in such a case, the Wiener filter ^ (ω, τ) can be estimated with high accuracy.

（変形例）
第一実施形態の変形例で説明したように、ウィーナーフィルタ推定部３２０では、対数−線形周波数変換、ウィーナーフィルタ計算の順に計算し、ウィーナーフィルタを推定したが、この計算順序は逆でもよい。ここでは、このようなウィーナーフィルタ推定部４２０について説明する。この場合、楽音強調装置１００は、周波数領域変換部９１０と、ウィーナーフィルタ推定部４２０と、信号強調部９３０と、時間領域変換部９４０を含む（図１、図２参照）。 (Modification)
As described in the modification of the first embodiment, the Wiener filter estimation unit 320 calculates the order of the log-linear frequency conversion and the Wiener filter calculation to estimate the Wiener filter, but the calculation order may be reversed. Here, such a Wiener filter estimation unit 420 will be described. In this case, the musical tone emphasizing device 100 includes a frequency domain converting unit 910, a Wiener filter estimating unit 420, a signal emphasizing unit 930, and a time domain converting unit 940 (see FIGS. 1 and 2).

以下、図９〜図１０を参照してウィーナーフィルタ推定部４２０について説明する。図９は、ウィーナーフィルタ推定部４２０の構成を示すブロック図である。図１０は、ウィーナーフィルタ推定部４２０の動作を示すフローチャートである。図９に示すようにウィーナーフィルタ推定部４２０は、対数振幅スペクトル計算部１２１と、目的音畳み込みオートエンコーダ部１２２と、雑音畳み込みオートエンコーダ部２２２と、対数ウィーナーフィルタ計算部２２５、対数−線形周波数変換部１２３を含む。 Hereinafter, the Wiener filter estimation unit 420 will be described with reference to FIGS. 9 to 10. FIG. 9 is a block diagram showing the configuration of the Wiener filter estimation unit 420. As shown in FIG. FIG. 10 is a flowchart showing the operation of the Wiener filter estimation unit 420. As shown in FIG. 9, the Wiener filter estimation unit 420 includes a logarithmic amplitude spectrum calculation unit 121, a target sound convolution auto encoder unit 122, a noise convolution auto encoder unit 222, a logarithmic Wiener filter calculation unit 225, and log-linear frequency conversion. Part 123 is included.

同様に、雑音畳み込みオートエンコーダ部２２２は、Ｓ１２１で計算した楽曲信号対数振幅スペクトル|X(ω’,τ)|から、所定の楽器音（目的音）以外のすべての音である雑音の対数周波数領域における振幅スペクトルである推定雑音対数振幅スペクトル|^N(ω’,τ)|を計算する（Ｓ２２２）。 Similarly, from the music signal log amplitude spectrum | X (ω ′, τ) | calculated in S121, the noise convolution auto encoder unit 222 calculates the logarithmic frequency of noise that is all sounds other than a predetermined instrument sound (target sound). An estimated noise log amplitude spectrum | ^ N (ω ′, τ) |, which is an amplitude spectrum in the region, is calculated (S222).

対数ウィーナーフィルタ計算部２２５は、Ｓ１２２で計算した推定目的音対数振幅スペクトル|^S(ω’,τ)|とＳ２２２で計算した推定雑音対数振幅スペクトル|^N(ω’,τ)|から対数ウィーナーフィルタ^W(ω’,τ)を計算する（Ｓ２２５）。具体的には、次式により対数ウィーナーフィルタ^W(ω’,τ)を計算する。 The logarithmic Wiener filter calculation unit 225 calculates the estimated target sound log amplitude spectrum | ^ S (ω ′, τ) | calculated in S122 and the estimated noise log amplitude spectrum | ^ N (ω ′, τ) | calculated in S222 The Wiener filter ^ W (ω ', τ) is calculated (S225). Specifically, the logarithmic Wiener filter ^ W (ω ', τ) is calculated by the following equation.

対数−線形周波数変換部１２３は、Ｓ２２５で計算した対数ウィーナーフィルタ^W(ω’,τ)を周波数領域変換し、ウィーナーフィルタ^W(ω,τ)を計算する（Ｓ１２３）。 The log-linear frequency conversion unit 123 performs frequency domain conversion on the logarithmic Wiener filter ^ W (ω ', τ) calculated in S225, and calculates a Wiener filter ^ W (ω, τ) (S123).

（変形例２）
これまでの説明では、音源強調の目的となる楽音以外の音を１つの雑音としてDNNを構成するようにしたが、音源強調の目的となる楽音以外の楽音それぞれを雑音として、各楽音に対して対数周波数領域のスペクトルを推定する2n層のCDAEを含むDNNを構成するようにしてもよい。この場合、楽音強調装置１００は、音源強調の目的となる楽音以外の楽音の数（楽器の数）と同じ数の雑音畳み込みオートエンコーダ部２２２を含むことになる。これらの雑音畳み込みオートエンコーダ部２２２で設定されるネットワークパラメータは、一般に異なるものとなる。 (Modification 2)
In the above description, DNN is configured to have a noise other than the tone that is the purpose of sound source enhancement as one noise, but each tone other than the tone that is the purpose of sound source enhancement is noise. A DNN including 2 n-layer CDAEs may be configured to estimate a spectrum in the logarithmic frequency domain. In this case, the tone enhancing apparatus 100 includes the same number of noise convolution auto encoders 222 as the number of tones (number of musical instruments) other than the tone to be emphasized in the tone generator. The network parameters set by the noise convolution auto encoder unit 222 are generally different.

＜第三実施形態＞
ここでは、目的音畳み込みオートエンコーダ部１２２や雑音畳み込みオートエンコーダ部２２２を構成するDNNのネットワークパラメータを学習する畳み込みオートエンコーダ学習装置５００について説明する。 Third Embodiment
Here, a convolution auto encoder learning device 500 will be described which learns network parameters of DNNs constituting the target sound convolution auto encoder unit 122 and the noise convolution auto encoder unit 222.

以下、図１１〜図１２を参照して畳み込みオートエンコーダ学習装置５００を説明する。図１１は、畳み込みオートエンコーダ学習装置５００の構成を示すブロック図である。図１２は、畳み込みオートエンコーダ学習装置５００の動作を示すフローチャートである。図１１に示すように畳み込みオートエンコーダ学習装置５００は、対数振幅スペクトル計算部１２１と、畳み込みオートエンコーダ計算部５１０と、ネットワークパラメータ最適化部５２０を含む。 The convolutional auto encoder learning device 500 will be described below with reference to FIGS. FIG. 11 is a block diagram showing the configuration of the convolutional auto encoder learning device 500. As shown in FIG. FIG. 12 is a flowchart showing the operation of the convolutional auto encoder learning device 500. As shown in FIG. 11, the convolutional auto encoder learning device 500 includes a logarithmic amplitude spectrum calculator 121, a convolution auto encoder calculator 510, and a network parameter optimizer 520.

畳み込みオートエンコーダ学習装置５００は、学習データ記録部５９０に接続している。学習データ記録部５９０には、学習に用いる周波数領域でのモノラル混合楽曲信号である周波数領域楽曲信号X(ω,τ)とその一部の信号である周波数領域部分楽曲信号Y(ω,τ)が学習データとして記録されている。周波数領域部分楽曲信号Y(ω,τ)は、例えば、周波数領域楽曲信号X(ω,τ)を構成する音源強調の目的となる楽音である周波数領域目的音S(ω,τ)、周波数領域楽曲信号X(ω,τ)を構成する音源強調の目的となる楽音以外の音である周波数領域雑音N(ω,τ)である。 The convolutional automatic encoder learning device 500 is connected to the learning data recording unit 590. The learning data recording unit 590 includes a frequency domain music signal X (ω, τ) which is a monaural mixed music signal in a frequency domain used for learning and a frequency domain partial music signal Y (ω, τ) which is a partial signal thereof. Are recorded as learning data. The frequency domain partial music signal Y (ω, τ) is, for example, a frequency domain target sound S (ω, τ) which is a musical tone for which the sound source is to be emphasized constituting the frequency domain music signal X (ω, τ). A frequency domain noise N (ω, τ) which is a sound other than a musical tone which is the purpose of sound source emphasis constituting the music signal X (ω, τ).

畳み込みオートエンコーダ学習装置５００は、周波数領域楽曲信号X(ω,τ)と周波数領域部分楽曲信号Y(ω,τ)の組を学習データとして入力し、楽曲信号対数振幅スペクトルを入力とし推定部分楽曲信号対数振幅スペクトルを出力する2n層のCDAEを含むDNNのネットワークパラメータを出力する。つまり、畳み込みオートエンコーダ学習装置５００は、周波数領域目的音S(ω,τ)や周波数領域雑音N(ω,τ)を周波数領域部分楽曲信号Y(ω,τ)として学習することにより、目的音畳み込みオートエンコーダ部１２２や雑音畳み込みオートエンコーダ部２２２を構成するDNNのネットワークパラメータを学習することができる。 The convolutional automatic encoder learning device 500 receives a set of frequency domain music signal X (ω, τ) and frequency domain partial music signal Y (ω, τ) as learning data, and uses the music signal logarithmic amplitude spectrum as input to estimate partial music Output network parameters of DNN including 2 n-layer CDAE that output signal logarithmic amplitude spectrum. That is, the convolutional automatic encoder learning device 500 learns the target sound by learning the frequency domain target sound S (ω, τ) and the frequency domain noise N (ω, τ) as the frequency domain partial music signal Y (ω, τ). The network parameters of DNNs constituting the convolution auto encoder unit 122 and the noise convolution auto encoder unit 222 can be learned.

対数振幅スペクトル計算部１２１は、周波数領域楽曲信号X(ω,τ)を対数周波数領域変換し、楽曲信号対数振幅スペクトル|X(ω’,τ)|を計算し、周波数領域部分楽曲信号Y(ω,τ)を対数周波数領域変換し、部分楽曲信号対数振幅スペクトル|Y(ω’,τ)|を計算する（Ｓ１２１）。 Logarithmic amplitude spectrum calculation unit 121 performs logarithmic frequency domain conversion of frequency domain music signal X (ω, τ), and calculates music signal logarithmic amplitude spectrum | X (ω ′, τ) | Logarithmic frequency domain conversion of ω, τ) and partial music signal logarithmic amplitude spectrum | Y (ω ′, τ) | is calculated (S121).

畳み込みオートエンコーダ計算部５１０は、現在学習（最適化）中のネットワークパラメータを設定した2n（nは1以上の整数）層のCDAEを含むDNNを用いて、Ｓ１２１で計算した楽曲信号対数振幅スペクトル|X(ω’,τ)|から推定部分楽曲信号対数振幅スペクトル|^Y(ω’,τ)|を計算する（Ｓ５１０）。 The convolutional auto encoder calculation unit 510 calculates the music signal logarithmic amplitude spectrum | calculated in S121 using DNN including the CDAE of the 2n (n is an integer of 1 or more) layer in which the network parameters currently being learned (optimized) are set. The estimated partial music signal log amplitude spectrum | ^ Y (ω ′, τ) | is calculated from X (ω ′, τ) | (S510).

ネットワークパラメータ最適化部５２０は、Ｓ１２１で計算した部分楽曲信号対数振幅スペクトル|Y(ω’,τ)|とＳ５１０で計算した推定部分楽曲信号対数振幅スペクトル|^Y(ω’,τ)|を用いて、Ｓ５１０で用いたネットワークパラメータを最適化する（Ｓ５２０）。ネットワークパラメータの最適化には、例えば、|Y(ω’,τ)|と|^Y(ω’,τ)|の二乗和誤差を利用した誤差逆伝播法などを用いればよい。 The network parameter optimization unit 520 calculates the partial music signal log amplitude spectrum | Y (ω ′, τ) | calculated in S121 and the estimated partial music signal log amplitude spectrum | ^ Y (ω ′, τ) | calculated in S510. The network parameters used in S510 are optimized (S520). For optimization of the network parameters, for example, an error back propagation method using a sum of squares error of | Y (ω ′, τ) | and | ^ Y (ω ′, τ) | may be used.

Ｓ１２１〜Ｓ５２０の処理は学習データの数だけ繰り返される。あるいは学習終了条件を定めておき、適当なところで学習を中止し、ネットワークパラメータを出力するようにしてもよい。 The processes of S121 to S520 are repeated by the number of learning data. Alternatively, the learning termination condition may be set, the learning may be stopped where appropriate, and the network parameter may be output.

これにより、目的音畳み込みオートエンコーダ部１２２や雑音畳み込みオートエンコーダ部２２２で用いるネットワークパラメータが学習される。 Thus, network parameters used in the target sound convolution auto encoder unit 122 and the noise convolution auto encoder unit 222 are learned.

本実施形態の発明によれば、楽器音の特性を反映したDNN、つまり2n層のCDAEを含むDNNのネットワークパラメータを学習することができる。この学習で得られたネットワークパラメータを用いた第一実施形態や第二実施形態の発明により、複数の楽器音で構成された楽曲信号の特定の楽音を精度よく強調することが可能となる。 According to the invention of this embodiment, it is possible to learn network parameters of DNNs reflecting characteristics of musical instrument sounds, that is, DNNs including 2 n-layer CDAEs. According to the invention of the first embodiment and the second embodiment using the network parameters obtained by this learning, it becomes possible to emphasize a specific musical tone of a music signal composed of a plurality of musical instrument sounds with high accuracy.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary Note>
The apparatus according to the present invention is, for example, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected as a single hardware entity, or a communication apparatus (eg, communication cable) capable of communicating outside the hardware entity. Communication unit that can be connected, CPU (central processing unit, cache memory, registers, etc. may be provided), RAM or ROM that is memory, external storage device that is hard disk, input unit for these, output unit, communication unit , CPU, RAM, ROM, and a bus connected so as to enable exchange of data between external storage devices. If necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. Examples of physical entities provided with such hardware resources include general purpose computers.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above-mentioned function, data required for processing the program, and the like (not limited to the external storage device, for example, the program is read) It may be stored in the ROM which is a dedicated storage device). In addition, data and the like obtained by the processing of these programs are appropriately stored in a RAM, an external storage device, and the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in the external storage device (or ROM etc.) and data necessary for processing of each program are read into the memory as necessary, and interpreted and processed appropriately by the CPU . As a result, the CPU realizes predetermined functions (each component requirement expressed as the above-mentioned,...

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the spirit of the present invention. Further, the processing described in the above embodiment may be performed not only in chronological order according to the order of description but also may be performed in parallel or individually depending on the processing capability of the device that executes the processing or the necessity. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function in the hardware entity (the apparatus of the present invention) described in the above embodiment is implemented by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on a computer, the processing function of the hardware entity is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (Rewritable), etc. as magneto-optical recording medium, MO (Magneto-Optical disc) etc., as semiconductor memory EEP-ROM (Electronically Erasable and Programmable Only Read Memory) etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD, a CD-ROM or the like in which the program is recorded. Furthermore, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, at the time of execution of the process, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer Each time, processing according to the received program may be executed sequentially. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by executing instructions and acquiring results from the server computer without transferring the program to the computer It may be Note that the program in the present embodiment includes information provided for processing by a computer that conforms to the program (such as data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the hardware entity is configured by executing a predetermined program on a computer, but at least a part of the processing content may be realized as hardware.

Claims

A frequency domain conversion unit that frequency domain converts the time domain music signal x (t), which is a monaural mixed music signal, to generate a frequency domain music signal X (ω, τ);
A Wiener filter estimation unit for estimating a Wiener filter ^ W (ω, τ) used to emphasize a predetermined musical instrument sound from the frequency domain music signal X (ω, τ);
A signal emphasizing unit for generating a frequency domain emphasized tone S '(ω, τ) from the frequency domain music signal X (ω, τ) and the Wiener filter ^ W (ω, τ);
A time domain conversion unit for generating an enhanced tone s '(t) in the time domain from the frequency domain enhanced tone S' (. Omega., .Tau.);
The Wiener filter estimation unit
A 2 n-layer (n is an integer greater than or equal to 1) convolutional denoising auto encoder that receives the amplitude spectrum of a music signal in the logarithmic frequency domain and outputs the amplitude spectrum of a part of the music signal in the logarithmic frequency domain A tone emphasizing device for estimating the Wiener filter ^ W (ω, τ) from the frequency domain music signal X (ω, τ) using a deep neural network including

The musical tone emphasizing device according to claim 1, wherein
The Wiener filter estimation unit
A logarithmic amplitude spectrum calculation unit that performs logarithmic frequency domain conversion on the frequency domain music signal X (ω, τ) and calculates music signal logarithmic amplitude spectrum | X (ω ′, τ) |
A deep including 2n layers (n is an integer greater than or equal to 1) convolutional denoising auto encoders that receive the amplitude spectrum of a music signal in the logarithmic frequency domain and output the amplitude spectrum of a predetermined instrument sound in the logarithmic frequency domain Estimated target sound logarithmic amplitude spectrum | ^ S (ω ′, τ) which is an amplitude spectrum in the logarithmic frequency domain of the predetermined musical instrument sound from the music signal logarithmic amplitude spectrum | X (ω ′, τ) | using a neural network A target sound convolution auto encoder unit that calculates |),
Logarithmic-linear frequency conversion unit that performs frequency domain conversion on the estimated target sound log amplitude spectrum | ^ S (ω ′, τ) | and calculates an estimated target sound amplitude spectrum | ^ S (ω, τ) |
The music signal amplitude spectrum | X (ω, τ) |, which is the amplitude spectrum of the frequency domain music signal X (ω, τ), and the estimated target sound amplitude spectrum | ^ S (ω, τ) | and a Wiener filter calculation unit for calculating (ω, τ).

The musical tone emphasizing device according to claim 1, wherein
The Wiener filter estimation unit
A logarithmic amplitude spectrum calculation unit that performs logarithmic frequency domain conversion on the frequency domain music signal X (ω, τ) and calculates music signal logarithmic amplitude spectrum | X (ω ′, τ) |
A deep including 2n layers (n is an integer greater than or equal to 1) convolutional denoising auto encoders that receive the amplitude spectrum of a music signal in the logarithmic frequency domain and output the amplitude spectrum of a predetermined instrument sound in the logarithmic frequency domain Estimated target sound logarithmic amplitude spectrum | ^ S (ω ′, τ) which is an amplitude spectrum in the logarithmic frequency domain of the predetermined musical instrument sound from the music signal logarithmic amplitude spectrum | X (ω ′, τ) | using a neural network A target sound convolution auto encoder unit that calculates |),
A convolution of 2n layers (n is an integer greater than or equal to 1) that receives as input the amplitude spectrum of a music signal in the logarithmic frequency domain and outputs an amplitude spectrum of noise that is all sounds other than a predetermined musical instrument sound in the logarithmic frequency domain Amplitude spectrum in the logarithmic frequency domain of noise which is all sounds other than the predetermined musical instrument sound from the music signal logarithmic amplitude spectrum | X (ω ′, τ) | using a deep neural network including a denoising auto encoder A noise convolution auto encoder unit that calculates an estimated noise log amplitude spectrum | ^ N (ω ′, τ) |
Is frequency domain transformed to calculate the estimated target sound amplitude spectrum | ^ S (ω, τ) |, and the estimated noise log amplitude spectrum | ^ N a log-linear frequency conversion unit that performs frequency domain conversion of (ω ′, τ) | and calculates an estimated noise amplitude spectrum | ^ N (ω, τ) |
A Wiener filter calculator for calculating the Wiener filter ^ W (ω, τ) from the estimated target sound amplitude spectrum | ^ S (ω, τ) | and the estimated noise amplitude spectrum | ^ N (ω, τ) | Tone emphasis device including.

The musical tone emphasizing device according to claim 1, wherein
The Wiener filter estimation unit
A logarithmic amplitude spectrum calculation unit that performs logarithmic frequency domain conversion on the frequency domain music signal X (ω, τ) and calculates music signal logarithmic amplitude spectrum | X (ω ′, τ) |
A deep including 2n layers (n is an integer greater than or equal to 1) convolutional denoising auto encoders that receive the amplitude spectrum of a music signal in the logarithmic frequency domain and output the amplitude spectrum of a predetermined instrument sound in the logarithmic frequency domain Estimated target sound logarithmic amplitude spectrum | ^ S (ω ′, τ) which is an amplitude spectrum in the logarithmic frequency domain of the predetermined musical instrument sound from the music signal logarithmic amplitude spectrum | X (ω ′, τ) | using a neural network A target sound convolution auto encoder unit that calculates |),
Log Wiener calculating log Wiener filter ^ W (ω ', τ) from the music signal log amplitude spectrum | X (ω', τ) | and the estimated target sound log amplitude spectrum | ^ S (ω ', τ) | A filter calculation unit,
A tone-to-linear frequency conversion unit that performs frequency domain conversion on the logarithmic Wiener filter ^ W (ω ', τ) and calculates the Wiener filter ^ W (ω, τ).

The musical tone emphasizing device according to claim 1, wherein
The Wiener filter estimation unit
A logarithmic amplitude spectrum calculation unit that performs logarithmic frequency domain conversion on the frequency domain music signal X (ω, τ) and calculates music signal logarithmic amplitude spectrum | X (ω ′, τ) |
A deep including 2n layers (n is an integer greater than or equal to 1) convolutional denoising auto encoders that receive the amplitude spectrum of a music signal in the logarithmic frequency domain and output the amplitude spectrum of a predetermined instrument sound in the logarithmic frequency domain Estimated target sound logarithmic amplitude spectrum | ^ S (ω ′, τ) which is an amplitude spectrum in the logarithmic frequency domain of the predetermined musical instrument sound from the music signal logarithmic amplitude spectrum | X (ω ′, τ) | using a neural network A target sound convolution auto encoder unit that calculates |),
A convolution of 2n layers (n is an integer greater than or equal to 1) that receives as input the amplitude spectrum of a music signal in the logarithmic frequency domain and outputs an amplitude spectrum of noise that is all sounds other than a predetermined musical instrument sound in the logarithmic frequency domain Amplitude spectrum in the logarithmic frequency domain of noise which is all sounds other than the predetermined musical instrument sound from the music signal logarithmic amplitude spectrum | X (ω ′, τ) | using a deep neural network including a denoising auto encoder A noise convolution auto encoder unit that calculates an estimated noise log amplitude spectrum | ^ N (ω ′, τ) |
Log to calculate a logarithmic Wiener filter ^ W (ω ', τ) from the estimated target sound log amplitude spectrum | ^ S (ω', τ) | and the estimated noise log amplitude spectrum | ^ N (ω ', τ) | Wiener filter calculation unit,
A tone-to-linear frequency conversion unit that performs frequency domain conversion on the logarithmic Wiener filter ^ W (ω ', τ) and calculates the Wiener filter ^ W (ω, τ).

The frequency domain music signal X (ω, τ), which is a monaural mixed music signal in the frequency domain, is logarithmic frequency domain transformed, and the music signal log amplitude spectrum | X (ω ′, τ) | is calculated. Logarithmic frequency domain conversion of frequency domain partial music signal Y (ω, τ), which is a part of X (ω, τ), to calculate partial music signal logarithmic amplitude spectrum | Y (ω ', τ) | An amplitude spectrum calculation unit,
The music signal log amplitude spectrum | X (ω ′, τ) using a deep neural network including a 2 n layer (n is an integer of 1 or more) convolutional denoising auto encoder in which the network parameter currently being optimized is set A convolutional auto-encoder calculating unit that calculates an estimated partial music signal log amplitude spectrum | ^ Y (ω ′, τ) | from |
Network parameter optimization for optimizing the network parameters using the partial music signal log amplitude spectrum | Y (ω ′, τ) | and the estimated partial music signal log amplitude spectrum | ^ Y (ω ′, τ) | A convolutional auto encoder learning device including a part and a.

A frequency domain conversion step of the musical tone emphasizing device frequency domain converting the time domain music signal x (t) which is a monaural mixed music signal to generate a frequency domain music signal X (ω, τ);
A Wiener filter estimation step of estimating a Wiener filter ^ W (ω, τ) used by the musical tone emphasizing device to emphasize a predetermined musical instrument sound from the frequency domain music signal X (ω, τ);
A signal emphasizing step in which the musical tone emphasizing device generates a frequency domain emphasized musical tone S '(ω, τ) from the frequency domain music signal X (ω, τ) and the Wiener filter ^ W (ω, τ);
A tone emphasizing method comprising: a time domain conversion step of the tone emphasizing device generating an emphasized tone s '(t) in a time domain from the frequency domain emphasized tone S' (ω, τ),
The Wiener filter estimation step is
A 2 n-layer (n is an integer greater than or equal to 1) convolutional denoising auto encoder that receives the amplitude spectrum of a music signal in the logarithmic frequency domain and outputs the amplitude spectrum of a part of the music signal in the logarithmic frequency domain A musical tone emphasizing method for estimating the Wiener filter ^ W (ω, τ) from the frequency domain music signal X (ω, τ) using a deep neural network including

A program for causing a computer to function as the musical tone emphasizing device according to any one of claims 1 to 5 or the convolution auto encoder learning device according to claim 6.