JP2010054802A

JP2010054802A - Unit rhythm extraction method from musical acoustic signal, musical piece structure estimation method using this method, and replacing method of percussion instrument pattern in musical acoustic signal

Info

Publication number: JP2010054802A
Application number: JP2008219539A
Authority: JP
Inventors: Shigeki Sagayama; 茂樹嵯峨山; Junki Ono; 順貴小野; Emiru Tsunoo; 衣未留角尾; Kenichi Miyamoto; 賢一宮本
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2008-08-28
Filing date: 2008-08-28
Publication date: 2010-03-11

Abstract

<P>PROBLEM TO BE SOLVED: To extract a percussion instrument pattern of a bar unit from a musical acoustic signal. <P>SOLUTION: A method for creating a map of a beat pattern is provided, by extracting a plurality of kinds of percussion patterns of the bar unit, which is included in the musical acoustic signal, and by estimating their performance positions at the same time. This method is utilized for automatic music genre classification, music information retrieval, and music processing such as replacing the percussion instrument pattern. Optimal segmentation to a bar by using One-pass DP (Dynamic Programming) method, and clustering of the percussion patterns by k-means clustering method are repeatedly performed, and the percussion patterns of the optimal number estimated by an information amount criterion are extracted, and as a result, the map for indicating a musical piece structure by using those is obtained. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音楽情報処理に関するものである。 The present invention relates to music information processing.

音楽情報検索に関する研究の中でも、特に音楽ジャンル分類のタスクにおいて、リズムに関する特徴量は非常に有力である。例えば、サンバやタンゴなどの楽曲ではそれらの典型的な小節単位の打楽器パターン、すなわち単位リズムパターン、には大きな特徴が見られる。 Among the research on music information retrieval, especially in the task of music genre classification, rhythmic features are very powerful. For example, in musical pieces such as samba and tango, a typical characteristic is seen in their typical percussion instrument patterns in units of measures, that is, unit rhythm patterns.

リズムは音楽を形成する最も基本的で重要な要素の一つであることは明らかである。ミクロな視点で見れば、単位リズムパターンはビートから構成される小節となることがほとんどである。また、マクロな視点でみれば、楽曲全体に含まれる複数の単位リズムパターンによって楽曲の構造を形成している事が多い。それら複数のリズムパターンがそれぞれ正確に抽出でき、単位リズムパターンから形成される楽曲構造が解析できたならば、リズムパターンに関する音楽解析や音楽ジャンル分類、音楽情報検索に非常に役立つであると考えられる。 Clearly, rhythm is one of the most fundamental and important elements of music. From a micro perspective, the unit rhythm pattern is mostly a measure composed of beats. From a macro viewpoint, the music structure is often formed by a plurality of unit rhythm patterns included in the entire music. If each of these rhythm patterns can be accurately extracted and the music structure formed from the unit rhythm patterns can be analyzed, it will be very useful for music analysis, music genre classification, and music information retrieval related to rhythm patterns. .

従来の、リズム解析における最も基本的な研究としては、ビートトラッキングが挙げられる（非特許文献１）。このシステムでは、調波音のオンセット、コードの遷移、打楽器のパターンによって実時間で推定を行っている。ここでの打楽器パターンはスペクトルの低周波域にエネルギーが局在しているバスドラムと広い周波数域にエネルギーが広がっているスネアドラムの組合せであり、音楽的に一般的なパターンに基づく事前知識によってビートの位置を推定している。 The most basic research in conventional rhythm analysis is beat tracking (Non-Patent Document 1). In this system, estimation is performed in real time based on harmonic sound onsets, chord transitions, and percussion instrument patterns. The percussion instrument pattern here is a combination of a bass drum with energy localized in the low frequency range of the spectrum and a snare drum with energy spread over a wide frequency range. Estimate the position of the beat.

また、音楽ジャンル分類の研究（非特許文献２）では音色特徴量、ピッチ情報特徴量、リズム情報特徴量がそれぞれ抽出される。特にリズム情報特徴量としてのビートヒストグラムは音楽音響信号の時間包絡と自己相関から抽出されたテンポの統計量として計算される（非特許文献３）。 Further, in research on music genre classification (Non-patent Document 2), a timbre feature amount, a pitch information feature amount, and a rhythm information feature amount are extracted. In particular, a beat histogram as a rhythm information feature is calculated as a tempo statistic extracted from the time envelope and autocorrelation of a music acoustic signal (Non-patent Document 3).

その他のリズムパターンに関する関連研究としては、リズムパターン間距離の測定が挙げられる（非特許文献４）。この研究ではスペクトルセントロイドやMFCC(Mel-Frequency Cepstrum Coefficient)等がリズムパターン特徴量として抽出され、この特徴量パターンをDynamic Time Warping 法(DTW)を用いて比較する。打楽器のみの音源では良く似たパターンほど距離が近くなる事が確認されたが、実音楽への応用までは至っていなかった。 As related research on other rhythm patterns, there is measurement of distance between rhythm patterns (Non-Patent Document 4). In this research, spectral centroid, MFCC (Mel-Frequency Cepstrum Coefficient), etc. are extracted as rhythm pattern features, and these feature patterns are compared using Dynamic Time Warping (DTW). Although it was confirmed that a similar pattern was closer to a percussion-only sound source, it was not yet applied to real music.

実音楽からリズムパターンを抽出する研究は、音響信号のパワーの周期的なパターンをヒューリスティックな方法により抽出したもの（非特許文献５）やスペクトルの周期性に基づく特徴量を抽出したもの（非特許文献６）が挙げられ、サンバ、タンゴ等、リズムが特徴的な楽曲の区別が出来る程度の成果が挙げられている。
Goto, M., “An audio-based real-time beattracking system for music with or without drum-sounds,” Journal of New MusicResearch, Vol. 30, No. 2, pp. 159. 171, June 2001. Tzanetakis, G. and Cook, P., “Musical genreclassification of audio signals,” IEEE Transaction on Speech and AudioProcessing, Vol. 10, No. 5, pp. 293-302, 2002. Tzanetakis, G., Essl, G. and Cook, P.,“Audio analysis using the discrete wavelet transform,” Proc. of WSES Int. Conf.on Acoustics and Music: Theory and Applications, 2001. Paulus, J. and Klapuri, A., “Measuring thesimilarity of rhythmic patterns,” Proc. of the 3rd Int. Conf. on MusicInformation Retrieval (ISMIR 2002), pp. 150-156, IRCAM Centre Pompidou, 2002. Dixon, S., Guyon,F. and Widmer G., “Towards characterization of music via rhythmic patterns,”Proc. of the 5th Int. Conf. on Music Information Retrieval (ISMIR2004), Barcelona, Spain, 2004. Peeters, G., “Rhythm classification usingspectral rhythm patterns,” in Proc. of the 6th Int. Conf. on Music InformationRetreval (ISMIR2005), pp. 644. 647, London, UK, September 2005. Research on extracting rhythm patterns from real music is based on the extraction of periodic patterns of acoustic signal power using a heuristic method (Non-Patent Document 5) and the extraction of features based on spectral periodicity (Non-Patent Documents) Reference 6) is given, and there are achievements to the extent that it is possible to distinguish samba, tango, and other music with characteristic rhythms.
Goto, M., “An audio-based real-time beattracking system for music with or without drum-sounds,” Journal of New MusicResearch, Vol. 30, No. 2, pp. 159. 171, June 2001. Tzanetakis, G. and Cook, P., “Musical genreclassification of audio signals,” IEEE Transaction on Speech and AudioProcessing, Vol. 10, No. 5, pp. 293-302, 2002. Tzanetakis, G., Essl, G. and Cook, P., “Audio analysis using the discrete wavelet transform,” Proc. Of WSES Int. Conf. On Acoustics and Music: Theory and Applications, 2001. Paulus, J. and Klapuri, A., “Measuring thesimilarity of rhythmic patterns,” Proc. Of the 3rd Int. Conf. On MusicInformation Retrieval (ISMIR 2002), pp. 150-156, IRCAM Center Pompidou, 2002. Dixon, S., Guyon, F. And Widmer G., “Towards characterization of music via rhythmic patterns,” Proc. Of the 5th Int. Conf. On Music Information Retrieval (ISMIR2004), Barcelona, Spain, 2004. Peeters, G., “Rhythm classification usingspectral rhythm patterns,” in Proc. Of the 6th Int. Conf. On Music InformationRetreval (ISMIR2005), pp. 644. 647, London, UK, September 2005.

本発明は、音楽音響信号から単位リズムパターン、典型的には小節単位の打楽器パターン、を抽出することを目的としている。 An object of the present invention is to extract a unit rhythm pattern, typically a percussion instrument pattern in a measure unit, from a music acoustic signal.

本発明の他の目的は、抽出された単位リズムパターンに基づいて楽曲構造を推定することにある。 Another object of the present invention is to estimate a music structure based on an extracted unit rhythm pattern.

本発明の他の目的は、音楽音響信号中の単位リズムパターン、すなわち単位打楽器パターン、を置換して楽曲加工を行なうことにある。 Another object of the present invention is to perform music processing by replacing a unit rhythm pattern in a music sound signal, that is, a unit percussion instrument pattern.

本発明が採用した第１の技術手段は、
打楽器音を含む音楽音響信号から単位リズムパターンを抽出する方法であって、
前記音楽音響信号に含まれる打楽器音のスペクトル系列を用意し、
複数種類の参照パターンを用いたＤＰマッチングにより、前記打楽器音のスペクトル系列のセグメント分割とパターン分類を行ない、得られたセグメント分割とパターン分類に基づいて前記参照パターンを更新する分割最適化クラスタリングを反復的に行なうことで、収束した参照パターンを単位リズムパターンとして抽出する、
単位リズムパターン抽出方法、である。 The first technical means adopted by the present invention is:
A method for extracting a unit rhythm pattern from a music acoustic signal including a percussion instrument sound,
A percussion instrument sound spectrum sequence included in the music acoustic signal is prepared,
By DP matching using multiple types of reference patterns, segmentation and pattern classification of the spectrum sequence of the percussion instrument sound is performed, and division optimization clustering is performed in which the reference pattern is updated based on the obtained segment division and pattern classification. The reference pattern that has converged is extracted as a unit rhythm pattern.
A unit rhythm pattern extraction method.

通常、実際の楽曲は調波音も含むため、前記打楽器音のスペクトル系列を用意することは、音楽音響信号から打楽器音を抽出することを含む。調波音・打楽器音分離手法を用いることによって、打楽器音を抽出する。一つの態様では、前記打楽器音の抽出は、前記音楽音響信号のスペクトログラムを、周波数方向に滑らかな非調波的成分と時間方向に滑らかな調波的成分の和であるとし、時間周波数マスク（スペクトル成分を分配する分配係数）によって、前記非調波的成分を抽出して打楽器音とする。 Usually, since an actual musical piece also includes a harmonic sound, preparing the spectrum sequence of the percussion instrument sound includes extracting the percussion instrument sound from the music acoustic signal. Percussion instrument sounds are extracted by using a harmonic / percussion instrument sound separation technique. In one aspect, the percussion instrument sound is extracted by adding a spectrogram of the music acoustic signal to a sum of a non-harmonic component smooth in the frequency direction and a harmonic component smooth in the time direction, and a time frequency mask ( The non-harmonic component is extracted as a percussion instrument sound according to a distribution coefficient for distributing spectral components.

一つの態様では、前記時間周波数マスク（分配係数）の取得は、分配係数をパラメータとして分配された各スペクトル成分の滑らかさ指標の関数を含む目的関数を設定し、前記目的関数を最適化するパラメータを推定することで取得される。分配された各スペクトル成分の滑らかさ指標は、着目したスペクトル成分と、当該着目スペクトル成分の時間周波数平面上での近傍にある分配されたスペクトル成分とのエネルギーの差に基づいて決定される。着目スペクトル成分の近傍のスペクトル成分は、典型的には、時間周波数平面上で隣接するスペクトル成分であるが、近傍の範囲はこれに限定されるものではない。分配係数、すなわち、時間周波数マスクの設定は、滑らかさのコストをスペクトログラムの微分の関数で設計し、これを最小化する最適化問題として捉えることができる。打楽器音を抽出するための打楽器音分離手法は、このものには限定されないことが当業者に理解される。 In one aspect, the time frequency mask (distribution coefficient) is obtained by setting an objective function including a smoothness index function of each spectral component distributed using the distribution coefficient as a parameter, and optimizing the objective function. Is obtained by estimating. The smoothness index of each distributed spectral component is determined based on the energy difference between the focused spectral component and the distributed spectral component in the vicinity of the focused spectral component on the time-frequency plane. The spectral component in the vicinity of the target spectral component is typically a spectral component adjacent on the time-frequency plane, but the range in the vicinity is not limited to this. The setting of the distribution coefficient, that is, the time-frequency mask, can be regarded as an optimization problem in which the cost of smoothness is designed as a function of the differential of the spectrogram and is minimized. Those skilled in the art will appreciate that the percussion instrument sound separation technique for extracting percussion instrument sounds is not limited to this.

打楽器音を含む音楽音響信号は、打楽器音のみからなる信号であってもよい。前記打楽器音のスペクトル系列を用意することは、音楽音響信号から既に分離されて格納されている打楽器音を用いるものでもよい。また、一つの態様では、前記打楽器音のスペクトル系列は、周波数をメル尺度にスケーリングしたスペクトル系列であるが、これには限定されないことは当業者に理解される。 The music acoustic signal including the percussion instrument sound may be a signal composed only of the percussion instrument sound. Preparing the spectrum sequence of the percussion instrument sound may use a percussion instrument sound that is already separated from the music acoustic signal and stored. In one embodiment, the spectrum sequence of the percussion instrument sound is a spectrum sequence in which the frequency is scaled to a mel scale, but it is understood by those skilled in the art that the present invention is not limited to this.

一つの態様では、前記ＤＰマッチングは、連続単語音声認識に用いられるＤＰ法から選択される。前記ＤＰ法は、一つの好ましい態様ではOne-Pass DP法であるが、本発明に用いられ得るその他のＤＰ法として、２段ＤＰ法、レベル・ビルディング、クロックワイズＤＰ法、ワンステージＤＰ、連続ＤＰ法を例示することができる。 In one aspect, the DP matching is selected from the DP method used for continuous word speech recognition. The DP method is a One-Pass DP method in one preferred embodiment, but other DP methods that can be used in the present invention include a two-stage DP method, a level building, a clockwise DP method, a one-stage DP, a continuous DP method. The DP method can be exemplified.

一つの態様では、ＤＰマッチングにおける初期参照パターンは、予め用意した複数種類の打楽器リズムパターンから選択される。一つの態様では、ＤＰマッチングにおける初期参照パターンは、入力された前記打楽器音のスペクトル系列を用いて取得される。より具体的な態様例では、参照パターンの幅を指定して、入力された前記打楽器音のスペクトル系列を複数の入力パターンに分割し、所要数の入力パターンをランダムに抽出して初期参照パターンとする。あるいは、参照パターンの幅を指定して、入力された前記打楽器音のスペクトル系列を複数の入力パターンに分割すると共に、各入力パターンにランダムあるいは順番に所要数の種類のラベル付けを行い、同じラベルが付与された入力パターンの平均を初期参照パターンとする。前記参照パターンの幅の指定による分割は、入力された前記打楽器音のスペクトル系列について自己相関関数のピーク値を取得し、取得したピーク値のフレーム数で前記入力された打楽器音のスペクトル系列を等分割することである。 In one aspect, the initial reference pattern in DP matching is selected from a plurality of types of percussion instrument rhythm patterns prepared in advance. In one aspect, the initial reference pattern in DP matching is acquired using the input spectrum sequence of the percussion instrument sound. In a more specific example, the width of the reference pattern is designated, the input percussion instrument sound spectrum sequence is divided into a plurality of input patterns, and the required number of input patterns are randomly extracted as initial reference patterns. To do. Alternatively, by specifying the width of the reference pattern, the input percussion instrument sound spectrum sequence is divided into a plurality of input patterns, and each input pattern is randomly or sequentially labeled with the required number of types, and the same label The average of the input patterns to which is given is used as the initial reference pattern. The division by designating the width of the reference pattern obtains the peak value of the autocorrelation function for the spectrum sequence of the input percussion instrument sound, and the spectrum sequence of the input percussion instrument sound by the number of frames of the acquired peak value, etc. To divide.

前記分割最適化クラスタリングは、k-means法である。k-means法は、典型的な分割最適化クラスタリングとして知られており、また、k-means法を変形・改良してなる数々の分割最適化クラスタリング手法も当業者に知られており、本明細書において、k-means法には、これらの変形・改良手法も含まれるものとして扱う。 The division optimization clustering is a k-means method. The k-means method is known as typical partition optimization clustering, and a number of partition optimization clustering methods obtained by modifying and improving the k-means method are also known to those skilled in the art. In the book, the k-means method is treated as including these deformation and improvement methods.

ＤＰマッチングによってパターン分類された各々のクラスタに属する複数のセグメントの中心を取得し、セグメント中心を参照パターンとして更新する。セグメント中心は、各クラスタに属する複数のセグメントを代表するものであり、セグメント中心の計算には幾つかのやり方があることは当業者に理解される。１つの態様では、パターン分類されたクラスタに属する複数のセグメントから選択した１つの代表セグメントをセグメント中心とする。１つの態様では、パターン分類されたクラスタに属する複数のセグメントの平均をセグメント中心とする。 The center of a plurality of segments belonging to each cluster classified by DP matching is acquired, and the segment center is updated as a reference pattern. Those skilled in the art will understand that the segment center represents a plurality of segments belonging to each cluster, and there are several ways to calculate the segment center. In one aspect, one representative segment selected from a plurality of segments belonging to the cluster whose pattern is classified is set as the segment center. In one aspect, an average of a plurality of segments belonging to a cluster whose pattern is classified is used as a segment center.

典型的には、単位リズムパターンの幅は、１小節の幅に対応している。 Typically, the width of the unit rhythm pattern corresponds to the width of one measure.

一つの態様では、前記参照パターン数が既知であり、既知の数の初期参照パターンを用意する。 In one embodiment, the number of reference patterns is known, and a known number of initial reference patterns are prepared.

一つの態様では、前記参照パターン数が未知であり、前記分割最適化クラスタリングにおける最適なクラスタ数を、情報量規準を用いて決定する。一つの態様では、前記情報量規準は、ベイズ情報量規準（ＢＩＣ）である。 In one aspect, the number of reference patterns is unknown, and the optimal number of clusters in the division optimization clustering is determined using an information criterion. In one aspect, the information criterion is a Bayesian information criterion (BIC).

本発明が採用する情報量規準は、確率分布モデルを評価するものであれば、ＢＩＣ（ベイズ情報量規準）に限定されるものではなく、ＡＩＣ（赤池情報量規準）、ＡＢＩＣ(赤池ベイズ型情報量規準)、ＴＩＣ（竹内情報量規準）、ＭＤＬ（最小既述長）、ＧＩＣ（一般化情報量規準）、ＥＩＣ（ブートストラップ情報量規準）、ＰＩＣ（予測情報量規準）、クロスバリデーション、マローのＣ_ｐ規準、ハナン−クインの規準、さらに、これらの近似的な情報量規準あるいは等価の情報量規準を含み得る。 The information criterion adopted by the present invention is not limited to BIC (Bayesian information criterion) as long as it evaluates a probability distribution model, but is not limited to AIC (Akaike Information criterion), ABIC (Akaike Bayesian information). Metric), TIC (takeuchi information criterion), MDL (minimum stated length), GIC (generalized information criterion), EIC (bootstrap information criterion), PIC (predicted information criterion), cross-validation, mallow C _p criteria, Hanan - Quinn criteria may further include information criterion of these approximate information criterion or equivalent.

本発明が採用した第２の技術手段は、上記方法で抽出された複数の単位リズムパターンを用いた楽曲構造の推定法であって、
前記打楽器音のスペクトル系列に対応する単位リズムパターン列を、楽曲構造とする、
単位リズムパターンを用いた楽曲構造の推定法、である。 The second technical means adopted by the present invention is a music structure estimation method using a plurality of unit rhythm patterns extracted by the above method,
A unit rhythm pattern sequence corresponding to the spectrum sequence of the percussion instrument sound is a music structure.
This is a music structure estimation method using unit rhythm patterns.

一つの態様では、前記楽曲構造は、入力された前記打楽器音のスペクトル系列のフレーム数を横軸とし、複数の単位リズムパターンを縦軸に沿って並べてなる平面上に表示されたリズムマップである。 In one aspect, the music structure is a rhythm map displayed on a plane in which the horizontal axis is the number of frames of the spectrum sequence of the input percussion instrument sound and a plurality of unit rhythm patterns are arranged along the vertical axis. .

本発明が採用した第３の技術手段は、上記方法で得られた楽曲構造を用いた単位リズムパターンの置換法であって、
前記楽曲構造を構成する単位リズムパターンの一部を、予め用意した打楽器パターンと置換してなる、単位リズムパターンの置換法、である。
典型的には、前記予め用意した打楽器パターンは、小節単位の打楽器パターンである。 The third technical means adopted by the present invention is a unit rhythm pattern replacement method using the music structure obtained by the above method,
A unit rhythm pattern replacement method in which a part of the unit rhythm pattern constituting the music structure is replaced with a previously prepared percussion instrument pattern.
Typically, the percussion instrument pattern prepared in advance is a percussion instrument pattern in bars.

一つの態様では、前記置換された打楽器パターンを有する打楽器成分のスペクトログラムを、音楽音響信号から分離した調波成分のスペクトログラムに重畳することを含む。典型的には、重畳されたスペクトログラムを時間領域の波形に変換する。 In one aspect, the method includes superimposing a spectrogram of a percussion instrument component having the replaced percussion instrument pattern on a spectrogram of a harmonic component separated from a music acoustic signal. Typically, the superimposed spectrogram is converted into a time domain waveform.

本発明のハードウエア構成としては、パーソナルコンピュータ等のコンピュータ（具体的には、入力装置、表示装置を含む出力装置、ＣＰＵ，記憶装置（ＲＯＭ，ＲＡＭ等）、これらを接続するバス等、を備えている。）から構成することができる。したがって、本発明は、前記各方法を実行させる、コンピュータプログラムあるいは該コンピュータプログラムを記憶したコンピュータ可読媒体としても提供され得る。 The hardware configuration of the present invention includes a computer such as a personal computer (specifically, an input device, an output device including a display device, a CPU, a storage device (ROM, RAM, etc.), a bus for connecting them, and the like). It can be configured from. Therefore, the present invention can also be provided as a computer program or a computer-readable medium storing the computer program for causing the above methods to be executed.

本発明は、楽曲検索や、その分類に用いる特徴量の一つとして、楽曲を構成する単位打楽器パターン、すなわち単位リズムパターン、に着目し、楽曲特有の単位リズムパターンを抽出する。本発明では、強調された打楽器スペクトログラムから、各リズムパターンへの分割とリズムパターンのクラスタリングを反復的に行なうことで単位リズムパターンの抽出と演奏される個所の推定を同時に行なうことで楽曲構造を得ることができる。音楽情報検索では、スペクトログラムの類似性が用いられているが、楽曲の構造によって大別してから各々のスペクトログラムを比較する階層的な手法のほうがより高速に検索できると考えられる。そのため、楽曲の構造解析は音楽情報検索技術にとって非常に有用である。 The present invention extracts a unit rhythm pattern peculiar to a song by focusing on a unit percussion instrument pattern constituting a song, that is, a unit rhythm pattern, as one of feature quantities used for music search and classification. In the present invention, from the emphasized percussion instrument spectrogram, division into each rhythm pattern and rhythm pattern clustering are repeatedly performed to simultaneously extract a unit rhythm pattern and estimate a place to be played, thereby obtaining a musical composition. be able to. Although the spectrogram similarity is used in the music information search, it is considered that the hierarchical method of comparing the spectrograms after broadly classifying them according to the structure of the music can search faster. Therefore, the structure analysis of music is very useful for music information retrieval technology.

本発明では、音楽音響信号中の打楽器パターンを置換することで、リズムパターンの対応的な置き換えによる楽曲加工を行なうことができる。 In the present invention, by replacing a percussion instrument pattern in a music acoustic signal, music processing can be performed by corresponding replacement of a rhythm pattern.

［Ａ］概要
本発明の一つの実施形態では、音楽音響信号のスペクトログラムから楽曲固有の単位打楽器パターンを抽出し、それらのパターンが楽曲中で演奏される箇所を推定しマップの形として楽曲の構造を表示する。 [A] Overview In one embodiment of the present invention, unit percussion instrument patterns specific to a song are extracted from a spectrogram of a music acoustic signal, the locations where those patterns are played in the song are estimated, and the structure of the song is formed as a map. Is displayed.

［Ａ−１］リズム解析における問題
打楽器パターンのように周期的に同じリズムパターンが繰り返している中からその単位リズムパターンを抽出したい。仮に一つのリズムパターンの繰り返しであれば比較的簡単な問題になるが、実際の楽曲では複数の単位打楽器パターンがそれぞれのジャンルに適した演奏法に従って演奏される。このような問題は鶏と卵問題であり、単位リズムパターンへのセグメンテーションが与えられていれば複数の単位リズムパターン自体を決定することは難しくなく、逆も然り、複数の単位リズムパターン自体が与えられていれば入力打楽器パターン列のセグメンテーションは容易になると考えられる。その他の問題点としては、テンポ変動の可能性が挙げられる。楽曲中の単位打楽器パターンが伸縮した場合にも正確にセグメンテーションを行う必要がある。これらは打楽器のみの楽曲から単位打楽器パターンを抽出する場合においても問題となる事だが、実際の楽曲、特に現代のポピュラー音楽やジャズ音楽から抽出するには、打楽器のみでなくメロディーや和音が含まれ、それらによって打楽器パターンが部分的に隠されるため更に難しい問題となる。 [A-1] Problem in Rhythm Analysis I want to extract the unit rhythm pattern from the same rhythm pattern that repeats periodically like percussion instrument patterns. If one rhythm pattern is repeated, it becomes a relatively simple problem. However, in actual music, a plurality of unit percussion instrument patterns are played according to a performance method suitable for each genre. Such problems are chicken and egg problems, and it is not difficult to determine multiple unit rhythm patterns themselves if segmentation into unit rhythm patterns is given, and vice versa. If given, segmentation of the input percussion instrument pattern sequence will be facilitated. Another problem is the possibility of tempo fluctuation. Even when the unit percussion instrument pattern in the musical piece expands and contracts, it is necessary to perform segmentation accurately. These are also problems when extracting unit percussion patterns from percussion-only songs, but to extract from actual music, especially contemporary popular music and jazz music, not only percussion instruments but also melodies and chords are included. The percussion instrument pattern is partially hidden by them, which is a more difficult problem.

そのため、打楽器音のテンプレートのような事前知識なしに入力音響信号から小節単位打楽器リズムパターンを抽出する際の問題点は以下の４点にまとめられる。
(i) 入力音響信号は打楽器音のみでなく、メロディーや和音を含む。
(ii) 演奏者によってテンポや打楽器パターン自体が変動する。
(iii) 単位パターンへのセグメンテーションが未知である。
(iv) 複数の単位打楽器パターン自体が未知である。 Therefore, the problems in extracting the bar-unit percussion instrument rhythm pattern from the input acoustic signal without prior knowledge like a percussion instrument sound template are summarized in the following four points.
(i) Input sound signals include not only percussion instrument sounds but also melodies and chords.
(ii) The tempo and percussion pattern itself varies depending on the performer.
(iii) The segmentation into unit patterns is unknown.
(iv) Multiple unit percussion instrument patterns themselves are unknown.

［Ａ−２］打楽器音の強調
上で挙げた問題(i)はつまり、入力音響信号から得られるスペクトログラムに調波音と打楽器音が混在することである。このようなスペクトログラムから打楽器音を事前知識無しに分離強調する必要がある。調波音は音高を持つため、そのスペクトログラムはある特定の周波数に偏ってエネルギーが大きくなり、時間的には一定時間演奏される。逆に打楽器音は広い周波数域にエネルギーが広がり、時間的には瞬間しかエネルギーが存在しない。このようなスペクトログラム上の音色の特徴の違いを利用し、時間軸方向に滑らかな調波音と周波数軸方向に滑らかな打楽器音に分離するマスクを用いてスペクトログラムから調波音と打楽器音を分離する。具体的な方法については、後述する。 [A-2] The problem (i) mentioned above in emphasizing percussion instrument sounds is that the spectrogram obtained from the input acoustic signal contains both harmonic and percussion instrument sounds. It is necessary to separate and emphasize percussion instrument sounds from such a spectrogram without prior knowledge. Since the harmonic sound has a pitch, the spectrogram is biased to a specific frequency and becomes large in energy, and is played for a certain period of time. On the other hand, percussion instrument sounds have energy spread over a wide frequency range, and in terms of time there is only energy in the moment. Using the difference in timbre characteristics on the spectrogram, the harmonic sound and the percussion instrument sound are separated from the spectrogram using a mask that separates the harmonic sound smooth in the time axis direction and the percussion instrument sound smooth in the frequency axis direction. A specific method will be described later.

［Ａ−３］リズムによる楽曲構造と単位打楽器パターンの反復的な推定
仮に正解の小節単位打楽器パターンがテンプレートとして与えられていれば上述の問題(iii)は連続音声認識に類似した問題と考える事が出来るため、One-pass DP法を用いて発声された各言葉の箇所を推定するように各単位打楽器パターンの演奏箇所としての楽曲構造の推定を行う事ができる。また、このアルゴリズムは時間の伸縮を許すため、上述の問題(ii)も同時に解決される。 [A-3] Iterative estimation of music structure and unit percussion instrument pattern by rhythm If the correct measure unit percussion instrument pattern is given as a template, the above problem (iii) is considered to be a problem similar to continuous speech recognition. Therefore, it is possible to estimate the music structure as the performance location of each unit percussion instrument pattern so as to estimate the location of each word uttered using the One-pass DP method. Further, since this algorithm allows time expansion and contraction, the above problem (ii) is solved at the same time.

しかし、実際には問題(iv)にも挙げられるように正解の単位パターンは与えられない鶏と卵問題であり、当初はセグメンテーションと正解単位パターンの両方が未知であるため、これらを同時に推定する必要がある。つまり教師なし学習問題と捉える事が出来る。そのため、One-passDP法と組み合わせてk-meansクラスタリングアルゴリズムを利用する事が考えられる。つまり、クラスタ要素の平均化による単位パターンの更新と楽曲構造の推定を反復的行うという事である。図６にアルゴリズムのフロー図を示す。音楽音響信号のスペクトログラムに対して、調波成分／打楽器成分の分離が行なわれ、打楽器音が強調されたスペクトログラムを得る。得られた打楽器音のスペクトル列に対して、リズム構造の更新と、単位パターンの更新を交互に反復し、楽曲構造（リズムマップ）及び単位リズムパターンを取得する。以下、アルゴリズムについて詳細に説明する。 However, as mentioned in question (iv), it is a chicken and egg problem that does not give a correct unit pattern, and since both the segmentation and correct unit pattern are unknown at the beginning, they are estimated simultaneously. There is a need. In other words, it can be considered as an unsupervised learning problem. Therefore, it is conceivable to use the k-means clustering algorithm in combination with the One-passDP method. In other words, the unit pattern is updated by averaging the cluster elements and the music structure is estimated repeatedly. FIG. 6 shows a flowchart of the algorithm. Harmonic components / percussion instrument components are separated from the spectrogram of the music acoustic signal to obtain a spectrogram in which the percussion instrument sound is emphasized. For the obtained percussion instrument sound spectrum sequence, the rhythm structure update and the unit pattern update are alternately repeated to obtain the music structure (rhythm map) and the unit rhythm pattern. Hereinafter, the algorithm will be described in detail.

［Ｂ］打楽器音の強調
楽曲のスペクトル系列から打楽器パターンを抽出する事を考えるが、一般には、スペクトル系列には調波音のスペクトルと打楽器音のスペクトルが混在する。したがって、打楽器パターンを抽出する前処理として、打楽器音の強調すなわちスペクトル系列から打楽器音を分離する必要がある。 [B] Percussion instrument sound enhancement It is considered to extract a percussion instrument pattern from a spectrum sequence of a musical piece. Generally, a spectrum sequence includes a spectrum of harmonic sound and a spectrum of percussion instrument sound. Therefore, as a pre-processing for extracting a percussion instrument pattern, it is necessary to emphasize percussion instrument sounds, that is, to separate percussion instrument sounds from a spectrum series.

調波音と打楽器音の混在した音楽信号を分析対象とし、入力信号の短時間周波数解析によって得られるスペクトログラムをＷ（ｘ，ｔ）とする（ｘ:周波数、ｔ：時刻)。ここで行なうことは、このＷ（ｘ，ｔ）を打楽器的な音程を持たない非調波成分Ｐ（ｘ，ｔ）と音程を持つ楽器のような調波成分Ｈ（ｘ，ｔ）の２つのスペクトログラムに分解することである。このとき満たすべき要件は、任意の時間周波数（ｘ，ｔ）において、
が成り立つことである。 A music signal in which harmonic sound and percussion instrument sound are mixed is an analysis target, and a spectrogram obtained by short-time frequency analysis of the input signal is W (x, t) (x: frequency, t: time). What is performed here is that W (x, t) is a non-harmonic component P (x, t) having no percussion-like pitch and a harmonic component H (x, t) like a musical instrument having a pitch. Is to decompose it into two spectrograms. The requirements to be satisfied at this time are as follows at an arbitrary time frequency (x, t):
Is true.

調波成分・打楽器成分の時間周波数領域におけるスペクトル成分の異方性に着目する。より具体的には、図１、図２に示すようにポピュラー音楽の音響信号のスペクトログラムが、時間周波数領域において、一般的に周波数方向に形成される山脈ないし畝のようなスペクトル成分と、時間方向に形成される山脈ないし畝のようなスペクトル成分とからなることが多い点に着目する。前者は、打楽器のように時間方向には急峻に変化するが周波数方向にはブロード（滑らか）である成分Ｐ（ｘ，ｔ）に、後者は逆に周波数方向には急峻な形状だが時間方向には滑らかな成分Ｈ（ｘ，ｔ）に対応するとみなすことができ、また２成分は時間周波数平面上においてスパースに存在しているとみなせる。 Pay attention to the anisotropy of the spectral components in the time-frequency domain of the harmonic and percussion instrument components. More specifically, as shown in FIG. 1 and FIG. 2, the spectrogram of an acoustic signal of popular music has a spectral component such as a mountain range or a ridge generally formed in the frequency direction in the time frequency domain, and the time direction. Focus on the fact that it is often made up of spectral components such as mountain ranges or ridges. The former is a component P (x, t) that changes sharply in the time direction but is broad (smooth) in the frequency direction, like the percussion instrument, while the latter is a sharp shape in the frequency direction but has a sharp shape in the time direction. Can be considered to correspond to a smooth component H (x, t), and the two components can be considered to exist sparsely on the time-frequency plane.

入力信号のスペクトログラムを、時間周波数マスクによって２つのスペクトログラムに分解する。すなわち、前述したＰ（ｘ，ｔ）とＨ（ｘ，ｔ）のスパース性から、任意の時間周波数において０〜１の値をとる時間周波数マスクｍ_Ｐ（ｘ，ｔ），ｍ_Ｈ（ｘ，ｔ）を設計することで、
とＷ（ｘ，ｔ）を分解できると考えられる。 The spectrogram of the input signal is decomposed into two spectrograms by a time frequency mask. That is, from the sparseness of P (x, t) and H (x, t) described above, time frequency masks m _P (x, t), m _H (x, By designing t)
And W (x, t) can be decomposed.

時間周波数マスクは、分解された２つのスペクトログラムの夫々を形成するスペクトル成分の滑らかな方向を検出するように設計される。打楽器成分のスペクトル成分が周波数方向に滑らかであるという特徴、及び、調波成分のスペクトル成分が時間方向に滑らかであるという特徴を用いて、入力信号のスペクトログラムをそれぞれのスペクトル成分に分離する時間周波数マスクが設計される。０〜１の値を取る時間周波数マスクは、一つの態様では、０か１の値を取るバイナリマスクである。 The time frequency mask is designed to detect the smooth direction of the spectral components that form each of the two resolved spectrograms. The time frequency that separates the spectrogram of the input signal into each spectral component using the feature that the spectral component of the percussion instrument component is smooth in the frequency direction and the feature that the spectral component of the harmonic component is smooth in the time direction. A mask is designed. The time-frequency mask that takes a value of 0 to 1 is a binary mask that takes a value of 0 or 1 in one embodiment.

マスクの設計方法として、１）２次元フィルタを用いる手法、２）Divergenceと滑らかさコストをＥＭアルゴリズム的手法で最小化する手法、３）レベル圧縮したスペクトログラムに対し滑らかさコストをＥＭアルゴリズム的手法で最小化する手法、の３つの実施形態について説明する。 As a mask design method, 1) a method using a two-dimensional filter, 2) a method for minimizing the Divergence and smoothness cost by the EM algorithm method, and 3) a smoothness cost for the level-compressed spectrogram by an EM algorithm method. Three embodiments of the minimizing method will be described.

［Ｂ−１］第１手法
第１手法では、観測信号の時間周波数平面のスペクトログラムを画像とみなし、調波的な音と打楽器的な音の持つ一般的な性質の違いを利用した２次元フィルタを用いることで、楽器固有の情報なしで音楽信号から打楽器音と調波音を分離する。 [B-1] First Method In the first method, the spectrogram of the time-frequency plane of the observation signal is regarded as an image, and a two-dimensional filter using a difference in general properties of harmonic and percussive sounds is used. Is used to separate percussion instrument sounds and harmonic sounds from music signals without instrument-specific information.

時間周波数マスクｍ_Ｐ（ｘ，ｔ）、ｍ_Ｈ（ｘ，ｔ）の設計について述べる。Ｗ（ｘ，ｔ）を画像とみなすと、Ｐ（ｘ，ｔ）とＨ（ｘ，ｔ）の特徴、すなわち、周波数方向のエッジ(縦方向のエッジ)と時間方向のエッジ(横方向のエッジ)、を個別に抽出するような２次元フィルタをかけることで、そのフィルタ出力結果の大小から各時間周波数成分がＰ（ｘ，ｔ）に属するかＨ（ｘ，ｔ）に属するかを決定できる。 The design of the time frequency masks m _P (x, t) and m _H (x, t) will be described. Considering W (x, t) as an image, the characteristics of P (x, t) and H (x, t), that is, an edge in the frequency direction (edge in the vertical direction) and an edge in the time direction (edge in the horizontal direction) ), By applying a two-dimensional filter that individually extracts, it is possible to determine whether each time frequency component belongs to P (x, t) or H (x, t) from the magnitude of the filter output result. .

Ｗ（ｘ，ｔ）の２次元フーリエ変換成分をＷ（バー）（ａ，ｂ）（ａ:周波数方向のフーリエ成分，ｂ:時間方向のフーリエ成分）とすると、Ｐ（ｘ，ｔ）特徴抽出フィルタＦ（バー）_Ｐ（ａ，ｂ）、Ｈ（ｘ，ｔ）特徴抽出フィルタＦ（バー）_Ｈ（ａ，ｂ）を用いることで、
のようにフィルタ出力結果が得られる。この結果から時間周波数マスクｍ_Ｐ（ｘ，ｔ）、ｍ_Ｈ（ｘ，ｔ）は、
と得られる。 If the two-dimensional Fourier transform component of W (x, t) is W (bar) (a, b) (a: Fourier component in the frequency direction, b: Fourier component in the time direction), P (x, t) feature extraction By using the filter F (bar) _P (a, b), H (x, t) feature extraction filter F (bar) _H (a, b),
A filter output result is obtained as follows. From this result, the time frequency masks m _P (x, t) and m _H (x, t) are
And obtained.

Ｐ（ｘ，ｔ）、Ｈ（ｘ，ｔ）の特徴をそれぞれ抽出する２次元フィルタＦ（バー）_Ｐ（ａ，ｂ）、Ｆ（バー）_Ｈ（ａ，ｂ）としては様々な形状が考えられる。具体的には、実質的に時間方向の平滑化を行うフィルタと、実質的に周波数方向の平滑化を行うフィルタと、からなる。より具体的には、時間方向のみの１次元ローパスフィルタと、周波数方向のみの１次元ローパスフィルタ、あるいは、時間方向の遮断周波数ωt、周波数方向の遮断周波数ωfが大きく異なる２つの２次元ローパスフィルタ（一方はωt>>ωf、他方はωt<<ωf）などを含む。最も簡単なフィルタの例としては、Ｆ_Ｐ（ａ，ｂ）は周波数方向のみ、Ｆ_Ｈ（ａ，ｂ）は時間方向のみのローパスフィルタ、
で設計することができ、また、ｇ（ａ）やｈ（ｂ）の１次元ローパスフィルタの断面形状としては三角窓やgaussianが利用できる。また、スペクトル成分の滑らかな方向の特徴を抽出するフィルタは、周波数領域のデジタルフィルタに限定されるものではなく、空間フィルタによっても設計し得ることは当業者に理解される。 Various shapes are considered as the two-dimensional filters F (bar) _P (a, b) and F (bar) _H (a, b) for extracting the features of P (x, t) and H (x, t), respectively. It is done. Specifically, the filter includes a filter that substantially smoothes in the time direction and a filter that substantially smoothes in the frequency direction. More specifically, a one-dimensional low-pass filter only in the time direction and a one-dimensional low-pass filter only in the frequency direction, or two two-dimensional low-pass filters having greatly different cutoff frequencies ωt in the time direction and cutoff frequencies ωf in the frequency direction ( One includes ωt >> ωf, and the other includes ωt << ωf). As an example of the simplest filter, F _P (a, b) is a low-pass filter only in the frequency direction, F _H (a, b) is a low-pass filter only in the time direction,
Further, a triangular window or gaussian can be used as the cross-sectional shape of the one-dimensional low-pass filter of g (a) or h (b). Further, it will be understood by those skilled in the art that the filter for extracting the characteristics of the smooth direction of the spectral component is not limited to the digital filter in the frequency domain, and can be designed by a spatial filter.

［Ｂ−２］第２手法
第２手法では、スペクトログラムの滑らかさの異方性に基づいたＥＭアルゴリズムによる反復解法を提案する。第２手法では、滑らかさのコスト＋距離指標から目的関数が設定され、この目的関数を最小化するように分配係数を最適化する。 [B-2] Second Method The second method proposes an iterative solution method using an EM algorithm based on the anisotropy of the spectrogram smoothness. In the second method, an objective function is set from the cost of smoothness + distance index, and the distribution coefficient is optimized so as to minimize the objective function.

（滑らかさコストの導入）
スペクトログラムにおける調波的な成分と打楽器的な成分の異方性を利用して、Ｗ（ｘ，ｔ）からＨ（ｘ，ｔ）とＰ（ｘ，ｔ）を推定する問題を議論する。実装上（ｘ，ｔ）は離散的な座標として取得できるため、以下の議論では離散的な時間周波数領域（ｘ_ｉ，ｔ_ｊ）と定義して議論を行なう(Ｉ:周波数bin数、Ｊ:分析フレーム数)。 (Introduction of smoothness cost)
The problem of estimating H (x, t) and P (x, t) from W (x, t) using the anisotropy of harmonic components and percussion instrument components in the spectrogram will be discussed. Since (x, t) can be obtained as discrete coordinates in the implementation, in the following discussion, the discussion is defined as discrete time frequency regions (x _i , t _j ) (I: frequency bin number, J: Number of analysis frames).

本実施形態では、スペクトログラムの滑らかさの異方性を、最小化すべきコストとして、隣り合う時間周波数binとのエネルギーの平方根の二乗誤差
のように表現する。 In this embodiment, the square error of the square root of energy between adjacent time frequencies bin is set as the cost to minimize the anisotropy of the smoothness of the spectrogram.
Express like this.

調波音のスペクトルは時間軸方向には滑らかであるが、周波数はある特定の周波数付近に偏って音高を成すことが多い。逆に、打楽器音のスペクトルは周波数軸方向には急激に変化しないが、時間的には瞬間的にしか存在しないことが多い。そのため、時間軸方向と周波数軸方向の滑らかさをコスト（隣り合う時間周波数bin とのエネルギーの平方根の二乗誤差）として目的関数に加えEMアルゴリズムでマスクを推定することで、打楽器音を分離強調できる。 The harmonic sound spectrum is smooth in the time axis direction, but the frequency is often biased near a specific frequency. Conversely, the spectrum of percussion instrument sounds does not change rapidly in the frequency axis direction, but often exists only momentarily in time. For this reason, percussion instrument sounds can be separated and emphasized by estimating the mask with the EM algorithm in addition to the objective function using the smoothness in the time axis direction and frequency axis direction as the cost (square error of the square root of energy between adjacent time frequencies bin). .

（目的関数最小化によるパラメータ反復推定）
観測スペクトログラムを調波成分・打楽器成分に分配する時間周波数マスクｍ_Ｈ（ｘ_ｉ，ｔ_ｊ），ｍ_Ｐ（ｘ_ｉ，ｔ_ｊ）を導入する。時間周波数マスクｍ_Ｈ（ｘ_ｉ，ｔ_ｊ），ｍ_Ｐ（ｘ_ｉ，ｔ_ｊ）は数式(4)(5)(6)の条件を満たす。 (Iterative parameter estimation by objective function minimization)
Time frequency masks m _H (x _i , t _j ), m _P (x _i , t _j ) for distributing the observed spectrogram to harmonic components and percussion instrument components are introduced. The time frequency masks m _H (x _i , t _j ) and m _P (x _i , t _j ) satisfy the conditions of the equations (4), (5) and (6).

分配されたエネルギー分布ｍ_Ｐ（ｘ_ｉ，ｔ_ｊ）Ｗ（ｘ_ｉ，ｔ_ｊ）、ｍ_Ｈ（ｘ_ｉ，ｔ_ｊ）Ｗ（ｘ_ｉ，ｔ_ｊ）と、Ｐ（ｘ_ｉ，ｔ_ｊ）、Ｈ（ｘ_ｉ，ｔ_ｊ）との近さを表す分布間距離の指標としてI-Divergenceを採用すると、式(13)(14)の滑らかさコストとの和による目的関数
を最小化する問題として定式化できる。 The distributed energy distributions m _P (x _i , t _j ) W (x _i , t _j ), m _H (x _i , t _j ) W (x _i , t _j ) and P (x _i , t _j ) , If I-Divergence is adopted as an index of the inter-distribution distance representing the proximity to H (x _i , t _j ), the objective function is the sum of the smoothness cost of equations (13) and (14)
Can be formulated as a problem of minimizing.

この目的関数から、時間周波数マスクを固定して式(15)を最小化するＨ（ｘ_ｉ，ｔ_ｊ）とＰ（ｘ_ｉ，ｔ_ｊ）の更新と、Ｈ（ｘ，ｔ）,Ｐ（ｘ，ｔ）を固定して式(15)を最小化するようなｍ_Ｐ（ｘ_ｉ，ｔ_ｊ）とｍ_Ｈ（ｘ_ｉ，ｔ_ｊ）の更新を交互に行なうことにより、目的関数(15)の最小化における局所最適解が得られる。 From this objective function, updating H (x _i , t _j ) and P (x _i , t _j ) that minimizes the expression (15) with the time frequency mask fixed, and H (x, t), P ( By alternately updating m _P (x _i , t _j ) and m _H (x _i , t _j ) so as to minimize Equation (15) while fixing x, t), the objective function (15 The local optimal solution in the minimization of) is obtained.

Iダイバージェンスは、解析的な更新式を求めやすいという利点を有している。しかしながら、距離指標としては、パラメータの更新式が解析的に求められるような距離関数であれば、他の距離指標、例えば、ユークリッド距離(２乗誤差)やマハラノビス距離などを用いても良い。 I divergence has the advantage that it is easy to obtain an analytical update formula. However, as the distance index, other distance indices such as the Euclidean distance (square error) and the Mahalanobis distance may be used as long as the parameter update formula is analytically obtained.

［Ｂ−３］第３手法
第２手法では、Ｗ（ｘ，ｔ）からＨ（ｘ，ｔ）とＰ（ｘ，ｔ）を推定する問題を議論するものであったが、第３手法では、分配されたスペクトログラムの滑らかさコストを最小化する問題として議論する。 [B-3] Third Method In the second method, the problem of estimating H (x, t) and P (x, t) from W (x, t) was discussed. In the third method, , Discussed as a problem of minimizing the smoothness cost of the distributed spectrogram.

Ｆ_ｈ，ｉモノラル音響信号ｆ（ｔ）の短時間フーリエ変換（ＳＴＦＴ）とすると、
Ｆ_ｈ，ｉ＝φ（｜Ｆ_ｈ，ｉ｜^２)となり、ここで、ｈ、ｉは、周波数ｂｉｎ、時間ｂｉｎのインデックスである。Ｆ_ｈ，ｉは、φ（Ａ）＝Ａの時には通常のスペクトログラムを表し、φ（Ａ）＝Ａ^γ（γ＜１）のような凸関数φ（Ａ）を設定することで、レンジ圧縮されたスペクトログラムが生成される。 F _{h, i When} the short-time Fourier transform (STFT) of the monaural sound signal f (t) is taken,
F _{h, i} = φ (| F _{h, i} | ² ), where h and i are indices of frequency bin and time bin. F _{h, i} represents a normal spectrogram when φ (A) = A, and range compression is performed by setting a convex function φ (A) such as φ (A) = A ^γ (γ <1). A spectrogram is generated.

以下のような適切な時間周波数バイナリマスｍ_ｈ，ｉを見つける。
ここで、Ｈ_ｈ，ｉ、Ｐ_ｈ，ｉはそれぞれ、スペクトログラムの調波成分、非調和（打楽器）成分を表す。マスクｍ_ｈ，ｉを設計する一つの手法は、ある事前分布に基づく最大事後推定（ＭＡＰ）推定を適用することである。 Find the appropriate time-frequency binary mass m _{h, i} as follows:
Here, H _{h, i} and P _{h, i} represent the harmonic component and the anharmonic (percussion instrument) component of the spectrogram, respectively. One way to design the mask m _{h, i} is to apply maximum a posteriori (MAP) estimation based on some prior distribution.

水平方向、垂直方向にそれぞれ滑らかなＨ_ｈ，ｉ、Ｐ_ｈ，ｉのエンベロープに着目して、各成分について次の事前確率を仮定する。
ベクトルＨ、Ｐは、それぞれ、Ｈ_ｈ，ｉ、Ｐ_ｈ，ｉの集合を表し、σ^２ _Ｈ、σ^２ _Ｐは、スペクトログラムの勾配の分散を表し、これらは、ＳＴＦＴのフレーム長やフレームシフトに依存するであろう。スペクトログラムの勾配の実際の分布はガウス分布とは異なるが、ガウス分布を仮定することで問題の定式化及び解法を容易としている。後述するように、φ（Ａ）を用いてスペクトログラムのダイナミックレンジを圧縮することで、実際の状態と仮定とのギャップをある程度埋めることができる。 Paying attention to the envelopes of H _{h, i} , P _{h, i} that are smooth in the horizontal and vertical directions, the following prior probabilities are assumed for each component.
The vectors H and P represent the set of H _{h, i} and P _{h, i} , respectively, and σ ² _H and σ ² _P represent the variance of the spectrogram gradient, and these represent the STFT frame length and frame shift, respectively. Will depend. Although the actual distribution of the spectrogram gradient is different from the Gaussian distribution, the assumption of the Gaussian distribution makes it easier to formulate and solve the problem. As will be described later, by compressing the dynamic range of the spectrogram using φ (A), the gap between the actual state and the assumption can be filled to some extent.

したがって、ＭＡＰ推定の目的関数は、以下のように書ける。
ここで、ベクトルｍはｍ_ｈ，ｉの集合であり、定数項は簡略化のため省略してある。 Therefore, the objective function of MAP estimation can be written as
Here, the vector m is a set of m _{h and i} , and the constant term is omitted for simplification.

補助関数を用いた更新ルールの導出
式（20）はｍ_ｈ，ｉの定積分形式であり、最適なベクトルｍは、ｍを連続値の変数であるとすると、∂Ｊ／∂ｍ_ｈ，ｉ＝０で求められる。ここで、∂Ｊ／∂ｍ_ｈ，ｉ＝０をより簡単に解くために、補助関数手法を用いることができる。補助関数は例えば、ＮＭＦ（Non-negative matrix factorization）やＨＴＣ（Harmonic-Temporal
Clustering）において用いられており、当業者において公知の手法である。 The derivation formula (20) of the update rule using the auxiliary function is a definite integral form of m _{h, i} , and the optimal vector m is ∂J / ∂m _{h, i} where m is a continuous value variable. = 0. Here, in order to solve ∂J / ∂m _{h, i} = 0 more easily, an auxiliary function method can be used. The auxiliary functions are, for example, NMF (Non-negative matrix factorization) and HTC (Harmonic-Temporal
Clustering), which is a technique known to those skilled in the art.

ここで述べた打楽器音の分離ないし強調方法は、スペクトログラムにおけるスペクトル成分の滑らかな方向の違いに着目した点に特徴を有するものであるが、分離信号を得る処理ステップにおいて、スペクトログラムを実際に画面に表示することを要しない。本発明においては、分析対象となる音信号を時間周波数領域に変換することで得られたスペクトル成分が得られていればよい。時間周波数領域への変換手段は、典型的な例では、短時間フーリエ変換であるが、ウェーブレット変換、定Ｑフィルタバンク分析、その他のフィルタバンク分析でもよい。実際のスペクトログラムの計算では、短時間周波数分析によって離散的な時間と周波数ごとに成分が得られる。したがって、スペクトログラムにおける各スペクトル成分（時間周波数成分）は、時間bin（フレーム)と周波数binにより特定される時間周波数binである。目的関数におけるパラメータである分配係数を推定するアルゴリズムとしては、一つの好ましい態様ではＥＭアルゴリズムであるが、最急降下法やニュートン法等の他の最適化アルゴリズムを用いてもよい。また、ＥＭアルゴリズムを解くにあたって、補助変数を導入してもよい。また、本発明において、打楽器音の分離手法として、他の分離手法を用いても良いことは当業者に理解される。 The percussion instrument sound separation or enhancement method described here is characterized by focusing on the difference in the smooth direction of the spectral components in the spectrogram. However, in the processing step for obtaining the separation signal, the spectrogram is actually displayed on the screen. Does not require display. In the present invention, it is only necessary to obtain a spectral component obtained by converting a sound signal to be analyzed into a time frequency domain. A typical example of the means for converting to the time-frequency domain is short-time Fourier transform, but wavelet transform, constant Q filter bank analysis, and other filter bank analysis may be used. In actual spectrogram calculation, components are obtained for each discrete time and frequency by short-time frequency analysis. Accordingly, each spectral component (time frequency component) in the spectrogram is a time frequency bin specified by the time bin (frame) and the frequency bin. The algorithm for estimating the distribution coefficient, which is a parameter in the objective function, is an EM algorithm in one preferred embodiment, but other optimization algorithms such as the steepest descent method and the Newton method may be used. In addition, auxiliary variables may be introduced when solving the EM algorithm. Further, it will be understood by those skilled in the art that other separation methods may be used as the percussion instrument sound separation method in the present invention.

［Ｃ］打楽器パターンの抽出
ポピュラー音楽等、打楽器を含む楽曲では、あるリズム構造の打楽器パターンが繰り返し用いられることが多いので、それを構成する最小パターンを抽出し、それをリズム特徴量とすることを考える。複数のパターンが存在する場合も多いが、それぞれの打楽器パターンを、打楽器音のスペクトル系列から抽出したい。 [C] Extraction of percussion instrument pattern In music including percussion instruments such as popular music, a percussion instrument pattern having a certain rhythm structure is often used repeatedly. Therefore, a minimum pattern constituting the percussion instrument pattern is extracted and used as a rhythm feature quantity. think of. There are many cases where a plurality of patterns exist, but each percussion instrument pattern is desired to be extracted from a spectrum sequence of percussion instrument sounds.

［Ｃ−１］One-Pass DP法によるセグメント分割
複数種類ある単位リズムパターンとしての打楽器パターンを抽出する際に問題になるのは、複数種類の各パターンの区間は当初は未知であること、どれが同じパターンであるかを推定すること、それらのパターンは演奏やアレンジ等により多少の変動を受けていることである。 [C-1] Segment division by the One-Pass DP method When extracting percussion instrument patterns as unit rhythm patterns of multiple types, the problem is that the sections of each of the multiple types of patterns are initially unknown. Is that the patterns are subject to some variation due to performance, arrangement, and the like.

本実施形態ではこれらの問題に対処するために、連続音声認識で用いられるOne-pass
DP 法（H. Ney and S. Ortmanns. Dynamic programming search
for continuous speech recognition. IEEE Signal Processing Magazine, Vol. 1, No.
5, pp. 64−83, September 1999.）を利用し、打楽器パターンを抽出するアルゴリズムを提案する。個数既知の複数の適切な初期参照パターンがあれば、観測入力をこれらの参照パターンとマッチングして、最適なパターン分類とセグメント分割を行える。DPマッチング経路を適切に設定すれば、パターンの変動を吸収することができる。 In this embodiment, in order to cope with these problems, the One-pass used in continuous speech recognition.
DP method (H. Ney and S. Ortmanns. Dynamic programming search
for continuous speech recognition.IEEE Signal Processing Magazine, Vol. 1, No.
5, pp. 64-83, September 1999.), we propose an algorithm for extracting percussion instrument patterns. If there are a plurality of appropriate initial reference patterns whose number is known, the observation input can be matched with these reference patterns to perform optimum pattern classification and segmentation. If the DP matching path is set appropriately, pattern variations can be absorbed.

One-Pass DPを設計するにあたり、打楽器を含む楽曲ではテンポの変動が小さいと考えられるため、時間変動の許容が小さくなるように設計する。そのため、局所経路を図７（Ａ）のようにする事で、初期の参照パターン幅に妥当なものを与えた場合、与えられた参照パターン幅に近くなるような許容された幅において打楽器パターンの最適なセグメント分割が行われる。参照パターン幅を固定した場合、参照の更新後も同様の事が言える。 When designing One-Pass DP, it is considered that the tempo variation is small for music including percussion instruments, so the tolerance of time variation is designed to be small. Therefore, by making the local route as shown in FIG. 7A, when an appropriate reference pattern width is given, the percussion instrument pattern has an allowable width close to the given reference pattern width. Optimal segmentation is performed. If the reference pattern width is fixed, the same can be said after the reference is updated.

各フレーム間の距離を各周波数ｂｉｎを要素に持つベクトル間のマハラノビス距離とすると、低周波数域のエネルギーが大きいバスドラムと、低周波数域ではエネルギーが小さいが高周波数域では大きいハイハットとを区別するなど、スペクトル形状から分かる音色の違いに着目したパターン抽出が可能となる。 If the distance between each frame is a Mahalanobis distance between vectors having each frequency bin as an element, a bass drum having a large energy in the low frequency region is distinguished from a hi-hat having a small energy in the low frequency region but a large energy in the high frequency region. For example, it is possible to extract a pattern by paying attention to the difference in timbre that can be understood from the spectrum shape.

One-pass DPを用いる利点は、局所経路の選択肢が多いのみでなく、経路制約を自由に変更できる点も挙げられる。例えば、スネアドラムが３拍目に演奏されるものと３．５拍目で演奏されるものとの差が有意であるとしたいのならば、One-pass DPで吸収される時間変動幅をより厳しく設定することにより、これらを異なる打楽器パターンとして抽出することができる。その一つの方法としては入力パターンのフレームインデックスをｉ、参照パターンのフレームインデックスをｊとする場合、ある閾値Ｄによって−Ｄ≦ｉ−ｊ≦Ｄして、それより外への経路を許さないとするものが考えられる。 The advantage of using One-pass DP is not only that there are many choices for local routes, but also that the route constraints can be changed freely. For example, if the difference between the snare drum played at the third beat and the one played at the 3.5th beat is significant, the time variation absorbed by the One-pass DP can be increased. By setting it strictly, these can be extracted as different percussion instrument patterns. As one of the methods, when the frame index of the input pattern is i and the frame index of the reference pattern is j, −D ≦ i−j ≦ D by a certain threshold D, and a path beyond that is permitted. What to do is conceivable.

［Ｃ−２］k-meansクラスタリング法を用いた参照パターンの更新
正解の単位リズムパターンは当初は未知であるため、楽曲内で教師なし学習を行う必要がある。上の方法で得られるパターン分類とセグメント分割に基づいて各クラスタの中心パターンを参照パターンとして更新する。これらの行程を反復して行うことで、k-meansアルゴリズムの原理により、局所最適に更新された参照パターンとセグメント分割とセグメントごとのパターン分類が行える。このような学習アルゴリズムにより、楽曲の音響信号から単位リズムパターンを抽出することができる。 [C-2] Since the unit rhythm pattern for updating the reference pattern using the k-means clustering method is unknown at first, it is necessary to perform unsupervised learning in the music. Based on the pattern classification and segment division obtained by the above method, the center pattern of each cluster is updated as a reference pattern. By repeating these steps, the reference pattern updated to the local optimum, segment division, and pattern classification for each segment can be performed according to the principle of the k-means algorithm. With such a learning algorithm, a unit rhythm pattern can be extracted from the acoustic signal of the music.

［Ｃ−３］初期の参照パターンの決定
上述のアルゴリズムでは初期の参照パターンによって局所最適解が異なると考えられる。より適切な局所最適解に収束させるための初期の参照パターンは、以下の３種類が考えられる。 [C-3] Determination of initial reference pattern In the above algorithm, it is considered that the local optimum solution differs depending on the initial reference pattern. The following three types of initial reference patterns for converging to a more appropriate local optimal solution can be considered.

（方法１：既存の打楽器パターンの利用）
ありふれた打楽器パターンスペクトル系列をいくつか用意し、その中からいくつかを初期の参照パターンに選出する方法である。この方法は最終的に学習される打楽器パターンが既存の打楽器パターンに類似することが期待され、妥当な局所最適解に収束する可能性が高いと考えられる。しかし、そのために用意すべきパターンのバラエティーが多いことや、One-pass DP法によってその中の一部のみが選択されるため、パターン数の制御ができないことが予想される。これはk-meansクラスタリングにも見られる問題であり、バラエティーに富んだ既存の打楽器パターンを用意することも容易でない。 (Method 1: Use of existing percussion instrument patterns)
This is a method in which several common percussion instrument pattern spectrum sequences are prepared, and some of them are selected as initial reference patterns. In this method, the percussion instrument pattern to be finally learned is expected to be similar to the existing percussion instrument pattern, and it is considered highly likely that the percussion instrument pattern will converge to a reasonable local optimal solution. However, it is expected that the number of patterns cannot be controlled because there are many variety of patterns to be prepared for this purpose and only a part of them is selected by the One-pass DP method. This is a problem also found in k-means clustering, and it is not easy to prepare a variety of existing percussion instrument patterns.

（方法２：代表セグメントの利用）
ある程度のパターン幅が与えられた場合、入力パターンからランダムにその幅だけのスペクトルを抜き出して参照パターンとする方法である。しかし打楽器パターンの場合、楽曲内の各パターンクラスタ内の距離は小さいと考えられるため、同じパターンが複数個選ばれる可能性が高く、その結果、同一の打楽器パターンが複数のクラスタに分割されることも予想される。 (Method 2: Use of representative segments)
When a certain pattern width is given, this is a method in which a spectrum corresponding to the width is randomly extracted from the input pattern and used as a reference pattern. However, in the case of percussion instrument patterns, the distance within each pattern cluster in the music is considered to be small, so there is a high possibility that the same pattern will be selected, and as a result, the same percussion instrument pattern will be divided into multiple clusters. Is also expected.

（方法３：セグメント平均の利用）
上と同様ある程度パターン幅が与えられた場合に、入力パターンからその幅に等間隔に切り分けるが、それぞれに対してランダムに、もしくは順番にラベルを付けて同じラベルのついたパターン同士の平均を参照パターンとする方法である。この方法を用いた場合、参照パターンはほとんど全て類似したものになるが、本来の打楽器パターンクラスタ間の平均的なパターンとして与えられるため、同一の打楽器パターンが複数のクラスタに分類されるエラーは起きにくくなることが考えられる。 (Method 3: Use of segment average)
If a pattern width is given to a certain extent as in the above, the input pattern is divided into equal widths, but each is randomly or sequentially labeled and the average of the same-labeled patterns is referenced. This is a pattern. When this method is used, almost all of the reference patterns are similar, but given as an average pattern between the original percussion instrument pattern clusters, an error occurs that causes the same percussion instrument pattern to be classified into multiple clusters. It can be difficult.

このような考察から、一つの態様では、方法３の入力パターンから等間隔に取り出して平均化したスペクトル系列を初期の参照パターンとする方法を採用する。ここで用いるパターン幅（フレーム数）は与える必要があるが、局所的に自己相関関数を計算してそのピーク値よりおおよそのパターン幅を決定し、これを解決する。 From such considerations, in one embodiment, a method is adopted in which a spectrum sequence extracted from the input pattern of Method 3 at equal intervals and averaged is used as an initial reference pattern. The pattern width (number of frames) used here needs to be given, but an autocorrelation function is calculated locally to determine an approximate pattern width from its peak value, and this is solved.

［Ｃ−４］参照パターンの更新
上述のアルゴリズムではOne-pass DP法を用いてセグメントに分割し、それぞれに対してパターン分類を行う。その後に分類された各々のパターンに属するセグメントの中心を計算し、それを参照パターンとして更新し、k-meansのアルゴリズムのように最適な分割と分類を求めたいが、セグメント中心の計算方法はいくつか考えられる。 [C-4] Update of Reference Pattern In the above algorithm, the one-pass DP method is used to divide into segments, and pattern classification is performed on each segment. After that, we want to calculate the center of the segment belonging to each classified pattern, update it as a reference pattern, and find the optimal division and classification like the k-means algorithm. I think.

一つはパターン分類されたクラスタ内のセグメントの代表を一つ選び出す方法である。具体的にはクラスタ内の全セグメント間に関してはDPマッチングにより距離が計算できるため、ある一つのセグメントに対し、その他全てのセグメントとの距離の総和が最小となるセグメントをそのクラスタ中心とすることにより、更新毎にOne-pass DPの距離コストが極小化され、収束する事が保証できる。 One is a method of selecting one representative of the segments in the pattern classified cluster. Specifically, since the distance can be calculated by DP matching for all segments in the cluster, the segment with the smallest sum of the distances to all other segments is set as the cluster center for one segment. The distance cost of One-pass DP is minimized with each update, and it can be guaranteed that it will converge.

もう一つは、One-pass DPによってとられたアラインメントにしたがって、クラスタ内のセグメントの平均を計算する方法である。One-pass DPの距離コストはアラインメントにしたがって対応する参照パターンのフレームと入力パターンのフレーム間の距離の総和と等しくなる。距離関数に二乗誤差(マハラノビス距離)を選択すると、平均をとることによって対応するフレームの距離の総和を最小化する事ができるため、同じく更新毎にOne-pass DPの距離コストを極小化することができ、収束する事が保証できる。 The other is to calculate the average of the segments in the cluster according to the alignment taken by the One-pass DP. The distance cost of the One-pass DP is equal to the sum of the distances between the corresponding reference pattern frame and the input pattern frame according to the alignment. By selecting the square error (Mahalanobis distance) as the distance function, the sum of the distances of the corresponding frames can be minimized by taking the average, thus minimizing the distance cost of One-pass DP for each update. Can be guaranteed to converge.

前者の方法は毎回の更新で参照パターンの幅が大きく変動するため、不安定なものになる。特に打楽器パターンを抽出したいので、パターン幅はなるべく一定となるように、可能ならば１小節となるように制限したい。そこで、本研究では後者の中心の計算方法を採用する。平均を計算する事によってOne-pass DPの距離コストを極小化できるだけでなく、調波音・打楽器音分離の際に完全に分離されずに残った調波成分のノイズ等が小さくなる事が考えられるため、より純粋な打楽器パターンを抽出できることが期待できる。 The former method becomes unstable because the width of the reference pattern varies greatly with each update. In particular, we want to extract percussion instrument patterns, so we want to limit the pattern width to be as constant as possible and to be one measure if possible. Therefore, in this research, the latter central calculation method is adopted. By calculating the average, not only can the distance cost of the One-pass DP be minimized, but the noise of the harmonic components that are not completely separated during the separation of harmonics and percussion sounds can be reduced. Therefore, it can be expected that a more pure percussion instrument pattern can be extracted.

［Ｃ−５］アルゴリズムの定式化
［Ｃ−５−１］One-pass DP法
One-pass DPで計算される距離は参照パターンと入力パターンが近くなるほど小さくなるため、One-pass DPの距離を極小化する問題として定式化できる。入力パターン列をＸ_１,・・・,Ｘ_ｍ,・・・,Ｘ_Ｍ、参照パターンＲ_ｖに対するパターン列をＹ_ｖ１,・・・,Ｙ_ｖｎ,・・・，Ｙ_ｖＮｖとする。今回用いたノード間の距離関数はＲ_ｖにおいて、
とした。Σは入力パターンに対する対角分散行列である。 [C-5] Algorithm formulation [C-5-1] One-pass DP method
Since the distance calculated by the One-pass DP becomes smaller as the reference pattern and the input pattern become closer, it can be formulated as a problem of minimizing the distance of the One-pass DP. The input pattern sequence is X ₁ ,..., X _m ,..., X _M , and the pattern sequence for the reference pattern R _v is Y _v1 , ..., Y _vn _,. The distance function between nodes used this time is _Rv .
It was. Σ is a diagonal dispersion matrix for the input pattern.

入力パターン列のフレームインデックスをｉ_ｘ、参照パターン列のフレームインデックスをｉ_ｙとすると、局所経路の計算は、
でξは点（ｉ´_ｘ，ｉ´_ｙ）から点（ｉ_ｘ，ｉ_ｙ）への移動コストに重みをかけたもの、
でＬ_ｓは局所経路の選び方によって異なり、
を満たす（L. Rabiner and B. H. Juang，Fundamentals of Speech Recognition，chapter 4，pp.200−238. Prentice Hall，1993.）。図７（Ａ）の経路の場合、上の重みは図７（Ｂ）のようにすることができ、この時Ｌ_ｓ＝１である。 If the frame index of the input pattern sequence is i _x and the frame index of the reference pattern sequence is i _y , the calculation of the local path is
In ξ is the point _{(i'x, _i'y)} multiplied by weighting the travel cost from the point _{_(i} x, i _y),
L _s depends on how the local route is selected,
(L. Rabiner and BH Juang, Fundamentals of Speech Recognition, chapter 4, pp.200-238. Prentice Hall, 1993.). In the case of the route of FIG. 7A, the upper weight can be as shown in FIG. 7B, where L _s = 1.

具体的な計算方法は、ｉ_ｘ＝１、ｉ_ｙ＝１の時、
１≦ｉ_ｘ≦Ｍ、間の参照パターンＲ_ｖ内では、
ｉ_ｙ＝１のときは参照パターンＲ_ｖ、Ｒ_ｖ´間の経路を図７（Ｃ）のように、
ｉ_ｙ＝２のときは図７（Ｄ）のように、
を計算する。 The specific calculation method is as follows when i _x = 1 and i _y = 1.
Within the reference pattern _Rv between 1 ≦ i _x ≦ M,
i _y = 1 reference pattern _R v when the, the path between _{R v 'as} shown in FIG. 7 (C), the
When i _y = 2, as shown in FIG.
Calculate

最終的な累積距離は、
となる。 The final cumulative distance is
It becomes.

［Ｃ−５−２］クラスタ平均計算
One-pass DPによって計算された距離は、
と書くことができ、式(30)を最小とするφ_ｙ（ｋ）をそれぞれ、ｉ_ｖ，ｎとすると、これに対応するφ_ｘをｉ_ｍ（ｖ，ｎ）、ｍ（ｋ）をｍ（ｖ，ｎ）として、
となる。∂Ｄ／∂ｉ_ｖ，ｎ＝０とおくと、
より、各ｉ_ｖ，ｎとは、
とすればＤを最小化できる。Ｍ_ｖ，ｎはｉ_ｖ，ｎに対応するｉ_ｍ（ｖ，ｎ）の要素数で、ｍ（ｋ）がすべて１の場合には、
となり、ｉ_ｖ，ｎはｉ_ｍ（ｖ，ｎ）の平均となる。 [C-5-2] Cluster average calculation
The distance calculated by One-pass DP is
If φ _y (k) that minimizes the expression (30) is i _{v, n} , φ _x corresponding to this is _im (v, n) and m (k) is m (V, n)
It becomes. If ∂D / ∂i _{v, n} = 0,
Therefore, each _{iv, n} is
If so, D can be minimized. M _{v, n} is the number of elements of i _m (v, n) corresponding to _{iv, n} , and when m (k) is all 1,
I _{v, n} is the average of _im (v, n).

更新後の距離Ｄ´は
となる。距離の総和Ｄは正であり、各更新により減少するか少なくとも非増大であるので、収束することが保証される。 The updated distance D ′ is
It becomes. The sum D of distances is positive and is guaranteed to converge because it decreases or at least does not increase with each update.

［Ｃ−６］アルゴリズムの手順
図６に示すアルゴリズムを詳述すると以下のようになる。
[1] 打楽器音の強調
(1) 入力の音楽音響信号に対し、短時間フーリエ変換を行う。
(2) 得られるスペクトル系列に対して調波音・打楽器音分離を行う。
(3) 周波数方向にメルスケールにｂｉｎを計算する。 [C-6] Algorithm Procedure The algorithm shown in FIG. 6 is described in detail as follows.
[1] Enhancement of percussion instrument sound
(1) Perform short-time Fourier transform on the input music sound signal.
(2) Harmonic and percussion instrument sounds are separated from the obtained spectrum series.
(3) Calculate bin on the mel scale in the frequency direction.

[2] 初期参照パターンの計算
(1) 上で強調された入力スペクトルパターンに対し自己相関関数のピーク値を求める。
(2) 求まったピーク値のフレーム数で入力スペクトルを等分割する。
(3) 分割されたスペクトルのセグメントにランダム(または順番)にラベル付けを行う。
(4) 同じラベル同士を集めて各フレーム、各周波数で平均値を計算し、初期参照パターンとする。 [2] Calculation of initial reference pattern
(1) Obtain the peak value of the autocorrelation function for the input spectrum pattern highlighted above.
(2) The input spectrum is equally divided by the number of frames of the peak value obtained.
(3) Label the divided spectrum segments randomly (or in order).
(4) Collect the same labels and calculate the average value at each frame and frequency to obtain the initial reference pattern.

[3] One-pass DP法の計算
(1) 入力スペクトルパターンのあるフレームiで、
(2) 参照パターンｖでのフレームｊで、式(26)−(28)により、Ｄ_Ａ（ｉ，ｊ，ｖ）を計算する。
(3) フレームｊが参照パターンの最後のフレームの時、Ｄ_Ａ（ｉ，Ｎ_ｖ，ｖ）の値を残しておく。
(4) 全ての参照パターンについて上を計算し終った場合、ｍｉｎＤ_Ａ（ｉ，Ｎ_ｖ，ｖ）を残しておく。
(5) ｉの値に1を加え1〜4を繰り返し逐次的に計算を行う。
(6) ｉが入力パターンの最後のフレームまで計算した場合、バックトレースを行い、アラインメントを得る。 [3] Calculation of One-pass DP method
(1) In frame i with input spectrum pattern,
(2) D _A (i, j, v) is calculated by the equations (26)-(28) at the frame j in the reference pattern v.
(3) When the frame j is the last frame of the reference pattern, the value of D _A (i, N _v , v) is left.
(4) When all the reference patterns have been calculated above, leave minD _A (i, N _v , v).
(5) Add 1 to the value of i and repeat steps 1 to 4 to calculate sequentially.
(6) If i is calculated up to the last frame of the input pattern, back trace is performed and alignment is obtained.

[4] 参照パターンの更新
(1) 得られたアラインメントによってそれぞれの参照パターンのフレームｉ_ｖ，ｎに対応する入力スペクトルのフレームｉ_ｍ（ｖ，ｎ）すべてを集め、各周波数で別々に平均を計算する(式(32))。
(2) 計算された平均を新たに参照パターンとする [4] Reference pattern update
(1) Collect all the input spectrum frames i _m (v, n) corresponding to the frames _{iv, n} of the respective reference patterns according to the obtained alignment, and calculate the average separately for each frequency (formula (32) ).
(2) Use the calculated average as a new reference pattern

[5] [3]、[4]を、収束条件を満たすまで交互に計算する
このような手順で計算する事で打楽器パターンを抽出する事ができる。 [5] Percussion instrument patterns can be extracted by calculating in such a procedure that [3] and [4] are calculated alternately until the convergence condition is satisfied.

［Ｃ−７］実験
上述のアルゴリズムを実際の楽曲に適用し、打楽器パターンを抽出する実験を行った。RWC音楽データベース（後藤他,“RWC 研究用音楽データベース:音楽ジャンルデータベースと楽器音データベース,”情報処理学会音楽情報科学研究会研究報告, 2002-MUS45-4, Vol. 2002, No. 40, pp. 19−26, May 2002）中のWAVファイルを1チャンネル、22.05kHzにダウンサンプリングし、調波音・打楽器音分離手法により得られた打楽器強調音を、長さ1024点、シフト512点のフレームで短時間フーリエ変換し、得られたパワースペクトルを入力パターンとした。パターン次元を減らすため、パワースペクトルは周波数方向に対数的に８つのｂｉｎに平均化した。初期参照パターンとしてはBPM120の典型的な打楽器パターンと思われるスペクトルパターン2秒間分(１小節に相当)の打楽器パターン数分の種類を与えた。参照パターンのフレーム数は固定した。上のデータのうち４種類の打楽器パターンを持つダンス音楽、RWC-MDB-G-2001 No. 16に対してOne-pass DP法を用いたクラスタリングを行い、収束した結果を図１１に示す。 [C-7] Experiment An experiment for extracting a percussion instrument pattern by applying the above algorithm to an actual musical piece was performed. RWC Music Database (Goto et al., “RWC Research Music Database: Music Genre Database and Musical Instrument Database,” Information Processing Society of Japan, Music Information Science Research Report, 2002-MUS45-4, Vol. 2002, No. 40, pp. 19-26, May 2002) Downsampling the WAV file in 1 channel to 22.05 kHz, and the percussion instrument emphasized sound obtained by the harmonic and percussion sound separation method is short in a frame of 1024 points in length and 512 points in shift. The power spectrum obtained after time Fourier transform was used as an input pattern. In order to reduce the pattern dimension, the power spectrum was averaged logarithmically 8 bins in the frequency direction. As the initial reference pattern, the number of percussion instrument patterns for 2 seconds (corresponding to one bar), which is considered to be a typical percussion instrument pattern of BPM120, was given. The number of reference pattern frames was fixed. Clustering using the One-pass DP method is performed on the dance music, RWC-MDB-G-2001 No. 16, which has 4 types of percussion instrument patterns among the above data, and the converged result is shown in FIG.

図１１において、横軸は入力パターンのフレーム数、縦軸は参照パターンを縦に並べたものである。また、図１２は学習されたそれぞれの参照パターンのスペクトルである。低周波のみの音、高周波のみの音、周波数全域に広がって鳴る音、という具合に見ると、これらが打楽器の音色パターンを表している事が確認できる。図１１は、打楽器パターンの地図としての楽曲構造を示していると捉える事ができ、「リズムマップ」と呼ぶ。 In FIG. 11, the horizontal axis represents the number of frames of the input pattern, and the vertical axis represents the reference patterns arranged vertically. FIG. 12 shows the spectrum of each learned reference pattern. If you look at the low-frequency sound, the high-frequency sound, and the sound that spreads over the entire frequency range, you can see that these represent the timbre pattern of the percussion instrument. FIG. 11 can be regarded as showing a musical composition as a percussion instrument pattern map, and is called a “rhythm map”.

［Ｃ−８］One-pass DP法を用いたリズムによる楽曲構造解析
One-pass DP法を用いた具体的な計算手法については、幾つかの手法があることが当業者に理解される。各テンプレートパターンからある確率分布に従って入力楽曲の打楽器パターンが演奏されるとするとOne-pass DP法で逐次的に対数尤度を計算し、その総和を最大化する問題として定式化する事が出来る。上述のように、One-pass DP法を用いる事で、連続音声認識のように与えられた各単位パターンのテンプレートから最適なセグメンテーションとそれぞれのテンプレートに対応するアラインメントを得ることができる。打楽器パートを含むほとんどの楽曲ではテンポは大きく変化しないため、時間変動の許容幅を小さくするため動的計画法に用いる局所的なパスを設計できる。 [C-8] Music structure analysis by rhythm using One-pass DP method
It is understood by those skilled in the art that there are several methods for a specific calculation method using the One-pass DP method. If the percussion instrument pattern of the input music is played according to a certain probability distribution from each template pattern, it can be formulated as a problem of calculating logarithmic likelihood sequentially by the One-pass DP method and maximizing the sum. As described above, by using the One-pass DP method, optimal segmentation and alignment corresponding to each template can be obtained from the template of each unit pattern given as in continuous speech recognition. Since most tunes including percussion parts do not change greatly in tempo, it is possible to design a local path used for dynamic programming in order to reduce the time variation tolerance.

各分析フレーム間のスペクトルの形状はその時刻で演奏される音色を表す。例えばバスドラムのスペクトルは低周波域に大きなエネルギーが存在するのに対し、ハイハットは高周波域に大きなエネルギーが存在する。このようなスペクトルの形状の違いを表したい。入力楽曲の各時間フレームでの各周波数帯域でのエネルギーがテンプレート打楽器パターンの対応する時間フレームの周波数帯域のエネルギーを期待値とする確率分布を成して観測されると仮定すると、確率的なモデルが設計できる。そのため、このような出現確率の尤度が大きい程テンプレートに近い音色であると考える事が出来る。 The shape of the spectrum between each analysis frame represents the timbre played at that time. For example, the bass drum spectrum has large energy in the low frequency range, whereas the hi-hat has large energy in the high frequency range. I would like to express the difference in spectrum shape. Assuming that the energy in each frequency band in each time frame of the input music is observed in a probability distribution with an expected value of the energy in the frequency band of the corresponding time frame of the template percussion instrument pattern, a stochastic model Can be designed. Therefore, it can be considered that the timbre is closer to the template as the likelihood of the appearance probability is larger.

打楽器スペクトログラムの時間周波数成分をＰ（ｔ，ｆ）とする。ただしｔは時刻、ｆは対数周波数を表す。実際には観測値ｔ，ｆは離散的なのでＰ（ｔ_ｘ，ｆ_ｙ）＝Ｐ_ｘ，ｙ）とし、周波数ｂｉｎ数をＹとして、さらにある時刻ｔ_ｘでの周波数成分Ｐ_ｘ，１，．．．．，Ｐ_ｘ，Ｙを要素にもつベクトルをｒ_ｘ＝（Ｐ_ｘ，１，．．．．，Ｐ_ｘ，Ｙ）^ｔと表す事が出来る。時間周波数成分Ｍ_ｍ（ｔ_ｉ，ｆ_ｙ）＝Ｍ_{ｍ，ｉ，ｙ}からなるOne-pass ＤＰ法でのテンプレートスペクトログラム（ｍ＝１，．．．，Ｍ）を用意し、これも同様にμ_ｍ，ｉ＝（Ｍ_{ｍ，ｉ，１}，．．．，Ｍ_{ｍ，ｉ，Ｙ}）^ｔと表せる。 Let P (t, f) be the time frequency component of the percussion instrument spectrogram. However, t represents time and f represents logarithmic frequency. In fact observations t, f is discrete since _{_{P (t x, f y)}} = P x, and _y), a frequency bin number as Y, the frequency components _{P x,} 1 at time _{t x} in addition,. . . . , P _{x, Y} can be represented as r _x = (P _{x, 1} ,..., P _{x, Y} ) ^t . A template spectrogram (m = 1,..., M) in the One-pass DP method comprising time frequency components M _m (t _i , f _y ) = M _{m, i, y} is prepared, and this is also denoted by μ. _{m, i} = (M _{m, i, 1} ,..., M _{m, i, Y} ) ^t .

このとき時刻ｘのスペクトルｒ_ｘがテンプレートｍの時間フレームｉから出現する確率はｅ_{ｍ，ｉ，ｘ}＝（μ_ｍ，ｉ−ｒ_ｘ）とおいて、
と表すことができる。ただし、Σ_ｍ，ｉはテンプレートｍの時間フレームｉにおけるスペクトル周波数ｂｉｎ各要素の対角分散行列である。 At this time, the probability that the spectrum r _{x at} time x appears from the time frame i of the template m is assumed to be em _{, i, x} = (μ _{m, i} −r _x ).
It can be expressed as. Here, Σ _{m, i} is a diagonal dispersion matrix of each element of the spectral frequency bin in the time frame i of the template m.

これを用いてOne-pass DP法アルゴリズムにより、図７（Ａ）の局所経路に従い、各フレームでの対数尤度ｌｎ（ｐ_ｍ，ｉ（ｒ_ｘ））に経路コストｗを掛けたものを逐次的に加えていく。経路コストは図７（Ａ）のように設計するといずれの経路を通ったとしても同じ点にたどり着くものは等価に扱う事ができる。結果的に、最も尤もらしい経路を捜す事で、各テンプレートパターンに基づいたアラインメントが得られ、用意されたテンプレート打楽器パターンによる楽曲構造が推定される。 Using this, according to the one-pass DP method algorithm, the logarithmic likelihood ln (pm _{, i} (r _x )) in each frame multiplied by the path cost w is sequentially applied according to the local path in FIG. Will be added. When the route cost is designed as shown in FIG. 7A, any route that reaches the same point can be handled equivalently. As a result, by searching for the most likely route, an alignment based on each template pattern is obtained, and the music structure based on the prepared template percussion instrument pattern is estimated.

対数尤度の総和が最大になる経路として、各時刻フレームｘ（ａ）のスペクトルｒ_ｘ（ａ）に対してテンプレートｍ＝ｍ（ａ）の時刻フレーム番号ｉ＝ｉ（ａ）のスペクトルμ_{ｍ（ａ），ｉ（ａ）}との対応が得られた時、最終的な対数尤度の総和ＤＡは、
として、
と表せる。 The spectrum μ _m of the time frame number i = i (a) of the template m = m (a) with respect to the spectrum r _{x (a) of} each time frame x (a) as a path with the maximum sum of log likelihoods. _When the correspondence with _(a) and _{i (a)} is obtained, the final sum of log likelihoods DA is
As
It can be expressed.

［Ｃ−９］k-meansクラスタリング法を用いたテンプレート単位打楽器パターンの更新
One-pass DP法によって入力打楽器パターンが最適なセグメンテーションを行い、それらのセグメントがクラスタリングされる。次に、単位打楽器パターンを推定するため、k-menasクラスタリング法のように各クラスタの中心となるパターンが計算され、新しいテンプレートパターンとして更新される。それぞれの中心パターンはOne-pass DP法によって与えられたアラインメントに基づいて、同じクラスタとしてラベル付けされたセグメントを平均化することで計算できる。実際には、経路コストによる重み付け平均を行うことにより、上述の尤度を最大化することができる。このことにより反復的にOne-pass DP法によるセグメンテーションと各クラスタ中心パターンの計算を行う事で、One-pass DP法の対数尤度の総和が増加し続け、収束性が保証される。 [C-9] Updating per-template percussion instrument pattern using k-means clustering method
The input percussion instrument pattern is optimally segmented by the One-pass DP method, and those segments are clustered. Next, in order to estimate a unit percussion instrument pattern, a pattern that becomes the center of each cluster is calculated as in the k-menas clustering method and updated as a new template pattern. Each center pattern can be calculated by averaging the segments labeled as the same cluster based on the alignment given by the One-pass DP method. Actually, the above-described likelihood can be maximized by performing weighted averaging based on the route cost. As a result, the sum of the log-likelihoods of the One-pass DP method continues to increase and the convergence is guaranteed by repeatedly performing segmentation by the One-pass DP method and calculating each cluster center pattern.

具体的な更新は最尤法により求まるパラメータ更新
に従えばよい。
よりこれを０として、
ただし、
とした。
同様に、
よりこれを０として、
となり、この更新により再度Ｏｎｅ−ｐａｓｓＤＰ法で計算されるＤ´_Ａはそれ以前に計算されたＤ_Ａに対して、
となり反復更新は目的関数を少なくとも減少させないため、収束性が保証される。 Concrete update is parameter update obtained by maximum likelihood method
Just follow.
From this, 0
However,
It was.
Similarly,
From this, 0
With this update, D ′ _A calculated again by the One-pass DP method becomes D _A calculated before that,
Since iterative updates do not reduce the objective function at least, convergence is guaranteed.

［Ｃ−９］情報量規準を用いた最適なクラスタ数の決定
この提案アルゴリズムでは、クラスタ数の決定が問題になることが考えられる。つまり、上述のk-meansクラスタリング法のk によって結果が大きく変わってしまう可能性がある。当然、クラスタ数が増えれば楽曲中の打楽器パターンをよりうまく表現する事が出来るため、尤度の総和である目的関数(式(35))は大きくなり、楽曲中の小節数分のクラスタを用意すれば最大になると考えられる。しかし、それが最適であるとは考えにくい為、最適なクラスタ数を決定する一つの方法として情報量規準を利用する。尤度が大きくなる程情報量規準が小さくなるが、冗長な情報が多くなる程、情報量規準は大きくなるため、これを最小化するクラスタ数として決定できる。情報量規準の一つとしてベイズ情報量規準(BIC) を用いることができる。 [C-9] Determination of optimum number of clusters using information criterion The determination of the number of clusters may be a problem in this proposed algorithm. In other words, the result may vary greatly depending on k in the above k-means clustering method. Naturally, as the number of clusters increases, the percussion instrument pattern in the song can be expressed better, so the objective function (equation (35)), which is the sum of the likelihoods, becomes larger, and clusters for the number of measures in the song are prepared. It will be the maximum if you do. However, since it is unlikely that it is optimal, the information criterion is used as one method for determining the optimal number of clusters. As the likelihood increases, the information amount criterion decreases. However, as the redundant information increases, the information amount criterion increases. Therefore, the information criterion can be determined as the number of clusters to be minimized. The Bayesian Information Criterion (BIC) can be used as one of the information criterion.

上述のアルゴリズムによって観測ベクトル列ｒ_１，．．．．，ｒ_Ｘが小節にセグメント分割され、Ｒ_１，．．．．．，Ｒ_ＮのＮ個のスペクトログラムパターンに分けられるとする。ここでＲ_ｎ＝（ｒ_{ｘ（ｎ−１）＋１}，．．．．，ｒ_ｘ（ｎ））とし、ｘ（０）＝０，ａ（Ｎ）＝Ａを満たすとする。楽曲の各小節で演奏される打楽器パターンＲ_１，．．．．．．，Ｒ_Ｎが出現する確率はそれぞれの参照パターンを中心とするガウス分布を成すと仮定できる。そこで各クラスタｍ＝１，．．．，Ｍにモデル：
を当てはめることができる。 The observation vector sequence r ₁ ,. . . . , R _X are segmented into bars and R ₁ ,. . . . . , And it is divided into N spectrograms pattern of R _N. Here, it is assumed that R _n = (r _{x (n−1) +1} ,..., R _{x (n)} ), and x (0) = 0 and a (N) = A are satisfied. Percussion instrument patterns R ₁ ,. . . . . . , The probability that R _N appear can be assumed to form a Gaussian distribution centered on each of the reference patterns. So each cluster m = 1,. . . , M to model:
Can be applied.

これも最尤法により式(41)を最尤とするパラメータは式(37),(38)となる。このパラメータを用いた最大対数尤度
を用いてベイズ情報量規準は、
と表す事ができ、これを最小化するクラスタ数Mが最適なクラスタ数であるといえる。 Also in this case, the parameters that make equation (41) the maximum likelihood by the maximum likelihood method are equations (37) and (38). Maximum log likelihood using this parameter
Bayesian information criterion using
The number of clusters M that minimizes this is the optimal number of clusters.

クラスタを作成する方法としては、まずは一つのテンプレート単位打楽器パターンを用意し、そこからLBGアルゴリズム(Linde, Y., Buzo, A. and Gray, R.，“An algorithm for vector quantizer design”，IEEE Trans. Commun., vol. 28, pp. 84-95, 1980)のような方法でクラスタ数を増やす事が考えられる。最も分散の大きいクラスタを二分割する事によって一つずつクラスタを増やす。クラスタを増やした後に上述のアルゴリズムによって、最適な単位打楽器パターン推定と楽曲構造解析を行い、情報量規準(式(42))を計算し、これを繰り返し行い情報量規準が増加した時点で打ち切ることで、最適なクラスタ数が決定できる。 As a method of creating a cluster, first, a template unit percussion instrument pattern is prepared, from which an LBG algorithm (Linde, Y., Buzo, A. and Gray, R., “An algorithm for vector quantizer design”, IEEE Trans Commun., Vol. 28, pp. 84-95, 1980) can be considered to increase the number of clusters. Increase the clusters one by one by dividing the cluster with the largest variance into two. After increasing the number of clusters, perform the optimal unit percussion instrument pattern estimation and music structure analysis using the algorithm described above, calculate the information criterion (Equation (42)), and repeat this to end when the information criterion increases. Thus, the optimum number of clusters can be determined.

［Ｃ−１０］アルゴリズムの手順
上で議論したアルゴリズムは以下の手順にまとめられる：
(1) 打楽器パターンの強調：調波音・打楽器音分離により打楽器音成分が分離、強調されたスペクトログラムを得る。
(2) 最適なセグメンテーションを行う：式(35)を最大化する経路をOne-pass DP法によって計算する。
(3) テンプレートパターンを更新する：k-meansクラスタリング法のようにクラスタ中心を式(37),(38)として計算し、新たにテンプレートパターンとして更新する。
(4) 反復：ステップ2と3を目的関数(式(35))が収束するまで繰り返し行う。
(5) 情報量規準計算：以前の情報量規準の値(式(42))より増加した場合終了。
(6) クラスタ分割：LBGアルゴリズムのように最も大きい分散のクラスタを二分割する。
(7) ステップ2へ戻る。 [C-10] Algorithm Procedure The algorithm discussed above is summarized in the following procedure:
(1) Percussion instrument pattern enhancement: A spectrogram is obtained in which percussion instrument sound components are separated and enhanced by separating harmonic and percussion instrument sounds.
(2) Perform optimal segmentation: calculate the path that maximizes Eq. (35) using the One-pass DP method.
(3) Updating the template pattern: The cluster center is calculated as equations (37) and (38) as in the k-means clustering method, and is newly updated as the template pattern.
(4) Iteration: Repeat steps 2 and 3 until the objective function (equation (35)) converges.
(5) Information criterion calculation: Ends when the information criterion value increases (formula (42)).
(6) Cluster division: The largest distributed cluster is divided into two as in the LBG algorithm.
(7) Return to Step 2.

［Ｃ−１１］実験評価
上述の小節単位打楽器パターンを抽出するアルゴリズムが実際の打楽器音のみでなく調波音も含む楽曲に対して適用出来るかを確認するため、評価実験を行った。データセットとして、RWC音楽データベース(Goto, M., Hashiguchi, H., Nishimura, T. and Oka, R., “RWC music database:music genre database and musical instrument sound database,”Proc.
of the 4th Int. Conf. on Music Information Retrieval (ISMIR 2003), pp. 229−230, October 2003.)のWAVファイルを22.05kHz、1ch 信号にダウンサンプリングしたものを用いた。1024点の長さのハミング窓を半分ずつオーバーラップしながら短時間フーリエ変換(STFT)を施したスペクトログラムに対し調波音・打楽器音分離を適用した。この時、スペクトル形状の違いを認識する事と計算量削減の両方を実現するため、スペクトログラムの対数周波数を8つの帯域に等分割しそれぞれ平均値を求めた。 [C-11] Experimental evaluation An evaluation experiment was conducted to confirm whether the algorithm for extracting the above-mentioned bar-unit percussion instrument pattern can be applied not only to the actual percussion instrument sound but also to the music including the harmonic sound. The RWC music database (Goto, M., Hashiguchi, H., Nishimura, T. and Oka, R., “RWC music database: music genre database and musical instrument sound database,” Proc.
of the 4th Int. Conf. on Music Information Retrieval (ISMIR 2003), pp. 229-230, October 2003.) down-sampled to 22.05kHz, 1ch signal. Harmonic sound and percussion sound separation were applied to a spectrogram that was subjected to short-time Fourier transform (STFT) while overlapping 1024 Hamming windows half by half. At this time, the logarithmic frequency of the spectrogram was equally divided into eight bands to obtain the average value in order to both recognize the difference in spectrum shape and reduce the amount of calculation.

上記データセットのダンス音楽：RWC-MDB-G-2001 No.16を用いた。テンプレートの単位打楽器パターンを反復的に更新し、収束した後の情報量規準が最小となった時のアラインメントが図１３となり、学習された個々の基本小節単位打楽器パターンは図１４となった。 Dance music of the above data set: RWC-MDB-G-2001 No.16 was used. The unit percussion instrument pattern of the template is repetitively updated, and the alignment when the information criterion after the convergence is minimized is shown in FIG. 13, and the learned basic measure unit percussion instrument pattern is shown in FIG.

図１３は演奏開始から一定時間はパターン1が演奏される中パターン2が演奏されていることを示している。実際の楽曲を聴くと、パターン1が繰り返し演奏される中、4小節に一度パターン2が演奏されるのが確認できる。このような基本リズムに続いて、間奏の打楽器パターンが演奏され(パターン3)、クライマックス部の打楽器パターンが続く(パターン4)。 FIG. 13 shows that pattern 2 is being played while pattern 1 is being played for a certain time from the start of the performance. Listening to the actual music, you can see that pattern 2 is played once every four bars while pattern 1 is played repeatedly. Following such a basic rhythm, an interlude percussion instrument pattern is played (pattern 3), followed by a percussion instrument pattern in the climax section (pattern 4).

音楽音響信号から小節単位の打楽器パターンを抽出し、それらの地図として楽曲構造を示し解析する手法について議論した。調波音・打楽器音分離手法を用いて音響信号から打楽器を強調したスペクトログラムを得る事ができ、それを用いて単位打楽器パターンと「リズムマップ」をOne-pass DP法とk-menasクラスタリング法を組み合わせた方法により反復的に学習するアルゴリズムを提案した。また、最適なクラスタ数を、情報量規準を用いて決定する手法を紹介した。実験的に様々なジャンルの楽曲から妥当な単位打楽器パターンの抽出とそれらの演奏される箇所としてのリズムマップ推定がなされることが確認された。応用としては、このアルゴリズムで抽出される個々の単位打楽器パターンにより、ジャンル分類に適用する事が考えられる。また、楽曲構造を示すリズムマップを利用して、楽曲検索の性能を向上させる事も考えられる。 We discussed the method of extracting percussion musical instrument patterns from music acoustic signals and showing and analyzing the music structure as a map of them. A spectrogram emphasizing percussion instruments can be obtained from acoustic signals using harmonic and percussion instrument separation techniques, and unit percussion instrument patterns and rhythm maps are combined using the One-pass DP method and the k-menas clustering method. An algorithm that iteratively learns by the proposed method is proposed. We also introduced a method to determine the optimal number of clusters using information criterion. Experimentally, it was confirmed that appropriate unit percussion instrument patterns were extracted from songs of various genres and rhythm maps were estimated as places where they were played. As an application, it is conceivable to apply it to genre classification according to individual unit percussion instrument patterns extracted by this algorithm. It is also conceivable to improve the performance of music search using a rhythm map indicating the music structure.

［Ｄ］音楽音響信号中の打楽器パターンの自動置換
今までの音楽鑑賞は、単に再生して受動的に楽しむだけのものであったが、近年、各周波数帯域毎に音量を調整したり、音楽を加工しながら再生する等の能動的に鑑賞する為の技術が発展している。メロディーの加工は比較的専門的な知識を要するが、打楽器の加工は少ない知識でも容易に楽曲の雰囲気を変化させることが可能である為、本実施形態では打楽器パターンの自動置換について議論する。 [D] Automatic replacement of percussion instrument patterns in music acoustic signals Up until now, music appreciation has only been played and passively enjoyed. In recent years, the volume has been adjusted for each frequency band, The technology for active viewing such as playback while processing is developed. Although the processing of the melody requires relatively specialized knowledge, the processing of the percussion instrument pattern can be easily changed with little knowledge of the percussion instrument processing. Therefore, in this embodiment, the automatic replacement of the percussion instrument pattern will be discussed.

従来の打楽器音加工に関する研究としては、多声音楽からバスドラムとスネアドラムを抽出する手法（Zils, A. et al., “Automatic Extraction of Drum Tracks from
Polyphonic Music Signals,” in Proc. of WEDELMUSIC, pp. 179-183, 2002.）やドラムパートをリアルタイムに編集する音楽プレイヤー、Drumix(吉井他, “Drumix: ドラムパートのリアルタイム編集機能付きオーディオプレイヤー,”情報処理学会インタラクション2006 論文集, pp. 207-208, 3 月, 2006.)等がある。後者は楽曲中のドラムパートを全く別のドラムの音色に置換したり編集することが可能であり、小節毎にランダムにドラムパターンを変化させながら再生させたり、気に入ったドラムパターンに固定して再生するなどの聴き方を楽しむことができる。 As a research on conventional percussion instrument processing, a method of extracting bass drum and snare drum from polyphonic music (Zils, A. et al., “Automatic Extraction of Drum Tracks from
Polyphonic Music Signals, ”in Proc. Of WEDELMUSIC, pp. 179-183, 2002.) and Drumix (Yoshii et al.,“ Drumix: Audio player with real-time editing of drum parts, ” Information Processing Society of Japan Interaction 2006 Proceedings, pp. 207-208, March, 2006.) etc. The latter can replace and edit the drum part in the song with a completely different drum tone, You can enjoy listening to the drum pattern by changing the drum pattern at random for each measure, or playing it with the drum pattern you like.

本実施形態では、このような音楽加工技術の一つとして音楽音響信号中に演奏される複数の打楽器パターンを認識することにより、楽曲の構造を保ちながらそれらのパターンそれぞれを好みの打楽器パターンに置換する手法を提案する。 In the present embodiment, as one of such music processing techniques, by recognizing a plurality of percussion instrument patterns performed in a music acoustic signal, each of the patterns is replaced with a favorite percussion instrument pattern while maintaining the structure of the music. We propose a method to do this.

［Ｄ−１］問題設定
打楽器パターンを置換する際にはメロディー等の調波音には出来る限り変化を与えたくないが、調波音と打楽器音が混在する楽曲において打楽器音のみを編集することは容易でないと考えられる。また、楽曲中に演奏される打楽器パターン複数種類存在すると考えられる。特にポピュラー音楽の場合、サビの部分は楽曲中の山場である為他の部分の打楽器パターンと異なる、複雑で豊かな打楽器パターンが演奏されることが多い。そのため打楽器パターンを置換する際、単一の打楽器パターンを繰り返したりランダムに演奏するより、これらの楽曲構造を保存しながら各種の打楽器パターンそれぞれを置換する方がより自然に聞こえるであろう。そのためこのような楽曲構造を推定する必要があると考えられ、その際には最適な小節への分割が行われていることが前提となるであろうから、その推定も必要となるであろう。 [D-1] When replacing a percussion instrument pattern with a problem setting, it is not desirable to change the harmonic sound such as a melody as much as possible, but it is easy to edit only the percussion instrument sound in a musical composition in which harmonic sound and percussion sound are mixed. It is not considered. Further, it is considered that there are a plurality of percussion instrument patterns to be played in the music. In particular, in the case of popular music, since the chorus part is a hill in the music, a complex and rich percussion instrument pattern that is different from the other percussion instrument patterns is often played. Therefore, when replacing a percussion instrument pattern, it would sound more natural to replace each of the various percussion instrument patterns while preserving the music structure than to repeat or randomly play a single percussion instrument pattern. Therefore, it is thought that it is necessary to estimate such a music structure, and in that case, it will be premised that the division into optimal bars is performed, so that estimation will also be necessary .

以上の問題は以下の三点のようにまとめられる。
(1)楽音は調波音と打楽器音の混合であること；
(2)小節への分割が未知であること；
(3)複数の打楽器パターンによる楽曲構造の推定が必要であること。 The above problems can be summarized as the following three points.
(1) The musical sound is a mixture of harmonic and percussion instrument sounds;
(2) The division into bars is unknown;
(3) It is necessary to estimate the music structure using multiple percussion instrument patterns.

［Ｄ−２］打楽器音と調波音の分離
一般に楽曲は打楽器音のみで演奏されることは稀で、調波音も含むことが多い。そのため、打楽器パターンを置換する際にはこれらを分離することが必要となる。打楽器音と調波音を分離する手法の一つとしては、既述の分離手法を用いることができる。この手法を用いて調波音と打楽器音を分離してそれぞれを格納しておき、調波音に新しい打楽器パターンを足し合わせることで、打楽器パターンを置換することが出来る。 [D-2] Separation of percussion instrument sounds and harmonic sounds In general, music is rarely played only with percussion instrument sounds, and often includes harmonic sounds. Therefore, when replacing the percussion instrument pattern, it is necessary to separate them. As one method for separating percussion instrument sounds and harmonic sounds, the above-described separation method can be used. The percussion instrument pattern can be replaced by separating the harmonic sound and the percussion instrument sound using this technique, storing each of them, and adding the new percussion instrument pattern to the harmonic sound.

［Ｄ−３］楽曲構造推定と小節への最適なセグメンテーション
上で述べたように、より自然な打楽器パターン置換を目指すには、最適な小節への分割と楽曲構造の推定が必要となる。これを実現する方法として、上述の打楽器パターン抽出手法を利用することができる。打楽器音・調波音分離手法を用いて分離強調した打楽器音スペクトログラムに対し、適当な数の小節単位の初期打楽器パターンをテンプレートとして用意する。そして、One-pass DP法によりそれぞれのテンプレートに対応する区間を推定し、それらをk-meansクラスタリングアルゴリズムのようにクラスタリングして各クラスタの中心を再度テンプレートとして更新する、という反復推定を行う。打楽器パターン置換には、この手法によって得られる最適な小節へのセグメンテーションと、各小節がいずれの打楽器パターンに対応するかというラベル情報を利用することが出来る。 [D-3] Music structure estimation and optimal segmentation into bars As described above, in order to aim for a more natural percussion instrument pattern replacement, it is necessary to divide into optimal bars and estimate the music structure. As a method for realizing this, the percussion instrument pattern extraction method described above can be used. For the percussion instrument sound spectrogram separated and emphasized by using the percussion instrument sound / harmonic sound separation technique, an appropriate number of initial percussion instrument patterns in bar units are prepared as templates. Then, iterative estimation is performed in which sections corresponding to the respective templates are estimated by the One-pass DP method, the clusters are clustered like a k-means clustering algorithm, and the centers of the respective clusters are updated again as templates. For percussion instrument pattern replacement, segmentation into optimum bars obtained by this method and label information indicating which percussion instrument pattern each bar corresponds to can be used.

［Ｄ−４］新しい打楽器パターンの作成
小節単位の打楽器パターンを作成する際は、その小節内で演奏される打楽器音の音色と、それぞれの演奏されるタイミング、それぞれの音量を指定すれば十分であると考えられる。 [D-4] Creation of a new percussion instrument pattern When creating a percussion instrument pattern for each measure, it is sufficient to specify the tone of the percussion instrument sound to be played in the measure, the timing of each performance, and the volume of each. It is believed that there is.

単一に録音された様々な打楽器音は事前に用意し、小節内で演奏されるタイミングを０から１までの実数で指定することで、それぞれの単一打楽器音信号を指定される時刻だけずらして足し合わせることによって任意の打楽器パターンを作成することができるであろう。 Various percussion instrument sounds recorded in a single are prepared in advance, and each single percussion instrument sound signal is shifted by a specified time by specifying a real number from 0 to 1 to be played within a measure. An arbitrary percussion instrument pattern could be created by adding together.

また、それぞれの音量を、元々小節内で演奏されていた打楽器音の最大音量との相対的な量とすることで、オリジナルの打楽器パターンの音量変化を反映させることができると考えられる。つまり、間奏部分などメロディーを引き立たせるために打楽器音の音量を抑えている部分は置換後も自動的に音量を小さくすることが出来るということである。このような打楽器パターンの設計の例を、表１に挙げる。
In addition, it is considered that the volume change of the original percussion instrument pattern can be reflected by setting each volume to a relative amount with the maximum volume of the percussion instrument sound originally played in the measure. In other words, the volume of percussion instrument sounds that are suppressed to enhance the melody, such as an interlude, can be automatically reduced after replacement. Examples of such percussion instrument pattern designs are listed in Table 1.

［Ｄ−５］打楽器パターン自動置換アルゴリズムの手順
以上の議論より、打楽器パターンの自動置換アルゴリズムを以下にまとめることができる。
(1) 小節単位の打楽器パターンを表１のように複数個作成して格納しておく。
(2) 既述の分離手法により調波音と打楽器音を分離して、それぞれを格納する。
(3) 既述の打楽器パターン抽出手法により楽曲中の小節の最適なセグメンテーションと各種の打楽器パターンとの対応を得る。
(4) (2)で分離された調波音の信号に(1)で用意しておいた打楽器パターンを(3)に従って加える。 [D-5] Percussion instrument pattern automatic replacement algorithm procedure From the above discussion, the percussion instrument pattern automatic replacement algorithm can be summarized as follows.
(1) Create and store multiple percussion instrument patterns as shown in Table 1.
(2) Separate harmonic sound and percussion instrument sound by the separation method described above, and store each.
(3) Obtain the correspondence between the optimal segmentation of the bars in the music and various percussion instrument patterns using the percussion instrument pattern extraction method described above.
(4) Add the percussion instrument pattern prepared in (1) to the harmonic sound signal separated in (2) according to (3).

［Ｄ−６］実験
データセットとして、上述のRWC音楽データベースのWAVファイルを22.05kHz、1ch信号にダウンサンプリングしたものを用いた。打楽器パターン置換アルゴリズムを実際の楽曲に適用した。まずは表１のように適当な小節単位の打楽器パターンを複数個作成した。各打楽器音は同RWC音楽データベースの楽器音データベース内のWAVファイルを利用した。次に楽曲構造解析と小節への最適なセグメンテーションを行い、その構造を保存しつつ打楽器パターンを置換した。 [D-6] As an experimental data set, a WAV file of the RWC music database described above was downsampled to 22.05 kHz, 1ch signal. A percussion pattern replacement algorithm was applied to the actual music. First, as shown in Table 1, a plurality of percussion instrument patterns in appropriate measures were created. Each percussion instrument sound was a WAV file in the instrument sound database of the RWC music database. Next, music structure analysis and optimal segmentation into bars were performed, and percussion instrument patterns were replaced while preserving the structure.

ポピュラー音楽：RWC-MDB-G-2001 No. 6へ適用した結果を例として挙げる。楽曲構造解析と小節への最適なセグメンテーションの結果は図１５に示す通りであった。前奏はパターン1 ではじまり、Aメロ・Bメロが同じ打楽器パターンで演奏され、サビ部分がパターン3であることが見てとれる。このような構造を保存しつつパターン1,2,3それぞれに対して3種類の異なる打楽器パターンを自動で置換した結果、オリジナルの楽曲(図１６)に対して打楽器パターン置換後の楽曲(図１７)が得られた。調波音成分を変化させること無く打楽器パターンのみを正確な周期で置換できていることが確認できる。 The following is an example of the results applied to popular music: RWC-MDB-G-2001 No. 6. The results of music structure analysis and optimal segmentation into bars were as shown in FIG. The prelude starts with pattern 1, and it can be seen that A melody and B melody are played with the same percussion instrument pattern, and that the chorus part is pattern 3. As a result of automatically replacing three different percussion instrument patterns for patterns 1, 2, and 3 while preserving such a structure, the original music (FIG. 16) is replaced with the percussion instrument pattern replacement music (FIG. 17). )was gotten. It can be confirmed that only the percussion instrument pattern can be replaced with an accurate period without changing the harmonic sound component.

本実施形態では音楽音響信号中の打楽器パターンを自動的に置換する手法について議論した。その際、打楽器音と調波音を分離し、打楽器音のパターンからOne pass DP法とk-meansクラスタリング法を組み合わせたアルゴリズムを用いて、最適な小節のセグメンテーションと打楽器パターンによる楽曲構造を推定した。これらの情報を用いて楽曲構造を保ったまま打楽器パターンを置換し、実際の楽曲に適用した。応用としては、ユーザーの望む打楽器パターンを簡単に作成し、容易に打楽器パターンを置換する為のインターフェースの開発が考えられる。 In the present embodiment, a method for automatically replacing a percussion instrument pattern in a music acoustic signal has been discussed. At that time, the percussion instrument sound and the harmonic sound were separated, and the optimal bar segmentation and the musical structure by the percussion instrument pattern were estimated from the percussion instrument sound pattern using an algorithm combining the One pass DP method and the k-means clustering method. Using this information, the percussion instrument pattern was replaced while maintaining the music structure and applied to the actual music. As an application, it is conceivable to develop an interface for easily creating a percussion instrument pattern desired by a user and easily replacing the percussion instrument pattern.

本発明は、音楽自動ジャンル分類や、音楽情報検索、さらに打楽器パターンを置換するなどの音楽の加工等に利用することができる。 The present invention can be used for music processing such as automatic music genre classification, music information retrieval, and percussion pattern replacement.

時間周波数スペクトログラムの観測モデルを示す図である。It is a figure which shows the observation model of a time frequency spectrogram. 左図は、調波音のスペクトログラムであり、時間方向に滑らか・周波数方向に急峻なスペクトル成分からなる。右図は、打楽器音のスペクトログラムであり、時間方向に急峻・周波数方向に滑らかなスペクトル成分からなる。左図のスペクトル成分と右図のスペクトル成分は、時間周波数平面上でスパースに存在している。The left figure is a spectrogram of the harmonic sound, and consists of spectral components that are smooth in the time direction and steep in the frequency direction. The right figure is a spectrogram of percussion instrument sound, which consists of spectral components that are steep in the time direction and smooth in the frequency direction. The spectral component in the left figure and the spectral component in the right figure are sparse on the time-frequency plane. 入力スペクトルグラムと時間周波数マスクの乗算による、当該入力スペクトログラムの分離を示す図である。It is a figure which shows isolation | separation of the said input spectrogram by multiplication of an input spectrumgram and a time frequency mask. 調波成分・打楽器成分を分離する第１手法を示すブロック図である。It is a block diagram which shows the 1st method which isolate | separates a harmonic component and a percussion instrument component. 調波成分・打楽器成分を分離する第２手法を示すブロック図である。It is a block diagram which shows the 2nd method which isolate | separates a harmonic component and a percussion instrument component. 音楽音響信号から単位リズムパターンを抽出するためのアルゴリズムのフロー図である。It is a flowchart of the algorithm for extracting a unit rhythm pattern from a music acoustic signal. One-pass DP法での局所経路と経路コストを示す図である。It is a figure which shows the local route and route cost by One-pass DP method. 初期の打楽器参照パターンの設定（方法２）を説明する図である。It is a figure explaining the setting (method 2) of an initial percussion instrument reference pattern. 初期の打楽器参照パターンの設定（方法３）を説明する図である。It is a figure explaining the setting (method 3) of an initial percussion instrument reference pattern. One-pass DP法によるセグメント分割を説明する図であり、３種類の参照パターンに対する入力された打楽器音のスペクトル列の最適マッチング経路を示す。It is a figure explaining the segment division | segmentation by One-pass DP method, and shows the optimal matching path | route of the spectrum sequence of the input percussion instrument sound with respect to three types of reference patterns. 学習された打楽器パターンに対してOne-pass DP法によって求まったアラインメントの結果を示す図である。It is a figure which shows the result of the alignment calculated | required by the One-pass DP method with respect to the learned percussion instrument pattern. 学習された４種類の打楽器パターンのスペクトルとそれぞれのインデックス番号を示す図である。It is a figure which shows the spectrum of four types of learned percussion instrument patterns, and each index number. ダンス音楽(No.16)の最適に推定されたアラインメント（リズムマップ）を示す図である。It is a figure which shows the alignment (rhythm map) estimated optimally of the dance music (No. 16). ダンス音楽(No. 16)から抽出された４つの単位打楽器パターンとそのインデックス番号を示す。Four unit percussion instrument patterns extracted from dance music (No. 16) and their index numbers are shown. ポピュラー音楽（RWC-MDB-G-2001 No. 6）の楽曲構造解析と小節への最適なセグメンテーションの結果を示す。The results of the music structure analysis and optimal segmentation of bars of popular music (RWC-MDB-G-2001 No. 6) are shown. RWC-MDB-G-2001 No. 6のオリジナルの楽曲のスペクトログラムである。This is a spectrogram of the original music of RWC-MDB-G-2001 No. 6. RWC-MDB-G-2001 No. 6の打楽器パターン置換後の楽曲のスペクトログラムである。RWC-MDB-G-2001 No. 6 percussion instrument pattern spectrogram after replacement.

Claims

A method for extracting a unit rhythm pattern from a music acoustic signal including a percussion instrument sound,
A percussion instrument sound spectrum sequence included in the music acoustic signal is prepared,
By DP matching using multiple types of reference patterns, segmentation and pattern classification of the spectrum sequence of the percussion instrument sound is performed, and division optimization clustering is performed in which the reference pattern is updated based on the obtained segment division and pattern classification. The reference pattern that has converged is extracted as a unit rhythm pattern.
Unit rhythm pattern extraction method.

The method of claim 1, wherein preparing the spectrum sequence of percussion instrument sounds includes extracting percussion instrument sounds from a music acoustic signal.

The percussion instrument sound extraction is performed by using a spectrogram of the music acoustic signal as a sum of a smooth non-harmonic component in the frequency direction and a smooth harmonic component in the time direction, The method according to claim 2, wherein the target component is extracted as a percussion instrument sound.

The method according to claim 1, wherein the music acoustic signal including the percussion instrument sound is a signal composed only of the percussion instrument sound.

The method according to claim 1, wherein preparing the spectrum sequence of percussion instrument sounds uses percussion instrument sounds that are already separated and stored from a music acoustic signal.

6. The method according to claim 1, wherein the spectrum sequence of the percussion instrument sound is a spectrum sequence in which a frequency is scaled to a mel scale.

The method according to any one of claims 1 to 6, wherein the DP matching is selected from a DP method used for continuous word speech recognition.

The method according to claim 7, wherein the DP method is selected from a One-Pass DP method, a two-stage DP method, a level building, a clockwise DP method, a one-stage DP, and a continuous DP method.

The method according to claim 1, wherein the initial reference pattern in DP matching is selected from a plurality of types of percussion instrument rhythm patterns prepared in advance.

The method according to claim 1, wherein an initial reference pattern in DP matching is acquired using a spectrum sequence of the input percussion instrument sound.

The width of the reference pattern is specified, the input spectrum sequence of the percussion instrument sound is divided into a plurality of input patterns, and a required number of input patterns are randomly extracted to be an initial reference pattern. Method.

Specify the width of the reference pattern, divide the spectrum sequence of the input percussion instrument sound into multiple input patterns, and label each input pattern with the required number of types randomly or in order, giving the same label The method according to claim 10, wherein an average of the input patterns is an initial reference pattern.

The division by specifying the width of the reference pattern is as follows:
13. A peak value of an autocorrelation function is acquired for the spectrum sequence of the input percussion instrument sound, and the spectrum sequence of the input percussion instrument sound is equally divided by the number of frames of the acquired peak value. The method according to any one.

The method according to claim 1, wherein the division optimization clustering is a k-means method.

The method according to claim 1, wherein a center of a plurality of segments belonging to each cluster classified by DP matching is acquired and the segment center is updated as a reference pattern.

The method according to claim 15, wherein one representative segment selected from a plurality of segments belonging to the pattern classified cluster is set as a segment center.

The method according to claim 15, wherein an average of a plurality of segments belonging to the pattern classified cluster is set as a segment center.

The method according to claim 1, wherein the width of the unit rhythm pattern corresponds to the width of one measure.

The method according to claim 1, wherein the number of reference patterns is known, and a known number of initial reference patterns are prepared.

The method according to claim 1, wherein the number of reference patterns is unknown, and an optimum number of clusters in the division optimization clustering is determined using an information criterion.

21. The method of claim 20, wherein the information criterion is a Bayesian information criterion (BIC).

A music structure estimation method using a plurality of unit rhythm patterns extracted by the method according to any one of claims 1 to 21,
A unit rhythm pattern sequence corresponding to the spectrum sequence of the percussion instrument sound is a music structure.
Music structure estimation method using unit rhythm pattern.

The music structure according to claim 22, wherein the music structure is displayed on a plane formed by arranging a plurality of unit rhythm patterns along the vertical axis with the number of frames of the spectrum sequence of the input percussion instrument sound as the horizontal axis. Estimation method.

A unit rhythm pattern replacement method using the music structure obtained by the method of claims 22 and 23,
A part of the unit rhythm pattern constituting the music structure is replaced with a percussion instrument pattern prepared in advance.
Unit rhythm pattern replacement method.

Superimposing a spectrogram of the percussion instrument component having the replaced percussion instrument pattern on a spectrogram of the harmonic component separated from the music acoustic signal;
The method for replacing a unit rhythm pattern according to claim 24.

26. The unit rhythm pattern replacement method according to claim 24, wherein the percussion instrument pattern prepared in advance is a bar-unit percussion instrument pattern.