JP6263382B2

JP6263382B2 - Audio signal processing apparatus, audio signal processing apparatus control method, and program

Info

Publication number: JP6263382B2
Application number: JP2013268963A
Authority: JP
Inventors: 松本　崇; 崇松本
Original assignee: Pioneer DJ Corp
Current assignee: Pioneer DJ Corp
Priority date: 2013-12-26
Filing date: 2013-12-26
Publication date: 2018-01-17
Anticipated expiration: 2033-12-26
Also published as: JP2015125238A

Description

本発明は、楽曲中の楽器音を抽出する音声信号処理装置、音声信号処理装置の制御方法、プログラムに関する。 The present invention relates to an audio signal processing device for extracting musical instrument sounds in music, a control method for the audio signal processing device, and a program.

従来、楽曲中の各楽器の音を抽出する方法として、アイソレータ（バンド分割フィルタを用いたイコライザの一種）を用いた方法が知られている。ところが、アイソレータを用いて中域のレベルを下げると、打楽器の場合、バスドラム、スネアドラム、ハイハットの音が同時に削れてしまう。つまり、スネアドラムだけの減衰、またはハイハットだけの減衰など、楽器別の制御はできない。 Conventionally, a method using an isolator (a kind of equalizer using a band division filter) is known as a method for extracting the sound of each instrument in a music piece. However, if the level of the mid range is lowered using an isolator, the bass drum, snare drum, and hi-hat sounds are simultaneously cut off in the case of percussion instruments. In other words, it is not possible to control by instrument such as the attenuation of only the snare drum or the attenuation of only the hi-hat.

これに対し、楽曲中の楽器音を抽出する技術として、特許文献１ないし特許文献３が知られている。特許文献１は、スペクトログラムとテンプレート（プロファイルスペクトル）を照合することにより、バスドラムやハイハットなどの打楽器音と、非打楽器音を分離・抽出する。また、特許文献２は、テンプレートや反復推定を用いることなく、スペクトログラムの周波数と時間方向の成分の異方性に着目して、打楽器音と非打楽器音を分離・抽出する。また、特許文献３は、混合音から楽器音を分離するものであり、特定の楽器音を模擬した特定音を出力する音源モジュールの音をテンプレートとして分離音を生成する。 On the other hand, Patent Documents 1 to 3 are known as techniques for extracting musical instrument sounds from music. Patent Document 1 separates and extracts percussion instrument sounds such as bass drums and hi-hats and non-percussion instrument sounds by collating spectrograms with templates (profile spectra). Further, Patent Document 2 separates and extracts percussion instrument sounds and non-percussion instrument sounds by paying attention to the spectrogram frequency and anisotropy of components in the time direction without using a template or iterative estimation. Further, Patent Document 3 separates instrument sounds from mixed sounds, and generates separated sounds using a sound of a sound module that outputs a specific sound simulating a specific instrument sound as a template.

特表２００７−５３６５８７号公報Special table 2007-536587 gazette 特開２００９−２１０８８８号公報JP 2009-210888 A 特開２０１１−２０９５９２号公報JP2011-209592A

ところが、特許文献１および特許文献３の技術は、事前情報としてテンプレートが必要となる。また、特許文献２の技術は、打楽器音と非楽器音とを分離することができるものの、打楽器ごとのスペクトログラム形状を推定できるものではない。 However, the techniques of Patent Document 1 and Patent Document 3 require a template as prior information. Moreover, although the technique of patent document 2 can isolate | separate percussion instrument sound and non-instrument sound, it cannot estimate the spectrogram shape for every percussion instrument.

本発明は、上記の問題点に鑑み、テンプレートを用いることなく、打楽器ごとのスペクトログラム形状を推定可能な音声信号処理方法、音声信号処理装置およびプログラムを提供することを目的とする。 In view of the above problems, an object of the present invention is to provide an audio signal processing method, an audio signal processing apparatus, and a program that can estimate a spectrogram shape for each percussion instrument without using a template.

本発明の音声信号処理装置は、所定の発音区間から、任意の楽器の周波数スペクトログラムである第１スペクトログラムを特定する第１特定部と、所定の発音区間から、第１スペクトログラムとの相関値に基づいて、任意の楽器と同一楽器の周波数スペクトログラムである第２スペクトログラムを特定する第２特定部と、第１スペクトログラムと第２スペクトログラムの共通成分を抽出する同期減算部と、を備えたことを特徴とする。 The audio signal processing device according to the present invention is based on a first specifying unit that specifies a first spectrogram that is a frequency spectrogram of an arbitrary instrument from a predetermined sounding section, and a correlation value between the predetermined sounding section and the first spectrogram. And a second specifying unit for specifying a second spectrogram, which is a frequency spectrogram of the same instrument as an arbitrary instrument, and a synchronous subtracting unit for extracting a common component of the first spectrogram and the second spectrogram. To do.

上記の音声信号処理装置において、第２特定部は、所定の発音区間に、任意の楽器と同一楽器の周波数スペクトログラムが複数存在する場合、第１スペクトログラムと最も相関値が高い周波数スペクトログラムを第２スペクトログラムとして特定することを特徴とする。 In the audio signal processing apparatus, the second specifying unit, when there are a plurality of frequency spectrograms of the same instrument as an arbitrary instrument in a predetermined tone generation section, obtains a frequency spectrogram having the highest correlation value with the first spectrogram. It is characterized by specifying as.

上記の音声信号処理装置において、所定の発音区間に、任意の楽器の周波数スペクトログラムがＬ個（但し、Ｌは、Ｌ≧２となる整数）存在する場合、同期減算部により抽出されたＬ個の共通成分を平均化して、共通スペクトログラムを算出する同期加算部をさらに備えたことを特徴とする。 In the above audio signal processing apparatus, when there are L frequency spectrograms of an arbitrary musical instrument in a predetermined sound generation section (where L is an integer satisfying L ≧ 2), L pieces of sound extracted by the synchronous subtraction unit It further includes a synchronous adder that averages the common components and calculates a common spectrogram.

上記の音声信号処理装置において、所定の発音区間に存在する、任意の楽器のＬ個の周波数スペクトログラムを、同期加算部により算出された共通スペクトログラムに置き換えることにより、任意の楽器の同期処理済み音源を生成する音源生成部をさらに備えたことを特徴とする。 In the above audio signal processing apparatus, by replacing the L frequency spectrograms of an arbitrary musical instrument existing in a predetermined sound generation section with the common spectrogram calculated by the synchronous addition unit, the synchronously processed sound source of the arbitrary musical instrument can be obtained. It further comprises a sound source generating unit for generating.

上記の音声信号処理装置において、任意の楽曲から、音源生成部により生成された任意の楽器の同期処理済み音源を分離する音源分離部をさらに備え、任意の楽曲に、複数の楽器音が含まれている場合、楽器音ごとに、第１特定部、第２特定部、同期減算部、同期加算部、音源生成部および音源分離部の処理を含む楽器音分離処理を実行することを特徴とする。 The above audio signal processing apparatus further includes a sound source separation unit that separates a synchronization-processed sound source of an arbitrary instrument generated by the sound source generation unit from an arbitrary piece of music, and the arbitrary piece of music includes a plurality of instrument sounds. If it is, the instrument sound separation process including the processes of the first specifying unit, the second specifying unit, the synchronization subtracting unit, the synchronization adding unit, the sound source generating unit, and the sound source separating unit is executed for each instrument sound. .

上記の音声信号処理装置において、複数の楽器音が、バスドラム、スネアドラム、ハイハットであり、任意の楽曲内に、単独で鳴っているハイハットと、別の打楽器と同時に鳴っているハイハットが存在する場合、楽器音分離処理は、別の打楽器と同時に発音されているハイハット、バスドラム、スネアドラム、単独で発音されているハイハットの順に同期処理済み音源を分離することを特徴とする。 In the above audio signal processing apparatus, a plurality of musical instrument sounds are a bass drum, a snare drum, and a hi-hat, and there is a hi-hat that is sounding alone and a hi-hat that is sounding simultaneously with another percussion instrument in an arbitrary musical piece. In this case, the musical instrument sound separation process is characterized in that the synchronized sound source is separated in the order of a hi-hat sounded simultaneously with another percussion instrument, a bass drum, a snare drum, and a hi-hat sounded independently.

上記の音声信号処理装置において、打楽器ごとに定められた周波数帯域を対象として、所定の発音区間を等分割した単位時間ごとに、打楽器ごとのベロシティを特定するベロシティ特定部をさらに備え、同期減算部は、打楽器ごとのベロシティに基づいて、第１スペクトログラムと第２スペクトログラムの振幅値を揃えた後、共通成分を抽出することを特徴とする。 In the above audio signal processing device, a synchronization subtracting unit further includes a velocity specifying unit for specifying a velocity for each percussion instrument for each unit time obtained by equally dividing a predetermined sound generation section for a frequency band defined for each percussion instrument Is characterized in that the common component is extracted after aligning the amplitude values of the first spectrogram and the second spectrogram based on the velocity for each percussion instrument.

上記の音声信号処理装置において、所定の発音区間に存在する複数個の周波数スペクトログラムを、同じ打楽器種類且つ同じ鳴り方をしていることを条件としてグルーピングする詳細判別部をさらに備え、第２特定部は、第１スペクトログラムと同じグループに属する１以上の周波数スペクトログラムの中から、第２スペクトログラムを特定する。 The audio signal processing device further includes a detailed determination unit that groups a plurality of frequency spectrograms existing in a predetermined sound generation section on condition that the same percussion instrument type and the same sounding method are used, and the second specifying unit Specifies a second spectrogram from one or more frequency spectrograms belonging to the same group as the first spectrogram.

上記の音声信号処理装置において、３つの打楽器の発音位置を示す発音位置情報を取得する発音位置情報取得部をさらに備え、詳細判別部は、発音位置情報から、所定の発音区間に、第１スペクトログラムと同じ打楽器の周波数スペクトログラムが複数存在することが分かっている場合、当該複数の周波数スペクトログラムの第１スペクトログラムに対する相関値の平均値を算出し、当該平均値を超える相関値の周波数スペクトログラムを、第１スペクトログラムと同じグループとして分類することを特徴とする。 The audio signal processing apparatus further includes a sound generation position information acquisition unit that acquires sound position information indicating the sound generation positions of the three percussion instruments, and the detailed determination unit includes a first spectrogram in a predetermined sound generation section from the sound generation position information. When it is known that there are a plurality of frequency spectrograms of the same percussion instrument, the average value of the correlation values of the plurality of frequency spectrograms with respect to the first spectrogram is calculated, and the frequency spectrogram of the correlation value exceeding the average value is calculated as the first spectrogram. It is classified as the same group as the spectrogram.

上記の音声信号処理装置において、任意の楽曲の音声信号を周波数フーリエ変換することにより得られた振幅スペクトル情報に基づいて、所定時間以上継続している継続音成分を抽出し、音声信号から除去する継続音除去部をさらに備え、第１特定部および第２特定部は、継続音成分が除去された後、第１スペクトログラムおよび第２スペクトログラムを特定することを特徴とする。 In the above audio signal processing device, based on the amplitude spectrum information obtained by performing frequency Fourier transform on the audio signal of an arbitrary piece of music, a continuous sound component continuing for a predetermined time or more is extracted and removed from the audio signal. A continuous sound removing unit is further provided, wherein the first specifying unit and the second specifying unit specify the first spectrogram and the second spectrogram after the continuous sound component is removed.

本発明の音声信号処理装置の制御方法は、所定の発音区間から、任意の楽器の周波数スペクトログラムである第１スペクトログラムを特定する第１特定ステップと、所定の発音区間から、第１スペクトログラムとの相関値に基づいて、任意の楽器と同一楽器の周波数スペクトログラムである第２スペクトログラムを特定する第２特定ステップと、第１スペクトログラムと第２スペクトログラムの共通成分を抽出する同期減算ステップと、を実行することを特徴とする。 The control method of the audio signal processing device according to the present invention includes a first specifying step of specifying a first spectrogram that is a frequency spectrogram of an arbitrary instrument from a predetermined sounding section, and a correlation between the first sounding section from the predetermined sounding section. Performing a second specifying step for specifying a second spectrogram, which is a frequency spectrogram of the same instrument as an arbitrary instrument based on the value, and a synchronous subtracting step for extracting a common component of the first spectrogram and the second spectrogram. It is characterized by.

本発明のプログラムは、コンピューターに、上記の音声信号処理装置の制御方法における各ステップを実行させることを特徴とする。 A program according to the present invention causes a computer to execute each step in the above-described method for controlling an audio signal processing device.

本発明の一実施形態に係る音声信号処理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the audio | voice signal processing apparatus which concerns on one Embodiment of this invention. 詳細判別用音加工部の詳細ブロック図である。It is a detailed block diagram of the sound processing part for detailed determination. ベロシティ特定部の詳細ブロック図である。It is a detailed block diagram of a velocity specific part. 打楽器音分離部の詳細ブロック図である。It is a detailed block diagram of a percussion instrument sound separation unit. 継続音除去処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a continuous sound removal process. 極小点検出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a minimum point detection process. 極大点検出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of local maximum detection processing. 極大点の突出度合い判定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the protrusion degree determination process of a local maximum point. 継続カウンタ更新処理の説明図である。It is explanatory drawing of a continuation counter update process. 継続音範囲検出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a continuous sound range detection process. 継続音修正処理の説明図である。It is explanatory drawing of a continuous sound correction process. バスドラム加工処理（鳴り終わり判定）の説明図である。It is explanatory drawing of a bass drum processing process (sound end determination). バスドラム加工処理（ベース音減算）の説明図である。It is explanatory drawing of a bass drum process (bass sound subtraction). スネアドラム加工処理の説明図である。It is explanatory drawing of a snare drum process. ハイハット加工処理の説明図である。It is explanatory drawing of a hi-hat process. 詳細判別処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a detailed discrimination | determination process. 詳細判別処理の説明図である。It is explanatory drawing of a detailed discrimination | determination process. グルーピング閾値の説明図である。It is explanatory drawing of a grouping threshold value. ベロシティ特定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a velocity specific process. ベロシティ検出処理の説明図である。It is explanatory drawing of a velocity detection process. ベロシティ算出処理の説明図である。It is explanatory drawing of a velocity calculation process. 打楽器音分離処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a percussion instrument sound separation process. 複合ハイハット分離処理およびバスドラム分離処理の説明図である。It is explanatory drawing of a composite hi-hat separation process and a bass drum separation process. 同期減算処理の説明図である。It is explanatory drawing of a synchronous subtraction process. 同期加算処理の説明図である。It is explanatory drawing of a synchronous addition process. 再アタック検出処理までの流れを示す簡易フローチャートおよびその説明図である。It is the simple flowchart which shows the flow to a re-attack detection process, and its explanatory drawing. 鳴り終わり修正処理の説明図である。It is explanatory drawing of a ringing end correction process. 打楽器音の調節に関する説明図である。It is explanatory drawing regarding adjustment of a percussion instrument sound. 打楽器音の譜面表示に関する説明図である。It is explanatory drawing regarding the musical score display of a percussion instrument sound. バスドラム加工処理の補足説明図である。It is a supplementary explanatory drawing of bass drum processing. バスドラム加工処理の応用例に関する説明図である。It is explanatory drawing regarding the application example of a bass-drum processing process.

以下、添付の図面を参照し、本発明の一実施形態に係る音声信号処理装置、音声信号処理装置の制御方法、プログラムについて説明する。本発明は、楽曲中の特定の楽器のスペクトログラム形状を推定し、その楽器音を分離（抽出）するものである。そこで、本実施形態では、３種類の打楽器（バスドラム、スネアドラム、ハイハット）のスペクトログラム形状を推定し、それら３つの打楽器音を分離する場合について例示する。 Hereinafter, an audio signal processing device, a control method for the audio signal processing device, and a program according to an embodiment of the present invention will be described with reference to the accompanying drawings. The present invention estimates the spectrogram shape of a specific musical instrument in music and separates (extracts) the musical instrument sound. Therefore, in the present embodiment, the case where the spectrogram shapes of three types of percussion instruments (bass drum, snare drum, hi-hat) are estimated and these three percussion instrument sounds are illustrated is exemplified.

図１は、本発明の一実施形態に係る音声信号処理装置１の機能構成を示すブロック図である。音声信号処理装置１は、主な機能構成として、ＦＦＴ（Fast Fourier Transform）部１１、継続音除去部１２、帯域分割部１３、詳細判別用音加工部１４、詳細判別部１５、ベロシティ特定部１６、グルーヴ判定部１７および打楽器音分離部１８を備えている。なお、音声信号処理装置１は、専用装置であっても良いし、ＤＪ機器（ＤＪプレーヤー、ＤＪミキサーなど）、オーディオ機器（ＣＤプレーヤー、ＤＶＤプレーヤーなど）、音声編集機器、パーソナルコンピューター、タブレット端末、エフェクター、録音機器、放送機器など、各種電子機器の一部であっても良い。 FIG. 1 is a block diagram showing a functional configuration of an audio signal processing device 1 according to an embodiment of the present invention. The audio signal processing apparatus 1 includes, as main functional configurations, an FFT (Fast Fourier Transform) unit 11, a continuous sound removing unit 12, a band dividing unit 13, a detailed determination sound processing unit 14, a detailed determination unit 15, and a velocity specifying unit 16. A groove determination unit 17 and a percussion instrument sound separation unit 18. The audio signal processing device 1 may be a dedicated device, a DJ device (DJ player, DJ mixer, etc.), an audio device (CD player, DVD player, etc.), a sound editing device, a personal computer, a tablet terminal, It may be a part of various electronic devices such as an effector, a recording device, and a broadcasting device.

ＦＦＴ部１１は、入力音（ｗａｖファイルなど、任意の楽曲の音声信号）を周波数フーリエ変換することにより、解析データ（振幅スペクトル情報）を生成する。ここでは、ＦＦＴサイズを２０４８サンプル、オーバーラップ数を４回としている。この場合、１フレーム（ＦＦＴの処理間隔）は、５１２サンプルとなる。 The FFT unit 11 generates analysis data (amplitude spectrum information) by frequency Fourier transforming an input sound (sound signal of an arbitrary music such as a wav file). Here, the FFT size is 2048 samples and the number of overlaps is four. In this case, one frame (FFT processing interval) is 512 samples.

継続音除去部１２は、周波数フーリエ変換により得られた振幅スペクトル情報に基づいて、所定時間以上継続している継続音成分を抽出し、入力された音声信号から除去する。この処理により、各打楽器のスペクトログラム形状を特定する際の誤差要因となる「打楽器ではない成分」を取り除くことができる。なお、特に図示しないが、継続音除去部１２の処理後、ＩＦＦＴ（Inverse FFT）を行って一旦時間軸に戻した後、再度ＦＦＴを行っている。具体的には、ＦＦＴサイズを５１２サンプル、１フレームを、１２８サンプルとしている。このように、ＦＦＴサイズを変更してから打楽器音分離部１８にデータ（継続音除去音源）を供給することにより、アタック感を再現することができる。 The continuous sound removing unit 12 extracts a continuous sound component that has continued for a predetermined time or more based on the amplitude spectrum information obtained by frequency Fourier transform, and removes the continuous sound component from the input sound signal. By this processing, it is possible to remove “components that are not percussion instruments” that cause errors in specifying the spectrogram shape of each percussion instrument. Although not particularly illustrated, IFFT (Inverse FFT) is performed after the processing of the continuous sound removing unit 12 to temporarily return to the time axis, and then FFT is performed again. Specifically, the FFT size is 512 samples, and one frame is 128 samples. In this way, the sense of attack can be reproduced by changing the FFT size and then supplying data (continuous sound removal sound source) to the percussion instrument sound separation unit 18.

帯域分割部１３は、各打楽器の周波数帯域を分割する。例えば、バスドラム「４０〜３００Hz」、スネアドラム「６００〜３０００Hz」、ハイハット「６０００Hz〜１６０００Hz」のように、各打楽器に対応する周波数範囲を限定する。 The band dividing unit 13 divides the frequency band of each percussion instrument. For example, the frequency range corresponding to each percussion instrument is limited, such as a bass drum “40 to 300 Hz”, a snare drum “600 to 3000 Hz”, and a hi-hat “6000 Hz to 16000 Hz”.

詳細判別用音加工部１４は、外部から発音位置情報を取得し、入力された音声信号に対して、打楽器のみの成分となるような加工を施す。この処理により、後段の詳細判別部１５の判別正答率を上げることができる。なお、発音位置情報とは、楽曲の拍位置解析結果を元に生成される情報であり、楽曲に含まれる各打楽器の発音位置を示す情報（バスドラム、スネアドラム、ハイハットがそれぞれどの位置で鳴っているのかを示す情報）である。本実施形態では、不図示の外部装置において、既に解析済みであるものとする。 The detail determination sound processing unit 14 acquires sound generation position information from the outside, and performs processing on the input sound signal so as to be a component of only a percussion instrument. By this processing, the correct answer rate of the detail determining unit 15 at the latter stage can be increased. Note that the pronunciation position information is information generated based on the beat position analysis result of the music, and information indicating the sound generation position of each percussion instrument included in the music (where the bass drum, snare drum, and hi-hat sound) Information). In this embodiment, it is assumed that the analysis has already been performed in an external device (not shown).

詳細判別部１５は、取得した発音位置情報に基づいて、所定の発音区間（本実施形態では、楽曲の８小節）に存在する複数個の周波数スペクトログラムを、同じ打楽器種類且つ同じ鳴り方をしていることを条件としてグルーピングする。この処理により、後述する打楽器音分離部１８における同期処理の精度を上げることができる。 Based on the acquired sound generation position information, the detail determination unit 15 uses the same percussion instrument type and the same sounding method for a plurality of frequency spectrograms existing in a predetermined sound generation section (in this embodiment, eight measures of music). Grouping on the condition that By this process, the accuracy of the synchronization process in the percussion instrument sound separation unit 18 to be described later can be improved.

なお、詳細判別部１５は、取得した発音位置情報から、所定の発音区間に、任意のスペクトログラム（第１スペクトログラム）と同じ打楽器の周波数スペクトログラムが複数存在することが分かっている場合、当該複数の周波数スペクトログラムの、任意のスペクトログラムに対する相関値の平均値を算出し、当該平均値を超える相関値の周波数スペクトログラムを、任意のスペクトログラムと同じグループとして分類する（同じ打楽器種類且つ同じ鳴り方をしていると判定する）。詳細については、後述する。 In addition, when it is known from the acquired sound generation position information that there are a plurality of frequency spectrograms of the same percussion instrument as an arbitrary spectrogram (first spectrogram) from the acquired sound generation position information, the detailed determination unit 15 Calculate the average value of the correlation value of the spectrogram with respect to an arbitrary spectrogram, and classify the frequency spectrogram of the correlation value exceeding the average value into the same group as the arbitrary spectrogram (assuming the same percussion instrument type and the same sounding method) judge). Details will be described later.

ベロシティ特定部１６は、取得した発音位置情報に基づいて、所定の発音区間における各打楽器のベロシティ情報を特定する。当該処理も、後述する打楽器音分離部１８における同期処理の精度を上げるために行われる。 The velocity specifying unit 16 specifies velocity information of each percussion instrument in a predetermined sound generation section based on the acquired sound generation position information. This processing is also performed in order to increase the accuracy of the synchronization processing in the percussion instrument sound separation unit 18 described later.

グルーヴ判定部１７は、ベロシティ特定部１６により特定された所定の発音区間における各打楽器のベロシティ情報と、取得した発音位置情報に基づいて、所定の発音区間における楽曲のグルーヴを判定する。また、その判定結果を、グルーヴ情報として出力する。なお、ベロシティ情報とは、１小節を等分割した単位時間（例えば、１６分音符単位）ごとのベロシティの値を指す。詳細については後述するが、所定の発音区間（８小節）に含まれる全ての小節に共通して、単位時間（単位時間１／１６〜１６／１６）ごとのベロシティが特定される。 The groove determination unit 17 determines the groove of the music in the predetermined sounding section based on the velocity information of each percussion instrument in the predetermined sounding section specified by the velocity specifying unit 16 and the acquired sounding position information. The determination result is output as groove information. The velocity information refers to a velocity value for each unit time (for example, a sixteenth note unit) obtained by equally dividing one measure. Although details will be described later, the velocity for each unit time (unit time 1/16 to 16/16) is specified in common to all the bars included in the predetermined sound generation section (8 bars).

打楽器音分離部１８は、上記の継続音除去部１２で得られた継続音除去音源を用い、詳細判別部１５およびベロシティ特定部１６で得られた詳細判別情報（グルーピング情報およびベロシティ情報）に基づいて、各打楽器のスペクトログラム形状を推定し、各打楽器音を分離する。また、スペクトログラム形状の推定結果を、スペクトログラム情報（バスドラム振幅情報、スネアドラム振幅情報、ハイハット振幅情報）として出力する。なお、図面では、バスドラムを「ＢＤ」、スネアドラムを「ＳＤ」、ハイハットを「ＨＨ」と略記する。 The percussion instrument sound separation unit 18 uses the continuous sound removal sound source obtained by the above continuous sound removal unit 12 and is based on the detailed discrimination information (grouping information and velocity information) obtained by the detailed discrimination unit 15 and the velocity specifying unit 16. The spectrogram shape of each percussion instrument is estimated, and each percussion instrument sound is separated. Further, the spectrogram shape estimation result is output as spectrogram information (bass drum amplitude information, snare drum amplitude information, hi-hat amplitude information). In the drawings, the bass drum is abbreviated as “BD”, the snare drum as “SD”, and the hi-hat as “HH”.

次に、図２ないし図４を参照し、詳細判別用音加工部１４、ベロシティ特定部１６および打楽器音分離部１８の詳細な機能構成について説明する。図２は、詳細判別用音加工部１４の詳細ブロック図である。詳細判別用音加工部１４は、バスドラム加工部２１、スネアドラム加工部２２およびハイハット加工部２３を含む。バスドラム加工部２１は、ベース音などを排除し、バスドラム音を抽出するための加工を施す。スネアドラム加工部２２は、人間の声やピアノの伴奏などを排除し、スネアドラム音を抽出するための加工を施す。ハイハット加工部２３は、抽出対象のハイハット音に被っている他のハイハット音などを排除するための加工を施す。 Next, with reference to FIG. 2 thru | or FIG. 4, the detailed functional structure of the sound processing part 14 for detailed determination, the velocity specific | specification part 16, and the percussion instrument sound separation part 18 is demonstrated. FIG. 2 is a detailed block diagram of the detail determination sound processing unit 14. The detail determination sound processing unit 14 includes a bass drum processing unit 21, a snare drum processing unit 22, and a hi-hat processing unit 23. The bass drum processing unit 21 performs processing for removing bass sounds and extracting bass drum sounds. The snare drum processing unit 22 eliminates human voices and piano accompaniment and performs processing for extracting the snare drum sound. The hi-hat processing unit 23 performs a process for eliminating other hi-hat sounds that are covered by the hi-hat sound to be extracted.

ここで、バスドラム加工部２１について、さらに詳細に説明する。バスドラム加工部２１は、発音位置情報取得部２１ａ、検索区間特定部２１ｂ、鳴り終わり判定部２１ｃ、抽出部２１ｄ、第１加工部２１ｅおよび第２加工部２１ｆを含む。 Here, the bass drum processing unit 21 will be described in more detail. The bass drum processing unit 21 includes a pronunciation position information acquisition unit 21a, a search section identification unit 21b, a ringing end determination unit 21c, an extraction unit 21d, a first processing unit 21e, and a second processing unit 21f.

発音位置情報取得部２１ａは、外部から、任意の楽曲に含まれる任意の楽器（バスドラム、スネアドラム、ハイハット）の発音位置を示す発音位置情報を取得する。検索区間特定部２１ｂは、取得した発音位置情報に基づき、バスドラムの発音区間を検索するための検索区間を特定する。本実施形態では、バスドラムのアタック位置を基準とした前後所定時間から成る区間を検索区間として特定する。 The sound generation position information acquisition unit 21a acquires sound generation position information indicating the sound generation position of an arbitrary instrument (bass drum, snare drum, hi-hat) included in an arbitrary musical piece from the outside. The search section specifying unit 21b specifies a search section for searching for a bass drum sounding section based on the acquired sounding position information. In the present embodiment, a section consisting of a predetermined time before and after the attack position of the bass drum is specified as a search section.

鳴り終わり判定部２１ｃは、特定した検索区間内においてバスドラムの鳴り終わりを判定する。本実施形態では「平均値終了点判定法」と「新規アタック判定法」の２つの判定法を用いて鳴り終わりを判定する。前者は、検索区間において、複数フレーム分の移動平均値が、アタック位置付近の平均値である変動閾値よりも連続して下回った時点を鳴り終わりとして判定する方法である。また、後者は、判定対象となるバスドラムとは別の音が発音された場合、当該別の音が発音された時点を鳴り終わりとして判定する方法である。 The ringing end determination unit 21c determines the end of ringing of the bass drum within the specified search section. In the present embodiment, the end of ringing is determined using two determination methods of “average value end point determination method” and “new attack determination method”. The former is a method of determining, as the end of sounding, a point in time when a moving average value for a plurality of frames continuously falls below a variation threshold value that is an average value near an attack position in a search section. The latter is a method in which when a sound different from the bass drum to be determined is generated, the time point when the other sound is generated is determined as the end of sounding.

抽出部２１ｄは、鳴り終わり判定部２１ｃにより鳴り終わりが判定されなかった場合（鳴り終わらなかった場合）、検索区間の所定位置における振幅値（以下、「振幅データ」とも称する）を抽出する。本実施形態では、検索区間の最後のフレームの振幅値を抽出するものとする。 The extraction unit 21d extracts an amplitude value (hereinafter also referred to as “amplitude data”) at a predetermined position in the search section when the end of ringing is not determined by the end of ring determination unit 21c (when ringing has not ended). In the present embodiment, it is assumed that the amplitude value of the last frame in the search section is extracted.

第１加工部２１ｅは、鳴り終わり判定部２１ｃにより鳴り終わりが判定されなかった場合、検索区間に含まれる音声データを、抽出部２１ｄで抽出された振幅値に基づいて加工する。本実施形態では、検索区間に含まれる全フレームから、抽出部２１ｄで抽出された振幅値を減算する。これにより、バスドラム音とベース音が重複している場合、ベース音を排除し、バスドラム音のみを抽出することができる。一方、第２加工部２１ｆは、鳴り終わり判定部２１ｃにより鳴り終わりが判定された場合（鳴り終わった場合）、検索区間における当該鳴り終わり以降の振幅値をゼロにする。これにより、鳴り終わり以降の不要な音を排除することができる。 When the end of ringing is not determined by the end of ring determination unit 21c, the first processing unit 21e processes the voice data included in the search section based on the amplitude value extracted by the extraction unit 21d. In the present embodiment, the amplitude value extracted by the extraction unit 21d is subtracted from all frames included in the search section. Thereby, when the bass drum sound and the bass sound overlap, it is possible to exclude the bass sound and extract only the bass drum sound. On the other hand, when the end of ringing is determined by the ringing end determination unit 21c (when ringing is completed), the second processing unit 21f sets the amplitude value after the end of ringing in the search section to zero. Thereby, unnecessary sounds after the end of ringing can be eliminated.

なお、詳細については後述するが、スネアドラム加工部２２およびハイハット加工部２３においても、バスドラム加工部２１における発音位置情報取得部２１ａ、検索区間特定部２１ｂ、抽出部２１ｄおよび第１加工部２１ｅと略同様の処理を行う。また、変形例として、スネアドラム加工部２２およびハイハット加工部２３においても、鳴り終わり判定部２１ｃおよび第２加工部２１ｆを含む構成としても良い。つまり、スネアドラムおよびハイハットについて検索区間内における鳴り終わりを判定し、鳴り終わり以降の振幅値をゼロにする処理を行っても良い。 Although details will be described later, also in the snare drum processing unit 22 and the hi-hat processing unit 23, the pronunciation position information acquisition unit 21a, the search section specifying unit 21b, the extraction unit 21d, and the first processing unit 21e in the bass drum processing unit 21. The process is substantially the same as. As a modification, the snare drum processing unit 22 and the hi-hat processing unit 23 may include a squeal end determination unit 21c and a second processing unit 21f. That is, the end of ringing within the search section for the snare drum and hi-hat may be determined, and the amplitude value after the end of ringing may be set to zero.

続いて、図３は、ベロシティ特定部１６の詳細ブロック図である。ベロシティ特定部１６は、ベロシティ検出部３１およびベロシティ算出部３２を含む。なお、ベロシティ特定部１６の各部は、打楽器ごとに処理される。 Next, FIG. 3 is a detailed block diagram of the velocity specifying unit 16. The velocity specifying unit 16 includes a velocity detecting unit 31 and a velocity calculating unit 32. Note that each part of the velocity specifying unit 16 is processed for each percussion instrument.

ベロシティ検出部３１は、所定の発音区間内の一部区間を対象として、各打楽器のベロシティを検出する。例えば、バスドラムの場合、楽曲の８小節のうち、取得した発音位置情報から得られる２番目の発音位置から最後の発音位置までの区間を対象とする。また、スネアドラムおよびハイハットの場合、楽曲の８小節のうち、３小節目から６小節目までの区間を対象とする。そして、いずれの打楽器についても、各小節をＮ個（本実施形態では、１６個）に等分割した単位時間ごとに各打楽器のベロシティを検出する。また、ベロシティ検出部３１は、打楽器ごとに定められた周波数範囲の、打楽器ごとに定められた発音継続区間における振幅値の合計を振幅強度としたとき、所定の発音区間の中で最も大きな振幅強度で正規化した値を、対応する打楽器のベロシティとして検出する。 The velocity detection unit 31 detects the velocity of each percussion instrument for a partial section within a predetermined sound generation section. For example, in the case of a bass drum, a section from the second sounding position obtained from the acquired sounding position information to the last sounding position among the eight measures of the music is targeted. In the case of a snare drum and a hi-hat, the section from the 3rd bar to the 6th bar among the 8 bars of the music is targeted. For any percussion instrument, the velocity of each percussion instrument is detected every unit time obtained by equally dividing each measure into N pieces (16 in this embodiment). Further, the velocity detection unit 31 has the largest amplitude intensity in a predetermined sounding section when the sum of the amplitude values in the sounding continuation period determined for each percussion instrument in the frequency range determined for each percussion instrument is defined as the amplitude intensity. The value normalized by is detected as the velocity of the corresponding percussion instrument.

ベロシティ算出部３２は、ベロシティ検出部３１の検出結果を用いて、所定の発音区間内の上記一部区間を除いた区間における各打楽器のベロシティを算出する。具体的には、上記一部区間に含まれる各小節内の、それぞれ１番目から１６番目までの各単位時間の平均値を、各打楽器のベロシティとして算出する。 The velocity calculation unit 32 uses the detection result of the velocity detection unit 31 to calculate the velocity of each percussion instrument in a section excluding the partial section in a predetermined sound generation section. Specifically, the average value of each unit time from the first to the 16th in each measure included in the partial section is calculated as the velocity of each percussion instrument.

続いて、図４は、打楽器音分離部１８の詳細ブロック図である。打楽器音分離部１８は、第１特定部４１、第２特定部４２、同期減算部４３、同期加算部４４、再アタック検出部４５、鳴り終わり判定部４６、音源生成部４７および音源分離部４８を含む。なお、打楽器音分離部１８の各部も、打楽器ごとに処理される。 Next, FIG. 4 is a detailed block diagram of the percussion instrument sound separation unit 18. The percussion instrument sound separation unit 18 includes a first identification unit 41, a second identification unit 42, a synchronization subtraction unit 43, a synchronization addition unit 44, a re-attack detection unit 45, a ringing end determination unit 46, a sound source generation unit 47, and a sound source separation unit 48. including. Each part of the percussion instrument sound separation unit 18 is also processed for each percussion instrument.

第１特定部４１は、継続音除去部１２により、入力された音声信号から継続音成分が除去された後、所定の発音区間から、任意の打楽器の周波数スペクトログラムである第１スペクトログラムを特定する。第２特定部４２は、所定の発音区間から、第１スペクトログラムとの相関値に基づいて、任意の打楽器と同じグループに属する１以上の周波数スペクトログラムの中から、第１スペクトログラムと同一楽器の周波数スペクトログラムである第２スペクトログラムを特定する。ここで、特定候補となる周波数スペクトログラムが複数存在する場合は、第１スペクトログラムと最も相関値が高い周波数スペクトログラムを特定する。 The first specifying unit 41 specifies a first spectrogram that is a frequency spectrogram of an arbitrary percussion instrument from a predetermined sounding section after the continuous sound component is removed from the input sound signal by the continuous sound removing unit 12. Based on the correlation value with the first spectrogram from a predetermined sounding section, the second specifying unit 42 selects a frequency spectrogram of the same instrument as the first spectrogram from one or more frequency spectrograms belonging to the same group as an arbitrary percussion instrument. A second spectrogram that is Here, when there are a plurality of frequency spectrograms as identification candidates, the frequency spectrogram having the highest correlation value with the first spectrogram is identified.

同期減算部４３は、特定された第１スペクトログラムと第２スペクトログラムの共通成分を抽出する。このとき、同期減算部４３は、ベロシティ特定部１６により特定された打楽器ごとのベロシティに基づいて、第１スペクトログラムと第２スペクトログラムの振幅値を揃えた後、共通成分を抽出する。同期加算部４４は、所定の発音区間に、任意の打楽器の周波数スペクトログラムがＬ個（但し、Ｌは、Ｌ≧２となる整数）存在する場合、同期減算部４３により抽出されたＬ個の共通成分を平均化して、共通スペクトログラムを算出する。 The synchronous subtraction unit 43 extracts a common component of the identified first spectrogram and second spectrogram. At this time, the synchronous subtraction unit 43 extracts the common component after aligning the amplitude values of the first spectrogram and the second spectrogram based on the velocity for each percussion instrument specified by the velocity specifying unit 16. When there are L frequency spectrograms of an arbitrary percussion instrument (where L is an integer satisfying L ≧ 2) in a predetermined sound generation section, the synchronous adder 44 uses the L commons extracted by the synchronous subtractor 43. The components are averaged to calculate a common spectrogram.

再アタック検出部４５は、同期減算部４３および同期加算部４４による同期処理結果を用いてアタック検出を行う。この処理により、バスドラムやハイハットに対しスネアドラムのみ前倒しで発音されている場合も、正確にアタック位置を検出することができる。鳴り終わり判定部４６は、任意の打楽器の鳴り終わりを判定し、任意の打楽器と同じ周波数帯域の他の成分を除去する。この処理より、同期処理によって除去できなかった打楽器以外の成分を除去することができ、打楽器らしい音に加工することができる。 The re-attack detection unit 45 performs attack detection using the synchronization processing results obtained by the synchronization subtraction unit 43 and the synchronization addition unit 44. With this processing, even when only the snare drum is pronounced forward relative to the bass drum or hi-hat, the attack position can be detected accurately. The ringing end determination unit 46 determines the end of ringing of an arbitrary percussion instrument and removes other components in the same frequency band as that of the arbitrary percussion instrument. By this process, components other than the percussion instrument that could not be removed by the synchronization process can be removed, and a sound like a percussion instrument can be processed.

音源生成部４７は、所定の発音区間に存在する、各打楽器のＬ個の周波数スペクトログラムを、同期加算部４４により算出された共通スペクトログラムに置き換え、且つ再アタック検出部４５および鳴り終わり判定部４６による処理結果に基づいて加工された同期処理済み音源を生成する。また、音源分離部４８は、入力音（任意の楽曲）から、音源生成部４７により生成された各打楽器の同期処理済み音源を分離する。なお、打楽器音分離部１８によって実行される楽器音分離処理は、別の打楽器と同時に発音されているハイハット（以下、「複合ハイハット」と称する）、バスドラム、スネアドラム、単独で発音されているハイハット（以下、「単独ハイハット」と称する）の順に実行される。 The sound source generation unit 47 replaces the L frequency spectrograms of each percussion instrument existing in a predetermined sounding section with the common spectrogram calculated by the synchronous addition unit 44, and the re-attack detection unit 45 and the end-of-sound determination unit 46 A synchronized sound source processed based on the processing result is generated. The sound source separation unit 48 separates the synchronized sound source of each percussion instrument generated by the sound source generation unit 47 from the input sound (arbitrary music). The instrument sound separation process executed by the percussion instrument sound separation unit 18 is sounded independently by a hi-hat (hereinafter referred to as “composite hi-hat”), bass drum, and snare drum that are sounded simultaneously with another percussion instrument. The operations are executed in the order of hi-hat (hereinafter referred to as “single hi-hat”).

次に、図５以降を参照し、上記の各部について具定例を挙げてさらに説明する。まず、図５〜図１１を参照し、継続音除去部１２による継続音除去処理について説明する。継続音除去処理は、上記のとおり「打楽器ではない成分」を取り除く処理である。図５は、音声信号処理装置１による継続音除去処理の流れを示すフローチャートである。 Next, with reference to FIG. 5 and subsequent figures, each part will be further described with a specific example. First, the continuous sound removal processing by the continuous sound removal unit 12 will be described with reference to FIGS. The continuous tone removal process is a process of removing “a component that is not a percussion instrument” as described above. FIG. 5 is a flowchart showing a flow of continuous sound removal processing by the audio signal processing apparatus 1.

継続音除去処理では、ＦＦＴにより得られた振幅スペクトル情報から極小点および極大点を検出し（Ｓ１１，Ｓ１２）、これらの結果から極大点の突出度合いを判定する（Ｓ１３）。また、その判定結果に基づいて継続カウンタの更新を行い（Ｓ１４）、継続音を確定する（Ｓ１５）。また、確定した継続音の中から継続音範囲を検出し（Ｓ１６）、検出された継続音範囲に基づいて、誤検出された継続音を修正する（Ｓ１７）。その後、原音から継続音の振幅を除去し（Ｓ１８）、継続音除去処理を終了する。 In the continuous sound removal process, the minimum point and the maximum point are detected from the amplitude spectrum information obtained by FFT (S11, S12), and the degree of protrusion of the maximum point is determined from these results (S13). Further, the continuation counter is updated based on the determination result (S14), and the continuation sound is determined (S15). Further, the continuous sound range is detected from the confirmed continuous sound (S16), and the erroneously detected continuous sound is corrected based on the detected continuous sound range (S17). Thereafter, the amplitude of the continuous sound is removed from the original sound (S18), and the continuous sound removal process is terminated.

図６は、極小点検出処理（図５のＳ１１参照）の流れを示すフローチャートである。極小点検出処理では、周波数ｂｉｎ「０」を開始値とし、ＦＦＴサイズの半分を対象として処理を開始する（Ｓ２１）。まず、対象となる周波数ｂｉｎ（同図、符号Ｐ１参照）の振幅値を中心として、両隣のｂｉｎの振幅値との傾きを求める（Ｓ２２）。ここで、対象となる周波数ｂｉｎが極小である場合（両隣のｂｉｎに対する傾きが所定値以上である場合）は（Ｓ２３：Ｙｅｓ）、極小点の周波数ｂｉｎとして記録する（Ｓ２４）。また、対象となる周波数ｂｉｎが極小でない場合は（Ｓ２３：Ｎｏ）、Ｓ２４を省略する。その後、対象となる周波数ｂｉｎを順次インクリメントしながら、Ｓ２１〜Ｓ２４を繰り返す（Ｓ２５）。 FIG. 6 is a flowchart showing the flow of the minimum point detection process (see S11 in FIG. 5). In the minimum point detection process, the frequency bin “0” is set as a start value, and the process is started for half of the FFT size (S21). First, an inclination with respect to the amplitude value of the bins adjacent to each other is obtained centering on the amplitude value of the target frequency bin (see the figure, reference P1) (S22). Here, when the target frequency bin is minimum (when the inclination with respect to both adjacent bins is greater than or equal to a predetermined value) (S23: Yes), the frequency bin is recorded as the minimum point frequency bin (S24). If the target frequency bin is not minimal (S23: No), S24 is omitted. Thereafter, S21 to S24 are repeated while sequentially increasing the target frequency bin (S25).

図７は、極大点検出処理（図５のＳ１２参照）の流れを示すフローチャートである。極大点検出処理でも、周波数ｂｉｎ「０」を開始値とし、ＦＦＴサイズの半分を対象として処理を開始する（Ｓ３１）。まず、対象となる周波数ｂｉｎ（同図、符号Ｐ２参照）の振幅値を中心として、両隣のｂｉｎの振幅値との傾きを求める（Ｓ３２）。ここで、対象となる周波数ｂｉｎが極大である場合（両隣のｂｉｎに対する傾きが所定値以下である場合）は（Ｓ３３：Ｙｅｓ）、極大点の周波数ｂｉｎとして記録する（Ｓ３４）。また、対象となる周波数ｂｉｎが極大でない場合は（Ｓ３３：Ｎｏ）、Ｓ３４を省略する。その後、対象となる周波数ｂｉｎを順次インクリメントしながら、Ｓ３１〜Ｓ３４を繰り返す（Ｓ３５）。 FIG. 7 is a flowchart showing the flow of local maximum point detection processing (see S12 in FIG. 5). Even in the local maximum point detection process, the frequency bin “0” is set as a start value, and the process is started for half of the FFT size (S31). First, an inclination with respect to the amplitude value of the bins adjacent to each other is obtained centering on the amplitude value of the target frequency bin (see the figure, reference symbol P2) (S32). Here, when the target frequency bin is maximal (when the slope with respect to both adjacent bins is equal to or smaller than a predetermined value) (S33: Yes), it is recorded as the frequency bin of the maximal point (S34). If the target frequency bin is not maximal (S33: No), S34 is omitted. Thereafter, S31 to S34 are repeated while sequentially incrementing the target frequency bin (S35).

図８は、極大点の突出度合い判定処理（図５のＳ１３参照）の流れを示すフローチャートである。この処理は、ノイズ成分による極大点を排除するために行われる。例えば、ホワイトノイズなどが入力されると、高域で無数の小さな極大点が発生することがある。そのため、極大点の突出度合い判定処理により、そのようなノイズ成分と、検出したい声や楽器などの極大点を区別し、周りの周波数の振幅値よりもある程度突出した極大点を残す。 FIG. 8 is a flowchart showing the flow of the maximum point protrusion degree determination process (see S13 in FIG. 5). This process is performed in order to eliminate the local maximum point due to the noise component. For example, when white noise or the like is input, innumerable small local maximum points may occur in the high frequency range. For this reason, such a noise component and a maximum point such as a voice or a musical instrument to be detected are distinguished from each other by a protrusion degree determination process of the maximum point, and a maximum point protruding to some extent from the amplitude value of the surrounding frequency is left.

突出度合い判定処理でも、周波数ｂｉｎ「０」を開始値とし、ＦＦＴサイズの半分を対象として処理を開始する（Ｓ４１）。まず、対象となる極大点の周波数ｂｉｎ（同図、符号Ｐ５参照）について、両隣の極小点の振幅値（極小値，同図、符号Ｐ３およびＰ４）で線形補完した値（補完値，同図、符号Ｌ１１参照）を求める（Ｓ４２）。線形補完した値が大きい場合（補完値が所定値以上である場合）は（Ｓ４３：Ｙｅｓ）、極大点の周波数ｂｉｎとして記録する（Ｓ４４）。また、線形補完した値が小さい場合は（Ｓ４４：Ｎｏ）、Ｓ４４を省略する。その後、対象となる周波数ｂｉｎを順次インクリメントしながら、Ｓ４１〜Ｓ４４を繰り返す（Ｓ４５）。 Also in the protrusion degree determination process, the frequency bin “0” is set as a start value, and the process is started for half of the FFT size (S41). First, a value (complementary value, the same figure) linearly complemented with the amplitude value (minimum value, the same figure, symbols P3 and P4) of the adjacent local minimum points for the frequency bin (see the symbol P5) of the target local maximum point. , (See symbol L11) (S42). When the linearly complemented value is large (when the complement value is greater than or equal to a predetermined value) (S43: Yes), it is recorded as the maximum point frequency bin (S44). When the linearly complemented value is small (S44: No), S44 is omitted. Thereafter, S41 to S44 are repeated while sequentially incrementing the target frequency bin (S45).

図９は、継続カウンタ更新処理の説明図である。この処理では、突出度合い判定処理により記録された極大点に基づいて、どの位の時間継続している音であるかを示すカウンタを更新する。具体的には、前回のフレームで極大点が同じ周波数ｂｉｎに存在していた、または両隣の周波数ｂｉｎに存在していた場合、継続中と判定し、カウンタをインクリメントする。同図の例の場合、矢印で示される０フレーム目から４フレーム目までの極大点は、継続している音としてカウントされ、５フレーム継続した音であると判定する。また、６フレーム目に存在する極大点（同図、符号Ｐ６参照）については、極大点の継続が途切れたため、新しい音としてカウントする。 FIG. 9 is an explanatory diagram of the continuation counter update process. In this process, a counter indicating how long the sound has continued is updated based on the maximum point recorded by the protrusion degree determination process. Specifically, if the local maximum point exists in the same frequency bin in the previous frame or exists in both adjacent frequency bins, it is determined that it is continuing and the counter is incremented. In the case of the example in the figure, the maximum point from the 0th frame to the 4th frame indicated by the arrow is counted as a continuous sound, and is determined to be a sound that has continued for 5 frames. Further, the local maximum point existing in the sixth frame (see P6 in the figure) is counted as a new sound because the continuation of the local maximum point is interrupted.

図１０は、継続音範囲検出処理（図５のＳ１６参照）の流れを示すフローチャートである。この処理は、確定した継続音のうち、各フレームで最も周波数が高い周波数ｂｉｎを集計し、８小節内の中央値に基づいて継続音範囲を検出する処理である。 FIG. 10 is a flowchart showing the flow of the continuous sound range detection process (see S16 in FIG. 5). This process is a process of counting the frequency bin having the highest frequency in each frame among the determined continuous sounds and detecting the continuous sound range based on the median value in the eight bars.

継続音範囲検出処理では、フレーム「０」を開始値とし、８小節分のフレームを対象として処理を開始する（Ｓ５１）。まず、その中で最も周波数が高い周波数ｂｉｎを集計し（Ｓ５２）、予め定められた検索範囲（例えば、０Hz〜４０００Hz）を逸脱しているか否かを判別する（Ｓ５３）。検索範囲を逸脱している場合は（Ｓ５３：Ｙｅｓ）、逸脱回数を記録する（Ｓ５４）。その後、対象となる周波数フレームを順次インクリメントしながら、Ｓ５１〜Ｓ５４を繰り返す（Ｓ５５）。その後、集計した最も高い周波数の中央値を算出し、０Hz〜中央値までを継続音範囲として確定する（Ｓ５６）。さらに、検索範囲を逸脱した回数（時間）が８小節中の半分以上の場合、音が詰まっていることを意味するため、サビフラグを立てる（Ｓ５７）。当該サビフラグは、後述する打楽器音分離処理の鳴り終わり判定などに用いる。 In the continuous sound range detection process, the process is started with the frame “0” as a start value and a frame for 8 bars (S51). First, the frequency bin having the highest frequency is counted (S52), and it is determined whether or not a predetermined search range (for example, 0 Hz to 4000 Hz) is deviated (S53). If the search range is deviated (S53: Yes), the number of departures is recorded (S54). Thereafter, S51 to S54 are repeated while sequentially incrementing the target frequency frame (S55). Thereafter, the median value of the highest frequency that has been aggregated is calculated, and the range from 0 Hz to the median value is determined as the continuous sound range (S56). Further, if the number of times (time) deviating from the search range is more than half of the eight bars, it means that the sound is clogged, and a rust flag is set (S57). The rust flag is used for determining the end of a percussion instrument sound separation process, which will be described later.

図１１は、継続音修正処理（図５のＳ１７参照）の説明図である。この処理は、継続音範囲の２倍以上に存在する継続音は、オープンハイハットなどの「打楽器ではあるが、継続している音」の誤検出であるとの想定の下、継続音を修正する。例えば、同図（ａ）に示すように、判定した継続音範囲の２倍の周波数を超える継続音成分を誤検出とみなし、同図（ｂ）に示すように、誤検出性分を除去する。 FIG. 11 is an explanatory diagram of the continuous sound correction process (see S17 of FIG. 5). This process corrects a continuation sound on the assumption that a continuation sound that is more than twice the continuous sound range is a false detection of a “percussion instrument but a continuous sound” such as an open hi-hat. . For example, as shown in FIG. 6A, a continuous sound component exceeding a frequency twice the determined continuous sound range is regarded as erroneous detection, and the erroneous detection property is removed as shown in FIG. .

次に、図１２〜図１５を参照し、詳細判別用音加工部１４による詳細判別用音加工処理について説明する。詳細判別用音加工処理は、上記のとおり、詳細判別処理の前処理として、打楽器のみの成分を生成する処理である。図１２および図１３は、バスドラム加工処理の説明図である。なお、両図において、符号ｔａは、バスドラムのアタック位置（発音位置）、符号ｔｅは、バスドラムの鳴り終わり位置、符号ｔ１は、打楽器の鳴り終わりを検索するための検索区間の開始位置、符号ｔ２は、検索区間の終了位置を示している。本実施形態において、検索区間の開始点および終了点は、各打楽器のアタック位置から予め定められた所定時間前および所定時間後（例えば、数十ms〜数百ms前後）の時点として規定している。なお、この前後所定時間は、打楽器ごとに異なる時間であっても良い。 Next, the detailed discrimination sound processing by the detailed determination sound processing unit 14 will be described with reference to FIGS. As described above, the detailed discrimination sound processing is a process for generating a component of only a percussion instrument as a pre-processing of the detailed discrimination process. 12 and 13 are explanatory diagrams of the bass drum processing. In both figures, the symbol ta is the bass drum attack position (sound generation position), the symbol te is the bass drum ringing end position, the symbol t1 is the start position of the search section for searching for the percussion instrument ringing end, Symbol t2 indicates the end position of the search section. In the present embodiment, the start point and end point of the search section are defined as points in time before and after a predetermined time (for example, around several tens of ms to several hundred ms) from the attack position of each percussion instrument. Yes. The predetermined time before and after this time may be different for each percussion instrument.

バスドラムの詳細判別処理は、低域のみで判別を行うため、低域の振幅情報に対して加工を施す。また、バスドラムの場合は、ベースの成分を極力取り除きたいため、バスドラムの鳴り終わりを反映した加工を行う。図１２（ａ）は、平均値終了判定法の説明図である。この方法では、比較的時定数の大きいＬＰＦ（Low-pass filter）からアタック位置付近の低域の音量の平均値を算出し、変動閾値とする。そして、複数フレーム（例えば、４フレーム）分の移動平均を取り、変動閾値よりも移動平均値の方が連続して下回った時点（同図、ｔｅ参照）を、鳴り終わりとして判定する。なお、複数フレーム分の移動平均を取るのは、ＦＦＴサイズが小さいと、周波数分解能が低く、低域の波形が乱れるためである（ＦＦＴサイズを大きくすることで、その乱れを低減できる）。また、変動閾値を用いて判定を行うのは、ベース音と誤検出することなく、バスドラム低域区間を正確に検出するためである。さらに、連続して下回ったことを条件とするのは、複数フレーム分の平均値を取ることで波形乱れを抑制しても、波形乱れを抑制しきれない場合、一瞬閾値を下回ることがあるので、そのような場合の誤判定を避けるためである。 Since the detailed determination processing of the bass drum is performed only in the low range, the low-frequency amplitude information is processed. In the case of a bass drum, in order to remove the bass component as much as possible, processing that reflects the end of the bass drum sound is performed. FIG. 12A is an explanatory diagram of the average value end determination method. In this method, an average value of volume in a low frequency region near an attack position is calculated from an LPF (Low-pass filter) having a relatively large time constant, and is used as a variation threshold value. Then, a moving average for a plurality of frames (for example, 4 frames) is taken, and a point in time when the moving average value is continuously lower than the fluctuation threshold (see te in the figure) is determined as the end of ringing. The reason why the moving average for a plurality of frames is taken is that if the FFT size is small, the frequency resolution is low and the low-frequency waveform is disturbed (the disturbance can be reduced by increasing the FFT size). The reason why the determination is made using the variation threshold is to accurately detect the bass drum low frequency section without erroneously detecting the bass sound. Furthermore, if the waveform disturbance cannot be suppressed even if the waveform disturbance is suppressed by taking an average value for a plurality of frames, it may fall below the threshold for a moment. This is to avoid erroneous determination in such a case.

一方、同図（ｂ）は、新規アタック判定法の説明図である。この方法では、新しい何らかの音が発生したとき（同図、ｔｅ参照）、鳴り終わりと判定する。例えば、同図に示すように、バスドラムが鳴り終わる前に次のバスドラムが鳴ってしまった場合などが考えられる。 On the other hand, FIG. 5B is an explanatory diagram of the new attack determination method. In this method, when any new sound is generated (see te in the same figure), it is determined that the ringing is finished. For example, as shown in the figure, there may be a case where the next bass drum is played before the bass drum is finished.

図１２（ａ）の平均値終了判定法、または同図（ｂ）の新規アタック判定法により鳴り終わりが判定された場合は、同図（ｃ）に示すように、鳴り終わり以降（ｔｅ〜ｔ２の範囲）の振幅を「０」にする。また、図１３（ａ）に示すように、検索区間（ｔ１〜ｔ２の範囲）において、平均値終了判定法および新規アタック判定法のいずれの方法でも鳴り終わりが判定されなかった場合は、同図（ｂ）に示すように、検索区間の最後のフレームの振幅データ（同図、符号５１ａ参照）を検索区間に含まれる音声データから減算する。同図（ｃ）は、その減算結果を示したものである。このように、ベース音が減衰していない場合、検索区間の最後のフレームの振幅データを減算することにより、ベース音の影響をなくすことができる。 When the end of ringing is determined by the average value end determination method of FIG. 12 (a) or the new attack determination method of FIG. 12 (b), as shown in FIG. The amplitude of (range) is set to “0”. In addition, as shown in FIG. 13A, when the end of ringing is not determined by either the average value end determination method or the new attack determination method in the search section (range t1 to t2), As shown in (b), the amplitude data (see reference numeral 51a in the figure) of the last frame in the search section is subtracted from the audio data included in the search section. FIG. 4C shows the subtraction result. Thus, when the bass sound is not attenuated, the influence of the bass sound can be eliminated by subtracting the amplitude data of the last frame in the search section.

なお、減算する区間（第１加工部２１ｅの対象となる区間）は、検索区間のみに限らず、検索区間に前後所定時間を加えた区間としても良い。また、検索区間に関係なく、アタック位置を基準として減算する区間を規定しても良い。また、鳴り終わりが判定された場合の減算する区間（第２加工部２１ｆの対象となる区間）も、検索区間の最後までではなく、検索区間の終了後も含めた区間としても良い。 The section to be subtracted (the section that is the target of the first processing unit 21e) is not limited to the search section, and may be a section obtained by adding a predetermined time before and after the search section. Moreover, you may prescribe | regulate the area subtracted on the basis of an attack position irrespective of a search area. Further, the section to be subtracted when the end of ringing is determined (the section that is the target of the second processing unit 21f) may be a section that includes not only the end of the search section but also the end of the search section.

図１４は、スネアドラム加工処理の説明図である。同図において、符号ｔａは、スネアドラムのアタック位置、符号ｔ１および符号ｔ２は、検索区間の開始位置および終了位置を示している。また、符号ｔ３は、検索区間の終了位置から所定時間前の時点（スネアドラムのアタック位置から所定時間後の時点）であって、抽出部２１ｄ（図２参照）の抽出対象位置を示している。 FIG. 14 is an explanatory diagram of the snare drum processing. In the figure, reference symbol ta represents the attack position of the snare drum, and reference symbols t1 and t2 represent the start position and end position of the search section. Reference numeral t3 is a time point a predetermined time before the end position of the search section (a time point a predetermined time after the snare drum attack position), and indicates the extraction target position of the extraction unit 21d (see FIG. 2). .

スネアドラムの帯域では、主に声やピアノなどの伴奏が詳細判別の判別結果に影響を及ぼす。そこで、検索区間の終了位置から所定時間前（同図、ｔ３参照）の振幅データ（同図、符号５２参照）、すなわちスネアドラムのアタック位置（同図、ｔａ参照）から数十ms進んだ時間（同図、ｔ３参照）の振幅データを減算することで、これらの影響を軽減する。つまり、同図（ａ）に示すように、スネアドラムと重複して、声などのスネアドラム以外の音が鳴っている場合、同図（ｂ）に示すように、時間ｔ３のフレームの振幅データを検索区間内の全フレームから減算する。同図（ｃ）は、その減算結果を示したものである。このように、スネアドラム以外の音が同じ音程で鳴り続けている場合、アタック位置から所定時間経過後の振幅データを減算することにより、その影響を軽減できる。 In the band of the snare drum, accompaniment such as voice and piano mainly affects the result of detailed discrimination. Therefore, amplitude data (see reference numeral 52 in the figure) before a predetermined time from the end position of the search section (see reference numeral 52 in the figure), that is, a time that is several tens of milliseconds from the snare drum attack position (see ta in the figure). By subtracting the amplitude data (see t3 in the figure), these effects are reduced. That is, when a sound other than the snare drum, such as a voice, is being sounded, as shown in FIG. 10A, the amplitude data of the frame at time t3 as shown in FIG. Is subtracted from all frames in the search interval. FIG. 4C shows the subtraction result. In this way, when sounds other than the snare drum continue to sound at the same pitch, the influence can be reduced by subtracting the amplitude data after a predetermined time has elapsed from the attack position.

図１５は、ハイハット加工処理の説明図である。同図において、符号ｔａは、ハイハットのアタック位置、符号ｔ１および符号ｔ２は、検索区間の開始位置および終了位置を示している。また、符号ｔ４は、検索区間の開始位置から所定時間後の時点（ハイハットのアタック位置から所定時間前の時点）であって、抽出部２１ｄ（図２参照）の抽出対象位置を示している。ハイハットの場合、前のハイハットが鳴り終わる前に次のハイハットが鳴るケースが多い。しかも、バスドラムやスネアドラムと異なり、別の種類のハイハットがなる可能性が高い。このため、発音が強制停止されて新しい音が発音されるのではなく、前のハイハット音が新しく発生したハイハット音に覆いかぶってしまい、詳細判別に悪影響を及ぼす。これを解消すべく、過去の振幅を減算する。つまり、同図（ａ）に示すように、オープンハイハットが鳴り終わる前にハイハットが鳴っている場合、同図（ｂ）に示すように、ハイハットのアタック位置（同図、ｔａ参照）から数十ms溯った時間（同図、ｔ４参照）の振幅データ（同図、符号５３参照）を全フレームから減算する。同図（ｃ）は、その減算結果を示したものである。このように、オープンハイハットが鳴り終わる前にハイハットが鳴っている場合でも、ハイハットのアタックから所定時間前の振幅データを減算することにより、オープンハイハットの影響を軽減できる。 FIG. 15 is an explanatory diagram of the hi-hat processing. In the figure, reference symbol ta indicates the hi-hat attack position, and reference symbols t1 and t2 indicate the start position and end position of the search section. Reference sign t4 indicates a position to be extracted by the extraction unit 21d (see FIG. 2), which is a time point after a predetermined time from the start position of the search section (a time point before the hi-hat attack position). In the case of a hi-hat, the next hi-hat often sounds before the previous hi-hat finishes. In addition, unlike bass drums and snare drums, there is a high possibility of another type of hi-hat. For this reason, the sound generation is not forcibly stopped and a new sound is not sounded, but the previous hi-hat sound covers the newly generated hi-hat sound, which adversely affects detailed discrimination. To eliminate this, the past amplitude is subtracted. That is, when the hi-hat is sounding before the open hi-hat has finished ringing as shown in FIG. 10A, as shown in FIG. 10B, several tens of minutes from the hi-hat attack position (see ta in the same figure). Amplitude data (see reference numeral 53 in the figure) for the time obtained by ms (see t4 in the figure) is subtracted from all frames. FIG. 4C shows the subtraction result. As described above, even when the hi-hat is sounding before the open hi-hat is finished, the influence of the open hi-hat can be reduced by subtracting the amplitude data of a predetermined time from the hi-hat attack.

なお、図１４に示したスネアドラム加工処理と、図１５に示したハイハット加工処理では、必ずしも検索区間を特定する必要はない。つまり、スネアドラム加工部２２およびハイハット加工部２３においては、バスドラム加工部２１における検索区間特定部２１ｂを省略した構成としても良い。この場合、減算する区間は、アタック位置から前後所定時間として規定すれば良い。 In the snare drum processing shown in FIG. 14 and the hi-hat processing shown in FIG. 15, it is not always necessary to specify the search section. That is, in the snare drum processing unit 22 and the hi-hat processing unit 23, the search section specifying unit 21b in the bass drum processing unit 21 may be omitted. In this case, the interval to be subtracted may be defined as a predetermined time before and after the attack position.

次に、図１６〜図１８を参照し、詳細判別部１５による詳細判別処理について説明する。詳細判別処理は、上記のとおり、打楽器音分離処理において必要となる「同じ種類（ハイハットとオープンハイハットは異なる種類とする）、または同じ打楽器の重ね合わせの箇所」の情報を得るために行う。具体的には、同じ打楽器種類且つ同じ鳴り方をしていることを条件としてグループ分けを行う。 Next, with reference to FIGS. 16 to 18, the detailed determination processing by the detailed determination unit 15 will be described. As described above, the detailed discrimination process is performed in order to obtain information on “the same type (the hi-hat and the open hi-hat are different types), or the overlapping part of the same percussion instrument”, which is necessary in the percussion instrument sound separation process. Specifically, grouping is performed on condition that the same percussion instrument type and the same sounding method are used.

図１６は、音声信号処理装置１による詳細判別処理の流れを示すフローチャートである。詳細判別処理では、・・・ごとに判別を行う（Ｓ６１）。また、８小節内のアタック「０」を初期値とし、全アタックを対象として処理を行う（Ｓ６２）。まず、検出したアタックが、対象の打楽器であるか否かを判別し（Ｓ６３）、対象の打楽器である場合は（Ｓ６３：Ｙｅｓ）、同種の打楽器について相関値を算出する（Ｓ６４）。その結果、相関値が閾値を超えるアタックは同じグループとみなす（Ｓ６５）。その後、対象となるアタックを順次インクリメントしながら、Ｓ６２〜Ｓ６５を繰り返す（Ｓ６６）。さらに、Ｓ６１〜Ｓ６６を、各打楽器について繰り返す（Ｓ６７）。 FIG. 16 is a flowchart showing the flow of the detailed determination process by the audio signal processing apparatus 1. In the detailed discrimination process, discrimination is performed for each... (S61). Further, the attack “0” in the eight bars is set as an initial value, and processing is performed for all attacks (S62). First, it is determined whether or not the detected attack is the target percussion instrument (S63), and if it is the target percussion instrument (S63: Yes), a correlation value is calculated for the same type of percussion instrument (S64). As a result, attacks whose correlation value exceeds the threshold are regarded as the same group (S65). Thereafter, S62 to S65 are repeated while sequentially increasing the target attack (S66). Further, S61 to S66 are repeated for each percussion instrument (S67).

図１７は、詳細判別処理の説明図である。同図に示すように、同じ打楽器種類且つ同じ鳴り方をしていることを調べる特徴量として、同じ打楽器同士の周波数スペクトログラムの相関値を用いる。ここでは、発音位置情報に基づいて、元々同じ打楽器同士で照合するため、基本的には高い相関値が得られるはずである。例えば、８小節の中に、同図（ａ）に示すように複数の周波数スペクトログラムが含まれている場合、相関をとる元となる任意の周波数スペクトログラムを決定する（同図、符号５２参照，同図の例ではハイハット）。続いて、同図（ｂ）に示すように、８小節の中に含まれるハイハットの周波数スペクトログラム全てについて、任意の周波数スペクトログラムとの相関値を求める。その結果、同図（ｃ）に示すように、相関値が低いハイハット（オープンハイハット）を除く３つの周波数スペクトログラムが、任意の周波数スペクトログラムと同じグループ（同図の例ではＨＨグループ１）と判定する。グループ分けに用いられる閾値（グルーピング閾値）については、後述する。その後、グループ番号が付いていない打楽器の周波数スペクトログラムについて、同じ処理を繰り返し、全ての周波数スペクトログラムにグループ番号がついたところで、詳細判別処理を終了する。当該グループ番号は、グルーピング情報として打楽器音分離部１８に出力される。 FIG. 17 is an explanatory diagram of the detailed determination process. As shown in the figure, the correlation value of the frequency spectrogram between the same percussion instruments is used as a feature value for checking that the same percussion instrument type and the same sounding style are used. Here, since the same percussion instrument is originally collated based on the pronunciation position information, a high correlation value should be basically obtained. For example, when a plurality of frequency spectrograms are included in 8 bars as shown in FIG. 8A, an arbitrary frequency spectrogram from which correlation is obtained is determined (see FIG. Hi-hat in the example in the figure). Subsequently, as shown in FIG. 5B, the correlation value with an arbitrary frequency spectrogram is obtained for all the hi-hat frequency spectrograms included in the eight bars. As a result, as shown in FIG. 6C, the three frequency spectrograms except for the high hat (open hi-hat) having a low correlation value are determined to be the same group as the arbitrary frequency spectrogram (HH group 1 in the example in the figure). . A threshold (grouping threshold) used for grouping will be described later. Thereafter, the same processing is repeated for the frequency spectrograms of percussion instruments without group numbers, and the detailed discrimination processing is terminated when group numbers are assigned to all frequency spectrograms. The group number is output to the percussion instrument sound separation unit 18 as grouping information.

図１８は、グルーピング閾値の説明図である。同じ鳴り方をしていることを判定するグルーピング閾値は、計算式「閾値＝１−（相関値の最大値−相関値の平均値）×０．５」によって求める。つまり、相関値の最大値と相関値の平均値の中央値がグルーピング閾値となる。同図の例は、任意の打楽器について、８小節の中に９個の周波数スペクトログラムが存在し、その中の１の元スペクトログラムとの相関値の算出結果を示したものである。このように、グルーピング閾値を変動閾値としたことで、楽曲によらず、より正確なグループ分けを行うことができる。なお、グルーピング閾値は、打楽器ごとに決定する。 FIG. 18 is an explanatory diagram of the grouping threshold. The grouping threshold value for determining that the same ringing is performed is obtained by a calculation formula “threshold = 1− (maximum correlation value−average correlation value) × 0.5”. That is, the median value of the maximum correlation value and the average correlation value is the grouping threshold. The example in the figure shows the calculation result of the correlation value with one original spectrogram among nine frequency spectrograms in eight bars for an arbitrary percussion instrument. Thus, by setting the grouping threshold as the variation threshold, more accurate grouping can be performed regardless of the music. The grouping threshold is determined for each percussion instrument.

次に、図１９〜図２１を参照し、ベロシティ特定部１６によるベロシティ特定処理について説明する。ベロシティ特定処理は、後述する打楽器音分離処理において、振幅の割合が異なる周波数スペクトログラム間で同期処理を行うと、振幅の小さい方に周波数スペクトログラムが収束してしまうため、振幅同士の強度の割合（ベロシティ情報）を得るために行う。 Next, the velocity specifying process by the velocity specifying unit 16 will be described with reference to FIGS. In the velocity specifying process, if the percussion instrument sound separation process described later performs a synchronization process between frequency spectrograms having different amplitude ratios, the frequency spectrogram converges on the smaller amplitude ratio. To get information).

図１９は、音声信号処理装置１によるベロシティ特定処理の流れを示すフローチャートである。ベロシティ特定処理では、打楽器ごとに判別を行う（Ｓ７１）。まず、対象の打楽器について、ベロシティの検出範囲を算出する（Ｓ７２）。その検出範囲内で振幅強度を検出し（Ｓ７３）、振幅強度をベロシティへ変換する（Ｓ７４，ベロシティ検出部３１）。その後、Ｓ７４で得られた情報に基づいて、検出範囲外のベロシティを算出する（Ｓ７５，ベロシティ算出部３２）。以降、Ｓ７１〜Ｓ７５を、各打楽器について繰り返す（Ｓ７６）。 FIG. 19 is a flowchart showing the flow of velocity specifying processing by the audio signal processing apparatus 1. In the velocity specifying process, discrimination is performed for each percussion instrument (S71). First, the velocity detection range is calculated for the target percussion instrument (S72). The amplitude intensity is detected within the detection range (S73), and the amplitude intensity is converted into velocity (S74, velocity detection unit 31). Thereafter, the velocity outside the detection range is calculated based on the information obtained in S74 (S75, velocity calculating unit 32). Thereafter, S71 to S75 are repeated for each percussion instrument (S76).

図２０は、ベロシティ検出処理の説明図である。同図に示すように、Ｓ７２で算出する検出範囲について、スネアドラムとハイハットは、３小節目〜６小節目までと算出される。これは、８小節の最初にシンバルなどが入ったり、８小節の最後にフィルインなどの不規則なリズムパターンや効果音が入ることが多いためである。また、バスドラムは、８小節内における２つ目のアタック位置〜最後のアタック位置までと算出される。これは、バスドラムの場合、１つ目のアタックだけ極端に振幅が大きい場合があるためである。 FIG. 20 is an explanatory diagram of velocity detection processing. As shown in the figure, the snare drum and hi-hat are calculated from the 3rd bar to the 6th bar for the detection range calculated in S72. This is because a cymbal or the like is often placed at the beginning of 8 bars, and an irregular rhythm pattern or sound effect such as a fill-in is often placed at the end of 8 bars. The bass drum is calculated from the second attack position to the last attack position in the eight bars. This is because in the case of a bass drum, the amplitude may be extremely large by the first attack.

また、Ｓ７３で検出する振幅強度は、各打楽器のアタック位置からの対象区間内で、所定の周波数範囲に存在する振幅値の合計を振幅強度として検出する。対象区間は、打楽器ごとに定めている（数十ms〜数百ms）。また、周波数範囲は、帯域分割部１３による帯域分割と同様に、バスドラム「４０〜３００Hz」、スネアドラム「６００〜３０００Hz」、ハイハット「６０００Hz〜１６０００Hz」と定めている。 In addition, the amplitude intensity detected in S73 detects the sum of amplitude values existing in a predetermined frequency range as the amplitude intensity within the target section from the attack position of each percussion instrument. The target section is determined for each percussion instrument (several tens ms to several hundred ms). Further, the frequency range is defined as a bass drum “40 to 300 Hz”, a snare drum “600 to 3000 Hz”, and a hi-hat “6000 Hz to 16000 Hz”, similarly to the band division by the band dividing unit 13.

また、Ｓ７４のベロシティへの変換は、８小節内において、算出した振幅強度の中で一番大きな振幅強度で正規化した０〜１の値をベロシティとする。なお、この正規化は、詳細判別処理で判別した同じグループ内の（同じ打楽器種類且つ同じ鳴り方をしている）周波数スペクトログラム同士で行う。 In the conversion to velocity in S74, the value of 0 to 1 normalized with the largest amplitude intensity among the calculated amplitude intensity is set as the velocity within 8 bars. Note that this normalization is performed between frequency spectrograms in the same group determined by the detailed determination processing (the same percussion instrument type and the same sounding method).

図２１は、ベロシティ算出処理の説明図である。Ｓ７５のベロシティの算出は、グループごとに、検出範囲で検出されたベロシティ情報に基づいて、検出範囲外のベロシティを補完するものである。同図の例は、３小節目〜６小節目までを検出範囲とするスネアドラムとハイハットの場合の補完方法を示している。同図に示すように、検出範囲のベロシティは、１６分音符単位で算出する。また、検出範囲に含まれる４小節（３小節目〜６小節目）について、グループごと且つ単位時間（１／１６〜１６／１６）ごとにベロシティを平均化する。また、その平均化した値を、１，２，７，８小節内の各単位時間の補完値として補完する。なお、バスドラムについては特に図示しないが、スネアドラムやハイハットの場合と同様に、検出範囲で検出されたベロシティ情報を小節単位で平均化した値を、検出範囲外のベロシティとして補完する。 FIG. 21 is an explanatory diagram of the velocity calculation process. The calculation of the velocity in S75 is to supplement the velocity outside the detection range based on the velocity information detected in the detection range for each group. The example in the figure shows a complementary method in the case of a snare drum and a hi-hat in which the detection range is from the third bar to the sixth bar. As shown in the figure, the velocity of the detection range is calculated in units of 16th notes. Further, the velocity is averaged for each group and for each unit time (1/16 to 16/16) for the four bars (third bar to sixth bar) included in the detection range. Further, the averaged value is complemented as a supplementary value for each unit time in the bars 1, 2, 7, and 8. Although not particularly shown for the bass drum, as in the case of the snare drum or hi-hat, the value obtained by averaging the velocity information detected in the detection range for each measure is complemented as the velocity outside the detection range.

次に、図２２〜図２７を参照し、打楽器音分離部１８による打楽器音分離処理について説明する。打楽器音分離処理は、前処理で得られた詳細判別情報（グルーピング情報およびベロシティ情報）を元に同期処理を行い、各打楽器の周波数スペクトログラム情報を生成する処理である。図２２は、打楽器音分離処理の流れを示すフローチャートである。打楽器音分離処理では、まず継続音除去部１２により出力された継続音除去音源から、複合ハイハットを分離する（Ｓ８１）。続いて、複合ハイハット分離後の音源（１）から、バスドラムを分離し（Ｓ８２）、バスドラム分離後の音源（２）から、スネアドラムを分離し（Ｓ８３）、スネアドラム分離後の音源（３）から、単独ハイハットを分離する（Ｓ８４）。さらに、アタック検出できなかった情報を補完し（Ｓ８５）、アタックより前の振幅を除去して、最終的に各打楽器のスペクトログラム情報を生成する。なお、Ｓ８１〜Ｓ８４は、対象となる打楽器種類が異なるだけであり、処理内容としては、各ステップにおいて、図４に示した第１特定部４１〜鳴り終わり判定部４６の処理を行う。 Next, a percussion instrument sound separation process by the percussion instrument sound separation unit 18 will be described with reference to FIGS. The percussion instrument sound separation process is a process of generating frequency spectrogram information of each percussion instrument by performing a synchronization process based on the detailed discrimination information (grouping information and velocity information) obtained in the preprocessing. FIG. 22 is a flowchart showing the flow of percussion instrument sound separation processing. In the percussion instrument sound separation process, first, the composite hi-hat is separated from the continuous sound removal sound source output by the continuous sound removal unit 12 (S81). Subsequently, the bass drum is separated from the sound source (1) after the composite hi-hat separation (S82), the snare drum is separated from the sound source (2) after the bass drum separation (S83), and the sound source after the snare drum separation ( The single hi-hat is separated from 3) (S84). Further, the information that could not be detected by the attack is complemented (S85), the amplitude before the attack is removed, and spectrogram information of each percussion instrument is finally generated. Note that S81 to S84 differ only in the percussion instrument type to be processed, and the processing contents of the first specifying unit 41 to the end-of-sound determination unit 46 shown in FIG. 4 are performed in each step.

図２３（ａ）は、複合ハイハット分離処理（図２２のＳ８１参照）の説明図である。複合ハイハットは、８小節の中に、バスドラムやスネアドラムなど他の打楽器と同時に発音されているハイハットと、単独で発音されているハイハットが存在することを条件としている。言い換えれば、バスドラムやスネアドラムなど他の打楽器と同時に発音されているハイハットが存在していても、単独で発音されているハイハットが存在しない場合、それを複合ハイハットとは看做さない。これは、単独で発音されているハイハットが存在しない場合、同期減算処理において、バスドラムやスネアドラムを除去できないためである。同図は、「ＢＤと同時に鳴ったＨＨ」として示す２つのハイハット（符号５６）と、「単独で鳴ったＨＨ」として示す２つのハイハットが同じハイハットである、とグルーピングされた例である。符号５６のハイハットと同時に鳴ったバスドラムの振幅成分を損なわないように、ハイハットだけを分離するため、単独で鳴ったハイハットの振幅成分を利用する。したがって、同図の例では、符号５６の２つのハイハットと、単独で鳴った２つのハイハット、合計４つのハイハットを、複合ハイハットとして分離する。 FIG. 23A is an explanatory diagram of the composite hi-hat separation process (see S81 in FIG. 22). The composite hi-hat is based on the condition that there are hi-hats that are pronounced simultaneously with other percussion instruments such as bass drums and snare drums and hi-hats that are pronounced independently in 8 bars. In other words, even if there is a hi-hat that is sounded simultaneously with other percussion instruments such as a bass drum and a snare drum, if there is no hi-hat sounded independently, it is not regarded as a composite hi-hat. This is because bass drums and snare drums cannot be removed in the synchronous subtraction process when there is no hi-hat sounded alone. This figure is an example in which two hi-hats (reference numeral 56) indicated as “HH sung at the same time as BD” and two hi-hats indicated as “HH sung singly” are grouped as the same hi-hat. In order to isolate only the hi-hat so as not to impair the amplitude component of the bass drum struck at the same time as the hi-hat of reference numeral 56, the amplitude component of the hi-hat sung alone is used. Therefore, in the example shown in the figure, two hi-hats denoted by reference numeral 56 and two hi-hats singing alone, a total of four hi-hats, are separated as a composite hi-hat.

一方、図２３（ｂ）は、バスドラム分離処理（図２２のＳ８２参照）の説明図である。同図に示すように、４つ打ち系の楽曲の場合、バスドラムとスネアドラムが同時に鳴ることがある。この場合、バスドラムは、スネアドラムよりも鳴る回数が多いため、単独で存在する確率が高い。つまり、バスドラムを分離しないでスネアドラムのスペクトグラム形状を推定しようとしても、バスドラムと被っていることが多いため、正しく推定できない可能性が高い。また、バスドラムの音色が残ったままでスネアドラムの音を分離すると、バスドラムの音が目立ってしまうといった問題もある。そのため、バスドラムを先に分離する必要がある。同図の例では、符号５７の４つのバスドラムを分離する。 On the other hand, FIG. 23B is an explanatory diagram of the bass drum separation process (see S82 in FIG. 22). As shown in the figure, in the case of a four-tone type musical composition, a bass drum and a snare drum may sound simultaneously. In this case, the bass drum has a higher number of sounds than the snare drum, and therefore has a high probability of being present alone. That is, even if an attempt is made to estimate the spectrogram shape of the snare drum without separating the bass drum, there is a high possibility that it cannot be correctly estimated because it often covers the bass drum. Another problem is that if the snare drum sound is separated while the bass drum tone remains, the bass drum sound will stand out. Therefore, it is necessary to separate the bass drum first. In the example of the figure, four bass drums denoted by reference numeral 57 are separated.

なお、スネアドラム分離処理（図２２のＳ８３参照）については、特に図示しないが、バスドラムを分離した後に処理を行うため、スペクトグラム形状の推定が容易である。また、単独ハイハット処理（図２２のＳ８４参照）についても、特に図示しないが、スネアドラムと同様に、既に被っている音が分離された後に処理を行うため、スペクトグラム形状の推定が容易である。 The snare drum separation process (see S83 in FIG. 22) is not particularly shown, but the process is performed after the bass drum is separated, so that the spectrogram shape can be easily estimated. Also, the single hi-hat process (see S84 in FIG. 22) is not particularly shown. However, as with the snare drum, the process is performed after the sound already covered is separated, so that the spectrogram shape can be easily estimated. .

続いて、第１特定部４１、第２特定部４２および同期減算部４３による同期減算処理について説明する。打楽器は、繰り返し同じ波形で鳴らされるが、打楽器以外の成分は異なる音程で鳴らされることが多い。このため、共通部分を抽出することで、打楽器のみの成分を残すことができる。まず、第１特定部４１および第２特定部４２による同期減算の相手を特定する方法について説明する。同期減算は、詳細判別処理で同じグループに属すると判別されたもの同士で相関値を算出し、最も相関値が高いもの同士で行う。例えば、図１７（ｂ）のように、８小節の中に、同一種類の周波数スペクトログラムが複数含まれている場合、その中の任意の周波数スペクトログラム（同図、符号５２）を第１スペクトログラムとして特定する。また、第１スペクトログラムと最も相関値が高い周波数スペクトログラム（同図、符号５３）を、同期減算の相手となる第２スペクトログラムとして特定する。なお、第１スペクトログラムおよび第２スペクトログラムの特定は、１のグループ内に含まれる全ての周波数スペクトログラムを対象として行う。つまり、１のグループ内にＬ個の周波数スペクトログラムが含まれる場合、Ｌ回のスペクトログラム特定および同期減算処理を繰り返すことになる。 Next, the synchronous subtraction process performed by the first specifying unit 41, the second specifying unit 42, and the synchronous subtracting unit 43 will be described. Percussion instruments are repeatedly played with the same waveform, but components other than percussion instruments are often played with different pitches. Therefore, by extracting the common part, it is possible to leave only the components of percussion instruments. First, a method of specifying the synchronous subtraction partner by the first specifying unit 41 and the second specifying unit 42 will be described. Synchronous subtraction is performed between those having the highest correlation value by calculating correlation values between those determined to belong to the same group in the detailed determination process. For example, as shown in FIG. 17 (b), if there are multiple frequency spectrograms of the same type in 8 bars, any frequency spectrogram (symbol 52 in the figure) is specified as the first spectrogram. To do. Also, the frequency spectrogram having the highest correlation value with the first spectrogram (symbol 53 in the figure) is specified as the second spectrogram that is the partner of the synchronous subtraction. The first spectrogram and the second spectrogram are specified for all frequency spectrograms included in one group. That is, when L frequency spectrograms are included in one group, L spectrogram identification and synchronous subtraction processing are repeated.

図２４は、同期減算部４３による同期減算処理の説明図である。同図（ａ）は、同期減算の対象として決定された２つのスネアドラムに対し、ベロシティ情報に基づいて、スペクトグラムの大きさを揃えた後の状態を示している。例えば、同図左側が第１スペクトログラムであり、右側が第２スペクトログラムである。同図（ｂ）に示すように、これら２つのスネアドラムの同期減算を行うと、減算元の第１スペクトログラムに含まれていない成分が、マイナス成分となる。また、スネアドラム成分以外は双方同じ大きさであるため、打楽器以外の成分を抽出できる。さらに、マイナス成分を「０」にすると、減算元の第１スペクトログラムのうち、打楽器以外の成分が残る。その後、同図（ｃ）に示すように、減算元の第１スペクトログラムから、上記の打楽器以外の成分を減算すると、共通成分（本来のスネアドラムのスペクトログラム形状）を求めることができる。 FIG. 24 is an explanatory diagram of the synchronous subtraction process by the synchronous subtraction unit 43. FIG. 5A shows a state after the sizes of the spectrograms are made uniform based on velocity information for two snare drums determined as objects of synchronous subtraction. For example, the left side is the first spectrogram and the right side is the second spectrogram. As shown in FIG. 5B, when synchronous subtraction of these two snare drums is performed, a component not included in the first spectrogram as a subtraction source becomes a negative component. In addition, since components other than the snare drum component have the same size, components other than the percussion instrument can be extracted. Further, when the minus component is set to “0”, components other than the percussion instrument remain in the first spectrogram of the subtraction source. Thereafter, as shown in FIG. 4C, by subtracting components other than the percussion instrument from the first spectrogram of the subtraction source, a common component (original snare drum spectrogram shape) can be obtained.

なお、同期減算処理については、図２４に示した方法ではなく、双方のスペクトログラムの振幅を各フレーム、各ｂｉｎで比較して、小さい方を採用する、といった単純な方法でも良い。 Note that the synchronous subtraction process is not limited to the method shown in FIG. 24 but may be a simple method in which the amplitudes of both spectrograms are compared for each frame and each bin and the smaller one is adopted.

続いて、同期加算部４４による同期加算処理について説明する。この処理は、同期減算によって求めたスペクトログラム形状の誤差を減少させるために行われる。図２５は、同期加算処理の説明図である。同図に示すように、同期加算処理では、１のグループ内に存在するＬ個の周波数スペクトログラムの同期減算結果を、同期加算によって平均化する。同図の例では、同期加算の対象が４個の周波数スペクトログラムであったため、４個の周波数スペクトログラムについての同期減算データの合計値を４で除算し、平均値（同期処理済みデータ）を求めている。なお、同期加算処理は、１のグループ内に存在する同期減算結果が１のみの場合、省略される。 Next, the synchronous addition process by the synchronous adder 44 will be described. This process is performed in order to reduce the error of the spectrogram shape obtained by synchronous subtraction. FIG. 25 is an explanatory diagram of the synchronous addition process. As shown in the figure, in the synchronous addition process, the synchronous subtraction results of L frequency spectrograms existing in one group are averaged by synchronous addition. In the example of the figure, since the target of synchronous addition is four frequency spectrograms, the total value of synchronous subtraction data for the four frequency spectrograms is divided by four to obtain an average value (synchronized data). Yes. Note that the synchronous addition process is omitted when only one synchronous subtraction result exists in one group.

続いて、再アタック検出部４５による再アタック検出処理について説明する。この処理は、アタック位置が正確でない場合や、スネアドラムのみ前倒しで発音されている場合などを考慮し、同期処理済みデータに基づいて、アタック検出を行う。図２６は、再アタック検出処理までの流れを示す簡易フローチャートおよびその説明図である。 Next, re-attack detection processing by the re-attack detection unit 45 will be described. In this process, the attack detection is performed based on the synchronization processed data in consideration of the case where the attack position is not accurate or the case where only the snare drum is sounded forward. FIG. 26 is a simplified flowchart showing the flow up to the re-attack detection process and an explanatory diagram thereof.

まず、不図示の拍位置解析アプリケーションより、拍位置の解析結果を取得する（Ｓ９１）。なお、同図のＳ９１およびＳ９２は、発音位置情報（図１参照）の取得前に行われる工程である。つまり、本実施形態における発音位置情報は、拍位置解析アプリケーションの解析結果を元に生成されている。Ｓ９１の説明図に示すように、拍位置の解析結果は、楽曲によってＢＰＭが正確であっても、拍位置が遅れてしまうことがある。そこで、拍位置の遅れを回避するため、全ての拍位置を所定時間（例えば、５０ms）前倒ししておく（Ｓ９２）。その後、第１特定部４１、第２特定部４２、同期減算部４３および同期加算部４４による同期処理を行う（Ｓ９３）。このとき、前倒しによりアタックの取り逃しは回避できるが、元々正しい拍位置が検出されていた楽曲では、前倒しによって余分な音（符号６１，符号６２参照）が入ってしまうことがある。このうち符号６２については、鳴り終わり判定部４６による鳴り終わり判定（図１２参照）で除去するが、符号６１についての除去が別途必要となる。そこで、同期処理後の単体の打楽器に対して改めてアタック検出を行い（Ｓ９４）、図２２のＳ８６にて、アタックよりも前の音を削除する（無音にする）。このように、再アタック検出処理は、単体の打楽器に対して行うため、スネアドラムのみ前倒ししている楽曲についても、正確にそのアタック位置を検出することができる。 First, a beat position analysis result is acquired from a beat position analysis application (not shown) (S91). In addition, S91 and S92 of the same figure are processes performed before acquisition of sound generation position information (refer FIG. 1). That is, the pronunciation position information in the present embodiment is generated based on the analysis result of the beat position analysis application. As shown in the explanatory diagram of S91, the beat position may be delayed in the analysis result of the beat position even if the BPM is accurate depending on the music. Therefore, in order to avoid a delay in beat positions, all beat positions are moved forward by a predetermined time (for example, 50 ms) (S92). Thereafter, synchronization processing is performed by the first identification unit 41, the second identification unit 42, the synchronization subtraction unit 43, and the synchronization addition unit 44 (S93). At this time, missed attack can be avoided by moving forward, but in the case of a music piece whose original beat position was originally detected, extra sounds (see reference numerals 61 and 62) may be added by moving forward. Of these, the reference numeral 62 is removed by the end of ringing determination (see FIG. 12) by the end of ringing determination unit 46, but the removal of the reference numeral 61 is required separately. Therefore, attack detection is performed again on the single percussion instrument after the synchronization processing (S94), and the sound before the attack is deleted (silenced) in S86 of FIG. As described above, since the re-attack detection process is performed on a single percussion instrument, it is possible to accurately detect the attack position of a song that has been advanced only by the snare drum.

なお、拍位置の遅れがない場合、Ｓ９１およびＳ９２の処理を省略可能である。また、発音位置情報に、「スネアドラムのみ前倒しされている」旨の情報が含まれる場合、再アタック検出処理（Ｓ９４）を省略できる。 If there is no delay in the beat position, the processing of S91 and S92 can be omitted. Further, when the sound generation position information includes information indicating that “only the snare drum has been advanced”, the re-attack detection process (S94) can be omitted.

続いて、鳴り終わり判定部４６による鳴り終わり判定処理について説明する。この処理は、同期処理を行っても打楽器以外の成分を削除しきれない場合があるため、打楽器の大まかな鳴り終わりを判定し、打楽器以外の成分を削除するために行う。具体的には、図１２に示した２つの方法で鳴り終わりを判定する。まず、継続音範囲検出処理（図１０のＳ５７参照）において、サビフラグが立てられている場合、全ての帯域で図１２（ａ）に示した平均値終了判定法による終了点を用いる。これは、サビでは全帯域で継続音が定常的になっている可能性が高いため、余計な音が入りにくい平均値終了判定法を用いることが好ましいためである。また、サビフラグが立てられていない場合は、継続音範囲（継続音が存在する帯域）に対して平均値終了判定法による終了点を用い、継続音範囲以外に対しては、図１２（ｂ）に示した新規アタック判定法による終了点を用いる。これは、平均値終了判定法が、余計な音が入りにくい代わりに実際の音より短めに打ち切られてしまう特徴があり、新規アタック判定法が、余計な音が入ってしなう可能性があるものの、実際の音と同じような終了点を検出できる（打楽器の消え際まで音を出すことができる）特徴があるためである。つまり、余計な音が入ってしまうおそれのある継続音範囲に対してのみ、余計な音が入りにくい平均値終了判定法を使用し、それ以外は新規アタック判定法を使用することで、より適切に打楽器音の終了点を特定することができる。なお、図１２では、バスドラムの鳴り終わり判定について例示したが、スネアドラムやハイハットについても同様に鳴り終わり判定処理を行う。 Next, the sound end determination process by the sound end determination unit 46 will be described. This processing is performed in order to determine a rough end of the percussion instrument and delete components other than the percussion instrument, since the components other than the percussion instrument may not be deleted even if the synchronization processing is performed. Specifically, the end of ringing is determined by the two methods shown in FIG. First, in the continuous sound range detection process (see S57 in FIG. 10), when the rust flag is set, the end points according to the average value end determination method shown in FIG. This is because in the chorus, it is highly possible that the continuous sound is steady in the entire band, and therefore it is preferable to use an average value end determination method that makes it difficult for extra sound to enter. When the rust flag is not set, the end point according to the average value end determination method is used for the continuous sound range (the band in which the continuous sound exists), and FIG. The end point by the new attack judgment method shown in Fig. 1 is used. This is because the mean value end judgment method is cut off shorter than the actual sound instead of making the extra sound difficult to enter, and the new attack judgment method may cause extra sound to enter. This is because an end point similar to an actual sound can be detected (a sound can be produced until the percussion instrument disappears). In other words, it is more appropriate to use the average value end judgment method that makes it difficult for extra sounds to enter only in the continuous sound range where extra sounds may enter, and to use the new attack judgment method for other cases. The end point of the percussion instrument sound can be specified. Although the bass drum ringing end determination is illustrated in FIG. 12, the sounding end determination process is similarly performed for the snare drum and the hi-hat.

ところで、鳴り終わり判定処理では、各周波数ｂｉｎについて鳴り終わりを判定するため、鳴り終わり地点が周りの周波数ｂｉｎと比べて極端に短かったり長かったりする場合がある。このような極端な鳴り終わり地点が検出されると、分離される打楽器音が劣化してしまう。このため、鳴り終わり地点を揃える必要がある。 By the way, in the sounding end determination process, since the sounding end is determined for each frequency bin, the sounding end point may be extremely shorter or longer than the surrounding frequency bins. When such an extreme end point is detected, the percussion instrument sound to be separated is deteriorated. For this reason, it is necessary to align the ringing end points.

図２７は、鳴り終わり修正処理の説明図である。この処理は、鳴り終わり判定の結果、同図（ａ）に示すように、極端に短い鳴り終わりや極端に長い鳴り終わりが存在する場合、同図（ａ）に示すように、極端に短い／長い鳴り終わりを周りに合わせる処理である。同図（ｃ）は、鳴り終わりを周りに合わせる方法を示している。同図に示すように、全周波数帯域を１／３オクターブ幅に分割し、１／３オクターブ幅単位で鳴り終わり地点を中央値に揃えている。なお、帯域の分割数や各帯域の範囲については、設定変更可能である。 FIG. 27 is an explanatory diagram of a ringing end correction process. As shown in FIG. 6A, this process is performed when the end of the ringing is extremely short and the end of the long ringing is extremely short as shown in FIG. It is a process that adjusts the end of a long ringing around. FIG. 5C shows a method for matching the end of the ringing around. As shown in the figure, the entire frequency band is divided into 1/3 octave widths, and the end points of sounding are aligned to the median in 1/3 octave width units. Note that the setting of the number of divided bands and the range of each band can be changed.

続いて、アタック検出できなかった情報を補完する方法（図２２のＳ８５）について説明する。アタックが非常に弱い打楽器の場合、再アタック検出処理（図２６のＳ９４参照）における、「アタックと看做す閾値」を超えない成分が存在する場合がある。これは、「アタックと看做す閾値」が図１の発音位置情報を検出する処理と、打楽器音分離の処理で異なるためである。そこで、拍位置解析アプリケーションから得られた拍位置が、本当のアタック位置からどの位ずれているのかを示すアタック値の最頻値を算出し、アタック検出できなかった情報を補完する。アタック検出できなかった箇所は、「アタックと看做す閾値」に満たないため、非常に弱いアタックであるが、発音位置情報から、打楽器の発音位置が分かっているため、この部分を補完する。例えば、ハイハットが５種類鳴っている場合を想定する。ハイハット１は５ms、ハイハット２は１０ms、ハイハット３は１０ms、ハイハット４は８ｍs、とアタック値（アタックのずれ時間）が算出されて、ハイハット５がアタックの閾値に満たなかったとする。このとき、５msと８msは１回しか現れていないが、１０msは２回現れているため、ハイハット５のアタック値が不明である場合も、一番多く現れている１０msとする。 Next, a method for supplementing information that could not be attack detected (S85 in FIG. 22) will be described. In the case of a percussion instrument with a very weak attack, there may be a component that does not exceed the “threshold value regarded as an attack” in the re-attack detection process (see S94 in FIG. 26). This is because the “threshold value regarded as an attack” differs between the process of detecting the pronunciation position information in FIG. 1 and the percussion instrument sound separation process. Therefore, the mode value of the attack value indicating how much the beat position obtained from the beat position analysis application is deviated from the true attack position is calculated, and the information that could not be detected is complemented. The location where the attack could not be detected is less than the “threshold value to be regarded as an attack” and is a very weak attack. However, since the pronunciation position of the percussion instrument is known from the pronunciation position information, this portion is complemented. For example, a case where five types of hi-hats are sounding is assumed. Assume that the attack value (attack deviation time) is calculated as 5 ms for hi-hat 1, 10 ms for hi-hat 2, 10 ms for hi-hat 3, and 8 ms for hi-hat 4, and hi-hat 5 does not satisfy the attack threshold. At this time, 5 ms and 8 ms appear only once, but 10 ms appears twice. Therefore, even when the attack value of hi-hat 5 is unknown, it is assumed that 10 ms appears most frequently.

次に、分離した各打楽器音の応用例について説明する。図２８は、打楽器音の調節に関する説明図である。同図に示すように、バスドラム、スネアドラム、ハイハット、３音以外の、それぞれの音量を、ロータリー型操作子７１等の操作子を用いて調節しても良い。また、音量ではなく、各打楽器音の分離率（生成率）を調節可能としても良い。この場合、調節可能な分離率の最小値を、０（ゼロ）としても良い。 Next, application examples of the separated percussion instrument sounds will be described. FIG. 28 is an explanatory diagram regarding adjustment of percussion instrument sounds. As shown in the figure, each volume other than the bass drum, snare drum, hi-hat, and three sounds may be adjusted using an operator such as a rotary type operator 71. Further, not the volume but the separation rate (generation rate) of each percussion instrument sound may be adjustable. In this case, the minimum value of the adjustable separation rate may be set to 0 (zero).

その他、バスドラム、スネアドラム、ハイハット、３音以外の、それぞれの音に対し、異なるエフェクトをかけても良い。また、そのエフェクト付与率（加工処理量）を、ユーサーが調節可能としても良い。エフェクトとしては、ディレイ、リバーブ、エコーなど、ＤＪ機器のエフェクター等で用いられる各種音響効果を適用可能である。操作方法としては、例えばバスドラムに対応したロータリー型操作子７１を右側に回転させると、バスドラム音の数を徐々に増加させ（ディレイをかけて足していき）、左側に回転させると、バスドラム音の数を徐々に減衰させる、などが考えられる。なお、操作子の形態は、ロータリー型操作子７１に限らず、フェーダー型操作子やタッチパネルなどその種類を問わない。 In addition, different effects may be applied to each sound other than the bass drum, snare drum, hi-hat, and three sounds. Further, the effect applying rate (processing amount) may be adjustable by the user. Various effects such as delay, reverb, echo, and other effects used in DJ equipment effectors can be applied as effects. As an operation method, for example, when the rotary type operation element 71 corresponding to the bass drum is rotated to the right, the number of bass drum sounds is gradually increased (added with a delay), and when rotated to the left, the bass For example, the number of drum sounds can be gradually attenuated. The type of the operation element is not limited to the rotary type operation element 71, and any type such as a fader type operation element or a touch panel may be used.

また、図２９に示すように、分離した打楽器音を譜面表示しても良い。つまり、バスドラム、スネアドラム、ハイハットの判別結果をＭＩＤＩ（Musical Instrument Digital Interface）化し、ドラム譜面７２として用いても良い。この場合、ハイハットについては、オープンハイハットとクローズハイハットに分け、打楽器種類別（グループ別）に表示しても良い。また、スネアドラムに代えて、ハンドクラップを譜面表示しても良い。ハンドクラップは、スネアドラムと同様の処理工程により、スペクトログラム形状の推定・分離が可能である。 Further, as shown in FIG. 29, the separated percussion instrument sound may be displayed as a musical score. That is, the discrimination result of the bass drum, snare drum, and hi-hat may be converted to MIDI (Musical Instrument Digital Interface) and used as the drum score 72. In this case, the hi-hat may be divided into an open hi-hat and a closed hi-hat, and may be displayed by percussion instrument type (by group). Further, instead of the snare drum, a hand clap may be displayed as a musical score. The hand clap can estimate and separate the spectrogram shape by the same processing steps as the snare drum.

また、特に図示しないが、ドラムをＭＩＤＩで鳴らし、音色を切り替えても良い。つまり、各打楽器音のアタックのタイミングで、別の音（アコースティックドラムなど）を出力しても良い。また、分離した打楽器音をサンプリングし、ユーザーが入力したシーケンスにしたがって（若しくはユーザーが指定した出力タイミングで）、各打楽器音を出力しても良い。 Although not particularly shown, the drum may be sounded with MIDI to switch the timbre. That is, another sound (such as an acoustic drum) may be output at the timing of each percussion instrument sound attack. Alternatively, the percussion instrument sounds may be sampled and output according to the sequence input by the user (or at the output timing specified by the user).

以上説明したとおり、本実施形態によれば、打楽器音分離部１８において、同期処理により周波数スペクトログラムを推定するため、テンプレートを用いる必要がない。したがって、楽曲によらず、打楽器ごとのスペクトログラム形状を正確に推定することができる。また、継続音除去部１２において、所定時間以上継続している継続音成分を除去するため、打楽器音分離部１８によるスペクトログラム形状をより正確に推定することができる。 As described above, according to the present embodiment, the percussion instrument sound separation unit 18 estimates the frequency spectrogram by the synchronization process, so that it is not necessary to use a template. Therefore, the spectrogram shape for each percussion instrument can be accurately estimated regardless of the music. Further, since the continuous sound removing unit 12 removes the continuous sound component that has continued for a predetermined time or longer, the spectrogram shape by the percussion instrument sound separating unit 18 can be estimated more accurately.

また、詳細判別用音加工部１４では、詳細判別処理の前処理として、打楽器別に不要な音を除去するため、詳細判別部１５の判別正答率を上げることができる。また、詳細判別用音加工部１４では、打楽器のアタック位置から検索区間を特定し、その検索区間の所定位置における振幅値を減算するため、簡易な処理で、不要な音を除去することができる。また、打楽器別に、振幅値の抽出方法が異なるため、目的の打楽器音をより正確に抽出することができる。また、バスドラムの加工においては、鳴り終わりを判定し、鳴り終わった場合は鳴り終わり以降をゼロにするため、不要な音を確実に除去することができる。また、詳細判別部１５では、同じ打楽器種類且つ同じ鳴り方をしていることを条件として周波数スペクトログラムをグルーピングするため、後段の打楽器音分離部１８における同期処理の精度を上げることができる。 Further, since the detailed discrimination sound processing unit 14 removes unnecessary sounds for each percussion instrument as pre-processing of the detailed discrimination process, the correct discrimination rate of the detailed discrimination unit 15 can be increased. Further, since the detailed discrimination sound processing unit 14 specifies the search section from the percussion instrument attack position and subtracts the amplitude value at the predetermined position of the search section, unnecessary sounds can be removed with simple processing. . Moreover, since the method of extracting the amplitude value differs for each percussion instrument, the target percussion instrument sound can be extracted more accurately. Further, in the processing of the bass drum, the end of the ringing is determined, and when the ringing is completed, the rest of the ringing is set to zero, so that unnecessary sounds can be reliably removed. Further, since the detailed discrimination section 15 groups frequency spectrograms on the condition that the same percussion instrument type and the same sounding style are used, the accuracy of the synchronization processing in the percussion instrument sound separation section 18 at the subsequent stage can be improved.

また、ベロシティ特定部１６では、各打楽器のベロシティを特定するため、後段のグルーヴ判定部１７において、楽曲のグルーヴを正確且つ容易に判定することができる。また、ベロシティを特定するために、楽曲の８小節のうち、雑音が入りにくい一部の区間（バスドラムは、２つ目のアタック〜最後のアタック、スネアドラムとハイハットは、３小節目〜６小節目）を対象としてベロシティの検出を行うため、正確な検出結果が得られる。さらに、その検出結果をグループごと且つ単位時間ごとに平均化し、その平均化した値を、８小節のうち一部の区間以外の区間に補完するため、各打楽器の８小節における各単位時間のベロシティを正確に特定することができる。また、ベロシティを正確に特定することで、打楽器音分離部１８における同期処理の精度を上げることができる。 Further, since the velocity specifying unit 16 specifies the velocity of each percussion instrument, the groove determining unit 17 in the subsequent stage can accurately and easily determine the groove of the music. In addition, in order to specify the velocity, some sections of the 8 measures of the music are less susceptible to noise (bass drum is the second attack to the last attack, snare drum and hi-hat are the third measure to 6 Since velocity is detected for the bar), an accurate detection result can be obtained. Further, the detection results are averaged for each group and for each unit time, and the averaged value is complemented to a section other than a part of the eight bars. Can be accurately identified. Further, by accurately specifying the velocity, the accuracy of the synchronization processing in the percussion instrument sound separation unit 18 can be increased.

なお、以下の変形例・応用例を採用可能である。例えば、上記の実施形態では、外部から発音位置情報を取得する構成としたが、音声信号処理装置１によって楽曲の解析を行い、発音位置情報を生成する構成としても良い。また、ユーザーが発音位置情報を手入力しても良い。 Note that the following modifications and application examples can be employed. For example, in the above embodiment, the sound generation position information is acquired from the outside. However, the sound signal processing apparatus 1 may analyze music and generate sound generation position information. Further, the user may manually input the pronunciation position information.

また、上記の実施形態のバスドラム加工部２１は、図３０（ａ）に示すように、検索区間（ｔ１〜ｔ２）の最後のフレームにおける振幅データ５１ａを減算する処理を行ったが、同図（ｂ）に示すように、検索区間（ｔ１〜ｔ２）の最初のフレームにおける振幅データ５１ｂを減算しても良い。前者の場合（上記の実施形態の場合）は、同図（ａ）に示すように、バスドラムと同時に音程が切り替わる場合に効果的である。また、後者の場合は、同図（ｂ）に示すように、バスドラムの発音前からベース音が鳴り続けている場合に効果的である。この場合は、検索区間の最後の振幅データを減算すると、音程が変わっていたり、ベース音自体が減衰している可能性があるためである。また、減算対象となるフレームの位置（抽出部２１ｄにより抽出する位置）を、検索区間の開始位置や終了位置またはアタック位置を基準としてユーザーが設定可能としても良いし、楽曲解析結果に応じてフレームの位置を可変しても良い。 Further, as shown in FIG. 30A, the bass drum processing unit 21 of the above embodiment performs a process of subtracting the amplitude data 51a in the last frame of the search section (t1 to t2). As shown in (b), the amplitude data 51b in the first frame of the search section (t1 to t2) may be subtracted. The former case (in the case of the above embodiment) is effective when the pitch is switched simultaneously with the bass drum, as shown in FIG. In the latter case, as shown in FIG. 5B, it is effective when the bass sound continues to sound before the bass drum sounds. In this case, if the last amplitude data in the search section is subtracted, the pitch may change or the bass sound itself may be attenuated. In addition, the position of the frame to be subtracted (the position extracted by the extraction unit 21d) may be set by the user based on the start position, end position, or attack position of the search section, and the frame may be set according to the music analysis result. The position may be variable.

また、バスドラム加工部２１の変形例として、抽出部２１ｄにより、検索区間の先頭から所定数（但し、２以上）のフレーム、または検索区間の最後から所定数（但し、２以上）のフレームを抽出し、第１加工部２１ｅにより、検索区間に含まれる全フレームから、抽出された所定数のフレームの平均振幅値を減算しても良い。この場合、所定数は、オーバーラップ数と同じ数（４回オーバーラップの場合は「４」）が好ましい。 As a modification of the bass drum processing unit 21, the extraction unit 21d generates a predetermined number (however, 2 or more) frames from the beginning of the search section or a predetermined number (however, 2 or more) frames from the end of the search section. The average amplitude value of a predetermined number of extracted frames may be subtracted from all the frames included in the search section by the first processing unit 21e. In this case, the predetermined number is preferably the same number as the overlap number (“4” in the case of four overlaps).

さらに、バスドラム加工部２１における第１加工部２１ｅの変形例として、減算以外の処理を行っても良い。つまり、検索区間に含まれる音声データを、抽出部２１ｄで抽出された振幅値に基づいて加工するものであれば、その演算方法は問わない。また、減算処理を行う場合でも、減算割合を１００％とするのではなく、８０％または５０％など所定の割合で減算しても良い。 Further, as a modification of the first processing unit 21e in the bass drum processing unit 21, processing other than subtraction may be performed. That is, any calculation method may be used as long as the voice data included in the search section is processed based on the amplitude value extracted by the extraction unit 21d. Even when the subtraction processing is performed, the subtraction ratio may be subtracted at a predetermined ratio such as 80% or 50% instead of 100%.

また、詳細判別用音加工部１４の応用例として、打楽器音以外の音を対象として加工処理を行っても良い。例えば、図３１に示すように、ピアノの伴奏に合わせてボーカルが流れている場合、各周波数帯域において、ボーカルの基音が存在しない位置の振幅データ（例えば、符号５４または符号５５に示す振幅データ）を抽出して減算処理を行うことにより、ボーカルのみを抽出することができる。 Further, as an application example of the detailed discrimination sound processing unit 14, processing may be performed on sounds other than percussion instrument sounds. For example, as shown in FIG. 31, when vocals are played in accordance with piano accompaniment, amplitude data at positions where no vocal fundamental tone exists in each frequency band (for example, amplitude data indicated by reference numeral 54 or reference numeral 55). By extracting and subtracting, it is possible to extract only vocals.

また、上記の実施形態のベロシティ特定部１６は、１小節を１６分割した１６分音符単位でベロシティを特定（検出および算出）したが、特定単位（分割数）は任意である。また、楽曲に応じて（発音位置情報をはじめ、楽曲ジャンル、楽曲ＢＰＭ（Beats Per Minute）、リズムなどの情報に応じて）、特定単位を可変しても良い。 Further, the velocity specifying unit 16 of the above embodiment specifies (detects and calculates) the velocity in units of sixteenth notes obtained by dividing one measure into sixteen parts, but the specific unit (number of divisions) is arbitrary. Further, the specific unit may be varied according to the music (according to information such as sound generation position information, music genre, music BPM (Beats Per Minute), rhythm, etc.).

また、上記の実施形態の打楽器音分離部１８は、３つの打楽器（バスドラム、スネアドラム、ハイハット）について、スペクトログラム形状の推定および打楽器音の分離を行ったが、これらの打楽器以外の打楽器にも、本実施形態を適用可能である。また、打楽器以外のリズム楽器、またはリズム楽器以外の楽器にも、本実施形態を適用可能である。 Further, the percussion instrument sound separation unit 18 of the above-described embodiment performs spectrogram shape estimation and percussion instrument sound separation for three percussion instruments (bass drum, snare drum, hi-hat). The present embodiment can be applied. The present embodiment can also be applied to rhythm instruments other than percussion instruments or instruments other than rhythm instruments.

また、上記の実施形態では、所定の発音期間を８小節としたが、それより長い／短い期間としても良い。また、楽曲１曲分を、所定の発音期間としても良い。 In the above embodiment, the predetermined sound generation period is eight bars, but a longer / shorter period may be used. Further, one music piece may be set as a predetermined pronunciation period.

また、上記の各実施形態に示した音声信号処理装置１における各部および各機能をプログラム（アプリケーション）として提供することが可能である。また、そのプログラムを各種記録媒体（ＣＤ−ＲＯＭ、フラッシュメモリ等）に格納して提供することも可能である。すなわち、コンピューターを、音声信号処理装置１の各部として機能させるためのプログラム、およびそれを記録した記録媒体も、本発明の権利範囲に含まれる。その他、音声信号処理装置１を、ネットワーク上のサーバー（クラウドコンピューティング）で実現するなど、本発明の要旨を逸脱しない範囲で適宜変更が可能である。 Moreover, it is possible to provide each part and each function in the audio | voice signal processing apparatus 1 shown in said each embodiment as a program (application). Further, the program can be provided by being stored in various recording media (CD-ROM, flash memory, etc.). That is, a program for causing a computer to function as each unit of the audio signal processing apparatus 1 and a recording medium on which the program is recorded are also included in the scope of the right of the present invention. In addition, the audio signal processing device 1 can be appropriately changed without departing from the gist of the present invention, such as being realized by a server (cloud computing) on a network.

１：音声信号処理装置１１：ＦＦＴ部１２：継続音除去部１３：帯域分割部１４：詳細判別用音加工部１５：詳細判別部１６：ベロシティ特定部１７：グルーヴ判定部１８：打楽器音分離部２１：バスドラム加工部２１ａ：発音位置情報取得部２１ｂ：検索区間特定部２１ｃ：鳴り終わり判定部２１ｄ：抽出部２１ｅ：第１加工部２１ｆ：第２加工部２２：スネアドラム加工部２３：ハイハット加工部３１：ベロシティ検出部３２：ベロシティ算出部４１：第１特定部４２：第２特定部４３：同期減算部４４：同期加算部４５：再アタック検出部４６：鳴り終わり判定部４７：音源生成部４８：音源分離部 DESCRIPTION OF SYMBOLS 1: Audio | voice signal processing apparatus 11: FFT part 12: Continuous sound removal part 13: Band division part 14: Sound processing part for detailed determination 15: Detailed determination part 16: Velocity specific part 17: Groove determination part 18: Percussion instrument sound separation part 21: Bass drum processing unit 21a: Sound generation position information acquisition unit 21b: Search section specifying unit 21c: End of sound determination unit 21d: Extraction unit 21e: First processing unit 21f: Second processing unit 22: Snare drum processing unit 23: Hi-hat Processing unit 31: Velocity detection unit 32: Velocity calculation unit 41: First identification unit 42: Second identification unit 43: Synchronization subtraction unit 44: Synchronization addition unit 45: Re-attack detection unit 46: Sound end determination unit 47: Sound source generation Part 48: Sound source separation part

Claims

A first specifying unit for specifying a first spectrogram which is a frequency spectrogram of an arbitrary instrument from a predetermined sound generation section;
A second specifying unit that specifies a second spectrogram that is a frequency spectrogram of the same musical instrument as the arbitrary musical instrument based on a correlation value with the first spectrogram from the predetermined sounding section;
An audio signal processing apparatus, comprising: a synchronous subtractor that extracts a common component of the first spectrogram and the second spectrogram.

The second specifying unit specifies, as the second spectrogram, a frequency spectrogram having the highest correlation value with the first spectrogram when a plurality of frequency spectrograms of the same instrument as the arbitrary instrument exist in the predetermined tone generation section. The audio signal processing apparatus according to claim 1.

When there are L frequency spectrograms of the arbitrary musical instrument in the predetermined sound generation interval (where L is an integer satisfying L ≧ 2), the L common components extracted by the synchronous subtraction unit are averaged. The audio signal processing apparatus according to claim 2, further comprising a synchronous adder that calculates a common spectrogram.

A sound source for generating a synchronization-processed sound source of the arbitrary musical instrument by replacing the L frequency spectrograms of the arbitrary musical instrument existing in the predetermined sound generation section with the common spectrogram calculated by the synchronous addition unit The audio signal processing apparatus according to claim 3, further comprising a generation unit.

A sound source separation unit that separates a synchronization-processed sound source of the arbitrary musical instrument generated by the sound source generation unit from an arbitrary music piece;
If the arbitrary music contains a plurality of instrument sounds,
For each instrument sound, performing instrument sound separation processing including processing of the first specifying unit, the second specifying unit, the synchronization subtracting unit, the synchronization adding unit, the sound source generating unit, and the sound source separating unit. The audio signal processing device according to claim 4, wherein

When the plurality of instrument sounds are bass drums, snare drums, hi-hats, and there is a hi-hat that is sounding alone and a hi-hat that is sounding simultaneously with another percussion instrument in the arbitrary music piece,
6. The instrument sound separation process of claim 5, wherein the synchronized sound source is separated in the order of a hi-hat sounded simultaneously with another percussion instrument, a bass drum, a snare drum, and a hi-hat sounded independently. The audio signal processing device described.

For a frequency band defined for each percussion instrument, further comprising a velocity specifying unit for specifying a velocity for each percussion instrument for each unit time obtained by equally dividing the predetermined sound generation section,
The voice according to claim 6, wherein the synchronous subtraction unit extracts the common component after aligning amplitude values of the first spectrogram and the second spectrogram based on a velocity for each percussion instrument. Signal processing device.

A detailed discriminating section for grouping a plurality of frequency spectrograms existing in the predetermined sound generation section on condition that the same percussion instrument type and the same sounding method are used,
The audio signal processing apparatus according to claim 6 or 7, wherein the second specifying unit specifies the second spectrogram from one or more frequency spectrograms belonging to the same group as the first spectrogram.

A pronunciation position information acquisition unit for acquiring pronunciation position information indicating the pronunciation positions of the three percussion instruments;
When it is known from the sound generation position information that there are a plurality of frequency spectrograms of the same percussion instrument as the first spectrogram in the predetermined sound generation section, the detailed determination unit determines the first spectrogram of the plurality of frequency spectrograms. 9. The audio signal processing apparatus according to claim 8, wherein an average value of correlation values with respect to is calculated, and a frequency spectrogram of correlation values exceeding the average value is classified as the same group as the first spectrogram.

A continuous sound removing unit that extracts a continuous sound component that continues for a predetermined time or more based on amplitude spectrum information obtained by performing a frequency Fourier transform on the sound signal of the arbitrary music piece, and removes the continuous sound component from the sound signal; Prepared,
The said 1st specific part and the said 2nd specific part identify the said 1st spectrogram and the said 2nd spectrogram after the said continuous sound component is removed, The any one of Claim 5 thru | or 9 characterized by the above-mentioned. The audio signal processing apparatus according to 1.

A first specifying step of specifying a first spectrogram which is a frequency spectrogram of an arbitrary instrument from a predetermined sound generation section;
A second specifying step of specifying a second spectrogram which is a frequency spectrogram of the same musical instrument as the arbitrary instrument based on a correlation value with the first spectrogram from the predetermined sounding section;
A method for controlling an audio signal processing device, comprising: performing a synchronous subtraction step of extracting a common component of the first spectrogram and the second spectrogram.

The program for making a computer perform each step in the control method of the audio | voice signal processing apparatus of Claim 11.