JP2012053205A

JP2012053205A - Sound source separation device and program

Info

Publication number: JP2012053205A
Application number: JP2010194704A
Authority: JP
Inventors: Noriaki Asemi; 典昭阿瀬見; Mitsuharu Kayama; 満春佳山; Seiji Kurokawa; 誠司黒川
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2010-08-31
Filing date: 2010-08-31
Publication date: 2012-03-15
Anticipated expiration: 2030-08-31
Also published as: JP5310677B2

Abstract

PROBLEM TO BE SOLVED: To reduce the computational complexity required for separating a target sound from mixed sounds in which the target sound is contained in a sound source separation device.SOLUTION: Musical sound transitions which are formed in a wave shape by transitioning of musical sounds constituting a target musical piece along a time-line are acquired (S120), and corrected score data is generated by correcting score data on the basis of the acquired musical sound transitions and the score data so that pitches of musical sounds constituting the target musical piece are equal to those of played sounds and the performance start timing is synchronized with output timing(S130 and S140). Furthermore, an amount of sound volume correction is derived (S170), and the corrected score data and the amount of sound volume correction are used to generate a target sound transition in a wave shape by transitioning of a sound output from one sound source along the time-line (S180). In the step S170, a sound volume ratio kv being a ratio of an output sound average amplitude and a musical sound average amplitude is derived as the amount of sound volume correction.

Description

本発明は、複数の音源から発生した音が重畳した混合音から、少なくとも一つの音源にて発生した音を分離する音源分離装置、及びプログラムに関する。 The present invention relates to a sound source separation device and a program for separating a sound generated by at least one sound source from a mixed sound in which sounds generated from a plurality of sound sources are superimposed.

従来、発話された音声や物音といった主要音に、対象楽曲が演奏された演奏音（例えば、ＢＧＭとして演奏された音）が混合した音を混合音とし、対象楽曲を構成する楽音の音圧が時間軸に沿って推移した既知の楽音波形を用いて、対象楽曲の演奏音を混合音から分離（除去）する音源分離装置が知られている（例えば、特許文献１参照）。 Conventionally, a sound obtained by mixing a performance sound (for example, a sound played as a BGM) of a main musical piece with a main sound such as a spoken voice or a physical sound is used as a mixed sound, and a sound pressure of a musical sound constituting the target musical piece is obtained. A sound source separation device that separates (removes) a performance sound of a target music from a mixed sound using a known musical sound waveform that changes along the time axis is known (see, for example, Patent Document 1).

この特許文献１に記載の音源分離装置（以下、従来分離装置とする）において、対象楽曲の演奏音を混合音から分離する際には、まず、混合音の音圧が時間軸に沿って推移した混合波形及び楽音波形の両方について単位時間毎に周波数解析して、それぞれの周波数スペクトルを表す混合音周波数スペクトル及び楽音周波数スペクトルを導出する。そして、混合音周波数スペクトルの各周波数のスペクトル振幅値から、楽音周波数スペクトルの各周波数のスペクトル振幅値を減算することで導出した周波数スペクトルを、逆フーリエ変換して生成した単位波形を時間軸に沿って配置することで、演奏音を分離した混合音の波形を生成、即ち、混合音中の主要音のみを得ている。 In the sound source separation device described in Patent Document 1 (hereinafter referred to as a conventional separation device), when separating the performance sound of the target music from the mixed sound, first, the sound pressure of the mixed sound changes along the time axis. Both the mixed waveform and the musical sound waveform are subjected to frequency analysis for each unit time, and a mixed sound frequency spectrum and a musical sound frequency spectrum representing each frequency spectrum are derived. A unit waveform generated by performing inverse Fourier transform on the frequency spectrum derived by subtracting the spectrum amplitude value of each frequency of the musical sound frequency spectrum from the spectrum amplitude value of each frequency of the mixed sound frequency spectrum along the time axis. Thus, a mixed sound waveform obtained by separating performance sounds is generated, that is, only main sounds in the mixed sound are obtained.

ただし、従来分離装置では、楽音周波数スペクトルを混合音周波数スペクトルから減算する際に、当該楽音周波数スペクトルの各スペクトル振幅値が、時間軸及び周波数軸の両方について最も近似する混合音周波数スペクトルのスペクトル振幅値から減算されるように楽音周波数スペクトルを補正している。 However, in the conventional separation device, when the musical sound frequency spectrum is subtracted from the mixed sound frequency spectrum, each spectral amplitude value of the musical sound frequency spectrum is closest to both the time axis and the frequency axis. The tone frequency spectrum is corrected so as to be subtracted from the value.

その補正に用いるパラメータ（即ち、補正量）は、演奏音の音高を補正するためのパラメータ、対象楽曲の時間軸に沿って演奏音を補正するためのパラメータ、及び演奏音の音量を補正するためのパラメータである。各パラメータは、個々の楽音スペクトルに対して一つずつ導出されるものであり、個々のパラメータの導出は、計算を反復することで実行される。 The parameters used for the correction (that is, the correction amount) are a parameter for correcting the pitch of the performance sound, a parameter for correcting the performance sound along the time axis of the target music, and a volume of the performance sound. It is a parameter for. Each parameter is derived one by one for each musical tone spectrum, and the derivation of each parameter is performed by repeating the calculation.

特許４２７４４１８号公報Japanese Patent No. 4274418

つまり、従来分離装置では、一つのパラメータを導出するまでに膨大な回数の計算を繰り返す必要があり、全てのパラメータを導出するまでに要する計算量が多大なもの、ひいては、混合音に含まれる対象音を、該混合音から分離するまでに要する計算量が多大なものとなるという問題があった。 In other words, in the conventional separation device, it is necessary to repeat an enormous number of calculations until one parameter is derived, and the amount of calculation required to derive all the parameters is large. There has been a problem that the amount of calculation required to separate the sound from the mixed sound becomes enormous.

そこで、本発明は、音源分離装置において、混合音に含まれる対象音を該混合音から分離するまでに要する計算量を低減することを目的とする。 Accordingly, an object of the present invention is to reduce the amount of calculation required for separating a target sound included in a mixed sound from the mixed sound in a sound source separation device.

上記目的を達成するためになされた本発明の音源分離装置は、楽音推移取得手段と、出力音推移取得手段と、補正量導出手段と、修正手段と、修正音推移取得手段と、比率導出手段と、楽音解析手段と、特定音推移取得手段と、特定音解析手段と、振幅比率導出手段と、区間推移導出手段と、音源分離手段とを備える。 The sound source separation device of the present invention made to achieve the above object includes a musical sound transition acquisition means, an output sound transition acquisition means, a correction amount derivation means, a correction means, a correction sound transition acquisition means, and a ratio derivation means. And a musical sound analysis means, a specific sound transition acquisition means, a specific sound analysis means, an amplitude ratio derivation means, a section transition derivation means, and a sound source separation means.

このうち、楽音推移取得手段は、対象楽曲を構成する楽音が時間軸に沿って推移した楽音推移を取得し、出力音推移取得手段は、対象楽曲を模擬した楽曲の楽譜を表し、該楽曲にて用いられる音源毎に、対象楽曲にて用いられる音源から出力される個々の音を模擬した出力音について、少なくとも音高及び出力タイミングが規定された楽譜トラックを含む楽譜データに基づいて、全ての楽譜トラックに規定された出力音が、楽譜データにおける時間軸に沿って推移した出力音推移を取得する。 Among these, the musical sound transition acquisition means acquires the musical sound transition in which the musical sounds constituting the target music have changed along the time axis, and the output sound transition acquisition means represents the score of the music simulating the target music, and the music For each sound source used, the output sound that simulates the individual sound output from the sound source used in the target music, based on the score data including at least the score track including the pitch and output timing, The output sound transition in which the output sound defined in the score track has shifted along the time axis in the score data is acquired.

すると、補正量導出手段が、音高補正量導出手段、及び時間補正量導出手段のうち少なくとも一方に、補正量の導出を実行させ、その導出された補正量に従って、修正手段が、出力音をシフトすることで修正出力音へと修正した楽譜データ（以下、修正楽譜データ）を生成する。ただし、本発明における音高補正量導出手段は、楽音推移から抽出した該楽音推移の特性を表す楽音情報と、出力音推移から抽出した該出力音推移の特性を表す出力音情報とを比較した結果に基づき、出力音の音高が、該出力音に対応する楽音の音高に一致するように楽譜データの音高補正量を補正量の一つとして導出する。また、時間補正量導出手段は、楽音情報と出力音情報との比較結果に基づき、出力音の出力タイミングが、該出力音に対応する楽音の演奏開始タイミングに一致するように楽譜データの時間補正量を補正量の一つとして導出する。 Then, the correction amount derivation unit causes at least one of the pitch correction amount derivation unit and the time correction amount derivation unit to perform the derivation of the correction amount, and the correction unit outputs the output sound according to the derived correction amount. The musical score data corrected to the corrected output sound by shifting is generated (hereinafter, corrected musical score data). However, the pitch correction amount deriving means in the present invention compares the musical sound information representing the characteristics of the musical sound transition extracted from the musical sound transitions with the output sound information representing the characteristics of the output musical sound transition extracted from the output sound transition. Based on the result, the pitch correction amount of the score data is derived as one of the correction amounts so that the pitch of the output sound matches the pitch of the musical sound corresponding to the output sound. The time correction amount derivation means, based on the comparison result between the musical sound information and the output sound information, corrects the time of the musical score data so that the output timing of the output sound coincides with the performance start timing of the musical sound corresponding to the output sound. The amount is derived as one of the correction amounts.

さらに、修正音推移取得手段が、修正楽譜データにおける全ての楽譜トラックに規定された修正出力音が、修正楽譜データにおける時間軸に沿って推移した修正音推移を取得し、比率導出手段が、その取得した修正音推移から導出した該修正音推移全体での平均振幅と、楽音推移取得手段にて取得した楽音推移から導出した該楽音推移全体での平均振幅との比率（以下、音量比率）を導出する。 Further, the corrected sound transition acquisition means acquires the corrected sound transition in which the corrected output sound defined for all the score tracks in the corrected score data has shifted along the time axis in the corrected score data, and the ratio derivation means A ratio (hereinafter referred to as volume ratio) between the average amplitude of the entire modified sound transition derived from the acquired modified sound transition and the average amplitude of the entire musical sound transition derived from the musical sound transition acquired by the musical sound transition acquisition means. To derive.

そして、楽音解析手段が、楽音推移取得手段で取得した楽音推移に含まれる周波数と各周波数における強度とを表す振幅スペクトルである楽音振幅スペクトルを、対象楽曲に時間軸に沿って単位時間毎に導出し、特定音推移取得手段が、修正手段で生成された修正楽譜データにおける楽譜トラックの一つである対象トラックに規定された修正出力音が、修正楽譜データにおける時間軸に沿って推移した特定音推移を取得し、特定音解析手段が、その取得した特定音推移に含まれる周波数と各周波数における強度（即ち、振幅やパワー）とを表し、かつその各周波数における強度に音量比率を乗じた特定音振幅スペクトルを、修正楽譜データの時間軸に沿って単位時間毎に導出する。 Then, the musical sound analysis means derives the musical sound amplitude spectrum, which is an amplitude spectrum representing the frequency included in the musical sound transition acquired by the musical sound transition acquisition means and the intensity at each frequency, for each unit time along the time axis in the target music. Then, the specific sound transition acquisition means causes the specific sound in which the corrected output sound defined for the target track, which is one of the score tracks in the corrected score data generated by the correction means, changes along the time axis in the corrected score data. The transition is acquired, and the specific sound analysis means indicates the frequency included in the acquired specific sound transition and the intensity at each frequency (that is, amplitude and power), and the intensity at each frequency is multiplied by the volume ratio. A sound amplitude spectrum is derived for each unit time along the time axis of the modified musical score data.

さらに、振幅比率導出手段が、その導出された楽音振幅スペクトルにおける周波数における強度と、特定音解析手段で導出された特定音振幅スペクトルにおける周波数における強度との比を表す振幅比率を、各周波数について導出して、区間推移導出手段が、振幅比率導出手段で導出された振幅比率それぞれを、楽音振幅スペクトルの各周波数における強度に乗じた結果である分離スペクトルから、時間軸に沿った音の推移である区間推移を導出する。すると、音源分離手段が、区間推移導出手段にて導出した区間推移を対象楽曲の時間軸に沿って配することで、楽音推移において、対象トラックに対応する音源にて出力される対象音が時間軸に沿って推移した対象音推移を生成する。 Further, the amplitude ratio deriving unit derives, for each frequency, an amplitude ratio representing a ratio between the intensity at the frequency in the derived musical tone amplitude spectrum and the intensity at the frequency in the specific sound amplitude spectrum derived by the specific sound analyzing unit. Then, the section transition deriving means is the transition of the sound along the time axis from the separated spectrum which is the result of multiplying the amplitude ratio derived by the amplitude ratio deriving means by the intensity at each frequency of the musical tone amplitude spectrum. Derive interval transitions. Then, the sound source separation means arranges the section transition derived by the section transition deriving means along the time axis of the target music, so that the target sound output by the sound source corresponding to the target track is changed over time in the musical sound transition. The target sound transition that has shifted along the axis is generated.

つまり、本発明の音源分離装置では、楽譜データに規定された出力音の音量を補正するための補正量（以下、音量補正量とする）を、修正音推移全体での平均振幅と楽音推移全体での平均振幅との比率（即ち、音量比率）としている。この音量比率の導出は、計算を反復して実行する必要がないため、音量補正量の導出に要する計算量を低減することができる。 That is, in the sound source separation device of the present invention, the correction amount for correcting the volume of the output sound specified in the score data (hereinafter referred to as volume correction amount) is used as the average amplitude and the entire transition of the musical sound. The ratio to the average amplitude at (i.e., the volume ratio). Since the derivation of the volume ratio does not need to be repeatedly executed, the calculation amount required for derivation of the volume correction amount can be reduced.

この結果、仮に、音量比率以外の補正量である音高補正量や時間補正量についての導出方法が、従来分離装置での導出方法と同様の方法であったとしても、本発明の音源分離装置においては、全ての補正量を導出するために要する計算量を低減できる。 As a result, even if the derivation method for the pitch correction amount and the time correction amount, which are correction amounts other than the volume ratio, is the same method as the derivation method in the conventional separation device, the sound source separation device of the present invention In, the amount of calculation required to derive all the correction amounts can be reduced.

したがって、本発明の音源分離装置によれば、楽音推移から、該楽音推移に含まれる対象音推移を抽出するまでに要する計算量を低減することができる。
本発明における音高補正量導出手段では、分布導出手段が、楽音推移の全体にわたって含まれる周波数と各周波数における強度とを表す楽音音高分布を、楽音情報の一つとして導出すると共に、出力音推移の全体にわたって含まれる周波数と各周波数における強度とを表す出力音高分布を、出力音情報の一つとして導出する。これと共に、音高補正量導出手段は、それらの導出された出力音高分布と楽音音高分布との相関値を表す音高相関値を、楽音音高分布の予め規定された規定位置から出力音高分布を周波数軸に沿ってシフトさせる毎に導出し、それらの導出された音高相関値の中で、値が最大となる音高相関値に対応する規定位置からの周波数軸に沿ったシフト量を、音高補正量として導出しても良い（請求項２）。 Therefore, according to the sound source separation device of the present invention, it is possible to reduce the amount of calculation required to extract the target sound transition included in the musical sound transition from the musical sound transition.
In the pitch correction amount deriving unit according to the present invention, the distribution deriving unit derives the musical tone pitch distribution representing the frequencies included in the whole tone transition and the intensity at each frequency as one piece of musical tone information, and outputs the output sound. An output pitch distribution representing the frequencies included in the entire transition and the intensity at each frequency is derived as one of the output sound information. At the same time, the pitch correction amount deriving means outputs a pitch correlation value representing a correlation value between the derived output pitch distribution and the musical tone pitch distribution from a predetermined prescribed position of the musical tone pitch distribution. The pitch distribution is derived every time the pitch distribution is shifted along the frequency axis, and among the derived pitch correlation values, the pitch distribution along the frequency axis from the specified position corresponding to the pitch correlation value having the maximum value is derived. The shift amount may be derived as a pitch correction amount.

この音高補正量は、楽音音高分布と出力音高分布とが最大相関となるときにおける楽音音高分布の規定位置からの周波数軸に沿ったシフト量であり、この音高補正量の導出は、楽音音高分布の規定位置から、出力音高分布を規定方向に沿ってシフトさせることによって行われる。 This pitch correction amount is the amount of shift along the frequency axis from the specified position of the musical tone pitch distribution when the musical tone pitch distribution and the output pitch distribution have the maximum correlation. Derivation of this pitch correction amount Is performed by shifting the output pitch distribution along the prescribed direction from the prescribed position of the musical tone pitch distribution.

したがって、本発明の音源分離装置によれば、音高補正量を導出するために、計算を反復させる回数を低減でき、この結果、音高補正量の導出に要する計算量を、従来分離装置における音高補正量の導出方法に比べて低減できる。 Therefore, according to the sound source separation device of the present invention, it is possible to reduce the number of times the calculation is repeated in order to derive the pitch correction amount. As a result, the calculation amount required to derive the pitch correction amount can be reduced in the conventional separation device. This can be reduced compared to the method for deriving the pitch correction amount.

この結果、請求項２に係る音源分離装置によれば、楽音推移から、該楽音推移に含まれる対象音推移を抽出するまでに要する計算量を、より確実に低減できる。
特に、このように導出される音高補正量に従って楽譜データを修正すれば、修正後の出力音推移（即ち、修正音推移）に含まれる周波数及び各周波数における強度の比率を、楽音推移に含まれる周波数及び各周波数における強度の比率に、より近似させることができる。この結果、楽音推移から、該楽音推移に含まれる対象音推移を分離する分離精度を向上させることができる。 As a result, according to the sound source separation device of the second aspect, it is possible to more reliably reduce the amount of calculation required to extract the target sound transition included in the musical sound transition from the musical sound transition.
In particular, if the musical score data is corrected in accordance with the pitch correction amount derived in this way, the frequency included in the corrected output sound transition (that is, the corrected sound transition) and the intensity ratio at each frequency are included in the musical sound transition. It is possible to more closely approximate the frequency and the intensity ratio at each frequency. As a result, it is possible to improve the separation accuracy for separating the target sound transition included in the musical sound transition from the musical sound transition.

本発明において、楽音音高分布及び出力音高分布の各周波数における強度は、正規化されていても良い（請求項３）。
このような本発明の音源分離装置によれば、楽音推移に含まれる各周波数における強度と、出力音推移に含まれる各周波数における強度とが大きく異なっていたとしても、楽譜データに規定された出力音の音高が対象楽曲の音高に一致するように楽譜データを修正できる。この結果、本発明の音源分離装置によれば、楽音推移における振幅と、出力音推移における振幅とに大きな差が生じていても、修正音推移から、該修正音推移に含まれる対象音推移を精度良く分離することができる。 In the present invention, the intensity at each frequency of the musical tone pitch distribution and the output pitch distribution may be normalized.
According to such a sound source separation device of the present invention, even if the intensity at each frequency included in the musical sound transition and the intensity at each frequency included in the output sound transition are greatly different, the output specified in the musical score data The musical score data can be corrected so that the pitch of the sound matches the pitch of the target music. As a result, according to the sound source separation device of the present invention, even if there is a large difference between the amplitude in the musical sound transition and the amplitude in the output sound transition, the target sound transition included in the corrected sound transition is determined from the corrected sound transition. It can be separated with high accuracy.

さらに、本発明における時間補正量導出手段では、変化導出手段が、楽音推移から、時間軸に沿った該楽音推移の非調波成分における変化の推移を表す楽音変化を、楽音情報の一つとして導出すると共に、出力音推移から、時間軸に沿った該出力音推移の非調波成分における変化の推移を表す出力音変化を、出力音情報の一つとして導出する。そして、時間相関導出手段が、それらの導出された楽音変化と出力音変化との相関値を表す時間相関値を、楽音変化と出力音変化とに設定された設定位置を一致させて出力音変化を時間軸に沿って伸縮する毎に導出すると共に、設定位置を規定範囲内で時間軸に沿って順次変更しても良い。この場合、時間補正量導出手段は、その導出された時間相関値の中で、値が最大となる時間相関値に対応する出力音変化の時間軸に沿った伸縮率及び設定位置を、時間補正量として導出しても良い（請求項４）。 Furthermore, in the time correction amount deriving means in the present invention, the change deriving means uses, as a piece of musical sound information, a musical sound change representing a transition of a change in a non-harmonic component of the musical sound transition along the time axis from the musical sound transition. At the same time, from the output sound transition, the output sound change representing the transition of the change in the non-harmonic component of the output sound transition along the time axis is derived as one of the output sound information. Then, the time correlation deriving means matches the time correlation value representing the correlation value between the derived musical sound change and the output sound change with the set position set in the musical sound change and the output sound change to change the output sound. May be derived each time it expands and contracts along the time axis, and the set position may be sequentially changed along the time axis within a specified range. In this case, the time correction amount deriving means performs time correction on the expansion rate and the set position along the time axis of the output sound change corresponding to the time correlation value having the maximum value among the derived time correlation values. It may be derived as a quantity (claim 4).

一般的に、楽音推移や出力音推移に含まれる非調波成分は、打楽器（例えば、ドラムやベース）や楽器のアタック音に多く含まれており、時間のずれに対する相関度合いの変化が大きい。 In general, inharmonic components included in musical tone transition and output sound transition are mostly included in percussion instruments (for example, drums and basses) and attack sounds of musical instruments, and the degree of correlation varies greatly with time lag.

よって、本発明の音源分離装置において、楽譜データに規定された出力音の出力タイミングを（打楽器データを基にした）時間補正量に従って修正すれば、修正楽譜データによって表される楽曲と、対象楽曲とのリズム、ひいては、修正楽譜データにおける個々の出力音の出力タイミングと、楽音の演奏開始タイミングとを容易に一致させることができる。 Therefore, in the sound source separation device of the present invention, if the output timing of the output sound defined in the score data is corrected according to the time correction amount (based on the percussion instrument data), the music represented by the corrected score data and the target music Therefore, the output timing of each output sound in the corrected musical score data can be easily matched with the performance start timing of the musical sound.

なお、本発明における変化導出手段は、音高補正量導出手段にて導出された音高補正量に従って、音高がシフトされた出力音を修正出力音とした修正楽譜データに基づいて、修正音取得手段で取得された修正音推移を、出力音推移としても良い（請求項５）。 The change deriving means in the present invention is based on the modified musical score data in which the output sound whose pitch is shifted according to the pitch correction amount derived by the pitch correction amount deriving means is used as the modified output sound. The correction sound transition acquired by the acquisition means may be output sound transition.

ただし、この場合、請求項５における補正量導出手段を形成する変化導出手段、時間相関導出手段、及び時間量導出手段は、請求項４に記載の変化導出手段、時間相関導出手段、及び時間量導出手段と同様に構成されている必要がある。 However, in this case, the change deriving unit, the time correlation deriving unit, and the time amount deriving unit forming the correction amount deriving unit according to claim 5 are the change deriving unit, the time correlation deriving unit, and the time amount according to claim 4. It needs to be configured in the same way as the deriving means.

このような本発明の音源分離装置では、時間補正量を導出する前に、楽譜データに規定された個々の出力音の音高が、対象楽曲を構成する楽音の音高に一致するように修正されている。したがって、本発明の音源分離装置によれば、楽譜データに規定された出力音の音高と、対象楽曲を構成する楽音の音高との間にズレが生じていることに起因して、時間補正量の導出精度が低下することを防止できる。 In such a sound source separation device of the present invention, before deriving the amount of time correction, the pitch of each output sound specified in the score data is corrected so as to match the pitch of the musical sound constituting the target music. Has been. Therefore, according to the sound source separation device of the present invention, there is a time difference between the pitch of the output sound specified in the musical score data and the pitch of the musical sound constituting the target music. It is possible to prevent the accuracy of deriving the correction amount from decreasing.

さらに、本発明において、変化導出手段は、楽音変化及び出力音変化を、対象楽曲においてテンポ一定の区間毎に導出しても良い（請求項６）。
このような音源分離装置によれば、楽譜データにおける出力音の出力タイミングの修正を、対象楽曲においてテンポが一定の区間毎に実行することができる。 Further, in the present invention, the change deriving means may derive the musical sound change and the output sound change for each section having a constant tempo in the target music.
According to such a sound source separation device, it is possible to correct the output timing of the output sound in the musical score data for each section having a constant tempo in the target music.

また、本発明の音源分離装置における音源分離手段は、記憶制御手段が、区間推移導出手段にて導出された区間推移を楽音推移から減算した残留楽音推移を導出して、その導出した残留楽音推移を記憶装置に記憶する。そして、更新手段が、区間推移導出手段での区間推移の導出を、対象トラックを順次変更して実行すると共に、区間推移が導出されると、その導出された区間推移を記憶装置に記憶された残留楽音推移から減算して、当該記憶装置に記憶された残留楽音推移を更新しても良い（請求項７）。 Further, the sound source separation means in the sound source separation device of the present invention is such that the storage control means derives a residual musical sound transition obtained by subtracting the section transition derived by the section transition deriving means from the musical sound transition, and the derived residual musical sound transition Is stored in the storage device. Then, the update means performs the derivation of the section transition in the section transition derivation means by sequentially changing the target track, and when the section transition is derived, the derived section transition is stored in the storage device. The residual musical sound transition stored in the storage device may be updated by subtracting from the residual musical sound transition (claim 7).

このような本発明の音源分離装置においては、楽音推移に含まれる対象音推移を、該楽音推移から分離する際に、一つの対象トラックに対応する楽音推移についてのみ分離すれば、その対象トラックに対応する音源にて発生した音のみを除去した残留楽音推移を生成することができる。すなわち、本発明の音源分離装置によれば、対象楽曲に対して、いわゆるマイナスワンを実行することができる。 In such a sound source separation device of the present invention, when the target sound transition included in the musical sound transition is separated from the musical sound transition, if only the musical sound transition corresponding to one target track is separated, It is possible to generate a transition of a residual musical tone in which only the sound generated by the corresponding sound source is removed. That is, according to the sound source separation device of the present invention, so-called minus one can be performed on the target music piece.

また、例えば、楽音として歌声が含まれた楽音推移を取得した場合、その楽音推移から、全ての楽譜トラックについての対象音推移を減算すれば、楽譜データにおける時間軸に沿った歌声の推移が残る。つまり、本発明の楽器音分離装置によれば、対象楽曲に含まれる歌声の推移を抽出することができる。 In addition, for example, when a musical sound transition including a singing voice is acquired as a musical sound, if the target sound transition for all musical score tracks is subtracted from the musical sound transition, the transition of the singing voice along the time axis in the musical score data remains. . That is, according to the instrument sound separation device of the present invention, it is possible to extract the transition of the singing voice included in the target music.

なお、本発明は、コンピュータを音源分離装置として機能させるためのプログラムであっても良い。本発明が、このようなプログラムとしてなされている場合、その本発明のプログラムでは、楽音推移取得手順と、出力音推移取得手順と、補正量導出手順と、修正手順と、修正音推移取得手順と、比率導出手順と、楽音解析手順と、特定音推移取得手順と、特定音解析手順と、振幅比率導出手順と、区間推移導出手順と、音源分離手順とをコンピュータに実行させる必要がある（請求項８）。 The present invention may be a program for causing a computer to function as a sound source separation device. When the present invention is configured as such a program, the program of the present invention includes a musical sound transition acquisition procedure, an output sound transition acquisition procedure, a correction amount derivation procedure, a correction procedure, and a correction sound transition acquisition procedure. The computer needs to execute a ratio derivation procedure, a musical sound analysis procedure, a specific sound transition acquisition procedure, a specific sound analysis procedure, an amplitude ratio derivation procedure, a section transition derivation procedure, and a sound source separation procedure. Item 8).

本発明の楽音推移取得手順では、対象楽曲を構成する楽音が時間軸に沿って推移した楽音推移を取得し、出力音推移取得手順では、対象楽曲を模擬した楽曲の楽譜を表し、該楽曲にて用いられる音源毎に、対象楽曲にて用いられる音源から出力される個々の音を模擬した出力音について、少なくとも音高及び出力タイミングが規定された楽譜トラックを含む楽譜データに基づいて、全ての楽譜トラックに規定された出力音が、前記楽譜データにおける時間軸に沿って推移した出力音推移を取得する。 In the musical sound transition acquisition procedure of the present invention, a musical sound transition in which the musical sound constituting the target music has shifted along the time axis is acquired, and in the output sound transition acquisition procedure, a musical score of the music simulating the target music is represented. For each sound source used, the output sound that simulates the individual sound output from the sound source used in the target music, based on the score data including at least the score track including the pitch and output timing, An output sound transition in which the output sound defined in the score track has shifted along the time axis in the score data is acquired.

すると、補正量導出手順では、音高補正量導出手順及び時間補正量導出手順のうち少なくとも一方に、補正量の導出を実行させ、修正手順では、導出した補正量に従って、出力音をシフトすることで修正出力音へと修正した楽譜データである修正楽譜データを生成する。ただし、音高補正量導出手順では、楽音推移から抽出した該楽音推移の特性を表す楽音情報と、出力音推移から抽出した該出力音推移の特性を表す出力音情報とを比較した結果に基づき、出力音の音高が、該出力音に対応する楽音の音高に一致するように楽譜データの音高補正量を補正量の一つとして導出する。一方、時間補正量導出手順では、楽音情報と出力音情報との比較結果に基づき、出力音の出力タイミングが、該出力音に対応する楽音の演奏開始タイミングに一致するように楽譜データの時間補正量を補正量の一つとして導出する。 Then, in the correction amount derivation procedure, at least one of the pitch correction amount derivation procedure and the time correction amount derivation procedure is executed, and in the correction procedure, the output sound is shifted according to the derived correction amount. The modified musical score data which is the musical score data corrected to the corrected output sound is generated. However, the pitch correction amount derivation procedure is based on the result of comparing the musical sound information representing the characteristics of the musical sound transition extracted from the musical sound transitions with the output sound information representing the characteristics of the outgoing sound transition extracted from the output sound transition. The pitch correction amount of the score data is derived as one of the correction amounts so that the pitch of the output sound matches the pitch of the musical sound corresponding to the output sound. On the other hand, in the time correction amount derivation procedure, the time correction of the score data is performed based on the comparison result between the musical sound information and the output sound information so that the output timing of the output sound matches the performance start timing of the musical sound corresponding to the output sound. The amount is derived as one of the correction amounts.

さらに、修正音推移取得手順では、修正手順で生成された修正楽譜データにおける全ての楽譜トラックに規定された修正出力音が、修正楽譜データにおける時間軸に沿って推移した修正音推移を取得し、比率導出手順では、その取得した修正音推移全体での平均振幅と、楽音推移全体での平均振幅との比率である音量比率を導出する。また、楽音解析手順では、楽音推移に含まれる周波数と各周波数における強度とを表す楽音振幅スペクトルを、対象楽曲に時間軸に沿って単位時間毎に導出する。 Furthermore, in the correction sound transition acquisition procedure, the correction output sound defined for all the score tracks in the correction score data generated in the correction procedure acquires the correction sound transition in which the correction score data has shifted along the time axis in the correction score data, In the ratio derivation procedure, a volume ratio, which is a ratio between the acquired average amplitude of the entire modified sound transition and the average amplitude of the entire musical sound transition, is derived. In the musical sound analysis procedure, a musical tone amplitude spectrum representing the frequency included in the musical tone transition and the intensity at each frequency is derived for each target time along the time axis in the target music.

そして、特定音推移取得手順にて、修正楽譜データにおける楽譜トラックの一つである対象トラックに規定された修正出力音が、修正楽譜データにおける時間軸に沿って推移した特定音推移を取得すると、特定音解析手順では、特定音推移に含まれる周波数と各周波数における強度とを表し、かつその各周波数における強度に音量比率を乗じた特定音振幅スペクトルを、修正楽譜データの時間軸に沿って単位時間毎に導出し、振幅比率導出手順では、楽音振幅スペクトルにおける周波数における強度と、特定音振幅スペクトルにおける周波数における強度との比を表す振幅比率を、各周波数について導出する。 Then, in the specific sound transition acquisition procedure, when the corrected output sound defined in the target track that is one of the score tracks in the corrected score data acquires the specific sound transition that has shifted along the time axis in the corrected score data, In the specific sound analysis procedure, the specific sound amplitude spectrum, which represents the frequency included in the specific sound transition and the intensity at each frequency, and is multiplied by the volume ratio, is expressed in units along the time axis of the modified musical score data. In the amplitude ratio derivation procedure, an amplitude ratio that represents the ratio between the intensity at the frequency in the musical tone amplitude spectrum and the intensity at the frequency in the specific sound amplitude spectrum is derived for each frequency.

すると、区間推移導出手順では、振幅比率それぞれを、楽音振幅スペクトルの各周波数における強度に乗じた結果である分離スペクトルから、時間軸に沿った音の推移である区間推移を導出し、音源分離手順では、その区間推移を対象楽曲の時間軸に沿って配することで、楽音推移において、対象トラックに対応する音源にて出力される対象音が時間軸に沿って推移した対象音推移を生成する。 Then, in the section transition derivation procedure, the section transition, which is the transition of the sound along the time axis, is derived from the separation spectrum obtained by multiplying each amplitude ratio by the intensity at each frequency of the musical sound amplitude spectrum, and the sound source separation procedure Then, by arranging the section transition along the time axis of the target music, the target sound output from the sound source corresponding to the target track is generated along the time axis in the musical sound transition. .

本発明のプログラムが、このようになされていれば、例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、ハードディスク等のコンピュータ読み取り可能な記録媒体に記録し、必要に応じてコンピュータにロードさせて起動することや、必要に応じて通信回線を介してコンピュータに取得させて起動することにより用いることができる。そして、コンピュータに各手順を実行させることで、そのコンピュータを、請求項１に記載された音源分離装置として機能させることができる。 If the program of the present invention is made in this way, for example, it can be recorded on a computer-readable recording medium such as a DVD-ROM, CD-ROM, hard disk, etc. If necessary, it can be used by being acquired and activated by a computer via a communication line. And by making a computer perform each procedure, the computer can be functioned as a sound source separation device described in claim 1.

実施形態における音源分離装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the sound source separation apparatus in embodiment. 音源分離処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sound source separation process. 音高補正処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a pitch correction process. 音高補正処理の処理内容を説明する説明図である。It is explanatory drawing explaining the processing content of a pitch correction process. 時間補正処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a time correction process. 時間補正処理の処理内容を説明する説明図である。It is explanatory drawing explaining the processing content of a time correction process. 音量補正処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a volume correction process. トラック分離処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a track separation process. トラック分離処理の処理内容を説明する説明図である。It is explanatory drawing explaining the processing content of a track separation process.

以下に本発明の実施形態を図面と共に説明する。
本発明が適用された音源分離装置は、複数の音源（例えば、各種の楽器や人）にて発生する音が重畳するように予め生成された１つの楽曲（以下、対象楽曲とする）にて用いられる全音源から出力された音から、対象楽曲における１つの音源から出力された音を分離する装置である。この音源分離装置は、本実施形態では、図１に示す情報処理装置１０によって構成されている。
〈音源分離装置の構成について〉
図１に示すように、情報処理装置１０は、通信部１１と、音響データ読取部１２と、入力受付部１３と、表示部１４と、音声入力部１５と、音声出力部１６と、音源モジュール１７と、記憶部１８と、制御部２０とを備えている。 Embodiments of the present invention will be described below with reference to the drawings.
The sound source separation device to which the present invention is applied is a single piece of music (hereinafter referred to as a target music) generated in advance so that sounds generated by a plurality of sound sources (for example, various musical instruments and people) are superimposed. This is an apparatus for separating the sound output from one sound source in the target music from the sound output from all sound sources used. In this embodiment, the sound source separation device is constituted by the information processing device 10 shown in FIG.
<Configuration of sound source separation device>
As shown in FIG. 1, the information processing apparatus 10 includes a communication unit 11, an acoustic data reading unit 12, an input receiving unit 13, a display unit 14, a voice input unit 15, a voice output unit 16, and a sound source module. 17, a storage unit 18, and a control unit 20.

このうち、通信部１１は、情報処理装置１０をネットワーク（例えば、専用回線やＷＡＮ）に接続し、その接続されたネットワークを介して外部と通信を行うものである。
音響データ読取部１２は、記憶媒体に記憶されている音響データを時間軸に沿って順次読み取る装置（例えば、ＣＤやＤＶＤの読取装置）である。その音響データは、対象楽曲を構成する全ての楽音（即ち、全音源から出力される全ての楽音）の音圧が時間軸に沿って推移したアナログ波形を標本化（サンプリング）したデータである。 Among these, the communication unit 11 connects the information processing apparatus 10 to a network (for example, a dedicated line or a WAN), and communicates with the outside through the connected network.
The acoustic data reading unit 12 is a device (for example, a CD or DVD reader) that sequentially reads the acoustic data stored in the storage medium along the time axis. The acoustic data is data obtained by sampling (sampling) an analog waveform in which the sound pressures of all the musical sounds constituting the target music (that is, all musical sounds output from all sound sources) have changed along the time axis.

そして、入力受付部１３は、外部からの操作に従って情報や指令の入力を受け付ける入力機器（例えば、キーボードやポインティングデバイス）である。表示部１４は、画像を表示する表示装置（例えば、液晶ディスプレイやＣＲＴ等）である。また、音声入力部１５は、音声を電気信号に変換して制御部２０に入力する装置（いわゆるマイクロホン）である。音声出力部１６は、制御部２０からの電気信号を音声に変換して出力する装置（いわゆるスピーカ）である。 The input receiving unit 13 is an input device (for example, a keyboard or a pointing device) that receives input of information and commands in accordance with an external operation. The display unit 14 is a display device (for example, a liquid crystal display or a CRT) that displays an image. The voice input unit 15 is a device (so-called microphone) that converts voice into an electrical signal and inputs the electrical signal to the control unit 20. The audio output unit 16 is a device (so-called speaker) that converts an electrical signal from the control unit 20 into sound and outputs the sound.

さらに、音源モジュール１７は、対象楽曲を模擬した楽曲（以下、対応楽曲とする）の楽譜を表す楽譜データに基づいて、音源からの音を模擬した音（以下、出力音）を出力する装置である。本実施形態においては、音源モジュール１７は、周知のＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）音源によって構成されている。そして、音源モジュール１７において、出力音として音が模擬される音源（以下、模擬音源とする）は、鍵盤楽器（例えば、ピアノやパイプオルガンなど）、弦楽器（例えば、バイオリンやビオラ、ギター、琴など）、打楽器（例えば、ドラムやシンバル、ティンパニー、木琴など）、及び管楽器（例えば、クラリネットやトランペット、フルート、尺八など）などであり、予め登録されている。 Furthermore, the sound source module 17 is an apparatus that outputs a sound (hereinafter, output sound) simulating a sound from a sound source based on score data representing a score of a music (hereinafter referred to as a corresponding music) that simulates the target music. is there. In the present embodiment, the sound source module 17 is constituted by a well-known MIDI (Musical Instrument Digital Interface) sound source. In the sound module 17, sound sources that are simulated as output sounds (hereinafter referred to as simulated sound sources) are keyboard instruments (for example, pianos and pipe organs), stringed instruments (for example, violin, viola, guitar, koto, etc.) ), Percussion instruments (for example, drums, cymbals, timpani, xylophone, etc.), wind instruments (for example, clarinet, trumpet, flute, shakuhachi, etc.), etc., which are registered in advance.

次に、楽譜データは、対応楽曲を区別するデータである識別データと、当該対応楽曲にて用いられる模擬音源毎の楽譜を表す楽譜トラックとを少なくとも有している。本実施形態における楽譜データは、周知のＭＩＤＩ規格によって表されている。 Next, the score data includes at least identification data that is data for distinguishing corresponding music and a score track that represents a score for each simulated sound source used in the corresponding music. The musical score data in this embodiment is represented by a well-known MIDI standard.

このうち、各楽譜トラックは、音源モジュール１７が出力する出力音について規定されており、模擬音源に応じてインデックス番号ｍｔｉ（ｍｔｉ＝１〜ＭＴＮ）が割り振られている。その楽譜トラックに規定される内容として、少なくとも、音源モジュール１７が出力音を出力する期間（以下、音符長）、及び個々の出力音の音高（いわゆるノートナンバー）、個々の出力音の強さ（いわゆるアタック、ベロシティ、ディケイなど）や、対応楽曲を分割する区間（例えば、Ａメロやサビなど）におけるテンポがある。 Among these, each musical score track is defined for the output sound output from the sound module 17 and is assigned an index number mti (mti = 1 to MTN) according to the simulated sound source. The contents defined in the score track include at least a period during which the sound source module 17 outputs an output sound (hereinafter referred to as note length), a pitch of each output sound (so-called note number), and an intensity of each output sound. There is a tempo (so-called attack, velocity, decay, etc.) and a section (for example, A melody or chorus) for dividing the corresponding music.

ただし、楽譜トラックでの音符長は、当該出力音の出力を開始するまでの当該楽曲の演奏開始からの時刻を表す出力タイミング（いわゆるノートオンタイミング）と、当該出力音の出力を終了するまでの当該楽曲の演奏開始からの時刻を表す終了タイミング（いわゆるノートオフタイミング）とによって規定されている。なお、以下では、楽譜トラックに規定された出力音を演奏音とも称す。 However, the note length on the score track is the output timing (so-called note-on timing) indicating the time from the start of the performance of the music until the output of the output sound is started, and until the output of the output sound is ended. It is defined by the end timing (so-called note-off timing) indicating the time from the start of performance of the music. Hereinafter, the output sound defined for the score track is also referred to as a performance sound.

また、記憶部１８は、記憶内容を読み書き可能に構成された不揮発性の記憶装置（例えば、ハードディスク装置）である。この記憶部１８には、処理プログラムや楽譜データが少なくとも格納される。 The storage unit 18 is a non-volatile storage device (for example, a hard disk device) configured to be able to read and write stored contents. The storage unit 18 stores at least processing programs and score data.

さらに、制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２１と、処理プログラムやデータを一時的に格納するＲＡＭ２２と、ＲＯＭ２１やＲＡＭ２２に記憶された処理プログラムに従って各処理（各種演算）を実行するＣＰＵ２３とを少なくとも有した周知のコンピュータを中心に構成されている。 Further, the control unit 20 is stored in the ROM 21 that stores processing programs and data that need to retain stored contents even when the power is turned off, the RAM 22 that temporarily stores processing programs and data, and the ROM 21 and RAM 22. It is mainly configured by a known computer having at least a CPU 23 that executes each process (various operations) according to the processing program.

なお、本実施形態における処理プログラムとして、楽譜データの楽譜トラックに規定された個々の演奏音（即ち、出力音）が、対象楽曲を構成しかつ該出力音に対応する楽音に一致するように修正した楽譜データを用いて、対象楽曲における全音源から出力された音から、一つの音源から出力された音を分離する音源分離処理を、制御部２０が実行するものが予め用意されている。
〈音源分離処理の処理内容について〉
次に、制御部２０が実行する音源分離処理について説明する。 As a processing program in this embodiment, each performance sound (ie, output sound) defined in the score track of the score data is corrected so as to match the musical sound that constitutes the target music and corresponds to the output sound. Using the musical score data, the control unit 20 prepares in advance a sound source separation process that separates sounds output from one sound source from sounds output from all sound sources in the target music.
<About the content of sound source separation processing>
Next, the sound source separation process executed by the control unit 20 will be described.

この音源分離処理は、入力受付部１３を介して、当該音源分離処理を起動するための起動指令が入力されると、実行が開始されるものである。
そして、図２に示すように、音源分離処理は、起動されると、入力受付部１３を介して入力された情報によって指定される楽曲に対応する楽譜データを取得する（Ｓ１１０（Ｓは、ステップを意味する））。 The sound source separation process is started when an activation command for activating the sound source separation process is input via the input receiving unit 13.
Then, as shown in FIG. 2, when the sound source separation process is started, score data corresponding to the music specified by the information input via the input receiving unit 13 is acquired (S110 (S is a step). Means)).

続いて、音響データ読取部１２にて読み取った音響データを、対象楽曲を構成する楽音が時間軸に沿って推移した波形である楽音推移として取得する（Ｓ１２０）。ただし、本実施形態の音響データ読取部１２には、本音源分離処理が起動される前に、Ｓ１１０にて取得する楽譜データに対応する対象楽曲の音響データを記憶した記憶媒体が配置されているものとする。 Subsequently, the acoustic data read by the acoustic data reading unit 12 is acquired as a musical sound transition which is a waveform in which the musical sounds constituting the target music have shifted along the time axis (S120). However, the acoustic data reading unit 12 of the present embodiment is provided with a storage medium that stores the acoustic data of the target music corresponding to the score data acquired in S110 before the sound source separation process is started. Shall.

そして、Ｓ１１０にて取得した楽譜データと、Ｓ１２０で取得した楽音推移とに基づいて、対象楽曲を構成する楽音の音高に、演奏音の音高が一致するように、当該楽譜データを修正する音高補正処理を実行する（Ｓ１３０）。以下、演奏音について修正が実行された楽譜データを修正楽譜データと称し、修正された演奏音を修正演奏音（本発明の修正出力音に相当）と称す。 Then, based on the musical score data acquired in S110 and the musical sound transition acquired in S120, the musical score data is corrected so that the pitch of the musical sound matches the pitch of the musical sound constituting the target music. A pitch correction process is executed (S130). Hereinafter, score data in which the performance sound is corrected is referred to as corrected score data, and the corrected performance sound is referred to as a corrected performance sound (corresponding to the corrected output sound of the present invention).

さらに、音高補正処理によって、楽音の音高に音高が一致するように修正された演奏音の出力タイミングが、楽音の演奏開始タイミングに一致するように、修正楽譜データを修正する時間補正処理を実行する（Ｓ１５０）。続いて、時間補正処理によって、楽音の演奏開始タイミングに出力タイミングが一致するように修正された演奏音の強さが、楽音の強さ（即ち、音量）に一致するように、修正楽譜データを修正するための音量補正量を導出する音量補正処理を実行する（Ｓ１７０）。 Furthermore, a time correction process for correcting the corrected musical score data so that the output timing of the performance sound that has been corrected so that the pitch matches the pitch of the musical tone by the pitch correction processing matches the musical performance start timing. Is executed (S150). Subsequently, the modified musical score data is adjusted so that the strength of the performance sound corrected so that the output timing matches the musical performance start timing by the time correction processing matches the strength (ie, volume) of the musical sound. A volume correction process for deriving a volume correction amount for correction is executed (S170).

そして、演奏音の音高や出力タイミングが修正された修正楽譜データ、及び音量補正量を用いて、楽音推移から、一つの音源から出力された音が時間軸に沿って推移した波形である対象音推移を生成するトラック分離処理を実行する（Ｓ１８０）。 Then, using the modified musical score data in which the pitch and output timing of the performance sound are corrected, and the volume correction amount, the target is a waveform in which the sound output from one sound source has shifted along the time axis from the transition of the musical sound A track separation process for generating a sound transition is executed (S180).

その後、本音源分離処理を終了する。
〈音高補正処理の処理内容について〉
次に、音源分離処理のＳ１３０にて起動される音高補正処理について説明する。 Thereafter, the sound source separation process is terminated.
<Pitch correction processing details>
Next, the pitch correction process started in S130 of the sound source separation process will be described.

この音高補正処理は、起動されると、図３に示すように、先のＳ１１０にて取得した楽譜データに含まれる全ての楽譜トラックに基づいて、全ての演奏音が時間軸に沿って推移した波形である出力音推移を取得する（Ｓ３１０）。具体的に、本実施形態における出力音推移の取得は、全ての楽譜トラックに規定されている個々の演奏音を、楽譜データの時間軸に沿って音源モジュール１７に出力させ、音声入力部１５を介して受け付けることで実行する。 When this pitch correction process is started, as shown in FIG. 3, all the performance sounds change along the time axis based on all the score tracks included in the score data acquired in the previous S110. The transition of the output sound that is the waveform obtained is acquired (S310). Specifically, the acquisition of the output sound transition in the present embodiment is performed by causing the sound source module 17 to output individual performance sounds defined in all the score tracks along the time axis of the score data, It is executed by accepting via

続いて、その取得した出力音推移を、時間軸に沿って設定された単位時間毎に周波数解析（本実施形態では、離散フーリエ変換）して、各単位時間の出力音推移に含まれる周波数、及び各周波数における強度を表すパワースペクトルを導出する（Ｓ３２０）。その導出されたパワースペクトルに基づいて、各周波数における強度を、時間軸に沿って周波数毎に相加平均した平均出力音スペクトルを導出する（Ｓ３３０）。その導出した平均出力音スペクトルの周波数における強度を、境界が互いに隣接するように予め規定された周波数範囲（例えば、半音単位、以下、規定音高範囲）毎に平均化して代表値を求める（Ｓ３４０）。さらに、そのＳ３４０で平均化した平均出力音スペクトルにおける周波数における強度を、分散「１」、平均「０」となるように正規化した正規化出力音スペクトル（図４（Ａ）参照）を導出する（Ｓ３５０）。 Subsequently, the obtained output sound transition is subjected to frequency analysis (in this embodiment, discrete Fourier transform) for each unit time set along the time axis, and the frequency included in the output sound transition of each unit time, A power spectrum representing the intensity at each frequency is derived (S320). Based on the derived power spectrum, an average output sound spectrum obtained by arithmetically averaging the intensity at each frequency for each frequency along the time axis is derived (S330). The intensity at the frequency of the derived average output sound spectrum is averaged for each frequency range (for example, semitone unit, hereinafter, specified pitch range) so that the boundaries are adjacent to each other, thereby obtaining a representative value (S340). ). Furthermore, a normalized output sound spectrum (see FIG. 4A) is derived by normalizing the intensity at the frequency in the average output sound spectrum averaged in S340 so that the variance is “1” and the average is “0”. (S350).

なお、本実施形態のＳ３４０にて求める代表値は、規定音高範囲に含まれる周波数における強度を平均化することで求めることに限らず、規定音高範囲における中心値に対応する周波数における強度を代表値としても良い。この場合、具体的には、２０Ｃｅｎｔ毎（半音の５分の１毎）に、２０Ｃｅｎｔグリッドに一番近い周波数の値（パワー）を抽出する処理を行う。 In addition, the representative value calculated | required by S340 of this embodiment is not only calculated | required by averaging the intensity | strength in the frequency contained in a regular pitch range, but the intensity | strength in the frequency corresponding to the center value in a regular pitch range. It may be a representative value. In this case, specifically, for each 20 Cent (every fifth of a semitone), a process of extracting a frequency value (power) closest to the 20 Cent grid is performed.

続いて、先のＳ１２０にて取得した楽音推移を、時間軸に沿って設定された単位時間毎に周波数解析して、各単位時間の楽音推移に含まれる周波数、及び各周波数における強度を表すパワースペクトルを導出する（Ｓ３６０）。その導出されたパワースペクトルに基づいて、各周波数における強度を、時間軸に沿って周波数毎に相加平均した平均楽音スペクトルを導出する（Ｓ３７０）。その導出した平均楽音スペクトルの周波数における強度を、規定音高範囲毎に平均化して代表値とし（Ｓ３８０）、そのＳ３８０で平均化した平均楽音スペクトルにおける周波数における強度を、分散「１」、平均「０」となるように正規化した正規化楽音スペクトル（図４（Ｂ）参照）を導出する（Ｓ３９０）。 Subsequently, the musical sound transition acquired in the previous S120 is subjected to frequency analysis for each unit time set along the time axis, and the frequency included in the musical sound transition of each unit time, and the power representing the intensity at each frequency. A spectrum is derived (S360). Based on the derived power spectrum, an average musical sound spectrum obtained by arithmetically averaging the intensity at each frequency for each frequency along the time axis is derived (S370). The intensity at the frequency of the derived average tone spectrum is averaged for each specified pitch range to obtain a representative value (S380), and the intensity at the frequency in the average tone spectrum averaged at S380 is expressed as variance “1”, average “ A normalized musical tone spectrum (see FIG. 4B) normalized to be “0” is derived (S390).

なお、本実施形態のＳ３８０にて求める代表値は、規定音高範囲に含まれる周波数における強度を平均化することで求めることに限らず、規定音高範囲における中心値に対応する周波数における強度を代表値としても良い。この場合、具体的には、２０Ｃｅｎｔ毎（半音の５分の１毎）に、２０Ｃｅｎｔグリッドに一番近い周波数の値（パワー）を抽出する処理を行う。 In addition, the representative value calculated | required by S380 of this embodiment is not restricted to calculating | requiring by averaging the intensity | strength in the frequency included in a regulation pitch range, but the intensity | strength in the frequency corresponding to the center value in a regulation pitch range. It may be a representative value. In this case, specifically, for each 20 Cent (every fifth of a semitone), a process of extracting a frequency value (power) closest to the 20 Cent grid is performed.

そして、詳しくは、後述するように、正規化出力音スペクトルと正規化楽音スペクトルとの相関値（以下、音高相関値とする）を導出する（Ｓ４００）。そして、正規化楽音スペクトルに対する正規化出力音スペクトルのシフト量が予め規定された上限値以上であるか否かを判定する（Ｓ４１０）。その判定の結果、シフト量が上限値未満であれば（Ｓ４１０：ＮＯ）、正規化出力音スペクトルを、周波数軸に沿って予め規定された規定量シフトして（Ｓ４２０）、Ｓ４００へと戻り、音高相関値を再度導出する。 In detail, as described later, a correlation value between the normalized output sound spectrum and the normalized musical sound spectrum (hereinafter referred to as a pitch correlation value) is derived (S400). Then, it is determined whether or not the shift amount of the normalized output sound spectrum with respect to the normalized musical sound spectrum is equal to or greater than a predetermined upper limit value (S410). As a result of the determination, if the shift amount is less than the upper limit value (S410: NO), the normalized output sound spectrum is shifted by a predetermined amount along the frequency axis (S420), and the process returns to S400. The pitch correlation value is derived again.

すなわち、本実施形態のＳ４００〜Ｓ４２０では、図４（Ｃ）に示すように、正規化楽音スペクトルに対して、正規化出力音スペクトルを周波数軸に沿って下限値から上限値に達するまでシフトさせつつ、その正規化出力音スペクトルをシフトさせる毎に、音高相関値を導出する。 That is, in S400 to S420 of this embodiment, as shown in FIG. 4C, the normalized output sound spectrum is shifted from the lower limit value to the upper limit value along the frequency axis with respect to the normalized musical sound spectrum. However, every time the normalized output sound spectrum is shifted, a pitch correlation value is derived.

そして、正規化出力音のシフト量が上限値以上となると（Ｓ４１０：ＹＥＳ）、対象楽曲を構成する楽音の音高に、演奏音の音高を一致させるための補正量（以下、音高補正量とする）を導出する（Ｓ４３０）。本実施形態のＳ４３０では、具体的に、先のＳ４００にて導出された全ての音高相関値の中で、値が最大である音高相関値に対応する正規化出力音スペクトルのシフト量を音高補正量として導出する。 When the shift amount of the normalized output sound is equal to or greater than the upper limit value (S410: YES), a correction amount (hereinafter referred to as pitch correction) for matching the pitch of the performance sound to the pitch of the musical sound constituting the target music. Is determined (S430). In S430 of the present embodiment, specifically, among the pitch correlation values derived in the previous S400, the shift amount of the normalized output sound spectrum corresponding to the pitch correlation value having the maximum value is calculated. Derived as a pitch correction amount.

続いて、その導出された音高補正量に従って、楽譜データにおける全ての楽譜トラックに規定された個々の演奏音の音高を修正（シフト）することで、修正楽譜データを生成する（Ｓ４４０）。すなわち、本実施形態のＳ４４０にて生成される修正楽譜データは、演奏音の音高が、予め用意された演奏音の音高から音高補正量シフトされたものとなる。 Subsequently, the corrected musical score data is generated by correcting (shifting) the pitches of the individual performance sounds defined for all musical score tracks in the musical score data in accordance with the derived pitch correction amount (S440). That is, the modified musical score data generated in S440 of the present embodiment is obtained by shifting the pitch of the performance sound from the pitch of the performance sound prepared in advance by a pitch correction amount.

そして、その後、本音高補正処理を終了し、音源分離処理へと戻る。
つまり、音高補正処理では、楽音推移の特性を表す楽音情報としての正規化楽音スペクトルと、出力音推移の特性を表す出力音情報としての正規化出力音スペクトルとを比較した結果に基づいて導出した一つの音高補正量に従って、楽譜データにおける全ての楽譜トラックに規定された個々の演奏音の音高を修正している。 After that, the pitch correction process is terminated and the process returns to the sound source separation process.
In other words, the pitch correction process is derived based on the result of comparing the normalized musical sound spectrum as the musical sound information representing the characteristics of the musical sound transition and the normalized output sound spectrum as the output sound information representing the characteristic of the output sound transition. According to the one pitch correction amount, the pitches of the individual performance sounds defined for all the score tracks in the score data are corrected.

〈時間補正処理の処理内容について〉
次に、音源分離処理のＳ１５０にて起動される時間補正処理について説明する。
この時間補正処理は、起動されると、図５に示すように、先のＳ４４０にて生成された修正楽譜データに含まれる全ての楽譜トラックに基づいて、全ての修正演奏音が時間軸に沿って推移した波形である修正音推移を取得する（Ｓ５１０）。本実施形態における修正音推移の取得は、Ｓ３１０と同様の方法により実行すれば良い。 <About time correction processing>
Next, the time correction process started in S150 of the sound source separation process will be described.
When this time correction process is started, as shown in FIG. 5, all the corrected performance sounds are moved along the time axis based on all the score tracks included in the corrected score data generated in the previous S440. The correction sound transition, which is the waveform that has changed, is acquired (S510). The acquisition of the correction sound transition in the present embodiment may be executed by the same method as in S310.

続いて、その取得した修正音推移の非調波成分が時間軸に沿って推移した波形である出力音非調波を、該修正音推移から導出し（Ｓ５２０）、さらに、先のＳ１２０で取得した楽音推移の非調波成分が時間軸に沿って推移した波形である楽音非調波を、該楽音推移から導出する（Ｓ５３０）。これらの非調波成分の導出は、予め用意されたフィルタに、修正音推移または楽音推移を通過させることで実行しても良い。 Subsequently, an output sound non-harmonic, which is a waveform in which the non-harmonic component of the acquired modified sound transition has shifted along the time axis, is derived from the modified sound transition (S520), and further acquired in the previous S120. The musical tone non-harmonic, which is a waveform in which the non-harmonic component of the musical tone transition is shifted along the time axis, is derived from the musical tone transition (S530). The derivation of these non-harmonic components may be executed by passing the corrected sound transition or the musical sound transition through a filter prepared in advance.

さらに、出力音非調波及び楽音非調波を、それぞれ、時間軸に沿って規定された時間長である特定ブロック毎に分割する（Ｓ５４０）。その分割する特定ブロックは、出力音非調波については、対応楽曲においてテンポが一定であることを表すテンポ一定区間毎である。このテンポ一定区間は、楽譜トラックに規定されたテンポに従って、対応楽曲にてテンポが変更する時刻を、各テンポ一定区間の開始時刻、終了時刻として特定することで決定する。なお、楽音非調波の特定ブロックについては、出力音非調波の特定ブロックを決定した後、出力音非調波の特定ブロックそれぞれの開始時刻、終了時刻に相当する対象楽曲の演奏開始からの時刻を、楽音非調波の特定ブロックそれぞれの開始時刻及び終了時刻として特定することで決定する。 Furthermore, the output sound non-harmonic and the musical sound non-harmonic are each divided into specific blocks each having a time length defined along the time axis (S540). The specific block to be divided is for each tempo constant section indicating that the tempo of the corresponding music is constant for the output sound inharmonic. This fixed tempo interval is determined by specifying the time at which the tempo changes in the corresponding music according to the tempo specified for the score track as the start time and end time of each fixed tempo interval. Regarding the specific block of musical tone non-harmonic, after the specific block of output non-harmonic is determined, the start time and the end time of each specific block of output sound non-harmonic are determined from the start of the performance of the target music. The time is determined by specifying the start time and the end time of each specific block of the musical tone non-harmonic.

そして、Ｓ５４０にて分割された特定ブロックの中から、一組の特定ブロックを選択し（Ｓ５５０）、その選択された一組の特定ブロックについて、楽音非調波、出力音非調波共に、時間軸に沿った変化を表すユニットデータを生成する（Ｓ５６０）。本実施形態におけるユニットデータは、図６（Ａ）,（Ｂ）に示すように、特定ブロックよりも短い時間長である規定区間毎に、その規定区間内での非調波成分（即ち、楽音非調波，及び出力音非調波）の振幅値を加算する。その上で、加算された振幅値を、正規化することによって生成する。なお、以下では、出力音非調波についてのユニットデータを出力音ユニットデータ（本発明における出力音変化に相当）とし、楽音非調波についてのユニットデータを楽音ユニットデータ（本発明における楽音変化に相当）とする。 Then, a set of specific blocks is selected from the specific blocks divided in S540 (S550), and both the musical tone non-harmonic and the output sound non-harmonic are selected for the selected set of specific blocks. Unit data representing a change along the axis is generated (S560). As shown in FIGS. 6 (A) and 6 (B), the unit data in the present embodiment includes a non-harmonic component (that is, a musical tone) in the specified section for each specified section having a shorter time length than the specific block. Add the amplitude value of non-harmonic and output sound non-harmonic). Then, the added amplitude value is generated by normalization. In the following, the unit data for output sound non-harmonic is referred to as output sound unit data (corresponding to the output sound change in the present invention), and the unit data for musical sound non-harmonic is referred to as the musical unit data (musical sound change in the present invention) Equivalent).

続いて、Ｓ５６０で生成した出力音ユニットデータの時間軸上に規定された出力音設定位置を、楽音ユニットデータの時間軸上に規定された楽音設定位置に一致させて、出力音ユニットデータと楽音ユニットデータとの相関値（以下、時間相関値とする）を導出する（Ｓ５７０）。そして、楽音ユニットデータに対する出力音ユニットデータの伸縮率が、予め規定された上限値（伸縮率の上限値）以上であるか否かを判定する（Ｓ５８０）。その判定の結果、楽音ユニットデータの伸縮率が、伸縮率の上限値未満であれば（Ｓ５８０：ＮＯ）、出力音ユニットデータを、時間軸に沿って予め規定された規定量拡大して（Ｓ５９０）、Ｓ５７０へと戻る。 Subsequently, the output sound setting position defined on the time axis of the output sound unit data generated in S560 is matched with the music sound setting position specified on the time axis of the musical sound unit data, so that the output sound unit data and the music sound are matched. A correlation value with the unit data (hereinafter referred to as a time correlation value) is derived (S570). Then, it is determined whether or not the expansion / contraction rate of the output sound unit data with respect to the musical sound unit data is equal to or higher than a predetermined upper limit value (expansion rate upper limit value) (S580). As a result of the determination, if the expansion / contraction rate of the musical sound unit data is less than the upper limit value of the expansion / contraction rate (S580: NO), the output sound unit data is expanded by a predetermined amount along the time axis (S590). ), The process returns to S570.

さらに、楽音ユニットデータの伸縮率が、伸縮率の上限値に達していれば（Ｓ５８０：ＹＥＳ）、楽音ユニットデータに対する出力音ユニットデータの時間軸に沿ったシフト量が、予め規定された上限値（シフト量の上限値）以上であるか否かを判定する（Ｓ６００）。その判定の結果、楽音ユニットデータのシフト量が、シフト量の上限値未満であれば（Ｓ６００：ＮＯ）、出力音ユニットデータの設定位置を、予め規定された時間シフトして（Ｓ６１０）、出力音ユニットデータの伸縮率を下限値とした上で、Ｓ５７０へと戻る。 Further, if the expansion / contraction rate of the musical sound unit data has reached the upper limit value of the expansion / contraction rate (S580: YES), the shift amount along the time axis of the output sound unit data with respect to the musical sound unit data is set to a predetermined upper limit value. It is determined whether or not (the upper limit value of the shift amount) is not less than (S600). As a result of the determination, if the shift amount of the musical sound unit data is less than the upper limit value of the shift amount (S600: NO), the set position of the output sound unit data is shifted by a predetermined time (S610) and output. After setting the expansion / contraction rate of the sound unit data as the lower limit value, the process returns to S570.

すなわち、本実施形態のＳ５７０〜Ｓ６１０では、図６（Ｃ）に示すように、楽音ユニットデータに対して、出力音ユニットデータの伸縮率が上限値に達するまで拡大する毎に、時間相関値を導出する。そして、このような時間相関値の導出を、楽音ユニットデータに対して、出力音ユニットデータを時間軸に沿ってシフト量の上限値に達するまでシフトさせつつ実行する。 That is, in S570 to S610 of this embodiment, as shown in FIG. 6 (C), the time correlation value is set each time the musical sound unit data is expanded until the expansion / contraction rate of the output sound unit data reaches the upper limit value. To derive. Then, the derivation of the time correlation value is executed while shifting the output sound unit data along the time axis until the upper limit value of the shift amount is reached with respect to the musical sound unit data.

一方、Ｓ６００での判定の結果、出力音ユニットデータのシフト量が、シフト量の上限値以上であれば（Ｓ６００：ＹＥＳ）、対象楽曲を構成する楽音の演奏開始タイミングに、修正出力音の出力タイミングを一致させるための補正量（以下、時間補正量とする）を導出する（Ｓ６２０）。本実施形態のＳ６２０では、具体的に、一組の特定ブロックに対してＳ５７０で導出された全ての時間相関値の中で、値が最大となる時間相関値に対応する出力音ユニットデータの伸縮率及びシフト量を、Ｓ５５０で選択した特定ブロックに対する時間補正量として導出する。 On the other hand, as a result of the determination in S600, if the shift amount of the output sound unit data is equal to or greater than the upper limit value of the shift amount (S600: YES), the output of the modified output sound is output at the performance start timing of the musical sound constituting the target song. A correction amount for matching the timing (hereinafter referred to as a time correction amount) is derived (S620). In S620 of this embodiment, specifically, the expansion / contraction of the output sound unit data corresponding to the time correlation value having the maximum value among all the time correlation values derived in S570 for a set of specific blocks. The rate and the shift amount are derived as the time correction amount for the specific block selected in S550.

その導出された時間補正量に従って、個々の演奏音の出力タイミングを修正（ここでは、修正演奏音をさらに修正）した修正楽譜データを生成する（Ｓ６３０）。本実施形態のＳ６３０では、Ｓ５５０で選択した特定ブロックに対する時間補正量として導出された、出力音ユニットデータのシフト量と、出力音ユニットデータの伸縮率とに基づいて、演奏音の音高が修正された修正楽譜データにおける当該特定ブロックの開始時刻及び終了時刻を修正する。そして、修正前の演奏音の出力タイミングの間隔比率が維持されるように、修正後の開始時刻、及び終了時刻にて規定される期間に応じて、演奏音の出力タイミングの間隔を伸縮させることで、当該特定ブロックに対する個々の演奏音の出力タイミングを修正した修正楽譜データを生成する。なお、本実施形態のＳ６３０では、演奏音の終了タイミングについても修正する。この演奏音の終了タイミングの修正方法は、演奏音の出力タイミングと同様の方法を用いればよい。 According to the derived time correction amount, corrected score data in which the output timing of each performance sound is corrected (here, the corrected performance sound is further corrected) is generated (S630). In S630 of this embodiment, the pitch of the performance sound is corrected based on the shift amount of the output sound unit data and the expansion / contraction rate of the output sound unit data derived as the time correction amount for the specific block selected in S550. The start time and end time of the specific block in the corrected musical score data are corrected. Then, the interval between the output timings of the performance sound is expanded or contracted according to the period defined by the start time and the end time after the correction so that the interval ratio of the output timing of the performance sound before the correction is maintained. Then, the modified score data in which the output timing of each performance sound for the specific block is corrected is generated. In S630 of the present embodiment, the end timing of the performance sound is also corrected. As a method for correcting the end timing of the performance sound, a method similar to the output timing of the performance sound may be used.

続いて、Ｓ５４０にて分割した全ての特定ブロックに対して、時間補正量を導出したか否かを判定し（Ｓ６４０）、その判定の結果、全ての特定ブロックに対して時間補正量を導出していなければ（Ｓ６４０：ＮＯ）、Ｓ５５０に戻る。そのＳ５５０では、新たな特定ブロックを選択し、Ｓ６３０までのステップを実行する。このＳ５５０では、時間長が長い特定ブロックから順に取得して、該特定ブロックに対する時間補正量を導出する。ただし、時間補正量が既に導出されている特定ブロックに隣接する特定ブロックでは、既に導出されている特定ブロックの修正後の開始時刻または終了時刻を、自特定ブロックでの値として導出する。 Subsequently, it is determined whether or not the time correction amount is derived for all the specific blocks divided in S540 (S640). As a result of the determination, the time correction amount is derived for all the specific blocks. If not (S640: NO), the process returns to S550. In S550, a new specific block is selected, and the steps up to S630 are executed. In S550, the specific block having the longest time length is acquired in order, and the time correction amount for the specific block is derived. However, in a specific block adjacent to a specific block whose time correction amount has already been derived, the start time or end time after modification of the specific block that has already been derived is derived as a value in the self-specific block.

一方、Ｓ６４０での判定の結果、全ての特定ブロックに対して時間補正量を導出していれば（Ｓ６４０：ＹＥＳ）、その後、本時間補正処理を終了し、音源分離処理へと戻る。
つまり、時間補正処理では、楽音推移の特性を表す楽音情報としての楽音ユニットデータと、修正音推移の特性を表す出力音情報としての出力音ユニットデータとを比較した結果に基づいて導出した時間補正量に従って、楽譜データにおける全ての楽譜トラックに規定された個々の演奏音の出力タイミングを修正している。 On the other hand, if the result of determination in S640 is that the time correction amount has been derived for all the specific blocks (S640: YES), then this time correction processing is terminated and the process returns to sound source separation processing.
In other words, in the time correction process, the time correction derived based on the result of comparing the musical sound unit data as the musical sound information representing the characteristics of the musical sound transition and the output sound unit data as the output sound information representing the characteristics of the corrected sound transition. According to the quantity, the output timing of each performance sound specified for all the score tracks in the score data is corrected.

〈音量補正処理の処理内容について〉
次に、音源分離処理のＳ１７０にて起動される音量補正処理について説明する。
この音量補正処理は、起動されると、図７に示すように、先のＳ６４０にて生成された修正楽譜データに含まれる全ての楽譜トラックに基づいて、全ての修正演奏音が時間軸に沿って推移した波形である修正音推移を取得する（Ｓ７１０）。本実施形態における修正音推移の取得は、Ｓ３１０と同様の方法により実行すれば良い。 <Volume correction processing details>
Next, the sound volume correction process activated in S170 of the sound source separation process will be described.
When this volume correction process is started, as shown in FIG. 7, all the modified performance sounds are moved along the time axis based on all the score tracks included in the corrected score data generated in the previous S640. The correction sound transition, which is the waveform that has changed, is acquired (S710). The acquisition of the correction sound transition in the present embodiment may be executed by the same method as in S310.

そのＳ７１０にて取得した修正音推移の振幅を時間軸に沿った全体（全期間）で平均することで、出力音平均振幅を導出し（Ｓ７２０）、さらに、先のＳ１２０にて取得した楽音推移の振幅を時間軸に沿った全体（全期間）で平均することで、楽音平均振幅を取得する（Ｓ７３０）。続いて、Ｓ７２０にて導出した出力音平均振幅と、Ｓ７３０にて導出した楽音平均振幅との比率である音量比率ｋｖを音量補正量として導出する（Ｓ７４０）。 The average amplitude of the output sound is derived by averaging the amplitude of the modified sound transition acquired in S710 over the entire time axis (all periods) (S720), and further, the musical sound transition acquired in S120 above Is averaged over the entire time axis (all periods) to obtain a musical tone average amplitude (S730). Subsequently, the volume ratio kv, which is the ratio between the output sound average amplitude derived in S720 and the musical sound average amplitude derived in S730, is derived as a volume correction amount (S740).

その後、音量補正処理を終了し、音源分離処理へと戻る。
〈トラック分離処理の処理内容について〉
次に、音源分離処理のＳ１８０にて起動されるトラック分離処理について説明する。 Thereafter, the sound volume correction process is terminated, and the process returns to the sound source separation process.
<About the contents of the track separation process>
Next, the track separation process activated in S180 of the sound source separation process will be described.

このトラック分離処理は、起動されると、図８に示すように、先のＳ１２０で取得した楽音推移の全体を、時間軸に沿って設定された分析時間ｔｗｉ毎に周波数解析（本実施形態では、離散フーリエ変換）し、その周波数解析の結果をＲＡＭ２２（または記憶部１８）に記憶する（Ｓ８１０）。このＳ８１０の周波数解析により、各分析時間ｔｗｉの楽音推移に含まれる周波数、及び各周波数における強度（以下、楽音スペクトル振幅値とする）ｔｕｓｐ（ｔｗｉ，ｆｉ）が、実数部及び虚数部の両方について導出される。また、符合ｆｉは、周波数の区分（即ち、離散フーリエ変換によって導出される周波数区分：単位［ｂｉｎ］）である。 When this track separation process is started, as shown in FIG. 8, the entire musical tone transition acquired in S120 is subjected to frequency analysis for each analysis time twi set along the time axis (in this embodiment, , Discrete Fourier transform), and the result of the frequency analysis is stored in the RAM 22 (or storage unit 18) (S810). According to the frequency analysis of S810, the frequency included in the musical sound transition of each analysis time twi and the intensity (hereinafter referred to as musical spectrum amplitude value) tusp (twi, fi) at each frequency are both in the real part and the imaginary part. Derived. Further, the symbol fi is a frequency division (that is, a frequency division derived by discrete Fourier transform: unit [bin]).

次に、楽譜トラックのインデックス番号ｍｔｉを初期値（本実施形態では、初期値＝０）に設定する（Ｓ８２０）。続いて、設定されている楽譜トラックのインデックス番号（以下、設定インデックスとする）ｍｔｉが、楽譜データにおける最大のインデックス番号（以下、最終インデックス）ＭＴＮ未満であるか否かを判定する（Ｓ８３０）。 Next, the index number mti of the musical score track is set to an initial value (in this embodiment, initial value = 0) (S820). Next, it is determined whether or not the index number (hereinafter referred to as a set index) mti of the set score track is less than the maximum index number (hereinafter referred to as the final index) MTN in the score data (S830).

そのＳ８３０での判定の結果、設定インデックスｍｔｉが最終インデックスＭＴＮ未満であれば（Ｓ８３０：ＹＥＳ）、設定インデックスｍｔｉを１つインクリメントする（Ｓ８４０）。続いて、対象音推移を初期値に設定する（Ｓ８５０）。本実施形態において、対象音推移の初期値は、音圧が時間軸に沿って全て「０」に設定されたゼロ波形である。 As a result of the determination in S830, if the setting index mti is less than the final index MTN (S830: YES), the setting index mti is incremented by one (S840). Subsequently, the target sound transition is set to an initial value (S850). In the present embodiment, the initial value of the target sound transition is a zero waveform in which the sound pressures are all set to “0” along the time axis.

そして、設定インデックスｍｔｉに対応する楽譜トラックの演奏音のインデックス番号（以下、演奏音インデックスとする）ｎｉを初期値（本実施形態では、０とする）に設定する（Ｓ８６０）。続いて、演奏音インデックスｎｉが、設定インデックスｍｔｉに対応する楽譜トラックにおいて、最大のインデックス番号（以下、最終演奏音とする）ＮＮＰＴ（ｍｔｉ）未満であるか否かを判定する（Ｓ８７０）。 Then, the index number (hereinafter referred to as performance sound index) ni of the performance sound of the musical score track corresponding to the set index mti is set to an initial value (in this embodiment, 0) (S860). Subsequently, it is determined whether or not the performance sound index ni is less than the maximum index number (hereinafter referred to as the final performance sound) NNPT (mti) in the musical score track corresponding to the set index mti (S870).

そのＳ８７０での判定の結果、演奏音インデックスｎｉが、最終演奏音ＮＮＰＴ（ｍｔｉ）未満であれば（Ｓ８７０：ＹＥＳ）、演奏音インデックスｎｉを規定数インクリメントする（Ｓ８８０）。続いて、今回のＳ８８０でインクリメントされた規定数の演奏音インデックスｎｉに対応する演奏音が、時間軸に沿って推移した波形である特定音推移を取得する（Ｓ８９０）。本実施形態における特定音推移の取得は、Ｓ３１０と同様の方法により実行すれば良い。 If the performance sound index ni is less than the final performance sound NNPT (mti) as a result of the determination in S870 (S870: YES), the performance sound index ni is incremented by a specified number (S880). Subsequently, a specific sound transition is acquired which is a waveform in which the performance sound corresponding to the specified number of performance sound indexes ni incremented in S880 this time has shifted along the time axis (S890). The acquisition of the specific sound transition in the present embodiment may be executed by the same method as in S310.

そして、取得した特定音推移を、時間軸に沿って設定された分析時間ｔｗｉ毎に周波数解析（ここでは、離散フーリエ変換）する（Ｓ９００）。この周波数解析の結果、特定音推移における分析時間ｔｗｉに含まれる周波数毎に、その周波数における強度（以下、スペクトル振幅値）ｎｔｓｐ（ｔｗｉ，ｆｉ）が、実数部及び虚数部の両方について導出される。 Then, the acquired specific sound transition is subjected to frequency analysis (here, discrete Fourier transform) for each analysis time twi set along the time axis (S900). As a result of this frequency analysis, for each frequency included in the analysis time twi in the specific sound transition, the intensity (hereinafter, spectral amplitude value) ntsp (twi, fi) at that frequency is derived for both the real part and the imaginary part. .

続いて、図９（Ａ）に示すように、先の音量補正量導出処理で導出された音量比率ｋｖを、個々のスペクトル振幅値ｎｔｓｐ（ｔｗｉ，ｆｉ）に乗じた特定音スペクトル振幅値ｎｔｓｐ＿ｎ（ｔｗｉ，ｆｉ）を導出する（Ｓ９１０）。その特定音スペクトル振幅値ｎｔｓｐ＿ｎ（ｔｗｉ，ｆｉ）と、楽音スペクトル振幅値ｔｕｓｐ（ｔｗｉ，ｆｉ）との比を表す振幅比率ｋｒ（ｔｗｉ，ｆｉ）を導出する（Ｓ９２０）。本実施形態のＳ９２０では、振幅比率ｋｒを、周波数区分ｆｉ毎に導出する。ただし、振幅比率ｋｒ（ｔｗｉ，ｆｉ）の値は、特定音スペクトル振幅値ｎｔｓｐ（ｔｗｉ，ｆｉ）が、楽音スペクトル振幅値ｔｕｓｐ（ｔｗｉ，ｆｉ）よりも大きければ、「１」とし、特定音スペクトル振幅値ｎｔｓｐ（ｔｗｉ，ｆｉ）が、楽音スペクトル振幅値ｔｕｓｐ（ｔｗｉ，ｆｉ）よりも小さければ、両スペクトル振幅値の比としている。 Subsequently, as shown in FIG. 9A, the specific sound spectrum amplitude value ntsp_n () obtained by multiplying the individual volume amplitude value ntsp (twi, fi) by the volume ratio kv derived in the previous volume correction amount derivation process. twi, fi) is derived (S910). An amplitude ratio kr (twi, fi) representing the ratio between the specific sound spectrum amplitude value ntsp_n (twi, fi) and the musical sound spectrum amplitude value tusp (twi, fi) is derived (S920). In S920 of the present embodiment, the amplitude ratio kr is derived for each frequency division fi. However, the value of the amplitude ratio kr (twi, fi) is “1” if the specific sound spectrum amplitude value ntsp (twi, fi) is larger than the musical sound spectrum amplitude value tusp (twi, fi). If the amplitude value ntsp (twi, fi) is smaller than the musical tone spectrum amplitude value tusp (twi, fi), the ratio of both spectrum amplitude values is set.

そして、楽音スペクトル振幅値ｔｕｓｐ（ｔｗｉ，ｆｉ）に、振幅比率ｋｒを乗算して分離スペクトル振幅値ｎｔｃｐｓｐ（ｔｗｉ，ｆｉ）、即ち、分離スペクトルを導出する（Ｓ９３０）。このＳ９３０では、具体的に、図９（Ｂ）、及び図９（Ｃ）に示すように、実数部及び虚数部それぞれの楽音スペクトル振幅値ｔｕｓｐ（ｔｗｉ，ｆｉ）に、分析時間ｔｗｉと周波数区分ｆｉとの組み合わせに対応する振幅比率ｋｒ（ｔｗｉ，ｆｉ）を乗算する。なお、図９（Ｂ），（Ｃ）中において、実線は、分離スペクトルとして導出されたスペクトル振幅値ｎｔｃｐｓｐ（ｔｗｉ，ｆｉ）であり、破線は、楽音スペクトル振幅値ｔｕｓｐ（ｔｗｉ，ｆｉ）である。 Then, the musical spectrum amplitude value tusp (twi, fi) is multiplied by the amplitude ratio kr to derive a separated spectral amplitude value ntpspsp (twi, fi), that is, a separated spectrum (S930). In S930, specifically, as shown in FIGS. 9B and 9C, the musical sound spectrum amplitude values tusp (twi, fi) of the real part and the imaginary part are divided into the analysis time twi and the frequency division. Multiply by the amplitude ratio kr (twi, fi) corresponding to the combination with fi. 9B and 9C, the solid line is the spectrum amplitude value ntcpsp (twi, fi) derived as a separated spectrum, and the broken line is the musical tone spectrum amplitude value tusp (twi, fi). .

さらに、Ｓ９３０にて導出された分離スペクトル振幅値ｎｔｃｐｓｐ（ｔｗｉ，ｆｉ）を、ＲＡＭ２２（または記憶部１８）に記憶され、対応する時間（期間）における楽音スペクトル振幅値ｔｕｓｐから減算することで、ＲＡＭ２２（または記憶部１８）に記憶された楽音スペクトル振幅値ｔｕｓｐを新たな楽音スペクトル振幅値ｔｕｓｐへと更新する（Ｓ９４０）。 Further, the separated spectrum amplitude value ntcpsp (twi, fi) derived in S930 is stored in the RAM 22 (or the storage unit 18), and is subtracted from the musical sound spectrum amplitude value tusp in the corresponding time (period), whereby the RAM 22 The musical sound spectrum amplitude value tusp stored in (or the storage unit 18) is updated to a new musical sound spectrum amplitude value tusp (S940).

続いて、分離スペクトル振幅値ｎｔｃｐｓｐ（ｔｗｉ，ｆｉ）を逆離散フーリエ変換（ＩＤＦＴ）して、区間推移を導出する（Ｓ９５０）。そして、初期値に設定されている対象音推移のうち、対応する区間について、Ｓ９５０で導出された区間推移へと置き換えることで、新たな対象音推移へと更新する（Ｓ９６０）。 Subsequently, an inverse discrete Fourier transform (IDFT) is performed on the separated spectrum amplitude value ntcpsp (twi, fi) to derive an interval transition (S950). Then, by replacing the corresponding section of the target sound transition set to the initial value with the section transition derived in S950, the target sound transition is updated to a new target sound transition (S960).

その後、Ｓ８７０へと戻り、演奏音インデックスｎｉが、設定インデックスｍｔｉにおける最終演奏音ＮＮＰＴ（ｍｔｉ）未満であれば（Ｓ８７０：ＹＥＳ）、Ｓ８７０からＳ９６０のステップを繰り返す。そして、演奏音インデックスｎｉが、設定インデックスｍｔｉにおける最終演奏音ＮＮＰＴ（ｍｔｉ）以上となると（Ｓ８７０：ＮＯ）、その時点での対象音推移を記憶部１８に記憶する（Ｓ９７０）。すなわち、音響データから、対象トラックに対応する音源から出力される音（つまり、対象音推移）を分離し終えると、Ｓ９７０を経てＳ８３０へと戻る。 Thereafter, the process returns to S870, and if the performance sound index ni is less than the final performance sound NNPT (mti) in the set index mti (S870: YES), the steps from S870 to S960 are repeated. When the performance sound index ni is equal to or higher than the final performance sound NNPT (mti) in the setting index mti (S870: NO), the target sound transition at that time is stored in the storage unit 18 (S970). That is, when the sound output from the sound source corresponding to the target track (that is, the target sound transition) is separated from the acoustic data, the process returns to S830 via S970.

そのＳ９７０を経て戻ったＳ８３０では、設定インデックスｍｔｉが、最終インデックスＭＴＮ未満であれば（Ｓ８３０：ＹＥＳ）、Ｓ８４０からＳ９７０のステップを繰り返す。そして、設定されている設定インデックスｍｔｉが、最終インデックスＭＴＮ以上となると（Ｓ８３０：ＮＯ）、本音源分離処理を終了する。すなわち、楽譜データに含まれる全ての楽譜トラックについて、音響データから対象音推移を生成して分離し終えると、本音源分離処理を終了する。
［実施形態の効果］
以上説明したように、本実施形態の音源分離処理における音量補正処理では、音量比率ｋｖを音量補正量として導出しており、この音量比率ｋｖの導出は、計算を反復して実行する必要がない。 In S830 after returning through S970, if the set index mti is less than the final index MTN (S830: YES), the steps from S840 to S970 are repeated. Then, when the set index mti that is set is equal to or greater than the final index MTN (S830: NO), the sound source separation process is terminated. That is, for all the score tracks included in the score data, when the target sound transition is generated from the sound data and separated, the sound source separation processing is terminated.
[Effect of the embodiment]
As described above, in the volume correction process in the sound source separation process according to the present embodiment, the volume ratio kv is derived as the volume correction amount, and the calculation of the volume ratio kv does not need to be repeatedly performed. .

このため、本実施形態の音量補正処理によれば、従来音源分離装置に比べて、音量補正量の導出に要する計算量を低減することができる。
さらに、音源分離処理における音高補正処理では、正規化出力音スペクトルと正規化楽音スペクトルとが最大相関（相関値の中で、値が最大である相関値で、正規化出力音スペクトルと正規化楽音スペクトルとの差分の値が最小）となるときの、正規化楽音スペクトルに対する正規化楽音スペクトルの周波数軸に沿ったシフト量を、音高補正量として導出している。この音高補正量の導出は、正規化出力音スペクトルを周波数軸に沿って、下限値から上限値までシフトさせることによって行われている。 For this reason, according to the sound volume correction processing of the present embodiment, the amount of calculation required to derive the sound volume correction amount can be reduced as compared with the conventional sound source separation device.
Furthermore, in the pitch correction process in the sound source separation process, the normalized output sound spectrum and the normalized musical sound spectrum have the maximum correlation (the correlation value having the maximum value among the correlation values, the normalized output sound spectrum and the normalized sound spectrum are normalized). The shift amount along the frequency axis of the normalized music spectrum with respect to the normalized music spectrum when the difference value with the music spectrum is minimum) is derived as the pitch correction amount. The pitch correction amount is derived by shifting the normalized output sound spectrum from the lower limit value to the upper limit value along the frequency axis.

このため、本実施形態の音高補正処理によれば、従来音源分離装置に比べて、音高補正量を導出するために、計算を反復する回数を低減できる。
これらの結果、本実施形態の音源分離処理によれば、従来音源分離装置に比べて、個々の補正量を導出するために要する計算量を削減することができ、ひいては、楽音推移から、該楽音推移に含まれる対象音推移を抽出するまでに要する計算量を低減できる。 For this reason, according to the pitch correction process of the present embodiment, the number of times to repeat the calculation can be reduced in order to derive the pitch correction amount as compared with the conventional sound source separation device.
As a result, according to the sound source separation process of the present embodiment, it is possible to reduce the amount of calculation required for deriving individual correction amounts as compared with the conventional sound source separation device. The amount of calculation required to extract the target sound transition included in the transition can be reduced.

なお、本実施形態の音高補正処理では、楽音推移及び出力音推移のパワースペクトルのうち、周波数における強度について正規化することで導出した正規化出力音スペクトル及び正規化楽音スペクトルの比較結果に基づいて、音高補正量を導出している。 In the pitch correction process of the present embodiment, based on the comparison result of the normalized output sound spectrum and the normalized musical sound spectrum derived by normalizing the intensity in frequency among the power spectra of the musical sound transition and the output sound transition. Thus, the pitch correction amount is derived.

したがって、このように導出される音高補正量を用いて、楽譜データに規定された個々の出力音の音高を修正すれば、楽音推移の振幅と、出力音推移の振幅とが大きく異なっていたとしても、修正楽譜データに基づく修正音推移を楽音推移に近づけることができる。 Therefore, if the pitch of each output sound specified in the score data is corrected using the pitch correction amount derived in this way, the amplitude of the musical tone transition and the amplitude of the output pitch transition differ greatly. Even so, the correction sound transition based on the corrected musical score data can be brought close to the musical sound transition.

さらに、本実施形態において、正規化出力音スペクトル及び正規化楽音スペクトルは、周波数について半音毎に平均化されているため、楽音の周波数や出力音の周波数に周波数ノイズが吸収されることとなる。このため、正規化出力音スペクトルと正規化楽音スペクトルとの相関の誤差が小さくなり、音高補正量の導出精度を向上させることができる。これにより、修正楽譜データに基づく出力音推移の音高と、楽音推移の音高との一致度をより向上させることができる。 Furthermore, in the present embodiment, the normalized output sound spectrum and the normalized musical sound spectrum are averaged for each semitone with respect to the frequency, so that the frequency noise is absorbed by the frequency of the musical sound and the frequency of the output sound. For this reason, the correlation error between the normalized output sound spectrum and the normalized musical sound spectrum is reduced, and the accuracy of deriving the pitch correction amount can be improved. Thereby, the degree of coincidence between the pitch of the output sound transition based on the modified musical score data and the pitch of the musical sound transition can be further improved.

また、本実施形態のデータ修正処理では、音高補正処理を実行して、楽譜データに規定された個々の演奏音の音高が、対象楽曲を構成する楽音の音高に一致するように修正した上で、時間補正処理を実行している。したがって、音源分離装置１０によれば、楽譜データに規定された演奏音の音高と、対象楽曲を構成する楽音の音高との間にズレが生じていることに起因して、時間補正量の導出精度が低下することを防止できる。 Further, in the data correction process of the present embodiment, the pitch correction process is executed so that the pitches of the individual performance sounds specified in the score data match the pitches of the musical sounds constituting the target music. In addition, time correction processing is executed. Therefore, according to the sound source separation device 10, the time correction amount is caused by the difference between the pitch of the performance sound specified in the score data and the pitch of the musical sound constituting the target music. It is possible to prevent the derivation accuracy of.

特に、本実施形態の時間補正処理では、対象楽曲においてテンポが一定の区間毎に、時間補正量の導出している。このように導出された時間補正量を用いて、出力音の出力タイミングを修正することで、修正楽譜データにおける個々の出力音の出力タイミングを、対象楽曲における個々の楽音の演奏開始タイミングにより正確に一致させることができる。 In particular, in the time correction process of the present embodiment, the time correction amount is derived for each section where the tempo is constant in the target music. By correcting the output timing of the output sound using the time correction amount derived in this way, the output timing of each output sound in the modified musical score data is more accurately determined by the performance start timing of each musical sound in the target music. Can be matched.

このような音源分離処理においては、楽音推移に含まれる対象音推移を、該楽音推移から分離する際に、一つの対象トラックに対応する楽音推移についてのみ分離すれば、その対象トラックに対応する音源にて発生した音のみを除去した残留楽音推移を生成することができる。すなわち、本実施形態の音源分離処理によれば、対象楽曲に対して、いわゆるマイナスワンを実行することができる。 In such a sound source separation process, when separating the target sound transition included in the musical sound transition from the musical sound transition, if only the musical sound transition corresponding to one target track is separated, the sound source corresponding to the target track It is possible to generate a transition of the remaining musical tone from which only the sound generated in the step is removed. That is, according to the sound source separation process of the present embodiment, so-called minus one can be performed on the target music piece.

特に、本実施形態の音源分離処理において、楽音として歌声が含まれた楽音推移を取得した場合、その楽音推移から全ての楽譜トラックについての区間推移（即ち、対象音推移）を減算すると、歌声の音圧の時間軸に沿った推移が残る。つまり、音源分離装置１０によれば、楽曲における歌声の音圧の推移を抽出することができる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 In particular, in the sound source separation process of the present embodiment, when a musical sound transition including a singing voice is acquired as a musical sound, subtracting a section transition (that is, a target sound transition) for all score tracks from the musical sound transition, the singing voice The transition along the time axis of sound pressure remains. That is, according to the sound source separation device 10, it is possible to extract the transition of the sound pressure of the singing voice in the music.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態の音高補正処理におけるＳ３１０では、全ての楽譜トラックに規定されている個々の出力音を、楽譜データの時間軸に沿って音源モジュール１７に出力させ、音声入力部１５を介して受け付けることで、出力音推移の取得を実行していたが、出力音推移の取得方法は、これに限るものではない。すなわち、出力音推移の取得は、出力音の時間軸に沿った波形を表す音響信号（電気信号）を音源モジュール１７が生成し、その生成された音響信号に従って音声出力部１６が鳴動するように、情報処理装置１０が構成されている場合、音源モジュール１７が生成する音響信号を出力音推移として取得しても良い。 For example, in S310 in the pitch correction process of the above embodiment, individual output sounds defined for all score tracks are output to the sound source module 17 along the time axis of the score data, and the sound input unit 15 is used. However, the output sound transition acquisition method is not limited to this. That is, the acquisition of the output sound transition is such that the sound source module 17 generates an acoustic signal (electric signal) representing a waveform along the time axis of the output sound, and the sound output unit 16 rings according to the generated acoustic signal. When the information processing apparatus 10 is configured, an acoustic signal generated by the sound source module 17 may be acquired as an output sound transition.

そして、上記実施形態の時間補正処理では、時間補正量の導出を、特定ブロック毎に実行していたが、時間補正量は、楽曲に対して一つ導出されても良い。
また、上記実施形態における時間補正処理では、時間補正量の導出するときに楽音ユニットデータと比較する出力音ユニットデータを、出力音の音高が修正された修正楽譜データに基づいて取得した修正音推移から生成していたが、この出力音ユニットデータの生成に用いる信号は、例えば、出力音の音高が修正される前の楽譜データに基づいて取得した出力音推移であっても良い。 In the time correction process of the above embodiment, the time correction amount is derived for each specific block. However, one time correction amount may be derived for the music piece.
Further, in the time correction processing in the above embodiment, the corrected sound acquired based on the modified musical score data in which the pitch of the output sound is corrected is output sound unit data to be compared with the musical sound unit data when the time correction amount is derived. Although generated from the transition, the signal used to generate the output sound unit data may be, for example, the output sound transition acquired based on the musical score data before the pitch of the output sound is corrected.

さらに、上記実施形態の音量補正処理では、修正楽譜データに基づいて取得した修正音推移を用いて音量比率を導出した修正楽譜データを生成したが、本発明においては、音量比率の導出は、楽譜データに基づく出力音推移を用いて実行しても良い。 Furthermore, in the sound volume correction processing of the above embodiment, the modified music score data in which the sound volume ratio is derived using the corrected sound transition acquired based on the modified music score data is generated. You may perform using the output sound transition based on data.

なお、上記実施形態の音源分離処理では、音高補正処理と時間補正処理との両方の処理を実行していたが、音源分離処理で実行する処理としては、音高補正処理と時間補正処理とのうちの少なくとも一方であっても良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 In the sound source separation process of the above embodiment, both the pitch correction process and the time correction process are executed. However, as the process executed in the sound source separation process, the pitch correction process and the time correction process are performed. At least one of them may be used.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の音源分離処理におけるＳ１２０が、本発明の楽音推移取得手段に相当し、音高補正処理のＳ３１０及び時間補正処理のＳ５１０が、出力音推移取得手段に相当する。そして、上記実施形態の音高補正処理におけるＳ３２０〜Ｓ４３０、及び時間補正処理におけるＳ５２０〜Ｓ６２０が、本発明の補正量導出手段に相当し、このうち、前者が音高補正量導出手段に、後者が時間補正量導出手段に相当する。 S120 in the sound source separation process of the above embodiment corresponds to the musical sound transition acquisition means of the present invention, and S310 of the pitch correction process and S510 of the time correction process correspond to the output sound transition acquisition means. And S320 to S430 in the pitch correction process of the above embodiment and S520 to S620 in the time correction process correspond to the correction amount deriving means of the present invention, among which the former is the pitch correction amount deriving means and the latter. Corresponds to time correction amount deriving means.

さらに、上記実施形態の音高補正処理におけるＳ４４０、及び時間補正処理のＳ６３０が、本発明の修正手段に相当し、音量補正処理におけるＳ７１０が、修正音推移取得手段に相当し、音量補正処理におけるＳ７２０〜Ｓ７４０が、比率導出手段に相当する。また、上記実施形態のトラック分離処理におけるＳ８９０が、本発明の特定音推移取得手段に相当し、トラック分離処理におけるＳ９００，Ｓ９１０が、特定音解析手段に相当し、トラック分離処理におけるＳ９２０が、振幅比率導出手段に相当し、トラック分離処理におけるＳ９３０，Ｓ９５０が、区間推移導出手段に相当し、トラック分離処理におけるＳ９４０，Ｓ９６０，Ｓ９７０が、音源分離手段に相当する。 Further, S440 in the pitch correction process of the above embodiment and S630 of the time correction process correspond to the correction means of the present invention, and S710 in the volume correction process corresponds to the correction sound transition acquisition means, and in the volume correction process. S720 to S740 correspond to the ratio deriving means. Further, S890 in the track separation process of the above embodiment corresponds to the specific sound transition acquisition unit of the present invention, S900 and S910 in the track separation process correspond to the specific sound analysis unit, and S920 in the track separation process has an amplitude. S930 and S950 in the track separation process correspond to the section derivation means, and S940, S960 and S970 in the track separation process correspond to the sound source separation means.

そして、上記実施形態の音高補正処理におけるＳ３２０〜Ｓ３９０が、本発明の分布導出手段に相当し、時間補正処理のＳ５２０〜Ｓ５６０が、変化導出手段に相当し、時間補正処理のＳ５７０〜Ｓ６１０が、時間相関導出手段に相当する。また、上記実施形態のトラック分離処理におけるＳ９４０が、本発明の記憶制御手段に相当し、トラック分離処理におけるＳ８３０〜Ｓ９７０が、本発明の更新手段に相当する。 S320 to S390 in the pitch correction process of the above embodiment correspond to the distribution deriving unit of the present invention, S520 to S560 of the time correction process correspond to the change deriving unit, and S570 to S610 of the time correction process. This corresponds to time correlation deriving means. Further, S940 in the track separation process of the above embodiment corresponds to the storage control means of the present invention, and S830 to S970 in the track separation process corresponds to the update means of the present invention.

１０…情報処理装置（音源分離装置）１１…通信部１２…音響データ読取部１３…入力受付部１４…表示部１５…音声入力部１６…音声出力部１７…音源モジュール１８…記憶部２０…制御部２１…ＲＯＭ２２…ＲＡＭ２３…ＣＰＵ DESCRIPTION OF SYMBOLS 10 ... Information processing apparatus (sound source separation apparatus) 11 ... Communication part 12 ... Acoustic data reading part 13 ... Input reception part 14 ... Display part 15 ... Audio | voice input part 16 ... Audio | voice output part 17 ... Sound source module 18 ... Memory | storage part 20 ... Control Unit 21 ... ROM 22 ... RAM 23 ... CPU

Claims

A musical sound transition acquisition means for acquiring a musical sound transition in which the musical sounds constituting the target music have shifted along the time axis;
Represents a musical score of the music simulating the target music, and for each sound source used in the music, at least the pitch and output timing of the output sound simulating individual sounds output from the sound source used in the target music Output sound transition acquisition means for acquiring the output sound transition in which the output sound defined in all the score tracks has shifted along the time axis based on the score data including the score track in which the score is defined; ,
Musical sound information representing the characteristics of the musical sound transition extracted from the musical sound transition acquired by the musical sound transition acquisition means, and output representing the characteristics of the output sound transition extracted from the output sound transition acquired by the output sound transition acquisition means Based on the result of comparison with the sound information, the pitch correction amount of the musical score data is derived as one of the correction amounts so that the pitch of the output sound matches the pitch of the musical sound corresponding to the output sound. Based on the comparison result between the pitch correction amount derivation means and the musical sound information and the output sound information, the output score of the output sound is matched with the performance start timing of the musical sound corresponding to the output sound. A time correction amount deriving unit for deriving a time correction amount of data as one of the correction amounts, and at least one of the pitch correction amount deriving unit and the time correction amount deriving unit derives the correction amount. Execute And a positive amount deriving means,
Correction means for generating corrected score data that is the score data corrected to the corrected output sound by shifting the output sound according to the correction amount derived by the correction amount derivation means;
Correction sound transition acquisition means for acquiring the correction sound transition in which the correction output sound defined in all the score tracks in the correction score data generated by the correction means has shifted along the time axis in the correction score data;
The average amplitude of the entire modified sound transition derived from the modified sound transition acquired by the modified sound transition acquiring means, and the average amplitude of the entire musical sound transition derived from the musical sound transition acquired by the musical sound transition acquiring means A ratio deriving means for deriving a volume ratio that is a ratio of
A musical sound analysis means for deriving a musical sound amplitude spectrum, which is an amplitude spectrum representing the frequency included in the musical sound transition acquired by the musical sound transition acquisition means and the intensity at each frequency, per unit time along the time axis to the target music; ,
The specific sound for acquiring the specific sound transition in which the corrected output sound defined in the target track, which is one of the score tracks in the corrected score data generated by the correcting means, has shifted along the time axis in the corrected score data. Change acquisition means;
The specific sound amplitude spectrum representing the frequency and the intensity at each frequency included in the specific sound transition acquired by the specific sound transition acquisition means, and multiplying the intensity at each frequency by the volume ratio derived by the ratio derivation means, Specific sound analysis means for deriving every unit time along the time axis of the modified score data;
Amplitude ratio for deriving for each frequency an amplitude ratio representing the ratio between the intensity at the frequency in the musical sound amplitude spectrum derived by the musical sound analyzing means and the intensity at the frequency in the specific sound amplitude spectrum derived by the specific sound analyzing means. Deriving means;
Section transition for deriving section transition, which is transition of sound along the time axis, from the separated spectrum obtained by multiplying each amplitude ratio derived by the amplitude ratio deriving means by the intensity at each frequency of the musical sound amplitude spectrum Deriving means;
By arranging the section transition derived by the section transition deriving means along the time axis of the target music, the target sound output from the sound source corresponding to the target track is along the time axis in the musical sound transition. A sound source separation device comprising: sound source separation means for generating a transition of the target sound that has changed.

The pitch correction amount derivation means includes:
Deriving a musical tone pitch distribution representing the frequencies included in the whole of the musical sound transition and the intensity at each frequency as one of the musical sound information, and the frequencies included in the entire output sound transition and the intensity at each frequency. A distribution deriving means for deriving an output pitch distribution representing as one of the output sound information,
A pitch correlation value representing a correlation value between the output pitch distribution and the musical tone pitch distribution derived by the distribution deriving means is used as a frequency of the output pitch distribution from a predetermined specified position of the musical tone pitch distribution. Derived every time shifting along the axis, among the derived pitch correlation values, the amount of shift along the frequency axis from the specified position corresponding to the pitch correlation value where the value is maximum, The sound source separation device according to claim 1, wherein the sound source separation device is derived as the pitch correction amount.

The distribution derivation means includes
The sound source separation apparatus according to claim 2, wherein the musical tone pitch distribution and the output pitch distribution are derived by normalizing intensities at respective frequencies.

The time correction amount derivation means includes:
From the musical sound transition, the musical sound change representing the transition of the non-harmonic component of the musical sound transition along the time axis is derived as one of the musical sound information, and from the output sound transition along the time axis. Change derivation means for deriving an output sound change representing a change in the non-harmonic component of the output sound transition as one of the output sound information;
The time correlation value representing the correlation value between the musical sound change and the output sound change derived by the change deriving means is matched with the set positions set in the musical sound change and the output sound change, and the output sound change is determined. A time correlation deriving unit for deriving each time it expands and contracts along the time axis and sequentially changing the set position along the time axis within a specified range;
Among the time correlation values derived by the time correlation deriving means, the expansion / contraction rate and the set position along the time axis of the output sound change corresponding to the time correlation value having the maximum value are used as the time correction amount. The sound source separation device according to any one of claims 1 to 3, wherein the sound source separation device is derived.

The time correction amount derivation means includes:
From the musical sound transition, the musical sound change representing the transition of the non-harmonic component of the musical sound transition along the time axis is derived as one of the musical sound information, and from the output sound transition along the time axis. Change derivation means for deriving an output sound change representing a change in the non-harmonic component of the output sound transition as one of the output sound information;
The time correlation value representing the correlation value between the musical sound change and the output sound change derived by the change deriving means is matched with the set positions set in the musical sound change and the output sound change, and the output sound change is determined. A time correlation deriving unit for deriving each time it expands and contracts along the time axis and sequentially changing the set position along the time axis within a specified range;
Among the time correlation values derived by the time correlation deriving means, the expansion / contraction rate and the set position along the time axis of the output sound change corresponding to the time correlation value having the maximum value are used as the time correction amount. Derived,
The change deriving means includes
According to the pitch correction amount derived by the pitch correction amount deriving means, acquired by the modified sound transition obtaining means based on the modified score data in which the output sound whose pitch is shifted is the modified output sound. The sound source separation device according to claim 2, wherein the corrected sound transition is the output sound transition.

The change deriving means includes
5. The musical sound change is derived for each target section that is a constant tempo section of the target music, and the output sound change is derived for each section corresponding to the target section. 6. The sound source separation device according to 5.

The sound source separation means is
Storage control means for deriving a residual music transition obtained by subtracting the section transition derived by the section transition deriving means from the musical sound transition, and storing the derived residual musical sound transition in a storage device;
The section transition derivation by the section transition deriving means is executed by sequentially changing the target track. When the section transition is derived, the derived section transition is stored in the storage device. The sound source separation device according to any one of claims 1 to 6, further comprising an update unit that subtracts the transition from the transition and updates the residual musical tone transition stored in the storage device.

A musical sound transition acquisition procedure for acquiring a musical sound transition in which the musical sounds constituting the target music have shifted along the time axis,
Represents a musical score of the music simulating the target music, and for each sound source used in the music, at least the pitch and output timing of the output sound simulating individual sounds output from the sound source used in the target music An output sound transition acquisition procedure for acquiring the output sound transition in which the output sound defined in all the score tracks transitions along the time axis in the score data based on the score data including the score track in which the score is defined; ,
Music information representing the characteristics of the musical sound transition extracted from the musical sound transition acquired in the musical sound transition acquisition procedure, and output representing the characteristics of the output sound transition extracted from the output sound transition acquired in the output sound transition acquisition procedure Based on the result of comparison with the sound information, the pitch correction amount of the musical score data is derived as one of the correction amounts so that the pitch of the output sound matches the pitch of the musical sound corresponding to the output sound. Based on a pitch correction amount derivation procedure and a comparison result between the musical sound information and the output sound information, the score data is set so that the output timing of the output sound matches the performance start timing of the musical sound corresponding to the output sound. A time correction amount derivation procedure for deriving the time correction amount as one of the correction amounts, and at least one of the pitch correction amount derivation procedure and the time correction amount derivation procedure, the derivation of the correction amount. Supplement to be executed And the amount derivation procedure,
A correction procedure for generating corrected score data that is the score data corrected to the corrected output sound by shifting the output sound according to the correction amount derived in the correction amount derivation procedure;
A modified sound transition acquisition procedure for acquiring the modified sound transition in which the modified output sound defined in all the musical score tracks in the modified musical score data generated by the modified procedure has shifted along the time axis in the modified musical score data;
The average amplitude of the entire modified sound transition derived from the modified sound transition acquired in the modified sound transition acquisition procedure, and the average amplitude of the entire musical sound transition derived from the musical sound transition acquired in the musical sound transition acquisition procedure A ratio derivation procedure for deriving a volume ratio that is a ratio of
A musical sound analysis procedure for deriving a musical sound amplitude spectrum, which is an amplitude spectrum representing the frequency included in the musical sound transition acquired in the musical sound transition acquisition procedure and the intensity at each frequency, per unit time along the time axis to the target music; ,
The specific sound for acquiring the specific sound transition in which the corrected output sound defined in the target track which is one of the score tracks in the corrected score data generated by the correction procedure has changed along the time axis in the corrected score data. Transition acquisition procedure,
A specific sound amplitude spectrum obtained by multiplying the intensity at each frequency by the volume ratio derived by the ratio derivation procedure, representing the frequency and the intensity at each frequency included in the specific sound transition acquired by the specific sound transition acquisition procedure, A specific sound analysis procedure derived per unit time along the time axis of the modified score data;
Amplitude ratio for deriving for each frequency an amplitude ratio representing the ratio between the intensity at the frequency in the musical sound amplitude spectrum derived by the musical sound analysis procedure and the intensity at the frequency in the specific sound amplitude spectrum derived by the specific sound analysis procedure Derivation procedure;
Section transition for deriving section transition, which is transition of sound along the time axis, from the separated spectrum that is the result of multiplying the amplitude ratio derived by the amplitude ratio derivation procedure by the intensity at each frequency of the musical sound amplitude spectrum Derivation procedure;
By arranging the section transition derived in the section transition deriving procedure along the time axis of the target music, the target sound output from the sound source corresponding to the target track is along the time axis in the musical sound transition. A sound source separation procedure for generating a target sound transition
A program that causes a computer to execute.