JP2020003536A

JP2020003536A - Learning device, automatic music transcription device, learning method, automatic music transcription method and program

Info

Publication number: JP2020003536A
Application number: JP2018120235A
Authority: JP
Inventors: 大輝日暮; Daiki Higure
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2020-01-09
Also published as: JP2023081946A; JP7448053B2

Abstract

To provide acoustic processing technology for automatically generating a musical score from audio data where a pitch and a section of each sound are not clear.SOLUTION: A learning device 100 comprises: a learning data acquisition section for acquiring a single tone sound source and pitch information as learning data of a first machine learning model, acquiring a sound source of a transcription object and musical score information as learning data of a second machine learning model, performing preprocessing on the single tone sound source and the sound source of the transcription object, and acquiring respective spectrograms; a first model learning section for performing learning by pitch information so that the spectrogram of the single tone sound source is input as learning input data and prediction probability of the pitch of the single tone sound source is output; and a second model learning section for performing learning by musical score information so that a feature map generated by inputting the spectrogram of the sound source of the transcription object to the learnt first machine learning model is input as learning input data and prediction probability that notes exist in the section of fixed length in the feature map is output.SELECTED DRAWING: Figure 2

Description

本開示は、音響処理技術に関する。 The present disclosure relates to sound processing technology.

オーディオデータから楽譜を自動生成する自動採譜技術が従来から知られている。例えば、特開２００７−０３３４７９には、同時に複数の音が演奏される場合でも単一楽器により演奏された音響信号から楽譜を自動採譜する技術が記載されている。 2. Description of the Related Art An automatic music transcription technique for automatically generating a musical score from audio data has been conventionally known. For example, JP-A-2007-033479 describes a technique for automatically transcribing a musical score from an acoustic signal played by a single musical instrument even when a plurality of sounds are played at the same time.

特開２００７−０３３４７９JP 2007-033479

しかしながら、従来の自動採譜では、楽譜に対して正確に演奏又は歌唱され、各音の音高や区間が明確なオーディオデータの場合には比較的高精度な採譜が可能であるが、例えば、各音の音高や区間が明確でないオーディオデータの場合には期待するような自動採譜が困難であった。 However, in the conventional automatic transcription, the musical score is played or sung accurately, and the pitch and the section of each sound are clear audio data, it is possible to perform relatively high-precision transcription. In the case of audio data whose pitch and section are not clear, automatic transcription as expected is difficult.

上記問題点を鑑み、本開示の課題は、様々なオーディオデータからより効果的に楽譜を自動生成するための音響処理技術を提供することである。 In view of the above problems, an object of the present disclosure is to provide a sound processing technique for automatically and efficiently generating a musical score from various audio data.

上記課題を解決するため、本開示の一態様は、単音音源と音高情報とを第１の機械学習モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを第２の機械学習モデルの学習用データとして取得し、前記単音音源と前記採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得する学習用データ取得部と、前記単音音源のスペクトログラムを学習用入力データとして入力し、前記単音音源の音高の予測確率を出力するよう前記音高情報によって第１の機械学習モデルを学習する第１モデル学習部と、前記採譜対象の音源のスペクトログラムを学習済みの前記第１の機械学習モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するよう前記楽譜情報によって第２の機械学習モデルを学習する第２モデル学習部と、を有する学習装置に関する。 In order to solve the above problem, according to one embodiment of the present disclosure, a single-tone sound source and pitch information are acquired as learning data of a first machine learning model, and a sound source to be transcribed and score information are acquired in a second machine learning mode. A learning data acquisition unit that acquires as model learning data, performs preprocessing on the single-tone sound source and the sound source to be transcribed, and acquires a spectrogram of each, and a learning input of the single-tone sound source spectrogram. A first model learning unit that learns a first machine learning model based on the pitch information so as to output a prediction probability of a pitch of the single-tone sound source, and a spectrogram of a sound source to be transcribed has been learned. A feature map generated by inputting to the first machine learning model is input as learning input data, and a note exists in a fixed length section of the feature map. A second model learning unit that learns a second machine learning model by the score information to output the prediction probability, regarding learning device having a.

本開示によると、各音の音高や区間が明確でないオーディオデータから楽譜を自動生成するための音響処理技術を提供することができる。 According to the present disclosure, it is possible to provide a sound processing technique for automatically generating a musical score from audio data in which the pitch or section of each sound is not clear.

本開示の一実施例による学習済み機械学習モデルを有する自動採譜装置を示す概略図である。1 is a schematic diagram illustrating an automatic transcription apparatus having a learned machine learning model according to an embodiment of the present disclosure. 本開示の一実施例による学習装置の機能構成を示すブロック図である。1 is a block diagram illustrating a functional configuration of a learning device according to an embodiment of the present disclosure. 本開示の一実施例による特徴マップ生成モデルの構成を示す概略図である。FIG. 2 is a schematic diagram illustrating a configuration of a feature map generation model according to an embodiment of the present disclosure. 本開示の一実施例による音符存在確率予測モデルの構成を示す概略図である。1 is a schematic diagram illustrating a configuration of a note existence probability prediction model according to an embodiment of the present disclosure. 本開示の一実施例による特徴マップとデフォルトボックスとの関係を示す概念図である。FIG. 6 is a conceptual diagram illustrating a relationship between a feature map and a default box according to an embodiment of the present disclosure. 本開示の一実施例による特徴マップ生成モデルの学習処理を示すフローチャートである。6 is a flowchart illustrating a learning process of a feature map generation model according to an embodiment of the present disclosure. 本開示の一実施例による音符存在確率予測モデルの学習処理を示すフローチャートである。11 is a flowchart illustrating a learning process of a note existence probability prediction model according to an embodiment of the present disclosure. 本開示の一実施例による自動採譜装置の機能構成を示すブロック図である。1 is a block diagram illustrating a functional configuration of an automatic transcription apparatus according to an embodiment of the present disclosure. 本開示の一実施例による自動採譜処理を示すフローチャートである。5 is a flowchart illustrating an automatic music transcription process according to an embodiment of the present disclosure. 本開示の一実施例による学習装置及び自動採譜装置のハードウェア構成を示すブロック図である。1 is a block diagram illustrating a hardware configuration of a learning device and an automatic transcription device according to an embodiment of the present disclosure.

以下の実施例では、機械学習モデルによって音源（音の波形データであるオーディオデータ）から楽譜情報を生成する自動採譜装置が開示される。 The following embodiment discloses an automatic transcription apparatus that generates musical score information from a sound source (audio data that is sound waveform data) using a machine learning model.

従来の自動採譜技術では、音高の予測に注力され、音符の切れ目を示すオンセットとオフセットとの予測は自動採譜における課題の１つであった。本開示による自動採譜装置は、音源におけるオンセットとオフセットとを機械学習モデルの１つであるＳＳＤ（ＳｉｎｇｌｅＳｈｏｔＤｅｔｅｃｔｉｏｎ）によって予測する。 In the conventional automatic transcription technology, attention is paid to pitch prediction, and prediction of an onset indicating a break in a note and an offset is one of the problems in automatic transcription. The automatic transcription apparatus according to the present disclosure predicts an onset and an offset in a sound source by using an SSD (Single Shot Detection) which is one of machine learning models.

ＳＳＤは、１つのニューラルネットワークを用いて入力画像における物体を検出する手法である。すなわち、当該ニューラルネットワークへの入力は画像であり、その出力は複数の矩形領域（ＳＳＤでは、デフォルトボックスと呼ばれる）の中心座標、高さ、幅及び物体の種類の予測確率である。デフォルトボックスは入力画像のサイズによって予め設定された個数の候補として用意され、後処理（ＮＭＳ：Ｎｏｎ−ＭａｘｉｍｕｍＳｕｐｐｒｅｓｓｉｏｎなど）によって大部分のデフォルトボックスを候補から外し、残ったデフォルトボックスを検出結果とするというものである。 SSD is a technique for detecting an object in an input image using one neural network. That is, the input to the neural network is an image, and the output is the predicted probability of the center coordinates, height, width, and object type of a plurality of rectangular regions (called default boxes in SSDs). The default box is prepared as a preset number of candidates according to the size of the input image, most of the default boxes are removed from the candidates by post-processing (NMS: Non-Maximum Suppression, etc.), and the remaining default boxes are used as detection results. That is.

本開示による自動採譜装置におけるニューラルネットワークへの入力は、採譜対象の楽音の波形データ又はスペクトログラムであり、その出力は楽音のオンセット、オフセット及び音高であり、自動採譜装置は、ＳＳＤにおける中心座標及び幅に対応してオンセット及びオフセット（すなわち、楽音の形状又は長さ）を特定し、ＳＳＤにおける物体の種類に対応して音高を特定する。 The input to the neural network in the automatic transcription apparatus according to the present disclosure is waveform data or a spectrogram of a musical sound to be transcribed, the output of which is the onset, offset and pitch of the musical sound. Onset and offset (that is, the shape or length of a musical sound) are specified in accordance with the pitch and the pitch, and the pitch is specified in accordance with the type of object in the SSD.

後述される実施例を概略すると、自動採譜装置は２つの学習済み機械学習モデル（畳み込みニューラルネットワークなど）を利用し、一方のモデルは単音音源から音高の予測確率を出力するものであり、他方のモデルは特徴マップから当該特徴マップの固定長の区間に音符が存在する予測確率を出力するものである。自動採譜装置は、採譜対象の音源を前者の学習済み機械学習モデル（特徴マップ生成モデル）に入力し、当該学習済み特徴マップ生成モデルの畳み込み層から生成された各特徴マップを後者の学習済み機械学習モデル（音符存在確率予測モデル）に入力し、各特徴マップの各点に対して当該学習済み音符存在確率予測モデルから出力された固定長の区間又はデフォルトボックスにおける各音高の音符の予測存在確率に基づき楽譜情報を生成する。 To outline the embodiment described later, the automatic transcription apparatus uses two learned machine learning models (such as a convolutional neural network), one of which outputs a pitch prediction probability from a single sound source, and The model outputs a prediction probability that a note exists in a fixed length section of the feature map from the feature map. The automatic transcription apparatus inputs a sound source to be transcribed to the former learned machine learning model (feature map generation model), and converts each feature map generated from the convolutional layer of the learned feature map generation model to the latter learned machine learning model. Predicted presence of a note of each pitch in a fixed-length section or default box output from the learned note existence probability prediction model for each point of each feature map, input to a learning model (note existence probability prediction model) Music score information is generated based on the probability.

学習済み特徴マップ生成モデルによって生成される特徴マップは、畳み込みの結果として異なる時間解像度を有し、固定長の区間又はデフォルトボックスは異なる時間的長さとなる。このため、音符存在確率予測モデルにより各特徴マップに対して固定長の区間と同じ長さの音符を検出することによって、異なる長さの音符のオンセット及びオフセットを特定することが可能になる。 The feature maps generated by the trained feature map generation model have different temporal resolutions as a result of convolution, and fixed-length sections or default boxes have different temporal lengths. For this reason, by detecting a note having the same length as a fixed-length section for each feature map by using the note existence probability prediction model, it becomes possible to specify onsets and offsets of notes of different lengths.

まず、図１を参照して、本開示の一実施例による自動採譜装置を説明する。図１は、本開示の一実施例による学習済み機械学習モデルを有する自動採譜装置を示す概略図である。 First, an automatic music transcription device according to an embodiment of the present disclosure will be described with reference to FIG. FIG. 1 is a schematic diagram illustrating an automatic transcription apparatus having a learned machine learning model according to an embodiment of the present disclosure.

図１に示されるように、本開示の一実施例による自動採譜装置２００は、限定することなく、畳み込みニューラルネットワークなどの何れかのタイプのニューラルネットワークとして実現される２種類の学習済みモデルを有し、学習用データストレージ５０を用いて学習装置１００によって学習された機械学習モデルを利用して、採譜対象の音源から楽譜情報を生成する。 As shown in FIG. 1, the automatic transcription apparatus 200 according to an embodiment of the present disclosure has two types of learned models realized as any type of neural network such as a convolutional neural network without limitation. Then, the musical score information is generated from the sound source to be transcribed using the machine learning model learned by the learning device 100 using the learning data storage 50.

次に、図２〜７を参照して、本開示の一実施例による学習装置を説明する。学習装置１００は、学習用データストレージ５０における学習用データを利用して、自動採譜装置２００に利用される特徴マップ生成モデルと音符存在確率予測モデルとを学習する。図２は、本開示の一実施例による学習装置の機能構成を示すブロック図である。 Next, a learning device according to an embodiment of the present disclosure will be described with reference to FIGS. The learning device 100 uses the learning data in the learning data storage 50 to learn a feature map generation model and a note existence probability prediction model used in the automatic transcription apparatus 200. FIG. 2 is a block diagram illustrating a functional configuration of the learning device according to the embodiment of the present disclosure.

図２に示されるように、学習装置１００は、学習用データ取得部１１０、第１モデル学習部１２０及び第２モデル学習部１３０を有する。 As shown in FIG. 2, the learning device 100 includes a learning data acquisition unit 110, a first model learning unit 120, and a second model learning unit 130.

学習用データ取得部１１０は、単音音源と音高情報とを特徴マップ生成モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを音符存在確率予測モデルの学習用データとして取得し、単音音源と採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得する。 The learning data acquisition unit 110 acquires a single sound source and pitch information as learning data of a feature map generation model, and acquires a sound source to be transcribed and musical score information as learning data of a note existence probability prediction model, Preprocessing is performed on a single sound source and a sound source to be transcribed, and respective spectrograms are obtained.

具体的には、学習用データ取得部１１０は、学習用データストレージ５０から、特徴マップ生成モデルを学習するための単音又はシングルノート音源（例えば、「ド」から「シ」までの１２種類の音源など）の波形データと音高情報（「ド」から「シ」までの音高など）とのペアを取得し、取得した単音音源の波形データに対して前処理（例えば、短時間フーリエ変換など）を実行することによって、各単音音源のスペクトログラムと音高情報との学習用データセットを生成する。 More specifically, the learning data acquisition unit 110 reads, from the learning data storage 50, a single sound or single note sound source (for example, 12 types of sound sources from “do” to “shi”) for learning the feature map generation model. ) And a pair of pitch information (such as pitches from “do” to “shi”), and pre-processes (eg, short-time Fourier transform, etc.) ) To generate a learning data set of the spectrogram of each single sound source and the pitch information.

また、学習用データ取得部１１０は、学習用データストレージ５０から、音符存在確率予測モデルを学習するための単旋律音源（歌唱音源など）の波形データと楽譜情報（音高の時系列変化など）とのペアを取得し、取得したモノフォニック音源の波形データに対して前処理（例えば、短時間フーリエ変換など）を実行することによって、モノフォニック音源のスペクトログラムと楽譜情報との学習用データセットを生成する。ここで、楽譜情報は、例えば、ＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格に従うものであってもよい。 In addition, the learning data acquisition unit 110 obtains, from the learning data storage 50, waveform data of a single melody sound source (such as a singing sound source) and musical score information (such as a time-series change in pitch) for learning a note existence probability prediction model. By performing preprocessing (for example, short-time Fourier transform) on the acquired waveform data of the monophonic sound source, thereby generating a learning data set of the spectrogram of the monophonic sound source and the musical score information. . Here, the musical score information may conform to, for example, a MIDI (Musical Instrument Digital Interface) standard.

典型的には、スペクトログラムは、時間軸及び周波数軸における信号成分の強度を表し、波形データを短時間フーリエ変換することによって生成される。短時間フーリエ変換には各種パラメータが設定される必要があるが、例えば、ＦＦＴ窓幅：１０２４、サンプリング周波数：１６ｋＨｚ、オーバラップ幅：７６８、窓関数：ハニング窓、及びフィルタバンク：メルフィルタバンク（１２８バンド）などに従って、短時間フーリエ変換が実行されてもよい。スペクトログラムに変換した後、時間軸方向に一定のサンプル数（例えば、１０２４サンプル）だけ抽出されてもよい。また、本実施例によるスペクトログラムは、低周波数成分を精細にするよう周波数軸が対数変換されたものであってもよい。 Typically, a spectrogram represents the intensity of a signal component on a time axis and a frequency axis, and is generated by performing a short-time Fourier transform on waveform data. Various parameters need to be set in the short-time Fourier transform. For example, an FFT window width: 1024, a sampling frequency: 16 kHz, an overlap width: 768, a window function: a Hanning window, and a filter bank: a mel filter bank ( (128 bands) or the like, and a short-time Fourier transform may be executed. After conversion into a spectrogram, a fixed number of samples (for example, 1024 samples) may be extracted in the time axis direction. Further, the spectrogram according to the present embodiment may be one in which the frequency axis is logarithmically transformed so as to make the low frequency component fine.

第１モデル学習部１２０は、単音音源のスペクトログラムを学習用入力データとして入力し、単音音源の音高の予測確率を出力するよう音高情報によって特徴マップ生成モデルを学習する。 The first model learning unit 120 inputs a spectrogram of a single sound source as learning input data, and learns a feature map generation model based on pitch information so as to output a prediction probability of a pitch of a single sound source.

例えば、特徴マップ生成モデルは、図３に示されるように、複数の畳み込み層を含む畳み込みニューラルネットワークにより構成され、入力された単音音源のスペクトログラムを音高の予測確率に変換するＳＳＤとして実現される。ここで、音高は連続値でなく離散値として表現され、ｏｎｅ−ｈｏｔベクトルとして表現されてもよい。なお、打楽器などの噪音音源も学習対象とする場合、噪音音源の単音又はシングルノートの音声をデータセットに含めてもよい。その場合、音高クラスとして噪音を表現するクラスを設定し、それを教師ラベルとしてもよい。 For example, as shown in FIG. 3, the feature map generation model is configured by a convolutional neural network including a plurality of convolutional layers, and is realized as an SSD that converts an input single sound source spectrogram into a pitch prediction probability. . Here, the pitch may be expressed as a discrete value instead of a continuous value, and may be expressed as a one-hot vector. When a noise source such as a percussion instrument is to be learned, a single sound or a single note sound of the noise source may be included in the data set. In that case, a class that expresses a noise may be set as the pitch class, and may be used as the teacher label.

第１モデル学習部１２０は、学習用入力データの単音音源のスペクトログラムを特徴マップ生成モデルに入力し、特徴マップ生成モデルからの出力と学習用出力データの音高情報との誤差が小さくなるように、バックプロパゲーションによって特徴マップ生成モデルのパラメータを更新する。ここで、誤差を示す損失関数として、限定することなく、特徴マップ生成モデルの出力と学習用出力データの音高との交差エントロピーが利用されてもよい。 The first model learning unit 120 inputs the spectrogram of the single sound source of the learning input data to the feature map generation model, and reduces an error between the output from the feature map generation model and the pitch information of the learning output data. Then, the parameters of the feature map generation model are updated by back propagation. Here, as the loss function indicating the error, the cross entropy between the output of the feature map generation model and the pitch of the output data for learning may be used without limitation.

例えば、所定数の学習用データに対して更新処理が終了した、誤差が所定の閾値以下に収束した、誤差の改善が所定の閾値以下に収束したなどの所定の学習終了条件が充足されると、第１モデル学習部１２０は、更新された特徴マップ生成モデルを学習済み機械学習モデルとして設定する。 For example, when predetermined learning end conditions are satisfied, such as when the update processing is completed for a predetermined number of learning data, the error converges below a predetermined threshold, and the error improvement converges below a predetermined threshold. , The first model learning unit 120 sets the updated feature map generation model as a learned machine learning model.

第２モデル学習部１３０は、採譜対象の音源のスペクトログラムを学習済みの特徴マップ生成モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、特徴マップの固定長の区間に音符が存在する予測確率を出力するよう楽譜情報によって音符存在確率予測モデルを学習する。 The second model learning unit 130 inputs, as learning input data, a feature map generated by inputting a spectrogram of a sound source to be transcribed to a learned feature map generation model, and enters a note in a fixed-length section of the feature map. A musical note existence probability prediction model is learned from the musical score information so as to output a prediction probability of existence.

例えば、音符存在確率予測モデルは、図４に示されるように、複数の畳み込み層を含む畳み込みニューラルネットワークにより構成され、モノフォニック音源のスペクトルグラムを学習済み特徴マップ生成モデルに入力することによって生成された特徴マップを当該特徴マップの各点を始点とする固定長の区間と同じ長さの音符が存在する予測確率に変換するＳＳＤとして実現される。例えば、ドからシの１２音で採譜する場合、特徴マップ上の各点は、ドからシの各音高及び休符（無音）の１３通りの音符又は音高クラスが存在する予測確率を有する。 For example, as shown in FIG. 4, the note existence probability prediction model is configured by a convolutional neural network including a plurality of convolutional layers, and is generated by inputting a spectrumgram of a monophonic sound source to a learned feature map generation model. The feature map is implemented as an SSD that converts a feature map into a prediction probability that a note having the same length as a fixed-length section starting from each point of the feature map exists. For example, when transcribed with 12 notes from C to C, each point on the feature map has a predicted probability that there are 13 pitches or pitch classes from C to C and rests (silences). .

上述したように、学習済み特徴マップ生成モデルは複数の畳み込み層を含み、各畳み込み層からモノフォニック音源のスペクトログラムの特徴マップが生成される。生成される特徴マップは、図３に示されるような畳み込み層のレベルに応じて時間解像度が異なる特徴マップとなる。典型的には、図５に示されるように、入力層に相対的に近い畳み込み層では、時間解像度が相対的に高い（図示された例では、３２Ｈｚ）特徴マップが生成され、出力層に相対的に近い畳み込み層では、時間解像度が相対的に低い（図示された例では、１６Ｈｚ）特徴マップが生成される。図示されるような固定長の区間又はデフォルトボックスが設定されると、時間解像度が相対的に高い特徴マップにおける区間は、時間解像度が相対的に低い特徴マップにおける区間より短い時間を占有する。このため、異なる時間的長さを有する音符の存在予測確率を導出することができ、音符の時間的長さを特定することが可能になる。 As described above, the learned feature map generation model includes a plurality of convolutional layers, and a feature map of a spectrogram of a monophonic sound source is generated from each of the convolutional layers. The generated feature map is a feature map having a different temporal resolution according to the level of the convolutional layer as shown in FIG. Typically, as shown in FIG. 5, a convolutional layer relatively close to the input layer produces a feature map with a relatively high temporal resolution (32 Hz in the example shown) and a In the convolutional layer which is close to the target, a feature map having a relatively low temporal resolution (16 Hz in the illustrated example) is generated. When a fixed-length section or a default box is set as shown in the figure, a section in a feature map having a relatively high temporal resolution occupies a shorter time than a section in a feature map having a relatively low temporal resolution. For this reason, it is possible to derive the predicted existence probabilities of notes having different time lengths, and to specify the time lengths of the notes.

第２モデル学習部１３０は、学習用入力データの音源のスペクトログラムを学習済み特徴マップ生成モデルに入力し、学習済み特徴マップ生成モデルによって生成された各特徴マップを音符存在確率予測モデルに入力し、音符存在確率予測モデルからの出力と学習用出力データの楽譜情報との誤差が小さくなるように、バックプロパゲーションによって音符存在確率予測モデルのパラメータを更新する。 The second model learning unit 130 inputs the spectrogram of the sound source of the learning input data to the learned feature map generation model, inputs each feature map generated by the learned feature map generation model to the note existence probability prediction model, The parameters of the note existence probability prediction model are updated by back propagation so that the error between the output from the note existence probability prediction model and the musical score information of the output data for learning is reduced.

ここで、誤差を示す損失関数として、限定することなく、音符存在確率予測モデルの出力と音高の時系列変化とから算出されるタイミング誤差と信頼誤差との加重和が利用されてもよい。音高の時系列変化は、楽曲のスタートタイミング、エンドタイミング及び音高のセットが複数集まることによって表現され、楽譜情報から導出される。当該セットは発音と呼ばれてもよく、例えば、音高の時系列変化は、発音＃１："０：００〜０：０２，Ａ（ラ）３"、発音＃２："０：０３〜０：０５，Ｂ（シ）３"、発音＃３："０：０５〜０：０８，Ｃ（ド）４"・・・などにより表現されてもよい。図５に示されるデフォルトボックスは、１つの発音を表現しており、複数のチャネルを有する。デフォルトボックスの各チャネルの最初のサンプルはそれぞれ、当該デフォルトボックスの発音のオンセットの予測値、オフセットの予測値及び音高クラスの予測確率を有する。すなわち、トータルで２＋（音高のクラス数）のチャネルがある。 Here, as a loss function indicating an error, a weighted sum of a timing error and a reliability error calculated from an output of the note existence probability prediction model and a time-series change in pitch may be used without limitation. The time-series change of the pitch is expressed by collecting a plurality of sets of the start timing, the end timing, and the pitch of the music, and is derived from the musical score information. The set may be referred to as pronunciation. For example, the time-series change in pitch is pronunciation # 1: “0:00 to 0:02, A (La) 3”, pronunciation # 2: “0:03 to”. 0:05, B (S) 3 ", pronunciation # 3:" 0: 05-0: 08, C (D) 4 ", etc. The default box shown in FIG. 5 expresses one pronunciation and has a plurality of channels. The first sample of each channel of the default box has a predicted onset of the pronunciation, a predicted offset, and a predicted probability of the pitch class of the default box, respectively. That is, there are a total of 2+ (the number of pitch classes) channels.

第２モデル学習部１３０は、各発音について、オンセットとオフセットとの和が最小となるデフォルトボックスを探索し、検出されたデフォルトボックスと発音とに対してタイミング誤差と信頼誤差を求める。ここで、タイミング誤差とは、予測したオンセットを考慮したスタートタイミングのずれと、予測したオフセットを考慮したエンドタイミングのずれとの和としてもよい。ただし、差分の表現として、デフォルトボックスの長さを基準にした相対値が利用されてもよい。また、信頼誤差は、発音の音高と予測した音高とから算出される交差エントロピーであってもよい。なお、無音を表すクラスも教師ラベルとして用意されてもよく、この場合、発音のない区間を予測することができる。 The second model learning unit 130 searches for a default box that minimizes the sum of the onset and the offset for each pronunciation, and obtains a timing error and a reliability error for the detected default box and the pronunciation. Here, the timing error may be a sum of a deviation of the start timing in consideration of the predicted onset and a deviation of the end timing in consideration of the predicted offset. However, a relative value based on the length of the default box may be used as the expression of the difference. Further, the reliability error may be a cross entropy calculated from the pitch of the pronunciation and the predicted pitch. Note that a class representing silence may also be prepared as a teacher label, and in this case, a section having no sound can be predicted.

第２モデル学習部１３０は、ＮＭＳ（Ｎｏｎ−ＭａｘｉｍｕｍＳｕｐｐｒｅｓｓｉｏｎ）に従って各特徴マップの各点について設定されたデフォルトボックスを減らしていき、残ったデフォルトボックスを予測発音としてもよい。具体的には、第２モデル学習部１３０はまず、各デフォルトボックスについて音高クラス毎の音符存在予測確率を求める。その後、第２モデル学習部１３０は、予測確率が所定の閾値（例えば、０．９など）以下であるデフォルトボックスを削除してもよい。第２モデル学習部は、残ったデフォルトボックスのうち積集合／和集合に閾値を設けて、閾値以上のデフォルトボックスの一方を削除し、重複したデフォルトボックスを排除する。第２モデル学習部１３０は、最終的に残ったデフォルトボックスを予測発音とする。 The second model learning unit 130 may reduce the number of default boxes set for each point of each feature map in accordance with NMS (Non-Maximum Suppression), and use the remaining default boxes as predicted sounds. Specifically, the second model learning unit 130 first obtains a note presence prediction probability for each pitch class for each default box. Thereafter, the second model learning unit 130 may delete the default box whose prediction probability is equal to or less than a predetermined threshold (for example, 0.9). The second model learning unit sets a threshold value for the intersection set / union set among the remaining default boxes, deletes one of the default boxes equal to or larger than the threshold value, and eliminates a duplicated default box. The second model learning unit 130 sets the finally left default box as the predicted pronunciation.

例えば、所定数の学習用データに対して更新処理が終了した、誤差が所定の閾値以下に収束した、誤差の改善が所定の閾値以下に収束したなどの所定の学習終了条件が充足されると、第２モデル学習部１３０は、更新された音符存在確率予測モデルを学習済みモデルとして設定する。 For example, when predetermined learning end conditions are satisfied, such as when the update processing is completed for a predetermined number of learning data, the error converges below a predetermined threshold, and the error improvement converges below a predetermined threshold. , The second model learning unit 130 sets the updated note existence probability prediction model as a learned model.

一実施例では、第１モデル学習部１２０は、複数種別のオーディオ成分のそれぞれに対して特徴マップ生成モデルを学習し、第２モデル学習部１３０は、複数種別のオーディオ成分を含む採譜対象の音源に対して各オーディオ成分種別毎に音符が存在する予測確率を出力するよう音符存在確率予測モデルを学習してもよい。 In one embodiment, the first model learning unit 120 learns a feature map generation model for each of a plurality of types of audio components, and the second model learning unit 130 generates a transcription target sound source including a plurality of types of audio components. Note that a note existence probability prediction model may be learned so as to output a prediction probability that a note exists for each audio component type.

例えば、特徴マップ生成モデルと音符存在確率予測モデルとは、モノフォニックボーカルと伴奏とを含む楽曲に対して適用されてもよい。この場合、ボーカル用特徴マップ生成モデルと伴奏用特徴マップ生成モデルとが、ボーカルの単音音源と音高情報とのペアから構成されるボーカル用学習データと、伴奏の単音音源と音高情報とのペアから構成される伴奏用学習データとを利用して、上述した学習処理と同様に学習される。一方、ボーカル用音符存在確率予測モデルと伴奏用音符存在確率予測モデルとが、学習用の音源と楽譜情報と利用して、音源を学習済みボーカル用特徴マップ生成モデルと学習済み伴奏用特徴マップ生成モデルとに入力することによって生成された特徴マップを入力とし、上述した学習処理と同様に学習される。 For example, the feature map generation model and the note existence probability prediction model may be applied to music including monophonic vocals and accompaniment. In this case, the vocal feature map generation model and the accompaniment feature map generation model are composed of vocal learning data composed of a pair of a vocal single-tone sound source and pitch information, and a vocal single-tone sound source and pitch information. Using the learning data for accompaniment composed of pairs, learning is performed in the same manner as in the learning processing described above. On the other hand, the note existence probability prediction model for vocals and the note existence probability prediction model for accompaniment use a sound source for learning and score information to generate a feature map generation model for vocal and a feature map for learned accompaniment. The feature map generated by inputting the information to the model is input, and learning is performed in the same manner as in the above-described learning process.

あるいは、特徴マップ生成モデルと音符存在確率予測モデルとは、楽器毎などの複数のパートを含む楽曲に対して適用されてもよい。上述したボーカルと伴奏とを含む楽曲に対する学習処理と同様であるが、この場合、音符存在確率予測モデルの出力は、特徴マップの固定長の区間に特定パートの特定音符が存在する予測確率であってもよい。例えば、"男声のＡ３の音高"、"女声のＡ３の音高"などの特定パートの特定音符の存在の予測確率を出力するようにしてもよい。 Alternatively, the feature map generation model and the note existence probability prediction model may be applied to a musical piece including a plurality of parts such as each musical instrument. This is the same as the above-described learning process for music including vocal and accompaniment, but in this case, the output of the note existence probability prediction model is the prediction probability that the specific note of the specific part exists in the fixed length section of the feature map. You may. For example, a prediction probability of the existence of a specific note of a specific part, such as “pitch of male A3” or “pitch of female A3” may be output.

あるいは、本開示は拍子を有する楽曲に対して適用されてもよい。この場合、音符存在確率予測モデルの出力は、拍子のオンセット及びオフセットに関するものであってもよく、例えば、デフォルトボックスが一拍である予測確率が出力されてもよい。 Alternatively, the present disclosure may be applied to songs having a time signature. In this case, the output of the note existence probability prediction model may be related to the onset and offset of the time signature, and for example, the prediction probability that the default box is one beat may be output.

図６は、本開示の一実施例による特徴マップ生成モデルの学習処理を示すフローチャートである。当該学習処理は、上述した学習装置１００又は学習装置１００のプロセッサによって実現される。 FIG. 6 is a flowchart illustrating a learning process of a feature map generation model according to an embodiment of the present disclosure. The learning process is realized by the learning device 100 or the processor of the learning device 100 described above.

図６に示されるように、ステップＳ１０１において、学習用データ取得部１１０は、学習用データストレージ５０から単音音源と音高情報とのペアを取得する。例えば、音高は、「ド」から「シ」の１２音と無音との１３通りであり、当該１３通りの音高に対応する単音音源が取得されてもよい。 As shown in FIG. 6, in step S101, the learning data acquisition unit 110 acquires a pair of a single sound source and pitch information from the learning data storage 50. For example, there are 13 pitches of 12 pitches from “do” to “shi” and no sound, and a single tone sound source corresponding to the 13 pitches may be acquired.

ステップＳ１０２において、学習用データ取得部１１０は、取得した単音音源を前処理する。具体的には、学習用データ取得部１１０は、単音音源の波形データに対して前処理（例えば、短時間フーリエ変換など）を実行し、単音音源のスペクトログラムを取得する。 In step S102, the learning data acquisition unit 110 pre-processes the acquired single sound source. Specifically, the learning data acquisition unit 110 performs preprocessing (for example, short-time Fourier transform) on the waveform data of the single sound source, and acquires a spectrogram of the single sound source.

ステップＳ１０３において、第１モデル学習部１２０は、前処理された単音音源と音高情報とのペアによって特徴マップ生成モデルを学習する。例えば、特徴マップ生成モデルは、畳み込みニューラルネットワークにより構成され、入力音源を音高の予測確率に変換する。具体的には、第１モデル学習部１２０は、単音音源のスペクトログラムを特徴マップ生成モデルに入力し、特徴マップ生成モデルの出力と音高情報との誤差が小さくなるように、バックプロパゲーションによって特徴マップ生成モデルのパラメータを更新する。 In step S103, the first model learning unit 120 learns a feature map generation model based on a pair of a preprocessed single sound source and pitch information. For example, the feature map generation model is formed by a convolutional neural network, and converts an input sound source into a pitch prediction probability. Specifically, the first model learning unit 120 inputs the spectrogram of the single sound source to the feature map generation model, and performs the backpropagation so as to reduce the error between the output of the feature map generation model and the pitch information. Update the parameters of the map generation model.

ステップＳ１０４において、第１モデル学習部１２０は、学習終了条件が充足されたか判断する。所定の学習終了条件は、例えば、所定数の学習用データに対して更新処理が終了した、誤差が所定の閾値以下に収束した、誤差の改善が所定の閾値以下に収束したなどであってもよい。所定の学習終了条件が充足されている場合（Ｓ１０４：ＹＥＳ）、第１モデル学習部１２０は、更新された特徴マップ生成モデルを学習済みモデルとして設定してもよい。他方、所定の学習終了条件が充足されていない場合（Ｓ１０４：ＮＯ）、当該処理はステップＳ１０１に移行し、上述した各ステップを繰り返す。 In step S104, the first model learning unit 120 determines whether a learning end condition has been satisfied. The predetermined learning end condition may be, for example, that the update processing has been completed for a predetermined number of learning data, the error has converged to a predetermined threshold or less, or the error improvement has converged to a predetermined threshold or less. Good. When the predetermined learning end condition is satisfied (S104: YES), the first model learning unit 120 may set the updated feature map generation model as a learned model. On the other hand, when the predetermined learning end condition is not satisfied (S104: NO), the process proceeds to step S101, and the above-described steps are repeated.

図７は、本開示の一実施例による音符存在確率予測モデルの学習処理を示すフローチャートである。当該学習処理は、上述した学習装置１００又は学習装置１００のプロセッサによって実現される。 FIG. 7 is a flowchart illustrating a learning process of a note existence probability prediction model according to an embodiment of the present disclosure. The learning process is realized by the learning device 100 or the processor of the learning device 100 described above.

図７に示されるように、ステップＳ２０１において、学習用データ取得部１１０は、学習用データストレージ５０からモノフォニック音源と楽譜情報とのペアを取得する。例えば、モノフォニック音源は歌唱音源の波形データであってもよく、楽譜情報は当該モノフォニック音源の楽譜を示す。 As shown in FIG. 7, in step S201, the learning data acquisition unit 110 acquires a pair of a monophonic sound source and musical score information from the learning data storage 50. For example, the monophonic sound source may be waveform data of a singing sound source, and the score information indicates the score of the monophonic sound source.

ステップＳ２０２において、学習用データ取得部１１０は、取得したモノフォニック音源を前処理する。具体的には、学習用データ取得部１１０は、モノフォニック音源の波形データに対して前処理（例えば、短時間フーリエ変換など）を実行し、モノフォニック音源のスペクトログラムを取得する。 In step S202, the learning data acquisition unit 110 pre-processes the acquired monophonic sound source. More specifically, the learning data acquisition unit 110 performs preprocessing (for example, short-time Fourier transform) on the waveform data of the monophonic sound source, and acquires a spectrogram of the monophonic sound source.

ステップＳ２０３において、第２モデル学習部１３０は、前処理されたモノフォニック音源を学習済み特徴マップ生成モデルに入力し、学習済み特徴マップ生成モデルによって生成された特徴マップを取得する。具体的には、第２モデル学習部１３０は、学習済み特徴マップ生成モデルの各畳み込み層から生成された特徴マップを取得する。生成された特徴マップは、各畳み込み層の畳み込みの程度に応じて異なる時間解像度の特徴マップとなる。 In step S203, the second model learning unit 130 inputs the preprocessed monophonic sound source to the learned feature map generation model, and acquires a feature map generated by the learned feature map generation model. Specifically, the second model learning unit 130 acquires a feature map generated from each convolutional layer of the learned feature map generation model. The generated feature map is a feature map having a different temporal resolution depending on the degree of convolution of each convolutional layer.

ステップＳ２０４において、第２モデル学習部１３０は、取得した特徴マップと楽譜情報とのペアによって音符存在確率予測モデルを学習する。例えば、音符存在確率予測モデルは、畳み込みニューラルネットワークにより構成により構成され、入力された特徴マップを当該特徴マップの固定長の区間に音符が存在する音符存在予測確率に変換する。具体的には、第２モデル学習部１３０は、各特徴マップを音符存在確率予測モデルに入力し、音符存在確率予測モデルの出力と楽譜情報との誤差が小さくなるように、バックプロパゲーションによって音符存在確率予測モデルのパラメータを更新する。 In step S204, the second model learning unit 130 learns a musical note existence probability prediction model based on a pair of the acquired feature map and musical score information. For example, the note existence probability prediction model is configured by a convolutional neural network, and converts an input feature map into a note existence prediction probability in which a note exists in a fixed length section of the feature map. Specifically, the second model learning unit 130 inputs each feature map to the note existence probability prediction model, and performs note propagation by back propagation so that an error between the output of the note existence probability prediction model and the score information becomes small. Update the parameters of the existence probability prediction model.

ステップＳ２０５において、第２モデル学習部１３０は、学習終了条件が充足されたか判断する。所定の学習終了条件は、例えば、所定数の学習用データに対して更新処理が終了した、誤差が所定の閾値以下に収束した、誤差の改善が所定の閾値以下に収束したなどであってもよい。所定の学習終了条件が充足されている場合（Ｓ２０５：ＹＥＳ）、第２モデル学習部１３０は、更新された音符存在確率予測モデルを学習済みモデルとして設定してもよい。他方、所定の学習終了条件が充足されていない場合（Ｓ２０５：ＮＯ）、当該処理はステップＳ２０１に移行し、上述した各ステップを繰り返す。 In step S205, the second model learning unit 130 determines whether the learning end condition has been satisfied. The predetermined learning end condition may be, for example, that the update processing has been completed for a predetermined number of learning data, the error has converged to a predetermined threshold or less, or the error improvement has converged to a predetermined threshold or less. Good. If the predetermined learning end condition is satisfied (S205: YES), the second model learning unit 130 may set the updated note existence probability prediction model as a learned model. On the other hand, if the predetermined learning end condition is not satisfied (S205: NO), the process proceeds to step S201, and the above-described steps are repeated.

次に、図８及び９を参照して、本開示の一実施例による自動採譜装置を説明する。図８は、本開示の一実施例による自動採譜装置の機能構成を示すブロック図である。 Next, an automatic music transcription device according to an embodiment of the present disclosure will be described with reference to FIGS. FIG. 8 is a block diagram illustrating a functional configuration of the automatic transcription apparatus according to an embodiment of the present disclosure.

図８に示されるように、自動採譜装置２００は、モデル処理部２１０及び楽譜生成部２２０を有する。 As shown in FIG. 8, the automatic transcription apparatus 200 includes a model processing unit 210 and a musical score generation unit 220.

モデル処理部２１０は、単音音源から音高の予測確率を出力する学習済み特徴マップ生成モデルと、特徴マップから当該特徴マップの固定長の区間に音符が存在する予測確率を出力する学習済み音符存在確率予測モデルとを利用し、採譜対象の音源を学習済み特徴マップ生成モデルに入力し、当該学習済み特徴マップ生成モデルによって生成された特徴マップを学習済み音符存在確率予測モデルに入力し、特徴マップの固定長の区間に音符が存在する予測確率を出力する。 The model processing unit 210 includes a learned feature map generation model that outputs a predicted probability of a pitch from a single sound source, and a learned note existence that outputs a predicted probability that a note exists in a fixed length section of the feature map from the feature map. Using the probability prediction model, the sound source to be transcribed is input to the learned feature map generation model, and the feature map generated by the learned feature map generation model is input to the learned note existence probability prediction model. The prediction probability that a note exists in the fixed-length section of is output.

具体的には、モデル処理部２１０は、採譜対象の音源に対して短時間フーリエ変換などの前処理を実行して当該音源のスペクトログラムを取得し、取得したスペクトログラムを学習装置１００による学習済み特徴マップ生成モデルに入力して当該学習済み特徴マップ生成モデルの各畳み込み層からの特徴マップを取得する。そして、モデル処理部２１０は、取得した各特徴マップを学習装置１００による学習済み音符存在確率予測モデルに入力し、入力した特徴マップの各点を始点とする固定長の区間又はデフォルトボックスと同じ長さの音符が存在する予測確率を取得し、取得した各特徴マップの音符存在予測確率を楽譜生成部２２０にわたす。例えば、音符存在予測確率は、特徴マップのデフォルトボックスに存在する各音高（例えば、「ド」、「レ」、・・・「シ」、無音など）の確率の予測値であり、高い予測確率を有する音高が当該時間的長さに対応する音符に相当すると判断できる。 Specifically, the model processing unit 210 performs a preprocessing such as a short-time Fourier transform on the sound source to be transcribed, acquires a spectrogram of the sound source, and converts the acquired spectrogram into a learned feature map by the learning device 100. The feature map is input to the generation model and the feature map from each convolutional layer of the learned feature map generation model is obtained. Then, the model processing unit 210 inputs each of the acquired feature maps to the learned note existence probability prediction model by the learning device 100, and has a fixed length section starting from each point of the input feature map or the same length as the default box. The prediction probability that a note exists is obtained, and the obtained note presence prediction probability of each of the feature maps is passed to the score generation unit 220. For example, the note presence prediction probability is a predicted value of the probability of each pitch (for example, “do”, “re”,. It can be determined that the pitch having the probability corresponds to the note corresponding to the time length.

楽譜生成部２２０は、音符が存在する予測確率に基づき楽譜情報を生成する。具体的には、楽譜生成部２２０は、ＳＳＤに用いられるＮＭＳ（Ｎｏｎ−ＭａｘｉｍｕｍＳｕｐｐｒｅｓｓｉｏｎ）に従って学習済み音符存在確率予測モデルの出力を後処理する。典型的には、学習済み音符存在確率予測モデルから多数の予測音符候補が出力される。これらの予測音符候補から予測音符を特定する必要があり、ＳＳＤではＮＭＳを利用して予測音符候補をしばしば絞っている。 The musical score generation unit 220 generates musical score information based on a predicted probability that a note exists. Specifically, the score generation unit 220 post-processes the output of the learned note existence probability prediction model according to the NMS (Non-Maximum Suppression) used for the SSD. Typically, many predicted note candidates are output from the learned note existence probability prediction model. It is necessary to specify a predicted note from these predicted note candidates, and the SSD often uses an NMS to narrow down the predicted note candidates.

例えば、楽譜生成部２２０はまず、学習済み音符存在確率予測モデルに入力された特徴マップ上の各点に対して出力された音符存在予測確率のうち最大となる音符を当該時間における予測音符とする。そして、楽譜生成部２２０は、特徴マップ上の各点について予測音符を決定し、各点、予測音符及び対応する音符存在予測確率のデータセットをリスト化し、音符存在予測確率に関して降順にリスト内のデータセットをソートする。そして、楽譜生成部２２０は、所定の抽出条件を適用し、リストから予測音符候補を絞る。例えば、楽譜生成部２２０は、音符存在予測確率が所定の閾値（例えば、０．９など）以下であるデータセットをリストから削除してもよい。また、楽譜生成部２２０は、重複して検出された音符の重複を排除するため、予測音符が同じであって、かつ、予測音符の重複度が所定の閾値（例えば、８０％など）以上のデータセットがリストの上位にある場合、当該上位のリストのみを残すようにしてもよい。楽譜生成部２２０は、最終的なリストにおけるデータセットに基づき楽譜を生成する。 For example, first, the musical score generation unit 220 sets a note having the largest of the note existence prediction probabilities output for each point on the feature map input to the learned note existence probability prediction model as a predicted note at the time. . Then, the score generation unit 220 determines a predicted note for each point on the feature map, lists a data set of each point, the predicted note and the corresponding note existence prediction probability, and sorts the note existence prediction probability in the list in descending order with respect to the note existence prediction probability. Sort the dataset. Then, the score generating unit 220 applies a predetermined extraction condition and narrows down predicted note candidates from the list. For example, the score generating unit 220 may delete from the list a data set whose note existence prediction probability is equal to or less than a predetermined threshold value (for example, 0.9). In addition, the score generating unit 220 eliminates duplication of notes detected by duplication, so that the predicted notes are the same and the degree of duplication of the predicted notes is equal to or more than a predetermined threshold (for example, 80%). If the dataset is at the top of the list, only the top list may be left. The score generation unit 220 generates a score based on the data set in the final list.

図９は、本開示の一実施例による自動採譜処理を示すフローチャートである。当該自動採譜処理は、上述した自動採譜装置２００又は自動採譜装置２００のプロセッサによって実現される。 FIG. 9 is a flowchart illustrating an automatic music transcription process according to an embodiment of the present disclosure. The automatic music transcription processing is realized by the above-described automatic music transcription apparatus 200 or the processor of the automatic music transcription apparatus 200.

図９に示されるように、ステップＳ３０１において、モデル処理部２１０は、採譜対象の音源を取得する。例えば、当該音源はモノフォニック音源であってもよいし、複数種別のオーディオ成分を含んでもよい。 As shown in FIG. 9, in step S301, the model processing unit 210 acquires a sound source to be transcribed. For example, the sound source may be a monophonic sound source or may include a plurality of types of audio components.

ステップＳ３０２において、モデル処理部２１０は、取得した音源を前処理する。具体的には、モデル処理部２１０は、取得した音源に対して短時間フーリエ変換などの前処理を実行し、当該音源のスペクトログラムを取得する。 In step S302, the model processing unit 210 pre-processes the acquired sound source. Specifically, the model processing unit 210 performs preprocessing such as short-time Fourier transform on the acquired sound source, and acquires a spectrogram of the sound source.

ステップＳ３０３において、モデル処理部２１０は、前処理した音源を学習済み特徴マップ生成モデルに入力して特徴マップを取得し、取得した特徴マップを学習済み音符存在確率予測モデルに入力して入力した特徴マップの各点を始点とする固定長の区間又はデフォルトボックスと同じ長さの音符が存在する予測確率を取得する。 In step S303, the model processing unit 210 inputs the preprocessed sound source to the learned feature map generation model to obtain a feature map, and inputs the obtained feature map to the learned note existence probability prediction model to input the feature. A prediction probability that a note having the same length as a fixed-length section or a default box starting from each point of the map is obtained.

ステップＳ３０４において、楽譜生成部２２０は、特徴マップ上の各点に対して取得した音符存在予測確率に基づき予測音符を決定する。具体的には、楽譜生成部２２０は、各点について取得した音符存在予測確率のうち最大となる音符存在予測確率に対応する音符を当該点に対する予測音符として決定する。 In step S304, the score generation unit 220 determines a predicted note based on the note presence prediction probability obtained for each point on the feature map. Specifically, the score generation unit 220 determines the note corresponding to the maximum note existence prediction probability among the note existence prediction probabilities acquired for each point as the predicted note for the point.

ステップＳ３０５において、楽譜生成部２２０は、決定された特徴マップの各点の予測音符に対して後処理を実行する。具体的には、楽譜生成部２２０は、ＳＳＤにおけるＮＭＳに従って特徴マップの各点の予測音符を絞る。例えば、楽譜生成部２２０は、特徴マップ上の各点について決定された予測音符に基づき、各点、予測音符及び対応する音符存在予測確率のデータセットをリスト化し、音符存在予測確率に関して降順にリスト内のデータセットをソートし、音符存在予測確率が所定の閾値（例えば、０．９など）以下であるデータセットをリストから削除すると共に、予測音符が同じであって、かつ、予測音符の重複度が所定の閾値（例えば、８０％など）以上のデータセットがリストの上位にある場合、当該上位のリストのみを残すようにしてもよい。 In step S305, the score generating unit 220 performs post-processing on the predicted note at each point of the determined feature map. Specifically, the score generating unit 220 narrows down predicted notes at each point of the feature map according to the NMS in the SSD. For example, based on the predicted notes determined for each point on the feature map, the score generation unit 220 lists a data set of each point, the predicted note, and the corresponding note existence prediction probability, and lists the note existence prediction probability in descending order. Are deleted from the list, and a data set whose predicted note existence probability is equal to or less than a predetermined threshold value (for example, 0.9) is deleted from the list, and the predicted note is the same and the predicted note overlaps. When a data set whose degree is equal to or higher than a predetermined threshold (for example, 80%) is at the top of the list, only the top list may be left.

ステップＳ３０６において、楽譜生成部２２０は、最終的なリストにおけるデータセットに基づき楽譜を生成する。 In step S306, the score generation unit 220 generates a score based on the data set in the final list.

上述した学習装置１００及び自動採譜装置２００はそれぞれ、例えば、図１０に示されるように、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０３、通信インタフェース（ＩＦ）１０４、ハードディスク１０５、入力装置１０６及び出力装置１０７によるハードウェア構成を有してもよい。ＣＰＵ１０１及びＧＰＵ１０２は、プロセッサ又は処理回路として参照されてもよく、学習装置１００及び自動採譜装置２００の各種処理を実行し、特に、ＣＰＵ１０１は学習装置１００及び自動採譜装置２００における各種処理の実行を制御し、ＧＰＵ１０２は機械学習モデルを学習及び実行するための各種処理を実行する。ＲＡＭ１０３及びハードディスク１０５は、学習装置１００及び自動採譜装置２００における各種データ及びプログラムを格納するメモリとして機能し、特に、ＲＡＭ１０３は、ＣＰＵ１０１及びＧＰＵ１０２における作業データを格納するワーキングメモリとして機能し、ハードディスク１０５は、ＣＰＵ１０１及びＧＰＵ１０２の制御プログラム及び／又は学習用データを格納する。通信ＩＦ１０４は、学習用データストレージ５０から学習用データを取得するための通信インタフェースである。入力装置１０６は、情報及びデータを入力するための各種デバイス（例えば、ディスプレイ、スピーカ、キーボード、タッチ画面など）であり、出力装置１０７は、処理の内容、経過、結果等の各種情報を表示する各種デバイス（例えば、ディスプレイ、プリンタ、スピーカなど）である。しかしながら、本開示による学習装置１００及び自動採譜装置２００は、上述したハードウェア構成に限定されず、他の何れか適切なハードウェア構成を有してもよい。 For example, as shown in FIG. 10, the learning device 100 and the automatic transcription device 200 described above each include a CPU (Central Processing Unit) 101, a GPU (Graphics Processing Unit) 102, a RAM (Random Access Memory) 103, and a communication interface ( IF) 104, a hard disk 105, an input device 106, and an output device 107. The CPU 101 and the GPU 102 may be referred to as a processor or a processing circuit, and execute various processes of the learning device 100 and the automatic transcription device 200. In particular, the CPU 101 controls execution of various processes in the learning device 100 and the automatic transcription device 200. The GPU 102 executes various processes for learning and executing the machine learning model. The RAM 103 and the hard disk 105 function as a memory for storing various data and programs in the learning device 100 and the automatic transcription device 200. In particular, the RAM 103 functions as a working memory for storing work data in the CPU 101 and the GPU 102. , A control program for the CPU 101 and the GPU 102 and / or learning data. The communication IF 104 is a communication interface for acquiring learning data from the learning data storage 50. The input device 106 is various devices (for example, a display, a speaker, a keyboard, a touch screen, etc.) for inputting information and data, and the output device 107 displays various information such as the content, progress, and results of processing. Various devices (for example, a display, a printer, a speaker, and the like). However, the learning device 100 and the automatic transcription device 200 according to the present disclosure are not limited to the above-described hardware configuration, and may have any other appropriate hardware configuration.

本開示の一態様では、
単音音源と音高情報とを第１の機械学習モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを第２の機械学習モデルの学習用データとして取得し、前記単音音源と前記採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得する学習用データ取得部と、
前記単音音源のスペクトログラムを学習用入力データとして入力し、前記単音音源の音高の予測確率を出力するよう前記音高情報によって第１の機械学習モデルを学習する第１モデル学習部と、
前記採譜対象の音源のスペクトログラムを学習済みの前記第１の機械学習モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するよう前記楽譜情報によって第２の機械学習モデルを学習する第２モデル学習部と、
を有する学習装置が提供される。 In one aspect of the present disclosure,
A single sound source and pitch information are obtained as learning data of a first machine learning model, a sound source to be transcribed and musical score information are obtained as learning data of a second machine learning model, and the single sound source and the pitch information are obtained. A learning data acquisition unit that performs preprocessing on a sound source to be transcribed and acquires each spectrogram;
A first model learning unit that inputs a spectrogram of the single sound source as learning input data and learns a first machine learning model based on the pitch information so as to output a prediction probability of a pitch of the single sound source;
A feature map generated by inputting the spectrogram of the sound source to be transcribed to the learned first machine learning model is input as learning input data, and a note exists in a fixed length section of the feature map. A second model learning unit that learns a second machine learning model based on the score information so as to output a prediction probability;
Is provided.

一実施例では、
前記第１の機械学習モデルと前記第２の機械学習モデルとは、畳み込みニューラルネットワークにより構成されてもよい。 In one embodiment,
The first machine learning model and the second machine learning model may be configured by a convolutional neural network.

一実施例では、
前記第２モデル学習部は、前記第１の機械学習モデルにより生成される異なる時間解像度を有する複数の特徴マップを前記第２の機械学習モデルに入力してもよい。 In one embodiment,
The second model learning unit may input a plurality of feature maps having different time resolutions generated by the first machine learning model to the second machine learning model.

一実施例では、
前記第２モデル学習部は、前記第１の機械学習モデルと前記第２の機械学習モデルとをＳＳＤ（ＳｉｎｇｌｅＳｈｏｔＤｅｔｅｃｔｉｏｎ）として実現してもよい。 In one embodiment,
The second model learning unit may realize the first machine learning model and the second machine learning model as an SSD (Single Shot Detection).

一実施例では、
前記第１モデル学習部は、複数種別のオーディオ成分のそれぞれに対して前記第１の機械学習モデルを学習し、
前記第２モデル学習部は、複数種別のオーディオ成分を含む採譜対象の音源に対して各オーディオ成分種別毎に音符が存在する予測確率を出力するよう前記第２の機械学習モデルを学習してもよい。 In one embodiment,
The first model learning unit learns the first machine learning model for each of a plurality of types of audio components,
The second model learning unit may also learn the second machine learning model so as to output a prediction probability that a note exists for each audio component type for a sound source to be transcribed including a plurality of types of audio components. Good.

本開示の一態様では、
単音音源から音高の予測確率を出力する第１の学習済み機械学習モデルと、特徴マップから前記特徴マップの固定長の区間に音符が存在する予測確率を出力する第２の学習済み機械学習モデルとを利用し、採譜対象の音源を前記第１の学習済み機械学習モデルに入力し、前記第１の学習済み機械学習モデルによって生成された特徴マップを前記第２の学習済み機械学習モデルに入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するモデル処理部と、
前記音符が存在する予測確率に基づき楽譜情報を生成する楽譜生成部と、
を有する自動採譜装置が提供される。 In one aspect of the present disclosure,
A first learned machine learning model that outputs a prediction probability of a pitch from a single sound source, and a second learned machine learning model that outputs a prediction probability that a note exists in a fixed length section of the feature map from a feature map The sound source to be transcribed is input to the first learned machine learning model, and the feature map generated by the first learned machine learning model is input to the second learned machine learning model. A model processing unit that outputs a prediction probability that a note exists in a fixed-length section of the feature map;
A score generation unit that generates score information based on the predicted probability that the note exists;
Is provided.

一実施例では、
前記モデル処理部は、前記採譜対象の音源に対して前処理を実行することによってスペクトログラムを取得し、前記スペクトログラムを前記第１の学習済み機械学習モデルに入力してもよい。 In one embodiment,
The model processing unit may obtain a spectrogram by performing preprocessing on the sound source to be transcribed, and may input the spectrogram to the first learned machine learning model.

一実施例では、
前記モデル処理部は、前記特徴マップ上の各点について前記第２の学習済み機械学習モデルから出力された最大の予測確率を有する音符を予測音符として決定してもよい。 In one embodiment,
The model processing unit may determine, as a predicted note, a note having the maximum prediction probability output from the second learned machine learning model for each point on the feature map.

一実施例では、
前記楽譜生成部は、ＮＭＳ（Ｎｏｎ−ＭａｘｉｍｕｍＳｕｐｐｒｅｓｓｉｏｎ）に従って抽出された予測音符に基づき楽譜情報を生成してもよい。 In one embodiment,
The musical score generating unit may generate musical score information based on predicted notes extracted according to NMS (Non-Maximum Suppression).

本開示の一態様では、
プロセッサが、単音音源と音高情報とを第１の機械学習モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを第２の機械学習モデルの学習用データとして取得し、前記単音音源と前記採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得すステップと、
前記プロセッサが、前記単音音源のスペクトログラムを学習用入力データとして入力し、前記単音音源の音高の予測確率を出力するよう前記音高情報によって第１の機械学習モデルを学習するステップと、
前記プロセッサが、前記採譜対象の音源のスペクトログラムを学習済みの前記第１の機械学習モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するよう前記楽譜情報によって第２の機械学習モデルを学習するステップと、
を有する学習方法が提供される。 In one aspect of the present disclosure,
A processor that obtains a single sound source and pitch information as learning data of a first machine learning model, obtains a sound source to be transcribed and music information as learning data of a second machine learning model, Performing preprocessing on the sound source and the sound source to be transcribed, and obtaining respective spectrograms;
The processor inputs a spectrogram of the monophonic sound source as learning input data, and learns a first machine learning model by the pitch information so as to output a prediction probability of a pitch of the monophonic sound source,
The processor inputs, as learning input data, a feature map generated by inputting a spectrogram of the sound source to be transcribed to the learned first machine learning model, and outputs the feature map to a fixed-length section of the feature map. Learning a second machine learning model based on the score information to output a predicted probability that a note is present;
Is provided.

本開示の一態様では、
プロセッサが、単音音源から音高の予測確率を出力する第１の学習済み機械学習モデルに採譜対象の音源を入力するステップと、
前記プロセッサが、特徴マップから前記特徴マップの固定長の区間に音符が存在する予測確率を出力する第２の学習済み機械学習モデルに前記第１の学習済み機械学習モデルによって生成された特徴マップを入力するステップと、
前記プロセッサが、前記第２の学習済み機械学習モデルから出力された前記音符が存在する予測確率に基づき楽譜情報を生成するステップと、
を有する自動採譜方法が提供される。 In one aspect of the present disclosure,
A processor inputting a sound source to be transcribed to a first learned machine learning model that outputs a predicted probability of a pitch from a single sound source;
The processor outputs a feature map generated by the first learned machine learning model to a second learned machine learning model that outputs a prediction probability that a note exists in a fixed length section of the feature map from the feature map. Inputting,
The processor generating score information based on a predicted probability that the note is output from the second learned machine learning model,
Is provided.

本開示の一態様では、
単音音源と音高情報とを第１の機械学習モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを第２の機械学習モデルの学習用データとして取得し、前記単音音源と前記採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得すステップと、
前記単音音源のスペクトログラムを学習用入力データとして入力し、前記単音音源の音高の予測確率を出力するよう前記音高情報によって第１の機械学習モデルを学習するステップと、
前記採譜対象の音源のスペクトログラムを学習済みの前記第１の機械学習モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するよう前記楽譜情報によって第２の機械学習モデルを学習するステップと、
をプロセッサに実行させるプログラムが提供される。 In one aspect of the present disclosure,
A single sound source and pitch information are obtained as learning data of a first machine learning model, a sound source to be transcribed and musical score information are obtained as learning data of a second machine learning model, and the single sound source and the pitch information are obtained. Performing preprocessing on the sound source to be transcribed and obtaining spectrograms of each;
Inputting a spectrogram of the single-tone sound source as learning input data, and learning a first machine learning model by the pitch information so as to output a prediction probability of a pitch of the single-tone sound source;
A feature map generated by inputting the spectrogram of the sound source to be transcribed to the learned first machine learning model is input as learning input data, and a note exists in a fixed length section of the feature map. Learning a second machine learning model with the musical score information to output a prediction probability;
Is provided.

本開示の一態様では、
単音音源から音高の予測確率を出力する第１の学習済み機械学習モデルに採譜対象の音源を入力するステップと、
特徴マップから前記特徴マップの固定長の区間に音符が存在する予測確率を出力する第２の学習済み機械学習モデルに前記第１の学習済み機械学習モデルによって生成された特徴マップを入力するステップと、
前記第２の学習済み機械学習モデルから出力された前記音符が存在する予測確率に基づき楽譜情報を生成するステップと、
をプロセッサに実行させるプログラムが提供される。 In one aspect of the present disclosure,
Inputting a sound source to be transcribed into a first learned machine learning model that outputs a prediction probability of a pitch from a single sound source;
Inputting the feature map generated by the first learned machine learning model to a second learned machine learning model that outputs a prediction probability that a note exists in a fixed length section of the feature map from the feature map; ,
Generating musical score information based on a predicted probability that the note is output from the second learned machine learning model;
Is provided.

本開示の一態様では、
上述したプログラムを記憶するコンピュータ可読記憶媒体が提供される。 In one aspect of the present disclosure,
A computer-readable storage medium storing the above-described program is provided.

以上、本開示の実施例について詳述したが、本開示は上述した特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本開示の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiments of the present disclosure have been described above in detail, the present disclosure is not limited to the specific embodiments described above, and various modifications may be made within the scope of the present disclosure described in the claims.・ Change is possible.

５０学習用データストレージ
１００学習装置
２００自動採譜装置 50 learning data storage 100 learning device 200 automatic transcription device

Claims

A single sound source and pitch information are obtained as learning data of a first machine learning model, a sound source to be transcribed and musical score information are obtained as learning data of a second machine learning model, and the single sound source and the pitch information are obtained. A learning data acquisition unit that performs preprocessing on a sound source to be transcribed and acquires each spectrogram;
A first model learning unit that inputs a spectrogram of the single sound source as learning input data and learns a first machine learning model based on the pitch information so as to output a prediction probability of a pitch of the single sound source;
A feature map generated by inputting the spectrogram of the sound source to be transcribed to the learned first machine learning model is input as learning input data, and a note exists in a fixed length section of the feature map. A second model learning unit that learns a second machine learning model based on the score information so as to output a prediction probability;
A learning device having:

The learning device according to claim 1, wherein the first machine learning model and the second machine learning model are configured by a convolutional neural network.

The learning device according to claim 2, wherein the second model learning unit inputs a plurality of feature maps having different temporal resolutions generated by the first machine learning model to the second machine learning model.

4. The learning device according to claim 1, wherein the second model learning unit implements the first machine learning model and the second machine learning model as a single shot detection (SSD). 5.

The first model learning unit learns the first machine learning model for each of a plurality of types of audio components,
The second model learning unit learns the second machine learning model so as to output a prediction probability that a note exists for each audio component type for a transcription target sound source including a plurality of types of audio components. The learning device according to any one of Items 1 to 4.

A first learned machine learning model that outputs a prediction probability of a pitch from a single sound source, and a second learned machine learning model that outputs a prediction probability that a note exists in a fixed length section of the feature map from a feature map The sound source to be transcribed is input to the first learned machine learning model, and the feature map generated by the first learned machine learning model is input to the second learned machine learning model. A model processing unit that outputs a prediction probability that a note exists in a fixed-length section of the feature map;
A score generation unit that generates score information based on the predicted probability that the note exists;
Automatic transcription device with

The automatic transcription apparatus according to claim 6, wherein the model processing unit acquires a spectrogram by performing preprocessing on the sound source to be transcribed, and inputs the spectrogram to the first learned machine learning model. .

8. The automatic music transcription according to claim 6, wherein the model processing unit determines, as a predicted note, a note having a maximum prediction probability output from the second learned machine learning model for each point on the feature map. apparatus.

The automatic music transcription apparatus according to claim 8, wherein the musical score generating unit generates musical score information based on predicted notes extracted according to NMS (Non-Maximum Supplement).

A processor that obtains a single sound source and pitch information as learning data of a first machine learning model, obtains a sound source to be transcribed and music information as learning data of a second machine learning model, Performing preprocessing on the sound source and the sound source to be transcribed, and obtaining respective spectrograms;
The processor inputs a spectrogram of the monophonic sound source as learning input data, and learns a first machine learning model by the pitch information so as to output a prediction probability of a pitch of the monophonic sound source,
The processor inputs, as learning input data, a feature map generated by inputting a spectrogram of the sound source to be transcribed to the learned first machine learning model, and outputs the feature map to a fixed-length section of the feature map. Learning a second machine learning model based on the score information to output a predicted probability that a note is present;
A learning method that has

A processor inputting a sound source to be transcribed to a first learned machine learning model that outputs a predicted probability of a pitch from a single sound source;
The processor outputs a feature map generated by the first learned machine learning model to a second learned machine learning model that outputs a prediction probability that a note exists in a fixed length section of the feature map from the feature map. Inputting,
The processor generating score information based on a predicted probability that the note is output from the second learned machine learning model,
Automatic transcription method with

A single sound source and pitch information are obtained as learning data of a first machine learning model, a sound source to be transcribed and musical score information are obtained as learning data of a second machine learning model, and the single sound source and the pitch information are obtained. Performing preprocessing on the sound source to be transcribed and obtaining spectrograms of each;
Inputting a spectrogram of the single-tone sound source as learning input data, and learning a first machine learning model by the pitch information so as to output a prediction probability of a pitch of the single-tone sound source;
A feature map generated by inputting the spectrogram of the sound source to be transcribed to the learned first machine learning model is input as learning input data, and a note exists in a fixed length section of the feature map. Learning a second machine learning model with the musical score information to output a prediction probability;
A program that causes a processor to execute

Inputting a sound source to be transcribed into a first learned machine learning model that outputs a prediction probability of a pitch from a single sound source;
Inputting the feature map generated by the first learned machine learning model to a second learned machine learning model that outputs a prediction probability that a note exists in a fixed length section of the feature map from the feature map; ,
Generating musical score information based on a predicted probability that the note is output from the second learned machine learning model;
A program that causes a processor to execute