JP2023081946A

JP2023081946A - Learning device, automatic music transcription device, learning method, automatic music transcription method and program

Info

Publication number: JP2023081946A
Application number: JP2023032348A
Authority: JP
Inventors: 大輝日暮; Daiki Higure
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2018-06-25
Filing date: 2023-03-03
Publication date: 2023-06-13
Anticipated expiration: 2038-06-25
Also published as: JP7448053B2; JP2020003536A

Abstract

To provide acoustic processing technology for automatically generating a musical score from audio data where a pitch and a section of each sound are not clear.SOLUTION: A learning device 100 comprises: a learning data acquisition section for acquiring a single tone sound source and pitch information as learning data of a first machine learning model, acquiring a sound source of a transcription object and musical score information as learning data of a second machine learning model, performing preprocessing on the single tone sound source and the sound source of the transcription object, and acquiring respective spectrograms; a first model learning section for performing learning by pitch information so that the spectrogram of the single tone sound source is input as learning input data and prediction probability of the pitch of the single tone sound source is output; and a second model learning section for performing learning by musical score information so that a feature map generated by inputting the spectrogram of the sound source of the transcription object to the learnt first machine learning model is input as learning input data and prediction probability that notes exist in the section of fixed length in the feature map is output.SELECTED DRAWING: Figure 2

Description

本開示は、音響処理技術に関する。 The present disclosure relates to sound processing technology.

オーディオデータから楽譜を自動生成する自動採譜技術が従来から知られている。例えば、特開２００７－０３３４７９には、同時に複数の音が演奏される場合でも単一楽器により演奏された音響信号から楽譜を自動採譜する技術が記載されている。 2. Description of the Related Art Automatic music transcription technology for automatically generating musical scores from audio data has been conventionally known. For example, Japanese Unexamined Patent Application Publication No. 2007-033479 describes a technique for automatically transcribing musical scores from acoustic signals played by a single musical instrument even when a plurality of sounds are played at the same time.

特開２００７－０３３４７９JP 2007-033479

しかしながら、従来の自動採譜では、楽譜に対して正確に演奏又は歌唱され、各音の音高や区間が明確なオーディオデータの場合には比較的高精度な採譜が可能であるが、例えば、各音の音高や区間が明確でないオーディオデータの場合には期待するような自動採譜が困難であった。 However, in the conventional automatic transcription, it is possible to transcribe music with a relatively high degree of accuracy in the case of audio data in which the musical score is accurately played or sung, and the pitch and interval of each note are clear. In the case of audio data in which pitches and sections of sounds are not clear, automatic transcription as expected is difficult.

上記問題点を鑑み、本開示の課題は、様々なオーディオデータからより効果的に楽譜を自動生成するための音響処理技術を提供することである。 In view of the above problems, an object of the present disclosure is to provide a sound processing technique for automatically generating musical scores from various audio data more effectively.

上記課題を解決するため、本開示の一態様は、単音音源から生成される第１スペクトログラムと、対応する音高情報と、をペアとする教師データを学習させることにより、スペクトログラムの入力に応じて、対応する音高の予測確率を示す第１特徴マップを出力する第１の機械学習モデルを学習する第１モデル学習部と、前記第１モデル学習部が出力する前記第１特徴マップと、楽譜情報と、をペアとする教師データを学習させることにより、採譜対象の音源から生成される第２スペクトログラムの前記第１の機械学習モデルへの入力に応じて出力される音高の予測確率を示す第１特徴マップの入力に応じて、楽譜を生成するための情報を出力する第２の機械学習モデルを学習する第２モデル学習部と、を有する学習装置に関する。 In order to solve the above problems, one aspect of the present disclosure is to learn teacher data that pairs a first spectrogram generated from a single tone sound source and corresponding pitch information, according to the input of the spectrogram , a first model learning unit that learns a first machine learning model that outputs a first feature map indicating the predicted probability of the corresponding pitch; the first feature map that the first model learning unit outputs; and information paired to learn the predicted probability of the pitch output according to the input to the first machine learning model of the second spectrogram generated from the sound source to be transcribed. The present invention relates to a learning device having a second model learning unit that learns a second machine learning model that outputs information for generating a musical score according to the input of the first feature map.

本開示によると、各音の音高や区間が明確でないオーディオデータから楽譜を自動生成するための音響処理技術を提供することができる。 Advantageous Effects of Invention According to the present disclosure, it is possible to provide an acoustic processing technique for automatically generating a musical score from audio data in which the pitch and section of each sound are unclear.

本開示の一実施例による学習済み機械学習モデルを有する自動採譜装置を示す概略図である。1 is a schematic diagram of an automatic music transcription device having a trained machine learning model according to one embodiment of the present disclosure; FIG. 本開示の一実施例による学習装置の機能構成を示すブロック図である。1 is a block diagram showing the functional configuration of a learning device according to an embodiment of the present disclosure; FIG. 本開示の一実施例による特徴マップ生成モデルの構成を示す概略図である。FIG. 2 is a schematic diagram illustrating the configuration of a feature map generation model according to one embodiment of the present disclosure; 本開示の一実施例による音符存在確率予測モデルの構成を示す概略図である。FIG. 2 is a schematic diagram showing the configuration of a note existence probability prediction model according to an embodiment of the present disclosure; 本開示の一実施例による特徴マップとデフォルトボックスとの関係を示す概念図である。FIG. 4 is a conceptual diagram illustrating the relationship between feature maps and default boxes according to one embodiment of the present disclosure; 本開示の一実施例による特徴マップ生成モデルの学習処理を示すフローチャートである。4 is a flow chart showing learning processing of a feature map generation model according to an embodiment of the present disclosure; 本開示の一実施例による音符存在確率予測モデルの学習処理を示すフローチャートである。10 is a flowchart showing learning processing of a note presence probability prediction model according to an embodiment of the present disclosure; 本開示の一実施例による自動採譜装置の機能構成を示すブロック図である。1 is a block diagram showing the functional configuration of an automatic music transcription device according to an embodiment of the present disclosure; FIG. 本開示の一実施例による自動採譜処理を示すフローチャートである。4 is a flow chart showing an automatic music transcription process according to one embodiment of the present disclosure; 本開示の一実施例による学習装置及び自動採譜装置のハードウェア構成を示すブロック図である。1 is a block diagram showing the hardware configuration of a learning device and an automatic music transcription device according to an embodiment of the present disclosure; FIG.

以下の実施例では、機械学習モデルによって音源（音の波形データであるオーディオデータ）から楽譜情報を生成する自動採譜装置が開示される。 In the following embodiments, an automatic music transcription apparatus is disclosed that generates musical score information from a sound source (audio data that is sound waveform data) using a machine learning model.

従来の自動採譜技術では、音高の予測に注力され、音符の切れ目を示すオンセットとオフセットとの予測は自動採譜における課題の１つであった。本開示による自動採譜装置は、音源におけるオンセットとオフセットとを機械学習モデルの１つであるＳＳＤ（ＳｉｎｇｌｅＳｈｏｔＤｅｔｅｃｔｉｏｎ）によって予測する。 Conventional automatic transcription techniques focus on pitch prediction, and one of the problems in automatic transcription is the prediction of onsets and offsets that indicate breaks between notes. The automatic music transcription device according to the present disclosure predicts the onset and offset in the sound source by SSD (Single Shot Detection), which is one of machine learning models.

ＳＳＤは、１つのニューラルネットワークを用いて入力画像における物体を検出する手法である。すなわち、当該ニューラルネットワークへの入力は画像であり、その出力は複数の矩形領域（ＳＳＤでは、デフォルトボックスと呼ばれる）の中心座標、高さ、幅及び物体の種類の予測確率である。デフォルトボックスは入力画像のサイズによって予め設定された個数の候補として用意され、後処理（ＮＭＳ：Ｎｏｎ－ＭａｘｉｍｕｍＳｕｐｐｒｅｓｓｉｏｎなど）によって大部分のデフォルトボックスを候補から外し、残ったデフォルトボックスを検出結果とするというものである。 SSD is a method of detecting an object in an input image using one neural network. That is, the input to the neural network is an image, and the output is the center coordinates, height, width and object type prediction probabilities of a plurality of rectangular regions (called default boxes in SSD). Default boxes are prepared as a preset number of candidates according to the size of the input image, most of the default boxes are removed from candidates by post-processing (NMS: Non-Maximum Suppression, etc.), and the remaining default boxes are used as detection results. That's what it means.

本開示による自動採譜装置におけるニューラルネットワークへの入力は、採譜対象の楽音の波形データ又はスペクトログラムであり、その出力は楽音のオンセット、オフセット及び音高であり、自動採譜装置は、ＳＳＤにおける中心座標及び幅に対応してオンセット及びオフセット（すなわち、楽音の形状又は長さ）を特定し、ＳＳＤにおける物体の種類に対応して音高を特定する。 The input to the neural network in the automatic music transcription apparatus according to the present disclosure is the waveform data or spectrogram of the musical tone to be transcribed, and the output is the onset, offset and pitch of the musical tone. and width corresponding to the onset and offset (ie shape or length of the note), and pitch corresponding to the type of object in the SSD.

後述される実施例を概略すると、自動採譜装置は２つの学習済み機械学習モデル（畳み込みニューラルネットワークなど）を利用し、一方のモデルは単音音源から音高の予測確率を出力するものであり、他方のモデルは特徴マップから当該特徴マップの固定長の区間に音符が存在する予測確率を出力するものである。自動採譜装置は、採譜対象の音源を前者の学習済み機械学習モデル（特徴マップ生成モデル）に入力し、当該学習済み特徴マップ生成モデルの畳み込み層から生成された各特徴マップを後者の学習済み機械学習モデル（音符存在確率予測モデル）に入力し、各特徴マップの各点に対して当該学習済み音符存在確率予測モデルから出力された固定長の区間又はデフォルトボックスにおける各音高の音符の予測存在確率に基づき楽譜情報を生成する。 To summarize the embodiment described later, the automatic transcription device utilizes two pre-trained machine learning models (such as convolutional neural networks), one model outputs the predicted probability of pitch from a monophonic sound source, the other model is to output the predicted probability that a note exists in a fixed-length section of the feature map from the feature map. The automatic transcription device inputs the sound source to be transcribed into the former learned machine learning model (feature map generation model), and applies each feature map generated from the convolution layer of the learned feature map generation model to the latter learned machine. input to the learning model (note existence probability prediction model), and predicted existence of each pitch in a fixed-length interval or default box output from the learned note existence probability prediction model for each point of each feature map Generate score information based on probability.

学習済み特徴マップ生成モデルによって生成される特徴マップは、畳み込みの結果として異なる時間解像度を有し、固定長の区間又はデフォルトボックスは異なる時間的長さとなる。このため、音符存在確率予測モデルにより各特徴マップに対して固定長の区間と同じ長さの音符を検出することによって、異なる長さの音符のオンセット及びオフセットを特定することが可能になる。 Feature maps generated by a trained feature map generation model have different temporal resolutions as a result of convolution, and fixed-length intervals or default boxes have different temporal lengths. Therefore, it is possible to identify onsets and offsets of notes of different lengths by detecting notes of the same length as the fixed-length section for each feature map using the note presence probability prediction model.

まず、図１を参照して、本開示の一実施例による自動採譜装置を説明する。図１は、本開示の一実施例による学習済み機械学習モデルを有する自動採譜装置を示す概略図である。 First, referring to FIG. 1, an automatic music transcription apparatus according to an embodiment of the present disclosure will be described. FIG. 1 is a schematic diagram illustrating an automatic music transcription device having a trained machine learning model according to one embodiment of the present disclosure.

図１に示されるように、本開示の一実施例による自動採譜装置２００は、限定することなく、畳み込みニューラルネットワークなどの何れかのタイプのニューラルネットワークとして実現される２種類の学習済みモデルを有し、学習用データストレージ５０を用いて学習装置１００によって学習された機械学習モデルを利用して、採譜対象の音源から楽譜情報を生成する。 As shown in FIG. 1, an automatic music transcription device 200 according to one embodiment of the present disclosure has two types of trained models implemented as any type of neural network, such as, but not limited to, a convolutional neural network. Then, using the machine learning model learned by the learning device 100 using the learning data storage 50, musical score information is generated from the sound source to be transcribed.

次に、図２～７を参照して、本開示の一実施例による学習装置を説明する。学習装置１００は、学習用データストレージ５０における学習用データを利用して、自動採譜装置２００に利用される特徴マップ生成モデルと音符存在確率予測モデルとを学習する。図２は、本開示の一実施例による学習装置の機能構成を示すブロック図である。 A learning device according to one embodiment of the present disclosure will now be described with reference to FIGS. The learning device 100 uses the learning data in the learning data storage 50 to learn the feature map generation model and the note existence probability prediction model used by the automatic transcription device 200 . FIG. 2 is a block diagram showing the functional configuration of the learning device according to one embodiment of the present disclosure.

図２に示されるように、学習装置１００は、学習用データ取得部１１０、第１モデル学習部１２０及び第２モデル学習部１３０を有する。 As shown in FIG. 2 , the learning device 100 has a learning data acquisition unit 110 , a first model learning unit 120 and a second model learning unit 130 .

学習用データ取得部１１０は、単音音源と音高情報とを特徴マップ生成モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを音符存在確率予測モデルの学習用データとして取得し、単音音源と採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得する。 The learning data acquisition unit 110 acquires the single tone sound source and the pitch information as learning data for the feature map generation model, acquires the sound source to be transcribed and the musical score information as learning data for the note existence probability prediction model, Preprocessing is performed on the monophonic sound source and the sound source to be transcribed, and spectrograms of each are obtained.

具体的には、学習用データ取得部１１０は、学習用データストレージ５０から、特徴マップ生成モデルを学習するための単音又はシングルノート音源（例えば、「ド」から「シ」までの１２種類の音源など）の波形データと音高情報（「ド」から「シ」までの音高など）とのペアを取得し、取得した単音音源の波形データに対して前処理（例えば、短時間フーリエ変換など）を実行することによって、各単音音源のスペクトログラムと音高情報との学習用データセットを生成する。 Specifically, the learning data acquisition unit 110 retrieves from the learning data storage 50 single-note or single-note sound sources (for example, 12 types of sound sources from “do” to “shi”) for learning the feature map generation model. ) and pitch information (e.g. pitches from "do" to "b"), and preprocess the waveform data of the obtained monophonic sound source (for example, short-term Fourier transform, etc.). ) to generate a learning data set of the spectrogram and pitch information of each single tone sound source.

また、学習用データ取得部１１０は、学習用データストレージ５０から、音符存在確率予測モデルを学習するための単旋律音源（歌唱音源など）の波形データと楽譜情報（音高の時系列変化など）とのペアを取得し、取得したモノフォニック音源の波形データに対して前処理（例えば、短時間フーリエ変換など）を実行することによって、モノフォニック音源のスペクトログラムと楽譜情報との学習用データセットを生成する。ここで、楽譜情報は、例えば、ＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格に従うものであってもよい。 In addition, the learning data acquisition unit 110 obtains from the learning data storage 50 waveform data of a monophonic sound source (singing sound source, etc.) and musical score information (time-series change in pitch, etc.) for learning the note existence probability prediction model. and perform preprocessing (for example, short-time Fourier transform) on the obtained monophonic sound source waveform data to generate a learning data set of the monophonic sound source spectrogram and musical score information. . Here, the musical score information may conform to the MIDI (Musical Instrument Digital Interface) standard, for example.

典型的には、スペクトログラムは、時間軸及び周波数軸における信号成分の強度を表し、波形データを短時間フーリエ変換することによって生成される。短時間フーリエ変換には各種パラメータが設定される必要があるが、例えば、ＦＦＴ窓幅：１０２４、サンプリング周波数：１６ｋＨｚ、オーバラップ幅：７６８、窓関数：ハニング窓、及びフィルタバンク：メルフィルタバンク（１２８バンド）などに従って、短時間フーリエ変換が実行されてもよい。スペクトログラムに変換した後、時間軸方向に一定のサンプル数（例えば、１０２４サンプル）だけ抽出されてもよい。また、本実施例によるスペクトログラムは、低周波数成分を精細にするよう周波数軸が対数変換されたものであってもよい。 Typically, a spectrogram represents the intensity of signal components on the time and frequency axes and is produced by short-time Fourier transforming waveform data. Various parameters need to be set for the short-time Fourier transform, for example, FFT window width: 1024, sampling frequency: 16 kHz, overlap width: 768, window function: Hanning window, and filter bank: Mel filter bank ( 128 band), etc., a short-time Fourier transform may be performed. After conversion into a spectrogram, a fixed number of samples (for example, 1024 samples) may be extracted along the time axis. Further, the spectrogram according to the present embodiment may be obtained by logarithmically transforming the frequency axis so as to refine the low frequency components.

第１モデル学習部１２０は、単音音源のスペクトログラムを学習用入力データとして入力し、単音音源の音高の予測確率を出力するよう音高情報によって特徴マップ生成モデルを学習する。 The first model learning unit 120 receives a spectrogram of a single sound source as input data for learning, and learns a feature map generation model based on pitch information so as to output a predicted probability of the pitch of the single sound source.

例えば、特徴マップ生成モデルは、図３に示されるように、複数の畳み込み層を含む畳み込みニューラルネットワークにより構成され、入力された単音音源のスペクトログラムを音高の予測確率に変換するＳＳＤとして実現される。ここで、音高は連続値でなく離散値として表現され、ｏｎｅ－ｈｏｔベクトルとして表現されてもよい。なお、打楽器などの噪音音源も学習対象とする場合、噪音音源の単音又はシングルノートの音声をデータセットに含めてもよい。その場合、音高クラスとして噪音を表現するクラスを設定し、それを教師ラベルとしてもよい。 For example, as shown in Fig. 3, the feature map generation model is constructed by a convolutional neural network including multiple convolution layers, and is realized as an SSD that converts the spectrogram of the input monophonic sound source into the predicted probability of the pitch. . Here, the pitch is expressed as a discrete value instead of a continuous value, and may be expressed as a one-hot vector. Note that when a noisy sound source such as a percussion instrument is also targeted for learning, a single note or single note of the noisy sound source may be included in the data set. In that case, a class expressing noise may be set as the pitch class and used as the teacher label.

第１モデル学習部１２０は、学習用入力データの単音音源のスペクトログラムを特徴マップ生成モデルに入力し、特徴マップ生成モデルからの出力と学習用出力データの音高情報との誤差が小さくなるように、バックプロパゲーションによって特徴マップ生成モデルのパラメータを更新する。ここで、誤差を示す損失関数として、限定することなく、特徴マップ生成モデルの出力と学習用出力データの音高との交差エントロピーが利用されてもよい。 The first model learning unit 120 inputs the spectrogram of the single tone sound source of the input data for learning to the feature map generation model, and adjusts the error between the output from the feature map generation model and the pitch information of the output data for learning to be small. , update the parameters of the feature map generation model by backpropagation. Here, without limitation, the cross entropy between the output of the feature map generation model and the pitch of the learning output data may be used as the loss function indicating the error.

例えば、所定数の学習用データに対して更新処理が終了した、誤差が所定の閾値以下に収束した、誤差の改善が所定の閾値以下に収束したなどの所定の学習終了条件が充足されると、第１モデル学習部１２０は、更新された特徴マップ生成モデルを学習済み機械学習モデルとして設定する。 For example, when a predetermined learning end condition is satisfied, such as updating processing for a predetermined number of learning data has been completed, error has converged to a predetermined threshold or less, or error improvement has converged to a predetermined threshold or less. , the first model learning unit 120 sets the updated feature map generation model as a learned machine learning model.

第２モデル学習部１３０は、採譜対象の音源のスペクトログラムを学習済みの特徴マップ生成モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、特徴マップの固定長の区間に音符が存在する予測確率を出力するよう楽譜情報によって音符存在確率予測モデルを学習する。 The second model learning unit 130 inputs a feature map generated by inputting a spectrogram of a sound source to be transcribed into a learned feature map generation model as input data for learning, and assigns a fixed-length section of the feature map to musical notes. A note existence probability prediction model is learned from the musical score information so as to output the prediction probability that

例えば、音符存在確率予測モデルは、図４に示されるように、複数の畳み込み層を含む畳み込みニューラルネットワークにより構成され、モノフォニック音源のスペクトルグラムを学習済み特徴マップ生成モデルに入力することによって生成された特徴マップを当該特徴マップの各点を始点とする固定長の区間と同じ長さの音符が存在する予測確率に変換するＳＳＤとして実現される。例えば、ドからシの１２音で採譜する場合、特徴マップ上の各点は、ドからシの各音高及び休符（無音）の１３通りの音符又は音高クラスが存在する予測確率を有する。 For example, the note presence probability prediction model is constructed by a convolutional neural network including multiple convolution layers, as shown in FIG. It is implemented as an SSD that converts a feature map into a predicted probability that there is a note of the same length as a fixed-length interval starting at each point of the feature map. For example, when transcribed with 12 notes from C to B, each point on the feature map has a predicted probability that there are 13 different note or pitch classes for each pitch from C to B and rests (silences). .

上述したように、学習済み特徴マップ生成モデルは複数の畳み込み層を含み、各畳み込み層からモノフォニック音源のスペクトログラムの特徴マップが生成される。生成される特徴マップは、図３に示されるような畳み込み層のレベルに応じて時間解像度が異なる特徴マップとなる。典型的には、図５に示されるように、入力層に相対的に近い畳み込み層では、時間解像度が相対的に高い（図示された例では、３２Ｈｚ）特徴マップが生成され、出力層に相対的に近い畳み込み層では、時間解像度が相対的に低い（図示された例では、１６Ｈｚ）特徴マップが生成される。図示されるような固定長の区間又はデフォルトボックスが設定されると、時間解像度が相対的に高い特徴マップにおける区間は、時間解像度が相対的に低い特徴マップにおける区間より短い時間を占有する。このため、異なる時間的長さを有する音符の存在予測確率を導出することができ、音符の時間的長さを特定することが可能になる。 As described above, the trained feature map generation model includes multiple convolution layers, from which a feature map of the spectrogram of a monophonic sound source is generated. The generated feature map is a feature map with different temporal resolutions depending on the level of the convolution layer as shown in FIG. Typically, convolutional layers relatively close to the input layer produce feature maps with relatively high temporal resolution (32 Hz in the example shown), as shown in FIG. Close convolutional layers produce feature maps with relatively low temporal resolution (16 Hz in the example shown). With fixed-length intervals or default boxes as shown, intervals in feature maps with higher temporal resolution occupy less time than intervals in feature maps with lower temporal resolution. Therefore, it is possible to derive the existence prediction probabilities of notes having different temporal lengths, and to specify the temporal lengths of the notes.

第２モデル学習部１３０は、学習用入力データの音源のスペクトログラムを学習済み特徴マップ生成モデルに入力し、学習済み特徴マップ生成モデルによって生成された各特徴マップを音符存在確率予測モデルに入力し、音符存在確率予測モデルからの出力と学習用出力データの楽譜情報との誤差が小さくなるように、バックプロパゲーションによって音符存在確率予測モデルのパラメータを更新する。 The second model learning unit 130 inputs the spectrogram of the sound source of the learning input data to the learned feature map generation model, inputs each feature map generated by the learned feature map generation model to the note existence probability prediction model, The parameters of the note presence probability prediction model are updated by back propagation so that the error between the output from the note presence probability prediction model and the musical score information of the learning output data is reduced.

ここで、誤差を示す損失関数として、限定することなく、音符存在確率予測モデルの出力と音高の時系列変化とから算出されるタイミング誤差と信頼誤差との加重和が利用されてもよい。音高の時系列変化は、楽曲のスタートタイミング、エンドタイミング及び音高のセットが複数集まることによって表現され、楽譜情報から導出される。当該セットは発音と呼ばれてもよく、例えば、音高の時系列変化は、発音＃１："０：００～０：０２，Ａ（ラ）３"、発音＃２："０：０３～０：０５，Ｂ（シ）３"、発音＃３："０：０５～０：０８，Ｃ（ド）４"・・・などにより表現されてもよい。図５に示されるデフォルトボックスは、１つの発音を表現しており、複数のチャネルを有する。デフォルトボックスの各チャネルの最初のサンプルはそれぞれ、当該デフォルトボックスの発音のオンセットの予測値、オフセットの予測値及び音高クラスの予測確率を有する。すなわち、トータルで２＋（音高のクラス数）のチャネルがある。 Here, as the loss function indicating the error, without limitation, a weighted sum of the timing error and the confidence error calculated from the output of the note presence probability prediction model and the time-series change of the pitch may be used. The time-series change in pitch is expressed by collecting a plurality of sets of start timing, end timing, and pitch of a piece of music, and is derived from musical score information. The set may also be called a pronunciation. For example, the chronological change in pitch is pronunciation #1: "0:00 to 0:02, A(la)3", pronunciation #2: "0:03 to 0:05, B (b) 3", pronunciation #3: "0:05 to 0:08, C (do) 4", and so on. The default box shown in FIG. 5 represents one pronunciation and has multiple channels. Each first sample of each channel of a default box has a predicted onset value, a predicted offset value, and a pitch class predicted probability of the pronunciation of that default box. That is, there are a total of 2+(number of pitch classes) channels.

第２モデル学習部１３０は、各発音について、オンセットとオフセットとの和が最小となるデフォルトボックスを探索し、検出されたデフォルトボックスと発音とに対してタイミング誤差と信頼誤差を求める。ここで、タイミング誤差とは、予測したオンセットを考慮したスタートタイミングのずれと、予測したオフセットを考慮したエンドタイミングのずれとの和としてもよい。ただし、差分の表現として、デフォルトボックスの長さを基準にした相対値が利用されてもよい。また、信頼誤差は、発音の音高と予測した音高とから算出される交差エントロピーであってもよい。なお、無音を表すクラスも教師ラベルとして用意されてもよく、この場合、発音のない区間を予測することができる。 The second model learning unit 130 searches for the default box that minimizes the sum of the onset and offset for each pronunciation, and obtains the timing error and confidence error for the detected default box and pronunciation. Here, the timing error may be the sum of the deviation of the start timing considering the predicted onset and the deviation of the end timing considering the predicted offset. However, as a representation of the difference, a relative value based on the length of the default box may be used. Alternatively, the confidence error may be a cross-entropy calculated from the pronounced pitch and the predicted pitch. Note that a class representing silence may also be prepared as a teacher label, in which case it is possible to predict an interval with no pronunciation.

第２モデル学習部１３０は、ＮＭＳ（Ｎｏｎ－ＭａｘｉｍｕｍＳｕｐｐｒｅｓｓｉｏｎ）に従って各特徴マップの各点について設定されたデフォルトボックスを減らしていき、残ったデフォルトボックスを予測発音としてもよい。具体的には、第２モデル学習部１３０はまず、各デフォルトボックスについて音高クラス毎の音符存在予測確率を求める。その後、第２モデル学習部１３０は、予測確率が所定の閾値（例えば、０．９など）以下であるデフォルトボックスを削除してもよい。第２モデル学習部は、残ったデフォルトボックスのうち積集合／和集合に閾値を設けて、閾値以上のデフォルトボックスの一方を削除し、重複したデフォルトボックスを排除する。第２モデル学習部１３０は、最終的に残ったデフォルトボックスを予測発音とする。 The second model learning unit 130 may reduce the default boxes set for each point of each feature map according to NMS (Non-Maximum Suppression), and use the remaining default boxes as predicted pronunciations. Specifically, the second model learning unit 130 first obtains the note presence prediction probability for each pitch class for each default box. After that, the second model learning unit 130 may delete default boxes whose predicted probabilities are less than or equal to a predetermined threshold value (for example, 0.9). The second model learning unit sets a threshold for the intersection/union of the remaining default boxes, deletes one of the default boxes equal to or greater than the threshold, and eliminates duplicate default boxes. The second model learning unit 130 uses the remaining default boxes as predicted pronunciations.

例えば、所定数の学習用データに対して更新処理が終了した、誤差が所定の閾値以下に収束した、誤差の改善が所定の閾値以下に収束したなどの所定の学習終了条件が充足されると、第２モデル学習部１３０は、更新された音符存在確率予測モデルを学習済みモデルとして設定する。 For example, when a predetermined learning end condition is satisfied, such as updating processing for a predetermined number of learning data has been completed, error has converged to a predetermined threshold or less, or error improvement has converged to a predetermined threshold or less. , the second model learning unit 130 sets the updated note existence probability prediction model as a learned model.

一実施例では、第１モデル学習部１２０は、複数種別のオーディオ成分のそれぞれに対して特徴マップ生成モデルを学習し、第２モデル学習部１３０は、複数種別のオーディオ成分を含む採譜対象の音源に対して各オーディオ成分種別毎に音符が存在する予測確率を出力するよう音符存在確率予測モデルを学習してもよい。 In one embodiment, the first model learning unit 120 learns a feature map generation model for each of multiple types of audio components, and the second model learning unit 130 learns a sound source to be transcribed including multiple types of audio components. A note presence probability prediction model may be trained to output a prediction probability that a note exists for each audio component type.

例えば、特徴マップ生成モデルと音符存在確率予測モデルとは、モノフォニックボーカルと伴奏とを含む楽曲に対して適用されてもよい。この場合、ボーカル用特徴マップ生成モデルと伴奏用特徴マップ生成モデルとが、ボーカルの単音音源と音高情報とのペアから構成されるボーカル用学習データと、伴奏の単音音源と音高情報とのペアから構成される伴奏用学習データとを利用して、上述した学習処理と同様に学習される。一方、ボーカル用音符存在確率予測モデルと伴奏用音符存在確率予測モデルとが、学習用の音源と楽譜情報と利用して、音源を学習済みボーカル用特徴マップ生成モデルと学習済み伴奏用特徴マップ生成モデルとに入力することによって生成された特徴マップを入力とし、上述した学習処理と同様に学習される。 For example, the feature map generation model and the note presence probability prediction model may be applied to a piece of music including monophonic vocals and accompaniment. In this case, the vocal feature map generation model and the accompaniment feature map generation model are composed of vocal learning data composed of pairs of vocal single-tone sound sources and pitch information, and accompaniment single-tone sound sources and pitch information. Learning is performed in the same manner as the above-described learning process using accompaniment learning data composed of pairs. On the other hand, the vocal note presence probability prediction model and the accompaniment note presence probability prediction model use the learning sound source and score information to generate a learned vocal feature map generation model and a learned accompaniment feature map generation model. A feature map generated by inputting a model is used as an input, and learning is performed in the same manner as the learning process described above.

あるいは、特徴マップ生成モデルと音符存在確率予測モデルとは、楽器毎などの複数のパートを含む楽曲に対して適用されてもよい。上述したボーカルと伴奏とを含む楽曲に対する学習処理と同様であるが、この場合、音符存在確率予測モデルの出力は、特徴マップの固定長の区間に特定パートの特定音符が存在する予測確率であってもよい。例えば、"男声のＡ３の音高"、"女声のＡ３の音高"などの特定パートの特定音符の存在の予測確率を出力するようにしてもよい。 Alternatively, the feature map generation model and the note presence probability prediction model may be applied to a piece of music including multiple parts for each instrument. The learning process for a piece of music including vocals and accompaniment is similar to the learning process described above, but in this case, the output of the note existence probability prediction model is the prediction probability that a specific note of a specific part exists in a fixed-length section of the feature map. may For example, the predicted probabilities of the presence of specific notes in specific parts such as "male A3 pitch" and "female A3 pitch" may be output.

あるいは、本開示は拍子を有する楽曲に対して適用されてもよい。この場合、音符存在確率予測モデルの出力は、拍子のオンセット及びオフセットに関するものであってもよく、例えば、デフォルトボックスが一拍である予測確率が出力されてもよい。 Alternatively, the present disclosure may be applied to music with beats. In this case, the output of the note presence probability prediction model may be in terms of the onset and offset of the time signature, eg the predicted probability that the default box is one beat.

図６は、本開示の一実施例による特徴マップ生成モデルの学習処理を示すフローチャートである。当該学習処理は、上述した学習装置１００又は学習装置１００のプロセッサによって実現される。 FIG. 6 is a flow chart showing learning processing of a feature map generation model according to an embodiment of the present disclosure. The learning process is implemented by the learning device 100 described above or the processor of the learning device 100 .

図６に示されるように、ステップＳ１０１において、学習用データ取得部１１０は、学習用データストレージ５０から単音音源と音高情報とのペアを取得する。例えば、音高は、「ド」から「シ」の１２音と無音との１３通りであり、当該１３通りの音高に対応する単音音源が取得されてもよい。 As shown in FIG. 6, in step S101, the learning data acquisition unit 110 acquires a pair of a single sound source and pitch information from the learning data storage 50. FIG. For example, there are 13 pitches of 12 pitches from "do" to "si" and silence, and single tone sound sources corresponding to the 13 pitches may be acquired.

ステップＳ１０２において、学習用データ取得部１１０は、取得した単音音源を前処理する。具体的には、学習用データ取得部１１０は、単音音源の波形データに対して前処理（例えば、短時間フーリエ変換など）を実行し、単音音源のスペクトログラムを取得する。 In step S102, the learning data acquisition unit 110 preprocesses the acquired monophonic sound source. Specifically, the learning data acquisition unit 110 performs preprocessing (for example, short-time Fourier transform) on waveform data of a single sound source to acquire a spectrogram of the single sound source.

ステップＳ１０３において、第１モデル学習部１２０は、前処理された単音音源と音高情報とのペアによって特徴マップ生成モデルを学習する。例えば、特徴マップ生成モデルは、畳み込みニューラルネットワークにより構成され、入力音源を音高の予測確率に変換する。具体的には、第１モデル学習部１２０は、単音音源のスペクトログラムを特徴マップ生成モデルに入力し、特徴マップ生成モデルの出力と音高情報との誤差が小さくなるように、バックプロパゲーションによって特徴マップ生成モデルのパラメータを更新する。 In step S103, the first model learning unit 120 learns a feature map generation model based on the pair of preprocessed single sound source and pitch information. For example, the feature map generation model consists of a convolutional neural network, which transforms the input sound source into pitch prediction probabilities. Specifically, the first model learning unit 120 inputs the spectrogram of a single tone sound source to the feature map generation model, and uses back propagation to reduce the error between the output of the feature map generation model and the pitch information. Update map generation model parameters.

ステップＳ１０４において、第１モデル学習部１２０は、学習終了条件が充足されたか判断する。所定の学習終了条件は、例えば、所定数の学習用データに対して更新処理が終了した、誤差が所定の閾値以下に収束した、誤差の改善が所定の閾値以下に収束したなどであってもよい。所定の学習終了条件が充足されている場合（Ｓ１０４：ＹＥＳ）、第１モデル学習部１２０は、更新された特徴マップ生成モデルを学習済みモデルとして設定してもよい。他方、所定の学習終了条件が充足されていない場合（Ｓ１０４：ＮＯ）、当該処理はステップＳ１０１に移行し、上述した各ステップを繰り返す。 In step S104, the first model learning unit 120 determines whether the learning end condition is satisfied. The predetermined learning end condition may be, for example, that the updating process has ended for a predetermined number of learning data, the error has converged below a predetermined threshold value, or the error improvement has converged below a predetermined threshold value. good. If a predetermined learning end condition is satisfied (S104: YES), the first model learning unit 120 may set the updated feature map generation model as a learned model. On the other hand, if the predetermined learning end condition is not satisfied (S104: NO), the process proceeds to step S101 and repeats the steps described above.

図７は、本開示の一実施例による音符存在確率予測モデルの学習処理を示すフローチャートである。当該学習処理は、上述した学習装置１００又は学習装置１００のプロセッサによって実現される。 FIG. 7 is a flow chart showing learning processing of a note existence probability prediction model according to an embodiment of the present disclosure. The learning process is realized by the learning device 100 described above or the processor of the learning device 100 .

図７に示されるように、ステップＳ２０１において、学習用データ取得部１１０は、学習用データストレージ５０からモノフォニック音源と楽譜情報とのペアを取得する。例えば、モノフォニック音源は歌唱音源の波形データであってもよく、楽譜情報は当該モノフォニック音源の楽譜を示す。 As shown in FIG. 7, in step S201, the learning data acquisition unit 110 acquires a pair of monophonic sound source and musical score information from the learning data storage 50. FIG. For example, the monophonic sound source may be waveform data of a singing sound source, and the musical score information indicates the musical score of the monophonic sound source.

ステップＳ２０２において、学習用データ取得部１１０は、取得したモノフォニック音源を前処理する。具体的には、学習用データ取得部１１０は、モノフォニック音源の波形データに対して前処理（例えば、短時間フーリエ変換など）を実行し、モノフォニック音源のスペクトログラムを取得する。 In step S202, the learning data acquisition unit 110 preprocesses the acquired monophonic sound source. Specifically, the learning data acquisition unit 110 performs preprocessing (for example, short-time Fourier transform) on waveform data of a monophonic sound source to acquire a spectrogram of the monophonic sound source.

ステップＳ２０３において、第２モデル学習部１３０は、前処理されたモノフォニック音源を学習済み特徴マップ生成モデルに入力し、学習済み特徴マップ生成モデルによって生成された特徴マップを取得する。具体的には、第２モデル学習部１３０は、学習済み特徴マップ生成モデルの各畳み込み層から生成された特徴マップを取得する。生成された特徴マップは、各畳み込み層の畳み込みの程度に応じて異なる時間解像度の特徴マップとなる。 In step S203, the second model learning unit 130 inputs the preprocessed monophonic sound source to the learned feature map generation model, and acquires the feature map generated by the learned feature map generation model. Specifically, the second model learning unit 130 acquires feature maps generated from each convolutional layer of the learned feature map generation model. The generated feature maps have different temporal resolutions depending on the degree of convolution of each convolution layer.

ステップＳ２０４において、第２モデル学習部１３０は、取得した特徴マップと楽譜情報とのペアによって音符存在確率予測モデルを学習する。例えば、音符存在確率予測モデルは、畳み込みニューラルネットワークにより構成により構成され、入力された特徴マップを当該特徴マップの固定長の区間に音符が存在する音符存在予測確率に変換する。具体的には、第２モデル学習部１３０は、各特徴マップを音符存在確率予測モデルに入力し、音符存在確率予測モデルの出力と楽譜情報との誤差が小さくなるように、バックプロパゲーションによって音符存在確率予測モデルのパラメータを更新する。 In step S204, the second model learning unit 130 learns a note presence probability prediction model based on the acquired pair of feature map and musical score information. For example, the note presence probability prediction model is configured by a convolutional neural network, and converts an input feature map into note presence prediction probabilities that notes exist in a fixed-length section of the feature map. Specifically, the second model learning unit 130 inputs each feature map to the note presence probability prediction model, and uses back propagation to reduce the error between the output of the note presence probability prediction model and the musical score information. Update the parameters of the existence probability prediction model.

ステップＳ２０５において、第２モデル学習部１３０は、学習終了条件が充足されたか判断する。所定の学習終了条件は、例えば、所定数の学習用データに対して更新処理が終了した、誤差が所定の閾値以下に収束した、誤差の改善が所定の閾値以下に収束したなどであってもよい。所定の学習終了条件が充足されている場合（Ｓ２０５：ＹＥＳ）、第２モデル学習部１３０は、更新された音符存在確率予測モデルを学習済みモデルとして設定してもよい。他方、所定の学習終了条件が充足されていない場合（Ｓ２０５：ＮＯ）、当該処理はステップＳ２０１に移行し、上述した各ステップを繰り返す。 In step S205, the second model learning unit 130 determines whether the learning termination condition is satisfied. The predetermined learning end condition may be, for example, that the updating process has ended for a predetermined number of learning data, the error has converged below a predetermined threshold value, or the error improvement has converged below a predetermined threshold value. good. If the predetermined learning termination condition is satisfied (S205: YES), the second model learning unit 130 may set the updated note presence probability prediction model as a learned model. On the other hand, if the predetermined learning end condition is not satisfied (S205: NO), the process proceeds to step S201 and repeats the steps described above.

次に、図８及び９を参照して、本開示の一実施例による自動採譜装置を説明する。図８は、本開示の一実施例による自動採譜装置の機能構成を示すブロック図である。 Next, with reference to FIGS. 8 and 9, an automatic music transcription device according to one embodiment of the present disclosure will be described. FIG. 8 is a block diagram showing the functional configuration of an automatic music transcription device according to an embodiment of the present disclosure.

図８に示されるように、自動採譜装置２００は、モデル処理部２１０及び楽譜生成部２２０を有する。 As shown in FIG. 8, the automatic music transcription device 200 has a model processor 210 and a musical score generator 220 .

モデル処理部２１０は、単音音源から音高の予測確率を出力する学習済み特徴マップ生成モデルと、特徴マップから当該特徴マップの固定長の区間に音符が存在する予測確率を出力する学習済み音符存在確率予測モデルとを利用し、採譜対象の音源を学習済み特徴マップ生成モデルに入力し、当該学習済み特徴マップ生成モデルによって生成された特徴マップを学習済み音符存在確率予測モデルに入力し、特徴マップの固定長の区間に音符が存在する予測確率を出力する。 The model processing unit 210 includes a learned feature map generation model that outputs a pitch prediction probability from a single tone sound source, and a learned note existence model that outputs a prediction probability that a note exists in a fixed-length section of the feature map from the feature map. Using a probabilistic prediction model, the sound source to be transcribed is input to the learned feature map generation model, the feature map generated by the learned feature map generation model is input to the learned note presence probability prediction model, and the feature map outputs the predicted probability that a note exists in a fixed-length interval of .

具体的には、モデル処理部２１０は、採譜対象の音源に対して短時間フーリエ変換などの前処理を実行して当該音源のスペクトログラムを取得し、取得したスペクトログラムを学習装置１００による学習済み特徴マップ生成モデルに入力して当該学習済み特徴マップ生成モデルの各畳み込み層からの特徴マップを取得する。そして、モデル処理部２１０は、取得した各特徴マップを学習装置１００による学習済み音符存在確率予測モデルに入力し、入力した特徴マップの各点を始点とする固定長の区間又はデフォルトボックスと同じ長さの音符が存在する予測確率を取得し、取得した各特徴マップの音符存在予測確率を楽譜生成部２２０にわたす。例えば、音符存在予測確率は、特徴マップのデフォルトボックスに存在する各音高（例えば、「ド」、「レ」、・・・「シ」、無音など）の確率の予測値であり、高い予測確率を有する音高が当該時間的長さに対応する音符に相当すると判断できる。 Specifically, the model processing unit 210 performs preprocessing such as a short-time Fourier transform on the sound source to be transcribed to obtain a spectrogram of the sound source, and converts the obtained spectrogram to a learned feature map by the learning device 100. A feature map is obtained from each convolutional layer of the learned feature map generation model by inputting it into the generative model. Then, the model processing unit 210 inputs each of the acquired feature maps into the learned note existence probability prediction model by the learning device 100, and sets a fixed-length section starting from each point of the input feature map or having the same length as the default box. It obtains the predicted probability of the presence of a note, and passes the obtained predicted probability of note presence of each feature map to the musical score generation unit 220 . For example, the note presence prediction probability is the prediction value of the probability of each pitch (e.g., “do”, “re”, . It can be determined that the pitch having the probability corresponds to the note corresponding to the temporal length.

楽譜生成部２２０は、音符が存在する予測確率に基づき楽譜情報を生成する。具体的には、楽譜生成部２２０は、ＳＳＤに用いられるＮＭＳ（Ｎｏｎ－ＭａｘｉｍｕｍＳｕｐｐｒｅｓｓｉｏｎ）に従って学習済み音符存在確率予測モデルの出力を後処理する。典型的には、学習済み音符存在確率予測モデルから多数の予測音符候補が出力される。これらの予測音符候補から予測音符を特定する必要があり、ＳＳＤではＮＭＳを利用して予測音符候補をしばしば絞っている。 The musical score generation unit 220 generates musical score information based on the predicted probability that notes exist. Specifically, the musical score generator 220 post-processes the output of the trained note presence probability prediction model according to NMS (Non-Maximum Suppression) used in SSD. Typically, a large number of predicted note candidates are output from the learned note existence probability prediction model. It is necessary to specify a predicted note from these predicted note candidates, and SSD often uses NMS to narrow down the predicted note candidates.

例えば、楽譜生成部２２０はまず、学習済み音符存在確率予測モデルに入力された特徴マップ上の各点に対して出力された音符存在予測確率のうち最大となる音符を当該時間における予測音符とする。そして、楽譜生成部２２０は、特徴マップ上の各点について予測音符を決定し、各点、予測音符及び対応する音符存在予測確率のデータセットをリスト化し、音符存在予測確率に関して降順にリスト内のデータセットをソートする。そして、楽譜生成部２２０は、所定の抽出条件を適用し、リストから予測音符候補を絞る。例えば、楽譜生成部２２０は、音符存在予測確率が所定の閾値（例えば、０．９など）以下であるデータセットをリストから削除してもよい。また、楽譜生成部２２０は、重複して検出された音符の重複を排除するため、予測音符が同じであって、かつ、予測音符の重複度が所定の閾値（例えば、８０％など）以上のデータセットがリストの上位にある場合、当該上位のリストのみを残すようにしてもよい。楽譜生成部２２０は、最終的なリストにおけるデータセットに基づき楽譜を生成する。 For example, the musical score generation unit 220 first determines the note with the maximum note presence prediction probability output for each point on the feature map input to the learned note presence probability prediction model as the predicted note at the time. . Then, the musical score generation unit 220 determines a predicted note for each point on the feature map, lists each point, a predicted note, and a data set of the corresponding note presence prediction probability, and selects the notes in the list in descending order of the note presence prediction probability. Sort the dataset. Then, the musical score generation unit 220 applies predetermined extraction conditions to narrow down the predicted note candidates from the list. For example, the musical score generation section 220 may delete data sets whose note existence prediction probabilities are equal to or less than a predetermined threshold value (for example, 0.9) from the list. In addition, the musical score generation unit 220 eliminates duplication of notes that are detected as duplicates. If the dataset is at the top of the list, only the top list may be left. A score generator 220 generates a score based on the data sets in the final list.

図９は、本開示の一実施例による自動採譜処理を示すフローチャートである。当該自動採譜処理は、上述した自動採譜装置２００又は自動採譜装置２００のプロセッサによって実現される。 FIG. 9 is a flowchart illustrating automatic transcription processing according to one embodiment of the present disclosure. The automatic music transcription processing is realized by the automatic music transcription device 200 or the processor of the automatic music transcription device 200 described above.

図９に示されるように、ステップＳ３０１において、モデル処理部２１０は、採譜対象の音源を取得する。例えば、当該音源はモノフォニック音源であってもよいし、複数種別のオーディオ成分を含んでもよい。 As shown in FIG. 9, in step S301, the model processing unit 210 acquires a sound source to be transcribed. For example, the sound source may be a monophonic sound source, or may include multiple types of audio components.

ステップＳ３０２において、モデル処理部２１０は、取得した音源を前処理する。具体的には、モデル処理部２１０は、取得した音源に対して短時間フーリエ変換などの前処理を実行し、当該音源のスペクトログラムを取得する。 In step S302, the model processing unit 210 preprocesses the acquired sound source. Specifically, the model processing unit 210 performs preprocessing such as short-time Fourier transform on the acquired sound source to acquire a spectrogram of the sound source.

ステップＳ３０３において、モデル処理部２１０は、前処理した音源を学習済み特徴マップ生成モデルに入力して特徴マップを取得し、取得した特徴マップを学習済み音符存在確率予測モデルに入力して入力した特徴マップの各点を始点とする固定長の区間又はデフォルトボックスと同じ長さの音符が存在する予測確率を取得する。 In step S303, the model processing unit 210 inputs the preprocessed sound source to the learned feature map generation model to obtain a feature map, and inputs the obtained feature map to the learned note presence probability prediction model to obtain the input feature. Obtain the predicted probability that there is a fixed-length interval starting at each point in the map or a note of the same length as the default box.

ステップＳ３０４において、楽譜生成部２２０は、特徴マップ上の各点に対して取得した音符存在予測確率に基づき予測音符を決定する。具体的には、楽譜生成部２２０は、各点について取得した音符存在予測確率のうち最大となる音符存在予測確率に対応する音符を当該点に対する予測音符として決定する。 In step S304, the musical score generator 220 determines a predicted note based on the note presence prediction probability obtained for each point on the feature map. Specifically, the musical score generation section 220 determines the note corresponding to the maximum note presence prediction probability among the note presence prediction probabilities acquired for each point as the prediction note for the point.

ステップＳ３０５において、楽譜生成部２２０は、決定された特徴マップの各点の予測音符に対して後処理を実行する。具体的には、楽譜生成部２２０は、ＳＳＤにおけるＮＭＳに従って特徴マップの各点の予測音符を絞る。例えば、楽譜生成部２２０は、特徴マップ上の各点について決定された予測音符に基づき、各点、予測音符及び対応する音符存在予測確率のデータセットをリスト化し、音符存在予測確率に関して降順にリスト内のデータセットをソートし、音符存在予測確率が所定の閾値（例えば、０．９など）以下であるデータセットをリストから削除すると共に、予測音符が同じであって、かつ、予測音符の重複度が所定の閾値（例えば、８０％など）以上のデータセットがリストの上位にある場合、当該上位のリストのみを残すようにしてもよい。 In step S305, the musical score generator 220 performs post-processing on the predicted notes at each point of the determined feature map. Specifically, the musical score generation unit 220 narrows down the predicted notes for each point of the feature map according to the NMS in SSD. For example, based on the predicted note determined for each point on the feature map, the musical score generator 220 lists each point, the predicted note, and the corresponding note presence prediction probability data set, and lists them in descending order with respect to the note presence prediction probability. The data sets in the list are sorted, and the data sets whose note existence prediction probability is less than a predetermined threshold (for example, 0.9) are deleted from the list, and the prediction notes are the same and the prediction notes overlap If a data set with a degree greater than or equal to a predetermined threshold (for example, 80%) is at the top of the list, only the top list may be left.

ステップＳ３０６において、楽譜生成部２２０は、最終的なリストにおけるデータセットに基づき楽譜を生成する。 In step S306, the musical score generator 220 generates musical scores based on the data sets in the final list.

上述した学習装置１００及び自動採譜装置２００はそれぞれ、例えば、図１０に示されるように、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０２、ＲＡＭ（Ｒａｎｄｏｍ
ＡｃｃｅｓｓＭｅｍｏｒｙ）１０３、通信インタフェース（ＩＦ）１０４、ハードディスク１０５、入力装置１０６及び出力装置１０７によるハードウェア構成を有してもよい。ＣＰＵ１０１及びＧＰＵ１０２は、プロセッサ又は処理回路として参照されてもよく、学習装置１００及び自動採譜装置２００の各種処理を実行し、特に、ＣＰＵ１０１は学習装置１００及び自動採譜装置２００における各種処理の実行を制御し、ＧＰＵ１０２は機械学習モデルを学習及び実行するための各種処理を実行する。ＲＡＭ１０３及びハードディスク１０５は、学習装置１００及び自動採譜装置２００における各種データ及びプログラムを格納するメモリとして機能し、特に、ＲＡＭ１０３は、ＣＰＵ１０１及びＧＰＵ１０２における作業データを格納するワーキングメモリとして機能し、ハードディスク１０５は、ＣＰＵ１０１及びＧＰＵ１０２の制御プログラム及び／又は学習用データを格納する。通信ＩＦ１０４は、学習用データストレージ５０から学習用データを取得するための通信インタフェースである。入力装置１０６は、情報及びデータを入力するための各種デバイス（例えば、ディスプレイ、スピーカ、キーボード、タッチ画面など）であり、出力装置１０７は、処理の内容、経過、結果等の各種情報を表示する各種デバイス（例えば、ディスプレイ、プリンタ、スピーカなど）である。しかしながら、本開示による学習装置１００及び自動採譜装置２００は、上述したハードウェア構成に限定されず、他の何れか適切なハードウェア構成を有してもよい。 The learning device 100 and the automatic music transcription device 200 described above each include, for example, a CPU (Central Processing Unit) 101, a GPU (Graphics Processing Unit) 102, a RAM (Random
Access Memory) 103 , communication interface (IF) 104 , hard disk 105 , input device 106 and output device 107 . The CPU 101 and the GPU 102 may be referred to as processors or processing circuits, and execute various processes of the learning device 100 and the automatic music transcription device 200. In particular, the CPU 101 controls execution of various processes in the learning device 100 and the automatic music transcription device 200. The GPU 102 performs various processes for learning and executing machine learning models. The RAM 103 and the hard disk 105 function as memories that store various data and programs in the learning device 100 and the automatic music transcription device 200. In particular, the RAM 103 functions as a working memory that stores work data in the CPU 101 and the GPU 102, and the hard disk 105 , control programs and/or learning data for the CPU 101 and GPU 102 are stored. The communication IF 104 is a communication interface for acquiring learning data from the learning data storage 50 . The input device 106 is various devices (for example, display, speaker, keyboard, touch screen, etc.) for inputting information and data, and the output device 107 displays various information such as the content, progress, and result of processing. Various devices (eg, displays, printers, speakers, etc.). However, the learning device 100 and the automatic music transcription device 200 according to the present disclosure are not limited to the hardware configuration described above, and may have any other appropriate hardware configuration.

本開示の一態様では、
単音音源と音高情報とを第１の機械学習モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを第２の機械学習モデルの学習用データとして取得し、前記単音音源と前記採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得する学習用データ取得部と、
前記単音音源のスペクトログラムを学習用入力データとして入力し、前記単音音源の音高の予測確率を出力するよう前記音高情報によって第１の機械学習モデルを学習する第１モデル学習部と、
前記採譜対象の音源のスペクトログラムを学習済みの前記第１の機械学習モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するよう前記楽譜情報によって第２の機械学習モデルを学習する第２モデル学習部と、
を有する学習装置が提供される。 In one aspect of the present disclosure,
A single-tone sound source and pitch information are acquired as learning data for a first machine learning model, a sound source to be transcribed and score information are acquired as learning data for a second machine-learning model, and the single-tone sound source and the musical score information are acquired as learning data for a second machine learning model. a learning data acquisition unit that preprocesses the sound source to be transcribed and acquires each spectrogram;
a first model learning unit that receives the spectrogram of the single sound source as learning input data and learns a first machine learning model using the pitch information so as to output a predicted probability of the pitch of the single sound source;
A feature map generated by inputting the spectrogram of the sound source to be transcribed into the learned first machine learning model is input as input data for learning, and notes are present in a fixed-length section of the feature map. a second model learning unit that learns a second machine learning model with the musical score information to output a predicted probability;
A learning device is provided having:

一実施例では、
前記第１の機械学習モデルと前記第２の機械学習モデルとは、畳み込みニューラルネットワークにより構成されてもよい。 In one example,
The first machine learning model and the second machine learning model may be configured by a convolutional neural network.

一実施例では、
前記第２モデル学習部は、前記第１の機械学習モデルにより生成される異なる時間解像度を有する複数の特徴マップを前記第２の機械学習モデルに入力してもよい。 In one example,
The second model learning unit may input a plurality of feature maps having different temporal resolutions generated by the first machine learning model to the second machine learning model.

一実施例では、
前記第２モデル学習部は、前記第１の機械学習モデルと前記第２の機械学習モデルとをＳＳＤ（ＳｉｎｇｌｅＳｈｏｔＤｅｔｅｃｔｉｏｎ）として実現してもよい。 In one example,
The second model learning unit may realize the first machine learning model and the second machine learning model as an SSD (Single Shot Detection).

一実施例では、
前記第１モデル学習部は、複数種別のオーディオ成分のそれぞれに対して前記第１の機械学習モデルを学習し、
前記第２モデル学習部は、複数種別のオーディオ成分を含む採譜対象の音源に対して各オーディオ成分種別毎に音符が存在する予測確率を出力するよう前記第２の機械学習モデルを学習してもよい。 In one example,
The first model learning unit learns the first machine learning model for each of multiple types of audio components,
The second model learning unit may learn the second machine learning model so as to output a predicted probability that a note exists for each audio component type with respect to a sound source to be transcribed containing multiple types of audio components. good.

本開示の一態様では、
単音音源から音高の予測確率を出力する第１の学習済み機械学習モデルと、特徴マップから前記特徴マップの固定長の区間に音符が存在する予測確率を出力する第２の学習済み機械学習モデルとを利用し、採譜対象の音源を前記第１の学習済み機械学習モデルに入力し、前記第１の学習済み機械学習モデルによって生成された特徴マップを前記第２の学習済み機械学習モデルに入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するモデル処理部と、
前記音符が存在する予測確率に基づき楽譜情報を生成する楽譜生成部と、
を有する自動採譜装置が提供される。 In one aspect of the present disclosure,
A first trained machine learning model that outputs a pitch prediction probability from a single sound source, and a second trained machine learning model that outputs a prediction probability that a note exists in a fixed-length section of the feature map from a feature map. inputting the sound source to be transcribed into the first trained machine learning model, and inputting the feature map generated by the first trained machine learning model into the second trained machine learning model. and a model processing unit that outputs a predicted probability that a note exists in a fixed-length section of the feature map;
a musical score generation unit that generates musical score information based on the predicted probability that the note exists;
is provided.

一実施例では、
前記モデル処理部は、前記採譜対象の音源に対して前処理を実行することによってスペクトログラムを取得し、前記スペクトログラムを前記第１の学習済み機械学習モデルに入力してもよい。 In one example,
The model processing unit may obtain a spectrogram by performing preprocessing on the sound source to be transcribed, and input the spectrogram to the first trained machine learning model.

一実施例では、
前記モデル処理部は、前記特徴マップ上の各点について前記第２の学習済み機械学習モデルから出力された最大の予測確率を有する音符を予測音符として決定してもよい。 In one example,
The model processing unit may determine, as the predicted note, the note having the highest prediction probability output from the second trained machine learning model for each point on the feature map.

一実施例では、
前記楽譜生成部は、ＮＭＳ（Ｎｏｎ－ＭａｘｉｍｕｍＳｕｐｐｒｅｓｓｉｏｎ）に従って抽出された予測音符に基づき楽譜情報を生成してもよい。 In one example,
The score generation unit may generate score information based on predicted notes extracted according to NMS (Non-Maximum Suppression).

本開示の一態様では、
プロセッサが、単音音源と音高情報とを第１の機械学習モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを第２の機械学習モデルの学習用データとして取得し、前記単音音源と前記採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得すステップと、
前記プロセッサが、前記単音音源のスペクトログラムを学習用入力データとして入力し、前記単音音源の音高の予測確率を出力するよう前記音高情報によって第１の機械学習モデルを学習するステップと、
前記プロセッサが、前記採譜対象の音源のスペクトログラムを学習済みの前記第１の機械学習モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するよう前記楽譜情報によって第２の機械学習モデルを学習するステップと、
を有する学習方法が提供される。 In one aspect of the present disclosure,
A processor acquires a single sound source and pitch information as learning data for a first machine learning model, acquires a sound source to be transcribed and score information as learning data for a second machine learning model, and acquires the single sound. performing preprocessing on the sound source and the sound source to be transcribed to obtain respective spectrograms;
the processor inputting the spectrogram of the monophonic sound source as learning input data and learning a first machine learning model with the pitch information to output a predicted probability of the pitch of the monophonic sound source;
The processor inputs, as input data for learning, a feature map generated by inputting the spectrogram of the sound source to be transcribed into the first machine learning model that has already been trained. training a second machine learning model with the musical score information to output a predicted probability that a note is present;
There is provided a learning method comprising:

本開示の一態様では、
プロセッサが、単音音源から音高の予測確率を出力する第１の学習済み機械学習モデルに採譜対象の音源を入力するステップと、
前記プロセッサが、特徴マップから前記特徴マップの固定長の区間に音符が存在する予測確率を出力する第２の学習済み機械学習モデルに前記第１の学習済み機械学習モデルによって生成された特徴マップを入力するステップと、
前記プロセッサが、前記第２の学習済み機械学習モデルから出力された前記音符が存在する予測確率に基づき楽譜情報を生成するステップと、
を有する自動採譜方法が提供される。 In one aspect of the present disclosure,
a processor inputting a sound source to be transcribed into a first trained machine learning model that outputs a predicted probability of pitch from a monophonic sound source;
The processor transfers the feature map generated by the first learned machine learning model to a second learned machine learning model that outputs a predicted probability that a note exists in a fixed-length section of the feature map from the feature map. a step of entering;
the processor generating score information based on the predicted probability of the note being present output from the second trained machine learning model;
There is provided an automatic music transcription method comprising:

本開示の一態様では、
単音音源と音高情報とを第１の機械学習モデルの学習用データとして取得し、採譜対象の音源と楽譜情報とを第２の機械学習モデルの学習用データとして取得し、前記単音音源と前記採譜対象の音源とに対して前処理を実行し、それぞれのスペクトログラムを取得すステップと、
前記単音音源のスペクトログラムを学習用入力データとして入力し、前記単音音源の音高の予測確率を出力するよう前記音高情報によって第１の機械学習モデルを学習するステップと、
前記採譜対象の音源のスペクトログラムを学習済みの前記第１の機械学習モデルに入力することによって生成される特徴マップを学習用入力データとして入力し、前記特徴マップの固定長の区間に音符が存在する予測確率を出力するよう前記楽譜情報によって第２の機械学習モデルを学習するステップと、
をプロセッサに実行させるプログラムが提供される。 In one aspect of the present disclosure,
A single-tone sound source and pitch information are acquired as learning data for a first machine learning model, a sound source to be transcribed and score information are acquired as learning data for a second machine-learning model, and the single-tone sound source and the musical score information are acquired as learning data for a second machine learning model. performing preprocessing on the sound sources to be transcribed and obtaining respective spectrograms;
inputting the spectrogram of the monophonic sound source as training input data and learning a first machine learning model with the pitch information so as to output a predicted probability of the pitch of the monophonic sound source;
A feature map generated by inputting the spectrogram of the sound source to be transcribed into the learned first machine learning model is input as input data for learning, and notes are present in a fixed-length section of the feature map. training a second machine learning model with the score information to output predicted probabilities;
A program is provided which causes a processor to execute

本開示の一態様では、
単音音源から音高の予測確率を出力する第１の学習済み機械学習モデルに採譜対象の音源を入力するステップと、
特徴マップから前記特徴マップの固定長の区間に音符が存在する予測確率を出力する第２の学習済み機械学習モデルに前記第１の学習済み機械学習モデルによって生成された特徴マップを入力するステップと、
前記第２の学習済み機械学習モデルから出力された前記音符が存在する予測確率に基づき楽譜情報を生成するステップと、
をプロセッサに実行させるプログラムが提供される。 In one aspect of the present disclosure,
inputting a sound source to be transcribed into a first trained machine learning model that outputs predicted probabilities of pitches from monophonic sound sources;
inputting the feature map generated by the first trained machine learning model into a second trained machine learning model that outputs from the feature map a predicted probability that a note exists in a fixed-length interval of the feature map; ,
generating score information based on the predicted probability that the note exists output from the second trained machine learning model;
A program is provided which causes a processor to execute

本開示の一態様では、
上述したプログラムを記憶するコンピュータ可読記憶媒体が提供される。 In one aspect of the present disclosure,
A computer readable storage medium storing the above program is provided.

以上、本開示の実施例について詳述したが、本開示は上述した特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本開示の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to the specific embodiments described above, and various modifications can be made within the scope of the gist of the present disclosure described in the claims.・Changes are possible.

５０学習用データストレージ
１００学習装置
２００自動採譜装置 50 learning data storage 100 learning device 200 automatic transcription device

上記課題を解決するため、本開示の一態様は、畳み込みニューラルネットワークにより構成される第１の機械学習モデルであって、単音音源から生成される第１スペクトログラムと、対応する音高情報と、をペアとする教師データを学習させることにより、第１スペクトログラムの入力に応じて複数の畳み込み層から時間解像度が異なるように生成された、対応する音高の予測確率を示す各特徴マップを出力する第１の機械学習モデルを学習する第１モデル学習部と、学習済みの前記第１の機械学習モデルに単旋律音源から生成される第２スペクトログラムを入力し前記第１の機械学習モデルによって生成された各特徴マップと、楽譜情報と、をペアとする教師データを学習させることにより、採譜対象の音源から生成される第３スペクトログラムの前記第１の機械学習モデルへの入力に応じて出力された各特徴マップの入力に応じて各特徴マップ上の各点を始点とする固定長の区間又はデフォルトボックスと同じ長さの音符が存在する音符存在予測確率を出力する第２の機械学習モデルを学習する第２モデル学習部と、を有する学習装置に関する。 In order to solve the above problems, one aspect of the present disclosure is a first machine learning model configured by a convolutional neural network, which includes a first spectrogram generated from a single sound source and corresponding pitch information. By learning paired teacher data, each feature map showing the predicted probability of the corresponding pitch generated from a plurality of convolution layers with different temporal resolutions according to the input of the first spectrogram is output. a first model learning unit for learning a first machine learning model; Output according to the input to the first machine learning model of the third spectrogram generated from the sound source to be transcribed by learning teacher data that pairs each feature map and score information A second machine learning model that outputs the note existence prediction probability that notes of the same length as the fixed-length section or default box starting from each point on each feature map according to the input of each feature map and a second model learning unit for learning the learning device.

Claims

A first feature map indicating the predicted probability of the corresponding pitch according to the input of the spectrogram by learning teacher data that pairs a first spectrogram generated from a single sound source and corresponding pitch information. a first model learning unit that learns a first machine learning model that outputs
The first machine learning of a second spectrogram generated from a sound source to be transcribed by learning teacher data pairing the first feature map output by the first model learning unit and musical score information. Second model learning for learning a second machine learning model that outputs information for generating a musical score according to the input of the first feature map indicating the predicted probability of the pitch that is output according to the input to the model. Department and
A learning device having

2. The learning device according to claim 1, wherein said first machine learning model and said second machine learning model are configured by a convolutional neural network.

3. The learning according to claim 2, wherein said second model learning unit inputs a plurality of said first feature maps having different temporal resolutions generated by said first machine learning model to said second machine learning model. Device.

4. The learning device according to claim 1, wherein said second model learning unit realizes said first machine learning model and said second machine learning model as SSD (Single Shot Detection).

The first model learning unit learns the first machine learning model for each of multiple types of audio components,
5. The learning device according to any one of claims 1 to 4, wherein the second model learning unit learns the second machine learning model for a sound source to be transcribed including multiple types of audio components.

A first trained machine learning model that outputs a first feature map indicating predicted probabilities of pitches from a single tone sound source, and a second trained machine learning model that outputs information for generating a musical score from the first feature map. input the sound source to be transcribed into the first trained machine learning model, and transfer the first feature map output by the first trained machine learning model to the second trained machine learning model a model processing unit that inputs to and outputs information for generating a musical score;
automatic transcription device.

7. The automatic music transcription apparatus according to claim 6, wherein said model processing unit acquires a spectrogram by performing preprocessing on said sound source to be transcribed, and inputs said spectrogram to said first trained machine learning model. .

8. The model processing unit according to claim 6, wherein for each point on the first feature map, the note having the highest prediction probability output from the second trained machine learning model is determined as the predicted note. Automatic transcription device.

a musical score generating unit that generates musical score information based on the predicted probability that the note exists;
9. The automatic music transcription apparatus according to claim 8, wherein said score generation unit generates score information based on predicted notes extracted according to NMS (Non-Maximum Suppression).

the processor
A first feature map indicating the predicted probability of the corresponding pitch according to the input of the spectrogram by learning teacher data that pairs a first spectrogram generated from a single sound source and corresponding pitch information. training a first machine learning model that outputs
A second spectrogram generated from a sound source to be transcribed is output according to an input to the first machine learning model by learning teacher data paired with the first feature map and musical score information. learning a second machine learning model that outputs information for generating a musical score in response to the input of a first feature map indicating the predicted probability of pitches that
How to learn to do.

the processor
A first feature map indicating the predicted probability of the corresponding pitch according to the input of the spectrogram by learning teacher data that pairs a first spectrogram generated from a single sound source and corresponding pitch information. training a first machine learning model that outputs
A second spectrogram generated from a sound source to be transcribed is output according to an input to the first machine learning model by learning teacher data paired with the first feature map and musical score information. learning a second machine learning model that outputs information for generating a musical score in response to the input of a first feature map indicating the predicted probability of pitches that
An automatic transcription method that performs

A first feature map indicating the predicted probability of the corresponding pitch according to the input of the spectrogram by learning teacher data that pairs a first spectrogram generated from a single sound source and corresponding pitch information. training a first machine learning model that outputs
A second spectrogram generated from a sound source to be transcribed is output according to an input to the first machine learning model by learning teacher data paired with the first feature map and musical score information. learning a second machine learning model that outputs information for generating a musical score in response to the input of a first feature map indicating the predicted probability of pitches that
A program that causes the processor to execute

inputting a sound source to be transcribed into a first trained machine learning model that outputs predicted probabilities of pitches from monophonic sound sources;
Inputting a feature map indicating the predicted probability of the pitch generated by the first trained machine learning model to a second trained machine learning model that outputs information for generating a musical score in response to an input. a step;
A program that causes the processor to execute