CN110852375A

CN110852375A - End-to-end music score note identification method based on deep learning

Info

Publication number: CN110852375A
Application number: CN201911090621.2A
Authority: CN
Inventors: 黄志清; 贾翔; 王师凯; 张煜森
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-11-09
Filing date: 2019-11-09
Publication date: 2020-02-28

Abstract

The invention discloses an end-to-end music score note identification method based on deep learning, which comprises the following three steps: (1) data preprocessing: the corresponding data set needs to be downloaded from musesecore and the pitch and duration tags re-encoded. (2) Data enhancement: the invention provides 4 different enhancing methods for data enhancement of the recoded music score data. (3) End-to-end model: the deep convolution neural network model is applied to end-to-end music score note identification, the enhanced data is input into the model, and the output of the model is a note duration and a pitch. The invention aims at the music score of a printed body to provide a music score note recognition model based on deep learning, namely, the whole music score image is input to the model, the duration and pitch of notes on the music score are directly output, the model is completely end-to-end, and the polyphonic music score image can be accurately recognized.

Description

End-to-end music score note identification method based on deep learning

Technical Field

The invention belongs to the field of optical music score recognition, and relates to an end-to-end neural network recognition method based on deep learning, which can be applied to music score note recognition.

Background

Optical score recognition is the application of optical character recognition to music for recognizing scores in editable or playable form, such as MIDI (for play) and MusicXML (for page layout). The proportion of notes is very high relative to other symbols of the score, and the notes are used for recording pitch and duration values and have important semantic information. Therefore, note recognition is the core and key of score recognition. The forms of musical notes vary widely, and the diversity and polymorphism characteristics of the musical notes determine that the musical notes are difficult to identify. In the traditional note recognition method, a staff is required to be deleted in advance, then primitive symbols are extracted, and note recognition is completed by combining the primitive symbols, so that the whole process is complicated, and each step can influence the note recognition precision.

In recent years, breakthrough of deep learning in the field of computer vision has enabled a huge change in the processing mode of optical music score recognition (OMR), and more research is focused on solving OMR by deep learning, and research methods are roughly classified into two categories: target detection and sequence identification. However, the current target detection method based on deep learning cannot identify the pitch and duration of notes, and the sequence identification method has the problems of low identification precision and the like when processing a polyphonic score.

Disclosure of Invention

The invention aims to provide a score note recognition model based on deep learning for a printed music score, namely, a whole score image is input into the model, and the duration and pitch of notes on the score are directly output. The model is completely end-to-end, and the polyphonic music score image can be accurately identified.

In order to achieve the purpose, the technical scheme adopted by the invention is an end-to-end music score note identification method based on deep learning, and the method is totally divided into three steps:

(1) data preprocessing: the corresponding data set needs to be downloaded from musesecore and the pitch and duration tags re-encoded.

(2) Data enhancement: the invention provides 4 different enhancing methods for data enhancement of the recoded music score data.

(3) End-to-end model: fig. 1 illustrates a deep convolutional neural network model applied to end-to-end score note recognition, a model to which enhanced data is input, the output of the model being note duration and pitch.

(1) Data pre-processing

From the corpus of selected MusicXML files, a data set of music score images and corresponding note annotations is created. The MusicXML file is converted into a music score image using MuseScore, and tags corresponding to the music score image are represented by vectors consisting of pitch, duration and location of a music score bounding box. Each note is represented by two values: pitch and duration. In the present invention pitch is re-encoded as the vertical distance, i.e. the distance of the note from the vertical axis of the staff. The pitch value of a note is determined by the vertical distance of the note from the staff, as shown in FIG. 2, with the numbers on the sides indicating the pitch label, the red note being labeled 5, and the yellow note being labeled-2. Note shows the corresponding form of the Note with different durations, Duration shows the Duration of the Note, and Label shows the encoded Duration Label, as shown in fig. 3. The duration is in units of quarter notes. Therefore, the time value and pitch of the corresponding label of the music score are coded according to the method.

(2) Data enhancement

Noise and change do not exist in the music score image generated by the computer, and the trained model has no generalization. In order to make the model of the present invention robust to lower quality inputs and different types of music score images, the present invention proposes 4 different enhancement methods, each simulating the input noise source in the natural environment. As shown in fig. 4, the image in a is subjected to gaussian blur processing; c, performing affine transformation on the image, and rotating the image by 5 degrees to the left; in the step b, the image adopts elastic transformation to change the visual angle of the image; the image in d is subjected to color conversion to simulate the influence of illumination on the image.

(3) End-to-end note recognition model

The specific process of the note recognition model is as follows: inputting the music score image into a convolution neural network, and extracting a characteristic diagram of the music score image through a series of operations of convolution, residual error and splicing; the output note duration and pitch are then sorted on the feature map and returned to the bounding box of the note.

In order to make the musical note have a sufficiently large receptive field, the model adopts the fundamental network of YOLOv3 to extract features, and the network structure is divided into 5 parts, namely conv1_ x, conv2_ x, conv3_ x, conv4_ x and conv5_ x. Wherein conv1_ x, conv2_ x, conv3_ x, conv4_ x and conv5_ x respectively comprise 1, 2, 8, 8, 6 building blocks, and each building block comprises 2 convolutional layers and one residual connecting layer. Considering that the small objects are subjected to feature loss after convolution, 8 times of upsampling is performed after the YOLOv3 basic network outputs a feature map, and feature fusion is performed on the upsampled feature map and a feature map of a lower-layer network to obtain more comprehensive feature information.

As shown in fig. 5, after the convolutional neural network outputs the feature map, an n-dimensional feature vector is generated through the intermediate layer based on each pixel point on the feature map, and a dimension n of the feature vector is: 7 (confidence + candidate box coordinates + pitch class + duration class), i.e. 7 target candidate regions are generated in the n-dimensional feature vector. And for each target candidate region, obtaining the confidence coefficient of a target box, the coordinates of the candidate box, the pitch of the musical notes and the duration of the musical notes by using a sigmoid activation function, and realizing multi-task training.

The invention provides an end-to-end note recognition model for a typeface music score, and a deep convolution neural network is applied to detect a note bounding box and recognize the value and the pitch of the note bounding box. Experimental test results show that it takes only 1 second to identify a whole music score image and a chronaxie value accuracy of 96% and a pitch accuracy of 98% can be obtained, and the results are shown in fig. 6.

The core technology of the invention comprises:

(1) a score data set suitable for the target detection algorithm is generated, the data set comprising a total of 10000 shares, each share comprising a score image and a corresponding label.

(2) 4 data enhancement methods of fuzzy, elastic transformation, color transformation and affine transformation are introduced to simulate the music score in a natural scene, and the generalization capability of the model is improved.

(3) An end-to-end note recognition model is constructed, and the average precision of a note head of 0.87, the time value accuracy rate of 0.96 and the pitch accuracy rate of 0.98 are realized.

Drawings

Fig. 1 is an end-to-end model.

Fig. 2 is a pitch label diagram.

Fig. 3 is a graph of a duration label.

Fig. 4 is a data enhancement diagram.

Fig. 5 is a graph of the network loss function.

FIG. 6 is a graph showing the results of detection.

Detailed Description

The corpus of the present invention consists of 10,000 MusicXML files, which 10000 MusicXML files are downloaded from the MuseScore dataset, after which a dataset containing a music score image and corresponding labels is created from the corpus. The whole process is divided into two stages: downloading MusicXML from MuseScore and converting the MusicXML into a vector graphics (svg) file; parsing svg is used to obtain the bounding box, duration and pitch of the symbol. The data is divided into three different subsets. 60% for training, 15% for validation, and 25% for evaluation of the model.

Data enhancement: for each selected music score image, the whole music score image is cut into a, b, c and d 4 images to amplify the data set, so that the total data amount is enlarged by 4 times. Then 4 data enhancement methods of fuzzy, elastic transformation, color transformation and affine transformation are adopted to process the cut music score image and input the music score image to the neural network model.

After data are input into the neural network model, the model is trained by using a random gradient descent optimizer, the selected batch size is 32, the initial learning rate is 0.001, the learning rate is constantly attenuated, and the learning rate in every ten periods is halved. After approximately 40 cycles, the model begins to converge. A single Nvidia Titan X was used for training and the model was trained in approximately 6 hours.

After training the neural network model, an image is input, and the model can output a bounding box of a note, pitch and duration.

Claims

1. An end-to-end music score note identification method based on deep learning is characterized in that: the method is totally divided into three steps,

(1) data preprocessing: downloading a corresponding data set from the MuseScore, and recoding the pitch and duration value labels;

(2) data enhancement: the data enhancement is carried out on the recoded music score data, and the invention provides 4 different enhancement methods;

(3) end-to-end model: the deep convolution neural network model is applied to end-to-end music score note identification, the enhanced data is input into the model, and the output of the model is a note duration and a pitch.

2. The deep learning based end-to-end score note recognition method of claim 1, wherein: creating a music score image and a data set of corresponding note annotations from a corpus of selected MusicXML files; converting the MusicXML file into a music score image by using a MuseScore, wherein a label corresponding to the music score image is represented by a vector consisting of a pitch, a duration and a position of a character boundary frame; each note is represented by two values: pitch and duration; the pitch is re-encoded as the vertical distance, i.e., the distance of the note from the vertical axis of the staff; the pitch value of the note is determined by the vertical distance from the note to the staff, the numbers on the sides indicate the label of the pitch, the pitch label of the red note is 5, and the label of the yellow note is-2; note shows the corresponding forms of different Duration notes, Duration shows the Duration of the notes, Label shows the coded Duration Label; the duration takes a quarter note as a unit; the duration and pitch of the label corresponding to the score are encoded as described above.

3. The deep learning based end-to-end score note recognition method of claim 1, wherein:

noise and change do not exist in the music score image generated by the computer, and the trained model has no generalization; in order to make the model have robustness to lower-quality input and different types of music score images, an input noise source enhancement method under a simulated natural environment is provided, namely, the images are subjected to Gaussian blur processing, the images are subjected to affine transformation and rotated by 5 degrees to the left, the images change the image view angle by adopting elastic transformation, and the images simulate the influence of illumination on the images through color transformation.

4. The deep learning based end-to-end score note recognition method of claim 1, wherein:

the specific process of the note recognition model is as follows: inputting the music score image into a convolution neural network, and extracting a characteristic diagram of the music score image through a series of operations of convolution, residual error and splicing; then classifying and outputting the duration and the pitch of the notes on the feature map and returning to the bounding boxes of the notes;

in order to make the musical note have a sufficiently large receptive field, the model adopts a basic network of YOLOv3 to extract features, and the network structure is divided into 5 parts, namely conv1_ x, conv2_ x, conv3_ x, conv4_ x and conv5_ x; wherein conv1_ x, conv2_ x, conv3_ x, conv4_ x and conv5_ x respectively comprise 1, 2, 8, 8 and 6 building blocks, and each building block comprises 2 convolutional layers and a residual connecting layer; considering that the small objects are subjected to feature loss after convolution, 8 times of upsampling is performed after the characteristics diagram is output by a YOLOv3 basic network, and the upsampling is performed with the characteristics diagram of a lower-layer network to obtain more comprehensive characteristic information;

after the convolutional neural network outputs the feature map, generating n-dimensional feature vectors through the middle layer based on each pixel point on the feature map, wherein the dimension n of the feature vectors is as follows: 7 (confidence + candidate box coordinates + pitch class + duration class), i.e. 7 target candidate regions are generated in the n-dimensional feature vector; and for each target candidate region, obtaining the confidence coefficient of a target box, the coordinates of the candidate box, the pitch of the musical notes and the duration of the musical notes by using a sigmoid activation function, and realizing multi-task training.