CN110852375A - End-to-end music score note identification method based on deep learning - Google Patents

End-to-end music score note identification method based on deep learning Download PDF

Info

Publication number
CN110852375A
CN110852375A CN201911090621.2A CN201911090621A CN110852375A CN 110852375 A CN110852375 A CN 110852375A CN 201911090621 A CN201911090621 A CN 201911090621A CN 110852375 A CN110852375 A CN 110852375A
Authority
CN
China
Prior art keywords
note
music score
duration
pitch
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911090621.2A
Other languages
Chinese (zh)
Inventor
黄志清
贾翔
王师凯
张煜森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201911090621.2A priority Critical patent/CN110852375A/en
Publication of CN110852375A publication Critical patent/CN110852375A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Auxiliary Devices For Music (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses an end-to-end music score note identification method based on deep learning, which comprises the following three steps: (1) data preprocessing: the corresponding data set needs to be downloaded from musesecore and the pitch and duration tags re-encoded. (2) Data enhancement: the invention provides 4 different enhancing methods for data enhancement of the recoded music score data. (3) End-to-end model: the deep convolution neural network model is applied to end-to-end music score note identification, the enhanced data is input into the model, and the output of the model is a note duration and a pitch. The invention aims at the music score of a printed body to provide a music score note recognition model based on deep learning, namely, the whole music score image is input to the model, the duration and pitch of notes on the music score are directly output, the model is completely end-to-end, and the polyphonic music score image can be accurately recognized.

Description

End-to-end music score note identification method based on deep learning
Technical Field
The invention belongs to the field of optical music score recognition, and relates to an end-to-end neural network recognition method based on deep learning, which can be applied to music score note recognition.
Background
Optical score recognition is the application of optical character recognition to music for recognizing scores in editable or playable form, such as MIDI (for play) and MusicXML (for page layout). The proportion of notes is very high relative to other symbols of the score, and the notes are used for recording pitch and duration values and have important semantic information. Therefore, note recognition is the core and key of score recognition. The forms of musical notes vary widely, and the diversity and polymorphism characteristics of the musical notes determine that the musical notes are difficult to identify. In the traditional note recognition method, a staff is required to be deleted in advance, then primitive symbols are extracted, and note recognition is completed by combining the primitive symbols, so that the whole process is complicated, and each step can influence the note recognition precision.
In recent years, breakthrough of deep learning in the field of computer vision has enabled a huge change in the processing mode of optical music score recognition (OMR), and more research is focused on solving OMR by deep learning, and research methods are roughly classified into two categories: target detection and sequence identification. However, the current target detection method based on deep learning cannot identify the pitch and duration of notes, and the sequence identification method has the problems of low identification precision and the like when processing a polyphonic score.
Disclosure of Invention
The invention aims to provide a score note recognition model based on deep learning for a printed music score, namely, a whole score image is input into the model, and the duration and pitch of notes on the score are directly output. The model is completely end-to-end, and the polyphonic music score image can be accurately identified.
In order to achieve the purpose, the technical scheme adopted by the invention is an end-to-end music score note identification method based on deep learning, and the method is totally divided into three steps:
(1) data preprocessing: the corresponding data set needs to be downloaded from musesecore and the pitch and duration tags re-encoded.
(2) Data enhancement: the invention provides 4 different enhancing methods for data enhancement of the recoded music score data.
(3) End-to-end model: fig. 1 illustrates a deep convolutional neural network model applied to end-to-end score note recognition, a model to which enhanced data is input, the output of the model being note duration and pitch.
(1) Data pre-processing
From the corpus of selected MusicXML files, a data set of music score images and corresponding note annotations is created. The MusicXML file is converted into a music score image using MuseScore, and tags corresponding to the music score image are represented by vectors consisting of pitch, duration and location of a music score bounding box. Each note is represented by two values: pitch and duration. In the present invention pitch is re-encoded as the vertical distance, i.e. the distance of the note from the vertical axis of the staff. The pitch value of a note is determined by the vertical distance of the note from the staff, as shown in FIG. 2, with the numbers on the sides indicating the pitch label, the red note being labeled 5, and the yellow note being labeled-2. Note shows the corresponding form of the Note with different durations, Duration shows the Duration of the Note, and Label shows the encoded Duration Label, as shown in fig. 3. The duration is in units of quarter notes. Therefore, the time value and pitch of the corresponding label of the music score are coded according to the method.
(2) Data enhancement
Noise and change do not exist in the music score image generated by the computer, and the trained model has no generalization. In order to make the model of the present invention robust to lower quality inputs and different types of music score images, the present invention proposes 4 different enhancement methods, each simulating the input noise source in the natural environment. As shown in fig. 4, the image in a is subjected to gaussian blur processing; c, performing affine transformation on the image, and rotating the image by 5 degrees to the left; in the step b, the image adopts elastic transformation to change the visual angle of the image; the image in d is subjected to color conversion to simulate the influence of illumination on the image.
(3) End-to-end note recognition model
The specific process of the note recognition model is as follows: inputting the music score image into a convolution neural network, and extracting a characteristic diagram of the music score image through a series of operations of convolution, residual error and splicing; the output note duration and pitch are then sorted on the feature map and returned to the bounding box of the note.
In order to make the musical note have a sufficiently large receptive field, the model adopts the fundamental network of YOLOv3 to extract features, and the network structure is divided into 5 parts, namely conv1_ x, conv2_ x, conv3_ x, conv4_ x and conv5_ x. Wherein conv1_ x, conv2_ x, conv3_ x, conv4_ x and conv5_ x respectively comprise 1, 2, 8, 8, 6 building blocks, and each building block comprises 2 convolutional layers and one residual connecting layer. Considering that the small objects are subjected to feature loss after convolution, 8 times of upsampling is performed after the YOLOv3 basic network outputs a feature map, and feature fusion is performed on the upsampled feature map and a feature map of a lower-layer network to obtain more comprehensive feature information.
As shown in fig. 5, after the convolutional neural network outputs the feature map, an n-dimensional feature vector is generated through the intermediate layer based on each pixel point on the feature map, and a dimension n of the feature vector is: 7 (confidence + candidate box coordinates + pitch class + duration class), i.e. 7 target candidate regions are generated in the n-dimensional feature vector. And for each target candidate region, obtaining the confidence coefficient of a target box, the coordinates of the candidate box, the pitch of the musical notes and the duration of the musical notes by using a sigmoid activation function, and realizing multi-task training.
The invention provides an end-to-end note recognition model for a typeface music score, and a deep convolution neural network is applied to detect a note bounding box and recognize the value and the pitch of the note bounding box. Experimental test results show that it takes only 1 second to identify a whole music score image and a chronaxie value accuracy of 96% and a pitch accuracy of 98% can be obtained, and the results are shown in fig. 6.
The core technology of the invention comprises:
(1) a score data set suitable for the target detection algorithm is generated, the data set comprising a total of 10000 shares, each share comprising a score image and a corresponding label.
(2) 4 data enhancement methods of fuzzy, elastic transformation, color transformation and affine transformation are introduced to simulate the music score in a natural scene, and the generalization capability of the model is improved.
(3) An end-to-end note recognition model is constructed, and the average precision of a note head of 0.87, the time value accuracy rate of 0.96 and the pitch accuracy rate of 0.98 are realized.
Drawings
Fig. 1 is an end-to-end model.
Fig. 2 is a pitch label diagram.
Fig. 3 is a graph of a duration label.
Fig. 4 is a data enhancement diagram.
Fig. 5 is a graph of the network loss function.
FIG. 6 is a graph showing the results of detection.
Detailed Description
The corpus of the present invention consists of 10,000 MusicXML files, which 10000 MusicXML files are downloaded from the MuseScore dataset, after which a dataset containing a music score image and corresponding labels is created from the corpus. The whole process is divided into two stages: downloading MusicXML from MuseScore and converting the MusicXML into a vector graphics (svg) file; parsing svg is used to obtain the bounding box, duration and pitch of the symbol. The data is divided into three different subsets. 60% for training, 15% for validation, and 25% for evaluation of the model.
Data enhancement: for each selected music score image, the whole music score image is cut into a, b, c and d 4 images to amplify the data set, so that the total data amount is enlarged by 4 times. Then 4 data enhancement methods of fuzzy, elastic transformation, color transformation and affine transformation are adopted to process the cut music score image and input the music score image to the neural network model.
After data are input into the neural network model, the model is trained by using a random gradient descent optimizer, the selected batch size is 32, the initial learning rate is 0.001, the learning rate is constantly attenuated, and the learning rate in every ten periods is halved. After approximately 40 cycles, the model begins to converge. A single Nvidia Titan X was used for training and the model was trained in approximately 6 hours.
After training the neural network model, an image is input, and the model can output a bounding box of a note, pitch and duration.

Claims (4)

1. An end-to-end music score note identification method based on deep learning is characterized in that: the method is totally divided into three steps,
(1) data preprocessing: downloading a corresponding data set from the MuseScore, and recoding the pitch and duration value labels;
(2) data enhancement: the data enhancement is carried out on the recoded music score data, and the invention provides 4 different enhancement methods;
(3) end-to-end model: the deep convolution neural network model is applied to end-to-end music score note identification, the enhanced data is input into the model, and the output of the model is a note duration and a pitch.
2. The deep learning based end-to-end score note recognition method of claim 1, wherein: creating a music score image and a data set of corresponding note annotations from a corpus of selected MusicXML files; converting the MusicXML file into a music score image by using a MuseScore, wherein a label corresponding to the music score image is represented by a vector consisting of a pitch, a duration and a position of a character boundary frame; each note is represented by two values: pitch and duration; the pitch is re-encoded as the vertical distance, i.e., the distance of the note from the vertical axis of the staff; the pitch value of the note is determined by the vertical distance from the note to the staff, the numbers on the sides indicate the label of the pitch, the pitch label of the red note is 5, and the label of the yellow note is-2; note shows the corresponding forms of different Duration notes, Duration shows the Duration of the notes, Label shows the coded Duration Label; the duration takes a quarter note as a unit; the duration and pitch of the label corresponding to the score are encoded as described above.
3. The deep learning based end-to-end score note recognition method of claim 1, wherein:
noise and change do not exist in the music score image generated by the computer, and the trained model has no generalization; in order to make the model have robustness to lower-quality input and different types of music score images, an input noise source enhancement method under a simulated natural environment is provided, namely, the images are subjected to Gaussian blur processing, the images are subjected to affine transformation and rotated by 5 degrees to the left, the images change the image view angle by adopting elastic transformation, and the images simulate the influence of illumination on the images through color transformation.
4. The deep learning based end-to-end score note recognition method of claim 1, wherein:
the specific process of the note recognition model is as follows: inputting the music score image into a convolution neural network, and extracting a characteristic diagram of the music score image through a series of operations of convolution, residual error and splicing; then classifying and outputting the duration and the pitch of the notes on the feature map and returning to the bounding boxes of the notes;
in order to make the musical note have a sufficiently large receptive field, the model adopts a basic network of YOLOv3 to extract features, and the network structure is divided into 5 parts, namely conv1_ x, conv2_ x, conv3_ x, conv4_ x and conv5_ x; wherein conv1_ x, conv2_ x, conv3_ x, conv4_ x and conv5_ x respectively comprise 1, 2, 8, 8 and 6 building blocks, and each building block comprises 2 convolutional layers and a residual connecting layer; considering that the small objects are subjected to feature loss after convolution, 8 times of upsampling is performed after the characteristics diagram is output by a YOLOv3 basic network, and the upsampling is performed with the characteristics diagram of a lower-layer network to obtain more comprehensive characteristic information;
after the convolutional neural network outputs the feature map, generating n-dimensional feature vectors through the middle layer based on each pixel point on the feature map, wherein the dimension n of the feature vectors is as follows: 7 (confidence + candidate box coordinates + pitch class + duration class), i.e. 7 target candidate regions are generated in the n-dimensional feature vector; and for each target candidate region, obtaining the confidence coefficient of a target box, the coordinates of the candidate box, the pitch of the musical notes and the duration of the musical notes by using a sigmoid activation function, and realizing multi-task training.
CN201911090621.2A 2019-11-09 2019-11-09 End-to-end music score note identification method based on deep learning Pending CN110852375A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911090621.2A CN110852375A (en) 2019-11-09 2019-11-09 End-to-end music score note identification method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911090621.2A CN110852375A (en) 2019-11-09 2019-11-09 End-to-end music score note identification method based on deep learning

Publications (1)

Publication Number Publication Date
CN110852375A true CN110852375A (en) 2020-02-28

Family

ID=69599934

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911090621.2A Pending CN110852375A (en) 2019-11-09 2019-11-09 End-to-end music score note identification method based on deep learning

Country Status (1)

Country Link
CN (1) CN110852375A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686104A (en) * 2020-12-19 2021-04-20 北京工业大学 Deep learning-based multi-vocal music score identification method
CN112686272A (en) * 2020-12-19 2021-04-20 北京工业大学 Handwritten music score spectral line deleting method based on deep learning
CN114332903A (en) * 2021-12-02 2022-04-12 厦门大学 Lute music score identification method and system based on end-to-end neural network
JP2022151387A (en) * 2021-03-27 2022-10-07 知行 宍戸 Method for generating music information from musical score image and computing device thereof and program
CN112686104B (en) * 2020-12-19 2024-05-28 北京工业大学 Multi-sound part music score recognition method based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105611477A (en) * 2015-12-27 2016-05-25 北京工业大学 Depth and breadth neural network combined speech enhancement algorithm of digital hearing aid
CN106446952A (en) * 2016-09-28 2017-02-22 北京邮电大学 Method and apparatus for recognizing score image
CN107888843A (en) * 2017-10-13 2018-04-06 深圳市迅雷网络技术有限公司 Sound mixing method, device, storage medium and the terminal device of user's original content

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105611477A (en) * 2015-12-27 2016-05-25 北京工业大学 Depth and breadth neural network combined speech enhancement algorithm of digital hearing aid
CN106446952A (en) * 2016-09-28 2017-02-22 北京邮电大学 Method and apparatus for recognizing score image
CN107888843A (en) * 2017-10-13 2018-04-06 深圳市迅雷网络技术有限公司 Sound mixing method, device, storage medium and the terminal device of user's original content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHIQING HUANG ET AL.: "State-of-the-Art Model for Music Object Recognition with Deep Learning", 《APPLIED SCIENCES》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686104A (en) * 2020-12-19 2021-04-20 北京工业大学 Deep learning-based multi-vocal music score identification method
CN112686272A (en) * 2020-12-19 2021-04-20 北京工业大学 Handwritten music score spectral line deleting method based on deep learning
CN112686104B (en) * 2020-12-19 2024-05-28 北京工业大学 Multi-sound part music score recognition method based on deep learning
JP2022151387A (en) * 2021-03-27 2022-10-07 知行 宍戸 Method for generating music information from musical score image and computing device thereof and program
CN114332903A (en) * 2021-12-02 2022-04-12 厦门大学 Lute music score identification method and system based on end-to-end neural network

Similar Documents

Publication Publication Date Title
CN108804530B (en) Subtitling areas of an image
CN111444343B (en) Cross-border national culture text classification method based on knowledge representation
Kadam et al. Detection and localization of multiple image splicing using MobileNet V1
CN111582241A (en) Video subtitle recognition method, device, equipment and storage medium
Krishnan et al. Textstylebrush: transfer of text aesthetics from a single example
CN110852375A (en) End-to-end music score note identification method based on deep learning
CN112149603B (en) Cross-modal data augmentation-based continuous sign language identification method
CN112541501A (en) Scene character recognition method based on visual language modeling network
Pacha et al. Towards self-learning optical music recognition
CN110580458A (en) music score image recognition method combining multi-scale residual error type CNN and SRU
CN111523420A (en) Header classification and header list semantic identification method based on multitask deep neural network
CN113283336A (en) Text recognition method and system
Devi S et al. A deep learning approach for recognizing the cursive Tamil characters in palm leaf manuscripts
CN113140020A (en) Method for generating image based on text of countermeasure network generated by accompanying supervision
CN114419174A (en) On-line handwritten text synthesis method, device and storage medium
Vankadaru et al. Text Identification from Handwritten Data using Bi-LSTM and CNN with FastAI
CN114444488B (en) Few-sample machine reading understanding method, system, equipment and storage medium
Kaddoura A Primer on Generative Adversarial Networks
Bajpai et al. Custom dataset creation with tensorflow framework and image processing for google t-rex
Jia et al. Printed score detection based on deep learning
Baró-Mas Optical music recognition by long short-term memory recurrent neural networks
CN112686104B (en) Multi-sound part music score recognition method based on deep learning
Thuon et al. Generate, transform, and clean: the role of GANs and transformers in palm leaf manuscript generation and enhancement
Liang Analysis of Emotional Deconstruction and the Role of Emotional Value for Learners in Animation Works Based on Digital Multimedia Technology
Liang et al. HFENet: Hybrid Feature Enhancement Network for Detecting Texts in Scenes and Traffic Panels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200228