JP6985543B1

JP6985543B1 - How to create music information from sheet music images and their computing devices and programs

Info

Publication number: JP6985543B1
Application number: JP2021054429A
Authority: JP
Inventors: 知行宍戸; 靖弘小野; ファティフェヒミユ; 大輔徳重
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-03-27
Filing date: 2021-03-27
Publication date: 2021-12-22
Anticipated expiration: 2041-03-27
Also published as: JP2022151387A

Abstract

【課題】楽譜画像から音楽情報を生成してする方法、コンピューティングデバイス及びプログラム提供する。【解決手段】楽譜画像から音楽情報を作成する方法であって、楽譜画像を入力する工程と、楽譜画像から少なくとも一つの小節を、任意ではあるがディープラーニングモデルを使用して抽出する工程と、任意に、少なくとも一つの小節の各小節内の五線の位置を補正する工程と、少なくとも一つの小節の各小節内の音符を、任意に複数のディープラーニングモデルを使用して同定する工程と、同定された音符から音楽情報を作成する工程を含む。【選択図】図１PROBLEM TO BE SOLVED: To provide a method, a computing device and a program for generating music information from a musical score image. SOLUTION: This is a method of creating music information from a musical score image, which includes a step of inputting a musical score image, a step of extracting at least one measure from the musical score image by using an optional deep learning model, and a step of extracting music information from the musical score image. Optionally, a step of correcting the position of the five lines in each bar of at least one bar, and a step of arbitrarily identifying the notes in each bar of at least one bar using multiple deep learning models. Includes the step of creating musical information from the identified notes. [Selection diagram] Fig. 1

Description

本発明は、楽譜画像から音楽情報を作成する方法、コンピューティングデバイス、およびプログラムに関する。 The present invention relates to a method, a computing device, and a program for creating music information from a musical score image.

光学式音楽認識（ＯＭＲ：ＯｐｔｉｃａｌＭｕｓｉｃＲｅｃｏｇｎｉｔｉｏｎ）は、文書中の楽譜を計算機で読み取る方法を研究する研究分野に関する。ＯＭＲの目標は、コンピュータを用いて楽譜の読み取りと解釈を行い、書かれた楽譜の機械読み取り可能なバージョンを作成することである。ＯＭＲパイプラインは、前処理、音楽記号認識、記譜法の再構築、最終表現の構築の４つの段階に分類される（非特許文献１） Optical Music Recognition (OMR) is a research field that studies how to read a musical score in a document by a computer. The goal of OMR is to use a computer to read and interpret the score to create a machine-readable version of the written score. The OMR pipeline is classified into four stages: preprocessing, musical symbol recognition, notation reconstruction, and final expression construction (Non-Patent Document 1).

具体的な処理について、特許文献１は、楽譜のイメージを読み取って得られた画像データから前記楽譜中の五線、音符、記号及びそれらの位置等を認識し、その認識結果に基づいて楽音の音高、発音タイミング及び発音時間等の情報を生成する楽譜認識装置を開示する。この装置内では、（１）前処理（五線・小節線認識、傾斜補正、五線消去およびビーム消去）、（２）オブジェクト認識（外接長方形の探索およびマッチング処理）、（３）イベント認識処理（音高認識および音長認識処理）及び演奏データ作成、（４）自動演奏（ＭＩＤＩデータ作成及び出力）が行われる。 Regarding specific processing, Patent Document 1 recognizes the staff, notes, symbols, their positions, etc. in the score from the image data obtained by reading the image of the score, and based on the recognition result, the musical sound is produced. A musical score recognition device that generates information such as pitch, sound timing, and sound time is disclosed. In this device, (1) pre-processing (staff / bar line recognition, tilt correction, staff erasing and beam erasing), (2) object recognition (external rectangle search and matching processing), (3) event recognition processing. (Pitch recognition and sound length recognition processing) and performance data creation, (4) Automatic performance (MIDI data creation and output) are performed.

特許文献２は、紙面の楽譜の情報を含む画像を、画像読み取り手段から取得する画像取得手段と、前記画像取得手段により取得された画像に含まれる楽譜記号を、複数の楽譜記号認識方法を用いて認識して複数の楽譜記号認識結果を出力する楽譜記号認識手段とを有し、前記楽譜記号認識手段は、五線認識処理、段落認識処理、楽譜記号認識処理、及び楽譜全体の処理を行い、楽譜記号同士の関係について複数の候補を検出し、それぞれの候補について、様々な情報を用いて、楽譜的に妥当なものを推定して１つの楽譜記号同士の関係を選別することを特徴とする楽譜認識装置を開示する。 Patent Document 2 uses a plurality of musical score symbol recognition methods for an image acquisition means for acquiring an image including information on a musical score on paper from an image reading means and a musical score symbol included in the image acquired by the image acquisition means. It has a musical score symbol recognition means that recognizes and outputs a plurality of musical score symbol recognition results, and the musical score symbol recognition means performs five-line recognition processing, paragraph recognition processing, musical score symbol recognition processing, and processing of the entire musical score. , It is characterized by detecting multiple candidates for the relationship between musical score symbols, and using various information for each candidate, estimating the one that is valid in terms of musical score and selecting the relationship between one musical score symbol. Disclose the score recognition device to be used.

特許文献３は、楽譜画像を基に一部の楽譜記号を認識する事前認識処理部と、前記事前認識処理部の認識結果を修正する修正部と、前記修正部により修正された認識結果を用いて、前記楽譜画像を基に他の楽譜記号を認識する本認識処理部とを有し、前記事前認識処理部は、拍子記号、小節線、音部記号及び調号を認識し、前記本認識処理部は、音符及び休符を認識することを特徴とする楽譜認識装置を開示する。 Patent Document 3 describes a pre-recognition processing unit that recognizes a part of musical score symbols based on a musical score image, a correction unit that corrects the recognition result of the pre-recognition processing unit, and a recognition result that is corrected by the correction unit. The pre-recognition processing unit has a main recognition processing unit that recognizes other musical score symbols based on the musical score image, and the pre-recognition processing unit recognizes beat symbols, bar lines, clef symbols and key signatures. The recognition processing unit discloses a musical score recognition device characterized by recognizing notes and rests.

これらの技術を用いた装置は、開示されるように五線、小節線を認識し、その後、五線や小節線を除去し、音符等をＯＣＲ等の技術を用いて認識する工程を有している。 Devices using these techniques have the steps of recognizing staves and bar lines as disclosed, then removing the staves and bar lines, and recognizing notes and the like using techniques such as OCR. ing.

これら従来のＯＭＲ装置の例には、Ａｒｕｓｐｉｘ、Ａｕｄｉｖｅｒｉｓ、Ｇａｍｅｒａ、ＰｈｏｔｏＳｃｏｒｅ（楽譜ソフトウエアＳｉｂｅｌｉｕｓで用いられているもの）等が挙げられる。しかしながら、ＯＭＲ精度の改善が必要とされてきた。 Examples of these conventional OMR devices include Aruspix, Audiovis, Gamera, PhotoScore (used in the musical score software Sibelius) and the like. However, there has been a need to improve OMR accuracy.

このＯＭＲ精度の改善のためにディープラーニングを用いたアプローチが試されている。ディープラーニングは、例えば、写真、画像、動画などの静的画像と動的画像を含むデータに関する情報の解析と利用を変容させてきた。ディープラーニングの現状と可能性については、多くの文献（例、非特許文献２と３）で検討されている。任意の対象物の分類だけでなくその位置も、ＹＯＬＯ（非特許文献４）やＳＳＤ（非特許文献５）などの様々なディープラーニングモデルによって決定可能になっている。分類と位置の両方を使用することで、録画されたビデオ内の物体検出やライブ画像内のリアルタイムの物体検出を含む多くのアプリケーションにおいて有用で汎用性の高いモデルとなっている。現在、その用途は様々な分野で拡大しており、今後も幅広く研究されるだろう。 An approach using deep learning has been tried to improve this OMR accuracy. Deep learning has transformed the analysis and use of information about data, including static and dynamic images, such as photographs, images, and moving images. The current status and potential of deep learning has been discussed in many documents (eg, Non-Patent Documents 2 and 3). Not only the classification of an arbitrary object but also its position can be determined by various deep learning models such as YOLO (Non-Patent Document 4) and SSD (Non-Patent Document 5). The use of both classification and location makes it a useful and versatile model for many applications, including object detection in recorded video and real-time object detection in live images. Currently, its applications are expanding in various fields and will be widely studied in the future.

具体的には、いくつかのディープラーニングモデルがＯＭＲに適用されている。Ｃａｌｖｏ−Ｚａｒａｌａｇｏｚａら（非特許文献６）は、楽譜中の楽譜の音楽記号の位置を特定するために、いわゆるＣｏｎｎｅｃｔｉｏｎｉｓｔＴｅｍｐｏｒａｌＣｌａｓｓｉｆｉｃａｔｉｏｎ（ＣＴＣ）損失関数を用いた。ＺｈｉｑｉｎｇＨｕａｎｇら（非特許文献７）は、深層畳み込みニューラルネットワークと特徴融合に基づくエンドツーエンド検出モデルを提案している。このモデルは、画像全体を直接処理した後、記号カテゴリと音符の音程と持続時間を出力することができる。また、畳み込みニューラルネットワーク（ＣＮＮ：ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）とリカレントニューラルネットワーク（ＲＮＮ：ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）を使用して五線上の音符データを時系列で処理する方法も開示されている（特許文献４）。 Specifically, several deep learning models have been applied to OMR. Calvo-Zaralagoza et al. (Non-Patent Document 6) used the so-called Connectionist Temporal Classification (CTC) loss function to locate the musical symbols of the score in the score. Zhiqing Huang et al. (Non-Patent Document 7) propose an end-to-end detection model based on a deep convolutional neural network and feature fusion. The model can directly process the entire image and then output the symbol category and note pitch and duration. Further, a method of processing note data on a five-line line in time series using a convolutional neural network (CNN) and a recurrent neural network (RNN) is also disclosed (Patent Document 4).

さらに、Ｐａｃｈａらは、非特許文献８で、楽譜画像を認識するために小節を認識するディープラーニングモデルを用いている。そして、非特許文献７に開示される方法と同様に、記号カテゴリと音符を認識するディープラーニングモデルを用いて音楽記号認識が可能であることを示している。 Further, in Non-Patent Document 8, Pacha et al. Use a deep learning model for recognizing measures in order to recognize a musical score image. Then, it is shown that music symbol recognition is possible by using a deep learning model for recognizing symbol categories and notes, as in the method disclosed in Non-Patent Document 7.

特開平６−１０３４１６号公報Japanese Unexamined Patent Publication No. 6-103416 特開２０１２−１３８００９号公報Japanese Unexamined Patent Publication No. 2012-13809 特開２０１５−５６１４９号公報Japanese Unexamined Patent Publication No. 2015-56149 国際公開番号ＷＯ２０１８／１９４４５６International Publication No. WO2018 / 194456

Rebelo, Ana; Fujinaga, Ichiro; Paszkiewicz, Filipe; Marcal, Andre R.S.; Guedes, Carlos; Cardoso, Jamie dos Santos(2012). "Optical music recognition: state-of-the-art and open issues" (PDF).International Journal of Multimedia Information Retrieval. 1(3): 173-190.doi: 10.1007/s13735-012-0004-6.Rebelo, Ana; Fujinaga, Ichiro; Paszkiewicz, Filipe; Marcal, Andre RS; Guedes, Carlos; Cardoso, Jamie dos Santos (2012). "Optical music recognition: state-of-the-art and open issues" (PDF). International Journal of Multimedia Information Retrieval. 1 (3): 173-190.doi: 10.1007 / s13735-012-0004-6. 松尾豊：ディープラーニングと人工知能の難問，システム制御情報学会誌，Vol.60, No.3, pp.92-98, 2016Yutaka Matsuo: The Difficulties of Deep Learning and Artificial Intelligence, Journal of the Society of Systems Control and Information Science, Vol.60, No.3, pp.92-98, 2016 Z. Zhao, P. Zheng, S. Xu and X. Wu, "Object Detection With Deep Learning: A Review," in IEEE Transactions on Neural Networks and Learning Systems, vol.30, no.11, pp.3212-3232, Nov. 2019, doi: 10.1109/TNNLS.2018.2876865.Z. Zhao, P. Zheng, S. Xu and X. Wu, "Object Detection With Deep Learning: A Review," in IEEE Transactions on Neural Networks and Learning Systems, vol.30, no.11, pp.3212-3232 , Nov. 2019, doi: 10.1109 / TNNLS. 2018.2876865. Redmon, J., Farhadi, A., YOLOv3: An Incremental Improvement., arXiv 2018, arXiv: 1804.02767Redmon, J., Farhadi, A., YOLOv3: An Incremental Improvement., ArXiv 2018, arXiv: 1804.02767 Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C., SSD: Single Shot MultiBox Detector., In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp.21-37., doi: 10.1007/978-3-319-46448-0_2.Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, CY, Berg, AC, SSD: Single Shot MultiBox Detector., In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp.21-37., doi: 10.1007 / 978-3-319-46448-0_2. Jorge Calvo-Zaragoza and David Rizo, End-to-End Neural Optical Music Recognition of Monophonic Scores, Appl. Sci., 2018, 8, 606Jorge Calvo-Zaragoza and David Rizo, End-to-End Neural Optical Music Recognition of Monophonic Scores, Appl. Sci., 2018, 8, 606 Zhiqing Huang, Xiang Jia and Yifan Guo, State-of-the-Art Model for Music Object Recognition with Deep Learning, Appl. Sci., 2019, 9, 2645Zhiqing Huang, Xiang Jia and Yifan Guo, State-of-the-Art Model for Music Object Recognition with Deep Learning, Appl. Sci., 2019, 9, 2645 https://www.youtube.com/watch?v=Mr7simdf0eAhttps://www.youtube.com/watch?v=Mr7simdf0eA

本発明は、楽譜画像から精度高く音符を同定することを目的とする。 An object of the present invention is to identify a note from a musical score image with high accuracy.

具体的には、本発明の第一観点は、楽譜画像から音楽情報を作成する方法であって、楽譜画像を入力する工程と、前記楽譜画像から少なくとも一つの小節を抽出する工程と、前記少なくとも一つの小節の各小節内の音符を同定する工程と、同定された前記音符から音楽情報を作成する工程を、含む方法を提供する。この方法は、特に、楽譜画像から少なくとも一つの小節を抽出する工程を経ることにより、精度高く音符を同定することができる。 Specifically, the first aspect of the present invention is a method of creating music information from a musical score image, which includes a step of inputting a musical score image, a step of extracting at least one measure from the musical score image, and at least the above. Provided is a method including a step of identifying a musical note in each bar of one bar and a step of creating musical information from the identified musical note. In this method, notes can be identified with high accuracy, in particular, by going through a step of extracting at least one measure from a musical score image.

ある態様では、前記少なくとも一つの小節がディープラーニングモデルによって抽出される場合がある。好ましくは、前記少なくとも一つの小節のそれぞれが、五線の枠、特に最上部と最下部の線に沿って抽出される。これにより、後述する五線の補正を容易にする効果を有する。 In some embodiments, the at least one measure may be extracted by a deep learning model. Preferably, each of the at least one measure is extracted along the staff frame, particularly the top and bottom lines. This has the effect of facilitating the correction of the staff, which will be described later.

ある態様では、前記少なくとも一つの小節の各小節内の五線の位置を補正する工程をさらに含む。この五線位置補正工程は、任意ではあるが、入力した前記楽譜画像全体をある五線の傾斜を補正して水平にするようにする工程を含む。さらに、前記ディープラーニングによる各小節の抽出は、この水平補正した前記楽譜画像に対して実施してもよい。さらにまた、抽出した各小節内の五線に対して水平補正をする工程を含んでもよい。このように水平補正された各小節の五線の位置を、限定はされないが、実施例６と７に記載する方法等により補正してもよい。この五線補正工程は、各小節をディープラーニングモデルで抽出することにより可能になったものであり、楽譜の写真等の五線譜の歪みが画像に不均一なものに顕著な効果を奏する。 In some embodiments, it further comprises correcting the position of the staff within each bar of the at least one bar. This staff position correction step includes, although optional, a step of correcting the inclination of a staff to make the entire input score image horizontal. Further, the extraction of each bar by the deep learning may be performed on the horizontally corrected musical score image. Furthermore, a step of horizontally correcting the staff in each of the extracted measures may be included. Although the position of the staff of each measure horizontally corrected in this way is not limited, it may be corrected by the method described in Examples 6 and 7. This staff correction step is made possible by extracting each measure with a deep learning model, and has a remarkable effect on a staff whose distortion is non-uniform in an image such as a photograph of a musical score.

ある態様では、前記少なくとも一つの小節の各小節内の音符を同定する工程をさらに含んでもよい。ある態様では、前記少なくとも一つの小節の各小節内の前記音符を複数のディープラーニングモデルを使用して同定してもよい。複数の特徴カテゴリに対応するディープラーニングモデルを組み合わせることで、多様な音符記号等を表現することが可能になるという顕著な効果を有する。また、多数の特徴タイプを判別する一つの大きなディープラーニングモデルをトレーニングし使用するよりも、複数の特徴カテゴリのディープラーニングモデルを組み合わせることが、学習と推論時の実行性、正確度等の点でより優れていることが分かった。また、本発明に従って抽出した各小節を規格化して学習データとした点も学習と推論の精度の向上に寄与したと考えられ、これらは顕著な効果を奏する。 In some embodiments, it may further include identifying notes within each bar of the at least one bar. In some embodiments, the note within each bar of the at least one bar may be identified using a plurality of deep learning models. By combining deep learning models corresponding to a plurality of feature categories, it has a remarkable effect that various clef symbols and the like can be expressed. Also, rather than training and using one large deep learning model that discriminates between multiple feature types, combining deep learning models from multiple feature categories is more efficient and accurate during learning and inference. Turned out to be better. In addition, it is considered that the point that each measure extracted according to the present invention was standardized and used as learning data also contributed to the improvement of the accuracy of learning and inference, and these have a remarkable effect.

ある態様では、前記複数のディープラーニングモデルが並列に処理される。これにより、推論の時間を著しく短縮可能であり、今後のＣＰＵ／ＧＰＵ／ＴＰＵ性能の向上に伴い本発明がますます優れた効果を奏する。 In some embodiments, the plurality of deep learning models are processed in parallel. As a result, the inference time can be remarkably shortened, and the present invention will exert more and more excellent effects with the improvement of CPU / GPU / TPU performance in the future.

ある態様では、前記音楽情報が、ＸＭＬファイル、ｍｕｓｉｃＸＭＬファイル、ＭＩＤＩファイル、ｍｐ３ファイル、ｗａｖファイル、および楽譜からなる群より選択される。 In some embodiments, the music information is selected from the group consisting of XML files, musicXML files, MIDI files, mp3 files, wav files, and musical scores.

本発明の第二観点は、楽譜画像から音楽情報を作成するためのコンピューティングデバイスであって、楽譜画像を入力する入力部と、前記楽譜画像から少なくとも一つの小節を抽出する小節抽出部と、前記少なくとも一つの小節の各小節内の五線の位置を補正する五線補正部と、前記少なくとも一つの小節の各小節内の音符を複数のディープラーニングモデルを使用して同定する音符同定部と、同定された前記音符から音楽情報を作成する音楽情報作成部と、を含み、前記少なくとも一つの小節がディープラーニングモデルによって抽出され、前記複数のディープラーニングモデルが並列に処理され、前記音楽情報が、ＸＭＬファイル、ｍｕｓｉｃＸＭＬファイル、ＭＩＤＩファイル、ｍｐ３ファイル、ｗａｖファイル、および楽譜からなる群より選択される、コンピューティングデバイスを提供する。このコンピューティングデバイスは、前記第一観点で得られる顕著な効果を有する。 A second aspect of the present invention is a computing device for creating music information from a musical score image, which includes an input unit for inputting the musical score image, a bar extraction unit for extracting at least one measure from the musical score image, and a bar extraction unit. A five-line correction unit that corrects the position of the five lines in each measure of the at least one measure, and a note identification unit that identifies the notes in each measure of the at least one measure using a plurality of deep learning models. , The music information creation unit that creates music information from the identified musical score, the at least one measure is extracted by the deep learning model, the plurality of deep learning models are processed in parallel, and the music information is obtained. , XML files, musicXML files, MIDI files, mp3 files, wav files, and musical scores. This computing device has the remarkable effect obtained from the first aspect.

本発明の第三観点は、楽譜画像から音楽情報を作成するためのプログラムであって、楽譜画像を入力する入力部と、前記楽譜画像から少なくとも一つの小節を抽出する小節抽出部と、前記少なくとも一つの小節の各小節内の五線の位置を補正する五線補正部と、前記少なくとも一つの小節の各小節内の音符を複数のディープラーニングモデルを使用して同定する音符同定部と、同定された前記音符から音楽情報を作成する音楽情報作成部と、を含み、前記少なくとも一つの小節がディープラーニングモデルによって抽出され、前記複数のディープラーニングモデルが並列に処理され、前記音楽情報が、ＸＭＬファイル、ｍｕｓｉｃＸＭＬファイル、ＭＩＤＩファイル、ｍｐ３ファイル、ｗａｖファイル、および楽譜からなる群より選択される、プログラムを提供する。このプログラムも、前記第一観点で得られる顕著な効果を有する。 A third aspect of the present invention is a program for creating music information from a musical score image, which includes an input unit for inputting the musical score image, a bar extraction unit for extracting at least one measure from the musical score image, and at least the above-mentioned bar extraction unit. Identification with a five-line correction unit that corrects the position of the five lines in each measure of one measure, and a note identification unit that identifies the notes in each measure of at least one measure using multiple deep learning models. A music information creation unit that creates music information from the said musical score, the at least one measure is extracted by the deep learning model, the plurality of deep learning models are processed in parallel, and the music information is XML. Provided is a program selected from a group consisting of a file, a musicXML file, a MIDI file, an mp3 file, a wav file, and a musical score. This program also has the remarkable effect obtained from the first aspect.

本発明の一態様によれば、楽譜画像から精度高く音符を同定する顕著な効果が生じる。 According to one aspect of the present invention, a remarkable effect of identifying a note from a musical score image with high accuracy is produced.

本発明の一実施形態の方法の工程を示すフローチャートである。It is a flowchart which shows the process of the method of one Embodiment of this invention. 本発明の実施例１に係る、複数の楽譜イメージに小節ディープラーニングモデルを適用して、各小節を認識させた結果を示す図である。It is a figure which shows the result of applying the bar deep learning model to a plurality of musical score images which concerns on Example 1 of this invention, and was made to recognize each bar. 本発明の実施例５に係る、複数の特徴カテゴリのディープラーニングモデルを、様々な解析領域に適用して特徴タイプの種類と位置を同定したことを示す図である。It is a figure which shows that the deep learning model of a plurality of feature categories which concerns on Example 5 of this invention was applied to various analysis regions, and the type and position of a feature type were identified. 本発明の実施例６と７に係る、傾いた楽譜イメージを五線譜に対して水平化した結果を示す図である。It is a figure which shows the result of leveling the inclined musical score image with respect to the staff staff which concerns on Examples 6 and 7 of this invention. 本発明の実施例７に係る、五線譜の位置と間隔の補正した結果を示す図である。It is a figure which shows the result of having corrected the position and the interval of the staff notation which concerns on Example 7 of this invention. 本発明の実施例８に係る、本方法を実施して楽譜イメージからＭｕｓｉｃＸＭＬを作成し、２種類の一般的な楽譜ソフトウエア上で表示させた図である。It is a figure which carried out this method which concerns on Example 8 of this invention, created MusicXML from a musical score image, and displayed it on two kinds of general musical score software. 本発明の実施例８に係る、傾いた楽譜の写真イメージと、そのイメージから本方法を用いてＭｕｓｉｃＸＭＬを作成しその結果を一般的な楽譜ソフトウエア上で表示した図である。FIG. 5 is a diagram showing a photographic image of a tilted musical score according to Example 8 of the present invention, a MusicXML created from the image using the present method, and the result displayed on general musical score software. 本発明の比較例に係る、傾いた楽譜の写真イメージを既存技術でＯＭＲ処理した結果を示す図である。It is a figure which shows the result of OMR processing of the photographic image of the inclined sheet music which concerns on the comparative example of this invention by the existing technique.

以下、本発明の実施形態について、詳細に説明する。
用語と定義
画像（イメージ）
本明細書で使用される画像またはイメージ（これらの用語は本明細書中で交換可能に用いられ、特に示されなければ同じ意味を有する）とは、本発明の方法で解析可能な任意の種類の画像である。画像は、写真またはスクリーン表示のような二次元であってもよいし、ホログラムのような三次元画像であってもよい。画像（イメージ）の例としては、画像、ビデオ、写真等が挙げられ、これらは、コンピュータ、サーバ、記憶媒体（例えば、ＲＡＭ、ＲＯＭ、キャッシュ、ＳＳＤ、ハードディスク）、またはそのようなものに、それぞれまたは一緒に、ファイル（例えば、．ｊｐｇ、．ｊｐｅｇ、．ｔｉｆｆ、．ｐｎｇ、．ｇｉｆ、．ｍｐ３、ｍｐ４、または．ｍｏｖファイル）として表示および／または保存することができる。 Hereinafter, embodiments of the present invention will be described in detail.
Terms and definitions
Image (image)
Images or images used herein (these terms are interchangeably used herein and have the same meaning unless otherwise indicated) are any kind that can be analyzed by the methods of the invention. It is an image of. The image may be two-dimensional, such as a photograph or screen display, or it may be a three-dimensional image, such as a hologram. Examples of images include images, videos, photographs, etc., which may be computers, servers, storage media (eg, RAM, ROM, cache, SSD, hard disk), or the like, respectively. Or together, it can be displayed and / or saved as a file (eg, .jpg, .jpeg, .tiff, .png, .gif, .mp3, mp4, or .mov file).

情報
本明細書で使用される情報はデータと関連している。違いは、情報が不確実性を解決することである。データは、冗長なシンボルを表すことができるが、最適なデータ圧縮を介して情報に近づく。情報は、伝送および解釈のための様々な形態に符号化することができる（例えば、情報は、符号のシーケンスに符号化されてもよいし、信号を介して伝送されてもよい）。情報のこの一般的な概念は、本明細書で適用することができる。情報の形態に関しては、情報は、文書化された形態、デジタル化された形態、オーディオ形態、ビデオ形態、またはそのような形態の組み合わせであってもよく、特定の形態に限定されない。光学的音楽認識（ＯＭＲ）の技術では、情報は、例えば、楽譜またはデジタル化された、可読性のある、または可聴性の形式の他の任意の媒体として提供されてもよい。可視化されたもの又は可聴化されたもののいずれも許容される。 Information The information used herein is relevant to the data. The difference is that the information resolves the uncertainty. Data can represent redundant symbols, but approaches information through optimal data compression. Information can be encoded in various forms for transmission and interpretation (eg, information may be encoded in a sequence of codes or transmitted over a signal). This general concept of information can be applied herein. With respect to the form of information, the information may be in documented form, digitized form, audio form, video form, or a combination of such forms and is not limited to any particular form. In Optical Mark Recognition (OMR) technology, information may be provided, for example, as a musical score or any other medium in a digitized, readable, or audible form. Both visualized and audible are acceptable.

領域単位
本明細書では、領域単位は各小節であってもよい。ＯＭＲの技術では、領域単位は、５本の線（五線）を含むスタッフ（Ｓｔａｆｆ；五線譜とも称されるが、本明細書中では「スタッフ」と「五線譜」は互換可能である場合もある）、１つ以上のスタッフを含む小節（メジャー（ｍｅａｓｕｒｅ）；本明細書中では「小節」と「メジャー」は互換可能である場合もある）であってもよい。 Area unit In the present specification, the area unit may be each measure. In OMR technology, a region unit is also referred to as a staff (Staff) containing five lines (staff), but in the present specification, "staff" and "staff" may be compatible. ) May be a measure containing one or more staff members (measure; in the present specification, "measure" and "major" may be compatible).

位置基準
本明細書で使用される位置基準は、五線譜の五線の一つ又は複数の線であってもよい。 Positional Reference The positional reference used herein may be one or more of the five lines of the staff.

特徴モデル
本明細書で使用される特徴モデルは、その特徴モデルが画像から情報を抽出できるものであれば、どのような特徴モデルであってもよい。特徴モデルは、例えば、一般的な特徴モデル、好ましくはＡＩモデル、より好ましくは機械学習モデル、さらに好ましくは深層学習（ディープラーニング）モデルであってもよい。複数のモデルが、画像または少なくとも１つの解析領域（各小節を含むもの）における推論に使用されてもよい。使用する特徴モデルの数は、１、２、３、４、５、６、７、８、９、１０、１５、２０、２５、５０、１００、３５０、５００、７５０、または１０００以上であってもよい。上記値の任意の二つの間の数も含まれる。小節の抽出に用いる特徴モデルの数は、好ましくは１であり、小節を含む解析領域の推論に用いる特徴モデルの数は、特に限定はされないが、好ましくは１〜１００であり、より好ましくは１〜２５であり、さらに好ましくは１〜１０であり、さらに好ましくは１〜５である。
楽曲情報を作成する際の本明細書に開示される特徴モデルの具体例としては、小節モデル、Ｃｌｅｆモデル、Ｂｏｄｙモデル、Ａｃｃｉｄｅｎｔａｌモード、Ａｒｍ／Ｂｅａｍモデル、および／またはＲｅｓｔモデルなどが挙げられるが、これらに特に限定されるものではない。これらのモデルの詳細については後述する。 Feature model The feature model used herein may be any feature model as long as the feature model can extract information from the image. The feature model may be, for example, a general feature model, preferably an AI model, more preferably a machine learning model, and even more preferably a deep learning model. Multiple models may be used for inference in images or at least one analysis area (including each bar). The number of feature models used is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 350, 500, 750, or 1000 or more. May be good. A number between any two of the above values is also included. The number of feature models used for extracting measures is preferably 1, and the number of feature models used for inference of an analysis region including bars is not particularly limited, but is preferably 1 to 100, and more preferably 1. It is ~ 25, more preferably 1-10, still more preferably 1-5.
Specific examples of the feature model disclosed in the present specification when creating music information include a bar model, a Clef model, a Body model, an Accidental mode, an Arm / Beam model, and / or a Rest model. It is not particularly limited to these. Details of these models will be described later.

特徴カテゴリ
本明細書で使用される特徴カテゴリは、関連する特徴モデルに対応する。特に指示がない限り、特徴カテゴリは、使用されるモデルのこの特徴に関係する。特徴カテゴリは、モデルが画像からこの特徴に関するデータを抽出できるものであれば、どのような種類のものであってもよい。得られるデータは任意のものであってもよく、必ずしも有用なものではない。したがって、抽出されたデータの全てがその後の解析に使用されるとは限らない。各カテゴリは、手動で選択されてもよいし、別のモデルによって自動的に選択されてもよい。これにより、楽譜画像からの音楽情報の自動生成を容易にすることができる。
本発明の一実施形態では、独自に特徴カテゴリを作成したものもあるので、それらは、Ｃｌｅｆ、Ａｃｃｉｄｅｎｔａｌ、Ｂｏｄｙ、Ａｒｍ／Ｂｅａｍ、Ｒｅｓｔと表記して特徴カテゴリを示すことにする。 Feature Categories The feature categories used herein correspond to the associated feature model. Unless otherwise indicated, feature categories relate to this feature of the model used. The feature category can be of any kind as long as the model can extract data about this feature from the image. The data obtained may be arbitrary and not always useful. Therefore, not all of the extracted data will be used for subsequent analysis. Each category may be selected manually or automatically by another model. This makes it possible to facilitate the automatic generation of music information from the score image.
Since some of the embodiments of the present invention have created their own feature categories, they are referred to as Clef, Accidental, Body, Arm / Beam, and Rest to indicate the feature categories.

特徴タイプ
本明細書では、各特徴モデルの上記特徴カテゴリには、１種類以上の特徴タイプが含まれる。また、特徴タイプの種類は特に限定されず、任意の種類を単独で使用してもよいし、組み合わせて使用してもよい。また、音符特徴タイプを、これら一又は複数の特徴カテゴリと位置基準を組み合わせて使用してアノテーションしてもよい。本明細書中では、音符特徴タイプには、音符と休符のものが含まれる。従って、音符への言及は、音符と休符の両者を含む場合がある。
本発明の一実施形態では、Ｃｌｅｆ特徴カテゴリはト音記号、へ音記号、オクターブシフトの特徴タイプを含む。Ａｃｃｉｄｅｎｔａｌ特徴カテゴリは♯（シャープ）、♭（フラット）、ナチュラルの特徴タイプを含む。Ｂｏｄｙ特徴カテゴリは音符の黒丸部分、点付き黒丸、半白丸（半音符）、点付き半白丸、全黒丸（全音符）、点付き全音符の特徴タイプを含む。Ａｒｍ／Ｂｅａｍ特徴カテゴリは連なっていないクオーターのステム部分（上向きと下向き）、旗の付いた８ｔｈ（上向きと下向き）、８ｔｈ（上部、下部）（開始、中間、終了）の連なっている部分、１６ｔｈ（上部、下部）（開始、中間、終了）の連なっている部分の特徴タイプを含む。Ｒｅｓｔ特徴カテゴリは全休符、半休符、クオーター、８ｔｈ、１６ｔｈ休符の特徴タイプを含む。これらの特徴タイプを表５に示す。具体的な形は図３を参照されたい。 Feature Types As used herein, the feature categories of each feature model include one or more feature types. Further, the types of feature types are not particularly limited, and any type may be used alone or in combination. Note feature types may also be annotated using a combination of these one or more feature categories and positional criteria. As used herein, note feature types include those of notes and rests. Therefore, references to notes may include both notes and rests.
In one embodiment of the invention, the Clef feature category includes a treble clef, a clef, and an octave shift feature type. Accidental feature categories include # (sharp), ♭ (flat), and natural feature types. Body feature categories include note black circles, dotted black circles, half-white circles (half notes), dotted half-white circles, all black circles (whole notes), and dotted whole note feature types. Arm / Beam feature categories are unconnected quarter stems (up and down), flagged 8th (up and down), 8th (top, bottom) (start, middle, end) connected, 16th. Includes feature types of continuous parts (top, bottom) (start, middle, end). The Rest feature category includes feature types for full rests, half rests, quarters, 8th and 16th rests. These feature types are shown in Table 5. See FIG. 3 for the specific shape.

楽譜（スコア）
楽譜（スコア）は、歌や楽器の音楽作品の音程、リズム、および／または和音を示すために音楽記号を使用して表記した手書きまたは印刷あるいは電子的に読み取り可能な形式のものを含む。スコアという用語は、楽譜（シートミュージック）の一般的な代替（より一般的な）用語である。本明細書で使用される楽譜またはスコアは、一般的に楽譜と呼ばれることがある。本明細書で使用される楽譜の画像の例には、可視化またはデジタル化された楽譜画像の任意の形態が含まれる。 Sheet music (score)
Scores include handwritten or printed or electronically readable formats written using musical symbols to indicate the pitch, rhythm, and / or chords of musical works of songs and musical instruments. The term score is a general alternative (more general) term for sheet music. The score or score used herein may be commonly referred to as the score. Examples of musical score images used herein include any form of a visualized or digitized musical score image.

スタッフ（五線譜）とメジャー（小節）
スタッフ（五線譜）は、５本の水平線と４つのスペースで構成されており、それぞれが異なる音程を表しているものを含む。スタッフは、例えば、以下の実施形態を含む。意図された効果に応じて対応する音程や機能に応じて適切な音楽記号がスタッフに配置される。音符は音程ごとに配置される。音程は五線上の縦の位置によって決定され、左から右へと演奏される。どの位置にどの音符があるかは、スタッフの先頭にある音部記号（クレフ記号）によって決まる。音部記号は、特定の線を特定の音として識別し、他のすべての音はその線に対して相対的に決定される。２本のスタッフがある音楽を繋いだり、一人の演奏者が一度に演奏したりする場合、グランドスタッフ（大五線譜）が使用される。一般的には、上段のスタッフ（五線譜）はト音記号、下段のスタッフはヘ音記号が使用される。例えば、ピアノの音楽は、右手用と左手用の２つのスタッフで書かれている。小節線は、五線上の音符を小節に区切ってまとめることに使用される。
音楽の表記法では、小節またはメジャー（以下、小節と呼ぶ場合がある）とは、特定の拍数に対応する時間のセグメントであり、各拍は特定の音価で表され、小節の境界は垂直の小節線で示される。音楽を小節に分割することで、作曲の中で位置を特定するための定期的な基準点が得られる。また、スタッフの各小節を一括して読み込んで演奏することができるので、音楽をより簡単に追うことができる。 Staff (staff) and measure (bar)
The staff (staff) consists of five horizontal lines and four spaces, each of which represents a different pitch. Staff include, for example, the following embodiments: Appropriate musical symbols are placed on the staff according to the corresponding pitch and function according to the intended effect. The notes are arranged for each pitch. The pitch is determined by the vertical position on the five lines and is played from left to right. Which note is in which position is determined by the clef at the beginning of the staff. The clef identifies a particular line as a particular note, and all other notes are determined relative to that line. The ground staff (large staff notation) is used when two staff members connect music or one performer plays at the same time. Generally, the upper staff (staff) uses the treble clef, and the lower staff uses the bass clef. For example, piano music is written by two staff members, one for the right hand and one for the left hand. The bar line is used to divide the notes on the five lines into bars.
In musical notation, a bar or measure (hereinafter sometimes referred to as a bar) is a segment of time corresponding to a specific number of beats, each beat is represented by a specific note value, and the boundaries of the measures are. Indicated by vertical bar lines. Dividing the music into bars provides a regular reference point for positioning in the composition. In addition, since each measure of the staff can be read and played at once, the music can be followed more easily.

五線の線（５つの線）
各スタッフは５つの線（ライン）（五線）で構成されている。ラインとスペースには下から上へ番号を振ることができる。音符は、ライン（音符の玉部分の中央を通る線）上またはスペースに配置することができる。このスペースには４つの内側のスペースと、上部または下部の２つの外側のスペースとが含まれる。
本発明の一実施形態では、スタッフの５つの線の位置を位置基準にして、音階（ステップ）をト音記号またはへ音記号に対応させて割り当てた。本明細書中では音階はＡ（ラ）、Ｂ（シ）、Ｃ（ド）、Ｄ（レ）、Ｅ（ミ）、Ｆ（ファ）、Ｇ（ソ）を原則的に使用する。 Staff line (five lines)
Each staff consists of 5 lines (staff). Lines and spaces can be numbered from bottom to top. The note can be placed on a line (a line that passes through the center of the note ball) or in a space. This space includes four inner spaces and two outer spaces at the top or bottom.
In one embodiment of the invention, the scales (steps) are assigned in correspondence with the treble clef or the treble clef, with the positions of the five lines of the staff as the position reference. In this specification, scales A (la), B (shi), C (do), D (re), E (mi), F (fa), and G (so) are used in principle.

音楽記号（特徴）タイプ
音楽記号の例には：線（例、五線、小節線、ブレース、カッコ）、音符と休符（例、全音、半音、四分音、八分音、１６分音、３２分音、６４分音、１２８分音、２５６分音、ビーム音、ドット音または休符）、臨時記号（フラット、シャープ、ナチュラル、ダブルフラット、ダブルシャープなど）、調号（例、フラット調号、シャープ調号）、四分音（デミフラット、フラットアンドハーフ、デミシャープ、シャープアンドハーフ）、拍子記号（例、ビート数とビートタイプで表示されるシンプルな拍子記号、コモンタイム、テンポなどのメトロノームマーク）、音符の関係性を示すもの（例、タイ、スラー、グリッサンド、グリッサンド、タプレット、コード、アルペジオコード）、ダイナミクス（例、ピアニッシモ、ピアニッシモ、ピアノ、メゾピアノ、メゾフォルテ、フォルテ、フォルティッシモ、フォルティッシモ、スフォルツァンド、クレッシェンド、ディミヌエンド）、奏法記号（例、スタッカティッシモ、スタッカティッシモ、スタッカティッシモスタッカティッシモ、スタッカート、テヌート、フェルマータ、アクセント、マルカート）、装飾音（例、トリル、アッパー・モーデント、ロア・モーデント、グルペット、アポッジアトゥーラ、アッキアッカトゥーラ）、オクターブ記号（例えば、オッタバ）、反復とコーダ（例、トレモロ、反復記号、シミュレーション記号、ボルタカッコ、ダカポ、ダルセグノ、セグノ、コーダ）、またはその他の音楽記号が含まれる。
本発明の一実施形態では、楽譜の画像から情報を生成するという問題に対処するために、いくつかのタイプが修正または作成される。本実施形態で使用される特徴タイプは、表５に記載されている。 Music Symbols (Characteristics) Type Examples of music symbols are: Lines (eg, five lines, bar lines, braces, parentheses), notes and rests (eg, whole note, half note, quarter note, eighth note, 16th note). , 32nd, 64th, 128th, 256th, beam, dot or rest), accidental (flat, sharp, natural, double flat, double sharp, etc.), key signature (eg flat) Key signature, sharp note), quarter note (Demiflat, Flat and Half, Demi Sharp, Sharp and Half), rhythmic note (eg, simple rhythmic note displayed by number of beats and beat type, common time, tempo) Key signatures such as), note relationships (eg, ties, slurs, grissands, grissands, taplets, chords, arpeggio chords), dynamics (eg pianissimo, pianissimo, piano, meso piano, mesoforte, forte, fortissimo, etc.) Fortissimo, Sforzand, Crescendo, Diminuend), playing style symbols (eg, Staccattissimo, Staccattissimo, Staccattissimo, Staccattissimo, Staccato, Tenuto, Fermata, Accent, Marquardt), decorative sounds (eg, Trill, Upper) Key signatures, lower modents, gurupets, apodge aturas, acchiaccatura), octave symbols (eg ottaba), iterations and coder (eg tremolo, repetition symbols, simulation symbols, volta brackets, dacapo, darsegno, segno, coder) ), Or other musical notes.
In one embodiment of the invention, several types are modified or created to address the problem of generating information from images of musical scores. The feature types used in this embodiment are listed in Table 5.

方向
別段の記載がない限り、本明細書で指定された方向は、当技術分野で通常使用される意味を有する。水平方向と垂直方向は、任意の画像に提供される。水平方向、垂直方向のいずれかを任意に設定してもよいが、位置は、各特徴モデルによって、ｘ位置、ｙ位置として提供されてもよい。これらの位置は、直接使用してもよいし、位置基準のいずれかを参照して再設定可能である。 Direction Unless otherwise stated, the directions specified herein have the meaning commonly used in the art. Horizontal and vertical orientations are provided for any image. Either the horizontal direction or the vertical direction may be arbitrarily set, but the positions may be provided as x positions and y positions by each feature model. These positions may be used directly or may be reset with reference to any of the position criteria.

概要
既存技術との対比
特許文献１〜３に開示される技術では、五線と小節線を認識し、その後、五線等を消去して音符記号等を認識し、その際に小節線を利用して認識した音符情報の再構築を行うものである。したがって、各小節に着目し、各小節を抽出してその後の音符情報の再構築を行う本発明とは技術思想が異なる。五線の傾斜を補正する工程も記載されているが、各小節内の五線の位置を補正する記載はない。 Overview
Comparison with existing technology In the technology disclosed in Patent Documents 1 to 3, the staff and the bar line are recognized, then the staff and the like are erased to recognize the clef and the like, and the bar line is used at that time. It reconstructs the recognized note information. Therefore, the technical idea is different from the present invention in which each measure is focused on, each measure is extracted, and the note information is reconstructed thereafter. The process of correcting the inclination of the staff is also described, but there is no description of correcting the position of the staff in each measure.

非特許文献６では画像全体を直接処理してシンボルカテゴリと音程と持続時間を出力するエンドツーエンドの検出モデルが提案されているが、得られるシンボルカテゴリをどのようにして作成するか、音程と持続時間からどのように音楽情報を生成するのかは明らかにされていない。また、小節に着目して各小節を抽出して音符情報の再構築を行う技術思想は開示されていない。 Non-Patent Document 6 proposes an end-to-end detection model that directly processes the entire image and outputs the symbol category, pitch, and duration. It is not clear how to generate music information from the duration. In addition, the technical idea of reconstructing note information by focusing on measures and extracting each measure is not disclosed.

特許文献４では、畳み込みニューラルネットワークとリカレントニューラルネットワークを使用して五線上の音符データを時系列で処理しているが、各小節を抽出して音符データを作成して時系列処理するものではない。 In Patent Document 4, the note data on the five lines is processed in time series by using the convolutional neural network and the recurrent neural network, but the note data is not created by extracting each measure and processed in time series. ..

非特許文献７と８では、音符記号等の検出に１つのエンドツーエンドのディープラーニング検出モデルを利用しているが、各シンボルカテゴリ（特徴タイプ）の検出に複数のモデルを利用することは検討されていない。シンボルカテゴリとタイプの数を増やす必要があるが、どのような方法でアノテーションして、その結果を再構築するかも具体的には提示されていない。また、五線の位置情報により、各音符のステップを同定することが開示されているが、各小節を抽出して位置を各小節に関して補正する技術思想は開示されていない。 In Non-Patent Documents 7 and 8, one end-to-end deep learning detection model is used for detecting note symbols, etc., but it is considered to use multiple models for detecting each symbol category (feature type). It has not been. We need to increase the number of symbol categories and types, but we haven't specifically suggested how to annotate and reconstruct the results. Further, although it is disclosed that the step of each note is identified by the position information of the staff, the technical idea of extracting each measure and correcting the position for each measure is not disclosed.

複数のモデルを、各記号カテゴリに属する特徴タイプの何れかを検出して解析するというタスクに使用する場合、複数のモデルの出力から音楽情報を生成するための最適な手順と処理構成を見出す必要がある。 When using multiple models for the task of detecting and analyzing any of the feature types belonging to each symbol category, it is necessary to find the optimal procedure and processing configuration for generating music information from the output of multiple models. There is.

非特許文献８では、ディープラーニングモデルによって楽譜イメージ内の小節を認識可能なことが示されている。しかしながら、認識された小節はグランドスタッフ（大五線譜：２つのスタッフを含むもの）であり、本願明細書中に記載される小節（一つのスタッフ中の各小節線で区切られるセグメント）とは異なっている。また、小節を認識する目的は画像が音楽画像であるかどうかを識別するための構造情報を提供するためである。さらに、非特許文献８の小節の認識は小節を含む五線の領域より大きなものを認識しており、できるだけ五線の領域に絞って認識するモデルではない。従って、各小節を抽出して、その単位を用いて五線情報を補正したり、各音符記号をディープラーニングモデルで認識したりするという技術思想とは異なる。さらに、得られた音符記号情報等を再構築して最終的に音楽情報にするやり方は著者も認めているように現在はまだ無い。
以下具体的な実施形態について詳述する。 Non-Patent Document 8 shows that a measure in a musical score image can be recognized by a deep learning model. However, the recognized bar is a ground staff (staff: one that includes two staff), unlike the bars described herein (segments separated by each bar line in one staff). There is. Further, the purpose of recognizing a measure is to provide structural information for identifying whether or not the image is a music image. Further, the recognition of the bar of Non-Patent Document 8 recognizes the one larger than the area of the staff including the bar, and is not a model of recognizing only the area of the staff as much as possible. Therefore, it is different from the technical idea of extracting each measure and correcting the staff information using the unit, or recognizing each clef with a deep learning model. Furthermore, as the author admits, there is no method for reconstructing the obtained clef information and finally making it into music information.
Specific embodiments will be described in detail below.

実施形態１
本発明の第１実施形態は、楽譜画像から音楽情報を作成する方法であって、楽譜画像から少なくとも一つの小節を抽出する工程を含む、方法を提供する。この方法は、例えば、楽譜画像を入力する工程又は前記少なくとも一つの小節の各小節内の音符から音楽情報を作成する工程を含んでもよい。以下、本発明のある実施形態の工程を説明したフローチャート（図１）に基づいて、本方法の工程と任意ではあるが含む場合がある工程とを詳細に説明する。これら工程の順序は変更される場合がある。 Embodiment 1
The first embodiment of the present invention is a method for creating music information from a musical score image, and provides a method including a step of extracting at least one measure from the musical score image. This method may include, for example, a step of inputting a musical score image or a step of creating music information from the notes in each bar of the at least one bar. Hereinafter, the process of the present method and the process which may be optionally included will be described in detail based on the flowchart (FIG. 1) explaining the process of the embodiment of the present invention. The order of these steps is subject to change.

（１）楽譜画像入力工程（工程Ｓ１００）
楽譜画像入力工程（１）では、楽譜画像を入力する。楽譜画像の画像は上記で定義されたような任意の画像である。楽譜には、楽曲の全体または一部が含まれる。楽譜は複数のページを含む場合があり、各ページが対象となる場合がある。入力は下記のコンピューティングデバイスが読み取り可能または認識可能な任意の方式で実施される。 (1) Musical score image input process (process S100)
In the score image input step (1), a score image is input. The image of the score image is any image as defined above. The score contains all or part of the music. The score may contain multiple pages, and each page may be the target. The input is performed in any manner readable or recognizable by the following computing devices.

（２）小節抽出工程（工程Ｓ２００）
小節抽出工程（２）では、前記楽譜画像から少なくとも一つの小節を抽出する。本明細書中で使用する、用語「小節」は領域単位として上記で定義されるものであり、小節またはメジャーと呼ぶ場合がある。本明細書では、各小節は好ましくはグランドスタッフ（大五線譜）のものではなく、一つのスタッフの中の単位（一つのスタッフ中の各小節線で区切られるセグメント）を指す。小節は領域単位として抽出されてもよい。また抽出された小節に対して、小節ごとに（例えば、小節単位で）音符を同定してもよい。抽出した小節を解析後に再構築して音楽情報を作成する工程を含んでもよい。 (2) Bar extraction step (step S200)
In the measure extraction step (2), at least one measure is extracted from the musical score image. As used herein, the term "bar" is defined above as a regional unit and may be referred to as a bar or measure. In the present specification, each bar is not preferably that of the ground staff (staff), but refers to a unit within one staff (segments separated by each bar line in one staff). Measures may be extracted as regional units. In addition, notes may be identified for each bar (for example, in bar units) for the extracted measures. It may include a step of reconstructing the extracted measures after analysis to create music information.

小節の数は特に限定されず、例えば、１、２、３、４、５、６、７、８、９、１０、１５、２０、２５、５０、１００、１５０、２００、２５０、５０、１００、１５０、２００、２５０、５００、１０００以上であってもよい。また、その数は、上記の数よりも大きくても低くてもよく、また、それらの内のいずれか２つの数値の間であってもよい。 The number of measures is not particularly limited, for example, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 150, 200, 250, 50, 100. , 150, 200, 250, 500, 1000 or more. Further, the number may be larger or lower than the above number, and may be between any two of them.

（２−１）小節抽出機械学習モデル
各小節は機械学習モデルで抽出されてもよい。この際には、小節モデルの種類は、どのようなものであってもよい。また、小節モデルの数は特に限定されず、１、２、３、４、５、１０以上であってもよい。また、その数は、上記の数よりも多くても少なくてもよく、また、その間の任意の数であってもよい。好ましくは、各小節の取得に要する処理時間の観点から、その数は１である。 (2-1) Bar Extraction Machine Learning Model Each bar may be extracted by a machine learning model. In this case, any type of bar model may be used. Further, the number of bar models is not particularly limited and may be 1, 2, 3, 4, 5, 10 or more. Further, the number may be larger or smaller than the above number, or may be any number in between. Preferably, the number is 1 in terms of the processing time required to acquire each measure.

小節モデルは、それぞれ好ましくは、ＡＩモデル、より好ましくは機械学習モード、さらに好ましくは深層学習（ディープラーニング；深層学習とディープラーニングは互換的に本明細書中で使用される）モデルであってもよい。それらの任意の組み合わせが許容され、それらは単独で使用されてもよいし、組み合わせて使用されてもよい。 Each measure model is preferably an AI model, more preferably a machine learning mode, and even more preferably a deep learning (deep learning; deep learning and deep learning are used interchangeably herein) model. good. Any combination thereof is acceptable and they may be used alone or in combination.

小節モデルの機能には、小節の種類の分類と位置決めが含まれる。分類と位置決めは、ＳＳＤやＹＯＬＯモデルなどの１つの特徴モデルを用いて行うことができる。ただし、複数のモデルを組み合わせて使用してもよい。後述する他の特徴モデルについても同様である。 Functions of the bar model include classification and positioning of bar types. Classification and positioning can be performed using one feature model such as SSD or YOLO model. However, a plurality of models may be used in combination. The same applies to other feature models described later.

実施例１では、表１に記載される小節を３つのタイプ（ｘ０、ｘ１、ｙ０）に分類するディープラーニングモデルを適用することで非常に効率よく楽譜内の各小節を認識できることが示された。従って、効率よく（例、９４％〜１００％）各小節を認識できるという顕著な効果を本発明が奏することが示される。 In Example 1, it was shown that each measure in the score can be recognized very efficiently by applying a deep learning model that classifies the measures shown in Table 1 into three types (x0, x1, y0). .. Therefore, it is shown that the present invention exerts a remarkable effect of being able to recognize each measure efficiently (eg, 94% to 100%).

（２−２）各小節に基づいて解析領域と前記各小節中に少なくとも一つの位置基準を設定する工程
各小節に基づいて解析領域が設定される。この解析領域は、各小節の一部であってもよいし、各小節の一部または全体を含んでいてもよい。解析領域は、任意の形状を有していてもよい。解析領域の形状は、各小節の形状と同じであってもよいし、異なる形状であってもよい。 (2-2) Analysis area based on each measure and step of setting at least one position reference in each measure The analysis area is set based on each measure. This analysis area may be a part of each bar, or may include a part or the whole of each bar. The analysis region may have any shape. The shape of the analysis region may be the same as the shape of each bar, or may be different.

また、各小節から導出される解析領域の大きさや数は特に限定されるものではなく、上述した領域単位と実質的に同様の方法で提供されてもよい。本実施例では、上側のマージンと下側のマージンを五線の縦幅の１倍または１．２倍にしている。これにより、小節の五線内の音符だけでなく、下側および上側に位置する音符等も各小節に属する音楽記号として認識することができる。 Further, the size and number of analysis regions derived from each measure are not particularly limited, and may be provided in substantially the same manner as the above-mentioned region unit. In this embodiment, the upper margin and the lower margin are set to 1 or 1.2 times the vertical width of the staff. As a result, not only the notes in the staff of the bar but also the notes located on the lower and upper sides can be recognized as musical symbols belonging to each bar.

少なくとも１つの位置基準を設定する。位置基準は上記で定義されるものである。位置基準の種類は特に限定されない。位置基準の種類は、その位置基準が後述する音楽記号をマッピングしたりアノテーションしたりするのに使用できるものであれば、どのような種類であってもよい。好ましくは五線譜内の五線の一又は複数の線である。また、五線間の間隔を適用して、スタッフの上側と下側にも位置基準の線を設けて、上側と下側の領域にある音符のステップを同定することができる。 Set at least one position reference. The position reference is as defined above. The type of position reference is not particularly limited. The type of position reference may be any type as long as the position reference can be used for mapping or annotating musical symbols described later. It is preferably one or more lines of the staff in the staff. You can also apply the spacing between the staves to provide position reference lines on the upper and lower sides of the staff to identify note steps in the upper and lower regions.

位置基準の数は特に限定されず、例えば、１、２、３、４、５、６、７、８、９、１０、１５、２０、２５、５０、７５，または１００以上であってもよい。また、その数は、上記の数よりも多くても少なくてもよく、また、いずれか２つの間であってもよい。 The number of position references is not particularly limited and may be, for example, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 75, or 100 or more. .. Further, the number may be larger or smaller than the above number, or may be between any two.

好ましくは、前記少なくとも一つの小節のそれぞれが、五線の枠、特に最上部と最下部の線に沿って抽出される。この点は非特許文献８で開示される方法と異なる部分であり、後述する五線の補正を容易にする優れた効果を有する。 Preferably, each of the at least one measure is extracted along the staff frame, particularly the top and bottom lines. This point is different from the method disclosed in Non-Patent Document 8, and has an excellent effect of facilitating the correction of the staff described later.

（３）位置基準補正工程
（３−１）全体画像傾斜補正工程（工程Ｓ３０１）
位置基準補正工程（３）は、前記少なくとも一つの小節の各小節内の五線の位置を補正する工程である。この五線位置補正工程は、任意ではあるが、入力した前記楽譜画像全体をある五線の傾斜を補正して水平にするようにする工程を含む。この楽譜画像全体の五線の傾斜を補正する方法は、好ましくは小節抽出工程（２）の前に実施される。これにより、より効率的に各小節を抽出することを可能とする。 (3) Position reference correction process
(3-1) Overall image tilt correction step (step S301)
The position reference correction step (3) is a step of correcting the position of the staff in each bar of at least one bar. This staff position correction step includes, although optional, a step of correcting the inclination of a staff to make the entire input score image horizontal. This method of correcting the inclination of the staff of the entire musical score image is preferably performed before the bar extraction step (2). This makes it possible to extract each measure more efficiently.

この全体画像傾斜補正は、例えば、以下のような工程で実施可能である。
１．入力イメージをグレースケール化し、Ｃａｎｎｙ法を用いて画像のエッジを抽出する。
２．Ｈｏｕｇｈ法を用いて直線を検出する。
３．一番長い直線の傾き角を計算して画像の回転角度を求める。
４．求めた回転角度で画像全体を回転する。 This overall image tilt correction can be performed, for example, in the following steps.
1. 1. The input image is grayscaled and the edges of the image are extracted using the Canny method.
2. 2. A straight line is detected using the Hough method.
3. 3. Calculate the tilt angle of the longest straight line to obtain the rotation angle of the image.
4. Rotate the entire image at the obtained rotation angle.

工程（３−１）は効果的に画像全体の傾斜を補正することはできるが、楽譜の写真のように（例、図４Ａ，４Ｂ参照）画像の各領域で小節の傾きが均一でないものに対しては、各小節が抽出できるようにはなるものの、位置基準である五線の傾斜を画一的に定めるにはまだ課題が存在していた。既存技術で五線の補正をする場合は、全体の五線を画一的に補正するか又は各五線（小節を跨って存在するもの）の傾斜を補正するにとどまっていた。そこでさらに正確な位置基準を提供するという課題を解決するために、以下の各小節内の五線に対する傾斜の補正を実施する場合がある。 Step (3-1) can effectively correct the tilt of the entire image, but the bar tilt is not uniform in each area of the image, as in a photograph of a musical score (eg, see FIGS. 4A and 4B). On the other hand, although it became possible to extract each measure, there was still a problem in uniformly determining the slope of the staff, which is the position reference. When correcting the staff with the existing technology, the whole staff is corrected uniformly or the inclination of each staff (existing across measures) is corrected only. Therefore, in order to solve the problem of providing a more accurate position reference, the inclination of the staff in each of the following measures may be corrected.

（３−２）各小節傾斜補正工程（工程Ｓ３０２）
各小節の五線傾斜の補正は、基本的に（３−１）全体画像傾斜補正と同様に実施することができる。画像の各領域で五線の傾斜が異なるものに対しては、各小節内の五線の傾斜を個別に補正することが好ましい。但し、各小節内の五線は横方向に伸びる直線の閾値で選択を掛けてもよい。この各小節に対する五線傾斜の補正は既存技術には無い顕著な効果を奏する（例、図４Ｃ）。この補正により、楽譜の写真等の五線譜の歪みが画像に不均一なものにおいてさえも位置基準となる五線をより精度高く提供できる。 (3-2) Each measure inclination correction step (step S302)
The correction of the staff inclination of each measure can be basically performed in the same manner as in (3-1) Overall image inclination correction. For those with different staff slopes in each region of the image, it is preferable to individually correct the staff slope in each measure. However, the staff in each measure may be selected by the threshold value of a straight line extending in the horizontal direction. This correction of the staff slope for each measure has a remarkable effect not found in the existing technology (eg, FIG. 4C). With this correction, it is possible to provide the staff as a position reference with higher accuracy even when the distortion of the staff such as a photograph of a musical score is not uniform in the image.

（３−３）五線位置／間隔補正工程（工程Ｓ３０３）
五線の位置は小節モデルで抽出した小節が正確な位置で（特に、五線譜の上下の線に沿って）抽出されると仮定して計算する。このように工程（２）で抽出される小節は、各小節を単に抽出するだけでなく、各小節の位置基準を定める指標となるという二重の効果を奏する。また、解析領域は五線譜の高さを指標として上部と下部に任意のサイズで設定可能である。上部と下部の解析領域は楽譜により幅があるので幅広に検出した特徴モデルを利用するかしないかは選択できるようにしてもよい。このようにして仮定した五線は実際の五線とズレがある場合がある。このズレを補正するためにａｌｐｈａとｂｅｔａ変数を導入してもよい。ａｌｐｈａは五線譜の中央からのズレであり、ｂｅｔａは五線譜間の間隔を補正する値である場合がある。この二つの値を以下のアルゴリズムを用いて自動で求めることができる。 (3-3) Staff position / interval correction step (step S303)
The position of the staff is calculated assuming that the bar extracted by the bar model is extracted at the correct position (especially along the upper and lower lines of the staff). The measures extracted in the step (2) in this way have the dual effect of not only extracting each measure but also serving as an index for determining the position reference of each measure. In addition, the analysis area can be set to any size at the top and bottom using the height of the staff as an index. Since the upper and lower analysis areas are wider depending on the score, it may be possible to select whether or not to use the widely detected feature model. The staff assumed in this way may differ from the actual staff. The alpha and beta variables may be introduced to correct this deviation. alpha may be a deviation from the center of the staff, and beta may be a value for correcting the spacing between the staffs. These two values can be automatically obtained using the following algorithm.

１．イメージ全体の縦幅（五線＋上部と下部にそれぞれ五線の高さサイズを任意に拡張した部分を設けたイメージ）を１とする。ａｌｐｈａの範囲を−０．０３〜０．０３の間０．００１刻みでループさせ、その各値でｂｅｔａを−０．００５〜０．００５の間０．００１刻みでループさせる。
２．その各ａｌｐｈａ、ｂｅｔａを使い五線譜をイメージ中に重ね書きする。
３．画像をグレースケール化しＧａｕｓｓｉａｎ閾値処理した画像の黒い部分の面積を求める。
４．五線譜が重なる場合が面積は最小になると考え最小値を求め、その時のａｌｐｈａ、ｂｅｔａの値を補正に使用する。 1. 1. Let 1 be the vertical width of the entire image (the image of the staff + the upper and lower parts of which the height size of the staff is arbitrarily expanded). The range of alpha is looped between -0.03 and 0.03 in 0.001 increments, and beta is looped between -0.005 to 0.005 in 0.001 increments at each value.
2. 2. Using each alpha and beta, the staff notation is overwritten in the image.
3. 3. The image is grayscaled and the area of the black part of the Gaussian threshold-processed image is obtained.
4. Considering that the area becomes the minimum when the staffs overlap, the minimum value is obtained, and the values of alpha and beta at that time are used for correction.

この（３−３）五線位置／間隔補正工程により、五線の各線の位置が正確に位置決めされてより正確な位置基準を提供することができる。従って、各音符のステップが正確に決定されることで得られる音楽情報がより有用で、その後のヒトによる補正工程の負担を軽減できるという優れた効果を有する。 By this (3-3) staff position / interval correction step, the position of each line of the staff can be accurately positioned to provide a more accurate position reference. Therefore, the music information obtained by accurately determining the step of each note is more useful, and has an excellent effect that the burden of the subsequent correction process by a human can be reduced.

以上に記載されるように、本発明の一実施形態では、画像を水平に補正し、五線の位置や間隔を補正する方法が好ましくは用いられる。自動補正に用いられる手法の例には、Ｃａｎｎｙ法、Ｈｏｕｇｈ法、Ｇａｕｓｓｉａｎ閾値処理（実施例６）、本明細書で開示される独自の五線位置間隔補正方法（実施例７）が含まれる。楽譜の写真等の五線譜の歪みが画像に不均一なものに対してさえも五線の位置を個別に補正することにより、音符のステップや臨時記号（アクシデンタル）（例、＃、♭、ナチュラル）等の位置をより精度高く同定することができる。 As described above, in one embodiment of the present invention, a method of horizontally correcting an image and correcting the position and spacing of the staff is preferably used. Examples of methods used for automatic correction include the Canny method, the Hough method, Gaussian threshold processing (Example 6), and the unique staff position spacing correction method disclosed herein (Example 7). Note steps and accidentals (eg, #, ♭, natural) by individually correcting the position of the staff, even if the distortion of the staff is not uniform in the image, such as a photo of a score. Etc. can be identified with higher accuracy.

（４）各小節内の音符を複数のディープラーニングモデルを使用して同定する工程（音符同定工程Ｓ４００）
（４−１）複数の特徴モデルと特徴タイプの使用
この工程では、複数の特徴モデルが推論のために各小節に基づいた解析領域に適用される。複数の特徴カテゴリに対応するディープラーニングモデルを組み合わせることで、多様な音符記号等を表現することができる。特徴モデルは、それぞれ好ましくは、ＡＩモデル、より好ましくは機械学習モード、さらに好ましくはディープラーニングモデルであってもよい。それらの任意の組み合わせが許容され、それらは単独で使用してもよいし、組み合わせて使用してもよい。 (4) A step of identifying the notes in each measure using a plurality of deep learning models (note identification step S400).
(4-1) Use of Multiple Feature Models and Feature Types In this step, multiple feature models are applied to the analysis area based on each bar for inference. By combining deep learning models corresponding to multiple feature categories, it is possible to express various clef and the like. The feature model may be preferably an AI model, more preferably a machine learning mode, and even more preferably a deep learning model. Any combination thereof is acceptable and they may be used alone or in combination.

特徴モデルの数は特に限定されず、２、３、４、５、６、７、８、９、１０、１５、２０、２５、５０、１００、１５０、２００、２５０、５００、１００以上であってもよい。また、上記の数字よりも大きい数であっても、小さい数であってもよく、いずれか２つの間の数であってもよい。 The number of feature models is not particularly limited, and is 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 150, 200, 250, 500, 100 or more. You may. Further, the number may be larger than the above number, may be smaller than the above number, or may be a number between any two.

特徴カテゴリ（例、実施例５）は、任意の音楽記号を認識する特徴モデルに関する。任意の音楽記号には、既定の音楽記号そのものと自作したもの、例えば、音符の各パーツに関するものが含まれる。具体例には、表２に記載されるａｃｃｉｄｅｎｔａｌ、ａｒｍ／ｂｅａｍ、ｂｏｄｙ、ｃｌｅｆ、ｒｅｓｔカテゴリが挙げられ、其々のカテゴリには複数の特徴タイプが設定可能である。 Feature categories (eg, Example 5) relate to feature models that recognize arbitrary musical symbols. Arbitrary musical symbols include the default musical symbol itself and self-made ones, such as those related to each part of the note. Specific examples include the accidental, arm / beam, body, clef, and rest categories shown in Table 2, and a plurality of feature types can be set for each category.

実施例２で示されているように、推論のために複数の特徴モデルを使用することは、単一の特徴モデルを使用することに比較していくつかの利点がある。 As shown in Example 2, using multiple feature models for inference has several advantages over using a single feature model.

複数の特徴モデルは、並列に処理されてもよいし、直列に処理されてもよい。しかし、複数の特徴モデルは、実施例３と４で示されるように、推論に必要な時間を短縮するために、並列に処理されることが好ましい。 The plurality of feature models may be processed in parallel or in series. However, it is preferred that the plurality of feature models be processed in parallel in order to reduce the time required for inference, as shown in Examples 3 and 4.

（ｉ）訓練性能
特徴タイプの数が少ない複数の特徴モデルは、特徴タイプの数が多い１つの特徴モデルよりも容易に学習実施可能であった。また、実施例２は、少数の特徴タイプを持つように特徴カテゴリを選択した場合に、各特徴タイプの認識精度が高くなることを実証する。このように、本発明によれば、特徴モデルの学習性能を向上可能とするという顕著な効果を奏する。 (I) Training performance A plurality of feature models with a small number of feature types could be learned more easily than a single feature model with a large number of feature types. Further, the second embodiment demonstrates that the recognition accuracy of each feature type is improved when the feature category is selected so as to have a small number of feature types. As described above, according to the present invention, there is a remarkable effect that the learning performance of the feature model can be improved.

（ｉｉ）推論性能
推論処理の数は、抽出される領域単位の数が大きくなると増加する。近い将来起こるであろうＣＰＵやＧＰＵの数が多いコンピュータの設定の場合、この設定を利用して推論処理を並列に処理し、処理時間を短縮することが考えられる。例えば、解析領域数が１００、推論用の特徴モデル数が１０の場合、１，０００個の独立した推論処理を完了させる必要がある。ＣＰＵやＧＰＵの数が増えるにつれて、複数の特徴モデルを並列に使用すると、すべての推論処理にかかる時間が短くなることが期待される。本実施例３で示すように８コアのＣＰＵで並列処理しても処理時間は単純に１／８にならないので、実際に現状の検証可能なアーキテクチャーで試験して処理時間を測定することが必要である。そこで実際に処理時間を比較した本実施例３と４が並列処理の有用性を実証した。実施例４では、処理時間はＣＰＵを直列で処理した時間の約１０分の１であり、ＧＰＵでの並列処理により顕著に処理時間の短縮ができることを実証する。複数の特徴モデルによる推論に複数のＣＰＵ／ＧＰＵを使用することは、総処理時間の点で優れていると考えられる。したがって、本発明の好ましい実施形態では、並列処理により推論処理に要する時間を短縮することが可能となるという顕著な効果を奏する。 (Ii) Inference performance The number of inference processes increases as the number of extracted area units increases. In the case of a computer setting with a large number of CPUs and GPUs, which will occur in the near future, it is conceivable to use this setting to process inference processing in parallel and shorten the processing time. For example, when the number of analysis regions is 100 and the number of feature models for inference is 10, it is necessary to complete 1,000 independent inference processes. As the number of CPUs and GPUs increases, it is expected that the time required for all inference processing will be shortened when multiple feature models are used in parallel. As shown in Example 3, the processing time is not simply reduced to 1/8 even if parallel processing is performed by an 8-core CPU, so it is possible to actually test with the current verifiable architecture and measure the processing time. is necessary. Therefore, Examples 3 and 4 in which the processing times were actually compared demonstrated the usefulness of parallel processing. In the fourth embodiment, the processing time is about one tenth of the time when the CPUs are processed in series, and it is demonstrated that the processing time can be remarkably shortened by the parallel processing on the GPU. Using multiple CPUs / GPUs for inference by multiple feature models is considered to be superior in terms of total processing time. Therefore, in a preferred embodiment of the present invention, the parallel processing has a remarkable effect that the time required for the inference processing can be shortened.

（４−２）各特徴モデル中の前記複数の特徴タイプのそれぞれの位置をマッピングして整列させる工程
各特徴モデル（例、ａｃｃｉｄｅｎｔａｌ、ａｒｍ／ｂｅａｍ、ｂｏｄｙ、ｃｌｅｆ、ｒｅｓｔモデル）によって推論された各特徴タイプがマッピングされる。このマッピングは、特徴モデルで使用される座標系を使用して実行してもよいし、位置基準を使用して実行してもよい。さらに、座標系と位置参照との組み合わせが、各特徴タイプをマッピングするために使用されてもよい。 (4-2) Step of mapping and aligning the positions of the plurality of feature types in each feature model Each inferred by each feature model (eg, accidental, arm / beam, body, clef, rest model). Feature types are mapped. This mapping may be performed using the coordinate system used in the feature model or using a positional reference. In addition, a combination of coordinate system and position reference may be used to map each feature type.

各特徴タイプは、水平方向または垂直方向に、または二方向に整列させてもよい。１つの特徴カテゴリの特徴タイプを整列させてもよいし、１つ以上の特徴カテゴリの特徴タイプを整列させてもよいし、すべての特徴カテゴリの特徴タイプを整列させてもよい。 Each feature type may be aligned horizontally, vertically, or bidirectionally. The feature types of one feature category may be aligned, the feature types of one or more feature categories may be aligned, or the feature types of all feature categories may be aligned.

整列の方向は特に限定されず、水平方向、垂直方向のいずれであってもよい。また、整列の方向は、１方向であってもよいし、２方向以上であってもよい。 The alignment direction is not particularly limited, and may be either a horizontal direction or a vertical direction. Further, the alignment direction may be one direction or two or more directions.

一つ以上の特徴タイプは、アライメントの前、途中、および／または後に除外されてもよい。 One or more feature types may be excluded before, during, and / or after alignment.

（４−３）五線位置（位置基準）を使用して各特徴タイプを解析することにより音符をアノテーションする工程
各特徴タイプは、少なくとも１つの位置基準である五線位置を用いて解析され、順に音符のアノテーション（同定；これらは互換的に用いられる場合がある）に使用してもよい。解析の方向は任意に設定してもよいし、水平方向または垂直方向であってもよい。整列された特徴タイプは、一部の特徴タイプが解析の対象から除外されてもよいが、順次解析されてもよい。 (4-3) A process of annotating notes by analyzing each feature type using the staff position (position reference) Each feature type is analyzed using at least one position reference, the staff position. They may be used in sequence for note annotation (identification; these may be used interchangeably). The direction of analysis may be set arbitrarily, and may be horizontal or vertical. As for the aligned feature types, some feature types may be excluded from the analysis target, but they may be analyzed sequentially.

解析される特徴タイプは、複数の特徴モデルのうちの少なくとも１つの特徴モデルからの少なくとも１つの先行解析された特徴タイプの影響を受けてもよい。少なくとも１つの先行解析された特徴タイプの特徴カテゴリは、解析されている特徴タイプの特徴カテゴリと同じであってもよいし、異なるものであってもよい。このようにして、解析結果として得られるアノテーションされた特徴タイプは、先行する特徴タイプが同じ特徴カテゴリまたは異なる特徴カテゴリの後続の特徴タイプに影響を与える間、特定の方向に向けて解析およびアノテーションされてもよい。 The feature type analyzed may be influenced by at least one pre-analyzed feature type from at least one feature model among the plurality of feature models. The feature category of at least one pre-analyzed feature type may be the same as or different from the feature category of the feature type being analyzed. In this way, the annotated feature type obtained as a result of the analysis is analyzed and annotated in a specific direction while the preceding feature type affects the subsequent feature types of the same feature category or different feature categories. You may.

具体的には、実施例８では、ａｃｃｉｄｅｎｔａｌ、ｃｌｅｆの各特徴タイプが少なくとも１つの先行解析された特徴タイプに相当する。 Specifically, in Example 8, each characteristic type of accredental and clef corresponds to at least one previously analyzed feature type.

本発明の好ましい実施形態では、水平方向または垂直方向に整列された各特徴タイプと、それぞれ、垂直方向または水平方向に重なって整列された各特徴タイプとを使用して前記新たな音符特徴タイプのアノテーションを行う。特徴タイプの全ての位置が水平方向または垂直方向に整列される場合、解析対象となる各特徴タイプは、複数の特徴モデルのうちの少なくとも１つの特徴モデルから、それぞれ垂直方向または水平方向に重なる少なくとも１つの特徴タイプを用いてアノテーションを行ってもよい。 In a preferred embodiment of the invention, each feature type aligned horizontally or vertically and each feature type aligned vertically or horizontally overlaps each of the new note feature types. Annotate. When all positions of feature types are aligned horizontally or vertically, each feature type to be analyzed is at least vertically or horizontally overlapped from at least one feature model among the plurality of feature models. Annotation may be performed using one feature type.

具体的には、各小節の水平方向への特徴タイプのソーティングを実施する場合がある。スタッフ番号を１か２に指定して、スタッフの小節（メジャー（ｍｅａｓｕｒｅ））を一続きのリストにし、前から順に一つずつ小節を取り出してもよい。そして、各小節に含まれる全ての特徴タイプを水平方向（ｘ）（順方向）にソーティングする。各アノテーションに影響する要素として現状のＣｌｅｆの状態とＡｃｃｉｄｅｎｔａｌテーブル（どの音階にシャープやフラットがあるかを教示するテーブル）とを更新しながら各音符をアノテーションしてもよい。Ａｃｃｉｄｅｎｔａｌテーブルは初期値のｆｉｆｔｈｓ（どの長調または短調かを指定するもの）の状態を入力し、次の小節を解析する際には直前のｆｉｆｔｈｓの状態を反映させる場合がある。 Specifically, horizontal feature-type sorting of each measure may be performed. You may specify the staff number as 1 or 2 to make a series of staff measures (measures), and take out the measures one by one from the front. Then, all the feature types included in each measure are sorted in the horizontal direction (x) (forward direction). As an element that affects each annotation, each note may be annotated while updating the current state of Clef and the Accidental table (a table that teaches which scale has a sharp or flat). The Accidental table inputs the initial value of the states (which specifies which major or minor), and may reflect the state of the immediately preceding fifths when analyzing the next measure.

水平方向にソーティングした各特徴タイプを前から順に解析するのが好ましい。解析は各タイプがどの特徴カテゴリにあるかに場合分けすることができる。 It is preferable to analyze each horizontally sorted feature type in order from the front. The analysis can be categorized according to which feature category each type is in.

Ａ．Ｃｌｅｆカテゴリ
解析中の特徴タイプがＣｌｅｆカテゴリＧまたはＦ（ｃｆ０またはｃｆ１）である場合は、Ｃｌｅｆの状態を変化させる。 A. Clef category If the feature type being analyzed is Clef category G or F (cf0 or cf1), the state of Clef is changed.

Ｂ．Ａｃｃｉｄｅｎｔａｌカテゴリ
解析中の特徴タイプがＡｃｃｉｄｅｎｔａｌカテゴリである場合は、位置基準を組み合わせてＡｃｃｉｄｅｎｔａｌテーブルを変更する。 B. Accidental category If the feature type being analyzed is the Accidental category, the Accidental table is modified by combining the positional criteria.

Ｃ．Ｒｅｓｔカテゴリ
解析中の特徴タイプがＲｅｓｔカテゴリである場合は、Ｒｅｓｔタイプに合わせてアノテーションして、その要素を出力リストに追加する。 C. Rest category If the feature type being analyzed is the Rest category, annotate it according to the Rest type and add the element to the output list.

Ｄ．Ｂｏｄｙカテゴリ（垂直方向に重なる特徴タイプにより音符を同定）
解析中の特徴タイプがＢｏｄｙカテゴリである場合は、和音を検出する。そして、音符の長さをＡｒｍ／Ｂｅａｍタイプで特定するために、垂直方向に重なる特徴タイプをソーティングしてリストにするのが好ましい。その中にＲｅｓｔタイプが含まれる場合は、その位置によってＶｏｉｃｅを指定するのが好ましい（一番下にある場合はＶｏｉｃｅ１、一番上にある場合はＶｏｉｃｅ２に設定可能）。中間位置にある場合は前後の位置に応じてＢｏｄｙタイプの前の要素として追加するか後の要素として追加するかを決定し、出力リストに追加してもよい。 D. Body category (notes are identified by feature types that overlap vertically)
If the feature type being analyzed is the Body category, chords are detected. Then, in order to specify the length of the note by the Arm / Beam type, it is preferable to sort and list the feature types that overlap in the vertical direction. If the Rest type is included in it, it is preferable to specify Voice according to its position (Voice1 can be set if it is at the bottom, and Voice2 can be set if it is at the top). If it is in the middle position, it may be added to the output list by deciding whether to add it as an element before or after the body type according to the position before and after.

Ｂｏｄｙタイプは垂直方向に重なる特徴タイプの数と位置によって場合分けしてアノテーションすることができる。複数のＢｏｄｙタイプが含まれる場合はｍｕｓｉｃＸＭＬファイルの規定に従って和音（Ｃｈｏｒｄ）を割り当て可能である。 Body types can be annotated separately according to the number and position of feature types that overlap in the vertical direction. When a plurality of body types are included, chords can be assigned according to the rules of the musicXML file.

ケース１：一番下と上の特徴タイプが共にＡｒｍ／Ｂｅａｍである場合
３個以上のＢｏｄｙタイプがある場合は、対象のものと、下のＡｒｍ／Ｂｅａｍに属する（下向きのステムの）ものとの距離と、上のＡｒｍ／Ｂｅａｍに属する（上向きのステムの）ものとの距離を計算して近いものに割り当てることができる。その際、下のＡｒｍ／Ｂｅａｍに属するものはＶｏｉｃｅ１に割り当て、上のＡｒｍ／Ｂｅａｍに属するものはＶｏｉｃｅ２に割り当てるのが好ましい。 Case 1: When both the bottom and top feature types are Arm / Beam If there are 3 or more Body types, the target one and the one belonging to the lower Arm / Beam (downward stem). The distance between the above Arm / Beam and the one belonging to the above Arm / Beam (of the upward stem) can be calculated and assigned to the closest one. At that time, it is preferable that the one belonging to the lower Arm / Beam is assigned to Voice1 and the one belonging to the upper Arm / Beam is assigned to Voice2.

ケース２：一番下がＲｅｓｔである場合
一番下がＲｅｓｔである場合はＲｅｓｔをＶｏｉｃｅ１に割り当て、一又は複数のＢｏｄｙタイプはＶｏｉｃｅ２に割り当てることが好ましい。 Case 2: When the bottom is Rest It is preferable to assign Rest to Voice1 when the bottom is Rest, and to assign one or more Body types to Voice2.

ケース３：一番上がＲｅｓｔである場合
一番上がＲｅｓｔである場合はＲｅｓｔをＶｏｉｃｅ２に割り当て、一又は複数のＢｏｄｙタイプはＶｏｉｃｅ１に割り当てることが好ましい。 Case 3: When the top is Rest It is preferable to assign Rest to Voice2 when the top is Rest, and to assign one or more Body types to Voice1.

ケース４：一番上がＡｒｍ／Ｂｅａｍである場合
一番上がＡｒｍ／Ｂｅａｍである場合は、Ｂｏｄｙタイプの種類によって場合分けする。特徴タイプｂｄ０〜ｂｄ３の様にＡｒｍまたはＢｅａｍと組み合わせて音符をアノテーションするものと、ｂｄ４〜ｂｄ５のようにＡｒｍとＢｅａｍを持たないものとをそれぞれアノテーションする。この際にＶｏｉｃｅはＶｏｉｃｅ１に設定し、後述するＶｏｉｃｅ調整工程で適宜変更する場合がある。 Case 4: When the top is Arm / Beam When the top is Arm / Beam, the cases are classified according to the type of Body type. Annotate notes in combination with Arm or Beam, such as feature types bd0 to bd3, and annotate those that do not have Arm and Beam, such as bd4 to bd5, respectively. At this time, Voice may be set to Voice 1 and may be appropriately changed in the Voice adjustment step described later.

ケース５：一番下がＡｒｍ／Ｂｅａｍである場合
一番下がＡｒｍ／Ｂｅａｍである場合も、Ｂｏｄｙタイプの種類によって場合分けする。特徴タイプｂｄ０〜ｂｄ３の様にＡｒｍまたはＢｅａｍと組み合わせて音符をアノテーションするものと、ｂｄ４〜ｂｄ５のようにＡｒｍとＢｅａｍを持たないものとをそれぞれアノテーションする。この際にＶｏｉｃｅはＶｏｉｃｅ１に設定し、後述するＶｏｉｃｅ調整工程で適宜変更する場合がある。 Case 5: When the bottom is Arm / Beam Even when the bottom is Arm / Beam, the case is classified according to the type of Body type. Annotate notes in combination with Arm or Beam, such as feature types bd0 to bd3, and annotate those that do not have Arm and Beam, such as bd4 to bd5, respectively. At this time, Voice may be set to Voice 1 and may be appropriately changed in the Voice adjustment step described later.

ケース６：一番上と下が共にＢｏｄｙである場合
この場合は、ｂｄ４〜ｂｄ５の特徴タイプが想定される。しかしながら、Ａｒｍ／Ｂｅａｍ特徴タイプやＲｅｓｔ特徴タイプが認識されなかった結果（例、小節の最下部や最上部に位置していて認識できない場合や特徴モデルの推論で検出されなかった場合も含む）である場合も考えられる。従って、ｂｄ０〜ｂｄ３の者が含まれている場合は、適宜Ａｒｍ／Ｂｅａｍを補うように処理することが好ましい。また、このケースでも音符はＶｏｉｃｅ１に割り当てることが好ましい。 Case 6: When both the top and bottom are Body In this case, the feature types of bd4 to bd5 are assumed. However, as a result of not recognizing the Arm / Beam feature type or Rest feature type (for example, when it is located at the bottom or top of the bar and cannot be recognized, or when it is not detected by the inference of the feature model). There may be cases. Therefore, when a person of bd0 to bd3 is included, it is preferable to appropriately supplement Arm / Beam. Also in this case, it is preferable that the notes are assigned to Voice1.

上記した各Ｂｏｄｙタイプのアノテーションでは現在のＣｌｅｆとａｃｃｉｄｅｎｔａｌテーブルを引数として渡して、音符特徴タイプをアノテーションするのが好ましい。そして、各Ｂｏｄｙタイプのステップを五線の位置との相対距離に従って同定する。 In the above-mentioned annotation of each Body type, it is preferable to pass the current Clef and the accidental table as arguments to annotate the note feature type. Then, each Body type step is identified according to the relative distance to the position of the staff.

解析済みのＢｏｄｙとＡｒｍとＲｅｓｔタイプは除外リストに入れて再度解析されるのを防止することができる。また、Ｂｅａｍは隣接するＢｏｄｙタイプの解析のために再度使用可能である。 The analyzed Body, Arm, and Rest types can be put in the exclusion list to prevent them from being analyzed again. Beam can also be used again for analysis of adjacent Body types.

このようにして水平方向にソーティングした特徴タイプを、以前に解析したある種の特定タイプ（Ｃｌｅｆ、Ａｃｃｉｄｅｎｔａｌ）がその後に特徴タイプに影響を及ぼすようにし、また、垂直方向に重なる特徴タイプを垂直方向に影響を及ぼす特徴タイプ（例、Ａｒｍ／Ｂｅａｍ）を使用してアノテーションを実施するのが好ましい。 The feature types sorted horizontally in this way are such that certain previously analyzed specific types (Clef, Accidental) subsequently affect the feature type, and the vertically overlapping feature types are vertical. It is preferable to perform annotation using a feature type that affects (eg, Arm / Beam).

好ましい実施形態では、前記複数の特徴タイプと前記位置基準（五線位置）を組み合わせて使用して、新たな音符特徴タイプをアノテーションする。音符特徴タイプの数は前記前記複数の特徴タイプと前記位置基準の合計数の好ましくは少なくとも１０倍であり、より好ましくは少なくとも１００倍であり、さらに好ましくは少なくとも１０００倍である。 In a preferred embodiment, the plurality of feature types and the position reference (staff position) are used in combination to annotate a new note feature type. The number of note feature types is preferably at least 10 times, more preferably at least 100 times, and even more preferably at least 1000 times the total number of the plurality of feature types and the position reference.

（４−４）各音符のＶｏｉｃｅ調整工程
小節は楽曲によって決められた音符長を有する。この工程では、上記（４−３）音符アノテーション工程で同定された音符群のＶｏｉｃｅが正しく割り当てられたかどうかを確認する。ケース１〜３では、各音符がＶｏｉｃｅ１またはＶｏｉｃｅ２に割り当てられているが、ケース４〜６では、各音符は便宜的にＶｏｉｃｅ１に割り当てられている。そこで、この状態で、Ｖｏｉｃｅ１とＶｏｉｃｅ２に属する各音符の長さを、和音を考慮して計算する。そして、小節の規定の音符長よりも長くなった場合は、Ｖｏｉｃｅの調整を実施する。例えば、上側にＡｒｍ／Ｂｅａｍを有するＢｏｄｙタイプをＶｏｉｃｅ２にし、残り（例、ｂｄ４〜ｂｄ５）のＢｏｄｙタイプをＶｏｉｃｅ１にする場合がある。また、下側にＡｒｍ／Ｂｅａｍを有するＢｏｄｙタイプをＶｏｉｃｅ１にし、残り（例、ｂｄ４〜ｂｄ５）のＢｏｄｙタイプをＶｏｉｃｅ２にする場合がある。さらに全音符（ｂｄ４〜ｂｄ５）をＶｏｉｃｅ２にする場合がある。この調整工程を繰り返して行ってもよい。 (4-4) Voice adjustment process measure of each note has a note length determined by the musical piece. In this step, it is confirmed whether or not the Voice of the note group identified in the above (4-3) note annotation step is correctly assigned. In cases 1 to 3, each note is assigned to Voice 1 or Voice 2, but in cases 4 to 6, each note is assigned to Voice 1 for convenience. Therefore, in this state, the length of each note belonging to Voice1 and Voice2 is calculated in consideration of the chord. Then, if the note length is longer than the specified note length of the measure, the Voice is adjusted. For example, the Body type having Arm / Beam on the upper side may be set to Voice2, and the remaining Body type (eg, bd4 to bd5) may be set to Voice1. Further, the Body type having Arm / Beam on the lower side may be set to Voice1, and the remaining Body type (eg, bd4 to bd5) may be set to Voice2. Further, the whole note (bd4 to bd5) may be set to Voice2. This adjustment step may be repeated.

実施例５では、少数の特徴モデルの少数の特徴タイプを用いて新たに音符特徴タイプを作成する例を示す。実施例５では、複数カテゴリの比較的少数の特徴タイプを組み合わせることで多数の音符特徴タイプを同定、アノテーションできるという本発明の顕著な効果を実証する。 Example 5 shows an example of creating a new note feature type using a small number of feature types of a small number of feature models. Example 5 demonstrates the remarkable effect of the present invention that a large number of note feature types can be identified and annotated by combining a relatively small number of feature types in a plurality of categories.

（５）各小節内の音符から音楽情報を作成する工程（音楽情報作成工程Ｓ５００）
（５−１）前記領域単位に関してアノテーションした各特徴タイプのデータを組み立てる工程
この工程では、各領小節に関してアノテーションされた音符特徴タイプ由来のデータが組み立てられる。組み立て中に、アノテーションに利用した１つ以上の特徴タイプが削除されてもよい。削除された特徴タイプは、アノテーション中に別の特徴タイプに影響を与える可能性があるが、情報を生成するためには不要である場合があるからである。 (5) A process of creating music information from notes in each measure (music information creation process S500)
(5-1) Step of assembling the data of each feature type annotated for each area unit In this step, the data derived from the note feature type annotated for each bar is assembled. During assembly, one or more feature types used for annotation may be deleted. The deleted feature type may affect another feature type during annotation, but may not be needed to generate the information.

組み立て方は特に限定されない。組み立ての方向は、解析中またはアノテーション中と同じ方向であってもよい。ただし、組み立てる方向は、解析中またはアノテーション中とは逆の方向であってもよい。また、アノテーションを時間的に処理する（すなわち、時系列で組み立てる）場合もあるため、アノテーション中は同じ方向にデータを組み立てることが好ましい。 The assembly method is not particularly limited. The direction of assembly may be the same as during analysis or annotation. However, the assembling direction may be opposite to that during analysis or annotation. In addition, since annotations may be processed in time (that is, assembled in chronological order), it is preferable to assemble data in the same direction during annotation.

本発明の好ましい実施形態では、前記アノテーションされた音符特徴タイプのデータが時間方向に組み立てられる。 In a preferred embodiment of the invention, the annotated note feature type data is assembled in the time direction.

（５−２）一又は複数の小節に関するデータを直列および／または並列に接続して音楽情報を作成する工程
一つ以上の小節について得られたデータを直列または並列に接続して情報を生成する。場合によっては、小節の数は１であってもよい。この場合、１つの小節に含まれるアノテーションされた音符特徴タイプのデータを使用してもよい。 (5-2) A step of connecting data related to one or more measures in series and / or in parallel to create music information . Data obtained in one or more measures are connected in series or in parallel to generate information. .. In some cases, the number of measures may be one. In this case, the annotated note feature type data contained in one bar may be used.

また、複数の小節を有する場合には、複数の小節は直列に接続されていてもよいし、並列に接続されていてもよい。また、直列に接続されたデータをさらに直列に接続してもよいし、並列に接続されたデータをさらに直列に接続してもよいし、並列に接続されたデータをさらに直列に接続して音楽情報を生成してもよい。これにより、複数のスタッフがある楽譜にも対応することができる。 When having a plurality of measures, the plurality of measures may be connected in series or in parallel. Further, the data connected in series may be further connected in series, the data connected in parallel may be further connected in series, or the data connected in parallel may be further connected in series for music. Information may be generated. This makes it possible to handle scores with multiple staff.

大五線譜を含む楽譜の場合には、右手用の五線譜を直列および並列（段が違うもの）に接続してスタッフ１とし、左手用の五線譜も直列および並列（段が違うもの）に接続してスタッフ２としてもよい。 In the case of a score that includes a large staff, connect the staff for the right hand in series and in parallel (those with different stages) to make Staff 1, and connect the staff for the left hand in series and in parallel (those with different stages). It may be staff 2.

接続される小節の数は特に限定されず、例えば、１、２、３、４、５、６、７、８、９、１０、１５、２０、２５、５０、１００、１５０、２００、２００、２５０、５００、１０００、２５００、５０００、１００００、２５０００、５００００、または１００００以上であってもよい。また、上記の数字よりも大きくても小さくてもよく、また、いずれか２つの間の数字であってもよい。 The number of connected measures is not particularly limited, and for example, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 150, 200, 200, It may be 250, 500, 1000, 2500, 5000, 10000, 25000, 50000, or 10000 or more. Further, it may be larger or smaller than the above number, or it may be a number between any two.

各小節の音符データを接続する方法は特に限定されない。音符データは直接接続してもよいし、間接的に接続してもよい。間接的に接続されている場合には、データ間に他のデータや素材を挿入してもよいし、同じデータを繰り返し挿入して音楽情報を生成してもよい。 The method of connecting the note data of each measure is not particularly limited. The note data may be directly connected or indirectly connected. When indirectly connected, other data or material may be inserted between the data, or the same data may be repeatedly inserted to generate music information.

本発明の一実施形態では、接続されるべき小節は、先行する小節内の特徴タイプ（例、調号や臨時記号）に影響されてもよい。 In one embodiment of the invention, the measures to be connected may be influenced by feature types (eg, key signatures and accidentals) within the preceding measures.

本発明の一実施形態では、接続される小節の特徴タイプ（例、反復記号等）は、先行する小節に影響を与えてもよい。あるいは、小節を、単にそのまま接続してもよい。 In one embodiment of the invention, the feature type of the connected bar (eg, ellipsis, etc.) may affect the preceding bar. Alternatively, the measures may be simply connected as they are.

本発明の一実施形態では、音楽情報は、ＸＭＬファイル、ｍｕｓｉｃＸＭＬファイル、ＭＩＤＩファイル、ｍｐ３ファイル、ｗａｖファイル、および楽譜からなる群より選択される。 In one embodiment of the invention, the music information is selected from the group consisting of XML files, musicXML files, MIDI files, mp3 files, wav files, and musical scores.

本発明の一実施形態では、得られた音楽情報はそのまま最終製品（例、ＭｕｓｉｃＸＭＬ、ＭＩＤＩ、ｍｐ３ファイル、ｗａｖファイル、楽譜）として実施する場合がある。 In one embodiment of the present invention, the obtained music information may be implemented as a final product (eg, MusicXML, MIDI, mp3 file, wav file, musical score) as it is.

実施例８では、各音符のアノテーションとＭｕｓｉｃＸＭＬファイルの作成の例を示し、本発明の方法が楽譜画像から音楽情報を作成する際に顕著な効果を奏することを実証する。 In Example 8, an annotation of each note and an example of creating a MusicXML file are shown, and it is demonstrated that the method of the present invention exerts a remarkable effect in creating music information from a musical score image.

実施形態２
本発明の方法を実施して画像から情報を作成するためのコンピューティングデバイス
実施形態２は、本発明の方法を実施して画像から情報を作成するためのコンピューティングデバイスに関する。 Embodiment 2
A computing device for carrying out the method of the present invention to create information from an image Embodiment 2 relates to a computing device for carrying out the method of the present invention to create information from an image.

本発明の第２実施形態は、楽譜画像から音楽情報を作成するためのコンピューティングデバイスであって、楽譜画像から少なくとも一つの小節を抽出する小節抽出部を含む、コンピューティングデバイスを提供する。このコンピューティングデバイスは、例えば、楽譜画像を入力する入力部、前記少なくとも一つの小節の各小節の五線の位置を補正する五線補正部、前記少なくとも一つの小節の各小節内の音符を複数のディープラーニングモデルを使用して同定する音符同定部、又は同定された前記音符から音楽情報を作成する音楽情報作成部、を含んでもよい。ここで、前記少なくとも一つの小節がディープラーニングモデルによって抽出され、前記複数のディープラーニングモデルが並列に処理され、又は前記音楽情報が、ＸＭＬファイル、ｍｕｓｉｃＸＭＬファイル、ＭＩＤＩファイル、ｍｐ３ファイル、ｗａｖファイル、および楽譜からなる群より選択される、コンピューティングデバイスであってもよい。 A second embodiment of the present invention is a computing device for creating music information from a musical score image, and provides a computing device including a bar extraction unit that extracts at least one bar from the musical score image. This computing device has, for example, an input unit for inputting a musical score image, a five-line correction unit for correcting the position of the five lines of each measure of the at least one measure, and a plurality of notes in each measure of the at least one measure. It may include a note identification unit identified using the deep learning model of the above, or a music information creation unit that creates music information from the identified musical note. Here, the at least one measure is extracted by the deep learning model, the plurality of deep learning models are processed in parallel, or the music information is the XML file, the musicXML file, the MIDI file, the mp3 file, the wav file, and the music information. It may be a computing device selected from a group of musical scores.

コンピューティングデバイスの例には、特に限定はされないが、ＲＡＭ、ＲＯＭ、キャッシュ、ＳＳＤ、ハードディスクが含まれる。また、クラウド上のもの、サーバ上のもの、オンプレミスのコンピュータ上のもの等の任意の形態のコンピューティングデバイスが含まれる。 Examples of computing devices include, but are not limited to, RAM, ROM, cache, SSD, and hard disk. It also includes any form of computing device, such as one on the cloud, one on a server, one on an on-premises computer, and so on.

楽譜画像を入力する入力部は、実施形態１の（１）楽譜画像入力工程を実行する。小節抽出部は、実施形態１の（２）小節抽出工程を実行する。五線補正部は、実施形態１の（３）位置基準補正工程を実行する。音符同定部は、実施形態１の（４）音符同定工程を実行する。音楽情報作成部は、実施形態１の（５）音楽情報作成工程を実行する。また、各部の好ましい態様は、実施例１に記載された態様を準用する。 The input unit for inputting the score image executes the (1) score image input step of the first embodiment. The bar extraction unit executes the (2) bar extraction step of the first embodiment. The staff correction unit executes the (3) position reference correction step of the first embodiment. The musical note identification unit executes (4) the musical note identification step of the first embodiment. The music information creation unit executes (5) the music information creation step of the first embodiment. Further, as the preferred embodiment of each part, the embodiment described in Example 1 shall be applied mutatis mutandis.

実施形態３
本発明の方法を実施して画像から情報を作成するためのプログラム
実施形態３は、本発明の方法を実施して画像から情報を作成するためのプログラムに関する。本発明のプログラムは、本発明の方法を実施できる限り、プログラム全体または部分を含む。 Embodiment 3
A program for implementing the method of the present invention to create information from an image Embodiment 3 relates to a program for implementing the method of the present invention to create information from an image. The program of the present invention includes the whole program or a part thereof as long as the method of the present invention can be carried out.

本発明のプログラムは、本発明の方法を実施できる限り、任意の言語で記載可能である。その言語の例には、特に限定はされないが、Ｐｙｔｈｏｎ，Ｊａｖａ，Ｋｏｔｌｉｎ，Ｓｗｉｆｔ，Ｃ，Ｃ＃，Ｃ＋＋，ＰＨＰ，Ｒｕｂｙ，ＪａｖａＳｃｒｉｐｔ，Ｓｃａｌａ，Ｇｏ，Ｒ，Ｐｅｒｌ，Ｕｎｉｔｙ，ＣＯＢＯＬ等が含まれる。 The program of the present invention can be written in any language as long as the method of the present invention can be carried out. Examples of the language include, but are not limited to, Python, Java, Kotlin, Swift, C, C #, C ++, PHP, Ruby, JavaScript, Scala, Go, R, Perl, Unity, COBOL and the like.

実施形態３は、楽譜画像から音楽情報を作成するためのプログラムであって、楽譜画像から少なくとも一つの小節を抽出する小節抽出部を含む、プログラムを提供する。このプログラムは、楽譜画像を入力する入力部、前記少なくとも一つの小節の各小節内の五線の位置を補正する五線補正部、前記少なくとも一つの小節の各小節内の音符を複数のディープラーニングモデルを使用して同定する音符同定部、又は同定された前記音符から音楽情報を作成する音楽情報作成部、を含んでもよい。ここで、前記少なくとも一つの小節がディープラーニングモデルによって抽出され、前記複数のディープラーニングモデルが並列に処理され、又は前記音楽情報が、ＸＭＬファイル、ｍｕｓｉｃＸＭＬファイル、ＭＩＤＩファイル、ｍｐ３ファイル、ｗａｖファイル、および楽譜からなる群より選択される、プログラムであってもよい。 The third embodiment is a program for creating music information from a musical score image, and provides a program including a bar extraction unit for extracting at least one bar from the musical score image. This program has an input unit for inputting a musical score image, a five-line correction unit for correcting the position of five lines in each measure of at least one measure, and a plurality of deep learning of notes in each measure of at least one measure. It may include a note identification unit identified using a model, or a music information creation unit that creates music information from the identified musical note. Here, the at least one measure is extracted by the deep learning model, the plurality of deep learning models are processed in parallel, or the music information is the XML file, the musicXML file, the MIDI file, the mp3 file, the wav file, and the music information. It may be a program selected from a group of musical scores.

その他の実施形態
本発明の一実施形態によれば、画像から情報を作成する方法であって、画像から領域単位を抽出する工程、前記領域単位に基づいて解析領域と前記領域単位中に少なくとも一つの位置基準を設定する工程、複数の特徴モデルを前記解析領域に適用して推論を行い、各特徴モデルは複数の特徴タイプに対して前記推論を実行する工程、各特徴モデル中の前記複数の特徴タイプのそれぞれの位置をマッピングして整列させる工程、前記少なくとも一つの位置基準を使用して、各特徴タイプを解析しアノテーションする工程、前記領域単位に関してアノテーションした各特徴タイプのデータを組み立てる工程、一又は複数の前記領域単位に関する前記データを直列および／または並列に接続して情報を作成する工程、の少なくとも1つの工程を含む方法が提供される。また本発明の一実施形態によれば、上記方法を実施して画像から情報を作成するためのコンピューティングデバイスが提供される。コンピューティングデバイスの例には、特に限定はされないが、ＲＡＭ、ＲＯＭ、キャッシュ、ＳＳＤ、ハードディスクが含まれる。また、クラウド上のもの、サーバ上のもの、オンプレミスのコンピュータ上のもの等の任意の形態のコンピューティングデバイスが含まれる。また本発明の一実施形態によれば、上記方法を実施して画像から情報を作成するためのプログラム又はこのプログラムを記録した記録媒体が提供される。記録媒体は、非一時的なコンピュータ読み取り可能な記録媒体であってもよい。 Other Embodiments According to one embodiment of the present invention, there is a method of creating information from an image, in which a step of extracting a region unit from an image, at least one in an analysis region and the region unit based on the region unit. A process of setting one position reference, a process of applying a plurality of feature models to the analysis area to perform inference, and each feature model performing the inference for a plurality of feature types, the plurality of features in each feature model. A process of mapping and aligning each position of a feature type, a process of analyzing and annotating each feature type using the at least one position reference, a process of assembling data of each feature type annotated with respect to the area unit, Provided is a method comprising at least one step of connecting the data about one or more of the domain units in series and / or in parallel to create information. Further, according to one embodiment of the present invention, there is provided a computing device for carrying out the above method and creating information from an image. Examples of computing devices include, but are not limited to, RAM, ROM, cache, SSD, and hard disk. It also includes any form of computing device, such as one on the cloud, one on a server, one on an on-premises computer, and so on. Further, according to one embodiment of the present invention, a program for carrying out the above method to create information from an image or a recording medium on which this program is recorded is provided. The recording medium may be a non-temporary computer-readable recording medium.

本明細書中で「Ａ〜Ｂ」という記載は、ＡおよびＢを含む。また、本発明に係る工程等について各実施形態で説明したが、これらの記載に限定されるものではなく、種々の変更を行うことができる。 The description "AB" in the present specification includes A and B. Further, although the process and the like according to the present invention have been described in each embodiment, the description is not limited to these, and various changes can be made.

以下、実施例を参照して本発明をさらに詳細に説明するが、本発明は以下の実施例に限定はされない。 Hereinafter, the present invention will be described in more detail with reference to Examples, but the present invention is not limited to the following Examples.

実施例１
楽譜中の小節用ディープラーニングモデルの訓練と推論
まず、４７個の楽譜全体図（各楽譜は数個から約５０個の小節を含んでいた）を使用してＹＯＬＯｖ５の小節モデルを訓練し、ｍＡＰ＠．５（特徴タイプ用のモデル中での正確性の指標）が０．９５を達成した。この小節モデルのカテゴリはｘ０、ｘ１、およびｙ０の小節特徴タイプがあり、それらは以下の表１に示されるようにそれぞれ、ト音記号（Ｇｃｌｅｆ）で始まる小節、へ音記号（Ｆｃｌｅｆ）で始まる小節、それ以外の残りの小節を示していた。訓練データの作成にはｌａｂｅｌＩｍｇソフトウエア（https://github.com/tzutalin/labelImg）を使用してバウンディングボックス（ＢｏｕｎｄｉｎｇＢｏｘ）を各イメージ中で各タイプを割り当てた。その際に、五線の最上部と最下部の線に沿うようにバウンディングボックスを設定した。また、訓練用の訓練データ、試験データ、および検証データはＲｏｂｏｆｌｏｗ（https://app.roboflow.com/）で調整した。 Example 1
Training and reasoning of deep learning models for measures in sheet music First, we trained a YOLOv5 measure model using 47 overall sheet music drawings (each sheet music contained several to about 50 measures), and mAP. @. 5 (an indicator of accuracy in the model for feature type) achieved 0.95. This bar model category has bar feature types x0, x1, and y0, which are bars starting with a treble clef (G clef) and a bar clef (F clef), respectively, as shown in Table 1 below. The measures starting with and the rest of the measures were shown. The labelImg software (https://github.com/tzutalin/labelImg) was used to create the training data, and a bounding box was assigned to each type in each image. At that time, the bounding box was set along the top and bottom lines of the staff. In addition, training data, test data, and verification data for training were adjusted by Robotflow (https://app.roboflow.com/).

次に、この小節モデルの訓練に用いなかった楽譜イメージでの推論に適用した。図２Ａはヘンデルによる「サラバンドと変奏」の楽譜の一部をスキャンして得たＰＤＦ由来イメージ中の推論結果を示す。図２Ｂは同じ楽譜をスマートフォンのカメラを使用して得た写真イメージ中の推論結果を示す。 Next, it was applied to the inference in the musical score image that was not used for training this bar model. FIG. 2A shows the inference results in the PDF-derived image obtained by scanning a part of the score of "Sarabande and Variations" by Handel. FIG. 2B shows the inference result in the photographic image obtained by using the camera of the smartphone with the same score.

その結果、各楽譜イメージ中で１００％の小節が、その推論の正確度０．９１〜０．９５で認識され抽出された。 As a result, 100% of the measures in each musical score image were recognized and extracted with the accuracy of the inference of 0.91 to 0.95.

また、ベートーベンの「悲愴第二楽章」の楽譜の一部（この小節モデルのトレーニングに用いたもの）も１００％の小節がその推論の正確度０．９２〜０．９３で認識され抽出された。 In addition, a part of the score of Beethoven's "Sorrowful Second Movement" (used for training this bar model) was also extracted with 100% of the bars recognized and extracted with the accuracy of the inference of 0.92 to 0.93. ..

さらに、この小節モデルの訓練に用いなかった別の楽譜である、バッハの「メヌエット」の楽譜イメージでは、６６個の小節のうち一つの小節がｘ０とｘ１で重複して認識され、２つの小節が融合して認識されていた。また、一つの小節では隣接する一つの音符を含んでいた。推論の正確度は０．７９〜０．９３であり、総合的には約９４％の小節が正しく認識されていた。結果を図２Ｃに示す。 Furthermore, in the score image of Bach's "Minuet", which is another score not used for training this bar model, one bar out of 66 bars is recognized as overlapping at x0 and x1, and two bars are recognized. Was recognized as a fusion. Also, one bar contained one adjacent note. The accuracy of the inference was 0.79 to 0.93, and overall, about 94% of the measures were correctly recognized. The results are shown in FIG. 2C.

これにより本小節モデルが、訓練に用いなかった楽譜のＰＤＦ由来イメージや写真イメージにおいてさえも効率的に小節を抽出することができて有用であることが実証された。 This proved that this bar model is useful because it can efficiently extract bars even in PDF-derived images and photographic images of musical scores that were not used for training.

実施例２
複数のディープラーニングモデルを使って実行する訓練
各音楽記号特徴カテゴリ（以下の実施例５で説明する）に対応する複数のＹＯＬＯｖ５モデルを訓練した。また、複数の特徴タイプを組み合わせて表現することで、全体で表現される音楽記号（音符）特徴タイプの数も飛躍的に増加し、これは有利な効果となった。 Example 2
Training Performed Using Multiple Deep Learning Models A plurality of YOLOv5 models corresponding to each music symbol feature category (described in Example 5 below) were trained. In addition, by expressing a combination of a plurality of feature types, the number of musical symbol (note) feature types expressed as a whole has increased dramatically, which is an advantageous effect.

各小節を抽出し、それに基づいて解析領域を決定し、拡大してサイズを一定（４１６ｘ４１６ピクセル）にし、訓練データを作成した。訓練データの作成は、実施例１と同様にｌａｂｅｌＩｍｇソフトウエアを使用してバウンディングボックス（ＢｏｕｎｄｉｎｇＢｏｘ）を割り当てた。 Each measure was extracted, the analysis area was determined based on it, enlarged to make the size constant (416 x 416 pixels), and training data was created. To create the training data, a bounding box was assigned using the labelImg software as in Example 1.

特徴カテゴリ（詳細は、実施例５で記載する）は、ａｃｃｉｄｅｎｔａｌ、ａｒｍ／ｂｅａｍ、ｂｏｄｙ、ｃｌｅｆ、ｒｅｓｔカテゴリを作成し、其々のカテゴリには複数の特徴タイプを設定した。特徴タイプの数は、それぞれ、ａｃｃｉｄｅｎｔａｌが３個、ａｒｍ／ｂｅａｍが８個、ｂｏｄｙが６個、ｃｌｅｆが５個、ｒｅｓｔが５個と上記一つのディープラーニングモデルと比べると少なかった。また、訓練に用いた画像数（訓練、テスト、検証用のデータの全体数）は、それぞれ、ａｃｃｉｄｅｎｔａｌが１９９個、ａｒｍ／ｂｅａｍが５４６個、ｂｏｄｙが５３７個、ｃｌｅｆが１４９個、ｒｅｓｔが６１１個とやはり、通常のディープラーニングでの訓練データ数よりも１桁以上少なかった。例えば、手書き数字のデータセットＭＮＩＳＴでは、訓練セット数６０，０００、テストセット数１０，０００である。したがって、特徴タイプの種類によってはこれまで考えられていた必要な数より少ないデータセット数でのディープラーニングの訓練ができた。これは本発明が少ない数の特徴タイプの組み合わせで多数の特徴タイプを表現できることに由来すると考えられる。したがって、訓練の質を落とさず、ディープラーニングの訓練を実施できるという顕著な効果の一つとなった。 For feature categories (details will be described in Example 5), accidental, arm / beam, body, clef, and rest categories were created, and a plurality of feature types were set for each category. The number of feature types was 3 for accidental, 8 for arm / beam, 6 for body, 5 for clef, and 5 for rest, which were smaller than those of the above one deep learning model. The number of images used for training (total number of data for training, testing, and verification) was 199 for accidental, 546 for arm / beam, 537 for body, 149 for clef, and 611 for rest, respectively. After all, it was more than an order of magnitude less than the number of training data in normal deep learning. For example, in the handwritten digit data set MNIST, the number of training sets is 60,000 and the number of test sets is 10,000. Therefore, depending on the type of feature type, it was possible to train deep learning with a smaller number of datasets than previously considered. It is considered that this is because the present invention can express a large number of feature types by combining a small number of feature types. Therefore, it was one of the remarkable effects that deep learning training could be carried out without degrading the quality of training.

訓練の結果、ｍＡＰ＠．５は、それぞれ、ａｃｃｉｄｅｎｔａｌモデルが０．９９、ａｒｍ／ｂｅａｍモデルが０．９９、ｂｏｄｙモデルが０．９４、ｃｌｅｆモデルが０．９９、ｒｅｓｔモデルが０．９９であった。訓練は基本的に５００エポック（ｅｐｏｃｈ）をバッチサイズ（ｂａｔｃｈｓｉｚｅ）１６でＧＰＵ（１６Ｇ）を搭載したＧｏｏｇｌｅＣｏｌａｂｏｒａｔｏｒｙを使用して行った。初期値のウエイト（ｗｅｉｇｈｔｓ）は前回の訓練で用いたものを使用した。したがって、実際は２〜４回のトレーニング（転移学習）の結果である。これまでの結果を表２に示す。 As a result of the training, mAP @. In No. 5, the accidental model was 0.99, the arm / beam model was 0.99, the body model was 0.94, the clef model was 0.99, and the rest model was 0.99, respectively. The training was basically carried out using a Google Colorory with a batch size of 16 and a GPU (16G) with 500 epochs. The initial weights used were those used in the previous training. Therefore, it is actually the result of 2 to 4 trainings (transfer learning). The results so far are shown in Table 2.

これらの結果は比較的少数の特徴タイプを比較的小規模な訓練データを用いて複数のディープラーニングモデルで訓練することにより優れた結果が得られる場合があることを実証した。多数の特徴タイプを判別する一つの大きなディープラーニングモデルをトレーニングし使用するよりも、複数の特徴カテゴリのディープラーニングモデルを組み合わせることが、学習と推論時の実行性、正確度等の点でより優れている場合がある。したがって、本実施例の構成の複数機械学習モデルを訓練して使用することが従来法よりも有利であり、極めて顕著な効果があることを示す。 These results demonstrate that training a relatively small number of feature types with multiple deep learning models using relatively small training data may yield superior results. Combining deep learning models from multiple feature categories is better in terms of learning and inference execution, accuracy, etc. than training and using one large deep learning model that discriminates between multiple feature types. May be. Therefore, it is shown that training and using the multi-machine learning model of the configuration of this embodiment is more advantageous than the conventional method and has a very remarkable effect.

実施例３
直列または並列で複数のモデルを処理した場合に掛かった処理時間の比較
これまで作成したディープラーニングモデルを使って楽譜イメージから各小節を認識および処理して、サイズを揃えた解析領域を用意した。そしてその各解析領域に対して、５つの特徴カテゴリの上記モデルを適用して解析データを作成する手順を自動化した。そして、処理に掛かった時間を計測した。この際、５つの特徴カテゴリのモデルの処理を直列で処理するか、または、並列処理するかして、その処理時間を比較した。結果を表３に示す。 Example 3
Comparison of processing time required when processing multiple models in series or parallel We recognized and processed each measure from the score image using the deep learning model created so far, and prepared an analysis area with the same size. Then, the procedure for creating analysis data was automated by applying the above models of the five feature categories to each analysis area. Then, the time required for processing was measured. At this time, the processing times of the models of the five feature categories were compared by processing them in series or in parallel. The results are shown in Table 3.

３種類の楽譜イメージを使い、処理時間を比較した。使用したコンピュータはｉＭａｃＰｒｏ（プロセッサ：３．２ＧＨｚ、８コアＩｎｔｅｌＸｅｏｎＷ；メモリ：６４ＧＢ２６６６ＭＨｚＤＤＲ４）であった。直列での処理に掛かった平均時間は、メヌエット（６６小節）、サラバンド（４８小節）、悲愴第二楽章（５８小節）が、それぞれ１５３．８秒、１２１．５秒、１３８．１秒で小節の数にほぼ比例していた。並列での処理に掛かった平均時間は、メヌエット、サラバンド、悲愴第二楽章が、それぞれ８１．３秒、６３．０秒、７５．４秒でこれも小節の数にほぼ比例していた。並列化処理により、メヌエット、サラバンド、悲愴第二楽章に関して、それぞれ、処理時間が５２．９％、５１．９％、５４．６％と約二分の一に短縮された。 The processing times were compared using three types of musical score images. The computer used was an iMac Pro (processor: 3.2 GHz, 8-core Intel Xeon W; memory: 64 GB 2666 MHz DDR4). The average time taken for processing in series was 153.8 seconds, 121.5 seconds, and 138.1 seconds for the minuet (66 bars), Sarabande (48 bars), and the second movement of sadness (58 bars), respectively. It was almost proportional to the number of. The average time taken for parallel processing was 81.3 seconds, 63.0 seconds, and 75.4 seconds for minuet, sarabande, and the second movement of sadness, respectively, which were also almost proportional to the number of measures. By the parallel processing, the processing time for minuet, sarabande, and the second movement of sadness was reduced to 52.9%, 51.9%, and 54.6%, respectively, by about half.

直列処理でも８コアに分散してある程度処理が進んでいたと考えられるため処理時間は１／８にはならなかったが、ディープラーニングモデルの推論にかかる時間は顕著に短縮された。今回の推論数は約５０個の小節に対してそれぞれ５個の特徴モデルを適用するので約２５０プロセスを処理する必要があった。本実施例では一つのＣＰＵ（８コア）で処理した。しかし、複数のＣＰＵとＧＰＵを有する構成が今後主流になると考えられるので、本発明の処理構成はそのＣＰＵ／ＧＰＵの数の増加に伴いさらに処理時間を短縮可能である。従って、本実施例の構成は顕著な効果を有している。 Even in the serial processing, the processing time was not reduced to 1/8 because it is considered that the processing was distributed to 8 cores and the processing progressed to some extent, but the time required for inference of the deep learning model was significantly shortened. Since the number of inferences this time applies 5 feature models to each of about 50 measures, it was necessary to process about 250 processes. In this embodiment, processing was performed by one CPU (8 cores). However, since it is considered that a configuration having a plurality of CPUs and GPUs will become mainstream in the future, the processing configuration of the present invention can further shorten the processing time as the number of CPUs / GPUs increases. Therefore, the configuration of this example has a remarkable effect.

実施例４
ＧＰＵでの処理速度
実際にＧＰＵを使用して処理時間が短縮されるかどうかを検討した。実施例３での処理をＡＷＳのＥＣ２インスタンスｇ４ｄｎ．ｍｅｔａｌを使用して処理時間を計測した。ｇ４ｄｎ．ｍｅｔａｌのＣＰＵ／ＧＰＵ構成は、ＮＶＩＤＩＡＴ４ＴｅｎｓｏｒＣｏｒｅＧＰＵが８個、ｖＣＰＵが９６個、ＲＡＭが３８４ＧｉＢ等であった。処理はＧＰＵを直列または並列に使用するようにプログラミングした。結果を表４に示す。 Example 4
Processing speed on GPU We examined whether the processing time could be shortened by actually using GPU. The processing in Example 3 was carried out by AWS EC2 instance g4dn. The processing time was measured using metal. g4dn. The CPU / GPU configuration of the metal was NVIDIA T4 Tensor Core GPU 8 pieces, vCPU 96 pieces, RAM 384GiB and the like. The process was programmed to use GPUs in series or in parallel. The results are shown in Table 4.

メヌエットの楽譜を、ＧＰＵを直列にして処理した処理時間は平均７０．９秒であり、ＣＰＵを直列で使用した場合の平均１５３．８秒、並列で処理した場合の平均８１．３秒よりも短かった。また、並列での処理時間は平均１６．４秒であり、直列処理の約１／４の処理時間であった。この処理時間はＣＰＵを直列で処理した時間の約１０分の１であり、ＧＰＵでの並列処理により顕著に処理時間の短縮ができることを実証した。したがって、実施例４はＧＰＵを並列で処理することにより本発明の効果がさらに増強されることを示している。コンピュータの能力（例、ＣＰＵやＧＰＵ等の容量や数）が大きくなればなるほど、複数のモデルを並列に処理するための時間は短くなり、本発明の適用性と性能がコンピュータパワーの増加に伴って著しく向上する。 The average processing time for processing Menuet scores in series with GPUs is 70.9 seconds, which is higher than the average of 153.8 seconds when using CPUs in series and the average of 81.3 seconds when processing in parallel. It was short. The average processing time in parallel was 16.4 seconds, which was about 1/4 of the processing time in series processing. This processing time is about one tenth of the time when the CPUs are processed in series, and it was demonstrated that the processing time can be remarkably shortened by the parallel processing on the GPU. Therefore, Example 4 shows that the effect of the present invention is further enhanced by processing GPUs in parallel. The greater the power of a computer (eg, the capacity or number of CPUs, GPUs, etc.), the shorter the time it takes to process multiple models in parallel, and the applicability and performance of the present invention grows with increasing computer power. Significantly improved.

実施例５
少数の特徴モデルの少数の特徴タイプを用いる新たな音符特徴タイプの作成
表２に示すようにディープラーニングモデルの訓練と推論に用いた特徴カテゴリと特徴タイプはＣｌｅｆが５種類（３つは不使用）、Ａｃｃｉｄｅｎｔａｌが３種類、Ｂｏｄｙが６種類、Ａｒｍ／Ｂｅａｍが８種類、Ｒｅｓｔが５種類あった。表５と図３に示す。 Example 5
Creating new note feature types using a small number of feature types in a small number of feature models As shown in Table 2, there are five feature categories and feature types used for training and inference of deep learning models (three are not used). ), There were 3 types of Accidental, 6 types of Body, 8 types of Arm / Beam, and 5 types of Rest. It is shown in Table 5 and FIG.

ト音記号に関しては五線譜の位置を位置基準にしてＤ３〜Ｇ６までの２５の音階を割り当てた、へ音記号に対してはＦ１〜Ｂ４までの２５音階を割り当てた。Ｂｏｄｙの位置する場所によって、これにより２×２５×６（Ｂｏｄｙの種類数）＝３００種類のバリエーションを表現できる。さらに、ＡｒｍとＢｅａｍの種類によってそれぞれの音符の長さが決定される（全音符はＡｒｍ／Ｂｅａｍを取らない、また、半音符はａｍ０またはａｍ１しかとらない）。また、Ｂｅａｍは開始、中間、終了の３種類がつらなりの位置によって表現される。したがって、３００×２（全音符２種類）＋３００×２（半音符２種類）×２（ａｍ０またはａｍ１）＋３００×２（黒丸の種類）×（４（Ａｒｍの種類）＋４（Ｂｅａｍの種類）×３（開始、中間、終了））＝１１，４００。これにＡｃｃｉｄｅｎｔａｌが３種類あるので、すべての音階に適用されるわけでは必ずしもないが、１１，４００×３＝３４，２００。したがって、１９個の特徴タイプから音符という新たな音符特徴タイプが約３万種類表現できることになった。さらに和音を考慮すると和音は２，３，４，５の音の任意の組み合わせであるから、表現できる前記特徴タイプの数はさらに飛躍的に増え、軽く１０万種類上の単音と和音を表現できる。したがって、複数カテゴリの比較的少数の特徴タイプを組み合わせることで多数の新たな音符特徴タイプである音符を同定、アノテーションできるという本実施例の顕著な効果を実証した。具体的なアノテーション方法は実施例７で解説する。 For the treble clef, 25 scales from D3 to G6 were assigned based on the position of the staff, and for the treble clef, 25 scales from F1 to B4 were assigned. Depending on the location of the body, 2 × 25 × 6 (number of types of body) = 300 types of variations can be expressed. In addition, the length of each note is determined by the type of Arm and Beam (whole notes do not take Arm / Beam, and half notes take only am0 or am1). In addition, three types of Beam, start, middle, and end, are expressed by the position of the chain. Therefore, 300 x 2 (2 types of whole notes) + 300 x 2 (2 types of half notes) x 2 (am0 or am1) + 300 x 2 (type of black circle) x (4 (type of Arm) + 4 (type of Beam) x 3 (start, middle, end)) = 11,400. Since there are three types of Accidental, it does not necessarily apply to all scales, but 11,400 x 3 = 34,200. Therefore, from 19 feature types, about 30,000 new note feature types called musical notes can be expressed. Furthermore, considering chords, since chords are any combination of 2, 3, 4, and 5, the number of the feature types that can be expressed increases dramatically, and 100,000 types of single notes and chords can be expressed lightly. .. Therefore, we have demonstrated the remarkable effect of this example that notes, which are a large number of new note feature types, can be identified and annotated by combining a relatively small number of feature types in multiple categories. A specific annotation method will be described in Example 7.

実施例６
傾いた楽譜イメージの補正
図４Ａはサラバンドの楽譜を傾いた状態で写真を撮ったイメージである。五線譜が水平状態にないと位置基準として機能しないことからまずは、楽譜画像全体の水平化を行った（図４Ｂ）。手順は以下のものであった。 Example 6
Correction of tilted score image Figure 4A is an image of a photograph taken with the score of Sarabande tilted. Since the staff not function as a position reference unless it is in the horizontal state, the entire score image was first leveled (Fig. 4B). The procedure was as follows.

１．入力イメージをグレースケール化し、Ｃａｎｎｙ法を用いて画像のエッジを抽出した。
２．Ｈｏｕｇｈ法を用いて直線を検出した。
３．一番長い直線の傾き角を計算して画像の回転角度を求めた。
４．求めた回転角度で画像全体を回転した。 1. 1. The input image was grayscaled and the edges of the image were extracted using the Canny method.
2. 2. A straight line was detected using the Hough method.
3. 3. The tilt angle of the longest straight line was calculated to obtain the rotation angle of the image.
4. The entire image was rotated at the obtained rotation angle.

得られた全体イメージでは各小節はまだ完全に水平化されなかった（中央部は水平化の度合いが高いが上部、下部ではまだ補正が必要であった。）。横方向に伸びる直線の閾値で選択を掛けた以外は再度上記手順と同様に各小節の水平化を行った（図４Ｃ）。得られたイメージを特徴モデルで推論すると各特徴タイプが認識されることが分かった（図４Ｄ）。 In the overall image obtained, each measure was not yet completely leveled (the central part was highly leveled, but the upper and lower parts still needed correction). Each measure was leveled again in the same manner as in the above procedure except that the selection was made by the threshold value of the straight line extending in the lateral direction (FIG. 4C). It was found that each feature type was recognized when the obtained image was inferred by the feature model (Fig. 4D).

この結果は、画面全体の傾きを補正するだけでなく、本実施例の要素である領域単位（小節）ごとに傾きを位置基準によって補正することで、発明の精度が向上するという顕著な効果を奏する。 This result not only corrects the tilt of the entire screen, but also corrects the tilt for each region unit (bar), which is an element of this embodiment, by a position reference, thereby improving the accuracy of the invention. Play.

この水平化により、従来法では問題であった五線譜の傾き補正が容易にできるようになり、本発明の実施を効率的に実行できることが分かった。 It was found that this leveling made it possible to easily correct the inclination of the staff notation, which was a problem in the conventional method, and to efficiently carry out the implementation of the present invention.

実施例７
五線の位置と間隔の補正
位置基準として五線を用いた。五線の位置は小節モデルで抽出した小節が正確な位置で抽出されたとして計算した。そして解析領域を五線の高さの１．２倍として上部と下部に設定した。実際のアノテーションで述べるが、上部と下部の解析領域は楽譜により幅があるので幅広に検出した特徴モデルを利用するかしないかは選択できるようにした。ここでは初期値の五線の位置は図５Ａで示したように、実際のものとはズレがあった。このズレを補正するためにａｌｐｈａとｂｅｔａ変数（係数）を導入した。ａｌｐｈａは五線の中央からのズレであり、ｂｅｔａは五線間の間隔を補正する値であった。この二つの値を以下のアルゴリズムを用いて自動で求めた。 Example 7
The staff was used as the correction position reference for the position and spacing of the staff. The position of the staff was calculated assuming that the bar extracted by the bar model was extracted at the correct position. Then, the analysis area was set at the upper and lower parts as 1.2 times the height of the staff. As described in the actual annotation, since the upper and lower analysis areas are wider depending on the score, it is possible to select whether or not to use the widely detected feature model. Here, the positions of the initial values of the staff were different from the actual ones as shown in FIG. 5A. In order to correct this deviation, alpha and beta variables (coefficients) were introduced. alpha was the deviation from the center of the staff, and beta was the value to correct the interval between the staffs. These two values were automatically obtained using the following algorithm.

１．イメージ全体の縦幅（五線譜＋上部と下部にそれぞれ五線譜の高さの１．２倍を設けたイメージ）を１とした。ａｌｐｈａの範囲を−０．０３〜０．０３の間０．００１刻みでループさせ、その各値でｂｅｔａを−０．００５〜０．００５の間０．００１刻みでループさせた。
２．その各ａｌｐｈａ、ｂｅｔａを使い五線をイメージ中に重ね書きした。
３．画像をグレースケール化しＧａｕｓｓｉａｎ閾値処理した画像の黒い部分の面積を求めた。
４．五線が重なる場合が面積は最小になると考え最小値を求め、その時のａｌｐｈａ、ｂｅｔａの値を補正に使用した。 1. 1. The vertical width of the entire image (staff + an image with 1.2 times the height of the staff at the top and bottom) was set to 1. The range of alpha was looped between -0.03 and 0.03 in 0.001 increments, and beta was looped between -0.005 and 0.005 in 0.001 increments at each value.
2. 2. Using each alpha and beta, the staff was overwritten in the image.
3. 3. The image was grayscaled and the area of the black part of the Gaussian threshold-processed image was obtained.
4. Considering that the area is the smallest when the staffs overlap, the minimum value was obtained, and the values of alpha and beta at that time were used for correction.

その補正結果を図５Ｂに示す。この自動補正機能を各小節のアノテーション時に実行することで正確度が高い音符の音階の同定ができるようになった。これにより本発明の効果をさらに改善することができた。 The correction result is shown in FIG. 5B. By executing this automatic correction function at the time of annotation of each measure, it has become possible to identify the scale of notes with high accuracy. This made it possible to further improve the effect of the present invention.

実施例８
各音符のアノテーションとＭｕｓｉｃＸＭＬファイルの作成
以下にアノテーションの方法の要点を簡単に説明する。各小節をディープラーニング小節モデルで抽出し、一部重なって認識されていた小節を除去する処理を重なりのあった位置に基づいて自動で実施した。その後、スタッフごとに並列に並んでいた小節を取り出し直列に繋いで各スタッフの元データとした。 Example 8
Annotation of each note and creation of MusicXML file The main points of the annotation method are briefly explained below. Each measure was extracted by the deep learning measure model, and the process of removing the measures that were partially overlapped and recognized was automatically performed based on the overlapped position. After that, the measures that were lined up in parallel for each staff member were taken out and connected in series to obtain the original data for each staff member.

８−１水平方向への特徴タイプのソーティング
スタッフ番号を１か２に指定して、スタッフの小節（メジャー（ｍｅａｓｕｒｅ））を一続きのリストにした。そして、前から順に一つずつ小節を取り出した。そして、各小節に含まれる全ての特徴タイプを水平方向（ｘ）（順方向）にソーティングした。各アノテーションに影響する要素として現状のＣｌｅｆの状態とＡｃｃｉｄｅｎｔａｌテーブル（どの音階にシャープやフラットがあるかを教示するテーブル）とを更新しながら各音符をアノテーションした。Ａｃｃｉｄｅｎｔａｌテーブルは初期値のｆｉｆｔｈｓ（どの長調または短調かを指定するもの）の状態を入力し、次の小節を解析する際には直前のｆｉｆｔｈｓの状態を反映させた。 8-1 Horizontal feature type sorting Staff numbers were specified as 1 or 2 to make a continuous list of staff measures. Then, the measures were taken out one by one from the front. Then, all the feature types included in each measure were sorted in the horizontal direction (x) (forward direction). Each note was annotated while updating the current Clef state and the Accidental table (a table that teaches which scale has a sharp or flat) as elements that affect each annotation. In the Accidental table, the initial value of the states (which specifies which major or minor) is input, and when the next measure is analyzed, the state of the immediately preceding fifths is reflected.

８−２各特徴タイプを前から順に解析
水平方向にソーティングした特徴タイプを順に解析した。解析は各タイプがどの特徴カテゴリにあるかに場合分けをした。 8-2 Analysis of each feature type in order from the front The feature types sorted in the horizontal direction were analyzed in order. The analysis was divided into cases according to which feature category each type was in.

Ａ．Ｃｌｅｆカテゴリ
解析中の特徴タイプがＣｌｅｆカテゴリＧまたはＦ（ｃｆ０またはｃｆ１）であった場合は、Ｃｌｅｆの状態を変化させた。 A. When the feature type in the Clef category analysis was Clef category G or F (cf0 or cf1), the state of Clef was changed.

Ｂ．Ａｃｃｉｄｅｎｔａｌカテゴリ
解析中の特徴タイプがＡｃｃｉｄｅｎｔａｌカテゴリであった場合は、位置基準と組み合わせてＡｃｃｉｄｅｎｔａｌテーブルを変更した。 B. Accidental category If the feature type in the analysis was the Accidental category, the Accidental table was modified in combination with the positional reference.

Ｃ．Ｒｅｓｔカテゴリ
解析中の特徴タイプがＲｅｓｔカテゴリであった場合は、Ｒｅｓｔタイプに合わせてアノテーションして、その要素を出力リストに追加した。 C. If the feature type during Rest category analysis was Rest category, it was annotated according to the Rest category and the element was added to the output list.

Ｄ．Ｂｏｄｙカテゴリ（垂直方向に重なる特徴タイプにより音符を同定）
解析中の特徴タイプがＢｏｄｙカテゴリであった場合は、和音を検出し、音符の長さをＡｒｍ／Ｂｅａｍタイプで特定するために、垂直方向に重なる特徴タイプをソーティングしてリストにした。その中にＲｅｓｔタイプが含まれる場合は、その位置によってＶｏｉｃｅを指定した（一番下にある場合はＶｏｉｃｅ１、一番上にある場合はＶｏｉｃｅ２にした）。中間位置にある場合は前後の位置に応じてＢｏｄｙタイプの前の要素として追加するか後の要素として追加するかを決定し、出力リストに追加した。 D. Body category (notes are identified by feature types that overlap vertically)
When the feature type under analysis was the Body category, chords were detected and feature types that overlap vertically were sorted and listed in order to specify the note length by Arm / Beam type. If the Rest type is included in it, Voice is specified according to its position (Voice1 is set if it is at the bottom, and Voice2 is set if it is at the top). If it is in the middle position, it is decided whether to add it as an element before or after the body type according to the position before and after, and it is added to the output list.

Ｂｏｄｙタイプは垂直方向に重なる特徴タイプの数と位置によって場合分けしてアノテーションした。複数のＢｏｄｙタイプが含まれる場合はｍｕｓｉｃＸＭＬファイルの規定に従って和音（Ｃｈｏｒｄ）を割り当てた。 The Body type was annotated according to the number and position of the feature types that overlap in the vertical direction. When multiple Body types were included, chords were assigned according to the rules of the musicXML file.

ケース１：一番下と上の特徴タイプが共にＡｒｍ／Ｂｅａｍである場合
ケース２：一番下がＲｅｓｔである場合
ケース３：一番上がＲｅｓｔである場合
ケース４：一番上がＡｒｍ／Ｂｅａｍである場合
ケース５：一番下がＡｒｍ／Ｂｅａｍである場合
ケース６：一番上と下が共にＢｏｄｙである場合
各Ｂｏｄｙタイプのアノテーションでは現在のＣｌｅｆとａｃｃｉｄｅｎｔａｌテーブルを引数として渡して、音符特徴タイプをアノテーションした。 Case 1: When both the bottom and top feature types are Arm / Beam
Case 2: When the bottom is Rest
Case 3: When the top is Rest
Case 4: When the top is Arm / Beam
Case 5: When the bottom is Arm / Beam
Case 6: When both the top and bottom are Body In each Body type annotation, the current Clef and accidental table are passed as arguments to annotate the note feature type.

解析済みのＢｏｄｙとＡｒｍとＲｅｓｔタイプは除外リストに入れて再度解析されるのを防止した。また、Ｂｅａｍは隣接するＢｏｄｙタイプの解析のために再度使用した。 The analyzed Body, Arm, and Rest types were put in the exclusion list to prevent them from being analyzed again. Beam was also used again for analysis of adjacent Body types.

このようにして水平方向にソーティングした特徴タイプを、以前に解析したある種の特定タイプ（Ｃｌｅｆ、Ａｃｃｉｄｅｎｔａｌ）がその後に特徴タイプに影響を及ぼすようにし、また、垂直方向に重なる特徴タイプを垂直方向に影響を及ぼす特徴タイプ（例、Ａｒｍ／Ｂｅａｍ）を使用してアノテーションを実施した。ここの音階はａｌｐｈａ、ｂｅｔａで音階の位置を個々の小節で補正した。 The feature types sorted horizontally in this way are such that certain previously analyzed specific types (Clef, Accidental) subsequently affect the feature type, and the vertically overlapping feature types are vertical. Annotations were performed using feature types that affect (eg, Arm / Beam). The scale here is alpha and beta, and the position of the scale is corrected for each measure.

８−３Ｖｏｉｃｅの調整
小節のアノテーション結果を検証した。上記ケース４〜６では全ての音符をＶｏｉｃｅ１に割れ当てた。その結果、アノテーションされた音符の長さの合計が小節に決められた長さを越えた場合にはＶｏｉｃｅ（声）を変更した。具体的には、下向きのステムを持つ音符をＶｏｉｃｅ１、上向きのステムを持つものをＶｏｉｃｅ２に割り当てた。小節内の音符の長さをＶｏｉｃｅごとに再計算し、Ｖｏｉｃｅ１の小節内の音符の長さがまだ決められた長さを越えている場合は、全音符をＶｏｉｃｅ２に割り当てた。 The annotation result of the adjustment measure of 8-3 Voice was verified. In cases 4 to 6 above, all the notes were split into Voice1. As a result, if the total length of the annotated notes exceeds the length determined for the bar, the Voice is changed. Specifically, a note having a downward stem was assigned to Voice 1, and a note having an upward stem was assigned to Voice 2. The length of the notes in the measure was recalculated for each Voice, and if the length of the notes in the measure of Voice1 still exceeded the determined length, all the notes were assigned to Voice2.

８−４各小節を直列に結合
出来上がった各小節のデータを直列につないでスタッフ全体のデータを作成した。出来たデータはＥｌｅｍｅｎｔＴｒｅｅ（ＥＴ）の形にして要素を登録してデータを構造化した。 8-4 Combining each measure in series The data of each measure was connected in series to create the data of the entire staff. The created data was in the form of Elegant Tree (ET), and the elements were registered to structure the data.

８−５ＭｕｓｉｃＸＭＬファイルの作成
ＥＴ構造化した音符データをＸＭＬファイルへと変換する関数を用いてＸＭＬ化して、ＭｕｓｉｃＸＭＬファイルを作成した。 8-5 Creation of MusicXML file A MusicXML file was created by converting ET-structured note data into XML using a function that converts it into an XML file.

結果
図６は、図２Ｃのバッハのメヌエットの楽譜イメージのスタッフ１を本発明の方法により各音符を同定してＸＭＬ化し、そのＸＭＬファイルをＳｉｂｅｌｉｕｓ（図６Ａ）とＭｕｓｅＳｃｏｒｅ（図６Ｂ）で読み込み表示させた結果である。図６に示すように作製したＸＭＬファイルはＳｉｂｅｌｉｕｓ、ＭｕｓｅＳｃｏｒｅ、Ｆｉｎａｌｅ（図示しない；表示小節の調整が必要）で読み込んで表示することができた。 Result In FIG. 6, the staff 1 of the score image of Bach's menuet in FIG. 2C is identified and converted into XML by the method of the present invention, and the XML file is read and displayed by Sibelius (FIG. 6A) and MuseScore (FIG. 6B). It is the result of making it. The XML file prepared as shown in FIG. 6 could be read and displayed by Sibelius, MuseScore, and Finale (not shown; adjustment of display measures is required).

次に、アノテーションの正確さの評価を行った。図６に示すＸＭＬを個々の楽譜ソフトウエア上で表示した結果を、元の画像イメージである図２Ｃと比較した。結果を表６にまとめた。 Next, the accuracy of the annotation was evaluated. The result of displaying the XML shown in FIG. 6 on each score software was compared with the original image image of FIG. 2C. The results are summarized in Table 6.

スタッフ１に関しては、小節は９７％（３２／３３）の精度で認識されていて、小節抽出の精度が高いことを実証した。個々の特徴タイプと位置基準とを組み合わせて同定された新しい音符特徴タイプの音階（ｓｔｅｐ）（Ｃｌｅｆタイプと位置基準の五線譜に基づくもの）、音符（Ｎｏｔｅ）（さらに長さも含むもの）、和音（Ｃｈｏｒｄ）（全体がすべて一致するもの）に関しては、それぞれ、９８％（１２５／１２８）、９５％（１２２／１２８）、１００％（１／１）の正確さだった。臨時記号（Ａｃｃｉｄｅｎｔａｌ）（音階と記号の両方が一致するもの）も１００％（３／３）認識された。 For Staff 1, the measures were recognized with an accuracy of 97% (32/33), demonstrating the high accuracy of measure extraction. New note feature type scales (based on Clef type and position criteria staff), notes (including length), chords (including length), new note feature types identified by combining individual feature types and position criteria. For Chords (all matches), the accuracy was 98% (125/128), 95% (122/128), and 100% (1/1), respectively. Accidentals (both scales and symbols match) were also recognized 100% (3/3).

スタッフ２に関しては、小節は９７％（３２／３３）の精度で認識されていた。音階（ｓｔｅｐ）、音符（Ｎｏｔｅ）、和音（Ｃｈｏｒｄ）に関しては、それぞれ、９５％（７１／７５）、９５％（７１／７５）、１００％（１／１）の正確さだった。休符（Ｒｅｓｔ）は４０％（２／５）、臨時記号（Ａｃｃｉｄｅｎｔａｌ）は５０％（１／２）認識された。 For Staff 2, the bar was recognized with an accuracy of 97% (32/33). The accuracy of the scale (step), note (Note), and chord (Chord) was 95% (71/75), 95% (71/75), and 100% (1/1), respectively. Rests (Rest) were recognized as 40% (2/5), and accidentals (Accidental) were recognized as 50% (1/2).

これらの結果から本実施例の方法によりアノテーションされた音符の精度が極めて高いことが示され、本実施例が顕著な効果があることが証明された。 From these results, it was shown that the accuracy of the notes annotated by the method of this example was extremely high, and it was proved that this example had a remarkable effect.

さらに、元の画像はＰＤＦからデジタル的に作成されたイメージだけでなく、実際に利用される可能性の高い、楽譜を写真で取ったイメージからもＸＭＬを作成できるか検討した。この際に、写真イメージは五線譜が水平でない場合が多いと考えられるので、図７Ａに示されるような傾いた写真イメージからＸＭＬ化を実施した。得られた結果を、Ｓｉｂｅｌｉｕｓを用いてスタッフ１の楽譜を表示した（図７Ｂ）。 Furthermore, we examined whether the original image can be created not only from the image digitally created from PDF, but also from the image of the score taken as a photograph, which is likely to be actually used. At this time, since it is considered that the staff notation of the photographic image is not horizontal in many cases, XML conversion was performed from the inclined photographic image as shown in FIG. 7A. The obtained results were displayed in the score of Staff 1 using Sibelius (FIG. 7B).

表６に示すように、小節は９６％（２３／２４）の精度で認識されていた。音階（ｓｔｅｐ）、音符（Ｎｏｔｅ）、和音（Ｃｈｏｒｄ）に関しては、それぞれ、８７％（１３５／１５６）、８６％（１３４／１５６）、７８％（２９／３７）の正確さだった。休符（Ｒｅｓｔ）は６４％（１６／２５）、臨時記号（Ａｃｃｉｄｅｎｔａｌ）は７１％（１０／１４）認識された。 As shown in Table 6, the measures were recognized with an accuracy of 96% (23/24). The accuracy of steps, notes, and chords was 87% (135/156), 86% (134/156), and 78% (29/37), respectively. Rests (Rest) were recognized as 64% (16/25), and accidentals (Accidental) were recognized as 71% (10/14).

特に、サラバンドは比較的複雑な和音（Ｃｈｏｒｄ）が３７か所スタッフ１に含まれていたが、７８％の正確さでそれら和音を認識していたことは驚くべき結果であり、本実施例の顕著な効果を実証した。 In particular, Sarabande contained relatively complex chords in 37 places in Staff 1, but it was a surprising result that they recognized those chords with 78% accuracy. Demonstrated a remarkable effect.

比較例として、図７Ａの楽譜を既存のＯＭＲアプリであるＰｈｏｔｏＳｃｏｒｅ２０２０に入力してＯＭＲ処理を実行した結果を図８に示す。図８で示すように傾いた写真イメージからの正しい音符情報の取得は既存技術ではできなかった。さらに、ＭｕｓｅＳｃｏｒｅ３ではＰＤＦ画像のみ現状解析可能なので図７Ａの写真をＰＤＦに変換しＯＭＲ処理を行ったが「ｕｎｓｕｃｃｅｓｓｆｕｌ」と出力され全く解析はできなかった。 As a comparative example, FIG. 8 shows the result of inputting the score of FIG. 7A into the existing OMR application PhotoScore2020 and executing the OMR process. As shown in FIG. 8, it was not possible to obtain correct note information from a tilted photographic image with the existing technology. Furthermore, since only the PDF image can be analyzed at present in MuseScore3, the photograph of FIG. 7A was converted to PDF and OMR processing was performed, but "unsuccessful" was output and analysis could not be performed at all.

従って、画像イメージとして写真からのものであり且つ水平でない位置基準（五線譜）のものからでも高い正確さ（約８６％）で音符を認識できたことは本実施例のさらなる顕著な効果を実証する。 Therefore, the fact that the notes could be recognized with high accuracy (about 86%) even from a position reference (staff) that is not horizontal and is from a photograph as an image image demonstrates a further remarkable effect of this embodiment. ..

実施例９
ＭｕｓｉｃＸＭＬからの音の再生
本発明で作成されたＭｕｓｉｃＸＭＬから一般的なソフトウエアを使用して音が再生されるかどうかを確認した。 Example 9
Reproduction of sound from MusicXML It was confirmed whether or not sound is reproduced from MusicXML produced in the present invention using general software.

実施例８で確認したメヌエットとサラバンドのＸＭＬファイルをＭｕｓｅＳｃｏｒｅ３とＳｉｂｅｌｉｕｓＦｉｒｓｔで読み込んで音源再生機能を使用して音が再生されることを確認した。 It was confirmed that the XML files of the minuet and Sarabande confirmed in Example 8 were read by MuseScore3 and Sibelius First, and the sound was reproduced using the sound source reproduction function.

また、ＭｕｓｅＳｃｏｒｅ３のＥｘｐｏｒｔ機能を使ってｍｐ３ファイル、ｗａｖファイル、ｍｉｄｉファイルとしても出力可能であることを確認した。そして、ｍｐ３ファイルとｗａｖファイルをコンピュータ上で再生し音が出力されることを確認した。またｍｉｄｉファイルはＬｏｇｉｃＰｒｏソフトウエアに読み込んで音が再生されることを確認した。 It was also confirmed that it is possible to output as an mp3 file, a wav file, and a midi file by using the Export function of MuseScore3. Then, it was confirmed that the mp3 file and the wav file were played back on the computer and the sound was output. It was also confirmed that the midi file was read into Logic Pro software and the sound was reproduced.

本発明の画像由来情報作成方法は、ＯＭＲ分野に有用である。また、本発明のディープラーニングモデルを用いる画像由来情報作成方法は、一般的には、例えば、自動運転、ロボット操作、医療診断、医療機器（内視鏡、カテーテル）操作、製品検査等の画像を使って操作・判断する分野に有用である。 The image-derived information creation method of the present invention is useful in the field of OMR. In addition, the image-derived information creation method using the deep learning model of the present invention generally includes images of, for example, automatic driving, robot operation, medical diagnosis, medical device (endoscope, catheter) operation, product inspection, and the like. It is useful in the field of operation and judgment by using.

Claims

It is a method of creating music information from a score image,
The process of inputting a score image and
The process of extracting at least one measure from the score image and
The step of identifying the notes in each bar of at least one bar without erasing the staff , and
A method comprising the step of creating musical information from the identified musical note.

The method of claim 1, wherein at least one measure is extracted by a deep learning model.

The method of claim 1 or 2, further comprising correcting the position of the staff within each bar of at least one bar.

The method according to any one of claims 1 to 3, wherein the musical note in each bar of at least one bar is identified by using a deep learning model.

The method according to any one of claims 1 to 4, wherein the musical note in each bar of at least one bar is identified by using a plurality of deep learning models.

The method of claim 5, wherein the plurality of deep learning models are processed in parallel.

The method according to any one of claims 1 to 6, wherein the music information is selected from the group consisting of an XML file, a musicXML file, a MIDI file, an mp3 file, a wav file, and a musical score.

A computing device for creating music information from musical score images.
The input section for inputting the score image and
A measure extraction unit that extracts at least one measure from the score image,
A staff correction unit that corrects the position of the staff in each bar of at least one bar,
A note identification unit that identifies notes in each bar of at least one bar using multiple deep learning models without erasing the staff.
Includes a music information creation unit that creates music information from the identified notes.
At least one of the measures is extracted by a deep learning model.
The multiple deep learning models are processed in parallel,
A computing device in which the music information is selected from the group consisting of XML files, musicXML files, MIDI files, mp3 files, wav files, and musical scores.

A program for creating music information from sheet music images,
The input section for inputting the score image and
A measure extraction unit that extracts at least one measure from the score image,
A staff correction unit that corrects the position of the staff in each bar of at least one bar,
A note identification unit that identifies notes in each bar of at least one bar using multiple deep learning models without erasing the staff.
Includes a music information creation unit that creates music information from the identified notes.
At least one of the measures is extracted by a deep learning model.
The multiple deep learning models are processed in parallel,
A program in which the music information is selected from a group consisting of an XML file, a musicXML file, a MIDI file, an mp3 file, a wav file, and a musical score.

It is a method of creating music information from a score image,
The process of inputting a score image and
The process of extracting at least one measure from the score image and
The step of correcting the slope and spacing of the staff in each bar of at least one bar,
The step of identifying a note in each bar of at least one bar,
A method comprising the step of creating musical information from the identified musical note.

It is a method of creating music information from a score image,
The process of inputting a score image and
The process of extracting at least one measure from the score image and
The step of identifying a note in each bar of at least one bar,
Including the step of creating music information from the identified notes.
The extraction is taken along the top and bottom lines of the staff.
Method.