US20220043854A1 - Sheet Music Search and Discovery System - Google Patents

Sheet Music Search and Discovery System Download PDF

Info

Publication number
US20220043854A1
US20220043854A1 US17/476,903 US202117476903A US2022043854A1 US 20220043854 A1 US20220043854 A1 US 20220043854A1 US 202117476903 A US202117476903 A US 202117476903A US 2022043854 A1 US2022043854 A1 US 2022043854A1
Authority
US
United States
Prior art keywords
user
compositions
instruments
sheet music
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/476,903
Inventor
Jeremy Sawruk
Bruce Donnelly
Michael Hamilton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JW Pepper & Son Inc
Original Assignee
JW Pepper & Son Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JW Pepper & Son Inc filed Critical JW Pepper & Son Inc
Priority to US17/476,903 priority Critical patent/US20220043854A1/en
Publication of US20220043854A1 publication Critical patent/US20220043854A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/638Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0454
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N5/003
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0445

Definitions

  • word embeddings In the field of natural language processing, vector representations of words that carry syntactic and semantic information (word embeddings) have proven powerful in various natural language processing tasks, in particular in sentiment analysis. Meanwhile, machine learning algorithms have also been used to detect patterns in data and generalize those patterns in order to adapt to data that it has not previously seen. For example, sequence labeling has been used to algorithmically assign categorical labels to observed values.
  • Some sheet music may be available in structured formats (such as MusicXML) that contains musical data (e.g., pitches, rhythms, clefs, articulations, etc.) in a musically semantic structure.
  • Other sheet music may only be available as (unstructured) image data (such as PDFs).
  • Unstructured sheet music data can be converted to structured music data and analyzed using the same mathematical rules mentioned above.
  • a two-step process of converting image data to structured sheet music data and then using mathematical rules developed for analyzing structured sheet music data may be computationally inefficient.
  • the conversion process may not be precise, particularly if the image data is unclear.
  • the system analyzes compositions stored as structured sheet music data to generate metadata characterizing each composition (or part within the composition).
  • the system stores a global vector space of semantic representations of elements extracted from a corpus of structured music data, where semantically similar elements extracted from the corpus are clustered together in the global vector space, generates semantic representations of each composition, and generates metadata characterizing each composition in part by comparing the semantic representations.
  • the system may also generate metadata characterizing each composition or part through deterministic functions, rules and/or heuristics, extracting and labeling phrases, or machine learning. In particular, machine learning may be used to predict the difficulty of each composition or part.
  • the system analyzes compositions stored as image data using machine learning-based pattern recognition. For example, the system may use algorithms pretrained to determine a range of a composition, extract and describe phrases, extract and analyze measures, determine the difficulty of each composition (by comparing image patterns to image patterns in a corpus of known compositions).
  • the algorithms may generate metadata without recognizing individual notes.
  • musically-relevant objects e.g., staves, measures, clefs, or notes
  • the metadata generated by the system allows the system to provide search and recommendation functionality unlike anything currently available.
  • the system may provide functionality for users to identify instruments and a range for each instrument and identify compositions with similar instruments and ranges.
  • the system may provide functionality for the user to input a search query that includes keywords or audio (input, e.g., by singing or humming).
  • the system may also identify recommendations for the user and/or provide functionality to automatically generate a concert program by comparing the instruments and ranges of each of the compositions.
  • the disclosed system Unlike conventional sheet music search engines, which simply select from available sheet music based on existing textual information to provide a humanly comprehensible number of search results, the disclosed system generates new data; specifically, a new kind of metadata characterizing the underlying sheet music.
  • the metadata generated by disclosed system enables the disclosed search engine to do things that conventional sheet music search engines cannot.
  • the metadata generated by disclosed system enables the disclosed system to provide functionality for a user to search and/or filter sheet music based on musically-relevant characterizations of the underlying sheet music, such as range or difficulty.
  • the disclosed search engine can provide functionality for a user to search the underlying sheet music, for example by singing or humming a melodic fragment.
  • FIG. 1 is a block diagram of a sheet music search and discovery system according to an exemplary embodiment of the present invention
  • FIG. 2 is a drawing illustrating an overview of the architecture of a sheet music search and discovery system according to an exemplary embodiment of the present invention
  • FIG. 3 is a flowchart illustrating a process for extracting explicit metadata from structured sheet music data and unstructured sheet music data and converting unstructured sheet music data into structured sheet music data according to an exemplary embodiment of the present invention.
  • FIG. 4 is a flowchart illustrating the inputs and outputs of an example structured sheet music analysis engine according to an exemplary embodiment of the present invention
  • FIG. 5A is a flowchart illustrating a workflow of an example structured sheet music analysis engine according to an exemplary embodiment of the present invention
  • FIG. 5B is a flowchart illustrating a semantic embedding process and a semantic similarity analysis process according to an exemplary embodiment of the present invention
  • FIG. 5C is a diagram illustrating a context encoder (ConEc) NN architecture corresponding to a continuous bag of words (CBOW) Word2Vec model;
  • FIG. 5D is a diagram illustrating a similar model trained to identify a musical element in context to an exemplary embodiment of the present invention
  • FIG. 5E is a graph illustrating an example global vector space projected down to two dimensions according to an exemplary embodiment of the present invention.
  • FIG. 6 is a flowchart illustrating a preprocessing technique according to an exemplary embodiment of the present invention.
  • FIG. 7 is a flowchart illustrating a feature extraction technique according to an exemplary embodiment of the present invention.
  • FIG. 8 is a flowchart illustrating a rules analysis process according to an exemplary embodiment of the present invention.
  • FIG. 9 is a flowchart illustrating a lyrics extraction process according to an exemplary embodiment of the present invention.
  • FIG. 10 is a flowchart illustrating a data ingestion process for analyzing unstructured sheet music data and determining metadata that describes each composition stored as unstructured sheet music data according to an exemplary embodiment of the present invention
  • FIG. 11A and FIG. 11B is a flowchart illustrating a workflow of an optical analysis engine according to an exemplary embodiment of the present invention
  • FIG. 12 is a flowchart illustrating a query process according to an exemplary embodiment of the present invention.
  • FIG. 13 is a flowchart illustrating a process for querying sheet music data using a user profile according to an exemplary embodiment of the present invention
  • FIG. 14 is a flowchart illustrating a recommendation process according to an exemplary embodiment of the present invention.
  • FIG. 15 is a flowchart illustrating a concert program generation process according to an exemplary embodiment of the present invention.
  • FIG. 16 illustrates a user profile view of a graphical user interface according to an exemplary embodiment of the present invention
  • FIG. 17 illustrates an edit range view of the graphical user interface according to an exemplary embodiment of the present invention
  • FIG. 18 illustrates a search view of the graphical user interface according to an exemplary embodiment of the present invention
  • FIG. 19 illustrates a search results view of the graphical user interface according to an exemplary embodiment of the present invention
  • FIG. 20 is a block diagram of a simplified multifunctional electronic device according to an exemplary embodiment of the present invention.
  • FIG. 21A is an example passage that is rhythmically difficult but not very syncopated
  • FIG. 21B is a musical excerpt
  • FIG. 21C illustrates each note being represented by an index from a vocabulary
  • FIG. 21D is an example input to phrase extraction
  • FIGS. 21E and 21F are an example of parallel corpora
  • FIG. 21G is a part before stitching
  • FIG. 21H is an uninterrupted scale run after stitching.
  • FIG. 1 is a block diagram of a sheet music search and discovery system 100 according to an exemplary embodiment of the present invention.
  • the sheet music search and discovery system 100 includes one or more databases 110 , a structured sheet music analysis engine (SMAE) 140 , a sheet music conversion engine 160 , an optical analysis engine 180 , a search engine 190 , and a graphical user interface 192 .
  • the one or more databases 110 store structured sheet music data 112 , unstructured sheet music data 114 , and metadata 116 identified by the system 100 based on the structured sheet music data 112 and the unstructured sheet music data 114 .
  • the structured sheet music data 112 may be sheet music (i.e., compositions) that contains musical data (e.g., pitches, rhythms, clefs, articulations, etc.) in a musically semantic structure (e.g., MusicXML).
  • MusicXML is a common encoding of structured musical data in Extensible Markup Language (XML) format.
  • XML is a generic encoding of hierarchical data stored in a flat text file.
  • An XML file consists of set of nodes, also called elements.
  • An XML document consists of a root node which contains zero or more child nodes, each of which may contain zero or more child nodes and so forth. A node which contains no child nodes is called an empty node.
  • a node which contains one or more child nodes is called the parent of the child nodes it contains. Note that while a parent node may contain multiple child nodes, a child node cannot be contained by more than one parent, and therefore cannot have more than one parent node. More detail regarding XML may be found in the World Wide Web Consortium (W3C) documentation (http://www.w3c.org/XML), which is incorporated herein by reference. More detail regarding MusicXML may be found in the MusicXML documentation (http://usermanuals.musicxml.com/MusicXML/MusicXML.htm), which is incorporated herein by reference.
  • W3C World Wide Web Consortium
  • MusicXML may be found in the MusicXML documentation (http://usermanuals.musicxml.com/MusicXML/MusicXML.htm), which is incorporated herein by reference.
  • the structured sheet music data 112 may be encoded in another format, such as Music Encoding Initiative (MEI) (see http://music-encoding.org/), MNX (see https://www.w3.org/community/music-notation/2017/01/19/introducing-mnx/), ABC (see http://abcnotation.com), MuseData (see http://www.musedata.org/about/), etc.
  • MEI Music Encoding Initiative
  • MNX see http://music-encoding.org/
  • MNX see https://www.w3.org/community/music-notation/2017/01/19/introducing-mnx/
  • ABC see http://abcnotation.com
  • MuseData see http://www.musedata.org/about/
  • the unstructured sheet music data 114 may be sheet music (i.e., compositions) in a graphical format (e.g., bitmap, vector etc.).
  • the unstructured sheet music data 114 lacks the musical semantics included in the structured sheet music data 112 described above, and therefore requires additional processing to identify musical semantics as described below.
  • Both the structured sheet music data 112 and the unstructured sheet music data 114 may be stored in computer readable formats.
  • the metadata 116 includes data that describes the (structured and unstructured) sheet music data 112 and 114 .
  • FIG. 2 is a drawing illustrating an overview of the architecture 200 of the sheet music search and discovery system 100 according to an exemplary embodiment of the present invention.
  • the architecture 200 may include one or more servers 210 and one or more storage devices 220 connected to a plurality of remote computer systems 240 , such as one or more personal systems 250 and one or more mobile computer systems 260 , via one or more networks 230 .
  • the architecture 200 may also include one or more document scanners 270 that communicate with the one or more servers 210 and output data for storage in the one or more storage devices 220 either directly (via wired or wireless communication) or via the one or more networks 230 .
  • the one or more servers 210 may include an internal storage device 212 and a processor 214 .
  • the one or more servers 210 may be any suitable computing device including, for example, an application server and a web server which hosts websites accessible by the remote computer systems 240 .
  • the one or more storage devices 220 may include external storage devices and/or the internal storage device 212 of the one or more servers 210 .
  • the one or more storage devices 220 may also include any non-transitory computer-readable storage medium, such as an external hard disk array or solid-state memory.
  • the networks 230 may include any combination of the internet, cellular networks, wide area networks (WAN), local area networks (LAN), etc. Communication via the networks 230 may be realized by wired and/or wireless connections.
  • a remote computer system 240 may be any suitable electronic device configured to send and/or receive data via the networks 230 .
  • a remote computer system 240 may be, for example, a network-connected computing device such as a personal computer, a notebook computer, a smartphone, a personal digital assistant (PDA), a tablet, a portable weather detector, a global positioning satellite (GPS) receiver, network-connected vehicle, a wearable device, etc.
  • a personal computer system 250 may include an internal storage device 252 , a processor 254 , output devices 256 and input devices 258 .
  • the one or more mobile computer systems 260 may include an internal storage device 262 , a processor 264 , output devices 266 and input devices 268 .
  • An internal storage device 212 , 252 , and/or 262 may include one or more non-transitory computer-readable storage mediums, such as hard disks or solid-state memory, for storing software instructions that, when executed by a processor 214 , 254 , or 264 , carry out relevant portions of the features described herein.
  • a processor 214 , 254 , and/or 264 may include a central processing unit (CPU), a graphics processing unit (GPU), etc.
  • a processor 214 , 254 , and/or 264 may be realized as a single semiconductor chip or more than one chip.
  • An output device 256 and/or 266 may include a display, speakers, external ports, etc.
  • a display may be any suitable device configured to output visible light, such as a liquid crystal display (LCD), a light emitting polymer display (LPD), a light emitting diode (LED), an organic light emitting diode (OLED), etc.
  • the input devices 258 and/or 268 may include keyboards, mice, trackballs, still or video cameras, touchpads, etc.
  • a touchpad may be overlaid or integrated with a display to form a touch-sensitive display or touchscreen.
  • the structured sheet music data 112 , the unstructured sheet music data 114 , and the metadata 116 may be any organized collection of information, whether stored on a single tangible device or multiple tangible devices, and may be stored, for example, in the one or more storage devices 220 .
  • the metadata 116 describing each composition is associated with that composition such that the metadata 116 can be used to discover compositions from the structured sheet music data 112 that are relevant to users.
  • the structured sheet music data 112 may be in computer readable format that contains musical data such as pitches, rhythms, clefs, articulations, etc. in a musically semantic structure (e.g., MusicXML).
  • the structured sheet music analysis engine (SMAE) 140 , the sheet music conversion engine 160 , the optical analysis engine 180 , and the search engine 190 may be realized by software instructions stored on one or more of the internal storage devices 212 , 252 , and/or 262 and executed by one or more of the processors 214 , 254 , or 264 .
  • the graphical user interface 192 may be any interface that allows a user to input information for transmittal to the sheet music search and discovery system 100 and/or outputs information received from the sheet music search and discovery system 100 to a user.
  • the graphical user interface 192 may be realized by software instructions stored on one or more of the internal storage devices 212 , 252 , and/or 262 executed by one or more of the processors 214 , 254 , or 264 .
  • the metadata 116 describes the compositions stored as structured sheet music data 112 or unstructured sheet music data 114 .
  • FIG. 3 is a flowchart illustrating a process 300 for extracting explicit metadata 350 from structured sheet music data 112 and unstructured sheet music data 114 and converting unstructured sheet music data 114 into structured sheet music data 112 according to an exemplary embodiment of the present invention.
  • Explicit metadata 350 is extracted from the structured sheet music data 112 by the SMAE 140 at 310 .
  • Explicit metadata 350 refers to metadata 116 directly encoded in the structured sheet music data 112 or unstructured sheet music data 114 .
  • structured sheet music data 112 includes low-level semantics such as notes, rhythms, etc. and lacks higher-level semantics such as range or difficulty. (This is analogous to text documents which encode characters, words, and sentences, but do not encode higher-level semantics like subject matter or syntax.) Therefore, as described below, the SMAE 140 analyzes the structured sheet music data 112 to calculate or determine mid-level and/or higher-level semantic metadata 116 describing each composition.
  • explicit metadata 350 includes title, composer, instrumentation, etc.
  • Explicit metadata 350 is extracted from structured sheet music data 112 by reading the structured fields of the data. If any expected explicit metadata 350 is missing, the field is marked as missing.
  • Explicit metadata 350 is extracted from the unstructured sheet music data 114 by the SMAE 140 at 320 . Because unstructured sheet music data 114 does not include structured fields like structured sheet music data 112 , it is more likely unstructured sheet music data 114 is missing musically-relevant explicit metadata 350 while including musically irrelevant explicit metadata 350 such as colorspace, bit resolution, dots per inch, etc. Explicit metadata 350 missing from unstructured sheet music data 114 is marked as missing.
  • the unstructured sheet music data 114 is converted into structured sheet music data 112 by the sheet music conversion engine 160 at 330 and 340 .
  • Optical character recognition (OCR) is used to extract text data (e.g., title, composer, etc.) and stores the extracted text data as metadata 116 at 330 .
  • Optical music recognition (OMR) is used to extract musical notations at 340 .
  • the text data and musical notations extracted from the unstructured sheet music data 114 are formatted into structured sheet music data 112 . Accordingly, as used for the remainder of this description, structured sheet music data 112 refers to both structured sheet music data 112 and unstructured sheet music data 114 that has been converted into structured sheet music data 112 .
  • FIG. 4 is a flowchart 400 illustrating the inputs and outputs of the SMAE 140 according to an exemplary embodiment of the present invention.
  • structured sheet music 112 is input to the SMAE 140 and the SMAE 140 outputs text/keyword output 420 and extended property output 430 .
  • the text/keyword output 420 may include individual words, phrases, or sentences that describe the musical content of the document using musical terminology.
  • the text/keyword output 420 may be fed into a keyword-based search engine, used in evaluating natural language queries, etc.
  • the extended property output 430 may include statistics or other numerical features (e.g., range, difficulty, etc.) that provide additional information about the document in formats other than plain text.
  • the extended property output 430 may be queried through a specialized user interface such as that described in FIGS. 16-19 .
  • FIG. 5A is a flowchart illustrating a workflow 500 of the structured sheet music analysis engine (SMAE) 140 according to the present invention.
  • SMAE structured sheet music analysis engine
  • the structured sheet music 112 is preprocessed at 510 .
  • Preprocessing allows music from different sources such as different publishers or notation format to be analyzed using the same procedures. (The publisher, notation format, etc. may be identified in the explicit metadata 350 .) Further, because the SMAE 112 may analyze a complete score, individual parts, or both, the structured sheet music 112 must be preprocessed into a normalized format. If a complete score is provided, individual parts must be extracted. If individual parts are provided, a score must be compiled. This is necessary because some features are specific to individual parts, such as the range of individual instruments, while other features require knowledge of the entire score, such as when individual instruments have solos. Preprocessing 510 will be discussed further in FIG. 6 .
  • the preprocessed structured sheet music 112 may be used by the SMAE 140 to perform feature extraction 520 , semantic embedding 532 A, phrase extraction 542 , and lyrics extraction 550 as described below.
  • Feature extraction 520 is a process for analyzing structured music data 112 to calculate features 522 that can be directly computed by a deterministic mathematical function or algorithm. (Deterministic means the same inputs always produce the same outputs.)
  • Features 522 include relevant musical data, statistical features (e.g., the average number of notes in a measure), features derived from lookup tables, range, pitch histograms, etc. Some or all of the features 522 may be used to calculate the higher level semantic metadata at 526 and 530 , as described below.
  • Feature extraction 520 is discussed further in reference to FIG. 7 .
  • the SMAE 140 may perform rules analysis 524 .
  • the rules analysis 524 is the use of rules and/or heuristics to determine higher-level semantic metadata, referred to herein as rule-derived metadata 526 .
  • a “heuristic” is a rule without a theoretical or pedagogical foundation (as opposed to a “rule,” which, as used herein, is explicitly defined based on information from musicology or music education literature.) Examples of heuristics that may be used by the SMAE 140 to generate rule-derived metadata 526 are shown in Table 1:
  • Pitch Histogram A pitch histogram with a wide standard deviation implies a more difficult composition (more notes and greater range), while a narrow standard deviation implies an easier composition (fewer notes and narrower range). Similarly, skew distributions may imply more difficult compositions because the central tendency may emphasize or focus on a particular high or low range of the instrument.
  • rule is explicitly defined based on information from musicology or music education literature (as opposed to a “heuristic,” which is rule without a theoretical or pedagogical foundation). Examples of rules that may be used by the SMAE 140 to generate rule-derived metadata 526 are shown in Table 2:
  • Rhythmic Vocabulary Rhythmic vocabulary is the number of distinct note durations and the number of times each occurs in a composition.
  • An example of rhythmic vocabulary is “Whole, Half, Quarter”, indicating that the composition uses notes and rests with durations of whole notes, half notes, and quarter notes. A larger rhythmic vocabulary implies a more difficult composition.
  • a specific vocabulary item implies a greater level of difficulty.
  • Rhythmic vocabulary may also include counts of how often the rhythmic item appears (e.g., 10 whole notes, 30 quarter notes, 24 eighth notes) referred to below as “duration type frequency.”
  • Rhythmic Onset Density Rhythmic onset density is the number of notes occurring in a measure that are neither a) rests nor b) the continuation or end of a tie. Such notes are referred to as onsets. The more onsets in a measure, the more difficult it is to play.
  • a measure consisting of a single whole note contains one onset, and is easy to perform, while a measure of all 16 th notes in 4/4 time (16 onsets) is much more difficult.
  • Averaging the rhythmic density of onsets over all the measures implies overall difficulty of a composition.
  • Relations among Meter, A 16 th note in a slow 6/8 meter is relatively easy to perform Rhythm, and Tempo while a 16 th note in a fast 2/2 meter is much more difficult.
  • Rules can be written to capture the relation among meter, rhythm, and tempo to further imply the difficulty of a composition.
  • Idiomatic Instrumental The same passage played on different instruments can have Techniques varying degrees of difficulty.
  • the rules analysis process 524 includes a number of subroutines, each specializing in a particular analysis method, which are discussed further with reference to FIG. 8 .
  • Machine learning analysis 528 may be used to analyze the structured sheet music data 112 (as well as the rule-derived metadata 526 and/or the features 522 ) to generate higher-level semantic metadata 116 , referred to herein as machine-learning derived metadata 530 .
  • Machine learning is a subfield of computer science that studies a class of algorithms that can detect patterns in data and generalize those patterns in order to adapt to data that it has not previously seen.
  • the machine learning analysis 528 may include ensemble learning.
  • Ensemble learning uses multiple machine learning algorithms to obtain better predictive performance than could be achieved from any one constituent learning algorithm.
  • the machine learning algorithms may include a k-nearest neighbors algorithm, support vector machines, neural networks, etc. Multiple machine learning algorithms of the same kind may be used where each algorithm varies with respect to their hyperparameters. For example, multiple neural networks may be used where the number of nodes in the hidden layer of each neural network varies.
  • Syncopation is loosely defined as an uneven or irregular rhythm. Syncopated rhythms are more difficult to perform, and imply a more difficult composition. No agreed upon rule exists as to what constitutes a syncopated rhythm, so a machine learning algorithm appears to be an appropriate solution. Both supervised and unsupervised approaches may be useful in determining an algorithm for identifying syncopation.
  • Form Form is a higher level of syntactic musical organization. Linguistic analogues include rhyming schemes, poetic forms, or even larger forms like literary arcs. Musical forms include low level forms like periods, or higher level forms like rondo and sonata-allegro.
  • Phrasing is a pre-requisite for form, and given that phrasing does not have a clear set of rules, neither does form.
  • Possible approaches to identifying form include recursive neural networks and convolutional neural networks similar to those used in sentence parsing in natural language processing. Co-occurrence matrices may also be used.
  • Style/Time Period Style is an extremely subjective musical characteristic. Some research has already been done on style classification using statistics, but machine learning techniques are likely to provide more accurate and relevant results.
  • Harmonic Progression Although the notes and key are explicitly defined in structured data, harmonic analysis information (while specified in the MusicXML specification) is optional and often omitted from the structured data. Harmonic information is included in Piano/Vocal/Guitar (PVG) arrangements and Lead Sheets, but rarely included in other formats.
  • Harmonic information may be extracted by performing a reduction of all the parts (transposed to sounding pitch), and then performing a harmonic analysis.
  • a harmonic analysis may use a hidden Markov model (HMM), or one of the syntactic approaches described above.
  • HMM hidden Markov model
  • a reduction is necessary because while harmonic analysis is relatively easy on homophonic music, a single melody, it is often difficult to determine the harmonic progression in polyphonic music, two or more melodies.
  • Tonal Centricity Tonal centricity is related to harmonic progression described above. While key signatures appear in the structured data, a composition may modulate temporarily, either to create musical interest, or to develop musical material such as during the development section of a sonata-allegro form.
  • compositions that temporarily and/or frequently modulate without changing key signatures, or lack a key signature altogether, are more difficult to perform.
  • Exposed Parts A part is considered exposed if it is playing solo, soli, or a distinct musical figure. It remains unclear whether a rules-based or machine learning based approach will work better to identify exposed parts. However, current research into a rules-based system indicates that such a system may be insufficient and that determining exposed parts may be more subjective than originally thought.
  • the machine learning analysis 528 can be used to predict the difficulty of each composition (or each part within each composition).
  • the structured sheet music data 112 may include a corpus of compositions or parts that are labeled (e.g., manually labeled) as having a certain difficulty level (e.g., on a 1-5 scale, a 1-10 scale, etc.).
  • a supervised learning process can then be used to learn a function for determining a probability that another composition or part has those difficulty levels (e.g., a 0.5 percent probability of grade 1, an 85.5 percent probability of grade 2, etc.).
  • the surprised learning process may compare the compositions in structured sheet music data 112 (and the metadata 116 describing those compositions) to the compositions in the corpus (and metadata 116 describing those compositions).
  • the supervised learning process may compare semantic representations of the compositions in structured sheet music data 112 (e.g., the part embeddings 534 A and score embeddings 536 A discussed below) to semantic representations of the compositions in the corpus.
  • rule-derived metadata 526 and machine learning-derived metadata 530 that may be generated by performing rules analysis 524 and/or the machine learning analysis 528 are shown in Table 4:
  • FIG. 21A is an example passage that is rhythmically difficult but not very syncopated Difficulty Grade by Part
  • the difficulty is computed last in the Rules Engine because it is the culmination of several features and the output of the other submodules Cumulative Grade of The cumulative grade of the composition
  • Entire Composition computed by aggregating the grades across all parts. This includes the average grade across all parts, as well as the mode, median, minimum, and maximum part grades.
  • Harmonic Language A list of the harmonic languages that appear in the composition. This is distinct from the chord progression, which only represents a sequence of chords.
  • a harmonic language describes the overall harmonic vocabulary.
  • harmonic language examples include pentatonic, diatonic, chromatic, modal, pandiatonic, etc.
  • Parsons Code The Parsons code of each voice in each part. This represents the overall melodic contour of the voice and can be used for query by humming. Parsons code is only computed for monophonic voices that are not percussive.
  • Melodic Contour The melodic contour is a descriptive term for the overall contour of the voice in each part. Melodic contour is one of three values: conjunct (smooth), disjunct (jagged), or static.
  • the melodic contour is computed from the intervallic content of the voice as follows: Compute the histogram for all melodic intervals If the unison is the most frequent interval and is greater than the sum of all other interval frequencies, the melodic contour is static If the unison, minor 2nd, or major 2nd is the most frequent interval and the sum of all three of those intervals is greater than the sum of the other interval frequencies, the melodic contour is conjunct Otherwise, the melodic contour is disjunct Form
  • the formal structure of the composition This is computed from the repeating note and/or measure structure. A set of rules and/or heuristics is then used to compute the form from common known forms, such as song, rondo, sonata-allegro, 12-bar blues, etc. Style
  • Style The musical style of the composition, such as march, bebop, etc. The style is computed from several of the above factors.
  • Semantic embedding 532 A creates a numerical representation (embedding) of each element (e.g., note, chord, rest, measure, etc.), part, and score within a vector space so that it may be analyzed within a larger context.
  • an embedding algorithm 532 A analyzes a large corpus of structured music data to produce a global vector space containing semantic representations of each element in the corpus. As described in detail below, the global vector space is a map of each element where similar elements are represented closer than dissimilar ones. Once the global vector space is produced, each element (e.g., note, chord, rest, measure, etc.) in the structured music data 112 is given the semantic representation (embedding) from the global vector space.
  • Element embeddings are averaged to create part embeddings 534 A and part embeddings 534 A are averaged to create a score embedding 536 A.
  • the part embeddings 534 A and the score embeddings 536 A are compared for semantic similarity at 538 A as described in detail below.
  • FIG. 5B is a flowchart illustrating the semantic embedding 532 A and semantic similarity analysis 538 A in detail.
  • a large corpus 550 of structured music documents (e.g., MusicXML documents) is used to create a global vector space 560 .
  • each element e.g., note, chord, rest, measure, etc.
  • the count of that element is incremented at 556 .
  • Different embodiments may make different determinations regarding equivalent elements. For example, one embodiment may be programmed such that an “A4 quarter” is equivalent to an “A4 quarter staccato”. Other embodiments may be programmed such that those elements are distinct.
  • the vocabulary 558 is sorted from the most common element to the least common element at 560 . (In some embodiments, the number of elements included in the vocabulary 558 may be trimmed to include only the N most common elements.) Each element in the vocabulary 558 is then assigned a unique integer at 562 . (The integer assignments can be random, for example the position of the element within the sorted vocabulary 558 .)
  • the elements extracted from the corpus 550 are then replaced with the integers at 564 . (If an extracted element is no longer in trimmed vocabulary 558 , the extracted element may be replaced with a special integer value specifying “unknown”.) Each document in the corpus 550 is then represented by a set of integers.
  • a neural network is then used to map targets and context at 566 .
  • an element may be fed to a skip-gram model, which is used to predict the elements around it (context).
  • the skip-gram model may be used to predict the context of a given element.
  • the elements around a target element may be fed to a continuous bag of words (CBOW) model, which is used to predict the target element.
  • CBOW model may be used to predict an element in a given context.
  • a next word algorithm is given an element (context) and trained to predict the next element (target).
  • the neural network e.g., skip-gram model, CBOW model, etc.
  • the input to the neural network is a combination of the integer values, defined in the vocabulary, for each element. (The vectors may be added or stacked, depending on representation.)
  • the output is also the integers defined in the vocabulary.
  • Dimensionality of the vocabulary 558 is reduced at 568 . Any dimensionality may be used. However, the target dimensionality is typically much lower than the number of items in the vocabulary 558 . For example, for a vocabulary of 10,000 elements, a target dimensionality of 300 may be used.
  • FIG. 5C is a diagram illustrating a context encoder (ConEc) NN architecture corresponding to a CBOW Word2Vec model.
  • FIG. 5D is a diagram illustrating a similar model trained to predict a target musical element from the context.
  • each element is represented by a vector. Because similar elements appear in similar contexts in the corpus 550 , the neural network clusters similar elements.
  • FIG. 5E is a graph showing an example global vector space 560 projected down to two dimensions. For example, “A4-half” and “A4-quarter” are very close to each other (because both represent the pitch A4 with different durations).
  • the global vector space 560 described above is used analyze the structured sheet music data 112 to generate part embeddings 534 A and score embeddings 536 A.
  • Each element e.g., note, chord, rest, measure, etc.
  • the semantic representation (vector) of each element is identified at 564 by looking it up in the global vector space 560 . For example, if an “A4 quarter” has been mapped in the global vector space 560 to [0.4, 0.99, 0.05], then any “A4 quarter” in any document in the structured sheet music data 112 is replaced with that vector.
  • Element vectors are averaged at 566 .
  • note vectors within each measure are averaged to create a measure vector.
  • Measure vectors with each part are averaged to create a part embedding 534 A for that part.
  • Part embeddings 534 A within each score are averaged to create a score embedding 536 A for that score.
  • Similar items are identified by comparing the vectors representing each item at 538 A.
  • Vectors may be compared using a distance function (metric), such as the L2-norm (Euclidean distance) or cosine similarity.
  • metric such as the L2-norm (Euclidean distance) or cosine similarity.
  • Vectors that are “close” e.g., separated by a small distance in the global vector space 560
  • vectors that are far apart e.g., separated by a large distance in the global vector space 560
  • Semantically related items may be grouped together using another machine learning algorithm, such as k-means clustering, support vector machines (SVMs), or another neural network.
  • SVMs support vector machines
  • the vocabulary 558 shown in Table 6 may be sorted by frequency and all but the N most frequently occurring entries may be kept. Entries that are removed would be replaced with a single symbol “UNK” (unknown). In this example, that step is omitted.) Each note is represented by the index from the vocabulary 558 as shown in FIG. 21C .
  • the indexes above are converted to one-hot vectors, which are fed into the neural network.
  • One-hot vectors are used in this example because every component of the vector is only either a one or a zero, the target probability can be interpreted as 0 percent or 100 percent, cross-entropy loss can be used to train the network, and all of the one-hot vectors are mutually orthogonal.
  • To convert an index value x to a one-hot vector a vector of all zeros is created and then the number at position x is changed to 1.
  • the one-hot vector for index value 1 is [1, 0, 0, 0, . . . ]
  • the one-hot vector for index value 2 is [0, 1, 0, 0, . . . ].
  • the one-hot vector for index value 2 is [0, 0, 1, 0, . . . ], etc.
  • a next word algorithm is given the current element and used to predict the next element.
  • some example pairs where the second item comes after the first:
  • the first element in the pair is added to set X and the expected output of the pair is added to the set y.
  • an embedding size less than the size of the vocabulary (24 elements) is selected. In this instance, an embedding size of 5 is selected.
  • a neural network is created:
  • W elements are weight matrices
  • b elements are biases that provide an additive factor to the model
  • g are non-linear functions such as a tanh, sigmoid, or ReLU, known as an activation function, that models non-linear relationships between the inputs and outputs.
  • the softmax function is a standard mathematical function that normalizes a value to a probability.
  • the softmax function computes the probability that y belongs to class j given input vector x by computing the exponent of x*w j (the input to the activation function) over the sum of all the exponents x*w k for all K distinct classes.
  • the output from the softmax is compared to the one-hot of the expected output.
  • the cross-entropy loss function is used to determine the difference between the output and the expected value. This difference (the loss) is then backpropagated through the network to adjust the weight matrices and biases.
  • the cross-entropy loss function is:
  • the model is updated using gradient descent, an optimization algorithm that numerically tries to find the minimum of a function.
  • the function in this case is the neural network itself.
  • the neural network is then trained for several iterations.
  • the inner matrix W 1 now holds the embedding vectors. This is called the embedding matrix.
  • the column is the index of the vocabulary item, and the row is the 5-dimensional embedding vector (or the column is the vector and the row is the index).
  • An example 5-dimensional embedding vector, derived in the process is above, is shown below:
  • the embedding vectors encode contextual information and the vector shown above represents a semantically meaningful encoding.
  • An example plot projected down to 2 dimensions is shown in FIG. 5E .
  • the semantic similarity between two vectors can then be determined by measuring their distance from each other in the domain space. Two example vectors and the Euclidean distance (L2 norm) computed between them are shown below:
  • the semantic similarity metadata 540 A from semantic similarity analysis 538 A may include the outputs described in Table 7:
  • Similar Measures For each part identify similar measures Within Part within the part. Similar measures may then be further analyzed to determine features such as repeated measures, ostinatos, and form. Similar Measures For each measure, identify similar measures Across Parts within any other part. Measures that are similar but occur in different parts suggest that either the parts are coupled/doubled, or that the parts may be interchanged (e.g. substitute a trombone for a bassoon). Similar Parts For each part, identify any other part that is similar. Similar parts may then be used to identify coupled/doubled parts or parts that are interchangeable. Similar Documents Compare this document to all other documents in the database and assign a similarity score to them. Pre-computing and then indexing these similarity scores improves performance of searching for similar documents.
  • embedding vectors for each part allow individual part embeddings 534 A to be compared within the same structured sheet music data 112 or against other structured sheet music data 112 .
  • the music director may search for similar clarinet parts by having a computer search for other pieces of music that contain semantically similar clarinet parts (i.e., the distance between the embeddings is minimized).
  • the computer will then be able to rank all clarinet parts in order of closeness to the original part.
  • a separate score embedding 536 A allows compositions to be holistically compared.
  • the search engine 190 provides functionality for a choir director to search for semantically similar compositions.
  • the sheet music search and discovery system 100 compares the score embeddings 536 A as described above and the search engine 190 returns search results ranked by semantic similarity.
  • part embeddings 534 A and the score embeddings 536 A may be used in the machine learning analysis 528 described above, for example to predict the difficulty of each part and score.
  • phrase extraction 542 is a process for identifying smaller musical passages known as phrases from a larger musical selection.
  • a phrase 544 may be identified using rules derived from music theory, heuristics, analysis of individual measures obtained via rules analysis 524 , machine learning analysis 528 , or both, etc.
  • phrase analysis 546 may be used to generate phrase descriptions 548 that describe each phrase using keywords or sentences which in turn may be stored in a database to augment the data available for querying.
  • An example input to phrase extraction 542 is shown in FIG. 21D and example phrase descriptions 548 are described in Table 8:
  • the phrase description 548 may be generated using sequence labeling. Sequence labeling is similar to the machine learning task of image captioning.
  • the sequence labeling may be performed by two neural networks trained together to understand structured music documents. This model is known as neural machine translation (NMT).
  • the first neural network is called the encoder and transforms each input phrase into a numerical representation.
  • the second neural network is called the decoder and transforms the numerical representation into output text, such as keywords describing the phrase.
  • the two neural networks of the NMT model are trained used parallel corpora.
  • the input to the training algorithm includes a set of passages and a separate set of descriptive texts such that each passage is described by one or more corresponding descriptive texts.
  • An example of parallel corpora for use in training the two neural networks is shown in FIGS. 21E and 21F and described in Table 9:
  • Lyrics extraction 550 is a process for extracting lyrics 552 from preprocessed structured sheet music data 112 in order to generate semantic similarity metadata 540 B pertaining to the lyrics 552 . Certain elements of lyrical metadata, such as the number of verses, may be obtained directly from the preprocessed structured sheet music data while other lyrical metadata must be inferred. Lyrics extraction 550 is described further in FIG. 9 . When the lyrics 552 appear in structured sheet music data 112 , those lyrics 552 are fragmented at syllable boundaries. The lyrics extraction process 550 combines those syllables to form words.
  • the lyrics extraction process 550 also parses out verses. A single note may be associated with text from the first verse and the second verse. The lyrics extraction process 550 puts this all together into a single document so that all of verse 1 is together, all of verse 2 is together, etc.
  • the lyrics 552 are input to semantic embedding 532 B and semantic similarity analysis 538 B.
  • the semantic embedding 532 B and semantic similarity analysis 538 B are similar processes performed on text data (the lyrics 552 ) as the semantic embedding 532 A and semantic similarity analysis 538 A performed on musical data (the part embeddings 534 A and 536 A) described above.
  • semantic embedding 532 A and 532 B create numerical representation (embedding) of data within a vector space by analyzing it within a larger context.
  • the semantic embedding 532 B similarly processes each word, phrase, verse, etc. within the extracted lyrics 552 , adds each unique element to a list, and assigns each unique element a unique value.
  • the list of unique elements i.e., “vocabulary”
  • frequency, context, etc. are input to an algorithm, neural network, etc., to create a probabilistic model that a set of elements occur together.
  • Semantic similarity analysis 538 B is performed to generate semantic similarity metadata 540 B.
  • An embedding vector for each part allows different individual parts to be compared within the same structured sheet music data 112 or against other structured sheet music data 112 . For example, where a choir has enjoyed the theme and language of a particular piece of choral music, the choir director may search for other choral works with similar themes and language. Note that harmonic and lyric similarity are independent and a user may search for music based on either.
  • Semantic similarity metadata 540 B from semantic similarity analysis 538 B may include the outputs described above in Table 7.
  • FIG. 6 is a flowchart illustrating the preprocessing 510 performed by the SMAE 140 according to an exemplary embodiment of the present invention.
  • structured sheet music data 112 is preprocessed to normalize the data into a consistent format before analysis, allowing music from different publishers, different notation structures, etc. to be analyzed using the same process.
  • Structured sheet music data 112 may be compressed music file 605 A or uncompressed music file 605 B.
  • Compressed music files 605 A are uncompressed at 610 .
  • Both uncompressed music files 605 B and newly uncompressed music files 605 B are validated at 615 .
  • the files are validated using a predefined schema for the given format. For example, the MusicXML schema are available in Document Type Definition (DTD) and XML Schema Definition (XSD) formats, and validation is performed by a standard XML validator. If the music file is invalid, an error is generated and preprocessing technique 510 is halted until the invalid music file is corrected. If the music file is valid, header metadata is extracted at 620 . Header metadata may include the fields described in Table 10:
  • the publisher of the music file may also be extracted, usually from the copyright field, in order to optimize the performance of the rules analysis 524 and machine learning analysis 528 according to any common patterns or peculiarities specific to the particular publisher, known as “house style.”
  • the publisher name informs other features, so it is advantageous to extract the publisher name early in the feature extraction process. Other features may be extracted at any time because they do not influence later extracted features.
  • the music file undergoes general cleanup at 625 to remove information useful for visual display but not useful for semantic analysis, such as color. Additionally, any specific encoding errors or conventions dependent on the software used to produce the encoding are corrected at 625 .
  • the instrument names are extracted from the music file at 630 .
  • the instrument names are normalized at 635 .
  • Instrument names are converted into their base instrument names in a process that may be similar to stemming in computational linguistics. For example, instrument names Trumpet 1, Trumpet II, 2 nd Trumpet, 4 Trumpets, and Trumpet in C would be converted to the base instrument name Trumpet. If more than one instrument name appears on a given part, separated by a space, new line, slash, etc. or there is a “change instrument” instruction within the part, the part is marked as having multiple instruments. This informs extraction of parts at 645 A. If an instrument name appears in a language other than English, a lookup table may be used to translate the instrument name into English. For example, the German equivalent of “2 nd Flute” (“2.
  • Flöte is normalized to “Flute.”
  • An instrument that cannot be identified is designated as “unknown.”
  • the instruments are identified with a standard sound.
  • a max polyphony property describing the number of pitches the instrument can play simultaneously is assigned to each instrument through the use of a lookup table. For example, a clarinet has a max polyphony of one, a guitar has a max polyphony of six, and a piano has a max polyphony of eighty-eight. This information is used in extraction of parts 645 A as well as the analysis described herein.
  • the ensemble type is identified at 640 .
  • the normalized instrument names are compared to a lookup table of predefined ensemble types and a percent match found.
  • the percent match may be found using Jaccard similarity or other similarity metrics.
  • the type of ensemble is then identified based on the percent match. For example, a string quartet must be an exact match while a marching band may vary in instrumentation. Information about the ensemble type can be used to further clean up the score, as well as provide additional information used in the analysis described herein. If the ensemble type cannot be identified, the ensemble type is left empty. Last, either parts are extracted at 645 A or a score is compiled at 645 B depending on the input music file. If the input music file includes a score, individual parts are extracted at 645 A.
  • a score is compiled at 645 B. Extraction of individual parts 650 A generates a separate output for each part or voice.
  • a part usually corresponds to a single instrument while a voice is a subdivision of instruments occurring within the same part. For example, a part labelled “2 Trumpets” may have two voices because the two trumpets may play different notes at the same time.
  • the parts extracted at 645 A may be labeled “2 Trumpets A” and “2 Trumpets B.”
  • a score 650 B is compiled into a single file to be used in conjunction with the parts files during analysis. Each instrument in the compiled score 650 B consists of only one voice.
  • FIG. 7 is a flowchart illustrating the feature extraction process 520 performed by the SMAE 140 according to an exemplary embodiment of the present invention.
  • the feature extraction process 520 analyzes the preprocessed structured music data 112 and returns features 522 that are then used during the rules analysis process 524 and machine learning process 528 .
  • Features 522 extracted include relevant musical data, statistical features such as the average number of notes in a measure, and features derived from lookup tables.
  • Lookup tables return simple semantic features which may be input for more complex rules analysis 524 and machine learning analysis 528 .
  • a lookup table may associate a time signature to a measure of difficulty. For example, a 4/4 time signature may be labeled as easy while a 17/32 time signature may be labeled as difficult.
  • the time signatures and associated difficulty levels may be used during the rules analysis 524 and machine learning analysis 528 , which may change the difficulty level.
  • Feature extraction is a bottom up process: features are first extracted from individual notes 710 A- 710 H, then from measures 720 A- 720 D, then from parts 730 A- 730 B, and lastly from document 740 .
  • Note features 712 include features and subfeatures described in Table 11:
  • Note features 712 may be used to determine note statistics 714 .
  • Note statistics 714 may include features described in Table 12:
  • Pitch Range The lowest pitch and the highest pitch within the voice.
  • Pitch Histogram A frequency distribution of all of the pitches within the voice.
  • Tessitura The range of all pitches within the first standard deviation of the mean pitch within the voice (known as “a truncated distribution”). This serves as an indicator of what notes commonly occur. A range includes the extremes, while the tessitura excludes the extremes and focuses on notes near the mean.
  • Polyphony per voice The number of notes sounding simultaneously within the voice. This is used to find multiple stops or implicit divisi within a voice. Multiple stops are a technique for performing multiple simultaneous pitches on bowed instruments.
  • Duration Type Frequency A frequency distribution of all the duration types within the voice.
  • Number of Onsets Count of notes that are neither rests nor the continuation or end of a tie within the voice across all measures. Average Number of The Number of Onsets divided by the number Onsets Per Measure of measures. Note n-grams N-grams of all notes within a voice. N-grams can cross barlines. Repeated Identical Find any repeated notes (same pitch and Notes Within Measure duration) within the voice. This is useful for finding ostinatos.
  • measure features 722 are extracted from measures 720 A- 720 D.
  • Measure features 722 may include features described in Table 13:
  • Multi-measure Rest Boolean Indicates that this measure is part of a multi-measure rest (i.e. the measure(s) before and/or after it are whole measure rests according to the criteria above)
  • Tuning Change Boolean Indicates that this measure includes a tuning change, scordatura, or capo Cross Staff Beaming Boolean Indicates that this measure uses cross staff beaming that appears on parts that use a grand staff.
  • Measure statistics 724 include features described in Table 14:
  • part features 732 are extracted from parts 730 A- 730 B.
  • parts 730 A- 730 B generally correspond to a single instrument, although may refer to more than one instrument if the parts are doubled, for example where a flute and an oboe play the same part, or if the performer changes instruments within the same part, such as where a flute player switches to a piccolo.
  • Part features 732 include the features described in Table 15:
  • Part statistics 734 are extracted. Part statistics 734 may be extracted from individual parts or a collection of parts and include the features described in Table 16:
  • Chord Progression Determine the large-scale chord progression of the overall composition. In this case, all parts must first be transposed to the sounding key. Next, a harmonic reduction is performed, from which the chord progression is estimated using a sequence model such as a hidden Markov model, chord n-grams, or Word2Vec- style chord embeddings. This output is then filtered to remove uncommon or infrequently occurring chords. After the chord progression has been extracted, a further analysis is used to compare the overall chord progression to common harmonic patterns, such as 12-bar blues or rhythm changes. This analysis uses chord n-grams to find repeated patterns and then measure the similarity to known chord progressions using Jaccard similarity or other similarity metric.
  • a sequence model such as a hidden Markov model, chord n-grams, or Word2Vec- style chord embeddings. This output is then filtered to remove uncommon or infrequently occurring chords.
  • a further analysis is used to compare the overall chord progression to common harmonic patterns, such as 12-bar blues or rhythm changes
  • Inexact doublings are only computed for common intervals, such as the minor 3 rd , major 3 rd , perfect 5 th , octave, and common compound intervals such as the octave + major 3 rd etc.
  • This information informs the user of the use of orchestration throughout the composition.
  • Exposed Solo All parts are compared across all measures to find exposed solos.
  • An exposed solo is defined to be any part in which all other parts are either whole measure rests or playing sustained notes for the entirety of the measure(s).
  • document features 742 are extracted from document 740 .
  • the document features 742 include the duration of the music, which may be extracted or estimated. If the duration of the music is included in document 740 , it is often preceded by the word “duration” and in a format such as the formats described in Table 17:
  • the duration of the music is estimated.
  • the duration of the music is estimated using at least the number of measures, the meters, the tempo markings, and any repeats, endings, codas, segnos, or other text indicating a repeated section. If the tempo marking does not contain a number indicating a number of beats per minute, the text of the tempo marking is compared to a lookup table of common tempos and, if a match is found, the lookup table's corresponding beats per minute is used. If no match is found, the duration is calculated using a substitute tempo. In some embodiments, the substitute tempo is 120 beats per minute. If a tempo marking indicates a range of numbers, for example “Allegro (116-120 bpm)”, the average of the specified range is used.
  • FIG. 8 is a flowchart illustrating the rules analysis process 524 performed by the SMAE 140 according to an exemplary embodiment of the present invention.
  • note features 712 , note statistics 714 , measure features 722 , measure statistics 724 , part features 732 , part statistics 734 , and document features 742 may be used to perform melodic rules analysis 810 , harmonic rules analysis 820 , rhythmic rules analysis 830 , and form rules analysis 840 .
  • the melodic rules analysis 810 applies rules to interpret and analyze the melody, the sequence of notes one after another.
  • Harmonic rules analysis 820 applies rules to interpret and analyze the harmony, which includes chords (notes occurring at the same time) and chord progressions (how the harmony changes over time).
  • Rhythmic rules analysis 830 applies rules to interpret and analyze the rhythm, the duration and timing of each note.
  • Form rules analysis 840 applies rules to interpret and analyze the music as a whole to determine musical form.
  • Each of the melodic rules analysis 810 , the harmonic rules analysis 820 , the rhythmic rules analysis 830 , and the form rules analysis 840 generate the rule-derived metadata 526 described above.
  • the rule-derived metadata 526 may be used during the machine learning analysis 528 described above.
  • FIG. 9 is a flowchart illustrating the lyrics extraction 550 process performed by the SMAE 140 according to an exemplary embodiment of the present invention.
  • the preprocessed structured sheet music data 112 is input to lyrics extraction 550 and explicit lyric metadata is extracted at 910 .
  • Explicit lyric metadata includes the data described in Table 18:
  • Humming Indicates whether the lyrics include humming Laughing Indicates whether the lyrics include laughter
  • Non-English Characters Indicates that the text uses characters not found in the English alphabet. This is helpful in identifying a text that may be in a language other than English, in the event that the lyric language cannot be identified.
  • the lyrics might use a constructed language or International Phonetic Alphabet (IPA).
  • IPA International Phonetic Alphabet
  • a constructed language is one specifically created (rather than a natural evolution), for example Esperanto or languages that occur in science fiction or fantasy literature.
  • the lyrics are extracted from the structured sheet music data 112 into a separate text file 920 .
  • the separate text file combines all syllables into words and includes all verses. All further lyrical analysis uses this separate text file as input.
  • the language of the lyrics is estimated using language identification techniques from the field of natural language processing. In some embodiments, a combination of writing system and statistical analysis such as letter frequency and n-gram frequency is used to estimate the language. If the lyrics include a passing phrase in another language, the dominant language is identified. If the lyrics include equal parts of multiple languages such as translated lyrics or a macaronic text, the language is identified as “multiple languages.” If no match is determined, the language is marked as unknown.
  • the language label may include a note regarding the non-English characters.
  • the language label may be “unknown (contains letters other than A-Z).”
  • the text difficulty is determined at 940 .
  • the text difficulty may be determined using readability metrics such as the Flesch-Kincaid grade level, which uses the number of words per sentence and the number of syllables per word to compute an estimated difficulty level.
  • the content of the lyrics is analyzed at 950 .
  • known content is looked up at 970 and compared against the text of the lyrics.
  • the known content may include scripture, quotations, poems, literary text, etc.
  • the subject matter of the text of the lyrics is analyzed at 960 to classify the meaning of the lyrics. In some embodiments, this may include comparing the lyrics 552 (or the embeddings generated during the semantic embedding process 532 B or the semantic similarity metadata 540 B generated by the semantic similarity analysis 538 B) to the text of other compositions with known subject matter, including love songs, holiday songs, religious or spiritual songs, novelty songs, etc. Mature or objectionable content may be identified at 960 as well.
  • FIG. 10 is a flowchart illustrating a data ingestion process 1000 for analyzing unstructured sheet music data 114 and determining metadata 116 that describes each composition stored as unstructured sheet music data 114 according to an exemplary embodiment of the present invention.
  • the structured sheet music analysis engine 140 operates on structured sheet music data 112 , not unstructured sheet music data 114 , because all explicit information stored in structured sheet music data 112 (e.g., pitch, duration, etc.) must be inferred from unstructured sheet music data 114 .
  • unstructured sheet music data 114 undergoes analysis performed by the optical analysis engine 180 using heuristics, optical character recognition (OCR), traditional optical music recognition (OMR), machine learning-based pattern recognition (“fuzzy OMR”), etc., instead of the structured sheet music analysis engine 140 .
  • the unstructured sheet music 114 may be multi-page digital image files (such as PDFs). Sheet music that is not printed (such as an engraving plate) is printed. Sheet music that is printed but not computer readable is converted to a computer readable format by scanning the printed sheet music (for example, by the optical scanner 270 ). If the conversion from a printed format to a computer readable format fails (e.g., printed documents that have been damaged, marked, or destroyed, handwritten manuscripts that produce poor quality OMR results) the music may be restored and re-converted.
  • Explicit metadata 350 is extracted at 1010 (using a similar process as 320 above). Each page is extracted at 1020 .
  • Heuristics may be used to determine higher-level semantic metadata 116 at 1030 .
  • Examples of heuristics that may be used to determine metadata 116 include the heuristics described in Table 19:
  • Ink Density (number of A higher ink density suggests the use of more musical notation, dark pixels as a such as dynamics, articulations, etc. A high ink density therefore percentage of all pixels) implies a more difficult composition.
  • System detection by Counting the number of consecutive dark pixels in every measuring the run-length horizontal row helps determine potential system regions. of dark pixels over each 1. Count the number of dark pixels in a row horizontal row 2. Find the longest run of consecutive dark pixels in the row (maximum run-length). 3. If the maximum run-length exceeds a threshold, add it to the current potential region. Otherwise, do not consider this row as part of a potential system region. A threshold is used to filter out any potential noise in the digital image. The output from this process is a set of potential system regions, which are then used for further processing. This algorithm is common within OMR software for finding systems and staves within a piece.
  • Potential system regions on each page may be identified at 1040 .
  • the system 100 may count the amount, and maximum run-length, of dark pixels in every horizontal row to identify potential system regions (e.g. areas of the page containing staves) to be used for further processing.
  • Classified system regions may be identified at 1050 .
  • a classifier e.g., a machine learning classifier
  • Machine learning may be used to determine higher-level semantic metadata 116 describing sheet music stored as unstructured sheet music data 114 at 1060 .
  • a machine learning algorithm may be used to determine the range of unstructured sheet music data 114 by using a convolutional neural network pretrained for this task.
  • the optical analysis engine 180 uses image classification methods to determine musically-relevant metadata 116 describing the sheet music stored as unstructured sheet music data 114 .
  • image classification methods based on existing image recognition models (e.g., AlexNet, VGG, Inception, ResNet, etc.) can be quickly retrained to extract range, motifs, playing techniques (double stop, cross staff beaming, etc.), etc.
  • the metadata 116 extracted from each classified system may be aggregated and reduced at 1070 .
  • a composition with two classified system regions, A and B may be mapped to two implicit metadata extraction functions, topNote(x) and bottomNote(x). as follows:
  • FIGS. 11A and 11B is a flowchart illustrating the workflow 1100 of the optical analysis engine 180 according to an exemplary embodiment of the present invention.
  • the optical analysis engine 180 uses heuristics, OCR, OMR, fuzzy OMR, etc. to infer information from images of sheet music.
  • the output of the optical analysis engine 180 is the same as the output of the structured sheet music analysis engine 140 .
  • images of sheet music vary enormously (e.g., in color, resolution, skew, etc.), oftentimes the optical analysis engine 180 provides results that approximate the output of the structured sheet music analysis engine 140 .
  • a large number of processes performed by the optical analysis engine 180 are based on object detection.
  • Object detection algorithms identify certain areas of an image as belonging to a particular object class.
  • the object detection algorithms return a bounding region around a detected object or the set of pixels representing the detected object, the likely class of detected object, and a probability the detected object belongs to the object class.
  • object classes include text, staves, measures, clefs, notes, etc.
  • a single object detector is trained to identify all object classes in a monolithic fashion, while in other embodiments, several specialized object detectors are trained to detect similar classes in a modular fashion. Any object detection algorithm may be used, including single-shot detection (SSD), you only look once (YOLO), common objects in context (COCO), etc.
  • a combination of object detection algorithms is used.
  • the optical analysis engine 180 works with images of sheet music that is either typeset or handwritten and is impervious to features such as scale and font.
  • the optical analysis engine 180 follows a general pattern of identifying segments, analyzing the segments, then further segmenting into smaller segments in a top-down approach (that differs from the bottom-up approach of the structured sheet music analysis engine 140 ). For example, a staff is segmented into measure, which is then segmented into notes.
  • the output from the optical analysis engine 180 may be directly fed into a keyword based search engine, used in evaluating natural language queries, or both.
  • the output from the optical analysis engine 180 may be further processed by the structured sheet music analysis engine 140 .
  • the optical analysis engine 180 may extract lyrics from the images of sheet music but the structured sheet music analysis engine 140 may analyze the lyrics using the process described above at 532 B and 538 B.
  • Data may be input to the optical analysis engine 180 as either raster image data 1102 (e.g., PNG, JPEG, etc.), vector image data 1104 (e.g., SVG), etc.
  • Raster formats encode information about every pixel while vector formats store instructions on how the content is drawn. If vector information is available as well as raster data, accuracy may be improved by analyzing both raster and vector data.
  • the vector image data 1104 is input to optical analysis engine 180 , the vector image data 1104 is rasterized at 1106 by selecting a resolution and rendering the vector image data 1104 at that resolution.
  • the newly generated raster image data 1112 undergoes preprocessing at 1114 just as raster image data 1102 does.
  • a copy of the original vector image data 1104 is used for vector analysis 1110 .
  • vector image data 1104 encodes graphics as a set of drawing commands, it is sometimes possible to identify when two drawing commands are similar through a vector analysis 1110 process known as template matching.
  • Each command in the vector image file 1104 is either stored in a dictionary within the file that is referenced at later points in the file or is used inline. These commands may be compared to a set of known templates that identify that command as belonging to a particular glyph within a particular typeface.
  • a vector command consists of one or more instructions that tell the computer what type of graphic primitive to draw (such as a line, circle, or Bezier curve), where to draw it, and its relative proportions. The use of relative proportions means that the commands may appear at different scales, even though the commands draw the same figure.
  • a vector command on a 1 ⁇ scale may be M 2.578125 2.984375 C 1.835938 2.984375 1.222656 2.8125 0.734375 2.46875 whereas the same vector command on a 2 ⁇ scale may be M 5.15625 5.96875 C 3.67188 5.96875 2.44531 5.625 1.46875 4.9375.
  • the same commands can be compared after they are normalized to the same scale (for example, by dividing every coordinate by the maximum coordinate so that the largest coordinate is now equal to 1).
  • a structured music document 112 may be reconstructed. For example, barline glyphs may be identified to isolate individual measure regions. Additionally, duration may be computed for all note, rest, and chord objects, and their position in time is determined using a time cursor within each measure region.
  • the reconstructed structured music document 112 may then be analyzed by the structured sheet music analysis engine 140 as described above. In some embodiments, the outputs of both vector analysis 1110 and other analysis performed by the optical analysis engine 1100 may be aggregated for greater accuracy. Where raster image data 1102 is the input to optical analysis engine 1100 , the raster image data 1102 is preprocessed at 1114 .
  • Preprocessing 1114 is intended to improve the accuracy of the image analysis by straightening, denoising, whitening, etc. Scanned images skewed during the scanning process are straightened during preprocessing 1114 .
  • the images of sheet music are straightened using a Hough transform to identify staff lines, measure their angles, and rotate the image until the staff lines are straight.
  • Denoising identifies and removes random variations in pixel intensities and may be done by any number of processes, including smoothing, which averages the intensities around a pixel, non-local means, which averages similar patches within the image, etc. Whitening changes the statistical proprieties of the pixel values such that all pixels are uncorrelated and have a variance of one.
  • document information is extracted at 1116 .
  • Document header information if available, provides the document information. If not, the information is extracted using OCR.
  • Document information includes the fields described in Table 20:
  • the output of the OCR may contain recognition errors which are corrected in a post-processing step.
  • the post-processing step may include one or more of (1) collating any header information contained in the original document, (2) collating any text, (3) identifying and fixing common recognition errors through the use of a lookup table, (4) identifying known entities (such as composers, arrangers, etc.) through the use of a lookup table, (5) identifying common patterns using a template, (6) spell-checking, and (7) using named entity recognition to identify names.
  • That part includes a scale run that begins in measure 5 and ends in measure 6, but is interrupted by a system break. After stitching, the scale run is uninterrupted as shown in FIG. 21H .
  • the unstructured sheet music data 114 may be converted (e.g., using OMR) to structured sheet music data 112 (e.g., a MusicXML file) by the sheet music conversion engine 160 and then passed to the SMAE 140 , which determines metadata 116 by analyzing the data 112 down to each note 710 as described above.
  • the optical analysis engine 180 includes one or more image recognition algorithms (described above) that have been trained to recognize musical metadata 116 without having to analyze every single note 710 .
  • lyrics may be extracted from the staff level features at 1136 (using a similar process as 550 above) and analyzed at 1138 (by the SMAE 140 a similar process as 532 B- 540 B above) to generate semantic similarity metadata 1140 .
  • Phrases may be extracted from the staff level features at 1142 and analyzed at 1144 using image recognition algorithm(s) trained to generate phrase descriptions 1146 (similar to phrase descriptions 548 ).
  • Those one or more image recognition algorithms may extract measures from the staff level features at 1148 and analyze measure features at 1150 , and extract notes from the staff level features at 1152 and analyze note features at 1154 .
  • the results of phrase, measure, and note analysis may be analyzed by the structured sheet music analysis engine 140 .
  • the results of the machine learning-based pattern recognition processes described above may be analyzed by the structured sheet music analysis engine 140 .
  • the machine learning analysis 528 described above can be used to predict the difficulty of each composition (or each part within each composition) stored as image data.
  • a corpus of compositions or parts stored as images may have known difficulty levels.
  • a supervised learning process can then be used to learn a function for determining a probability that another composition or part stored as image data has those difficulty levels based on the patterns recognized in the image data.
  • the structured sheet music analysis engine 140 analyzes structured sheet music data 112 as well as unstructured sheet music data 114 that has been converted to structured sheet music data 112 by the sheet music conversion engine 160 to determine musically relevant metadata 116 describing the sheet music.
  • the optical analysis engine 180 analyzes unstructured sheet music data 114 to determine musically relevant metadata 116 describing the sheet music.
  • the metadata 116 describing each composition is stored in the one or more databases 110 .
  • the sheet music search and discovery system 100 also includes a search engine 190 that enables user to determine compositions that are relevant to users based on the metadata 116 extracted, calculated, and generated by the sheet music search and discovery system 100 .
  • FIG. 12 is a flowchart illustrating a query process 1200 executed by the search engine 190 to search sheet music data according to an exemplary embodiment of the present invention.
  • a query language statement 1210 may be directly input to database query execution engine 1270 to generate search results 1280 .
  • the query language statement 1210 may be constructed using structured query language (SQL). Examples of SQL query pseudocode are shown in Table 21, where composition-related data is stored in a table called Compositions and part-related data is stored in a table called Parts. There is a 1:n relationship between Compositions and Parts (because each composition may include multiple parts), such that an entry in the composition table may link to one or more entries in the Part table via a foreign key called compositionId.
  • query generator 1260 may create a query based on user profile 1220 , keywords and filters 1230 , audio input 1240 analyzed at 1245 , natural language query 1250 analyzed at 1255 , audio fingerprint, QR code, unstructured sheet music data, etc.
  • the search engine 190 provides functionality for users to input the keywords 1230 via the graphical user interface 192 .
  • the graphical user interface 192 may provide functionality for users to input keywords 1230 in an unstructured manner. For example, the graphical user interface 192 may simply allow the user to enter keywords 1230 and the search engine 190 may be able to determine whether any of the metadata 116 matches or is similar to those keywords 1230 .
  • the graphical user interface 192 may provide functionality for users to input keywords 1230 in a structured manner.
  • the graphical user interface 192 may provide functionality for the user to input keywords 1230 in one or more categories of musical attributes (e.g., composer, range, etc.) and the search engine 190 may determine whether any of the metadata 116 in those categories matches or is similar to those keywords 1230 .
  • the search engine 190 may also provide functionality for users to augment a keyword search by selecting a filter 1230 via the graphical user interface 192 .
  • the graphical user interface 192 may allow the user to select a category (e.g., key signature, meter, lyric language, etc.) and input a value or range of values. The search engine 190 then determines whether any of the metadata 116 matches that value or is within that range of values.
  • a category e.g., key signature, meter, lyric language, etc.
  • the search engine 190 may also search the metadata 116 to identify sheet music matching user profiles 1220 .
  • User profiles 1220 will be discussed further in reference to FIG. 16 .
  • the database query execution engine 1270 compares the query generated by query generator 1260 to the metadata 116 describing structured and unstructured sheet music data to generate search results 1280 .
  • the query generator 1260 may also construct a query based on audio input 1240 .
  • the search engine 190 may provide functionality for a user to submit a query by humming or singing a melodic fragment (audio input 1240 ) and have the search engine 190 return the compositions with the highest similarity.
  • the search engine 190 may search for similar compositions using Parsons code, dynamic time warping, audio fingerprint and/or a neural network. Parsons code describes the melodic contour of a composition.
  • audio analysis at 1245 determines an approximate pitch using standard pitch detection, such as autocorrelation, fast Fourier transform (FFT), or the Yin algorithm.
  • FFT fast Fourier transform
  • the audio analysis at 1245 determines whether subsequent pitches are higher, lower, or the same as the previous pitch, within a certain error threshold. Converting the audio input 1240 to Parsons code has certain advantages. For example, the user's pitch does not have to be exact, the user's rhythm does not have to be exact, and the approach can easily search structured data. However, converting the audio input 1240 to Parsons code also has certain disadvantages. For example, the audio input 1240 must be monophonic, the audio input 1240 must start at the beginning of the composition, and, due to a lack of rhythmic information, it is difficult to distinguish between compositions that have the same melodic contour but different rhythms.
  • Dynamic time warping is a process of comparing two waveforms and determining how similar they are under time altering transforms. Dynamic time warping has certain advantages. For example, the user's pitch does not have to be exact (as a stretched waveform will have a lower pitch and a time-compressed waveform will have a higher pitch), the user's rhythm does not have to be exact, the audio input 1240 may be polyphonic (e.g. searching for a piano composition by playing a piano as the audio input 1240 ), and the audio input 1240 does not have to start at the beginning of the composition.
  • dynamic time warping also has certain disadvantages. For example, because this approach involves waveform comparison, the search engine 190 must store at least one audio version of every composition that is searched in audio format.
  • the search engine 190 may also utilize a neural network trained against several queries and known ground truth structured data, such as the Multimedia Information Retrieval Query By Singing/Humming (MIR-QBSH) corpus, the Institute Of Acoustics Chinese Academy Of Sciences (IOACAS) corpus, etc.
  • MIR-QBSH Multimedia Information Retrieval Query By Singing/Humming
  • IACAS Institute Of Acoustics Chinese Academy Of Sciences
  • Utilizing a neural network has certain advantages. For example, it allows for a direct query of structured data from an audio waveform, the pitch and rhythm of the audio input 1240 do not need to be exact (as the convolutional layers of the network should capture the time and pitch variation), and audio input 1240 does not have to start at the beginning of the composition.
  • utilizing a neural network has certain disadvantages.
  • training the neural network may take a long time and the query may be limited only to the primary melodic motifs of the compositions (whereas other methods may search the entire composition for a match). However, this may also be an advantage as people are more likely to search for common, memorable motifs.
  • the query generator 1260 may also construct a query based on natural language query 1250 (e.g., “What concertos feature the clarinet?” “Are there any SAB choral works based on the poems of Robert Frost?” “Is there a trumpet solo that features triple-tonguing?” etc.).
  • the search engine 190 may analyze the natural language query at 1255 and use that analysis to perform natural language querying against natural language descriptions of the compositions stored in the one or more databases 110 .
  • Natural language descriptions of musical compositions may include information from the metadata 116 (in particular the text/keyboard output 420 ) described above—including, for example, the explicit metadata 350 (e.g., composer, lyricist, etc.), the implicit metadata (e.g., range, meters, etc.), the rule-derived metadata 526 , the machine learning derived metadata 530 , the semantic similarity metadata 540 A, 540 B, and 1140 , the phrase descriptions 548 and 1146 , etc.—as well as the publisher's description of a composition, other advertising copy of the composition, other sources of information pertaining to the composition (e.g., electronic sources such as Wikipedia), crowd-sourced information about the composition, etc.
  • the explicit metadata 350 e.g., composer, lyricist, etc.
  • the implicit metadata e.g., range, meters, etc.
  • the rule-derived metadata 526 e.g., the machine learning derived metadata 530 , the semantic similarity metadata 540 A, 540 B, and 1140
  • FIG. 13 is a flowchart illustrating a process 1300 for querying sheet music data using a user profile according to an exemplary embodiment of the present invention.
  • Keywords 1310 are transformed into a query 1340 (e.g., an SQL query) by the query generator 1260 .
  • a selected user profile 1320 is considered and any mandatory filters are extracted at 1330 and appended to the query 1340 before query 1340 is executed at 1350 .
  • a mandatory filter may include, for example, an ensemble type, a requirement that a specific instrument be included in the search results, etc. Extracting and appending mandatory filters 1330 to the query 1340 limits the results returned. For example, if the user selects a user profile 1320 for a choir, all non-choir literature is automatically excluded. Similarly, if the user selects a user profile 1320 for wind ensemble and requires that the English horn be in the instrumentation, then all wind ensemble compositions not having an English horn are automatically excluded. Extracting mandatory filters at 1330 before the query 1340 is executed at 1350 limits the number of records that the query 1340 runs over, resulting in faster execution.
  • the execution of the query 1340 at 1350 results in unsorted query results 1360 , which are sorted at 1370 by comparing each composition in the unsorted query results 1360 to the selected user profile 1320 by using similarity metrics.
  • Potential metrics include, for example, Manhattan (L1) distance or Jaccard similarity.
  • the unsorted query results 1360 are then sorted at 1370 from highest scoring (i.e., most similar) to lowest scoring (least similar).
  • the sorted query results 1380 are then displayed to the user. Sorting allows the most relevant compositions to appear at the top of the page.
  • the search engine 190 compares how well the selected user profile 1320 matches a given item.
  • the storing process 1370 may include an instrumentation comparison and/or a range comparison.
  • an instrumentation comparison the selected user profile 1320 contains a list of one or more instruments and their respective ranges.
  • the instrumentation In order to recommend appropriate compositions, the instrumentation must be compared to the instrumentation of each composition. Both the number of instruments as well as the type of each instrument must be considered to make an accurate match.
  • instrument names may vary, the instrument names are normalized in during preprocessing 635 and/or are limited to a predefined set of drop down values by the graphical user interface 192 that the user may select when creating their profile.
  • a user has defined a selected user profile 1320 for a brass quartet.
  • the search engine 190 may use, for example, a method to determine the similarity of the instrumentation of the selected user profile 1320 to the composition. Without this comparison, it is possible that this example query 1340 would not return any results. By making this comparison, the example query 1340 can return results, even if the results are inexact matches.
  • the search engine 190 returns compositions where the range of a given part of the composition falls within the range of the same part in the selected user profile 1320 .
  • the search engine 190 may run two penalty functions to determine how far the range of the composition falls above and below the range of the selected profile 1340 .
  • Each function may determine an exponential penalty for each semitone that a given composition goes above or below the range of the selected used profile. For example, if a composition's range exceeds the upper range of the selected user profile 1320 by two semitones, the function may return a 96% match (100 ⁇ 2*2).
  • Such penalties may be more heavily weighted when the composition's range exceeds not only the range of the selected user profile 1320 , but also the physical capabilities of the instrument selected in the selected user profile 1320 .
  • the search engine 190 may include a content-based recommendation system that recommends compositions that the user has not yet purchased that are similar to compositions that the user has already purchased.
  • FIG. 14 is a flowchart illustrating a recommendation process 1400 performed by the search engine 190 to identify recommendations 1490 according to an exemplary embodiment of the present invention.
  • the selected user profile 1320 may include an instrumentation 1422 and a range 1424 for each of those instruments, as well as past purchases 1426 of the user.
  • the search engine 190 may identify similar compositions 1445 at 1440 by comparing the metadata 116 describing each of the past purchases 1426 to the metadata 116 of the remaining compositions in the one or more databases 110 .
  • the similar composition process 1440 may identify compositions with similar instrumentations and/or ranges as described above with respect to the storing process 1370 . After discarding purchases already made by the user at 1480 , the similar compositions 1445 may be sorted for relevance using the sorting process 1370 described above.
  • the search engine 190 may recommend compositions for middle school orchestras, followed by compositions for orchestras that are not specifically targeted for middle school but that match closely to other compositions that the user has already purchased.
  • Compositions for different ensembles e.g., choir or PVG arrangements
  • the search engine 190 may recommend other compositions by Beethoven regardless of ensemble (e.g., PVG arrangements, orchestral arrangements, compositions for wind ensemble, etc.).
  • the search engine 190 may also determine compositions that are often purchased together 1455 at 1450 .
  • the search engine 190 may store a co-purchasing matrix 1452 identifying compositions that are often purchased by the same user (either at the same time or separately). For example, the items “Guitar Method Book: beginner”, and “Guitar Method Book: Intermediate” are likely to be purchased either at the same time (in the same shopping cart), or at different points in time (a user finished the beginner book and now purchases the intermediate book). Based on historical purchase information of items across all users in the co-purchasing matrix 1452 , the search engine 190 can identify compositions 1455 frequently bought with the current compositions on the product page for the current composition.
  • the search engine 190 may utilize logistic regression, a neural-network based approach, etc. Again, after discarding purchases already made by the user at 1480 , the compositions purchased together 1445 may be sorted for relevance using the sorting process 1370 described above.
  • the search engine 190 may also recommend the past purchases 1465 of users with similar user profiles 1220 . Similar user profiles 1462 are identified at 1460 .
  • the user profile similarity process 1460 may compare the instrumentation 1422 and range(s) 1424 of the selected user profile 1320 to the instrumentation 1422 and the range(s) 1424 of the other user profiles 1220 using similarity metrics such as Manhattan (L1) distance or Jaccard similarity. (A similar process is used to select a composition for a user as described above.
  • a composition that is wholly contained within the range 1424 of the selected user profile 1320 may be considered to be a 100 percent match.
  • the search engine 190 may compare the ranges of the two ensembles by determining whether the largest interval difference between the two ranges exceeds a prescribed threshold (e.g., +/ ⁇ 2 semitones).
  • a prescribed threshold e.g., +/ ⁇ 2 semitones.
  • Conventional systems recommend products using collaborative filtering. For example, user ratings are collected into a matrix, which is factored (e.g., using singular value decomposition), and the user is projected onto the new basis.
  • the search engine 190 stores user profiles 1220 that include instrumentation 1422 (and, in some cases, a range 1424 for each instrument), the search engine 190 is able to recommend compositions that are more relevant than would be generated using conventional collaborative filtering. For example, a user who made a purchase for a middle school marching band where their trumpets can only play up to F5 may be informed about another purchase made for a different middle school marching band with trumpets that can only play up to F5. Meanwhile, compositions outside that range and compositions for other ensembles can be excluded.
  • the past purchases 1465 of users with similar user profiles 1462 may be sorted for relevance using the sorting process 1370 described above.
  • the past purchases 1465 of users with similar user profiles 1462 may be sorted by the similarity of the user profiles 1220 .
  • the search engine 190 may also use the information contained within the user profile 1320 and the ordered recommendations 1490 to market compositions.
  • new compositions may be electronically marketed specifically to that user.
  • the system 100 may also output email marketing campaigns, electronic ads, push notifications, etc. that include the recommendations 1490 .
  • existing electronic marketing materials may be customized to include recommendations 1490 based on the user profile 1320 of the user receiving the electronic marketing materials. For example, for a director of a high school marching band that subscribes to a marching band email newsletter, the newsletter may be customized to meet that director's ensemble by highlighting or sorting items that most closely match their ensemble. In another example, the same user may subscribe to new product push notifications.
  • the director When a new marching band arrangement of a current pop song is released and it matches that ensemble, the director receives a push notification on their phone indicating that new inventory is available that would be appropriate for their ensemble. If the user opens the push notification, they can then view the inventory and its marketing materials directly on their mobile device, allowing them to purchase the item as soon as it becomes available.
  • the search engine 190 may also recommend multiple compositions that together form a concert program (or set list).
  • FIG. 15 is a flowchart illustrating concert program generation processes 1500 performed by the search engine 190 according to an exemplary embodiment of the present invention.
  • the search engine 190 may include an automatic concert program generator that generates a concert program 1590 (for example, within a user-specified length) based on one or more user-specified compositions, referred to as a concert program generator seeds 1510 .
  • the concert generation process 1500 may identify compositions similar compositions 1445 as the concert program generator seed(s) 1510 or compositions previously purchased by the user (identified using the similar composition identification process 1440 described above), compositions programmed 1565 (and past purchases 1465 ) of users with similar user profiles 1462 (identified using the user profile similarity process 1460 described above), compositions frequently purchased together 1455 with the concert program generator seed(s) 1510 (identified using co-purchasing identification process 1450 and the co-purchasing matrix 1452 described above), compositions frequently programmed together 1555 with the concert program generator seed(s) 1510 , etc.
  • a co-programming matrix 1552 of compositions frequently programmed together (derived, for example, actual concert programs, CD track listings, etc.) may be stored and a co-occurrence analysis 1550 may be performed to determine compositions frequently programmed with the one or more concert program generator seeds 1510 .
  • the similar compositions 1445 , the compositions purchased together 1445 , the compositions programmed together 1555 , the compositions programmed 1565 by (and/or past purchases 1465 of) users with similar user profiles 1462 may be sorted using the sorting process 1370 described above.
  • the results may be trimmed at 1580 such that the concert program 1590 meets any requirements 1582 set by the user (such as the user-specified length).
  • the search engine 190 may provide functionality for a user to automatically generate a concert program 1590 (for example, using a single composition as the concert program generator seed 1510 ) with minimal input from the user.
  • each page for each composition may include a button (or other input mechanism) to automatically generate a concert program 1590 using that composition as the concert program generator seed 1510 .
  • the search engine 190 may infer the ensemble type from the composition's instrumentation, as well as the ranges of similar compositions.
  • the search engine 190 may include guided concert program generator interface 1592 that automatically generates a concert program 1590 using all of the specified requirements 1592 .
  • the guided concert program generator interface 1592 may provide functionality for the user to specify one or more concert program generator seeds 1510 .
  • the guided concert program generator interface 1592 may provide functionality for the user to interact with the search engine 190 and specify each of the requirements 1582 using natural language (e.g., by voice or text).
  • the search engine 190 (S) may interact with a user (U) as follows:
  • FIG. 16 illustrates a user profile view 1600 of the graphical user interface 192 according to an exemplary embodiment of the present invention.
  • the user profile view 1600 provides functionality for a user to identify several parameters at once.
  • the graphical user interface 192 may provide functionality for a user to add/edit/delete multiple instruments 1610 , a number 1620 of each instrument, a written range 1640 of each instrument with a tolerance in semitones (for example, “C4 to C5+/ ⁇ 2 semitones” would search for compositions with the range of Bb3 to D5, since Bb3 is 2 semitones lower than C4, and D5 is 2 semitones higher than C5), a grade level 1630 (or range of grade levels) for each instrument, whether an instrument should be featured in a solo, etc. (In a preferred embodiment, the written range 1640 is always used for pitched instruments.
  • the graphical user interface 192 may display the written and/or sounding pitch. For unpitched percussion, the graphical user interface 192 may not permit the user to enter this information because range does not apply.)
  • the graphical user interface 192 may allow each user to create multiple user profiles. For example, a user may create one profile for high school marching band (because, e.g., the user is the director of the band), another profile for church choir (because, e.g., the user is the director of the choir), a third profile for a saxophonist (e.g., a student of the user), and a fourth profile for the user (because, e.g., the user is learning to play guitar).
  • the search engine 190 may provide pre-defined user profiles that users can select and edit.
  • the pre-defined user profiles may include, for example, a sixth grade SAB (soprano, alto, baritone) choir, a community church choir, a high school marching band, etc.
  • FIG. 17 illustrates an edit range view 1700 of the graphical user interface 192 according to an exemplary embodiment of the present invention.
  • the user interface 192 may provide functionality for a user to edit the range 1640 of each instrument in the user profile.
  • FIG. 18 illustrates a search view 1800 of the graphical user interface 192 according to an exemplary embodiment of the present invention.
  • a user may input one or more keywords in field 1810 and indicate a user profile at 1820 .
  • the user profile may be chosen from a drop-down menu listing user profiles.
  • FIG. 19 illustrates a search results view 1900 of the graphical user interface 192 according to an exemplary embodiment of the present invention.
  • the search results view 1900 displays the results of a query constructed by the query generator based on the keywords 1920 , a user profile 1930 , or both.
  • the search results view 1900 provides functionality for the user to refine the results by selecting one or more filters 1910 .
  • the search results may be ranked according to their similarity to the one or more keywords 1920 and parameters stored in the selected profile 1930 .
  • FIG. 20 is a block diagram of a simplified multifunctional electronic device 2000 according to an exemplary embodiment of the present invention.
  • Multifunction electronic device 2000 may include processor 2005 , display 2010 , user interface 2015 , graphics hardware 2020 , device sensors 2025 (e.g., proximity sensor/ambient light sensor, accelerometer and/or gyroscope), microphone 2030 , audio codec(s) 2035 , speaker(s) 2040 , communications circuitry 2045 , digital image capture circuitry 2050 , video codec(s) 2055 (e.g., in support of digital image capture unit 2050 ), memory 2060 , storage device 2065 , and communications bus 2070 .
  • device sensors 2025 e.g., proximity sensor/ambient light sensor, accelerometer and/or gyroscope
  • microphone 2030 e.g., audio codec(s) 2035 , speaker(s) 2040 , communications circuitry 2045 , digital image capture circuitry 2050 , video codec(s) 2055 (e.g.
  • Multifunction electronic device 2000 may be, for example, a personal electronic device such as a personal digital assistant (PDA), mobile telephone, or a tablet computer.
  • PDA personal digital assistant
  • Processor 2005 may execute instructions necessary to carry out or control the operation of many functions performed by device 2000 .
  • Processor 2005 may, for instance, drive display 2010 and receive user input from user interface 2015 .
  • User interface 2015 may allow a user to interact with device 2000 .
  • user interface 2015 can take a variety of forms, such as a button, keypad, dial, a click wheel, keyboard, display screen and/or a touch screen.
  • Processor 2005 may also, for example, be a system-on-chip such as those found in mobile devices and include a dedicated graphics processing unit (GPU).
  • GPU dedicated graphics processing unit
  • Processor 2005 may be based on reduced instruction-set computer (RISC) or complex instruction-set computer (CISC) architectures or any other suitable architecture and may include one or more processing cores.
  • Graphics hardware 2020 may be special purpose computational hardware for processing graphics and/or assisting processor 2005 to process graphics information.
  • graphics hardware 2020 may include a programmable GPU.
  • Image capture circuitry 2050 may include lens 2080 . Lens assembly may have an associated sensor element 2090 .
  • Image capture circuitry 2050 may capture still and/or video images. Output from image capture circuitry 2050 may be processed, at least in part, by video codec(s) 2055 and/or processor 2005 and/or graphics hardware 2020 , and/or a dedicated image processing unit or pipeline incorporated within circuitry 2055 .
  • Images so captured may be stored in memory 2060 and/or storage 2065 .
  • Sensor and camera circuitry 2050 may capture still and video images that may be processed in accordance with this disclosure, at least in part, by video codec(s) 2055 and/or processor 2005 and/or graphics hardware 2020 , and/or a dedicated image processing unit incorporated within circuitry 2050 . Images so captured may be stored in memory 2060 and/or storage 2065 .
  • Microphone 2030 may capture audio recordings that may be processed in accordance with this disclosure, at least in part, by audio codec(s) 2035 and/or processor 2005 . Audio recordings so captured may be stored in memory 2060 and/or storage 2065 .
  • Memory 2060 may include one or more different types of media used by processor 2005 and graphics hardware 2020 to perform device functions.
  • memory 2060 may include memory cache, read-only memory (ROM), and/or random access memory (RAM).
  • Storage 2065 may store media (e.g., audio, image and video files), computer program instructions or software, preference information, device profile information, and any other suitable data.
  • Storage 2065 may include one more non-transitory storage mediums including, for example, magnetic disks (fixed, floppy, and removable) and tape, optical media such as CD-ROMs and digital video disks (DVDs), and semiconductor memory devices such as Electrically Programmable Read-Only Memory (EPROM), and Electrically Erasable Programmable Read-Only Memory (EEPROM).
  • Memory 2060 and storage 2065 may be used to tangibly retain computer program instructions or code organized into one or more modules and written in any desired computer programming language. When executed by, for example, processor 2005 such computer program code may implement one or more of the methods described herein.

Abstract

A sheet music search and discovery system is disclosed that uses specific mathematical rules to analyze and characterize sheet music and provides functionality for users to identify sheet music based on those characterizations. The system stores sheet music data and metadata characterizing each composition, provides a graphical user interface that provides functionality for users to search the sheet music data for compositions, and generates search results based at least in part on the metadata characterizing each composition. In one embodiment, metadata describing structured sheet music data is generated using a global vector space that includes semantic representations of elements extracted from a large corpus. In another embodiment, metadata describing unstructured sheet music data is generated using machine learning-based pattern recognition. In another embodiment, the interface provides functionality for users to identify instruments and a range for each of the instruments and identify compositions with similar instruments and ranges.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This applications is a continuation of U.S. patent application Ser. No. 15/988,719, filed May 24, 2018, which claims priority to U.S. Prov. Pat. No. 62/511,025, filed May 25, 2017, which is hereby incorporated by reference.
  • BACKGROUND
  • Searching for and selecting sheet music, particularly for an ensemble, is an imprecise and challenging process because the buyer needs to understand both the musical abilities of the ensemble and how well the printed music matches those abilities. With extensive music training and experience, a buyer (or a seller making a recommendation) can manually review sheet music and make a subjective determination as to the suitability of each work over the others. With thousands of pages of sheet music available, however, such a manual review is so inefficient that a comprehensive review of all available options is unrealistic. Furthermore, such subjective interpretations are often error prone, as even an experienced buyer can incorrectly assess the suitability of a piece or the abilities of an ensemble, which may change over time.
  • Online sheet music retailers have dedicated search engines. However, conventional search engines are configured to only provide functionality for users to search for and filter by textual data such as title, composer, arranger, short text description of the composition, keywords, and possibly lyrics). Conventional sheet music search engines are incapable of providing functionality to search or filter by other musically relevant information, such as range or difficulty. Even manually viewing each composition in search results in order to make a subjective determination is difficult, as retailers generally do not provide the entire document because of piracy concerns.
  • In the field of natural language processing, vector representations of words that carry syntactic and semantic information (word embeddings) have proven powerful in various natural language processing tasks, in particular in sentiment analysis. Meanwhile, machine learning algorithms have also been used to detect patterns in data and generalize those patterns in order to adapt to data that it has not previously seen. For example, sequence labeling has been used to algorithmically assign categorical labels to observed values.
  • However, processes have not been developed for generating vector representations of sheet music data to analyze and characterize sheet music and provide functionality for a user to search and/or filter sheet music based on musically-relevant characterizations of the underlying sheet music. Furthermore, machine learning algorithms have not been developed for analyzing and characterizing sheet music to provide functionality for a user to search and/or filter sheet music based on those characterizations.
  • Accordingly, there is a need for a system that uses specific mathematical rules to analyze and characterize sheet music and provides functionality for a user to leverage those characterizations while searching for and selecting sheet music. Furthermore, there is a need for a search engine and graphical user interface that provides functionality for a user to search and/or filter sheet music based on musically-relevant characterizations of the underlying sheet music, such as the instrumentation and range of the compositions.
  • Some sheet music may be available in structured formats (such as MusicXML) that contains musical data (e.g., pitches, rhythms, clefs, articulations, etc.) in a musically semantic structure. Other sheet music, however, may only be available as (unstructured) image data (such as PDFs). Unstructured sheet music data can be converted to structured music data and analyzed using the same mathematical rules mentioned above. However, a two-step process of converting image data to structured sheet music data and then using mathematical rules developed for analyzing structured sheet music data may be computationally inefficient. Furthermore, the conversion process may not be precise, particularly if the image data is unclear.
  • Accordingly, there is an additional need for a system that uses mathematical rules specifically developed to analyze and characterize unstructured sheet music images along with the functionality for a user to search and/or filter sheet music based on those characterizations.
  • SUMMARY
  • In order to overcome those and other drawbacks in the prior art, there is provided a sheet music search and discovery system.
  • In some embodiments, the system analyzes compositions stored as structured sheet music data to generate metadata characterizing each composition (or part within the composition). To do so, the system stores a global vector space of semantic representations of elements extracted from a corpus of structured music data, where semantically similar elements extracted from the corpus are clustered together in the global vector space, generates semantic representations of each composition, and generates metadata characterizing each composition in part by comparing the semantic representations. The system may also generate metadata characterizing each composition or part through deterministic functions, rules and/or heuristics, extracting and labeling phrases, or machine learning. In particular, machine learning may be used to predict the difficulty of each composition or part.
  • In some embodiments, the system analyzes compositions stored as image data using machine learning-based pattern recognition. For example, the system may use algorithms pretrained to determine a range of a composition, extract and describe phrases, extract and analyze measures, determine the difficulty of each composition (by comparing image patterns to image patterns in a corpus of known compositions). In some embodiments, the algorithms may generate metadata without recognizing individual notes. In other embodiments, musically-relevant objects (e.g., staves, measures, clefs, or notes) may be detected using object detection algorithms or by analyzing drawing commands in vector image data.
  • The metadata generated by the system allows the system to provide search and recommendation functionality unlike anything currently available. For example, the system may provide functionality for users to identify instruments and a range for each instrument and identify compositions with similar instruments and ranges. Additionally, the system may provide functionality for the user to input a search query that includes keywords or audio (input, e.g., by singing or humming). The system may also identify recommendations for the user and/or provide functionality to automatically generate a concert program by comparing the instruments and ranges of each of the compositions.
  • Using the specific mathematical rules to analyze and characterize sheet music as described herein is distinct from the subjective determinations previously performed by the buyers of sheet music (and sellers making recommendations).
  • Unlike conventional sheet music search engines, which simply select from available sheet music based on existing textual information to provide a humanly comprehensible number of search results, the disclosed system generates new data; specifically, a new kind of metadata characterizing the underlying sheet music. The metadata generated by disclosed system enables the disclosed search engine to do things that conventional sheet music search engines cannot. Specifically, the metadata generated by disclosed system enables the disclosed system to provide functionality for a user to search and/or filter sheet music based on musically-relevant characterizations of the underlying sheet music, such as range or difficulty.
  • Furthermore, since the disclosed system analyzes the underlying sheet music data (rather than just the textual data indexed by conventional sheet music search engines), the disclosed search engine can provide functionality for a user to search the underlying sheet music, for example by singing or humming a melodic fragment.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • A preferred embodiment of the present invention will be set forth in detail with reference to the drawings, in which:
  • FIG. 1 is a block diagram of a sheet music search and discovery system according to an exemplary embodiment of the present invention;
  • FIG. 2 is a drawing illustrating an overview of the architecture of a sheet music search and discovery system according to an exemplary embodiment of the present invention;
  • FIG. 3 is a flowchart illustrating a process for extracting explicit metadata from structured sheet music data and unstructured sheet music data and converting unstructured sheet music data into structured sheet music data according to an exemplary embodiment of the present invention.
  • FIG. 4 is a flowchart illustrating the inputs and outputs of an example structured sheet music analysis engine according to an exemplary embodiment of the present invention;
  • FIG. 5A is a flowchart illustrating a workflow of an example structured sheet music analysis engine according to an exemplary embodiment of the present invention;
  • FIG. 5B is a flowchart illustrating a semantic embedding process and a semantic similarity analysis process according to an exemplary embodiment of the present invention;
  • FIG. 5C is a diagram illustrating a context encoder (ConEc) NN architecture corresponding to a continuous bag of words (CBOW) Word2Vec model;
  • FIG. 5D is a diagram illustrating a similar model trained to identify a musical element in context to an exemplary embodiment of the present invention;
  • FIG. 5E is a graph illustrating an example global vector space projected down to two dimensions according to an exemplary embodiment of the present invention;
  • FIG. 6 is a flowchart illustrating a preprocessing technique according to an exemplary embodiment of the present invention;
  • FIG. 7 is a flowchart illustrating a feature extraction technique according to an exemplary embodiment of the present invention;
  • FIG. 8 is a flowchart illustrating a rules analysis process according to an exemplary embodiment of the present invention;
  • FIG. 9 is a flowchart illustrating a lyrics extraction process according to an exemplary embodiment of the present invention;
  • FIG. 10 is a flowchart illustrating a data ingestion process for analyzing unstructured sheet music data and determining metadata that describes each composition stored as unstructured sheet music data according to an exemplary embodiment of the present invention;
  • FIG. 11A and FIG. 11B is a flowchart illustrating a workflow of an optical analysis engine according to an exemplary embodiment of the present invention;
  • FIG. 12 is a flowchart illustrating a query process according to an exemplary embodiment of the present invention;
  • FIG. 13 is a flowchart illustrating a process for querying sheet music data using a user profile according to an exemplary embodiment of the present invention;
  • FIG. 14 is a flowchart illustrating a recommendation process according to an exemplary embodiment of the present invention;
  • FIG. 15 is a flowchart illustrating a concert program generation process according to an exemplary embodiment of the present invention;
  • FIG. 16 illustrates a user profile view of a graphical user interface according to an exemplary embodiment of the present invention;
  • FIG. 17 illustrates an edit range view of the graphical user interface according to an exemplary embodiment of the present invention;
  • FIG. 18 illustrates a search view of the graphical user interface according to an exemplary embodiment of the present invention;
  • FIG. 19 illustrates a search results view of the graphical user interface according to an exemplary embodiment of the present invention;
  • FIG. 20 is a block diagram of a simplified multifunctional electronic device according to an exemplary embodiment of the present invention;
  • FIG. 21A is an example passage that is rhythmically difficult but not very syncopated;
  • FIG. 21B is a musical excerpt;
  • FIG. 21C illustrates each note being represented by an index from a vocabulary;
  • FIG. 21D is an example input to phrase extraction;
  • FIGS. 21E and 21F are an example of parallel corpora;
  • FIG. 21G is a part before stitching; and
  • FIG. 21H is an uninterrupted scale run after stitching.
  • DETAILED DESCRIPTION
  • Preferred embodiments of the present invention will be set forth in detail with reference to the drawings, in which like reference numerals refer to like elements or steps throughout.
  • FIG. 1 is a block diagram of a sheet music search and discovery system 100 according to an exemplary embodiment of the present invention.
  • As shown in FIG. 1, the sheet music search and discovery system 100 includes one or more databases 110, a structured sheet music analysis engine (SMAE) 140, a sheet music conversion engine 160, an optical analysis engine 180, a search engine 190, and a graphical user interface 192. The one or more databases 110 store structured sheet music data 112, unstructured sheet music data 114, and metadata 116 identified by the system 100 based on the structured sheet music data 112 and the unstructured sheet music data 114.
  • The structured sheet music data 112 may be sheet music (i.e., compositions) that contains musical data (e.g., pitches, rhythms, clefs, articulations, etc.) in a musically semantic structure (e.g., MusicXML). MusicXML is a common encoding of structured musical data in Extensible Markup Language (XML) format. XML is a generic encoding of hierarchical data stored in a flat text file. An XML file consists of set of nodes, also called elements. An XML document consists of a root node which contains zero or more child nodes, each of which may contain zero or more child nodes and so forth. A node which contains no child nodes is called an empty node. A node which contains one or more child nodes is called the parent of the child nodes it contains. Note that while a parent node may contain multiple child nodes, a child node cannot be contained by more than one parent, and therefore cannot have more than one parent node. More detail regarding XML may be found in the World Wide Web Consortium (W3C) documentation (http://www.w3c.org/XML), which is incorporated herein by reference. More detail regarding MusicXML may be found in the MusicXML documentation (http://usermanuals.musicxml.com/MusicXML/MusicXML.htm), which is incorporated herein by reference. Additionally or alternatively, the structured sheet music data 112 may be encoded in another format, such as Music Encoding Initiative (MEI) (see http://music-encoding.org/), MNX (see https://www.w3.org/community/music-notation/2016/05/19/introducing-mnx/), ABC (see http://abcnotation.com), MuseData (see http://www.musedata.org/about/), etc.
  • The unstructured sheet music data 114 may be sheet music (i.e., compositions) in a graphical format (e.g., bitmap, vector etc.). The unstructured sheet music data 114 lacks the musical semantics included in the structured sheet music data 112 described above, and therefore requires additional processing to identify musical semantics as described below. Both the structured sheet music data 112 and the unstructured sheet music data 114 may be stored in computer readable formats. As described in detail below, the metadata 116 includes data that describes the (structured and unstructured) sheet music data 112 and 114.
  • FIG. 2 is a drawing illustrating an overview of the architecture 200 of the sheet music search and discovery system 100 according to an exemplary embodiment of the present invention.
  • As shown in FIG. 2, the architecture 200 may include one or more servers 210 and one or more storage devices 220 connected to a plurality of remote computer systems 240, such as one or more personal systems 250 and one or more mobile computer systems 260, via one or more networks 230. The architecture 200 may also include one or more document scanners 270 that communicate with the one or more servers 210 and output data for storage in the one or more storage devices 220 either directly (via wired or wireless communication) or via the one or more networks 230.
  • The one or more servers 210 may include an internal storage device 212 and a processor 214. The one or more servers 210 may be any suitable computing device including, for example, an application server and a web server which hosts websites accessible by the remote computer systems 240. The one or more storage devices 220 may include external storage devices and/or the internal storage device 212 of the one or more servers 210. The one or more storage devices 220 may also include any non-transitory computer-readable storage medium, such as an external hard disk array or solid-state memory. The networks 230 may include any combination of the internet, cellular networks, wide area networks (WAN), local area networks (LAN), etc. Communication via the networks 230 may be realized by wired and/or wireless connections. A remote computer system 240 may be any suitable electronic device configured to send and/or receive data via the networks 230. A remote computer system 240 may be, for example, a network-connected computing device such as a personal computer, a notebook computer, a smartphone, a personal digital assistant (PDA), a tablet, a portable weather detector, a global positioning satellite (GPS) receiver, network-connected vehicle, a wearable device, etc. A personal computer system 250 may include an internal storage device 252, a processor 254, output devices 256 and input devices 258. The one or more mobile computer systems 260 may include an internal storage device 262, a processor 264, output devices 266 and input devices 268. An internal storage device 212, 252, and/or 262 may include one or more non-transitory computer-readable storage mediums, such as hard disks or solid-state memory, for storing software instructions that, when executed by a processor 214, 254, or 264, carry out relevant portions of the features described herein. A processor 214, 254, and/or 264 may include a central processing unit (CPU), a graphics processing unit (GPU), etc. A processor 214, 254, and/or 264 may be realized as a single semiconductor chip or more than one chip. An output device 256 and/or 266 may include a display, speakers, external ports, etc. A display may be any suitable device configured to output visible light, such as a liquid crystal display (LCD), a light emitting polymer display (LPD), a light emitting diode (LED), an organic light emitting diode (OLED), etc. The input devices 258 and/or 268 may include keyboards, mice, trackballs, still or video cameras, touchpads, etc. A touchpad may be overlaid or integrated with a display to form a touch-sensitive display or touchscreen.
  • Referring back to FIG. 1, the structured sheet music data 112, the unstructured sheet music data 114, and the metadata 116 may be any organized collection of information, whether stored on a single tangible device or multiple tangible devices, and may be stored, for example, in the one or more storage devices 220. The metadata 116 describing each composition is associated with that composition such that the metadata 116 can be used to discover compositions from the structured sheet music data 112 that are relevant to users. As described above, the structured sheet music data 112 may be in computer readable format that contains musical data such as pitches, rhythms, clefs, articulations, etc. in a musically semantic structure (e.g., MusicXML). The structured sheet music analysis engine (SMAE) 140, the sheet music conversion engine 160, the optical analysis engine 180, and the search engine 190 may be realized by software instructions stored on one or more of the internal storage devices 212, 252, and/or 262 and executed by one or more of the processors 214, 254, or 264. The graphical user interface 192 may be any interface that allows a user to input information for transmittal to the sheet music search and discovery system 100 and/or outputs information received from the sheet music search and discovery system 100 to a user. The graphical user interface 192 may be realized by software instructions stored on one or more of the internal storage devices 212, 252, and/or 262 executed by one or more of the processors 214, 254, or 264.
  • The metadata 116 describes the compositions stored as structured sheet music data 112 or unstructured sheet music data 114.
  • Analyzing Structured Sheet Music Data
  • FIG. 3 is a flowchart illustrating a process 300 for extracting explicit metadata 350 from structured sheet music data 112 and unstructured sheet music data 114 and converting unstructured sheet music data 114 into structured sheet music data 112 according to an exemplary embodiment of the present invention.
  • Explicit metadata 350 is extracted from the structured sheet music data 112 by the SMAE 140 at 310. Explicit metadata 350 refers to metadata 116 directly encoded in the structured sheet music data 112 or unstructured sheet music data 114. Often, structured sheet music data 112 includes low-level semantics such as notes, rhythms, etc. and lacks higher-level semantics such as range or difficulty. (This is analogous to text documents which encode characters, words, and sentences, but do not encode higher-level semantics like subject matter or syntax.) Therefore, as described below, the SMAE 140 analyzes the structured sheet music data 112 to calculate or determine mid-level and/or higher-level semantic metadata 116 describing each composition. In the context of sheet music data, explicit metadata 350 includes title, composer, instrumentation, etc. Explicit metadata 350 is extracted from structured sheet music data 112 by reading the structured fields of the data. If any expected explicit metadata 350 is missing, the field is marked as missing.
  • Explicit metadata 350 is extracted from the unstructured sheet music data 114 by the SMAE 140 at 320. Because unstructured sheet music data 114 does not include structured fields like structured sheet music data 112, it is more likely unstructured sheet music data 114 is missing musically-relevant explicit metadata 350 while including musically irrelevant explicit metadata 350 such as colorspace, bit resolution, dots per inch, etc. Explicit metadata 350 missing from unstructured sheet music data 114 is marked as missing.
  • The unstructured sheet music data 114 is converted into structured sheet music data 112 by the sheet music conversion engine 160 at 330 and 340. Optical character recognition (OCR) is used to extract text data (e.g., title, composer, etc.) and stores the extracted text data as metadata 116 at 330. Optical music recognition (OMR) is used to extract musical notations at 340. The text data and musical notations extracted from the unstructured sheet music data 114 are formatted into structured sheet music data 112. Accordingly, as used for the remainder of this description, structured sheet music data 112 refers to both structured sheet music data 112 and unstructured sheet music data 114 that has been converted into structured sheet music data 112.
  • FIG. 4 is a flowchart 400 illustrating the inputs and outputs of the SMAE 140 according to an exemplary embodiment of the present invention.
  • As shown in FIG. 4, structured sheet music 112 is input to the SMAE 140 and the SMAE 140 outputs text/keyword output 420 and extended property output 430. The text/keyword output 420 may include individual words, phrases, or sentences that describe the musical content of the document using musical terminology. The text/keyword output 420 may be fed into a keyword-based search engine, used in evaluating natural language queries, etc. The extended property output 430 may include statistics or other numerical features (e.g., range, difficulty, etc.) that provide additional information about the document in formats other than plain text. The extended property output 430 may be queried through a specialized user interface such as that described in FIGS. 16-19.
  • FIG. 5A is a flowchart illustrating a workflow 500 of the structured sheet music analysis engine (SMAE) 140 according to the present invention.
  • The structured sheet music 112 is preprocessed at 510. Preprocessing allows music from different sources such as different publishers or notation format to be analyzed using the same procedures. (The publisher, notation format, etc. may be identified in the explicit metadata 350.) Further, because the SMAE 112 may analyze a complete score, individual parts, or both, the structured sheet music 112 must be preprocessed into a normalized format. If a complete score is provided, individual parts must be extracted. If individual parts are provided, a score must be compiled. This is necessary because some features are specific to individual parts, such as the range of individual instruments, while other features require knowledge of the entire score, such as when individual instruments have solos. Preprocessing 510 will be discussed further in FIG. 6. The preprocessed structured sheet music 112 may be used by the SMAE 140 to perform feature extraction 520, semantic embedding 532A, phrase extraction 542, and lyrics extraction 550 as described below.
  • Feature extraction 520 is a process for analyzing structured music data 112 to calculate features 522 that can be directly computed by a deterministic mathematical function or algorithm. (Deterministic means the same inputs always produce the same outputs.) Features 522 include relevant musical data, statistical features (e.g., the average number of notes in a measure), features derived from lookup tables, range, pitch histograms, etc. Some or all of the features 522 may be used to calculate the higher level semantic metadata at 526 and 530, as described below. Feature extraction 520 is discussed further in reference to FIG. 7.
  • The SMAE 140 may perform rules analysis 524. The rules analysis 524 is the use of rules and/or heuristics to determine higher-level semantic metadata, referred to herein as rule-derived metadata 526. As used herein, a “heuristic” is a rule without a theoretical or pedagogical foundation (as opposed to a “rule,” which, as used herein, is explicitly defined based on information from musicology or music education literature.) Examples of heuristics that may be used by the SMAE 140 to generate rule-derived metadata 526 are shown in Table 1:
  • TABLE 1
    Heuristic Implies
    Number of characters/bytes A larger file size implies a more difficult composition
    Number of notes More notes implies a more difficult composition
    Average number of notes per A higher average number of notes per measure implies a
    measure more difficult composition, while a lower average number
    of notes per measure implies a less difficult composition
    Average number of child nodes If a <measure> element contains more children, it indicates
    of the <measure> element a more difficult measure because the child nodes include
    more notes, dynamics, etc.
    Average number of child If a <note> element contains more children, it indicates a
    nodes of the <note> element more difficult note because the child nodes includes more
    articulations, dynamics, etc. Averaging over all of the
    <note>elements implies a more difficult composition.
    Pitch Histogram A pitch histogram with a wide standard deviation implies a
    more difficult composition (more notes and greater range),
    while a narrow standard deviation implies an easier
    composition (fewer notes and narrower range). Similarly,
    skew distributions may imply more difficult compositions
    because the central tendency may emphasize or focus on a
    particular high or low range of the instrument.
  • As used herein, a “rule” is explicitly defined based on information from musicology or music education literature (as opposed to a “heuristic,” which is rule without a theoretical or pedagogical foundation). Examples of rules that may be used by the SMAE 140 to generate rule-derived metadata 526 are shown in Table 2:
  • TABLE 2
    Rule Description
    Range Iterate over all the notes to find the minimum and maximum
    notes of a given part. Musicians refer to this as either the
    range or ambitus.
    Lyrics Lyrics are encoded on a per-note basis in MusicXML, and
    must be extracted from each note into a separate entity of
    lyrics. Additional processing must be done for
    multiple/alternate verses, as well as melismas (single
    syllables that span more than one note).
    Key Signatures The key signature of a composition may indicate the
    difficulty of a composition. Similarly, the number of key
    signature changes, and the rate of key signature changes
    also imply difficulty. A composition that changes keys
    frequently is more difficult than a composition in a single
    key or with only one key change.
    Meter The meter of a composition is an extremely important part
    of determining the difficulty of a composition. “Simple
    meters” and “compound meters” are relatively easy, while
    “irregular meters” can be extremely difficult. Also, the lack
    of meter (“senza misura”) may indicate a cadenza, which
    would greatly increase the difficulty of the composition.
    (“Simple meters” and “compound meters” are precise terms
    in music theory with specific meanings.)
    Rhythmic Vocabulary Rhythmic vocabulary is the number of distinct note
    durations and the number of times each occurs in a
    composition. An example of rhythmic vocabulary is
    “Whole, Half, Quarter”, indicating that the composition uses
    notes and rests with durations of whole notes, half notes,
    and quarter notes. A larger rhythmic vocabulary implies a
    more difficult composition. Additionally, a specific
    vocabulary item (such as an augmentation dot appearing
    inside of a tuplet) implies a greater level of difficulty.
    Rhythmic vocabulary may also include counts of how often
    the rhythmic item appears (e.g., 10 whole notes, 30 quarter
    notes, 24 eighth notes) referred to below as “duration type
    frequency.”
    Rhythmic Onset Density Rhythmic onset density is the number of notes occurring in a
    measure that are neither a) rests nor b) the continuation or
    end of a tie. Such notes are referred to as onsets. The more
    onsets in a measure, the more difficult it is to play. For
    example, a measure consisting of a single whole note
    contains one onset, and is easy to perform, while a measure
    of all 16th notes in 4/4 time (16 onsets) is much more
    difficult. Averaging the rhythmic density of onsets over all
    the measures implies overall difficulty of a composition.
    Relations among Meter, A 16th note in a slow 6/8 meter is relatively easy to perform
    Rhythm, and Tempo while a 16th note in a fast 2/2 meter is much more difficult.
    Rules can be written to capture the relation among meter,
    rhythm, and tempo to further imply the difficulty of a
    composition.
    Idiomatic Instrumental The same passage played on different instruments can have
    Techniques varying degrees of difficulty. For example, 16th note
    flourishes are relatively easy to perform on flute and
    piccolo, but extremely difficult on tuba.
    Additionally, there may be difficulties peculiar to individual
    instruments. For example, it is difficult for clarinets to play
    around the “break”, and it is also difficult for trombones to
    play between B natural in 7th position and Bb in 1st position
    without the use of an F attachment.
    These idiomatic issues imply difficulty on a per-instrument
    basis.
    Instrument Changes Some compositions require a single player to change
    instruments at some point during the composition. For
    example, an oboist may also be required to perform on an
    English horn within the same part. Any instrument change
    implies a more difficult part because most performers
    generally only perform on one instrument. Changing
    instruments is generally limited only to advanced and
    professional musicians.
    Aggregate Difficulty Measure The above may be combined into an aggregate metric,
    (Grade/Level) known as the grade or difficulty level.
  • The rules analysis process 524 includes a number of subroutines, each specializing in a particular analysis method, which are discussed further with reference to FIG. 8.
  • Music-based rules and heuristics, such as those described above, can provide a multitude of musically-relevant information from structured sheet music data 112. However, some musical idioms are highly subjective and no clear rule or heuristic exists. Therefore, machine learning analysis 528 may be used to analyze the structured sheet music data 112 (as well as the rule-derived metadata 526 and/or the features 522) to generate higher-level semantic metadata 116, referred to herein as machine-learning derived metadata 530.
  • Machine learning is a subfield of computer science that studies a class of algorithms that can detect patterns in data and generalize those patterns in order to adapt to data that it has not previously seen.
  • In some embodiments, the machine learning analysis 528 may include ensemble learning. Ensemble learning uses multiple machine learning algorithms to obtain better predictive performance than could be achieved from any one constituent learning algorithm. The machine learning algorithms may include a k-nearest neighbors algorithm, support vector machines, neural networks, etc. Multiple machine learning algorithms of the same kind may be used where each algorithm varies with respect to their hyperparameters. For example, multiple neural networks may be used where the number of nodes in the hidden layer of each neural network varies.
  • Examples of subjective musical characteristics that may be identified in the structured sheet music data 112 by performing the machine learning analysis 528 are shown in Table 3:
  • TABLE 3
    Subjective Musical
    Characteristic Description
    Syncopation Syncopation is loosely defined as an uneven or irregular rhythm.
    Syncopated rhythms are more difficult to perform, and imply a more
    difficult composition. No agreed upon rule exists as to what
    constitutes a syncopated rhythm, so a machine learning algorithm
    appears to be an appropriate solution. Both supervised and
    unsupervised approaches may be useful in determining an algorithm
    for identifying syncopation.
    Form Form is a higher level of syntactic musical organization. Linguistic
    analogues include rhyming schemes, poetic forms, or even larger
    forms like literary arcs. Musical forms include low level forms like
    periods, or higher level forms like rondo and sonata-allegro.
    Phrasing is a pre-requisite for form, and given that phrasing does
    not have a clear set of rules, neither does form. Possible approaches
    to identifying form include recursive neural networks and
    convolutional neural networks similar to those used in sentence
    parsing in natural language processing. Co-occurrence matrices
    may also be used.
    Style/Time Period Style is an extremely subjective musical characteristic. Some
    research has already been done on style classification using
    statistics, but machine learning techniques are likely to provide
    more accurate and relevant results.
    Harmonic Progression Although the notes and key are explicitly defined in structured
    data, harmonic analysis information (while specified in the
    MusicXML specification) is optional and often omitted from the
    structured data. Harmonic information is included in
    Piano/Vocal/Guitar (PVG) arrangements and Lead Sheets, but
    rarely included in other formats. Harmonic information may be
    extracted by performing a reduction of all the parts (transposed to
    sounding pitch), and then performing a harmonic analysis. Such a
    harmonic analysis may use a hidden Markov model (HMM), or
    one of the syntactic approaches described above. A reduction is
    necessary because while harmonic analysis is relatively easy on
    homophonic music, a single melody, it is often difficult to
    determine the harmonic progression in polyphonic music, two or
    more melodies.
    Tonal Centricity Tonal centricity is related to harmonic progression described above.
    While key signatures appear in the structured data, a composition
    may modulate temporarily, either to create musical interest, or to
    develop musical material such as during the development section of
    a sonata-allegro form. Compositions that temporarily and/or
    frequently modulate without changing key signatures, or lack a key
    signature altogether, are more difficult to perform.
    Exposed Parts A part is considered exposed if it is playing solo, soli, or a distinct
    musical figure. It remains unclear whether a rules-based or machine
    learning based approach will work better to identify exposed parts.
    However, current research into a rules-based system indicates that
    such a system may be insufficient and that determining exposed
    parts may be more subjective than originally thought.
  • In particular, the machine learning analysis 528 can be used to predict the difficulty of each composition (or each part within each composition). The structured sheet music data 112 may include a corpus of compositions or parts that are labeled (e.g., manually labeled) as having a certain difficulty level (e.g., on a 1-5 scale, a 1-10 scale, etc.). A supervised learning process can then be used to learn a function for determining a probability that another composition or part has those difficulty levels (e.g., a 0.5 percent probability of grade 1, an 85.5 percent probability of grade 2, etc.). The surprised learning process may compare the compositions in structured sheet music data 112 (and the metadata 116 describing those compositions) to the compositions in the corpus (and metadata 116 describing those compositions). Additionally or alternatively, the supervised learning process may compare semantic representations of the compositions in structured sheet music data 112 (e.g., the part embeddings 534A and score embeddings 536A discussed below) to semantic representations of the compositions in the corpus.
  • Examples of rule-derived metadata 526 and machine learning-derived metadata 530 that may be generated by performing rules analysis 524 and/or the machine learning analysis 528 are shown in Table 4:
  • TABLE 4
    Output Description
    Part Doubling Two or more instruments playing similar
    musical material, as well as the measures and
    intervals at which the doubling occurs.
    Exposed Solos The instruments and measure numbers of any
    exposed solos.
    Independence of Parts The instruments and measure numbers of any
    independent parts. This does not include
    exposed solos, but may include lead lines
    (non-exposed solos), solis, descants, etc.
    Degree of Syncopation A grade of the syncopation level of each part
    by Part on a scale (e.g., from 1-5) with a higher
    number indicating greater syncopation.
    Rhythmic Difficulty by A grade of the overall rhythmic difficulty of
    Part each part on a scale (e.g., from 1-5) with a
    higher number indicating greater rhythmic
    difficulty. While syncopation and rhythmic
    difficulty both refer to the rhythmic character
    of the composition, they are somewhat
    independent.
    FIG. 21A is an example passage that is rhythmically
    difficult but not very syncopated
    Difficulty Grade by Part The difficulty is computed last in the Rules
    Engine because it is the culmination of
    several features and the output of the other
    submodules
    Cumulative Grade of The cumulative grade of the composition
    Entire Composition computed by aggregating the grades across all
    parts. This includes the average grade across
    all parts, as well as the mode, median,
    minimum, and maximum part grades.
    Harmonic Language A list of the harmonic languages that appear
    in the composition. This is distinct from the
    chord progression, which only represents a
    sequence of chords. A harmonic language
    describes the overall harmonic vocabulary.
    Examples of harmonic language include
    pentatonic, diatonic, chromatic, modal,
    pandiatonic, etc.
    Parsons Code The Parsons code of each voice in each part.
    This represents the overall melodic contour of
    the voice and can be used for query by
    humming.
    Parsons code is only computed for
    monophonic voices that are not percussive.
    Melodic Contour The melodic contour is a descriptive term for
    the overall contour of the voice in each part.
    Melodic contour is one of three values:
    conjunct (smooth), disjunct (jagged), or static.
    The melodic contour is computed from the
    intervallic content of the voice as follows:
    Compute the histogram for all melodic
    intervals
    If the unison is the most frequent interval
    and is greater than the sum of all other
    interval frequencies, the melodic contour
    is static
    If the unison, minor 2nd, or major 2nd is
    the most frequent interval and the sum of
    all three of those intervals is greater than
    the sum of the other interval frequencies,
    the melodic contour is conjunct
    Otherwise, the melodic contour is disjunct
    Form The formal structure of the composition. This
    is computed from the repeating note and/or
    measure structure. A set of rules and/or
    heuristics is then used to compute the form
    from common known forms, such as song,
    rondo, sonata-allegro, 12-bar blues, etc.
    Style The musical style of the composition, such as
    march, bebop, etc. The style is computed
    from several of the above factors.
  • Semantic embedding 532A creates a numerical representation (embedding) of each element (e.g., note, chord, rest, measure, etc.), part, and score within a vector space so that it may be analyzed within a larger context. For example, an embedding algorithm 532A analyzes a large corpus of structured music data to produce a global vector space containing semantic representations of each element in the corpus. As described in detail below, the global vector space is a map of each element where similar elements are represented closer than dissimilar ones. Once the global vector space is produced, each element (e.g., note, chord, rest, measure, etc.) in the structured music data 112 is given the semantic representation (embedding) from the global vector space. Element embeddings are averaged to create part embeddings 534A and part embeddings 534A are averaged to create a score embedding 536A. The part embeddings 534A and the score embeddings 536A are compared for semantic similarity at 538A as described in detail below.
  • FIG. 5B is a flowchart illustrating the semantic embedding 532A and semantic similarity analysis 538A in detail.
  • As shown in in FIG. 5B, a large corpus 550 of structured music documents (e.g., MusicXML documents) is used to create a global vector space 560. For example, each element (e.g., note, chord, rest, measure, etc.) is extracted at 552 and added to a vocabulary 558 at 554. If the extracted element is already included in the vocabulary (552: Yes), the count of that element is incremented at 556. Different embodiments may make different determinations regarding equivalent elements. For example, one embodiment may be programmed such that an “A4 quarter” is equivalent to an “A4 quarter staccato”. Other embodiments may be programmed such that those elements are distinct. After all of the elements in the corpus 550 have been added to the vocabulary 558, the vocabulary 558 is sorted from the most common element to the least common element at 560. (In some embodiments, the number of elements included in the vocabulary 558 may be trimmed to include only the N most common elements.) Each element in the vocabulary 558 is then assigned a unique integer at 562. (The integer assignments can be random, for example the position of the element within the sorted vocabulary 558.)
  • The elements extracted from the corpus 550 are then replaced with the integers at 564. (If an extracted element is no longer in trimmed vocabulary 558, the extracted element may be replaced with a special integer value specifying “unknown”.) Each document in the corpus 550 is then represented by a set of integers.
  • A neural network is then used to map targets and context at 566. In one embodiment, an element (target) may be fed to a skip-gram model, which is used to predict the elements around it (context). In other words, the skip-gram model may be used to predict the context of a given element. In another embodiment, the elements around a target element may be fed to a continuous bag of words (CBOW) model, which is used to predict the target element. In other words, CBOW model may be used to predict an element in a given context. In yet another embodiment, a next word algorithm is given an element (context) and trained to predict the next element (target). Accordingly, the neural network (e.g., skip-gram model, CBOW model, etc.) is trained to map the target to the context or vice versa. The input to the neural network is a combination of the integer values, defined in the vocabulary, for each element. (The vectors may be added or stacked, depending on representation.) The output is also the integers defined in the vocabulary. Consider an example from natural language processing: the phrase “the black cat slept on” is defined by the vocabulary shown in Table 5:
  • TABLE 5
    Word Index
    Black 789
    Cat 342
    On 48
    Slept 1208
    The 2
  • If “cat” is the target, then [2, 789, 1208, 48] (The black slept on) is input to a CBOW model, which outputs [342] (Cat). Alternatively, using a skip-gram model, the input and output are reversed such that [342] (Cat) is input and [2, 789, 1208, 48] (The black slept on) is output by the skip-gram model.
  • Dimensionality of the vocabulary 558 is reduced at 568. Any dimensionality may be used. However, the target dimensionality is typically much lower than the number of items in the vocabulary 558. For example, for a vocabulary of 10,000 elements, a target dimensionality of 300 may be used.
  • After the neural network is trained, a weight matrix is extracted where each row (or column) of the weight matrix corresponds to each element in the vocabulary 558. These are called the embeddings. Collectively, these vocabulary embeddings form the global vector space 560 that represents the data in the corpus 550. Using the natural language processing example above, FIG. 5C is a diagram illustrating a context encoder (ConEc) NN architecture corresponding to a CBOW Word2Vec model. FIG. 5D is a diagram illustrating a similar model trained to predict a target musical element from the context. In the global vector space 560, each element is represented by a vector. Because similar elements appear in similar contexts in the corpus 550, the neural network clusters similar elements. In other words, the semantic similarity of the elements is proportional to the distance between those elements in the global vector space 560. Accordingly, semantically meaningful contextual information is encoded in each vector. FIG. 5E is a graph showing an example global vector space 560 projected down to two dimensions. For example, “A4-half” and “A4-quarter” are very close to each other (because both represent the pitch A4 with different durations).
  • Referring back to FIG. 5B, the global vector space 560 described above is used analyze the structured sheet music data 112 to generate part embeddings 534A and score embeddings 536A. Each element (e.g., note, chord, rest, measure, etc.) is extracted from the structured sheet music data 112 at 562. For each extracted element, the semantic representation (vector) of each element is identified at 564 by looking it up in the global vector space 560. For example, if an “A4 quarter” has been mapped in the global vector space 560 to [0.4, 0.99, 0.05], then any “A4 quarter” in any document in the structured sheet music data 112 is replaced with that vector. Element vectors are averaged at 566. For example, note vectors within each measure are averaged to create a measure vector. Measure vectors with each part are averaged to create a part embedding 534A for that part. Part embeddings 534A within each score are averaged to create a score embedding 536A for that score.
  • Similar items (e.g., elements, measures, parts, scores, etc.) are identified by comparing the vectors representing each item at 538A. Vectors may be compared using a distance function (metric), such as the L2-norm (Euclidean distance) or cosine similarity. Vectors that are “close” (e.g., separated by a small distance in the global vector space 560) are labeled as semantically related, and vectors that are far apart (e.g., separated by a large distance in the global vector space 560) are labeled as semantically unrelated. Semantically related items may be grouped together using another machine learning algorithm, such as k-means clustering, support vector machines (SVMs), or another neural network.
  • For example, consider the musical excerpt shown in FIG. 21B.
  • Each note is added to the vocabulary 558 shown in Table 6, where C4 is middle C:
  • TABLE 6
    Index Vocabulary Item Frequency Count
    1 C4-quarter 2
    2 D4-quarter 2
    3 E4-quarter 2
    4 F4-quarter 3
    5 G4-half 2
    6 A4-quarter 6
    7 G4-quarter 6
    8 F4-half 1
    9 E4-half 1
    10 D4-half 1
    11 A4-half 2
    12 B4-half 2
    13 B4-quarter 3
    14 C5-whole 2
    15 D5-half 1
    16 C5-half 2
    17 A4-eighth 5
    18 B4-eighth 1
    19 G4-eighth 4
    20 B4-dotted-quarter 1
    21 C5-eighth 1
    22 B4-dotted-half 1
    23 F4-eighth 1
    24 G4-dotted-quarter 1
  • (The vocabulary 558 shown in Table 6 may be sorted by frequency and all but the N most frequently occurring entries may be kept. Entries that are removed would be replaced with a single symbol “UNK” (unknown). In this example, that step is omitted.) Each note is represented by the index from the vocabulary 558 as shown in FIG. 21C.
  • In this example, the indexes above are converted to one-hot vectors, which are fed into the neural network. (One-hot vectors are used in this example because every component of the vector is only either a one or a zero, the target probability can be interpreted as 0 percent or 100 percent, cross-entropy loss can be used to train the network, and all of the one-hot vectors are mutually orthogonal.) To convert an index value x to a one-hot vector, a vector of all zeros is created and then the number at position x is changed to 1. For example, the one-hot vector for index value 1 is [1, 0, 0, 0, . . . ], the one-hot vector for index value 2 is [0, 1, 0, 0, . . . ], the one-hot vector for index value 2 is [0, 0, 1, 0, . . . ], etc.
  • To create the training data in this example, a next word algorithm is given the current element and used to predict the next element. Here are some example pairs (where the second item comes after the first):
  • [‘C4-quarter’, ‘D4-quarter’]
    [‘C4-quarter’, ‘E4-quarter’]
    [‘D4-quarter’, ‘C4-quarter’]
  • To create the training data, the first element in the pair is added to set X and the expected output of the pair is added to the set y.
  • Because the purpose of this process is to reduce dimensionality and find compact representations, an embedding size less than the size of the vocabulary (24 elements) is selected. In this instance, an embedding size of 5 is selected. A neural network is created:

  • y=softmax[g 2(W 2 *g 1(W 1 *x+b 1)+b 2]
  • where W elements are weight matrices, b elements are biases that provide an additive factor to the model, and g are non-linear functions such as a tanh, sigmoid, or ReLU, known as an activation function, that models non-linear relationships between the inputs and outputs.
  • The softmax function is a standard mathematical function that normalizes a value to a probability.
  • P ( y = j | x ) = e x T w j k = 1 K e x T w k
  • The softmax function computes the probability that y belongs to class j given input vector x by computing the exponent of x*wj (the input to the activation function) over the sum of all the exponents x*wk for all K distinct classes. By using the softmax function, we guarantee the range of the output values to be between 0 and 1 inclusive. The output from the softmax is compared to the one-hot of the expected output. The cross-entropy loss function is used to determine the difference between the output and the expected value. This difference (the loss) is then backpropagated through the network to adjust the weight matrices and biases. The cross-entropy loss function is:

  • −[y log(p)+(1−y)log(1−p)]
  • where p is the actual output and y is the expected output.
  • The model is updated using gradient descent, an optimization algorithm that numerically tries to find the minimum of a function. The function in this case is the neural network itself. The neural network is then trained for several iterations. The inner matrix W1 now holds the embedding vectors. This is called the embedding matrix. The column is the index of the vocabulary item, and the row is the 5-dimensional embedding vector (or the column is the vector and the row is the index). An example 5-dimensional embedding vector, derived in the process is above, is shown below:
  • [0.53852046 −0.15422869 2.3443408 −0.02002307 −0.11357132]
  • Because similar items appear in similar contexts in the data, the process described above causes similar items to cluster together in the 5-dimensional space. Accordingly, the embedding vectors encode contextual information and the vector shown above represents a semantically meaningful encoding. An example plot projected down to 2 dimensions is shown in FIG. 5E. The semantic similarity between two vectors can then be determined by measuring their distance from each other in the domain space. Two example vectors and the Euclidean distance (L2 norm) computed between them are shown below:
    • [−6.81328893e-01 1.76126170e+00 1.10498583e+00 1.66290998e-03 −3.98039997e-01]
    • [2.12125826 0.02730192 0.66247171 0.16639504 −0.38257411]
    • 3.32931
  • These two embeddings are “3.32” units apart from each other, which represents the relative proximity of those vectors. A smaller distance implies a stronger semantic relationship among the vectors.
  • The semantic similarity metadata 540A from semantic similarity analysis 538A may include the outputs described in Table 7:
  • TABLE 7
    Output Description
    Similar Measures For each part, identify similar measures
    Within Part within the part. Similar measures may then be
    further analyzed to determine features such as
    repeated measures, ostinatos, and form.
    Similar Measures For each measure, identify similar measures
    Across Parts within any other part. Measures that are
    similar but occur in different parts suggest
    that either the parts are coupled/doubled, or
    that the parts may be interchanged (e.g.
    substitute a trombone for a bassoon).
    Similar Parts For each part, identify any other part that is
    similar. Similar parts may then be used to
    identify coupled/doubled parts or parts that
    are interchangeable.
    Similar Documents Compare this document to all other
    documents in the database and assign a
    similarity score to them. Pre-computing and
    then indexing these similarity scores
    improves performance of searching for
    similar documents.
  • As described above, embedding vectors for each part allow individual part embeddings 534A to be compared within the same structured sheet music data 112 or against other structured sheet music data 112. For example, where a music director finds a clarinetist performs a certain clarinet part extremely well, the music director may search for similar clarinet parts by having a computer search for other pieces of music that contain semantically similar clarinet parts (i.e., the distance between the embeddings is minimized). The computer will then be able to rank all clarinet parts in order of closeness to the original part. Similarly, a separate score embedding 536A allows compositions to be holistically compared. For example, where a choir director finds the choir enjoys a particular composition, the search engine 190 provides functionality for a choir director to search for semantically similar compositions. The sheet music search and discovery system 100 compares the score embeddings 536A as described above and the search engine 190 returns search results ranked by semantic similarity.
  • Additionally, the part embeddings 534A and the score embeddings 536A may be used in the machine learning analysis 528 described above, for example to predict the difficulty of each part and score.
  • Referring back to FIG. 5A, phrase extraction 542 is a process for identifying smaller musical passages known as phrases from a larger musical selection. A phrase 544 may be identified using rules derived from music theory, heuristics, analysis of individual measures obtained via rules analysis 524, machine learning analysis 528, or both, etc. Once phrases 544 are extracted, phrase analysis 546 may be used to generate phrase descriptions 548 that describe each phrase using keywords or sentences which in turn may be stored in a database to augment the data available for querying. An example input to phrase extraction 542 is shown in FIG. 21D and example phrase descriptions 548 are described in Table 8:
  • TABLE 8
    Music Output
    See FIG. 21D G major, walking bass, moderate tempo, blue
    notes
  • The phrase description 548 may be generated using sequence labeling. Sequence labeling is similar to the machine learning task of image captioning. In some embodiments, the sequence labeling may be performed by two neural networks trained together to understand structured music documents. This model is known as neural machine translation (NMT). The first neural network is called the encoder and transforms each input phrase into a numerical representation. The second neural network is called the decoder and transforms the numerical representation into output text, such as keywords describing the phrase. The two neural networks of the NMT model are trained used parallel corpora. The input to the training algorithm includes a set of passages and a separate set of descriptive texts such that each passage is described by one or more corresponding descriptive texts. An example of parallel corpora for use in training the two neural networks is shown in FIGS. 21E and 21F and described in Table 9:
  • TABLE 9
    Music Description
    See FIG. 21E C major scale, ascending, fast, syncopated
    See FIG. 21F Very fast, chromatic, tuplets, trill, very high
  • Lyrics extraction 550 is a process for extracting lyrics 552 from preprocessed structured sheet music data 112 in order to generate semantic similarity metadata 540B pertaining to the lyrics 552. Certain elements of lyrical metadata, such as the number of verses, may be obtained directly from the preprocessed structured sheet music data while other lyrical metadata must be inferred. Lyrics extraction 550 is described further in FIG. 9. When the lyrics 552 appear in structured sheet music data 112, those lyrics 552 are fragmented at syllable boundaries. The lyrics extraction process 550 combines those syllables to form words. For example, “to-night” is combined to form “tonight” and “el-e-va-tion” is combined to form “elevation.” The lyrics extraction process 550 also parses out verses. A single note may be associated with text from the first verse and the second verse. The lyrics extraction process 550 puts this all together into a single document so that all of verse 1 is together, all of verse 2 is together, etc.
  • The lyrics 552 are input to semantic embedding 532B and semantic similarity analysis 538B. The semantic embedding 532B and semantic similarity analysis 538B are similar processes performed on text data (the lyrics 552) as the semantic embedding 532A and semantic similarity analysis 538A performed on musical data (the part embeddings 534A and 536A) described above. Recall that semantic embedding 532A and 532B create numerical representation (embedding) of data within a vector space by analyzing it within a larger context. The semantic embedding 532B similarly processes each word, phrase, verse, etc. within the extracted lyrics 552, adds each unique element to a list, and assigns each unique element a unique value. The list of unique elements (i.e., “vocabulary”), along with frequency, context, etc., are input to an algorithm, neural network, etc., to create a probabilistic model that a set of elements occur together.
  • Semantic similarity analysis 538B is performed to generate semantic similarity metadata 540B. An embedding vector for each part allows different individual parts to be compared within the same structured sheet music data 112 or against other structured sheet music data 112. For example, where a choir has enjoyed the theme and language of a particular piece of choral music, the choir director may search for other choral works with similar themes and language. Note that harmonic and lyric similarity are independent and a user may search for music based on either. Semantic similarity metadata 540B from semantic similarity analysis 538B may include the outputs described above in Table 7.
  • FIG. 6 is a flowchart illustrating the preprocessing 510 performed by the SMAE 140 according to an exemplary embodiment of the present invention.
  • As discussed above, structured sheet music data 112 is preprocessed to normalize the data into a consistent format before analysis, allowing music from different publishers, different notation structures, etc. to be analyzed using the same process. Structured sheet music data 112 may be compressed music file 605A or uncompressed music file 605B. Compressed music files 605A are uncompressed at 610. Both uncompressed music files 605B and newly uncompressed music files 605B are validated at 615. The files are validated using a predefined schema for the given format. For example, the MusicXML schema are available in Document Type Definition (DTD) and XML Schema Definition (XSD) formats, and validation is performed by a standard XML validator. If the music file is invalid, an error is generated and preprocessing technique 510 is halted until the invalid music file is corrected. If the music file is valid, header metadata is extracted at 620. Header metadata may include the fields described in Table 10:
  • TABLE 10
    Header Metadata Fields
    Document Format and Version
    Work Title
    Work Number
    Movement Title
    Movement Number
    Composer
    Arranger
    Lyricist
    Copyright
    Encoding Software
  • Header metadata may be used to inform the rules analysis process 524 and machine learning process 528 by adjusting the weight of certain features. For example, a known arranger may be considered “easier,” so music files arranged by this arranger may be weighted less (multiplicative factor <1). Similarly, a known arranger may be considered “difficult,” so music files arranged by this arranger may be weighted more (multiplicative factor >1). These relative weights may be stored in and retrieved from a lookup table. If no corresponding entry is found, the music file is not weighted (multiplicative factor=1). The publisher of the music file may also be extracted, usually from the copyright field, in order to optimize the performance of the rules analysis 524 and machine learning analysis 528 according to any common patterns or peculiarities specific to the particular publisher, known as “house style.” The publisher name informs other features, so it is advantageous to extract the publisher name early in the feature extraction process. Other features may be extracted at any time because they do not influence later extracted features. The music file undergoes general cleanup at 625 to remove information useful for visual display but not useful for semantic analysis, such as color. Additionally, any specific encoding errors or conventions dependent on the software used to produce the encoding are corrected at 625. The instrument names are extracted from the music file at 630. The instrument names are normalized at 635. Instrument names are converted into their base instrument names in a process that may be similar to stemming in computational linguistics. For example, instrument names Trumpet 1, Trumpet II, 2nd Trumpet, 4 Trumpets, and Trumpet in C would be converted to the base instrument name Trumpet. If more than one instrument name appears on a given part, separated by a space, new line, slash, etc. or there is a “change instrument” instruction within the part, the part is marked as having multiple instruments. This informs extraction of parts at 645A. If an instrument name appears in a language other than English, a lookup table may be used to translate the instrument name into English. For example, the German equivalent of “2nd Flute” (“2. Flöte”) is normalized to “Flute.” An instrument that cannot be identified is designated as “unknown.” After the instrument names are normalized, the instruments are identified with a standard sound. In addition, a max polyphony property describing the number of pitches the instrument can play simultaneously is assigned to each instrument through the use of a lookup table. For example, a clarinet has a max polyphony of one, a guitar has a max polyphony of six, and a piano has a max polyphony of eighty-eight. This information is used in extraction of parts 645A as well as the analysis described herein. Once the instrument names are normalized at 635, the ensemble type is identified at 640. The normalized instrument names are compared to a lookup table of predefined ensemble types and a percent match found. The percent match may be found using Jaccard similarity or other similarity metrics. The type of ensemble is then identified based on the percent match. For example, a string quartet must be an exact match while a marching band may vary in instrumentation. Information about the ensemble type can be used to further clean up the score, as well as provide additional information used in the analysis described herein. If the ensemble type cannot be identified, the ensemble type is left empty. Last, either parts are extracted at 645A or a score is compiled at 645B depending on the input music file. If the input music file includes a score, individual parts are extracted at 645A. If the input music file includes only individual parts, a score is compiled at 645B. Extraction of individual parts 650A generates a separate output for each part or voice. A part usually corresponds to a single instrument while a voice is a subdivision of instruments occurring within the same part. For example, a part labelled “2 Trumpets” may have two voices because the two trumpets may play different notes at the same time. In this case, the parts extracted at 645A may be labeled “2 Trumpets A” and “2 Trumpets B.” Where the input music file includes only individual parts, a score 650B is compiled into a single file to be used in conjunction with the parts files during analysis. Each instrument in the compiled score 650B consists of only one voice.
  • FIG. 7 is a flowchart illustrating the feature extraction process 520 performed by the SMAE 140 according to an exemplary embodiment of the present invention.
  • The feature extraction process 520 analyzes the preprocessed structured music data 112 and returns features 522 that are then used during the rules analysis process 524 and machine learning process 528. Features 522 extracted include relevant musical data, statistical features such as the average number of notes in a measure, and features derived from lookup tables. Lookup tables return simple semantic features which may be input for more complex rules analysis 524 and machine learning analysis 528. In one embodiment, a lookup table may associate a time signature to a measure of difficulty. For example, a 4/4 time signature may be labeled as easy while a 17/32 time signature may be labeled as difficult. The time signatures and associated difficulty levels may be used during the rules analysis 524 and machine learning analysis 528, which may change the difficulty level. For example, a music file with frequent time signature changes may be considered difficult even though each individual time signature is associated with an easy difficulty level. Feature extraction is a bottom up process: features are first extracted from individual notes 710A-710H, then from measures 720A-720D, then from parts 730A-730B, and lastly from document 740. Note features 712 include features and subfeatures described in Table 11:
  • TABLE 11
    Feature Sub-Feature Data Type
    Pitch
    Pitch Step/Pitch Class Alphanumeric (A-G)
    Octave Integer
    Alteration (sharp, flat, Floating point (to
    natural) support microtones)
    Unpitched Boolean
    Duration Alphanumeric (Whole,
    Half, Quarter, 16th etc.)
    Rest Boolean
    Augmentation Dots Integer
    Accidental
    Accidental Type Alphabetic (Sharp,
    Natural, Flat, Double
    Sharp, etc.)
    Cautionary Boolean
    Part of a Chord Boolean
    Articulation
    Accent Boolean
    Detached Legato Boolean
    Doit Boolean
    Falloff Boolean
    Plop Boolean
    Scoop Boolean
    Spiccato Boolean
    Staccatissimo Boolean
    Staccato Boolean
    Strong Accent Boolean
    Tenuto Boolean
    Ornament
    Delayed Inverted Turn Boolean
    Delayed Turn Boolean
    Inverted Mordent Boolean
    Inverted Turn Boolean
    Mordent Boolean
    Shake Boolean
    Tremolo Integer
    Trill Boolean
    Turn Boolean
    Vertical Turn Boolean
    Wavy Line Boolean
    Technique
    Bend (See following)
    Bend-Semitones Floating Point
    Bend-Prebend Boolean
    Bend-Release Boolean
    Bend-With Bar Boolean
    Double Tongue Boolean
    Down Bow Boolean
    Fingering Array of Integers
    Fingernails Boolean
    Fret Integer
    Hammer On Boolean
    Pull Off Boolean
    Handbell (See following)
    Handbell-Damp Boolean
    Handbell-Echo Boolean
    Handbell-Gyro Boolean
    Handbell-Hand Martellato Boolean
    Handbell-Mallet Lift Boolean
    Handbell-Mallet Table Boolean
    Handbell-Martellato Boolean
    Handbell-Martellato Lift Boolean
    Handbell-Muted Martellato Boolean
    Handbell-Pluck Lift Boolean
    Handbell-Swing Boolean
    Harmonic (See Following)
    Harmonic-Artificial Boolean
    Harmonic-Natural Boolean
    Harmonic-Base Pitch Pitch type (see above)
    Harmonic-Sounding Pitch Pitch type (see above)
    Harmonic-Touching Pitch Pitch type (see above)
    Heel Boolean
    Hole Alphabetic (one of:
    closed, open, half-open)
    Open String Boolean
    Pluck Alphabetic (one of:
    p, i, m, a)
    Snap Pizzicato Boolean
    Stopped Boolean
    String Integer
    Tap Boolean
    Thumb Position Boolean
    Toe Boolean
    Triple Tongue Boolean
    Up Bow Boolean
    Arpeggio
    Part of Arpeggio Boolean
    Non-Arpeggio-This is a Boolean
    technique specific to
    instruments that are normally
    played arpeggio (e.g., a
    harp) to specifically not play
    using an arpeggio.
    Cue Boolean
    Dynamics Alphabetic
    Fermata
    Normal Boolean
    Angled Boolean
    Square Boolean
    Glissando Alphabetic (one of:
    start, stop)
    Grace Boolean
    Grace-Slash Boolean
    Tie Alphabetic (one of:
    start, stop, continue)
    Slur Alphabetic (one of:
    start, stop, continue)
    Tuplet
    Ratio Numerator Integer
    Ratio Denominator Integer
    Duration Alphanumeric (see
    above)
  • Note features 712 may be used to determine note statistics 714. Note statistics 714 may include features described in Table 12:
  • TABLE 12
    Feature Description
    Pitch Range The lowest pitch and the highest pitch within
    the voice.
    Pitch Histogram A frequency distribution of all of the pitches
    within the voice.
    Tessitura The range of all pitches within the first
    standard deviation of the mean pitch within
    the voice (known as “a truncated
    distribution”). This serves as an indicator of
    what notes commonly occur. A range includes
    the extremes, while the tessitura excludes the
    extremes and focuses on notes near the mean.
    Polyphony per voice The number of notes sounding simultaneously
    within the voice. This is used to find multiple
    stops or implicit divisi within a voice.
    Multiple stops are a technique for performing
    multiple simultaneous pitches on bowed
    instruments.
    Duration Type Frequency A frequency distribution of all the duration
    types within the voice. Example: 2 quarter
    notes, 3 eighth notes, 2 16th notes, etc.
    Number of Onsets Count of notes that are neither rests nor the
    continuation or end of a tie within the voice
    across all measures.
    Average Number of The Number of Onsets divided by the number
    Onsets Per Measure of measures.
    Note n-grams N-grams of all notes within a voice. N-grams
    can cross barlines.
    Repeated Identical Find any repeated notes (same pitch and
    Notes Within Measure duration) within the voice. This is useful for
    finding ostinatos.
  • Once the note features 712 and note statistics 714 are extracted from notes 710A-710H, measure features 722 are extracted from measures 720A-720D. Measure features 722 may include features described in Table 13:
  • TABLE 13
    Feature Data Type Description
    Clefs Associative Array of Clef
    type and count
    Time Signature Alphanumeric tuple
    (Numerator, Denominator)
    Key Signature Alphanumeric
    Number of Voices Integer Number of unique voices in
    this measure
    Starts Repeat Boolean
    Ends Repeat Boolean
    Repeated Measure Boolean Indicates that this measure is
    identical or very nearly
    identical to the previous
    measure
    Begins Ending Boolean
    Continues Ending Boolean This measure is part of an
    ending but neither starts nor
    ends it
    Ends Ending Boolean
    Begins Coda Boolean The first measure in a coda
    Jump to Coda Boolean This measure includes the
    instruction “To Coda”
    Da Capo Boolean This measure includes the
    instruction “Da Capo” or
    “D.C.”, directing the player to
    go to the first measure of the
    composition
    Segno Boolean This measure includes a segno
    glyph, indicating a jump back
    point in the composition
    Dal Segno Boolean Indicates that this measure
    includes the instruction “Dal
    Segno” or “D.S.” , directing the
    player to go back to the segno
    earlier in the composition
    Whole Measure Rest Boolean Indicates that this measure
    contains a whole measure rest
    in voice 1 and all other voices
    are either a) whole measure
    rests or b) cue notes.
    Part of a Multi-measure Rest Boolean Indicates that this measure is
    part of a multi-measure rest
    (i.e. the measure(s) before
    and/or after it are whole
    measure rests according to the
    criteria above)
    Tuning Change Boolean Indicates that this measure
    includes a tuning change,
    scordatura, or capo
    Cross Staff Beaming Boolean Indicates that this measure
    uses cross staff beaming that
    appears on parts that use a
    grand staff.
  • Once measure features 722 are extracted, measure statistics 724 are extracted. Measure statistics 724 include features described in Table 14:
  • TABLE 14
    Feature Description
    Measure n-grams N-grams of all measures within a part
    Repeated Measures Find all passages of repeated measures, either
    explicitly marked (repeat measure mark,
    single measure repeat/vamp) or implicit
    (identical content)
  • Once measure statistics 724 are extracted, part features 732 are extracted from parts 730A-730B. As described previously, parts 730A-730B generally correspond to a single instrument, although may refer to more than one instrument if the parts are doubled, for example where a flute and an oboe play the same part, or if the performer changes instruments within the same part, such as where a flute player switches to a piccolo. Part features 732 include the features described in Table 15:
  • TABLE 15
    Feature Data Type Description
    Instrument Name(s) Alphanumeric
    Number of Staves Integer Number of staves for this part
    (1, 2, or 3)
  • Once part features 732 are extracted, part statistics 734 are extracted. Part statistics 734 may be extracted from individual parts or a collection of parts and include the features described in Table 16:
  • TABLE 16
    Feature Description
    Chord Progression Determine the large-scale chord progression
    of the overall composition. In this case, all
    parts must first be transposed to the sounding
    key. Next, a harmonic reduction is performed,
    from which the chord progression is estimated
    using a sequence model such as a hidden
    Markov model, chord n-grams, or Word2Vec-
    style chord embeddings. This output is then
    filtered to remove uncommon or infrequently
    occurring chords.
    After the chord progression has been
    extracted, a further analysis is used to
    compare the overall chord progression to
    common harmonic patterns, such as 12-bar
    blues or rhythm changes. This analysis uses
    chord n-grams to find repeated patterns and
    then measure the similarity to known chord
    progressions using Jaccard similarity or other
    similarity metric.
    Part Doubling All parts are compared across all measures to
    find doubled parts. All parts must first be
    transposed to the sounding key.
    Next, parts are compared to other parts with
    similar ranges or to known common doubles
    to find doublings. For example, Piccolo
    frequently doubles Flute, but an octave
    higher.
    In order to filter out any transient doublings,
    the system may ignore a doubling that only
    occurs for a short duration.
    Exact doublings (identical pitch and rhythm)
    are then marked and used to suggest potential
    substitutions. For example, if the bassoon and
    trombone parts frequently double exactly, the
    user will be informed that the bassoon and
    trombone parts could be interchanged if the
    ensemble lacks one of those instruments.
    Inexact doublings (identical rhythm but
    different pitch) are only computed for
    common intervals, such as the minor 3rd,
    major 3rd, perfect 5th, octave, and common
    compound intervals such as the octave +
    major 3rd etc. This information informs the
    user of the use of orchestration throughout the
    composition.
    Exposed Solo All parts are compared across all measures to
    find exposed solos. An exposed solo is
    defined to be any part in which all other parts
    are either whole measure rests or playing
    sustained notes for the entirety of the
    measure(s).
  • Once part statistics 734 are extracted, document features 742 are extracted from document 740. The document features 742 include the duration of the music, which may be extracted or estimated. If the duration of the music is included in document 740, it is often preceded by the word “duration” and in a format such as the formats described in Table 17:
  • TABLE 17
    Format Example
    mm:ss 3:45, 18:25
    mm′ss” 4’33”
    mm[m]ss[s], where [m] is either of the literals 2 m 30 s, 5 min 41 sec
    “m” or “min” and [s] is either of the literals
    “s” or “sec”
  • If the duration of the music is not included in document 740, the duration is estimated. In one embodiment, the duration of the music is estimated using at least the number of measures, the meters, the tempo markings, and any repeats, endings, codas, segnos, or other text indicating a repeated section. If the tempo marking does not contain a number indicating a number of beats per minute, the text of the tempo marking is compared to a lookup table of common tempos and, if a match is found, the lookup table's corresponding beats per minute is used. If no match is found, the duration is calculated using a substitute tempo. In some embodiments, the substitute tempo is 120 beats per minute. If a tempo marking indicates a range of numbers, for example “Allegro (116-120 bpm)”, the average of the specified range is used.
  • FIG. 8 is a flowchart illustrating the rules analysis process 524 performed by the SMAE 140 according to an exemplary embodiment of the present invention.
  • During rules analysis 524, note features 712, note statistics 714, measure features 722, measure statistics 724, part features 732, part statistics 734, and document features 742 may be used to perform melodic rules analysis 810, harmonic rules analysis 820, rhythmic rules analysis 830, and form rules analysis 840. The melodic rules analysis 810 applies rules to interpret and analyze the melody, the sequence of notes one after another. Harmonic rules analysis 820 applies rules to interpret and analyze the harmony, which includes chords (notes occurring at the same time) and chord progressions (how the harmony changes over time). Rhythmic rules analysis 830 applies rules to interpret and analyze the rhythm, the duration and timing of each note. Form rules analysis 840 applies rules to interpret and analyze the music as a whole to determine musical form. Each of the melodic rules analysis 810, the harmonic rules analysis 820, the rhythmic rules analysis 830, and the form rules analysis 840 generate the rule-derived metadata 526 described above. The rule-derived metadata 526 may be used during the machine learning analysis 528 described above.
  • FIG. 9 is a flowchart illustrating the lyrics extraction 550 process performed by the SMAE 140 according to an exemplary embodiment of the present invention.
  • The preprocessed structured sheet music data 112 is input to lyrics extraction 550 and explicit lyric metadata is extracted at 910. Explicit lyric metadata includes the data described in Table 18:
  • TABLE 18
    Explicit Lyrical Metadata Description
    Number of Verses The number of verses appearing in the music.
    Humming Indicates whether the lyrics include humming
    Laughing Indicates whether the lyrics include laughter
    Non-English Characters Indicates that the text uses characters not
    found in the English alphabet. This is helpful
    in identifying a text that may be in a
    language other than English, in the event that
    the lyric language cannot be identified. For
    example, the lyrics might use a constructed
    language or International Phonetic Alphabet
    (IPA). A constructed language is one
    specifically created (rather than a natural
    evolution), for example Esperanto or languages
    that occur in science fiction or fantasy literature.
  • Once the explicit lyric metadata is extracted at 910, the lyrics are extracted from the structured sheet music data 112 into a separate text file 920. The separate text file combines all syllables into words and includes all verses. All further lyrical analysis uses this separate text file as input. At 930, the language of the lyrics is estimated using language identification techniques from the field of natural language processing. In some embodiments, a combination of writing system and statistical analysis such as letter frequency and n-gram frequency is used to estimate the language. If the lyrics include a passing phrase in another language, the dominant language is identified. If the lyrics include equal parts of multiple languages such as translated lyrics or a macaronic text, the language is identified as “multiple languages.” If no match is determined, the language is marked as unknown. If the language cannot be determined and the text contains non-English characters, the language label may include a note regarding the non-English characters. In one embodiment, the language label may be “unknown (contains letters other than A-Z).” Next, the text difficulty is determined at 940. The text difficulty may be determined using readability metrics such as the Flesch-Kincaid grade level, which uses the number of words per sentence and the number of syllables per word to compute an estimated difficulty level. Next, the content of the lyrics is analyzed at 950. In some embodiments, known content is looked up at 970 and compared against the text of the lyrics. The known content may include scripture, quotations, poems, literary text, etc. In some embodiments, the subject matter of the text of the lyrics is analyzed at 960 to classify the meaning of the lyrics. In some embodiments, this may include comparing the lyrics 552 (or the embeddings generated during the semantic embedding process 532B or the semantic similarity metadata 540B generated by the semantic similarity analysis 538B) to the text of other compositions with known subject matter, including love songs, holiday songs, religious or spiritual songs, novelty songs, etc. Mature or objectionable content may be identified at 960 as well.
  • Analyzing Unstructured Sheet Music Images
  • FIG. 10 is a flowchart illustrating a data ingestion process 1000 for analyzing unstructured sheet music data 114 and determining metadata 116 that describes each composition stored as unstructured sheet music data 114 according to an exemplary embodiment of the present invention.
  • As discussed previously, the structured sheet music analysis engine 140 operates on structured sheet music data 112, not unstructured sheet music data 114, because all explicit information stored in structured sheet music data 112 (e.g., pitch, duration, etc.) must be inferred from unstructured sheet music data 114. As such, unstructured sheet music data 114 undergoes analysis performed by the optical analysis engine 180 using heuristics, optical character recognition (OCR), traditional optical music recognition (OMR), machine learning-based pattern recognition (“fuzzy OMR”), etc., instead of the structured sheet music analysis engine 140.
  • The unstructured sheet music 114 may be multi-page digital image files (such as PDFs). Sheet music that is not printed (such as an engraving plate) is printed. Sheet music that is printed but not computer readable is converted to a computer readable format by scanning the printed sheet music (for example, by the optical scanner 270). If the conversion from a printed format to a computer readable format fails (e.g., printed documents that have been damaged, marked, or destroyed, handwritten manuscripts that produce poor quality OMR results) the music may be restored and re-converted. Explicit metadata 350 is extracted at 1010 (using a similar process as 320 above). Each page is extracted at 1020.
  • Heuristics may be used to determine higher-level semantic metadata 116 at 1030. Examples of heuristics that may be used to determine metadata 116 include the heuristics described in Table 19:
  • TABLE 19
    Heuristic Description
    Ink Density (number of A higher ink density suggests the use of more musical notation,
    dark pixels as a such as dynamics, articulations, etc. A high ink density therefore
    percentage of all pixels) implies a more difficult composition.
    System detection by Counting the number of consecutive dark pixels in every
    measuring the run-length horizontal row helps determine potential system regions.
    of dark pixels over each 1. Count the number of dark pixels in a row
    horizontal row 2. Find the longest run of consecutive dark pixels in the row
    (maximum run-length).
    3. If the maximum run-length exceeds a threshold, add it to
    the current potential region. Otherwise, do not consider
    this row as part of a potential system region. A threshold
    is used to filter out any potential noise in the digital
    image.
    The output from this process is a set of potential system
    regions, which are then used for further processing.
    This algorithm is common within OMR software for finding
    systems and staves within a piece.
  • Potential system regions on each page may be identified at 1040. For example, the system 100 may count the amount, and maximum run-length, of dark pixels in every horizontal row to identify potential system regions (e.g. areas of the page containing staves) to be used for further processing. Classified system regions may be identified at 1050. For example, a classifier (e.g., a machine learning classifier) may analyze each potential system region to positively identify which of the potential system regions are classified system regions. Machine learning may be used to determine higher-level semantic metadata 116 describing sheet music stored as unstructured sheet music data 114 at 1060. For example, a machine learning algorithm may be used to determine the range of unstructured sheet music data 114 by using a convolutional neural network pretrained for this task. Because unstructured sheet music data 114 is encoded as a digital image, the optical analysis engine 180 uses image classification methods to determine musically-relevant metadata 116 describing the sheet music stored as unstructured sheet music data 114. By using transfer learning, machine learning-based classifiers based on existing image recognition models (e.g., AlexNet, VGG, Inception, ResNet, etc.) can be quickly retrained to extract range, motifs, playing techniques (double stop, cross staff beaming, etc.), etc. For each (potentially multi-page) composition, the metadata 116 extracted from each classified system may be aggregated and reduced at 1070. For example, a composition with two classified system regions, A and B, may be mapped to two implicit metadata extraction functions, topNote(x) and bottomNote(x). as follows:
      • Compute the functions directly:
        • topNote(A)=‘C5’
        • bottomNote(A)=‘D4’
        • topNote(B)=‘F5’
        • bottomNote(B)=‘E4’
      • Group the results together by algorithm
        • topNote(A, B)=[‘C5’, ‘F5’]
        • bottomNote(A, B)=[‘D4’, ‘E4’]
      • Compute the aggregates for each page. Given the example above, the aggregate function for topNote(x) is max(x), and the aggregate function for bottomNote(x) is min(x)
        • topNote(Page)=max(topNote(A, B))=max([‘C5’, ‘F5’])=‘F5’
        • bottomNote(Page)=min(bottomNote(A, B))=min([‘D4’, ‘E4’])=‘D4’
      • Repeat the aggregation over all of the pages to give the final results for each metadata extraction function
      • Store the final result of each metadata extraction function in the one or more databases 110.
  • FIGS. 11A and 11B is a flowchart illustrating the workflow 1100 of the optical analysis engine 180 according to an exemplary embodiment of the present invention.
  • As discussed previously, the optical analysis engine 180 uses heuristics, OCR, OMR, fuzzy OMR, etc. to infer information from images of sheet music. Ideally, the output of the optical analysis engine 180 is the same as the output of the structured sheet music analysis engine 140. However, because images of sheet music vary enormously (e.g., in color, resolution, skew, etc.), oftentimes the optical analysis engine 180 provides results that approximate the output of the structured sheet music analysis engine 140.
  • A large number of processes performed by the optical analysis engine 180 are based on object detection. Object detection algorithms identify certain areas of an image as belonging to a particular object class. In some embodiments, the object detection algorithms return a bounding region around a detected object or the set of pixels representing the detected object, the likely class of detected object, and a probability the detected object belongs to the object class. Within the context of the optical analysis engine, object classes include text, staves, measures, clefs, notes, etc. In some embodiments, a single object detector is trained to identify all object classes in a monolithic fashion, while in other embodiments, several specialized object detectors are trained to detect similar classes in a modular fashion. Any object detection algorithm may be used, including single-shot detection (SSD), you only look once (YOLO), common objects in context (COCO), etc. In some embodiments, a combination of object detection algorithms is used.
  • The optical analysis engine 180 works with images of sheet music that is either typeset or handwritten and is impervious to features such as scale and font. The optical analysis engine 180 follows a general pattern of identifying segments, analyzing the segments, then further segmenting into smaller segments in a top-down approach (that differs from the bottom-up approach of the structured sheet music analysis engine 140). For example, a staff is segmented into measure, which is then segmented into notes. The output from the optical analysis engine 180 may be directly fed into a keyword based search engine, used in evaluating natural language queries, or both. In some embodiments, the output from the optical analysis engine 180 may be further processed by the structured sheet music analysis engine 140. For example, the optical analysis engine 180 may extract lyrics from the images of sheet music but the structured sheet music analysis engine 140 may analyze the lyrics using the process described above at 532B and 538B.
  • Data may be input to the optical analysis engine 180 as either raster image data 1102 (e.g., PNG, JPEG, etc.), vector image data 1104 (e.g., SVG), etc. Raster formats encode information about every pixel while vector formats store instructions on how the content is drawn. If vector information is available as well as raster data, accuracy may be improved by analyzing both raster and vector data. Where vector image data 1104 is input to optical analysis engine 180, the vector image data 1104 is rasterized at 1106 by selecting a resolution and rendering the vector image data 1104 at that resolution. The newly generated raster image data 1112 undergoes preprocessing at 1114 just as raster image data 1102 does. A copy of the original vector image data 1104 is used for vector analysis 1110.
  • Since vector image data 1104 encodes graphics as a set of drawing commands, it is sometimes possible to identify when two drawing commands are similar through a vector analysis 1110 process known as template matching. Each command in the vector image file 1104 is either stored in a dictionary within the file that is referenced at later points in the file or is used inline. These commands may be compared to a set of known templates that identify that command as belonging to a particular glyph within a particular typeface. A vector command consists of one or more instructions that tell the computer what type of graphic primitive to draw (such as a line, circle, or Bezier curve), where to draw it, and its relative proportions. The use of relative proportions means that the commands may appear at different scales, even though the commands draw the same figure. For example, a vector command on a 1× scale may be M 2.578125 2.984375 C 1.835938 2.984375 1.222656 2.8125 0.734375 2.46875 whereas the same vector command on a 2× scale may be M 5.15625 5.96875 C 3.67188 5.96875 2.44531 5.625 1.46875 4.9375. The same commands can be compared after they are normalized to the same scale (for example, by dividing every coordinate by the maximum coordinate so that the largest coordinate is now equal to 1).
  • Once the commands have been identified with known glyphs and/or shapes, a structured music document 112 may be reconstructed. For example, barline glyphs may be identified to isolate individual measure regions. Additionally, duration may be computed for all note, rest, and chord objects, and their position in time is determined using a time cursor within each measure region. The reconstructed structured music document 112 may then be analyzed by the structured sheet music analysis engine 140 as described above. In some embodiments, the outputs of both vector analysis 1110 and other analysis performed by the optical analysis engine 1100 may be aggregated for greater accuracy. Where raster image data 1102 is the input to optical analysis engine 1100, the raster image data 1102 is preprocessed at 1114.
  • Preprocessing 1114 is intended to improve the accuracy of the image analysis by straightening, denoising, whitening, etc. Scanned images skewed during the scanning process are straightened during preprocessing 1114. In one embodiment, the images of sheet music are straightened using a Hough transform to identify staff lines, measure their angles, and rotate the image until the staff lines are straight. Denoising identifies and removes random variations in pixel intensities and may be done by any number of processes, including smoothing, which averages the intensities around a pixel, non-local means, which averages similar patches within the image, etc. Whitening changes the statistical proprieties of the pixel values such that all pixels are uncorrelated and have a variance of one. This makes it easier for the analysis algorithms to identify statistically significant features, improving overall accuracy. After preprocessing at 1114, document information is extracted at 1116. Document header information, if available, provides the document information. If not, the information is extracted using OCR. Document information includes the fields described in Table 20:
  • TABLE 20
    Document Info Fields
    Work Title
    Work Number
    Movement Title
    Movement Number
    Composer
    Arranger
    Lyricist
    Copyright
  • The output of the OCR may contain recognition errors which are corrected in a post-processing step. The post-processing step may include one or more of (1) collating any header information contained in the original document, (2) collating any text, (3) identifying and fixing common recognition errors through the use of a lookup table, (4) identifying known entities (such as composers, arrangers, etc.) through the use of a lookup table, (5) identifying common patterns using a template, (6) spell-checking, and (7) using named entity recognition to identify names. At 1118, it is determined whether the document is a score containing multiple instruments or a part which contains only a single instrument. If the document is identified as a part 1120, staves are identified at 1124. Staves are stitched together at 1126 to simplify feature extraction by minimizing discontinuities at system and page boundaries.
  • For example, consider the part before stitching shown in FIG. 21G. That part includes a scale run that begins in measure 5 and ends in measure 6, but is interrupted by a system break. After stitching, the scale run is uninterrupted as shown in FIG. 21H.
  • If the document is identified as a score 1122, systems are identified and stitched together at 1128. Pages are stitched together at 1130. Staves are identified at 1132. Staff level features of the pages and staves are identified at 1134.
  • In some embodiments, the unstructured sheet music data 114 may be converted (e.g., using OMR) to structured sheet music data 112 (e.g., a MusicXML file) by the sheet music conversion engine 160 and then passed to the SMAE 140, which determines metadata 116 by analyzing the data 112 down to each note 710 as described above. In other embodiments, however, the optical analysis engine 180 includes one or more image recognition algorithms (described above) that have been trained to recognize musical metadata 116 without having to analyze every single note 710. For example, lyrics may be extracted from the staff level features at 1136 (using a similar process as 550 above) and analyzed at 1138 (by the SMAE 140 a similar process as 532B-540B above) to generate semantic similarity metadata 1140. Phrases may be extracted from the staff level features at 1142 and analyzed at 1144 using image recognition algorithm(s) trained to generate phrase descriptions 1146 (similar to phrase descriptions 548). Those one or more image recognition algorithms may extract measures from the staff level features at 1148 and analyze measure features at 1150, and extract notes from the staff level features at 1152 and analyze note features at 1154. The results of phrase, measure, and note analysis may be analyzed by the structured sheet music analysis engine 140.
  • The results of the machine learning-based pattern recognition processes described above may be analyzed by the structured sheet music analysis engine 140. In particular, the machine learning analysis 528 described above can be used to predict the difficulty of each composition (or each part within each composition) stored as image data. For example, a corpus of compositions or parts stored as images may have known difficulty levels. A supervised learning process can then be used to learn a function for determining a probability that another composition or part stored as image data has those difficulty levels based on the patterns recognized in the image data.
  • As described above, the structured sheet music analysis engine 140 analyzes structured sheet music data 112 as well as unstructured sheet music data 114 that has been converted to structured sheet music data 112 by the sheet music conversion engine 160 to determine musically relevant metadata 116 describing the sheet music. Meanwhile, the optical analysis engine 180 analyzes unstructured sheet music data 114 to determine musically relevant metadata 116 describing the sheet music. The metadata 116 describing each composition is stored in the one or more databases 110. As described in detail below, the sheet music search and discovery system 100 also includes a search engine 190 that enables user to determine compositions that are relevant to users based on the metadata 116 extracted, calculated, and generated by the sheet music search and discovery system 100.
  • Search and Recommendations
  • FIG. 12 is a flowchart illustrating a query process 1200 executed by the search engine 190 to search sheet music data according to an exemplary embodiment of the present invention.
  • Once the metadata 116 describing structured sheet music data 112 and unstructured sheet music data 114 has been generated using the structured sheet music analysis engine 140 or the optical analysis engine 180, the metadata 116 is stored in a database and available for search. A query language statement 1210 may be directly input to database query execution engine 1270 to generate search results 1280. The query language statement 1210 may be constructed using structured query language (SQL). Examples of SQL query pseudocode are shown in Table 21, where composition-related data is stored in a table called Compositions and part-related data is stored in a table called Parts. There is a 1:n relationship between Compositions and Parts (because each composition may include multiple parts), such that an entry in the composition table may link to one or more entries in the Part table via a foreign key called compositionId.
  • TABLE 20
    SQL Query Pseudocode Description
    SELECT * FROM Compositions Find all Grade 2 Marching Band
    INNER JOIN Parts ON Part.compositionId = titles where Flute and Trumpet
    Composition.compositionID are featured instruments with a
    WHERE Composition.tempo BETWEEN 80 AND 120 moderate tempo between 80 and
    AND Composition.gradeLevel = 2 AND 120 beats per minute under 3
    Composition.ensembleType = ‘Marching Band’ minutes (180 seconds) in length.
    AND Composition.duration < 180
    AND (Part.featured = True
    WHERE Part.instrument = ‘Flute’
    AND Part.featured = True
    WHERE Part.instrument = ‘Trumpet’
    )
    SELECT TOP 5 * FROM Compositions Find the top 5 compositions for
    INNER JOIN Parts ON Part.compositionId = SAB chorus with a Latin text
    Composition.compositionID where the highest note in the
    WHERE Composition.ensembleType = ‘SAB’ Soprano part is F5.
    AND Composition.LyricLanguage = ’Latin’
    AND (Part.highestNote = ‘5’
    WHERE Part.instrument = ‘Soprano’)
  • Where the query language statement 1210 is not available, query generator 1260 may create a query based on user profile 1220, keywords and filters 1230, audio input 1240 analyzed at 1245, natural language query 1250 analyzed at 1255, audio fingerprint, QR code, unstructured sheet music data, etc.
  • The search engine 190 provides functionality for users to input the keywords 1230 via the graphical user interface 192. The graphical user interface 192 may provide functionality for users to input keywords 1230 in an unstructured manner. For example, the graphical user interface 192 may simply allow the user to enter keywords 1230 and the search engine 190 may be able to determine whether any of the metadata 116 matches or is similar to those keywords 1230. In some embodiments, the graphical user interface 192 may provide functionality for users to input keywords 1230 in a structured manner. For example, the graphical user interface 192 may provide functionality for the user to input keywords 1230 in one or more categories of musical attributes (e.g., composer, range, etc.) and the search engine 190 may determine whether any of the metadata 116 in those categories matches or is similar to those keywords 1230. The search engine 190 may also provide functionality for users to augment a keyword search by selecting a filter 1230 via the graphical user interface 192. For example, the graphical user interface 192 may allow the user to select a category (e.g., key signature, meter, lyric language, etc.) and input a value or range of values. The search engine 190 then determines whether any of the metadata 116 matches that value or is within that range of values. The search engine 190 may also search the metadata 116 to identify sheet music matching user profiles 1220. User profiles 1220 will be discussed further in reference to FIG. 16. The database query execution engine 1270 compares the query generated by query generator 1260 to the metadata 116 describing structured and unstructured sheet music data to generate search results 1280.
  • The query generator 1260 may also construct a query based on audio input 1240. For example, the search engine 190 may provide functionality for a user to submit a query by humming or singing a melodic fragment (audio input 1240) and have the search engine 190 return the compositions with the highest similarity. The search engine 190 may search for similar compositions using Parsons code, dynamic time warping, audio fingerprint and/or a neural network. Parsons code describes the melodic contour of a composition. In order to convert an audio input 1240 to Parsons code, audio analysis at 1245 determines an approximate pitch using standard pitch detection, such as autocorrelation, fast Fourier transform (FFT), or the Yin algorithm. The audio analysis at 1245 then determines whether subsequent pitches are higher, lower, or the same as the previous pitch, within a certain error threshold. Converting the audio input 1240 to Parsons code has certain advantages. For example, the user's pitch does not have to be exact, the user's rhythm does not have to be exact, and the approach can easily search structured data. However, converting the audio input 1240 to Parsons code also has certain disadvantages. For example, the audio input 1240 must be monophonic, the audio input 1240 must start at the beginning of the composition, and, due to a lack of rhythmic information, it is difficult to distinguish between compositions that have the same melodic contour but different rhythms.
  • Dynamic time warping is a process of comparing two waveforms and determining how similar they are under time altering transforms. Dynamic time warping has certain advantages. For example, the user's pitch does not have to be exact (as a stretched waveform will have a lower pitch and a time-compressed waveform will have a higher pitch), the user's rhythm does not have to be exact, the audio input 1240 may be polyphonic (e.g. searching for a piano composition by playing a piano as the audio input 1240), and the audio input 1240 does not have to start at the beginning of the composition. However, dynamic time warping also has certain disadvantages. For example, because this approach involves waveform comparison, the search engine 190 must store at least one audio version of every composition that is searched in audio format.
  • The search engine 190 may also utilize a neural network trained against several queries and known ground truth structured data, such as the Multimedia Information Retrieval Query By Singing/Humming (MIR-QBSH) corpus, the Institute Of Acoustics Chinese Academy Of Sciences (IOACAS) corpus, etc. Utilizing a neural network has certain advantages. For example, it allows for a direct query of structured data from an audio waveform, the pitch and rhythm of the audio input 1240 do not need to be exact (as the convolutional layers of the network should capture the time and pitch variation), and audio input 1240 does not have to start at the beginning of the composition. However, utilizing a neural network has certain disadvantages. For example, training the neural network may take a long time and the query may be limited only to the primary melodic motifs of the compositions (whereas other methods may search the entire composition for a match). However, this may also be an advantage as people are more likely to search for common, memorable motifs.
  • The query generator 1260 may also construct a query based on natural language query 1250 (e.g., “What concertos feature the clarinet?” “Are there any SAB choral works based on the poems of Robert Frost?” “Is there a trumpet solo that features triple-tonguing?” etc.). For example, the search engine 190 may analyze the natural language query at 1255 and use that analysis to perform natural language querying against natural language descriptions of the compositions stored in the one or more databases 110. Natural language descriptions of musical compositions may include information from the metadata 116 (in particular the text/keyboard output 420) described above—including, for example, the explicit metadata 350 (e.g., composer, lyricist, etc.), the implicit metadata (e.g., range, meters, etc.), the rule-derived metadata 526, the machine learning derived metadata 530, the semantic similarity metadata 540A, 540B, and 1140, the phrase descriptions 548 and 1146, etc.—as well as the publisher's description of a composition, other advertising copy of the composition, other sources of information pertaining to the composition (e.g., electronic sources such as Wikipedia), crowd-sourced information about the composition, etc.
  • FIG. 13 is a flowchart illustrating a process 1300 for querying sheet music data using a user profile according to an exemplary embodiment of the present invention.
  • Keywords 1310 are transformed into a query 1340 (e.g., an SQL query) by the query generator 1260. A selected user profile 1320 is considered and any mandatory filters are extracted at 1330 and appended to the query 1340 before query 1340 is executed at 1350. A mandatory filter may include, for example, an ensemble type, a requirement that a specific instrument be included in the search results, etc. Extracting and appending mandatory filters 1330 to the query 1340 limits the results returned. For example, if the user selects a user profile 1320 for a choir, all non-choir literature is automatically excluded. Similarly, if the user selects a user profile 1320 for wind ensemble and requires that the English horn be in the instrumentation, then all wind ensemble compositions not having an English horn are automatically excluded. Extracting mandatory filters at 1330 before the query 1340 is executed at 1350 limits the number of records that the query 1340 runs over, resulting in faster execution.
  • The execution of the query 1340 at 1350 results in unsorted query results 1360, which are sorted at 1370 by comparing each composition in the unsorted query results 1360 to the selected user profile 1320 by using similarity metrics. Potential metrics include, for example, Manhattan (L1) distance or Jaccard similarity. After the similarity metric is computed for each composition in the unsorted query results 1360, the unsorted query results 1360 are then sorted at 1370 from highest scoring (i.e., most similar) to lowest scoring (least similar). The sorted query results 1380 are then displayed to the user. Sorting allows the most relevant compositions to appear at the top of the page. In order for the search engine 190 to find items that match a selected user profile 1320, the search engine 190 compares how well the selected user profile 1320 matches a given item.
  • The storing process 1370 may include an instrumentation comparison and/or a range comparison. In an instrumentation comparison, the selected user profile 1320 contains a list of one or more instruments and their respective ranges. In order to recommend appropriate compositions, the instrumentation must be compared to the instrumentation of each composition. Both the number of instruments as well as the type of each instrument must be considered to make an accurate match. Because instrument names may vary, the instrument names are normalized in during preprocessing 635 and/or are limited to a predefined set of drop down values by the graphical user interface 192 that the user may select when creating their profile. In one example of an instrumental comparison, a user has defined a selected user profile 1320 for a brass quartet. Their profile contains four instruments, with the names “Trumpet 1”, “Trumpet 2”, “Horn”, and “Trombone”. The first two instruments have their names normalized to “Trumpet”. The user then performs a search. (If the search engine 190 only considers the number of instruments, then the search engine 190 returns results from string quartets or choral music. Such results are irrelevant and should not be returned to the user.) Instead, the search engine 190 takes both the number of instruments and their names into account when performing a search. In another example of an instrumental comparison, a user has defined a selected user profile 1320 for a wind ensemble, but has not included “Bassoon” in the selected user profile 1320 because their ensemble does not have a bassoon. However, virtually all wind ensemble literature includes a bassoon part. The user should still be able to find compositions with very similar instrumentation. The search engine 190 may use, for example, a method to determine the similarity of the instrumentation of the selected user profile 1320 to the composition. Without this comparison, it is possible that this example query 1340 would not return any results. By making this comparison, the example query 1340 can return results, even if the results are inexact matches.
  • In a range comparison, the search engine 190 returns compositions where the range of a given part of the composition falls within the range of the same part in the selected user profile 1320. For example, the search engine 190 may run two penalty functions to determine how far the range of the composition falls above and below the range of the selected profile 1340. Each function may determine an exponential penalty for each semitone that a given composition goes above or below the range of the selected used profile. For example, if a composition's range exceeds the upper range of the selected user profile 1320 by two semitones, the function may return a 96% match (100−2*2). Such penalties may be more heavily weighted when the composition's range exceeds not only the range of the selected user profile 1320, but also the physical capabilities of the instrument selected in the selected user profile 1320.
  • The search engine 190 may include a content-based recommendation system that recommends compositions that the user has not yet purchased that are similar to compositions that the user has already purchased. FIG. 14 is a flowchart illustrating a recommendation process 1400 performed by the search engine 190 to identify recommendations 1490 according to an exemplary embodiment of the present invention.
  • As shown in FIG. 14, the selected user profile 1320 may include an instrumentation 1422 and a range 1424 for each of those instruments, as well as past purchases 1426 of the user. The search engine 190 may identify similar compositions 1445 at 1440 by comparing the metadata 116 describing each of the past purchases 1426 to the metadata 116 of the remaining compositions in the one or more databases 110. In particular, the similar composition process 1440 may identify compositions with similar instrumentations and/or ranges as described above with respect to the storing process 1370. After discarding purchases already made by the user at 1480, the similar compositions 1445 may be sorted for relevance using the sorting process 1370 described above. In one example, for a user that has purchased compositions for middle school orchestra, the search engine 190 may recommend compositions for middle school orchestras, followed by compositions for orchestras that are not specifically targeted for middle school but that match closely to other compositions that the user has already purchased. Compositions for different ensembles (e.g., choir or PVG arrangements), although still related to a user's other previous purchases, may either be listed last or excluded from the recommended compositions entirely because they are not relevant to the ensemble that the user is shopping for. In another example, for a user that has purchased different compositions by Beethoven, the search engine 190 may recommend other compositions by Beethoven regardless of ensemble (e.g., PVG arrangements, orchestral arrangements, compositions for wind ensemble, etc.).
  • The search engine 190 may also determine compositions that are often purchased together 1455 at 1450. For example, the search engine 190 may store a co-purchasing matrix 1452 identifying compositions that are often purchased by the same user (either at the same time or separately). For example, the items “Guitar Method Book: Beginner”, and “Guitar Method Book: Intermediate” are likely to be purchased either at the same time (in the same shopping cart), or at different points in time (a user finished the beginner book and now purchases the intermediate book). Based on historical purchase information of items across all users in the co-purchasing matrix 1452, the search engine 190 can identify compositions 1455 frequently bought with the current compositions on the product page for the current composition. In order to determine compositions that are often purchased by the same user, the search engine 190 may utilize logistic regression, a neural-network based approach, etc. Again, after discarding purchases already made by the user at 1480, the compositions purchased together 1445 may be sorted for relevance using the sorting process 1370 described above.
  • The search engine 190 may also recommend the past purchases 1465 of users with similar user profiles 1220. Similar user profiles 1462 are identified at 1460. The user profile similarity process 1460 may compare the instrumentation 1422 and range(s) 1424 of the selected user profile 1320 to the instrumentation 1422 and the range(s) 1424 of the other user profiles 1220 using similarity metrics such as Manhattan (L1) distance or Jaccard similarity. (A similar process is used to select a composition for a user as described above. However, unlike when identifying similar user profiles 1462, a composition that is wholly contained within the range 1424 of the selected user profile 1320 may be considered to be a 100 percent match.) Additionally, the search engine 190 may compare the ranges of the two ensembles by determining whether the largest interval difference between the two ranges exceeds a prescribed threshold (e.g., +/−2 semitones). Conventional systems recommend products using collaborative filtering. For example, user ratings are collected into a matrix, which is factored (e.g., using singular value decomposition), and the user is projected onto the new basis. Because the search engine 190 stores user profiles 1220 that include instrumentation 1422 (and, in some cases, a range 1424 for each instrument), the search engine 190 is able to recommend compositions that are more relevant than would be generated using conventional collaborative filtering. For example, a user who made a purchase for a middle school marching band where their trumpets can only play up to F5 may be informed about another purchase made for a different middle school marching band with trumpets that can only play up to F5. Meanwhile, compositions outside that range and compositions for other ensembles can be excluded. Again, after discarding purchases already made by the user at 1480, the past purchases 1465 of users with similar user profiles 1462 may be sorted for relevance using the sorting process 1370 described above. Alternatively, the past purchases 1465 of users with similar user profiles 1462 may be sorted by the similarity of the user profiles 1220.
  • The search engine 190 may also use the information contained within the user profile 1320 and the ordered recommendations 1490 to market compositions. When new compositions arrive and/or the user modifies his or her profile(s) 1320, new compositions may be electronically marketed specifically to that user. The system 100 may also output email marketing campaigns, electronic ads, push notifications, etc. that include the recommendations 1490. Additionally, existing electronic marketing materials may be customized to include recommendations 1490 based on the user profile 1320 of the user receiving the electronic marketing materials. For example, for a director of a high school marching band that subscribes to a marching band email newsletter, the newsletter may be customized to meet that director's ensemble by highlighting or sorting items that most closely match their ensemble. In another example, the same user may subscribe to new product push notifications. When a new marching band arrangement of a current pop song is released and it matches that ensemble, the director receives a push notification on their phone indicating that new inventory is available that would be appropriate for their ensemble. If the user opens the push notification, they can then view the inventory and its marketing materials directly on their mobile device, allowing them to purchase the item as soon as it becomes available.
  • The search engine 190 may also recommend multiple compositions that together form a concert program (or set list). FIG. 15 is a flowchart illustrating concert program generation processes 1500 performed by the search engine 190 according to an exemplary embodiment of the present invention.
  • The search engine 190 may include an automatic concert program generator that generates a concert program 1590 (for example, within a user-specified length) based on one or more user-specified compositions, referred to as a concert program generator seeds 1510. In order to automatically generate a concert program 1590, the concert generation process 1500 may identify compositions similar compositions 1445 as the concert program generator seed(s) 1510 or compositions previously purchased by the user (identified using the similar composition identification process 1440 described above), compositions programmed 1565 (and past purchases 1465) of users with similar user profiles 1462 (identified using the user profile similarity process 1460 described above), compositions frequently purchased together 1455 with the concert program generator seed(s) 1510 (identified using co-purchasing identification process 1450 and the co-purchasing matrix 1452 described above), compositions frequently programmed together 1555 with the concert program generator seed(s) 1510, etc. To determine compositions frequently programmed together 1555, a co-programming matrix 1552 of compositions frequently programmed together (derived, for example, actual concert programs, CD track listings, etc.) may be stored and a co-occurrence analysis 1550 may be performed to determine compositions frequently programmed with the one or more concert program generator seeds 1510.
  • The similar compositions 1445, the compositions purchased together 1445, the compositions programmed together 1555, the compositions programmed 1565 by (and/or past purchases 1465 of) users with similar user profiles 1462 may be sorted using the sorting process 1370 described above. The results may be trimmed at 1580 such that the concert program 1590 meets any requirements 1582 set by the user (such as the user-specified length).
  • In some embodiments, the search engine 190 may provide functionality for a user to automatically generate a concert program 1590 (for example, using a single composition as the concert program generator seed 1510) with minimal input from the user. For example, each page for each composition may include a button (or other input mechanism) to automatically generate a concert program 1590 using that composition as the concert program generator seed 1510. The search engine 190 may infer the ensemble type from the composition's instrumentation, as well as the ranges of similar compositions.
  • Additionally or alternatively, a user may wish to specify several requirements 1582 for a concert program 1582 (e.g., length, overarching theme, focus on a particular composer, etc.). Accordingly, the search engine 190 may include guided concert program generator interface 1592 that automatically generates a concert program 1590 using all of the specified requirements 1592. The guided concert program generator interface 1592 may provide functionality for the user to specify one or more concert program generator seeds 1510. The guided concert program generator interface 1592 may provide functionality for the user to interact with the search engine 190 and specify each of the requirements 1582 using natural language (e.g., by voice or text). For example, the search engine 190 (S) may interact with a user (U) as follows:
      • S: Hi, I would like to help you design your concert. First, can you tell me what type of ensemble this is?
      • U: High School Wind Ensemble.
      • S: Tell me more about your ensemble. Are there any instruments that are not in your ensemble?
      • U: I don't have any bassoons or bass clarinets.
      • S: Are there any instruments you would like to feature, for example in a solo?
      • U: Flute
      • S: Now that I know about your ensemble, tell me about the concert you wish to perform. What is the theme of your concert?
      • U: The music of Beethoven.
      • S: How long would you like the concert to be?
      • U: I would like the concert to be 45 minutes long.
      • S: Last question: Is there any composition that must be included in this program?
      • U: Yes, the concert must include his 5th Symphony.
      • S: Ok. Generating concert program.
      • S: Here are your results:
      • 1) Beethoven Symphony #5
      • 2) Für Elise (Flute Solo)
      • . . .
      • Estimated concert length: 43 minutes, 20 seconds.
  • FIG. 16 illustrates a user profile view 1600 of the graphical user interface 192 according to an exemplary embodiment of the present invention.
  • The user profile view 1600 provides functionality for a user to identify several parameters at once. For each user profile, the graphical user interface 192 may provide functionality for a user to add/edit/delete multiple instruments 1610, a number 1620 of each instrument, a written range 1640 of each instrument with a tolerance in semitones (for example, “C4 to C5+/−2 semitones” would search for compositions with the range of Bb3 to D5, since Bb3 is 2 semitones lower than C4, and D5 is 2 semitones higher than C5), a grade level 1630 (or range of grade levels) for each instrument, whether an instrument should be featured in a solo, etc. (In a preferred embodiment, the written range 1640 is always used for pitched instruments. If searching for a transposing instrument, the graphical user interface 192 may display the written and/or sounding pitch. For unpitched percussion, the graphical user interface 192 may not permit the user to enter this information because range does not apply.) The graphical user interface 192 may allow each user to create multiple user profiles. For example, a user may create one profile for high school marching band (because, e.g., the user is the director of the band), another profile for church choir (because, e.g., the user is the director of the choir), a third profile for a saxophonist (e.g., a student of the user), and a fourth profile for the user (because, e.g., the user is learning to play guitar). In some embodiments, the search engine 190 may provide pre-defined user profiles that users can select and edit. The pre-defined user profiles may include, for example, a sixth grade SAB (soprano, alto, baritone) choir, a community church choir, a high school marching band, etc.
  • FIG. 17 illustrates an edit range view 1700 of the graphical user interface 192 according to an exemplary embodiment of the present invention. The user interface 192 may provide functionality for a user to edit the range 1640 of each instrument in the user profile.
  • FIG. 18 illustrates a search view 1800 of the graphical user interface 192 according to an exemplary embodiment of the present invention. As discussed previously, a user may input one or more keywords in field 1810 and indicate a user profile at 1820. In some embodiments, the user profile may be chosen from a drop-down menu listing user profiles.
  • FIG. 19 illustrates a search results view 1900 of the graphical user interface 192 according to an exemplary embodiment of the present invention. The search results view 1900 displays the results of a query constructed by the query generator based on the keywords 1920, a user profile 1930, or both. In some embodiments, the search results view 1900 provides functionality for the user to refine the results by selecting one or more filters 1910. The search results may be ranked according to their similarity to the one or more keywords 1920 and parameters stored in the selected profile 1930.
  • Hardware/Software
  • FIG. 20 is a block diagram of a simplified multifunctional electronic device 2000 according to an exemplary embodiment of the present invention. Multifunction electronic device 2000 may include processor 2005, display 2010, user interface 2015, graphics hardware 2020, device sensors 2025 (e.g., proximity sensor/ambient light sensor, accelerometer and/or gyroscope), microphone 2030, audio codec(s) 2035, speaker(s) 2040, communications circuitry 2045, digital image capture circuitry 2050, video codec(s) 2055 (e.g., in support of digital image capture unit 2050), memory 2060, storage device 2065, and communications bus 2070. Multifunction electronic device 2000 may be, for example, a personal electronic device such as a personal digital assistant (PDA), mobile telephone, or a tablet computer. Processor 2005 may execute instructions necessary to carry out or control the operation of many functions performed by device 2000. Processor 2005 may, for instance, drive display 2010 and receive user input from user interface 2015. User interface 2015 may allow a user to interact with device 2000. For example, user interface 2015 can take a variety of forms, such as a button, keypad, dial, a click wheel, keyboard, display screen and/or a touch screen. Processor 2005 may also, for example, be a system-on-chip such as those found in mobile devices and include a dedicated graphics processing unit (GPU). Processor 2005 may be based on reduced instruction-set computer (RISC) or complex instruction-set computer (CISC) architectures or any other suitable architecture and may include one or more processing cores. Graphics hardware 2020 may be special purpose computational hardware for processing graphics and/or assisting processor 2005 to process graphics information. In one embodiment, graphics hardware 2020 may include a programmable GPU. Image capture circuitry 2050 may include lens 2080. Lens assembly may have an associated sensor element 2090. Image capture circuitry 2050 may capture still and/or video images. Output from image capture circuitry 2050 may be processed, at least in part, by video codec(s) 2055 and/or processor 2005 and/or graphics hardware 2020, and/or a dedicated image processing unit or pipeline incorporated within circuitry 2055. Images so captured may be stored in memory 2060 and/or storage 2065. Sensor and camera circuitry 2050 may capture still and video images that may be processed in accordance with this disclosure, at least in part, by video codec(s) 2055 and/or processor 2005 and/or graphics hardware 2020, and/or a dedicated image processing unit incorporated within circuitry 2050. Images so captured may be stored in memory 2060 and/or storage 2065. Microphone 2030 may capture audio recordings that may be processed in accordance with this disclosure, at least in part, by audio codec(s) 2035 and/or processor 2005. Audio recordings so captured may be stored in memory 2060 and/or storage 2065. Memory 2060 may include one or more different types of media used by processor 2005 and graphics hardware 2020 to perform device functions. For example, memory 2060 may include memory cache, read-only memory (ROM), and/or random access memory (RAM). Storage 2065 may store media (e.g., audio, image and video files), computer program instructions or software, preference information, device profile information, and any other suitable data. Storage 2065 may include one more non-transitory storage mediums including, for example, magnetic disks (fixed, floppy, and removable) and tape, optical media such as CD-ROMs and digital video disks (DVDs), and semiconductor memory devices such as Electrically Programmable Read-Only Memory (EPROM), and Electrically Erasable Programmable Read-Only Memory (EEPROM). Memory 2060 and storage 2065 may be used to tangibly retain computer program instructions or code organized into one or more modules and written in any desired computer programming language. When executed by, for example, processor 2005 such computer program code may implement one or more of the methods described herein.
  • While a preferred embodiment has been set forth above, those skilled in the art who have reviewed the present disclosure will readily appreciate that other embodiments can be realized within the scope of the present invention.

Claims (22)

1.-20. (canceled)
21. A system, comprising:
non-transitory computer readable storage media that stores:
sheet music data comprising compositions; and
metadata characterizing each composition, including one or more instruments and a range for each of the one or more instruments;
a graphical user interface that provides functionality for a user to identify a user profile that includes one or more instruments and a range for each of the one or more instruments;
a search engine, executed by a hardware processor, that:
identifies compositions by comparing the one or more instruments and the one or more ranges included in the user profile with the one or more instruments and the one or more ranges of each of the compositions; and
displays the identified compositions to the user via the graphical user interface.
22. The system of claim 21, wherein the search engine compares the one or more instruments and the one or more ranges included in the user profile with the one or more instruments and the one or more ranges of each of the compositions using Manhattan (L1) distance or Jaccard similarity.
23. The system of claim 21, wherein the search engine identifies compositions that include one or more parts corresponding to the one or more instruments, where each of the one or more parts are within the range of the instrument corresponding to that part.
24. The system of claim 21, wherein:
the graphical user interface further provides functionality for the user to search the compositions; and
the search engine:
generates a query based on the user search;
queries the compositions and the metadata characterizing each composition using the query to generate search results; and
returns the search results to the user via the graphical user interface.
25. The system of claim 21, wherein:
the graphical user interface further provides functionality for the user to input audio; and
the search engine:
identifies compositions that are similar to the audio input by the user;
sorts the identified compositions by similarity; and
returns the identified compositions to the user via the graphical user interface.
26. The system of claim 21, wherein the search engine is further configured to identify similar compositions by comparing the one or more instruments and the one or more ranges of the compositions.
27. The system of claim 26, wherein the search engine is further configured to provide functionality to identify recommendations for the user by identifying compositions similar to at least one composition purchased by the user.
28. The system of claim 26, wherein the search engine is further configured to generate a concert program for the user by identifying compositions similar to one or more compositions identified by the user.
29. The system of claim 26, wherein:
the system is further configured to store a plurality of user profiles, each of the user profiles including one or more instruments and a range for each of the instruments; and
the search engine is further configured to identify similar user profiles by comparing the one or more instruments and the one or more ranges included in each of the user profiles.
30. The system of claim 29, wherein the search engine is further configured to identify recommendations for the user by identifying compositions purchased by a user with a similar user profile.
31. A method, comprising:
storing sheet music data comprising compositions;
storing metadata characterizing each composition, including one or more instruments and a range for each of the one or more instruments;
providing a graphical user interface that provides functionality for a user to identify a user profile that includes one or more instruments and a range for each of the one or more instruments;
identifying compositions by comparing the one or more instruments and the one or more ranges included in the user profile with the one or more instruments and the one or more ranges of each of the compositions; and
displaying the identified compositions to the user via the graphical user interface.
32. The method of claim 31, wherein the comparison of the one or more instruments and the one or more ranges included in the user profile with the one or more instruments and the one or more ranges of each of the compositions uses Manhattan (L1) distance or Jaccard similarity.
33. The method of claim 31, wherein compositions are identified that include one or more parts corresponding to the one or more instruments, where each of the one or more parts are within the range of the instrument corresponding to that part.
34. The method of claim 31, wherein the graphical user interface further provides functionality for the user to search the compositions, the method further comprising:
generating a query based on the user search;
querying the compositions and the metadata characterizing each composition using the query to generate search results; and
returning the search results to the user via the graphical user interface.
35. The method of claim 31, wherein the graphical user interface further provides functionality for the user to input audio, the method further comprising:
identifying compositions that are most similar to the audio input by the user;
sorting the identified compositions by similarity; and
returning the identified compositions to the user via the graphical user interface.
36. The method of claim 31, further comprising:
identifying similar compositions by comparing the one or more instruments and the one or more ranges of the compositions.
37. The system of claim 36, further comprising:
identify recommendations for the user by identifying compositions similar to at least one composition purchased by the user.
38. The system of claim 36, further comprising:
providing functionality to generate a concert program by identifying compositions similar to one or more compositions identified by the user.
39. The system of claim 36, further comprising:
storing a plurality of user profiles, each of the user profiles including one or more instruments and a range for each of the instruments; and
identifying similar user profiles by comparing the one or more instruments and the one or more ranges included in each of the user profiles.
40. The method of claim 39, further comprising:
identifying recommendations for the user by identifying compositions purchased by a user with a similar user profile.
41.-60. (canceled)
US17/476,903 2017-05-25 2021-09-16 Sheet Music Search and Discovery System Pending US20220043854A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/476,903 US20220043854A1 (en) 2017-05-25 2021-09-16 Sheet Music Search and Discovery System

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201762511025P 2017-05-25 2017-05-25
US15/988,719 US11157553B2 (en) 2017-05-25 2018-05-24 Sheet music search and discovery system
US17/476,903 US20220043854A1 (en) 2017-05-25 2021-09-16 Sheet Music Search and Discovery System

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/988,719 Continuation US11157553B2 (en) 2017-05-25 2018-05-24 Sheet music search and discovery system

Publications (1)

Publication Number Publication Date
US20220043854A1 true US20220043854A1 (en) 2022-02-10

Family

ID=64397069

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/988,719 Active 2039-01-25 US11157553B2 (en) 2017-05-25 2018-05-24 Sheet music search and discovery system
US17/476,903 Pending US20220043854A1 (en) 2017-05-25 2021-09-16 Sheet Music Search and Discovery System

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US15/988,719 Active 2039-01-25 US11157553B2 (en) 2017-05-25 2018-05-24 Sheet music search and discovery system

Country Status (4)

Country Link
US (2) US11157553B2 (en)
AU (3) AU2018271971B2 (en)
CA (1) CA3062700A1 (en)
WO (1) WO2018218034A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11308971B2 (en) * 2020-07-15 2022-04-19 Bank Of America Corporation Intelligent noise cancellation system for video conference calls in telepresence rooms

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3616143A1 (en) * 2017-04-28 2020-03-04 Covered Insurance Solutions, Inc. System and method for secure information validation and exchange
RU2652461C1 (en) * 2017-05-30 2018-04-26 Общество с ограниченной ответственностью "Аби Девелопмент" Differential classification with multiple neural networks
JP6708179B2 (en) * 2017-07-25 2020-06-10 ヤマハ株式会社 Information processing method, information processing apparatus, and program
KR101942814B1 (en) * 2017-08-10 2019-01-29 주식회사 쿨잼컴퍼니 Method for providing accompaniment based on user humming melody and apparatus for the same
US10885277B2 (en) 2018-08-02 2021-01-05 Google Llc On-device neural networks for natural language understanding
US20200073890A1 (en) * 2018-08-22 2020-03-05 Three10 Solutions, Inc. Intelligent search platforms
US11288975B2 (en) 2018-09-04 2022-03-29 Aleatoric Technologies LLC Artificially intelligent music instruction methods and systems
CN111368996B (en) * 2019-02-14 2024-03-12 谷歌有限责任公司 Retraining projection network capable of transmitting natural language representation
US11080601B2 (en) 2019-04-03 2021-08-03 Mashtraxx Limited Method of training a neural network to reflect emotional perception and related system and method for categorizing and finding associated content
US10614345B1 (en) * 2019-04-12 2020-04-07 Ernst & Young U.S. Llp Machine learning based extraction of partition objects from electronic documents
US11030446B2 (en) * 2019-06-11 2021-06-08 Open Text Sa Ulc System and method for separation and classification of unstructured documents
US10891872B1 (en) * 2019-06-18 2021-01-12 School of Rock, LLC Method and apparatus of music education
CN110415679B (en) * 2019-07-25 2021-12-17 北京百度网讯科技有限公司 Voice error correction method, device, equipment and storage medium
US11341945B2 (en) * 2019-08-15 2022-05-24 Samsung Electronics Co., Ltd. Techniques for learning effective musical features for generative and retrieval-based applications
EP4022553A4 (en) * 2019-08-30 2023-08-23 Tata Consultancy Services Limited System and method for handling popularity bias in item recommendations
CN111010595B (en) * 2019-12-25 2021-08-24 广州欢聊网络科技有限公司 New program recommendation method and device
CN111309866B (en) * 2020-02-15 2023-09-15 深圳前海黑顿科技有限公司 System and method for intelligently searching authoring materials by utilizing semantic fuzzy search
CN111626120B (en) * 2020-04-24 2023-08-22 南京理工大学 Target detection method based on improved YOLO-6D algorithm in industrial environment
CN112069352B (en) * 2020-09-18 2022-05-17 四川长虹电器股份有限公司 Television music recommendation method based on improved content-based
US11720346B2 (en) * 2020-10-02 2023-08-08 International Business Machines Corporation Semantic code retrieval using graph matching
GB2599441B (en) * 2020-10-02 2024-02-28 Emotional Perception Ai Ltd System and method for recommending semantically relevant content
CN113076715A (en) * 2021-04-02 2021-07-06 苏州谋时软件科技有限公司 Method and system for creating character block in Allegro
CA3223067A1 (en) * 2021-06-18 2022-12-22 Sam DRESSER System and method of facilitating live jam session with song recommendation engine to recommend a song to be played
EP4145438A1 (en) * 2021-09-01 2023-03-08 Yousician Oy Automatic music document displaying on performing music
CN113889146A (en) * 2021-09-22 2022-01-04 北京小米移动软件有限公司 Audio recognition method and device, electronic equipment and storage medium
US11790014B2 (en) 2021-12-31 2023-10-17 Microsoft Technology Licensing, Llc System and method of determining content similarity by comparing semantic entity attributes
CN115064184A (en) * 2022-06-28 2022-09-16 镁佳(北京)科技有限公司 Audio file musical instrument content identification vector representation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020044158A1 (en) * 2000-08-21 2002-04-18 Peyser C. Alan System for storing and retrieving electronic sheet music
US20080056491A1 (en) * 2006-08-31 2008-03-06 Corevalus Systems, Llc Methods and Systems For Managing Digital Sheet Music on a Digital Sheet Music Display System
US20080190272A1 (en) * 2007-02-14 2008-08-14 Museami, Inc. Music-Based Search Engine
US20150059558A1 (en) * 2013-08-27 2015-03-05 NiceChart LLC Systems and methods for creating customized music arrangements
US20150066685A1 (en) * 2013-09-03 2015-03-05 muse IQ, LLC System and method for digital content discovery, recommendations and purchasing
US20160379274A1 (en) * 2015-06-25 2016-12-29 Pandora Media, Inc. Relating Acoustic Features to Musicological Features For Selecting Audio with Similar Musical Characteristics

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3069654A (en) 1960-03-25 1962-12-18 Paul V C Hough Method and means for recognizing complex patterns
US4972349A (en) 1986-12-04 1990-11-20 Kleinberger Paul J Information retrieval system and method
US4996642A (en) 1987-10-01 1991-02-26 Neonics, Inc. System and method for recommending items
US5051745A (en) 1990-08-21 1991-09-24 Pkware, Inc. String searcher, and compressor using same
US5454106A (en) 1993-05-17 1995-09-26 International Business Machines Corporation Database retrieval system using natural language for presenting understood components of an ambiguous query on a user interface
US5726435A (en) 1994-03-14 1998-03-10 Nippondenso Co., Ltd. Optically readable two-dimensional code and method and apparatus using the same
US6272467B1 (en) 1996-09-09 2001-08-07 Spark Network Services, Inc. System for data collection and matching compatible profiles
US5835905A (en) 1997-04-09 1998-11-10 Xerox Corporation System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents
US6360215B1 (en) 1998-11-03 2002-03-19 Inktomi Corporation Method and apparatus for retrieving documents based on information other than document content
US7013301B2 (en) 2003-09-23 2006-03-14 Predixis Corporation Audio fingerprinting system and method
US6990453B2 (en) 2000-07-31 2006-01-24 Landmark Digital Services Llc System and methods for recognizing sound and music signals in high noise and distortion
US7233937B2 (en) 2001-06-18 2007-06-19 Siebel Systems, Inc. Method, apparatus, and system for searching based on filter search specification
US7518052B2 (en) * 2006-03-17 2009-04-14 Microsoft Corporation Musical theme searching
US8280539B2 (en) 2007-04-06 2012-10-02 The Echo Nest Corporation Method and apparatus for automatically segueing between audio tracks
US7761466B1 (en) 2007-07-30 2010-07-20 Hewlett-Packard Development Company, L.P. Hash-based image identification
KR101459766B1 (en) * 2008-02-12 2014-11-10 삼성전자주식회사 Method for recognizing a music score image with automatic accompaniment in a mobile device
US8494257B2 (en) 2008-02-13 2013-07-23 Museami, Inc. Music score deconstruction
JP5344715B2 (en) 2008-11-07 2013-11-20 国立大学法人北海道大学 Content search apparatus and content search program
US9111547B2 (en) * 2012-08-22 2015-08-18 Kodak Alaris Inc. Audio signal semantic concept classification method
US9037464B1 (en) 2013-01-15 2015-05-19 Google Inc. Computing numeric representations of words in a high-dimensional space

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020044158A1 (en) * 2000-08-21 2002-04-18 Peyser C. Alan System for storing and retrieving electronic sheet music
US20080056491A1 (en) * 2006-08-31 2008-03-06 Corevalus Systems, Llc Methods and Systems For Managing Digital Sheet Music on a Digital Sheet Music Display System
US20080190272A1 (en) * 2007-02-14 2008-08-14 Museami, Inc. Music-Based Search Engine
US20150059558A1 (en) * 2013-08-27 2015-03-05 NiceChart LLC Systems and methods for creating customized music arrangements
US20150066685A1 (en) * 2013-09-03 2015-03-05 muse IQ, LLC System and method for digital content discovery, recommendations and purchasing
US20160379274A1 (en) * 2015-06-25 2016-12-29 Pandora Media, Inc. Relating Acoustic Features to Musicological Features For Selecting Audio with Similar Musical Characteristics

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11308971B2 (en) * 2020-07-15 2022-04-19 Bank Of America Corporation Intelligent noise cancellation system for video conference calls in telepresence rooms

Also Published As

Publication number Publication date
AU2018271971B2 (en) 2023-06-29
US20180341702A1 (en) 2018-11-29
CA3062700A1 (en) 2018-11-29
US11157553B2 (en) 2021-10-26
AU2023204297A1 (en) 2023-07-27
WO2018218034A1 (en) 2018-11-29
AU2023204294A1 (en) 2023-07-27
AU2018271971A1 (en) 2020-01-16

Similar Documents

Publication Publication Date Title
US20220043854A1 (en) Sheet Music Search and Discovery System
Corrêa et al. A survey on symbolic data-based music genre classification
Harte et al. Symbolic Representation of Musical Chords: A Proposed Syntax for Text Annotations.
Byrd et al. Problems of music information retrieval in the real world
Mauch Automatic chord transcription from audio using computational models of musical context
US6225546B1 (en) Method and apparatus for music summarization and creation of audio summaries
López‐huertas Thesaurus structure design: a conceptual approach for improved interaction
Kempfert et al. Where does Haydn end and Mozart begin? Composer classification of string quartets
Cambouropoulos The harmonic musical surface and two novel chord representation schemes
Liu et al. From audio to music notation
Weiß Computational methods for tonality-based style analysis of classical music audio recordings
Van Balen Audio description and corpus analysis of popular music
Pickens Harmonic modeling for polyphonic music retrieval
Gross A set of computer programs to aid in music analysis.
Cournut et al. What are the most used guitar positions?
Lupker et al. Music theory, the missing link between music-related big data and artificial intelligence.
Alvarez et al. Composer classification using melodic combinatorial n-grams
Sutcliffe et al. Searching for musical features using natural language queries: the C@ merata evaluations at MediaEval
Ju Addressing ambiguity in supervised machine learning: A case study on automatic chord labelling
Nápoles López Automatic Roman numeral analysis in symbolic music representations
Panteli Computational analysis of world music corpora
López Automatic roman numeral analysis in symbolic music representations
Foscarin The Musical Score: a challenging goal for automatic music transcription
Židek Controlled music generation with deep learning
Schelehoff Optical music recognition: overview, challenges, and possibilities

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED