SE544738C2

SE544738C2 - Method and system for recognising patterns in sound

Info

Publication number: SE544738C2
Application number: SE2051550A
Authority: SE
Inventors: Stanislaw Gorlow
Original assignee: Algoriffix Ab
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2022-11-01
Also published as: SE2051550A1

Abstract

Method for interpreting a digital audio signal (AS), comprising the steps ofa) determining a dictionary (D) of characteristic sound patterns (SP);b) determining a time window (TW) of said audio signal;c) transforming said time window to the frequency domain to obtain snapshot (X); d) calculating, using a selected multiplicative update rule, a pattern representation vector (r) minimising a distance measure between the snapshot (X) and an approximation (Dr) being a linear combination of said sound patterns multiplied by the pattern representation vector; ande) repeating using a subsequent time window,wherein both said dictionary (D) and said pattern representation vector (r) have only nonnegative elements, and wherein step d is performed using a cost function that includes an elementwise Bregman divergence and in addition thereto one or several specified penalty terms.The invention also relates to a system and to a computer software product.

Description

Method and system for recognising patterns in sound The present invention relates to a method and a system for recognising patterns in sound, in particular tonal sound, such as music or singing. ln other words, the present invention relates to the interpretation of sound, which means the cognition of basic elements in sound. The present invention may further relate to transcription of sound, i.e., the recogni- tion and interpretation of musical notes, in the case of music or singing.

Systems for automatic music transcription have been developed for years and decades, as such systems can be useful in many situations. A typical use case is when a musician per- forms a piece of music and a computing system recognises the played notes and writes them down in musical notation and/or responds with feedback. There are also other use cases, where the recognised notes can be used in many ways. As an example, the recognised notes can be used to generate new musical elements, such as a countermelody or a chord progression. This is a matter of particular interest in human-machine interactive composi- tion, which has a strong connection with live music, electronic music, and computer music.

Practicing a musical instrument is usually associated with professional supervision and per- sonalised feedback when it comes to an unskilled apprentice. This is particularly true for a novice. Otherwise, fatigue may set in quickly, and even the most talented student can lose interest in continuing with practice or even in learning music as such. And yet, not everybody is willing to pay a personal tutor, especially ifthe outcome is unclear. Other fac- tors, such as dispensability, can also influence one's decision. A reasonable compromise may consist in learning agents that take the role of the tutor. And to avoid further spending on expensive hardware, the agents would preferably be installed on a tablet computer, which as of today is equipped with a speaker and a microphone.

Commonly, sound is captured as a waveform signal over one or several microphones. lt is this waveform signal which can be digitally processed with the aim of extracting musical information, such as the presence of complex tones or notes.

This task has proven to be challenging for several reasons.

Firstly, there is a problem of resiliency to disturbances and noise. Often, captured sound is polluted by noise of different types and it may become difficult to identify the sounds of interest.

Secondly, in monophonic music it is often a matter of basic ear training to identify the pitch of individual notes. However, when the texture is polyphonic, the identification and inter- pretation become much more difficult. ln addition, complex tones consist of multiple par- tials that are not always an integer multiple of the fundamental and the strengths of which can be influenced in many ways.

Thirdly, there is the temporal aspect. Processing captured sound in retrospect can be done from beginning to end, several times in a row, under different angles. ln many situations, however, it is desirable to interpret a detected sound in (near) real-time, while the signal is being captured. For example, a musician may wish to receive immediate feedback from a device while he/she is playing an instrument. The device must be able to ”listen” to the produced sounds and make out a meaning from it within tens of milliseconds to respond.

The problem of real-time processing and analysis is not solved with computer power alone, since the detection of tones requires some minimum amount of data that is accumulated over sufficiently long time.

The thesis ”REAL-TIME MUSICAL ANALYSIS OF POLYPHONIC GUITAR AUDIO” by John Hartquist, which was presented to the Faculty of California Polytechnic State University in June 2012, describes a method for real-time polyphonic music transcription. The method relies on a predetermined dictionary of harmonic templates for each playable note. The guitar signal is transformed to the frequency domain using the discrete Fourier transform and compared against the note templates in the dictionary. The corresponding likelihood is obtained by means of non-negative matrix factorisation (NI\/|F). The optimisation problem is thus solved using an iterative approach and a cost function from the class of Bregman divergences.

US 2017/0243571 A1 describes a similar method for transcribing piano recordings.

One problem with transcription techniques that rely only on a Bregman divergence as the goodness of fit is robustness. A signal captured in a room usually is not without noise or reverberation. I\/|oreover, interference always occurs between the partials of a tone, and even more so between the partials of multiple tones. The body of a stringed instrument, e.g., resonates at certain frequencies and affects the strengths of the partials, which can change the tone colour. The instrument can also be slightly untuned, shifting the partials in frequency. This means that tone patterns of a predefined or pre-learnt dictionary never perfectly match with the sounds of a captured signal, leading to recognition errors and in- accuracies in the transcription.

The present invention addresses the problems described above.

Hence, the invention relates to a method for interpreting a digital audio signal, comprising the steps of a) determining a dictionary defining a set of different characteristic sound pat- terns; b) determining a time window of a certain length of said digital audio signal; c) trans- forming said time window of the digital audio signal to the frequency domain to obtain a transformed snapshot; d) calculating, using a selected multiplicative update rule, a pattern representation vector that minimises a distance measure between the snapshot and an ap- proximation, being a linear combination of said sound patterns defined by said dictionary multiplied by the coefficients of the pattern representation vector; and e) repeating from step b using a subsequent time window, wherein both said dictionary and said pattern rep- resentation vector have only non-negative elements, wherein step d is performed using a cost function that includes an elementwise Bregman divergence and in addition thereto one or several penalty terms, said one or several penalty terms being selected from the list of a first penalty term, comprising a diversity index, preferably the Shannon index, being a meas- ure of the number of different patterns in the approximation according to their respective proportional abundancies, as quantified by the pattern representation vector; a second penalty term, comprising a norms difference between the taxicab norm and the Euclidian norm of the pattern representation vector; a third penalty term, comprising a sum of off- diagonal entries in the correlation matrix of the pattern representation vector; and a fourth penalty term, comprising a sum ofthe squares ofthe differences between elements of said pattern representation vector and their smoothed values, said smoothed values being cal- culated for each element in question separately, to represent a time-smoothed represen- tation of the element in question.

Furthermore, the invention relates to a system for interpreting a sampled sound signal, said system being arranged with a dictionary defining a set of different characteristic sound pat- terns, said system being arranged to determine, in a step b, a time window of a certain length of said digital audio signal, said system being arranged to transform, in a step c, said time window of the digital audio signal to the frequency domain to obtain a transformed snapshot, said system being arranged to calculate, in a step d, using a selected multiplicative update rule, a pattern representation vector that minimises a distance measure between the snapshot and an approximation, being a linear combination of said sound patterns de- fined by said dictionary multiplied by the coefficients ofthe pattern representation vector, and said system being arranged to repeat from said step b using a subsequent time window, both said dictionary and said pattern representation vector having only non-negative ele- ments, said system being arranged to perform said calculation of said pattern representa- tion vector using a cost function that includes an elementwise Bregman divergence and in addition thereto one or several penalty terms, said one or several penalty terms being se- lected from the list ofa first penalty term, comprising a diversity index, preferably the Shan- non index, being a measure of the number of different patterns in the approximation ac- cording to their respective proportional abundancies, as quantified by the pattern repre- sentation vector; a second penalty term, comprising a norms difference between the taxi- cab norm and the Euclidian norm of the pattern representation vector; a third penalty term, comprising a sum of off-diagonal entries in the correlation matrix of the pattern represen- tation vector; and a fourth penalty term, comprising a sum ofthe squares of the differences between elements of said pattern representation vector and their smoothed values, said smoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question.

I\/|oreover, the invention relates to a computer software product for interpreting a sampled sound signal, said computer software product comprising or being associated with a dic- tionary defining a set of different characteristic sound patterns, said computer software product being arranged to, in a step b, when executed on computer hardware, determine a time window of a certain length of said digital audio signal; transform, in a step c, said time window of the digital audio signal to the frequency domain to obtain a transformed snapshot; calculate, in a step c, using a selected multiplicative update rule, a pattern repre- sentation vector that minimises a distance measure between the snapshot and an approxi- mation, being a linear combination of said sound patterns defined by said dictionary multi- plied by the coefficients of the pattern representation vector; and repeat, from step b, using a subsequent time window, both said dictionary and said pattern representation vector having only non-negative elements, said computer software product being arranged to, when executed on computer hard-ware, perform said calculation of said pattern represen- tation vector using a cost function that includes an elementwise Bregman divergence and in addition thereto one or several penalty terms, said one or several penalty terms being selected from the list of a first penalty term, comprising a diversity index, preferably the Shannon index, being a measure of the number of different patterns in the approximation according to their respective proportional abundancies, as quantified by the pattern repre- sentation vector; a second penalty term, comprising a norms difference between the taxi- cab norm and the Euclidian norm of the pattern representation vector; a third penalty term, comprising a sum of off-diagonal entries in the correlation matrix of the pattern represen- tation vector; and a fourth penalty term, comprising a sum of the squares of the differences between elements of said pattern representation vector and their smoothed values, said smoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question. ln the following, the invention will be described in detail, with reference to exemplifying embodiments of the invention and to the enclosed drawings, wherein: Figure 1 i||ustrates a system according to the present invention, arranged to perform a method according to the invention; Figure 2 is a flow chart illustrating a method according to the present invention; Figure 3a i||ustrates a first sound pattern definition according to the invention; Figure 3b i||ustrates a second sound pattern definition according to the invention; Figure 4 i||ustrates a dictionary according to the present invention; and Figure 5 i||ustrates the processing of a digital audio signal according to a method of the present invention.

Hence, Figure 1 shows a system 100 according to the invention, the system 100 being ar- ranged to interpret a digital audio signal AS by performing a method according to the pre- sent invention.

The system 100 comprises a computer software product, arranged to be executed on virtual or physical hardware, and when executed, arranged to perform a method according to the present invention. ln Figure 1, this is exemplified by a computer 110, on the hardware of which said computer software product is arranged to be executed. The computer 110 may be a desktop computer, a laptop computer, a tablet computer, a mobile phone or similar. The computer 110 may also be a virtual hardware, such as a cloud-hosted virtual server.

The computer 110 may be standalone or distributed. ln some embodiments, the computer 110 may be associated with or comprise one or more different hardware interfaces accessible by said computer software product.

Such hardware interface may comprise an acoustic sound capturing device 111, such as a microphone. The sound capturing device 111 may be arranged to capture acoustic sound in one or several channels, such as in mono or stereo, and to provide a corresponding electric signal. The sound capturing device 111 may also comprise an analogue-to-digital converter, arranged to provide a digital audio signal interpretable by said computer software product.

Acoustic sound, detectable by sound capturing device 111, may be generated by any sound source, such as a human voice or a musical instrument. ln Figure 1, an exemplifying sound source in the form of an acoustic piano 200 is shown. ln general, the sound source may be any sound generating entity, such as a human being or an acoustic and/or electric instru- ment, such as a stringed instrument, a wind instrument, or even a percussion instrument, or a mixture of different sound sources. ln preferred embodiments, the sound source 200 is a pitched sound source in the sense that it can produce different sound patterns differing with respect to frequency and preferably being able to produce sound having different fun- damental frequencies that a human listener perceives as different notes. Such a sound pat- tern may hence be associated with a fundamental frequency and one or several additional frequencies, such as integer multiples of the fundamental (harmonics). A single sound source may furthermore be arranged to produce several such sound patterns simultane- ously, such as a chord played on a guitar or a piano.

Such hardware interfaces may also comprise an acoustic sound generating device 112, such as a loudspeaker. The sound generating device 112 may be arranged to generate acoustic sound over one or several channels, such as in mono or stereo, from a provided electric signal. The sound generating device 112 may also comprise a digital-to-analogue converter, arranged to produce an analogue electric signal from a digital audio signal provided by said computer software product, and to use the analogue signal to produce said acoustic sound.

The computer 110 may be connected to the sound capturing device 111 and to the sound generating device 112 via a suitable respective wired or wireless connection.

Figure 1 also shows a computer server 120, that may be physical or virtual hardware, dis- tributed or standalone computing entity, in a way corresponding to what has been said in relation to the computer 110. The computer 110 and the server 120 may communicate dig- itally via a network 10, such as the Internet.

As the word is used herein, the term ”hardware” denotes specialised or general-purpose computer hardware, typically comprising a CPU, RAM, and a computer bus, as well as any required peripheral interfaces and any conventional computer hardware, such as a GPU. |nstead of, or in addition to, a digital audio signal provided by the sound capturing device 111, the computer software product may be arranged to process and interpret a digital sound signal provided to the computer 110 from the server 120. ln this case, the digital sound signal may be a pre-recorded and/or synthesised digital sound signal, being provided to the computer software product either in a file-based fashion, for instance as a complete music file, or in a chunk-based fashion as in streaming. The sound signal received by the computer software product from the sound capturing device 111 may, alternatively, be cap- tured in (near) real time, as the acoustic sound is generated by the sound source 200, and continuously provided to the computer software product as a continuous signal. lrrespec- tively, the audio signal may be provided to the computer software product in any conven- tional data format, providing encoding and possibly compression, such as a WAV data file.

As used herein, the term ”audio” and ”sound” both refer to an acoustic wave that may be |H physical or represented as a signal. Hence, an ”audio signa refers to such an analogue or digitally coded signal representing said audio.

As used herein, a ”sound pattern” is any pattern of a sound defined in terms of frequencies and amplitudes. For instance, a sound pattern may be defined in terms of a frequency-am- plitude profile. A sound pattern may be associated with a time dimension, so that the sound pattern may be defined to change in a particular way over time. However, in preferred em- bodiments of the invention a sound pattern is static in relation to time. ln the below-de- scribed dictionary, comprising definitions of such sound patterns, each definition ofa sound pattern may cover or comprise several different variants ofthe sound pattern in question, such as to reflect different playing techniques.

A ”tone”, as used herein, is the sound resulting from the generation of a particular sound pattern. Such a tone is typically associated with a pitch, typically associated with a fundamental frequency, being the lowest frequency in the sound pattern, and a timbre, also known as ton colour or tone quality. For a piano, the tone of a key on the keyboard would be the sound resulting from the string being struck by pressing down that key in question. A tone may be generated with relatively stronger or weaker amplitude, usually referred to as velocity.

A ”timbre”, as used herein, refers to the frequency profile of a sound pattern, given a fun- damental frequency of the sound pattern. ln the example of the piano key, the timbre of the sound pattern refers to the spectral envelope of a tone produced by striking the string in question. For one and the same instrument, different tones may be associated with a similar timbre.

A ”note”, as used herein, refers to a piece of information, typically comprising a pitch and defining a particular playable sound pattern on an instrument. Such ”notes” can be defined using musical notation. For some sound sources, such as fretted instruments like the guitar, notes with the same pitch can be played in several different positions on the fretboard. The timbre of one and the same note played in different positions on the fretboard may differ significantly.

Figure 2 illustrates a method according to the present invention, for interpreting a digital audio signal AS of the type generally illustrated in Figure Figure 5 is a simplified overview illustrating the processing of the audio signal AS according to the present invention. As is clearfrom Figure 5, the audio signal AS is produced as at least the concatenation of different tones TO generated in parallel and/or in series to each other over time. For instance, the tones TO may be played on one or several instruments. A time window TW is moved along the audio signal AS (along a time axis), such as with an overlap, and for every new position of the time window TW the audio signal is multiplied by a win- dow function WF. The windowed audio signal is subjected to a time-frequency transform and subsequent analysis in the frequency domain, resulting in a symbolic representation of the audio signal AS at the position of the time window TW in question, expressed as a lO particular combination of sound patterns SP recognised in the audio signal. Each sound pat- tern SP is shown in Figure 5 as a (possibly unique) distribution of relative amplitudes (Y-axis) for a set of frequencies (X-axis). lf the recognised sound patterns SP were to be generated simultaneously as a sound, together they would approximate the sound observed in the audio signal AS at the position of the time window TW in question.

As mentioned, the present method can be performed on an audio signal AS in the form of a data stream being made available to the computer software product in real-time, or at least continuously, but it is also envisioned to use the present method to process music files or groups of files, as in batch processing. Since a file can be viewed as a stream of audio data of limited duration, the corresponding method steps as described herein below for an audio file stream can be applied to either a continuously provided data stream or an already existing audio data file.

The present invention achieves a flexible, tuneable, compact, and fast method of interpret- ing a digital audio signal AS and can produce high-quality results in (near) real-time, even when implemented using standard computer hardware. ln a first step, the method starts. ln a subsequent step, the audio signal AS is provided, as discussed. ln another step, that may be performed before or after said audio signal AS provision step, a dictionary D is determined. Figure 4 illustrates an example ofthe dictionary D, in the form of a matrix, defining for each of the eleven sound patterns SP1, ..., SPI (I being the total number of sound patterns SP) relative amplitudes for each of a set of seven frequencies Fl, ..., FK (K being the number of considered frequencies). lt is understood that this is a simpli- fied view, and in a practical application both the number of sound patterns and frequencies would be larger. ll The role of such a dictionary D in the context ofthe present invention is to define the set of expected sound patterns SP (SP1-SP11) that may be recognised in said sound signal AS and used jointly to find an approximation to said sound signal AS. To be precise, the dictionary D numerically defines a set of different characteristic sound patterns SP. Such definition may be in the form of a set of different frequencies, associated with a corresponding set of amplitudes. For instance, the dictionary D may define a function that associates a frequency value to an amplitude, or the dictionary D may define a set of one or more (preferably at least two or more) frequencies and for each such frequency a corresponding amplitude.

One way to interpret the information in the dictionary D defining such a sound pattern SP is as a characteristic spectral symbolic representation (fingerprint) resulting from an activa- tion of the sound pattern SP in question in isolation, without activating any other sound pattern SP and with no ambient sounds or noise. For instance, the sound pattern SP may define the cha racteristic sound of a particular string being picked or plucked when the string is pressed down at a particular fret, or of a particular key pressed down on a piano. How- ever, the sound pattern SP may also represent other types of musical primitives, such as a particular vocal sound (e.g., a vowel) produced by a singer or a particular sound made by any sound-generating device in any situation. The sound pattern SP may define a ”musical” sound in the sense that the sound pattern SP relates to a particular musical tone, ofa certain timbre. But the sound pattern SP may also define any sound primitive that per se does not easily translate to a particular played or sung tone, occurring in either musical or non-musi- cal contexts (atonal sounds).

For convenience, the dictionary D may be stored as a matrix representation available to said computer software product.

That the dictionary D is ”determined” may mean that it is established or calculated, by the computer 110 and/or the server 120, based on available information, or that it is provided to the computer 110 and/or the server 120, or a combination of the two (such that the computer 110 and/or the server 120 receives a dictionary and modifies it).The dictionary D may be selected, modified, or calculated based on the information that is specific to the implementation, using a priori available information about the digital audio signal AS. For instance, the dictionary may be created with the purpose of extracting musical information from an audio signal knowing that the audio signal contains representations of sound from an acoustic guitar or from a female singer, and the dictionary D may then be created to specifically define sound patterns SP typically produced by such an acoustic gui- tar or singer.

Such a bespoke dictionary D may then be pre-created and provided to the computer 110 and/or server 120, such as selected from a library of pre-created dictionaries stored in a database. The dictionary may also be created on-the-fly by combining a plurality of different pre-defined sound pattern SP definitions together forming the created dictionary D. ln this and other embodiments, the dictionary D may then contain sound pattern SP definitions for several different sound sources, such as for different instruments. For instance, the diction- ary D may contain sound pattern SP definitions for several different tones that can be generated by an acoustic guitar and/or by an acoustic piano and/or by a singer.

The creation of the dictionary D in a way specifically tailored to the situation may take place in different ways. ln general, the dictionary D is created by collecting a plurality, such as at least three, such as at least ten, such as least fifty, different sound pattern SP definitions of the above-described type and forming the dictionary D from these collected sound patterns SP. For instance, and as illustrated in Figure 4, the dictionary D may then be formed as a matrix having as rows or columns the different relative amplitudes, for a discreet set of different frequencies, representative for that sound pattern SP in question.

One way to create such sound pattern SP representations is to record a sample of the sound pattern SP and determine the frequency-amplitude information content of the signal. For instance, the recorded event may be the picking of one particular string on a guitar, with the string being pressed down on a particular fret.Another way to create such sound pattern representations is to calculate the sound pattern SP definition based on a physical model of the device generating the sound. For instance, for string or wind instruments the produced sounds are the result of the physical properties of the instrument in question, properties that are known to a large extent and similar across different individual instruments belonging to one and the same class of instruments.

So, for both string and wind instruments a played tone will in general comprise a fundamen- tal frequency, typically associated with the length of a vibrating string on a guitar or piano or the length of the tube in a flute. Several overtones will also be generated, representing standing waves of higher frequencies along the vibrating string or column of air. The calcu- lated sound pattern SP representation will then typically contain the fundamental fre- quency ofthat particular tone defined with a particular amplitude, such as an amplitude of 1, and additional frequency/amplitude pairs for each overtone. Such sound patterns SP may represent reasonably well many or all instruments ofthe same class of instruments, due to physical similarities between such instruments of the same class.

As used herein, the term ”class” of instruments relates to any group of instruments sharing one or several determined common characteristics. For instance, ”guitars” may be such a class, or ”acoustic guitars", or ”six-stringed acoustic guitars", or even ”all guitars of a partic- ular model from a particular guitar-making brand”. Hence, class definitions in a narrower or wider sense may be warranted in different circumstances, depending on the prerequisites and aims for the concrete implementation of the invention. ln general, the determination of the dictionary D may comprise a step of, for at least one pitched musical instrument or at least one class of pitched musical instruments, and for each note from a set of playable notes on such instrument or class of instruments, synthe- sising a pattern based on a physical model of an acoustic resonator of the instrument or class in question, and further based on the one-dimensional wave equation. lt has turned out that a reasonable simplification is to assume that, for said instrument or class of instruments, said physical model assumes the same relative amplitudes of therespective overtones for all playable notes. Hence, if each playable note for example is as- sociated with a fundamental frequency and five overtone frequencies, the sound patterns SP for that instrument or class of instruments are represented using the same amplitudes for the six different frequencies associated with each playable note. Another way of ex- pressing this is that one and the same spectral envelope is used for the entire set of playable notes for the instrument or instrument class in question.

This may refer to a specific set of amplitudes or a mapping between frequencies and amplitudes, as the case may be. Typically, the relative amplitude decreases with increasing frequency from the fundamental frequency upwards. Figure 3a illustrates a sound pattern SP representation, relating harmonic partials to specific amplitude values for the example of a random note. Figure 3b illustrates a sound pattern representation, in which the amplitude values are obtained from a corresponding function. ln particular, the inventor has discovered that it is often possible to limit the frequency- domain analysis described below to a finite set of frequencies, such as to at most 10 distinct frequencies for each playable note and/or at most 200 distinct frequencies for each set of playable notes of each detected sound-generating device. This results in computational ef- ficiency. Such a finite set of frequencies may depend on a particular type of sound-generat- ing device, such as on a particular instrument or class of instruments.

One way of selecting such a finite set of frequencies is to use only the fundamental fre- quency and a corresponding set of partials for each playable note to define each sound pattern SP. This will result in a dictionary D comprising information about amplitudes of harmonics for a set of such fundamental frequencies. ln particular, the dictionary D may be limited to information about amplitudes of such fundamental frequencies and partials for a set of playable notes from one or several sound-generating devices. lt is realized that a ”frequency”, in this context, may be defined as a distinct frequency or as a narrow frequency band, for instance centred about a distinct frequency. The frequency of a partial of one playable note may coincide or almost coincide with the fundamental or a partial ofa different playable note. By adjusting the width of said frequency band, one single ”frequency” may suffice to represent said respective frequencies of both playable notes. This way, the number of analysed frequencies can be limited, resulting in a smaller diction- ary D and reduced computational complexity.

As mentioned above, each of said fundamental frequencies used when constructing such a dictionary D may correspond to a respective playable note on a particular instrument or class of instruments. ln some embodiments, said frequencies selected as fundamental frequencies and associ- ated partials may be selected as the frequencies associated with or specified in the MIDI tuning standard, I\/ITS, for a given concert A up to the Nyquist frequency. This will be exem- plified in the following. ln general, the frequency scale used for the time-frequency analysis may be derived from *.' .- _ i. : . :qw nf-š-w» - -n-x .www .w- ,- =.».-v=:\zﬁàš-fišeß:z.x..~š .än så» the I\/||D| tuning standard (see limited in range in accordance with the frequency range of said instrument or class of in- struments. For a tuned regular guitar, for instance, the frequency scale ranges from 82.41 Hz (E2) to about 5 kHz (from the lowest fundamental frequency to the highest harmonic), and correspondingly for a four-string bass guitar the scale starts at 41.20 Hz (El).

The frequency scale may be subdivided into discrete data points between dmin and dmax, where d = 1210g2( )+ 44-0 HZ The corresponding frequencies are given by f = 2/12 -440 Hz with d I dminl dmin + 11-"1 dmax _ 11 dmax- As outlined above, the present method makes use of representation learning to recognize patterns of complex tones. This is achieved as described in detail below, by formulating the task of identifying sound patterns SP in the audio signal AS as a non-convex optimizationproblem and using said dictionary D defining typical patterns or atoms that are determined beforehand.

One way of constructing the dictionary D is by supervised learning, which learns the sound patterns automatically. All it requires is at least one sound example (audio signal) for each note in the dictionary and the corresponding pitches (labels), which can be known a priori or inferred from the audio signal using a pitch detection algorithm. Such a supervised learning algorithm can be iterative and resemble non-negative matrix factorisation. Using non-negative matrix factorisation, e.g., one would compute a spectrogram Vfor each note signal, such as for the note C4, and factorise the spectrogram into a matrix W and a matrix H while setting the rank of the factorisation (number of patterns per note) to, e.g., 1. The K-by-1 vector W can then be added to the dictionary D as the sound pattern for the note C4, while the 1-by-N vector H is discarded. This procedure is to be repeated for every note in the dictionary.

The advantage of supervised learning is that one can create an instrument-specific diction- ary D that considers the timbre (tone colour) of each note. A drawback is that one requires labelled data for all playable notes. lnstead, the dictionary D may be constructed in the following way, achieving a general-pur- pose dictionary D for complex tones.

Here, the guitar is used as an example. Knowing the number of strings, for instance six strings, the number offrets, such as nineteen, and the tuning, e.g., E2-A2-D3-G3-B3-E4 or 40-45-50-55-59-64 in I\/||D| note numbers, we also know the total number of sound patterns SP and all the fundamental frequencies of the corresponding tones: foülj) I 2[d(i)+j-69]/12 _440 Hz' wherein i is the string index, d(i) is the I\/||D| note number of the open string, andj is the fret index on the fingerboard. The fundamental of the open string is addressed withj = 0. For each note on the guitar, a time signal is generated that has the length of the time win- dow (see below) and which consists of pure tones of a harmonic series.Each harmonic corresponds to the frequency of a vibrating mode, which is an integer mul- tiple of the fundamental. Since the MIDI tuning standard is based on a chromatic scale (12- tone equal temperament, 12TET), which is slightly out of tune with respect to just intona- tion, we may add a correction factor c. Considering the first K = 31 harmonics, a complex tone is generated as SÜ- (n) = Xfﬂ aik sin (Znkck with ck I 2Ak/12oo' wherein Ålk is the deviation in cents of the k-th harmonic from a just interval. The exact deviation values are shown in the following Table 1: TableHarmonic (k) Deviation (Åk) 1 2 4 8 16 0 17 +5 9 18 +4 19 -2 5 10 20 -14 21 -29 11 22 -49 23 +28 3 6 12 24 +2 25 -27 13 26 +41 27 +6 7 14 28 -31 29 +30 15 30 -12 31 +The strength a of the respective harmonic is computed as_ kckrr k = 1,2,. ,K As mentioned above, it has proven beneficial to use a single spectral envelope for all tones, irrespective oftheir location on the fretboard. By doing so, the complexity ofthe algorithm can be reduced, since one can limit the dictionary to tones with unique pitches. We can do the same with a trained or learned dictionary, by assigning tones with the same pitch the same label before training.

However, if it is deemed important to distinguish notes in terms of their location on the fretboard (such that each playable note is defined by a unique pair of string and fret), a fretted-instrument dictionary must be created with all possible such pairs (for instance, each possible finger position on the fretboard).

For non-fretted instruments, such as the piano or the voice, the creation of a dictionary D is similar but simpler because each key on the keyboard represents a unique tone within the tonal range ofthe piano. The tonal range of, for example, an 88-key standard piano lies between AO and C8 or 21 and 108 in I\/||D| note numbers. The vocal range of classical per- formance covers about five octaves from a low G1 to a high G6. The common vocal range is from a low C2 to a high D6. Any individual's voice covers a range of one and a half to more than two octaves, depending on the voice type (from male bass to female soprano).

The dictionary D may be limited to the individual vocal range of a particular sound-generat- ing device or class of devices (such as a class of instruments), to save on memory and to reduce computational complexity. ln principle, however, the dictionary D could contain any number of sound patterns SP, which the algorithm would try to recognise in the audio signal AS. For instance, in the case of atonal music or non-musical audio, the dictionary D would not be based on the MIDI tuning standard.Hence, a complex tone is generated using the above-described principles, is multiplied by the same window function as used for processing the digital audio signal AS (see below) and is transformed to the frequency domain in the same manner as the time window TW pro- duced from the digital audio signal AS (see also below). ln other words, each sound pattern SP is produced in such a way as to best correspond to an audio signal AS which has been windowed and transformed in accordance with the present invention and as described be- low.

For instance, a predetermined frequency scale, such as the 12TET frequency scale can be used for this transform to the frequency domain, in the sense that only frequencies belong- ing to the frequency scale in question are evaluated. ln some embodiments, a second-order Goertzel algorithm may be used to compute the dis- crete Fourier transform for the respective frequency values. Preferably, only the magnitude spectra, which represent the absolute or relative strengths of the harmonics, are stored in the dictionary D as said sound patterns SP. The peak value of each sound pattern SP may be adjusted to one and the same level across sound patterns SP, such as to -15 dBFS, which for pure tones corresponds to an alignment level of -18 dBFS RI\/|S.

Relative amplitudes of harmonics for a string of length L that is fixed at both ends and pulled, picked, or plucked at position p from the end of the string (for a guitar, from the bridge) can be derived from the one-dimensional wave equation. The ratio of amplitudes of the n-th to the first harmonic (fundamental) is given by A pitch, string, or fret map of a particular instrument or class of instruments is an associative array. lt is a static data structure, that may be generated for each such instrument or class of instruments, and then used to look up the MIDI note number, the string index, or the fret index that is associated with a tone on, for instance, the fingerboard (fretboard) of a guitar.

This information may be interesting because a note can be played from different locations on the fingerboard of fretted instruments, generating the same pitch. What differentiates these tones is then their timbre.

For the piano, which is a non-fretted instrument, the pitch map would instead consist of 88 entries with unique I\/||D| note numbers ranging from 21 to 108. The string map in this case could contain values from 1 to 88, i.e., the key numbers, and the fret map would contain zeroes, following our convention of indexing open strings on a guitar or any other fretted instrument.

The corresponding principle can also be applied to voice, indicating the pitch and the index of the notes in the vocal range of a singer. Strictly speaking, these maps are only necessary for fretted instruments, and only if one wishes to distinguish the location of notes (string, fret) on the fingerboard. Otherwise, they can be omitted. ln the context of the present in- vention, they may be used to implement a superstructure that is compatible with different tonal sources.

As mentioned, the present method operates on a digital audio signal AS, that may be pro- vided to the computer 110 and/or server 120 or that may be measured by the sound cap- turing equipment 111. lrrespectively, the digital audio signal AS is provided with an infor- mation granularity in form of a sampling rate. Such a sampling rate may result from the operation of an analogue-to-digital conversion of the sound capturing device Again, referring to Figure 2, in a subsequent method step the method according to the pre- sent invention comprises the step of determining a time window TW of a certain length of said digital audio signal AS. ln some embodiments, the digital audio signal AS is processed at a sampling rate of 32 kHz or lower, such as 16 kHz or lower. This may mean that the audio signal AS is provided to the computer 110 and/or server 120 at this sampling rate, or that the computer software func- tion is arranged to down-sample the audio signal AS prior to said time window TW being applied to the audio signal AS. Such down-sampling may take place by the computersoftware product (or a down-sampling hardware function) on the fly or prior to interpreting a particular audio file. ln other words, the method may comprise the additional step of resampling the digital au- dio signal AS to 32 kHz, 16 kHz, or even lower sampling rate, before the application of the time window TW.

Namely, the present inventors have discovered that, using the present methodology, it is possible to use such low sampling rates and still achieve high-quality results for automatic sound pattern SP identification in an audio signal AS of sounds that are relevant to an average human listener.

The time window TW is a time window of a certain time duration, which is moved in the positive time direction along the audio signal AS (or the audio signal AS moves past the time window), as will be described below. ln some embodiments, the time window TW is a weighted time window, being weighted using a window function WF, such as an asymmetric window function. ln other words, the audio signal may be multiplied by the window function WF, the window function WF being zero everywhere apart from a time interval constituting said time win- dow TW. Across this time interval, the window function WF may be non-constant, so that a weighting is applied by said multiplication. ln some embodiments, the window function WF is defined having a maximum in the second (later) half of the time window TW. The area under the window function WF may be larger in the second half ofthe window function WF, in relation to time, as compared to the area under the window function WF in the first half of the window function WF, in relation to time.

Hence, the present invention uses transcription based on a time-frequency analysis, in turn being based on a time window TW, of said type, moving in relation to the audio signal AS.Any suitable window function WF can be used. However, in some embodiments the window function WF may be an asymmetric Kaiser-Bessel-derived, KBD, window. ln practical applications, a Kaiser-Bessel-derived window function WF with a non-zero shape parameter oL has been found to yield good results. ln some embodiments, the shape parameter oL is at least 1, such as at least 5, and in some embodiments, it is at most 20, such as at most 10. The duration (length) of the time window TW may be selected according to a desired frequency resolution. For a regular guitar and standard tuning, the resolution should be about the distance between E2 and F2, whereas for a bass guitar with six strings the resolution should be about the distance between BO and Cl.

As a rule, the time window TW should capture at least one period of the difference tone, the frequency of which is given by the positive difference between the fundamental of the lowest playable note and the next-higher note. For the examples given above, such window length T in seconds would correspond to T = L, FZ-EZ in the case of a six-string guitar and standard tuning, andT = , cl -Bo in the case of a six-string bass guitar. Furthermore, the time window TW would normally not cover more than 10, such as not more than 2, such periods.

Furthermore, the window function WF, here denoted w(n), may be defined to be asym- metrical, such as in the way described above. This has the advantage of minimizing the la- tency of note onsets when performing real-time analysis and interpretation of a received audio stream.

Also, the window function WF may be scaled, so its coefficients sum up to 2. As a concrete example, the peak of the window function WF, such as a window function WF of the general type described above, may be shifted to the right by w(n) <- n - w(n),n = 0,1,...,L - 1, where L is the window length in samples. The skewed window may then be rescaled (nor- malized), e.g., by WW * mm)- As used herein, the term ”time window” of the digital audio signal AS refers to the audio signal multiplied by the window function WF as described above, to obtain a windowed digital audio signal. The window function WF may, in some embodiments, be zero at either or both of its end points, tapering away from the maximum towards both ends. ln a subsequent step, the time window TW of the digital audio signal AS obtained in that manner is transformed to the frequency domain, to obtain a transformed snapshot X. This snapshot is a short-time spectral representationof the audio signal at a particular time instant for the duration of said time window TW.

As is the case for the construction of the dictionary D above, the transform of the time win- dow TW may be performed using the discrete Fourier transform (DFT), e.g., implemented in the form of the Goertzel algorithm, such as the second-order Goertzel algorithm. These algorithms are well-known as such and will not be further explained herein.

As is also correspondingly the case for the construction of the dictionary D as described above, the time-frequency transform may be accomplished so that only certain frequencies of a particular frequency scale, such as the 12TET frequency scale, are evaluated in the pro- cess. This is one way of achieving a spectral representation with low dimensionality, as de- scribed above in relation to the construction of the dictionary D but correspondingly appli- cable to the time-frequency transform and spectral analysis of the time window TW. ln more concrete terms, this may be performed according to the following.Firstly, the audio signal AS, consisting of carried over (overlap) past and newly read (hop) data is windowed (note that x here denotes the audio signal AS, not to be confused with X denoting said transformed snapshot): XUI) <- vi/(n) - XUL), n= 0,1,...,L- Secondly, the discrete Fourier transform (DFT) of the windowed signal at only the specified frequencies using 12TET is computed using the Goertzel algorithm: XÜC) = DFTÛCÜIÜQTET- ln case the window function WF was normalized (see above), and furthermore ifthe signal is levelled (see below), the peak magnitude of x(n) will not exceed 0 dBFS, with its RMS level being around -18 dBFS. Moreover, it is in this case guaranteed that the representation values will be bounded by 0 and 1, because the tone patterns were aligned with the same RMS level. As a result, a careful implementation of the Goertzel algorithm can directly re- turn the real-valued power spectrum, from which the magnitude spectrum is obtained tak- ing the square root. ln a subsequent step, a sound pattern representation vector r is calculated, using a selected multiplicative update rule. The sound pattern representation vector r is calculated to mini- mise a distance measure between the snapshot X and an approximation Dr, being a linear combination of the different sound patterns SP defined by the dictionary D multiplied by the coefficients of the pattern representation vector r.

Hence, once calculated, the pattern vector r contains, for each sound pattern SP, a corre- sponding velocity value. ln practise, given the magnitude spectrum, or any other non-negative spectrum representation, IX (k)| of the short-time Fourier transformed input signal, written as a vector X E RK = HXUHN |X(kK)|]T, and the dictionary D E lRšKX' consisting of I distinct tone patterns, we want to find a representation r E IR' such that the distance between x and a |inear combination of a few selected tone patterns, Dr, is minimised. See Figure 4. The components of r indicate the tone patterns that are de- tected in x and their magnitude. To find r, we may employ the following loss function, which is known as an entrywise 6-divergence: L(X, Dr) = Xfﬂ dy; (xmﬂﬂ dkiri) with ß_1_ ß_1 ß- ß %_p ßq , ß ${Û,1}, P10g§-P+q,ß=1, 2_ 2_ = q logq 1,ß 0, P dß (pl I where 0 S ß S 2. lt was found that perceptually most meaningful results were obtainedforß = 3, i.e., when L(X,Dl') = F få Sﬂxíﬂ dkl-TQZ + 2 - 3 ' SW/xš . êšﬂ lïíﬂdkm This specific loss function will be used throughout the rest of the document.

This minimisation may take place in a way which is conventional per se, such as using an iterative minimum-seeking algorithm which may be performed until a predetermined con- vergence criterion is met, and/or until a maximum number of iterations has been reached.

For instance, the dictionary D may in such cases contain information about amplitudes of harmonics for a set of fundamental frequencies corresponding to different tones, such as playable notes on a particular instrument or class of instruments, whereby the transform is evaluated only at a finite set of frequencies of a defined frequency scale as described above, comprising said fundamental frequencies and frequencies corresponding to said harmonics.ln a subsequent step, the procedure is repeated, producing a new snapshot by multiplying the digital audio signal AS with the window function WF at a new time instant and deter- mining a new sound pattern representation vector r for the respective time window TW.

This repetition may be performed until an end of the audio signal AS has been reached, or until any other stop criterion is satisfied.

Simply put, the main processing loop of the present method consists of reading in new data from the audio stream AS, processing the dynamics of the read data (if applicable, see be- low), applying a short-time Fourier transform (STFT) to a windowed portion ofthe data (to obtain a series of said snapshots X), and performing ”most likely" matching of tonal compo- nents in the new data (represented by the sound pattern representation vector r).

The main processing loop may also comprise the tracking of the occurrence of note ”on” and note ”off” events (see below), and finally generating a note list from such detected note events, arranged in a suitable temporal order, and possibly fed to any subsequent pro- cessing step.

As mentioned above, the detection of tonal data may be based on MIDI note numbers, and in this case the detection of tonal data may comprise a mapping ofthe detected tonal com- ponents (according to the representation vector r) to I\/||D| note numbers, using a I\/||D| note number table. ln general, new data from the audio stream AS is read in in way that is typical for time- frequency analysis of music signals. The essential parameters are the length of the time window TW, the hop size, the amount of overlap, and the transform length. As pointed out above, the time window TW length or duration is chosen in a way so that tones at the lower end of the register can be separated. The hop size is typically at least 10 milliseconds, such as at least 20 milliseconds, and at the most 50 milliseconds, such as at the most 30 milliseconds, preferably in the range between 20 and 30 milliseconds, which usually offers a good compromise between a time resolution that is sufficiently fine to detect fast notetransitions and a computational burden that is not too high. This is somewhat related to onset attenuation by the auditory system, which |asts around 20 milliseconds.

Here, it is important to understand that the goal of the present method is not to synthesize the transformed data, but to interpret its tona| contents. Therefore, things known as ”per- fect reconstruction" are not relevant.

The overlap in samples between past and newly read data is the difference between the time window TW length and the hop size. And since a transform length greater than the time window TW does not yield any more insight into the data, it may be set equal to or smaller than the length of the time window TW.

Moreover, in case the audio is recorded or provided across several channels, such as in ste- reo, it may be converted to one single channel (mono) as an introductory step, before the iteration across the audio signal AS begins. This computation-saving step can be taken since it is not an aim for the present method to capture the stereo-specific image of a recording. The mono conversion can be performed, for instance, by simple channel averaging in the case of stereo, or similar, prior to further processing and analysis of the audio stream AS.

According to the invention, both the dictionary D and the pattern representation vector r have only non-negative elements.

Further, according to the invention, the step in which the pattern representation vector r is calculated is performed using a cost function that includes an elementwise Bregman diver- gence. ln addition, one or several penalty terms are included in said cost function.

An acoustic signal that is recorded in a room is usually not without reverberation or noise. I\/|oreover, interference always occurs between the partials of a tone, and even more so between the partials of multiple tones. The body of an instrument, for instance, such as a stringed instrument, also resonates at certain frequencies and affects the relative strengthsof the various partials (timbre). This means that the sound patterns SP in the dictionary D, predefined or trained, will never perfectly match.

Hence, the penalty terms according to the present invention are added to the cost function with the purpose of favouring (or penalising) a certain property of the representation in a computationally efficient and numerically stable manner, tai|ored to real-life recording sit- uations.

Adding a penalty term to the objective function, so that the objective function, which is minimized by the optimization algorithm, consists of a Bregman divergence, which estab- lishes a distance between the observed spectral distribution and its approximation (mixture distribution) by a |inear combination of spectral distributions (mixture components), each representing a distinct tone (via the sound patterns SP), plus one or several penalty terms. The trade-off between fitting the observed data (minimizing the cost) and red ucing the pen- alty may be controlled by a respective Lagrange multiplier (a constant factor) for each pen- alty term used, as will be described below.

A first-order iterative optimisation algorithm such as gradient descent, or its stochastic ap- proximation, is typically used to find a local minimum of the objective function by taking steps proportional to the negative ofthe gradient (or approximate gradient) ofthe objective function at the current point. Restricting the data, the dictionary, and the representation to the domain of non-negative real numbers, the update rule can be converted to multiplica- tive form with an implicit learning rate (step size).

According to the invention, said one or several penalty terms are selected from the first, second, third, and fourth such penalty terms, which are described in detail below. lt is un- derstood that more penalty terms could be used, in addition to the one or several of the penalty terms described herein, depending on the concrete application.The penalty terms described herein are generally non-exclusive and can be used jointly, as separate penalty terms, for the benefit of an improved interpretation of the audio signal AS.

The first penalty term comprises a diversity index, preferably the Shannon index, being a measure of the number of different patterns in the approximation Dr according to their respective proportional abundancies, as quantified by the pattern representation vector r.

The second penalty term comprises a norms difference between the taxicab norm and the Euclidian norm ofthe pattern representation vector r.

One common purpose of both such first and such second penalty term is to suppress ”filler” notes (notes with small coefficients that mainly fill the holes in the approximation of x by Dr) and put emphasis on the prominent notes in the melody.

The first penalty term is hence expressed as a diversity index, and more specifically the Shannon index, which indicates how many different tones (mixture components) there are in the approximation (mixture distribution) according to their proportional abundances (mixture weights). By minimizing the Shannon index, the uncertainly of predicting the tone identity, and thus the number of different tones in the approximation is minimized. This is accomplished through multiplicative updates of the weights, which decrease for less certain HOteS aﬂd lHCFeaSe fOl' mOFe Ceftalﬁ HOteS.

The second penalty term is hence expressed as a norms difference between the taxicab norm and the Euclidean norm of the representation. ln two dimensions, taxicab circles are squares with sides oriented at a 45° angle to the coordinate axes. Therefore, the norms difference is minimal at the vertices. By minimizing the norms difference, the solution is moved away from the edges to the vertices, thus reducing the number of different tones in the approximation. This is accomplished through multiplicative updates of the representa- tions, which become smaller for notes with small representations and larger for notes with large representations.

More concretely put, the first and second penalty terms both address weak tones that may be the result of any ofthe mentioned negative effects. They are designed to favour stronger '(0065 OVel' Weakel' tOHeS.

The first penalty term may be useful in cases where the sound-generating device is used to produce higher sounding notes (G-clef), such as is the case of a regular guitar: Ti Zšàlri.

R1(r) = - Elan logn With P1- = The second penalty term may be useful in cases where the sound-generating device (instru- ment) is used to produce lower sounding notes (F-clef), such as is the case of a bass guitar: Rz (f) I Zš=1 Ti _ šﬂriz- The third penalty term comprises a sum of off-diagonal entries in the correlation matrix of the pattern representation vector r.

The third penalty term may be useful to suppress ghost notes and note fractions that may spread over different octaves, so-called octave errors, putting emphasis on the predomi- nant note in a pitch class.

The third penalty term is expressed as a sum of off-diagonal entries in the correlation matrix of the representations (random variables), which indicates the degree to which the repre- sentations are linearly related to one another. By minimizing the off-diagonal elements, i.e., the correlations between pairs of representations of different tones, predominant tones that belong to different pitch classes remain in the representation due to the divergence term and tone fractions belonging to the same pitch class are reassembled to a single tone. This is accomplished through multiplicative updates of the representations, which become smaller for all fractions but the largest fraction in a pitch class, which grows.

The third penalty term favours unique notes. The corresponding expression may be 31 R3 (f) I š=1 Z5=i+1 Tirj- The fourth penalty term comprises a sum of the squares of the differences between ele- ments of said pattern representation vector (r) and their smoothed values, said smoothed values being calculated for each element in question separately, to represent a time- smoothed representation ofthe element in question.

The fourth penalty term may be useful to avoid dropouts in the note envelope or fading of notes because of destructive interference between the partials of a tone or the partials of different tones, between original and reflected sound (reverberation), or between original and unwanted sound (noise).

The fourth penalty term is expressed as a sum of the squares of the differences between note representations and the estimated values of the representations, such as smoothed representations and exponentially smoothed representations. By minimizing the differ- ences, short dropouts in the envelopes are smoothed out and tones are stabilised. This is accomplished through multiplicative updates of the representations, which move closer to their estimated values.

This fourth penalty term hence addresses note interruptions that mainly result from inter- ference, and therefore favours slurs (legato articulation). This is achieved using 1 ~ RÄÛ = EZšﬂÛH _ TÛZ with ~t _ {0, for t = 0, Ti _ arf + (1 - a)ï",-t"1, for t > 0, where t is the frame (time) index and a is the smoothing factor (0 < a < 1). The smoothing factor may be derived from a corresponding halftime t1/2 expressed in seconds, such as 0.s, using the formula L a = 1-ln a subsequent method step, the method ends.Using such a method, a procedure is used capable of producing accurate results, while being computationally efficient enough to be able to run in real-time even on conventional hard- Wafe.

As mentioned above, in some embodiments of the invention each of the used penalty terms (first, second, third, and/or fourth) is individually weighted by a weight coefficient Ål, ÅZ, /13 and A4, respectively.

Hence, a composite cost function C may consist of the loss function (elementwise Bregman divergence) and one or several said penalty terms, which are weighted relative to their im- portance by the respective multipliers (0 S Å S 1): C(x, Dr) = L(x, Dr) + /11R1(r) + /12R2(r) + /13R3 (r) + /14R4(r).

To find the most likely representation, the cost is to be minimized. This may be accom- plished using the following iterative optimization procedure. The procedure is executed once per frame (with every hop of time window TW). Even in the case where no weights Å are used, the corresponding procedure can be used. 1. lnitialize the representation with random nonnegative values, 0 2. Set the value(s) for Ål, ÅZ, /13 and A 3. Compute the initial cost, Ct=° (X,yt=°) with yt=° = Drt=°. 4. Repeat the following steps until a predetermined stopping criterion is met:Update the representation by K dm xk 1 3Zk=13 t'% K 3x2 Yk -1 k +Ä1 +ÄTf-l-l _ rf i _ 1 K dki Ä 1 f Ä I f Ä f fgzk=lw+ 1(1_ Ogpiyl' 3zj=lrj+ :lll-ÛÛÛ for a higher-register instrument, or K dm Xk Tf f ~ I t+Ä3r¿ +Ä4(1-0c)r¿ 1 21<=1 f ÄZK Sf/Zf SM yk 2J21=1Ti 1<=1 ”k Tit+1 I n: 1 d - e2llšïlit+ll+lg 21- r-t+Ä4(1-oc)r¿t çšﬂglíâs sig 1=1for a lower-register instrument, respectively.

Compute the new approximation of the input signal, yt+1 = Drt+1_ Compute the new cost, Ct+1(X)yt+1)_ Compute the steepness of the cost function, lfs is below some predetermined small value e, return the representation, or continue with step a above, for t <- t + Alternatively, or in addition, the procedure may be performed for a predetermined number of iterations, possibly in combination with the convergence criterion using e described above.ln some embodiments, the cost function may include a function being an entrywise ß-di- vergence with 0 S ß S 2, such as 0.5 S ß S 1, such as ß = ln general, the calculation ofthe pattern representation vector r may be performed using a fi rst-order iterative optimisation algorithm, such as the one described in the example above.

As mentioned above, steps from determining time window TW up to and including the cal- culation of the representation vector r may be performed iteratively, and particularly by successively shifting the time window TW or the digital audio signal AU in relation to each other.

This process may furthermore comprise the detection of sound pattern ”on” and sound pattern ”off” events. Such event detection may be performed based on the pattern repre- sentation vector r calculated for each of the consecutively time windows TW.

Hence, as it is mentioned above, the components ofthe sound pattern representation vec- tor r indicate which tone patterns were detected and with which magnitude or likelihood. This information will normally change with every incoming frame. To convert this infor- mation into, e.g., I\/||D| note events, which apart from the pitch signal the beginning and the end of a note, and with which velocity the key was struck, or a string was picked or plucked, the pitch representation must be tracked over time. The following paragraphs explain how this can be put into practice. ln some embodiments, the sound pattern representation vector r may be pre-processed with a median filter and a short window to remove outliers and so to improve the onset (and offset) detection. Both onset and offset detection require a threshold value. Each tone has a state (active or inactive). Should the magnitude of a component in the pattern repre- sentation vector r exceed the onset threshold, and should the corresponding tone be inac- tive, a note ”on” event may be signalled, and the corresponding tone may be activated. Should the tone be already activated, a note ”off” event may be signalled before a new ”note on" event, but only if the magnitude of the current component is by a factor greater than 1 stronger than its magnitude one time instant before. ln the opposite case, no new event occurs. Should the magnitude of an activated tone fall below the offset threshold, a note ”off” event may be signalled, and the tone is deactivated.

The velocity of such a detected note may be derived from the corresponding magnitude value at onset using, for instance, Steven's power law: 1110) = kl”. where I is the intensity or strength of the stimulus, 1/1(1) is the magnitude of the sensation evoked by the stimulus, a is an exponent and k is a proportionality constant. The constants a and k are tuned to approximate the relationship between an increased representation value and the perceived loudness increase in the sensation when hearing the tone. A stand- ard value for k is 1 and 0.67 for a.

Once note ”on” and note ”off” events have been detected and signalled, one may generate a note list from such note events. A note begins at onset time and lasts until the correspond- ing offset time. A note's duration may thus be given by the time difference between these two events. The pitch of a note may be expressed using MIDI note numbers, which may be obtained from the component index in the representation vector r via a pitch map of the above-described type. Its velocity, i.e., the force with which a note is played, may be deter- mined at onset time. lt should be understood that a note list can be updated at run time without awaiting the end of a live performance or the end of a recording. ln addition to the standard note properties pitch, time, duration, and velocity, one can also indicate the string and the fret of a note on a fretted instrument with the help of the string map and the fret map of the above-described type, if used.

When the end of the file is reached, or the reading of the audio stream AS is stopped for any reason, a note list may be saved, for instance as a standard I\/||D| file, which can be opened and further edited in a digital audio workstation, DAW, or any other music compo- sition and notation software.I\/|oreover, in some embodiments the method comprises a pre-processing step, that may be performed before the step in which the audio signal AS is transformed to the frequency domain, in which the audio signal AS is subjected to at least one of noise gating, levelling, or limiting, as desired. ln Figure 1, it is i||ustrated how such signal processing can be performed. Signal processing may be performed on the sound signal before and/or after the determination of said time window TW. For instance, when processing a sound file, at least some processing may take place before application ofthe time window function, on the whole file. ln other examples, for instance in the processing of streamed audio, the time window function may be applied first and then sound processing such as gating may be applied to the windowed audio signal.

Such dynamics processing may be used to make sure that all recordings are interpreted and transcribed in a consistent manner. A noise gate removes unwanted noise, such as hum and hiss, during periods when the main signal is not present. A leveller stabilizes the volume of the signal by making too quiet notes louder and too loud notes softer. The leveller may be designed in such a way as to align the long-term signal level (RMS) to -18 dBFS. The limiter may be used for safety, i.e., to prevent the amplitude of the levelled signal from clipping which could cause unwanted distortion and thus worsen the performance of the algorithm. For instance, a standard noise gate without hysteresis may be used, and/or a so-called brick- wall limiter.

Once a sequence of notes was generated from the analysed audio signal AS, it can be used in various ways.

As mentioned, the notes can be stored for future use or for additional processing, e.g., in a I\/||D| file. ln other examples, they may be used on the fly, for instance as a basis for the automatic generation of an additional sequence of notes, such as a countermelody. Such countermelody may for instance be generated, for instance on the fly, using a neural net- work specifically designed to generate countermelodies. A corresponding sound signal(after Synthesis by the computer software function) may be routed by the computer 110 to the sound-generating device 112 for playback. ln other cases, such as in analysis of machine-produced sounds, the recognised notes may form the basis for a monitoring, debugging, or testing protocol, as the case may be. ln any case, the interpreted set of notes may be stored, sent to a different entity for pro- cessing, and/or displayed to a user on a screen of a computing device, such as in the form of automatically produced musical notation. ln some cases, the computer 110 is locally ar- ranged with respect to the sound-generating device 200, and either executes the computer software product on its own hardware or communicates with the server 120 that in turn performs one or several of the steps described above, for instance as a service. Said storage may take place on the server 120 and/or may be performed by the server As mentioned above, the present invention further relates to the system 100, being ar- ranged to perform the method steps described above.

These method steps are then typically performed by said computer software product, executed on the computer 110 and/or server 120, and then arranged to perform said steps as described above.

The invention also relates to such a computer software product.

Above, preferred embodiments have been described. However, it should be apparent to the skilled person that many modifications can be made to the disclosed embodiments without departing from the basic idea of the invention.

For instance, the system 100 may comprise many additional parts, such as a music compo- sition or production software, integrated with the audio signal interpretation functionality, or the system 100 may be comprised in such a system.The sound-generating device 200 may be arranged locally or remotely in relation to the sound-generating device ln general, all that has been said in relation to the present method is equally applicable to the present system and computer software product.

Hence, the invention is not limited to the described embodiments, but can be varied within the scope of the enclosed claims.

Claims

1. Method for interpreting a digital audio signal (AS), comprising the steps of a) determining a dictionary (D) “ ___defining a set of different characteristic sound patterns (SP); b) determining a time window (TW) of a certain length of said digital audio signal (AS); c) transforming said time window (TW) of the digital audio signal (AS) to the frequency domain to obtain a transformed snapshot (X); d) calculating, using a selected multiplicative update ruleggggí' a pattern representation vector (r) that minimises a distance measure between the snapshot (X) and an approximation (Dr), being a linear combination of said sound patterns (SP) defined by said dictionary (D) multiplied by the coefficients of the pattern representation vector (r); and e) repeating from step b using a subsequent time window (TW), wherein both said dictionary (D) and said pattern representation vector (r) have only non- negative elements, wherein step d is performed using a cost function that includes an elementwise Bregman divergence and in addition thereto one or several penalty terms, said one or several penalty terms being selected from the list of a first penalty term, comprising a diversity index, preferably the Shannon index, being a measure ofthe number of different patterns in the approximation (Dr) according to their respective proportional abundancies, as quantified by the pattern representation vector (r); a second penalty term, comprising a norms difference between the taxicab norm and the Euclidian norm of the pattern representation vector (r); a third penalty term, comprising a sum of off-diagonal entries in the correlation matrix ofthe pattern representation vector (r); and a fourth penalty term, comprising a sum of the squares of the differences between elements of said pattern representation vector (r) and their smoothed values, said smoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question. ¿f.'-:~kt->.\1-.\1-:{JÅ-.--_ÉQ;T

2. Method according to claim 1, wherein each of the penalty terms is individually weighted by a weight coefficient.

3. Method according to claim 1 or 2, wherein the cost function comprises a function being a ß-divergence with 0 S ß S 2, such as 0.5 S ß S 1, such as ß = 2/

4. Method according to any one of the preceding claims, wherein the time window (TW) is weighted using an asymmetric window function with a maximum in the second half of the time window (TW).

5. Method according to claim 4, wherein the window function is an asymmetric Kaiser- Bessel-derived, KBD, window.

6. Method according to any one ofthe preceding claims, wherein the digital audio signal (AS) is processed at a sampling rate of 16 kHz or lower.

7. Method according to claim 6, wherein the method comprises a step in which the digital audio signal (AS) is resampled to 16 kHz or lower sampling rate, before step c.

8. Method according to any one of the preceding claims, wherein step a comprises the step of, for at least one pitched musical instrument or at least one class of pitched musical instruments and for each one of a set of playable notes on such instrument or class of instruments, synthesising a pattern based on a physical model of an acoustic resonator and further based on the one-dimensional wave equation.

9. Method according to claim 8, wherein the physical model assumes the same relative amplitudes of harmonics for all playable notes.

10. Method according to any one of the preceding claims, wherein the dictionary (D) contains information about amplitudes of harmonics for a set of fundamental frequencies, and wherein the transform in step c is evaluated only at said fundamental frequencies

11. .Li-frequencies corresponding to said harmonics. ¿f.'-:~kt->.\1-.\1-:{JÅ-.--_ÉQ;T11. Method according to claim 10, wherein each of said fundamental frequencies corresponds to a respective playable note on a particular instrument or class of instruments, and wherein the transform in step c is evaluated only at said fundamental frequencies said frequencies corresponding to said harmonics.

12. Method according to claim 11, wherein said frequencies are selected as all . _ _ _ ~. \ v.. ,.~.\.\,.\..\.(.. ;.\....,. fundamental frequencies m-fï g :'.“;^:t*:_::. associated with the MIDI tuning standard, MTS, for a given concert A up to the Nyquist frequency.

13. Method according to any one of claims 10-12, wherein the transform in step c is the discrete Fourier transform (DFT), implemented in the form of the Goertzel algorithm, such as the second-order Goertzel algorithm.

14. Method according to any one of the preceding claims, wherein steps b to d are performed successively by time-shifting said time window (TW) in relation to the digital audio signal (AU).

15. Method according to claim 14, wherein the method further comprises detecting pattern on and pattern off events based on the pattern representation vector (r) calculated in step d for consecutively time-shifted time windows (TW).

16. Method according to any one of the preceding claims, wherein the method further comprises a pre-processing step, performed before step c, in which the digital audio signal (AS) is subjected to at least one of noise gating, levelling, or limiting.

17. Method according to any one of the preceding claims, wherein the calculation of the pattern representation vector (r) is performed using a first-order iterative optimisation algorithm.

18. System (100) for interpreting a sampled sound signal, said system (100) being arranged with a dictionary (D) fwdefining a set of different characteristic sound patterns (SP), ¿f.'-:~kt->.\1-.\1-:{JÅ-.--_ÉQ;Tsaid system (100) being arranged to determine, in a step b, a time window (TW) ofa certain length of said digital audio signal (AS), said system (100) being arranged to transform, in a step c, said time window (TW) of the digital audio signal (AS) to the frequency domain to obtain a transformed snapshot (X), said system (100) being arranged to calculate, in a step d, using a selected multiplicative update rulemçg" a pattern representation vector (r) that minimises a dis- tance measure between the snapshot (X) and an approximation (Dr), being a linear combi- nation of said sound patterns (SP) defined by said dictionary (D) multiplied by the coeffi- cients of the pattern representation vector (r), and said system (100) being arranged to repeat from said step b using a subsequent time win- dow (TW), both said dictionary (D) and said pattern representation vector (r) having only non-negative elements, said system (100) being arranged to perform said calculation of said pattern representation vector (r) using a cost function that includes an elementwise Bregman divergence and in addition thereto one or several penalty terms, said one or several penalty terms being se- lected from the list of a first penalty term, comprising a diversity index, preferably the Shannon index, being a measure ofthe number of different patterns in the approximation (Dr) according to their respective proportional abundancies, as quantified by the pattern representation vector (r); a second penalty term, comprising a norms difference between the taxicab norm and the Euclidian norm of the pattern representation vector (r); a third penalty term, comprising a sum of off-diagonal entries in the correlation matrix ofthe pattern representation vector (r); and a fourth penalty term, comprising a sum of the squares of the differences between elements of said pattern representation vector (r) and their smoothed values, said smoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question. ¿f.'-:~kt->.\1-.\1-:{JÅ-.--_ÉQ;T

19. Computer software product for interpreting a sampled sound signal, said computer software product comprising or being associated with a dictionary (D) 'gdefining a set of different characteristic sound patterns (SP), said computer software product being arranged to, in a step b, when executed on computer hardware, determine a time window (TW) of a certain length of said digital audio signal (AS); transform, in a step c, said time window (TW) ofthe digital audio signal (AS) to the frequency domain to obtain a transformed snapshot (X); calculate, in a step c, using a selected multiplicative update rule_¿°;¿_ \“\I\\“ a pat- tern representation vector (r) that minimises a distance measure between the snapshot (X) and an approximation (Dr), being a linear combination of said sound patterns (SP) defined by said dictionary (D) multiplied by the coefficients ofthe pattern representation vector (r); and repeat, from step b, using a subsequent time window (TW), both said dictionary (D) and said pattern representation vector (r) having only non-negative elements, said computer software product being arranged to, when executed on computer hardware, perform said calculation of said pattern representation vector (r) using a cost function that includes an elementwise Bregman divergence and in addition thereto one or several penalty terms, said one or several penalty terms being selected from the list of a first penalty term, comprising a diversity index, preferably the Shannon index, being a measure ofthe number of different patterns in the approximation (Dr) according to their respective proportional abundancies, as quantified by the pattern representation vector (r); a second penalty term, comprising a norms difference between the taxicab norm and the Euclidian norm of the pattern representation vector (r); a third penalty term, comprising a sum of off-diagonal entries in the correlation matrix ofthe pattern representation vector (r); and a fourth penalty term, comprising a sum ofthe squares ofthe differences between elements of said pattern representation vector (r) and their smoothed values, said smoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question. ¿f.'-:~kt->.\)-.\1-:{Jl-.--'É-i}2{-