SE2051550A1

SE2051550A1 - Method and system for recognising patterns in sound

Info

Publication number: SE2051550A1
Application number: SE2051550A
Authority: SE
Inventors: Stanislaw Gorlow
Original assignee: Algoriffix Ab
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2022-06-23
Also published as: SE544738C2

Abstract

Method for interpreting a digital audio signal (AS), comprising the steps ofa) determining a dictionary (D) of characteristic sound patterns (SP);b) determining a time window (TW) of said audio signal;c) transforming said time window to the frequency domain to obtain snapshot (X); d) calculating, using a selected multiplicative update rule, a pattern representation vector (r) minimising a distance measure between the snapshot (X) and an approximation (Dr) being a linear combination of said sound patterns multiplied by the pattern representation vector; ande) repeating using a subsequent time window,wherein both said dictionary (D) and said pattern representation vector (r) have only nonnegative elements, and wherein step d is performed using a cost function that includes an elementwise Bregman divergence and in addition thereto one or several specified penalty terms.The invention also relates to a system and to a computer software product.

Description

Method and svstem for recognising patterns in sound The present invention relates to a method and a system for recognising patterns in sound,in particular tonal sound, such as music or singing. ln other words, the present inventionrelates to the interpretation of sound, which means the cognition of basic elements insound. The present invention may further relate to transcription of sound, i.e., the recogni- tion and interpretation of musical notes, in the case of music or singing.

Systems for automatic music transcription have been developed for years and decades, assuch systems can be useful in many situations. A typical use case is when a musician per-forms a piece of music and a computing system recognises the played notes and writesthem down in musical notation and/or responds with feedback. There are also other usecases, where the recognised notes can be used in many ways. As an example, the recognisednotes can be used to generate new musical elements, such as a countermelody or a chordprogression. This is a matter of particular interest in human-machine interactive composi- tion, which has a strong connection with live music, electronic music, and computer music.

Practicing a musical instrument is usually associated with professional supervision and per-sonalised feedback when it comes to an unskilled apprentice. This is particularly true for anovice. Otherwise, fatigue may set in quickly, and even the most talented student can loseinterest in continuing with practice or even in learning music as such. And yet, not everybody is willing to pay a personal tutor, especially if the outcome is unclear. Other fac-tors, such as dispensability, can also influence one's decision. A reasonable compromisemay consist in learning agents that take the role ofthe tutor. And to avoid further spendingon expensive hardware, the agents would preferably be installed on a tablet computer, which as of today is equipped with a speaker and a microphone.

Commonly, sound is captured as a waveform signal over one or several microphones. lt isthis waveform signal which can be digitally processed with the aim of extracting musical information, such as the presence of complex tones or notes.

Application text, 2020-12-22 200040SE This task has proven to be challenging for several reasons.

Firstly, there is a problem of resiliency to disturbances and noise. Often, captured sound ispolluted by noise of different types and it may become difficult to identify the sounds of interest.

Secondly, in monophonic music it is often a matter of basic ear training to identify the pitchof individual notes. However, when the texture is polyphonic, the identification and inter-pretation become much more difficult. ln addition, complex tones consist of multiple par-tials that are not always an integer multiple of the fundamental and the strengths ofwhich can be influenced in many ways.

Thirdly, there is the temporal aspect. Processing captured sound in retrospect can be donefrom beginning to end, several times in a row, under different angles. ln many situations,however, it is desirable to interpret a detected sound in (near) real-time, while the signal isbeing captured. For example, a musician may wish to receive immediate feedback from adevice while he/she is playing an instrument. The device must be able to ”listen” to the produced sounds and make out a meaning from it within tens of milliseconds to respond.

The problem of real-time processing and analysis is not solved with computer power alone,since the detection of tones requires some minimum amount of data that is accumulated over sufficiently long time.

The thesis ”REAL-TIME MUSICAL ANALYSIS OF POLYPHONIC GUITAR AUDIO” by JohnHartquist, which was presented to the Faculty of California Polytechnic State University inJune 2012, describes a method for real-time polyphonic music transcription. The methodrelies on a predetermined dictionary of harmonic templates for each playable note. Theguitar signal is transformed to the frequency domain using the discrete Fourier transformand compared against the note templates in the dictionary. The corresponding likelihood is obtained by means of non-negative matrix factorisation (NI\/|F). The optimisation problem Application text, 2020-12-22 200040SE is thus solved using an iterative approach and a cost function from the class of Bregman divergences.

US 2017/0243571 A1 describes a similar method for transcribing piano recordings.

One problem with transcription techniques that rely only on a Bregman divergence as thegoodness of fit is robustness. A signal captured in a room usually is not without noise orreverberation. I\/|oreover, interference always occurs between the partials of a tone, andeven more so between the partials of multiple tones. The body of a stringed instrument,e.g., resonates at certain frequencies and affects the strengths of the partials, which canchange the tone colour. The instrument can also be slightly untuned, shifting the partials infrequency. This means that tone patterns of a predefined or pre-learnt dictionary neverperfectly match with the sounds of a captured signal, leading to recognition errors and in- accuracies in the transcription.

The present invention addresses the problems described above.

Hence, the invention relates to a method for interpreting a digital audio signal, comprisingthe steps of a) determining a dictionary defining a set of different characteristic sound pat-terns; b) determining a time window of a certain length of said digital audio signal; c) trans-forming said time window of the digital audio signal to the frequency domain to obtain atransformed snapshot; d) calculating, using a selected multiplicative update rule, a patternrepresentation vector that minimises a distance measure between the snapshot and an ap-proximation, being a linear combination of said sound patterns defined by said dictionarymultiplied by the coefficients of the pattern representation vector; and e) repeating fromstep b using a subsequent time window, wherein both said dictionary and said pattern rep-resentation vector have only non-negative elements, wherein step d is performed using acost function that includes an elementwise Bregman divergence and in addition thereto oneor several penalty terms, said one or several penalty terms being selected from the list of afirst penalty term, comprising a diversity index, preferably the Shannon index, being a meas-ure of the number of different patterns in the approximation according to their respective Application text, 2020-12-22 200040SE proportional abundancies, as quantified by the pattern representation vector; a secondpenalty term, comprising a norms difference between the taxicab norm and the Euclidiannorm of the pattern representation vector; a third penalty term, comprising a sum of off-diagonal entries in the correlation matrix of the pattern representation vector; and a fourthpenalty term, comprising a sum of the squares of the differences between elements of saidpattern representation vector and their smoothed values, said smoothed values being cal-culated for each element in question separately, to represent a time-smoothed represen- tation ofthe element in question.

Furthermore, the invention relates to a system for interpreting a sampled sound signal, saidsystem being arranged with a dictionary defining a set of different characteristic sound pat-terns, said system being arranged to determine, in a step b, a time window of a certainlength of said digital audio signal, said system being arranged to transform, in a step c, saidtime window of the digital audio signal to the frequency domain to obtain a transformedsnapshot, said system being arranged to calculate, in a step d, using a selected multiplicativeupdate rule, a pattern representation vector that minimises a distance measure betweenthe snapshot and an approximation, being a linear combination of said sound patterns de-fined by said dictionary multiplied by the coefficients of the pattern representation vector,and said system being arranged to repeat from said step b using a subsequent time window,both said dictionary and said pattern representation vector having only non-negative ele-ments, said system being arranged to perform said calculation of said pattern representa-tion vector using a cost function that includes an elementwise Bregman divergence and inaddition thereto one or several penalty terms, said one or several penalty terms being se-lected from the list of a first penalty term, comprising a diversity index, preferably the Shan-non index, being a measure of the number of different patterns in the approximation ac-cording to their respective proportional abundancies, as quantified by the pattern repre-sentation vector; a second penalty term, comprising a norms difference between the taxi-cab norm and the Euclidian norm ofthe pattern representation vector; a third penalty term,comprising a sum of off-diagonal entries in the correlation matrix of the pattern represen-tation vector; and a fourth penalty term, comprising a sum ofthe squares of the differencesbetween elements of said pattern representation vector and their smoothed values, said Application text, 2020-12-22 200040SE smoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question.

I\/|oreover, the invention relates to a computer software product for interpreting a sampledsound signal, said computer software product comprising or being associated with a dic-tionary defining a set of different characteristic sound patterns, said computer softwareproduct being arranged to, in a step b, when executed on computer hardware, determinea time window of a certain length of said digital audio signal; transform, in a step c, saidtime window of the digital audio signal to the frequency domain to obtain a transformedsnapshot; calculate, in a step c, using a selected multiplicative update rule, a pattern repre-sentation vector that minimises a distance measure between the snapshot and an approxi-mation, being a linear combination of said sound patterns defined by said dictionary multi-plied by the coefficients ofthe pattern representation vector; and repeat, from step b, usinga subsequent time window, both said dictionary and said pattern representation vectorhaving only non-negative elements, said computer software product being arranged to,when executed on computer hard-ware, perform said calculation of said pattern represen-tation vector using a cost function that includes an elementwise Bregman divergence andin addition thereto one or several penalty terms, said one or several penalty terms beingselected from the list of a first penalty term, comprising a diversity index, preferably theShannon index, being a measure of the number of different patterns in the approximationaccording to their respective proportional abundancies, as quantified by the pattern repre-sentation vector; a second penalty term, comprising a norms difference between the taxi-cab norm and the Euclidian norm ofthe pattern representation vector; a third penalty term,comprising a sum of off-diagonal entries in the correlation matrix of the pattern represen-tation vector; and a fourth penalty term, comprising a sum ofthe squares of the differencesbetween elements of said pattern representation vector and their smoothed values, saidsmoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question. ln the following, the invention will be described in detail, with reference to exemplifying embodiments ofthe invention and to the enclosed drawings, wherein: Application text, 2020-12-22 200040SE Figure 1 illustrates a system according to the present invention, arranged to perform amethod according to the invention; Figure 2 is a flow chart illustrating a method according to the present invention; Figure 3a illustrates a first sound pattern definition according to the invention; Figure 3b illustrates a second sound pattern definition according to the invention; Figure 4 illustrates a dictionary according to the present invention; and Figure 5 illustrates the processing of a digital audio signal according to a method of the present invention.

Hence, Figure 1 shows a system 100 according to the invention, the system 100 being ar-ranged to interpret a digital audio signal AS by performing a method according to the pre- sent invention.

The system 100 com prises a computer software product, arra nged to be executed on virtualor physical hardware, and when executed, arranged to perform a method according to thepresent invention. ln Figure 1, this is exemplified by a computer 110, on the hardware ofwhich said computer software product is arranged to be executed. The computer 110 maybe a desktop computer, a laptop computer, a tablet computer, a mobile phone or similar.The computer 110 may also be a virtual hardware, such as a cloud-hosted virtual server.

The computer 110 may be standalone or distributed. ln some embodiments, the computer 110 may be associated with or comprise one or more different hardware interfaces accessible by said computer software product.

Such hardware interface may comprise an acoustic sound capturing device 111, such as amicrophone. The sound capturing device 111 may be arranged to capture acoustic sound inone or several channels, such as in mono or stereo, and to provide a corresponding electricsignal. The sound capturing device 111 may also comprise an analogue-to-digital converter, arranged to provide a digital audio signal interpretable by said computer software product.

Application text, 2020-12-22 200040SE Acoustic sound, detectable by sound capturing device 111, may be generated by any soundsource, such as a human voice or a musical instrument. ln Figure 1, an exemplifying soundsource in the form of an acoustic piano 200 is shown. ln general, the sound source may beany sound generating entity, such as a human being or an acoustic and/or electric instru-ment, such as a stringed instrument, a wind instrument, or even a percussion instrument,or a mixture of different sound sources. ln preferred embodiments, the sound source 200is a pitched sound source in the sense that it can produce different sound patterns differingwith respect to frequency and preferably being able to produce sound having different fun-damental frequencies that a human listener perceives as different notes. Such a sound pat-tern may hence be associated with a fundamental frequency and one or several additionalfrequencies, such as integer multiples of the fundamental (harmonics). A single soundsource may furthermore be arranged to produce several such sound patterns simultane- ously, such as a chord played on a guitar or a piano.

Such hardware interfaces may also comprise an acoustic sound generating device 112, suchas a loudspeaker. The sound generating device 112 may be arranged to generate acousticsound over one or several channels, such as in mono or stereo, from a provided electricsignal. The sound generating device 112 may also comprise a digital-to-analogue converter,arranged to produce an analogue electric signal from a digital audio signal provided by said computer software product, and to use the analogue signal to produce said acoustic sound.

The computer 110 may be connected to the sound capturing device 111 and to the sound generating device 112 via a suitable respective wired or wireless connection.

Figure 1 also shows a computer server 120, that may be physical or virtual hardware, dis-tributed or standalone computing entity, in a way corresponding to what has been said inrelation to the computer 110. The computer 110 and the server 120 may communicate dig- itally via a network 10, such as the Internet.

Application text, 2020-12-22 200040SE As the word is used herein, the term ”hardware” denotes specialised or general-purposecomputer hardware, typically comprising a CPU, RAM, and a computer bus, as well as any required peripheral interfaces and any conventional computer hardware, such as a GPU. |nstead of, or in addition to, a digital audio signal provided by the sound capturing device111, the computer software product may be arranged to process and interpret a digitalsound signal provided to the computer 110 from the server 120. ln this case, the digitalsound signal may be a pre-recorded and/or synthesised digital sound signal, being providedto the computer software product either in a file-based fashion, for instance as a completemusic file, or in a chunk-based fashion as in streaming. The sound signal received by thecomputer software product from the sound capturing device 111 may, alternatively, be cap-tured in (near) real time, as the acoustic sound is generated by the sound source 200, andcontinuously provided to the computer software product as a continuous signal. lrrespec-tively, the audio signal may be provided to the computer software product in any conven- tional data format, providing encoding and possibly compression, such as a WAV data file.

As used herein, the term ”audio” and ”sound” both refer to an acoustic wave that may be |H physical or represented as a signal. Hence, an ”audio signa refers to such an analogue or digitally coded signal representing said audio.

As used herein, a ”sound pattern” is any pattern of a sound defined in terms of frequenciesand amplitudes. For instance, a sound pattern may be defined in terms of a frequency-am-plitude profile. A sound pattern may be associated with a time dimension, so that the soundpattern may be defined to change in a particular way over time. However, in preferred em-bodiments of the invention a sound pattern is static in relation to time. ln the below-de-scribed dictionary, comprising definitions ofsuch sound patterns, each definition of a soundpattern may cover or comprise several different variants of the sound pattern in question, such as to reflect different playing techniques.

A ”tone”, as used herein, is the sound resulting from the generation of a particular sound pattern. Such a tone is typically associated with a pitch, typically associated with a Application text, 2020-12-22 200040SE fundamental frequency, being the lowest frequency in the sound pattern, and a timbre, alsoknown as ton colour or tone quality. For a piano, the tone of a key on the keyboard wouldbe the sound resulting from the string being struck by pressing down that key in question.A tone may be generated with relatively stronger or weaker amplitude, usually referred to as velocity.

A ”timbre”, as used herein, refers to the frequency profile of a sound pattern, given a fun-damental frequency of the sound pattern. ln the example of the piano key, the timbre ofthe sound pattern refers to the spectral envelope of a tone produced by striking the stringin question. For one and the same instrument, different tones may be associated with a similar timbre.

A ”note”, as used herein, refers to a piece of information, typically comprising a pitch anddefining a particular playable sound pattern on an instrument. Such ”notes” can be definedusing musical notation. For some sound sources, such as fretted instruments like the guitar,notes with the same pitch can be played in several different positions on the fretboard. Thetimbre of one and the same note played in different positions on the fretboard may differ significantly.

Figure 2 illustrates a method according to the present invention, for interpreting a digital audio signal AS of the type generally illustrated in Figure 5.

Figure 5 is a simplified overview illustrating the processing of the audio signal AS accordingto the present invention. As is clear from Figure 5, the audio signal AS is produced as at leastthe concatenation ofdifferent tones TO generated in parallel and/or in series to each otherover time. For instance, the tones TO may be played on one or several instruments. A timewindow TW is moved along the audio signal AS (along a time axis), such as with an overlap,and for every new position of the time window TW the audio signal is multiplied by a win-dow function WF. The windowed audio signal is subjected to a time-frequency transformand subsequent analysis in the frequency domain, resulting in a symbolic representation ofthe audio signal AS at the position of the time window TW in question, expressed as a Application text, 2020-12-22 200040SE lO particular combination of sound patterns SP recognised in the audio signal. Each sound pat-tern SP is shown in Figure 5 as a (possibly unique) distribution of relative amplitudes (Y-axis)for a set of frequencies (X-axis). lf the recognised sound patterns SP were to be generatedsimultaneously as a sound, together they would approximate the sound observed in the audio signal AS at the position of the time window TW in question.

As mentioned, the present method can be performed on an audio signal AS in the form ofa data stream being made available to the computer software product in real-time, or atleast continuously, but it is also envisioned to use the present method to process music filesor groups of files, as in batch processing. Since a file can be viewed as a stream of audiodata of limited duration, the corresponding method steps as described herein below for anaudio file stream can be applied to either a continuously provided data stream or an already existing audio data file.

The present invention achieves a flexible, tuneable, compact, and fast method of interpret-ing a digital audio signal AS and can produce high-quality results in (near) real-time, even when implemented using standard computer hardware. ln a first step, the method starts. ln a subsequent step, the audio signal AS is provided, as discussed. ln another step, that may be performed before or after said audio signal AS provision step,a dictionary D is determined. Figure 4 illustrates an example of the dictionary D, in the formof a matrix, defining for each of the eleven sound patterns SP1, ..., SPI (I being the totalnumber of sound patterns SP) relative amplitudes for each of a set of seven frequencies Fl,..., FK (K being the number of considered frequencies). lt is understood that this is a simpli-fied view, and in a practical application both the number of sound patterns and frequencies would be larger.

Application text, 2020-12-22 200040SE ll The role of such a dictionary D in the context of the present invention is to define the set ofexpected sound patterns SP (SP1-SP11) that may be recognised in said sound signal AS andused jointly to find an approximation to said sound signal AS. To be precise, the dictionaryD numerically defines a set of different characteristic sound patterns SP. Such definitionmay be in the form of a set of different frequencies, associated with a corresponding set ofamplitudes. For instance, the dictionary D may define a function that associates a frequencyvalue to an amplitude, or the dictionary D may define a set of one or more (preferably at least two or more) frequencies and for each such frequency a corresponding amplitude.

One way to interpret the information in the dictionary D defining such a sound pattern SPis as a characteristic spectral symbolic representation (fingerprint) resulting from an activa-tion of the sound pattern SP in question in isolation, without activating any other soundpattern SP and with no ambient sounds or noise. For instance, the sound pattern SP maydefine the characteristic sound of a particular string being picked or plucked when the stringis pressed down at a particular fret, or of a particular key pressed down on a piano. How-ever, the sound pattern SP may also represent other types of musical primitives, such as aparticular vocal sound (e.g., a vowel) produced by a singer or a particular sound made byany sound-generating device in any situation. The sound pattern SP may define a ”musical”sound in the sense that the sound pattern SP relates to a particular musical tone, of a certaintimbre. But the sound pattern SP may also define any sound primitive that per se does not easily translate to a particular played or sung tone, occurring in either musical or non-musi- cal contexts (atonal sounds).

For convenience, the dictionary D may be stored as a matrix representation available to said computer software product.

That the dictionary D is ”determined” may mean that it is established or calculated, by thecomputer 110 and/or the server 120, based on available information, or that it is providedto the computer 110 and/or the server 120, or a combination of the two (such that the computer 110 and/or the server 120 receives a dictionary and modifies it).

Application text, 2020-12-22 200040SE 12 The dictionary D may be selected, modified, or calculated based on the information that isspecific to the implementation, using a priori available information about the digital audiosignal AS. For instance, the dictionary may be created with the purpose of extracting musicalinformation from an audio signal knowing that the audio signal contains representations ofsound from an acoustic guitar or from a female singer, and the dictionary D may then becreated to specifically define sound patterns SP typically produced by such an acoustic gui- tar or singer.

Such a bespoke dictionary D may then be pre-created and provided to the computer 110and/or server 120, such as selected from a library of pre-created dictionaries stored in adatabase. The dictionary may also be created on-the-fly by combining a plurality of differentpre-defined sound pattern SP definitions together forming the created dictionary D. ln thisand other embodiments, the dictionary D may then contain sound pattern SP definitions forseveral different sound sources, such as for different instruments. For instance, the diction-ary D may contain sound pattern SP definitions for several different tones that can be gen- erated by an acoustic guitar and/or by an acoustic piano and/or by a singer.

The creation ofthe dictionary D in a way specifically tailored to the situation may take placein different ways. ln general, the dictionary D is created by collecting a plurality, such as atleast three, such as at least ten, such as least fifty, different sound pattern SP definitions ofthe above-described type and forming the dictionary D from these collected sound patternsSP. For instance, and as illustrated in Figure 4, the dictionary D may then be formed as amatrix having as rows or columns the different relative amplitudes, for a discreet set of different frequencies, representative for that sound pattern SP in question.

One way to create such sound pattern SP representations is to record a sample ofthe soundpattern SP and determine the frequency-amplitude information content of the signal. Forinstance, the recorded event may be the picking of one particular string on a guitar, with the string being pressed down on a particular fret.

Application text, 2020-12-22 200040SE 13 Another way to create such sound pattern representations is to calculate the sound patternSP definition based on a physical model of the device generating the sound. For instance,for string or wind instruments the produced sounds are the result ofthe physical propertiesof the instrument in question, properties that are known to a large extent and similar across different individual instruments belonging to one and the same class of instruments.

So, for both string and wind instruments a played tone will in general comprise a fundamen-tal frequency, typically associated with the length of a vibrating string on a guitar or pianoor the length of the tube in a flute. Several overtones will also be generated, representingstanding waves of higher frequencies along the vibrating string or column of air. The calcu-lated sound pattern SP representation will then typically contain the fundamental fre-quency ofthat particular tone defined with a particular amplitude, such as an amplitude of1, and additional frequency/amplitude pairs for each overtone. Such sound patterns SP mayrepresent reasonably well many or all instruments ofthe same class of instruments, due to physical similarities between such instruments ofthe same class.

As used herein, the term ”class” of instruments relates to any group of instruments sharingone or several determined common characteristics. For instance, ”guitars” may be such aclass, or ”acoustic guitars", or ”six-stringed acoustic guitars", or even ”all guitars of a partic-ular model from a particular guitar-making brand”. Hence, class definitions in a narrower orwider sense may be warranted in different circumstances, depending on the prerequisites and aims for the concrete implementation of the invention. ln general, the determination of the dictionary D may comprise a step of, for at least onepitched musical instrument or at least one class of pitched musical instruments, and foreach note from a set of playable notes on such instrument or class of instruments, synthe-sising a pattern based on a physical model of an acoustic resonator of the instrument or class in question, and further based on the one-dimensional wave equation. lt has turned out that a reasonable simplification is to assume that, for said instrument or class of instruments, said physical model assumes the same relative amplitudes of the Application text, 2020-12-22 200040SE 14 respective overtones for all playable notes. Hence, if each playable note for example is as-sociated with a fundamental frequency and five overtone frequencies, the sound patternsSP for that instrument or class of instruments are represented using the same amplitudesfor the six different frequencies associated with each playable note. Another way of ex-pressing this is that one and the same spectral envelope is used for the entire set of playable notes for the instrument or instrument class in question.

This may refer to a specific set of amplitudes or a mapping between frequencies and ampli-tudes, as the case may be. Typically, the relative amplitude decreases with increasing fre-quency from the fundamental frequency upwards. Figure 3a illustrates a sound pattern SPrepresentation, relating harmonic partials to specific amplitude values for the example of arandom note. Figure 3b illustrates a sound pattern representation, in which the amplitude values are obtained from a corresponding function. ln particular, the inventor has discovered that it is often possible to limit the frequency-domain analysis described below to a finite set of frequencies, such as to at most 10 distinctfrequencies for each playable note and/or at most 200 distinct frequencies for each set ofplayable notes of each detected sound-generating device. This results in computational ef-ficiency. Such a finite set of frequencies may depend on a particular type of sound-generat- ing device, such as on a particular instrument or class of instruments.

One way of selecting such a finite set of frequencies is to use only the fundamental fre-quency and a corresponding set of partials for each playable note to define each soundpattern SP. This will result in a dictionary D comprising information about amplitudes ofharmonics for a set of such fundamental frequencies. ln particular, the dictionary D may belimited to information about amplitudes of such fundamental frequencies and partials for a set of playable notes from one or several sound-generating devices. lt is realized that a ”frequency”, in this context, may be defined as a distinct frequency or asa narrow frequency band, for instance centred about a distinct frequency. The frequency ofa partial of one playable note may coincide or almost coincide with the fundamental or a Application text, 2020-12-22 200040SE partial ofa different playable note. By adjusting the width of said frequency band, one single”frequency” may suffice to represent said respective frequencies of both playable notes.This way, the number of analysed frequencies can be limited, resulting in a smaller diction- ary D and reduced computational complexity.

As mentioned above, each of said fundamental frequencies used when constructing such adictionary D may correspond to a respective playable note on a particular instrument or class of instruments. ln some embodiments, said frequencies selected as fundamental frequencies and associ-ated partials may be selected as the frequencies associated with or specified in the I\/||D|tuning standard, I\/ITS, for a given concert A up to the Nyquist frequency. This will be exem- plified in the following. ln general, the frequency scale used for the time-frequency analysis may be derived from the I\/||D| tuning standard (see limited in range in accordance with the frequency range of said instrument or class of in-struments. For a tuned regular guitar, for instance, the frequency scale ranges from 82.41Hz (E2) to about 5 kHz (from the lowest fundamental frequency to the highest harmonic), and correspondingly for a four-string bass guitar the scale starts at 41.20 Hz (El).

The frequency scale may be subdivided into discrete data points between dml-n and dmax, where d= 1210g2( )+69. 44-0 HZ The corresponding frequencies are given byf = ZW-ÖQVIZ -440 Hz withd I dminl dmin + 11-"1 dmax _ 11 dmax- As outlined above, the present method makes use of representation learning to recognizepatterns of complex tones. This is achieved as described in detail below, by formulating the task of identifying sound patterns SP in the audio signal AS as a non-convex optimization Application text, 2020-12-22 200040SE 16 problem and using said dictionary D defining typical patterns or atoms that are determined beforehand.

One way of constructing the dictionary D is by supervised learning, which learns the soundpatterns automatically. All it requires is at least one sound example (audio signal) for eachnote in the dictionary and the corresponding pitches (labels), which can be known a priorior inferred from the audio signal using a pitch detection algorithm. Such a supervised learn-ing algorithm can be iterative and resemble non-negative matrix factorisation. Using non-negative matrix factorisation, e.g., one would compute a spectrogram Vfor each note sig-nal, such as for the note C4, and factorise the spectrogram into a matrix Wand a matrix Hwhile setting the rank of the factorisation (number of patterns per note) to, e.g., 1. The K-by-1 vector Wcan then be added to the dictionary D as the sound pattern for the note C4,while the 1-by-N vector H is discarded. This procedure is to be repeated for every note in the dictionary.

The advantage of supervised learning is that one can create an instrument-specific diction-ary D that considers the timbre (tone colour) of each note. A drawback is that one requires labelled data for all playable notes. lnstead, the dictionary D may be constructed in the following way, achieving a general-pur- pose dictionary D for complex tones.

Here, the guitar is used as an example. Knowing the number of strings, for instance sixstrings, the number offrets, such as nineteen, and the tuning, e.g., E2-A2-D3-G3-B3-E4 or40-45-50-55-59-64 in MIDI note numbers, we also know the total number of sound patternsSP and all the fundamental frequencies of the corresponding tones: fo (w) I 2[a(i)+j-69]/12 440 Hz' wherein i is the string index, d(i) is the MIDI note number of the open string, andj is thefret index on the fingerboard. The fundamental of the open string is addressed withj = 0.For each note on the guitar, a time signal is generated that has the length of the time win-dow (see below) and which consists of pure tones of a harmonic series.

Application text, 2020-12-22 200040SE 17 Each harmonic corresponds to the frequency of a vibrating mode, which is an integer mul-tiple of the fundamental. Since the MIDI tuning standard is based on a chromatic scale (12-tone equal temperament, 12TET), which is slightly out of tune with respect to just intona-tion, we may add a correction factor c. Considering the first K = 31 harmonics, a complex tone is generated as foüﬂj) sl-j (n) = Xfﬂ ajk sin (Znkck n) withck I ZAk/lzoo' wherein Ålk is the deviation in cents of the k-th harmonic from a just interval. The exact deviation values are shown in the following Table 1: Table 1 Harmonic (k) Deviation (Åk) 1 2 4 8 16 017 +5 9 18 +4 19 -2 10 20 -1421 -29 11 22 -49 23 +28 3 6 12 24 +225 -27 13 26 +41 27 +6 7 14 28 -31 29 +30 30 -12 31 +45 The strength a of the respective harmonic is computed as Application text, 2020-12-22 200040SE 18 sin kcknk = 1,2, .. ,K As mentioned above, it has proven beneficial to use a single spectral envelope for all tones,irrespective of their location on the fretboard. By doing so, the complexity of the algorithmcan be reduced, since one can limit the dictionary to tones with unique pitches. We can dothe same with a trained or learned dictionary, by assigning tones with the same pitch the same label before training.

However, if it is deemed important to distinguish notes in terms of their location on thefretboard (such that each playable note is defined by a unique pair of string and fret), afretted-instrument dictionary must be created with all possible such pairs (for instance, each possible finger position on the fretboard).

For non-fretted instruments, such as the piano or the voice, the creation of a dictionary Dis similar but simpler because each key on the keyboard represents a unique tone withinthe tonal range ofthe piano. The tonal range of, for example, an 88-key standard piano liesbetween AO and C8 or 21 and 108 in MIDI note numbers. The vocal range of classical per-formance covers about five octaves from a low G1 to a high G6. The common vocal range isfrom a low C2 to a high D6. Any individual's voice covers a range of one and a half to more than two octaves, depending on the voice type (from male bass to female soprano).

The dictionary D may be limited to the individual vocal range of a particular sound-generat-ing device or class of devices (such as a class of instruments), to save on memory and toreduce computational complexity. ln principle, however, the dictionary D could contain anynumber of sound patterns SP, which the algorithm would try to recognise in the audio signalAS. For instance, in the case of atonal music or non-musical audio, the dictionary D would not be based on the MIDI tuning standard.

Application text, 2020-12-22 200040SE 19 Hence, a complex tone is generated using the above-described principles, is multiplied bythe same window function as used for processing the digital audio signal AS (see below) andis transformed to the frequency domain in the same manner as the time window TW pro-duced from the digital audio signal AS (see also below). ln other words, each sound patternSP is produced in such a way as to best correspond to an audio signal AS which has beenwindowed and transformed in accordance with the present invention and as described be- low.

For instance, a predetermined frequency scale, such as the 12TET frequency scale can beused for this transform to the frequency domain, in the sense that only frequencies belong- ing to the frequency scale in question are evaluated. ln some em bodiments, a second-order Goertzel algorithm may be used to compute the dis-crete Fourier transform for the respective frequency values. Preferably, only the magnitudespectra, which represent the absolute or relative strengths of the harmonics, are stored inthe dictionary D as said sound patterns SP. The peak value of each sound pattern SP may beadjusted to one and the same level across sound patterns SP, such as to -15 dBFS, which for pure tones corresponds to an alignment level of -18 dBFS RI\/|S.

Relative amplitudes of harmonics for a string of length L that is fixed at both ends andpulled, picked, or plucked at position p from the end of the string (for a guitar, from thebridge) can be derived from the one-dimensional wave equation. The ratio of amplitudes ofthe n-th to the first harmonic (fundamental) is given by s1n(_'“'¿”” 21/12)ajk = kz sin(%2f/12)' A pitch, string, or fret map of a particular instrument or class of instruments is an associativearray. lt is a static data structure, that may be generated for each such instrument or classof instruments, and then used to look up the MIDI note number, the string index, or the fretindex that is associated with a tone on, for instance, the fingerboard (fretboard) of a guitar.

This information may be interesting because a note can be played from different locations Application text, 2020-12-22 200040SE on the fingerboard of fretted instruments, generating the same pitch. What differentiates these tones is then their timbre.

For the piano, which is a non-fretted instrument, the pitch map would instead consist of 88entries with unique I\/||D| note numbers ranging from 21 to 108. The string map in this casecould contain values from 1 to 88, i.e., the key numbers, and the fret map would containzeroes, following our convention of indexing open strings on a guitar or any other fretted instrument.

The corresponding principle can also be applied to voice, indicating the pitch and the indexof the notes in the vocal range of a singer. Strictly speaking, these maps are only necessaryfor fretted instruments, and only if one wishes to distinguish the location of notes (string,fret) on the fingerboard. Otherwise, they can be omitted. ln the context of the present in-vention, they may be used to implement a superstructure that is compatible with different tonal sources.

As mentioned, the present method operates on a digital audio signal AS, that may be pro-vided to the computer 110 and/or server 120 or that may be measured by the sound cap-turing equipment 111. lrrespectively, the digital audio signal AS is provided with an infor-mation granularity in form of a sampling rate. Such a sampling rate may result from the operation of an analogue-to-digital conversion ofthe sound capturing device 111.

Again, referring to Figure 2, in a subsequent method step the method according to the pre-sent invention comprises the step of determining a time window TW of a certain length of said digital audio signal AS. ln some embodiments, the digital audio signal AS is processed at a sampling rate of 32 kHzor lower, such as 16 kHz or lower. This may mean that the audio signal AS is provided to thecomputer 110 and/or server 120 at this sampling rate, or that the computer software func-tion is arranged to down-sample the audio signal AS prior to said time window TW beingapplied to the audio signal AS. Such down-sampling may take place by the computer Application text, 2020-12-22 200040SE 21 software product (or a down-sampling hardware function) on the fly or prior to interpreting a particular audio file. ln other words, the method may comprise the additional step of resampling the digital au-dio signal AS to 32 kHz, 16 kHz, or even lower sampling rate, before the application of the time window TW.

Namely, the present inventors have discovered that, using the present methodology, it ispossible to use such low sampling rates and still achieve high-quality results for automaticsound pattern SP identification in an audio signal AS of sounds that are relevant to an aver- age human listener.

The time window TW is a time window of a certain time duration, which is moved in thepositive time direction along the audio signal AS (or the audio signal AS moves past the timewindow), as will be described below. ln some embodiments, the time window TW is aweighted time window, being weighted using a window function WF, such as an asymmetric window function. ln other words, the audio signal may be multiplied by the window function WF, the windowfunction WF being zero everywhere apart from a time interval constituting said time win-dow TW. Across this time interval, the window function WF may be non-constant, so that aweighting is applied by said multiplication. ln some embodiments, the window function WFis defined having a maximum in the second (later) half of the time window TW. The areaunder the window function WF may be larger in the second half ofthe window function WF,in relation to time, as compared to the area under the window function WF in the first half ofthe window function WF, in relation to time.

Hence, the present invention uses transcription based on a time-frequency analysis, in turn being based on a time window TW, of said type, moving in relation to the audio signal AS.

Application text, 2020-12-22 200040SE 22 Any suitable window function WF can be used. However, in some embodiments the window function WF may be an asymmetric Kaiser-Bessel-derived, KBD, window. ln practical applications, a Kaiser-Bessel-derived window function WF with a non-zeroshape parameter oL has been found to yield good results. ln some embodiments, the shapeparameter oL is at least 1, such as at least 5, and in some embodiments, it is at most 20, suchas at most 10. The duration (length) of the time window TW may be selected according toa desired frequency resolution. For a regular guitar and standard tuning, the resolutionshould be about the distance between E2 and F2, whereas for a bass guitar with six strings the resolution should be about the distance between BO and Cl.

As a rule, the time window TW should capture at least one period of the difference tone,the frequency of which is given by the positive difference between the fundamental of thelowest playable note and the next-higher note. For the examples given above, such window length T in seconds would correspond toT = à,FZ-EZin the case of a six-string guitar and standard tuning, and 1 T = , in the case of a six-string bass guitar. Furthermore, the time window TW would normally not cover more than 10, such as not more than 2, such periods.

Furthermore, the window function WF, here denoted w(n), may be defined to be asym-metrical, such as in the way described above. This has the advantage of minimizing the la-tency of note onsets when performing real-time analysis and interpretation of a received audio stream.

Also, the window function WF may be scaled, so its coefficients sum up to 2. As a concreteexample, the peak ofthe window function WF, such as a window function WF of the generaltype described above, may be shifted to the right by w(n) <- n - w(n), Application text, 2020-12-22 200040SE 23 n = 0,1, ...,L - 1,where L is the window length in samples. The skewed window may then be rescaled (nor-malized), e.g., by WW t mm)- As used herein, the term ”time window” of the digital audio signal AS refers to the audiosignal multiplied by the window function WF as described above, to obtain a windoweddigital audio signal. The window function WF may, in some embodiments, be zero at either or both of its end points, tapering away from the maximum towards both ends. ln a subsequent step, the time window TW of the digital audio signal AS obtained in thatmanner is transformed to the frequency domain, to obtain a transformed snapshot X. Thissnapshot is a short-time spectral representationof the audio signal at a particular time in- stant for the duration of said time window TW.

As is the case for the construction of the dictionary D above, the transform of the time win-dow TW may be performed using the discrete Fourier transform (DFT), e.g., implementedin the form of the Goertzel algorithm, such as the second-order Goertzel algorithm. These algorithms are well-known as such and will not be further explained herein.

As is also correspondingly the case for the construction of the dictionary D as describedabove, the time-frequency transform may be accomplished so that only certain frequenciesof a particular frequency scale, such as the 12TET frequency scale, are evaluated in the pro-cess. This is one way of achieving a spectral representation with low dimensionality, as de-scribed above in relation to the construction of the dictionary D but correspondingly appli- cable to the time-frequency transform and spectral analysis of the time window TW. ln more concrete terms, this may be performed according to the following.

Application text, 2020-12-22 200040SE 24 Firstly, the audio signal AS, consisting of carried over (overlap) past and newly read (hop)data is windowed (note that x here denotes the audio signal AS, not to be confused with X denoting said transformed snapshot): XUI) <- WUI) 'X(n), n = 0,1,...,L - 1.

Secondly, the discrete Fourier transform (DFT) of the windowed signal at only the specified frequencies using 12TET is computed using the Goertzel algorithm: XÜÖ = DFTtxÛÛhzTET- ln case the window function WF was normalized (see above), and furthermore if the signalis levelled (see below), the peak magnitude of x(n) will not exceed 0 dBFS, with its RMSlevel being around -18 dBFS. I\/|oreover, it is in this case guaranteed that the representationvalues will be bounded by 0 and 1, because the tone patterns were aligned with the sameRMS level. As a result, a careful implementation of the Goertzel algorithm can directly re-turn the real-valued power spectrum, from which the magnitude spectrum is obtained tak- ing the square root. ln a subsequent step, a sound pattern representation vector r is calculated, using a selectedmultiplicative update rule. The sound pattern representation vector r is calculated to mini-mise a distance measure between the snapshot X and an approximation Dr, being a linearcombination of the different sound patterns SP defined by the dictionary D multiplied by the coefficients ofthe pattern representation vector r.

Hence, once calculated, the pattern vector r contains, for each sound pattern SP, a corre- sponding velocity value. ln practise, given the magnitude spectrum, or any other non-negative spectrum represen- tation, Application text, 2020-12-22 200040SE IX (k)| ofthe short-time Fourier transformed input signal, written as a vector X E IRK = [|X(k1)| IX(I and the dictionary D E lRšKXI consisting of I distinct tone patterns, we want to find a representation r E lRšl such that the distance between x and a |inear combination of a few selected tone patterns,Dr, is minimised. See Figure 4. The components of r indicate the tone patterns that are de-tected in x and their magnitude. To find r, we may employ the following loss function, whichis known as an entrywise 6-divergence: L(X, Dr) = l§=1 dßbfkllﬂ dia-ri) With ß-1_ ß-l ß_ ß% -%, ß e {0,1}, P10g§-P+q,ß=1,p_ E_ Iq logq 1,ß 0, dß (p, q) = where 0 S ß S 2. lt was found that perceptually most meaningful results were obtained 2 forß = 3, i.e., when 1 3 IÉ=1 (Zlﬂ dkﬂﬂz + 2 'ä _ 3 ' 3 ”få - This specific loss function will be used throughout the rest of the document.

L(X, Dr) = This minimisation may take place in a way which is conventional per se, such as using aniterative minimum-seeking algorithm which may be performed until a predetermined con- vergence criterion is met, and/or until a maximum number of iterations has been reached.

For instance, the dictionary D may in such cases contain information about amplitudes ofharmonics for a set of fundamental frequencies corresponding to different tones, such asplayable notes on a particular instrument or class of instruments, whereby the transform isevaluated only at a finite set of frequencies of a defined frequency scale as described above, comprising said fundamental frequencies and frequencies corresponding to said harmonics.

Application text, 2020-12-22 200040SE 26 ln a subsequent step, the procedure is repeated, producing a new snapshot by multiplyingthe digital audio signal AS with the window function WF at a new time instant and deter- mining a new sound pattern representation vector r for the respective time window TW.

This repetition may be performed until an end of the audio signal AS has been reached, or until any other stop criterion is satisfied.

Simply put, the main processing loop ofthe present method consists of reading in new datafrom the audio stream AS, processing the dynamics of the read data (if applicable, see be-low), applying a short-time Fourier transform (STFT) to a windowed portion of the data (toobtain a series of said snapshots X), and performing ”most likely" matching of tonal compo- nents in the new data (represented by the sound pattern representation vector r).

The main processing loop may also comprise the tracking of the occurrence of note ”on”and note ”of " events (see below), and finally generating a note list from such detected noteevents, arranged in a suitable temporal order, and possibly fed to any subsequent pro- cessing step.

As mentioned above, the detection of tonal data may be based on MIDI note numbers, andin this case the detection of tonal data may comprise a mapping of the detected tonal com-ponents (according to the representation vector r) to I\/||D| note numbers, using a MIDI note number table. ln general, new data from the audio stream AS is read in in way that is typical for time-frequency analysis of music signals. The essential parameters are the length of the timewindow TW, the hop size, the amount of overlap, and the transform length. As pointed outabove, the time window TW length or duration is chosen in a way so that tones at the lowerend of the register can be separated. The hop size is typically at least 10 milliseconds, suchas at least 20 milliseconds, and at the most 50 milliseconds, such as at the most 30 millisec-onds, preferably in the range between 20 and 30 milliseconds, which usually offers a goodcompromise between a time resolution that is sufficiently fine to detect fast note transitions Application text, 2020-12-22 200040SE 27 and a computational burden that is not too high. This is somewhat related to onset attenu- ation by the auditory system, which |asts around 20 milliseconds.

Here, it is important to understand that the goal of the present method is not to synthesizethe transformed data, but to interpret its tona| contents. Therefore, things known as ”per- fect reconstruction" are not relevant.

The overlap in samples between past and newly read data is the difference between thetime window TW length and the hop size. And since a transform length greater than thetime window TW does not yield any more insight into the data, it may be set equal to or smaller than the length of the time window TW.

I\/|oreover, in case the audio is recorded or provided across several channels, such as in ste-reo, it may be converted to one single channel (mono) as an introductory step, before theiteration across the audio signal AS begins. This computation-saving step can be taken sinceit is not an aim for the present method to capture the stereo-specific image of a recording.The mono conversion can be performed, for instance, by simple channel averaging in the case of stereo, or similar, prior to further processing and analysis of the audio stream AS.

According to the invention, both the dictionary D and the pattern representation vector r have only non-negative elements.

Further, according to the invention, the step in which the pattern representation vector r iscalculated is performed using a cost function that includes an elementwise Bregman diver- gence. ln addition, one or several penalty terms are included in said cost function.

An acoustic signal that is recorded in a room is usually not without reverberation or noise.I\/|oreover, interference always occurs between the partials of a tone, and even more sobetween the partials of multiple tones. The body of an instrument, for instance, such as a stringed instrument, also resonates at certain frequencies and affects the relative strengths Application text, 2020-12-22 200040SE 28 of the various partials (timbre). This means that the sound patterns SP in the dictionary D, predefined or trained, will never perfectly match.

Hence, the penalty terms according to the present invention are added to the cost functionwith the purpose of favouring (or penalising) a certain property of the representation in acomputationally efficient and numerically stable manner, tai|ored to real-life recording sit- uations.

Adding a penalty term to the objective function, so that the objective function, which isminimized by the optimization algorithm, consists of a Bregman divergence, which estab-lishes a distance between the observed spectra| distribution and its approximation (mixturedistribution) by a |inear combination of spectra| distributions (mixture components), eachrepresenting a distinct tone (via the sound patterns SP), plus one or several penalty terms.The trade-off between fitting the observed data (minimizing the cost) and reducing the pen-alty may be controlled by a respective Lagrange multiplier (a constant factor) for each pen- alty term used, as will be described below.

A first-order iterative optimisation algorithm such as gradient descent, or its stochastic ap-proximation, is typically used to find a local minimum of the objective function by takingsteps proportional to the negative ofthe gradient (or approximate gradient) ofthe objectivefunction at the current point. Restricting the data, the dictionary, and the representation tothe domain of non-negative real numbers, the update rule can be converted to multiplica- tive form with an implicit learning rate (step size).

According to the invention, said one or several penalty terms are selected from the first,second, third, and fourth such penalty terms, which are described in detail below. lt is un-derstood that more penalty terms could be used, in addition to the one or several of the penalty terms described herein, depending on the concrete application.

Application text, 2020-12-22 200040SE 29 The penalty terms described herein are generally non-exclusive and can be used jointly, asseparate penalty terms, for the benefit of an improved interpretation of the audio signal AS.

The first penalty term comprises a diversity index, preferably the Shannon index, being ameasure of the number of different patterns in the approximation Dr according to their respective proportional abundancies, as quantified by the pattern representation vector r.

The second penalty term comprises a norms difference between the taxicab norm and the Euclidian norm of the pattern representation vector r.

One common purpose of both such first and such second penalty term is to suppress ”filler”notes (notes with small coefficients that mainly fill the holes in the approximation of x by Dr) and put emphasis on the prominent notes in the melody.

The first penalty term is hence expressed as a diversity index, and more specifically theShannon index, which indicates how many different tones (mixture components) there arein the approximation (mixture distribution) according to their proportional abundances(mixture weights). By minimizing the Shannon index, the uncertainly of predicting the toneidentity, and thus the number of different tones in the approximation is minimized. This isaccomplished through multiplicative updates of the weights, which decrease for less certain HOteS aﬂd lHCFeaSe fOl' mOFe Ceftalﬁ HOteS.

The second penalty term is hence expressed as a norms difference between the taxicabnorm and the Euclidean norm of the representation. ln two dimensions, taxicab circles aresquares with sides oriented at a 45° angle to the coordinate axes. Therefore, the normsdifference is minimal at the vertices. By minimizing the norms difference, the solution ismoved away from the edges to the vertices, thus reducing the number of different tones inthe approximation. This is accomplished through multiplicative updates of the representa-tions, which become smaller for notes with small representations and larger for notes withlarge representations.

Application text, 2020-12-22 200040SE More concretely put, the first and second penalty terms both address weak tones that maybe the result of any ofthe mentioned negative effects. They are designed to favour stronger '(0065 OVel' Weakel' tOHeS.

The first penalty term may be useful in cases where the sound-generating device is used to produce higher sounding notes (G-clef), such as is the case of a regular guitar:fiZšàl rf.

R1(r) = - Zl=1 vi log P.- With P.- =The second penalty term may be useful in cases where the sound-generating device (instru- ment) is used to produce lower sounding notes (F-clef), such as is the case of a bass guitar: RzÜ) I Zšﬂri _ Llriz- The third penalty term comprises a sum of off-diagonal entries in the correlation matrix of the pattern representation vector r.

The third penalty term may be useful to suppress ghost notes and note fractions that mayspread over different octaves, so-called octave errors, putting emphasis on the predomi- nant note in a pitch class.

The third penalty term is expressed as a sum of off-diagonal entries in the correlation matrixof the representations (random variables), which indicates the degree to which the repre-sentations are linearly related to one another. By minimizing the off-diagonal elements, i.e.,the correlations between pairs of representations of different tones, predominant tonesthat belong to different pitch classes remain in the representation due to the divergenceterm and tone fractions belonging to the same pitch class are reassembled to a single tone.This is accomplished through multiplicative updates of the representations, which become smaller for all fractions but the largest fraction in a pitch class, which grows.

The third penalty term favours unique notes. The corresponding expression may be Application text, 2020-12-22 200040SE 31_ l 1RsÜ) _ i=1Zj=i+1TiTj- The fourth penalty term comprises a sum of the squares of the differences between ele-ments of said pattern representation vector (r) and their smoothed values, said smoothedvalues being calculated for each element in question separately, to represent a time- smoothed representation of the element in question.

The fourth penalty term may be useful to avoid dropouts in the note envelope or fading ofnotes because of destructive interference between the partials of a tone or the partials ofdifferent tones, between original and reflected sound (reverberation), or between original and unwanted sound (noise).

The fourth penalty term is expressed as a sum of the squares of the differences betweennote representations and the estimated values of the representations, such as smoothedrepresentations and exponentially smoothed representations. By minimizing the differ-ences, short dropouts in the envelopes are smoothed out and tones are stabilised. This isaccomplished through multiplicative updates of the representations, which move closer to their estimated values.

This fourth penalty term hence addresses note interruptions that mainly result from inter- ference, and therefore favours slurs (legato articulation). This is achieved using1 ~mr) = 5241-.- - n?with ~ 0, for t = 0,7"? = í t t 1 l ar¿+(1-a)ï"¿", fort>0,where t is the frame (time) index and a is the smoothing factor (0 < a < 1). The smoothingfactor may be derived from a corresponding halftime tl/z expressed in seconds, such as 0.05 s, using the formulaL a = 1 _(l)f1/z'fsl2 ln a subsequent method step, the method ends.

Application text, 2020-12-22 200040SE 32 Using such a method, a procedure is used capable of producing accurate results, while beingcomputationally efficient enough to be able to run in real-time even on conventional hard- Wafe.

As mentioned above, in some embodiments ofthe invention each ofthe used penalty terms(first, second, third, and/or fourth) is individually weighted by a weight coefficient Ål, A2, /13 and A4, respectively.

Hence, a composite cost function C may consist of the loss function (elementwise Bregmandivergence) and one or several said penalty terms, which are weighted relative to their im-portance by the respective multipliers (0 S Å S 1): C(x, Dr) = L(x, Dr) + /11R1(r) + /12R2(r) + /13R3 (r) + Å4R4(r).

To find the most likely representation, the cost is to be minimized. This may be accom-plished using the following iterative optimization procedure. The procedure is executedonce per frame (with every hop of time window TW). Even in the case where no weights Å are used, the corresponding procedure can be used. 1. lnitialize the representation with random nonnegative values, 0 2. Set the value(s) for Ål, A2, /13 and A4. 3. Compute the initial cost, Ct=° (x,yt=°) with yt=° = Drt=°. 4. Repeat the following steps until a predetermined stopping criterion is met: Application text, 2020-12-22 200040SE 33 a. Update the representation by d .K k* xk+Å1+Å3rf+/1.,(1-a)f¿ 12Tt+1 I rt t tízk' í+Å1(1-1ogp¿t)+/132'- r-t+/14(1-a)rf 3 3 kni/af 1=1 1K 2 k<21<=1 xk) - t _ *ttw|th p,- - E, ft,1=1 for a higher-register instrument, or t1 K d - x r- t ~ozlçlß, kl ite: 1íl+Å37'¿+/14(1_a)7'i3 tyk rt:K 3 2 Yk 1:1 121<=1 ”kt+l _ tri -ri 1 d .3 Xfgﬂlﬁtﬁ/tg 2%, r,f+/14(1-a)r,f <2§§=1 3 ï/y-í for a lower-register instrument, respectively. b. Compute the new approximation of the input signal, t+1 I Drt+1 Y c. Compute the new cost, Ct+1(X,yt+1)_ d. Compute the steepness of the cost function, e. lfs is below some predetermined small value e, return the representation, or continue with stepaabove, fort <- t + 1.

Alternatively, or in addition, the procedure may be performed for a predetermined numberof iterations, possibly in combination with the convergence criterion using e described above.

Application text, 2020-12-22 200040SE 34 ln some embodiments, the cost function may include a function being an entrywise ß-di- vergence with 0 S ß S 2, such as 0.5 S ß S 1, such as ß = ln general, the calculation ofthe pattern representation vector r may be performed using a first-order iterative optimisation algorithm, such as the one described in the example above.

As mentioned above, steps from determining time window TW up to and including the cal-culation of the representation vector r may be performed iteratively, and particularly bysuccessively shifting the time window TW or the digital audio signal AU in relation to each other.

This process may furthermore comprise the detection of sound pattern ”on” and soundpattern ”off” events. Such event detection may be performed based on the pattern repre- sentation vector r calculated for each of the consecutively time windows TW.

Hence, as it is mentioned above, the components of the sound pattern representation vec-tor r indicate which tone patterns were detected and with which magnitude or likelihood.This information will normally change with every incoming frame. To convert this infor-mation into, e.g., MIDI note events, which apart from the pitch signal the beginning and theend of a note, and with which velocity the key was struck, or a string was picked or plucked,the pitch representation must be tracked over time. The following paragraphs explain how this can be put into practice. ln some embodiments, the sound pattern representation vector r may be pre-processedwith a median filter and a short window to remove outliers and so to improve the onset(and offset) detection. Both onset and offset detection require a threshold value. Each tonehas a state (active or inactive). Should the magnitude of a component in the pattern repre-sentation vector r exceed the onset threshold, and should the corresponding tone be inac-tive, a note ”on” event may be signalled, and the corresponding tone may be activated.Should the tone be already activated, a note ”of” event may be signalled before a new ”note on" event, but only if the magnitude of the current component is by a factor greater Application text, 2020-12-22 200040SE than 1 stronger than its magnitude one time instant before. ln the opposite case, no newevent occurs. Should the magnitude of an activated tone fall below the offset threshold, a note ”off” event may be signalled, and the tone is deactivated.

The velocity of such a detected note may be derived from the corresponding magnitudevalue at onset using, for instance, Steven's power law: 1/10) = kla, where I is the intensity or strength of the stimulus, 1/1(I) is the magnitude of the sensationevoked by the stimulus, a is an exponent and k is a proportionality constant. The constantsa and k are tuned to approximate the relationship between an increased representationvalue and the perceived loudness increase in the sensation when hearing the tone. A stand- ard value for k is 1 and 0.67 for a.

Once note ”on” and note ”off” events have been detected and signalled, one may generatea note list from such note events. A note begins at onset time and lasts until the correspond-ing offset time. A note's duration may thus be given by the time difference between thesetwo events. The pitch of a note may be expressed using I\/||D| note numbers, which may beobtained from the component index in the representation vector r via a pitch map of theabove-described type. Its velocity, i.e., the force with which a note is played, may be deter-mined at onset time. lt should be understood that a note list can be updated at run timewithout awaiting the end of a live performance or the end of a recording. ln addition to thestandard note properties pitch, time, duration, and velocity, one can also indicate the stringand the fret of a note on a fretted instrument with the help of the string map and the fret map of the above-described type, if used.

When the end of the file is reached, or the reading of the audio stream AS is stopped forany reason, a note list may be saved, for instance as a standard MIDI file, which can beopened and further edited in a digital audio workstation, DAW, or any other music compo- sition and notation software.

Application text, 2020-12-22 200040SE 36 I\/|oreover, in some embodiments the method comprises a pre-processing step, that may beperformed before the step in which the audio signal AS is transformed to the frequencydomain, in which the audio signal AS is subjected to at least one of noise gating, levelling, or limiting, as desired. ln Figure 1, it is i||ustrated how such signal processing can be performed. Signal processingmay be performed on the sound signal before and/or after the determination of said timewindow TW. For instance, when processing a sound file, at least some processing may takeplace before application ofthe time window function, on the whole file. ln other examples,for instance in the processing of streamed audio, the time window function may be applied first and then sound processing such as gating may be applied to the windowed audio signal.

Such dynamics processing may be used to make sure that all recordings are interpreted andtranscribed in a consistent manner. A noise gate removes unwanted noise, such as hum andhiss, during periods when the main signal is not present. A leveller stabilizes the volume ofthe signal by making too quiet notes louder and too loud notes softer. The leveller may bedesigned in such a way as to align the long-term signal level (RMS) to -18 dBFS. The limitermay be used for safety, i.e., to prevent the amplitude of the levelled signal from clippingwhich could cause unwanted distortion and thus worsen the performance ofthe algorithm.For instance, a standard noise gate without hysteresis may be used, and/ora so-called brick- wall limiter.

Once a sequence of notes was generated from the analysed audio signal AS, it can be used in va rious ways.

As mentioned, the notes can be stored for future use or for additional processing, e.g., in aMIDI file. ln other examples, they may be used on the fly, for instance as a basis for theautomatic generation of an additional sequence of notes, such as a countermelody. Suchcountermelody may for instance be generated, for instance on the fly, using a neural net- work specifically designed to generate countermelodies. A corresponding sound signal Application text, 2020-12-22 200040SE 37 (after Synthesis by the computer software function) may be routed by the computer 110 to the sound-generating device 112 for playback. ln other cases, such as in analysis of machine-produced sounds, the recognised notes may form the basis for a monitoring, debugging, or testing protocol, as the case may be. ln any case, the interpreted set of notes may be stored, sent to a different entity for pro-cessing, and/or displayed to a user on a screen of a computing device, such as in the formof automatically produced musical notation. ln some cases, the computer 110 is locally ar-ranged with respect to the sound-generating device 200, and either executes the computersoftware product on its own hardware or communicates with the server 120 that in turnperforms one or several ofthe steps described above, for instance as a service. Said storage may take place on the server 120 and/or may be performed by the server 120.

As mentioned above, the present invention further relates to the system 100, being ar- ranged to perform the method steps described above.

These method steps are then typically performed by said computer software product, exe-cuted on the computer 110 and/or server 120, and then arranged to perform said steps as described above.

The invention also relates to such a computer software product.

Above, preferred embodiments have been described. However, it should be apparent tothe skilled person that many modifications can be made to the disclosed embodiments without departing from the basic idea ofthe invention.

For instance, the system 100 may comprise many additional parts, such as a music compo-sition or production software, integrated with the audio signal interpretation functionality, or the system 100 may be comprised in such a system.

Application text, 2020-12-22 200040SE 38 The sound-generating device 200 may be arranged locally or remotely in relation to the sound-generating device 112. ln general, all that has been said in relation to the present method is equally applicable to the present system and computer software product.

Hence, the invention is not limited to the described embodiments, but can be varied within the scope ofthe enclosed claims.

Application text, 2020-12-22 200040SE

Claims

1. Method for interpreting a digital audio signal (AS), comprising the steps of a) determining a dictionary (D) defining a set of different characteristic sound patterns(SP); b) determining a time window (TW) of a certain length of said digital audio signal (AS); c) transforming said time window (TW) of the digital audio signal (AS) to the frequencydomain to obtain a transformed snapshot (X); d) calculating, using a selected multiplicative update rule, a pattern representation vec-tor (r) that minimises a distance measure between the snapshot (X) and an approxi-mation (Dr), being a linear combination of said sound patterns (SP) defined by saiddictionary (D) multiplied by the coefficients of the pattern representation vector (r);and e) repeating from step b using a subsequent time window (TW), wherein both said dictionary (D) and said pattern representation vector (r) have only non- negative elements, wherein step d is performed using a cost function that includes an ele- mentwise Bregman divergence and in addition thereto one or several penalty terms, saidone or several penalty terms being selected from the list ofa first penalty term, comprising a diversity index, preferably the Shannon index, being a measure ofthe number of different patterns in the approximation (Dr) according to their respective proportional abundancies, as quantified by the pattern representation vector (r);a second penalty term, comprising a norms difference between the taxicab norm and the Euclidian norm of the pattern representation vector (r); a third penalty term, comprising a sum of off-diagonal entries in the correlation matrixofthe pattern representation vector (r); and a fourth penalty term, comprising a sum of the squares of the differences betweenelements of said pattern representation vector (r) and their smoothed values, said smoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question. Application text, 2020-12-22 200040SE

2. Method according to claim 1, wherein each of the penalty terms is individually weighted by a weight coefficient.

3. Method according to claim 1 or 2, wherein the cost function comprises a function being a ß-divergence with 0 S ß S 2, such as 0.5 S ß S 1, such as ß = 2/

4. Method according to any one of the preceding claims, wherein the time window (TW)is weighted using an asymmetric window function with a maximum in the second half of the time window (TW).

5. Method according to claim 4, wherein the window function is an asymmetric Kaiser- Bessel-derived, KBD, window.

6. Method according to any one ofthe preceding claims, wherein the digital audio signal (AS) is processed at a sampling rate of 16 kHz or lower.

7. Method according to claim 6, wherein the method comprises a step in which the dig- ital audio signal (AS) is resampled to 16 kHz or lower sampling rate, before step c.

8. Method according to any one of the preceding claims, wherein step a comprises thestep of, for at least one pitched musical instrument or at least one class of pitched musicalinstruments and for each one of a set of playable notes on such instrument or class of in-struments, synthesising a pattern based on a physical model of an acoustic resonator and further based on the one-dimensional wave equation.

9. Method according to claim 8, wherein the physical model assumes the same relative amplitudes of harmonics for all playable notes.

10. Method according to any one of the preceding claims, wherein the dictionary (D) con-tains information about amplitudes of harmonics for a set of fundamental frequencies, andwherein the transform in step c is evaluated only at said fundamental frequencies and at frequencies corresponding to said harmonics.

11. Application text, 2020-12-22 200040SE11. Method according to claim 10, wherein each of said fundamental frequencies corre-sponds to a respective playable note on a particular instrument or class of instruments, andwherein the transform in step c is evaluated only at said fundamental frequencies and at said frequencies corresponding to said harmonics.

12. Method according to claim 11, wherein said frequencies are selected as all fundamen-tal frequencies and associated harmonic frequencies associated with the MIDI tuning stand- ard, MTS, for a given concert A up to the Nyquist frequency.

13. Method according to any one of claims 10-12, wherein the transform in step c is thediscrete Fourier transform (DFT), implemented in the form of the Goertzel algorithm, such as the second-order Goertzel algorithm.

14. Method according to any one of the preceding claims, wherein steps b to d are per-formed successively by time-shifting said time window (TW) in relation to the digital audio signal (AU).

15. Method according to claim 14, wherein the method further comprises detecting pat-tern on and pattern off events based on the pattern representation vector (r) calculated in step d for consecutively time-shifted time windows (TW).

16. Method according to any one of the preceding claims, wherein the method furthercomprises a pre-processing step, performed before step c, in which the digital audio signal (AS) is subjected to at least one of noise gating, levelling, or limiting.

17. Method according to any one of the preceding claims, wherein the calculation of thepattern representation vector (r) is performed using a first-order iterative optimisation al- gorithm.

18. System (100) for interpreting a sampled sound signal, said system (100) being ar-ranged with a dictionary (D) defining a set of different characteristic sound patterns (SP),said system (100) being arranged to determine, in a step b, a time window (TW) ofa certainlength of said digital audio signal (AS), Application text, 2020-12-22 200040SEsaid system (100) being arranged to transform, in a step c, said time window (TW) of thedigital audio signal (AS) to the frequency domain to obtain a transformed snapshot (X),said system (100) being arranged to calculate, in a step d, using a selected multiplicativeupdate rule, a pattern representation vector (r) that minimises a distance measure betweenthe snapshot (X) and an approximation (Dr), being a linear combination of said sound pat-terns (SP) defined by said dictionary (D) multiplied by the coefficients of the pattern repre-sentation vector (r), andsaid system (100) being arranged to repeat from said step b using a subsequent time win-dow (TW),both said dictionary (D) and said pattern representation vector (r) having only non-negativeelements,said system (100) being arranged to perform said calculation of said pattern representationvector (r) using a cost function that includes an elementwise Bregman divergence and inaddition thereto one or several penalty terms, said one or several penalty terms being se-lected from the list of a first penalty term, comprising a diversity index, preferably the Shannon index, beinga measure ofthe number of different patterns in the approximation (Dr) according to theirrespective proportional abundancies, as quantified by the pattern representation vector (r); a second penalty term, comprising a norms difference between the taxicab norm andthe Euclidian norm of the pattern representation vector (r); a third penalty term, comprising a sum of off-diagonal entries in the correlation matrixofthe pattern representation vector (r); and a fourth penalty term, comprising a sum of the squares of the differences betweenelements of said pattern representation vector (r) and their smoothed values, saidsmoothed values being calculated for each element in question separately, to represent a time-smoothed representation of the element in question.

19. Computer software product for interpreting a sampled sound signal, said computersoftware product comprising or being associated with a dictionary (D) defining a set of dif- ferent characteristic sound patterns (SP), Application text, 2020-12-22 200040SEsaid computer software product being arranged to, in a step b, when executed on computerhardware, determine a time window (TW) of a certain length of said digital audio signal (AS);transform, in a step c, said time window (TW) ofthe digital audio signal (AS) to the frequencydomain to obtain a transformed snapshot (X);calculate, in a step c, using a selected multiplicative update rule, a pattern representationvector (r) that minimises a distance measure between the snapshot (X) and an approxima-tion (Dr), being a linear combination of said sound patterns (SP) defined by said dictionary(D) multiplied by the coefficients of the pattern representation vector (r); andrepeat, from step b, using a subsequent time window (TW),both said dictionary (D) and said pattern representation vector (r) having only non-negativeelements,said computer software product being arranged to, when executed on computer hardware,perform said calculation of said pattern representation vector (r) using a cost function thatincludes an elementwise Bregman divergence and in addition thereto one or several penaltyterms, said one or several penalty terms being selected from the list of a first penalty term, comprising a diversity index, preferably the Shannon index, beinga measure ofthe number of different patterns in the approximation (Dr) according to theirrespective proportional abundancies, as quantified by the pattern representation vector (r); a second penalty term, comprising a norms difference between the taxicab norm andthe Euclidian norm of the pattern representation vector (r); a third penalty term, comprising a sum of off-diagonal entries in the correlation matrixofthe pattern representation vector (r); anda fourth penalty term, comprising a sum ofthe squares ofthe differences between elementsof said pattern representation vector (r) and their smoothed values, said smoothed valuesbeing calculated for each element in question separately, to represent a time-smoothed representation of the element in question. Application text, 2020-12-22 200040SE