WO2023079419A1

WO2023079419A1 - Aligning digital note files with audio

Info

Publication number: WO2023079419A1
Application number: PCT/IB2022/060330
Authority: WO
Inventors: Yoav MOR; Hagay KONYO
Original assignee: Sphereo Sound Ltd.
Priority date: 2021-11-03
Filing date: 2022-10-27
Publication date: 2023-05-11

Abstract

A system (20) includes a memory (26, 29) and one or more processors (24, 27) configured to cooperatively carry out a process. The process includes loading a first audio file (28) and a digital note file (30) from the memory, computing a first spectrogram of the first audio file, computing a second spectrogram of a second audio file (32) generated from the digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file. Other embodiments are also described.

Description

ALIGNING DIGITAL NOTE FILES WITH AUDIO

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims the benefit of US Provisional Application 63/274,977, filed November 3, 2021, whose disclosure is incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to the field of digital audio, particularly digital music.

BACKGROUND

A Musical Instrument Digital Interface (MIDI) file specifies the timing and loudness of the musical notes belonging to a musical piece. To play the musical piece, the MIDI file is converted to a suitable audio format, such as Moving Picture Experts Group Layer-3 Audio (MP3), via conventional techniques such as those implemented by FluidSynth™.

SUMMARY OF THE INVENTION

There is provided, in accordance with some embodiments of the present invention, a system including a memory and one or more processors configured to cooperatively carry out a process. The process includes loading a first audio file and a digital note file from the memory, computing a first spectrogram of the first audio file, computing a second spectrogram of a second audio file generated from the digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.

In some embodiments, the digital note file includes a Musical Instrument Digital Interface (MIDI) file.

In some embodiments, wherein the first spectrogram includes a first short-time Fourier transform (STFT) spectrogram and the second spectrogram includes a second STFT spectrogram.

In some embodiments, computing the first spectrogram and second spectrogram includes: computing a first time-localized frequency transform of the first audio file and a second time-localized frequency transform of the second audio file, filtering each of the first time-localized frequency transform and second time-localized frequency transform so as to emphasize frequencies of musical notes, and computing the first spectrogram and second spectrogram from the filtered first time- localized frequency transform and filtered second time-localized frequency transform, respectively.

In some embodiments, filtering each of the first time-localized frequency transform and second time-localized frequency transform includes filtering each of the first time-localized frequency transform and second time-localized frequency transform using a bank of Gaussian filters, each of the filters being centered at a respective one of the frequencies.

In some embodiments, the distance measure is a function of respective local distances between pairs of spectra mapped to one another.

In some embodiments, each of the local distances is a Minkowski distance.

In some embodiments, computing the mapping includes: computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from another matrix of which an (i, j) or (j, i) element is a local distance between an i^l11 spectrum of the first spectra and a j^l11 spectrum of the second spectra for 1 < i < KI and 1 < j < K2, KI and K2 being respective numbers of the first spectra and the second spectra, computing a non-reversing path from D’[l, 1] to D’ [K1, K2] or to D’[K2, KI] that minimizes a sum of elements of D’ through which the path passes, and for each of the elements of D’ through which the path passes, mapping a pair of spectra corresponding to the element to one another.

In some embodiments, the process further includes, prior to modifying the digital note file, smoothing the mapping by fitting the mapping to a piecewise linear function including one or more linear segments.

In some embodiments, the process further includes identifying one or more points in the mapping at which a second derivative of the mapping is not within a predefined range, and the piecewise linear function includes multiple linear segments joined to each other at respective coordinates of the identified points representing the second spectrogram. There is further provided, in accordance with some embodiments of the present invention, a method including computing a first spectrogram of a first audio file, computing a second spectrogram of a second audio file generated from a digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.

There is further provided, in accordance with some embodiments of the present invention, a computer software product including a tangible non-transitory computer-readable medium in which program instructions are stored. The instructions, when read by a processor, cause the processor to compute a first spectrogram of a first audio file, to compute a second spectrogram of a second audio file generated from a digital note file, to compute a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and, based on the mapping, to shift, and adjust respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.

The present invention will be more fully understood from the following detailed description of embodiments thereof, taken together with the drawings, in which:

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a schematic illustration of a system for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention;

Fig. 2 shows a flow diagram for an example algorithm for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention;

Fig. 3 shows a spectrum of a short-time Fourier transform, which was filtered in accordance with some embodiments of the present invention;

Figs. 4A-B show a hypothetical mapping between spectra of a first spectrogram and spectra of a second spectrogram, in accordance with some embodiments of the present invention; and

Fig. 5 shows an example mapping and smoothed mapping, in accordance with some embodiments of the present invention. DETAILED DESCRIPTION OF EMBODIMENTS

OVERVIEW

A digital note file, such as a MIDI file, representing a musical piece is often used in conjunction with an audio recording of the musical piece. For example, a user may attempt to synchronize the playing of an instrument represented in the digital note file with the audio recording. Alternatively (e.g., in karaoke) the user may attempt to synchronize the singing of vocal notes represented in the digital note file with the audio recording. Alternatively, the user may practice his singing with reference to the digital note file, add an instrument to, or replace an instrument in, the digital note file, or otherwise modify the digital note file (e.g., so as to change the tempo of the musical piece). For all of these applications, it is helpful to have alignment between the digital note file and the audio recording.

However, in many cases, the digital note file may be misaligned with the audio recording. For example, the timing and/or duration of particular notes in the digital note file may be different from the timing and/or duration of these notes in the audio recording. Alternatively or additionally, the digital note file may contain one or more notes that are not played in the audio recording; for example, the number of times a chorus is repeated in the digital note file may be greater than the number of times the chorus is repeated in the audio recording.

To address this challenge, embodiments of the present invention align (or “synchronize”) the digital note file with the audio recording. In particular, the digital note file is converted to an audio file using conventional techniques, and a time-localized spectrogram, such as a short-time Fourier transform (STFT) spectrogram, of both audio files is computed. Subsequently, the spectra of one spectrogram are mapped to the spectra of the other spectrogram, under predefined constraints, so as to minimize a total distance. A piecewise linear function may then be fit to the mapping so as to smooth the mapping. Finally, the notes in the digital note file are shifted, and are also stretched or compressed, in accordance with the mapping, so as to better align the notes with the audio recording.

SYSTEM DESCRIPTION

Reference is initially made to Fig. 1 , which is a schematic illustration of a system 20 for aligning a digital note file 30 with a first audio file 28, in accordance with some embodiments of the present invention.

System 20 comprises a server 22 comprising a processor 24, a network interface 25 comprising, for example, a network interface controller (NIC), and a memory 26. In some embodiments, server 22 belongs to a cloud server farm.

Memory 26 may comprise a volatile memory (e.g., a random-access memory (RAM)) and/or a non-volatile memory (e.g., a flash drive). Memory 26 is configured to store first audio file 28, which may have any suitable audio format such as MP3. Memory 26 is further configured to store digital note file 30, which may include a Musical Instrument Digital Interface (MIDI) file, for example.

In alternative embodiments, processor 24 and memory 26 belong to different respective computers, such as different respective servers in a cloud server farm. Alternatively or additionally, memory 26 may be distributed over multiple computers.

System 20 further comprises a device 21 comprising, for example, a desktop computer, a laptop computer, or a smartphone. Device 21 comprises a processor 27, a network interface 31 comprising, for example, a network interface controller (NIC), and a memory 29. Memory 29 may comprise a volatile memory (e.g., a random-access memory (RAM)) and/or a non-volatile memory (e.g., a flash drive). Memory 29 is also configured to store first audio file 28 and digital note file 30.

Processor 24 and processor 27 are configured to communicate with one another over a network 23, such as the Internet, via network interface 25 and network interface 31. Thus, for example, in response to instructions from a user 33, processor 27 may load the first audio file and digital note file from memory 29 and then upload these files to server 22. Subsequently, first audio file 28 and digital note file 30 may be stored in memory 26. Processor 24 may then load the files from memory 26 and process the files as described below with reference to the subsequent figures. Based on the processing, the processor may modify digital note file 30 so as to increase the alignment of the notes in the digital note file with first audio file 28, i.e., so as to render the digital note file convertible to a third audio file 36 that is more similar to first audio file 28 than is second audio file 32. Subsequently to modifying the digital note file, processor 24 may store the resulting modified digital note file 34 in memory 26 and/or communicate modified digital note file 34 to device 21. In response to receiving the modified digital note file, processor 27 may store the modified digital note file in memory 29.

In some embodiments, processor 24, in performing the aforementioned processing, converts digital note file 30 to a second audio file 32 using conventional techniques. In other embodiments, processor 27 performs this conversion; in such embodiments, second audio file 32 may be uploaded to server 22 together with the first audio file and the digital note file.

In alternative embodiments, at least some of the aforementioned processing and/or modification of the digital note file is performed by processor 27 and/or at least one other processor, such as the processor of another server residing, together with server 22, on a cloud server farm.

In general, each of processor 24 and processor 27 may be embodied as a single processor, or as a cooperatively networked or clustered set of processors. The functionality of each of the processors may be implemented solely in hardware, e.g., using one or more fixed- function or general-purpose integrated circuits, Application- Specific Integrated Circuits (ASICs), and/or Field-Programmable Gate Arrays (FPGAs). Alternatively, this functionality may be implemented at least partly in software. For example, processor 24 and/or processor 27 may be embodied as a programmed processor comprising, for example, a central processing unit (CPU) and/or a Graphics Processing Unit (GPU). Program code, including software programs, and/or data may be loaded for execution and processing by the CPU and/or GPU. The program code and/or data may be downloaded to the processor in electronic form, over a network, for example. Alternatively or additionally, the program code and/or data may be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory. Such program code and/or data, when provided to the processor, produce a machine or special-purpose computer, configured to perform the tasks described herein.

ALIGNMENT OF DIGITAL NOTE FILE WITH AUDIO FILE

Reference is now made to Fig. 2, which shows a flow diagram for an example algorithm 38 for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention. Algorithm 38 may be executed, for example, by processor 24 (Fig. 1), by processor 27, or cooperatively by both processors.

At a first-spectrogram computing step 40 of algorithm 38, the processor computes a first spectrogram of first audio file 28 (Fig. 1). For example, the processor may compute a first time- localized frequency transform of the first audio file, and then compute the first spectrogram from the first time-localized frequency transform.

Next, at a converting step 42, the processor converts digital note file 30 to second audio file 32 (Fig. 1). Subsequently, at a second-spectrogram computing step 44, the processor computes a second spectrogram of the second audio file. For example, the processor may compute a second time-localized frequency transform of the second audio file, and then compute the second spectrogram from the second time-localized frequency transform.

Alternatively, converting step 42 and, optionally, second-spectrogram computing step 44 may be performed before first-spectrogram computing step 40. For example, as described above with reference to Fig. 1 , converting step 42 may be performed by processor 27, and the second audio file may then be uploaded together with the first audio file and digital note file.

In some embodiments, each of the spectrograms includes a short-time Fourier transform (STFT) spectrogram; in other words, the time-localized frequency transforms, from which the spectrograms are computed, are STFTs. The processor may compute each STFT per the equation

where:

X(k, n) is the STFT for k = 0...K-1 and n = 0...N-1, x(i), i = 0. . .K-l is the signal encoded by the file, K being the length of the signal, y*( ° ) is the complex conjugate of a symmetric window y( ° ) such as a trapezoid, a Hanning window, or a Blackman window, k is the time bin, n is the frequency bin, and

N is typically K72 or (K+l)/2.

In other embodiments, the time-localized frequency transforms include wavelet transforms. Suitable mother wavelets for such transforms include the complex Gabor, the Simlet, the Coifman, the Morlet, and the cubic spline wavelets.

In some embodiments, the processor computes the spectrograms directly from the time- localized frequency transforms. For example, each STFT spectrogram may be computed as IX(k, n)l².

In other embodiments, the processor first filters each of the time-localized frequency transforms so as to emphasize frequencies of musical notes, and then computes the spectrograms from the filtered transforms. For example, each STFT spectrogram may be computed as IX’ I², where X’ is a filtered STFT.

To filter each transform, the processor may use a bank of Gaussian filters, each of the filters being centered at a respective one of the musical-note frequencies. For example, the processor may calculate a maximum number of filters in the filter bank per the equation

L = floor(log2((f_s/2)/f_min)/log2(fr)), where: L is the maximum number of filters, f_s is the sampling frequency of the signal encoded in the audio file, is the minimum musical-note frequency of interest, such as the lowest frequency in the first octave (32.7 Hz), f_r = 2^1/112 (12 being the number of notes in each octave), and the function floor( ° ) returns the greatest integer less than or equal to the argument of the function.

The processor may further calculate the central frequency fi of each filter as f_min * 2⁽/¹², 1 = 0...L-1. After optionally eliminating any higher central frequencies that are not of interest, the processor may calculate the bandwidth Sfj of each l^l11 filter, 1 = 0. . .L’-l for L’ < L, as

5fi = max(fi*(f_r - 1), (f_s/2)/N).

The processor may then define each filter F(l, n), for 1 = 0.. .L’-l and n = 0. . .N-l, as follows:

The processor may then compute each element X’(l,n) as follows:

X’(l,n) = ^=o¹^(^n) * F(l, k-).

For example, Fig. 3 shows a spectrum 54 of an STFT, which was filtered in accordance with some embodiments of the present invention. By virtue of the application of the bank of Gaussian filters, spectrum 54 includes a plurality of Gaussian segments 56 having respective peaks at the musical-note frequencies. For example, a segment 56a has a peak at 3951 Hz, which is the frequency of note B in the seventh octave.

Regardless of how the spectrograms are computed, the size of each of the time bins of the first spectrogram is typically the same as the size of each of the time bins of the second spectrogram.

As shown in Fig. 2, subsequently to computing the spectrograms, the processor, at a mapping step 46, computes a mapping between spectra {sli}, i = 1...K1 of the first spectrogram, KI being the number of spectra in the first spectrogram, and respective spectra { s2j } , j = 1 . . .K2 of the second spectrogram, K2 being the number of spectra in the second spectrogram, that minimizes a distance measure under predefined constraints.

This mapping may be represented by the notation { [il,jl],[i2 J2] ■ ■ • [iM JM] } , where M > max(Kl,K2) and each pair of indices [i_m,jm] indicates a mapping between si _im and s2j_m . Typical constraints on the minimization are that each spectrum is mapped to at least one other spectrum, and that for any i_m and j_m, either (i) i_m+i = i_m + 1 and j_m+i = j_m + 1, (ii) im+1 = im +

1 and jm+i = j_m, or (iii) i_m+i = i_m and j_m+i = j_m + 1- (These two constraints imply that ii = 1, ji = 1, iM = KI, and jM = K2.) By way of example, Fig. 4A shows a hypothetical mapping { [1,1], [2, 2], [2, 3], [3, 4], [4, 4], [5, 5], [6, 6] } between the first few spectra of the first spectrogram and the first few spectra of the second spectrogram, in accordance with some embodiments of the present invention.

Typically, the distance measure, which is minimized in the mapping, is a function of respective local distances d_im j_m between pairs of spectra mapped to one another. (In Fig. 4A, the mapping between each pair of spectra is annotated to show the local distance between the pair.) Typically, the local distance between each pair of spectra is a Minkowski distance, such as the L2 distance,

For example, the distance measure may be a sum of the local distances; this sum may be minimized by finding a minimum-distance path through a matrix D of local distances between pairs of spectra, as further described below. Alternatively, the distance measure may be a more complicated function; an example of such a function is minimized by finding a minimumdistance path through the “accumulated distance gradient” matrix ADG described below.

In some embodiments, the processor computes the mapping by first computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from a matrix D of which the (i, j) or (j, i) element is the local distance between the i^l11 spectrum of the first spectra and the j^l11 spectrum of the second spectra for 1 < i < KI and 1 < j < K2. Subsequently to computing D’, the processor computes a non-reversing path from D’ [l, 1] to D’[K1, K2] or to D’ [K2, KI] that minimizes the sum of the elements of D’ through which the path passes. For example, the processor may compute the path using the Dynamic Time Warping algorithm, which is described, for example, in Muller, Meinard, "Dynamic time warping," Information retrieval for music and motion (2007): 69-84, whose disclosure is incorporated herein by reference. Finally, for each element of D’ through which the path passes, the processor maps the pair of spectra corresponding to the element to one another.

Fig. 4B illustrates this technique for the example shown in Fig. 4A. In particular, Fig. 4B shows a 6 x 6 corner of a KI x K2 matrix D’. (As opposed to the usual convention, D’ [1,1] is shown as the bottom-left element of D’, rather than the top-left element.) A path 58 passes through D’[l,l], D’[2,2], D’[2,3], D’[3,4], D’[4,4], D’[5,5], and D’[6,6], thus implying the mapping { [1,1], [2, 2], [2, 3], [3, 4], [4, 4], [5, 5], [6, 6] }. Path 58 is non-reversing in that the path does not move backwards through the rows or columns of D’ at any point, i.e., the path does not go from the i^l11 row to the (i- 1 )¹¹¹ row or from the j^l11 row to the (j-l)^th row.

In some embodiments, D’ = D, i.e., D’[i, j] or D’[j, i] is the local distance between the i^l11 spectrum of the first spectra and the j^l11 spectrum of the second spectra. In other embodiments, D’ is an accumulated distance gradient matrix ADG, which may be computed by executing the following algorithm:

(i) Compute G_x and Gy, the horizontal and vertical gradient matrices of D. For example, G_x[i,j] may equal D[i,j+ 1 ] - D[i,j] and Gy[i,j] may equal D[i+l,j] - D[i,j].

(ii) Compute the matrix G such that G[i,j] =

(iii) Compute the matrix P such that P[i,j] = arctan(Gy[i,j]/G_x[i,j])-

(iv) Compute the matrices Pl, P2, and P3 such that:

Pl[i,j] = {p[i,j], o < P[i,j] <= 90 or 270 < P[i,j] <= 360; P[i,j] + 180, 90 < P[i,j] <= 270},

P2[i,j] = {P[i,j], 0 < P[i,j] <= 135 or 315 < P[i,j] <= 360; P[i,j] + 180, 135 < P[i,j] <= 315}, and

P3[i,j] = {p[i,j], o < P[i,j] <= 180 or 180 < P[i,j] <= 360; P[i,j] + 180, 180 < P[i,j] <= 360}.

(v) Compute the matrix ADG such that ADG[i,j] = G[i,j] + min(ADG[i-l,j]*cos(Pl[i-l,j]), ADG[i-l,j-l]*cos(P2[i-l,j-l] - 45), ADG[i,j-l]*cos(P3[i,j-l] - 90) for i > 1 and j > 1. (The first row and column of ADG may have any arbitrary values.)

As shown in Fig. 2, optionally, following mapping step 46, the processor, at a smoothing step 48, smooths the mapping by fitting the mapping to a piecewise linear function including one or more linear segments.

In this regard, reference is now made to Fig. 5, which shows an example mapping 60 and smoothed mapping 62, which includes multiple linear segments, in accordance with some embodiments of the present invention. Per mapping 60, the spectra of the second spectrogram are mapped to corresponding spectra of the first spectrogram. Thus, mapping 60 also maps the time bins of the second spectrogram to the time bins of the first spectrogram. In other words, if the i_m ^th spectrum of the first spectrogram and the j_m ^th spectrum of the second spectrogram are mapped to one another, then the time bin represented by the i_m ^th spectrum and the time bin represented by the j_m ^th spectrum are also mapped to one another. By way of example, the horizontal axis in Fig. 5 corresponds to the second spectrogram, i.e., each point in mapping 60 is a point (j_m, im) representing a mapping between the i_m ^th spectrum of the first spectrogram and the j_m ^th spectrum of the second spectrogram.

Typically, to smooth the mapping, the processor first identifies any points 64 in mapping 60 at which the second derivative of the mapping is not within a predefined range (e.g., at which an absolute value of the second derivative is greater than a predefined threshold). For example, the processor may first calculate a local slope at each point in the mapping, by fitting a neighborhood of the point to a line (e.g., using linear regression) and then taking the slope of the line as the local slope. The processor may then calculate the second derivative at each point based on the local slopes.

For example, to calculate the second derivative at the mO¹¹¹ point (i.e., the point (i_mo, jmO) or (jmO, imO)), the processor may first calculate the 2p+l local slopes {ls_mQ-p, . .., ls_m0-L ls_m0, lsm0+L ■ ■ -, lsm0+p} for the (m0-p)^th. . . (m0+ p)^th pairs of spectra mapped to one another, where p = 1, 2, or more. The processor may then fit a line to the local slopes using linear regression, and then calculate the second derivative as the slope of this line.

In the event that no points 64 are identified, the mapping is smoothed so as to include exactly one line, i.e., a piecewise linear function including exactly one segment, that best fits the points in mapping 60. This line may be calculated, for example, using linear regression.

Otherwise, the processor fits the mapping to a piecewise linear function including multiple linear segments joined to each other at the respective coordinates of the identified points representing the second spectrogram, which in Fig. 5 are the first coordinates of the identified points. For example, Fig. 5 shows a first linear segment 66a joined to a second linear segment 66b at an intersection point 65 having the first coordinate of a point 64, and second linear segment 66b joined to a third linear segment 66c at another intersection point 65 having the first coordinate of another point 64. Typically, the fitting is performed using a linear programming algorithm, which selects intersection points 65 so as to maximize a smoothness of the fitting.

Finally, as shown in Fig. 2, the processor, at a modifying step 50, modifies digital note file 30 (Fig. 1), based on the mapping (e.g., smoothed mapping 62), so as to increase the alignment of the notes in the digital note file with first audio file 28, i.e., so as to render the digital note file convertible to third audio file 36, which is more similar to the first audio file than is second audio file 32.

In particular, the processor shifts, and adjusts respective durations of, notes in the digital note file based on the mapping. Specifically, for each note, the processor identifies the current time period spanned by the note, identifies a new time period to which this current time period is mapped, and shifts and/or adjusts the duration of the note such that the note occupies the new time period. (In some cases, the new time period may be identical to the current time period, such that no changes in the duration or timing of the note are required.)

For example, for any time t at which the note begins or ends, the processor may first identify the time bin B2, of the second spectrogram, to which t belongs. Subsequently, the processor may identify the time bin Bl, of the first spectrogram, that is mapped to B2 or to which B2 is mapped. Next, the processor may calculate t’, the new time at which the note is to begin or end, by applying the equation t’ = tSBl + (t - tSB2) * LB1/LB2, where: tSBl and tsB2 are the start times of Bl and B2, respectively, and

LBI and LB2 are the lengths of Bl and B2, respectively.

It is noted that, typically, LBI = LB2, and t = tsB2 (in which case t’ = tSBl) or t = teB2, the end time of B2 (in which case t’ = teBl).

For example, Fig. 5 shows an example note 68a spanning a time period (tO, tl). Per the mapping, this time period is mapped to a slightly longer time period (tO’, tl’), tO’ being slightly greater than tO. Hence, note 68a is lengthened and shifted forward, such that in modified digital note file 34 (Fig. 1), note 68a is longer and occurs later in the musical piece.

Fig. 5 also shows another example note 68b, which spans a time period (t2, t3). As opposed to first linear segment 66a, whose slope is greater than one, third linear segment 66c has a slope less than one; hence, note 68b is shortened so as to occupy, in modified digital note file 34, a shorter time period (t2’, t3’).

It is noted that second linear segment 66b, which is horizontal, corresponds to notes in the digital note file that are not played in the audio file. Any notes occurring between the beginning and end of second linear segment 66b are removed from the digital note file. As a result of this removal, t2’ < t2, i.e., note 68b occurs earlier in the modified digital note file.

It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof that are not in the prior art, which would occur to persons skilled in the art upon reading the foregoing description.

Claims

1. A system, comprising: a memory; and one or more processors configured to cooperatively carry out a process that includes: loading a first audio file and a digital note file from the memory, computing a first spectrogram of the first audio file, computing a second spectrogram of a second audio file generated from the digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.

2. The system according to claim 1, wherein the digital note file includes a Musical Instrument Digital Interface (MIDI) file.

3. The system according to claim 1, wherein the first spectrogram includes a first short-time Fourier transform (STFT) spectrogram and the second spectrogram includes a second STFT spectrogram.

4. The system according to any one of claims 1-3, wherein computing the first spectrogram and second spectrogram includes: computing a first time-localized frequency transform of the first audio file and a second time-localized frequency transform of the second audio file, filtering each of the first time-localized frequency transform and second time-localized frequency transform so as to emphasize frequencies of musical notes, and computing the first spectrogram and second spectrogram from the filtered first time- localized frequency transform and filtered second time-localized frequency transform, respectively.

5. The system according to claim 4, wherein filtering each of the first time- localized frequency transform and second time-localized frequency transform includes filtering each of the first time-localized frequency transform and second time-localized frequency transform using a bank of Gaussian filters, each of the filters being centered at a respective one of the frequencies.

6. The system according to any one of claims 1-3, wherein the distance measure is a function of respective local distances between pairs of spectra mapped to one another.

7. The system according to claim 6, wherein each of the local distances is a Minkowski distance.

8. The system according to any one of claims 1-3, wherein computing the mapping includes: computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from another matrix of which an (i, j) or (j, i) element is a local distance between an i^l11 spectrum of the first spectra and a j^l11 spectrum of the second spectra for 1 < i < KI and 1 < j < K2, KI and K2 being respective numbers of the first spectra and the second spectra, computing a non-reversing path from D’[l, 1] to D’ [K1, K2] or to D’[K2, KI] that minimizes a sum of elements of D’ through which the path passes, and for each of the elements of D’ through which the path passes, mapping a pair of spectra corresponding to the element to one another.

9. The system according to any one of claims 1-3, wherein the process further includes, prior to modifying the digital note file, smoothing the mapping by fitting the mapping to a piecewise linear function including one or more linear segments.

10. The system according to claim 9, wherein the process further includes identifying one or more points in the mapping at which a second derivative of the mapping is not within a predefined range, and wherein the piecewise linear function includes multiple linear segments joined to each other at respective coordinates of the identified points representing the second spectrogram.

11. A method, comprising : computing a first spectrogram of a first audio file; computing a second spectrogram of a second audio file generated from a digital note file; computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints; and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.

12. The method according to claim 11, wherein the digital note file includes a Musical Instrument Digital Interface (MIDI) file.

13. The method according to claim 11, wherein first spectrogram includes a first short-time Fourier transform (STFT) spectrogram and the second spectrogram includes a second STFT 16 spectrogram.

14. The method according to any one of claims 11-13, wherein computing the first spectrogram and second spectrogram comprises: computing a first time-localized frequency transform of the first audio file and a second time-localized frequency transform of the second audio file; filtering each of the first time-localized frequency transform and second time-localized frequency transform so as to emphasize frequencies of musical notes; and computing the first spectrogram and second spectrogram from the filtered first time- localized frequency transform and filtered second time-localized frequency transform, respectively.

15. The method according to claim 14, wherein filtering each of the first time- localized frequency transform and second time-localized frequency transform comprises filtering each of the first time-localized frequency transform and second time-localized frequency transform using a bank of Gaussian filters, each of the filters being centered at a respective one of the frequencies.

16. The method according to any one of claims 11-13, wherein the distance measure is a function of respective local distances between pairs of spectra mapped to one another.

17. The method according to claim 16, wherein each of the local distances is a Minkowski distance.

18. The method according to any one of claims 11-13, wherein computing the mapping comprises: computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from another matrix of which an (i, j) or (j, i) element is a local distance between an i^l11 spectrum of the first spectra and a j^l11 spectrum of the second spectra for 1 < i < KI and 1 < j < K2, KI and K2 being respective numbers of the first spectra and the second spectra; computing a non-reversing path from D’[l, 1] to D’ [K1, K2] or to D’[K2, KI] that minimizes a sum of elements of D’ through which the path passes; and for each of the elements of D’ through which the path passes, mapping a pair of spectra corresponding to the element to one another.

19. The method according to any one of claims 11-13, further comprising, prior to modifying the digital note file, smoothing the mapping by fitting the mapping to a piecewise linear function including one or more linear segments. 17

20. The method according to claim 19, further comprising identifying one or more points in the mapping at which a second derivative of the mapping is not within a predefined range, wherein the piecewise linear function includes multiple linear segments joined to each other at respective coordinates of the identified points representing the second spectrogram.

21. A computer software product comprising a tangible non-transitory computer-readable medium in which program instructions are stored, which instructions, when read by a processor, cause the processor to: compute a first spectrogram of a first audio file, compute a second spectrogram of a second audio file generated from a digital note file, compute a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shift, and adjust respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.