WO2023079419A1 - Aligning digital note files with audio - Google Patents

Aligning digital note files with audio Download PDF

Info

Publication number
WO2023079419A1
WO2023079419A1 PCT/IB2022/060330 IB2022060330W WO2023079419A1 WO 2023079419 A1 WO2023079419 A1 WO 2023079419A1 IB 2022060330 W IB2022060330 W IB 2022060330W WO 2023079419 A1 WO2023079419 A1 WO 2023079419A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectrogram
mapping
spectra
time
computing
Prior art date
Application number
PCT/IB2022/060330
Other languages
French (fr)
Inventor
Yoav MOR
Hagay KONYO
Original Assignee
Sphereo Sound Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sphereo Sound Ltd. filed Critical Sphereo Sound Ltd.
Publication of WO2023079419A1 publication Critical patent/WO2023079419A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/005Reproducing at a different information rate from the information rate of recording
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording

Definitions

  • the present invention relates to the field of digital audio, particularly digital music.
  • a Musical Instrument Digital Interface (MIDI) file specifies the timing and loudness of the musical notes belonging to a musical piece.
  • the MIDI file is converted to a suitable audio format, such as Moving Picture Experts Group Layer-3 Audio (MP3), via conventional techniques such as those implemented by FluidSynthTM.
  • MP3 Moving Picture Experts Group Layer-3 Audio
  • a system including a memory and one or more processors configured to cooperatively carry out a process.
  • the process includes loading a first audio file and a digital note file from the memory, computing a first spectrogram of the first audio file, computing a second spectrogram of a second audio file generated from the digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
  • the digital note file includes a Musical Instrument Digital Interface (MIDI) file.
  • MIDI Musical Instrument Digital Interface
  • the first spectrogram includes a first short-time Fourier transform (STFT) spectrogram and the second spectrogram includes a second STFT spectrogram.
  • STFT short-time Fourier transform
  • computing the first spectrogram and second spectrogram includes: computing a first time-localized frequency transform of the first audio file and a second time-localized frequency transform of the second audio file, filtering each of the first time-localized frequency transform and second time-localized frequency transform so as to emphasize frequencies of musical notes, and computing the first spectrogram and second spectrogram from the filtered first time- localized frequency transform and filtered second time-localized frequency transform, respectively.
  • filtering each of the first time-localized frequency transform and second time-localized frequency transform includes filtering each of the first time-localized frequency transform and second time-localized frequency transform using a bank of Gaussian filters, each of the filters being centered at a respective one of the frequencies.
  • the distance measure is a function of respective local distances between pairs of spectra mapped to one another.
  • each of the local distances is a Minkowski distance.
  • computing the mapping includes: computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from another matrix of which an (i, j) or (j, i) element is a local distance between an i l11 spectrum of the first spectra and a j l11 spectrum of the second spectra for 1 ⁇ i ⁇ KI and 1 ⁇ j ⁇ K2, KI and K2 being respective numbers of the first spectra and the second spectra, computing a non-reversing path from D’[l, 1] to D’ [K1, K2] or to D’[K2, KI] that minimizes a sum of elements of D’ through which the path passes, and for each of the elements of D’ through which the path passes, mapping a pair of spectra corresponding to the element to one another.
  • the process further includes, prior to modifying the digital note file, smoothing the mapping by fitting the mapping to a piecewise linear function including one or more linear segments.
  • the process further includes identifying one or more points in the mapping at which a second derivative of the mapping is not within a predefined range, and the piecewise linear function includes multiple linear segments joined to each other at respective coordinates of the identified points representing the second spectrogram.
  • a method including computing a first spectrogram of a first audio file, computing a second spectrogram of a second audio file generated from a digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
  • a computer software product including a tangible non-transitory computer-readable medium in which program instructions are stored.
  • the instructions when read by a processor, cause the processor to compute a first spectrogram of a first audio file, to compute a second spectrogram of a second audio file generated from a digital note file, to compute a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and, based on the mapping, to shift, and adjust respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
  • Fig. 1 is a schematic illustration of a system for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention
  • Fig. 2 shows a flow diagram for an example algorithm for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention
  • Fig. 3 shows a spectrum of a short-time Fourier transform, which was filtered in accordance with some embodiments of the present invention
  • Figs. 4A-B show a hypothetical mapping between spectra of a first spectrogram and spectra of a second spectrogram, in accordance with some embodiments of the present invention.
  • Fig. 5 shows an example mapping and smoothed mapping, in accordance with some embodiments of the present invention.
  • a digital note file such as a MIDI file, representing a musical piece is often used in conjunction with an audio recording of the musical piece.
  • a user may attempt to synchronize the playing of an instrument represented in the digital note file with the audio recording.
  • the user may attempt to synchronize the singing of vocal notes represented in the digital note file with the audio recording.
  • the user may practice his singing with reference to the digital note file, add an instrument to, or replace an instrument in, the digital note file, or otherwise modify the digital note file (e.g., so as to change the tempo of the musical piece).
  • the digital note file may be misaligned with the audio recording.
  • the timing and/or duration of particular notes in the digital note file may be different from the timing and/or duration of these notes in the audio recording.
  • the digital note file may contain one or more notes that are not played in the audio recording; for example, the number of times a chorus is repeated in the digital note file may be greater than the number of times the chorus is repeated in the audio recording.
  • embodiments of the present invention align (or “synchronize”) the digital note file with the audio recording.
  • the digital note file is converted to an audio file using conventional techniques, and a time-localized spectrogram, such as a short-time Fourier transform (STFT) spectrogram, of both audio files is computed.
  • STFT short-time Fourier transform
  • the spectra of one spectrogram are mapped to the spectra of the other spectrogram, under predefined constraints, so as to minimize a total distance.
  • a piecewise linear function may then be fit to the mapping so as to smooth the mapping.
  • the notes in the digital note file are shifted, and are also stretched or compressed, in accordance with the mapping, so as to better align the notes with the audio recording.
  • FIG. 1 is a schematic illustration of a system 20 for aligning a digital note file 30 with a first audio file 28, in accordance with some embodiments of the present invention.
  • System 20 comprises a server 22 comprising a processor 24, a network interface 25 comprising, for example, a network interface controller (NIC), and a memory 26.
  • server 22 belongs to a cloud server farm.
  • Memory 26 may comprise a volatile memory (e.g., a random-access memory (RAM)) and/or a non-volatile memory (e.g., a flash drive).
  • Memory 26 is configured to store first audio file 28, which may have any suitable audio format such as MP3.
  • Memory 26 is further configured to store digital note file 30, which may include a Musical Instrument Digital Interface (MIDI) file, for example.
  • MIDI Musical Instrument Digital Interface
  • processor 24 and memory 26 belong to different respective computers, such as different respective servers in a cloud server farm.
  • memory 26 may be distributed over multiple computers.
  • System 20 further comprises a device 21 comprising, for example, a desktop computer, a laptop computer, or a smartphone.
  • Device 21 comprises a processor 27, a network interface 31 comprising, for example, a network interface controller (NIC), and a memory 29.
  • Memory 29 may comprise a volatile memory (e.g., a random-access memory (RAM)) and/or a non-volatile memory (e.g., a flash drive).
  • Memory 29 is also configured to store first audio file 28 and digital note file 30.
  • Processor 24 and processor 27 are configured to communicate with one another over a network 23, such as the Internet, via network interface 25 and network interface 31.
  • processor 27 may load the first audio file and digital note file from memory 29 and then upload these files to server 22.
  • first audio file 28 and digital note file 30 may be stored in memory 26.
  • Processor 24 may then load the files from memory 26 and process the files as described below with reference to the subsequent figures.
  • the processor may modify digital note file 30 so as to increase the alignment of the notes in the digital note file with first audio file 28, i.e., so as to render the digital note file convertible to a third audio file 36 that is more similar to first audio file 28 than is second audio file 32.
  • processor 24 may store the resulting modified digital note file 34 in memory 26 and/or communicate modified digital note file 34 to device 21.
  • processor 27 may store the modified digital note file in memory 29.
  • processor 24, in performing the aforementioned processing, converts digital note file 30 to a second audio file 32 using conventional techniques.
  • processor 27 performs this conversion; in such embodiments, second audio file 32 may be uploaded to server 22 together with the first audio file and the digital note file.
  • processor 27 and/or at least one other processor, such as the processor of another server residing, together with server 22, on a cloud server farm.
  • each of processor 24 and processor 27 may be embodied as a single processor, or as a cooperatively networked or clustered set of processors.
  • the functionality of each of the processors may be implemented solely in hardware, e.g., using one or more fixed- function or general-purpose integrated circuits, Application- Specific Integrated Circuits (ASICs), and/or Field-Programmable Gate Arrays (FPGAs).
  • this functionality may be implemented at least partly in software.
  • processor 24 and/or processor 27 may be embodied as a programmed processor comprising, for example, a central processing unit (CPU) and/or a Graphics Processing Unit (GPU).
  • CPU central processing unit
  • GPU Graphics Processing Unit
  • Program code including software programs, and/or data may be loaded for execution and processing by the CPU and/or GPU.
  • the program code and/or data may be downloaded to the processor in electronic form, over a network, for example.
  • the program code and/or data may be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
  • Such program code and/or data when provided to the processor, produce a machine or special-purpose computer, configured to perform the tasks described herein.
  • Fig. 2 shows a flow diagram for an example algorithm 38 for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention.
  • Algorithm 38 may be executed, for example, by processor 24 (Fig. 1), by processor 27, or cooperatively by both processors.
  • the processor computes a first spectrogram of first audio file 28 (Fig. 1). For example, the processor may compute a first time- localized frequency transform of the first audio file, and then compute the first spectrogram from the first time-localized frequency transform.
  • the processor converts digital note file 30 to second audio file 32 (Fig. 1).
  • the processor computes a second spectrogram of the second audio file. For example, the processor may compute a second time-localized frequency transform of the second audio file, and then compute the second spectrogram from the second time-localized frequency transform.
  • converting step 42 and, optionally, second-spectrogram computing step 44 may be performed before first-spectrogram computing step 40.
  • converting step 42 may be performed by processor 27, and the second audio file may then be uploaded together with the first audio file and digital note file.
  • each of the spectrograms includes a short-time Fourier transform (STFT) spectrogram; in other words, the time-localized frequency transforms, from which the spectrograms are computed, are STFTs.
  • STFT short-time Fourier transform
  • the processor may compute each STFT per the equation where:
  • N is typically K72 or (K+l)/2.
  • the time-localized frequency transforms include wavelet transforms.
  • Suitable mother wavelets for such transforms include the complex Gabor, the Simlet, the Coifman, the Morlet, and the cubic spline wavelets.
  • the processor computes the spectrograms directly from the time- localized frequency transforms.
  • each STFT spectrogram may be computed as IX(k, n)l 2 .
  • the processor first filters each of the time-localized frequency transforms so as to emphasize frequencies of musical notes, and then computes the spectrograms from the filtered transforms.
  • each STFT spectrogram may be computed as IX’ I 2 , where X’ is a filtered STFT.
  • the processor may use a bank of Gaussian filters, each of the filters being centered at a respective one of the musical-note frequencies. For example, the processor may calculate a maximum number of filters in the filter bank per the equation
  • the processor may then compute each element X’(l,n) as follows:
  • Fig. 3 shows a spectrum 54 of an STFT, which was filtered in accordance with some embodiments of the present invention.
  • spectrum 54 includes a plurality of Gaussian segments 56 having respective peaks at the musical-note frequencies.
  • a segment 56a has a peak at 3951 Hz, which is the frequency of note B in the seventh octave.
  • the size of each of the time bins of the first spectrogram is typically the same as the size of each of the time bins of the second spectrogram.
  • This mapping may be represented by the notation ⁇ [il,jl],[i2 J2] ⁇ ⁇ • [iM JM] ⁇ , where M > max(Kl,K2) and each pair of indices [i m ,jm] indicates a mapping between si im and s2j m .
  • Fig. 4A shows a hypothetical mapping ⁇ [1,1], [2, 2], [2, 3], [3, 4], [4, 4], [5, 5], [6, 6] ⁇ between the first few spectra of the first spectrogram and the first few spectra of the second spectrogram, in accordance with some embodiments of the present invention.
  • the distance measure which is minimized in the mapping, is a function of respective local distances d im j m between pairs of spectra mapped to one another.
  • the mapping between each pair of spectra is annotated to show the local distance between the pair.
  • the local distance between each pair of spectra is a Minkowski distance, such as the L2 distance,
  • the distance measure may be a sum of the local distances; this sum may be minimized by finding a minimum-distance path through a matrix D of local distances between pairs of spectra, as further described below.
  • the distance measure may be a more complicated function; an example of such a function is minimized by finding a minimumdistance path through the “accumulated distance gradient” matrix ADG described below.
  • the processor computes the mapping by first computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from a matrix D of which the (i, j) or (j, i) element is the local distance between the i l11 spectrum of the first spectra and the j l11 spectrum of the second spectra for 1 ⁇ i ⁇ KI and 1 ⁇ j ⁇ K2. Subsequently to computing D’, the processor computes a non-reversing path from D’ [l, 1] to D’[K1, K2] or to D’ [K2, KI] that minimizes the sum of the elements of D’ through which the path passes.
  • the processor may compute the path using the Dynamic Time Warping algorithm, which is described, for example, in Muller, Meinard, "Dynamic time warping," Information retrieval for music and motion (2007): 69-84, whose disclosure is incorporated herein by reference. Finally, for each element of D’ through which the path passes, the processor maps the pair of spectra corresponding to the element to one another.
  • Fig. 4B illustrates this technique for the example shown in Fig. 4A.
  • Fig. 4B shows a 6 x 6 corner of a KI x K2 matrix D’.
  • D’ [1,1] is shown as the bottom-left element of D’, rather than the top-left element.
  • a path 58 passes through D’[l,l], D’[2,2], D’[2,3], D’[3,4], D’[4,4], D’[5,5], and D’[6,6], thus implying the mapping ⁇ [1,1], [2, 2], [2, 3], [3, 4], [4, 4], [5, 5], [6, 6] ⁇ .
  • Path 58 is non-reversing in that the path does not move backwards through the rows or columns of D’ at any point, i.e., the path does not go from the i l11 row to the (i- 1 ) 111 row or from the j l11 row to the (j-l) th row.
  • D’ D, i.e., D’[i, j] or D’[j, i] is the local distance between the i l11 spectrum of the first spectra and the j l11 spectrum of the second spectra.
  • D’ is an accumulated distance gradient matrix ADG, which may be computed by executing the following algorithm:
  • G x [i,j] may equal D[i,j+ 1 ] - D[i,j] and Gy[i,j] may equal D[i+l,j] - D[i,j].
  • ADG[i,j] G[i,j] + min(ADG[i-l,j]*cos(Pl[i-l,j]), ADG[i-l,j-l]*cos(P2[i-l,j-l] - 45), ADG[i,j-l]*cos(P3[i,j-l] - 90) for i > 1 and j > 1.
  • the first row and column of ADG may have any arbitrary values.
  • the processor at a smoothing step 48, smooths the mapping by fitting the mapping to a piecewise linear function including one or more linear segments.
  • mapping 60 shows an example mapping 60 and smoothed mapping 62, which includes multiple linear segments, in accordance with some embodiments of the present invention.
  • mapping 60 maps the time bins of the second spectrogram to the time bins of the first spectrogram. In other words, if the i m th spectrum of the first spectrogram and the j m th spectrum of the second spectrogram are mapped to one another, then the time bin represented by the i m th spectrum and the time bin represented by the j m th spectrum are also mapped to one another.
  • each point in mapping 60 is a point (j m , im) representing a mapping between the i m th spectrum of the first spectrogram and the j m th spectrum of the second spectrogram.
  • the processor first identifies any points 64 in mapping 60 at which the second derivative of the mapping is not within a predefined range (e.g., at which an absolute value of the second derivative is greater than a predefined threshold). For example, the processor may first calculate a local slope at each point in the mapping, by fitting a neighborhood of the point to a line (e.g., using linear regression) and then taking the slope of the line as the local slope. The processor may then calculate the second derivative at each point based on the local slopes.
  • a predefined range e.g., at which an absolute value of the second derivative is greater than a predefined threshold.
  • the processor may first calculate a local slope at each point in the mapping, by fitting a neighborhood of the point to a line (e.g., using linear regression) and then taking the slope of the line as the local slope. The processor may then calculate the second derivative at each point based on the local slopes.
  • the processor may then fit a line to the local slopes using linear regression, and then calculate the second derivative as the slope of this line.
  • mapping is smoothed so as to include exactly one line, i.e., a piecewise linear function including exactly one segment, that best fits the points in mapping 60.
  • This line may be calculated, for example, using linear regression.
  • the processor fits the mapping to a piecewise linear function including multiple linear segments joined to each other at the respective coordinates of the identified points representing the second spectrogram, which in Fig. 5 are the first coordinates of the identified points.
  • Fig. 5 shows a first linear segment 66a joined to a second linear segment 66b at an intersection point 65 having the first coordinate of a point 64, and second linear segment 66b joined to a third linear segment 66c at another intersection point 65 having the first coordinate of another point 64.
  • the fitting is performed using a linear programming algorithm, which selects intersection points 65 so as to maximize a smoothness of the fitting.
  • the processor modifies digital note file 30 (Fig. 1), based on the mapping (e.g., smoothed mapping 62), so as to increase the alignment of the notes in the digital note file with first audio file 28, i.e., so as to render the digital note file convertible to third audio file 36, which is more similar to the first audio file than is second audio file 32.
  • mapping e.g., smoothed mapping 62
  • the processor shifts, and adjusts respective durations of, notes in the digital note file based on the mapping.
  • the processor identifies the current time period spanned by the note, identifies a new time period to which this current time period is mapped, and shifts and/or adjusts the duration of the note such that the note occupies the new time period.
  • the new time period may be identical to the current time period, such that no changes in the duration or timing of the note are required.
  • LBI and LB2 are the lengths of Bl and B2, respectively.
  • Fig. 5 shows an example note 68a spanning a time period (tO, tl). Per the mapping, this time period is mapped to a slightly longer time period (tO’, tl’), tO’ being slightly greater than tO. Hence, note 68a is lengthened and shifted forward, such that in modified digital note file 34 (Fig. 1), note 68a is longer and occurs later in the musical piece.
  • Fig. 5 also shows another example note 68b, which spans a time period (t2, t3).
  • note 68b As opposed to first linear segment 66a, whose slope is greater than one, third linear segment 66c has a slope less than one; hence, note 68b is shortened so as to occupy, in modified digital note file 34, a shorter time period (t2’, t3’).
  • second linear segment 66b which is horizontal, corresponds to notes in the digital note file that are not played in the audio file. Any notes occurring between the beginning and end of second linear segment 66b are removed from the digital note file. As a result of this removal, t2’ ⁇ t2, i.e., note 68b occurs earlier in the modified digital note file.

Abstract

A system (20) includes a memory (26, 29) and one or more processors (24, 27) configured to cooperatively carry out a process. The process includes loading a first audio file (28) and a digital note file (30) from the memory, computing a first spectrogram of the first audio file, computing a second spectrogram of a second audio file (32) generated from the digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file. Other embodiments are also described.

Description

ALIGNING DIGITAL NOTE FILES WITH AUDIO
CROSS-REFERENCE TO RELATED APPLICATIONS
The present application claims the benefit of US Provisional Application 63/274,977, filed November 3, 2021, whose disclosure is incorporated herein by reference.
FIELD OF THE INVENTION
The present invention relates to the field of digital audio, particularly digital music.
BACKGROUND
A Musical Instrument Digital Interface (MIDI) file specifies the timing and loudness of the musical notes belonging to a musical piece. To play the musical piece, the MIDI file is converted to a suitable audio format, such as Moving Picture Experts Group Layer-3 Audio (MP3), via conventional techniques such as those implemented by FluidSynth™.
SUMMARY OF THE INVENTION
There is provided, in accordance with some embodiments of the present invention, a system including a memory and one or more processors configured to cooperatively carry out a process. The process includes loading a first audio file and a digital note file from the memory, computing a first spectrogram of the first audio file, computing a second spectrogram of a second audio file generated from the digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
In some embodiments, the digital note file includes a Musical Instrument Digital Interface (MIDI) file.
In some embodiments, wherein the first spectrogram includes a first short-time Fourier transform (STFT) spectrogram and the second spectrogram includes a second STFT spectrogram.
In some embodiments, computing the first spectrogram and second spectrogram includes: computing a first time-localized frequency transform of the first audio file and a second time-localized frequency transform of the second audio file, filtering each of the first time-localized frequency transform and second time-localized frequency transform so as to emphasize frequencies of musical notes, and computing the first spectrogram and second spectrogram from the filtered first time- localized frequency transform and filtered second time-localized frequency transform, respectively.
In some embodiments, filtering each of the first time-localized frequency transform and second time-localized frequency transform includes filtering each of the first time-localized frequency transform and second time-localized frequency transform using a bank of Gaussian filters, each of the filters being centered at a respective one of the frequencies.
In some embodiments, the distance measure is a function of respective local distances between pairs of spectra mapped to one another.
In some embodiments, each of the local distances is a Minkowski distance.
In some embodiments, computing the mapping includes: computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from another matrix of which an (i, j) or (j, i) element is a local distance between an il11 spectrum of the first spectra and a jl11 spectrum of the second spectra for 1 < i < KI and 1 < j < K2, KI and K2 being respective numbers of the first spectra and the second spectra, computing a non-reversing path from D’[l, 1] to D’ [K1, K2] or to D’[K2, KI] that minimizes a sum of elements of D’ through which the path passes, and for each of the elements of D’ through which the path passes, mapping a pair of spectra corresponding to the element to one another.
In some embodiments, the process further includes, prior to modifying the digital note file, smoothing the mapping by fitting the mapping to a piecewise linear function including one or more linear segments.
In some embodiments, the process further includes identifying one or more points in the mapping at which a second derivative of the mapping is not within a predefined range, and the piecewise linear function includes multiple linear segments joined to each other at respective coordinates of the identified points representing the second spectrogram. There is further provided, in accordance with some embodiments of the present invention, a method including computing a first spectrogram of a first audio file, computing a second spectrogram of a second audio file generated from a digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
There is further provided, in accordance with some embodiments of the present invention, a computer software product including a tangible non-transitory computer-readable medium in which program instructions are stored. The instructions, when read by a processor, cause the processor to compute a first spectrogram of a first audio file, to compute a second spectrogram of a second audio file generated from a digital note file, to compute a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and, based on the mapping, to shift, and adjust respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
The present invention will be more fully understood from the following detailed description of embodiments thereof, taken together with the drawings, in which:
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 is a schematic illustration of a system for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention;
Fig. 2 shows a flow diagram for an example algorithm for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention;
Fig. 3 shows a spectrum of a short-time Fourier transform, which was filtered in accordance with some embodiments of the present invention;
Figs. 4A-B show a hypothetical mapping between spectra of a first spectrogram and spectra of a second spectrogram, in accordance with some embodiments of the present invention; and
Fig. 5 shows an example mapping and smoothed mapping, in accordance with some embodiments of the present invention. DETAILED DESCRIPTION OF EMBODIMENTS
OVERVIEW
A digital note file, such as a MIDI file, representing a musical piece is often used in conjunction with an audio recording of the musical piece. For example, a user may attempt to synchronize the playing of an instrument represented in the digital note file with the audio recording. Alternatively (e.g., in karaoke) the user may attempt to synchronize the singing of vocal notes represented in the digital note file with the audio recording. Alternatively, the user may practice his singing with reference to the digital note file, add an instrument to, or replace an instrument in, the digital note file, or otherwise modify the digital note file (e.g., so as to change the tempo of the musical piece). For all of these applications, it is helpful to have alignment between the digital note file and the audio recording.
However, in many cases, the digital note file may be misaligned with the audio recording. For example, the timing and/or duration of particular notes in the digital note file may be different from the timing and/or duration of these notes in the audio recording. Alternatively or additionally, the digital note file may contain one or more notes that are not played in the audio recording; for example, the number of times a chorus is repeated in the digital note file may be greater than the number of times the chorus is repeated in the audio recording.
To address this challenge, embodiments of the present invention align (or “synchronize”) the digital note file with the audio recording. In particular, the digital note file is converted to an audio file using conventional techniques, and a time-localized spectrogram, such as a short-time Fourier transform (STFT) spectrogram, of both audio files is computed. Subsequently, the spectra of one spectrogram are mapped to the spectra of the other spectrogram, under predefined constraints, so as to minimize a total distance. A piecewise linear function may then be fit to the mapping so as to smooth the mapping. Finally, the notes in the digital note file are shifted, and are also stretched or compressed, in accordance with the mapping, so as to better align the notes with the audio recording.
SYSTEM DESCRIPTION
Reference is initially made to Fig. 1 , which is a schematic illustration of a system 20 for aligning a digital note file 30 with a first audio file 28, in accordance with some embodiments of the present invention.
System 20 comprises a server 22 comprising a processor 24, a network interface 25 comprising, for example, a network interface controller (NIC), and a memory 26. In some embodiments, server 22 belongs to a cloud server farm.
Memory 26 may comprise a volatile memory (e.g., a random-access memory (RAM)) and/or a non-volatile memory (e.g., a flash drive). Memory 26 is configured to store first audio file 28, which may have any suitable audio format such as MP3. Memory 26 is further configured to store digital note file 30, which may include a Musical Instrument Digital Interface (MIDI) file, for example.
In alternative embodiments, processor 24 and memory 26 belong to different respective computers, such as different respective servers in a cloud server farm. Alternatively or additionally, memory 26 may be distributed over multiple computers.
System 20 further comprises a device 21 comprising, for example, a desktop computer, a laptop computer, or a smartphone. Device 21 comprises a processor 27, a network interface 31 comprising, for example, a network interface controller (NIC), and a memory 29. Memory 29 may comprise a volatile memory (e.g., a random-access memory (RAM)) and/or a non-volatile memory (e.g., a flash drive). Memory 29 is also configured to store first audio file 28 and digital note file 30.
Processor 24 and processor 27 are configured to communicate with one another over a network 23, such as the Internet, via network interface 25 and network interface 31. Thus, for example, in response to instructions from a user 33, processor 27 may load the first audio file and digital note file from memory 29 and then upload these files to server 22. Subsequently, first audio file 28 and digital note file 30 may be stored in memory 26. Processor 24 may then load the files from memory 26 and process the files as described below with reference to the subsequent figures. Based on the processing, the processor may modify digital note file 30 so as to increase the alignment of the notes in the digital note file with first audio file 28, i.e., so as to render the digital note file convertible to a third audio file 36 that is more similar to first audio file 28 than is second audio file 32. Subsequently to modifying the digital note file, processor 24 may store the resulting modified digital note file 34 in memory 26 and/or communicate modified digital note file 34 to device 21. In response to receiving the modified digital note file, processor 27 may store the modified digital note file in memory 29.
In some embodiments, processor 24, in performing the aforementioned processing, converts digital note file 30 to a second audio file 32 using conventional techniques. In other embodiments, processor 27 performs this conversion; in such embodiments, second audio file 32 may be uploaded to server 22 together with the first audio file and the digital note file.
In alternative embodiments, at least some of the aforementioned processing and/or modification of the digital note file is performed by processor 27 and/or at least one other processor, such as the processor of another server residing, together with server 22, on a cloud server farm.
In general, each of processor 24 and processor 27 may be embodied as a single processor, or as a cooperatively networked or clustered set of processors. The functionality of each of the processors may be implemented solely in hardware, e.g., using one or more fixed- function or general-purpose integrated circuits, Application- Specific Integrated Circuits (ASICs), and/or Field-Programmable Gate Arrays (FPGAs). Alternatively, this functionality may be implemented at least partly in software. For example, processor 24 and/or processor 27 may be embodied as a programmed processor comprising, for example, a central processing unit (CPU) and/or a Graphics Processing Unit (GPU). Program code, including software programs, and/or data may be loaded for execution and processing by the CPU and/or GPU. The program code and/or data may be downloaded to the processor in electronic form, over a network, for example. Alternatively or additionally, the program code and/or data may be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory. Such program code and/or data, when provided to the processor, produce a machine or special-purpose computer, configured to perform the tasks described herein.
ALIGNMENT OF DIGITAL NOTE FILE WITH AUDIO FILE
Reference is now made to Fig. 2, which shows a flow diagram for an example algorithm 38 for aligning a digital note file with an audio file, in accordance with some embodiments of the present invention. Algorithm 38 may be executed, for example, by processor 24 (Fig. 1), by processor 27, or cooperatively by both processors.
At a first-spectrogram computing step 40 of algorithm 38, the processor computes a first spectrogram of first audio file 28 (Fig. 1). For example, the processor may compute a first time- localized frequency transform of the first audio file, and then compute the first spectrogram from the first time-localized frequency transform.
Next, at a converting step 42, the processor converts digital note file 30 to second audio file 32 (Fig. 1). Subsequently, at a second-spectrogram computing step 44, the processor computes a second spectrogram of the second audio file. For example, the processor may compute a second time-localized frequency transform of the second audio file, and then compute the second spectrogram from the second time-localized frequency transform.
Alternatively, converting step 42 and, optionally, second-spectrogram computing step 44 may be performed before first-spectrogram computing step 40. For example, as described above with reference to Fig. 1 , converting step 42 may be performed by processor 27, and the second audio file may then be uploaded together with the first audio file and digital note file.
In some embodiments, each of the spectrograms includes a short-time Fourier transform (STFT) spectrogram; in other words, the time-localized frequency transforms, from which the spectrograms are computed, are STFTs. The processor may compute each STFT per the equation
Figure imgf000009_0001
where:
X(k, n) is the STFT for k = 0...K-1 and n = 0...N-1, x(i), i = 0. . .K-l is the signal encoded by the file, K being the length of the signal, y*( ° ) is the complex conjugate of a symmetric window y( ° ) such as a trapezoid, a Hanning window, or a Blackman window, k is the time bin, n is the frequency bin, and
N is typically K72 or (K+l)/2.
In other embodiments, the time-localized frequency transforms include wavelet transforms. Suitable mother wavelets for such transforms include the complex Gabor, the Simlet, the Coifman, the Morlet, and the cubic spline wavelets.
In some embodiments, the processor computes the spectrograms directly from the time- localized frequency transforms. For example, each STFT spectrogram may be computed as IX(k, n)l2.
In other embodiments, the processor first filters each of the time-localized frequency transforms so as to emphasize frequencies of musical notes, and then computes the spectrograms from the filtered transforms. For example, each STFT spectrogram may be computed as IX’ I2, where X’ is a filtered STFT.
To filter each transform, the processor may use a bank of Gaussian filters, each of the filters being centered at a respective one of the musical-note frequencies. For example, the processor may calculate a maximum number of filters in the filter bank per the equation
L = floor(log2((fs/2)/fmin)/log2(fr)), where: L is the maximum number of filters, fs is the sampling frequency of the signal encoded in the audio file, is the minimum musical-note frequency of interest, such as the lowest frequency in the first octave (32.7 Hz), fr = 21/112 (12 being the number of notes in each octave), and the function floor( ° ) returns the greatest integer less than or equal to the argument of the function.
The processor may further calculate the central frequency fi of each filter as fmin * 2(/12, 1 = 0...L-1. After optionally eliminating any higher central frequencies that are not of interest, the processor may calculate the bandwidth Sfj of each ll11 filter, 1 = 0. . .L’-l for L’ < L, as
5fi = max(fi*(fr - 1), (fs/2)/N).
The processor may then define each filter F(l, n), for 1 = 0.. .L’-l and n = 0. . .N-l, as follows:
Figure imgf000010_0001
The processor may then compute each element X’(l,n) as follows:
X’(l,n) = ^=o1^(^n) * F(l, k-).
For example, Fig. 3 shows a spectrum 54 of an STFT, which was filtered in accordance with some embodiments of the present invention. By virtue of the application of the bank of Gaussian filters, spectrum 54 includes a plurality of Gaussian segments 56 having respective peaks at the musical-note frequencies. For example, a segment 56a has a peak at 3951 Hz, which is the frequency of note B in the seventh octave.
Regardless of how the spectrograms are computed, the size of each of the time bins of the first spectrogram is typically the same as the size of each of the time bins of the second spectrogram.
As shown in Fig. 2, subsequently to computing the spectrograms, the processor, at a mapping step 46, computes a mapping between spectra {sli}, i = 1...K1 of the first spectrogram, KI being the number of spectra in the first spectrogram, and respective spectra { s2j } , j = 1 . . .K2 of the second spectrogram, K2 being the number of spectra in the second spectrogram, that minimizes a distance measure under predefined constraints.
This mapping may be represented by the notation { [il,jl],[i2 J2] ■ ■ • [iM JM] } , where M > max(Kl,K2) and each pair of indices [im,jm] indicates a mapping between si im and s2jm . Typical constraints on the minimization are that each spectrum is mapped to at least one other spectrum, and that for any im and jm, either (i) im+i = im + 1 and jm+i = jm + 1, (ii) im+1 = im +
1 and jm+i = jm, or (iii) im+i = im and jm+i = jm + 1- (These two constraints imply that ii = 1, ji = 1, iM = KI, and jM = K2.) By way of example, Fig. 4A shows a hypothetical mapping { [1,1], [2, 2], [2, 3], [3, 4], [4, 4], [5, 5], [6, 6] } between the first few spectra of the first spectrogram and the first few spectra of the second spectrogram, in accordance with some embodiments of the present invention.
Typically, the distance measure, which is minimized in the mapping, is a function of respective local distances dim jm between pairs of spectra mapped to one another. (In Fig. 4A, the mapping between each pair of spectra is annotated to show the local distance between the pair.) Typically, the local distance between each pair of spectra is a Minkowski distance, such as the L2 distance,
Figure imgf000011_0001
For example, the distance measure may be a sum of the local distances; this sum may be minimized by finding a minimum-distance path through a matrix D of local distances between pairs of spectra, as further described below. Alternatively, the distance measure may be a more complicated function; an example of such a function is minimized by finding a minimumdistance path through the “accumulated distance gradient” matrix ADG described below.
In some embodiments, the processor computes the mapping by first computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from a matrix D of which the (i, j) or (j, i) element is the local distance between the il11 spectrum of the first spectra and the jl11 spectrum of the second spectra for 1 < i < KI and 1 < j < K2. Subsequently to computing D’, the processor computes a non-reversing path from D’ [l, 1] to D’[K1, K2] or to D’ [K2, KI] that minimizes the sum of the elements of D’ through which the path passes. For example, the processor may compute the path using the Dynamic Time Warping algorithm, which is described, for example, in Muller, Meinard, "Dynamic time warping," Information retrieval for music and motion (2007): 69-84, whose disclosure is incorporated herein by reference. Finally, for each element of D’ through which the path passes, the processor maps the pair of spectra corresponding to the element to one another.
Fig. 4B illustrates this technique for the example shown in Fig. 4A. In particular, Fig. 4B shows a 6 x 6 corner of a KI x K2 matrix D’. (As opposed to the usual convention, D’ [1,1] is shown as the bottom-left element of D’, rather than the top-left element.) A path 58 passes through D’[l,l], D’[2,2], D’[2,3], D’[3,4], D’[4,4], D’[5,5], and D’[6,6], thus implying the mapping { [1,1], [2, 2], [2, 3], [3, 4], [4, 4], [5, 5], [6, 6] }. Path 58 is non-reversing in that the path does not move backwards through the rows or columns of D’ at any point, i.e., the path does not go from the il11 row to the (i- 1 )111 row or from the jl11 row to the (j-l)th row.
In some embodiments, D’ = D, i.e., D’[i, j] or D’[j, i] is the local distance between the il11 spectrum of the first spectra and the jl11 spectrum of the second spectra. In other embodiments, D’ is an accumulated distance gradient matrix ADG, which may be computed by executing the following algorithm:
(i) Compute Gx and Gy, the horizontal and vertical gradient matrices of D. For example, Gx[i,j] may equal D[i,j+ 1 ] - D[i,j] and Gy[i,j] may equal D[i+l,j] - D[i,j].
(ii) Compute the matrix G such that G[i,j] =
Figure imgf000012_0001
(iii) Compute the matrix P such that P[i,j] = arctan(Gy[i,j]/Gx[i,j])-
(iv) Compute the matrices Pl, P2, and P3 such that:
Pl[i,j] = {p[i,j], o < P[i,j] <= 90 or 270 < P[i,j] <= 360; P[i,j] + 180, 90 < P[i,j] <= 270},
P2[i,j] = {P[i,j], 0 < P[i,j] <= 135 or 315 < P[i,j] <= 360; P[i,j] + 180, 135 < P[i,j] <= 315}, and
P3[i,j] = {p[i,j], o < P[i,j] <= 180 or 180 < P[i,j] <= 360; P[i,j] + 180, 180 < P[i,j] <= 360}.
(v) Compute the matrix ADG such that ADG[i,j] = G[i,j] + min(ADG[i-l,j]*cos(Pl[i-l,j]), ADG[i-l,j-l]*cos(P2[i-l,j-l] - 45), ADG[i,j-l]*cos(P3[i,j-l] - 90) for i > 1 and j > 1. (The first row and column of ADG may have any arbitrary values.)
As shown in Fig. 2, optionally, following mapping step 46, the processor, at a smoothing step 48, smooths the mapping by fitting the mapping to a piecewise linear function including one or more linear segments.
In this regard, reference is now made to Fig. 5, which shows an example mapping 60 and smoothed mapping 62, which includes multiple linear segments, in accordance with some embodiments of the present invention. Per mapping 60, the spectra of the second spectrogram are mapped to corresponding spectra of the first spectrogram. Thus, mapping 60 also maps the time bins of the second spectrogram to the time bins of the first spectrogram. In other words, if the im th spectrum of the first spectrogram and the jm th spectrum of the second spectrogram are mapped to one another, then the time bin represented by the im th spectrum and the time bin represented by the jm th spectrum are also mapped to one another. By way of example, the horizontal axis in Fig. 5 corresponds to the second spectrogram, i.e., each point in mapping 60 is a point (jm, im) representing a mapping between the im th spectrum of the first spectrogram and the jm th spectrum of the second spectrogram.
Typically, to smooth the mapping, the processor first identifies any points 64 in mapping 60 at which the second derivative of the mapping is not within a predefined range (e.g., at which an absolute value of the second derivative is greater than a predefined threshold). For example, the processor may first calculate a local slope at each point in the mapping, by fitting a neighborhood of the point to a line (e.g., using linear regression) and then taking the slope of the line as the local slope. The processor may then calculate the second derivative at each point based on the local slopes.
For example, to calculate the second derivative at the mO111 point (i.e., the point (imo, jmO) or (jmO, imO)), the processor may first calculate the 2p+l local slopes {lsmQ-p, . .., lsm0-L lsm0, lsm0+L ■ ■ -, lsm0+p} for the (m0-p)th. . . (m0+ p)th pairs of spectra mapped to one another, where p = 1, 2, or more. The processor may then fit a line to the local slopes using linear regression, and then calculate the second derivative as the slope of this line.
In the event that no points 64 are identified, the mapping is smoothed so as to include exactly one line, i.e., a piecewise linear function including exactly one segment, that best fits the points in mapping 60. This line may be calculated, for example, using linear regression.
Otherwise, the processor fits the mapping to a piecewise linear function including multiple linear segments joined to each other at the respective coordinates of the identified points representing the second spectrogram, which in Fig. 5 are the first coordinates of the identified points. For example, Fig. 5 shows a first linear segment 66a joined to a second linear segment 66b at an intersection point 65 having the first coordinate of a point 64, and second linear segment 66b joined to a third linear segment 66c at another intersection point 65 having the first coordinate of another point 64. Typically, the fitting is performed using a linear programming algorithm, which selects intersection points 65 so as to maximize a smoothness of the fitting.
Finally, as shown in Fig. 2, the processor, at a modifying step 50, modifies digital note file 30 (Fig. 1), based on the mapping (e.g., smoothed mapping 62), so as to increase the alignment of the notes in the digital note file with first audio file 28, i.e., so as to render the digital note file convertible to third audio file 36, which is more similar to the first audio file than is second audio file 32.
In particular, the processor shifts, and adjusts respective durations of, notes in the digital note file based on the mapping. Specifically, for each note, the processor identifies the current time period spanned by the note, identifies a new time period to which this current time period is mapped, and shifts and/or adjusts the duration of the note such that the note occupies the new time period. (In some cases, the new time period may be identical to the current time period, such that no changes in the duration or timing of the note are required.)
For example, for any time t at which the note begins or ends, the processor may first identify the time bin B2, of the second spectrogram, to which t belongs. Subsequently, the processor may identify the time bin Bl, of the first spectrogram, that is mapped to B2 or to which B2 is mapped. Next, the processor may calculate t’, the new time at which the note is to begin or end, by applying the equation t’ = tSBl + (t - tSB2) * LB1/LB2, where: tSBl and tsB2 are the start times of Bl and B2, respectively, and
LBI and LB2 are the lengths of Bl and B2, respectively.
It is noted that, typically, LBI = LB2, and t = tsB2 (in which case t’ = tSBl) or t = teB2, the end time of B2 (in which case t’ = teBl).
For example, Fig. 5 shows an example note 68a spanning a time period (tO, tl). Per the mapping, this time period is mapped to a slightly longer time period (tO’, tl’), tO’ being slightly greater than tO. Hence, note 68a is lengthened and shifted forward, such that in modified digital note file 34 (Fig. 1), note 68a is longer and occurs later in the musical piece.
Fig. 5 also shows another example note 68b, which spans a time period (t2, t3). As opposed to first linear segment 66a, whose slope is greater than one, third linear segment 66c has a slope less than one; hence, note 68b is shortened so as to occupy, in modified digital note file 34, a shorter time period (t2’, t3’).
It is noted that second linear segment 66b, which is horizontal, corresponds to notes in the digital note file that are not played in the audio file. Any notes occurring between the beginning and end of second linear segment 66b are removed from the digital note file. As a result of this removal, t2’ < t2, i.e., note 68b occurs earlier in the modified digital note file.
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof that are not in the prior art, which would occur to persons skilled in the art upon reading the foregoing description.

Claims

1. A system, comprising: a memory; and one or more processors configured to cooperatively carry out a process that includes: loading a first audio file and a digital note file from the memory, computing a first spectrogram of the first audio file, computing a second spectrogram of a second audio file generated from the digital note file, computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
2. The system according to claim 1, wherein the digital note file includes a Musical Instrument Digital Interface (MIDI) file.
3. The system according to claim 1, wherein the first spectrogram includes a first short-time Fourier transform (STFT) spectrogram and the second spectrogram includes a second STFT spectrogram.
4. The system according to any one of claims 1-3, wherein computing the first spectrogram and second spectrogram includes: computing a first time-localized frequency transform of the first audio file and a second time-localized frequency transform of the second audio file, filtering each of the first time-localized frequency transform and second time-localized frequency transform so as to emphasize frequencies of musical notes, and computing the first spectrogram and second spectrogram from the filtered first time- localized frequency transform and filtered second time-localized frequency transform, respectively.
5. The system according to claim 4, wherein filtering each of the first time- localized frequency transform and second time-localized frequency transform includes filtering each of the first time-localized frequency transform and second time-localized frequency transform using a bank of Gaussian filters, each of the filters being centered at a respective one of the frequencies.
6. The system according to any one of claims 1-3, wherein the distance measure is a function of respective local distances between pairs of spectra mapped to one another.
7. The system according to claim 6, wherein each of the local distances is a Minkowski distance.
8. The system according to any one of claims 1-3, wherein computing the mapping includes: computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from another matrix of which an (i, j) or (j, i) element is a local distance between an il11 spectrum of the first spectra and a jl11 spectrum of the second spectra for 1 < i < KI and 1 < j < K2, KI and K2 being respective numbers of the first spectra and the second spectra, computing a non-reversing path from D’[l, 1] to D’ [K1, K2] or to D’[K2, KI] that minimizes a sum of elements of D’ through which the path passes, and for each of the elements of D’ through which the path passes, mapping a pair of spectra corresponding to the element to one another.
9. The system according to any one of claims 1-3, wherein the process further includes, prior to modifying the digital note file, smoothing the mapping by fitting the mapping to a piecewise linear function including one or more linear segments.
10. The system according to claim 9, wherein the process further includes identifying one or more points in the mapping at which a second derivative of the mapping is not within a predefined range, and wherein the piecewise linear function includes multiple linear segments joined to each other at respective coordinates of the identified points representing the second spectrogram.
11. A method, comprising : computing a first spectrogram of a first audio file; computing a second spectrogram of a second audio file generated from a digital note file; computing a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints; and based on the mapping, shifting, and adjusting respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
12. The method according to claim 11, wherein the digital note file includes a Musical Instrument Digital Interface (MIDI) file.
13. The method according to claim 11, wherein first spectrogram includes a first short-time Fourier transform (STFT) spectrogram and the second spectrogram includes a second STFT 16 spectrogram.
14. The method according to any one of claims 11-13, wherein computing the first spectrogram and second spectrogram comprises: computing a first time-localized frequency transform of the first audio file and a second time-localized frequency transform of the second audio file; filtering each of the first time-localized frequency transform and second time-localized frequency transform so as to emphasize frequencies of musical notes; and computing the first spectrogram and second spectrogram from the filtered first time- localized frequency transform and filtered second time-localized frequency transform, respectively.
15. The method according to claim 14, wherein filtering each of the first time- localized frequency transform and second time-localized frequency transform comprises filtering each of the first time-localized frequency transform and second time-localized frequency transform using a bank of Gaussian filters, each of the filters being centered at a respective one of the frequencies.
16. The method according to any one of claims 11-13, wherein the distance measure is a function of respective local distances between pairs of spectra mapped to one another.
17. The method according to claim 16, wherein each of the local distances is a Minkowski distance.
18. The method according to any one of claims 11-13, wherein computing the mapping comprises: computing a KI x K2 or K2 x KI matrix D’, which is equal to or derived from another matrix of which an (i, j) or (j, i) element is a local distance between an il11 spectrum of the first spectra and a jl11 spectrum of the second spectra for 1 < i < KI and 1 < j < K2, KI and K2 being respective numbers of the first spectra and the second spectra; computing a non-reversing path from D’[l, 1] to D’ [K1, K2] or to D’[K2, KI] that minimizes a sum of elements of D’ through which the path passes; and for each of the elements of D’ through which the path passes, mapping a pair of spectra corresponding to the element to one another.
19. The method according to any one of claims 11-13, further comprising, prior to modifying the digital note file, smoothing the mapping by fitting the mapping to a piecewise linear function including one or more linear segments. 17
20. The method according to claim 19, further comprising identifying one or more points in the mapping at which a second derivative of the mapping is not within a predefined range, wherein the piecewise linear function includes multiple linear segments joined to each other at respective coordinates of the identified points representing the second spectrogram.
21. A computer software product comprising a tangible non-transitory computer-readable medium in which program instructions are stored, which instructions, when read by a processor, cause the processor to: compute a first spectrogram of a first audio file, compute a second spectrogram of a second audio file generated from a digital note file, compute a mapping between first spectra of the first spectrogram and respective second spectra of the second spectrogram that minimizes a distance measure under predefined constraints, and based on the mapping, shift, and adjust respective durations of, notes in the digital note file so as to increase an alignment of the notes with the first audio file.
PCT/IB2022/060330 2021-11-03 2022-10-27 Aligning digital note files with audio WO2023079419A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163274977P 2021-11-03 2021-11-03
US63/274,977 2021-11-03

Publications (1)

Publication Number Publication Date
WO2023079419A1 true WO2023079419A1 (en) 2023-05-11

Family

ID=86240853

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2022/060330 WO2023079419A1 (en) 2021-11-03 2022-10-27 Aligning digital note files with audio

Country Status (1)

Country Link
WO (1) WO2023079419A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040182229A1 (en) * 2001-06-25 2004-09-23 Doill Jung Method and apparatus for designating performance notes based on synchronization information
US20120198317A1 (en) * 2011-02-02 2012-08-02 Eppolito Aaron M Automatic synchronization of media clips
CN103354092A (en) * 2013-06-27 2013-10-16 天津大学 Audio music-score comparison method with error detection function

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040182229A1 (en) * 2001-06-25 2004-09-23 Doill Jung Method and apparatus for designating performance notes based on synchronization information
US20120198317A1 (en) * 2011-02-02 2012-08-02 Eppolito Aaron M Automatic synchronization of media clips
CN103354092A (en) * 2013-06-27 2013-10-16 天津大学 Audio music-score comparison method with error detection function

Similar Documents

Publication Publication Date Title
EP2854128A1 (en) Audio analysis apparatus
US10235981B2 (en) Intelligent crossfade with separated instrument tracks
WO2020006898A1 (en) Method and device for recognizing audio data of instrument, electronic apparatus, and storage medium
KR20150016225A (en) Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm
Müller et al. Towards structural analysis of audio recordings in the presence of musical variations
US20060173676A1 (en) Voice synthesizer of multi sounds
CN108269579B (en) Voice data processing method and device, electronic equipment and readable storage medium
WO2006132599A1 (en) Segmenting a humming signal into musical notes
US20220301528A1 (en) Systems, Devices, and Methods for Harmonic Structure in Digital Representations of Music
CN1135531C (en) Sound pitch converting apparatus
US8775167B2 (en) Noise-robust template matching
WO2014093713A1 (en) Audiovisual capture and sharing framework with coordinated, user-selectable audio and video effects filters
Kawamura et al. Differentiable digital signal processing mixture model for synthesis parameter extraction from mixture of harmonic sounds
US20210390937A1 (en) System And Method Generating Synchronized Reactive Video Stream From Auditory Input
US8492639B2 (en) Audio processing apparatus and method
CN108806721A (en) signal processor
Every Separation of musical sources and structure from single-channel polyphonic recordings
WO2023079419A1 (en) Aligning digital note files with audio
CN111863030A (en) Audio detection method and device
US10319353B2 (en) Method for audio sample playback using mapped impulse responses
WO2023224550A1 (en) Method and system for real-time and low latency synthesis of audio using neural networks and differentiable digital signal processors
CN105630831A (en) Humming retrieval method and system
CN111445923B (en) Method and device for identifying turnning and computer storage medium
JP6011039B2 (en) Speech synthesis apparatus and speech synthesis method
D'haes et al. Discrete cepstrum coefficients as perceptual features

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22889533

Country of ref document: EP

Kind code of ref document: A1