WO2010043258A1 - Method for analyzing a digital music audio signal - Google Patents

Method for analyzing a digital music audio signal Download PDF

Info

Publication number
WO2010043258A1
WO2010043258A1 PCT/EP2008/063911 EP2008063911W WO2010043258A1 WO 2010043258 A1 WO2010043258 A1 WO 2010043258A1 EP 2008063911 W EP2008063911 W EP 2008063911W WO 2010043258 A1 WO2010043258 A1 WO 2010043258A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
music audio
music
algorithm
window
Prior art date
Application number
PCT/EP2008/063911
Other languages
English (en)
French (fr)
Inventor
Lars FÄRNSTRÖM
Riccardo Leonardi
Nicolas Scaringella
Original Assignee
Museeka S.A.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Museeka S.A. filed Critical Museeka S.A.
Priority to EP08875184A priority Critical patent/EP2342708B1/de
Priority to CA2740638A priority patent/CA2740638A1/en
Priority to BRPI0823192A priority patent/BRPI0823192A2/pt
Priority to CN2008801315891A priority patent/CN102187386A/zh
Priority to PCT/EP2008/063911 priority patent/WO2010043258A1/en
Priority to EA201170559A priority patent/EA201170559A1/ru
Priority to JP2011531363A priority patent/JP2012506061A/ja
Publication of WO2010043258A1 publication Critical patent/WO2010043258A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/38Chord
    • G10H1/383Chord detection and/or recognition, e.g. for correction, or automatic bass generation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/066Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for pitch analysis as part of wider processing for musical purposes, e.g. transcription, musical performance evaluation; Pitch recognition, e.g. in polyphonic sounds; Estimation or use of missing fundamental
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/081Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for automatic key or tonality recognition, e.g. using musical rules or a knowledge base
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation

Definitions

  • the invention relates to automatic analysis of music audio signal, preferably a digital audio music signal.
  • the present invention relates to a music audio representation method and apparatus for analyzing a music audio signal in order to extract a set of characteristics representative of the informative content of the audio music signal, according to the preamble of claim 1 and 17, respectively.
  • Pitch - Perceived fundamental frequency of a sound A pitch is associated to a single (possibly isolated) sound and is instantaneous (the percept is more or less as long as the sound itself, typically 200 to 500 ms duration in music signals).
  • the pitches over the register of a piano have been associated to their corresponding fundamental frequencies (in Hertz) assuming a standard tuning, i.e. the pitch A3 corresponds to a fundamental frequency of 440Hz.
  • Pitch Class - A set of all pitches that are a whole number of octaves apart, e.g. the pitch class C consists of the Cs in all octaves.
  • chord In music theory, a chord is two or more different pitches that occur simultaneously; in this paper, single pitches may also be referred to as chords (see figure Ia and Ib for a sketch).
  • Chord Root The note or pitch upon which a chord is perceived or labelled as being built or hierarchically centred upon (see figure Ia and Ib for a sketch).
  • Chord Family A chord family is a set of chords that share a number of characteristics including (see figure Ia and Ib for an illustration):
  • Tonality A system of music in which pitches are hierarchically organized (around a tonal centre) and tend to be perceived as referring to each other; notice that the percept of tonality is not instantaneous and requires a sufficiently long tonal context.
  • Tonal context A combination of chords implying a particular tonality percept.
  • Key Ordered set of pitch classes, i.e. the reunion of a tonic and a mode (see figure 2a and 2b for an illustration).
  • Tonal Centre or Tonic - The dominating pitch class in a particular tonal context upon which all other pitches are hierarchically referenced (see figure 2a and 2b for an illustration).
  • Mode - Ordered set of intervals see figure 2a and 2b for an illustration.
  • Transposition The process of moving a collection of pitches up or down in pitch by a constant interval.
  • Modulation The process of changing from one tonal centre to another.
  • Chromatic scale The set of all 12 pitch classes.
  • Beat - Basic time unit of a piece of music (see figure 3 for an illustration). Measure or Bar - Segment of time defined as a recurring sequence of stressed and unstressed beats; in figure 3 is shown an audio signal and detected onset positions, wherein the higher the amplitude associated to the onset, the higher its weight in the detected metrical hierarchy (i.e. musical bars have higher weights, bar have intermediate weights, unmetrical onsets have lower weights).
  • Frame of audio signal is a short slice of audio signal, typically 20 to 50 ms segments of audio signal.
  • PCP Pitch Class Profiles
  • the PCP/Chroma approach is a general low-level feature extraction method that measures the strength of pitch classes in the audio music signal.
  • the intensity of each of the twelve semitones of the tonal scale is measured.
  • Such implementation consists in mapping some time/frequency representation to a time/pitch-class representation; in other words the spectrum peaks (or spectrum bins) are associated to the closest pitch of the chromatic scale.
  • PCPs algorithms of this type decrease the quantization level to less than a semitone.
  • the template based approach to high-level musical feature extraction is however restricted by the choice of templates.
  • state-of-the-art algorithms use templates for the Major key and for the Minor key (one such template for each of the 12 possible pitch class).
  • the object of the present invention is to develop a feature extraction algorithm able to compute a musico logically valid description of the pitch content of the audio signal of a music piece. Moreover, it is an object of the present invention to provide an algorithm for the detection of the tonal centre of a music piece in audio format and to provide a set of features that encode a transposition invariant representation of the distribution of pitches in a music piece and their correlations.
  • a further object of the present invention is to map directly spectral observations to a chord space without using an intermediate note identification unit. It is another object of the present invention to allow for the following of the tonal centre along the course of a piece of music if a modulation occurs. It is a specificity of the tonal centre following algorithm to take into account a sufficiently long time scale to avoid tracking chord changes that occur at a faster rate than modulations. It is an object of the present invention to take into account musical accentuation - more specifically, metrical accentuation - in the process of detecting the tonal centre of a music piece.
  • FIG. 3 shows a graphical representation of metrical levels
  • FIG. 4 a block diagram of the music audio analysis method according to the present invention
  • FIG. 5a shows a block diagram of a first algorithm of the music audio analysis method according to the present invention
  • FIG. 5b shows the music audio signal and the plurality of vectors as result of the application to the audio music signal of the first algorithm
  • FIG. 6a shows another block diagram of a first way for training of a step of the first algorithm according to the present invention
  • FIG. 6b shows another block diagram of a second way for training of a step of the first algorithm according to the present invention
  • - Figure 7 shows a block diagram of a second algorithm for the music audio analysis method according to the present invention
  • FIG. 8 shows a block diagram of the music audio analysis apparatus according to the present invention.
  • FIG. 9 shows a graphical representation of a moving average when applied to a power spectrum of the audio signal of Figure 3.
  • the digital music audio signal 2 can be an extract of a signal audio representing a song or a complete version of a song.
  • the method 1 comprises the step of: a) applying a first algorithm 4 to the music audio signal 2 in order to extract first data 5 representative of the tonal context of music audio signal 2, and b) applying a second algorithm 6 to said first data 5 in order to provide second data 7 representative of the tonal centre contained in the first data 5.
  • tonality it is encompassed a combination of chord roots and chord family hierarchically organized around a tonal centre, i.e. a combination of chord roots and chord family, which perceived significance is measured relatively to a tonal centre.
  • the step a) of the method 1, i.e. the first algorithm 4 is able to extract the first data 5 representing the combination of chord roots and chord families observed in the digital music audio signal 2, that is the first data 5 contains the tonal context of the digital music audio signal 2.
  • the step a) of the method 1, i.e. the first algorithm 4 does not aim explicitly at detecting chord roots and chord families contained in the digital music audio signal 2. On the contrary, it aims at obtaining an abstract, and possibly redundant, representation correlated with the chord roots and chord families observed in the digital music audio signal 2.
  • step b) of the method 1, i.e. the second algorithm 6, is able to elaborate the first data 5 for providing second data 7 which represent the tonal centre
  • the method 1 further comprises the step of: c) applying a third algorithm 8 to the first data 5 in function of the second data 7 in order to provide third data 9 which are the normalized version of the first data 5.
  • a third algorithm 8 applied to the first data 5 in function of the second data 7 in order to provide third data 9 which are the normalized version of the first data 5.
  • the first algorithm 4 comprises the steps of: al) identify 10 a sequence of note onsets in the music audio signal 2, in order to define the time position of a plurality of peaks pi, p2, p3, ..., pi where "i" is an index that can vary between l ⁇ i ⁇ N, being N the number of samples of the audio digital signal 2 and being in practice i « N; a2) dividing the audio music signal 2 into a plurality of audio segments s-on-1, s-on-2, s-on-3, s-on-i, each audio segments containing a peak pi, p2, p3, ..., pi, a3) applying a frequency analysis to each audio segment s-on-1, s-on-2, s-on-3, ..., s-on-i in order to obtain a plurality of spectrum segments sp-1,
  • the first data 5 comprise a plurality of vectors vl, v2, v3, ..., vi, wherein each vector of the plurality of vectors vl, v2, v3, ..., vi is associated to the respective audio segment s-on-1, s-on-2, s-on-3, s-on-i.
  • each vector vl, v2, v3, vi has a dimension equal to the twelve pitches (A to G#) times a predefined number "n" of chord type.
  • the predefined number "n” of chord type can be set equal to five so as to represent, for example, "pitches”, “major chords”, “minor chords”, “diminished chords” and “augmented chords”.
  • step al) of the first algorithm 4 is performed by an onset detection algorithm in order to detect the attacks of musical events of the audio signal 2.
  • each peak pi, p2, p3, ..., pi represents an attack of musical event in the respective audio segments s-on-1, s-on-2, s-on-3, ..., s-on-i.
  • the onset detection algorithm 10 can be implemented as described in [J.P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, M. Sandler, " A tutorial on Onset Detection in Music Signals", in IEEE Transactions on Speech and Audio Processing, 2005].
  • step a2) of the first algorithm 4 divides the audio music signal 2 into the plurality of audio segments s-on-1, s-on-2, s-on-3, ..., s-on-i each audio segments s-on-1, s-on-2, s-on-3, ..., s-on-i having a duration "T”.
  • the step a2) of the first algorithm 4 divides the audio music signal 2 into the audio segments s-on-1, s-on-2, s-on-3, ..., s-on-i and each audio segments s-on-1, s- on-2, s-on-3, ..., s-on-i has its own duration "T".
  • step a3) of the first algorithm 4 applies, advantageously, the frequency analysis to each audio segment s-on-1, s-on-2, s-on-3, ..., s-on-i only during a predetermined sub-duration "t", wherein the sub-duration "t" is less than the duration "T".
  • the audio segments s-on-1, s-on-2, s-on-3, ..., s-on-i are further analysed in frequency only during the sub-duration "t” even if they extend over such sub-duration "t".
  • prefixed sub-duration "t" can be set manually by the user.
  • the prefixed sub-duration "t" is within a range from 250 to 350 msec.
  • duration "T" audio segment s-on-1, s-on-2, s-on-3, ..., s-on-i is longer than the pre-defined duration "t", i.e. more than 250-350 msec, only the data contained in the sub-duration "t" are considered while the rest of the segment is assumed to contain irrelevant data and therefore such remaining data are disregarded.
  • the frequency analysis will be limited to the smallest time interval, i.e. the duration "T".
  • the frequency analysis of each audio segment s-on-1, s- on-2, s-on-3, ..., s-on-i is performed only using the music samples occurring during the duration T, i.e. the smallest duration.
  • the frequency analysis, applied during step a3), is performed, in the preferred embodiment, by a D. F. T. (Discrete Fourier Transform).
  • step a3) can also be performed a further step during which is applied a function that reduces the uncertainty in the time-frequency representation of the audio signal 2.
  • an apodization function such as a Hanning window.
  • the length of the Hanning window equals the length "T" of the audio segment s-on-1, s-on-2, s-on-3, s-on-i.
  • the apodization function is applied to audio segment s-on-1, s-on-2, s-on-3, s-on-i. by multiplying on a sample by sample basis to the audio data of the corresponding segment prior to applying the frequency analysis performed by the D.F.T..
  • a further reason for which the apodization function is used is for the attenuation of the musical event attacks pi, p2, p3, ..., pi, since they were located around the boundaries of the apodization window. In this way it is possible to create an attenuated version of the musical event attacks pi, p2, p3, ..., pi.
  • the power spectrum is computed with the D.F.T. or any of its fast implementations, for example F.F. T. (Fast Fourier Transform). It is to be noted that in the case of using the RF. T., the choice of the sub- duration "t" allows for controlling the frequency resolution of the FFT (i.e.
  • the choice of the sub-duration "t" is such that the length in samples of the resulting segment equals a power of two.
  • the computation network 12 can be implemented, preferably, with a trained machine- learning algorithm.
  • the trained machine- learning algorithm consists in a Multi-layer Perceptron (MLP).
  • MLP Multi-layer Perceptron
  • the task of the Multi-layer Perceptron is to estimate the posterior probabilities of each combination chord family (i.e a chord type) and chord root (i.e. a pitch class), given the spectrum segments sp-1, sp-2, sp-3, sp-i.
  • Multi-layer Perceptron is trained in two steps:
  • 1 st step training achieved in a supervised fashion by using a first set 13 of training data built upon a set of known isolated chords for which a first ground truth mapping can be established from the corresponding spectrum of said plurality of segments sp-1, sp-2, sp-3, sp-i to chord families and chord roots.
  • 2 nd step in an unsupervised fashion by a second set 14 of training data comprising a large set of music pieces in order to refine the set of weights " ⁇ " of the trained machine- learning algorithm obtained after the 1 st step to the variety of mixtures of instruments encountered in real polyphonic music.
  • the trained machine- learning algorithm 12 is trained in two steps: a first supervised training with few hand labelled training data and a subsequent unsupervised training with a larger set of unlabelled training data.
  • the set of hand labelled training data consists of isolated chords saved as MIDI files.
  • the set of chords should cover each considered chord type (Major, Minor, Diminished, Augmented%), each pitch class (C, C#, D%) and should cover a number of octaves.
  • a large variety of audio training data is created from these MIDI files by using a variety of MIDI instruments. These audio examples together with their pitch class and chord type are used to train the machine-learning algorithm 12, which is set to produce from the ground truth a single output per "pitch class / chord type" pair.
  • the training of the various weights " ⁇ " of the machine learning algorithm is performed thanks to a standard stochastic gradient descent. Once such training has been achieved, at the end of this 1st training step, a first preliminary mapping for any input spectral segment sp-1, sp-2, sp-3, sp-i to chord families can be produced.
  • the training of the trained machine-learning algorithm 12 needs to be refined by using the data from a larger set of music pieces.
  • the machine- learning algorithm 12 is trained in an unsupervised fashion.
  • the initially trained machine- learning algorithm 12 after the 1 st step is cascaded with a mirrored version of itself which uses as initial weights the same weights " ⁇ " of the trained machine- learning network after the 1 st step (so as to operate some sort of inversion of the corresponding operator, were it linear).
  • the machine- learning algorithm 12 (were it a linear operator) would achieve a projection of the high-dimensional input data (the spectral segments) into a low- dimensional space corresponding to the chord families. Its mirrored version attempts to go from the low-dimensional chord features back to the initial high dimensional spectral peak representation.
  • the initial setting of the cascaded algorithm adopts initially the transposed set of weights of the training engine algorithm.
  • This training approach is reminiscent of the training of auto-encoder networks.
  • the initialisation of the network with a supervised strategy ensures finding an initial set of weights for the network which is consistent with the physical essence of a low level representation in terms of chord families.
  • the first algorithm 4 may comprise the further step a5) of filtering, after the D.F.T. step a3).
  • Such filtering step a5) also called peak detection 15, is an optional step of the method 1.
  • the filtering step a5) is able to filter the plurality of spectrum segments sp-1, sp-2, sp-3, ..., sp-i, generated by the block 11, by a moving average in order to emphasize the peak pi ', p2', p3', ..., pi' in each of said plurality of spectrum segments sp-1, sp-2, sp-3, sp-i.
  • a moving average 20 typically operating over the power spectrum 21 as result from the step a4) is computed and the spectral components having power below this moving average are zeroed.
  • the music audio analysis method 1 comprises, before the computing step a4), a further step of decorrelating, also called whitening 16.
  • the plurality of spectrum segments sp-1 ', sp-2', sp-3', ..., sp-i' is de-correlated with reference to a predetermined database 19 ( Figure 8) of audio segment spectra in order to provide a plurality of decorrelated spectrum segments sp-1 ", sp-2", sp-3", ..., sp-i".
  • the second algorithm 6 of the music audio analysis method 1 comprises the steps of: bl) providing a first window "wl” having a first prefixed duration Tl containing a first group "gl” of plurality of vectors composing the first data 5, and b2) elaborating said first group "gl” of plurality of vectors contained in said first window "wl” for estimating a first tonal context TcI representative of the local tonal centre contained in said first window "wl".
  • the first prefixed duration Tl of said first window "wl” is much longer than the sub-duration "t" of each plurality of audio segments s-on-1, s- on-2, s-on-3, ..., s-on-i.
  • the second algorithm 6 comprises the further step of: b3) providing a second window "w2", being a shifted window of said first window “wl", said second window “w2” having a second prefixed duration T2, said second window “w2” comprising a second group "g2" of plurality of vectors; b4) computing said second group "g2" of plurality of vectors contained in said second window “w2” for estimating a second tonal context Tc2 representative of the local tonal centre contained in said second window "w2”; b5) elaborating the tonal context TcI of said first window "wl” and the tonal context Tc2 of said second window "w2" in order to generate said second data 7 being representative of the evolution of the tonal centre of said first data 5.
  • the second window "w2" is shifted by a prefixed duration Ts with respect to said temporal duration Tl of the first window "w".
  • the second prefixed duration T2 can vary in the range between Tl-Ts and the first prefixed duration Tl .
  • the second prefixed duration T2 is much longer than the sub- period t.
  • the prefixed time Ts is considered to be less of the first prefixed duration Tl, so that the first group gl of vectors and the second group g2 of vectors overlap each other.
  • chords typically change with musical bars - or even faster at the beat level - tonality requires a longer time duration to be perceived.
  • the first prefixed duration Tl is typically set in the range of 25 - 35 sec, more preferably about 30 sec, whereas the prefixed time Ts, is typically set in the range of 10 - 20 sec, more preferably about 15 sec.
  • the first group gl of vectors is contiguous with the second group of vectors g2.
  • the second algorithm 6 of the music audio analysis method 1 comprises also the further step of: b6) repeating the steps from b3) to b5) till to the end the plurality of audio segments s-on-1, s-on-2, s-on-3, ..., s-on-i for defining further windows "wi” wherein each further window "wi” contains a group "gi" of vectors.
  • windows w3 and w4 have to be overlapping or at most consecutive without gaps but any subsequent window, i.e. windows w4, must not be contained in the previous windows, i.e. wl, w2 and w3.
  • the prefixed duration of the window w2 i.e. duration T2
  • T2 could be equal to the prefixed duration Tl of the window wl or could be greater than the prefixed duration Tl, i.e. T2 > 3/2 Tl; T2 could also be adjusted locally to its associated window, so as to be tailored to local properties of the underlying audio signal, without however violating the principle of partial overlapping.
  • windows may be tailored to the overall structure of the music signal, i.e. windows may be set so as to match sections like e.g. verse or chorus of a song.
  • An automatic estimation of the temporal boundaries of these structural sections may be obtained by using a state-of-the-art music summarization algorithm such as well known to the skilled man in the art. In this latter case, different windows may have different durations and may be contiguous instead of overlapping.
  • a first way to generate the second data 7 being representative of the tonal centre of said first data 5 is to elaborate a mean vector "m" of said first data 5 and choose the highest chord root value in such mean vector "m" in order to set the tonal centre.
  • the statistical estimates measured over time such as mean, variance and first order covariance of the vectors contained in the first group gl and the same statistical estimates for the others groups (i.e. g2, ..., gi) can be used to recover a better description of the local tonal context of each audio segments s-on-1, s-on-2, s-on-3, ..., s-on-i.
  • D is the dimension
  • F is the number of considered chord families
  • 3 is the number of statistical estimates measured over time, i.e. mean, variance and first order covariance.
  • a weighting scheme during the extraction of data 7 to account for the fact that audio segments s-on-1, s-on-2, ..., s- on-i are perceived as being accentuated when synchronised with the underlying metrical grid.
  • the most stable pitches producing the percept of tonality are typically played in synchrony with the metrical grid while less relevant pitches are more likely to be played on unmetrical time positions.
  • the incorporation of metrical information during the tonality estimation is as follows.
  • Each audio segment s-on-1, s-on-2, ..., s-on-i is associated to a particular metrical weight depending on its synchronisation with identified metrical events. For example, it is possible to assign a weight of 1.0 to the audio segment if a musical bar position has been detected at some time position covered by the corresponding audio segment. A lower weight of e.g. 0.5 may be used if a beat position has been detected at some time position covered by the audio segment. Finally, the smallest weight of e.g. 0.25 may be used if no metrical event corresponds to the audio segment. Given such weights, it is possible to re-evaluate data 7 A as:
  • ⁇ ,. - — - -Z ⁇ ((wW, 1 XX, 1 -- ⁇ w , )
  • Step b5) N-2 * ⁇ where N is the number of vectors within the group “gi" of the window "wi”, ⁇ w the weighted mean, ⁇ w 2 the weighted variance and cov_l w is first order weighted co variance.
  • the step b5) of the second algorithm 6 of the music audio analysis method 1, i.e. the extraction of data 7 being representative of the evolution of the tonal centre of the music piece given data 8, is implemented as follows. Firstly, localized tonal centre estimates are computed by feeding independently each vector of data 7 A into the Multi-Layer Perceptron (MLP).
  • MLP Multi-Layer Perceptron
  • the architecture of the MLP is such that its number of inputs matches the size of the vectors in data 7A.
  • the number of inputs of the MLP corresponds to the number of features describing the tonal context of window "w" (or generic window “wi").
  • the MLP may be built with an arbitrary number of hidden layers and hidden neurons.
  • the number of outputs is however fixed to 12 so that each output corresponds to one of the 12 possible pitches of the chromatic scale.
  • the parameters of the MLP are trained in a supervised fashion with stochastic gradient descent.
  • the training data consists of a large set of feature vectors describing the tonal context of window "w" (or generic window “wi") for a variety of different music pieces.
  • a target tonal centre was manually associated by a number of expert musicologists.
  • the corresponding training data i.e. the pairs feature vectors / tonal centre targets
  • the training consists in finding the set of parameters that maximises the output corresponding to the target tonal centre and that minimises the other outputs given the corresponding input data.
  • the MLP outputs will estimate tonal centre posterior probabilities, i.e. each output will be bounded between 0 and 1 and they will sum to 1.
  • non-linearity functions e.g. sigmoid function
  • training cost function e.g. cross-entropy cost function
  • transition matrix which encodes the probability of going from the tonal centre estimate i-1 to the tonal centre estimate i.
  • transition probabilities could be learnt from data, it is set manually according to some expert musicological knowledge (see table 2 for an example).
  • the problem of finding data 7, i.e. the optimal sequence of tonal centres over the course of the music piece, can be formulated as follows.
  • TcI*, Tc2 *,..., Ten* be the optimal sequence of tonal centres and let Obsl, Obs2,..., Obsn be the sequence of feature vectors fed independently into the local tonal centre estimation MLP.
  • TcI*, Tc2*,..., Ten* is such that:
  • Tcl*,Tc2*,...,Tcn* argmaxTcl,Tc2,...,Tcn p(Tcl,Tc2,...,Tcn
  • Tc2*,..., Ten* can be obtained thanks to the Viterbi algorithm.
  • the Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states, in this case the most likely sequence of tonal centres, that results in a sequence of observed events, in this case the local tonal centre estimations of the MLP.
  • the modelling of the tonal context is implemented in practice by the computation of mean/variance/covariance 7A of the CFPs 7 in a generic window "wi" together with the MLP in charge of estimating the probability of each tonal centre Tci.
  • Figures 7A to 7D illustrate graphically the algorithm 6 once it has been applied on the first data 5.
  • Figure 7B shows a graphical representation of a sequence of D dimensional vectors representative of the tonal content over window "wi", i.e. second data 7, having in axis the vector for a generic window "wi" and in ordinate the dimension.
  • Figure 7B shows the longer-term vectors corresponding to the mean/variance/covariance of the shorter-term CFP vectors over the windows "w".
  • Figure 7C shows a graphical representation of a sequence of local tonal centre estimates, i.e. the 12 dimensional outputs of the MLP, having in axis the vector for a generic window "wi" and in ordinate the pitch class.
  • Figure 7D finally shows a graphical representation of the corresponding optimal sequence of tonal centres obtained thanks to the Viterbi algorithm, i.e. the final tonal centre estimates for each window "wi", having in axis the vector for a generic window "wi" and in ordinate the pitch class.
  • Step c) By referring again to figure 4, the third algorithm 8 comprises the step cl) of transposing to a reference pitch the first data 5 in function of second data 7 so as to generate the third data 9.
  • each CFP vectors of the group gl (or g2, ..., gi) is made invariant to transposition by transposing the vector values to a reference pitch.
  • the reference pitch can be C.
  • TCFP t (i,mod0- Tt, 12)) CFPt(ij)
  • TCFP t is the transposed CFP vector at time t
  • i is the chord family index
  • j the pitch class
  • T t the tonal centre pitch class at time t.
  • the step cl) of transposing to a reference pitch the first data 5 is a normalization operation, that allows to compare any kind of audio music signal based upon tonal considerations .
  • the apparatus able to perform the method heretofore described comprises:
  • the processor unit 18 is configured to extract the CFP 7 representative of the tonal centre of the audio music signal 2.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Auxiliary Devices For Music (AREA)
PCT/EP2008/063911 2008-10-15 2008-10-15 Method for analyzing a digital music audio signal WO2010043258A1 (en)

Priority Applications (7)

Application Number Priority Date Filing Date Title
EP08875184A EP2342708B1 (de) 2008-10-15 2008-10-15 Verfahren zum analysieren eines digitalen musikaudiosignals
CA2740638A CA2740638A1 (en) 2008-10-15 2008-10-15 Method for analyzing a digital music audio signal
BRPI0823192A BRPI0823192A2 (pt) 2008-10-15 2008-10-15 método para analisar um sinal de áudio de música digital
CN2008801315891A CN102187386A (zh) 2008-10-15 2008-10-15 分析数字音乐音频信号的方法
PCT/EP2008/063911 WO2010043258A1 (en) 2008-10-15 2008-10-15 Method for analyzing a digital music audio signal
EA201170559A EA201170559A1 (ru) 2008-10-15 2008-10-15 Способ анализа цифрового музыкального аудиосигнала
JP2011531363A JP2012506061A (ja) 2008-10-15 2008-10-15 デジタル音楽音響信号の分析方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2008/063911 WO2010043258A1 (en) 2008-10-15 2008-10-15 Method for analyzing a digital music audio signal

Publications (1)

Publication Number Publication Date
WO2010043258A1 true WO2010043258A1 (en) 2010-04-22

Family

ID=40344486

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2008/063911 WO2010043258A1 (en) 2008-10-15 2008-10-15 Method for analyzing a digital music audio signal

Country Status (7)

Country Link
EP (1) EP2342708B1 (de)
JP (1) JP2012506061A (de)
CN (1) CN102187386A (de)
BR (1) BRPI0823192A2 (de)
CA (1) CA2740638A1 (de)
EA (1) EA201170559A1 (de)
WO (1) WO2010043258A1 (de)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110254688A1 (en) * 2010-04-15 2011-10-20 Samsung Electronics Co., Ltd. User state recognition in a wireless communication system
US9263060B2 (en) 2012-08-21 2016-02-16 Marian Mason Publishing Company, Llc Artificial neural network based system for classification of the emotional content of digital music
US20210287662A1 (en) * 2018-09-04 2021-09-16 Gracenote, Inc. Methods and apparatus to segment audio and determine audio segment similarities

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9257954B2 (en) * 2013-09-19 2016-02-09 Microsoft Technology Licensing, Llc Automatic audio harmonization based on pitch distributions
JP6671245B2 (ja) * 2016-06-01 2020-03-25 株式会社Nttドコモ 識別装置
CN107135578B (zh) * 2017-06-08 2020-01-10 复旦大学 基于TonaLighting调节技术的智能音乐和弦-氛围灯系统
JP7375302B2 (ja) * 2019-01-11 2023-11-08 ヤマハ株式会社 音響解析方法、音響解析装置およびプログラム

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6057502A (en) 1999-03-30 2000-05-02 Yamaha Corporation Apparatus and method for recognizing musical chords
US20080245215A1 (en) 2006-10-20 2008-10-09 Yoshiyuki Kobayashi Signal Processing Apparatus and Method, Program, and Recording Medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1091199A (ja) * 1996-09-18 1998-04-10 Mitsubishi Electric Corp 記録再生装置
JP3870727B2 (ja) * 2001-06-20 2007-01-24 ヤマハ株式会社 演奏タイミング抽出方法
JP2006202235A (ja) * 2005-01-24 2006-08-03 Nara Institute Of Science & Technology 経時的現象発生解析装置及び経時的現象発生解析方法
JP2007041234A (ja) * 2005-08-02 2007-02-15 Univ Of Tokyo 音楽音響信号の調推定方法および調推定装置
JP4722738B2 (ja) * 2006-03-14 2011-07-13 三菱電機株式会社 楽曲分析方法及び楽曲分析装置
JP4823804B2 (ja) * 2006-08-09 2011-11-24 株式会社河合楽器製作所 コード名検出装置及びコード名検出用プログラム
JP4214491B2 (ja) * 2006-10-20 2009-01-28 ソニー株式会社 信号処理装置および方法、プログラム、並びに記録媒体

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6057502A (en) 1999-03-30 2000-05-02 Yamaha Corporation Apparatus and method for recognizing musical chords
US20080245215A1 (en) 2006-10-20 2008-10-09 Yoshiyuki Kobayashi Signal Processing Apparatus and Method, Program, and Recording Medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CHING-HUA CHUAN ET AL: "Polyphonic Audio Key Finding Using the Spiral Array CEG Algorithm", MULTIMEDIA AND EXPO, 2005. ICME 2005. IEEE INTERNATIONAL CONFERENCE ON AMSTERDAM, THE NETHERLANDS 06-06 JULY 2005, PISCATAWAY, NJ, USA,IEEE, 6 July 2005 (2005-07-06), pages 21 - 24, XP010843224, ISBN: 978-0-7803-9331-8 *
EMILIA GÓMEZ: "TONAL DESCRIPTION OF MUSIC AUDIO SIGNALS", 20060101, 1 January 2006 (2006-01-01), XP002501266 *
FUJISHIMA T: "REALTIME CHORD RECOGNITION OF MUSICAL SOUND: A SYSTEM USING COMMON LISP MUSIC", ICMC. INTERNATIONAL COMPUTER MUSIC CONFERENCE. PROCEEDINGS, XX, XX, 27 September 1999 (1999-09-27), pages 464 - 467, XP009053025 *
ÖZGÜR IZMIRLI: "An Algorithm for Audio Key Finding", INTERNET CITATION, XP002426658, Retrieved from the Internet <URL:http://www.music-ir.org/evaluation/mirex-results/articles/key_audio/i zmirli.pdf> [retrieved on 20070326] *
PEETERS G: "Chroma-based estimation of musical key from audio-signal analysis", PROCEEDINGS ANNUAL INTERNATIONAL SYMPOSIUM ON MUSIC INFORMATIONRETRIEVAL, XX, XX, 1 October 2006 (2006-10-01), pages 1 - 6, XP002447156 *
ZOIA G ET AL: "A multi-timbre chord/harmony analyzer based on signal processing and neural networks", MULTIMEDIA SIGNAL PROCESSING, 2004 IEEE 6TH WORKSHOP ON SIENA, ITALY SEPT. 29 - OCT. 1, 2004, PISCATAWAY, NJ, USA,IEEE, 29 September 2004 (2004-09-29), pages 219 - 222, XP010802125, ISBN: 978-0-7803-8578-8 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110254688A1 (en) * 2010-04-15 2011-10-20 Samsung Electronics Co., Ltd. User state recognition in a wireless communication system
US9263060B2 (en) 2012-08-21 2016-02-16 Marian Mason Publishing Company, Llc Artificial neural network based system for classification of the emotional content of digital music
US20210287662A1 (en) * 2018-09-04 2021-09-16 Gracenote, Inc. Methods and apparatus to segment audio and determine audio segment similarities
US11657798B2 (en) * 2018-09-04 2023-05-23 Gracenote, Inc. Methods and apparatus to segment audio and determine audio segment similarities

Also Published As

Publication number Publication date
CA2740638A1 (en) 2010-04-22
EA201170559A1 (ru) 2012-01-30
EP2342708B1 (de) 2012-07-18
EP2342708A1 (de) 2011-07-13
JP2012506061A (ja) 2012-03-08
BRPI0823192A2 (pt) 2018-10-23
CN102187386A (zh) 2011-09-14

Similar Documents

Publication Publication Date Title
JP5282548B2 (ja) 情報処理装置、音素材の切り出し方法、及びプログラム
US7908135B2 (en) Music-piece classification based on sustain regions
Shetty et al. Raga mining of Indian music by extracting arohana-avarohana pattern
Paiement et al. A probabilistic model for chord progressions
EP2342708B1 (de) Verfahren zum analysieren eines digitalen musikaudiosignals
CN110136730B (zh) 一种基于深度学习的钢琴和声自动编配系统及方法
CN111739491B (zh) 一种自动编配伴奏和弦的方法
Bittner et al. Multitask learning for fundamental frequency estimation in music
CN117334170A (zh) 生成音乐数据的方法
Zhang et al. Melody extraction from polyphonic music using particle filter and dynamic programming
Shi et al. Music genre classification based on chroma features and deep learning
WO2022038958A1 (ja) 楽曲構造解析装置および楽曲構造解析方法
CN112634841B (zh) 一种基于声音识别的吉他谱自动生成方法
Lerch Software-based extraction of objective parameters from music performances
Camurri et al. An experiment on analysis and synthesis of musical expressivity
Noland et al. Influences of signal processing, tone profiles, and chord progressions on a model for estimating the musical key from audio
Madsen et al. Exploring pianist performance styles with evolutionary string matching
Papadopoulos Joint estimation of musical content information from an audio signal
Wang et al. A framework for automated pop-song melody generation with piano accompaniment arrangement
JP2006201278A (ja) 楽曲の拍節構造の自動分析方法および装置、ならびにプログラムおよびこのプログラムを記録した記録媒体
Paiement Probabilistic models for music
Noland Computational tonality estimation: signal processing and hidden Markov models
Ryynänen Automatic transcription of pitch content in music and selected applications
Färnström et al. Method for Analyzing a Digital Music Audio Signal
Vatolkin et al. Performance of specific vs. generic feature sets in polyphonic music instrument recognition

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200880131589.1

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08875184

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2008875184

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2740638

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2011531363

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 3218/CHENP/2011

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 201170559

Country of ref document: EA

ENP Entry into the national phase

Ref document number: PI0823192

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20110415