WO2007072394A2

WO2007072394A2 - Audio structure analysis

Info

Publication number: WO2007072394A2
Application number: PCT/IB2006/054915
Authority: WO
Inventors: Aweke N. Lemma
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-12-22
Filing date: 2006-12-18
Publication date: 2007-06-28
Also published as: WO2007072394A3

Abstract

A device (1) for determining accented beats in a music signal (x[n]) comprises: energy determination means (12) for determining the energy (E[n]; E[k]) of the music signal (x[n]), - segmentation means (15) for segmenting the energy on the basis of a tempo estimate (T), similarity determination means (17) for determining the similarity between the energy (E[n]; E[k]) of segments, and selecting means (18) for selecting the segment having the smallest similarity as the segment containing an accented beat. The tempo estimate (T) may be determined by external means. The energy may be determined in the time domain or in a transform domain. The similarity may be determined using cross-correlation, entropy or a distance measure. The device (1) may advantageously be used in AutoDJ apparatus.

Description

Audio structure analysis

The present invention relates to audio structure analysis. More in particular, the present invention relates to a device for and method of determining accented beats in a music signal.

It is well known to analyze the structure of audio signals, in particular music signals, both by hand and automatically. In order to compare music pieces or sound tracks, several characteristics of the music may be determined, such as the meter of the music, including the beat and the bar boundaries. When automatically processing music, for example in AutoDJ (Automatic Disc Jockey) applications, it is necessary to match the meter of successive music pieces. When mixing songs, it is highly desirable to synchronize the beat of the songs, in particular the accented beats (downbeats). Although many different methods of beat detection are known, very few Prior Art documents deal with detecting the accented beat. United States Patent US 6 542 869 (Foote) discloses a method of determining points of change in an audio signal by measuring the self-similarity of components of the audio signal. The self-similarity as well as cross-similarity between each of a set of signal parameterization values is determined for all past and future time window regions. A significant point of change will have a high self-similarity in the past and future, and a low cross-similarity. This known method may be used for beat tracking, including finding the tempo and location of downbeats in music.

This known method has several disadvantages. A self- similarity matrix is very complex and its compilation is computationally very demanding, while requiring a large amount of memory. This makes the known method less suitable for consumer devices which typically have relatively little computational power and a limited amount of memory. In addition, the known method suffers from high degree of ambiguity as the nature of the detected points of change has to be derived from their frequency of occurrence, which may only be determined accurately if a sufficiently high resolution is used. It has been found that this method is less suitable for accurately determining accented beats in music. It is an object of the present invention to overcome these and other problems of the Prior Art and to provide a device for and method of determining accented beats in a music signal which are simple yet provide a sufficient accuracy.

Accordingly, the present invention provides a device for determining accented beats in a music signal, the device comprising: energy determination means for determining the energy of the music signal, segmentation means for segmenting the energy on the basis of a tempo estimate, similarity determination means for determining the similarity between the energy of segments, and selecting means for selecting the segment having the smallest similarity as the segment containing an accented beat. By determining the similarity between the energy of signal segments, a very simple yet effective way of detecting accented beats is obtained, as the accented beat will be dissimilar from the other beats. The present invention uses a one-dimensional approach (comparing consecutive signal segments), which is computationally far less demanding and requires far less memory than the two-dimensional approach of the Foote patent mentioned above.

A tempo estimate is used to aid the segmentation of the calculated energy. This tempo estimate may be produced using any known method and may also involve detecting beat onsets, although this is not essential. It is preferred that the segmentation substantially corresponds with the beat onsets (the beginning of each beat), but this is not essential.

It is noted that instead of the energy of the music signal, any other equivalent property may be determined, such as its magnitude.

The similarity determination means may be arranged for carrying out a cross- correlation, an autocorrelation, a distance measurement, an information measurement and/or a pattern match. A cross-correlation is preferred, but other (dis)similarity measures may also be used.

As mentioned above, the segmentation means are preferably arranged for segmenting the energy on beat onset positions. Additionally, or alternatively, the segmentation means are preferably arranged for providing the segments in parallel so as to allow a simple, essentially one-dimensional comparison.

The device of the present invention may further comprise tempo estimation means for estimating the tempo of the music signal. However, such tempo estimation means, which may also determine beat onsets, may also be external to the device.

The energy determination means may be arranged for determining the time domain energy. However, in a preferred embodiment the device of the present invention further comprises a transform means for transforming the music signal to a transform domain, while the energy determination means are arranged for determining the transform domain energy, said transform domain preferably being the frequency domain. Accordingly, the transform means are preferably arranged for performing a Fast Fourier Transform (FFT).

The device of the present invention may further comprise a frame compilation means for compiling frames of the music signal, and/or an energy buffer means for buffering the (time and/or transform domain) energy. The device of the present invention may advantageously further comprise a filter means arranged between the segmentation means and the similarity determination means for filtering the energy segments prior to determining their similarity. The filter means serve to reduce any influence of transients and improve the reliability of the accented beat estimates. A music system, such as an AutoDJ system, according to the present invention comprises an accented beat determination device as defined above.

The present invention also provides a method of determining accented beats in a music signal, the method comprising the steps of: determining the energy of the music signal, - segmenting the energy on the basis of a tempo estimate, determining the similarity between the energy of segments, and selecting the segment having the smallest similarity as the segment containing an accented beat.

The method of the present invention may advantageously be used for detecting bar boundaries, as a bar typically starts with an accented beat. Accordingly, the present invention also provides a method of detecting bar boundaries in a music signal, the method comprising the steps of: determining the energy of the music signal, segmenting the energy on the basis of a tempo estimate, determining the similarity between the energy of segments, selecting the segment having the smallest similarity as the segment containing an accented beat, and equating the bar boundary with the beat onset of the accented beat. Further advantageous embodiments of the inventive device and methods will become apparent from the description below.

The present invention additionally provides a computer program product for carrying out the method as defined above. A computer program product may comprise a set of computer executable instructions stored on a data carrier, such as a CD or a DVD. The set of computer executable instructions, which allow a programmable computer to carry out the method as defined above, may also be available for downloading from a remote server, for example via the Internet.

The present invention will further be explained below with reference to exemplary embodiments illustrated in the accompanying drawings, in which:

Fig. 1 schematically shows the energy of signal segments as processed according to the present invention,

Fig. 2 schematically shows a first embodiment of an accented beat detection device according to the present invention,

Fig. 3 schematically shows a second embodiment of an accented beat detection device according to the present invention,

Fig. 4 schematically shows an AutoDJ system in which the invention may advantageously be utilized.

The energy of a music signal as a function of time is schematically illustrated in Fig. 1. The energy E illustrated in Fig. 1 may be determined by an accented beat detection device of the present invention, which will be discussed later with reference to Figs. 2 and 3. The top diagram of Fig. 1 shows the energy E of a music signal as a function of time (sample number n) or frequency (frequency bin k). In the following discussion, it will be assumed that the energy E is a function of time and that the music has four beats per measure, although the invention is not so limited. The music signal is segmented into segments or beat periods BP. In the example shown, the segment boundaries are at the peaks of the energy signal E. Assuming (or knowing) that the music signal has four beats per measure, the segments can be labeled I, II, III and IV so as to correspond with the four respective beats. It is noted that at this stage, the accented beat and the beginning of the measure are not yet known and the label I is essentially arbitrary.

Although it is possible to use only a single copy of each segment I, II, III and IV, it is preferred to use multiple copies of each segment so as to average out any noise. Accordingly, the energy E of all first segments I (of a certain time period or time frame) are concatenated, resulting in the energy signal E labeled I in the leftmost lower diagram of Fig. 1. It is to be understood that the lower diagram labeled I contains a succession of segments I of the top diagram. Similarly, the second segments II are concatenated so as to produce the succession of segments illustrated in the lower diagram labeled II, while the same action is repeated for the segments III and IV. As can be seen, the successions of segments I, II and III are very similar and a similarity measure (such as cross-correlation) would yield a high degree of similarity. The segments IV, however, have a different shape and are therefore less similar. It can therefore be concluded that the segments IV represent the accented beats (downbeats), as they are the most dissimilar. The (dis)similarity can be determined in various ways, for example by determining the cross-correlation of each succession I, II, III and IV with each of the other successions, the succession having the lowest aggregate cross-correlation with the other successions representing the accented beat. Additionally, or alternatively, the autocorrelation of each succession may be determined, the most dissimilar autocorrelation value indicating the accented beat. In other embodiments, the shape and/or amplitude of the successions may be involved using pattern matching techniques or distance measures. It will be understood that the particular technique of determining the (dis)similarity of the successions is not essential.

A first embodiment of an accented beat detection device 1 according to the present invention is schematically illustrated in Fig. 2. The device 1 shown merely by way of non-limiting example in Fig. 2 is arranged for time domain similarity determination and comprises an energy calculation unit 12, a segmentation unit 15, a similarity determination unit 17 and a selecting unit 18. The energy calculation unit 12 receives a (digital) music signal x[n] and determines its energy (or any other suitable parameter), for example the signal energy E[n] (energy per sample n). This energy signal E[n] is fed to the segmentation unit 15, which acts as a demultiplexer (DMux). The segmentation unit receives tempo (beat and/or beat onset) information T and beats-per-measure information M and segments the energy E[n] accordingly (see also Fig. 1). The segmented energy is fed per segment number (I - IV in Fig. 1) to the similarity (SIM) determination unit 17. As a result, the similarity determination unit 17 receives the successions I - IV (Fig. 1) at each of its inputs.

The similarity determination unit 17 then determines the similarity between its input signals, essentially as indicated above. Similarity information relating to each of its inputs is produced at its respective outputs and fed to a selecting unit 18. This selecting unit 18 is, in the present example, arranged for outlier selection (OS) so as to determine which of its input signals is the most dissimilar, that is, is the outlier. Information identifying the outlier, and hence the corresponding segment (of the segments I - IV of Fig. 1) is output as accented beat information abi.

The embodiment of Fig. 3 is very similar to the embodiment of Fig. 2 but is arranged for operating in the frequency domain. The device 1 of Fig. 3 comprises a frame compilation (FC) unit 10 for compiling frames of the input time domain music signal x[n]. It will be understood that when the music signal x[n] is input in frame format, the frame compilation unit 10 may be dispensed with.

The music signal frames containing time domain signal data are fed to a transform unit 11 which in the present embodiment is arranged for carrying out a Fast Fourier Transform (FFT). It will be understood that other transforms, such as a Discrete Cosine Transform (DCT), may be used instead. The transform domain signal data produced by the transform unit 11 are fed to the energy calculation unit 12, which calculates the energy of each frame using the transform domain data. The resulting transform domain energy E[k], with k indicating the frequency bin number, is fed to the segmentation means 15 via an energy buffer (EB) 13. The embodiment shown also comprises a tempo estimator (TE) unit 14, which also receives the transform domain energy E[k] so as to derive the beat and optionally also the beat onsets.

This tempo information T produced by the tempo estimator unit 14 is fed to the segmentation means 15, which also receive the beat-per-measure information M as in the embodiment of Fig. 2. The energy E[k] is then processed by the segmentation unit 15 essentially as in the embodiment of Fig. 2. In the embodiment shown in Fig. 3, a low-pass filter (LPF) 16 is arranged between the segmentation unit 15 and the similarity determination unit 17 so as to remove any undesired frequency components, such as noise components.

It is possible to use additional filters to process the music signal x[n] or its energy E[n] or E[k] per sub-band. In the case of time domain energy E[n], the energy computation is preceded by a filter bank that splits the incoming signal into a number of sub- bands (m=l ...M). For each sub-band m, the energy function E_m[n] is computed. The similarity is then determined for each sub-band independently. The selection of the accented beat is based upon the weighted sum of the similarity values of the sub-bands. Similarly, in the case of transform domain energy E[k], the transform domain spectrum is first divided into a number of sub-bands. Then the (spectral) weighted energy is computed by taking the weighted sum of the transform domain coefficients (in the example shown: FFT coefficients) of the respective sub-bands.

The AutoDJ system 5 illustrated merely by way of non- limiting example in Fig. 4 comprises a song database (SDB) 51 coupled to a player device (PD) 50, a playlist generator (PG) 54 and an audio analyzer (AA) 52. The player device may be a home music (e.g. 5.1) set, an MP3 player, a computer sound card, or any other device capable of playing music, and is coupled to a loudspeaker 56. The playlist generator 54 selects songs from the sound database 51 and compiles playlists in accordance with user preferences. The audio analyzer 52 comprises an accented beat determination device 1 according to the present invention and supplies audio analysis information, including the positions of the accented beats, to a feature database (FDB) 53. A playlist recorder (PLR) 55 uses information provided by both the playlist generator 54 and the feature database 53 to record a playlist, and feeds this playlist (or playlists) to the player device 50. Using the accented beat information, smooth transitions between various songs can be achieved.

The method of the present invention may advantageously be used for detecting bar boundaries, as a bar typically starts with an accented beat.

The present invention is based upon the insight that accented beats may be detected on the basis of their (dis)similarity with the unaccented beats. The present invention benefits from the further insight that an accented beats typically indicates the beginning of a measure.

It is noted that any terms used in this document should not be construed so as to limit the scope of the present invention. In particular, the words "comprise(s)" and "comprising" are not meant to exclude any elements not specifically stated. Single (circuit) elements may be substituted with multiple (circuit) elements or with their equivalents.

It will be understood by those skilled in the art that the present invention is not limited to the embodiments illustrated above and that many modifications and additions may be made without departing from the scope of the invention as defined in the appending claims.

Claims

CLAIMS:

1. A device (1) for determining accented beats in a music signal, the device comprising: energy determination means (12) for determining the energy (E[n]; E[k]) of the music signal (x[n]), - segmentation means (15) for segmenting the energy on the basis of a tempo estimate (T), similarity determination means (17) for determining the similarity between the energy (E[n]; E[k]) of segments, and selecting means (18) for selecting the segment having the smallest similarity as the segment containing an accented beat.

2. The device according to claim 1, wherein the similarity determination means (17) are arranged for carrying out a cross-correlation, an autocorrelation, a distance measurement, an information measurement and/or a pattern match.

3. The device according to claim 1, wherein the segmentation means (15) are arranged for segmenting the energy on beat onset positions.

4. The device according to claim 1, wherein the segmentation means (15) are arranged for providing the segments in parallel.

5. The device according to claim 1, further comprising tempo estimation means (14) for estimating the tempo of the music signal (x[n]).

6. The device according to claim 1, wherein the energy determination means (12) are arranged for determining the time domain energy (E[n]).

7. The device according to claim 1, further comprising a transform means (11) for transforming the music signal x[n] to a transform domain, wherein the energy determination means (12) are arranged for determining the transform domain energy (E[k]), said transform domain preferably being the frequency domain.

8. The device according to claim 1, further comprising a frame compilation means (10) for compiling frames of the music signal, and/or an energy buffer means (13) for buffering the energy (E [n], E[k]).

9. The device according to claim 1, further comprising a filter means (16) arranged between the segmentation means (15) and the similarity determination means (17) for filtering the energy segments prior to determining their similarity.

10. An AutoDJ system (5), comprising a device (1; 52) according to claim 1.

11. A method of determining accented beats in a music signal (x[n]), the method comprising the steps of: determining the energy (E [n]; E[k]) of the music signal (x[n]), segmenting the energy on the basis of a tempo estimate (T), determining the similarity between the energy (E[n]; E[k]) of segments, and selecting the segment having the smallest similarity as the segment containing an accented beat.

12. A method of detecting bar boundaries in a music signal (x[n]), the method comprising the steps of: determining the energy (E[n]; E[k]) of the music signal (x[n]), - segmenting the energy on the basis of a tempo estimate (T), determining the similarity between the energy (E[n]; E[k]) of segments, selecting the segment having the smallest similarity as the segment containing an accented beat, and equating the bar boundary with the beat onset of the accented beat.

13. A computer program product for carrying out the method according to claim 11 and/or 12.