EP1636789A2

EP1636789A2 - Method for processing an audio sequence for example a piece of music

Info

Publication number: EP1636789A2
Application number: EP04767355A
Authority: EP
Inventors: Geoffroy c/o IRCAM PEETERS
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2003-06-25
Filing date: 2004-06-16
Publication date: 2006-03-22
Also published as: US20060288849A1; WO2005004002A2; JP2007520727A; WO2005004002A3; FR2856817A1

Abstract

The invention relates to the processing of an audio sequence, for example, a piece of music. After application of a spectral transformation to said sequence, at least one repeated sub-sequence in said sequence is determined by statistical analysis of the resulting spectral coefficients such as a refrain and/or a verse of a piece of music and the start and finish times of said sub-sequence are determined, in particular, for the preparation of an audio résumé of the piece of music.

Description

Method for processing a sound sequence, such as a musical piece

The present invention relates to the processing of a sound sequence, such as a piece of music or, more generally, a sound sequence comprising the repetition of a sub-sequence.

The distributors of musical productions, for example recorded on CD, cassette or other medium, make available to potential customers kiosks where customers can listen to music of their choice, or even music promoted because of their novelty. When a customer recognizes a verse or a chorus of the musical piece he is listening to, he can decide to buy the corresponding musical production.

More generally, a listener of average attention concentrates more his attention on a sequence of verse and chorus, than on the introduction of the piece, in particular. It will thus be understood that an audio summary comprising at least one verse and a chorus would be sufficient to be broadcast in kiosks of the aforementioned type, rather than providing for the complete musical production to be broadcast.

In another application such as the transmission of sound data by mobile telephone, it will be understood that downloading the complete piece of music to a mobile terminal, from a remote server, is much longer and, therefore, more expensive than downloading a sound summary of the aforementioned type.

Similarly, in the context of e-commerce, .Des sound summaries can be downloaded .a ^'station communicating with a remote server via an extensive network of the Internet type. The user of the computer station can thus order a musical production, 11 of which appreciates the sound summary.

However, detecting a verse and a chorus by ear and thus creating a sound summary for all the musical productions distributed would be a task of prohibitive heaviness.

The present invention improves the situation.

One of the aims of the present invention is to propose an automated detection of a repeated subsequence in a sound sequence.

Another object of the present invention is to propose an automated creation of sound summaries of the type described above.

To this end, the present invention relates firstly to a method of processing a sound sequence, in which: a) a spectral transform is applied to said sequence in order to obtain spectral coefficients varying as a function of time in said sequence. The method within the meaning of the invention further comprises the following steps: b) at least 'a subsequence repeated in said sequence is determined by statistical analysis of said spectral coefficients, and ^' 'c) moments are evaluated start and end of said sub-sequence in the sound sequence.

Advantageously, according to an additional step: d) the above-mentioned sub-sequence is extracted to store, in a memory, sound samples representing said 'sub-sequence. . J

Preferably, the extraction of step d) relates to at least one subsequence ^' the duration of which is the greatest and / or a subsequence of which the repetition frequency is the greatest in said sequence.

The present invention finds an advantageous application in assisting in the detection of failures of industrial machines or of engines, in particular by obtaining sound recording sequences of acceleration and deceleration phases of the engine speed. The application of the method within the meaning of the invention makes it possible to isolate a sound sub-sequence corresponding for example to a full speed or to an acceleration phase, this sub-sequence being, if necessary, compared to a sub- reference sequence.

In another advantageous application to obtaining musical data of the type described above, the sequence The aforementioned sound is a piece of music comprising a succession of sub-sequences among at least an introduction, a verse, a chorus, a transition bridge, a theme, a motif, or a movement which is repeated in the sequence.

In step c), the respective instants for the start and end of a first sub-sequence and of a second sub-sequence are preferably determined at least.

In a particularly advantageous embodiment, in step d), a first and a second sub-sequence are then extracted to obtain, on a memory medium, a sound summary of said piece of music comprising at least the first sub-sequence chained with the second subsequence.

Preferably, the first sub-sequence corresponds to a verse and the second sub-sequence corresponds to a chorus.

However, it may happen that the first and second subsequences, extracted from a sound sequence, are not contiguous in time.

To this end, the following steps are also provided: dl) detecting at least one cadence of the first sub-sequence and / or of the second sub-sequence to estimate the average duration of a measurement at said cadence, as well as at least one end segment of the first sub-sequence and at least one start segment of the second sub-sequence, of respective durations corresponding substantially to said average and isolated duration in ^' "the sequence of a whole number of average durations, d2) _. generate at least one ^. transition measure of duration corresponding to said average duration" and comprising an addition of sound samples "&" at least said segment end and at least said starting segment, d3) and concatenating the first 'sub-sequence, or ^• the transition measures and the second sub-sequence to obtain the sequence of the first and the second subsequence.

It will be noted that the succession of steps dl) to d3) finds, beyond the automatic generation of sound summaries, an advantageous application to computer-assisted musical creation. In this application, a user can create two subsequences of a musical piece himself, while software comprising instructions for carrying out steps dl) to d3) ensures a concatenation of the two subsequences, without artifact and pleasant to the ear.

More generally, the present invention also relates to a computer program product, stored in a computer memory or on a removable medium suitable for cooperating with a homologous computer reader, and comprising instructions for carrying out the steps of the method. within the meaning of the invention.

Other characteristics and advantages of the invention will appear on examining the detailed description below, and the attached drawings in which: - Figure la represents an audio signal of a piece of music corresponding, in the example shown, to a variety song; - Figure lb represents the variation of spectral energy as a function of time, - for the piece of music whose audio signal is shown in Figure la; - The figure illustrates the durations occupied by the different passages of the piece of music of Figure la and which are repeated in this piece; - Figure 2 schematically represents time windows selected in two respective parts of the musical piece to prepare the concatenation of these two parts, according to the succession of steps dl) to d3) above, - - Figure 3a schematically represents segments Si (t) and Sj (t) selected from the respective parts of the aforementioned piece, to prepare a concatenation of the two parts by superposition / addition; - Figure 3b schematically illustrates by the sign "Θ" the above superposition / addition; - Figure 4 illustrates a time window for the above concatenation, of preferred shape and width; and FIG. 5 represents a flow diagram for processing a sound sequence, in a preferred embodiment of the present invention.

The audio signal in FIG. 1a represents the sound intensity (on the ordinate) as a function of time (on the abscissa) a musical piece (here, the song "head over feet" ^® by artist Alanis Morissette). To build _. this audio signal, the respective signals of the right and left channels (in stereophonic mode) have been synchronized. and added.

To the audio signal represented in FIG. 1a, a spectral transform is applied (for example of the fast Fourier transform FFT type) to obtain a temporal variation of the spectral energy of the type represented in FIG. 1b.

In one embodiment, it is a plurality of short-term, successive FFTs, the result of which. is applied to a filter bank over several frequency ranges (preferably of increasing bandwidths such as the logarithmic of the frequency). Another Fourier transform is then applied to obtain dynamic parameters of the audio signal (referenced PD in FIG. 1b). In particular, the ordinate scale of FIG. 1b indicates the amplitude of the variations of the components at different speeds in a given frequency domain. Thus, the index 0 or 2 of the arbitrary ordinate scale of FIG. 1b corresponds to a slow variation in the low frequencies, while the index 12 of this same scale corresponds to a rapid variation in the high frequencies. These variations are expressed as a function of time, on the abscissa (seconds). The intensities associated with these dynamic parameters PD, over time, are illustrated by different levels of gray including the values relative "are indicated by there _" COL reference column (on the right of figure lb). •

It is indicated that the dynamic parameters of the type represented in FIG. B make it possible to completely identify a piece of music. In this context of "imprint" of a piece of music, application FR-2834363 from the Applicant describes in detail these parameters ¹ and the manner of obtaining them.

As a variant, the variables deduced from the audio signal and 'making it possible to characterize the. piece of music can be ^'of different types, including said coefficients "Mel Frequency Cepstral Coefficients". Overall, it is indicated that these coefficients (known per se) are still obtained by fast Fourier transform, in the short term.

The figure le provides a visual representation of the evolution of the spectral energy of figure lb. In figure le, the abscissa represents time (in seconds) and the ordinates represent the different parts of the piece, such as verses, choruses, introduction, theme, or others. The repetition over time of a similar part, such as a verse or a chorus, is represented by shaded rectangles which appear at different abscissa in time (and which can be of different temporal widths), but similarly ordered . To pass from the representation of FIG. 1b to the representation of FIG. 1a, a statistical analysis is implemented using for example the "K-means" algorithm, or even the algorithm "FUZZY K-means", or a hidden Markov chain, with learning by the BAUM-ELSH algorithm, followed by an evaluation by the VITERBI algorithm.

Typically, the determination of the number of states ^' (the parts of the piece of music) which are necessary for the representation of a piece of music is performed in an automated manner, by comparison of the similarity of the states found at each iteration of the algorithms above, and eliminating redundant states. This technique, known as "pruning" thus makes it possible to isolate each redundant part of the piece of music and to determine its time coordinates (its start and end times, as indicated above).

Thus, we study the variations, for example in the tonal frequencies (of a human voice), of the spectral energy to determine the repetition of a particular musical passage in the audio signal.

Preferably, one seeks to extract one or more musical passages whose duration is the greatest in the piece of music and / or whose frequency of repetition is the most important.

For example, for most variety pieces, we can choose to isolate the chorus parts, whose repetition is generally the most frequent, then the verse parts, whose repetition is frequent, then, if necessary, d 'other parts if they are repeated. Others are indicated. types of sub-sequences representative of the piece of music can be extracted, as soon as these "sub-sequences are repeated in the piece of music. For example, one can choose to extract a musical motif ', generally more short, a verse or a chorus, such as a pass. percussion repeated in the song, or a phrase ^"voice punctuated several times in the song. also, a theme can also be extracted from piece of music, for example a musical phrase repeated in a piece of jazz or classical music In classical music, a passage such as a movement can also be extracted.

On the visual summary shown as an example in Figure le, the shaded rectangles indicate the presence of a part of the song such as the introduction ("intro"), a verse or a chorus in a window time indicated by the time abscissa (in seconds). Thus, between 0 and about 15 seconds, the piece of music starts with an introduction (indexed by the number 2 on the ordinate scale). The introduction is followed by two alternations of verse (indexed by the number 3) and refrain (indexed by the number 1) up to approximately 100 seconds.

Reference is now made to FIG. 5 to describe the main steps of the method for obtaining the abovementioned sound summary, according to a preferred embodiment. First, we get the audio signals on the left channel "audio L" and on the right channel "audio R" in the respective steps 10 and 11, when the initial sound sequence is represented in stereophonic mode. The signals from these two channels are added in step 12 to obtain an audio signal of the type shown in the figure there. This audio signal is, if necessary, stored in sampled form in a working memory with sound intensity values arranged as a function of their associated time coordinates (step 14). To this audio data, a spectral transform (of FFT type in the example shown) is applied, in step 1.6, to obtain, in step 18, the spectral coefficients Fi (t) and / or their variation ΔFi ( t) as a function of time. At step 20, a statistical analysis module operates on the basis of the coefficients obtained in step 18 to isolate instants t _0, _t,. , ..., t ₇ which correspond to instants of start and end of the various subsequences which are repeated in the audio signal of stage 14.

In the example shown, the piece of music has a structure (classic in variety) of the type comprising: - an introduction at the start of the piece between an instant t ₀ and an instant t _{1 #} - a verse between tj and t ₂ , - a refrain between t ₂ and t ₃ , - a second verse between t ₃ and t ₄ , - a second refrain between t ₄ and t _s , - an introduction, again, if necessary with an instrumental solo, between the instants t ₅ and t ₆ , and - the repetition of two ^'refrains end. of piece between instants t ₆ and t ₇ .

In step 22, the instants t _p '- to t ₇ are listed and indexed as a function of the • ^' paësagé '• corresponding music (introduction, verse or .refrain) ^• and stored, if necessary, in a working memory . At step '23, we can then construct a visual summary of this piece of music, as shown in 'Figure'.

In the example described above of a variety piece with a typical structure, the sound summary is constructed from a verse extracted from the piece, followed by a chorus extracted from the piece. In step 24, a concatenation of the sound samples of the audio signal is prepared between the instants ti and t ₂ , on the one hand, and between the instants t ₂ and t ₃ , on the other hand, in the example described . If necessary, the result of this concatenation is stored in a permanent memory MEM for later use, in step 26.

However, as a general rule, the end time of an isolated verse and the start time of an isolated chorus are not necessarily identical, or alternatively, one can choose to construct the sound summary from the first verse and the second chorus (between t ₄ and t ₅ ) or the end chorus

(between t ₆ and t ₇ ). Thus, the two passages selected to build the sound summary are not necessarily contiguous. A blind concatenation of sound signals, corresponding to two parts of a piece of music gives an unpleasant feeling to the ear. The following decree is made, with reference to FIGS. 2, 3a, 3b and 4, 'the construction of a sound signal by concatenation of ^' two 'parts of a piece of music, so as to overcome this problem. .

One of the aims of this concatenation construction is to locally preserve the tempo of the sound signal.

Another aim is to ensure a temporal distance between concatenation points (or "alignment" points) equal to an integer multiple of the duration of a measurement.

Preferably, this concatenation is carried out by superposition / addition of selected sound segments and isolated from the two aforementioned respective parts of the piece of music.

A superposition / addition of such sound segments is described below, firstly by beat synchronization (called "beat -synchronous"), then by measurement synchronization according to a preferred embodiment.

We note below: - bpm, the number of beats per minute of a piece of music, D, the reference of this number bpm (for example in the case of a piece noted "120 = black", bpm = 120 and D = black), - T, the duration (expressed in seconds) of a beat, that is to say of the reference D: in the example, previous where D≈noire, we have _{r =} 60 bpm - N, the numerator of the metric of the piece of music (for example, in the case of a measure noted "3/4", N = 3), M, the duration (expressed in seconds) of a measure, given by the relation M≈NT (ie M = 3 * 60 / l20 in the previous example), s (t), the audio signal of a piece of music, s (t), the signal reconstructed by superposition / addition, and - sι (t) and s _j (t), the i ^th and j ^th segments include respective audio signals belonging to a first and ^'a second passage of a piece of music, and which are used for the construction of s (t) by superposition / addition.

In principle, the first and second passages mentioned above are not contiguous. s (t) is then obtained as follows.

Referring to FIG. 2, the segments sι (t) and Ξj (t) are first formed by cutting the audio signal using a time window h _L (t), of width L and defined ( of non-zero value) between 0 and L. This window can be of rectangular type, of so-called "hanning" type, of so-called "level hanning" type, or other. Referring to Figure 4, a preferred type of time window is obtained by concatenating a rising edge, a landing and a falling edge. The preferred time width of this window is shown below.

We then define the first segment _: if (t) so that:

where m is the start time _". the first segment.

As shown in the figure _| 3a, we construct S (t) in substantially the same way: 'S _j (t) = s (t + m _j ) .h _L (t) [Ibis] where mj is -The instant of the start of the second segment.

Even if the duration L of the time window is the same for the two segments, op. indicates however that the shape of the window can be • different from one segment Sι (t) to the other sj (t), as shown moreover in FIG. 2.

Let bi and bj be two respective positions inside the first and second segments, called "synchronization positions", with respect to which the superposition / addition takes place, such as: 0 ≤ bi ≤ L and 0 ≤ b _j ≤ L [2]

Advantageously, the temporal distance between bi and bj is chosen equal to an integer multiple of the duration T of a beat (bj - bi = kT). Under these conditions, we say that there is a "beat-synchronous" reconstruction if ^J (= ∑ ('- ^' -1) '(*' + ^C ) [4] 1 with s' i (t) = If ( t + bi) [5] and where k 'is the largest integer such that k' T ≤ L- (bi -mi), c is a time constant such that c ≈ -bi-mi. Advantageously, the distance between the instants mi and πi _j is chosen equal to an integer multiple of k'NT, in which N denotes the numerator of the metric.

Thus, the reconstructed signal is written: S (t) = ∑s _t '(t- (il) - (k'NT) + c) i

We then obtain a synchronous superposition / addition to the measurement. Figure 3b illustrates this situation. Note in FIG. 4 that the width L of the aforementioned time window is close to k'NT (near the rising and falling sides). However, one will preferentially choose in this case sidewall ramps such that k 'T ≤ L-2 (bχ -πii).

More particularly, the instants mi and m _j are chosen so that they correspond to the first measurement times. Under these conditions, a so-called "aligned" beat-synchronous superposition / addition is advantageously obtained.

Thus, by further determining the metric of the first pass and / or of the second pass, it is possible to perform a beat-synchronous reconstruction to the measure. If, in addition, the first and second segments are chosen so that they begin with a first measurement time, this beat-synchronous reconstruction is aligned. It is indicated that a reconstruction of the signal s (t) can be carried out on the basis of more than two musical passages to be concatenated. For i musical passages (i> 2), the generalization of the above process is expressed by the relation: s (t) = s ₁ '(t + c) + s ₂ ' (tk _i ^, T + c) + s ₃ t -k ₁ 'T + k ₂ ' T + c) + ...

Each integer kj 'is defined as the largest integer _t such that kj ^" ' T ≤ Lj - (bj -πij), where L _j corresponds to the width of the window of the jth musical passage to be concatenated.

It is indicated that the first measurement times, or the metric, or even the tempo of a piece of music, can be detected automatically, for example by using existing software applications. For example, the MPEG-7 standard (Audio Version 2) provides for the determination and description of the tempo and the metric of a piece of music, using such software applications.

Of course, the present invention is not limited to the embodiment described above by way of example; it extends to other variants.

Thus, it will be understood that the sound summary may include more than two musical passages, for example an introduction, a verse and a chorus, or even two different passages of a verse and a chorus, such as the introduction and a chorus, for example.

It will also be noted that the steps represented in the form of a flowchart in FIG. 5 can be implemented by computer software, the algorithm of which generally takes up the structure of the flowchart. As such, the present invention also relates to such a computer program.

Claims

claims

1. A method for processing a sound sequence, in which: a) a spectral transform is applied to said sequence in order to obtain spectral coefficients varying as a function of time in said sequence, characterized in that it further comprises the following steps: b) at least one subsequence is determined, by statistical analysis of said spectral coefficients. repeated in said sequence, and c) the instants of the start and end of said sub-sequence in the sound sequence are evaluated.

2. Method according to claim 1, characterized in that it further comprises a step: d) of extracting the sub-sequence to store, in a memory, sound samples representing said sub-sequence.

3. Method according to claim 2, characterized in that the extraction of step d) relates to at least one sub-sequence whose duration is the greatest and / or a sub-sequence whose repetition frequency is the most important in said sequence.

4. Method according to one of claims 1 to 3, wherein the sound sequence is a piece of music comprising a succession of sub-sequences among at least an introduction, a verse, a refrain, a transition bridge, a theme, a motif, a movement, characterized in that, in step c), at least the respective instants of start and end are determined a first subsequence and a second subsequence.

5. Method according to claim '4, taken in combination with claim 3, characterized in that the first subsequence corresponds to a verse and the second subsequence corresponds to a chorus.

6. Method according to one of claims 4 and 5, taken in combination, with claim 2, characterized in that, in step d), a first and a second subsequence are extracted to obtain, on a support memory, a sound summary of said piece of music comprising at least the first sub-sequence linked with the second sub-sequence.

7. The method as claimed in claim 6, in which the extracts of the sub-sequences are not contiguous over time, characterized in that it further comprises the following steps: dl) detecting at least one rate of the first sub-sequence and / or the second sub-sequence to estimate the average duration of a measurement at said rate, as well as at least one end segment of the first sub-sequence and at least one start segment of the second sub-sequence, respective durations corresponding substantially to said average duration and isolated in the sequence of an integer of average durations, d2.) generate at least one duration transition measure corresponding to said average duration and comprising an addition of sound samples of at least said end segment and at least said start segment, d3) and concatenate the first sub- sequence, where the transition measures and the second subsequence to obtain a sequence of the first and the second subsequence.

8. Method according to claim 7, characterized in that step dl) comprises a division into at least two windows, of rectangular type, of Hanning type, of Hanning type in level, or preferably of type comprising a flank rising, a level and a falling side in time.

9. Method according to one of claims 7 and 8, characterized in that step d2) comprises a beat-synchronous reconstruction.

10. Method according to claim 9, characterized in that, in step dl), the metric of the first subsequence and / or of the second subsequence is further determined, and in that step d2 ) includes a beat-synchronous reconstruction to the measure.

11. Method according to one of claims 9 and 10, characterized in that, in step dl), said end and start segments are determined so that they begin with a first measurement time, and in that that step d2) includes an aligned beat-synchronous reconstruction. 05/004002

22

12. Computer program product, stored in a computer memory or on. a removable support suitable for cooperating with a computer reader, characterized in that it includes - instructions for carrying out the steps of the method according to one of the preceding claims.

O 2005 0 1/3

Alanis Morissette "Head Over Feet"

55.12 105.12 155.12 205.12 255.12 TIME (SECONDS)

FIG. 2 2/3

k'NT

FIG.4

ε / ε ε6noo / oozîi-ι / i3 <ι zootoo / soo∑: OΛV