EP2509073A1

EP2509073A1 - Time-stretching of an audio signal

Info

Publication number: EP2509073A1
Application number: EP11305407A
Authority: EP
Inventors: Alexis Moinet; Thierry Dutoit; Philippe Latour
Original assignee: Universite de Mons; Evs International (swiss) Sarl
Current assignee: Universite de Mons; Evs International (swiss) Sarl
Priority date: 2011-04-07
Filing date: 2011-04-07
Publication date: 2012-10-10
Also published as: WO2012136380A1

Abstract

Time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of:
a) dividing an input audio signal into a set of grains, a grain being a group of consecutive audio samples from the input audio signal, the length of each grain of the set of grains being determined so as not to split a transient of the audio signal between two successive grains,
b) inserting the grains in an output audio signal, the insertion being done at a rate corresponding to a speed factor, and
c) synthesizing an audio texture to fill every gap between two successive grains in the output audio signal.

Description

The invention generally relates to the time-stretching of an audio signal.
Time-stretching (or time-scaling) permits changing the length of the audio signal without changing its perceived content. It is different to a simple re-sampling (up-sampling and down-sampling), which changes both the length of the audio signal and its spectral content. Indeed, with a simple re-sampling, the frequencies contained in the audio signal are shifted upward or downward when the audio signal length is respectively reduced or increased. This is illustrated by the low-pitched sounds heard in some movies during slow-motion scenes.
For example, time-stretching may be used to increase the length of an audio signal in order to preserve synchronization with a video when doing slow-motion playbacks. In the same way time-stretching may be used to decrease the length of the audio signal in order to preserve synchronization with a video when doing fast-motion playbacks.
In order to create an output audio signal perceived as similar to the input audio signal but with a different length, one needs to create new information from the input audio signal in order to increase or decrease its length. This creation of new information may comprise adding information and/or removing information.
Time-stretching methods are described in the article of Jordi Bonada Sanjaume, "Audio Time-Scale Modification in the Context of Professional Audio Postproduction", Research work for PhD program, 2002.
The known methods generally use time-domain processing, like SOLA algorithm (Synchronous Overlap and Add), WSOLA algorithm (Waveform Similarity based synchronized Overlap Add), and PSOLA algorithm (Pitch Synchronous Overlap Add Method), or spectral-domain processing, like phase vocoder-based solutions, or model-based processing, like LPC methods (Linear predictive coding), and HNM methods (hybrid numerical method).
These methods make the fundamental assumption that the input audio signal mainly contains either voice or music or both. In other words that the input audio signal is mainly composed with time varying sinusoids where noise is generally considered as a negligible background signal with few or no impact on the final result. The sinusoidal hypothesis makes it difficult to apply these methods to audio signals that mainly contain noise, like audio signal usually encountered during sport events, for example crowd, applause, ball impacts, and/or field events.
Besides transient sounds like impact sounds are distorted by these methods and therefore need to be detected and processed separately. Transient detection has many solutions when applied to quiet audio recordings such as speech or music but it is much more difficult in noisy environments such as a football or basketball stadium. As a result it can be very complex to detect and process separately those transient sounds during sport events.
Texture-based methods have been developed to synthesize audio content such as crowd and applauses. Although these methods create similar-sounding signals they do not manipulate the length of the original signal.
In the thesis "Expressive Sound Synthesis for Animation", PhD thesis, INRIA, 2009, C. Picard proposes a texture-based approach to sound re-synthesis and time-stretching. However it is based on the notion of correlation patterns and these patterns must be pre-computed. That computation takes a relatively long time. Besides it supposes that all the audio samples are known beforehand. This makes it unsuitable for real-time processing.
In the article of Lie Lu, Liu Wenyin, and Hong-Jiang Zhang "Audio textures: theory and applications", Transactions on Speech and Audio Processing, IEEE, Volume 12, n° 2, pp 156-167, 2004, it is presented a method to replace missing part in an audio stream using texture synthesis. However, this method is not adapted for faster playback speeds. Moreover, this method needs a pre-processing of the whole signal and a computation of a similarity matrix. As a consequence, this method can't work in real-time.
The invention will improve the situation.

SUMMARY

A first aspect of the invention relates to a time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of:

a) dividing an input audio signal into a set of grains, a grain being a group of consecutive audio samples from the input audio signal, the length of each grain of the set of grains being determined so as not to split a transient of the audio signal between two successive grains,
b) inserting the grains in an output audio signal, the insertion being done at a rate corresponding to a speed factor, and
c) synthesizing an audio texture to fill every gap between two successive grains in the output audio signal.

Thanks to these provisions, the performances of the time-stretching are improved. Indeed, the proposed method combines both aspects, time-stretching and textures, in order to manipulate the length of an audio signal without changing its perceived content and without having to process its transients. Moreover, since the transients are not split into different grains and there is no overlap between grains, this method will not suffer from the transient duplication artefact that characterizes many time-domain time-stretching algorithms. Any transient from the input signal will be copied in the output signal once and only once.
Step a) may comprise setting grains boundaries at positions of minimum values of a spectral flux computed on the input audio signal. Indeed, these minimum values correspond to samples of the input audio signal where the input audio signal is the most stable, in particular to samples where the probability of having a transient is minimized.
The spectral flux may be computed by:

dividing the input audio signal into overlapping constant-length frames,
computing, for each frame n and for each frequency bin k of the input audio signal, a spectral amplitude X(n,k) and a variation of amplitude ΔX(n,k) between the said frame and the previous frame: $Δ X (n k) = |X (n k)| - |X (n - 1, k)|$
computing the spectral flux SF by using the variation of amplitude values:

S F_{n} = \sum_{k = 1}^{N_{fft}} \frac{Δ X (n k) + |Δ X (n k)|}{2}

Step b) may comprise inserting in the output audio signal a grain G_i of the set of grains, which starts at a time t_i in the input audio signal, at time t_i' being calculated from a starting time t_i-1' of a previous grain G_i-1 in the output audio signal S' and from a length L_i-₁ of the previous grain G_i-1: $\begin{matrix} t_{i} ʹ = t_{i - 1} ʹ + r L_{i - 1} \\ t_{0} ʹ = t_{0} \end{matrix}$
the variable r being equal to the inverse of the speed factor.
Step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.
The texture is preferably synthesized so as not to contain any transient.
The texture may be synthesized by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
Synthesizing may comprise selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of the grain before the gap to be filled, M being a predetermined integer. Synthesizing may also comprise selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of the grain after the gap to be filled, M being a predetermined integer.
According to other embodiments of the invention, step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using linear prediction analysis.
According to other embodiments of the invention, step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using spectral analysis.
A second aspect of the invention relates to a time-stretching device for changing the length of an audio signal without changing its perceived content, comprising an extraction block configured to divide an input audio signal into a set of grains and to insert the grains in an output audio signal, a grain being a group of consecutive audio samples from the input audio signal, grains of the set of grains having a variable-length so as not to split a transient of the audio signal between two successive grains, the insertion being done at a rate corresponding to a speed factor, the time-stretching device further comprising a texture synthesizer block configured to synthesize a texture to fill every gap between two successive grains in the output audio signal.
The extraction block may be configured to set grains boundaries at positions of minimum values of a spectral flux computed on the input audio signal.
The extraction block may be configured to insert in the output audio signal a grain G_i of the set of grains, which starts at a time t_i in the input audio signal, at time t_i' being calculated from a starting time t_i-1' of a previous grain G_i-1 in the output audio signal S' and from a length L_i-1 of the previous grain G_i-1: $\begin{matrix} t_{i} ʹ = t_{i - 1} ʹ + r L_{i - 1} \\ t_{0} ʹ = t_{0} \end{matrix}$
the variable r being equal to the inverse of the speed factor.
The texture synthesizer block may be configured to synthesize an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database. The texture synthesizer block may also be configured to synthesize the texture so as not to contain any transient.
For example, the texture synthesizer block may be configured to synthesize the texture by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which like reference numerals refer to similar elements and in which:

Fig.1 is a schematic block diagram of a time-stretching device according to a first embodiment of the invention;
Fig.2 is a flow chart showing steps of a time-stretching method;
Fig.3 is a chart showing an example of input audio signal;
Fig.4 is a chart showing the input audio signal of Fig.3 which have been divided in a set of grains;
Fig.5 is a chart showing a spectral flux computed on another example of input audio signal;
Fig.6 is a chart showing an output audio signal computed from the input audio signal of Fig.3; and
Fig.7 is a schematic view illustrating the insertion of grains in an output audio signal.

DESCRIPTION OF PREFERRED EMBODIMENTS

Fig.1 shows a time-stretching device 1, which is configured to change the length of an audio signal S (Fig.3) without changing its perceived content. The audio signal S is for example an audio part of a sport event signals comprising an audio signal and a video signal. The audio signal mainly contains noise, that is to say transients, like audio signal usually encountered during sport events, for example crowd, applause, ball impacts, and/or field events.
The time-stretching device 1 comprises an extraction block 2, which is configured to receive the audio signal S and to divide it into a set of grains G. A grain G is a variable-length group of consecutive audio samples from the audio signal S. The grains G do not overlap and are created in a way so as not to split a transient of the audio signal S between two successive grains G. Their lengths vary between two reasonable values, for example between 5 and 50ms, preferably between 10 and 40ms. The extraction block 2 is also configured to insert the grains G in an output audio signal S'.
The time-stretching device 1 also comprises a database block 3 configured to memorize the extracted grains G and a texture synthesizer block 4 configured to synthesize a texture to fill gaps between the grains in the output audio signal S' (Fig.6).
Referring to Fig.2, we are describing below a time-stretching method, which may be executed by the time-stretching device 1 of Fig.1.
In step S1, the input audio signal S is received by extraction block 2. Then, extraction block 2 divides the input audio signal S into a set of grains G, for example into three grains G₀, G₁, G₂, as represented in Fig.4.
The grains G₀, G₁, G₂ boundaries may be set at the positions of minimum values of a spectral flux SF computed on the input audio signal S. Indeed, these minimum values correspond to samples of the input audio signal S where the input audio signal S is the most stable, in particular to samples where the probability of having a transient is minimized.
In order to compute the spectral flux SF, the input audio signal S is first divided into overlapping constant-length frames n. Then for each frame n and for each frequency bin k, the spectral amplitude X(n,k) is computed as well as the variation of amplitude ΔX(n,k) between frame n and frame n-1: $Δ X (n k) = |X (n k)| - |X (n - 1, k)|$
Then the spectral flux SF is defined as the sum of the positive contribution in the previous result: $S F_{n} = \sum_{k = 1}^{N_{fft}} \frac{Δ X (n k) + |Δ X (n k)|}{2}$
Thus, in this embodiment of the invention, the only parameter for the definition of the spectral flux SF is the constant length of the analysis frames n.
Once the spectral flux SF is computed, its local minimums are located, with the constraint that they must be separated by at least a first predetermined value, for example 5ms, and no more than a second predetermined value, for example 50ms, which means that some minimums may not be taken into consideration.
Then grain boundaries are positioned at the closest zero-crossing of the audio samples for each minimum to take into account. In variant the boundaries can be positioned at the closest minimum of energy instead of the closest zero-crossing.
Fig.5 shows an embodiment where audio signal S corresponds to recording of football. Curve C_SF represents the spectral flux SF computed for the audio signal S. Lines B₁ to B₁₈ represent grain boundaries positioned at local minimums, for example at the closest zero-crossing. In this example, the set of grains G comprise nineteen grains G, respectively defined by two consecutive grain boundaries B.
Thus, the extraction step is based on a time-domain algorithm, with variable grain size for allowing management of transients.
Some variants can be considered, such as a multi-band spectral flux, which means that a separate spectral flux is computed for each band of a filter bank. This variant may improve the precision of the extraction step.
In step S2, the grains G, for example the grains G₀, G₁, G₂ in the example of Fig.4, are inserted in an output audio signal S', as represented in Fig.6. The output audio signal S' corresponds to the audio signal S in slow-motion, the insertion being done at a rate corresponding to the speed factor of the slow-motion. In other words, empty spaces or gaps H₀, H₁, H₂ (Fig.7) are inserted between the grains G₀, G₁, G₂ of the input audio signal S in order to obtain an output signal S' whose length is equal to the original length divided by the speed factor of the slow-motion. For example, a 33% slow-motion signal S' is three times longer than the original signal S, which means is divided by 0.33.
We consider a grain G_i starting at time t_i in the input audio signal S. Its position t_i' in the output audio signal S' can be calculated from the position t_i-1' and the length L_i-1 of the grain G_i-1 in the output audio signal S': $\begin{matrix} t_{i} ʹ = t_{i - 1} ʹ + r L_{i - 1} \\ t_{0} ʹ = t_{0} \end{matrix}$
where the variable r is equal to the inverse of the speed factor. For example r=3 if the speed factor is 33,333%. The speed factor and the variable r can be changed with every grain G, for example if the user changes the playback speed of the video between grains of the input audio signal S.
Note that time t_i' is not necessarily an integer. Indeed, the closest integer value may be used instead of time t_i' to position the grain G_i in the output audio signal S', but its real value t_i' is preferably used to compute t_i+1' in order to minimize rounding errors.
Fig.7 illustrates the positioning of grains G₀, G₁, G₂ in the output audio signal S' for a 50% slow-motion, that is to say a speed factor of 50%. Dashed rectangles represent gaps H₀, H₁, H₂ that have to be filled with textures.
If an accumulation of rounding errors leads to a non-negligible offset between the audio and the video, it can be corrected by re-synchronizing the grain G_i with the video frame it was extracted from. An offset is considered as non-negligible if the viewer can notice the offset between sound and video.
The "ideal" position t_i' with respect to audio/video synchronism of the grain G_i is not always that ideal when it comes to overlap well with samples generated to fill a gap H. Therefore the position of the i^th grain G_i can optionally be slightly adjusted around its theoretical position t_i' by an offset δ, with value -Δ ≤ δ ≤ Δ. In case such an adjustment is made, the final position of the grain G_i in the output signal S' is then noted t_i", with t_i" = t_i' + δ.
In step S3, the texture synthesizer block 4 synthesizes a texture to fill every gap H in the output audio signal S'.
According to some embodiments of the invention, the texture is synthesized using past and current grains G, stored in database 3, so as to fit the best with the surrounding grains G in the output audio signal S' and so as not to contain any transient.
The texture synthesizer block 4 has to deals with different constraints. Indeed, the synthesized texture must connect well (temporally speaking) with its surrounding grains G to reduce audible discontinuities. For example, the synthesized texture of gap H₀ must connect well with grains G₀ and G₁.
Moreover, the synthesized texture must have a spectral content matching its surrounding grains G. In other words it must be perceptually relevant. And furthermore, the synthesized texture must not contain any transient. For example, the synthesized texture of gap H₀ must have a spectral content matching grains G₀ and G₁ and must not contain any transient.
The texture synthesis algorithm and the organization of the database 3 are adapted to these constraints. During the recording of an audio signal S, for example from a sport event, grains G are extracted of the input audio signal S and organized in the database 3. In example of Fig.4, grains G₀, G₁ and G₂ are extracted from audio signal S and memorized in database 3.
Organization of the database 3 can be for instance a k-means clustering, using a parameterization of the grains spectral envelop, for example MFCC (Mel-Frequency Cepstrum Coefficients) or LPCC (Linear Prediction Cepstrum Coefficients).
In variant, any approach that allows finding quickly a grain G can be used. Organization of the database 3 can also mean that redundant or oldest grains G are progressively discarded in order to keep the database 3 at a reasonable size. Moreover, grains G that might contain a transient are immediately discarded, using the spectral flux SF as a measure of transientness likelihood.
During a slow-motion, input grains G are positioned in the output audio signal S' as explained previously. For example, grains G₀, G₁, G₂ are positioned in the output audio signal S' as represented in Fig.7.
Then for each gap H the texture synthesizer block 4 selects in the database the N grains closest to the two surrounding grains G, N being a predetermined integer. The measure of closeness is based on the spectral envelop or its parameterization at the position where the two grains G were split since it is likely the most stable part of the audio signal S and it corresponds to the spectral envelop that the texture will have to match.
Using those N preselected grains, the texture synthesizer block 4 synthesizes a texture for filling the gap H.
For example, we note grain G_L the grain before the gap H to be filled and grain G_R the grain after the gap H. Texture synthesizer block 4 selects a first grain G_A which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of grain G_L, M being a predetermined integer. A cost C_A different from zero is memorized for grain G_A so as not to select this grain again in the next step. For example cost C_A = C + R, where C is a constant minimal cost and R is a random positive value, C and R being integer values.
Then texture synthesizer block 4 selects a second grain G_B which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of grain G_A and whose cost C_B is zero. Then a new cost C_B different from zero is memorized for grain G_B so as not to select this grain again at the next step. For example cost C_B = C + R, where C is the constant minimal cost and R is a random positive value, C and R being integer values. Then all the other costs C_i memorized are decreased by one unit.
And so on until a grain G_P finally overlaps with grain G_R. The last selected grain G_P permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of grain G_R. Note that if grain G_A is longer than the gap H it overlaps with grain G_R and as a consequence it is selected so that it is optimal considering the M last samples of grain G_L and the M first samples of grain G_R.
When the texture to be used in gap H has been synthesized with grains G_A to G_P it is overlap-added with the two grains G_L and G_R with an overlap of M samples on each side. If the texture goes beyond the M^th samples of G_R the exceeding samples are zeroed. This can happen if the grain G_P is longer than 2M+L_P, where L_P is the length of the remaining gap when G_A to G_P-1 have been added to the texture. Of course this could also apply when grain G_A is longer than the gap H.
Overlapping of the samples of G_L, G_R and the texture to be used in gap H means that the samples occupying a same temporal position are summed with a weighting factor. The weighting factor is determined by a temporal envelop that creates two fade-in/fade-out effects respectively between the M last samples of G_L and the M first samples of the texture and between the M last samples of the texture and the M first samples of G_R. The temporal envelop usually used is a normalized raised cosine.
Note that an amplitude normalization of the texture (or even an amplitude normalization of each of its grains G_A to G_P separately) may be realized in order to ensure the perceptual continuity between the input grains G_L, G_R and the texture in the output audio signal S'.
According to other embodiments of the invention, the texture is synthesized using either filter-based (LP) or spectral synthesis. The parameters of the filter or the spectral synthesis may be computed using parts of the input audio signal S surrounding the sample that separate the grains G between which the synthetic audio will be inserted. Finally, the synthetic signal is inserted in the output audio signal S' so as to fit the best with the surrounding grains G in the signal S', that is to say to reduce discontinuities. The exact position of each grain G₀, G₁, G₂ in the final output audio signal S' may be adjusted around the temporary positions to', t₁', t₂' so as to fit better with the samples that will fill the gaps H₀, H₁, H₂.
We are describing below example of linear prediction analysis and synthesis. We consider the input signal S and two consecutive grains G_n and G_n+1. We call S(m) the m^th sample of signal S and the first sample of grain G_n+1. We assume that grain G_n has just been added to the output signal S'.
Here, the step S3 comprises an operation of calculating the Linear Prediction Coefficients (LPC) of a p-order LP-filter using for example the samples from S(m-N) to S(m+N-1), weighted by a windowing function such as a Hann window. N and p (order of the filter) are predefined parameters.
The step S3 then comprises an operation of initializing the Linear Prediction (LP) filter using the p last samples of G_n, that is to say samples S(m-p) to S(m-1). This ensures continuity between the samples of grain G_n and the first samples that will be generated by the filter.
At instant n, the output y(n) of a Linear Prediction filter with input x(n) obeys to an equation of the form : y(n) = b·x(n) - (a₁·y(n-1) + a₂·y(n-2) + ... + a_p·y(n-p) ), where p is the order of the filter, b is the gain of the filter and a_i are the Linear Prediction Coefficients (LPC). One can see that the output at time n is a function of the output at instants n-1 to n-p. For the p first values of n, a part of these past outputs are not available (y(-1), ... , y(-p)) and usually they are fixed to zero: y(n) = 0, for n<0.
Instead if we consider that the p last values of a grain G_n are the last p output of a filter before y(0) (i.e. y(-1), ... , y(-p)), we obtain a filter output that is continuous with the grain samples. The filter has been initialized to a state where its samples are a continuity of the samples of grain G_n.
The step S3 further comprises an operation of filtering a white noise using the LP filter properly initialized. The white noise has a length of at least D_n + H + Δ + L_n+1, where D_n is the length of the gap H to fill between the last sample of the grain G_n in the output signal S' (i.e. t_n" + L_n, with L_n the length of grain G_n) and the temporary position t_n+1' of grain G_n+1, H is the length of the overlapping region between the synthetic signal and the grain G_n+1, Δ is the maximum displacement allowed for the grain G_n+1 around t_n+1' (in order to get t_n+1"), and L_n+1 is the length of grain G_n+1.
The step S3 further comprises an operation of appending the synthetic signal (i.e. the filtered white noise) to grain G_n in the output signal S'. As an optional refinement, the synthetic signal energy can be normalized to the energy of a group of samples surrounding sample S(m) (i.e. from S(m-M) to S(m+M)) beforehand.
The step S3 further comprises an operation of computing the cross-correlation between grain G_n+1 and the output signal samples from t_n+1'-Δ to t_n+1'+L_n+1+Δ, and an operation of locating the maximum of cross-correlation in the region that corresponds to positioning grain G_n+1 between t_n+1' - Δ to t_n+1' + Δ. The position of that maximum sets the value for t_n+1".
The step S3 further comprises an operation of overlapping and adding grain G_n+1 to the output signal S', starting from sample at t_n+1". The region from time t_n+1" to t_n+1"+H is where the overlap takes place. The LP synthetic samples beyond t_n+1"+H are discarded (that is, weighted to zero). N, p, H and Δ (and possibly M) are parameters of the system.
We are describing below example of spectral analysis and synthesis. This embodiment is very similar to the linear prediction analysis and synthesis embodiment.
The differences are that an FFT is computed on the windowed frame [S(m-N)...S(m+N-1)] instead of an LP analysis.
Moreover, the white noise used to generate the synthetic signal is sliced into overlapping windowed frames, the Fourier Transform (by mean of an FFT) of these frames are computed and the amplitude of their spectra are modified to be equal to the amplitude of the spectrum previously computed on frame [S(m-N)...S(m+N-1)]. The modified FFTs are inverted (IFFT) in order to obtain frames of a "colored" noise. These frames are then OverLap-Added so as to end up with a continuous signal r whose spectral characteristics match those of the input signal frame [S(m-N)...S(m+N-1)].
The length R of the synthetic signal r should be at least D_n + 2·H + 2·A + L_n+1.
The synthetic signal r is not necessarily continuous with the last samples of grain G_n. Therefore, it is OverLap-added with the H last samples of grain G_n. Optionally, the same (as in LP synthesis) cross-correlation based method for position adjustment (which can also be used for grain G_n+1) can be applied to adjust the position of the synthetic signal in the output signal S' (i.e. to better fit with the last samples of grain G_n). However the offset can only go to the left and the samples of synthetic signal r that fall before the overlapping region are discarded, that is to say that the samples added to the output are [r(δ) ... r(R)] with δ the offset (0 ≤ δ ≤Δ) obtained by the cross-correlation alignment method and samples [r(1) ... r(δ-1)] are discarded. As an optional refinement, the synthetic signal energy can be normalized to the energy of a group of samples surrounding S(m) (i.e. from sample S(m-M) to S(m+M)) beforehand. Once the synthetic signal r has been added to the output, the grain G_n+1 is positioned and added.
In variant, other texture synthesis algorithms can be used. For instance another possible approach is to select a single grain G_s with the exact needed duration in the part of the audio signal S already recorded during the event. But this adds a strong constraint (fixed length) reducing the number of available candidates whereas the previous algorithm could find a combination of various small grains that globally fits better in the gap.
While there has been illustrated and described what are presently considered to be the preferred embodiments of the present invention, it will be understood by those skilled in the art that various other modifications may be made, and equivalents may be substituted, without departing from the true scope of the present invention. Additionally, many modifications may be made to adapt a particular situation to the teachings of the present invention without departing from the central inventive concept described herein. Furthermore, an embodiment of the present invention may not include all of the features described above. Therefore, it is intended that the present invention not be limited to the particular embodiments disclosed, but that the invention include all embodiments falling within the scope of the invention as broadly defined above. In particular, the embodiments describe above could be combined.
Expressions such as "comprise", "include", "incorporate", "contain", "is" and "have" are to be construed in a non-exclusive manner when interpreting the description and its associated claims, namely construed to allow for other items or components which are not explicitly defined also to be present.
A person skilled in the art will readily appreciate that various parameters disclosed in the description may be modified and that various embodiments disclosed may be combined without departing from the scope of the invention.

Claims

Time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of:
a) dividing an input audio signal (S) into a set of grains (G₀, G₁, G₂), a grain being a group of consecutive audio samples from the input audio signal (S), the length of each grain of the set of grains being determined so as not to split a transient of the audio signal (S) between two successive grains,

b) inserting the grains in an output audio signal (S'), the insertion being done at a rate corresponding to a speed factor, and

c) synthesizing an audio texture to fill every gap (H) between two successive grains in the output audio signal (S').
Time-stretching method according to claim 1, wherein step a) comprises setting grains (G₀, G₁, G₂) boundaries at positions of minimum values of a spectral flux computed on the input audio signal (S).
Time-stretching method according to claim 2, wherein the spectral flux is computed by:
- dividing the input audio signal (S) into overlapping constant-length frames,

- computing, for each frame n and for each frequency bin k of the input audio signal (S), a spectral amplitude X(n,k) and a variation of amplitude ΔX(n,k) between the said frame and the previous frame: $Δ X (n k) = |X (n k)| - |X (n - 1, k)|$

- computing the spectral flux SF by using the variation of amplitude values: $S F_{n} = \sum_{k = 1}^{N_{fft}} \frac{Δ X (n k) + |Δ X (n k)|}{2}$
Time-stretching method according to one of the claims 1 to 3, wherein step b) comprises inserting in the output audio signal (S') a grain G_i of the set of grains, which starts at a time t_i in the input audio signal (S), at time t_i' being calculated from a starting time t_i-1' of a previous grain G_i-1 in the output audio signal S' and from a length L_i-1 of the previous grain G_i-1: $\begin{matrix} t_{i} ʹ = t_{i - 1} ʹ + r L_{i - 1} \\ t_{0} ʹ = t_{0} \end{matrix}$
the variable r being equal to the inverse of the speed factor.
Time-stretching method according to one of the claims 1 to 4, wherein step c) comprises synthesizing an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.
Time-stretching method according to claim 5, wherein the texture is synthesized so as not to contain any transient.
Time-stretching method according to claim 6, wherein the texture is synthesized by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
Time-stretching method according to claim 7, wherein synthesizing comprises selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of the grain before the gap to be filled, M being a predetermined integer.
Time-stretching method according to claim 8, wherein synthesizing comprises selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of the grain after the gap to be filled, M being a predetermined integer.
Time-stretching method according to one of the claims 1 to 4, wherein step c) comprises synthesizing an audio texture to fill a gap between two successive grains by using linear prediction analysis.
Time-stretching method according to one of the claims 1 to 4, wherein step c) comprises synthesizing an audio texture to fill a gap between two successive grains by using spectral analysis.
Time-stretching device (1) for changing the length of an audio signal without changing its perceived content, comprising an extraction block configured to divide an input audio signal (S) into a set of grains (G₀, G₁, G₂) and to insert the grains in an output audio signal (S'), a grain being a group of consecutive audio samples from the input audio signal (S), grains of the set of grains having a variable-length so as not to split a transient of the audio signal (S) between two successive grains, the insertion being done at a rate corresponding to a speed factor, the time-stretching device further comprising a texture synthesizer block (4) configured to synthesize a texture to fill every gap (H) between two successive grains in the output audio signal (S').
Time-stretching device according to claim 12, wherein the extraction block is configured to set grains (G₀, G₁, G₂) boundaries at positions of minimum values of a spectral flux computed on the input audio signal (S).
Time-stretching device according to claim 12 or 13, wherein the extraction block is configured to insert in the output audio signal (S') a grain G_i of the set of grains, which starts at a time t_i in the input audio signal (S), at time t_i' being calculated from a starting time t_i-1' of a previous grain G_i-1 in the output audio signal S' and from a length L_i-1 of the previous grain G_i-1: $\begin{matrix} t_{i} ʹ = t_{i - 1} ʹ + r L_{i - 1} \\ t_{0} ʹ = t_{0} \end{matrix}$
the variable r being equal to the inverse of the speed factor.
Time-stretching device according to one of the claims 12 to 14, wherein the texture synthesizer block is configured to synthesize an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.