WO2012136380A1 - Time-stretching of an audio signal - Google Patents

Time-stretching of an audio signal Download PDF

Info

Publication number
WO2012136380A1
WO2012136380A1 PCT/EP2012/001540 EP2012001540W WO2012136380A1 WO 2012136380 A1 WO2012136380 A1 WO 2012136380A1 EP 2012001540 W EP2012001540 W EP 2012001540W WO 2012136380 A1 WO2012136380 A1 WO 2012136380A1
Authority
WO
WIPO (PCT)
Prior art keywords
grains
audio signal
time
grain
texture
Prior art date
Application number
PCT/EP2012/001540
Other languages
French (fr)
Inventor
Alexis Moinet
Thierry Dutoit
Philippe Latour
Original Assignee
Evs International (Swiss) Sarl
Universite De Mons
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Evs International (Swiss) Sarl, Universite De Mons filed Critical Evs International (Swiss) Sarl
Publication of WO2012136380A1 publication Critical patent/WO2012136380A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion

Definitions

  • the invention generally relates to the time-stretching of an audio signal.
  • Time-stretching permits changing the length of the audio signal without changing its perceived content. It is different to a simple re-sampling (up- sampling and down-sampling), which changes both the length of the audio signal and its spectral content. Indeed, with a simple re-sampling, the frequencies contained in the audio signal are shifted upward or downward when the audio signal length is respectively reduced or increased. This is illustrated by the low-pitched sounds heard in some movies during slow-motion scenes.
  • time-stretching may be used to increase the length of an audio signal in order to preserve synchronization with a video when doing slow-motion playbacks.
  • time-stretching may be used to decrease the length of the audio signal in order to preserve synchronization with a video when doi ng fast-motion playbacks.
  • This creation of new information may comprise adding information and/or removing information.
  • the known methods generally use time-domain processing, like SOLA algorithm (Synchronous Overlap and Add), WSOLA algorithm (Waveform Similarity based synchronized Overlap Add), and PSOLA algorithm (Pitch Synchronous Overlap Add Method), or spectral-domain processing, like phase vocoder-based solutions, or model- based processing, like LPC methods (Linear predictive coding), and HNM methods (hybrid numerical method).
  • SOLA algorithm Synchronous Overlap and Add
  • WSOLA algorithm Wideform Similarity based synchronized Overlap Add
  • PSOLA algorithm Pitch Synchronous Overlap Add Method
  • spectral-domain processing like phase vocoder-based solutions
  • model- based processing like LPC methods (Linear predictive coding), and HNM methods (hybrid numerical method).
  • Transient detection has many solutions when applied to quiet audio recordings such as speech or music but it is much more difficult in noisy environments such as a football or basketball stadium. As a result it can be very complex to detect and process separately those transient sounds during sport events.
  • Texture-based methods have been developed to synthesize audio content such as crowd and applauses. Although these methods create similar-sounding signals they do not manipulate the length of the original signal.
  • the invention will improve the situation.
  • a first aspect of the invention relates to a time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of: a) dividi ng an input audio signal into a set of grains, a grain being a group of consecutive audio samples from the input audio signal, the length of each grain of the set of grains being determined so as not to split a transient of the audio signal between two successive grains,
  • the performances of the time-stretching are improved.
  • the proposed method combines both aspects, time-stretching and textures, in order to manipulate the length of an audio signal without changing its perceived content and without having to process its transients.
  • the transients are not split into different grains and there is no overlap between grains, this method will not suffer from the transient duplication artefact that characterizes many time- domain time-stretching algorithms. Any transient from the input signal wi ll be copied in the output signal once and only once.
  • a grain can be defined as a variable-length group of consecutive audio samples.
  • the grains of the above recited step a) do not overlap and are created in a way so as not to split a transient between two successive grains. Their lengths can vary between two reasonable values, such as between 10 and 40 ms.
  • the temporary positions of the grains in the output audio signal can be computed at a rate corresponding for example to the speed factor of a slow-motion of a corresponding image. Then, empty spaces (gaps) can be inserted between the grains of the input audio signal in order to obtain an output signal having a length equal to the original length divided by the speed factor of the slow-motion. For example, a 33% slow-motion signal is three times longer than the original signal (i.e. divided by 0.33). The exact position of each grain in the final output signal can then be adjusted around these temporary positions so as to fit better with the samples that wi ll fill the gaps.
  • step c for every gap in the output audio, samples are synthesized to fill it.
  • the samples are generated using (for instance) either filter-based (LP) or spectral synthesis.
  • the parameters of the fi lter or the spectral synthesis are computed using parts of the input signal surrounding the sample that separate the grains between which the synthetic audio will be inserted.
  • the synthetic signal is inserted in the output audio so as to fit the best with the surrounding grains in the signal (reducing discontinuities).
  • step a grains boundaries are set at the positions of the minimum values of the spectral flux (SF) computed on the input signal. These minimums correspond to the samples where the input signal is the most stable, typically in samples where the probabi lity of having a transient is low.
  • SF spectral flux
  • the boundaries can be positioned at minimum values of energy.
  • the boundaries are positioned at minimum values of energy closest to the minimum value of the spectral flux (SF).
  • a first operation consists on finding an SF minimum and then to find closest minimum of energy.
  • the position of the boundaries can be repositioned at the zero-crossings of the audio signal closest to the boundaries set usi ng any of the above embodi ments.
  • step a) may comprise setting grains boundaries at positions of minimum values of a spectral flux computed on the input audio signal, and these minimum values correspond to samples of the input audio signal where the input audio signal is the most stable, in particular to samples where the probability of having a transient is minimized.
  • the spectral flux may be computed by:
  • Step b) may comprise inserti ng in the output audio signal a grain G, of the set of grains, which starts at a time tj in the input audio signal, at time t,' being calculated from a starting time i t . ⁇ ' of a previous grain Gj.i in the output audio signal S ' and from a length Lj. I of the previous grain Gj. i :
  • Step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.
  • the texture is preferably synthesized so as not to contain any transient.
  • the texture may be synthesized by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
  • Synthesizing may comprise selecting in the database a grain which permits either mini mizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of the grain before the gap to be filled, M being a predetermined integer. Synthesizing may also comprise selecting in the database a grain which permits either mi ni mizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of the grain after the gap to be fil led, M being a predetermined integer.
  • step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using linear prediction analysis.
  • step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using spectral analysis.
  • a second aspect of the invention relates to a time-stretching device for changing the length of an audio signal without changing its perceived content, comprising an extraction block configured to divide an input audio signal into a set of grains and to insert the grains in an output audio signal, a grain being a group of consecutive audio samples from the input audio signal, grains of the set of grains having a variable-length so as not to split a transient of the audio signal between two successive grains, the insertion being done at a rate corresponding to a speed factor, the time-stretching device further comprising a texture synthesizer block configured to synthesize a texture to fill every gap between two successive grains in the output audio signal.
  • the extraction block may be configured to set grains boundaries at positions of minimum values of a spectral flux computed on the input audio signal.
  • the extraction block may be configured to insert in the output audio signal a grain Gj of the set of grains, which starts at a time t, in the input audio signal, at time t,' being calculated from a starting time tj_i ' of a previous grain Gj.
  • :
  • the texture synthesizer block may be configured to synthesize an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.
  • the texture synthesizer block may also be configured to synthesize the texture so as not to contain any transient.
  • the texture synthesizer block may be configured to synthesize the texture by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
  • Fig. l is a schematic block diagram of a time-stretching device according to a first embodi ment of the invention
  • - Fig.2 is a flow chart showing steps of a time-stretching method
  • - Fig.3 is a chart showing an example of input audio signal
  • - Fig.4 is a chart showing the input audio signal of Fig.3 which have been divided in a set of grains;
  • - Fig.5 is a chart showing a spectral flux computed on another example of input audio signal
  • - Fig.6 is a chart showing an output audio signal computed from the input audio signal of Fig.3;
  • - Fig.7 is a schematic view illustrating the insertion of grains in an output audio signal
  • - Fig.8 is a schematic workflow of a self-cross-synthesis method, in a possible embodiment of step c).
  • Fig. l shows a time-stretching device 1 , which is configured to change the length of an audio signal S (Fig.3) without changing its perceived content.
  • the audio signal S is for example an audio part of a sport event signals comprising an audio signal and a video signal.
  • the audio signal mainly contains noise, that is to say transients, like audio signal usually encountered during sport events, for example crowd, applause, ball impacts, and/or field events.
  • the time-stretching device 1 comprises an extraction block 2, which is configured to receive the audio signal S and to divide it into a set of grains G.
  • a grain G is a variable-length group of consecutive audio samples from the audio signal S.
  • the grains G do not overlap and are created in a way so as not to split a transient of the audio signal S between two successive grains G. Their lengths vary between two reasonable values, for example between 5 and 50ms, preferably between 10 and 40ms.
  • the extraction block 2 is also configured to insert the grains G in an output audio signal S' .
  • the time-stretching device 1 also comprises a database block 3 configured to memorize the extracted grains G and a texture synthesizer block 4 configured to synthesize a texture to fill gaps between the grains in the output audio signal S' (Fig.6).
  • step S I the input audio signal S is received by extraction block 2. Then, extraction block 2 divides the input audio signal S into a set of grains G, for example into three grains Go, G
  • the grains G 0 , Gi , G 2 boundaries may be set at the positions of mini mum values of a spectral flux SF computed on the input audio signal S. Indeed, these minimum values correspond to samples of the input audio signal S where the input audio signal S is the most stable, in particular to samples where the probability of having a transient is minimized.
  • spectral flux SF is defined as the sum of the positive contribution in the previous result:
  • the only parameter for the defi nition of the spectral flux SF is the constant length of the analysis frames n.
  • the spectral flux SF is computed, its local minimums are located, with the constraint that they must be separated by at least a first predetermined value, for example 5ms, and no more than a second predetermined value, for example 50ms, which means that some minimums may not be taken into consideration.
  • a first predetermined value for example 5ms
  • a second predetermined value for example 50ms
  • grain boundaries are positioned at the closest zero-crossing of the audio samples for each minimum to take into account.
  • the boundaries can be positioned at the closest minimum of energy instead of the closest zero-crossi ng.
  • Fig.5 shows an embodiment where audio signal S corresponds to recording of football.
  • Curve CSF represents the spectral flux SF computed for the audio signal S.
  • Lines B i to B i s represent grain boundaries positioned at local minimums, for example at the closest zero-crossing.
  • the set of grains G comprise nineteen grains G, respectively defined by two consecutive grain boundaries B.
  • the extraction step is based on a time-domain algorithm, with variable grain size for allowing management of transients.
  • Some variants can be considered, such as a multi-band spectral flux, which means that a separate spectral flux is computed for each band of a filter bank. This variant may improve the precision of the extraction step.
  • step S2 the grains G, for example the grai ns Go, G
  • the output audio signal S' corresponds to the audio signal S in slow-motion, the insertion being done at a rate corresponding to the speed factor of the slow-motion.
  • empty spaces or gaps Ho, Hi , H2 (Fig.7) are i nserted between the grains Go, Gi , G2 of the input audio signal S in order to obtain an output signal S' whose length is equal to the original length divided by the speed factor of the slow-motion.
  • a 33% slow-motion signal S' is three times longer than the original signal S, which means is divided by 0.33.
  • the speed factor and the variable r can be changed with every grain G, for example if the user changes the playback speed of the video between grains of the input audio signal S.
  • time tj' is not necessarily an integer. Indeed, the closest integer value may be used instead of time tj' to position the grain G, in the output audio signal S ⁇ but its real value t,' is preferably used to compute tj + in order to minimize rounding errors.
  • Fig.7 illustrates the positioning of grains Go, G
  • step S3 the texture synthesizer block 4 synthesizes a texture to fill every gap H in the output audio signal S' .
  • the texture is synthesized using past and current grains G, stored in database 3, so as to fit the best with the surrounding grains G in the output audio signal S' and so as not to contain any transient.
  • the texture synthesizer block 4 has to deals with different constraints. Indeed, the synthesized texture must connect well (temporally speaking) with its surrounding grains G to reduce audible discontinuities. For example, the synthesized texture of gap Ho must connect well with grains Go and G
  • the texture synthesis algorithm and the organization of the database 3 are adapted to these constraints.
  • grains G are extracted of the input audio signal S and organized in the database 3.
  • grains Go, Gi and G 2 are extracted from audio signal S and memorized in database 3.
  • Organization of the database 3 can be for instance a k-means clustering, usi ng a parameterization of the grains spectral envelop, for example MFCC (Mel-Frequency Cepstrum Coefficients) or LPCC (Linear Prediction Cepstrum Coefficients).
  • MFCC Mel-Frequency Cepstrum Coefficients
  • LPCC Linear Prediction Cepstrum Coefficients
  • any approach that allows finding quickly a grain G can be used.
  • Organization of the database 3 can also mean that redundant or oldest grains G are progressively discarded in order to keep the database 3 at a reasonable size.
  • grains G that might contain a transient are immediately discarded, using the spectral flux SF as a measure of transientness likelihood.
  • input grains G are positioned in the output audio signal S' as explained previously.
  • grains Go, G i , G2 are positioned in the output audio signal S' as represented in Fig.7.
  • the texture synthesizer block 4 selects in the database the N grains closest to the two surrounding grains G, N being a predetermined integer.
  • the measure of closeness is based on the spectral envelop or its parameterization at the position where the two grains G were split since it is likely the most stable part of the audio signal S and it corresponds to the spectral envelop that the texture wi ll have to match.
  • texture synthesizer block 4 selects a second grain GB which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of grain GA and whose cost CB is zero. Then a new cost CB different from zero is memorized for grain G B so as not to select this grain again at the next step.
  • cost C B C + R, where C is the constant minimal cost and R is a random positive value, C and R being integer values. Then all the other costs Q memorized are decreased by one unit.
  • Gp permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of grain GR. Note that if grain GA is longer than the gap H it overlaps with grain GR and as a consequence it is selected so that it is optimal considering the M last samples of grain GL and the M first samples of grain GR.
  • Overlapping of the samples of GL, GR and the texture to be used in gap H means that the samples occupying a same temporal position are summed with a weighting factor.
  • the weighting factor is determined by a temporal envelop that creates two fade-in/fade- out effects respectively between the last samples of GL and the M first samples of the texture and between the M last samples of the texture and the M first samples of GR.
  • the temporal envelop usually used is a normalized raised cosine.
  • an amplitude normalization of the texture may be realized in order to ensure the perceptual continuity between the input grains GL, GR and the texture in the output audio signal S' .
  • the texture is synthesized using either filter-based (LP) or spectral synthesis.
  • the parameters of the filter or the spectral synthesis may be computed using parts of the input audio signal S surrounding the sample that separate the grains G between which the synthetic audio will be inserted.
  • the synthetic signal is inserted in the output audio signal S ' so as to fit the best with the surrounding grains G in the signal S ⁇ that is to say to reduce discontinuities.
  • , G2 in the final output audio signal S' may be adjusted around the temporary positions t 0 ⁇ t
  • the step S3 comprises an operation of calculating the Linear Prediction
  • LPC Coefficients of a p-order LP-filter using for example the samples from S(m-N) to S(m+N- 1 ), weighted by a windowing function such as a Hann window.
  • N and p are predefined parameters.
  • the step S3 then comprises an operation of initializing the Linear Prediction (LP) filter using the p last samples of G n , that is to say samples S(m-p) to S(m- l ). This ensures continuity between the samples of grain G submit and the first samples that will be generated by the filter.
  • LP Linear Prediction
  • LPC Linear Prediction Coefficients
  • the step S3 further comprises an operation of filtering a white noise usi ng the LP fi lter properly initialized.
  • the white noise has a length of at least D n + H + ⁇ + Ln+ i , where D n is the length of the gap H to fill between the last sample of the grain G n in the output signal S' (i.e.
  • the step S3 further comprises an operation of appending the synthetic signal (i.e. the filtered white noise) to grain G n in the output signal S' .
  • the synthetic signal energy can be normalized to the energy of a group of samples surrounding sample S(m) (i.e. from S(m-M) to S(m+M)) beforehand.
  • the step S3 further comprises an operation of computing the cross-correlation between grain G n+ ] and the output signal samples from ⁇ ⁇ + - ⁇ to tn + i '+Ln + ⁇ + ⁇ , and an operation of locating the maximum of cross-correlation in the region that corresponds to positioning grain G n+
  • the step S3 further comprises an operation of overlapping and adding grain G n+ i to the output signal S ⁇ starting from sample at t n+ i ".
  • the region from time t n+ i " to t n+ i "+H is where the overlap takes place.
  • the LP synthetic samples beyond t n+ i "+H are discarded (that is, weighted to zero).
  • N, p, H and ⁇ (and possibly M) are parameters of the system.
  • the white noise used to generate the synthetic signal is sliced into overlapping windowed frames, the Fourier Transform (by mean of an FFT) of these frames are computed and the amplitude of their spectra are modified to be equal to the amplitude of the spectrum previously computed on frame [S(m-N)...S(m+N- l )] .
  • the modified FFTs are inverted (IFFT) in order to obtain frames of a "colored" noise. These frames are then OverLap-Added so as to end up with a continuous signal r whose spectral characteristics match those of the input signal frame [S(m-N)...S(m+N- l )].
  • the length R of the synthetic signal r should be at least D n + 2 ⁇ + 2 ⁇ + L n+
  • the synthetic signal r is not necessarily continuous with the last samples of grain G n . Therefore, it is OverLap-added with the H last samples of grain G n .
  • the same (as in LP synthesis) cross-correlation based method for position adjustment (which can also be used for grai n G n+ i ) can be applied to adjust the position of the synthetic signal in the output signal S' (i.e. to better fit with the last samples of grain G n ).
  • the offset can only go to the left and the samples of synthetic signal r that fall before the overlapping region are discarded, that is to say that the samples added to the output are [r(8) ...
  • the synthetic signal energy can be normalized to the energy of a group of samples surrounding S(m) (i.e. from sample S(m-M) to S(m+M)) beforehand. Once the synthetic signal r has been added to the output, the grain G n+ i is positioned and added.
  • step S3 (corresponding to step c) in the general wording of the invention) aims to fill every gap in the output audio signal (as shown in Fig.2). To that end, texture is to be synthesized.
  • Texture can be synthesized using either an LP synthesis (a linear prediction using white noise filtering as descri bed above) or a spectral-based white noise modification (by changing the spectral envelope of a white noise for that of the desired texture, using classical overlap and add analysis and synthesis filtering) or even a phase vocoder.
  • LP synthesis a linear prediction using white noise filtering as descri bed above
  • spectral-based white noise modification by changing the spectral envelope of a white noise for that of the desired texture, using classical overlap and add analysis and synthesis filtering
  • phase vocoder a phase vocoder
  • the aim of the method is to create a texture by modifying the spectral envelope of a white noise so that it matches the spectral envelope of the grains between which it can be inserted.
  • the spectral envelope of the grains is computed by selecting L samples in the i nterval :
  • an advantageous embodi ment consists on applying a post-processing step, based on the combination of a long-term and a short-term analysis, as explained hereafter.
  • a long-term analysis frame [s(n)-L/2 ; s(n)+L/2- l ] can contain one or more transients compared to a short-term analysis frame [s(n)-M/2:s(n)+M/2- l ], with M significantly smaller than L.
  • L is of the order of 100-200 ms (8 192 samples at a sampling rate of 48 kHz), whereas M is one order of magnitude smaller (for example 20 ms, corresponding to 1024 samples at 48 kHz).
  • Grains are preferably split at minimum of energy (theoretically far away from transients). It is thus almost assured that such a small frame will probably not contai n any transient (e.g. if it happens, this is a rare event and not a frequent one).
  • the long-term frame is centered on a sample s(n) that almost certainly contains transients
  • the short-term frame is centered on the same sample s(n) that will almost certainly contain none.
  • the short-term frame is used in this embodiment to fill the gap between two grains.
  • the long-term frame contains acoustical information (details on the spectrum) that is preferably kept and the short-term frame does not contain such information.
  • noisy environments such as the ones encountered during sport events present some stationarity on the long-term which is preferably kept in order to ensure that the sound synthesized is very similar to the original sound.
  • Fig.8 is a schematic workflow of the self-cross-synthesis method.
  • Long-term analysis contains distortions (dashed circles) in the spectral amplitude due to some transients i n the long-term audio frame, whereas short-term spectral amplitude does not present such a distortion.
  • Spectral envelopes are computed for both long-term and short-term spectral amplitudes.
  • the contribution of its spectral envelope is removed from long-term spectral amplitude, giving a flattened spectral content.
  • step referenced by the sign "+” the spectral envelope of the short-term analysis is applied to the flattened spectral content from long-term analysis.
  • the resulting spectral amplitude is then used to synthesize a texture to fill the gap between two grains (either by spectral synthesis or LP synthesis or any other synthesis method that can make use of it, a phase vocoder being also possible).
  • Fig.8 visibly presents the results of long-term and short-term analysis steps as "spectral amplitudes", the exact same process can be applied with LP analysis instead (an FFT being an intermediate tool in LP analysis).
  • band-by-band mean energy computation followed by an interpolation (the bands can be linearly or logarithmically-spaced, for instance);
  • the filter obtained through LP analysis has a frequency response that smoothly fits the spectral amplitude (this frequency response can thus be considered as a spectral envelope);
  • a "liftering" of the signal can give a spectral envelope: one can compute the cepstrum, keep only the N first coefficients (for example 50 first cepstral coefficients for a 81 92-point FFT) and compute the inverse cepstral transform to obtain a smoothed version of the spectral amplitude, thus creating a spectral envelope.
  • An example of embodiment consists on using the last one among the possible methods listed above, to extract the spectral envelope of both long-term and short-term spectral envelopes.
  • the method to compute the cepstrum uses a Discrete Cosine Transform of the log-amplitude of the spectrum (DCT(log
  • the method used to extract the spectral envelope is independent of the method used to actually synthesize the samples. For instance, one can use LP analysis to extract and modify the envelope, but use the spectral synthesi s by modification of a white noise spectral amplitude (plus overlap and add).
  • a transient detection step cou ld be added to the method so that the post-processing is used only when a transient is detected, in order to spare some computing CPU cycles. However, this would suppose that all transients are detected (which is a problem in noisy sports).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of: a) dividing an input audio signal into a set of grains, a grain being a group of consecutive audio samples from the input audio signal, the length of each grain of the set of grains being determined so as not to split a transient of the audio signal between two successive grains, b) inserting the grains in an output audio signal, the insertion being done at a rate corresponding to a speed factor, and c) synthesizing an audio texture to fill every gap between two successive grains in the output audio signal.

Description

TIME-STRETCHING OF AN AUDIO SIGNAL
The invention generally relates to the time-stretching of an audio signal.
Time-stretching (or time-scaling) permits changing the length of the audio signal without changing its perceived content. It is different to a simple re-sampling (up- sampling and down-sampling), which changes both the length of the audio signal and its spectral content. Indeed, with a simple re-sampling, the frequencies contained in the audio signal are shifted upward or downward when the audio signal length is respectively reduced or increased. This is illustrated by the low-pitched sounds heard in some movies during slow-motion scenes.
For example, time-stretching may be used to increase the length of an audio signal in order to preserve synchronization with a video when doing slow-motion playbacks. In the same way time-stretching may be used to decrease the length of the audio signal in order to preserve synchronization with a video when doi ng fast-motion playbacks.
In order to create an output audio signal perceived as similar to the input audio signal but with a different length, one needs to create new information from the input audio signal in order to increase or decrease its length. This creation of new information may comprise adding information and/or removing information.
Time-stretching methods are described in the article of Jordi Bonada Sanjaume, "Audio Time-Scale Modification in the Context of Professional Audio Postproduction" , Research work for PhD program, 2002.
The known methods generally use time-domain processing, like SOLA algorithm (Synchronous Overlap and Add), WSOLA algorithm (Waveform Similarity based synchronized Overlap Add), and PSOLA algorithm (Pitch Synchronous Overlap Add Method), or spectral-domain processing, like phase vocoder-based solutions, or model- based processing, like LPC methods (Linear predictive coding), and HNM methods (hybrid numerical method).
These methods make the fundamental assumption that the input audio signal mainly contains either voice or music or both. In other words that the input audio signal is mainly composed with time varying sinusoids where noise is generally considered as a negligible background signal with few or no impact on the final result. The sinusoidal hypothesis makes it difficult to apply these methods to audio signals that mainly contain noise, like audio signal usually encountered during sport events, for example crowd, applause, ball impacts, and/or field events.
Besides transient sounds like impact sounds are distorted by these methods and therefore need to be detected and processed separately. Transient detection has many solutions when applied to quiet audio recordings such as speech or music but it is much more difficult in noisy environments such as a football or basketball stadium. As a result it can be very complex to detect and process separately those transient sounds during sport events.
Texture-based methods have been developed to synthesize audio content such as crowd and applauses. Although these methods create similar-sounding signals they do not manipulate the length of the original signal.
In the thesis "Expressive Sound Synthesis for Animation", PhD thesis, INRIA, 2009, C. Picard proposes a texture-based approach to sound re-synthesis and time- stretching. However it is based on the notion of correlation patterns and these patterns must be pre-computed. That computation takes a relatively long time. Besides it supposes that all the audio samples are known beforehand. This makes it unsuitable for real-time processing.
In the article of Lie Lu, Liu Wenyin, and Hong-Jiang Zhang "Audio textures: theory and applications", Transactions on Speech and Audio Processing, IEEE, Volume 12, n° 2, pp 156- 167, 2004, it is presented a method to replace missing part in an audio stream using texture synthesis. However, this method is not adapted for faster playback speeds. Moreover, this method needs a pre-processing of the whole signal and a computation of a similarity matrix. As a consequence, this method can' t work in real-time.
The invention will improve the situation.
SUMMARY
A first aspect of the invention relates to a time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of: a) dividi ng an input audio signal into a set of grains, a grain being a group of consecutive audio samples from the input audio signal, the length of each grain of the set of grains being determined so as not to split a transient of the audio signal between two successive grains,
b) inserting the grai ns in an output audio signal, the insertion being done at a rate corresponding to a speed factor, and
c) synthesizing an audio texture to fil l every gap between two successive grains in the output audio signal.
Thanks to these provisions, the performances of the time-stretching are improved. Indeed, the proposed method combines both aspects, time-stretching and textures, in order to manipulate the length of an audio signal without changing its perceived content and without having to process its transients. Moreover, si nce the transients are not split into different grains and there is no overlap between grains, this method will not suffer from the transient duplication artefact that characterizes many time- domain time-stretching algorithms. Any transient from the input signal wi ll be copied in the output signal once and only once.
A grain can be defined as a variable-length group of consecutive audio samples. Preferably, the grains of the above recited step a) do not overlap and are created in a way so as not to split a transient between two successive grains. Their lengths can vary between two reasonable values, such as between 10 and 40 ms.
In step b), the temporary positions of the grains in the output audio signal can be computed at a rate corresponding for example to the speed factor of a slow-motion of a corresponding image. Then, empty spaces (gaps) can be inserted between the grains of the input audio signal in order to obtain an output signal having a length equal to the original length divided by the speed factor of the slow-motion. For example, a 33% slow-motion signal is three times longer than the original signal (i.e. divided by 0.33). The exact position of each grain in the final output signal can then be adjusted around these temporary positions so as to fit better with the samples that wi ll fill the gaps.
In an embodiment of step c), for every gap in the output audio, samples are synthesized to fill it. The samples are generated using (for instance) either filter-based (LP) or spectral synthesis. The parameters of the fi lter or the spectral synthesis are computed using parts of the input signal surrounding the sample that separate the grains between which the synthetic audio will be inserted. Finally, the synthetic signal is inserted in the output audio so as to fit the best with the surrounding grains in the signal (reducing discontinuities).
In step a), grains boundaries are set at the positions of the minimum values of the spectral flux (SF) computed on the input signal. These minimums correspond to the samples where the input signal is the most stable, typically in samples where the probabi lity of having a transient is low.
However, in another embodi ment of step a), the boundaries can be positioned at minimum values of energy.
In another embodiment, the boundaries are positioned at minimum values of energy closest to the minimum value of the spectral flux (SF). A first operation consists on finding an SF minimum and then to find closest minimum of energy.
Also, in a further refinement, the position of the boundaries can be repositioned at the zero-crossings of the audio signal closest to the boundaries set usi ng any of the above embodi ments.
Thus, to summarize the above presentation of the invention, step a) may comprise setting grains boundaries at positions of minimum values of a spectral flux computed on the input audio signal, and these minimum values correspond to samples of the input audio signal where the input audio signal is the most stable, in particular to samples where the probability of having a transient is minimized.
The spectral flux may be computed by:
- dividing the input audio signal into overlapping constant-length frames,
- computing, for each frame n and for each frequency bin k of the input audio signal, a spectral amplitude X(n,k) and a variation of amplitude AX(n,k) between the said frame and the previous frame:
AX{n , k ) = \X (n , k) \ - \X (n - 1 , λ·) |
- computing the spectral flux SF by using the variation of amplitude values: gp _ A X (n . k) + \ A X ( n , k) \
Step b) may comprise inserti ng in the output audio signal a grain G, of the set of grains, which starts at a time tj in the input audio signal, at time t,' being calculated from a starting time it.\ ' of a previous grain Gj.i in the output audio signal S ' and from a length Lj. I of the previous grain Gj. i :
Figure imgf000007_0001
to' = o the variable r being equal to the inverse of the speed factor.
Step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.
The texture is preferably synthesized so as not to contain any transient.
The texture may be synthesized by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
Synthesizing may comprise selecting in the database a grain which permits either mini mizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of the grain before the gap to be filled, M being a predetermined integer. Synthesizing may also comprise selecting in the database a grain which permits either mi ni mizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of the grain after the gap to be fil led, M being a predetermined integer.
According to other embodiments of the invention, step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using linear prediction analysis.
According to other embodiments of the invention, step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using spectral analysis. A second aspect of the invention relates to a time-stretching device for changing the length of an audio signal without changing its perceived content, comprising an extraction block configured to divide an input audio signal into a set of grains and to insert the grains in an output audio signal, a grain being a group of consecutive audio samples from the input audio signal, grains of the set of grains having a variable-length so as not to split a transient of the audio signal between two successive grains, the insertion being done at a rate corresponding to a speed factor, the time-stretching device further comprising a texture synthesizer block configured to synthesize a texture to fill every gap between two successive grains in the output audio signal.
The extraction block may be configured to set grains boundaries at positions of minimum values of a spectral flux computed on the input audio signal.
The extraction block may be configured to insert in the output audio signal a grain Gj of the set of grains, which starts at a time t, in the input audio signal, at time t,' being calculated from a starting time tj_i ' of a previous grain Gj.| in the output audio signal S ' and from a length L,.| of the previous grain G .\ :
Figure imgf000008_0001
to' = to the variable r being equal to the inverse of the speed factor.
The texture synthesizer block may be configured to synthesize an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database. The texture synthesizer block may also be configured to synthesize the texture so as not to contain any transient.
For example, the texture synthesizer block may be configured to synthesize the texture by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split. BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention are i llustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which like reference numerals refer to similar elements and in which:
- Fig. l is a schematic block diagram of a time-stretching device according to a first embodi ment of the invention;
- Fig.2 is a flow chart showing steps of a time-stretching method;
- Fig.3 is a chart showing an example of input audio signal;
- Fig.4 is a chart showing the input audio signal of Fig.3 which have been divided in a set of grains;
- Fig.5 is a chart showing a spectral flux computed on another example of input audio signal;
- Fig.6 is a chart showing an output audio signal computed from the input audio signal of Fig.3; and
- Fig.7 is a schematic view illustrating the insertion of grains in an output audio signal,
- Fig.8 is a schematic workflow of a self-cross-synthesis method, in a possible embodiment of step c). DESCRIPTION OF PREFERRED EMBODIMENTS
Fig. l shows a time-stretching device 1 , which is configured to change the length of an audio signal S (Fig.3) without changing its perceived content. The audio signal S is for example an audio part of a sport event signals comprising an audio signal and a video signal. The audio signal mainly contains noise, that is to say transients, like audio signal usually encountered during sport events, for example crowd, applause, ball impacts, and/or field events.
The time-stretching device 1 comprises an extraction block 2, which is configured to receive the audio signal S and to divide it into a set of grains G. A grain G is a variable-length group of consecutive audio samples from the audio signal S. The grains G do not overlap and are created in a way so as not to split a transient of the audio signal S between two successive grains G. Their lengths vary between two reasonable values, for example between 5 and 50ms, preferably between 10 and 40ms. The extraction block 2 is also configured to insert the grains G in an output audio signal S' .
The time-stretching device 1 also comprises a database block 3 configured to memorize the extracted grains G and a texture synthesizer block 4 configured to synthesize a texture to fill gaps between the grains in the output audio signal S' (Fig.6).
Referring to Fig.2, we are describing below a time-stretching method, which may be executed by the time-stretching device 1 of Fig. I .
In step S I , the input audio signal S is received by extraction block 2. Then, extraction block 2 divides the input audio signal S into a set of grains G, for example into three grains Go, G| , G2, as represented in Fig.4.
The grains G0, Gi , G2 boundaries may be set at the positions of mini mum values of a spectral flux SF computed on the input audio signal S. Indeed, these minimum values correspond to samples of the input audio signal S where the input audio signal S is the most stable, in particular to samples where the probability of having a transient is minimized.
In order to compute the spectral flux SF, the input audio signal S is first divided into overlapping constant-length frames n. Then for each frame n and for each frequency bin k, the spectral amplitude X(n,k) is computed as well as the variation of amplitude AX(n,k) between frame n and frame n- 1 : ΔΛ' ( η . k ) = | λ , fc ) l - \ X (n - 1 , Α·) |
Then the spectral flux SF is defined as the sum of the positive contribution in the previous result:
SFn = X ( n . k) + \ A X ( , k) \
k= l
Thus, in this embodiment of the invention, the only parameter for the defi nition of the spectral flux SF is the constant length of the analysis frames n.
Once the spectral flux SF is computed, its local minimums are located, with the constraint that they must be separated by at least a first predetermined value, for example 5ms, and no more than a second predetermined value, for example 50ms, which means that some minimums may not be taken into consideration.
Then grain boundaries are positioned at the closest zero-crossing of the audio samples for each minimum to take into account. In variant the boundaries can be positioned at the closest minimum of energy instead of the closest zero-crossi ng.
Fig.5 shows an embodiment where audio signal S corresponds to recording of football. Curve CSF represents the spectral flux SF computed for the audio signal S. Lines B i to B i s represent grain boundaries positioned at local minimums, for example at the closest zero-crossing. In this example, the set of grains G comprise nineteen grains G, respectively defined by two consecutive grain boundaries B.
Thus, the extraction step is based on a time-domain algorithm, with variable grain size for allowing management of transients.
Some variants can be considered, such as a multi-band spectral flux, which means that a separate spectral flux is computed for each band of a filter bank. This variant may improve the precision of the extraction step.
In step S2, the grains G, for example the grai ns Go, G| , G2 in the example of Fig.4, are inserted in an output audio signal S\ as represented in Fig.6. The output audio signal S' corresponds to the audio signal S in slow-motion, the insertion being done at a rate corresponding to the speed factor of the slow-motion. In other words, empty spaces or gaps Ho, Hi , H2 (Fig.7) are i nserted between the grains Go, Gi , G2 of the input audio signal S in order to obtain an output signal S' whose length is equal to the original length divided by the speed factor of the slow-motion. For example, a 33% slow-motion signal S' is three times longer than the original signal S, which means is divided by 0.33.
We consider a grain Gj starting at time tj in the input audio signal S. Its position I,' in the output audio signal S' can be calculated from the position tjV and the length Lj. i of the grain Gj., in the output audio signal S' :
V = + r ,
to' = to where the variable r is equal to the inverse of the speed factor. For example r=3 if the speed factor is 33,333%. The speed factor and the variable r can be changed with every grain G, for example if the user changes the playback speed of the video between grains of the input audio signal S.
Note that time tj' is not necessarily an integer. Indeed, the closest integer value may be used instead of time tj' to position the grain G, in the output audio signal S\ but its real value t,' is preferably used to compute tj+ in order to minimize rounding errors.
Fig.7 illustrates the positioning of grains Go, G | , G2 in the output audio signal S' for a 50% slow-motion, that is to say a speed factor of 50%. Dashed rectangles represent gaps H0, H i , ¾ that have to be filled with textures.
If an accumulation of rounding errors leads to a non-negligible offset between the audio and the video, it can be corrected by re-synchronizing the grain Gj with the video frame it was extracted from. An offset is considered as non-negligible if the viewer can notice the offset between sound and video.
The "ideal " position tj' with respect to audio/video synchronism of the grai n Gj is not always that ideal when it comes to overlap well with samples generated to fi ll a gap H. Therefore the position of the i'h grain Gj can optional ly be slightly adjusted around its theoretical position t,' by an offset δ, with value -Δ < δ < Δ. In case such an adjustment is made, the final position of the grai n Gj in the output signal S' is then noted tj", with t," = tj' + δ.
In step S3, the texture synthesizer block 4 synthesizes a texture to fill every gap H in the output audio signal S' .
According to some embodiments of the invention, the texture is synthesized using past and current grains G, stored in database 3, so as to fit the best with the surrounding grains G in the output audio signal S' and so as not to contain any transient.
The texture synthesizer block 4 has to deals with different constraints. Indeed, the synthesized texture must connect well (temporally speaking) with its surrounding grains G to reduce audible discontinuities. For example, the synthesized texture of gap Ho must connect well with grains Go and G | . Moreover, the synthesized texture must have a spectral content matching its surrounding grains G. In other words it must be perceptually relevant. And furthermore, the synthesized texture must not contain any transient. For example, the synthesized texture of gap H0 must have a spectral content matching grains G0 and G| and must not contain any transient.
The texture synthesis algorithm and the organization of the database 3 are adapted to these constraints. During the recording of an audio signal S, for example from a sport event, grains G are extracted of the input audio signal S and organized in the database 3. In example of Fig.4, grains Go, Gi and G2 are extracted from audio signal S and memorized in database 3.
Organization of the database 3 can be for instance a k-means clustering, usi ng a parameterization of the grains spectral envelop, for example MFCC (Mel-Frequency Cepstrum Coefficients) or LPCC (Linear Prediction Cepstrum Coefficients).
In variant, any approach that allows finding quickly a grain G can be used. Organization of the database 3 can also mean that redundant or oldest grains G are progressively discarded in order to keep the database 3 at a reasonable size. Moreover, grains G that might contain a transient are immediately discarded, using the spectral flux SF as a measure of transientness likelihood.
During a slow-motion, input grains G are positioned in the output audio signal S' as explained previously. For example, grains Go, G i , G2 are positioned in the output audio signal S' as represented in Fig.7.
Then for each gap H the texture synthesizer block 4 selects in the database the N grains closest to the two surrounding grains G, N being a predetermined integer. The measure of closeness is based on the spectral envelop or its parameterization at the position where the two grains G were split since it is likely the most stable part of the audio signal S and it corresponds to the spectral envelop that the texture wi ll have to match.
Using those N preselected grains, the texture synthesizer block 4 synthesizes a texture for filling the gap H. For example, we note grain GL the grain before the gap H to be fi lled and grain G the grain after the gap H. Texture synthesizer block 4 selects a first grain GA which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of grain GL, being a predetermined integer. A cost CA different from zero is memorized for grain GA SO as not to select this grain again in the next step. For example cost CA = C + R, where C is a constant minimal cost and R is a random positive value, C and R being integer values.
Then texture synthesizer block 4 selects a second grain GB which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of grain GA and whose cost CB is zero. Then a new cost CB different from zero is memorized for grain GB so as not to select this grain again at the next step. For example cost CB = C + R, where C is the constant minimal cost and R is a random positive value, C and R being integer values. Then all the other costs Q memorized are decreased by one unit.
And so on until a grain Gp finally overlaps with grain GR. The last selected grain
Gp permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of grain GR. Note that if grain GA is longer than the gap H it overlaps with grain GR and as a consequence it is selected so that it is optimal considering the M last samples of grain GL and the M first samples of grain GR.
When the texture to be used in gap H has been synthesized with grains GA to G it is overlap-added with the two grains GL and GR with an overlap of M samples on each side. If the texture goes beyond the M,h samples of GR the exceeding samples are zeroed. This can happen if the grain Gp is longer than 2M+Lp, where Lp is the length of the remaining gap when GA to Gp.i have been added to the texture. Of course this could also apply when grain GA is longer than the gap H.
Overlapping of the samples of GL, GR and the texture to be used in gap H means that the samples occupying a same temporal position are summed with a weighting factor. The weighting factor is determined by a temporal envelop that creates two fade-in/fade- out effects respectively between the last samples of GL and the M first samples of the texture and between the M last samples of the texture and the M first samples of GR. The temporal envelop usually used is a normalized raised cosine.
Note that an amplitude normalization of the texture (or even an amplitude normalization of each of its grains GA to Gp separately) may be realized in order to ensure the perceptual continuity between the input grains GL, GR and the texture in the output audio signal S' .
According to other embodiments of the invention, the texture is synthesized using either filter-based (LP) or spectral synthesis. The parameters of the filter or the spectral synthesis may be computed using parts of the input audio signal S surrounding the sample that separate the grains G between which the synthetic audio will be inserted. Finally, the synthetic signal is inserted in the output audio signal S ' so as to fit the best with the surrounding grains G in the signal S\ that is to say to reduce discontinuities. The exact position of each grain Go, G| , G2 in the final output audio signal S' may be adjusted around the temporary positions t0\ t| ', t2' so as to fit better with the samples that will fill the gaps H0, H, , H2.
We are describing below example of linear prediction analysis and synthesis. We consider the input signal S and two consecutive grains Gn and Gn+i . We call S(m) the mlh sample of signal S and the first sample of grain Gn+| . We assume that grain Gn has just been added to the output signal S' .
Here, the step S3 comprises an operation of calculating the Linear Prediction
Coefficients (LPC) of a p-order LP-filter using for example the samples from S(m-N) to S(m+N- 1 ), weighted by a windowing function such as a Hann window. N and p (order of the filter) are predefined parameters.
The step S3 then comprises an operation of initializing the Linear Prediction (LP) filter using the p last samples of Gn, that is to say samples S(m-p) to S(m- l ). This ensures continuity between the samples of grain G„ and the first samples that will be generated by the filter.
At instant n, the output y(n) of a Linear Prediction filter with input x(n) obeys to an equation of the form : y(n) = b x(n) - ( ary(n- l ) + a2 y(n-2) + ... + ap-y(n-p) ), where p is the order of the fi lter, b is the gain of the fi lter and a, are the Linear Prediction Coefficients (LPC). One can see that the output at time n is a function of the output at instants n- 1 to n-p. For the p first values of n, a part of these past outputs are not available (y(- l ) , .. . , y(-p)) and usually they are fixed to zero: y(n) = 0 , for n < 0.
Instead if we consider that the p last values of a grain Gn are the last p output of a filter before y(0) (i .e. y(- l ) , .. . , y(-p) ), we obtain a filter output that is continuous with the grain samples. The filter has been initialized to a state where its samples are a continuity of the samples of grain Gn.
The step S3 further comprises an operation of filtering a white noise usi ng the LP fi lter properly initialized. The white noise has a length of at least Dn + H + Δ + Ln+ i , where Dn is the length of the gap H to fill between the last sample of the grain Gn in the output signal S' (i.e. tn" + Ln, with L„ the length of grain Gn) and the temporary position tn+i' of grain Gn+ i , H is the length of the overlapping region between the synthetic signal and the grain Gn+ i , Δ is the maximum displacement allowed for the grain G„+ i around tn + i ' (in order to get tn+ i "), and Ln+ i is the length of grain Gn+ | .
The step S3 further comprises an operation of appending the synthetic signal (i.e. the filtered white noise) to grain Gn in the output signal S' . As an optional refinement, the synthetic signal energy can be normalized to the energy of a group of samples surrounding sample S(m) (i.e. from S(m-M) to S(m+M)) beforehand.
The step S3 further comprises an operation of computing the cross-correlation between grain Gn+] and the output signal samples from Ιη+ -Δ to tn+ i '+Ln+ ι+Δ, and an operation of locating the maximum of cross-correlation in the region that corresponds to positioning grain Gn+ | between tn+i ' - Δ to tn + i ' + Δ. The position of that maximum sets the value for tn+i ".
The step S3 further comprises an operation of overlapping and adding grain Gn+i to the output signal S\ starting from sample at tn+i ". The region from time tn+i " to tn+ i "+H is where the overlap takes place. The LP synthetic samples beyond tn+ i "+H are discarded (that is, weighted to zero). N, p, H and Δ (and possibly M) are parameters of the system.
We are describing below example of spectral analysis and synthesis. This embodiment is very similar to the linear prediction analysis and synthesis embodiment. The differences are that an FFT is computed on the windowed frame [S(m- N)...S(m+N- l )] instead of an LP analysis.
Moreover, the white noise used to generate the synthetic signal is sliced into overlapping windowed frames, the Fourier Transform (by mean of an FFT) of these frames are computed and the amplitude of their spectra are modified to be equal to the amplitude of the spectrum previously computed on frame [S(m-N)...S(m+N- l )] . The modified FFTs are inverted (IFFT) in order to obtain frames of a "colored" noise. These frames are then OverLap-Added so as to end up with a continuous signal r whose spectral characteristics match those of the input signal frame [S(m-N)...S(m+N- l )].
The length R of the synthetic signal r should be at least Dn + 2 Ή + 2 Δ + Ln+ | .
The synthetic signal r is not necessarily continuous with the last samples of grain Gn. Therefore, it is OverLap-added with the H last samples of grain Gn. Optionally, the same (as in LP synthesis) cross-correlation based method for position adjustment (which can also be used for grai n Gn+i ) can be applied to adjust the position of the synthetic signal in the output signal S' (i.e. to better fit with the last samples of grain Gn). However the offset can only go to the left and the samples of synthetic signal r that fall before the overlapping region are discarded, that is to say that the samples added to the output are [r(8) ... r(R)] with δ the offset (0 < δ < Δ) obtained by the cross-correlation alignment method and samples [r( l ) . . . r(6- l )] are discarded. As an optional refinement, the synthetic signal energy can be normalized to the energy of a group of samples surrounding S(m) (i.e. from sample S(m-M) to S(m+M)) beforehand. Once the synthetic signal r has been added to the output, the grain Gn+ i is positioned and added.
In variant, other texture synthesis algorithms can be used. For instance another possible approach is to select a single grain Gs with the exact needed duration in the part of the audio signal S already recorded during the event. But this adds a strong constraint (fixed length) reducing the number of available candidates whereas the previous algorithm could find a combination of various small grains that globally fits better in the gap. As explained above, step S3 (corresponding to step c) in the general wording of the invention) aims to fill every gap in the output audio signal (as shown in Fig.2). To that end, texture is to be synthesized.
Texture can be synthesized using either an LP synthesis (a linear prediction using white noise filtering as descri bed above) or a spectral-based white noise modification (by changing the spectral envelope of a white noise for that of the desired texture, using classical overlap and add analysis and synthesis filtering) or even a phase vocoder.
In both case, the aim of the method is to create a texture by modifying the spectral envelope of a white noise so that it matches the spectral envelope of the grains between which it can be inserted.
The spectral envelope of the grains is computed by selecting L samples in the i nterval :
[s(n)-L/2 ; s(n)+L 2- l ], from the input audio signal s, with s(n) being the sample that separates two grains and G^+i (typically the first sample of G^+i ) as computed in the general first step a).
However, this method raises a problem when samples included in [s(n)-L/2 ; s(n)+L/2- l ] contain one or more transients. In order to eliminate (or significantly reduce) the impact of transient on the spectral envelope used to generated the texture, an advantageous embodi ment consists on applying a post-processing step, based on the combination of a long-term and a short-term analysis, as explained hereafter.
It is considered that a long-term analysis frame [s(n)-L/2 ; s(n)+L/2- l ] can contain one or more transients compared to a short-term analysis frame [s(n)-M/2:s(n)+M/2- l ], with M significantly smaller than L.
In a possible implementation, L is of the order of 100-200 ms (8 192 samples at a sampling rate of 48 kHz), whereas M is one order of magnitude smaller (for example 20 ms, corresponding to 1024 samples at 48 kHz). Grains are preferably split at minimum of energy (theoretically far away from transients). It is thus almost assured that such a small frame will probably not contai n any transient (e.g. if it happens, this is a rare event and not a frequent one).
Therefore, on the one hand, the long-term frame is centered on a sample s(n) that almost certainly contains transients, and on the other hand, the short-term frame is centered on the same sample s(n) that will almost certainly contain none. As far as the spectral envelope is concerned, the short-term frame is used in this embodiment to fill the gap between two grains.
However, the long-term frame contains acoustical information (details on the spectrum) that is preferably kept and the short-term frame does not contain such information. Indeed, noisy environments such as the ones encountered during sport events present some stationarity on the long-term which is preferably kept in order to ensure that the sound synthesized is very similar to the original sound.
Therefore, it is sought to insert between two grains a texture which has the spectral envelope of a short-term analysis frame but the spectral content of a long-term one. To that end, a cross-synthesis as detailed below is used in an advantageous embodiment.
Cross-synthesis consists on:
using the spectral envelope of some sound and the spectral residual of some other sound,
- and then combining them, to obtain a third sound which has features from both.
Here, two different analysis of the same sound signal are combined to remove the artifacts due to the transients without detecting said transients. This method is called hereafter "Self-Cross-Synthesis" to designate cross-synthesis with the same audio signal (though analyzed differently) used as both inputs of the method. This makes it possible to clean the texture of most of the distortions induced by surrounding transients.
Fig.8 is a schematic workflow of the self-cross-synthesis method. Long-term analysis contains distortions (dashed circles) in the spectral amplitude due to some transients i n the long-term audio frame, whereas short-term spectral amplitude does not present such a distortion. Spectral envelopes are computed for both long-term and short-term spectral amplitudes. In the step referenced by the sign the contribution of its spectral envelope is removed from long-term spectral amplitude, giving a flattened spectral content. In step referenced by the sign "+", the spectral envelope of the short-term analysis is applied to the flattened spectral content from long-term analysis. The resulting spectral amplitude is then used to synthesize a texture to fill the gap between two grains (either by spectral synthesis or LP synthesis or any other synthesis method that can make use of it, a phase vocoder being also possible).
Although Fig.8 visibly presents the results of long-term and short-term analysis steps as "spectral amplitudes", the exact same process can be applied with LP analysis instead (an FFT being an intermediate tool in LP analysis).
Extraction of spectral envelopes can be performed according to different possible methods. The most commonly encountered methods are:
• peak-to-peak interpolation of the spectrum (taking all the peak of the spectral amplitude and linking them together);
• band-by-band mean energy computation, followed by an interpolation (the bands can be linearly or logarithmically-spaced, for instance);
• the filter obtained through LP analysis has a frequency response that smoothly fits the spectral amplitude (this frequency response can thus be considered as a spectral envelope);
• a "liftering" of the signal (e.g. a filtering in the cepstral domain) can give a spectral envelope: one can compute the cepstrum, keep only the N first coefficients (for example 50 first cepstral coefficients for a 81 92-point FFT) and compute the inverse cepstral transform to obtain a smoothed version of the spectral amplitude, thus creating a spectral envelope.
An example of embodiment consists on using the last one among the possible methods listed above, to extract the spectral envelope of both long-term and short-term spectral envelopes. The method to compute the cepstrum uses a Discrete Cosine Transform of the log-amplitude of the spectrum (DCT(log|0(|FFT(x)|))).
It should be noted that the method used to extract the spectral envelope is independent of the method used to actually synthesize the samples. For instance, one can use LP analysis to extract and modify the envelope, but use the spectral synthesi s by modification of a white noise spectral amplitude (plus overlap and add).
In practice however, it is much more efficient implementation-wise to use similar methods for spectral envelope modifications and texture synthesis (e.g. modify the spectral envelope in the spectral domain using the envelopes obtain by liftering and synthesize the samples of the texture with the spectral method), otherwise one may need to switch from one domain (LP) to the other (spectral) and vice versa, which might be costly in term of computational load.
Finally, a transient detection step cou ld be added to the method so that the post-processing is used only when a transient is detected, in order to spare some computing CPU cycles. However, this would suppose that all transients are detected (which is a problem in noisy sports).
While there has been illustrated and described what are presently considered to be the preferred embodiments of the present invention, it will be understood by those skilled in the art that various other modifications may be made, and equivalents may be substituted, without departing from the true scope of the present invention. Additionally, many modifications may be made to adapt a particular situation to the teachings of the present invention without departing from the central inventive concept described herein. Furthermore, an embodi ment of the present invention may not include all of the features described above. Therefore, it is intended that the present invention not be limited to the particular embodiments disclosed, but that the invention include all embodiments falling within the scope of the invention as broadly defined above. In particular, the embodiments describe above could be combined.
Expressions such as "comprise", "include", "incorporate", "contain ", "is" and "have" are to be construed in a non-exclusive manner when interpreting the description and its associated claims, namely construed to allow for other items or components which are not explicitly defined also to be present.
A person skilled in the art will readily appreciate that various parameters disclosed in the description may be modified and that various embodiments disclosed may be combined without departing from the scope of the invention.

Claims

1 . Time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of:
a) dividing an input audio signal (S) into a set of grains (Go, Gi , G2), a grain being a group of consecutive audio samples from the input audio signal (S), the length of each grain of the set of grains being determined so as not to split a transient of the audio signal (S) between two successive grains,
b) inserting the grains in an output audio signal (S' ), the insertion being done at a rate correspondi ng to a speed factor, and
c) synthesizing an audio texture to fill every gap (H) between two successive grains in the output audio signal (S' ).
2. Time-stretching method according to claim 1 , wherein step a) comprises setting grains (Go, G i , G2) boundaries at positions of minimum values of a spectral flux computed on the input audio signal (S).
3. Time-stretching method accordi ng to claim 2, wherein the spectral flux is computed by:
dividing the input audio signal (S) into overlapping constant-length frames, computing, for each frame n and for each frequency bin k of the input audio signal (S), a spectral amplitude X(n,k) and a variation of amplitude AX(n,k) between the said frame and the previous frame: X (n, k) = \X( n, k) \ - \X (n - l , k) \ computing the spectral flux SF by usi ng the variation of amplitude values:
Figure imgf000022_0001
Time-stretching method according to one of the claims 1 to 3, wherein step b) comprises inserting in the output audio signal (S' ) a grain G, of the set of grai ns, which starts at a time t, in the input audio signal (S), at time t,' being calculated from a starting time t i ' of a previous grain Gj.i in the output audio signal S' and from a length Lj.| of the previous grain Gj.i :
Figure imgf000023_0001
to' ~ to the variable r being equal to the inverse of the speed factor.
Time-stretching method according to one of the claims 1 to 4, wherein step comprises synthesizing an audio texture to fill a gap between two successive grai by using past and current grains, which are stored in a database.
Time-stretching method according to claim 5, wherein the texture is synthesized as not to contain any transient.
Time-stretching method according to claim 6, wherein the texture is synthesized by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successi ve grains were split.
8. Time-stretching method according to claim 7, wherein synthesizing comprises selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of the grain before the gap to be filled, M being a predetermined integer.
9. Time-stretching method according to claim 8, wherein synthesizing comprises selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of the grain after the gap to be filled, M being a predetermined integer.
10. Time-stretching method according to one of the claims 1 to 4, wherein step c) comprises synthesizing an audio texture to fill a gap between two successive grains by using linear prediction analysis.
1 1 . Time-stretching method according to one of the claims 1 to 4, wherein step c) comprises synthesizing an audio texture to fi ll a gap between two successive grains by usi ng spectral analysis.
12. Time-stretching device ( 1 ) for changing the length of an audio signal without changing its perceived content, comprising an extraction block configured to di vide an input audio signal (S) into a set of grains (Go, G| , G2) and to insert the grains in an output audio signal (S' ), a grain being a group of consecutive audio samples from the input audio signal (S), grains of the set of grains having a variable-length so as not to split a transient of the audio signal (S) between two successive grains, the insertion being done at a rate corresponding to a speed factor, the time-stretching device further comprising a texture synthesizer block (4) configured to synthesize a texture to fill every gap (H) between two successive grains i n the output audio signal (S' ).
1 3. Time-stretching device according to claim 1 2, wherein the extraction block configured to set grains (G0, Gj , G2) boundaries at positions of minimum values a spectral flux computed on the input audio signal (S).
14. Time-stretching device according to claim 12 or 1 3, wherein the extraction block is configured to insert in the output audio signal (S') a grain Gj of the set of grains, which starts at a time tj in the input audio signal (S), at time ti' being calculated from a starting time t,.i ' of a previous grain GM in the output audio signal S' and from a length L,.| of the previous grain GM :
Figure imgf000025_0001
to' = to the variable r being equal to the inverse of the speed factor.
15. Time-stretching device according to one of the claims 12 to 14, wherein the texture synthesizer block is configured to synthesize an audio texture to fill a gap between two successive grai ns by using past and current grains, which are stored in a database.
PCT/EP2012/001540 2011-04-07 2012-04-06 Time-stretching of an audio signal WO2012136380A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP11305407.6 2011-04-07
EP11305407A EP2509073A1 (en) 2011-04-07 2011-04-07 Time-stretching of an audio signal

Publications (1)

Publication Number Publication Date
WO2012136380A1 true WO2012136380A1 (en) 2012-10-11

Family

ID=44358241

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2012/001540 WO2012136380A1 (en) 2011-04-07 2012-04-06 Time-stretching of an audio signal

Country Status (2)

Country Link
EP (1) EP2509073A1 (en)
WO (1) WO2012136380A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176813B2 (en) 2015-04-17 2019-01-08 Dolby Laboratories Licensing Corporation Audio encoding and rendering with discontinuity compensation
CN112511886A (en) * 2020-11-25 2021-03-16 杭州当虹科技股份有限公司 Audio and video synchronous playing method based on audio expansion and contraction

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9508386B2 (en) 2014-06-27 2016-11-29 Nokia Technologies Oy Method and apparatus for synchronizing audio and video signals
US10231001B2 (en) 2016-05-24 2019-03-12 Divx, Llc Systems and methods for providing audio content during trick-play playback

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010028634A1 (en) * 2000-01-18 2001-10-11 Ying Huang Packet loss compensation method using injection of spectrally shaped noise

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010028634A1 (en) * 2000-01-18 2001-10-11 Ying Huang Packet loss compensation method using injection of spectrally shaped noise

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
C. PICARD: "PhD thesis", 2009, INRIA, article "Expressive Sound Synthesis for Animation"
ISMO KAUPPINEN ET AL: "Audio Signal Extrapolation - Theory and Application", PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON DIGITAL AUDIO EFFECTS, 1 September 2002 (2002-09-01), pages 105 - 110, XP055005290 *
JORDI BONADA SANJAUME: "Audio Time-Scale Modification in the Context of Professional Audio Postproduction", RESEARCH WORK FOR PHD PROGRAM, 2002
LIE LU; LIU WENYIN; HONG-JIANG ZHANG: "Audio textures: theory and applications", TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE, vol. 12, no. 2, 2004, pages 156 - 167, XP011110595, DOI: doi:10.1109/TSA.2003.819947
LU L ET AL: "Audio Textures: Theory and Applications", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 12, no. 2, 1 March 2004 (2004-03-01), pages 156 - 167, XP011110595, ISSN: 1063-6676, DOI: 10.1109/TSA.2003.819947 *
PICARD CÉCILE ET AL: "Retargetting Example Sounds to Interactive Physics-driven Animations", CONFERENCE: 35TH INTERNATIONAL CONFERENCE: AUDIO FOR GAMES; FEBRUARY 2009, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 1 February 2009 (2009-02-01), XP040509265 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176813B2 (en) 2015-04-17 2019-01-08 Dolby Laboratories Licensing Corporation Audio encoding and rendering with discontinuity compensation
CN112511886A (en) * 2020-11-25 2021-03-16 杭州当虹科技股份有限公司 Audio and video synchronous playing method based on audio expansion and contraction
CN112511886B (en) * 2020-11-25 2023-03-21 杭州当虹科技股份有限公司 Audio and video synchronous playing method based on audio expansion and contraction

Also Published As

Publication number Publication date
EP2509073A1 (en) 2012-10-10

Similar Documents

Publication Publication Date Title
AU2010209943B2 (en) Apparatus, method and computer program for manipulating an audio signal comprising a transient event
EP2293294B1 (en) Device and method for manipulating an audio signal having a transient event
Laroche Time and pitch scale modification of audio signals
US8706496B2 (en) Audio signal transforming by utilizing a computational cost function
US6073100A (en) Method and apparatus for synthesizing signals using transform-domain match-output extension
CA2721402C (en) Apparatus and method for determining a plurality of local center of gravity frequencies of a spectrum of an audio signal
US20080221876A1 (en) Method for processing audio data into a condensed version
US20070276657A1 (en) Method for the time scaling of an audio signal
EP2272062A1 (en) An audio signal classifier
WO2012136380A1 (en) Time-stretching of an audio signal
Dorran Audio time-scale modification
Moinet et al. Audio time-scaling for slow motion sports videos
Sanjaume Audio Time-Scale Modification in the Context of Professional Audio Post-production
Driedger Time-scale modification algorithms for music audio signals
JPH09510554A (en) Language synthesis
AU2012216537B2 (en) Device and method for manipulating an audio signal having a transient event

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12729853

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12729853

Country of ref document: EP

Kind code of ref document: A1