EP2509073A1 - Time-stretching of an audio signal - Google Patents

Time-stretching of an audio signal Download PDF

Info

Publication number
EP2509073A1
EP2509073A1 EP11305407A EP11305407A EP2509073A1 EP 2509073 A1 EP2509073 A1 EP 2509073A1 EP 11305407 A EP11305407 A EP 11305407A EP 11305407 A EP11305407 A EP 11305407A EP 2509073 A1 EP2509073 A1 EP 2509073A1
Authority
EP
European Patent Office
Prior art keywords
grains
audio signal
time
grain
texture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP11305407A
Other languages
German (de)
French (fr)
Inventor
Alexis Moinet
Thierry Dutoit
Philippe Latour
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Universite de Mons
Evs International (swiss) Sarl
Original Assignee
Universite de Mons
Evs International (swiss) Sarl
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universite de Mons, Evs International (swiss) Sarl filed Critical Universite de Mons
Priority to EP11305407A priority Critical patent/EP2509073A1/en
Priority to PCT/EP2012/001540 priority patent/WO2012136380A1/en
Publication of EP2509073A1 publication Critical patent/EP2509073A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion

Definitions

  • the known methods generally use time-domain processing, like SOLA algorithm (Synchronous Overlap and Add), WSOLA algorithm (Waveform Similarity based synchronized Overlap Add), and PSOLA algorithm (Pitch Synchronous Overlap Add Method), or spectral-domain processing, like phase vocoder-based solutions, or model-based processing, like LPC methods (Linear predictive coding), and HNM methods (hybrid numerical method).
  • SOLA algorithm Synchronous Overlap and Add
  • WSOLA algorithm Wideform Similarity based synchronized Overlap Add
  • PSOLA algorithm Pitch Synchronous Overlap Add Method
  • spectral-domain processing like phase vocoder-based solutions
  • model-based processing like LPC methods (Linear predictive coding), and HNM methods (hybrid numerical method).
  • the spectral flux may be computed by:
  • the texture synthesizer block may be configured to synthesize the texture by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
  • Fig.7 illustrates the positioning of grains G 0 , G 1 , G 2 in the output audio signal S' for a 50% slow-motion, that is to say a speed factor of 50%. Dashed rectangles represent gaps H 0 , H 1 , H 2 that have to be filled with textures.
  • the texture is synthesized using past and current grains G, stored in database 3, so as to fit the best with the surrounding grains G in the output audio signal S' and so as not to contain any transient.
  • the synthesized texture must have a spectral content matching its surrounding grains G. In other words it must be perceptually relevant. And furthermore, the synthesized texture must not contain any transient.
  • the synthesized texture of gap H 0 must have a spectral content matching grains G 0 and G 1 and must not contain any transient.
  • the length R of the synthetic signal r should be at least D n + 2 ⁇ H + 2 ⁇ A + L n+1 .

Abstract

Time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of:
a) dividing an input audio signal into a set of grains, a grain being a group of consecutive audio samples from the input audio signal, the length of each grain of the set of grains being determined so as not to split a transient of the audio signal between two successive grains,
b) inserting the grains in an output audio signal, the insertion being done at a rate corresponding to a speed factor, and
c) synthesizing an audio texture to fill every gap between two successive grains in the output audio signal.

Description

  • The invention generally relates to the time-stretching of an audio signal.
  • Time-stretching (or time-scaling) permits changing the length of the audio signal without changing its perceived content. It is different to a simple re-sampling (up-sampling and down-sampling), which changes both the length of the audio signal and its spectral content. Indeed, with a simple re-sampling, the frequencies contained in the audio signal are shifted upward or downward when the audio signal length is respectively reduced or increased. This is illustrated by the low-pitched sounds heard in some movies during slow-motion scenes.
  • For example, time-stretching may be used to increase the length of an audio signal in order to preserve synchronization with a video when doing slow-motion playbacks. In the same way time-stretching may be used to decrease the length of the audio signal in order to preserve synchronization with a video when doing fast-motion playbacks.
  • In order to create an output audio signal perceived as similar to the input audio signal but with a different length, one needs to create new information from the input audio signal in order to increase or decrease its length. This creation of new information may comprise adding information and/or removing information.
  • Time-stretching methods are described in the article of Jordi Bonada Sanjaume, "Audio Time-Scale Modification in the Context of Professional Audio Postproduction", Research work for PhD program, 2002.
  • The known methods generally use time-domain processing, like SOLA algorithm (Synchronous Overlap and Add), WSOLA algorithm (Waveform Similarity based synchronized Overlap Add), and PSOLA algorithm (Pitch Synchronous Overlap Add Method), or spectral-domain processing, like phase vocoder-based solutions, or model-based processing, like LPC methods (Linear predictive coding), and HNM methods (hybrid numerical method).
  • These methods make the fundamental assumption that the input audio signal mainly contains either voice or music or both. In other words that the input audio signal is mainly composed with time varying sinusoids where noise is generally considered as a negligible background signal with few or no impact on the final result. The sinusoidal hypothesis makes it difficult to apply these methods to audio signals that mainly contain noise, like audio signal usually encountered during sport events, for example crowd, applause, ball impacts, and/or field events.
  • Besides transient sounds like impact sounds are distorted by these methods and therefore need to be detected and processed separately. Transient detection has many solutions when applied to quiet audio recordings such as speech or music but it is much more difficult in noisy environments such as a football or basketball stadium. As a result it can be very complex to detect and process separately those transient sounds during sport events.
  • Texture-based methods have been developed to synthesize audio content such as crowd and applauses. Although these methods create similar-sounding signals they do not manipulate the length of the original signal.
  • In the thesis "Expressive Sound Synthesis for Animation", PhD thesis, INRIA, 2009, C. Picard proposes a texture-based approach to sound re-synthesis and time-stretching. However it is based on the notion of correlation patterns and these patterns must be pre-computed. That computation takes a relatively long time. Besides it supposes that all the audio samples are known beforehand. This makes it unsuitable for real-time processing.
  • In the article of Lie Lu, Liu Wenyin, and Hong-Jiang Zhang "Audio textures: theory and applications", Transactions on Speech and Audio Processing, IEEE, Volume 12, n° 2, pp 156-167, 2004, it is presented a method to replace missing part in an audio stream using texture synthesis. However, this method is not adapted for faster playback speeds. Moreover, this method needs a pre-processing of the whole signal and a computation of a similarity matrix. As a consequence, this method can't work in real-time.
  • The invention will improve the situation.
  • SUMMARY
  • A first aspect of the invention relates to a time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of:
    1. a) dividing an input audio signal into a set of grains, a grain being a group of consecutive audio samples from the input audio signal, the length of each grain of the set of grains being determined so as not to split a transient of the audio signal between two successive grains,
    2. b) inserting the grains in an output audio signal, the insertion being done at a rate corresponding to a speed factor, and
    3. c) synthesizing an audio texture to fill every gap between two successive grains in the output audio signal.
  • Thanks to these provisions, the performances of the time-stretching are improved. Indeed, the proposed method combines both aspects, time-stretching and textures, in order to manipulate the length of an audio signal without changing its perceived content and without having to process its transients. Moreover, since the transients are not split into different grains and there is no overlap between grains, this method will not suffer from the transient duplication artefact that characterizes many time-domain time-stretching algorithms. Any transient from the input signal will be copied in the output signal once and only once.
  • Step a) may comprise setting grains boundaries at positions of minimum values of a spectral flux computed on the input audio signal. Indeed, these minimum values correspond to samples of the input audio signal where the input audio signal is the most stable, in particular to samples where the probability of having a transient is minimized.
  • The spectral flux may be computed by:
    • dividing the input audio signal into overlapping constant-length frames,
    • computing, for each frame n and for each frequency bin k of the input audio signal, a spectral amplitude X(n,k) and a variation of amplitude ΔX(n,k) between the said frame and the previous frame: Δ X n k = X n k - X n - 1 , k
      Figure imgb0001
    • computing the spectral flux SF by using the variation of amplitude values:
    S F n = k = 1 N fft Δ X n k + Δ X n k 2
    Figure imgb0002
  • Step b) may comprise inserting in the output audio signal a grain Gi of the set of grains, which starts at a time ti in the input audio signal, at time ti' being calculated from a starting time ti-1' of a previous grain Gi-1 in the output audio signal S' and from a length Li-1 of the previous grain Gi-1: t i ʹ = t i - 1 ʹ + r L i - 1 t 0 ʹ = t 0
    Figure imgb0003
    the variable r being equal to the inverse of the speed factor.
  • Step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.
  • The texture is preferably synthesized so as not to contain any transient.
  • The texture may be synthesized by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
  • Synthesizing may comprise selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of the grain before the gap to be filled, M being a predetermined integer. Synthesizing may also comprise selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of the grain after the gap to be filled, M being a predetermined integer.
  • According to other embodiments of the invention, step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using linear prediction analysis.
  • According to other embodiments of the invention, step c) may comprise synthesizing an audio texture to fill a gap between two successive grains by using spectral analysis.
  • A second aspect of the invention relates to a time-stretching device for changing the length of an audio signal without changing its perceived content, comprising an extraction block configured to divide an input audio signal into a set of grains and to insert the grains in an output audio signal, a grain being a group of consecutive audio samples from the input audio signal, grains of the set of grains having a variable-length so as not to split a transient of the audio signal between two successive grains, the insertion being done at a rate corresponding to a speed factor, the time-stretching device further comprising a texture synthesizer block configured to synthesize a texture to fill every gap between two successive grains in the output audio signal.
  • The extraction block may be configured to set grains boundaries at positions of minimum values of a spectral flux computed on the input audio signal.
  • The extraction block may be configured to insert in the output audio signal a grain Gi of the set of grains, which starts at a time ti in the input audio signal, at time ti' being calculated from a starting time ti-1' of a previous grain Gi-1 in the output audio signal S' and from a length Li-1 of the previous grain Gi-1: t i ʹ = t i - 1 ʹ + r L i - 1 t 0 ʹ = t 0
    Figure imgb0004
    the variable r being equal to the inverse of the speed factor.
  • The texture synthesizer block may be configured to synthesize an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database. The texture synthesizer block may also be configured to synthesize the texture so as not to contain any transient.
  • For example, the texture synthesizer block may be configured to synthesize the texture by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which like reference numerals refer to similar elements and in which:
    • Fig.1 is a schematic block diagram of a time-stretching device according to a first embodiment of the invention;
    • Fig.2 is a flow chart showing steps of a time-stretching method;
    • Fig.3 is a chart showing an example of input audio signal;
    • Fig.4 is a chart showing the input audio signal of Fig.3 which have been divided in a set of grains;
    • Fig.5 is a chart showing a spectral flux computed on another example of input audio signal;
    • Fig.6 is a chart showing an output audio signal computed from the input audio signal of Fig.3; and
    • Fig.7 is a schematic view illustrating the insertion of grains in an output audio signal.
    DESCRIPTION OF PREFERRED EMBODIMENTS
  • Fig.1 shows a time-stretching device 1, which is configured to change the length of an audio signal S (Fig.3) without changing its perceived content. The audio signal S is for example an audio part of a sport event signals comprising an audio signal and a video signal. The audio signal mainly contains noise, that is to say transients, like audio signal usually encountered during sport events, for example crowd, applause, ball impacts, and/or field events.
  • The time-stretching device 1 comprises an extraction block 2, which is configured to receive the audio signal S and to divide it into a set of grains G. A grain G is a variable-length group of consecutive audio samples from the audio signal S. The grains G do not overlap and are created in a way so as not to split a transient of the audio signal S between two successive grains G. Their lengths vary between two reasonable values, for example between 5 and 50ms, preferably between 10 and 40ms. The extraction block 2 is also configured to insert the grains G in an output audio signal S'.
  • The time-stretching device 1 also comprises a database block 3 configured to memorize the extracted grains G and a texture synthesizer block 4 configured to synthesize a texture to fill gaps between the grains in the output audio signal S' (Fig.6).
  • Referring to Fig.2, we are describing below a time-stretching method, which may be executed by the time-stretching device 1 of Fig.1.
  • In step S1, the input audio signal S is received by extraction block 2. Then, extraction block 2 divides the input audio signal S into a set of grains G, for example into three grains G0, G1, G2, as represented in Fig.4.
  • The grains G0, G1, G2 boundaries may be set at the positions of minimum values of a spectral flux SF computed on the input audio signal S. Indeed, these minimum values correspond to samples of the input audio signal S where the input audio signal S is the most stable, in particular to samples where the probability of having a transient is minimized.
  • In order to compute the spectral flux SF, the input audio signal S is first divided into overlapping constant-length frames n. Then for each frame n and for each frequency bin k, the spectral amplitude X(n,k) is computed as well as the variation of amplitude ΔX(n,k) between frame n and frame n-1: Δ X n k = X n k - X n - 1 , k
    Figure imgb0005
  • Then the spectral flux SF is defined as the sum of the positive contribution in the previous result: S F n = k = 1 N fft Δ X n k + Δ X n k 2
    Figure imgb0006
  • Thus, in this embodiment of the invention, the only parameter for the definition of the spectral flux SF is the constant length of the analysis frames n.
  • Once the spectral flux SF is computed, its local minimums are located, with the constraint that they must be separated by at least a first predetermined value, for example 5ms, and no more than a second predetermined value, for example 50ms, which means that some minimums may not be taken into consideration.
  • Then grain boundaries are positioned at the closest zero-crossing of the audio samples for each minimum to take into account. In variant the boundaries can be positioned at the closest minimum of energy instead of the closest zero-crossing.
  • Fig.5 shows an embodiment where audio signal S corresponds to recording of football. Curve CSF represents the spectral flux SF computed for the audio signal S. Lines B1 to B18 represent grain boundaries positioned at local minimums, for example at the closest zero-crossing. In this example, the set of grains G comprise nineteen grains G, respectively defined by two consecutive grain boundaries B.
  • Thus, the extraction step is based on a time-domain algorithm, with variable grain size for allowing management of transients.
  • Some variants can be considered, such as a multi-band spectral flux, which means that a separate spectral flux is computed for each band of a filter bank. This variant may improve the precision of the extraction step.
  • In step S2, the grains G, for example the grains G0, G1, G2 in the example of Fig.4, are inserted in an output audio signal S', as represented in Fig.6. The output audio signal S' corresponds to the audio signal S in slow-motion, the insertion being done at a rate corresponding to the speed factor of the slow-motion. In other words, empty spaces or gaps H0, H1, H2 (Fig.7) are inserted between the grains G0, G1, G2 of the input audio signal S in order to obtain an output signal S' whose length is equal to the original length divided by the speed factor of the slow-motion. For example, a 33% slow-motion signal S' is three times longer than the original signal S, which means is divided by 0.33.
  • We consider a grain Gi starting at time ti in the input audio signal S. Its position ti' in the output audio signal S' can be calculated from the position ti-1' and the length Li-1 of the grain Gi-1 in the output audio signal S': t i ʹ = t i - 1 ʹ + r L i - 1 t 0 ʹ = t 0
    Figure imgb0007
    where the variable r is equal to the inverse of the speed factor. For example r=3 if the speed factor is 33,333%. The speed factor and the variable r can be changed with every grain G, for example if the user changes the playback speed of the video between grains of the input audio signal S.
  • Note that time ti' is not necessarily an integer. Indeed, the closest integer value may be used instead of time ti' to position the grain Gi in the output audio signal S', but its real value ti' is preferably used to compute ti+1' in order to minimize rounding errors.
  • Fig.7 illustrates the positioning of grains G0, G1, G2 in the output audio signal S' for a 50% slow-motion, that is to say a speed factor of 50%. Dashed rectangles represent gaps H0, H1, H2 that have to be filled with textures.
  • If an accumulation of rounding errors leads to a non-negligible offset between the audio and the video, it can be corrected by re-synchronizing the grain Gi with the video frame it was extracted from. An offset is considered as non-negligible if the viewer can notice the offset between sound and video.
  • The "ideal" position ti' with respect to audio/video synchronism of the grain Gi is not always that ideal when it comes to overlap well with samples generated to fill a gap H. Therefore the position of the ith grain Gi can optionally be slightly adjusted around its theoretical position ti' by an offset δ, with value -Δ ≤ δ ≤ Δ. In case such an adjustment is made, the final position of the grain Gi in the output signal S' is then noted ti", with ti" = ti' + δ.
  • In step S3, the texture synthesizer block 4 synthesizes a texture to fill every gap H in the output audio signal S'.
  • According to some embodiments of the invention, the texture is synthesized using past and current grains G, stored in database 3, so as to fit the best with the surrounding grains G in the output audio signal S' and so as not to contain any transient.
  • The texture synthesizer block 4 has to deals with different constraints. Indeed, the synthesized texture must connect well (temporally speaking) with its surrounding grains G to reduce audible discontinuities. For example, the synthesized texture of gap H0 must connect well with grains G0 and G1.
  • Moreover, the synthesized texture must have a spectral content matching its surrounding grains G. In other words it must be perceptually relevant. And furthermore, the synthesized texture must not contain any transient. For example, the synthesized texture of gap H0 must have a spectral content matching grains G0 and G1 and must not contain any transient.
  • The texture synthesis algorithm and the organization of the database 3 are adapted to these constraints. During the recording of an audio signal S, for example from a sport event, grains G are extracted of the input audio signal S and organized in the database 3. In example of Fig.4, grains G0, G1 and G2 are extracted from audio signal S and memorized in database 3.
  • Organization of the database 3 can be for instance a k-means clustering, using a parameterization of the grains spectral envelop, for example MFCC (Mel-Frequency Cepstrum Coefficients) or LPCC (Linear Prediction Cepstrum Coefficients).
  • In variant, any approach that allows finding quickly a grain G can be used. Organization of the database 3 can also mean that redundant or oldest grains G are progressively discarded in order to keep the database 3 at a reasonable size. Moreover, grains G that might contain a transient are immediately discarded, using the spectral flux SF as a measure of transientness likelihood.
  • During a slow-motion, input grains G are positioned in the output audio signal S' as explained previously. For example, grains G0, G1, G2 are positioned in the output audio signal S' as represented in Fig.7.
  • Then for each gap H the texture synthesizer block 4 selects in the database the N grains closest to the two surrounding grains G, N being a predetermined integer. The measure of closeness is based on the spectral envelop or its parameterization at the position where the two grains G were split since it is likely the most stable part of the audio signal S and it corresponds to the spectral envelop that the texture will have to match.
  • Using those N preselected grains, the texture synthesizer block 4 synthesizes a texture for filling the gap H.
  • For example, we note grain GL the grain before the gap H to be filled and grain GR the grain after the gap H. Texture synthesizer block 4 selects a first grain GA which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of grain GL, M being a predetermined integer. A cost CA different from zero is memorized for grain GA so as not to select this grain again in the next step. For example cost CA = C + R, where C is a constant minimal cost and R is a random positive value, C and R being integer values.
  • Then texture synthesizer block 4 selects a second grain GB which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of grain GA and whose cost CB is zero. Then a new cost CB different from zero is memorized for grain GB so as not to select this grain again at the next step. For example cost CB = C + R, where C is the constant minimal cost and R is a random positive value, C and R being integer values. Then all the other costs Ci memorized are decreased by one unit.
  • And so on until a grain GP finally overlaps with grain GR. The last selected grain GP permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of grain GR. Note that if grain GA is longer than the gap H it overlaps with grain GR and as a consequence it is selected so that it is optimal considering the M last samples of grain GL and the M first samples of grain GR.
  • When the texture to be used in gap H has been synthesized with grains GA to GP it is overlap-added with the two grains GL and GR with an overlap of M samples on each side. If the texture goes beyond the Mth samples of GR the exceeding samples are zeroed. This can happen if the grain GP is longer than 2M+LP, where LP is the length of the remaining gap when GA to GP-1 have been added to the texture. Of course this could also apply when grain GA is longer than the gap H.
  • Overlapping of the samples of GL, GR and the texture to be used in gap H means that the samples occupying a same temporal position are summed with a weighting factor. The weighting factor is determined by a temporal envelop that creates two fade-in/fade-out effects respectively between the M last samples of GL and the M first samples of the texture and between the M last samples of the texture and the M first samples of GR. The temporal envelop usually used is a normalized raised cosine.
  • Note that an amplitude normalization of the texture (or even an amplitude normalization of each of its grains GA to GP separately) may be realized in order to ensure the perceptual continuity between the input grains GL, GR and the texture in the output audio signal S'.
  • According to other embodiments of the invention, the texture is synthesized using either filter-based (LP) or spectral synthesis. The parameters of the filter or the spectral synthesis may be computed using parts of the input audio signal S surrounding the sample that separate the grains G between which the synthetic audio will be inserted. Finally, the synthetic signal is inserted in the output audio signal S' so as to fit the best with the surrounding grains G in the signal S', that is to say to reduce discontinuities. The exact position of each grain G0, G1, G2 in the final output audio signal S' may be adjusted around the temporary positions to', t1', t2' so as to fit better with the samples that will fill the gaps H0, H1, H2.
  • We are describing below example of linear prediction analysis and synthesis. We consider the input signal S and two consecutive grains Gn and Gn+1. We call S(m) the mth sample of signal S and the first sample of grain Gn+1. We assume that grain Gn has just been added to the output signal S'.
  • Here, the step S3 comprises an operation of calculating the Linear Prediction Coefficients (LPC) of a p-order LP-filter using for example the samples from S(m-N) to S(m+N-1), weighted by a windowing function such as a Hann window. N and p (order of the filter) are predefined parameters.
  • The step S3 then comprises an operation of initializing the Linear Prediction (LP) filter using the p last samples of Gn, that is to say samples S(m-p) to S(m-1). This ensures continuity between the samples of grain Gn and the first samples that will be generated by the filter.
  • At instant n, the output y(n) of a Linear Prediction filter with input x(n) obeys to an equation of the form : y(n) = b·x(n) - (a1·y(n-1) + a2·y(n-2) + ... + ap·y(n-p) ), where p is the order of the filter, b is the gain of the filter and ai are the Linear Prediction Coefficients (LPC). One can see that the output at time n is a function of the output at instants n-1 to n-p. For the p first values of n, a part of these past outputs are not available (y(-1), ... , y(-p)) and usually they are fixed to zero: y(n) = 0, for n<0.
  • Instead if we consider that the p last values of a grain Gn are the last p output of a filter before y(0) (i.e. y(-1), ... , y(-p)), we obtain a filter output that is continuous with the grain samples. The filter has been initialized to a state where its samples are a continuity of the samples of grain Gn.
  • The step S3 further comprises an operation of filtering a white noise using the LP filter properly initialized. The white noise has a length of at least Dn + H + Δ + Ln+1, where Dn is the length of the gap H to fill between the last sample of the grain Gn in the output signal S' (i.e. tn" + Ln, with Ln the length of grain Gn) and the temporary position tn+1' of grain Gn+1, H is the length of the overlapping region between the synthetic signal and the grain Gn+1, Δ is the maximum displacement allowed for the grain Gn+1 around tn+1' (in order to get tn+1"), and Ln+1 is the length of grain Gn+1.
  • The step S3 further comprises an operation of appending the synthetic signal (i.e. the filtered white noise) to grain Gn in the output signal S'. As an optional refinement, the synthetic signal energy can be normalized to the energy of a group of samples surrounding sample S(m) (i.e. from S(m-M) to S(m+M)) beforehand.
  • The step S3 further comprises an operation of computing the cross-correlation between grain Gn+1 and the output signal samples from tn+1'-Δ to tn+1'+Ln+1+Δ, and an operation of locating the maximum of cross-correlation in the region that corresponds to positioning grain Gn+1 between tn+1' - Δ to tn+1' + Δ. The position of that maximum sets the value for tn+1".
  • The step S3 further comprises an operation of overlapping and adding grain Gn+1 to the output signal S', starting from sample at tn+1". The region from time tn+1" to tn+1"+H is where the overlap takes place. The LP synthetic samples beyond tn+1"+H are discarded (that is, weighted to zero). N, p, H and Δ (and possibly M) are parameters of the system.
  • We are describing below example of spectral analysis and synthesis. This embodiment is very similar to the linear prediction analysis and synthesis embodiment.
  • The differences are that an FFT is computed on the windowed frame [S(m-N)...S(m+N-1)] instead of an LP analysis.
  • Moreover, the white noise used to generate the synthetic signal is sliced into overlapping windowed frames, the Fourier Transform (by mean of an FFT) of these frames are computed and the amplitude of their spectra are modified to be equal to the amplitude of the spectrum previously computed on frame [S(m-N)...S(m+N-1)]. The modified FFTs are inverted (IFFT) in order to obtain frames of a "colored" noise. These frames are then OverLap-Added so as to end up with a continuous signal r whose spectral characteristics match those of the input signal frame [S(m-N)...S(m+N-1)].
  • The length R of the synthetic signal r should be at least Dn + 2·H + 2·A + Ln+1.
  • The synthetic signal r is not necessarily continuous with the last samples of grain Gn. Therefore, it is OverLap-added with the H last samples of grain Gn. Optionally, the same (as in LP synthesis) cross-correlation based method for position adjustment (which can also be used for grain Gn+1) can be applied to adjust the position of the synthetic signal in the output signal S' (i.e. to better fit with the last samples of grain Gn). However the offset can only go to the left and the samples of synthetic signal r that fall before the overlapping region are discarded, that is to say that the samples added to the output are [r(δ) ... r(R)] with δ the offset (0 ≤ δ ≤Δ) obtained by the cross-correlation alignment method and samples [r(1) ... r(δ-1)] are discarded. As an optional refinement, the synthetic signal energy can be normalized to the energy of a group of samples surrounding S(m) (i.e. from sample S(m-M) to S(m+M)) beforehand. Once the synthetic signal r has been added to the output, the grain Gn+1 is positioned and added.
  • In variant, other texture synthesis algorithms can be used. For instance another possible approach is to select a single grain Gs with the exact needed duration in the part of the audio signal S already recorded during the event. But this adds a strong constraint (fixed length) reducing the number of available candidates whereas the previous algorithm could find a combination of various small grains that globally fits better in the gap.
  • While there has been illustrated and described what are presently considered to be the preferred embodiments of the present invention, it will be understood by those skilled in the art that various other modifications may be made, and equivalents may be substituted, without departing from the true scope of the present invention. Additionally, many modifications may be made to adapt a particular situation to the teachings of the present invention without departing from the central inventive concept described herein. Furthermore, an embodiment of the present invention may not include all of the features described above. Therefore, it is intended that the present invention not be limited to the particular embodiments disclosed, but that the invention include all embodiments falling within the scope of the invention as broadly defined above. In particular, the embodiments describe above could be combined.
  • Expressions such as "comprise", "include", "incorporate", "contain", "is" and "have" are to be construed in a non-exclusive manner when interpreting the description and its associated claims, namely construed to allow for other items or components which are not explicitly defined also to be present.
  • A person skilled in the art will readily appreciate that various parameters disclosed in the description may be modified and that various embodiments disclosed may be combined without departing from the scope of the invention.

Claims (15)

  1. Time-stretching method for changing the length of an audio signal without changing its perceived content, comprising the steps of:
    a) dividing an input audio signal (S) into a set of grains (G0, G1, G2), a grain being a group of consecutive audio samples from the input audio signal (S), the length of each grain of the set of grains being determined so as not to split a transient of the audio signal (S) between two successive grains,
    b) inserting the grains in an output audio signal (S'), the insertion being done at a rate corresponding to a speed factor, and
    c) synthesizing an audio texture to fill every gap (H) between two successive grains in the output audio signal (S').
  2. Time-stretching method according to claim 1, wherein step a) comprises setting grains (G0, G1, G2) boundaries at positions of minimum values of a spectral flux computed on the input audio signal (S).
  3. Time-stretching method according to claim 2, wherein the spectral flux is computed by:
    - dividing the input audio signal (S) into overlapping constant-length frames,
    - computing, for each frame n and for each frequency bin k of the input audio signal (S), a spectral amplitude X(n,k) and a variation of amplitude ΔX(n,k) between the said frame and the previous frame: Δ X n k = X n k - X n - 1 , k
    Figure imgb0008
    - computing the spectral flux SF by using the variation of amplitude values: S F n = k = 1 N fft Δ X n k + Δ X n k 2
    Figure imgb0009
  4. Time-stretching method according to one of the claims 1 to 3, wherein step b) comprises inserting in the output audio signal (S') a grain Gi of the set of grains, which starts at a time ti in the input audio signal (S), at time ti' being calculated from a starting time ti-1' of a previous grain Gi-1 in the output audio signal S' and from a length Li-1 of the previous grain Gi-1: t i ʹ = t i - 1 ʹ + r L i - 1 t 0 ʹ = t 0
    Figure imgb0010
    the variable r being equal to the inverse of the speed factor.
  5. Time-stretching method according to one of the claims 1 to 4, wherein step c) comprises synthesizing an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.
  6. Time-stretching method according to claim 5, wherein the texture is synthesized so as not to contain any transient.
  7. Time-stretching method according to claim 6, wherein the texture is synthesized by using N grains closest to the two surrounding grains of the gap to be filled, the N grains being selected in the database, N being a predetermined integer, the measure of closeness being based on the spectral envelop or its parameterization at the position where the two successive grains were split.
  8. Time-stretching method according to claim 7, wherein synthesizing comprises selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M first samples and the M last samples of the grain before the gap to be filled, M being a predetermined integer.
  9. Time-stretching method according to claim 8, wherein synthesizing comprises selecting in the database a grain which permits either minimizing the Euclidean distance or maximizing the correlation between its M last samples and the M first samples of the grain after the gap to be filled, M being a predetermined integer.
  10. Time-stretching method according to one of the claims 1 to 4, wherein step c) comprises synthesizing an audio texture to fill a gap between two successive grains by using linear prediction analysis.
  11. Time-stretching method according to one of the claims 1 to 4, wherein step c) comprises synthesizing an audio texture to fill a gap between two successive grains by using spectral analysis.
  12. Time-stretching device (1) for changing the length of an audio signal without changing its perceived content, comprising an extraction block configured to divide an input audio signal (S) into a set of grains (G0, G1, G2) and to insert the grains in an output audio signal (S'), a grain being a group of consecutive audio samples from the input audio signal (S), grains of the set of grains having a variable-length so as not to split a transient of the audio signal (S) between two successive grains, the insertion being done at a rate corresponding to a speed factor, the time-stretching device further comprising a texture synthesizer block (4) configured to synthesize a texture to fill every gap (H) between two successive grains in the output audio signal (S').
  13. Time-stretching device according to claim 12, wherein the extraction block is configured to set grains (G0, G1, G2) boundaries at positions of minimum values of a spectral flux computed on the input audio signal (S).
  14. Time-stretching device according to claim 12 or 13, wherein the extraction block is configured to insert in the output audio signal (S') a grain Gi of the set of grains, which starts at a time ti in the input audio signal (S), at time ti' being calculated from a starting time ti-1' of a previous grain Gi-1 in the output audio signal S' and from a length Li-1 of the previous grain Gi-1: t i ʹ = t i - 1 ʹ + r L i - 1 t 0 ʹ = t 0
    Figure imgb0011
    the variable r being equal to the inverse of the speed factor.
  15. Time-stretching device according to one of the claims 12 to 14, wherein the texture synthesizer block is configured to synthesize an audio texture to fill a gap between two successive grains by using past and current grains, which are stored in a database.
EP11305407A 2011-04-07 2011-04-07 Time-stretching of an audio signal Withdrawn EP2509073A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP11305407A EP2509073A1 (en) 2011-04-07 2011-04-07 Time-stretching of an audio signal
PCT/EP2012/001540 WO2012136380A1 (en) 2011-04-07 2012-04-06 Time-stretching of an audio signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP11305407A EP2509073A1 (en) 2011-04-07 2011-04-07 Time-stretching of an audio signal

Publications (1)

Publication Number Publication Date
EP2509073A1 true EP2509073A1 (en) 2012-10-10

Family

ID=44358241

Family Applications (1)

Application Number Title Priority Date Filing Date
EP11305407A Withdrawn EP2509073A1 (en) 2011-04-07 2011-04-07 Time-stretching of an audio signal

Country Status (2)

Country Link
EP (1) EP2509073A1 (en)
WO (1) WO2012136380A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2960904A1 (en) * 2014-06-27 2015-12-30 Nokia Technologies Oy Method and apparatus for synchronizing audio and video signals
US10231001B2 (en) 2016-05-24 2019-03-12 Divx, Llc Systems and methods for providing audio content during trick-play playback

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176813B2 (en) 2015-04-17 2019-01-08 Dolby Laboratories Licensing Corporation Audio encoding and rendering with discontinuity compensation
CN112511886B (en) * 2020-11-25 2023-03-21 杭州当虹科技股份有限公司 Audio and video synchronous playing method based on audio expansion and contraction

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010028634A1 (en) * 2000-01-18 2001-10-11 Ying Huang Packet loss compensation method using injection of spectrally shaped noise

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010028634A1 (en) * 2000-01-18 2001-10-11 Ying Huang Packet loss compensation method using injection of spectrally shaped noise

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
C. PICARD: "PhD thesis", 2009, INRIA, article "Expressive Sound Synthesis for Animation"
ISMO KAUPPINEN ET AL: "Audio Signal Extrapolation - Theory and Application", PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON DIGITAL AUDIO EFFECTS, 1 September 2002 (2002-09-01), pages 105 - 110, XP055005290 *
JORDI BONADA SANJAUME: "Audio Time-Scale Modification in the Context of Professional Audio Postproduction", RESEARCH WORK FOR PHD PROGRAM, 2002
LIE LU, LIU WENYIN, HONG-JIANG ZHANG: "Audio textures: theory and applications", TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE, vol. 12, no. 2, 2004, pages 156 - 167, XP011110595, DOI: doi:10.1109/TSA.2003.819947
LU L ET AL: "Audio Textures: Theory and Applications", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 12, no. 2, 1 March 2004 (2004-03-01), pages 156 - 167, XP011110595, ISSN: 1063-6676, DOI: 10.1109/TSA.2003.819947 *
PICARD CÉCILE ET AL: "Retargetting Example Sounds to Interactive Physics-driven Animations", CONFERENCE: 35TH INTERNATIONAL CONFERENCE: AUDIO FOR GAMES; FEBRUARY 2009, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 1 February 2009 (2009-02-01), XP040509265 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2960904A1 (en) * 2014-06-27 2015-12-30 Nokia Technologies Oy Method and apparatus for synchronizing audio and video signals
US9508386B2 (en) 2014-06-27 2016-11-29 Nokia Technologies Oy Method and apparatus for synchronizing audio and video signals
US10231001B2 (en) 2016-05-24 2019-03-12 Divx, Llc Systems and methods for providing audio content during trick-play playback
US11044502B2 (en) 2016-05-24 2021-06-22 Divx, Llc Systems and methods for providing audio content during trick-play playback
US11546643B2 (en) 2016-05-24 2023-01-03 Divx, Llc Systems and methods for providing audio content during trick-play playback

Also Published As

Publication number Publication date
WO2012136380A1 (en) 2012-10-11

Similar Documents

Publication Publication Date Title
Laroche Time and pitch scale modification of audio signals
US8280724B2 (en) Speech synthesis using complex spectral modeling
US6360202B1 (en) Variable rate video playback with synchronized audio
US9368103B2 (en) Estimation system of spectral envelopes and group delays for sound analysis and synthesis, and audio signal synthesis system
US5749073A (en) System for automatically morphing audio information
JP4906230B2 (en) A method for time adjustment of audio signals using characterization based on auditory events
US6073100A (en) Method and apparatus for synthesizing signals using transform-domain match-output extension
EP2296145B1 (en) Device and method for manipulating an audio signal having a transient event
AU2010209943B2 (en) Apparatus, method and computer program for manipulating an audio signal comprising a transient event
US8706496B2 (en) Audio signal transforming by utilizing a computational cost function
US8280738B2 (en) Voice quality conversion apparatus, pitch conversion apparatus, and voice quality conversion method
WO2008106698A1 (en) Method for processing audio data into a condensed version
US8996363B2 (en) Apparatus and method for determining a plurality of local center of gravity frequencies of a spectrum of an audio signal
JP2000511651A (en) Non-uniform time scaling of recorded audio signals
JPH06266390A (en) Waveform editing type speech synthesizing device
Grofit et al. Time-scale modification of audio signals using enhanced WSOLA with management of transients
EP2509073A1 (en) Time-stretching of an audio signal
US5787398A (en) Apparatus for synthesizing speech by varying pitch
CN108369803A (en) The method for being used to form the pumping signal of the parameter speech synthesis system based on glottal model
Dorran Audio time-scale modification
JP3081108B2 (en) Speaker classification processing apparatus and method
Verfaille et al. Adaptive digital audio effects
Sanjaume Audio Time-Scale Modification in the Context of Professional Audio Post-production
EP1500080B1 (en) Method for synthesizing speech
Lin et al. High quality and low complexity pitch modification of acoustic signals

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

17P Request for examination filed

Effective date: 20130328

17Q First examination report despatched

Effective date: 20130805

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20131217