EP2215632B1 - Procede, dispositif, et code de programme pour la conversion vocale - Google Patents

Procede, dispositif, et code de programme pour la conversion vocale Download PDF

Info

Publication number
EP2215632B1
EP2215632B1 EP08804436A EP08804436A EP2215632B1 EP 2215632 B1 EP2215632 B1 EP 2215632B1 EP 08804436 A EP08804436 A EP 08804436A EP 08804436 A EP08804436 A EP 08804436A EP 2215632 B1 EP2215632 B1 EP 2215632B1
Authority
EP
European Patent Office
Prior art keywords
glottal
parameters
vocal tract
converted
lsf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Not-in-force
Application number
EP08804436A
Other languages
German (de)
English (en)
Other versions
EP2215632A1 (fr
Inventor
María Arantzazu DEL POZO ECHEZARRETA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
Original Assignee
Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech filed Critical Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
Publication of EP2215632A1 publication Critical patent/EP2215632A1/fr
Application granted granted Critical
Publication of EP2215632B1 publication Critical patent/EP2215632B1/fr
Not-in-force legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/057Time compression or expansion for improving intelligibility
    • G10L2021/0575Aids for the handicapped in speaking

Definitions

  • the present invention relates to methods and systems for voice conversion.
  • Voice Conversion aims at transforming a source speaker's speech to sound like that of a different target speaker. Text-to-speech synthesisers, dialogue systems and speech repair are among the numerous applications which can greatly benefit from the development of voice conversion technology.
  • the most widely used speech signal representations are the Source-Filter Model and the Sinusoidal Model.
  • the Source-Filter representation ( G. Fant, Acoustic Theory of Speech Production, ISBN 9027916004 ) is based on a simple production model composed of a glottal source waveform exciting a time-varying filter loaded at its output by the radiation of the lips.
  • the main challenge in Source-Filter modelling is the estimation of the glottal waveform and vocal tract filter parameters from the speech signal.
  • the Liljencrants-Fant (LF) model ( The LF-model revisited. Transformations and frequency domain analysis, STL-QPSR, vol. 36, number 2-3, 1995, pages 119-156 ) has become the model of choice for research on the glottal source. It has been shown to be capable of modelling a wide range of naturally occurring phonations and the effects of its parameter variations are well understood. It exploits the linearity and time-invariance properties of the Source-Filter representation and assumes the commutation of the vocal tract and lip radiation filters to combine the modelling of the source excitation and lip radiation in the parameterisation of the derivative of the glottal waveform.
  • LF Liljencrants-Fant
  • Linear Prediction is one popular technique used to obtain a combined parameterisation of the glottal source, vocal tract and lip radiation components in a unique all-pole filter H (z) .
  • a filter is then excited, as shown in Figure 1 , by a sequence of impulses spaced at the fundamental period To during voiced speech and by white Gaussian noise during unvoiced speech.
  • the LP error or residual would be a train of impulses spaced at the voiced excitation instants and the impulse/noise voice source modelling would be accurate.
  • the LP residual looks more like a white noise signal with larger values around the instants of excitation.
  • H. Lu et. al. have proposed a convex optimization method to automatically estimate the vocal tract filter and glottal waveform jointly (Joint estimation of vocal tract filter and glottal source waveform via convex optimization, Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 ).
  • the better modelling of the glottal source employed by this approach results in speech which has better quality than that of LP.
  • the parameterisation of the glottal waveform allows its parametric modification, which can be exploited in voice conversion applications.
  • Sinusoidal Models assume the speech waveform to be composed of the sum of a small number of sinusoids with time-varying amplitudes, frequencies and phases. Such modelling was mainly developed by McAulay and Quatieri (Speech Analysis/Synthesis Based on a Sinusoidal Representation, IEEE Transactions on Acoustics, Speech and Signal Processing, pp. 744-754, 1986 ) in the mid-1980's and has been shown to be capable of producing high quality speech even after pitch and time-scale transformations. However, because of the high number of sinusoidal amplitudes, frequencies and phases involved, sinusoidal modelling results less flexible than the source-filter representation to modify spectral features.
  • state-of-the-art voice conversion (VC) implementations mainly employ variations and extensions of the original sinusoidal model.
  • VC state-of-the-art voice conversion
  • they generally adopt a source-filter formulation based on LP to carry out spectral transformations.
  • Spectral envelopes are generally encoded in line spectral frequencies (LSF) for voice conversion, since LSFs have been shown to possess very good linear interpolation characteristics and to relate well to formant location and bandwidth. Because the frequency resolution of the human ear is greater at low frequencies than at high frequencies, spectral envelopes are often warped to a non-linear scale, e.g. the Bark scale, taking the non-uniform sensitivity of the human ear into account. Usually, only spectral envelopes of voiced speech segments are transformed, since unvoiced sounds contain little vocal tract information and their spectral envelopes present high variations. Among the existing different spectral envelope conversion techniques, continuous probabilistic linear transformations have been found to be the most robust and efficient approach.
  • LSF line spectral frequencies
  • LP residual conversion sinusoidal VC systems have developed residual prediction and selection methods ( D. Suendermann, A. Bonafonte, H. Ney, and H. Hoege, A study on residual prediction techniques for voice conversion, in Proc.ICASSP, 2005, pp. 13-16 .) based on the correlation between spectral envelope and LP residuals. These methods reintroduce the target spectral detail lost after envelope conversion. Because residuals contain the errors introduced by the LP parameterisation, residual prediction techniques have been found to improve conversion performance. However, LP residuals do not constitute an accurate model of the voice source and residual prediction alone is not capable of modifying the quality of the voice source. This prevents their use in applications requiring voice quality modifications such as, for example, speech repair.
  • the stage of training comprises, given a training database of parallel source and target data, for each pitch period of said training database: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said T-parameters into R-parameters; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; defining a glottal vector to be converted; defining a vocal tract vector to be converted, said vocal tract vector comprising said line spectral frequencies in Bark scale; applying wavelet denoising to obtain an estimate of a glottal aspiration noise.
  • the stage of training also comprises, from the set of vocal tract vectors obtained for each pitch period of the said training database, estimating a vocal tract continuous probabilistic linear transformation function using the least square error criterion.
  • the previous stage of modelling further comprises the steps of modelling said aspiration noise estimate by modulating zero mean unit variance Gaussian noise with the said modelled glottal waveform and adjusting its energy to match that of the said aspiration noise estimate.
  • the glottal vector to be converted comprises said excitation strength parameter, said R-parameters and said energy of the aspiration noise estimate.
  • a given test speech waveform is modelled and transformed into a set of converted parameters.
  • a converted speech waveform is synthesised from the said set of converted parameters.
  • the stage of training further comprises: from the set of glottal vectors obtained for each pitch period of the said training database, estimating a glottal waveform continuous probabilistic linear transformation function using the least square error criterion.
  • the step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model preferably comprises the steps of: modelling the glottal waveform using the Rosenberg-Klatt model; using convex optimization to obtain a set of Rosenberg-Klatt glottal waveform parameters and the all-pole vocal tract filter coefficients, wherein said step of using convex optimization comprises a step of adaptive pre-emphasis for estimating and removing a spectral tilt filter contribution from the speech waveform before convex optimization.
  • step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model further comprises the steps of: obtaining a derivative glottal waveform by inverse filtering said pitch period using said all-pole vocal tract filter coefficients; fitting said set of LF parameters to the said inverse filtered derivative glottal waveform by direct estimation and constrained non-linear optimization.
  • the stage of conversion preferably comprises, for each pitch period of said test speech waveform: obtaining a glottal vector to be converted, said glottal vector comprising an excitation strength parameter, a set of R-parameters and the energy of the said aspiration noise estimate; obtaining a vocal tract vector to be converted, said vocal tract vector comprising a set of line spectral frequencies in Bark scale; applying said vocal tract continuous probabilistic linear transformation function estimated during the training stage to obtain a converted vocal tract parameter vector; transforming said glottal vector using said glottal waveform continuous probabilistic linear transformation function estimated during the training stage, thus obtaining a converted glottal vector comprising a set of converted parameters.
  • those stages of obtaining a glottal vector to be converted and a vocal tract vector to be converted further comprise the steps of: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; converting said T-parameters into R-parameters; defining a glottal vector to be converted; and defining a vocal tract vector to be converted.
  • the stage of conversion further comprises a step of post-filtering said converted vocal tract parameter vector.
  • the stage of synthesis in which said converted speech waveform is synthesised from the said set of converted parameters, preferably comprises the steps of: interpolating the trajectories of said converted parameters of each pitch period, thus obtaining a set of interpolated parameters comprising interpolated R-parameters, interpolated energy and interpolated vocal tract vector; converting said interpolated vocal tract vector into an all-pole filter coefficient vector; converting said interpolated R-parameters into interpolated T-parameters; for each frame of said test speech waveform, generating an excitation signal.
  • the stage of generating an excitation signal comprises, for each of said frames: if said frame is voiced: from said interpolated T-parameters and said excitation strength parameter, generating glottal waveform; from said interpolated aspiration noise energy parameter, generating interpolated aspiration noise; generating said voiced excitation signal by adding said interpolated glottal waveform and said interpolated aspiration noise. And, if said frame is unvoiced: generating said unvoiced excitation signal from a Gaussian noise source.
  • the stage of synthesis further comprises: generating a synthetic contribution of each frame by filtering said excitation signal with said interpolated all-pole filter coefficient vector; multiplying said synthetic contribution by a Hamming window, overlapping and adding, in order to generate the converted speech signal.
  • the present invention also provides a method applicable to voice quality transformations, such as tracheoesophageal speech repair, which comprises at least some of the above-mentioned method steps.
  • the term “approximately” and terms of its family should be understood as indicating values or forms very near to those which accompany the aforementioned term. That is to say, a deviation within reasonable limits from an exact value or form should be accepted, because the expert in the technique will understand that such a deviation from the values or forms indicated is inevitable due to measurement inaccuracies, etc. The same applies to the term “nearly”.
  • pitch period means a segment of a speech waveform which comprises a period of the fundamental frequency.
  • frame means a segment of a speech waveform, which corresponds to a pitch period in voiced parts and to a fixed amount of time in unvoiced parts. In a preferred embodiment of the present invention, which should not be interpreted as a limitation to the present invention, a frame corresponds to 10 ms in unvoiced parts.
  • source data refers to a collection of speech waveforms uttered by a source speaker
  • target data refers to a collection of speech waveforms uttered by a target speaker
  • parallel source and target data refers to a collection of speech waveforms uttered both by the source and the target speakers.
  • JEAS Joint Estimation Analysis Synthesis
  • Figure 2 shows a schematic diagram of the JEAS model. It is based on a general Source-Filter representation. It employs white Gaussian and amplitude-modulated white Gaussian noise to model the Turbulence and Aspiration Noise components respectively, a digital differentiator for Lip Radiation and an all-pole filter to represent the Vocal Tract. Besides, the Liljencrants-Fant (LF) model is adopted to better capture the characteristics of the derivative glottal wave. Then, in order to estimate the different model component parameterisations from the speech wave, a joint voice source and vocal tract parameter estimation technique based on Convex Optimization is applied.
  • LF Liljencrants-Fant
  • the present method adopts the well-known LF model, which is a four-parameter time-domain model of one cycle of the derivative glottal waveform.
  • Typical LF pulses corresponding to glottal and derivative glottal waves are shown in Figure 5 .
  • g n ⁇ E G ⁇ ⁇ ⁇ ⁇ n sin ⁇ g ⁇ n 0 ⁇ n ⁇ T e - E e ⁇ ⁇ T ⁇ ⁇ ⁇ e - ⁇ ⁇ n - T e - e ⁇ ⁇ T c - T e T e ⁇ n ⁇ T c
  • the model consists of two segments: the first one characterises the derivative glottal waveform from the instant of glottal opening to the instant of main excitation T e , where the amplitude reaches the maximum negative value -E e .
  • E 0 is a scaling factor used to ensure that the signal has a zero mean.
  • E e is closely related to the strength of the source excitation and the main determinant of the intensity of the speech signal. Its variation affects the overall harmonic amplitudes, except the very lowest components which are more determined by the shape of the pulse.
  • the second segment models the closing or return phase from the main excitation T e to the instant of full closure T c using an exponential function.
  • the duration of the return phase is thus determined by T c - T e .
  • the main parameter characterising this segment is T a , which represents the "effective duration" of the return phase. This is defined by the duration from T e to the point where a tangent fitted at the start of the return phase crosses zero.
  • T 0 corresponds to the fundamental period.
  • T c is made to coincide with the opening of the following pulse. This fact might suggest that the model does not account for the closed phase of the glottal waveform. However, for reasonably small values of T a , the exponential function will fit closely to the zero line providing a closed phase without the need for additional control parameters.
  • R g , R k , R a R-parameters ( R g , R k , R a ), which are normalised respect to T 0 and correlate with the most salient glottal phenomena, i.e. the glottal pulse width and the skewness and abruptness of closure.
  • R g T 0 2 ⁇ T p ;
  • R k T c - T p T p ;
  • R c T c T 0
  • R g is a normalised version of the glottal formant frequency Fg , which is defined as the inverse of twice the duration of the opening phase T p .
  • R k is the LF parameter which captures glottal asymmetry. It is defined as the ratio between the times of the opening and closing branches of the glottal pulse, and the larger its value, the more symmetrical the pulse is.
  • OQ is positively correlated with R k and negatively correlated with R g .
  • the R a parameter corresponds to the effective "return time" T a normalised by the fundamental period and captures the differences relating to the spectral tilt.
  • Source-Filter deconvolution is to obtain estimates of the glottal source and vocal tract filter components from the speech wave.
  • Inverse Filtering (IF) was the most commonly employed deconvolution method. It is based on calculating a vocal tract filter transfer function, whose inverse is used to obtain a glottal waveform estimate which can then be parameterised.
  • a different approach involves modelling both glottal source and vocal tract filter, and developing techniques to jointly estimate the source and tract model parameters from the speech wave.
  • Joint Estimation methods are fully automatic. This is an important condition that a mathematical model aimed at analysis, synthesis and modification of the speech signal should meet. Due to the characteristics of the mathematical voice source and vocal tract descriptions, such an approach is a complex nonlinear problem. For this reason, the use of LP has been deployed more widely as a simpler method to obtain a direct and efficient source-filter parameterisation of the speech signal. Its poor modelling of the voice source has not limited its application in the context of speech coding and to efficiently represent the speech spectrum with a small number of parameters. However, it has prevented its use in speech synthesis and transformation applications. Advances in voice conversion and Hidden Markov Model (HMM) speech synthesis in the last few years have emphasized the importance of refined vocoding and thus, the problem of automatic joint estimation of voice source and vocal tract filter parameters has gained renewed interest.
  • HMM Hidden Markov Model
  • the method employed to obtain the JEAS voice source and vocal tract model parameters from the speech wave follows the second deconvolution approach and is based on the joint estimation of the vocal tract filter and the glottal waveform proposed by Lu and Smith (Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 ).
  • voiced and unvoiced speech segments are processed differently due to their diverse source characteristics. While the voice source in voiced speech is represented by a combination of the LF and aspiration noise models, white Gaussian noise is used to excite the vocal tract filter in unvoiced frames (see Figure 2 ). Their different modelling requires a preprocessing step where the voiced and unvoiced speech sections are determined and the glottal closure instants (GCI) of the voiced segments are estimated. Then, the voice source and vocal tract parameters are obtained through joint source-filter estimation and LF re-parameterisation in voiced sections (V) and through standard autocorrelation LP and Gaussian noise energy matching in unvoiced portions ( U ).
  • GCI glottal closure instants
  • An algorithm such as the well-known Dynamic Programming Projected Phase-Slope Algorithm (DYPSA) is used for GCI estimation. It employs the group-delay function in combination with a phase-slope projection method to determine GCI candidates, plus N-best dynamic programming to select the most likely candidates according to a cost function which takes waveform similarity, pitch deviation, normalised energy and deviation from the ideal phase-slope into account.
  • DYPSA Dynamic Programming Projected Phase-Slope Algorithm
  • the voicing decision is made based on energy, zero-crossing and GCI information. Voiced segments are then processed pitch-synchronously, while unvoiced frames are periodically extracted. In a particular embodiment, they are extracted every 10ms.
  • the method employed by the invention to obtain the JEAS voice source and vocal tract model parameters involves using a voice source model simple enough to allow the source filter deconvolution to be formulated as a Convex Optimization problem. Then, the derivative glottal waveform obtained by inverse filtering (IF) with the estimated filter coefficients is reparameterised by LF model fitting.
  • IF inverse filtering
  • the success of the technique lies in providing a derivative glottal waveform constraint when estimating the vocal tract filter. Because of this, the resulting IF derivative glottal waveform is closer to the true glottal excitation and its fitting to an LF model is less error prone.
  • the joint estimation algorithm models the voice source using the well-known Rosenberg-Klatt (RK) model, which consists of a basic voicing waveform describing the shape of the derivative glottal wave and a low-pass filter, 1 1 - ⁇ - 1 , with ⁇ >0, as shown in Figure 6 .
  • RK Rosenberg-Klatt
  • OQ being the open-quotient, i.e. the fraction of the pitch period in which the glottis is open.
  • Source-filter deconvolution via convex optimization is accomplished by minimising the squared error between the modelled and the true derivative glottal waveforms.
  • the derived quadratic program can be solved using a number of existing iterative numerical algorithms.
  • the quadratic programming function of the MATLAB Optimization Toolbox has been employed.
  • the result of the minimization problem is the simultaneous estimation of the RK model parameters a and b and the all-pole filter coefficients ⁇ k .
  • Figure 7 shows a joint estimation example for one pitch period.
  • the described joint estimation process assumes that the closed and open-phases are defined, while in practice the parameter which delimits the end of the closed-phase and the beginning of the open-phase n c is unknown. Its optimal value is found by uniformly sampling the possible n c values (empirically shown to vary from 0% to 60% of the pitch period T o ), solving the quadratic problem at each sampled n c value and choosing the estimate resulting in minimum error.
  • the basic RK voicing waveform of equation (7) does not explicitly model the return phase of the derivative glottal waveform and changes abruptly at the glottal closure instants. For this reason, a low-pass filter is added to the basic model, with the purpose of reducing the abruptness of glottal closure.
  • the filter coefficient ⁇ is responsible for controlling the tilt of the source spectrum.
  • the spectral tilt filter is separated from the source model and incorporated to the vocal tract model by adding an extra pole to the all-pole filter as shown in Figure 9 .
  • the vocal tract filter coefficients estimated using this formulation also encode the spectral slope information of the voice source.
  • the derivative glottal waveforms obtained using this approach fail to adequately capture the variations in the return phase of the glottal source.
  • the present invention uses adaptive pre-emphasis to estimate and remove the spectral tilt filter contribution from the speech wave before convex optimization.
  • Order one LP analysis and IF is applied to estimate and remove the spectral slope from the speech frames under analysis.
  • the effect of adaptive pre-emphasis is illustrated in Figure 10: a) Speech spectrum and estimated spectral envelope, b) IF derivative glottal wave and fitted LF waveform, c) IF derivative glottal wave spectrum and fitted LF wave spectrum.
  • the vocal tract filter envelope estimates obtained this way do not encode source spectral tilt characteristics, which are reflected in the closing phase of the resulting derivative glottal waveforms instead. This improves the fitting of the return phase of the LF model and thus, of the high frequencies of the glottal source.
  • the LF model is capable of more accurately describing the glottal derivative waveform than the RK model.
  • its more complex nonlinear formulation fails to fulfil the convexity condition and prevents its use in the joint voice source and vocal tract filter parameter estimation algorithm.
  • the RK model is employed during source-filter deconvolution and the LF model is then used to re-parameterise the derivative glottal wave obtained by inverse filtering the speech waveform with the jointly estimated filter coefficients.
  • LF model fitting is carried out in two steps. First, initial estimates of the LF T-parameters ( T p , T e , T a , T c ) and the glottal excitation strength E e are obtained from the time-domain IF voice source waveform by conventional direct estimation methods. Then, their values are refined using the conventional constrained nonlinear optimization technique. The overall procedure is as follows.
  • Figure 11 shows an example of LF fitting in normal, breathy and pressed phonations.
  • Wavelet denoising is used to extract the glottal aspiration noise from the IF derivative glottal wave estimate.
  • the wavelet denoising technique used is Wavelet Packet Analysis, which has been found to obtain more reliable aspiration noise estimates compared to other techniques employed to identify and separate the periodic and aperiodic components of quasi-periodic signals, such as frequency transform analysis or periodic prediction.
  • Wavelet Packet Analysis is preferably performed at level 4 with the 7th order Daubechies wavelet, using soft-thresholding and the Stein Unbiased Risk Estimate threshold evaluation criteria.
  • Figure 12 shows a typical denoising result: a) original and denoised IF derivative glottal wave, b) noise estimate.
  • the aspiration noise estimate obtained for a particular pitch period during JEAS analysis is parameterised as follows. First, zero mean unit variance Gaussian noise is modulated with the already fitted LF waveform for that pitch period. Then, its energy is adjusted to match that (ANE) of the aspiration noise estimate. Because using a spectral shaping filter has informally been found not to make a perceptual difference, it is not included in the parameterisation.
  • Figure 15 depicts a diagram of the employed aspiration noise modelling approach.
  • the source parameters R g , R k , R a , ANE
  • R g , R k , R a , ANE smoothed LF derivative glottal waveforms lf (n) and amplitude modulated aspiration noise estimates an ( n ) to be used as the filter excitation e ( n ) for resynthesis.
  • Figure 16 shows a schematic diagram of the employed overlap-add synthesis scheme.
  • Both pitch and time-scale transformations are based on a parameter trajectory interpolation approach, where the first task involves calculating the number of frames in a particular segment required to achieve the desired modifications. Once the modified number of frames has been calculated, frame size contours, excitation and vocal tract parameter trajectories are resampled at the modified number of frames using, for example, cubic spline interpolation. Because JEAS modelling is pitch-synchronous, the frame sizes correspond with the pitch periods in voiced segments while they are fixed in unvoiced segments. Due to their better interpolation characteristics, LSF coefficients and R-parameters are employed during pitch and time-scale transformations to represent the vocal tract and glottal source respectively, in addition to aspiration ANE and Gaussian GNE noise energies.
  • Pitch can be altered by simply multiplying the fundamental period contour by a scaling factor.
  • JEAS Modelling i.e. the parameterisation of the voice source
  • the source characteristics can be also transformed to match the target.
  • this also avoids the need for conventional residual prediction methods.
  • the JEAS parameterisation does not involve a magnitude and phase division of the spectrum, the artifacts due to converted magnitude and phase mismatches are not produced and, thus, the use of additional techniques, such as phase prediction, is not required.
  • the jointly estimated JEAS all-pole vocal tract filter coefficients ⁇ 1 ... ⁇ p ⁇ are converted to Bark scaled LSF parameters for the transformation of the JEAS spectral envelopes.
  • the linear frequency response of the jointly estimated vocal tract filter is calculated. This is resampled according to the Bark scale using, for example, the well-known cubic spline interpolation technique.
  • the warped all-pole filter coefficients are then computed by applying, for example, the conventional Levinson-Durbin algorithm to the autocorrelation sequence of the warped power spectrum. Then, the filter coefficients are transformed into LSF for conversion.
  • a continuous probabilistic linear transformation function is employed to convert the LSF spectral envelopes.
  • Gaussian Mixture Models are used to describe the source and target glottal feature vector spaces, classify them into M classes and train class specific linear transformations.
  • the new LSF parameters are transformed to all-pole filter coefficients and resampled back to the linear scale before synthesis. Because the use of linear transformations broadens the formants of the converted speech, a perceptual post-filter is applied to narrow the formant bandwidths, deepen the spectral valleys and sharpen the formant peaks.
  • Figure 18 illustrates the JEAS vs. conventional PSHM spectral envelopes, where it can be seen that the PSHM envelopes capture the spectral tilt but in JEAS, it is encoded by the glottal waveforms instead.
  • both methods manage to represent the most important formants, small differences exist in their amplitudes, frequencies and/or bandwidths.
  • the glottal waveform morphing approach adopted within JEAS voice conversion employs Continuous Probabilistic Linear Transformations to map glottal LF parameters of different modal male and female speakers, which are the most commonly used speaker types in voice conversion applications.
  • Continuous probabilistic linear transformations have been chosen for being the most robust and efficient approach found to convert spectral envelopes.
  • the limitations of the codebook-based conversion methods for envelope transformations i.e. the discontinuities caused by the use of a discrete number of codebook entries, can also be extrapolated to the modification of glottal waveforms.
  • the use of continuous probabilistic modelling and transformations is expected to achieve better glottal conversions too.
  • the feature vectors employed to convert the glottal source characteristics are derived from the JEAS model parameters linked to the voice source of every pitch period, i.e. the glottal excitation strength E e and T-parameters (T p , T e , T a , T c ) obtained from the LF fitting procedure and the energy (ANE) of the aspiration noise estimate used to adjust that of the modelled pitch-synchronous amplitude modulated Gaussian noise.
  • Figure 19 shows the described glottal conversion approach is capable of bringing the source feature vector parameter contours closer to the target which, as a consequence, also produces converted glottal waveforms more similar to the target.
  • Figure 19 shows the linear transformation of LF Glottal Waveforms: a) source, target and converted derivative glottal LF waves; b) source, target and converted trajectories of the glottal feature vector parameters (E e , R g , R k, R a , ANE).
  • the speech data was recorded using a 'mimicking' approach which resulted in a natural time-alignment between the identical sentences produced by the different speakers and factor out the prosodic cues of speaker identity to some extent.
  • Glottal closure instants derived from laryngograph signals are also provided for each sentence, and have been used for both PSHM and JEAS pitch synchronous analysis.
  • Four different voice conversion experiments have been investigated: male-to-male (MM), male-to-female (MF), female-to-male (FM) and female-to-female (FF) transformations.
  • the first 120 sentences are used for training and the remaining 30 for testing each speaker pair conversion.
  • LSF spectral vectors of order 30 have been employed throughout the conversion experiments, to train 8 linear spectral envelope transforms between each source and target speaker pair using the parallel VOICES training data. This number has been chosen for being capable of achieving small spectral distortion ratios while still generalising to the test data. Aligned source-target vector pairs were obtained by applying forced alignment to mark sub-phone boundaries and using Dynamic Time Warping to further constrain their time alignment. For residual and phase prediction, target GMMs and codebooks of 40 classes and entries have been built. Finally, glottal waveform conversions have also been carried out using 8 linear transforms per speaker pair. Objective and subjective evaluations have been used to compare the performance of the two methods.
  • lsf src -( t ), lsf tgt ( t ) and lsf conv ( t ) are the source, target and converted LSF vectors respectively and the summation is computed over the time-aligned test data, L being the total number of test vectors after time alignment. Note that a 100% distortion ratio corresponds to the distortion between the source and the target.
  • Similar objective distortion measures can also be used to evaluate the conversion of the voice source characteristics, i.e. Residual Prediction and Glottal Waveform Conversion in the PSHM and JEAS implementations respectively.
  • Residual Prediction reintroduces the target spectral details not captured by spectral envelope conversion, bringing as a result the converted speech spectra closer to the target.
  • Glottal Waveform Conversion maps time-domain representations of the glottal waveforms which in the frequency domain result in better matching glottal formants and spectral tilts of the converted spectra. Whilst the methods differ, their spectral effect is similar, i.e, they aim to reduce the differences between the converted and the target speech spectra.
  • LSD log spectral distances
  • a distortion ratio R LSD similar to R LSF can be used to compare the converted-to-target log spectral distances with and without voice source conversion
  • a 100% ratio corresponds to the distortion between spectral envelope converted spectra without voice source transformation and the target spectra.
  • Figure 21 illustrates R LSD ratios computed for Residual Prediction and Glottal Waveform Conversion on the test set. Results show that both voice source conversion techniques manage to reduce the distortions between the converted and target speech spectra. Residual Prediction performs slightly better, mainly because the algorithm is designed to predict residuals which minimise the log spectral distance represented in R LSD . In contrast, glottal waveform conversion is trained to minimise the glottal parameter conversion error over the training data and not the log spectral distance. Nevertheless, both methods are successful in bringing the converted spectra close to the target.
  • the first part was an ABX test in which subjects were presented with PSHM-converted (A), JEAS-converted (B) and target (X) utterances and were asked to choose the speech sample A or B they found sounded more like the target X in terms of speaker identity.
  • Spectral envelopes and voice source characteristics were transformed with the methods described above for each system, i.e. spectral envelope conversion, residual and phase prediction were used for PSHM transformations and spectral envelope and glottal waveform conversion for JEAS transformations.
  • the prosody of the target was employed to synthesise the converted sentences in order to normalise the pitch, duration and energy differences between source and target speakers for the perceptual comparison.
  • Figure 22 shows the results of the ABX test.
  • the JEAS-converted samples are preferred over the PSHM-converted ones overall, but the preference difference varies depending on the type of conversion, being for example almost the same for FM transformations.
  • the 'NO STRONG PREFERENCE' (NSP) option has been selected almost as often as the JEAS-converted utterances in general, which reveals that subjects found it really difficult to distinguish between conversion systems in terms of speaker identity.
  • NSP STRONG PREFERENCE'
  • the second listening test aimed at determining which system produces speech with a higher quality. Subjects were presented with PSHM and JEAS converted speech utterance pairs and asked to choose the one they thought had a better speech quality. Results are illustrated in Figure 23 . There is a clear preference for the sentences converted using the JEAS method, chosen 75.7% of the time on average, which stems from the clearly distinguishable quality difference between the PSHM and JEAS transformed samples. Utterances obtained after PSHM conversion have a 'noisy' quality caused by phase discontinuities which still exist despite Phase Prediction. Comparatively, JEAS converted sentences sound much smoother. This quality difference is also thought to have slightly biased the preference for JEAS conversion in the ABX test.
  • the method and device of voice conversion of the present invention is applicable to frameworks requiring voice quality transformations.
  • its use to repair the deviant voice source characteristics of tracheoesophageal speech can be mentioned.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Auxiliary Devices For Music (AREA)
  • Numerical Control (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Claims (13)

  1. Procédé de conversion d'un signal de parole d'un locuteur source en un signal vocal converti, qui comprend:
    une étape d'apprentissage, dans laquelle:
    étant donné une base de données d'apprentissage de données sources et cibles parallèles, pour chaque période de pitch de ladite base de données d'apprentissage, le procédé comprend les étapes consistant à:
    modéliser chaque période de pitch au moyen d'une forme d'onde glottique et d'un filtre de conduit vocal conformément au modèle de Lu et Smith pour obtenir un ensemble de paramètres LF de Liljencrants-Fant, ledit ensemble de paramètres LF comprenant un paramètre de force d'excitation Ee et un ensemble de paramètres T, Tp, Te, Ta, Tc, modélisant une forme d'onde glottique, et un ensemble de coefficients de filtre de conduit vocal tout pôle α1 ... αp;
    convertir lesdits paramètres T, Tp, Te, Ta, Tc, en paramètres R, Rg, Rk, Ra;
    convertir lesdits coefficients de filtre de conduit vocal tout pôle, α1 ... αp, en fréquences spectrales de raies dans l'échelle de Bark lsf1 ... lsfp;
    définir un vecteur glottique G à convertir;
    définir un vecteur de conduit vocal LSF à convertir, ledit vecteur de conduit vocal LSF comprenant lesdites fréquences spectrales de raies dans l'échelle de Bark lsf1 ... lsfp;
    appliquer un débruitage par ondelettes pour obtenir une estimation d'un bruit d'aspiration glottique;
    à partir de l'ensemble de vecteurs de conduit vocal LSF obtenus pour chaque période de pitch de ladite base de données d'apprentissage, estimer une fonction de transformation linéaire probabiliste continue de conduit vocal à l'aide du critère d'erreur des moindres carrés;
    le procédé étant caractérisé en ce que ladite étape de modélisation comprend en outre les étapes consistant à:
    modéliser ladite estimation du bruit d'aspiration en modulant un bruit gaussien à variance unitaire à moyenne nulle avec ladite forme d'onde glottique modélisée et en ajustant son énergie ANE pour qu'elle corresponde à celle de ladite estimation du bruit d'aspiration;
    ledit vecteur glottique G à convertir comprenant ledit paramètre de force d'excitation Ee, lesdits paramètres R, Rg, Rk, Ra, et ladite énergie ANE de l'estimation du bruit d'aspiration,
    le procédé comprenant en outre:
    une étape de conversion, dans laquelle une forme d'onde de parole de test donnée est modélisée et transformée en un ensemble de paramètres convertis, Ee', Rg', Rk', Ra', ANE', LSF';
    une étape de synthèse, dans laquelle une forme d'onde de parole convertie est synthétisée à partir dudit ensemble de paramètres convertis, Ee', Rg', Rk', Ra', ANE', LSF'.
  2. Procédé selon la revendication 1, dans lequel ladite étape d'apprentissage comprend en outre l'étape consistant:
    à partir de l'ensemble de vecteurs glottiques G obtenus pour chaque période de pitch de ladite base de données d'apprentissage, à estimer une fonction de transformation linéaire probabiliste continue de conduit vocal à l'aide du critère d'erreur des moindres carrés.
  3. Procédé selon la revendication 1 ou la revendication 2, dans lequel ladite étape de modélisation de chaque période de pitch au moyen d'une forme d'onde glottique et d'un filtre de conduit vocal selon le modèle de Lu et Smith comprend les étapes consistant à:
    modéliser la forme d'onde glottique à l'aide du modèle Rosenberg-Klatt;
    utiliser une optimisation convexe pour obtenir un ensemble de paramètres de forme d'onde glottique Rosenberg-Klatt et les coefficients de filtre de conduit vocal tout pôle α1 ... αp, où ladite étape d'utilisation d'une optimisation convexe comprend une étape de préaccentuation adaptative pour estimer et supprimer une contribution de filtre d'inclinaison spectrale de la forme d'onde avant l'optimisation convexe.
  4. Procédé selon la revendication 3, dans lequel ladite étape de modélisation de chaque période de pitch au moyen d'une forme d'onde glottique et d'un filtre de conduit vocal conformément au modèle de Lu et Smith comprend en outre les étapes consistant à:
    obtenir une forme d'onde glottique dérivée par un filtrage inverse de ladite période de pitch à l'aide desdits coefficients de filtre de conduit vocal tout pôle α1 ... αp;
    adapter ledit ensemble de paramètres LF à ladite forme d'onde glottique dérivée filtrée inverse par estimation directe et optimisation non linéaire sous contraintes.
  5. Procédé selon l'une quelconque des revendications précédentes, dans lequel ladite étape de conversion comprend, pour chaque période de pitch de ladite forme d'onde de parole de test, les étapes consistant à:
    obtenir un vecteur glottique G à convertir, ledit vecteur glottique comprenant un paramètre de force d'excitation Ee, un ensemble de paramètres R, Rg, Rk, Ra, et l'énergie ANE de ladite estimation du bruit d'aspiration;
    obtenir un vecteur de conduit vocal LSF à convertir, ledit vecteur de conduit vocal LSF comprenant un ensemble de fréquences spectrales de raies dans l'échelle de Bark lsf1 ... lsfp;
    appliquer ladite fonction de transformation linéaire probabiliste continue de conduit vocal estimée lors de l'étape d'apprentissage pour obtenir un vecteur de paramètre de conduit vocal converti LSF;
    transformer ledit vecteur glottique G à l'aide de ladite fonction de transformation linéaire probabiliste continue de forme d'onde glottique estimée lors de l'étape d'apprentissage, pour obtenir ainsi un vecteur glottique converti G' comprenant un ensemble de paramètres convertis Ee', Rg', Rk', Ra', ANE', LSF'.
  6. Procédé selon la revendication 5, dans lequel lesdites étapes d'obtention d'un vecteur glottique G à convertir et d'un vecteur de conduit vocal LSF à convertir comprennent en outre les étapes consistant à:
    modéliser chaque période de pitch au moyen d'une forme d'onde glottique et d'un filtre de conduit vocal conformément au modèle de Lu et Smith pour obtenir un ensemble de paramètres LF, ledit ensemble de paramètres LF comprenant un paramètre de force d'excitation Ee et un ensemble de paramètres T, Tp, Te, Ta, Tc, modélisant une forme d'onde glottique, et un ensemble de coefficients de filtre de conduit vocal tout pôle α1 ... αp;
    convertir lesdits coefficients de filtre de conduit vocal tout pôle en fréquences spectrales de raies dans l'échelle de Bark lsf1 ... lsfp;
    convertir lesdits paramètres T en paramètres R, Rg, Rk, Ra;
    définir un vecteur glottique G à convertir;
    définir un vecteur de conduit vocal LSF à convertir.
  7. Procédé selon l'une quelconque des revendications 5 ou 6, dans lequel ladite étape de conversion comprend en outre une étape de post-filtrage dudit vecteur de paramètre de conduit vocal converti LSF'.
  8. Procédé selon l'une quelconque des revendications précédentes, dans lequel ladite étape de synthèse, dans laquelle ladite forme d'onde de parole convertie est synthétisée à partir dudit ensemble de paramètres convertis Ee' , Rg', Rk' , Ra' , ANE' , LSF' , comprend les étapes consistant à:
    interpoler les trajectoires desdits paramètres convertis Rg', Rk', Ra', ANE', LSF' de chaque période de pitch, pour obtenir ainsi un ensemble de paramètres interpolés Rg", Rk", Ra", ANE", LSF" comprenant des paramètres R interpolés, Rg", Rk", Ra", une énergie interpolée, ANE " , et un vecteur de conduit vocal interpolé LSF" ;
    convertir ledit vecteur de conduit vocal interpolé LSF" en un vecteur de coefficient de filtre tout pôle A";
    convertir lesdits paramètres R interpolés, Rg", Rk", Ra'', en paramètres T interpolés, Tp", Te", Ta", Tc";
    pour chaque trame de ladite forme d'onde de parole de test, générer un signal d'excitation ek(n), où k désigne la kième trame.
  9. Procédé selon la revendication 8, dans lequel ladite étape de génération d'un signal d'excitation comprend les étapes consistant à, pour chacune desdites trames:
    si ladite trame est exprimée vocalement:
    à partir desdits paramètres T interpolés, Tp", Te", Ta", Tc", et dudit paramètre de force d'excitation Ee, générer une forme d'onde glottique interpolée lfk(n);
    à partir du paramètre d'énergie interpolé ANE", générer un bruit d'aspiration interpolé ank(n);
    générer ledit signal d'excitation exprimé vocalement ek(n) en ajoutant ladite forme d'onde glottique interpolée lfk(n) et ledit bruit d'aspiration interpolé ank(n);
    si ladite trame n'est pas exprimée vocalement:
    générer ledit signal d'excitation non exprimé vocalement ek(n) à partir d'une source de bruit gaussien gnk(n).
  10. Procédé selon l'une quelconque des revendications 8 ou 9, dans lequel ladite étape de synthèse comprend en outre les étapes consistant à:
    générer une contribution de synthèse de chaque trame en filtrant ledit signal d'excitation ek(n) avec ledit vecteur de coefficient de filtre tout pôle interpolé A";
    multiplier ladite contribution de synthèse par une fenêtre de Hamming, venant en recouvrement et s'ajoutant, de manière à générer le signal de parole converti.
  11. Procédé applicable à des transformations de qualité vocale, telles qu'une réparation de voix trachéo-oesophagienne, qui comprend les étapes de procédé de l'une quelconque des revendications précédentes.
  12. Dispositif comprenant des moyens adaptés pour exécuter les étapes du procédé de l'une quelconque des revendications précédentes.
  13. Moyen de code de programme informatique adapté pour exécuter les étapes du procédé selon l'une quelconque des revendications 1 à 11, lorsque ledit programme est exécuté sur un ordinateur, un processeur de signal numérique, un réseau prédiffusé programmable par l'utilisateur FPGA, un circuit intégré à application spécifique, un microprocesseur, un microcontrôleur ou toute autre forme de matériel programmable.
EP08804436A 2008-09-19 2008-09-19 Procede, dispositif, et code de programme pour la conversion vocale Not-in-force EP2215632B1 (fr)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2008/062502 WO2010031437A1 (fr) 2008-09-19 2008-09-19 Procédé et système de conversion vocale

Publications (2)

Publication Number Publication Date
EP2215632A1 EP2215632A1 (fr) 2010-08-11
EP2215632B1 true EP2215632B1 (fr) 2011-03-16

Family

ID=40277465

Family Applications (1)

Application Number Title Priority Date Filing Date
EP08804436A Not-in-force EP2215632B1 (fr) 2008-09-19 2008-09-19 Procede, dispositif, et code de programme pour la conversion vocale

Country Status (5)

Country Link
EP (1) EP2215632B1 (fr)
AT (1) ATE502380T1 (fr)
DE (1) DE602008005641D1 (fr)
ES (1) ES2364005T3 (fr)
WO (1) WO2010031437A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9607610B2 (en) 2014-07-03 2017-03-28 Google Inc. Devices and methods for noise modulation in a universal vocoder synthesizer
US11100940B2 (en) 2019-12-20 2021-08-24 Soundhound, Inc. Training a voice morphing apparatus
US11600284B2 (en) 2020-01-11 2023-03-07 Soundhound, Inc. Voice morphing apparatus having adjustable parameters

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901598A (zh) * 2010-06-30 2010-12-01 北京捷通华声语音技术有限公司 一种哼唱合成方法和系统
ES2364401B2 (es) * 2011-06-27 2011-12-23 Universidad Politécnica de Madrid Método y sistema para la estimación de parámetros fisiológicos de la fonación.
RU2510954C2 (ru) * 2012-05-18 2014-04-10 Александр Юрьевич Бредихин Способ переозвучивания аудиоматериалов и устройство для его осуществления
EP3857541B1 (fr) * 2018-09-30 2023-07-19 Microsoft Technology Licensing, LLC Génération de forme d'onde de parole
WO2020174356A1 (fr) * 2019-02-25 2020-09-03 Technologies Of Voice Interface Ltd Système et dispositif d'interprétation de la parole
CN113780107B (zh) * 2021-08-24 2024-03-01 电信科学技术第五研究所有限公司 一种基于深度学习双输入网络模型的无线电信号检测方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100809368B1 (ko) * 2006-08-09 2008-03-05 한국과학기술원 성대파를 이용한 음색 변환 시스템

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9607610B2 (en) 2014-07-03 2017-03-28 Google Inc. Devices and methods for noise modulation in a universal vocoder synthesizer
US11100940B2 (en) 2019-12-20 2021-08-24 Soundhound, Inc. Training a voice morphing apparatus
US11600284B2 (en) 2020-01-11 2023-03-07 Soundhound, Inc. Voice morphing apparatus having adjustable parameters

Also Published As

Publication number Publication date
DE602008005641D1 (de) 2011-04-28
ES2364005T3 (es) 2011-08-22
EP2215632A1 (fr) 2010-08-11
WO2010031437A1 (fr) 2010-03-25
ATE502380T1 (de) 2011-04-15

Similar Documents

Publication Publication Date Title
EP2215632B1 (fr) Procede, dispositif, et code de programme pour la conversion vocale
EP2881947B1 (fr) Système d'inférence d'enveloppe spectrale et de temps de propagation de groupe et système de synthèse de signaux vocaux pour analyse / synthèse vocale
McCree et al. A mixed excitation LPC vocoder model for low bit rate speech coding
Erro et al. Voice conversion based on weighted frequency warping
Drugman et al. Glottal source processing: From analysis to applications
US9031834B2 (en) Speech enhancement techniques on the power spectrum
Arslan Speaker transformation algorithm using segmental codebooks (STASC)
JP4294724B2 (ja) 音声分離装置、音声合成装置および声質変換装置
Degottex et al. Phase minimization for glottal model estimation
US20180174571A1 (en) Speech processing device, speech processing method, and computer program product
Akande et al. Estimation of the vocal tract transfer function with application to glottal wave analysis
US20050131680A1 (en) Speech synthesis using complex spectral modeling
Cabral et al. Glottal spectral separation for speech synthesis
Cabral et al. Towards an improved modeling of the glottal source in statistical parametric speech synthesis
Cabral et al. Glottal spectral separation for parametric speech synthesis
Al-Radhi et al. Time-Domain Envelope Modulating the Noise Component of Excitation in a Continuous Residual-Based Vocoder for Statistical Parametric Speech Synthesis.
Ohtsuka et al. TRANSLATED PAPER
Del Pozo et al. The linear transformation of LF glottal waveforms for voice conversion.
Ferreira et al. A holistic glotal phase related feature
Kawahara et al. Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds.
Jelinek et al. Frequency-domain spectral envelope estimation for low rate coding of speech
Lenarczyk Parametric speech coding framework for voice conversion based on mixed excitation model
Del Pozo Voice source and duration modelling for voice conversion and speech repair
Shi et al. A variational EM method for pole-zero modeling of speech with mixed block sparse and Gaussian excitation
Agiomyrgiannakis et al. Towards flexible speech coding for speech synthesis: an LF+ modulated noise vocoder.

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20091014

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA MK RS

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

RTI1 Title (correction)

Free format text: METHOD, DEVICE AND COMPUTER PROGRAM CODE MEANS FOR VOICE CONVERSION

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REF Corresponds to:

Ref document number: 602008005641

Country of ref document: DE

Date of ref document: 20110428

Kind code of ref document: P

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602008005641

Country of ref document: DE

Effective date: 20110428

REG Reference to a national code

Ref country code: PT

Ref legal event code: SC4A

Free format text: AVAILABILITY OF NATIONAL TRANSLATION

Effective date: 20110606

REG Reference to a national code

Ref country code: NL

Ref legal event code: VDEP

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110616

REG Reference to a national code

Ref country code: ES

Ref legal event code: FG2A

Ref document number: 2364005

Country of ref document: ES

Kind code of ref document: T3

Effective date: 20110822

LTIE Lt: invalidation of european patent or patent extension

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110616

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: BE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110716

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: NL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20111219

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

REG Reference to a national code

Ref country code: DE

Ref legal event code: R097

Ref document number: 602008005641

Country of ref document: DE

Effective date: 20111219

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MC

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20110930

REG Reference to a national code

Ref country code: IE

Ref legal event code: MM4A

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20110919

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20110919

REG Reference to a national code

Ref country code: ES

Ref legal event code: PC2A

Owner name: FUNDACION CENTRO DE TECNOLOGIAS DE INTERACCION VIS

Effective date: 20130604

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20120930

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20120930

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: HU

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20140929

Year of fee payment: 7

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: PT

Payment date: 20140320

Year of fee payment: 7

Ref country code: IT

Payment date: 20140922

Year of fee payment: 7

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20140929

Year of fee payment: 7

REG Reference to a national code

Ref country code: PT

Ref legal event code: MM4A

Free format text: LAPSE DUE TO NON-PAYMENT OF FEES

Effective date: 20160321

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 602008005641

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IT

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20150919

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20150919

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PT

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20160321

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20150919

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20160401

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 9

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 10

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20170926

Year of fee payment: 10

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: ES

Payment date: 20171010

Year of fee payment: 10

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20180930

REG Reference to a national code

Ref country code: ES

Ref legal event code: FD2A

Effective date: 20191104

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: ES

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20180920